Skip to content

feat(tracking): plumb wandb tags through Tracking init#1698

Open
dinhxuanvu wants to merge 1 commit into
NovaSky-AI:mainfrom
dinhxuanvu:vdinh/wandb-tags-tracking
Open

feat(tracking): plumb wandb tags through Tracking init#1698
dinhxuanvu wants to merge 1 commit into
NovaSky-AI:mainfrom
dinhxuanvu:vdinh/wandb-tags-tracking

Conversation

@dinhxuanvu
Copy link
Copy Markdown
Contributor

Why

Tracking doesn't currently pass tags to wandb.init, so SkyRL-driven runs are always untagged. Tags are W&B's primary primitive for filtering, grouping, and organizing runs in the UI; not having them means callers can't distinguish, e.g., "baseline vs. ablation," "SFT vs. RL," or sibling runs in a sweep.

This also closes a small surprise gap: the lower-level WandbTracker in skyrl/utils/log.py already plumbs arbitrary kwargs to wandb.init, so power users discovered they could set tags through that path but not through the supported Tracking adapter.

What

  • Tracking.__init__ accepts a new explicit tags: Optional[List[str]] = None parameter.
  • The wandb branch forwards it to wandb.init(tags=tags).
  • New trainer.tags config field (default None).
  • get_tracker in main_base.py forwards cfg.trainer.tags.
  • The SFT path mirrors this: SFTConfig.tags bridges to cfg.trainer.tags, and SFTTrainer._init_tracker forwards it.
  • Tags are wandb-specific; other backends are unchanged.

Why an explicit param

Tracking is an adapter over five backends (wandb, mlflow, swanlab, tensorboard, console). A **kwargs passthrough would couple the adapter to wandb's API surface. An explicit, typed parameter keeps the contract clear and leaves a clean place to add matching primitives (mlflow tags, etc.) without breaking changes.

Test plan

  • Unit test: Tracking(..., tags=["a","b"]) results in wandb.init(..., tags=["a","b"]).
  • Unit test: default tags=None is forwarded when not passed.
  • Format/lint clean.
  • No behavior change when tags is not passed.

Compatibility

Strictly additive. wandb.init(tags=None) is a no-op; the SDK still honors WANDB_TAGS env var.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces Mixture of Experts (MoE) auxiliary-loss metrics collection and logging, a background GPU/RAM monitor (RayGpuMonitor) that scrapes Ray node Prometheus endpoints, and improved exception logging to wandb. Feedback focuses on correcting the type of moe_aux_loss_coeff to float to prevent truncation, safeguarding against potential ZeroDivisionError when self._num_training_gpus is zero, and using a threading.Event instead of a boolean flag and time.sleep to allow the background monitoring thread to shut down immediately.

Comment thread skyrl/train/config/config.py Outdated
Comment thread skyrl/train/trainer.py Outdated
Comment thread skyrl/train/sft_trainer.py Outdated
Comment thread skyrl/train/sft_trainer.py Outdated
Comment thread skyrl/train/utils/ray_gpu_monitor.py Outdated
Comment thread skyrl/train/utils/ray_gpu_monitor.py Outdated
Comment thread skyrl/train/utils/ray_gpu_monitor.py Outdated
Comment thread skyrl/train/utils/ray_gpu_monitor.py Outdated
Comment thread skyrl/train/utils/ray_gpu_monitor.py Outdated
Adds a `tags` parameter to `Tracking.__init__` and a corresponding
`trainer.tags` config field, allowing callers to label W&B runs at
init time. Other backends (mlflow, swanlab, tensorboard, console)
ignore `tags` since they don't have an equivalent primitive.

Without this, all SkyRL-driven W&B runs are untagged regardless of
caller intent, making it impossible to filter or group runs in the
W&B UI by anything other than project/run-name.

Tags can also be set via the `WANDB_TAGS` env var; the explicit
parameter takes precedence (W&B SDK behavior, unchanged).

Signed-off-by: Vu Dinh <vudinh@outlook.com>
@dinhxuanvu dinhxuanvu force-pushed the vdinh/wandb-tags-tracking branch from d716435 to c851008 Compare May 26, 2026 06:19
@dinhxuanvu
Copy link
Copy Markdown
Contributor Author

dinhxuanvu commented May 26, 2026

Please ignore the Gemini's review as there were some unrelated commits which were pushed accidentally and Gemini reviewed polluted code which was incorrect. These were removed.

@SumanthRH
Copy link
Copy Markdown
Member

Can you resolve merge conflicts @dinhxuanvu ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants