feat(tracking): plumb wandb tags through Tracking init#1698
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces Mixture of Experts (MoE) auxiliary-loss metrics collection and logging, a background GPU/RAM monitor (RayGpuMonitor) that scrapes Ray node Prometheus endpoints, and improved exception logging to wandb. Feedback focuses on correcting the type of moe_aux_loss_coeff to float to prevent truncation, safeguarding against potential ZeroDivisionError when self._num_training_gpus is zero, and using a threading.Event instead of a boolean flag and time.sleep to allow the background monitoring thread to shut down immediately.
Adds a `tags` parameter to `Tracking.__init__` and a corresponding `trainer.tags` config field, allowing callers to label W&B runs at init time. Other backends (mlflow, swanlab, tensorboard, console) ignore `tags` since they don't have an equivalent primitive. Without this, all SkyRL-driven W&B runs are untagged regardless of caller intent, making it impossible to filter or group runs in the W&B UI by anything other than project/run-name. Tags can also be set via the `WANDB_TAGS` env var; the explicit parameter takes precedence (W&B SDK behavior, unchanged). Signed-off-by: Vu Dinh <vudinh@outlook.com>
d716435 to
c851008
Compare
|
Please ignore the Gemini's review as there were some unrelated commits which were pushed accidentally and Gemini reviewed polluted code which was incorrect. These were removed. |
|
Can you resolve merge conflicts @dinhxuanvu ? |
Why
Trackingdoesn't currently passtagstowandb.init, so SkyRL-driven runs are always untagged. Tags are W&B's primary primitive for filtering, grouping, and organizing runs in the UI; not having them means callers can't distinguish, e.g., "baseline vs. ablation," "SFT vs. RL," or sibling runs in a sweep.This also closes a small surprise gap: the lower-level
WandbTrackerinskyrl/utils/log.pyalready plumbs arbitrary kwargs towandb.init, so power users discovered they could set tags through that path but not through the supportedTrackingadapter.What
Tracking.__init__accepts a new explicittags: Optional[List[str]] = Noneparameter.wandb.init(tags=tags).trainer.tagsconfig field (defaultNone).get_trackerinmain_base.pyforwardscfg.trainer.tags.SFTConfig.tagsbridges tocfg.trainer.tags, andSFTTrainer._init_trackerforwards it.Why an explicit param
Trackingis an adapter over five backends (wandb, mlflow, swanlab, tensorboard, console). A**kwargspassthrough would couple the adapter to wandb's API surface. An explicit, typed parameter keeps the contract clear and leaves a clean place to add matching primitives (mlflow tags, etc.) without breaking changes.Test plan
Tracking(..., tags=["a","b"])results inwandb.init(..., tags=["a","b"]).tags=Noneis forwarded when not passed.tagsis not passed.Compatibility
Strictly additive.
wandb.init(tags=None)is a no-op; the SDK still honorsWANDB_TAGSenv var.