Stop training if all components have reached their limit#1416
Stop training if all components have reached their limit#1416m4xw wants to merge 1 commit intoNerogar:masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to stop the training loop automatically once all model components have reached their configured stop_training_after limit, avoiding wasted steps when nothing remains trainable.
Changes:
- Added an epoch-level early-exit when no parameters in
self.parametershaverequires_grad=True. - Guarded
loss.backward()behindloss.requires_gradto avoid errors when the loss graph is not differentiable.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if not any(p.requires_grad for p in self.parameters): | ||
| print("All trainable components have reached their stop_training_after limit. Stopping training.") | ||
| return |
There was a problem hiding this comment.
The early-stop check only runs once per epoch. requires_grad can be toggled mid-epoch (e.g., many BaseModelSetup.after_optimizer_step() implementations call _setup_model_part_requires_grad(...) based on model.train_progress), so training can continue for the rest of the epoch with nothing trainable. Consider checking this condition right after after_optimizer_step() (and/or before stepping the optimizer) and exiting immediately once all params are frozen. Also, in multi-GPU this print(...) will run on every rank; gate the message behind multi.is_master() (or route through callbacks.on_update_status) to avoid duplicated output.
| loss = loss / self.config.gradient_accumulation_steps | ||
| if scaler: | ||
| scaler.scale(loss).backward() | ||
| else: | ||
| loss.backward() | ||
| if loss.requires_grad: | ||
| if scaler: | ||
| scaler.scale(loss).backward() | ||
| else: | ||
| loss.backward() | ||
|
|
||
| has_gradient = True | ||
| has_gradient = True |
There was a problem hiding this comment.
Guarding backward() with if loss.requires_grad: prevents the runtime error, but it also allows the loop to keep advancing train_progress, stepping the LR scheduler, and calling optimizer.step() even when nothing is trainable (or when the loss is unexpectedly detached). That can silently produce “training” runs with zero parameter updates. Suggestion: if loss.requires_grad is false, explicitly stop training (when no params require grad) or raise an error when any parameter still has requires_grad=True, and ensure the optimizer/LR-scheduler/progress aren’t advanced in the no-grad case.
|
ah woops pushed it to wrong branch, ignore last push |
No description provided.