Skip to content

Stop training if all components have reached their limit#1416

Open
m4xw wants to merge 1 commit intoNerogar:masterfrom
m4xw:upstream
Open

Stop training if all components have reached their limit#1416
m4xw wants to merge 1 commit intoNerogar:masterfrom
m4xw:upstream

Conversation

@m4xw
Copy link
Copy Markdown

@m4xw m4xw commented Apr 14, 2026

No description provided.

Copilot AI review requested due to automatic review settings April 14, 2026 13:02
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to stop the training loop automatically once all model components have reached their configured stop_training_after limit, avoiding wasted steps when nothing remains trainable.

Changes:

  • Added an epoch-level early-exit when no parameters in self.parameters have requires_grad=True.
  • Guarded loss.backward() behind loss.requires_grad to avoid errors when the loss graph is not differentiable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +652 to +654
if not any(p.requires_grad for p in self.parameters):
print("All trainable components have reached their stop_training_after limit. Stopping training.")
return
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early-stop check only runs once per epoch. requires_grad can be toggled mid-epoch (e.g., many BaseModelSetup.after_optimizer_step() implementations call _setup_model_part_requires_grad(...) based on model.train_progress), so training can continue for the rest of the epoch with nothing trainable. Consider checking this condition right after after_optimizer_step() (and/or before stepping the optimizer) and exiting immediately once all params are frozen. Also, in multi-GPU this print(...) will run on every rank; gate the message behind multi.is_master() (or route through callbacks.on_update_status) to avoid duplicated output.

Copilot uses AI. Check for mistakes.
Comment on lines 758 to +765
loss = loss / self.config.gradient_accumulation_steps
if scaler:
scaler.scale(loss).backward()
else:
loss.backward()
if loss.requires_grad:
if scaler:
scaler.scale(loss).backward()
else:
loss.backward()

has_gradient = True
has_gradient = True
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guarding backward() with if loss.requires_grad: prevents the runtime error, but it also allows the loop to keep advancing train_progress, stepping the LR scheduler, and calling optimizer.step() even when nothing is trainable (or when the loss is unexpectedly detached). That can silently produce “training” runs with zero parameter updates. Suggestion: if loss.requires_grad is false, explicitly stop training (when no params require grad) or raise an error when any parameter still has requires_grad=True, and ensure the optimizer/LR-scheduler/progress aren’t advanced in the no-grad case.

Copilot uses AI. Check for mistakes.
@m4xw
Copy link
Copy Markdown
Author

m4xw commented Apr 15, 2026

ah woops pushed it to wrong branch, ignore last push

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants