Stop training if all components have reached their limit by m4xw · Pull Request #1416 · Nerogar/OneTrainer

m4xw · 2026-04-14T13:02:22Z

No description provided.

…r limit

Copilot

Pull request overview

This PR aims to stop the training loop automatically once all model components have reached their configured stop_training_after limit, avoiding wasted steps when nothing remains trainable.

Changes:

Added an epoch-level early-exit when no parameters in self.parameters have requires_grad=True.
Guarded loss.backward() behind loss.requires_grad to avoid errors when the loss graph is not differentiable.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T13:07:36Z

+            if not any(p.requires_grad for p in self.parameters):
+                print("All trainable components have reached their stop_training_after limit. Stopping training.")
+                return


The early-stop check only runs once per epoch. requires_grad can be toggled mid-epoch (e.g., many BaseModelSetup.after_optimizer_step() implementations call _setup_model_part_requires_grad(...) based on model.train_progress), so training can continue for the rest of the epoch with nothing trainable. Consider checking this condition right after after_optimizer_step() (and/or before stepping the optimizer) and exiting immediately once all params are frozen. Also, in multi-GPU this print(...) will run on every rank; gate the message behind multi.is_master() (or route through callbacks.on_update_status) to avoid duplicated output.

Copilot · 2026-04-14T13:07:37Z

                    loss = loss / self.config.gradient_accumulation_steps
-                    if scaler:
-                        scaler.scale(loss).backward()
-                    else:
-                        loss.backward()
+                    if loss.requires_grad:
+                        if scaler:
+                            scaler.scale(loss).backward()
+                        else:
+                            loss.backward()

-                    has_gradient = True
+                        has_gradient = True


Guarding backward() with if loss.requires_grad: prevents the runtime error, but it also allows the loop to keep advancing train_progress, stepping the LR scheduler, and calling optimizer.step() even when nothing is trainable (or when the loss is unexpectedly detached). That can silently produce “training” runs with zero parameter updates. Suggestion: if loss.requires_grad is false, explicitly stop training (when no params require grad) or raise an error when any parameter still has requires_grad=True, and ensure the optimizer/LR-scheduler/progress aren’t advanced in the no-grad case.

m4xw · 2026-04-15T16:30:44Z

ah woops pushed it to wrong branch, ignore last push

Stop training if all components have reached their stop_training_afte…

0861535

…r limit

Copilot AI review requested due to automatic review settings April 14, 2026 13:02

Copilot started reviewing on behalf of m4xw April 14, 2026 13:02 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

m4xw force-pushed the upstream branch from 85a00ae to 0861535 Compare April 15, 2026 16:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stop training if all components have reached their limit#1416

Stop training if all components have reached their limit#1416
m4xw wants to merge 1 commit intoNerogar:masterfrom
m4xw:upstream

m4xw commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

m4xw commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

m4xw commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

m4xw commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants