Skip to content

Stop the training subprocess on every exit path#81

Open
jfrieli wants to merge 1 commit into
mainfrom
fix/stop-executor-on-abort
Open

Stop the training subprocess on every exit path#81
jfrieli wants to merge 1 commit into
mainfrom
fix/stop-executor-on-abort

Conversation

@jfrieli

@jfrieli jfrieli commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Motivation

Aborting a training cancels the trainer task, but for executor-based trainers (e.g. classification, yolov5) the python train.py subprocess was left running — still holding the GPU. The CancelledError from abort() propagated past _train's only cleanup (except TrainingError), so nothing terminated the executor. This was surfaced by the loop's new abort_training flow (zauberzeug/loop#346 + #80): once a node is on 0.20.0 and extends the executor-based TrainerLogic, abort orphans the process.

Implementation

  • Move the executor cleanup in TrainerLogic._train out of the except TrainingError block into a finally, so the training subprocess is terminated on every exit path (normal return, TrainingError, CancelledError/abort, and any unexpected exception)
  • Use asyncio.shield(self.executor.stop_and_wait()) so the terminate+wait completes even while the task is being cancelled
  • Add test_abort_during_training_kills_executor: starts a real training subprocess, aborts mid-run, asserts executor.is_running() becomes False (no orphan)

Notes / risks

  • stop() (graceful) and abort() remain distinct: finally only guarantees the subprocess is dead; whether _train returns normally (→ upload) or with CancelledError (→ ReadyForCleanup, discard) is unchanged. Covered by the existing test_stop_during_training_uploads_model / test_abort_during_training.
  • The TrainingError path is behavior-preserving (executor still killed, previous_state still reset) — only the internal ordering changed.
  • Relevant train-state tests pass locally. The SocketIO-dependent integration tests could not be run locally (no mock loop running here) and fail identically on the unmodified base; they should be verified in CI.

⬛ claude-opus-4-8[1m]

Move the executor cleanup in TrainerLogic._train from the
`except TrainingError` block into a `finally`, so the training subprocess
is terminated on every exit path - in particular abort(), where a
CancelledError previously propagated past the cleanup and orphaned the
executor process (leaving it holding the GPU).

Add test_abort_during_training_kills_executor asserting the subprocess is
no longer running after abort().

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jfrieli jfrieli changed the title Stop the training subprocess on abort Stop the training subprocess on every exit path Jun 16, 2026
@jfrieli

jfrieli commented Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

@denniswittich @NiklasNeugebauer
Ready to review

@denniswittich denniswittich marked this pull request as ready for review June 23, 2026 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant