Skip to content

Add on-demand full-state checkpointing for OpenShift AI / KubeFlow preemption#686

Open
RobotSail wants to merge 6 commits intomainfrom
claude/on-demand-checkpointing-dCsmo
Open

Add on-demand full-state checkpointing for OpenShift AI / KubeFlow preemption#686
RobotSail wants to merge 6 commits intomainfrom
claude/on-demand-checkpointing-dCsmo

Conversation

@RobotSail
Copy link
Member

@RobotSail RobotSail commented Feb 26, 2026

Implements signal-driven checkpoint-and-exit for distributed training jobs
running in OpenShift AI as KubeFlow training jobs or multi-node bare metal.

When on_demand_checkpointing=True is set in TrainingArgs:

  • Parent process (run_training) installs handlers for SIGTERM, SIGINT,
    SIGUSR1, SIGUSR2, SIGXCPU, and SIGHUP — covering all signals
    Kubernetes/OpenShift sends before the hard SIGKILL.
  • On signal receipt, a trigger file is atomically written to /dev/shm
    (tmpfs, shared within the pod, zero disk I/O).
  • Worker processes check for the trigger file after each optimizer step
    via an all_reduce(MAX) collective, ensuring global consensus across
    all ranks on all nodes.
  • When any rank detects the trigger, all ranks collectively save a
    full-state distributed checkpoint (model + optimizer + LR scheduler)
    then exit gracefully.
  • Parent waits up to 300s for workers to complete the checkpoint before
    proceeding with normal shutdown.

https://claude.ai/code/session_01HSxsk7SnMULJxy7uafe7t3

Summary by CodeRabbit

  • New Features
    • Added a training config option and CLI flag to enable on-demand, signal-triggered full-state checkpointing for preemption scenarios.
    • Introduced coordinated parent/worker checkpoint orchestration with namespaced trigger coordination, graceful worker wait/cleanup, and improved exit handling and logging.
    • Added worker-side utilities to detect, save, and remove on-demand checkpoint requests.
    • Batch processing is now interrupt-aware and reports an interrupted flag in batch metrics.

…eemption

Implements signal-driven checkpoint-and-exit for distributed training jobs
running in OpenShift AI as KubeFlow training jobs or multi-node bare metal.

When `on_demand_checkpointing=True` is set in TrainingArgs:

- Parent process (run_training) installs handlers for SIGTERM, SIGINT,
  SIGUSR1, SIGUSR2, SIGXCPU, and SIGHUP — covering all signals
  Kubernetes/OpenShift sends before the hard SIGKILL.
- On signal receipt, a trigger file is atomically written to /dev/shm
  (tmpfs, shared within the pod, zero disk I/O).
- Worker processes check for the trigger file after each optimizer step
  via an all_reduce(MAX) collective, ensuring global consensus across
  all ranks on all nodes.
- When any rank detects the trigger, all ranks collectively save a
  full-state distributed checkpoint (model + optimizer + LR scheduler)
  then exit gracefully.
- Parent waits up to 300s for workers to complete the checkpoint before
  proceeding with normal shutdown.

https://claude.ai/code/session_01HSxsk7SnMULJxy7uafe7t3
@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f5ee1f49-22cf-446d-a7a0-89c7746899ab

📥 Commits

Reviewing files that changed from the base of the PR and between d089910 and d21bf12.

📒 Files selected for processing (4)
  • src/instructlab/training/batch_loss_manager.py
  • src/instructlab/training/config.py
  • src/instructlab/training/main_ds.py
  • src/instructlab/training/on_demand_checkpoint.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/instructlab/training/config.py

📝 Walkthrough

Walkthrough

Adds an opt‑in on‑demand, signal‑triggered full‑state checkpointing mode: new TrainingArgs flag and CLI option, parent-side signal handler to create a shared trigger, worker-side consensus check and save-on-demand flow, and integration points in minibatch processing to interrupt and persist training state.

Changes

Cohort / File(s) Summary
Configuration
src/instructlab/training/config.py
Adds public boolean field on_demand_checkpointing: bool = False to TrainingArgs with description.
Training Orchestration / CLI
src/instructlab/training/main_ds.py
Adds --on_demand_checkpointing CLI flag; extends train(...) signature with on_demand_checkpointing; propagates flag into subprocess launch (run_training) and appends CLI arg when enabled; installs/uninstalls ParentSignalHandler in parent; extends parent wait/termination logic and exit-code handling when enabled; integrates checkpoint checks into minibatch loop and around optimizer step.
On‑Demand Checkpointing Module
src/instructlab/training/on_demand_checkpoint.py
New module implementing trigger-file orchestration in /dev/shm: write_trigger_file, trigger_file_exists, remove_trigger_file; ParentSignalHandler that writes trigger on received signals; check_checkpoint_requested (all-reduce consensus across ranks); and save_on_demand_checkpoint wrapper that calls existing checkpoint save with full_state=True/hf_format=True.
Batch Processing
src/instructlab/training/batch_loss_manager.py
Adds interrupted: bool to BatchMetrics, an optional interrupt_check callback parameter to BatchLossManager.process_batch(...), calls the callback at three points per minibatch, and adjusts loss averaging to handle early-interruption cases (supports float or Tensor accumulated losses).

Sequence Diagram

sequenceDiagram
    participant Parent as Parent Process
    participant Signal as ParentSignalHandler
    participant Worker as Worker Process(es)
    participant Trigger as Trigger File (/dev/shm)
    participant Dist as Distributed Backend
    participant Checkpoint as Checkpoint Storage

    Note over Parent,Worker: On‑demand checkpoint flow

    Parent->>Signal: install()
    Worker->>Worker: training loop -> process_batch(interrupt_check)

    Parent->>Parent: receives termination signal
    Parent->>Signal: handler invoked
    Signal->>Trigger: write_trigger_file(job_id)

    Worker->>Trigger: trigger_file_exists()
    Worker->>Dist: all_reduce(MAX, local_flag)
    Dist-->>Worker: consensus_flag

    alt consensus_flag == true
        Worker->>Checkpoint: save_on_demand_checkpoint(full_state=True)
        Checkpoint-->>Worker: saved
        Worker->>Trigger: remove_trigger_file()
        Worker->>Worker: exit early
    end

    Parent->>Parent: wait for workers (timeout)
    Parent->>Signal: uninstall()
    Parent->>Parent: exit
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 A tiny tap in shared RAM's nest,
Ranks whisper "save" and do their best.
Parent rings the bell, the file appears,
Workers tuck state, then calm their gears.
Hop! A checkpoint safe — carrots and cheers.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add on-demand full-state checkpointing for OpenShift AI / KubeFlow preemption' directly and clearly describes the main objective of the pull request, which is to implement signal-driven, on-demand full-state checkpoint-and-exit functionality for distributed training jobs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/on-demand-checkpointing-dCsmo
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mergify mergify bot added the ci-failure label Feb 26, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/instructlab/training/on_demand_checkpoint.py (1)

225-229: Consider rank-gating the global-consensus log.

When a checkpoint is requested, every rank logs the same message. Logging only on rank 0 would reduce shutdown-time log bursts on large jobs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/instructlab/training/on_demand_checkpoint.py` around lines 225 - 229, The
log message is emitted by every rank when a checkpoint is requested; gate it to
only run on the main/rank-0 process to avoid log storms. Wrap the existing
logger.info block (the code that runs when requested is truthy) with a check for
the main process—e.g., if torch.distributed.is_initialized() and
torch.distributed.get_rank() == 0: or, if the project exposes a helper like
is_main_process(), use that—then call logger.info only inside that conditional
while leaving the checkpoint request flow unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/instructlab/training/main_ds.py`:
- Around line 882-900: The code computes failure using process.poll() before
sending terminate()/kill(), so if the subprocess exits after forced shutdown the
failure status can be stale; update the logic inside the shutdown path in
main_ds.py to recompute process_code and failure after you perform
terminate()/kill() and any subsequent wait() calls (use process.wait with a
timeout then process.poll()), and then decide whether to log success or raise
based on the new failure value; apply the same fix for the second occurrence
referenced (the block around the later terminate/kill sequence) and reference
process.wait, process.poll, terminate(), kill(), and the logger.error messages
when updating the flow.
- Around line 821-833: The ParentSignalHandler is being instantiated without a
job identifier causing shared trigger files; update the instantiation to pass a
stable job id (e.g., use train_args.job_id or another unique training identifier
available in scope) so ParentSignalHandler(job_id=...) is used and the
handler.install() uses a namespaced trigger path; ensure the same job_id is
passed to any worker-side reader logic so trigger files live under a per-job
namespace instead of the global default.

---

Nitpick comments:
In `@src/instructlab/training/on_demand_checkpoint.py`:
- Around line 225-229: The log message is emitted by every rank when a
checkpoint is requested; gate it to only run on the main/rank-0 process to avoid
log storms. Wrap the existing logger.info block (the code that runs when
requested is truthy) with a check for the main process—e.g., if
torch.distributed.is_initialized() and torch.distributed.get_rank() == 0: or, if
the project exposes a helper like is_main_process(), use that—then call
logger.info only inside that conditional while leaving the checkpoint request
flow unchanged.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1f02ea6 and 4d82b3d.

📒 Files selected for processing (3)
  • src/instructlab/training/config.py
  • src/instructlab/training/main_ds.py
  • src/instructlab/training/on_demand_checkpoint.py

- Fix mypy error: properly type _original_handlers dict with _SignalHandler
  type alias instead of bare object
- Fix ruff/isort: remove duplicate comment, fix import ordering
- Namespace trigger file with rdzv_id as job_id so concurrent jobs sharing
  /dev/shm don't interfere with each other
- Recompute subprocess failure status after forced termination to avoid
  stale exit code
- Gate consensus log message to rank 0 to reduce log noise on large jobs
Move the checkpoint request check from after the full optimizer step to
after each minibatch's backward pass inside BatchLossManager.process_batch.
This ensures the system responds within one fwd+bwd cycle (~1-2s) even
when gradient accumulation spans many minibatches, giving more time to
save before Kubernetes sends SIGKILL after the grace period.

The check is passed as an optional interrupt_check callback to keep
checkpoint-specific logic out of BatchLossManager. When triggered, the
batch loop breaks early and the training loop saves the checkpoint
immediately, skipping the optimizer step to preserve the pre-step model
state for exact resumption.
When the training subprocess fails after an on-demand checkpoint signal
was received, the error message now includes guidance to increase
terminationGracePeriodSeconds or reduce fwd/bwd pass time so the
checkpoint check fires before SIGKILL arrives.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/instructlab/training/main_ds.py`:
- Around line 1160-1170: The help text for the "--on_demand_checkpointing"
argparse option is inaccurate: it says workers check "after each training step"
but the implementation triggers checks after each minibatch backward pass (see
BatchLossManager.process_batch). Update the parser.add_argument help string for
"--on_demand_checkpointing" to explicitly say the check happens after each
minibatch/backward pass (or "after each minibatch backward pass") and mention
that this is the granularity for checkpoint-trigger latency so the doc matches
the behavior in BatchLossManager.process_batch.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7cffa29 and d089910.

📒 Files selected for processing (2)
  • src/instructlab/training/batch_loss_manager.py
  • src/instructlab/training/main_ds.py

Update --on_demand_checkpointing help text and TrainingArgs description
to accurately state that workers check for the trigger file after each
minibatch backward pass, not after each training step.
@mergify mergify bot added ci-failure and removed ci-failure labels Mar 2, 2026
Expand on-demand checkpointing to check for a trigger at five points:
1. Before each minibatch forward pass
2. Before each minibatch backward pass
3. After each minibatch backward pass (existing)
4. Before the optimizer step
5. After the optimizer step

This minimizes the latency between a termination signal arriving and
the checkpoint being saved, which is critical when the SIGKILL grace
period is short (e.g. 30s on OpenShift/Kubernetes).

Also cleans up the save-and-exit logic in train() by extracting a
_save_and_exit() helper to eliminate three nearly identical blocks,
and fixes _compute_average_loss to handle the case where the
minibatch loop is interrupted before any forward pass completes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mergify mergify bot removed the ci-failure label Mar 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants