Skip to content

fix: Remove race condition during multi-GPU training#39

Open
uermel wants to merge 2 commits into
MLI-lab:masterfrom
uermel:master
Open

fix: Remove race condition during multi-GPU training#39
uermel wants to merge 2 commits into
MLI-lab:masterfrom
uermel:master

Conversation

@uermel

@uermel uermel commented Jun 9, 2026

Copy link
Copy Markdown

Fixes #27.

Summary

Under multi-GPU DDP, update_hparam performed an unsynchronized read-modify-write on the shared hparams.yaml, causing intermittent crashes and a corrupted hyperparameters file. This restricts the write to the global-zero rank.

Background

fit-model enables DDPStrategy when launched on >1 GPU (ddw/fit_model.py), running one process per rank. Each rank executes the full LightningModule lifecycle.

Call trace into the race

The hooks that reach update_hparam are not rank-gated, so every rank runs them:

on_train_start()        (epoch 0)         # all ranks
  └─ update_normalization()
on_train_epoch_end()    (every N epochs)  # all ranks
  └─ update_normalization()
       └─ update_hparam("unet_params", …)

All ranks resolve the same hparams.yaml (PL syncs the logger version across ranks → one shared version_N/ dir), then each performs:

hparams = yaml.safe_load(open(path, "r"))    # READ
hparams[key] = value                         # MODIFY
open(path, "w") ... yaml.dump(...)           # WRITE (truncates on open)

If one rank opens the file or is mid-write while another rank tries to read there is a race causing the error. Avoided using the global rank attribute of the module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Sporadic error when updating hparams

1 participant