fix: Remove race condition during multi-GPU training by uermel · Pull Request #39 · MLI-lab/DeepDeWedge

uermel · 2026-06-09T05:02:47Z

Fixes #27.

Summary

Under multi-GPU DDP, update_hparam performed an unsynchronized read-modify-write on the shared hparams.yaml, causing intermittent crashes and a corrupted hyperparameters file. This restricts the write to the global-zero rank.

Background

fit-model enables DDPStrategy when launched on >1 GPU (ddw/fit_model.py), running one process per rank. Each rank executes the full LightningModule lifecycle.

Call trace into the race

The hooks that reach update_hparam are not rank-gated, so every rank runs them:

on_train_start()        (epoch 0)         # all ranks
  └─ update_normalization()
on_train_epoch_end()    (every N epochs)  # all ranks
  └─ update_normalization()
       └─ update_hparam("unet_params", …)

All ranks resolve the same hparams.yaml (PL syncs the logger version across ranks → one shared version_N/ dir), then each performs:

hparams = yaml.safe_load(open(path, "r"))    # READ
hparams[key] = value                         # MODIFY
open(path, "w") ... yaml.dump(...)           # WRITE (truncates on open)

If one rank opens the file or is mid-write while another rank tries to read there is a race causing the error. Avoided using the global rank attribute of the module.

uermel added 2 commits June 8, 2026 18:41

fix DDP race

07c3ae4

less verbose comment, remove potential silent error

298deb1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Remove race condition during multi-GPU training#39

fix: Remove race condition during multi-GPU training#39
uermel wants to merge 2 commits into
MLI-lab:masterfrom
uermel:master

uermel commented Jun 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

uermel commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

Call trace into the race

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

uermel commented Jun 9, 2026 •

edited

Loading