Skip to content

test(data_plane): session-scope mooncake fixtures#2838

Open
ZhiyuLi-Nvidia wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
ZhiyuLi-Nvidia:zhiyul/mooncake-init-cleanup-followup
Open

test(data_plane): session-scope mooncake fixtures#2838
ZhiyuLi-Nvidia wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
ZhiyuLi-Nvidia:zhiyul/mooncake-init-cleanup-followup

Conversation

@ZhiyuLi-Nvidia

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Move TQ client fixtures to session scope in a new tests/unit/data_plane/conftest.py so each pytest worker has exactly one mooncake init per session.
  • Drop the per-file tq_client_backends and tq_client fixtures in test_tq_lifecycle.py and test_seqpack_equivalence.py; both files now consume from the shared conftest.
  • Rename tq_clienttq_client_backends in test_seqpack_equivalence.py to match the unified fixture name.

Why

The bug: a race in mooncake's C++ engine — its process-global mount registry survives Python tq.close(), so any re-init in the same pytest worker leaks stale endpoints and the next batch_upsert_from fails with TRANSFER_FAIL (-800).

Production never trips this because the driver creates the client once (bootstrap=True) and workers only attach (bootstrap=False) — one mooncake init per process, ever.

Relation to #2760 — this is the structural fix

#2760 made each test file's fixture module-scoped. That removed within-file re-init and turned CI green for today's two files. But the underlying race between files is not fixed though #2760 helps stabilize the CI tests.

What's still missing after #2760:

  • Each test file owns its own mooncake init/teardown. Two files = two sequential mooncake clients in the same pytest worker = the same race condition that produced -800.
  • Today it happens to work because pytest's between-module teardown timing leaves the C++ registry in a recoverable state for the second init. That's accidental, not by design.
  • Any new test file under tests/unit/data_plane/ that touches mooncake_cpu re-creates the second-init code path and re-trips -800.

What this PR changes: the single TQ client moves to session scope in conftest.py. Now the entire pytest worker shares one mooncake init — the "create a second client" code path that the C++ engine miscompiles doesn't exist anymore. It's the same contract production already uses (one init per process). The bug can't reappear by adding tests.

Test plan

Verified locally on Slurm (interactive partition, 1 node × 8 GPUs)

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia requested a review from a team as a code owner June 16, 2026 09:55
@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ZhiyuLi-Nvidia ZhiyuLi-Nvidia force-pushed the zhiyul/mooncake-init-cleanup-followup branch from f524c5c to 56d3572 Compare June 16, 2026 10:29
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Jun 16, 2026
@ZhiyuLi-Nvidia

Copy link
Copy Markdown
Contributor Author

/ok to test 56d3572

@ZhiyuLi-Nvidia

Copy link
Copy Markdown
Contributor Author

/ok to test f4615e1

ZhiyuLi-Nvidia and others added 2 commits June 22, 2026 18:26
Move TQ client fixtures to session scope in a new conftest.py so each
pytest worker has exactly one mooncake init per session. Rename
tq_client -> tq_client_backends in test files that use the parametrized
multi-backend fixture.

Root cause: mooncake's C++ engine keeps a process-global mount registry
that survives Python tq.close() (upstream transfer_queue.close() also
leaves mooncake_master running). A second mooncake_cpu init in the
same pytest worker leaks stale segment endpoints; the new client's
first batch_upsert_from then routes to the dead prior endpoint and
returns TRANSFER_FAIL (-800).

Production never trips this -- driver bootstraps once and workers
attach via bootstrap=False (tq_policy.py, sync_rollout_actor.py,
worker_mixin.py). Session-scoping the test fixtures mirrors production:
one mooncake init per pytest worker, period.

Structural follow-up to PR NVIDIA-NeMo#2760 (module-scope), which dodged the bug
for the existing 7 tests but is fragile: adding any new mooncake test
file under tests/unit/data_plane/ would re-trip the race. The shared
conftest.py prevents that by construction.

Verified on Slurm (job 12854381) -- 16 tests passed in 166s, including
all 8 [mooncake_cpu] variants. test_smoke_round_trip setup: 162.96s
(the single session-fixture init). test_smoke_round_trip_backends
[mooncake_cpu] setup: 0.01s (reused). No TRANSFER_FAIL / -800 in the
log.

Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
@ZhiyuLi-Nvidia ZhiyuLi-Nvidia force-pushed the zhiyul/mooncake-init-cleanup-followup branch from f4615e1 to 782d78f Compare June 23, 2026 01:27
@ZhiyuLi-Nvidia

Copy link
Copy Markdown
Contributor Author

/ok to test 782d78f

@ZhiyuLi-Nvidia

Copy link
Copy Markdown
Contributor Author

/ok to test 782d78f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant