test(data_plane): session-scope mooncake fixtures#2838
Open
ZhiyuLi-Nvidia wants to merge 2 commits into
Open
Conversation
f524c5c to
56d3572
Compare
Contributor
Author
|
/ok to test 56d3572 |
Contributor
Author
|
/ok to test f4615e1 |
Move TQ client fixtures to session scope in a new conftest.py so each pytest worker has exactly one mooncake init per session. Rename tq_client -> tq_client_backends in test files that use the parametrized multi-backend fixture. Root cause: mooncake's C++ engine keeps a process-global mount registry that survives Python tq.close() (upstream transfer_queue.close() also leaves mooncake_master running). A second mooncake_cpu init in the same pytest worker leaks stale segment endpoints; the new client's first batch_upsert_from then routes to the dead prior endpoint and returns TRANSFER_FAIL (-800). Production never trips this -- driver bootstraps once and workers attach via bootstrap=False (tq_policy.py, sync_rollout_actor.py, worker_mixin.py). Session-scoping the test fixtures mirrors production: one mooncake init per pytest worker, period. Structural follow-up to PR NVIDIA-NeMo#2760 (module-scope), which dodged the bug for the existing 7 tests but is fragile: adding any new mooncake test file under tests/unit/data_plane/ would re-trip the race. The shared conftest.py prevents that by construction. Verified on Slurm (job 12854381) -- 16 tests passed in 166s, including all 8 [mooncake_cpu] variants. test_smoke_round_trip setup: 162.96s (the single session-fixture init). test_smoke_round_trip_backends [mooncake_cpu] setup: 0.01s (reused). No TRANSFER_FAIL / -800 in the log. Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
f4615e1 to
782d78f
Compare
Contributor
Author
|
/ok to test 782d78f |
Contributor
Author
|
/ok to test 782d78f |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tests/unit/data_plane/conftest.pyso each pytest worker has exactly one mooncake init per session.tq_client_backendsandtq_clientfixtures intest_tq_lifecycle.pyandtest_seqpack_equivalence.py; both files now consume from the shared conftest.tq_client→tq_client_backendsintest_seqpack_equivalence.pyto match the unified fixture name.Why
The bug: a race in mooncake's C++ engine — its process-global mount registry survives Python
tq.close(), so any re-init in the same pytest worker leaks stale endpoints and the nextbatch_upsert_fromfails withTRANSFER_FAIL (-800).Production never trips this because the driver creates the client once (
bootstrap=True) and workers only attach (bootstrap=False) — one mooncake init per process, ever.Relation to #2760 — this is the structural fix
#2760 made each test file's fixture module-scoped. That removed within-file re-init and turned CI green for today's two files. But the underlying race between files is not fixed though #2760 helps stabilize the CI tests.
What's still missing after #2760:
-800.tests/unit/data_plane/that touchesmooncake_cpure-creates the second-init code path and re-trips-800.What this PR changes: the single TQ client moves to session scope in
conftest.py. Now the entire pytest worker shares one mooncake init — the "create a second client" code path that the C++ engine miscompiles doesn't exist anymore. It's the same contract production already uses (one init per process). The bug can't reappear by adding tests.Test plan
Verified locally on Slurm (interactive partition, 1 node × 8 GPUs)