test(data_plane): session-scope mooncake fixtures by ZhiyuLi-Nvidia · Pull Request #2838 · NVIDIA-NeMo/RL

ZhiyuLi-Nvidia · 2026-06-16T09:55:10Z

Summary

Move TQ client fixtures to session scope in a new tests/unit/data_plane/conftest.py so each pytest worker has exactly one mooncake init per session.
Drop the per-file tq_client_backends and tq_client fixtures in test_tq_lifecycle.py and test_seqpack_equivalence.py; both files now consume from the shared conftest.
Rename tq_client → tq_client_backends in test_seqpack_equivalence.py to match the unified fixture name.

Why

The bug: a race in mooncake's C++ engine — its process-global mount registry survives Python tq.close(), so any re-init in the same pytest worker leaks stale endpoints and the next batch_upsert_from fails with TRANSFER_FAIL (-800).

Production never trips this because the driver creates the client once (bootstrap=True) and workers only attach (bootstrap=False) — one mooncake init per process, ever.

Relation to #2760 — this is the structural fix

#2760 made each test file's fixture module-scoped. That removed within-file re-init and turned CI green for today's two files. But the underlying race between files is not fixed though #2760 helps stabilize the CI tests.

What's still missing after #2760:

Each test file owns its own mooncake init/teardown. Two files = two sequential mooncake clients in the same pytest worker = the same race condition that produced -800.
Today it happens to work because pytest's between-module teardown timing leaves the C++ registry in a recoverable state for the second init. That's accidental, not by design.
Any new test file under tests/unit/data_plane/ that touches mooncake_cpu re-creates the second-init code path and re-trips -800.

What this PR changes: the single TQ client moves to session scope in conftest.py. Now the entire pytest worker shares one mooncake init — the "create a second client" code path that the C++ engine miscompiles doesn't exist anymore. It's the same contract production already uses (one init per process). The bug can't reappear by adding tests.

Test plan

Verified locally on Slurm (interactive partition, 1 node × 8 GPUs)

copy-pr-bot · 2026-06-16T09:55:14Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ZhiyuLi-Nvidia · 2026-06-16T10:30:41Z

/ok to test 56d3572

ZhiyuLi-Nvidia · 2026-06-22T17:36:46Z

/ok to test f4615e1

Move TQ client fixtures to session scope in a new conftest.py so each pytest worker has exactly one mooncake init per session. Rename tq_client -> tq_client_backends in test files that use the parametrized multi-backend fixture. Root cause: mooncake's C++ engine keeps a process-global mount registry that survives Python tq.close() (upstream transfer_queue.close() also leaves mooncake_master running). A second mooncake_cpu init in the same pytest worker leaks stale segment endpoints; the new client's first batch_upsert_from then routes to the dead prior endpoint and returns TRANSFER_FAIL (-800). Production never trips this -- driver bootstraps once and workers attach via bootstrap=False (tq_policy.py, sync_rollout_actor.py, worker_mixin.py). Session-scoping the test fixtures mirrors production: one mooncake init per pytest worker, period. Structural follow-up to PR NVIDIA-NeMo#2760 (module-scope), which dodged the bug for the existing 7 tests but is fragile: adding any new mooncake test file under tests/unit/data_plane/ would re-trip the race. The shared conftest.py prevents that by construction. Verified on Slurm (job 12854381) -- 16 tests passed in 166s, including all 8 [mooncake_cpu] variants. test_smoke_round_trip setup: 162.96s (the single session-fixture init). test_smoke_round_trip_backends [mooncake_cpu] setup: 0.01s (reused). No TRANSFER_FAIL / -800 in the log. Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia · 2026-06-23T02:16:02Z

/ok to test 782d78f

ZhiyuLi-Nvidia · 2026-06-23T08:23:11Z

/ok to test 782d78f

ZhiyuLi-Nvidia requested a review from a team as a code owner June 16, 2026 09:55

ZhiyuLi-Nvidia force-pushed the zhiyul/mooncake-init-cleanup-followup branch from f524c5c to 56d3572 Compare June 16, 2026 10:29

ZhiyuLi-Nvidia added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Jun 16, 2026

copy-pr-bot Bot temporarily deployed to public June 16, 2026 10:31 Inactive

copy-pr-bot Bot temporarily deployed to public June 16, 2026 10:36 Inactive

copy-pr-bot Bot temporarily deployed to public June 22, 2026 17:37 Inactive

copy-pr-bot Bot temporarily deployed to test June 22, 2026 17:39 Inactive

copy-pr-bot Bot temporarily deployed to public June 22, 2026 17:40 Inactive

ZhiyuLi-Nvidia and others added 2 commits June 22, 2026 18:26

chore: fix lint blank line in test_tq_lifecycle

782d78f

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>

ZhiyuLi-Nvidia force-pushed the zhiyul/mooncake-init-cleanup-followup branch from f4615e1 to 782d78f Compare June 23, 2026 01:27

copy-pr-bot Bot temporarily deployed to public June 23, 2026 02:16 Inactive

copy-pr-bot Bot temporarily deployed to test June 23, 2026 02:19 Inactive

copy-pr-bot Bot temporarily deployed to public June 23, 2026 02:20 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(data_plane): session-scope mooncake fixtures#2838

test(data_plane): session-scope mooncake fixtures#2838
ZhiyuLi-Nvidia wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
ZhiyuLi-Nvidia:zhiyul/mooncake-init-cleanup-followup

ZhiyuLi-Nvidia commented Jun 16, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

ZhiyuLi-Nvidia commented Jun 16, 2026

Uh oh!

ZhiyuLi-Nvidia commented Jun 22, 2026

Uh oh!

ZhiyuLi-Nvidia commented Jun 23, 2026

Uh oh!

ZhiyuLi-Nvidia commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ZhiyuLi-Nvidia commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Relation to #2760 — this is the structural fix

Test plan

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

ZhiyuLi-Nvidia commented Jun 16, 2026

Uh oh!

ZhiyuLi-Nvidia commented Jun 22, 2026

Uh oh!

ZhiyuLi-Nvidia commented Jun 23, 2026

Uh oh!

ZhiyuLi-Nvidia commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ZhiyuLi-Nvidia commented Jun 16, 2026 •

edited

Loading