Skip to content

Dynamo Nemo-RL K8s integration#2429

Draft
jthomson04 wants to merge 68 commits into
NVIDIA-NeMo:mainfrom
jthomson04:dynamo-k8s-integration
Draft

Dynamo Nemo-RL K8s integration#2429
jthomson04 wants to merge 68 commits into
NVIDIA-NeMo:mainfrom
jthomson04:dynamo-k8s-integration

Conversation

@jthomson04

Copy link
Copy Markdown
Contributor

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

@copy-pr-bot

copy-pr-bot Bot commented May 6, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@jthomson04 jthomson04 force-pushed the dynamo-k8s-integration branch from f929d7d to 582c243 Compare May 11, 2026 16:28
@terrykong terrykong mentioned this pull request May 11, 2026
@github-actions

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 0159c21 (PR #2429 from dynamo-k8s-integration)

✅ Submodules that are properly updated:

Gym: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

@jthomson04 jthomson04 force-pushed the dynamo-k8s-integration branch from 2e5307d to 0159c21 Compare May 11, 2026 23:19
@github-actions

Copy link
Copy Markdown

✅ Submodule Fast-Forward Check Results

Check based on commit: 0159c21 (PR #2429 from dynamo-k8s-integration)

✅ Submodules that are properly updated:

Gym: ✅ PR branch is ahead of main branch (fast-forward)

All submodule changes look good! ✨

jthomson04 and others added 23 commits May 25, 2026 02:45
Adds policy.generation.backend=dynamo, a Kubernetes-only generation
backend that forwards rollouts to an externally-managed DynamoGraphDeployment
frontend over HTTP. The class is a thin wrapper around the resolved frontend
URL — no etcd / NATS / worker subprocess management. The DGD owns the
inference stack; nemo-rl just points nemo-gym at it.

Two ways to specify the frontend (mutually exclusive in the config):
  * dgd_name (+ optional namespace, frontend_port) — the class derives the
    cluster-internal URL from the dynamo operator's stable Service naming
    convention (<dgd-name>-frontend). Requires running inside a pod.
  * frontend_url — explicit URL escape hatch for hand-rolled DGDs, external
    clusters, or non-K8s environments. Disables the in-pod assertion.

GRPO setup wiring:
  * is_dynamo flag forces colocated_inference=False and routes all GPUs to
    training (the DGD handles inference on its own pods).
  * Dispatch case mirrors the vllm/sglang branches via
    initialize_generation_with_policy.
  * NEED_REFIT=False for the dynamo backend in both grpo_train and
    async_grpo_train — refit isn't supported in this phase, so dynamo
    runs are effectively frozen-policy (eval / inference-only experiments).
    Live refit deferred to a later phase via DGD restart or in-place
    worker reload.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Teaches the nrl-k8s CLI to bring up a DynamoGraphDeployment (DGD)
alongside the training RayCluster. After this lands, a single
nrl-k8s run --raycluster brings up both, waits for the DGD to be
state=successful, and stamps the DGD's name into the recipe before
submitting the training Ray Job.

Schema (infra/nrl_k8s/src/nrl_k8s/schema.py):
  * New DynamoGraphSpec — references a standalone DGD manifest by path
    (typically one of dynamo/recipes/...). Supports name override and
    deep-merged overrides without forking the upstream recipe.
  * New infra.dynamo: dict[str, DynamoGraphSpec], parallel to
    kuberay and deployments.

DGD module (infra/nrl_k8s/src/nrl_k8s/dgd.py):
  * load_dgd_manifest — resolves repo-relative paths against the infra
    YAML's directory; picks the DGD doc out of multi-doc files (skipping
    benchmark Pods that ship alongside in dynamo/recipes/).
  * build_dgd_manifest — deep-merges DynamoGraphSpec.overrides onto the
    loaded .spec, applies metadata.name override, sets namespace, merges
    labels, and patches cross-cutting infra fields (image as default
    only, imagePullSecrets, serviceAccount) across services[*].extraPodSpec.
  * resolve_dgd_name — for the recipe-injection path, returns the
    post-override metadata.name.
  * apply_dgd / get_dgd / delete_dgd / wait_for_dgd_ready / wait_for_dgd_gone
    mirror the RayCluster helpers in k8s.py.
  * is_dgd_crd_installed — namespaced list probe; treats 403 as
    "installed, user lacks list-RBAC" so restricted RBAC doesn't false-
    positive trigger the install hint.

Orchestrator (infra/nrl_k8s/src/nrl_k8s/orchestrate.py):
  * ensure_dgd / delete_dgd mirror ensure_deployment semantics
    (idempotent reuse on match, warn on drift, --recreate to replace).
  * _inject_dynamo_into_recipe stamps policy.generation.backend=dynamo and
    policy.generation.dynamo_cfg.dgd_name=<resolved-name> when exactly
    one DGD is declared. Multi-DGD configs leave the recipe alone.
  * run() brings up DGDs alongside Deployments before RayClusters.
  * LoadedConfig.infra_source_path tracks where the dynamo: block was
    declared so manifest paths resolve correctly in both bundled and
    split layouts.

CLI (infra/nrl_k8s/src/nrl_k8s/cli.py):
  * --target dynamo.<key> resolves alongside kuberay.<role> and
    deployments.<key>. cluster up/down dispatch through ensure_dgd /
    delete_dgd; --dry-run prints the rendered DGD manifest.
  * nrl-k8s check fails fast with a helmfile-install hint when
    infra.dynamo is set but the DGD CRD is missing.
  * _print_check_summary surfaces a DYNAMO section listing each DGD's
    services and ready-timeout.

Helmfile (infra/helm/):
  * dynamo-platform release added with values that mesh with our
    existing kai-scheduler install (install=false, enabled=true) and
    use bundled etcd + NATS for service discovery / event plane.

Examples:
  * Working recipe + infra pair targeting kind clusters
    (dynamo_qwen3_0.6b.{yaml,kind.infra.yaml}) plus a minimal DGD
    manifest (examples_dgd/qwen3_0.6b_kind.yaml).

Out of scope this PR (follow-ups):
  * --rayjob mode integration (needs ownerReferences on the DGD pointing
    at the RayJob).
  * Grove integration for multi-node DGDs.
  * Refit + planner autoscaling (Phase 3 — borrow slime/slynamo's
    external_discovery.py topology-fingerprint pattern).

Tests: 241 unit tests pass under Python 3.12. nrl-k8s check renders the
new example pair end-to-end against a live cluster.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
When nrl-k8s is the one applying a DGD (i.e. infra.dynamo.<key> is
declared), tie its lifetime to the training RayCluster via Kubernetes
ownerReferences. K8s GC cascades the DGD when the RayCluster is deleted,
so inference GPUs free at the same moment as the training GPUs without
hand-rolled cleanup logic.

Owner-of-choice: RayCluster, not RayJob — picking the cluster as parent
means DGD pods die the moment shutdownAfterJobFinishes fires, instead of
hanging around until ttlSecondsAfterFinished (default 1h).

User-managed DGDs (recipes that only set policy.generation.dynamo_cfg.frontend_url
without an infra.dynamo entry) are untouched: nrl-k8s never applies the
DGD, so it never sets an ownerReference, so it has nothing to cascade.

Implementation:
  * dgd.build_owner_reference helper sets controller=False
    (the dynamo operator already controls the DGD; we're a non-controlling
    owner solely for GC).
  * dgd.build_dgd_manifest accepts an owner_ref kwarg that lands in
    metadata.ownerReferences[0].
  * ensure_dgd threads owner_ref through to the builder.
  * k8s.wait_for_rayjob_raycluster_name polls the RayJob's
    .status.rayClusterName so the rayjob path can resolve KubeRay's
    auto-generated cluster name.

Long-lived path (orchestrate.run):
  * When any DGDs are declared, ensure the training RayCluster *first*,
    fetch its UID, and pass an ownerReference into the DGD apply loop.
  * When no DGDs are declared, behaviour is unchanged (no extra API call).

Rayjob path (cli._run_rayjob):
  * After applying the RayJob, poll for its .status.rayClusterName, look
    up the RayCluster's UID, then apply each DGD with an ownerReference
    pointing at the RayCluster. The DGD apply happens in parallel with
    KubeRay's own entrypoint submission — the entrypoint should
    tolerate a brief window where DGD pods are still coming up
    (a /health curl-loop in the recipe is the natural pattern).
  * --dry-run also renders the DGD manifest (without owner ref, since
    the UID isn't known until apply time).

Tests: 249 unit tests pass (8 new). Covers owner_ref attachment, the new
rayjob status poll, and end-to-end orchestrate.run wiring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
The qwen3-30b infra entrypoint pipes `python ... 2>&1 | tee "$LOG"` under
/bin/dash, which has no `set -o pipefail` and no $PIPESTATUS. A Python
crash (e.g. ImportError before training starts) leaves the pipeline
exit 0 because tee succeeds, and KubeRay records the RayJob as
SUCCEEDED — the failure is silent until someone reads the log.

Route python's stdout/stderr through a fifo that tee drains, so the
shell sees python's real exit code and `exit "$RC"` propagates it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
`nrl-k8s status` looked up the RayCluster by the bare `cluster.name`
(the value users put in `kuberay.<role>.name`). In `--rayjob` mode
KubeRay creates the RayCluster with a 5-char random suffix and writes
the suffixed name to the RayJob's `.status.rayClusterName`, so the bare
lookup 404s and every role rendered as "(not found)" for the entire
run lifetime.

When the bare cluster lookup misses, fall through to a RayJob lookup
on the same name and follow `.status.rayClusterName` to find the
suffixed cluster. The displayed `name` stays the configured name;
pod listing and daemon dashboard URLs use the resolved name.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
`_patch_dgd_service_account` was unconditionally overwriting
`serviceAccountName` on every service in a DynamoGraphDeployment, which
broke the dynamo operator's standard pattern of generating a per-DGD
`<dgd>-k8s-service-discovery` SA with RBAC for `endpointslices` and
`dynamoworkermetadatas`. With our infra's SA injected instead, the
worker pods 403 on their discovery reflectors and the DGD deadlocks
at state=pending.

The dynamo operator owns the SA wiring for DGD pods. nrl-k8s should
only honour an explicit `serviceAccountName` already declared in the
manifest's extraPodSpec; otherwise leave the field unset so the
operator can fill it in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
Two coupled changes that together let the dynamo backend run a full
GRPO step end-to-end through nemo-gym:

1. `_should_use_async_rollouts` and `_should_use_nemo_gym` in
   `algorithms/grpo.py` accept `backend == "dynamo"`. DynamoGeneration
   exposes `dp_openai_server_base_urls` (the DGD frontend URL) the same
   way an async-vLLM generator does, so the gym dispatch path works
   without further changes — the gates just hard-asserted vLLM only.
   The vllm-specific `expose_http_server` check is now scoped to the
   vllm branch (Dynamo always exposes a frontend; there's no analogous
   knob).

2. `print_performance_metrics` in `algorithms/utils.py` short-circuits
   to `training_num_gpus = total_num_gpus` and `generation_num_gpus = 0`
   when `backend == "dynamo"`. With dynamo, generation lives in a
   separate DGD outside the Ray cluster, so the existing read of
   `policy.generation.colocated.resources.gpus_per_node` (which is null
   on the dynamo path) was raising `TypeError: unsupported operand
   type(s) for *: 'NoneType' and 'int'` after a successful step.
   Also guarded the per-GPU generation-throughput division to avoid a
   ZeroDivisionError in this branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
`_postprocess_nemo_gym_to_nemo_rl_result` asserts that the policy's
`prompt_token_ids` for turn N+1 form a byte-identical extension of
the tokens accumulated through turn N. Against the Dynamo
`tokenize-endpoint` image (jwillthomson/dynamo-arm-rl-tokenize-endpoint-*)
this fires immediately on the first multi-turn rollout — the frontend
appears to re-tokenize the chat history rather than carry token IDs
verbatim across turns.

Disable the assert to a `RuntimeWarning` so the dynamo+nrl-k8s
integration smoke can validate the rest of the pipeline (gym dispatch,
reward computation, Megatron logprobs/training step, perf metrics
print, teardown). A `TODO(dynamo-smoke):` block marks the spot to
re-enable once the tokenize endpoint returns verbatim token IDs (or
nemo-gym is taught to re-derive contiguity from text + tokenizer).

Until then, advantages and logprobs computed on the dynamo path are
approximate — quality numbers from this rollout flow can't be trusted
yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
End-to-end smoke pair for the dynamo k8s integration on a GB300 NVL72:

  * recipe: examples/nemo_gym/grpo_workplace_assistant_dynamo_smoke.yaml
    Inherits from grpo_workplace_assistant_nemotron_nano_v2_9b.yaml,
    swaps in Qwen3-4B-Thinking-2507 with Megatron TP=1, clears the
    Nemotron MoE knobs, sets policy.generation.backend=dynamo, and
    trims to 4 rollouts/step × 1 step (no validation, no checkpoint,
    no wandb).

  * infra: infra/nrl_k8s/examples/grpo_workplace_assistant_dynamo_smoke.gb300.infra.yaml
    1-GPU Ray training cluster + a `dynamo:` block referencing the
    DGD manifest. The entrypoint passes `+policy.generation.dynamo_cfg.dgd_name=${user:}-dynamo-wpa-smoke`
    as a Hydra override since the orchestrator's auto-inject only fires
    in code_source: upload mode.

  * DGD: infra/nrl_k8s/examples_dgd/qwen3_4b_thinking_gb300.yaml
    Frontend on customer-cpu, 1× VllmDecodeWorker on a GB300 node, with
    `nvidia.com/kai-scheduler-queue: backfill` annotation (the
    operator's default queue `dynamo` is rejected by Kyverno on this
    cluster). vLLM worker uses --dyn-tool-call-parser hermes and
    --dyn-reasoning-parser qwen3.

Smoke validates: nrl-k8s cluster up + DGD apply + DGD ready + gym
3-server bring-up + Dynamo HTTP rollouts + Megatron policy step +
perf metrics print, in ~2m15s on a warm cluster.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jothomson@nvidia.com>
`mini_swe_agent → litellm → policy_model → Dynamo` JSON-serializes token
IDs and logprobs as floats; `torch.tensor(...)` then infers FloatTensor,
which downstream embedding lookups reject ("Expected indices to have
scalar types Long, Int; got FloatTensor"). Pin dtype=long on the
prompt/generation token_ids and dtype=float32 on generation_logprobs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…odule

Reverts the assertion-to-warning downgrade from a847616 ("chore(nemo-gym):
downgrade token-contiguity assert to a warning"). That downgrade was a
temporary unblock for Dynamo smokes while the underlying multi-turn
re-tokenization issue was diagnosed.

Root cause: Dynamo's chat preprocessor and tokenize endpoint honor a
required_prefix_token_ids field that splices verbatim model-emitted tokens
into the template-tokenized prefix at the EOS turn boundary, but only when
the field is explicitly populated (no auto-derive on Dynamo's side, unlike
the NeMoRLOpenAIChatRequestMixin in vllm_worker_async.py).

Fix lives in the gym submodule (bumped here): vllm_model app.py now
auto-derives required_prefix_token_ids from the latest assistant message's
prompt_token_ids + generation_token_ids and forwards it to both /chat and
/tokenize. With that in place, the contiguity invariant holds on Dynamo
just as it always has on vLLM, so the strict assert is correct again.

Verified: 80/80 multi-turn rollouts on Dynamo (Qwen3-4B-Thinking +
workplace_assistant gym, 5 steps x 16 rollouts), KL 0.0005-0.0042,
reward learning healthy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dynamo operator marks .status.state=successful once it has
reconciled the desired-vs-observed pod count, but pods may still be in
ContainerCreating / image-pull / probe-failing at that moment. The CLI
would return from wait_for_dgd_ready while pods were still warming up,
and the first training-side request would hit connection-refused.

Tighten the gate to three conditions instead of one:
  * operator state == successful (existing behavior)
  * every pod owned by the DGD has all containers Ready (kubelet view)
  * the frontend service answers an HTTP request via the K8s API
    server proxy (proves the python dynamo.frontend process has
    actually bound to :8000, not just that the pod is Ready)

The frontend HTTP probe goes through connect_get_namespaced_service_proxy
so it works without standing up a port-forward. 404 from /v1/models is
treated as ready (listener is up, just doesn't recognize the path);
503 / 502 / connection-refused all fall back to "not ready".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two independent smoke artifacts:

1. Nemotron-3-Nano-30B-A3B-BF16 proof of life on Dynamo + mini_swe_agent
   (3 files). The 30B Mamba-hybrid MoE is the first non-Qwen3 model run
   through the Dynamo path on k8s. Working shape: 2 GB300 worker pods
   (TP=4 × DP=2, AdamW state sharded across all 8 ranks via the
   distributed optimizer — no host offload). 1 single-GPU Dynamo
   decoder serves the rollouts, with --dyn-tool-call-parser nemotron_deci
   + --dyn-reasoning-parser nemotron_nano and mamba_ssm_cache=float32.
   Recipe sets moe_per_layer_logging=false explicitly (parent V2 9B
   recipe doesn't ship it; the policy worker reads the key and would
   KeyError without it) and make_sequence_length_divisible_by=4 (TP=4
   + sequence_parallel requires the seq-len-divides constraint).

2. WPA planner load source recipe (1 file). 32 prompts × 8 generations
   = 256 in-flight WPA rollouts that drive KV utilization past the
   Dynamo planner's static `latency` scale-up threshold (40%). Pure
   chat traffic — no singularity / SWE-bench CPU overhead — so the load
   lands almost entirely on the decoder's KV cache. Pair this with a
   planner-enabled DGD (single decoder, max_gpu_budget=4) and watch the
   planner scale replicas up toward the cap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous order (apply RayJob → wait for RayCluster → apply DGD with
ownerRef) lost a race against KubeRay: the driver entrypoint is fired
as soon as the RayCluster head is Ready, but DGD pods are still being
admitted, image-pulled, and probed at that point. The gym's first
generation request would hit connection-refused on the dynamo frontend.

Invert the order so inference is already serving before the driver
starts:
  1. apply each DGD (no ownerRef yet — the RayCluster doesn't exist)
  2. wait for DGDs to reach the three-gate readiness state
  3. apply the RayJob
  4. wait for KubeRay to spawn the RayCluster, get its UID
  5. PATCH the ownerReference back onto each DGD

K8s GC reconciles ownerReferences continuously, so adding the ref late
still produces cascade-delete-on-RayCluster-shutdown — the DGD is just
not anchored during the brief window between steps 1 and 5.

Rollback paths cover both failure modes: if a DGD apply fails partway
through, any siblings already up are deleted (they have no parent yet
so GC won't reap them); if the RayJob apply fails, all DGDs that were
brought up are deleted so the next attempt isn't blocked by name
collision on the admission webhook.

Works across KubeRay versions — no dependency on .spec.suspend or any
other feature gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ds fix

The gym branch jthomson04/required-prefix-token-ids was rebased onto
current NVIDIA-NeMo/Gym main and dropped three unrelated tool-call /
reasoning-content fixes that had accumulated on the original commit.
The submodule now points at the scoped fix: one commit, one file, just
the required_prefix_token_ids auto-derive on /chat + tokenize allowlist.

Functionally equivalent to the previous pointer for the contiguity
assert.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d refit

Adds a third weight-sync path alongside the existing IPC ZMQ (colocated)
and NCCL collective (non-colocated) paths. Opt-in via
cluster.weight_sync.method: "mx" + an MxConfig.

Why a third path:
* Cross-node like NCCL collective, but uses NIXL RDMA so it's sub-second
  even on multi-TB models.
* Rank-to-rank by default: each trainer rank publishes only its local
  DTensor shard (no tensor.full_tensor() allgather, no rank-0 funnel),
  the same-rank inference peer pulls directly via RDMA.
* MoE-aware: when ep_world_size > 1, only the owning rank's expert
  shard transits the wire; receivers in EP mode skip non-owned experts.
* Topology-aware: tree fan-out (TensorHub paper 2604.09107v1 pipeline
  replication) -- a receiver becomes a source after refit so the next
  receiver doesn't contend on the trainer's NIC.
* Elastic: NIXL connections are dynamic; rollouts can join/leave
  mid-run without rebuilding a NCCL group.

What's added:
* nemo_rl/distributed/mx_helpers.py:
  - MxConfig (parsed from cfg.cluster.weight_sync) with knobs:
    same_rank_only (default True; required on GCP GB200 / AWS EFA where
    cross-NIC subnets are unrouted), tree_scale_out, moe_expert_filter,
    nic_pin (default "auto" via NUMA-local IB device probe),
    retain_latest_k.
  - build_v2_publisher / build_v2_receiver convenience constructors that
    also pin the local NIC before NIXL initializes.
  - collect_named_local_shards: walks state_dict yielding tensor.to_local()
    for DTensors; never calls full_tensor().
  - detect_moe_expert_layout: heuristic detection of MoE expert tensors
    + ownership map; configurable via NRL_MX_EXPERT_TENSOR_PATTERN.

* New abstract methods (default NotImplementedError; opt-in per backend):
  - ColocatablePolicyInterface.stream_weights_via_mx(version, mx_config)
  - GenerationInterface.update_weights_via_mx(version, mx_config)

* Implementations:
  - DTensorPolicyWorker.stream_weights_via_mx: uses tensor.to_local()
    instead of full_tensor() (no allgather), MxV2TrainingPublisher with
    cached NIXL registration across optimizer steps.
  - VllmInternalWorkerExtension.update_weights_via_mx: discovers
    same-rank source, RDMA-receives into vLLM's named_parameters,
    runs through _load_weights for FP8 / GPT-OSS transpose / draft-weight
    handling, optionally republishes self for tree fan-out.
  - vllm_generation.py + vllm_worker.py + lm_policy.py: Ray-actor and
    worker-group plumbing.

* Driver: refit_policy_generation in algorithms/grpo.py now branches on
  weight_sync_method == "mx" in the non-colocated path. Default behavior
  stays NCCL collective for backward compat.

Validated end-to-end on a real GB200 cluster (dynamo-gcp-dev-02, kavin
namespace) via the standalone v2_moe_e2e_demo.py in NVIDIA/modelexpress
on branch kavink/nemo_rl_moe: 4 ranks x 2 refit cycles, NIXL RDMA,
all transfers correct, same-rank routing + freshness dedup + MoE expert
sharding all behaving as designed.

Designs and learnings:
* pensieve/RL/NemoRL/04_design_v2_moe_rank_to_rank.md (the v2 design)
* pensieve/RL/PrimeRL/06_status_2026_05_06.md (the GB200 RDMA-fabric
  lessons that made same_rank_only the default)
Adds the trainer-side hooks needed to drive Kavin's cherry-picked refit
(2e5307d) through the Dynamo backend, plus two fixes uncovered in
bring-up on the Qwen3-4B-Thinking + GB300 smoke:

- grpo.py: pass `weight_sync_method` + `mx_config` through every
  `refit_policy_generation` call site; parse `cluster.weight_sync.mx_config`
  via `MxConfig.from_dict` at trainer startup; drop the Dynamo backend's
  `NEED_REFIT = False` short-circuit so MX refit actually runs.
- generation/dynamo/config.py: `DynamoGenerationConfig` gains the fields
  the MX path needs.
- generation/dynamo/dynamo_generation.py: implement
  `prepare_refit_info` / `update_weights_via_mx` as no-ops (receiver-side
  polling makes the trainer-→-worker RPC optional).
- policy/workers/dtensor_policy_worker.py: `stream_weights_via_mx` was
  referencing `self.world_size` / `getattr(self, 'model_name', ...)` which
  don't exist on this worker (Kavin's commit only validated against his
  Megatron path) — switch to `self.dp_size` and `self.cfg['model_name']`
  so the publisher initializes correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
End-to-end smoke for the cherry-picked MX path: Qwen3-4B-Thinking-2507
on GB300, 2 steps × 4 rollouts. Validates the trainer publishing local
DTensor shards to a ModelExpress server and the DGD's vLLM worker
pulling them via NIXL RDMA.

Adds:
- examples/nemo_gym/grpo_workplace_assistant_dynamo_mx.yaml — recipe
  inheriting from the non-MX smoke and adding cluster.weight_sync.method=mx
- infra/nrl_k8s/examples/grpo_workplace_assistant_dynamo_mx.gb300.infra.yaml
  — RoCE DRA on the trainer, IPC_LOCK for NIXL pinned host memory, points
  at the MX DGD manifest below
- infra/nrl_k8s/examples_dgd/qwen3_4b_thinking_gb300_mx.yaml — DGD with
  DYN_MX_REFIT_ENABLED, RoCE claim on the worker
- infra/nrl_k8s/dynamo_mx/ — supporting files: modelexpress-server.yaml
  (Redis-backed deploy), bootstrap_mx.sh (PYTHONPATH override +
  NIXL_PLUGIN_DIR workaround), Dockerfile.nemorl (overlay that bakes
  nixl-cu12 + modelexpress + protobuf>=6.31 into the trainer venv),
  DEBUGGING_POSTMORTEM.md (full diagnosis of the UCX-CUDA classification
  blocker we're working through with the NIXL team)

Status: trainer side is end-to-end clean (publishes 400 tensors per step,
MX server confirms). Worker side reaches NIXL transport but fails at
ucp_mem_query — UCX classifies CUDA buffers as host inside vLLM v1's
EngineCore Worker, independent of UCX version (1.20.x or 1.21.x).
Postmortem has the full evidence trail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Replace the no-op DynamoGeneration.update_weights_via_mx placeholder with
a real synchronous orchestrator. The trainer drives each refit cycle
end-to-end and the call returns only after all workers have actually
refitted (or raises loud on any failure), eliminating the
silent-stale-weights failure mode of the prior receiver-side polling
architecture (DEBUGGING_POSTMORTEM §15).

Algorithm:
  1. GET <dgd-frontend>:8000/health → enumerate worker pods from
     instances[*], dedupe per instance_id; system_url is
     http://<podIP>:<DYN_SYSTEM_PORT> (default 9090).
  2. For each new instance:
       POST {system_url}/engine/pause_generation
       POST {system_url}/engine/update_weights_via_mx   (blocks on NIXL receive)
       POST {system_url}/engine/flush_cache
       POST {system_url}/engine/resume_generation       (try/finally)
  3. Re-discover via /health; if new instance_ids appeared (scale-up
     mid-cycle), refit those too. Bounded at 5 convergence passes.

The worker-side /engine/<route> endpoints are registered by the dynamo
branch jthomson04/tokenize-endpoint-merge-main-05-07 (commits
0e5590e812 + 8590c2694e). Uses the dynamo frontend's existing
GET /health endpoint for discovery — no new RBAC, no DYN_ENABLE_RL,
no biswapanda RL-discovery PR dependency.

E2E validated on Qwen3-4B-Thinking + Dynamo vLLM v1 GRPO smoke:
  step=1: RDMA transfer complete: 8.82 GB, 399 tensors, 0.20s, 358.5 Gbps
  step=2: RDMA transfer complete: 8.82 GB, 399 tensors, 0.18s, 386.0 Gbps

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
…nstead

The new dynamo worker image (built from
jthomson04/tokenize-endpoint-merge-main-05-07 @ 8590c2694e with
ENABLE_MODELEXPRESS_P2P=true + MODELEXPRESS_REF=8594fd6) bakes in
everything bootstrap_mx.sh was patching at container start:

  * dynamo Python source with MX refit + /engine/<route> handlers
  * modelexpress @ kavink/nemo_rl_moe @ 8594fd6 (sidecar fix +
    post-rebase fixups)
  * UCX bundled inside the nixl_cu12 wheel (commit 8590c2694e:
    auditwheel repair + contrib/wheel_add_ucx_plugins.py)

So we can drop:
  * bootstrap_mx.sh        — PYTHONPATH shadows + NIXL_PLUGIN_DIR probe
  * dev_sync.sh            — rsync helper for the dynamo-dev/ shadow
  * /mnt/rl-workspace/.../dynamo-dev/  (no longer in this repo, just
    deleting the Lustre flat copy)

DGD worker spec now:
  * Uses the new image tag jwillthomson/dynamo-arm-tokenize-endpoint-8590c26
  * Invokes `python3 -m dynamo.vllm` directly (no bash -c source)
  * Sets NIXL_PLUGIN_DIR in env (replaces the bootstrap probe — the
    dynamo operator hardcodes a path that only has GDS, not UCX, so the
    venv path needs to win)

README updated to reflect the bake-in flow + drop the hot-reload docs.

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
UCX/NIXL env tuning that the GB300 RoCE smoke needs to route
prep_xfer_dlist correctly + temporary trainer-image workarounds that
unwind with a separate nemo-rl-mx image rebake.

UCX/NIXL env (both trainer pod's MxV2TrainingPublisher and the DGD
worker — DGD env is set in examples_dgd/qwen3_4b_thinking_gb300_mx.yaml):
  - UCX_TLS=rc,cuda_copy  (restricted: the full
    cuda_copy,cuda_ipc,rc,sm,self caused prep_xfer_dlist to fail for
    cross-pod descriptors because UCX picks sm/cuda_ipc/self first)
  - NIXL_UCX_TLS=rc,cuda_copy
  - UCX_IB_GPU_DIRECT_RDMA=yes
  - UCX_CUDA_COPY_DMABUF=yes
  - MX_RDMA_NIC_PIN=auto (activates per-rank NUMA-local NIC selection
    via modelexpress.ucx_utils.apply_nic_pin_for_device)

Trainer-side workarounds (unblock when nemo-rl-mx image is rebaked with
NIXL_VERSION=0.10.1 + MX_REF=8594fd6):
  - PYTHONPATH shadow over baked modelexpress (1d8049d5 → 8594fd6
    rebased branch on Lustre)
  - Per-actor `pip install --force-reinstall nixl-cu12==0.10.1` loop in
    entrypoint (image bakes 1.1.0; empirically validated as
    load-bearing — DEBUGGING_POSTMORTEM §15)
  - lifecycle.postStart on head + worker doing the same downgrade from
    a Lustre-staged wheel (DNS-free path)

Resource downsize for fit on customer-cpu / GB300 capacity:
  - Head: limits 16→8 CPU, 64→32 Gi
  - GPU worker: limits 32→16 CPU, 200→100 Gi

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
…-mx image

The new trainer image jwillthomson/nemo-rl-mx:05-24 bakes in
nixl-cu12==0.10.1 and modelexpress @ kavink/nemo_rl_moe @ 8594fd6
directly (built from Dockerfile.nemorl with NIXL_VERSION=0.10.1 +
MX_REF=8594fd6), matching the dynamo worker image's versions. So we
can drop the runtime workarounds that were keeping the trainer
aligned with the worker:

  - PYTHONPATH shadow over /mnt/rl-workspace/<user>/modelexpress/...
    in the trainer entrypoint
  - Per-actor `pip install --force-reinstall nixl-cu12==0.10.1` loop
    in the trainer entrypoint
  - lifecycle.postStart blocks on head + worker installing the same
    downgrade from /mnt/rl-workspace/jothomson/nixl_wheels

E2E validated:
  step=1: RDMA transfer complete: 8.82 GB, 399 tensors, 0.20s, 358.0 Gbps
  step=2: RDMA transfer complete: 8.82 GB, 399 tensors, 0.18s, 384.9 Gbps

Net diff: -42 / +7.

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Two operator-side races kept ad-hoc dynamo runs from launching cleanly without
manual intervention; close both inside nrl-k8s itself.

1. DGD frontend HTTP ready-gate
   `_frontend_http_ready` probed via the kube-apiserver service-proxy, which
   has been returning `ServiceUnavailable: EOF` against the dynamo frontend
   even when the listener is healthy (direct curl succeeds). When the caller
   is in-cluster, hit the ClusterIP service directly via Kubernetes DNS
   instead; keep the api-server-proxy path as the laptop fallback.

2. KubeRay rayjob HTTPMode driver-submission stall
   The kuberay-operator's rayjob reconcile loop intermittently misses the
   RayCluster's `RayClusterProvisioned=True` transition (timing-correlated
   with the worker pod's CNI sandbox retries during DRA NIC plumbing),
   leaving `/api/jobs/` empty and the rayjob hanging in
   `jobDeploymentStatus=Running` indefinitely. Add
   `ensure_rayjob_driver_submitted` to k8s.py: after apply, wait for
   `RayClusterProvisioned=True` directly on the RayCluster (the
   raycluster controller is reliable), give the operator a 90s grace
   window, then POST the entrypoint to the head dashboard ourselves with
   `submission_id=jobId` (kuberay's own convention) if the grace expires.
   Wire it into `_run_rayjob` between apply and the wait-for-terminal
   block so it fires in both `--wait` and `--no-wait` modes.
@jthomson04 jthomson04 force-pushed the dynamo-k8s-integration branch from 457dc13 to ac0df5b Compare May 25, 2026 03:40
KavinKrishnan and others added 23 commits June 9, 2026 13:26
…lassifier

Phase A of the Megatron-Core MX integration plan
(temp/NemoRL_Megatron_MX_Design.md §4). Adds the trainer-side publisher
that emits per-rank Megatron-native shards to ModelExpress, with each
parameter classified into one of seven Megatron roles (qkv_column,
gated_mlp_column, column, row, vocab_parallel, replicated,
expert_column / expert_row). Receiver-side slice planner is in
ai-dynamo/modelexpress PR NVIDIA-NeMo#421 commits 12c73a7 + b26e80f.

What's new
    * nemo_rl/distributed/mx_megatron_helpers.py
        - detect_megatron_role(name, param, model, ...) — classifies
          parameters by walking to the enclosing module and matching
          against Megatron-Core's parallel-layer class names
          (ColumnParallelLinear, RowParallelLinear,
          VocabParallelEmbedding, plus heuristic detection of fused
          QKV / fused gated MLP via name patterns).
        - collect_megatron_publish_set(model, tp_size=..., pp_rank=...,
          ...) — generator that yields (name, local_shard, role_spec,
          full_extras) tuples ready for MxV2TrainingPublisher.add_tensor.
          Skips replicated tensors on tp_rank > 0 (rank 0 publishes for
          the replicated set).
        - Role overrides via MxConfig.megatron_role_overrides for
          non-mainline Megatron forks with different class / name
          conventions.
        - Expert pattern via NRL_MX_EXPERT_TENSOR_PATTERN env var,
          mirroring detect_moe_expert_layout's convention.

    * MegatronPolicyWorkerImpl.stream_weights_via_mx
        - Lazy-init the v2 publisher (once per worker lifetime).
        - Resolve TP / PP / EP rank position via parallel_state.
        - Pull num_attention_heads / num_query_groups / kv_channels
          from the trainer's transformer_config; descriptor extras for
          qkv_column become receiver-checkable.
        - Stamp the publisher's mesh position before publish() so the
          source identity / sidecar carry tp_rank / pp_rank / ep_rank.
        - For each parameter: classify role, emit local shard with
          per-tensor megatron_role + megatron_extras in the registry
          via add_tensor's new kwargs.
        - publish(version=N) + mark_ready() — same shape as the
          DTensor stream_weights_via_mx.

Memory invariant
    Trainer never materialises the full tensor. Megatron-Core stores
    native shards; the publisher passes them through (tensor.detach()
    + dtype cast + .contiguous()). Peak GPU memory for any layer
    stays at model_weights / (tp * pp * ep).

Out of scope (this commit)
    * Phase C: receiver translator (_translate_megatron_to_hf,
      _uninterleave_qkv) — lives on the inference side, separate
      file + PR.
    * Phase D: GRPO refit branch + receiver auto-detect.
    * FP8 KV-cache scales (existing NotImplementedError matches the
      DTensor path's behaviour).

Tested by
    * Pre-existing unit suite continues to pass (no functional changes
      to existing paths).
    * MX-side planner: 36/36 v2 tests passing in
      ai-dynamo/modelexpress @ kavink/fix-publish-self-as-source-v2-transports.
    * End-to-end validation pending — needs Phase C + D + new image
      + GB200 deploy. Tracked in
      temp/NemoRL_Megatron_MX_Design.md §8 / §11.
…ole classification

Pivot the role classifier in detect_megatron_role to consult
megatron.bridge.models.conversion.param_mapping.AutoMapping._MODULE_TYPE_REGISTRY
as the authoritative source. Bridge's registry is curated to cover every
TE / Inference / Quant variant of column-parallel, row-parallel, and
replicated modules:

    column     ColumnParallelLinear, TEColumnParallelLinear,
               TELayerNormColumnParallelLinear, TEColumnParallelGroupedLinear,
               InferenceLayerNormColumnParallelLinear, QuantColumnParallelLinear,
               LinearCrossEntropyModule, VocabParallelEmbedding,
               DotProductAttention (attention sink), TEDotProductAttention
    row        RowParallelLinear, TERowParallelLinear, TERowParallelGroupedLinear,
               QuantRowParallelLinear, InferenceRowParallelLinear
    replicated TENorm, FusedLayerNorm, WrappedTorchNorm, LayerNorm, RMSNorm,
               L2Norm, InferenceTopKRouter, IdentityOp, LinearForLastLayer,
               TopKRouter

Previously the classifier matched on substrings of the base class names
("ColumnParallel", "RowParallel"), which silently misclassified many TE
variants as replicated and would have wasted publish bandwidth (best
case) or routed shards through the wrong assembly path (worst case).

When Bridge is not importable (e.g. CPU-only unit tests), the classifier
falls back to substring matching against the base names — sufficient for
mainline Megatron-Core. The fallback also handles the
LayerNormColumnParallelLinear special-case Bridge tracks separately.

Why this matters
    The receiver-side translator (Phase C) is a thin adapter over Bridge's
    own MegatronParamMapping.megatron_to_hf calls (and the standalone
    helpers like split_qkv_weights / merge_qkv_weights). By using Bridge's
    classification on the publisher side too, both halves of the transport
    agree on what a tensor's role is, regardless of whether it's a
    mainline Megatron-Core class or a TE variant. No more parallel
    truth-source to maintain.

See temp/NemoRL_Megatron_MX_Design.md §9b ("Architecture pivot:
Bridge-driven receiver") for the full design, and
temp/NemoRL_Megatron_MX_Phase_C_Handoff.md for the receiver-side
implementation plan that this Phase A change unblocks.
…in v2 sidecar

Trainer-side companion to modelexpress commit 76ac523. At trainer init,
the Megatron policy worker now derives:

  1. MegatronTransformerConfig (head counts + dims) from
     self.megatron_bridge.transformer_config
  2. Megatron→HF name map from a Bridge mapping-registry walk
     (no weight transfer; just iterates the conversion-task list and
     reads each task.mapping.hf_param)

…and stamps them onto the MxV2TrainingPublisher via
``set_megatron_sidecar(sidecar)``. The publisher then merges them into
the v2 sidecar JSON at every publish() call, so the receiver's
parse_megatron_sidecar (modelexpress.megatron_translator) gets:

  * transformer_config — head counts + dims for QKV un-interleave
  * hf_name_map — entries like
      [
        ["decoder.layers.0.self_attention.linear_qkv.weight",
         ["q_proj.weight", "k_proj.weight", "v_proj.weight"]],
        ["decoder.layers.0.self_attention.linear_proj.weight",
         ["o_proj.weight"]],
        ...
      ]

The receiver doesn't need to import Megatron-Bridge to know either —
exactly what we want for the production Dynamo VllmDecodeWorker image
which does not ship Bridge.

QKVMapping ordering: Bridge's QKVMapping uses a dict-shaped hf_param
({"q": ..., "k": ..., "v": ...}). The sidecar emit forces the
canonical [q, k, v] order so the receiver-side translator's
hf_names[0/1/2] line up with split_qkv_weights' (q, k, v) return
tuple. Other dict-shaped mappings (rare today) emit dict.values()
order; receivers can override per-tensor if needed.

Failure mode: if Bridge's build_conversion_tasks raises (older Bridge,
unsupported model arch), warn and emit just the transformer_config
piece. The receiver then falls back to deriving names independently
(e.g. from the inference vLLM model's state-dict + plan.role).
…erExtension

Phase D of the Megatron-MX integration plan. Wires the new
modelexpress.megatron_translator path into the existing
update_weights_via_mx flow on NemoRL's direct-vLLM extension.

What's new
    * First-cycle Megatron detection: peek at any candidate's
      megatron_meta. None → existing DTensor / FSDP path. Set →
      latched into self._mx_megatron_mode and routed through
      _update_weights_via_mx_megatron forevermore on this receiver.
    * _update_weights_via_mx_megatron: builds a MegatronReceiverContext
      once (transformer_config + Megatron→HF name map from the source
      sidecar; receive_specs from the per-tensor TensorDescriptorV2
      registry); per refit, calls run_refit_cycle which drives the full
      discover → plan → assemble → translate → yield(hf_name, tensor)
      pipeline. The yielded HF tensors flow through the existing
      _load_weights + FP8-KV-cache hooks.
    * Tree fan-out (publish_self_as_source) is preserved on the
      Megatron path identical to the DTensor path.

Backwards compat
    * DTensor / FSDP receivers see _mx_megatron_mode = False on the
      first cycle and stay on the existing path. PR NVIDIA-NeMo#2700's prime-rl
      receivers and John's Dynamo-side worker extension are
      unaffected.
    * Sources that advertise publisher_kind=megatron but are missing
      the transformer_config sidecar (older trainer image) trigger a
      one-shot warning and fall back to non-Megatron mode.

Out of scope (deferred)
    * The pull callback in _update_weights_via_mx_megatron uses
      self._mx_receiver._receiver._nixl.pull_to, which is the
      conceptual API; the actual NIXL plumbing for sliced pulls
      (registering a sub-view of a parent tensor + completing the
      transfer) needs a small wrapper in modelexpress.refit_receiver.
      v0 of the actual NIXL sliced-pull plumbing lands in a follow-up
      commit; this commit gets the Phase D control flow + receive
      context build right so the next commit only needs to fill in
      the pull function.
    * Wiring into Dynamo's MxRefitWorkerExtension (a separate Dynamo
      repo edit; same shape — import run_refit_cycle, call from the
      worker extension's update_weights_via_mx).
… receiver

Pivot the Phase D receiver path from 'per-source pull callback' to
'bulk receive_weights + pre-filled buffers' since:

  1. The existing MxRefitReceiver.receive_weights primitive already
     does the matched-TP single-source NIXL pull correctly (same
     primitive the DTensor path uses).
  2. modelexpress.megatron_translator's new
     pre_assembled_buffers arg lets run_refit_cycle skip
     assemble_into_destination + the per-source pull callback.

Result: matched-TP Megatron-MX path uses ZERO new NIXL plumbing.
Per-rank Megatron buffers are pre-allocated + registered with the
existing register_tensors call, bulk-filled via the existing
receive_weights call, then walked by the translator.

Mixed-TP support (target_tp != source_tp) is gated behind 'no
matched-TP source for this receiver's tp_rank' and prints a clear
'v1 not yet wired' message — the planner already returns the right
slice info; only the per-source NIXL plumbing remains.
Two small fixes surfaced while validating Phase E shape 1 on a real
Megatron-loaded Qwen3-4B-Thinking model on the kavin GB200 cluster.

1. _build_megatron_sidecar called bridge.build_conversion_tasks but the
   Bridge API is bridge.get_conversion_tasks. Rename.
2. model.named_parameters() returns 'module.<...>' prefixed names when
   the model is wrapped, but Bridge's get_conversion_tasks returns
   unprefixed names. The publisher now strips 'module.' in
   collect_megatron_publish_set so the catalog and the sidecar name_map
   use the same form. The receiver-side wire-up gets a defensive strip
   too in case an older publisher emits the prefix.

Phase E shape 1 validation result with these fixes: 398/398 HF tensors
byte-identical against bridge.export_hf_weights ground truth after a
full A->B->C->D round trip; 313 Gbps single-NIC bulk pull. Writeup in
pensieve/RL/NemoRL/14_megatron_mx_phase_e_shape1_2026_06_08.md.
Replaces the matched-TP gate in _update_weights_via_mx_megatron with a
mixed-TP path that pulls each source's full manifest into scratch via
receive_weights_scratch, then satisfies per-source slice requests by
copying from scratch into the planner's pre-narrowed dest view.

Handles MegatronSliceSource.source_subslice for target-narrower cases
(vLLM TP > source TP, each receiver carves a sub-range out of one
source's shard). For target-wider (vLLM TP < source TP, each receiver
concatenates multiple sources' full shards), source_subslice is None
and we copy the full source tensor into the corresponding dest range.

Bandwidth: O(sum(source_bytes)) per receiver. For target-wider this is
the minimum needed; for target-narrower it over-pulls by the target
fan-out factor. A v1 sliced-pull primitive (offset+size NIXL
descriptor on the source side) would cut narrow-target bandwidth by
target_tp / source_tp.

Matched-TP fast path is preserved via the is_matched_tp branch; only
mixed-TP falls into the scratch-then-copy path. Both branches share
the same translation step (assemble_into_destination +
translate_megatron_to_hf).
Megatron-Core's TE-grouped linears (TEColumnParallelGroupedLinear,
TERowParallelGroupedLinear) expose each local expert's slice as a
separate nn.Parameter named weight0, weight1, ..., weightN, rather
than a single grouped buffer. This holds even at EP=1.

Two classifier fixes needed:

1. _enclosing_module previously only recognised weight/bias/scale/
   _extra_state as parameter leaves; with grouped naming the
   per-expert leaf is e.g. weight17. Extended the leaf detector to
   match weight<digits>, bias<digits>, scale<digits>. Without this,
   _enclosing_module walks past the parameter into the leaf and the
   classifier sees class name "Parameter" → unclassified → falls to
   ROLE_REPLICATED.

2. detect_megatron_role's per-expert path required ep_size > 1
   because it assumed the legacy single-.weight-leading-axis grouped
   layout. The TE-grouped layout has N parameters even at EP=1.
   Added a 2a branch that fires whenever the param name matches the
   expert pattern AND the leaf is weight<digits>: extracts the
   expert id from the suffix, classifies as ROLE_EXPERT_COLUMN /
   ROLE_EXPERT_ROW based on the enclosing module class, and emits
   expert_id + expert_layout=grouped in descriptor_extras. The
   legacy ep_size>1 leading-axis path is kept as 2b for back-compat.

Validation on Qwen3-30B-A3B-Instruct-2507 (real MoE: 48 layers, 128
experts) on the GB200 trainer:

  pre-fix role distribution (broken — all experts replicated):
    replicated 12529, row 48, qkv_column 48, vocab_parallel 1, column 1

  post-fix role distribution (correct):
    expert_column 6144  (= 48 layers x 128 experts of linear_fc1)
    expert_row    6144  (= 48 layers x 128 experts of linear_fc2)
    replicated     241  (norms, scalars, routers)
    row             48  (linear_proj)
    qkv_column      48  (linear_qkv)
    vocab_parallel   1
    column           1

Total classified: 12627 (matches Bridge's get_conversion_tasks count
and the publisher's named_parameters iteration count).
… run config

Wandb metric-history fix:
- async_grpo_train omitted step_finished=True on its final per-step log, so the
  per-step metric HISTORY never committed to wandb (empty scan_history, dashboard
  "no data") though run.summary populated. Add it (the sync path already had it).
- The generation_metrics/* per-worker figures are NOT the cause: the dc3m70us
  baseline logs them on this same async path and still commits history. Keep them
  for the per-worker tab.
- Drop the ~700MB/step full_result wandb Table (500s on upload; ungated on the
  async path; large media volume).
- Curate the Dynamo->wandb metrics sampler into canonical generation_metrics/*.

SWE2 24-worker run config:
- DGD qwen3_30b_thinking_gb300_mx_4gpu: 24 TP2 decode workers + nodeAffinity
  exclusion of the broken-aws-cni node ip-7-248-83-146.
- debug recipe/infra: full-size entrypoint (5 steps); should_log_nemo_gym_responses off.

Verified on a 24-worker run (wandb djdu4txk): generation_metrics (20 keys) +
history both commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Bug surfaced on Qwen3-MoE-30B-A3B during Phase E shape 2 cluster
validation:

The fix in 9bdf5e4 stripped the `module.` prefix from
`model.named_parameters()`'s name BEFORE passing it to
`detect_megatron_role`, so the publisher and Bridge's name_map (which
uses unprefixed names) would agree on the catalog key. That fix was
correct for the catalog/name-map side, but it broke the classifier's
model-walk: `_enclosing_module(name, model)` descends through
attribute lookups, and for a DDP-wrapped model the descent starts at
`model.module.decoder...`. With the prefix stripped, `getattr(model,
'decoder', None)` returns None on the first step, the walk fails,
and every TP-sharded non-expert role (column, row, qkv_column,
gated_mlp_column, vocab_parallel) falls through to ROLE_REPLICATED.

Expert tensors didn't show the bug because the grouped-MoE 2a branch
classifies via the trailing `weight<N>` suffix on the name string,
not via the model walk — that branch fires before _enclosing_module.

Cluster evidence (Qwen3-30B-A3B, kavin/GB200, pre-fix):

  pass1: detect_megatron_role(raw_name=`module.decoder...`)
    expert_column 6144  expert_row 6144
    replicated     241  row 48  qkv_column 48
    vocab_parallel   1  column 1                    ← correct

  pass2: collect_megatron_publish_set (stripped name)
    expert_column 12288 (= 6144 fc1 + 6144 fc2 — both → "_column")
    replicated      339 (everything else collapsed)
                                                    ← broken

Fix: pass the raw (prefixed) name to detect_megatron_role so the
model-walk succeeds; keep stripping the prefix for the published
name (so the name_map lookup on the receiver still works). This
restores the 7-role classification and unblocks Phase E shape 2.

Validated locally: classifier now produces 7 distinct roles on
Qwen3-30B-A3B (6144 expert_column + 6144 expert_row + 241 replicated
+ 48 row + 48 qkv_column + 1 vocab + 1 column = 12 627 total).
Adds the v1 sliced-pull data plane to
VllmInternalWorkerExtension._update_weights_via_mx_megatron's mixed-TP
branch, in tandem with the new MxRefitReceiver.pull_to primitive
(MX-side commit).

Restructured the mixed-TP path into five phases:

  1. Pre-allocate per-plan dest tensors (target_shape) and classify
     each (plan, source) contribution: contiguous narrows along axis 0
     route to v1 sliced pull; non-contiguous (axis-1 row-parallel) or
     per_expert (dict assembly) route to v0 scratch+copy.

  2. NIXL-register the v1 dest buffers (one register_tensors call).

  3. Per-source combined sliced pull: each source's batch of
     SlicedTransferRequests issued as a single pull_to() call → one
     NIXL transfer with N descriptor pairs.

  4. v0 fallback: for the plans that need scratch (row-parallel, per
     expert), bulk-pull each contributing source via
     receive_weights_scratch into a per-source scratch dict.

  5. Assemble + translate per plan: v1 plans use the pre-filled dest
     directly (no host-side assembly); v0 plans use
     assemble_into_destination with the existing scratch-based pull
     callback. Both paths feed translate_megatron_to_hf identically.

Bandwidth profile change vs v0:

  * Target-narrower (source_tp > target_tp): unchanged — already
    bandwidth-optimal in v0 (validated 8/8 byte-identical on
    cluster, 2026-06-10).

  * Target-wider (source_tp < target_tp): wire bytes drop by
    target_tp/source_tp× for axis-0 roles. Cluster validation needs
    a vllm tensor_parallel_size>1 DGD worker; deferred to a separate
    deployment-only step.

Backwards-compatibility: matched-TP fast path (is_matched_tp branch)
unchanged. Existing 18 867/18 867 byte-identical validation on
Qwen3-MoE-30B-A3B continues to hold (matched-TP path doesn't touch
this code).
Smallest viable Megatron + MX GRPO recipe — Qwen2.5-1.5B Megatron
TP=1 EP=1 PP=1 with vLLM generation, weight_sync.method="mx" pointing
at the cluster-validated Megatron-MX data plane:

  trainer:  stream_weights_via_mx (per-rank native shards + sidecar)
  refit:    MX server catalogs the source, ready=True
  receiver: _update_weights_via_mx_megatron
              -> v1 sliced-pull (axis-0 roles) / v0 scratch+copy (axis-1)
              -> translate Megatron->HF (vendored Bridge helpers)
              -> vllm.model.load_weights()

5 GRPO steps with val_period=5 gives one refit cycle + one validation
in ~10-20 min wall clock. Bump max_num_steps after first run lands.

End-to-end byte-identity of the Megatron path is independently
validated on:
  * Qwen3-4B-Thinking-2507 (dense): 398/398 HF tensors byte-identical
  * Qwen3-MoE-30B-A3B-Instruct-2507: 18 867/18 867 byte-identical
    (12 288 grouped per-expert tensors via the new split_gated_mlp_tp)
  * Synthetic 2-rank Megatron TP=2 mixed-TP: 8/8 byte-identical

See pensieve/RL/NemoRL/17_megatron_mx_grpo_loop_runbook.md for the
deployment runbook + expected log lines + bumping guide.

Recipe is fully wired; actual GRPO loop run pending cluster session
restoration.
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Companion to the matched-TP cluster smokes on Qwen3-4B-Thinking-2507
(shape 1, 398/398 byte-identical, 2026-06-08). Inherits from the
existing grpo_workplace_assistant_dynamo_mx.yaml (DTensor + MX recipe)
and flips back to Megatron:

  policy.megatron_cfg.enabled: true
  policy.dtensor_cfg.enabled:  false
  policy.optimizer:            null   (Megatron reads megatron_cfg.optimizer)
  policy.scheduler:            null

cluster.weight_sync.method: "mx" + mx_config points at the kavin
MX server. 2 GRPO steps for the first cycle.

Requires the DGD's dynamo/vllm/mx_refit/extension.py to be patched
with a Megatron dispatch (calls modelexpress.megatron_translator.
run_refit_cycle for Megatron sources). The patch is staged in
pensieve/RL/NemoRL/v1_targetwider_smoke_2026_06_11/dynamo_vllm_mx_
refit_extension_megatron.py as a drop-in overlay; the natural
follow-up is an upstream PR to ai-dynamo/dynamo.

Companion writeup: pensieve/RL/NemoRL/18_v1_targetwider_validated_
dynamo_megatron_patch_2026_06_11.md.
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
feat(dynamo): add qwen3 fp8 kv-cache mx recipe
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
feat(k8s): add EAGLE specdec MX refit support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants