fix(qwen3-vl): per-segment mRoPE + vision under CP + THD packing by Zhichenzzz · Pull Request #1308 · radixark/miles

Zhichenzzz · 2026-06-08T21:40:29Z

Follow-up to #1272 (stacked on its branch, so this diff shows only the #1296 changes — merge after #1272).

What

Makes Qwen3-VL train end-to-end under context parallelism + THD sequence packing in bridge mode.

Per-segment mRoPE under CP — when the THD row is CP-sharded (zigzag), _build_packed_positions all-gathers the per-rank rows, de-interleaves to the full row (_reassemble_full_row, unit-tested in tests/fast/test_qwen3_vl_cp_mrope.py), rebuilds per-segment MRoPE, and re-slices into this rank's zigzag layout.
Don't double-shard — when the input is already CP-local, the bridge's internal preprocess_packed_seqs is wrapped to an identity that returns miles' full-cu packed_seq_params (CP attention still sees the full cu; the data isn't re-split).
CP-local vision embeds — select_local_vision_embeds maps each rank's local vision tokens to the matching slice of the full vision-tower output (and deepstack). Cooperates with a small hook in megatron-bridge (separate PR to radixark/Megatron-Bridge).
Wires calculate_per_token_loss into the bridge provider (Qwen3-VL asserts on it under CP) and a defensive AllGatherVisionEmbeddings.apply kwarg shim.

Validation (e2e)

Qwen3-VL-2B geo3k, CP=2 TP=4, 8×H200, THD packed, bridge mode: stable over 3 steps, train_rollout_logprob_abs_diff 0.0141 → 0.0146 (healthy, == non-CP 0.011–0.016), rollout/raw_reward ~0.4, no crashes.

Depends on

feat: Fix Qwen3-VL THD packed mRoPE positions #1272 (base of this PR)
megatron-bridge vision hook (next PR to radixark/Megatron-Bridge) — until it lands, the miles-side monkeypatch provides the hook.

Fixes #1296

Update: cleanup pass + re-validation

helpers' first parameter renamed self → model (they are free functions, not methods)
the preprocess_packed_seqs identity wrapper forwards *args/**kwargs instead of hard-coding the upstream signature
the patch warns once at install when the running megatron-bridge lacks the _miles_select_local_vision_embeds hook (instead of silently mis-placing vision embeddings under CP); points at the matching Megatron-Bridge patch

Re-validated after cleanup: unit tests 6/6; Qwen3-VL-2B CP2 TP4 THD geo3k RL — train_rollout_logprob_abs_diff 0.0127–0.0131 (same healthy band as the original validation), coherent rollouts, no hook warning with the bridge patch installed.

Note on stacking: this branch contains #1272; the cleanup lives here because this PR owns the final shape of qwen3_vl_packed_mrope.py.

Update: vision-embed plumbing removed (−53 lines)

Megatron-Bridge PR #9 now selects CP-local vision embeddings natively inside Qwen3VLModel (no override hook), so this PR no longer carries select_local_vision_embeds, the local→full position mapping, or any hook installation. What remains on the miles side: per-segment mRoPE position reconstruction and the preprocess_packed_seqs identity wrapper (both still needed — they are about positions, not embeddings), plus a warning when the running bridge lacks the native support.

Re-validated end-to-end after the removal (Qwen3-VL-2B, CP2 TP2, THD, geo3k RL): train_rollout_logprob_abs_diff 0.0130, coherent rollouts, no warnings, no crashes.

) Follow-up to #1272, which handled non-CP packed mRoPE and left CP as a logged dense fallback. Under context parallelism miles shards the THD row with the load-balanced zigzag layout (slice_with_cp), so the model sees only this ranks

…rg bug megatron-bridge 0.5.0 calls AllGatherVisionEmbeddings.apply(..., cp_group=...) in the Qwen3-VL vision_dp_when_cp path, but torch.autograd.Function.apply rejects keyword arguments. Install a shim (at import, alongside the mRoPE patch) whose .apply accepts cp_group as a kwarg and forwards it positionally. One of the blockers for end-to-end Qwen3-VL CP training (see #1296).

…1296) Completes the CP+packing path for Qwen3-VL. The bridge forward assumes the FULL unsharded input and re-shards internally (preprocess_packed_seqs) + inserts vision embeddings against a full mask, but miles pre-shards the THD row (slice_with_cp). That double-sharded and mismatched the vision mask vs the full vision-tower output. When the input is already CP-sharded (cu_seqlens_q[-1] == cp_size * local_len): - preprocess_packed_seqs is wrapped to be an identity that returns miles full-cu packed_seq_params, so the bridge does not re-split the already-local data while CP attention still sees the full cu_seqlens; - a per-rank vision-embed selector (select_local_vision_embeds) maps this rank local vision tokens to the matching slice of the full vision-tower output (and deepstack), via a reconstructed full row + zigzag local->full position map; - the bridge model.py gains a no-op _miles_select_local_vision_embeds hook at the vision/deepstack insertion sites that miles overrides at import. Plus calculate_per_token_loss wired into the bridge provider (CP asserts on it). Validated end-to-end: Qwen3-VL-2B geo3k, CP=2 TP=4 8xH200, THD packed, bridge mode, train_rollout_logprob_abs_diff = 0.0141 at step 0 (healthy, matches non-CP 0.011-0.016), rollout raw_reward 0.42, no crashes. The bridge-side hook is a 4-line change delivered separately as a patch (belongs upstream in megatron-bridge). Fixes #1296

gemini-code-assist

Code Review

This pull request introduces support for Qwen3-VL context parallelism (CP) and THD packed mRoPE reconstruction. It adds patches to handle CP-local vision embeddings, bypasses redundant re-sharding in preprocess_packed_seqs, and implements a shim for AllGatherVisionEmbeddings to accept cp_group as a keyword argument. Additionally, a comprehensive CPU unit test suite is added to verify the correctness of the zigzag reconstruction. The reviewer feedback suggests improving the robustness of the patches by having the _AllGatherVisionEmbeddingsKwargShim inherit from the original class to preserve its type hierarchy, and using *args and **kwargs in the preprocess_packed_seqs wrapper to guard against future signature changes.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-08T21:42:04Z

+    class _AllGatherVisionEmbeddingsKwargShim:
+        _miles_kwarg_shim = True
+
+        @staticmethod
+        def apply(input, seqlens_on_cp_ranks, cp_group=None):
+            return orig.apply(input, seqlens_on_cp_ranks, cp_group)


To ensure that _AllGatherVisionEmbeddingsKwargShim behaves identically to the original AllGatherVisionEmbeddings class (e.g., preserving class attributes, static methods, or satisfying issubclass / isinstance checks in downstream code), it is safer to have the shim inherit from orig instead of being a completely separate, plain class.

Suggested change

class _AllGatherVisionEmbeddingsKwargShim:

_miles_kwarg_shim = True

@staticmethod

def apply(input, seqlens_on_cp_ranks, cp_group=None):

return orig.apply(input, seqlens_on_cp_ranks, cp_group)

class _AllGatherVisionEmbeddingsKwargShim(orig):

_miles_kwarg_shim = True

@staticmethod

def apply(input, seqlens_on_cp_ranks, cp_group=None):

return orig.apply(input, seqlens_on_cp_ranks, cp_group)

gemini-code-assist · 2026-06-08T21:42:05Z

+    def wrapped(input_ids, attention_mask, pre_process=True, pg_collection=None):
+        ctx = getattr(_tls, "cp_local", None)
+        if ctx is not None:
+            # already-local CP path: do not re-shard; return the data unchanged together with
+            # miles' full-cu packed_seq_params (callers ignore the psp; the model's CP attention
+            # uses the packed_seq_params passed into forward, which already has the full cu).
+            return input_ids, ctx["psp"]
+        return orig(input_ids, attention_mask, pre_process=pre_process, pg_collection=pg_collection)


To make the monkeypatched wrapped function more robust against future signature changes in preprocess_packed_seqs (e.g., if the bridge library adds or reorders arguments), it is highly recommended to use *args and **kwargs and extract input_ids dynamically. This prevents potential TypeError exceptions due to signature mismatches.

Suggested change

def wrapped(input_ids, attention_mask, pre_process=True, pg_collection=None):

ctx = getattr(_tls, "cp_local", None)

if ctx is not None:

# already-local CP path: do not re-shard; return the data unchanged together with

# miles' full-cu packed_seq_params (callers ignore the psp; the model's CP attention

# uses the packed_seq_params passed into forward, which already has the full cu).

return input_ids, ctx["psp"]

return orig(input_ids, attention_mask, pre_process=pre_process, pg_collection=pg_collection)

def wrapped(*args, **kwargs):

ctx = getattr(_tls, "cp_local", None)

if ctx is not None:

# already-local CP path: do not re-shard; return the data unchanged together with

# miles' full-cu packed_seq_params (callers ignore the psp; the model's CP attention

# uses the packed_seq_params passed into forward, which already has the full cu).

input_ids = kwargs.get("input_ids") if "input_ids" in kwargs else args[0]

return input_ids, ctx["psp"]

return orig(*args, **kwargs)

- rename the free helpers' first parameter self -> model (they are not methods; the model instance is passed in from the patched forward) - forward *args/**kwargs in the preprocess_packed_seqs identity wrapper instead of hard-coding the upstream signature - warn once at patch install when the running megatron-bridge lacks the _miles_select_local_vision_embeds hook, instead of silently mis-placing vision embeddings under CP (points at the matching Megatron-Bridge patch)

Megatron-Bridge PR #9 renamed its extension point to the vendor-neutral select_cp_local_vision_embeds; look that up first and keep the old _miles_-prefixed name as a fallback for older patched bridges.

Megatron-Bridge now selects CP-local vision embeddings natively inside Qwen3VLModel (no override hook), so remove select_local_vision_embeds, the local->full position mapping it needed, and the hook installation; keep a warning when the running bridge lacks the native support. The cp_local context now only carries packed_seq_params for the preprocess_packed_seqs identity wrapper.

Zhichenzzz added 3 commits June 8, 2026 21:01

Zhichenzzz requested review from fzyzcjy, maocheng23, yueming-yuan and yushengsu-thu as code owners June 8, 2026 21:40

gemini-code-assist Bot reviewed Jun 8, 2026

View reviewed changes

Zhichenzzz mentioned this pull request Jun 8, 2026

fix(qwen3-vl): CP-local vision-embed hook + AllGatherVisionEmbeddings.apply kwarg radixark/Megatron-Bridge#9

Open

Zhichenzzz added 6 commits June 10, 2026 00:40

style: isort import in test_qwen3_vl_cp_mrope

bf50ebe

refactor(qwen3-vl): track the renamed bridge vision-embed hook

afa1b5e

Megatron-Bridge PR #9 renamed its extension point to the vendor-neutral select_cp_local_vision_embeds; look that up first and keep the old _miles_-prefixed name as a fallback for older patched bridges.

Merge base branch (main's transformers alias fix for CI)

c135815

style: tighten comments

82f3215

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(qwen3-vl): per-segment mRoPE + vision under CP + THD packing#1308

fix(qwen3-vl): per-segment mRoPE + vision under CP + THD packing#1308
Zhichenzzz wants to merge 9 commits into
zhichen/qwen3-vl-thd-miles-hijackfrom
fix/1296-qwen3vl-cp-mrope

Zhichenzzz commented Jun 8, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Zhichenzzz commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Validation (e2e)

Depends on

Update: cleanup pass + re-validation

Update: vision-embed plumbing removed (−53 lines)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Zhichenzzz commented Jun 8, 2026 •

edited

Loading