feat(models): declarative invertible conversion-op framework for all models#2797
Draft
S1ro1 wants to merge 8 commits into
Draft
feat(models): declarative invertible conversion-op framework for all models#2797S1ro1 wants to merge 8 commits into
S1ro1 wants to merge 8 commits into
Conversation
ConvOp vocabulary (Rename, PrefixRename, Drop, MoEExperts+FusedGateUp, Synthetic, MapValue, SqueezeLeading, Conditional) + apply_hf_to_tt / apply_tt_to_hf runners. Sharding-aware (name ops are value-agnostic; the expert stack/unstack takes a global-expert offset for shard-local operation), no gathers in the ops themselves. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…alence test Proves the framework: forward (per-expert & fused), backward, and roundtrip all match the legacy imperative convert_hf_to_tt_moe/convert_tt_to_hf_moe. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ivalence tests Encode every model's HF<->prime conversion as an invertible op chain (conversion_chains.py + models/<name>/conversion_chain.py), registered by model_type and reachable via PreTrainedModelPrimeRL.conversion_ops. Covers qwen3_moe, qwen3_5_moe, glm4_moe, glm_moe_dsa, minimax_m2, laguna, afmoe, nemotron_h, and gpt_oss (identity). Each chain is verified against the legacy imperative convert_* functions on mock state dicts (forward, backward, and roundtrip where lossless) — 43 passing tests in tests/unit/train/models/conversions. The ops are sharding-aware (name ops are value-agnostic; the expert stack/unstack takes a global-expert offset) and never gather. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…/SplitConcat base-ops
Replace the two MoE-specific value ops with orthogonal primitives:
- Stack: stack/unstack a variable {e}-indexed group along a new dim (index_offset
for shard-local global numbering)
- SplitConcat: split/concat fixed parts along an existing dim
- Sequence: bundle ops into one (lets the routed-experts helper stay a single op)
The MoE-ness now lives only in the _routed_experts_op composition helper (Stack
per proj + a Conditional that splits the fused gate_up input), not in the op
vocabulary. NemotronH's fused case collapses to a plain Rename. All 43
equivalence tests unchanged and green.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…nversion_chain.py Consistency: every model's chain now lives in its own package; conversion_chains.py holds only the shared helpers (_routed_experts_op, _GATE_DOWN_UP) and the registry. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…orm dispatch
A single per-layer Conditional detects layer type from a signature key (present
in either HF or prime form, so it works both directions) and dispatches:
attention/mamba keep bulk PrefixRename(mixer.->{self_attn,mamba}.), MoE uses its
specific ops (incl. the gated Synthetic w3). Drops the layers_block_type
argument — build_nemotron_h_chain(num_layers) now matches every other model's
signature. Equivalence to the imperative converter (which still uses
layers_block_type) unchanged: 43 tests green.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Each model defines `conversion_chain(config)` in its `converting_<model>.py`; the `PreTrainedModelPrimeRL` base has a single set of `convert_to_hf` / `convert_to_prime` / `convert_layer_*` implementations that play the chain forward and backward. Because every op is present-guarded, the same chain works over a full state dict, one layer's keys, or a local shard. Removes the now-redundant scaffolding: - per-model imperative `convert_hf_to_tt` / `convert_tt_to_hf` functions - the standalone `conversion_chain.py` builders and the `conversion_chains` registry - the mock equivalence tests (superseded by the 20-step math KL-mismatch validation in PR #2797) and the obsolete classmethod roundtrip tests Net -2.2k lines. Validated end-to-end: qwen3_moe/qwen3_5_moe/nemotron_h/glm4_moe mean KL mismatch < 0.0015 over 20 steps on math. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brings back the per-model conversion roundtrip tests that predated the declarative refactor (qwen3_5_moe, nemotron_h reverse + roundtrip, qwen3_5 VLM), rewritten to call `convert_to_hf` / `convert_to_prime` on a model instance (playing the declarative chain) instead of the removed classmethods. The NemotronH reverse test keeps its pre-existing xfail (HF now uses fused expert tensors). gpu-marked. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Replaces the per-model imperative
convert_hf_to_tt/convert_tt_to_hffunctions with a single declarative, invertible, sharding-aware conversion-operator framework, and encodes all 9 models against it.Motivation: the HF↔prime conversions were hand-written per model and drifting (the NIXL weight-transfer adapter only covered a couple of the renames). Expressing each conversion as a chain of small bidirectional ops makes the inverse fall out for free, makes the conversion introspectable, and removes the per-model imperative duplication.
Framework (
trainer/models/conversion_ops.py)A conversion is a flat list of
ConvOp;apply_hf_to_ttplays them forward,apply_tt_to_hfplays each op's backward in reverse. The op vocabulary is deliberately small and general (no model-specific ops):Rename,PrefixRename— value-agnostic name maps (trivially shard-safe, no gather).Drop— symmetric removal of keys with no counterpart (prime-only buffers liketokens_per_expert/reorderer; HF-only MTP heads).Stack— stack a variable-cardinality{e}-indexed group of per-expert tensors along a new dim (withindex_offsetfor shard-local global expert numbering); its backward unstacks.SplitConcat— split/concat fixed parts along an existing dim (e.g. a fusedgate_up_proj↔ separatew1/w3).Sequence— bundle ops into one unit;Conditional(predicate, then, else_)— dispatch on which keys are present (used for fused-vs-per-expert inputs, singular/plural names, and NemotronH layer-type dispatch).MapValue— explicit value transform with its own backward (NemotronH's lossy router-bias shift, identity backward).Synthetic— a prime-only tensor created forward / dropped backward (NemotronH dummyw3).SqueezeLeading— backward-only leading-singleton squeeze (GLM shared-expertshape[0]==1).A shared
routed_experts_op(prefix, …)helper (+GATE_DOWN_UP) composes the common MoE expert stack/unstack (per-expert and fused-gate_uplayouts) so each model's chain stays a few lines. Each model definesconversion_chain(config)in itsconverting_<model>.py; thePreTrainedModelPrimeRLbase has a single set ofconvert_to_hf/convert_to_prime/convert_layer_to_*implementations that play the chain forward/backward — since every op is present-guarded, the same chain works over a full state dict, a single layer's keys, or a local shard.Coverage: Qwen3-MoE, Qwen3.5-MoE (incl. fused
gate_up+ shared expert), GLM-4 MoE / GLM-MoE-DSA (shared experts withshape[0]==1squeeze + MLA passthrough), MiniMax-M2 (block_sparse_moenamespace + literalw1/w2/w3proj names), Laguna (singular/plural shared-expert and dual bias-key inputs), AFMoE (no router rename, 3× expert stack, reorderer drop), NemotronH (backbone.prefix, layer-typemixer→{mamba,self_attn,mlp}dispatch, lossy router-bias shift, syntheticw3), and GPT-OSS (identity / empty chain).Validation — KL mismatch (bugbot requirement)
Mean trainer-vs-inference KL mismatch (
mismatch_kl/all/mean) over 20 steps on themathenv,batch_size=64, smallest released checkpoint per architecture, on SLURM (4-trainer-node budget: EP=8,optim_cpu_offload, activation checkpointing + offloading,sign_sgd). Threshold for a passing custom model is < 0.015.arch)qwen3_moe)Qwen/Qwen3-30B-A3Bqwen3_5_moe)Qwen/Qwen3.5-35B-A3Bnemotron_h)nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16glm4_moe)zai-org/GLM-4.5-Airafmoe)arcee-ai/Trinity-Minilaguna)poolside/Laguna-XS.2gpt_oss)unsloth/gpt-oss-20b-BF16minimax_m2)MiniMaxAI/MiniMax-M2glm_moe_dsawas excluded from this run by request (shares the GLM-4 MoE structure already covered byglm4_moe).Notes on the non-passing models — none are caused by this refactor
These were investigated; in every case the conversion chain produced by this PR is byte-for-byte equivalent to
main's imperative converter, so the discrepancy is not introduced here:main's converters. AFMoE's step-0 KL is 0.0016 (the load conversion is exact); the KL only grows under training, and Laguna is elevated from step 0 — i.e. a pre-existing prime-vs-vLLM modeling-parity gap, independent of weight conversion.trainer/models/layers/moe.pyexpert_parallel(TypeError: wrapper() takes 4–5 positional args but 6 were given), reproducible at bothep=1andep=8and untouched by this PR.weight_scale_invis not handled in the loader, nor inmain's or this PR's converter), so the bf16 trainer loads raw fp8 → KL ≈ 6.1. A pre-existing fp8-checkpoint-load limitation, not the conversion refactor. (Inference also needstp=4so the per-shard expert dim 1536/4=384 is divisible by the fp8 block size 128.)The four bf16 MoE families that exercise the full breadth of the op vocabulary (router renames, per-expert and fused-
gate_upexpert stacking, shared experts, prime-only buffer drops, NemotronH's mamba/attn/moe dispatch + syntheticw3) all pass comfortably under 0.0015.🤖 Generated with Claude Code
Note
High Risk
This refactor sits on every checkpoint load, HF export, and NCCL weight-broadcast path; a subtle op-order or predicate bug could corrupt weights across nine architectures, though the PR targets behavioral parity with the old imperative converters.
Overview
Introduces
conversion_ops.py, a small bidirectional operator vocabulary (Rename,PrefixRename,Drop,Stack,SplitConcat,Conditional,MapValue,Synthetic,SqueezeLeading, plusrouted_experts_op) withapply_hf_to_tt/apply_tt_to_hfplaying each model’s chain forward or backward in place.PreTrainedModelPrimeRLno longer requires per-model classmethodconvert_to_hf/convert_to_prime/ layer variants; models implementconversion_chain(self.config)(instance method) and the baseconvert_to_*methods run the chain. Present-guarded ops mean the same chain is intended to work on full checkpoints, single-layer shards, and NCCL broadcast slices without separate layer loops.Per-model
converting_*.pyfiles drop hundreds of lines of imperative stacking/renaming in favor of shortconversion_chaindefinitions for AFMoE, GLM-4 MoE, GLM-MoE-DSA (reusesglm_moe_layer_ops), Laguna, MiniMax M2, NemotronH, Qwen3/Qwen3.5 MoE, and GPT-OSS (empty chain). NemotronH additionally drops runtimelayers_block_typeinference and bespoke per-layerbackbone.handling in favor of predicate-based layer dispatch in the chain.Qwen3.5-MoE VLM keeps nested
model.language_model.*keys via instanceconvert_to_hf/convert_to_primewrappers that flatten/remap text weights before calling the base chain. Unit tests switch fromModelClass.convert_to_*tomodel.convert_to_*.Reviewed by Cursor Bugbot for commit 63de2fd. Bugbot is set up for automated code reviews on this repo. Configure here.