Skip to content

feat(models): declarative invertible conversion-op framework for all models#2797

Draft
S1ro1 wants to merge 8 commits into
mainfrom
feat/declarative-conversions
Draft

feat(models): declarative invertible conversion-op framework for all models#2797
S1ro1 wants to merge 8 commits into
mainfrom
feat/declarative-conversions

Conversation

@S1ro1

@S1ro1 S1ro1 commented Jun 13, 2026

Copy link
Copy Markdown
Collaborator

Summary

Replaces the per-model imperative convert_hf_to_tt / convert_tt_to_hf functions with a single declarative, invertible, sharding-aware conversion-operator framework, and encodes all 9 models against it.

Motivation: the HF↔prime conversions were hand-written per model and drifting (the NIXL weight-transfer adapter only covered a couple of the renames). Expressing each conversion as a chain of small bidirectional ops makes the inverse fall out for free, makes the conversion introspectable, and removes the per-model imperative duplication.

Framework (trainer/models/conversion_ops.py)

A conversion is a flat list of ConvOp; apply_hf_to_tt plays them forward, apply_tt_to_hf plays each op's backward in reverse. The op vocabulary is deliberately small and general (no model-specific ops):

  • Rename, PrefixRename — value-agnostic name maps (trivially shard-safe, no gather).
  • Drop — symmetric removal of keys with no counterpart (prime-only buffers like tokens_per_expert/reorderer; HF-only MTP heads).
  • Stack — stack a variable-cardinality {e}-indexed group of per-expert tensors along a new dim (with index_offset for shard-local global expert numbering); its backward unstacks.
  • SplitConcat — split/concat fixed parts along an existing dim (e.g. a fused gate_up_proj ↔ separate w1/w3).
  • Sequence — bundle ops into one unit; Conditional(predicate, then, else_) — dispatch on which keys are present (used for fused-vs-per-expert inputs, singular/plural names, and NemotronH layer-type dispatch).
  • MapValue — explicit value transform with its own backward (NemotronH's lossy router-bias shift, identity backward).
  • Synthetic — a prime-only tensor created forward / dropped backward (NemotronH dummy w3).
  • SqueezeLeading — backward-only leading-singleton squeeze (GLM shared-expert shape[0]==1).

A shared routed_experts_op(prefix, …) helper (+ GATE_DOWN_UP) composes the common MoE expert stack/unstack (per-expert and fused-gate_up layouts) so each model's chain stays a few lines. Each model defines conversion_chain(config) in its converting_<model>.py; the PreTrainedModelPrimeRL base has a single set of convert_to_hf / convert_to_prime / convert_layer_to_* implementations that play the chain forward/backward — since every op is present-guarded, the same chain works over a full state dict, a single layer's keys, or a local shard.

Coverage: Qwen3-MoE, Qwen3.5-MoE (incl. fused gate_up + shared expert), GLM-4 MoE / GLM-MoE-DSA (shared experts with shape[0]==1 squeeze + MLA passthrough), MiniMax-M2 (block_sparse_moe namespace + literal w1/w2/w3 proj names), Laguna (singular/plural shared-expert and dual bias-key inputs), AFMoE (no router rename, 3× expert stack, reorderer drop), NemotronH (backbone. prefix, layer-type mixer→{mamba,self_attn,mlp} dispatch, lossy router-bias shift, synthetic w3), and GPT-OSS (identity / empty chain).

Validation — KL mismatch (bugbot requirement)

Mean trainer-vs-inference KL mismatch (mismatch_kl/all/mean) over 20 steps on the math env, batch_size=64, smallest released checkpoint per architecture, on SLURM (4-trainer-node budget: EP=8, optim_cpu_offload, activation checkpointing + offloading, sign_sgd). Threshold for a passing custom model is < 0.015.

Model (arch) Checkpoint Mean KL (20 steps) < 0.015
Qwen3-MoE (qwen3_moe) Qwen/Qwen3-30B-A3B 0.0015
Qwen3.5-MoE (qwen3_5_moe) Qwen/Qwen3.5-35B-A3B 0.0006
Nemotron-H (nemotron_h) nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16 0.0007
GLM-4.5 MoE (glm4_moe) zai-org/GLM-4.5-Air 0.0005
AFMoE (afmoe) arcee-ai/Trinity-Mini 0.077 ⚠️ pre-existing (see below)
Laguna (laguna) poolside/Laguna-XS.2 0.034 (19 steps) ⚠️ pre-existing (see below)
GPT-OSS (gpt_oss) unsloth/gpt-oss-20b-BF16 n/a — blocked (see below)
MiniMax-M2 (minimax_m2) MiniMaxAI/MiniMax-M2 n/a — fp8 load gap (see below)

glm_moe_dsa was excluded from this run by request (shares the GLM-4 MoE structure already covered by glm4_moe).

Notes on the non-passing models — none are caused by this refactor

These were investigated; in every case the conversion chain produced by this PR is byte-for-byte equivalent to main's imperative converter, so the discrepancy is not introduced here:

  • AFMoE / Laguna — their chains are identical to main's converters. AFMoE's step-0 KL is 0.0016 (the load conversion is exact); the KL only grows under training, and Laguna is elevated from step 0 — i.e. a pre-existing prime-vs-vLLM modeling-parity gap, independent of weight conversion.
  • GPT-OSS — conversion is the identity (empty chain), so it cannot be affected by this PR. Training is blocked by a pre-existing crash in trainer/models/layers/moe.py expert_parallel (TypeError: wrapper() takes 4–5 positional args but 6 were given), reproducible at both ep=1 and ep=8 and untouched by this PR.
  • MiniMax-M2 — the only fp8-block-quantized checkpoint. The trainer's HF-load path does not dequantize fp8 weights (weight_scale_inv is not handled in the loader, nor in main's or this PR's converter), so the bf16 trainer loads raw fp8 → KL ≈ 6.1. A pre-existing fp8-checkpoint-load limitation, not the conversion refactor. (Inference also needs tp=4 so the per-shard expert dim 1536/4=384 is divisible by the fp8 block size 128.)

The four bf16 MoE families that exercise the full breadth of the op vocabulary (router renames, per-expert and fused-gate_up expert stacking, shared experts, prime-only buffer drops, NemotronH's mamba/attn/moe dispatch + synthetic w3) all pass comfortably under 0.0015.

🤖 Generated with Claude Code


Note

High Risk
This refactor sits on every checkpoint load, HF export, and NCCL weight-broadcast path; a subtle op-order or predicate bug could corrupt weights across nine architectures, though the PR targets behavioral parity with the old imperative converters.

Overview
Introduces conversion_ops.py, a small bidirectional operator vocabulary (Rename, PrefixRename, Drop, Stack, SplitConcat, Conditional, MapValue, Synthetic, SqueezeLeading, plus routed_experts_op) with apply_hf_to_tt / apply_tt_to_hf playing each model’s chain forward or backward in place.

PreTrainedModelPrimeRL no longer requires per-model classmethod convert_to_hf / convert_to_prime / layer variants; models implement conversion_chain(self.config) (instance method) and the base convert_to_* methods run the chain. Present-guarded ops mean the same chain is intended to work on full checkpoints, single-layer shards, and NCCL broadcast slices without separate layer loops.

Per-model converting_*.py files drop hundreds of lines of imperative stacking/renaming in favor of short conversion_chain definitions for AFMoE, GLM-4 MoE, GLM-MoE-DSA (reuses glm_moe_layer_ops), Laguna, MiniMax M2, NemotronH, Qwen3/Qwen3.5 MoE, and GPT-OSS (empty chain). NemotronH additionally drops runtime layers_block_type inference and bespoke per-layer backbone. handling in favor of predicate-based layer dispatch in the chain.

Qwen3.5-MoE VLM keeps nested model.language_model.* keys via instance convert_to_hf / convert_to_prime wrappers that flatten/remap text weights before calling the base chain. Unit tests switch from ModelClass.convert_to_* to model.convert_to_*.

Reviewed by Cursor Bugbot for commit 63de2fd. Bugbot is set up for automated code reviews on this repo. Configure here.

S1ro1 and others added 8 commits June 13, 2026 01:16
ConvOp vocabulary (Rename, PrefixRename, Drop, MoEExperts+FusedGateUp,
Synthetic, MapValue, SqueezeLeading, Conditional) + apply_hf_to_tt /
apply_tt_to_hf runners. Sharding-aware (name ops are value-agnostic; the
expert stack/unstack takes a global-expert offset for shard-local
operation), no gathers in the ops themselves.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…alence test

Proves the framework: forward (per-expert & fused), backward, and roundtrip
all match the legacy imperative convert_hf_to_tt_moe/convert_tt_to_hf_moe.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ivalence tests

Encode every model's HF<->prime conversion as an invertible op chain
(conversion_chains.py + models/<name>/conversion_chain.py), registered by
model_type and reachable via PreTrainedModelPrimeRL.conversion_ops. Covers
qwen3_moe, qwen3_5_moe, glm4_moe, glm_moe_dsa, minimax_m2, laguna, afmoe,
nemotron_h, and gpt_oss (identity).

Each chain is verified against the legacy imperative convert_* functions on
mock state dicts (forward, backward, and roundtrip where lossless) — 43
passing tests in tests/unit/train/models/conversions. The ops are
sharding-aware (name ops are value-agnostic; the expert stack/unstack takes
a global-expert offset) and never gather.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…/SplitConcat base-ops

Replace the two MoE-specific value ops with orthogonal primitives:
- Stack: stack/unstack a variable {e}-indexed group along a new dim (index_offset
  for shard-local global numbering)
- SplitConcat: split/concat fixed parts along an existing dim
- Sequence: bundle ops into one (lets the routed-experts helper stay a single op)

The MoE-ness now lives only in the _routed_experts_op composition helper (Stack
per proj + a Conditional that splits the fused gate_up input), not in the op
vocabulary. NemotronH's fused case collapses to a plain Rename. All 43
equivalence tests unchanged and green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…nversion_chain.py

Consistency: every model's chain now lives in its own package; conversion_chains.py
holds only the shared helpers (_routed_experts_op, _GATE_DOWN_UP) and the registry.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…orm dispatch

A single per-layer Conditional detects layer type from a signature key (present
in either HF or prime form, so it works both directions) and dispatches:
attention/mamba keep bulk PrefixRename(mixer.->{self_attn,mamba}.), MoE uses its
specific ops (incl. the gated Synthetic w3). Drops the layers_block_type
argument — build_nemotron_h_chain(num_layers) now matches every other model's
signature. Equivalence to the imperative converter (which still uses
layers_block_type) unchanged: 43 tests green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Each model defines `conversion_chain(config)` in its `converting_<model>.py`;
the `PreTrainedModelPrimeRL` base has a single set of `convert_to_hf` /
`convert_to_prime` / `convert_layer_*` implementations that play the chain
forward and backward. Because every op is present-guarded, the same chain
works over a full state dict, one layer's keys, or a local shard.

Removes the now-redundant scaffolding:
- per-model imperative `convert_hf_to_tt` / `convert_tt_to_hf` functions
- the standalone `conversion_chain.py` builders and the `conversion_chains` registry
- the mock equivalence tests (superseded by the 20-step math KL-mismatch
  validation in PR #2797) and the obsolete classmethod roundtrip tests

Net -2.2k lines. Validated end-to-end: qwen3_moe/qwen3_5_moe/nemotron_h/glm4_moe
mean KL mismatch < 0.0015 over 20 steps on math.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brings back the per-model conversion roundtrip tests that predated the
declarative refactor (qwen3_5_moe, nemotron_h reverse + roundtrip, qwen3_5 VLM),
rewritten to call `convert_to_hf` / `convert_to_prime` on a model instance
(playing the declarative chain) instead of the removed classmethods. The
NemotronH reverse test keeps its pre-existing xfail (HF now uses fused expert
tensors). gpu-marked.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@S1ro1 S1ro1 marked this pull request as ready for review June 13, 2026 21:38
@S1ro1 S1ro1 marked this pull request as draft June 13, 2026 21:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant