feat(models): declarative invertible conversion-op framework for all models by S1ro1 · Pull Request #2797 · PrimeIntellect-ai/prime-rl

S1ro1 · 2026-06-13T01:35:51Z

Summary

Replaces the per-model imperative convert_hf_to_tt / convert_tt_to_hf functions with a single declarative, invertible, sharding-aware conversion-operator framework, and encodes all 9 models against it.

Motivation: the HF↔prime conversions were hand-written per model and drifting (the NIXL weight-transfer adapter only covered a couple of the renames). Expressing each conversion as a chain of small bidirectional ops makes the inverse fall out for free, makes the conversion introspectable, and removes the per-model imperative duplication.

Framework (`trainer/models/conversion_ops.py`)

A conversion is a flat list of ConvOp; apply_hf_to_tt plays them forward, apply_tt_to_hf plays each op's backward in reverse. The op vocabulary is deliberately small and general (no model-specific ops):

Rename, PrefixRename — value-agnostic name maps (trivially shard-safe, no gather).
Drop — symmetric removal of keys with no counterpart (prime-only buffers like tokens_per_expert/reorderer; HF-only MTP heads).
Stack — stack a variable-cardinality {e}-indexed group of per-expert tensors along a new dim (with index_offset for shard-local global expert numbering); its backward unstacks.
SplitConcat — split/concat fixed parts along an existing dim (e.g. a fused gate_up_proj ↔ separate w1/w3).
Sequence — bundle ops into one unit; Conditional(predicate, then, else_) — dispatch on which keys are present (used for fused-vs-per-expert inputs, singular/plural names, and NemotronH layer-type dispatch).
MapValue — explicit value transform with its own backward (NemotronH's lossy router-bias shift, identity backward).
Synthetic — a prime-only tensor created forward / dropped backward (NemotronH dummy w3).
SqueezeLeading — backward-only leading-singleton squeeze (GLM shared-expert shape[0]==1).

A shared routed_experts_op(prefix, …) helper (+ GATE_DOWN_UP) composes the common MoE expert stack/unstack (per-expert and fused-gate_up layouts) so each model's chain stays a few lines. Each model defines conversion_chain(config) in its converting_<model>.py; the PreTrainedModelPrimeRL base has a single set of convert_to_hf / convert_to_prime / convert_layer_to_* implementations that play the chain forward/backward — since every op is present-guarded, the same chain works over a full state dict, a single layer's keys, or a local shard.

Coverage: Qwen3-MoE, Qwen3.5-MoE (incl. fused gate_up + shared expert), GLM-4 MoE / GLM-MoE-DSA (shared experts with shape[0]==1 squeeze + MLA passthrough), MiniMax-M2 (block_sparse_moe namespace + literal w1/w2/w3 proj names), Laguna (singular/plural shared-expert and dual bias-key inputs), AFMoE (no router rename, 3× expert stack, reorderer drop), NemotronH (backbone. prefix, layer-type mixer→{mamba,self_attn,mlp} dispatch, lossy router-bias shift, synthetic w3), and GPT-OSS (identity / empty chain).

Validation — KL mismatch (bugbot requirement)

Mean trainer-vs-inference KL mismatch (mismatch_kl/all/mean) over 20 steps on the math env, batch_size=64, smallest released checkpoint per architecture, on SLURM (4-trainer-node budget: EP=8, optim_cpu_offload, activation checkpointing + offloading, sign_sgd). Threshold for a passing custom model is < 0.015.

Model (`arch`)	Checkpoint	Mean KL (20 steps)	< 0.015
Qwen3-MoE (`qwen3_moe`)	`Qwen/Qwen3-30B-A3B`	0.0015	✅
Qwen3.5-MoE (`qwen3_5_moe`)	`Qwen/Qwen3.5-35B-A3B`	0.0006	✅
Nemotron-H (`nemotron_h`)	`nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16`	0.0007	✅
GLM-4.5 MoE (`glm4_moe`)	`zai-org/GLM-4.5-Air`	0.0005	✅
AFMoE (`afmoe`)	`arcee-ai/Trinity-Mini`	0.077	⚠️ pre-existing (see below)
Laguna (`laguna`)	`poolside/Laguna-XS.2`	0.034 (19 steps)	⚠️ pre-existing (see below)
GPT-OSS (`gpt_oss`)	`unsloth/gpt-oss-20b-BF16`	n/a — blocked (see below)	—
MiniMax-M2 (`minimax_m2`)	`MiniMaxAI/MiniMax-M2`	n/a — fp8 load gap (see below)	—

glm_moe_dsa was excluded from this run by request (shares the GLM-4 MoE structure already covered by glm4_moe).

Notes on the non-passing models — none are caused by this refactor

These were investigated; in every case the conversion chain produced by this PR is byte-for-byte equivalent to main's imperative converter, so the discrepancy is not introduced here:

AFMoE / Laguna — their chains are identical to main's converters. AFMoE's step-0 KL is 0.0016 (the load conversion is exact); the KL only grows under training, and Laguna is elevated from step 0 — i.e. a pre-existing prime-vs-vLLM modeling-parity gap, independent of weight conversion.
GPT-OSS — conversion is the identity (empty chain), so it cannot be affected by this PR. Training is blocked by a pre-existing crash in trainer/models/layers/moe.py expert_parallel (TypeError: wrapper() takes 4–5 positional args but 6 were given), reproducible at both ep=1 and ep=8 and untouched by this PR.
MiniMax-M2 — the only fp8-block-quantized checkpoint. The trainer's HF-load path does not dequantize fp8 weights (weight_scale_inv is not handled in the loader, nor in main's or this PR's converter), so the bf16 trainer loads raw fp8 → KL ≈ 6.1. A pre-existing fp8-checkpoint-load limitation, not the conversion refactor. (Inference also needs tp=4 so the per-shard expert dim 1536/4=384 is divisible by the fp8 block size 128.)

The four bf16 MoE families that exercise the full breadth of the op vocabulary (router renames, per-expert and fused-gate_up expert stacking, shared experts, prime-only buffer drops, NemotronH's mamba/attn/moe dispatch + synthetic w3) all pass comfortably under 0.0015.

🤖 Generated with Claude Code

Note

High Risk
This refactor sits on every checkpoint load, HF export, and NCCL weight-broadcast path; a subtle op-order or predicate bug could corrupt weights across nine architectures, though the PR targets behavioral parity with the old imperative converters.

Overview
Introduces conversion_ops.py, a small bidirectional operator vocabulary (Rename, PrefixRename, Drop, Stack, SplitConcat, Conditional, MapValue, Synthetic, SqueezeLeading, plus routed_experts_op) with apply_hf_to_tt / apply_tt_to_hf playing each model’s chain forward or backward in place.

PreTrainedModelPrimeRL no longer requires per-model classmethod convert_to_hf / convert_to_prime / layer variants; models implement conversion_chain(self.config) (instance method) and the base convert_to_* methods run the chain. Present-guarded ops mean the same chain is intended to work on full checkpoints, single-layer shards, and NCCL broadcast slices without separate layer loops.

Per-model converting_*.py files drop hundreds of lines of imperative stacking/renaming in favor of short conversion_chain definitions for AFMoE, GLM-4 MoE, GLM-MoE-DSA (reuses glm_moe_layer_ops), Laguna, MiniMax M2, NemotronH, Qwen3/Qwen3.5 MoE, and GPT-OSS (empty chain). NemotronH additionally drops runtime layers_block_type inference and bespoke per-layer backbone. handling in favor of predicate-based layer dispatch in the chain.

Qwen3.5-MoE VLM keeps nested model.language_model.* keys via instance convert_to_hf / convert_to_prime wrappers that flatten/remap text weights before calling the base chain. Unit tests switch from ModelClass.convert_to_* to model.convert_to_*.

^{Reviewed by Cursor Bugbot for commit 63de2fd. Bugbot is set up for automated code reviews on this repo. Configure here.}

ConvOp vocabulary (Rename, PrefixRename, Drop, MoEExperts+FusedGateUp, Synthetic, MapValue, SqueezeLeading, Conditional) + apply_hf_to_tt / apply_tt_to_hf runners. Sharding-aware (name ops are value-agnostic; the expert stack/unstack takes a global-expert offset for shard-local operation), no gathers in the ops themselves. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…alence test Proves the framework: forward (per-expert & fused), backward, and roundtrip all match the legacy imperative convert_hf_to_tt_moe/convert_tt_to_hf_moe. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…ivalence tests Encode every model's HF<->prime conversion as an invertible op chain (conversion_chains.py + models/<name>/conversion_chain.py), registered by model_type and reachable via PreTrainedModelPrimeRL.conversion_ops. Covers qwen3_moe, qwen3_5_moe, glm4_moe, glm_moe_dsa, minimax_m2, laguna, afmoe, nemotron_h, and gpt_oss (identity). Each chain is verified against the legacy imperative convert_* functions on mock state dicts (forward, backward, and roundtrip where lossless) — 43 passing tests in tests/unit/train/models/conversions. The ops are sharding-aware (name ops are value-agnostic; the expert stack/unstack takes a global-expert offset) and never gather. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…/SplitConcat base-ops Replace the two MoE-specific value ops with orthogonal primitives: - Stack: stack/unstack a variable {e}-indexed group along a new dim (index_offset for shard-local global numbering) - SplitConcat: split/concat fixed parts along an existing dim - Sequence: bundle ops into one (lets the routed-experts helper stay a single op) The MoE-ness now lives only in the _routed_experts_op composition helper (Stack per proj + a Conditional that splits the fused gate_up input), not in the op vocabulary. NemotronH's fused case collapses to a plain Rename. All 43 equivalence tests unchanged and green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…nversion_chain.py Consistency: every model's chain now lives in its own package; conversion_chains.py holds only the shared helpers (_routed_experts_op, _GATE_DOWN_UP) and the registry. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…orm dispatch A single per-layer Conditional detects layer type from a signature key (present in either HF or prime form, so it works both directions) and dispatches: attention/mamba keep bulk PrefixRename(mixer.->{self_attn,mamba}.), MoE uses its specific ops (incl. the gated Synthetic w3). Drops the layers_block_type argument — build_nemotron_h_chain(num_layers) now matches every other model's signature. Equivalence to the imperative converter (which still uses layers_block_type) unchanged: 43 tests green. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Each model defines `conversion_chain(config)` in its `converting_<model>.py`; the `PreTrainedModelPrimeRL` base has a single set of `convert_to_hf` / `convert_to_prime` / `convert_layer_*` implementations that play the chain forward and backward. Because every op is present-guarded, the same chain works over a full state dict, one layer's keys, or a local shard. Removes the now-redundant scaffolding: - per-model imperative `convert_hf_to_tt` / `convert_tt_to_hf` functions - the standalone `conversion_chain.py` builders and the `conversion_chains` registry - the mock equivalence tests (superseded by the 20-step math KL-mismatch validation in PR #2797) and the obsolete classmethod roundtrip tests Net -2.2k lines. Validated end-to-end: qwen3_moe/qwen3_5_moe/nemotron_h/glm4_moe mean KL mismatch < 0.0015 over 20 steps on math. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Brings back the per-model conversion roundtrip tests that predated the declarative refactor (qwen3_5_moe, nemotron_h reverse + roundtrip, qwen3_5 VLM), rewritten to call `convert_to_hf` / `convert_to_prime` on a model instance (playing the declarative chain) instead of the removed classmethods. The NemotronH reverse test keeps its pre-existing xfail (HF now uses fused expert tensors). gpu-marked. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

S1ro1 and others added 8 commits June 13, 2026 01:16

S1ro1 marked this pull request as ready for review June 13, 2026 21:38

S1ro1 marked this pull request as draft June 13, 2026 21:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(models): declarative invertible conversion-op framework for all models#2797

feat(models): declarative invertible conversion-op framework for all models#2797
S1ro1 wants to merge 8 commits into
mainfrom
feat/declarative-conversions

S1ro1 commented Jun 13, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

S1ro1 commented Jun 13, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Framework (trainer/models/conversion_ops.py)

Validation — KL mismatch (bugbot requirement)

Notes on the non-passing models — none are caused by this refactor

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

S1ro1 commented Jun 13, 2026 •

edited by cursor Bot

Loading

Framework (`trainer/models/conversion_ops.py`)