Add Tencent Hy3 (HYV3) MoE model support#2789
Draft
hallerite wants to merge 1 commit into
Draft
Conversation
Implement full pipeline for Hy3-preview (295B MoE, 21B activated/token): modeling code, HF<->PrimeRL weight conversion, registration, and mini-model verification. Key points: - Sigmoid router with e_score_correction_bias selection, route norm, and router_scaling_factor=2.826, mapped onto TokenChoiceTopKRouter semantics - Per-head QK norm (Apertus-style) via existing AttentionConfig - Conversion handles both the hub checkpoint format (per-expert experts, mlp.router.gate, mlp.expert_bias, mlp.shared_mlp) and the transformers in-memory format (fused gate_up_proj, mlp.gate, e_score_correction_bias, mlp.shared_experts); convert_to_hf emits the hub format which both vLLM and transformers load natively - MTP layer (model.layers.80, speculative decoding only) dropped on load - Config parses the hub config.json field names directly; emits expert_hidden_dim for vLLM compatibility - Fix stale SFT warm-up command in docs/development.md Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds Tencent Hy3-preview (295B MoE, 21B activated/token, 192 experts top-8 + 1 shared) as a custom trainer model.
What's in here
src/prime_rl/trainer/models/hy_v3/): glm4_moe-style decoder built from existing layer primitives — per-head QK norm (Apertus-style), sigmoid router withe_score_correction_bias-based selection + top-k normalization +router_scaling_factor=2.826(maps 1:1 ontoTokenChoiceTopKRoutersemantics), 1 dense + 79 sparse layers, shared expert viaBCFeedForward.mlp.router.gate.weight,mlp.expert_bias,mlp.shared_mlp.*) and the transformers 5.x in-memory format (fusedgate_up_proj,mlp.gate.weight,mlp.e_score_correction_bias,mlp.shared_experts.*).convert_to_hfemits the hub format, which both vLLM and transformers load natively. The MTP layer (model.layers.80, speculative decoding only) is dropped on load.config.jsonfield names directly (first_k_dense_replace,qk_norm,route_norm, ...), derivesmlp_layer_typesfor compatibility with transformers' nativeHYV3ForCausalLM, and emitsexpert_hidden_dimso checkpoints saved by the trainer stay loadable by vLLM. The hub'suse_grouped_mm: false(which refers to HF's eager experts impl) is harmlessly overridden by the trainer fromModelConfig.moe_use_grouped_mm.hy_v3) verifying against transformers' native implementation.docs/development.md(--data.type sft,--ckpt.weights.save-sharded false).No inference-side changes needed: vLLM 0.22.0 (our pin) already ships
HYV3ForCausalLM, thehy_v3reasoning/tool parsers, and MTP speculative decoding; transformers 5.6.2 has the architecture natively (notrust_remote_code).Validation
mini_moecreate + verify (fp32, CUDA):SFT warm-up on reverse-text (200 steps, loss 11.75 → 4.27), then the full RL stack (
configs/ci/integration/reverse_text_moe/start.toml, batch 128, 20 steps, FA2 trainer vs vLLM triton MoE backend):All entries well under the 0.015 bar. Note: this is the reverse-text smoke env on the mini model (2× RTX PRO 6000); the documented
math+batch_size=64table on the real checkpoint needs a node that can hold 295B.Known limitations / follow-ups
deps/renderersyet — runs need[orchestrator.renderer] name = "default"(chat-template fallback). A dedicated renderer handling Hy3'sreasoning_effortkwarg + tool format is a follow-up (renderers repo).torch.compilehits a pre-existing Inductor unbacked-symint assertion on this dev box — reproduces identically withglm4_moe, so it's environmental, not model-specific.--vllm-extra.moe_backend tritonon such boxes.🤖 Generated with Claude Code