[Klaud Cold] DSV4 MI355X vLLM disagg smoke test (8k1k conc=32) / DSV4 MI355X vLLM 分离式冒烟测试(8k1k conc=32)#2081
[Klaud Cold] DSV4 MI355X vLLM disagg smoke test (8k1k conc=32) / DSV4 MI355X vLLM 分离式冒烟测试(8k1k conc=32)#2081functionstackx wants to merge 1 commit into
Conversation
DeepSeek-V4-Pro disaggregated P/D on MI355X via vLLM + MoRI-IO: new dsv4-fp4-mi355x-vllm-disagg config (1P1D, TP8/EP1, 8k1k conc=32 only), thin multi_node launcher, and DeepSeek-V4-Pro entry in models_vllm.yaml reusing the validated aggregated serving flags. Uses the latest vLLM ROCm nightly (2026-07-04). Refreshes closed #1707 against current main, which already carries the #1585 patch-free MoRI-IO path. 中文:新增 DSV4 MI355X vLLM 分离式冒烟测试(8k1k conc=32)- 新配置 dsv4-fp4-mi355x-vllm-disagg(1P1D、TP8/EP1)、轻量多节点启动脚本,以及 models_vllm.yaml 中复用已验证聚合配方 serving 参数的 DeepSeek-V4-Pro 条目;采用最新 vLLM ROCm nightly 镜像。基于当前 main(已含 #1585 的 免补丁 MoRI-IO 路径)重制已关闭的 #1707。 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
6bcdd1d to
972ef1e
Compare
|
Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase For PR verification, add the PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs 感谢你的贡献!请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md,然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人,你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板,包括保留英文语句 如需进行 PR 验证,请为此 PR 添加 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档 |
1 similar comment
|
Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase For PR verification, add the PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs 感谢你的贡献!请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md,然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人,你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板,包括保留英文语句 如需进行 PR 验证,请为此 PR 添加 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档 |
|
|
||
| DeepSeek-V4-Pro: | ||
| # DeepSeek-V4-Pro is mixed-precision FP4+FP8 (FP4 MoE expert weights dominate | ||
| # the ~960 GB footprint; FP8 on attention/norm/router; FP8 KV cache at runtime). | ||
| # InferenceX classifies this as the fp4 variant. | ||
| # | ||
| # Serving flags reuse the validated single-node MI355X recipe | ||
| # (benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_vllm.sh, from | ||
| # vllm-project/recipes#433) so the per-node engine config is identical to the | ||
| # known-good aggregated run; disaggregation only adds the MoRIIO kv-transfer | ||
| # role (injected by server_vllm.sh). Each P/D worker is a full TP=8 node, EP=1 | ||
| # — matching the aggregated recipe, which runs DSv4 on TP=8 without expert | ||
| # parallelism. DEP decode is a follow-up. | ||
| # | ||
| # --moe-backend triton_unfused is REQUIRED for the FP4 MoE expert weight format; | ||
| # the auto backend doesn't register the FP4 scale params and safetensors load | ||
| # raises KeyError. --enforce-eager (no CUDA graphs) keeps the first disagg recipe | ||
| # robust against cudagraph/MoRIIO-hook interactions; FULL/PIECEWISE capture is a | ||
| # follow-up. --async-scheduling is intentionally omitted (not used by the kimi / | ||
| # minimax vllm-disagg recipes). | ||
| prefill_flags: "--tensor-parallel-size 8 --distributed-executor-backend mp --kv-cache-dtype fp8 --moe-backend triton_unfused --tokenizer-mode deepseek_v4 --reasoning-parser deepseek_v4 --no-enable-prefix-caching --gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 --enforce-eager" | ||
| decode_flags: "--tensor-parallel-size 8 --distributed-executor-backend mp --kv-cache-dtype fp8 --moe-backend triton_unfused --tokenizer-mode deepseek_v4 --reasoning-parser deepseek_v4 --no-enable-prefix-caching --gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 --enforce-eager" | ||
| env: "VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ENGINE_READY_TIMEOUT_S=3600" | ||
| hf_dir: "models--deepseek-ai--DeepSeek-V4-Pro" |
There was a problem hiding this comment.
🟡 The new dsv4-fp4-mi355x-vllm-disagg entry sets --moe-backend triton_unfused and omits VLLM_ROCM_USE_AITER_MOE=1, but the yaml comment, PR description, and perf-changelog all claim per-node flags 'reuse the aggregated recipe verbatim' / 'identical to the known-good aggregated run'. The current aggregated recipe (benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_vllm.sh, post-PR #1980) actually uses --moe-backend aiter + VLLM_ROCM_USE_AITER_MOE=1 — the disagg entry looks pinned to the pre-#1980 backend. Either flip it to aiter + VLLM_ROCM_USE_AITER_MOE=1 to actually match, or update the comment/perf-changelog to state the divergence is intentional (e.g. a conservative first bring-up).
Extended reasoning...
What the bug is
The new disagg recipe added by this PR (configs/amd-master.yaml and the DeepSeek-V4-Pro entry in benchmarks/multi_node/amd_utils/models_vllm.yaml) claims — in three places — to reuse the validated single-node MI355X DSv4 recipe verbatim:
- PR description: "reusing the validated single-node flags verbatim (
--moe-backend triton_unfusedrequired for the FP4 expert format...)" - yaml comment (models_vllm.yaml:63-79): "Serving flags reuse the validated single-node MI355X recipe... so the per-node engine config is identical to the known-good aggregated run", and further "
--moe-backend triton_unfusedis REQUIRED for the FP4 MoE expert weight format; the auto backend doesn't register the FP4 scale params..." - perf-changelog (line 4463): "Per-node flags reuse the aggregated recipe verbatim (
--moe-backend triton_unfusedfor the FP4 expert format...)"
But this description is stale. As of PR #1980 (merged the same day as this PR), the aggregated recipe at benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_vllm.sh no longer uses triton_unfused. It uses AITER MoE:
# dsv4_fp4_mi355x_vllm.sh, lines 47-48:
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1
...
# line 79:
--moe-backend aiter \The aggregated recipe's own comment (lines 15-19) explicitly says: "Use the AITER MoE backend (VLLM_ROCM_USE_AITER_MOE=1 + --moe-backend aiter) for the FP4 MoE expert weights... The AITER MXFP4 path registers the FP4 scale parameters (w13_weight_scale / w2_weight_scale), so safetensors loads correctly and decode runs on the fused AITER experts instead of triton_unfused."
The new disagg entry, meanwhile, uses:
prefill_flags: "... --moe-backend triton_unfused ..."
decode_flags: "... --moe-backend triton_unfused ..."
env: "VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ENGINE_READY_TIMEOUT_S=3600" # no VLLM_ROCM_USE_AITER_MOESo the recipe was written against a pre-#1980 aggregated recipe.
Why the yaml comment's justification is stale
The comment justifies triton_unfused as REQUIRED because "the auto backend doesn't register the FP4 scale params and safetensors load raises KeyError". That was true for --moe-backend auto before #1980 — but the aggregated recipe doesn't use auto, it uses aiter explicitly, and the whole point of #1980 was that AITER MXFP4 DOES register w13_weight_scale/w2_weight_scale correctly. The nightly image pinned here (nightly-f329ce405b…, 2026-07-04) is newer than the aggregated recipe's image, so AITER MoE support is present.
Step-by-step proof
- Open
benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_vllm.shonmain— grep formoe-backend: matches line 79 (--moe-backend aiter). Grep forAITER_MOE: matches line 48 (export VLLM_ROCM_USE_AITER_MOE=1). No occurrences oftriton_unfusedin the file. - Open
benchmarks/multi_node/amd_utils/models_vllm.yamlat the newDeepSeek-V4-Problock —prefill_flagsanddecode_flagsboth contain--moe-backend triton_unfused. Theenvstring containsVLLM_ROCM_USE_AITER=1but notVLLM_ROCM_USE_AITER_MOE=1. git log --oneline -- benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_vllm.shandperf-changelog.yamlshow PR [AMD] DeepSeek-V4 FP4 MI355X vLLM STP: bump image to latest nightly / DeepSeek-V4 FP4 MI355X vLLM STP:升级镜像至最新 nightly #1980 with the entry: "Switch the MoE backend from triton_unfused to AITER MoE (VLLM_ROCM_USE_AITER_MOE=1 + --moe-backend aiter) for the FP4 experts."- The PR description here explicitly claims verbatim reuse of the single-node recipe, but (1)+(2) show the flags don't match.
Impact
This is a smoke test (a single ISL/OSL, single conc=32 point) with full-sweep-fail-fast on, so runtime blast radius is bounded to that one job. Two outcomes are possible:
triton_unfusedstill works on the pinned nightly → the disagg smoke perf number will be lower than the aggregated recipe it claims to mirror (AITER was chosen in [AMD] DeepSeek-V4 FP4 MI355X vLLM STP: bump image to latest nightly / DeepSeek-V4 FP4 MI355X vLLM STP:升级镜像至最新 nightly #1980 for perf reasons); the recorded baseline is misleading.triton_unfusedhas bitrotted on the newer nightly → the smoke job fails and CI catches it before merge.
Either way, once the smoke test is expanded to a real conc sweep (per the yaml comment, that's the follow-up), the divergence from the aggregated recipe becomes a real perf-comparison gotcha.
How to fix
Pick one:
(a) Actually match the aggregated recipe (what the description promises):
prefill_flags: "... --moe-backend aiter ..." # was: triton_unfused
decode_flags: "... --moe-backend aiter ..."
env: "VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MOE=1 VLLM_ENGINE_READY_TIMEOUT_S=3600"and rewrite the yaml comment to say the AITER MXFP4 path registers the FP4 scale params (mirroring the aggregated recipe's own comment).
(b) Keep triton_unfused intentionally, but update the comment, PR description, and perf-changelog to explicitly state "the disagg recipe intentionally diverges from the aggregated recipe: it pins to the pre-#1980 triton_unfused path as a conservative first bring-up before enabling AITER MoE + MoRIIO together" — so future readers aren't misled by the "verbatim / identical" claims.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28721684504 |
|
Closing per cleanup of non-mergeable PRs. The DSV4 MI355X vLLM disagg smoke test failed validation — branch kept for diagnosis and re-cut. |
Summary
dsv4-fp4-mi355x-vllm-disagg: DeepSeek-V4-Pro disaggregated prefill/decode on MI355X via vLLM + MoRI-IO — a refresh of the closed [Klaud Cold] dsv4-fp4-mi355x-vllm-disagg: DeepSeek-V4-Pro vLLM disagg (8k1k conc=1 smoke test) / DeepSeek-V4-Pro vLLM 分离式推理基准测试(8k1k conc=1 冒烟测试) / DeepSeek-V4-Pro vLLM 분리형 추론 벤치마크 (8k1k conc=1 스모크 테스트) #1707 rebuilt against current main, with the smoke point moved to 8k1k conc=32 only and the image bumped to the latest vLLM ROCm nightly (vllm/vllm-openai-rocm:nightly-f329ce405b…, 2026-07-04, verified on Docker Hub).setup_deps.sh,read_modeviakv_connector_extra_configinserver_vllm.sh,mori_low_latencya2a, noVLLM_MORIIO_CONNECTOR_READ_MODEanywhere), so unlike [Klaud Cold] dsv4-fp4-mi355x-vllm-disagg: DeepSeek-V4-Pro vLLM disagg (8k1k conc=1 smoke test) / DeepSeek-V4-Pro vLLM 分离式推理基准测试(8k1k conc=1 冒烟测试) / DeepSeek-V4-Pro vLLM 분리형 추론 벤치마크 (8k1k conc=1 스모크 테스트) #1707 this PR touches no framework plumbing — only the three additive pieces:configs/amd-master.yaml: the new entry — 1P1D, TP8/EP1 both sides (matching the aggregated DSv4 recipe; no a2a backend needed at EP1),runner: mi355x-disagg,framework: vllm-disagg.benchmarks/multi_node/dsv4_fp4_mi355x_vllm-disagg.sh: thin launcher identical in shape to the kimi/minimax vllm-disagg wrappers.amd_utils/models_vllm.yaml:DeepSeek-V4-Proentry reusing the validated single-node flags verbatim (--moe-backend triton_unfusedrequired for the FP4 expert format,deepseek_v4tokenizer/reasoning parser, fp8 KV,--enforce-eagerfor first bring-up).perf-changelog.yamlentry.full-sweep-fail-fastlabel applied.中文说明
dsv4-fp4-mi355x-vllm-disagg:MI355X 上经 vLLM + MoRI-IO 的 DeepSeek-V4-Pro 分离式预填充/解码 — 在当前 main 基础上重制已关闭的 [Klaud Cold] dsv4-fp4-mi355x-vllm-disagg: DeepSeek-V4-Pro vLLM disagg (8k1k conc=1 smoke test) / DeepSeek-V4-Pro vLLM 分离式推理基准测试(8k1k conc=1 冒烟测试) / DeepSeek-V4-Pro vLLM 분리형 추론 벤치마크 (8k1k conc=1 스모크 테스트) #1707,冒烟点改为 仅 8k1k conc=32,镜像升级为最新 vLLM ROCm nightly(nightly-f329ce405b…,2026-07-04,已在 Docker Hub 验证存在)。setup_deps.sh的 MoRI-IO 运行时补丁、server_vllm.sh经kv_connector_extra_config传递read_mode、mori_low_latencya2a),因此本 PR 不改动任何框架层 — 仅三个新增件:master 配置条目(1P1D、双侧 TP8/EP1,与聚合配方一致)、轻量启动脚本(与 kimi/minimax 包装脚本同构)、models_vllm.yaml的DeepSeek-V4-Pro条目(逐字复用已验证的单节点 serving 参数)。perf-changelog.yaml条目,已打full-sweep-fail-fast标签。🤖 Generated with Claude Code