[Klaud Cold] DSV4 MI355X vLLM disagg smoke test (8k1k conc=32) / DSV4 MI355X vLLM 分离式冒烟测试（8k1k conc=32） by functionstackx · Pull Request #2081 · SemiAnalysisAI/InferenceX

functionstackx · 2026-07-04T22:28:59Z

Summary

Adds dsv4-fp4-mi355x-vllm-disagg: DeepSeek-V4-Pro disaggregated prefill/decode on MI355X via vLLM + MoRI-IO — a refresh of the closed [Klaud Cold] dsv4-fp4-mi355x-vllm-disagg: DeepSeek-V4-Pro vLLM disagg (8k1k conc=1 smoke test) / DeepSeek-V4-Pro vLLM 分离式推理基准测试（8k1k conc=1 冒烟测试） / DeepSeek-V4-Pro vLLM 분리형 추론 벤치마크 (8k1k conc=1 스모크 테스트) #1707 rebuilt against current main, with the smoke point moved to 8k1k conc=32 only and the image bumped to the latest vLLM ROCm nightly (vllm/vllm-openai-rocm:nightly-f329ce405b…, 2026-07-04, verified on Docker Hub).
Research notes: main already carries the [Fix] Remove MoRI-IO patches from vLLM Disagg benchmarks #1585 migration (MoRI-IO runtime patches removed from setup_deps.sh, read_mode via kv_connector_extra_config in server_vllm.sh, mori_low_latency a2a, no VLLM_MORIIO_CONNECTOR_READ_MODE anywhere), so unlike [Klaud Cold] dsv4-fp4-mi355x-vllm-disagg: DeepSeek-V4-Pro vLLM disagg (8k1k conc=1 smoke test) / DeepSeek-V4-Pro vLLM 分离式推理基准测试（8k1k conc=1 冒烟测试） / DeepSeek-V4-Pro vLLM 분리형 추론 벤치마크 (8k1k conc=1 스모크 테스트) #1707 this PR touches no framework plumbing — only the three additive pieces:
1. configs/amd-master.yaml: the new entry — 1P1D, TP8/EP1 both sides (matching the aggregated DSv4 recipe; no a2a backend needed at EP1), runner: mi355x-disagg, framework: vllm-disagg.
2. benchmarks/multi_node/dsv4_fp4_mi355x_vllm-disagg.sh: thin launcher identical in shape to the kimi/minimax vllm-disagg wrappers.
3. amd_utils/models_vllm.yaml: DeepSeek-V4-Pro entry reusing the validated single-node flags verbatim (--moe-backend triton_unfused required for the FP4 expert format, deepseek_v4 tokenizer/reasoning parser, fp8 KV, --enforce-eager for first bring-up).
Plus the required perf-changelog.yaml entry. full-sweep-fail-fast label applied.

中文说明

新增 dsv4-fp4-mi355x-vllm-disagg：MI355X 上经 vLLM + MoRI-IO 的 DeepSeek-V4-Pro 分离式预填充/解码 — 在当前 main 基础上重制已关闭的 [Klaud Cold] dsv4-fp4-mi355x-vllm-disagg: DeepSeek-V4-Pro vLLM disagg (8k1k conc=1 smoke test) / DeepSeek-V4-Pro vLLM 分离式推理基准测试（8k1k conc=1 冒烟测试） / DeepSeek-V4-Pro vLLM 분리형 추론 벤치마크 (8k1k conc=1 스모크 테스트) #1707，冒烟点改为 仅 8k1k conc=32，镜像升级为最新 vLLM ROCm nightly（nightly-f329ce405b…，2026-07-04，已在 Docker Hub 验证存在）。
调研结论:main 已包含 [Fix] Remove MoRI-IO patches from vLLM Disagg benchmarks #1585 的迁移（移除 setup_deps.sh 的 MoRI-IO 运行时补丁、server_vllm.sh 经 kv_connector_extra_config 传递 read_mode、mori_low_latency a2a），因此本 PR 不改动任何框架层 — 仅三个新增件：master 配置条目（1P1D、双侧 TP8/EP1，与聚合配方一致）、轻量启动脚本（与 kimi/minimax 包装脚本同构）、models_vllm.yaml 的 DeepSeek-V4-Pro 条目（逐字复用已验证的单节点 serving 参数）。
附所需的 perf-changelog.yaml 条目，已打 full-sweep-fail-fast 标签。

🤖 Generated with Claude Code

DeepSeek-V4-Pro disaggregated P/D on MI355X via vLLM + MoRI-IO: new dsv4-fp4-mi355x-vllm-disagg config (1P1D, TP8/EP1, 8k1k conc=32 only), thin multi_node launcher, and DeepSeek-V4-Pro entry in models_vllm.yaml reusing the validated aggregated serving flags. Uses the latest vLLM ROCm nightly (2026-07-04). Refreshes closed #1707 against current main, which already carries the #1585 patch-free MoRI-IO path. 中文：新增 DSV4 MI355X vLLM 分离式冒烟测试（8k1k conc=32）- 新配置 dsv4-fp4-mi355x-vllm-disagg（1P1D、TP8/EP1）、轻量多节点启动脚本，以及 models_vllm.yaml 中复用已验证聚合配方 serving 参数的 DeepSeek-V4-Pro 条目；采用最新 vLLM ROCm nightly 镜像。基于当前 main（已含 #1585 的免补丁 MoRI-IO 路径）重制已关闭的 #1707。 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

github-actions · 2026-07-04T22:29:06Z

Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase As a PR reviewer and CODEOWNER, I have reviewed this and have.

For PR verification, add the full-sweep-fail-fast label (strongly recommended) to this PR — the benchmark sweep only runs on labeled PRs. Use full-sweep-enabled only if you need matrix jobs to keep running past a failure.

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs

感谢你的贡献！请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md，然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人，你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板，包括保留英文语句 As a PR reviewer and CODEOWNER, I have reviewed this and have。

如需进行 PR 验证，请为此 PR 添加 full-sweep-fail-fast 标签（强烈推荐）— 基准测试 sweep 仅在带有标签的 PR 上运行。仅当需要矩阵任务在失败后继续运行时才使用 full-sweep-enabled。

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动（flake），重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档

github-actions · 2026-07-04T22:29:06Z

Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase As a PR reviewer and CODEOWNER, I have reviewed this and have.

For PR verification, add the full-sweep-fail-fast label (strongly recommended) to this PR — the benchmark sweep only runs on labeled PRs. Use full-sweep-enabled only if you need matrix jobs to keep running past a failure.

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs

感谢你的贡献！请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md，然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人，你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板，包括保留英文语句 As a PR reviewer and CODEOWNER, I have reviewed this and have。

如需进行 PR 验证，请为此 PR 添加 full-sweep-fail-fast 标签（强烈推荐）— 基准测试 sweep 仅在带有标签的 PR 上运行。仅当需要矩阵任务在失败后继续运行时才使用 full-sweep-enabled。

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动（flake），重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档

claude · 2026-07-04T22:44:51Z

+
+DeepSeek-V4-Pro:
+  # DeepSeek-V4-Pro is mixed-precision FP4+FP8 (FP4 MoE expert weights dominate
+  # the ~960 GB footprint; FP8 on attention/norm/router; FP8 KV cache at runtime).
+  # InferenceX classifies this as the fp4 variant.
+  #
+  # Serving flags reuse the validated single-node MI355X recipe
+  # (benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_vllm.sh, from
+  # vllm-project/recipes#433) so the per-node engine config is identical to the
+  # known-good aggregated run; disaggregation only adds the MoRIIO kv-transfer
+  # role (injected by server_vllm.sh). Each P/D worker is a full TP=8 node, EP=1
+  # — matching the aggregated recipe, which runs DSv4 on TP=8 without expert
+  # parallelism. DEP decode is a follow-up.
+  #
+  # --moe-backend triton_unfused is REQUIRED for the FP4 MoE expert weight format;
+  # the auto backend doesn't register the FP4 scale params and safetensors load
+  # raises KeyError. --enforce-eager (no CUDA graphs) keeps the first disagg recipe
+  # robust against cudagraph/MoRIIO-hook interactions; FULL/PIECEWISE capture is a
+  # follow-up. --async-scheduling is intentionally omitted (not used by the kimi /
+  # minimax vllm-disagg recipes).
+  prefill_flags: "--tensor-parallel-size 8 --distributed-executor-backend mp --kv-cache-dtype fp8 --moe-backend triton_unfused --tokenizer-mode deepseek_v4 --reasoning-parser deepseek_v4 --no-enable-prefix-caching --gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 --enforce-eager"
+  decode_flags: "--tensor-parallel-size 8 --distributed-executor-backend mp --kv-cache-dtype fp8 --moe-backend triton_unfused --tokenizer-mode deepseek_v4 --reasoning-parser deepseek_v4 --no-enable-prefix-caching --gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 --enforce-eager"
+  env: "VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ENGINE_READY_TIMEOUT_S=3600"
+  hf_dir: "models--deepseek-ai--DeepSeek-V4-Pro"


🟡 The new dsv4-fp4-mi355x-vllm-disagg entry sets --moe-backend triton_unfused and omits VLLM_ROCM_USE_AITER_MOE=1, but the yaml comment, PR description, and perf-changelog all claim per-node flags 'reuse the aggregated recipe verbatim' / 'identical to the known-good aggregated run'. The current aggregated recipe (benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_vllm.sh, post-PR #1980) actually uses --moe-backend aiter + VLLM_ROCM_USE_AITER_MOE=1 — the disagg entry looks pinned to the pre-#1980 backend. Either flip it to aiter + VLLM_ROCM_USE_AITER_MOE=1 to actually match, or update the comment/perf-changelog to state the divergence is intentional (e.g. a conservative first bring-up).

Extended reasoning...

What the bug is

The new disagg recipe added by this PR (configs/amd-master.yaml and the DeepSeek-V4-Pro entry in benchmarks/multi_node/amd_utils/models_vllm.yaml) claims — in three places — to reuse the validated single-node MI355X DSv4 recipe verbatim:

PR description: "reusing the validated single-node flags verbatim (--moe-backend triton_unfused required for the FP4 expert format...)"

yaml comment (models_vllm.yaml:63-79): "Serving flags reuse the validated single-node MI355X recipe... so the per-node engine config is identical to the known-good aggregated run", and further "--moe-backend triton_unfused is REQUIRED for the FP4 MoE expert weight format; the auto backend doesn't register the FP4 scale params..."

perf-changelog (line 4463): "Per-node flags reuse the aggregated recipe verbatim (--moe-backend triton_unfused for the FP4 expert format...)"

But this description is stale. As of PR #1980 (merged the same day as this PR), the aggregated recipe at benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_vllm.sh no longer uses triton_unfused. It uses AITER MoE:

# dsv4_fp4_mi355x_vllm.sh, lines 47-48: export VLLM_ROCM_USE_AITER=1 export VLLM_ROCM_USE_AITER_MOE=1 ... # line 79: --moe-backend aiter \

The aggregated recipe's own comment (lines 15-19) explicitly says: "Use the AITER MoE backend (VLLM_ROCM_USE_AITER_MOE=1 + --moe-backend aiter) for the FP4 MoE expert weights... The AITER MXFP4 path registers the FP4 scale parameters (w13_weight_scale / w2_weight_scale), so safetensors loads correctly and decode runs on the fused AITER experts instead of triton_unfused."

The new disagg entry, meanwhile, uses:

prefill_flags: "... --moe-backend triton_unfused ..." decode_flags: "... --moe-backend triton_unfused ..." env: "VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ENGINE_READY_TIMEOUT_S=3600" # no VLLM_ROCM_USE_AITER_MOE

So the recipe was written against a pre-#1980 aggregated recipe.

Why the yaml comment's justification is stale

The comment justifies triton_unfused as REQUIRED because "the auto backend doesn't register the FP4 scale params and safetensors load raises KeyError". That was true for --moe-backend auto before #1980 — but the aggregated recipe doesn't use auto, it uses aiter explicitly, and the whole point of #1980 was that AITER MXFP4 DOES register w13_weight_scale/w2_weight_scale correctly. The nightly image pinned here (nightly-f329ce405b…, 2026-07-04) is newer than the aggregated recipe's image, so AITER MoE support is present.

Step-by-step proof

Open benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_vllm.sh on main — grep for moe-backend: matches line 79 (--moe-backend aiter). Grep for AITER_MOE: matches line 48 (export VLLM_ROCM_USE_AITER_MOE=1). No occurrences of triton_unfused in the file.

Open benchmarks/multi_node/amd_utils/models_vllm.yaml at the new DeepSeek-V4-Pro block — prefill_flags and decode_flags both contain --moe-backend triton_unfused. The env string contains VLLM_ROCM_USE_AITER=1 but not VLLM_ROCM_USE_AITER_MOE=1.

git log --oneline -- benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_vllm.sh and perf-changelog.yaml show PR [AMD] DeepSeek-V4 FP4 MI355X vLLM STP: bump image to latest nightly / DeepSeek-V4 FP4 MI355X vLLM STP：升级镜像至最新 nightly #1980 with the entry: "Switch the MoE backend from triton_unfused to AITER MoE (VLLM_ROCM_USE_AITER_MOE=1 + --moe-backend aiter) for the FP4 experts."

The PR description here explicitly claims verbatim reuse of the single-node recipe, but (1)+(2) show the flags don't match.

Impact

This is a smoke test (a single ISL/OSL, single conc=32 point) with full-sweep-fail-fast on, so runtime blast radius is bounded to that one job. Two outcomes are possible:

triton_unfused still works on the pinned nightly → the disagg smoke perf number will be lower than the aggregated recipe it claims to mirror (AITER was chosen in [AMD] DeepSeek-V4 FP4 MI355X vLLM STP: bump image to latest nightly / DeepSeek-V4 FP4 MI355X vLLM STP：升级镜像至最新 nightly #1980 for perf reasons); the recorded baseline is misleading.

triton_unfused has bitrotted on the newer nightly → the smoke job fails and CI catches it before merge.

Either way, once the smoke test is expanded to a real conc sweep (per the yaml comment, that's the follow-up), the divergence from the aggregated recipe becomes a real perf-comparison gotcha.

How to fix

Pick one:

(a) Actually match the aggregated recipe (what the description promises):

prefill_flags: "... --moe-backend aiter ..." # was: triton_unfused decode_flags: "... --moe-backend aiter ..." env: "VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MOE=1 VLLM_ENGINE_READY_TIMEOUT_S=3600"

and rewrite the yaml comment to say the AITER MXFP4 path registers the FP4 scale params (mirroring the aggregated recipe's own comment).

(b) Keep triton_unfused intentionally, but update the comment, PR description, and perf-changelog to explicitly state "the disagg recipe intentionally diverges from the aggregated recipe: it pins to the pre-#1980 triton_unfused path as a conservative first bring-up before enabling AITER MoE + MoRIIO together" — so future readers aren't misled by the "verbatim / identical" claims.

github-actions · 2026-07-05T03:46:51Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28721684504
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28721684504

functionstackx · 2026-07-05T15:26:29Z

Closing per cleanup of non-mergeable PRs. The DSV4 MI355X vLLM disagg smoke test failed validation — branch kept for diagnosis and re-cut.
中文：按不可合并 PR 清理关闭。DSV4 MI355X vLLM 分离式冒烟测试未通过验证 - 分支保留，便于排查后重新提交。

functionstackx requested a review from a team July 4, 2026 22:29

functionstackx added the full-sweep-fail-fast label Jul 4, 2026

functionstackx requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners July 4, 2026 22:29

github-project-automation Bot added this to InferenceMAX Board Jul 4, 2026

functionstackx force-pushed the klaud/dsv4-mi355x-vllm-disagg-smoke branch from 6bcdd1d to 972ef1e Compare July 4, 2026 22:29

claude Bot reviewed Jul 4, 2026

View reviewed changes

functionstackx closed this Jul 5, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board Jul 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Klaud Cold] DSV4 MI355X vLLM disagg smoke test (8k1k conc=32) / DSV4 MI355X vLLM 分离式冒烟测试（8k1k conc=32）#2081

[Klaud Cold] DSV4 MI355X vLLM disagg smoke test (8k1k conc=32) / DSV4 MI355X vLLM 分离式冒烟测试（8k1k conc=32）#2081
functionstackx wants to merge 1 commit into
mainfrom
klaud/dsv4-mi355x-vllm-disagg-smoke

functionstackx commented Jul 4, 2026

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

claude Bot Jul 4, 2026

Uh oh!

github-actions Bot commented Jul 5, 2026

Uh oh!

functionstackx commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

functionstackx commented Jul 4, 2026

Summary

中文说明

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

claude Bot Jul 4, 2026

Choose a reason for hiding this comment

What the bug is

Why the yaml comment's justification is stale

Step-by-step proof

Impact

How to fix

Uh oh!

github-actions Bot commented Jul 5, 2026

Uh oh!

functionstackx commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant