Skip to content

[Klaud Cold] DSV4 MI355X vLLM disagg smoke test (8k1k conc=32) / DSV4 MI355X vLLM 分离式冒烟测试(8k1k conc=32)#2081

Closed
functionstackx wants to merge 1 commit into
mainfrom
klaud/dsv4-mi355x-vllm-disagg-smoke
Closed

[Klaud Cold] DSV4 MI355X vLLM disagg smoke test (8k1k conc=32) / DSV4 MI355X vLLM 分离式冒烟测试(8k1k conc=32)#2081
functionstackx wants to merge 1 commit into
mainfrom
klaud/dsv4-mi355x-vllm-disagg-smoke

Conversation

@functionstackx

Copy link
Copy Markdown
Collaborator

Summary

中文说明

🤖 Generated with Claude Code

DeepSeek-V4-Pro disaggregated P/D on MI355X via vLLM + MoRI-IO: new
dsv4-fp4-mi355x-vllm-disagg config (1P1D, TP8/EP1, 8k1k conc=32 only),
thin multi_node launcher, and DeepSeek-V4-Pro entry in models_vllm.yaml
reusing the validated aggregated serving flags. Uses the latest vLLM
ROCm nightly (2026-07-04). Refreshes closed #1707 against current main,
which already carries the #1585 patch-free MoRI-IO path.

中文:新增 DSV4 MI355X vLLM 分离式冒烟测试(8k1k conc=32)- 新配置
dsv4-fp4-mi355x-vllm-disagg(1P1D、TP8/EP1)、轻量多节点启动脚本,以及
models_vllm.yaml 中复用已验证聚合配方 serving 参数的 DeepSeek-V4-Pro
条目;采用最新 vLLM ROCm nightly 镜像。基于当前 main(已含 #1585 的
免补丁 MoRI-IO 路径)重制已关闭的 #1707。

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the klaud/dsv4-mi355x-vllm-disagg-smoke branch from 6bcdd1d to 972ef1e Compare July 4, 2026 22:29
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase As a PR reviewer and CODEOWNER, I have reviewed this and have.

For PR verification, add the full-sweep-fail-fast label (strongly recommended) to this PR — the benchmark sweep only runs on labeled PRs. Use full-sweep-enabled only if you need matrix jobs to keep running past a failure.

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs


感谢你的贡献!请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md,然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人,你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板,包括保留英文语句 As a PR reviewer and CODEOWNER, I have reviewed this and have

如需进行 PR 验证,请为此 PR 添加 full-sweep-fail-fast 标签(强烈推荐)— 基准测试 sweep 仅在带有标签的 PR 上运行。仅当需要矩阵任务在失败后继续运行时才使用 full-sweep-enabled

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档

1 similar comment
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase As a PR reviewer and CODEOWNER, I have reviewed this and have.

For PR verification, add the full-sweep-fail-fast label (strongly recommended) to this PR — the benchmark sweep only runs on labeled PRs. Use full-sweep-enabled only if you need matrix jobs to keep running past a failure.

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs


感谢你的贡献!请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md,然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人,你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板,包括保留英文语句 As a PR reviewer and CODEOWNER, I have reviewed this and have

如需进行 PR 验证,请为此 PR 添加 full-sweep-fail-fast 标签(强烈推荐)— 基准测试 sweep 仅在带有标签的 PR 上运行。仅当需要矩阵任务在失败后继续运行时才使用 full-sweep-enabled

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档

Comment on lines +62 to +85

DeepSeek-V4-Pro:
# DeepSeek-V4-Pro is mixed-precision FP4+FP8 (FP4 MoE expert weights dominate
# the ~960 GB footprint; FP8 on attention/norm/router; FP8 KV cache at runtime).
# InferenceX classifies this as the fp4 variant.
#
# Serving flags reuse the validated single-node MI355X recipe
# (benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_vllm.sh, from
# vllm-project/recipes#433) so the per-node engine config is identical to the
# known-good aggregated run; disaggregation only adds the MoRIIO kv-transfer
# role (injected by server_vllm.sh). Each P/D worker is a full TP=8 node, EP=1
# — matching the aggregated recipe, which runs DSv4 on TP=8 without expert
# parallelism. DEP decode is a follow-up.
#
# --moe-backend triton_unfused is REQUIRED for the FP4 MoE expert weight format;
# the auto backend doesn't register the FP4 scale params and safetensors load
# raises KeyError. --enforce-eager (no CUDA graphs) keeps the first disagg recipe
# robust against cudagraph/MoRIIO-hook interactions; FULL/PIECEWISE capture is a
# follow-up. --async-scheduling is intentionally omitted (not used by the kimi /
# minimax vllm-disagg recipes).
prefill_flags: "--tensor-parallel-size 8 --distributed-executor-backend mp --kv-cache-dtype fp8 --moe-backend triton_unfused --tokenizer-mode deepseek_v4 --reasoning-parser deepseek_v4 --no-enable-prefix-caching --gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 --enforce-eager"
decode_flags: "--tensor-parallel-size 8 --distributed-executor-backend mp --kv-cache-dtype fp8 --moe-backend triton_unfused --tokenizer-mode deepseek_v4 --reasoning-parser deepseek_v4 --no-enable-prefix-caching --gpu-memory-utilization 0.9 --max-num-batched-tokens 8192 --enforce-eager"
env: "VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ENGINE_READY_TIMEOUT_S=3600"
hf_dir: "models--deepseek-ai--DeepSeek-V4-Pro"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new dsv4-fp4-mi355x-vllm-disagg entry sets --moe-backend triton_unfused and omits VLLM_ROCM_USE_AITER_MOE=1, but the yaml comment, PR description, and perf-changelog all claim per-node flags 'reuse the aggregated recipe verbatim' / 'identical to the known-good aggregated run'. The current aggregated recipe (benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_vllm.sh, post-PR #1980) actually uses --moe-backend aiter + VLLM_ROCM_USE_AITER_MOE=1 — the disagg entry looks pinned to the pre-#1980 backend. Either flip it to aiter + VLLM_ROCM_USE_AITER_MOE=1 to actually match, or update the comment/perf-changelog to state the divergence is intentional (e.g. a conservative first bring-up).

Extended reasoning...

What the bug is

The new disagg recipe added by this PR (configs/amd-master.yaml and the DeepSeek-V4-Pro entry in benchmarks/multi_node/amd_utils/models_vllm.yaml) claims — in three places — to reuse the validated single-node MI355X DSv4 recipe verbatim:

  • PR description: "reusing the validated single-node flags verbatim (--moe-backend triton_unfused required for the FP4 expert format...)"
  • yaml comment (models_vllm.yaml:63-79): "Serving flags reuse the validated single-node MI355X recipe... so the per-node engine config is identical to the known-good aggregated run", and further "--moe-backend triton_unfused is REQUIRED for the FP4 MoE expert weight format; the auto backend doesn't register the FP4 scale params..."
  • perf-changelog (line 4463): "Per-node flags reuse the aggregated recipe verbatim (--moe-backend triton_unfused for the FP4 expert format...)"

But this description is stale. As of PR #1980 (merged the same day as this PR), the aggregated recipe at benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_vllm.sh no longer uses triton_unfused. It uses AITER MoE:

# dsv4_fp4_mi355x_vllm.sh, lines 47-48:
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1
...
# line 79:
    --moe-backend aiter \

The aggregated recipe's own comment (lines 15-19) explicitly says: "Use the AITER MoE backend (VLLM_ROCM_USE_AITER_MOE=1 + --moe-backend aiter) for the FP4 MoE expert weights... The AITER MXFP4 path registers the FP4 scale parameters (w13_weight_scale / w2_weight_scale), so safetensors loads correctly and decode runs on the fused AITER experts instead of triton_unfused."

The new disagg entry, meanwhile, uses:

prefill_flags: "... --moe-backend triton_unfused ..."
decode_flags:  "... --moe-backend triton_unfused ..."
env: "VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ENGINE_READY_TIMEOUT_S=3600"   # no VLLM_ROCM_USE_AITER_MOE

So the recipe was written against a pre-#1980 aggregated recipe.

Why the yaml comment's justification is stale

The comment justifies triton_unfused as REQUIRED because "the auto backend doesn't register the FP4 scale params and safetensors load raises KeyError". That was true for --moe-backend auto before #1980 — but the aggregated recipe doesn't use auto, it uses aiter explicitly, and the whole point of #1980 was that AITER MXFP4 DOES register w13_weight_scale/w2_weight_scale correctly. The nightly image pinned here (nightly-f329ce405b…, 2026-07-04) is newer than the aggregated recipe's image, so AITER MoE support is present.

Step-by-step proof

  1. Open benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_vllm.sh on main — grep for moe-backend: matches line 79 (--moe-backend aiter). Grep for AITER_MOE: matches line 48 (export VLLM_ROCM_USE_AITER_MOE=1). No occurrences of triton_unfused in the file.
  2. Open benchmarks/multi_node/amd_utils/models_vllm.yaml at the new DeepSeek-V4-Pro block — prefill_flags and decode_flags both contain --moe-backend triton_unfused. The env string contains VLLM_ROCM_USE_AITER=1 but not VLLM_ROCM_USE_AITER_MOE=1.
  3. git log --oneline -- benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_vllm.sh and perf-changelog.yaml show PR [AMD] DeepSeek-V4 FP4 MI355X vLLM STP: bump image to latest nightly / DeepSeek-V4 FP4 MI355X vLLM STP:升级镜像至最新 nightly #1980 with the entry: "Switch the MoE backend from triton_unfused to AITER MoE (VLLM_ROCM_USE_AITER_MOE=1 + --moe-backend aiter) for the FP4 experts."
  4. The PR description here explicitly claims verbatim reuse of the single-node recipe, but (1)+(2) show the flags don't match.

Impact

This is a smoke test (a single ISL/OSL, single conc=32 point) with full-sweep-fail-fast on, so runtime blast radius is bounded to that one job. Two outcomes are possible:

Either way, once the smoke test is expanded to a real conc sweep (per the yaml comment, that's the follow-up), the divergence from the aggregated recipe becomes a real perf-comparison gotcha.

How to fix

Pick one:

(a) Actually match the aggregated recipe (what the description promises):

prefill_flags: "... --moe-backend aiter ..."       # was: triton_unfused
decode_flags:  "... --moe-backend aiter ..."
env: "VLLM_USE_V1=1 VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MOE=1 VLLM_ENGINE_READY_TIMEOUT_S=3600"

and rewrite the yaml comment to say the AITER MXFP4 path registers the FP4 scale params (mirroring the aggregated recipe's own comment).

(b) Keep triton_unfused intentionally, but update the comment, PR description, and perf-changelog to explicitly state "the disagg recipe intentionally diverges from the aggregated recipe: it pins to the pre-#1980 triton_unfused path as a conservative first bring-up before enabling AITER MoE + MoRIIO together" — so future readers aren't misled by the "verbatim / identical" claims.

@github-actions

github-actions Bot commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

@functionstackx

Copy link
Copy Markdown
Collaborator Author

Closing per cleanup of non-mergeable PRs. The DSV4 MI355X vLLM disagg smoke test failed validation — branch kept for diagnosis and re-cut.
中文:按不可合并 PR 清理关闭。DSV4 MI355X vLLM 分离式冒烟测试未通过验证 - 分支保留,便于排查后重新提交。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant