[Cherry-Pick][Optimization] Use triton qk_norm both in Prefill and Decode.(#7213)#7305
[Cherry-Pick][Optimization] Use triton qk_norm both in Prefill and Decode.(#7213)#7305EmmonsCurse wants to merge 1 commit intoPaddlePaddle:release/2.5from
Conversation
Co-authored-by: “liuruian” <liuruian@baidu.com>
|
Thanks for your contribution! |
fastdeploy-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-10 16:01 CST
📋 Review 摘要
PR 概述:在 Prefill 和 Decode 阶段都使用 Triton QK RMSNorm 融合算子,移除 CUDA Graph 模式限制
变更范围:fastdeploy/model_executor/layers/normalization.py、tests/e2e/test_Qwen3VL_serving.py
影响面 Tag:[OP]
PR 规范检查
✅ 标题包含有效 Tag [Cherry-Pick] 和 [Optimization]
✅ Motivation 和 Modifications 填写完整
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | test_Qwen3VL_serving.py:176 |
测试预期输出变更未在 PR 描述中说明原因 |
总体评价
代码变更合理,移除 step_use_cudagraph 条件后 triton kernel 在 Prefill 阶段也能使用,符合优化目的。triton kernel 实现支持动态 batch size,且有独立的数值精度测试验证正确性。
|
|
||
| # base result | ||
| content2 = "视频中手机支架的颜色是黑色的。" | ||
| content2 = "视频中手机支架的颜色是黑色。" |
There was a problem hiding this comment.
🟡 建议 测试预期输出从 "视频中手机支架的颜色是黑色的。" 修改为 "视频中手机支架的颜色是黑色。",这个变更应该在 PR 描述中说明原因。
建议在 PR 描述中补充说明:由于改用 triton qk_rmsnorm_fused kernel 替代 paddle 算子,导致模型输出在端到端测试中存在微小差异(属于正常范围内的浮点计算差异)。
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.5 #7305 +/- ##
==============================================
Coverage ? 69.93%
==============================================
Files ? 390
Lines ? 54417
Branches ? 8581
==============================================
Hits ? 38056
Misses ? 13619
Partials ? 2742
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Cherry-pick of #7213 (authored by @K11OntheBoat) to
release/2.5.devPR:#7213
Motivation
Prefill 阶段使用QKRMSNorm融合算子. 部分模型 单Kernel部分加速2~7倍. Prefill 空泡较大的模型单次Forward可加速2倍左右.
Modifications
使用QKRMSNorm 替代paddle 散Op
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.