[Cherry-Pick][Optimization] Use triton qk_norm both in Prefill and Decode.(#7213)#7306
[Cherry-Pick][Optimization] Use triton qk_norm both in Prefill and Decode.(#7213)#7306EmmonsCurse wants to merge 1 commit intoPaddlePaddle:release/2.6from
Conversation
Co-authored-by: “liuruian” <liuruian@baidu.com>
|
Thanks for your contribution! |
fastdeploy-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-10 16:01 CST
📋 Review 摘要
PR 概述:Cherry-pick PR,将 QKRMSNorm 融合算子的优化从 develop 分支合入 release/2.6 分支,在 Prefill 和 Decode 阶段都使用 Triton 融合算子以提升性能。
变更范围:fastdeploy/model_executor/layers/normalization.py、tests/e2e/test_Qwen3VL_serving.py
影响面 Tag:[Optimization] [OP]
📝 PR 规范检查
PR 标题和描述基本符合规范,但 Checklist 未勾选导致 CI 失败。
标题建议(符合规范,无需修改):
[Cherry-Pick][Optimization] Use triton qk_norm both in Prefill and Decode.(#7213)
描述模板(建议添加说明):
## Motivation
Prefill 阶段使用QKRMSNorm融合算子. 部分模型单Kernel部分加速2~7倍. Prefill 空泡较大的模型单次Forward可加速2倍左右.
## Modifications
使用QKRMSNorm 替代 paddle 散Op
## Usage or Command
## Accuracy Tests
- 原始 PR #7213 已通过 CI 测试
- 测试文件中更新了预期结果(从"黑色的"改为"黑色"),这是由于算子优化导致的输出细微变化,符合预期
## Checklist
- [x] Add at least a tag in the PR title.
- [x] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. (已有 tests/operators/test_qk_rmsnorm_fused.py)
- [ ] Provide accuracy results. (原始 PR 中未提供,但 cherry-pick 可参考 develop 分支)
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | fastdeploy/model_executor/layers/normalization.py:344 |
建议添加注释说明为什么可以移除 step_use_cudagraph 条件 |
未发现阻塞性问题。
总体评价
PR 的核心变更(移除 step_use_cudagraph 条件,在 Prefill 阶段也使用融合算子)是合理的性能优化。从代码审查来看:
qk_rmsnorm_fused算子实现正确,支持可变的M(batch_size/token_num),可以处理 Prefill 阶段的大批量 token- 原始 PR #7213 已在 develop 分支合并并通过大部分 CI 测试
- 测试文件的修改反映了优化后的输出变化,符合预期
建议作者更新 PR 描述中的 Checklist 并添加准确性测试说明。
| proxy_rmsnorm=None, | ||
| ) -> paddle.Tensor: | ||
| if proxy_rmsnorm is None and self.qk_norm_fused and forward_meta.step_use_cudagraph: | ||
| if proxy_rmsnorm is None and self.qk_norm_fused: |
There was a problem hiding this comment.
🟡 建议 建议添加注释说明为什么可以移除 step_use_cudagraph 条件
step_use_cudagraph 原本用于限制在 Decode 阶段(使用 CUDA Graph)才使用融合算子。移除此条件后,融合算子也将在 Prefill 阶段使用。
建议在代码中添加注释说明这一变化的原因和性能影响:
# 在 Prefill 和 Decode 阶段都使用 QKRMSNorm 融合算子
# 移除 step_use_cudagraph 条件以支持 Prefill 阶段性能优化
# 预期加速:部分模型单Kernel 2~7倍,Prefill空泡较大模型单次Forward 2倍左右
if proxy_rmsnorm is None and self.qk_norm_fused:
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## release/2.6 #7306 +/- ##
==============================================
Coverage ? 73.84%
==============================================
Files ? 376
Lines ? 52915
Branches ? 8255
==============================================
Hits ? 39076
Misses ? 11110
Partials ? 2729
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Cherry-pick of #7213 (authored by @K11OntheBoat) to
release/2.6.devPR:#7213
Motivation
Prefill 阶段使用QKRMSNorm融合算子. 部分模型 单Kernel部分加速2~7倍. Prefill 空泡较大的模型单次Forward可加速2倍左右.
Modifications
使用QKRMSNorm 替代paddle 散Op
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.