[RL] change glm rope_emb calculation#7316
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7316 +/- ##
==========================================
Coverage ? 74.11%
==========================================
Files ? 383
Lines ? 53637
Branches ? 8412
==========================================
Hits ? 39754
Misses ? 11182
Partials ? 2701
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fastdeploy-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-11
📋 Review 摘要
PR 概述:优化 GLM 模型的 RoPE 实现,通过禁用 EnforceFmulRN 和修改 Python 层计算方式提升性能(0.355s → 0.123s)
变更范围:custom_ops/gpu_ops/append_attn/、fastdeploy/model_executor/layers/
影响面 Tag:[OP] [RL]
📝 PR 规范检查
PR 描述中的 "Modifications" 章节为空,建议补充具体变更说明:
描述模板(可直接复制):
## Modifications
1. GPU 算子层:将 `EnforceFmulRN` 模板参数硬编码为 `false`,使用普通乘法替代 IEEE-754 标准舍入的 `__fmul_rn`
2. Python 层:添加 `FD_ENABLE_RL` 环境变量控制分支,使用 `paddle.outer` 替代 `paddle.einsum`
问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | decoder_write_cache_with_rope_kernel.cu:149 |
硬编码影响所有模型,建议改为环境变量控制 |
| 🟡 建议 | encoder_write_cache_with_rope_impl.cuh:2546 |
同上 |
| 🟡 建议 | gqa_rope_write_cache.cu:390 |
同上 |
| 🟡 建议 | speculate_write_cache_with_rope_kernel.cu:133 |
同上 |
总体评价
性能优化效果显著(约 3x 提升),精度损失在可接受范围内(max diff: 0.00048828)。但 GPU 层硬编码会影响所有使用 partial rotary embedding 的模型,建议改为通过 FD_ENABLE_RL 环境变量统一控制,保持与 Python 层的一致性。
| append_decode_cache_T_neox_partial_rope_kernel<T, | ||
| PackSize, | ||
| EnforceFmulRN>; | ||
| auto* kernelFn = append_decode_cache_T_neox_partial_rope_kernel< |
There was a problem hiding this comment.
🟡 建议 硬编码 EnforceFmulRN = false 会影响所有使用这些 kernel 的模型(包括 DeepSeek V3 等使用 partial rotary embedding 的模型),而不仅仅是 GLM 模型。
建议改为通过环境变量 FD_ENABLE_RL 控制此参数,保持与 Python 层的一致性。
修改方式参考 helper.h:746 中的 fmul_func 定义,可以在 kernel launch 时根据环境变量选择不同的模板实例。
| GQANeoxVariableLengthPartialRotaryKernel<T, | ||
| PackSize, | ||
| EnforceFmulRN>; | ||
| auto *kernelFn = GQANeoxVariableLengthPartialRotaryKernel< |
There was a problem hiding this comment.
🟡 建议 同上,硬编码会影响所有模型,建议改为通过环境变量控制。
| max_model_len, | ||
| head_dim, | ||
| rotary_dim); | ||
| launchWithPdlWhenEnabled(GQAVariableLengthNeoxPartialRotarySplitKernel< |
There was a problem hiding this comment.
🟡 建议 同上,硬编码会影响所有模型,建议改为通过环境变量控制。
| PackSize, | ||
| QKV_TYPE, | ||
| EnforceFmulRN> | ||
| append_speculate_cache_neox_partial_rope_kernel< |
There was a problem hiding this comment.
🟡 建议 同上,硬编码会影响所有模型,建议改为通过环境变量控制。
EmmonsCurse
left a comment
There was a problem hiding this comment.
LGTM~ Skip coverage check as it mainly relies on tests with RL.
Motivation
修改当前glm的rope实现
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.