Skip to content

[RL] change glm rope_emb calculation#7316

Merged
zoooo0820 merged 3 commits intoPaddlePaddle:developfrom
zoooo0820:glm_rope
Apr 11, 2026
Merged

[RL] change glm rope_emb calculation#7316
zoooo0820 merged 3 commits intoPaddlePaddle:developfrom
zoooo0820:glm_rope

Conversation

@zoooo0820
Copy link
Copy Markdown
Collaborator

@zoooo0820 zoooo0820 commented Apr 10, 2026

Motivation

修改当前glm的rope实现

原始实现耗时:  0.3555927276611328
当前实现耗时:0.12303400039672852
max diff: Tensor(shape=[], dtype=float32, place=Place(gpu:0), stop_gradient=True,
       0.00048828)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 10, 2026

Thanks for your contribution!

fastdeploy-bot

This comment was marked as outdated.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 10, 2026

Codecov Report

❌ Patch coverage is 14.28571% with 6 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@00005c9). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...stdeploy/model_executor/layers/rotary_embedding.py 14.28% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7316   +/-   ##
==========================================
  Coverage           ?   74.11%           
==========================================
  Files              ?      383           
  Lines              ?    53637           
  Branches           ?     8412           
==========================================
  Hits               ?    39754           
  Misses             ?    11182           
  Partials           ?     2701           
Flag Coverage Δ
GPU 74.11% <14.28%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fastdeploy-bot

This comment was marked as outdated.

fastdeploy-bot

This comment was marked as outdated.

Copy link
Copy Markdown

@fastdeploy-bot fastdeploy-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-11

📋 Review 摘要

PR 概述:优化 GLM 模型的 RoPE 实现,通过禁用 EnforceFmulRN 和修改 Python 层计算方式提升性能(0.355s → 0.123s)

变更范围:custom_ops/gpu_ops/append_attn/、fastdeploy/model_executor/layers/

影响面 Tag[OP] [RL]

📝 PR 规范检查

PR 描述中的 "Modifications" 章节为空,建议补充具体变更说明:

描述模板(可直接复制):

## Modifications

1. GPU 算子层:将 `EnforceFmulRN` 模板参数硬编码为 `false`,使用普通乘法替代 IEEE-754 标准舍入的 `__fmul_rn`
2. Python 层:添加 `FD_ENABLE_RL` 环境变量控制分支,使用 `paddle.outer` 替代 `paddle.einsum`

问题

级别 文件 概述
🟡 建议 decoder_write_cache_with_rope_kernel.cu:149 硬编码影响所有模型,建议改为环境变量控制
🟡 建议 encoder_write_cache_with_rope_impl.cuh:2546 同上
🟡 建议 gqa_rope_write_cache.cu:390 同上
🟡 建议 speculate_write_cache_with_rope_kernel.cu:133 同上

总体评价

性能优化效果显著(约 3x 提升),精度损失在可接受范围内(max diff: 0.00048828)。但 GPU 层硬编码会影响所有使用 partial rotary embedding 的模型,建议改为通过 FD_ENABLE_RL 环境变量统一控制,保持与 Python 层的一致性。

append_decode_cache_T_neox_partial_rope_kernel<T,
PackSize,
EnforceFmulRN>;
auto* kernelFn = append_decode_cache_T_neox_partial_rope_kernel<
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 硬编码 EnforceFmulRN = false 会影响所有使用这些 kernel 的模型(包括 DeepSeek V3 等使用 partial rotary embedding 的模型),而不仅仅是 GLM 模型。

建议改为通过环境变量 FD_ENABLE_RL 控制此参数,保持与 Python 层的一致性。

修改方式参考 helper.h:746 中的 fmul_func 定义,可以在 kernel launch 时根据环境变量选择不同的模板实例。

GQANeoxVariableLengthPartialRotaryKernel<T,
PackSize,
EnforceFmulRN>;
auto *kernelFn = GQANeoxVariableLengthPartialRotaryKernel<
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 同上,硬编码会影响所有模型,建议改为通过环境变量控制。

max_model_len,
head_dim,
rotary_dim);
launchWithPdlWhenEnabled(GQAVariableLengthNeoxPartialRotarySplitKernel<
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 同上,硬编码会影响所有模型,建议改为通过环境变量控制。

PackSize,
QKV_TYPE,
EnforceFmulRN>
append_speculate_cache_neox_partial_rope_kernel<
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 同上,硬编码会影响所有模型,建议改为通过环境变量控制。

Copy link
Copy Markdown
Collaborator

@EmmonsCurse EmmonsCurse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM~ Skip coverage check as it mainly relies on tests with RL.

@zoooo0820 zoooo0820 merged commit 076ab07 into PaddlePaddle:develop Apr 11, 2026
54 of 57 checks passed
@zoooo0820 zoooo0820 deleted the glm_rope branch April 11, 2026 10:36
zoooo0820 added a commit that referenced this pull request Apr 11, 2026
* change glm rope_emb calculation

* glm without EnforceFmulRN

* fix ci
zoooo0820 added a commit that referenced this pull request Apr 11, 2026
* change glm rope_emb calculation

* glm without EnforceFmulRN

* fix ci
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants