[RL] change glm rope_emb calculation by zoooo0820 · Pull Request #7316 · PaddlePaddle/FastDeploy

zoooo0820 · 2026-04-10T14:22:17Z

Motivation

修改当前glm的rope实现

原始实现耗时：  0.3555927276611328
当前实现耗时：0.12303400039672852
max diff: Tensor(shape=[], dtype=float32, place=Place(gpu:0), stop_gradient=True,
       0.00048828)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-10T14:22:25Z

Thanks for your contribution!

codecov-commenter · 2026-04-10T15:47:41Z

Codecov Report

❌ Patch coverage is 14.28571% with 6 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@00005c9). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
...stdeploy/model_executor/layers/rotary_embedding.py	14.28%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7316   +/-   ##
==========================================
  Coverage           ?   74.11%           
==========================================
  Files              ?      383           
  Lines              ?    53637           
  Branches           ?     8412           
==========================================
  Hits               ?    39754           
  Misses             ?    11182           
  Partials           ?     2701

Flag	Coverage Δ
GPU	`74.11% <14.28%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fastdeploy-bot

🤖 AI Code Review | 2026-04-11

📋 Review 摘要

PR 概述：优化 GLM 模型的 RoPE 实现，通过禁用 EnforceFmulRN 和修改 Python 层计算方式提升性能（0.355s → 0.123s）

变更范围：custom_ops/gpu_ops/append_attn/、fastdeploy/model_executor/layers/

影响面 Tag：[OP] [RL]

📝 PR 规范检查

PR 描述中的 "Modifications" 章节为空，建议补充具体变更说明：

描述模板（可直接复制）：

## Modifications

1. GPU 算子层：将 `EnforceFmulRN` 模板参数硬编码为 `false`，使用普通乘法替代 IEEE-754 标准舍入的 `__fmul_rn`
2. Python 层：添加 `FD_ENABLE_RL` 环境变量控制分支，使用 `paddle.outer` 替代 `paddle.einsum`

问题

级别	文件	概述
🟡 建议	`decoder_write_cache_with_rope_kernel.cu:149`	硬编码影响所有模型，建议改为环境变量控制
🟡 建议	`encoder_write_cache_with_rope_impl.cuh:2546`	同上
🟡 建议	`gqa_rope_write_cache.cu:390`	同上
🟡 建议	`speculate_write_cache_with_rope_kernel.cu:133`	同上

总体评价

性能优化效果显著（约 3x 提升），精度损失在可接受范围内（max diff: 0.00048828）。但 GPU 层硬编码会影响所有使用 partial rotary embedding 的模型，建议改为通过 FD_ENABLE_RL 环境变量统一控制，保持与 Python 层的一致性。

fastdeploy-bot · 2026-04-11T06:25:41Z

custom_ops/gpu_ops/append_attn/decoder_write_cache_with_rope_kernel.cu

-            append_decode_cache_T_neox_partial_rope_kernel<T,
-                                                           PackSize,
-                                                           EnforceFmulRN>;
+        auto* kernelFn = append_decode_cache_T_neox_partial_rope_kernel<


🟡 建议 硬编码 EnforceFmulRN = false 会影响所有使用这些 kernel 的模型（包括 DeepSeek V3 等使用 partial rotary embedding 的模型），而不仅仅是 GLM 模型。

建议改为通过环境变量 FD_ENABLE_RL 控制此参数，保持与 Python 层的一致性。

修改方式参考 helper.h:746 中的 fmul_func 定义，可以在 kernel launch 时根据环境变量选择不同的模板实例。

fastdeploy-bot · 2026-04-11T06:25:41Z

custom_ops/gpu_ops/append_attn/encoder_write_cache_with_rope_impl.cuh

-            GQANeoxVariableLengthPartialRotaryKernel<T,
-                                                     PackSize,
-                                                     EnforceFmulRN>;
+        auto *kernelFn = GQANeoxVariableLengthPartialRotaryKernel<


🟡 建议 同上，硬编码会影响所有模型，建议改为通过环境变量控制。

fastdeploy-bot · 2026-04-11T06:25:41Z

custom_ops/gpu_ops/append_attn/gqa_rope_write_cache.cu

-      max_model_len,
-      head_dim,
-      rotary_dim);
+  launchWithPdlWhenEnabled(GQAVariableLengthNeoxPartialRotarySplitKernel<


🟡 建议 同上，硬编码会影响所有模型，建议改为通过环境变量控制。

fastdeploy-bot · 2026-04-11T06:25:41Z

custom_ops/gpu_ops/append_attn/speculate_write_cache_with_rope_kernel.cu

-                                                      PackSize,
-                                                      QKV_TYPE,
-                                                      EnforceFmulRN>
+      append_speculate_cache_neox_partial_rope_kernel<


🟡 建议 同上，硬编码会影响所有模型，建议改为通过环境变量控制。

EmmonsCurse

LGTM～ Skip coverage check as it mainly relies on tests with RL.

* change glm rope_emb calculation * glm without EnforceFmulRN * fix ci

change glm rope_emb calculation

1f92620

zoooo0820 temporarily deployed to Metax_ci April 10, 2026 14:22 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

This was referenced Apr 10, 2026

[Cherry-Pick][RL] change glm rope_emb calculation #7316 #7317

Merged

[Cherry-Pick][RL] change glm rope_emb calculation #7316 #7318

Merged

plusNew001 temporarily deployed to Metax_ci April 10, 2026 16:29 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

glm without EnforceFmulRN

8d0d77c

zoooo0820 force-pushed the glm_rope branch from 2f92414 to 8d0d77c Compare April 11, 2026 05:07

zoooo0820 temporarily deployed to Metax_ci April 11, 2026 05:07 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

zoooo0820 had a problem deploying to Metax_ci April 11, 2026 06:17 — with GitHub Actions Error

zoooo0820 force-pushed the glm_rope branch from b2e8ecf to 47c4e36 Compare April 11, 2026 06:18

zoooo0820 had a problem deploying to Metax_ci April 11, 2026 06:18 — with GitHub Actions Error

fix ci

e1f1d48

zoooo0820 force-pushed the glm_rope branch from 47c4e36 to e1f1d48 Compare April 11, 2026 06:20

zoooo0820 temporarily deployed to Metax_ci April 11, 2026 06:20 — with GitHub Actions Inactive

fastdeploy-bot reviewed Apr 11, 2026

View reviewed changes

ckl117 approved these changes Apr 11, 2026

View reviewed changes

EmmonsCurse approved these changes Apr 11, 2026

View reviewed changes

EmmonsCurse added the skip-ci: coverage label Apr 11, 2026

zoooo0820 merged commit 076ab07 into PaddlePaddle:develop Apr 11, 2026
54 of 57 checks passed

zoooo0820 deleted the glm_rope branch April 11, 2026 10:36

zoooo0820 added a commit that referenced this pull request Apr 11, 2026

[Cherry-Pick][RL] change glm rope_emb calculation #7316 (#7317)

ffe2cf1

* change glm rope_emb calculation * glm without EnforceFmulRN * fix ci

zoooo0820 added a commit that referenced this pull request Apr 11, 2026

[Cherry-Pick][RL] change glm rope_emb calculation #7316 (#7318)

42b0f59

* change glm rope_emb calculation * glm without EnforceFmulRN * fix ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RL] change glm rope_emb calculation#7316

[RL] change glm rope_emb calculation#7316
zoooo0820 merged 3 commits intoPaddlePaddle:developfrom
zoooo0820:glm_rope

zoooo0820 commented Apr 10, 2026 •

edited

Loading

Uh oh!

paddle-bot bot commented Apr 10, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Apr 10, 2026 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

fastdeploy-bot left a comment

Uh oh!

fastdeploy-bot Apr 11, 2026

Uh oh!

fastdeploy-bot Apr 11, 2026

Uh oh!

fastdeploy-bot Apr 11, 2026

Uh oh!

fastdeploy-bot Apr 11, 2026

Uh oh!

EmmonsCurse left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

zoooo0820 commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Apr 10, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

fastdeploy-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

问题

总体评价

Uh oh!

fastdeploy-bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

fastdeploy-bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

fastdeploy-bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

fastdeploy-bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

EmmonsCurse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

zoooo0820 commented Apr 10, 2026 •

edited

Loading

codecov-commenter commented Apr 10, 2026 •

edited

Loading