[Optimization] [OP] [Models] dsk del prefill mask by chang-wenbin · Pull Request #7313 · PaddlePaddle/FastDeploy

chang-wenbin · 2026-04-10T10:08:46Z

Motivation

DeepSeek V3 模型性能优化，包括 rotary kernel 支持 >65535 token、merge 算子支持多 head_dim

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick，PR标题需遵循格式，在最开始加上[Cherry-Pick]标签，以及最后面加上原PR ID，例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

Add at least a tag in the PR title.
- Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
- You can add new tags based on the PR content, but the semantics must be clear.
Format your code, run pre-commit before commit.
Add unit tests. Please write the reason in this PR if no unit tests.
Provide accuracy results.
If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

paddle-bot · 2026-04-10T10:08:58Z

Thanks for your contribution!

codecov-commenter · 2026-04-10T12:53:22Z

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@734fbcf). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
fastdeploy/model_executor/models/deepseek_v3.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             develop    #7313   +/-   ##
==========================================
  Coverage           ?   74.14%           
==========================================
  Files              ?      383           
  Lines              ?    53631           
  Branches           ?     8411           
==========================================
  Hits               ?    39764           
  Misses             ?    11161           
  Partials           ?     2706

Flag	Coverage Δ
GPU	`74.14% <0.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

fastdeploy-bot

🤖 AI Code Review | 2026-04-11

📋 Review 摘要

PR 概述：DeepSeek V3 模型性能优化，包括 rotary kernel 支持 >65535 token、merge 算子支持多 head_dim

变更范围：custom_ops/gpu_ops/、model_executor/models/、tests/operators/

影响面 Tag：[Optimization] [OP] [Models]

📝 PR 规范检查

PR 标题和描述符合规范，包含 [Optimization] 和 [OP]、[Models] 标签，并填写了 Motivation 和 Modifications。

发现的问题

级别	文件	概述
🟡 建议	`tests/operators/test_fused_rotary_position_encoding.py:118`	`test_large_num_tokens` 只验证不抛出异常，没有验证输出正确性
🟡 建议	`fastdeploy/model_executor/models/deepseek_v3.py:445`	建议确认 `max_token=1` 硬编码值是否适用于所有场景（见评论详情）

总体评价

修改逻辑正确：rotary kernel 使用 2D grid 突破 65535 限制、merge 算子支持多 head_dim 的实现合理。但建议补充 merge_prefill_decode_output 的单元测试，并完善 test_large_num_tokens 的正确性验证。

fastdeploy-bot · 2026-04-10T23:17:20Z

tests/operators/test_fused_rotary_position_encoding.py

@@ -116,18 +116,21 @@ def test_neox_mode(self):
        self._check_correctness(num_tokens=3, num_heads=2, num_kv_heads=2, head_size=8, rot_dim=8, is_neox=True)

    def test_large_num_tokens(self):


🟡 建议 test_large_num_tokens 只验证算子不抛出异常，没有验证输出结果的正确性。

建议添加正确性验证（类似其他测试用例使用 _check_correctness）：

def test_large_num_tokens(self): num_tokens, num_heads, head_size = 65537, 1, 4 num_kv_heads, rot_dim = 1, 4 query_np = np.random.rand(num_tokens, num_heads, head_size).astype("float32") key_np = np.random.rand(num_tokens, num_kv_heads, head_size).astype("float32") position_ids_np = np.arange(num_tokens, dtype="int32") cos_sin_cache_np = self._make_cos_sin_cache(num_tokens, rot_dim) query_out, key_out = self._run_op( query_np, key_np, position_ids_np, cos_sin_cache_np, head_size, is_neox=False ) # 添加正确性验证 query_ref, key_ref = self._ref_rotary( query_np, key_np, position_ids_np, cos_sin_cache_np, head_size, is_neox=False ) np.testing.assert_allclose(query_out, query_ref, rtol=1e-5, atol=1e-6) np.testing.assert_allclose(key_out, key_ref, rtol=1e-5, atol=1e-6)

fastdeploy-bot · 2026-04-10T23:17:20Z

fastdeploy/model_executor/models/deepseek_v3.py

+                    forward_meta.cu_seqlens_q,
+                    self.num_attention_heads_tp,
+                    self.v_head_dim,
+                    1,


❓ 疑问 max_token=1 被硬编码传递给 merge_prefill_decode_output。

在 CUDA kernel 中，max_token 用于计算 grid 的 z 维度：

const int warps = head_dim / 32; const int tokens_block = (max_token + warps - 1) / warps; dim3 grid_dims(batch_size, head_num, tokens_block);

当 max_token=1 时，tokens_block=1，token_id = warp_id（范围 0-31）。如果 seq_lens_this_time[bidb] > warps，会导致某些 token 无法被处理。

需要确认：

在 DeepSeek V3 的使用场景中，seq_lens_this_time[bidb] 是否总是 ≤ warps？

是否有场景需要一次解码生成多个 token（如 speculative decoding）？

建议：如果 max_token=1 是针对当前场景的优化，请添加注释说明原因；否则考虑使用 forward_meta.max_len_tensor_cpu[2]（max_dec_len_this_time）作为参数。

dsk del prefill mask

09ae057

chang-wenbin temporarily deployed to Metax_ci April 10, 2026 10:08 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

dsk support 1M+ seq_len rope

e084af5

chang-wenbin temporarily deployed to Metax_ci April 10, 2026 10:38 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

chang-wenbin changed the title ~~[Optimization] dsk del prefill mask~~ [Optimization] [OP] [Models] dsk del prefill mask Apr 10, 2026

update rope tests

75428a9

chang-wenbin temporarily deployed to Metax_ci April 10, 2026 23:05 — with GitHub Actions Inactive

fastdeploy-bot reviewed Apr 10, 2026

View reviewed changes

zhoutianzi666 approved these changes Apr 11, 2026

View reviewed changes

Jiang-Jia-Jun merged commit ba01d7a into PaddlePaddle:develop Apr 11, 2026
35 of 38 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Optimization] [OP] [Models] dsk del prefill mask#7313

[Optimization] [OP] [Models] dsk del prefill mask#7313
Jiang-Jia-Jun merged 3 commits intoPaddlePaddle:developfrom
chang-wenbin:DSK_DEL

chang-wenbin commented Apr 10, 2026 •

edited

Loading

Uh oh!

paddle-bot bot commented Apr 10, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Apr 10, 2026 •

edited

Loading

Uh oh!

fastdeploy-bot left a comment

Uh oh!

fastdeploy-bot Apr 10, 2026

Uh oh!

fastdeploy-bot Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		@@ -116,18 +116,21 @@ def test_neox_mode(self):
		self._check_correctness(num_tokens=3, num_heads=2, num_kv_heads=2, head_size=8, rot_dim=8, is_neox=True)

		def test_large_num_tokens(self):

Conversation

chang-wenbin commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Usage or Command

Accuracy Tests

Checklist

Uh oh!

paddle-bot bot commented Apr 10, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

codecov-commenter commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fastdeploy-bot left a comment

Choose a reason for hiding this comment

📋 Review 摘要

📝 PR 规范检查

发现的问题

总体评价

Uh oh!

fastdeploy-bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

fastdeploy-bot Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

chang-wenbin commented Apr 10, 2026 •

edited

Loading

codecov-commenter commented Apr 10, 2026 •

edited

Loading