Skip to content

[Optimization] [OP] [Models] dsk del prefill mask#7313

Merged
Jiang-Jia-Jun merged 3 commits intoPaddlePaddle:developfrom
chang-wenbin:DSK_DEL
Apr 11, 2026
Merged

[Optimization] [OP] [Models] dsk del prefill mask#7313
Jiang-Jia-Jun merged 3 commits intoPaddlePaddle:developfrom
chang-wenbin:DSK_DEL

Conversation

@chang-wenbin
Copy link
Copy Markdown
Collaborator

@chang-wenbin chang-wenbin commented Apr 10, 2026

Motivation

DeepSeek V3 模型性能优化,包括 rotary kernel 支持 >65535 token、merge 算子支持多 head_dim

💡 If this PR is a Cherry Pick, the PR title needs to follow the format by adding the [Cherry-Pick] label at the very beginning and appending the original PR ID at the end. For example, [Cherry-Pick][CI] Add check trigger and logic(#5191)

💡 如若此PR是Cherry Pick,PR标题需遵循格式,在最开始加上[Cherry-Pick]标签,以及最后面加上原PR ID,例如[Cherry-Pick][CI] Add check trigger and logic(#5191)

Modifications

Usage or Command

Accuracy Tests

Checklist

  • Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • Format your code, run pre-commit before commit.
  • Add unit tests. Please write the reason in this PR if no unit tests.
  • Provide accuracy results.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Apr 10, 2026

Thanks for your contribution!

fastdeploy-bot

This comment was marked as outdated.

fastdeploy-bot

This comment was marked as outdated.

@chang-wenbin chang-wenbin changed the title [Optimization] dsk del prefill mask [Optimization] [OP] [Models] dsk del prefill mask Apr 10, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 10, 2026

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@734fbcf). Learn more about missing BASE report.

Files with missing lines Patch % Lines
fastdeploy/model_executor/models/deepseek_v3.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop    #7313   +/-   ##
==========================================
  Coverage           ?   74.14%           
==========================================
  Files              ?      383           
  Lines              ?    53631           
  Branches           ?     8411           
==========================================
  Hits               ?    39764           
  Misses             ?    11161           
  Partials           ?     2706           
Flag Coverage Δ
GPU 74.14% <0.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

@fastdeploy-bot fastdeploy-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 AI Code Review | 2026-04-11

📋 Review 摘要

PR 概述:DeepSeek V3 模型性能优化,包括 rotary kernel 支持 >65535 token、merge 算子支持多 head_dim

变更范围:custom_ops/gpu_ops/、model_executor/models/、tests/operators/

影响面 Tag[Optimization] [OP] [Models]

📝 PR 规范检查

PR 标题和描述符合规范,包含 [Optimization][OP][Models] 标签,并填写了 Motivation 和 Modifications。

发现的问题

级别 文件 概述
🟡 建议 tests/operators/test_fused_rotary_position_encoding.py:118 test_large_num_tokens 只验证不抛出异常,没有验证输出正确性
🟡 建议 fastdeploy/model_executor/models/deepseek_v3.py:445 建议确认 max_token=1 硬编码值是否适用于所有场景(见评论详情)

总体评价

修改逻辑正确:rotary kernel 使用 2D grid 突破 65535 限制、merge 算子支持多 head_dim 的实现合理。但建议补充 merge_prefill_decode_output 的单元测试,并完善 test_large_num_tokens 的正确性验证。

@@ -116,18 +116,21 @@ def test_neox_mode(self):
self._check_correctness(num_tokens=3, num_heads=2, num_kv_heads=2, head_size=8, rot_dim=8, is_neox=True)

def test_large_num_tokens(self):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 建议 test_large_num_tokens 只验证算子不抛出异常,没有验证输出结果的正确性。

建议添加正确性验证(类似其他测试用例使用 _check_correctness):

def test_large_num_tokens(self):
    num_tokens, num_heads, head_size = 65537, 1, 4
    num_kv_heads, rot_dim = 1, 4
    query_np = np.random.rand(num_tokens, num_heads, head_size).astype("float32")
    key_np = np.random.rand(num_tokens, num_kv_heads, head_size).astype("float32")
    position_ids_np = np.arange(num_tokens, dtype="int32")
    cos_sin_cache_np = self._make_cos_sin_cache(num_tokens, rot_dim)
    
    query_out, key_out = self._run_op(
        query_np, key_np, position_ids_np, cos_sin_cache_np, head_size, is_neox=False
    )
    # 添加正确性验证
    query_ref, key_ref = self._ref_rotary(
        query_np, key_np, position_ids_np, cos_sin_cache_np, head_size, is_neox=False
    )
    np.testing.assert_allclose(query_out, query_ref, rtol=1e-5, atol=1e-6)
    np.testing.assert_allclose(key_out, key_ref, rtol=1e-5, atol=1e-6)

forward_meta.cu_seqlens_q,
self.num_attention_heads_tp,
self.v_head_dim,
1,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ 疑问 max_token=1 被硬编码传递给 merge_prefill_decode_output

在 CUDA kernel 中,max_token 用于计算 grid 的 z 维度:

const int warps = head_dim / 32;
const int tokens_block = (max_token + warps - 1) / warps;
dim3 grid_dims(batch_size, head_num, tokens_block);

max_token=1 时,tokens_block=1token_id = warp_id(范围 0-31)。如果 seq_lens_this_time[bidb] > warps,会导致某些 token 无法被处理。

需要确认

  1. 在 DeepSeek V3 的使用场景中,seq_lens_this_time[bidb] 是否总是 ≤ warps?
  2. 是否有场景需要一次解码生成多个 token(如 speculative decoding)?

建议:如果 max_token=1 是针对当前场景的优化,请添加注释说明原因;否则考虑使用 forward_meta.max_len_tensor_cpu[2](max_dec_len_this_time)作为参数。

@Jiang-Jia-Jun Jiang-Jia-Jun merged commit ba01d7a into PaddlePaddle:develop Apr 11, 2026
35 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants