[Optimization] [OP] [Models] dsk del prefill mask#7313
[Optimization] [OP] [Models] dsk del prefill mask#7313Jiang-Jia-Jun merged 3 commits intoPaddlePaddle:developfrom
Conversation
|
Thanks for your contribution! |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7313 +/- ##
==========================================
Coverage ? 74.14%
==========================================
Files ? 383
Lines ? 53631
Branches ? 8411
==========================================
Hits ? 39764
Misses ? 11161
Partials ? 2706
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
fastdeploy-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review | 2026-04-11
📋 Review 摘要
PR 概述:DeepSeek V3 模型性能优化,包括 rotary kernel 支持 >65535 token、merge 算子支持多 head_dim
变更范围:custom_ops/gpu_ops/、model_executor/models/、tests/operators/
影响面 Tag:[Optimization] [OP] [Models]
📝 PR 规范检查
PR 标题和描述符合规范,包含 [Optimization] 和 [OP]、[Models] 标签,并填写了 Motivation 和 Modifications。
发现的问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | tests/operators/test_fused_rotary_position_encoding.py:118 |
test_large_num_tokens 只验证不抛出异常,没有验证输出正确性 |
| 🟡 建议 | fastdeploy/model_executor/models/deepseek_v3.py:445 |
建议确认 max_token=1 硬编码值是否适用于所有场景(见评论详情) |
总体评价
修改逻辑正确:rotary kernel 使用 2D grid 突破 65535 限制、merge 算子支持多 head_dim 的实现合理。但建议补充 merge_prefill_decode_output 的单元测试,并完善 test_large_num_tokens 的正确性验证。
| @@ -116,18 +116,21 @@ def test_neox_mode(self): | |||
| self._check_correctness(num_tokens=3, num_heads=2, num_kv_heads=2, head_size=8, rot_dim=8, is_neox=True) | |||
|
|
|||
| def test_large_num_tokens(self): | |||
There was a problem hiding this comment.
🟡 建议 test_large_num_tokens 只验证算子不抛出异常,没有验证输出结果的正确性。
建议添加正确性验证(类似其他测试用例使用 _check_correctness):
def test_large_num_tokens(self):
num_tokens, num_heads, head_size = 65537, 1, 4
num_kv_heads, rot_dim = 1, 4
query_np = np.random.rand(num_tokens, num_heads, head_size).astype("float32")
key_np = np.random.rand(num_tokens, num_kv_heads, head_size).astype("float32")
position_ids_np = np.arange(num_tokens, dtype="int32")
cos_sin_cache_np = self._make_cos_sin_cache(num_tokens, rot_dim)
query_out, key_out = self._run_op(
query_np, key_np, position_ids_np, cos_sin_cache_np, head_size, is_neox=False
)
# 添加正确性验证
query_ref, key_ref = self._ref_rotary(
query_np, key_np, position_ids_np, cos_sin_cache_np, head_size, is_neox=False
)
np.testing.assert_allclose(query_out, query_ref, rtol=1e-5, atol=1e-6)
np.testing.assert_allclose(key_out, key_ref, rtol=1e-5, atol=1e-6)| forward_meta.cu_seqlens_q, | ||
| self.num_attention_heads_tp, | ||
| self.v_head_dim, | ||
| 1, |
There was a problem hiding this comment.
❓ 疑问 max_token=1 被硬编码传递给 merge_prefill_decode_output。
在 CUDA kernel 中,max_token 用于计算 grid 的 z 维度:
const int warps = head_dim / 32;
const int tokens_block = (max_token + warps - 1) / warps;
dim3 grid_dims(batch_size, head_num, tokens_block);当 max_token=1 时,tokens_block=1,token_id = warp_id(范围 0-31)。如果 seq_lens_this_time[bidb] > warps,会导致某些 token 无法被处理。
需要确认:
- 在 DeepSeek V3 的使用场景中,
seq_lens_this_time[bidb]是否总是 ≤ warps? - 是否有场景需要一次解码生成多个 token(如 speculative decoding)?
建议:如果 max_token=1 是针对当前场景的优化,请添加注释说明原因;否则考虑使用 forward_meta.max_len_tensor_cpu[2](max_dec_len_this_time)作为参数。
Motivation
DeepSeek V3 模型性能优化,包括 rotary kernel 支持 >65535 token、merge 算子支持多 head_dim
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.