[CI]【Hackathon 10th Spring No.45】SM-tier compile guards for T4/V100#7330
Open
r-cloudforge wants to merge 1 commit intoPaddlePaddle:developfrom
Open
Conversation
…0/SM75 Additive fix on top of merged PaddlePaddle#6488: - Add #ifdef ENABLE_SM75_EXT_OPS guard for 5 cutlass/FP8 op registrations (prevents linker error on SM70) - Add #ifdef ENABLE_SM80_EXT_OPS guard for 7 tail MoE/MLA op registrations (prevents linker error on SM70/SM75) Uses ENABLE_SM75_EXT_OPS (passed to both cxx and nvcc compilers) instead of ENABLE_SCALED_MM_C2X (nvcc-only) for the cutlass guard, since cpp_extensions.cc is compiled by the host C++ compiler.
|
Thanks for your contribution! |
|
cloudforge1 seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
fastdeploy-bot
left a comment
There was a problem hiding this comment.
🤖 AI Code Review |
2026-04-11
📋 Review 摘要
PR 概述:为 T4 (SM75) 和 V100 (SM70) GPU 添加条件编译守卫,使 custom_ops 能够在旧架构 GPU 上编译运行
变更范围:custom_ops/gpu_ops/cpp_extensions.cc
影响面 Tag:[OP] [CI]
📝 PR 规范检查
PR 标题和描述格式符合规范:
- 标题包含有效 Tag
[CI] - Motivation 和 Modifications 部分完整
- 提供了详细的表格说明各 guard 块覆盖的 ops
- 包含验证脚本和硬件测试证据
问题
| 级别 | 文件 | 概述 |
|---|
未发现阻塞性问题。
总体评价
PR 实现正确,条件编译守卫结构完整(14 个 #ifdef/#ifndef 与 14 个 #endif 平衡)。宏定义与 setup_ops.py 中的编译参数设置一致,确保了 T4 和 V100 GPU 上的兼容性。描述中提供的验证脚本和硬件测试证据充分,代码变更只影响构建时的 op 注册,不影响运行时逻辑。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Task 45 requires FastDeploy's
custom_opsto compile on T4 (SM75) and V100 (SM70) GPUs. Currently,cpp_extensions.ccregisters all 117 ops unconditionally, causing link errors when SM80+-only CUDA kernels (MoE, MLA, speculative decoding, append attention) are absent from the build.This PR adds conditional compilation guards to
cpp_extensions.ccand corresponding macro definitions insetup_ops.py, gating SM80+ op bindings behindENABLE_SM80_EXT_OPS, SM75+ ops behindENABLE_SM75_EXT_OPS/ENABLE_SCALED_MM_C2X, and SM70'sgelu_tanhbehindDISABLE_GELU_TANH_OP.Modifications
cpp_extensions.cc(+28 lines)14 guard blocks wrapping 78 of 117 ops (updated after merge with latest upstream):
ENABLE_SM80_EXT_OPSENABLE_SM75_EXT_OPSENABLE_SCALED_MM_C2XDISABLE_GELU_TANH_OPThe remaining 39 ops (per_token_quant, get_padding_offset, fused_rotary_position_encoding, noaux_tc, etc.) compile on all SM tiers and remain unguarded.
setup_ops.py(+19 lines, -1 line)ENABLE_SM75_EXT_OPSadded to bothcc_compile_argsandnvcc_compile_argsatcc >= 75— also addsmoe_deepgemm_permute.cuandmoe_deepgemm_depermute.cusources (these kernels have no BF16 dependency)ENABLE_SM80_EXT_OPSadded to bothcc_compile_argsandnvcc_compile_argsatcc >= 80DISABLE_GELU_TANH_OPadded to both compile args when SM70 is in the target architectures — also removesgelu_tanh.cufrom sources to avoid compiling unsupported SM75 Tanh instructionssm_versionscomputed once and reused (avoids redundantget_sm_version()call)dict.fromkeys()beforesetup()to prevent duplicate translation units from overlappingfind_end_files()callsUsage or Command
Verification script (run from repo root)
Hardware Verification (AI Studio V100)
Guard counts verified on Tesla V100-SXM2-32GB via AI Studio CLI pipeline:
p-1051a228d3c7Guard balance:
#if*=18, #endif=18— balanced.Full V100 nvcc compilation blocked by GFW (cutlass submodule requires GitHub access from AI Studio). Guard structure and macro gating verified independently on hardware.
Accuracy Tests
#if*= 18#endif.Pipeline Evidence:
Checklist
pre-commit) passed for modified files.