feat(ascend): add 9 Ascend operator kernels by zhangyue207 · Pull Request #47 · InfiniTensor/InfiniOps

zhangyue207 · 2026-04-08T09:30:36Z

Add, RmsNorm, Swiglu, Matmul, CausalSoftmax, AddRmsNorm,ReshapeAndCache, RotaryEmbedding, FlashAttention.

…ld integration Add Ascend platform scaffolding: - `device_.h`: `DeviceEnabled<kAscend>` specialization - `data_type_.h`: `toAclDtype()`, `isIntegerDtype()` - `common.h`: `buildAclTensor()` with optional transpose - `workspace_pool_.h`: stream-keyed workspace allocator - `runtime_.h`: `Runtime<kAscend>` (Malloc, Free, Memcpy, Memset) - 5 new operator base classes (AddRmsNorm, FlashAttention, Matmul, ReshapeAndCache, RotaryEmbedding) Integrate into CMake build system, Python binding generation (stream + optional tensor support), and examples runtime API.

…emove missing include - Wrap `aclrtMemcpy` (5-arg) and `aclrtMemset` (4-arg) in lambdas to match the generic 4-arg / 3-arg calling convention used by examples. - Assert `aclrtMalloc` return value in `WorkspacePool::ensure()`. - Remove `ascend/gemm/kernel.h` include from `runtime_api.h` (file does not exist until the kernels commit).

- Add Ascend GEMM specialization using `aclnnAddmm`/`aclnnBaddbmm`. - Add `get_npu_stream()` helper and NPU device detection in test utils. - Add `skip_unsupported_dtype` fixture for Ascend in conftest. - Update `runtime_api.h` with Ascend backend entry.

The `aclrtMalloc` call was the sole expression inside `assert()`, so it was compiled away in release builds (NDEBUG). This left the workspace buffer null, causing `aclnnAddmm` to return ACLNN_ERR_PARAM_NULLPTR (161001) for any operation that requires workspace (e.g. alpha != 1.0).

Add, RmsNorm, Swiglu, Matmul, CausalSoftmax, AddRmsNorm, ReshapeAndCache, RotaryEmbedding, FlashAttention.

Pass stream to all CANN ops in existing tests; add FlashAttention, ReshapeAndCache, RotaryEmbedding, and E2E LLaMA layer tests.

This reverts commit 26c2bdc.

…/Linear/Mul operators Descriptor caching (`AclTensorCache` + `aclSetRawTensorAddr`), executor caching (`aclSetAclOpExecutorRepeatable`), D2H sync elimination, `add_rms_norm` decomposition, and `WorkspacePool` thread-local fast path. Host dispatch dropped from ~255 us/call to 17-57 us/call for all cacheable operators. New operators: Cast (`aclnnCast`), Cat (`aclnnCat` with TensorList executor caching), Linear (`aclnnAddmm`/`aclnnBaddbmm`/ `aclnnMatmul`), Mul (`aclnnMul`). Full regression: 2040 passed, 0 failed.

Use `unique_ptr<WorkspaceArena>` in the arena map so that thread-local cached pointers remain valid across `unordered_map` rehashes. Remove unused `detail::reshapeView` helper from FlashAttention.

…tion Normalize negative `dim` in the base class constructor (e.g. -1 → last dimension). Add comment in the Ascend kernel explaining why `aclSetRawTensorAddr` on TensorList-contained descriptors is sufficient without `aclSetInputTensorAddr`. Add negative-dim test case.

zhangyue added 15 commits April 8, 2026 10:52

style(ascend): apply clang-format to framework headers

08e0d6a

fix(nvidia): restore CUDA::cublasLt link dependency

cec3de8

feat(test): add --devices option to pytest for platform-name filtering

cb2bab3

feat(ascend): add 9 Ascend operator kernels

1e78a02

Add, RmsNorm, Swiglu, Matmul, CausalSoftmax, AddRmsNorm, ReshapeAndCache, RotaryEmbedding, FlashAttention.

test(ascend): add NPU stream injection and new operator tests

2ccd53f

Pass stream to all CANN ops in existing tests; add FlashAttention, ReshapeAndCache, RotaryEmbedding, and E2E LLaMA layer tests.

ci(ascend): update Ascend CI config, Dockerfile, and NPU detection

b336c84

docs: add Ascend FlashAttention design spec

26c2bdc

Revert "docs: add Ascend FlashAttention design spec"

ad8bf06

This reverts commit 26c2bdc.

fix(ascend): stabilize WorkspacePool pointers and remove dead code

4f90b5a

Use `unique_ptr<WorkspaceArena>` in the arena map so that thread-local cached pointers remain valid across `unordered_map` rehashes. Remove unused `detail::reshapeView` helper from FlashAttention.

zhangyue207 force-pushed the feat/ascend-framework branch from 80acc8b to 7628b2f Compare April 10, 2026 16:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ascend): add 9 Ascend operator kernels#47

feat(ascend): add 9 Ascend operator kernels#47
zhangyue207 wants to merge 15 commits intofeat/ascend-frameworkfrom
feat/ascend-operators

zhangyue207 commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhangyue207 commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant