Skip to content

feat(ascend): add 9 Ascend operator kernels#47

Open
zhangyue207 wants to merge 15 commits intofeat/ascend-frameworkfrom
feat/ascend-operators
Open

feat(ascend): add 9 Ascend operator kernels#47
zhangyue207 wants to merge 15 commits intofeat/ascend-frameworkfrom
feat/ascend-operators

Conversation

@zhangyue207
Copy link
Copy Markdown
Collaborator

Add, RmsNorm, Swiglu, Matmul, CausalSoftmax, AddRmsNorm,ReshapeAndCache, RotaryEmbedding, FlashAttention.

zhangyue added 15 commits April 8, 2026 10:52
…ld integration

Add Ascend platform scaffolding:
- `device_.h`: `DeviceEnabled<kAscend>` specialization
- `data_type_.h`: `toAclDtype()`, `isIntegerDtype()`
- `common.h`: `buildAclTensor()` with optional transpose
- `workspace_pool_.h`: stream-keyed workspace allocator
- `runtime_.h`: `Runtime<kAscend>` (Malloc, Free, Memcpy, Memset)
- 5 new operator base classes (AddRmsNorm, FlashAttention, Matmul,
  ReshapeAndCache, RotaryEmbedding)

Integrate into CMake build system, Python binding generation (stream +
optional tensor support), and examples runtime API.
…emove missing include

- Wrap `aclrtMemcpy` (5-arg) and `aclrtMemset` (4-arg) in lambdas to
  match the generic 4-arg / 3-arg calling convention used by examples.
- Assert `aclrtMalloc` return value in `WorkspacePool::ensure()`.
- Remove `ascend/gemm/kernel.h` include from `runtime_api.h` (file
  does not exist until the kernels commit).
- Add Ascend GEMM specialization using `aclnnAddmm`/`aclnnBaddbmm`.
- Add `get_npu_stream()` helper and NPU device detection in test utils.
- Add `skip_unsupported_dtype` fixture for Ascend in conftest.
- Update `runtime_api.h` with Ascend backend entry.
The `aclrtMalloc` call was the sole expression inside `assert()`, so it
was compiled away in release builds (NDEBUG). This left the workspace
buffer null, causing `aclnnAddmm` to return ACLNN_ERR_PARAM_NULLPTR
(161001) for any operation that requires workspace (e.g. alpha != 1.0).
Add, RmsNorm, Swiglu, Matmul, CausalSoftmax, AddRmsNorm,
ReshapeAndCache, RotaryEmbedding, FlashAttention.
Pass stream to all CANN ops in existing tests; add FlashAttention,
ReshapeAndCache, RotaryEmbedding, and E2E LLaMA layer tests.
…/Linear/Mul operators

Descriptor caching (`AclTensorCache` + `aclSetRawTensorAddr`), executor caching
(`aclSetAclOpExecutorRepeatable`), D2H sync elimination, `add_rms_norm` decomposition,
and `WorkspacePool` thread-local fast path. Host dispatch dropped from ~255 us/call to
17-57 us/call for all cacheable operators. New operators: Cast (`aclnnCast`), Cat
(`aclnnCat` with TensorList executor caching), Linear (`aclnnAddmm`/`aclnnBaddbmm`/
`aclnnMatmul`), Mul (`aclnnMul`). Full regression: 2040 passed, 0 failed.
Use `unique_ptr<WorkspaceArena>` in the arena map so that thread-local
cached pointers remain valid across `unordered_map` rehashes.  Remove
unused `detail::reshapeView` helper from FlashAttention.
…tion

Normalize negative `dim` in the base class constructor (e.g. -1 → last
dimension).  Add comment in the Ascend kernel explaining why
`aclSetRawTensorAddr` on TensorList-contained descriptors is sufficient
without `aclSetInputTensorAddr`.  Add negative-dim test case.
@zhangyue207 zhangyue207 force-pushed the feat/ascend-framework branch from 80acc8b to 7628b2f Compare April 10, 2026 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant