By Qingfeng Xia
ascend-numpy architecture from top to bottom
- numpy api in Python lang
- cupy._core in Cython lang
- cupy.xpu: high level backend api in cython lange
- cupy.backends.backend: abstraction of xpu low level backend api in c lang
- cupy.backends.ascend: impl in cython/c++
-
Oct 12: MVP for add, cos, matmul, benchmark
-
Oct 23: benchmark.py 经过xpu重构后, NPU测试可以运行
-
Nov 08: reduction op such as
sum()is working, 90% math ops ACLOP supported has been added into numpy-ascend UnitTest:pytest tests/cupy_tests/logic_tests/test_truth.py -
Nov 15: concatenate(), clip(), copy(), non-math/irregular ops initially supported
- but
array()seems not working properly
reason is async_copy, two arrays created without print the first will have the same value as the second array
- scalar 转化 not working, 可能是exp scalar op 没有注册 DONE
- matmul(a, b) 结果和np.matmul(b, a) 相同, 应该是代码某处有bug
- but
-
Dec 06:
- creation apis:
arrange()added, but test failed concatenatetest passed
- creation apis:
-
Dec 13: sorting API
sort()/argosrt()added, but nopartition()alcop,- sort also depending on
rollaxis()
TODO:
- creation/manipulation/indexing/linalg ops
- statistics ops: passing string arg
https://data-apis.org/array-api/latest/API_specification/index.html
https://github.com/data-apis/array-api
np_arr = np.array([1, 2, 3])
xp_np = np_arr.__array_namespace__()
print(xp_np.__name__) # 通常输出 'numpy.array_api'
cp_arr = cp.array([1, 2, 3])
xp_cp = cp_arr.__array_namespace__()
print(xp_cp.__name__) # 通常输出 'cupy.array_api'import cupy as cp
# 直接导入 CuPy 的 Array API 模块
import cupy.array_api as cpx
# 使用 cpx 模块中的函数创建数组、执行运算
x_gpu = cpx.asarray([1, 2, 3, 4], device='cuda') # 显式指定设备
y_gpu = cpx.reshape(x_gpu, (2, 2))
z_gpu = cpx.matmul(y_gpu, y_gpu)import torch
import torch_npu
# 直接导入 torch 的 Array API 模块
import torch._numpy as cp
device = "npu" # can also "cuda" for torch-cuda
a_xpu = cpx.asarray([1, 2, 3, 4], dtype=cp.int32).tensor.to(device)
# here _numpy wrap/proxy torch.Tensor into a ndarray class type - 未注册 einsum, cbrt(cube root, not std api), fix (Trunc), rint (Round), round/around, convolve (?),
- 自己实现: radians (deg2rad), degrees (rad2deg), deg2rad, rad2deg. lcm, divmod
- missing 数值计算: gradient, interp, trapezoid, diff
- missing: frexp, ldexp ()
- complex numpy ops: angle, conj, 缺少几个ops但是自己实现很简单, real, complex
- scan (numpy has no such op), true_divide
- cupy.math_op(scalar, tensor), can aclop kernel broadcast deal with this?
- slicing ? working, but it does not use
Sliceaclop math.scan()is a dummy/empty func, no such aclop- aclop has
take, put(InplacePut), slice, but nochoose
可能有大量不兼容, 测试工作量不小
- CUPY
reshape, splitdoes not need kernel, it is done in cython code on host (Reshape api) - ACLOP having:
roll, permute, flip, repeat, while repeat/rollaxis() is written in cython, no kernel needed - cupy uses
concatenateto impl vstack, stack, hstack without using CUDA kernel _manipulation/rearange.pyslicing is used to flip, rotatesqueeze: Removes size-one axes from the shape of an array
- ACLOP misses numpy op:
_left_shift,_left_right cupy_is_closeshould be used asa.isclose(b)is_nan():
TODO but why aclnnEqualhas no tensor-scalar version?
-
registered: median, var, mean, std, bincount, histgram (histc), 主要是看nan怎么处理, 部分做了注册
-
missing: average, quantile, percentile, vecter op实现难度应该不太大
-
ptp (Range of values (maximum - minimum) along an axis.) -> Aminmax
TODO: passing keyword args
is1dArray Std support only:- unique_all
- unique_counts
- unique_inverse
- unique_values
AsNumpy project has impl
- fmin, nanmin, min, amin
- remainder, fmod, modf
- rint, round, around
- dot, matmul, mm, gemm, inner
- fabs(real number only), abs
aclEventmapping is may have error to fix, causing memory error- multiple NPU intialization yet design/tested
get_default_dtype() torch has such API, while cupy/numpy has no such, float64 is the default
- most ACLOP does not support
double,int64while+-*/seems supported but slow add(all algorith op) support double vector, int64 vector, but it is slow, probably done by AICPU- matrix/linalgo:
dot/matmulsupport only float32, float16, bfloat bfloatis not standard numpy type, so will not be supportednumpy.int64is long 'l' on POSIX OS, 'q' on Windows?cupy_scalar_to_acl_scalar(_cupy_scalar s)- cupy scalar operands must be cupy._scalar type, it may be extended to python scalar in numpy-ascend (TODO)
- if two operands have diff dtype, cupy will do promote_types in
ElementwiseKernel
astype()involved cast op https://data-apis.org/array-api/latest/API_specification/index.html- CANN aclnn op kernel inside can deal with broadcast, just as pytorch/numpy, while cupy deal with itself not in kernel
currently, only support tensor op tensor, some op support tensor op scalar (aclScalar not python double/int)
power(scalar, tensor)not supported, need some refactoring, Operand as union of aclTensor* and aclScalar*- inplace operator like
addis working, while not sure it use InplaceOp or ASCEND nonIplanceOp inplace and nonInplace may have some diff, the save memory addr self and out passed to op may lead to some error - creation/manipulation/indexing, geneal_ops not registered, not tested
- masked tensor/ndarray: its possible using kargs, using aclnn op
- scalar op scalar: numpy/cupy 是不是也不支持这样的操作?
=====================================================
没有NPU开发: 需要注释掉 runtime.pyx initialize_backend(0) 否则不能import cupy
Ubuntu 22.04 似乎才是2025年推荐平台, 主要是python3.12不受支持, 但是通过conda安装得到pyhton3.10, 一样可以安装CANN
python 3.12 is not supported on CANN 8.2 RC , so install miniconda-3.10
[Toolkit] [20250912-21:52:11] [ERROR] There is no python3.7,python3.8,python3.9,python3.10,python3.11 in the current environment !
dpkg: error processing package ascend-cann-toolkit (--configure):
installed ascend-cann-toolkit package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
ascend-cann-toolkit
开发阶段, 不准备支持windows, 不支持conda, 仅仅支持pip
ImportError: libstdc++.so.6: version `GLIBCXX_3.4.32' not found
ldd does not help
strings /home/qingfeng/miniconda3/bin/../lib/libstdc++.so.6 | grep GLIBCXX_3.4
# systemwide version is high enough, g++ use this version to compile
strings /lib/x86_64-linux-gnu/libstdc++.so.6 | grep GLIBCXX_3.4
# while python running using the miniconda 's libstdc++
ln -s /lib/x86_64-linux-gnu/libstdc++.so.6 /home/qingfeng/miniconda3/lib/libstdc++.so.6install extension
-
Python C++ Debugger (混合debug 不确定对于cython有效)
-
Cython: Cython syntax highlighting
-
vscode: cann debugger is under way
# c++ basic dev environment
apt-get install -y gcc g++ make cmake libsqlite3-dev zlib1g-dev libssl-dev libffi-dev net-tools
# python dependencies
pip3 install attrs cython numpy==1.24 decorator sympy cffi pyyaml pathlib2 psutil protobuf==3.20 scipy requests absl-py cython==3.1
# this package is needed but not documented
pip3 install fastrlockcython 3.0 is not higher enough, use cython 3.1 for string auto conversion
CANN社区版本是新特性较多的先行版.
安装到用户HOME, 不需要root权限, 如果要运行和benchmark, 需要根据昇腾硬件
# install driver, skip here
./Downloads/Ascend-cann-toolkit_8.2.RC1_linux-x86_64.run --install
# add set_env.sh into ~/.bashrc
./Downloads/Ascend-cann-nnal_8.2.RC1_linux-x86_64.run --install
# add set_env.sh into ~/.bashrc
# install kernel, skip here安装cann-toolkit成功之后, 记得source set_env.sh, 如果有两个CANN版本的话(root, 非root) 会导致后续nnal安装不了.
安装driver是必须root权限, 应为没有NPU, 也或者我的Ubuntu24.04 不知支持OS, sudo dpkg -i *.deb 失败.
我就 dpkg -x *.deb 把driver解压, copy 里面的driver目录到 $HOME/Ascend 同时设置.bashrc环境
# emulate driver/set_env.sh export LD_LIBRARY_PATH=$HOME/Ascend/driver/lib64/driver:$HOME/Ascend/driver/lib64/common:$LD_LIBRARY_PATH
本机没有昇腾卡,driver kernnel需要安装,否则没法import cupy 测试cython编译出来so文件是否, 可以导入.
CANN 8.2RC1 (推荐最新稳定版本,只有个8.2 才有FFT和AsdSip的blas函数),
| Ascend-cann-nnal_8.2.RC1_linux-x86_64.run | 加速库软件包 BLAS |
|---|---|
| Ascend-cann-toolkit_8.2.RC1_linux-x86_64.run | runtime |
安装nnal成功之后, 记得source set_env.sh,
If you want to use asdsip module:
- To take effect for current user, you can exec command below: source /home/qingfeng/Ascend/nnal/asdsip/set_env.sh or add "source /home/qingfeng/Ascend/nnal/asdsip/set_env.sh" to ~/.bashrc.
# gitee上的torch-npu安装指南 依赖torch cpu 2.6.0
# https://pytorch.org/get-started/locally/ 有详细指南
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu
pip3 install torch-npu==2.6.0
#
pip3 install triton-ascend应该是没有安装ascend driver, import torch 会有这个错误,
ImportError: libascend_hal.so: cannot open shared object file: No such file or directory, You can disable extension auto-loading with TORCH_DEVICE_BACKEND_AUTOLOAD=0.
ModelArts EulerOS (对应是OpenEuler 20.03) in docker CANN 8.2, python 3.9 (华为modelarts 4 910B 服务器).
进一步测试install脚本, 被benchmark加速效果
for vector op, double is supported, possibly via AICPU, so it is very slow, slower than CPU but can keep data in device memory.
早期开发测试阶段: 建议clone 并修改代码
git clone [email protected]:qingfengxia/numpy-ascend.git
git checkout ascend从源代码编译 (假设已经安装CANN 8.2)
cd cupy-ascend
export CUPY_INSTALL_USE_ASCEND=1 # 对应C代码中 CUPY_USE_ASCEND, 编译时刻, 选择ascend backend
#export ASCEND_TOOLKIT_HOME=/home/qingfeng/Ascend/ascend-toolkit/latest
#export PATH=$ASCEND_TOOLKIT_HOME/bin:$PATH
which bisheng
# cython --inplace for gdb debugging
clear && export CUPY_INSTALL_USE_ASCEND=1 && python setup.py develop && python -c "import cupy._core"
python -c "import cupy._core" # to test if it is importable without installation
clear && export CUPY_INSTALL_USE_ASCEND=1 && python setup.py develop && python benchmark.py
如果修改 .h 文件, 没有修改pyx文件, 可能导致不会触发编译, 这时候可以运行clean_cpp_so_files.sh 做全面清理.
# 第二阶段: 如果测试比较稳定, 可以直接git拉去代码, 编译二进制wheel
pip install git+https://github.com/qingfengxia/cupy-ascend.git
# 第三阶段: 如果大规模测试通过, 已经有pip二进制包
pip install numpy-ascend
# pip install cupy-cuda12x开发了100小时, 达成MVP (最小功能单元), 测试了matmul, cos, add, 在910B实现了非常客观的加速, 几十到一百的加速. 看benchmark.py
但是还是有大量工作, 预计为1人年, 欢迎加入测试和开发.
see TODO.md for list of task.
还有大量的算子需要加入, 参看如下commit, 有固定的模版添加ASCEND的算子到numpy-ascend, 欢迎测试. https://github.com/qingfengxia/numpy-ascend/commit/863e0ff4c07994a45a204b8032db7c3da17f6c90
如果添加代码后, 运行一下命令, 可以编译可以import表示成功.
export CUPY_INSTALL_USE_ASCEND=1 && python setup.py develop && python -c "import cupy._core"only float number can represent NaN (like inf, special value of float)
- fmin 两个数组的逐元素最小值,忽略 NaN。binary op
- minimum 两个数组的逐元素最小值,传播 NaN。binary op
- amin 数组沿给定轴的最小值,传播 NaN。 reduction op
- nanmin 数组沿给定轴的最小值,忽略 NaN。reduction op
这个已经提交上游社区了 refactor MR, 但是我编译有点问题, 还没有接受
NVIDIA最终用户许可协议门户:https://www.nvidia.com/en-us/about-nvidia/eula-agreement/
NVIDIA的EULA并未禁止重新编译CUDA源代码
- 合法途径:像AMD的HIP(HIPIFY工具)和Intel的SYCL(SYCLomatic工具)这类技术,其工作方式是将CUDA源代码转换为另一种兼容的编程模型代码,然后使用目标平台自己的编译器和工具链进行编译。这个过程不涉及对CUDA SDK输出成果的逆向或反编译,因此是合规的。
- 核心区别:关键在于“转译(Translation)”与“移植(Porting)”的区别。EULA禁止的是对已编译的二进制/PTX代码进行直接转译,但不禁止对源代码进行转换和重新编译。
CUDA的runtime API emulate 可能不违反EULA, 但是没有必要冒着未来的法律, 直接中性化.
-
cupy.cuda -> cupy.xpu , XPU 泛指任何CPU之外的计算加速器
-
cupy_backends.cuda -> cupy.backends.backend , 为什么叫backends, 这是和torch保持移植.
-
xpuXXX 作为runtime的抽象API
同时保持Cupy作者沟通, 确保Cupy是否有商标/著作权, 是否也已授权其他XPU使用.
架构中性化, 也可以和cupy作者沟通, 看这种架构的重构上游是否可以接受. nvidia在致力做自己官方的pynumeric, 那么社区驱动cupy未来就有不确定性.
有一个python的脚本(api_replace_tool.py)来负责处理. 这样处理后, cuda backend有工作量, cuda api -> xpu api, 暂时不能编译. 所以我放在新的分支 xpu开发
cudaDataType -> xpuDataType 这是一个typedef cuDoubleComplex -> xpuComplex128, numpy,torch use this style enum xpuFunction_attribute cuGetErrorString ->
// Context
xpuBlasStatus cublasCreate(...) {
return CUBLAS_STATUS_SUCCESS;
}