Skip to content

poyea/lollipop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

lollipop

Sweet GPU compute kernels in CUDA, wrapped in Python via CuPy.

uv sync && uv pip install -e . && python examples/mandelbrot.py

You need CUDA Toolkit 11.8 (well, newer version may not work) and an NVIDIA GPU (sm_75+ for the HMMA kernels; Turing or anything newer). CuPy's bundled nvrtc compiles each kernel at first use, picking up mma.h and friends from CUDA_PATH.

Example Kernels

Kernel What it does
reduction_v2 sum-reduce a 1D float array
reduction_cg same sum-reduce via Cooperative Groups cg::reduce
prefix_sum device-wide exclusive scan, hierarchical Blelloch
radix_sort LSD radix sort of uint32 keys, multi-block
matrix_transpose 2D fp32 transpose, 32×33 padded smem tile
softmax_vec4 row-wise softmax with float4 loads
flash_attention_hmma FA-2 forward, fp16 in / fp32 accum, wmma 16×16×16
gemm_tiled dense fp32 GEMM, 128×128 macro / 8×8 register micro, manual smem double-buffer
gemm_int8 W8A8 INT8 GEMM, per-row act scale × per-channel weight scale
gemm_int4 W4A16 weight-only (AWQ/GPTQ-shaped), G=64 asymmetric, dequant-fuse-matmul
fused_ffn_tail RMSNorm → ×γ → +bias → GELU/SiLU → +residual, one kernel
rope rotary positional embedding, Llama half-rotation (pair separation D/2), in-place safe
rmsnorm RMSNorm forward + backward, per-row fused reductions, dgamma in fp32 accum

License

MIT

About

🍭 Sweet GPU compute kernels in CUDA, wrapped via CuPy

Topics

Resources

License

Stars

Watchers

Forks

Contributors