Skip to content

[CUDA] implement Hadamard transform#3179

Open
Lyxot wants to merge 3 commits intoml-explore:mainfrom
Lyxot:cuda/hadamard
Open

[CUDA] implement Hadamard transform#3179
Lyxot wants to merge 3 commits intoml-explore:mainfrom
Lyxot:cuda/hadamard

Conversation

@Lyxot
Copy link

@Lyxot Lyxot commented Feb 27, 2026

Proposed changes

This PR adds CUDA support for Hadamard (mx.hadamard_transform) with the same staged decomposition strategy used by the Metal backend.

Changed files

  • mlx/backend/cuda/hadamard.cu: implemented CUDA Hadamard::eval_gpu and JIT launch flow (n1/n2/m staged execution), reusing decompose_hadamard(...).
  • mlx/backend/cuda/device/hadamard.cuh: added JIT device kernels hadamard_n<...> and hadamard_m<...> plus radix helpers.
  • python/tests/cuda_skip.py: removed CUDA skip entries

Validation

  • python -m pytest python/tests/test_ops.py -k test_hadamard -q passed.
  • python -m pytest python/tests/test_ops.py -k test_hadamard_grad_vmap -q passed.

Checklist

Put an x in the boxes that apply.

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

Copilot AI review requested due to automatic review settings February 27, 2026 02:48
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements CUDA support for the Hadamard transform (mx.hadamard_transform), following the same staged decomposition strategy as the Metal backend. The implementation decomposes the Hadamard transform into three stages (n1, n2, and m) to efficiently handle large transforms while respecting GPU memory constraints.

Changes:

  • Added CUDA kernel implementation for Hadamard transform with JIT compilation support
  • Enabled CUDA tests by removing skip entries for test_hadamard and test_hadamard_grad_vmap
  • Integrated the implementation into the CUDA backend build system

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
mlx/backend/cuda/hadamard.cu Implements the main CUDA evaluation logic with staged kernel launches and JIT code generation for non-power-of-two radices
mlx/backend/cuda/device/hadamard.cuh Provides device-side kernel templates for n-stage and m-stage transforms with vectorized memory access
mlx/backend/cuda/primitives.cpp Removes NO_GPU(Hadamard) to enable GPU evaluation path
mlx/backend/cuda/jit_module.cpp Registers the hadamard.cuh header for JIT compilation
mlx/backend/cuda/CMakeLists.txt Adds hadamard.cu to build sources
python/tests/cuda_skip.py Removes CUDA skip entries to enable Hadamard tests

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Lyxot Lyxot requested a review from zcbenz February 27, 2026 09:51
@nastya236 nastya236 self-requested a review February 27, 2026 11:34
@nastya236
Copy link
Collaborator

nastya236 commented Feb 27, 2026

Looks great, thanks for the contribution.

Could you please share bandwidth numbers for the proposed kernel across a range of shapes? I’m also particularly interested in case where the Hadamard transform is applied to tiled inputs with N=16 or N=32. Something like:

x = mx.random.uniform(shape=(4096, 4096))
mx.hadamard_transform(x.reshape(4096, 4096 // N, N))

where N=16 or N=32.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants