Feat/pretransposed states by higgsboson1710 · Pull Request #36 · inclusionAI/cuLA

higgsboson1710 · 2026-04-05T19:36:00Z

This PR implements the pre-transposed BHVK state layout optimization.

Updated the core C++/CUDA kernel and Python API to natively handle the BHVK layout.

Updated tests/test_lightning_attn.py and tests/test_la_decode.py to match the new layout.

Added an end-to-end prefill → decode test to verify the state passes directly without manual transposes

gemini-code-assist

Code Review

This pull request transitions the attention state layout from Column-Major to Row-Major (BHVK) within the lightning attention kernels and updates the associated loading and storing logic. The review feedback highlights critical concerns regarding the use of non-contiguous transposed tensors with kernels that assume fixed memory layouts, which could lead to silent data corruption. The reviewer recommends using contiguous allocations, simplifying the test suite by removing redundant transpose operations, and correcting minor indentation inconsistencies.

gemini-code-assist · 2026-04-05T19:38:43Z

cula/ops/lightning_attn.py

    # Allocate state pool if not provided
    if state_pool is None:
-        state_pool = torch.zeros(N, H, D, D, dtype=torch.float32, device=Q.device)
+        state_pool = torch.zeros(N, H, D, D, dtype=torch.float32, device=Q.device).transpose(-1,-2)


This in-place update on a non-contiguous state_pool is highly problematic. Because the kernel operates on the raw pointer using a fixed layout, it will write data to memory locations that do not align with the non-contiguous strides of the transposed PyTorch view. This will lead to silent data corruption or unexpected results when the state_pool is accessed later in Python. The state pool should be allocated as a contiguous tensor in the layout the kernel expects.

Suggested change

state_pool = torch.zeros(N, H, D, D, dtype=torch.float32, device=Q.device).transpose(-1,-2)

state_pool = torch.zeros(N, H, D, D, dtype=torch.float32, device=Q.device)

We assume the input states & output states are all pretransposed. Hence, we do not need to transpose the input states explicitly.

gemini-code-assist · 2026-04-05T19:38:43Z

cula/ops/lightning_attn.py

+        fstate_layout = cute.make_layout((D, D, (H, B)),
+                                         stride=(D, 1, (D * D, D * D * H)),
+                                         )


The indentation of the stride argument and the closing parenthesis for make_layout is inconsistent with the surrounding code. It is better to keep the arguments aligned for readability.

fstate_layout = cute.make_layout((D, D, (H, B)), stride=(D, 1, (D * D, D * D * H)))

gemini-code-assist · 2026-04-05T19:38:43Z

cula/ops/lightning_attn.py


    if output_final_state:
-        ht = torch.zeros(B, H, D, D, dtype=torch.float32, device=Q.device)
+        ht = torch.zeros(B, H, D, D, dtype=torch.float32, device=Q.device).transpose(-1,-2)


Applying .transpose(-1, -2) to the newly allocated ht tensor creates a non-contiguous view. Since the kernel (at line 417/424) wraps the raw pointer in a fixed fstate_layout that assumes a specific physical memory ordering (Row-Major with strides D, 1), passing a non-contiguous tensor is dangerous. If the kernel writes to the pointer in Row-Major order, the resulting ht tensor in Python will appear transposed logically. It would be much cleaner to allocate the tensor with the correct shape and ensure it is contiguous, avoiding reliance on the kernel ignoring PyTorch strides.

Suggested change

ht = torch.zeros(B, H, D, D, dtype=torch.float32, device=Q.device).transpose(-1,-2)

ht = torch.zeros(B, H, D, D, dtype=torch.float32, device=Q.device)

@higgsboson1710 since the ht is a zero tensor, we do not need the transpose.

gemini-code-assist · 2026-04-05T19:38:43Z

tests/test_la_decode.py

    k = torch.randn(B, H, D, device=device, dtype=torch.bfloat16)
    v = torch.randn(B, H, D, device=device, dtype=torch.bfloat16)
-    state = torch.randn(B, H, D, D, device=device, dtype=torch.float32) * 0.01
+    state = torch.randn(B, H, D, D, device=device, dtype=torch.float32).transpose(-1, -2).contiguous().transpose(-1, -2) * 0.01


The double-transpose transpose(-1, -2).contiguous().transpose(-1, -2) creates a non-contiguous tensor. However, in run_la_decode (line 87), this tensor is immediately cloned, which produces a contiguous version and negates the effect of the previous layout manipulation. This adds unnecessary complexity to the test setup. If the goal is to test the kernel with BHVK layout, simply providing a standard contiguous tensor of the appropriate shape is sufficient and clearer.

Suggested change

state = torch.randn(B, H, D, D, device=device, dtype=torch.float32).transpose(-1, -2).contiguous().transpose(-1, -2) * 0.01

state = torch.randn(B, H, D, D, device=device, dtype=torch.float32) * 0.01

As discussed above, I think you could just remove the transpose here.

gemini-code-assist · 2026-04-05T19:38:43Z

tests/test_lightning_attn.py

    V = torch.randn(B, S, H, D, device="cuda", dtype=torch.bfloat16) * 0.1
    decay = torch.full((H,), decay_val, device="cuda", dtype=torch.float32)
-    h0 = torch.randn(B, H, D, D, device="cuda", dtype=torch.float32) * 0.01
+    h0 = torch.randn(B, H, D, D, device="cuda", dtype=torch.float32).transpose(-1, -2).contiguous().transpose(-1, -2) * 0.01


The use of the double-transpose pattern here and in subsequent tests (lines 398, 528, 592) to create non-contiguous inputs is confusing. Given that the kernel ignores input strides and uses a fixed Row-Major layout, passing a physically Column-Major tensor (which this pattern produces) will result in the kernel seeing a transposed matrix. It is recommended to use standard contiguous tensors to ensure the layout is handled predictably.

Suggested change

h0 = torch.randn(B, H, D, D, device="cuda", dtype=torch.float32).transpose(-1, -2).contiguous().transpose(-1, -2) * 0.01

h0 = torch.randn(B, H, D, D, device="cuda", dtype=torch.float32) * 0.01

higgsboson1710 · 2026-04-05T19:43:05Z

"Hi @icavan, I've completed the roadmap. I see the bot is flagging the non-contiguous state pool allocations and the double-transpose pattern in the tests. I used these to ensure the memory matches the new BHVK layout, but let me know if you'd prefer I refactor these to standard contiguous allocations to satisfy the linter/bot."

icavan · 2026-04-06T10:35:07Z

cula/ops/lightning_attn.py


    if output_final_state:
-        ht = torch.zeros(B, H, D, D, dtype=torch.float32, device=Q.device)
+        ht = torch.zeros(B, H, D, D, dtype=torch.float32, device=Q.device).transpose(-1,-2)


@higgsboson1710 since the ht is a zero tensor, we do not need the transpose.

icavan · 2026-04-06T10:36:50Z

cula/ops/lightning_attn.py

    # Allocate state pool if not provided
    if state_pool is None:
-        state_pool = torch.zeros(N, H, D, D, dtype=torch.float32, device=Q.device)
+        state_pool = torch.zeros(N, H, D, D, dtype=torch.float32, device=Q.device).transpose(-1,-2)


We assume the input states & output states are all pretransposed. Hence, we do not need to transpose the input states explicitly.

icavan · 2026-04-06T10:38:02Z

tests/test_la_decode.py

    k = torch.randn(B, H, D, device=device, dtype=torch.bfloat16)
    v = torch.randn(B, H, D, device=device, dtype=torch.bfloat16)
-    state = torch.randn(B, H, D, D, device=device, dtype=torch.float32) * 0.01
+    state = torch.randn(B, H, D, D, device=device, dtype=torch.float32).transpose(-1, -2).contiguous().transpose(-1, -2) * 0.01


As discussed above, I think you could just remove the transpose here.

icavan · 2026-04-06T10:40:36Z

cula/ops/lightning_attn.py

+                    gCol_ht = cute.make_tensor(gState_ht.iterator + local_tidx * _D, cute.make_layout(_D, stride=1))
                    out_flat = cute.make_tensor(tTR_rKV.iterator, layout=cute.make_layout(_D))
-                    cute.autovec_copy(out_flat, gRow_ht)
+                    cute.autovec_copy(out_flat, gCol_ht)


We need to track the performance change here. Could you share the results of bench_lightning_attn.py

higgsboson1710 added 2 commits April 5, 2026 21:11

opt: implement pretransposed state layout (BHVK) for lightning attention

cecba88

test: update test suite for BHVK state layout

9a84996

gemini-code-assist bot reviewed Apr 5, 2026

View reviewed changes

icavan reviewed Apr 6, 2026

View reviewed changes

	state_pool = torch.zeros(N, H, D, D, dtype=torch.float32, device=Q.device).transpose(-1,-2)
	state_pool = torch.zeros(N, H, D, D, dtype=torch.float32, device=Q.device)

	ht = torch.zeros(B, H, D, D, dtype=torch.float32, device=Q.device).transpose(-1,-2)
	ht = torch.zeros(B, H, D, D, dtype=torch.float32, device=Q.device)

	state = torch.randn(B, H, D, D, device=device, dtype=torch.float32).transpose(-1, -2).contiguous().transpose(-1, -2) * 0.01
	state = torch.randn(B, H, D, D, device=device, dtype=torch.float32) * 0.01

	h0 = torch.randn(B, H, D, D, device="cuda", dtype=torch.float32).transpose(-1, -2).contiguous().transpose(-1, -2) * 0.01
	h0 = torch.randn(B, H, D, D, device="cuda", dtype=torch.float32) * 0.01

Conversation

higgsboson1710 commented Apr 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

icavan Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

icavan Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

icavan Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

higgsboson1710 commented Apr 5, 2026

Uh oh!

icavan Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

icavan Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

icavan Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

icavan Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants