feat: BHVK (K-last) state layout for Lightning Attention prefill & decode by icavan · Pull Request #56 · inclusionAI/cuLA

icavan · 2026-04-19T14:00:38Z

feat: BHVK (K-last) state layout for Lightning Attention prefill & decode

NOTE: based on code base from #36, higgsboson1710

Motivation

The decode kernel (la_decode) uses BHVK state layout (B, H, V, K) where K is contiguous, while the prefill kernel (lightning_attn) previously used BHKV (B, H, K, V). This mismatch required a transpose between prefill and decode, adding latency on the critical serving path.

This PR unifies both kernels on the BHVK layout so that prefill output state flows directly into decode without any transpose.

Changes

Kernel (cula/ops/lightning_attn.py)

Changed fstate_layout stride from (1, D, ...) (K-major, BHKV) to (D, 1, ...) (K-contiguous, BHVK)
Implemented SMEM-mediated cooperative state load/store to maintain coalesced GMEM access despite the VK memory layout:
- 128 CE threads cooperatively load/store states in row-major order (coalesced 128-bit GMEM transactions)
- 2 strips × 64 V-rows processed sequentially through a 33KB padded SMEM buffer
- SMEM row stride = D + 4 (132 for D=128) eliminates bank conflicts (stride mod 32 = 4)
- Added sStateBuf field to SharedStorage struct (~33KB, within 228KB SMEM budget at occ=1)

Tests (tests/test_lightning_attn.py, tests/test_la_decode.py)

All state inputs converted BHKV→BHVK via .transpose(-1, -2).contiguous() before passing to CuTe kernel
All state outputs converted BHVK→BHKV via .transpose(-1, -2) before comparing with FLA/PyTorch reference
Added test_prefill_decode_e2e: verifies prefill ht passes directly into decode without manual transpose
API docstrings updated to document BHVK layout requirement

Benchmarks (benchmarks/bench_lightning_attn.py)

h0_cute converted to BHVK before timing; ht converted back to BHKV for accuracy comparison
Transpose is outside time_fn — not included in reported timings

Performance (vs FLA Triton baseline, GB200)

Mode	Avg Speedup	Min	Max
h0_ht (prefill with state)	1.33x	0.90x	1.88x
varlen (persistent)	1.44x	0.83x	2.05x
no_state (prefill only)	1.50x	1.12x	1.89x

The no_state mode is unaffected by this change (no state load/store). The h0_ht and varlen modes show ~1-5% overhead vs the previous BHKV kernel (from the SMEM transpose), which is acceptable given the elimination of the prefill→decode transpose on the serving path.

Performance Evolution (vs origin/main BHKV baseline)

Mode	origin/main (BHKV)	VK version 1 (no SMEM)	Final (SMEM-mediated)	Recovery
h0_ht	avg 1.32x	avg 1.24x (-6%)	avg 1.33x (+0.8%)	fully recovered
varlen	avg 1.54x	avg 1.12x (-27%)	avg 1.44x (-6.5%)	mostly recovered
no_state	avg 1.49x	avg 1.50x (flat)	avg 1.50x (flat)	unaffected

The naive VK layout change caused severe varlen regression (-27%) due to uncoalesced column-strided GMEM accesses (each thread writing stride-D elements). The SMEM cooperative transpose recovers most of the loss:

h0_ht: fully recovered (+0.8% vs BHKV, within noise)
varlen: recovered from -27% to -6.5%, remaining gap from SMEM transpose overhead in the persistent loop
no_state: unchanged (no state I/O)

Precision

RMSE matches FLA exactly — no precision loss from the layout change:

Output O RMSE: 0.2347% (identical to FLA)
State Ht RMSE: 0.0198% (identical to FLA)
la_decode: 20/20 tests pass, state relative RMSE < 0.1%

Test Results

tests/test_lightning_attn.py — 10/10 passed
tests/test_la_decode.py      — 20/20 passed

🚀 Pull Request Checklist

Thank you for contributing to cuLA! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing.

⚡ Performance

Reviewer Notes

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Implement cooperative GMEM↔SMEM transpose for VK state access: - 128 CE threads load/store states cooperatively (coalesced GMEM) - 2 strips of 64 V-rows with padded SMEM (stride D+4) to eliminate bank conflicts - Row-based addressing: each iteration covers 4 rows × 32 cols with LDG/STG.128 - Per-thread SMEM read/write uses padded stride for conflict-free bank access Performance (vs FLA): h0_ht avg 1.33x, varlen avg 1.44x, no_state avg 1.46x Precision: RMSE matches FLA exactly (0.2347% O, 0.0198% Ht) All 10 tests pass

run_la_decode now transposes state_4d from BHKV to BHVK before passing to the kernel, and transposes output state back to BHKV for comparison. E2E test also converts prefill BHVK output to BHKV before passing to run_la_decode and torch_la_decode_ref. All 20 tests pass.

gemini-code-assist

Code Review

This pull request transitions the lightning attention kernel to a BHVK (K-contiguous) layout for initial and final states, improving global memory access efficiency through a shared memory transpose buffer and cooperative loading/storing logic. Benchmarks and tests have been updated to accommodate this layout change, and a new end-to-end prefill-to-decode test has been added. I have no feedback to provide.

KevinZeng08

LGTM

zheyang0825

LGTM

higgsboson1710 and others added 10 commits April 5, 2026 21:11

opt: implement pretransposed state layout (BHVK) for lightning attention

cecba88

test: update test suite for BHVK state layout

9a84996

Apply suggestion from @gemini-code-assist[bot]

6a856d5

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Apply suggestion from @gemini-code-assist[bot]

22a34a0

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Apply suggestion from @gemini-code-assist[bot]

cc85ab6

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Apply suggestion from @gemini-code-assist[bot]

0eb8ebb

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Refactor tensor initialization by removing transpose

f297510

fix(linter): fix linter

5837a92

icavan requested review from KevinZeng08 and zheyang0825 April 19, 2026 14:01

gemini-code-assist Bot reviewed Apr 19, 2026

View reviewed changes

KevinZeng08 approved these changes Apr 20, 2026

View reviewed changes

zheyang0825 approved these changes Apr 20, 2026

View reviewed changes

icavan mentioned this pull request Apr 20, 2026

Feat/pretransposed states #36

Closed

icavan merged commit 33ec0b3 into main Apr 22, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: BHVK (K-last) state layout for Lightning Attention prefill & decode#56

feat: BHVK (K-last) state layout for Lightning Attention prefill & decode#56
icavan merged 10 commits intomainfrom
feat/vk_states

icavan commented Apr 19, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

KevinZeng08 left a comment

Uh oh!

zheyang0825 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

icavan commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!