[Refactor] Explict usage of mbarrier arrive_and_expect_tx#88
[Refactor] Explict usage of mbarrier arrive_and_expect_tx#88yaoyaoding merged 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Refactors TMA synchronization so that transaction expectation is no longer implicit inside tma.global_to_shared; callers must explicitly issue mbarrier.arrive_and_expect_tx(..., tx_count=...) before TMA loads, and barrier arrival counts are adjusted accordingly. This also renames the barrier-arrive parameter from per_thread_count to count and adds a dedicated IR instruction/emitter path for arrive+expect-tx.
Changes:
- Make TMA transaction expectation explicit via
mbarrier.arrive_and_expect_txand update tests/examples to call it and reduce barrier counts (e.g., 2 → 1 for two grouped TMA copies). - Rename barrier arrive parameter
per_thread_count→countacross the IR/builder/emitter stack. - Add
ArriveExpectTxBarrierInstand emitmbarrier.arrive.expect_txvia a dedicated CUDA emitter.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/instructions/test_tcgen05_mma.py | Updates barrier allocation and adds explicit arrive_and_expect_tx before two TMA loads. |
| tests/instructions/test_copy_async_tensor.py | Adds explicit arrive_and_expect_tx before TMA global→shared load. |
| python/tilus/lang/instructions/mbarrier.py | Renames arrive arg to count; adds arrive_and_expect_tx API. |
| python/tilus/ir/tensor.py | Changes SharedTensor.size computation and imports prod. |
| python/tilus/ir/instructions/cuda/mbarrier.py | Renames ArriveBarrierInst field and introduces ArriveExpectTxBarrierInst. |
| python/tilus/ir/builders/stmt_builder.py | Updates builder API and adds arrive_expect_tx_barrier emission. |
| python/tilus/backends/emitters/cuda/mbarrier.py | Updates arrive emitter and adds arrive+expect-tx emitter. |
| python/tilus/backends/emitters/cuda/cp_async_tensor.py | Removes implicit arrive+expect-tx emission from TMA global→shared copies. |
| examples/hopper_matmul/matmul_v0.py | Updates barrier counts and adds explicit arrive_and_expect_tx before TMA copies. |
| examples/hopper_matmul/matmul_v1.py | Same as v0, single-stage variant. |
| examples/hopper_matmul/matmul_v2.py | Same as v0, pipelined/staged variant (preload + main loop). |
| examples/hopper_matmul/matmul_v3.py | Same as v2 with producer/consumer barrier scheme. |
| examples/blackwell_matmul/matmul_v1.py | Updates barrier counts and adds explicit arrive_and_expect_tx before TMA copies. |
| examples/blackwell_matmul/matmul_v2.py | Same as v1, staged variant with preload. |
| examples/blackwell_matmul/matmul_v3.py | Same as v2, producer/consumer barrier scheme. |
| examples/blackwell_matmul/matmul_v4.py | Updates Pipeline usage (producer arrive count) and adds explicit arrive_and_expect_tx. |
| examples/blackwell_matmul/matmul_v7.py | Same as v4 for a different pipeline/example variant. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 19 out of 19 changed files in this pull request and generated no new comments.
Comments suppressed due to low confidence (1)
python/tilus/backends/emitters/smem.py:42
- Shared memory allocation now uses
tensor.storage_nbytes, but the declared pointer type still usesshape=[tensor.size](logical element count). For non-compact/swizzled layouts whereoptional_layout.count_size() > prod(shape), this can make typed indexing appear out-of-bounds (and may break codegen/type-checking). Declare the shared pointer with a storage-sized shape (e.g.,optional_layout.count_size()orstorage_nbytes // dtype.nbytes) to match the allocated storage.
allocator_addr = ctx.allocate_shared_tensor(tensor, nbytes=tensor.storage_nbytes)
self.tensor2var[tensor] = self.declare_var(
name="shared",
tp=tensor_pointer_type(dtype=tensor.dtype, shape=[tensor.size]),
init=dynamic_shared_memory(byte_offset=allocator_addr, dtype=tensor.dtype),
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
API Change Summary:
self.tma.global_to_sharedno longer internally callsmbarrier_arrive_and_expect_tx. Users must now explicitly callself.mbarrier.arrive_and_expect_tx(barrier, tx_count=...)before issuing TMA loads, and reduce the barrier allocation count accordingly (e.g., from 2 to 1 when grouping two TMA copies under one arrive).ArriveBarrierInst.per_thread_counthas been renamed to count.PR Summary:
Decouple
arrive_and_expect_txfromtma.global_to_shared, making transaction expectation explicit at the user level for direct control over barrier arrival semantics and transaction byte counts. All examples, tests, and the Pipeline class are updated accordingly.