Skip to content

[Refactor] Explict usage of mbarrier arrive_and_expect_tx#88

Merged
yaoyaoding merged 2 commits intomainfrom
refactor-mbarrier
Mar 6, 2026
Merged

[Refactor] Explict usage of mbarrier arrive_and_expect_tx#88
yaoyaoding merged 2 commits intomainfrom
refactor-mbarrier

Conversation

@yaoyaoding
Copy link
Member

API Change Summary:
self.tma.global_to_shared no longer internally calls mbarrier_arrive_and_expect_tx. Users must now explicitly call self.mbarrier.arrive_and_expect_tx(barrier, tx_count=...) before issuing TMA loads, and reduce the barrier allocation count accordingly (e.g., from 2 to 1 when grouping two TMA copies under one arrive). ArriveBarrierInst.per_thread_count has been renamed to count.

PR Summary:
Decouple arrive_and_expect_tx from tma.global_to_shared, making transaction expectation explicit at the user level for direct control over barrier arrival semantics and transaction byte counts. All examples, tests, and the Pipeline class are updated accordingly.

.
Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors TMA synchronization so that transaction expectation is no longer implicit inside tma.global_to_shared; callers must explicitly issue mbarrier.arrive_and_expect_tx(..., tx_count=...) before TMA loads, and barrier arrival counts are adjusted accordingly. This also renames the barrier-arrive parameter from per_thread_count to count and adds a dedicated IR instruction/emitter path for arrive+expect-tx.

Changes:

  • Make TMA transaction expectation explicit via mbarrier.arrive_and_expect_tx and update tests/examples to call it and reduce barrier counts (e.g., 2 → 1 for two grouped TMA copies).
  • Rename barrier arrive parameter per_thread_countcount across the IR/builder/emitter stack.
  • Add ArriveExpectTxBarrierInst and emit mbarrier.arrive.expect_tx via a dedicated CUDA emitter.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/instructions/test_tcgen05_mma.py Updates barrier allocation and adds explicit arrive_and_expect_tx before two TMA loads.
tests/instructions/test_copy_async_tensor.py Adds explicit arrive_and_expect_tx before TMA global→shared load.
python/tilus/lang/instructions/mbarrier.py Renames arrive arg to count; adds arrive_and_expect_tx API.
python/tilus/ir/tensor.py Changes SharedTensor.size computation and imports prod.
python/tilus/ir/instructions/cuda/mbarrier.py Renames ArriveBarrierInst field and introduces ArriveExpectTxBarrierInst.
python/tilus/ir/builders/stmt_builder.py Updates builder API and adds arrive_expect_tx_barrier emission.
python/tilus/backends/emitters/cuda/mbarrier.py Updates arrive emitter and adds arrive+expect-tx emitter.
python/tilus/backends/emitters/cuda/cp_async_tensor.py Removes implicit arrive+expect-tx emission from TMA global→shared copies.
examples/hopper_matmul/matmul_v0.py Updates barrier counts and adds explicit arrive_and_expect_tx before TMA copies.
examples/hopper_matmul/matmul_v1.py Same as v0, single-stage variant.
examples/hopper_matmul/matmul_v2.py Same as v0, pipelined/staged variant (preload + main loop).
examples/hopper_matmul/matmul_v3.py Same as v2 with producer/consumer barrier scheme.
examples/blackwell_matmul/matmul_v1.py Updates barrier counts and adds explicit arrive_and_expect_tx before TMA copies.
examples/blackwell_matmul/matmul_v2.py Same as v1, staged variant with preload.
examples/blackwell_matmul/matmul_v3.py Same as v2, producer/consumer barrier scheme.
examples/blackwell_matmul/matmul_v4.py Updates Pipeline usage (producer arrive count) and adds explicit arrive_and_expect_tx.
examples/blackwell_matmul/matmul_v7.py Same as v4 for a different pipeline/example variant.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

python/tilus/backends/emitters/smem.py:42

  • Shared memory allocation now uses tensor.storage_nbytes, but the declared pointer type still uses shape=[tensor.size] (logical element count). For non-compact/swizzled layouts where optional_layout.count_size() > prod(shape), this can make typed indexing appear out-of-bounds (and may break codegen/type-checking). Declare the shared pointer with a storage-sized shape (e.g., optional_layout.count_size() or storage_nbytes // dtype.nbytes) to match the allocated storage.
        allocator_addr = ctx.allocate_shared_tensor(tensor, nbytes=tensor.storage_nbytes)
        self.tensor2var[tensor] = self.declare_var(
            name="shared",
            tp=tensor_pointer_type(dtype=tensor.dtype, shape=[tensor.size]),
            init=dynamic_shared_memory(byte_offset=allocator_addr, dtype=tensor.dtype),

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@yaoyaoding yaoyaoding merged commit a68f881 into main Mar 6, 2026
12 checks passed
@yaoyaoding yaoyaoding deleted the refactor-mbarrier branch March 6, 2026 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants