[Refactor] Explict usage of mbarrier arrive_and_expect_tx by yaoyaoding · Pull Request #88 · NVIDIA/tilus

yaoyaoding · 2026-03-06T19:28:46Z

API Change Summary:
self.tma.global_to_shared no longer internally calls mbarrier_arrive_and_expect_tx. Users must now explicitly call self.mbarrier.arrive_and_expect_tx(barrier, tx_count=...) before issuing TMA loads, and reduce the barrier allocation count accordingly (e.g., from 2 to 1 when grouping two TMA copies under one arrive). ArriveBarrierInst.per_thread_count has been renamed to count.

PR Summary:
Decouple arrive_and_expect_tx from tma.global_to_shared, making transaction expectation explicit at the user level for direct control over barrier arrival semantics and transaction byte counts. All examples, tests, and the Pipeline class are updated accordingly.

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

Copilot

Pull request overview

Refactors TMA synchronization so that transaction expectation is no longer implicit inside tma.global_to_shared; callers must explicitly issue mbarrier.arrive_and_expect_tx(..., tx_count=...) before TMA loads, and barrier arrival counts are adjusted accordingly. This also renames the barrier-arrive parameter from per_thread_count to count and adds a dedicated IR instruction/emitter path for arrive+expect-tx.

Changes:

Make TMA transaction expectation explicit via mbarrier.arrive_and_expect_tx and update tests/examples to call it and reduce barrier counts (e.g., 2 → 1 for two grouped TMA copies).
Rename barrier arrive parameter per_thread_count → count across the IR/builder/emitter stack.
Add ArriveExpectTxBarrierInst and emit mbarrier.arrive.expect_tx via a dedicated CUDA emitter.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/instructions/test_tcgen05_mma.py	Updates barrier allocation and adds explicit `arrive_and_expect_tx` before two TMA loads.
tests/instructions/test_copy_async_tensor.py	Adds explicit `arrive_and_expect_tx` before TMA global→shared load.
python/tilus/lang/instructions/mbarrier.py	Renames `arrive` arg to `count`; adds `arrive_and_expect_tx` API.
python/tilus/ir/tensor.py	Changes `SharedTensor.size` computation and imports `prod`.
python/tilus/ir/instructions/cuda/mbarrier.py	Renames `ArriveBarrierInst` field and introduces `ArriveExpectTxBarrierInst`.
python/tilus/ir/builders/stmt_builder.py	Updates builder API and adds `arrive_expect_tx_barrier` emission.
python/tilus/backends/emitters/cuda/mbarrier.py	Updates arrive emitter and adds arrive+expect-tx emitter.
python/tilus/backends/emitters/cuda/cp_async_tensor.py	Removes implicit arrive+expect-tx emission from TMA global→shared copies.
examples/hopper_matmul/matmul_v0.py	Updates barrier counts and adds explicit `arrive_and_expect_tx` before TMA copies.
examples/hopper_matmul/matmul_v1.py	Same as v0, single-stage variant.
examples/hopper_matmul/matmul_v2.py	Same as v0, pipelined/staged variant (preload + main loop).
examples/hopper_matmul/matmul_v3.py	Same as v2 with producer/consumer barrier scheme.
examples/blackwell_matmul/matmul_v1.py	Updates barrier counts and adds explicit `arrive_and_expect_tx` before TMA copies.
examples/blackwell_matmul/matmul_v2.py	Same as v1, staged variant with preload.
examples/blackwell_matmul/matmul_v3.py	Same as v2, producer/consumer barrier scheme.
examples/blackwell_matmul/matmul_v4.py	Updates `Pipeline` usage (producer arrive count) and adds explicit `arrive_and_expect_tx`.
examples/blackwell_matmul/matmul_v7.py	Same as v4 for a different pipeline/example variant.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

python/tilus/ir/tensor.py

python/tilus/lang/instructions/mbarrier.py

python/tilus/backends/emitters/cuda/mbarrier.py

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (1)

python/tilus/backends/emitters/smem.py:42

Shared memory allocation now uses tensor.storage_nbytes, but the declared pointer type still uses shape=[tensor.size] (logical element count). For non-compact/swizzled layouts where optional_layout.count_size() > prod(shape), this can make typed indexing appear out-of-bounds (and may break codegen/type-checking). Declare the shared pointer with a storage-sized shape (e.g., optional_layout.count_size() or storage_nbytes // dtype.nbytes) to match the allocated storage.

        allocator_addr = ctx.allocate_shared_tensor(tensor, nbytes=tensor.storage_nbytes)
        self.tensor2var[tensor] = self.declare_var(
            name="shared",
            tp=tensor_pointer_type(dtype=tensor.dtype, shape=[tensor.size]),
            init=dynamic_shared_memory(byte_offset=allocator_addr, dtype=tensor.dtype),

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

.

141054c

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

yaoyaoding requested a review from Copilot March 6, 2026 19:29

Copilot started reviewing on behalf of yaoyaoding March 6, 2026 19:34 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

python/tilus/ir/tensor.py Show resolved Hide resolved

python/tilus/lang/instructions/mbarrier.py Outdated Show resolved Hide resolved

python/tilus/backends/emitters/cuda/mbarrier.py Outdated Show resolved Hide resolved

fix comments

601874d

Signed-off-by: Yaoyao Ding <dingyaoyao.cs@gmail.com>

yaoyaoding requested a review from Copilot March 6, 2026 19:58

Copilot started reviewing on behalf of yaoyaoding March 6, 2026 19:59 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

yaoyaoding merged commit a68f881 into main Mar 6, 2026
12 checks passed

yaoyaoding deleted the refactor-mbarrier branch March 6, 2026 20:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor] Explict usage of mbarrier arrive_and_expect_tx#88

[Refactor] Explict usage of mbarrier arrive_and_expect_tx#88
yaoyaoding merged 2 commits intomainfrom
refactor-mbarrier

yaoyaoding commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yaoyaoding commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants