This is an archive of the discontinued LLVM Phabricator instance.

[mlir][nvgpu] Add `mbarrier.arrive.expect_tx` and `mbarrier.try_wait.parity`
ClosedPublic

Authored by guraypp on Jun 29 2023, 8:35 AM.

Details

Summary

This work adds two Ops:
mbarrier.arrive.expect_tx performs expect_tx mbarrier.barrier returns mbarrier.barrier.token
mbarrier.try_wait.parity waits on mbarrier.barrier and mbarrier.barrier.token

mbarrier.arrive.expect_tx is one of the requirement to enable H100 TMA support.

Depends on D154074 D154076 D154059 D154060

Diff Detail

Event Timeline

guraypp created this revision.Jun 29 2023, 8:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 29 2023, 8:35 AM
guraypp requested review of this revision.Jun 29 2023, 8:35 AM
manishucsd added inline comments.
mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
476

txcount is number number of elements or number of bytes?

guraypp added inline comments.Jun 30 2023, 12:00 AM
mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
476

It is bytes, see the cuda example here:
https://gist.github.com/grypp/5594a5036ffed6689ff3d7527b0a8370#file-tma_00_simple-cu-L131

This is from ptx definition.

The optional qualifier .expect_tx specifies that an expect-tx operation is performed prior to the arrive-on operation. The 32-bit unsigned integer operand txCount specifies the expectCount argument to the expect-tx operation. When both qualifiers .arrive and .expect_tx are specified, then the count argument of the arrive-on operation is assumed to be 1.
qcolombet accepted this revision.Jun 30 2023, 1:21 AM

LGTM with nits.

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
474

Could you add some description here to better explain what this is doing?
Essentially what it means to "perform arrive with expect_tx".

From what I understand, when the thread arrives to this point, we set the expect count of the barrier to txcount and produce a token. Later we'll wait on the produced token until the barrier is released.

Feel free to reword/fix, but we need some kind of explanation :).

484

Ditto on the description.

This revision is now accepted and ready to land.Jun 30 2023, 1:21 AM
guraypp updated this revision to Diff 536770.Jul 3 2023, 7:42 AM

update the op information

guraypp marked 2 inline comments as done and an inline comment as not done.Jul 3 2023, 7:50 AM
guraypp added inline comments.
mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
476

I understood what you mean, nvgpu dialect uses the number of elements not the bytes and`nvgpu.device_async_copy` is example for that.

I cannot make this txcount number of element, because the Op does not get any memref that I can calculate the bytes. See the Op usage below:

func @test(%barrier : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>) {
  %txcountBytes = arith.constant 256 : index
  %token = nvgpu.mbarrier.arrive.expect_tx %barrier, %txcountBytes : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>> -> !nvgpu.mbarrier.token
}

I can make Op to take memref and get rid of txcount. But this somehow does not match with the ptx instruction.

func @test(%barrier : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>, %buffer: memref<128x128xf16, 3>) {
  %token = nvgpu.mbarrier.arrive.expect_tx %barrier, %buffer : !nvgpu.mbarrier.barrier<memorySpace = #gpu.address_space<workgroup>>, memref<128x128xf16, 3>
}

Thanks for the added descriptions. Still LGTM :).

Couple of nits

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
499

nit: is is

501

Nit: A suspended thread...
or
Suspended threads resume

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
607

And nl

guraypp updated this revision to Diff 542413.Jul 20 2023, 3:26 AM

update ptx and nvgpu ops

This revision was landed with ongoing or failed builds.Jul 20 2023, 4:48 AM
This revision was automatically updated to reflect the committed changes.