This is an archive of the discontinued LLVM Phabricator instance.

[mlir][nvgpu] Add a nvgpu.rewrite_copy_as_tma transform operation.
ClosedPublic

Authored by nicolasvasilache on Aug 4 2023, 5:10 AM.

Details

Summary

This revision adds support for direct lowering of a linalg.copy on buffers between global and shared memory to a tma async load + synchronization operations.
This uses the recently introduced Hopper NVVM and NVGPU abstraction to connect things end to end.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptAug 4 2023, 5:10 AM
nicolasvasilache requested review of this revision.Aug 4 2023, 5:10 AM
Herald added a project: Restricted Project. · View Herald Transcript

Add higher-levle IR test

ftynse added inline comments.Aug 4 2023, 5:45 AM
mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
759–772
792

Nit: ///

836–837

Why is this necessary? Are we not setting the listener properly when creating the nested builder? If so, we should fix that.

854–855

It's impossible to yield different number of values from then and else branches.

Speaking of which, I don't know much about the barrier behavior, but double-check that it's okay to have barrier with and without operands in diverging branches.

865–871

This seems to be confusing terminology. The GPU dialect is supposed to be using vendor-neutral terminology aligned with Khronos group specs (e.g. opencl), though I know it's not systematic. Mapping that terminology to CUDA terminology gives us: "workgroup" -> "block", "workgroup memory" -> "shared memory" i.e. memory accessed by any workitem/thread in the workgroup/block. It's unclear to me what "workgroup address space" is referring to here and how is that different from "shared memory space". The enum values for different address spaces in the GPU dialect are arbitrary and must not be interpreted as LLVM-compatible integers. A proper conversion, that should already be available in type converters, should be used to convert these.

917–925

It's the third occurrence of this snippet with commented-out code in a row. This is likely worth factoring out into a function that can be easily updated.

943

Please expand auto unless the type is obvious from context or impossible to spell.

974–975

Nit: explain the magic number.

1003

Does this have to hardcode num threads?

1049

Would be helpful to indicate which of the ops failed the precondition.

1057–1060

I don't actually see any way for the callee to return failure. Consider changing its return type and dropping the message here.

1062–1063

Would it make sense for the builder logic to erase this, so it is usable as a C++ call?

guraypp added inline comments.Aug 7 2023, 1:09 AM
mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
789

I guess comment is forgotten here, because it talks about wgmma descriptors.

826

Using threadIdx.x is going to change, see an example below. It is okay right now, but just giving you a heads up.
We will select the leader thread using elect instruction in PTX. I need to implement it first.

https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss.hpp#L344-L345

1003

We can use blockDim.x instead of 128 for this code.

But programs are slightly faster when it's hardcoded.

It could be read from %block_x = %c128, on gpu.launch.

nicolasvasilache marked 15 inline comments as done.

Address comments.

mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
826

ack, thanks!

836–837

no reason, remnant from a previous state, thanks for catching!

854–855

Well the the other path would just yield 0.
Refactored to make it less confusing.

865–871

Ok I understand the issue now, I was mistakenly using gpu::GPUMemorySpaceMappingAttr that seem to have been added for the purpose of transforms but do not lower further..

guraypp accepted this revision.Aug 8 2023, 1:58 AM

It is really nice that we can use linalg.copy for Hopper's TMA load. Thanks working on this.
Looks clear to me.

mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
955

nice this is way better

This revision is now accepted and ready to land.Aug 8 2023, 1:58 AM