This revision adds support for direct lowering of a linalg.copy on buffers between global and shared memory to a tma async load + synchronization operations.
This uses the recently introduced Hopper NVVM and NVGPU abstraction to connect things end to end.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp | ||
---|---|---|
759–772 | Can this just use https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/IR/Value.h#L435 ? | |
792 | Nit: /// | |
836–837 | Why is this necessary? Are we not setting the listener properly when creating the nested builder? If so, we should fix that. | |
854–855 | It's impossible to yield different number of values from then and else branches. Speaking of which, I don't know much about the barrier behavior, but double-check that it's okay to have barrier with and without operands in diverging branches. | |
865–871 | This seems to be confusing terminology. The GPU dialect is supposed to be using vendor-neutral terminology aligned with Khronos group specs (e.g. opencl), though I know it's not systematic. Mapping that terminology to CUDA terminology gives us: "workgroup" -> "block", "workgroup memory" -> "shared memory" i.e. memory accessed by any workitem/thread in the workgroup/block. It's unclear to me what "workgroup address space" is referring to here and how is that different from "shared memory space". The enum values for different address spaces in the GPU dialect are arbitrary and must not be interpreted as LLVM-compatible integers. A proper conversion, that should already be available in type converters, should be used to convert these. | |
917–925 | It's the third occurrence of this snippet with commented-out code in a row. This is likely worth factoring out into a function that can be easily updated. | |
943 | Please expand auto unless the type is obvious from context or impossible to spell. | |
974–975 | Nit: explain the magic number. | |
1003 | Does this have to hardcode num threads? | |
1049 | Would be helpful to indicate which of the ops failed the precondition. | |
1057–1060 | I don't actually see any way for the callee to return failure. Consider changing its return type and dropping the message here. | |
1062–1063 | Would it make sense for the builder logic to erase this, so it is usable as a C++ call? |
mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp | ||
---|---|---|
789 | I guess comment is forgotten here, because it talks about wgmma descriptors. | |
826 | Using threadIdx.x is going to change, see an example below. It is okay right now, but just giving you a heads up. | |
1003 | We can use blockDim.x instead of 128 for this code. But programs are slightly faster when it's hardcoded. It could be read from %block_x = %c128, on gpu.launch. |
mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp | ||
---|---|---|
826 | ack, thanks! | |
836–837 | no reason, remnant from a previous state, thanks for catching! | |
854–855 | Well the the other path would just yield 0. | |
865–871 | Ok I understand the issue now, I was mistakenly using gpu::GPUMemorySpaceMappingAttr that seem to have been added for the purpose of transforms but do not lower further.. |
It is really nice that we can use linalg.copy for Hopper's TMA load. Thanks working on this.
Looks clear to me.
mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp | ||
---|---|---|
955 | nice this is way better |
Can this just use https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/IR/Value.h#L435 ?