[MLIR][NVGPU] Adding nvgpu.wargroup.mma Op for Hopper GPUs
This work introduces a new operation called wargroup.mma to the NVGPU dialect of MLIR. The purpose of this operation is to facilitate warpgroup-level matrix multiply and accumulate (WGMMA) operations on Hopper GPUs with sm_90a architecture.
Previously, the nvvm.wgmma.mma_async operation was introduced to support wargroup-level matrix operations in NVVM dialect. This op is used multiple instances of nvvm.wgmma.mma_async to achieve the desired shape. The new nvgpu.wargroup.mma operation abstracts this complexity and provides a higher-level interface for performing wargroup-level matrix operations.
The nvgpu.wargroup.mma does followings:
- Corresponds multiple wgmma instructions.
- Iterates input matrix descriptors to achieve the desired computation shape.
- Groups and runs wgmma instructions asynchronously, and eventually waits them. This are done by wgmma.fence.aligned, wgmma.commit.group.sync.aligned, and wgmma.wait.group.sync.aligned
- Results fragmented matrices
Here's an example usage of the nvgpu.wargroup.mma operation:
%wgmmaResult, %wgmmaResult2 = nvgpu.wargroup.mma %descA, %descB, %acc, group = 1 {transposeB}: !nvgpu.wgmma.descriptor<tensor = memref<128x64xf16, 3>>, !nvgpu.wgmma.descriptor<tensor = memref<64x128xf16, 3>>, vector<128x128xf32> -> !nvgpu.warpgroup.result<tensor = !llvm.struct<...>, !nvgpu.warpgroup.result<tensor = !llvm.struct<...>>