This work introduces a new operation called wargroup.mma.store to the NVGPU dialect of MLIR. The purpose of this operation is to facilitate storing fragmanted results of WGMMA to the given memref.
An example of fragmentation is given here :
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#wgmma-64n16-d
The wargroup.mma.store does followings:
- Takes one or more fragmented results matrix.
- Calculates indexes per thread in warp group and stores the data into give memref.
Here's an example usage of the nvgpu.wargroup.mma operation:
%res, %res2 = nvgpu.wargroup.mma ... nvgpu.wargroup.mma.store [%res1, %res2], %matrixD : !nvgpu.warpgroup.result<tensor = !llvm.struct<...>>, !nvgpu.warpgroup.result<tensor = !llvm.struct<...>> to memref<128x128xf32,3>
Depens on D158434