For mma.sync Tensor Core lowering path with split-k reduction and multi-buffering the mainloop's vector.transfer_read op reads from 4d memref instances.
See example:
%27 = vector.transfer_read %alloc_1[%20, %26, %c0, %c0], %cst_0 {in_bounds = [true, true], permutation_map = affine_map<(d0, d1, d2, d3) -> (d1, d3)>} : memref<4x32x1x32xf16, #gpu.address_space<workgroup>>, vector<16x16xf16> %28 = vector.transfer_read %alloc_1[%20, %26, %c0, %c16], %cst_0 {in_bounds = [true, true], permutation_map = affine_map<(d0, d1, d2, d3) -> (d1, d3)>} : memref<4x32x1x32xf16, #gpu.address_space<workgroup>>, vector<16x16xf16> %30 = vector.transfer_read %alloc_2[%20, %c0, %c0, %29], %cst_0 {in_bounds = [true, true], permutation_map = affine_map<(d0, d1, d2, d3) -> (d3, d2)>} : memref<4x1x32x32xf16, #gpu.address_space<workgroup>>, vector<16x16xf16> %31 = vector.transfer_read %alloc_2[%20, %c0, %c16, %29], %cst_0 {in_bounds = [true, true], permutation_map = affine_map<(d0, d1, d2, d3) -> (d3, d2)>} : memref<4x1x32x32xf16, #gpu.address_space<workgroup>>, vector<16x16xf16>
Without this change all four vector.transfered_read are marked to use transpose resulting in ldmatrix.trans for both matrixA and matrixB in row-row matmul, while only matrixB needs a transpose. This patch fixes the issue.
nit: remove empty line