This is an archive of the discontinued LLVM Phabricator instance.

[mlir][vector-to-gpu] Fix the Transpose Check in `mma.sync` VectorToGPU Lowering Path
ClosedPublic

Authored by manishucsd on Apr 6 2023, 5:13 PM.

Details

Summary

For mma.sync Tensor Core lowering path with split-k reduction and multi-buffering the mainloop's vector.transfer_read op reads from 4d memref instances.

See example:

%27 = vector.transfer_read %alloc_1[%20, %26, %c0, %c0], %cst_0 {in_bounds = [true, true], permutation_map = affine_map<(d0, d1, d2, d3) -> (d1, d3)>} : memref<4x32x1x32xf16, #gpu.address_space<workgroup>>, vector<16x16xf16>
%28 = vector.transfer_read %alloc_1[%20, %26, %c0, %c16], %cst_0 {in_bounds = [true, true], permutation_map = affine_map<(d0, d1, d2, d3) -> (d1, d3)>} : memref<4x32x1x32xf16, #gpu.address_space<workgroup>>, vector<16x16xf16>

%30 = vector.transfer_read %alloc_2[%20, %c0, %c0, %29], %cst_0 {in_bounds = [true, true], permutation_map = affine_map<(d0, d1, d2, d3) -> (d3, d2)>} : memref<4x1x32x32xf16, #gpu.address_space<workgroup>>, vector<16x16xf16>
%31 = vector.transfer_read %alloc_2[%20, %c0, %c16, %29], %cst_0 {in_bounds = [true, true], permutation_map = affine_map<(d0, d1, d2, d3) -> (d3, d2)>} : memref<4x1x32x32xf16, #gpu.address_space<workgroup>>, vector<16x16xf16>

Without this change all four vector.transfered_read are marked to use transpose resulting in ldmatrix.trans for both matrixA and matrixB in row-row matmul, while only matrixB needs a transpose. This patch fixes the issue.

Diff Detail

Event Timeline

manishucsd created this revision.Apr 6 2023, 5:13 PM
Herald added a project: Restricted Project. · View Herald Transcript
manishucsd requested review of this revision.Apr 6 2023, 5:13 PM

Make sure to add a test.

mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp
661

nit: remove empty line

664–666

either make it an assert or return false. We should not be emitting error and just continuing.

669

the rule is that you should use auto only if this is a cast where the type is explicit set or this is a complex type (iterator kind of stuff) so here you need to spell out the type.

676

Same as above emitting an error here doesn't sound right, either we should have code preventing such case before and then assert or return false or propagate an error.

704

nit /*transpose=*/ doesn't add much anymore I would remove it.

ThomasRaoux added inline comments.Apr 7 2023, 10:16 AM
mlir/test/Conversion/VectorToGPU/vector-to-mma-ops-mma-sync.mlir
257 ↗(On Diff #511714)

This comment doesn't really have a meaning in the context of upstream flow. It would be better to comment on what problem it is trying to catch.

271–309 ↗(On Diff #511714)

Can you minimize the test? It should be as simple as possible while still catching the bug you are fixing.

Applied comments.

manishucsd added inline comments.Apr 7 2023, 12:16 PM
mlir/test/Conversion/VectorToGPU/vector-to-mma-ops-mma-sync.mlir
271–309 ↗(On Diff #511714)

Updated the comment for the test.

Also, pruned the test to only check the following:
vector.transfer_read for A (transpose=false), vector.transfer_read for B (transpose=true), vector.transfer_read for C (transpose=false) on multi-dimensional vector.transfer_read ops.
mma.sync is coming as we don't want DCE to trigger.

manishucsd marked 6 inline comments as done.Apr 7 2023, 12:16 PM

Looks good, one last nit

mlir/test/Conversion/VectorToGPU/vector-to-mma-ops-mma-sync.mlir
278 ↗(On Diff #511743)

replace %c0 to not assume special naming here and below.

manishucsd updated this revision to Diff 511796.Apr 7 2023, 2:13 PM
manishucsd marked 2 inline comments as done.Apr 7 2023, 3:17 PM
ThomasRaoux accepted this revision.Apr 9 2023, 10:52 PM
This revision is now accepted and ready to land.Apr 9 2023, 10:52 PM
This revision was automatically updated to reflect the committed changes.