This is an archive of the discontinued LLVM Phabricator instance.

Support `transpose` mode for `gpu.subgroup` WMMA ops
ClosedPublic

Authored by navdeepkk on Nov 30 2022, 8:50 AM.

Details

Summary

Add support for loading, computing, and storing gpu.subgroup WMMA ops
in transpose mode as well. Update the GPU to NVVM lowerings to support
transpose mode and update integration tests as well.

Diff Detail

Event Timeline

navdeepkk created this revision.Nov 30 2022, 8:50 AM
navdeepkk requested review of this revision.Nov 30 2022, 8:50 AM
navdeepkk updated this revision to Diff 478995.Nov 30 2022, 9:45 AM

Clang-format

ThomasRaoux added inline comments.
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
1162

please add some doc (or move it from below)

1182–1188

Do you understand why the compute operation has to know that? The transposition should happen during the load, I don't understand why we need to know this at that point. It looks like in your test you don't set it and it works fine.
Maybe @christopherbate would know?

1231–1232

This doc seems to be at the wrong spot?

mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir
13

we should also test the non transposed case. Is it somewhere?

ThomasRaoux requested changes to this revision.Nov 30 2022, 10:28 AM
ThomasRaoux added inline comments.
mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp
156

Please also change the lowering to SPIRV to avoid miscompile (it's fine to fail the lowering pattern but ignoring the new field is wrong)

This revision now requires changes to proceed.Nov 30 2022, 10:28 AM
bondhugula added inline comments.Nov 30 2022, 5:43 PM
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
1182–1188

I didn't understand this question. Please see the NVVM lowering below. Why would the LLVM/NVVM builder take aLayout and bLayout in that case?

rewriter.replaceOpWithNewOp<NVVM::WMMAMmaOp>(
       op, adaptor.getOpC().getType(), m, n, k, layout, layout, sourceType,
       op, adaptor.getOpC().getType(), m, n, k, aLayout, bLayout, sourceType,
       destType, unpackedOps);
       destType, unpackedOps);

It looks like the mma intrinsic allows you to multiply elements in the transposed order and this is orthogonal to the load. So, you can load in whatever way, but do the multiplication on the transpose of the values if desired?

1228–1232

Doc added at the wrong place.

mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir
96

As a convention, attributes like these should use snake case: a_transpose?

mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f16.mlir
30

How is this change related/needed? The comment above would now be inaccurate.

ThomasRaoux added inline comments.Nov 30 2022, 5:51 PM
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
1182–1188

It looks like the mma intrinsic allows you to multiply elements in the transposed order and this is orthogonal to the load. So, you can load in whatever way, but do the multiplication on the transpose of the values if desired?

From the ptx spec this doesn't sound like what it is:

The qualifiers .alayout and .blayout must match the layout specified on the wmma.load instructions that produce the contents of operands a and b respectively. Similarly, the qualifiers .atype, .btype and .ctype must match the corresponding qualifiers on the wmma.load instructions that produce the contents of operands a, b and c respectively.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-mma

I'm guessing this is a way to allow a more flexible implementation of wmma but this makes the representation awkward. Interestingly enough SPIRV equivalent doesn't have this restriction. This is why I wanted to understand if we can simplify the design.

bondhugula added inline comments.Nov 30 2022, 5:58 PM
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
1182–1188

Shouldn't the LLVM (and NVVM) dialect mimic 1:1 the underlying intrinsics? If the underlying intrinsic expects it and requires it to be matched, the MLIR counterpart shouldn't remove it but expose the same requirements on the attributes. This should also be documented.

ThomasRaoux added inline comments.Nov 30 2022, 6:18 PM
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
1182–1188

The LLVM and NVVM dialect should mimic 1:1 the intrinsics but in the GPU dialect can try to abstract those API specific details if possible.

Anyway I don't have a good solution so it is fine to leave as is if nobody has a better idea, however the current solution feels very error prone.

navdeepkk updated this revision to Diff 479141.Nov 30 2022, 7:34 PM
navdeepkk marked 7 inline comments as done.

Address comments

navdeepkk updated this revision to Diff 479147.Nov 30 2022, 8:15 PM
navdeepkk marked 5 inline comments as done.

Update SPIR-V lowerings.

ThomasRaoux accepted this revision.Dec 1 2022, 9:07 AM
This revision is now accepted and ready to land.Dec 1 2022, 9:07 AM

Several minor improvements needed on the readability side.

mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp
79–80

Please use a ternary operator - more readable and compact.

156

Likewise.

227–233

Likewise.

mlir/lib/Conversion/GPUToSPIRV/WmmaOpsToSPIRV.cpp
119–120

Can fix typo in variable name while on this.

120

useColMajor -> isColumnMajor

to be consistent with the one below.

mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp
476–478

Use `/*arg=..*/ style for the last one.

804

Likewise.

mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f16.mlir
25

with the column index

navdeepkk updated this revision to Diff 479566.Dec 2 2022, 2:33 AM
navdeepkk marked 8 inline comments as done.

Address comments

navdeepkk updated this revision to Diff 479820.Dec 3 2022, 2:59 AM

Remove whitespace errors.

navdeepkk updated this revision to Diff 479973.Dec 4 2022, 9:04 PM

Sync with upstream