This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][GPU] Add NvGpu mma.sync path to the VectorToGPU pass
ClosedPublic

Authored by christopherbate on Apr 1 2022, 2:11 PM.

Details

Summary

This changes adds the option to lower to NvGpu dialect ops during the
VectorToGPU convsersion pass. Because this transformation reuses
existing VectorToGPU logic, a seperate VectorToNvGpu conversion pass is
not created. The option use-nvgpu is added to the VectorToGPU pass.
When this is true, the pass will attempt to convert slices rooted at
vector.contract operations into nvgpu.mma.sync ops, and
vector.transfer_read ops are converted to either nvgpu.ldmatrix or
one or more vector.load operations. The specific data loaded will
depend on the thread id within a subgroup (warp). These index
calculations depend on data type and shape of the MMA op
according to the downstream PTX specification. The code for supporting
these details is separated into NvGpuSupport.cpp|h.

Diff Detail

Event Timeline

christopherbate created this revision.Apr 1 2022, 2:11 PM
Herald added a project: Restricted Project. · View Herald Transcript
christopherbate requested review of this revision.Apr 1 2022, 2:11 PM

Removed dead code

Fix ldmatrix use condition to check address space

Thanks for contributing these. Can you separate out the VectorToGPU changes into another revision?

mlir/include/mlir/Conversion/VectorToGPU/VectorToGPU.h
20–24

Can you please document useWmma here?

mlir/include/mlir/Dialect/GPU/GPUDialect.h
172 ↗(On Diff #420382)

Doc comment here.

Thanks for contributing these. Can you separate out the VectorToGPU changes into another revision?

OK, I will separate this into two different revisions

Need to rebase this on the most recent movement of mma.sync and ldmatrix ops from the gpu dialect to nvgpu dialect. should have this updated tomorrow morning.

Rebased on top-of-tree. The first part of this patch was already merged
(the ldmatrix and mma.sync ops in nvgpu dialect). What remains is the lowerings
from vector to nvgpu.

christopherbate added a comment.EditedApr 26 2022, 8:02 PM

I introduced an option use-nvgpu to the VectorToGPU pass rather than introducing a VectorToNvGPU pass. The reason is because a fair amount of code can be shared. I think we should still break this conversion into its own pass (VectorToNvGPU) and factor out the common utilities that can be shared by both. Pending discussion/input from others, I'm happy to do that in this patch or land this patch and follow up with the creation of the new pass (or keep it as is/something else)

Add documentation / address comments.

mlir/include/mlir/Conversion/VectorToGPU/VectorToGPU.h
20–24

Done. I can add more detail if required, but basically the canonical forms of the contractions differ, so that is why this is required.

christopherbate edited the summary of this revision. (Show Details)Apr 26 2022, 8:14 PM
nirvedhmeshram added inline comments.
mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp
83

why is this ordering different than the wmma case?

nirvedhmeshram added inline comments.May 2 2022, 1:46 PM
mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp
826

Isnt this always false so nvgpu sub-path is never used?

nirvedhmeshram added inline comments.May 2 2022, 2:11 PM
mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp
826

I see that for the nvgpu path another function is provided. Sorry about the noise.

christopherbate marked 3 inline comments as done.May 5 2022, 12:36 PM
christopherbate added inline comments.
mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp
83

The wmma code is treating the canonical layout as "row-row-row" whereas we need "row-col-row", because "row-col-row" is what the actual mma.sync instructions are asking for. If the user is giving data as "row-row-row", we need to ensure that a transpose operation is manifested in the IR (and merged into the vector.transfer_read). This then serves as an indicator that the elements should be transposed during loading (by either passing the "transpose=true" attribute to nvgpu.ldmatrix or loading the elements individually in a transposed manner.

This looks good to me thanks for adding this! Few minor comments.

mlir/include/mlir/Conversion/VectorToGPU/NvGpuSupport.h
1 ↗(On Diff #425402)

Is this header meant to be public? If not I think it would be better to move it in lib/Conversion/VectorToGPU. There are precedents of such cases in lib/Conversion if you want to see examples.

30 ↗(On Diff #425402)

I would remove this namespace as it isn't consistent with how other dialect conversion are written.

mlir/lib/Conversion/VectorToGPU/NvGpuSupport.cpp
31

Could we not use gpu::MMAMatrixType at all during conversion to nvppu ops, this type is means to be used with the wmma operations and adding a dependency is not ideal, I think we could just use a simple structure to represent the matrix kind we want.

105

You don't really need a builder just to create an AffineMap as it is not an operation. It seems like it is only used to get the context, you can get it from elementType instead

216

nit: spell out type

mlir/test/Conversion/VectorToGPU/vector-to-mma-ops-mma-sync.mlir
287

nit: missing newline

ThomasRaoux added inline comments.May 11 2022, 6:23 AM
mlir/test/Conversion/VectorToGPU/vector-to-mma-ops-mma-sync.mlir
8

nit: Note that the trick to make CHECK-LABEL work with map variable is to do:

// CHECK-DAG: [[$rowA0_map:#.+]] = affine_map<()[s0] -> (s0 mod 16 + 1)>

This way the variable are not reset by CHECK-LABEL

237

should be [[C0]]

christopherbate marked an inline comment as done.

Small functional changes and additional tests based on TF32 and Int8 testing.

christopherbate retitled this revision from [MLIR][GPU] Add VectorToGPU, GPU to nvvm.mma.sync lowering path to [MLIR][GPU] Add NvGpu mma.sync path to the VectorToGPU pass.May 13 2022, 9:11 PM
christopherbate edited the summary of this revision. (Show Details)

I think this should complete any functional updates. I still need to address Thomas' comments.

christopherbate marked an inline comment as done.
christopherbate retitled this revision from [MLIR][GPU] Add NvGpu mma.sync path to the VectorToGPU pass to [MLIR][GPU] Add NvGpu mma.sync path to the VectorToGPU pass.

Address reviewer comments:
Remove use of GPU Dialect's WarpMatrixFragmentType in NvGpu conversion path.
Fix variable naming in FileCheck tests to allow for CHECK-LABEL.
Cleanup misc typos/small fixes.

christopherbate marked 5 inline comments as done and an inline comment as not done.May 17 2022, 4:59 PM

Addressed Thomas' comments.

mlir/lib/Conversion/VectorToGPU/NvGpuSupport.cpp
216

I think I fixed this, let me know if there's still an issue.

This revision is now accepted and ready to land.May 18 2022, 3:10 PM
This revision was automatically updated to reflect the committed changes.
christopherbate marked an inline comment as done.