This changes adds the option to lower to NvGpu dialect ops during the
VectorToGPU convsersion pass. Because this transformation reuses
existing VectorToGPU logic, a seperate VectorToNvGpu conversion pass is
not created. The option use-nvgpu is added to the VectorToGPU pass.
When this is true, the pass will attempt to convert slices rooted at
vector.contract operations into nvgpu.mma.sync ops, and
vector.transfer_read ops are converted to either nvgpu.ldmatrix or
one or more vector.load operations.  The specific data loaded will
depend on the thread id within a subgroup (warp). These index
calculations depend on data type and shape of the MMA op
according to the downstream PTX specification. The code for supporting
these details is separated into NvGpuSupport.cpp|h.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Need to rebase this on the most recent movement of mma.sync and ldmatrix ops from the gpu dialect to nvgpu dialect. should have this updated tomorrow morning.
Rebased on top-of-tree. The first part of this patch was already merged
(the ldmatrix and mma.sync ops in nvgpu dialect). What remains is the lowerings
from vector to nvgpu.
I introduced an option use-nvgpu to the VectorToGPU pass rather than introducing a VectorToNvGPU pass. The reason is because a fair amount of code can be shared. I think we should still break this conversion into its own pass (VectorToNvGPU) and factor out the common utilities that can be shared by both. Pending discussion/input from others, I'm happy to do that in this patch or land this patch and follow up with the creation of the new pass (or keep it as is/something else)
Add documentation / address comments.
| mlir/include/mlir/Conversion/VectorToGPU/VectorToGPU.h | ||
|---|---|---|
| 21–22 | Done. I can add more detail if required, but basically the canonical forms of the contractions differ, so that is why this is required. | |
| mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp | ||
|---|---|---|
| 88 | why is this ordering different than the wmma case? | |
| mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp | ||
|---|---|---|
| 844–845 | Isnt this always false so nvgpu sub-path is never used? | |
| mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp | ||
|---|---|---|
| 844–845 | I see that for the nvgpu path another function is provided. Sorry about the noise. | |
| mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp | ||
|---|---|---|
| 88 | The wmma code is treating the canonical layout as "row-row-row" whereas we need "row-col-row", because "row-col-row" is what the actual mma.sync instructions are asking for. If the user is giving data as "row-row-row", we need to ensure that a transpose operation is manifested in the IR (and merged into the vector.transfer_read). This then serves as an indicator that the elements should be transposed during loading (by either passing the "transpose=true" attribute to nvgpu.ldmatrix or loading the elements individually in a transposed manner. | |
This looks good to me thanks for adding this! Few minor comments.
| mlir/include/mlir/Conversion/VectorToGPU/NvGpuSupport.h | ||
|---|---|---|
| 1 ↗ | (On Diff #425402) | Is this header meant to be public? If not I think it would be better to move it in lib/Conversion/VectorToGPU. There are precedents of such cases in lib/Conversion if you want to see examples. | 
| 30 ↗ | (On Diff #425402) | I would remove this namespace as it isn't consistent with how other dialect conversion are written. | 
| mlir/lib/Conversion/VectorToGPU/NvGpuSupport.cpp | ||
| 30 ↗ | (On Diff #425402) | Could we not use gpu::MMAMatrixType at all during conversion to nvppu ops, this type is means to be used with the wmma operations and adding a dependency is not ideal, I think we could just use a simple structure to represent the matrix kind we want. | 
| 104 ↗ | (On Diff #425402) | You don't really need a builder just to create an AffineMap as it is not an operation. It seems like it is only used to get the context, you can get it from elementType instead | 
| 215 ↗ | (On Diff #425402) | nit: spell out type | 
| mlir/test/Conversion/VectorToGPU/vector-to-mma-ops-mma-sync.mlir | ||
| 287 | nit: missing newline | |
I think this should complete any functional updates. I still need to address Thomas' comments.
Address reviewer comments:
Remove use of GPU Dialect's WarpMatrixFragmentType in NvGpu conversion path.
Fix variable naming in FileCheck tests to allow for CHECK-LABEL.
Cleanup misc typos/small fixes.
Addressed Thomas' comments.
| mlir/lib/Conversion/VectorToGPU/NvGpuSupport.cpp | ||
|---|---|---|
| 215 ↗ | (On Diff #425402) | I think I fixed this, let me know if there's still an issue. | 
Can you please document useWmma here?