This is an archive of the discontinued LLVM Phabricator instance.

Add Lowerings for GPU WMMA F16/F32 ops to ROCDL dialect
Needs ReviewPublic

Authored by navdeepkk on Aug 6 2023, 6:39 AM.

Details

Summary

The following support is added:
1.) Lowering for GPU WMMA load op for AOp, BOp, COp. The lowering supports transposed and non-transposed loads for AOp and BOp. Only non-transposed loads are supported for COp. Loading for COp also supports the opSelect bit.
2.) Lowering for GPU WMMA mma op with support for opselect bit.
3.) Lowering for GPU WMMA store op with support for opSelect bit.

Diff Detail

Event Timeline

navdeepkk created this revision.Aug 6 2023, 6:39 AM
Herald added a reviewer: dcaballe. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
navdeepkk requested review of this revision.Aug 6 2023, 6:39 AM
navdeepkk edited the summary of this revision. (Show Details)Aug 6 2023, 6:42 AM
krzysz00 requested changes to this revision.Aug 6 2023, 7:52 AM
  1. In order to maintain good abstractions, could you refactor this into a GPUToAMDGPU pass instead, targetting the amdgpu.wmma operation, which will then be lowered to rocdl? This should make the code less fragile and mean that we only have one place to change if we need to adjust the compiler intrinsics in the future.
  2. https://reviews.llvm.org/D154666 defines a lowering for lane_id, please use it
  3. Minor comments on test organization

I'm willing to drop point 1 here if you're particularly opposed to it

mlir/test/CMakeLists.txt
32

Could this be a derived flag that defaults to "ROCm integration tests enabled and chipset is known to support WMMA"?

mlir/test/Integration/GPU/ROCM/WMMA/wmma_f32_16_16_16_f16_a_b_transpose.mlir
5

Please don't hardcode the chip in here and use %chip instead like in the other ROCm integration tests.

This revision now requires changes to proceed.Aug 6 2023, 7:52 AM
  1. In order to maintain good abstractions, could you refactor this into a GPUToAMDGPU pass instead, targetting the amdgpu.wmma operation, which will then be lowered to rocdl? This should make the code less fragile and mean that we only have one place to change if we need to adjust the compiler intrinsics in the future.
  2. https://reviews.llvm.org/D154666 defines a lowering for lane_id, please use it
  3. Minor comments on test organization

I'm willing to drop point 1 here if you're particularly opposed to it

Thanks for the review. I'll work on addressing 2 and 3. Regarding point one, as GPU dialect is supposed to serve as a common abstraction for both AMD and Nvidia GPUs we can just lower it to the ROCDL intrinsic. I am not sure if the AMD GPU intrinsics will significantly deviate from the GPU dialect ops already present. Say If we decide to go from the GPU dialect to the AMD GPU dialect intrinsic and the AMD GPU dialect intrinsics capture more information, we would have to have that information in the GPU dialect ops somehow as well, Or we could just skip the GPU dialect completely which I don't think will be a good idea as the frontends would need two different mechanisms (possibly sharing significant code) to map to AMD or Nvidia GPU. The current patch enables the users to just generate GPU dialect ops and just lower it to AMD or Nvidia GPUs with the switch of a single pass. Please let me know if there is a stronger motivation or if I am missing something.

Thanks!

Re 1: The motivation is that, while the rocdl dialect provides direct access to the LLVM intrinsics, the amdgpu dialect provides a wrapper around those intrinsics that is more MLIR-flavored. This means that there is only one place in the code that needs to know how the exact details of WMMA, any relevant type mangling (ex. the fact that i8 inputs should be turned into i64s).

I would like to leverage this abstraction when lowering from the GPU dialect.

That is, the GPU dialects abstracts over the amdgpu and rocdl dialects.

(Having GPU-to-AMDGPU would also make selecting MFMA vs WMMA vs ... easier)

Re 1: The motivation is that, while the rocdl dialect provides direct access to the LLVM intrinsics, the amdgpu dialect provides a wrapper around those intrinsics that is more MLIR-flavored. This means that there is only one place in the code that needs to know how the exact details of WMMA, any relevant type mangling (ex. the fact that i8 inputs should be turned into i64s).

I would like to leverage this abstraction when lowering from the GPU dialect.

That is, the GPU dialects abstracts over the amdgpu and rocdl dialects.

(Having GPU-to-AMDGPU would also make selecting MFMA vs WMMA vs ... easier)

Thanks. This sheds more light on this. I am trying to understand can the GPU dialect ops themselves capture all the information (via attributes) that is needed to lower them to the ROCDL dialect ops (for both WMMA and MFMA). If that is the case then the lowering from GPU dialect to ROCDL will be the single place where all the information/logic will reside. If it is not possible to represent all of the WMMA and MFMA functionality in the GPU dialect ops then we can just refactor this patch into two stages. Another point is that there is no intrinsic for the load and store ops. Will we even want them to be present or just use the GPU dialect intrinsic and lower them to ROCDL/LLVM transparently?

On a separate note, AFAIK there is a path to lower MFMA ops from the GPU dialect to AMD-GPU currently. Do we have a plan to have that support or are the users supposed to AMD-GPU dialect ops directly?

krzysz00 added a subscriber: sjw36.Aug 14 2023, 7:18 AM

The reason I want to lower the matrix ops from GPU to AMDGPU + memref/vector + ... instead of to ROCDL and LLVM directory is separation of concerns and progressive lowering.

gpu-to-amdgpu lowers GPU matrix operations to MLIR constricts that abstract away some of the fiddler details of the underlying intrinsics (such as how F16 * F16 => F16 doesn't return a vector of F16 (with half its entries updated) but instead a vector of 32-bit values with the low or high halves updated.

Going straight from GPU to ROCDL means you have to copy all the complexity that's present in amdgpu-to-rocdl and I don't want to maintain two copies of the dame code if I don't have to, hence my request to do a two-stage lowering.

And re load and store, unlike on Nvidia, WMMA is just an instruction: loads and stores should be lowered to operations on a private memory attribution (aka an wlloca() ) that replaces each GPU matrix value.

As to MFMA, @sjw36 for more there since he's been poking at that lowering

Just wanted to check on what you're planning here

Just wanted to check on what you're planning here

Hi @krzysz00, I will take all the suggestions and update this patch in a few days.

navdeepkk updated this revision to Diff 557107.Sep 20 2023, 5:55 AM

Address second major comment on the revision. Use gpu.lane_id lowering to get
laneID while converting WMMA ops.

navdeepkk updated this revision to Diff 557276.Sep 24 2023, 1:14 AM
navdeepkk marked 2 inline comments as done.

Add a separate pass to convert gpu.subgroup_mma_compute op to amdgpu.wmma op

navdeepkk updated this revision to Diff 557277.Sep 24 2023, 1:21 AM

Add missing commits

navdeepkk updated this revision to Diff 557278.Sep 24 2023, 1:26 AM

Format comment in LowerGPUOpsToAMDGPUOps.cpp

I think this is headed in a good direction, I just have some thoughts, mostly about how the pass options were set up

mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h
72

I don't want opSelect here, though I'm willing to allow warpSize so long as that becomes a module attribute that SerializeToHsaco picks up on for correct device library linking/option setting.

Anything that's requiring reasoning about opSelect should be moved up to gpu-to-amdgpu.

Heck, you probably can and should implement the case where you use a sequence of two WMMA ops, one writing low halves and one writing high halves, to do a larger matrix multiplication that produces fp16 values.

mlir/include/mlir/Conversion/Passes.td
432

I'm not convinced this needs to be here.

You could implement a larger subgroup MMA operation that uses both halves of the output registers.

Alternatively, you could make amdgpu.opSelect an attribute on the relevant operations instead of making it a pass-wide knob.

518

This does make some sense to have here, yeah

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td
221

Not sure we'll need these?

mlir/lib/Conversion/GPUToAMDGPU/WmmaOpsToAMDGPU.cpp
85

(To clarify, this could be two compute ops, one with opSelect = false and one with opSelect = true)

165

Do we need this here?

(also, if you'll have this here, you'll want the data layout incantations somewhere in the pass)

mlir/lib/Conversion/GPUToROCDL/WmmaOpsToROCDL.cpp
142

I'm not sure why these are *LaneFirst functions. Could you expand on that?

257

Grammar nit here, perhaps "only size 32 wavefronts are supported"

425

That's too strong a restriction - WMMA is available on all the gfx11NN GPUs

487

There's something suspicious about this being here

mlir/test/CMakeLists.txt
38

Yeah, let's loosen this up and have it check for prefix gfx11 or something like that.

42

This should be a user-configurable option?

navdeepkk updated this revision to Diff 557722.Oct 17 2023, 1:49 AM
navdeepkk marked 11 inline comments as done.

Address comments and implement upper/lower half FP16 WMMA

@krzysz00 Please review.

mlir/include/mlir/Conversion/Passes.td
432

Thanks, I wasn't aware of this use case. I was under the impression that this was the job of the underlying register allocator to utilize the upper and lower halves of the registers properly. But if that isn't the case then it makes sense to have this as an opAttribute and just generate code with the opselect attribute set and unset.

mlir/include/mlir/Dialect/LLVMIR/ROCDLOps.td
221

These are actually needed currently to check some conditions in the lowerings.

mlir/lib/Conversion/GPUToAMDGPU/WmmaOpsToAMDGPU.cpp
165

This can be dropped actually. Thanks.

mlir/lib/Conversion/GPUToROCDL/WmmaOpsToROCDL.cpp
142

It's just a naming convention I used as the laneId is the innermost position of the indexing calculation. I can rename it if its too confusing.

425

Are the thread data mappings the same for them too? Can you please point me to a reference? This patch has only been tested on the "gfx1100", and can be later extended/tested.

mlir/test/CMakeLists.txt
38

I have relaxed the checks in the passes. I think this variable has to have to complete name. If gfx1100 is the minimum series that supports that WMMA then this will not be problematic I think. Code generated for this will work on nay GFX11 series cards right?