The following support is added:
1.) Lowering for GPU WMMA load op for AOp, BOp, COp. The lowering supports transposed and non-transposed loads for AOp and BOp. Only non-transposed loads are supported for COp. Loading for COp also supports the opSelect bit.
2.) Lowering for GPU WMMA mma op with support for opselect bit.
3.) Lowering for GPU WMMA store op with support for opSelect bit.
The following support is added:
- In order to maintain good abstractions, could you refactor this into a GPUToAMDGPU pass instead, targetting the amdgpu.wmma operation, which will then be lowered to rocdl? This should make the code less fragile and mean that we only have one place to change if we need to adjust the compiler intrinsics in the future.
- https://reviews.llvm.org/D154666 defines a lowering for lane_id, please use it
- Minor comments on test organization
I'm willing to drop point 1 here if you're particularly opposed to it
Could this be a derived flag that defaults to "ROCm integration tests enabled and chipset is known to support WMMA"?
Please don't hardcode the chip in here and use %chip instead like in the other ROCm integration tests.
Thanks for the review. I'll work on addressing 2 and 3. Regarding point one, as GPU dialect is supposed to serve as a common abstraction for both AMD and Nvidia GPUs we can just lower it to the ROCDL intrinsic. I am not sure if the AMD GPU intrinsics will significantly deviate from the GPU dialect ops already present. Say If we decide to go from the GPU dialect to the AMD GPU dialect intrinsic and the AMD GPU dialect intrinsics capture more information, we would have to have that information in the GPU dialect ops somehow as well, Or we could just skip the GPU dialect completely which I don't think will be a good idea as the frontends would need two different mechanisms (possibly sharing significant code) to map to AMD or Nvidia GPU. The current patch enables the users to just generate GPU dialect ops and just lower it to AMD or Nvidia GPUs with the switch of a single pass. Please let me know if there is a stronger motivation or if I am missing something.
Re 1: The motivation is that, while the rocdl dialect provides direct access to the LLVM intrinsics, the amdgpu dialect provides a wrapper around those intrinsics that is more MLIR-flavored. This means that there is only one place in the code that needs to know how the exact details of WMMA, any relevant type mangling (ex. the fact that i8 inputs should be turned into i64s).
I would like to leverage this abstraction when lowering from the GPU dialect.
That is, the GPU dialects abstracts over the amdgpu and rocdl dialects.
(Having GPU-to-AMDGPU would also make selecting MFMA vs WMMA vs ... easier)
Thanks. This sheds more light on this. I am trying to understand can the GPU dialect ops themselves capture all the information (via attributes) that is needed to lower them to the ROCDL dialect ops (for both WMMA and MFMA). If that is the case then the lowering from GPU dialect to ROCDL will be the single place where all the information/logic will reside. If it is not possible to represent all of the WMMA and MFMA functionality in the GPU dialect ops then we can just refactor this patch into two stages. Another point is that there is no intrinsic for the load and store ops. Will we even want them to be present or just use the GPU dialect intrinsic and lower them to ROCDL/LLVM transparently?
On a separate note, AFAIK there is a path to lower MFMA ops from the GPU dialect to AMD-GPU currently. Do we have a plan to have that support or are the users supposed to AMD-GPU dialect ops directly?
The reason I want to lower the matrix ops from GPU to AMDGPU + memref/vector + ... instead of to ROCDL and LLVM directory is separation of concerns and progressive lowering.
gpu-to-amdgpu lowers GPU matrix operations to MLIR constricts that abstract away some of the fiddler details of the underlying intrinsics (such as how F16 * F16 => F16 doesn't return a vector of F16 (with half its entries updated) but instead a vector of 32-bit values with the low or high halves updated.
Going straight from GPU to ROCDL means you have to copy all the complexity that's present in amdgpu-to-rocdl and I don't want to maintain two copies of the dame code if I don't have to, hence my request to do a two-stage lowering.
And re load and store, unlike on Nvidia, WMMA is just an instruction: loads and stores should be lowered to operations on a private memory attribution (aka an wlloca() ) that replaces each GPU matrix value.
As to MFMA, @sjw36 for more there since he's been poking at that lowering
I think this is headed in a good direction, I just have some thoughts, mostly about how the pass options were set up
I don't want opSelect here, though I'm willing to allow warpSize so long as that becomes a module attribute that SerializeToHsaco picks up on for correct device library linking/option setting.
Anything that's requiring reasoning about opSelect should be moved up to gpu-to-amdgpu.
Heck, you probably can and should implement the case where you use a sequence of two WMMA ops, one writing low halves and one writing high halves, to do a larger matrix multiplication that produces fp16 values.
I'm not convinced this needs to be here.
You could implement a larger subgroup MMA operation that uses both halves of the output registers.
Alternatively, you could make amdgpu.opSelect an attribute on the relevant operations instead of making it a pass-wide knob.
This does make some sense to have here, yeah
Not sure we'll need these?
|85 ↗||(On Diff #557278)|
(To clarify, this could be two compute ops, one with opSelect = false and one with opSelect = true)
|165 ↗||(On Diff #557278)|
Do we need this here?
(also, if you'll have this here, you'll want the data layout incantations somewhere in the pass)
I'm not sure why these are *LaneFirst functions. Could you expand on that?
Grammar nit here, perhaps "only size 32 wavefronts are supported"
That's too strong a restriction - WMMA is available on all the gfx11NN GPUs
There's something suspicious about this being here
Yeah, let's loosen this up and have it check for prefix gfx11 or something like that.
This should be a user-configurable option?
Thanks, I wasn't aware of this use case. I was under the impression that this was the job of the underlying register allocator to utilize the upper and lower halves of the registers properly. But if that isn't the case then it makes sense to have this as an opAttribute and just generate code with the opselect attribute set and unset.
These are actually needed currently to check some conditions in the lowerings.
|165 ↗||(On Diff #557278)|
This can be dropped actually. Thanks.
It's just a naming convention I used as the laneId is the innermost position of the indexing calculation. I can rename it if its too confusing.
Are the thread data mappings the same for them too? Can you please point me to a reference? This patch has only been tested on the "gfx1100", and can be later extended/tested.
I have relaxed the checks in the passes. I think this variable has to have to complete name. If gfx1100 is the minimum series that supports that WMMA then this will not be problematic I think. Code generated for this will work on nay GFX11 series cards right?