This is an archive of the discontinued LLVM Phabricator instance.

[mlir][AMDGPU] Add `mfma` operation to wrap mfma intrinsics.
AbandonedPublic

Authored by krzysz00 on Jun 7 2022, 12:30 PM.

Details

Summary

The mfma (matrix fused multiply add) instructions present on some
AMDGPUs provide hardware support for particular matrix multiplication
sies and formats.

In LVVM, these operations are exposed via intrinsics. In order to make
their usage in MLIR more ergonomic, we define a amdgpu.mfma
operation that takes a MFMAInstr enum to specify which instruction
should be used. This allows higher-level code to select the mfma
operation to be used by changing an enum value instead of by selecting
a different operation, improving the ergonomics of generating matrix
multiplication kernels.

The amdgpu.mfma operation also allows operations that logically take
vectors of bytes as inputs, instead of requiring, as LLVM does, that
the inputs be concatenated into an i32 or i64.

Diff Detail

Event Timeline

krzysz00 created this revision.Jun 7 2022, 12:30 PM
krzysz00 requested review of this revision.Jun 7 2022, 12:30 PM
mlir/include/mlir/Dialect/AMDGPU/AMDGPU.td
173–190

This seems awkward. Do you really need different ops for all of these, rather than having a single op that considers the types of its arguments, or perhaps takes some small number of parameters? You may need to generate something that says which ops are valid on a particular architecture, but that seems preferable to me.

krzysz00 added inline comments.Jun 10 2022, 12:05 PM
mlir/include/mlir/Dialect/AMDGPU/AMDGPU.td
173–190

MLIR-side, this is one op?

And we do need to case, because (to selectively quote the intrinsics list)

def int_amdgcn_mfma_f32_16x16x1f32  : AMDGPUMfmaIntrinsic<llvm_v16f32_ty, llvm_float_ty>;
def int_amdgcn_mfma_f32_32x32x2f32  : AMDGPUMfmaIntrinsic<llvm_v16f32_ty, llvm_float_ty>;

These two instructions take the same types of argument but have different semantics. They both take [64 simd]xf32 inputs for A and B and return a [64]x16xf32 output, but there's more than one option for how to reshape [64 things] * [64 things] -> [256 things] as a matrix multiply. One option as that you're doing [32 x 2] * [2 x 32] -> [32 x 32], but another is [64 x 1] * [1 x 16] -> [64 x 16] - though what the instruction name is saying is that we have 4x[16x1] * [1x16] -> 4x[16x16] (or its transpose if N is the long dimension)

So, no, we can't dispatch off type alone.

ThomasRaoux added inline comments.Jun 12 2022, 8:35 PM
mlir/include/mlir/Dialect/AMDGPU/AMDGPU.td
173

Just a few suggestions:
Could the type be inferred from the operand type instead of being part of the enum?
I wonder if having the dimensions be integer attributes would make the code more generic?
I assume that decoupling this slightly from the rocdl intrinsics may be beneficial.

krzysz00 added inline comments.
mlir/include/mlir/Dialect/AMDGPU/AMDGPU.td
173

Pulling in @jerryyin and @whchung for their thoughts, but, from where I'm standing, this is at least partially meant to be wrapper around the intrinsics that lets us have things like the constants being specified as attributes on the op instead of additional arguments.

jerryyin added inline comments.Jun 14 2022, 8:53 AM
mlir/include/mlir/Dialect/AMDGPU/AMDGPU.td
173

I think @ThomasRaoux brings about a valid point: We indeed can bring more structure around this enum, which can provide more information to our side of the xdlopsSelect.h in miopen-dialect.

Each fields carries a meaning in the xdlops instruction, take f32_32x32x1f32 as an example:

  • First f32 is the return type
  • 32x32 is the size of dimension for A and B matrix
  • 1 is the number of gemm we performed, if the number is larger than 1, then it is a reduction (with sum)
  • Last f32 is the argument type

Desirably this can be constructed from a number of fields of attributes that comes inherently with the instruction naming.

krzysz00 abandoned this revision.Jun 17 2022, 8:38 AM

Per feedback here, I'm abandoning this revision in favor of

  1. A new revision that adds the new ROCDL intrinsics from LLVM but doesn't touch AMDGPU
  2. Going back downstream to design a better mfma operation that looks like something like mfma {k = K, m = M, n = N, ...} %c = %a * %b.