This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops
ClosedPublic

Authored by navdeepkk on Jan 24 2021, 11:10 PM.

Details

Summary

[MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops
Add warp synchronous matrix-multiply accumulate ops in GPU and NVVM
dialect. Add following three ops to GPU dialect :-

1.) subgroup_mma_load_matrix
2.) subgroup_mma_store_matrix
3.) subgroup_mma_compute

Add following three ops to NVVM dialect :-

1.) wmma.m16n16k16.load.[a,b,c].[f16,f32].row.stride
2.) wmma.m16n16k16.store.d.[f16,f32].row.stride
3.) wmma.m16n16k16.mma.row.row.[f16,f32].[f16,f32]

Diff Detail

Event Timeline

navdeepkk created this revision.Jan 24 2021, 11:10 PM
navdeepkk requested review of this revision.Jan 24 2021, 11:10 PM
ftynse requested changes to this revision.Jan 25 2021, 5:26 AM

Thanks! I have some comments.

mlir/include/mlir/Dialect/GPU/GPUOps.td
943–944

We usually use Index rather than I64 for indexing into memrefs. Is there a specific reason why I64 is needed here?

945

Could we use a longer name for this, e.g., leadDimension?

1006

Have you considered encoding this into a TypeConstraint in ODS instead of using AnyMemRef?

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
19

I am not a fan of creating a dependency from the NVVM dialect on the GPU dialect only for the purpose of reusing an attribute. Have you considered having two versions of this attribute for different dialects, or putting it into some included file shared by both?

153

Please follow the code style inside inline code blocks. Here in particular, add spaces between "if" and "(".

154–163

We generally prefer to have one operation per intrinsic in this dialect. There is certainly value in having a operation where this duplication is abstracted away, but I suppose the one in the GPU dialect is just right for that. This dispatch should happen when converting from the gpu.subgroup_mma_load to nvvm.wmma_* rather than in translation to LLVM IR. This will also resolve the dialect layering problem I pointed at above.

189

function-type($args, $res) ?

257–262

This looks intimidatingly verbose. Are we ever expected to have different types for operands or can we just print one instead?

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
980

Nit: please don't use magic numbers. Rather define this as GPUDialect::kThreadPrivateMemorySpace and so on.

984–989

This can be defined in ODS, instead of using AnyMemRef.

991

This can be put into ODS.

1004

Braces must be symmetric.

1079–1084

Nit: please use LogicalResult for verify* functions, even internal. Otherwise, use the name that makes it clear what are the intended semantics of the return value, e.g. isValidType.

mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp
135

Please fix linter findings.

This revision now requires changes to proceed.Jan 25 2021, 5:26 AM

Adding Thomas and Christian to see whether they have comments wrt. the GPU operations.

ThomasRaoux added inline comments.Jan 26 2021, 12:36 AM
mlir/include/mlir/Dialect/GPU/GPUOps.td
917–938

The part that I don't understand is if we expect the destination memref to have a defined layout or if it is opaque. As far as I can tell in both CUDA and Vulkan the layout of the mma matrix type is opaque in private memory.
If the layout is opaque how can we perform elementwise operations on the matrix type without going back to global or shared memory? One of the common usage would be to fuse the MMA compute with elementwise operations and use directly the result of the MMA without going back to memory. Is is possible to represent such case with the current design?

bondhugula added inline comments.Jan 26 2021, 3:10 AM
mlir/include/mlir/Dialect/GPU/GPUOps.td
1016

Incorrect copy/paste example.

navdeepkk updated this revision to Diff 321329.Feb 3 2021, 11:56 PM
navdeepkk marked 15 inline comments as done.

Issues in diff 318905 :-

1.) The design used memrefs to model mmafragments and were allocated in `.local`
  space in the PTX generated. This compeletely destroyed the purpose of wmmaOps(to
  use operands in `.reg` space.).

Changes in this diff :-

1.) Address comments on diff 318905.
2.) Introduce a new type !gpu.mmafragment to enable use of MMAOps operands in
registers.
3.) Modify all the introduced ops/test cases to use the !gpu.mmafragment type.
4.) Add the NVVM IR translation test previously in Revision D95333.

Can you use mma_fragment instead of mmafragment for better readability?

bondhugula added inline comments.Feb 4 2021, 12:50 AM
mlir/include/mlir/Dialect/GPU/GPUOps.td
935

The design looks much more in line now with the memref operands being replaced with pure values and with a result value instead of an output memref.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
32

Nit: MMAFragmentType

72–79

Use an enum?

bondhugula added inline comments.Feb 4 2021, 12:52 AM
mlir/include/mlir/Dialect/GPU/GPUDialect.h
90

Matrix-multiply -> matrix-matrix multiply

98–115

All of these need doc comments.

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
309–312

I think all of this code should be in a C++ file. Please see other things for reference or if there is a guideline. @ftynse?

ftynse added a comment.Feb 9 2021, 6:12 AM

Looks better with a dedicated type. I have some further concerns about memref layout and a bunch of small comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

mlir/include/mlir/Dialect/GPU/GPUDialect.h
46–55

This class looks unnecessary. It has only one derived class, that doesn't to actually use anything from the base class.

76–80

There's no need in getters if member fields are public.

85–86

This shadows the elementType from the base class. You'll actually have two elementType values this way...

mlir/include/mlir/Dialect/GPU/GPUOps.td
918

Does it make any assumptions about the data storage in the memref? It can have an arbitrary layout now, not only consecutive elements.

923–924

Have you considered always loading the entire matrix starting from indices 0x0 and asking the user to use std.subview to position the view in a way they want? There may be a reason to use indices here, but it is unclear to me. Otherwise, it feels like this op will be a subset of functionality that subview already provides.

925–926

Why is this attribute necessary? It looks redundant with the dimension of the memref available in its type. If it may differ from the memref size, that this op needs to clarify how it handles such non-contiguous cases.

934

Please document why it is important to specify which operand is being loaded (I suppose because of how the fragment is layed out).

935

Nit: this example breaks the verifier, which expects leadDimension to have i32 type.

941

Why I32Attr and not IndexAttr?

958–976

Same remarks as for the "load" op.

1011–1012

These shapes don't add up for me.

1023

Nit: I would have just used functional-type(operands, results) here

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
147

It looks like operand is never used

153

Nit: use [{ }] for multi-line text.

177

Tablegen doesn't have $-substitution, use need something like !strconcat( and a way of accessing the common string.

298–303

This is still super-verbose. Do we expect operands of different types?

308

Any reason why this cannot be implemented with declarative assembly format?

309

We shouldn't be needing global namespace qualification in this code. It goes into the body of a OpTy::parse, which itself lives in the MLIR namespace.

309–312

There are no guidelines AFAIK. I prefer and ask to put any non-trivial code with more than ~5 lines in a C++ file.

311

This is useless, it is necessary if the variable is only in assertions to avoid -Wunused-variable in release builds. Here, it is always used.

314

This is unnecessary, ArrayRef is implicitly constructible from a single element, just pass resType to addTypes.

317–327

Chain these with || in a single if condition. This is the reason why Parser methods return bool.

330

I doubt "20" was chosen for any particular reason. SmallVector now supports automatic selection of the number of stack elements if there is no provided in the template, just use that if there is no specific reason to do otherwise.

340

Use getOperationName() instead

344–345

These can be combined in a single string...

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
51–56

It is better to inspect elementType rather than to construct a new one for comparison. Construction takes a lock in the context, inspection does not.

Also, if (condition) return false; return true should be written as return !condition.

121–123

All user-visible error messages need a test.

133–138

parseType accepts references to a derived type and does this check for you. Just declare elementType as VectorType

141

This should be in the type verifier instead.

142

Not that the location is pointing to the token after the last consumed one, i.e. after >. Capture the location before parsing the size, or at least point to the first token in the type,

146–154

Don't duplicate the verifier, use getChecked instead.

977

Replaces these magic numbers with named constants defined above

999–1002

Don't duplicate the checks already enforced by type constraints in ODS.

Also, if you had added tests for user-visible errors, you would have seen that this message is never produced because the error condition is caught by ODS-generated verifier parts with a different message.

ftynse requested changes to this revision.Feb 9 2021, 6:12 AM
This revision now requires changes to proceed.Feb 9 2021, 6:12 AM

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

mlir/include/mlir/Dialect/GPU/GPUOps.td
917–938

Hi, I quote from the PTX side of these LLVM intrinsics,

The distribution of fragments loaded by the threads in a warp is unspecified and is target architecture dependent. The contents of a matrix fragment can be manipulated by reading and writing to individual registers in the fragment, provided the following conditions are satisfied:

  1. All matrix element in the fragment are operated on uniformly across threads, using the same parameters.
  2. The order of the matrix elements is not changed.

So as long as the operation is something like a bias addition, which is uniform throughout the output matrix, I think it would be possible to realize the fusion using the present design. The way to go would be to introduce another op in GPU dialcet, Something like

gpu.subgroup_mma_fuse_bias %0, %1 : memref<1x16<vectorxf16>>, f16

The argument memref will be the result of a gpu.subgroup_mma_compute. And the other argument will be the bias. With the appropriate LLVM lowerings this would reuse the results of gpu.subgroup_mma_compute in registers and hence prevent trip to global/shared memory.

Let me know what you think, I can implement the above op and check, I think it should work.

917–938

Also, could you please tell if fuse the MMA compute with elementwise operations is the fusion of an elementwise operation with the register/warp level tile of the accumulator, which I assumed in my reply above. Is that correct?

943–944

Yes, When caluclating the starting of the load address, I emit LLVM ops of this sort

%[[LDM:.*]] = llvm.mlir.constant(32 : i64) : !llvm.i64
%[[ILDM:.*]] = llvm.mul %[[LDM]], %[[OFF]] : !llvm.i64
%[[IJLDM:.*]] = llvm.add %[[ILDM]], %[[OFF]] : !llvm.i64
%[[IJOLDM:.*]] = llvm.add %[[IJLDM]], %[[OFFSETT]] : !llvm.i64

Now the leading dimesnion is of type i64 and to have same types in Add/Mul ops I have used I64 for indices to.

945

Yes, I'll change it.

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
257–262

Since this op is mapped to a single intrinsic, The type of the operands is fixed. I modified the op to print only one type.

ftynse added a comment.Feb 9 2021, 9:48 AM

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

mlir/include/mlir/Dialect/GPU/GPUOps.td
943–944

This is exactly why you should use index or the result of converting index to LLVM. There is no guarantee that it is i64 so you should not be mixing i64 constants with anything that indexes a memref, e.g. the offset in the descriptor. We actually have a flow that converts index to i32 on GPUs as optimization, and this flow will likely get broken if you emit the code described above.

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

Okay, but we would still need to pack the operands in <2xhalf> from scalar values, at some point, to supply them to the intrinsic. Is there some op which would pack scalar values into a <2xhalf>?

navdeepkk updated this revision to Diff 324324.EditedFeb 17 2021, 8:49 AM
navdeepkk marked 40 inline comments as done.

Changes in this diff:-

1.) Address comments on diff 321329.
2.) Add tests for all user-visible error messages.
bondhugula added a comment.EditedFeb 22 2021, 10:55 AM

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

bondhugula accepted this revision.Feb 22 2021, 11:02 AM

Thanks for addressing the large number of comments. Some additional minor ones and one that was missed (or not pushed). This overall looks great to me!

mlir/include/mlir/Dialect/GPU/GPUBase.td
60

Comment here please.

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
328

Looks like this comment is still unaddressed but marked "Done". Did you forget to push? @ftynse suggested doing:

if (parser.parseType... || parser.resolveOperands(...) ...
  return failure();
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1030–1032

Can drop trivial braces.

ftynse added a comment.Mar 2 2021, 5:09 AM

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

You are right that this op is a low-level abstraction, and that's why it doesn't feel like it really belongs to the GPU dialect. I understand there is a need for an op that is better integrated with the rest of MLIR, e.g., uses memref, and such an op wouldn't fit into the NVVM dialect either. So I wouldn't press my objection as long as another reviewer (@herhut, @csigg or @ThomasRaoux) agrees to have this in GPU dialect as is.

More generally, I think we will run into this problem again: we need some way of having MLIR-friendlier versions of LLVM IR intrinsics without having to duplicate abstractions. There are similar cases in "target vector" dialects like AVX512. Ideas welcome.

mehdi_amini added inline comments.Mar 3 2021, 2:07 PM
mlir/include/mlir/Dialect/GPU/GPUDialect.h
78

elements

79

May be useful to document the syntax and provide examples.

106

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

You are right that this op is a low-level abstraction, and that's why it doesn't feel like it really belongs to the GPU dialect. I understand there is a need for an op that is better integrated with the rest of MLIR, e.g., uses memref, and such an op wouldn't fit into the NVVM dialect either. So I wouldn't press my objection as long as another reviewer (@herhut, @csigg or @ThomasRaoux) agrees to have this in GPU dialect as is.

More generally, I think we will run into this problem again: we need some way of having MLIR-friendlier versions of LLVM IR intrinsics without having to duplicate abstractions. There are similar cases in "target vector" dialects like AVX512. Ideas welcome.

I think it would be good to have it in the GPU dialect to have a common layer for SPIR-V, NVVM and potentially ROCDL if it applies. The challenge is to pick a good abstraction for all of those. SPIR-V dialect already has CooperativeMatrix ops and types which are the exact equivalent to MMA. I think we should try to remove what is an overfit for NVVM like using vector type for mmafragment but otherwise this is the right direction in my opinion.

mlir/test/Dialect/GPU/invalid.mlir
464–470

I think we should also have a positive test in mlir/test/Dialect/GPU/ops.mlir

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

You are right that this op is a low-level abstraction, and that's why it doesn't feel like it really belongs to the GPU dialect. I understand there is a need for an op that is better integrated with the rest of MLIR, e.g., uses memref, and such an op wouldn't fit into the NVVM dialect either. So I wouldn't press my objection as long as another reviewer (@herhut, @csigg or @ThomasRaoux) agrees to have this in GPU dialect as is.

More generally, I think we will run into this problem again: we need some way of having MLIR-friendlier versions of LLVM IR intrinsics without having to duplicate abstractions. There are similar cases in "target vector" dialects like AVX512. Ideas welcome.

I think it would be good to have it in the GPU dialect to have a common layer for SPIR-V, NVVM and potentially ROCDL if it applies. The challenge is to pick a good abstraction for all of those. SPIR-V dialect already has CooperativeMatrix ops and types which are the exact equivalent to MMA. I think we should try to remove what is an overfit for NVVM like using vector type for mmafragment but otherwise this is the right direction in my opinion.

Hi all,
I just wanted to point out that the vector type here was used for only the F16 accumulate version of these intrinsics. For the C matrix, The type will change for the F32 accumulate version. So I wanted to make the gpu.mmafragment type flexible for that and perhaps just adjust the constraints as we go on to add more ops in the nvvm dialect.

I also just checked and saw that SPIR-V dialect does not model the fragments held by a particular workitem(as I do here) but models the whole cooperative matrix as a single type. I am not very sure how the SPIR-V ops are lowered to the target. Are they library calls? If yes then it is very different from the NVVM dialect which is actually a 1-to-1 mapping with the intrinsics. In that case, it would be more difficult to choose the right abstraction, for e.g., for F16 accumulate the library may just do all the packing unpacking stuff to get to the same machine instruction that we go to from NVVM->LLVMIR->PTX. Here we have to ensure that the input is in correct data format.

Please let me know what you think. Also can someone please point out the details on how SPIR-V is lowered?

mlir/include/mlir/Dialect/GPU/GPUDialect.h
46–55

Removed the class, Now directly inheriting from TypeStorage in MMAFragmentStorageType.

85–86

Removed the redundant base class as pointed out in a previous comment. Retained this member.

mlir/include/mlir/Dialect/GPU/GPUOps.td
918

Enforcing identity layout maps for the source memref right now. Can be extended for generic layouts in future commits.

923–924

Not really. Since the GPU dialect is meant to be closer to the target, I preferred to have it without the view abstraction and specify the start location's actual indices.

925–926

The op generator can specify the LeadingDimension, And in the lowering to NVVM dialect, the attribute can be directly used. I assume by non-contiguity in the leading dimension, In that case, the op would still work as long as the layout is identity and the leading dimension is correctly specified. I can enforce an identity layout, for now, add support for generic layouts in future commits.

1011–1012

!gpu.mmafragment<4xvector<3xf16>> -> !gpu.mmafragment<4xvector<2xf16>>.

The shapes represent the part of the MMAFragment each thread holds. E.g., the result is of shape <4xvector<2xhalf>>, which says 8 fp16 elements per thread. So in all, 32 threads(a warp) have 256 elements, which is actually the shape of the output, 16x16.

1023

functional-type not parsing !gpu.mmafragment type. retaining current format.

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
298–303

Sorry, I had already removed the printing but forgot to update the example.

308

I tried to use custom directives to print and parse a single type for all the operands. But, there is a restriction in custom directives, quoted as The type directives must refer to a variable, but that variable need not also be a parameter to the custom directive. This restricts us to parse a single type for all the operand types and parser.resolveOperands() would fail.

309–312

Please let me know if this is to be moved and to where.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
121–123

I have removed these messages as they seemed redundant. The parser already emits errors for the required thing. I am adding tests for other user-visible messages.

999–1002

Removed redundant checks.

I think it would be good to have it in the GPU dialect to have a common layer for SPIR-V, NVVM and potentially ROCDL if it applies.

+1: even for a "low-level" operation, if the operation is available on multiple targets (SPIRV or other, we can even consider that proprietary target can benefit from this), then properly building the abstraction in the GPU dialect is really the intention here.

herhut added a comment.Mar 4 2021, 3:47 AM

Thank you @ftynse and @bondhugula for your many comments on this patch and thanks @navdeepkk for addressing them!

IMHO, this operation fits perfectly into the GPU dialect, as it allows to target mma operations independently of the used GPU target. Regarding the type encoding: Ultimately, it only needs to be rich enough to allow conversion to the required LLVM types. We do not really have operations that extract the contained type, so it is truly opaque in that way. I am missing why we need the vector type. Why would mma.fragment<8x2xf16> not suffice for this? This can still be lowered as today to LLVM.

To apply element-wise operations to the fragment, we would need special operations anyway or a special operation that extracts the element to compute on from the fragment. The latter could produce a vector if that is desired.

An open question for me is how this lowers to SPIR-V. Would mma.fragment<8x2xf16> be rich enough for that case, as well? @ThomasRaoux

mlir/include/mlir/Dialect/GPU/GPUBase.td
121

nit: Is the "GPU Dialect" here intentional?

mlir/include/mlir/Dialect/GPU/GPUDialect.h
73

nit: These?

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

You are right that this op is a low-level abstraction, and that's why it doesn't feel like it really belongs to the GPU dialect. I understand there is a need for an op that is better integrated with the rest of MLIR, e.g., uses memref, and such an op wouldn't fit into the NVVM dialect either. So I wouldn't press my objection as long as another reviewer (@herhut, @csigg or @ThomasRaoux) agrees to have this in GPU dialect as is.

More generally, I think we will run into this problem again: we need some way of having MLIR-friendlier versions of LLVM IR intrinsics without having to duplicate abstractions. There are similar cases in "target vector" dialects like AVX512. Ideas welcome.

I think it would be good to have it in the GPU dialect to have a common layer for SPIR-V, NVVM and potentially ROCDL if it applies. The challenge is to pick a good abstraction for all of those. SPIR-V dialect already has CooperativeMatrix ops and types which are the exact equivalent to MMA. I think we should try to remove what is an overfit for NVVM like using vector type for mmafragment but otherwise this is the right direction in my opinion.

Hi all,
I just wanted to point out that the vector type here was used for only the F16 accumulate version of these intrinsics. For the C matrix, The type will change for the F32 accumulate version. So I wanted to make the gpu.mmafragment type flexible for that and perhaps just adjust the constraints as we go on to add more ops in the nvvm dialect.

I also just checked and saw that SPIR-V dialect does not model the fragments held by a particular workitem(as I do here) but models the whole cooperative matrix as a single type. I am not very sure how the SPIR-V ops are lowered to the target. Are they library calls? If yes then it is very different from the NVVM dialect which is actually a 1-to-1 mapping with the intrinsics. In that case, it would be more difficult to choose the right abstraction, for e.g., for F16 accumulate the library may just do all the packing unpacking stuff to get to the same machine instruction that we go to from NVVM->LLVMIR->PTX. Here we have to ensure that the input is in correct data format.

Please let me know what you think. Also can someone please point out the details on how SPIR-V is lowered?

Hi Navdeep,

in SPIR-V the type CooperativeMatrixNVType is an opaque type indeed and I think we should keep the MMA type in the GPU dialect as opaque as well. SPIR-V cooperative matrix are just native SPIR-V ops and don't require any library call. Underneath they map to the exact same functionality than NVVM MMA intrinsics so it makes sense to try to find a common abstraction. If we make gpu.mmafragment an opaque type we should be able to have 1:1 mapping to both NVVM and SPIR-V. As pointed @herhut pointed out we cannot extra from this type so it makes sense to not make any assumption on the layout. In SPIR-V there are ways to extract elements but the layout can only be known at runtime so it is not practical to use in general. Ideally we should avoid that and use gpu.mmafragment until we store back to shared or global memory.

There are some lowering to SPIR-V cooperative matrix in IREE here: https://github.com/google/iree/blob/main/iree/compiler/Conversion/LinalgToSPIRV/ConvertToSPIRVPass.cpp#L330
In this case it goes directly from vector dialect to SPIR-V cooperative matrix. This jumps the level of abstraction you are adding which is a hack to make it work and test cooperative matrix end to end. This is the reason why this is not upstreamed in MLIR.

Hi all, Thanks for the valuable comments. @ThomasRaoux Thanks for clarifying things on the SPIR-V side.

I am missing why we need the vector type. Why would mma.fragment<8x2xf16> not suffice for this? This can still be lowered as today to LLVM.

@herhut I had the vector type to just extract elements from gpu::mmafragment type and just supply it to the nvvm intrinsic, as is. Now that I think, mma.fragment<8x2xf16> can also be used equivalently.

I will take some time and try to develop a more generic abstraction and look at removing the vector<> type. I'll post again for some more feedback before implementing the next iteration.

Thanks!

Hey @navdeepkk, I'm curious to know how this is progressing.
I'm working on enabling CUDA codegen support in IREE (https://github.com/google/iree/tree/main/iree/compiler/Conversion/LinalgToNVVM), I have some basic vectorization working and I'm probably 1 to 2 weeks away to a point where I would like to hook that up to MMA ops. My plan is to add some Vector to GPU transformation in MLIR to create the MMA ops from vector.contract/vector.tranfer kind of ops.
Let me know if I can do anything to help this moving forward. I can help address some point of the patch or do anything else that would help you to make progress.

I'm happy to chat more about my plans if you want and discuss how we can collaborate to on this. Let me know.

Thanks

Hey @navdeepkk, I'm curious to know how this is progressing.
I'm working on enabling CUDA codegen support in IREE (https://github.com/google/iree/tree/main/iree/compiler/Conversion/LinalgToNVVM), I have some basic vectorization working and I'm probably 1 to 2 weeks away to a point where I would like to hook that up to MMA ops. My plan is to add some Vector to GPU transformation in MLIR to create the MMA ops from vector.contract/vector.tranfer kind of ops.
Let me know if I can do anything to help this moving forward. I can help address some point of the patch or do anything else that would help you to make progress.

I'm happy to chat more about my plans if you want and discuss how we can collaborate to on this. Let me know.

Thanks

Cool, please rope me into any offline discussions as I plan to start touching some of that in the short-term future too.

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @navdeepkk, great to hear you were able to use those ops. Next week sounds good, if you think it you won't have bandwidth to progress in the next couple weeks please let me know and hopefully offload some of the work to me.
To give more details on what I'm doing on IREE side is to try to match what cutlass does using MLIR transformations. The flow will look like: linalg on tensor -> tiling along M,N,K and distribute to blocks -> promote operands to shared memory -> (double buffering and pipelining) -> Tiling to wrap -> tiling shared memory copy -> linalg vectorization -> vector.contract unrolling to mma size -> vector to GPU to create MMA GPU ops -> lowering to llvm/nvvm.
A lot of this infrastructure is already there in IREE.

If you are able to share, it would be great to hear about the flow you used those ops with.

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @navdeepkk, great to hear you were able to use those ops. Next week sounds good, if you think it you won't have bandwidth to progress in the next couple weeks please let me know and hopefully offload some of the work to me.
To give more details on what I'm doing on IREE side is to try to match what cutlass does using MLIR transformations. The flow will look like: linalg on tensor -> tiling along M,N,K and distribute to blocks -> promote operands to shared memory -> (double buffering and pipelining) -> Tiling to wrap -> tiling shared memory copy -> linalg vectorization -> vector.contract unrolling to mma size -> vector to GPU to create MMA GPU ops -> lowering to llvm/nvvm.
A lot of this infrastructure is already there in IREE.

If you are able to share, it would be great to hear about the flow you used those ops with.

@ThomasRaoux Just to clarify: @navdeepkk has all of the MLIR-based GPU code generation implemented on top of these ops (the ones in this revision) already working using the affine dialect infrastructure: we don't use IREE, linalg, or the vector dialects, but our pipeline effectively includes all of the transformations you list above and several more in between those lowerings.

Getting to this PR itself, are there specific changes to make before this can be landed? @ftynse and others (Stephan and Mehdi) appear to be fine with having these ops in the GPU dialect. Could you please list those changes out if any or LGTM this? This will allow further development and progress upstream. If the remaining changes are minor, these could be made quickly and landed.

ThomasRaoux accepted this revision.Apr 13 2021, 9:15 PM

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @navdeepkk, great to hear you were able to use those ops. Next week sounds good, if you think it you won't have bandwidth to progress in the next couple weeks please let me know and hopefully offload some of the work to me.
To give more details on what I'm doing on IREE side is to try to match what cutlass does using MLIR transformations. The flow will look like: linalg on tensor -> tiling along M,N,K and distribute to blocks -> promote operands to shared memory -> (double buffering and pipelining) -> Tiling to wrap -> tiling shared memory copy -> linalg vectorization -> vector.contract unrolling to mma size -> vector to GPU to create MMA GPU ops -> lowering to llvm/nvvm.
A lot of this infrastructure is already there in IREE.

If you are able to share, it would be great to hear about the flow you used those ops with.

@ThomasRaoux Just to clarify: @navdeepkk has all of the MLIR-based GPU code generation implemented on top of these ops (the ones in this revision) already working using the affine dialect infrastructure: we don't use IREE, linalg, or the vector dialects, but our pipeline effectively includes all of the transformations you list above and several more in between those lowerings.

Getting to this PR itself, are there specific changes to make before this can be landed? @ftynse and others (Stephan and Mehdi) appear to be fine with having these ops in the GPU dialect. Could you please list those changes out if any or LGTM this? This will allow further development and progress upstream. If the remaining changes are minor, these could be made quickly and landed.

Thanks for explaining, @bondhugula. It is great to know that you have all those transformations. Is it publicly available? (or will it be?).
I 100% support having these ops in the GPU dialect. The only concern I had with the PR was about the type including vector instead of being completely opaque because as I mentioned this will make it harder to share with Vulkan path. However I don't think it had to be a blocker and it is something that we can iterate on so I LGTM the patch.

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @navdeepkk, great to hear you were able to use those ops. Next week sounds good, if you think it you won't have bandwidth to progress in the next couple weeks please let me know and hopefully offload some of the work to me.
To give more details on what I'm doing on IREE side is to try to match what cutlass does using MLIR transformations. The flow will look like: linalg on tensor -> tiling along M,N,K and distribute to blocks -> promote operands to shared memory -> (double buffering and pipelining) -> Tiling to wrap -> tiling shared memory copy -> linalg vectorization -> vector.contract unrolling to mma size -> vector to GPU to create MMA GPU ops -> lowering to llvm/nvvm.
A lot of this infrastructure is already there in IREE.

If you are able to share, it would be great to hear about the flow you used those ops with.

@ThomasRaoux Just to clarify: @navdeepkk has all of the MLIR-based GPU code generation implemented on top of these ops (the ones in this revision) already working using the affine dialect infrastructure: we don't use IREE, linalg, or the vector dialects, but our pipeline effectively includes all of the transformations you list above and several more in between those lowerings.

Getting to this PR itself, are there specific changes to make before this can be landed? @ftynse and others (Stephan and Mehdi) appear to be fine with having these ops in the GPU dialect. Could you please list those changes out if any or LGTM this? This will allow further development and progress upstream. If the remaining changes are minor, these could be made quickly and landed.

Thanks for explaining, @bondhugula. It is great to know that you have all those transformations. Is it publicly available? (or will it be?).
I 100% support having these ops in the GPU dialect. The only concern I had with the PR was about the type including vector instead of being completely opaque because as I mentioned this will make it harder to share with Vulkan path. However I don't think it had to be a blocker and it is something that we can iterate on so I LGTM the patch.

Thanks @ThomasRaoux. Yes, our goal is to upstream this infrastructure - most of the underlying affine dialect transformations and utilities needed have already been there in MLIR for a long time. We'd like to send out things for review starting with the lower level abstractions and support first - so this is really the first revision. (One op/abstraction that is publicly available but not in upstream that we used for our purpose is the memref_vector_cast op: https://reviews.llvm.org/D85885 - this is also available to work with recent upstream at https://github.com/polymage-labs/mlirx )

@navdeepkk, @bondhugula, any update on this?

Hi @ThomasRaoux, I'll get started on this now, and complete it on priority in the next 3-4 days or even sooner. Hope that works.

@navdeepkk, @bondhugula, any update on this?

Hi @ThomasRaoux, I'll get started on this now, and complete it on priority in the next 3-4 days or even sooner. Hope that works.

Thank you

navdeepkk updated this revision to Diff 342265.May 2 2021, 1:19 PM
navdeepkk marked 8 inline comments as done.

Changes in this diff :-

1.) Rebase on upstream/main.
2.) Address comments on previous diff(324324).
3.) Remove gpu.mmafragment and introduce gpu.mma_matrix type.
bondhugula accepted this revision.May 2 2021, 5:46 PM

Looks great. Mostly minor comments on style and movement of some C++ code out of the td file.

mlir/include/mlir/Dialect/GPU/GPUDialect.h
87

It's not immediately clear what this StringRef is for.

143

unsigned?

151

Nit: Operand -> operand

151

An operand is expected to be an SSA Value. It's not clear what this returns.

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
358–378

We shouldn't have so much C++ code in the tablegen file. Please move this to a C++ file (NVVMOps.cpp) and call that from here.

382–390

Likewise.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
48

unsigned ?

68

Operand -> operand

144

Nit: "x" -> 'x'

146

Likewise.

990

Only -> only

1041

Operands -> operands

1050

Likewise.

mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp
197–198

No need of the else.

215

Likewise.

237

Likewise. clang-tidy will also show a warning here.

274–275

Use getOperands() instead of doing getOpOperands() and then doing a get()?

ThomasRaoux accepted this revision.May 3 2021, 6:55 PM
ftynse accepted this revision.May 4 2021, 2:50 AM
This revision is now accepted and ready to land.May 4 2021, 2:50 AM
navdeepkk updated this revision to Diff 343281.May 5 2021, 9:50 PM
navdeepkk marked 17 inline comments as done.

Changes in this diff :-

1.) Address comments on previous diff(342265).
navdeepkk edited the summary of this revision. (Show Details)May 5 2021, 9:53 PM
bondhugula accepted this revision.May 5 2021, 11:05 PM
navdeepkk updated this revision to Diff 343291.May 5 2021, 11:23 PM
navdeepkk marked an inline comment as done.

Changes in this diff :-

1.) Clang-format fix.
2.) Added TODO to generate MMAMatrix via ODS.
This revision was landed with ongoing or failed builds.May 5 2021, 11:37 PM
This revision was automatically updated to reflect the committed changes.