Page MenuHomePhabricator

[MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops
ClosedPublic

Authored by navdeepkk on Jan 24 2021, 11:10 PM.

Details

Summary

[MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops
Add warp synchronous matrix-multiply accumulate ops in GPU and NVVM
dialect. Add following three ops to GPU dialect :-

1.) subgroup_mma_load_matrix
2.) subgroup_mma_store_matrix
3.) subgroup_mma_compute

Add following three ops to NVVM dialect :-

1.) wmma.m16n16k16.load.[a,b,c].[f16,f32].row.stride
2.) wmma.m16n16k16.store.d.[f16,f32].row.stride
3.) wmma.m16n16k16.mma.row.row.[f16,f32].[f16,f32]

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
This revision now requires changes to proceed.Feb 9 2021, 6:12 AM

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

mlir/include/mlir/Dialect/GPU/GPUOps.td
920–941

Hi, I quote from the PTX side of these LLVM intrinsics,

The distribution of fragments loaded by the threads in a warp is unspecified and is target architecture dependent. The contents of a matrix fragment can be manipulated by reading and writing to individual registers in the fragment, provided the following conditions are satisfied:

  1. All matrix element in the fragment are operated on uniformly across threads, using the same parameters.
  2. The order of the matrix elements is not changed.

So as long as the operation is something like a bias addition, which is uniform throughout the output matrix, I think it would be possible to realize the fusion using the present design. The way to go would be to introduce another op in GPU dialcet, Something like

gpu.subgroup_mma_fuse_bias %0, %1 : memref<1x16<vectorxf16>>, f16

The argument memref will be the result of a gpu.subgroup_mma_compute. And the other argument will be the bias. With the appropriate LLVM lowerings this would reuse the results of gpu.subgroup_mma_compute in registers and hence prevent trip to global/shared memory.

Let me know what you think, I can implement the above op and check, I think it should work.

920–941

Also, could you please tell if fuse the MMA compute with elementwise operations is the fusion of an elementwise operation with the register/warp level tile of the accumulator, which I assumed in my reply above. Is that correct?

946–947

Yes, When caluclating the starting of the load address, I emit LLVM ops of this sort

%[[LDM:.*]] = llvm.mlir.constant(32 : i64) : !llvm.i64
%[[ILDM:.*]] = llvm.mul %[[LDM]], %[[OFF]] : !llvm.i64
%[[IJLDM:.*]] = llvm.add %[[ILDM]], %[[OFF]] : !llvm.i64
%[[IJOLDM:.*]] = llvm.add %[[IJLDM]], %[[OFFSETT]] : !llvm.i64

Now the leading dimesnion is of type i64 and to have same types in Add/Mul ops I have used I64 for indices to.

948

Yes, I'll change it.

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
264–269

Since this op is mapped to a single intrinsic, The type of the operands is fixed. I modified the op to print only one type.

ftynse added a comment.Feb 9 2021, 9:48 AM

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

mlir/include/mlir/Dialect/GPU/GPUOps.td
946–947

This is exactly why you should use index or the result of converting index to LLVM. There is no guarantee that it is i64 so you should not be mixing i64 constants with anything that indexes a memref, e.g. the offset in the descriptor. We actually have a flow that converts index to i32 on GPUs as optimization, and this flow will likely get broken if you emit the code described above.

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

Okay, but we would still need to pack the operands in <2xhalf> from scalar values, at some point, to supply them to the intrinsic. Is there some op which would pack scalar values into a <2xhalf>?

navdeepkk updated this revision to Diff 324324.EditedFeb 17 2021, 8:49 AM
navdeepkk marked 40 inline comments as done.

Changes in this diff:-

1.) Address comments on diff 321329.
2.) Add tests for all user-visible error messages.
bondhugula added a comment.EditedFeb 22 2021, 10:55 AM

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

bondhugula accepted this revision.Feb 22 2021, 11:02 AM

Thanks for addressing the large number of comments. Some additional minor ones and one that was missed (or not pushed). This overall looks great to me!

mlir/include/mlir/Dialect/GPU/GPUBase.td
60

Comment here please.

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
335

Looks like this comment is still unaddressed but marked "Done". Did you forget to push? @ftynse suggested doing:

if (parser.parseType... || parser.resolveOperands(...) ...
  return failure();
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1054–1056

Can drop trivial braces.

ftynse added a comment.Mar 2 2021, 5:09 AM

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

You are right that this op is a low-level abstraction, and that's why it doesn't feel like it really belongs to the GPU dialect. I understand there is a need for an op that is better integrated with the rest of MLIR, e.g., uses memref, and such an op wouldn't fit into the NVVM dialect either. So I wouldn't press my objection as long as another reviewer (@herhut, @csigg or @ThomasRaoux) agrees to have this in GPU dialect as is.

More generally, I think we will run into this problem again: we need some way of having MLIR-friendlier versions of LLVM IR intrinsics without having to duplicate abstractions. There are similar cases in "target vector" dialects like AVX512. Ideas welcome.

mehdi_amini added inline comments.Mar 3 2021, 2:07 PM
mlir/include/mlir/Dialect/GPU/GPUDialect.h
79

elements

80

May be useful to document the syntax and provide examples.

107

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

You are right that this op is a low-level abstraction, and that's why it doesn't feel like it really belongs to the GPU dialect. I understand there is a need for an op that is better integrated with the rest of MLIR, e.g., uses memref, and such an op wouldn't fit into the NVVM dialect either. So I wouldn't press my objection as long as another reviewer (@herhut, @csigg or @ThomasRaoux) agrees to have this in GPU dialect as is.

More generally, I think we will run into this problem again: we need some way of having MLIR-friendlier versions of LLVM IR intrinsics without having to duplicate abstractions. There are similar cases in "target vector" dialects like AVX512. Ideas welcome.

I think it would be good to have it in the GPU dialect to have a common layer for SPIR-V, NVVM and potentially ROCDL if it applies. The challenge is to pick a good abstraction for all of those. SPIR-V dialect already has CooperativeMatrix ops and types which are the exact equivalent to MMA. I think we should try to remove what is an overfit for NVVM like using vector type for mmafragment but otherwise this is the right direction in my opinion.

mlir/test/Dialect/GPU/invalid.mlir
464–470

I think we should also have a positive test in mlir/test/Dialect/GPU/ops.mlir

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

You are right that this op is a low-level abstraction, and that's why it doesn't feel like it really belongs to the GPU dialect. I understand there is a need for an op that is better integrated with the rest of MLIR, e.g., uses memref, and such an op wouldn't fit into the NVVM dialect either. So I wouldn't press my objection as long as another reviewer (@herhut, @csigg or @ThomasRaoux) agrees to have this in GPU dialect as is.

More generally, I think we will run into this problem again: we need some way of having MLIR-friendlier versions of LLVM IR intrinsics without having to duplicate abstractions. There are similar cases in "target vector" dialects like AVX512. Ideas welcome.

I think it would be good to have it in the GPU dialect to have a common layer for SPIR-V, NVVM and potentially ROCDL if it applies. The challenge is to pick a good abstraction for all of those. SPIR-V dialect already has CooperativeMatrix ops and types which are the exact equivalent to MMA. I think we should try to remove what is an overfit for NVVM like using vector type for mmafragment but otherwise this is the right direction in my opinion.

Hi all,
I just wanted to point out that the vector type here was used for only the F16 accumulate version of these intrinsics. For the C matrix, The type will change for the F32 accumulate version. So I wanted to make the gpu.mmafragment type flexible for that and perhaps just adjust the constraints as we go on to add more ops in the nvvm dialect.

I also just checked and saw that SPIR-V dialect does not model the fragments held by a particular workitem(as I do here) but models the whole cooperative matrix as a single type. I am not very sure how the SPIR-V ops are lowered to the target. Are they library calls? If yes then it is very different from the NVVM dialect which is actually a 1-to-1 mapping with the intrinsics. In that case, it would be more difficult to choose the right abstraction, for e.g., for F16 accumulate the library may just do all the packing unpacking stuff to get to the same machine instruction that we go to from NVVM->LLVMIR->PTX. Here we have to ensure that the input is in correct data format.

Please let me know what you think. Also can someone please point out the details on how SPIR-V is lowered?

mlir/include/mlir/Dialect/GPU/GPUDialect.h
47–56

Removed the class, Now directly inheriting from TypeStorage in MMAFragmentStorageType.

86–87

Removed the redundant base class as pointed out in a previous comment. Retained this member.

mlir/include/mlir/Dialect/GPU/GPUOps.td
921

Enforcing identity layout maps for the source memref right now. Can be extended for generic layouts in future commits.

926–927

Not really. Since the GPU dialect is meant to be closer to the target, I preferred to have it without the view abstraction and specify the start location's actual indices.

928–929

The op generator can specify the LeadingDimension, And in the lowering to NVVM dialect, the attribute can be directly used. I assume by non-contiguity in the leading dimension, In that case, the op would still work as long as the layout is identity and the leading dimension is correctly specified. I can enforce an identity layout, for now, add support for generic layouts in future commits.

1014–1015

!gpu.mmafragment<4xvector<3xf16>> -> !gpu.mmafragment<4xvector<2xf16>>.

The shapes represent the part of the MMAFragment each thread holds. E.g., the result is of shape <4xvector<2xhalf>>, which says 8 fp16 elements per thread. So in all, 32 threads(a warp) have 256 elements, which is actually the shape of the output, 16x16.

1026

functional-type not parsing !gpu.mmafragment type. retaining current format.

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
305–310

Sorry, I had already removed the printing but forgot to update the example.

315

I tried to use custom directives to print and parse a single type for all the operands. But, there is a restriction in custom directives, quoted as The type directives must refer to a variable, but that variable need not also be a parameter to the custom directive. This restricts us to parse a single type for all the operand types and parser.resolveOperands() would fail.

316–319

Please let me know if this is to be moved and to where.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
132–134

I have removed these messages as they seemed redundant. The parser already emits errors for the required thing. I am adding tests for other user-visible messages.

1023–1026

Removed redundant checks.

I think it would be good to have it in the GPU dialect to have a common layer for SPIR-V, NVVM and potentially ROCDL if it applies.

+1: even for a "low-level" operation, if the operation is available on multiple targets (SPIRV or other, we can even consider that proprietary target can benefit from this), then properly building the abstraction in the GPU dialect is really the intention here.

herhut added a comment.Mar 4 2021, 3:47 AM

Thank you @ftynse and @bondhugula for your many comments on this patch and thanks @navdeepkk for addressing them!

IMHO, this operation fits perfectly into the GPU dialect, as it allows to target mma operations independently of the used GPU target. Regarding the type encoding: Ultimately, it only needs to be rich enough to allow conversion to the required LLVM types. We do not really have operations that extract the contained type, so it is truly opaque in that way. I am missing why we need the vector type. Why would mma.fragment<8x2xf16> not suffice for this? This can still be lowered as today to LLVM.

To apply element-wise operations to the fragment, we would need special operations anyway or a special operation that extracts the element to compute on from the fragment. The latter could produce a vector if that is desired.

An open question for me is how this lowers to SPIR-V. Would mma.fragment<8x2xf16> be rich enough for that case, as well? @ThomasRaoux

mlir/include/mlir/Dialect/GPU/GPUBase.td
122

nit: Is the "GPU Dialect" here intentional?

mlir/include/mlir/Dialect/GPU/GPUDialect.h
74

nit: These?

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

You are right that this op is a low-level abstraction, and that's why it doesn't feel like it really belongs to the GPU dialect. I understand there is a need for an op that is better integrated with the rest of MLIR, e.g., uses memref, and such an op wouldn't fit into the NVVM dialect either. So I wouldn't press my objection as long as another reviewer (@herhut, @csigg or @ThomasRaoux) agrees to have this in GPU dialect as is.

More generally, I think we will run into this problem again: we need some way of having MLIR-friendlier versions of LLVM IR intrinsics without having to duplicate abstractions. There are similar cases in "target vector" dialects like AVX512. Ideas welcome.

I think it would be good to have it in the GPU dialect to have a common layer for SPIR-V, NVVM and potentially ROCDL if it applies. The challenge is to pick a good abstraction for all of those. SPIR-V dialect already has CooperativeMatrix ops and types which are the exact equivalent to MMA. I think we should try to remove what is an overfit for NVVM like using vector type for mmafragment but otherwise this is the right direction in my opinion.

Hi all,
I just wanted to point out that the vector type here was used for only the F16 accumulate version of these intrinsics. For the C matrix, The type will change for the F32 accumulate version. So I wanted to make the gpu.mmafragment type flexible for that and perhaps just adjust the constraints as we go on to add more ops in the nvvm dialect.

I also just checked and saw that SPIR-V dialect does not model the fragments held by a particular workitem(as I do here) but models the whole cooperative matrix as a single type. I am not very sure how the SPIR-V ops are lowered to the target. Are they library calls? If yes then it is very different from the NVVM dialect which is actually a 1-to-1 mapping with the intrinsics. In that case, it would be more difficult to choose the right abstraction, for e.g., for F16 accumulate the library may just do all the packing unpacking stuff to get to the same machine instruction that we go to from NVVM->LLVMIR->PTX. Here we have to ensure that the input is in correct data format.

Please let me know what you think. Also can someone please point out the details on how SPIR-V is lowered?

Hi Navdeep,

in SPIR-V the type CooperativeMatrixNVType is an opaque type indeed and I think we should keep the MMA type in the GPU dialect as opaque as well. SPIR-V cooperative matrix are just native SPIR-V ops and don't require any library call. Underneath they map to the exact same functionality than NVVM MMA intrinsics so it makes sense to try to find a common abstraction. If we make gpu.mmafragment an opaque type we should be able to have 1:1 mapping to both NVVM and SPIR-V. As pointed @herhut pointed out we cannot extra from this type so it makes sense to not make any assumption on the layout. In SPIR-V there are ways to extract elements but the layout can only be known at runtime so it is not practical to use in general. Ideally we should avoid that and use gpu.mmafragment until we store back to shared or global memory.

There are some lowering to SPIR-V cooperative matrix in IREE here: https://github.com/google/iree/blob/main/iree/compiler/Conversion/LinalgToSPIRV/ConvertToSPIRVPass.cpp#L330
In this case it goes directly from vector dialect to SPIR-V cooperative matrix. This jumps the level of abstraction you are adding which is a hack to make it work and test cooperative matrix end to end. This is the reason why this is not upstreamed in MLIR.

Hi all, Thanks for the valuable comments. @ThomasRaoux Thanks for clarifying things on the SPIR-V side.

I am missing why we need the vector type. Why would mma.fragment<8x2xf16> not suffice for this? This can still be lowered as today to LLVM.

@herhut I had the vector type to just extract elements from gpu::mmafragment type and just supply it to the nvvm intrinsic, as is. Now that I think, mma.fragment<8x2xf16> can also be used equivalently.

I will take some time and try to develop a more generic abstraction and look at removing the vector<> type. I'll post again for some more feedback before implementing the next iteration.

Thanks!

Hey @navdeepkk, I'm curious to know how this is progressing.
I'm working on enabling CUDA codegen support in IREE (https://github.com/google/iree/tree/main/iree/compiler/Conversion/LinalgToNVVM), I have some basic vectorization working and I'm probably 1 to 2 weeks away to a point where I would like to hook that up to MMA ops. My plan is to add some Vector to GPU transformation in MLIR to create the MMA ops from vector.contract/vector.tranfer kind of ops.
Let me know if I can do anything to help this moving forward. I can help address some point of the patch or do anything else that would help you to make progress.

I'm happy to chat more about my plans if you want and discuss how we can collaborate to on this. Let me know.

Thanks

Hey @navdeepkk, I'm curious to know how this is progressing.
I'm working on enabling CUDA codegen support in IREE (https://github.com/google/iree/tree/main/iree/compiler/Conversion/LinalgToNVVM), I have some basic vectorization working and I'm probably 1 to 2 weeks away to a point where I would like to hook that up to MMA ops. My plan is to add some Vector to GPU transformation in MLIR to create the MMA ops from vector.contract/vector.tranfer kind of ops.
Let me know if I can do anything to help this moving forward. I can help address some point of the patch or do anything else that would help you to make progress.

I'm happy to chat more about my plans if you want and discuss how we can collaborate to on this. Let me know.

Thanks

Cool, please rope me into any offline discussions as I plan to start touching some of that in the short-term future too.

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @navdeepkk, great to hear you were able to use those ops. Next week sounds good, if you think it you won't have bandwidth to progress in the next couple weeks please let me know and hopefully offload some of the work to me.
To give more details on what I'm doing on IREE side is to try to match what cutlass does using MLIR transformations. The flow will look like: linalg on tensor -> tiling along M,N,K and distribute to blocks -> promote operands to shared memory -> (double buffering and pipelining) -> Tiling to wrap -> tiling shared memory copy -> linalg vectorization -> vector.contract unrolling to mma size -> vector to GPU to create MMA GPU ops -> lowering to llvm/nvvm.
A lot of this infrastructure is already there in IREE.

If you are able to share, it would be great to hear about the flow you used those ops with.

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @navdeepkk, great to hear you were able to use those ops. Next week sounds good, if you think it you won't have bandwidth to progress in the next couple weeks please let me know and hopefully offload some of the work to me.
To give more details on what I'm doing on IREE side is to try to match what cutlass does using MLIR transformations. The flow will look like: linalg on tensor -> tiling along M,N,K and distribute to blocks -> promote operands to shared memory -> (double buffering and pipelining) -> Tiling to wrap -> tiling shared memory copy -> linalg vectorization -> vector.contract unrolling to mma size -> vector to GPU to create MMA GPU ops -> lowering to llvm/nvvm.
A lot of this infrastructure is already there in IREE.

If you are able to share, it would be great to hear about the flow you used those ops with.

@ThomasRaoux Just to clarify: @navdeepkk has all of the MLIR-based GPU code generation implemented on top of these ops (the ones in this revision) already working using the affine dialect infrastructure: we don't use IREE, linalg, or the vector dialects, but our pipeline effectively includes all of the transformations you list above and several more in between those lowerings.

Getting to this PR itself, are there specific changes to make before this can be landed? @ftynse and others (Stephan and Mehdi) appear to be fine with having these ops in the GPU dialect. Could you please list those changes out if any or LGTM this? This will allow further development and progress upstream. If the remaining changes are minor, these could be made quickly and landed.

ThomasRaoux accepted this revision.Apr 13 2021, 9:15 PM

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @navdeepkk, great to hear you were able to use those ops. Next week sounds good, if you think it you won't have bandwidth to progress in the next couple weeks please let me know and hopefully offload some of the work to me.
To give more details on what I'm doing on IREE side is to try to match what cutlass does using MLIR transformations. The flow will look like: linalg on tensor -> tiling along M,N,K and distribute to blocks -> promote operands to shared memory -> (double buffering and pipelining) -> Tiling to wrap -> tiling shared memory copy -> linalg vectorization -> vector.contract unrolling to mma size -> vector to GPU to create MMA GPU ops -> lowering to llvm/nvvm.
A lot of this infrastructure is already there in IREE.

If you are able to share, it would be great to hear about the flow you used those ops with.

@ThomasRaoux Just to clarify: @navdeepkk has all of the MLIR-based GPU code generation implemented on top of these ops (the ones in this revision) already working using the affine dialect infrastructure: we don't use IREE, linalg, or the vector dialects, but our pipeline effectively includes all of the transformations you list above and several more in between those lowerings.

Getting to this PR itself, are there specific changes to make before this can be landed? @ftynse and others (Stephan and Mehdi) appear to be fine with having these ops in the GPU dialect. Could you please list those changes out if any or LGTM this? This will allow further development and progress upstream. If the remaining changes are minor, these could be made quickly and landed.

Thanks for explaining, @bondhugula. It is great to know that you have all those transformations. Is it publicly available? (or will it be?).
I 100% support having these ops in the GPU dialect. The only concern I had with the PR was about the type including vector instead of being completely opaque because as I mentioned this will make it harder to share with Vulkan path. However I don't think it had to be a blocker and it is something that we can iterate on so I LGTM the patch.

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @navdeepkk, great to hear you were able to use those ops. Next week sounds good, if you think it you won't have bandwidth to progress in the next couple weeks please let me know and hopefully offload some of the work to me.
To give more details on what I'm doing on IREE side is to try to match what cutlass does using MLIR transformations. The flow will look like: linalg on tensor -> tiling along M,N,K and distribute to blocks -> promote operands to shared memory -> (double buffering and pipelining) -> Tiling to wrap -> tiling shared memory copy -> linalg vectorization -> vector.contract unrolling to mma size -> vector to GPU to create MMA GPU ops -> lowering to llvm/nvvm.
A lot of this infrastructure is already there in IREE.

If you are able to share, it would be great to hear about the flow you used those ops with.

@ThomasRaoux Just to clarify: @navdeepkk has all of the MLIR-based GPU code generation implemented on top of these ops (the ones in this revision) already working using the affine dialect infrastructure: we don't use IREE, linalg, or the vector dialects, but our pipeline effectively includes all of the transformations you list above and several more in between those lowerings.

Getting to this PR itself, are there specific changes to make before this can be landed? @ftynse and others (Stephan and Mehdi) appear to be fine with having these ops in the GPU dialect. Could you please list those changes out if any or LGTM this? This will allow further development and progress upstream. If the remaining changes are minor, these could be made quickly and landed.

Thanks for explaining, @bondhugula. It is great to know that you have all those transformations. Is it publicly available? (or will it be?).
I 100% support having these ops in the GPU dialect. The only concern I had with the PR was about the type including vector instead of being completely opaque because as I mentioned this will make it harder to share with Vulkan path. However I don't think it had to be a blocker and it is something that we can iterate on so I LGTM the patch.

Thanks @ThomasRaoux. Yes, our goal is to upstream this infrastructure - most of the underlying affine dialect transformations and utilities needed have already been there in MLIR for a long time. We'd like to send out things for review starting with the lower level abstractions and support first - so this is really the first revision. (One op/abstraction that is publicly available but not in upstream that we used for our purpose is the memref_vector_cast op: https://reviews.llvm.org/D85885 - this is also available to work with recent upstream at https://github.com/polymage-labs/mlirx )

@navdeepkk, @bondhugula, any update on this?

Hi @ThomasRaoux, I'll get started on this now, and complete it on priority in the next 3-4 days or even sooner. Hope that works.

@navdeepkk, @bondhugula, any update on this?

Hi @ThomasRaoux, I'll get started on this now, and complete it on priority in the next 3-4 days or even sooner. Hope that works.

Thank you

navdeepkk updated this revision to Diff 342265.May 2 2021, 1:19 PM
navdeepkk marked 8 inline comments as done.

Changes in this diff :-

1.) Rebase on upstream/main.
2.) Address comments on previous diff(324324).
3.) Remove gpu.mmafragment and introduce gpu.mma_matrix type.
bondhugula accepted this revision.May 2 2021, 5:46 PM

Looks great. Mostly minor comments on style and movement of some C++ code out of the td file.

mlir/include/mlir/Dialect/GPU/GPUDialect.h
88

It's not immediately clear what this StringRef is for.

144

unsigned?

152

Nit: Operand -> operand

152

An operand is expected to be an SSA Value. It's not clear what this returns.

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
365–385

We shouldn't have so much C++ code in the tablegen file. Please move this to a C++ file (NVVMOps.cpp) and call that from here.

389–397

Likewise.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
48

unsigned ?

68

Operand -> operand

164

Nit: "x" -> 'x'

166

Likewise.

1014

Only -> only

1065

Operands -> operands

1074

Likewise.

mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp
197–198

No need of the else.

215

Likewise.

237

Likewise. clang-tidy will also show a warning here.

274–275

Use getOperands() instead of doing getOpOperands() and then doing a get()?

ThomasRaoux accepted this revision.May 3 2021, 6:55 PM
ftynse accepted this revision.May 4 2021, 2:50 AM
This revision is now accepted and ready to land.May 4 2021, 2:50 AM
navdeepkk updated this revision to Diff 343281.May 5 2021, 9:50 PM
navdeepkk marked 17 inline comments as done.

Changes in this diff :-

1.) Address comments on previous diff(342265).
navdeepkk edited the summary of this revision. (Show Details)May 5 2021, 9:53 PM
bondhugula accepted this revision.May 5 2021, 11:05 PM
navdeepkk updated this revision to Diff 343291.May 5 2021, 11:23 PM
navdeepkk marked an inline comment as done.

Changes in this diff :-

1.) Clang-format fix.
2.) Added TODO to generate MMAMatrix via ODS.
This revision was landed with ongoing or failed builds.May 5 2021, 11:37 PM
This revision was automatically updated to reflect the committed changes.