This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/
-
mlir/
-
Dialect/
-
GPU/
2/2
GPUBase.td
15/15
GPUDialect.h
25/25
GPUOps.td
-
LLVMIR/
26/26
NVVMOps.td
-
lib/
-
Dialect/
-
GPU/IR/
-
IR/
25/25
GPUDialect.cpp
-
LLVMIR/IR/
-
IR/
5/5
NVVMDialect.cpp
-
Target/LLVMIR/
-
LLVMIR/
-
ConvertToNVVMIR.cpp
-
test/
-
Dialect/
-
GPU/
1/1
invalid.mlir
-
LLVMIR/
-
invalid.mlir
-
Target/
-
nvvmir.mlir

Differential D95330

[MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops
ClosedPublic

Authored by navdeepkk on Jan 24 2021, 11:10 PM.

Download Raw Diff

Details

Reviewers

bondhugula
ftynse
herhut
ThomasRaoux
csigg

Commits

rG875eb523c132: [MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops

Summary

[MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops
Add warp synchronous matrix-multiply accumulate ops in GPU and NVVM
dialect. Add following three ops to GPU dialect :-

1.) subgroup_mma_load_matrix
2.) subgroup_mma_store_matrix
3.) subgroup_mma_compute

Add following three ops to NVVM dialect :-

1.) wmma.m16n16k16.load.[a,b,c].[f16,f32].row.stride
2.) wmma.m16n16k16.store.d.[f16,f32].row.stride
3.) wmma.m16n16k16.mma.row.row.[f16,f32].[f16,f32]

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

navdeepkk created this revision.Jan 24 2021, 11:10 PM

Herald added a reviewer: ftynse. · View Herald TranscriptJan 24 2021, 11:10 PM

Herald added subscribers: teijeong, rdzhabarov, tatianashp and 16 others. · View Herald Transcript

navdeepkk requested review of this revision.Jan 24 2021, 11:10 PM

Herald added a reviewer: herhut. · View Herald TranscriptJan 24 2021, 11:10 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

navdeepkk added a child revision: D95331: [MLIR][GPU][NVVM] Add conversion of warp synchronous matrix-multiply accumulate GPU ops.Jan 24 2021, 11:18 PM

Harbormaster completed remote builds in B86506: Diff 318905.Jan 25 2021, 12:06 AM

Thanks! I have some comments.

mlir/include/mlir/Dialect/GPU/GPUOps.td
943–944	We usually use Index rather than I64 for indexing into memrefs. Is there a specific reason why I64 is needed here?
945	Could we use a longer name for this, e.g., leadDimension?
1006	Have you considered encoding this into a TypeConstraint in ODS instead of using AnyMemRef?
mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
19	I am not a fan of creating a dependency from the NVVM dialect on the GPU dialect only for the purpose of reusing an attribute. Have you considered having two versions of this attribute for different dialects, or putting it into some included file shared by both?
153	Please follow the code style inside inline code blocks. Here in particular, add spaces between "if" and "(".
154–163	We generally prefer to have one operation per intrinsic in this dialect. There is certainly value in having a operation where this duplication is abstracted away, but I suppose the one in the GPU dialect is just right for that. This dispatch should happen when converting from the `gpu.subgroup_mma_load` to `nvvm.wmma_*` rather than in translation to LLVM IR. This will also resolve the dialect layering problem I pointed at above.
189	function-type($args, $res) ?
257–262	This looks intimidatingly verbose. Are we ever expected to have different types for operands or can we just print one instead?
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
980	Nit: please don't use magic numbers. Rather define this as `GPUDialect::kThreadPrivateMemorySpace` and so on.
984–989	This can be defined in ODS, instead of using AnyMemRef.
991	This can be put into ODS.
1004	Braces must be symmetric.
1079–1084	Nit: please use LogicalResult for verify* functions, even internal. Otherwise, use the name that makes it clear what are the intended semantics of the return value, e.g. `isValidType`.
mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp
135	Please fix linter findings.

This revision now requires changes to proceed.Jan 25 2021, 5:26 AM

Adding Thomas and Christian to see whether they have comments wrt. the GPU operations.

ThomasRaoux added inline comments.Jan 26 2021, 12:36 AM

mlir/include/mlir/Dialect/GPU/GPUOps.td
917–938	The part that I don't understand is if we expect the destination memref to have a defined layout or if it is opaque. As far as I can tell in both CUDA and Vulkan the layout of the mma matrix type is opaque in private memory. If the layout is opaque how can we perform elementwise operations on the matrix type without going back to global or shared memory? One of the common usage would be to fuse the MMA compute with elementwise operations and use directly the result of the MMA without going back to memory. Is is possible to represent such case with the current design?

bondhugula added inline comments.Jan 26 2021, 3:10 AM

mlir/include/mlir/Dialect/GPU/GPUOps.td
1016	Incorrect copy/paste example.

Issues in diff 318905 :-

1.) The design used memrefs to model mmafragments and were allocated in `.local`
  space in the PTX generated. This compeletely destroyed the purpose of wmmaOps(to
  use operands in `.reg` space.).

Changes in this diff :-

1.) Address comments on diff 318905.
2.) Introduce a new type !gpu.mmafragment to enable use of MMAOps operands in
registers.
3.) Modify all the introduced ops/test cases to use the !gpu.mmafragment type.
4.) Add the NVVM IR translation test previously in Revision D95333.

Harbormaster completed remote builds in B87840: Diff 321329.Feb 4 2021, 12:33 AM

Can you use mma_fragment instead of mmafragment for better readability?

bondhugula added inline comments.Feb 4 2021, 12:50 AM

mlir/include/mlir/Dialect/GPU/GPUOps.td
935	The design looks much more in line now with the `memref` operands being replaced with pure values and with a result value instead of an output memref.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
32	Nit: MMAFragmentType
72–79	Use an `enum`?

bondhugula added inline comments.Feb 4 2021, 12:52 AM

mlir/include/mlir/Dialect/GPU/GPUDialect.h
90	Matrix-multiply -> matrix-matrix multiply
98–115	All of these need doc comments.
mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
309–312	I think all of this code should be in a C++ file. Please see other things for reference or if there is a guideline. @ftynse?

Looks better with a dedicated type. I have some further concerns about memref layout and a bunch of small comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

mlir/include/mlir/Dialect/GPU/GPUDialect.h
46–55	This class looks unnecessary. It has only one derived class, that doesn't to actually use anything from the base class.
76–80	There's no need in getters if member fields are public.
85–86	This shadows the `elementType` from the base class. You'll actually have two `elementType` values this way...
mlir/include/mlir/Dialect/GPU/GPUOps.td
918	Does it make any assumptions about the data storage in the memref? It can have an arbitrary layout now, not only consecutive elements.
923–924	Have you considered always loading the entire matrix starting from indices 0x0 and asking the user to use `std.subview` to position the view in a way they want? There may be a reason to use indices here, but it is unclear to me. Otherwise, it feels like this op will be a subset of functionality that subview already provides.
925–926	Why is this attribute necessary? It looks redundant with the dimension of the memref available in its type. If it may differ from the memref size, that this op needs to clarify how it handles such non-contiguous cases.
934	Please document why it is important to specify which operand is being loaded (I suppose because of how the fragment is layed out).
935	Nit: this example breaks the verifier, which expects `leadDimension` to have `i32` type.
941	Why I32Attr and not IndexAttr?
958–976	Same remarks as for the "load" op.
1011–1012	These shapes don't add up for me.
1023	Nit: I would have just used `functional-type(operands, results)` here
mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
147	It looks like `operand` is never used
153	Nit: use `[{ }]` for multi-line text.
177	Tablegen doesn't have $-substitution, use need something like `!strconcat(` and a way of accessing the common string.
298–303	This is still super-verbose. Do we expect operands of different types?
308	Any reason why this cannot be implemented with declarative assembly format?
309	We shouldn't be needing global namespace qualification in this code. It goes into the body of a `OpTy::parse`, which itself lives in the MLIR namespace.
309–312	There are no guidelines AFAIK. I prefer and ask to put any non-trivial code with more than ~5 lines in a C++ file.
311	This is useless, it is necessary if the variable is only in assertions to avoid -Wunused-variable in release builds. Here, it is always used.
314	This is unnecessary, ArrayRef is implicitly constructible from a single element, just pass `resType` to `addTypes`.
317–327	Chain these with `\|\|` in a single `if` condition. This is the reason why `Parser` methods return `bool`.
330	I doubt "20" was chosen for any particular reason. SmallVector now supports automatic selection of the number of stack elements if there is no provided in the template, just use that if there is no specific reason to do otherwise.
340	Use `getOperationName()` instead
344–345	These can be combined in a single string...
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
51–56	It is better to inspect `elementType` rather than to construct a new one for comparison. Construction takes a lock in the context, inspection does not. Also, `if (condition) return false; return true` should be written as `return !condition`.
121–123	All user-visible error messages need a test.
133–138	`parseType` accepts references to a derived type and does this check for you. Just declare `elementType` as `VectorType`
141	This should be in the type verifier instead.
142	Not that the location is pointing to the token after the last consumed one, i.e. after `>`. Capture the location before parsing the size, or at least point to the first token in the type,
146–154	Don't duplicate the verifier, use `getChecked` instead.
977	Replaces these magic numbers with named constants defined above
999–1002	Don't duplicate the checks already enforced by type constraints in ODS. Also, if you had added tests for user-visible errors, you would have seen that this message is never produced because the error condition is caught by ODS-generated verifier parts with a different message.

ftynse requested changes to this revision.Feb 9 2021, 6:12 AM

This revision now requires changes to proceed.Feb 9 2021, 6:12 AM

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

mlir/include/mlir/Dialect/GPU/GPUOps.td
917–938	Hi, I quote from the PTX side of these LLVM intrinsics, The distribution of fragments loaded by the threads in a warp is unspecified and is target architecture dependent. The contents of a matrix fragment can be manipulated by reading and writing to individual registers in the fragment, provided the following conditions are satisfied: All matrix element in the fragment are operated on uniformly across threads, using the same parameters. The order of the matrix elements is not changed. So as long as the operation is something like a bias addition, which is uniform throughout the output matrix, I think it would be possible to realize the fusion using the present design. The way to go would be to introduce another op in GPU dialcet, Something like gpu.subgroup_mma_fuse_bias %0, %1 : memref<1x16<vectorxf16>>, f16 The argument memref will be the result of a `gpu.subgroup_mma_compute`. And the other argument will be the bias. With the appropriate LLVM lowerings this would reuse the results of `gpu.subgroup_mma_compute` in registers and hence prevent trip to global/shared memory. Let me know what you think, I can implement the above op and check, I think it should work.
917–938	Also, could you please tell if `fuse the MMA compute with elementwise operations` is the fusion of an elementwise operation with the register/warp level tile of the accumulator, which I assumed in my reply above. Is that correct?
943–944	Yes, When caluclating the starting of the load address, I emit LLVM ops of this sort %[[LDM:.]] = llvm.mlir.constant(32 : i64) : !llvm.i64 %[[ILDM:.]] = llvm.mul %[[LDM]], %[[OFF]] : !llvm.i64 %[[IJLDM:.]] = llvm.add %[[ILDM]], %[[OFF]] : !llvm.i64 %[[IJOLDM:.]] = llvm.add %[[IJLDM]], %[[OFFSETT]] : !llvm.i64 Now the leading dimesnion is of type i64 and to have same types in Add/Mul ops I have used I64 for indices to.
945	Yes, I'll change it.
mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
257–262	Since this op is mapped to a single intrinsic, The type of the operands is fixed. I modified the op to print only one type.

In D95330#2551547, @navdeepkk wrote:

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

mlir/include/mlir/Dialect/GPU/GPUOps.td
943–944	This is exactly why you should use `index` or the result of converting `index` to LLVM. There is no guarantee that it is `i64` so you should not be mixing `i64` constants with anything that indexes a memref, e.g. the offset in the descriptor. We actually have a flow that converts `index` to `i32` on GPUs as optimization, and this flow will likely get broken if you emit the code described above.

In D95330#2551743, @ftynse wrote:

In D95330#2551547, @navdeepkk wrote:

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

Okay, but we would still need to pack the operands in <2xhalf> from scalar values, at some point, to supply them to the intrinsic. Is there some op which would pack scalar values into a <2xhalf>?

Changes in this diff:-

1.) Address comments on diff 321329.
2.) Add tests for all user-visible error messages.

Harbormaster completed remote builds in B89557: Diff 324324.Feb 17 2021, 10:01 AM

In D95330#2551743, @ftynse wrote:

In D95330#2551547, @navdeepkk wrote:

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

Thanks for addressing the large number of comments. Some additional minor ones and one that was missed (or not pushed). This overall looks great to me!

mlir/include/mlir/Dialect/GPU/GPUBase.td
60	Comment here please.
mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
328	Looks like this comment is still unaddressed but marked "Done". Did you forget to push? @ftynse suggested doing: if (parser.parseType... \|\| parser.resolveOperands(...) ... return failure();
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1030–1032	Can drop trivial braces.

In D95330#2579401, @bondhugula wrote:

In D95330#2551743, @ftynse wrote:

In D95330#2551547, @navdeepkk wrote:

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

You are right that this op is a low-level abstraction, and that's why it doesn't feel like it really belongs to the GPU dialect. I understand there is a need for an op that is better integrated with the rest of MLIR, e.g., uses memref, and such an op wouldn't fit into the NVVM dialect either. So I wouldn't press my objection as long as another reviewer (@herhut, @csigg or @ThomasRaoux) agrees to have this in GPU dialect as is.

More generally, I think we will run into this problem again: we need some way of having MLIR-friendlier versions of LLVM IR intrinsics without having to duplicate abstractions. There are similar cases in "target vector" dialects like AVX512. Ideas welcome.

Herald added a subscriber: cota. · View Herald TranscriptMar 2 2021, 5:09 AM

mehdi_amini added inline comments.Mar 3 2021, 2:07 PM

mlir/include/mlir/Dialect/GPU/GPUDialect.h
78	elements
79	May be useful to document the syntax and provide examples.
106	Can this type be defined in ODS? https://mlir.llvm.org/docs/OpDefinitions/#type-definitions

In D95330#2597019, @ftynse wrote:

In D95330#2579401, @bondhugula wrote:

In D95330#2551743, @ftynse wrote:

In D95330#2551547, @navdeepkk wrote:

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

You are right that this op is a low-level abstraction, and that's why it doesn't feel like it really belongs to the GPU dialect. I understand there is a need for an op that is better integrated with the rest of MLIR, e.g., uses memref, and such an op wouldn't fit into the NVVM dialect either. So I wouldn't press my objection as long as another reviewer (@herhut, @csigg or @ThomasRaoux) agrees to have this in GPU dialect as is.

More generally, I think we will run into this problem again: we need some way of having MLIR-friendlier versions of LLVM IR intrinsics without having to duplicate abstractions. There are similar cases in "target vector" dialects like AVX512. Ideas welcome.

I think it would be good to have it in the GPU dialect to have a common layer for SPIR-V, NVVM and potentially ROCDL if it applies. The challenge is to pick a good abstraction for all of those. SPIR-V dialect already has CooperativeMatrix ops and types which are the exact equivalent to MMA. I think we should try to remove what is an overfit for NVVM like using vector type for mmafragment but otherwise this is the right direction in my opinion.

mlir/test/Dialect/GPU/invalid.mlir
464–470	I think we should also have a positive test in `mlir/test/Dialect/GPU/ops.mlir`

In D95330#2602206, @ThomasRaoux wrote:

In D95330#2597019, @ftynse wrote:

In D95330#2579401, @bondhugula wrote:

In D95330#2551743, @ftynse wrote:

In D95330#2551547, @navdeepkk wrote:

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

You are right that this op is a low-level abstraction, and that's why it doesn't feel like it really belongs to the GPU dialect. I understand there is a need for an op that is better integrated with the rest of MLIR, e.g., uses memref, and such an op wouldn't fit into the NVVM dialect either. So I wouldn't press my objection as long as another reviewer (@herhut, @csigg or @ThomasRaoux) agrees to have this in GPU dialect as is.

More generally, I think we will run into this problem again: we need some way of having MLIR-friendlier versions of LLVM IR intrinsics without having to duplicate abstractions. There are similar cases in "target vector" dialects like AVX512. Ideas welcome.

I think it would be good to have it in the GPU dialect to have a common layer for SPIR-V, NVVM and potentially ROCDL if it applies. The challenge is to pick a good abstraction for all of those. SPIR-V dialect already has CooperativeMatrix ops and types which are the exact equivalent to MMA. I think we should try to remove what is an overfit for NVVM like using vector type for mmafragment but otherwise this is the right direction in my opinion.

Hi all,
I just wanted to point out that the vector type here was used for only the F16 accumulate version of these intrinsics. For the C matrix, The type will change for the F32 accumulate version. So I wanted to make the gpu.mmafragment type flexible for that and perhaps just adjust the constraints as we go on to add more ops in the nvvm dialect.

I also just checked and saw that SPIR-V dialect does not model the fragments held by a particular workitem(as I do here) but models the whole cooperative matrix as a single type. I am not very sure how the SPIR-V ops are lowered to the target. Are they library calls? If yes then it is very different from the NVVM dialect which is actually a 1-to-1 mapping with the intrinsics. In that case, it would be more difficult to choose the right abstraction, for e.g., for F16 accumulate the library may just do all the packing unpacking stuff to get to the same machine instruction that we go to from NVVM->LLVMIR->PTX. Here we have to ensure that the input is in correct data format.

Please let me know what you think. Also can someone please point out the details on how SPIR-V is lowered?

mlir/include/mlir/Dialect/GPU/GPUDialect.h
46–55	Removed the class, Now directly inheriting from TypeStorage in MMAFragmentStorageType.
85–86	Removed the redundant base class as pointed out in a previous comment. Retained this member.
mlir/include/mlir/Dialect/GPU/GPUOps.td
918	Enforcing identity layout maps for the source memref right now. Can be extended for generic layouts in future commits.
923–924	Not really. Since the GPU dialect is meant to be closer to the target, I preferred to have it without the view abstraction and specify the start location's actual indices.
925–926	The op generator can specify the LeadingDimension, And in the lowering to NVVM dialect, the attribute can be directly used. I assume by non-contiguity in the leading dimension, In that case, the op would still work as long as the layout is identity and the leading dimension is correctly specified. I can enforce an identity layout, for now, add support for generic layouts in future commits.
1011–1012	!gpu.mmafragment<4xvector<3xf16>> -> !gpu.mmafragment<4xvector<2xf16>>. The shapes represent the part of the MMAFragment each thread holds. E.g., the result is of shape <4xvector<2xhalf>>, which says 8 fp16 elements per thread. So in all, 32 threads(a warp) have 256 elements, which is actually the shape of the output, 16x16.
1023	functional-type not parsing !gpu.mmafragment type. retaining current format.
mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
298–303	Sorry, I had already removed the printing but forgot to update the example.
308	I tried to use custom directives to print and parse a single type for all the operands. But, there is a restriction in custom directives, quoted as `The type directives must refer to a variable, but that variable need not also be a parameter to the custom directive`. This restricts us to parse a single type for all the operand types and parser.resolveOperands() would fail.
309–312	Please let me know if this is to be moved and to where.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
121–123	I have removed these messages as they seemed redundant. The parser already emits errors for the required thing. I am adding tests for other user-visible messages.
999–1002	Removed redundant checks.

I think it would be good to have it in the GPU dialect to have a common layer for SPIR-V, NVVM and potentially ROCDL if it applies.

+1: even for a "low-level" operation, if the operation is available on multiple targets (SPIRV or other, we can even consider that proprietary target can benefit from this), then properly building the abstraction in the GPU dialect is really the intention here.

Thank you @ftynse and @bondhugula for your many comments on this patch and thanks @navdeepkk for addressing them!

IMHO, this operation fits perfectly into the GPU dialect, as it allows to target mma operations independently of the used GPU target. Regarding the type encoding: Ultimately, it only needs to be rich enough to allow conversion to the required LLVM types. We do not really have operations that extract the contained type, so it is truly opaque in that way. I am missing why we need the vector type. Why would mma.fragment<8x2xf16> not suffice for this? This can still be lowered as today to LLVM.

To apply element-wise operations to the fragment, we would need special operations anyway or a special operation that extracts the element to compute on from the fragment. The latter could produce a vector if that is desired.

An open question for me is how this lowers to SPIR-V. Would mma.fragment<8x2xf16> be rich enough for that case, as well? @ThomasRaoux

mlir/include/mlir/Dialect/GPU/GPUBase.td
121	nit: Is the "GPU Dialect" here intentional?
mlir/include/mlir/Dialect/GPU/GPUDialect.h
73	nit: These?

In D95330#2602259, @navdeepkk wrote:

In D95330#2602206, @ThomasRaoux wrote:

In D95330#2597019, @ftynse wrote:

In D95330#2579401, @bondhugula wrote:

In D95330#2551743, @ftynse wrote:

In D95330#2551547, @navdeepkk wrote:

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

You are right that this op is a low-level abstraction, and that's why it doesn't feel like it really belongs to the GPU dialect. I understand there is a need for an op that is better integrated with the rest of MLIR, e.g., uses memref, and such an op wouldn't fit into the NVVM dialect either. So I wouldn't press my objection as long as another reviewer (@herhut, @csigg or @ThomasRaoux) agrees to have this in GPU dialect as is.

More generally, I think we will run into this problem again: we need some way of having MLIR-friendlier versions of LLVM IR intrinsics without having to duplicate abstractions. There are similar cases in "target vector" dialects like AVX512. Ideas welcome.

I think it would be good to have it in the GPU dialect to have a common layer for SPIR-V, NVVM and potentially ROCDL if it applies. The challenge is to pick a good abstraction for all of those. SPIR-V dialect already has CooperativeMatrix ops and types which are the exact equivalent to MMA. I think we should try to remove what is an overfit for NVVM like using vector type for mmafragment but otherwise this is the right direction in my opinion.

Hi all,
I just wanted to point out that the vector type here was used for only the F16 accumulate version of these intrinsics. For the C matrix, The type will change for the F32 accumulate version. So I wanted to make the gpu.mmafragment type flexible for that and perhaps just adjust the constraints as we go on to add more ops in the nvvm dialect.

I also just checked and saw that SPIR-V dialect does not model the fragments held by a particular workitem(as I do here) but models the whole cooperative matrix as a single type. I am not very sure how the SPIR-V ops are lowered to the target. Are they library calls? If yes then it is very different from the NVVM dialect which is actually a 1-to-1 mapping with the intrinsics. In that case, it would be more difficult to choose the right abstraction, for e.g., for F16 accumulate the library may just do all the packing unpacking stuff to get to the same machine instruction that we go to from NVVM->LLVMIR->PTX. Here we have to ensure that the input is in correct data format.

Please let me know what you think. Also can someone please point out the details on how SPIR-V is lowered?

Hi Navdeep,

in SPIR-V the type CooperativeMatrixNVType is an opaque type indeed and I think we should keep the MMA type in the GPU dialect as opaque as well. SPIR-V cooperative matrix are just native SPIR-V ops and don't require any library call. Underneath they map to the exact same functionality than NVVM MMA intrinsics so it makes sense to try to find a common abstraction. If we make gpu.mmafragment an opaque type we should be able to have 1:1 mapping to both NVVM and SPIR-V. As pointed @herhut pointed out we cannot extra from this type so it makes sense to not make any assumption on the layout. In SPIR-V there are ways to extract elements but the layout can only be known at runtime so it is not practical to use in general. Ideally we should avoid that and use gpu.mmafragment until we store back to shared or global memory.

There are some lowering to SPIR-V cooperative matrix in IREE here: https://github.com/google/iree/blob/main/iree/compiler/Conversion/LinalgToSPIRV/ConvertToSPIRVPass.cpp#L330
In this case it goes directly from vector dialect to SPIR-V cooperative matrix. This jumps the level of abstraction you are adding which is a hack to make it work and test cooperative matrix end to end. This is the reason why this is not upstreamed in MLIR.

Hi all, Thanks for the valuable comments. @ThomasRaoux Thanks for clarifying things on the SPIR-V side.

I am missing why we need the vector type. Why would mma.fragment<8x2xf16> not suffice for this? This can still be lowered as today to LLVM.

@herhut I had the vector type to just extract elements from gpu::mmafragment type and just supply it to the nvvm intrinsic, as is. Now that I think, mma.fragment<8x2xf16> can also be used equivalently.

I will take some time and try to develop a more generic abstraction and look at removing the vector<> type. I'll post again for some more feedback before implementing the next iteration.

Thanks!

Herald added a subscriber: dcaballe. · View Herald TranscriptMar 7 2021, 9:22 AM

Hey @navdeepkk, I'm curious to know how this is progressing.
I'm working on enabling CUDA codegen support in IREE (https://github.com/google/iree/tree/main/iree/compiler/Conversion/LinalgToNVVM), I have some basic vectorization working and I'm probably 1 to 2 weeks away to a point where I would like to hook that up to MMA ops. My plan is to add some Vector to GPU transformation in MLIR to create the MMA ops from vector.contract/vector.tranfer kind of ops.
Let me know if I can do anything to help this moving forward. I can help address some point of the patch or do anything else that would help you to make progress.

I'm happy to chat more about my plans if you want and discuss how we can collaborate to on this. Let me know.

Thanks

In D95330#2669446, @ThomasRaoux wrote:

Hey @navdeepkk, I'm curious to know how this is progressing.
I'm working on enabling CUDA codegen support in IREE (https://github.com/google/iree/tree/main/iree/compiler/Conversion/LinalgToNVVM), I have some basic vectorization working and I'm probably 1 to 2 weeks away to a point where I would like to hook that up to MMA ops. My plan is to add some Vector to GPU transformation in MLIR to create the MMA ops from vector.contract/vector.tranfer kind of ops.
Let me know if I can do anything to help this moving forward. I can help address some point of the patch or do anything else that would help you to make progress.

I'm happy to chat more about my plans if you want and discuss how we can collaborate to on this. Let me know.

Thanks

Cool, please rope me into any offline discussions as I plan to start touching some of that in the short-term future too.

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

In D95330#2684814, @navdeepkk wrote:

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @navdeepkk, great to hear you were able to use those ops. Next week sounds good, if you think it you won't have bandwidth to progress in the next couple weeks please let me know and hopefully offload some of the work to me.
To give more details on what I'm doing on IREE side is to try to match what cutlass does using MLIR transformations. The flow will look like: linalg on tensor -> tiling along M,N,K and distribute to blocks -> promote operands to shared memory -> (double buffering and pipelining) -> Tiling to wrap -> tiling shared memory copy -> linalg vectorization -> vector.contract unrolling to mma size -> vector to GPU to create MMA GPU ops -> lowering to llvm/nvvm.
A lot of this infrastructure is already there in IREE.

If you are able to share, it would be great to hear about the flow you used those ops with.

In D95330#2687420, @ThomasRaoux wrote:

In D95330#2684814, @navdeepkk wrote:

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @navdeepkk, great to hear you were able to use those ops. Next week sounds good, if you think it you won't have bandwidth to progress in the next couple weeks please let me know and hopefully offload some of the work to me.
To give more details on what I'm doing on IREE side is to try to match what cutlass does using MLIR transformations. The flow will look like: linalg on tensor -> tiling along M,N,K and distribute to blocks -> promote operands to shared memory -> (double buffering and pipelining) -> Tiling to wrap -> tiling shared memory copy -> linalg vectorization -> vector.contract unrolling to mma size -> vector to GPU to create MMA GPU ops -> lowering to llvm/nvvm.
A lot of this infrastructure is already there in IREE.

If you are able to share, it would be great to hear about the flow you used those ops with.

@ThomasRaoux Just to clarify: @navdeepkk has all of the MLIR-based GPU code generation implemented on top of these ops (the ones in this revision) already working using the affine dialect infrastructure: we don't use IREE, linalg, or the vector dialects, but our pipeline effectively includes all of the transformations you list above and several more in between those lowerings.

Getting to this PR itself, are there specific changes to make before this can be landed? @ftynse and others (Stephan and Mehdi) appear to be fine with having these ops in the GPU dialect. Could you please list those changes out if any or LGTM this? This will allow further development and progress upstream. If the remaining changes are minor, these could be made quickly and landed.

In D95330#2687703, @bondhugula wrote:

In D95330#2687420, @ThomasRaoux wrote:

In D95330#2684814, @navdeepkk wrote:

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @navdeepkk, great to hear you were able to use those ops. Next week sounds good, if you think it you won't have bandwidth to progress in the next couple weeks please let me know and hopefully offload some of the work to me.
To give more details on what I'm doing on IREE side is to try to match what cutlass does using MLIR transformations. The flow will look like: linalg on tensor -> tiling along M,N,K and distribute to blocks -> promote operands to shared memory -> (double buffering and pipelining) -> Tiling to wrap -> tiling shared memory copy -> linalg vectorization -> vector.contract unrolling to mma size -> vector to GPU to create MMA GPU ops -> lowering to llvm/nvvm.
A lot of this infrastructure is already there in IREE.

If you are able to share, it would be great to hear about the flow you used those ops with.

@ThomasRaoux Just to clarify: @navdeepkk has all of the MLIR-based GPU code generation implemented on top of these ops (the ones in this revision) already working using the affine dialect infrastructure: we don't use IREE, linalg, or the vector dialects, but our pipeline effectively includes all of the transformations you list above and several more in between those lowerings.

Getting to this PR itself, are there specific changes to make before this can be landed? @ftynse and others (Stephan and Mehdi) appear to be fine with having these ops in the GPU dialect. Could you please list those changes out if any or LGTM this? This will allow further development and progress upstream. If the remaining changes are minor, these could be made quickly and landed.

Thanks for explaining, @bondhugula. It is great to know that you have all those transformations. Is it publicly available? (or will it be?).
I 100% support having these ops in the GPU dialect. The only concern I had with the PR was about the type including vector instead of being completely opaque because as I mentioned this will make it harder to share with Vulkan path. However I don't think it had to be a blocker and it is something that we can iterate on so I LGTM the patch.

In D95330#2687719, @ThomasRaoux wrote:

In D95330#2687703, @bondhugula wrote:

In D95330#2687420, @ThomasRaoux wrote:

In D95330#2684814, @navdeepkk wrote:

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @navdeepkk, great to hear you were able to use those ops. Next week sounds good, if you think it you won't have bandwidth to progress in the next couple weeks please let me know and hopefully offload some of the work to me.
To give more details on what I'm doing on IREE side is to try to match what cutlass does using MLIR transformations. The flow will look like: linalg on tensor -> tiling along M,N,K and distribute to blocks -> promote operands to shared memory -> (double buffering and pipelining) -> Tiling to wrap -> tiling shared memory copy -> linalg vectorization -> vector.contract unrolling to mma size -> vector to GPU to create MMA GPU ops -> lowering to llvm/nvvm.
A lot of this infrastructure is already there in IREE.

If you are able to share, it would be great to hear about the flow you used those ops with.

@ThomasRaoux Just to clarify: @navdeepkk has all of the MLIR-based GPU code generation implemented on top of these ops (the ones in this revision) already working using the affine dialect infrastructure: we don't use IREE, linalg, or the vector dialects, but our pipeline effectively includes all of the transformations you list above and several more in between those lowerings.

Getting to this PR itself, are there specific changes to make before this can be landed? @ftynse and others (Stephan and Mehdi) appear to be fine with having these ops in the GPU dialect. Could you please list those changes out if any or LGTM this? This will allow further development and progress upstream. If the remaining changes are minor, these could be made quickly and landed.

Thanks for explaining, @bondhugula. It is great to know that you have all those transformations. Is it publicly available? (or will it be?).
I 100% support having these ops in the GPU dialect. The only concern I had with the PR was about the type including vector instead of being completely opaque because as I mentioned this will make it harder to share with Vulkan path. However I don't think it had to be a blocker and it is something that we can iterate on so I LGTM the patch.

Thanks @ThomasRaoux. Yes, our goal is to upstream this infrastructure - most of the underlying affine dialect transformations and utilities needed have already been there in MLIR for a long time. We'd like to send out things for review starting with the lower level abstractions and support first - so this is really the first revision. (One op/abstraction that is publicly available but not in upstream that we used for our purpose is the memref_vector_cast op: https://reviews.llvm.org/D85885 - this is also available to work with recent upstream at https://github.com/polymage-labs/mlirx )

@navdeepkk, @bondhugula, any update on this?

In D95330#2721303, @ThomasRaoux wrote:

@navdeepkk, @bondhugula, any update on this?

Hi @ThomasRaoux, I'll get started on this now, and complete it on priority in the next 3-4 days or even sooner. Hope that works.

In D95330#2721740, @navdeepkk wrote:

In D95330#2721303, @ThomasRaoux wrote:

@navdeepkk, @bondhugula, any update on this?

Hi @ThomasRaoux, I'll get started on this now, and complete it on priority in the next 3-4 days or even sooner. Hope that works.

Thank you

Changes in this diff :-

1.) Rebase on upstream/main.
2.) Address comments on previous diff(324324).
3.) Remove gpu.mmafragment and introduce gpu.mma_matrix type.

Harbormaster completed remote builds in B102194: Diff 342265.May 2 2021, 1:57 PM

Looks great. Mostly minor comments on style and movement of some C++ code out of the td file.

mlir/include/mlir/Dialect/GPU/GPUDialect.h
87	It's not immediately clear what this StringRef is for.
143	`unsigned`?
151	Nit: Operand -> operand
151	An `operand` is expected to be an SSA `Value`. It's not clear what this returns.
mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
358–378	We shouldn't have so much C++ code in the tablegen file. Please move this to a C++ file (NVVMOps.cpp) and call that from here.
382–390	Likewise.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
48	`unsigned` ?
68	`Operand` -> `operand`
144	Nit: "x" -> 'x'
146	Likewise.
990	`Only` -> `only`
1041	`Operands` -> `operands`
1050	Likewise.
mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp
197–198	No need of the `else`.
215	Likewise.
237	Likewise. clang-tidy will also show a warning here.
274–275	Use `getOperands()` instead of doing getOpOperands() and then doing a `get()`?

ThomasRaoux accepted this revision.May 3 2021, 6:55 PM

ftynse accepted this revision.May 4 2021, 2:50 AM

This revision is now accepted and ready to land.May 4 2021, 2:50 AM

Changes in this diff :-

1.) Address comments on previous diff(342265).

navdeepkk edited the summary of this revision. (Show Details)May 5 2021, 9:53 PM

Harbormaster completed remote builds in B102908: Diff 343281.May 5 2021, 10:26 PM

bondhugula accepted this revision.May 5 2021, 11:05 PM

Changes in this diff :-

1.) Clang-format fix.
2.) Added TODO to generate MMAMatrix via ODS.

This revision was landed with ongoing or failed builds.May 5 2021, 11:37 PM

Closed by commit rG875eb523c132: [MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops (authored by navdeepkk, committed by bondhugula). · Explain Why

This revision was automatically updated to reflect the committed changes.

bondhugula added a commit: rG875eb523c132: [MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops.

Harbormaster completed remote builds in B102918: Diff 343291.May 6 2021, 12:10 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

GPUBase.td

40 lines

GPUDialect.h

62 lines

GPUOps.td

120 lines

LLVMIR/

NVVMOps.td

198 lines

lib/

Dialect/

GPU/

IR/

GPUDialect.cpp

165 lines

LLVMIR/

IR/

NVVMDialect.cpp

111 lines

Target/

LLVMIR/

ConvertToNVVMIR.cpp

21 lines

test/

Dialect/

GPU/

invalid.mlir

83 lines

LLVMIR/

invalid.mlir

99 lines

Target/

nvvmir.mlir

37 lines

Diff 324324

mlir/include/mlir/Dialect/GPU/GPUBase.td

Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	let extraClassDeclaration = [{
static unsigned getPrivateAddressSpace() { return 5; }		static unsigned getPrivateAddressSpace() { return 5; }
}];		}];
}		}

def GPU_AsyncToken : DialectType<		def GPU_AsyncToken : DialectType<
GPU_Dialect, CPred<"$_self.isa<::mlir::gpu::AsyncTokenType>()">, "async token type">,		GPU_Dialect, CPred<"$_self.isa<::mlir::gpu::AsyncTokenType>()">, "async token type">,
BuildableType<"mlir::gpu::AsyncTokenType::get($_builder.getContext())">;		BuildableType<"mlir::gpu::AsyncTokenType::get($_builder.getContext())">;

		def IsMMAFragmentTypePred : CPred<"$_self.isa<::mlir::gpu::MMAFragmentType>()">;
		bondhugulaUnsubmitted Done Reply Inline Actions Comment here please. bondhugula: Comment here please.

		def GPU_MMAFragment : DialectType<
		GPU_Dialect, IsMMAFragmentTypePred, "mmafragment type">;

		class MMAFragmentOf<list<Type> allowedTypes> :
		ContainerType<AnyTypeOf<allowedTypes>, IsMMAFragmentTypePred,
		"$_self.cast<::mlir::gpu::MMAFragmentType>().getElementType()",
		"gpu.mmafragment", "::mlir::gpu::MMAFragmentType">;

def GPU_AsyncOpInterface : OpInterface<"AsyncOpInterface"> {		def GPU_AsyncOpInterface : OpInterface<"AsyncOpInterface"> {
let description = [{		let description = [{
Interface for GPU operations that execute asynchronously on the device.		Interface for GPU operations that execute asynchronously on the device.

GPU operations implementing this interface take a list of dependencies		GPU operations implementing this interface take a list of dependencies
as `gpu.async.token` arguments and optionally return a `gpu.async.token`.		as `gpu.async.token` arguments and optionally return a `gpu.async.token`.

The op doesn't start executing until all depent ops producing the async		The op doesn't start executing until all depent ops producing the async
Show All 29 Lines	InterfaceMethod<[{
"OpResult", "getAsyncToken", (ins), [{}], [{		"OpResult", "getAsyncToken", (ins), [{}], [{
ConcreteOp op = cast<ConcreteOp>(this->getOperation());		ConcreteOp op = cast<ConcreteOp>(this->getOperation());
return op.asyncToken().template dyn_cast_or_null<OpResult>();		return op.asyncToken().template dyn_cast_or_null<OpResult>();
}]		}]
>		>
];		];
}		}

		// Cases of the String enum Attribute for SubgroupMmaOpLayout, representing
		// the layouts of the operands supported by the ops that use this attribute.
		def RowMajor: StrEnumAttrCase<"RowMajor", 0>;
		def ColMajor: StrEnumAttrCase<"ColMajor", 1>;

		// Specifies a String enum Attribute for Warp wide matrix operations,
		// representing the layout of respective operands GPU Dialect. The layout
		herhutUnsubmitted Done Reply Inline Actions nit: Is the "GPU Dialect" here intentional? herhut: nit: Is the "GPU Dialect" here intentional?
		// later governs the lowerings to appropriate intrinsics.
		def SubgroupMmaOpLayout: StrEnumAttr<"Layout", "Specifies whether op is row/col major",
		[RowMajor, ColMajor]> {
		let stringToSymbolFnName = "LayoutStrToEnum";
		let symbolToStringFnName = "EnumToLayoutStr";
		}

		// Cases of the String enum Attribute for SubgroupMmaOperand, representing
		// the operand tags supported by the ops that use this attribute.
		def DOp: StrEnumAttrCase<"DOp", 0>;
		def Aop: StrEnumAttrCase<"AOp", 1>;
		def Bop: StrEnumAttrCase<"BOp", 2>;
		def Cop: StrEnumAttrCase<"COp", 3>;

		// Specifies a String enum Attribute for Warp wide matrix operations,
		// representing the operands for the operations. The operands later
		// govern the lowerings to appropriate intrinsics.
		def SubgroupMmaOperand: StrEnumAttr<"WmmaOp", "Specifes which subgroup operand is this",
		[DOp, Aop, Bop, Cop]> {
		let stringToSymbolFnName = "OpStrToEnum";
		let symbolToStringFnName = "EnumToOpStr";
		}

#endif // GPU_BASE		#endif // GPU_BASE

mlir/include/mlir/Dialect/GPU/GPUDialect.h

	Show All 37 Lines

	class AsyncTokenType			class AsyncTokenType
	: public Type::TypeBase<AsyncTokenType, Type, TypeStorage> {			: public Type::TypeBase<AsyncTokenType, Type, TypeStorage> {
	public:			public:
	// Used for generic hooks in TypeBase.			// Used for generic hooks in TypeBase.
	using Base::Base;			using Base::Base;
	};			};

				/// MMAFragmentType storage and uniquing.
				struct MMAFragmentStorageType : public TypeStorage {
				MMAFragmentStorageType(int64_t size, Type elementType)
				: size(size), elementType(elementType) {}

				/// The hash key for uniquing.
				using KeyTy = std::pair<int64_t, Type>;
				bool operator==(const KeyTy &key) const {
				return key == KeyTy(size, elementType);
				}
				ftynseUnsubmitted Done Reply Inline Actions This class looks unnecessary. It has only one derived class, that doesn't to actually use anything from the base class. ftynse: This class looks unnecessary. It has only one derived class, that doesn't to actually use…
				navdeepkkAuthorUnsubmitted Done Reply Inline Actions Removed the class, Now directly inheriting from TypeStorage in MMAFragmentStorageType. navdeepkk: Removed the class, Now directly inheriting from TypeStorage in MMAFragmentStorageType.

				/// Construction.
				static MMAFragmentStorageType *construct(TypeStorageAllocator &allocator,
				const KeyTy &key) {
				return new (allocator.allocate<MMAFragmentStorageType>())
				MMAFragmentStorageType(key.first, key.second);
				}

				/// Number of elements held in the fragment.
				int64_t size;

				/// Element type of elements held in the fragment.
				Type elementType;
				};

				/// MMAFragment represents a fragment or collection of elements held by a thread
				/// for matrix-matrix multiply accumulate operations. MMAFragments are taken as
				/// direct operands by these operations and are also produced as results. There
				herhutUnsubmitted Done Reply Inline Actions nit: These? herhut: nit: These?
				/// fragments are meant to reside in the registers. A limited number of
				/// pointwise operations can be performed on these fragments, i.e., operations
				/// which operate uniformly on all the elements in the fragment and do not
				/// change the order of matrix elements in the fragments. The above conditions
				/// exist because the layout of matrix elemnets inside the fragment is opaque
				mehdi_aminiUnsubmitted Done Reply Inline Actions elements mehdi_amini: elements
				/// i.e., the elements may be present in the fragment in any order.
				mehdi_aminiUnsubmitted Done Reply Inline Actions May be useful to document the syntax and provide examples. mehdi_amini: May be useful to document the syntax and provide examples.
				class MMAFragmentType
				ftynseUnsubmitted Done Reply Inline Actions There's no need in getters if member fields are public. ftynse: There's no need in getters if member fields are public.
				: public Type::TypeBase<MMAFragmentType, Type, MMAFragmentStorageType> {
				public:
				using Base::Base;

				/// Get MMAFragmentType and verify construction Invariants.
				static MMAFragmentType get(int64_t shape, Type elementType);
				ftynseUnsubmitted Done Reply Inline Actions This shadows the `elementType` from the base class. You'll actually have two `elementType` values this way... ftynse: This shadows the `elementType` from the base class. You'll actually have two `elementType`…
				navdeepkkAuthorUnsubmitted Done Reply Inline Actions Removed the redundant base class as pointed out in a previous comment. Retained this member. navdeepkk: Removed the redundant base class as pointed out in a previous comment. Retained this member.

				bondhugulaUnsubmitted Done Reply Inline Actions It's not immediately clear what this StringRef is for. bondhugula: It's not immediately clear what this StringRef is for.
				/// Get MMAFragmentType at a particular location and verify construction
				/// Invariants.
				static MMAFragmentType getChecked(Location loc, int64_t shape,
				bondhugulaUnsubmitted Done Reply Inline Actions Matrix-multiply -> matrix-matrix multiply bondhugula: Matrix-multiply -> matrix-matrix multiply
				Type elementType);

				/// Check if a type is valid a MMAFragmentType elementType.
				static bool isValidElementType(Type elementType);

				/// Verify that shape and elementType are actually allowed for the
				/// MMAFragmentType.
				static LogicalResult verifyConstructionInvariants(Location loc, int64_t shape,
				Type elementType);

				/// Get size of MMAFragment in number of elements.
				int64_t getSize() const;

				/// Get elementType of a single element in MMAFragment.
				Type getElementType() const;
				};
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions Can this type be defined in ODS? https://mlir.llvm.org/docs/OpDefinitions/#type-definitions mehdi_amini: Can this type be defined in ODS? https://mlir.llvm.org/docs/OpDefinitions/#type-definitions

	// Adds a `gpu.async.token` to the front of the argument list.			// Adds a `gpu.async.token` to the front of the argument list.
	void addAsyncDependency(Operation *op, Value token);			void addAsyncDependency(Operation *op, Value token);

	} // end namespace gpu			} // end namespace gpu
	} // end namespace mlir			} // end namespace mlir

	#include "mlir/Dialect/GPU/GPUOpsDialect.h.inc"			#include "mlir/Dialect/GPU/GPUOpsDialect.h.inc"

				bondhugulaUnsubmitted Done Reply Inline Actions All of these need doc comments. bondhugula: All of these need doc comments.
	#include "mlir/Dialect/GPU/GPUOpInterfaces.h.inc"			#include "mlir/Dialect/GPU/GPUOpInterfaces.h.inc"

	#define GET_OP_CLASSES			#define GET_OP_CLASSES
	#include "mlir/Dialect/GPU/GPUOps.h.inc"			#include "mlir/Dialect/GPU/GPUOps.h.inc"

	#endif // MLIR_DIALECT_GPU_GPUDIALECT_H			#endif // MLIR_DIALECT_GPU_GPUDIALECT_H
				bondhugulaUnsubmitted Done Reply Inline Actions Nit: Operand -> operand bondhugula: Nit: Operand -> operand
				bondhugulaUnsubmitted Done Reply Inline Actions `unsigned`? bondhugula: `unsigned`?
				bondhugulaUnsubmitted Done Reply Inline Actions An `operand` is expected to be an SSA `Value`. It's not clear what this returns. bondhugula: An `operand` is expected to be an SSA `Value`. It's not clear what this returns.

mlir/include/mlir/Dialect/GPU/GPUOps.td

Show First 20 Lines • Show All 903 Lines • ▼ Show 20 Lines	def GPU_MemcpyOp : GPU_Op<"memcpy", [GPU_AsyncOpInterface]> {

let assemblyFormat = [{		let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$dst`,` $src `:` type($dst)`,` type($src) attr-dict		$dst`,` $src `:` type($dst)`,` type($src) attr-dict
}];		}];
let verifier = [{ return ::verify(*this); }];		let verifier = [{ return ::verify(*this); }];
}		}

		def GPU_SubgroupMmaLoadMatrixOp : GPU_Op<"subgroup_mma_load_matrix",
		[MemoryEffects<[MemRead]>]>{

		let summary = "GPU warp synchronous matrix load";

		let description = [{
		The `gpu.subgroup_mma_load_matrix` operation loads a matrix collectively
		ftynseUnsubmitted Done Reply Inline Actions Does it make any assumptions about the data storage in the memref? It can have an arbitrary layout now, not only consecutive elements. ftynse: Does it make any assumptions about the data storage in the memref? It can have an arbitrary…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Enforcing identity layout maps for the source memref right now. Can be extended for generic layouts in future commits. navdeepkk: Enforcing identity layout maps for the source memref right now. Can be extended for generic…
		using all the threads in a subgroup.

		This operation takes a memref as argument. It is the source matrix from which
		data is to be loaded. The op returns a `!gpu.mmafragment`. The source memref
		can be in the global or shared memory space. The starting of the load address
		is determined using indices provided. Which matrix is being loaded is specified
		ftynseUnsubmitted Done Reply Inline Actions Have you considered always loading the entire matrix starting from indices 0x0 and asking the user to use `std.subview` to position the view in a way they want? There may be a reason to use indices here, but it is unclear to me. Otherwise, it feels like this op will be a subset of functionality that subview already provides. ftynse: Have you considered always loading the entire matrix starting from indices 0x0 and asking the…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Not really. Since the GPU dialect is meant to be closer to the target, I preferred to have it without the view abstraction and specify the start location's actual indices. navdeepkk: Not really. Since the GPU dialect is meant to be closer to the target, I preferred to have it…
		by the `operand` attribute. This attribute is necessary becasue there exists a
		different LLVM intrinsic for loading each operand, This is probably because all
		ftynseUnsubmitted Done Reply Inline Actions Why is this attribute necessary? It looks redundant with the dimension of the memref available in its type. If it may differ from the memref size, that this op needs to clarify how it handles such non-contiguous cases. ftynse: Why is this attribute necessary? It looks redundant with the dimension of the memref available…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions The op generator can specify the LeadingDimension, And in the lowering to NVVM dialect, the attribute can be directly used. I assume by non-contiguity in the leading dimension, In that case, the op would still work as long as the layout is identity and the leading dimension is correctly specified. I can enforce an identity layout, for now, add support for generic layouts in future commits. navdeepkk: The op generator can specify the LeadingDimension, And in the lowering to NVVM dialect, the…
		operands need to be laid out in a specific/different way for the operation in the
		registers. `leadDimension` attribute specifies the leading dimension of the source
		matrix.

		This op is meant to be used along with `gpu.subgroup_mma_store_matrix` and
		`gpu.subgroup_mma_compute`.

		Example:
		ftynseUnsubmitted Done Reply Inline Actions Please document why it is important to specify which operand is being loaded (I suppose because of how the fragment is layed out). ftynse: Please document why it is important to specify which operand is being loaded (I suppose because…

		bondhugulaUnsubmitted Done Reply Inline Actions The design looks much more in line now with the `memref` operands being replaced with pure values and with a result value instead of an output memref. bondhugula: The design looks much more in line now with the `memref` operands being replaced with pure…
		ftynseUnsubmitted Done Reply Inline Actions Nit: this example breaks the verifier, which expects `leadDimension` to have `i32` type. ftynse: Nit: this example breaks the verifier, which expects `leadDimension` to have `i32` type.
		```mlir
		%0 = gpu.subgroup_mma_load_matrix src[%i,%j] : {operand = "AOp", leadDimension = 32
		: i32} : memref<32x32xf16, 3>, !gpu.mmafragment<8xvector<2xf16>>
		ThomasRaouxUnsubmitted Done Reply Inline Actions The part that I don't understand is if we expect the destination memref to have a defined layout or if it is opaque. As far as I can tell in both CUDA and Vulkan the layout of the mma matrix type is opaque in private memory. If the layout is opaque how can we perform elementwise operations on the matrix type without going back to global or shared memory? One of the common usage would be to fuse the MMA compute with elementwise operations and use directly the result of the MMA without going back to memory. Is is possible to represent such case with the current design? ThomasRaoux: The part that I don't understand is if we expect the destination memref to have a defined…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Hi, I quote from the PTX side of these LLVM intrinsics, The distribution of fragments loaded by the threads in a warp is unspecified and is target architecture dependent. The contents of a matrix fragment can be manipulated by reading and writing to individual registers in the fragment, provided the following conditions are satisfied: All matrix element in the fragment are operated on uniformly across threads, using the same parameters. The order of the matrix elements is not changed. So as long as the operation is something like a bias addition, which is uniform throughout the output matrix, I think it would be possible to realize the fusion using the present design. The way to go would be to introduce another op in GPU dialcet, Something like gpu.subgroup_mma_fuse_bias %0, %1 : memref<1x16<vectorxf16>>, f16 The argument memref will be the result of a `gpu.subgroup_mma_compute`. And the other argument will be the bias. With the appropriate LLVM lowerings this would reuse the results of `gpu.subgroup_mma_compute` in registers and hence prevent trip to global/shared memory. Let me know what you think, I can implement the above op and check, I think it should work. navdeepkk: Hi, I quote from the PTX side of these LLVM intrinsics, > The distribution of fragments…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Also, could you please tell if `fuse the MMA compute with elementwise operations` is the fusion of an elementwise operation with the register/warp level tile of the accumulator, which I assumed in my reply above. Is that correct? navdeepkk: Also, could you please tell if `fuse the MMA compute with elementwise operations` is the fusion…
		```
		}];

		ftynseUnsubmitted Done Reply Inline Actions Why I32Attr and not IndexAttr? ftynse: Why I32Attr and not IndexAttr?
		let arguments = (ins Arg<MemRefRankOf<[F16], [2]>, "", [MemRead]>:$srcMemref,
		Variadic<Index>:$indices,
		IndexAttr:$leadDimension,
		ftynseUnsubmitted Done Reply Inline Actions We usually use Index rather than I64 for indexing into memrefs. Is there a specific reason why I64 is needed here? ftynse: We usually use Index rather than I64 for indexing into memrefs. Is there a specific reason why…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Yes, When caluclating the starting of the load address, I emit LLVM ops of this sort %[[LDM:.]] = llvm.mlir.constant(32 : i64) : !llvm.i64 %[[ILDM:.]] = llvm.mul %[[LDM]], %[[OFF]] : !llvm.i64 %[[IJLDM:.]] = llvm.add %[[ILDM]], %[[OFF]] : !llvm.i64 %[[IJOLDM:.]] = llvm.add %[[IJLDM]], %[[OFFSETT]] : !llvm.i64 Now the leading dimesnion is of type i64 and to have same types in Add/Mul ops I have used I64 for indices to. navdeepkk: Yes, When caluclating the starting of the load address, I emit LLVM ops of this sort %[[LDM:.
		ftynseUnsubmitted Done Reply Inline Actions This is exactly why you should use `index` or the result of converting `index` to LLVM. There is no guarantee that it is `i64` so you should not be mixing `i64` constants with anything that indexes a memref, e.g. the offset in the descriptor. We actually have a flow that converts `index` to `i32` on GPUs as optimization, and this flow will likely get broken if you emit the code described above. ftynse: This is exactly why you should use `index` or the result of converting `index` to LLVM. There…
		SubgroupMmaOperand:$operand);
		ftynseUnsubmitted Done Reply Inline Actions Could we use a longer name for this, e.g., leadDimension? ftynse: Could we use a longer name for this, e.g., leadDimension?
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Yes, I'll change it. navdeepkk: Yes, I'll change it.

		let results = (outs GPU_MMAFragment:$res);

		let assemblyFormat = [{
		$srcMemref`[`$indices`]` attr-dict `:` type($srcMemref) `->` type($res)
		}];

		let verifier = [{ return ::verify(*this); }];
		}

		def GPU_SubgroupMmaStoreMatrixOp : GPU_Op<"subgroup_mma_store_matrix",
		[MemoryEffects<[MemWrite]>]>{

		let summary = "GPU warp synchronous matrix store";

		let description = [{
		The `gpu.subgroup_mma_store_matrix` operation stores a matrix collectively
		using all the threads in a subgroup.

		This operation takes a `!gpu.mmafragment` and a memref as arguments.
		`!gpu.mmafragment` is the source which contains the data to be stored.
		The destination can be in the global or shared memory space. The starting
		of store address is determined using indices provided. The `leadDimension`
		attribute specifies the leading dimension of the destination matrix.

		This op is meant to be used along with `gpu.subgroup_mma_load_matrix` and
		`gpu.subgroup_mma_compute`.

		Example:

		```mlir
		ftynseUnsubmitted Done Reply Inline Actions Same remarks as for the "load" op. ftynse: Same remarks as for the "load" op.
		gpu.subgroup_mma_store_matrix %D, %sg[%i,%j] : { leadDimension = 32 : i32} :
		!gpu.mmafragment<4, vector<2xf16>>, memref<32x32xf16, 3>
		```
		}];

		let arguments = (ins Arg<MMAFragmentOf<[VectorOfLengthAndType<[2], [F16]>]>>:$src,
		Arg<MemRefRankOf<[F16], [2]>, "",[MemWrite]>:$dstMemref,
		Variadic<Index>:$indices,
		IndexAttr:$leadDimension);

		let assemblyFormat = [{
		$src`,` $dstMemref`[`$indices`]` attr-dict `:` type($src)`,` type($dstMemref)
		}];

		let verifier = [{ return ::verify(*this); }];
		}

		def GPU_SubgroupMmaComputeOp : GPU_Op<"subgroup_mma_compute", []>{

		let summary = "GPU warp synchronous matrix multiply accumulate";

		let description = [{
		The `gpu.subgroup_mma_compute` operation performs a matrix-multiply accumulate(mma)
		operation using all the threads in a subgroup.

		This operation takes three `!gpu.mmafragment`s as arguments. All of them hold `A`,
		`B` and `C`operands for the mma operation. The operation performed is represented
		as `D = A * B + C`. The op returns a `!gpu.mmafragment` which contains the result of
		the operation held by the current thread.

		ftynseUnsubmitted Done Reply Inline Actions Have you considered encoding this into a TypeConstraint in ODS instead of using AnyMemRef? ftynse: Have you considered encoding this into a TypeConstraint in ODS instead of using AnyMemRef?
		This op is meant to be used along with `gpu.subgroup_mma_store_matrix` and
		`gpu.subgroup_mma_load_matrix`.

		Example:

		```mlir
		ftynseUnsubmitted Done Reply Inline Actions These shapes don't add up for me. ftynse: These shapes don't add up for me.
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions !gpu.mmafragment<4xvector<3xf16>> -> !gpu.mmafragment<4xvector<2xf16>>. The shapes represent the part of the MMAFragment each thread holds. E.g., the result is of shape <4xvector<2xhalf>>, which says 8 fp16 elements per thread. So in all, 32 threads(a warp) have 256 elements, which is actually the shape of the output, 16x16. navdeepkk: !gpu.mmafragment<4xvector<3xf16>> -> !gpu.mmafragment<4xvector<2xf16>>. The shapes represent…
		%D = gpu.subgroup_mma_compute_matrix %A, %B, %C :
		!gpu.mmafragment<8xvector<2xf16>>, !gpu.mmafragment<8xvector<2xf16>>,
		!gpu.mmafragment<4xvector<2xf16>> -> !gpu.mmafragment<4xvector<2xf16>>
		```
		bondhugulaUnsubmitted Done Reply Inline Actions Incorrect copy/paste example. bondhugula: Incorrect copy/paste example.
		}];

		let arguments = (ins Arg<MMAFragmentOf<[VectorOfLengthAndType<[2], [F16]>]>>:$opA,
		Arg<MMAFragmentOf<[VectorOfLengthAndType<[2], [F16]>]>>:$opB,
		Arg<MMAFragmentOf<[VectorOfLengthAndType<[2], [F16]>]>>:$opC);

		let results = (outs GPU_MMAFragment:$res);
		ftynseUnsubmitted Done Reply Inline Actions Nit: I would have just used `functional-type(operands, results)` here ftynse: Nit: I would have just used `functional-type(operands, results)` here
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions functional-type not parsing !gpu.mmafragment type. retaining current format. navdeepkk: functional-type not parsing !gpu.mmafragment type. retaining current format.

		let assemblyFormat = [{
		$opA`,` $opB`,` $opC attr-dict `:` type($opA)`,` type($opB)`,` type($opC) `->` type($res)
		}];

		let verifier = [{ return ::verify(*this); }];
		}

#endif // GPU_OPS		#endif // GPU_OPS

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td

Show All 10 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef NVVMIR_OPS		#ifndef NVVMIR_OPS
#define NVVMIR_OPS		#define NVVMIR_OPS

include "mlir/Dialect/LLVMIR/LLVMOpBase.td"		include "mlir/Dialect/LLVMIR/LLVMOpBase.td"
include "mlir/Interfaces/SideEffectInterfaces.td"		include "mlir/Interfaces/SideEffectInterfaces.td"

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
		ftynseUnsubmitted Done Reply Inline Actions I am not a fan of creating a dependency from the NVVM dialect on the GPU dialect only for the purpose of reusing an attribute. Have you considered having two versions of this attribute for different dialects, or putting it into some included file shared by both? ftynse: I am not a fan of creating a dependency from the NVVM dialect on the GPU dialect only for the…
// NVVM dialect definitions		// NVVM dialect definitions
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

def NVVM_Dialect : Dialect {		def NVVM_Dialect : Dialect {
let name = "nvvm";		let name = "nvvm";
let cppNamespace = "::mlir::NVVM";		let cppNamespace = "::mlir::NVVM";
let dependentDialects = ["LLVM::LLVMDialect"];		let dependentDialects = ["LLVM::LLVMDialect"];
}		}
▲ Show 20 Lines • Show All 111 Lines • ▼ Show 20 Lines	def NVVM_MmaOp :
string llvmBuilder = [{		string llvmBuilder = [{
$res = createIntrinsicCall(		$res = createIntrinsicCall(
builder, llvm::Intrinsic::nvvm_mma_m8n8k4_row_col_f32_f32, $args);		builder, llvm::Intrinsic::nvvm_mma_m8n8k4_row_col_f32_f32, $args);
}];		}];
let assemblyFormat = "$args attr-dict `:` functional-type($args, $res)";		let assemblyFormat = "$args attr-dict `:` functional-type($args, $res)";
let verifier = [{ return ::verify(*this); }];		let verifier = [{ return ::verify(*this); }];
}		}

		class NVVM_WMMALoadOp<string mnemonic> : NVVM_Op<mnemonic>,
		ftynseUnsubmitted Done Reply Inline Actions It looks like `operand` is never used ftynse: It looks like `operand` is never used
		Results<(outs LLVM_AnyStruct:$res)>,
		Arguments<(ins Variadic<LLVM_Type>:$args)> {

		let summary = "Warp synchronous matrix load";

		string baseDescription = [{"The `nvvm.wmma.m16n16k16.load.[a, b, c]` operation"
		ftynseUnsubmitted Done Reply Inline Actions Please follow the code style inside inline code blocks. Here in particular, add spaces between "if" and "(". ftynse: Please follow the code style inside inline code blocks. Here in particular, add spaces between…
		ftynseUnsubmitted Done Reply Inline Actions Nit: use `[{ }]` for multi-line text. ftynse: Nit: use `[{ }]` for multi-line text.
		"loads a matrix collectively using all the threads in a warp."

		"The operation takes two arguments, the address from where the matrix"
		"elements are to be loaded from and a stride. The stride argument"
		"represents the leading dimension of the source matrix. The address and"
		"the stride are required to be the same across all threads in the warp."
		"Each thread in a warp holds a certain number of elements. The Op returns"
		"a LLVMStruct which holds the elements of the matrix held by this thread."

		"This op is meant to be used along with `nvvm.wmma.m16n16k16.store` and"
		ftynseUnsubmitted Done Reply Inline Actions We generally prefer to have one operation per intrinsic in this dialect. There is certainly value in having a operation where this duplication is abstracted away, but I suppose the one in the GPU dialect is just right for that. This dispatch should happen when converting from the `gpu.subgroup_mma_load` to `nvvm.wmma_` rather than in translation to LLVM IR. This will also resolve the dialect layering problem I pointed at above. ftynse:* We generally prefer to have one operation per intrinsic in this dialect. There is certainly…
		"`nvvm.wmma.m16n16k16.mma`."}];

		let assemblyFormat = "$args attr-dict `:` functional-type($args, $res)";
		}

		def NVVM_WMMALoadAOp :
		NVVM_WMMALoadOp<"wmma.m16n16k16.load.a.f16.row.stride">{

		string llvmBuilder = [{
		$res = createNvvmIntrinsicCall(
		builder, llvm::Intrinsic::nvvm_wmma_m16n16k16_load_a_f16_row_stride, $args);
		}];

		string opDescription = [{
		ftynseUnsubmitted Done Reply Inline Actions Tablegen doesn't have $-substitution, use need something like `!strconcat(` and a way of accessing the common string. ftynse: Tablegen doesn't have $-substitution, use need something like `!strconcat(` and a way of…
		Example:

		```mlir
		%2 = nvvm.wmma.m16n16k16.load.a %0, %1 : !llvm.ptr<i32, 3>, !llvm.i32 ->
		!llvm.struct<(vec<2 x half>, vec<2 x half>, vec<2 x half>, vec<2 x half>,
		vec<2 x half>, vec<2 x half>, vec<2 x half>, vec<2 x half>)>
		```
		}];

		let description = !strconcat(baseDescription, opDescription);

		let verifier = [{ return ::verify(*this); }];
		ftynseUnsubmitted Done Reply Inline Actions function-type($args, $res) ? ftynse: function-type($args, $res) ?
		}

		def NVVM_WMMALoadBOp :
		NVVM_WMMALoadOp<"wmma.m16n16k16.load.b.f16.row.stride">{

		string llvmBuilder = [{
		$res = createNvvmIntrinsicCall(
		builder, llvm::Intrinsic::nvvm_wmma_m16n16k16_load_b_f16_row_stride, $args);
		}];

		string opDescription = [{
		Example:

		```mlir
		%2 = nvvm.wmma.m16n16k16.load.b %0, %1 : !llvm.ptr<i32, 3>, !llvm.i32 ->
		!llvm.struct<(vec<2 x half>, vec<2 x half>, vec<2 x half>, vec<2 x half>,
		vec<2 x half>, vec<2 x half>, vec<2 x half>, vec<2 x half>)>
		```
		}];

		let description = !strconcat(baseDescription, opDescription);

		let verifier = [{ return ::verify(*this); }];
		}

		def NVVM_WMMALoadCOp :
		NVVM_WMMALoadOp<"wmma.m16n16k16.load.c.f16.row.stride">{
		string llvmBuilder = [{
		$res = createNvvmIntrinsicCall(
		builder, llvm::Intrinsic::nvvm_wmma_m16n16k16_load_c_f16_row_stride, $args);
		}];

		string opDescription = [{
		Example:

		```mlir
		%2 = nvvm.wmma.m16n16k16.load.c %0, %1 : !llvm.ptr<i32, 3>, !llvm.i32 ->
		!llvm.struct<(vec<2 x half>, vec<2 x half>, vec<2 x half>, vec<2 x half>)>
		```
		}];

		let description = !strconcat(baseDescription, opDescription);

		let verifier = [{ return ::verify(*this); }];
		}

		def NVVM_WMMAStoreOp :
		NVVM_Op<"wmma.m16n16k16.store.d.f16.row.stride">,
		Arguments<(ins Variadic<LLVM_Type>:$args)>{
		string llvmBuilder = [{
		createNvvmIntrinsicCall(
		builder, llvm::Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_row_stride, $args);
		}];

		let summary = "Warp synchronous matrix store";

		let description = [{
		The `nvvm.wmma.m16n16k16.store` operation stores a matrix collectively using
		all the threads in a warp.

		The operation takes as arguments the address to where the matrix elements are
		to be stored, a stride and the elements to store, held by the current thread.
		The stride argument represents the leading dimension of the destination matrix.
		The address and the stride are required to be the same across all threads in the
		warp.

		This op is meant to be used along with `nvvm.wmma.m16n16k16.load` and
		`nvvm.wmma.m16n16k16.mma`.

		Example:

		```mlir
		nvvm.wmma.m16n16k16.store %0, %1, %2, %3, %4, %5, %6, %7, %8, %9, %10 :
		ftynseUnsubmitted Done Reply Inline Actions This looks intimidatingly verbose. Are we ever expected to have different types for operands or can we just print one instead? ftynse: This looks intimidatingly verbose. Are we ever expected to have different types for operands or…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Since this op is mapped to a single intrinsic, The type of the operands is fixed. I modified the op to print only one type. navdeepkk: Since this op is mapped to a single intrinsic, The type of the operands is fixed. I modified…
		!llvm.ptr<i32, 3>, !llvm.struct<(vec<2 x half>, vec<2 x half>, vec<2 x half>,
		vec<2 x half>, vec<2 x half>, vec<2 x half>, vec<2 x half>, vec<2 x half>)>,
		!llvm.i32
		```
		}];

		let assemblyFormat = "$args attr-dict `:` type($args)";

		let verifier = [{ return ::verify(*this); }];
		}

		def NVVM_WMMAMmaOp :
		NVVM_Op<"wmma.m16n16k16.mma.row.row.f16.f16">,
		Results<(outs LLVM_AnyStruct:$res)>,
		Arguments<(ins Variadic<LLVM_Type>:$args)>{
		string llvmBuilder = [{
		$res = createNvvmIntrinsicCall(
		builder, llvm::Intrinsic::nvvm_wmma_m16n16k16_mma_row_row_f16_f16, $args);
		}];

		let summary = "Warp synchronous matrix-multiply accumulate using tensor cores.";

		let description = [{
		The `nvvm.wmma.m16n16k16.mma` operation performs a matrix-multiply accumulate
		(mma) operation using all the threads in a warp.

		The operation performed is represented as `D = A * B + C`. The operation takes
		as arguments the elements of the matrices `A`, `B`, `C` and `D`, held by the
		current thread. The op returns a LLVM struct which holds a part of the result
		held by the current thread.

		This op is meant to be used along with `nvvm.wmma.m16n16k16.load` and `nvvm.wmma.
		m16n16k16.store`.

		Example:

		```mlir
		%20 = nvvm.wmma.m16n16k16.mma %0, %1, %2, %3, %4, %5, %6, %7, %8, %9, %10, %11, %12,
		%13, %14, %15, %16, %17, %18, %19 : vector<2xf16> -> !llvm.struct<(vector<2xf16>,
		vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		```
		ftynseUnsubmitted Done Reply Inline Actions This is still super-verbose. Do we expect operands of different types? ftynse: This is still super-verbose. Do we expect operands of different types?
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Sorry, I had already removed the printing but forgot to update the example. navdeepkk: Sorry, I had already removed the printing but forgot to update the example.
		}];

		let parser = [{
		SmallVector<OpAsmParser::OperandType, 4> operands;
		::llvm::SMLoc operandsLoc;
		ftynseUnsubmitted Done Reply Inline Actions Any reason why this cannot be implemented with declarative assembly format? ftynse: Any reason why this cannot be implemented with declarative assembly format?
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions I tried to use custom directives to print and parse a single type for all the operands. But, there is a restriction in custom directives, quoted as `The type directives must refer to a variable, but that variable need not also be a parameter to the custom directive`. This restricts us to parse a single type for all the operand types and parser.resolveOperands() would fail. navdeepkk: I tried to use custom directives to print and parse a single type for all the operands. But…
		Type operandType;
		ftynseUnsubmitted Done Reply Inline Actions We shouldn't be needing global namespace qualification in this code. It goes into the body of a `OpTy::parse`, which itself lives in the MLIR namespace. ftynse: We shouldn't be needing global namespace qualification in this code. It goes into the body of a…
		Type resType;

		ftynseUnsubmitted Done Reply Inline Actions This is useless, it is necessary if the variable is only in assertions to avoid -Wunused-variable in release builds. Here, it is always used. ftynse: This is useless, it is necessary if the variable is only in assertions to avoid -Wunused…
		operandsLoc = parser.getCurrentLocation();
		bondhugulaUnsubmitted Done Reply Inline Actions I think all of this code should be in a C++ file. Please see other things for reference or if there is a guideline. @ftynse? bondhugula: I think all of this code should be in a C++ file. Please see other things for reference or if…
		ftynseUnsubmitted Done Reply Inline Actions There are no guidelines AFAIK. I prefer and ask to put any non-trivial code with more than ~5 lines in a C++ file. ftynse: There are no guidelines AFAIK. I prefer and ask to put any non-trivial code with more than ~5…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Please let me know if this is to be moved and to where. navdeepkk: Please let me know if this is to be moved and to where.
		if (parser.parseOperandList(operands)
		\|\| parser.parseOptionalAttrDict(result.attributes)
		ftynseUnsubmitted Done Reply Inline Actions This is unnecessary, ArrayRef is implicitly constructible from a single element, just pass `resType` to `addTypes`. ftynse: This is unnecessary, ArrayRef is implicitly constructible from a single element, just pass…
		\|\| parser.parseColon()
		\|\| parser.parseType(operandType)
		\|\| parser.parseArrow())
		return failure();

		unsigned numOperands = operands.size();
		SmallVector<Type> operandTypes(numOperands, operandType);
		if (parser.parseType(resType))
		return failure();
		result.addTypes(resType);
		if (parser.resolveOperands(operands, operandTypes, operandsLoc, result.operands))
		return failure();
		return success();
		ftynseUnsubmitted Done Reply Inline Actions Chain these with `\|\|` in a single `if` condition. This is the reason why `Parser` methods return `bool`. ftynse: Chain these with `\|\|` in a single `if` condition. This is the reason why `Parser` methods…
		}];
		bondhugulaUnsubmitted Done Reply Inline Actions Looks like this comment is still unaddressed but marked "Done". Did you forget to push? @ftynse suggested doing: if (parser.parseType... \|\| parser.resolveOperands(...) ... return failure(); bondhugula: Looks like this comment is still unaddressed but marked "Done". Did you forget to push? @ftynse…

		let printer = [{
		ftynseUnsubmitted Done Reply Inline Actions I doubt "20" was chosen for any particular reason. SmallVector now supports automatic selection of the number of stack elements if there is no provided in the template, just use that if there is no specific reason to do otherwise. ftynse: I doubt "20" was chosen for any particular reason. SmallVector now supports automatic selection…
		p << getOperationName();
		p << ' ';
		p << args();
		p.printOptionalAttrDict(getAttrs(), {});
		p << " : ";
		p << getOperation()->getOperand(0).getType();
		p << ' ' << "->";
		p << ' ';
		p << ::llvm::ArrayRef<::mlir::Type>(res().getType());
		}];
		ftynseUnsubmitted Done Reply Inline Actions Use `getOperationName()` instead ftynse: Use `getOperationName()` instead

		let verifier = [{ return ::verify(*this); }];
		}

#endif // NVVMIR_OPS		#endif // NVVMIR_OPS
		ftynseUnsubmitted Done Reply Inline Actions These can be combined in a single string... ftynse: These can be combined in a single string...
		bondhugulaUnsubmitted Done Reply Inline Actions We shouldn't have so much C++ code in the tablegen file. Please move this to a C++ file (NVVMOps.cpp) and call that from here. bondhugula: We shouldn't have so much C++ code in the tablegen file. Please move this to a C++ file…
		bondhugulaUnsubmitted Done Reply Inline Actions Likewise. bondhugula: Likewise.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

Show All 23 Lines
#include "mlir/IR/PatternMatch.h"		#include "mlir/IR/PatternMatch.h"
#include "mlir/IR/TypeUtilities.h"		#include "mlir/IR/TypeUtilities.h"
#include "llvm/ADT/TypeSwitch.h"		#include "llvm/ADT/TypeSwitch.h"

using namespace mlir;		using namespace mlir;
using namespace mlir::gpu;		using namespace mlir::gpu;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
		// MMAFragmentType
		bondhugulaUnsubmitted Done Reply Inline Actions Nit: MMAFragmentType bondhugula: Nit: MMAFragmentType
		//===----------------------------------------------------------------------===//

		MMAFragmentType MMAFragmentType::get(int64_t size, Type elementType) {
		return Base::get(elementType.getContext(), size, elementType);
		}

		MMAFragmentType MMAFragmentType::getChecked(Location loc, int64_t size,
		Type elementType) {
		return Base::getChecked(loc, size, elementType);
		}

		int64_t MMAFragmentType::getSize() const { return getImpl()->size; }

		Type MMAFragmentType::getElementType() const { return getImpl()->elementType; }

		bool MMAFragmentType::isValidElementType(Type elementType) {
		bondhugulaUnsubmitted Done Reply Inline Actions `unsigned` ? bondhugula: `unsigned` ?

		if (auto vectorElemType = elementType.dyn_cast<VectorType>())
		return vectorElemType.getRank() == 1 && vectorElemType.getDimSize(0) == 2;
		else
		return false;
		}

		LogicalResult MMAFragmentType::verifyConstructionInvariants(Location loc,
		ftynseUnsubmitted Done Reply Inline Actions It is better to inspect `elementType` rather than to construct a new one for comparison. Construction takes a lock in the context, inspection does not. Also, `if (condition) return false; return true` should be written as `return !condition`. ftynse: It is better to inspect `elementType` rather than to construct a new one for comparison.
		int64_t size,
		Type elementType) {
		if (size <= 0)
		return emitError(loc, "MMAFragmentType size must be atleast one");

		if (!MMAFragmentType::isValidElementType(elementType))
		return emitError(loc, "MMAFragmentType elements must be vector<2xf16>");

		return success();
		}

		//===----------------------------------------------------------------------===//
		bondhugulaUnsubmitted Done Reply Inline Actions `Operand` -> `operand` bondhugula: `Operand` -> `operand`
// GPUDialect		// GPUDialect
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		/// GPU memory space identifiers.
		enum GPUMemorySpace {
		/// Generic memory space identifier.
		kGenericMemorySpace = 0,

		/// Global memory space identifier.
		kGlobalMemorySpace = 1,

		bondhugulaUnsubmitted Done Reply Inline Actions Use an `enum`? bondhugula: Use an `enum`?
		/// Shared memory space identifier.
		kSharedMemorySpace = 3
		};

bool GPUDialect::isKernel(Operation *op) {		bool GPUDialect::isKernel(Operation *op) {
UnitAttr isKernelAttr = op->getAttrOfType<UnitAttr>(getKernelFuncAttrName());		UnitAttr isKernelAttr = op->getAttrOfType<UnitAttr>(getKernelFuncAttrName());
return static_cast<bool>(isKernelAttr);		return static_cast<bool>(isKernelAttr);
}		}

void GPUDialect::initialize() {		void GPUDialect::initialize() {
addTypes<AsyncTokenType>();		addTypes<AsyncTokenType>();
		addTypes<MMAFragmentType>();
addOperations<		addOperations<
#define GET_OP_LIST		#define GET_OP_LIST
#include "mlir/Dialect/GPU/GPUOps.cpp.inc"		#include "mlir/Dialect/GPU/GPUOps.cpp.inc"
>();		>();
}		}

Type GPUDialect::parseType(DialectAsmParser &parser) const {		Type GPUDialect::parseType(DialectAsmParser &parser) const {
// Parse the main keyword for the type.		// Parse the main keyword for the type.
StringRef keyword;		StringRef keyword;
if (parser.parseKeyword(&keyword))		if (parser.parseKeyword(&keyword))
return Type();		return Type();
MLIRContext *context = getContext();		MLIRContext *context = getContext();

// Handle 'async token' types.		// Handle 'async token' types.
if (keyword == "async.token")		if (keyword == "async.token")
return AsyncTokenType::get(context);		return AsyncTokenType::get(context);

		if (keyword == "mmafragment") {
		llvm::SMLoc beginLoc = parser.getNameLoc();

		// Parse '<'.
		if (parser.parseLess())
		return nullptr;

		// Parse the size and elementType.
		int64_t size;
		VectorType elementType;
		if (parser.parseInteger(size) \|\| parser.parseComma() \|\|
		parser.parseType(elementType)) {
		return nullptr;
		}

		ftynseUnsubmitted Done Reply Inline Actions All user-visible error messages need a test. ftynse: All user-visible error messages need a test.
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions I have removed these messages as they seemed redundant. The parser already emits errors for the required thing. I am adding tests for other user-visible messages. navdeepkk: I have removed these messages as they seemed redundant. The parser already emits errors for the…
		// Parse '>'.
		if (parser.parseGreater())
		return nullptr;

		return MMAFragmentType::getChecked(parser.getEncodedSourceLoc(beginLoc),
		size, elementType);
		}

parser.emitError(parser.getNameLoc(), "unknown gpu type: " + keyword);		parser.emitError(parser.getNameLoc(), "unknown gpu type: " + keyword);
return Type();		return Type();
}		}

void GPUDialect::printType(Type type, DialectAsmPrinter &os) const {		void GPUDialect::printType(Type type, DialectAsmPrinter &os) const {
TypeSwitch<Type>(type)		TypeSwitch<Type>(type)
.Case<AsyncTokenType>([&](Type) { os << "async.token"; })		.Case<AsyncTokenType>([&](Type) { os << "async.token"; })
		ftynseUnsubmitted Done Reply Inline Actions `parseType` accepts references to a derived type and does this check for you. Just declare `elementType` as `VectorType` ftynse: `parseType` accepts references to a derived type and does this check for you. Just declare…
		.Case<MMAFragmentType>([&](MMAFragmentType fragTy) {
		os << "mmafragment<" << fragTy.getSize() << ", "
		<< fragTy.getElementType() << ">";
		ftynseUnsubmitted Done Reply Inline Actions This should be in the type verifier instead. ftynse: This should be in the type verifier instead.
		})
		ftynseUnsubmitted Done Reply Inline Actions Not that the location is pointing to the token after the last consumed one, i.e. after `>`. Capture the location before parsing the size, or at least point to the first token in the type, ftynse: Not that the location is pointing to the token after the last consumed one, i.e. after `>`.
.Default([](Type) { llvm_unreachable("unexpected 'gpu' type kind"); });		.Default([](Type) { llvm_unreachable("unexpected 'gpu' type kind"); });
}		}
		bondhugulaUnsubmitted Done Reply Inline Actions Nit: "x" -> 'x' bondhugula: Nit: "x" -> 'x'

LogicalResult GPUDialect::verifyOperationAttribute(Operation *op,		LogicalResult GPUDialect::verifyOperationAttribute(Operation *op,
		bondhugulaUnsubmitted Done Reply Inline Actions Likewise. bondhugula: Likewise.
NamedAttribute attr) {		NamedAttribute attr) {
if (!attr.second.isa<UnitAttr>() \|\|		if (!attr.second.isa<UnitAttr>() \|\|
attr.first != getContainerModuleAttrName())		attr.first != getContainerModuleAttrName())
return success();		return success();

auto module = dyn_cast<ModuleOp>(op);		auto module = dyn_cast<ModuleOp>(op);
if (!module)		if (!module)
return op->emitError("expected '")		return op->emitError("expected '")
		ftynseUnsubmitted Done Reply Inline Actions Don't duplicate the verifier, use `getChecked` instead. ftynse: Don't duplicate the verifier, use `getChecked` instead.
<< getContainerModuleAttrName() << "' attribute to be attached to '"		<< getContainerModuleAttrName() << "' attribute to be attached to '"
<< ModuleOp::getOperationName() << '\'';		<< ModuleOp::getOperationName() << '\'';

auto walkResult = module.walk([&module](LaunchFuncOp launchOp) -> WalkResult {		auto walkResult = module.walk([&module](LaunchFuncOp launchOp) -> WalkResult {
// Ignore launches that are nested more or less deep than functions in the		// Ignore launches that are nested more or less deep than functions in the
// module we are currently checking.		// module we are currently checking.
if (!launchOp->getParentOp() \|\|		if (!launchOp->getParentOp() \|\|
launchOp->getParentOp()->getParentOp() != module)		launchOp->getParentOp()->getParentOp() != module)
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	auto walkResult = module.walk([&module](LaunchFuncOp launchOp) -> WalkResult {
}		}

return success();		return success();
});		});

return walkResult.wasInterrupted() ? failure() : success();		return walkResult.wasInterrupted() ? failure() : success();
}		}

template <typename T> static LogicalResult verifyIndexOp(T op) {		template <typename T>
		static LogicalResult verifyIndexOp(T op) {
auto dimension = op.dimension();		auto dimension = op.dimension();
if (dimension != "x" && dimension != "y" && dimension != "z")		if (dimension != "x" && dimension != "y" && dimension != "z")
return op.emitError("dimension \"") << dimension << "\" is invalid";		return op.emitError("dimension \"") << dimension << "\" is invalid";
return success();		return success();
}		}

static LogicalResult verifyAllReduce(gpu::AllReduceOp allReduce) {		static LogicalResult verifyAllReduce(gpu::AllReduceOp allReduce) {
if (allReduce.body().empty() != allReduce.op().hasValue())		if (allReduce.body().empty() != allReduce.op().hasValue())
▲ Show 20 Lines • Show All 730 Lines • ▼ Show 20 Lines	if (asyncTokenType)
printer << "async ";		printer << "async ";
if (asyncDependencies.empty())		if (asyncDependencies.empty())
return;		return;
printer << "[";		printer << "[";
llvm::interleaveComma(asyncDependencies, printer);		llvm::interleaveComma(asyncDependencies, printer);
printer << "]";		printer << "]";
}		}

		//===----------------------------------------------------------------------===//
		// GPU_SubgroupMmaLoadMatrixOp
		//===----------------------------------------------------------------------===//

		static LogicalResult verify(SubgroupMmaLoadMatrixOp op) {
		auto srcType = op.srcMemref().getType();
		auto srcMemrefType = srcType.cast<MemRefType>();
		auto srcMemSpace = srcMemrefType.getMemorySpace();

		if (!srcMemrefType.getAffineMaps().empty() &&
		!srcMemrefType.getAffineMaps().front().isIdentity())
		return op.emitError("expected identity layout map for source memref");
		ftynseUnsubmitted Done Reply Inline Actions Replaces these magic numbers with named constants defined above ftynse: Replaces these magic numbers with named constants defined above

		if (srcMemSpace != kGenericMemorySpace && srcMemSpace != kSharedMemorySpace &&
		srcMemSpace != kGlobalMemorySpace)
		ftynseUnsubmitted Done Reply Inline Actions Nit: please don't use magic numbers. Rather define this as `GPUDialect::kThreadPrivateMemorySpace` and so on. ftynse: Nit: please don't use magic numbers. Rather define this as `GPUDialect…
		return op.emitError(
		"source memorySpace kGenericMemorySpace, kSharedMemorySpace or "
		"kGlobalMemorySpace only allowed");

		return success();
		}

		//===----------------------------------------------------------------------===//
		// GPU_SubgroupMmaStoreMatrixOp
		ftynseUnsubmitted Done Reply Inline Actions This can be defined in ODS, instead of using AnyMemRef. ftynse: This can be defined in ODS, instead of using AnyMemRef.
		//===----------------------------------------------------------------------===//
		bondhugulaUnsubmitted Done Reply Inline Actions `Only` -> `only` bondhugula: `Only` -> `only`

		ftynseUnsubmitted Done Reply Inline Actions This can be put into ODS. ftynse: This can be put into ODS.
		static LogicalResult verify(SubgroupMmaStoreMatrixOp op) {
		auto srcType = op.src().getType();
		auto dstType = op.dstMemref().getType();
		auto srcFragType = srcType.cast<gpu::MMAFragmentType>();
		auto dstMemrefType = dstType.cast<MemRefType>();
		auto dstMemSpace = dstMemrefType.getMemorySpace();

		if (!dstMemrefType.getAffineMaps().empty() &&
		!dstMemrefType.getAffineMaps().front().isIdentity())
		return op.emitError("expected identity layout map for destination memref");

		ftynseUnsubmitted Done Reply Inline Actions Don't duplicate the checks already enforced by type constraints in ODS. Also, if you had added tests for user-visible errors, you would have seen that this message is never produced because the error condition is caught by ODS-generated verifier parts with a different message. ftynse: Don't duplicate the checks already enforced by type constraints in ODS. Also, if you had added…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Removed redundant checks. navdeepkk: Removed redundant checks.
		if (dstMemSpace != kGenericMemorySpace && dstMemSpace != kSharedMemorySpace &&
		dstMemSpace != kGlobalMemorySpace)
		ftynseUnsubmitted Done Reply Inline Actions Braces must be symmetric. ftynse: Braces must be symmetric.
		return op.emitError(
		"destination memorySpace of kGenericMemorySpace, "
		"kGlobalMemorySpace or kSharedMemorySpace only allowed");

		if (srcFragType.getSize() != 4)
		return op.emitError(
		"operand should be of type !gpu.mmafragment<4xvector<2xf16>>");

		return success();
		}

		//===----------------------------------------------------------------------===//
		// GPU_SubgroupMmaComputeOp
		//===----------------------------------------------------------------------===//

		static LogicalResult verify(SubgroupMmaComputeOp op) {
		enum OperandMap { A, B, C };
		SmallVector<MMAFragmentType, 3> opTypes;
		SmallVector<Type, 3> elemTypes;

		auto populateOpInfo = [&opTypes, &elemTypes, &op]() {
		opTypes.push_back(op.opA().getType().cast<MMAFragmentType>());
		opTypes.push_back(op.opB().getType().cast<MMAFragmentType>());
		opTypes.push_back(op.opC().getType().cast<MMAFragmentType>());

		for (MMAFragmentType opType : opTypes) {
		elemTypes.push_back(opType.getElementType());
		}
		bondhugulaUnsubmitted Done Reply Inline Actions Can drop trivial braces. bondhugula: Can drop trivial braces.
		};
		populateOpInfo();

		auto isValidElementType = [](Type &elemType, MMAFragmentType &fragTy,
		unsigned numElems) {
		return fragTy.getSize() == numElems;
		};

		if (!isValidElementType(elemTypes[A], opTypes[A], 8) \|\|
		bondhugulaUnsubmitted Done Reply Inline Actions `Operands` -> `operands` bondhugula: `Operands` -> `operands`
		!isValidElementType(elemTypes[B], opTypes[B], 8) \|\|
		!isValidElementType(elemTypes[C], opTypes[C], 4))
		return op.emitError(
		"A and B must be of type !gpu.mmafragment<8xvector<2xf16>> and C "
		"must of type !gpu.mmafragment<4xvector<2xf16>>");

		return success();
		}

		bondhugulaUnsubmitted Done Reply Inline Actions Likewise. bondhugula: Likewise.
#include "mlir/Dialect/GPU/GPUOpInterfaces.cpp.inc"		#include "mlir/Dialect/GPU/GPUOpInterfaces.cpp.inc"

#define GET_OP_CLASSES		#define GET_OP_CLASSES
#include "mlir/Dialect/GPU/GPUOps.cpp.inc"		#include "mlir/Dialect/GPU/GPUOps.cpp.inc"
		ftynseUnsubmitted Done Reply Inline Actions Nit: please use LogicalResult for verify* functions, even internal. Otherwise, use the name that makes it clear what are the intended semantics of the return value, e.g. `isValidType`. ftynse: Nit: please use LogicalResult for verify* functions, even internal. Otherwise, use the name…

mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp

Show First 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	static LogicalResult verify(MmaOp op) {
auto f16Ty = Float16Type::get(context);		auto f16Ty = Float16Type::get(context);
auto f16x2Ty = LLVM::getFixedVectorType(f16Ty, 2);		auto f16x2Ty = LLVM::getFixedVectorType(f16Ty, 2);
auto f32Ty = Float32Type::get(context);		auto f32Ty = Float32Type::get(context);
auto f16x2x4StructTy = LLVM::LLVMStructType::getLiteral(		auto f16x2x4StructTy = LLVM::LLVMStructType::getLiteral(
context, {f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty});		context, {f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty});
auto f32x8StructTy = LLVM::LLVMStructType::getLiteral(		auto f32x8StructTy = LLVM::LLVMStructType::getLiteral(
context, {f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty});		context, {f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty});

SmallVector<Type, 12> operand_types(op.getOperandTypes().begin(),		SmallVector<Type, 12> operandTypes(op.getOperandTypes().begin(),
op.getOperandTypes().end());		op.getOperandTypes().end());
if (operand_types != SmallVector<Type, 8>(8, f16x2Ty) &&		if (operandTypes != SmallVector<Type, 8>(8, f16x2Ty) &&
operand_types != SmallVector<Type, 12>{f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty,		operandTypes != SmallVector<Type, 12>{f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty,
f32Ty, f32Ty, f32Ty, f32Ty, f32Ty,		f32Ty, f32Ty, f32Ty, f32Ty, f32Ty,
f32Ty, f32Ty, f32Ty}) {		f32Ty, f32Ty, f32Ty}) {
return op.emitOpError(		return op.emitOpError(
"expected operands to be 4 <halfx2>s followed by either "		"expected operands to be 4 <halfx2>s followed by either "
"4 <halfx2>s or 8 floats");		"4 <halfx2>s or 8 floats");
}		}
if (op.getType() != f32x8StructTy && op.getType() != f16x2x4StructTy) {		if (op.getType() != f32x8StructTy && op.getType() != f16x2x4StructTy) {
return op.emitOpError("expected result type to be a struct of either 4 "		return op.emitOpError("expected result type to be a struct of either 4 "
"<halfx2>s or 8 floats");		"<halfx2>s or 8 floats");
}		}

auto alayout = op->getAttrOfType<StringAttr>("alayout");		auto alayout = op->getAttrOfType<StringAttr>("alayout");
auto blayout = op->getAttrOfType<StringAttr>("blayout");		auto blayout = op->getAttrOfType<StringAttr>("blayout");

if (!(alayout && blayout) \|\|		if (!(alayout && blayout) \|\|
!(alayout.getValue() == "row" \|\| alayout.getValue() == "col") \|\|		!(alayout.getValue() == "row" \|\| alayout.getValue() == "col") \|\|
!(blayout.getValue() == "row" \|\| blayout.getValue() == "col")) {		!(blayout.getValue() == "row" \|\| blayout.getValue() == "col")) {
return op.emitOpError(		return op.emitOpError(
"alayout and blayout attributes must be set to either "		"alayout and blayout attributes must be set to either "
"\"row\" or \"col\"");		"\"row\" or \"col\"");
}		}

if (operand_types == SmallVector<Type, 12>{f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty,		if (operandTypes == SmallVector<Type, 12>{f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty,
f32Ty, f32Ty, f32Ty, f32Ty, f32Ty,		f32Ty, f32Ty, f32Ty, f32Ty, f32Ty,
f32Ty, f32Ty, f32Ty} &&		f32Ty, f32Ty, f32Ty} &&
op.getType() == f32x8StructTy && alayout.getValue() == "row" &&		op.getType() == f32x8StructTy && alayout.getValue() == "row" &&
blayout.getValue() == "col") {		blayout.getValue() == "col") {
return success();		return success();
}		}
return op.emitOpError("unimplemented mma.sync variant");		return op.emitOpError("unimplemented mma.sync variant");
}		}

		template <typename T>
		static LogicalResult verifyWMMALoadOp(T op, StringRef operand) {
		MLIRContext *context = op.getContext();
		ftynseUnsubmitted Done Reply Inline Actions Please fix linter findings. ftynse: Please fix linter findings.
		auto i32Ty = IntegerType::get(context, 32);
		auto i32Ptr1Ty = LLVM::LLVMPointerType::get(i32Ty, 1);
		auto i32Ptr3Ty = LLVM::LLVMPointerType::get(i32Ty, 3);
		auto i32Ptr0Ty = LLVM::LLVMPointerType::get(i32Ty, 0);
		auto f16Ty = FloatType::getF16(context);
		auto f16x2Ty = VectorType::get(2, f16Ty);
		auto f16x2x4StructTy = LLVM::LLVMStructType::getLiteral(
		context, {f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty});
		auto f16x2x8StructTy = LLVM::LLVMStructType::getLiteral(
		context,
		{f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty});

		SmallVector<Type, 2> operandTypes(op.getOperandTypes().begin(),
		op.getOperandTypes().end());
		if (operandTypes != SmallVector<Type, 2>{i32Ptr1Ty, i32Ty} &&
		operandTypes != SmallVector<Type, 2>{i32Ptr3Ty, i32Ty} &&
		operandTypes != SmallVector<Type, 2>{i32Ptr0Ty, i32Ty}) {
		return op.emitOpError("expected operands to be a source pointer in memory "
		"space 0, 1, 3 followed by ldm of the source");
		}

		if (operand.equals("AOp") \|\| operand.equals("BOp")) {
		if (op.getType() != f16x2x8StructTy) {
		return op.emitOpError("expected result type of loadAOp and loadBOp to be "
		"a struct of 8 <halfx2>s");
		}
		} else if (operand.equals("COp")) {
		if (op.getType() != f16x2x4StructTy) {
		return op.emitOpError(
		"expected result type of loadCOp to be a struct of 4 <halfx2>s");
		}
		}

		return success();
		}

		static LogicalResult verify(WMMALoadAOp op) {
		return verifyWMMALoadOp(op, "AOp");
		}

		static LogicalResult verify(WMMALoadBOp op) {
		return verifyWMMALoadOp(op, "BOp");
		}

		static LogicalResult verify(WMMALoadCOp op) {
		return verifyWMMALoadOp(op, "COp");
		}

		static LogicalResult verify(WMMAStoreOp op) {
		MLIRContext *context = op.getContext();
		auto i32Ty = IntegerType::get(context, 32);
		auto i32Ptr1Ty = LLVM::LLVMPointerType::get(i32Ty, 1);
		auto i32Ptr3Ty = LLVM::LLVMPointerType::get(i32Ty, 3);
		auto i32Ptr0Ty = LLVM::LLVMPointerType::get(i32Ty, 0);
		auto f16Ty = FloatType::getF16(context);
		auto f16x2Ty = VectorType::get(2, f16Ty);

		SmallVector<Type, 2> operandTypes(op.getOperandTypes().begin(),
		op.getOperandTypes().end());
		if (operandTypes != SmallVector<Type, 5>{i32Ptr1Ty, f16x2Ty, f16x2Ty, f16x2Ty,
		f16x2Ty, i32Ty} &&
		operandTypes != SmallVector<Type, 5>{i32Ptr3Ty, f16x2Ty, f16x2Ty, f16x2Ty,
		f16x2Ty, i32Ty} &&
		bondhugulaUnsubmitted Done Reply Inline Actions No need of the `else`. bondhugula: No need of the `else`.
		operandTypes != SmallVector<Type, 5>{i32Ptr0Ty, f16x2Ty, f16x2Ty, f16x2Ty,
		f16x2Ty, i32Ty}) {
		return op.emitOpError("expected operands to be a source pointer in memory "
		"space 0, 1, 3 followed by ldm of the source");
		}

		return success();
		}

		static LogicalResult verify(WMMAMmaOp op) {
		MLIRContext *context = op.getContext();
		auto f16Ty = FloatType::getF16(context);
		auto f16x2Ty = VectorType::get(2, f16Ty);
		auto f16x2x4StructTy = LLVM::LLVMStructType::getLiteral(
		context, {f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty});

		SmallVector<Type, 2> operandTypes(op.getOperandTypes().begin(),
		bondhugulaUnsubmitted Done Reply Inline Actions Likewise. bondhugula: Likewise.
		op.getOperandTypes().end());
		if (operandTypes != SmallVector<Type, 20>(20, f16x2Ty))
		return op.emitOpError("expected 20 <halfx2>s as operands");

		if (op.getResult().getType() != f16x2x4StructTy)
		return op.emitOpError("expected result type to be a struct of 4 <halfx2>s");

		return success();
		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// NVVMDialect initialization, type parsing, and registration.		// NVVMDialect initialization, type parsing, and registration.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

// TODO: This should be the llvm.nvvm dialect once this is supported.		// TODO: This should be the llvm.nvvm dialect once this is supported.
void NVVMDialect::initialize() {		void NVVMDialect::initialize() {
addOperations<		addOperations<
#define GET_OP_LIST		#define GET_OP_LIST
#include "mlir/Dialect/LLVMIR/NVVMOps.cpp.inc"		#include "mlir/Dialect/LLVMIR/NVVMOps.cpp.inc"
>();		>();

// Support unknown operations because not all NVVM operations are registered.		// Support unknown operations because not all NVVM operations are registered.
		bondhugulaUnsubmitted Done Reply Inline Actions Likewise. clang-tidy will also show a warning here. bondhugula: Likewise. clang-tidy will also show a warning here.
allowUnknownOperations();		allowUnknownOperations();
}		}

#define GET_OP_CLASSES		#define GET_OP_CLASSES
#include "mlir/Dialect/LLVMIR/NVVMOps.cpp.inc"		#include "mlir/Dialect/LLVMIR/NVVMOps.cpp.inc"
		bondhugulaUnsubmitted Done Reply Inline Actions Use `getOperands()` instead of doing getOpOperands() and then doing a `get()`? bondhugula: Use `getOperands()` instead of doing getOpOperands() and then doing a `get()`?

mlir/lib/Target/LLVMIR/ConvertToNVVMIR.cpp

	Show All 29 Lines
	static llvm::Value *createIntrinsicCall(llvm::IRBuilder<> &builder,			static llvm::Value *createIntrinsicCall(llvm::IRBuilder<> &builder,
	llvm::Intrinsic::ID intrinsic,			llvm::Intrinsic::ID intrinsic,
	ArrayRef<llvm::Value *> args = {}) {			ArrayRef<llvm::Value *> args = {}) {
	llvm::Module *module = builder.GetInsertBlock()->getModule();			llvm::Module *module = builder.GetInsertBlock()->getModule();
	llvm::Function *fn = llvm::Intrinsic::getDeclaration(module, intrinsic);			llvm::Function *fn = llvm::Intrinsic::getDeclaration(module, intrinsic);
	return builder.CreateCall(fn, args);			return builder.CreateCall(fn, args);
	}			}

				static llvm::Value *createNvvmIntrinsicCall(llvm::IRBuilder<> &builder,
				llvm::Intrinsic::ID intrinsic,
				ArrayRef<llvm::Value *> args = {}) {
				llvm::Module *module = builder.GetInsertBlock()->getModule();
				llvm::Function *fn;
				if (llvm::Intrinsic::isOverloaded(intrinsic)) {
				if (intrinsic != llvm::Intrinsic::nvvm_wmma_m16n16k16_mma_row_row_f16_f16) {
				// NVVM load and store instrinsic names are overloaded on the
				// source/destination pointer type. Pointer is the first argument in the
				// corresponding NVVM Op.
				fn = llvm::Intrinsic::getDeclaration(module, intrinsic,
				{args[0]->getType()});
				} else {
				fn = llvm::Intrinsic::getDeclaration(module, intrinsic, {});
				}
				} else {
				fn = llvm::Intrinsic::getDeclaration(module, intrinsic);
				}
				return builder.CreateCall(fn, args);
				}

	static llvm::Intrinsic::ID getShflBflyIntrinsicId(llvm::Type *resultType,			static llvm::Intrinsic::ID getShflBflyIntrinsicId(llvm::Type *resultType,
	bool withPredicate) {			bool withPredicate) {
	if (withPredicate) {			if (withPredicate) {
	resultType = cast<llvm::StructType>(resultType)->getElementType(0);			resultType = cast<llvm::StructType>(resultType)->getElementType(0);
	return resultType->isFloatTy() ? llvm::Intrinsic::nvvm_shfl_sync_bfly_f32p			return resultType->isFloatTy() ? llvm::Intrinsic::nvvm_shfl_sync_bfly_f32p
	: llvm::Intrinsic::nvvm_shfl_sync_bfly_i32p;			: llvm::Intrinsic::nvvm_shfl_sync_bfly_i32p;
	}			}
	return resultType->isFloatTy() ? llvm::Intrinsic::nvvm_shfl_sync_bfly_f32			return resultType->isFloatTy() ? llvm::Intrinsic::nvvm_shfl_sync_bfly_f32
	▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/invalid.mlir

	Show First 20 Lines • Show All 452 Lines • ▼ Show 20 Lines
	}			}

	// -----			// -----

	func @memcpy_incompatible_shape(%dst : memref<7xf32>, %src : memref<9xf32>) {			func @memcpy_incompatible_shape(%dst : memref<7xf32>, %src : memref<9xf32>) {
	// expected-error @+1 {{'gpu.memcpy' op arguments have incompatible shape}}			// expected-error @+1 {{'gpu.memcpy' op arguments have incompatible shape}}
	gpu.memcpy %dst, %src : memref<7xf32>, memref<9xf32>			gpu.memcpy %dst, %src : memref<7xf32>, memref<9xf32>
	}			}

				// -----

				func @mmfragment_invalid_elem_type(){
				%wg = alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = constant 16 : index
				// expected-error @+1 {{MMAFragmentType elements must be vector<2xf16>}}
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %i] {operand = "AOp", leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mmafragment<8, vector<4xf16>>
				return
				}
				ThomasRaouxUnsubmitted Done Reply Inline Actions I think we should also have a positive test in `mlir/test/Dialect/GPU/ops.mlir` ThomasRaoux: I think we should also have a positive test in `mlir/test/Dialect/GPU/ops.mlir`

				// -----

				func @mmfragment_invalid_size(){
				%wg = alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = constant 16 : index
				// expected-error @+1 {{MMAFragmentType size must be atleast one}}
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %i] {operand = "AOp", leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mmafragment<0, vector<4xf16>>
				return
				}

				// -----
				#layout_map_col_major = affine_map<(i, j) -> (j, i)>

				func @mmaLoadOp_identity_layout(){
				%wg = alloca() {alignment = 32} : memref<32x32xf16, #layout_map_col_major, 3>
				%i = constant 16 : index
				// expected-error @+1 {{expected identity layout map for source memref}}
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %i] {operand = "AOp", leadDimension = 32 : index} : memref<32x32xf16, #layout_map_col_major, 3> -> !gpu.mmafragment<8, vector<2xf16>>
				return
				}

				// -----

				func @mmaLoadOp_invalid_mem_space(){
				%wg = alloca() {alignment = 32} : memref<32x32xf16, 5>
				%i = constant 16 : index
				// expected-error @+1 {{source memorySpace kGenericMemorySpace, kSharedMemorySpace or kGlobalMemorySpace only allowed}}
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %i] {operand = "AOp", leadDimension = 32 : index} : memref<32x32xf16, 5> -> !gpu.mmafragment<8, vector<2xf16>>
				return
				}

				// -----
				#layout_map_col_major = affine_map<(i, j) -> (j, i)>

				func @wmmaStoreOp_invalid_map(%arg0 : !gpu.mmafragment<4, vector<2xf16>>) -> () {
				%sg = alloca(){alignment = 32} : memref<32x32xf16, #layout_map_col_major, 3>
				%i = constant 16 : index
				%j = constant 16 : index
				// expected-error @+1 {{expected identity layout map for destination memref}}
				gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mmafragment<4, vector<2xf16>>, memref<32x32xf16,#layout_map_col_major, 3>
				return
				}

				// -----

				func @wmmaStoreOp_invalid_mem_space(%arg0 : !gpu.mmafragment<4, vector<2xf16>>) -> () {
				%sg = alloca(){alignment = 32} : memref<32x32xf16, 5>
				%i = constant 16 : index
				%j = constant 16 : index
				// expected-error @+1 {{destination memorySpace of kGenericMemorySpace, kGlobalMemorySpace or kSharedMemorySpace only allowed}}
				gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mmafragment<4, vector<2xf16>>, memref<32x32xf16, 5>
				return
				}

				// -----

				func @wmmaStoreOp_invalid_mem_space(%arg0 : !gpu.mmafragment<5, vector<2xf16>>) -> () {
				%sg = alloca(){alignment = 32} : memref<32x32xf16, 3>
				%i = constant 16 : index
				%j = constant 16 : index
				// expected-error @+1 {{operand should be of type !gpu.mmafragment<4xvector<2xf16>>}}
				gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mmafragment<5, vector<2xf16>>, memref<32x32xf16, 3>
				return
				}

				// -----

				func @wmmaMmaOp_invalid_operand_shape(%A : !gpu.mmafragment<3, vector<2xf16>>, %B : !gpu.mmafragment<8, vector<2xf16>>, %C : !gpu.mmafragment<4, vector<2xf16>>) -> () {
				// expected-error @+1 {{A and B must be of type !gpu.mmafragment<8xvector<2xf16>> and C must of type !gpu.mmafragment<4xvector<2xf16>>}}
				%D = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mmafragment<3, vector<2xf16>>, !gpu.mmafragment<8, vector<2xf16>>, !gpu.mmafragment<4, vector<2xf16>> -> !gpu.mmafragment<4, vector<2xf16>>
				return
				}

mlir/test/Dialect/LLVMIR/invalid.mlir

Show First 20 Lines • Show All 661 Lines • ▼ Show 20 Lines	func @switch_wrong_number_of_weights(%arg0 : i32) {
] {branch_weights = dense<[13, 17, 19]> : vector<3xi32>}		] {branch_weights = dense<[13, 17, 19]> : vector<3xi32>}

^bb1: // pred: ^bb0		^bb1: // pred: ^bb0
llvm.return		llvm.return

^bb2(%1: i32, %2: i32): // pred: ^bb0		^bb2(%1: i32, %2: i32): // pred: ^bb0
llvm.return		llvm.return
}		}

		// -----

		llvm.func @wmmaLoadOp_invalid_mem_space(%arg0: !llvm.ptr<i32, 5>, %arg1: i32) {
		// expected-error@+1 {{expected operands to be a source pointer in memory space 0, 1, 3 followed by ldm of the source}}
		%0 = nvvm.wmma.m16n16k16.load.a.f16.row.stride %arg0, %arg1 : (!llvm.ptr<i32, 5>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>

		llvm.return
		}

		// -----

		llvm.func @wmmaLoadOp_invalid_missing_ldm(%arg0: !llvm.ptr<i32, 3>, %arg1: i32) {
		// expected-error@+1 {{expected operands to be a source pointer in memory space 0, 1, 3 followed by ldm of the source}}
		%0 = nvvm.wmma.m16n16k16.load.a.f16.row.stride %arg0: (!llvm.ptr<i32, 3>) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>

		llvm.return
		}

		// -----

		llvm.func @wmmaLoadOp_invalid_AOp(%arg0: !llvm.ptr<i32, 3>, %arg1: i32) {
		// expected-error@+1 {{nvvm.wmma.m16n16k16.load.a.f16.row.stride' op expected result type of loadAOp and loadBOp to be a struct of 8 <halfx2>s}}
		%0 = nvvm.wmma.m16n16k16.load.a.f16.row.stride %arg0, %arg1 : (!llvm.ptr<i32, 3>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>

		llvm.return
		}

		// -----

		llvm.func @wmmaLoadOp_invalid_BOp(%arg0: !llvm.ptr<i32, 3>, %arg1: i32) {
		// expected-error@+1 {{nvvm.wmma.m16n16k16.load.b.f16.row.stride' op expected result type of loadAOp and loadBOp to be a struct of 8 <halfx2>s}}
		%0 = nvvm.wmma.m16n16k16.load.b.f16.row.stride %arg0, %arg1 : (!llvm.ptr<i32, 3>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>

		llvm.return
		}

		// -----

		llvm.func @wmmaLoadOp_invalid_COp(%arg0: !llvm.ptr<i32, 3>, %arg1: i32) {
		// expected-error@+1 {{nvvm.wmma.m16n16k16.load.c.f16.row.stride' op expected result type of loadCOp to be a struct of 4 <halfx2>s}}
		%0 = nvvm.wmma.m16n16k16.load.c.f16.row.stride %arg0, %arg1 : (!llvm.ptr<i32, 3>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>)>

		llvm.return
		}

		// -----

		llvm.func @wmmaStoreOp_invalid_mem_space(%arg0: !llvm.ptr<i32, 5>, %arg1: vector<2 x f16>,
		%arg2: vector<2 x f16>, %arg3: vector<2 x f16>,
		%arg4: vector<2 xf16>, %arg5: i32) {
		// expected-error@+1 {{expected operands to be a source pointer in memory space 0, 1, 3 followed by ldm of the source}}
		nvvm.wmma.m16n16k16.store.d.f16.row.stride %arg0, %arg1, %arg2, %arg3, %arg4, %arg5 : !llvm.ptr<i32, 5>, vector<2 x f16>, vector<2 x f16>, vector<2 x f16>, vector<2 x f16>, i32
		llvm.return
		}

		// -----

		llvm.func @wmmaStoreOp_invalid_missing_ldm(%arg0: !llvm.ptr<i32, 3>, %arg1: vector<2 x f16>,
		%arg2: vector<2 x f16>, %arg3: vector<2 x f16>,
		%arg4: vector<2 xf16>, %arg5: i32) {
		// expected-error@+1 {{expected operands to be a source pointer in memory space 0, 1, 3 followed by ldm of the source}}
		nvvm.wmma.m16n16k16.store.d.f16.row.stride %arg0, %arg1, %arg2, %arg3, %arg4 : !llvm.ptr<i32, 3>, vector<2 x f16>, vector<2 x f16>, vector<2 x f16>, vector<2 x f16>
		llvm.return
		}

		// -----

		llvm.func @gpu_wmma_mma_op(%arg0: vector<2 x f16>, %arg1: vector<2 x f16>,
		%arg2: vector<2 x f16>, %arg3: vector<2 x f16>,
		%arg4: vector<2 x f16>, %arg5: vector<2 x f16>,
		%arg6: vector<2 x f16>, %arg7: vector<2 x f16>,
		%arg8: vector<2 x f16>, %arg9: vector<2 x f16>,
		%arg10: vector<2 x f16>, %arg11: vector<2 x f16>,
		%arg12: vector<2 x f16>, %arg13: vector<2 x f16>,
		%arg14: vector<2 x f16>, %arg15: vector<2 x f16>,
		%arg16: vector<2 x f16>, %arg17: vector<2 x f16>,
		%arg18: vector<2 x f16>, %arg19: vector<2 x f16>) {
		// expected-error@+1 {{expected 20 <halfx2>s as operands}}
		%0 = nvvm.wmma.m16n16k16.mma.row.row.f16.f16 %arg0, %arg1, %arg2, %arg3, %arg4, %arg5, %arg6, %arg7, %arg8, %arg9, %arg10, %arg11, %arg12, %arg13, %arg14, %arg15, %arg16, %arg17, %arg18 : vector<2 x f16> -> !llvm.struct<(vector<2 x f16>, vector<2 x f16>, vector<2 x f16>, vector<2 x f16>)>
		llvm.return
		}

		// -----

		llvm.func @gpu_wmma_mma_op(%arg0: vector<2 x f16>, %arg1: vector<2 x f16>,
		%arg2: vector<2 x f16>, %arg3: vector<2 x f16>,
		%arg4: vector<2 x f16>, %arg5: vector<2 x f16>,
		%arg6: vector<2 x f16>, %arg7: vector<2 x f16>,
		%arg8: vector<2 x f16>, %arg9: vector<2 x f16>,
		%arg10: vector<2 x f16>, %arg11: vector<2 x f16>,
		%arg12: vector<2 x f16>, %arg13: vector<2 x f16>,
		%arg14: vector<2 x f16>, %arg15: vector<2 x f16>,
		%arg16: vector<2 x f16>, %arg17: vector<2 x f16>,
		%arg18: vector<2 x f16>, %arg19: vector<2 x f16>) {
		// expected-error@+1 {{expected result type to be a struct of 4 <halfx2>s}}
		%0 = nvvm.wmma.m16n16k16.mma.row.row.f16.f16 %arg0, %arg1, %arg2, %arg3, %arg4, %arg5, %arg6, %arg7, %arg8, %arg9, %arg10, %arg11, %arg12, %arg13, %arg14, %arg15, %arg16, %arg17, %arg18, %arg19 : vector<2 x f16> -> !llvm.struct<(vector<2 x f16>, vector<2 x f16>, vector<2 x f16>)>
		llvm.return
		}

mlir/test/Target/nvvmir.mlir

Show First 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	llvm.func @nvvm_mma(%a0 : vector<2xf16>, %a1 : vector<2xf16>,
%b0 : vector<2xf16>, %b1 : vector<2xf16>,		%b0 : vector<2xf16>, %b1 : vector<2xf16>,
%c0 : f32, %c1 : f32, %c2 : f32, %c3 : f32,		%c0 : f32, %c1 : f32, %c2 : f32, %c3 : f32,
%c4 : f32, %c5 : f32, %c6 : f32, %c7 : f32) {		%c4 : f32, %c5 : f32, %c6 : f32, %c7 : f32) {
// CHECK: call { float, float, float, float, float, float, float, float } @llvm.nvvm.mma.m8n8k4.row.col.f32.f32		// CHECK: call { float, float, float, float, float, float, float, float } @llvm.nvvm.mma.m8n8k4.row.col.f32.f32
%0 = nvvm.mma.sync %a0, %a1, %b0, %b1, %c0, %c1, %c2, %c3, %c4, %c5, %c6, %c7 {alayout="row", blayout="col"} : (vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, f32, f32, f32, f32, f32, f32, f32, f32) -> !llvm.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>		%0 = nvvm.mma.sync %a0, %a1, %b0, %b1, %c0, %c1, %c2, %c3, %c4, %c5, %c6, %c7 {alayout="row", blayout="col"} : (vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, f32, f32, f32, f32, f32, f32, f32, f32) -> !llvm.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
llvm.return %0 : !llvm.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>		llvm.return %0 : !llvm.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
}		}

		// The test below checks the correct mapping of the nvvm.wmma..load. op to the correct intrinsic
		// in the LLVM NVPTX backend.
		llvm.func @gpu_wmma_load_op(%arg0: !llvm.ptr<i32, 3>, %arg1: i32) {
		// CHECK: call { <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half> } @llvm.nvvm.wmma.m16n16k16.load.a.row.stride.f16.p3i32(i32 addrspace(3)* %{{.}}, i32 %{{.}})
		%0 = nvvm.wmma.m16n16k16.load.a.f16.row.stride %arg0, %arg1 : (!llvm.ptr<i32, 3>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>

		llvm.return
		}

		// The test below checks the correct mapping of the nvvm.wmma..store. op to the correct intrinsic
		// in the LLVM NVPTX backend.
		llvm.func @gpu_wmma_store_op(%arg0: !llvm.ptr<i32, 3>, %arg1: vector<2 x f16>,
		%arg2: vector<2 x f16>, %arg3: vector<2 x f16>,
		%arg4: vector<2 xf16>, %arg5: i32) {
		// CHECK: call void @llvm.nvvm.wmma.m16n16k16.store.d.row.stride.f16.p3i32(i32 addrspace(3)* %{{.}}, <2 x half> {{.}}, <2 x half> %{{.}}, <2 x half> %{{.}}, <2 x half> %{{.}}, i32 %{{.}})
		nvvm.wmma.m16n16k16.store.d.f16.row.stride %arg0, %arg1, %arg2, %arg3, %arg4, %arg5 : !llvm.ptr<i32, 3>, vector<2 x f16>, vector<2 x f16>, vector<2 x f16>, vector<2 x f16>, i32
		llvm.return
		}

		// The test below checks the correct mapping of the nvvm.wmma..mma. op to the correct intrinsic
		// in the LLVM NVPTX backend.
		llvm.func @gpu_wmma_mma_op(%arg0: vector<2 x f16>, %arg1: vector<2 x f16>,
		%arg2: vector<2 x f16>, %arg3: vector<2 x f16>,
		%arg4: vector<2 x f16>, %arg5: vector<2 x f16>,
		%arg6: vector<2 x f16>, %arg7: vector<2 x f16>,
		%arg8: vector<2 x f16>, %arg9: vector<2 x f16>,
		%arg10: vector<2 x f16>, %arg11: vector<2 x f16>,
		%arg12: vector<2 x f16>, %arg13: vector<2 x f16>,
		%arg14: vector<2 x f16>, %arg15: vector<2 x f16>,
		%arg16: vector<2 x f16>, %arg17: vector<2 x f16>,
		%arg18: vector<2 x f16>, %arg19: vector<2 x f16>) {
		// CHECK: call { <2 x half>, <2 x half>, <2 x half>, <2 x half> } @llvm.nvvm.wmma.m16n16k16.mma.row.row.f16.f16(<2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}})
		%0 = nvvm.wmma.m16n16k16.mma.row.row.f16.f16 %arg0, %arg1, %arg2, %arg3, %arg4, %arg5, %arg6, %arg7, %arg8, %arg9, %arg10, %arg11, %arg12, %arg13, %arg14, %arg15, %arg16, %arg17, %arg18, %arg19 : vector<2 x f16> -> !llvm.struct<(vector<2 x f16>, vector<2 x f16>, vector<2 x f16>, vector<2 x f16>)>

		llvm.return
		}

// This function has the "kernel" attribute attached and should appear in the		// This function has the "kernel" attribute attached and should appear in the
// NVVM annotations after conversion.		// NVVM annotations after conversion.
llvm.func @kernel_func() attributes {gpu.kernel} {		llvm.func @kernel_func() attributes {gpu.kernel} {
llvm.return		llvm.return
}		}

// CHECK: !nvvm.annotations =		// CHECK: !nvvm.annotations =
// CHECK-NOT: {i32 ()* @nvvm_special_regs, !"kernel", i32 1}		// CHECK-NOT: {i32 ()* @nvvm_special_regs, !"kernel", i32 1}
// CHECK: {void ()* @kernel_func, !"kernel", i32 1}		// CHECK: {void ()* @kernel_func, !"kernel", i32 1}

This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate opsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 324324

mlir/include/mlir/Dialect/GPU/GPUBase.td

mlir/include/mlir/Dialect/GPU/GPUDialect.h

mlir/include/mlir/Dialect/GPU/GPUOps.td

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp

mlir/lib/Target/LLVMIR/ConvertToNVVMIR.cpp

mlir/test/Dialect/GPU/invalid.mlir

mlir/test/Dialect/LLVMIR/invalid.mlir

mlir/test/Target/nvvmir.mlir

[MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops
ClosedPublic