This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/
-
mlir/
-
Dialect/
-
GPU/
2/2
GPUBase.td
15/15
GPUDialect.h
25/25
GPUOps.td
-
LLVMIR/
26/26
NVVMOps.td
-
Target/LLVMIR/
-
LLVMIR/
-
ModuleTranslation.h
-
lib/
-
Dialect/
-
GPU/IR/
-
IR/
25/25
GPUDialect.cpp
-
LLVMIR/IR/
-
IR/
5/5
NVVMDialect.cpp
-
Target/LLVMIR/
-
LLVMIR/
-
Dialect/NVVM/
-
NVVM/
-
NVVMToLLVMIRTranslation.cpp
-
ModuleTranslation.cpp
-
test/
-
Dialect/
-
GPU/
1/1
invalid.mlir
-
ops.mlir
-
LLVMIR/
-
invalid.mlir
-
Target/LLVMIR/
-
LLVMIR/
-
nvvmir.mlir

Differential D95330

[MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops
ClosedPublic

Authored by navdeepkk on Jan 24 2021, 11:10 PM.

Download Raw Diff

Details

Reviewers

bondhugula
ftynse
herhut
ThomasRaoux
csigg

Commits

rG875eb523c132: [MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops

Summary

[MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops
Add warp synchronous matrix-multiply accumulate ops in GPU and NVVM
dialect. Add following three ops to GPU dialect :-

1.) subgroup_mma_load_matrix
2.) subgroup_mma_store_matrix
3.) subgroup_mma_compute

Add following three ops to NVVM dialect :-

1.) wmma.m16n16k16.load.[a,b,c].[f16,f32].row.stride
2.) wmma.m16n16k16.store.d.[f16,f32].row.stride
3.) wmma.m16n16k16.mma.row.row.[f16,f32].[f16,f32]

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

navdeepkk created this revision.Jan 24 2021, 11:10 PM

Herald added a reviewer: ftynse. · View Herald TranscriptJan 24 2021, 11:10 PM

Herald added subscribers: teijeong, rdzhabarov, tatianashp and 16 others. · View Herald Transcript

navdeepkk requested review of this revision.Jan 24 2021, 11:10 PM

Herald added a reviewer: herhut. · View Herald TranscriptJan 24 2021, 11:10 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

navdeepkk added a child revision: D95331: [MLIR][GPU][NVVM] Add conversion of warp synchronous matrix-multiply accumulate GPU ops.Jan 24 2021, 11:18 PM

Harbormaster completed remote builds in B86506: Diff 318905.Jan 25 2021, 12:06 AM

Thanks! I have some comments.

mlir/include/mlir/Dialect/GPU/GPUOps.td
946–947	We usually use Index rather than I64 for indexing into memrefs. Is there a specific reason why I64 is needed here?
948	Could we use a longer name for this, e.g., leadDimension?
1009	Have you considered encoding this into a TypeConstraint in ODS instead of using AnyMemRef?
mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
19	I am not a fan of creating a dependency from the NVVM dialect on the GPU dialect only for the purpose of reusing an attribute. Have you considered having two versions of this attribute for different dialects, or putting it into some included file shared by both?
160	Please follow the code style inside inline code blocks. Here in particular, add spaces between "if" and "(".
161–170	We generally prefer to have one operation per intrinsic in this dialect. There is certainly value in having a operation where this duplication is abstracted away, but I suppose the one in the GPU dialect is just right for that. This dispatch should happen when converting from the `gpu.subgroup_mma_load` to `nvvm.wmma_*` rather than in translation to LLVM IR. This will also resolve the dialect layering problem I pointed at above.
196	function-type($args, $res) ?
264–269	This looks intimidatingly verbose. Are we ever expected to have different types for operands or can we just print one instead?
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1004	Nit: please don't use magic numbers. Rather define this as `GPUDialect::kThreadPrivateMemorySpace` and so on.
1008–1013	This can be defined in ODS, instead of using AnyMemRef.
1015	This can be put into ODS.
1028	Braces must be symmetric.
1103–1108	Nit: please use LogicalResult for verify* functions, even internal. Otherwise, use the name that makes it clear what are the intended semantics of the return value, e.g. `isValidType`.
mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp
135	Please fix linter findings.

This revision now requires changes to proceed.Jan 25 2021, 5:26 AM

Adding Thomas and Christian to see whether they have comments wrt. the GPU operations.

ThomasRaoux added inline comments.Jan 26 2021, 12:36 AM

mlir/include/mlir/Dialect/GPU/GPUOps.td
920–941	The part that I don't understand is if we expect the destination memref to have a defined layout or if it is opaque. As far as I can tell in both CUDA and Vulkan the layout of the mma matrix type is opaque in private memory. If the layout is opaque how can we perform elementwise operations on the matrix type without going back to global or shared memory? One of the common usage would be to fuse the MMA compute with elementwise operations and use directly the result of the MMA without going back to memory. Is is possible to represent such case with the current design?

bondhugula added inline comments.Jan 26 2021, 3:10 AM

mlir/include/mlir/Dialect/GPU/GPUOps.td
1019	Incorrect copy/paste example.

Issues in diff 318905 :-

1.) The design used memrefs to model mmafragments and were allocated in `.local`
  space in the PTX generated. This compeletely destroyed the purpose of wmmaOps(to
  use operands in `.reg` space.).

Changes in this diff :-

1.) Address comments on diff 318905.
2.) Introduce a new type !gpu.mmafragment to enable use of MMAOps operands in
registers.
3.) Modify all the introduced ops/test cases to use the !gpu.mmafragment type.
4.) Add the NVVM IR translation test previously in Revision D95333.

Harbormaster completed remote builds in B87840: Diff 321329.Feb 4 2021, 12:33 AM

Can you use mma_fragment instead of mmafragment for better readability?

bondhugula added inline comments.Feb 4 2021, 12:50 AM

mlir/include/mlir/Dialect/GPU/GPUOps.td
938	The design looks much more in line now with the `memref` operands being replaced with pure values and with a result value instead of an output memref.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
32	Nit: MMAFragmentType
83–90	Use an `enum`?

bondhugula added inline comments.Feb 4 2021, 12:52 AM

mlir/include/mlir/Dialect/GPU/GPUDialect.h
91	Matrix-multiply -> matrix-matrix multiply
99–116	All of these need doc comments.
mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
316–319	I think all of this code should be in a C++ file. Please see other things for reference or if there is a guideline. @ftynse?

Looks better with a dedicated type. I have some further concerns about memref layout and a bunch of small comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

mlir/include/mlir/Dialect/GPU/GPUDialect.h
47–56	This class looks unnecessary. It has only one derived class, that doesn't to actually use anything from the base class.
77–81	There's no need in getters if member fields are public.
86–87	This shadows the `elementType` from the base class. You'll actually have two `elementType` values this way...
mlir/include/mlir/Dialect/GPU/GPUOps.td
921	Does it make any assumptions about the data storage in the memref? It can have an arbitrary layout now, not only consecutive elements.
926–927	Have you considered always loading the entire matrix starting from indices 0x0 and asking the user to use `std.subview` to position the view in a way they want? There may be a reason to use indices here, but it is unclear to me. Otherwise, it feels like this op will be a subset of functionality that subview already provides.
928–929	Why is this attribute necessary? It looks redundant with the dimension of the memref available in its type. If it may differ from the memref size, that this op needs to clarify how it handles such non-contiguous cases.
937	Please document why it is important to specify which operand is being loaded (I suppose because of how the fragment is layed out).
938	Nit: this example breaks the verifier, which expects `leadDimension` to have `i32` type.
944	Why I32Attr and not IndexAttr?
961–979	Same remarks as for the "load" op.
1014–1015	These shapes don't add up for me.
1026	Nit: I would have just used `functional-type(operands, results)` here
mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
154	It looks like `operand` is never used
160	Nit: use `[{ }]` for multi-line text.
184	Tablegen doesn't have $-substitution, use need something like `!strconcat(` and a way of accessing the common string.
305–310	This is still super-verbose. Do we expect operands of different types?
315	Any reason why this cannot be implemented with declarative assembly format?
316	We shouldn't be needing global namespace qualification in this code. It goes into the body of a `OpTy::parse`, which itself lives in the MLIR namespace.
316–319	There are no guidelines AFAIK. I prefer and ask to put any non-trivial code with more than ~5 lines in a C++ file.
318	This is useless, it is necessary if the variable is only in assertions to avoid -Wunused-variable in release builds. Here, it is always used.
321	This is unnecessary, ArrayRef is implicitly constructible from a single element, just pass `resType` to `addTypes`.
324–334	Chain these with `\|\|` in a single `if` condition. This is the reason why `Parser` methods return `bool`.
337	I doubt "20" was chosen for any particular reason. SmallVector now supports automatic selection of the number of stack elements if there is no provided in the template, just use that if there is no specific reason to do otherwise.
347	Use `getOperationName()` instead
351–352	These can be combined in a single string...
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
51–56	It is better to inspect `elementType` rather than to construct a new one for comparison. Construction takes a lock in the context, inspection does not. Also, `if (condition) return false; return true` should be written as `return !condition`.
132–134	All user-visible error messages need a test.
144–149	`parseType` accepts references to a derived type and does this check for you. Just declare `elementType` as `VectorType`
152	This should be in the type verifier instead.
153	Not that the location is pointing to the token after the last consumed one, i.e. after `>`. Capture the location before parsing the size, or at least point to the first token in the type,
157–165	Don't duplicate the verifier, use `getChecked` instead.
1001	Replaces these magic numbers with named constants defined above
1023–1026	Don't duplicate the checks already enforced by type constraints in ODS. Also, if you had added tests for user-visible errors, you would have seen that this message is never produced because the error condition is caught by ODS-generated verifier parts with a different message.

ftynse requested changes to this revision.Feb 9 2021, 6:12 AM

This revision now requires changes to proceed.Feb 9 2021, 6:12 AM

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

mlir/include/mlir/Dialect/GPU/GPUOps.td
920–941	Hi, I quote from the PTX side of these LLVM intrinsics, The distribution of fragments loaded by the threads in a warp is unspecified and is target architecture dependent. The contents of a matrix fragment can be manipulated by reading and writing to individual registers in the fragment, provided the following conditions are satisfied: All matrix element in the fragment are operated on uniformly across threads, using the same parameters. The order of the matrix elements is not changed. So as long as the operation is something like a bias addition, which is uniform throughout the output matrix, I think it would be possible to realize the fusion using the present design. The way to go would be to introduce another op in GPU dialcet, Something like gpu.subgroup_mma_fuse_bias %0, %1 : memref<1x16<vectorxf16>>, f16 The argument memref will be the result of a `gpu.subgroup_mma_compute`. And the other argument will be the bias. With the appropriate LLVM lowerings this would reuse the results of `gpu.subgroup_mma_compute` in registers and hence prevent trip to global/shared memory. Let me know what you think, I can implement the above op and check, I think it should work.
920–941	Also, could you please tell if `fuse the MMA compute with elementwise operations` is the fusion of an elementwise operation with the register/warp level tile of the accumulator, which I assumed in my reply above. Is that correct?
946–947	Yes, When caluclating the starting of the load address, I emit LLVM ops of this sort %[[LDM:.]] = llvm.mlir.constant(32 : i64) : !llvm.i64 %[[ILDM:.]] = llvm.mul %[[LDM]], %[[OFF]] : !llvm.i64 %[[IJLDM:.]] = llvm.add %[[ILDM]], %[[OFF]] : !llvm.i64 %[[IJOLDM:.]] = llvm.add %[[IJLDM]], %[[OFFSETT]] : !llvm.i64 Now the leading dimesnion is of type i64 and to have same types in Add/Mul ops I have used I64 for indices to.
948	Yes, I'll change it.
mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
264–269	Since this op is mapped to a single intrinsic, The type of the operands is fixed. I modified the op to print only one type.

In D95330#2551547, @navdeepkk wrote:

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

mlir/include/mlir/Dialect/GPU/GPUOps.td
946–947	This is exactly why you should use `index` or the result of converting `index` to LLVM. There is no guarantee that it is `i64` so you should not be mixing `i64` constants with anything that indexes a memref, e.g. the offset in the descriptor. We actually have a flow that converts `index` to `i32` on GPUs as optimization, and this flow will likely get broken if you emit the code described above.

In D95330#2551743, @ftynse wrote:

In D95330#2551547, @navdeepkk wrote:

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

Okay, but we would still need to pack the operands in <2xhalf> from scalar values, at some point, to supply them to the intrinsic. Is there some op which would pack scalar values into a <2xhalf>?

Changes in this diff:-

1.) Address comments on diff 321329.
2.) Add tests for all user-visible error messages.

Harbormaster completed remote builds in B89557: Diff 324324.Feb 17 2021, 10:01 AM

In D95330#2551743, @ftynse wrote:

In D95330#2551547, @navdeepkk wrote:

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

Thanks for addressing the large number of comments. Some additional minor ones and one that was missed (or not pushed). This overall looks great to me!

mlir/include/mlir/Dialect/GPU/GPUBase.td
60	Comment here please.
mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
335	Looks like this comment is still unaddressed but marked "Done". Did you forget to push? @ftynse suggested doing: if (parser.parseType... \|\| parser.resolveOperands(...) ... return failure();
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1054–1056	Can drop trivial braces.

In D95330#2579401, @bondhugula wrote:

In D95330#2551743, @ftynse wrote:

In D95330#2551547, @navdeepkk wrote:

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

You are right that this op is a low-level abstraction, and that's why it doesn't feel like it really belongs to the GPU dialect. I understand there is a need for an op that is better integrated with the rest of MLIR, e.g., uses memref, and such an op wouldn't fit into the NVVM dialect either. So I wouldn't press my objection as long as another reviewer (@herhut, @csigg or @ThomasRaoux) agrees to have this in GPU dialect as is.

More generally, I think we will run into this problem again: we need some way of having MLIR-friendlier versions of LLVM IR intrinsics without having to duplicate abstractions. There are similar cases in "target vector" dialects like AVX512. Ideas welcome.

Herald added a subscriber: cota. · View Herald TranscriptMar 2 2021, 5:09 AM

mehdi_amini added inline comments.Mar 3 2021, 2:07 PM

mlir/include/mlir/Dialect/GPU/GPUDialect.h
79	elements
80	May be useful to document the syntax and provide examples.
107	Can this type be defined in ODS? https://mlir.llvm.org/docs/OpDefinitions/#type-definitions

In D95330#2597019, @ftynse wrote:

In D95330#2579401, @bondhugula wrote:

In D95330#2551743, @ftynse wrote:

In D95330#2551547, @navdeepkk wrote:

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

You are right that this op is a low-level abstraction, and that's why it doesn't feel like it really belongs to the GPU dialect. I understand there is a need for an op that is better integrated with the rest of MLIR, e.g., uses memref, and such an op wouldn't fit into the NVVM dialect either. So I wouldn't press my objection as long as another reviewer (@herhut, @csigg or @ThomasRaoux) agrees to have this in GPU dialect as is.

More generally, I think we will run into this problem again: we need some way of having MLIR-friendlier versions of LLVM IR intrinsics without having to duplicate abstractions. There are similar cases in "target vector" dialects like AVX512. Ideas welcome.

I think it would be good to have it in the GPU dialect to have a common layer for SPIR-V, NVVM and potentially ROCDL if it applies. The challenge is to pick a good abstraction for all of those. SPIR-V dialect already has CooperativeMatrix ops and types which are the exact equivalent to MMA. I think we should try to remove what is an overfit for NVVM like using vector type for mmafragment but otherwise this is the right direction in my opinion.

mlir/test/Dialect/GPU/invalid.mlir
464–470	I think we should also have a positive test in `mlir/test/Dialect/GPU/ops.mlir`

In D95330#2602206, @ThomasRaoux wrote:

In D95330#2597019, @ftynse wrote:

In D95330#2579401, @bondhugula wrote:

In D95330#2551743, @ftynse wrote:

In D95330#2551547, @navdeepkk wrote:

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

You are right that this op is a low-level abstraction, and that's why it doesn't feel like it really belongs to the GPU dialect. I understand there is a need for an op that is better integrated with the rest of MLIR, e.g., uses memref, and such an op wouldn't fit into the NVVM dialect either. So I wouldn't press my objection as long as another reviewer (@herhut, @csigg or @ThomasRaoux) agrees to have this in GPU dialect as is.

More generally, I think we will run into this problem again: we need some way of having MLIR-friendlier versions of LLVM IR intrinsics without having to duplicate abstractions. There are similar cases in "target vector" dialects like AVX512. Ideas welcome.

I think it would be good to have it in the GPU dialect to have a common layer for SPIR-V, NVVM and potentially ROCDL if it applies. The challenge is to pick a good abstraction for all of those. SPIR-V dialect already has CooperativeMatrix ops and types which are the exact equivalent to MMA. I think we should try to remove what is an overfit for NVVM like using vector type for mmafragment but otherwise this is the right direction in my opinion.

Hi all,
I just wanted to point out that the vector type here was used for only the F16 accumulate version of these intrinsics. For the C matrix, The type will change for the F32 accumulate version. So I wanted to make the gpu.mmafragment type flexible for that and perhaps just adjust the constraints as we go on to add more ops in the nvvm dialect.

I also just checked and saw that SPIR-V dialect does not model the fragments held by a particular workitem(as I do here) but models the whole cooperative matrix as a single type. I am not very sure how the SPIR-V ops are lowered to the target. Are they library calls? If yes then it is very different from the NVVM dialect which is actually a 1-to-1 mapping with the intrinsics. In that case, it would be more difficult to choose the right abstraction, for e.g., for F16 accumulate the library may just do all the packing unpacking stuff to get to the same machine instruction that we go to from NVVM->LLVMIR->PTX. Here we have to ensure that the input is in correct data format.

Please let me know what you think. Also can someone please point out the details on how SPIR-V is lowered?

mlir/include/mlir/Dialect/GPU/GPUDialect.h
47–56	Removed the class, Now directly inheriting from TypeStorage in MMAFragmentStorageType.
86–87	Removed the redundant base class as pointed out in a previous comment. Retained this member.
mlir/include/mlir/Dialect/GPU/GPUOps.td
921	Enforcing identity layout maps for the source memref right now. Can be extended for generic layouts in future commits.
926–927	Not really. Since the GPU dialect is meant to be closer to the target, I preferred to have it without the view abstraction and specify the start location's actual indices.
928–929	The op generator can specify the LeadingDimension, And in the lowering to NVVM dialect, the attribute can be directly used. I assume by non-contiguity in the leading dimension, In that case, the op would still work as long as the layout is identity and the leading dimension is correctly specified. I can enforce an identity layout, for now, add support for generic layouts in future commits.
1014–1015	!gpu.mmafragment<4xvector<3xf16>> -> !gpu.mmafragment<4xvector<2xf16>>. The shapes represent the part of the MMAFragment each thread holds. E.g., the result is of shape <4xvector<2xhalf>>, which says 8 fp16 elements per thread. So in all, 32 threads(a warp) have 256 elements, which is actually the shape of the output, 16x16.
1026	functional-type not parsing !gpu.mmafragment type. retaining current format.
mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
305–310	Sorry, I had already removed the printing but forgot to update the example.
315	I tried to use custom directives to print and parse a single type for all the operands. But, there is a restriction in custom directives, quoted as `The type directives must refer to a variable, but that variable need not also be a parameter to the custom directive`. This restricts us to parse a single type for all the operand types and parser.resolveOperands() would fail.
316–319	Please let me know if this is to be moved and to where.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
132–134	I have removed these messages as they seemed redundant. The parser already emits errors for the required thing. I am adding tests for other user-visible messages.
1023–1026	Removed redundant checks.

I think it would be good to have it in the GPU dialect to have a common layer for SPIR-V, NVVM and potentially ROCDL if it applies.

+1: even for a "low-level" operation, if the operation is available on multiple targets (SPIRV or other, we can even consider that proprietary target can benefit from this), then properly building the abstraction in the GPU dialect is really the intention here.

Thank you @ftynse and @bondhugula for your many comments on this patch and thanks @navdeepkk for addressing them!

IMHO, this operation fits perfectly into the GPU dialect, as it allows to target mma operations independently of the used GPU target. Regarding the type encoding: Ultimately, it only needs to be rich enough to allow conversion to the required LLVM types. We do not really have operations that extract the contained type, so it is truly opaque in that way. I am missing why we need the vector type. Why would mma.fragment<8x2xf16> not suffice for this? This can still be lowered as today to LLVM.

To apply element-wise operations to the fragment, we would need special operations anyway or a special operation that extracts the element to compute on from the fragment. The latter could produce a vector if that is desired.

An open question for me is how this lowers to SPIR-V. Would mma.fragment<8x2xf16> be rich enough for that case, as well? @ThomasRaoux

mlir/include/mlir/Dialect/GPU/GPUBase.td
122	nit: Is the "GPU Dialect" here intentional?
mlir/include/mlir/Dialect/GPU/GPUDialect.h
74	nit: These?

In D95330#2602259, @navdeepkk wrote:

In D95330#2602206, @ThomasRaoux wrote:

In D95330#2597019, @ftynse wrote:

In D95330#2579401, @bondhugula wrote:

In D95330#2551743, @ftynse wrote:

In D95330#2551547, @navdeepkk wrote:

Hi, Thanks for the comments.

A high-level design question: why does the element type of mmafragment have to be a vector type? I'd just use 2D indexing for the fragment, it's not like we are going to extract vectors from it.

I have tried to keep the types as close to what is expected by the corresponding LLVM intrinsics. As the mma.compute intrinsic expects operands in <2 x half> form and also returns things in a similar form, I have used the vector type.

This is an anti-argument for me, I see very little value in just lifting the low-level LLVM abstractions to higher levels. Hardcoding NVVM-specific modeling in the GPU dialect that is
supposed to abstract that away defies the purpose of the GPU dialect. It sounds like memfragment<AxBxf16> would make all of the code, except for a tiny part of the conversion, simpler.

But this op is a low-level abstraction. It may be in the GPU dialect but I don't see a value in raising the abstraction to only lower it again immediately with only one path out of it. Having some ops that are tied to specialized lower level instructions sounds like a reasonable tradeoff to me and by no means against the purpose of the GPU dialect. The abstraction could be raised if there are other lower level ops of that nature that this could be mapped to. As pointed out downthread, trying to put abstractions in here would mean that the operands would have to be packed into <2xhalf> from scalar values. This would just be additional deabstraction and lowering for yet no good reason: it can always be done if there are other backends served by it in the future but even that looks unlikely given how specialized this operation is.

You are right that this op is a low-level abstraction, and that's why it doesn't feel like it really belongs to the GPU dialect. I understand there is a need for an op that is better integrated with the rest of MLIR, e.g., uses memref, and such an op wouldn't fit into the NVVM dialect either. So I wouldn't press my objection as long as another reviewer (@herhut, @csigg or @ThomasRaoux) agrees to have this in GPU dialect as is.

More generally, I think we will run into this problem again: we need some way of having MLIR-friendlier versions of LLVM IR intrinsics without having to duplicate abstractions. There are similar cases in "target vector" dialects like AVX512. Ideas welcome.

I think it would be good to have it in the GPU dialect to have a common layer for SPIR-V, NVVM and potentially ROCDL if it applies. The challenge is to pick a good abstraction for all of those. SPIR-V dialect already has CooperativeMatrix ops and types which are the exact equivalent to MMA. I think we should try to remove what is an overfit for NVVM like using vector type for mmafragment but otherwise this is the right direction in my opinion.

Hi all,
I just wanted to point out that the vector type here was used for only the F16 accumulate version of these intrinsics. For the C matrix, The type will change for the F32 accumulate version. So I wanted to make the gpu.mmafragment type flexible for that and perhaps just adjust the constraints as we go on to add more ops in the nvvm dialect.

I also just checked and saw that SPIR-V dialect does not model the fragments held by a particular workitem(as I do here) but models the whole cooperative matrix as a single type. I am not very sure how the SPIR-V ops are lowered to the target. Are they library calls? If yes then it is very different from the NVVM dialect which is actually a 1-to-1 mapping with the intrinsics. In that case, it would be more difficult to choose the right abstraction, for e.g., for F16 accumulate the library may just do all the packing unpacking stuff to get to the same machine instruction that we go to from NVVM->LLVMIR->PTX. Here we have to ensure that the input is in correct data format.

Please let me know what you think. Also can someone please point out the details on how SPIR-V is lowered?

Hi Navdeep,

in SPIR-V the type CooperativeMatrixNVType is an opaque type indeed and I think we should keep the MMA type in the GPU dialect as opaque as well. SPIR-V cooperative matrix are just native SPIR-V ops and don't require any library call. Underneath they map to the exact same functionality than NVVM MMA intrinsics so it makes sense to try to find a common abstraction. If we make gpu.mmafragment an opaque type we should be able to have 1:1 mapping to both NVVM and SPIR-V. As pointed @herhut pointed out we cannot extra from this type so it makes sense to not make any assumption on the layout. In SPIR-V there are ways to extract elements but the layout can only be known at runtime so it is not practical to use in general. Ideally we should avoid that and use gpu.mmafragment until we store back to shared or global memory.

There are some lowering to SPIR-V cooperative matrix in IREE here: https://github.com/google/iree/blob/main/iree/compiler/Conversion/LinalgToSPIRV/ConvertToSPIRVPass.cpp#L330
In this case it goes directly from vector dialect to SPIR-V cooperative matrix. This jumps the level of abstraction you are adding which is a hack to make it work and test cooperative matrix end to end. This is the reason why this is not upstreamed in MLIR.

Hi all, Thanks for the valuable comments. @ThomasRaoux Thanks for clarifying things on the SPIR-V side.

I am missing why we need the vector type. Why would mma.fragment<8x2xf16> not suffice for this? This can still be lowered as today to LLVM.

@herhut I had the vector type to just extract elements from gpu::mmafragment type and just supply it to the nvvm intrinsic, as is. Now that I think, mma.fragment<8x2xf16> can also be used equivalently.

I will take some time and try to develop a more generic abstraction and look at removing the vector<> type. I'll post again for some more feedback before implementing the next iteration.

Thanks!

Herald added a subscriber: dcaballe. · View Herald TranscriptMar 7 2021, 9:22 AM

Hey @navdeepkk, I'm curious to know how this is progressing.
I'm working on enabling CUDA codegen support in IREE (https://github.com/google/iree/tree/main/iree/compiler/Conversion/LinalgToNVVM), I have some basic vectorization working and I'm probably 1 to 2 weeks away to a point where I would like to hook that up to MMA ops. My plan is to add some Vector to GPU transformation in MLIR to create the MMA ops from vector.contract/vector.tranfer kind of ops.
Let me know if I can do anything to help this moving forward. I can help address some point of the patch or do anything else that would help you to make progress.

I'm happy to chat more about my plans if you want and discuss how we can collaborate to on this. Let me know.

Thanks

In D95330#2669446, @ThomasRaoux wrote:

Hey @navdeepkk, I'm curious to know how this is progressing.
I'm working on enabling CUDA codegen support in IREE (https://github.com/google/iree/tree/main/iree/compiler/Conversion/LinalgToNVVM), I have some basic vectorization working and I'm probably 1 to 2 weeks away to a point where I would like to hook that up to MMA ops. My plan is to add some Vector to GPU transformation in MLIR to create the MMA ops from vector.contract/vector.tranfer kind of ops.
Let me know if I can do anything to help this moving forward. I can help address some point of the patch or do anything else that would help you to make progress.

I'm happy to chat more about my plans if you want and discuss how we can collaborate to on this. Let me know.

Thanks

Cool, please rope me into any offline discussions as I plan to start touching some of that in the short-term future too.

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

In D95330#2684814, @navdeepkk wrote:

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @navdeepkk, great to hear you were able to use those ops. Next week sounds good, if you think it you won't have bandwidth to progress in the next couple weeks please let me know and hopefully offload some of the work to me.
To give more details on what I'm doing on IREE side is to try to match what cutlass does using MLIR transformations. The flow will look like: linalg on tensor -> tiling along M,N,K and distribute to blocks -> promote operands to shared memory -> (double buffering and pipelining) -> Tiling to wrap -> tiling shared memory copy -> linalg vectorization -> vector.contract unrolling to mma size -> vector to GPU to create MMA GPU ops -> lowering to llvm/nvvm.
A lot of this infrastructure is already there in IREE.

If you are able to share, it would be great to hear about the flow you used those ops with.

In D95330#2687420, @ThomasRaoux wrote:

In D95330#2684814, @navdeepkk wrote:

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @navdeepkk, great to hear you were able to use those ops. Next week sounds good, if you think it you won't have bandwidth to progress in the next couple weeks please let me know and hopefully offload some of the work to me.
To give more details on what I'm doing on IREE side is to try to match what cutlass does using MLIR transformations. The flow will look like: linalg on tensor -> tiling along M,N,K and distribute to blocks -> promote operands to shared memory -> (double buffering and pipelining) -> Tiling to wrap -> tiling shared memory copy -> linalg vectorization -> vector.contract unrolling to mma size -> vector to GPU to create MMA GPU ops -> lowering to llvm/nvvm.
A lot of this infrastructure is already there in IREE.

If you are able to share, it would be great to hear about the flow you used those ops with.

@ThomasRaoux Just to clarify: @navdeepkk has all of the MLIR-based GPU code generation implemented on top of these ops (the ones in this revision) already working using the affine dialect infrastructure: we don't use IREE, linalg, or the vector dialects, but our pipeline effectively includes all of the transformations you list above and several more in between those lowerings.

Getting to this PR itself, are there specific changes to make before this can be landed? @ftynse and others (Stephan and Mehdi) appear to be fine with having these ops in the GPU dialect. Could you please list those changes out if any or LGTM this? This will allow further development and progress upstream. If the remaining changes are minor, these could be made quickly and landed.

In D95330#2687703, @bondhugula wrote:

In D95330#2687420, @ThomasRaoux wrote:

In D95330#2684814, @navdeepkk wrote:

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @navdeepkk, great to hear you were able to use those ops. Next week sounds good, if you think it you won't have bandwidth to progress in the next couple weeks please let me know and hopefully offload some of the work to me.
To give more details on what I'm doing on IREE side is to try to match what cutlass does using MLIR transformations. The flow will look like: linalg on tensor -> tiling along M,N,K and distribute to blocks -> promote operands to shared memory -> (double buffering and pipelining) -> Tiling to wrap -> tiling shared memory copy -> linalg vectorization -> vector.contract unrolling to mma size -> vector to GPU to create MMA GPU ops -> lowering to llvm/nvvm.
A lot of this infrastructure is already there in IREE.

If you are able to share, it would be great to hear about the flow you used those ops with.

@ThomasRaoux Just to clarify: @navdeepkk has all of the MLIR-based GPU code generation implemented on top of these ops (the ones in this revision) already working using the affine dialect infrastructure: we don't use IREE, linalg, or the vector dialects, but our pipeline effectively includes all of the transformations you list above and several more in between those lowerings.

Getting to this PR itself, are there specific changes to make before this can be landed? @ftynse and others (Stephan and Mehdi) appear to be fine with having these ops in the GPU dialect. Could you please list those changes out if any or LGTM this? This will allow further development and progress upstream. If the remaining changes are minor, these could be made quickly and landed.

Thanks for explaining, @bondhugula. It is great to know that you have all those transformations. Is it publicly available? (or will it be?).
I 100% support having these ops in the GPU dialect. The only concern I had with the PR was about the type including vector instead of being completely opaque because as I mentioned this will make it harder to share with Vulkan path. However I don't think it had to be a blocker and it is something that we can iterate on so I LGTM the patch.

In D95330#2687719, @ThomasRaoux wrote:

In D95330#2687703, @bondhugula wrote:

In D95330#2687420, @ThomasRaoux wrote:

In D95330#2684814, @navdeepkk wrote:

Hi @ThomasRaoux,
Sorry for the late reply. Great to hear that these ops can be reused in the IREE pipeline too. I was actually busy in some parallel work using these ops and getting it ready for an upcoming submission. The comments regarding the types are still to be addressed. I will surely be working on this, But I will get started on any major changes only by next week. As you mention, It would be great to know what your plans are and how you wish to proceed.

Thanks!

Hi @navdeepkk, great to hear you were able to use those ops. Next week sounds good, if you think it you won't have bandwidth to progress in the next couple weeks please let me know and hopefully offload some of the work to me.
To give more details on what I'm doing on IREE side is to try to match what cutlass does using MLIR transformations. The flow will look like: linalg on tensor -> tiling along M,N,K and distribute to blocks -> promote operands to shared memory -> (double buffering and pipelining) -> Tiling to wrap -> tiling shared memory copy -> linalg vectorization -> vector.contract unrolling to mma size -> vector to GPU to create MMA GPU ops -> lowering to llvm/nvvm.
A lot of this infrastructure is already there in IREE.

If you are able to share, it would be great to hear about the flow you used those ops with.

@ThomasRaoux Just to clarify: @navdeepkk has all of the MLIR-based GPU code generation implemented on top of these ops (the ones in this revision) already working using the affine dialect infrastructure: we don't use IREE, linalg, or the vector dialects, but our pipeline effectively includes all of the transformations you list above and several more in between those lowerings.

Getting to this PR itself, are there specific changes to make before this can be landed? @ftynse and others (Stephan and Mehdi) appear to be fine with having these ops in the GPU dialect. Could you please list those changes out if any or LGTM this? This will allow further development and progress upstream. If the remaining changes are minor, these could be made quickly and landed.

Thanks for explaining, @bondhugula. It is great to know that you have all those transformations. Is it publicly available? (or will it be?).
I 100% support having these ops in the GPU dialect. The only concern I had with the PR was about the type including vector instead of being completely opaque because as I mentioned this will make it harder to share with Vulkan path. However I don't think it had to be a blocker and it is something that we can iterate on so I LGTM the patch.

Thanks @ThomasRaoux. Yes, our goal is to upstream this infrastructure - most of the underlying affine dialect transformations and utilities needed have already been there in MLIR for a long time. We'd like to send out things for review starting with the lower level abstractions and support first - so this is really the first revision. (One op/abstraction that is publicly available but not in upstream that we used for our purpose is the memref_vector_cast op: https://reviews.llvm.org/D85885 - this is also available to work with recent upstream at https://github.com/polymage-labs/mlirx )

@navdeepkk, @bondhugula, any update on this?

In D95330#2721303, @ThomasRaoux wrote:

@navdeepkk, @bondhugula, any update on this?

Hi @ThomasRaoux, I'll get started on this now, and complete it on priority in the next 3-4 days or even sooner. Hope that works.

In D95330#2721740, @navdeepkk wrote:

In D95330#2721303, @ThomasRaoux wrote:

@navdeepkk, @bondhugula, any update on this?

Hi @ThomasRaoux, I'll get started on this now, and complete it on priority in the next 3-4 days or even sooner. Hope that works.

Thank you

Changes in this diff :-

1.) Rebase on upstream/main.
2.) Address comments on previous diff(324324).
3.) Remove gpu.mmafragment and introduce gpu.mma_matrix type.

Harbormaster completed remote builds in B102194: Diff 342265.May 2 2021, 1:57 PM

Looks great. Mostly minor comments on style and movement of some C++ code out of the td file.

mlir/include/mlir/Dialect/GPU/GPUDialect.h
88	It's not immediately clear what this StringRef is for.
144	`unsigned`?
152	Nit: Operand -> operand
152	An `operand` is expected to be an SSA `Value`. It's not clear what this returns.
mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td
365–385	We shouldn't have so much C++ code in the tablegen file. Please move this to a C++ file (NVVMOps.cpp) and call that from here.
389–397	Likewise.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
48	`unsigned` ?
68	`Operand` -> `operand`
164	Nit: "x" -> 'x'
166	Likewise.
1014	`Only` -> `only`
1065	`Operands` -> `operands`
1074	Likewise.
mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp
197–198	No need of the `else`.
215	Likewise.
237	Likewise. clang-tidy will also show a warning here.
274–275	Use `getOperands()` instead of doing getOpOperands() and then doing a `get()`?

ThomasRaoux accepted this revision.May 3 2021, 6:55 PM

ftynse accepted this revision.May 4 2021, 2:50 AM

This revision is now accepted and ready to land.May 4 2021, 2:50 AM

Changes in this diff :-

1.) Address comments on previous diff(342265).

navdeepkk edited the summary of this revision. (Show Details)May 5 2021, 9:53 PM

Harbormaster completed remote builds in B102908: Diff 343281.May 5 2021, 10:26 PM

bondhugula accepted this revision.May 5 2021, 11:05 PM

Changes in this diff :-

1.) Clang-format fix.
2.) Added TODO to generate MMAMatrix via ODS.

This revision was landed with ongoing or failed builds.May 5 2021, 11:37 PM

Closed by commit rG875eb523c132: [MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops (authored by navdeepkk, committed by bondhugula). · Explain Why

This revision was automatically updated to reflect the committed changes.

bondhugula added a commit: rG875eb523c132: [MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops.

Harbormaster completed remote builds in B102918: Diff 343291.May 6 2021, 12:10 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

GPUBase.td

25 lines

GPUDialect.h

116 lines

GPUOps.td

118 lines

LLVMIR/

NVVMOps.td

250 lines

Target/

LLVMIR/

ModuleTranslation.h

8 lines

lib/

Dialect/

GPU/

IR/

GPUDialect.cpp

193 lines

LLVMIR/

IR/

NVVMDialect.cpp

220 lines

Target/

LLVMIR/

Dialect/

NVVM/

NVVMToLLVMIRTranslation.cpp

1 line

ModuleTranslation.cpp

24 lines

test/

Dialect/

GPU/

invalid.mlir

113 lines

ops.mlir

11 lines

LLVMIR/

invalid.mlir

159 lines

Target/

LLVMIR/

nvvmir.mlir

37 lines

Diff 343297

mlir/include/mlir/Dialect/GPU/GPUBase.td

Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	let extraClassDeclaration = [{
static unsigned getPrivateAddressSpace() { return 5; }		static unsigned getPrivateAddressSpace() { return 5; }
}];		}];
}		}

def GPU_AsyncToken : DialectType<		def GPU_AsyncToken : DialectType<
GPU_Dialect, CPred<"$_self.isa<::mlir::gpu::AsyncTokenType>()">, "async token type">,		GPU_Dialect, CPred<"$_self.isa<::mlir::gpu::AsyncTokenType>()">, "async token type">,
BuildableType<"mlir::gpu::AsyncTokenType::get($_builder.getContext())">;		BuildableType<"mlir::gpu::AsyncTokenType::get($_builder.getContext())">;

		// Predicat to check if type is gpu::MMAMatrixType.
		bondhugulaUnsubmitted Done Reply Inline Actions Comment here please. bondhugula: Comment here please.
		def IsMMAMatrixTypePred : CPred<"$_self.isa<::mlir::gpu::MMAMatrixType>()">;

		def GPU_MMAMatrix : DialectType<
		GPU_Dialect, IsMMAMatrixTypePred, "MMAMatrix type">;

		class MMAMatrixOf<list<Type> allowedTypes> :
		ContainerType<AnyTypeOf<allowedTypes>, IsMMAMatrixTypePred,
		"$_self.cast<::mlir::gpu::MMAMatrixType>().getElementType()",
		"gpu.mma_matrix", "::mlir::gpu::MMAMatrixType">;

def GPU_AsyncOpInterface : OpInterface<"AsyncOpInterface"> {		def GPU_AsyncOpInterface : OpInterface<"AsyncOpInterface"> {
let description = [{		let description = [{
Interface for GPU operations that execute asynchronously on the device.		Interface for GPU operations that execute asynchronously on the device.

GPU operations implementing this interface take a list of dependencies		GPU operations implementing this interface take a list of dependencies
as `gpu.async.token` arguments and optionally return a `gpu.async.token`.		as `gpu.async.token` arguments and optionally return a `gpu.async.token`.

The op doesn't start executing until all depent ops producing the async		The op doesn't start executing until all depent ops producing the async
Show All 29 Lines	InterfaceMethod<[{
"OpResult", "getAsyncToken", (ins), [{}], [{		"OpResult", "getAsyncToken", (ins), [{}], [{
ConcreteOp op = cast<ConcreteOp>(this->getOperation());		ConcreteOp op = cast<ConcreteOp>(this->getOperation());
return op.asyncToken().template dyn_cast_or_null<OpResult>();		return op.asyncToken().template dyn_cast_or_null<OpResult>();
}]		}]
>		>
];		];
}		}

		// Cases of the String enum Attribute for SubgroupMmaOpLayout, representing
		// the layouts of the operands supported by the ops that use this attribute.
		def RowMajor: StrEnumAttrCase<"RowMajor", 0>;
		def ColMajor: StrEnumAttrCase<"ColMajor", 1>;

		// Specifies a String enum Attribute for Warp wide matrix operations,
		// representing the layout of respective operands. The layout later governs
		herhutUnsubmitted Done Reply Inline Actions nit: Is the "GPU Dialect" here intentional? herhut: nit: Is the "GPU Dialect" here intentional?
		// the lowerings to appropriate intrinsics.
		def SubgroupMmaOpLayout: StrEnumAttr<"Layout", "Specifies whether op is row/col major",
		[RowMajor, ColMajor]> {
		let stringToSymbolFnName = "LayoutStrToEnum";
		let symbolToStringFnName = "EnumToLayoutStr";
		}

#endif // GPU_BASE		#endif // GPU_BASE

mlir/include/mlir/Dialect/GPU/GPUDialect.h

	Show All 38 Lines

	class AsyncTokenType			class AsyncTokenType
	: public Type::TypeBase<AsyncTokenType, Type, TypeStorage> {			: public Type::TypeBase<AsyncTokenType, Type, TypeStorage> {
	public:			public:
	// Used for generic hooks in TypeBase.			// Used for generic hooks in TypeBase.
	using Base::Base;			using Base::Base;
	};			};

				/// MMAMatrixType storage and uniquing. Array is uniqued based on its shape
				/// and type.
				struct MMAMatrixStorageType : public TypeStorage {
				MMAMatrixStorageType(unsigned numDims, const int64_t *dimShapes,
				Type elementType, StringRef operand)
				: dimShapes(dimShapes), numDims(numDims), elementType(elementType),
				operand(operand) {}

				/// The hash key for uniquing.
				using KeyTy = std::tuple<ArrayRef<int64_t>, Type, StringRef>;
				ftynseUnsubmitted Done Reply Inline Actions This class looks unnecessary. It has only one derived class, that doesn't to actually use anything from the base class. ftynse: This class looks unnecessary. It has only one derived class, that doesn't to actually use…
				navdeepkkAuthorUnsubmitted Done Reply Inline Actions Removed the class, Now directly inheriting from TypeStorage in MMAFragmentStorageType. navdeepkk: Removed the class, Now directly inheriting from TypeStorage in MMAFragmentStorageType.
				bool operator==(const KeyTy &key) const {
				return key == KeyTy(getShape(), elementType, operand);
				}

				/// Construction.
				static MMAMatrixStorageType *construct(TypeStorageAllocator &allocator,
				const KeyTy &key) {
				ArrayRef<int64_t> shape = allocator.copyInto(std::get<0>(key));
				StringRef operand = allocator.copyInto(std::get<2>(key));

				return new (allocator.allocate<MMAMatrixStorageType>())
				MMAMatrixStorageType(shape.size(), shape.data(), std::get<1>(key),
				operand);
				}

				ArrayRef<int64_t> getShape() const {
				return ArrayRef<int64_t>(dimShapes, numDims);
				}
				herhutUnsubmitted Done Reply Inline Actions nit: These? herhut: nit: These?

				StringRef getOperand() const { return operand; }

				/// Reference to the shape of the MMA matrix.
				const int64_t *dimShapes;
				mehdi_aminiUnsubmitted Done Reply Inline Actions elements mehdi_amini: elements

				mehdi_aminiUnsubmitted Done Reply Inline Actions May be useful to document the syntax and provide examples. mehdi_amini: May be useful to document the syntax and provide examples.
				/// Number of dimensions in the MMA matrix.
				ftynseUnsubmitted Done Reply Inline Actions There's no need in getters if member fields are public. ftynse: There's no need in getters if member fields are public.
				unsigned numDims;

				/// Element type of elements held in the MMA matrix.
				Type elementType;

				/// MMA operand that this MMAMatrix holds. The general form of operation this
				ftynseUnsubmitted Done Reply Inline Actions This shadows the `elementType` from the base class. You'll actually have two `elementType` values this way... ftynse: This shadows the `elementType` from the base class. You'll actually have two `elementType`…
				navdeepkkAuthorUnsubmitted Done Reply Inline Actions Removed the redundant base class as pointed out in a previous comment. Retained this member. navdeepkk: Removed the redundant base class as pointed out in a previous comment. Retained this member.
				/// type supports is given by the equation D = (alpha(AB)) + (beta*C). This
				bondhugulaUnsubmitted Done Reply Inline Actions It's not immediately clear what this StringRef is for. bondhugula: It's not immediately clear what this StringRef is for.
				/// field specifies which operand in the given equation is held by this type.
				/// The valid values are "AOp", "BOp", "COp" and "DOp".
				StringRef operand;
				bondhugulaUnsubmitted Done Reply Inline Actions Matrix-multiply -> matrix-matrix multiply bondhugula: Matrix-multiply -> matrix-matrix multiply
				};

				/// MMAMatrix represents a matrix held by a subgroup for matrix-matrix multiply
				/// accumulate operations. MMAMatrices are taken as direct operands by these
				/// operations and are also produced as results. These matrices are meant to
				/// reside in the registers. A limited number of pointwise operations can be
				/// performed on these matrices, i.e., operations which operate uniformly on
				/// all the elements in the matrix and do not change the order of matrix
				/// elements. The above conditions exist because the layout of matrix elements
				/// inside the matrix is opaque i.e., the elements may be present in the
				/// matrix in any order. The general usage of this type is shown as follows:-
				///
				/// %0 = gpu.subgroup_mma_load_matrix %arg0[%c0, %c0] {leadDimension = 16 :
				/// index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
				///
				/// The MMAMatrixType describes the shape of the matrix being loaded and the
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions Can this type be defined in ODS? https://mlir.llvm.org/docs/OpDefinitions/#type-definitions mehdi_amini: Can this type be defined in ODS? https://mlir.llvm.org/docs/OpDefinitions/#type-definitions
				/// operand being loaded too. The operand needs to be specified to aid the
				/// lowering of this type to dialects such as NVVM where each workitem may
				/// hold different amount of elements depending on the elementType of the
				/// matrix. For e.g., Each workitem holds 4 vector<2xf16>s for f16 data type
				/// and 8 f32s for f32 data type of MMAMatrix. Some other instances of usage
				/// are:-
				///
				/// %3 = gpu.subgroup_mma_compute %0, %1, %2 : !gpu.mma_matrix<16x16xf16,
				/// "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">, !gpu.mma_matrix<16x16xf32,
				bondhugulaUnsubmitted Done Reply Inline Actions All of these need doc comments. bondhugula: All of these need doc comments.
				/// "COp"> -> !gpu.mma_matrix<16x16xf32, "DOp">
				///
				///
				/// gpu.subgroup_mma_store_matrix %3, %arg22[%c0, %c0] {leadDimension = 16
				/// : index}: !gpu.mma_matrix<16x16xf32, "DOp">, memref<16x16xf32>
				// TODO: consider moving this to ODS.
				class MMAMatrixType
				: public Type::TypeBase<MMAMatrixType, Type, MMAMatrixStorageType> {
				public:
				using Base::Base;

				/// Get MMAMatrixType and verify construction Invariants.
				static MMAMatrixType get(ArrayRef<int64_t> shape, Type elementType,
				StringRef operand);

				/// Get MMAMatrixType at a particular location and verify construction
				/// Invariants.
				static MMAMatrixType getChecked(function_ref<InFlightDiagnostic()> emitError,
				ArrayRef<int64_t> shape, Type elementType,
				StringRef operand);

				/// Check if a type is valid a MMAMatrixType elementType.
				static bool isValidElementType(Type elementType);

				/// Verify that shape and elementType are actually allowed for the
				/// MMAMatrixType.
				static LogicalResult verify(function_ref<InFlightDiagnostic()> emitError,
				ArrayRef<int64_t> shape, Type elementType,
				bondhugulaUnsubmitted Done Reply Inline Actions `unsigned`? bondhugula: `unsigned`?
				StringRef operand);

				/// Get number of dims.
				unsigned getNumDims() const;

				/// Get shape of the matrix.
				ArrayRef<int64_t> getShape() const;

				bondhugulaUnsubmitted Done Reply Inline Actions Nit: Operand -> operand bondhugula: Nit: Operand -> operand
				bondhugulaUnsubmitted Done Reply Inline Actions An `operand` is expected to be an SSA `Value`. It's not clear what this returns. bondhugula: An `operand` is expected to be an SSA `Value`. It's not clear what this returns.
				/// Get elementType of a single element.
				Type getElementType() const;

				/// The general form of operation this type supports is given by the equation
				/// D = (alpha(AB)) + (beta*C). This function returns which operand in the
				/// given equation is held by this type. String returned can be one of"AOp",
				/// "BOp", "COp" and "DOp".
				StringRef getOperand() const;
				};

	// Adds a `gpu.async.token` to the front of the argument list.			// Adds a `gpu.async.token` to the front of the argument list.
	void addAsyncDependency(Operation *op, Value token);			void addAsyncDependency(Operation *op, Value token);

	} // end namespace gpu			} // end namespace gpu
	} // end namespace mlir			} // end namespace mlir

	#include "mlir/Dialect/GPU/GPUOpsDialect.h.inc"			#include "mlir/Dialect/GPU/GPUOpsDialect.h.inc"

	#include "mlir/Dialect/GPU/GPUOpInterfaces.h.inc"			#include "mlir/Dialect/GPU/GPUOpInterfaces.h.inc"

	#define GET_OP_CLASSES			#define GET_OP_CLASSES
	#include "mlir/Dialect/GPU/GPUOps.h.inc"			#include "mlir/Dialect/GPU/GPUOps.h.inc"

	#endif // MLIR_DIALECT_GPU_GPUDIALECT_H			#endif // MLIR_DIALECT_GPU_GPUDIALECT_H

mlir/include/mlir/Dialect/GPU/GPUOps.td

Show First 20 Lines • Show All 906 Lines • ▼ Show 20 Lines	def GPU_MemcpyOp : GPU_Op<"memcpy", [GPU_AsyncOpInterface]> {

let assemblyFormat = [{		let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)
$dst`,` $src `:` type($dst)`,` type($src) attr-dict		$dst`,` $src `:` type($dst)`,` type($src) attr-dict
}];		}];
let verifier = [{ return ::verify(*this); }];		let verifier = [{ return ::verify(*this); }];
}		}

		def GPU_SubgroupMmaLoadMatrixOp : GPU_Op<"subgroup_mma_load_matrix",
		[MemoryEffects<[MemRead]>]>{

		let summary = "GPU warp synchronous matrix load";

		let description = [{
		The `gpu.subgroup_mma_load_matrix` operation loads a matrix collectively
		ftynseUnsubmitted Done Reply Inline Actions Does it make any assumptions about the data storage in the memref? It can have an arbitrary layout now, not only consecutive elements. ftynse: Does it make any assumptions about the data storage in the memref? It can have an arbitrary…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Enforcing identity layout maps for the source memref right now. Can be extended for generic layouts in future commits. navdeepkk: Enforcing identity layout maps for the source memref right now. Can be extended for generic…
		using all the threads in a subgroup.

		This operation takes a memref as argument. It is the source matrix from which
		data is to be loaded. The op returns a `!gpu.mma_matrix`. The source memref
		can be in the global or shared memory space. The starting of the load address
		is determined using indices provided. The matrix being loaded is specified in
		ftynseUnsubmitted Done Reply Inline Actions Have you considered always loading the entire matrix starting from indices 0x0 and asking the user to use `std.subview` to position the view in a way they want? There may be a reason to use indices here, but it is unclear to me. Otherwise, it feels like this op will be a subset of functionality that subview already provides. ftynse: Have you considered always loading the entire matrix starting from indices 0x0 and asking the…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Not really. Since the GPU dialect is meant to be closer to the target, I preferred to have it without the view abstraction and specify the start location's actual indices. navdeepkk: Not really. Since the GPU dialect is meant to be closer to the target, I preferred to have it…
		the result type. This attribute is necessary because there exists a different
		LLVM intrinsic for loading each operand, This is probably because all operands
		ftynseUnsubmitted Done Reply Inline Actions Why is this attribute necessary? It looks redundant with the dimension of the memref available in its type. If it may differ from the memref size, that this op needs to clarify how it handles such non-contiguous cases. ftynse: Why is this attribute necessary? It looks redundant with the dimension of the memref available…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions The op generator can specify the LeadingDimension, And in the lowering to NVVM dialect, the attribute can be directly used. I assume by non-contiguity in the leading dimension, In that case, the op would still work as long as the layout is identity and the leading dimension is correctly specified. I can enforce an identity layout, for now, add support for generic layouts in future commits. navdeepkk: The op generator can specify the LeadingDimension, And in the lowering to NVVM dialect, the…
		need to be laid out in a specific/different way for the operation in the registers.
		`leadDimension` attribute specifies the leading dimension of the source matrix.

		This op is meant to be used along with `gpu.subgroup_mma_store_matrix` and
		`gpu.subgroup_mma_compute`.

		Example:

		ftynseUnsubmitted Done Reply Inline Actions Please document why it is important to specify which operand is being loaded (I suppose because of how the fragment is layed out). ftynse: Please document why it is important to specify which operand is being loaded (I suppose because…
		```mlir
		bondhugulaUnsubmitted Done Reply Inline Actions The design looks much more in line now with the `memref` operands being replaced with pure values and with a result value instead of an output memref. bondhugula: The design looks much more in line now with the `memref` operands being replaced with pure…
		ftynseUnsubmitted Done Reply Inline Actions Nit: this example breaks the verifier, which expects `leadDimension` to have `i32` type. ftynse: Nit: this example breaks the verifier, which expects `leadDimension` to have `i32` type.
		%0 = gpu.subgroup_mma_load_matrix src[%i,%j] : {leadDimension = 32
		: i32} : memref<32x32xf16, 3>, !gpu.mma_matrix<16x16xf16, "AOp">
		```
		ThomasRaouxUnsubmitted Done Reply Inline Actions The part that I don't understand is if we expect the destination memref to have a defined layout or if it is opaque. As far as I can tell in both CUDA and Vulkan the layout of the mma matrix type is opaque in private memory. If the layout is opaque how can we perform elementwise operations on the matrix type without going back to global or shared memory? One of the common usage would be to fuse the MMA compute with elementwise operations and use directly the result of the MMA without going back to memory. Is is possible to represent such case with the current design? ThomasRaoux: The part that I don't understand is if we expect the destination memref to have a defined…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Hi, I quote from the PTX side of these LLVM intrinsics, The distribution of fragments loaded by the threads in a warp is unspecified and is target architecture dependent. The contents of a matrix fragment can be manipulated by reading and writing to individual registers in the fragment, provided the following conditions are satisfied: All matrix element in the fragment are operated on uniformly across threads, using the same parameters. The order of the matrix elements is not changed. So as long as the operation is something like a bias addition, which is uniform throughout the output matrix, I think it would be possible to realize the fusion using the present design. The way to go would be to introduce another op in GPU dialcet, Something like gpu.subgroup_mma_fuse_bias %0, %1 : memref<1x16<vectorxf16>>, f16 The argument memref will be the result of a `gpu.subgroup_mma_compute`. And the other argument will be the bias. With the appropriate LLVM lowerings this would reuse the results of `gpu.subgroup_mma_compute` in registers and hence prevent trip to global/shared memory. Let me know what you think, I can implement the above op and check, I think it should work. navdeepkk: Hi, I quote from the PTX side of these LLVM intrinsics, > The distribution of fragments…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Also, could you please tell if `fuse the MMA compute with elementwise operations` is the fusion of an elementwise operation with the register/warp level tile of the accumulator, which I assumed in my reply above. Is that correct? navdeepkk: Also, could you please tell if `fuse the MMA compute with elementwise operations` is the fusion…
		}];

		let arguments = (ins Arg<MemRefRankOf<[F16, F32], [2]>, "", [MemRead]>:$srcMemref,
		ftynseUnsubmitted Done Reply Inline Actions Why I32Attr and not IndexAttr? ftynse: Why I32Attr and not IndexAttr?
		Variadic<Index>:$indices,
		IndexAttr:$leadDimension);

		ftynseUnsubmitted Done Reply Inline Actions We usually use Index rather than I64 for indexing into memrefs. Is there a specific reason why I64 is needed here? ftynse: We usually use Index rather than I64 for indexing into memrefs. Is there a specific reason why…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Yes, When caluclating the starting of the load address, I emit LLVM ops of this sort %[[LDM:.]] = llvm.mlir.constant(32 : i64) : !llvm.i64 %[[ILDM:.]] = llvm.mul %[[LDM]], %[[OFF]] : !llvm.i64 %[[IJLDM:.]] = llvm.add %[[ILDM]], %[[OFF]] : !llvm.i64 %[[IJOLDM:.]] = llvm.add %[[IJLDM]], %[[OFFSETT]] : !llvm.i64 Now the leading dimesnion is of type i64 and to have same types in Add/Mul ops I have used I64 for indices to. navdeepkk: Yes, When caluclating the starting of the load address, I emit LLVM ops of this sort %[[LDM:.
		ftynseUnsubmitted Done Reply Inline Actions This is exactly why you should use `index` or the result of converting `index` to LLVM. There is no guarantee that it is `i64` so you should not be mixing `i64` constants with anything that indexes a memref, e.g. the offset in the descriptor. We actually have a flow that converts `index` to `i32` on GPUs as optimization, and this flow will likely get broken if you emit the code described above. ftynse: This is exactly why you should use `index` or the result of converting `index` to LLVM. There…
		let results = (outs GPU_MMAMatrix:$res);
		ftynseUnsubmitted Done Reply Inline Actions Could we use a longer name for this, e.g., leadDimension? ftynse: Could we use a longer name for this, e.g., leadDimension?
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Yes, I'll change it. navdeepkk: Yes, I'll change it.

		let assemblyFormat = [{
		$srcMemref`[`$indices`]` attr-dict `:` type($srcMemref) `->` type($res)
		}];

		let verifier = [{ return ::verify(*this); }];
		}

		def GPU_SubgroupMmaStoreMatrixOp : GPU_Op<"subgroup_mma_store_matrix",
		[MemoryEffects<[MemWrite]>]>{

		let summary = "GPU warp synchronous matrix store";

		let description = [{
		The `gpu.subgroup_mma_store_matrix` operation stores a matrix collectively
		using all the threads in a subgroup.

		This operation takes a `!gpu.mma_matrix` and a memref as arguments.
		`!gpu.mma_matrix` is the source which contains the data to be stored.
		The destination can be in the global or shared memory space. The starting
		of store address is determined using indices provided. The `leadDimension`
		attribute specifies the leading dimension of the destination matrix.

		This op is meant to be used along with `gpu.subgroup_mma_load_matrix` and
		`gpu.subgroup_mma_compute`.

		Example:

		```mlir
		gpu.subgroup_mma_store_matrix %D, %sg[%i,%j] : { leadDimension = 32 : i32} :
		!gpu.mma_matrix<16x16xf16, "DOp">, memref<32x32xf16, 3>
		ftynseUnsubmitted Done Reply Inline Actions Same remarks as for the "load" op. ftynse: Same remarks as for the "load" op.
		```
		}];

		let arguments = (ins Arg<MMAMatrixOf<[F16, F32]>>:$src,
		Arg<MemRefRankOf<[F16, F32], [2]>, "",[MemWrite]>:$dstMemref,
		Variadic<Index>:$indices,
		IndexAttr:$leadDimension);

		let assemblyFormat = [{
		$src`,` $dstMemref`[`$indices`]` attr-dict `:` type($src)`,` type($dstMemref)
		}];

		let verifier = [{ return ::verify(*this); }];
		}

		def GPU_SubgroupMmaComputeOp : GPU_Op<"subgroup_mma_compute", []>{

		let summary = "GPU warp synchronous matrix multiply accumulate";

		let description = [{
		The `gpu.subgroup_mma_compute` operation performs a matrix-multiply accumulate(mma)
		operation using all the threads in a subgroup.

		This operation takes three `!gpu.mma_matrix`s as arguments. All of them hold `A`,
		`B` and `C`operands for the mma operation. The operation performed is represented
		as `D = A * B + C`. The op returns a `!gpu.mma_matrix` which contains the result of
		the operation held by the current thread.

		This op is meant to be used along with `gpu.subgroup_mma_store_matrix` and
		`gpu.subgroup_mma_load_matrix`.
		ftynseUnsubmitted Done Reply Inline Actions Have you considered encoding this into a TypeConstraint in ODS instead of using AnyMemRef? ftynse: Have you considered encoding this into a TypeConstraint in ODS instead of using AnyMemRef?

		Example:

		```mlir
		%D = gpu.subgroup_mma_compute_matrix %A, %B, %C :
		!gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">,
		ftynseUnsubmitted Done Reply Inline Actions These shapes don't add up for me. ftynse: These shapes don't add up for me.
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions !gpu.mmafragment<4xvector<3xf16>> -> !gpu.mmafragment<4xvector<2xf16>>. The shapes represent the part of the MMAFragment each thread holds. E.g., the result is of shape <4xvector<2xhalf>>, which says 8 fp16 elements per thread. So in all, 32 threads(a warp) have 256 elements, which is actually the shape of the output, 16x16. navdeepkk: !gpu.mmafragment<4xvector<3xf16>> -> !gpu.mmafragment<4xvector<2xf16>>. The shapes represent…
		!gpu.mma_matrix<16x16xf16, "COp"> -> !gpu.mma_matrix<16x16xf16, "DOp">
		```
		}];

		bondhugulaUnsubmitted Done Reply Inline Actions Incorrect copy/paste example. bondhugula: Incorrect copy/paste example.
		let arguments = (ins Arg<MMAMatrixOf<[F16]>>:$opA,
		Arg<MMAMatrixOf<[F16]>>:$opB,
		Arg<MMAMatrixOf<[F16, F32]>>:$opC);

		let results = (outs GPU_MMAMatrix:$res);

		let assemblyFormat = [{
		ftynseUnsubmitted Done Reply Inline Actions Nit: I would have just used `functional-type(operands, results)` here ftynse: Nit: I would have just used `functional-type(operands, results)` here
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions functional-type not parsing !gpu.mmafragment type. retaining current format. navdeepkk: functional-type not parsing !gpu.mmafragment type. retaining current format.
		$opA`,` $opB`,` $opC attr-dict `:` type($opA)`,` type($opB)`,` type($opC) `->` type($res)
		}];

		let verifier = [{ return ::verify(*this); }];
		}

#endif // GPU_OPS		#endif // GPU_OPS

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td

Show All 10 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef NVVMIR_OPS		#ifndef NVVMIR_OPS
#define NVVMIR_OPS		#define NVVMIR_OPS

include "mlir/Dialect/LLVMIR/LLVMOpBase.td"		include "mlir/Dialect/LLVMIR/LLVMOpBase.td"
include "mlir/Interfaces/SideEffectInterfaces.td"		include "mlir/Interfaces/SideEffectInterfaces.td"

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
		ftynseUnsubmitted Done Reply Inline Actions I am not a fan of creating a dependency from the NVVM dialect on the GPU dialect only for the purpose of reusing an attribute. Have you considered having two versions of this attribute for different dialects, or putting it into some included file shared by both? ftynse: I am not a fan of creating a dependency from the NVVM dialect on the GPU dialect only for the…
// NVVM dialect definitions		// NVVM dialect definitions
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

def NVVM_Dialect : Dialect {		def NVVM_Dialect : Dialect {
let name = "nvvm";		let name = "nvvm";
let cppNamespace = "::mlir::NVVM";		let cppNamespace = "::mlir::NVVM";
let dependentDialects = ["LLVM::LLVMDialect"];		let dependentDialects = ["LLVM::LLVMDialect"];
let hasOperationAttrVerify = 1;		let hasOperationAttrVerify = 1;
▲ Show 20 Lines • Show All 118 Lines • ▼ Show 20 Lines	def NVVM_MmaOp :
string llvmBuilder = [{		string llvmBuilder = [{
$res = createIntrinsicCall(		$res = createIntrinsicCall(
builder, llvm::Intrinsic::nvvm_mma_m8n8k4_row_col_f32_f32, $args);		builder, llvm::Intrinsic::nvvm_mma_m8n8k4_row_col_f32_f32, $args);
}];		}];
let assemblyFormat = "$args attr-dict `:` functional-type($args, $res)";		let assemblyFormat = "$args attr-dict `:` functional-type($args, $res)";
let verifier = [{ return ::verify(*this); }];		let verifier = [{ return ::verify(*this); }];
}		}

		// Base class for all the variants of WMMA loadOps that may be defined.
		ftynseUnsubmitted Done Reply Inline Actions It looks like `operand` is never used ftynse: It looks like `operand` is never used
		class NVVM_WMMALoadOp<string mnemonic> : NVVM_Op<mnemonic>,
		Results<(outs LLVM_AnyStruct:$res)>,
		Arguments<(ins Variadic<LLVM_Type>:$args)> {

		let summary = "Warp synchronous matrix load";

		ftynseUnsubmitted Done Reply Inline Actions Please follow the code style inside inline code blocks. Here in particular, add spaces between "if" and "(". ftynse: Please follow the code style inside inline code blocks. Here in particular, add spaces between…
		ftynseUnsubmitted Done Reply Inline Actions Nit: use `[{ }]` for multi-line text. ftynse: Nit: use `[{ }]` for multi-line text.
		string baseDescription = [{"The `nvvm.wmma.mnk*.load.[a, b, c]` operation"
		"loads a matrix collectively using all the threads in a warp."

		"The operation takes two arguments, the address from where the matrix"
		"elements are to be loaded from and a stride. The stride argument"
		"represents the leading dimension of the source matrix. The address and"
		"the stride are required to be the same across all threads in the warp."
		"Each thread in a warp holds a certain number of elements. The Op returns"
		"a LLVMStruct which holds the elements of the matrix held by this thread."

		ftynseUnsubmitted Done Reply Inline Actions We generally prefer to have one operation per intrinsic in this dialect. There is certainly value in having a operation where this duplication is abstracted away, but I suppose the one in the GPU dialect is just right for that. This dispatch should happen when converting from the `gpu.subgroup_mma_load` to `nvvm.wmma_` rather than in translation to LLVM IR. This will also resolve the dialect layering problem I pointed at above. ftynse:* We generally prefer to have one operation per intrinsic in this dialect. There is certainly…
		"This op is meant to be used along with `nvvm.wmma.mnk*.store` and"
		"`nvvm.wmma.mnk*.mma`."}];

		let assemblyFormat = "$args attr-dict `:` functional-type($args, $res)";
		}

		def NVVM_WMMALoadAM16N16K16Op :
		NVVM_WMMALoadOp<"wmma.m16n16k16.load.a.f16.row.stride">{

		string llvmBuilder = [{
		$res = createNvvmIntrinsicCall(
		builder, llvm::Intrinsic::nvvm_wmma_m16n16k16_load_a_f16_row_stride, $args);
		}];

		ftynseUnsubmitted Done Reply Inline Actions Tablegen doesn't have $-substitution, use need something like `!strconcat(` and a way of accessing the common string. ftynse: Tablegen doesn't have $-substitution, use need something like `!strconcat(` and a way of…
		string opDescription = [{
		Example:

		```mlir
		%2 = nvvm.wmma.m16n16k16.load.a %0, %1 : !llvm.ptr<i32, 3>, !llvm.i32 ->
		!llvm.struct<(vec<2 x half>, vec<2 x half>, vec<2 x half>, vec<2 x half>,
		vec<2 x half>, vec<2 x half>, vec<2 x half>, vec<2 x half>)>
		```
		}];

		let description = !strconcat(baseDescription, opDescription);

		ftynseUnsubmitted Done Reply Inline Actions function-type($args, $res) ? ftynse: function-type($args, $res) ?
		let verifier = [{ return ::verify(*this); }];
		}

		def NVVM_WMMALoadBM16N16K16Op :
		NVVM_WMMALoadOp<"wmma.m16n16k16.load.b.f16.row.stride">{

		string llvmBuilder = [{
		$res = createNvvmIntrinsicCall(
		builder, llvm::Intrinsic::nvvm_wmma_m16n16k16_load_b_f16_row_stride, $args);
		}];

		string opDescription = [{
		Example:

		```mlir
		%2 = nvvm.wmma.m16n16k16.load.b %0, %1 : !llvm.ptr<i32, 3>, !llvm.i32 ->
		!llvm.struct<(vec<2 x half>, vec<2 x half>, vec<2 x half>, vec<2 x half>,
		vec<2 x half>, vec<2 x half>, vec<2 x half>, vec<2 x half>)>
		```
		}];

		let description = !strconcat(baseDescription, opDescription);

		let verifier = [{ return ::verify(*this); }];
		}

		def NVVM_WMMALoadCF16M16N16K16Op :
		NVVM_WMMALoadOp<"wmma.m16n16k16.load.c.f16.row.stride">{
		string llvmBuilder = [{
		$res = createNvvmIntrinsicCall(
		builder, llvm::Intrinsic::nvvm_wmma_m16n16k16_load_c_f16_row_stride, $args);
		}];

		string opDescription = [{
		Example:

		```mlir
		%2 = nvvm.wmma.m16n16k16.load.c.f16.row.stride %0, %1 : !llvm.ptr<i32, 3>, !llvm.i32 ->
		!llvm.struct<(vec<2 x half>, vec<2 x half>, vec<2 x half>, vec<2 x half>)>
		```
		}];

		let description = !strconcat(baseDescription, opDescription);

		let verifier = [{ return ::verify(*this); }];
		}

		def NVVM_WMMALoadCF32M16N16K16Op :
		NVVM_WMMALoadOp<"wmma.m16n16k16.load.c.f32.row.stride">{
		string llvmBuilder = [{
		$res = createNvvmIntrinsicCall(
		builder, llvm::Intrinsic::nvvm_wmma_m16n16k16_load_c_f32_row_stride, $args);
		}];

		string opDescription = [{
		Example:

		```mlir
		%2 = nvvm.wmma.m16n16k16.load.c.f32.row.stride %0, %1 : !llvm.ptr<i32, 3>, !llvm.i32 ->
		!llvm.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
		```
		}];

		let description = !strconcat(baseDescription, opDescription);

		let verifier = [{ return ::verify(*this); }];
		}

		// Base class for all the variants of WMMA storeOps that may be defined.
		class NVVM_WMMAStoreOp<string mnemonic> : NVVM_Op<mnemonic>,
		Arguments<(ins Variadic<LLVM_Type>:$args)>{
		let summary = "Warp synchronous matrix store";

		ftynseUnsubmitted Done Reply Inline Actions This looks intimidatingly verbose. Are we ever expected to have different types for operands or can we just print one instead? ftynse: This looks intimidatingly verbose. Are we ever expected to have different types for operands or…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Since this op is mapped to a single intrinsic, The type of the operands is fixed. I modified the op to print only one type. navdeepkk: Since this op is mapped to a single intrinsic, The type of the operands is fixed. I modified…
		string baseDescription = [{
		The `nvvm.wmma.mnk*.store` operation stores a matrix collectively using
		all the threads in a warp.

		The operation takes as arguments the address to where the matrix elements are
		to be stored, a stride and the elements to store, held by the current thread.
		The stride argument represents the leading dimension of the destination matrix.
		The address and the stride are required to be the same across all threads in the
		warp.

		This op is meant to be used along with `nvvm.wmma.m16n16k16.load` and
		`nvvm.wmma.m16n16k16.mma`.
		}];

		let assemblyFormat = "$args attr-dict `:` type($args)";
		}

		def NVVM_WMMAStoreF16M16N16K16Op : NVVM_WMMAStoreOp<"wmma.m16n16k16.store.d.f16.row.stride"> {
		string llvmBuilder = [{
		createNvvmIntrinsicCall(
		builder, llvm::Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_row_stride, $args);
		}];

		string opDescription = [{
		Example:

		```mlir
		nvvm.wmma.m16n16k16.stored.f16.row.stride %0, %1, %2, %3, %4, %5, %6 : !llvm.ptr<i32, 3>,
		!llvm.struct<(vec<2 x half>, vec<2 x half>, vec<2 x half>, vec<2 x half>)>, !llvm.i32
		```
		}];

		let description = !strconcat(baseDescription, opDescription);

		let verifier = [{ return ::verify(*this); }];
		}

		def NVVM_WMMAStoreF32M16N16K16Op : NVVM_WMMAStoreOp<"wmma.m16n16k16.store.d.f32.row.stride"> {
		string llvmBuilder = [{
		createNvvmIntrinsicCall(
		builder, llvm::Intrinsic::nvvm_wmma_m16n16k16_store_d_f32_row_stride, $args);
		ftynseUnsubmitted Done Reply Inline Actions This is still super-verbose. Do we expect operands of different types? ftynse: This is still super-verbose. Do we expect operands of different types?
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Sorry, I had already removed the printing but forgot to update the example. navdeepkk: Sorry, I had already removed the printing but forgot to update the example.
		}];

		string opDescription = [{
		Example:

		ftynseUnsubmitted Done Reply Inline Actions Any reason why this cannot be implemented with declarative assembly format? ftynse: Any reason why this cannot be implemented with declarative assembly format?
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions I tried to use custom directives to print and parse a single type for all the operands. But, there is a restriction in custom directives, quoted as `The type directives must refer to a variable, but that variable need not also be a parameter to the custom directive`. This restricts us to parse a single type for all the operand types and parser.resolveOperands() would fail. navdeepkk: I tried to use custom directives to print and parse a single type for all the operands. But…
		```mlir
		ftynseUnsubmitted Done Reply Inline Actions We shouldn't be needing global namespace qualification in this code. It goes into the body of a `OpTy::parse`, which itself lives in the MLIR namespace. ftynse: We shouldn't be needing global namespace qualification in this code. It goes into the body of a…
		nvvm.wmma.m16n16k16.store.d.f32.row.stride %0, %1, %2, %3, %4, %5, %6, %7, %8, %9,
		%10 : !llvm.ptr<i32, 3>, !llvm.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>,
		ftynseUnsubmitted Done Reply Inline Actions This is useless, it is necessary if the variable is only in assertions to avoid -Wunused-variable in release builds. Here, it is always used. ftynse: This is useless, it is necessary if the variable is only in assertions to avoid -Wunused…
		!llvm.i32
		bondhugulaUnsubmitted Done Reply Inline Actions I think all of this code should be in a C++ file. Please see other things for reference or if there is a guideline. @ftynse? bondhugula: I think all of this code should be in a C++ file. Please see other things for reference or if…
		ftynseUnsubmitted Done Reply Inline Actions There are no guidelines AFAIK. I prefer and ask to put any non-trivial code with more than ~5 lines in a C++ file. ftynse: There are no guidelines AFAIK. I prefer and ask to put any non-trivial code with more than ~5…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Please let me know if this is to be moved and to where. navdeepkk: Please let me know if this is to be moved and to where.
		```
		}];
		ftynseUnsubmitted Done Reply Inline Actions This is unnecessary, ArrayRef is implicitly constructible from a single element, just pass `resType` to `addTypes`. ftynse: This is unnecessary, ArrayRef is implicitly constructible from a single element, just pass…

		let description = !strconcat(baseDescription, opDescription);

		let verifier = [{ return ::verify(*this); }];
		}

		// Base class for all the variants of WMMA mmaOps that may be defined.
		class NVVM_WMMAMmaOp<string mnemonic> : NVVM_Op<mnemonic>,
		Results<(outs LLVM_AnyStruct:$res)>,
		Arguments<(ins Variadic<LLVM_Type>:$args)>{
		let summary = "Warp synchronous matrix-multiply accumulate using tensor cores.";

		string baseDescription = [{
		ftynseUnsubmitted Done Reply Inline Actions Chain these with `\|\|` in a single `if` condition. This is the reason why `Parser` methods return `bool`. ftynse: Chain these with `\|\|` in a single `if` condition. This is the reason why `Parser` methods…
		The `nvvm.wmma.mnk*.mma` operation performs a matrix-multiply accumulate
		bondhugulaUnsubmitted Done Reply Inline Actions Looks like this comment is still unaddressed but marked "Done". Did you forget to push? @ftynse suggested doing: if (parser.parseType... \|\| parser.resolveOperands(...) ... return failure(); bondhugula: Looks like this comment is still unaddressed but marked "Done". Did you forget to push? @ftynse…
		(mma) operation using all the threads in a warp.

		ftynseUnsubmitted Done Reply Inline Actions I doubt "20" was chosen for any particular reason. SmallVector now supports automatic selection of the number of stack elements if there is no provided in the template, just use that if there is no specific reason to do otherwise. ftynse: I doubt "20" was chosen for any particular reason. SmallVector now supports automatic selection…
		The operation performed is represented as `D = A * B + C`. The operation takes
		as arguments the elements of the matrices `A`, `B`, `C` and `D`, held by the
		current thread. The op returns a LLVM struct which holds a part of the result
		held by the current thread.

		This op is meant to be used along with `nvvm.wmma.m16n16k16.load` and `nvvm.wmma.
		m16n16k16.store`.
		}];
		}

		ftynseUnsubmitted Done Reply Inline Actions Use `getOperationName()` instead ftynse: Use `getOperationName()` instead
		def NVVM_WMMAMmaF16F16M16N16K16Op : NVVM_WMMAMmaOp<"wmma.m16n16k16.mma.row.row.f16.f16">{
		string llvmBuilder = [{
		$res = createNvvmIntrinsicCall(
		builder, llvm::Intrinsic::nvvm_wmma_m16n16k16_mma_row_row_f16_f16, $args);
		}];
		ftynseUnsubmitted Done Reply Inline Actions These can be combined in a single string... ftynse: These can be combined in a single string...

		string opDescription = [{
		Example:

		```mlir
		%20 = nvvm.wmma.m16n16k16.mma.row.row.f16.f16 %0, %1, %2, %3, %4, %5, %6, %7, %8,
		%9, %10, %11, %12, %13, %14, %15, %16, %17, %18, %19 : vector<2xf16> -> !llvm.struct
		<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
		```
		}];

		let parser = [{
		return parseWMMAMmaF16F16M16N16K16Op(parser, result);
		}];

		let printer = [{
		printWMMAMmaF16F16M16N16K16Op(p, *this);
		}];

		let description = !strconcat(baseDescription, opDescription);

		let verifier = [{ return ::verify(*this); }];
		}

		def NVVM_WMMAMmaF32F32M16N16K16Op : NVVM_WMMAMmaOp<"wmma.m16n16k16.mma.row.row.f32.f32">{
		string llvmBuilder = [{
		$res = createNvvmIntrinsicCall(
		builder, llvm::Intrinsic::nvvm_wmma_m16n16k16_mma_row_row_f32_f32, $args);
		}];

		string opDescription = [{
		Example:

		bondhugulaUnsubmitted Done Reply Inline Actions We shouldn't have so much C++ code in the tablegen file. Please move this to a C++ file (NVVMOps.cpp) and call that from here. bondhugula: We shouldn't have so much C++ code in the tablegen file. Please move this to a C++ file…
		```mlir
		%24 = nvvm.wmma.m16n16k16.mma.row.row.f32.f32 %0, %1, %2, %3, %4, %5, %6, %7, %8
		%9, %10, %11, %12, %13, %14, %15, %16, %17, %18, %19, %20, %21, %22, %23 :
		(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>,
		vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>,
		vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>,
		vector<2xf16>, f32, f32, f32, f32, f32, f32, f32, f32) -> !llvm.struct<(f32,
		f32, f32, f32, f32, f32, f32, f32)>
		```
		}];

		let assemblyFormat = "$args attr-dict `:` functional-type($args, $res)";
		bondhugulaUnsubmitted Done Reply Inline Actions Likewise. bondhugula: Likewise.

		let description = !strconcat(baseDescription, opDescription);

		let verifier = [{ return ::verify(*this); }];
		}

#endif // NVVMIR_OPS		#endif // NVVMIR_OPS

mlir/include/mlir/Target/LLVMIR/ModuleTranslation.h

Show First 20 Lines • Show All 251 Lines • ▼ Show 20 Lines	llvm::Constant getLLVMConstant(llvm::Type llvmType, Attribute attr,
Location loc,		Location loc,
const ModuleTranslation &moduleTranslation);		const ModuleTranslation &moduleTranslation);

/// Creates a call to an LLVM IR intrinsic function with the given arguments.		/// Creates a call to an LLVM IR intrinsic function with the given arguments.
llvm::Value *createIntrinsicCall(llvm::IRBuilderBase &builder,		llvm::Value *createIntrinsicCall(llvm::IRBuilderBase &builder,
llvm::Intrinsic::ID intrinsic,		llvm::Intrinsic::ID intrinsic,
ArrayRef<llvm::Value *> args = {},		ArrayRef<llvm::Value *> args = {},
ArrayRef<llvm::Type *> tys = {});		ArrayRef<llvm::Type *> tys = {});

		/// Creates a call to an LLVM IR intrinsic function with the given arguments
		/// for NVVM WMMA ops. Handles cases where the intrinsic name is overloaded
		/// using the types of arguments supplied. Selects the correct intrinsic
		/// by inspecting the argument types.
		llvm::Value *createNvvmIntrinsicCall(llvm::IRBuilderBase &builder,
		llvm::Intrinsic::ID intrinsic,
		ArrayRef<llvm::Value *> args = {});
} // namespace detail		} // namespace detail

} // namespace LLVM		} // namespace LLVM
} // namespace mlir		} // namespace mlir

#endif // MLIR_TARGET_LLVMIR_MODULETRANSLATION_H		#endif // MLIR_TARGET_LLVMIR_MODULETRANSLATION_H

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

Show All 23 Lines
#include "mlir/IR/PatternMatch.h"		#include "mlir/IR/PatternMatch.h"
#include "mlir/IR/TypeUtilities.h"		#include "mlir/IR/TypeUtilities.h"
#include "llvm/ADT/TypeSwitch.h"		#include "llvm/ADT/TypeSwitch.h"

using namespace mlir;		using namespace mlir;
using namespace mlir::gpu;		using namespace mlir::gpu;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
		// MMAMatrixType
		bondhugulaUnsubmitted Done Reply Inline Actions Nit: MMAFragmentType bondhugula: Nit: MMAFragmentType
		//===----------------------------------------------------------------------===//

		MMAMatrixType MMAMatrixType::get(ArrayRef<int64_t> shape, Type elementType,
		StringRef operand) {
		return Base::get(elementType.getContext(), shape, elementType, operand);
		}

		MMAMatrixType
		MMAMatrixType::getChecked(function_ref<InFlightDiagnostic()> emitError,
		ArrayRef<int64_t> shape, Type elementType,
		StringRef operand) {
		return Base::getChecked(emitError, elementType.getContext(), shape,
		elementType, operand);
		}

		unsigned MMAMatrixType::getNumDims() const { return getImpl()->numDims; }
		bondhugulaUnsubmitted Done Reply Inline Actions `unsigned` ? bondhugula: `unsigned` ?

		ArrayRef<int64_t> MMAMatrixType::getShape() const {
		return getImpl()->getShape();
		}

		Type MMAMatrixType::getElementType() const { return getImpl()->elementType; }

		StringRef MMAMatrixType::getOperand() const { return getImpl()->getOperand(); }
		ftynseUnsubmitted Done Reply Inline Actions It is better to inspect `elementType` rather than to construct a new one for comparison. Construction takes a lock in the context, inspection does not. Also, `if (condition) return false; return true` should be written as `return !condition`. ftynse: It is better to inspect `elementType` rather than to construct a new one for comparison.

		bool MMAMatrixType::isValidElementType(Type elementType) {
		return elementType.isF16() \|\| elementType.isF32();
		}

		LogicalResult
		MMAMatrixType::verify(function_ref<InFlightDiagnostic()> emitError,
		ArrayRef<int64_t> shape, Type elementType,
		StringRef operand) {
		if (!operand.equals("AOp") && !operand.equals("BOp") &&
		!operand.equals("COp") && !operand.equals("DOp"))
		return emitError() << "operand expected to be one of AOp, BOp, COp or DOp";
		bondhugulaUnsubmitted Done Reply Inline Actions `Operand` -> `operand` bondhugula: `Operand` -> `operand`

		if (shape.size() != 2)
		return emitError() << "MMAMatrixType must have exactly two dimensions";

		if (!MMAMatrixType::isValidElementType(elementType))
		return emitError() << "MMAMatrixType elements must be F16 or F32";

		return success();
		}

		//===----------------------------------------------------------------------===//
// GPUDialect		// GPUDialect
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		/// GPU memory space identifiers.
		enum GPUMemorySpace {
		/// Generic memory space identifier.
		kGenericMemorySpace = 0,

		/// Global memory space identifier.
		kGlobalMemorySpace = 1,

		bondhugulaUnsubmitted Done Reply Inline Actions Use an `enum`? bondhugula: Use an `enum`?
		/// Shared memory space identifier.
		kSharedMemorySpace = 3
		};

bool GPUDialect::isKernel(Operation *op) {		bool GPUDialect::isKernel(Operation *op) {
UnitAttr isKernelAttr = op->getAttrOfType<UnitAttr>(getKernelFuncAttrName());		UnitAttr isKernelAttr = op->getAttrOfType<UnitAttr>(getKernelFuncAttrName());
return static_cast<bool>(isKernelAttr);		return static_cast<bool>(isKernelAttr);
}		}

void GPUDialect::initialize() {		void GPUDialect::initialize() {
addTypes<AsyncTokenType>();		addTypes<AsyncTokenType>();
		addTypes<MMAMatrixType>();
addOperations<		addOperations<
#define GET_OP_LIST		#define GET_OP_LIST
#include "mlir/Dialect/GPU/GPUOps.cpp.inc"		#include "mlir/Dialect/GPU/GPUOps.cpp.inc"
>();		>();
}		}

Type GPUDialect::parseType(DialectAsmParser &parser) const {		Type GPUDialect::parseType(DialectAsmParser &parser) const {
// Parse the main keyword for the type.		// Parse the main keyword for the type.
StringRef keyword;		StringRef keyword;
if (parser.parseKeyword(&keyword))		if (parser.parseKeyword(&keyword))
return Type();		return Type();
MLIRContext *context = getContext();		MLIRContext *context = getContext();

// Handle 'async token' types.		// Handle 'async token' types.
if (keyword == "async.token")		if (keyword == "async.token")
return AsyncTokenType::get(context);		return AsyncTokenType::get(context);

		if (keyword == "mma_matrix") {
		llvm::SMLoc beginLoc = parser.getNameLoc();

		// Parse '<'.
		if (parser.parseLess())
		return nullptr;

		// Parse the size and elementType.
		SmallVector<int64_t> shape;
		Type elementType;
		if (parser.parseDimensionList(shape, /allowDynamic=/false) \|\|
		parser.parseType(elementType))
		return nullptr;

		// Parse ','
		ftynseUnsubmitted Done Reply Inline Actions All user-visible error messages need a test. ftynse: All user-visible error messages need a test.
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions I have removed these messages as they seemed redundant. The parser already emits errors for the required thing. I am adding tests for other user-visible messages. navdeepkk: I have removed these messages as they seemed redundant. The parser already emits errors for the…
		if (parser.parseComma())
		return nullptr;

		// Parse operand.
		StringRef operand;
		if (failed(parser.parseOptionalString(&operand)))
		return nullptr;

		// Parse '>'.
		if (parser.parseGreater())
		return nullptr;

		return MMAMatrixType::getChecked(mlir::detail::getDefaultDiagnosticEmitFn(
		parser.getEncodedSourceLoc(beginLoc)),
		shape, elementType, operand);
		ftynseUnsubmitted Done Reply Inline Actions `parseType` accepts references to a derived type and does this check for you. Just declare `elementType` as `VectorType` ftynse: `parseType` accepts references to a derived type and does this check for you. Just declare…
		}

parser.emitError(parser.getNameLoc(), "unknown gpu type: " + keyword);		parser.emitError(parser.getNameLoc(), "unknown gpu type: " + keyword);
		ftynseUnsubmitted Done Reply Inline Actions This should be in the type verifier instead. ftynse: This should be in the type verifier instead.
return Type();		return Type();
		ftynseUnsubmitted Done Reply Inline Actions Not that the location is pointing to the token after the last consumed one, i.e. after `>`. Capture the location before parsing the size, or at least point to the first token in the type, ftynse: Not that the location is pointing to the token after the last consumed one, i.e. after `>`.
}		}

void GPUDialect::printType(Type type, DialectAsmPrinter &os) const {		void GPUDialect::printType(Type type, DialectAsmPrinter &os) const {
TypeSwitch<Type>(type)		TypeSwitch<Type>(type)
.Case<AsyncTokenType>([&](Type) { os << "async.token"; })		.Case<AsyncTokenType>([&](Type) { os << "async.token"; })
		.Case<MMAMatrixType>([&](MMAMatrixType fragTy) {
		os << "mma_matrix<";
		auto shape = fragTy.getShape();
		for (auto dim = shape.begin(), e = shape.end() - 1; dim != e; ++dim)
		os << *dim << 'x';
		os << shape.back() << 'x' << fragTy.getElementType();
		bondhugulaUnsubmitted Done Reply Inline Actions Nit: "x" -> 'x' bondhugula: Nit: "x" -> 'x'
		os << ", \"" << fragTy.getOperand() << "\"" << '>';
		ftynseUnsubmitted Done Reply Inline Actions Don't duplicate the verifier, use `getChecked` instead. ftynse: Don't duplicate the verifier, use `getChecked` instead.
		})
		bondhugulaUnsubmitted Done Reply Inline Actions Likewise. bondhugula: Likewise.
.Default([](Type) { llvm_unreachable("unexpected 'gpu' type kind"); });		.Default([](Type) { llvm_unreachable("unexpected 'gpu' type kind"); });
}		}

LogicalResult GPUDialect::verifyOperationAttribute(Operation *op,		LogicalResult GPUDialect::verifyOperationAttribute(Operation *op,
NamedAttribute attr) {		NamedAttribute attr) {
if (!attr.second.isa<UnitAttr>() \|\|		if (!attr.second.isa<UnitAttr>() \|\|
attr.first != getContainerModuleAttrName())		attr.first != getContainerModuleAttrName())
return success();		return success();
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	auto walkResult = module.walk([&module](LaunchFuncOp launchOp) -> WalkResult {
}		}

return success();		return success();
});		});

return walkResult.wasInterrupted() ? failure() : success();		return walkResult.wasInterrupted() ? failure() : success();
}		}

template <typename T> static LogicalResult verifyIndexOp(T op) {		template <typename T>
		static LogicalResult verifyIndexOp(T op) {
auto dimension = op.dimension();		auto dimension = op.dimension();
if (dimension != "x" && dimension != "y" && dimension != "z")		if (dimension != "x" && dimension != "y" && dimension != "z")
return op.emitError("dimension \"") << dimension << "\" is invalid";		return op.emitError("dimension \"") << dimension << "\" is invalid";
return success();		return success();
}		}

static LogicalResult verifyAllReduce(gpu::AllReduceOp allReduce) {		static LogicalResult verifyAllReduce(gpu::AllReduceOp allReduce) {
if (allReduce.body().empty() != allReduce.op().hasValue())		if (allReduce.body().empty() != allReduce.op().hasValue())
▲ Show 20 Lines • Show All 730 Lines • ▼ Show 20 Lines	if (asyncTokenType)
printer << "async ";		printer << "async ";
if (asyncDependencies.empty())		if (asyncDependencies.empty())
return;		return;
printer << "[";		printer << "[";
llvm::interleaveComma(asyncDependencies, printer);		llvm::interleaveComma(asyncDependencies, printer);
printer << "]";		printer << "]";
}		}

		//===----------------------------------------------------------------------===//
		// GPU_SubgroupMmaLoadMatrixOp
		//===----------------------------------------------------------------------===//

		static LogicalResult verify(SubgroupMmaLoadMatrixOp op) {
		auto srcType = op.srcMemref().getType();
		auto resType = op.res().getType();
		auto resMatrixType = resType.cast<gpu::MMAMatrixType>();
		auto operand = resMatrixType.getOperand();
		auto srcMemrefType = srcType.cast<MemRefType>();
		auto srcMemSpace = srcMemrefType.getMemorySpaceAsInt();

		ftynseUnsubmitted Done Reply Inline Actions Replaces these magic numbers with named constants defined above ftynse: Replaces these magic numbers with named constants defined above
		if (!srcMemrefType.getAffineMaps().empty() &&
		!srcMemrefType.getAffineMaps().front().isIdentity())
		return op.emitError("expected identity layout map for source memref");
		ftynseUnsubmitted Done Reply Inline Actions Nit: please don't use magic numbers. Rather define this as `GPUDialect::kThreadPrivateMemorySpace` and so on. ftynse: Nit: please don't use magic numbers. Rather define this as `GPUDialect…

		if (srcMemSpace != kGenericMemorySpace && srcMemSpace != kSharedMemorySpace &&
		srcMemSpace != kGlobalMemorySpace)
		return op.emitError(
		"source memorySpace kGenericMemorySpace, kSharedMemorySpace or "
		"kGlobalMemorySpace only allowed");

		if (!operand.equals("AOp") && !operand.equals("BOp") &&
		!operand.equals("COp"))
		ftynseUnsubmitted Done Reply Inline Actions This can be defined in ODS, instead of using AnyMemRef. ftynse: This can be defined in ODS, instead of using AnyMemRef.
		return op.emitError("only AOp, BOp and COp can be loaded");
		bondhugulaUnsubmitted Done Reply Inline Actions `Only` -> `only` bondhugula: `Only` -> `only`

		ftynseUnsubmitted Done Reply Inline Actions This can be put into ODS. ftynse: This can be put into ODS.
		return success();
		}

		//===----------------------------------------------------------------------===//
		// GPU_SubgroupMmaStoreMatrixOp
		//===----------------------------------------------------------------------===//

		static LogicalResult verify(SubgroupMmaStoreMatrixOp op) {
		auto srcType = op.src().getType();
		auto dstType = op.dstMemref().getType();
		auto srcMatrixType = srcType.cast<gpu::MMAMatrixType>();
		ftynseUnsubmitted Done Reply Inline Actions Don't duplicate the checks already enforced by type constraints in ODS. Also, if you had added tests for user-visible errors, you would have seen that this message is never produced because the error condition is caught by ODS-generated verifier parts with a different message. ftynse: Don't duplicate the checks already enforced by type constraints in ODS. Also, if you had added…
		navdeepkkAuthorUnsubmitted Done Reply Inline Actions Removed redundant checks. navdeepkk: Removed redundant checks.
		auto dstMemrefType = dstType.cast<MemRefType>();
		auto dstMemSpace = dstMemrefType.getMemorySpaceAsInt();
		ftynseUnsubmitted Done Reply Inline Actions Braces must be symmetric. ftynse: Braces must be symmetric.

		if (!dstMemrefType.getAffineMaps().empty() &&
		!dstMemrefType.getAffineMaps().front().isIdentity())
		return op.emitError("expected identity layout map for destination memref");

		if (dstMemSpace != kGenericMemorySpace && dstMemSpace != kSharedMemorySpace &&
		dstMemSpace != kGlobalMemorySpace)
		return op.emitError(
		"destination memorySpace of kGenericMemorySpace, "
		"kGlobalMemorySpace or kSharedMemorySpace only allowed");

		if (!srcMatrixType.getOperand().equals("DOp"))
		return op.emitError(
		"expected the operand matrix being stored to have 'DOp' operand type");

		return success();
		}

		//===----------------------------------------------------------------------===//
		// GPU_SubgroupMmaComputeOp
		//===----------------------------------------------------------------------===//

		static LogicalResult verify(SubgroupMmaComputeOp op) {
		enum OperandMap { A, B, C };
		SmallVector<MMAMatrixType, 3> opTypes;

		auto populateOpInfo = [&opTypes, &op]() {
		opTypes.push_back(op.opA().getType().cast<MMAMatrixType>());
		bondhugulaUnsubmitted Done Reply Inline Actions Can drop trivial braces. bondhugula: Can drop trivial braces.
		opTypes.push_back(op.opB().getType().cast<MMAMatrixType>());
		opTypes.push_back(op.opC().getType().cast<MMAMatrixType>());
		};
		populateOpInfo();

		if (!opTypes[A].getOperand().equals("AOp") \|\|
		!opTypes[B].getOperand().equals("BOp") \|\|
		!opTypes[C].getOperand().equals("COp"))
		return op.emitError("operands must be in the order AOp, BOp, COp");
		bondhugulaUnsubmitted Done Reply Inline Actions `Operands` -> `operands` bondhugula: `Operands` -> `operands`

		ArrayRef<int64_t> aShape, bShape, cShape;
		aShape = opTypes[A].getShape();
		bShape = opTypes[B].getShape();
		cShape = opTypes[C].getShape();

		if (aShape[1] != bShape[0] \|\| aShape[0] != cShape[0] \|\|
		bShape[1] != cShape[1])
		return op.emitError("operand shapes do not satisfy matmul constraints");
		bondhugulaUnsubmitted Done Reply Inline Actions Likewise. bondhugula: Likewise.

		return success();
		}

#include "mlir/Dialect/GPU/GPUOpInterfaces.cpp.inc"		#include "mlir/Dialect/GPU/GPUOpInterfaces.cpp.inc"

#define GET_OP_CLASSES		#define GET_OP_CLASSES
#include "mlir/Dialect/GPU/GPUOps.cpp.inc"		#include "mlir/Dialect/GPU/GPUOps.cpp.inc"
		ftynseUnsubmitted Done Reply Inline Actions Nit: please use LogicalResult for verify* functions, even internal. Otherwise, use the name that makes it clear what are the intended semantics of the return value, e.g. `isValidType`. ftynse: Nit: please use LogicalResult for verify* functions, even internal. Otherwise, use the name…

mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp

Show First 20 Lines • Show All 88 Lines • ▼ Show 20 Lines	static LogicalResult verify(MmaOp op) {
auto f16Ty = Float16Type::get(context);		auto f16Ty = Float16Type::get(context);
auto f16x2Ty = LLVM::getFixedVectorType(f16Ty, 2);		auto f16x2Ty = LLVM::getFixedVectorType(f16Ty, 2);
auto f32Ty = Float32Type::get(context);		auto f32Ty = Float32Type::get(context);
auto f16x2x4StructTy = LLVM::LLVMStructType::getLiteral(		auto f16x2x4StructTy = LLVM::LLVMStructType::getLiteral(
context, {f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty});		context, {f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty});
auto f32x8StructTy = LLVM::LLVMStructType::getLiteral(		auto f32x8StructTy = LLVM::LLVMStructType::getLiteral(
context, {f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty});		context, {f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty});

SmallVector<Type, 12> operand_types(op.getOperandTypes().begin(),		SmallVector<Type, 12> operandTypes(op.getOperandTypes().begin(),
op.getOperandTypes().end());		op.getOperandTypes().end());
if (operand_types != SmallVector<Type, 8>(8, f16x2Ty) &&		if (operandTypes != SmallVector<Type, 8>(8, f16x2Ty) &&
operand_types != SmallVector<Type, 12>{f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty,		operandTypes != SmallVector<Type, 12>{f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty,
f32Ty, f32Ty, f32Ty, f32Ty, f32Ty,		f32Ty, f32Ty, f32Ty, f32Ty, f32Ty,
f32Ty, f32Ty, f32Ty}) {		f32Ty, f32Ty, f32Ty}) {
return op.emitOpError(		return op.emitOpError(
"expected operands to be 4 <halfx2>s followed by either "		"expected operands to be 4 <halfx2>s followed by either "
"4 <halfx2>s or 8 floats");		"4 <halfx2>s or 8 floats");
}		}
if (op.getType() != f32x8StructTy && op.getType() != f16x2x4StructTy) {		if (op.getType() != f32x8StructTy && op.getType() != f16x2x4StructTy) {
return op.emitOpError("expected result type to be a struct of either 4 "		return op.emitOpError("expected result type to be a struct of either 4 "
"<halfx2>s or 8 floats");		"<halfx2>s or 8 floats");
}		}

auto alayout = op->getAttrOfType<StringAttr>("alayout");		auto alayout = op->getAttrOfType<StringAttr>("alayout");
auto blayout = op->getAttrOfType<StringAttr>("blayout");		auto blayout = op->getAttrOfType<StringAttr>("blayout");

if (!(alayout && blayout) \|\|		if (!(alayout && blayout) \|\|
!(alayout.getValue() == "row" \|\| alayout.getValue() == "col") \|\|		!(alayout.getValue() == "row" \|\| alayout.getValue() == "col") \|\|
!(blayout.getValue() == "row" \|\| blayout.getValue() == "col")) {		!(blayout.getValue() == "row" \|\| blayout.getValue() == "col")) {
return op.emitOpError(		return op.emitOpError(
"alayout and blayout attributes must be set to either "		"alayout and blayout attributes must be set to either "
"\"row\" or \"col\"");		"\"row\" or \"col\"");
}		}

if (operand_types == SmallVector<Type, 12>{f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty,		if (operandTypes == SmallVector<Type, 12>{f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty,
f32Ty, f32Ty, f32Ty, f32Ty, f32Ty,		f32Ty, f32Ty, f32Ty, f32Ty, f32Ty,
f32Ty, f32Ty, f32Ty} &&		f32Ty, f32Ty, f32Ty} &&
op.getType() == f32x8StructTy && alayout.getValue() == "row" &&		op.getType() == f32x8StructTy && alayout.getValue() == "row" &&
blayout.getValue() == "col") {		blayout.getValue() == "col") {
return success();		return success();
}		}
return op.emitOpError("unimplemented mma.sync variant");		return op.emitOpError("unimplemented mma.sync variant");
}		}

		template <typename T>
		static LogicalResult verifyWMMALoadOp(T op, StringRef operand) {
		MLIRContext *context = op.getContext();
		ftynseUnsubmitted Done Reply Inline Actions Please fix linter findings. ftynse: Please fix linter findings.
		auto i32Ty = IntegerType::get(context, 32);
		auto i32Ptr1Ty = LLVM::LLVMPointerType::get(i32Ty, 1);
		auto i32Ptr3Ty = LLVM::LLVMPointerType::get(i32Ty, 3);
		auto i32Ptr0Ty = LLVM::LLVMPointerType::get(i32Ty, 0);
		auto f16Ty = FloatType::getF16(context);
		auto f32Ty = FloatType::getF32(context);
		auto f16x2Ty = VectorType::get(2, f16Ty);
		auto f16x2x4StructTy = LLVM::LLVMStructType::getLiteral(
		context, {f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty});
		auto f16x2x8StructTy = LLVM::LLVMStructType::getLiteral(
		context,
		{f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty});
		auto f32x8StructTy = LLVM::LLVMStructType::getLiteral(
		context, {f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty});

		SmallVector<Type, 2> operandTypes(op.getOperandTypes().begin(),
		op.getOperandTypes().end());
		if (operandTypes != SmallVector<Type, 2>{i32Ptr1Ty, i32Ty} &&
		operandTypes != SmallVector<Type, 2>{i32Ptr3Ty, i32Ty} &&
		operandTypes != SmallVector<Type, 2>{i32Ptr0Ty, i32Ty}) {
		return op.emitOpError("expected operands to be a source pointer in memory "
		"space 0, 1, 3 followed by ldm of the source");
		}

		if (operand.equals("AOp") \|\| operand.equals("BOp")) {
		if (op.getType() != f16x2x8StructTy) {
		return op.emitOpError("expected result type of loadAOp and loadBOp to be "
		"a struct of 8 <halfx2>s");
		}
		} else if (operand.equals("COp")) {
		if (op.getType() != f16x2x4StructTy && op.getType() != f32x8StructTy) {
		return op.emitOpError("expected result type of loadCOp to be a struct of "
		"4 <halfx2>s or 8 f32s");
		}
		}

		return success();
		}

		static LogicalResult verify(WMMALoadAM16N16K16Op op) {
		return verifyWMMALoadOp(op, "AOp");
		}

		static LogicalResult verify(WMMALoadBM16N16K16Op op) {
		return verifyWMMALoadOp(op, "BOp");
		}

		static LogicalResult verify(WMMALoadCF16M16N16K16Op op) {
		return verifyWMMALoadOp(op, "COp");
		}

		static LogicalResult verify(WMMALoadCF32M16N16K16Op op) {
		return verifyWMMALoadOp(op, "COp");
		}

		template <typename T>
		static bool verifyWMMAStoreOp(T op, SmallVector<Type> &containedElems) {
		SmallVector<Type> operandTypes(op.getOperandTypes().begin(),
		op.getOperandTypes().end());
		if (operandTypes == containedElems)
		return true;

		return false;
		bondhugulaUnsubmitted Done Reply Inline Actions No need of the `else`. bondhugula: No need of the `else`.
		}

		static LogicalResult verify(WMMAStoreF16M16N16K16Op op) {
		MLIRContext *context = op.getContext();
		auto i32Ty = IntegerType::get(context, 32);
		auto i32Ptr1Ty = LLVM::LLVMPointerType::get(i32Ty, 1);
		auto i32Ptr3Ty = LLVM::LLVMPointerType::get(i32Ty, 3);
		auto i32Ptr0Ty = LLVM::LLVMPointerType::get(i32Ty, 0);
		auto f16Ty = FloatType::getF16(context);
		auto f16x2Ty = VectorType::get(2, f16Ty);
		SmallVector<Type> type1{i32Ptr1Ty, f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty, i32Ty};
		SmallVector<Type> type0{i32Ptr0Ty, f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty, i32Ty};
		SmallVector<Type> type3{i32Ptr3Ty, f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty, i32Ty};
		if (verifyWMMAStoreOp(op, type1) \|\| verifyWMMAStoreOp(op, type0) \|\|
		verifyWMMAStoreOp(op, type3))
		return success();

		bondhugulaUnsubmitted Done Reply Inline Actions Likewise. bondhugula: Likewise.
		return op.emitOpError("expected operands to be a source pointer in memory"
		"space 0, 1, 3 followed by ldm of the source");
		}

		static LogicalResult verify(WMMAStoreF32M16N16K16Op op) {
		MLIRContext *context = op.getContext();
		auto i32Ty = IntegerType::get(context, 32);
		auto i32Ptr1Ty = LLVM::LLVMPointerType::get(i32Ty, 1);
		auto i32Ptr3Ty = LLVM::LLVMPointerType::get(i32Ty, 3);
		auto i32Ptr0Ty = LLVM::LLVMPointerType::get(i32Ty, 0);
		auto f32Ty = FloatType::getF32(context);

		SmallVector<Type> type1{i32Ptr1Ty, f32Ty, f32Ty, f32Ty, f32Ty,
		f32Ty, f32Ty, f32Ty, f32Ty, i32Ty};
		SmallVector<Type> type0{i32Ptr0Ty, f32Ty, f32Ty, f32Ty, f32Ty,
		f32Ty, f32Ty, f32Ty, f32Ty, i32Ty};
		SmallVector<Type> type3{i32Ptr3Ty, f32Ty, f32Ty, f32Ty, f32Ty,
		f32Ty, f32Ty, f32Ty, f32Ty, i32Ty};
		if (verifyWMMAStoreOp(op, type0) \|\| verifyWMMAStoreOp(op, type1) \|\|
		verifyWMMAStoreOp(op, type3))
		return success();

		bondhugulaUnsubmitted Done Reply Inline Actions Likewise. clang-tidy will also show a warning here. bondhugula: Likewise. clang-tidy will also show a warning here.
		return op.emitOpError("expected operands to be a source pointer in memory"
		"space 0, 1, 3 followed by ldm of the source");
		}

		static LogicalResult verify(WMMAMmaF16F16M16N16K16Op op) {
		MLIRContext *context = op.getContext();
		auto f16Ty = FloatType::getF16(context);
		auto f16x2Ty = VectorType::get(2, f16Ty);
		auto f16x2x4StructTy = LLVM::LLVMStructType::getLiteral(
		context, {f16x2Ty, f16x2Ty, f16x2Ty, f16x2Ty});

		SmallVector<Type, 2> operandTypes(op.getOperandTypes().begin(),
		op.getOperandTypes().end());
		if (operandTypes != SmallVector<Type, 20>(20, f16x2Ty))
		return op.emitOpError("expected 20 <halfx2>s as operands");

		if (op.getResult().getType() != f16x2x4StructTy)
		return op.emitOpError("expected result type to be a struct of 4 <halfx2>s");

		return success();
		}

		static LogicalResult parseWMMAMmaF16F16M16N16K16Op(OpAsmParser &parser,
		OperationState &result) {
		SmallVector<OpAsmParser::OperandType, 4> operands;
		::llvm::SMLoc operandsLoc;
		Type operandType;
		Type resType;

		operandsLoc = parser.getCurrentLocation();
		if (parser.parseOperandList(operands) \|\|
		parser.parseOptionalAttrDict(result.attributes) \|\| parser.parseColon() \|\|
		parser.parseType(operandType) \|\| parser.parseArrow())
		return failure();

		unsigned numOperands = operands.size();
		SmallVector<Type> operandTypes(numOperands, operandType);
		if (parser.parseType(resType))
		bondhugulaUnsubmitted Done Reply Inline Actions Use `getOperands()` instead of doing getOpOperands() and then doing a `get()`? bondhugula: Use `getOperands()` instead of doing getOpOperands() and then doing a `get()`?
		return failure();
		result.addTypes(resType);
		if (parser.resolveOperands(operands, operandTypes, operandsLoc,
		result.operands))
		return failure();
		return success();
		}

		static void printWMMAMmaF16F16M16N16K16Op(OpAsmPrinter &p,
		WMMAMmaF16F16M16N16K16Op &op) {
		p << op.getOperationName();
		p << ' ';
		p << op.args();
		p.printOptionalAttrDict(op->getAttrs(), {});
		p << " : ";
		p << op->getOperand(0).getType();
		p << ' ' << "->";
		p << ' ';
		p << ::llvm::ArrayRef<::mlir::Type>(op.res().getType());
		}

		static LogicalResult verify(WMMAMmaF32F32M16N16K16Op op) {
		unsigned numABOperands = 16;
		unsigned numCOperands = 8;
		MLIRContext *context = op.getContext();
		auto f16Ty = FloatType::getF16(context);
		auto f32Ty = FloatType::getF32(context);
		auto f16x2Ty = VectorType::get(2, f16Ty);
		auto f32x8StructTy = LLVM::LLVMStructType::getLiteral(
		context, {f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty, f32Ty});

		SmallVector<Type> abOpTypes;
		SmallVector<Type> bOpTypes;
		SmallVector<Type> cOpTypes;

		for (auto operand : op->getOperands().take_front(numABOperands)) {
		abOpTypes.push_back(operand.getType());
		}

		for (auto operand :
		op->getOperands().drop_front(numABOperands).take_front(numCOperands)) {
		cOpTypes.push_back(operand.getType());
		}

		if (abOpTypes != SmallVector<Type>(16, f16x2Ty))
		return op.emitOpError("expected 16 <halfx2>s for `a` and `b` operand");

		if (cOpTypes != SmallVector<Type>(8, f32Ty))
		return op.emitOpError("expected 8 f32s for `c` operand");

		if (op.getResult().getType() != f32x8StructTy)
		return op.emitOpError("expected result type to be a struct of 8 f32s");

		return success();
		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// NVVMDialect initialization, type parsing, and registration.		// NVVMDialect initialization, type parsing, and registration.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

// TODO: This should be the llvm.nvvm dialect once this is supported.		// TODO: This should be the llvm.nvvm dialect once this is supported.
void NVVMDialect::initialize() {		void NVVMDialect::initialize() {
addOperations<		addOperations<
#define GET_OP_LIST		#define GET_OP_LIST
#include "mlir/Dialect/LLVMIR/NVVMOps.cpp.inc"		#include "mlir/Dialect/LLVMIR/NVVMOps.cpp.inc"
>();		>();

// Support unknown operations because not all NVVM operations are registered.		// Support unknown operations because not all NVVM operations are
		// registered.
allowUnknownOperations();		allowUnknownOperations();
}		}

LogicalResult NVVMDialect::verifyOperationAttribute(Operation *op,		LogicalResult NVVMDialect::verifyOperationAttribute(Operation *op,
NamedAttribute attr) {		NamedAttribute attr) {
// Kernel function attribute should be attached to functions.		// Kernel function attribute should be attached to functions.
if (attr.first == NVVMDialect::getKernelFuncAttrName()) {		if (attr.first == NVVMDialect::getKernelFuncAttrName()) {
if (!isa<LLVM::LLVMFuncOp>(op)) {		if (!isa<LLVM::LLVMFuncOp>(op)) {
Show All 9 Lines

mlir/lib/Target/LLVMIR/Dialect/NVVM/NVVMToLLVMIRTranslation.cpp

	Show All 16 Lines
	#include "mlir/Target/LLVMIR/ModuleTranslation.h"			#include "mlir/Target/LLVMIR/ModuleTranslation.h"

	#include "llvm/IR/IRBuilder.h"			#include "llvm/IR/IRBuilder.h"
	#include "llvm/IR/IntrinsicsNVPTX.h"			#include "llvm/IR/IntrinsicsNVPTX.h"

	using namespace mlir;			using namespace mlir;
	using namespace mlir::LLVM;			using namespace mlir::LLVM;
	using mlir::LLVM::detail::createIntrinsicCall;			using mlir::LLVM::detail::createIntrinsicCall;
				using mlir::LLVM::detail::createNvvmIntrinsicCall;

	static llvm::Intrinsic::ID getShflBflyIntrinsicId(llvm::Type *resultType,			static llvm::Intrinsic::ID getShflBflyIntrinsicId(llvm::Type *resultType,
	bool withPredicate) {			bool withPredicate) {
	if (withPredicate) {			if (withPredicate) {
	resultType = cast<llvm::StructType>(resultType)->getElementType(0);			resultType = cast<llvm::StructType>(resultType)->getElementType(0);
	return resultType->isFloatTy() ? llvm::Intrinsic::nvvm_shfl_sync_bfly_f32p			return resultType->isFloatTy() ? llvm::Intrinsic::nvvm_shfl_sync_bfly_f32p
	: llvm::Intrinsic::nvvm_shfl_sync_bfly_i32p;			: llvm::Intrinsic::nvvm_shfl_sync_bfly_i32p;
	}			}
	▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

mlir/lib/Target/LLVMIR/ModuleTranslation.cpp

	Show All 29 Lines
	#include "llvm/ADT/SetVector.h"			#include "llvm/ADT/SetVector.h"
	#include "llvm/Frontend/OpenMP/OMPIRBuilder.h"			#include "llvm/Frontend/OpenMP/OMPIRBuilder.h"
	#include "llvm/IR/BasicBlock.h"			#include "llvm/IR/BasicBlock.h"
	#include "llvm/IR/CFG.h"			#include "llvm/IR/CFG.h"
	#include "llvm/IR/Constants.h"			#include "llvm/IR/Constants.h"
	#include "llvm/IR/DerivedTypes.h"			#include "llvm/IR/DerivedTypes.h"
	#include "llvm/IR/IRBuilder.h"			#include "llvm/IR/IRBuilder.h"
	#include "llvm/IR/InlineAsm.h"			#include "llvm/IR/InlineAsm.h"
				#include "llvm/IR/IntrinsicsNVPTX.h"
	#include "llvm/IR/LLVMContext.h"			#include "llvm/IR/LLVMContext.h"
	#include "llvm/IR/MDBuilder.h"			#include "llvm/IR/MDBuilder.h"
	#include "llvm/IR/Module.h"			#include "llvm/IR/Module.h"
	#include "llvm/IR/Verifier.h"			#include "llvm/IR/Verifier.h"
	#include "llvm/Transforms/Utils/BasicBlockUtils.h"			#include "llvm/Transforms/Utils/BasicBlockUtils.h"
	#include "llvm/Transforms/Utils/Cloning.h"			#include "llvm/Transforms/Utils/Cloning.h"

	using namespace mlir;			using namespace mlir;
	▲ Show 20 Lines • Show All 249 Lines • ▼ Show 20 Lines
	llvm::Value *mlir::LLVM::detail::createIntrinsicCall(			llvm::Value *mlir::LLVM::detail::createIntrinsicCall(
	llvm::IRBuilderBase &builder, llvm::Intrinsic::ID intrinsic,			llvm::IRBuilderBase &builder, llvm::Intrinsic::ID intrinsic,
	ArrayRef<llvm::Value > args, ArrayRef<llvm::Type > tys) {			ArrayRef<llvm::Value > args, ArrayRef<llvm::Type > tys) {
	llvm::Module *module = builder.GetInsertBlock()->getModule();			llvm::Module *module = builder.GetInsertBlock()->getModule();
	llvm::Function *fn = llvm::Intrinsic::getDeclaration(module, intrinsic, tys);			llvm::Function *fn = llvm::Intrinsic::getDeclaration(module, intrinsic, tys);
	return builder.CreateCall(fn, args);			return builder.CreateCall(fn, args);
	}			}

				llvm::Value *
				mlir::LLVM::detail::createNvvmIntrinsicCall(llvm::IRBuilderBase &builder,
				llvm::Intrinsic::ID intrinsic,
				ArrayRef<llvm::Value *> args) {
				llvm::Module *module = builder.GetInsertBlock()->getModule();
				llvm::Function *fn;
				if (llvm::Intrinsic::isOverloaded(intrinsic)) {
				if (intrinsic != llvm::Intrinsic::nvvm_wmma_m16n16k16_mma_row_row_f16_f16 &&
				intrinsic != llvm::Intrinsic::nvvm_wmma_m16n16k16_mma_row_row_f32_f32) {
				// NVVM load and store instrinsic names are overloaded on the
				// source/destination pointer type. Pointer is the first argument in the
				// corresponding NVVM Op.
				fn = llvm::Intrinsic::getDeclaration(module, intrinsic,
				{args[0]->getType()});
				} else {
				fn = llvm::Intrinsic::getDeclaration(module, intrinsic, {});
				}
				} else {
				fn = llvm::Intrinsic::getDeclaration(module, intrinsic);
				}
				return builder.CreateCall(fn, args);
				}

	/// Given a single MLIR operation, create the corresponding LLVM IR operation			/// Given a single MLIR operation, create the corresponding LLVM IR operation
	/// using the `builder`.			/// using the `builder`.
	LogicalResult			LogicalResult
	ModuleTranslation::convertOperation(Operation &op,			ModuleTranslation::convertOperation(Operation &op,
	llvm::IRBuilderBase &builder) {			llvm::IRBuilderBase &builder) {
	const LLVMTranslationDialectInterface *opIface = iface.getInterfaceFor(&op);			const LLVMTranslationDialectInterface *opIface = iface.getInterfaceFor(&op);
	if (!opIface)			if (!opIface)
	return op.emitError("cannot be converted to LLVM IR: missing "			return op.emitError("cannot be converted to LLVM IR: missing "
	▲ Show 20 Lines • Show All 472 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/invalid.mlir

	Show First 20 Lines • Show All 452 Lines • ▼ Show 20 Lines
	}			}

	// -----			// -----

	func @memcpy_incompatible_shape(%dst : memref<7xf32>, %src : memref<9xf32>) {			func @memcpy_incompatible_shape(%dst : memref<7xf32>, %src : memref<9xf32>) {
	// expected-error @+1 {{'gpu.memcpy' op arguments have incompatible shape}}			// expected-error @+1 {{'gpu.memcpy' op arguments have incompatible shape}}
	gpu.memcpy %dst, %src : memref<7xf32>, memref<9xf32>			gpu.memcpy %dst, %src : memref<7xf32>, memref<9xf32>
	}			}

				// -----

				func @mmamatrix_invalid_shape(){
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = constant 16 : index
				// expected-error @+1 {{MMAMatrixType must have exactly two dimensions}}
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %i] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16x16xf16, "AOp">
				return
				}
				ThomasRaouxUnsubmitted Done Reply Inline Actions I think we should also have a positive test in `mlir/test/Dialect/GPU/ops.mlir` ThomasRaoux: I think we should also have a positive test in `mlir/test/Dialect/GPU/ops.mlir`

				// -----

				func @mmamatrix_operand_type(){
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = constant 16 : index
				// expected-error @+1 {{operand expected to be one of AOp, BOp, COp or DOp}}
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %i] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "EOp">
				return
				}

				// -----

				func @mmamatrix_invalid_element_type(){
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = constant 16 : index
				// expected-error @+1 {{MMAMatrixType elements must be F16 or F32}}
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %i] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xi32, "AOp">
				return
				}

				// -----

				#layout_map_col_major = affine_map<(i, j) -> (j, i)>

				func @mmaLoadOp_identity_layout(){
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, #layout_map_col_major, 3>
				%i = constant 16 : index
				// expected-error @+1 {{expected identity layout map for source memref}}
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %i] {leadDimension = 32 : index} : memref<32x32xf16, #layout_map_col_major, 3> -> !gpu.mma_matrix<16x16xf16, "AOp">
				return
				}

				// -----

				func @mmaLoadOp_invalid_mem_space(){
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 5>
				%i = constant 16 : index
				// expected-error @+1 {{source memorySpace kGenericMemorySpace, kSharedMemorySpace or kGlobalMemorySpace only allowed}}
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %i] {leadDimension = 32 : index} : memref<32x32xf16, 5> -> !gpu.mma_matrix<16x16xf16, "AOp">
				return
				}

				// -----

				func @mmaLoadOp_operand_type(){
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = constant 16 : index
				// expected-error @+1 {{only AOp, BOp and COp can be loaded}}
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %i] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "DOp">
				return
				}

				// -----

				#layout_map_col_major = affine_map<(i, j) -> (j, i)>

				func @wmmaStoreOp_invalid_map(%arg0 : !gpu.mma_matrix<16x16xf16, "DOp">) -> () {
				%sg = memref.alloca(){alignment = 32} : memref<32x32xf16, #layout_map_col_major, 3>
				%i = constant 16 : index
				%j = constant 16 : index
				// expected-error @+1 {{expected identity layout map for destination memref}}
				gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mma_matrix<16x16xf16, "DOp">, memref<32x32xf16,#layout_map_col_major, 3>
				return
				}

				// -----

				func @wmmaStoreOp_invalid_mem_space(%arg0 : !gpu.mma_matrix<16x16xf16, "DOp">) -> () {
				%sg = memref.alloca(){alignment = 32} : memref<32x32xf16, 5>
				%i = constant 16 : index
				%j = constant 16 : index
				// expected-error @+1 {{destination memorySpace of kGenericMemorySpace, kGlobalMemorySpace or kSharedMemorySpace only allowed}}
				gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mma_matrix<16x16xf16, "DOp">, memref<32x32xf16, 5>
				return
				}

				// -----

				func @wmmaStoreOp_invalid_store_operand(%arg0 : !gpu.mma_matrix<16x16xf16, "AOp">) -> () {
				%sg = memref.alloca(){alignment = 32} : memref<32x32xf16, 3>
				%i = constant 16 : index
				%j = constant 16 : index
				// expected-error @+1 {{expected the operand matrix being stored to have 'DOp' operand type}}
				gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mma_matrix<16x16xf16, "AOp">, memref<32x32xf16, 3>
				return
				}

				// -----

				func @wmmaMmaOp_invalid_operand_order(%A : !gpu.mma_matrix<16x16xf16, "AOp">, %B : !gpu.mma_matrix<16x16xf16, "BOp">, %C : !gpu.mma_matrix<16x16xf16, "COp">) -> () {
				// expected-error @+1 {{operands must be in the order AOp, BOp, COp}}
				%D = gpu.subgroup_mma_compute %B, %A, %C : !gpu.mma_matrix<16x16xf16, "BOp">, !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "COp"> -> !gpu.mma_matrix<16x16xf16, "DOp">
				return
				}

				// -----

				func @wmmaMmaOp_invalid_operand_shapes(%A : !gpu.mma_matrix<16x32xf16, "AOp">, %B : !gpu.mma_matrix<16x16xf16, "BOp">, %C : !gpu.mma_matrix<16x16xf16, "COp">) -> () {
				// expected-error @+1 {{operand shapes do not satisfy matmul constraints}}
				%D = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x32xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">, !gpu.mma_matrix<16x16xf16, "COp"> -> !gpu.mma_matrix<16x16xf16, "DOp">
				return
				}

mlir/test/Dialect/GPU/ops.mlir

Show First 20 Lines • Show All 188 Lines • ▼ Show 20 Lines	func @memcpy(%dst : memref<3x7xf32>, %src : memref<3x7xf32, 1>) {
// CHECK: gpu.memcpy {{.}}, {{.}} : memref<3x7xf32>, memref<3x7xf32, 1>		// CHECK: gpu.memcpy {{.}}, {{.}} : memref<3x7xf32>, memref<3x7xf32, 1>
gpu.memcpy %dst, %src : memref<3x7xf32>, memref<3x7xf32, 1>		gpu.memcpy %dst, %src : memref<3x7xf32>, memref<3x7xf32, 1>
// CHECK: %[[t0:.*]] = gpu.wait async		// CHECK: %[[t0:.*]] = gpu.wait async
%0 = gpu.wait async		%0 = gpu.wait async
// CHECK: {{.}} = gpu.memcpy async [%[[t0]]] {{.}}, {{.*}} : memref<3x7xf32>, memref<3x7xf32, 1>		// CHECK: {{.}} = gpu.memcpy async [%[[t0]]] {{.}}, {{.*}} : memref<3x7xf32>, memref<3x7xf32, 1>
%1 = gpu.memcpy async [%0] %dst, %src : memref<3x7xf32>, memref<3x7xf32, 1>		%1 = gpu.memcpy async [%0] %dst, %src : memref<3x7xf32>, memref<3x7xf32, 1>
return		return
}		}

		func @mmamatrix_valid_element_type(){
		// CHECK-LABEL: func @mmamatrix_valid_element_type
		%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
		// CHECK: %[[wg:.*]] = memref.alloca()
		%i = constant 16 : index
		// CHECK: %[[i:.*]] = constant 16 : index
		%0 = gpu.subgroup_mma_load_matrix %wg[%i, %i] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "AOp">
		// CHECK: gpu.subgroup_mma_load_matrix %[[wg]][%[[i]], %[[i]]] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "AOp">
		return
		}
}		}

mlir/test/Dialect/LLVMIR/invalid.mlir

Show First 20 Lines • Show All 837 Lines • ▼ Show 20 Lines	llvm.func @accessGroups(%arg0 : !llvm.ptr<i32>) {
// expected-error@below {{expected '@metadata' to reference an access_group op}}		// expected-error@below {{expected '@metadata' to reference an access_group op}}
%0 = llvm.load %arg0 { "access_groups" = [@metadata] } : !llvm.ptr<i32>		%0 = llvm.load %arg0 { "access_groups" = [@metadata] } : !llvm.ptr<i32>
llvm.return		llvm.return
}		}
llvm.metadata @metadata {		llvm.metadata @metadata {
llvm.return		llvm.return
}		}
}		}

		// -----

		llvm.func @wmmaLoadOp_invalid_mem_space(%arg0: !llvm.ptr<i32, 5>, %arg1: i32) {
		// expected-error@+1 {{'nvvm.wmma.m16n16k16.load.a.f16.row.stride' op expected operands to be a source pointer in memory space 0, 1, 3 followed by ldm of the source}}
		%0 = nvvm.wmma.m16n16k16.load.a.f16.row.stride %arg0, %arg1 : (!llvm.ptr<i32, 5>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>

		llvm.return
		}

		// -----

		llvm.func @wmmaLoadOp_invalid_missing_ldm(%arg0: !llvm.ptr<i32, 3>, %arg1: i32) {
		// expected-error@+1 {{'nvvm.wmma.m16n16k16.load.a.f16.row.stride' op expected operands to be a source pointer in memory space 0, 1, 3 followed by ldm of the source}}
		%0 = nvvm.wmma.m16n16k16.load.a.f16.row.stride %arg0: (!llvm.ptr<i32, 3>) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>

		llvm.return
		}

		// -----

		llvm.func @wmmaLoadOp_invalid_AOp(%arg0: !llvm.ptr<i32, 3>, %arg1: i32) {
		// expected-error@+1 {{'nvvm.wmma.m16n16k16.load.a.f16.row.stride' op expected result type of loadAOp and loadBOp to be a struct of 8 <halfx2>s}}
		%0 = nvvm.wmma.m16n16k16.load.a.f16.row.stride %arg0, %arg1 : (!llvm.ptr<i32, 3>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>

		llvm.return
		}

		// -----

		llvm.func @wmmaLoadOp_invalid_AOp(%arg0: !llvm.ptr<i32, 3>, %arg1: i32) {
		// expected-error@+1 {{nvvm.wmma.m16n16k16.load.a.f16.row.stride' op expected result type of loadAOp and loadBOp to be a struct of 8 <halfx2>s}}
		%0 = nvvm.wmma.m16n16k16.load.a.f16.row.stride %arg0, %arg1 : (!llvm.ptr<i32, 3>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>

		llvm.return
		}

		// -----

		llvm.func @wmmaLoadOp_invalid_BOp(%arg0: !llvm.ptr<i32, 3>, %arg1: i32) {
		// expected-error@+1 {{'nvvm.wmma.m16n16k16.load.b.f16.row.stride' op expected result type of loadAOp and loadBOp to be a struct of 8 <halfx2>s}}
		%0 = nvvm.wmma.m16n16k16.load.b.f16.row.stride %arg0, %arg1 : (!llvm.ptr<i32, 3>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>

		llvm.return
		}

		// -----

		llvm.func @wmmaLoadOp_invalid_COp(%arg0: !llvm.ptr<i32, 3>, %arg1: i32) {
		// expected-error@+1 {{'nvvm.wmma.m16n16k16.load.c.f16.row.stride' op expected result type of loadCOp to be a struct of 4 <halfx2>s or 8 f32s}}
		%0 = nvvm.wmma.m16n16k16.load.c.f16.row.stride %arg0, %arg1 : (!llvm.ptr<i32, 3>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>)>

		llvm.return
		}

		// -----

		llvm.func @wmmaStoreOp_invalid_mem_space(%arg0: !llvm.ptr<i32, 5>, %arg1: vector<2 x f16>,
		%arg2: vector<2 x f16>, %arg3: vector<2 x f16>,
		%arg4: vector<2 xf16>, %arg5: i32) {
		// expected-error@+1 {{'nvvm.wmma.m16n16k16.store.d.f16.row.stride' op expected operands to be a source pointer in memoryspace 0, 1, 3 followed by ldm of the source}}
		nvvm.wmma.m16n16k16.store.d.f16.row.stride %arg0, %arg1, %arg2, %arg3, %arg4, %arg5 : !llvm.ptr<i32, 5>, vector<2 x f16>, vector<2 x f16>, vector<2 x f16>, vector<2 x f16>, i32
		llvm.return
		}

		// -----

		llvm.func @wmmaStoreOp_invalid_missing_ldm(%arg0: !llvm.ptr<i32, 3>, %arg1: vector<2 x f16>,
		%arg2: vector<2 x f16>, %arg3: vector<2 x f16>,
		%arg4: vector<2 xf16>, %arg5: i32) {
		// expected-error@+1 {{'nvvm.wmma.m16n16k16.store.d.f16.row.stride' op expected operands to be a source pointer in memoryspace 0, 1, 3 followed by ldm of the source}}
		nvvm.wmma.m16n16k16.store.d.f16.row.stride %arg0, %arg1, %arg2, %arg3, %arg4 : !llvm.ptr<i32, 3>, vector<2 x f16>, vector<2 x f16>, vector<2 x f16>, vector<2 x f16>
		llvm.return
		}

		// -----

		llvm.func @gpu_wmma_mma_op_invalid_operands(%arg0: vector<2 x f16>, %arg1: vector<2 x f16>,
		%arg2: vector<2 x f16>, %arg3: vector<2 x f16>,
		%arg4: vector<2 x f16>, %arg5: vector<2 x f16>,
		%arg6: vector<2 x f16>, %arg7: vector<2 x f16>,
		%arg8: vector<2 x f16>, %arg9: vector<2 x f16>,
		%arg10: vector<2 x f16>, %arg11: vector<2 x f16>,
		%arg12: vector<2 x f16>, %arg13: vector<2 x f16>,
		%arg14: vector<2 x f16>, %arg15: vector<2 x f16>,
		%arg16: vector<2 x f16>, %arg17: vector<2 x f16>,
		%arg18: vector<2 x f16>) {
		// expected-error@+1 {{'nvvm.wmma.m16n16k16.mma.row.row.f16.f16' op expected 20 <halfx2>s as operands}}
		%0 = nvvm.wmma.m16n16k16.mma.row.row.f16.f16 %arg0, %arg1, %arg2, %arg3, %arg4, %arg5, %arg6, %arg7, %arg8, %arg9, %arg10, %arg11, %arg12, %arg13, %arg14, %arg15, %arg16, %arg17, %arg18 : vector<2 x f16> -> !llvm.struct<(vector<2 x f16>, vector<2 x f16>, vector<2 x f16>, vector<2 x f16>)>
		llvm.return
		}

		// -----

		llvm.func @gpu_wmma_mma_op_results(%arg0: vector<2 x f16>, %arg1: vector<2 x f16>,
		%arg2: vector<2 x f16>, %arg3: vector<2 x f16>,
		%arg4: vector<2 x f16>, %arg5: vector<2 x f16>,
		%arg6: vector<2 x f16>, %arg7: vector<2 x f16>,
		%arg8: vector<2 x f16>, %arg9: vector<2 x f16>,
		%arg10: vector<2 x f16>, %arg11: vector<2 x f16>,
		%arg12: vector<2 x f16>, %arg13: vector<2 x f16>,
		%arg14: vector<2 x f16>, %arg15: vector<2 x f16>,
		%arg16: vector<2 x f16>, %arg17: vector<2 x f16>,
		%arg18: vector<2 x f16>, %arg19: vector<2 x f16>) {
		// expected-error@+1 {{expected result type to be a struct of 4 <halfx2>s}}
		%0 = nvvm.wmma.m16n16k16.mma.row.row.f16.f16 %arg0, %arg1, %arg2, %arg3, %arg4, %arg5, %arg6, %arg7, %arg8, %arg9, %arg10, %arg11, %arg12, %arg13, %arg14, %arg15, %arg16, %arg17, %arg18, %arg19 : vector<2 x f16> -> !llvm.struct<(vector<2 x f16>, vector<2 x f16>, vector<2 x f16>)>
		llvm.return
		}

		// -----

		llvm.func @gpu_wmma_mma_op_invalid_ab_operands(%arg0: vector<2 x f16>, %arg1: vector<2 x f16>,
		%arg2: vector<2 x f16>, %arg3: vector<2 x f16>,
		%arg4: vector<2 x f16>, %arg5: vector<2 x f16>,
		%arg6: vector<2 x f16>, %arg7: vector<2 x f16>,
		%arg8: vector<2 x f16>, %arg9: vector<2 x f16>,
		%arg10: vector<2 x f16>, %arg11: vector<2 x f16>,
		%arg12: vector<2 x f16>, %arg13: vector<2 x f16>,
		%arg14: vector<2 x f16>, %arg15: f32,
		%arg16: f32, %arg17: f32, %arg18: f32, %arg19: f32,
		%arg20: f32, %arg21: f32, %arg22: f32, %arg23: f32) {
		// expected-error@+1 {{'nvvm.wmma.m16n16k16.mma.row.row.f32.f32' op expected 16 <halfx2>s for `a` and `b` operand}}
		%0 = nvvm.wmma.m16n16k16.mma.row.row.f32.f32 %arg0, %arg1, %arg2, %arg3, %arg4, %arg5, %arg6, %arg7, %arg8, %arg9, %arg10, %arg11, %arg12, %arg13, %arg14, %arg15, %arg16, %arg17, %arg18, %arg19, %arg20, %arg21, %arg22, %arg23 : (vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, f32, f32, f32, f32, f32, f32, f32, f32, f32) -> !llvm.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
		llvm.return
		}

		// -----

		llvm.func @gpu_wmma_mma_op_invalid_c_operand(%arg0: vector<2 x f16>, %arg1: vector<2 x f16>,
		%arg2: vector<2 x f16>, %arg3: vector<2 x f16>,
		%arg4: vector<2 x f16>, %arg5: vector<2 x f16>,
		%arg6: vector<2 x f16>, %arg7: vector<2 x f16>,
		%arg8: vector<2 x f16>, %arg9: vector<2 x f16>,
		%arg10: vector<2 x f16>, %arg11: vector<2 x f16>,
		%arg12: vector<2 x f16>, %arg13: vector<2 x f16>,
		%arg14: vector<2 x f16>, %arg15: vector<2xf16>,
		%arg16: f32, %arg17: f32, %arg18: f32, %arg19: f32,
		%arg20: f32, %arg21: f32, %arg22: f32, %arg23: vector<2xf16>) {
		// expected-error@+1 {{'nvvm.wmma.m16n16k16.mma.row.row.f32.f32' op expected 8 f32s for `c` operand}}
		%0 = nvvm.wmma.m16n16k16.mma.row.row.f32.f32 %arg0, %arg1, %arg2, %arg3, %arg4, %arg5, %arg6, %arg7, %arg8, %arg9, %arg10, %arg11, %arg12, %arg13, %arg14, %arg15, %arg16, %arg17, %arg18, %arg19, %arg20, %arg21, %arg22, %arg23 : (vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, f32, f32, f32, f32, f32, f32, f32, vector<2xf16>) -> !llvm.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
		llvm.return
		}

		// -----

		llvm.func @gpu_wmma_mma_op_invalid_result(%arg0: vector<2 x f16>, %arg1: vector<2 x f16>,
		%arg2: vector<2 x f16>, %arg3: vector<2 x f16>,
		%arg4: vector<2 x f16>, %arg5: vector<2 x f16>,
		%arg6: vector<2 x f16>, %arg7: vector<2 x f16>,
		%arg8: vector<2 x f16>, %arg9: vector<2 x f16>,
		%arg10: vector<2 x f16>, %arg11: vector<2 x f16>,
		%arg12: vector<2 x f16>, %arg13: vector<2 x f16>,
		%arg14: vector<2 x f16>, %arg15: vector<2xf16>,
		%arg16: f32, %arg17: f32, %arg18: f32, %arg19: f32,
		%arg20: f32, %arg21: f32, %arg22: f32, %arg23: f32) {
		// expected-error@+1 {{'nvvm.wmma.m16n16k16.mma.row.row.f32.f32' op expected result type to be a struct of 8 f32s}}
		%0 = nvvm.wmma.m16n16k16.mma.row.row.f32.f32 %arg0, %arg1, %arg2, %arg3, %arg4, %arg5, %arg6, %arg7, %arg8, %arg9, %arg10, %arg11, %arg12, %arg13, %arg14, %arg15, %arg16, %arg17, %arg18, %arg19, %arg20, %arg21, %arg22, %arg23 : (vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, f32, f32, f32, f32, f32, f32, f32, f32) -> !llvm.struct<(f32, f32, f32, f32, f32, f32, f32, vector<2xf16>)>
		llvm.return
		}

mlir/test/Target/LLVMIR/nvvmir.mlir

Show First 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	llvm.func @nvvm_mma(%a0 : vector<2xf16>, %a1 : vector<2xf16>,
%b0 : vector<2xf16>, %b1 : vector<2xf16>,		%b0 : vector<2xf16>, %b1 : vector<2xf16>,
%c0 : f32, %c1 : f32, %c2 : f32, %c3 : f32,		%c0 : f32, %c1 : f32, %c2 : f32, %c3 : f32,
%c4 : f32, %c5 : f32, %c6 : f32, %c7 : f32) -> !llvm.struct<(f32, f32, f32, f32, f32, f32, f32, f32)> {		%c4 : f32, %c5 : f32, %c6 : f32, %c7 : f32) -> !llvm.struct<(f32, f32, f32, f32, f32, f32, f32, f32)> {
// CHECK: call { float, float, float, float, float, float, float, float } @llvm.nvvm.mma.m8n8k4.row.col.f32.f32		// CHECK: call { float, float, float, float, float, float, float, float } @llvm.nvvm.mma.m8n8k4.row.col.f32.f32
%0 = nvvm.mma.sync %a0, %a1, %b0, %b1, %c0, %c1, %c2, %c3, %c4, %c5, %c6, %c7 {alayout="row", blayout="col"} : (vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, f32, f32, f32, f32, f32, f32, f32, f32) -> !llvm.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>		%0 = nvvm.mma.sync %a0, %a1, %b0, %b1, %c0, %c1, %c2, %c3, %c4, %c5, %c6, %c7 {alayout="row", blayout="col"} : (vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, f32, f32, f32, f32, f32, f32, f32, f32) -> !llvm.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
llvm.return %0 : !llvm.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>		llvm.return %0 : !llvm.struct<(f32, f32, f32, f32, f32, f32, f32, f32)>
}		}

		// The test below checks the correct mapping of the nvvm.wmma..load. op to the correct intrinsic
		// in the LLVM NVPTX backend.
		llvm.func @gpu_wmma_load_op(%arg0: !llvm.ptr<i32, 3>, %arg1: i32) {
		// CHECK: call { <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half>, <2 x half> } @llvm.nvvm.wmma.m16n16k16.load.a.row.stride.f16.p3i32(i32 addrspace(3)* %{{.}}, i32 %{{.}})
		%0 = nvvm.wmma.m16n16k16.load.a.f16.row.stride %arg0, %arg1 : (!llvm.ptr<i32, 3>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>

		llvm.return
		}

		// The test below checks the correct mapping of the nvvm.wmma..store. op to the correct intrinsic
		// in the LLVM NVPTX backend.
		llvm.func @gpu_wmma_store_op(%arg0: !llvm.ptr<i32, 3>, %arg1: vector<2 x f16>,
		%arg2: vector<2 x f16>, %arg3: vector<2 x f16>,
		%arg4: vector<2 xf16>, %arg5: i32) {
		// CHECK: call void @llvm.nvvm.wmma.m16n16k16.store.d.row.stride.f16.p3i32(i32 addrspace(3)* %{{.}}, <2 x half> {{.}}, <2 x half> %{{.}}, <2 x half> %{{.}}, <2 x half> %{{.}}, i32 %{{.}})
		nvvm.wmma.m16n16k16.store.d.f16.row.stride %arg0, %arg1, %arg2, %arg3, %arg4, %arg5 : !llvm.ptr<i32, 3>, vector<2 x f16>, vector<2 x f16>, vector<2 x f16>, vector<2 x f16>, i32
		llvm.return
		}

		// The test below checks the correct mapping of the nvvm.wmma..mma. op to the correct intrinsic
		// in the LLVM NVPTX backend.
		llvm.func @gpu_wmma_mma_op(%arg0: vector<2 x f16>, %arg1: vector<2 x f16>,
		%arg2: vector<2 x f16>, %arg3: vector<2 x f16>,
		%arg4: vector<2 x f16>, %arg5: vector<2 x f16>,
		%arg6: vector<2 x f16>, %arg7: vector<2 x f16>,
		%arg8: vector<2 x f16>, %arg9: vector<2 x f16>,
		%arg10: vector<2 x f16>, %arg11: vector<2 x f16>,
		%arg12: vector<2 x f16>, %arg13: vector<2 x f16>,
		%arg14: vector<2 x f16>, %arg15: vector<2 x f16>,
		%arg16: vector<2 x f16>, %arg17: vector<2 x f16>,
		%arg18: vector<2 x f16>, %arg19: vector<2 x f16>) {
		// CHECK: call { <2 x half>, <2 x half>, <2 x half>, <2 x half> } @llvm.nvvm.wmma.m16n16k16.mma.row.row.f16.f16(<2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}}, <2 x half> {{.}})
		%0 = nvvm.wmma.m16n16k16.mma.row.row.f16.f16 %arg0, %arg1, %arg2, %arg3, %arg4, %arg5, %arg6, %arg7, %arg8, %arg9, %arg10, %arg11, %arg12, %arg13, %arg14, %arg15, %arg16, %arg17, %arg18, %arg19 : vector<2 x f16> -> !llvm.struct<(vector<2 x f16>, vector<2 x f16>, vector<2 x f16>, vector<2 x f16>)>

		llvm.return
		}

// This function has the "kernel" attribute attached and should appear in the		// This function has the "kernel" attribute attached and should appear in the
// NVVM annotations after conversion.		// NVVM annotations after conversion.
llvm.func @kernel_func() attributes {nvvm.kernel} {		llvm.func @kernel_func() attributes {nvvm.kernel} {
llvm.return		llvm.return
}		}

// CHECK: !nvvm.annotations =		// CHECK: !nvvm.annotations =
// CHECK-NOT: {i32 ()* @nvvm_special_regs, !"kernel", i32 1}		// CHECK-NOT: {i32 ()* @nvvm_special_regs, !"kernel", i32 1}
// CHECK: {void ()* @kernel_func, !"kernel", i32 1}		// CHECK: {void ()* @kernel_func, !"kernel", i32 1}

This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate opsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 343297

mlir/include/mlir/Dialect/GPU/GPUBase.td

mlir/include/mlir/Dialect/GPU/GPUDialect.h

mlir/include/mlir/Dialect/GPU/GPUOps.td

mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td

mlir/include/mlir/Target/LLVMIR/ModuleTranslation.h

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

mlir/lib/Dialect/LLVMIR/IR/NVVMDialect.cpp

mlir/lib/Target/LLVMIR/Dialect/NVVM/NVVMToLLVMIRTranslation.cpp

mlir/lib/Target/LLVMIR/ModuleTranslation.cpp

mlir/test/Dialect/GPU/invalid.mlir

mlir/test/Dialect/GPU/ops.mlir

mlir/test/Dialect/LLVMIR/invalid.mlir

mlir/test/Target/LLVMIR/nvvmir.mlir

[MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops
ClosedPublic