This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/GPU/
-
mlir/
-
Dialect/
-
GPU/
-
GPUBase.td
1/1
GPUDialect.h
8/11
GPUOps.td
-
lib/
-
Conversion/GPUToNVVM/
-
GPUToNVVM/
4/4
LowerGpuOpsToNVVMOps.cpp
-
Dialect/GPU/IR/
-
GPU/
-
IR/
3/5
GPUDialect.cpp
-
test/
-
Conversion/GPUToNVVM/
-
GPUToNVVM/
-
gpu-to-nvvm.mlir
-
Dialect/GPU/
-
GPU/
-
invalid.mlir
-
ops.mlir

Differential D119191

[mlir][gpu] Add device side async copy operations
ClosedPublic

Authored by ThomasRaoux on Feb 7 2022, 2:22 PM.

Download Raw Diff

Details

Reviewers

mravishankar
herhut
antiagainst
bondhugula
ftynse

Commits

rG5ab04bc068d5: [mlir][gpu] Add device side async copy operations

Summary

Add new operations to the gpu dialect to represent device side
asynchronous copies. This also add the lowering of those operations to
nvvm dialect.
Those ops are meant to be low level and map directly to llvm dialects
like nvvm or rocdl.

We can further add higher level of abstraction by building on top of
those operations.
This has been discuss here:
https://discourse.llvm.org/t/modeling-gpu-async-copy-ampere-feature/4924

Diff Detail

Event Timeline

ThomasRaoux created this revision.Feb 7 2022, 2:22 PM

Herald added a reviewer: bondhugula. · View Herald TranscriptFeb 7 2022, 2:22 PM

Herald added subscribers: sdasgup3, wenzhicui, wrengr and 20 others. · View Herald Transcript

ThomasRaoux requested review of this revision.Feb 7 2022, 2:22 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2022, 2:22 PM

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Harbormaster completed remote builds in B148099: Diff 406609.Feb 7 2022, 6:12 PM

Great to see this -- some comments.

mlir/include/mlir/Dialect/GPU/GPUDialect.h
49	Doc comment here.
mlir/include/mlir/Dialect/GPU/GPUOps.td
1230	Nit: device-side
1232	`cp` -> `copy` to be consistent with other op names like `memref.copy` etc.
1232	operation
1237	"commit the copy" s confusing here. "commit" is often used synonymously with "completion". The name `device_async_commit` is also missing a connection to the group in the name. How about `device_aysnc_attach_to_group`, `device_aysnc_add_to_group`. Even `device_async_commit_to_group` will be clearer here. Is `commit` motivated by any lower-level naming on the NVVM side?
1237–1239	From this, it isn't entirely clear if just the latter `gpu.device_async_wait` will suffice. You could just reference these respective ops' doc instead of reproducing it here. I think the confusion is mainly due to the naming.
1267	This is ...
mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
125	`Lowering` prefix to be consistent.
176	Likewise.
191–194	This has a single user --- just inline it unless you want to expose this publicly
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
119–122	Nit: Sorted order.

Address review comments

Thanks for the review @bondhugula. I tried to improve the op naming and description, let me know what you think.

mlir/include/mlir/Dialect/GPU/GPUOps.td
1237	That's a good point, yes commit was picked because of no better name. Thinking more about this I think it should be called `device_async_create_group` as this is the op that takes all the copies and put them into a group. Let me know if you think this makes sense or if you prefer `device_aysnc_attach_to_group`.
1237–1239	Good point, I changed the description to just mention reference the other ops.

Remove missed dead code

Harbormaster completed remote builds in B148160: Diff 406695.Feb 7 2022, 10:43 PM

As modeled I dont have any blockers, but IMO without the higher level ops (which should be the load bearing ops), these ops are going to make transformations jump directly to something that is essentially LLVM. I'd strongly suggest that the transformations dont directly target these "low level ops", cause then it would be harder to back track form there. Ill take a look again for details tomorrow, but leaving comments for now.

mlir/include/mlir/Dialect/GPU/GPUOps.td
1259	Maybe this is what you mean by "lower level ops", but I would really like to see a token returned by the copy operation that is used to create the groups. These are relying on textual order of things that makes me very uncomfortable. I see why they are needed this way cause NVIDIA hardware do it this way.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1202	Using `3` here is taking an opnion of shared memory being mapped to memory space 3. I think its better the GPU dialect just take the opinion that 3 maps to shared memory, create an enum type that does this mapping and use that. instead of hard-coding such numbers. For a later patch maybe, but this technical debt will only grow.

remove hardcoded number

ThomasRaoux added inline comments.Feb 8 2022, 8:47 PM

mlir/include/mlir/Dialect/GPU/GPUOps.td
1259	It would have to lowered to this representation though otherwise the gap to lower to llvm is to large.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1202	removed, I meant to remove it and missed it.

In D119191#3306729, @mravishankar wrote:

As modeled I dont have any blockers, but IMO without the higher level ops (which should be the load bearing ops), these ops are going to make transformations jump directly to something that is essentially LLVM. I'd strongly suggest that the transformations dont directly target these "low level ops", cause then it would be harder to back track form there. Ill take a look again for details tomorrow, but leaving comments for now.

Finding the right level of abstraction is a challenge as discussed. That's why I am starting with the layers that I think are needed no matter what. If I understand well you are suggesting having a level where the copy would directly feed tokens into the wait. This probably make sense but it is only slightly lower level than this representation:

token1 = copy
token2 = copy
....
wait token1, token2

would get transformed:

copy
copy
...
token = create_group
wait token

Your concern is that we rely on side effect to enforce the dependency between copy and create_group, I think this is a legit concern but it is not clear to me that the first representation is worth having another level of abstraction. Since I believe the representation in this patch will be needed I think it makes sense to move forward. Let me know what you think.

While gpu.device_async_create_group and gpu.device_async_wait are fairly straightforward in how they are paired together, there is a non-trivial amount of implicit global state associated to device_async_copy and device_async_wait_group.

Let me see if I infer correctly an abstract model of this, we need a global thread_local bag of groups and a global thread_local list of "copies not yet added to a group".
From there I can see how to describe the existing ops in terms of read/write effects to these globals I guess.

I'm still not sure about two things right now:

how to target the gpu.device_async_wait_group or why it is exposed the way it is? For example, I have a hard time understanding the composability of this op: entering a function I may not know how many pending groups are in flight and then it's not clear to me how to pick a number here?
the constraint this puts on the ordering: it seems that for example because of the implicit state between a copy and a create_group, we can't reschedule things. For example in this case:

src2 = ...
...
src1 = ...
gpu.device_async_copy %src1
%token = gpu.device_async_create_group
gpu.device_async_copy %src2
%token2 = gpu.device_async_create_group

It would be interesting to schedule the copy from src2 earlier since src2 is available before src1 somehow, but doing so would require to move the two operations together (gpu.device_async_copy %src2 and %token2 = gpu.device_async_create_group).

Harbormaster completed remote builds in B148409: Diff 407039.Feb 8 2022, 9:33 PM

In D119191#3306806, @ThomasRaoux wrote:
token1 = copy
token2 = copy
....
wait token1, token2
would get transformed:
copy
copy
...
token = create_group
wait token

I don't think this is correct (in the absolute): what if your ... is also using some async? For example:

token1 = copy
token2 = copy
...
token3 = copy
...
wait token1, token2
...
wait token3

In D119191#3306831, @mehdi_amini wrote:

While gpu.device_async_create_group and gpu.device_async_wait are fairly straightforward in how they are paired together, there is a non-trivial amount of implicit global state associated to device_async_copy and device_async_wait_group.

Let me see if I infer correctly an abstract model of this, we need a global thread_local bag of groups and a global thread_local list of "copies not yet added to a group".
From there I can see how to describe the existing ops in terms of read/write effects to these globals I guess.

Yes this is correct, and those are modeled by side effect in this case.

I'm still not sure about two things right now:

how to target the gpu.device_async_wait_group or why it is exposed the way it is? For example, I have a hard time understanding the composability of this op: entering a function I may not know how many pending groups are in flight and then it's not clear to me how to pick a number here?

gpu.device_async_wait_group is there to have a straight forward way to lower to nvvm kind of semantic. The way it would be generated is by lowering from gpu.device_async_wait and walking all the potential paths going from this op to the associated gpu.device_async_create_group and finding the minimum number of other gpu.device_async_create_group in those paths. Of course picking a lower value than needed is always correct so this would have to be a conservative transformation. There are alternative solutions, for instance cuda compiler creates a big switch case when it cannot resolve statically the number of workgroup and just use a dynamic counter. I think we could have several strategy of lowering but this stage has to be done no matter what.

the constraint this puts on the ordering: it seems that for example because of the implicit state between a copy and a create_group, we can't reschedule things. For example in this case:
src2 = ...
...
src1 = ...
gpu.device_async_copy %src1
%token = gpu.device_async_create_group
gpu.device_async_copy %src2
%token2 = gpu.device_async_create_group
It would be interesting to schedule the copy from src2 earlier since src2 is available before src1 somehow, but doing so would require to move the two operations together (gpu.device_async_copy %src2 and %token2 = gpu.device_async_create_group).

Correct this would not be possible and would have to be done at a higher level of abstraction. The main thing missing in this representation is how the group are created, once the groups are set it is too late to do such re-ordering. That being said in general you would want to put consecutive copy without interleave work in the same group so you shouldn't have to do the re-ordering.

The reason why I have those representations is because what I need is:
-a representation to apply pipelining transformation on. It should have the groups picked and I have an explicit ssa link between the group and the wait to know directly how they are linked.
-A representation after pipelining and before convertToNVVM where each wait knows the max number of groups under progress to unblock the thread.

In D119191#3306837, @mehdi_amini wrote:
In D119191#3306806, @ThomasRaoux wrote:
token1 = copy
token2 = copy
....
wait token1, token2
would get transformed:
copy
copy
...
token = create_group
wait token
I don't think this is correct (in the absolute): what if your ... is also using some async? For example:
token1 = copy
token2 = copy
...
token3 = copy
...
wait token1, token2
...
wait token3

True, in this case the group creating may not be trivial which is one problem I tried to avoid in the representation. In this example the resprentation is still simple, if we have something like below it becomes complicated:

token1 = copy
...
token3 = copy
...
token2 = copy
...
wait token1, token2
...
wait token3

In D119191#3306854, @ThomasRaoux wrote:

gpu.device_async_wait_group is there to have a straight forward way to lower to nvvm kind of semantic. The way it would be generated is by lowering from gpu.device_async_wait and walking all the potential paths going from this op to the associated gpu.device_async_create_group and finding the minimum number of other gpu.device_async_create_group in those paths.

I actually missed in the description: "Groups are executed the order they are created", this makes sense now :)

The side-effecting behavior makes it difficult to reason about the IR. It would be helpful to document it more instead of cross-referencing ops. State something like "this creates a pending memory access" in the device_async_copy description, and "this will associate pending memory accesses created by preceding device_async_copy operations" in the device_async_create_group. A complete example in the documentation of one of the ops would also help.

Two concrete questions:

How does this work in presence of control flow (i.e., copies in different blocks that branch to a single block with create_group, and copies in a block that branches to two blocks each with own create_group?
Why require byte size in the copy instead of a subview/subset of memref indices to copy? Specifically, this requires the caller to compute the byte size of the type rather than having the compiler compute it based on the element type. And this lets one read data that is not normally indexed by the memref, e.g., an 8-byte read from a non-contiguous f32 memref with innermost stride 2 will access the memory that cannot be read by a memref.load (4 bytes after the first element), which sounds very annoying from aliasing perspective.

mlir/include/mlir/Dialect/GPU/GPUOps.td
1263
mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
149	Nit: please use a constant / helper function instead of the magic `1` here.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1202	Nit: can this variable be called `workgroupAddressSpace` to be consistent with the terminology?

ftynse added a reviewer: ftynse.Feb 9 2022, 1:29 AM

Add extra description and address comments on naming

In D119191#3307203, @ftynse wrote:

The side-effecting behavior makes it difficult to reason about the IR. It would be helpful to document it more instead of cross-referencing ops. State something like "this creates a pending memory access" in the device_async_copy description, and "this will associate pending memory accesses created by preceding device_async_copy operations" in the device_async_create_group. A complete example in the documentation of one of the ops would also help.

I agree the side effect part is not ideal, however this is the only solution I found to avoid having an abstraction gap during lowering to llvm. I'm open to suggestions if there is a better way to do it. I added a complete example in gpu.device_async_copy description.

Two concrete questions:

How does this work in presence of control flow (i.e., copies in different blocks that branch to a single block with create_group, and copies in a block that branches to two blocks each with own create_group?

It works as if there were an implicit queue of pending copies and an implicit queue of group submitted. So the create_group op would make a group with copies pending from whatever path was taken.

Why require byte size in the copy instead of a subview/subset of memref indices to copy? Specifically, this requires the caller to compute the byte size of the type rather than having the compiler compute it based on the element type. And this lets one read data that is not normally indexed by the memref, e.g., an 8-byte read from a non-contiguous f32 memref with innermost stride 2 will access the memory that cannot be read by a memref.load (4 bytes after the first element), which sounds very annoying from aliasing perspective.

That's something I went back and forth on but overall I want a semantic which is equivalent to a load+store rather than a memcpy kind of semantic. I would like to be able to converted from a vector.load+vector.store so using this representation feels more natural. I do think it should have the same restriction as a memref.load and should be treated similarly.

Harbormaster completed remote builds in B148514: Diff 407203.Feb 9 2022, 11:46 AM

I agree the side effect part is not ideal, however this is the only solution I found to avoid having an abstraction gap during lowering to llvm. I'm open to suggestions if there is a better way to do it. I added a complete example in gpu.device_async_copy description.

Thanks! IMO, there are bigger abstraction gaps that are covered by lowerings, but I don't have a strong opinion here.

That's something I went back and forth on but overall I want a semantic which is equivalent to a load+store rather than a memcpy kind of semantic. I would like to be able to converted from a vector.load+vector.store so using this representation feels more natural. I do think it should have the same restriction as a memref.load and should be treated similarly.

I don't follow here. Neither memref.load/store nor vector.load/store require explicit byte size. The former accesses only one element. The latter has a requirement on the innermost dimension having stride 1, so reads are always contiguous, but their "size" is provided as the number of elements to read (in the vector type of the result). Having byte size is especially problematic because it duplicates (and therefore will end up contradicting) the data layout information. I would rather take the number of elements or reused the vector type somehow. Plus a verifier for a memref stride being either dynamic or static 1.

In D119191#3308431, @ThomasRaoux wrote:

In D119191#3307203, @ftynse wrote:

The side-effecting behavior makes it difficult to reason about the IR. It would be helpful to document it more instead of cross-referencing ops. State something like "this creates a pending memory access" in the device_async_copy description, and "this will associate pending memory accesses created by preceding device_async_copy operations" in the device_async_create_group. A complete example in the documentation of one of the ops would also help.

I agree the side effect part is not ideal, however this is the only solution I found to avoid having an abstraction gap during lowering to llvm. I'm open to suggestions if there is a better way to do it. I added a complete example in gpu.device_async_copy description.

Thanks for the examples. One option of this is to have only the "higher level" abstractions in the GPU dialect, and have the ops that reflect the intrinsic in NVVM directly stay in NVVM dialect. You can have a staged lowering of operations when lowering to NVVM, and lower from the higher level of abstraction in GPU dialect to the lower level of abstraction in NVVM dialect. Basically, these are exactly mirroring the ops in NVVM dialect, and are redundant for the most part. I cant really see why lowering to these are better than using the ops from NVVM dialect directly. Also, these ops are very NVVM specific. I dont know how these ops can be supported when lowering to SPIR-V (its always a useful check point to say what about SPIR-V when adding things to GPU dialect).
The above comments seems to suggest I have a strong opinion, but I am mostly on the fence on this (albeit slightly negative) and lot of this is originating from how these ops are exposed in NVVM/PTX. My over-arching concern is that if we have transformations that target these directly, we cant back track (and I can almost guarantee these ops specifications will change in different versions of CUDA, so we might be hard-coding behavior that is transient, and ill-formed). If we today have code that generate NVVM dialect ops directly and those have to change as the NVVM intrinsics evolve, that makes more sense to me (its a one-off we need performance effort and wont be long term stable). When targeting GPU dialect, I'd expect something more stable.

Two concrete questions:

How does this work in presence of control flow (i.e., copies in different blocks that branch to a single block with create_group, and copies in a block that branches to two blocks each with own create_group?

It works as if there were an implicit queue of pending copies and an implicit queue of group submitted. So the create_group op would make a group with copies pending from whatever path was taken.

So just to fill in an offline conversation

copy
copy
if(something)
  copy
commit

Here the commit could have 3 copies or 2 copies based on the conditional.

Why require byte size in the copy instead of a subview/subset of memref indices to copy? Specifically, this requires the caller to compute the byte size of the type rather than having the compiler compute it based on the element type. And this lets one read data that is not normally indexed by the memref, e.g., an 8-byte read from a non-contiguous f32 memref with innermost stride 2 will access the memory that cannot be read by a memref.load (4 bytes after the first element), which sounds very annoying from aliasing perspective.

That's something I went back and forth on but overall I want a semantic which is equivalent to a load+store rather than a memcpy kind of semantic. I would like to be able to converted from a vector.load+vector.store so using this representation feels more natural. I do think it should have the same restriction as a memref.load and should be treated similarly.

I echo this point about byte-size. When you say load/store semantics this is not consistent with the load/store semantics in MLIR. memref.load and memref.store is loading an element and not a byte. Also could be source of potential inconsistency. What if my element type is f32 and I say read 2 bytes? Number of elements makes more sense here.
Another related aspect that is a bit strange is that these cannot operate on memref that have an affine maps (as structured). Of course the op-semantics does not allow that, but you implicitly require that the memref be "contiguous" (I left a comment about this in the code as well).

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1215	I think you also need to enforce that the memref region that is read and written is "contiguous". Enforcing that condition especially in dynamic cases will be challenging. You might have to say that this operation only works with 1D memrefs and rely on `memref.expand_shape`/`memref.collapse_shape` to get things into the right shape.

In D119191#3309294, @mravishankar wrote:

In D119191#3308431, @ThomasRaoux wrote:

In D119191#3307203, @ftynse wrote:

The side-effecting behavior makes it difficult to reason about the IR. It would be helpful to document it more instead of cross-referencing ops. State something like "this creates a pending memory access" in the device_async_copy description, and "this will associate pending memory accesses created by preceding device_async_copy operations" in the device_async_create_group. A complete example in the documentation of one of the ops would also help.

I agree the side effect part is not ideal, however this is the only solution I found to avoid having an abstraction gap during lowering to llvm. I'm open to suggestions if there is a better way to do it. I added a complete example in gpu.device_async_copy description.

Thanks for the examples. One option of this is to have only the "higher level" abstractions in the GPU dialect, and have the ops that reflect the intrinsic in NVVM directly stay in NVVM dialect. You can have a staged lowering of operations when lowering to NVVM, and lower from the higher level of abstraction in GPU dialect to the lower level of abstraction in NVVM dialect. Basically, these are exactly mirroring the ops in NVVM dialect, and are redundant for the most part. I cant really see why lowering to these are better than using the ops from NVVM dialect directly. Also, these ops are very NVVM specific. I dont know how these ops can be supported when lowering to SPIR-V (its always a useful check point to say what about SPIR-V when adding things to GPU dialect).
The above comments seems to suggest I have a strong opinion, but I am mostly on the fence on this (albeit slightly negative) and lot of this is originating from how these ops are exposed in NVVM/PTX. My over-arching concern is that if we have transformations that target these directly, we cant back track (and I can almost guarantee these ops specifications will change in different versions of CUDA, so we might be hard-coding behavior that is transient, and ill-formed). If we today have code that generate NVVM dialect ops directly and those have to change as the NVVM intrinsics evolve, that makes more sense to me (its a one-off we need performance effort and wont be long term stable). When targeting GPU dialect, I'd expect something more stable.

Discussed this offline but in my opinion doing the complex transformations of isolating groups and "counting" the max number of in flights group during conversion to NVVM sounds like it would mix too many things and would increase the complexity of the code significantly. I think we need a way to express at least those two things before converting to LLVM/SPIRV IR. This feature is not exposed in SPIRV but if it was targeting the same HW, it is hard to imagine how different this could be. And without another equivalent of this feature it is hard to tell how to pick a common abstraction.
I'm open to other designs but I think to be practical it need to expose:

Which ops are in the same group and ideally enforce that they don't interleave with other groups
Be able to represent the result of the analysis of how many groups are in flights (it could just be an optional attribute)

For the second case I had started with an attribute but it turns out that removing the asycn token during conversion to llvm is actually non trivial because the type convert needs something to convert it to.

Two concrete questions:

How does this work in presence of control flow (i.e., copies in different blocks that branch to a single block with create_group, and copies in a block that branches to two blocks each with own create_group?

It works as if there were an implicit queue of pending copies and an implicit queue of group submitted. So the create_group op would make a group with copies pending from whatever path was taken.

So just to fill in an offline conversation
copy
copy
if(something)
  copy
commit
Here the commit could have 3 copies or 2 copies based on the conditional.

Why require byte size in the copy instead of a subview/subset of memref indices to copy? Specifically, this requires the caller to compute the byte size of the type rather than having the compiler compute it based on the element type. And this lets one read data that is not normally indexed by the memref, e.g., an 8-byte read from a non-contiguous f32 memref with innermost stride 2 will access the memory that cannot be read by a memref.load (4 bytes after the first element), which sounds very annoying from aliasing perspective.

That's something I went back and forth on but overall I want a semantic which is equivalent to a load+store rather than a memcpy kind of semantic. I would like to be able to converted from a vector.load+vector.store so using this representation feels more natural. I do think it should have the same restriction as a memref.load and should be treated similarly.

I echo this point about byte-size. When you say load/store semantics this is not consistent with the load/store semantics in MLIR. memref.load and memref.store is loading an element and not a byte. Also could be source of potential inconsistency. What if my element type is f32 and I say read 2 bytes? Number of elements makes more sense here.
Another related aspect that is a bit strange is that these cannot operate on memref that have an affine maps (as structured). Of course the op-semantics does not allow that, but you implicitly require that the memref be "contiguous" (I left a comment about this in the code as well).

Yes as @ftynse pointed out this is not the right way to do it, and I'm changing the op to take a number of elements instead. If the memref is not contiguous the operation should be broke into scalar like it is done for vector.load/vector.store

In D119191#3309411, @ThomasRaoux wrote:

In D119191#3309294, @mravishankar wrote:

In D119191#3308431, @ThomasRaoux wrote:

In D119191#3307203, @ftynse wrote:

The side-effecting behavior makes it difficult to reason about the IR. It would be helpful to document it more instead of cross-referencing ops. State something like "this creates a pending memory access" in the device_async_copy description, and "this will associate pending memory accesses created by preceding device_async_copy operations" in the device_async_create_group. A complete example in the documentation of one of the ops would also help.

I agree the side effect part is not ideal, however this is the only solution I found to avoid having an abstraction gap during lowering to llvm. I'm open to suggestions if there is a better way to do it. I added a complete example in gpu.device_async_copy description.

Thanks for the examples. One option of this is to have only the "higher level" abstractions in the GPU dialect, and have the ops that reflect the intrinsic in NVVM directly stay in NVVM dialect. You can have a staged lowering of operations when lowering to NVVM, and lower from the higher level of abstraction in GPU dialect to the lower level of abstraction in NVVM dialect. Basically, these are exactly mirroring the ops in NVVM dialect, and are redundant for the most part. I cant really see why lowering to these are better than using the ops from NVVM dialect directly. Also, these ops are very NVVM specific. I dont know how these ops can be supported when lowering to SPIR-V (its always a useful check point to say what about SPIR-V when adding things to GPU dialect).
The above comments seems to suggest I have a strong opinion, but I am mostly on the fence on this (albeit slightly negative) and lot of this is originating from how these ops are exposed in NVVM/PTX. My over-arching concern is that if we have transformations that target these directly, we cant back track (and I can almost guarantee these ops specifications will change in different versions of CUDA, so we might be hard-coding behavior that is transient, and ill-formed). If we today have code that generate NVVM dialect ops directly and those have to change as the NVVM intrinsics evolve, that makes more sense to me (its a one-off we need performance effort and wont be long term stable). When targeting GPU dialect, I'd expect something more stable.

Discussed this offline but in my opinion doing the complex transformations of isolating groups and "counting" the max number of in flights group during conversion to NVVM sounds like it would mix too many things and would increase the complexity of the code significantly. I think we need a way to express at least those two things before converting to LLVM/SPIRV IR. This feature is not exposed in SPIRV but if it was targeting the same HW, it is hard to imagine how different this could be. And without another equivalent of this feature it is hard to tell how to pick a common abstraction.
I'm open to other designs but I think to be practical it need to expose:

Which ops are in the same group and ideally enforce that they don't interleave with other groups

Be able to represent the result of the analysis of how many groups are in flights (it could just be an optional attribute)

For the second case I had started with an attribute but it turns out that removing the asycn token during conversion to llvm is actually non trivial because the type convert needs something to convert it to.

I am just looking at these ops and comparing the difference between these ops and the NVVM ops. The copy op to create_group op have similar semantics as the NVVM dialect version of these ops. The create_group to wait have different semantics. So you are saying you want to be able to represent ops that allow you to track how many groups are in flight so that when lowering to NVVM you can bridge the gap and account for the relationship between the create_group and the wait ops in NVVM. There is the same implicit relationship between copy and create_group in the NVVM ops, but that implicit relationship is carried over into the GPU dialect. That's the disconnect I am trying to resolve. The higher up the stack this leaks, the more is the cost of it in the longer term.
The argument you are giving is that you don't want to add all of that complexity when lowering to NVVM. That's fair, but that seems to suggest that lowering to NVVM is an atomic step, which it need not be. You could stage it with unrealized_conversion_casts and these get connected up later. So leave aside the ugly mechanics this introduces, whats the difference between the copy_op and create_group added here and the ones added in NVVM dialect.
Anyway, as I mentioned offline (and in the comment above), this whole things makes me uneasy, but is lot driven by the way these are exposed in NVVM (proper). So I am not really blocking on this, but still asking questions. That is a technical debt now we inherit, question is how far up the stack do we let that debt leak.

I am just looking at these ops and comparing the difference between these ops and the NVVM ops. The copy op to create_group op have similar semantics as the NVVM dialect version of these ops. The create_group to wait have different semantics. So you are saying you want to be able to represent ops that allow you to track how many groups are in flight so that when lowering to NVVM you can bridge the gap and account for the relationship between the create_group and the wait ops in NVVM. There is the same implicit relationship between copy and create_group in the NVVM ops, but that implicit relationship is carried over into the GPU dialect. That's the disconnect I am trying to resolve. The higher up the stack this leaks, the more is the cost of it in the longer term.
The argument you are giving is that you don't want to add all of that complexity when lowering to NVVM. That's fair, but that seems to suggest that lowering to NVVM is an atomic step, which it need not be. You could stage it with unrealized_conversion_casts and these get connected up later. So leave aside the ugly mechanics this introduces, whats the difference between the copy_op and create_group added here and the ones added in NVVM dialect.
Anyway, as I mentioned offline (and in the comment above), this whole things makes me uneasy, but is lot driven by the way these are exposed in NVVM (proper). So I am not really blocking on this, but still asking questions. That is a technical debt now we inherit, question is how far up the stack do we let that debt leak.

Ok I had misunderstood what your concern is. So your concern is that there are missing links between copy and commit. I really don't mind adding those as they would just get dropped during translation. The problem is that as far as I know there is no way to cleanly drop those extra types during conversion to llvm. The type convert would require those types to be converted to something as they are not legal types. This is the reason why I added the second version of the wait operation for the commit/wait case. If there is a solution to handle that during conversion to llvm then we remove the second version of the wait op and we can add explicit ssa links between copy and commit.

Change attribute from size in bytes to number of element and add verifier to check that the memref stride is 1

In D119191#3309238, @ftynse wrote:

I agree the side effect part is not ideal, however this is the only solution I found to avoid having an abstraction gap during lowering to llvm. I'm open to suggestions if there is a better way to do it. I added a complete example in gpu.device_async_copy description.

Thanks! IMO, there are bigger abstraction gaps that are covered by lowerings, but I don't have a strong opinion here.

That's something I went back and forth on but overall I want a semantic which is equivalent to a load+store rather than a memcpy kind of semantic. I would like to be able to converted from a vector.load+vector.store so using this representation feels more natural. I do think it should have the same restriction as a memref.load and should be treated similarly.

I don't follow here. Neither memref.load/store nor vector.load/store require explicit byte size. The former accesses only one element. The latter has a requirement on the innermost dimension having stride 1, so reads are always contiguous, but their "size" is provided as the number of elements to read (in the vector type of the result). Having byte size is especially problematic because it duplicates (and therefore will end up contradicting) the data layout information. I would rather take the number of elements or reused the vector type somehow. Plus a verifier for a memref stride being either dynamic or static 1.

You are right, I changed the op to take a number of element and add check in the verifier to that the innermost dimension has a stride of 1. Thanks for catching that.

Harbormaster completed remote builds in B148656: Diff 407392.Feb 9 2022, 10:22 PM

After a longer offline discussions, this is where my head is at,

I think the copy op has an implicit requirement that the memrefs be contiguous. Enforcing that requirement is hard. On the other hand, memref.collapse_shape codifies this semantics. You can reshape a memref type value only if it is contiguous (and the verifier of this op already has some logic to do so). To avoid duplicating that logic, I would rely on having a memref.collapse_shape on the operand and the result and enforce that the input and output memref type have rank 1.
It seems cleaner to me to have an explicit link between the copy operation and the create_group operation. If the issue there is in terms of lowering to LLVM, that should be handle-able by a) Make the gpu.async_copy return a token and the gpu.create_group accept a list of tokens b) Lower gpu.async_copy to nvvm.async_copy + %t = llvm.const <token> (or equivalent, could just be lowered to undef for that matter). Use %t to replace all uses of gpu.async_copy. c) Ignore these values in the operands when lowering the gpu.create_group. I think this works, but I havent tried it out obviously.

This is actually a slightly different model than what the nvvm ops support. For example ` %t1 = copy A; %t2 = copy B; %t3 = copy C; %g1 = create_group %t1, %t2; %g2 = create_group %t3; ` should be illegal currently. Catching this case (and similar cases) will require additional verification using dominance, etc. Still I think approach is better and we could have verifications that allow only a subset of interleavings valid.

So to restate the Point (2) above, I see as it could go either way, and I can put that down as preference. The bigger issue for me though is Point (1) above cause that will result in silent miscompilation in cases where the memref is not contiguous. Whether a memref is contiguous is not always provable and even for cases it is, I'd rather not duplicate the logic to check this many places.

I am fine if the consensus is to go ahead with the patch as is. I am not stamping to give others a chance to respond, but happy to stamp. I think I have raised all my concerns.

In D119191#3310066, @mravishankar wrote:

After a longer offline discussions, this is where my head is at,

I think the copy op has an implicit requirement that the memrefs be contiguous. Enforcing that requirement is hard. On the other hand, memref.collapse_shape codifies this semantics. You can reshape a memref type value only if it is contiguous (and the verifier of this op already has some logic to do so). To avoid duplicating that logic, I would rely on having a memref.collapse_shape on the operand and the result and enforce that the input and output memref type have rank 1.

That's not the semantic I had in mind for this op. I don't think we want to read element across dimensions. The number of elements should be read only along the most inner dimension in my opinion and I think it might make a difference if we later introduce masking.
I can look more into what you are suggesting but I would rather not change to this semantic in this patch or until I tried it on a prototype.

It seems cleaner to me to have an explicit link between the copy operation and the create_group operation. If the issue there is in terms of lowering to LLVM, that should be handle-able by a) Make the gpu.async_copy return a token and the gpu.create_group accept a list of tokens b) Lower gpu.async_copy to nvvm.async_copy + %t = llvm.const <token> (or equivalent, could just be lowered to undef for that matter). Use %t to replace all uses of gpu.async_copy.

I don't believe llvm.const token would be valid. What worked for me was to convert the token type to a dummy i32. and just leave it dead. I don't mind going in that direction if people overall agree that it is better, I changed it because it felt too hacky at the time. Let me know what you prefer.

I don't have a strong opinion on 2) and I'm fine with either representation as long as we have a practical way to do the lowering.

In D119191#3310094, @ThomasRaoux wrote:

In D119191#3310066, @mravishankar wrote:

After a longer offline discussions, this is where my head is at,

I think the copy op has an implicit requirement that the memrefs be contiguous. Enforcing that requirement is hard. On the other hand, memref.collapse_shape codifies this semantics. You can reshape a memref type value only if it is contiguous (and the verifier of this op already has some logic to do so). To avoid duplicating that logic, I would rely on having a memref.collapse_shape on the operand and the result and enforce that the input and output memref type have rank 1.

That's not the semantic I had in mind for this op. I don't think we want to read element across dimensions. The number of elements should be read only along the most inner dimension in my opinion and I think it might make a difference if we later introduce masking.
I can look more into what you are suggesting but I would rather not change to this semantic in this patch or until I tried it on a prototype.

It seems cleaner to me to have an explicit link between the copy operation and the create_group operation. If the issue there is in terms of lowering to LLVM, that should be handle-able by a) Make the gpu.async_copy return a token and the gpu.create_group accept a list of tokens b) Lower gpu.async_copy to nvvm.async_copy + %t = llvm.const <token> (or equivalent, could just be lowered to undef for that matter). Use %t to replace all uses of gpu.async_copy.

I don't believe llvm.const token would be valid. What worked for me was to convert the token type to a dummy i32. and just leave it dead. I don't mind going in that direction if people overall agree that it is better, I changed it because it felt too hacky at the time. Let me know what you prefer.

When I said const I didn't really mean an actual const. I was thinking more of a dummy i32 as well. My bad on not being careful enough with the terminology. I dont see it as hacky, but ¯\_(ツ)_/¯ . Thats the only way to have a link at gpu dialect level and not have the link at NVVM level. Seems OK to me.

I don't have a strong opinion on 2) and I'm fine with either representation as long as we have a practical way to do the lowering.

Add SSA link between copy and create_group

In D119191#3311925, @mravishankar wrote:

In D119191#3310094, @ThomasRaoux wrote:

In D119191#3310066, @mravishankar wrote:

After a longer offline discussions, this is where my head is at,

I think the copy op has an implicit requirement that the memrefs be contiguous. Enforcing that requirement is hard. On the other hand, memref.collapse_shape codifies this semantics. You can reshape a memref type value only if it is contiguous (and the verifier of this op already has some logic to do so). To avoid duplicating that logic, I would rely on having a memref.collapse_shape on the operand and the result and enforce that the input and output memref type have rank 1.

That's not the semantic I had in mind for this op. I don't think we want to read element across dimensions. The number of elements should be read only along the most inner dimension in my opinion and I think it might make a difference if we later introduce masking.
I can look more into what you are suggesting but I would rather not change to this semantic in this patch or until I tried it on a prototype.

It seems cleaner to me to have an explicit link between the copy operation and the create_group operation. If the issue there is in terms of lowering to LLVM, that should be handle-able by a) Make the gpu.async_copy return a token and the gpu.create_group accept a list of tokens b) Lower gpu.async_copy to nvvm.async_copy + %t = llvm.const <token> (or equivalent, could just be lowered to undef for that matter). Use %t to replace all uses of gpu.async_copy.

I don't believe llvm.const token would be valid. What worked for me was to convert the token type to a dummy i32. and just leave it dead. I don't mind going in that direction if people overall agree that it is better, I changed it because it felt too hacky at the time. Let me know what you prefer.

When I said const I didn't really mean an actual const. I was thinking more of a dummy i32 as well. My bad on not being careful enough with the terminology. I dont see it as hacky, but ¯\_(ツ)_/¯ . Thats the only way to have a link at gpu dialect level and not have the link at NVVM level. Seems OK to me.

Ok I added SSA links between copy and create_group that I'm dropping as dummy i32 during conversion to NVVM. I removed the wait_group op as I don't need it anymore with this change I can just use an optional attribute. Please take another look. Note that there currently no verification that the order is correct to keep the patch small.

Harbormaster completed remote builds in B148818: Diff 407641.Feb 10 2022, 1:11 PM

Looks good.

Misread the semantics w.r.t to memref. I was thinking more vector.transfer_read/transfer_write than vector.load/store.

This revision is now accepted and ready to land.Feb 10 2022, 2:17 PM

Closed by commit rG5ab04bc068d5: [mlir][gpu] Add device side async copy operations (authored by ThomasRaoux). · Explain WhyFeb 10 2022, 5:26 PM

This revision was automatically updated to reflect the committed changes.

ThomasRaoux added a commit: rG5ab04bc068d5: [mlir][gpu] Add device side async copy operations.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

GPUBase.td

7 lines

GPUDialect.h

8 lines

GPUOps.td

94 lines

lib/

Conversion/

GPUToNVVM/

LowerGpuOpsToNVVMOps.cpp

69 lines

Dialect/

GPU/

IR/

GPUDialect.cpp

25 lines

test/

Conversion/

GPUToNVVM/

gpu-to-nvvm.mlir

34 lines

Dialect/

GPU/

invalid.mlir

32 lines

ops.mlir

14 lines

Diff 406695

mlir/include/mlir/Dialect/GPU/GPUBase.td

Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	def GPU_Dialect : Dialect {
let dependentDialects = ["arith::ArithmeticDialect"];		let dependentDialects = ["arith::ArithmeticDialect"];
let useDefaultAttributePrinterParser = 1;		let useDefaultAttributePrinterParser = 1;
}		}

def GPU_AsyncToken : DialectType<		def GPU_AsyncToken : DialectType<
GPU_Dialect, CPred<"$_self.isa<::mlir::gpu::AsyncTokenType>()">, "async token type">,		GPU_Dialect, CPred<"$_self.isa<::mlir::gpu::AsyncTokenType>()">, "async token type">,
BuildableType<"mlir::gpu::AsyncTokenType::get($_builder.getContext())">;		BuildableType<"mlir::gpu::AsyncTokenType::get($_builder.getContext())">;

		/// Device-side synchronization token.
		def GPU_DeviceAsyncToken : DialectType<
		GPU_Dialect, CPred<"$_self.isa<::mlir::gpu::DeviceAsyncTokenType>()">,
		"device async token type">,
		BuildableType<
		"mlir::gpu::DeviceAsyncTokenType::get($_builder.getContext())">;

// Predicat to check if type is gpu::MMAMatrixType.		// Predicat to check if type is gpu::MMAMatrixType.
def IsMMAMatrixTypePred : CPred<"$_self.isa<::mlir::gpu::MMAMatrixType>()">;		def IsMMAMatrixTypePred : CPred<"$_self.isa<::mlir::gpu::MMAMatrixType>()">;

def GPU_MMAMatrix : DialectType<		def GPU_MMAMatrix : DialectType<
GPU_Dialect, IsMMAMatrixTypePred, "MMAMatrix type">;		GPU_Dialect, IsMMAMatrixTypePred, "MMAMatrix type">;

class MMAMatrixOf<list<Type> allowedTypes> :		class MMAMatrixOf<list<Type> allowedTypes> :
ContainerType<AnyTypeOf<allowedTypes>, IsMMAMatrixTypePred,		ContainerType<AnyTypeOf<allowedTypes>, IsMMAMatrixTypePred,
▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/GPU/GPUDialect.h

	Show All 40 Lines

	class AsyncTokenType			class AsyncTokenType
	: public Type::TypeBase<AsyncTokenType, Type, TypeStorage> {			: public Type::TypeBase<AsyncTokenType, Type, TypeStorage> {
	public:			public:
	// Used for generic hooks in TypeBase.			// Used for generic hooks in TypeBase.
	using Base::Base;			using Base::Base;
	};			};

				/// Device-side token storage type. There is only one type of device-side token.
				bondhugulaUnsubmitted Done Reply Inline Actions Doc comment here. bondhugula: Doc comment here.
				class DeviceAsyncTokenType
				: public Type::TypeBase<DeviceAsyncTokenType, Type, TypeStorage> {
				public:
				// Used for generic hooks in TypeBase.
				using Base::Base;
				};

	/// MMAMatrixType storage and uniquing. Array is uniqued based on its shape			/// MMAMatrixType storage and uniquing. Array is uniqued based on its shape
	/// and type.			/// and type.
	struct MMAMatrixStorageType : public TypeStorage {			struct MMAMatrixStorageType : public TypeStorage {
	MMAMatrixStorageType(unsigned numDims, const int64_t *dimShapes,			MMAMatrixStorageType(unsigned numDims, const int64_t *dimShapes,
	Type elementType, StringRef operand)			Type elementType, StringRef operand)
	: dimShapes(dimShapes), numDims(numDims), elementType(elementType),			: dimShapes(dimShapes), numDims(numDims), elementType(elementType),
	operand(operand) {}			operand(operand) {}

	▲ Show 20 Lines • Show All 126 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/GPU/GPUOps.td

Show First 20 Lines • Show All 1,219 Lines • ▼ Show 20 Lines let extraClassDeclaration = [{

} }

}]; }];

let assemblyFormat = [{ let assemblyFormat = [{

$operation $args attr-dict `:` functional-type($args, $res) $operation $args attr-dict `:` functional-type($args, $res)

}]; }];

} }

def GPU_DeviceAsyncCopyOp : GPU_Op<"device_async_copy",

[AttrSizedOperandSegments]> {

let summary = "device-side asynchronous copy";

bondhugulaUnsubmitted

Done

Nit: device-side

bondhugula: Nit: device-side

let description = [{

The `gpu.device_async_copy` op initiates an asynchronous copy operation of

bondhugulaUnsubmitted

Done

cp -> copy to be consistent with other op names like memref.copy etc.

bondhugula: `cp` -> `copy` to be consistent with other op names like `memref.copy` etc.

bondhugulaUnsubmitted

Done

operation

bondhugula: operation

`$size` bytes from source to the destination without blocking the thread.

The destination has to be in shared memory.

This op is meant to be used with `gpu.device_async_create_group` and

`gpu.device_async_wait` to synchronize copies as explained in those ops

bondhugulaUnsubmitted

Not Done

"commit the copy" s confusing here. "commit" is often used synonymously with "completion". The name device_async_commit is also missing a connection to the group in the name. How about device_aysnc_attach_to_group, device_aysnc_add_to_group. Even device_async_commit_to_group will be clearer here. Is commit motivated by any lower-level naming on the NVVM side?

bondhugula: "commit the copy" s confusing here. "commit" is often used synonymously with "completion". The…

ThomasRaouxAuthorUnsubmitted

Done

That's a good point, yes commit was picked because of no better name. Thinking more about this I think it should be called device_async_create_group as this is the op that takes all the copies and put them into a group. Let me know if you think this makes sense or if you prefer device_aysnc_attach_to_group.

ThomasRaoux: That's a good point, yes commit was picked because of no better name. Thinking more about this…

descriptions.

bondhugulaUnsubmitted

Not Done

From this, it isn't entirely clear if just the latter gpu.device_async_wait will suffice. You could just reference these respective ops' doc instead of reproducing it here. I think the confusion is mainly due to the naming.

bondhugula: From this, it isn't entirely clear if just the latter `gpu.device_async_wait` will suffice. You…

ThomasRaouxAuthorUnsubmitted

Done

Good point, I changed the description to just mention reference the other ops.

ThomasRaoux: Good point, I changed the description to just mention reference the other ops.

Example:

```mlir

gpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 16 :

memref<4x5xf32> to memref<2x7x5xf32, 3>

```

}];

let arguments = (ins Arg<AnyMemRef, "", [MemWrite]>:$dst,

Variadic<Index>:$dstIndices,

Arg<AnyMemRef, "", [MemRead]>:$src,

Variadic<Index>:$srcIndices,

IndexAttr:$size);

let assemblyFormat = [{

$src `[` $srcIndices `]` `,` $dst `[` $dstIndices `]` `,` $size attr-dict

`:` type($src) `to` type($dst)

}];

let hasVerifier = 1;

}

def GPU_DeviceAsyncCreateGroupOp : GPU_Op<"device_async_create_group", []> {

mravishankarUnsubmitted

Not Done

Maybe this is what you mean by "lower level ops", but I would really like to see a token returned by the copy operation that is used to create the groups. These are relying on textual order of things that makes me very uncomfortable. I see why they are needed this way cause NVIDIA hardware do it this way.

mravishankar: Maybe this is what you mean by "lower level ops", but I would really like to see a token…

ThomasRaouxAuthorUnsubmitted

Done

It would have to lowered to this representation though otherwise the gap to lower to llvm is to large.

ThomasRaoux: It would have to lowered to this representation though otherwise the gap to lower to llvm is to…

let summary = "device side asynchronous create group operation";

let description = [{

The `gpu.device_async_create_group` op create a group of memory accesses

ftynseUnsubmitted

Done

let description = [{

- The `gpu.device_async_create_group` op create a group of memory accesses

+ The `gpu.device_async_create_group` op creates a group of memory accesses

containing all the pending copies that haven't been added to a group yet.

ftynse:

containing all the pending copies that haven't been added to a group yet.

It returns a token that can be use to wait until the group fully completes.

This is meant to be used with `gpu.device_async_wait` or

bondhugulaUnsubmitted

Done

This is ...

bondhugula: This is ...

`gpu.device_async_wait_group` to synchronize copies as explained in those ops

descriptions.

Groups are executed the order they are created.

Example:

```mlir

%0 = gpu.device_async_create_group

```

}];

let results = (outs GPU_DeviceAsyncToken:$asyncToken);

let assemblyFormat = [{

attr-dict

}];

}

def GPU_DeviceAsyncWaitTokenOp : GPU_Op<"device_async_wait", []> {

let summary = "Wait for async gpu ops to complete.";

let description = [{

The `gpu.device_async_wait` op will block the execution thread until the group

associated with the source token is fully completed.

Example:

```mlir

gpu.device_async_wait %0

```

}];

let arguments = (ins GPU_DeviceAsyncToken:$asyncDependencies);

let assemblyFormat = [{

$asyncDependencies attr-dict

}];

}

def GPU_DeviceAsyncWaitGroupOp : GPU_Op<"device_async_wait_group", []> {

let summary = "Wait for async gpu ops to complete.";

let description = [{

The `gpu.device_async_wait` op will block the execution thread until the

number of groups uncompleted becomes lower or equal to the integer attribute

`numGroups`.

Example:

```mlir

gpu.device_async_wait_group 1

```

}];

let arguments = (ins I32Attr:$numGroups);

let assemblyFormat = [{

$numGroups attr-dict

}];

}

#endif // GPU_OPS #endif // GPU_OPS

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp

Show First 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	matchAndRewrite(gpu::ShuffleOp op, OpAdaptor adaptor,
Value isActiveSrcLane = rewriter.create<LLVM::ExtractValueOp>(		Value isActiveSrcLane = rewriter.create<LLVM::ExtractValueOp>(
loc, predTy, shfl, rewriter.getIndexArrayAttr(1));		loc, predTy, shfl, rewriter.getIndexArrayAttr(1));

rewriter.replaceOp(op, {shflValue, isActiveSrcLane});		rewriter.replaceOp(op, {shflValue, isActiveSrcLane});
return success();		return success();
}		}
};		};

		struct GPUAsyncCopyLowering
		bondhugulaUnsubmitted Done Reply Inline Actions `Lowering` prefix to be consistent. bondhugula: `Lowering` prefix to be consistent.
		: public ConvertOpToLLVMPattern<gpu::DeviceAsyncCopyOp> {
		using ConvertOpToLLVMPattern<gpu::DeviceAsyncCopyOp>::ConvertOpToLLVMPattern;

		LogicalResult
		matchAndRewrite(gpu::DeviceAsyncCopyOp op, OpAdaptor adaptor,
		ConversionPatternRewriter &rewriter) const override {
		Location loc = op->getLoc();
		auto dstMemrefType = op.dst().getType().cast<MemRefType>();
		Value dstPtr = getStridedElementPtr(loc, dstMemrefType, adaptor.dst(),
		adaptor.dstIndices(), rewriter);
		auto i8Ty = IntegerType::get(op.getContext(), 8);
		auto dstPointerType =
		LLVM::LLVMPointerType::get(i8Ty, dstMemrefType.getMemorySpaceAsInt());
		dstPtr = rewriter.create<LLVM::BitcastOp>(loc, dstPointerType, dstPtr);

		auto srcMemrefType = op.src().getType().cast<MemRefType>();

		Value scrPtr = getStridedElementPtr(loc, srcMemrefType, adaptor.src(),
		adaptor.srcIndices(), rewriter);
		auto srcPointerType =
		LLVM::LLVMPointerType::get(i8Ty, srcMemrefType.getMemorySpaceAsInt());
		scrPtr = rewriter.create<LLVM::BitcastOp>(loc, srcPointerType, scrPtr);
		// Intrinsics takes a global pointer so we need an address space cast.
		auto srcPointerGlobalType = LLVM::LLVMPointerType::get(i8Ty, 1);
		ftynseUnsubmitted Done Reply Inline Actions Nit: please use a constant / helper function instead of the magic `1` here. ftynse: Nit: please use a constant / helper function instead of the magic `1` here.
		scrPtr = rewriter.create<LLVM::AddrSpaceCastOp>(loc, srcPointerGlobalType,
		scrPtr);

		rewriter.replaceOpWithNewOp<NVVM::CpAsyncOp>(
		op, dstPtr, scrPtr,
		rewriter.getI32IntegerAttr(adaptor.size().getZExtValue()));
		return success();
		}
		};

		struct GPUAsyncCreateGroupLowering
		: public ConvertOpToLLVMPattern<gpu::DeviceAsyncCreateGroupOp> {
		using ConvertOpToLLVMPattern<
		gpu::DeviceAsyncCreateGroupOp>::ConvertOpToLLVMPattern;

		LogicalResult
		matchAndRewrite(gpu::DeviceAsyncCreateGroupOp op, OpAdaptor adaptor,
		ConversionPatternRewriter &rewriter) const override {
		// Token cannot be represented in NVVM, only lower if the op has no use.
		if (!op.use_empty())
		return failure();
		rewriter.create<NVVM::CpAsyncCommitGroupOp>(op.getLoc());
		rewriter.eraseOp(op);
		return success();
		}
		};

		bondhugulaUnsubmitted Done Reply Inline Actions Likewise. bondhugula: Likewise.
		struct GPUAsyncWaitLowering
		: public ConvertOpToLLVMPattern<gpu::DeviceAsyncWaitGroupOp> {
		using ConvertOpToLLVMPattern<
		gpu::DeviceAsyncWaitGroupOp>::ConvertOpToLLVMPattern;

		LogicalResult
		matchAndRewrite(gpu::DeviceAsyncWaitGroupOp op, OpAdaptor adaptor,
		ConversionPatternRewriter &rewriter) const override {
		int32_t numGroups = adaptor.numGroups();
		rewriter.create<NVVM::CpAsyncWaitGroupOp>(op.getLoc(), numGroups);
		rewriter.eraseOp(op);
		return success();
		}
		};

/// Import the GPU Ops to NVVM Patterns.		/// Import the GPU Ops to NVVM Patterns.
#include "GPUToNVVM.cpp.inc"		#include "GPUToNVVM.cpp.inc"

		bondhugulaUnsubmitted Done Reply Inline Actions This has a single user --- just inline it unless you want to expose this publicly bondhugula: This has a single user --- just inline it unless you want to expose this publicly
/// A pass that replaces all occurrences of GPU device operations with their		/// A pass that replaces all occurrences of GPU device operations with their
/// corresponding NVVM equivalent.		/// corresponding NVVM equivalent.
///		///
/// This pass only handles device code and is not meant to be run on GPU host		/// This pass only handles device code and is not meant to be run on GPU host
/// code.		/// code.
struct LowerGpuOpsToNVVMOpsPass		struct LowerGpuOpsToNVVMOpsPass
: public ConvertGpuOpsToNVVMOpsBase<LowerGpuOpsToNVVMOpsPass> {		: public ConvertGpuOpsToNVVMOpsBase<LowerGpuOpsToNVVMOpsPass> {
LowerGpuOpsToNVVMOpsPass() = default;		LowerGpuOpsToNVVMOpsPass() = default;
▲ Show 20 Lines • Show All 118 Lines • ▼ Show 20 Lines	void mlir::populateGpuToNVVMConversionPatterns(LLVMTypeConverter &converter,
patterns.add<OpToFuncCallLowering<math::RsqrtOp>>(converter, "__nv_rsqrtf",		patterns.add<OpToFuncCallLowering<math::RsqrtOp>>(converter, "__nv_rsqrtf",
"__nv_rsqrt");		"__nv_rsqrt");
patterns.add<OpToFuncCallLowering<math::SinOp>>(converter, "__nv_sinf",		patterns.add<OpToFuncCallLowering<math::SinOp>>(converter, "__nv_sinf",
"__nv_sin");		"__nv_sin");
patterns.add<OpToFuncCallLowering<math::SqrtOp>>(converter, "__nv_sqrtf",		patterns.add<OpToFuncCallLowering<math::SqrtOp>>(converter, "__nv_sqrtf",
"__nv_sqrt");		"__nv_sqrt");
patterns.add<OpToFuncCallLowering<math::TanhOp>>(converter, "__nv_tanhf",		patterns.add<OpToFuncCallLowering<math::TanhOp>>(converter, "__nv_tanhf",
"__nv_tanh");		"__nv_tanh");
		patterns.add<GPUAsyncCopyLowering, GPUAsyncCreateGroupLowering,
		GPUAsyncWaitLowering>(converter);
}		}

std::unique_ptr<OperationPass<gpu::GPUModuleOp>>		std::unique_ptr<OperationPass<gpu::GPUModuleOp>>
mlir::createLowerGpuOpsToNVVMOpsPass(unsigned indexBitwidth) {		mlir::createLowerGpuOpsToNVVMOpsPass(unsigned indexBitwidth) {
return std::make_unique<LowerGpuOpsToNVVMOpsPass>(indexBitwidth);		return std::make_unique<LowerGpuOpsToNVVMOpsPass>(indexBitwidth);
}		}

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

Show First 20 Lines • Show All 110 Lines • ▼ Show 20 Lines	struct GPUInlinerInterface : public DialectInlinerInterface {
bool isLegalToInline(Operation , Region , bool,		bool isLegalToInline(Operation , Region , bool,
BlockAndValueMapping &) const final {		BlockAndValueMapping &) const final {
return true;		return true;
}		}
};		};
} // namespace		} // namespace

void GPUDialect::initialize() {		void GPUDialect::initialize() {
addTypes<AsyncTokenType>();		addTypes<AsyncTokenType>();
		addTypes<DeviceAsyncTokenType>();
addTypes<MMAMatrixType>();		addTypes<MMAMatrixType>();
addOperations<		addOperations<
		bondhugulaUnsubmitted Done Reply Inline Actions Nit: Sorted order. bondhugula: Nit: Sorted order.
#define GET_OP_LIST		#define GET_OP_LIST
#include "mlir/Dialect/GPU/GPUOps.cpp.inc"		#include "mlir/Dialect/GPU/GPUOps.cpp.inc"
>();		>();
addAttributes<		addAttributes<
#define GET_ATTRDEF_LIST		#define GET_ATTRDEF_LIST
#include "mlir/Dialect/GPU/GPUOpsAttributes.cpp.inc"		#include "mlir/Dialect/GPU/GPUOpsAttributes.cpp.inc"
>();		>();
addInterfaces<GPUInlinerInterface>();		addInterfaces<GPUInlinerInterface>();
}		}

Type GPUDialect::parseType(DialectAsmParser &parser) const {		Type GPUDialect::parseType(DialectAsmParser &parser) const {
// Parse the main keyword for the type.		// Parse the main keyword for the type.
StringRef keyword;		StringRef keyword;
if (parser.parseKeyword(&keyword))		if (parser.parseKeyword(&keyword))
return Type();		return Type();
MLIRContext *context = getContext();		MLIRContext *context = getContext();

// Handle 'async token' types.		// Handle 'async token' types.
if (keyword == "async.token")		if (keyword == "async.token")
return AsyncTokenType::get(context);		return AsyncTokenType::get(context);
		// Handle 'device async token' types.
		if (keyword == "device.async.token")
		return DeviceAsyncTokenType::get(context);

if (keyword == "mma_matrix") {		if (keyword == "mma_matrix") {
SMLoc beginLoc = parser.getNameLoc();		SMLoc beginLoc = parser.getNameLoc();

// Parse '<'.		// Parse '<'.
if (parser.parseLess())		if (parser.parseLess())
return nullptr;		return nullptr;

Show All 24 Lines	Type GPUDialect::parseType(DialectAsmParser &parser) const {

parser.emitError(parser.getNameLoc(), "unknown gpu type: " + keyword);		parser.emitError(parser.getNameLoc(), "unknown gpu type: " + keyword);
return Type();		return Type();
}		}

void GPUDialect::printType(Type type, DialectAsmPrinter &os) const {		void GPUDialect::printType(Type type, DialectAsmPrinter &os) const {
TypeSwitch<Type>(type)		TypeSwitch<Type>(type)
.Case<AsyncTokenType>([&](Type) { os << "async.token"; })		.Case<AsyncTokenType>([&](Type) { os << "async.token"; })
		.Case<DeviceAsyncTokenType>([&](Type) { os << "device.async.token"; })
.Case<MMAMatrixType>([&](MMAMatrixType fragTy) {		.Case<MMAMatrixType>([&](MMAMatrixType fragTy) {
os << "mma_matrix<";		os << "mma_matrix<";
auto shape = fragTy.getShape();		auto shape = fragTy.getShape();
for (auto dim = shape.begin(), e = shape.end() - 1; dim != e; ++dim)		for (auto dim = shape.begin(), e = shape.end() - 1; dim != e; ++dim)
os << *dim << 'x';		os << *dim << 'x';
os << shape.back() << 'x' << fragTy.getElementType();		os << shape.back() << 'x' << fragTy.getElementType();
os << ", \"" << fragTy.getOperand() << "\"" << '>';		os << ", \"" << fragTy.getOperand() << "\"" << '>';
})		})
▲ Show 20 Lines • Show All 992 Lines • ▼ Show 20 Lines

} // namespace		} // namespace

void AllocOp::getCanonicalizationPatterns(RewritePatternSet &results,		void AllocOp::getCanonicalizationPatterns(RewritePatternSet &results,
MLIRContext *context) {		MLIRContext *context) {
results.add<SimplifyDimOfAllocOp>(context);		results.add<SimplifyDimOfAllocOp>(context);
}		}

		//===----------------------------------------------------------------------===//
		// GPU_DeviceAsyncCopyOp
		//===----------------------------------------------------------------------===//

		LogicalResult DeviceAsyncCopyOp::verify() {
		auto srcMemref = src().getType().cast<MemRefType>();
		auto dstMemref = dst().getType().cast<MemRefType>();
		if (dstMemref.getMemorySpaceAsInt() != 3)
		mravishankarUnsubmitted Not Done Reply Inline Actions Using `3` here is taking an opnion of shared memory being mapped to memory space 3. I think its better the GPU dialect just take the opinion that 3 maps to shared memory, create an enum type that does this mapping and use that. instead of hard-coding such numbers. For a later patch maybe, but this technical debt will only grow. mravishankar: Using `3` here is taking an opnion of shared memory being mapped to memory space 3. I think its…
		ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions removed, I meant to remove it and missed it. ThomasRaoux: removed, I meant to remove it and missed it.
		ftynseUnsubmitted Done Reply Inline Actions Nit: can this variable be called `workgroupAddressSpace` to be consistent with the terminology? ftynse: Nit: can this variable be called `workgroupAddressSpace` to be consistent with the terminology?
		return emitError("destination memref must have memory space 3");
		if (dstMemref.getElementType() != srcMemref.getElementType())
		return emitError("source and destination must have the same element type");
		if (srcMemref.getRank() != srcIndices().size())
		return emitOpError() << "expected " << srcMemref.getRank()
		<< " source indices, got " << srcIndices().size();
		if (dstMemref.getRank() != dstIndices().size())
		return emitOpError() << "expected " << dstMemref.getRank()
		<< " destination indices, got " << dstIndices().size();
		return success();
		}

#include "mlir/Dialect/GPU/GPUOpInterfaces.cpp.inc"		#include "mlir/Dialect/GPU/GPUOpInterfaces.cpp.inc"
		mravishankarUnsubmitted Not Done Reply Inline Actions I think you also need to enforce that the memref region that is read and written is "contiguous". Enforcing that condition especially in dynamic cases will be challenging. You might have to say that this operation only works with 1D memrefs and rely on `memref.expand_shape`/`memref.collapse_shape` to get things into the right shape. mravishankar: I think you also need to enforce that the memref region that is read and written is…
#include "mlir/Dialect/GPU/GPUOpsEnums.cpp.inc"		#include "mlir/Dialect/GPU/GPUOpsEnums.cpp.inc"

#define GET_ATTRDEF_CLASSES		#define GET_ATTRDEF_CLASSES
#include "mlir/Dialect/GPU/GPUOpsAttributes.cpp.inc"		#include "mlir/Dialect/GPU/GPUOpsAttributes.cpp.inc"

#define GET_OP_CLASSES		#define GET_OP_CLASSES
#include "mlir/Dialect/GPU/GPUOps.cpp.inc"		#include "mlir/Dialect/GPU/GPUOps.cpp.inc"

mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir

Show First 20 Lines • Show All 473 Lines • ▼ Show 20 Lines	gpu.module @test_module {
// CHECK-LABEL: @kernel_func		// CHECK-LABEL: @kernel_func
// CHECK: attributes		// CHECK: attributes
// CHECK: gpu.kernel		// CHECK: gpu.kernel
// CHECK: nvvm.kernel		// CHECK: nvvm.kernel
gpu.func @kernel_func() kernel {		gpu.func @kernel_func() kernel {
gpu.return		gpu.return
}		}
}		}

		// -----

		gpu.module @test_module {
		// CHECK-LABEL: @async_cp(
		// CHECK-SAME: %[[IDX:[a-zA-Z0-9_]+]]: i64)
		gpu.func @async_cp(
		%src: memref<128x128xf32>, %dst: memref<3x16x128xf32, 3>, %i : index) kernel {
		// CHECK-DAG: %[[BASEDST:.]] = llvm.extractvalue %{{.}}[1] : !llvm.struct<(ptr<f32, 3>, ptr<f32, 3>, i64, array<3 x i64>, array<3 x i64>)>
		// CHECK-DAG: %[[S0:.*]] = llvm.mlir.constant(2048 : index) : i64
		// CHECK-DAG: %[[LI:.*]] = llvm.mul %[[IDX]], %[[S0]] : i64
		// CHECK-DAG: %[[S1:.*]] = llvm.mlir.constant(128 : index) : i64
		// CHECK-DAG: %[[FI0:.*]] = llvm.mul %[[IDX]], %[[S1]] : i64
		// CHECK-DAG: %[[FI1:.*]] = llvm.add %[[LI]], %[[FI0]] : i64
		// CHECK-DAG: %[[FI2:.*]] = llvm.add %[[FI1]], %[[IDX]] : i64
		// CHECK-DAG: %[[ADDRESSDST:.*]] = llvm.getelementptr %[[BASEDST]][%[[FI2]]] : (!llvm.ptr<f32, 3>, i64) -> !llvm.ptr<f32, 3>
		// CHECK-DAG: %[[CAST0:.*]] = llvm.bitcast %[[ADDRESSDST]] : !llvm.ptr<f32, 3> to !llvm.ptr<i8, 3>

		// CHECK-DAG: %[[BASESRC:.]] = llvm.extractvalue %{{.}}[1] : !llvm.struct<(ptr<f32>, ptr<f32>, i64, array<2 x i64>, array<2 x i64>)>
		// CHECK-DAG: %[[S3:.*]] = llvm.mlir.constant(128 : index) : i64
		// CHECK-DAG: %[[FI3:.*]] = llvm.mul %[[IDX]], %[[S3]] : i64
		// CHECK-DAG: %[[FI4:.*]] = llvm.add %[[FI3]], %[[IDX]] : i64
		// CHECK-DAG: %[[ADDRESSSRC:.*]] = llvm.getelementptr %[[BASESRC]][%[[FI4]]] : (!llvm.ptr<f32>, i64) -> !llvm.ptr<f32>
		// CHECK-DAG: %[[CAST1:.*]] = llvm.bitcast %[[ADDRESSSRC]] : !llvm.ptr<f32> to !llvm.ptr<i8>
		// CHECK-DAG: %[[CAST2:.*]] = llvm.addrspacecast %[[CAST1]] : !llvm.ptr<i8> to !llvm.ptr<i8, 1>
		// CHECK-DAG: nvvm.cp.async.shared.global %[[CAST0]], %[[CAST2]], 16
		gpu.device_async_copy %src[%i, %i], %dst[%i, %i, %i], 16 : memref<128x128xf32> to memref<3x16x128xf32, 3>
		// CHECK: nvvm.cp.async.commit.group
		%0 = gpu.device_async_create_group
		// CHECK: nvvm.cp.async.wait.group 1
		gpu.device_async_wait_group 1
		gpu.return
		}
		}

mlir/test/Dialect/GPU/invalid.mlir

	Show First 20 Lines • Show All 549 Lines • ▼ Show 20 Lines

	// -----			// -----

	func @wmmaMmaOp_invalid_operand_shapes(%A : !gpu.mma_matrix<16x32xf16, "AOp">, %B : !gpu.mma_matrix<16x16xf16, "BOp">, %C : !gpu.mma_matrix<16x16xf16, "COp">) -> () {			func @wmmaMmaOp_invalid_operand_shapes(%A : !gpu.mma_matrix<16x32xf16, "AOp">, %B : !gpu.mma_matrix<16x16xf16, "BOp">, %C : !gpu.mma_matrix<16x16xf16, "COp">) -> () {
	// expected-error @+1 {{operand shapes do not satisfy matmul constraints}}			// expected-error @+1 {{operand shapes do not satisfy matmul constraints}}
	%D = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x32xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">			%D = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x32xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">
	return			return
	}			}

				// -----

				func @async_cp_memory_space(%dst : memref<16xf32>, %src : memref<16xf32>, %i : index) -> () {
				// expected-error @+1 {{destination memref must have memory space 3}}
				gpu.device_async_copy %src[%i], %dst[%i], 16 : memref<16xf32> to memref<16xf32>
				return
				}

				// -----

				func @async_cp_memref_type(%dst : memref<16xi32, 3>, %src : memref<16xf32>, %i : index) -> () {
				// expected-error @+1 {{source and destination must have the same element type}}
				gpu.device_async_copy %src[%i], %dst[%i], 16 : memref<16xf32> to memref<16xi32, 3>
				return
				}

				// -----

				func @async_cp_num_src_indices(%dst : memref<16xf32, 3>, %src : memref<16x16xf32>, %i : index) -> () {
				// expected-error @+1 {{expected 2 source indices, got 1}}
				gpu.device_async_copy %src[%i], %dst[%i], 16 : memref<16x16xf32> to memref<16xf32, 3>
				return
				}

				// -----

				func @async_cp_num_dst_indices(%dst : memref<16x16xf32, 3>, %src : memref<16xf32>, %i : index) -> () {
				// expected-error @+1 {{expected 2 destination indices, got 1}}
				gpu.device_async_copy %src[%i], %dst[%i], 16 : memref<16xf32> to memref<16x16xf32, 3>
				return
				}

mlir/test/Dialect/GPU/ops.mlir

Show First 20 Lines • Show All 234 Lines • ▼ Show 20 Lines	func @mmamatrix_valid_element_type(){
// CHECK: gpu.subgroup_mma_load_matrix %[[wg]][%[[i]], %[[i]]] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "AOp">		// CHECK: gpu.subgroup_mma_load_matrix %[[wg]][%[[i]], %[[i]]] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "AOp">
%1 = gpu.subgroup_mma_constant_matrix %cst : !gpu.mma_matrix<16x16xf32, "COp">		%1 = gpu.subgroup_mma_constant_matrix %cst : !gpu.mma_matrix<16x16xf32, "COp">
// CHECK: gpu.subgroup_mma_elementwise addf %{{.}}, %{{.}} : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">		// CHECK: gpu.subgroup_mma_elementwise addf %{{.}}, %{{.}} : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">
%2 = gpu.subgroup_mma_elementwise addf %1, %1 : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">		%2 = gpu.subgroup_mma_elementwise addf %1, %1 : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">
// CHECK: gpu.subgroup_mma_elementwise maxf %{{.}}, %{{.}} : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">		// CHECK: gpu.subgroup_mma_elementwise maxf %{{.}}, %{{.}} : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">
%3 = gpu.subgroup_mma_elementwise maxf %2, %1 : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">		%3 = gpu.subgroup_mma_elementwise maxf %2, %1 : (!gpu.mma_matrix<16x16xf32, "COp">, !gpu.mma_matrix<16x16xf32, "COp">) -> !gpu.mma_matrix<16x16xf32, "COp">
return		return
}		}

		func @async_cp(%dst : memref<2x7x5xf32, 3>, %src : memref<4x5xf32>){
		// CHECK-LABEL: func @async_cp
		%c0 = arith.constant 0 : index
		// CHECK: gpu.device_async_copy %{{.}}[{{.}}, {{.}}], %{{.}}[{{.}}, {{.}}, {{.*}}], 16 : memref<4x5xf32> to memref<2x7x5xf32, 3>
		gpu.device_async_copy %src[%c0, %c0], %dst[%c0, %c0, %c0], 16 : memref<4x5xf32> to memref<2x7x5xf32, 3>
		// CHECK: %{{.*}} = gpu.device_async_create_group
		%token = gpu.device_async_create_group
		// CHECK: gpu.device_async_wait %{{.*}}
		gpu.device_async_wait %token
		// CHECK: gpu.device_async_wait_group 1
		gpu.device_async_wait_group 1
		return
		}
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Add device side async copy operationsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 406695

mlir/include/mlir/Dialect/GPU/GPUBase.td

mlir/include/mlir/Dialect/GPU/GPUDialect.h

mlir/include/mlir/Dialect/GPU/GPUOps.td

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir

mlir/test/Dialect/GPU/invalid.mlir

mlir/test/Dialect/GPU/ops.mlir

[mlir][gpu] Add device side async copy operations
ClosedPublic