This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Add device side async copy operations
ClosedPublic

Authored by ThomasRaoux on Feb 7 2022, 2:22 PM.

Details

Summary

Add new operations to the gpu dialect to represent device side
asynchronous copies. This also add the lowering of those operations to
nvvm dialect.
Those ops are meant to be low level and map directly to llvm dialects
like nvvm or rocdl.

We can further add higher level of abstraction by building on top of
those operations.
This has been discuss here:
https://discourse.llvm.org/t/modeling-gpu-async-copy-ampere-feature/4924

Diff Detail

Event Timeline

ThomasRaoux created this revision.Feb 7 2022, 2:22 PM
ThomasRaoux requested review of this revision.Feb 7 2022, 2:22 PM

Great to see this -- some comments.

mlir/include/mlir/Dialect/GPU/GPUDialect.h
49

Doc comment here.

mlir/include/mlir/Dialect/GPU/GPUOps.td
1231

Nit: device-side

1233

cp -> copy to be consistent with other op names like memref.copy etc.

1233

operation

1238

"commit the copy" s confusing here. "commit" is often used synonymously with "completion". The name device_async_commit is also missing a connection to the group in the name. How about device_aysnc_attach_to_group, device_aysnc_add_to_group. Even device_async_commit_to_group will be clearer here. Is commit motivated by any lower-level naming on the NVVM side?

1238–1240

From this, it isn't entirely clear if just the latter gpu.device_async_wait will suffice. You could just reference these respective ops' doc instead of reproducing it here. I think the confusion is mainly due to the naming.

1268

This is ...

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
133

Lowering prefix to be consistent.

184

Likewise.

199–202

This has a single user --- just inline it unless you want to expose this publicly

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
119–122

Nit: Sorted order.

Address review comments

ThomasRaoux marked 9 inline comments as done.Feb 7 2022, 8:49 PM

Thanks for the review @bondhugula. I tried to improve the op naming and description, let me know what you think.

mlir/include/mlir/Dialect/GPU/GPUOps.td
1238

That's a good point, yes commit was picked because of no better name. Thinking more about this I think it should be called device_async_create_group as this is the op that takes all the copies and put them into a group. Let me know if you think this makes sense or if you prefer device_aysnc_attach_to_group.

1238–1240

Good point, I changed the description to just mention reference the other ops.

Remove missed dead code

As modeled I dont have any blockers, but IMO without the higher level ops (which should be the load bearing ops), these ops are going to make transformations jump directly to something that is essentially LLVM. I'd strongly suggest that the transformations dont directly target these "low level ops", cause then it would be harder to back track form there. Ill take a look again for details tomorrow, but leaving comments for now.

mlir/include/mlir/Dialect/GPU/GPUOps.td
1260

Maybe this is what you mean by "lower level ops", but I would really like to see a token returned by the copy operation that is used to create the groups. These are relying on textual order of things that makes me very uncomfortable. I see why they are needed this way cause NVIDIA hardware do it this way.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1218

Using 3 here is taking an opnion of shared memory being mapped to memory space 3. I think its better the GPU dialect just take the opinion that 3 maps to shared memory, create an enum type that does this mapping and use that. instead of hard-coding such numbers. For a later patch maybe, but this technical debt will only grow.

remove hardcoded number

ThomasRaoux added inline comments.Feb 8 2022, 8:47 PM
mlir/include/mlir/Dialect/GPU/GPUOps.td
1260

It would have to lowered to this representation though otherwise the gap to lower to llvm is to large.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1218

removed, I meant to remove it and missed it.

As modeled I dont have any blockers, but IMO without the higher level ops (which should be the load bearing ops), these ops are going to make transformations jump directly to something that is essentially LLVM. I'd strongly suggest that the transformations dont directly target these "low level ops", cause then it would be harder to back track form there. Ill take a look again for details tomorrow, but leaving comments for now.

Finding the right level of abstraction is a challenge as discussed. That's why I am starting with the layers that I think are needed no matter what. If I understand well you are suggesting having a level where the copy would directly feed tokens into the wait. This probably make sense but it is only slightly lower level than this representation:

token1 = copy
token2 = copy
....
wait token1, token2

would get transformed:

copy
copy
...
token = create_group
wait token

Your concern is that we rely on side effect to enforce the dependency between copy and create_group, I think this is a legit concern but it is not clear to me that the first representation is worth having another level of abstraction. Since I believe the representation in this patch will be needed I think it makes sense to move forward. Let me know what you think.

While gpu.device_async_create_group and gpu.device_async_wait are fairly straightforward in how they are paired together, there is a non-trivial amount of implicit global state associated to device_async_copy and device_async_wait_group.

Let me see if I infer correctly an abstract model of this, we need a global thread_local bag of groups and a global thread_local list of "copies not yet added to a group".
From there I can see how to describe the existing ops in terms of read/write effects to these globals I guess.

I'm still not sure about two things right now:

  • how to target the gpu.device_async_wait_group or why it is exposed the way it is? For example, I have a hard time understanding the composability of this op: entering a function I may not know how many pending groups are in flight and then it's not clear to me how to pick a number here?
  • the constraint this puts on the ordering: it seems that for example because of the implicit state between a copy and a create_group, we can't reschedule things. For example in this case:
src2 = ...
...
src1 = ...
gpu.device_async_copy %src1
%token = gpu.device_async_create_group
gpu.device_async_copy %src2
%token2 = gpu.device_async_create_group

It would be interesting to schedule the copy from src2 earlier since src2 is available before src1 somehow, but doing so would require to move the two operations together (gpu.device_async_copy %src2 and %token2 = gpu.device_async_create_group).

mehdi_amini added a comment.EditedFeb 8 2022, 9:39 PM
token1 = copy
token2 = copy
....
wait token1, token2

would get transformed:

copy
copy
...
token = create_group
wait token

I don't think this is correct (in the absolute): what if your ... is also using some async? For example:

token1 = copy
token2 = copy
...
token3 = copy
...
wait token1, token2
...
wait token3

While gpu.device_async_create_group and gpu.device_async_wait are fairly straightforward in how they are paired together, there is a non-trivial amount of implicit global state associated to device_async_copy and device_async_wait_group.

Let me see if I infer correctly an abstract model of this, we need a global thread_local bag of groups and a global thread_local list of "copies not yet added to a group".
From there I can see how to describe the existing ops in terms of read/write effects to these globals I guess.

Yes this is correct, and those are modeled by side effect in this case.

I'm still not sure about two things right now:

  • how to target the gpu.device_async_wait_group or why it is exposed the way it is? For example, I have a hard time understanding the composability of this op: entering a function I may not know how many pending groups are in flight and then it's not clear to me how to pick a number here?

gpu.device_async_wait_group is there to have a straight forward way to lower to nvvm kind of semantic. The way it would be generated is by lowering from gpu.device_async_wait and walking all the potential paths going from this op to the associated gpu.device_async_create_group and finding the minimum number of other gpu.device_async_create_group in those paths. Of course picking a lower value than needed is always correct so this would have to be a conservative transformation. There are alternative solutions, for instance cuda compiler creates a big switch case when it cannot resolve statically the number of workgroup and just use a dynamic counter. I think we could have several strategy of lowering but this stage has to be done no matter what.

  • the constraint this puts on the ordering: it seems that for example because of the implicit state between a copy and a create_group, we can't reschedule things. For example in this case:
src2 = ...
...
src1 = ...
gpu.device_async_copy %src1
%token = gpu.device_async_create_group
gpu.device_async_copy %src2
%token2 = gpu.device_async_create_group

It would be interesting to schedule the copy from src2 earlier since src2 is available before src1 somehow, but doing so would require to move the two operations together (gpu.device_async_copy %src2 and %token2 = gpu.device_async_create_group).

Correct this would not be possible and would have to be done at a higher level of abstraction. The main thing missing in this representation is how the group are created, once the groups are set it is too late to do such re-ordering. That being said in general you would want to put consecutive copy without interleave work in the same group so you shouldn't have to do the re-ordering.

The reason why I have those representations is because what I need is:
-a representation to apply pipelining transformation on. It should have the groups picked and I have an explicit ssa link between the group and the wait to know directly how they are linked.
-A representation after pipelining and before convertToNVVM where each wait knows the max number of groups under progress to unblock the thread.

token1 = copy
token2 = copy
....
wait token1, token2

would get transformed:

copy
copy
...
token = create_group
wait token

I don't think this is correct (in the absolute): what if your ... is also using some async? For example:

token1 = copy
token2 = copy
...
token3 = copy
...
wait token1, token2
...
wait token3

True, in this case the group creating may not be trivial which is one problem I tried to avoid in the representation. In this example the resprentation is still simple, if we have something like below it becomes complicated:

token1 = copy
...
token3 = copy
...
token2 = copy
...
wait token1, token2
...
wait token3

gpu.device_async_wait_group is there to have a straight forward way to lower to nvvm kind of semantic. The way it would be generated is by lowering from gpu.device_async_wait and walking all the potential paths going from this op to the associated gpu.device_async_create_group and finding the minimum number of other gpu.device_async_create_group in those paths.

I actually missed in the description: "Groups are executed the order they are created", this makes sense now :)

ftynse added a subscriber: ftynse.Feb 9 2022, 1:29 AM

The side-effecting behavior makes it difficult to reason about the IR. It would be helpful to document it more instead of cross-referencing ops. State something like "this creates a pending memory access" in the device_async_copy description, and "this will associate pending memory accesses created by preceding device_async_copy operations" in the device_async_create_group. A complete example in the documentation of one of the ops would also help.

Two concrete questions:

  • How does this work in presence of control flow (i.e., copies in different blocks that branch to a single block with create_group, and copies in a block that branches to two blocks each with own create_group?
  • Why require byte size in the copy instead of a subview/subset of memref indices to copy? Specifically, this requires the caller to compute the byte size of the type rather than having the compiler compute it based on the element type. And this lets one read data that is not normally indexed by the memref, e.g., an 8-byte read from a non-contiguous f32 memref with innermost stride 2 will access the memory that cannot be read by a memref.load (4 bytes after the first element), which sounds very annoying from aliasing perspective.
mlir/include/mlir/Dialect/GPU/GPUOps.td
1264
mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
157

Nit: please use a constant / helper function instead of the magic 1 here.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1218

Nit: can this variable be called workgroupAddressSpace to be consistent with the terminology?

Add extra description and address comments on naming

ThomasRaoux marked 3 inline comments as done.Feb 9 2022, 10:39 AM

The side-effecting behavior makes it difficult to reason about the IR. It would be helpful to document it more instead of cross-referencing ops. State something like "this creates a pending memory access" in the device_async_copy description, and "this will associate pending memory accesses created by preceding device_async_copy operations" in the device_async_create_group. A complete example in the documentation of one of the ops would also help.

I agree the side effect part is not ideal, however this is the only solution I found to avoid having an abstraction gap during lowering to llvm. I'm open to suggestions if there is a better way to do it. I added a complete example in gpu.device_async_copy description.

Two concrete questions:

  • How does this work in presence of control flow (i.e., copies in different blocks that branch to a single block with create_group, and copies in a block that branches to two blocks each with own create_group?

It works as if there were an implicit queue of pending copies and an implicit queue of group submitted. So the create_group op would make a group with copies pending from whatever path was taken.

  • Why require byte size in the copy instead of a subview/subset of memref indices to copy? Specifically, this requires the caller to compute the byte size of the type rather than having the compiler compute it based on the element type. And this lets one read data that is not normally indexed by the memref, e.g., an 8-byte read from a non-contiguous f32 memref with innermost stride 2 will access the memory that cannot be read by a memref.load (4 bytes after the first element), which sounds very annoying from aliasing perspective.

That's something I went back and forth on but overall I want a semantic which is equivalent to a load+store rather than a memcpy kind of semantic. I would like to be able to converted from a vector.load+vector.store so using this representation feels more natural. I do think it should have the same restriction as a memref.load and should be treated similarly.

ftynse added a comment.Feb 9 2022, 2:47 PM

I agree the side effect part is not ideal, however this is the only solution I found to avoid having an abstraction gap during lowering to llvm. I'm open to suggestions if there is a better way to do it. I added a complete example in gpu.device_async_copy description.

Thanks! IMO, there are bigger abstraction gaps that are covered by lowerings, but I don't have a strong opinion here.

That's something I went back and forth on but overall I want a semantic which is equivalent to a load+store rather than a memcpy kind of semantic. I would like to be able to converted from a vector.load+vector.store so using this representation feels more natural. I do think it should have the same restriction as a memref.load and should be treated similarly.

I don't follow here. Neither memref.load/store nor vector.load/store require explicit byte size. The former accesses only one element. The latter has a requirement on the innermost dimension having stride 1, so reads are always contiguous, but their "size" is provided as the number of elements to read (in the vector type of the result). Having byte size is especially problematic because it duplicates (and therefore will end up contradicting) the data layout information. I would rather take the number of elements or reused the vector type somehow. Plus a verifier for a memref stride being either dynamic or static 1.

The side-effecting behavior makes it difficult to reason about the IR. It would be helpful to document it more instead of cross-referencing ops. State something like "this creates a pending memory access" in the device_async_copy description, and "this will associate pending memory accesses created by preceding device_async_copy operations" in the device_async_create_group. A complete example in the documentation of one of the ops would also help.

I agree the side effect part is not ideal, however this is the only solution I found to avoid having an abstraction gap during lowering to llvm. I'm open to suggestions if there is a better way to do it. I added a complete example in gpu.device_async_copy description.

Thanks for the examples. One option of this is to have only the "higher level" abstractions in the GPU dialect, and have the ops that reflect the intrinsic in NVVM directly stay in NVVM dialect. You can have a staged lowering of operations when lowering to NVVM, and lower from the higher level of abstraction in GPU dialect to the lower level of abstraction in NVVM dialect. Basically, these are exactly mirroring the ops in NVVM dialect, and are redundant for the most part. I cant really see why lowering to these are better than using the ops from NVVM dialect directly. Also, these ops are very NVVM specific. I dont know how these ops can be supported when lowering to SPIR-V (its always a useful check point to say what about SPIR-V when adding things to GPU dialect).
The above comments seems to suggest I have a strong opinion, but I am mostly on the fence on this (albeit slightly negative) and lot of this is originating from how these ops are exposed in NVVM/PTX. My over-arching concern is that if we have transformations that target these directly, we cant back track (and I can almost guarantee these ops specifications will change in different versions of CUDA, so we might be hard-coding behavior that is transient, and ill-formed). If we today have code that generate NVVM dialect ops directly and those have to change as the NVVM intrinsics evolve, that makes more sense to me (its a one-off we need performance effort and wont be long term stable). When targeting GPU dialect, I'd expect something more stable.

Two concrete questions:

  • How does this work in presence of control flow (i.e., copies in different blocks that branch to a single block with create_group, and copies in a block that branches to two blocks each with own create_group?

It works as if there were an implicit queue of pending copies and an implicit queue of group submitted. So the create_group op would make a group with copies pending from whatever path was taken.

So just to fill in an offline conversation

copy
copy
if(something)
  copy
commit

Here the commit could have 3 copies or 2 copies based on the conditional.

  • Why require byte size in the copy instead of a subview/subset of memref indices to copy? Specifically, this requires the caller to compute the byte size of the type rather than having the compiler compute it based on the element type. And this lets one read data that is not normally indexed by the memref, e.g., an 8-byte read from a non-contiguous f32 memref with innermost stride 2 will access the memory that cannot be read by a memref.load (4 bytes after the first element), which sounds very annoying from aliasing perspective.

That's something I went back and forth on but overall I want a semantic which is equivalent to a load+store rather than a memcpy kind of semantic. I would like to be able to converted from a vector.load+vector.store so using this representation feels more natural. I do think it should have the same restriction as a memref.load and should be treated similarly.

I echo this point about byte-size. When you say load/store semantics this is not consistent with the load/store semantics in MLIR. memref.load and memref.store is loading an element and not a byte. Also could be source of potential inconsistency. What if my element type is f32 and I say read 2 bytes? Number of elements makes more sense here.
Another related aspect that is a bit strange is that these cannot operate on memref that have an affine maps (as structured). Of course the op-semantics does not allow that, but you implicitly require that the memref be "contiguous" (I left a comment about this in the code as well).

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1231

I think you also need to enforce that the memref region that is read and written is "contiguous". Enforcing that condition especially in dynamic cases will be challenging. You might have to say that this operation only works with 1D memrefs and rely on memref.expand_shape/memref.collapse_shape to get things into the right shape.

The side-effecting behavior makes it difficult to reason about the IR. It would be helpful to document it more instead of cross-referencing ops. State something like "this creates a pending memory access" in the device_async_copy description, and "this will associate pending memory accesses created by preceding device_async_copy operations" in the device_async_create_group. A complete example in the documentation of one of the ops would also help.

I agree the side effect part is not ideal, however this is the only solution I found to avoid having an abstraction gap during lowering to llvm. I'm open to suggestions if there is a better way to do it. I added a complete example in gpu.device_async_copy description.

Thanks for the examples. One option of this is to have only the "higher level" abstractions in the GPU dialect, and have the ops that reflect the intrinsic in NVVM directly stay in NVVM dialect. You can have a staged lowering of operations when lowering to NVVM, and lower from the higher level of abstraction in GPU dialect to the lower level of abstraction in NVVM dialect. Basically, these are exactly mirroring the ops in NVVM dialect, and are redundant for the most part. I cant really see why lowering to these are better than using the ops from NVVM dialect directly. Also, these ops are very NVVM specific. I dont know how these ops can be supported when lowering to SPIR-V (its always a useful check point to say what about SPIR-V when adding things to GPU dialect).
The above comments seems to suggest I have a strong opinion, but I am mostly on the fence on this (albeit slightly negative) and lot of this is originating from how these ops are exposed in NVVM/PTX. My over-arching concern is that if we have transformations that target these directly, we cant back track (and I can almost guarantee these ops specifications will change in different versions of CUDA, so we might be hard-coding behavior that is transient, and ill-formed). If we today have code that generate NVVM dialect ops directly and those have to change as the NVVM intrinsics evolve, that makes more sense to me (its a one-off we need performance effort and wont be long term stable). When targeting GPU dialect, I'd expect something more stable.

Discussed this offline but in my opinion doing the complex transformations of isolating groups and "counting" the max number of in flights group during conversion to NVVM sounds like it would mix too many things and would increase the complexity of the code significantly. I think we need a way to express at least those two things before converting to LLVM/SPIRV IR. This feature is not exposed in SPIRV but if it was targeting the same HW, it is hard to imagine how different this could be. And without another equivalent of this feature it is hard to tell how to pick a common abstraction.
I'm open to other designs but I think to be practical it need to expose:

  1. Which ops are in the same group and ideally enforce that they don't interleave with other groups
  2. Be able to represent the result of the analysis of how many groups are in flights (it could just be an optional attribute)

For the second case I had started with an attribute but it turns out that removing the asycn token during conversion to llvm is actually non trivial because the type convert needs something to convert it to.

Two concrete questions:

  • How does this work in presence of control flow (i.e., copies in different blocks that branch to a single block with create_group, and copies in a block that branches to two blocks each with own create_group?

It works as if there were an implicit queue of pending copies and an implicit queue of group submitted. So the create_group op would make a group with copies pending from whatever path was taken.

So just to fill in an offline conversation

copy
copy
if(something)
  copy
commit

Here the commit could have 3 copies or 2 copies based on the conditional.

  • Why require byte size in the copy instead of a subview/subset of memref indices to copy? Specifically, this requires the caller to compute the byte size of the type rather than having the compiler compute it based on the element type. And this lets one read data that is not normally indexed by the memref, e.g., an 8-byte read from a non-contiguous f32 memref with innermost stride 2 will access the memory that cannot be read by a memref.load (4 bytes after the first element), which sounds very annoying from aliasing perspective.

That's something I went back and forth on but overall I want a semantic which is equivalent to a load+store rather than a memcpy kind of semantic. I would like to be able to converted from a vector.load+vector.store so using this representation feels more natural. I do think it should have the same restriction as a memref.load and should be treated similarly.

I echo this point about byte-size. When you say load/store semantics this is not consistent with the load/store semantics in MLIR. memref.load and memref.store is loading an element and not a byte. Also could be source of potential inconsistency. What if my element type is f32 and I say read 2 bytes? Number of elements makes more sense here.
Another related aspect that is a bit strange is that these cannot operate on memref that have an affine maps (as structured). Of course the op-semantics does not allow that, but you implicitly require that the memref be "contiguous" (I left a comment about this in the code as well).

Yes as @ftynse pointed out this is not the right way to do it, and I'm changing the op to take a number of elements instead. If the memref is not contiguous the operation should be broke into scalar like it is done for vector.load/vector.store

The side-effecting behavior makes it difficult to reason about the IR. It would be helpful to document it more instead of cross-referencing ops. State something like "this creates a pending memory access" in the device_async_copy description, and "this will associate pending memory accesses created by preceding device_async_copy operations" in the device_async_create_group. A complete example in the documentation of one of the ops would also help.

I agree the side effect part is not ideal, however this is the only solution I found to avoid having an abstraction gap during lowering to llvm. I'm open to suggestions if there is a better way to do it. I added a complete example in gpu.device_async_copy description.

Thanks for the examples. One option of this is to have only the "higher level" abstractions in the GPU dialect, and have the ops that reflect the intrinsic in NVVM directly stay in NVVM dialect. You can have a staged lowering of operations when lowering to NVVM, and lower from the higher level of abstraction in GPU dialect to the lower level of abstraction in NVVM dialect. Basically, these are exactly mirroring the ops in NVVM dialect, and are redundant for the most part. I cant really see why lowering to these are better than using the ops from NVVM dialect directly. Also, these ops are very NVVM specific. I dont know how these ops can be supported when lowering to SPIR-V (its always a useful check point to say what about SPIR-V when adding things to GPU dialect).
The above comments seems to suggest I have a strong opinion, but I am mostly on the fence on this (albeit slightly negative) and lot of this is originating from how these ops are exposed in NVVM/PTX. My over-arching concern is that if we have transformations that target these directly, we cant back track (and I can almost guarantee these ops specifications will change in different versions of CUDA, so we might be hard-coding behavior that is transient, and ill-formed). If we today have code that generate NVVM dialect ops directly and those have to change as the NVVM intrinsics evolve, that makes more sense to me (its a one-off we need performance effort and wont be long term stable). When targeting GPU dialect, I'd expect something more stable.

Discussed this offline but in my opinion doing the complex transformations of isolating groups and "counting" the max number of in flights group during conversion to NVVM sounds like it would mix too many things and would increase the complexity of the code significantly. I think we need a way to express at least those two things before converting to LLVM/SPIRV IR. This feature is not exposed in SPIRV but if it was targeting the same HW, it is hard to imagine how different this could be. And without another equivalent of this feature it is hard to tell how to pick a common abstraction.
I'm open to other designs but I think to be practical it need to expose:

  1. Which ops are in the same group and ideally enforce that they don't interleave with other groups
  2. Be able to represent the result of the analysis of how many groups are in flights (it could just be an optional attribute)

For the second case I had started with an attribute but it turns out that removing the asycn token during conversion to llvm is actually non trivial because the type convert needs something to convert it to.

I am just looking at these ops and comparing the difference between these ops and the NVVM ops. The copy op to create_group op have similar semantics as the NVVM dialect version of these ops. The create_group to wait have different semantics. So you are saying you want to be able to represent ops that allow you to track how many groups are in flight so that when lowering to NVVM you can bridge the gap and account for the relationship between the create_group and the wait ops in NVVM. There is the same implicit relationship between copy and create_group in the NVVM ops, but that implicit relationship is carried over into the GPU dialect. That's the disconnect I am trying to resolve. The higher up the stack this leaks, the more is the cost of it in the longer term.
The argument you are giving is that you don't want to add all of that complexity when lowering to NVVM. That's fair, but that seems to suggest that lowering to NVVM is an atomic step, which it need not be. You could stage it with unrealized_conversion_casts and these get connected up later. So leave aside the ugly mechanics this introduces, whats the difference between the copy_op and create_group added here and the ones added in NVVM dialect.
Anyway, as I mentioned offline (and in the comment above), this whole things makes me uneasy, but is lot driven by the way these are exposed in NVVM (proper). So I am not really blocking on this, but still asking questions. That is a technical debt now we inherit, question is how far up the stack do we let that debt leak.

I am just looking at these ops and comparing the difference between these ops and the NVVM ops. The copy op to create_group op have similar semantics as the NVVM dialect version of these ops. The create_group to wait have different semantics. So you are saying you want to be able to represent ops that allow you to track how many groups are in flight so that when lowering to NVVM you can bridge the gap and account for the relationship between the create_group and the wait ops in NVVM. There is the same implicit relationship between copy and create_group in the NVVM ops, but that implicit relationship is carried over into the GPU dialect. That's the disconnect I am trying to resolve. The higher up the stack this leaks, the more is the cost of it in the longer term.
The argument you are giving is that you don't want to add all of that complexity when lowering to NVVM. That's fair, but that seems to suggest that lowering to NVVM is an atomic step, which it need not be. You could stage it with unrealized_conversion_casts and these get connected up later. So leave aside the ugly mechanics this introduces, whats the difference between the copy_op and create_group added here and the ones added in NVVM dialect.
Anyway, as I mentioned offline (and in the comment above), this whole things makes me uneasy, but is lot driven by the way these are exposed in NVVM (proper). So I am not really blocking on this, but still asking questions. That is a technical debt now we inherit, question is how far up the stack do we let that debt leak.

Ok I had misunderstood what your concern is. So your concern is that there are missing links between copy and commit. I really don't mind adding those as they would just get dropped during translation. The problem is that as far as I know there is no way to cleanly drop those extra types during conversion to llvm. The type convert would require those types to be converted to something as they are not legal types. This is the reason why I added the second version of the wait operation for the commit/wait case. If there is a solution to handle that during conversion to llvm then we remove the second version of the wait op and we can add explicit ssa links between copy and commit.

Change attribute from size in bytes to number of element and add verifier to check that the memref stride is 1

I agree the side effect part is not ideal, however this is the only solution I found to avoid having an abstraction gap during lowering to llvm. I'm open to suggestions if there is a better way to do it. I added a complete example in gpu.device_async_copy description.

Thanks! IMO, there are bigger abstraction gaps that are covered by lowerings, but I don't have a strong opinion here.

That's something I went back and forth on but overall I want a semantic which is equivalent to a load+store rather than a memcpy kind of semantic. I would like to be able to converted from a vector.load+vector.store so using this representation feels more natural. I do think it should have the same restriction as a memref.load and should be treated similarly.

I don't follow here. Neither memref.load/store nor vector.load/store require explicit byte size. The former accesses only one element. The latter has a requirement on the innermost dimension having stride 1, so reads are always contiguous, but their "size" is provided as the number of elements to read (in the vector type of the result). Having byte size is especially problematic because it duplicates (and therefore will end up contradicting) the data layout information. I would rather take the number of elements or reused the vector type somehow. Plus a verifier for a memref stride being either dynamic or static 1.

You are right, I changed the op to take a number of element and add check in the verifier to that the innermost dimension has a stride of 1. Thanks for catching that.

After a longer offline discussions, this is where my head is at,

  1. I think the copy op has an implicit requirement that the memrefs be contiguous. Enforcing that requirement is hard. On the other hand, memref.collapse_shape codifies this semantics. You can reshape a memref type value only if it is contiguous (and the verifier of this op already has some logic to do so). To avoid duplicating that logic, I would rely on having a memref.collapse_shape on the operand and the result and enforce that the input and output memref type have rank 1.
  2. It seems cleaner to me to have an explicit link between the copy operation and the create_group operation. If the issue there is in terms of lowering to LLVM, that should be handle-able by a) Make the gpu.async_copy return a token and the gpu.create_group accept a list of tokens b) Lower gpu.async_copy to nvvm.async_copy + %t = llvm.const <token> (or equivalent, could just be lowered to undef for that matter). Use %t to replace all uses of gpu.async_copy. c) Ignore these values in the operands when lowering the gpu.create_group. I think this works, but I havent tried it out obviously.

    This is actually a slightly different model than what the nvvm ops support. For example ` %t1 = copy A; %t2 = copy B; %t3 = copy C; %g1 = create_group %t1, %t2; %g2 = create_group %t3; ` should be illegal currently. Catching this case (and similar cases) will require additional verification using dominance, etc. Still I think approach is better and we could have verifications that allow only a subset of interleavings valid.

So to restate the Point (2) above, I see as it could go either way, and I can put that down as preference. The bigger issue for me though is Point (1) above cause that will result in silent miscompilation in cases where the memref is not contiguous. Whether a memref is contiguous is not always provable and even for cases it is, I'd rather not duplicate the logic to check this many places.

I am fine if the consensus is to go ahead with the patch as is. I am not stamping to give others a chance to respond, but happy to stamp. I think I have raised all my concerns.

After a longer offline discussions, this is where my head is at,

  1. I think the copy op has an implicit requirement that the memrefs be contiguous. Enforcing that requirement is hard. On the other hand, memref.collapse_shape codifies this semantics. You can reshape a memref type value only if it is contiguous (and the verifier of this op already has some logic to do so). To avoid duplicating that logic, I would rely on having a memref.collapse_shape on the operand and the result and enforce that the input and output memref type have rank 1.

That's not the semantic I had in mind for this op. I don't think we want to read element across dimensions. The number of elements should be read only along the most inner dimension in my opinion and I think it might make a difference if we later introduce masking.
I can look more into what you are suggesting but I would rather not change to this semantic in this patch or until I tried it on a prototype.

  1. It seems cleaner to me to have an explicit link between the copy operation and the create_group operation. If the issue there is in terms of lowering to LLVM, that should be handle-able by a) Make the gpu.async_copy return a token and the gpu.create_group accept a list of tokens b) Lower gpu.async_copy to nvvm.async_copy + %t = llvm.const <token> (or equivalent, could just be lowered to undef for that matter). Use %t to replace all uses of gpu.async_copy.

I don't believe llvm.const token would be valid. What worked for me was to convert the token type to a dummy i32. and just leave it dead. I don't mind going in that direction if people overall agree that it is better, I changed it because it felt too hacky at the time. Let me know what you prefer.

I don't have a strong opinion on 2) and I'm fine with either representation as long as we have a practical way to do the lowering.

After a longer offline discussions, this is where my head is at,

  1. I think the copy op has an implicit requirement that the memrefs be contiguous. Enforcing that requirement is hard. On the other hand, memref.collapse_shape codifies this semantics. You can reshape a memref type value only if it is contiguous (and the verifier of this op already has some logic to do so). To avoid duplicating that logic, I would rely on having a memref.collapse_shape on the operand and the result and enforce that the input and output memref type have rank 1.

That's not the semantic I had in mind for this op. I don't think we want to read element across dimensions. The number of elements should be read only along the most inner dimension in my opinion and I think it might make a difference if we later introduce masking.
I can look more into what you are suggesting but I would rather not change to this semantic in this patch or until I tried it on a prototype.

  1. It seems cleaner to me to have an explicit link between the copy operation and the create_group operation. If the issue there is in terms of lowering to LLVM, that should be handle-able by a) Make the gpu.async_copy return a token and the gpu.create_group accept a list of tokens b) Lower gpu.async_copy to nvvm.async_copy + %t = llvm.const <token> (or equivalent, could just be lowered to undef for that matter). Use %t to replace all uses of gpu.async_copy.

I don't believe llvm.const token would be valid. What worked for me was to convert the token type to a dummy i32. and just leave it dead. I don't mind going in that direction if people overall agree that it is better, I changed it because it felt too hacky at the time. Let me know what you prefer.

When I said const I didn't really mean an actual const. I was thinking more of a dummy i32 as well. My bad on not being careful enough with the terminology. I dont see it as hacky, but ¯\_(ツ)_/¯ . Thats the only way to have a link at gpu dialect level and not have the link at NVVM level. Seems OK to me.

I don't have a strong opinion on 2) and I'm fine with either representation as long as we have a practical way to do the lowering.

Add SSA link between copy and create_group

After a longer offline discussions, this is where my head is at,

  1. I think the copy op has an implicit requirement that the memrefs be contiguous. Enforcing that requirement is hard. On the other hand, memref.collapse_shape codifies this semantics. You can reshape a memref type value only if it is contiguous (and the verifier of this op already has some logic to do so). To avoid duplicating that logic, I would rely on having a memref.collapse_shape on the operand and the result and enforce that the input and output memref type have rank 1.

That's not the semantic I had in mind for this op. I don't think we want to read element across dimensions. The number of elements should be read only along the most inner dimension in my opinion and I think it might make a difference if we later introduce masking.
I can look more into what you are suggesting but I would rather not change to this semantic in this patch or until I tried it on a prototype.

  1. It seems cleaner to me to have an explicit link between the copy operation and the create_group operation. If the issue there is in terms of lowering to LLVM, that should be handle-able by a) Make the gpu.async_copy return a token and the gpu.create_group accept a list of tokens b) Lower gpu.async_copy to nvvm.async_copy + %t = llvm.const <token> (or equivalent, could just be lowered to undef for that matter). Use %t to replace all uses of gpu.async_copy.

I don't believe llvm.const token would be valid. What worked for me was to convert the token type to a dummy i32. and just leave it dead. I don't mind going in that direction if people overall agree that it is better, I changed it because it felt too hacky at the time. Let me know what you prefer.

When I said const I didn't really mean an actual const. I was thinking more of a dummy i32 as well. My bad on not being careful enough with the terminology. I dont see it as hacky, but ¯\_(ツ)_/¯ . Thats the only way to have a link at gpu dialect level and not have the link at NVVM level. Seems OK to me.

Ok I added SSA links between copy and create_group that I'm dropping as dummy i32 during conversion to NVVM. I removed the wait_group op as I don't need it anymore with this change I can just use an optional attribute. Please take another look. Note that there currently no verification that the order is correct to keep the patch small.

mravishankar accepted this revision.Feb 10 2022, 2:17 PM

Looks good.

Misread the semantics w.r.t to memref. I was thinking more vector.transfer_read/transfer_write than vector.load/store.

This revision is now accepted and ready to land.Feb 10 2022, 2:17 PM
This revision was automatically updated to reflect the committed changes.