This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][GPU] Add canonicalization patterns for folding simple gpu.wait ops.
ClosedPublic

Authored by arnab-oss on Mar 16 2022, 9:10 PM.

Details

Summary
  • Fold away redundant %t = gpu.wait async + gpu.wait [%t] pairs.
  • Fold away %t = gpu.wait async ... ops when %t has no uses.
  • Fold away gpu.wait [] ops.
  • In case of %t1 = gpu.wait async [%t0], replace all uses of %t1 with %t0.

Diff Detail

Event Timeline

arnab-oss created this revision.Mar 16 2022, 9:10 PM
Herald added a project: Restricted Project. · View Herald Transcript
arnab-oss requested review of this revision.Mar 16 2022, 9:10 PM
arnab-oss updated this revision to Diff 416065.Mar 16 2022, 9:28 PM

Resolved build issues.

Thanks for taking care of this!

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228

Untested, but would it work to simply remove the async dependency and let the other canonicalizer take care of erasing ops?

ping @csigg, can you please take look at my comment?

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228
%0 = gpu.wait async
 %memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16>
gpu.wait [%0]

In the above example, we cannot exactly remove the %0 async dependency from the 2nd gpu.wait right @csigg ?

bondhugula added inline comments.Apr 5 2022, 12:24 AM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228

@arnab-oss Does %memref have any uses in this example? If it doesn't, this is all dead code. Is the rule eliminating these now? Can you add this test case?

bondhugula added inline comments.Apr 8 2022, 11:22 PM
mlir/test/Dialect/GPU/canonicalize.mlir
12

Nit: You can just instead use:

CHECK-NEXT: return

for a stronger and more direct check.

bondhugula requested changes to this revision.Apr 9 2022, 8:09 PM
bondhugula added inline comments.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1248–1258

These conditions are all completely wrong: you can't erase a gpu.wait op just because its result token has no uses. Also, you can't simply replace its uses (which could be empty) by its aysnc dependencies and replace the op.

Can you please read through the gpu.wait doc description (which is pretty straightforward and simple) before adding these patterns?

This revision now requires changes to proceed.Apr 9 2022, 8:09 PM
mehdi_amini added inline comments.Apr 9 2022, 8:13 PM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1230

Please remove.

1248–1258

I'm confused, the description seems to exactly call for this:

If the op contains the async keyword, it returns a new async token which
is synchronized with the op arguments. This new token is merely a shortcut
to the argument list, and one could replace the uses of the result with the
arguments for the same effect.

bondhugula added inline comments.Apr 9 2022, 8:18 PM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1233

In fact, 1) is just a trivial special case of 3), but anyway, both (1) and (3) are wrong. You can't erase the gpu.wait op if its token has no uses. The op synchronizes the device/host (internally, it destroys the stream) You can only replace it if its result token has at least one use that is another gpu.wait op.

bondhugula added inline comments.Apr 9 2022, 8:20 PM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1248–1258

If none of those result uses are another gpu.wait op, you've eliminated a device synchronization incorrectly.

bondhugula added inline comments.Apr 9 2022, 8:33 PM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1248–1258

Actually, the documentation is messed up here, it says:

If the op does not contain the `async` keyword, it does not return a new
    async token but blocks until all ops producing the async dependency tokens
    finished execution.

But the op can definitely return a new token even without the async keyword. That's how a fresh token is created in the first place!

mehdi_amini added inline comments.Apr 9 2022, 8:34 PM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1248–1258

How so? What is specific to gpu.wait in how it is using a token that does not apply to the other users?
Also: what kind of synchronization does an asynchronous gpu.wait without users providing?

(if you can also explain how you interpret what the doc says).

Dismissing change request.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1248–1258

Sorry, looks like I was mistaken on this part (whenever you return an async token and also have async dependencies).

This revision now requires review to proceed.Apr 9 2022, 8:55 PM
bondhugula accepted this revision.Apr 9 2022, 9:03 PM

LGTM - mostly doc/code comment fix suggestions.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1232
  1. %t = gpu.wait ... ops where %t has no uses (regardless of async dependencies).
1232–1234

Nit: The order of these doesn't match the order in the implementation. Permute.

1233

gpu.wait ops that neither have any async dependencies nor return any token.

1242–1243

Erase gpu.wait ops that neither have any async dependencies nor return any async token.

1248–1249

Replace uses of %t1 = gpu.wait async [%t0] ops with %t0 and erase the op.

This revision is now accepted and ready to land.Apr 9 2022, 9:03 PM
mehdi_amini added inline comments.Apr 9 2022, 9:14 PM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1253

Ideally this should be implemented as a folder I think.

bondhugula added inline comments.Apr 9 2022, 11:27 PM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1253

Yes, I think this pattern can be the folding hook since we aren't erasing any other ops.

mehdi_amini added inline comments.Apr 10 2022, 10:20 PM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1253

Unfortunately I think the folding hook can't implement conditional deletion, only replacements. Can it?
Otherwise the eraseOp() can't be moved there.

csigg added inline comments.Apr 11 2022, 12:14 AM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228
%0 = gpu.wait async
%memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16>
gpu.wait [%0]

In the above example, we cannot exactly remove the %0 async dependency from the 2nd gpu.wait right @csigg ?

I think we can. The second gpu.wait does not synchronize with anything. The order of the gpu.alloc and second gpu.wait can be swapped, unless gpu.wait also uses %asyncToken (which would still allow you to remove %0).

bondhugula added inline comments.Apr 12 2022, 3:57 AM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228

But if you remove the gpu.wait, you've left a stream "undestroyed". Also, we shouldn't be converting an async alloc to a sync alloc -- the latter won't even lower through gpu-to-llvm - so I'm confused here.

csigg added inline comments.Apr 12 2022, 6:41 AM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228

But if you remove the gpu.wait, you've left a stream "undestroyed".

I assume there is a gpu.wait [<dependency of %asyncToken>] somewhere. Otherwise the input is not really valid.

Also, we shouldn't be converting an async alloc to a sync alloc

That's not what's happening.

%0 = gpu.wait async
%memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16>
gpu.wait [%0]

should be folded to

%0 = gpu.wait async
%memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16>
gpu.wait []

should be folded to

%0 = gpu.wait async
%memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16>
bondhugula added inline comments.Apr 12 2022, 6:55 AM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228

I think I see where the difference in understanding is. I was going by how the lowering functions worked (to CUDA streams) as opposed to the op's semantics. In the snippet below, asyncToken doesn't need to have a use later: when the wait on %0 happens, the stream is synchronized and destroyed: this means the memcpy will be completed (even though %asyncToken isn't specified as a dep on it) and anything else attached to that stream. And one doesn't need another use of %asyncToken -- this IR currently lowers and executes correctly as intended AFAIU. In fact, deleting the second gpu.wait would eliminate a synchronization and violate semantics.

%0 = gpu.wait async
%memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16>
gpu.wait [%0]

However, per the doc, the second gpu.wait would really "block" on the completion of the first op which is a trivial block and so can be eliminated.

bondhugula added inline comments.Apr 12 2022, 4:45 PM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228

@csigg Furthermore, based on what you said, the following would be valid IR:

%0 = gpu.wait async
%memref, %asyncToken = gpu.alloc async [%0] () : memref<5xf16>
gpu.wait [%0]
gpu.wait %asyncToken

But won't this crash on lowering today since it would destroy the same stream twice?
https://github.com/llvm/llvm-project/blob/32f3633171aa9d7352e9507c12d219efb48899a0/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp#L560

mehdi_amini added inline comments.Apr 12 2022, 9:39 PM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228

The notion of stream being destroyed in a wait or being left undestroyed is a bit confusing to me, and I can't connect this to the concept of token and async dependencies. Seems like some lowering issues to me more than anything to fix at this level of abstraction.

mehdi_amini added inline comments.Apr 12 2022, 9:45 PM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1211

Why does multiple uses prevent optimizing it away?

1222

Seems like instead of handling by looking at the user, we should root the pattern on the user and look for the operands.
That way it would handle the "multiple uses" above, and it could operate by eliminating operands from the operands list of the gpu.wait.
An async gpu.wait without users could be just deleted.

bondhugula added inline comments.Apr 13 2022, 2:25 AM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1193–1228

I agree - this all makes sense to me and I think @csigg's change can safely be applied here.

1248–1258

A gpu.wait that returns a token doesn't provide any synchronization. We are talking about a gpu.wait that doesn't return an async token but has async deps. This as you know will block till those producers are complete. If one of those producers is a gpu.wait, we can erase it -- this is fine per the semantics of the op and this is why @csigg's suggestion above can be applied.

bondhugula added inline comments.Apr 13 2022, 2:27 AM
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
1222

An async gpu.wait without users could be just deleted.

The pattern further below already does this: see line 1218.

arnab-oss updated this revision to Diff 422466.Apr 13 2022, 4:08 AM
arnab-oss marked 18 inline comments as done.

Adressed comments by @csigg, @bondhugula

bondhugula accepted this revision.Apr 13 2022, 5:38 AM