This is an archive of the discontinued LLVM Phabricator instance.

Add async dependencies support for gpu.launch op
ClosedPublic

Authored by bondhugula on Apr 11 2022, 5:50 AM.

Details

Summary

Add async dependencies support for gpu.launch op: this allows specifying
a list of async tokens ("streams") as dependencies for the launch.

Update the GPU kernel outlining pass lowering to propagate async
dependencies from gpu.launch to gpu.launch_func op. Previously, a new
stream was being created and destroyed for a kernel launch. The async
deps support allows the kernel launch to be serialized on an existing
stream.

Diff Detail

Event Timeline

bondhugula created this revision.Apr 11 2022, 5:50 AM
Herald added a project: Restricted Project. · View Herald Transcript
bondhugula requested review of this revision.Apr 11 2022, 5:50 AM
csigg added a comment.Apr 11 2022, 6:19 AM

Not opposed to this change at all, but what's the motivation for allowing gpu-async-region to run before gpu-kernel-outlining?

Some cleanup.

bondhugula edited the summary of this revision. (Show Details)

Update commit summary.

bondhugula added a comment.EditedApr 11 2022, 6:52 AM

Not opposed to this change at all, but what's the motivation for allowing gpu-async-region to run before gpu-kernel-outlining?

I didn't know that pass existed! Do let me know if that approach subsumes or is more canonical/systematic than this one. I guess if a user wishes to explicitly specify which streams the kernels should be part of and how they should be ordered/overlapped, they could go with not attaching aysnc deps on gpu.launch and let gpu-async-region mark them async with deps. If they wanted full control, they would specify these. So, there is still motivation for running gpu-async-region before gpu-kernel-outlining in pipelines -- but could I know which pipeline/use case you were referring to?

On a separate note, at least with NVIDIA GPUs, aren't all kernel launches always asynchronous (w.r.t host)? What does it mean to not mark a gpu.launch_func async?! Does the GPU dialect leave that unspecified (target-dependent)?

csigg added a comment.Apr 11 2022, 8:11 AM

The gpu-async-region pass simply chains together sequences of gpu ops, with the intention of using async.execute to separate independent work that runs on separate streams. For that case, gpu ops can be synchronous during lowering from higher dialects because the async.execute regions specify which gpu ops should run in sequence and which ones can run in parallel.

But you are right, users might want to set dependencies though tokens manually, and might want to do that before lowering from gpu.launch to gpu.launch_func.

On a separate note, at least with NVIDIA GPUs, aren't all kernel launches always asynchronous (w.r.t host)? What does it mean to not mark a gpu.launch_func async?! Does the GPU dialect leave that unspecified (target-dependent)?

A gpu.launch_func without async implies that it is synchronous (same for gpu.memset, gpu.memcpy etc). If the lowering to the gpu runtime wouldn't reject it, it would need to insert a mgpuStreamSynchronize call.

A gpu.launch_func without async implies that it is synchronous (same for gpu.memset, gpu.memcpy etc). If the lowering to the gpu runtime wouldn't reject it, it would need to insert a mgpuStreamSynchronize call.

I see -- it's been made "synchronous" as a result of the synchronize call inserted below. This makes sense to me - thanks. But in that case, async with 0 async dep tokens wouldn't appear to be a meaningful configuration for the op. The lowering does check for it and fails, but should it just be disallowed? For the support in this PR, I'm not adding a result token to the op if no async deps have been specified (even if the async keyword is specified, it's dropped during the print). It should perhaps be a parse error and similarly for the launch_func op?

Add AsyncOpInterface to gpu.launch and move common methods in GPUDialect.cpp up.

Sorted order.

csigg added a comment.Apr 12 2022, 8:01 AM

async with 0 async dep tokens wouldn't appear to be a meaningful configuration for the op. The lowering does check for it and fails, but should it just be disallowed?

I'm not sure. I would say the semantics are clear if an op uses async (the host does not wait for the op to complete) but no dependencies (it can run immediately without waiting for anything else), and it's OK for the current lowering to be limited in what it can handle and rely on gpu-async-region to bring it into lowering-compatible form. I kind of like the symmetry of these ops (including gpu.wait, where gpu.wait async [] needs to be valid).

For the support in this PR, I'm not adding a result token to the op if no async deps have been specified (even if the async keyword is specified, it's dropped during the print). It should perhaps be a parse error and similarly for the launch_func op?

I don't think this is a good idea. The 'async' keyword stands for the !gpu.async.token return type and should really be independent of the token operands inside []. I would not infer the former from the latter, but fail verification (for all but gpu.wait) if we want that.

For the support in this PR, I'm not adding a result token to the op if no async deps have been specified (even if the async keyword is specified, it's dropped during the print).

I don't understand what you're describing: you don't get to control the number of output of the operation, this is parsed by MLIR before your custom parser and if you don't provide a result type matching the number of result MLIR will fatal_error() anyway.

async with 0 async dep tokens wouldn't appear to be a meaningful configuration for the op. The lowering does check for it and fails, but should it just be disallowed?

I'm not sure. I would say the semantics are clear if an op uses async (the host does not wait for the op to complete) but no dependencies (it can run immediately without waiting for anything else), and it's OK for the current lowering to be limited in what it can handle and rely on gpu-async-region to bring it into lowering-compatible form. I kind of like the symmetry of these ops (including gpu.wait, where gpu.wait async [] needs to be valid).

This makes sense to me.

For the support in this PR, I'm not adding a result token to the op if no async deps have been specified (even if the async keyword is specified, it's dropped during the print). It should perhaps be a parse error and similarly for the launch_func op?

I don't think this is a good idea. The 'async' keyword stands for the !gpu.async.token return type and should really be independent of the token operands inside []. I would not infer the former from the latter, but fail verification (for all but gpu.wait) if we want that.

This makes sense to me too -- follows from the previous para. I'll update the revision to make sure there is a meaning to the "async" keyword (regardless of the tokens) in that it returns a token indicating async execution.

For the support in this PR, I'm not adding a result token to the op if no async deps have been specified (even if the async keyword is specified, it's dropped during the print).

I don't understand what you're describing: you don't get to control the number of output of the operation, this is parsed by MLIR before your custom parser and if you don't provide a result type matching the number of result MLIR will fatal_error() anyway.

I meant "adding a result type". A single result type will be added if there is an async keyword. Why would it be a fatal error instead of a parse error, which it should be if a user provided a number of results on the LHS that are inconsistent with the expected number?

For the support in this PR, I'm not adding a result token to the op if no async deps have been specified (even if the async keyword is specified, it's dropped during the print).

I don't understand what you're describing: you don't get to control the number of output of the operation, this is parsed by MLIR before your custom parser and if you don't provide a result type matching the number of result MLIR will fatal_error() anyway.

I meant "adding a result type". A single result type will be added if there is an async keyword. Why would it be a fatal error instead of a parse error, which it should be if a user provided a number of results on the LHS that are inconsistent with the expected number?

Right it'll be a parse error

Fix semantics and syntax to allow async without any deps.

For the support in this PR, I'm not adding a result token to the op if no async deps have been specified (even if the async keyword is specified, it's dropped during the print). It should perhaps be a parse error and similarly for the launch_func op?

I don't think this is a good idea. The 'async' keyword stands for the !gpu.async.token return type and should really be independent of the token operands inside []. I would not infer the former from the latter, but fail verification (for all but gpu.wait) if we want that.

This makes sense to me too -- follows from the previous para. I'll update the revision to make sure there is a meaning to the "async" keyword (regardless of the tokens) in that it returns a token indicating async execution.

Done and addressed. I've updated the doc comment as well. @csigg, this is ready for review now.

Missed updates for gpu.launch -> gpu.launch_func.

Add a couple more test cases for the outlining pass.

Any more comments here @csigg ?

csigg accepted this revision.Apr 21 2022, 1:02 AM
csigg added inline comments.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
490

Isn't this checked elsewhere already? Also, the error messages refers to the 'async dependencies' instead of the 'async keyword'.

This revision is now accepted and ready to land.Apr 21 2022, 1:02 AM

Adjust error message.

bondhugula marked an inline comment as done.Apr 21 2022, 3:55 AM
bondhugula added inline comments.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
490

Adjusted the error message to refer to 'async keyword'. This isn't checked anywhere else (not in any automatically generated verifier hooks); we check for this during (custom) parsing, but the verifier needs to separately check for it as well.

This revision was landed with ongoing or failed builds.Apr 21 2022, 3:56 AM
This revision was automatically updated to reflect the committed changes.
bondhugula marked an inline comment as done.