This is an archive of the discontinued LLVM Phabricator instance.

[mlir][GPU] Add gpu.create_queue op and add queue as an optional argument to gpu ops
Needs ReviewPublic

Authored by Hardcode84 on Jul 9 2023, 3:34 PM.

Details

Summary

Supersedes https://reviews.llvm.org/D146556

The above PR introduces a new op called gpu.create_queue and adds queue as an optional arg to some gpu dialect ops.
Some GPU runtimes (OpenCL/SYCL) are using queue as explicit argument to all ops.
Explicit queue allows more flexibility on scheduling kernels, e.g interleave execution on multiple devices, allow to create separate queues for kernel execution and data copying, etc.
Queues doesn't introduce any additional sychronization semantics to ops, aync dependencies are still controlled by async tokens.

The link to the discussion is:
https://discourse.llvm.org/t/proposal-to-add-stream-queue-as-an-optional-argument-to-few-gpu-dialect-ops/67920/13

Diff Detail

Event Timeline

Hardcode84 created this revision.Jul 9 2023, 3:34 PM
Herald added a project: Restricted Project. · View Herald Transcript
Hardcode84 requested review of this revision.Jul 9 2023, 3:34 PM
Hardcode84 updated this revision to Diff 538464.Jul 9 2023, 4:01 PM

check for stream in gpu-to-llvm

Hardcode84 updated this revision to Diff 538471.Jul 9 2023, 5:57 PM

update doc

aartbik added inline comments.Jul 10 2023, 2:35 PM
mlir/docs/Dialects/GPU.md
69

I don't feel super strong about this, but should we adjust our syntax a bit.
(1) it feels like queue should be up front, just like tokens
(2) since we use [ ... ] for tokens we could use something similar for queue, e.g. <%queue1>,

or, alternatively, use something keyword based like 
      OP tokens %token1 queue %queue

Again, not a blocker, and I have not clean design in mind yet, but it feels like this is mixing and matching styles a bit

rebase, use <> for queue arg

Hardcode84 marked an inline comment as done.Jul 13 2023, 7:27 AM
Hardcode84 added inline comments.
mlir/docs/Dialects/GPU.md
69

Yes, I think it looks better.

Hardcode84 marked an inline comment as done.

update doc

I don't think the discussion we had in the thread converged?

I don't think the discussion we had in the thread converged?

Right, no activity in thread

mehdi_amini added inline comments.Jul 22 2023, 3:24 PM
mlir/docs/Dialects/GPU.md
44

This is back to the discussion in the thread, where I don't think I got clear answers, and I'd rather see it address there.

Right now it is claimed that the queues "allow more fine-grained control over kernel scheduling", but I don't quite get this argument: since they are entirely out-of-order it's not clear to me where is the "fine-grained control" here?

  • Have different queues for kernel execution and data copying.

Why is it useful? Since queues are out-of-order, other than mapping to a given device, I don't see the semantic different between one queue or two here.

Hardcode84 added inline comments.Jul 24 2023, 9:32 AM
mlir/docs/Dialects/GPU.md
44

Our main usecase is running on multiple devices, which you cannot express with current dialect at all. Parallelizing execution and copying is handled automatically in SYCL, yes, but for some more low level APIs (Intel Level Zero) you will need to create two separate queues (which SYCL does this under the hood).

mehdi_amini added inline comments.Jul 24 2023, 4:20 PM
mlir/docs/Dialects/GPU.md
44

Our main usecase is running on multiple devices, which you cannot express with current dialect at all.

OK, but we're mixing two things here here: 1) how to map launches to a specific device and 2) scheduling aspects.

You're still not addressing my questions clearly: neither here, nor in the discourse thread, it's not clear to me how we'll converge on this patch right now.