This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Add `gpu.alloc_managed`
Needs ReviewPublic

Authored by guraypp on Jul 10 2023, 6:08 AM.

Details

Summary

This work adds a new op gpu.alloc_managed that allocates memory region whose pointer is visible to GPU and CPU. It is similar to memref.alloc however the data is migrated automatically between GPU and CPU by the driver via page faults or underlying software.

gpu.alloc_managed works synchronous fashion, not async. Therefore, it cannot be mapped to existing gpu.alloc that can be executed asynchronously. Note that: the gpu.alloc has host_shared option, which is not used anywhere. I am not sure what the intend was there.

Diff Detail

Event Timeline

guraypp created this revision.Jul 10 2023, 6:08 AM
Herald added a reviewer: dcaballe. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
guraypp requested review of this revision.Jul 10 2023, 6:08 AM

This Op works synchronous fashion, therefore, it cannot be mapped to existing gpu.alloc that is executed async. The gpu.alloc has host_shared, I guess the intend was to used managed memory. However, it is not used anywhere, so this work removes that.

We are using host_shared downstream. I personally doesn't have preference between flag and separate op (but separate op will require more copypaste in td file), but async argument is not relevant here, as existing gpu.alloc cannot work async either (stream argument is just ignored).

We are using host_shared downstream.

Thanks for the review! I decided to delete it when I didn't see any use of host_shared. Are you using it to allocate managed memory or unified memory?

I personally doesn't have preference between flag and separate op (but separate op will require more copypaste in td file), but async argument is not relevant here, as existing gpu.alloc cannot work async either (stream argument is just ignored).

async can be useful for nvidia targets, one can call`cuMemAllocAsync`. I can put another PR that uses stream at least for nvidia GPUs.

I created a new op because gpu.alloc implements GPU_AsyncOpInterface and managed memory allocation is not asynchronous operation. I don't know any system that does this async.

We are using sycl::malloc_shared for host_shared allocations (and sycl::malloc_device for device ones), sycl doesn't support async allocations, but doing them synchronously will still produce expected observable results, so having dependencies are harmless. IMO, proposed separation is too CUDA-specific (as well as managed name) and GPU dialect is intended as more cross-api dialect. If some implementation doesn't support specific combination of flags and async, it can just safely fallback to synchronous alloc.

We are using sycl::malloc_shared for host_shared allocations (and sycl::malloc_device for device ones), sycl doesn't support async allocations, but doing them synchronously will still produce expected observable results, so having dependencies are harmless.

Sounds like sycl needs asynchronous allocation.

IMO, proposed separation is too CUDA-specific (as well as managed name) and GPU dialect is intended as more cross-api dialect. If some implementation doesn't support specific combination of flags and async, it can just safely fallback to synchronous alloc.

It is not CUDA specific. I've implemented in Hip which has the same name and implementation.

managed means that the data is managed by the driver or runtime. sycl's malloc_shared does not tell about how the data is shared.

I would wait for other people opinions, but naming aside, I still won't see much reason for this change, you are encoding specific api restriction (I believe, dedicated enough person should be able to implement async allocs even for host_shared/managed mem, something along calling allocation from different thread), you are copypasting entire op (and you forgot to copypaste verifier, and your copypasted canonicalization missing tests and won't work).

guraypp updated this revision to Diff 540919.Jul 17 2023, 2:39 AM

rebase
add verifier
dont remove host_shared

guraypp edited the summary of this revision. (Show Details)Jul 17 2023, 3:01 AM
guraypp updated this revision to Diff 540970.Jul 17 2023, 5:32 AM
guraypp edited the summary of this revision. (Show Details)

fix issues in verifier

@Hardcode84 do you think is fine for you? I added a verifier and put back host_shared.

I would wait for other people opinions, but naming aside, I still won't see much reason for this change,

We need a new Op. As I mentioned gpu.alloc is asynchronous and gpu.malloc_managed cannot be allocated asynchronously.

One alternative solution could be:

  1. Make existing gpu.alloc synchronous Op. Use host_shared for managed memory.
  2. Add a new op gpu.alloc_async for asynchronous allocation.

Either way we need a new Op.

As I said, asynchronous allocation is too tied to specific runtime, the are other runtimes which doesn't support async allocations at all, for SYCL we are doing allocation synchronously regardless of async tokens passed (which may be suboptimal, but will still produce correct result), you can do the same (ignore stream if host_shared is passed). IMO, adding a new op in addition to host_shared just bloats the code and API (btw, your copypasted canonicalization still lacking a tests and won't actually work).

Alternatively, you can just disallow having both host_shared and async tokens on alloc in verifier.

Matt added a subscriber: Matt.Jul 19 2023, 12:56 PM
guraypp added a comment.EditedJul 19 2023, 1:05 PM

As I said, asynchronous allocation is too tied to specific runtime, the are other runtimes which doesn't support async allocations at all,

This isn't important. Allocation can be done asynchronously. The Op is designed to be async. I hope we are on the same page on that. CUDA is just an example.

for SYCL we are doing allocation synchronously regardless of async tokens passed

It sounds incorrect to me.

(which may be suboptimal, but will still produce correct result),

Running parallel program sequentially would also produce correct result.

you can do the same (ignore stream if host_shared is passed). IMO,

Are you proposing to continue incorrect path?

adding a new op in addition to host_shared just bloats the code and API

The PR deleted host_shared initially. I put it back because you asked. I think we should delete it!

(btw, your copypasted canonicalization still lacking a tests and won't actually work).

This is only productive comment from your side. I will ask you to elaborate why will not work?

The Op is designed to be async.

This op can work either sync or async, depending if it have async tokens or not, quote from the doc:

If the `async` keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In
that case, it also returns a !gpu.async.token.

host_shared without tokens will give you exact semantics you want.

It sounds incorrect to me.

There is nothing we can do there (and same will apply to OpenCL)

you can do the same (ignore stream if host_shared is passed). IMO,

Are you proposing to continue incorrect path?

I don't see a big issue here (especially considering no one bothered to actually change CUDA impl to async for years)

Anyway, I'm not a GPU code owner and I want to hear other people opinions (@csigg and @bondhugula reviewed original host_shared flag).

@guraypp

Also, do you have any specific usecase, which benefits from async allocs? It would be interesting to look at it.

@guraypp

Also, do you have any specific usecase, which benefits from async allocs? It would be interesting to look at it.

Large memory allocation is expensive. One can imagine overlapping work on the host and async allocation on the device memory.
This work is not related to async allocs, so I don't have any.