Page MenuHomePhabricator

[mlir][gpu] Introduce `shared` flag to `gpu.alloc`
ClosedPublic

Authored by Hardcode84 on Sep 8 2022, 3:05 PM.

Details

Summary

Motivation: we have lowering pipeline based on upstream gpu and spirv dialects and and we are using shared gpu memory to transfer data between host and device.
Add shared flag to gpu.alloc to distinguish between shared and device-only gpu memory allocations.

Diff Detail

Event Timeline

Hardcode84 created this revision.Sep 8 2022, 3:05 PM
Hardcode84 requested review of this revision.Sep 8 2022, 3:05 PM

This patch doesn't look complete. You would lower this incorrectly to LLVM (gpu-to-llvm) since the new attribute is being ignored?

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
943

Have you now created a double space if the shared keyword isn't present? The space should have been inside the parentheses for shared here?

We are not using upstream gpu-to-llvm lowering.

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
943

I can do this, but It seems it is already covered by generated printer, as still only a single space is generated.

bondhugula requested changes to this revision.Sep 11 2022, 6:55 AM

We are not using upstream gpu-to-llvm lowering.

Would the current gpu-to-llvm conversion behavior be correct in the presence of this flag? Shouldn't it fail?

This revision now requires changes to proceed.Sep 11 2022, 6:55 AM

do not print space, add checck in llvm lowering

csigg added inline comments.Sep 13 2022, 12:47 AM
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
923

Shared memory is not accessible on host.

928

You can't have shared at the same time as async or [%dep].

Could you please update the example (maybe split it in two) and add a verifier?

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
469

Please use rewriter.notifyMatchFailure() to improve debuggability.

Hardcode84 added inline comments.Sep 13 2022, 1:12 AM
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
923

It is, and it is main reason it exists (https://spec.oneapi.io/level-zero/latest/core/PROG.html#memory), are we really talking about same thing?

928

I don't see any issue in op having shared and async.

bondhugula requested changes to this revision.Sep 13 2022, 1:26 AM
bondhugula added inline comments.
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
923

GPU shared memory is used to refer to GPU scratchpads (on-chip shared memory) as far as NVIDIA GPUs go, and not the kind of memory in the GPU DRAM that you are referring to here. This line will have to be rewritten for clarity.

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
467

"Shared memory" again is misleading here.

This revision now requires changes to proceed.Sep 13 2022, 1:26 AM
csigg added inline comments.Sep 13 2022, 1:43 AM
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
923

Indeed, I was thinking of on-chip shared memory. The OneAPI shared memory corresponds to managed memory in CUDA speak. Sorry for the confusion.

Would you mind explaining your use-case a little more? The main purpose of managed memory AFAIK is to incrementally port a large code base to CUDA, where inserting appropriate h2d/d2h copies is non-trivial. The migration logic is not intended to be very efficient, but the CUDA API allows you to provide some hints to fix the biggest inefficiencies. So I see managed memory more as a crutch than something you would actively want to design for.

Hardcode84 added a comment.EditedSep 13 2022, 2:21 AM

We (MLIR codegen for Intel GPUs) are currently using such memory for transferring data between GPU and Host in our pipeline. And also, we have libraries, which are using shared memory under the hood. It is very effective for integrated GPUs (as they use same memory under the hood), it is less effective for our discrete GPUs, but still usable. We may replace some of these cases with explicit copies in the future, but some cases will definitely remain.

I understand this is not ideal naming choice, do you have better idea?

We (MLIR codegen for Intel GPUs) are currently using such memory for transferring data between GPU and Host in our pipeline. And also, we have libraries, which are using shared memory under the hood. It is very effective for integrated GPUs (as they use same memory under the hood), it is less effective for our discrete GPUs, but still usable. We may replace some of these cases with explicit copies in the future, but some cases will definitely remain.

I understand this is not ideal naming choice, do you have better idea?

host_gpu_shared?

We (MLIR codegen for Intel GPUs) are currently using such memory for transferring data between GPU and Host in our pipeline. And also, we have libraries, which are using shared memory under the hood. It is very effective for integrated GPUs (as they use same memory under the hood), it is less effective for our discrete GPUs, but still usable. We may replace some of these cases with explicit copies in the future, but some cases will definitely remain.

I understand this is not ideal naming choice, do you have better idea?

host_gpu_shared?

Maybe just host_shared then? As it is already a gpu op.

Hardcode84 updated this revision to Diff 464490.Oct 1 2022, 5:13 AM

rebase, rename to host_shared, notifyMatchFailure

Hardcode84 marked an inline comment as done.Oct 1 2022, 5:16 AM
bondhugula added inline comments.Oct 1 2022, 9:39 PM
mlir/test/Dialect/GPU/ops.mlir
212–215

Can host_shared go along with any memory space or just the default memory space? (I forgot whether that's the same as memref<13xf32, 0>.) For e.g., it doesn't make sense to use host_shared with memory space 3 (which is for GPU scratchpad/shared memory). Shouldn't that be a verifier error itself?

It looks like you are missing a verifier check to ensure host_shared is used only for memref allocations in GPU global memory space?

Besides this, I don't have any other concerns on this revision.

Can host_shared go along with any memory space or just the default memory space

With broad range of gpu runtimes (cuda, opencl, level zero, shaders, etc) with potential different values and semantics of memory spaces I do not want to put any restrictions on high level op. Lowering to actual runtime can add a check if flags and memory space are incompatible. In our specific pipeline we are using memref without any mem space (which is assumed global in our gpu lowering passes).

bondhugula accepted this revision.Oct 4 2022, 10:35 PM
This revision is now accepted and ready to land.Oct 4 2022, 10:35 PM
This revision was automatically updated to reflect the committed changes.