Motivation: we have lowering pipeline based on upstream gpu and spirv dialects and and we are using shared gpu memory to transfer data between host and device.
Add shared flag to gpu.alloc to distinguish between shared and device-only gpu memory allocations.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
This patch doesn't look complete. You would lower this incorrectly to LLVM (gpu-to-llvm) since the new attribute is being ignored?
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td | ||
---|---|---|
943 | Have you now created a double space if the shared keyword isn't present? The space should have been inside the parentheses for shared here? |
We are not using upstream gpu-to-llvm lowering.
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td | ||
---|---|---|
943 | I can do this, but It seems it is already covered by generated printer, as still only a single space is generated. |
Would the current gpu-to-llvm conversion behavior be correct in the presence of this flag? Shouldn't it fail?
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td | ||
---|---|---|
923 | Shared memory is not accessible on host. | |
928 | You can't have shared at the same time as async or [%dep]. Could you please update the example (maybe split it in two) and add a verifier? | |
mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp | ||
469 | Please use rewriter.notifyMatchFailure() to improve debuggability. |
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td | ||
---|---|---|
923 | It is, and it is main reason it exists (https://spec.oneapi.io/level-zero/latest/core/PROG.html#memory), are we really talking about same thing? | |
928 | I don't see any issue in op having shared and async. |
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td | ||
---|---|---|
923 | GPU shared memory is used to refer to GPU scratchpads (on-chip shared memory) as far as NVIDIA GPUs go, and not the kind of memory in the GPU DRAM that you are referring to here. This line will have to be rewritten for clarity. | |
mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp | ||
467 | "Shared memory" again is misleading here. |
mlir/include/mlir/Dialect/GPU/IR/GPUOps.td | ||
---|---|---|
923 | Indeed, I was thinking of on-chip shared memory. The OneAPI shared memory corresponds to managed memory in CUDA speak. Sorry for the confusion. Would you mind explaining your use-case a little more? The main purpose of managed memory AFAIK is to incrementally port a large code base to CUDA, where inserting appropriate h2d/d2h copies is non-trivial. The migration logic is not intended to be very efficient, but the CUDA API allows you to provide some hints to fix the biggest inefficiencies. So I see managed memory more as a crutch than something you would actively want to design for. |
We (MLIR codegen for Intel GPUs) are currently using such memory for transferring data between GPU and Host in our pipeline. And also, we have libraries, which are using shared memory under the hood. It is very effective for integrated GPUs (as they use same memory under the hood), it is less effective for our discrete GPUs, but still usable. We may replace some of these cases with explicit copies in the future, but some cases will definitely remain.
I understand this is not ideal naming choice, do you have better idea?
mlir/test/Dialect/GPU/ops.mlir | ||
---|---|---|
212–215 | Can host_shared go along with any memory space or just the default memory space? (I forgot whether that's the same as memref<13xf32, 0>.) For e.g., it doesn't make sense to use host_shared with memory space 3 (which is for GPU scratchpad/shared memory). Shouldn't that be a verifier error itself? It looks like you are missing a verifier check to ensure host_shared is used only for memref allocations in GPU global memory space? Besides this, I don't have any other concerns on this revision. |
Can host_shared go along with any memory space or just the default memory space
With broad range of gpu runtimes (cuda, opencl, level zero, shaders, etc) with potential different values and semantics of memory spaces I do not want to put any restrictions on high level op. Lowering to actual runtime can add a check if flags and memory space are incompatible. In our specific pipeline we are using memref without any mem space (which is assumed global in our gpu lowering passes).
Shared memory is not accessible on host.