Download Raw Diff

Details

Reviewers

bondhugula
ThomasRaoux
herhut
ftynse
dcaballe
nicolasvasilache

Commits

rGa93ec06ae627: [mlir][gpu] Introduce `host_shared` flag to `gpu.alloc`

Summary

Motivation: we have lowering pipeline based on upstream gpu and spirv dialects and and we are using shared gpu memory to transfer data between host and device.
Add shared flag to gpu.alloc to distinguish between shared and device-only gpu memory allocations.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Hardcode84 created this revision.Sep 8 2022, 3:05 PM

Herald added a reviewer: bondhugula. · View Herald TranscriptSep 8 2022, 3:05 PM

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: bzcheeseman, sdasgup3, wenzhicui and 19 others. · View Herald Transcript

Hardcode84 requested review of this revision.Sep 8 2022, 3:05 PM

Herald added a reviewer: herhut. · View Herald TranscriptSep 8 2022, 3:05 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Harbormaster completed remote builds in B185726: Diff 458891.Sep 8 2022, 3:54 PM

This patch doesn't look complete. You would lower this incorrectly to LLVM (gpu-to-llvm) since the new attribute is being ignored?

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
943	Have you now created a double space if the `shared` keyword isn't present? The space should have been inside the parentheses for `shared` here?

We are not using upstream gpu-to-llvm lowering.

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
943	I can do this, but It seems it is already covered by generated printer, as still only a single space is generated.

In D133533#3782691, @Hardcode84 wrote:

We are not using upstream gpu-to-llvm lowering.

Would the current gpu-to-llvm conversion behavior be correct in the presence of this flag? Shouldn't it fail?

This revision now requires changes to proceed.Sep 11 2022, 6:55 AM

do not print space, add checck in llvm lowering

Herald added a reviewer: ftynse. · View Herald TranscriptSep 12 2022, 1:15 PM

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a subscriber: awarzynski. · View Herald Transcript

Harbormaster completed remote builds in B186215: Diff 459541.Sep 12 2022, 1:42 PM

csigg added inline comments.Sep 13 2022, 12:47 AM

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
923	Shared memory is not accessible on host.
928	You can't have `shared` at the same time as `async` or `[%dep]`. Could you please update the example (maybe split it in two) and add a verifier?
mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
469	Please use `rewriter.notifyMatchFailure()` to improve debuggability.

Hardcode84 added inline comments.Sep 13 2022, 1:12 AM

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
923	It is, and it is main reason it exists (https://spec.oneapi.io/level-zero/latest/core/PROG.html#memory), are we really talking about same thing?
928	I don't see any issue in op having `shared` and `async`.

bondhugula requested changes to this revision.Sep 13 2022, 1:26 AM

bondhugula added inline comments.

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
923	GPU shared memory is used to refer to GPU scratchpads (on-chip shared memory) as far as NVIDIA GPUs go, and not the kind of memory in the GPU DRAM that you are referring to here. This line will have to be rewritten for clarity.
mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
467	"Shared memory" again is misleading here.

This revision now requires changes to proceed.Sep 13 2022, 1:26 AM

csigg added inline comments.Sep 13 2022, 1:43 AM

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td
923	Indeed, I was thinking of on-chip shared memory. The OneAPI shared memory corresponds to managed memory in CUDA speak. Sorry for the confusion. Would you mind explaining your use-case a little more? The main purpose of managed memory AFAIK is to incrementally port a large code base to CUDA, where inserting appropriate h2d/d2h copies is non-trivial. The migration logic is not intended to be very efficient, but the CUDA API allows you to provide some hints to fix the biggest inefficiencies. So I see managed memory more as a crutch than something you would actively want to design for.

We (MLIR codegen for Intel GPUs) are currently using such memory for transferring data between GPU and Host in our pipeline. And also, we have libraries, which are using shared memory under the hood. It is very effective for integrated GPUs (as they use same memory under the hood), it is less effective for our discrete GPUs, but still usable. We may replace some of these cases with explicit copies in the future, but some cases will definitely remain.

I understand this is not ideal naming choice, do you have better idea?

In D133533#3785963, @Hardcode84 wrote:

We (MLIR codegen for Intel GPUs) are currently using such memory for transferring data between GPU and Host in our pipeline. And also, we have libraries, which are using shared memory under the hood. It is very effective for integrated GPUs (as they use same memory under the hood), it is less effective for our discrete GPUs, but still usable. We may replace some of these cases with explicit copies in the future, but some cases will definitely remain.

I understand this is not ideal naming choice, do you have better idea?

host_gpu_shared?

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptSep 24 2022, 8:45 PM

Herald added a subscriber: zero9178. · View Herald Transcript

In D133533#3813519, @bondhugula wrote:

In D133533#3785963, @Hardcode84 wrote:

We (MLIR codegen for Intel GPUs) are currently using such memory for transferring data between GPU and Host in our pipeline. And also, we have libraries, which are using shared memory under the hood. It is very effective for integrated GPUs (as they use same memory under the hood), it is less effective for our discrete GPUs, but still usable. We may replace some of these cases with explicit copies in the future, but some cases will definitely remain.

I understand this is not ideal naming choice, do you have better idea?

host_gpu_shared?

Maybe just host_shared then? As it is already a gpu op.

rebase, rename to host_shared, notifyMatchFailure

typo

Hardcode84 marked an inline comment as done.Oct 1 2022, 5:16 AM

Harbormaster completed remote builds in B189845: Diff 464491.Oct 1 2022, 5:46 AM

bondhugula added inline comments.Oct 1 2022, 9:39 PM

mlir/test/Dialect/GPU/ops.mlir
212–215	Can `host_shared` go along with any memory space or just the default memory space? (I forgot whether that's the same as `memref<13xf32, 0>`.) For e.g., it doesn't make sense to use `host_shared` with memory space `3` (which is for GPU scratchpad/shared memory). Shouldn't that be a verifier error itself? It looks like you are missing a verifier check to ensure host_shared is used only for memref allocations in GPU global memory space? Besides this, I don't have any other concerns on this revision.

Can host_shared go along with any memory space or just the default memory space

With broad range of gpu runtimes (cuda, opencl, level zero, shaders, etc) with potential different values and semantics of memory spaces I do not want to put any restrictions on high level op. Lowering to actual runtime can add a check if flags and memory space are incompatible. In our specific pipeline we are using memref without any mem space (which is assumed global in our gpu lowering passes).

bondhugula accepted this revision.Oct 4 2022, 10:35 PM

This revision is now accepted and ready to land.Oct 4 2022, 10:35 PM

Closed by commit rGa93ec06ae627: [mlir][gpu] Introduce `host_shared` flag to `gpu.alloc` (authored by Hardcode84). · Explain WhyOct 5 2022, 1:03 PM

This revision was automatically updated to reflect the committed changes.

Hardcode84 added a commit: rGa93ec06ae627: [mlir][gpu] Introduce `host_shared` flag to `gpu.alloc`.

Diff 465518

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

Show First 20 Lines • Show All 913 Lines • ▼ Show 20 Lines	let description = [{

The op does not execute before all async dependencies have finished		The op does not execute before all async dependencies have finished
executing.		executing.

If the `async` keyword is present, the op is executed asynchronously (i.e.		If the `async` keyword is present, the op is executed asynchronously (i.e.
it does not block until the execution has finished on the device). In		it does not block until the execution has finished on the device). In
that case, it also returns a !gpu.async.token.		that case, it also returns a !gpu.async.token.

		If the `host_shared` keyword is present, the memory will be allocated in a
		memory accessible both on host and on device.
		csiggUnsubmitted Not Done Reply Inline Actions Shared memory is not accessible on host. csigg: Shared memory is not accessible on host.
		Hardcode84AuthorUnsubmitted Done Reply Inline Actions It is, and it is main reason it exists (https://spec.oneapi.io/level-zero/latest/core/PROG.html#memory), are we really talking about same thing? Hardcode84: It is, and it is main reason it exists (https://spec.oneapi.io/level-zero/latest/core/PROG.
		bondhugulaUnsubmitted Not Done Reply Inline Actions GPU shared memory is used to refer to GPU scratchpads (on-chip shared memory) as far as NVIDIA GPUs go, and not the kind of memory in the GPU DRAM that you are referring to here. This line will have to be rewritten for clarity. bondhugula: GPU shared memory is used to refer to GPU scratchpads (on-chip shared memory) as far as NVIDIA…
		csiggUnsubmitted Not Done Reply Inline Actions Indeed, I was thinking of on-chip shared memory. The OneAPI shared memory corresponds to managed memory in CUDA speak. Sorry for the confusion. Would you mind explaining your use-case a little more? The main purpose of managed memory AFAIK is to incrementally port a large code base to CUDA, where inserting appropriate h2d/d2h copies is non-trivial. The migration logic is not intended to be very efficient, but the CUDA API allows you to provide some hints to fix the biggest inefficiencies. So I see managed memory more as a crutch than something you would actively want to design for. csigg: Indeed, I was thinking of on-chip shared memory. The OneAPI shared memory corresponds to…

Example:		Example:

```mlir		```mlir
%memref, %token = gpu.alloc async [%dep] (%width) : memref<64x?xf32, 1>		%memref, %token = gpu.alloc async [%dep] host_shared (%width) : memref<64x?xf32, 1>
		csiggUnsubmitted Not Done Reply Inline Actions You can't have `shared` at the same time as `async` or `[%dep]`. Could you please update the example (maybe split it in two) and add a verifier? csigg: You can't have `shared` at the same time as `async` or `[%dep]`. Could you please update the…
		Hardcode84AuthorUnsubmitted Done Reply Inline Actions I don't see any issue in op having `shared` and `async`. Hardcode84: I don't see any issue in op having `shared` and `async`.
```		```
}];		}];

let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,		let arguments = (ins Variadic<GPU_AsyncToken>:$asyncDependencies,
Variadic<Index>:$dynamicSizes, Variadic<Index>:$symbolOperands);		Variadic<Index>:$dynamicSizes, Variadic<Index>:$symbolOperands,
		UnitAttr:$hostShared);
let results = (outs Res<AnyMemRef, "", [MemAlloc]>:$memref,		let results = (outs Res<AnyMemRef, "", [MemAlloc]>:$memref,
Optional<GPU_AsyncToken>:$asyncToken);		Optional<GPU_AsyncToken>:$asyncToken);

let extraClassDeclaration = [{		let extraClassDeclaration = [{
MemRefType getType() { return getMemref().getType().cast<MemRefType>(); }		MemRefType getType() { return getMemref().getType().cast<MemRefType>(); }
}];		}];

let assemblyFormat = [{		let assemblyFormat = [{
custom<AsyncDependencies>(type($asyncToken), $asyncDependencies) ` `		custom<AsyncDependencies>(type($asyncToken), $asyncDependencies) (` ` `host_shared` $hostShared^)? ` `
		bondhugulaUnsubmitted Not Done Reply Inline Actions Have you now created a double space if the `shared` keyword isn't present? The space should have been inside the parentheses for `shared` here? bondhugula: Have you now created a double space if the `shared` keyword isn't present? The space should…
		Hardcode84AuthorUnsubmitted Done Reply Inline Actions I can do this, but It seems it is already covered by generated printer, as still only a single space is generated. Hardcode84: I can do this, but It seems it is already covered by generated printer, as still only a single…
`(` $dynamicSizes `)` (`` `[` $symbolOperands^ `]`)? attr-dict `:` type($memref)		`(` $dynamicSizes `)` (`` `[` $symbolOperands^ `]`)? attr-dict `:` type($memref)
}];		}];

let hasVerifier = 1;		let hasVerifier = 1;
let hasCanonicalizer = 1;		let hasCanonicalizer = 1;
}		}

def GPU_DeallocOp : GPU_Op<"dealloc", [GPU_AsyncOpInterface]> {		def GPU_DeallocOp : GPU_Op<"dealloc", [GPU_AsyncOpInterface]> {
▲ Show 20 Lines • Show All 335 Lines • Show Last 20 Lines

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

Show First 20 Lines • Show All 458 Lines • ▼ Show 20 Lines	LogicalResult ConvertHostRegisterOpToGpuRuntimeCallPattern::matchAndRewrite(

rewriter.eraseOp(op);		rewriter.eraseOp(op);
return success();		return success();
}		}

LogicalResult ConvertAllocOpToGpuRuntimeCallPattern::matchAndRewrite(		LogicalResult ConvertAllocOpToGpuRuntimeCallPattern::matchAndRewrite(
gpu::AllocOp allocOp, OpAdaptor adaptor,		gpu::AllocOp allocOp, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const {		ConversionPatternRewriter &rewriter) const {
		if (adaptor.getHostShared())
		bondhugulaUnsubmitted Not Done Reply Inline Actions "Shared memory" again is misleading here. bondhugula: "Shared memory" again is misleading here.
		return rewriter.notifyMatchFailure(
		allocOp, "host_shared allocation is not supported");
		csiggUnsubmitted Done Reply Inline Actions Please use `rewriter.notifyMatchFailure()` to improve debuggability. csigg: Please use `rewriter.notifyMatchFailure()` to improve debuggability.

MemRefType memRefType = allocOp.getType();		MemRefType memRefType = allocOp.getType();

if (failed(areAllLLVMTypes(allocOp, adaptor.getOperands(), rewriter)) \|\|		if (failed(areAllLLVMTypes(allocOp, adaptor.getOperands(), rewriter)) \|\|
!isConvertibleAndHasIdentityMaps(memRefType) \|\|		!isConvertibleAndHasIdentityMaps(memRefType) \|\|
failed(isAsyncWithOneDependency(rewriter, allocOp)))		failed(isAsyncWithOneDependency(rewriter, allocOp)))
return failure();		return failure();

auto loc = allocOp.getLoc();		auto loc = allocOp.getLoc();
▲ Show 20 Lines • Show All 454 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/ops.mlir

Show First 20 Lines • Show All 203 Lines • ▼ Show 20 Lines	func.func @alloc() {
gpu.dealloc %m0 : memref<13xf32, 1>		gpu.dealloc %m0 : memref<13xf32, 1>

%t0 = gpu.wait async		%t0 = gpu.wait async
// CHECK: %[[m1:.]], %[[t1:.]] = gpu.alloc async [{{.*}}] () : memref<13xf32, 1>		// CHECK: %[[m1:.]], %[[t1:.]] = gpu.alloc async [{{.*}}] () : memref<13xf32, 1>
%m1, %t1 = gpu.alloc async [%t0] () : memref<13xf32, 1>		%m1, %t1 = gpu.alloc async [%t0] () : memref<13xf32, 1>
// CHECK: gpu.dealloc async [%[[t1]]] %[[m1]] : memref<13xf32, 1>		// CHECK: gpu.dealloc async [%[[t1]]] %[[m1]] : memref<13xf32, 1>
%t2 = gpu.dealloc async [%t1] %m1 : memref<13xf32, 1>		%t2 = gpu.dealloc async [%t1] %m1 : memref<13xf32, 1>

		// CHECK: %[[m2:.*]] = gpu.alloc host_shared () : memref<13xf32, 1>
		%m2 = gpu.alloc host_shared () : memref<13xf32, 1>
		// CHECK: gpu.dealloc %[[m2]] : memref<13xf32, 1>
		gpu.dealloc %m2 : memref<13xf32, 1>
		bondhugulaUnsubmitted Not Done Reply Inline Actions Can `host_shared` go along with any memory space or just the default memory space? (I forgot whether that's the same as `memref<13xf32, 0>`.) For e.g., it doesn't make sense to use `host_shared` with memory space `3` (which is for GPU scratchpad/shared memory). Shouldn't that be a verifier error itself? It looks like you are missing a verifier check to ensure host_shared is used only for memref allocations in GPU global memory space? Besides this, I don't have any other concerns on this revision. bondhugula: Can `host_shared` go along with any memory space or just the default memory space? (I forgot…

return		return
}		}

func.func @async_token(%arg0 : !gpu.async.token) -> !gpu.async.token {		func.func @async_token(%arg0 : !gpu.async.token) -> !gpu.async.token {
// CHECK-LABEL: func @async_token({{.*}}: !gpu.async.token)		// CHECK-LABEL: func @async_token({{.*}}: !gpu.async.token)
// CHECK: return {{.*}} : !gpu.async.token		// CHECK: return {{.*}} : !gpu.async.token
return %arg0 : !gpu.async.token		return %arg0 : !gpu.async.token
}		}
▲ Show 20 Lines • Show All 74 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Introduce `shared` flag to `gpu.alloc`
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 465518

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

mlir/test/Dialect/GPU/ops.mlir

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Introduce `shared` flag to `gpu.alloc`ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 465518

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

mlir/test/Dialect/GPU/ops.mlir

[mlir][gpu] Introduce `shared` flag to `gpu.alloc`
ClosedPublic