The lifetime management of streams is not correct, I think. Even with the assumption that "forks" are explicit, we might leak streams. For instance if we have

%t3 = gpu.wait aysnc [%t1, %t2] // last use of %t1 and %t2
...

the streams would never be freed. I wonder whether it would be easier to not deallocate streams at all when lowering and then have a fix-up pass that inserts the stream destruction where needed.

mlir/lib/Conversion/GPUCommon/ConvertLaunchFuncToRuntimeCalls.cpp
498	If there is more than one dependency, this would need to insert synchronization, rigth?
515	This assumes the property that all "forks" (i.e., tokens that have multiple uses) are made explicit via `gpu.wait`, right? This idea is not captured in the comments anywhere and also we do not enforce it in the rewritings. If this property is not true, the lowering will be incorrect.
516	nit: left-over tab.

This revision now requires changes to proceed.Oct 22 2020, 6:26 AM

In D89933#2347104, @herhut wrote:
The lifetime management of streams is not correct, I think. Even with the assumption that "forks" are explicit, we might leak streams. For instance if we have
%t3 = gpu.wait aysnc [%t1, %t2] // last use of %t1 and %t2
...
the streams would never be freed. I wonder whether it would be easier to not deallocate streams at all when lowering and then have a fix-up pass that inserts the stream destruction where needed.

Yes, you are right. The gpu.wait async lowering doesn't currently destroy the streams it depends on. The gpu.wait async ops we create in D89937 have no dependencies.

My plan is that gpu.wait async with dependencies will be inserted for tokens that have multiple uses (the explicit "forks"). There, the streams we synchronize on should not be destroyed.

We should verify that we don't leak, but hopefully that will be a little easier than a fix-up pass.

The change here primarily fixes the existing leaks of streams that we encountered benchmark when lowering non-async gpu.launch_func.

mlir/lib/Conversion/GPUCommon/ConvertLaunchFuncToRuntimeCalls.cpp
498	Correct. The rewriter checks that there is at most one dependency on line 449.
515	Yes, that is correct. I added a comment explaining this assumption. I will address the TODO in a follow up change.

Add a comment that token are expected to not be reused.

csigg added inline comments.Oct 22 2020, 10:13 AM

mlir/lib/Conversion/GPUCommon/ConvertLaunchFuncToRuntimeCalls.cpp
516	Isn't this a marker that the line didn't change? I can't find any tab in the code.

Harbormaster completed remote builds in B76058: Diff 300032.Oct 22 2020, 10:32 AM

Fail when synchronous version has async dependencies.

Harbormaster completed remote builds in B76378: Diff 300632.Oct 26 2020, 4:45 AM

mehdi_amini added inline comments.Oct 26 2020, 6:29 PM

mlir/lib/Conversion/GPUCommon/ConvertLaunchFuncToRuntimeCalls.cpp
320	drive by comment: it's really nice that you add messages!

herhut accepted this revision.Oct 29 2020, 8:57 AM

herhut added inline comments.

mlir/lib/Conversion/GPUCommon/ConvertLaunchFuncToRuntimeCalls.cpp
516	Oh, interesting. My brain is primed to other review tools :)

This revision is now accepted and ready to land.Oct 29 2020, 8:57 AM

mehdi_amini added inline comments.Oct 29 2020, 9:36 AM

mlir/lib/Conversion/GPUCommon/ConvertLaunchFuncToRuntimeCalls.cpp
516	Yeah phab indicates like this that only the indentation changed.

Rebase.

Harbormaster completed remote builds in B76938: Diff 301668.Oct 29 2020, 10:25 AM

Harbormaster completed remote builds in B76940: Diff 301670.Oct 29 2020, 10:41 AM

Rebase.

Harbormaster completed remote builds in B76969: Diff 301738.Oct 29 2020, 1:58 PM

This revision was landed with ongoing or failed builds.Oct 29 2020, 2:16 PM

Closed by commit rGfce99e5f739e: [mlir][gpu] Handle async in gpu.launch_func lowering. (authored by csigg). · Explain Why

This revision was automatically updated to reflect the committed changes.

csigg added a commit: rGfce99e5f739e: [mlir][gpu] Handle async in gpu.launch_func lowering..

Diff 300632

mlir/lib/Conversion/GPUCommon/ConvertLaunchFuncToRuntimeCalls.cpp

Show First 20 Lines • Show All 288 Lines • ▼ Show 20 Lines
// streams (i.e. void*). The converted op synchronizes the host with every		// streams (i.e. void*). The converted op synchronizes the host with every
// stream and then destroys it. That is, it assumes that the stream is not used		// stream and then destroys it. That is, it assumes that the stream is not used
// afterwards. In case this isn't correct, we will get a runtime error.		// afterwards. In case this isn't correct, we will get a runtime error.
// Eventually, we will have a pass that guarantees this property.		// Eventually, we will have a pass that guarantees this property.
LogicalResult ConvertWaitOpToGpuRuntimeCallPattern::matchAndRewrite(		LogicalResult ConvertWaitOpToGpuRuntimeCallPattern::matchAndRewrite(
Operation *op, ArrayRef<Value> operands,		Operation *op, ArrayRef<Value> operands,
ConversionPatternRewriter &rewriter) const {		ConversionPatternRewriter &rewriter) const {
if (cast<gpu::WaitOp>(op).asyncToken())		if (cast<gpu::WaitOp>(op).asyncToken())
return failure(); // The gpu.wait is async.		return rewriter.notifyMatchFailure(op, "Cannot convert async op.");

Location loc = op->getLoc();		Location loc = op->getLoc();

for (auto asyncDependency : operands)		for (auto asyncDependency : operands)
streamSynchronizeCallBuilder.create(loc, rewriter, {asyncDependency});		streamSynchronizeCallBuilder.create(loc, rewriter, {asyncDependency});
for (auto asyncDependency : operands)		for (auto asyncDependency : operands)
streamDestroyCallBuilder.create(loc, rewriter, {asyncDependency});		streamDestroyCallBuilder.create(loc, rewriter, {asyncDependency});

rewriter.eraseOp(op);		rewriter.eraseOp(op);
return success();		return success();
}		}

// Converts `gpu.wait async` to runtime calls. The result is a new stream that		// Converts `gpu.wait async` to runtime calls. The result is a new stream that
// is synchronized with all operands, which are CUDA or ROCm streams (i.e.		// is synchronized with all operands, which are CUDA or ROCm streams (i.e.
// void*). We create and record an event after the definition of the stream		// void*). We create and record an event after the definition of the stream
// and make the new stream wait on that event before destroying it again. This		// and make the new stream wait on that event before destroying it again. This
// assumes that there is no other use between the definition and this op, and		// assumes that there is no other use between the definition and this op, and
// the plan is to have a pass that guarantees this property.		// the plan is to have a pass that guarantees this property.
LogicalResult ConvertWaitAsyncOpToGpuRuntimeCallPattern::matchAndRewrite(		LogicalResult ConvertWaitAsyncOpToGpuRuntimeCallPattern::matchAndRewrite(
Operation *op, ArrayRef<Value> operands,		Operation *op, ArrayRef<Value> operands,
ConversionPatternRewriter &rewriter) const {		ConversionPatternRewriter &rewriter) const {
if (!cast<gpu::WaitOp>(op).asyncToken())		if (!cast<gpu::WaitOp>(op).asyncToken())
return failure(); // The gpu.wait is not async.		return rewriter.notifyMatchFailure(op, "Can only convert async op.");
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions drive by comment: it's really nice that you add messages! mehdi_amini: drive by comment: it's really nice that you add messages!

Location loc = op->getLoc();		Location loc = op->getLoc();

auto insertionPoint = rewriter.saveInsertionPoint();		auto insertionPoint = rewriter.saveInsertionPoint();
SmallVector<Value, 1> events;		SmallVector<Value, 1> events;
for (auto pair : llvm::zip(op->getOperands(), operands)) {		for (auto pair : llvm::zip(op->getOperands(), operands)) {
auto token = std::get<0>(pair);		auto token = std::get<0>(pair);
if (auto *defOp = token.getDefiningOp()) {		if (auto *defOp = token.getDefiningOp()) {
▲ Show 20 Lines • Show All 103 Lines • ▼ Show 20 Lines
// %0 = call %binarygetter		// %0 = call %binarygetter
// %1 = call %moduleLoad(%0)		// %1 = call %moduleLoad(%0)
// %2 = <see generateKernelNameConstant>		// %2 = <see generateKernelNameConstant>
// %3 = call %moduleGetFunction(%1, %2)		// %3 = call %moduleGetFunction(%1, %2)
// %4 = call %streamCreate()		// %4 = call %streamCreate()
// %5 = <see generateParamsArray>		// %5 = <see generateParamsArray>
// call %launchKernel(%3, <launchOp operands 0..5>, 0, %4, %5, nullptr)		// call %launchKernel(%3, <launchOp operands 0..5>, 0, %4, %5, nullptr)
// call %streamSynchronize(%4)		// call %streamSynchronize(%4)
		// call %streamDestroy(%4)
		//
		// If the op is async, the stream corresponds to the (single) async dependency
		// as well as the async token the op produces.
LogicalResult ConvertLaunchFuncOpToGpuRuntimeCallPattern::matchAndRewrite(		LogicalResult ConvertLaunchFuncOpToGpuRuntimeCallPattern::matchAndRewrite(
Operation *op, ArrayRef<Value> operands,		Operation *op, ArrayRef<Value> operands,
ConversionPatternRewriter &rewriter) const {		ConversionPatternRewriter &rewriter) const {
if (!llvm::all_of(operands, isLLVMType))		if (!llvm::all_of(operands, isLLVMType))
return rewriter.notifyMatchFailure(		return rewriter.notifyMatchFailure(
op, "Cannot convert if operands aren't of LLVM type.");		op, "Cannot convert if operands aren't of LLVM type.");

auto launchOp = cast<gpu::LaunchFuncOp>(op);		auto launchOp = cast<gpu::LaunchFuncOp>(op);

		if (launchOp.asyncDependencies().size() > 1)
		return rewriter.notifyMatchFailure(
		op, "Cannot convert with more than one async dependency.");

		// Fail when the synchronous version of the op has async dependencies. The
		// lowering destroys the stream, and we do not want to check that there is no
		// use of the stream after this op.
		if (!launchOp.asyncToken() && !launchOp.asyncDependencies().empty())
		return rewriter.notifyMatchFailure(
		op, "Cannot convert non-async op with async dependencies.");

Location loc = launchOp.getLoc();		Location loc = launchOp.getLoc();

// Create an LLVM global with CUBIN extracted from the kernel annotation and		// Create an LLVM global with CUBIN extracted from the kernel annotation and
// obtain a pointer to the first byte in it.		// obtain a pointer to the first byte in it.
auto kernelModule = SymbolTable::lookupNearestSymbolFrom<gpu::GPUModuleOp>(		auto kernelModule = SymbolTable::lookupNearestSymbolFrom<gpu::GPUModuleOp>(
launchOp, launchOp.getKernelModuleName());		launchOp, launchOp.getKernelModuleName());
assert(kernelModule && "expected a kernel module");		assert(kernelModule && "expected a kernel module");

Show All 14 Lines	LogicalResult ConvertLaunchFuncOpToGpuRuntimeCallPattern::matchAndRewrite(
// Get the function from the module. The name corresponds to the name of		// Get the function from the module. The name corresponds to the name of
// the kernel function.		// the kernel function.
auto kernelName = generateKernelNameConstant(		auto kernelName = generateKernelNameConstant(
launchOp.getKernelModuleName(), launchOp.getKernelName(), loc, rewriter);		launchOp.getKernelModuleName(), launchOp.getKernelName(), loc, rewriter);
auto function = moduleGetFunctionCallBuilder.create(		auto function = moduleGetFunctionCallBuilder.create(
loc, rewriter, {module.getResult(0), kernelName});		loc, rewriter, {module.getResult(0), kernelName});
auto zero = rewriter.create<LLVM::ConstantOp>(loc, llvmInt32Type,		auto zero = rewriter.create<LLVM::ConstantOp>(loc, llvmInt32Type,
rewriter.getI32IntegerAttr(0));		rewriter.getI32IntegerAttr(0));
// Grab the global stream needed for execution.		auto adaptor = gpu::LaunchFuncOpAdaptor(operands, op->getAttrDictionary());
auto stream = streamCreateCallBuilder.create(loc, rewriter, {});		Value stream =
		adaptor.asyncDependencies().empty()
		? streamCreateCallBuilder.create(loc, rewriter, {}).getResult(0)
		: adaptor.asyncDependencies().front();
		herhutUnsubmitted Not Done Reply Inline Actions If there is more than one dependency, this would need to insert synchronization, rigth? herhut: If there is more than one dependency, this would need to insert synchronization, rigth?
		csiggAuthorUnsubmitted Done Reply Inline Actions Correct. The rewriter checks that there is at most one dependency on line 449. csigg: Correct. The rewriter checks that there is at most one dependency on line 449.
// Create array of pointers to kernel arguments.		// Create array of pointers to kernel arguments.
auto kernelParams = generateParamsArray(launchOp, operands, rewriter);		auto kernelParams = generateParamsArray(launchOp, operands, rewriter);
auto nullpointer = rewriter.create<LLVM::NullOp>(loc, llvmPointerPointerType);		auto nullpointer = rewriter.create<LLVM::NullOp>(loc, llvmPointerPointerType);
launchKernelCallBuilder.create(		launchKernelCallBuilder.create(
loc, rewriter,		loc, rewriter,
{function.getResult(0), launchOp.gridSizeX(), launchOp.gridSizeY(),		{function.getResult(0), launchOp.gridSizeX(), launchOp.gridSizeY(),
launchOp.gridSizeZ(), launchOp.blockSizeX(), launchOp.blockSizeY(),		launchOp.gridSizeZ(), launchOp.blockSizeX(), launchOp.blockSizeY(),
launchOp.blockSizeZ(), zero, /* sharedMemBytes */		launchOp.blockSizeZ(), /sharedMemBytes=/zero, stream, kernelParams,
stream.getResult(0), /* stream */		/extra=/nullpointer});
kernelParams, /* kernel params */
nullpointer /* extra */});
streamSynchronizeCallBuilder.create(loc, rewriter, stream.getResult(0));

		if (launchOp.asyncToken()) {
		// Async launch: make dependent ops use the same stream.
		rewriter.replaceOp(op, {stream});
		} else {
		// Synchronize with host and destroy stream. This must be the stream created
		// above (with no other uses) because we check that the synchronous version
		// does not have any async dependencies.
		herhutUnsubmitted Not Done Reply Inline Actions This assumes the property that all "forks" (i.e., tokens that have multiple uses) are made explicit via `gpu.wait`, right? This idea is not captured in the comments anywhere and also we do not enforce it in the rewritings. If this property is not true, the lowering will be incorrect. herhut: This assumes the property that all "forks" (i.e., tokens that have multiple uses) are made…
		csiggAuthorUnsubmitted Done Reply Inline Actions Yes, that is correct. I added a comment explaining this assumption. I will address the TODO in a follow up change. csigg: Yes, that is correct. I added a comment explaining this assumption. I will address the TODO in…
		streamSynchronizeCallBuilder.create(loc, rewriter, stream);
		herhutUnsubmitted Done Reply Inline Actions nit: left-over tab. herhut: nit: left-over tab.
		csiggAuthorUnsubmitted Done Reply Inline Actions Isn't this a marker that the line didn't change? I can't find any tab in the code. csigg: Isn't this a marker that the line didn't change? I can't find any tab in the code.
		herhutUnsubmitted Done Reply Inline Actions Oh, interesting. My brain is primed to other review tools :) herhut: Oh, interesting. My brain is primed to other review tools :)
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions Yeah phab indicates like this that only the indentation changed. mehdi_amini: Yeah phab indicates like this that only the indentation changed.
		streamDestroyCallBuilder.create(loc, rewriter, stream);
rewriter.eraseOp(op);		rewriter.eraseOp(op);
		}

return success();		return success();
}		}

std::unique_ptr<mlir::OperationPass<mlir::ModuleOp>>		std::unique_ptr<mlir::OperationPass<mlir::ModuleOp>>
mlir::createGpuToLLVMConversionPass(StringRef gpuBinaryAnnotation) {		mlir::createGpuToLLVMConversionPass(StringRef gpuBinaryAnnotation) {
return std::make_unique<GpuToLLVMConversionPass>(gpuBinaryAnnotation);		return std::make_unique<GpuToLLVMConversionPass>(gpuBinaryAnnotation);
}		}

Show All 14 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Handle async in gpu.launch_func lowering.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 300632

mlir/lib/Conversion/GPUCommon/ConvertLaunchFuncToRuntimeCalls.cpp

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Handle async in gpu.launch_func lowering.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 300632

mlir/lib/Conversion/GPUCommon/ConvertLaunchFuncToRuntimeCalls.cpp

[mlir][gpu] Handle async in gpu.launch_func lowering.
ClosedPublic