This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Conversion/
-
mlir/
-
Conversion/
-
Passes.td
-
lib/Conversion/GPUCommon/
-
Conversion/
-
GPUCommon/
3/8
ConvertLaunchFuncToRuntimeCalls.cpp
-
test/Conversion/GPUCommon/
-
Conversion/
-
GPUCommon/
1
lower-wait-to-gpu-runtime-calls.mlir

Differential D89686

[mlir][gpu] Add lowering to LLVM for `gpu.wait` and `gpu.wait async`.
ClosedPublic

Authored by csigg on Oct 19 2020, 4:23 AM.

Download Raw Diff

Details

Reviewers

herhut

Commits

rG3ac561d8c348: [mlir][gpu] Add lowering to LLVM for `gpu.wait` and `gpu.wait async`.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	200 ms	windows > LLVM.tools/llvm-objdump/X86::source-interleave-prefix-windows.test
	220 ms	windows > LLVM.tools/llvm-objdump/X86::source-interleave-prefix.test

Event Timeline

csigg created this revision.Oct 19 2020, 4:23 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 19 2020, 4:23 AM

Herald added subscribers: rdzhabarov, tatianashp, msifontes and 14 others. · View Herald Transcript

csigg requested review of this revision.Oct 19 2020, 4:23 AM

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald TranscriptOct 19 2020, 4:23 AM

Harbormaster completed remote builds in B75519: Diff 299002.Oct 19 2020, 4:52 AM

Remove checking operands to be LLVMType because it's guaranteed by the type conversion.

Harbormaster completed remote builds in B75522: Diff 299009.Oct 19 2020, 5:10 AM

herhut added inline comments.Oct 20 2020, 4:15 AM

mlir/lib/Conversion/GPUCommon/ConvertLaunchFuncToRuntimeCalls.cpp
327	`defOp` could be null, e.g. a block argument, in which case you have to insert these late (maybe at block start?)
330	This is just `asyncDependency`, right?
330	Does this actually work with a `gpu.call`? That would create a stream and then enqueue the kernel on the stream, which is a use of the stream. So the `record event` would be inserted before that use, right?

Apply thoughtful comments from herhut.

mlir/lib/Conversion/GPUCommon/ConvertLaunchFuncToRuntimeCalls.cpp

327

Oh, thanks for catching this. I've changed it to insert the mgpuEventRecord at block start if defOp null.

330

Yes, much cleaner.

330

Good point. I changed the insertion point to after the definition of the original operand.

If I understand correctly,

%t0 = gpu.launch_func async ...
%t1 = gpu.wait async [%t0]

will now convert to

%stream0 = llvm.call @mgpuStreamCreate()
llvm.call @mgpuLaunchKernel(%stream0, ...)       // assumption: inserted before gpu.launch_func
%t0 = gpu.launch_func async ...                  // marked for deletion
%event = llvm.call @mgpuEventCreate()            // after definition of %t0, not %stream0
llvm.call @mgpuEventRecord(%event, %stream0)
%stream1 = llvm.call @mgpuStreamCreate()
llvm.call @mgpuStreamWaitEvent(%stream1, %event)
llvm.call @mgpuEventDestroy(%event)
%t1 = gpu.wait async [%t0]                       // marked for deletion

Side note: my intention is that streams will only be created with explicit gpu.wait async ops, but your point still holds.

Harbormaster completed remote builds in B75821: Diff 299562.Oct 20 2020, 11:47 PM

Thanks.

mlir/lib/Conversion/GPUCommon/ConvertLaunchFuncToRuntimeCalls.cpp
326	Can you add a comment here why `getOperands` is used. This is somewhat special and not obvious.
330	Can you just copy this to the comment in the code. I think it would serve well as documentation,
mlir/test/Conversion/GPUCommon/lower-wait-to-gpu-runtime-calls.mlir
5	Please remember to add a test with a `gpu.launch` once that supports async launches.

This revision is now accepted and ready to land.Oct 21 2020, 7:19 AM

This revision was landed with ongoing or failed builds.Oct 21 2020, 9:20 AM

Closed by commit rG3ac561d8c348: [mlir][gpu] Add lowering to LLVM for `gpu.wait` and `gpu.wait async`. (authored by csigg). · Explain Why

This revision was automatically updated to reflect the committed changes.

csigg added a commit: rG3ac561d8c348: [mlir][gpu] Add lowering to LLVM for `gpu.wait` and `gpu.wait async`..

Revision Contents

Path

Size

mlir/

include/

mlir/

Conversion/

Passes.td

1 line

lib/

Conversion/

GPUCommon/

ConvertLaunchFuncToRuntimeCalls.cpp

99 lines

test/

Conversion/

GPUCommon/

lower-wait-to-gpu-runtime-calls.mlir

21 lines

Diff 299562

mlir/include/mlir/Conversion/Passes.td

	Show First 20 Lines • Show All 85 Lines • ▼ Show 20 Lines

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// GPUCommon			// GPUCommon
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	def GpuToLLVMConversionPass : Pass<"gpu-to-llvm", "ModuleOp"> {			def GpuToLLVMConversionPass : Pass<"gpu-to-llvm", "ModuleOp"> {
	let summary = "Convert GPU dialect to LLVM dialect with GPU runtime calls";			let summary = "Convert GPU dialect to LLVM dialect with GPU runtime calls";
	let constructor = "mlir::createGpuToLLVMConversionPass()";			let constructor = "mlir::createGpuToLLVMConversionPass()";
				let dependentDialects = ["LLVM::LLVMDialect"];
	let options = [			let options = [
	Option<"gpuBinaryAnnotation", "gpu-binary-annotation", "std::string",			Option<"gpuBinaryAnnotation", "gpu-binary-annotation", "std::string",
	"", "Annotation attribute string for GPU binary">,			"", "Annotation attribute string for GPU binary">,
	];			];
	}			}

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// GPUToNVVM			// GPUToNVVM
	▲ Show 20 Lines • Show All 294 Lines • Show Last 20 Lines

mlir/lib/Conversion/GPUCommon/ConvertLaunchFuncToRuntimeCalls.cpp

Show First 20 Lines • Show All 151 Lines • ▼ Show 20 Lines	ConvertHostRegisterOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter)
: ConvertOpToGpuRuntimeCallPattern<gpu::HostRegisterOp>(typeConverter) {}		: ConvertOpToGpuRuntimeCallPattern<gpu::HostRegisterOp>(typeConverter) {}

private:		private:
LogicalResult		LogicalResult
matchAndRewrite(Operation *op, ArrayRef<Value> operands,		matchAndRewrite(Operation *op, ArrayRef<Value> operands,
ConversionPatternRewriter &rewriter) const override;		ConversionPatternRewriter &rewriter) const override;
};		};

		/// A rewrite pattern to convert gpu.wait operations into a GPU runtime
		/// call. Currently it supports CUDA and ROCm (HIP).
		class ConvertWaitOpToGpuRuntimeCallPattern
		: public ConvertOpToGpuRuntimeCallPattern<gpu::WaitOp> {
		public:
		ConvertWaitOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter)
		: ConvertOpToGpuRuntimeCallPattern<gpu::WaitOp>(typeConverter) {}

		private:
		LogicalResult
		matchAndRewrite(Operation *op, ArrayRef<Value> operands,
		ConversionPatternRewriter &rewriter) const override;
		};

		/// A rewrite pattern to convert gpu.wait async operations into a GPU runtime
		/// call. Currently it supports CUDA and ROCm (HIP).
		class ConvertWaitAsyncOpToGpuRuntimeCallPattern
		: public ConvertOpToGpuRuntimeCallPattern<gpu::WaitOp> {
		public:
		ConvertWaitAsyncOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter)
		: ConvertOpToGpuRuntimeCallPattern<gpu::WaitOp>(typeConverter) {}

		private:
		LogicalResult
		matchAndRewrite(Operation *op, ArrayRef<Value> operands,
		ConversionPatternRewriter &rewriter) const override;
		};

/// A rewrite patter to convert gpu.launch_func operations into a sequence of		/// A rewrite patter to convert gpu.launch_func operations into a sequence of
/// GPU runtime calls. Currently it supports CUDA and ROCm (HIP).		/// GPU runtime calls. Currently it supports CUDA and ROCm (HIP).
///		///
/// In essence, a gpu.launch_func operations gets compiled into the following		/// In essence, a gpu.launch_func operations gets compiled into the following
/// sequence of runtime calls:		/// sequence of runtime calls:
///		///
/// * moduleLoad -- loads the module given the cubin / hsaco data		/// * moduleLoad -- loads the module given the cubin / hsaco data
/// * moduleGetFunction -- gets a handle to the actual kernel function		/// * moduleGetFunction -- gets a handle to the actual kernel function
▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	auto arguments =
typeConverter.promoteOperands(loc, op->getOperands(), operands, rewriter);		typeConverter.promoteOperands(loc, op->getOperands(), operands, rewriter);
arguments.push_back(elementSize);		arguments.push_back(elementSize);
hostRegisterCallBuilder.create(loc, rewriter, arguments);		hostRegisterCallBuilder.create(loc, rewriter, arguments);

rewriter.eraseOp(op);		rewriter.eraseOp(op);
return success();		return success();
}		}

		// Converts `gpu.wait` to runtime calls. The operands are all CUDA or ROCm
		// streams (i.e. void*). The converted op synchronizes the host with every
		// stream and then destroys it. That is, it assumes that the stream is not used
		// afterwards. In case this isn't correct, we will get a runtime error.
		// Eventually, we will have a pass that guarantees this property.
		LogicalResult ConvertWaitOpToGpuRuntimeCallPattern::matchAndRewrite(
		Operation *op, ArrayRef<Value> operands,
		ConversionPatternRewriter &rewriter) const {
		if (cast<gpu::WaitOp>(op).asyncToken())
		return failure(); // The gpu.wait is async.

		Location loc = op->getLoc();

		for (auto asyncDependency : operands)
		streamSynchronizeCallBuilder.create(loc, rewriter, {asyncDependency});
		for (auto asyncDependency : operands)
		streamDestroyCallBuilder.create(loc, rewriter, {asyncDependency});

		rewriter.eraseOp(op);
		return success();
		}

		// Converts `gpu.wait async` to runtime calls. The result is a new stream that
		// is synchronized with all operands, which are CUDA or ROCm streams (i.e.
		// void*). We create and record an event after the definition of the stream
		// and make the new stream wait on that event before destroying it again. This
		// assumes that there is no other use between the definition and this op, and
		// the plan is to have a pass that guarantees this property.
		LogicalResult ConvertWaitAsyncOpToGpuRuntimeCallPattern::matchAndRewrite(
		Operation *op, ArrayRef<Value> operands,
		ConversionPatternRewriter &rewriter) const {
		if (!cast<gpu::WaitOp>(op).asyncToken())
		return failure(); // The gpu.wait is not async.

		Location loc = op->getLoc();

		auto insertionPoint = rewriter.saveInsertionPoint();
		SmallVector<Value, 1> events;
		for (auto pair : llvm::zip(op->getOperands(), operands)) {
		herhutUnsubmitted Not Done Reply Inline Actions Can you add a comment here why `getOperands` is used. This is somewhat special and not obvious. herhut: Can you add a comment here why `getOperands` is used. This is somewhat special and not obvious.
		auto token = std::get<0>(pair);
		herhutUnsubmitted Not Done Reply Inline Actions `defOp` could be null, e.g. a block argument, in which case you have to insert these late (maybe at block start?) herhut: `defOp` could be null, e.g. a block argument, in which case you have to insert these late…
		csiggAuthorUnsubmitted Done Reply Inline Actions Oh, thanks for catching this. I've changed it to insert the mgpuEventRecord at block start if defOp null. csigg: Oh, thanks for catching this. I've changed it to insert the mgpuEventRecord at block start if…
		if (auto *defOp = token.getDefiningOp()) {
		rewriter.setInsertionPointAfter(defOp);
		} else {
		herhutUnsubmitted Not Done Reply Inline Actions This is just `asyncDependency`, right? herhut: This is just `asyncDependency`, right?
		csiggAuthorUnsubmitted Done Reply Inline Actions Yes, much cleaner. csigg: Yes, much cleaner.
		herhutUnsubmitted Not Done Reply Inline Actions Does this actually work with a `gpu.call`? That would create a stream and then enqueue the kernel on the stream, which is a use of the stream. So the `record event` would be inserted before that use, right? herhut: Does this actually work with a `gpu.call`? That would create a stream and then enqueue the…
		csiggAuthorUnsubmitted Done Reply Inline Actions Good point. I changed the insertion point to after the definition of the original operand. If I understand correctly, %t0 = gpu.launch_func async ... %t1 = gpu.wait async [%t0] will now convert to %stream0 = llvm.call @mgpuStreamCreate() llvm.call @mgpuLaunchKernel(%stream0, ...) // assumption: inserted before gpu.launch_func %t0 = gpu.launch_func async ... // marked for deletion %event = llvm.call @mgpuEventCreate() // after definition of %t0, not %stream0 llvm.call @mgpuEventRecord(%event, %stream0) %stream1 = llvm.call @mgpuStreamCreate() llvm.call @mgpuStreamWaitEvent(%stream1, %event) llvm.call @mgpuEventDestroy(%event) %t1 = gpu.wait async [%t0] // marked for deletion Side note: my intention is that streams will only be created with explicit `gpu.wait async` ops, but your point still holds. csigg: Good point. I changed the insertion point to after the definition of the original operand. If…
		herhutUnsubmitted Not Done Reply Inline Actions Can you just copy this to the comment in the code. I think it would serve well as documentation, herhut: Can you just copy this to the comment in the code. I think it would serve well as documentation,
		// If we can't find the defining op, we record the event at block start,
		// which is late and therefore misses parallelism, but still valid.
		rewriter.setInsertionPointToStart(op->getBlock());
		}
		auto event = eventCreateCallBuilder.create(loc, rewriter, {}).getResult(0);
		auto stream = std::get<1>(pair);
		eventRecordCallBuilder.create(loc, rewriter, {event, stream});
		events.push_back(event);
		}
		rewriter.restoreInsertionPoint(insertionPoint);
		auto stream = streamCreateCallBuilder.create(loc, rewriter, {}).getResult(0);
		for (auto event : events)
		streamWaitEventCallBuilder.create(loc, rewriter, {stream, event});
		for (auto event : events)
		eventDestroyCallBuilder.create(loc, rewriter, {event});
		rewriter.replaceOp(op, {stream});

		return success();
		}

// Creates a struct containing all kernel parameters on the stack and returns		// Creates a struct containing all kernel parameters on the stack and returns
// an array of type-erased pointers to the fields of the struct. The array can		// an array of type-erased pointers to the fields of the struct. The array can
// then be passed to the CUDA / ROCm (HIP) kernel launch calls.		// then be passed to the CUDA / ROCm (HIP) kernel launch calls.
// The generated code is essentially as follows:		// The generated code is essentially as follows:
//		//
// %struct = alloca(sizeof(struct { Parameters... }))		// %struct = alloca(sizeof(struct { Parameters... }))
// %array = alloca(NumParameters * sizeof(void *))		// %array = alloca(NumParameters * sizeof(void *))
// for (i : [0, NumParameters))		// for (i : [0, NumParameters))
▲ Show 20 Lines • Show All 138 Lines • ▼ Show 20 Lines
std::unique_ptr<mlir::OperationPass<mlir::ModuleOp>>		std::unique_ptr<mlir::OperationPass<mlir::ModuleOp>>
mlir::createGpuToLLVMConversionPass(StringRef gpuBinaryAnnotation) {		mlir::createGpuToLLVMConversionPass(StringRef gpuBinaryAnnotation) {
return std::make_unique<GpuToLLVMConversionPass>(gpuBinaryAnnotation);		return std::make_unique<GpuToLLVMConversionPass>(gpuBinaryAnnotation);
}		}

void mlir::populateGpuToLLVMConversionPatterns(		void mlir::populateGpuToLLVMConversionPatterns(
LLVMTypeConverter &converter, OwningRewritePatternList &patterns,		LLVMTypeConverter &converter, OwningRewritePatternList &patterns,
StringRef gpuBinaryAnnotation) {		StringRef gpuBinaryAnnotation) {
patterns.insert<ConvertHostRegisterOpToGpuRuntimeCallPattern>(converter);		converter.addConversion(
		[context = &converter.getContext()](gpu::AsyncTokenType type) -> Type {
		return LLVM::LLVMType::getInt8PtrTy(context);
		});
		patterns.insert<ConvertHostRegisterOpToGpuRuntimeCallPattern,
		ConvertWaitOpToGpuRuntimeCallPattern,
		ConvertWaitAsyncOpToGpuRuntimeCallPattern>(converter);
patterns.insert<ConvertLaunchFuncOpToGpuRuntimeCallPattern>(		patterns.insert<ConvertLaunchFuncOpToGpuRuntimeCallPattern>(
converter, gpuBinaryAnnotation);		converter, gpuBinaryAnnotation);
patterns.insert<EraseGpuModuleOpPattern>(&converter.getContext());		patterns.insert<EraseGpuModuleOpPattern>(&converter.getContext());
}		}

mlir/test/Conversion/GPUCommon/lower-wait-to-gpu-runtime-calls.mlir

This file was added.

				// RUN: mlir-opt -allow-unregistered-dialect %s --gpu-to-llvm \| FileCheck %s

				module attributes {gpu.container_module} {

				func @foo() {
				herhutUnsubmitted Not Done Reply Inline Actions Please remember to add a test with a `gpu.launch` once that supports async launches. herhut: Please remember to add a test with a `gpu.launch` once that supports async launches.
				// CHECK: %[[t0:.*]] = llvm.call @mgpuStreamCreate
				// CHECK: %[[e0:.*]] = llvm.call @mgpuEventCreate
				// CHECK: llvm.call @mgpuEventRecord(%[[e0]], %[[t0]])
				%t0 = gpu.wait async
				// CHECK: %[[t1:.*]] = llvm.call @mgpuStreamCreate
				// CHECK: llvm.call @mgpuStreamWaitEvent(%[[t1]], %[[e0]])
				// CHECK: llvm.call @mgpuEventDestroy(%[[e0]])
				%t1 = gpu.wait async [%t0]
				// CHECK: llvm.call @mgpuStreamSynchronize(%[[t0]])
				// CHECK: llvm.call @mgpuStreamSynchronize(%[[t1]])
				// CHECK: llvm.call @mgpuStreamDestroy(%[[t0]])
				// CHECK: llvm.call @mgpuStreamDestroy(%[[t1]])
				gpu.wait [%t0, %t1]
				return
				}
				}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Add lowering to LLVM for `gpu.wait` and `gpu.wait async`.ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 299562

mlir/include/mlir/Conversion/Passes.td

mlir/lib/Conversion/GPUCommon/ConvertLaunchFuncToRuntimeCalls.cpp

mlir/test/Conversion/GPUCommon/lower-wait-to-gpu-runtime-calls.mlir

[mlir][gpu] Add lowering to LLVM for `gpu.wait` and `gpu.wait async`.
ClosedPublic