Download Raw Diff

Details

Reviewers

ftynse
ThomasRaoux
dcaballe
herhut
mehdi_amini

Commits

rGfcfeb1e5b3cd: [mlir][gpu] Add GPU target support to `gpu-to-llvm`.

Summary

For an explanation of these patches see D154153.

This patch modifies the lowering of gpu.module & gpu.launch_func in the gpu-to-llvm pass,
allowing the usage of the new GPU compilation mechanism in the patch series ending in D154153.

Instead of removing Modules, this patch preserves the module if it has target attributes so that the
gpu-module-to-binary pass can later serialize them.

Instead of lowering the kernel calls to the LLVM dialect, this patch primarily updates the operation's
arguments, leaving the job of converting the operation into LLVM instructions to the translation stage.
The reason for not lowering the operation to LLVM at this stage is that kernel launches do not have a
single one-to-one representation in LLVM. For example, a kernel launch can be represented by a call
to a kernel stub, like in CUDA or HIP.
Kernel launches are also intrinsically linked to the binary associated with the call, and the binaries are
converted during translation.

Depends on D154149

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fmorac created this revision.Jun 29 2023, 2:04 PM

Herald added a reviewer: ftynse. · View Herald TranscriptJun 29 2023, 2:04 PM

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: gysit, Dinistro, bviyer and 25 others. · View Herald Transcript

fmorac added a child revision: D154153: [mlir][gpu] Update GPU translation to accept binaries..Jun 29 2023, 2:12 PM

Harbormaster completed remote builds in B242236: Diff 535992.Jun 29 2023, 4:17 PM

Rebasing.

Harbormaster completed remote builds in B242305: Diff 536081.Jun 29 2023, 10:01 PM

fmorac published this revision for review.Jun 30 2023, 6:30 AM

Herald added a reviewer: herhut. · View Herald TranscriptJun 30 2023, 6:30 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Can we get a test for the bare pointer calling convention please?

Added bare pointer test.

Added new line.

Harbormaster completed remote builds in B247432: Diff 543215.Jul 22 2023, 12:24 PM

Can you add context / motivation in the description? The why is often more important than the what there.

(This would actually be valuable for most of the patches in this series)

mehdi_amini added inline comments.Jul 23 2023, 11:51 PM

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
1235	You created a symbolTable, we should try to take advantage of it everywhere I think?
1274	If you're just change the operands of the op: can't it be done in-place?
mlir/test/Conversion/GPUCommon/lower-launch-func-bare-ptr.mlir
1 ↗	(On Diff #543215)	Please remove -allow-unregistered-dialect, you may use the test dialect if needed

I think I understand this: this is for backward compatibility where the new mechanism isn't used, and when it is used there is a target attribute and the LLVM translation will be done through the new mechanism.

As mentioned in another patch: this is the kind of context that is helpful to explain in the description.

In D154152#4526546, @mehdi_amini wrote:

Can you add context / motivation in the description? The why is often more important than the what there.

(This would actually be valuable for most of the patches in this series)

Ok, I'll do it for the patch series.

In D154152#4526813, @mehdi_amini wrote:

I think I understand this: this is for backward compatibility where the new mechanism isn't used, and when it is used there is a target attribute and the LLVM translation will be done through the new mechanism.
As mentioned in another patch: this is the kind of context that is helpful to explain in the description.

Yes, the changes in this patch preserve the old mechanism while introducing the new one. The idea is to deprecate the old one once this gets approval, and then patch the LaunchFuncOP pattern to only use the new one.

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
1235	Yes, I'll change it.
1274	I think so? I don't remember if the in-place update gave me issues. In any case, I'll change it or say what was the issue.
mlir/test/Conversion/GPUCommon/lower-launch-func-bare-ptr.mlir
1 ↗	(On Diff #543215)	Ok, yeah this was a test I copied & modified for testing the bare-ptr convention, so I presume it was an old test. I'll change it.

Removed the flag allow-unregistered-dialect & added the option for reusing the symbol table.

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
1274	I'm not sure the `launch_func` op can be updated safely in place, as in some instances I change the number of results of the op. (I have to get rid of the async token as it's illegal),

fmorac added a reviewer: mehdi_amini.Aug 7 2023, 5:50 AM

Harbormaster completed remote builds in B250759: Diff 547749.Aug 7 2023, 7:57 AM

Please update the commit description to provide better context as discussed before

This revision is now accepted and ready to land.Aug 7 2023, 7:51 PM

fmorac edited the summary of this revision. (Show Details)Aug 9 2023, 7:00 PM

Rebasing.

Harbormaster completed remote builds in B251916: Diff 549335.Aug 11 2023, 6:03 AM

Closed by commit rGfcfeb1e5b3cd: [mlir][gpu] Add GPU target support to `gpu-to-llvm`. (authored by fmorac). · Explain WhyAug 11 2023, 5:27 PM

This revision was automatically updated to reflect the committed changes.

fmorac added a commit: rGfcfeb1e5b3cd: [mlir][gpu] Add GPU target support to `gpu-to-llvm`..

Diff 535992

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

Show First 20 Lines • Show All 716 Lines • ▼ Show 20 Lines	void GpuToLLVMConversionPass::runOnOperation() {
LowerToLLVMOptions options(&getContext());		LowerToLLVMOptions options(&getContext());
options.useOpaquePointers = useOpaquePointers;		options.useOpaquePointers = useOpaquePointers;
options.useBarePtrCallConv = hostBarePtrCallConv;		options.useBarePtrCallConv = hostBarePtrCallConv;

LLVMTypeConverter converter(&getContext(), options);		LLVMTypeConverter converter(&getContext(), options);
RewritePatternSet patterns(&getContext());		RewritePatternSet patterns(&getContext());
LLVMConversionTarget target(getContext());		LLVMConversionTarget target(getContext());

target.addIllegalDialect<gpu::GPUDialect>();		SymbolTable symbolTable = SymbolTable(getOperation());

		target.addDynamicallyLegalOp<gpu::GPUModuleOp>(
		[](gpu::GPUModuleOp module) -> bool {
		return module.getTargetsAttr() != nullptr;
		});

		target.addDynamicallyLegalOp<gpu::LaunchFuncOp>(
		[&](gpu::LaunchFuncOp op) -> bool {
		auto module =
		symbolTable.lookup<gpu::GPUModuleOp>(op.getKernelModuleName());
		return converter.isLegal(op->getOperandTypes()) &&
		converter.isLegal(op->getResultTypes()) &&
		(module && module.getTargetsAttr() &&
		module.getTargetsAttr().size());
		});

mlir::arith::populateArithToLLVMConversionPatterns(converter, patterns);		mlir::arith::populateArithToLLVMConversionPatterns(converter, patterns);
mlir::cf::populateControlFlowToLLVMConversionPatterns(converter, patterns);		mlir::cf::populateControlFlowToLLVMConversionPatterns(converter, patterns);
populateVectorToLLVMConversionPatterns(converter, patterns);		populateVectorToLLVMConversionPatterns(converter, patterns);
populateFinalizeMemRefToLLVMConversionPatterns(converter, patterns);		populateFinalizeMemRefToLLVMConversionPatterns(converter, patterns);
populateFuncToLLVMConversionPatterns(converter, patterns);		populateFuncToLLVMConversionPatterns(converter, patterns);
populateAsyncStructuralTypeConversionsAndLegality(converter, patterns,		populateAsyncStructuralTypeConversionsAndLegality(converter, patterns,
target);		target);
▲ Show 20 Lines • Show All 478 Lines • ▼ Show 20 Lines	LogicalResult ConvertLaunchFuncOpToGpuRuntimeCallPattern::matchAndRewrite(
if (!launchOp.getAsyncToken() && !launchOp.getAsyncDependencies().empty())		if (!launchOp.getAsyncToken() && !launchOp.getAsyncDependencies().empty())
return rewriter.notifyMatchFailure(		return rewriter.notifyMatchFailure(
launchOp, "Cannot convert non-async op with async dependencies.");		launchOp, "Cannot convert non-async op with async dependencies.");

Location loc = launchOp.getLoc();		Location loc = launchOp.getLoc();

// Create an LLVM global with CUBIN extracted from the kernel annotation and		// Create an LLVM global with CUBIN extracted from the kernel annotation and
// obtain a pointer to the first byte in it.		// obtain a pointer to the first byte in it.
auto kernelModule = SymbolTable::lookupNearestSymbolFrom<gpu::GPUModuleOp>(		auto kernelModule = SymbolTable::lookupNearestSymbolFrom<gpu::GPUModuleOp>(
		mehdi_aminiUnsubmitted Done Reply Inline Actions You created a symbolTable, we should try to take advantage of it everywhere I think? mehdi_amini: You created a symbolTable, we should try to take advantage of it everywhere I think?
		fmoracAuthorUnsubmitted Done Reply Inline Actions Yes, I'll change it. fmorac: Yes, I'll change it.
launchOp, launchOp.getKernelModuleName());		launchOp, launchOp.getKernelModuleName());
assert(kernelModule && "expected a kernel module");		assert(kernelModule && "expected a kernel module");

		// If the module has Targets then just update the op operands.
		if (ArrayAttr targets = kernelModule.getTargetsAttr()) {
		Value stream = Value();
		if (adaptor.getAsyncDependencies().size())
		stream = adaptor.getAsyncDependencies().front();
		// If the async keyword is present and there are no dependencies, then a
		// stream must be created to pass to subsequent operations.
		else if (launchOp.getAsyncToken())
		stream = streamCreateCallBuilder.create(loc, rewriter, {}).getResult();

		// Lower the kernel operands to match kernel parameters.
		SmallVector<Value, 4> arguments;
		if (kernelBarePtrCallConv) {
		// Hack the bare pointer value on just for the argument promotion
		LLVMTypeConverter *converter = getTypeConverter();
		LowerToLLVMOptions options = converter->getOptions();
		LowerToLLVMOptions overrideToMatchKernelOpts = options;
		overrideToMatchKernelOpts.useBarePtrCallConv = true;
		converter->dangerousSetOptions(overrideToMatchKernelOpts);
		arguments =
		converter->promoteOperands(loc, launchOp.getKernelOperands(),
		adaptor.getKernelOperands(), rewriter);
		converter->dangerousSetOptions(options);
		} else {
		arguments = getTypeConverter()->promoteOperands(
		loc, launchOp.getKernelOperands(), adaptor.getKernelOperands(),
		rewriter);
		}

		rewriter.create<gpu::LaunchFuncOp>(
		launchOp.getLoc(), launchOp.getKernelAttr(),
		gpu::KernelDim3{adaptor.getGridSizeX(), adaptor.getGridSizeY(),
		adaptor.getGridSizeZ()},
		gpu::KernelDim3{adaptor.getBlockSizeX(), adaptor.getBlockSizeY(),
		adaptor.getBlockSizeZ()},
		adaptor.getDynamicSharedMemorySize(), arguments, stream);
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions If you're just change the operands of the op: can't it be done in-place? mehdi_amini: If you're just change the operands of the op: can't it be done in-place?
		fmoracAuthorUnsubmitted Done Reply Inline Actions I think so? I don't remember if the in-place update gave me issues. In any case, I'll change it or say what was the issue. fmorac: I think so? I don't remember if the in-place update gave me issues. In any case, I'll change it…
		fmoracAuthorUnsubmitted Done Reply Inline Actions I'm not sure the `launch_func` op can be updated safely in place, as in some instances I change the number of results of the op. (I have to get rid of the async token as it's illegal), fmorac: I'm not sure the `launch_func` op can be updated safely in place, as in some instances I change…
		if (launchOp.getAsyncToken())
		rewriter.replaceOp(launchOp, {stream});
		else
		rewriter.eraseOp(launchOp);
		return success();
		}

auto binaryAttr =		auto binaryAttr =
kernelModule->getAttrOfType<StringAttr>(gpuBinaryAnnotation);		kernelModule->getAttrOfType<StringAttr>(gpuBinaryAnnotation);
if (!binaryAttr) {		if (!binaryAttr) {
kernelModule.emitOpError()		kernelModule.emitOpError()
<< "missing " << gpuBinaryAnnotation << " attribute";		<< "missing " << gpuBinaryAnnotation << " attribute";
return failure();		return failure();
}		}

▲ Show 20 Lines • Show All 689 Lines • Show Last 20 Lines

mlir/test/Conversion/GPUCommon/lower-launch-func-to-gpu-runtime-calls.mlir

// RUN: mlir-opt %s --gpu-to-llvm="gpu-binary-annotation=nvvm.cubin use-opaque-pointers=1" \| FileCheck %s		// RUN: mlir-opt %s --gpu-to-llvm="gpu-binary-annotation=nvvm.cubin use-opaque-pointers=1" -split-input-file \| FileCheck %s
// RUN: mlir-opt %s --gpu-to-llvm="gpu-binary-annotation=rocdl.hsaco use-opaque-pointers=1" \| FileCheck %s --check-prefix=ROCDL		// RUN: mlir-opt %s --gpu-to-llvm="gpu-binary-annotation=rocdl.hsaco use-opaque-pointers=1" -split-input-file \| FileCheck %s --check-prefix=ROCDL

module attributes {gpu.container_module} {		module attributes {gpu.container_module} {

// CHECK: llvm.mlir.global internal constant @[[KERNEL_NAME:.*]]("kernel\00")		// CHECK: llvm.mlir.global internal constant @[[KERNEL_NAME:.*]]("kernel\00")
// CHECK: llvm.mlir.global internal constant @[[GLOBAL:.*]]("CUBIN")		// CHECK: llvm.mlir.global internal constant @[[GLOBAL:.*]]("CUBIN")
// ROCDL: llvm.mlir.global internal constant @[[GLOBAL:.*]]("HSACO")		// ROCDL: llvm.mlir.global internal constant @[[GLOBAL:.*]]("HSACO")

gpu.module @kernel_module attributes {		gpu.module @kernel_module attributes {
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	module attributes {gpu.container_module} {

// CHECK: llvm.call @mgpuLaunchKernel([[FUNC]], [[C8]], [[C8]], [[C8]],		// CHECK: llvm.call @mgpuLaunchKernel([[FUNC]], [[C8]], [[C8]], [[C8]],
// CHECK-SAME: [[C8]], [[C8]], [[C8]], [[C256]], [[STREAM]],		// CHECK-SAME: [[C8]], [[C8]], [[C8]], [[C256]], [[STREAM]],
// CHECK-SAME: [[PARAMS]], [[EXTRA_PARAMS]])		// CHECK-SAME: [[PARAMS]], [[EXTRA_PARAMS]])
// CHECK: llvm.call @mgpuStreamSynchronize		// CHECK: llvm.call @mgpuStreamSynchronize
// CHECK: llvm.call @mgpuStreamDestroy		// CHECK: llvm.call @mgpuStreamDestroy
// CHECK: llvm.call @mgpuModuleUnload		// CHECK: llvm.call @mgpuModuleUnload
}		}

		// -----

		module attributes {gpu.container_module} {
		// CHECK: gpu.module
		// ROCDL: gpu.module
		gpu.module @kernel_module [#gpu.nvptx] {
		llvm.func @kernel(%arg0: i32, %arg1: !llvm.ptr<f32>,
		%arg2: !llvm.ptr<f32>, %arg3: i64, %arg4: i64,
		%arg5: i64) attributes {gpu.kernel} {
		llvm.return
		}
		}

		func.func @foo(%buffer: memref<?xf32>) {
		// CHECK: [[C8:%.*]] = llvm.mlir.constant(8 : index) : i64
		// CHECK: [[C32:%.*]] = llvm.mlir.constant(32 : i32) : i32
		// CHECK: [[C256:%.*]] = llvm.mlir.constant(256 : i32) : i32
		%c8 = arith.constant 8 : index
		%c32 = arith.constant 32 : i32
		%c256 = arith.constant 256 : i32

		// CHECK: gpu.launch_func @kernel_module::@kernel
		// CHECK: blocks in ([[C8]], [[C8]], [[C8]]) : i64 threads in ([[C8]], [[C8]], [[C8]]) : i64
		// CHECK: dynamic_shared_memory_size [[C256]]
		// CHECK: args([[C32]] : i32, %{{.}} : !llvm.ptr, %{{.}} : !llvm.ptr, %{{.}} : i64, %{{.}} : i64, %{{.*}} : i64)
		gpu.launch_func @kernel_module::@kernel
		blocks in (%c8, %c8, %c8)
		threads in (%c8, %c8, %c8)
		dynamic_shared_memory_size %c256
		args(%c32 : i32, %buffer : memref<?xf32>)
		return
		}
		}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Add GPU target support to `gpu-to-llvm`.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 535992

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

mlir/test/Conversion/GPUCommon/lower-launch-func-to-gpu-runtime-calls.mlir

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Add GPU target support to `gpu-to-llvm`.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 535992

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

mlir/test/Conversion/GPUCommon/lower-launch-func-to-gpu-runtime-calls.mlir

[mlir][gpu] Add GPU target support to `gpu-to-llvm`.
ClosedPublic