This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Conversion/
-
mlir/
-
Conversion/
-
GPUCommon/
-
GPUCommonPass.h
-
GPUToROCDL/
-
GPUToROCDLPass.h
-
LLVMCommon/
-
TypeConverter.h
-
Passes.td
-
lib/Conversion/
-
Conversion/
-
GPUCommon/
1/2
GPUOpsLowering.cpp
2/2
GPUToLLVMConversion.cpp
-
GPUToROCDL/
3/3
LowerGpuOpsToROCDLOps.cpp
-
test/
-
Conversion/GPUToROCDL/
-
GPUToROCDL/
-
memref.mlir
-
Integration/GPU/ROCM/
-
GPU/
-
ROCM/
-
vecadd.mlir

Differential D130716

[mlir][GPU] Allow bare pointer memrefs when calling GPU kernels
ClosedPublic

Authored by krzysz00 on Jul 28 2022, 9:50 AM.

Download Raw Diff

Details

Reviewers

ftynse
ThomasRaoux
herhut

Commits

rGc2fc8d9b95bd: [mlir][GPU] Allow bare pointer memrefs when calling GPU kernels

Summary

In the ROCm runtime (and probably CUDA as well), all kernel arguments
are aligned. Therefore, enable using bare pointers for memref
arguments to kernels when these memrefs have static shape and a
trivial layout.

This is a substantial optimization to launching kernels that use
memrefs with known, static sizes, since it causes the kernel launch
packet to no longer include information already known to the kernel,
which can enable packing the kernel launch arguments into launch
packets instead of having to allocate an entire separate structure to
hold unneeded memref information.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

krzysz00 created this revision.Jul 28 2022, 9:50 AM

Herald added a reviewer: ftynse. · View Herald TranscriptJul 28 2022, 9:50 AM

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: bzcheeseman, awarzynski, sdasgup3 and 20 others. · View Herald Transcript

krzysz00 requested review of this revision.Jul 28 2022, 9:50 AM

Herald added a reviewer: herhut. · View Herald TranscriptJul 28 2022, 9:50 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Harbormaster completed remote builds in B178097: Diff 448368.Jul 28 2022, 10:32 AM

ftynse requested changes to this revision.Aug 2 2022, 2:58 AM

ftynse added inline comments.

mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp
142
156–157	Shouldn't this be an assertion? We checked that the original type is a memref above, so it not being remapped is a converter misconfiguration.
mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
659	This will discard any customizations that could have been applied to `converter` (e.g., support for additional types). I suppose we actually want a pseudo-copy constructor for `LLVMTypeConverter` that allows to take different options, or just a mechanism to change options with big warning flags that it should preserve consistency.
mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp
44	Please document top-level entities.
49	Nit: `getLayout().getAffineMap()`, there is no guarantee the layout will be an` AffineMapAttr` in the longer run.
112	Nit: diagnostics should start with lower case.

This revision now requires changes to proceed.Aug 2 2022, 2:58 AM

Adress review comments

(Also , @ThomasRaoux - should this be added to the NVVM conversion? I'm not doing it right now because I can't test the change, but I thought I'd raise the possibility)

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
659	Good call, changed

Harbormaster completed remote builds in B178786: Diff 449320.Aug 2 2022, 10:39 AM

Thanks!

This revision is now accepted and ready to land.Aug 2 2022, 11:25 AM

Closed by commit rGc2fc8d9b95bd: [mlir][GPU] Allow bare pointer memrefs when calling GPU kernels (authored by krzysz00). · Explain WhyAug 2 2022, 1:58 PM

This revision was automatically updated to reflect the committed changes.

krzysz00 marked an inline comment as done.

krzysz00 added a commit: rGc2fc8d9b95bd: [mlir][GPU] Allow bare pointer memrefs when calling GPU kernels.

Revision Contents

Path

Size

mlir/

include/

mlir/

Conversion/

GPUCommon/

GPUCommonPass.h

6 lines

GPUToROCDL/

GPUToROCDLPass.h

1 line

LLVMCommon/

TypeConverter.h

9 lines

Passes.td

4 lines

lib/

Conversion/

GPUCommon/

GPUOpsLowering.cpp

29 lines

GPUToLLVMConversion.cpp

53 lines

GPUToROCDL/

LowerGpuOpsToROCDLOps.cpp

44 lines

test/

Conversion/

GPUToROCDL/

memref.mlir

15 lines

Integration/

GPU/

ROCM/

vecadd.mlir

19 lines

Diff 449426

mlir/include/mlir/Conversion/GPUCommon/GPUCommonPass.h

Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	using LoweringCallback = std::function<std::unique_ptr<llvm::Module>(
Operation *, llvm::LLVMContext &, StringRef)>;		Operation *, llvm::LLVMContext &, StringRef)>;

/// Creates a pass to convert a GPU operations into a sequence of GPU runtime		/// Creates a pass to convert a GPU operations into a sequence of GPU runtime
/// calls.		/// calls.
///		///
/// This pass does not generate code to call GPU runtime APIs directly but		/// This pass does not generate code to call GPU runtime APIs directly but
/// instead uses a small wrapper library that exports a stable and conveniently		/// instead uses a small wrapper library that exports a stable and conveniently
/// typed ABI on top of GPU runtimes such as CUDA or ROCm (HIP).		/// typed ABI on top of GPU runtimes such as CUDA or ROCm (HIP).
std::unique_ptr<OperationPass<ModuleOp>> createGpuToLLVMConversionPass();		std::unique_ptr<OperationPass<ModuleOp>>
		createGpuToLLVMConversionPass(bool kernelBarePtrCallConv = false);

/// Collect a set of patterns to convert from the GPU dialect to LLVM and		/// Collect a set of patterns to convert from the GPU dialect to LLVM and
/// populate converter for gpu types.		/// populate converter for gpu types.
void populateGpuToLLVMConversionPatterns(LLVMTypeConverter &converter,		void populateGpuToLLVMConversionPatterns(LLVMTypeConverter &converter,
RewritePatternSet &patterns,		RewritePatternSet &patterns,
StringRef gpuBinaryAnnotation = {});		StringRef gpuBinaryAnnotation = {},
		bool kernelBarePtrCallConv = false);

} // namespace mlir		} // namespace mlir

#endif // MLIR_CONVERSION_GPUCOMMON_GPUCOMMONPASS_H_		#endif // MLIR_CONVERSION_GPUCOMMON_GPUCOMMONPASS_H_

mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h

	Show All 35 Lines

	/// Creates a pass that lowers GPU dialect operations to ROCDL counterparts. The			/// Creates a pass that lowers GPU dialect operations to ROCDL counterparts. The
	/// index bitwidth used for the lowering of the device side index computations			/// index bitwidth used for the lowering of the device side index computations
	/// is configurable.			/// is configurable.
	std::unique_ptr<OperationPass<gpu::GPUModuleOp>>			std::unique_ptr<OperationPass<gpu::GPUModuleOp>>
	createLowerGpuOpsToROCDLOpsPass(			createLowerGpuOpsToROCDLOpsPass(
	const std::string &chipset = "gfx900",			const std::string &chipset = "gfx900",
	unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout,			unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout,
				bool useBarePtrCallConv = false,
	gpu::amd::Runtime runtime = gpu::amd::Runtime::Unknown);			gpu::amd::Runtime runtime = gpu::amd::Runtime::Unknown);

	} // namespace mlir			} // namespace mlir

	#endif // MLIR_CONVERSION_GPUTOROCDL_GPUTOROCDLPASS_H_			#endif // MLIR_CONVERSION_GPUTOROCDL_GPUTOROCDLPASS_H_

mlir/include/mlir/Conversion/LLVMCommon/TypeConverter.h

Show First 20 Lines • Show All 74 Lines • ▼ Show 20 Lines	public:
/// Returns the MLIR context.		/// Returns the MLIR context.
MLIRContext &getContext();		MLIRContext &getContext();

/// Returns the LLVM dialect.		/// Returns the LLVM dialect.
LLVM::LLVMDialect *getDialect() { return llvmDialect; }		LLVM::LLVMDialect *getDialect() { return llvmDialect; }

const LowerToLLVMOptions &getOptions() const { return options; }		const LowerToLLVMOptions &getOptions() const { return options; }

		/// Set the lowering options to `newOptions`. Note: using this after some
		/// some conversions have been performed can lead to inconsistencies in the
		/// IR.
		void dangerousSetOptions(LowerToLLVMOptions newOptions) {
		options = std::move(newOptions);
		}

/// Promote the LLVM representation of all operands including promoting MemRef		/// Promote the LLVM representation of all operands including promoting MemRef
/// descriptors to stack and use pointers to struct to avoid the complexity		/// descriptors to stack and use pointers to struct to avoid the complexity
/// of the platform-specific C/C++ ABI lowering related to struct argument		/// of the platform-specific C/C++ ABI lowering related to struct argument
/// passing.		/// passing.
SmallVector<Value, 4> promoteOperands(Location loc, ValueRange opOperands,		SmallVector<Value, 4> promoteOperands(Location loc, ValueRange opOperands,
ValueRange operands,		ValueRange operands,
OpBuilder &builder);		OpBuilder &builder);

Show All 30 Lines	public:
/// Returns the size of the memref descriptor object in bytes.		/// Returns the size of the memref descriptor object in bytes.
unsigned getMemRefDescriptorSize(MemRefType type, const DataLayout &layout);		unsigned getMemRefDescriptorSize(MemRefType type, const DataLayout &layout);

/// Returns the size of the unranked memref descriptor object in bytes.		/// Returns the size of the unranked memref descriptor object in bytes.
unsigned getUnrankedMemRefDescriptorSize(UnrankedMemRefType type,		unsigned getUnrankedMemRefDescriptorSize(UnrankedMemRefType type,
const DataLayout &layout);		const DataLayout &layout);

/// Check if a memref type can be converted to a bare pointer.		/// Check if a memref type can be converted to a bare pointer.
bool canConvertToBarePtr(BaseMemRefType type);		static bool canConvertToBarePtr(BaseMemRefType type);

protected:		protected:
/// Pointer to the LLVM dialect.		/// Pointer to the LLVM dialect.
LLVM::LLVMDialect *llvmDialect;		LLVM::LLVMDialect *llvmDialect;

private:		private:
/// Convert a function type. The arguments and results are converted one by		/// Convert a function type. The arguments and results are converted one by
/// one. Additionally, if the function returns more than one value, pack the		/// one. Additionally, if the function returns more than one value, pack the
▲ Show 20 Lines • Show All 93 Lines • Show Last 20 Lines

mlir/include/mlir/Conversion/Passes.td

Show First 20 Lines • Show All 367 Lines • ▼ Show 20 Lines	def ConvertGpuOpsToROCDLOps : Pass<"convert-gpu-to-rocdl", "gpu::GPUModuleOp"> {
let dependentDialects = ["ROCDL::ROCDLDialect"];		let dependentDialects = ["ROCDL::ROCDLDialect"];
let options = [		let options = [
Option<"chipset", "chipset", "std::string",		Option<"chipset", "chipset", "std::string",
/default=/"\"gfx000\"",		/default=/"\"gfx000\"",
"Chipset that these operations will run on">,		"Chipset that these operations will run on">,
Option<"indexBitwidth", "index-bitwidth", "unsigned",		Option<"indexBitwidth", "index-bitwidth", "unsigned",
/default=kDeriveIndexBitwidthFromDataLayout/"0",		/default=kDeriveIndexBitwidthFromDataLayout/"0",
"Bitwidth of the index type, 0 to use size of machine word">,		"Bitwidth of the index type, 0 to use size of machine word">,
		Option<"useBarePtrCallConv", "use-bare-ptr-memref-call-conv", "bool",
		/default=/"false",
		"Replace memref arguments in GPU functions with bare pointers."
		"All memrefs must have static shape">,
Option<"runtime", "runtime", "::mlir::gpu::amd::Runtime",		Option<"runtime", "runtime", "::mlir::gpu::amd::Runtime",
"::mlir::gpu::amd::Runtime::Unknown",		"::mlir::gpu::amd::Runtime::Unknown",
"Runtime code will be run on (default is Unknown, can also use HIP or OpenCl)",		"Runtime code will be run on (default is Unknown, can also use HIP or OpenCl)",
[{::llvm::cl::values(		[{::llvm::cl::values(
clEnumValN(::mlir::gpu::amd::Runtime::Unknown, "unknown", "Unknown (default)"),		clEnumValN(::mlir::gpu::amd::Runtime::Unknown, "unknown", "Unknown (default)"),
clEnumValN(::mlir::gpu::amd::Runtime::HIP, "HIP", "HIP"),		clEnumValN(::mlir::gpu::amd::Runtime::HIP, "HIP", "HIP"),
clEnumValN(::mlir::gpu::amd::Runtime::OpenCL, "OpenCL", "OpenCL")		clEnumValN(::mlir::gpu::amd::Runtime::OpenCL, "OpenCL", "OpenCL")
)}]>		)}]>
▲ Show 20 Lines • Show All 572 Lines • Show Last 20 Lines

mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp

//===- GPUOpsLowering.cpp - GPU FuncOp / ReturnOp lowering ----------------===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

#include "GPUOpsLowering.h"

#include "mlir/Dialect/LLVMIR/LLVMDialect.h"

#include "mlir/IR/Builders.h"

#include "llvm/ADT/STLExtras.h"

#include "llvm/Support/FormatVariadic.h"

using namespace mlir;

LogicalResult

GPUFuncOpLowering::matchAndRewrite(gpu::GPUFuncOp gpuFuncOp, OpAdaptor adaptor,

ConversionPatternRewriter &rewriter) const {

Location loc = gpuFuncOp.getLoc();

▲ Show 20 Lines • Show All 112 Lines • ▼ Show 20 Lines

GPUFuncOpLowering::matchAndRewrite(gpu::GPUFuncOp gpuFuncOp, OpAdaptor adaptor,

// Move the region to the new function, update the entry block signature.

rewriter.inlineRegionBefore(gpuFuncOp.getBody(), llvmFuncOp.getBody(),

llvmFuncOp.end());

if (failed(rewriter.convertRegionTypes(&llvmFuncOp.getBody(), *typeConverter,

&signatureConversion)))

return failure();

// If bare memref pointers are being used, remap them back to memref

// descriptors This must be done after signature conversion to get rid of the

ftynseUnsubmitted

Not Done

// If bare memref pointers are being used, remap them back to memref

- // descriptors This must be done after signature conversion to get rid of the

+ // descriptors. This must be done after signature conversion to get rid of the

// unrealized casts.

ftynse:

// unrealized casts.

if (getTypeConverter()->getOptions().useBarePtrCallConv) {

OpBuilder::InsertionGuard guard(rewriter);

rewriter.setInsertionPointToStart(&llvmFuncOp.getBody().front());

for (const auto &en : llvm::enumerate(gpuFuncOp.getArgumentTypes())) {

auto memrefTy = en.value().dyn_cast<MemRefType>();

if (!memrefTy)

continue;

assert(memrefTy.hasStaticShape() &&

"Bare pointer convertion used with dynamically-shaped memrefs");

// Use a placeholder when replacing uses of the memref argument to prevent

// circular replacements.

auto remapping = signatureConversion.getInputMapping(en.index());

assert(remapping && remapping->size == 1 &&

"Type converter should produce 1-to-1 mapping for bare memrefs");

ftynseUnsubmitted

Done

Shouldn't this be an assertion? We checked that the original type is a memref above, so it not being remapped is a converter misconfiguration.

ftynse: Shouldn't this be an assertion? We checked that the original type is a memref above, so it not…

BlockArgument newArg =

llvmFuncOp.getBody().getArgument(remapping->inputNo);

auto placeholder = rewriter.create<LLVM::UndefOp>(

loc, getTypeConverter()->convertType(memrefTy));

rewriter.replaceUsesOfBlockArgument(newArg, placeholder);

Value desc = MemRefDescriptor::fromStaticShape(

rewriter, loc, *getTypeConverter(), memrefTy, newArg);

rewriter.replaceOp(placeholder, {desc});

}

rewriter.eraseOp(gpuFuncOp);

return success();

}

static const char formatStringPrefix[] = "printfFormat_";

template <typename T>

static LLVM::LLVMFuncOp getOrDefineFunction(T &moduleOp, const Location loc,

▲ Show 20 Lines • Show All 185 Lines • Show Last 20 Lines

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

Show First 20 Lines • Show All 43 Lines • ▼ Show 20 Lines

namespace {		namespace {

class GpuToLLVMConversionPass		class GpuToLLVMConversionPass
: public GpuToLLVMConversionPassBase<GpuToLLVMConversionPass> {		: public GpuToLLVMConversionPassBase<GpuToLLVMConversionPass> {
public:		public:
GpuToLLVMConversionPass() = default;		GpuToLLVMConversionPass() = default;

		GpuToLLVMConversionPass(bool kernelBarePtrCallConv)
		: GpuToLLVMConversionPass() {
		if (this->kernelBarePtrCallConv.getNumOccurrences() == 0)
		this->kernelBarePtrCallConv = kernelBarePtrCallConv;
		}

GpuToLLVMConversionPass(const GpuToLLVMConversionPass &other)		GpuToLLVMConversionPass(const GpuToLLVMConversionPass &other)
: GpuToLLVMConversionPassBase(other) {}		: GpuToLLVMConversionPassBase(other) {}

// Run the dialect converter on the module.		// Run the dialect converter on the module.
void runOnOperation() override;		void runOnOperation() override;

private:		private:
Option<std::string> gpuBinaryAnnotation{		Option<std::string> gpuBinaryAnnotation{
*this, "gpu-binary-annotation",		*this, "gpu-binary-annotation",
llvm::cl::desc("Annotation attribute string for GPU binary"),		llvm::cl::desc("Annotation attribute string for GPU binary"),
llvm::cl::init(gpu::getDefaultGpuBinaryAnnotation())};		llvm::cl::init(gpu::getDefaultGpuBinaryAnnotation())};
		Option<bool> kernelBarePtrCallConv{
		*this, "use-bare-pointers-for-kernels",
		llvm::cl::desc("Use bare pointers to pass memref arguments to kernels. "
		"The kernel must use the same setting for this option."),
		llvm::cl::init(false)};
};		};

struct FunctionCallBuilder {		struct FunctionCallBuilder {
FunctionCallBuilder(StringRef functionName, Type returnType,		FunctionCallBuilder(StringRef functionName, Type returnType,
ArrayRef<Type> argumentTypes)		ArrayRef<Type> argumentTypes)
: functionName(functionName),		: functionName(functionName),
functionType(LLVM::LLVMFunctionType::get(returnType, argumentTypes)) {}		functionType(LLVM::LLVMFunctionType::get(returnType, argumentTypes)) {}
LLVM::CallOp create(Location loc, OpBuilder &builder,		LLVM::CallOp create(Location loc, OpBuilder &builder,
▲ Show 20 Lines • Show All 214 Lines • ▼ Show 20 Lines
/// * launchKernel -- launches the kernel on a stream		/// * launchKernel -- launches the kernel on a stream
/// * streamSynchronize -- waits for operations on the stream to finish		/// * streamSynchronize -- waits for operations on the stream to finish
///		///
/// Intermediate data structures are allocated on the stack.		/// Intermediate data structures are allocated on the stack.
class ConvertLaunchFuncOpToGpuRuntimeCallPattern		class ConvertLaunchFuncOpToGpuRuntimeCallPattern
: public ConvertOpToGpuRuntimeCallPattern<gpu::LaunchFuncOp> {		: public ConvertOpToGpuRuntimeCallPattern<gpu::LaunchFuncOp> {
public:		public:
ConvertLaunchFuncOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter,		ConvertLaunchFuncOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter,
StringRef gpuBinaryAnnotation)		StringRef gpuBinaryAnnotation,
		bool kernelBarePtrCallConv)
: ConvertOpToGpuRuntimeCallPattern<gpu::LaunchFuncOp>(typeConverter),		: ConvertOpToGpuRuntimeCallPattern<gpu::LaunchFuncOp>(typeConverter),
gpuBinaryAnnotation(gpuBinaryAnnotation) {}		gpuBinaryAnnotation(gpuBinaryAnnotation),
		kernelBarePtrCallConv(kernelBarePtrCallConv) {}

private:		private:
Value generateParamsArray(gpu::LaunchFuncOp launchOp, OpAdaptor adaptor,		Value generateParamsArray(gpu::LaunchFuncOp launchOp, OpAdaptor adaptor,
OpBuilder &builder) const;		OpBuilder &builder) const;
Value generateKernelNameConstant(StringRef moduleName, StringRef name,		Value generateKernelNameConstant(StringRef moduleName, StringRef name,
Location loc, OpBuilder &builder) const;		Location loc, OpBuilder &builder) const;

LogicalResult		LogicalResult
matchAndRewrite(gpu::LaunchFuncOp launchOp, OpAdaptor adaptor,		matchAndRewrite(gpu::LaunchFuncOp launchOp, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override;		ConversionPatternRewriter &rewriter) const override;

llvm::SmallString<32> gpuBinaryAnnotation;		llvm::SmallString<32> gpuBinaryAnnotation;
		bool kernelBarePtrCallConv;
};		};

class EraseGpuModuleOpPattern : public OpRewritePattern<gpu::GPUModuleOp> {		class EraseGpuModuleOpPattern : public OpRewritePattern<gpu::GPUModuleOp> {
using OpRewritePattern<gpu::GPUModuleOp>::OpRewritePattern;		using OpRewritePattern<gpu::GPUModuleOp>::OpRewritePattern;

LogicalResult matchAndRewrite(gpu::GPUModuleOp op,		LogicalResult matchAndRewrite(gpu::GPUModuleOp op,
PatternRewriter &rewriter) const override {		PatternRewriter &rewriter) const override {
// GPU kernel modules are no longer necessary since we have a global		// GPU kernel modules are no longer necessary since we have a global
▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	void GpuToLLVMConversionPass::runOnOperation() {

mlir::arith::populateArithmeticToLLVMConversionPatterns(converter, patterns);		mlir::arith::populateArithmeticToLLVMConversionPatterns(converter, patterns);
mlir::cf::populateControlFlowToLLVMConversionPatterns(converter, patterns);		mlir::cf::populateControlFlowToLLVMConversionPatterns(converter, patterns);
populateVectorToLLVMConversionPatterns(converter, patterns);		populateVectorToLLVMConversionPatterns(converter, patterns);
populateMemRefToLLVMConversionPatterns(converter, patterns);		populateMemRefToLLVMConversionPatterns(converter, patterns);
populateFuncToLLVMConversionPatterns(converter, patterns);		populateFuncToLLVMConversionPatterns(converter, patterns);
populateAsyncStructuralTypeConversionsAndLegality(converter, patterns,		populateAsyncStructuralTypeConversionsAndLegality(converter, patterns,
target);		target);
populateGpuToLLVMConversionPatterns(converter, patterns, gpuBinaryAnnotation);		populateGpuToLLVMConversionPatterns(converter, patterns, gpuBinaryAnnotation,
		kernelBarePtrCallConv);

if (failed(		if (failed(
applyPartialConversion(getOperation(), target, std::move(patterns))))		applyPartialConversion(getOperation(), target, std::move(patterns))))
signalPassFailure();		signalPassFailure();
}		}

LLVM::CallOp FunctionCallBuilder::create(Location loc, OpBuilder &builder,		LLVM::CallOp FunctionCallBuilder::create(Location loc, OpBuilder &builder,
ArrayRef<Value> arguments) const {		ArrayRef<Value> arguments) const {
▲ Show 20 Lines • Show All 241 Lines • ▼ Show 20 Lines
// llvm.store parameters[i], %fieldPtr		// llvm.store parameters[i], %fieldPtr
// %elementPtr = llvm.getelementptr %array[i]		// %elementPtr = llvm.getelementptr %array[i]
// llvm.store %fieldPtr, %elementPtr		// llvm.store %fieldPtr, %elementPtr
// return %array		// return %array
Value ConvertLaunchFuncOpToGpuRuntimeCallPattern::generateParamsArray(		Value ConvertLaunchFuncOpToGpuRuntimeCallPattern::generateParamsArray(
gpu::LaunchFuncOp launchOp, OpAdaptor adaptor, OpBuilder &builder) const {		gpu::LaunchFuncOp launchOp, OpAdaptor adaptor, OpBuilder &builder) const {
auto loc = launchOp.getLoc();		auto loc = launchOp.getLoc();
auto numKernelOperands = launchOp.getNumKernelOperands();		auto numKernelOperands = launchOp.getNumKernelOperands();
auto arguments = getTypeConverter()->promoteOperands(		SmallVector<Value, 4> arguments;
		if (kernelBarePtrCallConv) {
		// Hack the bare pointer value on just for the argument promotion
		LLVMTypeConverter *converter = getTypeConverter();
		LowerToLLVMOptions options = converter->getOptions();
		LowerToLLVMOptions overrideToMatchKernelOpts = options;
		overrideToMatchKernelOpts.useBarePtrCallConv = true;
		ftynseUnsubmitted Done Reply Inline Actions This will discard any customizations that could have been applied to `converter` (e.g., support for additional types). I suppose we actually want a pseudo-copy constructor for `LLVMTypeConverter` that allows to take different options, or just a mechanism to change options with big warning flags that it should preserve consistency. ftynse: This will discard any customizations that could have been applied to `converter` (e.g., support…
		krzysz00AuthorUnsubmitted Done Reply Inline Actions Good call, changed krzysz00: Good call, changed
		converter->dangerousSetOptions(overrideToMatchKernelOpts);
		arguments = converter->promoteOperands(
		loc, launchOp.getOperands().take_back(numKernelOperands),
		adaptor.getOperands().take_back(numKernelOperands), builder);
		converter->dangerousSetOptions(options);
		} else {
		arguments = getTypeConverter()->promoteOperands(
loc, launchOp.getOperands().take_back(numKernelOperands),		loc, launchOp.getOperands().take_back(numKernelOperands),
adaptor.getOperands().take_back(numKernelOperands), builder);		adaptor.getOperands().take_back(numKernelOperands), builder);
		}

auto numArguments = arguments.size();		auto numArguments = arguments.size();
SmallVector<Type, 4> argumentTypes;		SmallVector<Type, 4> argumentTypes;
argumentTypes.reserve(numArguments);		argumentTypes.reserve(numArguments);
for (auto argument : arguments)		for (auto argument : arguments)
argumentTypes.push_back(argument.getType());		argumentTypes.push_back(argument.getType());
auto structType = LLVM::LLVMStructType::getNewIdentified(context, StringRef(),		auto structType = LLVM::LLVMStructType::getNewIdentified(context, StringRef(),
argumentTypes);		argumentTypes);
auto one = builder.create<LLVM::ConstantOp>(loc, llvmInt32Type,		auto one = builder.create<LLVM::ConstantOp>(loc, llvmInt32Type,
▲ Show 20 Lines • Show All 216 Lines • ▼ Show 20 Lines	LogicalResult ConvertSetDefaultDeviceOpToGpuRuntimeCallPattern::matchAndRewrite(
ConversionPatternRewriter &rewriter) const {		ConversionPatternRewriter &rewriter) const {
Location loc = op.getLoc();		Location loc = op.getLoc();
setDefaultDeviceCallBuilder.create(loc, rewriter, {adaptor.devIndex()});		setDefaultDeviceCallBuilder.create(loc, rewriter, {adaptor.devIndex()});
rewriter.replaceOp(op, {});		rewriter.replaceOp(op, {});
return success();		return success();
}		}

std::unique_ptr<mlir::OperationPass<mlir::ModuleOp>>		std::unique_ptr<mlir::OperationPass<mlir::ModuleOp>>
mlir::createGpuToLLVMConversionPass() {		mlir::createGpuToLLVMConversionPass(bool kernelBarePtrCallConv) {
return std::make_unique<GpuToLLVMConversionPass>();		return std::make_unique<GpuToLLVMConversionPass>(kernelBarePtrCallConv);
}		}

void mlir::populateGpuToLLVMConversionPatterns(LLVMTypeConverter &converter,		void mlir::populateGpuToLLVMConversionPatterns(LLVMTypeConverter &converter,
RewritePatternSet &patterns,		RewritePatternSet &patterns,
StringRef gpuBinaryAnnotation) {		StringRef gpuBinaryAnnotation,
		bool kernelBarePtrCallConv) {
converter.addConversion(		converter.addConversion(
[context = &converter.getContext()](gpu::AsyncTokenType type) -> Type {		[context = &converter.getContext()](gpu::AsyncTokenType type) -> Type {
return LLVM::LLVMPointerType::get(IntegerType::get(context, 8));		return LLVM::LLVMPointerType::get(IntegerType::get(context, 8));
});		});
patterns.add<ConvertAllocOpToGpuRuntimeCallPattern,		patterns.add<ConvertAllocOpToGpuRuntimeCallPattern,
ConvertDeallocOpToGpuRuntimeCallPattern,		ConvertDeallocOpToGpuRuntimeCallPattern,
ConvertHostRegisterOpToGpuRuntimeCallPattern,		ConvertHostRegisterOpToGpuRuntimeCallPattern,
ConvertMemcpyOpToGpuRuntimeCallPattern,		ConvertMemcpyOpToGpuRuntimeCallPattern,
ConvertMemsetOpToGpuRuntimeCallPattern,		ConvertMemsetOpToGpuRuntimeCallPattern,
ConvertSetDefaultDeviceOpToGpuRuntimeCallPattern,		ConvertSetDefaultDeviceOpToGpuRuntimeCallPattern,
ConvertWaitAsyncOpToGpuRuntimeCallPattern,		ConvertWaitAsyncOpToGpuRuntimeCallPattern,
ConvertWaitOpToGpuRuntimeCallPattern,		ConvertWaitOpToGpuRuntimeCallPattern,
ConvertAsyncYieldToGpuRuntimeCallPattern>(converter);		ConvertAsyncYieldToGpuRuntimeCallPattern>(converter);
patterns.add<ConvertLaunchFuncOpToGpuRuntimeCallPattern>(converter,		patterns.add<ConvertLaunchFuncOpToGpuRuntimeCallPattern>(
gpuBinaryAnnotation);		converter, gpuBinaryAnnotation, kernelBarePtrCallConv);
patterns.add<EraseGpuModuleOpPattern>(&converter.getContext());		patterns.add<EraseGpuModuleOpPattern>(&converter.getContext());
}		}

mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp

Show All 35 Lines

#include "../GPUCommon/GPUOpsLowering.h"		#include "../GPUCommon/GPUOpsLowering.h"
#include "../GPUCommon/IndexIntrinsicsOpLowering.h"		#include "../GPUCommon/IndexIntrinsicsOpLowering.h"
#include "../GPUCommon/OpToFuncCallLowering.h"		#include "../GPUCommon/OpToFuncCallLowering.h"
#include "../PassDetail.h"		#include "../PassDetail.h"

using namespace mlir;		using namespace mlir;

		/// Returns true if the given `gpu.func` can be safely called using the bare
		ftynseUnsubmitted Done Reply Inline Actions Please document top-level entities. ftynse: Please document top-level entities.
		/// pointer calling convention.
		static bool canBeCalledWithBarePointers(gpu::GPUFuncOp func) {
		bool canBeBare = true;
		for (Type type : func.getArgumentTypes())
		if (auto memrefTy = type.dyn_cast<BaseMemRefType>())
		ftynseUnsubmitted Done Reply Inline Actions Nit: `getLayout().getAffineMap()`, there is no guarantee the layout will be an` AffineMapAttr` in the longer run. ftynse: Nit: `getLayout().getAffineMap()`, there is no guarantee the layout will be an` AffineMapAttr`…
		canBeBare &= LLVMTypeConverter::canConvertToBarePtr(memrefTy);
		return canBeBare;
		}

namespace {		namespace {

/// Import the GPU Ops to ROCDL Patterns.		/// Import the GPU Ops to ROCDL Patterns.
#include "GPUToROCDL.cpp.inc"		#include "GPUToROCDL.cpp.inc"

// A pass that replaces all occurrences of GPU device operations with their		// A pass that replaces all occurrences of GPU device operations with their
// corresponding ROCDL equivalent.		// corresponding ROCDL equivalent.
//		//
// This pass only handles device code and is not meant to be run on GPU host		// This pass only handles device code and is not meant to be run on GPU host
// code.		// code.
struct LowerGpuOpsToROCDLOpsPass		struct LowerGpuOpsToROCDLOpsPass
: public ConvertGpuOpsToROCDLOpsBase<LowerGpuOpsToROCDLOpsPass> {		: public ConvertGpuOpsToROCDLOpsBase<LowerGpuOpsToROCDLOpsPass> {
LowerGpuOpsToROCDLOpsPass() = default;		LowerGpuOpsToROCDLOpsPass() = default;
LowerGpuOpsToROCDLOpsPass(const std::string &chipset, unsigned indexBitwidth,		LowerGpuOpsToROCDLOpsPass(const std::string &chipset, unsigned indexBitwidth,
		bool useBarePtrCallConv,
gpu::amd::Runtime runtime) {		gpu::amd::Runtime runtime) {
		if (this->chipset.getNumOccurrences() == 0)
this->chipset = chipset;		this->chipset = chipset;
		if (this->indexBitwidth.getNumOccurrences() == 0)
this->indexBitwidth = indexBitwidth;		this->indexBitwidth = indexBitwidth;
		if (this->useBarePtrCallConv.getNumOccurrences() == 0)
		this->useBarePtrCallConv = useBarePtrCallConv;
		if (this->runtime.getNumOccurrences() == 0)
this->runtime = runtime;		this->runtime = runtime;
}		}

void runOnOperation() override {		void runOnOperation() override {
gpu::GPUModuleOp m = getOperation();		gpu::GPUModuleOp m = getOperation();
MLIRContext *ctx = m.getContext();		MLIRContext *ctx = m.getContext();

// Request C wrapper emission.		// Request C wrapper emission.
for (auto func : m.getOps<func::FuncOp>()) {		for (auto func : m.getOps<func::FuncOp>()) {
func->setAttr(LLVM::LLVMDialect::getEmitCWrapperAttrName(),		func->setAttr(LLVM::LLVMDialect::getEmitCWrapperAttrName(),
UnitAttr::get(ctx));		UnitAttr::get(ctx));
}		}

FailureOr<amdgpu::Chipset> maybeChipset = amdgpu::Chipset::parse(chipset);		FailureOr<amdgpu::Chipset> maybeChipset = amdgpu::Chipset::parse(chipset);
if (failed(maybeChipset)) {		if (failed(maybeChipset)) {
emitError(UnknownLoc::get(ctx), "Invalid chipset name: " + chipset);		emitError(UnknownLoc::get(ctx), "Invalid chipset name: " + chipset);
return signalPassFailure();		return signalPassFailure();
}		}

/// Customize the bitwidth used for the device side index computations.		/// Customize the bitwidth used for the device side index computations.
LowerToLLVMOptions options(		LowerToLLVMOptions options(
ctx, DataLayout(cast<DataLayoutOpInterface>(m.getOperation())));		ctx, DataLayout(cast<DataLayoutOpInterface>(m.getOperation())));
if (indexBitwidth != kDeriveIndexBitwidthFromDataLayout)		if (indexBitwidth != kDeriveIndexBitwidthFromDataLayout)
options.overrideIndexBitwidth(indexBitwidth);		options.overrideIndexBitwidth(indexBitwidth);

		if (useBarePtrCallConv) {
		options.useBarePtrCallConv = true;
		WalkResult canUseBarePointers =
		m.walk([](gpu::GPUFuncOp func) -> WalkResult {
		if (canBeCalledWithBarePointers(func))
		return WalkResult::advance();
		return WalkResult::interrupt();
		});
		if (canUseBarePointers.wasInterrupted()) {
		emitError(UnknownLoc::get(ctx),
		"bare pointer calling convention requires all memrefs to "
		ftynseUnsubmitted Done Reply Inline Actions Nit: diagnostics should start with lower case. ftynse: Nit: diagnostics should start with lower case.
		"have static shape and use the identity map");
		return signalPassFailure();
		}
		}

LLVMTypeConverter converter(ctx, options);		LLVMTypeConverter converter(ctx, options);

RewritePatternSet patterns(ctx);		RewritePatternSet patterns(ctx);
RewritePatternSet llvmPatterns(ctx);		RewritePatternSet llvmPatterns(ctx);

populateGpuRewritePatterns(patterns);		populateGpuRewritePatterns(patterns);
(void)applyPatternsAndFoldGreedily(m, std::move(patterns));		(void)applyPatternsAndFoldGreedily(m, std::move(patterns));

▲ Show 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	patterns.add<OpToFuncCallLowering<math::SqrtOp>>(converter, "__ocml_sqrt_f32",
"__ocml_sqrt_f64");		"__ocml_sqrt_f64");
patterns.add<OpToFuncCallLowering<math::TanhOp>>(converter, "__ocml_tanh_f32",		patterns.add<OpToFuncCallLowering<math::TanhOp>>(converter, "__ocml_tanh_f32",
"__ocml_tanh_f64");		"__ocml_tanh_f64");
}		}

std::unique_ptr<OperationPass<gpu::GPUModuleOp>>		std::unique_ptr<OperationPass<gpu::GPUModuleOp>>
mlir::createLowerGpuOpsToROCDLOpsPass(const std::string &chipset,		mlir::createLowerGpuOpsToROCDLOpsPass(const std::string &chipset,
unsigned indexBitwidth,		unsigned indexBitwidth,
		bool useBarePtrCallConv,
gpu::amd::Runtime runtime) {		gpu::amd::Runtime runtime) {
return std::make_unique<LowerGpuOpsToROCDLOpsPass>(chipset, indexBitwidth,		return std::make_unique<LowerGpuOpsToROCDLOpsPass>(
runtime);		chipset, indexBitwidth, useBarePtrCallConv, runtime);
}		}

mlir/test/Conversion/GPUToROCDL/memref.mlir

This file was added.

				// RUN: mlir-opt %s -convert-gpu-to-rocdl -split-input-file \| FileCheck %s
				// RUN: mlir-opt %s \
				// RUN: -convert-gpu-to-rocdl=use-bare-ptr-memref-call-conv=true \
				// RUN: -split-input-file \
				// RUN: \| FileCheck %s --check-prefix=BARE

				gpu.module @memref_conversions {
				// CHECK: llvm.func @kern
				// CHECK-SAME: (%{{.}}: !llvm.ptr<f32>, %{{.}}: !llvm.ptr<f32>, %{{.}}: i64, %{{.}}: i64, %{{.*}}: i64)
				// BARE: llvm.func @kern
				// BARE-SAME: (%{{.*}}: !llvm.ptr<f32>)
				gpu.func @kern(%arg0: memref<8xf32>) kernel {
				gpu.return
				}
				}

mlir/test/Integration/GPU/ROCM/vecadd.mlir

// RUN: mlir-opt %s \		// RUN: mlir-opt %s \
// RUN: -convert-scf-to-cf \		// RUN: -convert-scf-to-cf \
// RUN: -gpu-kernel-outlining \		// RUN: -gpu-kernel-outlining \
// RUN: -pass-pipeline='gpu.module(strip-debuginfo,convert-gpu-to-rocdl,gpu-to-hsaco{chip=%chip})' \		// RUN: -pass-pipeline='gpu.module(strip-debuginfo,convert-gpu-to-rocdl{use-bare-ptr-memref-call-conv=true},gpu-to-hsaco{chip=%chip})' \
// RUN: -gpu-to-llvm \		// RUN: -gpu-to-llvm=use-bare-pointers-for-kernels=true \
// RUN: \| mlir-cpu-runner \		// RUN: \| mlir-cpu-runner \
// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_rocm_runtime%shlibext \		// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_rocm_runtime%shlibext \
// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_runner_utils%shlibext \		// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_runner_utils%shlibext \
// RUN: --entry-point-result=void \		// RUN: --entry-point-result=void \
// RUN: \| FileCheck %s		// RUN: \| FileCheck %s

func.func @vecadd(%arg0 : memref<?xf32>, %arg1 : memref<?xf32>, %arg2 : memref<?xf32>) {		func.func @vecadd(%arg0 : memref<5xf32>, %arg1 : memref<5xf32>, %arg2 : memref<5xf32>) {
%c0 = arith.constant 0 : index		%c0 = arith.constant 0 : index
%c1 = arith.constant 1 : index		%c1 = arith.constant 1 : index
%block_dim = memref.dim %arg0, %c0 : memref<?xf32>		%block_dim = arith.constant 5 : index
gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)		gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
threads(%tx, %ty, %tz) in (%block_x = %block_dim, %block_y = %c1, %block_z = %c1) {		threads(%tx, %ty, %tz) in (%block_x = %block_dim, %block_y = %c1, %block_z = %c1) {
%a = memref.load %arg0[%tx] : memref<?xf32>		%a = memref.load %arg0[%tx] : memref<5xf32>
%b = memref.load %arg1[%tx] : memref<?xf32>		%b = memref.load %arg1[%tx] : memref<5xf32>
%c = arith.addf %a, %b : f32		%c = arith.addf %a, %b : f32
memref.store %c, %arg2[%tx] : memref<?xf32>		memref.store %c, %arg2[%tx] : memref<5xf32>
gpu.terminator		gpu.terminator
}		}
return		return
}		}

// CHECK: [2.46, 2.46, 2.46, 2.46, 2.46]		// CHECK: [2.46, 2.46, 2.46, 2.46, 2.46]
func.func @main() {		func.func @main() {
%c0 = arith.constant 0 : index		%c0 = arith.constant 0 : index
Show All 14 Lines	func.func @main() {
%7 = memref.cast %4 : memref<?xf32> to memref<*xf32>		%7 = memref.cast %4 : memref<?xf32> to memref<*xf32>
%8 = memref.cast %5 : memref<?xf32> to memref<*xf32>		%8 = memref.cast %5 : memref<?xf32> to memref<*xf32>
gpu.host_register %6 : memref<*xf32>		gpu.host_register %6 : memref<*xf32>
gpu.host_register %7 : memref<*xf32>		gpu.host_register %7 : memref<*xf32>
gpu.host_register %8 : memref<*xf32>		gpu.host_register %8 : memref<*xf32>
%9 = call @mgpuMemGetDeviceMemRef1dFloat(%3) : (memref<?xf32>) -> (memref<?xf32>)		%9 = call @mgpuMemGetDeviceMemRef1dFloat(%3) : (memref<?xf32>) -> (memref<?xf32>)
%10 = call @mgpuMemGetDeviceMemRef1dFloat(%4) : (memref<?xf32>) -> (memref<?xf32>)		%10 = call @mgpuMemGetDeviceMemRef1dFloat(%4) : (memref<?xf32>) -> (memref<?xf32>)
%11 = call @mgpuMemGetDeviceMemRef1dFloat(%5) : (memref<?xf32>) -> (memref<?xf32>)		%11 = call @mgpuMemGetDeviceMemRef1dFloat(%5) : (memref<?xf32>) -> (memref<?xf32>)
		%12 = memref.cast %9 : memref<?xf32> to memref<5xf32>
		%13 = memref.cast %10 : memref<?xf32> to memref<5xf32>
		%14 = memref.cast %11 : memref<?xf32> to memref<5xf32>

call @vecadd(%9, %10, %11) : (memref<?xf32>, memref<?xf32>, memref<?xf32>) -> ()		call @vecadd(%12, %13, %14) : (memref<5xf32>, memref<5xf32>, memref<5xf32>) -> ()
call @printMemrefF32(%8) : (memref<*xf32>) -> ()		call @printMemrefF32(%8) : (memref<*xf32>) -> ()
return		return
}		}

func.func private @mgpuMemGetDeviceMemRef1dFloat(%ptr : memref<?xf32>) -> (memref<?xf32>)		func.func private @mgpuMemGetDeviceMemRef1dFloat(%ptr : memref<?xf32>) -> (memref<?xf32>)
func.func private @printMemrefF32(%ptr : memref<*xf32>)		func.func private @printMemrefF32(%ptr : memref<*xf32>)

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][GPU] Allow bare pointer memrefs when calling GPU kernelsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 449426

mlir/include/mlir/Conversion/GPUCommon/GPUCommonPass.h

mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h

mlir/include/mlir/Conversion/LLVMCommon/TypeConverter.h

mlir/include/mlir/Conversion/Passes.td

mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp

mlir/test/Conversion/GPUToROCDL/memref.mlir

mlir/test/Integration/GPU/ROCM/vecadd.mlir

[mlir][GPU] Allow bare pointer memrefs when calling GPU kernels
ClosedPublic