This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/GPU/
-
mlir/
-
Dialect/
-
GPU/
2/2
GPUOps.td
-
lib/
-
Conversion/GPUCommon/
-
GPUCommon/
-
GPUToLLVMConversion.cpp
-
Dialect/GPU/
-
GPU/
-
IR/
-
GPUDialect.cpp
-
Transforms/
-
KernelOutlining.cpp
-
test/
-
Conversion/
-
GPUCommon/
1
lower-launch-func-to-gpu-runtime-calls.mlir
-
GPUToSPIRV/
-
builtins.mlir
-
Dialect/GPU/
-
GPU/
-
invalid.mlir
-
ops.mlir

Differential D110800

[MLIR][GPU] Add GPU launch op support for dynamic shared memory
ClosedPublic

Authored by bondhugula on Sep 29 2021, 11:34 PM.

Download Raw Diff

Details

Reviewers

herhut
csigg
ThomasRaoux
navdeepkk
antiagainst
ftynse
nicolasvasilache
dcaballe

Commits

rG08b63db8bb3e: [MLIR][GPU] Add GPU launch op support for dynamic shared memory

Summary

Add support for dynamic shared memory for GPU launch ops: add an
optional operand to gpu.launch and gpu.launch_func ops to specify the
amount of "dynamic" shared memory to use. Update lowerings to connect
this operand to the GPU runtime.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

bondhugula created this revision.Sep 29 2021, 11:34 PM

Herald added a reviewer: antiagainst. · View Herald TranscriptSep 29 2021, 11:34 PM

Herald added a reviewer: ftynse. · View Herald Transcript

Herald added subscribers: Groverkss, wenzhicui, wrengr and 23 others. · View Herald Transcript

bondhugula requested review of this revision.Sep 29 2021, 11:34 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 29 2021, 11:34 PM

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Harbormaster completed remote builds in B126498: Diff 376113.Sep 29 2021, 11:46 PM

csigg accepted this revision.Sep 30 2021, 12:08 AM

csigg added inline comments.

mlir/include/mlir/Dialect/GPU/GPUOps.td
381–382	Would it make sense to provide a default?

This revision is now accepted and ready to land.Sep 30 2021, 12:08 AM

Add default value on builder for "dynamic memory size" for gpu.launch but not for gpu.launch_func.

mlir/include/mlir/Dialect/GPU/GPUOps.td
381–382	Makes sense, although we'll need to provide a default value for the last argument as well if we provide it for this one. I've done this for the `gpu.launch` op where it's more useful. However, for this one (launch_func), I'm concerned that a default would lead to runtime breakage for any downstream users since `kernelOperands` in pre-existing code would then silently map to this new argument and escape a build failure.

Harbormaster completed remote builds in B126506: Diff 376127.Sep 30 2021, 1:10 AM

bondhugula marked an inline comment as done.Oct 1 2021, 4:21 AM

This revision was landed with ongoing or failed builds.Oct 1 2021, 4:21 AM

Closed by commit rG08b63db8bb3e: [MLIR][GPU] Add GPU launch op support for dynamic shared memory (authored by bondhugula). · Explain Why

This revision was automatically updated to reflect the committed changes.

bondhugula added a commit: rG08b63db8bb3e: [MLIR][GPU] Add GPU launch op support for dynamic shared memory.

nbpatel added a subscriber: nbpatel.May 15 2023, 11:18 AM

nbpatel added inline comments.

mlir/test/Conversion/GPUCommon/lower-launch-func-to-gpu-runtime-calls.mlir
27	is there an example of how this value will be passed to the outlined kernel?

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptMay 15 2023, 11:18 AM

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: bviyer, Moerafaat, zero9178 and 3 others. · View Herald Transcript

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

GPUOps.td

42 lines

lib/

Conversion/

GPUCommon/

GPUToLLVMConversion.cpp

16 lines

Dialect/

GPU/

IR/

GPUDialect.cpp

34 lines

Transforms/

KernelOutlining.cpp

5 lines

test/

Conversion/

GPUCommon/

lower-launch-func-to-gpu-runtime-calls.mlir

8 lines

GPUToSPIRV/

builtins.mlir

2 lines

Dialect/

GPU/

invalid.mlir

4 lines

ops.mlir

3 lines

Diff 376475

mlir/include/mlir/Dialect/GPU/GPUOps.td

Show First 20 Lines • Show All 283 Lines • ▼ Show 20 Lines

} }

def GPU_LaunchFuncOp : GPU_Op<"launch_func", def GPU_LaunchFuncOp : GPU_Op<"launch_func",

[GPU_AsyncOpInterface, AttrSizedOperandSegments]>, [GPU_AsyncOpInterface, AttrSizedOperandSegments]>,

Arguments<(ins Variadic<GPU_AsyncToken>:$asyncDependencies, Arguments<(ins Variadic<GPU_AsyncToken>:$asyncDependencies,

SymbolRefAttr:$kernel, SymbolRefAttr:$kernel,

Index:$gridSizeX, Index:$gridSizeY, Index:$gridSizeZ, Index:$gridSizeX, Index:$gridSizeY, Index:$gridSizeZ,

Index:$blockSizeX, Index:$blockSizeY, Index:$blockSizeZ, Index:$blockSizeX, Index:$blockSizeY, Index:$blockSizeZ,

Optional<I32>:$dynamicSharedMemorySize,

Variadic<AnyType>:$operands)>, Variadic<AnyType>:$operands)>,

Results<(outs Optional<GPU_AsyncToken>:$asyncToken)> { Results<(outs Optional<GPU_AsyncToken>:$asyncToken)> {

let summary = "Launches a function as a GPU kernel"; let summary = "Launches a function as a GPU kernel";

let description = [{ let description = [{

Launch a kernel function on the specified grid of thread blocks. Launch a kernel function on the specified grid of thread blocks.

`gpu.launch` operations are lowered to `gpu.launch_func` operations by `gpu.launch` operations are lowered to `gpu.launch_func` operations by

outlining the kernel body into a function in a dedicated module, which outlining the kernel body into a function in a dedicated module, which

Show All 12 Lines let description = [{

completed. If the `async` keyword is present, the host does not block but completed. If the `async` keyword is present, the host does not block but

instead a `!gpu.async.token` is returned. Other async GPU ops can take this instead a `!gpu.async.token` is returned. Other async GPU ops can take this

token as dependency. token as dependency.

The operation requires at least the grid and block sizes along the x,y,z The operation requires at least the grid and block sizes along the x,y,z

dimensions as arguments. When a lower-dimensional kernel is required, dimensions as arguments. When a lower-dimensional kernel is required,

unused sizes must be explicitly set to `1`. unused sizes must be explicitly set to `1`.

The remaining operands are passed as arguments to the kernel function. The remaining operands are optional. The first optional operand corresponds

to the amount of dynamic shared memory a kernel's workgroup should be

allocated; when this operand is not present, a zero size is assumed.

The remaining operands if present are passed as arguments to the kernel

function.

Example: Example:

```mlir ```mlir

module attributes {gpu.container_module} { module attributes {gpu.container_module} {

// This module creates a separate compilation unit for the GPU compiler. // This module creates a separate compilation unit for the GPU compiler.

gpu.module @kernels { gpu.module @kernels {

Show All 26 Lines module attributes {gpu.container_module} {

%t0 = gpu.wait async %t0 = gpu.wait async

gpu.launch_func gpu.launch_func

async // (Optional) Don't block host, return token. async // (Optional) Don't block host, return token.

[%t0] // (Optional) Execute only after %t0 has completed. [%t0] // (Optional) Execute only after %t0 has completed.

@kernels::@kernel_1 // Kernel function. @kernels::@kernel_1 // Kernel function.

blocks in (%cst, %cst, %cst) // Grid size. blocks in (%cst, %cst, %cst) // Grid size.

threads in (%cst, %cst, %cst) // Block size. threads in (%cst, %cst, %cst) // Block size.

dynamic_shared_memory_size %s // (Optional) Amount of dynamic shared

// memory to allocate for a workgroup.

args(%arg0 : f32, // (Optional) Kernel arguments. args(%arg0 : f32, // (Optional) Kernel arguments.

%arg1 : memref<?xf32, 1>) %arg1 : memref<?xf32, 1>)

} }

``` ```

}]; }];

let skipDefaultBuilders = 1; let skipDefaultBuilders = 1;

let builders = [ let builders = [

OpBuilder<(ins "GPUFuncOp":$kernelFunc, "KernelDim3":$gridSize, OpBuilder<(ins "GPUFuncOp":$kernelFunc, "KernelDim3":$gridSize,

"KernelDim3":$blockSize, "ValueRange":$kernelOperands)> "KernelDim3":$blockSize, "Value":$dynamicSharedMemorySize,

"ValueRange":$kernelOperands)>

csiggUnsubmitted

Done

OpBuilder<(ins "GPUFuncOp":$kernelFunc, "KernelDim3":$gridSize,

- "KernelDim3":$blockSize, "Value":$dynamicSharedMemorySize,

+ "KernelDim3":$blockSize, CArg<"Value", "nullptr">:$dynamicSharedMemorySize,

"ValueRange":$kernelOperands)>

Would it make sense to provide a default?

csigg: Would it make sense to provide a default?

bondhugulaAuthorUnsubmitted

Done

Makes sense, although we'll need to provide a default value for the last argument as well if we provide it for this one. I've done this for the gpu.launch op where it's more useful. However, for this one (launch_func), I'm concerned that a default would lead to runtime breakage for any downstream users since kernelOperands in pre-existing code would then silently map to this new argument and escape a build failure.

bondhugula: Makes sense, although we'll need to provide a default value for the last argument as well if we…

]; ];

let extraClassDeclaration = [{ let extraClassDeclaration = [{

/// The number of operands passed to the kernel function. /// The number of operands passed to the kernel function.

unsigned getNumKernelOperands(); unsigned getNumKernelOperands();

/// The name of the kernel's containing module. /// The name of the kernel's containing module.

StringAttr getKernelModuleName(); StringAttr getKernelModuleName();

Show All 24 Lines def GPU_LaunchFuncOp : GPU_Op<"launch_func",

}]; }];

let verifier = [{ return ::verify(*this); }]; let verifier = [{ return ::verify(*this); }];

let assemblyFormat = [{ let assemblyFormat = [{

custom<AsyncDependencies>(type($asyncToken), $asyncDependencies) custom<AsyncDependencies>(type($asyncToken), $asyncDependencies)

$kernel $kernel

`blocks` `in` ` ` `(`$gridSizeX`,` $gridSizeY`,` $gridSizeZ`)` `blocks` `in` ` ` `(`$gridSizeX`,` $gridSizeY`,` $gridSizeZ`)`

`threads` `in` ` ` `(`$blockSizeX`,` $blockSizeY`,` $blockSizeZ`)` `threads` `in` ` ` `(`$blockSizeX`,` $blockSizeY`,` $blockSizeZ`)`

custom<LaunchFuncOperands>($operands, type($operands)) (`dynamic_shared_memory_size` $dynamicSharedMemorySize^)?

attr-dict custom<LaunchFuncOperands>($operands, type($operands)) attr-dict

}]; }];

} }

def GPU_LaunchOp : GPU_Op<"launch">, def GPU_LaunchOp : GPU_Op<"launch">,

Arguments<(ins Index:$gridSizeX, Index:$gridSizeY, Index:$gridSizeZ, Arguments<(ins Index:$gridSizeX, Index:$gridSizeY, Index:$gridSizeZ,

Index:$blockSizeX, Index:$blockSizeY, Index:$blockSizeZ)>, Index:$blockSizeX, Index:$blockSizeY, Index:$blockSizeZ,

Optional<I32>:$dynamicSharedMemorySize)>,

Results<(outs)> { Results<(outs)> {

let summary = "GPU kernel launch operation"; let summary = "GPU kernel launch operation";

let description = [{ let description = [{

Launch a kernel on the specified grid of thread blocks. The body of the Launch a kernel on the specified grid of thread blocks. The body of the

kernel is defined by the single region that this operation contains. The kernel is defined by the single region that this operation contains. The

operation takes six operands, with first three operands being grid sizes operation takes six operands followed by an optional operand: the first

along x,y,z dimensions and the following three arguments being block sizes three operands are grid sizes along the x,y,z dimensions and the following

along x,y,z dimension. When a lower-dimensional kernel is required, three are block sizes along the x,y,z dimensions. The last operand is

unused sizes must be explicitly set to `1`. optional and corresponds to the amount of dynamic shared memory a kernel's

workgroup should be allocated; when this operand is not present, a zero size

is assumed.

When a lower-dimensional kernel is required, unused sizes must

be explicitly set to `1`.

The body region has _twelve_ arguments, grouped as follows: The body region has _twelve_ arguments, grouped as follows:

- three arguments that contain block identifiers along x,y,z dimensions; - three arguments that contain block identifiers along x,y,z dimensions;

- three arguments that contain thread identifiers along x,y,z dimensions; - three arguments that contain thread identifiers along x,y,z dimensions;

- operands of the `gpu.launch` operation as is (i.e. the operands for - operands of the `gpu.launch` operation as is (i.e. the operands for

grid and block sizes). grid and block sizes).

Syntax: Syntax:

``` ```

operation ::= `gpu.launch` `block` `(` ssa-id-list `)` `in` ssa-reassignment operation ::= `gpu.launch` `block` `(` ssa-id-list `)` `in` ssa-reassignment

`threads` `(` ssa-id-list `)` `in` ssa-reassignment `threads` `(` ssa-id-list `)` `in` ssa-reassignment

(dynamic_shared_memory_size ssa-use)?

region attr-dict? region attr-dict?

ssa-reassignment ::= `(` ssa-id `=` ssa-use (`,` ssa-id `=` ssa-use)* `)` ssa-reassignment ::= `(` ssa-id `=` ssa-use (`,` ssa-id `=` ssa-use)* `)`

``` ```

Example: Example:

```mlir ```mlir

gpu.launch blocks(%bx, %by, %bz) in (%sz_bx = %0, %sz_by = %1, %sz_bz = %2) gpu.launch blocks(%bx, %by, %bz) in (%sz_bx = %0, %sz_by = %1, %sz_bz = %2)

threads(%tx, %ty, %tz) in (%sz_tx = %3, %sz_ty = %4, %sz_tz = %5) { threads(%tx, %ty, %tz) in (%sz_tx = %3, %sz_ty = %4, %sz_tz = %5) {

Show All 31 Lines def GPU_LaunchOp : GPU_Op<"launch">,

let regions = (region AnyRegion:$body); let regions = (region AnyRegion:$body);

let skipDefaultBuilders = 1; let skipDefaultBuilders = 1;

let builders = [ let builders = [

OpBuilder<(ins "Value":$gridSizeX, "Value":$gridSizeY, OpBuilder<(ins "Value":$gridSizeX, "Value":$gridSizeY,

"Value":$gridSizeZ, "Value":$blockSizeX, "Value":$blockSizeY, "Value":$gridSizeZ, "Value":$blockSizeX, "Value":$blockSizeY,

"Value":$blockSizeZ)> "Value":$blockSizeZ,

CArg<"Value", "nullptr">:$dynamic_shared_memory_size)>

]; ];

let extraClassDeclaration = [{ let extraClassDeclaration = [{

/// Get the SSA values corresponding to kernel block identifiers. /// Get the SSA values corresponding to kernel block identifiers.

KernelDim3 getBlockIds(); KernelDim3 getBlockIds();

/// Get the SSA values corresponding to kernel thread identifiers. /// Get the SSA values corresponding to kernel thread identifiers.

KernelDim3 getThreadIds(); KernelDim3 getThreadIds();

/// Get the SSA values corresponding to kernel grid size. /// Get the SSA values corresponding to kernel grid size.

KernelDim3 getGridSize(); KernelDim3 getGridSize();

/// Get the SSA values corresponding to kernel block size. /// Get the SSA values corresponding to kernel block size.

KernelDim3 getBlockSize(); KernelDim3 getBlockSize();

/// Get the SSA values passed as operands to specify the grid size. /// Get the SSA values passed as operands to specify the grid size.

KernelDim3 getGridSizeOperandValues(); KernelDim3 getGridSizeOperandValues();

/// Get the SSA values passed as operands to specify the block size. /// Get the SSA values passed as operands to specify the block size.

KernelDim3 getBlockSizeOperandValues(); KernelDim3 getBlockSizeOperandValues();

static StringRef getBlocksKeyword() { return "blocks"; } static StringRef getBlocksKeyword() { return "blocks"; }

static StringRef getThreadsKeyword() { return "threads"; } static StringRef getThreadsKeyword() { return "threads"; }

static StringRef getDynamicSharedMemorySizeKeyword() {

return "dynamic_shared_memory_size";

}

/// The number of launch configuration operands, placed at the leading /// The number of launch configuration operands, placed at the leading

/// positions of the operand list. /// positions of the operand list.

static constexpr unsigned kNumConfigOperands = 6; static constexpr unsigned kNumConfigOperands = 6;

/// The number of region attributes containing the launch configuration, /// The number of region attributes containing the launch configuration,

/// placed in the leading positions of the argument list. /// placed in the leading positions of the argument list.

static constexpr unsigned kNumConfigRegionAttributes = 12; static constexpr unsigned kNumConfigRegionAttributes = 12;

▲ Show 20 Lines • Show All 584 Lines • Show Last 20 Lines

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

Show First 20 Lines • Show All 739 Lines • ▼ Show 20 Lines	auto zero = rewriter.create<LLVM::ConstantOp>(loc, llvmInt32Type,
rewriter.getI32IntegerAttr(0));		rewriter.getI32IntegerAttr(0));
Value stream =		Value stream =
adaptor.asyncDependencies().empty()		adaptor.asyncDependencies().empty()
? streamCreateCallBuilder.create(loc, rewriter, {}).getResult(0)		? streamCreateCallBuilder.create(loc, rewriter, {}).getResult(0)
: adaptor.asyncDependencies().front();		: adaptor.asyncDependencies().front();
// Create array of pointers to kernel arguments.		// Create array of pointers to kernel arguments.
auto kernelParams = generateParamsArray(launchOp, adaptor, rewriter);		auto kernelParams = generateParamsArray(launchOp, adaptor, rewriter);
auto nullpointer = rewriter.create<LLVM::NullOp>(loc, llvmPointerPointerType);		auto nullpointer = rewriter.create<LLVM::NullOp>(loc, llvmPointerPointerType);
launchKernelCallBuilder.create(loc, rewriter,		Value dynamicSharedMemorySize = launchOp.dynamicSharedMemorySize()
{function.getResult(0), adaptor.gridSizeX(),		? launchOp.dynamicSharedMemorySize()
adaptor.gridSizeY(), adaptor.gridSizeZ(),		: zero;
adaptor.blockSizeX(), adaptor.blockSizeY(),		launchKernelCallBuilder.create(
adaptor.blockSizeZ(),		loc, rewriter,
/sharedMemBytes=/zero, stream, kernelParams,		{function.getResult(0), adaptor.gridSizeX(), adaptor.gridSizeY(),
		adaptor.gridSizeZ(), adaptor.blockSizeX(), adaptor.blockSizeY(),
		adaptor.blockSizeZ(), dynamicSharedMemorySize, stream, kernelParams,
/extra=/nullpointer});		/extra=/nullpointer});

if (launchOp.asyncToken()) {		if (launchOp.asyncToken()) {
// Async launch: make dependent ops use the same stream.		// Async launch: make dependent ops use the same stream.
rewriter.replaceOp(launchOp, {stream});		rewriter.replaceOp(launchOp, {stream});
} else {		} else {
// Synchronize with host and destroy stream. This must be the stream created		// Synchronize with host and destroy stream. This must be the stream created
// above (with no other uses) because we check that the synchronous version		// above (with no other uses) because we check that the synchronous version
// does not have any async dependencies.		// does not have any async dependencies.
▲ Show 20 Lines • Show All 102 Lines • Show Last 20 Lines

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

Show First 20 Lines • Show All 347 Lines • ▼ Show 20 Lines
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// LaunchOp		// LaunchOp
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

void LaunchOp::build(OpBuilder &builder, OperationState &result,		void LaunchOp::build(OpBuilder &builder, OperationState &result,
Value gridSizeX, Value gridSizeY, Value gridSizeZ,		Value gridSizeX, Value gridSizeY, Value gridSizeZ,
Value blockSizeX, Value blockSizeY, Value blockSizeZ) {		Value blockSizeX, Value blockSizeY, Value blockSizeZ,
		Value dynamicSharedMemorySize) {
// Add grid and block sizes as op operands, followed by the data operands.		// Add grid and block sizes as op operands, followed by the data operands.
result.addOperands(		result.addOperands(
{gridSizeX, gridSizeY, gridSizeZ, blockSizeX, blockSizeY, blockSizeZ});		{gridSizeX, gridSizeY, gridSizeZ, blockSizeX, blockSizeY, blockSizeZ});
		if (dynamicSharedMemorySize)
		result.addOperands(dynamicSharedMemorySize);

// Create a kernel body region with kNumConfigRegionAttributes + N arguments,		// Create a kernel body region with kNumConfigRegionAttributes + N arguments,
// where the first kNumConfigRegionAttributes arguments have `index` type and		// where the first kNumConfigRegionAttributes arguments have `index` type and
// the rest have the same types as the data operands.		// the rest have the same types as the data operands.
Region *kernelRegion = result.addRegion();		Region *kernelRegion = result.addRegion();
Block *body = new Block();		Block *body = new Block();
body->addArguments(		body->addArguments(
std::vector<Type>(kNumConfigRegionAttributes, builder.getIndexType()));		std::vector<Type>(kNumConfigRegionAttributes, builder.getIndexType()));
Show All 33 Lines
}		}

static LogicalResult verify(LaunchOp op) {		static LogicalResult verify(LaunchOp op) {
// Kernel launch takes kNumConfigOperands leading operands for grid/block		// Kernel launch takes kNumConfigOperands leading operands for grid/block
// sizes and transforms them into kNumConfigRegionAttributes region arguments		// sizes and transforms them into kNumConfigRegionAttributes region arguments
// for block/thread identifiers and grid/block sizes.		// for block/thread identifiers and grid/block sizes.
if (!op.body().empty()) {		if (!op.body().empty()) {
if (op.body().getNumArguments() !=		if (op.body().getNumArguments() !=
LaunchOp::kNumConfigOperands + op.getNumOperands())		LaunchOp::kNumConfigOperands + op.getNumOperands() -
		(op.dynamicSharedMemorySize() ? 1 : 0))
return op.emitOpError("unexpected number of region arguments");		return op.emitOpError("unexpected number of region arguments");
}		}

// Block terminators without successors are expected to exit the kernel region		// Block terminators without successors are expected to exit the kernel region
// and must be `gpu.terminator`.		// and must be `gpu.terminator`.
for (Block &block : op.body()) {		for (Block &block : op.body()) {
if (block.empty())		if (block.empty())
continue;		continue;
Show All 27 Lines
static void printLaunchOp(OpAsmPrinter &p, LaunchOp op) {		static void printLaunchOp(OpAsmPrinter &p, LaunchOp op) {
// Print the launch configuration.		// Print the launch configuration.
p << ' ' << op.getBlocksKeyword();		p << ' ' << op.getBlocksKeyword();
printSizeAssignment(p, op.getGridSize(), op.getGridSizeOperandValues(),		printSizeAssignment(p, op.getGridSize(), op.getGridSizeOperandValues(),
op.getBlockIds());		op.getBlockIds());
p << ' ' << op.getThreadsKeyword();		p << ' ' << op.getThreadsKeyword();
printSizeAssignment(p, op.getBlockSize(), op.getBlockSizeOperandValues(),		printSizeAssignment(p, op.getBlockSize(), op.getBlockSizeOperandValues(),
op.getThreadIds());		op.getThreadIds());
		if (op.dynamicSharedMemorySize())
		p << ' ' << op.getDynamicSharedMemorySizeKeyword() << ' '
		<< op.dynamicSharedMemorySize();

p.printRegion(op.body(), /printEntryBlockArgs=/false);		p.printRegion(op.body(), /printEntryBlockArgs=/false);
p.printOptionalAttrDict(op->getAttrs());		p.printOptionalAttrDict(op->getAttrs());
}		}

// Parse the size assignment blocks for blocks and threads. These have the form		// Parse the size assignment blocks for blocks and threads. These have the form
// (%region_arg, %region_arg, %region_arg) in		// (%region_arg, %region_arg, %region_arg) in
// (%region_arg = %operand, %region_arg = %operand, %region_arg = %operand)		// (%region_arg = %operand, %region_arg = %operand, %region_arg = %operand)
▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	if (parser.parseKeyword(LaunchOp::getBlocksKeyword().data()) \|\|
parser.parseKeyword(LaunchOp::getThreadsKeyword().data()) \|\|		parser.parseKeyword(LaunchOp::getThreadsKeyword().data()) \|\|
parseSizeAssignment(parser, sizesRef.drop_front(3),		parseSizeAssignment(parser, sizesRef.drop_front(3),
regionArgsRef.slice(9, 3),		regionArgsRef.slice(9, 3),
regionArgsRef.slice(3, 3)) \|\|		regionArgsRef.slice(3, 3)) \|\|
parser.resolveOperands(sizes, parser.getBuilder().getIndexType(),		parser.resolveOperands(sizes, parser.getBuilder().getIndexType(),
result.operands))		result.operands))
return failure();		return failure();

		OpAsmParser::OperandType dynamicSharedMemorySize;
		if (!parser.parseOptionalKeyword(
		LaunchOp::getDynamicSharedMemorySizeKeyword()))
		if (parser.parseOperand(dynamicSharedMemorySize) \|\|
		parser.resolveOperand(dynamicSharedMemorySize,
		parser.getBuilder().getI32Type(),
		result.operands))
		return failure();

// Introduce the body region and parse it. The region has		// Introduce the body region and parse it. The region has
// kNumConfigRegionAttributes arguments that correspond to		// kNumConfigRegionAttributes arguments that correspond to
// block/thread identifiers and grid/block sizes, all of the `index` type.		// block/thread identifiers and grid/block sizes, all of the `index` type.
Type index = parser.getBuilder().getIndexType();		Type index = parser.getBuilder().getIndexType();
SmallVector<Type, LaunchOp::kNumConfigRegionAttributes> dataTypes(		SmallVector<Type, LaunchOp::kNumConfigRegionAttributes> dataTypes(
LaunchOp::kNumConfigRegionAttributes, index);		LaunchOp::kNumConfigRegionAttributes, index);
Region *body = result.addRegion();		Region *body = result.addRegion();
return failure(parser.parseRegion(*body, regionArgs, dataTypes) \|\|		return failure(parser.parseRegion(*body, regionArgs, dataTypes) \|\|
Show All 40 Lines
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// LaunchFuncOp		// LaunchFuncOp
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

void LaunchFuncOp::build(OpBuilder &builder, OperationState &result,		void LaunchFuncOp::build(OpBuilder &builder, OperationState &result,
GPUFuncOp kernelFunc, KernelDim3 gridSize,		GPUFuncOp kernelFunc, KernelDim3 gridSize,
KernelDim3 blockSize, ValueRange kernelOperands) {		KernelDim3 blockSize, Value dynamicSharedMemorySize,
		ValueRange kernelOperands) {
// Add grid and block sizes as op operands, followed by the data operands.		// Add grid and block sizes as op operands, followed by the data operands.
result.addOperands({gridSize.x, gridSize.y, gridSize.z, blockSize.x,		result.addOperands({gridSize.x, gridSize.y, gridSize.z, blockSize.x,
blockSize.y, blockSize.z});		blockSize.y, blockSize.z});
		if (dynamicSharedMemorySize)
		result.addOperands(dynamicSharedMemorySize);
result.addOperands(kernelOperands);		result.addOperands(kernelOperands);
auto kernelModule = kernelFunc->getParentOfType<GPUModuleOp>();		auto kernelModule = kernelFunc->getParentOfType<GPUModuleOp>();
auto kernelSymbol =		auto kernelSymbol =
SymbolRefAttr::get(kernelModule.getNameAttr(),		SymbolRefAttr::get(kernelModule.getNameAttr(),
{SymbolRefAttr::get(kernelFunc.getNameAttr())});		{SymbolRefAttr::get(kernelFunc.getNameAttr())});
result.addAttribute(getKernelAttrName(), kernelSymbol);		result.addAttribute(getKernelAttrName(), kernelSymbol);
SmallVector<int32_t, 8> segmentSizes(8, 1);		SmallVector<int32_t, 9> segmentSizes(9, 1);
segmentSizes.front() = 0; // Initially no async dependencies.		segmentSizes.front() = 0; // Initially no async dependencies.
		segmentSizes[segmentSizes.size() - 2] = dynamicSharedMemorySize ? 1 : 0;
segmentSizes.back() = static_cast<int32_t>(kernelOperands.size());		segmentSizes.back() = static_cast<int32_t>(kernelOperands.size());
result.addAttribute(getOperandSegmentSizeAttr(),		result.addAttribute(getOperandSegmentSizeAttr(),
builder.getI32VectorAttr(segmentSizes));		builder.getI32VectorAttr(segmentSizes));
}		}

unsigned LaunchFuncOp::getNumKernelOperands() {		unsigned LaunchFuncOp::getNumKernelOperands() {
return getNumOperands() - asyncDependencies().size() - kNumConfigOperands;		return getNumOperands() - asyncDependencies().size() - kNumConfigOperands -
		(dynamicSharedMemorySize() ? 1 : 0);
}		}

StringAttr LaunchFuncOp::getKernelModuleName() {		StringAttr LaunchFuncOp::getKernelModuleName() {
return kernel().getRootReference();		return kernel().getRootReference();
}		}

StringAttr LaunchFuncOp::getKernelName() { return kernel().getLeafReference(); }		StringAttr LaunchFuncOp::getKernelName() { return kernel().getLeafReference(); }

Value LaunchFuncOp::getKernelOperand(unsigned i) {		Value LaunchFuncOp::getKernelOperand(unsigned i) {
return getOperand(asyncDependencies().size() + kNumConfigOperands + i);		return getOperand(asyncDependencies().size() + kNumConfigOperands +
		(dynamicSharedMemorySize() ? 1 : 0) + i);
}		}

KernelDim3 LaunchFuncOp::getGridSizeOperandValues() {		KernelDim3 LaunchFuncOp::getGridSizeOperandValues() {
auto operands = getOperands().drop_front(asyncDependencies().size());		auto operands = getOperands().drop_front(asyncDependencies().size());
return KernelDim3{operands[0], operands[1], operands[2]};		return KernelDim3{operands[0], operands[1], operands[2]};
}		}

KernelDim3 LaunchFuncOp::getBlockSizeOperandValues() {		KernelDim3 LaunchFuncOp::getBlockSizeOperandValues() {
▲ Show 20 Lines • Show All 553 Lines • Show Last 20 Lines

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp

	Show First 20 Lines • Show All 209 Lines • ▼ Show 20 Lines

	/// Replace `gpu.launch` operations with an `gpu.launch_func` operation			/// Replace `gpu.launch` operations with an `gpu.launch_func` operation
	/// launching `kernelFunc`. The kernel func contains the body of the			/// launching `kernelFunc`. The kernel func contains the body of the
	/// `gpu.launch` with constant region arguments inlined.			/// `gpu.launch` with constant region arguments inlined.
	static void convertToLaunchFuncOp(gpu::LaunchOp launchOp,			static void convertToLaunchFuncOp(gpu::LaunchOp launchOp,
	gpu::GPUFuncOp kernelFunc,			gpu::GPUFuncOp kernelFunc,
	ValueRange operands) {			ValueRange operands) {
	OpBuilder builder(launchOp);			OpBuilder builder(launchOp);
				// The launch op has an optional dynamic shared memory size. If it doesn't
				// exist, we use zero.
	builder.create<gpu::LaunchFuncOp>(			builder.create<gpu::LaunchFuncOp>(
	launchOp.getLoc(), kernelFunc, launchOp.getGridSizeOperandValues(),			launchOp.getLoc(), kernelFunc, launchOp.getGridSizeOperandValues(),
	launchOp.getBlockSizeOperandValues(), operands);			launchOp.getBlockSizeOperandValues(), launchOp.dynamicSharedMemorySize(),
				operands);
	launchOp.erase();			launchOp.erase();
	}			}

	namespace {			namespace {
	/// Pass that moves the kernel of each LaunchOp into its separate nested module.			/// Pass that moves the kernel of each LaunchOp into its separate nested module.
	///			///
	/// This pass moves the kernel code of each LaunchOp into a function created			/// This pass moves the kernel code of each LaunchOp into a function created
	/// inside a nested module. It also creates an external function of the same			/// inside a nested module. It also creates an external function of the same
	▲ Show 20 Lines • Show All 89 Lines • Show Last 20 Lines

mlir/test/Conversion/GPUCommon/lower-launch-func-to-gpu-runtime-calls.mlir

Show All 14 Lines	llvm.func @kernel(%arg0: i32, %arg1: !llvm.ptr<f32>,
%arg5: i64) attributes {gpu.kernel} {		%arg5: i64) attributes {gpu.kernel} {
llvm.return		llvm.return
}		}
}		}

func @foo(%buffer: memref<?xf32>) {		func @foo(%buffer: memref<?xf32>) {
%c8 = constant 8 : index		%c8 = constant 8 : index
%c32 = constant 32 : i32		%c32 = constant 32 : i32
		%c256 = constant 256 : i32
gpu.launch_func @kernel_module::@kernel		gpu.launch_func @kernel_module::@kernel
blocks in (%c8, %c8, %c8)		blocks in (%c8, %c8, %c8)
threads in (%c8, %c8, %c8)		threads in (%c8, %c8, %c8)
		dynamic_shared_memory_size %c256
		nbpatelUnsubmitted Not Done Reply Inline Actions is there an example of how this value will be passed to the outlined kernel? nbpatel: is there an example of how this value will be passed to the outlined kernel?
args(%c32 : i32, %buffer : memref<?xf32>)		args(%c32 : i32, %buffer : memref<?xf32>)
return		return
}		}

// CHECK: [[C8:%.*]] = llvm.mlir.constant(8 : index) : i64		// CHECK-DAG: [[C256:%.*]] = llvm.mlir.constant(256 : i32) : i32
		// CHECK-DAG: [[C8:%.*]] = llvm.mlir.constant(8 : index) : i64
// CHECK: [[ADDRESSOF:%.*]] = llvm.mlir.addressof @[[GLOBAL]]		// CHECK: [[ADDRESSOF:%.*]] = llvm.mlir.addressof @[[GLOBAL]]
// CHECK: [[C0:%.*]] = llvm.mlir.constant(0 : index)		// CHECK: [[C0:%.*]] = llvm.mlir.constant(0 : index)
// CHECK: [[BINARY:%.*]] = llvm.getelementptr [[ADDRESSOF]]{{\[}}[[C0]], [[C0]]]		// CHECK: [[BINARY:%.*]] = llvm.getelementptr [[ADDRESSOF]]{{\[}}[[C0]], [[C0]]]
// CHECK-SAME: -> !llvm.ptr<i8>		// CHECK-SAME: -> !llvm.ptr<i8>

// CHECK: [[MODULE:%.*]] = llvm.call @mgpuModuleLoad([[BINARY]])		// CHECK: [[MODULE:%.*]] = llvm.call @mgpuModuleLoad([[BINARY]])
// CHECK: [[FUNC:%.]] = llvm.call @mgpuModuleGetFunction([[MODULE]], {{.}})		// CHECK: [[FUNC:%.]] = llvm.call @mgpuModuleGetFunction([[MODULE]], {{.}})

// CHECK: [[C0_I32:%.*]] = llvm.mlir.constant(0 : i32)
// CHECK: [[STREAM:%.*]] = llvm.call @mgpuStreamCreate		// CHECK: [[STREAM:%.*]] = llvm.call @mgpuStreamCreate

// CHECK: [[NUM_PARAMS:%.*]] = llvm.mlir.constant(6 : i32) : i32		// CHECK: [[NUM_PARAMS:%.*]] = llvm.mlir.constant(6 : i32) : i32
// CHECK-NEXT: [[PARAMS:%.*]] = llvm.alloca [[NUM_PARAMS]] x !llvm.ptr<i8>		// CHECK-NEXT: [[PARAMS:%.*]] = llvm.alloca [[NUM_PARAMS]] x !llvm.ptr<i8>

// CHECK: [[EXTRA_PARAMS:%.*]] = llvm.mlir.null : !llvm.ptr<ptr<i8>>		// CHECK: [[EXTRA_PARAMS:%.*]] = llvm.mlir.null : !llvm.ptr<ptr<i8>>

// CHECK: llvm.call @mgpuLaunchKernel([[FUNC]], [[C8]], [[C8]], [[C8]],		// CHECK: llvm.call @mgpuLaunchKernel([[FUNC]], [[C8]], [[C8]], [[C8]],
// CHECK-SAME: [[C8]], [[C8]], [[C8]], [[C0_I32]], [[STREAM]],		// CHECK-SAME: [[C8]], [[C8]], [[C8]], [[C256]], [[STREAM]],
// CHECK-SAME: [[PARAMS]], [[EXTRA_PARAMS]])		// CHECK-SAME: [[PARAMS]], [[EXTRA_PARAMS]])
// CHECK: llvm.call @mgpuStreamSynchronize		// CHECK: llvm.call @mgpuStreamSynchronize
// CHECK: llvm.call @mgpuStreamDestroy		// CHECK: llvm.call @mgpuStreamDestroy
// CHECK: llvm.call @mgpuModuleUnload		// CHECK: llvm.call @mgpuModuleUnload
}		}

mlir/test/Conversion/GPUToSPIRV/builtins.mlir

Show All 21 Lines	module attributes {gpu.container_module} {
}		}
}		}

// -----		// -----

module attributes {gpu.container_module} {		module attributes {gpu.container_module} {
func @builtin() {		func @builtin() {
%c0 = constant 1 : index		%c0 = constant 1 : index
		%c256 = constant 256 : i32
gpu.launch_func @kernels::@builtin_workgroup_id_y		gpu.launch_func @kernels::@builtin_workgroup_id_y
blocks in (%c0, %c0, %c0) threads in (%c0, %c0, %c0)		blocks in (%c0, %c0, %c0) threads in (%c0, %c0, %c0)
		dynamic_shared_memory_size %c256
return		return
}		}

// CHECK-LABEL: spv.module @{{.*}} Logical GLSL450		// CHECK-LABEL: spv.module @{{.*}} Logical GLSL450
// CHECK: spv.GlobalVariable [[WORKGROUPID:@.*]] built_in("WorkgroupId")		// CHECK: spv.GlobalVariable [[WORKGROUPID:@.*]] built_in("WorkgroupId")
gpu.module @kernels {		gpu.module @kernels {
gpu.func @builtin_workgroup_id_y() kernel		gpu.func @builtin_workgroup_id_y() kernel
attributes {spv.entry_point_abi = {local_size = dense<[16, 1, 1]>: vector<3xi32>}} {		attributes {spv.entry_point_abi = {local_size = dense<[16, 1, 1]>: vector<3xi32>}} {
▲ Show 20 Lines • Show All 197 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/invalid.mlir

	// RUN: mlir-opt -split-input-file -verify-diagnostics %s			// RUN: mlir-opt -split-input-file -verify-diagnostics %s

	func @not_enough_sizes(%sz : index) {			func @not_enough_sizes(%sz : index) {
	// expected-error@+1 {{expected 6 operands, but found 5}}			// expected-error@+1 {{expected 6 or more operands, but found 5}}
	"gpu.launch"(%sz, %sz, %sz, %sz, %sz) ({			"gpu.launch"(%sz, %sz, %sz, %sz, %sz) ({
	gpu.return			gpu.return
	}) : (index, index, index, index, index) -> ()			}) : (index, index, index, index, index) -> ()
	return			return
	}			}

	// -----			// -----

	Show All 38 Lines
	}			}

	// -----			// -----

	module attributes {gpu.container_module} {			module attributes {gpu.container_module} {
	func @launch_func_missing_callee_attribute(%sz : index) {			func @launch_func_missing_callee_attribute(%sz : index) {
	// expected-error@+1 {{'gpu.launch_func' op requires attribute 'kernel'}}			// expected-error@+1 {{'gpu.launch_func' op requires attribute 'kernel'}}
	"gpu.launch_func"(%sz, %sz, %sz, %sz, %sz, %sz)			"gpu.launch_func"(%sz, %sz, %sz, %sz, %sz, %sz)
	{operand_segment_sizes = dense<[0, 1, 1, 1, 1, 1, 1, 0]> : vector<8xi32>}			{operand_segment_sizes = dense<[0, 1, 1, 1, 1, 1, 1, 0, 0]> : vector<9xi32>}
	: (index, index, index, index, index, index) -> ()			: (index, index, index, index, index, index) -> ()
	return			return
	}			}
	}			}

	// -----			// -----

	module attributes {gpu.container_module} {			module attributes {gpu.container_module} {
	▲ Show 20 Lines • Show All 509 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/ops.mlir

Show First 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	gpu.module @kernels {
}		}
}		}

func @foo() {		func @foo() {
%0 = "op"() : () -> (f32)		%0 = "op"() : () -> (f32)
%1 = "op"() : () -> (memref<?xf32, 1>)		%1 = "op"() : () -> (memref<?xf32, 1>)
// CHECK: %{{.*}} = constant 8		// CHECK: %{{.*}} = constant 8
%cst = constant 8 : index		%cst = constant 8 : index
		%c0 = constant 0 : i32
%t0 = gpu.wait async		%t0 = gpu.wait async

// CHECK: gpu.launch_func @kernels::@kernel_1 blocks in (%{{.}}, %{{.}}, %{{.}}) threads in (%{{.}}, %{{.}}, %{{.}}) args(%{{.}} : f32, %{{.}} : memref<?xf32, 1>)		// CHECK: gpu.launch_func @kernels::@kernel_1 blocks in (%{{.}}, %{{.}}, %{{.}}) threads in (%{{.}}, %{{.}}, %{{.}}) args(%{{.}} : f32, %{{.}} : memref<?xf32, 1>)
gpu.launch_func @kernels::@kernel_1 blocks in (%cst, %cst, %cst) threads in (%cst, %cst, %cst) args(%0 : f32, %1 : memref<?xf32, 1>)		gpu.launch_func @kernels::@kernel_1 blocks in (%cst, %cst, %cst) threads in (%cst, %cst, %cst) args(%0 : f32, %1 : memref<?xf32, 1>)

		gpu.launch_func @kernels::@kernel_1 blocks in (%cst, %cst, %cst) threads in (%cst, %cst, %cst) dynamic_shared_memory_size %c0 args(%0 : f32, %1 : memref<?xf32, 1>)

// CHECK: gpu.launch_func @kernels::@kernel_2 blocks in (%{{.}}, %{{.}}, %{{.}}) threads in (%{{.}}, %{{.}}, %{{.}})		// CHECK: gpu.launch_func @kernels::@kernel_2 blocks in (%{{.}}, %{{.}}, %{{.}}) threads in (%{{.}}, %{{.}}, %{{.}})
gpu.launch_func @kernels::@kernel_2 blocks in (%cst, %cst, %cst) threads in (%cst, %cst, %cst)		gpu.launch_func @kernels::@kernel_2 blocks in (%cst, %cst, %cst) threads in (%cst, %cst, %cst)

// CHECK: %{{.}} = gpu.launch_func async [%{{.}}] @kernels::@kernel_2 blocks in (%{{.}}, %{{.}}, %{{.}}) threads in (%{{.}}, %{{.}}, %{{.}})		// CHECK: %{{.}} = gpu.launch_func async [%{{.}}] @kernels::@kernel_2 blocks in (%{{.}}, %{{.}}, %{{.}}) threads in (%{{.}}, %{{.}}, %{{.}})
%t1 = gpu.launch_func async [%t0] @kernels::@kernel_2 blocks in (%cst, %cst, %cst) threads in (%cst, %cst, %cst)		%t1 = gpu.launch_func async [%t0] @kernels::@kernel_2 blocks in (%cst, %cst, %cst) threads in (%cst, %cst, %cst)

return		return
}		}
▲ Show 20 Lines • Show All 135 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][GPU] Add GPU launch op support for dynamic shared memoryClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 376475

mlir/include/mlir/Dialect/GPU/GPUOps.td

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp

mlir/test/Conversion/GPUCommon/lower-launch-func-to-gpu-runtime-calls.mlir

mlir/test/Conversion/GPUToSPIRV/builtins.mlir

mlir/test/Dialect/GPU/invalid.mlir

mlir/test/Dialect/GPU/ops.mlir

[MLIR][GPU] Add GPU launch op support for dynamic shared memory
ClosedPublic