This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/GPU/
-
mlir/
-
Dialect/
-
GPU/
-
GPUOps.td
-
lib/Dialect/GPU/
-
Dialect/
-
GPU/
-
IR/
2/2
GPUDialect.cpp
-
Transforms/
-
KernelOutlining.cpp
-
test/Dialect/GPU/
-
Dialect/
-
GPU/
-
invalid.mlir
-
ops.mlir
-
outlining.mlir

Differential D123499

Add async dependencies support for gpu.launch op
ClosedPublic

Authored by bondhugula on Apr 11 2022, 5:50 AM.

Download Raw Diff

Details

Reviewers

csigg
herhut
ThomasRaoux
ftynse
mehdi_amini

Commits

rGf47a38f51724: Add async dependencies support for gpu.launch op

Summary

Add async dependencies support for gpu.launch op: this allows specifying
a list of async tokens ("streams") as dependencies for the launch.

Update the GPU kernel outlining pass lowering to propagate async
dependencies from gpu.launch to gpu.launch_func op. Previously, a new
stream was being created and destroyed for a kernel launch. The async
deps support allows the kernel launch to be serialized on an existing
stream.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

bondhugula created this revision.Apr 11 2022, 5:50 AM

Herald added a reviewer: ThomasRaoux. · View Herald TranscriptApr 11 2022, 5:50 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: sdasgup3, wenzhicui, wrengr and 20 others. · View Herald Transcript

bondhugula requested review of this revision.Apr 11 2022, 5:50 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 5:50 AM

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Harbormaster completed remote builds in B158985: Diff 421890.Apr 11 2022, 6:11 AM

Not opposed to this change at all, but what's the motivation for allowing gpu-async-region to run before gpu-kernel-outlining?

Some cleanup.

bondhugula added a reviewer: ftynse.Apr 11 2022, 6:33 AM

Update commit summary.

In D123499#3442565, @csigg wrote:

Not opposed to this change at all, but what's the motivation for allowing gpu-async-region to run before gpu-kernel-outlining?

I didn't know that pass existed! Do let me know if that approach subsumes or is more canonical/systematic than this one. I guess if a user wishes to explicitly specify which streams the kernels should be part of and how they should be ordered/overlapped, they could go with not attaching aysnc deps on gpu.launch and let gpu-async-region mark them async with deps. If they wanted full control, they would specify these. So, there is still motivation for running gpu-async-region before gpu-kernel-outlining in pipelines -- but could I know which pipeline/use case you were referring to?

On a separate note, at least with NVIDIA GPUs, aren't all kernel launches always asynchronous (w.r.t host)? What does it mean to not mark a gpu.launch_func async?! Does the GPU dialect leave that unspecified (target-dependent)?

bondhugula added a reviewer: mehdi_amini.Apr 11 2022, 6:52 AM

Harbormaster completed remote builds in B159001: Diff 421907.Apr 11 2022, 7:06 AM

The gpu-async-region pass simply chains together sequences of gpu ops, with the intention of using async.execute to separate independent work that runs on separate streams. For that case, gpu ops can be synchronous during lowering from higher dialects because the async.execute regions specify which gpu ops should run in sequence and which ones can run in parallel.

But you are right, users might want to set dependencies though tokens manually, and might want to do that before lowering from gpu.launch to gpu.launch_func.

On a separate note, at least with NVIDIA GPUs, aren't all kernel launches always asynchronous (w.r.t host)? What does it mean to not mark a gpu.launch_func async?! Does the GPU dialect leave that unspecified (target-dependent)?

A gpu.launch_func without async implies that it is synchronous (same for gpu.memset, gpu.memcpy etc). If the lowering to the gpu runtime wouldn't reject it, it would need to insert a mgpuStreamSynchronize call.

In D123499#3442878, @csigg wrote:

A gpu.launch_func without async implies that it is synchronous (same for gpu.memset, gpu.memcpy etc). If the lowering to the gpu runtime wouldn't reject it, it would need to insert a mgpuStreamSynchronize call.

I see -- it's been made "synchronous" as a result of the synchronize call inserted below. This makes sense to me - thanks. But in that case, async with 0 async dep tokens wouldn't appear to be a meaningful configuration for the op. The lowering does check for it and fails, but should it just be disallowed? For the support in this PR, I'm not adding a result token to the op if no async deps have been specified (even if the async keyword is specified, it's dropped during the print). It should perhaps be a parse error and similarly for the launch_func op?

Add AsyncOpInterface to gpu.launch and move common methods in GPUDialect.cpp up.

Sorted order.

Harbormaster completed remote builds in B159134: Diff 422090.Apr 11 2022, 6:38 PM

async with 0 async dep tokens wouldn't appear to be a meaningful configuration for the op. The lowering does check for it and fails, but should it just be disallowed?

I'm not sure. I would say the semantics are clear if an op uses async (the host does not wait for the op to complete) but no dependencies (it can run immediately without waiting for anything else), and it's OK for the current lowering to be limited in what it can handle and rely on gpu-async-region to bring it into lowering-compatible form. I kind of like the symmetry of these ops (including gpu.wait, where gpu.wait async [] needs to be valid).

For the support in this PR, I'm not adding a result token to the op if no async deps have been specified (even if the async keyword is specified, it's dropped during the print). It should perhaps be a parse error and similarly for the launch_func op?

I don't think this is a good idea. The 'async' keyword stands for the !gpu.async.token return type and should really be independent of the token operands inside []. I would not infer the former from the latter, but fail verification (for all but gpu.wait) if we want that.

In D123499#3445734, @csigg wrote:

For the support in this PR, I'm not adding a result token to the op if no async deps have been specified (even if the async keyword is specified, it's dropped during the print).

I don't understand what you're describing: you don't get to control the number of output of the operation, this is parsed by MLIR before your custom parser and if you don't provide a result type matching the number of result MLIR will fatal_error() anyway.

In D123499#3445734, @csigg wrote:

async with 0 async dep tokens wouldn't appear to be a meaningful configuration for the op. The lowering does check for it and fails, but should it just be disallowed?

I'm not sure. I would say the semantics are clear if an op uses async (the host does not wait for the op to complete) but no dependencies (it can run immediately without waiting for anything else), and it's OK for the current lowering to be limited in what it can handle and rely on gpu-async-region to bring it into lowering-compatible form. I kind of like the symmetry of these ops (including gpu.wait, where gpu.wait async [] needs to be valid).

This makes sense to me.

For the support in this PR, I'm not adding a result token to the op if no async deps have been specified (even if the async keyword is specified, it's dropped during the print). It should perhaps be a parse error and similarly for the launch_func op?

I don't think this is a good idea. The 'async' keyword stands for the !gpu.async.token return type and should really be independent of the token operands inside []. I would not infer the former from the latter, but fail verification (for all but gpu.wait) if we want that.

This makes sense to me too -- follows from the previous para. I'll update the revision to make sure there is a meaning to the "async" keyword (regardless of the tokens) in that it returns a token indicating async execution.

In D123499#3447326, @mehdi_amini wrote:

In D123499#3445734, @csigg wrote:

For the support in this PR, I'm not adding a result token to the op if no async deps have been specified (even if the async keyword is specified, it's dropped during the print).

I don't understand what you're describing: you don't get to control the number of output of the operation, this is parsed by MLIR before your custom parser and if you don't provide a result type matching the number of result MLIR will fatal_error() anyway.

I meant "adding a result type". A single result type will be added if there is an async keyword. Why would it be a fatal error instead of a parse error, which it should be if a user provided a number of results on the LHS that are inconsistent with the expected number?

In D123499#3447834, @bondhugula wrote:

In D123499#3447326, @mehdi_amini wrote:

In D123499#3445734, @csigg wrote:

For the support in this PR, I'm not adding a result token to the op if no async deps have been specified (even if the async keyword is specified, it's dropped during the print).

I don't understand what you're describing: you don't get to control the number of output of the operation, this is parsed by MLIR before your custom parser and if you don't provide a result type matching the number of result MLIR will fatal_error() anyway.

I meant "adding a result type". A single result type will be added if there is an async keyword. Why would it be a fatal error instead of a parse error, which it should be if a user provided a number of results on the LHS that are inconsistent with the expected number?

Right it'll be a parse error

Fix semantics and syntax to allow async without any deps.

In D123499#3447805, @bondhugula wrote:

In D123499#3445734, @csigg wrote:

For the support in this PR, I'm not adding a result token to the op if no async deps have been specified (even if the async keyword is specified, it's dropped during the print). It should perhaps be a parse error and similarly for the launch_func op?

I don't think this is a good idea. The 'async' keyword stands for the !gpu.async.token return type and should really be independent of the token operands inside []. I would not infer the former from the latter, but fail verification (for all but gpu.wait) if we want that.

This makes sense to me too -- follows from the previous para. I'll update the revision to make sure there is a meaning to the "async" keyword (regardless of the tokens) in that it returns a token indicating async execution.

Done and addressed. I've updated the doc comment as well. @csigg, this is ready for review now.

Harbormaster completed remote builds in B159586: Diff 422709.Apr 13 2022, 6:52 PM

Missed updates for gpu.launch -> gpu.launch_func.

Add a couple more test cases for the outlining pass.

Harbormaster completed remote builds in B159591: Diff 422714.Apr 13 2022, 7:18 PM

Any more comments here @csigg ?

Rebase.

Harbormaster completed remote builds in B160480: Diff 423951.Apr 20 2022, 10:39 AM

csigg accepted this revision.Apr 21 2022, 1:02 AM

csigg added inline comments.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
486	Isn't this checked elsewhere already? Also, the error messages refers to the 'async dependencies' instead of the 'async keyword'.

This revision is now accepted and ready to land.Apr 21 2022, 1:02 AM

Adjust error message.

bondhugula marked an inline comment as done.Apr 21 2022, 3:55 AM

bondhugula added inline comments.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
486	Adjusted the error message to refer to 'async keyword'. This isn't checked anywhere else (not in any automatically generated verifier hooks); we check for this during (custom) parsing, but the verifier needs to separately check for it as well.

This revision was landed with ongoing or failed builds.Apr 21 2022, 3:56 AM

Closed by commit rGf47a38f51724: Add async dependencies support for gpu.launch op (authored by bondhugula). · Explain Why

This revision was automatically updated to reflect the committed changes.

bondhugula marked an inline comment as done.

bondhugula added a commit: rGf47a38f51724: Add async dependencies support for gpu.launch op.

Harbormaster completed remote builds in B160616: Diff 424138.Apr 21 2022, 4:02 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

GPUOps.td

42 lines

lib/

Dialect/

GPU/

IR/

GPUDialect.cpp

141 lines

Transforms/

KernelOutlining.cpp

7 lines

test/

Dialect/

GPU/

invalid.mlir

6 lines

ops.mlir

30 lines

outlining.mlir

20 lines

Diff 422714

mlir/include/mlir/Dialect/GPU/GPUOps.td

Show First 20 Lines • Show All 414 Lines • ▼ Show 20 Lines	let description = [{
```		```
}];		}];

let skipDefaultBuilders = 1;		let skipDefaultBuilders = 1;

let builders = [		let builders = [
OpBuilder<(ins "GPUFuncOp":$kernelFunc, "KernelDim3":$gridSize,		OpBuilder<(ins "GPUFuncOp":$kernelFunc, "KernelDim3":$gridSize,
"KernelDim3":$blockSize, "Value":$dynamicSharedMemorySize,		"KernelDim3":$blockSize, "Value":$dynamicSharedMemorySize,
"ValueRange":$kernelOperands)>		"ValueRange":$kernelOperands,
		CArg<"Type", "nullptr">:$asyncTokenType,
		CArg<"ValueRange", "{}">:$asyncDependencies)>
];		];

let extraClassDeclaration = [{		let extraClassDeclaration = [{
/// The number of operands passed to the kernel function.		/// The number of operands passed to the kernel function.
unsigned getNumKernelOperands();		unsigned getNumKernelOperands();

/// The name of the kernel's containing module.		/// The name of the kernel's containing module.
StringAttr getKernelModuleName();		StringAttr getKernelModuleName();
Show All 29 Lines	let assemblyFormat = [{
`blocks` `in` ` ` `(`$gridSizeX`,` $gridSizeY`,` $gridSizeZ`)`		`blocks` `in` ` ` `(`$gridSizeX`,` $gridSizeY`,` $gridSizeZ`)`
`threads` `in` ` ` `(`$blockSizeX`,` $blockSizeY`,` $blockSizeZ`)`		`threads` `in` ` ` `(`$blockSizeX`,` $blockSizeY`,` $blockSizeZ`)`
(`dynamic_shared_memory_size` $dynamicSharedMemorySize^)?		(`dynamic_shared_memory_size` $dynamicSharedMemorySize^)?
custom<LaunchFuncOperands>($operands, type($operands)) attr-dict		custom<LaunchFuncOperands>($operands, type($operands)) attr-dict
}];		}];
let hasVerifier = 1;		let hasVerifier = 1;
}		}

def GPU_LaunchOp : GPU_Op<"launch", [AutomaticAllocationScope]>,		def GPU_LaunchOp : GPU_Op<"launch",
Arguments<(ins Index:$gridSizeX, Index:$gridSizeY, Index:$gridSizeZ,		[AutomaticAllocationScope, AttrSizedOperandSegments, GPU_AsyncOpInterface]>,
		Arguments<(ins Variadic<GPU_AsyncToken>:$asyncDependencies,
		Index:$gridSizeX, Index:$gridSizeY, Index:$gridSizeZ,
Index:$blockSizeX, Index:$blockSizeY, Index:$blockSizeZ,		Index:$blockSizeX, Index:$blockSizeY, Index:$blockSizeZ,
Optional<I32>:$dynamicSharedMemorySize)>,		Optional<I32>:$dynamicSharedMemorySize)>,
Results<(outs)> {		Results<(outs Optional<GPU_AsyncToken>:$asyncToken)> {
let summary = "GPU kernel launch operation";		let summary = "GPU kernel launch operation";

let description = [{		let description = [{
Launch a kernel on the specified grid of thread blocks. The body of the		Launch a kernel on the specified grid of thread blocks. The body of the
kernel is defined by the single region that this operation contains. The		kernel is defined by the single region that this operation contains. The
operation takes six operands followed by an optional operand: the first		operation takes an optional list of async dependencies followed by six
three operands are grid sizes along the x,y,z dimensions and the following		operands and an optional operand.
three are block sizes along the x,y,z dimensions. The last operand is
optional and corresponds to the amount of dynamic shared memory a kernel's
workgroup should be allocated; when this operand is not present, a zero size
is assumed.

When a lower-dimensional kernel is required, unused sizes must		The `async` keyword indicates the kernel should be launched asynchronously;
be explicitly set to `1`.		the operation returns a new !gpu.async.token when the keyword is specified.
		The kernel launched does not start executing until the ops producing its
		async dependencies (optional operands) have completed.

		The first three operands (following any async dependencies) are grid sizes
		along the x,y,z dimensions and the following three are block sizes along the
		x,y,z dimensions. When a lower-dimensional kernel is required, unused sizes
		must be explicitly set to `1`. The last operand is optional and corresponds
		to the amount of dynamic shared memory a kernel's workgroup should be
		allocated; when this operand is not present, a zero size is assumed.

The body region has _twelve_ arguments, grouped as follows:		The body region has _twelve_ arguments, grouped as follows:

- three arguments that contain block identifiers along x,y,z dimensions;		- three arguments that contain block identifiers along x,y,z dimensions;
- three arguments that contain thread identifiers along x,y,z dimensions;		- three arguments that contain thread identifiers along x,y,z dimensions;
- operands of the `gpu.launch` operation as is (i.e. the operands for		- operands of the `gpu.launch` operation as is (i.e. the operands for
grid and block sizes).		grid and block sizes).

Syntax:		Syntax:

```		```
operation ::= `gpu.launch` `block` `(` ssa-id-list `)` `in` ssa-reassignment		operation ::= `gpu.launch` (`async` (`[` ssa-id-list `]`)? )?
		`block` `(` ssa-id-list `)` `in` ssa-reassignment
`threads` `(` ssa-id-list `)` `in` ssa-reassignment		`threads` `(` ssa-id-list `)` `in` ssa-reassignment
(dynamic_shared_memory_size ssa-use)?		(dynamic_shared_memory_size ssa-use)?
region attr-dict?		region attr-dict?
ssa-reassignment ::= `(` ssa-id `=` ssa-use (`,` ssa-id `=` ssa-use)* `)`		ssa-reassignment ::= `(` ssa-id `=` ssa-use (`,` ssa-id `=` ssa-use)* `)`
```		```

Example:		Example:

Show All 35 Lines	def GPU_LaunchOp : GPU_Op<"launch",
let regions = (region AnyRegion:$body);		let regions = (region AnyRegion:$body);

let skipDefaultBuilders = 1;		let skipDefaultBuilders = 1;

let builders = [		let builders = [
OpBuilder<(ins "Value":$gridSizeX, "Value":$gridSizeY,		OpBuilder<(ins "Value":$gridSizeX, "Value":$gridSizeY,
"Value":$gridSizeZ, "Value":$blockSizeX, "Value":$blockSizeY,		"Value":$gridSizeZ, "Value":$blockSizeX, "Value":$blockSizeY,
"Value":$blockSizeZ,		"Value":$blockSizeZ,
CArg<"Value", "nullptr">:$dynamic_shared_memory_size)>		CArg<"Value", "nullptr">:$dynamicSharedMemorySize,
		CArg<"Type", "nullptr">:$asyncTokenType,
		CArg<"ValueRange", "{}">:$asyncDependencies)>
];		];

let extraClassDeclaration = [{		let extraClassDeclaration = [{
/// Get the SSA values corresponding to kernel block identifiers.		/// Get the SSA values corresponding to kernel block identifiers.
KernelDim3 getBlockIds();		KernelDim3 getBlockIds();
/// Get the SSA values corresponding to kernel thread identifiers.		/// Get the SSA values corresponding to kernel thread identifiers.
KernelDim3 getThreadIds();		KernelDim3 getThreadIds();
/// Get the SSA values corresponding to kernel grid size.		/// Get the SSA values corresponding to kernel grid size.
▲ Show 20 Lines • Show All 865 Lines • Show Last 20 Lines

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

Show First 20 Lines • Show All 269 Lines • ▼ Show 20 Lines	auto walkResult = module.walk([&module](LaunchFuncOp launchOp) -> WalkResult {
}		}

return success();		return success();
});		});

return walkResult.wasInterrupted() ? failure() : success();		return walkResult.wasInterrupted() ? failure() : success();
}		}

		/// Parses an optional list of async operands.
		/// (`async` `[` ssa-id-list `]`)?
		///
		/// This method is used by the tablegen assembly format for async ops as well.
		static ParseResult parseAsyncDependencies(
		OpAsmParser &parser, Type &asyncTokenType,
		SmallVectorImpl<OpAsmParser::UnresolvedOperand> &asyncDependencies) {
		auto loc = parser.getCurrentLocation();
		if (succeeded(parser.parseOptionalKeyword("async"))) {
		if (parser.getNumResults() == 0)
		return parser.emitError(loc, "needs to be named when marked 'async'");
		asyncTokenType = parser.getBuilder().getType<AsyncTokenType>();
		}
		return parser.parseOperandList(asyncDependencies,
		OpAsmParser::Delimiter::OptionalSquare);
		}

		// Used by the tablegen assembly format for several async ops.
		static void printAsyncDependencies(OpAsmPrinter &printer, Operation *op,
		Type asyncTokenType,
		OperandRange asyncDependencies) {
		if (asyncTokenType)
		printer << "async ";
		if (asyncDependencies.empty())
		return;
		printer << '[';
		llvm::interleaveComma(asyncDependencies, printer);
		printer << ']';
		}

		//===----------------------------------------------------------------------===//
		// AllReduceOp
		//===----------------------------------------------------------------------===//

LogicalResult gpu::AllReduceOp::verifyRegions() {		LogicalResult gpu::AllReduceOp::verifyRegions() {
if (body().empty() != op().hasValue())		if (body().empty() != op().hasValue())
return emitError("expected either an op attribute or a non-empty body");		return emitError("expected either an op attribute or a non-empty body");
if (!body().empty()) {		if (!body().empty()) {
if (body().getNumArguments() != 2)		if (body().getNumArguments() != 2)
return emitError("expected two region arguments");		return emitError("expected two region arguments");
for (auto argument : body().getArguments()) {		for (auto argument : body().getArguments()) {
if (argument.getType() != getType())		if (argument.getType() != getType())
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// LaunchOp		// LaunchOp
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

void LaunchOp::build(OpBuilder &builder, OperationState &result,		void LaunchOp::build(OpBuilder &builder, OperationState &result,
Value gridSizeX, Value gridSizeY, Value gridSizeZ,		Value gridSizeX, Value gridSizeY, Value gridSizeZ,
Value blockSizeX, Value blockSizeY, Value blockSizeZ,		Value blockSizeX, Value blockSizeY, Value blockSizeZ,
Value dynamicSharedMemorySize) {		Value dynamicSharedMemorySize, Type asyncTokenType,
		ValueRange asyncDependencies) {
		result.addOperands(asyncDependencies);
		if (asyncTokenType)
		result.types.push_back(builder.getType<AsyncTokenType>());

// Add grid and block sizes as op operands, followed by the data operands.		// Add grid and block sizes as op operands, followed by the data operands.
result.addOperands(		result.addOperands(
{gridSizeX, gridSizeY, gridSizeZ, blockSizeX, blockSizeY, blockSizeZ});		{gridSizeX, gridSizeY, gridSizeZ, blockSizeX, blockSizeY, blockSizeZ});
if (dynamicSharedMemorySize)		if (dynamicSharedMemorySize)
result.addOperands(dynamicSharedMemorySize);		result.addOperands(dynamicSharedMemorySize);

// Create a kernel body region with kNumConfigRegionAttributes + N arguments,		// Create a kernel body region with kNumConfigRegionAttributes + N arguments,
// where the first kNumConfigRegionAttributes arguments have `index` type and		// where the first kNumConfigRegionAttributes arguments have `index` type and
// the rest have the same types as the data operands.		// the rest have the same types as the data operands.
Region *kernelRegion = result.addRegion();		Region *kernelRegion = result.addRegion();
Block *body = new Block();		Block *body = new Block();
for (unsigned i = 0; i < kNumConfigRegionAttributes; ++i)		for (unsigned i = 0; i < kNumConfigRegionAttributes; ++i)
body->addArgument(builder.getIndexType(), result.location);		body->addArgument(builder.getIndexType(), result.location);
kernelRegion->push_back(body);		kernelRegion->push_back(body);
		SmallVector<int32_t, 8> segmentSizes(8, 1);
		segmentSizes.front() = asyncDependencies.size();
		segmentSizes.back() = dynamicSharedMemorySize ? 1 : 0;
		result.addAttribute(getOperandSegmentSizeAttr(),
		builder.getI32VectorAttr(segmentSizes));
}		}

KernelDim3 LaunchOp::getBlockIds() {		KernelDim3 LaunchOp::getBlockIds() {
assert(!body().empty() && "LaunchOp body must not be empty.");		assert(!body().empty() && "LaunchOp body must not be empty.");
auto args = body().getArguments();		auto args = body().getArguments();
return KernelDim3{args[0], args[1], args[2]};		return KernelDim3{args[0], args[1], args[2]};
}		}

Show All 11 Lines

KernelDim3 LaunchOp::getBlockSize() {		KernelDim3 LaunchOp::getBlockSize() {
assert(!body().empty() && "LaunchOp body must not be empty.");		assert(!body().empty() && "LaunchOp body must not be empty.");
auto args = body().getArguments();		auto args = body().getArguments();
return KernelDim3{args[9], args[10], args[11]};		return KernelDim3{args[9], args[10], args[11]};
}		}

KernelDim3 LaunchOp::getGridSizeOperandValues() {		KernelDim3 LaunchOp::getGridSizeOperandValues() {
return KernelDim3{getOperand(0), getOperand(1), getOperand(2)};		auto operands = getOperands().drop_front(asyncDependencies().size());
		return KernelDim3{operands[0], operands[1], operands[2]};
}		}

KernelDim3 LaunchOp::getBlockSizeOperandValues() {		KernelDim3 LaunchOp::getBlockSizeOperandValues() {
return KernelDim3{getOperand(3), getOperand(4), getOperand(5)};		auto operands = getOperands().drop_front(asyncDependencies().size());
		return KernelDim3{operands[3], operands[4], operands[5]};
}		}

LogicalResult LaunchOp::verifyRegions() {		LogicalResult LaunchOp::verifyRegions() {
// Kernel launch takes kNumConfigOperands leading operands for grid/block		// Kernel launch takes kNumConfigOperands leading operands for grid/block
// sizes and transforms them into kNumConfigRegionAttributes region arguments		// sizes and transforms them into kNumConfigRegionAttributes region arguments
// for block/thread identifiers and grid/block sizes.		// for block/thread identifiers and grid/block sizes.
if (!body().empty()) {		if (!body().empty()) {
if (body().getNumArguments() != LaunchOp::kNumConfigOperands +		if (body().getNumArguments() !=
getNumOperands() -		LaunchOp::kNumConfigOperands + getNumOperands() -
(dynamicSharedMemorySize() ? 1 : 0))		(dynamicSharedMemorySize() ? 1 : 0) - asyncDependencies().size())
return emitOpError("unexpected number of region arguments");		return emitOpError("unexpected number of region arguments");
}		}

// Block terminators without successors are expected to exit the kernel region		// Block terminators without successors are expected to exit the kernel region
// and must be `gpu.terminator`.		// and must be `gpu.terminator`.
for (Block &block : body()) {		for (Block &block : body()) {
if (block.empty())		if (block.empty())
continue;		continue;
if (block.back().getNumSuccessors() != 0)		if (block.back().getNumSuccessors() != 0)
continue;		continue;
if (!isa<gpu::TerminatorOp>(&block.back())) {		if (!isa<gpu::TerminatorOp>(&block.back())) {
return block.back()		return block.back()
.emitError()		.emitError()
.append("expected '", gpu::TerminatorOp::getOperationName(),		.append("expected '", gpu::TerminatorOp::getOperationName(),
"' or a terminator with successors")		"' or a terminator with successors")
.attachNote(getLoc())		.attachNote(getLoc())
.append("in '", LaunchOp::getOperationName(), "' body region");		.append("in '", LaunchOp::getOperationName(), "' body region");
}		}
}		}

		if (getNumResults() == 0 && asyncToken())
		return emitOpError(
		"needs to be named when async dependencies are specified");
		csiggUnsubmitted Done Reply Inline Actions Isn't this checked elsewhere already? Also, the error messages refers to the 'async dependencies' instead of the 'async keyword'. csigg: Isn't this checked elsewhere already? Also, the error messages refers to the 'async…
		bondhugulaAuthorUnsubmitted Done Reply Inline Actions Adjusted the error message to refer to 'async keyword'. This isn't checked anywhere else (not in any automatically generated verifier hooks); we check for this during (custom) parsing, but the verifier needs to separately check for it as well. bondhugula: Adjusted the error message to refer to 'async keyword'. This isn't checked anywhere else (not…

return success();		return success();
}		}

// Pretty-print the kernel grid/block size assignment as		// Pretty-print the kernel grid/block size assignment as
// (%iter-x, %iter-y, %iter-z) in		// (%iter-x, %iter-y, %iter-z) in
// (%size-x = %ssa-use, %size-y = %ssa-use, %size-z = %ssa-use)		// (%size-x = %ssa-use, %size-y = %ssa-use, %size-z = %ssa-use)
// where %size-* and %iter-* will correspond to the body region arguments.		// where %size-* and %iter-* will correspond to the body region arguments.
static void printSizeAssignment(OpAsmPrinter &p, KernelDim3 size,		static void printSizeAssignment(OpAsmPrinter &p, KernelDim3 size,
KernelDim3 operands, KernelDim3 ids) {		KernelDim3 operands, KernelDim3 ids) {
p << '(' << ids.x << ", " << ids.y << ", " << ids.z << ") in (";		p << '(' << ids.x << ", " << ids.y << ", " << ids.z << ") in (";
p << size.x << " = " << operands.x << ", ";		p << size.x << " = " << operands.x << ", ";
p << size.y << " = " << operands.y << ", ";		p << size.y << " = " << operands.y << ", ";
p << size.z << " = " << operands.z << ')';		p << size.z << " = " << operands.z << ')';
}		}

void LaunchOp::print(OpAsmPrinter &p) {		void LaunchOp::print(OpAsmPrinter &p) {
		if (asyncToken()) {
		p << " async";
		if (!asyncDependencies().empty())
		p << " [" << asyncDependencies() << ']';
		}
// Print the launch configuration.		// Print the launch configuration.
p << ' ' << getBlocksKeyword();		p << ' ' << getBlocksKeyword();
printSizeAssignment(p, getGridSize(), getGridSizeOperandValues(),		printSizeAssignment(p, getGridSize(), getGridSizeOperandValues(),
getBlockIds());		getBlockIds());
p << ' ' << getThreadsKeyword();		p << ' ' << getThreadsKeyword();
printSizeAssignment(p, getBlockSize(), getBlockSizeOperandValues(),		printSizeAssignment(p, getBlockSize(), getBlockSizeOperandValues(),
getThreadIds());		getThreadIds());
if (dynamicSharedMemorySize())		if (dynamicSharedMemorySize())
p << ' ' << getDynamicSharedMemorySizeKeyword() << ' '		p << ' ' << getDynamicSharedMemorySizeKeyword() << ' '
<< dynamicSharedMemorySize();		<< dynamicSharedMemorySize();

p << ' ';		p << ' ';
p.printRegion(body(), /printEntryBlockArgs=/false);		p.printRegion(body(), /printEntryBlockArgs=/false);
p.printOptionalAttrDict((*this)->getAttrs());		p.printOptionalAttrDict((this)->getAttrs(), /elidedAttrs=*/{
		LaunchOp::getOperandSegmentSizeAttr()});
}		}

// Parse the size assignment blocks for blocks and threads. These have the form		// Parse the size assignment blocks for blocks and threads. These have the form
// (%region_arg, %region_arg, %region_arg) in		// (%region_arg, %region_arg, %region_arg) in
// (%region_arg = %operand, %region_arg = %operand, %region_arg = %operand)		// (%region_arg = %operand, %region_arg = %operand, %region_arg = %operand)
// where %region_arg are percent-identifiers for the region arguments to be		// where %region_arg are percent-identifiers for the region arguments to be
// introduced further (SSA defs), and %operand are percent-identifiers for the		// introduced further (SSA defs), and %operand are percent-identifiers for the
// SSA value uses.		// SSA value uses.
Show All 17 Lines	if (parser.parseRegionArgument(regionSizes[i]) \|\| parser.parseEqual() \|\|
parser.parseOperand(sizes[i]))		parser.parseOperand(sizes[i]))
return failure();		return failure();
}		}

return parser.parseRParen();		return parser.parseRParen();
}		}

/// Parses a Launch operation.		/// Parses a Launch operation.
/// operation ::= `gpu.launch` `blocks` `(` ssa-id-list `)` `in`		/// operation ::= `gpu.launch` (`async` `[` ssa-id-list `]`)?
/// ssa-reassignment		// `blocks` `(` ssa-id-list `)` `in` ssa-reassignment
/// `threads` `(` ssa-id-list `)` `in`		/// `threads` `(` ssa-id-list `)` `in` ssa-reassignment
/// ssa-reassignment
/// region attr-dict?		/// region attr-dict?
/// ssa-reassignment ::= `(` ssa-id `=` ssa-use (`,` ssa-id `=` ssa-use)* `)`		/// ssa-reassignment ::= `(` ssa-id `=` ssa-use (`,` ssa-id `=` ssa-use)* `)`
ParseResult LaunchOp::parse(OpAsmParser &parser, OperationState &result) {		ParseResult LaunchOp::parse(OpAsmParser &parser, OperationState &result) {
// Sizes of the grid and block.		// Sizes of the grid and block.
SmallVector<OpAsmParser::UnresolvedOperand, LaunchOp::kNumConfigOperands>		SmallVector<OpAsmParser::UnresolvedOperand, LaunchOp::kNumConfigOperands>
sizes(LaunchOp::kNumConfigOperands);		sizes(LaunchOp::kNumConfigOperands);
MutableArrayRef<OpAsmParser::UnresolvedOperand> sizesRef(sizes);		MutableArrayRef<OpAsmParser::UnresolvedOperand> sizesRef(sizes);

// Actual (data) operands passed to the kernel.		// Actual (data) operands passed to the kernel.
SmallVector<OpAsmParser::UnresolvedOperand, 4> dataOperands;		SmallVector<OpAsmParser::UnresolvedOperand, 4> dataOperands;

// Region arguments to be created.		// Region arguments to be created.
SmallVector<OpAsmParser::UnresolvedOperand, 16> regionArgs(		SmallVector<OpAsmParser::UnresolvedOperand, 16> regionArgs(
LaunchOp::kNumConfigRegionAttributes);		LaunchOp::kNumConfigRegionAttributes);
MutableArrayRef<OpAsmParser::UnresolvedOperand> regionArgsRef(regionArgs);		MutableArrayRef<OpAsmParser::UnresolvedOperand> regionArgsRef(regionArgs);

		// Parse optional async dependencies.
		SmallVector<OpAsmParser::UnresolvedOperand, 4> asyncDependencies;
		Type asyncTokenType;
		if (failed(
		parseAsyncDependencies(parser, asyncTokenType, asyncDependencies)) \|\|
		parser.resolveOperands(asyncDependencies, asyncTokenType,
		result.operands))
		return failure();
		if (parser.getNumResults() > 0)
		result.types.push_back(asyncTokenType);

// Parse the size assignment segments: the first segment assigns grid sizes		// Parse the size assignment segments: the first segment assigns grid sizes
// and defines values for block identifiers; the second segment assigns block		// and defines values for block identifiers; the second segment assigns block
// sizes and defines values for thread identifiers. In the region argument		// sizes and defines values for thread identifiers. In the region argument
// list, identifiers precede sizes, and block-related values precede		// list, identifiers precede sizes, and block-related values precede
// thread-related values.		// thread-related values.
if (parser.parseKeyword(LaunchOp::getBlocksKeyword().data()) \|\|		if (parser.parseKeyword(LaunchOp::getBlocksKeyword().data()) \|\|
parseSizeAssignment(parser, sizesRef.take_front(3),		parseSizeAssignment(parser, sizesRef.take_front(3),
regionArgsRef.slice(6, 3),		regionArgsRef.slice(6, 3),
regionArgsRef.slice(0, 3)) \|\|		regionArgsRef.slice(0, 3)) \|\|
parser.parseKeyword(LaunchOp::getThreadsKeyword().data()) \|\|		parser.parseKeyword(LaunchOp::getThreadsKeyword().data()) \|\|
parseSizeAssignment(parser, sizesRef.drop_front(3),		parseSizeAssignment(parser, sizesRef.drop_front(3),
regionArgsRef.slice(9, 3),		regionArgsRef.slice(9, 3),
regionArgsRef.slice(3, 3)) \|\|		regionArgsRef.slice(3, 3)) \|\|
parser.resolveOperands(sizes, parser.getBuilder().getIndexType(),		parser.resolveOperands(sizes, parser.getBuilder().getIndexType(),
result.operands))		result.operands))
return failure();		return failure();

OpAsmParser::UnresolvedOperand dynamicSharedMemorySize;		OpAsmParser::UnresolvedOperand dynamicSharedMemorySize;
		bool hasDynamicSharedMemorySize = false;
if (!parser.parseOptionalKeyword(		if (!parser.parseOptionalKeyword(
LaunchOp::getDynamicSharedMemorySizeKeyword()))		LaunchOp::getDynamicSharedMemorySizeKeyword())) {
		hasDynamicSharedMemorySize = true;
if (parser.parseOperand(dynamicSharedMemorySize) \|\|		if (parser.parseOperand(dynamicSharedMemorySize) \|\|
parser.resolveOperand(dynamicSharedMemorySize,		parser.resolveOperand(dynamicSharedMemorySize,
parser.getBuilder().getI32Type(),		parser.getBuilder().getI32Type(),
result.operands))		result.operands))
return failure();		return failure();
		}

// Introduce the body region and parse it. The region has		// Introduce the body region and parse it. The region has
// kNumConfigRegionAttributes arguments that correspond to		// kNumConfigRegionAttributes arguments that correspond to
// block/thread identifiers and grid/block sizes, all of the `index` type.		// block/thread identifiers and grid/block sizes, all of the `index` type.
Type index = parser.getBuilder().getIndexType();		Type index = parser.getBuilder().getIndexType();
SmallVector<Type, LaunchOp::kNumConfigRegionAttributes> dataTypes(		SmallVector<Type, LaunchOp::kNumConfigRegionAttributes> dataTypes(
LaunchOp::kNumConfigRegionAttributes, index);		LaunchOp::kNumConfigRegionAttributes, index);
Region *body = result.addRegion();		Region *body = result.addRegion();
return failure(parser.parseRegion(*body, regionArgs, dataTypes) \|\|		if (parser.parseRegion(*body, regionArgs, dataTypes) \|\|
parser.parseOptionalAttrDict(result.attributes));		parser.parseOptionalAttrDict(result.attributes))
		return failure();

		SmallVector<int32_t, 8> segmentSizes(8, 1);
		segmentSizes.front() = asyncDependencies.size();
		segmentSizes.back() = hasDynamicSharedMemorySize ? 1 : 0;
		result.addAttribute(LaunchOp::getOperandSegmentSizeAttr(),
		parser.getBuilder().getI32VectorAttr(segmentSizes));
		return success();
}		}

/// Simplify the gpu.launch when the range of a thread or block ID is		/// Simplify the gpu.launch when the range of a thread or block ID is
/// trivially known to be one.		/// trivially known to be one.
struct FoldLaunchArguments : public OpRewritePattern<LaunchOp> {		struct FoldLaunchArguments : public OpRewritePattern<LaunchOp> {
using OpRewritePattern<LaunchOp>::OpRewritePattern;		using OpRewritePattern<LaunchOp>::OpRewritePattern;
LogicalResult matchAndRewrite(LaunchOp op,		LogicalResult matchAndRewrite(LaunchOp op,
PatternRewriter &rewriter) const override {		PatternRewriter &rewriter) const override {
Show All 33 Lines

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// LaunchFuncOp		// LaunchFuncOp
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

void LaunchFuncOp::build(OpBuilder &builder, OperationState &result,		void LaunchFuncOp::build(OpBuilder &builder, OperationState &result,
GPUFuncOp kernelFunc, KernelDim3 gridSize,		GPUFuncOp kernelFunc, KernelDim3 gridSize,
KernelDim3 blockSize, Value dynamicSharedMemorySize,		KernelDim3 blockSize, Value dynamicSharedMemorySize,
ValueRange kernelOperands) {		ValueRange kernelOperands, Type asyncTokenType,
		ValueRange asyncDependencies) {
		result.addOperands(asyncDependencies);
		if (asyncTokenType)
		result.types.push_back(builder.getType<AsyncTokenType>());

// Add grid and block sizes as op operands, followed by the data operands.		// Add grid and block sizes as op operands, followed by the data operands.
result.addOperands({gridSize.x, gridSize.y, gridSize.z, blockSize.x,		result.addOperands({gridSize.x, gridSize.y, gridSize.z, blockSize.x,
blockSize.y, blockSize.z});		blockSize.y, blockSize.z});
if (dynamicSharedMemorySize)		if (dynamicSharedMemorySize)
result.addOperands(dynamicSharedMemorySize);		result.addOperands(dynamicSharedMemorySize);
result.addOperands(kernelOperands);		result.addOperands(kernelOperands);
auto kernelModule = kernelFunc->getParentOfType<GPUModuleOp>();		auto kernelModule = kernelFunc->getParentOfType<GPUModuleOp>();
auto kernelSymbol =		auto kernelSymbol =
SymbolRefAttr::get(kernelModule.getNameAttr(),		SymbolRefAttr::get(kernelModule.getNameAttr(),
{SymbolRefAttr::get(kernelFunc.getNameAttr())});		{SymbolRefAttr::get(kernelFunc.getNameAttr())});
result.addAttribute(getKernelAttrName(), kernelSymbol);		result.addAttribute(getKernelAttrName(), kernelSymbol);
SmallVector<int32_t, 9> segmentSizes(9, 1);		SmallVector<int32_t, 9> segmentSizes(9, 1);
segmentSizes.front() = 0; // Initially no async dependencies.		segmentSizes.front() = asyncDependencies.size();
segmentSizes[segmentSizes.size() - 2] = dynamicSharedMemorySize ? 1 : 0;		segmentSizes[segmentSizes.size() - 2] = dynamicSharedMemorySize ? 1 : 0;
segmentSizes.back() = static_cast<int32_t>(kernelOperands.size());		segmentSizes.back() = static_cast<int32_t>(kernelOperands.size());
result.addAttribute(getOperandSegmentSizeAttr(),		result.addAttribute(getOperandSegmentSizeAttr(),
builder.getI32VectorAttr(segmentSizes));		builder.getI32VectorAttr(segmentSizes));
}		}

unsigned LaunchFuncOp::getNumKernelOperands() {		unsigned LaunchFuncOp::getNumKernelOperands() {
return getNumOperands() - asyncDependencies().size() - kNumConfigOperands -		return getNumOperands() - asyncDependencies().size() - kNumConfigOperands -
▲ Show 20 Lines • Show All 407 Lines • ▼ Show 20 Lines	if (getElementTypeOrSelf(srcType) != getElementTypeOrSelf(dstType))
return emitOpError("arguments have incompatible element type");		return emitOpError("arguments have incompatible element type");

if (failed(verifyCompatibleShape(srcType, dstType)))		if (failed(verifyCompatibleShape(srcType, dstType)))
return emitOpError("arguments have incompatible shape");		return emitOpError("arguments have incompatible shape");

return success();		return success();
}		}

static ParseResult parseAsyncDependencies(
OpAsmParser &parser, Type &asyncTokenType,
SmallVectorImpl<OpAsmParser::UnresolvedOperand> &asyncDependencies) {
auto loc = parser.getCurrentLocation();
if (succeeded(parser.parseOptionalKeyword("async"))) {
if (parser.getNumResults() == 0)
return parser.emitError(loc, "needs to be named when marked 'async'");
asyncTokenType = parser.getBuilder().getType<AsyncTokenType>();
}
return parser.parseOperandList(asyncDependencies,
OpAsmParser::Delimiter::OptionalSquare);
}

static void printAsyncDependencies(OpAsmPrinter &printer, Operation *op,
Type asyncTokenType,
OperandRange asyncDependencies) {
if (asyncTokenType)
printer << "async ";
if (asyncDependencies.empty())
return;
printer << "[";
llvm::interleaveComma(asyncDependencies, printer);
printer << "]";
}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// GPU_SubgroupMmaLoadMatrixOp		// GPU_SubgroupMmaLoadMatrixOp
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

/// Return true if the last dimension of the MemRefType has unit stride. Also		/// Return true if the last dimension of the MemRefType has unit stride. Also
/// return true for memrefs with no strides.		/// return true for memrefs with no strides.
static bool isLastMemrefDimUnitStride(MemRefType type) {		static bool isLastMemrefDimUnitStride(MemRefType type) {
int64_t offset;		int64_t offset;
▲ Show 20 Lines • Show All 204 Lines • Show Last 20 Lines

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp

	Show First 20 Lines • Show All 219 Lines • ▼ Show 20 Lines
	/// launching `kernelFunc`. The kernel func contains the body of the			/// launching `kernelFunc`. The kernel func contains the body of the
	/// `gpu.launch` with constant region arguments inlined.			/// `gpu.launch` with constant region arguments inlined.
	static void convertToLaunchFuncOp(gpu::LaunchOp launchOp,			static void convertToLaunchFuncOp(gpu::LaunchOp launchOp,
	gpu::GPUFuncOp kernelFunc,			gpu::GPUFuncOp kernelFunc,
	ValueRange operands) {			ValueRange operands) {
	OpBuilder builder(launchOp);			OpBuilder builder(launchOp);
	// The launch op has an optional dynamic shared memory size. If it doesn't			// The launch op has an optional dynamic shared memory size. If it doesn't
	// exist, we use zero.			// exist, we use zero.
	builder.create<gpu::LaunchFuncOp>(			Value asyncToken = launchOp.asyncToken();
				auto launchFunc = builder.create<gpu::LaunchFuncOp>(
	launchOp.getLoc(), kernelFunc, launchOp.getGridSizeOperandValues(),			launchOp.getLoc(), kernelFunc, launchOp.getGridSizeOperandValues(),
	launchOp.getBlockSizeOperandValues(), launchOp.dynamicSharedMemorySize(),			launchOp.getBlockSizeOperandValues(), launchOp.dynamicSharedMemorySize(),
	operands);			operands, asyncToken ? asyncToken.getType() : nullptr,
				launchOp.asyncDependencies());
				launchOp.replaceAllUsesWith(launchFunc);
	launchOp.erase();			launchOp.erase();
	}			}

	namespace {			namespace {
	/// Pass that moves ops which are likely an index computation into gpu.launch			/// Pass that moves ops which are likely an index computation into gpu.launch
	/// body.			/// body.
	class GpuLaunchSinkIndexComputationsPass			class GpuLaunchSinkIndexComputationsPass
	: public GpuLaunchSinkIndexComputationsBase<			: public GpuLaunchSinkIndexComputationsBase<
	▲ Show 20 Lines • Show All 149 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/invalid.mlir

	// RUN: mlir-opt -split-input-file -verify-diagnostics %s			// RUN: mlir-opt -split-input-file -verify-diagnostics %s

	func @not_enough_sizes(%sz : index) {			func @not_enough_sizes(%sz : index) {
	// expected-error@+1 {{expected 6 or more operands, but found 5}}			// expected-error@+1 {{expected 6 or more operands, but found 5}}
	"gpu.launch"(%sz, %sz, %sz, %sz, %sz) ({			"gpu.launch"(%sz, %sz, %sz, %sz, %sz) ({
	gpu.return			gpu.return
	}) : (index, index, index, index, index) -> ()			}) {operand_segment_sizes = dense<[0, 1, 1, 1, 1, 1, 1, 0]> : vector<8xi32>} : (index, index, index, index, index) -> ()
	return			return
	}			}

	// -----			// -----

	func @no_region_attrs(%sz : index) {			func @no_region_attrs(%sz : index) {
	// expected-error@+1 {{unexpected number of region arguments}}			// expected-error@+1 {{unexpected number of region arguments}}
	"gpu.launch"(%sz, %sz, %sz, %sz, %sz, %sz) ({			"gpu.launch"(%sz, %sz, %sz, %sz, %sz, %sz) ({
	^bb1(%bx: index, %by: index, %bz: index,			^bb1(%bx: index, %by: index, %bz: index,
	%tx: index, %ty: index, %tz: index):			%tx: index, %ty: index, %tz: index):
	gpu.terminator			gpu.terminator
	}) : (index, index, index, index, index, index) -> ()			}) {operand_segment_sizes = dense<[0, 1, 1, 1, 1, 1, 1, 0]> : vector<8xi32>} : (index, index, index, index, index, index) -> ()
	return			return
	}			}

	// -----			// -----

	func @launch_requires_gpu_return(%sz : index) {			func @launch_requires_gpu_return(%sz : index) {
	// @expected-note@+1 {{in 'gpu.launch' body region}}			// @expected-note@+1 {{in 'gpu.launch' body region}}
	gpu.launch blocks(%bx, %by, %bz) in (%sbx = %sz, %sby = %sz, %sbz = %sz)			gpu.launch blocks(%bx, %by, %bz) in (%sbx = %sz, %sby = %sz, %sbz = %sz)
	▲ Show 20 Lines • Show All 626 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/ops.mlir

// RUN: mlir-opt -allow-unregistered-dialect %s \| FileCheck %s		// RUN: mlir-opt -allow-unregistered-dialect %s \| FileCheck %s
		// Verify the printed output can be parsed.
		// RUN: mlir-opt -allow-unregistered-dialect %s \| mlir-opt -allow-unregistered-dialect \| FileCheck %s
		// Verify the generic form can be parsed.
		// RUN: mlir-opt -allow-unregistered-dialect -mlir-print-op-generic %s \| mlir-opt -allow-unregistered-dialect \| FileCheck %s

module attributes {gpu.container_module} {		module attributes {gpu.container_module} {

// CHECK-LABEL:func @no_args(%{{.*}}: index)		// CHECK-LABEL:func @no_args(%{{.*}}: index)
func @no_args(%sz : index) {		func @no_args(%sz : index) {
// CHECK: gpu.launch blocks(%{{.}}, %{{.}}, %{{.}}) in (%{{.}} = %{{.}}, %{{.}} = %{{.}}, %{{.}} = %{{.}}) threads(%{{.}}, %{{.}}, %{{.}}) in (%{{.}} = %{{.}}, %{{.}} = %{{.}}, %{{.}} = %{{.}})		// CHECK: gpu.launch blocks(%{{.}}, %{{.}}, %{{.}}) in (%{{.}} = %{{.}}, %{{.}} = %{{.}}, %{{.}} = %{{.}}) threads(%{{.}}, %{{.}}, %{{.}}) in (%{{.}} = %{{.}}, %{{.}} = %{{.}}, %{{.}} = %{{.}})
gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %sz, %grid_y = %sz, %grid_z = %sz)		gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %sz, %grid_y = %sz, %grid_z = %sz)
threads(%tx, %ty, %tz) in (%block_x = %sz, %block_y = %sz, %block_z = %sz) {		threads(%tx, %ty, %tz) in (%block_x = %sz, %block_y = %sz, %block_z = %sz) {
Show All 11 Lines	gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %blk, %grid_y = %blk, %grid_z = %blk)
"use"(%float) : (f32) -> ()		"use"(%float) : (f32) -> ()
"use"(%data) : (memref<?xf32,1>) -> ()		"use"(%data) : (memref<?xf32,1>) -> ()
// CHECK: gpu.terminator		// CHECK: gpu.terminator
gpu.terminator		gpu.terminator
}		}
return		return
}		}

		// CHECK-LABEL:func @launch_async(%{{.}}: index, %{{.}}: index) {
		func @launch_async(%blk : index, %thrd : index) {
		// CHECK: gpu.launch async [%{{.+}}] blocks(%{{.}}, %{{.}}, %{{.}}) in (%{{.}} = %{{.}}, %{{.}} = %{{.}}, %{{.}} = %{{.}}) threads(%{{.}}, %{{.}}, %{{.}}) in (%{{.}} = %{{.}}, %{{.}} = %{{.}}, %{{.}} = %{{.}})
		%t = gpu.wait async
		%name = gpu.launch async [%t] blocks(%arg0, %arg1, %arg2) in (%grid_x = %blk, %grid_y = %blk, %grid_z = %blk)
		threads(%arg3, %arg4, %arg5) in (%block_x = %thrd, %block_y = %thrd, %block_z = %thrd) {
		gpu.terminator
		}
		return
		}

		// CHECK-LABEL:func @launch_async_no_deps(%{{.}}: index, %{{.}}: index) {
		func @launch_async_no_deps(%blk : index, %thrd : index) {
		// CHECK: %{{.}} = gpu.launch async blocks(%{{.}}, %{{.}}, %{{.}}) in (%{{.}} = %{{.}}, %{{.}} = %{{.}}, %{{.}} = %{{.}}) threads(%{{.}}, %{{.}}, %{{.}}) in (%{{.}} = %{{.}}, %{{.}} = %{{.}}, %{{.}} = %{{.*}})
		%t0 = gpu.launch async blocks(%arg0, %arg1, %arg2) in (%grid_x = %blk, %grid_y = %blk, %grid_z = %blk)
		threads(%arg3, %arg4, %arg5) in (%block_x = %thrd, %block_y = %thrd, %block_z = %thrd) {
		gpu.terminator
		}
		// CHECK: gpu.launch async blocks(%{{.}}, %{{.}}, %{{.}}) in (%{{.}} = %{{.}}, %{{.}} = %{{.}}, %{{.}} = %{{.}}) threads(%{{.}}, %{{.}}, %{{.}}) in (%{{.}} = %{{.}}, %{{.}} = %{{.}}, %{{.}} = %{{.}})
		%t1 = gpu.launch async [] blocks(%arg0, %arg1, %arg2) in (%grid_x = %blk, %grid_y = %blk, %grid_z = %blk)
		threads(%arg3, %arg4, %arg5) in (%block_x = %thrd, %block_y = %thrd, %block_z = %thrd) {
		gpu.terminator
		}
		return
		}

gpu.module @kernels {		gpu.module @kernels {
gpu.func @kernel_1(%arg0 : f32, %arg1 : memref<?xf32, 1>) kernel {		gpu.func @kernel_1(%arg0 : f32, %arg1 : memref<?xf32, 1>) kernel {
%tIdX = gpu.thread_id x		%tIdX = gpu.thread_id x
%tIdY = gpu.thread_id y		%tIdY = gpu.thread_id y
%tIdZ = gpu.thread_id z		%tIdZ = gpu.thread_id z

%bDimX = gpu.block_dim x		%bDimX = gpu.block_dim x
%bDimY = gpu.block_dim y		%bDimY = gpu.block_dim y
▲ Show 20 Lines • Show All 232 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/outlining.mlir

Show First 20 Lines • Show All 74 Lines • ▼ Show 20 Lines	func @multiple_launches() {
}		}
// CHECK: gpu.launch_func @multiple_launches_kernel_0::@multiple_launches_kernel blocks in (%[[CST]], %[[CST]], %[[CST]]) threads in (%[[CST]], %[[CST]], %[[CST]])		// CHECK: gpu.launch_func @multiple_launches_kernel_0::@multiple_launches_kernel blocks in (%[[CST]], %[[CST]], %[[CST]]) threads in (%[[CST]], %[[CST]], %[[CST]])
gpu.launch blocks(%bx2, %by2, %bz2) in (%grid_x2 = %cst, %grid_y2 = %cst,		gpu.launch blocks(%bx2, %by2, %bz2) in (%grid_x2 = %cst, %grid_y2 = %cst,
%grid_z2 = %cst)		%grid_z2 = %cst)
threads(%tx2, %ty2, %tz2) in (%block_x2 = %cst, %block_y2 = %cst,		threads(%tx2, %ty2, %tz2) in (%block_x2 = %cst, %block_y2 = %cst,
%block_z2 = %cst) {		%block_z2 = %cst) {
gpu.terminator		gpu.terminator
}		}

		// With async and async deps.
		// CHECK: %[[TOKEN:.*]] = gpu.wait async
		// CHECK: gpu.launch_func async [%[[TOKEN]]] @multiple_launches_kernel_1::@multiple_launches_kernel blocks in (%[[CST]], %[[CST]], %[[CST]]) threads in (%[[CST]], %[[CST]], %[[CST]])
		%t = gpu.wait async
		%u = gpu.launch async [%t] blocks(%bx2, %by2, %bz2) in (%grid_x2 = %cst, %grid_y2 = %cst,
		%grid_z2 = %cst)
		threads(%tx2, %ty2, %tz2) in (%block_x2 = %cst, %block_y2 = %cst,
		%block_z2 = %cst) {
		gpu.terminator
		}

		// CHECK: gpu.launch_func async @multiple_launches_kernel_2::@multiple_launches_kernel blocks in (%[[CST]], %[[CST]], %[[CST]]) threads in (%[[CST]], %[[CST]], %[[CST]])
		%v = gpu.launch async blocks(%bx2, %by2, %bz2) in (%grid_x2 = %cst, %grid_y2 = %cst,
		%grid_z2 = %cst)
		threads(%tx2, %ty2, %tz2) in (%block_x2 = %cst, %block_y2 = %cst,
		%block_z2 = %cst) {
		gpu.terminator
		}

return		return
}		}

// CHECK-DL-LABEL: gpu.module @multiple_launches_kernel attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<index, 32 : i32>>}		// CHECK-DL-LABEL: gpu.module @multiple_launches_kernel attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<index, 32 : i32>>}
// CHECK-DL-LABEL: gpu.module @multiple_launches_kernel_0 attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<index, 32 : i32>>}		// CHECK-DL-LABEL: gpu.module @multiple_launches_kernel_0 attributes {dlti.dl_spec = #dlti.dl_spec<#dlti.dl_entry<index, 32 : i32>>}

// CHECK: gpu.module @multiple_launches_kernel		// CHECK: gpu.module @multiple_launches_kernel
// CHECK: func @multiple_launches_kernel		// CHECK: func @multiple_launches_kernel
▲ Show 20 Lines • Show All 183 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

Add async dependencies support for gpu.launch opClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 422714

mlir/include/mlir/Dialect/GPU/GPUOps.td

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp

mlir/test/Dialect/GPU/invalid.mlir

mlir/test/Dialect/GPU/ops.mlir

mlir/test/Dialect/GPU/outlining.mlir

Add async dependencies support for gpu.launch op
ClosedPublic