This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/GPU/
-
mlir/
-
Dialect/
-
GPU/
2/5
Passes.h
-
lib/Dialect/GPU/Transforms/
-
Dialect/
-
GPU/
-
Transforms/
21/31
KernelOutlining.cpp
-
test/Dialect/GPU/
-
Dialect/
-
GPU/
-
outlining.mlir

Differential D75287

[mlir][GPU] Expose the functionality to create a gpu.GPUFuncOp from a gpu.GPULaunchOp
ClosedPublic

Authored by mravishankar on Feb 27 2020, 11:56 AM.

Download Raw Diff

Details

Reviewers

herhut
ftynse
mehdi_amini

Commits

rG3f44495dfd61: [mlir][GPU] Expose the functionality to create a GPUFuncOp from a LaunchOp

Summary

The current setup of the GPU dialect is to model both the host and
device side codegen. For cases (like IREE) the host side modeling
might not directly fit its use case, but device-side codegen is still
valuable. First step in accessing just the device-side functionality
of the GPU dialect is to allow just creating a gpu.func operation from
a gpu.launch operation. In addition this change also "inlines"
operations into the gpu.func op at time of creation instead of this
being a later step.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	30 ms	Extra Tools Unit Tests.clang-doc/_/ClangDocTests_exe::Unknown Unit Message ("")

Event Timeline

mravishankar created this revision.Feb 27 2020, 11:56 AM

Herald added a reviewer: herhut. · View Herald TranscriptFeb 27 2020, 11:56 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, Joonsoo, liufengdb and 12 others. · View Herald Transcript

mravishankar added a reviewer: ftynse.Feb 27 2020, 11:59 AM

mravishankar added a subscriber: hanchung.

Harbormaster failed remote builds in B47482: Diff 247058!Feb 27 2020, 1:47 PM

mehdi_amini added inline comments.Feb 27 2020, 10:26 PM

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp
133	Please don't use "inline" for other aspect that the inliner. What about sink?
136	Nit: don't use auto when it does not improve the readability (line 93 below is explicit for instance)
138	This whole sinking transformation does not seem safe in general: this should check legality rather than "benefit". Also it isn't clear to me why this is done during the outlining and not as a pre-pass. The launch operation with the region abstraction seems perfectly suited to model this. I rather have this exposed in a separate API / as a separate step.

Please check the comments around the code you are modifying/moving, some of them no longer describe what the code does after your changes.

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp
34–35	Please update the comment to describe the new API
138	This whole sinking transformation does not seem safe in general: this should check legality rather than "benefit". The function just seems misnamed, should be something like `shouldSink` because it mixes validity and benefit. In practice, it only returns `true` for `constant` and `dim` operations that don't have side effects.
177	Will this work for blocks whose dominance relation is inverse of their textual order? E.g. ^entry: br ^bb2: ^bb1: "use"(%0) : (index) -> () return ^bb2: %0 = "def"() : () -> (index) br ^bb1
178	This no longer removes the arguments, but rather updates the map.
190	Will this update the users of the inlinedOps? I don't see the map updated anywhere.

This revision now requires changes to proceed.Feb 28 2020, 1:45 AM

herhut added inline comments.Feb 28 2020, 7:41 AM

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp
138	This whole sinking transformation does not seem safe in general: this should check legality rather than "benefit". Well, it should check both. You do not want to move all legal operation either :) Also it isn't clear to me why this is done during the outlining and not as a pre-pass. The launch operation with the region abstraction seems perfectly suited to model this. I rather have this exposed in a separate API / as a separate step This has purely historical reasons. Not long ago, the `gpu.launch` was closed from above, so this transformation was done when moving to function form. I have a separate pass for this in a local client, which I can send out next week. It just needs tests. It was implemented as a "post transformation" to the outlining and I would prefer if we do not mix it into the outlining transformation itself. When written separately, the transformations are trivial.

mehdi_amini added inline comments.Feb 28 2020, 9:21 AM

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp
138	It was implemented as a "post transformation" to the outlining and Pre-outlining seems easier to manage because region vs inter-procedural (and also can be kept a function pass). I would prefer if we do not mix it into the outlining transformation itself. When written separately, the transformations are trivial. Seems like we're in agreement :)

Addressing comments and changing the way cloning is of the region of
the gpu.launch operation is done.

Harbormaster failed remote builds in B47643: Diff 247372!Feb 28 2020, 3:21 PM

mravishankar marked 4 inline comments as done.Feb 28 2020, 3:29 PM

mravishankar added inline comments.

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp
133	Changed to "sink" and updated all variables names.
138	A pre-pass is fine, but I think it would be better to leave it here. Eventually, it would be good if all transformations can be expressed as a pattern match and rewrite. This "outlining" is essentially converting a gpu.launchOp to a gpu.launchFuncOp. If you need to have a separate pass to sink the instructions, then it breaks the ability of going from loops -> GPU -> NVVM/SPIR-V. I am not saying anybody does this today (not doing this in IREE), but in general it seems like it would be beneficial to have transformations as patterns, and passes as just a light-weight wrapper around patterns. Re: being able to keep it as a function pass, is related to where the gpu.module is created. As set up right now it is put outside of the function that the gpu.launch operation lives in. Thats a a very specific choice and would be very useful to allow "clients" of the outlining to decide where to put the gpu.module.
177	Thanks for pointing this out. I updated the method to use cloneInto which handles this case (see comment on changes to cloneInto. I can make the change here if you think that is reasonable)
190	It is done within the clone(map) operation. The results of the operation are added to the map as well.

Fixes minor typos

mravishankar added a child revision: D75391: [mlir][Linalg] Fix load/store operations generated while lower loops when output has zero rank..Feb 28 2020, 3:37 PM

antiagainst added inline comments.Feb 28 2020, 3:53 PM

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp
182	We are only using the "set" part here right? Just use a set data type?

Harbormaster failed remote builds in B47649: Diff 247379!Feb 28 2020, 3:57 PM

Fix failing test

Harbormaster completed remote builds in B47654: Diff 247390.Feb 28 2020, 5:28 PM

mehdi_amini requested changes to this revision.Feb 28 2020, 10:21 PM

mehdi_amini added inline comments.

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp
138	If you need to have a separate pass to sink the instructions, then it breaks the ability of going from loops -> GPU -> NVVM/SPIR-V. I don't understand what you mean, can you elaborate? but in general it seems like it would be beneficial to have transformations as patterns, and passes as just a light-weight wrapper around patterns. This is mixing an optimization within an unrelated transformation: this just does not belong here IMO. Re: being able to keep it as a function pass, is related to where the gpu.module is created. I don't know what you mean or how it answer the point about the function pass right now.

This revision now requires changes to proceed.Feb 28 2020, 10:21 PM

mravishankar marked an inline comment as done.Mar 2 2020, 10:45 AM

mravishankar added inline comments.

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp
138	Re : separate pass to sink instructions. The Dialect conversion framework is designed to go from A -> B -> C. If I want to target SPIR-V/NVVM from Linalg dialect vial Loop dialect and GPU dialects (i.e. Linalg -> Loops -> GPU -> SPIRV/NVVM), I can add all the patterns for the conversion into the dialect conversion framework. Currently Loops to GPU dialect is not exposed as a conversion pattern. GPU to SPIRV is. By adding extra steps as a "pre-condition" will limit the ability to the entire conversion being done using the dialect conversion framework (which is what it is built for). You could add the "sinking" as a canonicalization pattern, but it seems to me this sinking is useful only when the gpu.launch region is outlined to create a gpu.func operation. So doing the sinking during the conversion makes sense to me. Re: fusion pass vs module pass The current setup of the gpu.launch to gpu.launch_func conversion creates a gpu.module that is inserted just after the function the gpu.launch is in. This makes it a module pass, and this behavior is only relevant for the CUDA/NVVM side of things. For IREE, we are only interested in the device side for now. So we can make this a function pass if we can control where the gpu.module is inserted. See this dependent PR in IREE that uses this change and makes the conversion of gpu.launch to gpu.func as a function pass.

mravishankar marked an inline comment as done.Mar 2 2020, 1:35 PM

mravishankar added inline comments.

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp
138	@mehdi_amini : Update on the function pass vs module pass. You were right that the outlining can only be a module pass since the gpu.module also has a symbol so it needs to be added to a module (or an op with symbol table). So I was wrong about that. I update the PR shown above to be module pass as well, but FYI there was no assert when i did it as a function pass. Just filling in some details about discussion offline. It is true that the sinking could be done as a prepass. If so then it is a separate "pre-processing" pass. It is unclear if sinking can be expressed as a pattern.

mehdi_amini added inline comments.Mar 2 2020, 8:13 PM

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp
138	(typed the following this morning before you last comment, but didn't click send) The Dialect conversion framework is designed to go from A -> B -> C. Yes, but if you just say this, you can shoehorn anything in there: you could use this mental model to go from Swift SIL to X86 assembly in a single "legalize()" call, I don't think this is a good use of the framework. We should use the lowering framework where it makes sense and where it is the way to solve a problem. If your problem fits into the pass pipeline, then why not start there? This is the most natural way of thinking about chain of transformations. the ability to the entire conversion being done using the dialect conversion framework (which is what it is built for). I disagree that this is what it is built for. I think this is a misconception of what the framework is intended to solve. If you can express a pass pipeline where you want to do A->C as a logical sequence of A->B and then B->C, where B is an "interesting level of abstraction", I believe this should be separate passes. If we take for example the `HLO -> SPIR-V` pipeline, we can likely identify logical stages like `HLO->Loops`, `Loops->GPU Kernel`, and `GPU Kernel -> SPIRV`. These stages are fully disjoint as far as I can tell, and there is no immediate benefit to combine them in a single lowering. On the opposite: if these stages are well separated, this provides the opportunity for passes to run on each intermediate level of abstraction (including some generic things like canonicalization or CSE), and it allows also more reusable blocks (`Loops->GPU Kernel` can be reused even when you don't come from HLO). It also forces testing at every level and help compiler engineers keeping a mental model where we can reason about this stages and how they compose independently.

Updating patch to separate the sinking of instructions into launch op
as a separate function.

mravishankar marked an inline comment as done.Mar 3 2020, 2:05 AM

mravishankar added inline comments.

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp
138	Thanks @mehdi_amini for that overview. I think what you say makes sense and is a good thumb rule to use (probably good to add it somewhere in rationale) Going back to the change at hand. I modified the patch to expose the "sinking" transformation as a separate utility function exposed by the GPU dialect. PTAL, but to me this seems more complex. If this is along the lines of what is the recommendation here, I can work with it for my use case.

Harbormaster completed remote builds in B47883: Diff 247827.Mar 3 2020, 4:11 AM

mehdi_amini added inline comments.Mar 3 2020, 9:39 PM

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp
234	Note: this is still not decoupled from this pass right now (i.e. not tested in isolation, etc.): we still have "outlining" and "sinking" part of the same pass, can't they be separated?

I don't have further objections other than a bunch of nits. This patch intends to expose _functions_ and can land as is. Refactoring those functions into separate (test) passes is okay for a follow-up IMO.

mlir/include/mlir/Dialect/GPU/Passes.h
21	`struct`, otherwise you'll break windows builds
53	Bikeshed nit: `sinkOperationsIntoLaunchOp`. "Intructions" aren't a thing in MLIR.
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp
64	Bikeshed nit: `operands` confused me into thinking it referred to the _existing_ launchOp operands, not "values that might become operands if sinking is beneficial".
105	How about iterating over uses rather than users? for (auto use : result.value().getUses()) { if (use.getUser().getParentOfType<gpu::LaunchOp>() == launchOp)) use.getOperand().set(replacement); }
112	Can this happen in a valid IR? If not, I would rather assert. Otherwise, please drop trivial braces
115	This comment looks outdated
232	Sinking already reports an error, no need to add another one IMO.
247	Nit: Please drop trivial braces

Addressing comments

mravishankar added inline comments.Mar 4 2020, 2:01 PM

mlir/include/mlir/Dialect/GPU/Passes.h
21	THanks! Interesting that the build bot passed.
53	Good point. I have conflated the two mentally cause of that.
mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp
112	I think so, but I am not sure. Will leave it as an error, and remove braces
234	They are separate functions. I have no visibility into the clients of the pass. So if any user of the pass is relying on sinking happening then removing the sinking would "potentially" break. One could argue that then it is incorrect usage since the gpu.launch_func op gets updated accordingly, but at this point I would rather keep this change as an NFC.

Harbormaster failed remote builds in B48110: Diff 248311!Mar 4 2020, 4:20 PM

mehdi_amini added inline comments.Mar 4 2020, 5:19 PM

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp
234	Keeping it NFC is a very good point! (what would break here is an optimization and not correctness right? So we can still do it in the absolute?)

mehdi_amini accepted this revision.Mar 4 2020, 5:21 PM

mehdi_amini added inline comments.

mlir/include/mlir/Dialect/GPU/Passes.h
33	Nit: this is the only pass exposed in a header called `Passes.h`. Can you split this header in a `Utils.h`?

This revision is now accepted and ready to land.Mar 4 2020, 5:21 PM

KMoving GPU utility functions into Utils.h

mravishankar added inline comments.Mar 4 2020, 10:35 PM

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp
234	Yes, it would indeed be an optimization thing and not a correctness thing.

Harbormaster failed remote builds in B48155: Diff 248398!Mar 4 2020, 11:31 PM

Harbormaster completed remote builds in B48155: Diff 248398.Mar 5 2020, 10:58 AM

Closed by commit rG3f44495dfd61: [mlir][GPU] Expose the functionality to create a GPUFuncOp from a LaunchOp (authored by mravishankar). · Explain WhyMar 5 2020, 11:32 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

GPU/

Passes.h

24 lines

lib/

Dialect/

GPU/

Transforms/

KernelOutlining.cpp

193 lines

test/

Dialect/

GPU/

outlining.mlir

24 lines

Diff 248311

mlir/include/mlir/Dialect/GPU/Passes.h

	//===- Passes.h - Pass Entrypoints ------------------------------- C++ --===//			//===- Passes.h - Pass Entrypoints ------------------------------- C++ --===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This header file defines prototypes that expose pass constructors.			// This header file defines prototypes that expose pass constructors.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef MLIR_DIALECT_GPU_PASSES_H_			#ifndef MLIR_DIALECT_GPU_PASSES_H_
	#define MLIR_DIALECT_GPU_PASSES_H_			#define MLIR_DIALECT_GPU_PASSES_H_

				#include "mlir/Support/LLVM.h"
	#include <memory>			#include <memory>

	namespace mlir {			namespace mlir {

				struct LogicalResult;
				ftynseUnsubmitted Not Done Reply Inline Actions `struct`, otherwise you'll break windows builds ftynse: `struct`, otherwise you'll break windows builds
				mravishankarAuthorUnsubmitted Done Reply Inline Actions THanks! Interesting that the build bot passed. mravishankar: THanks! Interesting that the build bot passed.
	class MLIRContext;			class MLIRContext;
	class ModuleOp;			class ModuleOp;
	template <typename T> class OpPassBase;			template <typename T> class OpPassBase;
	class OwningRewritePatternList;			class OwningRewritePatternList;
				class Value;

				namespace gpu {
				class GPUFuncOp;
				class LaunchOp;
				} // namespace gpu

	std::unique_ptr<OpPassBase<ModuleOp>> createGpuKernelOutliningPass();			std::unique_ptr<OpPassBase<ModuleOp>> createGpuKernelOutliningPass();
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions Nit: this is the only pass exposed in a header called `Passes.h`. Can you split this header in a `Utils.h`? mehdi_amini: Nit: this is the only pass exposed in a header called `Passes.h`. Can you split this header…

				/// Get a gpu.func created from outlining the region of a gpu.launch op with the
				/// given `kernelFnName`. The region of the `launchOp` can use values from
				/// above. These need to be captured and passed as arguments to the generated
				/// gpu.func. The generated function has arguments
				/// - corresponding to the values passed in as `operands`, in that order.
				/// - any additional values that might be used within the region of the
				/// `launchOp` and defined above it. These captured values are appended to the
				/// `operands` list.
				gpu::GPUFuncOp outlineKernelFunc(gpu::LaunchOp launchOp, StringRef kernelFnName,
				SmallVectorImpl<Value> &operands);

	/// Collect a set of patterns to rewrite ops within the GPU dialect.			/// Collect a set of patterns to rewrite ops within the GPU dialect.
	void populateGpuRewritePatterns(MLIRContext *context,			void populateGpuRewritePatterns(MLIRContext *context,
	OwningRewritePatternList &patterns);			OwningRewritePatternList &patterns);

				/// Sink operations into the `launchOp` to reduce the number of values that are
				/// used within the region of the operation, but defined outside of the
				/// region.
				LogicalResult sinkOperationsIntoLaunchOp(gpu::LaunchOp launchOp);
				ftynseUnsubmitted Not Done Reply Inline Actions Bikeshed nit: `sinkOperationsIntoLaunchOp`. "Intructions" aren't a thing in MLIR. ftynse: Bikeshed nit: `sinkOperationsIntoLaunchOp`. "Intructions" aren't a thing in MLIR.
				mravishankarAuthorUnsubmitted Done Reply Inline Actions Good point. I have conflated the two mentally cause of that. mravishankar: Good point. I have conflated the two mentally cause of that.

	} // namespace mlir			} // namespace mlir

	#endif // MLIR_DIALECT_GPU_PASSES_H_			#endif // MLIR_DIALECT_GPU_PASSES_H_

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp

Show All 25 Lines	static void createForAllDimensions(OpBuilder &builder, Location loc,
SmallVectorImpl<Value> &values) {		SmallVectorImpl<Value> &values) {
for (StringRef dim : {"x", "y", "z"}) {		for (StringRef dim : {"x", "y", "z"}) {
Value v = builder.create<OpTy>(loc, builder.getIndexType(),		Value v = builder.create<OpTy>(loc, builder.getIndexType(),
builder.getStringAttr(dim));		builder.getStringAttr(dim));
values.push_back(v);		values.push_back(v);
}		}
}		}

// Add operations generating block/thread ids and grid/block dimensions at the		// Add operations generating block/thread ids and grid/block dimensions at the
// beginning of the `body` region and replace uses of the respective function		// beginning of the `launchFuncOpBody` region. Add mapping from argument in
		ftynseUnsubmitted Done Reply Inline Actions Please update the comment to describe the new API ftynse: Please update the comment to describe the new API
// arguments.		// entry block of `launchOpBody`, to the corresponding result value of the added
static void injectGpuIndexOperations(Location loc, Region &body) {		// operations.
		static void injectGpuIndexOperations(Location loc, Region &launchFuncOpBody,
		Region &launchOpBody,
		BlockAndValueMapping &map) {
OpBuilder builder(loc->getContext());		OpBuilder builder(loc->getContext());
Block &firstBlock = body.front();		Block &firstBlock = launchOpBody.front();
builder.setInsertionPointToStart(&firstBlock);		builder.setInsertionPointToStart(&launchFuncOpBody.front());
SmallVector<Value, 12> indexOps;		SmallVector<Value, 12> indexOps;
createForAllDimensions<gpu::BlockIdOp>(builder, loc, indexOps);		createForAllDimensions<gpu::BlockIdOp>(builder, loc, indexOps);
createForAllDimensions<gpu::ThreadIdOp>(builder, loc, indexOps);		createForAllDimensions<gpu::ThreadIdOp>(builder, loc, indexOps);
createForAllDimensions<gpu::GridDimOp>(builder, loc, indexOps);		createForAllDimensions<gpu::GridDimOp>(builder, loc, indexOps);
createForAllDimensions<gpu::BlockDimOp>(builder, loc, indexOps);		createForAllDimensions<gpu::BlockDimOp>(builder, loc, indexOps);
// Replace the leading 12 function args with the respective thread/block index		// Replace the leading 12 function args with the respective thread/block index
// operations. Iterate backwards since args are erased and indices change.		// operations. Iterate backwards since args are erased and indices change.
for (int i = 11; i >= 0; --i) {		for (auto indexOp : enumerate(indexOps))
firstBlock.getArgument(i).replaceAllUsesWith(indexOps[i]);		map.map(firstBlock.getArgument(indexOp.index()), indexOp.value());
firstBlock.eraseArgument(i);
}
}		}

static bool isInliningBeneficiary(Operation *op) {		static bool isSinkingBeneficiary(Operation *op) {
return isa<ConstantOp>(op) \|\| isa<DimOp>(op);		return isa<ConstantOp>(op) \|\| isa<DimOp>(op);
}		}

// Move arguments of the given kernel function into the function if this reduces		LogicalResult mlir::sinkOperationsIntoLaunchOp(gpu::LaunchOp launchOp) {
// the number of kernel arguments.		Region &launchOpBody = launchOp.body();
static gpu::LaunchFuncOp inlineBeneficiaryOps(gpu::GPUFuncOp kernelFunc,
gpu::LaunchFuncOp launch) {		// Identify uses from values defined outside of the scope of the launch
OpBuilder kernelBuilder(kernelFunc.getBody());		// operation.
auto &firstBlock = kernelFunc.getBody().front();		llvm::SetVector<Value> sinkCandidates;
		ftynseUnsubmitted Done Reply Inline Actions Bikeshed nit: `operands` confused me into thinking it referred to the _existing_ launchOp operands, not "values that might become operands if sinking is beneficial". ftynse: Bikeshed nit: `operands` confused me into thinking it referred to the _existing_ launchOp…
SmallVector<Value, 8> newLaunchArgs;		getUsedValuesDefinedAbove(launchOpBody, sinkCandidates);
BlockAndValueMapping map;
for (int i = 0, e = launch.getNumKernelOperands(); i < e; ++i) {		llvm::SetVector<Value> sunkValues;
map.map(launch.getKernelOperand(i), kernelFunc.getArgument(i));		llvm::SetVector<Operation *> sunkOperations;
}		for (Value operand : sinkCandidates) {
for (int i = launch.getNumKernelOperands() - 1; i >= 0; --i) {		Operation *operandOp = operand.getDefiningOp();
auto operandOp = launch.getKernelOperand(i).getDefiningOp();		if (!operandOp \|\| !isSinkingBeneficiary(operandOp))
if (!operandOp \|\| !isInliningBeneficiary(operandOp)) {
newLaunchArgs.push_back(launch.getKernelOperand(i));
continue;		continue;
		// Only sink operations that do not create new sinkCandidates.
		if (!llvm::all_of(operandOp->getOperands(), [&sinkCandidates](Value value) {
		return sinkCandidates.count(value);
		}))
		continue;
		sunkValues.insert(operand);
		sunkOperations.insert(operandOp);
}		}
// Only inline operations that do not create new arguments.
if (!llvm::all_of(operandOp->getOperands(),		// Insert operations so that the defs get cloned before uses.
[map](Value value) { return map.contains(value); })) {		BlockAndValueMapping map;
		OpBuilder builder(launchOpBody);
		DenseSet<Operation *> processed;
		SmallVector<Operation *, 2> clonedOps;
		while (processed.size() != sunkOperations.size()) {
		auto startSize = processed.size();
		for (Operation *sunkOperation : sunkOperations) {
		if (processed.count(sunkOperation))
continue;		continue;

		// Operation cant be cloned yet if any of its operands is also being sunk,
		// but isnt cloned yet.
		if (llvm::any_of(
		sunkOperation->getOperands(), [&sunkValues, &map](Value value) {
		return sunkValues.count(value) && !map.lookupOrNull(value);
		}))
		continue;

		Operation clonedOp = builder.clone(sunkOperation, map);
		// Only replace uses within the launch op.
		for (auto result : llvm::enumerate(sunkOperation->getResults())) {
		auto replacement = clonedOp->getResult(result.index());
		for (auto &use : llvm::make_early_inc_range(result.value().getUses()))
		ftynseUnsubmitted Not Done Reply Inline Actions How about iterating over uses rather than users? for (auto use : result.value().getUses()) { if (use.getUser().getParentOfType<gpu::LaunchOp>() == launchOp)) use.getOperand().set(replacement); } ftynse: How about iterating over uses rather than users? ``` for (auto use : result.value().getUses())…
		if (use.getOwner()->getParentOfType<gpu::LaunchOp>() == launchOp)
		use.set(replacement);
		}
		processed.insert(sunkOperation);
		}
		if (startSize == processed.size())
		return launchOp.emitError(
		ftynseUnsubmitted Not Done Reply Inline Actions Can this happen in a valid IR? If not, I would rather assert. Otherwise, please drop trivial braces ftynse: Can this happen in a valid IR? If not, I would rather assert. Otherwise, please drop trivial…
		mravishankarAuthorUnsubmitted Done Reply Inline Actions I think so, but I am not sure. Will leave it as an error, and remove braces mravishankar: I think so, but I am not sure. Will leave it as an error, and remove braces
		"found illegal cyclic dependency between operations while sinking");
}		}
auto clone = kernelBuilder.clone(*operandOp, map);		return success();
		ftynseUnsubmitted Done Reply Inline Actions This comment looks outdated ftynse: This comment looks outdated
firstBlock.getArgument(i).replaceAllUsesWith(clone->getResult(0));
firstBlock.eraseArgument(i);
}
if (newLaunchArgs.size() == launch.getNumKernelOperands())
return launch;

std::reverse(newLaunchArgs.begin(), newLaunchArgs.end());
OpBuilder LaunchBuilder(launch);
SmallVector<Type, 8> newArgumentTypes;
newArgumentTypes.reserve(firstBlock.getNumArguments());
for (auto value : firstBlock.getArguments()) {
newArgumentTypes.push_back(value.getType());
}
kernelFunc.setType(LaunchBuilder.getFunctionType(newArgumentTypes, {}));
auto newLaunch = LaunchBuilder.create<gpu::LaunchFuncOp>(
launch.getLoc(), kernelFunc, launch.getGridSizeOperandValues(),
launch.getBlockSizeOperandValues(), newLaunchArgs);
launch.erase();
return newLaunch;
}		}

// Outline the `gpu.launch` operation body into a kernel function. Replace		// Outline the `gpu.launch` operation body into a kernel function. Replace
// `gpu.terminator` operations by `gpu.return` in the generated function.		// `gpu.terminator` operations by `gpu.return` in the generated function.
static gpu::GPUFuncOp outlineKernelFunc(gpu::LaunchOp launchOp,		static gpu::GPUFuncOp outlineKernelFuncImpl(gpu::LaunchOp launchOp,
		StringRef kernelFnName,
llvm::SetVector<Value> &operands) {		llvm::SetVector<Value> &operands) {
Location loc = launchOp.getLoc();		Location loc = launchOp.getLoc();
// Create a builder with no insertion point, insertion will happen separately		// Create a builder with no insertion point, insertion will happen separately
// due to symbol table manipulation.		// due to symbol table manipulation.
OpBuilder builder(launchOp.getContext());		OpBuilder builder(launchOp.getContext());
		Region &launchOpBody = launchOp.body();

// Identify uses from values defined outside of the scope of the launch		// Identify uses from values defined outside of the scope of the launch
// operation.		// operation.
getUsedValuesDefinedAbove(launchOp.body(), operands);		getUsedValuesDefinedAbove(launchOpBody, operands);

		// Create the gpu.func operation.
		mehdi_aminiUnsubmitted Done Reply Inline Actions Please don't use "inline" for other aspect that the inliner. What about sink? mehdi_amini: Please don't use "inline" for other aspect that the inliner. What about sink?
		mravishankarAuthorUnsubmitted Done Reply Inline Actions Changed to "sink" and updated all variables names. mravishankar: Changed to "sink" and updated all variables names.
SmallVector<Type, 4> kernelOperandTypes;		SmallVector<Type, 4> kernelOperandTypes;
kernelOperandTypes.reserve(operands.size());		kernelOperandTypes.reserve(operands.size());
for (Value operand : operands) {		for (Value operand : operands) {
		mehdi_aminiUnsubmitted Done Reply Inline Actions Nit: don't use auto when it does not improve the readability (line 93 below is explicit for instance) mehdi_amini: Nit: don't use auto when it does not improve the readability (line 93 below is explicit for…
kernelOperandTypes.push_back(operand.getType());		kernelOperandTypes.push_back(operand.getType());
}		}
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions This whole sinking transformation does not seem safe in general: this should check legality rather than "benefit". Also it isn't clear to me why this is done during the outlining and not as a pre-pass. The launch operation with the region abstraction seems perfectly suited to model this. I rather have this exposed in a separate API / as a separate step. mehdi_amini: This whole sinking transformation does not seem safe in general: this should check legality…
		ftynseUnsubmitted Not Done Reply Inline Actions This whole sinking transformation does not seem safe in general: this should check legality rather than "benefit". The function just seems misnamed, should be something like `shouldSink` because it mixes validity and benefit. In practice, it only returns `true` for `constant` and `dim` operations that don't have side effects. ftynse: > This whole sinking transformation does not seem safe in general: this should check legality…
		herhutUnsubmitted Not Done Reply Inline Actions This whole sinking transformation does not seem safe in general: this should check legality rather than "benefit". Well, it should check both. You do not want to move all legal operation either :) Also it isn't clear to me why this is done during the outlining and not as a pre-pass. The launch operation with the region abstraction seems perfectly suited to model this. I rather have this exposed in a separate API / as a separate step This has purely historical reasons. Not long ago, the `gpu.launch` was closed from above, so this transformation was done when moving to function form. I have a separate pass for this in a local client, which I can send out next week. It just needs tests. It was implemented as a "post transformation" to the outlining and I would prefer if we do not mix it into the outlining transformation itself. When written separately, the transformations are trivial. herhut: > This whole sinking transformation does not seem safe in general: this should check legality…
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions It was implemented as a "post transformation" to the outlining and Pre-outlining seems easier to manage because region vs inter-procedural (and also can be kept a function pass). I would prefer if we do not mix it into the outlining transformation itself. When written separately, the transformations are trivial. Seems like we're in agreement :) mehdi_amini: > It was implemented as a "post transformation" to the outlining and Pre-outlining seems…
		mravishankarAuthorUnsubmitted Done Reply Inline Actions A pre-pass is fine, but I think it would be better to leave it here. Eventually, it would be good if all transformations can be expressed as a pattern match and rewrite. This "outlining" is essentially converting a gpu.launchOp to a gpu.launchFuncOp. If you need to have a separate pass to sink the instructions, then it breaks the ability of going from loops -> GPU -> NVVM/SPIR-V. I am not saying anybody does this today (not doing this in IREE), but in general it seems like it would be beneficial to have transformations as patterns, and passes as just a light-weight wrapper around patterns. Re: being able to keep it as a function pass, is related to where the gpu.module is created. As set up right now it is put outside of the function that the gpu.launch operation lives in. Thats a a very specific choice and would be very useful to allow "clients" of the outlining to decide where to put the gpu.module. mravishankar: A pre-pass is fine, but I think it would be better to leave it here. Eventually, it would be…
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions If you need to have a separate pass to sink the instructions, then it breaks the ability of going from loops -> GPU -> NVVM/SPIR-V. I don't understand what you mean, can you elaborate? but in general it seems like it would be beneficial to have transformations as patterns, and passes as just a light-weight wrapper around patterns. This is mixing an optimization within an unrelated transformation: this just does not belong here IMO. Re: being able to keep it as a function pass, is related to where the gpu.module is created. I don't know what you mean or how it answer the point about the function pass right now. mehdi_amini: > If you need to have a separate pass to sink the instructions, then it breaks the ability of…
		mravishankarAuthorUnsubmitted Done Reply Inline Actions Re : separate pass to sink instructions. The Dialect conversion framework is designed to go from A -> B -> C. If I want to target SPIR-V/NVVM from Linalg dialect vial Loop dialect and GPU dialects (i.e. Linalg -> Loops -> GPU -> SPIRV/NVVM), I can add all the patterns for the conversion into the dialect conversion framework. Currently Loops to GPU dialect is not exposed as a conversion pattern. GPU to SPIRV is. By adding extra steps as a "pre-condition" will limit the ability to the entire conversion being done using the dialect conversion framework (which is what it is built for). You could add the "sinking" as a canonicalization pattern, but it seems to me this sinking is useful only when the gpu.launch region is outlined to create a gpu.func operation. So doing the sinking during the conversion makes sense to me. Re: fusion pass vs module pass The current setup of the gpu.launch to gpu.launch_func conversion creates a gpu.module that is inserted just after the function the gpu.launch is in. This makes it a module pass, and this behavior is only relevant for the CUDA/NVVM side of things. For IREE, we are only interested in the device side for now. So we can make this a function pass if we can control where the gpu.module is inserted. See this dependent PR in IREE that uses this change and makes the conversion of gpu.launch to gpu.func as a function pass. mravishankar: Re : separate pass to sink instructions. The Dialect conversion framework is designed to go…
		mravishankarAuthorUnsubmitted Done Reply Inline Actions @mehdi_amini : Update on the function pass vs module pass. You were right that the outlining can only be a module pass since the gpu.module also has a symbol so it needs to be added to a module (or an op with symbol table). So I was wrong about that. I update the PR shown above to be module pass as well, but FYI there was no assert when i did it as a function pass. Just filling in some details about discussion offline. It is true that the sinking could be done as a prepass. If so then it is a separate "pre-processing" pass. It is unclear if sinking can be expressed as a pattern. mravishankar: @mehdi_amini : Update on the function pass vs module pass. You were right that the outlining…
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions (typed the following this morning before you last comment, but didn't click send) The Dialect conversion framework is designed to go from A -> B -> C. Yes, but if you just say this, you can shoehorn anything in there: you could use this mental model to go from Swift SIL to X86 assembly in a single "legalize()" call, I don't think this is a good use of the framework. We should use the lowering framework where it makes sense and where it is the way to solve a problem. If your problem fits into the pass pipeline, then why not start there? This is the most natural way of thinking about chain of transformations. the ability to the entire conversion being done using the dialect conversion framework (which is what it is built for). I disagree that this is what it is built for. I think this is a misconception of what the framework is intended to solve. If you can express a pass pipeline where you want to do A->C as a logical sequence of A->B and then B->C, where B is an "interesting level of abstraction", I believe this should be separate passes. If we take for example the `HLO -> SPIR-V` pipeline, we can likely identify logical stages like `HLO->Loops`, `Loops->GPU Kernel`, and `GPU Kernel -> SPIRV`. These stages are fully disjoint as far as I can tell, and there is no immediate benefit to combine them in a single lowering. On the opposite: if these stages are well separated, this provides the opportunity for passes to run on each intermediate level of abstraction (including some generic things like canonicalization or CSE), and it allows also more reusable blocks (`Loops->GPU Kernel` can be reused even when you don't come from HLO). It also forces testing at every level and help compiler engineers keeping a mental model where we can reason about this stages and how they compose independently. mehdi_amini: (typed the following this morning before you last comment, but didn't click send) > The…
		mravishankarAuthorUnsubmitted Done Reply Inline Actions Thanks @mehdi_amini for that overview. I think what you say makes sense and is a good thumb rule to use (probably good to add it somewhere in rationale) Going back to the change at hand. I modified the patch to expose the "sinking" transformation as a separate utility function exposed by the GPU dialect. PTAL, but to me this seems more complex. If this is along the lines of what is the recommendation here, I can work with it for my use case. mravishankar: Thanks @mehdi_amini for that overview. I think what you say makes sense and is a good thumb…
FunctionType type =		FunctionType type =
FunctionType::get(kernelOperandTypes, {}, launchOp.getContext());		FunctionType::get(kernelOperandTypes, {}, launchOp.getContext());
std::string kernelFuncName =		auto outlinedFunc = builder.create<gpu::GPUFuncOp>(loc, kernelFnName, type);
Twine(launchOp.getParentOfType<FuncOp>().getName(), "_kernel").str();
auto outlinedFunc = builder.create<gpu::GPUFuncOp>(loc, kernelFuncName, type);
outlinedFunc.setAttr(gpu::GPUDialect::getKernelFuncAttrName(),		outlinedFunc.setAttr(gpu::GPUDialect::getKernelFuncAttrName(),
builder.getUnitAttr());		builder.getUnitAttr());
outlinedFunc.body().takeBody(launchOp.body());		BlockAndValueMapping map;
injectGpuIndexOperations(loc, outlinedFunc.body());
Block &entryBlock = outlinedFunc.body().front();		// Map the arguments corresponding to the launch parameters like blockIdx,
for (Value operand : operands) {		// threadIdx, etc.
BlockArgument newArg = entryBlock.addArgument(operand.getType());		Region &outlinedFuncBody = outlinedFunc.body();
replaceAllUsesInRegionWith(operand, newArg, outlinedFunc.body());		injectGpuIndexOperations(loc, outlinedFuncBody, launchOpBody, map);
}
		// Map arguments from gpu.launch region to the arguments of the gpu.func
		// operation.
		Block &entryBlock = outlinedFuncBody.front();
		for (auto operand : enumerate(operands))
		map.map(operand.value(), entryBlock.getArgument(operand.index()));

		// Clone the region of the gpu.launch operation into the gpu.func operation.
		// TODO(ravishankarm): If cloneInto can be modified such that if a mapping for
		// a block exists, that block will be used to clone operations into (at the
		// end of the block), instead of creating a new block, this would be much
		// cleaner.
		launchOpBody.cloneInto(&outlinedFuncBody, map);

		// Branch from enty of the gpu.func operation to the block that is cloned from
		// the entry block of the gpu.launch operation.
		Block &launchOpEntry = launchOpBody.front();
		Block *clonedLaunchOpEntry = map.lookup(&launchOpEntry);
		builder.setInsertionPointToEnd(&entryBlock);
		builder.create<BranchOp>(loc, clonedLaunchOpEntry);

outlinedFunc.walk([](gpu::TerminatorOp op) {		outlinedFunc.walk([](gpu::TerminatorOp op) {
OpBuilder replacer(op);		OpBuilder replacer(op);
replacer.create<gpu::ReturnOp>(op.getLoc());		replacer.create<gpu::ReturnOp>(op.getLoc());
op.erase();		op.erase();
});		});

return outlinedFunc;		return outlinedFunc;
}		}
		ftynseUnsubmitted Done Reply Inline Actions Will this work for blocks whose dominance relation is inverse of their textual order? E.g. ^entry: br ^bb2: ^bb1: "use"(%0) : (index) -> () return ^bb2: %0 = "def"() : () -> (index) br ^bb1 ftynse: Will this work for blocks whose dominance relation is inverse of their textual order? E.g.
		mravishankarAuthorUnsubmitted Done Reply Inline Actions Thanks for pointing this out. I updated the method to use cloneInto which handles this case (see comment on changes to cloneInto. I can make the change here if you think that is reasonable) mravishankar: Thanks for pointing this out. I updated the method to use cloneInto which handles this case…

		ftynseUnsubmitted Done Reply Inline Actions This no longer removes the arguments, but rather updates the map. ftynse: This no longer removes the arguments, but rather updates the map.
		gpu::GPUFuncOp mlir::outlineKernelFunc(gpu::LaunchOp launchOp,
		StringRef kernelFnName,
		llvm::SmallVectorImpl<Value> &operands) {
		DenseSet<Value> inputOperandSet;
		antiagainstUnsubmitted Done Reply Inline Actions We are only using the "set" part here right? Just use a set data type? antiagainst: We are only using the "set" part here right? Just use a set data type?
		inputOperandSet.insert(operands.begin(), operands.end());
		llvm::SetVector<Value> operandSet(operands.begin(), operands.end());
		auto funcOp = outlineKernelFuncImpl(launchOp, kernelFnName, operandSet);
		for (auto operand : operandSet) {
		if (!inputOperandSet.count(operand))
		operands.push_back(operand);
		}
		return funcOp;
		ftynseUnsubmitted Done Reply Inline Actions Will this update the users of the inlinedOps? I don't see the map updated anywhere. ftynse: Will this update the users of the inlinedOps? I don't see the map updated anywhere.
		mravishankarAuthorUnsubmitted Done Reply Inline Actions It is done within the clone(map) operation. The results of the operation are added to the map as well. mravishankar: It is done within the clone(map) operation. The results of the operation are added to the map…
		}

// Replace `gpu.launch` operations with an `gpu.launch_func` operation launching		// Replace `gpu.launch` operations with an `gpu.launch_func` operation launching
// `kernelFunc`. The kernel func contains the body of the `gpu.launch` with		// `kernelFunc`. The kernel func contains the body of the `gpu.launch` with
// constant region arguments inlined.		// constant region arguments inlined.
static void convertToLaunchFuncOp(gpu::LaunchOp &launchOp,		static void convertToLaunchFuncOp(gpu::LaunchOp launchOp,
gpu::GPUFuncOp kernelFunc,		gpu::GPUFuncOp kernelFunc,
ValueRange operands) {		ValueRange operands) {
OpBuilder builder(launchOp);		OpBuilder builder(launchOp);
auto launchFuncOp = builder.create<gpu::LaunchFuncOp>(		builder.create<gpu::LaunchFuncOp>(
launchOp.getLoc(), kernelFunc, launchOp.getGridSizeOperandValues(),		launchOp.getLoc(), kernelFunc, launchOp.getGridSizeOperandValues(),
launchOp.getBlockSizeOperandValues(), operands);		launchOp.getBlockSizeOperandValues(), operands);
inlineBeneficiaryOps(kernelFunc, launchFuncOp);
launchOp.erase();		launchOp.erase();
}		}

namespace {		namespace {

/// Pass that moves the kernel of each LaunchOp into its separate nested module.		/// Pass that moves the kernel of each LaunchOp into its separate nested module.
///		///
/// This pass moves the kernel code of each LaunchOp into a function created		/// This pass moves the kernel code of each LaunchOp into a function created
/// inside a nested module. It also creates an external function of the same		/// inside a nested module. It also creates an external function of the same
/// name in the parent module.		/// name in the parent module.
///		///
/// The gpu.modules are intended to be compiled to a cubin blob independently in		/// The gpu.modules are intended to be compiled to a cubin blob independently in
/// a separate pass. The external functions can then be annotated with the		/// a separate pass. The external functions can then be annotated with the
/// symbol of the cubin accessor function.		/// symbol of the cubin accessor function.
class GpuKernelOutliningPass : public ModulePass<GpuKernelOutliningPass> {		class GpuKernelOutliningPass : public ModulePass<GpuKernelOutliningPass> {
public:		public:
void runOnModule() override {		void runOnModule() override {
SymbolTable symbolTable(getModule());		SymbolTable symbolTable(getModule());
bool modified = false;		bool modified = false;
for (auto func : getModule().getOps<FuncOp>()) {		for (auto func : getModule().getOps<FuncOp>()) {
// Insert just after the function.		// Insert just after the function.
Block::iterator insertPt(func.getOperation()->getNextNode());		Block::iterator insertPt(func.getOperation()->getNextNode());
func.walk([&](gpu::LaunchOp op) {		auto funcWalkResult = func.walk([&](gpu::LaunchOp op) {
llvm::SetVector<Value> operands;		llvm::SetVector<Value> operands;
gpu::GPUFuncOp outlinedFunc = outlineKernelFunc(op, operands);		std::string kernelFnName =
		Twine(op.getParentOfType<FuncOp>().getName(), "_kernel").str();

		// Pull in instructions that can be sunk
		if (failed(sinkOperationsIntoLaunchOp(op)))
		return WalkResult::interrupt();
		ftynseUnsubmitted Done Reply Inline Actions Sinking already reports an error, no need to add another one IMO. ftynse: Sinking already reports an error, no need to add another one IMO.
		gpu::GPUFuncOp outlinedFunc =
		outlineKernelFuncImpl(op, kernelFnName, operands);
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions Note: this is still not decoupled from this pass right now (i.e. not tested in isolation, etc.): we still have "outlining" and "sinking" part of the same pass, can't they be separated? mehdi_amini: Note: this is still not decoupled from this pass right now (i.e. not tested in isolation, etc.)…
		mravishankarAuthorUnsubmitted Done Reply Inline Actions They are separate functions. I have no visibility into the clients of the pass. So if any user of the pass is relying on sinking happening then removing the sinking would "potentially" break. One could argue that then it is incorrect usage since the gpu.launch_func op gets updated accordingly, but at this point I would rather keep this change as an NFC. mravishankar: They are separate functions. I have no visibility into the clients of the pass. So if any user…
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions Keeping it NFC is a very good point! (what would break here is an optimization and not correctness right? So we can still do it in the absolute?) mehdi_amini: Keeping it NFC is a very good point! (what would break here is an optimization and not…
		mravishankarAuthorUnsubmitted Done Reply Inline Actions Yes, it would indeed be an optimization thing and not a correctness thing. mravishankar: Yes, it would indeed be an optimization thing and not a correctness thing.

// Create nested module and insert outlinedFunc. The module will		// Create nested module and insert outlinedFunc. The module will
// originally get the same name as the function, but may be renamed on		// originally get the same name as the function, but may be renamed on
// insertion into the parent module.		// insertion into the parent module.
auto kernelModule = createKernelModule(outlinedFunc, symbolTable);		auto kernelModule = createKernelModule(outlinedFunc, symbolTable);
symbolTable.insert(kernelModule, insertPt);		symbolTable.insert(kernelModule, insertPt);

// Potentially changes signature, pulling in constants.		// Potentially changes signature, pulling in constants.
convertToLaunchFuncOp(op, outlinedFunc, operands.getArrayRef());		convertToLaunchFuncOp(op, outlinedFunc, operands.getArrayRef());
modified = true;		modified = true;
		return WalkResult::advance();
});		});
		if (funcWalkResult.wasInterrupted())
		ftynseUnsubmitted Done Reply Inline Actions Nit: Please drop trivial braces ftynse: Nit: Please drop trivial braces
		return signalPassFailure();
}		}

// If any new module was inserted in this module, annotate this module as		// If any new module was inserted in this module, annotate this module as
// a container module.		// a container module.
if (modified)		if (modified)
getModule().setAttr(gpu::GPUDialect::getContainerModuleAttrName(),		getModule().setAttr(gpu::GPUDialect::getContainerModuleAttrName(),
UnitAttr::get(&getContext()));		UnitAttr::get(&getContext()));
}		}
▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

mlir/test/Dialect/GPU/outlining.mlir

	Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	// CHECK-NEXT: = "gpu.thread_id"() {dimension = "y"} : () -> index			// CHECK-NEXT: = "gpu.thread_id"() {dimension = "y"} : () -> index
	// CHECK-NEXT: = "gpu.thread_id"() {dimension = "z"} : () -> index			// CHECK-NEXT: = "gpu.thread_id"() {dimension = "z"} : () -> index
	// CHECK-NEXT: = "gpu.grid_dim"() {dimension = "x"} : () -> index			// CHECK-NEXT: = "gpu.grid_dim"() {dimension = "x"} : () -> index
	// CHECK-NEXT: = "gpu.grid_dim"() {dimension = "y"} : () -> index			// CHECK-NEXT: = "gpu.grid_dim"() {dimension = "y"} : () -> index
	// CHECK-NEXT: = "gpu.grid_dim"() {dimension = "z"} : () -> index			// CHECK-NEXT: = "gpu.grid_dim"() {dimension = "z"} : () -> index
	// CHECK-NEXT: %[[BDIM:.*]] = "gpu.block_dim"() {dimension = "x"} : () -> index			// CHECK-NEXT: %[[BDIM:.*]] = "gpu.block_dim"() {dimension = "x"} : () -> index
	// CHECK-NEXT: = "gpu.block_dim"() {dimension = "y"} : () -> index			// CHECK-NEXT: = "gpu.block_dim"() {dimension = "y"} : () -> index
	// CHECK-NEXT: = "gpu.block_dim"() {dimension = "z"} : () -> index			// CHECK-NEXT: = "gpu.block_dim"() {dimension = "z"} : () -> index
				// CHECK-NEXT: br ^[[BLOCK:.*]]
				// CHECK-NEXT: ^[[BLOCK]]:
	// CHECK-NEXT: "use"(%[[KERNEL_ARG0]]) : (f32) -> ()			// CHECK-NEXT: "use"(%[[KERNEL_ARG0]]) : (f32) -> ()
	// CHECK-NEXT: "some_op"(%[[BID]], %[[BDIM]]) : (index, index) -> ()			// CHECK-NEXT: "some_op"(%[[BID]], %[[BDIM]]) : (index, index) -> ()
	// CHECK-NEXT: = load %[[KERNEL_ARG1]][%[[TID]]] : memref<?xf32, 1>			// CHECK-NEXT: = load %[[KERNEL_ARG1]][%[[TID]]] : memref<?xf32, 1>

	// -----			// -----

	// CHECK: module attributes {gpu.container_module}			// CHECK: module attributes {gpu.container_module}

	▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
	}			}

	// CHECK-LABEL: func @extra_constants_kernel(%{{.*}}: memref<?xf32>)			// CHECK-LABEL: func @extra_constants_kernel(%{{.*}}: memref<?xf32>)
	// CHECK: constant			// CHECK: constant
	// CHECK: constant			// CHECK: constant

	// -----			// -----

				func @multiple_uses(%arg0 : memref<?xf32>) {
				%c1 = constant 1 : index
				%c2 = constant 2 : index
				// CHECK: gpu.func {{.*}} {
				// CHECK: %[[C2:.*]] = constant 2 : index
				// CHECK: "use1"(%[[C2]], %[[C2]])
				// CHECK: "use2"(%[[C2]])
				// CHECK: gpu.return
				// CHECK: }
				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1,
				%grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c1, %block_y = %c1,
				%block_z = %c1) {
				"use1"(%c2, %c2) : (index, index) -> ()
				"use2"(%c2) : (index) -> ()
				gpu.terminator
				}
				return
				}

				// -----

	llvm.mlir.global internal @global(42 : i64) : !llvm.i64			llvm.mlir.global internal @global(42 : i64) : !llvm.i64

	func @function_call(%arg0 : memref<?xf32>) {			func @function_call(%arg0 : memref<?xf32>) {
	%cst = constant 8 : index			%cst = constant 8 : index
	gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %cst, %grid_y = %cst,			gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %cst, %grid_y = %cst,
	%grid_z = %cst)			%grid_z = %cst)
	threads(%tx, %ty, %tz) in (%block_x = %cst, %block_y = %cst,			threads(%tx, %ty, %tz) in (%block_x = %cst, %block_y = %cst,
	%block_z = %cst) {			%block_z = %cst) {
	Show All 30 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][GPU] Expose the functionality to create a gpu.GPUFuncOp from a gpu.GPULaunchOpClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 248311

mlir/include/mlir/Dialect/GPU/Passes.h

mlir/lib/Dialect/GPU/Transforms/KernelOutlining.cpp

mlir/test/Dialect/GPU/outlining.mlir

[mlir][GPU] Expose the functionality to create a gpu.GPUFuncOp from a gpu.GPULaunchOp
ClosedPublic