This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Conversion/LoopsToGPU/
-
mlir/
-
Conversion/
-
LoopsToGPU/
-
LoopsToGPU.h
-
lib/Conversion/LoopsToGPU/
-
Conversion/
-
LoopsToGPU/
29/32
LoopsToGPU.cpp
-
test/Conversion/LoopsToGPU/
-
Conversion/
-
LoopsToGPU/
1/2
parallel_loop.mlir

Differential D73893

[MLIR][GPU] Implement initial mapping from loop.parallel to gpu.launch.
ClosedPublic

Authored by herhut on Feb 3 2020, 7:24 AM.

Download Raw Diff

Details

Reviewers

nicolasvasilache
ftynse
mravishankar

Commits

rG715783d415fe: [MLIR][GPU] Implement initial mapping from loop.parallel to gpu.launch.

Summary

To unblock other work, this implements basic lowering based on mapping
attributes that have to be provided on all loop.parallel. The lowering
does not yet support reduce.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

herhut created this revision.Feb 3 2020, 7:24 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptFeb 3 2020, 7:24 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, Joonsoo, liufengdb and 12 others. · View Herald Transcript

I started implementing the lowering of parallel loops to gpu launch code. This code uses mapping annotations in the form of attributes. Currently, one can only map to a single block/thread id. Also, upper bounds for the launch can be defined based on the number of iterations of the mapped iteration of the parallel loop. This is probably not what we want in the long run but a good starting point to inform the design we take.

Feedback highly appreciated.

Unit tests: unknown.

clang-tidy: fail. clang-tidy found 2 errors and 2 warnings. 0 of them are added as review comments below (why?).

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt

Pre-merge checks is in beta. Report issue. Please join beta or enable it for your project.

Harbormaster failed remote builds in B45590: Diff 242069!Feb 3 2020, 7:43 AM

Thanks Stephan for pushing on this concretely!
I am very interested in this representation and I would welcome a (future) refactoring that will make this as reusable as possible so we can also use it directly in Linalg.
As you may or may not know, other people internally are using a very similar representation in a very successful fashion ;) ;)

Won't have time to dig into details but a few comments:

made a cursory glance, it's great to start with this and refactor as we go when others need this too
one of the problems you solve here is to go from type (the affine maps for mapping) to values (tidx, bidy, etc)). This is a recurrent transition that would be great to try and make retargetable (e.g. OpenMP or other runtimes), I would love to see this separated from GPU. This would also force separating concerns a bit more re outlining of kernel vs mapping of loops.
Along the lines of point 2., I would have hoped to see mapLoopToProcessorIds be templatized and retargeted to this use case, is this possible as a followup?
Depending on how much it is possible to go towards 3. or not, I'll comment that the Builder API is really underwhelming.. I understand the context behind not using EDSC atm, I'll post something on discourse to reopen that discussion (was too swamped until now).

In any case, great stuff and thanks!

mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir
165	Could we use capture names such as `TIDX`, `BIDY` or something similar that would help "see" the mapping better?

I suppose you want high-level feedback on this, so I'm not nitpicking in the code.

I think this goes into the right direction, but you are right that this should not be the final design. Using pattern rewrites looks like the right thing to do. I would consider having a pattern and/or a utility function that a future driver can call and have a parallel loop mapped to GPU. We can have test passes where the application is controlled by attributes, but we can also have more automated approaches where we would, e.g., greedily map the outermost parallel loop to GPUs. My thinking is that we should eventually have a tree-like loop structure on which we can decide where to map to blocks/threads/do promotions. That being said, I am not 100% sure we actually want to materialize the decisions as attributes on the operations, or at least as attributes that operations know about.

One thing to keep in mind, nested parallel loops may be tricky, in particular you may need synchronizations inside the outermost parallel loop and make sure they are called by all threads in a "convergent" fashion. For the first take, I'd map one parallel loop construct or perfectly nested constructs and extend from there.

mlir/include/mlir/Dialect/LoopOps/LoopOps.td
16 ↗	(On Diff #242069)	It's a bit reversed to have loops depend on affine loops. Is this necessary for AffineMapAttr?

In D73893#1854949, @nicolasvasilache wrote:

one of the problems you solve here is to go from type (the affine maps for mapping) to values (tidx, bidy, etc)). This is a recurrent transition that would be great to try and make retargetable (e.g. OpenMP or other runtimes), I would love to see this separated from GPU. This would also force separating concerns a bit more re outlining of kernel vs mapping of loops.

I use attributes here instead of baking the mapping into the lowering part to enable reuse. My idea was that we can have different implementations/strategies for mapping that inform a single code generation. I'd be happy to get to a shared "mapping language" that can be reused but it has to start somewhere.

Along the lines of point 2., I would have hoped to see mapLoopToProcessorIds be templatized and retargeted to this use case, is this possible as a followup?

There is no code at all yet to produce the mapping attributes. But map LoopToProcessorIds could be one producer indeed.

Remove attribute from td file and minor cleanup.

In D73893#1855737, @ftynse wrote:

I suppose you want high-level feedback on this, so I'm not nitpicking in the code.

Indeed. Thanks for the feedback.

I think this goes into the right direction, but you are right that this should not be the final design. Using pattern rewrites looks like the right thing to do. I would consider having a pattern and/or a utility function that a future driver can call and have a parallel loop mapped to GPU.

I have not written code yet to produce the mapping attributes but a greedy mapper will probably be the first thing I'll do.

We can have test passes where the application is controlled by attributes, but we can also have more automated approaches where we would, e.g., greedily map the outermost parallel loop to GPUs. My thinking is that we should eventually have a tree-like loop structure on which we can decide where to map to blocks/threads/do promotions.

I agree. To do any meaningful mapping one needs to see the whole tree. Also, the heuristics that does the tiling might already know how it wants things to be mapped. Likewise, if you explicit loads into shared memory, for example, you would know that this has to be mapped to thread groups (like warps). So maybe one want to specify the mapping at that point already, which is why I though attributes are a good way to model this.

That being said, I am not 100% sure we actually want to materialize the decisions as attributes on the operations, or at least as attributes that operations know about.

I am strongly of the opinion that the mapping should be attributes so that the producer and consumer of the mapping decisions is decoupled. I agree that they should be invisible to the op and I just added them to the ,td file temporarily for documentation. They attribute is already optional.

One thing to keep in mind, nested parallel loops may be tricky, in particular you may need synchronizations inside the outermost parallel loop and make sure they are called by all threads in a "convergent" fashion. For the first take, I'd map one parallel loop construct or perfectly nested constructs and extend from there.

There is a TODO in there that it only supports nesting of the instructions up to the innermost nest are sideeffect free. Otherwise we will need predication to have only a single hardware thread materialize the sideeffects and, as you state, a barrier if other iterations of nested loops depend on that code. Let's start with the simple case, though.

mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir
165	Will do once this has stabilized a bit. For now, the tests are fully auto-generated.

Unit tests: pass. 62417 tests passed, 0 failed and 839 were skipped.

clang-tidy: fail. clang-tidy found 0 errors and 2 warnings. 0 of them are added as review comments below (why?).

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Pre-merge checks is in beta. Report issue. Please join beta or enable it for your project.

Harbormaster failed remote builds in B45672: Diff 242294!Feb 4 2020, 5:19 AM

Also support linalg tiled loops

Herald added a reviewer: mravishankar. · View Herald TranscriptFeb 6 2020, 10:20 AM

Correct tests and some minor cleanup.

ftynse added inline comments.Feb 7 2020, 5:30 AM

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
503	I would suggest creating a struct with named fields here for better readability. And document the function plz
505	.cast here and below, otherwise you may be accessing a null pointer
507	Nit: can we factor out these names into constants?
517	Nit: mlir uses camelBack names
521	Should we use string attributes instead? E.g. having `mapping = ["thread-0", "thread-1", "thread-2", "block-0", "block-1", "block-2", "seq"]` ?
538	`dyn_cast_or_null` since `val` is not guaranteed to have a defining op
582	We should have patterns to fold constants into affine maps (in canonicalization), eg %0 = constant 42 : index %1 = afifne.min affine_map<(d0,d1)->(d0,d1)>(%0, %arg0) should fold into %0 = affine.min affine_map<(d0)->(d0,42)>(%0) LMK if it's not the case.

I'm missing some higher-level documentation on what each function is supposed to do.

rriddle added inline comments.Feb 7 2020, 10:21 AM

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
503	Generally only classes should be within anonmyous namespaces. static functions should be in global scope and marked 'static'.
552	This function is quite large. Can you split it up a bit?

(Sorry for jumping on this late)

This is now adding three different lowering of loops to GPU

From perfectly nested loops to GPU
From imperfectly nested loops to GPU
Parallel loops to GPU

I confess I contributed to one of them (2). We should probably have a plan to consolidate these. This is too much duplication of functionality. We should probably deprecate (2) in favor of this change, and adapt all uses accordingly.

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
503	Add some comments please.
504	If this is really a dictionary_attribute, this lookup might be simplified using a StructAttr
516	Please add some comments about this method, preconditions, etc.
522	I am just coming upto speed on implementation of loop.parallel, but in the ODS I dont see any attribute for "mapping". When is this added?
524	nit: Please remove this line. Didnt make the immediate leap to reductions and parallelop.getNumResults() != 0
671	This comment seems outdated now.
703	It would be useful to also expose a method to populate this pattern (and also other patterns that might be needed in the future) in a method like `populateParallelLoopToGPUPatterns` . Then a pass can just import these patterns without having to run a separate pass to do this lowering.

This revision now requires changes to proceed.Feb 8 2020, 10:52 AM

Comments, comments, comments :-)

Harbormaster completed remote builds in B46215: Diff 243835.Feb 11 2020, 5:53 AM

In D73893#1865719, @mravishankar wrote:

(Sorry for jumping on this late)

This is now adding three different lowering of loops to GPU

From perfectly nested loops to GPU

From imperfectly nested loops to GPU

Parallel loops to GPU

I confess I contributed to one of them (2). We should probably have a plan to consolidate these. This is too much duplication of functionality. We should probably deprecate (2) in favor of this change, and adapt all uses accordingly.

All but 3 should probably go away unless someone invests into the dependency analysis to make 1 and 2 safe.

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
504	I had a struct attribute before but I did not want to specify the attribute on the `loop.parallel` operation, as it is GPU dialect specific. So I went with an unspecified optional attribute with custom accessor.
521	I wanted to stay with processor ids for now, as that is also what @nicolasvasilache was using.
522	This is added by producers of mappings that want to lower GPU. I have a `GreedyMapper` (I will send that out shortly) that essentially implements what the existing helper functions did.
552	A lot of it is comments but I have moved some things out.
582	So you are saying that by running with -canonicalize I should see the affine.min change? I could not reproduce this.
671	No, it is unfortunately still true. If there are nested `loop.parallel` the code up to them has to be side-effect free, as we are essentially replicating it per thread. I will add some checking in a later version.

Thanks Stephan! I see overall where this is headed. This is nice to have indeed.

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
504	You can add it as an attribute to GPU dialect. There is no harm in adding this attribute "while" lowering to GPU dialect, or a pre-pass before lowering to GPU dialect right?

Looks okay for the first take at it.

In general, I wouldn't bother with sequential loops here and implement a separate rewrite that we could run before this that splits the parallel loop into two nested parallel ops, and then use parallel-to-sequential transformation on the inner op.

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
576	Nit: why not `nullptr` as a sentinel?
611	Readablity nit: can we assign std::get<0>, etc to variables with more explicative names? The body of this loop is 3 screens long (consider refactoring in a follow-up :)) and it's hard to keep track of what was zip'ed.
627	Nit: you can use OpRewriter::InsertionGuard with a C++ scope
701	For a future revision, I'd consider having a transformation function that is controlled by actual arguments rather than by an attached attribute, and have the pattern read the attribute and call that function. This way, we can call the transformation programmatically without modifying the IR.
748	Can we just walk recursively and check for the no side effects trait, returning matchFailure if there are side effects?
750	The comment above says "it is only correct if there either is no further loop.parallel", which I cannot match with this code.

This revision is now accepted and ready to land.Feb 12 2020, 12:12 PM

sri added a subscriber: sri.Feb 12 2020, 8:08 PM

Some more comments addressed.

herhut marked 3 inline comments as done.Feb 13 2020, 6:08 AM

herhut added inline comments.

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
576	I dislike nullptr as sentinel because it too easily sneaks in and then produces hard to debug errors. The `gpu.launch` on the other hand definitely won't. I would have created an explicit sentinel but that seemed a bit overkill.
748	I added some testing code that is conservative but ensures correctness.

Harbormaster failed remote builds in B46408: Diff 244413!Feb 13 2020, 6:26 AM

Closed by commit rG715783d415fe: [MLIR][GPU] Implement initial mapping from loop.parallel to gpu.launch. (authored by herhut). · Explain WhyFeb 13 2020, 7:57 AM

This revision was automatically updated to reflect the committed changes.

herhut marked an inline comment as done.

rriddle added inline comments.Feb 13 2020, 11:25 AM

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
751	Remove this debugging.

Revision Contents

Path

Size

mlir/

include/

mlir/

Conversion/

LoopsToGPU/

LoopsToGPU.h

7 lines

lib/

Conversion/

LoopsToGPU/

LoopsToGPU.cpp

328 lines

test/

Conversion/

LoopsToGPU/

parallel_loop.mlir

326 lines

Diff 244433

mlir/include/mlir/Conversion/LoopsToGPU/LoopsToGPU.h

	//===- LoopsToGPU.h - Convert loop nests to GPU kernels ---------- C++ --===//			//===- LoopsToGPU.h - Convert loop nests to GPU kernels ---------- C++ --===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	#ifndef MLIR_CONVERSION_LOOPSTOGPU_LOOPSTOGPU_H_			#ifndef MLIR_CONVERSION_LOOPSTOGPU_LOOPSTOGPU_H_
	#define MLIR_CONVERSION_LOOPSTOGPU_LOOPSTOGPU_H_			#define MLIR_CONVERSION_LOOPSTOGPU_LOOPSTOGPU_H_

	#include "mlir/Support/LLVM.h"			#include "mlir/Support/LLVM.h"

	namespace mlir {			namespace mlir {
	class AffineForOp;			class AffineForOp;
				class MLIRContext;
				class OwningRewritePatternList;
	struct LogicalResult;			struct LogicalResult;
	class Value;			class Value;

	namespace loop {			namespace loop {
	class ForOp;			class ForOp;
	} // end namespace loop			} // end namespace loop

	/// Convert a perfect affine loop nest with the outermost loop identified by			/// Convert a perfect affine loop nest with the outermost loop identified by
	▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
	/// workgroup. No checks are performed that this is indeed the case.			/// workgroup. No checks are performed that this is indeed the case.
	/// TODO(ravishankarm) : Add checks that verify 2(b) above.			/// TODO(ravishankarm) : Add checks that verify 2(b) above.
	/// The above conditions are assumed to be satisfied by the computation rooted			/// The above conditions are assumed to be satisfied by the computation rooted
	/// at `forOp`.			/// at `forOp`.
	LogicalResult convertLoopToGPULaunch(loop::ForOp forOp,			LogicalResult convertLoopToGPULaunch(loop::ForOp forOp,
	ArrayRef<Value> numWorkGroups,			ArrayRef<Value> numWorkGroups,
	ArrayRef<Value> workGroupSizes);			ArrayRef<Value> workGroupSizes);

				/// Adds the conversion pattern from `loop.parallel` to `gpu.launch` to the
				/// provided pattern list.
				void populateParallelLoopToGPUPatterns(OwningRewritePatternList &patterns,
				MLIRContext *ctx);

	} // namespace mlir			} // namespace mlir

	#endif // MLIR_CONVERSION_LOOPSTOGPU_LOOPSTOGPU_H_			#endif // MLIR_CONVERSION_LOOPSTOGPU_LOOPSTOGPU_H_

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp

Show All 14 Lines
#include "mlir/Conversion/LoopsToGPU/LoopsToGPU.h"		#include "mlir/Conversion/LoopsToGPU/LoopsToGPU.h"

#include "mlir/Conversion/AffineToStandard/AffineToStandard.h"		#include "mlir/Conversion/AffineToStandard/AffineToStandard.h"
#include "mlir/Dialect/AffineOps/AffineOps.h"		#include "mlir/Dialect/AffineOps/AffineOps.h"
#include "mlir/Dialect/GPU/GPUDialect.h"		#include "mlir/Dialect/GPU/GPUDialect.h"
#include "mlir/Dialect/LoopOps/LoopOps.h"		#include "mlir/Dialect/LoopOps/LoopOps.h"
#include "mlir/Dialect/StandardOps/Ops.h"		#include "mlir/Dialect/StandardOps/Ops.h"
#include "mlir/IR/AffineExpr.h"		#include "mlir/IR/AffineExpr.h"
		#include "mlir/IR/BlockAndValueMapping.h"
#include "mlir/IR/Builders.h"		#include "mlir/IR/Builders.h"
		#include "mlir/Pass/Pass.h"
		#include "mlir/Transforms/DialectConversion.h"
#include "mlir/Transforms/LoopUtils.h"		#include "mlir/Transforms/LoopUtils.h"
		#include "mlir/Transforms/Passes.h"
#include "mlir/Transforms/RegionUtils.h"		#include "mlir/Transforms/RegionUtils.h"
#include "llvm/ADT/Sequence.h"		#include "llvm/ADT/Sequence.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"

#define DEBUG_TYPE "loops-to-gpu"		#define DEBUG_TYPE "loops-to-gpu"

using namespace mlir;		using namespace mlir;
using namespace mlir::loop;		using namespace mlir::loop;
▲ Show 20 Lines • Show All 449 Lines • ▼ Show 20 Lines	LogicalResult mlir::convertLoopNestToGPULaunch(ForOp forOp,
return ::convertLoopNestToGPULaunch(forOp, numBlockDims, numThreadDims);		return ::convertLoopNestToGPULaunch(forOp, numBlockDims, numThreadDims);
}		}

LogicalResult mlir::convertLoopToGPULaunch(loop::ForOp forOp,		LogicalResult mlir::convertLoopToGPULaunch(loop::ForOp forOp,
ArrayRef<Value> numWorkGroups,		ArrayRef<Value> numWorkGroups,
ArrayRef<Value> workGroupSizes) {		ArrayRef<Value> workGroupSizes) {
return ::convertLoopToGPULaunch(forOp, numWorkGroups, workGroupSizes);		return ::convertLoopToGPULaunch(forOp, numWorkGroups, workGroupSizes);
}		}

		namespace {
		struct ParallelToGpuLaunchLowering : public OpRewritePattern<ParallelOp> {
		using OpRewritePattern<ParallelOp>::OpRewritePattern;

		PatternMatchResult matchAndRewrite(ParallelOp parallelOp,
		PatternRewriter &rewriter) const override;
		};

		struct MappingAnnotation {
		ftynseUnsubmitted Done Reply Inline Actions I would suggest creating a struct with named fields here for better readability. And document the function plz ftynse: I would suggest creating a struct with named fields here for better readability. And document…
		rriddleUnsubmitted Done Reply Inline Actions Generally only classes should be within anonmyous namespaces. static functions should be in global scope and marked 'static'. rriddle: Generally only classes should be within anonmyous namespaces. static functions should be in…
		mravishankarUnsubmitted Done Reply Inline Actions Add some comments please. mravishankar: Add some comments please.
		unsigned processor;
		mravishankarUnsubmitted Done Reply Inline Actions If this is really a dictionary_attribute, this lookup might be simplified using a StructAttr mravishankar: If this is really a dictionary_attribute, this lookup might be simplified using a StructAttr
		herhutAuthorUnsubmitted Done Reply Inline Actions I had a struct attribute before but I did not want to specify the attribute on the `loop.parallel` operation, as it is GPU dialect specific. So I went with an unspecified optional attribute with custom accessor. herhut: I had a struct attribute before but I did not want to specify the attribute on the `loop.
		mravishankarUnsubmitted Not Done Reply Inline Actions You can add it as an attribute to GPU dialect. There is no harm in adding this attribute "while" lowering to GPU dialect, or a pre-pass before lowering to GPU dialect right? mravishankar: You can add it as an attribute to GPU dialect. There is no harm in adding this attribute…
		AffineMap indexMap;
		ftynseUnsubmitted Done Reply Inline Actions .cast here and below, otherwise you may be accessing a null pointer ftynse: .cast here and below, otherwise you may be accessing a null pointer
		AffineMap boundMap;
		};
		ftynseUnsubmitted Done Reply Inline Actions Nit: can we factor out these names into constants? ftynse: Nit: can we factor out these names into constants?

		} // namespace

		static constexpr const char *kProcessorEntryName = "processor";
		static constexpr const char *kIndexMapEntryName = "map";
		static constexpr const char *kBoundMapEntryName = "bound";

		/// Extracts the mapping annotations from the provided attribute. The attribute
		/// is expected to be of the form
		mravishankarUnsubmitted Done Reply Inline Actions Please add some comments about this method, preconditions, etc. mravishankar: Please add some comments about this method, preconditions, etc.
		/// { processor = <unsigned>, map = <AffineMap>, bound = <AffineMap> }
		ftynseUnsubmitted Done Reply Inline Actions Nit: mlir uses camelBack names ftynse: Nit: mlir uses camelBack names
		/// where the bound is optional.
		static MappingAnnotation extractMappingAnnotation(Attribute attribute) {
		DictionaryAttr dict = attribute.cast<DictionaryAttr>();
		unsigned processor = dict.get(kProcessorEntryName)
		ftynseUnsubmitted Done Reply Inline Actions Should we use string attributes instead? E.g. having `mapping = ["thread-0", "thread-1", "thread-2", "block-0", "block-1", "block-2", "seq"]` ? ftynse: Should we use string attributes instead? E.g. having `mapping = ["thread-0", "thread-1"…
		herhutAuthorUnsubmitted Done Reply Inline Actions I wanted to stay with processor ids for now, as that is also what @nicolasvasilache was using. herhut: I wanted to stay with processor ids for now, as that is also what @nicolasvasilache was using.
		.cast<IntegerAttr>()
		mravishankarUnsubmitted Done Reply Inline Actions I am just coming upto speed on implementation of loop.parallel, but in the ODS I dont see any attribute for "mapping". When is this added? mravishankar: I am just coming upto speed on implementation of loop.parallel, but in the ODS I dont see any…
		herhutAuthorUnsubmitted Done Reply Inline Actions This is added by producers of mappings that want to lower GPU. I have a `GreedyMapper` (I will send that out shortly) that essentially implements what the existing helper functions did. herhut: This is added by producers of mappings that want to lower GPU. I have a `GreedyMapper` (I will…
		.getValue()
		.getSExtValue();
		mravishankarUnsubmitted Done Reply Inline Actions nit: Please remove this line. Didnt make the immediate leap to reductions and parallelop.getNumResults() != 0 mravishankar: nit: Please remove this line. Didnt make the immediate leap to reductions and parallelop.
		AffineMap map = dict.get(kIndexMapEntryName).cast<AffineMapAttr>().getValue();
		AffineMapAttr boundAttr =
		dict.get(kBoundMapEntryName).dyn_cast_or_null<AffineMapAttr>();
		AffineMap bound;
		if (boundAttr)
		bound = boundAttr.getValue();
		return {processor, map, bound};
		}

		/// Tries to derive a static upper bound from the defining operation of
		/// `upperBound`.
		static Value deriveStaticUpperBound(Value upperBound) {
		Value constantBound = {};
		if (AffineMinOp minOp =
		ftynseUnsubmitted Done Reply Inline Actions `dyn_cast_or_null` since `val` is not guaranteed to have a defining op ftynse: `dyn_cast_or_null` since `val` is not guaranteed to have a defining op
		dyn_cast_or_null<AffineMinOp>(upperBound.getDefiningOp())) {
		auto map = minOp.map();
		auto operands = minOp.operands();
		for (int sub = 0, e = map.getNumResults(); sub < e; ++sub) {
		AffineExpr expr = map.getResult(sub);
		if (AffineDimExpr dimExpr = expr.dyn_cast<AffineDimExpr>()) {
		auto dimOperand = operands[dimExpr.getPosition()];
		auto defOp = dimOperand.getDefiningOp();
		if (ConstantOp constOp = dyn_cast_or_null<ConstantOp>(defOp)) {
		constantBound = constOp;
		break;
		}
		}
		}
		rriddleUnsubmitted Done Reply Inline Actions This function is quite large. Can you split it up a bit? rriddle: This function is quite large. Can you split it up a bit?
		herhutAuthorUnsubmitted Done Reply Inline Actions A lot of it is comments but I have moved some things out. herhut: A lot of it is comments but I have moved some things out.
		}
		return constantBound;
		}

		/// Modifies the current transformation state to capture the effect of the given
		/// `loop.parallel` operation on index substitutions and the operations to be
		/// inserted.
		/// Specifically, if a dimension of a parallel loop is mapped to a hardware id,
		/// this function will
		/// - compute the loop index based on the hardware id and affine map from the
		/// mapping and update `cloningMap` to substitute all uses.
		/// - derive a new upper bound for the hardware id and augment the provided
		/// `gpu.launch operation` accordingly.
		/// - if the upper bound is imprecise, insert a conditional in the `gpu.launch`
		/// and update the rewriter to insert into the conditional's body.
		/// If the dimension is mapped to sequential,
		/// - insert a for loop into the body and update the rewriter to insert into
		/// the for loop's body.
		/// - update the `cloningMap` to replace uses of the index with the index of
		/// the new for loop.
		/// In either case,
		/// - append the instructions from the loops body to worklist, in reverse order.
		/// To note the end of the current scope in case a loop or conditional was
		/// inserted, a sentinel (the `gpu.launch` operation) is inserted into the
		ftynseUnsubmitted Done Reply Inline Actions Nit: why not `nullptr` as a sentinel? ftynse: Nit: why not `nullptr` as a sentinel?
		herhutAuthorUnsubmitted Done Reply Inline Actions I dislike nullptr as sentinel because it too easily sneaks in and then produces hard to debug errors. The `gpu.launch` on the other hand definitely won't. I would have created an explicit sentinel but that seemed a bit overkill. herhut: I dislike nullptr as sentinel because it too easily sneaks in and then produces hard to debug…
		/// worklist. This signals the processor of the worklist to pop the rewriter
		/// one scope-level up.
		static LogicalResult processParallelLoop(ParallelOp parallelOp,
		gpu::LaunchOp launchOp,
		BlockAndValueMapping &cloningMap,
		SmallVectorImpl<Operation *> &worklist,
		ftynseUnsubmitted Done Reply Inline Actions We should have patterns to fold constants into affine maps (in canonicalization), eg %0 = constant 42 : index %1 = afifne.min affine_map<(d0,d1)->(d0,d1)>(%0, %arg0) should fold into %0 = affine.min affine_map<(d0)->(d0,42)>(%0) LMK if it's not the case. ftynse: We should have patterns to fold constants into affine maps (in canonicalization), eg %0 =…
		herhutAuthorUnsubmitted Not Done Reply Inline Actions So you are saying that by running with -canonicalize I should see the affine.min change? I could not reproduce this. herhut: So you are saying that by running with -canonicalize I should see the affine.min change? I…
		PatternRewriter &rewriter) {
		// TODO(herhut): Verify that this is a valid GPU mapping.
		// processor ids: 0-2 block [x/y/z], 3-5 -> thread [x/y/z], 6-> sequential
		ArrayAttr mapping = parallelOp.getAttrOfType<ArrayAttr>("mapping");

		// TODO(herhut): Support reductions.
		if (!mapping \|\| parallelOp.getNumResults() != 0)
		return failure();

		Location loc = parallelOp.getLoc();

		auto launchIndependent = [&launchOp](Value val) {
		return val.getParentRegion()->isAncestor(launchOp.getParentRegion());
		};

		auto ensureLaunchIndependent = [&launchOp, &rewriter,
		launchIndependent](Value val) -> Value {
		if (launchIndependent(val))
		return val;
		if (ConstantOp constOp = dyn_cast_or_null<ConstantOp>(val.getDefiningOp()))
		return rewriter.create<ConstantOp>(constOp.getLoc(), constOp.getValue());
		return {};
		};

		for (auto config : llvm::zip(mapping, parallelOp.getInductionVars(),
		parallelOp.lowerBound(), parallelOp.upperBound(),
		parallelOp.step())) {
		Attribute mappingAttribute;
		Value iv, lowerBound, upperBound, step;
		ftynseUnsubmitted Done Reply Inline Actions Readablity nit: can we assign std::get<0>, etc to variables with more explicative names? The body of this loop is 3 screens long (consider refactoring in a follow-up :)) and it's hard to keep track of what was zip'ed. ftynse: Readablity nit: can we assign std::get<0>, etc to variables with more explicative names? The…
		std::tie(mappingAttribute, iv, lowerBound, upperBound, step) = config;
		MappingAnnotation annotation = extractMappingAnnotation(mappingAttribute);
		Value newIndex;

		if (annotation.processor < gpu::LaunchOp::kNumConfigOperands) {
		// Use the corresponding thread/grid index as replacement for the loop iv.
		// TODO(herhut): Make the iv calculation depend on lower & upper bound.
		Value operand = launchOp.body().front().getArgument(annotation.processor);
		Value appliedMap =
		rewriter.create<AffineApplyOp>(loc, annotation.indexMap, operand);
		// Add the lower bound, as the maps are 0 based but the loop might not be.
		// TODO(herhut): Maybe move this explicitly into the maps?
		newIndex = rewriter.create<AddIOp>(
		loc, appliedMap, cloningMap.lookupOrDefault(lowerBound));
		// If there was also a bound, insert that, too.
		// TODO(herhut): Check that we do not assign bounds twice.
		ftynseUnsubmitted Done Reply Inline Actions Nit: you can use OpRewriter::InsertionGuard with a C++ scope ftynse: Nit: you can use OpRewriter::InsertionGuard with a C++ scope
		if (annotation.boundMap) {
		// We pass as the single opererand to the bound-map the number of
		// iterations, which is upperBound - lowerBound. To support inner loops
		// with dynamic upper bounds (as generated by e.g. tiling), try to
		// derive a max for the bounds. If the used bound for the hardware id is
		// inprecise, wrap the contained code into a conditional.
		// If the lower-bound is constant or defined before the launch, we can
		// use it in the launch bounds. Otherwise fail.
		if (!launchIndependent(lowerBound) &&
		!isa<ConstantOp>(lowerBound.getDefiningOp()))
		return failure();
		// If the upper-bound is constant or defined before the launch, we can
		// use it in the launch bounds directly. Otherwise try derive a bound.
		bool boundIsPrecise = launchIndependent(upperBound) \|\|
		isa<ConstantOp>(upperBound.getDefiningOp());
		if (!boundIsPrecise) {
		upperBound = deriveStaticUpperBound(upperBound);
		if (!upperBound)
		return failure();
		}
		{
		PatternRewriter::InsertionGuard guard(rewriter);
		rewriter.setInsertionPoint(launchOp);

		Value iterations = rewriter.create<SubIOp>(
		loc,
		ensureLaunchIndependent(cloningMap.lookupOrDefault(upperBound)),
		ensureLaunchIndependent(cloningMap.lookupOrDefault(lowerBound)));
		Value launchBound = rewriter.create<AffineApplyOp>(
		loc, annotation.boundMap, iterations);
		launchOp.setOperand(annotation.processor, launchBound);
		}
		if (!boundIsPrecise) {
		// We are using an approximation, create a surrounding conditional.
		Value originalBound = std::get<3>(config);
		CmpIOp pred = rewriter.create<CmpIOp>(
		loc, CmpIPredicate::slt, newIndex,
		cloningMap.lookupOrDefault(originalBound));
		loop::IfOp ifOp = rewriter.create<loop::IfOp>(loc, pred, false);
		rewriter.setInsertionPointToStart(&ifOp.thenRegion().front());
		// Put a sentinel into the worklist so we know when to pop out of the
		// if body again. We use the launchOp here, as that cannot be part of
		// the bodies instruction.
		worklist.push_back(launchOp.getOperation());
		mravishankarUnsubmitted Done Reply Inline Actions This comment seems outdated now. mravishankar: This comment seems outdated now.
		herhutAuthorUnsubmitted Done Reply Inline Actions No, it is unfortunately still true. If there are nested `loop.parallel` the code up to them has to be side-effect free, as we are essentially replicating it per thread. I will add some checking in a later version. herhut: No, it is unfortunately still true. If there are nested `loop.parallel` the code up to them has…
		}
		}
		} else {
		// Create a sequential for loop.
		auto loopOp = rewriter.create<loop::ForOp>(
		loc, cloningMap.lookupOrDefault(lowerBound),
		cloningMap.lookupOrDefault(upperBound),
		cloningMap.lookupOrDefault(step));
		newIndex = loopOp.getInductionVar();
		rewriter.setInsertionPointToStart(loopOp.getBody());
		// Put a sentinel into the worklist so we know when to pop out of the loop
		// body again. We use the launchOp here, as that cannot be part of the
		// bodies instruction.
		worklist.push_back(launchOp.getOperation());
		}
		cloningMap.map(iv, newIndex);
		}
		Block *body = parallelOp.getBody();
		worklist.reserve(worklist.size() + body->getOperations().size());
		for (Operation &op : llvm::reverse(body->without_terminator()))
		worklist.push_back(&op);
		return success();
		}

		/// Lower a `loop.parallel` operation into a corresponding `gpu.launch`
		/// operation.
		///
		/// This essentially transforms a loop nest into a corresponding SIMT function.
		/// The conversion is driven by mapping annotations on the `loop.parallel`
		/// operations. The mapping is provided via a `DictionaryAttribute` named
		ftynseUnsubmitted Done Reply Inline Actions For a future revision, I'd consider having a transformation function that is controlled by actual arguments rather than by an attached attribute, and have the pattern read the attribute and call that function. This way, we can call the transformation programmatically without modifying the IR. ftynse: For a future revision, I'd consider having a transformation function that is controlled by…
		/// `mapping`, which has three entries:
		/// - processor: the hardware id to map to. 0-2 are block dimensions, 3-5 are
		mravishankarUnsubmitted Done Reply Inline Actions It would be useful to also expose a method to populate this pattern (and also other patterns that might be needed in the future) in a method like `populateParallelLoopToGPUPatterns` . Then a pass can just import these patterns without having to run a separate pass to do this lowering. mravishankar: It would be useful to also expose a method to populate this pattern (and also other patterns…
		/// thread dimensions and 6 is sequential.
		/// - map : An affine map that is used to pre-process hardware ids before
		/// substitution.
		/// - bound : An affine map that is used to compute the bound of the hardware
		/// id based on an upper bound of the number of iterations.
		/// If the `loop.parallel` contains nested `loop.parallel` operations, those
		/// need to be annotated, as well. Structurally, the transformation works by
		/// splicing all operations from nested `loop.parallel` operations into a single
		/// sequence. Indices mapped to hardware ids are substituted with those ids,
		/// wheras sequential mappings result in a sequential for-loop. To have more
		/// flexibility when mapping code to hardware ids, the transform supports two
		/// affine maps. The first `map` is used to compute the actual index for
		/// substitution from the hardware id. The second `bound` is used to compute the
		/// launch dimension for the hardware id from the number of iterations the
		/// mapped loop is performing. Note that the number of iterations might be
		/// imprecise if the corresponding loop-bounds are loop-dependent. In such case,
		/// the hardware id might iterate over additional indices. The transformation
		/// caters for this by predicating the created sequence of instructions on
		/// the actual loop bound. This only works if an static upper bound for the
		/// dynamic loop bound can be defived, currently via analyzing `affine.min`
		/// operations.
		PatternMatchResult
		ParallelToGpuLaunchLowering::matchAndRewrite(ParallelOp parallelOp,
		PatternRewriter &rewriter) const {
		// Create a launch operation. We start with bound one for all grid/block
		// sizes. Those will be refined later as we discover them from mappings.
		Location loc = parallelOp.getLoc();
		Value constantOne = rewriter.create<ConstantIndexOp>(parallelOp.getLoc(), 1);
		gpu::LaunchOp launchOp = rewriter.create<gpu::LaunchOp>(
		parallelOp.getLoc(), constantOne, constantOne, constantOne, constantOne,
		constantOne, constantOne);
		rewriter.setInsertionPointToEnd(&launchOp.body().front());
		rewriter.create<gpu::TerminatorOp>(loc);
		rewriter.setInsertionPointToStart(&launchOp.body().front());

		BlockAndValueMapping cloningMap;
		SmallVector<Operation *, 16> worklist;
		if (failed(processParallelLoop(parallelOp, launchOp, cloningMap, worklist,
		rewriter)))
		return matchFailure();

		// Whether we have seen any side-effects. Reset when leaving an inner scope.
		bool seenSideeffects = false;
		// Whether we have left a nesting scope (and hence are no longer innermost).
		bool leftNestingScope = false;
		ftynseUnsubmitted Done Reply Inline Actions Can we just walk recursively and check for the no side effects trait, returning matchFailure if there are side effects? ftynse: Can we just walk recursively and check for the no side effects trait, returning matchFailure if…
		herhutAuthorUnsubmitted Done Reply Inline Actions I added some testing code that is conservative but ensures correctness. herhut: I added some testing code that is conservative but ensures correctness.
		while (!worklist.empty()) {
		Operation *op = worklist.pop_back_val();
		ftynseUnsubmitted Done Reply Inline Actions The comment above says "it is only correct if there either is no further loop.parallel", which I cannot match with this code. ftynse: The comment above says "it is only correct if there either is no further loop.parallel", which…
		launchOp.dump();
		rriddleUnsubmitted Not Done Reply Inline Actions Remove this debugging. rriddle: Remove this debugging.

		// Now walk over the body and clone it.
		// TODO: This is only correct if there either is no further loop.parallel
		// nested or this code is side-effect free. Otherwise we might need
		// predication. We are overly consertaive for now and only allow
		// side-effects in the innermost scope.
		if (auto nestedParallel = dyn_cast<ParallelOp>(op)) {
		// Before entering a nested scope, make sure there have been no
		// sideeffects until now.
		if (seenSideeffects)
		return matchFailure();
		// A nested loop.parallel needs insertion of code to compute indices.
		// Insert that now. This will also update the worklist with the loops
		// body.
		processParallelLoop(nestedParallel, launchOp, cloningMap, worklist,
		rewriter);
		} else if (op == launchOp.getOperation()) {
		// Found our sentinel value. We have finished the operations from one
		// nesting level, pop one level back up.
		auto parent = rewriter.getInsertionPoint()->getParentOp();
		rewriter.setInsertionPointAfter(parent);
		leftNestingScope = true;
		seenSideeffects = false;
		} else {
		// Otherwise we copy it over.
		Operation clone = rewriter.clone(op, cloningMap);
		cloningMap.map(op->getResults(), clone->getResults());
		// Check for side effects.
		seenSideeffects \|= !clone->hasNoSideEffect();
		// If we are no longer in the innermost scope, sideeffects are disallowed.
		if (seenSideeffects && leftNestingScope)
		return matchFailure();
		}
		}

		rewriter.eraseOp(parallelOp);
		return matchSuccess();
		}

		namespace {
		struct ParallelLoopToGpuPass : public OperationPass<ParallelLoopToGpuPass> {
		void runOnOperation() override;
		};
		} // namespace

		void mlir::populateParallelLoopToGPUPatterns(OwningRewritePatternList &patterns,
		MLIRContext *ctx) {
		patterns.insert<ParallelToGpuLaunchLowering>(ctx);
		}

		void ParallelLoopToGpuPass::runOnOperation() {
		OwningRewritePatternList patterns;
		populateParallelLoopToGPUPatterns(patterns, &getContext());
		ConversionTarget target(getContext());
		target.addLegalDialect<StandardOpsDialect>();
		target.addLegalDialect<AffineOpsDialect>();
		target.addLegalDialect<gpu::GPUDialect>();
		target.addLegalDialect<loop::LoopOpsDialect>();
		target.addIllegalOp<loop::ParallelOp>();
		if (failed(applyPartialConversion(getOperation(), target, patterns)))
		signalPassFailure();
		}

		static PassRegistration<ParallelLoopToGpuPass>
		pass("convert-parallel-loops-to-gpu", "Convert mapped loop.parallel ops"
		" to gpu launch operations.");
		No newline at end of file

mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir

This file was added.

				// RUN: mlir-opt -convert-parallel-loops-to-gpu -split-input-file %s \| FileCheck %s -dump-input-on-failure

				// 2-d parallel loop mapped to block.y and block.x

				func @parallel_loop_bidy_bidx(%arg0 : index, %arg1 : index, %arg2 : index,
				%arg3 : index, %arg4 : index,
				%buf : memref<?x?xf32>,
				%res : memref<?x?xf32>) {
				%step = constant 2 : index
				loop.parallel (%i0, %i1) = (%arg0, %arg1) to (%arg2, %arg3)
				step (%arg4, %step) {
				%val = load %buf[%i0, %i1] : memref<?x?xf32>
				store %val, %res[%i1, %i0] : memref<?x?xf32>
				} { mapping = [{processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}, {processor = 0, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}] }
				return
				}

				// CHECK: #map0 = affine_map<(d0) -> (d0)>
				// CHECK: module {

				// CHECK-LABEL: func @parallel_loop_bidy_bidx(
				// CHECK-SAME: [[VAL_0:%.]]: index, [[VAL_1:%.]]: index, [[VAL_2:%.]]: index, [[VAL_3:%.]]: index, [[VAL_4:%.]]: index, [[VAL_5:%.]]: memref<?x?xf32>, [[VAL_6:%.*]]: memref<?x?xf32>) {
				// CHECK: [[VAL_7:%.*]] = constant 2 : index
				// CHECK: [[VAL_8:%.*]] = constant 1 : index
				// CHECK: [[VAL_9:%.*]] = subi [[VAL_2]], [[VAL_0]] : index
				// CHECK: [[VAL_10:%.*]] = affine.apply #map0([[VAL_9]])
				// CHECK: [[VAL_11:%.*]] = subi [[VAL_3]], [[VAL_1]] : index
				// CHECK: [[VAL_12:%.*]] = affine.apply #map0([[VAL_11]])
				// CHECK: gpu.launch blocks([[VAL_13:%.]], [[VAL_14:%.]], [[VAL_15:%.]]) in ([[VAL_16:%.]] = [[VAL_12]], [[VAL_17:%.]] = [[VAL_10]], [[VAL_18:%.]] = [[VAL_8]]) threads([[VAL_19:%.]], [[VAL_20:%.]], [[VAL_21:%.]]) in ([[VAL_22:%.]] = [[VAL_8]], [[VAL_23:%.]] = [[VAL_8]], [[VAL_24:%.]] = [[VAL_8]]) {
				// CHECK: [[VAL_25:%.*]] = affine.apply #map0([[VAL_14]])
				// CHECK: [[VAL_26:%.*]] = addi [[VAL_25]], [[VAL_0]] : index
				// CHECK: [[VAL_27:%.*]] = affine.apply #map0([[VAL_13]])
				// CHECK: [[VAL_28:%.*]] = addi [[VAL_27]], [[VAL_1]] : index
				// CHECK: [[VAL_29:%.*]] = load [[VAL_5]]{{\[}}[[VAL_26]], [[VAL_28]]] : memref<?x?xf32>
				// CHECK: store [[VAL_29]], [[VAL_6]]{{\[}}[[VAL_28]], [[VAL_26]]] : memref<?x?xf32>
				// CHECK: gpu.terminator
				// CHECK: }
				// CHECK: return
				// CHECK: }
				// CHECK: }

				// -----

				// tiled 2-d parallel loop mapped to block.y and block.x and thread.y and thread.x.

				func @parallel_loop_tiled(%arg0 : index, %arg1 : index, %arg2 : index,
				%arg3 : index,
				%buf : memref<?x?xf32>,
				%res : memref<?x?xf32>) {
				%zero = constant 0 : index
				%one = constant 1 : index
				%four = constant 4 : index
				loop.parallel (%i0, %i1) = (%arg0, %arg1) to (%arg2, %arg3)
				step (%four, %four) {
				loop.parallel (%si0, %si1) = (%zero, %zero) to (%four, %four)
				step (%one, %one) {
				%idx0 = addi %i0, %si0 : index
				%idx1 = addi %i1, %si1 : index
				%val = load %buf[%idx0, %idx1] : memref<?x?xf32>
				store %val, %res[%idx1, %idx0] : memref<?x?xf32>
				} { mapping = [
				{processor = 4, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>},
				{processor = 3, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}
				] }
				} { mapping = [
				{processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>},
				{processor = 0, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}
				] }
				return
				}

				// CHECK: #map0 = affine_map<(d0) -> (d0)>
				// CHECK: module {

				// CHECK-LABEL: func @parallel_loop_tiled(
				// CHECK-SAME: [[VAL_30:%.]]: index, [[VAL_31:%.]]: index, [[VAL_32:%.]]: index, [[VAL_33:%.]]: index, [[VAL_34:%.]]: memref<?x?xf32>, [[VAL_35:%.]]: memref<?x?xf32>) {
				// CHECK: [[VAL_36:%.*]] = constant 0 : index
				// CHECK: [[VAL_37:%.*]] = constant 1 : index
				// CHECK: [[VAL_38:%.*]] = constant 4 : index
				// CHECK: [[VAL_39:%.*]] = constant 1 : index
				// CHECK: [[VAL_40:%.*]] = subi [[VAL_32]], [[VAL_30]] : index
				// CHECK: [[VAL_41:%.*]] = affine.apply #map0([[VAL_40]])
				// CHECK: [[VAL_42:%.*]] = subi [[VAL_33]], [[VAL_31]] : index
				// CHECK: [[VAL_43:%.*]] = affine.apply #map0([[VAL_42]])
				// CHECK: [[VAL_44:%.*]] = subi [[VAL_38]], [[VAL_36]] : index
				// CHECK: [[VAL_45:%.*]] = affine.apply #map0([[VAL_44]])
				// CHECK: [[VAL_46:%.*]] = subi [[VAL_38]], [[VAL_36]] : index
				// CHECK: [[VAL_47:%.*]] = affine.apply #map0([[VAL_46]])
				// CHECK: gpu.launch blocks([[VAL_48:%.]], [[VAL_49:%.]], [[VAL_50:%.]]) in ([[VAL_51:%.]] = [[VAL_43]], [[VAL_52:%.]] = [[VAL_41]], [[VAL_53:%.]] = [[VAL_39]]) threads([[VAL_54:%.]], [[VAL_55:%.]], [[VAL_56:%.]]) in ([[VAL_57:%.]] = [[VAL_47]], [[VAL_58:%.]] = [[VAL_45]], [[VAL_59:%.]] = [[VAL_39]]) {
				// CHECK: [[VAL_60:%.*]] = affine.apply #map0([[VAL_49]])
				// CHECK: [[VAL_61:%.*]] = addi [[VAL_60]], [[VAL_30]] : index
				// CHECK: [[VAL_62:%.*]] = affine.apply #map0([[VAL_48]])
				// CHECK: [[VAL_63:%.*]] = addi [[VAL_62]], [[VAL_31]] : index
				// CHECK: [[VAL_64:%.*]] = affine.apply #map0([[VAL_55]])
				// CHECK: [[VAL_65:%.*]] = addi [[VAL_64]], [[VAL_36]] : index
				// CHECK: [[VAL_66:%.*]] = affine.apply #map0([[VAL_54]])
				// CHECK: [[VAL_67:%.*]] = addi [[VAL_66]], [[VAL_36]] : index
				// CHECK: [[VAL_68:%.*]] = addi [[VAL_61]], [[VAL_65]] : index
				// CHECK: [[VAL_69:%.*]] = addi [[VAL_63]], [[VAL_67]] : index
				// CHECK: [[VAL_70:%.*]] = load [[VAL_34]]{{\[}}[[VAL_68]], [[VAL_69]]] : memref<?x?xf32>
				// CHECK: store [[VAL_70]], [[VAL_35]]{{\[}}[[VAL_69]], [[VAL_68]]] : memref<?x?xf32>
				// CHECK: gpu.terminator
				// CHECK: }
				// CHECK: return
				// CHECK: }
				// CHECK: }

				// -----

				// 2-d parallel loop mapped to block.y and sequential

				func @parallel_loop_bidy_seq(%arg0 : index, %arg1 : index, %arg2 : index,
				%arg3 : index, %arg4 : index,
				%buf : memref<?x?xf32>,
				%res : memref<?x?xf32>) {
				%step = constant 2 : index
				loop.parallel (%i0, %i1) = (%arg0, %arg1) to (%arg2, %arg3)
				step (%arg4, %step) {
				%val = load %buf[%i0, %i1] : memref<?x?xf32>
				store %val, %res[%i1, %i0] : memref<?x?xf32>
				} { mapping = [
				{processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>},
				{processor = 6, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}
				] }
				return
				}

				// CHECK: #map0 = affine_map<(d0) -> (d0)>
				// CHECK: module {

				// CHECK-LABEL: func @parallel_loop_bidy_seq(
				// CHECK-SAME: [[VAL_71:%.]]: index, [[VAL_72:%.]]: index, [[VAL_73:%.]]: index, [[VAL_74:%.]]: index, [[VAL_75:%.]]: index, [[VAL_76:%.]]: memref<?x?xf32>, [[VAL_77:%.*]]: memref<?x?xf32>) {
				// CHECK: [[VAL_78:%.*]] = constant 2 : index
				// CHECK: [[VAL_79:%.*]] = constant 1 : index
				// CHECK: [[VAL_80:%.*]] = subi [[VAL_73]], [[VAL_71]] : index
				// CHECK: [[VAL_81:%.*]] = affine.apply #map0([[VAL_80]])
				// CHECK: gpu.launch blocks([[VAL_82:%.]], [[VAL_83:%.]], [[VAL_84:%.]]) in ([[VAL_85:%.]] = [[VAL_79]], [[VAL_86:%.]] = [[VAL_81]], [[VAL_87:%.]] = [[VAL_79]]) threads([[VAL_88:%.]], [[VAL_89:%.]], [[VAL_90:%.]]) in ([[VAL_91:%.]] = [[VAL_79]], [[VAL_92:%.]] = [[VAL_79]], [[VAL_93:%.]] = [[VAL_79]]) {
				// CHECK: [[VAL_94:%.*]] = affine.apply #map0([[VAL_83]])
				// CHECK: [[VAL_95:%.*]] = addi [[VAL_94]], [[VAL_71]] : index
				// CHECK: loop.for [[VAL_96:%.*]] = [[VAL_72]] to [[VAL_74]] step [[VAL_78]] {
				// CHECK: [[VAL_97:%.*]] = load [[VAL_76]]{{\[}}[[VAL_95]], [[VAL_96]]] : memref<?x?xf32>
				// CHECK: store [[VAL_97]], [[VAL_77]]{{\[}}[[VAL_96]], [[VAL_95]]] : memref<?x?xf32>
				// CHECK: }
				// CHECK: gpu.terminator
				// CHECK: }
				// CHECK: return
				// CHECK: }
				// CHECK: }

				// -----

				// tiled 2-d parallel loop mapped to block.y and seq. and thread.y and seq.

				func @parallel_loop_tiled_seq(%arg0 : index, %arg1 : index, %arg2 : index,
				%arg3 : index,
				%buf : memref<?x?xf32>,
				%res : memref<?x?xf32>) {
				%zero = constant 0 : index
				%one = constant 1 : index
				%four = constant 4 : index
				loop.parallel (%i0, %i1) = (%arg0, %arg1) to (%arg2, %arg3)
				step (%four, %four) {
				loop.parallel (%si0, %si1) = (%zero, %zero) to (%four, %four)
				step (%one, %one) {
				%idx0 = addi %i0, %si0 : index
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Could we use capture names such as `TIDX`, `BIDY` or something similar that would help "see" the mapping better? nicolasvasilache: Could we use capture names such as `TIDX`, `BIDY` or something similar that would help "see"…
				herhutAuthorUnsubmitted Done Reply Inline Actions Will do once this has stabilized a bit. For now, the tests are fully auto-generated. herhut: Will do once this has stabilized a bit. For now, the tests are fully auto-generated.
				%idx1 = addi %i1, %si1 : index
				%val = load %buf[%idx0, %idx1] : memref<?x?xf32>
				store %val, %res[%idx1, %idx0] : memref<?x?xf32>
				} { mapping = [
				{processor = 4, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>},
				{processor = 6, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}
				] }
				} { mapping = [
				{processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>},
				{processor = 6, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}
				] }
				return
				}

				// CHECK: #map0 = affine_map<(d0) -> (d0)>
				// CHECK: module {

				// CHECK-LABEL: func @parallel_loop_tiled_seq(
				// CHECK-SAME: [[VAL_98:%.]]: index, [[VAL_99:%.]]: index, [[VAL_100:%.]]: index, [[VAL_101:%.]]: index, [[VAL_102:%.]]: memref<?x?xf32>, [[VAL_103:%.]]: memref<?x?xf32>) {
				// CHECK: [[VAL_104:%.*]] = constant 0 : index
				// CHECK: [[VAL_105:%.*]] = constant 1 : index
				// CHECK: [[VAL_106:%.*]] = constant 4 : index
				// CHECK: [[VAL_107:%.*]] = constant 1 : index
				// CHECK: [[VAL_108:%.*]] = subi [[VAL_100]], [[VAL_98]] : index
				// CHECK: [[VAL_109:%.*]] = affine.apply #map0([[VAL_108]])
				// CHECK: [[VAL_110:%.*]] = subi [[VAL_106]], [[VAL_104]] : index
				// CHECK: [[VAL_111:%.*]] = affine.apply #map0([[VAL_110]])
				// CHECK: gpu.launch blocks([[VAL_112:%.]], [[VAL_113:%.]], [[VAL_114:%.]]) in ([[VAL_115:%.]] = [[VAL_107]], [[VAL_116:%.]] = [[VAL_109]], [[VAL_117:%.]] = [[VAL_107]]) threads([[VAL_118:%.]], [[VAL_119:%.]], [[VAL_120:%.]]) in ([[VAL_121:%.]] = [[VAL_107]], [[VAL_122:%.]] = [[VAL_111]], [[VAL_123:%.]] = [[VAL_107]]) {
				// CHECK: [[VAL_124:%.*]] = affine.apply #map0([[VAL_113]])
				// CHECK: [[VAL_125:%.*]] = addi [[VAL_124]], [[VAL_98]] : index
				// CHECK: loop.for [[VAL_126:%.*]] = [[VAL_99]] to [[VAL_101]] step [[VAL_106]] {
				// CHECK: [[VAL_127:%.*]] = affine.apply #map0([[VAL_119]])
				// CHECK: [[VAL_128:%.*]] = addi [[VAL_127]], [[VAL_104]] : index
				// CHECK: loop.for [[VAL_129:%.*]] = [[VAL_104]] to [[VAL_106]] step [[VAL_105]] {
				// CHECK: [[VAL_130:%.*]] = addi [[VAL_125]], [[VAL_128]] : index
				// CHECK: [[VAL_131:%.*]] = addi [[VAL_126]], [[VAL_129]] : index
				// CHECK: [[VAL_132:%.*]] = load [[VAL_102]]{{\[}}[[VAL_130]], [[VAL_131]]] : memref<?x?xf32>
				// CHECK: store [[VAL_132]], [[VAL_103]]{{\[}}[[VAL_131]], [[VAL_130]]] : memref<?x?xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: gpu.terminator
				// CHECK: }
				// CHECK: return
				// CHECK: }
				// CHECK: }

				// -----

				#map0 = affine_map<(d0, d1)[s0, s1] -> (d0 * s1 + s0 + d1)>
				#map1 = affine_map<(d0, d1, d2) -> (d0, d1 - d2)>
				#map2 = affine_map<(d0, d1)[s0, s1, s2] -> (d0 * s1 + s0 + d1 * s2)>
				#map3 = affine_map<(d0) -> (d0)>

				module {
				func @sum(%arg0: memref<?x?xf32, #map0>, %arg1: memref<?x?xf32, #map0>, %arg2: memref<?x?xf32, #map0>) {
				%c1 = constant 1 : index
				%c0 = constant 0 : index
				%c3 = constant 3 : index
				%c2 = constant 2 : index
				%0 = dim %arg0, 0 : memref<?x?xf32, #map0>
				%1 = dim %arg0, 1 : memref<?x?xf32, #map0>
				loop.parallel (%arg3, %arg4) = (%c0, %c0) to (%0, %1) step (%c2, %c3) {
				%2 = dim %arg0, 0 : memref<?x?xf32, #map0>
				%3 = affine.min #map1(%c2, %2, %arg3)
				%4 = dim %arg0, 1 : memref<?x?xf32, #map0>
				%5 = affine.min #map1(%c3, %4, %arg4)
				%6 = std.subview %arg0[%arg3, %arg4][%3, %5][%c1, %c1] : memref<?x?xf32, #map0> to memref<?x?xf32, #map2>
				%7 = dim %arg1, 0 : memref<?x?xf32, #map0>
				%8 = affine.min #map1(%c2, %7, %arg3)
				%9 = dim %arg1, 1 : memref<?x?xf32, #map0>
				%10 = affine.min #map1(%c3, %9, %arg4)
				%11 = std.subview %arg1[%arg3, %arg4][%8, %10][%c1, %c1] : memref<?x?xf32, #map0> to memref<?x?xf32, #map2>
				%12 = dim %arg2, 0 : memref<?x?xf32, #map0>
				%13 = affine.min #map1(%c2, %12, %arg3)
				%14 = dim %arg2, 1 : memref<?x?xf32, #map0>
				%15 = affine.min #map1(%c3, %14, %arg4)
				%16 = std.subview %arg2[%arg3, %arg4][%13, %15][%c1, %c1] : memref<?x?xf32, #map0> to memref<?x?xf32, #map2>
				loop.parallel (%arg5, %arg6) = (%c0, %c0) to (%3, %5) step (%c1, %c1) {
				%17 = load %6[%arg5, %arg6] : memref<?x?xf32, #map2>
				%18 = load %11[%arg5, %arg6] : memref<?x?xf32, #map2>
				%19 = load %16[%arg5, %arg6] : memref<?x?xf32, #map2>
				%20 = addf %17, %18 : f32
				store %20, %16[%arg5, %arg6] : memref<?x?xf32, #map2>
				"loop.terminator"() : () -> ()
				} { mapping = [
				{processor = 3, map = #map3, bound = #map3},
				{processor = 4, map = #map3, bound = #map3}
				] }
				"loop.terminator"() : () -> ()
				} { mapping = [
				{processor = 0, map = #map3, bound = #map3},
				{processor = 1, map = #map3, bound = #map3}
				] }
				return
				}
				}

				// CHECK: #map0 = affine_map<(d0, d1)[s0, s1] -> (d0 * s1 + s0 + d1)>
				// CHECK: #map1 = affine_map<(d0) -> (d0)>
				// CHECK: #map2 = affine_map<(d0, d1, d2) -> (d0, d1 - d2)>
				// CHECK: #map3 = affine_map<(d0, d1)[s0, s1, s2] -> (d0 * s1 + s0 + d1 * s2)>
				// CHECK: module {

				// CHECK-LABEL: func @sum(
				// CHECK-SAME: [[VAL_133:%.]]: memref<?x?xf32, #map0>, [[VAL_134:%.]]: memref<?x?xf32, #map0>, [[VAL_135:%.*]]: memref<?x?xf32, #map0>) {
				// CHECK: [[VAL_136:%.*]] = constant 1 : index
				// CHECK: [[VAL_137:%.*]] = constant 0 : index
				// CHECK: [[VAL_138:%.*]] = constant 3 : index
				// CHECK: [[VAL_139:%.*]] = constant 2 : index
				// CHECK: [[VAL_140:%.*]] = dim [[VAL_133]], 0 : memref<?x?xf32, #map0>
				// CHECK: [[VAL_141:%.*]] = dim [[VAL_133]], 1 : memref<?x?xf32, #map0>
				// CHECK: [[VAL_142:%.*]] = constant 1 : index
				// CHECK: [[VAL_143:%.*]] = subi [[VAL_140]], [[VAL_137]] : index
				// CHECK: [[VAL_144:%.*]] = affine.apply #map1([[VAL_143]])
				// CHECK: [[VAL_145:%.*]] = subi [[VAL_141]], [[VAL_137]] : index
				// CHECK: [[VAL_146:%.*]] = affine.apply #map1([[VAL_145]])
				// CHECK: [[VAL_148:%.*]] = subi [[VAL_139]], [[VAL_137]] : index
				// CHECK: [[VAL_149:%.*]] = affine.apply #map1([[VAL_148]])
				// CHECK: [[VAL_151:%.*]] = subi [[VAL_138]], [[VAL_137]] : index
				// CHECK: [[VAL_152:%.*]] = affine.apply #map1([[VAL_151]])
				// CHECK: gpu.launch blocks([[VAL_153:%.]], [[VAL_154:%.]], [[VAL_155:%.]]) in ([[VAL_156:%.]] = [[VAL_144]], [[VAL_157:%.]] = [[VAL_146]], [[VAL_158:%.]] = [[VAL_142]]) threads([[VAL_159:%.]], [[VAL_160:%.]], [[VAL_161:%.]]) in ([[VAL_162:%.]] = [[VAL_149]], [[VAL_163:%.]] = [[VAL_152]], [[VAL_164:%.]] = [[VAL_142]]) {
				// CHECK: [[VAL_165:%.*]] = affine.apply #map1([[VAL_153]])
				// CHECK: [[VAL_166:%.*]] = addi [[VAL_165]], [[VAL_137]] : index
				// CHECK: [[VAL_167:%.*]] = affine.apply #map1([[VAL_154]])
				// CHECK: [[VAL_168:%.*]] = addi [[VAL_167]], [[VAL_137]] : index
				// CHECK: [[VAL_169:%.*]] = dim [[VAL_133]], 0 : memref<?x?xf32, #map0>
				// CHECK: [[VAL_170:%.*]] = affine.min #map2([[VAL_139]], [[VAL_169]], [[VAL_166]])
				// CHECK: [[VAL_171:%.*]] = dim [[VAL_133]], 1 : memref<?x?xf32, #map0>
				// CHECK: [[VAL_172:%.*]] = affine.min #map2([[VAL_138]], [[VAL_171]], [[VAL_168]])
				// CHECK: [[VAL_173:%.*]] = std.subview [[VAL_133]]{{\[}}[[VAL_166]], [[VAL_168]]]{{\[}}[[VAL_170]], [[VAL_172]]]{{\[}}[[VAL_136]], [[VAL_136]]] : memref<?x?xf32, #map0> to memref<?x?xf32, #map3>
				// CHECK: [[VAL_174:%.*]] = dim [[VAL_134]], 0 : memref<?x?xf32, #map0>
				// CHECK: [[VAL_175:%.*]] = affine.min #map2([[VAL_139]], [[VAL_174]], [[VAL_166]])
				// CHECK: [[VAL_176:%.*]] = dim [[VAL_134]], 1 : memref<?x?xf32, #map0>
				// CHECK: [[VAL_177:%.*]] = affine.min #map2([[VAL_138]], [[VAL_176]], [[VAL_168]])
				// CHECK: [[VAL_178:%.*]] = std.subview [[VAL_134]]{{\[}}[[VAL_166]], [[VAL_168]]]{{\[}}[[VAL_175]], [[VAL_177]]]{{\[}}[[VAL_136]], [[VAL_136]]] : memref<?x?xf32, #map0> to memref<?x?xf32, #map3>
				// CHECK: [[VAL_179:%.*]] = dim [[VAL_135]], 0 : memref<?x?xf32, #map0>
				// CHECK: [[VAL_180:%.*]] = affine.min #map2([[VAL_139]], [[VAL_179]], [[VAL_166]])
				// CHECK: [[VAL_181:%.*]] = dim [[VAL_135]], 1 : memref<?x?xf32, #map0>
				// CHECK: [[VAL_182:%.*]] = affine.min #map2([[VAL_138]], [[VAL_181]], [[VAL_168]])
				// CHECK: [[VAL_183:%.*]] = std.subview [[VAL_135]]{{\[}}[[VAL_166]], [[VAL_168]]]{{\[}}[[VAL_180]], [[VAL_182]]]{{\[}}[[VAL_136]], [[VAL_136]]] : memref<?x?xf32, #map0> to memref<?x?xf32, #map3>
				// CHECK: [[VAL_184:%.*]] = affine.apply #map1([[VAL_159]])
				// CHECK: [[VAL_185:%.*]] = addi [[VAL_184]], [[VAL_137]] : index
				// CHECK: [[VAL_186:%.*]] = cmpi "slt", [[VAL_185]], [[VAL_170]] : index
				// CHECK: loop.if [[VAL_186]] {
				// CHECK: [[VAL_187:%.*]] = affine.apply #map1([[VAL_160]])
				// CHECK: [[VAL_188:%.*]] = addi [[VAL_187]], [[VAL_137]] : index
				// CHECK: [[VAL_189:%.*]] = cmpi "slt", [[VAL_188]], [[VAL_172]] : index
				// CHECK: loop.if [[VAL_189]] {
				// CHECK: [[VAL_190:%.*]] = load [[VAL_173]]{{\[}}[[VAL_185]], [[VAL_188]]] : memref<?x?xf32, #map3>
				// CHECK: [[VAL_191:%.*]] = load [[VAL_178]]{{\[}}[[VAL_185]], [[VAL_188]]] : memref<?x?xf32, #map3>
				// CHECK: [[VAL_192:%.*]] = load [[VAL_183]]{{\[}}[[VAL_185]], [[VAL_188]]] : memref<?x?xf32, #map3>
				// CHECK: [[VAL_193:%.*]] = addf [[VAL_190]], [[VAL_191]] : f32
				// CHECK: store [[VAL_193]], [[VAL_183]]{{\[}}[[VAL_185]], [[VAL_188]]] : memref<?x?xf32, #map3>
				// CHECK: }
				// CHECK: }
				// CHECK: gpu.terminator
				// CHECK: }
				// CHECK: return
				// CHECK: }
				// CHECK: }

This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][GPU] Implement initial mapping from loop.parallel to gpu.launch.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 244433

mlir/include/mlir/Conversion/LoopsToGPU/LoopsToGPU.h

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp

mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir

[MLIR][GPU] Implement initial mapping from loop.parallel to gpu.launch.
ClosedPublic