This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
lib/Conversion/LoopsToGPU/
-
Conversion/
-
LoopsToGPU/
29/32
LoopsToGPU.cpp
-
test/Conversion/LoopsToGPU/
-
Conversion/
-
LoopsToGPU/
1/2
parallel_loop.mlir

Differential D73893

[MLIR][GPU] Implement initial mapping from loop.parallel to gpu.launch.
ClosedPublic

Authored by herhut on Feb 3 2020, 7:24 AM.

Download Raw Diff

Details

Reviewers

nicolasvasilache
ftynse
mravishankar

Commits

rG715783d415fe: [MLIR][GPU] Implement initial mapping from loop.parallel to gpu.launch.

Summary

To unblock other work, this implements basic lowering based on mapping
attributes that have to be provided on all loop.parallel. The lowering
does not yet support reduce.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

herhut created this revision.Feb 3 2020, 7:24 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptFeb 3 2020, 7:24 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, Joonsoo, liufengdb and 12 others. · View Herald Transcript

I started implementing the lowering of parallel loops to gpu launch code. This code uses mapping annotations in the form of attributes. Currently, one can only map to a single block/thread id. Also, upper bounds for the launch can be defined based on the number of iterations of the mapped iteration of the parallel loop. This is probably not what we want in the long run but a good starting point to inform the design we take.

Feedback highly appreciated.

Unit tests: unknown.

clang-tidy: fail. clang-tidy found 2 errors and 2 warnings. 0 of them are added as review comments below (why?).

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt

Pre-merge checks is in beta. Report issue. Please join beta or enable it for your project.

Harbormaster failed remote builds in B45590: Diff 242069!Feb 3 2020, 7:43 AM

Thanks Stephan for pushing on this concretely!
I am very interested in this representation and I would welcome a (future) refactoring that will make this as reusable as possible so we can also use it directly in Linalg.
As you may or may not know, other people internally are using a very similar representation in a very successful fashion ;) ;)

Won't have time to dig into details but a few comments:

made a cursory glance, it's great to start with this and refactor as we go when others need this too
one of the problems you solve here is to go from type (the affine maps for mapping) to values (tidx, bidy, etc)). This is a recurrent transition that would be great to try and make retargetable (e.g. OpenMP or other runtimes), I would love to see this separated from GPU. This would also force separating concerns a bit more re outlining of kernel vs mapping of loops.
Along the lines of point 2., I would have hoped to see mapLoopToProcessorIds be templatized and retargeted to this use case, is this possible as a followup?
Depending on how much it is possible to go towards 3. or not, I'll comment that the Builder API is really underwhelming.. I understand the context behind not using EDSC atm, I'll post something on discourse to reopen that discussion (was too swamped until now).

In any case, great stuff and thanks!

mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir
165	Could we use capture names such as `TIDX`, `BIDY` or something similar that would help "see" the mapping better?

I suppose you want high-level feedback on this, so I'm not nitpicking in the code.

I think this goes into the right direction, but you are right that this should not be the final design. Using pattern rewrites looks like the right thing to do. I would consider having a pattern and/or a utility function that a future driver can call and have a parallel loop mapped to GPU. We can have test passes where the application is controlled by attributes, but we can also have more automated approaches where we would, e.g., greedily map the outermost parallel loop to GPUs. My thinking is that we should eventually have a tree-like loop structure on which we can decide where to map to blocks/threads/do promotions. That being said, I am not 100% sure we actually want to materialize the decisions as attributes on the operations, or at least as attributes that operations know about.

One thing to keep in mind, nested parallel loops may be tricky, in particular you may need synchronizations inside the outermost parallel loop and make sure they are called by all threads in a "convergent" fashion. For the first take, I'd map one parallel loop construct or perfectly nested constructs and extend from there.

mlir/include/mlir/Dialect/LoopOps/LoopOps.td
16 ↗	(On Diff #242069)	It's a bit reversed to have loops depend on affine loops. Is this necessary for AffineMapAttr?

In D73893#1854949, @nicolasvasilache wrote:

one of the problems you solve here is to go from type (the affine maps for mapping) to values (tidx, bidy, etc)). This is a recurrent transition that would be great to try and make retargetable (e.g. OpenMP or other runtimes), I would love to see this separated from GPU. This would also force separating concerns a bit more re outlining of kernel vs mapping of loops.

I use attributes here instead of baking the mapping into the lowering part to enable reuse. My idea was that we can have different implementations/strategies for mapping that inform a single code generation. I'd be happy to get to a shared "mapping language" that can be reused but it has to start somewhere.

Along the lines of point 2., I would have hoped to see mapLoopToProcessorIds be templatized and retargeted to this use case, is this possible as a followup?

There is no code at all yet to produce the mapping attributes. But map LoopToProcessorIds could be one producer indeed.

Remove attribute from td file and minor cleanup.

In D73893#1855737, @ftynse wrote:

I suppose you want high-level feedback on this, so I'm not nitpicking in the code.

Indeed. Thanks for the feedback.

I think this goes into the right direction, but you are right that this should not be the final design. Using pattern rewrites looks like the right thing to do. I would consider having a pattern and/or a utility function that a future driver can call and have a parallel loop mapped to GPU.

I have not written code yet to produce the mapping attributes but a greedy mapper will probably be the first thing I'll do.

We can have test passes where the application is controlled by attributes, but we can also have more automated approaches where we would, e.g., greedily map the outermost parallel loop to GPUs. My thinking is that we should eventually have a tree-like loop structure on which we can decide where to map to blocks/threads/do promotions.

I agree. To do any meaningful mapping one needs to see the whole tree. Also, the heuristics that does the tiling might already know how it wants things to be mapped. Likewise, if you explicit loads into shared memory, for example, you would know that this has to be mapped to thread groups (like warps). So maybe one want to specify the mapping at that point already, which is why I though attributes are a good way to model this.

That being said, I am not 100% sure we actually want to materialize the decisions as attributes on the operations, or at least as attributes that operations know about.

I am strongly of the opinion that the mapping should be attributes so that the producer and consumer of the mapping decisions is decoupled. I agree that they should be invisible to the op and I just added them to the ,td file temporarily for documentation. They attribute is already optional.

One thing to keep in mind, nested parallel loops may be tricky, in particular you may need synchronizations inside the outermost parallel loop and make sure they are called by all threads in a "convergent" fashion. For the first take, I'd map one parallel loop construct or perfectly nested constructs and extend from there.

There is a TODO in there that it only supports nesting of the instructions up to the innermost nest are sideeffect free. Otherwise we will need predication to have only a single hardware thread materialize the sideeffects and, as you state, a barrier if other iterations of nested loops depend on that code. Let's start with the simple case, though.

mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir
165	Will do once this has stabilized a bit. For now, the tests are fully auto-generated.

Unit tests: pass. 62417 tests passed, 0 failed and 839 were skipped.

clang-tidy: fail. clang-tidy found 0 errors and 2 warnings. 0 of them are added as review comments below (why?).

clang-format: fail. Please format your changes with clang-format by running git-clang-format HEAD^ or applying this patch.

Build artifacts: diff.json, clang-tidy.txt, clang-format.patch, CMakeCache.txt, console-log.txt, test-results.xml

Pre-merge checks is in beta. Report issue. Please join beta or enable it for your project.

Harbormaster failed remote builds in B45672: Diff 242294!Feb 4 2020, 5:19 AM

Also support linalg tiled loops

Herald added a reviewer: mravishankar. · View Herald TranscriptFeb 6 2020, 10:20 AM

Correct tests and some minor cleanup.

ftynse added inline comments.Feb 7 2020, 5:30 AM

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
503	I would suggest creating a struct with named fields here for better readability. And document the function plz
505	.cast here and below, otherwise you may be accessing a null pointer
507	Nit: can we factor out these names into constants?
517	Nit: mlir uses camelBack names
521	Should we use string attributes instead? E.g. having `mapping = ["thread-0", "thread-1", "thread-2", "block-0", "block-1", "block-2", "seq"]` ?
538	`dyn_cast_or_null` since `val` is not guaranteed to have a defining op
582	We should have patterns to fold constants into affine maps (in canonicalization), eg %0 = constant 42 : index %1 = afifne.min affine_map<(d0,d1)->(d0,d1)>(%0, %arg0) should fold into %0 = affine.min affine_map<(d0)->(d0,42)>(%0) LMK if it's not the case.

I'm missing some higher-level documentation on what each function is supposed to do.

rriddle added inline comments.Feb 7 2020, 10:21 AM

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
503	Generally only classes should be within anonmyous namespaces. static functions should be in global scope and marked 'static'.
552	This function is quite large. Can you split it up a bit?

(Sorry for jumping on this late)

This is now adding three different lowering of loops to GPU

From perfectly nested loops to GPU
From imperfectly nested loops to GPU
Parallel loops to GPU

I confess I contributed to one of them (2). We should probably have a plan to consolidate these. This is too much duplication of functionality. We should probably deprecate (2) in favor of this change, and adapt all uses accordingly.

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
503	Add some comments please.
504	If this is really a dictionary_attribute, this lookup might be simplified using a StructAttr
516	Please add some comments about this method, preconditions, etc.
522	I am just coming upto speed on implementation of loop.parallel, but in the ODS I dont see any attribute for "mapping". When is this added?
524	nit: Please remove this line. Didnt make the immediate leap to reductions and parallelop.getNumResults() != 0
671	This comment seems outdated now.
703	It would be useful to also expose a method to populate this pattern (and also other patterns that might be needed in the future) in a method like `populateParallelLoopToGPUPatterns` . Then a pass can just import these patterns without having to run a separate pass to do this lowering.

This revision now requires changes to proceed.Feb 8 2020, 10:52 AM

Comments, comments, comments :-)

Harbormaster completed remote builds in B46215: Diff 243835.Feb 11 2020, 5:53 AM

In D73893#1865719, @mravishankar wrote:

(Sorry for jumping on this late)

This is now adding three different lowering of loops to GPU

From perfectly nested loops to GPU

From imperfectly nested loops to GPU

Parallel loops to GPU

I confess I contributed to one of them (2). We should probably have a plan to consolidate these. This is too much duplication of functionality. We should probably deprecate (2) in favor of this change, and adapt all uses accordingly.

All but 3 should probably go away unless someone invests into the dependency analysis to make 1 and 2 safe.

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
504	I had a struct attribute before but I did not want to specify the attribute on the `loop.parallel` operation, as it is GPU dialect specific. So I went with an unspecified optional attribute with custom accessor.
521	I wanted to stay with processor ids for now, as that is also what @nicolasvasilache was using.
522	This is added by producers of mappings that want to lower GPU. I have a `GreedyMapper` (I will send that out shortly) that essentially implements what the existing helper functions did.
552	A lot of it is comments but I have moved some things out.
582	So you are saying that by running with -canonicalize I should see the affine.min change? I could not reproduce this.
671	No, it is unfortunately still true. If there are nested `loop.parallel` the code up to them has to be side-effect free, as we are essentially replicating it per thread. I will add some checking in a later version.

Thanks Stephan! I see overall where this is headed. This is nice to have indeed.

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
504	You can add it as an attribute to GPU dialect. There is no harm in adding this attribute "while" lowering to GPU dialect, or a pre-pass before lowering to GPU dialect right?

Looks okay for the first take at it.

In general, I wouldn't bother with sequential loops here and implement a separate rewrite that we could run before this that splits the parallel loop into two nested parallel ops, and then use parallel-to-sequential transformation on the inner op.

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
576	Nit: why not `nullptr` as a sentinel?
611	Readablity nit: can we assign std::get<0>, etc to variables with more explicative names? The body of this loop is 3 screens long (consider refactoring in a follow-up :)) and it's hard to keep track of what was zip'ed.
627	Nit: you can use OpRewriter::InsertionGuard with a C++ scope
701	For a future revision, I'd consider having a transformation function that is controlled by actual arguments rather than by an attached attribute, and have the pattern read the attribute and call that function. This way, we can call the transformation programmatically without modifying the IR.
748	Can we just walk recursively and check for the no side effects trait, returning matchFailure if there are side effects?
750	The comment above says "it is only correct if there either is no further loop.parallel", which I cannot match with this code.

This revision is now accepted and ready to land.Feb 12 2020, 12:12 PM

sri added a subscriber: sri.Feb 12 2020, 8:08 PM

Some more comments addressed.

herhut marked 3 inline comments as done.Feb 13 2020, 6:08 AM

herhut added inline comments.

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
576	I dislike nullptr as sentinel because it too easily sneaks in and then produces hard to debug errors. The `gpu.launch` on the other hand definitely won't. I would have created an explicit sentinel but that seemed a bit overkill.
748	I added some testing code that is conservative but ensures correctness.

Harbormaster failed remote builds in B46408: Diff 244413!Feb 13 2020, 6:26 AM

Closed by commit rG715783d415fe: [MLIR][GPU] Implement initial mapping from loop.parallel to gpu.launch. (authored by herhut). · Explain WhyFeb 13 2020, 7:57 AM

This revision was automatically updated to reflect the committed changes.

herhut marked an inline comment as done.

rriddle added inline comments.Feb 13 2020, 11:25 AM

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
751	Remove this debugging.

Revision Contents

Path

Size

mlir/

lib/

Conversion/

LoopsToGPU/

LoopsToGPU.cpp

214 lines

test/

Conversion/

LoopsToGPU/

parallel_loop.mlir

296 lines

Diff 242937

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp

Show All 14 Lines
#include "mlir/Conversion/LoopsToGPU/LoopsToGPU.h"		#include "mlir/Conversion/LoopsToGPU/LoopsToGPU.h"

#include "mlir/Conversion/AffineToStandard/AffineToStandard.h"		#include "mlir/Conversion/AffineToStandard/AffineToStandard.h"
#include "mlir/Dialect/AffineOps/AffineOps.h"		#include "mlir/Dialect/AffineOps/AffineOps.h"
#include "mlir/Dialect/GPU/GPUDialect.h"		#include "mlir/Dialect/GPU/GPUDialect.h"
#include "mlir/Dialect/LoopOps/LoopOps.h"		#include "mlir/Dialect/LoopOps/LoopOps.h"
#include "mlir/Dialect/StandardOps/Ops.h"		#include "mlir/Dialect/StandardOps/Ops.h"
#include "mlir/IR/AffineExpr.h"		#include "mlir/IR/AffineExpr.h"
		#include "mlir/IR/BlockAndValueMapping.h"
#include "mlir/IR/Builders.h"		#include "mlir/IR/Builders.h"
		#include "mlir/Pass/Pass.h"
		#include "mlir/Transforms/DialectConversion.h"
#include "mlir/Transforms/LoopUtils.h"		#include "mlir/Transforms/LoopUtils.h"
		#include "mlir/Transforms/Passes.h"
#include "mlir/Transforms/RegionUtils.h"		#include "mlir/Transforms/RegionUtils.h"
#include "llvm/ADT/Sequence.h"		#include "llvm/ADT/Sequence.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"

#define DEBUG_TYPE "loops-to-gpu"		#define DEBUG_TYPE "loops-to-gpu"

using namespace mlir;		using namespace mlir;
using namespace mlir::loop;		using namespace mlir::loop;
▲ Show 20 Lines • Show All 449 Lines • ▼ Show 20 Lines	LogicalResult mlir::convertLoopNestToGPULaunch(ForOp forOp,
return ::convertLoopNestToGPULaunch(forOp, numBlockDims, numThreadDims);		return ::convertLoopNestToGPULaunch(forOp, numBlockDims, numThreadDims);
}		}

LogicalResult mlir::convertLoopToGPULaunch(loop::ForOp forOp,		LogicalResult mlir::convertLoopToGPULaunch(loop::ForOp forOp,
ArrayRef<Value> numWorkGroups,		ArrayRef<Value> numWorkGroups,
ArrayRef<Value> workGroupSizes) {		ArrayRef<Value> workGroupSizes) {
return ::convertLoopToGPULaunch(forOp, numWorkGroups, workGroupSizes);		return ::convertLoopToGPULaunch(forOp, numWorkGroups, workGroupSizes);
}		}

		namespace {
		struct ParallelToGpuLaunchLowering : public OpRewritePattern<ParallelOp> {
		using OpRewritePattern<ParallelOp>::OpRewritePattern;

		PatternMatchResult matchAndRewrite(ParallelOp parallelOp,
		PatternRewriter &rewriter) const override;
		};

		std::tuple<unsigned, AffineMap, AffineMap>
		ftynseUnsubmitted Done Reply Inline Actions I would suggest creating a struct with named fields here for better readability. And document the function plz ftynse: I would suggest creating a struct with named fields here for better readability. And document…
		rriddleUnsubmitted Done Reply Inline Actions Generally only classes should be within anonmyous namespaces. static functions should be in global scope and marked 'static'. rriddle: Generally only classes should be within anonmyous namespaces. static functions should be in…
		mravishankarUnsubmitted Done Reply Inline Actions Add some comments please. mravishankar: Add some comments please.
		extractMapAndOperand(Attribute attribute) {
		mravishankarUnsubmitted Done Reply Inline Actions If this is really a dictionary_attribute, this lookup might be simplified using a StructAttr mravishankar: If this is really a dictionary_attribute, this lookup might be simplified using a StructAttr
		herhutAuthorUnsubmitted Done Reply Inline Actions I had a struct attribute before but I did not want to specify the attribute on the `loop.parallel` operation, as it is GPU dialect specific. So I went with an unspecified optional attribute with custom accessor. herhut: I had a struct attribute before but I did not want to specify the attribute on the `loop.
		mravishankarUnsubmitted Not Done Reply Inline Actions You can add it as an attribute to GPU dialect. There is no harm in adding this attribute "while" lowering to GPU dialect, or a pre-pass before lowering to GPU dialect right? mravishankar: You can add it as an attribute to GPU dialect. There is no harm in adding this attribute…
		DictionaryAttr dict = attribute.dyn_cast<DictionaryAttr>();
		ftynseUnsubmitted Done Reply Inline Actions .cast here and below, otherwise you may be accessing a null pointer ftynse: .cast here and below, otherwise you may be accessing a null pointer
		unsigned processor = dict.get("processor").dyn_cast<IntegerAttr>().getValue().getSExtValue();
		AffineMap map = dict.get("map").dyn_cast<AffineMapAttr>().getValue();
		ftynseUnsubmitted Done Reply Inline Actions Nit: can we factor out these names into constants? ftynse: Nit: can we factor out these names into constants?
		AffineMapAttr boundAttr = dict.get("bound").dyn_cast_or_null<AffineMapAttr>();
		AffineMap bound;
		if (boundAttr) bound = boundAttr.getValue();
		return {processor, map, bound};
		}

		LogicalResult processParallelLoop(ParallelOp parallelOp, gpu::LaunchOp launchOp,
		BlockAndValueMapping &cloning_map,
		SmallVectorImpl<Operation *> &worklist,
		mravishankarUnsubmitted Done Reply Inline Actions Please add some comments about this method, preconditions, etc. mravishankar: Please add some comments about this method, preconditions, etc.
		PatternRewriter &rewriter) {
		ftynseUnsubmitted Done Reply Inline Actions Nit: mlir uses camelBack names ftynse: Nit: mlir uses camelBack names
		// TODO(herhut): Verify that this is a valid GPU mapping.
		// processor ids: 0-2 block [x/y/z], 3-5 -> thread [x/y/z], 6-> sequential
		ArrayAttr mapping = parallelOp.getAttrOfType<ArrayAttr>("mapping");
		// TODO(herhut): Support reductions.
		ftynseUnsubmitted Done Reply Inline Actions Should we use string attributes instead? E.g. having `mapping = ["thread-0", "thread-1", "thread-2", "block-0", "block-1", "block-2", "seq"]` ? ftynse: Should we use string attributes instead? E.g. having `mapping = ["thread-0", "thread-1"…
		herhutAuthorUnsubmitted Done Reply Inline Actions I wanted to stay with processor ids for now, as that is also what @nicolasvasilache was using. herhut: I wanted to stay with processor ids for now, as that is also what @nicolasvasilache was using.

		mravishankarUnsubmitted Done Reply Inline Actions I am just coming upto speed on implementation of loop.parallel, but in the ODS I dont see any attribute for "mapping". When is this added? mravishankar: I am just coming upto speed on implementation of loop.parallel, but in the ODS I dont see any…
		herhutAuthorUnsubmitted Done Reply Inline Actions This is added by producers of mappings that want to lower GPU. I have a `GreedyMapper` (I will send that out shortly) that essentially implements what the existing helper functions did. herhut: This is added by producers of mappings that want to lower GPU. I have a `GreedyMapper` (I will…
		if (!mapping \|\| parallelOp.getNumResults() != 0)
		return failure();
		mravishankarUnsubmitted Done Reply Inline Actions nit: Please remove this line. Didnt make the immediate leap to reductions and parallelop.getNumResults() != 0 mravishankar: nit: Please remove this line. Didnt make the immediate leap to reductions and parallelop.

		Location loc = parallelOp.getLoc();

		for (auto config : llvm::zip(mapping, parallelOp.getInductionVars(), parallelOp.lowerBound(), parallelOp.upperBound(), parallelOp.step())) {
		unsigned processor;
		AffineMap map;
		AffineMap bound;
		std::tie(processor, map, bound) = extractMapAndOperand(std::get<0>(config));
		Value newIndex;

		if (processor < gpu::LaunchOp::kNumConfigOperands) {
		// Use the corresponding thread/grid index as replacement for the loop iv.
		// TODO(herhut): Make the iv calculation depend on lower & upper bound.
		Value operand = launchOp.body().front().getArgument(processor);
		ftynseUnsubmitted Done Reply Inline Actions `dyn_cast_or_null` since `val` is not guaranteed to have a defining op ftynse: `dyn_cast_or_null` since `val` is not guaranteed to have a defining op
		Value appliedMap = rewriter.create<AffineApplyOp>(loc, map, operand);
		// Add the lower bound, as the maps are 0 based but the loop might not be.
		// TODO(herhut): Maybe move this explicitly into the maps?
		newIndex = rewriter.create<AddIOp>(
		loc, appliedMap, cloning_map.lookupOrDefault(std::get<2>(config)));
		// If there was also a bound, insert that, too.
		// TODO(herhut): Check that we do not assign bounds twice.
		if (bound) {
		auto save = rewriter.saveInsertionPoint();
		rewriter.setInsertionPoint(launchOp);
		// We pass as the single opererand to the bound-map the number of
		// iterations, which is upperBound - lowerBound. To support inner loops
		// with dynamic upper bounds (as generated by e.g. tiling), try to
		// derive a max for the bounds. If the used bound for the hardware id is
		rriddleUnsubmitted Done Reply Inline Actions This function is quite large. Can you split it up a bit? rriddle: This function is quite large. Can you split it up a bit?
		herhutAuthorUnsubmitted Done Reply Inline Actions A lot of it is comments but I have moved some things out. herhut: A lot of it is comments but I have moved some things out.
		// inprecise, wrap the contained code into a conditional.
		Value lowerBound = std::get<2>(config);
		Value upperBound = std::get<3>(config);
		if (!lowerBound.getParentRegion()->isAncestor(
		launchOp.getParentRegion()) &&
		!isa<ConstantOp>(lowerBound.getDefiningOp()))
		return failure();
		// If the upper-bound is constant or defined before the launch, we can
		// use it in the launch bounds directly.
		if (!upperBound.getParentRegion()->isAncestor(
		launchOp.getParentRegion()) &&
		!isa<ConstantOp>(upperBound.getDefiningOp())) {
		// TODO(herhut): Is there a helper in affine for this?
		if (mlir::AffineMinOp minOp =
		dyn_cast_or_null<AffineMinOp>(upperBound.getDefiningOp())) {
		auto map = minOp.map();
		auto operands = minOp.operands();
		upperBound = {};
		for (int sub = 0, e = map.getNumResults(); sub < e; ++sub) {
		mlir::AffineExpr expr = map.getResult(sub);
		if (AffineDimExpr dimExpr = expr.dyn_cast<AffineDimExpr>()) {
		auto dimOperand = operands[dimExpr.getPosition()];
		auto defOp = dimOperand.getDefiningOp();
		if (mlir::ConstantOp constOp =
		ftynseUnsubmitted Done Reply Inline Actions Nit: why not `nullptr` as a sentinel? ftynse: Nit: why not `nullptr` as a sentinel?
		herhutAuthorUnsubmitted Done Reply Inline Actions I dislike nullptr as sentinel because it too easily sneaks in and then produces hard to debug errors. The `gpu.launch` on the other hand definitely won't. I would have created an explicit sentinel but that seemed a bit overkill. herhut: I dislike nullptr as sentinel because it too easily sneaks in and then produces hard to debug…
		dyn_cast_or_null<ConstantOp>(defOp)) {
		upperBound = rewriter.create<ConstantOp>(constOp.getLoc(),
		constOp.getValue());
		break;
		}
		}
		ftynseUnsubmitted Done Reply Inline Actions We should have patterns to fold constants into affine maps (in canonicalization), eg %0 = constant 42 : index %1 = afifne.min affine_map<(d0,d1)->(d0,d1)>(%0, %arg0) should fold into %0 = affine.min affine_map<(d0)->(d0,42)>(%0) LMK if it's not the case. ftynse: We should have patterns to fold constants into affine maps (in canonicalization), eg %0 =…
		herhutAuthorUnsubmitted Not Done Reply Inline Actions So you are saying that by running with -canonicalize I should see the affine.min change? I could not reproduce this. herhut: So you are saying that by running with -canonicalize I should see the affine.min change? I…
		}
		if (!upperBound)
		return failure();
		}
		}
		Value iterations = rewriter.create<SubIOp>(
		loc, cloning_map.lookupOrDefault(upperBound),
		cloning_map.lookupOrDefault(lowerBound));
		Value newBound = rewriter.create<AffineApplyOp>(loc, bound, iterations);
		launchOp.setOperand(processor, newBound);
		rewriter.restoreInsertionPoint(save);
		if (upperBound != std::get<3>(config)) {
		// We are using an approximation, create a surrounding conditional.
		CmpIOp pred = rewriter.create<CmpIOp>(
		loc, CmpIPredicate::slt, newIndex,
		cloning_map.lookupOrDefault(std::get<3>(config)));
		loop::IfOp ifOp = rewriter.create<loop::IfOp>(loc, pred, false);
		rewriter.setInsertionPointToStart(&ifOp.thenRegion().front());
		// Put a sentinel into the worklist so we know when to pop out of the
		// if body again. We use the launchOp here, as that cannot be part of
		// the bodies instruction.
		worklist.push_back(launchOp.getOperation());
		}
		}
		} else {
		// Create a sequential for loop.
		auto loopOp = rewriter.create<loop::ForOp>(
		loc, cloning_map.lookupOrDefault(std::get<2>(config)),
		cloning_map.lookupOrDefault(std::get<3>(config)),
		ftynseUnsubmitted Done Reply Inline Actions Readablity nit: can we assign std::get<0>, etc to variables with more explicative names? The body of this loop is 3 screens long (consider refactoring in a follow-up :)) and it's hard to keep track of what was zip'ed. ftynse: Readablity nit: can we assign std::get<0>, etc to variables with more explicative names? The…
		cloning_map.lookupOrDefault(std::get<4>(config)));
		newIndex = loopOp.getInductionVar();
		rewriter.setInsertionPointToStart(loopOp.getBody());
		// Put a sentinel into the worklist so we know when to pop out of the loop
		// body again. We use the launchOp here, as that cannot be part of the
		// bodies instruction.
		worklist.push_back(launchOp.getOperation());
		}
		cloning_map.map(std::get<1>(config), newIndex);
		}
		Block *body = parallelOp.getBody();
		worklist.reserve(worklist.size() + body->getOperations().size());
		for (Operation &op : llvm::reverse(body->without_terminator()))
		worklist.push_back(&op);
		return success();
		}
		ftynseUnsubmitted Done Reply Inline Actions Nit: you can use OpRewriter::InsertionGuard with a C++ scope ftynse: Nit: you can use OpRewriter::InsertionGuard with a C++ scope

		} // namespace

		PatternMatchResult
		ParallelToGpuLaunchLowering::matchAndRewrite(ParallelOp parallelOp,
		PatternRewriter &rewriter) const {
		// Create a launch operation. We start with bound one for all grid/block
		// sizes. Those will be refined later as we discover them from mappings.
		Location loc = parallelOp.getLoc();
		Value constantOne = rewriter.create<ConstantIndexOp>(parallelOp.getLoc(), 1);
		gpu::LaunchOp launchOp = rewriter.create<gpu::LaunchOp>(
		parallelOp.getLoc(), constantOne, constantOne, constantOne, constantOne,
		constantOne, constantOne);
		rewriter.setInsertionPointToEnd(&launchOp.body().front());
		rewriter.create<gpu::TerminatorOp>(loc);
		rewriter.setInsertionPointToStart(&launchOp.body().front());

		BlockAndValueMapping cloning_map;
		SmallVector<Operation *, 16> worklist;
		if (failed(processParallelLoop(parallelOp, launchOp, cloning_map, worklist,
		rewriter)))
		return matchFailure();

		while (!worklist.empty()) {
		Operation *op = worklist.pop_back_val();

		// Now walk over the body and clone it.
		// TODO: This is only correct if there either is no further loop.parallel
		// nested
		// or this code is side-effect free. Otherwise we might need
		// predication.
		if (auto nestedParallel = dyn_cast<ParallelOp>(op)) {
		// A nested loop.parallel needs insertion of code to compute indices.
		// Insert that now.
		processParallelLoop(nestedParallel, launchOp, cloning_map, worklist,
		rewriter);
		} else if (op == launchOp.getOperation()) {
		// Found our sentinel value. We have finished the operations from one
		// nesting level, pop one level back up.
		auto parent = rewriter.getInsertionPoint()->getParentOp();
		rewriter.setInsertionPointAfter(parent);
		} else {
		// Otherwise we copy it over.
		Operation clone = rewriter.clone(op, cloning_map);
		mravishankarUnsubmitted Done Reply Inline Actions This comment seems outdated now. mravishankar: This comment seems outdated now.
		herhutAuthorUnsubmitted Done Reply Inline Actions No, it is unfortunately still true. If there are nested `loop.parallel` the code up to them has to be side-effect free, as we are essentially replicating it per thread. I will add some checking in a later version. herhut: No, it is unfortunately still true. If there are nested `loop.parallel` the code up to them has…
		// TODO(herhut) Use generalized BlockAndValueMapping::map once landed.
		for (auto pair : llvm::zip(op->getResults(), clone->getResults()))
		cloning_map.map(std::get<0>(pair), std::get<1>(pair));
		}
		}

		rewriter.eraseOp(parallelOp);
		return matchSuccess();
		}

		namespace {
		struct ParallelLoopToGpuPass : public OperationPass<ParallelLoopToGpuPass> {
		void runOnOperation() override;
		};
		}

		void ParallelLoopToGpuPass::runOnOperation() {
		OwningRewritePatternList patterns;
		patterns.insert<ParallelToGpuLaunchLowering>(&getContext());
		ConversionTarget target(getContext());
		target.addLegalDialect<StandardOpsDialect>();
		target.addLegalDialect<AffineOpsDialect>();
		target.addLegalDialect<gpu::GPUDialect>();
		target.addLegalDialect<loop::LoopOpsDialect>();
		target.addIllegalOp<loop::ParallelOp>();
		if (failed(applyPartialConversion(getOperation(), target, patterns)))
		signalPassFailure();
		}

		static PassRegistration<ParallelLoopToGpuPass>
		ftynseUnsubmitted Done Reply Inline Actions For a future revision, I'd consider having a transformation function that is controlled by actual arguments rather than by an attached attribute, and have the pattern read the attribute and call that function. This way, we can call the transformation programmatically without modifying the IR. ftynse: For a future revision, I'd consider having a transformation function that is controlled by…
		pass("convert-parallel-loop-to-gpu", "Convert mapped loop,parallel op to "
		"gpu launch operations.");
		mravishankarUnsubmitted Done Reply Inline Actions It would be useful to also expose a method to populate this pattern (and also other patterns that might be needed in the future) in a method like `populateParallelLoopToGPUPatterns` . Then a pass can just import these patterns without having to run a separate pass to do this lowering. mravishankar: It would be useful to also expose a method to populate this pattern (and also other patterns…
		No newline at end of file
		ftynseUnsubmitted Done Reply Inline Actions The comment above says "it is only correct if there either is no further loop.parallel", which I cannot match with this code. ftynse: The comment above says "it is only correct if there either is no further loop.parallel", which…
		ftynseUnsubmitted Done Reply Inline Actions Can we just walk recursively and check for the no side effects trait, returning matchFailure if there are side effects? ftynse: Can we just walk recursively and check for the no side effects trait, returning matchFailure if…
		herhutAuthorUnsubmitted Done Reply Inline Actions I added some testing code that is conservative but ensures correctness. herhut: I added some testing code that is conservative but ensures correctness.
		rriddleUnsubmitted Not Done Reply Inline Actions Remove this debugging. rriddle: Remove this debugging.

mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir

This file was added.

				// RUN: mlir-opt -convert-parallel-loop-to-gpu -split-input-file %s \| FileCheck %s -dump-input-on-failure

				// 2-d parallel loop mapped to block.y and block.x

				func @parallel_loop(%arg0 : index, %arg1 : index, %arg2 : index,
				%arg3 : index, %arg4 : index,
				%buf : memref<?x?xf32>,
				%res : memref<?x?xf32>) {
				%step = constant 2 : index
				loop.parallel (%i0, %i1) = (%arg0, %arg1) to (%arg2, %arg3)
				step (%arg4, %step) {
				%val = load %buf[%i0, %i1] : memref<?x?xf32>
				store %val, %res[%i1, %i0] : memref<?x?xf32>
				} { mapping = [{processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}, {processor = 0, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}] }
				return
				}

				// CHECK-LABEL: func @parallel_loop(
				// CHECK-SAME: [[VAL_0:%.]]: index, [[VAL_1:%.]]: index, [[VAL_2:%.]]: index, [[VAL_3:%.]]: index, [[VAL_4:%.]]: index, [[VAL_5:%.]]: memref<?x?xf32>, [[VAL_6:%.*]]: memref<?x?xf32>) {
				// CHECK: [[VAL_7:%.*]] = constant 2 : index
				// CHECK: [[VAL_8:%.*]] = constant 1 : index
				// CHECK: [[VAL_9:%.*]] = subi [[VAL_2]], [[VAL_0]] : index
				// CHECK: [[VAL_10:%.*]] = subi [[VAL_3]], [[VAL_1]] : index
				// CHECK: gpu.launch blocks([[VAL_11:%.]], [[VAL_12:%.]], [[VAL_13:%.]]) in ([[VAL_14:%.]] = [[VAL_10]], [[VAL_15:%.]] = [[VAL_9]], [[VAL_16:%.]] = [[VAL_8]]) threads([[VAL_17:%.]], [[VAL_18:%.]], [[VAL_19:%.]]) in ([[VAL_20:%.]] = [[VAL_8]], [[VAL_21:%.]] = [[VAL_8]], [[VAL_22:%.]] = [[VAL_8]]) {
				// CHECK: [[VAL_23:%.*]] = addi [[VAL_12]], [[VAL_0]] : index
				// CHECK: [[VAL_24:%.*]] = addi [[VAL_11]], [[VAL_1]] : index
				// CHECK: [[VAL_25:%.*]] = load [[VAL_5]]{{\[}}[[VAL_23]], [[VAL_24]]] : memref<?x?xf32>
				// CHECK: store [[VAL_25]], [[VAL_6]]{{\[}}[[VAL_24]], [[VAL_23]]] : memref<?x?xf32>
				// CHECK: gpu.terminator
				// CHECK: }
				// CHECK: return
				// CHECK: }
				// CHECK: }

				// -----

				// tiled 2-d parallel loop mapped to block.y and block.x and thread.y and thread.x.

				func @parallel_loop(%arg0 : index, %arg1 : index, %arg2 : index,
				%arg3 : index,
				%buf : memref<?x?xf32>,
				%res : memref<?x?xf32>) {
				%zero = constant 0 : index
				%one = constant 1 : index
				%four = constant 4 : index
				loop.parallel (%i0, %i1) = (%arg0, %arg1) to (%arg2, %arg3)
				step (%four, %four) {
				loop.parallel (%si0, %si1) = (%zero, %zero) to (%four, %four)
				step (%one, %one) {
				%idx0 = addi %i0, %si0 : index
				%idx1 = addi %i1, %si1 : index
				%val = load %buf[%idx0, %idx1] : memref<?x?xf32>
				store %val, %res[%idx1, %idx0] : memref<?x?xf32>
				} { mapping = [
				{processor = 4, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>},
				{processor = 3, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}
				] }
				} { mapping = [
				{processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>},
				{processor = 0, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}
				] }
				return
				}

				// CHECK-LABEL: func @parallel_loop(
				// CHECK-SAME: [[VAL_26:%.]]: index, [[VAL_27:%.]]: index, [[VAL_28:%.]]: index, [[VAL_29:%.]]: index, [[VAL_30:%.]]: memref<?x?xf32>, [[VAL_31:%.]]: memref<?x?xf32>) {
				// CHECK: [[VAL_32:%.*]] = constant 0 : index
				// CHECK: [[VAL_33:%.*]] = constant 1 : index
				// CHECK: [[VAL_34:%.*]] = constant 4 : index
				// CHECK: [[VAL_35:%.*]] = constant 1 : index
				// CHECK: [[VAL_36:%.*]] = subi [[VAL_28]], [[VAL_26]] : index
				// CHECK: [[VAL_37:%.*]] = subi [[VAL_29]], [[VAL_27]] : index
				// CHECK: [[VAL_38:%.*]] = subi [[VAL_34]], [[VAL_32]] : index
				// CHECK: [[VAL_39:%.*]] = subi [[VAL_34]], [[VAL_32]] : index
				// CHECK: gpu.launch blocks([[VAL_40:%.]], [[VAL_41:%.]], [[VAL_42:%.]]) in ([[VAL_43:%.]] = [[VAL_37]], [[VAL_44:%.]] = [[VAL_36]], [[VAL_45:%.]] = [[VAL_35]]) threads([[VAL_46:%.]], [[VAL_47:%.]], [[VAL_48:%.]]) in ([[VAL_49:%.]] = [[VAL_39]], [[VAL_50:%.]] = [[VAL_38]], [[VAL_51:%.]] = [[VAL_35]]) {
				// CHECK: [[VAL_52:%.*]] = addi [[VAL_41]], [[VAL_26]] : index
				// CHECK: [[VAL_53:%.*]] = addi [[VAL_40]], [[VAL_27]] : index
				// CHECK: [[VAL_54:%.*]] = addi [[VAL_47]], [[VAL_32]] : index
				// CHECK: [[VAL_55:%.*]] = addi [[VAL_46]], [[VAL_32]] : index
				// CHECK: [[VAL_56:%.*]] = addi [[VAL_52]], [[VAL_54]] : index
				// CHECK: [[VAL_57:%.*]] = addi [[VAL_53]], [[VAL_55]] : index
				// CHECK: [[VAL_58:%.*]] = load [[VAL_30]]{{\[}}[[VAL_56]], [[VAL_57]]] : memref<?x?xf32>
				// CHECK: store [[VAL_58]], [[VAL_31]]{{\[}}[[VAL_57]], [[VAL_56]]] : memref<?x?xf32>
				// CHECK: gpu.terminator
				// CHECK: }
				// CHECK: return
				// CHECK: }

				// -----

				// 2-d parallel loop mapped to block.y and sequential

				func @parallel_loop(%arg0 : index, %arg1 : index, %arg2 : index,
				%arg3 : index, %arg4 : index,
				%buf : memref<?x?xf32>,
				%res : memref<?x?xf32>) {
				%step = constant 2 : index
				loop.parallel (%i0, %i1) = (%arg0, %arg1) to (%arg2, %arg3)
				step (%arg4, %step) {
				%val = load %buf[%i0, %i1] : memref<?x?xf32>
				store %val, %res[%i1, %i0] : memref<?x?xf32>
				} { mapping = [
				{processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>},
				{processor = 6, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}
				] }
				return
				}

				// CHECK-LABEL: func @parallel_loop(
				// CHECK-SAME: [[VAL_59:%.]]: index, [[VAL_60:%.]]: index, [[VAL_61:%.]]: index, [[VAL_62:%.]]: index, [[VAL_63:%.]]: index, [[VAL_64:%.]]: memref<?x?xf32>, [[VAL_65:%.*]]: memref<?x?xf32>) {
				// CHECK: [[VAL_66:%.*]] = constant 2 : index
				// CHECK: [[VAL_67:%.*]] = constant 1 : index
				// CHECK: [[VAL_68:%.*]] = subi [[VAL_61]], [[VAL_59]] : index
				// CHECK: gpu.launch blocks([[VAL_69:%.]], [[VAL_70:%.]], [[VAL_71:%.]]) in ([[VAL_72:%.]] = [[VAL_67]], [[VAL_73:%.]] = [[VAL_68]], [[VAL_74:%.]] = [[VAL_67]]) threads([[VAL_75:%.]], [[VAL_76:%.]], [[VAL_77:%.]]) in ([[VAL_78:%.]] = [[VAL_67]], [[VAL_79:%.]] = [[VAL_67]], [[VAL_80:%.]] = [[VAL_67]]) {
				// CHECK: [[VAL_81:%.*]] = addi [[VAL_70]], [[VAL_59]] : index
				// CHECK: loop.for [[VAL_82:%.*]] = [[VAL_60]] to [[VAL_62]] step [[VAL_66]] {
				// CHECK: [[VAL_83:%.*]] = load [[VAL_64]]{{\[}}[[VAL_81]], [[VAL_82]]] : memref<?x?xf32>
				// CHECK: store [[VAL_83]], [[VAL_65]]{{\[}}[[VAL_82]], [[VAL_81]]] : memref<?x?xf32>
				// CHECK: }
				// CHECK: gpu.terminator
				// CHECK: }
				// CHECK: return
				// CHECK: }

				// -----

				// tiled 2-d parallel loop mapped to block.y and seq. and thread.y and seq.

				func @parallel_loop(%arg0 : index, %arg1 : index, %arg2 : index,
				%arg3 : index,
				%buf : memref<?x?xf32>,
				%res : memref<?x?xf32>) {
				%zero = constant 0 : index
				%one = constant 1 : index
				%four = constant 4 : index
				loop.parallel (%i0, %i1) = (%arg0, %arg1) to (%arg2, %arg3)
				step (%four, %four) {
				loop.parallel (%si0, %si1) = (%zero, %zero) to (%four, %four)
				step (%one, %one) {
				%idx0 = addi %i0, %si0 : index
				%idx1 = addi %i1, %si1 : index
				%val = load %buf[%idx0, %idx1] : memref<?x?xf32>
				store %val, %res[%idx1, %idx0] : memref<?x?xf32>
				} { mapping = [
				{processor = 4, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>},
				{processor = 6, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}
				] }
				} { mapping = [
				{processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>},
				{processor = 6, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}
				] }
				return
				}

				// CHECK-LABEL: func @parallel_loop(
				// CHECK-SAME: [[VAL_84:%.]]: index, [[VAL_85:%.]]: index, [[VAL_86:%.]]: index, [[VAL_87:%.]]: index, [[VAL_88:%.]]: memref<?x?xf32>, [[VAL_89:%.]]: memref<?x?xf32>) {
				// CHECK: [[VAL_90:%.*]] = constant 0 : index
				// CHECK: [[VAL_91:%.*]] = constant 1 : index
				// CHECK: [[VAL_92:%.*]] = constant 4 : index
				// CHECK: [[VAL_93:%.*]] = constant 1 : index
				// CHECK: [[VAL_94:%.*]] = subi [[VAL_86]], [[VAL_84]] : index
				// CHECK: [[VAL_95:%.*]] = subi [[VAL_92]], [[VAL_90]] : index
				// CHECK: gpu.launch blocks([[VAL_96:%.]], [[VAL_97:%.]], [[VAL_98:%.]]) in ([[VAL_99:%.]] = [[VAL_93]], [[VAL_100:%.]] = [[VAL_94]], [[VAL_101:%.]] = [[VAL_93]]) threads([[VAL_102:%.]], [[VAL_103:%.]], [[VAL_104:%.]]) in ([[VAL_105:%.]] = [[VAL_93]], [[VAL_106:%.]] = [[VAL_95]], [[VAL_107:%.]] = [[VAL_93]]) {
				// CHECK: [[VAL_108:%.*]] = addi [[VAL_97]], [[VAL_84]] : index
				// CHECK: loop.for [[VAL_109:%.*]] = [[VAL_85]] to [[VAL_87]] step [[VAL_92]] {
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Could we use capture names such as `TIDX`, `BIDY` or something similar that would help "see" the mapping better? nicolasvasilache: Could we use capture names such as `TIDX`, `BIDY` or something similar that would help "see"…
				herhutAuthorUnsubmitted Done Reply Inline Actions Will do once this has stabilized a bit. For now, the tests are fully auto-generated. herhut: Will do once this has stabilized a bit. For now, the tests are fully auto-generated.
				// CHECK: [[VAL_110:%.*]] = addi [[VAL_103]], [[VAL_90]] : index
				// CHECK: loop.for [[VAL_111:%.*]] = [[VAL_90]] to [[VAL_92]] step [[VAL_91]] {
				// CHECK: [[VAL_112:%.*]] = addi [[VAL_108]], [[VAL_110]] : index
				// CHECK: [[VAL_113:%.*]] = addi [[VAL_109]], [[VAL_111]] : index
				// CHECK: [[VAL_114:%.*]] = load [[VAL_88]]{{\[}}[[VAL_112]], [[VAL_113]]] : memref<?x?xf32>
				// CHECK: store [[VAL_114]], [[VAL_89]]{{\[}}[[VAL_113]], [[VAL_112]]] : memref<?x?xf32>
				// CHECK: }
				// CHECK: }
				// CHECK: gpu.terminator
				// CHECK: }
				// CHECK: return
				// CHECK: }

				// -----

				#map0 = affine_map<(d0, d1)[s0, s1] -> (d0 * s1 + s0 + d1)>
				#map1 = affine_map<(d0, d1, d2) -> (d0, d1 - d2)>
				#map2 = affine_map<(d0, d1)[s0, s1, s2] -> (d0 * s1 + s0 + d1 * s2)>
				#map3 = affine_map<(d0) -> (d0)>

				module {
				func @sum(%arg0: memref<?x?xf32, #map0>, %arg1: memref<?x?xf32, #map0>, %arg2: memref<?x?xf32, #map0>) {
				%c1 = constant 1 : index
				%c0 = constant 0 : index
				%c3 = constant 3 : index
				%c2 = constant 2 : index
				%0 = dim %arg0, 0 : memref<?x?xf32, #map0>
				%1 = dim %arg0, 1 : memref<?x?xf32, #map0>
				loop.parallel (%arg3, %arg4) = (%c0, %c0) to (%0, %1) step (%c2, %c3) {
				%2 = dim %arg0, 0 : memref<?x?xf32, #map0>
				%3 = affine.min #map1(%c2, %2, %arg3)
				%4 = dim %arg0, 1 : memref<?x?xf32, #map0>
				%5 = affine.min #map1(%c3, %4, %arg4)
				%6 = std.subview %arg0[%arg3, %arg4][%3, %5][%c1, %c1] : memref<?x?xf32, #map0> to memref<?x?xf32, #map2>
				%7 = dim %arg1, 0 : memref<?x?xf32, #map0>
				%8 = affine.min #map1(%c2, %7, %arg3)
				%9 = dim %arg1, 1 : memref<?x?xf32, #map0>
				%10 = affine.min #map1(%c3, %9, %arg4)
				%11 = std.subview %arg1[%arg3, %arg4][%8, %10][%c1, %c1] : memref<?x?xf32, #map0> to memref<?x?xf32, #map2>
				%12 = dim %arg2, 0 : memref<?x?xf32, #map0>
				%13 = affine.min #map1(%c2, %12, %arg3)
				%14 = dim %arg2, 1 : memref<?x?xf32, #map0>
				%15 = affine.min #map1(%c3, %14, %arg4)
				%16 = std.subview %arg2[%arg3, %arg4][%13, %15][%c1, %c1] : memref<?x?xf32, #map0> to memref<?x?xf32, #map2>
				loop.parallel (%arg5, %arg6) = (%c0, %c0) to (%3, %5) step (%c1, %c1) {
				%17 = load %6[%arg5, %arg6] : memref<?x?xf32, #map2>
				%18 = load %11[%arg5, %arg6] : memref<?x?xf32, #map2>
				%19 = load %16[%arg5, %arg6] : memref<?x?xf32, #map2>
				%20 = addf %17, %18 : f32
				store %20, %16[%arg5, %arg6] : memref<?x?xf32, #map2>
				"loop.terminator"() : () -> ()
				} { mapping = [
				{processor = 3, map = #map3, bound = #map3},
				{processor = 4, map = #map3, bound = #map3}
				] }
				"loop.terminator"() : () -> ()
				} { mapping = [
				{processor = 0, map = #map3, bound = #map3},
				{processor = 1, map = #map3, bound = #map3}
				] }
				return
				}
				}

				// NOTE: Assertions have been autogenerated by utils/generate-test-checks.py
				// CHECK: #map0 = affine_map<(d0, d1)[s0, s1] -> (d0 * s1 + s0 + d1)>
				// CHECK: #map1 = affine_map<(d0) -> (d0)>
				// CHECK: #map2 = affine_map<(d0, d1, d2) -> (d0, d1 - d2)>
				// CHECK: #map3 = affine_map<(d0, d1)[s0, s1, s2] -> (d0 * s1 + s0 + d1 * s2)>
				// CHECK: module {

				// CHECK-LABEL: func @sum(
				// CHECK-SAME: [[VAL_0:%.]]: memref<?x?xf32, #map0>, [[VAL_1:%.]]: memref<?x?xf32, #map0>, [[VAL_2:%.*]]: memref<?x?xf32, #map0>) {
				// CHECK: [[VAL_3:%.*]] = constant 1 : index
				// CHECK: [[VAL_4:%.*]] = constant 0 : index
				// CHECK: [[VAL_5:%.*]] = constant 3 : index
				// CHECK: [[VAL_6:%.*]] = constant 2 : index
				// CHECK: [[VAL_7:%.*]] = dim [[VAL_0]], 0 : memref<?x?xf32, #map0>
				// CHECK: [[VAL_8:%.*]] = dim [[VAL_0]], 1 : memref<?x?xf32, #map0>
				// CHECK: [[VAL_9:%.*]] = constant 1 : index
				// CHECK: [[VAL_10:%.*]] = subi [[VAL_7]], [[VAL_4]] : index
				// CHECK: [[VAL_11:%.*]] = affine.apply #map1([[VAL_10]])
				// CHECK: [[VAL_12:%.*]] = subi [[VAL_8]], [[VAL_4]] : index
				// CHECK: [[VAL_13:%.*]] = affine.apply #map1([[VAL_12]])
				// CHECK: [[VAL_14:%.*]] = constant 2 : index
				// CHECK: [[VAL_15:%.*]] = subi [[VAL_14]], [[VAL_4]] : index
				// CHECK: [[VAL_16:%.*]] = affine.apply #map1([[VAL_15]])
				// CHECK: [[VAL_17:%.*]] = constant 3 : index
				// CHECK: [[VAL_18:%.*]] = subi [[VAL_17]], [[VAL_4]] : index
				// CHECK: [[VAL_19:%.*]] = affine.apply #map1([[VAL_18]])
				// CHECK: gpu.launch blocks([[VAL_20:%.]], [[VAL_21:%.]], [[VAL_22:%.]]) in ([[VAL_23:%.]] = [[VAL_13]], [[VAL_24:%.]] = [[VAL_11]], [[VAL_25:%.]] = [[VAL_9]]) threads([[VAL_26:%.]], [[VAL_27:%.]], [[VAL_28:%.]]) in ([[VAL_29:%.]] = [[VAL_16]], [[VAL_30:%.]] = [[VAL_19]], [[VAL_31:%.]] = [[VAL_9]]) {
				// CHECK: [[VAL_32:%.*]] = affine.apply #map1([[VAL_21]])
				// CHECK: [[VAL_33:%.*]] = addi [[VAL_32]], [[VAL_4]] : index
				// CHECK: [[VAL_34:%.*]] = affine.apply #map1([[VAL_20]])
				// CHECK: [[VAL_35:%.*]] = addi [[VAL_34]], [[VAL_4]] : index
				// CHECK: [[VAL_36:%.*]] = dim [[VAL_0]], 0 : memref<?x?xf32, #map0>
				// CHECK: [[VAL_37:%.*]] = affine.min #map2([[VAL_6]], [[VAL_36]], [[VAL_33]])
				// CHECK: [[VAL_38:%.*]] = dim [[VAL_0]], 1 : memref<?x?xf32, #map0>
				// CHECK: [[VAL_39:%.*]] = affine.min #map2([[VAL_5]], [[VAL_38]], [[VAL_35]])
				// CHECK: [[VAL_40:%.*]] = std.subview [[VAL_0]]{{\[}}[[VAL_33]], [[VAL_35]]]{{\[}}[[VAL_37]], [[VAL_39]]]{{\[}}[[VAL_3]], [[VAL_3]]] : memref<?x?xf32, #map0> to memref<?x?xf32, #map3>
				// CHECK: [[VAL_41:%.*]] = dim [[VAL_1]], 0 : memref<?x?xf32, #map0>
				// CHECK: [[VAL_42:%.*]] = affine.min #map2([[VAL_6]], [[VAL_41]], [[VAL_33]])
				// CHECK: [[VAL_43:%.*]] = dim [[VAL_1]], 1 : memref<?x?xf32, #map0>
				// CHECK: [[VAL_44:%.*]] = affine.min #map2([[VAL_5]], [[VAL_43]], [[VAL_35]])
				// CHECK: [[VAL_45:%.*]] = std.subview [[VAL_1]]{{\[}}[[VAL_33]], [[VAL_35]]]{{\[}}[[VAL_42]], [[VAL_44]]]{{\[}}[[VAL_3]], [[VAL_3]]] : memref<?x?xf32, #map0> to memref<?x?xf32, #map3>
				// CHECK: [[VAL_46:%.*]] = dim [[VAL_2]], 0 : memref<?x?xf32, #map0>
				// CHECK: [[VAL_47:%.*]] = affine.min #map2([[VAL_6]], [[VAL_46]], [[VAL_33]])
				// CHECK: [[VAL_48:%.*]] = dim [[VAL_2]], 1 : memref<?x?xf32, #map0>
				// CHECK: [[VAL_49:%.*]] = affine.min #map2([[VAL_5]], [[VAL_48]], [[VAL_35]])
				// CHECK: [[VAL_50:%.*]] = std.subview [[VAL_2]]{{\[}}[[VAL_33]], [[VAL_35]]]{{\[}}[[VAL_47]], [[VAL_49]]]{{\[}}[[VAL_3]], [[VAL_3]]] : memref<?x?xf32, #map0> to memref<?x?xf32, #map3>
				// CHECK: [[VAL_51:%.*]] = affine.apply #map1([[VAL_26]])
				// CHECK: [[VAL_52:%.*]] = addi [[VAL_51]], [[VAL_4]] : index
				// CHECK: [[VAL_53:%.*]] = cmpi "slt", [[VAL_52]], [[VAL_37]] : index
				// CHECK: loop.if [[VAL_53]] {
				// CHECK: [[VAL_54:%.*]] = affine.apply #map1([[VAL_27]])
				// CHECK: [[VAL_55:%.*]] = addi [[VAL_54]], [[VAL_4]] : index
				// CHECK: [[VAL_56:%.*]] = cmpi "slt", [[VAL_55]], [[VAL_39]] : index
				// CHECK: loop.if [[VAL_56]] {
				// CHECK: [[VAL_57:%.*]] = load [[VAL_40]]{{\[}}[[VAL_52]], [[VAL_55]]] : memref<?x?xf32, #map3>
				// CHECK: [[VAL_58:%.*]] = load [[VAL_45]]{{\[}}[[VAL_52]], [[VAL_55]]] : memref<?x?xf32, #map3>
				// CHECK: [[VAL_59:%.*]] = load [[VAL_50]]{{\[}}[[VAL_52]], [[VAL_55]]] : memref<?x?xf32, #map3>
				// CHECK: [[VAL_60:%.*]] = addf [[VAL_57]], [[VAL_58]] : f32
				// CHECK: store [[VAL_60]], [[VAL_50]]{{\[}}[[VAL_52]], [[VAL_55]]] : memref<?x?xf32, #map3>
				// CHECK: }
				// CHECK: }
				// CHECK: gpu.terminator
				// CHECK: }
				// CHECK: return
				// CHECK: }
				// CHECK: }