This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Conversion/LoopsToGPU/
-
mlir/
-
Conversion/
-
LoopsToGPU/
-
LoopsToGPUPass.h
-
lib/Conversion/LoopsToGPU/
-
Conversion/
-
LoopsToGPU/
2/2
LoopsToGPU.cpp
-
LoopsToGPUPass.cpp
-
test/Conversion/LoopsToGPU/
-
Conversion/
-
LoopsToGPU/
2/4
parallel_loop.mlir

Differential D75052

[MLIR][GPU] Properly model step in parallel loop to gpu conversion.
ClosedPublic

Authored by herhut on Feb 24 2020, 7:34 AM.

Download Raw Diff

Details

Reviewers

mravishankar
ftynse

Commits

rG5e6d7246335d: [MLIR][GPU] Properly model step in parallel loop to gpu conversion.
rGca1bad5253a1: Unreachable (removed)

Summary

The original patch had TODOs to add support for step computations,
which this commit addresses. The computations are expressed using
affine expressions so that the affine canonicalizers can simplify
the full bound and index computations.

Also cleans up the code a little and exposes the pass in the
header file.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

herhut created this revision.Feb 24 2020, 7:34 AM

Herald added a reviewer: mravishankar. · View Herald TranscriptFeb 24 2020, 7:35 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, Joonsoo, liufengdb and 12 others. · View Herald Transcript

PTAL.

Harbormaster completed remote builds in B47132: Diff 246205.Feb 24 2020, 8:58 AM

Simplify afffine constant matching.

bondhugula added a subscriber: bondhugula.Feb 25 2020, 3:44 AM

bondhugula added inline comments.

mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir
235	Side question: where aren't we using affine.load/store instead of load/store and loop.parallel -> affine.parallel here? With the former, you'll get things like store to load fwd'ing, redundant load elimination, composition of ops supplying subscript values into the load/store itself, etc., infra for all of which exist and whenever you need them. All the mapping metadata should nicely fit into affine.parallel as well.

ftynse accepted this revision.Feb 25 2020, 4:18 AM

ftynse added inline comments.

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp
542	Nit: could you early-return instead?
669	Nit "we use affine apply" ?
mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir
18–19	Let's not match on attribute names here, i.e. prefer `#[[MAP0:.*]] = affine_map<...`

This revision is now accepted and ready to land.Feb 25 2020, 4:18 AM

Harbormaster completed remote builds in B47198: Diff 246410.Feb 25 2020, 4:24 AM

Review comments.

Thanks for the review!

mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir
235	It is not pure coincidence that the mapping data fits :) My hope is that this mapper will work equally well with affine.parallel. However, I do not want to restrict it to affine and currently the code we feed into this is not based on affine.parallel. I expect that we will generalize things in that direction eventually but would also be very happy if someone else looks into that.

Harbormaster failed remote builds in B47200: Diff 246414!Feb 25 2020, 5:00 AM

Closed by commit rG5e6d7246335d: [MLIR][GPU] Properly model step in parallel loop to gpu conversion. (authored by herhut). · Explain WhyFeb 25 2020, 5:27 AM

This revision was automatically updated to reflect the committed changes.

bondhugula added inline comments.Feb 25 2020, 7:53 AM

mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir
235	However, I do not want to restrict it to affine and currently the code we feed into this is not based on affine.parallel. I expect that we will "The code we feed into this": is the thing that's generating the loop.parallel's available somewhere or is it something that's planned for release in the future? generalize things in that direction eventually but would also be very happy if someone else looks into that. But in order to do that one would also have to look at that converter that's generating the loop dialect ops and switch it to affine dialect ones. IMO, that'd prevent duplication of a lot of infrastructure in less powerful ways. All of these examples can be represented and transformed (whether or not you need any analysis) with the affine dialect.

Revision Contents

Path

Size

mlir/

include/

mlir/

Conversion/

LoopsToGPU/

LoopsToGPUPass.h

8 lines

lib/

Conversion/

LoopsToGPU/

LoopsToGPU.cpp

100 lines

LoopsToGPUPass.cpp

25 lines

test/

Conversion/

LoopsToGPU/

parallel_loop.mlir

352 lines

Diff 246205

mlir/include/mlir/Conversion/LoopsToGPU/LoopsToGPUPass.h

	Show All 9 Lines

	#include "mlir/Support/LLVM.h"			#include "mlir/Support/LLVM.h"

	#include <memory>			#include <memory>

	namespace mlir {			namespace mlir {
	class FuncOp;			class FuncOp;
	template <typename T> class OpPassBase;			template <typename T> class OpPassBase;
				class Pass;

	/// Create a pass that converts loop nests into GPU kernels. It considers			/// Create a pass that converts loop nests into GPU kernels. It considers
	/// top-level affine.for and linalg.for operations as roots of loop nests and			/// top-level affine.for and linalg.for operations as roots of loop nests and
	/// converts them to the gpu.launch operations if possible.			/// converts them to the gpu.launch operations if possible.
	///			///
	/// No check on the size of the block or grid, or on the validity of			/// No check on the size of the block or grid, or on the validity of
	/// parallelization is performed, it is under the responsibility of the caller			/// parallelization is performed, it is under the responsibility of the caller
	/// to strip-mine the loops and to perform the dependence analysis before			/// to strip-mine the loops and to perform the dependence analysis before
	/// calling the conversion.			/// calling the conversion.
	std::unique_ptr<OpPassBase<FuncOp>>			std::unique_ptr<OpPassBase<FuncOp>>
	createSimpleLoopsToGPUPass(unsigned numBlockDims, unsigned numThreadDims);			createSimpleLoopsToGPUPass(unsigned numBlockDims, unsigned numThreadDims);

	/// Create a pass that converts every loop operation within the body of the			/// Create a pass that converts every loop operation within the body of the
	/// FuncOp into a GPU launch. The number of workgroups and workgroup size for			/// FuncOp into a GPU launch. The number of workgroups and workgroup size for
	/// the implementation is controlled by SSA values passed into conversion			/// the implementation is controlled by SSA values passed into conversion
	/// method. For testing, the values are set as constants obtained from a command			/// method. For testing, the values are set as constants obtained from a command
	/// line flag. See convertLoopToGPULaunch for a description of the required			/// line flag. See convertLoopToGPULaunch for a description of the required
	/// semantics of the converted loop operation.			/// semantics of the converted loop operation.
	std::unique_ptr<OpPassBase<FuncOp>>			std::unique_ptr<OpPassBase<FuncOp>>
	createLoopToGPUPass(ArrayRef<int64_t> numWorkGroups,			createLoopToGPUPass(ArrayRef<int64_t> numWorkGroups,
	ArrayRef<int64_t> workGroupSize);			ArrayRef<int64_t> workGroupSize);

				/// Creates a pass that converts loop.parallel operations into a gpu.launch
				/// operation. The mapping of loop dimensions to launch dimensions is derived
				/// from mapping attributes. See ParallelToGpuLaunchLowering::matchAndRewrite
				/// for a description of the used attributes.
				std::unique_ptr<Pass> createParallelLoopToGpuPass();

	} // namespace mlir			} // namespace mlir

	#endif // MLIR_CONVERSION_LOOPSTOGPU_LOOPSTOGPUPASS_H_			#endif // MLIR_CONVERSION_LOOPSTOGPU_LOOPSTOGPUPASS_H_

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp

Show First 20 Lines • Show All 527 Lines • ▼ Show 20 Lines	static MappingAnnotation extractMappingAnnotation(Attribute attribute) {
AffineMap bound;		AffineMap bound;
if (boundAttr)		if (boundAttr)
bound = boundAttr.getValue();		bound = boundAttr.getValue();
return {processor, map, bound};		return {processor, map, bound};
}		}

/// Tries to derive a static upper bound from the defining operation of		/// Tries to derive a static upper bound from the defining operation of
/// `upperBound`.		/// `upperBound`.
static Value deriveStaticUpperBound(Value upperBound) {		static Value deriveStaticUpperBound(Value upperBound,
		PatternRewriter &rewriter) {
Value constantBound = {};		Value constantBound = {};
		// We cannot rely on canonicalization at the moment, so we have to
		// dig throuhg the expression.
		// See https://bugs.llvm.org/show_bug.cgi?id=45008.
if (AffineMinOp minOp =		if (AffineMinOp minOp =
		ftynseUnsubmitted Done Reply Inline Actions Nit: could you early-return instead? ftynse: Nit: could you early-return instead?
dyn_cast_or_null<AffineMinOp>(upperBound.getDefiningOp())) {		dyn_cast_or_null<AffineMinOp>(upperBound.getDefiningOp())) {
auto map = minOp.map();		auto map = minOp.map();
auto operands = minOp.operands();		auto operands = minOp.operands();
for (int sub = 0, e = map.getNumResults(); sub < e; ++sub) {		for (int sub = 0, e = map.getNumResults(); sub < e; ++sub) {
AffineExpr expr = map.getResult(sub);		AffineExpr expr = map.getResult(sub);
if (AffineDimExpr dimExpr = expr.dyn_cast<AffineDimExpr>()) {		if (AffineDimExpr dimExpr = expr.dyn_cast<AffineDimExpr>()) {
auto dimOperand = operands[dimExpr.getPosition()];		auto dimOperand = operands[dimExpr.getPosition()];
auto defOp = dimOperand.getDefiningOp();		auto defOp = dimOperand.getDefiningOp();
if (ConstantOp constOp = dyn_cast_or_null<ConstantOp>(defOp)) {		if (ConstantOp constOp = dyn_cast_or_null<ConstantOp>(defOp)) {
constantBound = constOp;		constantBound = constOp;
break;		break;
}		}
}		}
		if (AffineConstantExpr constExpr = expr.dyn_cast<AffineConstantExpr>()) {
		constantBound = rewriter.create<ConstantIndexOp>(minOp.getLoc(),
		constExpr.getValue());
		break;
		}
}		}
}		}
return constantBound;		return constantBound;
}		}

/// Modifies the current transformation state to capture the effect of the given		/// Modifies the current transformation state to capture the effect of the given
/// `loop.parallel` operation on index substitutions and the operations to be		/// `loop.parallel` operation on index substitutions and the operations to be
/// inserted.		/// inserted.
▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	for (auto config : llvm::zip(mapping, parallelOp.getInductionVars(),
Attribute mappingAttribute;		Attribute mappingAttribute;
Value iv, lowerBound, upperBound, step;		Value iv, lowerBound, upperBound, step;
std::tie(mappingAttribute, iv, lowerBound, upperBound, step) = config;		std::tie(mappingAttribute, iv, lowerBound, upperBound, step) = config;
MappingAnnotation annotation = extractMappingAnnotation(mappingAttribute);		MappingAnnotation annotation = extractMappingAnnotation(mappingAttribute);
Value newIndex;		Value newIndex;

if (annotation.processor < gpu::LaunchOp::kNumConfigOperands) {		if (annotation.processor < gpu::LaunchOp::kNumConfigOperands) {
// Use the corresponding thread/grid index as replacement for the loop iv.		// Use the corresponding thread/grid index as replacement for the loop iv.
// TODO(herhut): Make the iv calculation depend on lower & upper bound.
Value operand = launchOp.body().front().getArgument(annotation.processor);		Value operand = launchOp.body().front().getArgument(annotation.processor);
Value appliedMap =		// Take the indexmap and add the lower bound and step computations in.
rewriter.create<AffineApplyOp>(loc, annotation.indexMap, operand);		// This computes operand * step + lowerBound.
// Add the lower bound, as the maps are 0 based but the loop might not be.		// Use an affine map here so that it composes nicely with the provided
// TODO(herhut): Maybe move this explicitly into the maps?		// annotation.
newIndex = rewriter.create<AddIOp>(		AffineMap lowerAndStep = AffineMap::get(
loc, appliedMap, cloningMap.lookupOrDefault(lowerBound));		1, 2,
		rewriter.getAffineDimExpr(0) * rewriter.getAffineSymbolExpr(0) +
		rewriter.getAffineSymbolExpr(1));
		newIndex = rewriter.create<AffineApplyOp>(
		loc, annotation.indexMap.compose(lowerAndStep),
		ValueRange{operand, step, lowerBound});
// If there was also a bound, insert that, too.		// If there was also a bound, insert that, too.
// TODO(herhut): Check that we do not assign bounds twice.		// TODO(herhut): Check that we do not assign bounds twice.
if (annotation.boundMap) {		if (annotation.boundMap) {
// We pass as the single opererand to the bound-map the number of		// We pass as the single opererand to the bound-map the number of
// iterations, which is upperBound - lowerBound. To support inner loops		// iterations, which is (upperBound - lowerBound) ceilDiv step. To
// with dynamic upper bounds (as generated by e.g. tiling), try to		// support inner loops with dynamic upper bounds (as generated by e.g.
// derive a max for the bounds. If the used bound for the hardware id is		// tiling), try to derive a max for the bounds. If the used bound for
// inprecise, wrap the contained code into a conditional.		// the hardware id is imprecise, wrap the contained code into a
// If the lower-bound is constant or defined before the launch, we can		// conditional. If the lower-bound is constant or defined before the
// use it in the launch bounds. Otherwise fail.		// launch, we can use it in the launch bounds. Otherwise fail.
if (!launchIndependent(lowerBound) &&		if (!launchIndependent(lowerBound) &&
!isa<ConstantOp>(lowerBound.getDefiningOp()))		!isa<ConstantOp>(lowerBound.getDefiningOp()))
return failure();		return failure();
		// The step must also be constant or defined outside of the loop nest.
		if (!launchIndependent(step) && !isa<ConstantOp>(step.getDefiningOp()))
		return failure();
// If the upper-bound is constant or defined before the launch, we can		// If the upper-bound is constant or defined before the launch, we can
// use it in the launch bounds directly. Otherwise try derive a bound.		// use it in the launch bounds directly. Otherwise try derive a bound.
bool boundIsPrecise = launchIndependent(upperBound) \|\|		bool boundIsPrecise = launchIndependent(upperBound) \|\|
isa<ConstantOp>(upperBound.getDefiningOp());		isa<ConstantOp>(upperBound.getDefiningOp());
		{
		PatternRewriter::InsertionGuard guard(rewriter);
		rewriter.setInsertionPoint(launchOp);
if (!boundIsPrecise) {		if (!boundIsPrecise) {
upperBound = deriveStaticUpperBound(upperBound);		upperBound = deriveStaticUpperBound(upperBound, rewriter);
if (!upperBound)		if (!upperBound)
return failure();		return failure();
}		}
{		// Compute the number of iterations needed. We compute this as an
PatternRewriter::InsertionGuard guard(rewriter);		// affine expression ceilDiv (upperBound - lowerBound) step. We use
rewriter.setInsertionPoint(launchOp);		// affine here so that it composes nicely with the provided map.
		ftynseUnsubmitted Done Reply Inline Actions Nit "we use affine apply" ? ftynse: Nit "we use affine apply" ?
		AffineMap stepMap =
Value iterations = rewriter.create<SubIOp>(		AffineMap::get(0, 3,
loc,		(rewriter.getAffineSymbolExpr(0) -
ensureLaunchIndependent(cloningMap.lookupOrDefault(upperBound)),		rewriter.getAffineSymbolExpr(1).ceilDiv(
ensureLaunchIndependent(cloningMap.lookupOrDefault(lowerBound)));		rewriter.getAffineSymbolExpr(2))));
Value launchBound = rewriter.create<AffineApplyOp>(		Value launchBound = rewriter.create<AffineApplyOp>(
loc, annotation.boundMap, iterations);		loc, annotation.boundMap.compose(stepMap),
		ValueRange{
		ensureLaunchIndependent(
		cloningMap.lookupOrDefault(upperBound)),
		ensureLaunchIndependent(
		cloningMap.lookupOrDefault(lowerBound)),
		ensureLaunchIndependent(cloningMap.lookupOrDefault(step))});
launchOp.setOperand(annotation.processor, launchBound);		launchOp.setOperand(annotation.processor, launchBound);
}		}
if (!boundIsPrecise) {		if (!boundIsPrecise) {
// We are using an approximation, create a surrounding conditional.		// We are using an approximation, create a surrounding conditional.
Value originalBound = std::get<3>(config);		Value originalBound = std::get<3>(config);
CmpIOp pred = rewriter.create<CmpIOp>(		CmpIOp pred = rewriter.create<CmpIOp>(
loc, CmpIPredicate::slt, newIndex,		loc, CmpIPredicate::slt, newIndex,
cloningMap.lookupOrDefault(originalBound));		cloningMap.lookupOrDefault(originalBound));
▲ Show 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	if (failed(processParallelLoop(parallelOp, launchOp, cloningMap, worklist,
return matchFailure();		return matchFailure();

// Whether we have seen any side-effects. Reset when leaving an inner scope.		// Whether we have seen any side-effects. Reset when leaving an inner scope.
bool seenSideeffects = false;		bool seenSideeffects = false;
// Whether we have left a nesting scope (and hence are no longer innermost).		// Whether we have left a nesting scope (and hence are no longer innermost).
bool leftNestingScope = false;		bool leftNestingScope = false;
while (!worklist.empty()) {		while (!worklist.empty()) {
Operation *op = worklist.pop_back_val();		Operation *op = worklist.pop_back_val();
launchOp.dump();

// Now walk over the body and clone it.		// Now walk over the body and clone it.
// TODO: This is only correct if there either is no further loop.parallel		// TODO: This is only correct if there either is no further loop.parallel
// nested or this code is side-effect free. Otherwise we might need		// nested or this code is side-effect free. Otherwise we might need
// predication. We are overly consertaive for now and only allow		// predication. We are overly consertaive for now and only allow
// side-effects in the innermost scope.		// side-effects in the innermost scope.
if (auto nestedParallel = dyn_cast<ParallelOp>(op)) {		if (auto nestedParallel = dyn_cast<ParallelOp>(op)) {
// Before entering a nested scope, make sure there have been no		// Before entering a nested scope, make sure there have been no
// sideeffects until now.		// sideeffects until now.
Show All 22 Lines	if (auto nestedParallel = dyn_cast<ParallelOp>(op)) {
return matchFailure();		return matchFailure();
}		}
}		}

rewriter.eraseOp(parallelOp);		rewriter.eraseOp(parallelOp);
return matchSuccess();		return matchSuccess();
}		}

namespace {
struct ParallelLoopToGpuPass : public OperationPass<ParallelLoopToGpuPass> {
void runOnOperation() override;
};
} // namespace

void mlir::populateParallelLoopToGPUPatterns(OwningRewritePatternList &patterns,		void mlir::populateParallelLoopToGPUPatterns(OwningRewritePatternList &patterns,
MLIRContext *ctx) {		MLIRContext *ctx) {
patterns.insert<ParallelToGpuLaunchLowering>(ctx);		patterns.insert<ParallelToGpuLaunchLowering>(ctx);
}		}

void ParallelLoopToGpuPass::runOnOperation() {
OwningRewritePatternList patterns;
populateParallelLoopToGPUPatterns(patterns, &getContext());
ConversionTarget target(getContext());
target.addLegalDialect<StandardOpsDialect>();
target.addLegalDialect<AffineOpsDialect>();
target.addLegalDialect<gpu::GPUDialect>();
target.addLegalDialect<loop::LoopOpsDialect>();
target.addIllegalOp<loop::ParallelOp>();
if (failed(applyPartialConversion(getOperation(), target, patterns)))
signalPassFailure();
}

static PassRegistration<ParallelLoopToGpuPass>
pass("convert-parallel-loops-to-gpu", "Convert mapped loop.parallel ops"
" to gpu launch operations.");

mlir/lib/Conversion/LoopsToGPU/LoopsToGPUPass.cpp

//===- LoopsToGPUPass.cpp - Convert a loop nest to a GPU kernel -----------===//		//===- LoopsToGPUPass.cpp - Convert a loop nest to a GPU kernel -----------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/Conversion/LoopsToGPU/LoopsToGPUPass.h"		#include "mlir/Conversion/LoopsToGPU/LoopsToGPUPass.h"
#include "mlir/Conversion/LoopsToGPU/LoopsToGPU.h"		#include "mlir/Conversion/LoopsToGPU/LoopsToGPU.h"
#include "mlir/Dialect/AffineOps/AffineOps.h"		#include "mlir/Dialect/AffineOps/AffineOps.h"
		#include "mlir/Dialect/GPU/GPUDialect.h"
#include "mlir/Dialect/LoopOps/LoopOps.h"		#include "mlir/Dialect/LoopOps/LoopOps.h"
#include "mlir/Dialect/StandardOps/IR/Ops.h"		#include "mlir/Dialect/StandardOps/IR/Ops.h"
#include "mlir/Pass/Pass.h"		#include "mlir/Pass/Pass.h"
		#include "mlir/Transforms/DialectConversion.h"

#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"

#define PASS_NAME "convert-loops-to-gpu"		#define PASS_NAME "convert-loops-to-gpu"
#define LOOPOP_TO_GPU_PASS_NAME "convert-loop-op-to-gpu"		#define LOOPOP_TO_GPU_PASS_NAME "convert-loop-op-to-gpu"

using namespace mlir;		using namespace mlir;
▲ Show 20 Lines • Show All 87 Lines • ▼ Show 20 Lines	for (Block &block : getFunction()) {
}		}
}		}
}		}
}		}
SmallVector<int64_t, 3> numWorkGroups;		SmallVector<int64_t, 3> numWorkGroups;
SmallVector<int64_t, 3> workGroupSize;		SmallVector<int64_t, 3> workGroupSize;
};		};

		struct ParallelLoopToGpuPass : public OperationPass<ParallelLoopToGpuPass> {
		void runOnOperation() override {
		OwningRewritePatternList patterns;
		populateParallelLoopToGPUPatterns(patterns, &getContext());
		ConversionTarget target(getContext());
		target.addLegalDialect<StandardOpsDialect>();
		target.addLegalDialect<AffineOpsDialect>();
		target.addLegalDialect<gpu::GPUDialect>();
		target.addLegalDialect<loop::LoopOpsDialect>();
		target.addIllegalOp<loop::ParallelOp>();
		if (failed(applyPartialConversion(getOperation(), target, patterns)))
		signalPassFailure();
		}
		};

} // namespace		} // namespace

std::unique_ptr<OpPassBase<FuncOp>>		std::unique_ptr<OpPassBase<FuncOp>>
mlir::createSimpleLoopsToGPUPass(unsigned numBlockDims,		mlir::createSimpleLoopsToGPUPass(unsigned numBlockDims,
unsigned numThreadDims) {		unsigned numThreadDims) {
return std::make_unique<ForLoopMapper>(numBlockDims, numThreadDims);		return std::make_unique<ForLoopMapper>(numBlockDims, numThreadDims);
}		}

std::unique_ptr<OpPassBase<FuncOp>>		std::unique_ptr<OpPassBase<FuncOp>>
mlir::createLoopToGPUPass(ArrayRef<int64_t> numWorkGroups,		mlir::createLoopToGPUPass(ArrayRef<int64_t> numWorkGroups,
ArrayRef<int64_t> workGroupSize) {		ArrayRef<int64_t> workGroupSize) {
return std::make_unique<ImperfectlyNestedForLoopMapper>(numWorkGroups,		return std::make_unique<ImperfectlyNestedForLoopMapper>(numWorkGroups,
workGroupSize);		workGroupSize);
}		}

		std::unique_ptr<Pass> mlir::createParallelLoopToGpuPass() {
		return std::make_unique<ParallelLoopToGpuPass>();
		}

static PassRegistration<ForLoopMapper>		static PassRegistration<ForLoopMapper>
registration(PASS_NAME, "Convert top-level loops to GPU kernels", [] {		registration(PASS_NAME, "Convert top-level loops to GPU kernels", [] {
return std::make_unique<ForLoopMapper>(clNumBlockDims.getValue(),		return std::make_unique<ForLoopMapper>(clNumBlockDims.getValue(),
clNumThreadDims.getValue());		clNumThreadDims.getValue());
});		});

static PassRegistration<ImperfectlyNestedForLoopMapper> loopOpToGPU(		static PassRegistration<ImperfectlyNestedForLoopMapper> loopOpToGPU(
LOOPOP_TO_GPU_PASS_NAME, "Convert top-level loop::ForOp to GPU kernels",		LOOPOP_TO_GPU_PASS_NAME, "Convert top-level loop::ForOp to GPU kernels",
[] {		[] {
SmallVector<int64_t, 3> numWorkGroups, workGroupSize;		SmallVector<int64_t, 3> numWorkGroups, workGroupSize;
numWorkGroups.assign(clNumWorkGroups.begin(), clNumWorkGroups.end());		numWorkGroups.assign(clNumWorkGroups.begin(), clNumWorkGroups.end());
workGroupSize.assign(clWorkGroupSize.begin(), clWorkGroupSize.end());		workGroupSize.assign(clWorkGroupSize.begin(), clWorkGroupSize.end());
return std::make_unique<ImperfectlyNestedForLoopMapper>(numWorkGroups,		return std::make_unique<ImperfectlyNestedForLoopMapper>(numWorkGroups,
workGroupSize);		workGroupSize);
});		});

		static PassRegistration<ParallelLoopToGpuPass>
		pass("convert-parallel-loops-to-gpu", "Convert mapped loop.parallel ops"
		" to gpu launch operations.");

mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir

Show All 9 Lines	func @parallel_loop_bidy_bidx(%arg0 : index, %arg1 : index, %arg2 : index,
loop.parallel (%i0, %i1) = (%arg0, %arg1) to (%arg2, %arg3)		loop.parallel (%i0, %i1) = (%arg0, %arg1) to (%arg2, %arg3)
step (%arg4, %step) {		step (%arg4, %step) {
%val = load %buf[%i0, %i1] : memref<?x?xf32>		%val = load %buf[%i0, %i1] : memref<?x?xf32>
store %val, %res[%i1, %i0] : memref<?x?xf32>		store %val, %res[%i1, %i0] : memref<?x?xf32>
} { mapping = [{processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}, {processor = 0, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}] }		} { mapping = [{processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}, {processor = 0, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}] }
return		return
}		}

// CHECK: #map0 = affine_map<(d0) -> (d0)>		// CHECK: #map0 = affine_map<()[s0, s1, s2] -> (s0 - s1 ceildiv s2)>
// CHECK: module {		// CHECK: #map1 = affine_map<(d0)[s0, s1] -> (d0 * s0 + s1)>
		ftynseUnsubmitted Done Reply Inline Actions Let's not match on attribute names here, i.e. prefer `#[[MAP0:.]] = affine_map<...` ftynse:* Let's not match on attribute names here, i.e. prefer `#[[MAP0:.*]] = affine_map<...`

		// CHECK: module {
// CHECK-LABEL: func @parallel_loop_bidy_bidx(		// CHECK-LABEL: func @parallel_loop_bidy_bidx(
// CHECK-SAME: [[VAL_0:%.]]: index, [[VAL_1:%.]]: index, [[VAL_2:%.]]: index, [[VAL_3:%.]]: index, [[VAL_4:%.]]: index, [[VAL_5:%.]]: memref<?x?xf32>, [[VAL_6:%.*]]: memref<?x?xf32>) {		// CHECK-SAME: [[VAL_0:%.]]: index, [[VAL_1:%.]]: index, [[VAL_2:%.]]: index, [[VAL_3:%.]]: index, [[VAL_4:%.]]: index, [[VAL_5:%.]]: memref<?x?xf32>, [[VAL_6:%.*]]: memref<?x?xf32>) {
// CHECK: [[VAL_7:%.*]] = constant 2 : index		// CHECK: [[VAL_7:%.*]] = constant 2 : index
// CHECK: [[VAL_8:%.*]] = constant 1 : index		// CHECK: [[VAL_8:%.*]] = constant 1 : index
// CHECK: [[VAL_9:%.*]] = subi [[VAL_2]], [[VAL_0]] : index		// CHECK: [[VAL_9:%.*]] = affine.apply #map0(){{\[}}[[VAL_2]], [[VAL_0]], [[VAL_4]]]
// CHECK: [[VAL_10:%.*]] = affine.apply #map0([[VAL_9]])		// CHECK: [[VAL_10:%.*]] = affine.apply #map0(){{\[}}[[VAL_3]], [[VAL_1]], [[VAL_7]]]
// CHECK: [[VAL_11:%.*]] = subi [[VAL_3]], [[VAL_1]] : index		// CHECK: gpu.launch blocks([[VAL_11:%.]], [[VAL_12:%.]], [[VAL_13:%.]]) in ([[VAL_14:%.]] = [[VAL_10]], [[VAL_15:%.]] = [[VAL_9]], [[VAL_16:%.]] = [[VAL_8]]) threads([[VAL_17:%.]], [[VAL_18:%.]], [[VAL_19:%.]]) in ([[VAL_20:%.]] = [[VAL_8]], [[VAL_21:%.]] = [[VAL_8]], [[VAL_22:%.]] = [[VAL_8]]) {
// CHECK: [[VAL_12:%.*]] = affine.apply #map0([[VAL_11]])		// CHECK: [[VAL_23:%.*]] = affine.apply #map1([[VAL_12]]){{\[}}[[VAL_4]], [[VAL_0]]]
// CHECK: gpu.launch blocks([[VAL_13:%.]], [[VAL_14:%.]], [[VAL_15:%.]]) in ([[VAL_16:%.]] = [[VAL_12]], [[VAL_17:%.]] = [[VAL_10]], [[VAL_18:%.]] = [[VAL_8]]) threads([[VAL_19:%.]], [[VAL_20:%.]], [[VAL_21:%.]]) in ([[VAL_22:%.]] = [[VAL_8]], [[VAL_23:%.]] = [[VAL_8]], [[VAL_24:%.]] = [[VAL_8]]) {		// CHECK: [[VAL_24:%.*]] = affine.apply #map1([[VAL_11]]){{\[}}[[VAL_7]], [[VAL_1]]]
// CHECK: [[VAL_25:%.*]] = affine.apply #map0([[VAL_14]])		// CHECK: [[VAL_25:%.*]] = load [[VAL_5]]{{\[}}[[VAL_23]], [[VAL_24]]] : memref<?x?xf32>
// CHECK: [[VAL_26:%.*]] = addi [[VAL_25]], [[VAL_0]] : index		// CHECK: store [[VAL_25]], [[VAL_6]]{{\[}}[[VAL_24]], [[VAL_23]]] : memref<?x?xf32>
// CHECK: [[VAL_27:%.*]] = affine.apply #map0([[VAL_13]])
// CHECK: [[VAL_28:%.*]] = addi [[VAL_27]], [[VAL_1]] : index
// CHECK: [[VAL_29:%.*]] = load [[VAL_5]]{{\[}}[[VAL_26]], [[VAL_28]]] : memref<?x?xf32>
// CHECK: store [[VAL_29]], [[VAL_6]]{{\[}}[[VAL_28]], [[VAL_26]]] : memref<?x?xf32>
// CHECK: gpu.terminator		// CHECK: gpu.terminator
// CHECK: }		// CHECK: }
// CHECK: return		// CHECK: return
// CHECK: }		// CHECK: }
// CHECK: }		// CHECK: }

// -----		// -----

Show All 20 Lines	loop.parallel (%i0, %i1) = (%arg0, %arg1) to (%arg2, %arg3)
] }		] }
} { mapping = [		} { mapping = [
{processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>},		{processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>},
{processor = 0, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}		{processor = 0, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}
] }		] }
return		return
}		}

// CHECK: #map0 = affine_map<(d0) -> (d0)>		// CHECK: #map0 = affine_map<()[s0, s1, s2] -> (s0 - s1 ceildiv s2)>
// CHECK: module {		// CHECK: #map1 = affine_map<(d0)[s0, s1] -> (d0 * s0 + s1)>

		// CHECK: module {
// CHECK-LABEL: func @parallel_loop_tiled(		// CHECK-LABEL: func @parallel_loop_tiled(
// CHECK-SAME: [[VAL_30:%.]]: index, [[VAL_31:%.]]: index, [[VAL_32:%.]]: index, [[VAL_33:%.]]: index, [[VAL_34:%.]]: memref<?x?xf32>, [[VAL_35:%.]]: memref<?x?xf32>) {		// CHECK-SAME: [[VAL_26:%.]]: index, [[VAL_27:%.]]: index, [[VAL_28:%.]]: index, [[VAL_29:%.]]: index, [[VAL_30:%.]]: memref<?x?xf32>, [[VAL_31:%.]]: memref<?x?xf32>) {
// CHECK: [[VAL_36:%.*]] = constant 0 : index		// CHECK: [[VAL_32:%.*]] = constant 0 : index
// CHECK: [[VAL_37:%.*]] = constant 1 : index		// CHECK: [[VAL_33:%.*]] = constant 1 : index
// CHECK: [[VAL_38:%.*]] = constant 4 : index		// CHECK: [[VAL_34:%.*]] = constant 4 : index
// CHECK: [[VAL_39:%.*]] = constant 1 : index		// CHECK: [[VAL_35:%.*]] = constant 1 : index
// CHECK: [[VAL_40:%.*]] = subi [[VAL_32]], [[VAL_30]] : index		// CHECK: [[VAL_36:%.*]] = affine.apply #map0(){{\[}}[[VAL_28]], [[VAL_26]], [[VAL_34]]]
// CHECK: [[VAL_41:%.*]] = affine.apply #map0([[VAL_40]])		// CHECK: [[VAL_37:%.*]] = affine.apply #map0(){{\[}}[[VAL_29]], [[VAL_27]], [[VAL_34]]]
// CHECK: [[VAL_42:%.*]] = subi [[VAL_33]], [[VAL_31]] : index		// CHECK: [[VAL_38:%.*]] = affine.apply #map0(){{\[}}[[VAL_34]], [[VAL_32]], [[VAL_33]]]
// CHECK: [[VAL_43:%.*]] = affine.apply #map0([[VAL_42]])		// CHECK: [[VAL_39:%.*]] = affine.apply #map0(){{\[}}[[VAL_34]], [[VAL_32]], [[VAL_33]]]
// CHECK: [[VAL_44:%.*]] = subi [[VAL_38]], [[VAL_36]] : index		// CHECK: gpu.launch blocks([[VAL_40:%.]], [[VAL_41:%.]], [[VAL_42:%.]]) in ([[VAL_43:%.]] = [[VAL_37]], [[VAL_44:%.]] = [[VAL_36]], [[VAL_45:%.]] = [[VAL_35]]) threads([[VAL_46:%.]], [[VAL_47:%.]], [[VAL_48:%.]]) in ([[VAL_49:%.]] = [[VAL_39]], [[VAL_50:%.]] = [[VAL_38]], [[VAL_51:%.]] = [[VAL_35]]) {
// CHECK: [[VAL_45:%.*]] = affine.apply #map0([[VAL_44]])		// CHECK: [[VAL_52:%.*]] = affine.apply #map1([[VAL_41]]){{\[}}[[VAL_34]], [[VAL_26]]]
// CHECK: [[VAL_46:%.*]] = subi [[VAL_38]], [[VAL_36]] : index		// CHECK: [[VAL_53:%.*]] = affine.apply #map1([[VAL_40]]){{\[}}[[VAL_34]], [[VAL_27]]]
// CHECK: [[VAL_47:%.*]] = affine.apply #map0([[VAL_46]])		// CHECK: [[VAL_54:%.*]] = affine.apply #map1([[VAL_47]]){{\[}}[[VAL_33]], [[VAL_32]]]
// CHECK: gpu.launch blocks([[VAL_48:%.]], [[VAL_49:%.]], [[VAL_50:%.]]) in ([[VAL_51:%.]] = [[VAL_43]], [[VAL_52:%.]] = [[VAL_41]], [[VAL_53:%.]] = [[VAL_39]]) threads([[VAL_54:%.]], [[VAL_55:%.]], [[VAL_56:%.]]) in ([[VAL_57:%.]] = [[VAL_47]], [[VAL_58:%.]] = [[VAL_45]], [[VAL_59:%.]] = [[VAL_39]]) {		// CHECK: [[VAL_55:%.*]] = affine.apply #map1([[VAL_46]]){{\[}}[[VAL_33]], [[VAL_32]]]
// CHECK: [[VAL_60:%.*]] = affine.apply #map0([[VAL_49]])		// CHECK: [[VAL_56:%.*]] = addi [[VAL_52]], [[VAL_54]] : index
// CHECK: [[VAL_61:%.*]] = addi [[VAL_60]], [[VAL_30]] : index		// CHECK: [[VAL_57:%.*]] = addi [[VAL_53]], [[VAL_55]] : index
// CHECK: [[VAL_62:%.*]] = affine.apply #map0([[VAL_48]])		// CHECK: [[VAL_58:%.*]] = load [[VAL_30]]{{\[}}[[VAL_56]], [[VAL_57]]] : memref<?x?xf32>
// CHECK: [[VAL_63:%.*]] = addi [[VAL_62]], [[VAL_31]] : index		// CHECK: store [[VAL_58]], [[VAL_31]]{{\[}}[[VAL_57]], [[VAL_56]]] : memref<?x?xf32>
// CHECK: [[VAL_64:%.*]] = affine.apply #map0([[VAL_55]])
// CHECK: [[VAL_65:%.*]] = addi [[VAL_64]], [[VAL_36]] : index
// CHECK: [[VAL_66:%.*]] = affine.apply #map0([[VAL_54]])
// CHECK: [[VAL_67:%.*]] = addi [[VAL_66]], [[VAL_36]] : index
// CHECK: [[VAL_68:%.*]] = addi [[VAL_61]], [[VAL_65]] : index
// CHECK: [[VAL_69:%.*]] = addi [[VAL_63]], [[VAL_67]] : index
// CHECK: [[VAL_70:%.*]] = load [[VAL_34]]{{\[}}[[VAL_68]], [[VAL_69]]] : memref<?x?xf32>
// CHECK: store [[VAL_70]], [[VAL_35]]{{\[}}[[VAL_69]], [[VAL_68]]] : memref<?x?xf32>
// CHECK: gpu.terminator		// CHECK: gpu.terminator
// CHECK: }		// CHECK: }
// CHECK: return		// CHECK: return
// CHECK: }		// CHECK: }
// CHECK: }		// CHECK: }

// -----		// -----

Show All 10 Lines	loop.parallel (%i0, %i1) = (%arg0, %arg1) to (%arg2, %arg3)
store %val, %res[%i1, %i0] : memref<?x?xf32>		store %val, %res[%i1, %i0] : memref<?x?xf32>
} { mapping = [		} { mapping = [
{processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>},		{processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>},
{processor = 6, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}		{processor = 6, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}
] }		] }
return		return
}		}

// CHECK: #map0 = affine_map<(d0) -> (d0)>		// CHECK: #map0 = affine_map<()[s0, s1, s2] -> (s0 - s1 ceildiv s2)>
// CHECK: module {		// CHECK: #map1 = affine_map<(d0)[s0, s1] -> (d0 * s0 + s1)>

		// CHECK: module {
// CHECK-LABEL: func @parallel_loop_bidy_seq(		// CHECK-LABEL: func @parallel_loop_bidy_seq(
// CHECK-SAME: [[VAL_71:%.]]: index, [[VAL_72:%.]]: index, [[VAL_73:%.]]: index, [[VAL_74:%.]]: index, [[VAL_75:%.]]: index, [[VAL_76:%.]]: memref<?x?xf32>, [[VAL_77:%.*]]: memref<?x?xf32>) {		// CHECK-SAME: [[VAL_59:%.]]: index, [[VAL_60:%.]]: index, [[VAL_61:%.]]: index, [[VAL_62:%.]]: index, [[VAL_63:%.]]: index, [[VAL_64:%.]]: memref<?x?xf32>, [[VAL_65:%.*]]: memref<?x?xf32>) {
// CHECK: [[VAL_78:%.*]] = constant 2 : index		// CHECK: [[VAL_66:%.*]] = constant 2 : index
// CHECK: [[VAL_79:%.*]] = constant 1 : index		// CHECK: [[VAL_67:%.*]] = constant 1 : index
// CHECK: [[VAL_80:%.*]] = subi [[VAL_73]], [[VAL_71]] : index		// CHECK: [[VAL_68:%.*]] = affine.apply #map0(){{\[}}[[VAL_61]], [[VAL_59]], [[VAL_63]]]
// CHECK: [[VAL_81:%.*]] = affine.apply #map0([[VAL_80]])		// CHECK: gpu.launch blocks([[VAL_69:%.]], [[VAL_70:%.]], [[VAL_71:%.]]) in ([[VAL_72:%.]] = [[VAL_67]], [[VAL_73:%.]] = [[VAL_68]], [[VAL_74:%.]] = [[VAL_67]]) threads([[VAL_75:%.]], [[VAL_76:%.]], [[VAL_77:%.]]) in ([[VAL_78:%.]] = [[VAL_67]], [[VAL_79:%.]] = [[VAL_67]], [[VAL_80:%.]] = [[VAL_67]]) {
// CHECK: gpu.launch blocks([[VAL_82:%.]], [[VAL_83:%.]], [[VAL_84:%.]]) in ([[VAL_85:%.]] = [[VAL_79]], [[VAL_86:%.]] = [[VAL_81]], [[VAL_87:%.]] = [[VAL_79]]) threads([[VAL_88:%.]], [[VAL_89:%.]], [[VAL_90:%.]]) in ([[VAL_91:%.]] = [[VAL_79]], [[VAL_92:%.]] = [[VAL_79]], [[VAL_93:%.]] = [[VAL_79]]) {		// CHECK: [[VAL_81:%.*]] = affine.apply #map1([[VAL_70]]){{\[}}[[VAL_63]], [[VAL_59]]]
// CHECK: [[VAL_94:%.*]] = affine.apply #map0([[VAL_83]])		// CHECK: loop.for [[VAL_82:%.*]] = [[VAL_60]] to [[VAL_62]] step [[VAL_66]] {
// CHECK: [[VAL_95:%.*]] = addi [[VAL_94]], [[VAL_71]] : index		// CHECK: [[VAL_83:%.*]] = load [[VAL_64]]{{\[}}[[VAL_81]], [[VAL_82]]] : memref<?x?xf32>
// CHECK: loop.for [[VAL_96:%.*]] = [[VAL_72]] to [[VAL_74]] step [[VAL_78]] {		// CHECK: store [[VAL_83]], [[VAL_65]]{{\[}}[[VAL_82]], [[VAL_81]]] : memref<?x?xf32>
// CHECK: [[VAL_97:%.*]] = load [[VAL_76]]{{\[}}[[VAL_95]], [[VAL_96]]] : memref<?x?xf32>
// CHECK: store [[VAL_97]], [[VAL_77]]{{\[}}[[VAL_96]], [[VAL_95]]] : memref<?x?xf32>
// CHECK: }		// CHECK: }
// CHECK: gpu.terminator		// CHECK: gpu.terminator
// CHECK: }		// CHECK: }
// CHECK: return		// CHECK: return
// CHECK: }		// CHECK: }
// CHECK: }		// CHECK: }

// -----		// -----
Show All 21 Lines	loop.parallel (%si0, %si1) = (%zero, %zero) to (%four, %four)
] }		] }
} { mapping = [		} { mapping = [
{processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>},		{processor = 1, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>},
{processor = 6, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}		{processor = 6, map = affine_map<(d0) -> (d0)>, bound = affine_map<(d0) -> (d0)>}
] }		] }
return		return
}		}

// CHECK: #map0 = affine_map<(d0) -> (d0)>		// CHECK: #map0 = affine_map<()[s0, s1, s2] -> (s0 - s1 ceildiv s2)>
// CHECK: module {		// CHECK: #map1 = affine_map<(d0)[s0, s1] -> (d0 * s0 + s1)>

		// CHECK: module {
// CHECK-LABEL: func @parallel_loop_tiled_seq(		// CHECK-LABEL: func @parallel_loop_tiled_seq(
// CHECK-SAME: [[VAL_98:%.]]: index, [[VAL_99:%.]]: index, [[VAL_100:%.]]: index, [[VAL_101:%.]]: index, [[VAL_102:%.]]: memref<?x?xf32>, [[VAL_103:%.]]: memref<?x?xf32>) {		// CHECK-SAME: [[VAL_84:%.]]: index, [[VAL_85:%.]]: index, [[VAL_86:%.]]: index, [[VAL_87:%.]]: index, [[VAL_88:%.]]: memref<?x?xf32>, [[VAL_89:%.]]: memref<?x?xf32>) {
// CHECK: [[VAL_104:%.*]] = constant 0 : index		// CHECK: [[VAL_90:%.*]] = constant 0 : index
// CHECK: [[VAL_105:%.*]] = constant 1 : index		// CHECK: [[VAL_91:%.*]] = constant 1 : index
// CHECK: [[VAL_106:%.*]] = constant 4 : index		// CHECK: [[VAL_92:%.*]] = constant 4 : index
// CHECK: [[VAL_107:%.*]] = constant 1 : index		// CHECK: [[VAL_93:%.*]] = constant 1 : index
// CHECK: [[VAL_108:%.*]] = subi [[VAL_100]], [[VAL_98]] : index		// CHECK: [[VAL_94:%.*]] = affine.apply #map0(){{\[}}[[VAL_86]], [[VAL_84]], [[VAL_92]]]
// CHECK: [[VAL_109:%.*]] = affine.apply #map0([[VAL_108]])		// CHECK: [[VAL_95:%.*]] = affine.apply #map0(){{\[}}[[VAL_92]], [[VAL_90]], [[VAL_91]]]
// CHECK: [[VAL_110:%.*]] = subi [[VAL_106]], [[VAL_104]] : index		// CHECK: gpu.launch blocks([[VAL_96:%.]], [[VAL_97:%.]], [[VAL_98:%.]]) in ([[VAL_99:%.]] = [[VAL_93]], [[VAL_100:%.]] = [[VAL_94]], [[VAL_101:%.]] = [[VAL_93]]) threads([[VAL_102:%.]], [[VAL_103:%.]], [[VAL_104:%.]]) in ([[VAL_105:%.]] = [[VAL_93]], [[VAL_106:%.]] = [[VAL_95]], [[VAL_107:%.]] = [[VAL_93]]) {
// CHECK: [[VAL_111:%.*]] = affine.apply #map0([[VAL_110]])		// CHECK: [[VAL_108:%.*]] = affine.apply #map1([[VAL_97]]){{\[}}[[VAL_92]], [[VAL_84]]]
// CHECK: gpu.launch blocks([[VAL_112:%.]], [[VAL_113:%.]], [[VAL_114:%.]]) in ([[VAL_115:%.]] = [[VAL_107]], [[VAL_116:%.]] = [[VAL_109]], [[VAL_117:%.]] = [[VAL_107]]) threads([[VAL_118:%.]], [[VAL_119:%.]], [[VAL_120:%.]]) in ([[VAL_121:%.]] = [[VAL_107]], [[VAL_122:%.]] = [[VAL_111]], [[VAL_123:%.]] = [[VAL_107]]) {		// CHECK: loop.for [[VAL_109:%.*]] = [[VAL_85]] to [[VAL_87]] step [[VAL_92]] {
// CHECK: [[VAL_124:%.*]] = affine.apply #map0([[VAL_113]])		// CHECK: [[VAL_110:%.*]] = affine.apply #map1([[VAL_103]]){{\[}}[[VAL_91]], [[VAL_90]]]
// CHECK: [[VAL_125:%.*]] = addi [[VAL_124]], [[VAL_98]] : index		// CHECK: loop.for [[VAL_111:%.*]] = [[VAL_90]] to [[VAL_92]] step [[VAL_91]] {
// CHECK: loop.for [[VAL_126:%.*]] = [[VAL_99]] to [[VAL_101]] step [[VAL_106]] {		// CHECK: [[VAL_112:%.*]] = addi [[VAL_108]], [[VAL_110]] : index
// CHECK: [[VAL_127:%.*]] = affine.apply #map0([[VAL_119]])		// CHECK: [[VAL_113:%.*]] = addi [[VAL_109]], [[VAL_111]] : index
// CHECK: [[VAL_128:%.*]] = addi [[VAL_127]], [[VAL_104]] : index		// CHECK: [[VAL_114:%.*]] = load [[VAL_88]]{{\[}}[[VAL_112]], [[VAL_113]]] : memref<?x?xf32>
// CHECK: loop.for [[VAL_129:%.*]] = [[VAL_104]] to [[VAL_106]] step [[VAL_105]] {		// CHECK: store [[VAL_114]], [[VAL_89]]{{\[}}[[VAL_113]], [[VAL_112]]] : memref<?x?xf32>
// CHECK: [[VAL_130:%.*]] = addi [[VAL_125]], [[VAL_128]] : index
// CHECK: [[VAL_131:%.*]] = addi [[VAL_126]], [[VAL_129]] : index
// CHECK: [[VAL_132:%.*]] = load [[VAL_102]]{{\[}}[[VAL_130]], [[VAL_131]]] : memref<?x?xf32>
// CHECK: store [[VAL_132]], [[VAL_103]]{{\[}}[[VAL_131]], [[VAL_130]]] : memref<?x?xf32>
// CHECK: }		// CHECK: }
// CHECK: }		// CHECK: }
// CHECK: gpu.terminator		// CHECK: gpu.terminator
// CHECK: }		// CHECK: }
// CHECK: return		// CHECK: return
// CHECK: }		// CHECK: }
// CHECK: }		// CHECK: }

// -----		// -----

		// This and the below are the same but with affine.min canonicalized (see https://bugs.llvm.org/show_bug.cgi?id=45008).
		// Once the bug is fixed, this version (and supporting code) can be removed.

#map0 = affine_map<(d0, d1)[s0, s1] -> (d0 * s1 + s0 + d1)>		#map0 = affine_map<(d0, d1)[s0, s1] -> (d0 * s1 + s0 + d1)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1 - d2)>		#map1 = affine_map<(d0, d1, d2) -> (d0, d1 - d2)>
#map2 = affine_map<(d0, d1)[s0, s1, s2] -> (d0 * s1 + s0 + d1 * s2)>		#map2 = affine_map<(d0, d1)[s0, s1, s2] -> (d0 * s1 + s0 + d1 * s2)>
#map3 = affine_map<(d0) -> (d0)>		#map3 = affine_map<(d0) -> (d0)>

module {		module {
func @sum(%arg0: memref<?x?xf32, #map0>, %arg1: memref<?x?xf32, #map0>, %arg2: memref<?x?xf32, #map0>) {		func @sum(%arg0: memref<?x?xf32, #map0>, %arg1: memref<?x?xf32, #map0>, %arg2: memref<?x?xf32, #map0>) {
%c1 = constant 1 : index		%c1 = constant 1 : index
Show All 16 Lines	loop.parallel (%arg3, %arg4) = (%c0, %c0) to (%0, %1) step (%c2, %c3) {
%12 = dim %arg2, 0 : memref<?x?xf32, #map0>		%12 = dim %arg2, 0 : memref<?x?xf32, #map0>
%13 = affine.min #map1(%c2, %12, %arg3)		%13 = affine.min #map1(%c2, %12, %arg3)
%14 = dim %arg2, 1 : memref<?x?xf32, #map0>		%14 = dim %arg2, 1 : memref<?x?xf32, #map0>
%15 = affine.min #map1(%c3, %14, %arg4)		%15 = affine.min #map1(%c3, %14, %arg4)
%16 = std.subview %arg2[%arg3, %arg4][%13, %15][%c1, %c1] : memref<?x?xf32, #map0> to memref<?x?xf32, #map2>		%16 = std.subview %arg2[%arg3, %arg4][%13, %15][%c1, %c1] : memref<?x?xf32, #map0> to memref<?x?xf32, #map2>
loop.parallel (%arg5, %arg6) = (%c0, %c0) to (%3, %5) step (%c1, %c1) {		loop.parallel (%arg5, %arg6) = (%c0, %c0) to (%3, %5) step (%c1, %c1) {
%17 = load %6[%arg5, %arg6] : memref<?x?xf32, #map2>		%17 = load %6[%arg5, %arg6] : memref<?x?xf32, #map2>
%18 = load %11[%arg5, %arg6] : memref<?x?xf32, #map2>		%18 = load %11[%arg5, %arg6] : memref<?x?xf32, #map2>
%19 = load %16[%arg5, %arg6] : memref<?x?xf32, #map2>		%19 = load %16[%arg5, %arg6] : memref<?x?xf32, #map2>
		bondhugulaUnsubmitted Not Done Reply Inline Actions Side question: where aren't we using affine.load/store instead of load/store and loop.parallel -> affine.parallel here? With the former, you'll get things like store to load fwd'ing, redundant load elimination, composition of ops supplying subscript values into the load/store itself, etc., infra for all of which exist and whenever you need them. All the mapping metadata should nicely fit into affine.parallel as well. bondhugula: Side question: where aren't we using affine.load/store instead of load/store and loop.parallel…
		herhutAuthorUnsubmitted Done Reply Inline Actions It is not pure coincidence that the mapping data fits :) My hope is that this mapper will work equally well with affine.parallel. However, I do not want to restrict it to affine and currently the code we feed into this is not based on affine.parallel. I expect that we will generalize things in that direction eventually but would also be very happy if someone else looks into that. herhut: It is not pure coincidence that the mapping data fits :) My hope is that this mapper will work…
		bondhugulaUnsubmitted Not Done Reply Inline Actions However, I do not want to restrict it to affine and currently the code we feed into this is not based on affine.parallel. I expect that we will "The code we feed into this": is the thing that's generating the loop.parallel's available somewhere or is it something that's planned for release in the future? generalize things in that direction eventually but would also be very happy if someone else looks into that. But in order to do that one would also have to look at that converter that's generating the loop dialect ops and switch it to affine dialect ones. IMO, that'd prevent duplication of a lot of infrastructure in less powerful ways. All of these examples can be represented and transformed (whether or not you need any analysis) with the affine dialect. bondhugula: >However, I do not want to restrict it to affine and currently the code >we feed into this is…
%20 = addf %17, %18 : f32		%20 = addf %17, %18 : f32
store %20, %16[%arg5, %arg6] : memref<?x?xf32, #map2>		store %20, %16[%arg5, %arg6] : memref<?x?xf32, #map2>
loop.yield		loop.yield
} { mapping = [		} { mapping = [
{processor = 3, map = #map3, bound = #map3},		{processor = 3, map = #map3, bound = #map3},
{processor = 4, map = #map3, bound = #map3}		{processor = 4, map = #map3, bound = #map3}
] }		] }
loop.yield		loop.yield
} { mapping = [		} { mapping = [
{processor = 0, map = #map3, bound = #map3},		{processor = 0, map = #map3, bound = #map3},
{processor = 1, map = #map3, bound = #map3}		{processor = 1, map = #map3, bound = #map3}
] }		] }
return		return
}		}
}		}

// CHECK: #map0 = affine_map<(d0, d1)[s0, s1] -> (d0 * s1 + s0 + d1)>		// CHECK: #map0 = affine_map<(d0, d1)[s0, s1] -> (d0 * s1 + s0 + d1)>
// CHECK: #map1 = affine_map<(d0) -> (d0)>		// CHECK: #map1 = affine_map<()[s0, s1, s2] -> (s0 - s1 ceildiv s2)>
// CHECK: #map2 = affine_map<(d0, d1, d2) -> (d0, d1 - d2)>		// CHECK: #map2 = affine_map<(d0)[s0, s1] -> (d0 * s0 + s1)>
// CHECK: #map3 = affine_map<(d0, d1)[s0, s1, s2] -> (d0 * s1 + s0 + d1 * s2)>		// CHECK: #map3 = affine_map<(d0, d1, d2) -> (d0, d1 - d2)>
		// CHECK: #map4 = affine_map<(d0, d1)[s0, s1, s2] -> (d0 * s1 + s0 + d1 * s2)>
// CHECK: module {		// CHECK: module {

// CHECK-LABEL: func @sum(		// CHECK-LABEL: func @sum(
// CHECK-SAME: [[VAL_133:%.]]: memref<?x?xf32, #map0>, [[VAL_134:%.]]: memref<?x?xf32, #map0>, [[VAL_135:%.*]]: memref<?x?xf32, #map0>) {		// CHECK-SAME: [[VAL_115:%.]]: memref<?x?xf32, #map0>, [[VAL_116:%.]]: memref<?x?xf32, #map0>, [[VAL_117:%.*]]: memref<?x?xf32, #map0>) {
// CHECK: [[VAL_136:%.*]] = constant 1 : index		// CHECK: [[VAL_118:%.*]] = constant 1 : index
// CHECK: [[VAL_137:%.*]] = constant 0 : index		// CHECK: [[VAL_119:%.*]] = constant 0 : index
// CHECK: [[VAL_138:%.*]] = constant 3 : index		// CHECK: [[VAL_120:%.*]] = constant 3 : index
// CHECK: [[VAL_139:%.*]] = constant 2 : index		// CHECK: [[VAL_121:%.*]] = constant 2 : index
// CHECK: [[VAL_140:%.*]] = dim [[VAL_133]], 0 : memref<?x?xf32, #map0>		// CHECK: [[VAL_122:%.*]] = dim [[VAL_115]], 0 : memref<?x?xf32, #map0>
// CHECK: [[VAL_141:%.*]] = dim [[VAL_133]], 1 : memref<?x?xf32, #map0>		// CHECK: [[VAL_123:%.*]] = dim [[VAL_115]], 1 : memref<?x?xf32, #map0>
// CHECK: [[VAL_142:%.*]] = constant 1 : index		// CHECK: [[VAL_124:%.*]] = constant 1 : index
// CHECK: [[VAL_143:%.*]] = subi [[VAL_140]], [[VAL_137]] : index		// CHECK: [[VAL_125:%.*]] = affine.apply #map1(){{\[}}[[VAL_122]], [[VAL_119]], [[VAL_121]]]
// CHECK: [[VAL_144:%.*]] = affine.apply #map1([[VAL_143]])		// CHECK: [[VAL_126:%.*]] = affine.apply #map1(){{\[}}[[VAL_123]], [[VAL_119]], [[VAL_120]]]
// CHECK: [[VAL_145:%.*]] = subi [[VAL_141]], [[VAL_137]] : index		// CHECK: [[VAL_127:%.*]] = affine.apply #map1(){{\[}}[[VAL_121]], [[VAL_119]], [[VAL_118]]]
// CHECK: [[VAL_146:%.*]] = affine.apply #map1([[VAL_145]])		// CHECK: [[VAL_128:%.*]] = affine.apply #map1(){{\[}}[[VAL_120]], [[VAL_119]], [[VAL_118]]]
// CHECK: [[VAL_148:%.*]] = subi [[VAL_139]], [[VAL_137]] : index		// CHECK: gpu.launch blocks([[VAL_129:%.]], [[VAL_130:%.]], [[VAL_131:%.]]) in ([[VAL_132:%.]] = [[VAL_125]], [[VAL_133:%.]] = [[VAL_126]], [[VAL_134:%.]] = [[VAL_124]]) threads([[VAL_135:%.]], [[VAL_136:%.]], [[VAL_137:%.]]) in ([[VAL_138:%.]] = [[VAL_127]], [[VAL_139:%.]] = [[VAL_128]], [[VAL_140:%.]] = [[VAL_124]]) {
// CHECK: [[VAL_149:%.*]] = affine.apply #map1([[VAL_148]])		// CHECK: [[VAL_141:%.*]] = affine.apply #map2([[VAL_129]]){{\[}}[[VAL_121]], [[VAL_119]]]
// CHECK: [[VAL_151:%.*]] = subi [[VAL_138]], [[VAL_137]] : index		// CHECK: [[VAL_142:%.*]] = affine.apply #map2([[VAL_130]]){{\[}}[[VAL_120]], [[VAL_119]]]
// CHECK: [[VAL_152:%.*]] = affine.apply #map1([[VAL_151]])		// CHECK: [[VAL_143:%.*]] = dim [[VAL_115]], 0 : memref<?x?xf32, #map0>
// CHECK: gpu.launch blocks([[VAL_153:%.]], [[VAL_154:%.]], [[VAL_155:%.]]) in ([[VAL_156:%.]] = [[VAL_144]], [[VAL_157:%.]] = [[VAL_146]], [[VAL_158:%.]] = [[VAL_142]]) threads([[VAL_159:%.]], [[VAL_160:%.]], [[VAL_161:%.]]) in ([[VAL_162:%.]] = [[VAL_149]], [[VAL_163:%.]] = [[VAL_152]], [[VAL_164:%.]] = [[VAL_142]]) {		// CHECK: [[VAL_144:%.*]] = affine.min #map3([[VAL_121]], [[VAL_143]], [[VAL_141]])
// CHECK: [[VAL_165:%.*]] = affine.apply #map1([[VAL_153]])		// CHECK: [[VAL_145:%.*]] = dim [[VAL_115]], 1 : memref<?x?xf32, #map0>
// CHECK: [[VAL_166:%.*]] = addi [[VAL_165]], [[VAL_137]] : index		// CHECK: [[VAL_146:%.*]] = affine.min #map3([[VAL_120]], [[VAL_145]], [[VAL_142]])
// CHECK: [[VAL_167:%.*]] = affine.apply #map1([[VAL_154]])		// CHECK: [[VAL_147:%.*]] = std.subview [[VAL_115]]{{\[}}[[VAL_141]], [[VAL_142]]]{{\[}}[[VAL_144]], [[VAL_146]]]{{\[}}[[VAL_118]], [[VAL_118]]] : memref<?x?xf32, #map0> to memref<?x?xf32, #map4>
// CHECK: [[VAL_168:%.*]] = addi [[VAL_167]], [[VAL_137]] : index		// CHECK: [[VAL_148:%.*]] = dim [[VAL_116]], 0 : memref<?x?xf32, #map0>
// CHECK: [[VAL_169:%.*]] = dim [[VAL_133]], 0 : memref<?x?xf32, #map0>		// CHECK: [[VAL_149:%.*]] = affine.min #map3([[VAL_121]], [[VAL_148]], [[VAL_141]])
// CHECK: [[VAL_170:%.*]] = affine.min #map2([[VAL_139]], [[VAL_169]], [[VAL_166]])		// CHECK: [[VAL_150:%.*]] = dim [[VAL_116]], 1 : memref<?x?xf32, #map0>
// CHECK: [[VAL_171:%.*]] = dim [[VAL_133]], 1 : memref<?x?xf32, #map0>		// CHECK: [[VAL_151:%.*]] = affine.min #map3([[VAL_120]], [[VAL_150]], [[VAL_142]])
// CHECK: [[VAL_172:%.*]] = affine.min #map2([[VAL_138]], [[VAL_171]], [[VAL_168]])		// CHECK: [[VAL_152:%.*]] = std.subview [[VAL_116]]{{\[}}[[VAL_141]], [[VAL_142]]]{{\[}}[[VAL_149]], [[VAL_151]]]{{\[}}[[VAL_118]], [[VAL_118]]] : memref<?x?xf32, #map0> to memref<?x?xf32, #map4>
// CHECK: [[VAL_173:%.*]] = std.subview [[VAL_133]]{{\[}}[[VAL_166]], [[VAL_168]]]{{\[}}[[VAL_170]], [[VAL_172]]]{{\[}}[[VAL_136]], [[VAL_136]]] : memref<?x?xf32, #map0> to memref<?x?xf32, #map3>		// CHECK: [[VAL_153:%.*]] = dim [[VAL_117]], 0 : memref<?x?xf32, #map0>
// CHECK: [[VAL_174:%.*]] = dim [[VAL_134]], 0 : memref<?x?xf32, #map0>		// CHECK: [[VAL_154:%.*]] = affine.min #map3([[VAL_121]], [[VAL_153]], [[VAL_141]])
// CHECK: [[VAL_175:%.*]] = affine.min #map2([[VAL_139]], [[VAL_174]], [[VAL_166]])		// CHECK: [[VAL_155:%.*]] = dim [[VAL_117]], 1 : memref<?x?xf32, #map0>
// CHECK: [[VAL_176:%.*]] = dim [[VAL_134]], 1 : memref<?x?xf32, #map0>		// CHECK: [[VAL_156:%.*]] = affine.min #map3([[VAL_120]], [[VAL_155]], [[VAL_142]])
// CHECK: [[VAL_177:%.*]] = affine.min #map2([[VAL_138]], [[VAL_176]], [[VAL_168]])		// CHECK: [[VAL_157:%.*]] = std.subview [[VAL_117]]{{\[}}[[VAL_141]], [[VAL_142]]]{{\[}}[[VAL_154]], [[VAL_156]]]{{\[}}[[VAL_118]], [[VAL_118]]] : memref<?x?xf32, #map0> to memref<?x?xf32, #map4>
// CHECK: [[VAL_178:%.*]] = std.subview [[VAL_134]]{{\[}}[[VAL_166]], [[VAL_168]]]{{\[}}[[VAL_175]], [[VAL_177]]]{{\[}}[[VAL_136]], [[VAL_136]]] : memref<?x?xf32, #map0> to memref<?x?xf32, #map3>		// CHECK: [[VAL_158:%.*]] = affine.apply #map2([[VAL_135]]){{\[}}[[VAL_118]], [[VAL_119]]]
// CHECK: [[VAL_179:%.*]] = dim [[VAL_135]], 0 : memref<?x?xf32, #map0>		// CHECK: [[VAL_159:%.*]] = cmpi "slt", [[VAL_158]], [[VAL_144]] : index
// CHECK: [[VAL_180:%.*]] = affine.min #map2([[VAL_139]], [[VAL_179]], [[VAL_166]])		// CHECK: loop.if [[VAL_159]] {
// CHECK: [[VAL_181:%.*]] = dim [[VAL_135]], 1 : memref<?x?xf32, #map0>		// CHECK: [[VAL_160:%.*]] = affine.apply #map2([[VAL_136]]){{\[}}[[VAL_118]], [[VAL_119]]]
// CHECK: [[VAL_182:%.*]] = affine.min #map2([[VAL_138]], [[VAL_181]], [[VAL_168]])		// CHECK: [[VAL_161:%.*]] = cmpi "slt", [[VAL_160]], [[VAL_146]] : index
// CHECK: [[VAL_183:%.*]] = std.subview [[VAL_135]]{{\[}}[[VAL_166]], [[VAL_168]]]{{\[}}[[VAL_180]], [[VAL_182]]]{{\[}}[[VAL_136]], [[VAL_136]]] : memref<?x?xf32, #map0> to memref<?x?xf32, #map3>		// CHECK: loop.if [[VAL_161]] {
// CHECK: [[VAL_184:%.*]] = affine.apply #map1([[VAL_159]])		// CHECK: [[VAL_162:%.*]] = load [[VAL_147]]{{\[}}[[VAL_158]], [[VAL_160]]] : memref<?x?xf32, #map4>
// CHECK: [[VAL_185:%.*]] = addi [[VAL_184]], [[VAL_137]] : index		// CHECK: [[VAL_163:%.*]] = load [[VAL_152]]{{\[}}[[VAL_158]], [[VAL_160]]] : memref<?x?xf32, #map4>
// CHECK: [[VAL_186:%.*]] = cmpi "slt", [[VAL_185]], [[VAL_170]] : index		// CHECK: [[VAL_164:%.*]] = load [[VAL_157]]{{\[}}[[VAL_158]], [[VAL_160]]] : memref<?x?xf32, #map4>
// CHECK: loop.if [[VAL_186]] {		// CHECK: [[VAL_165:%.*]] = addf [[VAL_162]], [[VAL_163]] : f32
// CHECK: [[VAL_187:%.*]] = affine.apply #map1([[VAL_160]])		// CHECK: store [[VAL_165]], [[VAL_157]]{{\[}}[[VAL_158]], [[VAL_160]]] : memref<?x?xf32, #map4>
// CHECK: [[VAL_188:%.*]] = addi [[VAL_187]], [[VAL_137]] : index
// CHECK: [[VAL_189:%.*]] = cmpi "slt", [[VAL_188]], [[VAL_172]] : index
// CHECK: loop.if [[VAL_189]] {
// CHECK: [[VAL_190:%.*]] = load [[VAL_173]]{{\[}}[[VAL_185]], [[VAL_188]]] : memref<?x?xf32, #map3>
// CHECK: [[VAL_191:%.*]] = load [[VAL_178]]{{\[}}[[VAL_185]], [[VAL_188]]] : memref<?x?xf32, #map3>
// CHECK: [[VAL_192:%.*]] = load [[VAL_183]]{{\[}}[[VAL_185]], [[VAL_188]]] : memref<?x?xf32, #map3>
// CHECK: [[VAL_193:%.*]] = addf [[VAL_190]], [[VAL_191]] : f32
// CHECK: store [[VAL_193]], [[VAL_183]]{{\[}}[[VAL_185]], [[VAL_188]]] : memref<?x?xf32, #map3>
// CHECK: }		// CHECK: }
// CHECK: }		// CHECK: }
// CHECK: gpu.terminator		// CHECK: gpu.terminator
// CHECK: }		// CHECK: }
// CHECK: return		// CHECK: return
// CHECK: }		// CHECK: }
// CHECK: }		// CHECK: }

		// -----

		#map0 = affine_map<(d0, d1)[s0, s1] -> (d0 * s1 + s0 + d1)>
		#map1 = affine_map<(d0, d1, d2) -> (d0, d1 - d2)>
		#map2 = affine_map<(d0, d1)[s0, s1, s2] -> (d0 * s1 + s0 + d1 * s2)>
		#map3 = affine_map<(d0) -> (d0)>
		#map4 = affine_map<(d1, d2) -> (2, d1 - d2)>
		#map5 = affine_map<(d1, d2) -> (3, d1 - d2)>


		module {
		func @sum2(%arg0: memref<?x?xf32, #map0>, %arg1: memref<?x?xf32, #map0>, %arg2: memref<?x?xf32, #map0>) {
		%c1 = constant 1 : index
		%c0 = constant 0 : index
		%c3 = constant 3 : index
		%c2 = constant 2 : index
		%0 = dim %arg0, 0 : memref<?x?xf32, #map0>
		%1 = dim %arg0, 1 : memref<?x?xf32, #map0>
		loop.parallel (%arg3, %arg4) = (%c0, %c0) to (%0, %1) step (%c2, %c3) {
		%2 = dim %arg0, 0 : memref<?x?xf32, #map0>
		%3 = affine.min #map4(%2, %arg3)
		%4 = dim %arg0, 1 : memref<?x?xf32, #map0>
		%5 = affine.min #map5(%4, %arg4)
		%6 = std.subview %arg0[%arg3, %arg4][%3, %5][%c1, %c1] : memref<?x?xf32, #map0> to memref<?x?xf32, #map2>
		%7 = dim %arg1, 0 : memref<?x?xf32, #map0>
		%8 = affine.min #map4(%7, %arg3)
		%9 = dim %arg1, 1 : memref<?x?xf32, #map0>
		%10 = affine.min #map5(%9, %arg4)
		%11 = std.subview %arg1[%arg3, %arg4][%8, %10][%c1, %c1] : memref<?x?xf32, #map0> to memref<?x?xf32, #map2>
		%12 = dim %arg2, 0 : memref<?x?xf32, #map0>
		%13 = affine.min #map4(%12, %arg3)
		%14 = dim %arg2, 1 : memref<?x?xf32, #map0>
		%15 = affine.min #map5(%14, %arg4)
		%16 = std.subview %arg2[%arg3, %arg4][%13, %15][%c1, %c1] : memref<?x?xf32, #map0> to memref<?x?xf32, #map2>
		loop.parallel (%arg5, %arg6) = (%c0, %c0) to (%3, %5) step (%c1, %c1) {
		%17 = load %6[%arg5, %arg6] : memref<?x?xf32, #map2>
		%18 = load %11[%arg5, %arg6] : memref<?x?xf32, #map2>
		%19 = load %16[%arg5, %arg6] : memref<?x?xf32, #map2>
		%20 = addf %17, %18 : f32
		store %20, %16[%arg5, %arg6] : memref<?x?xf32, #map2>
		loop.yield
		} { mapping = [
		{processor = 3, map = #map3, bound = #map3},
		{processor = 4, map = #map3, bound = #map3}
		] }
		loop.yield
		} { mapping = [
		{processor = 0, map = #map3, bound = #map3},
		{processor = 1, map = #map3, bound = #map3}
		] }
		return
		}
		}

		// CHECK: #map1 = affine_map<()[s0, s1, s2] -> (s0 - s1 ceildiv s2)>
		// CHECK: #map2 = affine_map<(d0)[s0, s1] -> (d0 * s0 + s1)>
		// CHECK: #map3 = affine_map<(d0, d1) -> (2, d0 - d1)>
		// CHECK: #map4 = affine_map<(d0, d1) -> (3, d0 - d1)>
		// CHECK: #map5 = affine_map<(d0, d1)[s0, s1, s2] -> (d0 * s1 + s0 + d1 * s2)>

		// CHECK: module {
		// CHECK-LABEL: func @sum2(
		// CHECK-SAME: [[VAL_166:%.]]: memref<?x?xf32, #map0>, [[VAL_167:%.]]: memref<?x?xf32, #map0>, [[VAL_168:%.*]]: memref<?x?xf32, #map0>) {
		// CHECK: [[VAL_169:%.*]] = constant 1 : index
		// CHECK: [[VAL_170:%.*]] = constant 0 : index
		// CHECK: [[VAL_171:%.*]] = constant 3 : index
		// CHECK: [[VAL_172:%.*]] = constant 2 : index
		// CHECK: [[VAL_173:%.*]] = dim [[VAL_166]], 0 : memref<?x?xf32, #map0>
		// CHECK: [[VAL_174:%.*]] = dim [[VAL_166]], 1 : memref<?x?xf32, #map0>
		// CHECK: [[VAL_175:%.*]] = constant 1 : index
		// CHECK: [[VAL_176:%.*]] = affine.apply #map1(){{\[}}[[VAL_173]], [[VAL_170]], [[VAL_172]]]
		// CHECK: [[VAL_177:%.*]] = affine.apply #map1(){{\[}}[[VAL_174]], [[VAL_170]], [[VAL_171]]]
		// CHECK: [[VAL_178:%.*]] = constant 2 : index
		// CHECK: [[VAL_179:%.*]] = affine.apply #map1(){{\[}}[[VAL_178]], [[VAL_170]], [[VAL_169]]]
		// CHECK: [[VAL_180:%.*]] = constant 3 : index
		// CHECK: [[VAL_181:%.*]] = affine.apply #map1(){{\[}}[[VAL_180]], [[VAL_170]], [[VAL_169]]]
		// CHECK: gpu.launch blocks([[VAL_182:%.]], [[VAL_183:%.]], [[VAL_184:%.]]) in ([[VAL_185:%.]] = [[VAL_176]], [[VAL_186:%.]] = [[VAL_177]], [[VAL_187:%.]] = [[VAL_175]]) threads([[VAL_188:%.]], [[VAL_189:%.]], [[VAL_190:%.]]) in ([[VAL_191:%.]] = [[VAL_179]], [[VAL_192:%.]] = [[VAL_181]], [[VAL_193:%.]] = [[VAL_175]]) {
		// CHECK: [[VAL_194:%.*]] = affine.apply #map2([[VAL_182]]){{\[}}[[VAL_172]], [[VAL_170]]]
		// CHECK: [[VAL_195:%.*]] = affine.apply #map2([[VAL_183]]){{\[}}[[VAL_171]], [[VAL_170]]]
		// CHECK: [[VAL_196:%.*]] = dim [[VAL_166]], 0 : memref<?x?xf32, #map0>
		// CHECK: [[VAL_197:%.*]] = affine.min #map3([[VAL_196]], [[VAL_194]])
		// CHECK: [[VAL_198:%.*]] = dim [[VAL_166]], 1 : memref<?x?xf32, #map0>
		// CHECK: [[VAL_199:%.*]] = affine.min #map4([[VAL_198]], [[VAL_195]])
		// CHECK: [[VAL_200:%.*]] = std.subview [[VAL_166]]{{\[}}[[VAL_194]], [[VAL_195]]]{{\[}}[[VAL_197]], [[VAL_199]]]{{\[}}[[VAL_169]], [[VAL_169]]] : memref<?x?xf32, #map0> to memref<?x?xf32, #map5>
		// CHECK: [[VAL_201:%.*]] = dim [[VAL_167]], 0 : memref<?x?xf32, #map0>
		// CHECK: [[VAL_202:%.*]] = affine.min #map3([[VAL_201]], [[VAL_194]])
		// CHECK: [[VAL_203:%.*]] = dim [[VAL_167]], 1 : memref<?x?xf32, #map0>
		// CHECK: [[VAL_204:%.*]] = affine.min #map4([[VAL_203]], [[VAL_195]])
		// CHECK: [[VAL_205:%.*]] = std.subview [[VAL_167]]{{\[}}[[VAL_194]], [[VAL_195]]]{{\[}}[[VAL_202]], [[VAL_204]]]{{\[}}[[VAL_169]], [[VAL_169]]] : memref<?x?xf32, #map0> to memref<?x?xf32, #map5>
		// CHECK: [[VAL_206:%.*]] = dim [[VAL_168]], 0 : memref<?x?xf32, #map0>
		// CHECK: [[VAL_207:%.*]] = affine.min #map3([[VAL_206]], [[VAL_194]])
		// CHECK: [[VAL_208:%.*]] = dim [[VAL_168]], 1 : memref<?x?xf32, #map0>
		// CHECK: [[VAL_209:%.*]] = affine.min #map4([[VAL_208]], [[VAL_195]])
		// CHECK: [[VAL_210:%.*]] = std.subview [[VAL_168]]{{\[}}[[VAL_194]], [[VAL_195]]]{{\[}}[[VAL_207]], [[VAL_209]]]{{\[}}[[VAL_169]], [[VAL_169]]] : memref<?x?xf32, #map0> to memref<?x?xf32, #map5>
		// CHECK: [[VAL_211:%.*]] = affine.apply #map2([[VAL_188]]){{\[}}[[VAL_169]], [[VAL_170]]]
		// CHECK: [[VAL_212:%.*]] = cmpi "slt", [[VAL_211]], [[VAL_197]] : index
		// CHECK: loop.if [[VAL_212]] {
		// CHECK: [[VAL_213:%.*]] = affine.apply #map2([[VAL_189]]){{\[}}[[VAL_169]], [[VAL_170]]]
		// CHECK: [[VAL_214:%.*]] = cmpi "slt", [[VAL_213]], [[VAL_199]] : index
		// CHECK: loop.if [[VAL_214]] {
		// CHECK: [[VAL_215:%.*]] = load [[VAL_200]]{{\[}}[[VAL_211]], [[VAL_213]]] : memref<?x?xf32, #map5>
		// CHECK: [[VAL_216:%.*]] = load [[VAL_205]]{{\[}}[[VAL_211]], [[VAL_213]]] : memref<?x?xf32, #map5>
		// CHECK: [[VAL_217:%.*]] = load [[VAL_210]]{{\[}}[[VAL_211]], [[VAL_213]]] : memref<?x?xf32, #map5>
		// CHECK: [[VAL_218:%.*]] = addf [[VAL_215]], [[VAL_216]] : f32
		// CHECK: store [[VAL_218]], [[VAL_210]]{{\[}}[[VAL_211]], [[VAL_213]]] : memref<?x?xf32, #map5>
		// CHECK: }
		// CHECK: }
		// CHECK: gpu.terminator
		// CHECK: }
		// CHECK: return
		// CHECK: }
		// CHECK: }

This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][GPU] Properly model step in parallel loop to gpu conversion.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 246205

mlir/include/mlir/Conversion/LoopsToGPU/LoopsToGPUPass.h

mlir/lib/Conversion/LoopsToGPU/LoopsToGPU.cpp

mlir/lib/Conversion/LoopsToGPU/LoopsToGPUPass.cpp

mlir/test/Conversion/LoopsToGPU/parallel_loop.mlir

[MLIR][GPU] Properly model step in parallel loop to gpu conversion.
ClosedPublic