This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
lib/Dialect/Linalg/Transforms/
-
Dialect/
-
Linalg/
-
Transforms/
3/9
Vectorization.cpp
-
test/Dialect/Linalg/
-
Dialect/
-
Linalg/
-
vectorization.mlir

Differential D111666

[mlir][linalg] Fix generic reduction vectorization
ClosedPublic

Authored by ThomasRaoux on Oct 12 2021, 11:30 AM.

Download Raw Diff

Details

Reviewers

nicolasvasilache
dcaballe
aartbik
mravishankar

Commits

rG7c97e328b3b4: [mlir][linalg] Fix generic reduction vectorization

Summary

We shouldn't broadcast the original value when doing reduction. Instead
we compute the reduction and then combine it with the original value.

Diff Detail

Event Timeline

ThomasRaoux created this revision.Oct 12 2021, 11:30 AM

Herald added a reviewer: aartbik. · View Herald TranscriptOct 12 2021, 11:30 AM

Herald added a reviewer: mravishankar. · View Herald Transcript

Herald added subscribers: wenzhicui, wrengr, Chia-hungDuan and 19 others. · View Herald Transcript

ThomasRaoux requested review of this revision.Oct 12 2021, 11:30 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 12 2021, 11:30 AM

Herald added subscribers: limo1996, stephenneuendorffer. · View Herald Transcript

Harbormaster completed remote builds in B128430: Diff 379125.Oct 12 2021, 12:33 PM

LGTM so putting a stamp, but please be sure to really refactor the code I point out, with good names to be super easy to come back to in the future.

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
259	the new code above and below this line feel quite dry. Can you refactor in helper functions with proper names so it becomes clearer what is happening? I have to squint a lot to get it and I know I'll have more trouble in the future.

This revision is now accepted and ready to land.Oct 12 2021, 1:28 PM

Review feedback

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
259	Done.

Harbormaster completed remote builds in B128496: Diff 379210.Oct 12 2021, 3:44 PM

This revision was landed with ongoing or failed builds.Oct 12 2021, 3:47 PM

Closed by commit rG7c97e328b3b4: [mlir][linalg] Fix generic reduction vectorization (authored by ThomasRaoux). · Explain Why

This revision was automatically updated to reflect the committed changes.

ThomasRaoux added a commit: rG7c97e328b3b4: [mlir][linalg] Fix generic reduction vectorization.

nicolasvasilache added inline comments.Oct 13 2021, 1:51 AM

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
184	typo: two
203	This is highly non-trivial, please add a comment with a detailed example with all the steps involved.
212	This is adding a new read + broadcast that didn't exist before, right? OTOH, you have already read this value into some other broadcasted form. It seems like you should get that initial broadcasted value and extract from it?
242	typo: match

nicolasvasilache added inline comments.Oct 13 2021, 2:50 AM

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
263	This should be `return vectorizeBinaryReductionOp(reduceOp)` where this code gets hoisted. In the future this would become a VectorOpInterface for ops that are vectorizable and we would just check / use that interface to build.

ThomasRaoux added inline comments.Oct 13 2021, 7:25 AM

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
212	Correct, the problem is that the value we get at this stage is already loaded + broadcast and added (for a reduction add) to the input so there is nothing I can't extra bak the original value. Maybe a better solution is to emit the reduction beforehand when we generate the vector operations?

ThomasRaoux added inline comments.Oct 13 2021, 8:18 AM

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
263	I'm not sure I understand. Do you mean the whole creation of multiDimReduce + the extra op to combine with the initial value should go into `vectorizeBinaryReductionOp` or just the code creating the last instruction?

Revision Contents

Path

Size

mlir/

lib/

Dialect/

Linalg/

Transforms/

Vectorization.cpp

122 lines

test/

Dialect/

Linalg/

vectorization.mlir

39 lines

Diff 379210

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

Show First 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	return llvm::TypeSwitch<Operation *, llvm::Optional<vector::CombiningKind>>(
.Case<MaxSIOp>([&](auto op) { return vector::CombiningKind::MAXSI; })		.Case<MaxSIOp>([&](auto op) { return vector::CombiningKind::MAXSI; })
.Case<MaxFOp>([&](auto op) { return vector::CombiningKind::MAXF; })		.Case<MaxFOp>([&](auto op) { return vector::CombiningKind::MAXF; })
.Case<MinSIOp>([&](auto op) { return vector::CombiningKind::MINSI; })		.Case<MinSIOp>([&](auto op) { return vector::CombiningKind::MINSI; })
.Case<MinFOp>([&](auto op) { return vector::CombiningKind::MINF; })		.Case<MinFOp>([&](auto op) { return vector::CombiningKind::MINF; })
.Default([&](auto op) { return llvm::None; });		.Default([&](auto op) { return llvm::None; });
}		}

/// Check whether `outputOperand` is a reduction with a single combiner		/// Check whether `outputOperand` is a reduction with a single combiner
/// operation. Return the combiner operation kind of the reduction, if		/// operation. Return the combiner operation of the reduction. Return
/// supported. Return llvm::None, otherwise. Multiple reduction operations would		/// nullptr otherwise. Multiple reduction operations would impose an
/// impose an ordering between reduction dimensions and is currently unsupported		/// ordering between reduction dimensions and is currently unsupported in
/// in Linalg. This limitation is motivated by the fact that e.g. min(max(X)) !=		/// Linalg. This limitation is motivated by the fact that e.g. min(max(X)) !=
/// max(min(X))		/// max(min(X))
// TODO: use in LinalgOp verification, there is a circular dependency atm.		// TODO: use in LinalgOp verification, there is a circular dependency atm.
static llvm::Optional<vector::CombiningKind>		static Operation matchLinalgReduction(OpOperand outputOperand) {
matchLinalgReduction(OpOperand *outputOperand) {
auto linalgOp = cast<LinalgOp>(outputOperand->getOwner());		auto linalgOp = cast<LinalgOp>(outputOperand->getOwner());
unsigned outputPos =		unsigned outputPos =
outputOperand->getOperandNumber() - linalgOp.getNumInputs();		outputOperand->getOperandNumber() - linalgOp.getNumInputs();
// Only single combiner operatios are supported for now.		// Only single combiner operatios are supported for now.
SmallVector<Operation *, 4> combinerOps;		SmallVector<Operation *, 4> combinerOps;
if (!matchReduction(linalgOp.getRegionOutputArgs(), outputPos, combinerOps) \|\|		if (!matchReduction(linalgOp.getRegionOutputArgs(), outputPos, combinerOps) \|\|
combinerOps.size() != 1)		combinerOps.size() != 1)
return llvm::None;		return nullptr;

// Return the combiner operation kind, if supported.		// Return the combiner operation.
return getKindForOp(combinerOps[0]);		return combinerOps[0];
}		}

/// Broadcast `value` to a vector of `shape` if possible. Return value		/// Broadcast `value` to a vector of `shape` if possible. Return value
/// otherwise.		/// otherwise.
static Value broadcastIfNeeded(OpBuilder &b, Value value,		static Value broadcastIfNeeded(OpBuilder &b, Value value,
ArrayRef<int64_t> shape) {		ArrayRef<int64_t> shape) {
// If no shape to broadcast to, just return `value`.		// If no shape to broadcast to, just return `value`.
if (shape.empty())		if (shape.empty())
return value;		return value;
VectorType targetVectorType =		VectorType targetVectorType =
VectorType::get(shape, getElementTypeOrSelf(value));		VectorType::get(shape, getElementTypeOrSelf(value));
if (vector::isBroadcastableTo(value.getType(), targetVectorType) !=		if (vector::isBroadcastableTo(value.getType(), targetVectorType) !=
vector::BroadcastableToResult::Success)		vector::BroadcastableToResult::Success)
return value;		return value;
Location loc = b.getInsertionPoint()->getLoc();		Location loc = b.getInsertionPoint()->getLoc();
return b.createOrFold<vector::BroadcastOp>(loc, targetVectorType, value);		return b.createOrFold<vector::BroadcastOp>(loc, targetVectorType, value);
}		}

		/// Build a vector.transfer_read from `source` at indices set to all `0`.
		/// If source has rank zero, build a `vector<1xt> transfer_read + extract`.
		/// Return the produced value.
		static Value buildVectorRead(OpBuilder &b, Value source, Type readType,
		AffineMap map) {
		Location loc = source.getLoc();
		auto shapedType = source.getType().cast<ShapedType>();
		SmallVector<Value> indices(shapedType.getRank(),
		b.create<ConstantIndexOp>(loc, 0));
		if (auto vectorType = readType.dyn_cast<VectorType>())
		return b.create<vector::TransferReadOp>(loc, vectorType, source, indices,
		map);
		return vector::TransferReadOp::createScalarOp(b, loc, source, indices);
		}

		/// Create MultiDimReductionOp to compute the reduction for `reductionOp`. This
		/// assumes that `reductionOp` has tow operands and one of them is the reduction
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions typo: two nicolasvasilache: typo: two
		/// initial value.
		static Value buildMultiDimReduce(OpBuilder &b, Operation *reduceOp,
		Value outputArg,
		const SmallVector<bool> &reductionMask,
		const BlockAndValueMapping &bvm) {
		auto maybeKind = getKindForOp(reduceOp);
		assert(maybeKind && "Failed precondition: could not get reduction kind");
		Value operandToReduce = reduceOp->getOperand(0) == outputArg
		? reduceOp->getOperand(1)
		: reduceOp->getOperand(0);
		Value vec = bvm.lookup(operandToReduce);
		return b.create<vector::MultiDimReductionOp>(reduceOp->getLoc(), vec,
		reductionMask, *maybeKind);
		}

		/// Read the initial value associated to the given `outputOperand`.
		static Value readInitialValue(OpBuilder &b, LinalgOp linalgOp,
		OpOperand *outputOperand) {
		AffineMap map = inversePermutation(
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions This is highly non-trivial, please add a comment with a detailed example with all the steps involved. nicolasvasilache: This is highly non-trivial, please add a comment with a detailed example with all the steps…
		reindexIndexingMap(linalgOp.getTiedIndexingMap(outputOperand)));
		Type readType;
		if (linalgOp.getShape(outputOperand).empty()) {
		readType = getElementTypeOrSelf(outputOperand->get());
		} else {
		readType = VectorType::get(map.compose(linalgOp.getShape(outputOperand)),
		getElementTypeOrSelf(outputOperand->get()));
		}
		Value vectorRead = buildVectorRead(b, outputOperand->get(), readType, map);
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions This is adding a new read + broadcast that didn't exist before, right? OTOH, you have already read this value into some other broadcasted form. It seems like you should get that initial broadcasted value and extract from it? nicolasvasilache: This is adding a new read + broadcast that didn't exist before, right? OTOH, you have already…
		ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions Correct, the problem is that the value we get at this stage is already loaded + broadcast and added (for a reduction add) to the input so there is nothing I can't extra bak the original value. Maybe a better solution is to emit the reduction beforehand when we generate the vector operations? ThomasRaoux: Correct, the problem is that the value we get at this stage is already loaded + broadcast and…
		return vectorRead;
		}

/// Assuming `outputOperand` is an output operand of a LinalgOp, determine		/// Assuming `outputOperand` is an output operand of a LinalgOp, determine
/// whether a reduction is needed to produce a `targetType` and create that		/// whether a reduction is needed to produce a `targetType` and create that
/// reduction if it is the case.		/// reduction if it is the case.
static Value reduceIfNeeded(OpBuilder &b, Type targetType, Value value,		static Value reduceIfNeeded(OpBuilder &b, Type targetType, Value value,
OpOperand *outputOperand) {		OpOperand *outputOperand,
		const BlockAndValueMapping &bvm) {
LDBG("Reduce " << value << " to type " << targetType);		LDBG("Reduce " << value << " to type " << targetType);
LDBG("In LinalgOp operand #" << outputOperand->getOperandNumber() << "\n"		LDBG("In LinalgOp operand #" << outputOperand->getOperandNumber() << "\n"
<< *(outputOperand->getOwner()));		<< *(outputOperand->getOwner()));
auto linalgOp = cast<LinalgOp>(outputOperand->getOwner());		auto linalgOp = cast<LinalgOp>(outputOperand->getOwner());
auto vecType = value.getType().dyn_cast<VectorType>();		auto vecType = value.getType().dyn_cast<VectorType>();
VectorType targetVectorType = targetType.dyn_cast<VectorType>();		VectorType targetVectorType = targetType.dyn_cast<VectorType>();
if (!vecType)		if (!vecType)
return value;		return value;
if (targetVectorType && vecType.getShape() == targetVectorType.getShape())		if (targetVectorType && vecType.getShape() == targetVectorType.getShape())
return value;		return value;

// At this point, we know we need to reduce. Detect the reduction operator.		// At this point, we know we need to reduce. Detect the reduction operator.
unsigned pos = 0;		unsigned pos = 0;
MLIRContext *ctx = b.getContext();		MLIRContext *ctx = b.getContext();
SmallVector<AffineExpr> exprs;		SmallVector<AffineExpr> exprs;
for (auto s : linalgOp.iterator_types())		for (auto s : linalgOp.iterator_types())
if (isParallelIterator(s))		if (isParallelIterator(s))
exprs.push_back(getAffineDimExpr(pos++, ctx));		exprs.push_back(getAffineDimExpr(pos++, ctx));
auto loc = value.getLoc();

auto maybeKind = matchLinalgReduction(outputOperand);		Operation *reduceOp = matchLinalgReduction(outputOperand);
assert(maybeKind && "Failed precondition: could not get reduction kind");		assert(reduceOp && "Failed precondition: could not math a reduction");
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions typo: match nicolasvasilache: typo: match
unsigned idx = 0;		unsigned idx = 0;
SmallVector<bool> reductionMask(linalgOp.iterator_types().size(), false);		SmallVector<bool> reductionMask(linalgOp.iterator_types().size(), false);
for (auto attr : linalgOp.iterator_types()) {		for (auto attr : linalgOp.iterator_types()) {
if (isReductionIterator(attr))		if (isReductionIterator(attr))
reductionMask[idx] = true;		reductionMask[idx] = true;
++idx;		++idx;
}		}
return b.create<vector::MultiDimReductionOp>(loc, value, reductionMask,		assert(reduceOp->getNumOperands() == 2 &&
*maybeKind);		"Only support binary reduce op right now");
}		unsigned outputPos =
		outputOperand->getOperandNumber() - linalgOp.getNumInputs();
/// Build a vector.transfer_read from `source` at indices set to all `0`.		Value outputArg = linalgOp.getRegionOutputArgs()[outputPos];
/// If source has rank zero, build a `vector<1xt> transfer_read + extract`.		// Reduce across the iteration space.
/// Return the produced value.		Value reduce =
static Value buildVectorRead(OpBuilder &b, Value source, Type readType,		buildMultiDimReduce(b, reduceOp, outputArg, reductionMask, bvm);
AffineMap map) {
Location loc = source.getLoc();		// Read the original output value.
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions the new code above and below this line feel quite dry. Can you refactor in helper functions with proper names so it becomes clearer what is happening? I have to squint a lot to get it and I know I'll have more trouble in the future. nicolasvasilache: the new code above and below this line feel quite dry. Can you refactor in helper functions…
		ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions Done. ThomasRaoux: Done.
auto shapedType = source.getType().cast<ShapedType>();		Value initialValue = readInitialValue(b, linalgOp, outputOperand);
SmallVector<Value> indices(shapedType.getRank(),
b.create<ConstantIndexOp>(loc, 0));		// Combine the output argument with the reduced value.
if (auto vectorType = readType.dyn_cast<VectorType>())		OperationState state(reduceOp->getLoc(), reduceOp->getName());
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions This should be `return vectorizeBinaryReductionOp(reduceOp)` where this code gets hoisted. In the future this would become a VectorOpInterface for ops that are vectorizable and we would just check / use that interface to build. nicolasvasilache: This should be `return vectorizeBinaryReductionOp(reduceOp)` where this code gets hoisted. In…
		ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions I'm not sure I understand. Do you mean the whole creation of multiDimReduce + the extra op to combine with the initial value should go into `vectorizeBinaryReductionOp` or just the code creating the last instruction? ThomasRaoux: I'm not sure I understand. Do you mean the whole creation of multiDimReduce + the extra op to…
return b.create<vector::TransferReadOp>(loc, vectorType, source, indices,		state.addAttributes(reduceOp->getAttrs());
map);		state.addOperands({reduce, initialValue});
return vector::TransferReadOp::createScalarOp(b, loc, source, indices);		state.addTypes(initialValue.getType());
		return b.createOperation(state)->getResult(0);
}		}

/// Build a vector.transfer_write of `value` into `outputOperand` at indices set		/// Build a vector.transfer_write of `value` into `outputOperand` at indices set
/// to all `0`; where `outputOperand` is an output operand of the LinalgOp		/// to all `0`; where `outputOperand` is an output operand of the LinalgOp
/// currently being vectorized. If `dest` has null rank, build an memref.store.		/// currently being vectorized. If `dest` has null rank, build an memref.store.
/// Return the produced value or null if no value is produced.		/// Return the produced value or null if no value is produced.
static Value buildVectorWrite(OpBuilder &b, Value value,		static Value buildVectorWrite(OpBuilder &b, Value value,
OpOperand *outputOperand) {		OpOperand *outputOperand,
		const BlockAndValueMapping &bvm) {
Operation *write;		Operation *write;
Location loc = value.getLoc();		Location loc = value.getLoc();
auto linalgOp = cast<LinalgOp>(outputOperand->getOwner());		auto linalgOp = cast<LinalgOp>(outputOperand->getOwner());
if (VectorType vectorType =		if (VectorType vectorType =
extractVectorTypeFromShapedValue(outputOperand->get())) {		extractVectorTypeFromShapedValue(outputOperand->get())) {
AffineMap map =		AffineMap map =
reindexIndexingMap(linalgOp.getTiedIndexingMap(outputOperand));		reindexIndexingMap(linalgOp.getTiedIndexingMap(outputOperand));
SmallVector<int64_t> transposeShape =		SmallVector<int64_t> transposeShape =
applyPermutationMap(inversePermutation(map), vectorType.getShape());		applyPermutationMap(inversePermutation(map), vectorType.getShape());
assert(!transposeShape.empty() && "unexpected empty transpose shape");		assert(!transposeShape.empty() && "unexpected empty transpose shape");
vectorType = VectorType::get(transposeShape, vectorType.getElementType());		vectorType = VectorType::get(transposeShape, vectorType.getElementType());
SmallVector<Value> indices(linalgOp.getRank(outputOperand),		SmallVector<Value> indices(linalgOp.getRank(outputOperand),
b.create<ConstantIndexOp>(loc, 0));		b.create<ConstantIndexOp>(loc, 0));
value = broadcastIfNeeded(b, value, vectorType.getShape());		value = broadcastIfNeeded(b, value, vectorType.getShape());
value = reduceIfNeeded(b, vectorType, value, outputOperand);		value = reduceIfNeeded(b, vectorType, value, outputOperand, bvm);
write = b.create<vector::TransferWriteOp>(loc, value, outputOperand->get(),		write = b.create<vector::TransferWriteOp>(loc, value, outputOperand->get(),
indices, map);		indices, map);
} else {		} else {
value =		value = reduceIfNeeded(b, getElementTypeOrSelf(value), value, outputOperand,
reduceIfNeeded(b, getElementTypeOrSelf(value), value, outputOperand);		bvm);
write = vector::TransferWriteOp::createScalarOp(		write = vector::TransferWriteOp::createScalarOp(
b, loc, value, outputOperand->get(), ValueRange{});		b, loc, value, outputOperand->get(), ValueRange{});
}		}
LDBG("vectorized op: " << *write);		LDBG("vectorized op: " << *write);
if (!write->getResults().empty())		if (!write->getResults().empty())
return write->getResult(0);		return write->getResult(0);
return Value();		return Value();
}		}
Show All 18 Lines	vectorizeLinalgYield(OpBuilder &b, Operation *op,
auto yieldOp = dyn_cast<linalg::YieldOp>(op);		auto yieldOp = dyn_cast<linalg::YieldOp>(op);
if (!yieldOp)		if (!yieldOp)
return VectorizationResult{VectorizationStatus::Failure, nullptr};		return VectorizationResult{VectorizationStatus::Failure, nullptr};
for (auto outputs : llvm::enumerate(yieldOp.values())) {		for (auto outputs : llvm::enumerate(yieldOp.values())) {
// TODO: Scan for an opportunity for reuse.		// TODO: Scan for an opportunity for reuse.
// TODO: use a map.		// TODO: use a map.
Value vectorValue = bvm.lookup(outputs.value());		Value vectorValue = bvm.lookup(outputs.value());
Value newResult = buildVectorWrite(		Value newResult = buildVectorWrite(
b, vectorValue, linalgOp.getOutputOperand(outputs.index()));		b, vectorValue, linalgOp.getOutputOperand(outputs.index()), bvm);
if (newResult)		if (newResult)
newResults.push_back(newResult);		newResults.push_back(newResult);
}		}
return VectorizationResult{VectorizationStatus::NoReplace, nullptr};		return VectorizationResult{VectorizationStatus::NoReplace, nullptr};
}		}

/// Helper function to vectorize the index operations of a `linalgOp`. Return		/// Helper function to vectorize the index operations of a `linalgOp`. Return
/// VectorizationStatus::NewOp to signal the vectorization algorithm that it		/// VectorizationStatus::NewOp to signal the vectorization algorithm that it
▲ Show 20 Lines • Show All 309 Lines • ▼ Show 20 Lines
// TODO: probably need some extra checks for reduction followed by consumer		// TODO: probably need some extra checks for reduction followed by consumer
// ops that may not commute (e.g. linear reduction + non-linear instructions).		// ops that may not commute (e.g. linear reduction + non-linear instructions).
static LogicalResult reductionPreconditions(LinalgOp op) {		static LogicalResult reductionPreconditions(LinalgOp op) {
if (llvm::none_of(op.iterator_types(), isReductionIterator)) {		if (llvm::none_of(op.iterator_types(), isReductionIterator)) {
LDBG("reduction precondition failed: no reduction iterator");		LDBG("reduction precondition failed: no reduction iterator");
return failure();		return failure();
}		}
for (OpOperand *opOperand : op.getOutputOperands()) {		for (OpOperand *opOperand : op.getOutputOperands()) {
if (!matchLinalgReduction(opOperand)) {		Operation *reduceOp = matchLinalgReduction(opOperand);
		if (!reduceOp \|\| !getKindForOp(reduceOp)) {
LDBG("reduction precondition failed: reduction detection failed");		LDBG("reduction precondition failed: reduction detection failed");
return failure();		return failure();
}		}
}		}
return success();		return success();
}		}

LogicalResult mlir::linalg::vectorizeLinalgOpPrecondition(Operation *op) {		LogicalResult mlir::linalg::vectorizeLinalgOpPrecondition(Operation *op) {
▲ Show 20 Lines • Show All 765 Lines • Show Last 20 Lines

mlir/test/Dialect/Linalg/vectorization.mlir

Show First 20 Lines • Show All 738 Lines • ▼ Show 20 Lines	^bb0(%arg1: index, %arg2: index):
%m = mulf %f1, %f2 : f32		%m = mulf %f1, %f2 : f32
linalg.yield %m : f32		linalg.yield %m : f32
} : tensor<5x6xf32> to tensor<12x13xf32>		} : tensor<5x6xf32> to tensor<12x13xf32>
return %0 : tensor<12x13xf32>		return %0 : tensor<12x13xf32>
}		}

// -----		// -----

// CHECK-DAG: #[[$M0:.*]] = affine_map<(d0, d1) -> (d0, d1, 0)>

// CHECK-LABEL: func @sum_exp		// CHECK-LABEL: func @sum_exp
func @sum_exp(%input: tensor<4x16x8xf32>, %output: tensor<4x16xf32>)		func @sum_exp(%input: tensor<4x16x8xf32>, %output: tensor<4x16xf32>)
-> tensor<4x16xf32>		-> tensor<4x16xf32>
{		{
// CHECK: vector.transfer_read {{.*}} : tensor<4x16x8xf32>, vector<4x16x8xf32>		// CHECK: vector.transfer_read {{.*}} : tensor<4x16x8xf32>, vector<4x16x8xf32>
// CHECK: vector.transfer_read {{.*}} {in_bounds = [true, true, true], permutation_map = #[[$M0]]} : tensor<4x16xf32>, vector<4x16x8xf32>
// CHECK: math.exp {{.*}} : vector<4x16x8xf32>		// CHECK: math.exp {{.*}} : vector<4x16x8xf32>
// CHECK: addf {{.*}} : vector<4x16x8xf32>
// CHECK: vector.multi_reduction #vector.kind<add>, %{{.*}} [2] : vector<4x16x8xf32> to vector<4x16xf32>		// CHECK: vector.multi_reduction #vector.kind<add>, %{{.*}} [2] : vector<4x16x8xf32> to vector<4x16xf32>
		// CHECK: vector.transfer_read {{.*}} {in_bounds = [true, true]} : tensor<4x16xf32>, vector<4x16xf32>
		// CHECK: addf {{.*}} : vector<4x16xf32>
// CHECK: vector.transfer_write {{.*}} : vector<4x16xf32>, tensor<4x16xf32>		// CHECK: vector.transfer_write {{.*}} : vector<4x16xf32>, tensor<4x16xf32>
// CHECK: return {{.*}} : tensor<4x16xf32>		// CHECK: return {{.*}} : tensor<4x16xf32>
%0 = linalg.generic {		%0 = linalg.generic {
indexing_maps = [		indexing_maps = [
affine_map<(d0, d1, d2) -> (d0, d1, d2)>,		affine_map<(d0, d1, d2) -> (d0, d1, d2)>,
affine_map<(d0, d1, d2) -> (d0, d1)>		affine_map<(d0, d1, d2) -> (d0, d1)>
],		],
iterator_types = ["parallel", "parallel", "reduction"]		iterator_types = ["parallel", "parallel", "reduction"]
} ins(%input : tensor<4x16x8xf32>) outs(%output : tensor<4x16xf32>) {		} ins(%input : tensor<4x16x8xf32>) outs(%output : tensor<4x16xf32>) {
^bb0(%arg0: f32, %arg1: f32): // no predecessors		^bb0(%arg0: f32, %arg1: f32): // no predecessors
%1 = math.exp %arg0 : f32		%1 = math.exp %arg0 : f32
%2 = addf %1, %arg1 : f32		%2 = addf %1, %arg1 : f32
linalg.yield %2 : f32		linalg.yield %2 : f32
} -> tensor<4x16xf32>		} -> tensor<4x16xf32>
return %0 : tensor<4x16xf32>		return %0 : tensor<4x16xf32>
}		}

// -----		// -----

// CHECK-DAG: #[[$M1:.*]] = affine_map<(d0, d1) -> (d1, d0, 0, 0)>		// CHECK-DAG: #[[$M1:.*]] = affine_map<(d0, d1) -> (d1, d0, 0, 0)>
// CHECK-DAG: #[[$M2:.*]] = affine_map<(d0, d1) -> (0, 0, d1, d0)>		// CHECK-DAG: #[[$M2:.*]] = affine_map<(d0, d1) -> (0, 0, d1, d0)>
// CHECK-DAG: #[[$M3:.*]] = affine_map<(d0, d1) -> (d1, 0, 0, d0)>		// CHECK-DAG: #[[$M3:.*]] = affine_map<(d0, d1) -> (d1, d0)>
// CHECK-DAG: #[[$M4:.*]] = affine_map<(d0, d1) -> (d1, d0)>

// CHECK-LABEL: func @sum_exp_2		// CHECK-LABEL: func @sum_exp_2
func @sum_exp_2(%input: tensor<3x2xf32>, %input_2: tensor<5x4xf32>, %output: tensor<5x2xf32>)		func @sum_exp_2(%input: tensor<3x2xf32>, %input_2: tensor<5x4xf32>, %output: tensor<5x2xf32>)
-> tensor<5x2xf32>		-> tensor<5x2xf32>
{		{
// CHECK: vector.transfer_read {{.*}} {in_bounds = [true, true, true, true], permutation_map = #[[$M1]]} : tensor<3x2xf32>, vector<2x3x4x5xf32>		// CHECK: vector.transfer_read {{.*}} {in_bounds = [true, true, true, true], permutation_map = #[[$M1]]} : tensor<3x2xf32>, vector<2x3x4x5xf32>
// CHECK: vector.transfer_read {{.*}} {in_bounds = [true, true, true, true], permutation_map = #[[$M2]]} : tensor<5x4xf32>, vector<2x3x4x5xf32>		// CHECK: vector.transfer_read {{.*}} {in_bounds = [true, true, true, true], permutation_map = #[[$M2]]} : tensor<5x4xf32>, vector<2x3x4x5xf32>
// CHECK: vector.transfer_read {{.*}} {in_bounds = [true, true, true, true], permutation_map = #[[$M3]]} : tensor<5x2xf32>, vector<2x3x4x5xf32>
// CHECK: math.exp {{.*}} : vector<2x3x4x5xf32>		// CHECK: math.exp {{.*}} : vector<2x3x4x5xf32>
// CHECK: math.exp {{.*}} : vector<2x3x4x5xf32>		// CHECK: math.exp {{.*}} : vector<2x3x4x5xf32>
// CHECK: addf {{.*}} : vector<2x3x4x5xf32>		// CHECK: addf {{.*}} : vector<2x3x4x5xf32>
// CHECK: addf {{.*}} : vector<2x3x4x5xf32>
// CHECK: vector.multi_reduction #vector.kind<add>, {{.*}} [1, 2] : vector<2x3x4x5xf32> to vector<2x5xf32>		// CHECK: vector.multi_reduction #vector.kind<add>, {{.*}} [1, 2] : vector<2x3x4x5xf32> to vector<2x5xf32>
// CHECK: vector.transfer_write {{.*}} {in_bounds = [true, true], permutation_map = #[[$M4]]} : vector<2x5xf32>, tensor<5x2xf32>		// CHECK: vector.transfer_read {{.*}} {in_bounds = [true, true], permutation_map = #[[$M3]]} : tensor<5x2xf32>, vector<2x5xf32>
		// CHECK: addf {{.*}} : vector<2x5xf32>
		// CHECK: vector.transfer_write {{.*}} {in_bounds = [true, true], permutation_map = #[[$M3]]} : vector<2x5xf32>, tensor<5x2xf32>
// CHECK: return {{.*}} : tensor<5x2xf32>		// CHECK: return {{.*}} : tensor<5x2xf32>
%0 = linalg.generic {		%0 = linalg.generic {
indexing_maps = [		indexing_maps = [
affine_map<(d0, d1, d2, d3) -> (d1, d0)>,		affine_map<(d0, d1, d2, d3) -> (d1, d0)>,
affine_map<(d0, d1, d2, d3) -> (d3, d2)>,		affine_map<(d0, d1, d2, d3) -> (d3, d2)>,
affine_map<(d0, d1, d2, d3) -> (d3, d0)>		affine_map<(d0, d1, d2, d3) -> (d3, d0)>
],		],
iterator_types = ["parallel", "reduction", "reduction", "parallel"]		iterator_types = ["parallel", "reduction", "reduction", "parallel"]
} ins(%input, %input_2 : tensor<3x2xf32>, tensor<5x4xf32>) outs(%output : tensor<5x2xf32>) {		} ins(%input, %input_2 : tensor<3x2xf32>, tensor<5x4xf32>) outs(%output : tensor<5x2xf32>) {
^bb0(%arg0: f32, %arg1: f32, %arg2: f32): // no predecessors		^bb0(%arg0: f32, %arg1: f32, %arg2: f32): // no predecessors
%1 = math.exp %arg0 : f32		%1 = math.exp %arg0 : f32
%2 = math.exp %arg1 : f32		%2 = math.exp %arg1 : f32
%3 = addf %1, %2 : f32		%3 = addf %1, %2 : f32
%4 = addf %3, %arg2 : f32		%4 = addf %3, %arg2 : f32
linalg.yield %4 : f32		linalg.yield %4 : f32
} -> tensor<5x2xf32>		} -> tensor<5x2xf32>
return %0 : tensor<5x2xf32>		return %0 : tensor<5x2xf32>
}		}

// -----		// -----

// CHECK-LABEL: func @red_max_2d(		// CHECK-LABEL: func @red_max_2d(
func @red_max_2d(%arg0: tensor<4x4xf32>) -> tensor<4xf32> {		func @red_max_2d(%arg0: tensor<4x4xf32>) -> tensor<4xf32> {
		// CHECK: %[[CMINF:.+]] = constant dense<-3.402820e+38> : vector<4xf32>
// CHECK: linalg.init_tensor [4] : tensor<4xf32>		// CHECK: linalg.init_tensor [4] : tensor<4xf32>
// CHECK: vector.transfer_write {{.*}} : vector<4xf32>, tensor<4xf32>		// CHECK: vector.transfer_write {{.*}} : vector<4xf32>, tensor<4xf32>
// CHECK: vector.transfer_read {{.*}} : tensor<4x4xf32>, vector<4x4xf32>		// CHECK: %[[R:.+]] = vector.multi_reduction #vector.kind<maxf>, {{.*}} [1] : vector<4x4xf32> to vector<4xf32>
// CHECK: vector.transfer_read {{.*}} : tensor<4xf32>, vector<4x4xf32>		// CHECK: maxf %[[R]], %[[CMINF]] : vector<4xf32>
// CHECK: maxf {{.*}} : vector<4x4xf32>
// CHECK: vector.multi_reduction #vector.kind<maxf>, {{.*}} [1] : vector<4x4xf32> to vector<4xf32>
// CHECK: vector.transfer_write {{.*}} : vector<4xf32>, tensor<4xf32>		// CHECK: vector.transfer_write {{.*}} : vector<4xf32>, tensor<4xf32>
%minf32 = constant -3.40282e+38 : f32		%minf32 = constant -3.40282e+38 : f32
%init = linalg.init_tensor [4] : tensor<4xf32>		%init = linalg.init_tensor [4] : tensor<4xf32>
%fill = linalg.fill(%minf32, %init) : f32, tensor<4xf32> -> tensor<4xf32>		%fill = linalg.fill(%minf32, %init) : f32, tensor<4xf32> -> tensor<4xf32>
%red = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,		%red = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,
affine_map<(d0, d1) -> (d0)>],		affine_map<(d0, d1) -> (d0)>],
iterator_types = ["parallel", "reduction"]}		iterator_types = ["parallel", "reduction"]}
ins(%arg0 : tensor<4x4xf32>) outs(%fill : tensor<4xf32>) {		ins(%arg0 : tensor<4x4xf32>) outs(%fill : tensor<4xf32>) {
^bb0(%in0: f32, %out0: f32): // no predecessors		^bb0(%in0: f32, %out0: f32): // no predecessors
%max = maxf %in0, %out0 : f32		%max = maxf %in0, %out0 : f32
linalg.yield %max : f32		linalg.yield %max : f32
} -> tensor<4xf32>		} -> tensor<4xf32>
return %red : tensor<4xf32>		return %red : tensor<4xf32>
}		}

// -----		// -----

// CHECK-LABEL: func @red_min_2d(		// CHECK-LABEL: func @red_min_2d(
func @red_min_2d(%arg0: tensor<4x4xf32>) -> tensor<4xf32> {		func @red_min_2d(%arg0: tensor<4x4xf32>) -> tensor<4xf32> {
		// CHECK: %[[CMAXF:.+]] = constant dense<3.402820e+38> : vector<4xf32>
// CHECK: linalg.init_tensor [4] : tensor<4xf32>		// CHECK: linalg.init_tensor [4] : tensor<4xf32>
// CHECK: vector.transfer_write {{.*}} : vector<4xf32>, tensor<4xf32>		// CHECK: vector.transfer_write {{.*}} : vector<4xf32>, tensor<4xf32>
// CHECK: vector.transfer_read {{.*}} : tensor<4x4xf32>, vector<4x4xf32>		// CHECK: vector.transfer_read {{.*}} : tensor<4x4xf32>, vector<4x4xf32>
// CHECK: vector.transfer_read {{.*}} : tensor<4xf32>, vector<4x4xf32>		// CHECK: %[[R:.+]] = vector.multi_reduction #vector.kind<minf>, {{.*}} [1] : vector<4x4xf32> to vector<4xf32>
// CHECK: minf {{.*}} : vector<4x4xf32>		// CHECK: minf %[[R]], %[[CMAXF]] : vector<4xf32>
// CHECK: vector.multi_reduction #vector.kind<minf>, {{.*}} [1] : vector<4x4xf32> to vector<4xf32>
// CHECK: vector.transfer_write {{.*}} : vector<4xf32>, tensor<4xf32>		// CHECK: vector.transfer_write {{.*}} : vector<4xf32>, tensor<4xf32>
%maxf32 = constant 3.40282e+38 : f32		%maxf32 = constant 3.40282e+38 : f32
%init = linalg.init_tensor [4] : tensor<4xf32>		%init = linalg.init_tensor [4] : tensor<4xf32>
%fill = linalg.fill(%maxf32, %init) : f32, tensor<4xf32> -> tensor<4xf32>		%fill = linalg.fill(%maxf32, %init) : f32, tensor<4xf32> -> tensor<4xf32>
%red = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,		%red = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,
affine_map<(d0, d1) -> (d0)>],		affine_map<(d0, d1) -> (d0)>],
iterator_types = ["parallel", "reduction"]}		iterator_types = ["parallel", "reduction"]}
ins(%arg0 : tensor<4x4xf32>) outs(%fill : tensor<4xf32>) {		ins(%arg0 : tensor<4x4xf32>) outs(%fill : tensor<4xf32>) {
^bb0(%in0: f32, %out0: f32): // no predecessors		^bb0(%in0: f32, %out0: f32): // no predecessors
%min = minf %in0, %out0 : f32		%min = minf %out0, %in0 : f32
linalg.yield %min : f32		linalg.yield %min : f32
} -> tensor<4xf32>		} -> tensor<4xf32>
return %red : tensor<4xf32>		return %red : tensor<4xf32>
}		}

// -----		// -----

// CHECK-LABEL: func @reduce_1d(		// CHECK-LABEL: func @reduce_1d(
// CHECK-SAME: %[[A:.*]]: tensor<32xf32>		// CHECK-SAME: %[[A:.*]]: tensor<32xf32>
func @reduce_1d(%arg0: tensor<32xf32>) -> tensor<f32> {		func @reduce_1d(%arg0: tensor<32xf32>) -> tensor<f32> {
// CHECK-DAG: %[[F0_v1:.*]] = constant dense<0.000000e+00> : vector<1xf32>		// CHECK-DAG: %[[F0_v1:.*]] = constant dense<0.000000e+00> : vector<1xf32>
// CHECK-DAG: %[[F0_v32:.*]] = constant dense<0.000000e+00> : vector<32xf32>		// CHECK-DAG: %[[F0:.*]] = constant 0.000000e+00 : f32
// CHECK-DAG: %[[C0:.*]] = constant 0 : index		// CHECK-DAG: %[[C0:.*]] = constant 0 : index
%f0 = constant 0.000000e+00 : f32		%f0 = constant 0.000000e+00 : f32

// CHECK: %[[init:.*]] = linalg.init_tensor [] : tensor<f32>		// CHECK: %[[init:.*]] = linalg.init_tensor [] : tensor<f32>
%0 = linalg.init_tensor [] : tensor<f32>		%0 = linalg.init_tensor [] : tensor<f32>

// CHECK: %[[f:.*]] = vector.transfer_write %[[F0_v1]], %[[init]][]		// CHECK: %[[f:.*]] = vector.transfer_write %[[F0_v1]], %[[init]][]
// CHECK-SAME: : vector<1xf32>, tensor<f32>		// CHECK-SAME: : vector<1xf32>, tensor<f32>
%1 = linalg.fill(%f0, %0) : f32, tensor<f32> -> tensor<f32>		%1 = linalg.fill(%f0, %0) : f32, tensor<f32> -> tensor<f32>

// CHECK: %[[r:.*]] = vector.transfer_read %[[A]][%[[C0]]]		// CHECK: %[[r:.*]] = vector.transfer_read %[[A]][%[[C0]]]
// CHECK-SAME: : tensor<32xf32>, vector<32xf32>		// CHECK-SAME: : tensor<32xf32>, vector<32xf32>
// CHECK: %[[a:.*]] = addf %[[r]], %[[F0_v32]] : vector<32xf32>		// CHECK: %[[red:.*]] = vector.multi_reduction #vector.kind<add>, %[[r]] [0]
// CHECK: %[[red:.*]] = vector.multi_reduction #vector.kind<add>, %[[a]] [0]
// CHECK-SAME: : vector<32xf32> to f32		// CHECK-SAME: : vector<32xf32> to f32
// CHECK: %[[red_v1:.*]] = vector.broadcast %[[red]] : f32 to vector<1xf32>		// CHECK: %[[a:.*]] = addf %[[red]], %[[F0]] : f32
		// CHECK: %[[red_v1:.*]] = vector.broadcast %[[a]] : f32 to vector<1xf32>
// CHECK: %[[res:.*]] = vector.transfer_write %[[red_v1]], %[[f]][]		// CHECK: %[[res:.*]] = vector.transfer_write %[[red_v1]], %[[f]][]
// CHECK-SAME: : vector<1xf32>, tensor<f32>		// CHECK-SAME: : vector<1xf32>, tensor<f32>
%2 = linalg.generic {		%2 = linalg.generic {
indexing_maps = [affine_map<(d0) -> (d0)>,		indexing_maps = [affine_map<(d0) -> (d0)>,
affine_map<(d0) -> ()>],		affine_map<(d0) -> ()>],
iterator_types = ["reduction"]}		iterator_types = ["reduction"]}
ins(%arg0 : tensor<32xf32>)		ins(%arg0 : tensor<32xf32>)
outs(%1 : tensor<f32>) {		outs(%1 : tensor<f32>) {
^bb0(%a: f32, %b: f32): // no predecessors		^bb0(%a: f32, %b: f32): // no predecessors
%3 = addf %a, %b : f32		%3 = addf %a, %b : f32
linalg.yield %3 : f32		linalg.yield %3 : f32
} -> tensor<f32>		} -> tensor<f32>

return %2 : tensor<f32>		return %2 : tensor<f32>
}		}