This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/Linalg/
-
mlir/
-
Dialect/
-
Linalg/
-
IR/
-
LinalgOps.td
-
Utils/
-
Utils.h
-
lib/Dialect/Linalg/
-
Dialect/
-
Linalg/
-
IR/
-
LinalgOps.cpp
-
Transforms/
2
Fusion.cpp
-
test/
-
Dialect/Linalg/
-
Linalg/
-
tile-and-fuse-tensors.mlir
-
lib/Dialect/Linalg/
-
Dialect/
-
Linalg/
-
TestLinalgFusionTransforms.cpp

Differential D103243

[mlir][linalg] Support tiling and fusing linalg.pad_tensor
AbandonedPublic

Authored by antiagainst on May 27 2021, 5:17 AM.

Download Raw Diff

Details

Reviewers

mravishankar
nicolasvasilache

Summary

linalg.pad_tensor ops frequently arise from cases like
convolution padding, where we want to pad a few zeros
at the boundary. Handwritten kernels can handle this
boundary case by using pre- and post-if statements to
load referenced scalars at the boundary and compose
them with zeros to form vectors. So we can still go
through nicely vectorized load/store for the original
tensor's content.

Right now, this is not possible with linalg as the
linalg.pad_tensor ops is on its own without being
tiled and fused: when CodeGen'ing towards GPU, it
will force a separate kernel, which requires allocating
a new buffer, copying over the original tensor's buffer
content. This whole buffer allocation and data copying
is unnecessary and can be a major source of latency.

This commit is a first step towards making linalg.pad_tensor
compose better with other linalg ops, to enable generating
more optimized code like handwritten kernels. linalg.pad_tensor
isn't a structured linalg op so we need specific handling for it.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	320 ms	x64 windows > LLVM.DebugInfo/X86::basic-block-sections-debug-loc-const-value-1.ll
	330 ms	x64 windows > LLVM.DebugInfo/X86::basic-block-sections-debug-loc-const-value-2.ll
	320 ms	x64 windows > LLVM.DebugInfo/X86::basic-block-sections-debug-loc-split-range.ll
	380 ms	x64 windows > LLVM.DebugInfo/X86::basic-block-sections-debug-loc.ll
	390 ms	x64 windows > LLVM.DebugInfo/X86::basic-block-sections-debug-loclist-1.ll
		View Full Test Results (9 Failed)

Event Timeline

antiagainst created this revision.May 27 2021, 5:17 AM

Herald added a reviewer: mravishankar. · View Herald TranscriptMay 27 2021, 5:17 AM

Herald added subscribers: dcaballe, cota, mravishankar and 16 others. · View Herald Transcript

antiagainst requested review of this revision.May 27 2021, 5:17 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptMay 27 2021, 5:17 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: limo1996, stephenneuendorffer, nicolasvasilache. · View Herald Transcript

antiagainst added a child revision: D103244: [mlir][affine] Fold affine.min with constant zero expressions.May 27 2021, 5:19 AM

Harbormaster completed remote builds in B106488: Diff 348227.May 27 2021, 5:52 AM

Thanks for attacking this @antiagainst !
I haven't looked at the details yet but the low/hi padding composition looks like what I'd expect.

I have one question before digging deeper, it seems to me this is doing a few things at the same time.
In my mind there is a fundamental building block that rewrites subtensor(pad) into pad(subtensor) with the proper min/max computation gymnastics and that is in turn used for fusion (and later distribution and other places).
Then the imperfectly nested structure you currently write yourself as a special case of fusion, would happen automatically by loop hoisting.
Your fusion pattern would then just detect pad -> subtensor -> consumer and use the above building block.
I also expect similar type of subtensor + X to appear with reshape and concat, so common infrastructure at that level would be useful too.

Have you tried this line of thought?

@gysit @springerm FYI

Jing added a subscriber: Jing.Jun 1 2021, 12:00 PM

Ok, first initial skim of this patch. I actually didnt follow the core logic. Maybe we can chat offline about this.

mlir/lib/Dialect/Linalg/Transforms/Fusion.cpp
338	I would avoid doing this. Instead it is probably better to just create the map so that it can either use the value or the constant. Creating an `std.constant` only for the `affine.apply` to be canonicalized is just wasted work.
365	I am having a hard time following the logic here. Maybe rephrase this?

This revision now requires changes to proceed.Jun 2 2021, 4:29 PM

springerm mentioned this in D104004: [mlir][linalg] Add constant padding helper to PadTensorOp.Jun 9 2021, 6:46 PM

springerm mentioned this in rGbf5d3092f855: [mlir][linalg] Add constant padding helper to PadTensorOp.Jun 13 2021, 5:49 PM

springerm mentioned this in D104683: [mlir][linalg] Fusion of PadTensorOp.Jun 21 2021, 7:38 PM

springerm mentioned this in rG2ba387a316d1: [mlir][linalg] Fusion of PadTensorOp.Jun 21 2021, 7:56 PM

FYI some of the work that @springerm did should subsume this now.

This functionality has been implemented in D105460 and ExtractSliceOfPadTensorSwapPattern. This revision can be closed.

antiagainst abandoned this revision.Nov 15 2021, 12:42 PM

Herald added subscribers: sdasgup3, wenzhicui, wrengr, Chia-hungDuan. · View Herald TranscriptNov 15 2021, 12:42 PM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

Linalg/

IR/

LinalgOps.td

3 lines

Utils/

Utils.h

4 lines

lib/

Dialect/

Linalg/

IR/

LinalgOps.cpp

14 lines

Transforms/

Fusion.cpp

211 lines

test/

Dialect/

Linalg/

tile-and-fuse-tensors.mlir

202 lines

lib/

Dialect/

Linalg/

TestLinalgFusionTransforms.cpp

20 lines

Diff 348227

mlir/include/mlir/Dialect/Linalg/IR/LinalgOps.td

Show First 20 Lines • Show All 257 Lines • ▼ Show 20 Lines	inline SmallVector<OpFoldResult> getMixedPadImpl(ArrayAttr staticAttrs,
return res;		return res;
}		}
SmallVector<OpFoldResult> getMixedLowPad() {		SmallVector<OpFoldResult> getMixedLowPad() {
return getMixedPadImpl(static_low(), low());		return getMixedPadImpl(static_low(), low());
}		}
SmallVector<OpFoldResult> getMixedHighPad() {		SmallVector<OpFoldResult> getMixedHighPad() {
return getMixedPadImpl(static_high(), high());		return getMixedPadImpl(static_high(), high());
}		}

		// Return the pad value if it is a constant. Return null value otherwise.
		Value getConstantPaddingValue();
}];		}];

let builders = [		let builders = [
// Build a PadTensorOp with mixed static and dynamic entries.		// Build a PadTensorOp with mixed static and dynamic entries.
OpBuilder<(ins "Value":$source, "ArrayRef<int64_t>":$staticLow,		OpBuilder<(ins "Value":$source, "ArrayRef<int64_t>":$staticLow,
"ArrayRef<int64_t>":$staticHigh, "ValueRange":$low, "ValueRange":$high,		"ArrayRef<int64_t>":$staticHigh, "ValueRange":$low, "ValueRange":$high,
CArg<"ArrayRef<NamedAttribute>", "{}">:$attrs)>,		CArg<"ArrayRef<NamedAttribute>", "{}">:$attrs)>,
// Build a PadTensorOp with all dynamic entries.		// Build a PadTensorOp with all dynamic entries.
▲ Show 20 Lines • Show All 524 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/Linalg/Utils/Utils.h

	Show First 20 Lines • Show All 98 Lines • ▼ Show 20 Lines
	FusableOpDependencesTy			FusableOpDependencesTy
	findAllFusableDependences(ArrayRef<LinalgOp> ops,			findAllFusableDependences(ArrayRef<LinalgOp> ops,
	const LinalgDependenceGraph &dependenceGraph);			const LinalgDependenceGraph &dependenceGraph);

	/// A struct containing the Linalg producer before and after fusion.			/// A struct containing the Linalg producer before and after fusion.
	/// When operating on tensors, `fusedProducer` may feed into a `tensor.cast` op			/// When operating on tensors, `fusedProducer` may feed into a `tensor.cast` op
	/// before the consumer Linalg op, until enough canonicalizations have applied.			/// before the consumer Linalg op, until enough canonicalizations have applied.
	struct FusionInfo {			struct FusionInfo {
	LinalgOp originalProducer;			Operation *originalProducer;
	LinalgOp fusedProducer;			Operation *fusedProducer;
	};			};

	/// Fuses producer into consumer if the producer is structurally feasible and			/// Fuses producer into consumer if the producer is structurally feasible and
	/// the fusion would not violate dependencies.			/// the fusion would not violate dependencies.
	/// Implements the fusion part of the "tileAndFuse on buffers" transformation			/// Implements the fusion part of the "tileAndFuse on buffers" transformation
	/// and thus requires the `consumerOpOperand` to be a `subview` op (generally			/// and thus requires the `consumerOpOperand` to be a `subview` op (generally
	/// obtained by applying the tiling transformation).			/// obtained by applying the tiling transformation).
	Optional<FusionInfo> fuseProducerOfBuffer(OpBuilder &b,			Optional<FusionInfo> fuseProducerOfBuffer(OpBuilder &b,
	▲ Show 20 Lines • Show All 141 Lines • Show Last 20 Lines

mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp

Show First 20 Lines • Show All 1,005 Lines • ▼ Show 20 Lines	void PadTensorOp::build(OpBuilder &b, OperationState &result, Value source,
result.addAttributes(attrs);		result.addAttributes(attrs);
}		}

void PadTensorOp::build(OpBuilder &b, OperationState &result, Value source,		void PadTensorOp::build(OpBuilder &b, OperationState &result, Value source,
ValueRange low, ValueRange high,		ValueRange low, ValueRange high,
ArrayRef<NamedAttribute> attrs) {		ArrayRef<NamedAttribute> attrs) {
auto sourceType = source.getType().cast<RankedTensorType>();		auto sourceType = source.getType().cast<RankedTensorType>();
unsigned rank = sourceType.getRank();		unsigned rank = sourceType.getRank();
SmallVector<int64_t, 4> staticVector(ShapedType::kDynamicSize, rank);		SmallVector<int64_t, 4> staticVector(rank, ShapedType::kDynamicSize);
build(b, result, source, staticVector, staticVector, low, high, attrs);		build(b, result, source, staticVector, staticVector, low, high, attrs);
}		}

void PadTensorOp::build(OpBuilder &b, OperationState &result, Type resultType,		void PadTensorOp::build(OpBuilder &b, OperationState &result, Type resultType,
Value source, ArrayRef<OpFoldResult> low,		Value source, ArrayRef<OpFoldResult> low,
ArrayRef<OpFoldResult> high,		ArrayRef<OpFoldResult> high,
ArrayRef<NamedAttribute> attrs) {		ArrayRef<NamedAttribute> attrs) {
assert(resultType.isa<RankedTensorType>());		assert(resultType.isa<RankedTensorType>());
▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	for (auto dim : llvm::seq<int64_t>(0, getSourceType().getRank())) {
addOpFoldResult(highPad[dim]);		addOpFoldResult(highPad[dim]);
shapes.push_back(applyMapToValues(		shapes.push_back(applyMapToValues(
b, loc, AffineMap::get(1, numSymbols, expr), mapOperands)[0]);		b, loc, AffineMap::get(1, numSymbols, expr), mapOperands)[0]);
}		}
reifiedReturnShapes.emplace_back(std::move(shapes));		reifiedReturnShapes.emplace_back(std::move(shapes));
return success();		return success();
}		}

		Value PadTensorOp::getConstantPaddingValue() {
		auto yieldOp = dyn_cast<YieldOp>(getRegion().front().getTerminator());
		if (!yieldOp \|\| yieldOp.values().size() != 1)
		return {};

		Value padValue = yieldOp.values().front();
		if (matchPattern(padValue, m_Constant()))
		return padValue;

		return {};
		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// ReshapeOp		// ReshapeOp
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

Optional<SmallVector<ReassociationIndices>>		Optional<SmallVector<ReassociationIndices>>
mlir::linalg::getReassociationIndicesForReshape(ShapedType sourceType,		mlir::linalg::getReassociationIndicesForReshape(ShapedType sourceType,
ShapedType targetType) {		ShapedType targetType) {
// Make the sourceType greater rank than the targetType. If they are same		// Make the sourceType greater rank than the targetType. If they are same
▲ Show 20 Lines • Show All 2,106 Lines • Show Last 20 Lines

mlir/lib/Dialect/Linalg/Transforms/Fusion.cpp

Show All 17 Lines
#include "mlir/Dialect/Linalg/Passes.h"		#include "mlir/Dialect/Linalg/Passes.h"
#include "mlir/Dialect/Linalg/Transforms/Transforms.h"		#include "mlir/Dialect/Linalg/Transforms/Transforms.h"
#include "mlir/Dialect/Linalg/Utils/Utils.h"		#include "mlir/Dialect/Linalg/Utils/Utils.h"
#include "mlir/Dialect/MemRef/IR/MemRef.h"		#include "mlir/Dialect/MemRef/IR/MemRef.h"
#include "mlir/Dialect/Tensor/IR/Tensor.h"		#include "mlir/Dialect/Tensor/IR/Tensor.h"
#include "mlir/IR/AffineExpr.h"		#include "mlir/IR/AffineExpr.h"
#include "mlir/IR/AffineMap.h"		#include "mlir/IR/AffineMap.h"
#include "mlir/IR/Dominance.h"		#include "mlir/IR/Dominance.h"
		#include "mlir/IR/Matchers.h"
#include "mlir/Support/LLVM.h"		#include "mlir/Support/LLVM.h"
#include "mlir/Transforms/GreedyPatternRewriteDriver.h"		#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
#include "mlir/Transforms/RegionUtils.h"		#include "mlir/Transforms/RegionUtils.h"
#include "llvm/ADT/MapVector.h"		#include "llvm/ADT/MapVector.h"
#include "llvm/ADT/ScopeExit.h"		#include "llvm/ADT/ScopeExit.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"

▲ Show 20 Lines • Show All 146 Lines • ▼ Show 20 Lines	static LinalgOp fuse(OpBuilder &b, LinalgOp producer,

for (unsigned i = 0, e = producer.getNumLoops(); i < e; ++i) {		for (unsigned i = 0, e = producer.getNumLoops(); i < e; ++i) {
auto it = fusedLoopsAndRanges.find(i);		auto it = fusedLoopsAndRanges.find(i);
if (it != fusedLoopsAndRanges.end()) {		if (it != fusedLoopsAndRanges.end()) {
ivs.push_back(it->second.offset);		ivs.push_back(it->second.offset);
tileSizes.push_back(it->second.size);		tileSizes.push_back(it->second.size);
sizeBounds.push_back(nullptr);		sizeBounds.push_back(nullptr);
loopRanges.push_back(it->second);		loopRanges.push_back(it->second);
LLVM_DEBUG(llvm::dbgs() << "tiled loop#" << i << " with LoopRange "		LLVM_DEBUG(llvm::dbgs()
<< loopRanges.back() << "\n");		<< "tiled loop#" << i << loopRanges.back() << "\n");
} else {		} else {
auto shapeDim = getShapeDefiningLoopRange(producer, i);		auto shapeDim = getShapeDefiningLoopRange(producer, i);
Value dim = b.createOrFold<memref::DimOp>(loc, shapeDim.shape,		Value dim = b.createOrFold<memref::DimOp>(loc, shapeDim.shape,
shapeDim.dimension);		shapeDim.dimension);
tileSizes.push_back(zero);		tileSizes.push_back(zero);
sizeBounds.push_back(dim);		sizeBounds.push_back(dim);
loopRanges.push_back(Range{zero, dim, one});		loopRanges.push_back(Range{zero, dim, one});
LLVM_DEBUG(llvm::dbgs() << "full loop#" << i << " with LoopRange "		LLVM_DEBUG(llvm::dbgs()
<< loopRanges.back() << "\n");		<< "full loop#" << i << loopRanges.back() << "\n");
}		}
}		}

SmallVector<Value, 8> clonedShapes;		SmallVector<Value, 8> clonedShapes;
clonedShapes.reserve(producer.getNumShapedOperands());		clonedShapes.reserve(producer.getNumShapedOperands());

// Compute subranges for all tensor input/output operands.		// Compute subranges for all tensor input/output operands.
clonedShapes.append(makeTiledShapes(b, loc, producer,		clonedShapes.append(makeTiledShapes(b, loc, producer,
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	static LinalgOp fuse(OpBuilder &b, LinalgOp producerOp, AffineMap producerMap,
for (auto en : llvm::enumerate(producerMap.getResults())) {		for (auto en : llvm::enumerate(producerMap.getResults())) {
unsigned posInProducerLoop = en.value().cast<AffineDimExpr>().getPosition();		unsigned posInProducerLoop = en.value().cast<AffineDimExpr>().getPosition();
fusedLoopsAndRanges[posInProducerLoop] = getRangeFromOperandShape(		fusedLoopsAndRanges[posInProducerLoop] = getRangeFromOperandShape(
b, consumerOpOperand.getOwner()->getLoc(), shapedOperand, en.index());		b, consumerOpOperand.getOwner()->getLoc(), shapedOperand, en.index());
}		}
return fuse(b, producerOp, fusedLoopsAndRanges);		return fuse(b, producerOp, fusedLoopsAndRanges);
}		}

		/// Fuses the linalg.pad_tensor `producerOp` into the loops immediately
		/// enclosing the given consumer that uses the `producerOp` with
		/// `consumerOpOperand`. The fusion is done by creating a subtensor for the
		/// source tensor and create a new linalg.pad_tensor for it.
		static PadTensorOp fusePadTensor(OpBuilder &builder, PadTensorOp producerOp,
		OpOperand &consumerOpOperand) {
		Value cstPadValue = producerOp.getConstantPaddingValue();
		// Only support constant padding values for now.
		if (!cstPadValue)
		return {};

		MLIRContext *context = builder.getContext();
		Location loc = consumerOpOperand.getOwner()->getLoc();
		Value shapedOperand = consumerOpOperand.get();
		int64_t numLoops = producerOp.getResultType().getRank();

		// linalg.pad_tensor cannot expand or shrink the number of dimensions. So we
		// can get the all loops' ranges from its consumer.
		SmallVector<Range, 4> loopRanges;
		for (int i = 0; i < numLoops; ++i) {
		Range range = getRangeFromOperandShape(builder, loc, shapedOperand, i);
		LLVM_DEBUG(llvm::dbgs() << "tiled loop#" << i << range << "\n");
		loopRanges.push_back(range);

		// Only support stride 1 right now.
		IntegerAttr one;
		if (!matchPattern(range.stride, m_Constant(&one)) \|\|
		one.getValue().getZExtValue() != 1)
		return {};
		}

		auto getValuePaddings = [&](ArrayRef<OpFoldResult> paddings) {
		SmallVector<Value, 4> valuePaddings;
		for (OpFoldResult pad : paddings) {
		if (Value value = pad.dyn_cast<Value>()) {
		valuePaddings.push_back(value);
		} else {
		auto cst = pad.get<Attribute>().cast<IntegerAttr>().getInt();
		valuePaddings.push_back(builder.create<ConstantIndexOp>(loc, cst));
		}
		}
		return valuePaddings;
		};

		SmallVector<Value, 4> lowPaddings =
		mravishankarUnsubmitted Not Done Reply Inline Actions I would avoid doing this. Instead it is probably better to just create the map so that it can either use the value or the constant. Creating an `std.constant` only for the `affine.apply` to be canonicalized is just wasted work. mravishankar: I would avoid doing this. Instead it is probably better to just create the map so that it can…
		getValuePaddings(producerOp.getMixedLowPad());
		SmallVector<Value, 4> highPaddings =
		getValuePaddings(producerOp.getMixedHighPad());

		AffineExpr dim0, dim1, dim2;
		bindDims(context, dim0, dim1, dim2);
		auto idMap = AffineMap::getMultiDimIdentityMap(2, context);
		auto add2Map = AffineMap::get(3, 0, {dim0 + dim1 + dim2});
		auto sub1Map = AffineMap::get(2, 0, {dim0 - dim1});
		auto sub2Map = AffineMap::get(3, 0, {dim0 - dim1 - dim2});

		SmallVector<Value, 4> newSrcOffsets(numLoops);
		SmallVector<Value, 4> newSrcSizes(numLoops);
		SmallVector<Value, 4> newSrcStrides(numLoops);
		SmallVector<Value, 4> newLowPaddings(numLoops);
		SmallVector<Value, 4> newHighPaddings(numLoops);

		for (int i = 0; i < numLoops; ++i) {
		const auto &range = loopRanges[i];

		// Get the original and padded dimension size.
		Value originalDim =
		builder.create<memref::DimOp>(loc, producerOp.source(), i);
		Value paddedDim = builder.create<AffineApplyOp>(
		loc, add2Map, ValueRange{originalDim, lowPaddings[i], highPaddings[i]});

		// Calculate the low padding amount that is really used in this tile. We
		mravishankarUnsubmitted Not Done Reply Inline Actions I am having a hard time following the logic here. Maybe rephrase this? mravishankar: I am having a hard time following the logic here. Maybe rephrase this?
		// first calculate how many elements aren't used, then subtract it from the
		// total low padding amount. Affine min ops are needed to make sure in
		// bounds.
		//
		// For the middle tile in three tiles, there are three cases (where `L` is
		// the first element after low paddings: lowPaddings[i]):
		//
		// \|\|---L----\|\|--------\|\|--------\|\|
		// unusedLowPadding = lowPaddings[i]
		// potentialUsedLowPadding = 0
		// usedLowPadding = 0
		// newSrcOffset = range.offset - lowPaddings[i]
		//
		// \|\|--------\|\|---L----\|\|--------\|\|
		// unusedLowPadding = range.offset
		// potentialUsedLowPadding = lowPaddings[i] - range.offset
		// usedLowPadding = lowPaddings[i] - range.offset
		// newSrcOffset = 0
		//
		// \|\|--------\|\|--------\|\|---L----\|\|
		// unusedLowPadding = range.offset
		// potentialUsedLowPadding = lowPaddings[i] - range.offset
		// usedLowPadding = range.size
		// newSrcOffset = 0
		auto unusedLowPadding = builder.create<AffineMinOp>(
		loc, idMap, ValueRange{lowPaddings[i], range.offset});
		auto potentialUsedLowPadding = builder.create<AffineApplyOp>(
		loc, sub1Map, ValueRange{lowPaddings[i], unusedLowPadding});
		auto usedLowPadding = builder.create<AffineMinOp>(
		loc, idMap, ValueRange{range.size, potentialUsedLowPadding});

		// Similarly for high padding. We need to use the remaining elements to
		// compare with the total high padding amount to deduce the unused amount
		// here.
		//
		// For the middle tile in three tiles, there are three cases (where `H` is
		// the last element before high paddings: paddedDim - highPaddings[i]):
		//
		// \|\|--------\|\|--------\|\|---H----\|\|
		// unusedHighPadding = highPaddings[i]
		// potentialUsedHighPadding = 0
		// usedHighPadding = 0
		//
		// \|\|--------\|\|---H----\|\|--------\|\|
		// unusedHighPadding = paddedDim - range.offset - range.size
		// potentialUsedHighPadding = highPaddings[i] - unusedHighPadding
		// usedHighPadding = potentialUnusedHighPadding
		//
		// \|\|---H----\|\|--------\|\|--------\|\|
		// unusedHighPadding = paddedDim - range.offset - range.size
		// potentialUsedHighPadding = highPaddings[i] - unusedHighPadding
		// usedHighPadding = range.size
		auto remainingElements = builder.create<AffineApplyOp>(
		loc, sub2Map, ValueRange{paddedDim, range.offset, range.size});
		auto unusedHighPadding = builder.create<AffineMinOp>(
		loc, idMap, ValueRange{highPaddings[i], remainingElements});
		auto potentialUsedHighPadding = builder.create<AffineApplyOp>(
		loc, sub1Map, ValueRange{highPaddings[i], unusedHighPadding});
		auto usedHighPadding = builder.create<AffineMinOp>(
		loc, idMap, ValueRange{range.size, potentialUsedHighPadding});

		newSrcOffsets[i] = builder.create<AffineApplyOp>(
		loc, sub1Map, ValueRange{range.offset, unusedLowPadding});
		newSrcSizes[i] = builder.create<AffineApplyOp>(
		loc, sub2Map, ValueRange{range.size, usedLowPadding, usedHighPadding});
		newSrcStrides[i] = range.stride;

		newLowPaddings[i] = usedLowPadding;
		newHighPaddings[i] = usedHighPadding;
		}

		// Create the subtensor for the source tensor and pad it.
		auto srcSubTensor = builder.create<SubTensorOp>(
		loc, producerOp.source(), newSrcOffsets, newSrcSizes, newSrcStrides);
		auto padSubTensor = builder.create<PadTensorOp>(
		loc, srcSubTensor, newLowPaddings, newHighPaddings);

		// Create the region for the new linalg.pad_tensor op.
		OpBuilder::InsertionGuard guard(builder);
		auto &region = padSubTensor.region();
		SmallVector<Type, 4> blockArgTypes;
		blockArgTypes.assign(numLoops, builder.getIndexType());
		builder.createBlock(&region, region.end(), blockArgTypes);
		builder.create<linalg::YieldOp>(loc, cstPadValue);

		return padSubTensor;
		}

// Encode structural fusion safety preconditions.		// Encode structural fusion safety preconditions.
// Some of these will be lifted in the future with better analysis.		// Some of these will be lifted in the future with better analysis.
static bool isStructurallyFusableProducer(LinalgOp producer, Value consumedView,		static bool isStructurallyFusableProducer(LinalgOp producer, Value consumedView,
LinalgOp consumer) {		LinalgOp consumer) {
assert(producer.hasBufferSemantics() &&		assert(producer.hasBufferSemantics() &&
"expected linalg op with buffer semantics");		"expected linalg op with buffer semantics");
assert(consumer.hasBufferSemantics() &&		assert(consumer.hasBufferSemantics() &&
"expected linalg op with buffer semantics");		"expected linalg op with buffer semantics");
▲ Show 20 Lines • Show All 177 Lines • ▼ Show 20 Lines
// TODO(ravishankarm, ntv): This can be moved into the dependence graphs		// TODO(ravishankarm, ntv): This can be moved into the dependence graphs
// dependence tracking since the dependence tracking is similar to what is done		// dependence tracking since the dependence tracking is similar to what is done
// w.r.t to buffers.		// w.r.t to buffers.
static void getProducerOfTensor(Value tensor, OpResult &opResult) {		static void getProducerOfTensor(Value tensor, OpResult &opResult) {
if (!tensor.getType().isa<RankedTensorType>())		if (!tensor.getType().isa<RankedTensorType>())
return;		return;

while (true) {		while (true) {
LLVM_DEBUG(llvm::dbgs() << "\ngetProducerOfTensor: " << tensor);		LLVM_DEBUG(llvm::dbgs() << "\ngetProducerOfTensor: " << tensor << "\n");
if (auto linalgOp = tensor.getDefiningOp<LinalgOp>()) {		if (auto linalgOp = tensor.getDefiningOp<LinalgOp>()) {
opResult = tensor.cast<OpResult>();		opResult = tensor.cast<OpResult>();
return;		return;
}		}
		if (auto padOp = tensor.getDefiningOp<PadTensorOp>()) {
		opResult = tensor.cast<OpResult>();
		return;
		}
if (auto subTensorOp = tensor.getDefiningOp<SubTensorOp>()) {		if (auto subTensorOp = tensor.getDefiningOp<SubTensorOp>()) {
tensor = subTensorOp.source();		tensor = subTensorOp.source();
continue;		continue;
}		}
if (auto blockArg = tensor.dyn_cast<BlockArgument>()) {		if (auto blockArg = tensor.dyn_cast<BlockArgument>()) {
if (auto forOp = blockArg.getDefiningOp<scf::ForOp>()) {		if (auto forOp = blockArg.getDefiningOp<scf::ForOp>()) {
tensor = *(forOp.getIterOperands().begin() + blockArg.getArgNumber());		tensor = *(forOp.getIterOperands().begin() + blockArg.getArgNumber());
continue;		continue;
}		}
}		}
return;		return;
}		}
}		}

Optional<FusionInfo>		Optional<FusionInfo>
mlir::linalg::fuseProducerOfTensor(OpBuilder &b, OpOperand &consumerOpOperand) {		mlir::linalg::fuseProducerOfTensor(OpBuilder &b, OpOperand &consumerOpOperand) {
Value inputTensor = consumerOpOperand.get();		Value inputTensor = consumerOpOperand.get();
OpResult producerOpResult;		OpResult producerOpResult;
getProducerOfTensor(inputTensor, producerOpResult);		getProducerOfTensor(inputTensor, producerOpResult);
if (!producerOpResult) {		if (!producerOpResult) {
LLVM_DEBUG(llvm::dbgs() << "\nUnable to find producer");		LLVM_DEBUG(llvm::dbgs() << "\nUnable to find producer\n");
return {};		return {};
}		}
return fuseProducerOfTensor(b, producerOpResult, consumerOpOperand);		return fuseProducerOfTensor(b, producerOpResult, consumerOpOperand);
}		}

Optional<FusionInfo>		Optional<FusionInfo>
mlir::linalg::fuseProducerOfTensor(OpBuilder &b, OpResult producerOpResult,		mlir::linalg::fuseProducerOfTensor(OpBuilder &b, OpResult producerOpResult,
OpOperand &consumerOpOperand) {		OpOperand &consumerOpOperand) {
// Canonicalize indexed generic ops before fusion.		// Canonicalize indexed generic ops before fusion.
if (isa<IndexedGenericOp>(producerOpResult.getOwner()))		if (isa<IndexedGenericOp>(producerOpResult.getOwner()))
return llvm::None;		return llvm::None;

auto producerOp = dyn_cast<LinalgOp>(producerOpResult.getOwner());
if (!producerOp)
return llvm::None;

LinalgOp consumerOp = dyn_cast<LinalgOp>(consumerOpOperand.getOwner());		LinalgOp consumerOp = dyn_cast<LinalgOp>(consumerOpOperand.getOwner());
if (!consumerOp)		if (!consumerOp) {
		LLVM_DEBUG(llvm::dbgs() << "cannot fuse: consumer is not a linalg op\n");
return llvm::None;		return llvm::None;
		}

Value inputTensor = consumerOpOperand.get();		Value inputTensor = consumerOpOperand.get();

// Must be a subtensor to guarantee there are loops we can fuse into.		// Must be a subtensor to guarantee there are loops we can fuse into.
auto subTensor = inputTensor.getDefiningOp<SubTensorOp>();		auto subTensor = inputTensor.getDefiningOp<SubTensorOp>();
if (!subTensor) {		if (!subTensor) {
LLVM_DEBUG(llvm::dbgs()		LLVM_DEBUG(llvm::dbgs()
<< "\nNot fusable, not a subtensor: " << inputTensor);		<< "\nNot fusable, not a subtensor: " << inputTensor);
return {};		return {};
}		}

// If producer is already in the same block as consumer, we are done.		// If producer is already in the same block as consumer, we are done.
if (consumerOpOperand.get().getParentBlock() ==		if (consumerOpOperand.get().getParentBlock() ==
producerOpResult.getParentBlock())		producerOpResult.getParentBlock())
return {};		return {};

// Insert fused `producer` just before `consumer`.		// Insert fused `producer` just before `consumer`.
OpBuilder::InsertionGuard g(b);		OpBuilder::InsertionGuard g(b);
b.setInsertionPoint(consumerOp);		b.setInsertionPoint(consumerOp);
LLVM_DEBUG(llvm::dbgs() << "Fuse into consumer: " << *consumerOp << "\n");		LLVM_DEBUG(llvm::dbgs() << "Fuse into consumer: " << *consumerOp << "\n");
LinalgOp fusedProducer =
fuse(b, producerOp,		Operation *fusedProducer = nullptr;
		if (auto producerOp = dyn_cast<LinalgOp>(producerOpResult.getOwner())) {
		fusedProducer = fuse(
		b, producerOp,
producerOp.getOutputIndexingMap(producerOpResult.getResultNumber()),		producerOp.getOutputIndexingMap(producerOpResult.getResultNumber()),
consumerOpOperand);		consumerOpOperand);
		} else if (auto producerOp =
		dyn_cast<PadTensorOp>(producerOpResult.getOwner())) {
		fusedProducer = fusePadTensor(b, producerOp, consumerOpOperand);
		if (!fusedProducer) {
		LLVM_DEBUG(llvm::dbgs() << "failed to fuse linalg.pad_tensor\n");
		return llvm::None;
		}
		} else {
		LLVM_DEBUG(llvm::dbgs() << "cannot fuse: producer is not a linalg op\n");
		return llvm::None;
		}

// Replace use.		// Replace use.
// Canonicalizations are not guaranteed to have happened before constructing		// Canonicalizations are not guaranteed to have happened before constructing
// `fusedProducer`. In the tensor case this can result in temporary type		// `fusedProducer`. In the tensor case this can result in temporary type
// mismatches. Insert a `tensor.cast` op to propagate the transformation		// mismatches. Insert a `tensor.cast` op to propagate the transformation
// invariant that types are compatible.		// invariant that types are compatible.
Value def = fusedProducer->getResult(producerOpResult.getResultNumber());		Value def = fusedProducer->getResult(producerOpResult.getResultNumber());
Type consumerType = consumerOpOperand.get().getType();		Type consumerType = consumerOpOperand.get().getType();
if (consumerType != def.getType())		if (consumerType != def.getType())
def = b.create<tensor::CastOp>(fusedProducer.getLoc(), consumerType, def);		def = b.create<tensor::CastOp>(fusedProducer->getLoc(), consumerType, def);
consumerOpOperand.set(def);		consumerOpOperand.set(def);
return FusionInfo{cast<LinalgOp>(producerOpResult.getOwner()), fusedProducer};		return FusionInfo{producerOpResult.getOwner(), fusedProducer};
}		}

/// Prune all dimensions that are of reduction iterator type from `map`.		/// Prune all dimensions that are of reduction iterator type from `map`.
static AffineMap pruneReductionDimsFromMap(ArrayRef<Attribute> iteratorTypes,		static AffineMap pruneReductionDimsFromMap(ArrayRef<Attribute> iteratorTypes,
AffineMap map) {		AffineMap map) {
llvm::SmallDenseSet<unsigned> projectedDims;		llvm::SmallDenseSet<unsigned> projectedDims;
for (auto attr : llvm::enumerate(iteratorTypes)) {		for (auto attr : llvm::enumerate(iteratorTypes)) {
if (!isParallelIterator(attr.value()))		if (!isParallelIterator(attr.value()))
▲ Show 20 Lines • Show All 423 Lines • Show Last 20 Lines

mlir/test/Dialect/Linalg/tile-and-fuse-tensors.mlir

	Show First 20 Lines • Show All 271 Lines • ▼ Show 20 Lines
	// CHECK-NEXT: %[[ST_CONV:.+]] = linalg.conv_2d_input_nhwc_filter_hwcf			// CHECK-NEXT: %[[ST_CONV:.+]] = linalg.conv_2d_input_nhwc_filter_hwcf
	// CHECK-SAME: ins(%[[ST_INPUT]], %[[ST_FILTER]] : tensor<?x?x?x?xf32>, tensor<?x?x?x?xf32>)			// CHECK-SAME: ins(%[[ST_INPUT]], %[[ST_FILTER]] : tensor<?x?x?x?xf32>, tensor<?x?x?x?xf32>)
	// CHECK-SAME: outs(%[[ST_FILL]] : tensor<?x?x?x?xf32>) -> tensor<?x?x?x?xf32>			// CHECK-SAME: outs(%[[ST_FILL]] : tensor<?x?x?x?xf32>) -> tensor<?x?x?x?xf32>
	// CHECK-NEXT: %[[ST_ADD:.+]] = linalg.generic			// CHECK-NEXT: %[[ST_ADD:.+]] = linalg.generic
	// CHECK-SAME: ins(%[[ST_CONV]], %[[ST_ELEM]] : tensor<?x?x?x?xf32>, tensor<?x?x?x?xf32>)			// CHECK-SAME: ins(%[[ST_CONV]], %[[ST_ELEM]] : tensor<?x?x?x?xf32>, tensor<?x?x?x?xf32>)
	// CHECK-SAME: outs(%[[ST_ARG]] : tensor<?x?x?x?xf32>)			// CHECK-SAME: outs(%[[ST_ARG]] : tensor<?x?x?x?xf32>)
	// CHECK: subtensor_insert %[[ST_ADD]] into %[[ARG]][%[[IV0]], %[[IV1]], %[[IV2]], %[[IV3]]]			// CHECK: subtensor_insert %[[ST_ADD]] into %[[ARG]][%[[IV0]], %[[IV1]], %[[IV2]], %[[IV3]]]
	// CHECK-SAME: [%[[SIZE_ELEM_N]], %[[SIZE_ELEM_OH]], %[[SIZE_ELEM_OW]], %[[SIZE_ELEM_OC]]]			// CHECK-SAME: [%[[SIZE_ELEM_N]], %[[SIZE_ELEM_OH]], %[[SIZE_ELEM_OW]], %[[SIZE_ELEM_OC]]]

				// -----

				#map = affine_map<(d0, d1) -> (d0, d1)>

				func @pad_generic_static(%small_input: tensor<58x1xf32>, %large_input: tensor<64x128xf32>) -> tensor<64x128xf32> {
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%c16 = constant 16 : index
				%c32 = constant 32 : index
				%zero = constant 0.0 : f32

				%d0 = memref.dim %large_input, %c0 : tensor<64x128xf32>
				%d1 = memref.dim %large_input, %c1 : tensor<64x128xf32>

				%pad = linalg.pad_tensor %small_input low[4, 60] high[2, 67] {
				^bb0(%arg0: index, %arg1: index):
				linalg.yield %zero : f32
				} : tensor<58x1xf32> to tensor<64x128xf32>

				%fill = linalg.fill(%large_input, %zero) : tensor<64x128xf32>, f32 -> tensor<64x128xf32>

				%for0 = scf.for %iv0 = %c0 to %d0 step %c16 iter_args(%arg0 = %fill) -> tensor<64x128xf32> {
				%for1 = scf.for %iv1 = %c0 to %d1 step %c32 iter_args(%arg1 = %arg0) -> tensor<64x128xf32> {
				%0 = subtensor %pad[%iv0, %iv1][16, 32][1, 1] : tensor<64x128xf32> to tensor<16x32xf32>
				%1 = subtensor %large_input[%iv0, %iv1][16, 32][1, 1] : tensor<64x128xf32> to tensor<16x32xf32>
				%2 = subtensor %arg1[%iv0, %iv1][16, 32][1, 1] : tensor<64x128xf32> to tensor<16x32xf32>

				%add = linalg.generic
				{indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel"]}
				ins(%0, %1 : tensor<16x32xf32>, tensor<16x32xf32>) outs(%2 : tensor<16x32xf32>) {
				^bb0(%arg4: f32, %arg5: f32, %arg6: f32):
				%result = addf %arg4, %arg5 : f32
				linalg.yield %result : f32
				} -> tensor<16x32xf32>

				%insert = subtensor_insert %add into %arg1[%iv0, %iv1] [16, 32] [1, 1] : tensor<16x32xf32> into tensor<64x128xf32>
				scf.yield %insert : tensor<64x128xf32>
				}
				scf.yield %for1 : tensor<64x128xf32>
				}
				return %for0 : tensor<64x128xf32>
				}

				// CHECK-DAG: #[[DIM0_LOW_UNUSED_MAP:.+]] = affine_map<(d0) -> (4, d0)>
				// CHECK-DAG: #[[DIM0_LOW_USED_MAP:.+]] = affine_map<(d0) -> (16, -d0 + 4)>
				// CHECK-DAG: #[[DIM0_HIGH_UNUSED_MAP:.+]] = affine_map<(d0) -> (2, -d0 + 48)>
				// CHECK-DAG: #[[DIM0_HIGH_USED_MAP:.+]] = affine_map<(d0) -> (16, -d0 + 2)>
				// CHECK-DAG: #[[SUB_MAP:.+]] = affine_map<(d0, d1) -> (d0 - d1)>
				// CHECK-DAG: #[[DIM0_SIZE_MAP:.+]] = affine_map<(d0, d1) -> (-d0 - d1 + 16)>
				// CHECK-DAG: #[[DIM1_LOW_UNUSED_MAP:.+]] = affine_map<(d0) -> (60, d0)>
				// CHECK-DAG: #[[DIM1_LOW_USED_MAP:.+]] = affine_map<(d0) -> (32, -d0 + 60)>
				// CHECK-DAG: #[[DIM1_HIGH_UNUSED_MAP:.+]] = affine_map<(d0) -> (67, -d0 + 96)>
				// CHECK-DAG: #[[DIM1_HIGH_USED_MAP:.+]] = affine_map<(d0) -> (32, -d0 + 67)>
				// CHECK-DAG: #[[DIM1_SIZE_MAP:.+]] = affine_map<(d0, d1) -> (-d0 - d1 + 32)>

				// CHECK: func @pad_generic_static
				// CHECK-SAME: %[[SMALL_INPUT:.+]]: tensor<58x1xf32>

				// CHECK: %[[ZERO:.+]] = constant 0.000000e+00 : f32
				// CHECK: scf.for %[[IV0:[a-z0-9]+]]
				// CHECK: %[[DIM0_UNUSED_LOW_PAD:.+]] = affine.min #[[DIM0_LOW_UNUSED_MAP]](%[[IV0]])
				// CHECK: %[[DIM0_USED_LOW_PAD:.+]] = affine.min #[[DIM0_LOW_USED_MAP]](%[[DIM0_UNUSED_LOW_PAD]])
				// CHECK: %[[DIM0_UNUSED_HIGH_PAD:.+]] = affine.min #[[DIM0_HIGH_UNUSED_MAP]](%[[IV0]])
				// CHECK: %[[DIM0_USED_HIGH_PAD:.+]] = affine.min #[[DIM0_HIGH_USED_MAP]](%[[DIM0_UNUSED_HIGH_PAD]])
				// CHECK: %[[DIM0_SRC_OFFSET:.+]] = affine.apply #[[SUB_MAP]](%[[IV0]], %[[DIM0_UNUSED_LOW_PAD]])
				// CHECK: %[[DIM0_SRC_SIZE:.+]] = affine.apply #[[DIM0_SIZE_MAP]](%[[DIM0_USED_LOW_PAD]], %[[DIM0_USED_HIGH_PAD]])
				// CHECK: scf.for %[[IV1:[a-z0-9]+]]
				// CHECK: %[[DIM1_UNUSED_LOW_PAD:.+]] = affine.min #[[DIM1_LOW_UNUSED_MAP]](%[[IV1]])
				// CHECK: %[[DIM1_USED_LOW_PAD:.+]] = affine.min #[[DIM1_LOW_USED_MAP]](%[[DIM1_UNUSED_LOW_PAD]])
				// CHECK: %[[DIM1_UNUSED_HIGH_PAD:.+]] = affine.min #[[DIM1_HIGH_UNUSED_MAP]](%[[IV1]])
				// CHECK: %[[DIM1_USED_HIGH_PAD:.+]] = affine.min #[[DIM1_HIGH_USED_MAP]](%[[DIM1_UNUSED_HIGH_PAD]])
				// CHECK: %[[DIM1_SRC_OFFSET:.+]] = affine.apply #[[SUB_MAP]](%[[IV1]], %[[DIM1_UNUSED_LOW_PAD]])
				// CHECK: %[[DIM1_SRC_SIZE:.+]] = affine.apply #[[DIM1_SIZE_MAP]](%[[DIM1_USED_LOW_PAD]], %[[DIM1_USED_HIGH_PAD]])
				// CHECK: %[[SRC_SUBTENSOR:.+]] = subtensor %[[SMALL_INPUT]]
				// CHECK-SAME: [%[[DIM0_SRC_OFFSET]], %[[DIM1_SRC_OFFSET]]]
				// CHECK-SAME: [%[[DIM0_SRC_SIZE]], %[[DIM1_SRC_SIZE]]]
				// CHECK: %[[PAD:.+]] = linalg.pad_tensor %[[SRC_SUBTENSOR]]
				// CHECK-SAME: low[%[[DIM0_USED_LOW_PAD]], %[[DIM1_USED_LOW_PAD]]]
				// CHECK-SAME: high[%[[DIM0_USED_HIGH_PAD]], %[[DIM1_USED_HIGH_PAD]]]
				// CHECK: linalg.yield %[[ZERO]]
				// CHECK: %[[CAST:.+]] = tensor.cast %[[PAD]] : tensor<?x?xf32> to tensor<16x32xf32>
				// CHECK: linalg.generic
				// CHECK-SAME: ins(%[[CAST]]

				// -----

				#map = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>

				func @pad_generic_dynamic(%small_input: tensor<?x?x?x?xf32>, %large_input: tensor<?x?x?x?xf32>, %input_low_pad: index, %input_high_pad: index) -> tensor<?x?x?x?xf32> {
				%c0 = constant 0 : index
				%c1 = constant 1 : index
				%c2 = constant 2 : index
				%c3 = constant 3 : index
				%c4 = constant 4 : index
				%c8 = constant 8 : index
				%zero = constant 0.0 : f32

				%d0 = memref.dim %large_input, %c0 : tensor<?x?x?x?xf32> // not to pad, to tile
				%d1 = memref.dim %large_input, %c1 : tensor<?x?x?x?xf32> // to pad, to tile
				%d2 = memref.dim %large_input, %c2 : tensor<?x?x?x?xf32> // to pad, not to tile
				%d3 = memref.dim %large_input, %c3 : tensor<?x?x?x?xf32> // not to pad, not to tile

				%pad = linalg.pad_tensor %small_input low[%c0, %input_low_pad, %input_low_pad, %c0] high[%c0, %input_high_pad, %input_high_pad, %c0] {
				^bb0(%arg0: index, %arg1: index, %arg2: index, %arg3: index):
				linalg.yield %zero : f32
				} : tensor<?x?x?x?xf32> to tensor<?x?x?x?xf32>

				%fill = linalg.fill(%large_input, %zero) : tensor<?x?x?x?xf32>, f32 -> tensor<?x?x?x?xf32>

				%for0 = scf.for %iv0 = %c0 to %d0 step %c4 iter_args(%arg0 = %fill) -> tensor<?x?x?x?xf32> {
				%for1 = scf.for %iv1 = %c0 to %d1 step %c8 iter_args(%arg1 = %arg0) -> tensor<?x?x?x?xf32> {
				%d0_size = affine.min affine_map<(d0)[s0] -> (4, -d0 + s0)>(%iv0)[%d0]
				%d1_size = affine.min affine_map<(d0)[s0] -> (8, -d0 + s0)>(%iv1)[%d1]
				%0 = subtensor %pad[%iv0, %iv1, 0, 0][%d0_size, %d1_size, %d2, %d3][1, 1, 1, 1] : tensor<?x?x?x?xf32> to tensor<?x?x?x?xf32>
				%1 = subtensor %large_input[%iv0, %iv1, 0, 0][%d0_size, %d1_size, %d2, %d3][1, 1, 1, 1] : tensor<?x?x?x?xf32> to tensor<?x?x?x?xf32>
				%2 = subtensor %arg1[%iv0, %iv1, 0, 0][%d0_size, %d1_size, %d2, %d3][1, 1, 1, 1] : tensor<?x?x?x?xf32> to tensor<?x?x?x?xf32>

				%add = linalg.generic
				{indexing_maps = [#map, #map, #map], iterator_types = ["parallel", "parallel", "parallel", "parallel"]}
				ins(%0, %1 : tensor<?x?x?x?xf32>, tensor<?x?x?x?xf32>) outs(%2 : tensor<?x?x?x?xf32>) {
				^bb0(%arg4: f32, %arg5: f32, %arg6: f32):
				%result = addf %arg4, %arg5 : f32
				linalg.yield %result : f32
				} -> tensor<?x?x?x?xf32>

				%insert = subtensor_insert %add into %arg1[%iv0, %iv1, 0, 0] [%d0_size, %d1_size, %d2, %d3] [1, 1, 1, 1] : tensor<?x?x?x?xf32> into tensor<?x?x?x?xf32>
				scf.yield %insert : tensor<?x?x?x?xf32>
				}
				scf.yield %for1 : tensor<?x?x?x?xf32>
				}
				return %for0 : tensor<?x?x?x?xf32>
				}

				// CHECK-DAG: #[[DIM2_UNUSED_LOW_MAP:.+]] = affine_map<()[s0] -> (s0, 0)>
				// CHECK-DAG: #[[DIM2_USED_LOW_MAP:.+]] = affine_map<()[s0, s1, s2] -> (s0, s1 - s2)>
				// CHECK-DAG: #[[DIM2_UNUSED_HIGH_MAP:.+]] = affine_map<()[s0, s1, s2, s3] -> (s0, s0 - s1 + s2 + s3)>
				// CHECK-DAG: #[[DIM2_OFFSET_MAP:.+]] = affine_map<()[s0] -> (-s0)>
				// CHECK-DAG: #[[DIM3_SIZE_MAP:.+]] = affine_map<()[s0, s1, s2] -> (s0 - s1 - s2)>
				// CHECK-DAG: #[[DIM3_UNUSED_HIGH_MAP:.+]] = affine_map<()[s0, s1] -> (0, -s0 + s1)>
				// CHECK-DAG: #[[DIM3_USED_HIGH_MAP:.+]] = affine_map<()[s0, s1] -> (s0, -s1)>
				// CHECK-DAG: #[[DIM0_UNUSED_LOW_MAP:.+]] = affine_map<(d0)[s0] -> (4, -d0 + s0)>
				// CHECK-DAG: #[[DIM0_USED_LOW_MAP:.+]] = affine_map<(d0)[s0] -> (0, 4, -d0 + s0)>
				// CHECK-DAG: #[[DIM0_UNUSED_HIGH_MAP:.+]] = affine_map<(d0, d1)[s0] -> (0, -d0 - d1 + s0)>
				// CHECK-DAG: #[[DIM0_USED_HIGH_MAP:.+]] = affine_map<(d0, d1)[s0] -> (-d1, 4, -d0 + s0)>
				// CHECK-DAG: #[[DIM0_SIZE_MAP:.+]] = affine_map<(d0, d1, d2) -> (d0 - d1 - d2)>
				// CHECK-DAG: #[[DIM1_TILE_SIZE_MAP:.+]] = affine_map<(d0)[s0] -> (8, -d0 + s0)>
				// CHECK-DAG: #[[DIM1_UNUSED_LOW_MAP:.+]] = affine_map<(d0)[s0] -> (s0, d0)>
				// CHECK-DAG: #[[DIM1_USED_MAP:.+]] = affine_map<(d0, d1)[s0, s1] -> (-d1 + s1, 8, -d0 + s0)>
				// CHECK-DAG: #[[DIM1_UNUSED_HIGH_MAP:.+]] = affine_map<(d0, d1)[s0, s1, s2] -> (s0, -d0 - d1 + s0 + s1 + s2)>
				// CHECK-DAG: #[[DIM1_OFFSET_MAP:.+]] = affine_map<(d0, d1) -> (d0 - d1)>

				// CHECK: func @pad_generic_dynamic
				// CHECK-SAME: %[[SMALL_INPUT:.+]]: tensor<?x?x?x?xf32>, %[[LARGE_INPUT:.+]]: tensor<?x?x?x?xf32>
				// CHECK-SAME: %[[INPUT_LOW_PAD:.+]]: index, %[[INPUT_HIGH_PAD:.+]]: index

				// CHECK-DAG: %[[ZERO:.+]] = constant 0.000000e+00 : f32
				// CHECK-DAG: %[[C0:.+]] = constant 0 : index
				// CHECK-DAG: %[[C1:.+]] = constant 1 : index
				// CHECK-DAG: %[[C2:.+]] = constant 2 : index
				// CHECK-DAG: %[[C3:.+]] = constant 3 : index
				// CHECK: %[[LARGE_DIM0:.+]] = memref.dim %[[LARGE_INPUT]], %[[C0]]
				// CHECK: %[[LARGE_DIM1:.+]] = memref.dim %[[LARGE_INPUT]], %[[C1]]
				// CHECK: %[[LARGE_DIM2:.+]] = memref.dim %[[LARGE_INPUT]], %[[C2]]
				// CHECK: %[[LARGE_DIM3:.+]] = memref.dim %[[LARGE_INPUT]], %[[C3]]
				// CHECK: %[[SMALL_DIM0:.+]] = memref.dim %[[SMALL_INPUT]], %[[C0]]
				// CHECK: %[[SMALL_DIM1:.+]] = memref.dim %[[SMALL_INPUT]], %[[C1]]
				// CHECK: %[[SMALL_DIM2:.+]] = memref.dim %[[SMALL_INPUT]], %[[C2]]
				// CHECK: %[[DIM2_UNUSED_LOW_PAD:.+]] = affine.min #[[DIM2_UNUSED_LOW_MAP]]()[%[[INPUT_LOW_PAD]]]
				// CHECK: %[[DIM2_USED_LOW_PAD:.+]] = affine.min #[[DIM2_USED_LOW_MAP]]()[%[[LARGE_DIM2]], %[[INPUT_LOW_PAD]], %[[DIM2_UNUSED_LOW_PAD]]]
				// CHECK: %[[DIM2_UNUSED_HIGH_PAD:.+]] = affine.min #[[DIM2_UNUSED_HIGH_MAP]]()[%[[INPUT_HIGH_PAD]], %[[LARGE_DIM2]], %[[SMALL_DIM2]], %[[INPUT_LOW_PAD]]]
				// CHECK: %[[DIM2_USED_HIGH_PAD:.+]] = affine.min #[[DIM2_USED_LOW_MAP]]()[%[[LARGE_DIM2]], %[[INPUT_HIGH_PAD]], %[[DIM2_UNUSED_HIGH_PAD]]]
				// CHECK: %[[DIM2_SRC_OFFSET:.+]] = affine.apply #[[DIM2_OFFSET_MAP]]()[%[[DIM2_UNUSED_LOW_PAD]]]
				// CHECK: %[[DIM2_SRC_SIZE:.+]] = affine.apply #[[DIM3_SIZE_MAP]]()[%[[LARGE_DIM2]], %[[DIM2_USED_LOW_PAD]], %[[DIM2_USED_HIGH_PAD]]]
				// CHECK: %[[SMALL_DIM3:.+]] = memref.dim %[[SMALL_INPUT]], %[[C3]]
				// CHECK: %[[DIM3_USED_LOW_PAD:.+]] = affine.min #[[DIM2_UNUSED_LOW_MAP]]()[%[[LARGE_DIM3]]]
				// CHECK: %[[DIM3_UNUSED_HIGH_PAD:.+]] = affine.min #[[DIM3_UNUSED_HIGH_MAP]]()[%[[LARGE_DIM3]], %[[SMALL_DIM3]]]
				// CHECK: %[[DIM3_USED_HIGH_PAD:.+]] = affine.min #[[DIM3_USED_HIGH_MAP]]()[%[[LARGE_DIM3]], %[[DIM3_UNUSED_HIGH_PAD]]]
				// CHECK: %[[DIM3_SRC_SIZE:.+]] = affine.apply #[[DIM3_SIZE_MAP]]()[%[[LARGE_DIM3]], %[[DIM3_USED_LOW_PAD]], %[[DIM3_USED_HIGH_PAD]]]
				// CHECK: scf.for %[[IV0:[a-z0-9]+]]
				// CHECK: %[[DIM0_UNUSED_LOW_PAD:.+]] = affine.min #[[DIM0_UNUSED_LOW_MAP]](%[[IV0]])[%[[LARGE_DIM0]]]
				// CHECK: %[[DIM0_USED_LOW_PAD:.+]] = affine.min #[[DIM0_USED_LOW_MAP]](%[[IV0]])[%[[LARGE_DIM0]]]
				// CHECK: %[[DIM0_UNUSED_HIGH_PAD:.+]] = affine.min #[[DIM0_UNUSED_HIGH_MAP]](%[[IV0]], %[[DIM0_UNUSED_LOW_PAD]])[%[[SMALL_DIM0]]]
				// CHECK: %[[DIM0_USED_HIGH_PAD:.+]] = affine.min #[[DIM0_USED_HIGH_MAP]](%[[IV0]], %[[DIM0_UNUSED_HIGH_PAD]])[%[[LARGE_DIM0]]]
				// CHECK: %[[DIM0_SRC_SIZE:.+]] = affine.apply #[[DIM0_SIZE_MAP]](%[[DIM0_UNUSED_LOW_PAD]], %[[DIM0_USED_LOW_PAD]], %[[DIM0_USED_HIGH_PAD]])
				// CHECK: scf.for %[[IV1:[a-z0-9]+]]
				// CHECK: %[[DIM1_TILE_SIZE:.+]] = affine.min #[[DIM1_TILE_SIZE_MAP]](%[[IV1]])[%[[LARGE_DIM1]]]
				// CHECK: %[[DIM1_UNUSED_LOW_PAD:.+]] = affine.min #[[DIM1_UNUSED_LOW_MAP]](%[[IV1]])[%[[INPUT_LOW_PAD]]]
				// CHECK: %[[DIM1_USED_LOW_PAD:.+]] = affine.min #[[DIM1_USED_MAP]](%[[IV1]], %[[DIM1_UNUSED_LOW_PAD]])[%[[LARGE_DIM1]], %[[INPUT_LOW_PAD]]]
				// CHECK: %[[DIM1_UNUSED_HIGH_PAD:.+]] = affine.min #[[DIM1_UNUSED_HIGH_MAP]](%[[IV1]], %[[DIM1_TILE_SIZE]])[%[[INPUT_HIGH_PAD]], %[[SMALL_DIM1]], %[[INPUT_LOW_PAD]]]
				// CHECK: %[[DIM1_USED_HIGH_PAD:.+]] = affine.min #[[DIM1_USED_MAP]](%[[IV1]], %[[DIM1_UNUSED_HIGH_PAD]])[%[[LARGE_DIM1]], %[[INPUT_HIGH_PAD]]]
				// CHECK: %[[DIM1_SRC_OFFSET:.+]] = affine.apply #[[DIM1_OFFSET_MAP]](%[[IV1]], %[[DIM1_UNUSED_LOW_PAD]])
				// CHECK: %[[DIM1_SRC_SIZE:.+]] = affine.apply #[[DIM0_SIZE_MAP]](%[[DIM1_TILE_SIZE]], %[[DIM1_USED_LOW_PAD]], %[[DIM1_USED_HIGH_PAD]])
				// CHECK: %[[SRC_SUBTENSOR:.+]] = subtensor %[[SMALL_INPUT]]
				// CHECK-SAME: [%[[IV0]], %[[DIM1_SRC_OFFSET]], %[[DIM2_SRC_OFFSET]], 0]
				// CHECK-SAME: [%[[DIM0_SRC_SIZE]], %[[DIM1_SRC_SIZE]], %[[DIM2_SRC_SIZE]], %[[DIM3_SRC_SIZE]]]
				// CHECK: %[[PAD:.+]] = linalg.pad_tensor %[[SRC_SUBTENSOR]]
				// CHECK-SAME: low[%[[DIM0_USED_LOW_PAD]], %[[DIM1_USED_LOW_PAD]], %[[DIM2_USED_LOW_PAD]], %[[DIM3_USED_LOW_PAD]]]
				// CHECK-SAME: high[%[[DIM0_USED_HIGH_PAD]], %[[DIM1_USED_HIGH_PAD]], %[[DIM2_USED_HIGH_PAD]], %[[DIM3_USED_HIGH_PAD]]]
				// CHECK: linalg.yield %[[ZERO]]
				// CHECK: linalg.generic
				// CHECK-SAME: ins(%[[PAD]]

mlir/test/lib/Dialect/Linalg/TestLinalgFusionTransforms.cpp

Show First 20 Lines • Show All 149 Lines • ▼ Show 20 Lines	for (LinalgOp linalgOp : llvm::reverse(linalgOps)) {
for (OpOperand &opOperand : linalgOp.getShapedOpOperands()) {		for (OpOperand &opOperand : linalgOp.getShapedOpOperands()) {
if (opOperand.get().getType().isa<MemRefType>()) {		if (opOperand.get().getType().isa<MemRefType>()) {
// TODO: LinalgDependenceGraph should be able to update itself.		// TODO: LinalgDependenceGraph should be able to update itself.
// The current naive and expensive reconstruction of the graph should be		// The current naive and expensive reconstruction of the graph should be
// removed.		// removed.
linalg::Aliases aliases;		linalg::Aliases aliases;
linalg::LinalgDependenceGraph graph(aliases, linalgOps);		linalg::LinalgDependenceGraph graph(aliases, linalgOps);
if (auto info = fuseProducerOfBuffer(b, opOperand, graph)) {		if (auto info = fuseProducerOfBuffer(b, opOperand, graph)) {
auto *originalOp = info->originalProducer.getOperation();		auto *originalOp = info->originalProducer;
eraseSet.insert(originalOp);		eraseSet.insert(originalOp);
		if (auto linalgProducer = dyn_cast<linalg::LinalgOp>(originalOp)) {
auto *originalOpInLinalgOpsVector =		auto *originalOpInLinalgOpsVector =
std::find(linalgOps.begin(), linalgOps.end(), originalOp);		std::find(linalgOps.begin(), linalgOps.end(), originalOp);
*originalOpInLinalgOpsVector = info->fusedProducer.getOperation();		*originalOpInLinalgOpsVector = info->fusedProducer;
		}
changed = true;		changed = true;
}		}
} else {		} else {
assert(opOperand.get().getType().isa<RankedTensorType>());		assert(opOperand.get().getType().isa<RankedTensorType>());
// Tile and Fuse tensor input.		// Tile and Fuse tensor input.
if (opOperand.getOperandNumber() >= linalgOp.getNumInputs())		if (opOperand.getOperandNumber() >= linalgOp.getNumInputs())
continue;		continue;
if (auto info = fuseProducerOfTensor(b, opOperand)) {		if (auto info = fuseProducerOfTensor(b, opOperand)) {
auto *originalOp = info->originalProducer.getOperation();		auto *originalOp = info->originalProducer;
		if (auto linalgProducer = dyn_cast<linalg::LinalgOp>(originalOp)) {
auto *originalOpInLinalgOpsVector =		auto *originalOpInLinalgOpsVector =
std::find(linalgOps.begin(), linalgOps.end(), originalOp);		std::find(linalgOps.begin(), linalgOps.end(), originalOp);
*originalOpInLinalgOpsVector = info->fusedProducer.getOperation();		*originalOpInLinalgOpsVector = info->fusedProducer;
		}
// Don't mark for erasure in the tensor case, let DCE handle this.		// Don't mark for erasure in the tensor case, let DCE handle this.
changed = true;		changed = true;
}		}
}		}
}		}
}		}
// The `fuseProducerOfBuffer` function performs structural checks and in		// The `fuseProducerOfBuffer` function performs structural checks and in
// particular that no covering read or write exist between the consumer and		// particular that no covering read or write exist between the consumer and
▲ Show 20 Lines • Show All 119 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][linalg] Support tiling and fusing linalg.pad_tensorAbandonedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 348227

mlir/include/mlir/Dialect/Linalg/IR/LinalgOps.td

mlir/include/mlir/Dialect/Linalg/Utils/Utils.h

mlir/lib/Dialect/Linalg/IR/LinalgOps.cpp

mlir/lib/Dialect/Linalg/Transforms/Fusion.cpp

mlir/test/Dialect/Linalg/tile-and-fuse-tensors.mlir

mlir/test/lib/Dialect/Linalg/TestLinalgFusionTransforms.cpp

[mlir][linalg] Support tiling and fusing linalg.pad_tensor
AbandonedPublic