This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/Linalg/
-
mlir/
-
Dialect/
-
Linalg/
-
Passes.h
2/3
Passes.td
-
Utils/
1/4
Utils.h
-
lib/Dialect/Linalg/
-
Dialect/
-
Linalg/
-
Transforms/
-
CMakeLists.txt
9/19
FusionOnTensors.cpp
-
Utils/
-
Utils.cpp
-
test/Dialect/Linalg/
-
Dialect/
-
Linalg/
1/2
fusion-on-tensors.mlir

Differential D108995

[mlir][linalg] Fusion on tensors.
AbandonedPublic

Authored by gysit on Aug 31 2021, 7:10 AM.

Download Raw Diff

Details

Reviewers

nicolasvasilache

Summary

Add a new version of fusion on tensors that supports the following scenarios:

support fusion on all levels of the tile loop nest
fuse a producer result passed in via tile loop iteration arguments (update the tile loop iteration arguments)
supports only linalg operations on tensors
supports only scf::for
cannot add an output to the tile loop nest

The LinalgFusionOnTensors pass tries to fuse all producers of tiled ops.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

gysit created this revision.Aug 31 2021, 7:10 AM

Herald added subscribers: wrengr, Chia-hungDuan, dcaballe and 20 others. · View Herald TranscriptAug 31 2021, 7:10 AM

gysit requested review of this revision.Aug 31 2021, 7:10 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptAug 31 2021, 7:10 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: limo1996, stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Harbormaster completed remote builds in B121927: Diff 369693.Aug 31 2021, 7:46 AM

some early comments, will pick up again a bit later.

mlir/include/mlir/Dialect/Linalg/Passes.td
59	Can we keep the same name between the pass and the string? i.e. `def LinalgFuseTensorOps`.
60	Typo `tesnors`.
mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp
30	`consisting of exactly intermediate extract_slice and scf::for ops` ? Also, please mention that the use-def search stops as soon as a different op is found.
53	Add a `TODO` to support other types of enclosing ops (e.g. `scf.if`) and other types of loops ? Since we are in value land, we should be able to handle more interesting cases?
161	We usually just go for `T mlir::linalg::isProducerFusable(...) {...}`

Address comments.

gysit marked 5 inline comments as done.Aug 31 2021, 11:45 AM

Harbormaster completed remote builds in B121968: Diff 369757.Aug 31 2021, 12:12 PM

mravishankar added inline comments.Aug 31 2021, 10:06 PM

mlir/include/mlir/Dialect/Linalg/Utils/Utils.h
164	Curious why we need this and the producer itself. I dont think we can split the producer and "fuse" individual `OpResult` if we have ops with multiple results
mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp
55	would this not be easier to get with the `getUseDefChain` above?
174	Nit: fold this with the check below. AN error saying "expected linalg op" implies blockargument is not expected.
189	Third time traversing this. Its a style comment, so sort of a nit. You could compute the `sliceOps` and `loops` and `bbArgs` at the same time while traversing the use-def chain. The chain is small, so its not an efficiency thing, but more cognitive overhead.
mlir/test/Dialect/Linalg/fusion-on-tensors.mlir
37	Am I reading this right? The fill is coming into the `k` loop?

@nicolasvasilache @mravishankar thanks for the feedback so far. I already addressed some comments and will continue to do so.

mlir/include/mlir/Dialect/Linalg/Utils/Utils.h
164	The current version of fusion supports fusing producers with multiple outputs but introduces redundant computation. Two follow up changes could make sense: a canonicalization pattern that removes unused outputs from a linalg op (probably limited to generic ops) adding the additional outputs of the cloned producer to the tile loop iteration arguments and replace all uses after the tile loop nest.
mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp
55	I think if we fuse a consumer input that is not passed via iteration argument we still want to find the tile loops. They are needed to make the match between the tile loop dimensions and the extract slice ops.
189	I think one thing that introduces maybe a bit of complexity here is that code does not enforce an extract slice op for every tile loop (which helps making the unit tests more readable!). Also not that once we start fusing dependencies transitively there start to be tile loops that are actually not tiling the fused operation... I thus currently believe the choice of getting the loops first and then try to map the extract slice ops is better. But I will think about simplifying this code!
mlir/test/Dialect/Linalg/fusion-on-tensors.mlir
37	Yes since the fill is consumed by an input operand it is fine to fuse it into the k loop in this case. It is a trade-off though and at some point we may want to control if fusion shall introduce redundant computation and if yes how much.

Address one more comment.

Harbormaster completed remote builds in B122044: Diff 369871.Sep 1 2021, 12:10 AM

gysit marked an inline comment as done.Sep 1 2021, 12:10 AM

Address comment and merge getTileLoops and getUseDefChain into one method.

Harbormaster completed remote builds in B122066: Diff 369903.Sep 1 2021, 5:32 AM

gysit marked 2 inline comments as done.Sep 1 2021, 5:34 AM

gysit added inline comments.

mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp
189	I collect the tile loops and the use def chain now in one method getTileLoopNest and I think it is an improvement!

Added a few more comments.

mlir/include/mlir/Dialect/Linalg/Passes.td
64	Is this the consumer op tile sizes ? I think it is confusing to take about "loops dimensions" here as this is also dependent on the op semantics.
mlir/include/mlir/Dialect/Linalg/Utils/Utils.h
169	Add a note that is this currently limited to 1-1 output permutations to shape dimension but that this will be relaxed in the future?
182	I do not understand what tileLoopDims is and why it is needed. From your usage I only see that it is initialized to null everywhere?
mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp
38	I find the new API quite difficult to follow: there are hidden assumptions about when some entries are null but not others and how things are supposed to align that I have a hard time following. Can we have a single vector with struct with a proper name that contains `{scf::ForOp, tensor::ExtractSliceOp, BlockArgument}` with all the properties spelled out explicitly with helper functions? e.g. a strawman struct AGoodName { void isPropertyX(int index) { return tileLoops[index] && sliceOps[index] && !bbArgs[index]; } ... SmallVector<scf::ForOp> tileLoops; SmallVector<tensor::ExtractSliceOp> sliceOps; SmallVector<BlockArgument> bbArgs; }
50	LoopLikeInterface has a "isDefinedOutsideOfLoop" method for this.
214	I'd lift that as a helper in the LinalgInterfaces, seems generally useful and a good place to centralize as we generalize.
225	this type of bookkeeping should be internal to the struct I mention above.
228	This should be a standalone documented helper function.
233	This feels like an assumption that comes from tiling invariants and is used at a distance without guarantees? This isn't something we can generally rely on. This seems to require a "tile and fuse" approach to the problem?
247	This is also internal bookkeeping of the struct that should have an intuitive name.
254	I'd rephrase this with an example and isolate in a helper function; e.g. Consider the case of fusion of an output tensor: %0 = producer(...) %2 = consumer ins(...) outs(%0) When consumer is tiled, %0 appears as a loop iter_arg, i.e.: %0 = producer(...) %2 = scf.for ... iter_args(%0) .. (%bbarg) { %t = tensor.extract_slice %bbarg[..] %c = consumer ins(...) (%bbarg) %r = tensor.insert_slice %t, %bbarg[...] } Fusing %0 into the loop requires updating iter_args(%0). This transformation is only valid when xxx (the rest of your text + code). ... TODO: instead of a check + failure, insert new iter_args each time a producer is fused into a consumer and fold away unused iter_args.
288	so, just `take_front(sliceOpDepth)` ?
291	This also seems like something that is only valid when you know the enclosing loop structure has been produced exactly by the tiling you expect and nothing has changed in the meantime?

gysit abandoned this revision.Sep 14 2021, 8:36 AM

Herald added a subscriber: wenzhicui. · View Herald TranscriptSep 14 2021, 8:36 AM

gysit mentioned this in D109766: [mlir][linalg] Fusion on tensors..Sep 14 2021, 9:25 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

Linalg/

Passes.h

2 lines

Passes.td

11 lines

Utils/

Utils.h

39 lines

lib/

Dialect/

Linalg/

Transforms/

CMakeLists.txt

1 line

FusionOnTensors.cpp

452 lines

Utils/

Utils.cpp

23 lines

test/

Dialect/

Linalg/

fusion-on-tensors.mlir

471 lines

Diff 369903

mlir/include/mlir/Dialect/Linalg/Passes.h

	Show All 14 Lines

	#include "mlir/Pass/Pass.h"			#include "mlir/Pass/Pass.h"

	namespace mlir {			namespace mlir {
	std::unique_ptr<OperationPass<FuncOp>> createConvertElementwiseToLinalgPass();			std::unique_ptr<OperationPass<FuncOp>> createConvertElementwiseToLinalgPass();

	std::unique_ptr<OperationPass<FuncOp>> createLinalgFoldUnitExtentDimsPass();			std::unique_ptr<OperationPass<FuncOp>> createLinalgFoldUnitExtentDimsPass();

				std::unique_ptr<OperationPass<FuncOp>> createLinalgFuseTensorOpsPass();

	std::unique_ptr<Pass> createLinalgElementwiseOpFusionPass();			std::unique_ptr<Pass> createLinalgElementwiseOpFusionPass();
	std::unique_ptr<Pass> createFoldReshapeOpsByLinearizationPass();			std::unique_ptr<Pass> createFoldReshapeOpsByLinearizationPass();

	std::unique_ptr<OperationPass<FuncOp>>			std::unique_ptr<OperationPass<FuncOp>>
	createLinalgTilingPass(ArrayRef<int64_t> tileSizes = {});			createLinalgTilingPass(ArrayRef<int64_t> tileSizes = {});

	std::unique_ptr<OperationPass<FuncOp>>			std::unique_ptr<OperationPass<FuncOp>>
	createLinalgTilingToParallelLoopsPass(ArrayRef<int64_t> tileSizes = {});			createLinalgTilingToParallelLoopsPass(ArrayRef<int64_t> tileSizes = {});
	▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/Linalg/Passes.td

Show First 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	Option<"foldOneTripLoopsOnly", "fold-one-trip-loops-only", "bool",
"Only folds the one-trip loops from Linalg ops on tensors "		"Only folds the one-trip loops from Linalg ops on tensors "
"(for testing purposes only)">		"(for testing purposes only)">
];		];
let dependentDialects = [		let dependentDialects = [
"linalg::LinalgDialect", "AffineDialect", "memref::MemRefDialect"		"linalg::LinalgDialect", "AffineDialect", "memref::MemRefDialect"
];		];
}		}

		def LinalgFuseTensorOps : FunctionPass<"linalg-fuse-tensor-ops"> {
		nicolasvasilacheUnsubmitted Done Reply Inline Actions Can we keep the same name between the pass and the string? i.e. `def LinalgFuseTensorOps`. nicolasvasilache: Can we keep the same name between the pass and the string? i.e. `def LinalgFuseTensorOps`.
		let summary = "Fuse the producers of tiled operations on tensors";
		nicolasvasilacheUnsubmitted Done Reply Inline Actions Typo `tesnors`. nicolasvasilache: Typo `tesnors`.
		let constructor = "mlir::createLinalgFuseTensorOpsPass()";
		let options = [
		ListOption<"tileLoops", "tile-loops", "int64_t",
		"The tiled consumer loop dimensions from outer to inner",
		nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Is this the consumer op tile sizes ? I think it is confusing to take about "loops dimensions" here as this is also dependent on the op semantics. nicolasvasilache: Is this the consumer op tile sizes ? I think it is confusing to take about "loops dimensions"…
		"llvm::cl::ZeroOrMore, llvm::cl::MiscFlags::CommaSeparated">,
		];
		let dependentDialects = ["linalg::LinalgDialect", "scf::SCFDialect"];
		}

def LinalgElementwiseOpFusion : Pass<"linalg-fuse-elementwise-ops"> {		def LinalgElementwiseOpFusion : Pass<"linalg-fuse-elementwise-ops"> {
let summary = "Fuse elementwise operations on tensors";		let summary = "Fuse elementwise operations on tensors";
let constructor = "mlir::createLinalgElementwiseOpFusionPass()";		let constructor = "mlir::createLinalgElementwiseOpFusionPass()";
let options = [		let options = [
Option<"allowFoldingUnitDimReshapes", "allow-folding-unit-dim-reshapes",		Option<"allowFoldingUnitDimReshapes", "allow-folding-unit-dim-reshapes",
"bool", /default=/"false",		"bool", /default=/"false",
"Allow fusing linalg.tensor_reshape ops that performs unit "		"Allow fusing linalg.tensor_reshape ops that performs unit "
"dimension collapsing">		"dimension collapsing">
▲ Show 20 Lines • Show All 177 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/Linalg/Utils/Utils.h

	Show First 20 Lines • Show All 102 Lines • ▼ Show 20 Lines
	/// loop. The number of non-zero values in `tileSizes` should be equal to the			/// loop. The number of non-zero values in `tileSizes` should be equal to the
	/// number of values in `ivs`.			/// number of values in `ivs`.
	SmallVector<Value, 4> makeTiledShapes(OpBuilder &builder, Location loc,			SmallVector<Value, 4> makeTiledShapes(OpBuilder &builder, Location loc,
	LinalgOp linalgOp,			LinalgOp linalgOp,
	ArrayRef<Value> valuesToTile,			ArrayRef<Value> valuesToTile,
	ValueRange ivs, ValueRange tileSizes,			ValueRange ivs, ValueRange tileSizes,
	ArrayRef<Value> sizeBounds);			ArrayRef<Value> sizeBounds);

				/// Add the tile loop induction variables `ivs` to the index op results found in
				/// the body of the `tiledOp` to account for the tile offset.
				void addTileLoopIvsToIndexOpResults(OpBuilder &b, LinalgOp tiledOp,
				ArrayRef<Value> ivs);

	using FusableOpDependencesTy = llvm::MapVector<			using FusableOpDependencesTy = llvm::MapVector<
	Operation *,			Operation *,
	SmallVector<LinalgDependenceGraph::LinalgDependenceGraphElem, 1>>;			SmallVector<LinalgDependenceGraph::LinalgDependenceGraphElem, 1>>;
	FusableOpDependencesTy			FusableOpDependencesTy
	findAllFusableDependences(ArrayRef<LinalgOp> ops,			findAllFusableDependences(ArrayRef<LinalgOp> ops,
	const LinalgDependenceGraph &dependenceGraph);			const LinalgDependenceGraph &dependenceGraph);

	/// A struct containing the Linalg producer before and after fusion.			/// A struct containing the Linalg producer before and after fusion.
	Show All 25 Lines
	/// `extract_slice` op (generally obtained by applying the tiling			/// `extract_slice` op (generally obtained by applying the tiling
	/// transformation). Assumes `producerOfTensor` is a Linalg op that produces			/// transformation). Assumes `producerOfTensor` is a Linalg op that produces
	/// `consumerOpOperand`.			/// `consumerOpOperand`.
	Optional<FusionInfo> fuseProducerOfTensor(OpBuilder &b,			Optional<FusionInfo> fuseProducerOfTensor(OpBuilder &b,
	OpResult producerOpResult,			OpResult producerOpResult,
	OpOperand &consumerOpOperand);			OpOperand &consumerOpOperand);

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
				// Fusion on tensor utilities
				//===----------------------------------------------------------------------===//

				/// A struct storing the state computed by `isProducerFusable` and consumed by
				/// `fuseProducer`.
				struct FusionState {
				/// Result to fuse.
				OpResult producerOpResult;
				mravishankarUnsubmitted Not Done Reply Inline Actions Curious why we need this and the producer itself. I dont think we can split the producer and "fuse" individual `OpResult` if we have ops with multiple results mravishankar: Curious why we need this and the producer itself. I dont think we can split the producer and…
				gysitAuthorUnsubmitted Done Reply Inline Actions The current version of fusion supports fusing producers with multiple outputs but introduces redundant computation. Two follow up changes could make sense: a canonicalization pattern that removes unused outputs from a linalg op (probably limited to generic ops) adding the additional outputs of the cloned producer to the tile loop iteration arguments and replace all uses after the tile loop nest. gysit: The current version of fusion supports fusing producers with multiple outputs but introduces…
				/// Iteration argument of the outermost tile loop if one exists.
				OpOperand *tileLoopIterArg;
				/// All fusable producer loop dimensions.
				SmallVector<int64_t> producerLoopsToFuse;
				/// A producer result shape dimension per fusable producer loop.
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions Add a note that is this currently limited to 1-1 output permutations to shape dimension but that this will be relaxed in the future? nicolasvasilache: Add a note that is this currently limited to 1-1 output permutations to shape dimension but…
				SmallVector<int64_t> producerShapeDimsToFuse;
				/// An ExtractSliceOp per fusable producer loop if one exists.
				SmallVector<tensor::ExtractSliceOp> sliceOpsToFuse;
				};

				/// Verify if the producer of `consumerOpOperand` is fusable in place of an
				/// ExtractSliceOp part of the use-def chain connecting consumer and producer.
				/// Use `tileLoopDims` to map the tile loops from outer to inner to the tiled
				/// consumer loop dimension. In case of transitive fusion, tile loop dimensions
				/// may be set to none if the loop does not tile a previously fused operation.
				FailureOr<FusionState>
				isProducerFusable(OpOperand *consumerOpOperand,
				ArrayRef<Optional<int64_t>> tileLoopDims,
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions I do not understand what tileLoopDims is and why it is needed. From your usage I only see that it is initialized to null everywhere? nicolasvasilache: I do not understand what tileLoopDims is and why it is needed. From your usage I only see that…
				function_ref<void(StringRef)> notifyFailure);

				/// Fuse the producer in place of the ExtractSliceOp found by
				/// `isProducerFusable`. Clones the producer to replace the ExtractSliceOp and
				/// returns the cloned producer.
				LinalgOp fuseProducer(OpBuilder &b, FusionState &fusionState);

				//===----------------------------------------------------------------------===//
	// Distribution utilities			// Distribution utilities
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	/// Scheme used to distribute loops to processors.			/// Scheme used to distribute loops to processors.
	enum class DistributionMethod {			enum class DistributionMethod {
	/// Cyclic distribution where no assumption is made about the dynamic			/// Cyclic distribution where no assumption is made about the dynamic
	/// relationship between number of processors and number of iterations of the			/// relationship between number of processors and number of iterations of the
	/// distributed loop. Distributes the following loop			/// distributed loop. Distributes the following loop
	▲ Show 20 Lines • Show All 122 Lines • Show Last 20 Lines

mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt

	add_mlir_dialect_library(MLIRLinalgTransforms			add_mlir_dialect_library(MLIRLinalgTransforms
	Bufferize.cpp			Bufferize.cpp
	CodegenStrategy.cpp			CodegenStrategy.cpp
	ComprehensiveBufferize.cpp			ComprehensiveBufferize.cpp
	Detensorize.cpp			Detensorize.cpp
	Distribution.cpp			Distribution.cpp
	DropUnitDims.cpp			DropUnitDims.cpp
	ElementwiseOpFusion.cpp			ElementwiseOpFusion.cpp
	ElementwiseToLinalg.cpp			ElementwiseToLinalg.cpp
	Fusion.cpp			Fusion.cpp
				FusionOnTensors.cpp
	Generalization.cpp			Generalization.cpp
	Hoisting.cpp			Hoisting.cpp
	InlineScalarOperands.cpp			InlineScalarOperands.cpp
	Interchange.cpp			Interchange.cpp
	Loops.cpp			Loops.cpp
	Promotion.cpp			Promotion.cpp
	Tiling.cpp			Tiling.cpp
	Transforms.cpp			Transforms.cpp
	Show All 30 Lines

mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp

This file was added.

				//===- Fusion.cpp - Implementation of linalg Fusion -----------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements linalg fusion on tensors
				//
				//===----------------------------------------------------------------------===//

				#include "PassDetail.h"
				#include "mlir/Dialect/Affine/IR/AffineOps.h"
				#include "mlir/Dialect/Linalg/IR/LinalgOps.h"
				#include "mlir/Dialect/Linalg/IR/LinalgTypes.h"
				#include "mlir/Dialect/Linalg/Passes.h"
				#include "mlir/Dialect/Linalg/Transforms/Transforms.h"
				#include "mlir/Dialect/Linalg/Utils/Utils.h"
				#include "mlir/Dialect/Tensor/IR/Tensor.h"
				#include "mlir/IR/AffineExpr.h"
				#include "mlir/IR/AffineMap.h"
				#include "mlir/Support/LLVM.h"
				#include "llvm/ADT/TypeSwitch.h"

				using namespace mlir;
				using namespace linalg;

				/// Walk the use-def chain starting from the consumer and collect the
				/// ExtracSliceOps and the ForOps found. Stop the search as soon as an operation
				nicolasvasilacheUnsubmitted Done Reply Inline Actions `consisting of exactly intermediate extract_slice and scf::for ops` ? Also, please mention that the use-def search stops as soon as a different op is found. nicolasvasilache: `consisting of exactly intermediate extract_slice and scf::for ops` ? Also, please mention…
				/// of a different kind is found. At the same time, collect the parent tile
				/// loops of type ForOp and add them to `tileLoops`. For every tile loop, add
				/// the ExtractSliceOp or ForOp found to `sliceOps` and `bbArgs`, respectively.
				/// Otherwise, set the `sliceOps` or `bbArgs` entries to nullptr. Finally,
				/// return the back of the use-def chain.
				// TODO: Support additional loop types and control flow operations.
				static Value getTileLoopNest(OpOperand *consumerOpOperand,
				SmallVectorImpl<scf::ForOp> &tileLoops,
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions I find the new API quite difficult to follow: there are hidden assumptions about when some entries are null but not others and how things are supposed to align that I have a hard time following. Can we have a single vector with struct with a proper name that contains `{scf::ForOp, tensor::ExtractSliceOp, BlockArgument}` with all the properties spelled out explicitly with helper functions? e.g. a strawman struct AGoodName { void isPropertyX(int index) { return tileLoops[index] && sliceOps[index] && !bbArgs[index]; } ... SmallVector<scf::ForOp> tileLoops; SmallVector<tensor::ExtractSliceOp> sliceOps; SmallVector<BlockArgument> bbArgs; } nicolasvasilache: I find the new API quite difficult to follow: there are hidden assumptions about when some…
				SmallVectorImpl<tensor::ExtractSliceOp> &sliceOps,
				SmallVectorImpl<BlockArgument> &bbArgs) {
				// Get the initial value of the use-def chain and the innermost tile loop.
				Value current = consumerOpOperand->get();
				LinalgOp consumerOp = consumerOpOperand->getOwner();
				auto tileLoop = dyn_cast<scf::ForOp>(consumerOp->getParentOp());

				// Walk the tile loop nest.
				while (tileLoop) {
				// Advance to the next tile loop if the current value of the use-def chain
				// is defined outside of the loop.
				if (current.getParentBlock()->getParentOp() != tileLoop) {
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions LoopLikeInterface has a "isDefinedOutsideOfLoop" method for this. nicolasvasilache: LoopLikeInterface has a "isDefinedOutsideOfLoop" method for this.
				tileLoops.push_back(tileLoop);
				sliceOps.push_back(nullptr);
				bbArgs.push_back(nullptr);
				nicolasvasilacheUnsubmitted Done Reply Inline Actions Add a `TODO` to support other types of enclosing ops (e.g. `scf.if`) and other types of loops ? Since we are in value land, we should be able to handle more interesting cases? nicolasvasilache: Add a `TODO` to support other types of enclosing ops (e.g. `scf.if`) and other types of loops ?
				tileLoop = dyn_cast<scf::ForOp>(tileLoop->getParentOp());
				continue;
				mravishankarUnsubmitted Done Reply Inline Actions would this not be easier to get with the `getUseDefChain` above? mravishankar: would this not be easier to get with the `getUseDefChain` above?
				gysitAuthorUnsubmitted Done Reply Inline Actions I think if we fuse a consumer input that is not passed via iteration argument we still want to find the tile loops. They are needed to make the match between the tile loop dimensions and the extract slice ops. gysit: I think if we fuse a consumer input that is not passed via iteration argument we still want to…
				}
				// Search an ExtractSliceOp part of the tile loop level.
				auto sliceOp = current.getDefiningOp<tensor::ExtractSliceOp>();
				if (sliceOp && sliceOp->getParentOp() == tileLoop) {
				sliceOps.push_back(sliceOp);
				current = sliceOp.source();
				}
				// Search a ForOp part of the tile loop level.
				if (auto bbArg = current.dyn_cast<BlockArgument>()) {
				Operation *parentOp = bbArg.getParentBlock()->getParentOp();
				auto forOp = dyn_cast<scf::ForOp>(parentOp);
				if (forOp && forOp == tileLoop) {
				bbArgs.push_back(bbArg);
				current = forOp.getOpOperandForRegionIterArg(bbArg).get();
				}
				}
				// Exit if the current value is not an ExtractSliceOp or a ForOp and defined
				// inside the tile loop since we may have found the producer.
				if (sliceOps.size() == tileLoops.size() &&
				bbArgs.size() == tileLoops.size())
				return current;
				// Advance to the next tile loop level and append nullptr to the collections
				// if needed.
				tileLoops.push_back(tileLoop);
				sliceOps.resize(tileLoops.size(), nullptr);
				bbArgs.resize(tileLoops.size(), nullptr);
				tileLoop = dyn_cast<scf::ForOp>(tileLoop->getParentOp());
				}
				return current;
				}

				/// Relate the producer to the consumer loop iterations that access the same
				/// producer result element:
				/// consumerToProducerLoops =
				/// inverse(producerIndexingMap).compose(consumerIndexingMap).
				/// Return `consumerToProducerLoops` or none if the inversion fails.
				static Optional<AffineMap>
				getConsumerToProducerLoopsMap(AffineMap producerIndexingMap,
				AffineMap consumerIndexingMap) {
				assert(consumerIndexingMap.getNumResults() ==
				producerIndexingMap.getNumResults() &&
				"expect the number of indexing map results to match");
				// Ensure the producer indexing map is a projected permutation.
				if (!producerIndexingMap.isProjectedPermutation())
				return None;
				AffineMap inverseIndexingMap =
				inverseAndBroadcastProjectedPermuation(producerIndexingMap);
				return inverseIndexingMap.compose(consumerIndexingMap);
				}

				/// Return the producer loop dimension mapped to the given consumer tile loop
				/// dimension or none if no mapping exists.
				static Optional<int64_t> getProducerLoopDim(int64_t tileLoopDim,
				AffineMap consumerToProducerLoops) {
				assert(count_if(consumerToProducerLoops.getResults(),
				[&](AffineExpr expr) {
				return expr.isFunctionOfDim(tileLoopDim);
				}) <= 1 &&
				"expect the tile loop to tile at most one producer loop");
				for (auto en : enumerate(consumerToProducerLoops.getResults()))
				if (en.value().isFunctionOfDim(tileLoopDim))
				return en.index();
				return None;
				}

				/// Return the bound for the given producer loop dimension.
				static Value getLoopBound(OpBuilder &b, LinalgOp producerOp, int64_t dim) {
				Location loc = producerOp.getLoc();
				for (OpOperand *opOperand : producerOp.getInputAndOutputOperands()) {
				AffineMap indexingMap = producerOp.getTiedIndexingMap(opOperand);
				for (auto en : enumerate(indexingMap.getResults())) {
				auto dimExpr = en.value().dyn_cast<AffineDimExpr>();
				if (dimExpr && dim == static_cast<int64_t>(dimExpr.getPosition()))
				return createOrFoldDimOp(b, loc, opOperand->get(), en.index());
				}
				}
				return nullptr;
				}

				// Tile the producer operands given an ExtractSliceOp part of the use-def chain.
				static SmallVector<Value>
				getTiledOperands(OpBuilder &b, LinalgOp producerOp,
				tensor::ExtractSliceOp sliceOp, ArrayRef<Value> valuesToTile,
				ArrayRef<int64_t> producerLoopsToFuse,
				ArrayRef<int64_t> producerShapeDimsToFuse,
				SmallVectorImpl<Value> &sizeBounds,
				SmallVectorImpl<Value> &allIvs, FusionState &fusionState) {
				Location loc = producerOp.getLoc();
				OpBuilder::InsertionGuard guard(b);
				b.setInsertionPoint(sliceOp);

				// Get the offsets and sizes extracted by the ExtractSliceOp.
				SmallVector<Range> ranges = sliceOp.getOrCreateRanges(b, loc);

				// Get the induction variables and tile sizes for the fused producer loops.
				auto zero = b.create<ConstantIndexOp>(loc, 0);
				SmallVector<Value> tileIvs(producerOp.getNumLoops(), nullptr);
				SmallVector<Value> tileSizes(producerOp.getNumLoops(), zero);
				for (auto it : zip(producerLoopsToFuse, producerShapeDimsToFuse)) {
				int64_t loopDim, shapeDim;
				std::tie(loopDim, shapeDim) = it;
				tileIvs[loopDim] = ranges[shapeDim].offset;
				tileSizes[loopDim] = ranges[shapeDim].size;
				allIvs[loopDim] = tileIvs[loopDim];
				}
				erase_value(tileIvs, nullptr);
				nicolasvasilacheUnsubmitted Done Reply Inline Actions We usually just go for `T mlir::linalg::isProducerFusable(...) {...}` nicolasvasilache: We usually just go for `T mlir::linalg::isProducerFusable(...) {...}`

				// Tile the producer operands given the ivs and tile sizes / size bounds.
				SmallVector<Value> tiledOperands = makeTiledShapes(
				b, loc, producerOp, valuesToTile, tileIvs, tileSizes, sizeBounds);

				// Update the size bounds for the tiled dimensions.
				for (int64_t loopDim : producerLoopsToFuse)
				sizeBounds[loopDim] = tileSizes[loopDim];

				return tiledOperands;
				}

				FailureOr<FusionState>
				mravishankarUnsubmitted Done Reply Inline Actions Nit: fold this with the check below. AN error saying "expected linalg op" implies blockargument is not expected. mravishankar: Nit: fold this with the check below. AN error saying "expected linalg op" implies blockargument…
				mlir::linalg::isProducerFusable(OpOperand *consumerOpOperand,
				ArrayRef<Optional<int64_t>> tileLoopDims,
				function_ref<void(StringRef)> notifyFailure) {
				// Call the notify failure callback and return failure.
				auto handleFailure = [&](StringRef message) {
				notifyFailure(message);
				return failure();
				};

				// Get the consumer and check it has tensor semantics.
				auto consumerOp = dyn_cast<LinalgOp>(consumerOpOperand->getOwner());
				if (!consumerOp \|\| !consumerOp.hasTensorSemantics())
				return handleFailure("expect consumer to be a linalg op on tensors");

				// Collect the tile loops plus the ExtractSliceOps and BlockArguments for
				mravishankarUnsubmitted Done Reply Inline Actions Third time traversing this. Its a style comment, so sort of a nit. You could compute the `sliceOps` and `loops` and `bbArgs` at the same time while traversing the use-def chain. The chain is small, so its not an efficiency thing, but more cognitive overhead. mravishankar: Third time traversing this. Its a style comment, so sort of a nit. You could compute the…
				gysitAuthorUnsubmitted Done Reply Inline Actions I think one thing that introduces maybe a bit of complexity here is that code does not enforce an extract slice op for every tile loop (which helps making the unit tests more readable!). Also not that once we start fusing dependencies transitively there start to be tile loops that are actually not tiling the fused operation... I thus currently believe the choice of getting the loops first and then try to map the extract slice ops is better. But I will think about simplifying this code! gysit: I think one thing that introduces maybe a bit of complexity here is that code does not enforce…
				gysitAuthorUnsubmitted Done Reply Inline Actions I collect the tile loops and the use def chain now in one method getTileLoopNest and I think it is an improvement! gysit: I collect the tile loops and the use def chain now in one method getTileLoopNest and I think it…
				// every tile loop level or nullptr if they do not exists.
				SmallVector<scf::ForOp> tileLoops;
				SmallVector<tensor::ExtractSliceOp> sliceOps;
				SmallVector<BlockArgument> bbArgs;
				Value producerResult =
				getTileLoopNest(consumerOpOperand, tileLoops, sliceOps, bbArgs);

				// Check there are tile loops and the tile loop dimensions are known.
				if (tileLoops.empty() \|\| tileLoops.size() > tileLoopDims.size())
				return handleFailure("expect >0 and <=tileLoopDims.size() tile loops");

				// Check the producer is a LinalgOp and has tensor semantics.
				auto producerOp = producerResult.getDefiningOp<LinalgOp>();
				if (!producerOp \|\| !producerOp.hasTensorSemantics())
				return handleFailure("expect producer to be a linalg op on tensors");
				auto producerOpResult = producerResult.cast<OpResult>();

				// Check the parents of producer result and outermost tile loop match.
				scf::ForOp outermostTileLoop = tileLoops.back();
				if (producerResult.getParentBlock()->getParentOp() !=
				outermostTileLoop->getParentOp())
				return handleFailure("expect producer and tile loop parents to match");

				// Compute the consumer to producer loops mapping and exit on failure.
				AffineMap producerIndexinMap = producerOp.getTiedIndexingMap(
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions I'd lift that as a helper in the LinalgInterfaces, seems generally useful and a good place to centralize as we generalize. nicolasvasilache: I'd lift that as a helper in the LinalgInterfaces, seems generally useful and a good place to…
				producerOp.getOutputOperand(producerOpResult.getResultNumber()));
				AffineMap consumerIndexinMap =
				consumerOp.getTiedIndexingMap(consumerOpOperand);
				Optional<AffineMap> consumerToProducerLoops =
				getConsumerToProducerLoopsMap(producerIndexinMap, consumerIndexinMap);
				if (!consumerToProducerLoops.hasValue())
				return handleFailure("cannot compute consumer to producer loop map");

				// Reverse `sliceOps` and `bbArgs` since they are processed from outer to
				// inner and to match the `tileLoopDims` order.
				std::reverse(sliceOps.begin(), sliceOps.end());
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions this type of bookkeeping should be internal to the struct I mention above. nicolasvasilache: this type of bookkeeping should be internal to the struct I mention above.
				std::reverse(bbArgs.begin(), bbArgs.end());

				// Search the innermost fusable ExtractSliceOp.
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions This should be a standalone documented helper function. nicolasvasilache: This should be a standalone documented helper function.
				int64_t consumerOpDepth = tileLoops.size();
				int64_t sliceOpDepth = 0;
				for (auto en : enumerate(tileLoopDims.take_front(consumerOpDepth))) {
				// Stop fusion if an output operand passed by BlockArgument into a
				// non-parallel tile loop.
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions This feels like an assumption that comes from tiling invariants and is used at a distance without guarantees? This isn't something we can generally rely on. This seems to require a "tile and fuse" approach to the problem? nicolasvasilache: This feels like an assumption that comes from tiling invariants and is used at a distance…
				if (bbArgs[en.index()] && en.value().hasValue()) {
				Attribute iteratorType =
				consumerOp.iterator_types().getValue()[en.value().getValue()];
				if (!isParallelIterator(iteratorType))
				break;
				}
				// Update depth to the innermost ExtractSliceOp found so far.
				if (sliceOps[en.index()])
				sliceOpDepth = en.index() + 1;
				}
				if (sliceOpDepth == 0)
				return handleFailure("expect to find a slice op to replace");

				// Keep only the fusable ExtractSliceOps and BlockArguments.
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions This is also internal bookkeeping of the struct that should have an intuitive name. nicolasvasilache: This is also internal bookkeeping of the struct that should have an intuitive name.
				sliceOps.resize(sliceOpDepth);
				bbArgs.resize(sliceOpDepth);
				int64_t bbArgsCount = sliceOpDepth - count(bbArgs, nullptr);
				if (bbArgsCount != 0 && bbArgsCount != sliceOpDepth)
				return handleFailure("expect one block argument per tile loop or none");

				// If the producer of a consumer output is fused into a tile loop nest, fusion
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions I'd rephrase this with an example and isolate in a helper function; e.g. Consider the case of fusion of an output tensor: %0 = producer(...) %2 = consumer ins(...) outs(%0) When consumer is tiled, %0 appears as a loop iter_arg, i.e.: %0 = producer(...) %2 = scf.for ... iter_args(%0) .. (%bbarg) { %t = tensor.extract_slice %bbarg[..] %c = consumer ins(...) (%bbarg) %r = tensor.insert_slice %t, %bbarg[...] } Fusing %0 into the loop requires updating iter_args(%0). This transformation is only valid when xxx (the rest of your text + code). ... TODO: instead of a check + failure, insert new iter_args each time a producer is fused into a consumer and fold away unused iter_args. nicolasvasilache: I'd rephrase this with an example and isolate in a helper function; e.g. ``` Consider the case…
				// sets the iteration argument of the outermost tile loop to the producer
				// output instead of its result. This transformation is only valid if all
				// values along the use-def chain between the outermost tile loop iteration
				// argument and the ExtractSliceOp are solely used by the consumer. We thus
				// ensure these values either have one use or are used by ExtractSliceOp
				// InsertSliceOp pairs exclusively.
				if (bbArgsCount != 0) {
				// Ensure all slice ops except for the last have one use.
				for (auto sliceOp : sliceOps) {
				if (sliceOp && !(sliceOp->hasOneUse() \|\| sliceOp == sliceOps.back()))
				return handleFailure("expect slice op to have one use");
				}
				// Ensure all block arguments have one use or are used by one ExtractSliceOp
				// InsertSliceOp pair except for possible dim accesses.
				for (auto bbArg : bbArgs) {
				int64_t defaultCount = 0, extractCount = 0, insertCount = 0;
				for (Operation *op : bbArg.getUsers()) {
				TypeSwitch<Operation *>(op)
				.Case<tensor::ExtractSliceOp>([&](auto) { extractCount++; })
				.Case<tensor::InsertSliceOp>([&](auto) { insertCount++; })
				.Case<tensor::DimOp>([&](auto) {})
				.Default([&](auto) { defaultCount++; });
				}
				if (!bbArg.hasOneUse() &&
				!(extractCount == 1 && insertCount == 1 && defaultCount == 0))
				return handleFailure("expect one use or one extract/insert pair");
				}
				}

				// Search the producer loops to fuse and exit if there are none.
				SmallVector<int64_t> producerLoopsToFuse;
				SmallVector<tensor::ExtractSliceOp> sliceOpsToFuse;
				for (auto en : enumerate(tileLoopDims.take_front(consumerOpDepth)
				.drop_back(consumerOpDepth - sliceOpDepth))) {
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions so, just `take_front(sliceOpDepth)` ? nicolasvasilache: so, just `take_front(sliceOpDepth)` ?
				// Search the the tiled producer loop and add it if one exists.
				if (en.value().hasValue()) {
				Optional<int64_t> producerTileLoopDim = getProducerLoopDim(
				nicolasvasilacheUnsubmitted Not Done Reply Inline Actions This also seems like something that is only valid when you know the enclosing loop structure has been produced exactly by the tiling you expect and nothing has changed in the meantime? nicolasvasilache: This also seems like something that is only valid when you know the enclosing loop structure…
				en.value().getValue(), consumerToProducerLoops.getValue());
				if (producerTileLoopDim.hasValue())
				producerLoopsToFuse.push_back(producerTileLoopDim.getValue());
				}
				// Add a ExtractSliceOp to tile if there are producer loops to fuse.
				int64_t numLoopsToTile = producerLoopsToFuse.size() - sliceOpsToFuse.size();
				if (sliceOps[en.index()] && numLoopsToTile != 0) {
				sliceOpsToFuse.insert(sliceOpsToFuse.end(), numLoopsToTile, nullptr);
				sliceOpsToFuse.back() = sliceOps[en.index()];
				}
				}
				if (producerLoopsToFuse.empty())
				return handleFailure("expect to find producer loops to fuse");

				// Search the producer shape dimensions to fuse.
				SmallVector<int64_t> producerShapeDimsToFuse(producerLoopsToFuse.size());
				for (auto en : enumerate(producerLoopsToFuse)) {
				auto *it = find_if(producerIndexinMap.getResults(), [&](AffineExpr expr) {
				AffineDimExpr dimExpr = expr.dyn_cast<AffineDimExpr>();
				return dimExpr.getPosition() == en.value();
				});
				assert(it != producerIndexinMap.getResults().end() &&
				"expect to find the loop in the indexing map");
				producerShapeDimsToFuse[en.index()] =
				std::distance(producerIndexinMap.getResults().begin(), it);
				}

				// Compute the tile loop iteration argument of the outermost tile loop.
				OpOperand *tileLoopIterArg = nullptr;
				if (bbArgsCount != 0)
				tileLoopIterArg =
				&outermostTileLoop.getOpOperandForRegionIterArg(bbArgs.front());

				return FusionState{producerOpResult, tileLoopIterArg, producerLoopsToFuse,
				producerShapeDimsToFuse, sliceOpsToFuse};
				}

				LinalgOp mlir::linalg::fuseProducer(OpBuilder &b, FusionState &fusionState) {
				LinalgOp producerOp = fusionState.producerOpResult.getOwner();

				// Set the insertion point to the producer to compute its bounds.
				OpBuilder::InsertionGuard guard(b);
				b.setInsertionPointAfter(producerOp);

				// Compute the size bounds for all producer loop dimensions.
				SmallVector<Value> sizeBounds(producerOp.getNumLoops());
				for (auto &en : enumerate(sizeBounds)) {
				en.value() = getLoopBound(b, producerOp, en.index());
				assert(en.value() && "cannot compute producer loop bound");
				}

				// Tile the producer operands at every tile loop level associated to an
				// ExtractSliceOp. Tiling the producer operands level-by-level may unlock
				// additional fusion opportunities in case of transitive fusion.
				assert(fusionState.producerLoopsToFuse.size() ==
				fusionState.producerShapeDimsToFuse.size() &&
				"expect one shape dimension per producer loop dimension");
				assert(fusionState.producerLoopsToFuse.size() ==
				fusionState.sliceOpsToFuse.size() &&
				"expect one slice op entry per producer loop dimension");
				int64_t startIndex = 0;
				SmallVector<Value> allIvs(producerOp.getNumLoops(), nullptr);
				SmallVector<Value> tiledOperands = producerOp.getInputAndOutputOperands();
				for (auto en : enumerate(fusionState.sliceOpsToFuse)) {
				// Skip the tile loop if there is no ExtractSliceOp.
				if (!en.value())
				continue;

				// Get the producer loops and shapes tiled by the current tile loop level.
				SmallVector<int64_t> producerLoopsToFuse(
				fusionState.producerLoopsToFuse.begin() + startIndex,
				fusionState.producerLoopsToFuse.begin() + en.index() + 1);
				SmallVector<int64_t> producerShapeDimsToFuse(
				fusionState.producerShapeDimsToFuse.begin() + startIndex,
				fusionState.producerShapeDimsToFuse.begin() + en.index() + 1);
				startIndex = en.index() + 1;
				assert(!producerLoopsToFuse.empty() && !producerShapeDimsToFuse.empty() &&
				"expect the slice op tiles at least one producer loop");

				tiledOperands = getTiledOperands(
				b, producerOp, en.value(), tiledOperands, producerLoopsToFuse,
				producerShapeDimsToFuse, sizeBounds, allIvs, fusionState);
				}

				// Replace the producer result iteration argument of the outermost loop by
				// the tied producer output and use the existing ExtractSliceOp result instead
				// of the tiled producer output.
				if (fusionState.tileLoopIterArg) {
				OpOperand *outputOperand = producerOp.getOutputOperand(
				fusionState.producerOpResult.getResultNumber());
				fusionState.tileLoopIterArg->set(outputOperand->get());
				tiledOperands[outputOperand->getOperandNumber()] =
				fusionState.sliceOpsToFuse.back().getResult();
				}

				// Set the insertion point to after the ExtractSliceOp since the cloned
				// producer may access its result.
				b.setInsertionPointAfter(fusionState.sliceOpsToFuse.back());

				// Clone the producer using the tiled producer operands.
				Location loc = producerOp.getLoc();
				TypeRange resultTypes = ValueRange(tiledOperands)
				.take_back(producerOp.getNumOutputs())
				.getTypes();
				LinalgOp clonedOp = producerOp.clone(b, loc, resultTypes, tiledOperands);

				// Shift all index op results by the tile offset.
				addTileLoopIvsToIndexOpResults(b, clonedOp, allIvs);

				// Cast the cloned op result to gap type mismatches before canonicalizations.
				Type operandType = fusionState.sliceOpsToFuse.back().getResult().getType();
				int64_t resultNumber = fusionState.producerOpResult.getResultNumber();
				Value result = clonedOp->getResult(resultNumber);
				if (result.getType() != operandType)
				result = b.create<tensor::CastOp>(loc, operandType, result).getResult();

				// Replace all ExtractSliceOp uses except for possible uses by cloned op.
				fusionState.sliceOpsToFuse.back().getResult().replaceAllUsesExcept(result,
				clonedOp);
				return clonedOp;
				}

				namespace {

				struct LinalgFuseTensorOps
				: public LinalgFuseTensorOpsBase<LinalgFuseTensorOps> {

				void runOnFunction() override {
				FuncOp funcOp = getFunction();
				// Search all tiled ops.
				SmallVector<LinalgOp> tiledOps;
				funcOp.walk([&](LinalgOp linalgOp) {
				if (isa<scf::ForOp>(linalgOp->getParentOp()))
				tiledOps.push_back(linalgOp);
				});
				// Try to fuse all producers.
				OpBuilder b(funcOp.getContext());
				SmallVector<Optional<int64_t>> tileLoopDims(tileLoops.begin(),
				tileLoops.end());
				for (auto tiledOp : tiledOps) {
				for (OpOperand *consumerOpOperand : tiledOp.getInputAndOutputOperands()) {
				auto notifyFailure = [&](StringRef message) {
				llvm::errs() << " - LinalgFusionOnTensors (" << tiledOp->getName()
				<< " operand #" << consumerOpOperand->getOperandNumber()
				<< "): " << message << "\n";
				};
				FailureOr<FusionState> fusionState =
				isProducerFusable(consumerOpOperand, tileLoopDims, notifyFailure);
				if (failed(fusionState))
				continue;
				fuseProducer(b, fusionState.getValue());
				}
				}
				}
				};

				} // namespace

				std::unique_ptr<OperationPass<FuncOp>> mlir::createLinalgFuseTensorOpsPass() {
				return std::make_unique<LinalgFuseTensorOps>();
				}

mlir/lib/Dialect/Linalg/Utils/Utils.cpp

Show First 20 Lines • Show All 657 Lines • ▼ Show 20 Lines	for (OpOperand *opOperand : linalgOp.getInputAndOutputOperands()) {

tiledShapes.push_back(		tiledShapes.push_back(
makeTiledShape(b, loc, shapedOp, tileSizes, map, lbs, subShapeSizes));		makeTiledShape(b, loc, shapedOp, tileSizes, map, lbs, subShapeSizes));
}		}

return tiledShapes;		return tiledShapes;
}		}

		void addTileLoopIvsToIndexOpResults(OpBuilder &b, LinalgOp tiledOp,
		ArrayRef<Value> ivs) {
		if (tiledOp.hasIndexSemantics()) {
		assert(tiledOp->getNumRegions() == 1 &&
		tiledOp->getRegion(0).getBlocks().size() == 1 &&
		"expect producer to have one block.");
		// Shift all index op results by the tile offset.
		Block &block = tiledOp->getRegion(0).front();
		for (IndexOp indexOp : block.getOps<IndexOp>()) {
		if (ivs[indexOp.dim()] == nullptr)
		continue;
		OpBuilder::InsertionGuard g(b);
		b.setInsertionPointAfter(indexOp);
		AffineExpr index, offset;
		bindDims(b.getContext(), index, offset);
		AffineApplyOp applyOp = b.create<AffineApplyOp>(
		indexOp.getLoc(), index + offset,
		ValueRange{indexOp.getResult(), ivs[indexOp.dim()]});
		indexOp.getResult().replaceAllUsesExcept(applyOp, applyOp);
		}
		}
		}

} // namespace linalg		} // namespace linalg
} // namespace mlir		} // namespace mlir

mlir/test/Dialect/Linalg/fusion-on-tensors.mlir

This file was added.

				// RUN: mlir-opt %s -linalg-fuse-tensor-ops="tile-loops=0,1,2,2" -split-input-file --cse \| FileCheck %s

				// CHECK-DAG: #[[MAP0:.*]] = affine_map<(d0) -> (4, -d0 + 24)>
				// CHECK-DAG: #[[MAP1:.*]] = affine_map<(d0) -> (4, -d0 + 12)>
				#map = affine_map<(d0) -> (4, -d0 + 25)>

				// CHECK: fuse_input
				// CHECK-SAME: %[[ARG0:[0-9a-zA-Z]*]]: tensor<24x12xf32>
				builtin.func @fuse_input(%arg0: tensor<24x12xf32>,
				%arg1: tensor<12x25xf32>,
				%arg2: tensor<24x25xf32>) -> tensor<24x25xf32> {
				%c0 = constant 0 : index
				%c12 = constant 12 : index
				%c25 = constant 25 : index
				%c24 = constant 24 : index
				%c4 = constant 4 : index
				%cst = constant 0.000000e+00 : f32
				%0 = linalg.fill(%cst, %arg0) : f32, tensor<24x12xf32> -> tensor<24x12xf32>

				// CHECK: scf.for %[[IV0:[0-9a-zA-Z]*]] =
				%1 = scf.for %arg3 = %c0 to %c24 step %c4 iter_args(%arg4 = %arg2) -> (tensor<24x25xf32>) {

				// CHECK: scf.for %[[IV1:[0-9a-zA-Z]*]] =
				%2 = scf.for %arg5 = %c0 to %c25 step %c4 iter_args(%arg6 = %arg4) -> (tensor<24x25xf32>) {
				%3 = affine.min #map(%arg5)

				// CHECK: scf.for %[[IV2:[0-9a-zA-Z]*]] =
				%4 = scf.for %arg7 = %c0 to %c12 step %c4 iter_args(%arg8 = %arg6) -> (tensor<24x25xf32>) {

				// Tile both fill output operand dimensions.
				// CHECK: %[[UB0:.*]] = affine.min #[[MAP0]](%[[IV0]])
				// CHECK: %[[UB1:.*]] = affine.min #[[MAP1]](%[[IV2]])
				// CHECK: %[[T0:.*]] = tensor.extract_slice %[[ARG0]]
				// CHECK-SAME: %[[IV0]], %[[IV2]]
				// CHECK-SAME: %[[UB0]], %[[UB1]]
				// CHECK: %[[T1:.]] = linalg.fill(%{{.}}, %[[T0]])
				// CHECK: %[[T2:.*]] = tensor.cast %[[T1]] : tensor<?x?xf32>
				mravishankarUnsubmitted Not Done Reply Inline Actions Am I reading this right? The fill is coming into the `k` loop? mravishankar: Am I reading this right? The fill is coming into the `k` loop?
				gysitAuthorUnsubmitted Done Reply Inline Actions Yes since the fill is consumed by an input operand it is fine to fuse it into the k loop in this case. It is a trade-off though and at some point we may want to control if fusion shall introduce redundant computation and if yes how much. gysit: Yes since the fill is consumed by an input operand it is fine to fuse it into the k loop in…
				%5 = tensor.extract_slice %0[%arg3, %arg7] [4, 4] [1, 1] : tensor<24x12xf32> to tensor<4x4xf32>
				%6 = tensor.extract_slice %arg1[%arg7, %arg5] [4, %3] [1, 1] : tensor<12x25xf32> to tensor<4x?xf32>
				%7 = tensor.extract_slice %arg8[%arg3, %arg5] [4, %3] [1, 1] : tensor<24x25xf32> to tensor<4x?xf32>

				// CHECK: %{{.*}} = linalg.matmul ins(%[[T2]]
				%8 = linalg.matmul ins(%5, %6 : tensor<4x4xf32>, tensor<4x?xf32>) outs(%7 : tensor<4x?xf32>) -> tensor<4x?xf32>
				%9 = tensor.insert_slice %8 into %arg8[%arg3, %arg5] [4, %3] [1, 1] : tensor<4x?xf32> into tensor<24x25xf32>
				scf.yield %9 : tensor<24x25xf32>
				}
				scf.yield %4 : tensor<24x25xf32>
				}
				scf.yield %2 : tensor<24x25xf32>
				}
				return %1 : tensor<24x25xf32>
				}

				// -----

				// CHECK-DAG: #[[MAP0:.*]] = affine_map<(d0) -> (4, -d0 + 25)>
				// CHECK-DAG: #[[MAP1:.*]] = affine_map<(d0, d1) -> (d1, -d0 + 25)>
				#map = affine_map<(d0) -> (4, -d0 + 25)>

				// CHECK: fuse_input_2d_tiling
				// CHECK-SAME: %[[ARG1:[0-9a-zA-Z]*]]: tensor<12x25xf32>
				builtin.func @fuse_input_2d_tiling(%arg0: tensor<24x12xf32>,
				%arg1: tensor<12x25xf32>,
				%arg2: tensor<24x25xf32>) -> tensor<24x25xf32> {
				%c0 = constant 0 : index
				%c25 = constant 25 : index
				%c24 = constant 24 : index
				%c4 = constant 4 : index
				%cst = constant 0.000000e+00 : f32
				%0 = linalg.fill(%cst, %arg1) : f32, tensor<12x25xf32> -> tensor<12x25xf32>

				// CHECK: scf.for %[[IV0:[0-9a-zA-Z]*]] =
				%1 = scf.for %arg3 = %c0 to %c24 step %c4 iter_args(%arg4 = %arg2) -> (tensor<24x25xf32>) {
				%2 = tensor.extract_slice %arg0[%arg3, 0] [4, 12] [1, 1] : tensor<24x12xf32> to tensor<4x12xf32>

				// CHECK: scf.for %[[IV1:[0-9a-zA-Z]*]] =
				%3 = scf.for %arg5 = %c0 to %c25 step %c4 iter_args(%arg6 = %arg4) -> (tensor<24x25xf32>) {

				// CHECK: %[[TS1:.*]] = affine.min #[[MAP0]](%[[IV1]])
				%4 = affine.min #map(%arg5)

				// Tile the second fill output operand dimension taking into account the
				// domain size is not an integer mutiple of the step.
				// CHECK: %[[UB1:.*]] = affine.min #[[MAP1]](%[[IV1]], %[[TS1]])
				// CHECK: %[[T0:.*]] = tensor.extract_slice %[[ARG1]]
				// CHECK-SAME: 0, %[[IV1]]
				// CHECK-SAME: 12, %[[UB1]]
				// CHECK: %[[T1:.]] = linalg.fill(%{{.}}, %[[T0]])
				%5 = tensor.extract_slice %0[0, %arg5] [12, %4] [1, 1] : tensor<12x25xf32> to tensor<12x?xf32>
				%6 = tensor.extract_slice %arg6[%arg3, %arg5] [4, %4] [1, 1] : tensor<24x25xf32> to tensor<4x?xf32>

				// CHECK: %{{.}} = linalg.matmul ins({{.}}, %[[T1]]
				%7 = linalg.matmul ins(%2, %5 : tensor<4x12xf32>, tensor<12x?xf32>) outs(%6 : tensor<4x?xf32>) -> tensor<4x?xf32>
				%8 = tensor.insert_slice %7 into %arg6[%arg3, %arg5] [4, %4] [1, 1] : tensor<4x?xf32> into tensor<24x25xf32>
				scf.yield %8 : tensor<24x25xf32>
				}
				scf.yield %3 : tensor<24x25xf32>
				}
				return %1 : tensor<24x25xf32>
				}

				// -----

				// CHECK-DAG: #[[MAP0:.*]] = affine_map<(d0) -> (4, -d0 + 25)>
				#map = affine_map<(d0) -> (4, -d0 + 25)>

				// CHECK: fuse_output
				// CHECK-SAME: %[[ARG2:[0-9a-zA-Z]*]]: tensor<24x25xf32>
				builtin.func @fuse_output(%arg0: tensor<24x12xf32>,
				%arg1: tensor<12x25xf32>,
				%arg2: tensor<24x25xf32>) -> tensor<24x25xf32> {
				%c0 = constant 0 : index
				%c12 = constant 12 : index
				%c25 = constant 25 : index
				%c24 = constant 24 : index
				%c4 = constant 4 : index
				%cst = constant 0.000000e+00 : f32
				%0 = linalg.fill(%cst, %arg2) : f32, tensor<24x25xf32> -> tensor<24x25xf32>

				// Verify the iteration argument is updated and the extract slice reused.
				// CHECK: scf.for %[[IV0:.]] = {{.}} iter_args(%{{.*}} = %[[ARG2]]
				%1 = scf.for %arg3 = %c0 to %c24 step %c4 iter_args(%arg4 = %0) -> (tensor<24x25xf32>) {

				// CHECK: scf.for %[[IV1:.]] = {{.}} iter_args(%[[ARG3:.*]] =
				%2 = scf.for %arg5 = %c0 to %c25 step %c4 iter_args(%arg6 = %arg4) -> (tensor<24x25xf32>) {

				// CHECK: %[[TS1:.*]] = affine.min #[[MAP0]](%[[IV1]])
				%3 = affine.min #map(%arg5)

				// CHECK: %[[T0:.*]] = tensor.extract_slice %[[ARG3]]
				// CHECK-SAME: %[[IV0]], %[[IV1]]
				// CHECK-SAME: 4, %[[TS1]]
				// CHECK: %[[T1:.]] = linalg.fill(%{{.}}, %[[T0]])
				%4 = tensor.extract_slice %arg6[%arg3, %arg5] [4, %3] [1, 1] : tensor<24x25xf32> to tensor<4x?xf32>

				// CHECK: scf.for %[[IV2:.]] = {{.}} iter_args(%[[ARG4:.*]] = %[[T1]]
				%5 = scf.for %arg7 = %c0 to %c12 step %c4 iter_args(%arg8 = %4) -> (tensor<4x?xf32>) {
				%7 = tensor.extract_slice %arg0[%arg3, %arg7] [4, 4] [1, 1] : tensor<24x12xf32> to tensor<4x4xf32>
				%8 = tensor.extract_slice %arg1[%arg7, %arg5] [4, %3] [1, 1] : tensor<12x25xf32> to tensor<4x?xf32>

				// CHECK: %{{.}} = linalg.matmul {{.}} outs(%[[ARG4]]
				%9 = linalg.matmul ins(%7, %8 : tensor<4x4xf32>, tensor<4x?xf32>) outs(%arg8 : tensor<4x?xf32>) -> tensor<4x?xf32>
				scf.yield %9 : tensor<4x?xf32>
				}
				%6 = tensor.insert_slice %5 into %arg6[%arg3, %arg5] [4, %3] [1, 1] : tensor<4x?xf32> into tensor<24x25xf32>
				scf.yield %6 : tensor<24x25xf32>
				}
				scf.yield %2 : tensor<24x25xf32>
				}
				return %1 : tensor<24x25xf32>
				}

				// -----

				// CHECK-DAG: #[[MAP0:.*]] = affine_map<(d0) -> (4, -d0 + 25)>
				// CHECK-DAG: #[[MAP1:.*]] = affine_map<(d0) -> (4, -d0 + 12)>
				// CHECK-DAG: #[[MAP2:.*]] = affine_map<(d0, d1) -> (d1, -d0 + 25)>
				#map0 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
				#map1 = affine_map<(d0, d1, d2) -> (d0, d2)>
				#map2 = affine_map<(d0) -> (4, -d0 + 25)>

				// CHECK: fuse_reduction
				// CHECK-SAME: %[[ARG0:[0-9a-zA-Z]*]]: tensor<24x12xf32>
				// CHECK-SAME: %[[ARG1:[0-9a-zA-Z]*]]: tensor<12x25xf32>
				// CHECK-SAME: %[[ARG2:[0-9a-zA-Z]*]]: tensor<24x25xf32>
				// CHECK-SAME: %[[ARG3:[0-9a-zA-Z]*]]: tensor<12x7x25xf32>
				builtin.func @fuse_reduction(%arg0: tensor<24x12xf32>, %arg1: tensor<12x25xf32>, %arg2: tensor<24x25xf32>, %arg3: tensor<12x7x25xf32>) -> tensor<24x25xf32> {
				%c0 = constant 0 : index
				%c12 = constant 12 : index
				%c25 = constant 25 : index
				%c24 = constant 24 : index
				%c4 = constant 4 : index
				%0 = linalg.generic {indexing_maps = [#map0, #map1], iterator_types = ["parallel", "reduction", "parallel"]} ins(%arg3 : tensor<12x7x25xf32>) outs(%arg1 : tensor<12x25xf32>) {
				^bb0(%arg4: f32, %arg5: f32): // no predecessors
				%2 = addf %arg4, %arg5 : f32
				linalg.yield %2 : f32
				} -> tensor<12x25xf32>

				// CHECK: scf.for %[[IV0:[0-9a-zA-Z]*]] =
				%1 = scf.for %arg4 = %c0 to %c24 step %c4 iter_args(%arg5 = %arg2) -> (tensor<24x25xf32>) {

				// CHECK: scf.for %[[IV1:[0-9a-zA-Z]*]] =
				%2 = scf.for %arg6 = %c0 to %c25 step %c4 iter_args(%arg7 = %arg5) -> (tensor<24x25xf32>) {

				// CHECK: %[[TS1:.*]] = affine.min #[[MAP0]](%[[IV1]])
				%3 = affine.min #map2(%arg6)

				// CHECK: scf.for %[[IV2:[0-9a-zA-Z]*]] =
				%4 = scf.for %arg8 = %c0 to %c12 step %c4 iter_args(%arg9 = %arg7) -> (tensor<24x25xf32>) {
				%5 = tensor.extract_slice %arg0[%arg4, %arg8] [4, 4] [1, 1] : tensor<24x12xf32> to tensor<4x4xf32>

				// CHECK: %[[UB0:.*]] = affine.min #[[MAP1]](%[[IV2]])
				// CHECK: %[[UB1:.*]] = affine.min #[[MAP2]](%[[IV1]], %[[TS1]])
				// CHECK: %[[T0:.*]] = tensor.extract_slice %[[ARG3]]
				// CHECK-SAME: %[[IV2]], 0, %[[IV1]]
				// CHECK-SAME: %[[UB0]], 7, %[[UB1]]
				// CHECK: %[[T1:.*]] = tensor.extract_slice %[[ARG1]]
				// CHECK-SAME: %[[IV2]], %[[IV1]]
				// CHECK-SAME: %[[UB0]], %[[UB1]]
				// CHECK: %[[T2:.]] = linalg.generic {{.}} ins(%[[T0]] {{.*}} outs(%[[T1]]
				// CHECK: %[[T3:.*]] = tensor.cast %[[T2]] : tensor<?x?xf32> to
				%6 = tensor.extract_slice %0[%arg8, %arg6] [4, %3] [1, 1] : tensor<12x25xf32> to tensor<4x?xf32>
				%7 = tensor.extract_slice %arg9[%arg4, %arg6] [4, %3] [1, 1] : tensor<24x25xf32> to tensor<4x?xf32>

				// CHECK: %{{.}} = linalg.matmul ins(%{{.}}, %[[T3]]
				%8 = linalg.matmul ins(%5, %6 : tensor<4x4xf32>, tensor<4x?xf32>) outs(%7 : tensor<4x?xf32>) -> tensor<4x?xf32>
				%9 = tensor.insert_slice %8 into %arg9[%arg4, %arg6] [4, %3] [1, 1] : tensor<4x?xf32> into tensor<24x25xf32>
				scf.yield %9 : tensor<24x25xf32>
				}
				scf.yield %4 : tensor<24x25xf32>
				}
				scf.yield %2 : tensor<24x25xf32>
				}
				return %1 : tensor<24x25xf32>
				}

				// -----

				#map = affine_map<(d0) -> (4, -d0 + 25)>

				// CHECK: fuse_output_not_fusable
				// CHECK-SAME: %[[ARG2:[0-9a-zA-Z]*]]: tensor<24x25xf32>
				builtin.func @fuse_output_not_fusable(%arg0: tensor<24x12xf32>,
				%arg1: tensor<12x25xf32>,
				%arg2: tensor<24x25xf32>) -> tensor<24x25xf32> {
				%c0 = constant 0 : index
				%c12 = constant 12 : index
				%c25 = constant 25 : index
				%c24 = constant 24 : index
				%c4 = constant 4 : index
				%cst = constant 0.0 : f32

				// CHECK: %[[T1:.]] = linalg.fill(%{{.}}, %[[ARG2]])
				%0 = linalg.fill(%cst, %arg2) : f32, tensor<24x25xf32> -> tensor<24x25xf32>

				// CHECK: scf.for {{.}} iter_args(%{{.}} = %[[T1]]
				%1 = scf.for %arg3 = %c0 to %c24 step %c4 iter_args(%arg4 = %0) -> (tensor<24x25xf32>) {

				// CHECK: scf.for
				%2 = scf.for %arg5 = %c0 to %c25 step %c4 iter_args(%arg6 = %arg4) -> (tensor<24x25xf32>) {
				%3 = affine.min #map(%arg5)

				// Cannot fuse producer in place of the slice op inside the reduction loop.
				// CHECK: scf.for
				// CHECK-NOT: linalg.fill
				%4 = scf.for %arg7 = %c0 to %c12 step %c4 iter_args(%arg8 = %arg6) -> (tensor<24x25xf32>) {
				%5 = tensor.extract_slice %arg0[%arg3, %arg7] [4, 4] [1, 1] : tensor<24x12xf32> to tensor<4x4xf32>
				%6 = tensor.extract_slice %arg1[%arg7, %arg5] [4, %3] [1, 1] : tensor<12x25xf32> to tensor<4x?xf32>
				%7 = tensor.extract_slice %arg8[%arg3, %arg5] [4, %3] [1, 1] : tensor<24x25xf32> to tensor<4x?xf32>
				%8 = linalg.matmul ins(%5, %6 : tensor<4x4xf32>, tensor<4x?xf32>) outs(%7 : tensor<4x?xf32>) -> tensor<4x?xf32>
				%9 = tensor.insert_slice %8 into %arg8[%arg3, %arg5] [4, %3] [1, 1] : tensor<4x?xf32> into tensor<24x25xf32>
				scf.yield %9 : tensor<24x25xf32>
				}
				scf.yield %4 : tensor<24x25xf32>
				}
				scf.yield %2 : tensor<24x25xf32>
				}
				return %1 : tensor<24x25xf32>
				}

				// -----

				// CHECK-DAG: #[[MAP0:.*]] = affine_map<(d0)[s0] -> (4, -d0 + s0)>
				// CHECK-DAG: #[[MAP1:.*]] = affine_map<(d0, d1)[s0] -> (d1, -d0 + s0)>
				#map0 = affine_map<(d0)[s0] -> (4, -d0 + s0)>
				#map1 = affine_map<(d0, d1) -> (4, d0 - d1)>

				// CHECK: fuse_twisted_dynamic
				// CHECK-SAME: %[[ARG0:[0-9a-zA-Z]*]]: tensor<?x?xf32>
				// CHECK-SAME: %[[ARG1:[0-9a-zA-Z]*]]: tensor<?x?xf32>
				// CHECK-SAME: %[[ARG2:[0-9a-zA-Z]*]]: tensor<?x?xf32>
				builtin.func @fuse_twisted_dynamic(%arg0: tensor<?x?xf32>,
				%arg1: tensor<?x?xf32>,
				%arg2: tensor<?x?xf32>) -> tensor<?x?xf32> {

				// CHECK-DAG: %[[C0:.*]] = constant 0
				// CHECK-DAG: %[[C1:.*]] = constant 1
				%c1 = constant 1 : index
				%c0 = constant 0 : index
				%c4 = constant 4 : index

				// CHECK: %[[T0:.*]] = linalg.generic
				// CHECK-DAG: %[[DIM0_T0:.*]] = tensor.dim %[[T0]], %[[C0]]
				// CHECK-DAG: %[[DIM1_T0:.*]] = tensor.dim %[[T0]], %[[C1]]
				%0 = linalg.generic {
				indexing_maps = [affine_map<(d0, d1) -> (d1, d0)>],
				iterator_types = ["parallel", "parallel"]}
				outs(%arg1 : tensor<?x?xf32>) {
				^bb0(%arg3: f32): // no predecessors
				linalg.yield %arg3 : f32
				} -> tensor<?x?xf32>
				%1 = tensor.dim %arg0, %c0 : tensor<?x?xf32>
				%2 = tensor.dim %arg0, %c1 : tensor<?x?xf32>
				%3 = tensor.dim %0, %c1 : tensor<?x?xf32>
				%4 = tensor.dim %0, %c0 : tensor<?x?xf32>

				// CHECK: scf.for %[[IV0:[0-9a-zA-Z]*]] =
				%5 = scf.for %arg3 = %c0 to %1 step %c4 iter_args(%arg4 = %arg2) -> (tensor<?x?xf32>) {
				%6 = affine.min #map0(%arg3)[%1]

				// CHECK: scf.for %[[IV1:[0-9a-zA-Z]*]] =
				%7 = scf.for %arg5 = %c0 to %3 step %c4 iter_args(%arg6 = %arg4) -> (tensor<?x?xf32>) {

				// CHECK: %[[TS1:.*]] = affine.min #[[MAP0]](%[[IV1]])[%[[DIM1_T0]]]
				%8 = affine.min #map0(%arg5)[%3]

				// CHECK: scf.for %[[IV2:[0-9a-zA-Z]*]] =
				%9 = scf.for %arg7 = %c0 to %2 step %c4 iter_args(%arg8 = %arg6) -> (tensor<?x?xf32>) {
				%10 = affine.min #map0(%arg7)[%2]
				%11 = tensor.extract_slice %arg0[%arg3, %arg7] [%6, %10] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>

				// CHECK: %[[TS2:.*]] = affine.min #[[MAP0]](%[[IV2]])[%[[DIM0_T0]]]
				%12 = affine.min #map0(%arg7)[%4]

				// Tile the producer output operand and take into account the domain
				// size is dynamic and may not be an integer multiple of the step.
				// CHECK: %[[DIM0_ARG1:.*]] = tensor.dim %[[ARG1]], %[[C0]]
				// CHECK: %[[UB0:.*]] = affine.min #[[MAP1]](%[[IV2]], %[[TS2]])[%[[DIM0_ARG1]]]
				// CHECK: %[[DIM1_ARG1:.*]] = tensor.dim %[[ARG1]], %[[C1]]
				// CHECK: %[[UB1:.*]] = affine.min #[[MAP1]](%[[IV1]], %[[TS1]])[%[[DIM1_ARG1]]]
				// CHECK: %[[T1:.*]] = tensor.extract_slice %[[ARG1]]
				// CHECK-SAME: %[[IV2]], %[[IV1]]
				// CHECK-SAME: %[[UB0]], %[[UB1]]
				// CHECK: %[[T2:.]] = linalg.generic {{.}} outs(%[[T1]]
				%13 = tensor.extract_slice %0[%arg7, %arg5] [%12, %8] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>
				%14 = tensor.dim %arg8, %c0 : tensor<?x?xf32>
				%15 = affine.min #map1(%14, %arg3)
				%16 = tensor.dim %arg8, %c1 : tensor<?x?xf32>
				%17 = affine.min #map1(%16, %arg5)
				%18 = tensor.extract_slice %arg8[%arg3, %arg5] [%15, %17] [1, 1] : tensor<?x?xf32> to tensor<?x?xf32>

				// CHECK: %{{.}} = linalg.matmul ins(%{{.}}, %[[T2]]
				%19 = linalg.matmul ins(%11, %13 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%18 : tensor<?x?xf32>) -> tensor<?x?xf32>
				%20 = tensor.insert_slice %19 into %arg8[%arg3, %arg5] [%15, %17] [1, 1] : tensor<?x?xf32> into tensor<?x?xf32>
				scf.yield %20 : tensor<?x?xf32>
				}
				scf.yield %9 : tensor<?x?xf32>
				}
				scf.yield %7 : tensor<?x?xf32>
				}
				return %5 : tensor<?x?xf32>
				}

				// -----

				// CHECK-DAG: #[[MAP0:.*]] = affine_map<(d0, d1) -> (d0 + d1)>
				#map0 = affine_map<(d0, d1) -> (d1, d0)>
				#map1 = affine_map<(d0)[s0] -> (4, -d0 + s0)>
				#map2 = affine_map<(d0, d1) -> (4, d0 - d1)>

				// CHECK: fuse_index_dynamic
				builtin.func @fuse_index_dynamic(%arg0: tensor<?x?xi32>,
				%arg1: tensor<?x?xi32>,
				%arg2: tensor<?x?xi32>) -> tensor<?x?xi32> {
				%c1 = constant 1 : index
				%c0 = constant 0 : index
				%c4 = constant 4 : index
				%0 = linalg.generic {
				indexing_maps = [#map0],
				iterator_types = ["parallel", "parallel"]}
				outs(%arg1 : tensor<?x?xi32>) {
				^bb0(%arg3: i32): // no predecessors
				%6 = linalg.index 0 : index
				%7 = linalg.index 1 : index
				%8 = addi %6, %7 : index
				%9 = index_cast %8 : index to i32
				linalg.yield %9 : i32
				} -> tensor<?x?xi32>
				%1 = tensor.dim %arg0, %c0 : tensor<?x?xi32>
				%2 = tensor.dim %0, %c1 : tensor<?x?xi32>
				%3 = tensor.dim %arg0, %c1 : tensor<?x?xi32>
				%4 = tensor.dim %0, %c0 : tensor<?x?xi32>

				// CHECK: scf.for %[[IV0:[0-9a-zA-Z]*]] =
				%5 = scf.for %arg3 = %c0 to %1 step %c4 iter_args(%arg4 = %arg2) -> (tensor<?x?xi32>) {
				%6 = affine.min #map1(%arg3)[%1]
				%7 = tensor.extract_slice %arg0[%arg3, 0] [%6, %3] [1, 1] : tensor<?x?xi32> to tensor<?x?xi32>

				// CHECK: scf.for %[[IV1:[0-9a-zA-Z]*]] =
				%8 = scf.for %arg5 = %c0 to %2 step %c4 iter_args(%arg6 = %arg4) -> (tensor<?x?xi32>) {
				%9 = affine.min #map1(%arg5)[%2]
				%10 = tensor.extract_slice %0[0, %arg5] [%4, %9] [1, 1] : tensor<?x?xi32> to tensor<?x?xi32>
				%11 = tensor.dim %arg6, %c0 : tensor<?x?xi32>
				%12 = affine.min #map2(%11, %arg3)
				%13 = tensor.dim %arg6, %c1 : tensor<?x?xi32>
				%14 = affine.min #map2(%13, %arg5)

				// Shift only the first dimension since only the second dimension of the
				// producer output is tiled and since the index map is twisted.
				// CHECK: linalg.generic
				// CHECK: %[[IDX0:.*]] = linalg.index 0
				// CHECK: %[[IDX0_SHIFTED:.*]] = affine.apply #[[MAP0]](%[[IDX0]], %[[IV1]])
				// CHECK: %[[IDX1:.*]] = linalg.index 1
				// CHECK: %{{.*}} = addi %[[IDX0_SHIFTED]], %[[IDX1]]
				%15 = tensor.extract_slice %arg6[%arg3, %arg5] [%12, %14] [1, 1] : tensor<?x?xi32> to tensor<?x?xi32>
				%16 = linalg.matmul ins(%7, %10 : tensor<?x?xi32>, tensor<?x?xi32>) outs(%15 : tensor<?x?xi32>) -> tensor<?x?xi32>
				%17 = tensor.insert_slice %16 into %arg6[%arg3, %arg5] [%12, %14] [1, 1] : tensor<?x?xi32> into tensor<?x?xi32>
				scf.yield %17 : tensor<?x?xi32>
				}
				scf.yield %8 : tensor<?x?xi32>
				}
				return %5 : tensor<?x?xi32>
				}

				// -----

				// CHECK-DAG: #[[MAP0:.*]] = affine_map<(d0) -> (4, -d0 + 24)>
				// CHECK-DAG: #[[MAP1:.*]] = affine_map<(d0) -> (4, -d0 + 12)>
				// CHECK-DAG: #[[MAP2:.*]] = affine_map<(d0, d1) -> (2, d0 - d1)>
				#map = affine_map<(d0) -> (4, -d0 + 25)>

				// CHECK: fuse_input_4d_tiling
				builtin.func @fuse_input_4d_tiling(%arg0: tensor<24x12xf32>,
				%arg1: tensor<12x25xf32>,
				%arg2: tensor<24x25xf32>) -> tensor<24x25xf32> {
				%c0 = constant 0 : index
				%c4 = constant 4 : index
				%c2 = constant 2 : index
				%cst = constant 0.000000e+00 : f32
				%c24 = constant 24 : index
				%c25 = constant 25 : index
				%c12 = constant 12 : index
				%0 = linalg.fill(%cst, %arg0) : f32, tensor<24x12xf32> -> tensor<24x12xf32>

				// CHECK: scf.for %[[IV0:[0-9a-zA-Z]*]] =
				%1 = scf.for %arg3 = %c0 to %c24 step %c4 iter_args(%arg4 = %arg2) -> (tensor<24x25xf32>) {

				// CHECK: scf.for %[[IV1:[0-9a-zA-Z]*]] =
				%2 = scf.for %arg5 = %c0 to %c25 step %c4 iter_args(%arg6 = %arg4) -> (tensor<24x25xf32>) {
				%3 = affine.min #map(%arg5)

				// CHECK: scf.for %[[IV2:[0-9a-zA-Z]*]] =
				%4 = scf.for %arg7 = %c0 to %c12 step %c4 iter_args(%arg8 = %arg6) -> (tensor<24x25xf32>) {

				// Tile both fill output operand dimensions.
				// CHECK: %[[UB0:.*]] = affine.min #[[MAP0]](%[[IV0]])
				// CHECK: %[[UB1:.*]] = affine.min #[[MAP1]](%[[IV2]])
				// CHECK: %[[T0:.*]] = tensor.extract_slice %[[ARG0]]
				// CHECK-SAME: %[[IV0]], %[[IV2]]
				// CHECK-SAME: %[[UB0]], %[[UB1]]
				%5 = tensor.extract_slice %0[%arg3, %arg7] [4, 4] [1, 1] : tensor<24x12xf32> to tensor<4x4xf32>
				%6 = tensor.extract_slice %arg1[%arg7, %arg5] [4, %3] [1, 1] : tensor<12x25xf32> to tensor<4x?xf32>
				%7 = tensor.extract_slice %arg8[%arg3, %arg5] [4, %3] [1, 1] : tensor<24x25xf32> to tensor<4x?xf32>

				// CHECK: scf.for %[[IV3:[0-9a-zA-Z]*]] =
				%8 = scf.for %arg9 = %c0 to %c4 step %c2 iter_args(%arg10 = %7) -> (tensor<4x?xf32>) {

				// Tile only the second fill output operand dimension
				// CHECK: %[[UB3:.*]] = affine.min #[[MAP2]](%[[UB1]], %[[IV3]])

				// CHECK: %[[T1:.*]] = tensor.extract_slice %[[T0]]
				// CHECK-SAME: 0, %[[IV3]]
				// CHECK-SAME: %[[UB0]], %[[UB3]]
				// CHECK: %[[T2:.]] = linalg.fill(%{{.}}, %[[T1]])
				// CHECK: %[[T3:.*]] = tensor.cast %[[T2]] : tensor<?x?xf32> to
				%10 = tensor.extract_slice %5[0, %arg9] [4, 2] [1, 1] : tensor<4x4xf32> to tensor<4x2xf32>
				%11 = tensor.extract_slice %6[%arg9, 0] [2, %3] [1, 1] : tensor<4x?xf32> to tensor<2x?xf32>

				// CHECK: %{{.*}} = linalg.matmul ins(%[[T3]]
				%12 = linalg.matmul ins(%10, %11 : tensor<4x2xf32>, tensor<2x?xf32>) outs(%arg10 : tensor<4x?xf32>) -> tensor<4x?xf32>
				scf.yield %12 : tensor<4x?xf32>
				}
				%9 = tensor.insert_slice %8 into %arg8[%arg3, %arg5] [4, %3] [1, 1] : tensor<4x?xf32> into tensor<24x25xf32>
				scf.yield %9 : tensor<24x25xf32>
				}
				scf.yield %4 : tensor<24x25xf32>
				}
				scf.yield %2 : tensor<24x25xf32>
				}
				return %1 : tensor<24x25xf32>
				}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][linalg] Fusion on tensors.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 369903

mlir/include/mlir/Dialect/Linalg/Passes.h

mlir/include/mlir/Dialect/Linalg/Passes.td

mlir/include/mlir/Dialect/Linalg/Utils/Utils.h

mlir/lib/Dialect/Linalg/Transforms/CMakeLists.txt

mlir/lib/Dialect/Linalg/Transforms/FusionOnTensors.cpp

mlir/lib/Dialect/Linalg/Utils/Utils.cpp

mlir/test/Dialect/Linalg/fusion-on-tensors.mlir

[mlir][linalg] Fusion on tensors.
AbandonedPublic