This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/SCF/Transforms/
-
mlir/
-
Dialect/
-
SCF/
-
Transforms/
2/2
TileUsingInterface.h
-
lib/Dialect/SCF/Transforms/
-
Dialect/
-
SCF/
-
Transforms/
1/3
TileUsingInterface.cpp
-
test/
-
Interfaces/TilingInterface/
-
TilingInterface/
-
tile-and-fuse-using-interface.mlir
-
lib/Interfaces/TilingInterface/
-
Interfaces/
-
TilingInterface/
3/4
TestTilingInterface.cpp

Differential D134307

[mlir][TilingInterface] Add callback to yield a produced value.
AbandonedPublic

Authored by mravishankar on Sep 20 2022, 1:25 PM.

Download Raw Diff

Details

Reviewers

nicolasvasilache
pzread
chelini
hanchung

Summary

To support cases where a fused producer value is computed within the loop
and yielded, with the yield value being used as the replacement of the fused
op; add callback that provides the caller control towards this.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

mravishankar created this revision.Sep 20 2022, 1:25 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 20 2022, 1:25 PM

Herald added subscribers: bzcheeseman, sdasgup3, wenzhicui and 18 others. · View Herald Transcript

mravishankar requested review of this revision.Sep 20 2022, 1:25 PM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptSep 20 2022, 1:25 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

mravishankar added parent revisions: D134306: [mlir][Transforms] CSE of ops with a single block., D134144: [mlir][TilingInterface]NFC Refactor of tile and fuse using `TilingInterface`..Sep 20 2022, 1:25 PM

Harbormaster completed remote builds in B187822: Diff 461678.Sep 20 2022, 1:54 PM

Rebase.

Herald added a subscriber: zero9178. · View Herald TranscriptSep 30 2022, 1:00 PM

Harbormaster completed remote builds in B189759: Diff 464374.Sep 30 2022, 1:01 PM

Set callback default to return false.

Harbormaster completed remote builds in B190259: Diff 465077.Oct 4 2022, 10:52 AM

Rebase.

Herald added subscribers: Moerafaat, hanchung. · View Herald TranscriptNov 21 2022, 11:04 PM

mravishankar added reviewers: pzread, chelini, hanchung.Nov 21 2022, 11:04 PM

Harbormaster completed remote builds in B198906: Diff 477069.Nov 22 2022, 2:39 AM

hanchung added inline comments.Nov 22 2022, 6:09 PM

mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
674–684	Is it possible that it failed? If so, should we signal something? IF not, can we add an assertion and we won't need one level of nesting.
mlir/test/lib/Interfaces/TilingInterface/TestTilingInterface.cpp
272–274	should we add `memref::DimOp` to the list?

chelini added inline comments.Nov 23 2022, 1:47 AM

mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h
93	Why `use` here? `SliceOp` is an `Operation*`.
mlir/test/lib/Interfaces/TilingInterface/TestTilingInterface.cpp
208

Rebase and address comments.

Addressing minor comment that was missed.

mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h
93	Sorry, that was confusing. Fixed.
mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
674–684	Dropped the `FailureOr` from the called method. It always succeeds.
mlir/test/lib/Interfaces/TilingInterface/TestTilingInterface.cpp
272–274	tile and fuse doesnt work with `memref`s and also with memref there is no replacement to do.

Harbormaster completed remote builds in B199295: Diff 477613.Nov 23 2022, 3:00 PM

hanchung added inline comments.Nov 28 2022, 11:30 AM

mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp
674–684	do we need to capture the result of `yieldTiledValues`? It's not used, and I assume that it's only for generating IRs.

hanchung accepted this revision.Nov 30 2022, 1:19 PM

hanchung added inline comments.

mlir/test/lib/Interfaces/TilingInterface/TestTilingInterface.cpp
36	should be consistent with other uses (see below)

This revision is now accepted and ready to land.Nov 30 2022, 1:19 PM

This feels extremely fishy, I do not understand the use case or the value but see significant complexity increase.

Putting a blocker until I can make sense of this.

This revision now requires changes to proceed.Nov 30 2022, 2:07 PM

In D134307#3961594, @nicolasvasilache wrote:

This feels extremely fishy, I do not understand the use case or the value but see significant complexity increase.

Putting a blocker until I can make sense of this.

I dont follow what is fishy here. This is the culmination of months long effort to fix this https://github.com/llvm/llvm-project/issues/57205 (which has been blocking things downstream for as long). The changes here are the same as what is done in lines 394 - 407 that returns the replacements for the tiled consumer. This is now returning the replacements for a fused producer when that the caller determines is worth/valid to return. It is too much state to track if this value is returnable or not from the fused code. So this is left as a caller flag, so its a complete opt-in, and controlled by the caller. I dont see the complexity increase. Happy to discuss offline. This needs to be solved, and it was determined that the previous tile+fuse method was not structured enough to fix the original issue, so it was redone to make this issue solvable...

In D134307#3962186, @mravishankar wrote:

In D134307#3961594, @nicolasvasilache wrote:

This feels extremely fishy, I do not understand the use case or the value but see significant complexity increase.

Putting a blocker until I can make sense of this.

I dont follow what is fishy here. This is the culmination of months long effort to fix this https://github.com/llvm/llvm-project/issues/57205 (which has been blocking things downstream for as long). The changes here are the same as what is done in lines 394 - 407 that returns the replacements for the tiled consumer. This is now returning the replacements for a fused producer when that the caller determines is worth/valid to return. It is too much state to track if this value is returnable or not from the fused code. So this is left as a caller flag, so its a complete opt-in, and controlled by the caller. I dont see the complexity increase. Happy to discuss offline. This needs to be solved, and it was determined that the previous tile+fuse method was not structured enough to fix the original issue, so it was redone to make this issue solvable...

I'll paste my comment from https://reviews.llvm.org/D138882 here.

Injecting lambda form above allows inversion of control that is convenient very short term and has almost always proven to very quickly turn to technical debt.

Given recent offline discussions I had and other parts of the codebase I have seen, I am going to more seriously push back against this anti-pattern globally.
Injecting C++ callback control from above is a sign of missing abstractions and should almost always be disallowed.
The alternative is often to refactor multiple times until the right abstractions emerge.

In other words, the transformations we add must be functional-style, statically return the pieces of IR (existing created or updated) that make sense for that transform.
This is not something customizable, if you need more information then statically return more information: no backchannels through lambdas.

If you need different behaviors, instead of injecting a dynamic mechanism through a lambda, write another transformation that takes more/different inputs and return more/different outputs.
Refactor the reusable utility helpers to avoid copy-pasta.

These transformations can then be plugged into patterns and transform dialect ops who can be responsible for the switch between different static behaviors.

Hard stop on more lambdas from above, I am happy to iterate with you on the IREE side to a good solution.

As a side note, softmax and other complex things fuse properly into scf.foreach_thread thanks to a redesigned iterative algorithm and better sets of input / outputs.
Not sure if relevant to your specific case but my gut feeling is that this is not a case that qualifies for a pass.

This revision is now accepted and ready to land.Dec 1 2022, 2:29 AM

Sorry, didn't mean to resign.

This revision now requires changes to proceed.Dec 1 2022, 2:30 AM

In D134307#3962939, @nicolasvasilache wrote:
In D134307#3962186, @mravishankar wrote:

In D134307#3961594, @nicolasvasilache wrote:

This feels extremely fishy, I do not understand the use case or the value but see significant complexity increase.

Putting a blocker until I can make sense of this.

I dont follow what is fishy here. This is the culmination of months long effort to fix this https://github.com/llvm/llvm-project/issues/57205 (which has been blocking things downstream for as long). The changes here are the same as what is done in lines 394 - 407 that returns the replacements for the tiled consumer. This is now returning the replacements for a fused producer when that the caller determines is worth/valid to return. It is too much state to track if this value is returnable or not from the fused code. So this is left as a caller flag, so its a complete opt-in, and controlled by the caller. I dont see the complexity increase. Happy to discuss offline. This needs to be solved, and it was determined that the previous tile+fuse method was not structured enough to fix the original issue, so it was redone to make this issue solvable...

I'll paste my comment from https://reviews.llvm.org/D138882 here.
Injecting lambda form above allows inversion of control that is convenient very short term and has almost always proven to very quickly turn to technical debt.

Given recent offline discussions I had and other parts of the codebase I have seen, I am going to more seriously push back against this anti-pattern globally.
Injecting C++ callback control from above is a sign of missing abstractions and should almost always be disallowed.
The alternative is often to refactor multiple times until the right abstractions emerge.

In other words, the transformations we add must be functional-style, statically return the pieces of IR (existing created or updated) that make sense for that transform.
This is not something customizable, if you need more information then statically return more information: no backchannels through lambdas.

If you need different behaviors, instead of injecting a dynamic mechanism through a lambda, write another transformation that takes more/different inputs and return more/different outputs.
Refactor the reusable utility helpers to avoid copy-pasta.

These transformations can then be plugged into patterns and transform dialect ops who can be responsible for the switch between different static behaviors.
Hard stop on more lambdas from above, I am happy to iterate with you on the IREE side to a good solution.

If I have to internalize that comment, and the places you have blocked this, all the places you mentioned are worklist based algorithms (either using pattern rewrite based fixed point solutions, or the worklist algorithm used here). In both the cases, the callbacks allow the way the worklist is built. Callbacks allow control externally without leaking too much of the implementation details. I think that is useful if done in specific/controlled ways. I am not a fan of hard red-lines. In this particular case, I can rewrite things to not use callbacks but it will have to leak too much of the implementation details to allow callers to put things together again. Thats the tradeoff IMO

As a side note, softmax and other complex things fuse properly into scf.foreach_thread thanks to a redesigned iterative algorithm and better sets of input / outputs.

Since you have put a hard stop to this, could point me to where this is. Since there is unilateral decision made to disallow callbacks, Ill try to conform to this. Also btw, not related to softmax (that already works), and not related to scf.foreach_thread cause it is not related to distribution.

Not sure if relevant to your specific case but my gut feeling is that this is not a case that qualifies for a pass.

In D134307#3964457, @mravishankar wrote:
In D134307#3962939, @nicolasvasilache wrote:
In D134307#3962186, @mravishankar wrote:

In D134307#3961594, @nicolasvasilache wrote:

This feels extremely fishy, I do not understand the use case or the value but see significant complexity increase.

Putting a blocker until I can make sense of this.

I dont follow what is fishy here. This is the culmination of months long effort to fix this https://github.com/llvm/llvm-project/issues/57205 (which has been blocking things downstream for as long). The changes here are the same as what is done in lines 394 - 407 that returns the replacements for the tiled consumer. This is now returning the replacements for a fused producer when that the caller determines is worth/valid to return. It is too much state to track if this value is returnable or not from the fused code. So this is left as a caller flag, so its a complete opt-in, and controlled by the caller. I dont see the complexity increase. Happy to discuss offline. This needs to be solved, and it was determined that the previous tile+fuse method was not structured enough to fix the original issue, so it was redone to make this issue solvable...

I'll paste my comment from https://reviews.llvm.org/D138882 here.
Injecting lambda form above allows inversion of control that is convenient very short term and has almost always proven to very quickly turn to technical debt.

Given recent offline discussions I had and other parts of the codebase I have seen, I am going to more seriously push back against this anti-pattern globally.
Injecting C++ callback control from above is a sign of missing abstractions and should almost always be disallowed.
The alternative is often to refactor multiple times until the right abstractions emerge.

In other words, the transformations we add must be functional-style, statically return the pieces of IR (existing created or updated) that make sense for that transform.
This is not something customizable, if you need more information then statically return more information: no backchannels through lambdas.

If you need different behaviors, instead of injecting a dynamic mechanism through a lambda, write another transformation that takes more/different inputs and return more/different outputs.
Refactor the reusable utility helpers to avoid copy-pasta.

These transformations can then be plugged into patterns and transform dialect ops who can be responsible for the switch between different static behaviors.
Hard stop on more lambdas from above, I am happy to iterate with you on the IREE side to a good solution.
If I have to internalize that comment, and the places you have blocked this, all the places you mentioned are worklist based algorithms (either using pattern rewrite based fixed point solutions, or the worklist algorithm used here). In both the cases, the callbacks allow the way the worklist is built. Callbacks allow control externally without leaking too much of the implementation details. I think that is useful if done in specific/controlled ways. I am not a fan of hard red-lines. In this particular case, I can rewrite things to not use callbacks but it will have to leak too much of the implementation details to allow callers to put things together again. Thats the tradeoff IMO

As a side note, softmax and other complex things fuse properly into scf.foreach_thread thanks to a redesigned iterative algorithm and better sets of input / outputs.

Since you have put a hard stop to this, could point me to where this is. Since there is unilateral decision made to disallow callbacks, Ill try to conform to this. Also btw, not related to softmax (that already works), and not related to scf.foreach_thread cause it is not related to distribution.

Not sure if relevant to your specific case but my gut feeling is that this is not a case that qualifies for a pass.

I apologize for the language, I realize upon rereading that it was too harsh and didn't mean it this way, I could have posted in less of a hurry and worded things better.
As we discussed offline, I would love to be able to come up with a characterization where we are very careful with C++ callbacks and avoid things that amount to inversion of control and breaking the functional contract that has worked very well for the implementation of most transformations logic.
I do think this example can be written in a way that avoids this injection and I am happy to help iterate until we get there (it may be a long process and I am sorry for the drag this has been causing for a too long time already).

The piece I was referring to I think was actually adapted from an earlier version of yours, it is only used inside the transform dialect for now but if there is a need to extract as a free function we can:
https://sourcegraph.com/github.com/llvm/llvm-project/-/blob/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp?L504

My experience with this is that this can run in 2 modes: either use the list of ops to fuse incrementally and let the transformation figure things out, or call the transformation itself in a sequence controlled from outside, which allows more flexibility.
I could offer to try the example you have in mind through this and see whether something new appears that would force us to rethink the approach.

In D134307#3964983, @nicolasvasilache wrote:
In D134307#3964457, @mravishankar wrote:
In D134307#3962939, @nicolasvasilache wrote:
In D134307#3962186, @mravishankar wrote:

In D134307#3961594, @nicolasvasilache wrote:

This feels extremely fishy, I do not understand the use case or the value but see significant complexity increase.

Putting a blocker until I can make sense of this.

I dont follow what is fishy here. This is the culmination of months long effort to fix this https://github.com/llvm/llvm-project/issues/57205 (which has been blocking things downstream for as long). The changes here are the same as what is done in lines 394 - 407 that returns the replacements for the tiled consumer. This is now returning the replacements for a fused producer when that the caller determines is worth/valid to return. It is too much state to track if this value is returnable or not from the fused code. So this is left as a caller flag, so its a complete opt-in, and controlled by the caller. I dont see the complexity increase. Happy to discuss offline. This needs to be solved, and it was determined that the previous tile+fuse method was not structured enough to fix the original issue, so it was redone to make this issue solvable...

I'll paste my comment from https://reviews.llvm.org/D138882 here.
Injecting lambda form above allows inversion of control that is convenient very short term and has almost always proven to very quickly turn to technical debt.

Given recent offline discussions I had and other parts of the codebase I have seen, I am going to more seriously push back against this anti-pattern globally.
Injecting C++ callback control from above is a sign of missing abstractions and should almost always be disallowed.
The alternative is often to refactor multiple times until the right abstractions emerge.

In other words, the transformations we add must be functional-style, statically return the pieces of IR (existing created or updated) that make sense for that transform.
This is not something customizable, if you need more information then statically return more information: no backchannels through lambdas.

If you need different behaviors, instead of injecting a dynamic mechanism through a lambda, write another transformation that takes more/different inputs and return more/different outputs.
Refactor the reusable utility helpers to avoid copy-pasta.

These transformations can then be plugged into patterns and transform dialect ops who can be responsible for the switch between different static behaviors.
Hard stop on more lambdas from above, I am happy to iterate with you on the IREE side to a good solution.
If I have to internalize that comment, and the places you have blocked this, all the places you mentioned are worklist based algorithms (either using pattern rewrite based fixed point solutions, or the worklist algorithm used here). In both the cases, the callbacks allow the way the worklist is built. Callbacks allow control externally without leaking too much of the implementation details. I think that is useful if done in specific/controlled ways. I am not a fan of hard red-lines. In this particular case, I can rewrite things to not use callbacks but it will have to leak too much of the implementation details to allow callers to put things together again. Thats the tradeoff IMO

As a side note, softmax and other complex things fuse properly into scf.foreach_thread thanks to a redesigned iterative algorithm and better sets of input / outputs.

Since you have put a hard stop to this, could point me to where this is. Since there is unilateral decision made to disallow callbacks, Ill try to conform to this. Also btw, not related to softmax (that already works), and not related to scf.foreach_thread cause it is not related to distribution.

Not sure if relevant to your specific case but my gut feeling is that this is not a case that qualifies for a pass.
I apologize for the language, I realize upon rereading that it was too harsh and didn't mean it this way, I could have posted in less of a hurry and worded things better.

Thanks for that!

As we discussed offline, I would love to be able to come up with a characterization where we are very careful with C++ callbacks and avoid things that amount to inversion of control and breaking the functional contract that has worked very well for the implementation of most transformations logic.
I do think this example can be written in a way that avoids this injection and I am happy to help iterate until we get there (it may be a long process and I am sorry for the drag this has been causing for a too long time already).

The piece I was referring to I think was actually adapted from an earlier version of yours, it is only used inside the transform dialect for now but if there is a need to extract as a free function we can:
https://sourcegraph.com/github.com/llvm/llvm-project/-/blob/mlir/lib/Dialect/Linalg/TransformOps/LinalgTransformOps.cpp?L504

My experience with this is that this can run in 2 modes: either use the list of ops to fuse incrementally and let the transformation figure things out, or call the transformation itself in a sequence controlled from outside, which allows more flexibility.
I could offer to try the example you have in mind through this and see whether something new appears that would force us to rethink the approach.

Thanks. I think I understand. I will hold off on this patch, and send some NFC patches that basically do

Have an API entry that does one step of the fusion, i.e. one producer -> tensor.extract_slice swap.
Refactor the existing scf::tileConsumerAndFuseProducersGreedily to use this and then maybe have different API entry points that allow having a list of producers and tile + fuse all of them etc. that use the entry point added in (1).

Might not get to this soon (I am on vacation for a few days). Ill send out the stack once I have something working.

mravishankar abandoned this revision.Jan 16 2023, 10:38 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

SCF/

Transforms/

TileUsingInterface.h

14 lines

lib/

Dialect/

SCF/

Transforms/

TileUsingInterface.cpp

86 lines

test/

Interfaces/

TilingInterface/

tile-and-fuse-using-interface.mlir

130 lines

lib/

Interfaces/

TilingInterface/

TestTilingInterface.cpp

57 lines

Diff 477613

mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h

	Show First 20 Lines • Show All 81 Lines • ▼ Show 20 Lines
	/// Options used to control tile + fuse.			/// Options used to control tile + fuse.
	struct SCFTileAndFuseOptions {			struct SCFTileAndFuseOptions {
	/// The tiling options used to control the tiling of the consumer.			/// The tiling options used to control the tiling of the consumer.
	SCFTilingOptions tilingOptions;			SCFTilingOptions tilingOptions;
	SCFTileAndFuseOptions &setTilingOptions(SCFTilingOptions options) {			SCFTileAndFuseOptions &setTilingOptions(SCFTilingOptions options) {
	tilingOptions = options;			tilingOptions = options;
	return *this;			return *this;
	}			}

				/// Callback to check if a value is to be yielded.
				/// Parameters are `producer` which is the result of the untiled op
				/// that is being fused, and the `sliceOp` that represents the slice
				cheliniUnsubmitted Done Reply Inline Actions Why `use` here? `SliceOp` is an `Operation`. chelini:* Why `use` here? `SliceOp` is an `Operation*`.
				mravishankarAuthorUnsubmitted Done Reply Inline Actions Sorry, that was confusing. Fixed. mravishankar: Sorry, that was confusing. Fixed.
				/// being fused (through tile and fuse).
				using ControlYieldFusedProducerResultFn =
				std::function<bool(OpResult producer, Operation *sliceOp)>;
				ControlYieldFusedProducerResultFn shouldYieldFusedProducerResult =
				[](OpResult, Operation *) { return false; };
				SCFTileAndFuseOptions &setControlYieldFusedProducerResultFn(
				const ControlYieldFusedProducerResultFn &fn) {
				shouldYieldFusedProducerResult = fn;
				return *this;
				}
	};			};

	/// Transformation information returned after tile and fuse.			/// Transformation information returned after tile and fuse.
	struct SCFTileAndFuseResult {			struct SCFTileAndFuseResult {
	/// List of untiled operations that were fused with the tiled consumer.			/// List of untiled operations that were fused with the tiled consumer.
	llvm::SetVector<Operation *> fusedProducers;			llvm::SetVector<Operation *> fusedProducers;
	/// List of tiled and fused operations generated. The first one in this list			/// List of tiled and fused operations generated. The first one in this list
	/// is guaranteed to be the tiled operations generated during tiling of the			/// is guaranteed to be the tiled operations generated during tiling of the
	▲ Show 20 Lines • Show All 87 Lines • Show Last 20 Lines

mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp

Show First 20 Lines • Show All 194 Lines • ▼ Show 20 Lines
/// scf.for %iv0 = ... iter_args(%arg = %0) {		/// scf.for %iv0 = ... iter_args(%arg = %0) {
/// %1 = tensor.extract_slice %arg		/// %1 = tensor.extract_slice %arg
/// %2 = tiled_op		/// %2 = tiled_op
/// %3 = tensor.insert_slice %2 into %arg		/// %3 = tensor.insert_slice %2 into %arg
/// scf.yield %3		/// scf.yield %3
/// }		/// }
/// ```		/// ```
/// TODO: This API can be cleaned up by using `SubsetExtractOpInterface`.		/// TODO: This API can be cleaned up by using `SubsetExtractOpInterface`.
static FailureOr<SmallVector<Value>>		static SmallVector<Value>
yieldTiledValues(RewriterBase &rewriter, ValueRange initValues,		yieldTiledValues(RewriterBase &rewriter, ValueRange initValues,
ValueRange yieldedValues,		ValueRange yieldedValues,
ArrayRef<SmallVector<OpFoldResult>> tileOffsetsList,		ArrayRef<SmallVector<OpFoldResult>> tileOffsetsList,
ArrayRef<SmallVector<OpFoldResult>> tileSizesList,		ArrayRef<SmallVector<OpFoldResult>> tileSizesList,
MutableArrayRef<scf::ForOp> loops) {		MutableArrayRef<scf::ForOp> loops) {
NewYieldValueFn yieldValueFn =		NewYieldValueFn yieldValueFn =
[&](OpBuilder &b, Location loc,		[&](OpBuilder &b, Location loc,
ArrayRef<BlockArgument> newBBArgs) -> SmallVector<Value> {		ArrayRef<BlockArgument> newBBArgs) -> SmallVector<Value> {
▲ Show 20 Lines • Show All 174 Lines • ▼ Show 20 Lines	if (failed(op.getResultTilePosition(rewriter, result.index(), offsets,
sizes,		sizes,
resultOffsetsList[result.index()],		resultOffsetsList[result.index()],
resultSizesList[result.index()]))) {		resultSizesList[result.index()]))) {
return rewriter.notifyMatchFailure(		return rewriter.notifyMatchFailure(
op, "failed to get slice of result produced");		op, "failed to get slice of result produced");
}		}
}		}

FailureOr<SmallVector<Value>> replacementOr = yieldTiledValues(		tilingResult.replacements = yieldTiledValues(
rewriter, destinationTensors, tilingResult.tiledOps.back()->getResults(),		rewriter, destinationTensors, tilingResult.tiledOps.back()->getResults(),
resultOffsetsList, resultSizesList, tilingResult.loops);		resultOffsetsList, resultSizesList, tilingResult.loops);
if (failed(replacementOr))
return rewriter.notifyMatchFailure(op, "failed to yield replacement");

if (auto dstOp =		if (auto dstOp =
dyn_cast<DestinationStyleOpInterface>(tilingResult.tiledOps.back())) {		dyn_cast<DestinationStyleOpInterface>(tilingResult.tiledOps.back())) {
auto innerMostLoop = tilingResult.loops.back();		auto innerMostLoop = tilingResult.loops.back();
SmallVector<Value> destinationTensors = dstOp.getDpsInitOperands();		SmallVector<Value> destinationTensors = dstOp.getDpsInitOperands();
assert(destinationTensors.size() ==		assert(destinationTensors.size() ==
innerMostLoop.getRegionIterArgs().size() &&		innerMostLoop.getRegionIterArgs().size() &&
"unexpected number of outputs");		"unexpected number of outputs");
updateDestinationOperandsForTiledOp(rewriter, destinationTensors,		updateDestinationOperandsForTiledOp(rewriter, destinationTensors,
innerMostLoop.getRegionIterArgs());		innerMostLoop.getRegionIterArgs());
}		}

tilingResult.replacements = replacementOr.value();

LLVM_DEBUG({		LLVM_DEBUG({
if (!tilingResult.loops.empty()) {		if (!tilingResult.loops.empty()) {
llvm::dbgs() << "After tiled implementation :\n";		llvm::dbgs() << "After tiled implementation :\n";
tilingResult.loops.front().dump();		tilingResult.loops.front().dump();
llvm::dbgs() << "\n";		llvm::dbgs() << "\n";
}		}
});		});
return tilingResult;		return tilingResult;
▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	Operation *parallelOp =
op.tileToPartialReduction(b, loc, identityTensor.value()->getResults(),		op.tileToPartialReduction(b, loc, identityTensor.value()->getResults(),
offsets, sizes, reductionDim);		offsets, sizes, reductionDim);

SmallVector<OpFoldResult> resultSizesList;		SmallVector<OpFoldResult> resultSizesList;
for (size_t i = 0; i < offsets.size(); i++)		for (size_t i = 0; i < offsets.size(); i++)
resultSizesList.push_back(		resultSizesList.push_back(
b.createOrFold<tensor::DimOp>(loc, parallelOp->getResult(0), i));		b.createOrFold<tensor::DimOp>(loc, parallelOp->getResult(0), i));
SmallVector<OpFoldResult> outOffsets(offsets.size(), b.getIndexAttr(0));		SmallVector<OpFoldResult> outOffsets(offsets.size(), b.getIndexAttr(0));
FailureOr<SmallVector<Value>> replacementOr = yieldTiledValues(		SmallVector<Value> replacements = yieldTiledValues(
b, identityTensor.value()->getResults(), parallelOp->getResults(),		b, identityTensor.value()->getResults(), parallelOp->getResults(),
outOffsets, resultSizesList, loops);		outOffsets, resultSizesList, loops);
if (failed(replacementOr))
return b.notifyMatchFailure(op, "failed to yield replacement");

auto dstOp = cast<DestinationStyleOpInterface>(parallelOp);		auto dstOp = cast<DestinationStyleOpInterface>(parallelOp);
auto innerMostLoop = loops.back();		auto innerMostLoop = loops.back();
SmallVector<Value> destinationTensors = dstOp.getDpsInitOperands();		SmallVector<Value> destinationTensors = dstOp.getDpsInitOperands();
assert(destinationTensors.size() ==		assert(destinationTensors.size() ==
innerMostLoop.getRegionIterArgs().size() &&		innerMostLoop.getRegionIterArgs().size() &&
"unexpected number of outputs");		"unexpected number of outputs");
updateDestinationOperandsForTiledOp(b, destinationTensors,		updateDestinationOperandsForTiledOp(b, destinationTensors,
innerMostLoop.getRegionIterArgs());		innerMostLoop.getRegionIterArgs());

// 4. Apply the merge reduction to combine all the partial values.		// 4. Apply the merge reduction to combine all the partial values.
b.setInsertionPointAfter(*loops.begin());		b.setInsertionPointAfter(*loops.begin());
Operation *mergeOp =		Operation *mergeOp = op.mergeReductions(b, loc, replacements, reductionDim);
op.mergeReductions(b, loc, replacementOr.value(), reductionDim);
b.replaceOp(op, mergeOp->getResults());		b.replaceOp(op, mergeOp->getResults());

SCFReductionTilingResult results;		SCFReductionTilingResult results;
results.initialOp = identityTensor.value();		results.initialOp = identityTensor.value();
results.loops = std::move(loops);		results.loops = std::move(loops);
results.parallelTiledOp = parallelOp;		results.parallelTiledOp = parallelOp;
results.mergeOp = mergeOp;		results.mergeOp = mergeOp;
return results;		return results;
Show All 33 Lines	mlir::scf::tileConsumerAndFuseProducerGreedilyUsingSCFForOp(
// valid to use with operations that have memref operands).		// valid to use with operations that have memref operands).
if (!consumer->getNumResults()) {		if (!consumer->getNumResults()) {
return rewriter.notifyMatchFailure(		return rewriter.notifyMatchFailure(
consumer, "invalid pattern for op with no results");		consumer, "invalid pattern for op with no results");
}		}

// 1. First tile the consumer.		// 1. First tile the consumer.
scf::SCFTileAndFuseResult tileAndFuseResult;		scf::SCFTileAndFuseResult tileAndFuseResult;
llvm::SmallDenseMap<Value, int64_t> yieldedValueToResultNumber;		SmallVector<Value> toBeReturned;
{		{
FailureOr<scf::SCFTilingResult> tilingResult =		FailureOr<scf::SCFTilingResult> tilingResult =
tileUsingSCFForOp(rewriter, consumer, options.tilingOptions);		tileUsingSCFForOp(rewriter, consumer, options.tilingOptions);
if (failed(tilingResult))		if (failed(tilingResult))
return rewriter.notifyMatchFailure(consumer, "failed to tile consumer");		return rewriter.notifyMatchFailure(consumer, "failed to tile consumer");
for (auto *tiledOp : tilingResult->tiledOps)		for (auto *tiledOp : tilingResult->tiledOps)
tileAndFuseResult.tiledAndFusedOps.insert(tiledOp);		tileAndFuseResult.tiledAndFusedOps.insert(tiledOp);
tileAndFuseResult.loops = std::move(tilingResult->loops);		tileAndFuseResult.loops = std::move(tilingResult->loops);
for (const auto &result : llvm::enumerate(		toBeReturned = llvm::to_vector(llvm::map_range(
llvm::zip(consumer->getResults(), tilingResult->replacements))) {		consumer->getResults(), [](OpResult r) -> Value { return r; }));
tileAndFuseResult.replacements[std::get<0>(result.value())] =
std::get<1>(result.value());
yieldedValueToResultNumber[tilingResult->tiledOps.back()->getResult(
result.index())] = result.index();
}
}		}

// If there are no loops generated, fusion is immaterial.		// If there are no loops generated, fusion is immaterial.
if (tileAndFuseResult.loops.empty())		if (tileAndFuseResult.loops.empty())
return tileAndFuseResult;		return tileAndFuseResult;

// 2. Typically, the operands of the tiled operation are slices of the		// 2. Typically, the operands of the tiled operation are slices of the
// operands of the untiled operation. These are expressed in IR using		// operands of the untiled operation. These are expressed in IR using
Show All 29 Lines	while (!candidates.empty()) {
rewriter.setInsertionPoint(candidateSliceOp);		rewriter.setInsertionPoint(candidateSliceOp);
FailureOr<Value> fusedProducerValue =		FailureOr<Value> fusedProducerValue =
tensor::replaceExtractSliceWithTiledProducer(rewriter, candidateSliceOp,		tensor::replaceExtractSliceWithTiledProducer(rewriter, candidateSliceOp,
fusableProducer);		fusableProducer);
if (failed(fusedProducerValue))		if (failed(fusedProducerValue))
continue;		continue;
rewriter.replaceOp(candidateSliceOp, fusedProducerValue.value());		rewriter.replaceOp(candidateSliceOp, fusedProducerValue.value());

// 2d. The operands of the fused producer might themselved be slices of		// 2d. If the slice is for a destination operand, for example,
// values produced by operations that implement the `TilingInterface`.
// Add these operations to the worklist.
Operation *fusedProducer = fusedProducerValue->getDefiningOp();
tileAndFuseResult.tiledAndFusedOps.insert(fusedProducer);
addCandidateSlices(fusedProducer, candidates);

// 2e. If the slice is for a destination operand, for example,
//		//
// ```mlir		// ```mlir
// %0 = linalg.init		// %0 = linalg.init
// %1 = linalg.fill .. outs(%0 : )		// %1 = linalg.fill .. outs(%0 : )
// %2 = scf.for .. iter_args(%arg0 = %1) {		// %2 = scf.for .. iter_args(%arg0 = %1) {
// %3 = scf.for .. iter_args(%arg1 = %arg0) {		// %3 = scf.for .. iter_args(%arg1 = %arg0) {
// %4 = tensor.extract_slice %arg1 [..]		// %4 = tensor.extract_slice %arg1 [..]
// .. = linalg.matmul .. outs(%4 : )		// .. = linalg.matmul .. outs(%4 : )
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	if (iterArgNumber) {
if (auto dstOp = fusedProducerValue.value()		if (auto dstOp = fusedProducerValue.value()
.getDefiningOp<DestinationStyleOpInterface>()) {		.getDefiningOp<DestinationStyleOpInterface>()) {
scf::ForOp innerMostLoop = tileAndFuseResult.loops.back();		scf::ForOp innerMostLoop = tileAndFuseResult.loops.back();
updateDestinationOperandsForTiledOp(		updateDestinationOperandsForTiledOp(
rewriter, dstOp.getDpsInitOperand(resultNumber)->get(),		rewriter, dstOp.getDpsInitOperand(resultNumber)->get(),
innerMostLoop.getRegionIterArgs()[iterArgNumber.value()]);		innerMostLoop.getRegionIterArgs()[iterArgNumber.value()]);
}		}
}		}

		// 2e. Use the callback to yield the value of the fused producer as well.
		if (options.shouldYieldFusedProducerResult(fusableProducer,
		candidateSliceOp)) {
		SmallVector<Value> initValues;
		FailureOr<Value> initValue = tensor::getOrCreateDestination(
		rewriter, fusableProducer.getOwner()->getLoc(), fusableProducer);
		if (succeeded(initValue)) {
		SmallVector<OpFoldResult> resultOffsets =
		candidateSliceOp.getMixedOffsets();
		SmallVector<OpFoldResult> resultSizes =
		candidateSliceOp.getMixedSizes();
		SmallVector<Value> yieldedVals = yieldTiledValues(
		rewriter, initValue.value(), fusedProducerValue.value(),
		resultOffsets, resultSizes, tileAndFuseResult.loops);
		toBeReturned.push_back(fusableProducer);
		}
		if (auto dstStyleProducer =
		hanchungUnsubmitted Not Done Reply Inline Actions Is it possible that it failed? If so, should we signal something? IF not, can we add an assertion and we won't need one level of nesting. hanchung: Is it possible that it failed? If so, should we signal something? IF not, can we add an…
		mravishankarAuthorUnsubmitted Done Reply Inline Actions Dropped the `FailureOr` from the called method. It always succeeds. mravishankar: Dropped the `FailureOr` from the called method. It always succeeds.
		hanchungUnsubmitted Not Done Reply Inline Actions do we need to capture the result of `yieldTiledValues`? It's not used, and I assume that it's only for generating IRs. hanchung: do we need to capture the result of `yieldTiledValues`? It's not used, and I assume that it's…
		fusedProducerValue.value()
		.getDefiningOp<DestinationStyleOpInterface>()) {
		Value dstValue =
		dstStyleProducer
		.getDpsInitOperand(fusableProducer.getResultNumber())
		->get();
		updateDestinationOperandsForTiledOp(
		rewriter, dstValue,
		tileAndFuseResult.loops.back().getRegionIterArgs().back());
		}
		}

		// 2f. The operands of the fused producer might themselved be slices of
		// values produced by operations that implement the `TilingInterface`.
		// Add these operations to the worklist.
		if (auto tiledAndFusedOp = fusedProducerValue.value().getDefiningOp()) {
		tileAndFuseResult.tiledAndFusedOps.insert(tiledAndFusedOp);
		addCandidateSlices(tiledAndFusedOp, candidates);
		}

		LLVM_DEBUG({
		if (!tileAndFuseResult.loops.empty()) {
		llvm::errs() << "After fusing producer: \n";
		tileAndFuseResult.loops.front().dump();
		llvm::errs() << "\n";
}		}
		});
		}

		for (auto returnedValue : llvm::enumerate(toBeReturned)) {
		tileAndFuseResult.replacements[returnedValue.value()] =
		tileAndFuseResult.loops.front().getResult(returnedValue.index());
		}

return tileAndFuseResult;		return tileAndFuseResult;
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// lowerToLoopsUsingSCFForOp implementation.		// lowerToLoopsUsingSCFForOp implementation.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

FailureOr<SmallVector<scf::ForOp>>		FailureOr<SmallVector<scf::ForOp>>
Show All 30 Lines

mlir/test/Interfaces/TilingInterface/tile-and-fuse-using-interface.mlir

	// RUN: mlir-opt -test-tiling-interface=tile-consumer-and-fuse-producer-using-scf-for -split-input-file %s \| FileCheck %s			// RUN: mlir-opt -test-tiling-interface=tile-consumer-and-fuse-producer-using-scf-for -cse -split-input-file %s \| FileCheck %s

	func.func @gemm_fill_fusion(%arg0 : tensor<?x?xf32>, %arg1 : tensor<?x?xf32>) -> tensor<?x?xf32> {			func.func @gemm_fill_fusion(%arg0 : tensor<?x?xf32>, %arg1 : tensor<?x?xf32>) -> tensor<?x?xf32> {
	%c0 = arith.constant 0 : index			%c0 = arith.constant 0 : index
	%c1 = arith.constant 1 : index			%c1 = arith.constant 1 : index
	%cst = arith.constant 0.0 : f32			%cst = arith.constant 0.0 : f32
	%d0 = tensor.dim %arg0, %c0 : tensor<?x?xf32>			%d0 = tensor.dim %arg0, %c0 : tensor<?x?xf32>
	%d1 = tensor.dim %arg1, %c1 : tensor<?x?xf32>			%d1 = tensor.dim %arg1, %c1 : tensor<?x?xf32>
	%init = tensor.empty(%d0, %d1) : tensor<?x?xf32>			%init = tensor.empty(%d0, %d1) : tensor<?x?xf32>
	▲ Show 20 Lines • Show All 256 Lines • ▼ Show 20 Lines
	// CHECK-SAME: %[[ARG2:[a-zA-Z0-9_]+]]: tensor<?x?xf32>			// CHECK-SAME: %[[ARG2:[a-zA-Z0-9_]+]]: tensor<?x?xf32>
	// CHECK: %[[RESULT:.+]] = scf.for %[[IV0:[a-zA-Z0-9_]+]]			// CHECK: %[[RESULT:.+]] = scf.for %[[IV0:[a-zA-Z0-9_]+]]
	// CHECK-SAME: iter_args(%[[ARG4:.+]] = %{{[a-zA-Z0-9_]+}})			// CHECK-SAME: iter_args(%[[ARG4:.+]] = %{{[a-zA-Z0-9_]+}})
	// CHECK: %[[YIELD:.+]] = scf.for %[[IV1:[a-zA-Z0-9_]+]]			// CHECK: %[[YIELD:.+]] = scf.for %[[IV1:[a-zA-Z0-9_]+]]
	// CHECK-SAME: iter_args(%[[ARG6:.+]] = %[[ARG4]])			// CHECK-SAME: iter_args(%[[ARG6:.+]] = %[[ARG4]])
	// CHECK-DAG: %[[ST_ARG0:.+]] = tensor.extract_slice %[[ARG0]][%[[IV0]], 0]			// CHECK-DAG: %[[ST_ARG0:.+]] = tensor.extract_slice %[[ARG0]][%[[IV0]], 0]
	// CHECK-DAG: %[[ST_ARG1:.+]] = tensor.extract_slice %[[ARG1]][0, %[[IV1]]]			// CHECK-DAG: %[[ST_ARG1:.+]] = tensor.extract_slice %[[ARG1]][0, %[[IV1]]]
	// CHECK-DAG: %[[ST_ARG2:.+]] = tensor.extract_slice %[[ARG2]][%[[IV0]], %[[IV1]]]			// CHECK-DAG: %[[ST_ARG2:.+]] = tensor.extract_slice %[[ARG2]][%[[IV0]], %[[IV1]]]
	// CHECK: %[[LHS:.+]] = linalg.matmul			// CHECK: %[[MATMUL:.+]] = linalg.matmul
	// CHECK-SAME: ins(%[[ST_ARG0]], %[[ST_ARG1]] :			// CHECK-SAME: ins(%[[ST_ARG0]], %[[ST_ARG1]] :
	// CHECK-SAME: outs(%[[ST_ARG2]] :			// CHECK-SAME: outs(%[[ST_ARG2]] :
	// CHECK-DAG: %[[ST_ARG0_1:.+]] = tensor.extract_slice %[[ARG0]][%[[IV0]], 0]
	// CHECK-DAG: %[[ST_ARG1_1:.+]] = tensor.extract_slice %[[ARG1]][0, %[[IV1]]]
	// CHECK-DAG: %[[ST_ARG2_1:.+]] = tensor.extract_slice %[[ARG2]][%[[IV0]], %[[IV1]]]
	// CHECK: %[[RHS:.+]] = linalg.matmul
	// CHECK-SAME: ins(%[[ST_ARG0_1]], %[[ST_ARG1_1]] :
	// CHECK-SAME: outs(%[[ST_ARG2_1]] :
	// CHECK: %[[ST_ARG6:.+]] = tensor.extract_slice %[[ARG6]][%[[IV0]], %[[IV1]]]			// CHECK: %[[ST_ARG6:.+]] = tensor.extract_slice %[[ARG6]][%[[IV0]], %[[IV1]]]
	// CHECK: %[[ST_RESULT:.+]] = linalg.generic			// CHECK: %[[ST_RESULT:.+]] = linalg.generic
	// CHECK-SAME: ins(%[[LHS]], %[[RHS]] :			// CHECK-SAME: ins(%[[MATMUL]], %[[MATMUL]] :
	// CHECK-SAME: outs(%[[ST_ARG6]] :			// CHECK-SAME: outs(%[[ST_ARG6]] :
	// CHECK: %[[UPDATE:.+]] = tensor.insert_slice %[[ST_RESULT]]			// CHECK: %[[UPDATE:.+]] = tensor.insert_slice %[[ST_RESULT]]
	// CHECK-SAME: into %[[ARG6]][%[[IV0]], %[[IV1]]]			// CHECK-SAME: into %[[ARG6]][%[[IV0]], %[[IV1]]]
	// CHECK: scf.yield %[[UPDATE]]			// CHECK: scf.yield %[[UPDATE]]
	// CHECK: scf.yield %[[YIELD]]			// CHECK: scf.yield %[[YIELD]]
	// CHECK: return %[[RESULT]]			// CHECK: return %[[RESULT]]

	// -----			// -----
	▲ Show 20 Lines • Show All 102 Lines • ▼ Show 20 Lines
	// CHECK-SAME: outs(%[[SLICE_ARG4]] :			// CHECK-SAME: outs(%[[SLICE_ARG4]] :
	// CHECK-DAG: %[[SLICE_ARG5:.+]] = tensor.extract_slice %[[ARG5]][0, 0] [%[[N2]], %[[N3]]]			// CHECK-DAG: %[[SLICE_ARG5:.+]] = tensor.extract_slice %[[ARG5]][0, 0] [%[[N2]], %[[N3]]]
	// CHECK-DAG: %[[SLICE_ARG6:.+]] = tensor.extract_slice %[[ARG8]][%[[IV]], 0] [%[[TILE_M]], %[[N3]]]			// CHECK-DAG: %[[SLICE_ARG6:.+]] = tensor.extract_slice %[[ARG8]][%[[IV]], 0] [%[[TILE_M]], %[[N3]]]
	// CHECK-DAG: %[[TILE_GEMM3:.+]] = linalg.matmul			// CHECK-DAG: %[[TILE_GEMM3:.+]] = linalg.matmul
	// CHECK-SAME: ins(%[[TILE_GEMM2]], %[[SLICE_ARG5]] :			// CHECK-SAME: ins(%[[TILE_GEMM2]], %[[SLICE_ARG5]] :
	// CHECK-SAME: outs(%[[SLICE_ARG6]] :			// CHECK-SAME: outs(%[[SLICE_ARG6]] :
	// CHECK: %[[UPDATE:.+]] = tensor.insert_slice %[[TILE_GEMM3]] into %[[ARG8]][%[[IV]], 0] [%[[TILE_M]], %[[N3]]]			// CHECK: %[[UPDATE:.+]] = tensor.insert_slice %[[TILE_GEMM3]] into %[[ARG8]][%[[IV]], 0] [%[[TILE_M]], %[[N3]]]
	// CHECK: scf.yield %[[UPDATE]]			// CHECK: scf.yield %[[UPDATE]]

				// -----

				func.func @reduction_sequence(%arg0: tensor<30x3xf32>) -> tensor<30x3xf32> {
				%cst = arith.constant 0.000000e+00 : f32
				%cst_0 = arith.constant 0xFF800000 : f32
				%0 = tensor.empty() : tensor<30xf32>
				%1 = linalg.fill ins(%cst_0 : f32) outs(%0 : tensor<30xf32>) -> tensor<30xf32>
				%2 = linalg.generic {
				indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>],
				iterator_types = ["parallel", "reduction"]}
				ins(%arg0 : tensor<30x3xf32>) outs(%1 : tensor<30xf32>) {
				^bb0(%arg1: f32, %arg2: f32):
				%8 = arith.maxf %arg2, %arg1 : f32
				linalg.yield %8 : f32
				} -> tensor<30xf32>
				%3 = tensor.empty() : tensor<30x3xf32>
				%4 = linalg.fill ins(%cst : f32) outs(%0 : tensor<30xf32>) -> tensor<30xf32>
				%5:2 = linalg.generic {
				indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>,
				affine_map<(d0, d1) -> (d0)>, affine_map<(d0, d1) -> (d0, d1)>],
				iterator_types = ["parallel", "reduction"]}
				ins(%arg0, %2 : tensor<30x3xf32>, tensor<30xf32>) outs(%4, %3 : tensor<30xf32>, tensor<30x3xf32>) {
				^bb0(%arg1: f32, %arg2: f32, %arg3: f32, %arg4: f32):
				%8 = arith.subf %arg1, %arg2 : f32
				%9 = math.exp %8 : f32
				%10 = arith.addf %arg3, %9 : f32
				linalg.yield %10, %9 : f32, f32
				} -> (tensor<30xf32>, tensor<30x3xf32>)
				%6 = linalg.generic {
				__internal_linalg_transform__ = "reduction_sequence_fusion",
				indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0)>,
				affine_map<(d0, d1) -> (d0, d1)>],
				iterator_types = ["parallel", "parallel"]}
				ins(%5#1, %5#0 : tensor<30x3xf32>, tensor<30xf32>) outs(%3 : tensor<30x3xf32>) {
				^bb0(%arg1: f32, %arg2: f32, %arg3: f32):
				%8 = arith.divf %arg1, %arg2 : f32
				linalg.yield %8 : f32
				} -> tensor<30x3xf32>
				return %6 : tensor<30x3xf32>
				}
				// CHECK: func @reduction_sequence(%[[ARG0:.+]]: tensor<30x3xf32>)
				// CHECK-DAG: %[[INIT0:.+]] = tensor.empty() : tensor<30xf32>
				// CHECK-DAG: %[[INIT1:.+]] = tensor.empty() : tensor<30x3xf32>
				// CHECK: %[[RESULT:[a-zA-Z0-9]+]] = scf.for %[[IV:[a-zA-Z0-9]+]]
				// CHECK-SAME: iter_args(%[[ITERARG0:[a-zA-Z0-9]+]] = %[[INIT1]])
				// CHECK-DAG: %[[ARG0_SLICE:.+]] = tensor.extract_slice %[[ARG0]][%[[IV]], 0]
				// CHECK-DAG: %[[INIT0_SLICE:.+]] = tensor.extract_slice %[[INIT0]][%[[IV]]]
				// CHECK: %[[FILL0:.+]] = linalg.fill
				// CHECK-SAME: outs(%[[INIT0_SLICE]] :
				// CHECK: %[[GENERIC0:.+]] = linalg.generic
				// CHECK-SAME: ins(%[[ARG0_SLICE]] :
				// CHECK-SAME: outs(%[[FILL0]] :
				// CHECK: %[[FILL1:.+]] = linalg.fill
				// CHECK-SAME: outs(%[[INIT0_SLICE]] :
				// CHECK: %[[INIT1_SLICE:.+]] = tensor.extract_slice %[[INIT1]][%[[IV]], 0]
				// CHECK: %[[GENERIC1:.+]]:2 = linalg.generic
				// CHECK-SAME: ins(%[[ARG0_SLICE]], %[[GENERIC0]] :
				// CHECK-SAME: outs(%[[FILL1]], %[[INIT1_SLICE]] :
				// CHECK: %[[ITERARG0_SLICE:.+]] = tensor.extract_slice %[[ITERARG0]][%[[IV]], 0]
				// CHECK: %[[GENERIC2:.+]] = linalg.generic
				// CHECK-SAME: ins(%[[GENERIC1]]#1, %[[GENERIC1]]#0 :
				// CHECK-SAME: outs(%[[ITERARG0_SLICE]] :
				// CHECK-DAG: %[[INSERTSLICE:.+]] = tensor.insert_slice %[[GENERIC2]] into %[[ITERARG0]][%[[IV]], 0]
				// CHECK: scf.yield %[[INSERTSLICE]]
				// CHECK: return %[[RESULT]]

				// -----

				func.func @gemm_gemm_fusion_yield_both(%lhs0 : tensor<?x?xf32>, %rhs0 : tensor<?x?xf32>, %rhs1 : tensor<?x?xf32>)
				-> (tensor<?x?xf32>, tensor<?x?xf32>) {
				%c0 = arith.constant 0 : index
				%c1 = arith.constant 1 : index
				%cst = arith.constant 0.0 : f32
				%d0 = tensor.dim %lhs0, %c0 : tensor<?x?xf32>
				%d1 = tensor.dim %rhs0, %c1 : tensor<?x?xf32>
				%init0 = tensor.empty(%d0, %d1) : tensor<?x?xf32>
				%fill0 = linalg.fill ins(%cst : f32) outs(%init0 : tensor<?x?xf32>) -> tensor<?x?xf32>
				%gemm0 = linalg.matmul {__yield_result__ = 0}
				ins(%lhs0, %rhs0 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%fill0 : tensor<?x?xf32>) -> tensor<?x?xf32>
				%d2 = tensor.dim %rhs1, %c1 : tensor<?x?xf32>
				%init1 = tensor.empty(%d0, %d2) : tensor<?x?xf32>
				%fill1 = linalg.fill ins(%cst : f32) outs(%init1 : tensor<?x?xf32>) -> tensor<?x?xf32>
				%gemm1 = linalg.matmul {__internal_linalg_transform__ = "gemm_fusion"}
				ins(%gemm0, %rhs1 : tensor<?x?xf32>, tensor<?x?xf32>) outs(%fill1 : tensor<?x?xf32>) -> tensor<?x?xf32>
				return %gemm0, %gemm1 : tensor<?x?xf32>, tensor<?x?xf32>
				}
				// CHECK: func.func @gemm_gemm_fusion_yield_both(
				// CHECK-SAME: %[[LHS0:[a-zA-Z0-9]+]]: tensor<?x?xf32>
				// CHECK-SAME: %[[RHS0:[a-zA-Z0-9]+]]: tensor<?x?xf32>,
				// CHECK-SAME: %[[RHS1:[a-zA-Z0-9]+]]: tensor<?x?xf32>)
				// CHECK-DAG: %[[C0:.+]] = arith.constant 0 : index
				// CHECK-DAG: %[[C1:.+]] = arith.constant 1 : index
				// CHECK-DAG: %[[D0:.+]] = tensor.dim %[[LHS0]], %[[C0]]
				// CHECK-DAG: %[[D1:.+]] = tensor.dim %[[RHS0]], %[[C1]]
				// CHECK-DAG: %[[INIT0:.+]] = tensor.empty(%[[D0]], %[[D1]])
				// CHECK-DAG: %[[D2:.+]] = tensor.dim %[[RHS1]], %[[C1]]
				// CHECK: %[[INIT1:.+]] = tensor.empty(%[[D0]], %[[D2]])
				// CHECK: %[[RESULT:.+]]:2 = scf.for %[[IV:[a-zA-Z0-9]+]] =
				// CHECK-SAME: iter_args(%[[ITERARG0:[a-zA-Z0-9]+]] = %[[INIT1]], %[[ITERARG1:[a-zA-Z0-9]+]] = %[[INIT0]])
				// CHECK-DAG: %[[LHS0_TILE:.+]] = tensor.extract_slice %[[LHS0]][%[[IV]], 0]
				// CHECK-DAG: %[[RHS0_TILE:.+]] = tensor.extract_slice %[[RHS0]][0, 0]
				// CHECK-DAG: %[[INIT0_TILE:.+]] = tensor.extract_slice %[[ITERARG1]][%[[IV]], 0]
				// CHECK: %[[FILL0_TILE:.+]] = linalg.fill
				// CHECK-SAME: outs(%[[INIT0_TILE]] :
				// CHECK: %[[GEMM0_TILE:.+]] = linalg.matmul
				// CHECK-SAME: ins(%[[LHS0_TILE]], %[[RHS0_TILE]] :
				// CHECK-SAME: outs(%[[FILL0_TILE]] :
				// CHECK-DAG: %[[RHS1_TILE:.+]] = tensor.extract_slice %[[RHS1]][0, 0]
				// CHECK-DAG: %[[INIT1_TILE:.+]] = tensor.extract_slice %[[ITERARG0]][%[[IV]], 0]
				// CHECK: %[[FILL1_TILE:.+]] = linalg.fill
				// CHECK-SAME: outs(%[[INIT1_TILE]] :
				// CHECK: %[[GEMM1_TILE:.+]] = linalg.matmul
				// CHECK-SAME: ins(%[[GEMM0_TILE]], %[[RHS1_TILE]] :
				// CHECK-SAME: outs(%[[FILL1_TILE]] :
				// CHECK: %[[INSERT0:.+]] = tensor.insert_slice %[[GEMM1_TILE]] into %[[ITERARG0]][%[[IV]], 0]
				// CHECK: %[[INSERT1:.+]] = tensor.insert_slice %[[GEMM0_TILE]] into %[[ITERARG1]][%[[IV]], 0]
				// CHECK: scf.yield %[[INSERT0]], %[[INSERT1]]

mlir/test/lib/Interfaces/TilingInterface/TestTilingInterface.cpp

Show All 27 Lines

#include "mlir/Pass/PassManager.h" #include "mlir/Pass/PassManager.h"

#include "mlir/Transforms/GreedyPatternRewriteDriver.h" #include "mlir/Transforms/GreedyPatternRewriteDriver.h"

#include "llvm/ADT/TypeSwitch.h" #include "llvm/ADT/TypeSwitch.h"

using namespace mlir; using namespace mlir;

// TODO: this file should disappear and instead tests should make use of the // TODO: this file should disappear and instead tests should make use of the

// transform dialect. // transform dialect.

static constexpr char yieldMarker[] = "__yield_result__";

hanchungUnsubmitted

Not Done

// transform dialect.

- static constexpr char yieldMarker[] = "__yield_result__";

+ const StringLiteral kYieldMarker = "__yield_result__";

namespace {

should be consistent with other uses (see below)

hanchung: should be consistent with other uses (see below)

namespace { namespace {

/// Marker used as attribute name in generated Linalg rewriting transformations. /// Marker used as attribute name in generated Linalg rewriting transformations.

const StringLiteral kLinalgTransformMarker = "__internal_linalg_transform__"; const StringLiteral kLinalgTransformMarker = "__internal_linalg_transform__";

/// Helper class to control application of linalg transformation patterns. /// Helper class to control application of linalg transformation patterns.

/// Control comes in 2 forms: /// Control comes in 2 forms:

/// 1. attribute matching and setting behavior using the attribute named /// 1. attribute matching and setting behavior using the attribute named

▲ Show 20 Lines • Show All 154 Lines • ▼ Show 20 Lines LogicalResult matchAndRewrite(TilingInterface op,

return success(); return success();

} }

private: private:

scf::SCFTilingOptions options; scf::SCFTilingOptions options;

LinalgTransformationFilter filter; LinalgTransformationFilter filter;

}; };

/// Method to collect all potential fusable producer.

cheliniUnsubmitted

Done

LinalgTransformationFilter filter;

};

- /// Method to collect all potential fusable producer

+ /// Method to collect all potential fusable producer.

static llvm::SmallDenseSet<Operation *>

chelini:

static llvm::SmallDenseSet<Operation *>

collectFusableProducers(TilingInterface op) {

llvm::SmallDenseSet<Operation *> producers;

producers.insert(op);

SmallVector<TilingInterface> worklist;

worklist.push_back(op);

while (!worklist.empty()) {

TilingInterface currOp = worklist.pop_back_val();

for (OpOperand &operand : currOp->getOpOperands()) {

auto producer = operand.get().getDefiningOp<TilingInterface>();

if (producer && !producers.count(producer)) {

worklist.push_back(producer);

producers.insert(producer);

}

return producers;

}

/// Pattern for testing `TileConsumerAndFuseProducersUsingSCFForOp` pattern /// Pattern for testing `TileConsumerAndFuseProducersUsingSCFForOp` pattern

/// (that tiles and fuses operations using the `TilingInterface` with `scf.for` /// (that tiles and fuses operations using the `TilingInterface` with `scf.for`

/// ops for iterating over the tiles) while using a `filter` to avoid recursive /// ops for iterating over the tiles) while using a `filter` to avoid recursive

/// application. /// application.

struct TestTileConsumerAndFuseProducersGreedilyUsingSCFForOp struct TestTileConsumerAndFuseProducersGreedilyUsingSCFForOp

: public OpInterfaceRewritePattern<TilingInterface> { : public OpInterfaceRewritePattern<TilingInterface> {

TestTileConsumerAndFuseProducersGreedilyUsingSCFForOp( TestTileConsumerAndFuseProducersGreedilyUsingSCFForOp(

MLIRContext *context, scf::SCFTileAndFuseOptions options, MLIRContext *context, scf::SCFTileAndFuseOptions options,

Show All 11 Lines TestTileConsumerAndFuseProducersGreedilyUsingSCFForOp(

: OpInterfaceRewritePattern<TilingInterface>(context, benefit), : OpInterfaceRewritePattern<TilingInterface>(context, benefit),

options(std::move(options)), filter(std::move(filter)) {} options(std::move(options)), filter(std::move(filter)) {}

LogicalResult matchAndRewrite(TilingInterface op, LogicalResult matchAndRewrite(TilingInterface op,

PatternRewriter &rewriter) const override { PatternRewriter &rewriter) const override {

if (failed(filter.checkAndNotify(rewriter, op))) if (failed(filter.checkAndNotify(rewriter, op)))

return failure(); return failure();

llvm::SmallDenseSet<Operation *> producers = collectFusableProducers(op);

FailureOr<scf::SCFTileAndFuseResult> tileAndFuseResult = FailureOr<scf::SCFTileAndFuseResult> tileAndFuseResult =

scf::tileConsumerAndFuseProducerGreedilyUsingSCFForOp(rewriter, op, scf::tileConsumerAndFuseProducerGreedilyUsingSCFForOp(rewriter, op,

options); options);

if (failed(tileAndFuseResult)) { if (failed(tileAndFuseResult)) {

return failure(); return failure();

} }

// Replace the tiled op with replacements. // Replace the tiled op with replacements.

SmallVector<Value> replacements(op->getNumResults()); for (auto it : tileAndFuseResult->replacements) {

for (const auto &result : llvm::enumerate(op->getResults())) { it.first.replaceUsesWithIf(it.second, [&](OpOperand &use) {

replacements[result.index()] = Operation *user = use.getOwner();

tileAndFuseResult->replacements.lookup(result.value()); // Replace use if user is

// - Not one of the untiled producers

// - Not a dim op (these are resolved through use of

// `ReifyRankedShapedTypeOpInterface`)

// - is not the outer most loop generated by tile + fuse.

return !producers.count(user) && !isa<tensor::DimOp>(user) &&

(tileAndFuseResult->loops.empty() ||

!tileAndFuseResult->loops.front()->isAncestor(user));

hanchungUnsubmitted

Done

should we add memref::DimOp to the list?

hanchung: should we add `memref::DimOp` to the list?

mravishankarAuthorUnsubmitted

Done

tile and fuse doesnt work with memrefs and also with memref there is no replacement to do.

mravishankar: tile and fuse doesnt work with `memref`s and also with memref there is no replacement to do.

});

} }

rewriter.replaceOp(op, replacements); rewriter.eraseOp(op);

filter.replaceLinalgTransformationFilter( filter.replaceLinalgTransformationFilter(

rewriter, tileAndFuseResult->tiledAndFusedOps.front()); rewriter, tileAndFuseResult->tiledAndFusedOps.front());

return success(); return success();

} }

private: private:

scf::SCFTileAndFuseOptions options; scf::SCFTileAndFuseOptions options;

▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines

static void addPatternForTileAndFuse(MLIRContext *context, static void addPatternForTileAndFuse(MLIRContext *context,

RewritePatternSet &patterns, RewritePatternSet &patterns,

StringRef filterName, StringRef filterName,

ArrayRef<int64_t> tileSizes, ArrayRef<int64_t> tileSizes,

ArrayRef<int64_t> interchange = {}) { ArrayRef<int64_t> interchange = {}) {

scf::SCFTileAndFuseOptions tileAndFuseOptions; scf::SCFTileAndFuseOptions tileAndFuseOptions;

tileAndFuseOptions.tilingOptions.setTileSizes(tileSizes).setInterchange( tileAndFuseOptions.tilingOptions.setTileSizes(tileSizes).setInterchange(

interchange); interchange);

scf::SCFTileAndFuseOptions::ControlYieldFusedProducerResultFn fn =

[&](OpResult producer, Operation * /*sliceOp*/) -> bool {

Operation *producerOp = producer.getOwner();

auto yieldResult = producerOp->getAttrOfType<IntegerAttr>(yieldMarker);

if (!yieldResult || yieldResult.getInt() != producer.getResultNumber())

return false;

producerOp->removeAttr(yieldMarker);

return true;

};

tileAndFuseOptions.setControlYieldFusedProducerResultFn(fn);

LinalgTransformationFilter filter(StringAttr::get(context, filterName), LinalgTransformationFilter filter(StringAttr::get(context, filterName),

StringAttr::get(context, "tiled")); StringAttr::get(context, "tiled"));

patterns.add<TestTileConsumerAndFuseProducersGreedilyUsingSCFForOp>( patterns.add<TestTileConsumerAndFuseProducersGreedilyUsingSCFForOp>(

context, tileAndFuseOptions, filter); context, tileAndFuseOptions, filter);

} }

void TestTilingInterfacePass::addTestPatterns(MLIRContext *context, void TestTilingInterfacePass::addTestPatterns(MLIRContext *context,

RewritePatternSet &patterns) { RewritePatternSet &patterns) {

Show All 32 Lines addPatternForTileAndFuse(context, patterns, "gemm_interchange_fusion",

{10, 20}, {1, 0}); {10, 20}, {1, 0});

// 4. Tile and fuse matmul + transpose(matmul). Will introduce redundant // 4. Tile and fuse matmul + transpose(matmul). Will introduce redundant

// computations. // computations.

addPatternForTileAndFuse(context, patterns, "gemm_plus_gemm_fusion", addPatternForTileAndFuse(context, patterns, "gemm_plus_gemm_fusion",

{10, 20}); {10, 20});

// 5. Tile and fuse a sequence of GEMMs by tiling and fusing only along M // 5. Tile and fuse a sequence of GEMMs by tiling and fusing only along M

// dimension. // dimension.

addPatternForTileAndFuse(context, patterns, "gemm_sequence_fusion", {10}); addPatternForTileAndFuse(context, patterns, "gemm_sequence_fusion", {10});

// 6. Fusion of back-to-back-reduction ops

addPatternForTileAndFuse(context, patterns, "reduction_sequence_fusion",

{10});

return; return;

} }

if (testLoweringToScalar) { if (testLoweringToScalar) {

patterns.add<LowerToLoopsUsingSCFForOp>(context); patterns.add<LowerToLoopsUsingSCFForOp>(context);

} }

void TestTilingInterfacePass::runOnOperation() { void TestTilingInterfacePass::runOnOperation() {

MLIRContext *context = &getContext(); MLIRContext *context = &getContext();

RewritePatternSet tilingPatterns(context); RewritePatternSet tilingPatterns(context);

addTestPatterns(context, tilingPatterns); addTestPatterns(context, tilingPatterns);

if (failed(applyPatternsAndFoldGreedily(getOperation(), if (failed(applyPatternsAndFoldGreedily(getOperation(),

std::move(tilingPatterns)))) std::move(tilingPatterns))))

return signalPassFailure(); return signalPassFailure();

getOperation().walk([&](Operation *op) { op->removeAttr(yieldMarker); });

} }

namespace mlir { namespace mlir {

namespace test { namespace test {

void registerTestTilingInterface() { void registerTestTilingInterface() {

PassRegistration<TestTilingInterfacePass>(); PassRegistration<TestTilingInterfacePass>();

} }

} // namespace test } // namespace test

} // namespace mlir } // namespace mlir

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][TilingInterface] Add callback to yield a produced value.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 477613

mlir/include/mlir/Dialect/SCF/Transforms/TileUsingInterface.h

mlir/lib/Dialect/SCF/Transforms/TileUsingInterface.cpp

mlir/test/Interfaces/TilingInterface/tile-and-fuse-using-interface.mlir

mlir/test/lib/Interfaces/TilingInterface/TestTilingInterface.cpp

[mlir][TilingInterface] Add callback to yield a produced value.
AbandonedPublic