This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/
-
mlir/
-
Analysis/
1/2
AffineStructures.h
-
Utils.h
-
Transforms/
2/3
LoopFusionUtils.h
-
Passes.td
-
lib/
-
Analysis/
3/5
AffineStructures.cpp
2/4
Utils.cpp
-
Transforms/
21/24
LoopFusion.cpp
-
Utils/
1/1
LoopFusionUtils.cpp
-
test/Transforms/
-
Transforms/
3/6
loop-fusion.mlir

Differential D92876

[mlir][Affine] Add support for multi-store producer fusion
ClosedPublic

Authored by dcaballe on Dec 8 2020, 11:58 AM.

Download Raw Diff

Details

Reviewers

andydavis1
bondhugula
ftynse

Commits

rGc8fc5c0385db: [mlir][Affine] Add support for multi-store producer fusion
rG7dd198852b4d: [mlir][Affine] Add support for multi-store producer fusion

Summary

This patch adds support for producer-consumer fusion scenarios with
multiple producer stores to the AffineLoopFusion pass. The patch
introduces some changes to the producer-consumer algorithm, including:

For a given consumer loop, producer-consumer fusion iterates over its

producer candidates until a fixed point is reached.

Producer candidates are gathered beforehand for each iteration of the

consumer loop and visited in reverse program order (not strictly guaranteed)
to maximize the number of loops fused per iteration.

In general, these changes were needed to simplify the multi-store producer
support and remove some of the workarounds that were introduced in the past
to support more fusion cases under the single-store producer limitation.

This patch also preserves the existing functionality of AffineLoopFusion with
one minor change in behavior. Producer-consumer fusion didn't fuse scenarios
with escaping memrefs and multiple outgoing edges (from a single store).
Multi-store producer scenarios will usually (always?) have multiple outgoing
edges so we couldn't fuse any with escaping memrefs, which would greatly limit
the applicability of this new feature. Therefore, the patch enables fusion for
these scenarios. Please, see modified tests for specific details.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

dcaballe created this revision.Dec 8 2020, 11:58 AM

Herald added subscribers: teijeong, rdzhabarov, tatianashp and 14 others. · View Herald TranscriptDec 8 2020, 11:58 AM

dcaballe requested review of this revision.Dec 8 2020, 11:58 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 8 2020, 11:58 AM

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Some comments to facilitate the code review.

mlir/include/mlir/Transforms/LoopFusionUtils.h
100	I couldn't generalize the sibling fusion algorithm to be able to remove the 'memref' field and look at all the memrefs in the loops. Computation slices when doing so where significantly different and I couldn't the preserve existing functionality (tests). I introduced these changes to that the memref can only be provided when using a sibling fusion strategy.
mlir/lib/Transforms/LoopFusion.cpp
318	This functionality is replaced with 'canRemoveSrcNodeAfterFusion' below. I'm adding as a non-member in an attempt to minimize code that is specific to a fusion strategy so that we can eventually move a generic version of the 'mdg' to the loop fusion utilities in the future (if that's what we want).
942	This code was heavily relying on having a single store so I couldn't reuse much. 'hasNonAffineUsersOnThePath' check has been moved to the producer-consumer fusion main function. The memory region superset check is now covered by slice::isMaximal check, which also works for an arbitrary number of stores.
1545	This check related to memref privatization is new and may be worth discussing it. The remaining checks are the same as before but extended to support multiple stores.
1587	These workarounds are no longer needed (they couldn't be easily ported to the new scheme) since the algorithm runs until a fixed point is reached.
1619	Same here. The algorithm runs until a fixed point is reached so this is not needed.

Harbormaster completed remote builds in B81506: Diff 310308.Dec 8 2020, 12:21 PM

Thanks for this significant update! Some superficial comments to start with. It may be good to handle the pass documentation update as well, starting with an update to the summary. It's currently emptly: https://mlir.llvm.org/docs/Passes/#-affine-loop-fusion-fuse-affine-loop-nests
You could mention about the fusion strategies producer/consumer and sibling ones as well. Here's the previous doc paragraph that disappeared (was overwritten) when pass documentation was migrated to auto-generated stuff.

Performs fusion of loop nests using a slicing-based approach. The fused loop
nests, when possible, are rewritten to access significantly smaller local
buffers instead of the original memref's, and the latter are often
either completely optimized away or contracted. This transformation leads to
enhanced locality and lower memory footprint through the elimination or
contraction of temporaries / intermediate memref's. These benefits are sometimes
achieved at the expense of redundant computation through a cost model that
evaluates available choices such as the depth at which a source slice should be
materialized in the designation slice.

mlir/include/mlir/Transforms/LoopFusionUtils.h
53–54	Not related to this, but it's not clear how one gets both `ProducerConsumer` and `Sibling`. Does it have to be called twice? In that case, what about convergence?
mlir/lib/Analysis/AffineStructures.cpp
716–717	Interesting that the use case for this requires the same operands for upper bounds, lower bounds, and for IVs at all depths! I haven't looked at the other part of the code yet - so this is a superficial comment.
727	Any assertions on the number of operands?
mlir/lib/Transforms/LoopFusion.cpp
686	Nit: `memref.isa<BlockArgument>()`
692	auto -> Operation
774	Avoid `auto` here.

This revision now requires changes to proceed.Dec 10 2020, 12:55 AM

bondhugula added inline comments.Dec 10 2020, 1:03 AM

mlir/include/mlir/Analysis/AffineStructures.h
246	We could actually rename this to `addDomainFromSliceMaps` - it's good to keep it independent of loop IVs themselves esp. given that you aren't passing `AffineForOp`s and that it's specifically dealing with the case the operands are all the same for the maps (something that won't happen for different affine.for ops in the IR even if they are nested).
mlir/lib/Analysis/AffineStructures.cpp
751	auto -> `AffineForOp`
mlir/lib/Transforms/LoopFusion.cpp
577	can -> can be
578	Since 'dependences' are between two entities, I think we need to clarify what "has no dependences" means (for eg. outgoing deps, incoming deps, intra-node deps are different things).

Addressing initial feedback.
Doc is WIP.

Thanks for working on this Diego!

mlir/lib/Analysis/Utils.cpp
162	It might be food to differentiate between error return cases and cases where there is no error, but isMaximal is false.
174	If possible, it might be good to have a fast-case at the top of this function, that checked for some simple cases (static upper/lower bounds). Then fallback to the integer set check otherwise...
mlir/lib/Transforms/LoopFusion.cpp
634	These kind of simple, focused functions are great. Not in this change, but it feels like longer term, we could do more breaking up of long functions into focused utility functions.

Harbormaster completed remote builds in B81944: Diff 311073.Dec 10 2020, 7:01 PM

Adding one more comment...

andydavis1 added inline comments.Dec 16 2020, 10:57 AM

mlir/lib/Analysis/Utils.cpp
162	It might be food to differentiate between error return cases and cases where there is no error, but isMaximal is false.

Addressing all the comments. Thanks!

Sorry, I had some unsubmitted comments. Posting them all now.

Adding one more comment...

Hey Andy, I can't see any new comment. Did you forget to save it?

mlir/include/mlir/Analysis/AffineStructures.h
246	It sounds good. Thanks!
mlir/include/mlir/Transforms/LoopFusionUtils.h
53–54	Good point! If we wanted to fuse two loops, regardless of the strategy and out of the context of the AffineFusion pass, we should use the generic strategy. Two comments on this regard: Note that producer-consumer and sibling strategies are only used to specialize the loop fusion utilities with the assumptions made in the AffineLoopFusion pass so that we can preserve its existing functionality. These assumptions are very specific to the pass and mostly filter the memrefs that are considered in some steps of the analysis. Eventually, when we improve the utilities, we should be able to achieve the same results using the generic strategy and get rid of this class. Note that the selection of loop candidates for fusion is something external to the utilities, at least for now. It's the client the one that would gather the candidate loops based on some properties (producer-consumer relationship, input reuse, etc.) and then use the utilities to determine if fusion is legal, implement a cost model, etc., for every pair of loops. IOW, the utilities are strategy-independent except for the aforementioned assumptions. I wonder if it's worth restricting ProducerConsumer and Sibling strategies to only the AffineFusion pass. We could make only the Generic strategy public and create a friend relationship between this class and the AffineFusion pass to only be able to use the ProducerConsumer and Sibling strategies from the pass. Just a random thought. I haven't thought too much about it.
mlir/lib/Analysis/AffineStructures.cpp
716–717	I had exactly the same question and found it super confusing! :) This is inspired by 'addSliceBounds' which has exactly the same requirement. It could be related to the way all the maps are created for the slice (?). When I was debugging this, I could see that all the maps had the same input symbols. Those not relevant for a particular map just were not part of the map output.
727	See below (729, 730)
mlir/lib/Analysis/Utils.cpp
174	Good point! Done and I verified that we have lit tests that cover all the scenarios: fast-check returns true/false, integer set diff returns true/false.
mlir/lib/Transforms/LoopFusion.cpp
634	Thanks! Yeah, we could probably add a similar one for sibling fusion candidates in the future. Any other examples?
774	'auto' should be justified here since the type is very verbose: std::pair<unsigned, Node>

Ping :)

Thanks Diego! The changes look good to me. Please wait for Uday to have another look before submitting. Thanks!

bondhugula added inline comments.Jan 19 2021, 4:07 PM

mlir/test/Transforms/loop-fusion.mlir
407	This is an out of bounds access. `%m[100] -> %m[99]` `-test-memref-bound-check` can catch these actually.

bondhugula added inline comments.Jan 19 2021, 4:17 PM

mlir/lib/Transforms/LoopFusion.cpp
637	granted -> guaranteed?
640	advance -> advanced
643	Do you want to just use an unordered set and then sort at the end in the right order? Right now, you are using a sorted set and then doing a `reverse`. You don't seem to need the sorted order in between. `srcIdCandidates` is a small set, and as you know, `std::set` will have high overhead. https://llvm.org/docs/ProgrammersManual.html#set
1404	Typo

This looks really great! A bunch of minor comments to address.

mlir/lib/Transforms/LoopFusion.cpp
1048	Looks like the method is already handling it, but not returning an accurate estimate?
1406	Likewise.
1616	Nit: ... after it is removed from `mdg`?
mlir/lib/Transforms/Utils/LoopFusionUtils.cpp
628	ops in`srcOps' -> write ops in `srcOps` ? ops in `dstOps` -> read ops in `dstOps`?
mlir/test/Transforms/loop-fusion.mlir
2376–2377	Not sure why this wasn't marked a TODO.
2695–2696	Looks like this has been fixed?

This revision is now accepted and ready to land.Jan 19 2021, 4:26 PM

Addressing Uday's feedback.

Herald added a subscriber: mgrang. · View Herald TranscriptJan 19 2021, 7:42 PM

Thanks, Uday. I'll commit this tomorrow if no more comments.

mlir/lib/Transforms/LoopFusion.cpp
643	I returned the candidates in ascending order because I thought it would make more sense for external clients than the reverse order. However, I agree that we can return an unordered map and leave the order to the client. Not sure, though, if we would be gaining much in our use case since we have to copy the elements to another container and sort them...
1048	Some work is needed. We still have code that is looking at 'srcStoreOpInst'.
mlir/test/Transforms/loop-fusion.mlir
407	Good catch :)
2695–2696	Why? The size is still 128 and the ticket seems to be open.

Harbormaster completed remote builds in B85822: Diff 317760.Jan 19 2021, 8:10 PM

Closed by commit rG7dd198852b4d: [mlir][Affine] Add support for multi-store producer fusion (authored by dcaballe). · Explain WhyJan 20 2021, 9:13 AM

This revision was automatically updated to reflect the committed changes.

dcaballe added a commit: rG7dd198852b4d: [mlir][Affine] Add support for multi-store producer fusion.

bondhugula added inline comments.Jan 20 2021, 9:39 AM

mlir/lib/Transforms/LoopFusion.cpp
643	I think your comment is orthogonal here. You can return in whatever sorted order but you don't need to use an `std::set` here. You can use an `unordered_set` and then just copy + sort at the end and return a vector. Hashtable + some sorting algo is still far less complexity (both time and more importantly memory allocations and access wise) than a typical heavy red-black tree (or similar structure) backed `std::set`. It's a handful of elements you have - so copy costs isn't much I assume.
mlir/test/Transforms/loop-fusion.mlir
2695–2696	Looks like I was looking at the (wrong) test case further below.

I see an assert failure in loop-fusion.mlir.test in GreedyFusion::fuseProducerConsumerNodes

cast_retty<X, Y *>::ret_type llvm::cast(Y *) [X = mlir::AffineWriteOpInterface, Y = mlir::Operation]: isa<X>(Val) && "cast<Ty>() argument of incompatible type!"

dcaballe added inline comments.Jan 20 2021, 11:08 AM

mlir/lib/Transforms/LoopFusion.cpp
643	`unordered_set` is not recommended for similar reasons (https://llvm.org/docs/ProgrammersManual.html#other-set-like-container-options). I can change it to the "sorted vector" approach (https://llvm.org/docs/ProgrammersManual.html#a-sorted-vector) if that makes more sense to you.

I see an assert failure in loop-fusion.mlir.test in GreedyFusion::fuseProducerConsumerNodes

Looking into it. Any buildbot failing? I didn't get any notification...

dcaballe added a reverting change: rG735a07f04785: Revert "[mlir][Affine] Add support for multi-store producer fusion".Jan 20 2021, 2:39 PM

dcaballe reopened this revision.Jan 20 2021, 4:45 PM

This revision is now accepted and ready to land.Jan 20 2021, 4:45 PM

Addressing Uday's comment and fixing ASAN issue.
We were gathering store ops and reuse them after 'createPrivateMemrefs',
which is replacing some of those stores with new ones. Thanks @jpienaar!

Harbormaster completed remote builds in B86013: Diff 318064.Jan 20 2021, 9:47 PM

Closed by commit rGc8fc5c0385db: [mlir][Affine] Add support for multi-store producer fusion (authored by dcaballe). · Explain WhyJan 25 2021, 10:32 AM

This revision was automatically updated to reflect the committed changes.

dcaballe added a commit: rGc8fc5c0385db: [mlir][Affine] Add support for multi-store producer fusion.

pashu123 added a subscriber: pashu123.Dec 5 2021, 10:31 AM

Herald added subscribers: sdasgup3, wenzhicui, wrengr and 2 others. · View Herald TranscriptDec 5 2021, 10:31 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Analysis/

AffineStructures.h

15 lines

Utils.h

17 lines

Transforms/

LoopFusionUtils.h

49 lines

Passes.td

105 lines

lib/

Analysis/

AffineStructures.cpp

64 lines

Utils.cpp

124 lines

Transforms/

LoopFusion.cpp

699 lines

Utils/

LoopFusionUtils.cpp

47 lines

test/

Transforms/

loop-fusion.mlir

225 lines

Diff 319056

mlir/include/mlir/Analysis/AffineStructures.h

Show First 20 Lines • Show All 226 Lines • ▼ Show 20 Lines	public:
/// constraint system. Returns failure for the yet unimplemented/unsupported		/// constraint system. Returns failure for the yet unimplemented/unsupported
/// cases. Any new identifiers that are found in the bound operands of the		/// cases. Any new identifiers that are found in the bound operands of the
/// 'affine.for' operation are added as trailing identifiers (either		/// 'affine.for' operation are added as trailing identifiers (either
/// dimensional or symbolic depending on whether the operand is a valid		/// dimensional or symbolic depending on whether the operand is a valid
/// symbol).		/// symbol).
// TODO: add support for non-unit strides.		// TODO: add support for non-unit strides.
LogicalResult addAffineForOpDomain(AffineForOp forOp);		LogicalResult addAffineForOpDomain(AffineForOp forOp);

		/// Adds constraints (lower and upper bounds) for each loop in the loop nest
		/// described by the bound maps 'lbMaps' and 'ubMaps' of a computation slice.
		/// Every pair ('lbMaps[i]', 'ubMaps[i]') describes the bounds of a loop in
		/// the nest, sorted outer-to-inner. 'operands' contains the bound operands
		/// for a single bound map. All the bound maps will use the same bound
		/// operands. Note that some loops described by a computation slice might not
		/// exist yet in the IR so the Value attached to those dimension identifiers
		/// might be empty. For that reason, this method doesn't perform Value
		/// look-ups to retrieve the dimension identifier positions. Instead, it
		/// assumes the position of the dim identifiers in the constraint system is
		/// the same as the position of the loop in the loop nest.
		LogicalResult addDomainFromSliceMaps(ArrayRef<AffineMap> lbMaps,
		bondhugulaUnsubmitted Not Done Reply Inline Actions We could actually rename this to `addDomainFromSliceMaps` - it's good to keep it independent of loop IVs themselves esp. given that you aren't passing `AffineForOp`s and that it's specifically dealing with the case the operands are all the same for the maps (something that won't happen for different affine.for ops in the IR even if they are nested). bondhugula: We could actually rename this to `addDomainFromSliceMaps` - it's good to keep it independent of…
		dcaballeAuthorUnsubmitted Done Reply Inline Actions It sounds good. Thanks! dcaballe: It sounds good. Thanks!
		ArrayRef<AffineMap> ubMaps,
		ArrayRef<Value> operands);

/// Adds constraints imposed by the `affine.if` operation. These constraints		/// Adds constraints imposed by the `affine.if` operation. These constraints
/// are collected from the IntegerSet attached to the given `affine.if`		/// are collected from the IntegerSet attached to the given `affine.if`
/// instance argument (`ifOp`). It is asserted that:		/// instance argument (`ifOp`). It is asserted that:
/// 1) The IntegerSet of the given `affine.if` instance should not contain		/// 1) The IntegerSet of the given `affine.if` instance should not contain
/// semi-affine expressions,		/// semi-affine expressions,
/// 2) The columns of the constraint system created from `ifOp` should match		/// 2) The columns of the constraint system created from `ifOp` should match
/// the columns in the current one regarding numbers and values.		/// the columns in the current one regarding numbers and values.
void addAffineIfOpDomain(AffineIfOp ifOp);		void addAffineIfOpDomain(AffineIfOp ifOp);
▲ Show 20 Lines • Show All 467 Lines • Show Last 20 Lines

mlir/include/mlir/Analysis/Utils.h

Show First 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	struct ComputationSliceState {
// Constraints are added for all loop IV bounds (dim or symbol), and		// Constraints are added for all loop IV bounds (dim or symbol), and
// constraints are added for slice bounds in 'lbs'/'ubs'.		// constraints are added for slice bounds in 'lbs'/'ubs'.
// Returns failure if we cannot add loop bounds because of unsupported cases.		// Returns failure if we cannot add loop bounds because of unsupported cases.
LogicalResult getAsConstraints(FlatAffineConstraints *cst);		LogicalResult getAsConstraints(FlatAffineConstraints *cst);

// Clears all bounds and operands in slice state.		// Clears all bounds and operands in slice state.
void clearBounds();		void clearBounds();

/// Return true if the computation slice is empty.		/// Returns true if the computation slice is empty.
bool isEmpty() const { return ivs.empty(); }		bool isEmpty() const { return ivs.empty(); }

		/// Returns true if the computation slice encloses all the iterations of the
		/// sliced loop nest. Returns false if it does not. Returns llvm::None if it
		/// cannot determine if the slice is maximal or not.
		// TODO: Cache 'isMaximal' so that we don't recompute it when the slice
		// information hasn't changed.
		Optional<bool> isMaximal() const;

void dump() const;		void dump() const;

		private:
		/// Fast check to determine if the computation slice is maximal. Returns true
		/// if each slice dimension maps to an existing dst dimension and both the src
		/// and the dst loops for those dimensions have the same bounds. Returns false
		/// if both the src and the dst loops don't have the same bounds. Returns
		/// llvm::None if none of the above can be proven.
		Optional<bool> isSliceMaximalFastCheck() const;
};		};

/// Computes the computation slice loop bounds for one loop nest as affine maps		/// Computes the computation slice loop bounds for one loop nest as affine maps
/// of the other loop nest's IVs and symbols, using 'dependenceConstraints'		/// of the other loop nest's IVs and symbols, using 'dependenceConstraints'
/// computed between 'depSourceAccess' and 'depSinkAccess'.		/// computed between 'depSourceAccess' and 'depSinkAccess'.
/// If 'isBackwardSlice' is true, a backwards slice is computed in which the		/// If 'isBackwardSlice' is true, a backwards slice is computed in which the
/// slice bounds of loop nest surrounding 'depSourceAccess' are computed in		/// slice bounds of loop nest surrounding 'depSourceAccess' are computed in
/// terms of loop IVs and symbols of the loop nest surrounding 'depSinkAccess'		/// terms of loop IVs and symbols of the loop nest surrounding 'depSinkAccess'
▲ Show 20 Lines • Show All 227 Lines • Show Last 20 Lines

mlir/include/mlir/Transforms/LoopFusionUtils.h

	Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
	/// and sibling fusion, while sharing a single implementation. The latter			/// and sibling fusion, while sharing a single implementation. The latter
	/// strategies are also limited to scenarios where a single memref is involved			/// strategies are also limited to scenarios where a single memref is involved
	/// in the producer-consume or sibling relationship between the candidate			/// in the producer-consume or sibling relationship between the candidate
	/// loops. We use 'memref' to keep track of such a memref.			/// loops. We use 'memref' to keep track of such a memref.
	// TODO: Remove 'memref' when we support more generic scenarios.			// TODO: Remove 'memref' when we support more generic scenarios.
	// TODO: Generalize utilities so that producer-consumer and sibling fusion			// TODO: Generalize utilities so that producer-consumer and sibling fusion
	// strategies can be used without the assumptions made in the AffineLoopFusion			// strategies can be used without the assumptions made in the AffineLoopFusion
	// pass.			// pass.
	struct FusionStrategy {			class FusionStrategy {
				public:
				bondhugulaUnsubmitted Not Done Reply Inline Actions Not related to this, but it's not clear how one gets both `ProducerConsumer` and `Sibling`. Does it have to be called twice? In that case, what about convergence? bondhugula: Not related to this, but it's not clear how one gets both `ProducerConsumer` and `Sibling`.
				dcaballeAuthorUnsubmitted Done Reply Inline Actions Good point! If we wanted to fuse two loops, regardless of the strategy and out of the context of the AffineFusion pass, we should use the generic strategy. Two comments on this regard: Note that producer-consumer and sibling strategies are only used to specialize the loop fusion utilities with the assumptions made in the AffineLoopFusion pass so that we can preserve its existing functionality. These assumptions are very specific to the pass and mostly filter the memrefs that are considered in some steps of the analysis. Eventually, when we improve the utilities, we should be able to achieve the same results using the generic strategy and get rid of this class. Note that the selection of loop candidates for fusion is something external to the utilities, at least for now. It's the client the one that would gather the candidate loops based on some properties (producer-consumer relationship, input reuse, etc.) and then use the utilities to determine if fusion is legal, implement a cost model, etc., for every pair of loops. IOW, the utilities are strategy-independent except for the aforementioned assumptions. I wonder if it's worth restricting ProducerConsumer and Sibling strategies to only the AffineFusion pass. We could make only the Generic strategy public and create a friend relationship between this class and the AffineFusion pass to only be able to use the ProducerConsumer and Sibling strategies from the pass. Just a random thought. I haven't thought too much about it. dcaballe: Good point! If we wanted to fuse two loops, regardless of the strategy and out of the context…
	enum StrategyEnum {			enum StrategyEnum {
	// Generic loop fusion: Arbitrary loops are considered for fusion. No			// Generic loop fusion: Arbitrary loops are considered for fusion. No
	// assumptions about a specific fusion strategy from AffineLoopFusion pass			// assumptions about a specific fusion strategy from AffineLoopFusion pass
	// are made.			// are made.
	// TODO: Generic fusion is not fully implemented by fusion utilities yet.			// TODO: Generic fusion is not fully implemented by fusion utilities yet.
	// It should only be used for testing.			// It should only be used for testing.
	Generic,			Generic,
	// Producer-consumer fusion: Only loops with a producer-consumer			// Producer-consumer fusion: Only loops with a producer-consumer
	// memref dependence are considered for fusion. Currently, assumptions from			// memref dependence are considered for fusion. Currently, assumptions from
	// the producer-consumer fusion implementation in AffineLoopFusion pass are			// the producer-consumer fusion implementation in AffineLoopFusion pass are
	// made. See pass for specific details.			// made. See pass for specific details.
	ProducerConsumer,			ProducerConsumer,
	// Sibling fusion: Only sibling loops with no producer-consumer memref			// Sibling fusion: Only sibling loops with no producer-consumer memref
	// dependences are considered for fusion. Memref reuse is taken into account			// dependences are considered for fusion. Memref reuse is taken into account
	// for profitability. Currently, assumptions from the sibling fusion			// for profitability. Currently, assumptions from the sibling fusion
	// implementation in AffineLoopFusion pass are made. See pass for specific			// implementation in AffineLoopFusion pass are made. See pass for specific
	// details.			// details.
	Sibling			Sibling
	} strategy;			};

	// Target memref for this fusion transformation.			/// Construct a generic or producer-consumer fusion strategy.
	Value memref;			FusionStrategy(StrategyEnum strategy) : strategy(strategy) {
				assert(strategy != Sibling &&
				"Sibling fusion strategy requires a specific memref");
				}

				/// Construct a sibling fusion strategy targeting 'memref'. This construct
				/// should only be used for sibling fusion.
				FusionStrategy(Value memref) : strategy(Sibling), memref(memref) {}

				/// Returns the fusion strategy.
				StrategyEnum getStrategy() const { return strategy; };

				/// Returns the memref attached to this sibling fusion strategy.
				Value getSiblingFusionMemRef() const {
				assert(strategy == Sibling && "Memref is only valid for sibling fusion");
				return memref;
				}

				private:
				/// Fusion strategy.
				StrategyEnum strategy;

	FusionStrategy(StrategyEnum strategy, Value memref)			/// Target memref for this fusion transformation. Only used for sibling
	: strategy(strategy), memref(memref) {}			/// fusion.
				Value memref;
				dcaballeAuthorUnsubmitted Done Reply Inline Actions I couldn't generalize the sibling fusion algorithm to be able to remove the 'memref' field and look at all the memrefs in the loops. Computation slices when doing so where significantly different and I couldn't the preserve existing functionality (tests). I introduced these changes to that the memref can only be provided when using a sibling fusion strategy. dcaballe: I couldn't generalize the sibling fusion algorithm to be able to remove the 'memref' field and…
	};			};

	/// Checks the feasibility of fusing the loop nest rooted at 'srcForOp' into the			/// Checks the feasibility of fusing the loop nest rooted at 'srcForOp' into the
	/// loop nest rooted at 'dstForOp' at 'dstLoopDepth'. Returns FusionResult			/// loop nest rooted at 'dstForOp' at 'dstLoopDepth'. Returns FusionResult
	/// 'Success' if fusion of the src/dst loop nests is feasible (i.e. they are			/// 'Success' if fusion of the src/dst loop nests is feasible (i.e. they are
	/// in the same block and dependences would not be violated). Otherwise			/// in the same block and dependences would not be violated). Otherwise
	/// returns a FusionResult explaining why fusion is not feasible.			/// returns a FusionResult explaining why fusion is not feasible.
	/// NOTE: This function is not feature complete and should only be used in			/// NOTE: This function is not feature complete and should only be used in
	/// testing.			/// testing.
	/// TODO: Update comments when this function is fully implemented.			/// TODO: Update comments when this function is fully implemented.
	FusionResult canFuseLoops(AffineForOp srcForOp, AffineForOp dstForOp,			FusionResult
	unsigned dstLoopDepth,			canFuseLoops(AffineForOp srcForOp, AffineForOp dstForOp, unsigned dstLoopDepth,
	ComputationSliceState *srcSlice,			ComputationSliceState *srcSlice,
	FusionStrategy fusionStrategy = {			FusionStrategy fusionStrategy = FusionStrategy::Generic);
	FusionStrategy::Generic, Value()});

	/// Fuses 'srcForOp' into 'dstForOp' with destination loop block insertion point			/// Fuses 'srcForOp' into 'dstForOp' with destination loop block insertion point
	/// and source slice loop bounds specified in 'srcSlice'.			/// and source slice loop bounds specified in 'srcSlice'.
	void fuseLoops(AffineForOp srcForOp, AffineForOp dstForOp,			void fuseLoops(AffineForOp srcForOp, AffineForOp dstForOp,
	const ComputationSliceState &srcSlice);			const ComputationSliceState &srcSlice);

	/// LoopNestStats aggregates various per-loop statistics (eg. loop trip count			/// LoopNestStats aggregates various per-loop statistics (eg. loop trip count
	/// and operation count) for a loop nest up until (and including) the innermost			/// and operation count) for a loop nest up until (and including) the innermost
	Show All 27 Lines
	/// the entire loop nest.			/// the entire loop nest.
	/// Returns true on success, failure otherwise (e.g. non-constant trip counts).			/// Returns true on success, failure otherwise (e.g. non-constant trip counts).
	// TODO: Improve this cost model.			// TODO: Improve this cost model.
	bool getFusionComputeCost(AffineForOp srcForOp, LoopNestStats &srcStats,			bool getFusionComputeCost(AffineForOp srcForOp, LoopNestStats &srcStats,
	AffineForOp dstForOp, LoopNestStats &dstStats,			AffineForOp dstForOp, LoopNestStats &dstStats,
	const ComputationSliceState &slice,			const ComputationSliceState &slice,
	int64_t *computeCost);			int64_t *computeCost);

				/// Returns in 'producerConsumerMemrefs' the memrefs involved in a
				/// producer-consumer dependence between write ops in 'srcOps' and read ops in
				/// 'dstOps'.
				void gatherProducerConsumerMemrefs(ArrayRef<Operation *> srcOps,
				ArrayRef<Operation *> dstOps,
				DenseSet<Value> &producerConsumerMemrefs);
	} // end namespace mlir			} // end namespace mlir

	#endif // MLIR_TRANSFORMS_LOOP_FUSION_UTILS_H			#endif // MLIR_TRANSFORMS_LOOP_FUSION_UTILS_H

mlir/include/mlir/Transforms/Passes.td

	Show All 11 Lines

	#ifndef MLIR_TRANSFORMS_PASSES			#ifndef MLIR_TRANSFORMS_PASSES
	#define MLIR_TRANSFORMS_PASSES			#define MLIR_TRANSFORMS_PASSES

	include "mlir/Pass/PassBase.td"			include "mlir/Pass/PassBase.td"

	def AffineLoopFusion : FunctionPass<"affine-loop-fusion"> {			def AffineLoopFusion : FunctionPass<"affine-loop-fusion"> {
	let summary = "Fuse affine loop nests";			let summary = "Fuse affine loop nests";
				let description = [{
				This pass performs fusion of loop nests using a slicing-based approach. It
				combines two fusion strategies: producer-consumer fusion and sibling fusion.
				Producer-consumer fusion is aimed at fusing pairs of loops where the first
				one writes to a memref that the second reads. Sibling fusion targets pairs
				of loops that share no dependences between them but that load from the same
				memref. The fused loop nests, when possible, are rewritten to access
				significantly smaller local buffers instead of the original memref's, and
				the latter are often either completely optimized away or contracted. This
				transformation leads to enhanced locality and lower memory footprint through
				the elimination or contraction of temporaries/intermediate memref's. These
				benefits are sometimes achieved at the expense of redundant computation
				through a cost model that evaluates available choices such as the depth at
				which a source slice should be materialized in the designation slice.

				Example 1: Producer-consumer fusion.
				Input:
				```mlir
				func @producer_consumer_fusion(%arg0: memref<10xf32>, %arg1: memref<10xf32>) {
				%0 = alloc() : memref<10xf32>
				%1 = alloc() : memref<10xf32>
				%cst = constant 0.000000e+00 : f32
				affine.for %arg2 = 0 to 10 {
				affine.store %cst, %0[%arg2] : memref<10xf32>
				affine.store %cst, %1[%arg2] : memref<10xf32>
				}
				affine.for %arg2 = 0 to 10 {
				%2 = affine.load %0[%arg2] : memref<10xf32>
				%3 = addf %2, %2 : f32
				affine.store %3, %arg0[%arg2] : memref<10xf32>
				}
				affine.for %arg2 = 0 to 10 {
				%2 = affine.load %1[%arg2] : memref<10xf32>
				%3 = mulf %2, %2 : f32
				affine.store %3, %arg1[%arg2] : memref<10xf32>
				}
				return
				}
				```
				Output:
				```mlir
				func @producer_consumer_fusion(%arg0: memref<10xf32>, %arg1: memref<10xf32>) {
				%0 = alloc() : memref<1xf32>
				%1 = alloc() : memref<1xf32>
				%cst = constant 0.000000e+00 : f32
				affine.for %arg2 = 0 to 10 {
				affine.store %cst, %0[0] : memref<1xf32>
				affine.store %cst, %1[0] : memref<1xf32>
				%2 = affine.load %1[0] : memref<1xf32>
				%3 = mulf %2, %2 : f32
				affine.store %3, %arg1[%arg2] : memref<10xf32>
				%4 = affine.load %0[0] : memref<1xf32>
				%5 = addf %4, %4 : f32
				affine.store %5, %arg0[%arg2] : memref<10xf32>
				}
				return
				}
				```

				Example 2: Sibling fusion.
				Input:
				```mlir
				func @sibling_fusion(%arg0: memref<10x10xf32>, %arg1: memref<10x10xf32>,
				%arg2: memref<10x10xf32>, %arg3: memref<10x10xf32>,
				%arg4: memref<10x10xf32>) {
				affine.for %arg5 = 0 to 3 {
				affine.for %arg6 = 0 to 3 {
				%0 = affine.load %arg0[%arg5, %arg6] : memref<10x10xf32>
				%1 = affine.load %arg1[%arg5, %arg6] : memref<10x10xf32>
				%2 = mulf %0, %1 : f32
				affine.store %2, %arg3[%arg5, %arg6] : memref<10x10xf32>
				}
				}
				affine.for %arg5 = 0 to 3 {
				affine.for %arg6 = 0 to 3 {
				%0 = affine.load %arg0[%arg5, %arg6] : memref<10x10xf32>
				%1 = affine.load %arg2[%arg5, %arg6] : memref<10x10xf32>
				%2 = addf %0, %1 : f32
				affine.store %2, %arg4[%arg5, %arg6] : memref<10x10xf32>
				}
				}
				return
				}
				```
				Output:
				```mlir
				func @sibling_fusion(%arg0: memref<10x10xf32>, %arg1: memref<10x10xf32>,
				%arg2: memref<10x10xf32>, %arg3: memref<10x10xf32>,
				%arg4: memref<10x10xf32>) {
				affine.for %arg5 = 0 to 3 {
				affine.for %arg6 = 0 to 3 {
				%0 = affine.load %arg0[%arg5, %arg6] : memref<10x10xf32>
				%1 = affine.load %arg1[%arg5, %arg6] : memref<10x10xf32>
				%2 = mulf %0, %1 : f32
				affine.store %2, %arg3[%arg5, %arg6] : memref<10x10xf32>
				%3 = affine.load %arg0[%arg5, %arg6] : memref<10x10xf32>
				%4 = affine.load %arg2[%arg5, %arg6] : memref<10x10xf32>
				%5 = addf %3, %4 : f32
				affine.store %5, %arg4[%arg5, %arg6] : memref<10x10xf32>
				}
				}
				return
				}
				```
				}];
	let constructor = "mlir::createLoopFusionPass()";			let constructor = "mlir::createLoopFusionPass()";
	let options = [			let options = [
	Option<"computeToleranceThreshold", "fusion-compute-tolerance", "double",			Option<"computeToleranceThreshold", "fusion-compute-tolerance", "double",
	/default=/"0.30f", "Fractional increase in additional computation "			/default=/"0.30f", "Fractional increase in additional computation "
	"tolerated while fusing">,			"tolerated while fusing">,
	Option<"fastMemorySpace", "fusion-fast-mem-space", "unsigned",			Option<"fastMemorySpace", "fusion-fast-mem-space", "unsigned",
	/default=/"0",			/default=/"0",
	"Faster memory space number to promote fusion buffers to">,			"Faster memory space number to promote fusion buffers to">,
	▲ Show 20 Lines • Show All 601 Lines • Show Last 20 Lines

mlir/lib/Analysis/AffineStructures.cpp

Show First 20 Lines • Show All 703 Lines • ▼ Show 20 Lines	if (forOp.hasConstantUpperBound()) {
return success();		return success();
}		}
// Non-constant upper bound case.		// Non-constant upper bound case.
return addLowerOrUpperBound(pos, forOp.getUpperBoundMap(),		return addLowerOrUpperBound(pos, forOp.getUpperBoundMap(),
forOp.getUpperBoundOperands(),		forOp.getUpperBoundOperands(),
/eq=/false, /lower=/false);		/eq=/false, /lower=/false);
}		}

		/// Adds constraints (lower and upper bounds) for each loop in the loop nest
		/// described by the bound maps 'lbMaps' and 'ubMaps' of a computation slice.
		/// Every pair ('lbMaps[i]', 'ubMaps[i]') describes the bounds of a loop in
		/// the nest, sorted outer-to-inner. 'operands' contains the bound operands
		/// for a single bound map. All the bound maps will use the same bound
		/// operands. Note that some loops described by a computation slice might not
		bondhugulaUnsubmitted Not Done Reply Inline Actions Interesting that the use case for this requires the same operands for upper bounds, lower bounds, and for IVs at all depths! I haven't looked at the other part of the code yet - so this is a superficial comment. bondhugula: Interesting that the use case for this requires the same operands for upper bounds, lower…
		dcaballeAuthorUnsubmitted Done Reply Inline Actions I had exactly the same question and found it super confusing! :) This is inspired by 'addSliceBounds' which has exactly the same requirement. It could be related to the way all the maps are created for the slice (?). When I was debugging this, I could see that all the maps had the same input symbols. Those not relevant for a particular map just were not part of the map output. dcaballe: I had exactly the same question and found it super confusing! :) This is inspired by…
		/// exist yet in the IR so the Value attached to those dimension identifiers
		/// might be empty. For that reason, this method doesn't perform Value
		/// look-ups to retrieve the dimension identifier positions. Instead, it
		/// assumes the position of the dim identifiers in the constraint system is
		/// the same as the position of the loop in the loop nest.
		LogicalResult
		FlatAffineConstraints::addDomainFromSliceMaps(ArrayRef<AffineMap> lbMaps,
		ArrayRef<AffineMap> ubMaps,
		ArrayRef<Value> operands) {
		assert(lbMaps.size() == ubMaps.size());
		bondhugulaUnsubmitted Not Done Reply Inline Actions Any assertions on the number of operands? bondhugula: Any assertions on the number of operands?
		dcaballeAuthorUnsubmitted Done Reply Inline Actions See below (729, 730) dcaballe: See below (729, 730)
		assert(lbMaps.size() <= getNumDimIds());

		for (unsigned i = 0, e = lbMaps.size(); i < e; ++i) {
		AffineMap lbMap = lbMaps[i];
		AffineMap ubMap = ubMaps[i];
		assert(!lbMap \|\| lbMap.getNumInputs() == operands.size());
		assert(!ubMap \|\| ubMap.getNumInputs() == operands.size());

		// Check if this slice is just an equality along this dimension. If so,
		// retrieve the existing loop it equates to and add it to the system.
		if (lbMap && ubMap && lbMap.getNumResults() == 1 &&
		ubMap.getNumResults() == 1 &&
		lbMap.getResult(0) + 1 == ubMap.getResult(0) &&
		// The condition above will be true for maps describing a single
		// iteration (e.g., lbMap.getResult(0) = 0, ubMap.getResult(0) = 1).
		// Make sure we skip those cases by checking that the lb result is not
		// just a constant.
		!lbMap.getResult(0).isa<AffineConstantExpr>()) {
		// Limited support: we expect the lb result to be just a loop dimension.
		// Not supported otherwise for now.
		AffineDimExpr result = lbMap.getResult(0).dyn_cast<AffineDimExpr>();
		if (!result)
		return failure();

		bondhugulaUnsubmitted Done Reply Inline Actions auto -> `AffineForOp` bondhugula: auto -> `AffineForOp`
		AffineForOp loop =
		getForInductionVarOwner(operands[result.getPosition()]);
		if (!loop)
		return failure();

		if (failed(addAffineForOpDomain(loop)))
		return failure();
		continue;
		}

		// This slice refers to a loop that doesn't exist in the IR yet. Add its
		// bounds to the system assuming its dimension identifier position is the
		// same as the position of the loop in the loop nest.
		if (lbMap && failed(addLowerOrUpperBound(i, lbMap, operands, /eq=/false,
		/lower=/true)))
		return failure();

		if (ubMap && failed(addLowerOrUpperBound(i, ubMap, operands, /eq=/false,
		/lower=/false)))
		return failure();
		}
		return success();
		}

void FlatAffineConstraints::addAffineIfOpDomain(AffineIfOp ifOp) {		void FlatAffineConstraints::addAffineIfOpDomain(AffineIfOp ifOp) {
// Create the base constraints from the integer set attached to ifOp.		// Create the base constraints from the integer set attached to ifOp.
FlatAffineConstraints cst(ifOp.getIntegerSet());		FlatAffineConstraints cst(ifOp.getIntegerSet());

// Bind ids in the constraints to ifOp operands.		// Bind ids in the constraints to ifOp operands.
SmallVector<Value, 4> operands = ifOp.getOperands();		SmallVector<Value, 4> operands = ifOp.getOperands();
cst.setIdValues(0, cst.getNumDimAndSymbolIds(), operands);		cst.setIdValues(0, cst.getNumDimAndSymbolIds(), operands);

▲ Show 20 Lines • Show All 2,585 Lines • Show Last 20 Lines

mlir/lib/Analysis/Utils.cpp

//===- Utils.cpp ---- Misc utilities for analysis -------------------------===//		//===- Utils.cpp ---- Misc utilities for analysis -------------------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file implements miscellaneous analysis routines for non-loop IR		// This file implements miscellaneous analysis routines for non-loop IR
// structures.		// structures.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/Analysis/Utils.h"		#include "mlir/Analysis/Utils.h"

#include "mlir/Analysis/AffineAnalysis.h"		#include "mlir/Analysis/AffineAnalysis.h"
		#include "mlir/Analysis/PresburgerSet.h"
#include "mlir/Dialect/Affine/IR/AffineOps.h"		#include "mlir/Dialect/Affine/IR/AffineOps.h"
#include "mlir/Dialect/Affine/IR/AffineValueMap.h"		#include "mlir/Dialect/Affine/IR/AffineValueMap.h"
#include "mlir/Dialect/StandardOps/IR/Ops.h"		#include "mlir/Dialect/StandardOps/IR/Ops.h"
#include "mlir/IR/IntegerSet.h"		#include "mlir/IR/IntegerSet.h"
#include "llvm/ADT/SmallPtrSet.h"		#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"

▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	void ComputationSliceState::dump() const {
for (auto &en : llvm::enumerate(ubs)) {		for (auto &en : llvm::enumerate(ubs)) {
llvm::errs() << "\t\t" << en.value() << "\n";		llvm::errs() << "\t\t" << en.value() << "\n";
llvm::errs() << "\t\tOperands:\n";		llvm::errs() << "\t\tOperands:\n";
for (Value ubOp : ubOperands[en.index()])		for (Value ubOp : ubOperands[en.index()])
llvm::errs() << "\t\t\t" << ubOp << "\n";		llvm::errs() << "\t\t\t" << ubOp << "\n";
}		}
}		}

		/// Fast check to determine if the computation slice is maximal. Returns true if
		/// each slice dimension maps to an existing dst dimension and both the src
		/// and the dst loops for those dimensions have the same bounds. Returns false
		/// if both the src and the dst loops don't have the same bounds. Returns
		/// llvm::None if none of the above can be proven.
		Optional<bool> ComputationSliceState::isSliceMaximalFastCheck() const {
		assert(lbs.size() == ubs.size() && lbs.size() && ivs.size() &&
		"Unexpected number of lbs, ubs and ivs in slice");

		for (unsigned i = 0, end = lbs.size(); i < end; ++i) {
		AffineMap lbMap = lbs[i];
		AffineMap ubMap = ubs[i];

		// Check if this slice is just an equality along this dimension.
		if (!lbMap \|\| !ubMap \|\| lbMap.getNumResults() != 1 \|\|
		ubMap.getNumResults() != 1 \|\|
		lbMap.getResult(0) + 1 != ubMap.getResult(0) \|\|
		// The condition above will be true for maps describing a single
		// iteration (e.g., lbMap.getResult(0) = 0, ubMap.getResult(0) = 1).
		// Make sure we skip those cases by checking that the lb result is not
		// just a constant.
		lbMap.getResult(0).isa<AffineConstantExpr>())
		return llvm::None;

		// Limited support: we expect the lb result to be just a loop dimension for
		// now.
		AffineDimExpr result = lbMap.getResult(0).dyn_cast<AffineDimExpr>();
		if (!result)
		return llvm::None;

		// Retrieve dst loop bounds.
		AffineForOp dstLoop =
		getForInductionVarOwner(lbOperands[i][result.getPosition()]);
		andydavis1Unsubmitted Done Reply Inline Actions It might be food to differentiate between error return cases and cases where there is no error, but isMaximal is false. andydavis1: It might be food to differentiate between error return cases and cases where there is no error…
		andydavis1Unsubmitted Not Done Reply Inline Actions It might be food to differentiate between error return cases and cases where there is no error, but isMaximal is false. andydavis1: It might be food to differentiate between error return cases and cases where there is no error…
		if (!dstLoop)
		return llvm::None;
		AffineMap dstLbMap = dstLoop.getLowerBoundMap();
		AffineMap dstUbMap = dstLoop.getUpperBoundMap();

		// Retrieve src loop bounds.
		AffineForOp srcLoop = getForInductionVarOwner(ivs[i]);
		assert(srcLoop && "Expected affine for");
		AffineMap srcLbMap = srcLoop.getLowerBoundMap();
		AffineMap srcUbMap = srcLoop.getUpperBoundMap();

		// Limited support: we expect simple src and dst loops with a single
		andydavis1Unsubmitted Not Done Reply Inline Actions If possible, it might be good to have a fast-case at the top of this function, that checked for some simple cases (static upper/lower bounds). Then fallback to the integer set check otherwise... andydavis1: If possible, it might be good to have a fast-case at the top of this function, that checked for…
		dcaballeAuthorUnsubmitted Done Reply Inline Actions Good point! Done and I verified that we have lit tests that cover all the scenarios: fast-check returns true/false, integer set diff returns true/false. dcaballe: Good point! Done and I verified that we have lit tests that cover all the scenarios: fast-check…
		// constant component per bound for now.
		if (srcLbMap.getNumResults() != 1 \|\| srcUbMap.getNumResults() != 1 \|\|
		dstLbMap.getNumResults() != 1 \|\| dstUbMap.getNumResults() != 1)
		return llvm::None;

		AffineExpr srcLbResult = srcLbMap.getResult(0);
		AffineExpr dstLbResult = dstLbMap.getResult(0);
		AffineExpr srcUbResult = srcUbMap.getResult(0);
		AffineExpr dstUbResult = dstUbMap.getResult(0);
		if (!srcLbResult.isa<AffineConstantExpr>() \|\|
		!srcUbResult.isa<AffineConstantExpr>() \|\|
		!dstLbResult.isa<AffineConstantExpr>() \|\|
		!dstUbResult.isa<AffineConstantExpr>())
		return llvm::None;

		// Check if src and dst loop bounds are the same. If not, we can guarantee
		// that the slice is not maximal.
		if (srcLbResult != dstLbResult \|\| srcUbResult != dstUbResult)
		return false;
		}

		return true;
		}

		/// Returns true if the computation slice encloses all the iterations of the
		/// sliced loop nest. Returns false if it does not. Returns llvm::None if it
		/// cannot determine if the slice is maximal or not.
		Optional<bool> ComputationSliceState::isMaximal() const {
		// Fast check to determine if the computation slice is maximal. If the result
		// is inconclusive, we proceed with a more expensive analysis.
		Optional<bool> isMaximalFastCheck = isSliceMaximalFastCheck();
		if (isMaximalFastCheck.hasValue())
		return isMaximalFastCheck;

		// Create constraints for the src loop nest being sliced.
		FlatAffineConstraints srcConstraints;
		srcConstraints.reset(/numDims=/ivs.size(), /numSymbols=/0,
		/numLocals=/0, ivs);
		for (Value iv : ivs) {
		AffineForOp loop = getForInductionVarOwner(iv);
		assert(loop && "Expected affine for");
		if (failed(srcConstraints.addAffineForOpDomain(loop)))
		return llvm::None;
		}

		// Create constraints for the slice using the dst loop nest information. We
		// retrieve existing dst loops from the lbOperands.
		SmallVector<Value, 8> consumerIVs;
		for (Value lbOp : lbOperands[0])
		if (getForInductionVarOwner(lbOp))
		consumerIVs.push_back(lbOp);

		// Add empty IV Values for those new loops that are not equalities and,
		// therefore, are not yet materialized in the IR.
		for (int i = consumerIVs.size(), end = ivs.size(); i < end; ++i)
		consumerIVs.push_back(Value());

		FlatAffineConstraints sliceConstraints;
		sliceConstraints.reset(/numDims=/consumerIVs.size(), /numSymbols=/0,
		/numLocals=/0, consumerIVs);

		if (failed(sliceConstraints.addDomainFromSliceMaps(lbs, ubs, lbOperands[0])))
		return llvm::None;

		if (srcConstraints.getNumDimIds() != sliceConstraints.getNumDimIds())
		// Constraint dims are different. The integer set difference can't be
		// computed so we don't know if the slice is maximal.
		return llvm::None;

		// Compute the difference between the src loop nest and the slice integer
		// sets.
		PresburgerSet srcSet(srcConstraints);
		PresburgerSet sliceSet(sliceConstraints);
		PresburgerSet diffSet = srcSet.subtract(sliceSet);
		return diffSet.isIntegerEmpty();
		}

unsigned MemRefRegion::getRank() const {		unsigned MemRefRegion::getRank() const {
return memref.getType().cast<MemRefType>().getRank();		return memref.getType().cast<MemRefType>().getRank();
}		}

Optional<int64_t> MemRefRegion::getConstantBoundingSizeAndShape(		Optional<int64_t> MemRefRegion::getConstantBoundingSizeAndShape(
SmallVectorImpl<int64_t> shape, std::vector<SmallVector<int64_t, 4>> lbs,		SmallVectorImpl<int64_t> shape, std::vector<SmallVector<int64_t, 4>> lbs,
SmallVectorImpl<int64_t> *lbDivisors) const {		SmallVectorImpl<int64_t> *lbDivisors) const {
auto memRefType = memref.getType().cast<MemRefType>();		auto memRefType = memref.getType().cast<MemRefType>();
▲ Show 20 Lines • Show All 962 Lines • Show Last 20 Lines

mlir/lib/Transforms/LoopFusion.cpp

Show First 20 Lines • Show All 264 Lines • ▼ Show 20 Lines	for (auto *storeOpInst : node->stores) {
// Return true if any use of 'memref' escapes the function.		// Return true if any use of 'memref' escapes the function.
for (auto *user : memref.getUsers())		for (auto *user : memref.getUsers())
if (!isMemRefDereferencingOp(*user))		if (!isMemRefDereferencingOp(*user))
return true;		return true;
}		}
return false;		return false;
}		}

// Returns the unique AffineWriteOpInterface in `node` that meets all the
// following:
// *) store is the only one that writes to a function-local memref live out
// of `node`,
// *) store is not the source of a self-dependence on `node`.
// Otherwise, returns a null AffineWriteOpInterface.
AffineWriteOpInterface getUniqueOutgoingStore(Node *node) {
AffineWriteOpInterface uniqueStore;

// Return null if `node` doesn't have any outgoing edges.
auto outEdgeIt = outEdges.find(node->id);
if (outEdgeIt == outEdges.end())
return nullptr;

const auto &nodeOutEdges = outEdgeIt->second;
for (auto *op : node->stores) {
auto storeOp = cast<AffineWriteOpInterface>(op);
auto memref = storeOp.getMemRef();
// Skip this store if there are no dependences on its memref. This means
// that store either:
// *) writes to a memref that is only read within the same loop nest
// (self-dependence edges are not represented in graph at the moment),
// *) writes to a function live out memref (function parameter), or
// *) is dead.
if (llvm::all_of(nodeOutEdges, [=](const Edge &edge) {
return (edge.value != memref);
}))
continue;

if (uniqueStore)
// Found multiple stores to function-local live-out memrefs.
return nullptr;
// Found first store to function-local live-out memref.
uniqueStore = storeOp;
}

return uniqueStore;
}

// Returns true if node 'id' can be removed from the graph. Returns false
// otherwise. A node can be removed from the graph iff the following
// conditions are met:
// *) The node does not write to any memref which escapes (or is a
// function/block argument).
// *) The node has no successors in the dependence graph.
bool canRemoveNode(unsigned id) {
dcaballeAuthorUnsubmitted Done Reply Inline Actions This functionality is replaced with 'canRemoveSrcNodeAfterFusion' below. I'm adding as a non-member in an attempt to minimize code that is specific to a fusion strategy so that we can eventually move a generic version of the 'mdg' to the loop fusion utilities in the future (if that's what we want). dcaballe: This functionality is replaced with 'canRemoveSrcNodeAfterFusion' below. I'm adding as a non…
if (writesToLiveInOrEscapingMemrefs(id))
return false;
Node *node = getNode(id);
for (auto *storeOpInst : node->stores) {
// Return false if there exist out edges from 'id' on 'memref'.
auto storeMemref = cast<AffineWriteOpInterface>(storeOpInst).getMemRef();
if (getOutEdgeCount(id, storeMemref) > 0)
return false;
}
return true;
}

// Returns true iff there is an edge from node 'srcId' to node 'dstId' which		// Returns true iff there is an edge from node 'srcId' to node 'dstId' which
// is for 'value' if non-null, or for any value otherwise. Returns false		// is for 'value' if non-null, or for any value otherwise. Returns false
// otherwise.		// otherwise.
bool hasEdge(unsigned srcId, unsigned dstId, Value value = nullptr) {		bool hasEdge(unsigned srcId, unsigned dstId, Value value = nullptr) {
if (outEdges.count(srcId) == 0 \|\| inEdges.count(dstId) == 0) {		if (outEdges.count(srcId) == 0 \|\| inEdges.count(dstId) == 0) {
return false;		return false;
}		}
bool hasOutEdge = llvm::any_of(outEdges[srcId], [=](Edge &edge) {		bool hasOutEdge = llvm::any_of(outEdges[srcId], [=](Edge &edge) {
▲ Show 20 Lines • Show All 151 Lines • ▼ Show 20 Lines	if (firstSrcDepPos.hasValue()) {
// Return the insertion point at 'firstSrcDepPos'.		// Return the insertion point at 'firstSrcDepPos'.
return depInsts[firstSrcDepPos.getValue()];		return depInsts[firstSrcDepPos.getValue()];
}		}
// No dependence targets in range (or only dst deps in range), return		// No dependence targets in range (or only dst deps in range), return
// 'dstNodInst' insertion point.		// 'dstNodInst' insertion point.
return dstNodeInst;		return dstNodeInst;
}		}

// Updates edge mappings from node 'srcId' to node 'dstId' after 'oldMemRef'		// Updates edge mappings from node 'srcId' to node 'dstId' after fusing them,
// has been replaced in node at 'dstId' by a private memref depending		// taking into account that:
// on the value of 'createPrivateMemRef'.		// *) if 'removeSrcId' is true, 'srcId' will be removed after fusion,
void updateEdges(unsigned srcId, unsigned dstId, Value oldMemRef,		// *) memrefs in 'privateMemRefs' has been replaced in node at 'dstId' by a
bool createPrivateMemRef) {		// private memref.
		void updateEdges(unsigned srcId, unsigned dstId,
		const DenseSet<Value> &privateMemRefs, bool removeSrcId) {
// For each edge in 'inEdges[srcId]': add new edge remapping to 'dstId'.		// For each edge in 'inEdges[srcId]': add new edge remapping to 'dstId'.
if (inEdges.count(srcId) > 0) {		if (inEdges.count(srcId) > 0) {
SmallVector<Edge, 2> oldInEdges = inEdges[srcId];		SmallVector<Edge, 2> oldInEdges = inEdges[srcId];
for (auto &inEdge : oldInEdges) {		for (auto &inEdge : oldInEdges) {
// Add edge from 'inEdge.id' to 'dstId' if not for 'oldMemRef'.		// Add edge from 'inEdge.id' to 'dstId' if it's not a private memref.
if (inEdge.value != oldMemRef)		if (privateMemRefs.count(inEdge.value) == 0)
addEdge(inEdge.id, dstId, inEdge.value);		addEdge(inEdge.id, dstId, inEdge.value);
}		}
}		}
// For each edge in 'outEdges[srcId]': remove edge from 'srcId' to 'dstId'.		// For each edge in 'outEdges[srcId]': remove edge from 'srcId' to 'dstId'.
		// If 'srcId' is going to be removed, remap all the out edges to 'dstId'.
if (outEdges.count(srcId) > 0) {		if (outEdges.count(srcId) > 0) {
SmallVector<Edge, 2> oldOutEdges = outEdges[srcId];		SmallVector<Edge, 2> oldOutEdges = outEdges[srcId];
for (auto &outEdge : oldOutEdges) {		for (auto &outEdge : oldOutEdges) {
// Remove any out edges from 'srcId' to 'dstId' across memrefs.		// Remove any out edges from 'srcId' to 'dstId' across memrefs.
if (outEdge.id == dstId)		if (outEdge.id == dstId)
removeEdge(srcId, outEdge.id, outEdge.value);		removeEdge(srcId, outEdge.id, outEdge.value);
		else if (removeSrcId) {
		addEdge(dstId, outEdge.id, outEdge.value);
		removeEdge(srcId, outEdge.id, outEdge.value);
		}
}		}
}		}
// Remove any edges in 'inEdges[dstId]' on 'oldMemRef' (which is being		// Remove any edges in 'inEdges[dstId]' on 'oldMemRef' (which is being
// replaced by a private memref). These edges could come from nodes		// replaced by a private memref). These edges could come from nodes
// other than 'srcId' which were removed in the previous step.		// other than 'srcId' which were removed in the previous step.
if (inEdges.count(dstId) > 0 && createPrivateMemRef) {		if (inEdges.count(dstId) > 0 && !privateMemRefs.empty()) {
SmallVector<Edge, 2> oldInEdges = inEdges[dstId];		SmallVector<Edge, 2> oldInEdges = inEdges[dstId];
for (auto &inEdge : oldInEdges)		for (auto &inEdge : oldInEdges)
if (inEdge.value == oldMemRef)		if (privateMemRefs.count(inEdge.value) > 0)
removeEdge(inEdge.id, dstId, inEdge.value);		removeEdge(inEdge.id, dstId, inEdge.value);
}		}
}		}

// Update edge mappings for nodes 'sibId' and 'dstId' to reflect fusion		// Update edge mappings for nodes 'sibId' and 'dstId' to reflect fusion
// of sibling node 'sidId' into node 'dstId'.		// of sibling node 'sibId' into node 'dstId'.
void updateEdges(unsigned sibId, unsigned dstId) {		void updateEdges(unsigned sibId, unsigned dstId) {
// For each edge in 'inEdges[sibId]':		// For each edge in 'inEdges[sibId]':
// *) Add new edge from source node 'inEdge.id' to 'dstNode'.		// *) Add new edge from source node 'inEdge.id' to 'dstNode'.
// *) Remove edge from source node 'inEdge.id' to 'sibNode'.		// *) Remove edge from source node 'inEdge.id' to 'sibNode'.
if (inEdges.count(sibId) > 0) {		if (inEdges.count(sibId) > 0) {
SmallVector<Edge, 2> oldInEdges = inEdges[sibId];		SmallVector<Edge, 2> oldInEdges = inEdges[sibId];
for (auto &inEdge : oldInEdges) {		for (auto &inEdge : oldInEdges) {
addEdge(inEdge.id, dstId, inEdge.value);		addEdge(inEdge.id, dstId, inEdge.value);
▲ Show 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	for (const auto &idAndNode : nodes) {
for (const auto &e : it->second)		for (const auto &e : it->second)
os << " OutEdge: " << e.id << " " << e.value << "\n";		os << " OutEdge: " << e.id << " " << e.value << "\n";
}		}
}		}
}		}
void dump() const { print(llvm::errs()); }		void dump() const { print(llvm::errs()); }
};		};

		/// Returns true if node 'srcId' can be removed after fusing it with node
		/// 'dstId'. The node can be removed if any of the following conditions are met:
		bondhugulaUnsubmitted Done Reply Inline Actions can -> can be bondhugula: can -> can be
		/// 1. 'srcId' has no output dependences after fusion and no escaping memrefs.
		bondhugulaUnsubmitted Done Reply Inline Actions Since 'dependences' are between two entities, I think we need to clarify what "has no dependences" means (for eg. outgoing deps, incoming deps, intra-node deps are different things). bondhugula: Since 'dependences' are between two entities, I think we need to clarify what "has no…
		/// 2. 'srcId' has no output dependences after fusion, has escaping memrefs
		/// and the fusion slice is maximal.
		/// 3. 'srcId' has output dependences after fusion, the fusion slice is
		/// maximal and the fusion insertion point dominates all the dependences.
		static bool canRemoveSrcNodeAfterFusion(
		unsigned srcId, unsigned dstId, const ComputationSliceState &fusionSlice,
		Operation *fusedLoopInsPoint, const DenseSet<Value> &escapingMemRefs,
		MemRefDependenceGraph *mdg) {

		Operation *dstNodeOp = mdg->getNode(dstId)->op;
		bool hasOutDepsAfterFusion = false;

		for (auto &outEdge : mdg->outEdges[srcId]) {
		Operation *depNodeOp = mdg->getNode(outEdge.id)->op;
		// Skip dependence with dstOp since it will be removed after fusion.
		if (depNodeOp == dstNodeOp)
		continue;

		// Only fusion within the same block is supported. Use domination analysis
		// when needed.
		if (depNodeOp->getBlock() != dstNodeOp->getBlock())
		return false;

		// Check if the insertion point of the fused loop dominates the dependence.
		// Otherwise, the src loop can't be removed.
		if (fusedLoopInsPoint != depNodeOp &&
		!fusedLoopInsPoint->isBeforeInBlock(depNodeOp)) {
		LLVM_DEBUG(llvm::dbgs() << "Src loop can't be removed: dst loop doesn't "
		"dominate dependence\n");
		return false;
		}

		hasOutDepsAfterFusion = true;
		}

		// If src loop has dependences after fusion or it writes to an live-out or
		// escaping memref, we can only remove it if the fusion slice is maximal so
		// that all the dependences are preserved.
		if (hasOutDepsAfterFusion \|\| !escapingMemRefs.empty()) {
		Optional<bool> isMaximal = fusionSlice.isMaximal();
		if (!isMaximal.hasValue()) {
		LLVM_DEBUG(llvm::dbgs() << "Src loop can't be removed: can't determine "
		"if fusion is maximal\n");
		return false;
		}

		if (!isMaximal.getValue()) {
		LLVM_DEBUG(llvm::dbgs()
		<< "Src loop can't be removed: fusion is not maximal\n");
		return false;
		}
		}

		return true;
		}

		andydavis1Unsubmitted Not Done Reply Inline Actions These kind of simple, focused functions are great. Not in this change, but it feels like longer term, we could do more breaking up of long functions into focused utility functions. andydavis1: These kind of simple, focused functions are great. Not in this change, but it feels like longer…
		dcaballeAuthorUnsubmitted Done Reply Inline Actions Thanks! Yeah, we could probably add a similar one for sibling fusion candidates in the future. Any other examples? dcaballe: Thanks! Yeah, we could probably add a similar one for sibling fusion candidates in the future.
		/// Returns in 'srcIdCandidates' the producer fusion candidates for consumer
		/// 'dstId'. Candidates are sorted by node id order. This order corresponds to
		/// the program order when the 'mdg' is created. However, program order is not
		bondhugulaUnsubmitted Done Reply Inline Actions granted -> guaranteed? bondhugula: granted -> guaranteed?
		/// guaranteed and must not be required by the client. Program order won't be
		/// held if the 'mdg' is reused from a previous fusion step or if the node
		/// creation order changes in the future to support more advance cases.
		bondhugulaUnsubmitted Done Reply Inline Actions advance -> advanced bondhugula: advance -> advanced
		// TODO: Move this to a loop fusion utility once 'mdg' is also moved.
		static void getProducerCandidates(unsigned dstId, MemRefDependenceGraph *mdg,
		SmallVectorImpl<unsigned> &srcIdCandidates) {
		bondhugulaUnsubmitted Done Reply Inline Actions Do you want to just use an unordered set and then sort at the end in the right order? Right now, you are using a sorted set and then doing a `reverse`. You don't seem to need the sorted order in between. `srcIdCandidates` is a small set, and as you know, `std::set` will have high overhead. https://llvm.org/docs/ProgrammersManual.html#set bondhugula: Do you want to just use an unordered set and then sort at the end in the right order? Right now…
		dcaballeAuthorUnsubmitted Done Reply Inline Actions I returned the candidates in ascending order because I thought it would make more sense for external clients than the reverse order. However, I agree that we can return an unordered map and leave the order to the client. Not sure, though, if we would be gaining much in our use case since we have to copy the elements to another container and sort them... dcaballe: I returned the candidates in ascending order because I thought it would make more sense for…
		bondhugulaUnsubmitted Done Reply Inline Actions I think your comment is orthogonal here. You can return in whatever sorted order but you don't need to use an `std::set` here. You can use an `unordered_set` and then just copy + sort at the end and return a vector. Hashtable + some sorting algo is still far less complexity (both time and more importantly memory allocations and access wise) than a typical heavy red-black tree (or similar structure) backed `std::set`. It's a handful of elements you have - so copy costs isn't much I assume. bondhugula: I think your comment is orthogonal here. You can return in whatever sorted order but you don't…
		dcaballeAuthorUnsubmitted Done Reply Inline Actions `unordered_set` is not recommended for similar reasons (https://llvm.org/docs/ProgrammersManual.html#other-set-like-container-options). I can change it to the "sorted vector" approach (https://llvm.org/docs/ProgrammersManual.html#a-sorted-vector) if that makes more sense to you. dcaballe: `unordered_set` is not recommended for similar reasons (https://llvm.org/docs/ProgrammersManual.
		// Skip if no input edges along which to fuse.
		if (mdg->inEdges.count(dstId) == 0)
		return;

		// Gather memrefs from loads in 'dstId'.
		auto *dstNode = mdg->getNode(dstId);
		DenseSet<Value> consumedMemrefs;
		for (Operation *load : dstNode->loads)
		consumedMemrefs.insert(cast<AffineReadOpInterface>(load).getMemRef());

		// Traverse 'dstId' incoming edges and gather the nodes that contain a store
		// to one of the consumed memrefs.
		for (auto &srcEdge : mdg->inEdges[dstId]) {
		auto *srcNode = mdg->getNode(srcEdge.id);
		// Skip if 'srcNode' is not a loop nest.
		if (!isa<AffineForOp>(srcNode->op))
		continue;

		if (any_of(srcNode->stores, [&](Operation *op) {
		auto storeOp = cast<AffineWriteOpInterface>(op);
		return consumedMemrefs.count(storeOp.getMemRef()) > 0;
		}))
		srcIdCandidates.push_back(srcNode->id);
		}

		std::sort(srcIdCandidates.begin(), srcIdCandidates.end());
		srcIdCandidates.erase(
		std::unique(srcIdCandidates.begin(), srcIdCandidates.end()),
		srcIdCandidates.end());
		}

		/// Returns in 'producerConsumerMemrefs' the memrefs involved in a
		/// producer-consumer dependence between 'srcId' and 'dstId'.
		static void
		gatherProducerConsumerMemrefs(unsigned srcId, unsigned dstId,
		MemRefDependenceGraph *mdg,
		DenseSet<Value> &producerConsumerMemrefs) {
		auto *dstNode = mdg->getNode(dstId);
		auto *srcNode = mdg->getNode(srcId);
		gatherProducerConsumerMemrefs(srcNode->stores, dstNode->loads,
		producerConsumerMemrefs);
		}

		bondhugulaUnsubmitted Done Reply Inline Actions Nit: `memref.isa<BlockArgument>()` bondhugula: Nit: `memref.isa<BlockArgument>()`
		/// Returns in 'escapingMemRefs' the memrefs from affine store ops in node 'id'
		/// that escape the function. A memref escapes the function if either:
		/// 1. It's a function argument, or
		/// 2. It's used by a non-affine op (e.g., std load/store, std call, etc.)
		void gatherEscapingMemrefs(unsigned id, MemRefDependenceGraph *mdg,
		DenseSet<Value> &escapingMemRefs) {
		bondhugulaUnsubmitted Done Reply Inline Actions auto -> Operation bondhugula: auto -> Operation
		auto *node = mdg->getNode(id);
		for (auto *storeOpInst : node->stores) {
		auto memref = cast<AffineWriteOpInterface>(storeOpInst).getMemRef();
		if (escapingMemRefs.count(memref))
		continue;
		// Check if 'memref' escapes because it's a block argument.
		if (memref.isa<BlockArgument>()) {
		escapingMemRefs.insert(memref);
		continue;
		}
		// Check if 'memref' escapes through a non-affine op (e.g., std load/store,
		// call op, etc.).
		for (Operation *user : memref.getUsers())
		if (!isMemRefDereferencingOp(*user))
		escapingMemRefs.insert(memref);
		}
		}

} // end anonymous namespace		} // end anonymous namespace

// Initializes the data dependence graph by walking operations in 'f'.		// Initializes the data dependence graph by walking operations in 'f'.
// Assigns each node in the graph a node id based on program order in 'f'.		// Assigns each node in the graph a node id based on program order in 'f'.
// TODO: Add support for taking a Block arg to construct the		// TODO: Add support for taking a Block arg to construct the
// dependence graph at a different depth.		// dependence graph at a different depth.
bool MemRefDependenceGraph::init(FuncOp f) {		bool MemRefDependenceGraph::init(FuncOp f) {
		LLVM_DEBUG(llvm::dbgs() << "--- Initializing MDG ---\n");
DenseMap<Value, SetVector<unsigned>> memrefAccesses;		DenseMap<Value, SetVector<unsigned>> memrefAccesses;

// TODO: support multi-block functions.		// TODO: support multi-block functions.
if (!llvm::hasSingleElement(f))		if (!llvm::hasSingleElement(f))
return false;		return false;

DenseMap<Operation *, unsigned> forToNodeMap;		DenseMap<Operation *, unsigned> forToNodeMap;
for (auto &op : f.front()) {		for (auto &op : f.front()) {
Show All 39 Lines	for (auto &op : f.front()) {
} else if (op.getNumResults() > 0 && !op.use_empty()) {		} else if (op.getNumResults() > 0 && !op.use_empty()) {
// Create graph node for top-level producer of SSA values, which		// Create graph node for top-level producer of SSA values, which
// could be used by loop nest nodes.		// could be used by loop nest nodes.
Node node(nextNodeId++, &op);		Node node(nextNodeId++, &op);
nodes.insert({node.id, node});		nodes.insert({node.id, node});
}		}
}		}

		for (auto &idAndNode : nodes) {
		bondhugulaUnsubmitted Not Done Reply Inline Actions Avoid `auto` here. bondhugula: Avoid `auto` here.
		dcaballeAuthorUnsubmitted Done Reply Inline Actions 'auto' should be justified here since the type is very verbose: std::pair<unsigned, Node> dcaballe: 'auto' should be justified here since the type is very verbose: std::pair<unsigned, Node>
		LLVM_DEBUG(llvm::dbgs() << "Create node " << idAndNode.first << " for:\n"
		<< *(idAndNode.second.op) << "\n");
		(void)idAndNode;
		}

// Add dependence edges between nodes which produce SSA values and their		// Add dependence edges between nodes which produce SSA values and their
// users.		// users.
for (auto &idAndNode : nodes) {		for (auto &idAndNode : nodes) {
const Node &node = idAndNode.second;		const Node &node = idAndNode.second;
if (!node.loads.empty() \|\| !node.stores.empty())		if (!node.loads.empty() \|\| !node.stores.empty())
continue;		continue;
auto *opInst = node.op;		auto *opInst = node.op;
for (auto value : opInst->getResults()) {		for (auto value : opInst->getResults()) {
Show All 23 Lines	for (unsigned i = 0; i < n; ++i) {
if (srcHasStore \|\| dstHasStore)		if (srcHasStore \|\| dstHasStore)
addEdge(srcId, dstId, memrefAndList.first);		addEdge(srcId, dstId, memrefAndList.first);
}		}
}		}
}		}
return true;		return true;
}		}

// Removes load operations from 'srcLoads' which operate on 'memref', and
// adds them to 'dstLoads'.
static void moveLoadsAccessingMemrefTo(Value memref,
SmallVectorImpl<Operation > srcLoads,
SmallVectorImpl<Operation > dstLoads) {
dstLoads->clear();
SmallVector<Operation *, 4> srcLoadsToKeep;
for (auto load : srcLoads) {
if (cast<AffineReadOpInterface>(load).getMemRef() == memref)
dstLoads->push_back(load);
else
srcLoadsToKeep.push_back(load);
}
srcLoads->swap(srcLoadsToKeep);
}

// Sinks all sequential loops to the innermost levels (while preserving		// Sinks all sequential loops to the innermost levels (while preserving
// relative order among them) and moves all parallel loops to the		// relative order among them) and moves all parallel loops to the
// outermost (while again preserving relative order among them).		// outermost (while again preserving relative order among them).
// This can increase the loop depth at which we can fuse a slice, since we are		// This can increase the loop depth at which we can fuse a slice, since we are
// pushing loop carried dependence to a greater depth in the loop nest.		// pushing loop carried dependence to a greater depth in the loop nest.
static void sinkSequentialLoops(MemRefDependenceGraph::Node *node) {		static void sinkSequentialLoops(MemRefDependenceGraph::Node *node) {
assert(isa<AffineForOp>(node->op));		assert(isa<AffineForOp>(node->op));
AffineForOp newRootForOp = sinkSequentialLoops(cast<AffineForOp>(node->op));		AffineForOp newRootForOp = sinkSequentialLoops(cast<AffineForOp>(node->op));
▲ Show 20 Lines • Show All 175 Lines • ▼ Show 20 Lines	static bool hasNonAffineUsersOnThePath(unsigned srcId, unsigned dstId,
});		});
// Looking for users between node 'srcId' and node 'dstId'.		// Looking for users between node 'srcId' and node 'dstId'.
for (Value memref : memRefValues)		for (Value memref : memRefValues)
if (hasNonAffineUsersOnThePath(srcId, dstId, memref, mdg))		if (hasNonAffineUsersOnThePath(srcId, dstId, memref, mdg))
return true;		return true;
return false;		return false;
}		}

// Checks if node 'srcId' can be safely fused into node 'dstId'. Node 'srcId'
// may write to multiple memrefs but it is required that only one of them,
// 'srcLiveOutStoreOp', has output edges.
// Returns true if 'dstNode's read/write region to 'memref' is a super set of
// 'srcNode's write region to 'memref' and 'srcId' has only one output edge.
// TODO: Generalize this to handle more live in/out cases.
static bool
canFuseSrcWhichWritesToLiveOut(unsigned srcId, unsigned dstId,
dcaballeAuthorUnsubmitted Done Reply Inline Actions This code was heavily relying on having a single store so I couldn't reuse much. 'hasNonAffineUsersOnThePath' check has been moved to the producer-consumer fusion main function. The memory region superset check is now covered by slice::isMaximal check, which also works for an arbitrary number of stores. dcaballe: This code was heavily relying on having a single store so I couldn't reuse much.
AffineWriteOpInterface srcLiveOutStoreOp,
MemRefDependenceGraph *mdg) {
assert(srcLiveOutStoreOp && "Expected a valid store op");
auto *dstNode = mdg->getNode(dstId);
Value memref = srcLiveOutStoreOp.getMemRef();
// Return false if 'srcNode' has more than one output edge on 'memref'.
if (mdg->getOutEdgeCount(srcId, memref) > 1)
return false;

// Compute MemRefRegion 'srcWriteRegion' for 'srcStoreOp' on 'memref'.
MemRefRegion srcWriteRegion(srcLiveOutStoreOp.getLoc());
if (failed(srcWriteRegion.compute(srcLiveOutStoreOp, /loopDepth=/0))) {
LLVM_DEBUG(llvm::dbgs()
<< "Unable to compute MemRefRegion for source operation\n.");
return false;
}
SmallVector<int64_t, 4> srcShape;
// Query 'srcWriteRegion' for 'srcShape' and 'srcNumElements'.
// by 'srcStoreOp' at depth 'dstLoopDepth'.
Optional<int64_t> srcNumElements =
srcWriteRegion.getConstantBoundingSizeAndShape(&srcShape);
if (!srcNumElements.hasValue())
return false;

// Compute MemRefRegion 'dstRegion' for 'dstStore/LoadOpInst' on 'memref'.
// TODO: Compute 'unionboundingbox' of all write regions (one for
// each store op in 'dstStoreOps').
SmallVector<Operation *, 2> dstStoreOps;
dstNode->getStoreOpsForMemref(memref, &dstStoreOps);
SmallVector<Operation *, 2> dstLoadOps;
dstNode->getLoadOpsForMemref(memref, &dstLoadOps);

auto *dstOpInst = dstStoreOps.empty() ? dstLoadOps[0] : dstStoreOps[0];
MemRefRegion dstRegion(dstOpInst->getLoc());
if (failed(dstRegion.compute(dstOpInst, /loopDepth=/0))) {
LLVM_DEBUG(llvm::dbgs()
<< "Unable to compute MemRefRegion for dest operation\n.");
return false;
}
SmallVector<int64_t, 4> dstShape;
// Query 'dstRegion' for 'dstShape' and 'dstNumElements'.
// by 'dstOpInst' at depth 'dstLoopDepth'.
Optional<int64_t> dstNumElements =
dstRegion.getConstantBoundingSizeAndShape(&dstShape);
if (!dstNumElements.hasValue())
return false;

// Return false if write region is not a superset of 'srcNodes' write
// region to 'memref'.
// TODO: Check the shape and lower bounds here too.
if (srcNumElements != dstNumElements)
return false;

// Return false if 'memref' is used by a non-affine operation that is
// between node 'srcId' and node 'dstId'.
if (hasNonAffineUsersOnThePath(srcId, dstId, mdg))
return false;

return true;
}

// Checks the profitability of fusing a backwards slice of the loop nest		// Checks the profitability of fusing a backwards slice of the loop nest
// surrounding 'srcOpInst' into the loop nest surrounding 'dstLoadOpInsts'.		// surrounding 'srcOpInst' into the loop nest surrounding 'dstLoadOpInsts'.
// The argument 'srcStoreOpInst' is used to calculate the storage reduction on		// The argument 'srcStoreOpInst' is used to calculate the storage reduction on
// the memref being produced and consumed, which is an input to the cost model.		// the memref being produced and consumed, which is an input to the cost model.
// For producer-consumer fusion, 'srcStoreOpInst' will be the same as		// For producer-consumer fusion, 'srcStoreOpInst' will be the same as
// 'srcOpInst', as we are slicing w.r.t to that producer. For input-reuse		// 'srcOpInst', as we are slicing w.r.t to that producer. For input-reuse
// fusion, 'srcOpInst' will be the src loop nest LoadOp which reads from the		// fusion, 'srcOpInst' will be the src loop nest LoadOp which reads from the
// same memref as dst loop nest load ops, and 'srcStoreOpInst' will be the		// same memref as dst loop nest load ops, and 'srcStoreOpInst' will be the
Show All 12 Lines
// *) Computes the cost of unfused src/dst loop nests (currently the cost of a		// *) Computes the cost of unfused src/dst loop nests (currently the cost of a
// loop nest is the total number of dynamic operation instances in the loop		// loop nest is the total number of dynamic operation instances in the loop
// nest).		// nest).
// *) Computes the cost of fusing a slice of the src loop nest into the dst		// *) Computes the cost of fusing a slice of the src loop nest into the dst
// loop nest at various values of dst loop depth, attempting to fuse		// loop nest at various values of dst loop depth, attempting to fuse
// the largest computation slice at the maximal dst loop depth (closest to		// the largest computation slice at the maximal dst loop depth (closest to
// the load) to minimize reuse distance and potentially enable subsequent		// the load) to minimize reuse distance and potentially enable subsequent
// load/store forwarding.		// load/store forwarding.
// NOTE: If the dst loop nest includes multiple loads in 'dstLoadOpInsts' for
// the same memref as is written by 'srcOpInst', then the union of slice
// loop bounds is used to compute the slice and associated slice cost.
// NOTE: 'dstLoopDepth' refers to the loop depth within the destination loop		// NOTE: 'dstLoopDepth' refers to the loop depth within the destination loop
// nest, at which the src computation slice is inserted/fused.		// nest, at which the src computation slice is inserted/fused.
// NOTE: We attempt to maximize the dst loop depth, but there are cases		// NOTE: We attempt to maximize the dst loop depth, but there are cases
// where a particular setting for 'dstLoopNest' might fuse an unsliced		// where a particular setting for 'dstLoopNest' might fuse an unsliced
// loop (within the src computation slice) at a depth which results in		// loop (within the src computation slice) at a depth which results in
// excessive recomputation (see unit tests for examples).		// excessive recomputation (see unit tests for examples).
// *) Compares the total cost of the unfused loop nests to the min cost fused		// *) Compares the total cost of the unfused loop nests to the min cost fused
// loop nest computed in the previous step, and returns true if the latter		// loop nest computed in the previous step, and returns true if the latter
// is lower.		// is lower.
		// TODO: Extend profitability analysis to support scenarios with multiple
		// stores.
		bondhugulaUnsubmitted Not Done Reply Inline Actions Looks like the method is already handling it, but not returning an accurate estimate? bondhugula: Looks like the method is already handling it, but not returning an accurate estimate?
		dcaballeAuthorUnsubmitted Done Reply Inline Actions Some work is needed. We still have code that is looking at 'srcStoreOpInst'. dcaballe: Some work is needed. We still have code that is looking at 'srcStoreOpInst'.
static bool isFusionProfitable(Operation srcOpInst, Operation srcStoreOpInst,		static bool isFusionProfitable(Operation srcOpInst, Operation srcStoreOpInst,
ArrayRef<Operation *> dstLoadOpInsts,		AffineForOp dstForOp,
ArrayRef<ComputationSliceState> depthSliceUnions,		ArrayRef<ComputationSliceState> depthSliceUnions,
unsigned maxLegalFusionDepth,		unsigned maxLegalFusionDepth,
unsigned *dstLoopDepth,		unsigned *dstLoopDepth,
double computeToleranceThreshold) {		double computeToleranceThreshold) {
LLVM_DEBUG({		LLVM_DEBUG({
llvm::dbgs() << "Checking whether fusion is profitable between src op:\n";		llvm::dbgs() << "Checking whether fusion is profitable between src op:\n";
llvm::dbgs() << ' ' << *srcOpInst << " and destination op(s)\n";		llvm::dbgs() << ' ' << *srcOpInst << " and destination loop:\n";
for (auto dstOpInst : dstLoadOpInsts) {		llvm::dbgs() << dstForOp << "\n";
llvm::dbgs() << " " << *dstOpInst << "\n";
};
});		});

if (maxLegalFusionDepth == 0) {		if (maxLegalFusionDepth == 0) {
LLVM_DEBUG(llvm::dbgs() << "Can't fuse: maxLegalFusionDepth == 0 .\n");		LLVM_DEBUG(llvm::dbgs() << "Can't fuse: maxLegalFusionDepth == 0 .\n");
return false;		return false;
}		}

// Compute cost of sliced and unsliced src loop nest.		// Compute cost of sliced and unsliced src loop nest.
SmallVector<AffineForOp, 4> srcLoopIVs;		SmallVector<AffineForOp, 4> srcLoopIVs;
getLoopIVs(*srcOpInst, &srcLoopIVs);		getLoopIVs(*srcOpInst, &srcLoopIVs);

// Walk src loop nest and collect stats.		// Walk src loop nest and collect stats.
LoopNestStats srcLoopNestStats;		LoopNestStats srcLoopNestStats;
if (!getLoopNestStats(srcLoopIVs[0], &srcLoopNestStats))		if (!getLoopNestStats(srcLoopIVs[0], &srcLoopNestStats))
return false;		return false;

// Compute cost of dst loop nest.		// Compute cost of dst loop nest.
SmallVector<AffineForOp, 4> dstLoopIVs;
getLoopIVs(*dstLoadOpInsts[0], &dstLoopIVs);

LoopNestStats dstLoopNestStats;		LoopNestStats dstLoopNestStats;
if (!getLoopNestStats(dstLoopIVs[0], &dstLoopNestStats))		if (!getLoopNestStats(dstForOp, &dstLoopNestStats))
return false;		return false;

// Search for min cost value for 'dstLoopDepth'. At each value of		// Search for min cost value for 'dstLoopDepth'. At each value of
// 'dstLoopDepth' from 'maxLegalLoopDepth' to '1', compute computation slice		// 'dstLoopDepth' from 'maxLegalLoopDepth' to '1', compute computation slice
// bounds between 'srcOpInst' and each op in 'dstOpinsts' (taking the union		// bounds between 'srcOpInst' and each op in 'dstOpinsts' (taking the union
// of these bounds). Next the union slice bounds are used to calculate		// of these bounds). Next the union slice bounds are used to calculate
// the cost of the slice and the cost of the slice inserted into the dst		// the cost of the slice and the cost of the slice inserted into the dst
// loop nest at 'dstLoopDepth'.		// loop nest at 'dstLoopDepth'.
Show All 17 Lines	static bool isFusionProfitable(Operation srcOpInst, Operation srcStoreOpInst,

Optional<int64_t> maybeSrcWriteRegionSizeBytes =		Optional<int64_t> maybeSrcWriteRegionSizeBytes =
srcWriteRegion.getRegionSize();		srcWriteRegion.getRegionSize();
if (!maybeSrcWriteRegionSizeBytes.hasValue())		if (!maybeSrcWriteRegionSizeBytes.hasValue())
return false;		return false;
int64_t srcWriteRegionSizeBytes = maybeSrcWriteRegionSizeBytes.getValue();		int64_t srcWriteRegionSizeBytes = maybeSrcWriteRegionSizeBytes.getValue();

// Compute op instance count for the src loop nest.		// Compute op instance count for the src loop nest.
uint64_t dstLoopNestCost = getComputeCost(dstLoopIVs[0], dstLoopNestStats);		uint64_t dstLoopNestCost = getComputeCost(dstForOp, dstLoopNestStats);

// Evaluate all depth choices for materializing the slice in the destination		// Evaluate all depth choices for materializing the slice in the destination
// loop nest.		// loop nest.
for (unsigned i = maxLegalFusionDepth; i >= 1; --i) {		for (unsigned i = maxLegalFusionDepth; i >= 1; --i) {
		const ComputationSliceState &slice = depthSliceUnions[i - 1];
// Skip slice union if it wasn't computed for this depth.		// Skip slice union if it wasn't computed for this depth.
if (depthSliceUnions[i - 1].isEmpty())		if (slice.isEmpty())
continue;		continue;

int64_t fusedLoopNestComputeCost;		int64_t fusedLoopNestComputeCost;
if (!getFusionComputeCost(srcLoopIVs[0], srcLoopNestStats, dstLoopIVs[0],		if (!getFusionComputeCost(srcLoopIVs[0], srcLoopNestStats, dstForOp,
dstLoopNestStats, depthSliceUnions[i - 1],		dstLoopNestStats, slice,
&fusedLoopNestComputeCost)) {		&fusedLoopNestComputeCost)) {
LLVM_DEBUG(llvm::dbgs() << "Unable to compute fusion compute cost.\n.");		LLVM_DEBUG(llvm::dbgs() << "Unable to compute fusion compute cost.\n.");
continue;		continue;
}		}

double additionalComputeFraction =		double additionalComputeFraction =
fusedLoopNestComputeCost /		fusedLoopNestComputeCost /
(static_cast<double>(srcLoopNestCost) + dstLoopNestCost) -		(static_cast<double>(srcLoopNestCost) + dstLoopNestCost) -
1;		1;

// Determine what the slice write MemRefRegion would be, if the src loop		// Determine what the slice write MemRefRegion would be, if the src loop
// nest slice 'depthSliceUnions[i - 1]' were to be inserted into the dst		// nest slice 'slice' were to be inserted into the dst loop nest at loop
// loop nest at loop depth 'i'.		// depth 'i'.
MemRefRegion sliceWriteRegion(srcStoreOpInst->getLoc());		MemRefRegion sliceWriteRegion(srcStoreOpInst->getLoc());
if (failed(sliceWriteRegion.compute(srcStoreOpInst, /loopDepth=/0,		if (failed(sliceWriteRegion.compute(srcStoreOpInst, /loopDepth=/0,
&depthSliceUnions[i - 1]))) {		&slice))) {
LLVM_DEBUG(llvm::dbgs()		LLVM_DEBUG(llvm::dbgs()
<< "Failed to compute slice write region at loopDepth: " << i		<< "Failed to compute slice write region at loopDepth: " << i
<< "\n");		<< "\n");
continue;		continue;
}		}

Optional<int64_t> maybeSliceWriteRegionSizeBytes =		Optional<int64_t> maybeSliceWriteRegionSizeBytes =
sliceWriteRegion.getRegionSize();		sliceWriteRegion.getRegionSize();
▲ Show 20 Lines • Show All 66 Lines • ▼ Show 20 Lines	static bool isFusionProfitable(Operation srcOpInst, Operation srcStoreOpInst,
LLVM_DEBUG(		LLVM_DEBUG(
llvm::dbgs() << " LoopFusion fusion stats:"		llvm::dbgs() << " LoopFusion fusion stats:"
<< "\n best loop depth: " << bestDstLoopDepth		<< "\n best loop depth: " << bestDstLoopDepth
<< "\n src loop nest compute cost: " << srcLoopNestCost		<< "\n src loop nest compute cost: " << srcLoopNestCost
<< "\n dst loop nest compute cost: " << dstLoopNestCost		<< "\n dst loop nest compute cost: " << dstLoopNestCost
<< "\n fused loop nest compute cost: "		<< "\n fused loop nest compute cost: "
<< minFusedLoopNestComputeCost << "\n");		<< minFusedLoopNestComputeCost << "\n");

auto dstMemSize = getMemoryFootprintBytes(dstLoopIVs[0]);		auto dstMemSize = getMemoryFootprintBytes(dstForOp);
auto srcMemSize = getMemoryFootprintBytes(srcLoopIVs[0]);		auto srcMemSize = getMemoryFootprintBytes(srcLoopIVs[0]);

Optional<double> storageReduction = None;		Optional<double> storageReduction = None;

if (!dstMemSize.hasValue() \|\| !srcMemSize.hasValue()) {		if (!dstMemSize.hasValue() \|\| !srcMemSize.hasValue()) {
LLVM_DEBUG(llvm::dbgs()		LLVM_DEBUG(llvm::dbgs()
<< " fusion memory benefit cannot be evaluated; NOT fusing.\n");		<< " fusion memory benefit cannot be evaluated; NOT fusing.\n");
return false;		return false;
▲ Show 20 Lines • Show All 87 Lines • ▼ Show 20 Lines
//		//
// TODO: Experiment with other fusion policies.		// TODO: Experiment with other fusion policies.
struct GreedyFusion {		struct GreedyFusion {
public:		public:
// The data dependence graph to traverse during fusion.		// The data dependence graph to traverse during fusion.
MemRefDependenceGraph *mdg;		MemRefDependenceGraph *mdg;
// Worklist of graph nodes visited during the fusion pass.		// Worklist of graph nodes visited during the fusion pass.
SmallVector<unsigned, 8> worklist;		SmallVector<unsigned, 8> worklist;
// Set of graph nodes which are present on the worklist.
llvm::SmallDenseSet<unsigned, 16> worklistSet;
// Parameter for local buffer size threshold.		// Parameter for local buffer size threshold.
unsigned localBufSizeThreshold;		unsigned localBufSizeThreshold;
// Parameter for fast memory space.		// Parameter for fast memory space.
Optional<unsigned> fastMemorySpace;		Optional<unsigned> fastMemorySpace;
// If true, ignore any additional (redundant) computation tolerance threshold		// If true, ignore any additional (redundant) computation tolerance threshold
// that would have prevented fusion.		// that would have prevented fusion.
bool maximalFusion;		bool maximalFusion;
// The amount of additional computation that is tolerated while fusing		// The amount of additional computation that is tolerated while fusing
// pair-wise as a fraction of the total computation.		// pair-wise as a fraction of the total computation.
double computeToleranceThreshold;		double computeToleranceThreshold;

using Node = MemRefDependenceGraph::Node;		using Node = MemRefDependenceGraph::Node;

GreedyFusion(MemRefDependenceGraph *mdg, unsigned localBufSizeThreshold,		GreedyFusion(MemRefDependenceGraph *mdg, unsigned localBufSizeThreshold,
Optional<unsigned> fastMemorySpace, bool maximalFusion,		Optional<unsigned> fastMemorySpace, bool maximalFusion,
double computeToleranceThreshold)		double computeToleranceThreshold)
: mdg(mdg), localBufSizeThreshold(localBufSizeThreshold),		: mdg(mdg), localBufSizeThreshold(localBufSizeThreshold),
fastMemorySpace(fastMemorySpace), maximalFusion(maximalFusion),		fastMemorySpace(fastMemorySpace), maximalFusion(maximalFusion),
computeToleranceThreshold(computeToleranceThreshold) {}		computeToleranceThreshold(computeToleranceThreshold) {}

// Initializes 'worklist' with nodes from 'mdg'		/// Initializes 'worklist' with nodes from 'mdg'.
void init() {		void init() {
// TODO: Add a priority queue for prioritizing nodes by different		// TODO: Add a priority queue for prioritizing nodes by different
// metrics (e.g. arithmetic intensity/flops-to-bytes ratio).		// metrics (e.g. arithmetic intensity/flops-to-bytes ratio).
worklist.clear();		worklist.clear();
worklistSet.clear();
for (auto &idAndNode : mdg->nodes) {		for (auto &idAndNode : mdg->nodes) {
const Node &node = idAndNode.second;		const Node &node = idAndNode.second;
worklist.push_back(node.id);		worklist.push_back(node.id);
worklistSet.insert(node.id);
}		}
}		}

// Run the GreedyFusion pass.		// Run the GreedyFusion pass.
// *) First pass through the nodes fuses single-use producer nodes into their		// *) First pass through the nodes fuses single-use producer nodes into their
// unique consumer.		// unique consumer.
// *) Second pass fuses sibling nodes which share no dependence edges.		// *) Second pass fuses sibling nodes which share no dependence edges.
// *) Third pass fuses any remaining producer nodes into their users.		// *) Third pass fuses any remaining producer nodes into their users.
void run() {		void run() {
// TODO: Run this repeatedly until a fixed-point is reached.		// TODO: Run this repeatedly until a fixed-point is reached.
fuseProducerConsumerNodes(/maxSrcUserCount=/1);		fuseProducerConsumerNodes(/maxSrcUserCount=/1);
fuseSiblingNodes();		fuseSiblingNodes();
fuseProducerConsumerNodes(		fuseProducerConsumerNodes(
/maxSrcUserCount=/std::numeric_limits<unsigned>::max());		/maxSrcUserCount=/std::numeric_limits<unsigned>::max());
eraseUnusedMemRefAllocations();		eraseUnusedMemRefAllocations();
}		}

void fuseProducerConsumerNodes(unsigned maxSrcUserCount) {		void fuseProducerConsumerNodes(unsigned maxSrcUserCount) {
		LLVM_DEBUG(llvm::dbgs() << "--- Producer/Consumer Fusion ---\n");
init();		init();
while (!worklist.empty()) {		while (!worklist.empty()) {
unsigned dstId = worklist.back();		unsigned dstId = worklist.back();
worklist.pop_back();		worklist.pop_back();
worklistSet.erase(dstId);

// Skip if this node was removed (fused into another node).		// Skip if this node was removed (fused into another node).
if (mdg->nodes.count(dstId) == 0)		if (mdg->nodes.count(dstId) == 0)
continue;		continue;
// Get 'dstNode' into which to attempt fusion.		// Get 'dstNode' into which to attempt fusion.
auto *dstNode = mdg->getNode(dstId);		auto *dstNode = mdg->getNode(dstId);
// Skip if 'dstNode' is not a loop nest.		// Skip if 'dstNode' is not a loop nest.
if (!isa<AffineForOp>(dstNode->op))		if (!isa<AffineForOp>(dstNode->op))
continue;		continue;

		LLVM_DEBUG(llvm::dbgs() << "Evaluating dst loop " << dstId << "\n");

// Sink sequential loops in 'dstNode' (and thus raise parallel loops)		// Sink sequential loops in 'dstNode' (and thus raise parallel loops)
// while preserving relative order. This can increase the maximum loop		// while preserving relative order. This can increase the maximum loop
// depth at which we can fuse a slice of a producer loop nest into a		// depth at which we can fuse a slice of a producer loop nest into a
// consumer loop nest.		// consumer loop nest.
sinkSequentialLoops(dstNode);		sinkSequentialLoops(dstNode);
		auto dstAffineForOp = cast<AffineForOp>(dstNode->op);

SmallVector<Operation *, 4> loads = dstNode->loads;		// Try to fuse 'dstNode' with candidate producer loops until a fixed point
SmallVector<Operation *, 4> dstLoadOpInsts;		// is reached. Fusing two loops may expose new fusion opportunities.
DenseSet<Value> visitedMemrefs;		bool dstNodeChanged;
while (!loads.empty()) {		do {
// Get memref of load on top of the stack.		// Gather src loop candidates for 'dstNode' and visit them in "quasi"
auto memref = cast<AffineReadOpInterface>(loads.back()).getMemRef();		// reverse program order to minimize the number of iterations needed to
if (visitedMemrefs.count(memref) > 0)		// reach the fixed point. Note that this is a best effort approach since
continue;		// 'getProducerCandidates' does not always guarantee that program order
visitedMemrefs.insert(memref);		// in 'srcIdCandidates'.
		bondhugulaUnsubmitted Done Reply Inline Actions Typo bondhugula: Typo
// Move all loads in 'loads' accessing 'memref' to 'dstLoadOpInsts'.		dstNodeChanged = false;
moveLoadsAccessingMemrefTo(memref, &loads, &dstLoadOpInsts);		SmallVector<unsigned, 16> srcIdCandidates;
		bondhugulaUnsubmitted Done Reply Inline Actions Likewise. bondhugula: Likewise.
// Skip if no input edges along which to fuse.		getProducerCandidates(dstId, mdg, srcIdCandidates);
if (mdg->inEdges.count(dstId) == 0)
continue;		for (unsigned srcId : llvm::reverse(srcIdCandidates)) {
// Iterate through in-edges for 'dstId' and src node id for any
// edges on 'memref'.
SmallVector<unsigned, 2> srcNodeIds;
for (auto &srcEdge : mdg->inEdges[dstId]) {
// Skip 'srcEdge' if not for 'memref'.
if (srcEdge.value != memref)
continue;
srcNodeIds.push_back(srcEdge.id);
}
for (unsigned srcId : srcNodeIds) {
// Skip if this node was removed (fused into another node).
if (mdg->nodes.count(srcId) == 0)
continue;
// Get 'srcNode' from which to attempt fusion into 'dstNode'.		// Get 'srcNode' from which to attempt fusion into 'dstNode'.
auto *srcNode = mdg->getNode(srcId);		auto *srcNode = mdg->getNode(srcId);
// Skip if 'srcNode' is not a loop nest.		auto srcAffineForOp = cast<AffineForOp>(srcNode->op);
if (!isa<AffineForOp>(srcNode->op))		LLVM_DEBUG(llvm::dbgs() << "Evaluating src loop " << srcId
continue;		<< " for dst loop " << dstId << "\n");
// Skip if 'srcNode' has more than one live-out store to a
// function-local memref.
// TODO: Support more generic multi-output src loop nests
// fusion.
auto srcStoreOp = mdg->getUniqueOutgoingStore(srcNode);
if (!srcStoreOp) {
// Get the src store op at the deepest loop depth.
// We will use 'LoopFusionUtils::canFuseLoops' to check fusion
// feasibility for loops with multiple stores.
unsigned maxLoopDepth = 0;
for (auto *op : srcNode->stores) {
auto storeOp = cast<AffineWriteOpInterface>(op);
if (storeOp.getMemRef() != memref) {
srcStoreOp = nullptr;
break;
}
unsigned loopDepth = getNestingDepth(storeOp);
if (loopDepth > maxLoopDepth) {
maxLoopDepth = loopDepth;
srcStoreOp = storeOp;
}
}
if (!srcStoreOp)
continue;
}

// Unique outgoing store found must write to 'memref' since 'memref'
// is the one that established the producer-consumer relationship
// between 'srcNode' and 'dstNode'.
assert(srcStoreOp.getMemRef() == memref &&
"Found store to unexpected memref");

// Skip if 'srcNode' writes to any live in or escaping memrefs,		DenseSet<Value> producerConsumerMemrefs;
// and cannot be fused.		gatherProducerConsumerMemrefs(srcId, dstId, mdg,
bool writesToLiveInOrOut =		producerConsumerMemrefs);
mdg->writesToLiveInOrEscapingMemrefs(srcNode->id);
if (writesToLiveInOrOut &&		// Skip if 'srcNode' out edge count on any memref is greater than
!canFuseSrcWhichWritesToLiveOut(srcId, dstId, srcStoreOp, mdg))		// 'maxSrcUserCount'.
		if (any_of(producerConsumerMemrefs, [&](Value memref) {
		return mdg->getOutEdgeCount(srcNode->id, memref) >
		maxSrcUserCount;
		}))
continue;		continue;

// Don't create a private memref if 'writesToLiveInOrOut'.		// Gather memrefs in 'srcNode' that are written and escape to the
bool createPrivateMemref = !writesToLiveInOrOut;		// function (e.g., memref function arguments, returned memrefs,
// Don't create a private memref if 'srcNode' has in edges on		// memrefs passed to function calls, etc.).
// 'memref', or if 'dstNode' has out edges on 'memref'.		DenseSet<Value> srcEscapingMemRefs;
if (mdg->getIncomingMemRefAccesses(srcNode->id, memref) > 0 \|\|		gatherEscapingMemrefs(srcNode->id, mdg, srcEscapingMemRefs);
mdg->getOutEdgeCount(dstNode->id, memref) > 0) {
createPrivateMemref = false;		// Skip if there are non-affine operations in between the 'srcNode'
}		// and 'dstNode' using their memrefs. If so, we wouldn't be able to
		// compute a legal insertion point for now. 'srcNode' and 'dstNode'
// Skip if 'srcNode' out edge count on 'memref' > 'maxSrcUserCount'.		// memrefs with non-affine operation users would be considered
if (mdg->getOutEdgeCount(srcNode->id, memref) > maxSrcUserCount)		// escaping memrefs so we can limit this check to only scenarios with
		// escaping memrefs.
		if (!srcEscapingMemRefs.empty() &&
		hasNonAffineUsersOnThePath(srcId, dstId, mdg)) {
		LLVM_DEBUG(
		llvm::dbgs()
		<< "Can't fuse: non-affine users in between the loops\n.");
continue;		continue;
		}

// Compute an operation list insertion point for the fused loop		// Compute an operation list insertion point for the fused loop
// nest which preserves dependences.		// nest which preserves dependences.
Operation *insertPointInst =		Operation *fusedLoopInsPoint =
mdg->getFusedLoopNestInsertionPoint(srcNode->id, dstNode->id);		mdg->getFusedLoopNestInsertionPoint(srcNode->id, dstNode->id);
if (insertPointInst == nullptr)		if (fusedLoopInsPoint == nullptr)
continue;		continue;

auto srcAffineForOp = cast<AffineForOp>(srcNode->op);		// Compute the innermost common loop depth for dstNode
auto dstAffineForOp = cast<AffineForOp>(dstNode->op);		// producer-consumer loads/stores.

// Compute the innermost common loop depth for dstNode loads/stores.
SmallVector<Operation *, 2> dstMemrefOps;		SmallVector<Operation *, 2> dstMemrefOps;
for (Operation *op : dstNode->loads)		for (Operation *op : dstNode->loads)
if (cast<AffineReadOpInterface>(op).getMemRef() == memref)		if (producerConsumerMemrefs.count(
		cast<AffineReadOpInterface>(op).getMemRef()) > 0)
dstMemrefOps.push_back(op);		dstMemrefOps.push_back(op);
for (Operation *op : dstNode->stores)		for (Operation *op : dstNode->stores)
if (cast<AffineWriteOpInterface>(op).getMemRef() == memref)		if (producerConsumerMemrefs.count(
		cast<AffineWriteOpInterface>(op).getMemRef()))
dstMemrefOps.push_back(op);		dstMemrefOps.push_back(op);
unsigned dstLoopDepthTest = getInnermostCommonLoopDepth(dstMemrefOps);		unsigned dstLoopDepthTest = getInnermostCommonLoopDepth(dstMemrefOps);

// Check the feasibility of fusing src loop nest into dst loop nest		// Check the feasibility of fusing src loop nest into dst loop nest
// at loop depths in range [1, dstLoopDepthTest].		// at loop depths in range [1, dstLoopDepthTest].
unsigned maxLegalFusionDepth = 0;		unsigned maxLegalFusionDepth = 0;
SmallVector<ComputationSliceState, 8> depthSliceUnions;		SmallVector<ComputationSliceState, 8> depthSliceUnions;
depthSliceUnions.resize(dstLoopDepthTest);		depthSliceUnions.resize(dstLoopDepthTest);
FusionStrategy strategy(FusionStrategy::ProducerConsumer, memref);		FusionStrategy strategy(FusionStrategy::ProducerConsumer);
for (unsigned i = 1; i <= dstLoopDepthTest; ++i) {		for (unsigned i = 1; i <= dstLoopDepthTest; ++i) {
FusionResult result = mlir::canFuseLoops(		FusionResult result = mlir::canFuseLoops(
srcAffineForOp, dstAffineForOp,		srcAffineForOp, dstAffineForOp,
/dstLoopDepth=/i, &depthSliceUnions[i - 1], strategy);		/dstLoopDepth=/i, &depthSliceUnions[i - 1], strategy);

if (result.value == FusionResult::Success)		if (result.value == FusionResult::Success)
maxLegalFusionDepth = i;		maxLegalFusionDepth = i;
}		}

// Skip if fusion is not feasible at any loop depths.		if (maxLegalFusionDepth == 0) {
if (maxLegalFusionDepth == 0)		LLVM_DEBUG(llvm::dbgs()
		<< "Can't fuse: fusion is not legal at any depth\n");
continue;		continue;
		}

// Check if fusion would be profitable. We skip profitability analysis		// Check if fusion would be profitable. We skip profitability analysis
// for maximal fusion since we already know the maximal legal depth to		// for maximal fusion since we already know the maximal legal depth to
// fuse.		// fuse.
unsigned bestDstLoopDepth = maxLegalFusionDepth;		unsigned bestDstLoopDepth = maxLegalFusionDepth;
if (!maximalFusion &&		if (!maximalFusion) {
!isFusionProfitable(srcStoreOp, srcStoreOp, dstLoadOpInsts,		// Retrieve producer stores from the src loop.
depthSliceUnions, maxLegalFusionDepth,		SmallVector<Operation *, 2> producerStores;
&bestDstLoopDepth, computeToleranceThreshold))		for (Operation *op : srcNode->stores)
		if (producerConsumerMemrefs.count(
		cast<AffineWriteOpInterface>(op).getMemRef()))
		producerStores.push_back(op);

		// TODO: Suppport multiple producer stores in profitability
		// analysis. We limit profitability analysis to only scenarios with
		// a single producer store for now. Note that some multi-store
		// producer scenarios will still go through profitability analysis
		// if only one of the stores is involved the producer-consumer
		// relationship of the candidate loops.
		assert(producerStores.size() > 0 && "Expected producer store");
		if (producerStores.size() > 1)
		LLVM_DEBUG(llvm::dbgs() << "Skipping profitability analysis. Not "
		"supported for this case\n");
		else if (!isFusionProfitable(producerStores[0], producerStores[0],
		dstAffineForOp, depthSliceUnions,
		maxLegalFusionDepth, &bestDstLoopDepth,
		computeToleranceThreshold))
continue;		continue;
		}

assert(bestDstLoopDepth > 0 && "Unexpected loop fusion depth");		assert(bestDstLoopDepth > 0 && "Unexpected loop fusion depth");
assert(!depthSliceUnions[bestDstLoopDepth - 1].isEmpty() &&		ComputationSliceState &bestSlice =
"Missing slice union for depth");		depthSliceUnions[bestDstLoopDepth - 1];
		assert(!bestSlice.isEmpty() && "Missing slice union for depth");

		// Determine if 'srcId' can be removed after fusion, taking into
		// account remaining dependences, escaping memrefs and the fusion
		// insertion point.
		bool removeSrcNode = canRemoveSrcNodeAfterFusion(
		srcId, dstId, bestSlice, fusedLoopInsPoint, srcEscapingMemRefs,
		mdg);

		DenseSet<Value> privateMemrefs;
		for (Value memref : producerConsumerMemrefs) {
		// Don't create a private memref if 'srcNode' writes to escaping
		// memrefs.
		if (srcEscapingMemRefs.count(memref) > 0)
		continue;

		// Don't create a private memref if 'srcNode' has in edges on
		// 'memref' or 'dstNode' has out edges on 'memref'.
		if (mdg->getIncomingMemRefAccesses(srcId, memref) > 0 \|\|
		mdg->getOutEdgeCount(dstId, memref) > 0)
		continue;

		// If 'srcNode' will be removed but it has out edges on 'memref' to
		// nodes other than 'dstNode', we have to preserve dependences and
		// cannot create a private memref.
		dcaballeAuthorUnsubmitted Done Reply Inline Actions This check related to memref privatization is new and may be worth discussing it. The remaining checks are the same as before but extended to support multiple stores. dcaballe: This check related to memref privatization is new and may be worth discussing it. The remaining…
		if (removeSrcNode &&
		any_of(mdg->outEdges[srcId], [&](const auto &edge) {
		return edge.value == memref && edge.id != dstId;
		}))
		continue;

		// Create a private version of this memref.
		privateMemrefs.insert(memref);
		}

// Fuse computation slice of 'srcLoopNest' into 'dstLoopNest'.		// Fuse computation slice of 'srcLoopNest' into 'dstLoopNest'.
fuseLoops(srcAffineForOp, dstAffineForOp,		fuseLoops(srcAffineForOp, dstAffineForOp, bestSlice);
depthSliceUnions[bestDstLoopDepth - 1]);		dstNodeChanged = true;

LLVM_DEBUG(llvm::dbgs()		LLVM_DEBUG(llvm::dbgs()
<< "Fused src loop " << srcId << " into dst loop " << dstId		<< "Fused src loop " << srcId << " into dst loop " << dstId
<< " at depth " << bestDstLoopDepth << ":\n"		<< " at depth " << bestDstLoopDepth << ":\n"
<< dstAffineForOp << "\n");		<< dstAffineForOp << "\n");

// Move 'dstAffineForOp' before 'insertPointInst' if needed.		// Move 'dstAffineForOp' before 'insertPointInst' if needed.
if (insertPointInst != dstAffineForOp.getOperation())		if (fusedLoopInsPoint != dstAffineForOp.getOperation())
dstAffineForOp->moveBefore(insertPointInst);		dstAffineForOp.getOperation()->moveBefore(fusedLoopInsPoint);

// Update edges between 'srcNode' and 'dstNode'.		// Update edges between 'srcNode' and 'dstNode'.
mdg->updateEdges(srcNode->id, dstNode->id, memref,		mdg->updateEdges(srcNode->id, dstNode->id, privateMemrefs,
createPrivateMemref);		removeSrcNode);

// Collect slice loop stats.		// Create private memrefs.
LoopNestStateCollector dstForCollector;		if (!privateMemrefs.empty()) {
dstForCollector.collect(dstAffineForOp);		// Gather stores for all the private-to-be memrefs.
if (createPrivateMemref) {		DenseMap<Value, SmallVector<Operation *, 4>> privateMemRefToStores;
// Create private memref for 'memref' in 'dstAffineForOp'.		dstAffineForOp.walk([&](AffineWriteOpInterface storeOp) {
SmallVector<Operation *, 4> storesForMemref;		Value storeMemRef = storeOp.getMemRef();
for (auto *storeOpInst : dstForCollector.storeOpInsts) {		if (privateMemrefs.count(storeMemRef) > 0)
if (cast<AffineWriteOpInterface>(storeOpInst).getMemRef() ==		privateMemRefToStores[storeMemRef].push_back(
memref)		storeOp.getOperation());
storesForMemref.push_back(storeOpInst);		});
}
		// Replace original memrefs with private memrefs. Note that all the
		// loads and stores on these memrefs will be replaced with a new
		// loads and stores. Any reference to the original ones becomes
		// invalid after this point.
		for (auto &memrefToStoresPair : privateMemRefToStores) {
// TODO: Use union of memref write regions to compute		// TODO: Use union of memref write regions to compute
// private memref footprint.		// private memref footprint.
auto newMemRef = createPrivateMemRef(		SmallVector<Operation *, 4> &storesForMemref =
		memrefToStoresPair.second;
		Value newMemRef = createPrivateMemRef(
dstAffineForOp, storesForMemref[0], bestDstLoopDepth,		dstAffineForOp, storesForMemref[0], bestDstLoopDepth,
fastMemorySpace, localBufSizeThreshold);		fastMemorySpace, localBufSizeThreshold);
visitedMemrefs.insert(newMemRef);
// Create new node in dependence graph for 'newMemRef' alloc op.		// Create new node in dependence graph for 'newMemRef' alloc op.
unsigned newMemRefNodeId = mdg->addNode(newMemRef.getDefiningOp());		unsigned newMemRefNodeId =
		mdg->addNode(newMemRef.getDefiningOp());
// Add edge from 'newMemRef' node to dstNode.		// Add edge from 'newMemRef' node to dstNode.
mdg->addEdge(newMemRefNodeId, dstId, newMemRef);		mdg->addEdge(newMemRefNodeId, dstId, newMemRef);
}		}
		}

// Collect dst loop stats after memref privatization transformation.		// Collect dst loop stats after memref privatization transformation.
LoopNestStateCollector dstLoopCollector;		LoopNestStateCollector dstLoopCollector;
dstLoopCollector.collect(dstAffineForOp.getOperation());		dstLoopCollector.collect(dstAffineForOp.getOperation());

// Add new load ops to current Node load op list 'loads' to continue
// fusing based on new operands.
for (auto *loadOpInst : dstLoopCollector.loadOpInsts) {
// NOTE: Change 'loads' to a hash set in case efficiency is an
// issue. We still use a vector since it's expected to be small.
if (!llvm::is_contained(loads, loadOpInst))
loads.push_back(loadOpInst);
}
// Clear visited memrefs after fusion so that previously visited src
// nodes are considered for fusion again in the context of the new
// fused node.
dcaballeAuthorUnsubmitted Done Reply Inline Actions These workarounds are no longer needed (they couldn't be easily ported to the new scheme) since the algorithm runs until a fixed point is reached. dcaballe: These workarounds are no longer needed (they couldn't be easily ported to the new scheme) since…
// TODO: This shouldn't be necessary if we visited candidates in the
// dependence graph in post-order or once we fully support multi-store
// producers. Currently, in a multi-store producer scenario such as
// A->B, A->C, B->C, we fail to fuse A+B due to the multiple outgoing
// edges. However, after fusing B+C, A has a single outgoing edge and
// can be fused if we revisit it in the context of the new fused B+C
// node.
visitedMemrefs.clear();

// Clear and add back loads and stores.		// Clear and add back loads and stores.
mdg->clearNodeLoadAndStores(dstNode->id);		mdg->clearNodeLoadAndStores(dstNode->id);
mdg->addToNode(dstId, dstLoopCollector.loadOpInsts,		mdg->addToNode(dstId, dstLoopCollector.loadOpInsts,
dstLoopCollector.storeOpInsts);		dstLoopCollector.storeOpInsts);
// Remove old src loop nest if it no longer has outgoing dependence
// edges, and if it does not write to a memref which escapes the		if (removeSrcNode) {
// function. If 'writesToLiveInOrOut' is true, then 'srcNode' has been		LLVM_DEBUG(llvm::dbgs()
// fused into 'dstNode' and write region of 'dstNode' covers the write		<< "Removing src loop " << srcId << " after fusion\n");
// region of 'srcNode', and 'srcNode' has no other users so it is safe		// srcNode is no longer valid after it is removed from mdg.
		bondhugulaUnsubmitted Done Reply Inline Actions Nit: ... after it is removed from `mdg`? bondhugula: Nit: ... after it is removed from `mdg`?
// to remove.		srcAffineForOp.erase();
if (writesToLiveInOrOut \|\| mdg->canRemoveNode(srcNode->id)) {		mdg->removeNode(srcId);
mdg->removeNode(srcNode->id);		srcNode = nullptr;
srcNode->op->erase();
} else {
// Add remaining users of 'oldMemRef' back on the worklist (if not
// already there), as its replacement with a local/private memref
// has reduced dependences on 'oldMemRef' which may have created new
// fusion opportunities.
if (mdg->outEdges.count(srcNode->id) > 0) {
SmallVector<MemRefDependenceGraph::Edge, 2> oldOutEdges =
mdg->outEdges[srcNode->id];
for (auto &outEdge : oldOutEdges) {
if (outEdge.value == memref &&
dcaballeAuthorUnsubmitted Done Reply Inline Actions Same here. The algorithm runs until a fixed point is reached so this is not needed. dcaballe: Same here. The algorithm runs until a fixed point is reached so this is not needed.
worklistSet.count(outEdge.id) == 0) {
worklist.push_back(outEdge.id);
worklistSet.insert(outEdge.id);
}
}
}
}
}		}
}		}
		} while (dstNodeChanged);
}		}
}		}

// Visits each node in the graph, and for each node, attempts to fuse it with		// Visits each node in the graph, and for each node, attempts to fuse it with
// its sibling nodes (nodes which share a parent, but no dependence edges).		// its sibling nodes (nodes which share a parent, but no dependence edges).
void fuseSiblingNodes() {		void fuseSiblingNodes() {
init();		init();
while (!worklist.empty()) {		while (!worklist.empty()) {
unsigned dstId = worklist.back();		unsigned dstId = worklist.back();
worklist.pop_back();		worklist.pop_back();
worklistSet.erase(dstId);

// Skip if this node was removed (fused into another node).		// Skip if this node was removed (fused into another node).
if (mdg->nodes.count(dstId) == 0)		if (mdg->nodes.count(dstId) == 0)
continue;		continue;
// Get 'dstNode' into which to attempt fusion.		// Get 'dstNode' into which to attempt fusion.
auto *dstNode = mdg->getNode(dstId);		auto *dstNode = mdg->getNode(dstId);
// Skip if 'dstNode' is not a loop nest.		// Skip if 'dstNode' is not a loop nest.
if (!isa<AffineForOp>(dstNode->op))		if (!isa<AffineForOp>(dstNode->op))
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	while (findSiblingNodeToFuse(dstNode, &visitedSibNodeIds, &idAndMemref)) {
getLoopIVs(*dstLoadOpInsts[0], &dstLoopIVs);		getLoopIVs(*dstLoadOpInsts[0], &dstLoopIVs);
unsigned dstLoopDepthTest = dstLoopIVs.size();		unsigned dstLoopDepthTest = dstLoopIVs.size();
auto sibAffineForOp = cast<AffineForOp>(sibNode->op);		auto sibAffineForOp = cast<AffineForOp>(sibNode->op);

// Compute loop depth and slice union for fusion.		// Compute loop depth and slice union for fusion.
SmallVector<ComputationSliceState, 8> depthSliceUnions;		SmallVector<ComputationSliceState, 8> depthSliceUnions;
depthSliceUnions.resize(dstLoopDepthTest);		depthSliceUnions.resize(dstLoopDepthTest);
unsigned maxLegalFusionDepth = 0;		unsigned maxLegalFusionDepth = 0;
FusionStrategy strategy(FusionStrategy::Sibling, memref);		FusionStrategy strategy(memref);
for (unsigned i = 1; i <= dstLoopDepthTest; ++i) {		for (unsigned i = 1; i <= dstLoopDepthTest; ++i) {
FusionResult result = mlir::canFuseLoops(		FusionResult result = mlir::canFuseLoops(
sibAffineForOp, dstAffineForOp,		sibAffineForOp, dstAffineForOp,
/dstLoopDepth=/i, &depthSliceUnions[i - 1], strategy);		/dstLoopDepth=/i, &depthSliceUnions[i - 1], strategy);

if (result.value == FusionResult::Success)		if (result.value == FusionResult::Success)
maxLegalFusionDepth = i;		maxLegalFusionDepth = i;
}		}

// Skip if fusion is not feasible at any loop depths.		// Skip if fusion is not feasible at any loop depths.
if (maxLegalFusionDepth == 0)		if (maxLegalFusionDepth == 0)
continue;		continue;

unsigned bestDstLoopDepth = dstLoopDepthTest;		unsigned bestDstLoopDepth = maxLegalFusionDepth;
if (!maximalFusion) {		if (!maximalFusion) {
// Check if fusion would be profitable.		// Check if fusion would be profitable.
if (!isFusionProfitable(sibLoadOpInst, sibStoreOpInst, dstLoadOpInsts,		if (!isFusionProfitable(sibLoadOpInst, sibStoreOpInst, dstAffineForOp,
depthSliceUnions, maxLegalFusionDepth,		depthSliceUnions, maxLegalFusionDepth,
&bestDstLoopDepth, computeToleranceThreshold))		&bestDstLoopDepth, computeToleranceThreshold))
continue;		continue;
}		}

assert(bestDstLoopDepth > 0 && "Unexpected loop fusion depth");		assert(bestDstLoopDepth > 0 && "Unexpected loop fusion depth");
assert(!depthSliceUnions[bestDstLoopDepth - 1].isEmpty() &&		assert(!depthSliceUnions[bestDstLoopDepth - 1].isEmpty() &&
"Fusion depth has no computed slice union");		"Fusion depth has no computed slice union");
▲ Show 20 Lines • Show All 194 Lines • Show Last 20 Lines

mlir/lib/Transforms/Utils/LoopFusionUtils.cpp

Show First 20 Lines • Show All 185 Lines • ▼ Show 20 Lines	gatherLoadsAndStores(AffineForOp forOp,
});		});
return !hasIfOp;		return !hasIfOp;
}		}

/// Returns the maximum loop depth at which we could fuse producer loop		/// Returns the maximum loop depth at which we could fuse producer loop
/// 'srcForOp' into consumer loop 'dstForOp' without violating data dependences.		/// 'srcForOp' into consumer loop 'dstForOp' without violating data dependences.
// TODO: Generalize this check for sibling and more generic fusion scenarios.		// TODO: Generalize this check for sibling and more generic fusion scenarios.
// TODO: Support forward slice fusion.		// TODO: Support forward slice fusion.
static unsigned getMaxLoopDepth(ArrayRef<Operation *> dstOps,		static unsigned getMaxLoopDepth(ArrayRef<Operation *> srcOps,
FusionStrategy fusionStrategy) {		ArrayRef<Operation *> dstOps) {
assert(fusionStrategy.strategy == FusionStrategy::ProducerConsumer &&
"Fusion strategy not supported");

if (dstOps.empty())		if (dstOps.empty())
// Expected at least one memory operation.		// Expected at least one memory operation.
// TODO: Revisit this case with a specific example.		// TODO: Revisit this case with a specific example.
return 0;		return 0;

// Filter out ops in 'dstOps' that do not use the producer-consumer memref so		// Filter out ops in 'dstOps' that do not use the producer-consumer memref so
// that they are not considered for analysis.		// that they are not considered for analysis.
// TODO: Currently, we pass the producer-consumer memref through		DenseSet<Value> producerConsumerMemrefs;
// fusionStrategy. We will retrieve the memrefs from 'srcOps' once we		gatherProducerConsumerMemrefs(srcOps, dstOps, producerConsumerMemrefs);
// generalize the algorithm.
SmallVector<Operation *, 4> targetDstOps;		SmallVector<Operation *, 4> targetDstOps;
for (Operation *dstOp : dstOps) {		for (Operation *dstOp : dstOps) {
auto loadOp = dyn_cast<AffineReadOpInterface>(dstOp);		auto loadOp = dyn_cast<AffineReadOpInterface>(dstOp);
Value memref = loadOp ? loadOp.getMemRef()		Value memref = loadOp ? loadOp.getMemRef()
: cast<AffineWriteOpInterface>(dstOp).getMemRef();		: cast<AffineWriteOpInterface>(dstOp).getMemRef();
if (memref == fusionStrategy.memref)		if (producerConsumerMemrefs.count(memref) > 0)
targetDstOps.push_back(dstOp);		targetDstOps.push_back(dstOp);
}		}

assert(!targetDstOps.empty() &&		assert(!targetDstOps.empty() &&
"No dependences between 'srcForOp' and 'dstForOp'?");		"No dependences between 'srcForOp' and 'dstForOp'?");

// Compute the innermost common loop depth for loads and stores.		// Compute the innermost common loop depth for loads and stores.
unsigned loopDepth = getInnermostCommonLoopDepth(targetDstOps);		unsigned loopDepth = getInnermostCommonLoopDepth(targetDstOps);
▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	if (!gatherLoadsAndStores(forOpB, opsB)) {
LLVM_DEBUG(llvm::dbgs() << "Fusing loops with affine.if unsupported\n");		LLVM_DEBUG(llvm::dbgs() << "Fusing loops with affine.if unsupported\n");
return FusionResult::FailPrecondition;		return FusionResult::FailPrecondition;
}		}

// Return 'failure' if fusing loops at depth 'dstLoopDepth' wouldn't preserve		// Return 'failure' if fusing loops at depth 'dstLoopDepth' wouldn't preserve
// loop dependences.		// loop dependences.
// TODO: Enable this check for sibling and more generic loop fusion		// TODO: Enable this check for sibling and more generic loop fusion
// strategies.		// strategies.
if (fusionStrategy.strategy == FusionStrategy::ProducerConsumer) {		if (fusionStrategy.getStrategy() == FusionStrategy::ProducerConsumer) {
// TODO: 'getMaxLoopDepth' does not support forward slice fusion.		// TODO: 'getMaxLoopDepth' does not support forward slice fusion.
assert(isSrcForOpBeforeDstForOp && "Unexpected forward slice fusion");		assert(isSrcForOpBeforeDstForOp && "Unexpected forward slice fusion");
if (getMaxLoopDepth(opsB, fusionStrategy) < dstLoopDepth) {		if (getMaxLoopDepth(opsA, opsB) < dstLoopDepth) {
LLVM_DEBUG(llvm::dbgs() << "Fusion would violate loop dependences\n");		LLVM_DEBUG(llvm::dbgs() << "Fusion would violate loop dependences\n");
return FusionResult::FailFusionDependence;		return FusionResult::FailFusionDependence;
}		}
}		}

// Calculate the number of common loops surrounding 'srcForOp' and 'dstForOp'.		// Calculate the number of common loops surrounding 'srcForOp' and 'dstForOp'.
unsigned numCommonLoops = mlir::getNumCommonSurroundingLoops(		unsigned numCommonLoops = mlir::getNumCommonSurroundingLoops(
srcForOp.getOperation(), dstForOp.getOperation());		srcForOp.getOperation(), dstForOp.getOperation());

// Filter out ops in 'opsA' to compute the slice union based on the		// Filter out ops in 'opsA' to compute the slice union based on the
// assumptions made by the fusion strategy.		// assumptions made by the fusion strategy.
SmallVector<Operation *, 4> strategyOpsA;		SmallVector<Operation *, 4> strategyOpsA;
switch (fusionStrategy.strategy) {		switch (fusionStrategy.getStrategy()) {
case FusionStrategy::Generic:		case FusionStrategy::Generic:
// Generic fusion. Take into account all the memory operations to compute		// Generic fusion. Take into account all the memory operations to compute
// the slice union.		// the slice union.
strategyOpsA.append(opsA.begin(), opsA.end());		strategyOpsA.append(opsA.begin(), opsA.end());
break;		break;
case FusionStrategy::ProducerConsumer:		case FusionStrategy::ProducerConsumer:
// Producer-consumer fusion (AffineLoopFusion pass) only takes into		// Producer-consumer fusion (AffineLoopFusion pass) only takes into
// account stores to 'memref' in 'srcForOp' to compute the slice union.		// account stores in 'srcForOp' to compute the slice union.
for (Operation *op : opsA) {		for (Operation *op : opsA) {
auto store = dyn_cast<AffineWriteOpInterface>(op);		if (isa<AffineWriteOpInterface>(op))
if (store && store.getMemRef() == fusionStrategy.memref)
strategyOpsA.push_back(op);		strategyOpsA.push_back(op);
}		}
break;		break;
case FusionStrategy::Sibling:		case FusionStrategy::Sibling:
// Sibling fusion (AffineLoopFusion pass) only takes into account the loads		// Sibling fusion (AffineLoopFusion pass) only takes into account the loads
// to 'memref' in 'srcForOp' to compute the slice union.		// to 'memref' in 'srcForOp' to compute the slice union.
for (Operation *op : opsA) {		for (Operation *op : opsA) {
auto load = dyn_cast<AffineReadOpInterface>(op);		auto load = dyn_cast<AffineReadOpInterface>(op);
if (load && load.getMemRef() == fusionStrategy.memref)		if (load && load.getMemRef() == fusionStrategy.getSiblingFusionMemRef())
strategyOpsA.push_back(op);		strategyOpsA.push_back(op);
}		}
break;		break;
}		}

// Compute union of computation slices computed between all pairs of ops		// Compute union of computation slices computed between all pairs of ops
// from 'forOpA' and 'forOpB'.		// from 'forOpA' and 'forOpB'.
if (failed(mlir::computeSliceUnion(strategyOpsA, opsB, dstLoopDepth,		if (failed(mlir::computeSliceUnion(strategyOpsA, opsB, dstLoopDepth,
▲ Show 20 Lines • Show All 267 Lines • ▼ Show 20 Lines	bool mlir::getFusionComputeCost(AffineForOp srcForOp, LoopNestStats &srcStats,
// Compute cost of fusion for this depth.		// Compute cost of fusion for this depth.
computeCostMap[insertPointParent] = sliceComputeCost;		computeCostMap[insertPointParent] = sliceComputeCost;

*computeCost =		*computeCost =
getComputeCostHelper(dstForOp.getOperation(), dstStats,		getComputeCostHelper(dstForOp.getOperation(), dstStats,
/tripCountOverrideMap=/nullptr, &computeCostMap);		/tripCountOverrideMap=/nullptr, &computeCostMap);
return true;		return true;
}		}

		/// Returns in 'producerConsumerMemrefs' the memrefs involved in a
		/// producer-consumer dependence between write ops in 'srcOps' and read ops in
		bondhugulaUnsubmitted Done Reply Inline Actions ops in`srcOps' -> write ops in `srcOps` ? ops in `dstOps` -> read ops in `dstOps`? bondhugula: ops in`srcOps' -> write ops in `srcOps` ? ops in `dstOps` -> read ops in `dstOps`?
		/// 'dstOps'.
		void mlir::gatherProducerConsumerMemrefs(
		ArrayRef<Operation > srcOps, ArrayRef<Operation > dstOps,
		DenseSet<Value> &producerConsumerMemrefs) {
		// Gather memrefs from stores in 'srcOps'.
		DenseSet<Value> srcStoreMemRefs;
		for (Operation *op : srcOps)
		if (auto storeOp = dyn_cast<AffineWriteOpInterface>(op))
		srcStoreMemRefs.insert(storeOp.getMemRef());

		// Compute the intersection between memrefs from stores in 'srcOps' and
		// memrefs from loads in 'dstOps'.
		for (Operation *op : dstOps)
		if (auto loadOp = dyn_cast<AffineReadOpInterface>(op))
		if (srcStoreMemRefs.count(loadOp.getMemRef()) > 0)
		producerConsumerMemrefs.insert(loadOp.getMemRef());
		}

mlir/test/Transforms/loop-fusion.mlir

Show First 20 Lines • Show All 358 Lines • ▼ Show 20 Lines	func @should_fuse_and_move_to_preserve_war_dep() {
// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: return		// CHECK-NEXT: return
return		return
}		}

// -----		// -----

// CHECK-LABEL: func @should_fuse_with_private_memref_if_top_level_access() {		// CHECK-LABEL: func @should_fuse_if_top_level_access() {
func @should_fuse_with_private_memref_if_top_level_access() {		func @should_fuse_if_top_level_access() {
%m = alloc() : memref<10xf32>		%m = alloc() : memref<10xf32>
%cf7 = constant 7.0 : f32		%cf7 = constant 7.0 : f32

affine.for %i0 = 0 to 10 {		affine.for %i0 = 0 to 10 {
affine.store %cf7, %m[%i0] : memref<10xf32>		affine.store %cf7, %m[%i0] : memref<10xf32>
}		}
affine.for %i1 = 0 to 10 {		affine.for %i1 = 0 to 10 {
%v0 = affine.load %m[%i1] : memref<10xf32>		%v0 = affine.load %m[%i1] : memref<10xf32>
}		}

%c0 = constant 4 : index		%c0 = constant 4 : index
%v1 = affine.load %m[%c0] : memref<10xf32>		%v1 = affine.load %m[%c0] : memref<10xf32>
// Top-level load to '%{{.*}}' should prevent fusion.		// Top-level load to '%m' should prevent creating a private memref but
// CHECK: affine.for %{{.*}} = 0 to 10 {		// loop nests should be fused and '%i0' should be removed.
// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>		// CHECK: %[[m:.*]] = alloc() : memref<10xf32>
		// CHECK-NOT: alloc

		// CHECK: affine.for %[[i1:.*]] = 0 to 10 {
		// CHECK-NEXT: affine.store %{{.*}}, %[[m]][%[[i1]]] : memref<10xf32>
		// CHECK-NEXT: affine.load %[[m]][%[[i1]]] : memref<10xf32>
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: affine.for %{{.*}} = 0 to 10 {		// CHECK: affine.load %[[m]][%{{.*}}] : memref<10xf32>
		return
		}

		// -----

		// CHECK-LABEL: func @should_fuse_but_not_remove_src() {
		func @should_fuse_but_not_remove_src() {
		%m = alloc() : memref<100xf32>
		%cf7 = constant 7.0 : f32

		affine.for %i0 = 0 to 100 {
		affine.store %cf7, %m[%i0] : memref<100xf32>
		}
		affine.for %i1 = 0 to 17 {
		%v0 = affine.load %m[%i1] : memref<100xf32>
		}
		%v1 = affine.load %m[99] : memref<100xf32>
		bondhugulaUnsubmitted Not Done Reply Inline Actions This is an out of bounds access. `%m[100] -> %m[99]` `-test-memref-bound-check` can catch these actually. bondhugula: This is an out of bounds access. `%m[100] -> %m[99]` `-test-memref-bound-check` can catch…
		dcaballeAuthorUnsubmitted Done Reply Inline Actions Good catch :) dcaballe: Good catch :)

		// Loop '%i0' and '%i1' should be fused but '%i0' shouldn't be removed to
		// preserve the dependence with the top-level access.
		// CHECK: affine.for %{{.*}} = 0 to 100 {
		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<100xf32>
		// CHECK-NEXT: }
		// CHECK-NEXT: affine.for %{{.*}} = 0 to 17 {
// CHECK-NEXT: affine.store %{{.}}, %{{.}}[0] : memref<1xf32>		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[0] : memref<1xf32>
// CHECK-NEXT: affine.load %{{.*}}[0] : memref<1xf32>		// CHECK-NEXT: affine.load %{{.*}}[0] : memref<1xf32>
// CHECK-NEXT: }		// CHECK-NEXT: }
		// CHECK-NEXT: affine.load %{{.*}}[99] : memref<100xf32>
		// CHECK-NEXT: return
return		return
}		}

// -----		// -----

// CHECK-LABEL: func @should_fuse_no_top_level_access() {		// CHECK-LABEL: func @should_fuse_no_top_level_access() {
func @should_fuse_no_top_level_access() {		func @should_fuse_no_top_level_access() {
%m = alloc() : memref<10xf32>		%m = alloc() : memref<10xf32>
▲ Show 20 Lines • Show All 708 Lines • ▼ Show 20 Lines	func @should_fuse_with_private_memrefs_with_diff_shapes() {
// CHECK-NEXT: affine.load %{{.*}}[0] : memref<1xf32>		// CHECK-NEXT: affine.load %{{.*}}[0] : memref<1xf32>
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: return		// CHECK-NEXT: return
return		return
}		}

// -----		// -----

// CHECK-LABEL: func @should_not_fuse_live_out_arg(%{{.*}}: memref<10xf32>) {		// CHECK-LABEL: func @should_fuse_live_out_arg_but_preserve_src_loop(%{{.*}}: memref<10xf32>) {
func @should_not_fuse_live_out_arg(%arg0: memref<10xf32>) {		func @should_fuse_live_out_arg_but_preserve_src_loop(%arg0: memref<10xf32>) {
%cf7 = constant 7.0 : f32		%cf7 = constant 7.0 : f32

affine.for %i0 = 0 to 10 {		affine.for %i0 = 0 to 10 {
affine.store %cf7, %arg0[%i0] : memref<10xf32>		affine.store %cf7, %arg0[%i0] : memref<10xf32>
}		}
affine.for %i1 = 0 to 9 {		affine.for %i1 = 0 to 9 {
%v0 = affine.load %arg0[%i1] : memref<10xf32>		%v0 = affine.load %arg0[%i1] : memref<10xf32>
}		}
// This tests that the loop nest '%i0' should not be removed after fusion		// This tests that the loop nest '%i0' should not be removed after fusion
// because it writes to memref argument '%arg0', and its read region		// because it writes to memref argument '%arg0', and its read region
// does not cover its write region (so fusion would shrink the write region		// does not cover its write region (so fusion would shrink the write region
// in the fused loop nest, so complete live out data region would not		// in the fused loop nest, so complete live out data region would not
// be written).		// be written).
// CHECK: affine.for %{{.*}} = 0 to 10 {		// CHECK: affine.for %{{.*}} = 0 to 10 {
// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: affine.for %{{.*}} = 0 to 9 {		// CHECK-NEXT: affine.for %{{.*}} = 0 to 9 {
		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>
// CHECK-NEXT: affine.load %{{.}}[%{{.}}] : memref<10xf32>		// CHECK-NEXT: affine.load %{{.}}[%{{.}}] : memref<10xf32>
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: return		// CHECK-NEXT: return
return		return
}		}

// -----		// -----

Show All 15 Lines	func @should_fuse_live_out_arg(%arg0: memref<10xf32>) {
// CHECK-NEXT: affine.load %{{.}}[%{{.}}] : memref<10xf32>		// CHECK-NEXT: affine.load %{{.}}[%{{.}}] : memref<10xf32>
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: return		// CHECK-NEXT: return
return		return
}		}

// -----		// -----

// CHECK-LABEL: func @should_not_fuse_escaping_memref() -> memref<10xf32>		// CHECK-LABEL: func @should_fuse_escaping_memref_but_preserve_src_loop() -> memref<10xf32>
func @should_not_fuse_escaping_memref() -> memref<10xf32> {		func @should_fuse_escaping_memref_but_preserve_src_loop() -> memref<10xf32> {
%cf7 = constant 7.0 : f32		%cf7 = constant 7.0 : f32
%m = alloc() : memref<10xf32>		%m = alloc() : memref<10xf32>
affine.for %i0 = 0 to 10 {		affine.for %i0 = 0 to 10 {
affine.store %cf7, %m[%i0] : memref<10xf32>		affine.store %cf7, %m[%i0] : memref<10xf32>
}		}
affine.for %i1 = 0 to 9 {		affine.for %i1 = 0 to 9 {
%v0 = affine.load %m[%i1] : memref<10xf32>		%v0 = affine.load %m[%i1] : memref<10xf32>
}		}
// This tests that the loop nest '%{{.*}}' should not be removed after fusion		// This tests that the loop nest '%i0' should not be removed after fusion
// because it writes to memref '%{{.*}}' which is returned by the function.		// because it writes to memref '%m', which is returned by the function, and
		// the '%i1' memory region does not cover '%i0' memory region.

// CHECK-DAG: alloc() : memref<10xf32>		// CHECK-DAG: alloc() : memref<10xf32>
// CHECK: affine.for %{{.*}} = 0 to 10 {		// CHECK: affine.for %{{.*}} = 0 to 10 {
// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: affine.for %{{.*}} = 0 to 9 {		// CHECK-NEXT: affine.for %{{.*}} = 0 to 9 {
		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>
// CHECK-NEXT: affine.load %{{.}}[%{{.}}] : memref<10xf32>		// CHECK-NEXT: affine.load %{{.}}[%{{.}}] : memref<10xf32>
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: return %{{.*}} : memref<10xf32>		// CHECK-NEXT: return %{{.*}} : memref<10xf32>
return %m : memref<10xf32>		return %m : memref<10xf32>
}		}

// -----		// -----

// This should fuse with the %in becoming a 1x1x1.		// This should fuse with the %in becoming a 1x1x1.
func @R3_to_R2_reshape() {		func @R3_to_R2_reshape() {
%in = alloc() : memref<2x3x16xi32>		%in = alloc() : memref<2x3x16xi32>

%c0 = constant 0 : index		%c0 = constant 0 : index

Show All 31 Lines
// CHECK-NEXT: affine.apply [[$MAP2]](%{{.*}})		// CHECK-NEXT: affine.apply [[$MAP2]](%{{.*}})
// CHECK-NEXT: affine.load %{{.*}}[0, 0, 0] : memref<1x1x1xi32>		// CHECK-NEXT: affine.load %{{.*}}[0, 0, 0] : memref<1x1x1xi32>
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: return		// CHECK-NEXT: return

// -----		// -----

func @should_not_fuse_multi_output_producer() {		func @should_fuse_multi_output_producer() {
%a = alloc() : memref<10xf32>		%a = alloc() : memref<10xf32>
%b = alloc() : memref<10xf32>		%b = alloc() : memref<10xf32>

%cf7 = constant 7.0 : f32		%cf7 = constant 7.0 : f32

affine.for %i0 = 0 to 10 {		affine.for %i0 = 0 to 10 {
affine.store %cf7, %a[%i0] : memref<10xf32>		affine.store %cf7, %a[%i0] : memref<10xf32>
affine.store %cf7, %b[%i0] : memref<10xf32>		affine.store %cf7, %b[%i0] : memref<10xf32>
}		}
affine.for %i1 = 0 to 10 {		affine.for %i1 = 0 to 10 {
%v0 = affine.load %a[%i1] : memref<10xf32>		%v0 = affine.load %a[%i1] : memref<10xf32>
%v1 = affine.load %b[%i1] : memref<10xf32>		%v1 = affine.load %b[%i1] : memref<10xf32>
}		}

// CHECK: affine.for %{{.*}} = 0 to 10 {		// CHECK: affine.for %{{.*}} = 0 to 10 {
// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[0] : memref<1xf32>
// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[0] : memref<1xf32>
// CHECK-NEXT: }		// CHECK-NEXT: affine.load %{{.*}}[0] : memref<1xf32>
// CHECK-NEXT: affine.for %{{.*}} = 0 to 10 {		// CHECK-NEXT: affine.load %{{.*}}[0] : memref<1xf32>
// CHECK-NEXT: affine.load %{{.}}[%{{.}}] : memref<10xf32>
// CHECK-NEXT: affine.load %{{.}}[%{{.}}] : memref<10xf32>
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: return		// CHECK-NEXT: return
return		return
}		}

// -----		// -----

// CHECK-LABEL: func @fusion_preventing_deps_on_middle_loop() {		// CHECK-LABEL: func @fusion_preventing_deps_on_middle_loop() {
▲ Show 20 Lines • Show All 236 Lines • ▼ Show 20 Lines	func @should_fuse_at_depth_above_loop_carried_dependence(%arg0: memref<64x4xf32>, %arg1: memref<64x4xf32>) {
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: return		// CHECK-NEXT: return
return		return
}		}

// -----		// -----

// CHECK-LABEL: func @should_fuse_after_private_memref_creation() {		// CHECK-LABEL: func @should_fuse_only_two_loops_and_remove_producer() {
func @should_fuse_after_private_memref_creation() {		func @should_fuse_only_two_loops_and_remove_producer() {
%a = alloc() : memref<10xf32>		%a = alloc() : memref<10xf32>
%b = alloc() : memref<10xf32>		%b = alloc() : memref<10xf32>

%cf7 = constant 7.0 : f32		%cf7 = constant 7.0 : f32

affine.for %i0 = 0 to 10 {		affine.for %i0 = 0 to 10 {
affine.store %cf7, %a[%i0] : memref<10xf32>		affine.store %cf7, %a[%i0] : memref<10xf32>
}		}
affine.for %i1 = 0 to 10 {		affine.for %i1 = 0 to 10 {
%v0 = affine.load %a[%i1] : memref<10xf32>		%v0 = affine.load %a[%i1] : memref<10xf32>
affine.store %v0, %b[%i1] : memref<10xf32>		affine.store %v0, %b[%i1] : memref<10xf32>
}		}
affine.for %i2 = 0 to 10 {		affine.for %i2 = 0 to 10 {
%v1 = affine.load %a[%i2] : memref<10xf32>		%v1 = affine.load %a[%i2] : memref<10xf32>
affine.store %v1, %b[%i2] : memref<10xf32>		affine.store %v1, %b[%i2] : memref<10xf32>
}		}

// On the first visit to '%i2', the fusion algorithm can not fuse loop nest		// On the first visit to '%i2', the fusion algorithm can not fuse loop nest
// '%i0' into '%i2' because of the dependences '%i0' and '%i2' each have on		// '%i0' into '%i2' because of the dependences '%i0' and '%i2' each have on
// '%i1'. However, once the loop nest '%i0' is fused into '%i1' with a		// '%i1'. Then, '%i0' is fused into '%i1' and no private memref is created for
// private memref, the dependence between '%i0' and '%i1' on memref '%a' no		// memref '%a' to be able to remove '%i0' and still preserve the depencence on
// longer exists, so '%i0' can now be fused into '%i2'.		// '%a' with '%i2'.
		// TODO: Alternatively, we could fuse '%i0' into '%i1' with a private memref,
		// the dependence between '%i0' and '%i1' on memref '%a' would no longer exist,
		// and '%i0' could be fused into '%i2' as well. Note that this approach would
		// duplicate the computation in loop nest '%i0' to loop nests '%i1' and '%i2',
		// which would limit its profitability.
// CHECK: affine.for %{{.*}} = 0 to 10 {		// CHECK: affine.for %{{.*}} = 0 to 10 {
// CHECK-NEXT: affine.store %{{.}}, %{{.}}[0] : memref<1xf32>		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>
// CHECK-NEXT: affine.load %{{.*}}[0] : memref<1xf32>		// CHECK-NEXT: affine.load %{{.}}[%{{.}}] : memref<10xf32>
// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: affine.for %{{.*}} = 0 to 10 {		// CHECK-NEXT: affine.for %{{.*}} = 0 to 10 {
// CHECK-NEXT: affine.store %{{.}}, %{{.}}[0] : memref<1xf32>		// CHECK-NEXT: affine.load %{{.}}[%{{.}}] : memref<10xf32>
// CHECK-NEXT: affine.load %{{.*}}[0] : memref<1xf32>
// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: return		// CHECK-NEXT: return
return		return
}		}

// -----		// -----

▲ Show 20 Lines • Show All 667 Lines • ▼ Show 20 Lines	affine.for %i4 = 0 to 1024 {
%7 = mulf %6, %5 : f32		%7 = mulf %6, %5 : f32
%8 = affine.load %arg4[%i3, %i4] : memref<1024x1024xf32>		%8 = affine.load %arg4[%i3, %i4] : memref<1024x1024xf32>
%9 = addf %8, %7 : f32		%9 = addf %8, %7 : f32
affine.store %9, %arg4[%i3, %i4] : memref<1024x1024xf32>		affine.store %9, %arg4[%i3, %i4] : memref<1024x1024xf32>
}		}
}		}
}		}

// CHECK: affine.for %{{.*}} = 0 to 1024 {		// CHECK: affine.for %{{.*}} = 0 to 1024 {
// CHECK-NEXT: affine.for %{{.*}} = 0 to 1024 {		// CHECK-NEXT: affine.for %{{.*}} = 0 to 1024 {
// CHECK-NEXT: affine.for %{{.*}} = 0 to 1024 {		// CHECK-NEXT: affine.for %{{.*}} = 0 to 1024 {
// CHECK-NEXT: affine.load %{{.}}[%{{.}}, %{{.*}}] : memref<1024x1024xf32>		// CHECK-NEXT: affine.load %{{.}}[%{{.}}, %{{.*}}] : memref<1024x1024xf32>
// CHECK-NEXT: affine.load %{{.}}[%{{.}}, %{{.*}}] : memref<1024x1024xf32>		// CHECK-NEXT: affine.load %{{.}}[%{{.}}, %{{.*}}] : memref<1024x1024xf32>
// CHECK-NEXT: mulf %{{.}}, %{{.}} : f32		// CHECK-NEXT: mulf %{{.}}, %{{.}} : f32
// CHECK-NEXT: affine.load %{{.}}[%{{.}}, %{{.*}}] : memref<1024x1024xf32>		// CHECK-NEXT: affine.load %{{.}}[%{{.}}, %{{.*}}] : memref<1024x1024xf32>
// CHECK-NEXT: addf %{{.}}, %{{.}} : f32		// CHECK-NEXT: addf %{{.}}, %{{.}} : f32
// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.}}, %{{.}}] : memref<1024x1024xf32>		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.}}, %{{.}}] : memref<1024x1024xf32>
▲ Show 20 Lines • Show All 74 Lines • ▼ Show 20 Lines	affine.for %i0 = 0 to 10 {
affine.store %cf7, %live_in_out_m[%i0] : memref<10xf32>		affine.store %cf7, %live_in_out_m[%i0] : memref<10xf32>
affine.store %cf7, %m[%i0] : memref<10xf32>		affine.store %cf7, %m[%i0] : memref<10xf32>
}		}
affine.for %i1 = 0 to 10 {		affine.for %i1 = 0 to 10 {
%v0 = affine.load %m[%i1] : memref<10xf32>		%v0 = affine.load %m[%i1] : memref<10xf32>
}		}
// CHECK: affine.for %[[i0:.*]] = 0 to 10 {		// CHECK: affine.for %[[i0:.*]] = 0 to 10 {
// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%[[i0]]] : memref<10xf32>		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%[[i0]]] : memref<10xf32>
// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%[[i0]]] : memref<10xf32>		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[0] : memref<1xf32>
// CHECK-NEXT: affine.load %{{.*}}[%[[i0]]] : memref<10xf32>		// CHECK-NEXT: affine.load %{{.*}}[0] : memref<1xf32>
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: return		// CHECK-NEXT: return
return		return
}		}

// -----		// -----

// Test case from github bug 777.		// Test case from github bug 777.
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	func @mul_add_0(%arg0: memref<3x4xf32>, %arg1: memref<4x3xf32>, %arg2: memref<3x3xf32>, %arg3: memref<3x3xf32>) {
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: }		// CHECK-NEXT: }
// CHECK-NEXT: return		// CHECK-NEXT: return
return		return
}		}

// -----		// -----

// Verify that 'fuseProducerConsumerNodes' doesn't fuse a producer loop with		// Verify that 'fuseProducerConsumerNodes' fuse a producer loop with a store
// a store that has multiple outgoing edges. Sibling loop fusion should not fuse		// that has multiple outgoing edges.
bondhugulaUnsubmitted Not Done Reply Inline Actions Not sure why this wasn't marked a TODO. bondhugula: Not sure why this wasn't marked a TODO.
// any of these loops due to dependencies on external memref '%a'.

// CHECK-LABEL: func @should_not_fuse_multi_outgoing_edge_store_producer1		// CHECK-LABEL: func @should_fuse_multi_outgoing_edge_store_producer
func @should_not_fuse_multi_outgoing_edge_store_producer1(%a : memref<1xf32>) {		func @should_fuse_multi_outgoing_edge_store_producer(%a : memref<1xf32>) {
%cst = constant 0.000000e+00 : f32		%cst = constant 0.000000e+00 : f32
affine.for %arg0 = 0 to 1 {		affine.for %arg0 = 0 to 1 {
affine.store %cst, %a[%arg0] : memref<1xf32>		affine.store %cst, %a[%arg0] : memref<1xf32>
}		}

affine.for %arg0 = 0 to 1 {		affine.for %arg0 = 0 to 1 {
%0 = affine.load %a[%arg0] : memref<1xf32>		%0 = affine.load %a[%arg0] : memref<1xf32>
}		}

affine.for %arg0 = 0 to 1 {		affine.for %arg0 = 0 to 1 {
%0 = affine.load %a[%arg0] : memref<1xf32>		%0 = affine.load %a[%arg0] : memref<1xf32>
}		}
// CHECK: affine.for %{{.*}} = 0 to 1		// CHECK: affine.for %{{.*}} = 0 to 1 {
// CHECK: affine.for %{{.*}} = 0 to 1		// CHECK-NEXT: affine.store
// CHECK: affine.for %{{.*}} = 0 to 1		// CHECK-NEXT: affine.load
		// CHECK-NEXT: affine.load
		// CHECK-NEXT: }

return		return
}		}

// -----		// -----

// Verify that 'fuseProducerConsumerNodes' fuses a producer loop that: 1) has		// Verify that 'fuseProducerConsumerNodes' fuses a producer loop that: 1) has
// multiple outgoing edges, 2) producer store has a single outgoing edge.		// multiple outgoing edges, 2) producer store has a single outgoing edge.
// Sibling loop fusion should not fuse any of these loops due to		// Sibling loop fusion should not fuse any of these loops due to
▲ Show 20 Lines • Show All 245 Lines • ▼ Show 20 Lines	affine.for %arg3 = 0 to 20 {
affine.for %arg4 = 0 to 512 {		affine.for %arg4 = 0 to 512 {
%ld = affine.load %tmp[%arg4 mod 128] : memref<128xf32>		%ld = affine.load %tmp[%arg4 mod 128] : memref<128xf32>
affine.store %ld, %out[%arg3, %arg4] : memref<20x512xf32>		affine.store %ld, %out[%arg3, %arg4] : memref<20x512xf32>
}		}
}		}

return		return
}		}

// TODO: The size of the private memref is not properly computed in the presence		// TODO: The size of the private memref is not properly computed in the presence
		bondhugulaUnsubmitted Not Done Reply Inline Actions Looks like this has been fixed? bondhugula: Looks like this has been fixed?
		dcaballeAuthorUnsubmitted Done Reply Inline Actions Why? The size is still 128 and the ticket seems to be open. dcaballe: Why? The size is still 128 and the ticket seems to be open.
		bondhugulaUnsubmitted Done Reply Inline Actions Looks like I was looking at the (wrong) test case further below. bondhugula: Looks like I was looking at the (wrong) test case further below.
// of the 'mod' operation. It should be memref<1xf32> instead of		// of the 'mod' operation. It should be memref<1xf32> instead of
// memref<128xf32>: https://bugs.llvm.org/show_bug.cgi?id=46973		// memref<128xf32>: https://bugs.llvm.org/show_bug.cgi?id=46973
// MAXIMAL: alloc() : memref<128xf32>		// MAXIMAL: alloc() : memref<128xf32>
// MAXIMAL: affine.for		// MAXIMAL: affine.for
// MAXIMAL-NEXT: affine.for		// MAXIMAL-NEXT: affine.for
// MAXIMAL-NOT: affine.for		// MAXIMAL-NOT: affine.for
		// MAXIMAL: return

		// -----

		// CHECK-LABEL: func @should_fuse_multi_store_producer_and_privatize_memfefs
		func @should_fuse_multi_store_producer_and_privatize_memfefs() {
		%a = alloc() : memref<10xf32>
		%b = alloc() : memref<10xf32>
		%c = alloc() : memref<10xf32>
		%cst = constant 0.000000e+00 : f32
		affine.for %arg0 = 0 to 10 {
		affine.store %cst, %a[%arg0] : memref<10xf32>
		affine.store %cst, %b[%arg0] : memref<10xf32>
		affine.store %cst, %c[%arg0] : memref<10xf32>
		%0 = affine.load %c[%arg0] : memref<10xf32>
		}

		affine.for %arg0 = 0 to 10 {
		%0 = affine.load %a[%arg0] : memref<10xf32>
		}

		affine.for %arg0 = 0 to 10 {
		%0 = affine.load %b[%arg0] : memref<10xf32>
		}

		// All the memrefs should be privatized except '%c', which is not involved in
		// the producer-consumer fusion.
		// CHECK: affine.for %{{.*}} = 0 to 10 {
		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[0] : memref<1xf32>
		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[0] : memref<1xf32>
		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>
		// CHECK-NEXT: affine.load %{{.}}[%{{.}}] : memref<10xf32>
		// CHECK-NEXT: affine.load %{{.*}}[0] : memref<1xf32>
		// CHECK-NEXT: affine.load %{{.*}}[0] : memref<1xf32>
		// CHECK-NEXT: }

		return
		}

		// -----

		func @should_fuse_multi_store_producer_with_scaping_memrefs_and_remove_src(
		%a : memref<10xf32>, %b : memref<10xf32>) {
		%cst = constant 0.000000e+00 : f32
		affine.for %i0 = 0 to 10 {
		affine.store %cst, %a[%i0] : memref<10xf32>
		affine.store %cst, %b[%i0] : memref<10xf32>
		}

		affine.for %i1 = 0 to 10 {
		%0 = affine.load %a[%i1] : memref<10xf32>
		}

		affine.for %i2 = 0 to 10 {
		%0 = affine.load %b[%i2] : memref<10xf32>
		}

		// Producer loop '%i0' should be removed after fusion since fusion is maximal.
		// No memref should be privatized since they escape the function.
		// CHECK: affine.for %{{.*}} = 0 to 10 {
		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>
		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>
		// CHECK-NEXT: affine.load %{{.}}[%{{.}}] : memref<10xf32>
		// CHECK-NEXT: affine.load %{{.}}[%{{.}}] : memref<10xf32>
		// CHECK-NEXT: }
		// CHECK-NOT: affine.for

		return
		}

		// -----

		func @should_fuse_multi_store_producer_with_scaping_memrefs_and_preserve_src(
		%a : memref<10xf32>, %b : memref<10xf32>) {
		%cst = constant 0.000000e+00 : f32
		affine.for %i0 = 0 to 10 {
		affine.store %cst, %a[%i0] : memref<10xf32>
		affine.store %cst, %b[%i0] : memref<10xf32>
		}

		affine.for %i1 = 0 to 5 {
		%0 = affine.load %a[%i1] : memref<10xf32>
		}

		affine.for %i2 = 0 to 10 {
		%0 = affine.load %b[%i2] : memref<10xf32>
		}

		// Loops '%i0' and '%i2' should be fused first and '%i0' should be removed
		// since fusion is maximal. Then the fused loop and '%i1' should be fused
		// and the fused loop shouldn't be removed since fusion is not maximal.
		// CHECK: affine.for %{{.*}} = 0 to 10 {
		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>
		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>
		// CHECK-NEXT: affine.load %{{.}}[%{{.}}] : memref<10xf32>
		// CHECK-NEXT: }
		// CHECK: affine.for %{{.*}} = 0 to 5 {
		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>
		// CHECK-NEXT: affine.store %{{.}}, %{{.}}[%{{.*}}] : memref<10xf32>
		// CHECK-NEXT: affine.load %{{.}}[%{{.}}] : memref<10xf32>
		// CHECK-NEXT: affine.load %{{.}}[%{{.}}] : memref<10xf32>
		// CHECK-NEXT: }
		// CHECK-NOT: affine.for

		return
		}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][Affine] Add support for multi-store producer fusionClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 319056

mlir/include/mlir/Analysis/AffineStructures.h

mlir/include/mlir/Analysis/Utils.h

mlir/include/mlir/Transforms/LoopFusionUtils.h

mlir/include/mlir/Transforms/Passes.td

mlir/lib/Analysis/AffineStructures.cpp

mlir/lib/Analysis/Utils.cpp

mlir/lib/Transforms/LoopFusion.cpp

mlir/lib/Transforms/Utils/LoopFusionUtils.cpp

mlir/test/Transforms/loop-fusion.mlir

[mlir][Affine] Add support for multi-store producer fusion
ClosedPublic