This is an archive of the discontinued LLVM Phabricator instance.

[mlir][Affine] Add support for multi-store producer fusion
ClosedPublic

Authored by dcaballe on Dec 8 2020, 11:58 AM.

Details

Summary

This patch adds support for producer-consumer fusion scenarios with
multiple producer stores to the AffineLoopFusion pass. The patch
introduces some changes to the producer-consumer algorithm, including:

  • For a given consumer loop, producer-consumer fusion iterates over its

producer candidates until a fixed point is reached.

  • Producer candidates are gathered beforehand for each iteration of the

consumer loop and visited in reverse program order (not strictly guaranteed)
to maximize the number of loops fused per iteration.

In general, these changes were needed to simplify the multi-store producer
support and remove some of the workarounds that were introduced in the past
to support more fusion cases under the single-store producer limitation.

This patch also preserves the existing functionality of AffineLoopFusion with
one minor change in behavior. Producer-consumer fusion didn't fuse scenarios
with escaping memrefs and multiple outgoing edges (from a single store).
Multi-store producer scenarios will usually (always?) have multiple outgoing
edges so we couldn't fuse any with escaping memrefs, which would greatly limit
the applicability of this new feature. Therefore, the patch enables fusion for
these scenarios. Please, see modified tests for specific details.

Diff Detail

Event Timeline

dcaballe created this revision.Dec 8 2020, 11:58 AM
dcaballe requested review of this revision.Dec 8 2020, 11:58 AM

Some comments to facilitate the code review.

mlir/include/mlir/Transforms/LoopFusionUtils.h
100

I couldn't generalize the sibling fusion algorithm to be able to remove the 'memref' field and look at all the memrefs in the loops. Computation slices when doing so where significantly different and I couldn't the preserve existing functionality (tests). I introduced these changes to that the memref can only be provided when using a sibling fusion strategy.

mlir/lib/Transforms/LoopFusion.cpp
318

This functionality is replaced with 'canRemoveSrcNodeAfterFusion' below. I'm adding as a non-member in an attempt to minimize code that is specific to a fusion strategy so that we can eventually move a generic version of the 'mdg' to the loop fusion utilities in the future (if that's what we want).

942

This code was heavily relying on having a single store so I couldn't reuse much. 'hasNonAffineUsersOnThePath' check has been moved to the producer-consumer fusion main function. The memory region superset check is now covered by slice::isMaximal check, which also works for an arbitrary number of stores.

1545

This check related to memref privatization is new and may be worth discussing it. The remaining checks are the same as before but extended to support multiple stores.

1587

These workarounds are no longer needed (they couldn't be easily ported to the new scheme) since the algorithm runs until a fixed point is reached.

1619

Same here. The algorithm runs until a fixed point is reached so this is not needed.

bondhugula requested changes to this revision.EditedDec 10 2020, 12:55 AM

Thanks for this significant update! Some superficial comments to start with. It may be good to handle the pass documentation update as well, starting with an update to the summary. It's currently emptly: https://mlir.llvm.org/docs/Passes/#-affine-loop-fusion-fuse-affine-loop-nests
You could mention about the fusion strategies producer/consumer and sibling ones as well. Here's the previous doc paragraph that disappeared (was overwritten) when pass documentation was migrated to auto-generated stuff.

Performs fusion of loop nests using a slicing-based approach. The fused loop
nests, when possible, are rewritten to access significantly smaller local
buffers instead of the original memref's, and the latter are often
either completely optimized away or contracted. This transformation leads to
enhanced locality and lower memory footprint through the elimination or
contraction of temporaries / intermediate memref's. These benefits are sometimes
achieved at the expense of redundant computation through a cost model that
evaluates available choices such as the depth at which a source slice should be
materialized in the designation slice.
mlir/include/mlir/Transforms/LoopFusionUtils.h
53–54

Not related to this, but it's not clear how one gets both ProducerConsumer and Sibling. Does it have to be called twice? In that case, what about convergence?

mlir/lib/Analysis/AffineStructures.cpp
716–717

Interesting that the use case for this requires the same operands for upper bounds, lower bounds, and for IVs at all depths! I haven't looked at the other part of the code yet - so this is a superficial comment.

727

Any assertions on the number of operands?

mlir/lib/Transforms/LoopFusion.cpp
686

Nit: memref.isa<BlockArgument>()

692

auto -> Operation

774

Avoid auto here.

This revision now requires changes to proceed.Dec 10 2020, 12:55 AM
bondhugula added inline comments.Dec 10 2020, 1:03 AM
mlir/include/mlir/Analysis/AffineStructures.h
246

We could actually rename this to addDomainFromSliceMaps - it's good to keep it independent of loop IVs themselves esp. given that you aren't passing AffineForOps and that it's specifically dealing with the case the operands are all the same for the maps (something that won't happen for different affine.for ops in the IR even if they are nested).

mlir/lib/Analysis/AffineStructures.cpp
751

auto -> AffineForOp

mlir/lib/Transforms/LoopFusion.cpp
577

can -> can be

578

Since 'dependences' are between two entities, I think we need to clarify what "has no dependences" means (for eg. outgoing deps, incoming deps, intra-node deps are different things).

dcaballe updated this revision to Diff 311073.Dec 10 2020, 5:19 PM
dcaballe marked 5 inline comments as done.

Addressing initial feedback.
Doc is WIP.

Thanks for working on this Diego!

mlir/lib/Analysis/Utils.cpp
162

It might be food to differentiate between error return cases and cases where there is no error, but isMaximal is false.

174

If possible, it might be good to have a fast-case at the top of this function, that checked for some simple cases (static upper/lower bounds). Then fallback to the integer set check otherwise...

mlir/lib/Transforms/LoopFusion.cpp
634

These kind of simple, focused functions are great. Not in this change, but it feels like longer term, we could do more breaking up of long functions into focused utility functions.

Adding one more comment...

andydavis1 added inline comments.Dec 16 2020, 10:57 AM
mlir/lib/Analysis/Utils.cpp
162

It might be food to differentiate between error return cases and cases where there is no error, but isMaximal is false.

dcaballe updated this revision to Diff 312336.Dec 16 2020, 5:01 PM
dcaballe marked an inline comment as done.

Addressing all the comments. Thanks!

Sorry, I had some unsubmitted comments. Posting them all now.

Adding one more comment...

Hey Andy, I can't see any new comment. Did you forget to save it?

mlir/include/mlir/Analysis/AffineStructures.h
246

It sounds good. Thanks!

mlir/include/mlir/Transforms/LoopFusionUtils.h
53–54

Good point! If we wanted to fuse two loops, regardless of the strategy and out of the context of the AffineFusion pass, we should use the generic strategy. Two comments on this regard:

  1. Note that producer-consumer and sibling strategies are only used to specialize the loop fusion utilities with the assumptions made in the AffineLoopFusion pass so that we can preserve its existing functionality. These assumptions are very specific to the pass and mostly filter the memrefs that are considered in some steps of the analysis. Eventually, when we improve the utilities, we should be able to achieve the same results using the generic strategy and get rid of this class.
  1. Note that the selection of loop candidates for fusion is something external to the utilities, at least for now. It's the client the one that would gather the candidate loops based on some properties (producer-consumer relationship, input reuse, etc.) and then use the utilities to determine if fusion is legal, implement a cost model, etc., for every pair of loops. IOW, the utilities are strategy-independent except for the aforementioned assumptions.

I wonder if it's worth restricting ProducerConsumer and Sibling strategies to only the AffineFusion pass. We could make only the Generic strategy public and create a friend relationship between this class and the AffineFusion pass to only be able to use the ProducerConsumer and Sibling strategies from the pass. Just a random thought. I haven't thought too much about it.

mlir/lib/Analysis/AffineStructures.cpp
716–717

I had exactly the same question and found it super confusing! :) This is inspired by 'addSliceBounds' which has exactly the same requirement. It could be related to the way all the maps are created for the slice (?). When I was debugging this, I could see that all the maps had the same input symbols. Those not relevant for a particular map just were not part of the map output.

727

See below (729, 730)

mlir/lib/Analysis/Utils.cpp
174

Good point! Done and I verified that we have lit tests that cover all the scenarios: fast-check returns true/false, integer set diff returns true/false.

mlir/lib/Transforms/LoopFusion.cpp
634

Thanks! Yeah, we could probably add a similar one for sibling fusion candidates in the future. Any other examples?

774

'auto' should be justified here since the type is very verbose: std::pair<unsigned, Node>

andydavis1 accepted this revision.Jan 12 2021, 2:57 PM

Thanks Diego! The changes look good to me. Please wait for Uday to have another look before submitting. Thanks!

bondhugula added inline comments.Jan 19 2021, 4:07 PM
mlir/test/Transforms/loop-fusion.mlir
407

This is an out of bounds access.

%m[100] -> %m[99]

-test-memref-bound-check can catch these actually.

bondhugula added inline comments.Jan 19 2021, 4:17 PM
mlir/lib/Transforms/LoopFusion.cpp
637

granted -> guaranteed?

640

advance -> advanced

643

Do you want to just use an unordered set and then sort at the end in the right order? Right now, you are using a sorted set and then doing a reverse. You don't seem to need the sorted order in between. srcIdCandidates is a small set, and as you know, std::set will have high overhead. https://llvm.org/docs/ProgrammersManual.html#set

1404

Typo

bondhugula accepted this revision.Jan 19 2021, 4:26 PM

This looks really great! A bunch of minor comments to address.

mlir/lib/Transforms/LoopFusion.cpp
1048

Looks like the method is already handling it, but not returning an accurate estimate?

1406

Likewise.

1616

Nit: ... after it is removed from mdg?

mlir/lib/Transforms/Utils/LoopFusionUtils.cpp
628

ops in`srcOps' -> write ops in srcOps ?
ops in dstOps -> read ops in dstOps?

mlir/test/Transforms/loop-fusion.mlir
2376–2377

Not sure why this wasn't marked a TODO.

2695–2696

Looks like this has been fixed?

This revision is now accepted and ready to land.Jan 19 2021, 4:26 PM
dcaballe updated this revision to Diff 317760.Jan 19 2021, 7:42 PM
dcaballe marked 7 inline comments as done.

Addressing Uday's feedback.

Thanks, Uday. I'll commit this tomorrow if no more comments.

mlir/lib/Transforms/LoopFusion.cpp
643

I returned the candidates in ascending order because I thought it would make more sense for external clients than the reverse order. However, I agree that we can return an unordered map and leave the order to the client. Not sure, though, if we would be gaining much in our use case since we have to copy the elements to another container and sort them...

1048

Some work is needed. We still have code that is looking at 'srcStoreOpInst'.

mlir/test/Transforms/loop-fusion.mlir
407

Good catch :)

2695–2696

Why? The size is still 128 and the ticket seems to be open.

bondhugula added inline comments.Jan 20 2021, 9:39 AM
mlir/lib/Transforms/LoopFusion.cpp
643

I think your comment is orthogonal here. You can return in whatever sorted order but you don't need to use an std::set here. You can use an unordered_set and then just copy + sort at the end and return a vector. Hashtable + some sorting algo is still far less complexity (both time and more importantly memory allocations and access wise) than a typical heavy red-black tree (or similar structure) backed std::set. It's a handful of elements you have - so copy costs isn't much I assume.

mlir/test/Transforms/loop-fusion.mlir
2695–2696

Looks like I was looking at the (wrong) test case further below.

I see an assert failure in loop-fusion.mlir.test in GreedyFusion::fuseProducerConsumerNodes

cast_retty<X, Y *>::ret_type llvm::cast(Y *) [X = mlir::AffineWriteOpInterface, Y = mlir::Operation]: isa<X>(Val) && "cast<Ty>() argument of incompatible type!"

dcaballe added inline comments.Jan 20 2021, 11:08 AM
mlir/lib/Transforms/LoopFusion.cpp
643

unordered_set is not recommended for similar reasons (https://llvm.org/docs/ProgrammersManual.html#other-set-like-container-options). I can change it to the "sorted vector" approach (https://llvm.org/docs/ProgrammersManual.html#a-sorted-vector) if that makes more sense to you.

I see an assert failure in loop-fusion.mlir.test in GreedyFusion::fuseProducerConsumerNodes

Looking into it. Any buildbot failing? I didn't get any notification...

dcaballe reopened this revision.Jan 20 2021, 4:45 PM
This revision is now accepted and ready to land.Jan 20 2021, 4:45 PM
dcaballe updated this revision to Diff 318064.Jan 20 2021, 4:50 PM

Addressing Uday's comment and fixing ASAN issue.
We were gathering store ops and reuse them after 'createPrivateMemrefs',
which is replacing some of those stores with new ones. Thanks @jpienaar!