This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
InitializePasses.h
-
Transforms/
-
Scalar.h
-
Scalar/
1/1
LoopFuse.h
-
lib/
-
Passes/
-
PassBuilder.cpp
-
PassRegistry.def
-
Transforms/Scalar/
-
Scalar/
-
CMakeLists.txt
94/94
LoopFuse.cpp
-
Scalar.cpp
-
test/Transforms/LoopFusion/
-
Transforms/
-
LoopFusion/
6/6
cannot_fuse.ll
-
four_loops.ll
-
inner_loops.ll
1/1
loop_nest.ll
1/1
simple.ll

Differential D55851

Implement basic loop fusion pass
ClosedPublic

Authored by kbarton on Dec 18 2018, 12:52 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
hfinkel
Meinersbur
nemanjai
greened
dmgreen

Commits

rG3cdf87940f05: Add basic loop fusion pass.
rL358607: Add basic loop fusion pass.

Summary

Loop fusion is an optimization/transformation that takes two (or more) loops and
fuses (or joins) them together to create a single loop.

This pass implements Basic Loop Fusion. It will fuse loops that conform to the
following 4 conditions:

Adjacent (no code in between the two loops)
Control flow equivalent (if one loop executes, the other loop executes)
Identical bounds (both loops iterate the same number of iterations)
No negative distance dependencies between the loop bodies.

The intention is to initially focus on the safety checks and mechanics of loop
fusion. It makes no changes to the IR to create opportunities for fusion.
Instead, it checks the necessary conditions and if all conditions are met, it
fuses two loops together.

Further information on loop fusion, and techniques to improve the opportunities
for loop fusion can be found at
https://webdocs.cs.ualberta.ca/~amaral/thesis/ChristopherBartonMSc.pdf

This patch intentionally does not add the loop fusion pass into the pass
pipeline. The location of loop fusion in the pass pipeline needs to be discussed
and I would prefer to have that discussion outside of this review.

This patch was done in collaboration with Johannes Doerfert.

Diff Detail

Event Timeline

kbarton created this revision.Dec 18 2018, 12:52 PM

Herald added subscribers: hiraditya, mgorny. · View Herald TranscriptDec 18 2018, 12:52 PM

syzaara added a subscriber: syzaara.Dec 18 2018, 1:09 PM

jsji added a subscriber: jsji.Dec 18 2018, 3:40 PM

jedilyn added a subscriber: jedilyn.Dec 18 2018, 5:22 PM

Thanks for bringing us loop fusion! Here are some review comments, but I did not check the functionality yet.

llvm/lib/Transforms/Scalar/LoopFuse.cpp
100	Is this the right description for "-loop-fusion-dependence-analysis=scev"? It is the same as for the `da` option.
107–110	[serious] LLVM has an infrastructure for enabling debugging output: `-debug-only`. We should not introduce additional ones. At least this option should only be available in assert-builds.
124–141	[suggestion] Add doxygen comments for these fields?
143–144	[suggestion] IMHO it is good practice to not have much logic in object constructors. Reasons: When exceptions are thrown, the object is partially initialized With class polymorphism, calling virtual functions will call the non-overidden method. It is not possible to return anything, in this case, an error code. and 2. do not apply to this code. Point 3. is handled using the `Valid` flag. Nonetheless, I think having lots of logic in constructor is a bad pattern. Instead, I suggest to have something return nullptr or `llvm::ErrorOr`.
155	[remark] In contrast to the other invalidate reason, this one does not have a statistic counter. Intentional?
205	[style] Use `LLVM_DUMP_METHOD` pattern for `dump()` methods (See `compiler.h`)
272	[remark] LoopSet is a vector?
277	[remark] Is there a reason to use `std::set`? Usually, the LLVM hash-based sets (`DenseSet`, `SmallPtrSet`, etc) are more performant.
283–285	[style] braces can be removed
306–308	[style] I'd prefer method description as a doxygen comment at the methods.
322	[style] Use doxygen comment?
346–348	[style] Descriptions of these would be useful.
351–356	This is only used inside `LLVM_DEBUG`, so the compiler will emit an unused-function warning when compiling in release mode. Guard with `#ifdef NDEBUG`?
427–428	[suggestion] Guard with `#ifndef NDEBUG` to avoid being called in non-assert builds?
441–442	[style] Start functions with a verb: `isControlFlowEquivalent`
456–459	[style] `return PDT.dominates(FC0.Preheader, FC1.Preheader)`
495–496	[style] This looks like a leftover review comment.
518	[Style] Use `LLVM_PRETTY_FUNCTION` instead of `__FUNCTION__`. Maybe remove entirely? Nowhere I have seen function tracing in LLVM code.
592–595	[suggestion] If predication is not supported, why not using `ScalarEvolution::getBackedgeTakenCount()`?
712	[style] Use doxygen comment here?
729	[style] No reason to use `auto` here
746–747	[suggestion] Descriptions of these would be useful
750–751	[style] [[ https://llvm.org/docs/CodingStandards.html#doxygen-use-in-documentation-comments \| Use `\p` to refer to parameters ]]
810–812	[style] The braces seem unnecessary
894	[style] Start method names with a verb: `isAdjacent`?
978–981	[style] Use `emplace_back`?
1060–1064	[suggestion] This looks expensive to do. However, the pass manager will do these verifications anyway between passes (if enabled), so it shouldn't be necessary to do here.
1074–1084	[style] I think it is more common in the LLVM case base to have private fields declared at the beginning of the class.
1095	[serious] I think LoopFuser is preserving some passes such as ScalarEvolution?
1134	[serious] If changes are made, the pass should not indicate that it preserves all analyses (including the ones it doesn't know about).
llvm/test/Transforms/LoopFusion/cannot_fuse.ll
2	[serious] If you are testing for `llvm::dbgs()` output, you need to add a `REQUIRES: asserts` line. Another problem is that `2>&1` this mixes stdout and stderr output. stdout is buffered, stderr is not. How these are merged by `2>&1` is undefined. You can switch off stdout by using `opt -disable-output`. If possible, using `opt -analyze` is a better way to verify pass-specific output because adding another `LLVM_DEBUG` during debugging sessions can lead to test case failures. However, checking `-debug` output is common in the llvm tests.
75–77	[remark] I think these can be removed.
270–273	[remark] Are these necessary for the test?
llvm/test/Transforms/LoopFusion/loop_nest.ll
127–141	[remark] Can these be removed?
llvm/test/Transforms/LoopFusion/simple.ll
2	[suggestion] Remove `2>&1` (there is no stdout output here)

Responded to two of the comments. Most of the others will be addressed in the next revision, which I should hopefully be uploading soon.

llvm/lib/Transforms/Scalar/LoopFuse.cpp
143–144	I understand your points here. I did not realize issue #2 (although I agree it doesn't apply here). However, if I understand your suggestion, I would need to add an initialize method (or something similar) to do the logic that is currently in the constructor and return a success/fail code. The problem is that before anything can be done with the object, you need to ensure that it has been initialized. This can be accomplished by adding a boolean flag (isInitialized) to the class and ensuring that all accessors check that before doing anything with it. However, I think that is more complicated then leaving the initialization logic in the constructor. That said, I'm not particularly tied to either approach. I can change the implementation if you feel strongly. Or, if there is an alternative that I'm not seeing, I'm happy to do that instead.
277	This is a good question, and I should have added comments here to articulate why I eventually settled on std::set. Ultimately, I was looking for a data structure that would keep the fusion candidates sorted automatically, and not invalidate the iterators as the set changes. In this implementation, either DenseSet or SmallPtrSet would be fine, but we have additional patches to extend fusion to move intervening code around and split up loops to create new opportunities. When this is done, the contents of the FusionCandidateSet will change, and having stable iterators simplifies the logic quite a bit (in my opinion). Similarly, always ensuring that the set is ordered properly is essential for this implementation, and that seems to be done reasonably well with std::set. I will add some comments in my next revision to indicate my line of reasoning here.

kbarton edited the summary of this revision. (Show Details)Dec 20 2018, 9:28 AM

Addressed all but two of the review comments.
The only outstanding comments are the use of std::set (which I tried to address with comments) and the logic inside the FusionCandidate constructor (which I'm waiting for feedback on the review).

kbarton marked 31 inline comments as done and 2 inline comments as done.Dec 20 2018, 12:51 PM

kbarton added inline comments.

llvm/lib/Transforms/Scalar/LoopFuse.cpp
107–110	I use this to have two different "levels" of debugging. The debugging available with -debug-only is not as verbose. If you feel strongly about it, I can remove this and make all debug output equivalent. In the meantime, I've guarded this use with NDEBUG.
978–981	There is no emplace_back for TreeUpdates (unless I misunderstood your suggestion).
1060–1064	I've kept this for now, as this isn't in the pass manager yet. However, I've guarded it under NDEBUG. Are you OK with that?
1134	I agree. The pass should not make changes and return true. I don't believe this is happening now.
llvm/test/Transforms/LoopFusion/cannot_fuse.ll
2	I don't see an (easy) way to use analyze to get this information. For now, I've added the REQUIRES: asserts. I'll think if there is a way to use analyze in the future to accomplish this.

dmgreen added a subscriber: dmgreen.Dec 22 2018, 2:27 PM

Great stuff. This looks like it will be very useful. I have a few thoughts/questions/nits.

llvm/include/llvm/Transforms/Scalar/LoopFuse.h
27	Two spaces.
llvm/lib/Transforms/Scalar/LoopFuse.cpp
613	second loop
720	This debug message is (essentially) repeated?
748	Commented code
859	I would imagine, although I'm not sure, that there would at least be a lot of bugs here. We are dealing with different loops, but we can say that they are very similar. What does it currently give? Anything useful? The SCEV checks you are doing above is obviously quite similar to what DA would be trying to do, but with the added loop rewrite. It would be a shame to duplicate all the effort but may, I guess, be necessary if DA doesn't do well enough.
967	thest -> test?
985	Theg
1004	It can sometimes be better to insert edges into the DT before deleting old ones. It keep the tree more intact (especially pdts), with less subtrees becoming unreachable and less recalculations needed. It means it can be simpler and quicker for the updates.
1016	Is there anywhere that we check that these won't have phi operands from inside the first loop? i.e. that the phis can be moved into the first loop / before all of it's instructions.
1088	You may want to add DominatorTree::VerificationLevel::Fast otherwise this might be quite slow.
1119	Looks like the code is preserving LI too (I'm not sure if you can/should preserve SE without preserving LI.)

I've addressed most of these comments (except the ones I have some questions about). I will be refreshing the patch momentarily.

llvm/lib/Transforms/Scalar/LoopFuse.cpp
720	Good catch. Fixed.
859	At this point DA doesn't give anything useful - at least not for the test cases that I have tried. I have not had a chance to investigate why, or if there is a better/different way to do things where it can be useful (which is why I marked the TODO here). The one thing that DA is able to do, that SCEV currently cannot, is understand the restrict keyword and accurately identify no dependencies between the two loops in this case. Perhaps it would be better to try and teach SCEV about restrict, and then only rely on SCEV in the long run?
1004	OK, that's good to know, thanks. Is this specifically for inserting/deleting edges between similar blocks, or is this for all inserts/deletes in the entire tree? In other words, should I rearrange this code, and the code below (lines 1048-1051) that updates the latch blocks to do all the insertions before any deletions?
1016	This is a really good catch!] We completely missed this. I've added a check in dependencesAllowFusion that walks through the PHIs in the header of FC1 and check if the incoming block is in FC0 then it is the header block. If it is not, then we do not fuse.
1088	Thanks for pointing this out. I've changed to fast. Most (probably all) of the problems this has found has been done by comparing to a newly constructed tree, so I think Fast should be sufficient.
1119	Yup, good call. I've added LoopInfoWrapperPass to the preserved list.

Addressed comments from dmgreen.

Initial comments from my side.

llvm/lib/Transforms/Scalar/LoopFuse.cpp
8	We need to update the licence ;)
112	You should keep it this way. Other passes do it similarly.
143–144	It is not uncommon in LLVM to do it this way. I personally do not try to force a solution similar to `bool createXXX(...)` but I also do not necessarily object if there is an actual reason.
592–595	Because, as an example and not the only reason, we could easily generate statistics on the impact predication support might have.
849	This should probably be a conditional and a `return false` ;)
1016	I don't get this. Could one of you produce a problematic example for me? (also good as a test case)
1163	As above, preserve LoopAnalysis.

dmgreen added inline comments.Jan 23 2019, 8:41 AM

llvm/lib/Transforms/Scalar/LoopFuse.cpp
530	Is there reason in the current version to prevent fusion if "FC0" has instructions in its preheaders?
859	I remember DA having problems on 64bit systems due to all the sexts that kept happening. Delinearising constant's too, maybe? There are certainly problems in it that would be worth trying to fix (or find a replacement to do it better). The delinearisation that DA will attempt may also be helpful. It is nowadays all built on SCEV's too, so should in theory (baring the facts that we know here about dealing with different but similar loops) be a more powerful version of the SCEV code here. That power might be confusing it at the moment though. I would expect that better DA is something llvm will need sooner or later as more loop optimisations are implemented. Perhaps this is something that we can improve in the future, relying on the SCEV code at the moment, but getting more help from DA as it is improved. The no-aliasing checks it can do are useful at least, it would seem.
918	I'm not sure if this is quite strong enough. Consider something like this, where the sum would be used in the second block, but not as phi: for(i = 0; i < n; i++) sum += a[i]; for(i = 0; i < n; i++) b[i] = a[i]/sum; These can be fused, but use the wrong "sum" on each iteration of the loop. Unroll and jam side-steps all this by requiring lcssa nodes (and, you know, not requiring generalised loop fusion :) )
1004	I think it's probably fine, just something I've run into in the past with PDT's. If it's something we run into, we can fix it then. And if it does become an issue, it may be better to teach the DTU to do this itself.

JonChesterfield added a subscriber: JonChesterfield.Feb 11 2019, 1:14 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 11 2019, 1:14 PM

I've addressed almost all of the comments.
I will work on an update to the problem that dmgreen provided an example for and post a new version of the patch when that is done.
In the meantime, if my understanding of the comments is incorrect, someone please correct me :)

llvm/lib/Transforms/Scalar/LoopFuse.cpp
143–144	OK, so I'm still not sure what the preference is here: Keep it as is. Change to create an empty object and then require an Initialize method call on it. Create a wrapper, createFusionCandidate(), that does the construction and initialization. Some other alternative that I don't see. If I'm going to change, I prefer the createFusionCandidate route (which I hadn't thought of earlier) which I think is the cleanest, although I don't know how to enforce it's usage.
441–442	I addressed this a while ago, I just forgot to check it off as done.
530	Strictly speaking, no, I there is no reason to prevent fusion if FC0 has instructions in its preheader. This check was meant to prevent fusion if FC1 has instructions in the preheader since we don't have any analysis to move them around yet I've moved this check later when we are trying to fuse pairs of candidates. If FC1 has a non-empty preheader then fusion is skipped.
849	I've changed from the assert to a check and print message. After that, all the conditions just result in a return false, which cleans this up a lot.
859	I completely agree about improving DA and the need for it with other loop optimizations. I still haven't had a chance to look at it in detail, but would like to start looking at it once this patch is finalized and lands. For now are you OK with the current usage of DA here? I'm hoping that as we extend/improve it, we can modify this code to take advantage of it.
918	Yes, you're right. If I generate the ll for this example and massage it to make it eligible for fusion, we will fuse these. However, I'm confused by your statement: These can be fused, but use the wrong "sum" on each iteration of the loop. I don't see how fusion is legal here. Are you saying that the current check will (incorrectly) still allow fusion? Or there is a way to fuse these loops? For this example, lcssa will create a non-empty preheader for the second loop and the earlier checks preventing fusion will bail before we get to this test. If I manually sink the PHI from lcssa into the header of the second loop, I can create an empty preheader for the second loop to allow fusion to continue. At any rate, I think this check needs to be strengthened to prevent fusion from happening in this case. Do you agree?
1004	OK, sounds good. I will mark this comment as done then.
1016	dmgreen added an example above that illustrates the issue (after some manual modifications to the IR to get around other limitations in fusion).

dmgreen added inline comments.Feb 13 2019, 9:41 AM

llvm/lib/Transforms/Scalar/LoopFuse.cpp
859	Sounds good to me. The !DepResult check can help in some cases at least, and the rest we can add as DA is improved.
918	Yeah, sorry. I meant _will_ fuse, but shouldn't. "Can" was a bit misleading.

Addressed some more comments. I'll upload a new patch momentarily.

Addressed most of the review comments.

I think there are still two outstanding issues to resolve before committing this patch:

The current constructors for FusionCandidates - whether to leave them as is or whether to reimplement using a createFusionCandidate wrapper that removes most of the logic from the constructor.
New safety check that I added, to detect cross-loop dependencies that were previously missed. The new check is around line 918 of LoopFuse.cpp (inside dependencesAllowFusion method).

dmgreen added inline comments.Mar 3 2019, 6:12 AM

llvm/lib/Transforms/Scalar/LoopFuse.cpp
929	I think that you can just do something like if (Instruction *Def = dyn_cast<Instruction>(Op)) if (FC0.L->contains(Def->getParent())) return false; Providing there are no lcssa phis, this should rule out anything that depends on the first loop.
llvm/test/Transforms/LoopFusion/cannot_fuse.ll
283	You can run something like -loop-rotate to make this into a more simply structured loop. -instcombine will remove any lcssa nodes, and -simplify-cfg can clean things up too. I guess that will depend on what types of loops you want to test, ones that have been cleanup up or ones that are a little less structured. It's good to test both, but which is more important will depend on where in the pass pipeline this ends up.

kbarton marked 2 inline comments as done.Mar 22 2019, 2:07 PM

kbarton added inline comments.

llvm/lib/Transforms/Scalar/LoopFuse.cpp
929	I just finished up testing this change, and then realized I think this is overly restrictive. I had originally restricted it to PHIs because I only wanted to include things that are (re)defined in the first loop. Basically, the first loop needs to complete in order to get the correct definition going into the second loop. This alternate sequence will also include stores, not just PHIs. The store can be loop invariant, and if it is not hoisted out of the loop prior, then it would also prevent fusion.
llvm/test/Transforms/LoopFusion/cannot_fuse.ll
283	I completely agree. In fact we have a subsequent patch to post once this one lands that restricts loop fusion to only run on rotated loops. This seems to be a common theme among many loop optimizations, and restricting to rotated loops simplifies the implementation in several places. I'm planning on discussing this at my presentation at EuroLLVM in a couple weeks :)

dmgreen added inline comments.Mar 24 2019, 7:39 AM

llvm/lib/Transforms/Scalar/LoopFuse.cpp

129

Perhaps name it ExitingBlock, if there is only one.

929

This is the kind of thing I was thinking of:

define float @test(float* nocapture %a, i32 %n) {
entry:
  %conv = zext i32 %n to i64
  %cmp32 = icmp eq i32 %n, 0
  br i1 %cmp32, label %for.cond.cleanup7, label %for.body

for.body:                                         ; preds = %for.body, %entry
  %i.034 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
  %sum1.033 = phi float [ %add, %for.body ], [ 0.000000e+00, %entry ]
  %idxprom = trunc i64 %i.034 to i32
  %arrayidx = getelementptr inbounds float, float* %a, i32 %idxprom
  %0 = load float, float* %arrayidx, align 4
  %add = fadd float %sum1.033, %0
  %inc = add nuw nsw i64 %i.034, 1
  %cmp = icmp ult i64 %inc, %conv
  br i1 %cmp, label %for.body, label %for.body8

for.body8:                                        ; preds = %for.body, %for.body8
  %i2.031 = phi i64 [ %inc14, %for.body8 ], [ 0, %for.body ]
  %idxprom9 = trunc i64 %i2.031 to i32
  %arrayidx10 = getelementptr inbounds float, float* %a, i32 %idxprom9
  %1 = load float, float* %arrayidx10, align 4
  %div = fdiv float %1, %add
  store float %div, float* %arrayidx10, align 4
  %inc14 = add nuw nsw i64 %i2.031, 1
  %cmp5 = icmp ult i64 %inc14, %conv
  br i1 %cmp5, label %for.body8, label %for.cond.cleanup7

for.cond.cleanup7:                                ; preds = %for.body8, %entry
  %sum1.0.lcssa36 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body8 ]
  ret float %sum1.0.lcssa36
}

The importand part being %add, that is defined in the first loop and used in the second. I believe that using normal SSA def-use chains, anything that is def'd in the first loop (and used in the second) will need to use the value from the last iteration of the loop, so fusion will be illegal.

There may be cases where a def has the same value on every iteration of the loop, but I imagine those would already have been hoisted/sunk, or be quite rare.

1083

Whilst checking the example, I noticed the dom tree updater has become stricter about "inbalanced" tree operations. I think that if FC0.ExitingBlocks == FC0.Latch, we add a link to FC1.Header twice, which is maybe what it's complaining about (?)

Oh, forgot to mention, I am looking forward to getting this in tree, and think it's very close. If we can sort out the last couple of things here, I think it would be good to get it committed so we can have a play around and sort out some of the questions like where in the pass pipeline it fits.

kbarton marked 10 inline comments as done.Mar 25 2019, 8:18 PM

kbarton added inline comments.

llvm/lib/Transforms/Scalar/LoopFuse.cpp
129	Good idea. Will fix in patch I'm going to post soon.
859	I'm going to mark this as done, since I think we are in agreement we want to work on improving the dependence analysis in the future. If this is incorrect, please let me know.
929	Yes, I see now. It's possible to have a def in addition to a PHI. If I wanted to catch this, I would also need to traverse all the inputs to any defs and make sure they don't come from PHIs within loop 0 as well. I'm OK with restricting this to defs, with the assumption that if there is any loop invariant defs they would have been hoisted out of the loop at this point.
1083	Yup, good catch. I found the same problem when I tried your test case. I've fixed the problem and added your test above to cannot_fuse.ll.

Address review comments as follows:

Strengthen dependence checks between any value used in Loop 1 to ensure it is not defined in loop 0.
Do not add a tree update to insert an edge between FC0.Latch and FC1.Header when FC0.ExitingBlock is the same as FC0.Latch. This edge has already been inserted into the tree updates earlier and will cause an assertion to fail while performing the updates.
Rename ExitingBlocks to ExitingBlock since we are only concerned with loops with a single exiting block.

compnerd added a subscriber: compnerd.Mar 25 2019, 9:44 PM

Thanks. I had another look through and think this is in a really good state. I'm very happy for it to go in.

I'm not sure if there were any other remaining comments from @Meinersbur or @jdoerfert?

This revision is now accepted and ready to land.Mar 26 2019, 5:51 AM

Thanks @dmgreen!
The only outstanding issue was from @Meinersbur regarding the constructor for FusionCandidate. The concern was keeping the logic in the constructor, vs moving it out into a createFusionCandidate method to wrap the constructor and add all the logic to that method. I don't have a preference one way or the other, but will happily make the change if people feel strongly about it.

nickdesaulniers added a subscriber: nickdesaulniers.Apr 8 2019, 10:01 PM

Generally, LGTM; with some smaller suggestions.

llvm/lib/Transforms/Scalar/LoopFuse.cpp
143–144	I was thinking the following pattern: FusionCandidate* makeFusionCandiate(...) { BasicBlock Preheader; BasicBlock Header; BasicBlock ExitingBlocks; BasicBlock ExitBlock; BasicBlock Latch; // The above are redundant since llvm::Loop provides accessors // I suggest to not cache them in other objects (they may require being updated) unless it measurable improves performance Loop L; SmallVector<Instruction , 16> MemReads; SmallVector<Instruction , 16> MemWrites; if (!condition) { LLVM_DEBUG(dbgs() << "The reason\n"); InvalidationReason++; return nullptr; } return new FusionCandidate(L, MemReads, MemWrites); }
143–144	It is just a suggestion and does not correspond to an LLVM coding policy. Keep as-is, it is not an requirement for me to accept the patch.
205	[style] `dump()` methods are usually guard by #if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP) as mentioned in the comment above `LLVM_DUMP_METHOD`
277	[suggestion] IMHO it is a good style to not depend on a particular implementation of ADTs. Code that advances iterators manually depending on inserted/removed elements can be confusing and may break easily in future changes. However, as with the previous [suggestion], I am ok with keeping it as-is.
293	[suggestion] This could be a range-based for-loop.
324	[style] Did you consider `LI.empty()`?
936	[suggestion] Consider using a method name that makes clear that is has no side-effects. `isEmptyPreheader`? `emptyPreheader` sounds like it would remove instructions from the preheader until it is empty.
978–981	Are you sure? There is `SmallVectorImpl<T>::emplace_back`.

All comments have been addressed. I will be committing this soon.

llvm/lib/Transforms/Scalar/LoopFuse.cpp
143–144	I will mark this conversation as done as I think we've agreed the existing implementation is OK. We can refactor/redo in the future if this becomes too cumbersome.
277	Thanks. I will mark this as done for now. This is also something that may need to get revisited as the code evolves.
324	I don't think so. I've changed it, as I think it makes it more readable.
978–981	You're right. The syntax I was using was wrong. I've fixed it and changed to using emplace_back.

Closed by commit rL358607: Add basic loop fusion pass. (authored by kbarton). · Explain WhyApr 17 2019, 11:51 AM

This revision was automatically updated to reflect the committed changes.

kbarton marked 4 inline comments as done.

rtereshin added a subscriber: rtereshin.Apr 18 2019, 4:27 PM

rtereshin added inline comments.

llvm/trunk/lib/Transforms/Scalar/LoopFuse.cpp
324 ↗	(On Diff #195612)	It appears that if both `NDEBUG` and `LLVM_ENABLE_DUMP` are defined, the function will be compiled in, but all `LLVM_DEBUG` macro eliminated. Resulting static void printFusionCandidates(const FusionCandidateCollection &FusionCandidates) { for (const auto &CandidateSet : FusionCandidates) { } } produces an "unused variable" `CandidateSet` warning. We probably don't need to wrap anything in `LLVM_DEBUG` here.

Meinersbur added inline comments.Apr 19 2019, 9:04 AM

llvm/trunk/lib/Transforms/Scalar/LoopFuse.cpp
324 ↗	(On Diff #195612)	We probably don't need to wrap anything in LLVM_DEBUG here. I agree, but for a different reason: All calls to printFusionCandidates are already wrapped by LLVM_DEBUG. The macro checks whether anything should be printed by checking the `-debug`/`-debug-only` cl::opt.

jeroen.dobbelaere added a subscriber: jeroen.dobbelaere.Jul 12 2021, 3:38 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

InitializePasses.h

1 line

Transforms/

Scalar.h

6 lines

Scalar/

LoopFuse.h

30 lines

lib/

Passes/

PassBuilder.cpp

1 line

PassRegistry.def

1 line

Transforms/

Scalar/

CMakeLists.txt

1 line

LoopFuse.cpp

1203 lines

Scalar.cpp

1 line

test/

Transforms/

LoopFusion/

318 lines

136 lines

86 lines

120 lines

317 lines

Diff 188928

llvm/include/llvm/InitializePasses.h

	Show First 20 Lines • Show All 214 Lines • ▼ Show 20 Lines
	void initializeLocalStackSlotPassPass(PassRegistry&);			void initializeLocalStackSlotPassPass(PassRegistry&);
	void initializeLocalizerPass(PassRegistry&);			void initializeLocalizerPass(PassRegistry&);
	void initializeLoopAccessLegacyAnalysisPass(PassRegistry&);			void initializeLoopAccessLegacyAnalysisPass(PassRegistry&);
	void initializeLoopDataPrefetchLegacyPassPass(PassRegistry&);			void initializeLoopDataPrefetchLegacyPassPass(PassRegistry&);
	void initializeLoopDeletionLegacyPassPass(PassRegistry&);			void initializeLoopDeletionLegacyPassPass(PassRegistry&);
	void initializeLoopDistributeLegacyPass(PassRegistry&);			void initializeLoopDistributeLegacyPass(PassRegistry&);
	void initializeLoopExtractorPass(PassRegistry&);			void initializeLoopExtractorPass(PassRegistry&);
	void initializeLoopGuardWideningLegacyPassPass(PassRegistry&);			void initializeLoopGuardWideningLegacyPassPass(PassRegistry&);
				void initializeLoopFuseLegacyPass(PassRegistry&);
	void initializeLoopIdiomRecognizeLegacyPassPass(PassRegistry&);			void initializeLoopIdiomRecognizeLegacyPassPass(PassRegistry&);
	void initializeLoopInfoWrapperPassPass(PassRegistry&);			void initializeLoopInfoWrapperPassPass(PassRegistry&);
	void initializeLoopInstSimplifyLegacyPassPass(PassRegistry&);			void initializeLoopInstSimplifyLegacyPassPass(PassRegistry&);
	void initializeLoopInterchangePass(PassRegistry&);			void initializeLoopInterchangePass(PassRegistry&);
	void initializeLoopLoadEliminationPass(PassRegistry&);			void initializeLoopLoadEliminationPass(PassRegistry&);
	void initializeLoopPassPass(PassRegistry&);			void initializeLoopPassPass(PassRegistry&);
	void initializeLoopPredicationLegacyPassPass(PassRegistry&);			void initializeLoopPredicationLegacyPassPass(PassRegistry&);
	void initializeLoopRerollPass(PassRegistry&);			void initializeLoopRerollPass(PassRegistry&);
	▲ Show 20 Lines • Show All 188 Lines • Show Last 20 Lines

llvm/include/llvm/Transforms/Scalar.h

	Show First 20 Lines • Show All 452 Lines • ▼ Show 20 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// LoopDistribute - Distribute loops.			// LoopDistribute - Distribute loops.
	//			//
	FunctionPass *createLoopDistributePass();			FunctionPass *createLoopDistributePass();

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
				// LoopFuse - Fuse loops.
				//
				FunctionPass *createLoopFusePass();

				//===----------------------------------------------------------------------===//
				//
	// LoopLoadElimination - Perform loop-aware load elimination.			// LoopLoadElimination - Perform loop-aware load elimination.
	//			//
	FunctionPass *createLoopLoadEliminationPass();			FunctionPass *createLoopLoadEliminationPass();

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// LoopVersioning - Perform loop multi-versioning.			// LoopVersioning - Perform loop multi-versioning.
	//			//
	Show All 35 Lines

llvm/include/llvm/Transforms/Scalar/LoopFuse.h

This file was added.

				//===- LoopFuse.h - Loop Fusion Pass ----------------------------- C++ --===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				///
				/// \file
				/// This file implements the Loop Fusion pass.
				///
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_TRANSFORMS_SCALAR_LOOPFUSE_H
				#define LLVM_TRANSFORMS_SCALAR_LOOPFUSE_H

				#include "llvm/IR/PassManager.h"

				namespace llvm {

				class Function;

				class LoopFusePass : public PassInfoMixin<LoopFusePass> {
				public:
				PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
				};

				dmgreenUnsubmitted Done Reply Inline Actions Two spaces. dmgreen: Two spaces.
				} // end namespace llvm

				#endif // LLVM_TRANSFORMS_SCALAR_LOOPFUSE_H

llvm/lib/Passes/PassBuilder.cpp

	Show First 20 Lines • Show All 115 Lines • ▼ Show 20 Lines
	#include "llvm/Transforms/Scalar/InductiveRangeCheckElimination.h"			#include "llvm/Transforms/Scalar/InductiveRangeCheckElimination.h"
	#include "llvm/Transforms/Scalar/InstSimplifyPass.h"			#include "llvm/Transforms/Scalar/InstSimplifyPass.h"
	#include "llvm/Transforms/Scalar/JumpThreading.h"			#include "llvm/Transforms/Scalar/JumpThreading.h"
	#include "llvm/Transforms/Scalar/LICM.h"			#include "llvm/Transforms/Scalar/LICM.h"
	#include "llvm/Transforms/Scalar/LoopAccessAnalysisPrinter.h"			#include "llvm/Transforms/Scalar/LoopAccessAnalysisPrinter.h"
	#include "llvm/Transforms/Scalar/LoopDataPrefetch.h"			#include "llvm/Transforms/Scalar/LoopDataPrefetch.h"
	#include "llvm/Transforms/Scalar/LoopDeletion.h"			#include "llvm/Transforms/Scalar/LoopDeletion.h"
	#include "llvm/Transforms/Scalar/LoopDistribute.h"			#include "llvm/Transforms/Scalar/LoopDistribute.h"
				#include "llvm/Transforms/Scalar/LoopFuse.h"
	#include "llvm/Transforms/Scalar/LoopIdiomRecognize.h"			#include "llvm/Transforms/Scalar/LoopIdiomRecognize.h"
	#include "llvm/Transforms/Scalar/LoopInstSimplify.h"			#include "llvm/Transforms/Scalar/LoopInstSimplify.h"
	#include "llvm/Transforms/Scalar/LoopLoadElimination.h"			#include "llvm/Transforms/Scalar/LoopLoadElimination.h"
	#include "llvm/Transforms/Scalar/LoopPassManager.h"			#include "llvm/Transforms/Scalar/LoopPassManager.h"
	#include "llvm/Transforms/Scalar/LoopPredication.h"			#include "llvm/Transforms/Scalar/LoopPredication.h"
	#include "llvm/Transforms/Scalar/LoopRotation.h"			#include "llvm/Transforms/Scalar/LoopRotation.h"
	#include "llvm/Transforms/Scalar/LoopSimplifyCFG.h"			#include "llvm/Transforms/Scalar/LoopSimplifyCFG.h"
	#include "llvm/Transforms/Scalar/LoopSink.h"			#include "llvm/Transforms/Scalar/LoopSink.h"
	▲ Show 20 Lines • Show All 2,001 Lines • Show Last 20 Lines

llvm/lib/Passes/PassRegistry.def

	Show First 20 Lines • Show All 191 Lines • ▼ Show 20 Lines
	FUNCTION_PASS("mldst-motion", MergedLoadStoreMotionPass())			FUNCTION_PASS("mldst-motion", MergedLoadStoreMotionPass())
	FUNCTION_PASS("nary-reassociate", NaryReassociatePass())			FUNCTION_PASS("nary-reassociate", NaryReassociatePass())
	FUNCTION_PASS("newgvn", NewGVNPass())			FUNCTION_PASS("newgvn", NewGVNPass())
	FUNCTION_PASS("jump-threading", JumpThreadingPass())			FUNCTION_PASS("jump-threading", JumpThreadingPass())
	FUNCTION_PASS("partially-inline-libcalls", PartiallyInlineLibCallsPass())			FUNCTION_PASS("partially-inline-libcalls", PartiallyInlineLibCallsPass())
	FUNCTION_PASS("lcssa", LCSSAPass())			FUNCTION_PASS("lcssa", LCSSAPass())
	FUNCTION_PASS("loop-data-prefetch", LoopDataPrefetchPass())			FUNCTION_PASS("loop-data-prefetch", LoopDataPrefetchPass())
	FUNCTION_PASS("loop-load-elim", LoopLoadEliminationPass())			FUNCTION_PASS("loop-load-elim", LoopLoadEliminationPass())
				FUNCTION_PASS("loop-fuse", LoopFusePass())
	FUNCTION_PASS("loop-distribute", LoopDistributePass())			FUNCTION_PASS("loop-distribute", LoopDistributePass())
	FUNCTION_PASS("loop-vectorize", LoopVectorizePass())			FUNCTION_PASS("loop-vectorize", LoopVectorizePass())
	FUNCTION_PASS("pgo-memop-opt", PGOMemOPSizeOpt())			FUNCTION_PASS("pgo-memop-opt", PGOMemOPSizeOpt())
	FUNCTION_PASS("print", PrintFunctionPass(dbgs()))			FUNCTION_PASS("print", PrintFunctionPass(dbgs()))
	FUNCTION_PASS("print<assumptions>", AssumptionPrinterPass(dbgs()))			FUNCTION_PASS("print<assumptions>", AssumptionPrinterPass(dbgs()))
	FUNCTION_PASS("print<block-freq>", BlockFrequencyPrinterPass(dbgs()))			FUNCTION_PASS("print<block-freq>", BlockFrequencyPrinterPass(dbgs()))
	FUNCTION_PASS("print<branch-prob>", BranchProbabilityPrinterPass(dbgs()))			FUNCTION_PASS("print<branch-prob>", BranchProbabilityPrinterPass(dbgs()))
	FUNCTION_PASS("print<da>", DependenceAnalysisPrinterPass(dbgs()))			FUNCTION_PASS("print<da>", DependenceAnalysisPrinterPass(dbgs()))
	▲ Show 20 Lines • Show All 80 Lines • Show Last 20 Lines

llvm/lib/Transforms/Scalar/CMakeLists.txt

Show All 22 Lines	add_llvm_library(LLVMScalarOpts
InstSimplifyPass.cpp		InstSimplifyPass.cpp
JumpThreading.cpp		JumpThreading.cpp
LICM.cpp		LICM.cpp
LoopAccessAnalysisPrinter.cpp		LoopAccessAnalysisPrinter.cpp
LoopSink.cpp		LoopSink.cpp
LoopDeletion.cpp		LoopDeletion.cpp
LoopDataPrefetch.cpp		LoopDataPrefetch.cpp
LoopDistribute.cpp		LoopDistribute.cpp
		LoopFuse.cpp
LoopIdiomRecognize.cpp		LoopIdiomRecognize.cpp
LoopInstSimplify.cpp		LoopInstSimplify.cpp
LoopInterchange.cpp		LoopInterchange.cpp
LoopLoadElimination.cpp		LoopLoadElimination.cpp
LoopPassManager.cpp		LoopPassManager.cpp
LoopPredication.cpp		LoopPredication.cpp
LoopRerollPass.cpp		LoopRerollPass.cpp
LoopRotation.cpp		LoopRotation.cpp
▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

llvm/lib/Transforms/Scalar/LoopFuse.cpp

This file was added.

				//===- LoopFuse.cpp - Loop Fusion Pass ------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				///
				jdoerfertUnsubmitted Done Reply Inline Actions We need to update the licence ;) jdoerfert: We need to update the licence ;)
				/// \file
				/// This file implements the loop fusion pass.
				/// The implementation is largely based on the following document:
				///
				/// Code Transformations to Augment the Scope of Loop Fusion in a
				/// Production Compiler
				/// Christopher Mark Barton
				/// MSc Thesis
				/// https://webdocs.cs.ualberta.ca/~amaral/thesis/ChristopherBartonMSc.pdf
				///
				/// The general approach taken is to collect sets of control flow equivalent
				/// loops and test whether they can be fused. The necessary conditions for
				/// fusion are:
				/// 1. The loops must be adjacent (there cannot be any statements between
				/// the two loops).
				/// 2. The loops must be conforming (they must execute the same number of
				/// iterations).
				/// 3. The loops must be control flow equivalent (if one loop executes, the
				/// other is guaranteed to execute).
				/// 4. There cannot be any negative distance dependencies between the loops.
				/// If all of these conditions are satisfied, it is safe to fuse the loops.
				///
				/// This implementation creates FusionCandidates that represent the loop and the
				/// necessary information needed by fusion. It then operates on the fusion
				/// candidates, first confirming that the candidate is eligible for fusion. The
				/// candidates are then collected into control flow equivalent sets, sorted in
				/// dominance order. Each set of control flow equivalent candidates is then
				/// traversed, attempting to fuse pairs of candidates in the set. If all
				/// requirements for fusion are met, the two candidates are fused, creating a
				/// new (fused) candidate which is then added back into the set to consider for
				/// additional fusion.
				///
				/// This implementation currently does not make any modifications to remove
				/// conditions for fusion. Code transformations to make loops conform to each of
				/// the conditions for fusion are discussed in more detail in the document
				/// above. These can be added to the current implementation in the future.
				//===----------------------------------------------------------------------===//

				#include "llvm/Transforms/Scalar/LoopFuse.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/Analysis/DependenceAnalysis.h"
				#include "llvm/Analysis/DomTreeUpdater.h"
				#include "llvm/Analysis/LoopInfo.h"
				#include "llvm/Analysis/OptimizationRemarkEmitter.h"
				#include "llvm/Analysis/PostDominators.h"
				#include "llvm/Analysis/ScalarEvolution.h"
				#include "llvm/Analysis/ScalarEvolutionExpressions.h"
				#include "llvm/IR/Function.h"
				#include "llvm/IR/Verifier.h"
				#include "llvm/Pass.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Support/raw_ostream.h"
				#include "llvm/Transforms/Scalar.h"
				#include "llvm/Transforms/Utils.h"
				#include "llvm/Transforms/Utils/BasicBlockUtils.h"

				using namespace llvm;

				#define DEBUG_TYPE "loop-fusion"

				STATISTIC(FuseCounter, "Count number of loop fusions performed");
				STATISTIC(NumFusionCandidates, "Number of candidates for loop fusion");
				STATISTIC(InvalidPreheader, "Loop has invalid preheader");
				STATISTIC(InvalidHeader, "Loop has invalid header");
				STATISTIC(InvalidExitingBlocks, "Loop has invalid exiting blocks");
				STATISTIC(InvalidExitBlock, "Loop has invalid exit block");
				STATISTIC(InvalidLatch, "Loop has invalid latch");
				STATISTIC(InvalidLoop, "Loop is invalid");
				STATISTIC(AddressTakenBB, "Basic block has address taken");
				STATISTIC(MayThrowException, "Loop may throw an exception");
				STATISTIC(ContainsVolatileAccess, "Loop contains a volatile access");
				STATISTIC(NotSimplifiedForm, "Loop is not in simplified form");
				STATISTIC(InvalidDependencies, "Dependencies prevent fusion");
				STATISTIC(InvalidTripCount,
				"Loop does not have invariant backedge taken count");
				STATISTIC(UncomputableTripCount, "SCEV cannot compute trip count of loop");
				STATISTIC(NonEqualTripCount, "Candidate trip counts are not the same");
				STATISTIC(NonAdjacent, "Candidates are not adjacent");
				STATISTIC(NonEmptyPreheader, "Candidate has a non-empty preheader");

				enum FusionDependenceAnalysisChoice {
				FUSION_DEPENDENCE_ANALYSIS_SCEV,
				FUSION_DEPENDENCE_ANALYSIS_DA,
				FUSION_DEPENDENCE_ANALYSIS_ALL,
				};

				static cl::opt<FusionDependenceAnalysisChoice> FusionDependenceAnalysis(
				"loop-fusion-dependence-analysis",
				cl::desc("Which dependence analysis should loop fusion use?"),
				cl::values(clEnumValN(FUSION_DEPENDENCE_ANALYSIS_SCEV, "scev",
				"Use the scalar evolution interface"),
				clEnumValN(FUSION_DEPENDENCE_ANALYSIS_DA, "da",
				MeinersburUnsubmitted Done Reply Inline Actions Is this the right description for "-loop-fusion-dependence-analysis=scev"? It is the same as for the `da` option. Meinersbur: Is this the right description for "-loop-fusion-dependence-analysis=scev"? It is the same as…
				"Use the dependence analysis interface"),
				clEnumValN(FUSION_DEPENDENCE_ANALYSIS_ALL, "all",
				"Use all available analyses")),
				cl::Hidden, cl::init(FUSION_DEPENDENCE_ANALYSIS_ALL), cl::ZeroOrMore);

				#ifndef NDEBUG
				static cl::opt<bool>
				VerboseFusionDebugging("loop-fusion-verbose-debug",
				cl::desc("Enable verbose debugging for Loop Fusion"),
				cl::Hidden, cl::init(false), cl::ZeroOrMore);
				MeinersburUnsubmitted Done Reply Inline Actions [serious] LLVM has an infrastructure for enabling debugging output: `-debug-only`. We should not introduce additional ones. At least this option should only be available in assert-builds. Meinersbur: [serious] LLVM has an infrastructure for enabling debugging output: `-debug-only`. We should…
				kbartonAuthorUnsubmitted Done Reply Inline Actions I use this to have two different "levels" of debugging. The debugging available with -debug-only is not as verbose. If you feel strongly about it, I can remove this and make all debug output equivalent. In the meantime, I've guarded this use with NDEBUG. kbarton: I use this to have two different "levels" of debugging. The debugging available with -debug…
				#endif

				jdoerfertUnsubmitted Done Reply Inline Actions You should keep it this way. Other passes do it similarly. jdoerfert: You should keep it this way. Other passes do it similarly.
				/// This class is used to represent a candidate for loop fusion. When it is
				/// constructed, it checks the conditions for loop fusion to ensure that it
				/// represents a valid candidate. It caches several parts of a loop that are
				/// used throughout loop fusion (e.g., loop preheader, loop header, etc) instead
				/// of continually querying the underlying Loop to retrieve these values. It is
				/// assumed these will not change throughout loop fusion.
				///
				/// The invalidate method should be used to indicate that the FusionCandidate is
				/// no longer a valid candidate for fusion. Similarly, the isValid() method can
				/// be used to ensure that the FusionCandidate is still valid for fusion.
				struct FusionCandidate {

				// Cache of parts of the loop used throughout loop fusion. These should not
				// need to change throughout the analysis and transformation.
				BasicBlock *Preheader;
				BasicBlock *Header;
				BasicBlock *ExitingBlocks;
				dmgreenUnsubmitted Done Reply Inline Actions Perhaps name it ExitingBlock, if there is only one. dmgreen: Perhaps name it ExitingBlock, if there is only one.
				kbartonAuthorUnsubmitted Done Reply Inline Actions Good idea. Will fix in patch I'm going to post soon. kbarton: Good idea. Will fix in patch I'm going to post soon.
				BasicBlock *ExitBlock;
				BasicBlock *Latch;
				Loop *L;
				SmallVector<Instruction *, 16> MemReads;
				SmallVector<Instruction *, 16> MemWrites;
				bool Valid;

				// Dominator and PostDominator trees are needed for the FusionCandidateCompare
				// function, required by FusionCandidateSet to determine where the
				// FusionCandidate should be inserted into the set. These are used to
				// establish ordering of the FusionCandidates based on dominance.
				const DominatorTree *DT;
				MeinersburUnsubmitted Done Reply Inline Actions [suggestion] Add doxygen comments for these fields? Meinersbur: [suggestion] Add doxygen comments for these fields?
				const PostDominatorTree *PDT;

				FusionCandidate(Loop L, const DominatorTree DT,
				MeinersburUnsubmitted Done Reply Inline Actions [suggestion] IMHO it is good practice to not have much logic in object constructors. Reasons: When exceptions are thrown, the object is partially initialized With class polymorphism, calling virtual functions will call the non-overidden method. It is not possible to return anything, in this case, an error code. and 2. do not apply to this code. Point 3. is handled using the `Valid` flag. Nonetheless, I think having lots of logic in constructor is a bad pattern. Instead, I suggest to have something return nullptr or `llvm::ErrorOr`. Meinersbur: [suggestion] IMHO it is good practice to not have much logic in object constructors. Reasons: 1.
				kbartonAuthorUnsubmitted Done Reply Inline Actions I understand your points here. I did not realize issue #2 (although I agree it doesn't apply here). However, if I understand your suggestion, I would need to add an initialize method (or something similar) to do the logic that is currently in the constructor and return a success/fail code. The problem is that before anything can be done with the object, you need to ensure that it has been initialized. This can be accomplished by adding a boolean flag (isInitialized) to the class and ensuring that all accessors check that before doing anything with it. However, I think that is more complicated then leaving the initialization logic in the constructor. That said, I'm not particularly tied to either approach. I can change the implementation if you feel strongly. Or, if there is an alternative that I'm not seeing, I'm happy to do that instead. kbarton: I understand your points here. I did not realize issue #2 (although I agree it doesn't apply…
				MeinersburUnsubmitted Done Reply Inline Actions I was thinking the following pattern: FusionCandidate* makeFusionCandiate(...) { BasicBlock Preheader; BasicBlock Header; BasicBlock ExitingBlocks; BasicBlock ExitBlock; BasicBlock Latch; // The above are redundant since llvm::Loop provides accessors // I suggest to not cache them in other objects (they may require being updated) unless it measurable improves performance Loop L; SmallVector<Instruction , 16> MemReads; SmallVector<Instruction , 16> MemWrites; if (!condition) { LLVM_DEBUG(dbgs() << "The reason\n"); InvalidationReason++; return nullptr; } return new FusionCandidate(L, MemReads, MemWrites); } Meinersbur: I was thinking the following pattern: ``` FusionCandidate* makeFusionCandiate(...) {…
				jdoerfertUnsubmitted Done Reply Inline Actions It is not uncommon in LLVM to do it this way. I personally do not try to force a solution similar to `bool createXXX(...)` but I also do not necessarily object if there is an actual reason. jdoerfert: It is not uncommon in LLVM to do it this way. I personally do not try to force a solution…
				kbartonAuthorUnsubmitted Done Reply Inline Actions OK, so I'm still not sure what the preference is here: Keep it as is. Change to create an empty object and then require an Initialize method call on it. Create a wrapper, createFusionCandidate(), that does the construction and initialization. Some other alternative that I don't see. If I'm going to change, I prefer the createFusionCandidate route (which I hadn't thought of earlier) which I think is the cleanest, although I don't know how to enforce it's usage. kbarton: OK, so I'm still not sure what the preference is here: # Keep it as is. # Change to create…
				MeinersburUnsubmitted Done Reply Inline Actions It is just a suggestion and does not correspond to an LLVM coding policy. Keep as-is, it is not an requirement for me to accept the patch. Meinersbur: It is just a suggestion and does not correspond to an LLVM coding policy. Keep as-is, it is not…
				kbartonAuthorUnsubmitted Done Reply Inline Actions I will mark this conversation as done as I think we've agreed the existing implementation is OK. We can refactor/redo in the future if this becomes too cumbersome. kbarton: I will mark this conversation as done as I think we've agreed the existing implementation is OK.
				const PostDominatorTree *PDT)
				: Preheader(L->getLoopPreheader()), Header(L->getHeader()),
				ExitingBlocks(L->getExitingBlock()), ExitBlock(L->getExitBlock()),
				Latch(L->getLoopLatch()), L(L), Valid(true), DT(DT), PDT(PDT) {

				// Walk over all blocks in the loop and check for conditions that may
				// prevent fusion. For each block, walk over all instructions and collect
				// the memory reads and writes If any instructions that prevent fusion are
				// found, invalidate this object and return.
				for (BasicBlock *BB : L->blocks()) {
				if (BB->hasAddressTaken()) {
				MeinersburUnsubmitted Done Reply Inline Actions [remark] In contrast to the other invalidate reason, this one does not have a statistic counter. Intentional? Meinersbur: [remark] In contrast to the other invalidate reason, this one does not have a statistic counter.
				AddressTakenBB++;
				invalidate();
				return;
				}

				for (Instruction &I : *BB) {
				if (I.mayThrow()) {
				MayThrowException++;
				invalidate();
				return;
				}
				if (StoreInst *SI = dyn_cast<StoreInst>(&I)) {
				if (SI->isVolatile()) {
				ContainsVolatileAccess++;
				invalidate();
				return;
				}
				}
				if (LoadInst *LI = dyn_cast<LoadInst>(&I)) {
				if (LI->isVolatile()) {
				ContainsVolatileAccess++;
				invalidate();
				return;
				}
				}
				if (I.mayWriteToMemory())
				MemWrites.push_back(&I);
				if (I.mayReadFromMemory())
				MemReads.push_back(&I);
				}
				}
				}

				/// Check if all members of the class are valid.
				bool isValid() const {
				return Preheader && Header && ExitingBlocks && ExitBlock && Latch && L &&
				!L->isInvalid() && Valid;
				}

				/// Verify that all members are in sync with the Loop object.
				void verify() const {
				assert(isValid() && "Candidate is not valid!!");
				assert(!L->isInvalid() && "Loop is invalid!");
				assert(Preheader == L->getLoopPreheader() && "Preheader is out of sync");
				assert(Header == L->getHeader() && "Header is out of sync");
				assert(ExitingBlocks == L->getExitingBlock() &&
				"Exiting Blocks is out of sync");
				assert(ExitBlock == L->getExitBlock() && "Exit block is out of sync");
				assert(Latch == L->getLoopLatch() && "Latch is out of sync");
				}
				MeinersburUnsubmitted Done Reply Inline Actions [style] Use `LLVM_DUMP_METHOD` pattern for `dump()` methods (See `compiler.h`) Meinersbur: [style] Use `LLVM_DUMP_METHOD` pattern for `dump()` methods (See `compiler.h`)
				MeinersburUnsubmitted Done Reply Inline Actions [style] `dump()` methods are usually guard by #if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP) as mentioned in the comment above `LLVM_DUMP_METHOD` Meinersbur: [style] `dump()` methods are usually guard by ``` #if !defined(NDEBUG) \|\| defined…

				LLVM_DUMP_METHOD void dump() const {
				dbgs() << "\tPreheader: " << (Preheader ? Preheader->getName() : "nullptr")
				<< "\n"
				<< "\tHeader: " << (Header ? Header->getName() : "nullptr") << "\n"
				<< "\tExitingBB: "
				<< (ExitingBlocks ? ExitingBlocks->getName() : "nullptr") << "\n"
				<< "\tExitBB: " << (ExitBlock ? ExitBlock->getName() : "nullptr")
				<< "\n"
				<< "\tLatch: " << (Latch ? Latch->getName() : "nullptr") << "\n";
				}

				private:
				// This is only used internally for now, to clear the MemWrites and MemReads
				// list and setting Valid to false. I can't envision other uses of this right
				// now, since once FusionCandidates are put into the FusionCandidateSet they
				// are immutable. Thus, any time we need to change/update a FusionCandidate,
				// we must create a new one and insert it into the FusionCandidateSet to
				// ensure the FusionCandidateSet remains ordered correctly.
				void invalidate() {
				MemWrites.clear();
				MemReads.clear();
				Valid = false;
				}
				};

				inline llvm::raw_ostream &operator<<(llvm::raw_ostream &OS,
				const FusionCandidate &FC) {
				if (FC.isValid())
				OS << FC.Preheader->getName();
				else
				OS << "<Invalid>";

				return OS;
				}

				struct FusionCandidateCompare {
				/// Comparison functor to sort two Control Flow Equivalent fusion candidates
				/// into dominance order.
				/// If LHS dominates RHS and RHS post-dominates LHS, return true;
				/// IF RHS dominates LHS and LHS post-dominates RHS, return false;
				bool operator()(const FusionCandidate &LHS,
				const FusionCandidate &RHS) const {
				const DominatorTree *DT = LHS.DT;
				const PostDominatorTree *PDT = LHS.PDT;

				assert(DT && PDT && "Expecting valid dominator tree");

				if (DT->dominates(LHS.Preheader, RHS.Preheader)) {
				// Verify RHS Postdominates LHS
				assert(PDT->dominates(RHS.Preheader, LHS.Preheader));
				return true;
				}

				if (DT->dominates(RHS.Preheader, LHS.Preheader)) {
				// RHS dominates LHS
				// Verify LHS post-dominates RHS
				assert(PDT->dominates(LHS.Preheader, RHS.Preheader));
				return false;
				}
				// If LHS does not dominate RHS and RHS does not dominate LHS then there is
				// no dominance relationship between the two FusionCandidates. Thus, they
				// should not be in the same set together.
				llvm_unreachable(
				"No dominance relationship between these fusion candidates!");
				}
				};
				MeinersburUnsubmitted Done Reply Inline Actions [remark] LoopSet is a vector? Meinersbur: [remark] LoopSet is a vector?

				namespace {
				using LoopVector = SmallVector<Loop *, 4>;

				// Set of Control Flow Equivalent (CFE) Fusion Candidates, sorted in dominance
				MeinersburUnsubmitted Done Reply Inline Actions [remark] Is there a reason to use `std::set`? Usually, the LLVM hash-based sets (`DenseSet`, `SmallPtrSet`, etc) are more performant. Meinersbur: [remark] Is there a reason to use `std::set`? Usually, the LLVM hash-based sets (`DenseSet`…
				kbartonAuthorUnsubmitted Done Reply Inline Actions This is a good question, and I should have added comments here to articulate why I eventually settled on std::set. Ultimately, I was looking for a data structure that would keep the fusion candidates sorted automatically, and not invalidate the iterators as the set changes. In this implementation, either DenseSet or SmallPtrSet would be fine, but we have additional patches to extend fusion to move intervening code around and split up loops to create new opportunities. When this is done, the contents of the FusionCandidateSet will change, and having stable iterators simplifies the logic quite a bit (in my opinion). Similarly, always ensuring that the set is ordered properly is essential for this implementation, and that seems to be done reasonably well with std::set. I will add some comments in my next revision to indicate my line of reasoning here. kbarton: This is a good question, and I should have added comments here to articulate why I eventually…
				MeinersburUnsubmitted Done Reply Inline Actions [suggestion] IMHO it is a good style to not depend on a particular implementation of ADTs. Code that advances iterators manually depending on inserted/removed elements can be confusing and may break easily in future changes. However, as with the previous [suggestion], I am ok with keeping it as-is. Meinersbur: [suggestion] IMHO it is a good style to not depend on a particular implementation of ADTs. Code…
				kbartonAuthorUnsubmitted Done Reply Inline Actions Thanks. I will mark this as done for now. This is also something that may need to get revisited as the code evolves. kbarton: Thanks. I will mark this as done for now. This is also something that may need to get…
				// order. Thus, if FC0 comes before FC1 in a FusionCandidateSet, then FC0
				// dominates FC1 and FC1 post-dominates FC0.
				// std::set was chosen because we want a sorted data structure with stable
				// iterators. A subsequent patch to loop fusion will enable fusing non-ajdacent
				// loops by moving intervening code around. When this intervening code contains
				// loops, those loops will be moved also. The corresponding FusionCandidates
				// will also need to be moved accordingly. As this is done, having stable
				// iterators will simplify the logic. Similarly, having an efficient insert that
				MeinersburUnsubmitted Done Reply Inline Actions [style] braces can be removed Meinersbur: [style] braces can be removed
				// keeps the FusionCandidateSet sorted will also simplify the implementation.
				using FusionCandidateSet = std::set<FusionCandidate, FusionCandidateCompare>;
				using FusionCandidateCollection = SmallVector<FusionCandidateSet, 4>;
				} // namespace

				inline llvm::raw_ostream &operator<<(llvm::raw_ostream &OS,
				const FusionCandidateSet &CandSet) {
				for (auto IT = CandSet.begin(); IT != CandSet.end(); ++IT)
				MeinersburUnsubmitted Done Reply Inline Actions [suggestion] This could be a range-based for-loop. Meinersbur: [suggestion] This could be a range-based for-loop.
				OS << *IT << "\n";

				return OS;
				}

				static void
				printFusionCandidates(const FusionCandidateCollection &FusionCandidates) {
				LLVM_DEBUG(dbgs() << "Fusion Candidates: \n");
				for (const auto &CandidateSet : FusionCandidates) {
				LLVM_DEBUG({
				dbgs() << "* Fusion Candidate Set *\n";
				dbgs() << CandidateSet;
				dbgs() << "****************************\n";
				});
				}
				MeinersburUnsubmitted Done Reply Inline Actions [style] I'd prefer method description as a doxygen comment at the methods. Meinersbur: [style] I'd prefer method description as a doxygen comment at the methods.
				}

				/// Collect all loops in function at the same nest level, starting at the
				/// outermost level.
				///
				/// This data structure collects all loops at the same nest level for a
				/// given function (specified by the LoopInfo object). It starts at the
				/// outermost level.
				struct LoopDepthTree {
				using LoopsOnLevelTy = SmallVector<LoopVector, 4>;
				using iterator = LoopsOnLevelTy::iterator;
				using const_iterator = LoopsOnLevelTy::const_iterator;

				LoopDepthTree(LoopInfo &LI) : Depth(1) {
				MeinersburUnsubmitted Done Reply Inline Actions [style] Use doxygen comment? Meinersbur: [style] Use doxygen comment?
				if (LI.begin() != LI.end())
				LoopsOnLevel.emplace_back(LoopVector(LI.rbegin(), LI.rend()));
				MeinersburUnsubmitted Done Reply Inline Actions [style] Did you consider `LI.empty()`? Meinersbur: [style] Did you consider `LI.empty()`?
				kbartonAuthorUnsubmitted Done Reply Inline Actions I don't think so. I've changed it, as I think it makes it more readable. kbarton: I don't think so. I've changed it, as I think it makes it more readable.
				}

				/// Test whether a given loop has been removed from the function, and thus is
				/// no longer valid.
				bool isRemovedLoop(const Loop *L) const { return RemovedLoops.count(L); }

				/// Record that a given loop has been removed from the function and is no
				/// longer valid.
				void removeLoop(const Loop *L) { RemovedLoops.insert(L); }

				/// Descend the tree to the next (inner) nesting level
				void descend() {
				LoopsOnLevelTy LoopsOnNextLevel;

				for (const LoopVector &LV : *this)
				for (Loop *L : LV)
				if (!isRemovedLoop(L) && L->begin() != L->end())
				LoopsOnNextLevel.emplace_back(LoopVector(L->begin(), L->end()));

				LoopsOnLevel = LoopsOnNextLevel;
				RemovedLoops.clear();
				Depth++;
				}

				MeinersburUnsubmitted Done Reply Inline Actions [style] Descriptions of these would be useful. Meinersbur: [style] Descriptions of these would be useful.
				bool empty() const { return size() == 0; }
				size_t size() const { return LoopsOnLevel.size() - RemovedLoops.size(); }
				unsigned getDepth() const { return Depth; }

				iterator begin() { return LoopsOnLevel.begin(); }
				iterator end() { return LoopsOnLevel.end(); }
				const_iterator begin() const { return LoopsOnLevel.begin(); }
				const_iterator end() const { return LoopsOnLevel.end(); }
				MeinersburUnsubmitted Done Reply Inline Actions This is only used inside `LLVM_DEBUG`, so the compiler will emit an unused-function warning when compiling in release mode. Guard with `#ifdef NDEBUG`? Meinersbur: This is only used inside `LLVM_DEBUG`, so the compiler will emit an unused-function warning…

				private:
				/// Set of loops that have been removed from the function and are no longer
				/// valid.
				SmallPtrSet<const Loop *, 8> RemovedLoops;

				/// Depth of the current level, starting at 1 (outermost loops).
				unsigned Depth;

				/// Vector of loops at the current depth level that have the same parent loop
				LoopsOnLevelTy LoopsOnLevel;
				};

				#ifndef NDEBUG
				static void printLoopVector(const LoopVector &LV) {
				dbgs() << "****************************\n";
				for (auto L : LV)
				printLoop(*L, dbgs());
				dbgs() << "****************************\n";
				}
				#endif

				static void reportLoopFusion(const FusionCandidate &FC0,
				const FusionCandidate &FC1,
				OptimizationRemarkEmitter &ORE) {
				using namespace ore;
				ORE.emit(
				OptimizationRemark(DEBUG_TYPE, "LoopFusion", FC0.Preheader->getParent())
				<< "Fused " << NV("Cand1", StringRef(FC0.Preheader->getName()))
				<< " with " << NV("Cand2", StringRef(FC1.Preheader->getName())));
				}

				struct LoopFuser {
				private:
				// Sets of control flow equivalent fusion candidates for a given nest level.
				FusionCandidateCollection FusionCandidates;

				LoopDepthTree LDT;
				DomTreeUpdater DTU;

				LoopInfo &LI;
				DominatorTree &DT;
				DependenceInfo &DI;
				ScalarEvolution &SE;
				PostDominatorTree &PDT;
				OptimizationRemarkEmitter &ORE;

				public:
				LoopFuser(LoopInfo &LI, DominatorTree &DT, DependenceInfo &DI,
				ScalarEvolution &SE, PostDominatorTree &PDT,
				OptimizationRemarkEmitter &ORE, const DataLayout &DL)
				: LDT(LI), DTU(DT, PDT, DomTreeUpdater::UpdateStrategy::Lazy), LI(LI),
				DT(DT), DI(DI), SE(SE), PDT(PDT), ORE(ORE) {}

				/// This is the main entry point for loop fusion. It will traverse the
				/// specified function and collect candidate loops to fuse, starting at the
				/// outermost nesting level and working inwards.
				bool fuseLoops(Function &F) {
				#ifndef NDEBUG
				if (VerboseFusionDebugging) {
				LI.print(dbgs());
				}
				#endif

				LLVM_DEBUG(dbgs() << "Performing Loop Fusion on function " << F.getName()
				<< "\n");
				bool Changed = false;

				while (!LDT.empty()) {
				LLVM_DEBUG(dbgs() << "Got " << LDT.size() << " loop sets for depth "
				<< LDT.getDepth() << "\n";);

				MeinersburUnsubmitted Done Reply Inline Actions [suggestion] Guard with `#ifndef NDEBUG` to avoid being called in non-assert builds? Meinersbur: [suggestion] Guard with `#ifndef NDEBUG` to avoid being called in non-assert builds?
				for (const LoopVector &LV : LDT) {
				assert(LV.size() > 0 && "Empty loop set was build!");

				// Skip singleton loop sets as they do not offer fusion opportunities on
				// this level.
				if (LV.size() == 1)
				continue;
				#ifndef NDEBUG
				if (VerboseFusionDebugging) {
				LLVM_DEBUG({
				dbgs() << " Visit loop set (#" << LV.size() << "):\n";
				printLoopVector(LV);
				});
				}
				MeinersburUnsubmitted Done Reply Inline Actions [style] Start functions with a verb: `isControlFlowEquivalent` Meinersbur: [style] Start functions with a verb: `isControlFlowEquivalent`
				kbartonAuthorUnsubmitted Done Reply Inline Actions I addressed this a while ago, I just forgot to check it off as done. kbarton: I addressed this a while ago, I just forgot to check it off as done.
				#endif

				collectFusionCandidates(LV);
				Changed \|= fuseCandidates();
				}

				// Finished analyzing candidates at this level.
				// Descend to the next level and clear all of the candidates currently
				// collected. Note that it will not be possible to fuse any of the
				// existing candidates with new candidates because the new candidates will
				// be at a different nest level and thus not be control flow equivalent
				// with all of the candidates collected so far.
				LLVM_DEBUG(dbgs() << "Descend one level!\n");
				LDT.descend();
				FusionCandidates.clear();
				}

				MeinersburUnsubmitted Done Reply Inline Actions [style] `return PDT.dominates(FC0.Preheader, FC1.Preheader)` Meinersbur: [style] `return PDT.dominates(FC0.Preheader, FC1.Preheader)`
				if (Changed)
				LLVM_DEBUG(dbgs() << "Function after Loop Fusion: \n"; F.dump(););

				#ifndef NDEBUG
				assert(DT.verify());
				assert(PDT.verify());
				LI.verify(DT);
				SE.verify();
				#endif

				LLVM_DEBUG(dbgs() << "Loop Fusion complete\n");
				return Changed;
				}

				private:
				/// Determine if two fusion candidates are control flow equivalent.
				///
				/// Two fusion candidates are control flow equivalent if when one executes,
				/// the other is guaranteed to execute. This is determined using dominators
				/// and post-dominators: if A dominates B and B post-dominates A then A and B
				/// are control-flow equivalent.
				bool isControlFlowEquivalent(const FusionCandidate &FC0,
				const FusionCandidate &FC1) const {
				assert(FC0.Preheader && FC1.Preheader && "Expecting valid preheaders");

				if (DT.dominates(FC0.Preheader, FC1.Preheader))
				return PDT.dominates(FC1.Preheader, FC0.Preheader);

				if (DT.dominates(FC1.Preheader, FC0.Preheader))
				return PDT.dominates(FC0.Preheader, FC1.Preheader);

				return false;
				}

				/// Determine if a fusion candidate (representing a loop) is eligible for
				/// fusion. Note that this only checks whether a single loop can be fused - it
				/// does not check whether it is legal to fuse two loops together.
				MeinersburUnsubmitted Done Reply Inline Actions [style] This looks like a leftover review comment. Meinersbur: [style] This looks like a leftover review comment.
				bool eligibleForFusion(const FusionCandidate &FC) const {
				if (!FC.isValid()) {
				LLVM_DEBUG(dbgs() << "FC " << FC << " has invalid CFG requirements!\n");
				if (!FC.Preheader)
				InvalidPreheader++;
				if (!FC.Header)
				InvalidHeader++;
				if (!FC.ExitingBlocks)
				InvalidExitingBlocks++;
				if (!FC.ExitBlock)
				InvalidExitBlock++;
				if (!FC.Latch)
				InvalidLatch++;
				if (FC.L->isInvalid())
				InvalidLoop++;

				return false;
				}

				// Require ScalarEvolution to be able to determine a trip count.
				if (!SE.hasLoopInvariantBackedgeTakenCount(FC.L)) {
				LLVM_DEBUG(dbgs() << "Loop " << FC.L->getName()
				MeinersburUnsubmitted Done Reply Inline Actions [Style] Use `LLVM_PRETTY_FUNCTION` instead of `__FUNCTION__`. Maybe remove entirely? Nowhere I have seen function tracing in LLVM code. Meinersbur: [Style] Use `LLVM_PRETTY_FUNCTION` instead of `__FUNCTION__`. Maybe remove entirely? Nowhere I…
				<< " trip count not computable!\n");
				InvalidTripCount++;
				return false;
				}

				if (!FC.L->isLoopSimplifyForm()) {
				LLVM_DEBUG(dbgs() << "Loop " << FC.L->getName()
				<< " is not in simplified form!\n");
				NotSimplifiedForm++;
				return false;
				}

				dmgreenUnsubmitted Done Reply Inline Actions Is there reason in the current version to prevent fusion if "FC0" has instructions in its preheaders? dmgreen: Is there reason in the current version to prevent fusion if "FC0" has instructions in its…
				kbartonAuthorUnsubmitted Done Reply Inline Actions Strictly speaking, no, I there is no reason to prevent fusion if FC0 has instructions in its preheader. This check was meant to prevent fusion if FC1 has instructions in the preheader since we don't have any analysis to move them around yet I've moved this check later when we are trying to fuse pairs of candidates. If FC1 has a non-empty preheader then fusion is skipped. kbarton: Strictly speaking, no, I there is no reason to prevent fusion if FC0 has instructions in its…
				return true;
				}

				/// Iterate over all loops in the given loop set and identify the loops that
				/// are eligible for fusion. Place all eligible fusion candidates into Control
				/// Flow Equivalent sets, sorted by dominance.
				void collectFusionCandidates(const LoopVector &LV) {
				for (Loop *L : LV) {
				FusionCandidate CurrCand(L, &DT, &PDT);
				if (!eligibleForFusion(CurrCand))
				continue;

				// Go through each list in FusionCandidates and determine if L is control
				// flow equivalent with the first loop in that list. If it is, append LV.
				// If not, go to the next list.
				// If no suitable list is found, start another list and add it to
				// FusionCandidates.
				bool FoundSet = false;

				for (auto &CurrCandSet : FusionCandidates) {
				if (isControlFlowEquivalent(*CurrCandSet.begin(), CurrCand)) {
				CurrCandSet.insert(CurrCand);
				FoundSet = true;
				#ifndef NDEBUG
				if (VerboseFusionDebugging)
				LLVM_DEBUG(dbgs() << "Adding " << CurrCand
				<< " to existing candidate set\n");
				#endif
				break;
				}
				}
				if (!FoundSet) {
				// No set was found. Create a new set and add to FusionCandidates
				#ifndef NDEBUG
				if (VerboseFusionDebugging)
				LLVM_DEBUG(dbgs() << "Adding " << CurrCand << " to new set\n");
				#endif
				FusionCandidateSet NewCandSet;
				NewCandSet.insert(CurrCand);
				FusionCandidates.push_back(NewCandSet);
				}
				NumFusionCandidates++;
				}
				}

				/// Determine if it is beneficial to fuse two loops.
				///
				/// For now, this method simply returns true because we want to fuse as much
				/// as possible (primarily to test the pass). This method will evolve, over
				/// time, to add heuristics for profitability of fusion.
				bool isBeneficialFusion(const FusionCandidate &FC0,
				const FusionCandidate &FC1) {
				return true;
				}

				/// Determine if two fusion candidates have the same trip count (i.e., they
				/// execute the same number of iterations).
				///
				/// Note that for now this method simply returns a boolean value because there
				/// are no mechanisms in loop fusion to handle different trip counts. In the
				/// future, this behaviour can be extended to adjust one of the loops to make
				/// the trip counts equal (e.g., loop peeling). When this is added, this
				/// interface may need to change to return more information than just a
				/// boolean value.
				bool identicalTripCounts(const FusionCandidate &FC0,
				MeinersburUnsubmitted Done Reply Inline Actions [suggestion] If predication is not supported, why not using `ScalarEvolution::getBackedgeTakenCount()`? Meinersbur: [suggestion] If predication is not supported, why not using `ScalarEvolution…
				jdoerfertUnsubmitted Done Reply Inline Actions Because, as an example and not the only reason, we could easily generate statistics on the impact predication support might have. jdoerfert: Because, as an example and not the only reason, we could easily generate statistics on the…
				const FusionCandidate &FC1) const {
				const SCEV *TripCount0 = SE.getBackedgeTakenCount(FC0.L);
				if (isa<SCEVCouldNotCompute>(TripCount0)) {
				UncomputableTripCount++;
				LLVM_DEBUG(dbgs() << "Trip count of first loop could not be computed!");
				return false;
				}

				const SCEV *TripCount1 = SE.getBackedgeTakenCount(FC1.L);
				if (isa<SCEVCouldNotCompute>(TripCount1)) {
				UncomputableTripCount++;
				LLVM_DEBUG(dbgs() << "Trip count of second loop could not be computed!");
				return false;
				}
				LLVM_DEBUG(dbgs() << "\tTrip counts: " << *TripCount0 << " & "
				<< *TripCount1 << " are "
				<< (TripCount0 == TripCount1 ? "identical" : "different")
				<< "\n");
				dmgreenUnsubmitted Done Reply Inline Actions second loop dmgreen: second loop

				return (TripCount0 == TripCount1);
				}

				/// Walk each set of control flow equivalent fusion candidates and attempt to
				/// fuse them. This does a single linear traversal of all candidates in the
				/// set. The conditions for legal fusion are checked at this point. If a pair
				/// of fusion candidates passes all legality checks, they are fused together
				/// and a new fusion candidate is created and added to the FusionCandidateSet.
				/// The original fusion candidates are then removed, as they are no longer
				/// valid.
				bool fuseCandidates() {
				bool Fused = false;
				LLVM_DEBUG(printFusionCandidates(FusionCandidates));
				for (auto &CandidateSet : FusionCandidates) {
				if (CandidateSet.size() < 2)
				continue;

				LLVM_DEBUG(dbgs() << "Attempting fusion on Candidate Set:\n"
				<< CandidateSet << "\n");

				for (auto FC0 = CandidateSet.begin(); FC0 != CandidateSet.end(); ++FC0) {
				assert(!LDT.isRemovedLoop(FC0->L) &&
				"Should not have removed loops in CandidateSet!");
				auto FC1 = FC0;
				for (++FC1; FC1 != CandidateSet.end(); ++FC1) {
				assert(!LDT.isRemovedLoop(FC1->L) &&
				"Should not have removed loops in CandidateSet!");

				LLVM_DEBUG(dbgs() << "Attempting to fuse candidate \n"; FC0->dump();
				dbgs() << " with\n"; FC1->dump(); dbgs() << "\n");

				FC0->verify();
				FC1->verify();

				if (!identicalTripCounts(FC0, FC1)) {
				LLVM_DEBUG(dbgs() << "Fusion candidates do not have identical trip "
				"counts. Not fusing.\n");
				NonEqualTripCount++;
				continue;
				}

				if (!isAdjacent(FC0, FC1)) {
				LLVM_DEBUG(dbgs()
				<< "Fusion candidates are not adjacent. Not fusing.\n");
				NonAdjacent++;
				continue;
				}

				// For now we skip fusing if the second candidate has any instructions
				// in the preheader. This is done because we currently do not have the
				// safety checks to determine if it is save to move the preheader of
				// the second candidate past the body of the first candidate. Once
				// these checks are added, this condition can be removed.
				if (!emptyPreheader(*FC1)) {
				LLVM_DEBUG(dbgs() << "Fusion candidate does not have empty "
				"preheader. Not fusing.\n");
				NonEmptyPreheader++;
				continue;
				}

				if (!dependencesAllowFusion(FC0, FC1)) {
				LLVM_DEBUG(dbgs() << "Memory dependencies do not allow fusion!\n");
				continue;
				}

				bool BeneficialToFuse = isBeneficialFusion(FC0, FC1);
				LLVM_DEBUG(dbgs()
				<< "\tFusion appears to be "
				<< (BeneficialToFuse ? "" : "un") << "profitable!\n");
				if (!BeneficialToFuse)
				continue;

				// All analysis has completed and has determined that fusion is legal
				// and profitable. At this point, start transforming the code and
				// perform fusion.

				LLVM_DEBUG(dbgs() << "\tFusion is performed: " << *FC0 << " and "
				<< *FC1 << "\n");

				// Report fusion to the Optimization Remarks.
				// Note this needs to be done before performFusion because
				// performFusion will change the original loops, making it not
				// possible to identify them after fusion is complete.
				reportLoopFusion(FC0, FC1, ORE);

				FusionCandidate FusedCand(performFusion(FC0, FC1), &DT, &PDT);
				FusedCand.verify();
				assert(eligibleForFusion(FusedCand) &&
				"Fused candidate should be eligible for fusion!");

				// Notify the loop-depth-tree that these loops are not valid objects
				// anymore.
				LDT.removeLoop(FC1->L);

				CandidateSet.erase(FC0);
				CandidateSet.erase(FC1);

				auto InsertPos = CandidateSet.insert(FusedCand);
				MeinersburUnsubmitted Done Reply Inline Actions [style] Use doxygen comment here? Meinersbur: [style] Use doxygen comment here?

				assert(InsertPos.second &&
				"Unable to insert TargetCandidate in CandidateSet!");

				// Reset FC0 and FC1 the new (fused) candidate. Subsequent iterations
				// of the FC1 loop will attempt to fuse the new (fused) loop with the
				// remaining candidates in the current candidate set.
				FC0 = FC1 = InsertPos.first;
				dmgreenUnsubmitted Done Reply Inline Actions This debug message is (essentially) repeated? dmgreen: This debug message is (essentially) repeated?
				kbartonAuthorUnsubmitted Done Reply Inline Actions Good catch. Fixed. kbarton: Good catch. Fixed.

				LLVM_DEBUG(dbgs() << "Candidate Set (after fusion): " << CandidateSet
				<< "\n");

				Fused = true;
				}
				}
				}
				return Fused;
				MeinersburUnsubmitted Done Reply Inline Actions [style] No reason to use `auto` here Meinersbur: [style] No reason to use `auto` here
				}

				/// Rewrite all additive recurrences in a SCEV to use a new loop.
				class AddRecLoopReplacer : public SCEVRewriteVisitor<AddRecLoopReplacer> {
				public:
				AddRecLoopReplacer(ScalarEvolution &SE, const Loop &OldL, const Loop &NewL,
				bool UseMax = true)
				: SCEVRewriteVisitor(SE), Valid(true), UseMax(UseMax), OldL(OldL),
				NewL(NewL) {}

				const SCEV visitAddRecExpr(const SCEVAddRecExpr Expr) {
				const Loop *ExprL = Expr->getLoop();
				SmallVector<const SCEV *, 2> Operands;
				if (ExprL == &OldL) {
				Operands.append(Expr->op_begin(), Expr->op_end());
				return SE.getAddRecExpr(Operands, &NewL, Expr->getNoWrapFlags());
				}

				MeinersburUnsubmitted Done Reply Inline Actions [suggestion] Descriptions of these would be useful Meinersbur: [suggestion] Descriptions of these would be useful
				if (OldL.contains(ExprL)) {
				dmgreenUnsubmitted Done Reply Inline Actions Commented code dmgreen: Commented code
				bool Pos = SE.isKnownPositive(Expr->getStepRecurrence(SE));
				if (!UseMax \|\| !Pos \|\| !Expr->isAffine()) {
				Valid = false;
				MeinersburUnsubmitted Done Reply Inline Actions [style] [[ https://llvm.org/docs/CodingStandards.html#doxygen-use-in-documentation-comments \| Use `\p` to refer to parameters ]] Meinersbur: [style] [[ https://llvm.org/docs/CodingStandards.html#doxygen-use-in-documentation-comments \|…
				return Expr;
				}
				return visit(Expr->getStart());
				}

				for (const SCEV *Op : Expr->operands())
				Operands.push_back(visit(Op));
				return SE.getAddRecExpr(Operands, ExprL, Expr->getNoWrapFlags());
				}

				bool wasValidSCEV() const { return Valid; }

				private:
				bool Valid, UseMax;
				const Loop &OldL, &NewL;
				};

				/// Return false if the access functions of \p I0 and \p I1 could cause
				/// a negative dependence.
				bool accessDiffIsPositive(const Loop &L0, const Loop &L1, Instruction &I0,
				Instruction &I1, bool EqualIsInvalid) {
				Value *Ptr0 = getLoadStorePointerOperand(&I0);
				Value *Ptr1 = getLoadStorePointerOperand(&I1);
				if (!Ptr0 \|\| !Ptr1)
				return false;

				const SCEV *SCEVPtr0 = SE.getSCEVAtScope(Ptr0, &L0);
				const SCEV *SCEVPtr1 = SE.getSCEVAtScope(Ptr1, &L1);
				#ifndef NDEBUG
				if (VerboseFusionDebugging)
				LLVM_DEBUG(dbgs() << " Access function check: " << *SCEVPtr0 << " vs "
				<< *SCEVPtr1 << "\n");
				#endif
				AddRecLoopReplacer Rewriter(SE, L0, L1);
				SCEVPtr0 = Rewriter.visit(SCEVPtr0);
				#ifndef NDEBUG
				if (VerboseFusionDebugging)
				LLVM_DEBUG(dbgs() << " Access function after rewrite: " << *SCEVPtr0
				<< " [Valid: " << Rewriter.wasValidSCEV() << "]\n");
				#endif
				if (!Rewriter.wasValidSCEV())
				return false;

				// TODO: isKnownPredicate doesnt work well when one SCEV is loop carried (by
				// L0) and the other is not. We could check if it is monotone and test
				// the beginning and end value instead.

				BasicBlock *L0Header = L0.getHeader();
				auto HasNonLinearDominanceRelation = [&](const SCEV *S) {
				const SCEVAddRecExpr *AddRec = dyn_cast<SCEVAddRecExpr>(S);
				if (!AddRec)
				return false;
				return !DT.dominates(L0Header, AddRec->getLoop()->getHeader()) &&
				!DT.dominates(AddRec->getLoop()->getHeader(), L0Header);
				};
				if (SCEVExprContains(SCEVPtr1, HasNonLinearDominanceRelation))
				return false;

				ICmpInst::Predicate Pred =
				EqualIsInvalid ? ICmpInst::ICMP_SGT : ICmpInst::ICMP_SGE;
				bool IsAlwaysGE = SE.isKnownPredicate(Pred, SCEVPtr0, SCEVPtr1);
				MeinersburUnsubmitted Done Reply Inline Actions [style] The braces seem unnecessary Meinersbur: [style] The braces seem unnecessary
				#ifndef NDEBUG
				if (VerboseFusionDebugging)
				LLVM_DEBUG(dbgs() << " Relation: " << *SCEVPtr0
				<< (IsAlwaysGE ? " >= " : " may < ") << *SCEVPtr1
				<< "\n");
				#endif
				return IsAlwaysGE;
				}

				/// Return true if the dependences between @p I0 (in @p L0) and @p I1 (in
				/// @p L1) allow loop fusion of @p L0 and @p L1. The dependence analyses
				/// specified by @p DepChoice are used to determine this.
				bool dependencesAllowFusion(const FusionCandidate &FC0,
				const FusionCandidate &FC1, Instruction &I0,
				Instruction &I1, bool AnyDep,
				FusionDependenceAnalysisChoice DepChoice) {
				#ifndef NDEBUG
				if (VerboseFusionDebugging) {
				LLVM_DEBUG(dbgs() << "Check dep: " << I0 << " vs " << I1 << " : "
				<< DepChoice << "\n");
				}
				#endif
				switch (DepChoice) {
				case FUSION_DEPENDENCE_ANALYSIS_SCEV:
				return accessDiffIsPositive(FC0.L, FC1.L, I0, I1, AnyDep);
				case FUSION_DEPENDENCE_ANALYSIS_DA: {
				auto DepResult = DI.depends(&I0, &I1, true);
				if (!DepResult)
				return true;
				#ifndef NDEBUG
				if (VerboseFusionDebugging) {
				LLVM_DEBUG(dbgs() << "DA res: "; DepResult->dump(dbgs());
				dbgs() << " [#l: " << DepResult->getLevels() << "][Ordered: "
				<< (DepResult->isOrdered() ? "true" : "false")
				<< "]\n");
				LLVM_DEBUG(dbgs() << "DepResult Levels: " << DepResult->getLevels()
				<< "\n");
				jdoerfertUnsubmitted Done Reply Inline Actions This should probably be a conditional and a `return false` ;) jdoerfert: This should probably be a conditional and a `return false` ;)
				kbartonAuthorUnsubmitted Done Reply Inline Actions I've changed from the assert to a check and print message. After that, all the conditions just result in a return false, which cleans this up a lot. kbarton: I've changed from the assert to a check and print message. After that, all the conditions just…
				}
				#endif

				if (DepResult->getNextPredecessor() \|\| DepResult->getNextSuccessor())
				LLVM_DEBUG(
				dbgs() << "TODO: Implement pred/succ dependence handling!\n");

				// TODO: Can we actually use the dependence info analysis here?
				return false;
				}
				dmgreenUnsubmitted Done Reply Inline Actions I would imagine, although I'm not sure, that there would at least be a lot of bugs here. We are dealing with different loops, but we can say that they are very similar. What does it currently give? Anything useful? The SCEV checks you are doing above is obviously quite similar to what DA would be trying to do, but with the added loop rewrite. It would be a shame to duplicate all the effort but may, I guess, be necessary if DA doesn't do well enough. dmgreen: I would imagine, although I'm not sure, that there would at least be a lot of bugs here. We are…
				kbartonAuthorUnsubmitted Done Reply Inline Actions At this point DA doesn't give anything useful - at least not for the test cases that I have tried. I have not had a chance to investigate why, or if there is a better/different way to do things where it can be useful (which is why I marked the TODO here). The one thing that DA is able to do, that SCEV currently cannot, is understand the restrict keyword and accurately identify no dependencies between the two loops in this case. Perhaps it would be better to try and teach SCEV about restrict, and then only rely on SCEV in the long run? kbarton: At this point DA doesn't give anything useful - at least not for the test cases that I have…
				dmgreenUnsubmitted Done Reply Inline Actions I remember DA having problems on 64bit systems due to all the sexts that kept happening. Delinearising constant's too, maybe? There are certainly problems in it that would be worth trying to fix (or find a replacement to do it better). The delinearisation that DA will attempt may also be helpful. It is nowadays all built on SCEV's too, so should in theory (baring the facts that we know here about dealing with different but similar loops) be a more powerful version of the SCEV code here. That power might be confusing it at the moment though. I would expect that better DA is something llvm will need sooner or later as more loop optimisations are implemented. Perhaps this is something that we can improve in the future, relying on the SCEV code at the moment, but getting more help from DA as it is improved. The no-aliasing checks it can do are useful at least, it would seem. dmgreen: I remember DA having problems on 64bit systems due to all the sexts that kept happening.
				kbartonAuthorUnsubmitted Done Reply Inline Actions I completely agree about improving DA and the need for it with other loop optimizations. I still haven't had a chance to look at it in detail, but would like to start looking at it once this patch is finalized and lands. For now are you OK with the current usage of DA here? I'm hoping that as we extend/improve it, we can modify this code to take advantage of it. kbarton: I completely agree about improving DA and the need for it with other loop optimizations. I…
				dmgreenUnsubmitted Done Reply Inline Actions Sounds good to me. The !DepResult check can help in some cases at least, and the rest we can add as DA is improved. dmgreen: Sounds good to me. The !DepResult check can help in some cases at least, and the rest we can…
				kbartonAuthorUnsubmitted Done Reply Inline Actions I'm going to mark this as done, since I think we are in agreement we want to work on improving the dependence analysis in the future. If this is incorrect, please let me know. kbarton: I'm going to mark this as done, since I think we are in agreement we want to work on improving…

				case FUSION_DEPENDENCE_ANALYSIS_ALL:
				return dependencesAllowFusion(FC0, FC1, I0, I1, AnyDep,
				FUSION_DEPENDENCE_ANALYSIS_SCEV) \|\|
				dependencesAllowFusion(FC0, FC1, I0, I1, AnyDep,
				FUSION_DEPENDENCE_ANALYSIS_DA);
				}

				llvm_unreachable("Unknown fusion dependence analysis choice!");
				}

				/// Perform a dependence check and return if @p FC0 and @p FC1 can be fused.
				bool dependencesAllowFusion(const FusionCandidate &FC0,
				const FusionCandidate &FC1) {
				LLVM_DEBUG(dbgs() << "Check if " << FC0 << " can be fused with " << FC1
				<< "\n");
				assert(FC0.L->getLoopDepth() == FC1.L->getLoopDepth());
				assert(DT.dominates(FC0.Preheader, FC1.Preheader));

				for (Instruction *WriteL0 : FC0.MemWrites) {
				for (Instruction *WriteL1 : FC1.MemWrites)
				if (!dependencesAllowFusion(FC0, FC1, WriteL0, WriteL1,
				/* AnyDep */ false,
				FusionDependenceAnalysis)) {
				InvalidDependencies++;
				return false;
				}
				for (Instruction *ReadL1 : FC1.MemReads)
				if (!dependencesAllowFusion(FC0, FC1, WriteL0, ReadL1,
				/* AnyDep */ false,
				FusionDependenceAnalysis)) {
				InvalidDependencies++;
				return false;
				}
				}
				MeinersburUnsubmitted Done Reply Inline Actions [style] Start method names with a verb: `isAdjacent`? Meinersbur: [style] [[ https://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and…

				for (Instruction *WriteL1 : FC1.MemWrites) {
				for (Instruction *WriteL0 : FC0.MemWrites)
				if (!dependencesAllowFusion(FC0, FC1, WriteL0, WriteL1,
				/* AnyDep */ false,
				FusionDependenceAnalysis)) {
				InvalidDependencies++;
				return false;
				}
				for (Instruction *ReadL0 : FC0.MemReads)
				if (!dependencesAllowFusion(FC0, FC1, ReadL0, WriteL1,
				/* AnyDep */ false,
				FusionDependenceAnalysis)) {
				InvalidDependencies++;
				return false;
				}
				}

				// Walk through all uses in FC1. For each use, find the reaching def. If the
				// def is a PHI node, and one of the inputs to the PHI is from FC0 then it
				// is not safe to fuse.

				for (BasicBlock *BB : FC1.L->blocks())
				for (Instruction &I : *BB)
				dmgreenUnsubmitted Done Reply Inline Actions I'm not sure if this is quite strong enough. Consider something like this, where the sum would be used in the second block, but not as phi: for(i = 0; i < n; i++) sum += a[i]; for(i = 0; i < n; i++) b[i] = a[i]/sum; These can be fused, but use the wrong "sum" on each iteration of the loop. Unroll and jam side-steps all this by requiring lcssa nodes (and, you know, not requiring generalised loop fusion :) ) dmgreen: I'm not sure if this is quite strong enough. Consider something like this, where the sum would…
				kbartonAuthorUnsubmitted Done Reply Inline Actions Yes, you're right. If I generate the ll for this example and massage it to make it eligible for fusion, we will fuse these. However, I'm confused by your statement: These can be fused, but use the wrong "sum" on each iteration of the loop. I don't see how fusion is legal here. Are you saying that the current check will (incorrectly) still allow fusion? Or there is a way to fuse these loops? For this example, lcssa will create a non-empty preheader for the second loop and the earlier checks preventing fusion will bail before we get to this test. If I manually sink the PHI from lcssa into the header of the second loop, I can create an empty preheader for the second loop to allow fusion to continue. At any rate, I think this check needs to be strengthened to prevent fusion from happening in this case. Do you agree? kbarton: Yes, you're right. If I generate the ll for this example and massage it to make it eligible for…
				dmgreenUnsubmitted Done Reply Inline Actions Yeah, sorry. I meant _will_ fuse, but shouldn't. "Can" was a bit misleading. dmgreen: Yeah, sorry. I meant _will_ fuse, but shouldn't. "Can" was a bit misleading.
				for (auto &Op : I.operands()) {
				Value *Def = Op.get();

				// Check if it's a PHI and if it lives in Loop0. If so, then we
				// probably want to skip.
				// FIXME: Are we concerned about PHIs anywhere in FC0 or just in
				// FC0.Header (or FC0.ExitingBlock?)
				// Once we have the PHI, see if any of the incoming edges are from
				// FC0. Then we know that PHI is dependent on something that is
				// defined in that loop, and we likely don't want to disturb it.
				PHINode *DefPHI = dyn_cast<PHINode>(Def);
				dmgreenUnsubmitted Done Reply Inline Actions I think that you can just do something like if (Instruction Def = dyn_cast<Instruction>(Op)) if (FC0.L->contains(Def->getParent())) return false; Providing there are no lcssa phis, this should rule out anything that depends on the first loop. dmgreen:* I think that you can just do something like if (Instruction *Def = dyn_cast<Instruction>(Op))…
				kbartonAuthorUnsubmitted Done Reply Inline Actions I just finished up testing this change, and then realized I think this is overly restrictive. I had originally restricted it to PHIs because I only wanted to include things that are (re)defined in the first loop. Basically, the first loop needs to complete in order to get the correct definition going into the second loop. This alternate sequence will also include stores, not just PHIs. The store can be loop invariant, and if it is not hoisted out of the loop prior, then it would also prevent fusion. kbarton: I just finished up testing this change, and then realized I think this is overly restrictive. I…
				dmgreenUnsubmitted Done Reply Inline Actions This is the kind of thing I was thinking of: define float @test(float* nocapture %a, i32 %n) { entry: %conv = zext i32 %n to i64 %cmp32 = icmp eq i32 %n, 0 br i1 %cmp32, label %for.cond.cleanup7, label %for.body for.body: ; preds = %for.body, %entry %i.034 = phi i64 [ %inc, %for.body ], [ 0, %entry ] %sum1.033 = phi float [ %add, %for.body ], [ 0.000000e+00, %entry ] %idxprom = trunc i64 %i.034 to i32 %arrayidx = getelementptr inbounds float, float* %a, i32 %idxprom %0 = load float, float* %arrayidx, align 4 %add = fadd float %sum1.033, %0 %inc = add nuw nsw i64 %i.034, 1 %cmp = icmp ult i64 %inc, %conv br i1 %cmp, label %for.body, label %for.body8 for.body8: ; preds = %for.body, %for.body8 %i2.031 = phi i64 [ %inc14, %for.body8 ], [ 0, %for.body ] %idxprom9 = trunc i64 %i2.031 to i32 %arrayidx10 = getelementptr inbounds float, float* %a, i32 %idxprom9 %1 = load float, float* %arrayidx10, align 4 %div = fdiv float %1, %add store float %div, float* %arrayidx10, align 4 %inc14 = add nuw nsw i64 %i2.031, 1 %cmp5 = icmp ult i64 %inc14, %conv br i1 %cmp5, label %for.body8, label %for.cond.cleanup7 for.cond.cleanup7: ; preds = %for.body8, %entry %sum1.0.lcssa36 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body8 ] ret float %sum1.0.lcssa36 } The importand part being %add, that is defined in the first loop and used in the second. I believe that using normal SSA def-use chains, anything that is def'd in the first loop (and used in the second) will need to use the value from the last iteration of the loop, so fusion will be illegal. There may be cases where a def has the same value on every iteration of the loop, but I imagine those would already have been hoisted/sunk, or be quite rare. dmgreen: This is the kind of thing I was thinking of: ``` define float @test(float* nocapture %a, i32…
				kbartonAuthorUnsubmitted Done Reply Inline Actions Yes, I see now. It's possible to have a def in addition to a PHI. If I wanted to catch this, I would also need to traverse all the inputs to any defs and make sure they don't come from PHIs within loop 0 as well. I'm OK with restricting this to defs, with the assumption that if there is any loop invariant defs they would have been hoisted out of the loop at this point. kbarton: Yes, I see now. It's possible to have a def in addition to a PHI. If I wanted to catch this, I…
				if (DefPHI)
				for (BasicBlock *Block : DefPHI->blocks())
				if (FC0.L->contains(Block)) {
				InvalidDependencies++;
				return false;
				}
				}
				MeinersburUnsubmitted Done Reply Inline Actions [suggestion] Consider using a method name that makes clear that is has no side-effects. `isEmptyPreheader`? `emptyPreheader` sounds like it would remove instructions from the preheader until it is empty. Meinersbur: [suggestion] Consider using a method name that makes clear that is has no side-effects.
				return true;
				}

				/// Determine if the exit block of \p FC0 is the preheader of \p FC1. In this
				/// case, there is no code in between the two fusion candidates, thus making
				/// them adjacent.
				bool isAdjacent(const FusionCandidate &FC0,
				const FusionCandidate &FC1) const {
				return FC0.ExitBlock == FC1.Preheader;
				}

				bool emptyPreheader(const FusionCandidate &FC) const {
				return FC.Preheader->size() == 1;
				}

				/// Fuse two fusion candidates, creating a new fused loop.
				///
				/// This method contains the mechanics of fusing two loops, represented by \p
				/// FC0 and \p FC1. It is assumed that \p FC0 dominates \p FC1 and \p FC1
				/// postdominates \p FC0 (making them control flow equivalent). It also
				/// assumes that the other conditions for fusion have been met: adjacent,
				/// identical trip counts, and no negative distance dependencies exist that
				/// would prevent fusion. Thus, there is no checking for these conditions in
				/// this method.
				///
				/// Fusion is performed by rewiring the CFG to update successor blocks of the
				/// components of tho loop. Specifically, the following changes are done:
				///
				/// 1. The preheader of \p FC1 is removed as it is no longer necessary
				/// (because it is currently only a single statement block).
				/// 2. The latch of \p FC0 is modified to jump to the header of \p FC1.
				dmgreenUnsubmitted Done Reply Inline Actions thest -> test? dmgreen: thest -> test?
				/// 3. The latch of \p FC1 i modified to jump to the header of \p FC0.
				/// 4. All blocks from \p FC1 are removed from FC1 and added to FC0.
				///
				/// All of these modifications are done with dominator tree updates, thus
				/// keeping the dominator (and post dominator) information up-to-date.
				///
				/// This can be improved in the future by actually merging blocks during
				/// fusion. For example, the preheader of \p FC1 can be merged with the
				/// preheader of \p FC0. This would allow loops with more than a single
				/// statement in the preheader to be fused. Similarly, the latch blocks of the
				/// two loops could also be fused into a single block. This will require
				/// analysis to prove it is safe to move the contents of the block past
				/// existing code, which currently has not been implemented.
				Loop *performFusion(const FusionCandidate &FC0, const FusionCandidate &FC1) {
				MeinersburUnsubmitted Done Reply Inline Actions [style] Use `emplace_back`? Meinersbur: [style] Use `emplace_back`?
				kbartonAuthorUnsubmitted Done Reply Inline Actions There is no emplace_back for TreeUpdates (unless I misunderstood your suggestion). kbarton: There is no emplace_back for TreeUpdates (unless I misunderstood your suggestion).
				MeinersburUnsubmitted Done Reply Inline Actions Are you sure? There is `SmallVectorImpl<T>::emplace_back`. Meinersbur: Are you sure? There is `SmallVectorImpl<T>::emplace_back`.
				kbartonAuthorUnsubmitted Done Reply Inline Actions You're right. The syntax I was using was wrong. I've fixed it and changed to using emplace_back. kbarton: You're right. The syntax I was using was wrong. I've fixed it and changed to using emplace_back.
				assert(FC0.isValid() && FC1.isValid() &&
				"Expecting valid fusion candidates");

				LLVM_DEBUG(dbgs() << "Fusion Candidate 0: \n"; FC0.dump();
				dmgreenUnsubmitted Done Reply Inline Actions Theg dmgreen: Theg
				dbgs() << "Fusion Candidate 1: \n"; FC1.dump(););

				assert(FC1.Preheader == FC0.ExitBlock);
				assert(FC1.Preheader->size() == 1 &&
				FC1.Preheader->getSingleSuccessor() == FC1.Header);

				// Remember the phi nodes originally in the header of FC0 in order to rewire
				// them later. However, this is only necessary if the new loop carried
				// values might not dominate the exiting branch. While we do not generally
				// test if this is the case but simply insert intermediate phi nodes, we
				// need to make sure these intermediate phi nodes have different
				// predecessors. To this end, we filter the special case where the exiting
				// block is the latch block of the first loop. Nothing needs to be done
				// anyway as all loop carried values dominate the latch and thereby also the
				// exiting branch.
				SmallVector<PHINode *, 8> OriginalFC0PHIs;
				if (FC0.ExitingBlocks != FC0.Latch)
				for (PHINode &PHI : FC0.Header->phis())
				OriginalFC0PHIs.push_back(&PHI);
				dmgreenUnsubmitted Done Reply Inline Actions It can sometimes be better to insert edges into the DT before deleting old ones. It keep the tree more intact (especially pdts), with less subtrees becoming unreachable and less recalculations needed. It means it can be simpler and quicker for the updates. dmgreen: It can sometimes be better to insert edges into the DT before deleting old ones. It keep the…
				kbartonAuthorUnsubmitted Done Reply Inline Actions OK, that's good to know, thanks. Is this specifically for inserting/deleting edges between similar blocks, or is this for all inserts/deletes in the entire tree? In other words, should I rearrange this code, and the code below (lines 1048-1051) that updates the latch blocks to do all the insertions before any deletions? kbarton: OK, that's good to know, thanks. Is this specifically for inserting/deleting edges between…
				dmgreenUnsubmitted Done Reply Inline Actions I think it's probably fine, just something I've run into in the past with PDT's. If it's something we run into, we can fix it then. And if it does become an issue, it may be better to teach the DTU to do this itself. dmgreen: I think it's probably fine, just something I've run into in the past with PDT's. If it's…
				kbartonAuthorUnsubmitted Done Reply Inline Actions OK, sounds good. I will mark this comment as done then. kbarton: OK, sounds good. I will mark this comment as done then.

				// Replace incoming blocks for header PHIs first.
				FC1.Preheader->replaceSuccessorsPhiUsesWith(FC0.Preheader);
				FC0.Latch->replaceSuccessorsPhiUsesWith(FC1.Latch);

				// Then modify the control flow and update DT and PDT.
				SmallVector<DominatorTree::UpdateType, 8> TreeUpdates;

				// The old exiting block of the first loop (FC0) has to jump to the header
				// of the second as we need to execute the code in the second header block
				// regardless of the trip count. That is, if the trip count is 0, so the
				// back edge is never taken, we still have to execute both loop headers,
				dmgreenUnsubmitted Done Reply Inline Actions Is there anywhere that we check that these won't have phi operands from inside the first loop? i.e. that the phis can be moved into the first loop / before all of it's instructions. dmgreen: Is there anywhere that we check that these won't have phi operands from inside the first loop?
				kbartonAuthorUnsubmitted Done Reply Inline Actions This is a really good catch!] We completely missed this. I've added a check in dependencesAllowFusion that walks through the PHIs in the header of FC1 and check if the incoming block is in FC0 then it is the header block. If it is not, then we do not fuse. kbarton: This is a really good catch!] We completely missed this. I've added a check in…
				jdoerfertUnsubmitted Done Reply Inline Actions I don't get this. Could one of you produce a problematic example for me? (also good as a test case) jdoerfert: I don't get this. Could one of you produce a problematic example for me? (also good as a test…
				kbartonAuthorUnsubmitted Done Reply Inline Actions dmgreen added an example above that illustrates the issue (after some manual modifications to the IR to get around other limitations in fusion). kbarton: dmgreen added an example above that illustrates the issue (after some manual modifications to…
				// especially (but not only!) if the second is a do-while style loop.
				// However, doing so might invalidate the phi nodes of the first loop as
				// the new values do only need to dominate their latch and not the exiting
				// predicate. To remedy this potential problem we always introduce phi
				// nodes in the header of the second loop later that select the loop carried
				// value, if the second header was reached through an old latch of the
				// first, or undef otherwise. This is sound as exiting the first implies the
				// second will exit too, __without__ taking the back-edge. [Their
				// trip-counts are equal after all.
				// KB: Would this sequence be simpler to just just make FC0.ExitingBlocks go
				// to FC1.Header? I think this is basically what the three sequences are
				// trying to accomplish; however, doing this directly in the CFG may mean
				// the DT/PDT becomes invalid
				FC0.ExitingBlocks->getTerminator()->replaceUsesOfWith(FC1.Preheader,
				FC1.Header);
				TreeUpdates.push_back(
				{DominatorTree::Delete, FC0.ExitingBlocks, FC1.Preheader});
				TreeUpdates.push_back(
				{DominatorTree::Insert, FC0.ExitingBlocks, FC1.Header});

				// The pre-header of L1 is not necessary anymore.
				assert(pred_begin(FC1.Preheader) == pred_end(FC1.Preheader));
				FC1.Preheader->getTerminator()->eraseFromParent();
				new UnreachableInst(FC1.Preheader->getContext(), FC1.Preheader);
				TreeUpdates.push_back({DominatorTree::Delete, FC1.Preheader, FC1.Header});

				// Moves the phi nodes from the second to the first loops header block.
				while (PHINode *PHI = dyn_cast<PHINode>(&FC1.Header->front())) {
				if (SE.isSCEVable(PHI->getType()))
				SE.forgetValue(PHI);
				if (PHI->hasNUsesOrMore(1))
				PHI->moveBefore(&*FC0.Header->getFirstInsertionPt());
				else
				PHI->eraseFromParent();
				}

				// Introduce new phi nodes in the second loop header to ensure
				// exiting the first and jumping to the header of the second does not break
				// the SSA property of the phis originally in the first loop. See also the
				// comment above.
				Instruction *L1HeaderIP = &FC1.Header->front();
				for (PHINode *LCPHI : OriginalFC0PHIs) {
				int L1LatchBBIdx = LCPHI->getBasicBlockIndex(FC1.Latch);
				assert(L1LatchBBIdx >= 0 &&
				"Expected loop carried value to be rewired at this point!");

				Value *LCV = LCPHI->getIncomingValue(L1LatchBBIdx);

				MeinersburUnsubmitted Done Reply Inline Actions [suggestion] This looks expensive to do. However, the pass manager will do these verifications anyway between passes (if enabled), so it shouldn't be necessary to do here. Meinersbur: [suggestion] This looks expensive to do. However, the pass manager will do these verifications…
				kbartonAuthorUnsubmitted Done Reply Inline Actions I've kept this for now, as this isn't in the pass manager yet. However, I've guarded it under NDEBUG. Are you OK with that? kbarton: I've kept this for now, as this isn't in the pass manager yet. However, I've guarded it under…
				PHINode *L1HeaderPHI = PHINode::Create(
				LCV->getType(), 2, LCPHI->getName() + ".afterFC0", L1HeaderIP);
				L1HeaderPHI->addIncoming(LCV, FC0.Latch);
				L1HeaderPHI->addIncoming(UndefValue::get(LCV->getType()),
				FC0.ExitingBlocks);

				LCPHI->setIncomingValue(L1LatchBBIdx, L1HeaderPHI);
				}

				// Replace latch terminator destinations.
				FC0.Latch->getTerminator()->replaceUsesOfWith(FC0.Header, FC1.Header);
				FC1.Latch->getTerminator()->replaceUsesOfWith(FC1.Header, FC0.Header);
				TreeUpdates.push_back({DominatorTree::Insert, FC0.Latch, FC1.Header});
				TreeUpdates.push_back({DominatorTree::Delete, FC0.Latch, FC0.Header});
				TreeUpdates.push_back({DominatorTree::Insert, FC1.Latch, FC0.Header});
				TreeUpdates.push_back({DominatorTree::Delete, FC1.Latch, FC1.Header});

				// Update DT/PDT
				DTU.applyUpdates(TreeUpdates);
				dmgreenUnsubmitted Done Reply Inline Actions Whilst checking the example, I noticed the dom tree updater has become stricter about "inbalanced" tree operations. I think that if FC0.ExitingBlocks == FC0.Latch, we add a link to FC1.Header twice, which is maybe what it's complaining about (?) dmgreen: Whilst checking the example, I noticed the dom tree updater has become stricter about…
				kbartonAuthorUnsubmitted Done Reply Inline Actions Yup, good catch. I found the same problem when I tried your test case. I've fixed the problem and added your test above to cannot_fuse.ll. kbarton: Yup, good catch. I found the same problem when I tried your test case. I've fixed the problem…

				MeinersburUnsubmitted Done Reply Inline Actions [style] I think it is more common in the LLVM case base to have private fields declared at the beginning of the class. Meinersbur: [style] I think it is more common in the LLVM case base to have private fields declared at the…
				LI.removeBlock(FC1.Preheader);
				DTU.deleteBB(FC1.Preheader);
				DTU.flush();

				dmgreenUnsubmitted Done Reply Inline Actions You may want to add DominatorTree::VerificationLevel::Fast otherwise this might be quite slow. dmgreen: You may want to add DominatorTree::VerificationLevel::Fast otherwise this might be quite slow.
				kbartonAuthorUnsubmitted Done Reply Inline Actions Thanks for pointing this out. I've changed to fast. Most (probably all) of the problems this has found has been done by comparing to a newly constructed tree, so I think Fast should be sufficient. kbarton: Thanks for pointing this out. I've changed to fast. Most (probably all) of the problems this…
				// Is there a way to keep SE up-to-date so we don't need to forget the loops
				// and rebuild the information in subsequent passes of fusion?
				SE.forgetLoop(FC1.L);
				SE.forgetLoop(FC0.L);

				// Merge the loops.
				SmallVector<BasicBlock *, 8> Blocks(FC1.L->block_begin(),
				MeinersburUnsubmitted Done Reply Inline Actions [serious] I think LoopFuser is preserving some passes such as ScalarEvolution? Meinersbur: [serious] I think LoopFuser is preserving some passes such as ScalarEvolution?
				FC1.L->block_end());
				for (BasicBlock *BB : Blocks) {
				FC0.L->addBlockEntry(BB);
				FC1.L->removeBlockFromLoop(BB);
				if (LI.getLoopFor(BB) != FC1.L)
				continue;
				LI.changeLoopFor(BB, FC0.L);
				}
				while (!FC1.L->empty()) {
				const auto &ChildLoopIt = FC1.L->begin();
				Loop ChildLoop = ChildLoopIt;
				FC1.L->removeChildLoop(ChildLoopIt);
				FC0.L->addChildLoop(ChildLoop);
				}

				// Delete the now empty loop L1.
				LI.erase(FC1.L);

				#ifndef NDEBUG
				assert(!verifyFunction(*FC0.Header->getParent(), &errs()));
				assert(DT.verify(DominatorTree::VerificationLevel::Fast));
				assert(PDT.verify());
				LI.verify(DT);
				SE.verify();
				dmgreenUnsubmitted Done Reply Inline Actions Looks like the code is preserving LI too (I'm not sure if you can/should preserve SE without preserving LI.) dmgreen: Looks like the code is preserving LI too (I'm not sure if you can/should preserve SE without…
				kbartonAuthorUnsubmitted Done Reply Inline Actions Yup, good call. I've added LoopInfoWrapperPass to the preserved list. kbarton: Yup, good call. I've added LoopInfoWrapperPass to the preserved list.
				#endif

				FuseCounter++;

				LLVM_DEBUG(dbgs() << "Fusion done:\n");

				return FC0.L;
				}
				};

				struct LoopFuseLegacy : public FunctionPass {

				static char ID;

				LoopFuseLegacy() : FunctionPass(ID) {
				MeinersburUnsubmitted Done Reply Inline Actions [serious] If changes are made, the pass should not indicate that it preserves all analyses (including the ones it doesn't know about). Meinersbur: [serious] If changes are made, the pass should not indicate that it preserves all analyses…
				kbartonAuthorUnsubmitted Done Reply Inline Actions I agree. The pass should not make changes and return true. I don't believe this is happening now. kbarton: I agree. The pass should not make changes and return true. I don't believe this is happening…
				initializeLoopFuseLegacyPass(*PassRegistry::getPassRegistry());
				}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequiredID(LoopSimplifyID);
				AU.addRequired<ScalarEvolutionWrapperPass>();
				AU.addRequired<LoopInfoWrapperPass>();
				AU.addRequired<DominatorTreeWrapperPass>();
				AU.addRequired<PostDominatorTreeWrapperPass>();
				AU.addRequired<OptimizationRemarkEmitterWrapperPass>();
				AU.addRequired<DependenceAnalysisWrapperPass>();

				AU.addPreserved<ScalarEvolutionWrapperPass>();
				AU.addPreserved<LoopInfoWrapperPass>();
				AU.addPreserved<DominatorTreeWrapperPass>();
				AU.addPreserved<PostDominatorTreeWrapperPass>();
				}

				bool runOnFunction(Function &F) override {
				if (skipFunction(F))
				return false;
				auto &LI = getAnalysis<LoopInfoWrapperPass>().getLoopInfo();
				auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
				auto &DI = getAnalysis<DependenceAnalysisWrapperPass>().getDI();
				auto &SE = getAnalysis<ScalarEvolutionWrapperPass>().getSE();
				auto &PDT = getAnalysis<PostDominatorTreeWrapperPass>().getPostDomTree();
				auto &ORE = getAnalysis<OptimizationRemarkEmitterWrapperPass>().getORE();

				const DataLayout &DL = F.getParent()->getDataLayout();
				jdoerfertUnsubmitted Done Reply Inline Actions As above, preserve LoopAnalysis. jdoerfert: As above, preserve LoopAnalysis.
				LoopFuser LF(LI, DT, DI, SE, PDT, ORE, DL);
				return LF.fuseLoops(F);
				}
				};

				PreservedAnalyses LoopFusePass::run(Function &F, FunctionAnalysisManager &AM) {
				auto &LI = AM.getResult<LoopAnalysis>(F);
				auto &DT = AM.getResult<DominatorTreeAnalysis>(F);
				auto &DI = AM.getResult<DependenceAnalysis>(F);
				auto &SE = AM.getResult<ScalarEvolutionAnalysis>(F);
				auto &PDT = AM.getResult<PostDominatorTreeAnalysis>(F);
				auto &ORE = AM.getResult<OptimizationRemarkEmitterAnalysis>(F);

				const DataLayout &DL = F.getParent()->getDataLayout();
				LoopFuser LF(LI, DT, DI, SE, PDT, ORE, DL);
				bool Changed = LF.fuseLoops(F);
				if (!Changed)
				return PreservedAnalyses::all();

				PreservedAnalyses PA;
				PA.preserve<DominatorTreeAnalysis>();
				PA.preserve<PostDominatorTreeAnalysis>();
				PA.preserve<ScalarEvolutionAnalysis>();
				PA.preserve<LoopAnalysis>();
				return PA;
				}

				char LoopFuseLegacy::ID = 0;

				INITIALIZE_PASS_BEGIN(LoopFuseLegacy, "loop-fusion", "Loop Fusion", false,
				false)
				INITIALIZE_PASS_DEPENDENCY(PostDominatorTreeWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(ScalarEvolutionWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(DependenceAnalysisWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(OptimizationRemarkEmitterWrapperPass)
				INITIALIZE_PASS_END(LoopFuseLegacy, "loop-fusion", "Loop Fusion", false, false)

				FunctionPass *llvm::createLoopFusePass() { return new LoopFuseLegacy(); }

llvm/lib/Transforms/Scalar/Scalar.cpp

Show First 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	void llvm::initializeScalarOpts(PassRegistry &Registry) {
initializeFlattenCFGPassPass(Registry);		initializeFlattenCFGPassPass(Registry);
initializeIRCELegacyPassPass(Registry);		initializeIRCELegacyPassPass(Registry);
initializeIndVarSimplifyLegacyPassPass(Registry);		initializeIndVarSimplifyLegacyPassPass(Registry);
initializeInferAddressSpacesPass(Registry);		initializeInferAddressSpacesPass(Registry);
initializeInstSimplifyLegacyPassPass(Registry);		initializeInstSimplifyLegacyPassPass(Registry);
initializeJumpThreadingPass(Registry);		initializeJumpThreadingPass(Registry);
initializeLegacyLICMPassPass(Registry);		initializeLegacyLICMPassPass(Registry);
initializeLegacyLoopSinkPassPass(Registry);		initializeLegacyLoopSinkPassPass(Registry);
		initializeLoopFuseLegacyPass(Registry);
initializeLoopDataPrefetchLegacyPassPass(Registry);		initializeLoopDataPrefetchLegacyPassPass(Registry);
initializeLoopDeletionLegacyPassPass(Registry);		initializeLoopDeletionLegacyPassPass(Registry);
initializeLoopAccessLegacyAnalysisPass(Registry);		initializeLoopAccessLegacyAnalysisPass(Registry);
initializeLoopInstSimplifyLegacyPassPass(Registry);		initializeLoopInstSimplifyLegacyPassPass(Registry);
initializeLoopInterchangePass(Registry);		initializeLoopInterchangePass(Registry);
initializeLoopPredicationLegacyPassPass(Registry);		initializeLoopPredicationLegacyPassPass(Registry);
initializeLoopRotateLegacyPassPass(Registry);		initializeLoopRotateLegacyPassPass(Registry);
initializeLoopStrengthReducePass(Registry);		initializeLoopStrengthReducePass(Registry);
▲ Show 20 Lines • Show All 216 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopFusion/cannot_fuse.ll

This file was added.

				; RUN: opt -S -loop-fusion -debug-only=loop-fusion -disable-output < %s 2>&1 \| FileCheck %s
				; REQUIRES: asserts
				MeinersburUnsubmitted Done Reply Inline Actions [serious] If you are testing for `llvm::dbgs()` output, you need to add a `REQUIRES: asserts` line. Another problem is that `2>&1` this mixes stdout and stderr output. stdout is buffered, stderr is not. How these are merged by `2>&1` is undefined. You can switch off stdout by using `opt -disable-output`. If possible, using `opt -analyze` is a better way to verify pass-specific output because adding another `LLVM_DEBUG` during debugging sessions can lead to test case failures. However, checking `-debug` output is common in the llvm tests. Meinersbur: [serious] If you are testing for `llvm::dbgs()` output, you need to add a `REQUIRES: asserts`…
				kbartonAuthorUnsubmitted Done Reply Inline Actions I don't see an (easy) way to use analyze to get this information. For now, I've added the REQUIRES: asserts. I'll think if there is a way to use analyze in the future to accomplish this. kbarton: I don't see an (easy) way to use analyze to get this information. For now, I've added the…

				@B = common global [1024 x i32] zeroinitializer, align 16

				; CHECK that the two candidates for fusion are placed into separate candidate
				; sets because they are not control flow equivalent.

				; CHECK: Performing Loop Fusion on function non_cfe
				; CHECK: Fusion Candidates:
				; CHECK: * Fusion Candidate Set *
				; CHECK: bb
				; CHECK: ****************************
				; CHECK: * Fusion Candidate Set *
				; CHECK: bb20.preheader
				; CHECK: ****************************
				; CHECK: Loop Fusion complete
				define void @non_cfe(i32* noalias %arg) {
				bb:
				br label %bb5

				bb5: ; preds = %bb14, %bb
				%indvars.iv2 = phi i64 [ %indvars.iv.next3, %bb14 ], [ 0, %bb ]
				%.01 = phi i32 [ 0, %bb ], [ %tmp15, %bb14 ]
				%exitcond4 = icmp ne i64 %indvars.iv2, 100
				br i1 %exitcond4, label %bb7, label %bb16

				bb7: ; preds = %bb5
				%tmp = add nsw i32 %.01, -3
				%tmp8 = add nuw nsw i64 %indvars.iv2, 3
				%tmp9 = trunc i64 %tmp8 to i32
				%tmp10 = mul nsw i32 %tmp, %tmp9
				%tmp11 = trunc i64 %indvars.iv2 to i32
				%tmp12 = srem i32 %tmp10, %tmp11
				%tmp13 = getelementptr inbounds i32, i32* %arg, i64 %indvars.iv2
				store i32 %tmp12, i32* %tmp13, align 4
				br label %bb14

				bb14: ; preds = %bb7
				%indvars.iv.next3 = add nuw nsw i64 %indvars.iv2, 1
				%tmp15 = add nuw nsw i32 %.01, 1
				br label %bb5

				bb16: ; preds = %bb5
				%tmp17 = load i32, i32* %arg, align 4
				%tmp18 = icmp slt i32 %tmp17, 0
				br i1 %tmp18, label %bb20, label %bb33

				bb20: ; preds = %bb30, %bb16
				%indvars.iv = phi i64 [ %indvars.iv.next, %bb30 ], [ 0, %bb16 ]
				%.0 = phi i32 [ 0, %bb16 ], [ %tmp31, %bb30 ]
				%exitcond = icmp ne i64 %indvars.iv, 100
				br i1 %exitcond, label %bb22, label %bb33

				bb22: ; preds = %bb20
				%tmp23 = add nsw i32 %.0, -3
				%tmp24 = add nuw nsw i64 %indvars.iv, 3
				%tmp25 = trunc i64 %tmp24 to i32
				%tmp26 = mul nsw i32 %tmp23, %tmp25
				%tmp27 = trunc i64 %indvars.iv to i32
				%tmp28 = srem i32 %tmp26, %tmp27
				%tmp29 = getelementptr inbounds [1024 x i32], [1024 x i32]* @B, i64 0, i64 %indvars.iv
				store i32 %tmp28, i32* %tmp29, align 4
				br label %bb30

				bb30: ; preds = %bb22
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%tmp31 = add nuw nsw i32 %.0, 1
				br label %bb20

				bb33: ; preds = %bb20, %bb16
				ret void
				}

				; Check that fusion detects the two canddates are not adjacent (the exit block
				; of the first candidate is not the preheader of the second candidate).

				MeinersburUnsubmitted Done Reply Inline Actions [remark] I think these can be removed. Meinersbur: [remark] I think these can be removed.
				; CHECK: Performing Loop Fusion on function non_adjacent
				; CHECK: Fusion Candidates:
				; CHECK: * Fusion Candidate Set *
				; CHECK-NEXT: [[LOOP1PREHEADER:bb[0-9]*]]
				; CHECK-NEXT: [[LOOP2PREHEADER:bb[0-9]*]]
				; CHECK-NEXT: ****************************
				; CHECK: Attempting fusion on Candidate Set:
				; CHECK-NEXT: [[LOOP1PREHEADER]]
				; CHECK-NEXT: [[LOOP2PREHEADER]]
				; CHECK: Fusion candidates are not adjacent. Not fusing.
				; CHECK: Loop Fusion complete
				define void @non_adjacent(i32* noalias %arg) {
				bb:
				br label %bb3

				bb3: ; preds = %bb11, %bb
				%.01 = phi i64 [ 0, %bb ], [ %tmp12, %bb11 ]
				%exitcond2 = icmp ne i64 %.01, 100
				br i1 %exitcond2, label %bb5, label %bb4

				bb4: ; preds = %bb3
				br label %bb13

				bb5: ; preds = %bb3
				%tmp = add nsw i64 %.01, -3
				%tmp6 = add nuw nsw i64 %.01, 3
				%tmp7 = mul nsw i64 %tmp, %tmp6
				%tmp8 = srem i64 %tmp7, %.01
				%tmp9 = trunc i64 %tmp8 to i32
				%tmp10 = getelementptr inbounds i32, i32* %arg, i64 %.01
				store i32 %tmp9, i32* %tmp10, align 4
				br label %bb11

				bb11: ; preds = %bb5
				%tmp12 = add nuw nsw i64 %.01, 1
				br label %bb3

				bb13: ; preds = %bb4
				br label %bb14

				bb14: ; preds = %bb23, %bb13
				%.0 = phi i64 [ 0, %bb13 ], [ %tmp24, %bb23 ]
				%exitcond = icmp ne i64 %.0, 100
				br i1 %exitcond, label %bb16, label %bb15

				bb15: ; preds = %bb14
				br label %bb25

				bb16: ; preds = %bb14
				%tmp17 = add nsw i64 %.0, -3
				%tmp18 = add nuw nsw i64 %.0, 3
				%tmp19 = mul nsw i64 %tmp17, %tmp18
				%tmp20 = srem i64 %tmp19, %.0
				%tmp21 = trunc i64 %tmp20 to i32
				%tmp22 = getelementptr inbounds [1024 x i32], [1024 x i32]* @B, i64 0, i64 %.0
				store i32 %tmp21, i32* %tmp22, align 4
				br label %bb23

				bb23: ; preds = %bb16
				%tmp24 = add nuw nsw i64 %.0, 1
				br label %bb14

				bb25: ; preds = %bb15
				ret void
				}

				; Check that the different bounds are detected and prevent fusion.

				; CHECK: Performing Loop Fusion on function different_bounds
				; CHECK: Fusion Candidates:
				; CHECK: * Fusion Candidate Set *
				; CHECK-NEXT: [[LOOP1PREHEADER:bb[0-9]*]]
				; CHECK-NEXT: [[LOOP2PREHEADER:bb[0-9]*]]
				; CHECK-NEXT: ****************************
				; CHECK: Attempting fusion on Candidate Set:
				; CHECK-NEXT: [[LOOP1PREHEADER]]
				; CHECK-NEXT: [[LOOP2PREHEADER]]
				; CHECK: Fusion candidates do not have identical trip counts. Not fusing.
				; CHECK: Loop Fusion complete
				define void @different_bounds(i32* noalias %arg) {
				bb:
				br label %bb3

				bb3: ; preds = %bb11, %bb
				%.01 = phi i64 [ 0, %bb ], [ %tmp12, %bb11 ]
				%exitcond2 = icmp ne i64 %.01, 100
				br i1 %exitcond2, label %bb5, label %bb4

				bb4: ; preds = %bb3
				br label %bb13

				bb5: ; preds = %bb3
				%tmp = add nsw i64 %.01, -3
				%tmp6 = add nuw nsw i64 %.01, 3
				%tmp7 = mul nsw i64 %tmp, %tmp6
				%tmp8 = srem i64 %tmp7, %.01
				%tmp9 = trunc i64 %tmp8 to i32
				%tmp10 = getelementptr inbounds i32, i32* %arg, i64 %.01
				store i32 %tmp9, i32* %tmp10, align 4
				br label %bb11

				bb11: ; preds = %bb5
				%tmp12 = add nuw nsw i64 %.01, 1
				br label %bb3

				bb13: ; preds = %bb4
				br label %bb14

				bb14: ; preds = %bb23, %bb13
				%.0 = phi i64 [ 0, %bb13 ], [ %tmp24, %bb23 ]
				%exitcond = icmp ne i64 %.0, 200
				br i1 %exitcond, label %bb16, label %bb15

				bb15: ; preds = %bb14
				br label %bb25

				bb16: ; preds = %bb14
				%tmp17 = add nsw i64 %.0, -3
				%tmp18 = add nuw nsw i64 %.0, 3
				%tmp19 = mul nsw i64 %tmp17, %tmp18
				%tmp20 = srem i64 %tmp19, %.0
				%tmp21 = trunc i64 %tmp20 to i32
				%tmp22 = getelementptr inbounds [1024 x i32], [1024 x i32]* @B, i64 0, i64 %.0
				store i32 %tmp21, i32* %tmp22, align 4
				br label %bb23

				bb23: ; preds = %bb16
				%tmp24 = add nuw nsw i64 %.0, 1
				br label %bb14

				bb25: ; preds = %bb15
				ret void
				}

				; Check that the negative dependence between the two candidates is identified
				; and prevents fusion.

				; CHECK: Performing Loop Fusion on function negative_dependence
				; CHECK: Fusion Candidates:
				; CHECK: * Fusion Candidate Set *
				; CHECK-NEXT: [[LOOP1PREHEADER:bb[0-9]*]]
				; CHECK-NEXT: [[LOOP2PREHEADER:bb[0-9]*]]
				; CHECK-NEXT: ****************************
				; CHECK: Attempting fusion on Candidate Set:
				; CHECK-NEXT: [[LOOP1PREHEADER]]
				; CHECK-NEXT: [[LOOP2PREHEADER]]
				; CHECK: Memory dependencies do not allow fusion!
				; CHECK: Loop Fusion complete
				define void @negative_dependence(i32* noalias %arg) {
				bb:
				br label %bb5

				bb5: ; preds = %bb9, %bb
				%indvars.iv2 = phi i64 [ %indvars.iv.next3, %bb9 ], [ 0, %bb ]
				%exitcond4 = icmp ne i64 %indvars.iv2, 100
				br i1 %exitcond4, label %bb7, label %bb11

				bb7: ; preds = %bb5
				%tmp = getelementptr inbounds i32, i32* %arg, i64 %indvars.iv2
				%tmp8 = trunc i64 %indvars.iv2 to i32
				store i32 %tmp8, i32* %tmp, align 4
				br label %bb9

				bb9: ; preds = %bb7
				%indvars.iv.next3 = add nuw nsw i64 %indvars.iv2, 1
				br label %bb5

				bb11: ; preds = %bb18, %bb5
				%indvars.iv = phi i64 [ %indvars.iv.next, %bb18 ], [ 0, %bb5 ]
				%exitcond = icmp ne i64 %indvars.iv, 100
				br i1 %exitcond, label %bb13, label %bb19

				bb13: ; preds = %bb11
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%tmp14 = getelementptr inbounds i32, i32* %arg, i64 %indvars.iv.next
				%tmp15 = load i32, i32* %tmp14, align 4
				%tmp16 = shl nsw i32 %tmp15, 1
				%tmp17 = getelementptr inbounds [1024 x i32], [1024 x i32]* @B, i64 0, i64 %indvars.iv
				store i32 %tmp16, i32* %tmp17, align 4
				br label %bb18

				bb18: ; preds = %bb13
				br label %bb11

				bb19: ; preds = %bb11
				ret void
				}

				;;
				; CHECK: Performing Loop Fusion on function sumTest
				; CHECK: Fusion Candidates:
				; CHECK: * Fusion Candidate Set *
				; CHECK-NEXT: [[LOOP1PREHEADER:bb[0-9]*]]
				; CHECK-NEXT: [[LOOP2PREHEADER:bb[0-9]*]]
				; CHECK-NEXT: ****************************
				; CHECK: Attempting fusion on Candidate Set:
				MeinersburUnsubmitted Done Reply Inline Actions [remark] Are these necessary for the test? Meinersbur: [remark] Are these necessary for the test?
				; CHECK-NEXT: [[LOOP1PREHEADER]]
				; CHECK-NEXT: [[LOOP2PREHEADER]]
				; CHECK: Memory dependencies do not allow fusion!
				; CHECK: Loop Fusion complete
				define i32 @sumTest(i32* noalias %arg) {
				bb:
				br label %bb6

				bb6: ; preds = %bb9, %bb
				%indvars.iv3 = phi i64 [ %indvars.iv.next4, %bb9 ], [ 0, %bb ]
				dmgreenUnsubmitted Done Reply Inline Actions You can run something like -loop-rotate to make this into a more simply structured loop. -instcombine will remove any lcssa nodes, and -simplify-cfg can clean things up too. I guess that will depend on what types of loops you want to test, ones that have been cleanup up or ones that are a little less structured. It's good to test both, but which is more important will depend on where in the pass pipeline this ends up. dmgreen: You can run something like -loop-rotate to make this into a more simply structured loop.
				kbartonAuthorUnsubmitted Done Reply Inline Actions I completely agree. In fact we have a subsequent patch to post once this one lands that restricts loop fusion to only run on rotated loops. This seems to be a common theme among many loop optimizations, and restricting to rotated loops simplifies the implementation in several places. I'm planning on discussing this at my presentation at EuroLLVM in a couple weeks :) kbarton: I completely agree. In fact we have a subsequent patch to post once this one lands that…
				%.01 = phi i32 [ 0, %bb ], [ %tmp11, %bb9 ]
				%exitcond5 = icmp ne i64 %indvars.iv3, 100
				br i1 %exitcond5, label %bb9, label %bb13

				bb9: ; preds = %bb6
				%tmp = getelementptr inbounds i32, i32* %arg, i64 %indvars.iv3
				%tmp10 = load i32, i32* %tmp, align 4
				%tmp11 = add nsw i32 %.01, %tmp10
				%indvars.iv.next4 = add nuw nsw i64 %indvars.iv3, 1
				br label %bb6

				bb13: ; preds = %bb20, %bb6
				%.01.lcssa = phi i32 [ %.01, %bb6 ], [ %.01.lcssa, %bb20 ]
				%indvars.iv = phi i64 [ %indvars.iv.next, %bb20 ], [ 0, %bb6 ]
				%exitcond = icmp ne i64 %indvars.iv, 100
				br i1 %exitcond, label %bb15, label %bb14

				bb14: ; preds = %bb13
				br label %bb21

				bb15: ; preds = %bb13
				%tmp16 = getelementptr inbounds i32, i32* %arg, i64 %indvars.iv
				%tmp17 = load i32, i32* %tmp16, align 4
				%tmp18 = sdiv i32 %tmp17, %.01.lcssa
				%tmp19 = getelementptr inbounds [1024 x i32], [1024 x i32]* @B, i64 0, i64 %indvars.iv
				store i32 %tmp18, i32* %tmp19, align 4
				br label %bb20

				bb20: ; preds = %bb15
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				br label %bb13

				bb21: ; preds = %bb14
				ret i32 %.01.lcssa
				}

llvm/test/Transforms/LoopFusion/four_loops.ll

This file was added.

				; RUN: opt -S -loop-fusion < %s \| FileCheck %s

				@A = common global [1024 x i32] zeroinitializer, align 16
				@B = common global [1024 x i32] zeroinitializer, align 16
				@C = common global [1024 x i32] zeroinitializer, align 16
				@D = common global [1024 x i32] zeroinitializer, align 16

				; CHECK: void @dep_free
				; CHECK-NEXT: bb:
				; CHECK-NEXT: br label %[[LOOP1HEADER:bb[0-9]+]]
				; CHECK: [[LOOP1HEADER]]
				; CHECK: br i1 %exitcond12, label %[[LOOP1BODY:bb[0-9]+]], label %[[LOOP2PREHEADER:bb[0-9]+]]
				; CHECK: [[LOOP1BODY]]
				; CHECK: br label %[[LOOP1LATCH:bb[0-9]+]]
				; CHECK: [[LOOP1LATCH]]
				; CHECK: br label %[[LOOP2PREHEADER]]
				; CHECK: [[LOOP2PREHEADER]]
				; CHECK: br i1 %exitcond9, label %[[LOOP2HEADER:bb[0-9]+]], label %[[LOOP3PREHEADER:bb[0-9]+]]
				; CHECK: [[LOOP2HEADER]]
				; CHECK: br label %[[LOOP2LATCH:bb[0-9]+]]
				; CHECK: [[LOOP2LATCH]]
				; CHECK: br label %[[LOOP3PREHEADER]]
				; CHECK: [[LOOP3PREHEADER]]
				; CHECK: br i1 %exitcond6, label %[[LOOP3HEADER:bb[0-9]+]], label %[[LOOP4PREHEADER:bb[0-9]+]]
				; CHECK: [[LOOP3HEADER]]
				; CHECK: br label %[[LOOP3LATCH:bb[0-9]+]]
				; CHECK: [[LOOP3LATCH]]
				; CHECK: br label %[[LOOP4PREHEADER]]
				; CHECK: [[LOOP4PREHEADER]]
				; CHECK: br i1 %exitcond, label %[[LOOP4HEADER:bb[0-9]+]], label %[[LOOP4EXIT:bb[0-9]+]]
				; CHECK: [[LOOP4EXIT]]
				; CHECK: br label %[[FUNCEXIT:bb[0-9]+]]
				; CHECK: [[LOOP4HEADER]]
				; CHECK: br label %[[LOOP4LATCH:bb[0-9]+]]
				; CHECK: [[LOOP4LATCH]]
				; CHECK: br label %[[LOOP1HEADER]]
				; CHECK: [[FUNCEXIT]]
				; CHECK: ret void
				define void @dep_free() {
				bb:
				br label %bb13

				bb13: ; preds = %bb22, %bb
				%indvars.iv10 = phi i64 [ %indvars.iv.next11, %bb22 ], [ 0, %bb ]
				%.0 = phi i32 [ 0, %bb ], [ %tmp23, %bb22 ]
				%exitcond12 = icmp ne i64 %indvars.iv10, 100
				br i1 %exitcond12, label %bb15, label %bb25

				bb15: ; preds = %bb13
				%tmp = add nsw i32 %.0, -3
				%tmp16 = add nuw nsw i64 %indvars.iv10, 3
				%tmp17 = trunc i64 %tmp16 to i32
				%tmp18 = mul nsw i32 %tmp, %tmp17
				%tmp19 = trunc i64 %indvars.iv10 to i32
				%tmp20 = srem i32 %tmp18, %tmp19
				%tmp21 = getelementptr inbounds [1024 x i32], [1024 x i32]* @A, i64 0, i64 %indvars.iv10
				store i32 %tmp20, i32* %tmp21, align 4
				br label %bb22

				bb22: ; preds = %bb15
				%indvars.iv.next11 = add nuw nsw i64 %indvars.iv10, 1
				%tmp23 = add nuw nsw i32 %.0, 1
				br label %bb13

				bb25: ; preds = %bb35, %bb13
				%indvars.iv7 = phi i64 [ %indvars.iv.next8, %bb35 ], [ 0, %bb13 ]
				%.01 = phi i32 [ 0, %bb13 ], [ %tmp36, %bb35 ]
				%exitcond9 = icmp ne i64 %indvars.iv7, 100
				br i1 %exitcond9, label %bb27, label %bb38

				bb27: ; preds = %bb25
				%tmp28 = add nsw i32 %.01, -3
				%tmp29 = add nuw nsw i64 %indvars.iv7, 3
				%tmp30 = trunc i64 %tmp29 to i32
				%tmp31 = mul nsw i32 %tmp28, %tmp30
				%tmp32 = trunc i64 %indvars.iv7 to i32
				%tmp33 = srem i32 %tmp31, %tmp32
				%tmp34 = getelementptr inbounds [1024 x i32], [1024 x i32]* @B, i64 0, i64 %indvars.iv7
				store i32 %tmp33, i32* %tmp34, align 4
				br label %bb35

				bb35: ; preds = %bb27
				%indvars.iv.next8 = add nuw nsw i64 %indvars.iv7, 1
				%tmp36 = add nuw nsw i32 %.01, 1
				br label %bb25

				bb38: ; preds = %bb48, %bb25
				%indvars.iv4 = phi i64 [ %indvars.iv.next5, %bb48 ], [ 0, %bb25 ]
				%.02 = phi i32 [ 0, %bb25 ], [ %tmp49, %bb48 ]
				%exitcond6 = icmp ne i64 %indvars.iv4, 100
				br i1 %exitcond6, label %bb40, label %bb51

				bb40: ; preds = %bb38
				%tmp41 = add nsw i32 %.02, -3
				%tmp42 = add nuw nsw i64 %indvars.iv4, 3
				%tmp43 = trunc i64 %tmp42 to i32
				%tmp44 = mul nsw i32 %tmp41, %tmp43
				%tmp45 = trunc i64 %indvars.iv4 to i32
				%tmp46 = srem i32 %tmp44, %tmp45
				%tmp47 = getelementptr inbounds [1024 x i32], [1024 x i32]* @C, i64 0, i64 %indvars.iv4
				store i32 %tmp46, i32* %tmp47, align 4
				br label %bb48

				bb48: ; preds = %bb40
				%indvars.iv.next5 = add nuw nsw i64 %indvars.iv4, 1
				%tmp49 = add nuw nsw i32 %.02, 1
				br label %bb38

				bb51: ; preds = %bb61, %bb38
				%indvars.iv = phi i64 [ %indvars.iv.next, %bb61 ], [ 0, %bb38 ]
				%.03 = phi i32 [ 0, %bb38 ], [ %tmp62, %bb61 ]
				%exitcond = icmp ne i64 %indvars.iv, 100
				br i1 %exitcond, label %bb53, label %bb52

				bb52: ; preds = %bb51
				br label %bb63

				bb53: ; preds = %bb51
				%tmp54 = add nsw i32 %.03, -3
				%tmp55 = add nuw nsw i64 %indvars.iv, 3
				%tmp56 = trunc i64 %tmp55 to i32
				%tmp57 = mul nsw i32 %tmp54, %tmp56
				%tmp58 = trunc i64 %indvars.iv to i32
				%tmp59 = srem i32 %tmp57, %tmp58
				%tmp60 = getelementptr inbounds [1024 x i32], [1024 x i32]* @D, i64 0, i64 %indvars.iv
				store i32 %tmp59, i32* %tmp60, align 4
				br label %bb61

				bb61: ; preds = %bb53
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%tmp62 = add nuw nsw i32 %.03, 1
				br label %bb51

				bb63: ; preds = %bb52
				ret void
				}

llvm/test/Transforms/LoopFusion/inner_loops.ll

This file was added.

				; RUN: opt -S -loop-fusion < %s 2>&1 \| FileCheck %s

				@A = common global [1024 x [1024 x i32]] zeroinitializer, align 16
				@B = common global [1024 x [1024 x i32]] zeroinitializer, align 16

				; CHECK: void @dep_free
				; CHECK-NEXT: bb:
				; CHECK-NEXT: br label %[[LOOP1HEADER:bb[0-9]*]]
				; CHECK: [[LOOP1HEADER]]
				; CHECK: br i1 %{{.}}, label %[[LOOP1BODY:bb[0-9]]], label %[[LOOP2PREHEADER:bb[0-9]+]]
				; CHECK: [[LOOP1BODY]]
				; CHECK: br label %[[LOOP1LATCH:bb[0-9]*]]
				; CHECK: [[LOOP1LATCH]]
				; CHECK: br label %[[LOOP2PREHEADER:bb[0-9]+]]
				; CHECK: [[LOOP2PREHEADER]]
				; CHECK: br i1 %{{.}}, label %[[LOOP2BODY:bb[0-9]]], label %[[LOOP2EXIT:bb[0-9]*]]
				; CHECK: [[LOOP2BODY]]
				; CHECK: br label %[[LOOP2LATCH:bb[0-9]+]]
				; CHECK: [[LOOP2LATCH]]
				; CHECK: br label %[[LOOP1HEADER]]
				; CHECK: ret void

				define void @dep_free() {
				bb:
				br label %bb9

				bb9: ; preds = %bb35, %bb
				%indvars.iv6 = phi i64 [ %indvars.iv.next7, %bb35 ], [ 0, %bb ]
				%.0 = phi i32 [ 0, %bb ], [ %tmp36, %bb35 ]
				%exitcond8 = icmp ne i64 %indvars.iv6, 100
				br i1 %exitcond8, label %bb11, label %bb10

				bb10: ; preds = %bb9
				br label %bb37

				bb11: ; preds = %bb9
				br label %bb12

				bb12: ; preds = %bb21, %bb11
				%indvars.iv = phi i64 [ %indvars.iv.next, %bb21 ], [ 0, %bb11 ]
				%exitcond = icmp ne i64 %indvars.iv, 100
				br i1 %exitcond, label %bb14, label %bb23

				bb14: ; preds = %bb12
				%tmp = add nsw i32 %.0, -3
				%tmp15 = add nuw nsw i64 %indvars.iv6, 3
				%tmp16 = trunc i64 %tmp15 to i32
				%tmp17 = mul nsw i32 %tmp, %tmp16
				%tmp18 = trunc i64 %indvars.iv6 to i32
				%tmp19 = srem i32 %tmp17, %tmp18
				%tmp20 = getelementptr inbounds [1024 x [1024 x i32]], [1024 x [1024 x i32]]* @A, i64 0, i64 %indvars.iv6, i64 %indvars.iv
				store i32 %tmp19, i32* %tmp20, align 4
				br label %bb21

				bb21: ; preds = %bb14
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				br label %bb12

				bb23: ; preds = %bb33, %bb12
				%indvars.iv3 = phi i64 [ %indvars.iv.next4, %bb33 ], [ 0, %bb12 ]
				%exitcond5 = icmp ne i64 %indvars.iv3, 100
				br i1 %exitcond5, label %bb25, label %bb35

				bb25: ; preds = %bb23
				%tmp26 = add nsw i32 %.0, -3
				%tmp27 = add nuw nsw i64 %indvars.iv6, 3
				%tmp28 = trunc i64 %tmp27 to i32
				%tmp29 = mul nsw i32 %tmp26, %tmp28
				%tmp30 = trunc i64 %indvars.iv6 to i32
				%tmp31 = srem i32 %tmp29, %tmp30
				%tmp32 = getelementptr inbounds [1024 x [1024 x i32]], [1024 x [1024 x i32]]* @B, i64 0, i64 %indvars.iv6, i64 %indvars.iv3
				store i32 %tmp31, i32* %tmp32, align 4
				br label %bb33

				bb33: ; preds = %bb25
				%indvars.iv.next4 = add nuw nsw i64 %indvars.iv3, 1
				br label %bb23

				bb35: ; preds = %bb23
				%indvars.iv.next7 = add nuw nsw i64 %indvars.iv6, 1
				%tmp36 = add nuw nsw i32 %.0, 1
				br label %bb9

				bb37: ; preds = %bb10
				ret void
				}

llvm/test/Transforms/LoopFusion/loop_nest.ll

This file was added.

				; RUN: opt -S -loop-fusion < %s \| FileCheck %s
				;
				; int A[1024][1024];
				; int B[1024][1024];
				;
				; #define EXPENSIVE_PURE_COMPUTATION(i) ((i - 3) * (i + 3) % i)
				;
				; void dep_free() {
				;
				; for (int i = 0; i < 100; i++)
				; for (int j = 0; j < 100; j++)
				; A[i][j] = EXPENSIVE_PURE_COMPUTATION(i);
				;
				; for (int i = 0; i < 100; i++)
				; for (int j = 0; j < 100; j++)
				; B[i][j] = EXPENSIVE_PURE_COMPUTATION(i);
				; }
				;
				@A = common global [1024 x [1024 x i32]] zeroinitializer, align 16
				@B = common global [1024 x [1024 x i32]] zeroinitializer, align 16

				; CHECK: void @dep_free
				; CHECK-NEXT: bb:
				; CHECK-NEXT: br label %[[LOOP1HEADER:bb[0-9]+]]
				; CHECK: [[LOOP1HEADER]]
				; CHECK: br i1 %exitcond12, label %[[LOOP3PREHEADER:bb[0-9]+.preheader]], label %[[LOOP2HEADER:bb[0-9]+]]
				; CHECK: [[LOOP3PREHEADER]]
				; CHECK: br label %[[LOOP3HEADER:bb[0-9]+]]
				; CHECK: [[LOOP3HEADER]]
				; CHECK: br i1 %exitcond9, label %[[LOOP3BODY:bb[0-9]+]], label %[[LOOP1LATCH:bb[0-9]+]]
				; CHECK: [[LOOP1LATCH]]
				; CHECK: br label %[[LOOP2HEADER:bb[0-9]+]]
				; CHECK: [[LOOP2HEADER]]
				; CHECK: br i1 %exitcond6, label %[[LOOP4PREHEADER:bb[0-9]+.preheader]], label %[[LOOP2EXITBLOCK:bb[0-9]+]]
				; CHECK: [[LOOP4PREHEADER]]
				; CHECK: br label %[[LOOP4HEADER:bb[0-9]+]]
				; CHECK: [[LOOP2EXITBLOCK]]
				; CHECK-NEXT: br label %[[FUNCEXIT:bb[0-9]+]]
				; CHECK: [[LOOP4HEADER]]
				; CHECK: br i1 %exitcond, label %[[LOOP4BODY:bb[0-9]+]], label %[[LOOP2LATCH:bb[0-9]+]]
				; CHECK: [[LOOP2LATCH]]
				; CHECK: br label %[[LOOP1HEADER:bb[0-9]+]]
				; CHECK: [[FUNCEXIT]]
				; CHECK: ret void

				; TODO: The current version of loop fusion does not allow the inner loops to be
				; fused because they are not control flow equivalent and adjacent. These are
				; limitations that can be addressed in future improvements to fusion.
				define void @dep_free() {
				bb:
				br label %bb13

				bb13: ; preds = %bb27, %bb
				%indvars.iv10 = phi i64 [ %indvars.iv.next11, %bb27 ], [ 0, %bb ]
				%.0 = phi i32 [ 0, %bb ], [ %tmp28, %bb27 ]
				%exitcond12 = icmp ne i64 %indvars.iv10, 100
				br i1 %exitcond12, label %bb16, label %bb30

				bb16: ; preds = %bb25, %bb13
				%indvars.iv7 = phi i64 [ %indvars.iv.next8, %bb25 ], [ 0, %bb13 ]
				%exitcond9 = icmp ne i64 %indvars.iv7, 100
				br i1 %exitcond9, label %bb18, label %bb27

				bb18: ; preds = %bb16
				%tmp = add nsw i32 %.0, -3
				%tmp19 = add nuw nsw i64 %indvars.iv10, 3
				%tmp20 = trunc i64 %tmp19 to i32
				%tmp21 = mul nsw i32 %tmp, %tmp20
				%tmp22 = trunc i64 %indvars.iv10 to i32
				%tmp23 = srem i32 %tmp21, %tmp22
				%tmp24 = getelementptr inbounds [1024 x [1024 x i32]], [1024 x [1024 x i32]]* @A, i64 0, i64 %indvars.iv10, i64 %indvars.iv7
				store i32 %tmp23, i32* %tmp24, align 4
				br label %bb25

				bb25: ; preds = %bb18
				%indvars.iv.next8 = add nuw nsw i64 %indvars.iv7, 1
				br label %bb16

				bb27: ; preds = %bb16
				%indvars.iv.next11 = add nuw nsw i64 %indvars.iv10, 1
				%tmp28 = add nuw nsw i32 %.0, 1
				br label %bb13

				bb30: ; preds = %bb45, %bb13
				%indvars.iv4 = phi i64 [ %indvars.iv.next5, %bb45 ], [ 0, %bb13 ]
				%.02 = phi i32 [ 0, %bb13 ], [ %tmp46, %bb45 ]
				%exitcond6 = icmp ne i64 %indvars.iv4, 100
				br i1 %exitcond6, label %bb33, label %bb31

				bb31: ; preds = %bb30
				br label %bb47

				bb33: ; preds = %bb43, %bb30
				%indvars.iv = phi i64 [ %indvars.iv.next, %bb43 ], [ 0, %bb30 ]
				%exitcond = icmp ne i64 %indvars.iv, 100
				br i1 %exitcond, label %bb35, label %bb45

				bb35: ; preds = %bb33
				%tmp36 = add nsw i32 %.02, -3
				%tmp37 = add nuw nsw i64 %indvars.iv4, 3
				%tmp38 = trunc i64 %tmp37 to i32
				%tmp39 = mul nsw i32 %tmp36, %tmp38
				%tmp40 = trunc i64 %indvars.iv4 to i32
				%tmp41 = srem i32 %tmp39, %tmp40
				%tmp42 = getelementptr inbounds [1024 x [1024 x i32]], [1024 x [1024 x i32]]* @B, i64 0, i64 %indvars.iv4, i64 %indvars.iv
				store i32 %tmp41, i32* %tmp42, align 4
				br label %bb43

				bb43: ; preds = %bb35
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				br label %bb33

				bb45: ; preds = %bb33
				%indvars.iv.next5 = add nuw nsw i64 %indvars.iv4, 1
				%tmp46 = add nuw nsw i32 %.02, 1
				br label %bb30

				bb47: ; preds = %bb31
				ret void
				}
				MeinersburUnsubmitted Done Reply Inline Actions [remark] Can these be removed? Meinersbur: [remark] Can these be removed?

llvm/test/Transforms/LoopFusion/simple.ll

This file was added.

				; RUN: opt -S -loop-fusion < %s \| FileCheck %s

				MeinersburUnsubmitted Done Reply Inline Actions [suggestion] Remove `2>&1` (there is no stdout output here) Meinersbur: [suggestion] Remove `2>&1` (there is no stdout output here)
				@B = common global [1024 x i32] zeroinitializer, align 16

				; CHECK: void @dep_free
				; CHECK-NEXT: bb:
				; CHECK-NEXT: br label %[[LOOP1HEADER:bb[0-9]*]]
				; CHECK: [[LOOP1HEADER]]
				; CHECK: br i1 %{{.}}, label %[[LOOP1BODY:bb[0-9]]], label %[[LOOP2PREHEADER:bb[0-9]+]]
				; CHECK: [[LOOP1BODY]]
				; CHECK: br label %[[LOOP1LATCH:bb[0-9]*]]
				; CHECK: [[LOOP1LATCH]]
				; CHECK: br label %[[LOOP2PREHEADER:bb[0-9]+]]
				; CHECK: [[LOOP2PREHEADER]]
				; CHECK: br i1 %{{.}}, label %[[LOOP2BODY:bb[0-9]]], label %[[LOOP2EXIT:bb[0-9]*]]
				; CHECK: [[LOOP2BODY]]
				; CHECK: br label %[[LOOP2LATCH:bb[0-9]+]]
				; CHECK: [[LOOP2LATCH]]
				; CHECK: br label %[[LOOP1HEADER]]
				; CHECK: ret void
				define void @dep_free(i32* noalias %arg) {
				bb:
				br label %bb5

				bb5: ; preds = %bb14, %bb
				%indvars.iv2 = phi i64 [ %indvars.iv.next3, %bb14 ], [ 0, %bb ]
				%.01 = phi i32 [ 0, %bb ], [ %tmp15, %bb14 ]
				%exitcond4 = icmp ne i64 %indvars.iv2, 100
				br i1 %exitcond4, label %bb7, label %bb17

				bb7: ; preds = %bb5
				%tmp = add nsw i32 %.01, -3
				%tmp8 = add nuw nsw i64 %indvars.iv2, 3
				%tmp9 = trunc i64 %tmp8 to i32
				%tmp10 = mul nsw i32 %tmp, %tmp9
				%tmp11 = trunc i64 %indvars.iv2 to i32
				%tmp12 = srem i32 %tmp10, %tmp11
				%tmp13 = getelementptr inbounds i32, i32* %arg, i64 %indvars.iv2
				store i32 %tmp12, i32* %tmp13, align 4
				br label %bb14

				bb14: ; preds = %bb7
				%indvars.iv.next3 = add nuw nsw i64 %indvars.iv2, 1
				%tmp15 = add nuw nsw i32 %.01, 1
				br label %bb5

				bb17: ; preds = %bb27, %bb5
				%indvars.iv = phi i64 [ %indvars.iv.next, %bb27 ], [ 0, %bb5 ]
				%.0 = phi i32 [ 0, %bb5 ], [ %tmp28, %bb27 ]
				%exitcond = icmp ne i64 %indvars.iv, 100
				br i1 %exitcond, label %bb19, label %bb18

				bb18: ; preds = %bb17
				br label %bb29

				bb19: ; preds = %bb17
				%tmp20 = add nsw i32 %.0, -3
				%tmp21 = add nuw nsw i64 %indvars.iv, 3
				%tmp22 = trunc i64 %tmp21 to i32
				%tmp23 = mul nsw i32 %tmp20, %tmp22
				%tmp24 = trunc i64 %indvars.iv to i32
				%tmp25 = srem i32 %tmp23, %tmp24
				%tmp26 = getelementptr inbounds [1024 x i32], [1024 x i32]* @B, i64 0, i64 %indvars.iv
				store i32 %tmp25, i32* %tmp26, align 4
				br label %bb27

				bb27: ; preds = %bb19
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%tmp28 = add nuw nsw i32 %.0, 1
				br label %bb17

				bb29: ; preds = %bb18
				ret void
				}

				; CHECK: void @dep_free_parametric
				; CHECK-NEXT: bb:
				; CHECK-NEXT: br label %[[LOOP1HEADER:bb[0-9]*]]
				; CHECK: [[LOOP1HEADER]]
				; CHECK: br i1 %{{.}}, label %[[LOOP1BODY:bb[0-9]]], label %[[LOOP2PREHEADER:bb[0-9]+]]
				; CHECK: [[LOOP1BODY]]
				; CHECK: br label %[[LOOP1LATCH:bb[0-9]*]]
				; CHECK: [[LOOP1LATCH]]
				; CHECK: br label %[[LOOP2PREHEADER:bb[0-9]+]]
				; CHECK: [[LOOP2PREHEADER]]
				; CHECK: br i1 %{{.}}, label %[[LOOP2BODY:bb[0-9]]], label %[[LOOP2EXIT:bb[0-9]*]]
				; CHECK: [[LOOP2BODY]]
				; CHECK: br label %[[LOOP2LATCH:bb[0-9]+]]
				; CHECK: [[LOOP2LATCH]]
				; CHECK: br label %[[LOOP1HEADER]]
				; CHECK: ret void
				define void @dep_free_parametric(i32* noalias %arg, i64 %arg2) {
				bb:
				br label %bb3

				bb3: ; preds = %bb12, %bb
				%.01 = phi i64 [ 0, %bb ], [ %tmp13, %bb12 ]
				%tmp = icmp slt i64 %.01, %arg2
				br i1 %tmp, label %bb5, label %bb15

				bb5: ; preds = %bb3
				%tmp6 = add nsw i64 %.01, -3
				%tmp7 = add nuw nsw i64 %.01, 3
				%tmp8 = mul nsw i64 %tmp6, %tmp7
				%tmp9 = srem i64 %tmp8, %.01
				%tmp10 = trunc i64 %tmp9 to i32
				%tmp11 = getelementptr inbounds i32, i32* %arg, i64 %.01
				store i32 %tmp10, i32* %tmp11, align 4
				br label %bb12

				bb12: ; preds = %bb5
				%tmp13 = add nuw nsw i64 %.01, 1
				br label %bb3

				bb15: ; preds = %bb25, %bb3
				%.0 = phi i64 [ 0, %bb3 ], [ %tmp26, %bb25 ]
				%tmp16 = icmp slt i64 %.0, %arg2
				br i1 %tmp16, label %bb18, label %bb17

				bb17: ; preds = %bb15
				br label %bb27

				bb18: ; preds = %bb15
				%tmp19 = add nsw i64 %.0, -3
				%tmp20 = add nuw nsw i64 %.0, 3
				%tmp21 = mul nsw i64 %tmp19, %tmp20
				%tmp22 = srem i64 %tmp21, %.0
				%tmp23 = trunc i64 %tmp22 to i32
				%tmp24 = getelementptr inbounds [1024 x i32], [1024 x i32]* @B, i64 0, i64 %.0
				store i32 %tmp23, i32* %tmp24, align 4
				br label %bb25

				bb25: ; preds = %bb18
				%tmp26 = add nuw nsw i64 %.0, 1
				br label %bb15

				bb27: ; preds = %bb17
				ret void
				}

				; CHECK: void @raw_only
				; CHECK-NEXT: bb:
				; CHECK-NEXT: br label %[[LOOP1HEADER:bb[0-9]*]]
				; CHECK: [[LOOP1HEADER]]
				; CHECK: br i1 %{{.}}, label %[[LOOP1BODY:bb[0-9]]], label %[[LOOP2PREHEADER:bb[0-9]+]]
				; CHECK: [[LOOP1BODY]]
				; CHECK: br label %[[LOOP1LATCH:bb[0-9]*]]
				; CHECK: [[LOOP1LATCH]]
				; CHECK: br label %[[LOOP2PREHEADER:bb[0-9]+]]
				; CHECK: [[LOOP2PREHEADER]]
				; CHECK: br i1 %{{.}}, label %[[LOOP2BODY:bb[0-9]]], label %[[LOOP2EXIT:bb[0-9]*]]
				; CHECK: [[LOOP2BODY]]
				; CHECK: br label %[[LOOP2LATCH:bb[0-9]+]]
				; CHECK: [[LOOP2LATCH]]
				; CHECK: br label %[[LOOP1HEADER]]
				; CHECK: ret void
				define void @raw_only(i32* noalias %arg) {
				bb:
				br label %bb5

				bb5: ; preds = %bb9, %bb
				%indvars.iv2 = phi i64 [ %indvars.iv.next3, %bb9 ], [ 0, %bb ]
				%exitcond4 = icmp ne i64 %indvars.iv2, 100
				br i1 %exitcond4, label %bb7, label %bb11

				bb7: ; preds = %bb5
				%tmp = getelementptr inbounds i32, i32* %arg, i64 %indvars.iv2
				%tmp8 = trunc i64 %indvars.iv2 to i32
				store i32 %tmp8, i32* %tmp, align 4
				br label %bb9

				bb9: ; preds = %bb7
				%indvars.iv.next3 = add nuw nsw i64 %indvars.iv2, 1
				br label %bb5

				bb11: ; preds = %bb18, %bb5
				%indvars.iv = phi i64 [ %indvars.iv.next, %bb18 ], [ 0, %bb5 ]
				%exitcond = icmp ne i64 %indvars.iv, 100
				br i1 %exitcond, label %bb13, label %bb19

				bb13: ; preds = %bb11
				%tmp14 = getelementptr inbounds i32, i32* %arg, i64 %indvars.iv
				%tmp15 = load i32, i32* %tmp14, align 4
				%tmp16 = shl nsw i32 %tmp15, 1
				%tmp17 = getelementptr inbounds [1024 x i32], [1024 x i32]* @B, i64 0, i64 %indvars.iv
				store i32 %tmp16, i32* %tmp17, align 4
				br label %bb18

				bb18: ; preds = %bb13
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				br label %bb11

				bb19: ; preds = %bb11
				ret void
				}

				; CHECK: void @raw_only_parametric
				; CHECK-NEXT: bb:
				; CHECK: br label %[[LOOP1HEADER:bb[0-9]*]]
				; CHECK: [[LOOP1HEADER]]
				; CHECK: br i1 %{{.}}, label %[[LOOP1BODY:bb[0-9]]], label %[[LOOP2PREHEADER:bb[0-9]+]]
				; CHECK: [[LOOP1BODY]]
				; CHECK: br label %[[LOOP1LATCH:bb[0-9]*]]
				; CHECK: [[LOOP1LATCH]]
				; CHECK: br label %[[LOOP2PREHEADER:bb[0-9]+]]
				; CHECK: [[LOOP2PREHEADER]]
				; CHECK: br i1 %{{.}}, label %[[LOOP2BODY:bb[0-9]]], label %[[LOOP2EXIT:bb[0-9]*]]
				; CHECK: [[LOOP2BODY]]
				; CHECK: br label %[[LOOP2LATCH:bb[0-9]+]]
				; CHECK: [[LOOP2LATCH]]
				; CHECK: br label %[[LOOP1HEADER]]
				; CHECK: ret void
				define void @raw_only_parametric(i32* noalias %arg, i32 %arg4) {
				bb:
				br label %bb5

				bb5: ; preds = %bb11, %bb
				%indvars.iv2 = phi i64 [ %indvars.iv.next3, %bb11 ], [ 0, %bb ]
				%tmp = sext i32 %arg4 to i64
				%tmp6 = icmp slt i64 %indvars.iv2, %tmp
				br i1 %tmp6, label %bb8, label %bb14

				bb8: ; preds = %bb5
				%tmp9 = getelementptr inbounds i32, i32* %arg, i64 %indvars.iv2
				%tmp10 = trunc i64 %indvars.iv2 to i32
				store i32 %tmp10, i32* %tmp9, align 4
				br label %bb11

				bb11: ; preds = %bb8
				%indvars.iv.next3 = add nuw nsw i64 %indvars.iv2, 1
				br label %bb5

				bb14: ; preds = %bb22, %bb5
				%indvars.iv = phi i64 [ %indvars.iv.next, %bb22 ], [ 0, %bb5 ]
				%tmp13 = sext i32 %arg4 to i64
				%tmp15 = icmp slt i64 %indvars.iv, %tmp13
				br i1 %tmp15, label %bb17, label %bb23

				bb17: ; preds = %bb14
				%tmp18 = getelementptr inbounds i32, i32* %arg, i64 %indvars.iv
				%tmp19 = load i32, i32* %tmp18, align 4
				%tmp20 = shl nsw i32 %tmp19, 1
				%tmp21 = getelementptr inbounds [1024 x i32], [1024 x i32]* @B, i64 0, i64 %indvars.iv
				store i32 %tmp20, i32* %tmp21, align 4
				br label %bb22

				bb22: ; preds = %bb17
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				br label %bb14

				bb23: ; preds = %bb14
				ret void
				}

				; CHECK: void @forward_dep
				; CHECK-NEXT: bb:
				; CHECK: br label %[[LOOP1HEADER:bb[0-9]*]]
				; CHECK: [[LOOP1HEADER]]
				; CHECK: br i1 %{{.}}, label %[[LOOP1BODY:bb[0-9]]], label %[[LOOP2PREHEADER:bb[0-9]+]]
				; CHECK: [[LOOP1BODY]]
				; CHECK: br label %[[LOOP1LATCH:bb[0-9]*]]
				; CHECK: [[LOOP1LATCH]]
				; CHECK: br label %[[LOOP2PREHEADER:bb[0-9]+]]
				; CHECK: [[LOOP2PREHEADER]]
				; CHECK: br i1 %{{.}}, label %[[LOOP2BODY:bb[0-9]]], label %[[LOOP2EXIT:bb[0-9]*]]
				; CHECK: [[LOOP2BODY]]
				; CHECK: br label %[[LOOP2LATCH:bb[0-9]+]]
				; CHECK: [[LOOP2LATCH]]
				; CHECK: br label %[[LOOP1HEADER]]
				; CHECK: ret void
				define void @forward_dep(i32* noalias %arg) {
				bb:
				br label %bb5

				bb5: ; preds = %bb14, %bb
				%indvars.iv2 = phi i64 [ %indvars.iv.next3, %bb14 ], [ 0, %bb ]
				%.01 = phi i32 [ 0, %bb ], [ %tmp15, %bb14 ]
				%exitcond4 = icmp ne i64 %indvars.iv2, 100
				br i1 %exitcond4, label %bb7, label %bb17

				bb7: ; preds = %bb5
				%tmp = add nsw i32 %.01, -3
				%tmp8 = add nuw nsw i64 %indvars.iv2, 3
				%tmp9 = trunc i64 %tmp8 to i32
				%tmp10 = mul nsw i32 %tmp, %tmp9
				%tmp11 = trunc i64 %indvars.iv2 to i32
				%tmp12 = srem i32 %tmp10, %tmp11
				%tmp13 = getelementptr inbounds i32, i32* %arg, i64 %indvars.iv2
				store i32 %tmp12, i32* %tmp13, align 4
				br label %bb14

				bb14: ; preds = %bb7
				%indvars.iv.next3 = add nuw nsw i64 %indvars.iv2, 1
				%tmp15 = add nuw nsw i32 %.01, 1
				br label %bb5

				bb17: ; preds = %bb25, %bb5
				%indvars.iv = phi i64 [ %indvars.iv.next, %bb25 ], [ 0, %bb5 ]
				%exitcond = icmp ne i64 %indvars.iv, 100
				br i1 %exitcond, label %bb19, label %bb26

				bb19: ; preds = %bb17
				%tmp20 = add nsw i64 %indvars.iv, -3
				%tmp21 = getelementptr inbounds i32, i32* %arg, i64 %tmp20
				%tmp22 = load i32, i32* %tmp21, align 4
				%tmp23 = mul nsw i32 %tmp22, 3
				%tmp24 = getelementptr inbounds i32, i32* %arg, i64 %indvars.iv
				store i32 %tmp23, i32* %tmp24, align 4
				br label %bb25

				bb25: ; preds = %bb19
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				br label %bb17

				bb26: ; preds = %bb17
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

Implement basic loop fusion passClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 188928

llvm/include/llvm/InitializePasses.h

llvm/include/llvm/Transforms/Scalar.h

llvm/include/llvm/Transforms/Scalar/LoopFuse.h

llvm/lib/Passes/PassBuilder.cpp

llvm/lib/Passes/PassRegistry.def

llvm/lib/Transforms/Scalar/CMakeLists.txt

llvm/lib/Transforms/Scalar/LoopFuse.cpp

llvm/lib/Transforms/Scalar/Scalar.cpp

llvm/test/Transforms/LoopFusion/cannot_fuse.ll

llvm/test/Transforms/LoopFusion/four_loops.ll

llvm/test/Transforms/LoopFusion/inner_loops.ll

llvm/test/Transforms/LoopFusion/loop_nest.ll

llvm/test/Transforms/LoopFusion/simple.ll

Implement basic loop fusion pass
ClosedPublic