Page MenuHomePhabricator

Implement basic loop fusion pass
Needs ReviewPublic

Authored by kbarton on Dec 18 2018, 12:52 PM.



Loop fusion is an optimization/transformation that takes two (or more) loops and
fuses (or joins) them together to create a single loop.

This pass implements Basic Loop Fusion. It will fuse loops that conform to the
following 4 conditions:

  1. Adjacent (no code in between the two loops)
  2. Control flow equivalent (if one loop executes, the other loop executes)
  3. Identical bounds (both loops iterate the same number of iterations)
  4. No negative distance dependencies between the loop bodies.

The intention is to initially focus on the safety checks and mechanics of loop
fusion. It makes no changes to the IR to create opportunities for fusion.
Instead, it checks the necessary conditions and if all conditions are met, it
fuses two loops together.

Further information on loop fusion, and techniques to improve the opportunities
for loop fusion can be found at

This patch intentionally does not add the loop fusion pass into the pass
pipeline. The location of loop fusion in the pass pipeline needs to be discussed
and I would prefer to have that discussion outside of this review.

This patch was done in collaboration with Johannes Doerfert.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Meinersbur added inline comments.Dec 18 2018, 10:05 PM

[serious] LLVM has an infrastructure for enabling debugging output: -debug-only. We should not introduce additional ones.

At least this option should only be available in assert-builds.


[suggestion] Add doxygen comments for these fields?


[remark] In contrast to the other invalidate reason, this one does not have a statistic counter. Intentional?


[style] Use LLVM_DUMP_METHOD pattern for dump() methods (See compiler.h)


[style] braces can be removed


[style] I'd prefer method description as a doxygen comment at the methods.


[style] Use doxygen comment?


[style] Descriptions of these would be useful.


[style] return PDT.dominates(FC0.Preheader, FC1.Preheader)


[style] This looks like a leftover review comment.


[Style] Use LLVM_PRETTY_FUNCTION instead of __FUNCTION__.

Maybe remove entirely? Nowhere I have seen function tracing in LLVM code.


[suggestion] If predication is not supported, why not using ScalarEvolution::getBackedgeTakenCount()?


[style] Use doxygen comment here?


[style] No reason to use auto here


[suggestion] Descriptions of these would be useful


[style] The braces seem unnecessary


[style] Start method names with a verb: isAdjacent?


[style] Use emplace_back?


[suggestion] This looks expensive to do. However, the pass manager will do these verifications anyway between passes (if enabled), so it shouldn't be necessary to do here.


[style] I think it is more common in the LLVM case base to have private fields declared at the beginning of the class.


[serious] I think LoopFuser is preserving some passes such as ScalarEvolution?


[serious] If changes are made, the pass should not indicate that it preserves all analyses (including the ones it doesn't know about).


[serious] If you are testing for llvm::dbgs() output, you need to add a REQUIRES: asserts line.

Another problem is that 2>&1 this mixes stdout and stderr output. stdout is buffered, stderr is not. How these are merged by 2>&1 is undefined. You can switch off stdout by using opt -disable-output.

If possible, using opt -analyze is a better way to verify pass-specific output because adding another LLVM_DEBUG during debugging sessions can lead to test case failures. However, checking -debug output is common in the llvm tests.


[remark] I think these can be removed.


[remark] Are these necessary for the test?


[remark] Can these be removed?


[suggestion] Remove 2>&1 (there is no stdout output here)

kbarton marked 2 inline comments as done.Dec 20 2018, 9:26 AM

Responded to two of the comments. Most of the others will be addressed in the next revision, which I should hopefully be uploading soon.


I understand your points here. I did not realize issue #2 (although I agree it doesn't apply here).

However, if I understand your suggestion, I would need to add an initialize method (or something similar) to do the logic that is currently in the constructor and return a success/fail code. The problem is that before anything can be done with the object, you need to ensure that it has been initialized. This can be accomplished by adding a boolean flag (isInitialized) to the class and ensuring that all accessors check that before doing anything with it. However, I think that is more complicated then leaving the initialization logic in the constructor.

That said, I'm not particularly tied to either approach. I can change the implementation if you feel strongly. Or, if there is an alternative that I'm not seeing, I'm happy to do that instead.


This is a good question, and I should have added comments here to articulate why I eventually settled on std::set.

Ultimately, I was looking for a data structure that would keep the fusion candidates sorted automatically, and not invalidate the iterators as the set changes. In this implementation, either DenseSet or SmallPtrSet would be fine, but we have additional patches to extend fusion to move intervening code around and split up loops to create new opportunities. When this is done, the contents of the FusionCandidateSet will change, and having stable iterators simplifies the logic quite a bit (in my opinion). Similarly, always ensuring that the set is ordered properly is essential for this implementation, and that seems to be done reasonably well with std::set.

I will add some comments in my next revision to indicate my line of reasoning here.

kbarton edited the summary of this revision. (Show Details)Dec 20 2018, 9:28 AM
kbarton updated this revision to Diff 179130.Dec 20 2018, 12:40 PM

Addressed all but two of the review comments.
The only outstanding comments are the use of std::set (which I tried to address with comments) and the logic inside the FusionCandidate constructor (which I'm waiting for feedback on the review).

kbarton marked 31 inline comments as done and 2 inline comments as done.Dec 20 2018, 12:51 PM
kbarton added inline comments.

I use this to have two different "levels" of debugging. The debugging available with -debug-only is not as verbose.
If you feel strongly about it, I can remove this and make all debug output equivalent.
In the meantime, I've guarded this use with NDEBUG.


There is no emplace_back for TreeUpdates (unless I misunderstood your suggestion).


I've kept this for now, as this isn't in the pass manager yet. However, I've guarded it under NDEBUG.
Are you OK with that?


I agree. The pass should not make changes and return true.
I don't believe this is happening now.


I don't see an (easy) way to use analyze to get this information.
For now, I've added the REQUIRES: asserts.
I'll think if there is a way to use analyze in the future to accomplish this.

Great stuff. This looks like it will be very useful. I have a few thoughts/questions/nits.


Two spaces.


second loop


This debug message is (essentially) repeated?


Commented code


I would imagine, although I'm not sure, that there would at least be a lot of bugs here. We are dealing with different loops, but we can say that they are very similar.

What does it currently give? Anything useful? The SCEV checks you are doing above is obviously quite similar to what DA would be trying to do, but with the added loop rewrite. It would be a shame to duplicate all the effort but may, I guess, be necessary if DA doesn't do well enough.


thest -> test?




It can sometimes be better to insert edges into the DT before deleting old ones. It keep the tree more intact (especially pdts), with less subtrees becoming unreachable and less recalculations needed. It means it can be simpler and quicker for the updates.


Is there anywhere that we check that these won't have phi operands from inside the first loop? i.e. that the phis can be moved into the first loop / before all of it's instructions.


You may want to add DominatorTree::VerificationLevel::Fast otherwise this might be quite slow.


Looks like the code is preserving LI too (I'm not sure if you can/should preserve SE without preserving LI.)

kbarton marked 15 inline comments as done.Jan 17 2019, 1:57 PM

I've addressed most of these comments (except the ones I have some questions about). I will be refreshing the patch momentarily.


Good catch. Fixed.


At this point DA doesn't give anything useful - at least not for the test cases that I have tried. I have not had a chance to investigate why, or if there is a better/different way to do things where it can be useful (which is why I marked the TODO here).

The one thing that DA is able to do, that SCEV currently cannot, is understand the restrict keyword and accurately identify no dependencies between the two loops in this case. Perhaps it would be better to try and teach SCEV about restrict, and then only rely on SCEV in the long run?


OK, that's good to know, thanks.

Is this specifically for inserting/deleting edges between similar blocks, or is this for all inserts/deletes in the entire tree? In other words, should I rearrange this code, and the code below (lines 1048-1051) that updates the latch blocks to do all the insertions before any deletions?


This is a really good catch!] We completely missed this.
I've added a check in dependencesAllowFusion that walks through the PHIs in the header of FC1 and check if the incoming block is in FC0 then it is the header block. If it is not, then we do not fuse.


Thanks for pointing this out.
I've changed to fast.
Most (probably all) of the problems this has found has been done by comparing to a newly constructed tree, so I think Fast should be sufficient.


Yup, good call. I've added LoopInfoWrapperPass to the preserved list.

kbarton updated this revision to Diff 182396.Jan 17 2019, 1:59 PM
kbarton marked 4 inline comments as done.

Addressed comments from dmgreen.

Initial comments from my side.


We need to update the licence ;)


You should keep it this way. Other passes do it similarly.


It is not uncommon in LLVM to do it this way. I personally do not try to force a solution similar to bool createXXX(...) but I also do not necessarily object if there is an actual reason.


Because, as an example and not the only reason, we could easily generate statistics on the impact predication support might have.


This should probably be a conditional and a return false ;)


I don't get this. Could one of you produce a problematic example for me? (also good as a test case)


As above, preserve LoopAnalysis.

dmgreen added inline comments.Jan 23 2019, 8:41 AM

Is there reason in the current version to prevent fusion if "FC0" has instructions in its preheaders?


I remember DA having problems on 64bit systems due to all the sexts that kept happening. Delinearising constant's too, maybe? There are certainly problems in it that would be worth trying to fix (or find a replacement to do it better).

The delinearisation that DA will attempt may also be helpful. It is nowadays all built on SCEV's too, so should in theory (baring the facts that we know here about dealing with different but similar loops) be a more powerful version of the SCEV code here. That power might be confusing it at the moment though.

I would expect that better DA is something llvm will need sooner or later as more loop optimisations are implemented. Perhaps this is something that we can improve in the future, relying on the SCEV code at the moment, but getting more help from DA as it is improved. The no-aliasing checks it can do are useful at least, it would seem.


I'm not sure if this is quite strong enough. Consider something like this, where the sum would be used in the second block, but not as phi:

for(i = 0; i < n; i++)
  sum += a[i];
for(i = 0; i < n; i++)
  b[i] = a[i]/sum;

These can be fused, but use the wrong "sum" on each iteration of the loop. Unroll and jam side-steps all this by requiring lcssa nodes (and, you know, not requiring generalised loop fusion :) )


I think it's probably fine, just something I've run into in the past with PDT's. If it's something we run into, we can fix it then. And if it does become an issue, it may be better to teach the DTU to do this itself.

Herald added a project: Restricted Project. · View Herald TranscriptFeb 11 2019, 1:14 PM
kbarton marked 16 inline comments as done.Feb 12 2019, 9:58 AM

I've addressed almost all of the comments.
I will work on an update to the problem that dmgreen provided an example for and post a new version of the patch when that is done.
In the meantime, if my understanding of the comments is incorrect, someone please correct me :)


OK, so I'm still not sure what the preference is here:

  1. Keep it as is.
  2. Change to create an empty object and then require an Initialize method call on it.
  3. Create a wrapper, createFusionCandidate(), that does the construction and initialization.
  4. Some other alternative that I don't see.

If I'm going to change, I prefer the createFusionCandidate route (which I hadn't thought of earlier) which I think is the cleanest, although I don't know how to enforce it's usage.


I addressed this a while ago, I just forgot to check it off as done.


Strictly speaking, no, I there is no reason to prevent fusion if FC0 has instructions in its preheader.
This check was meant to prevent fusion if FC1 has instructions in the preheader since we don't have any analysis to move them around yet
I've moved this check later when we are trying to fuse pairs of candidates. If FC1 has a non-empty preheader then fusion is skipped.


I've changed from the assert to a check and print message.
After that, all the conditions just result in a return false, which cleans this up a lot.


I completely agree about improving DA and the need for it with other loop optimizations. I still haven't had a chance to look at it in detail, but would like to start looking at it once this patch is finalized and lands.

For now are you OK with the current usage of DA here? I'm hoping that as we extend/improve it, we can modify this code to take advantage of it.


Yes, you're right. If I generate the ll for this example and massage it to make it eligible for fusion, we will fuse these.
However, I'm confused by your statement:

These can be fused, but use the wrong "sum" on each iteration of the loop.

I don't see how fusion is legal here. Are you saying that the current check will (incorrectly) still allow fusion? Or there is a way to fuse these loops?

For this example, lcssa will create a non-empty preheader for the second loop and the earlier checks preventing fusion will bail before we get to this test. If I manually sink the PHI from lcssa into the header of the second loop, I can create an empty preheader for the second loop to allow fusion to continue.

At any rate, I think this check needs to be strengthened to prevent fusion from happening in this case. Do you agree?


OK, sounds good. I will mark this comment as done then.


dmgreen added an example above that illustrates the issue (after some manual modifications to the IR to get around other limitations in fusion).

dmgreen added inline comments.Feb 13 2019, 9:41 AM

Sounds good to me. The !DepResult check can help in some cases at least, and the rest we can add as DA is improved.


Yeah, sorry. I meant _will_ fuse, but shouldn't. "Can" was a bit misleading.

kbarton marked 9 inline comments as done and an inline comment as not done.Fri, Mar 1, 9:03 AM

Addressed some more comments. I'll upload a new patch momentarily.

kbarton updated this revision to Diff 188928.Fri, Mar 1, 9:59 AM

Addressed most of the review comments.

I think there are still two outstanding issues to resolve before committing this patch:

  1. The current constructors for FusionCandidates - whether to leave them as is or whether to reimplement using a createFusionCandidate wrapper that removes most of the logic from the constructor.
  2. New safety check that I added, to detect cross-loop dependencies that were previously missed. The new check is around line 918 of LoopFuse.cpp (inside dependencesAllowFusion method).
dmgreen added inline comments.Sun, Mar 3, 6:12 AM

I think that you can just do something like

if (Instruction *Def = dyn_cast<Instruction>(Op))
  if (FC0.L->contains(Def->getParent()))
    return false;

Providing there are no lcssa phis, this should rule out anything that depends on the first loop.


You can run something like -loop-rotate to make this into a more simply structured loop. -instcombine will remove any lcssa nodes, and -simplify-cfg can clean things up too.

I guess that will depend on what types of loops you want to test, ones that have been cleanup up or ones that are a little less structured. It's good to test both, but which is more important will depend on where in the pass pipeline this ends up.

kbarton marked 2 inline comments as done.Fri, Mar 22, 2:07 PM
kbarton added inline comments.

I just finished up testing this change, and then realized I think this is overly restrictive.
I had originally restricted it to PHIs because I only wanted to include things that are (re)defined in the first loop. Basically, the first loop needs to complete in order to get the correct definition going into the second loop.

This alternate sequence will also include stores, not just PHIs. The store can be loop invariant, and if it is not hoisted out of the loop prior, then it would also prevent fusion.


I completely agree.
In fact we have a subsequent patch to post once this one lands that restricts loop fusion to only run on rotated loops. This seems to be a common theme among many loop optimizations, and restricting to rotated loops simplifies the implementation in several places. I'm planning on discussing this at my presentation at EuroLLVM in a couple weeks :)