Page MenuHomePhabricator

Implement basic loop fusion pass

Authored by kbarton on Dec 18 2018, 12:52 PM.



Loop fusion is an optimization/transformation that takes two (or more) loops and
fuses (or joins) them together to create a single loop.

This pass implements Basic Loop Fusion. It will fuse loops that conform to the
following 4 conditions:

  1. Adjacent (no code in between the two loops)
  2. Control flow equivalent (if one loop executes, the other loop executes)
  3. Identical bounds (both loops iterate the same number of iterations)
  4. No negative distance dependencies between the loop bodies.

The intention is to initially focus on the safety checks and mechanics of loop
fusion. It makes no changes to the IR to create opportunities for fusion.
Instead, it checks the necessary conditions and if all conditions are met, it
fuses two loops together.

Further information on loop fusion, and techniques to improve the opportunities
for loop fusion can be found at

This patch intentionally does not add the loop fusion pass into the pass
pipeline. The location of loop fusion in the pass pipeline needs to be discussed
and I would prefer to have that discussion outside of this review.

This patch was done in collaboration with Johannes Doerfert.

Diff Detail


Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
kbarton added inline comments.Dec 20 2018, 12:51 PM
1 ↗(On Diff #178761)

I don't see an (easy) way to use analyze to get this information.
For now, I've added the REQUIRES: asserts.
I'll think if there is a way to use analyze in the future to accomplish this.

Great stuff. This looks like it will be very useful. I have a few thoughts/questions/nits.

26 ↗(On Diff #179130)

Two spaces.

612 ↗(On Diff #179130)

second loop

719 ↗(On Diff #179130)

This debug message is (essentially) repeated?

747 ↗(On Diff #179130)

Commented code

858 ↗(On Diff #179130)

I would imagine, although I'm not sure, that there would at least be a lot of bugs here. We are dealing with different loops, but we can say that they are very similar.

What does it currently give? Anything useful? The SCEV checks you are doing above is obviously quite similar to what DA would be trying to do, but with the added loop rewrite. It would be a shame to duplicate all the effort but may, I guess, be necessary if DA doesn't do well enough.

966 ↗(On Diff #179130)

thest -> test?

984 ↗(On Diff #179130)


1003 ↗(On Diff #179130)

It can sometimes be better to insert edges into the DT before deleting old ones. It keep the tree more intact (especially pdts), with less subtrees becoming unreachable and less recalculations needed. It means it can be simpler and quicker for the updates.

1015 ↗(On Diff #179130)

Is there anywhere that we check that these won't have phi operands from inside the first loop? i.e. that the phis can be moved into the first loop / before all of it's instructions.

1087 ↗(On Diff #179130)

You may want to add DominatorTree::VerificationLevel::Fast otherwise this might be quite slow.

1118 ↗(On Diff #179130)

Looks like the code is preserving LI too (I'm not sure if you can/should preserve SE without preserving LI.)

kbarton marked 15 inline comments as done.Jan 17 2019, 1:57 PM

I've addressed most of these comments (except the ones I have some questions about). I will be refreshing the patch momentarily.

719 ↗(On Diff #179130)

Good catch. Fixed.

858 ↗(On Diff #179130)

At this point DA doesn't give anything useful - at least not for the test cases that I have tried. I have not had a chance to investigate why, or if there is a better/different way to do things where it can be useful (which is why I marked the TODO here).

The one thing that DA is able to do, that SCEV currently cannot, is understand the restrict keyword and accurately identify no dependencies between the two loops in this case. Perhaps it would be better to try and teach SCEV about restrict, and then only rely on SCEV in the long run?

1003 ↗(On Diff #179130)

OK, that's good to know, thanks.

Is this specifically for inserting/deleting edges between similar blocks, or is this for all inserts/deletes in the entire tree? In other words, should I rearrange this code, and the code below (lines 1048-1051) that updates the latch blocks to do all the insertions before any deletions?

1015 ↗(On Diff #179130)

This is a really good catch!] We completely missed this.
I've added a check in dependencesAllowFusion that walks through the PHIs in the header of FC1 and check if the incoming block is in FC0 then it is the header block. If it is not, then we do not fuse.

1087 ↗(On Diff #179130)

Thanks for pointing this out.
I've changed to fast.
Most (probably all) of the problems this has found has been done by comparing to a newly constructed tree, so I think Fast should be sufficient.

1118 ↗(On Diff #179130)

Yup, good call. I've added LoopInfoWrapperPass to the preserved list.

kbarton updated this revision to Diff 182396.Jan 17 2019, 1:59 PM
kbarton marked 4 inline comments as done.

Addressed comments from dmgreen.

Initial comments from my side.

142–143 ↗(On Diff #178761)

It is not uncommon in LLVM to do it this way. I personally do not try to force a solution similar to bool createXXX(...) but I also do not necessarily object if there is an actual reason.

591–594 ↗(On Diff #178761)

Because, as an example and not the only reason, we could easily generate statistics on the impact predication support might have.

1015 ↗(On Diff #179130)

I don't get this. Could one of you produce a problematic example for me? (also good as a test case)

7 ↗(On Diff #182396)

We need to update the licence ;)

111 ↗(On Diff #182396)

You should keep it this way. Other passes do it similarly.

848 ↗(On Diff #182396)

This should probably be a conditional and a return false ;)

1162 ↗(On Diff #182396)

As above, preserve LoopAnalysis.

dmgreen added inline comments.Jan 23 2019, 8:41 AM
858 ↗(On Diff #179130)

I remember DA having problems on 64bit systems due to all the sexts that kept happening. Delinearising constant's too, maybe? There are certainly problems in it that would be worth trying to fix (or find a replacement to do it better).

The delinearisation that DA will attempt may also be helpful. It is nowadays all built on SCEV's too, so should in theory (baring the facts that we know here about dealing with different but similar loops) be a more powerful version of the SCEV code here. That power might be confusing it at the moment though.

I would expect that better DA is something llvm will need sooner or later as more loop optimisations are implemented. Perhaps this is something that we can improve in the future, relying on the SCEV code at the moment, but getting more help from DA as it is improved. The no-aliasing checks it can do are useful at least, it would seem.

1003 ↗(On Diff #179130)

I think it's probably fine, just something I've run into in the past with PDT's. If it's something we run into, we can fix it then. And if it does become an issue, it may be better to teach the DTU to do this itself.

529 ↗(On Diff #182396)

Is there reason in the current version to prevent fusion if "FC0" has instructions in its preheaders?

917 ↗(On Diff #182396)

I'm not sure if this is quite strong enough. Consider something like this, where the sum would be used in the second block, but not as phi:

for(i = 0; i < n; i++)
  sum += a[i];
for(i = 0; i < n; i++)
  b[i] = a[i]/sum;

These can be fused, but use the wrong "sum" on each iteration of the loop. Unroll and jam side-steps all this by requiring lcssa nodes (and, you know, not requiring generalised loop fusion :) )

Herald added a project: Restricted Project. · View Herald TranscriptFeb 11 2019, 1:14 PM
kbarton marked 16 inline comments as done.Feb 12 2019, 9:58 AM

I've addressed almost all of the comments.
I will work on an update to the problem that dmgreen provided an example for and post a new version of the patch when that is done.
In the meantime, if my understanding of the comments is incorrect, someone please correct me :)

142–143 ↗(On Diff #178761)

OK, so I'm still not sure what the preference is here:

  1. Keep it as is.
  2. Change to create an empty object and then require an Initialize method call on it.
  3. Create a wrapper, createFusionCandidate(), that does the construction and initialization.
  4. Some other alternative that I don't see.

If I'm going to change, I prefer the createFusionCandidate route (which I hadn't thought of earlier) which I think is the cleanest, although I don't know how to enforce it's usage.

440–441 ↗(On Diff #178761)

I addressed this a while ago, I just forgot to check it off as done.

858 ↗(On Diff #179130)

I completely agree about improving DA and the need for it with other loop optimizations. I still haven't had a chance to look at it in detail, but would like to start looking at it once this patch is finalized and lands.

For now are you OK with the current usage of DA here? I'm hoping that as we extend/improve it, we can modify this code to take advantage of it.

1003 ↗(On Diff #179130)

OK, sounds good. I will mark this comment as done then.

1015 ↗(On Diff #179130)

dmgreen added an example above that illustrates the issue (after some manual modifications to the IR to get around other limitations in fusion).

529 ↗(On Diff #182396)

Strictly speaking, no, I there is no reason to prevent fusion if FC0 has instructions in its preheader.
This check was meant to prevent fusion if FC1 has instructions in the preheader since we don't have any analysis to move them around yet
I've moved this check later when we are trying to fuse pairs of candidates. If FC1 has a non-empty preheader then fusion is skipped.

848 ↗(On Diff #182396)

I've changed from the assert to a check and print message.
After that, all the conditions just result in a return false, which cleans this up a lot.

917 ↗(On Diff #182396)

Yes, you're right. If I generate the ll for this example and massage it to make it eligible for fusion, we will fuse these.
However, I'm confused by your statement:

These can be fused, but use the wrong "sum" on each iteration of the loop.

I don't see how fusion is legal here. Are you saying that the current check will (incorrectly) still allow fusion? Or there is a way to fuse these loops?

For this example, lcssa will create a non-empty preheader for the second loop and the earlier checks preventing fusion will bail before we get to this test. If I manually sink the PHI from lcssa into the header of the second loop, I can create an empty preheader for the second loop to allow fusion to continue.

At any rate, I think this check needs to be strengthened to prevent fusion from happening in this case. Do you agree?

dmgreen added inline comments.Feb 13 2019, 9:41 AM
858 ↗(On Diff #179130)

Sounds good to me. The !DepResult check can help in some cases at least, and the rest we can add as DA is improved.

917 ↗(On Diff #182396)

Yeah, sorry. I meant _will_ fuse, but shouldn't. "Can" was a bit misleading.

kbarton marked 9 inline comments as done and an inline comment as not done.Mar 1 2019, 9:03 AM

Addressed some more comments. I'll upload a new patch momentarily.

kbarton updated this revision to Diff 188928.Mar 1 2019, 9:59 AM

Addressed most of the review comments.

I think there are still two outstanding issues to resolve before committing this patch:

  1. The current constructors for FusionCandidates - whether to leave them as is or whether to reimplement using a createFusionCandidate wrapper that removes most of the logic from the constructor.
  2. New safety check that I added, to detect cross-loop dependencies that were previously missed. The new check is around line 918 of LoopFuse.cpp (inside dependencesAllowFusion method).
dmgreen added inline comments.Mar 3 2019, 6:12 AM
929 ↗(On Diff #188928)

I think that you can just do something like

if (Instruction *Def = dyn_cast<Instruction>(Op))
  if (FC0.L->contains(Def->getParent()))
    return false;

Providing there are no lcssa phis, this should rule out anything that depends on the first loop.

283 ↗(On Diff #188928)

You can run something like -loop-rotate to make this into a more simply structured loop. -instcombine will remove any lcssa nodes, and -simplify-cfg can clean things up too.

I guess that will depend on what types of loops you want to test, ones that have been cleanup up or ones that are a little less structured. It's good to test both, but which is more important will depend on where in the pass pipeline this ends up.

kbarton marked 2 inline comments as done.Mar 22 2019, 2:07 PM
kbarton added inline comments.
929 ↗(On Diff #188928)

I just finished up testing this change, and then realized I think this is overly restrictive.
I had originally restricted it to PHIs because I only wanted to include things that are (re)defined in the first loop. Basically, the first loop needs to complete in order to get the correct definition going into the second loop.

This alternate sequence will also include stores, not just PHIs. The store can be loop invariant, and if it is not hoisted out of the loop prior, then it would also prevent fusion.

283 ↗(On Diff #188928)

I completely agree.
In fact we have a subsequent patch to post once this one lands that restricts loop fusion to only run on rotated loops. This seems to be a common theme among many loop optimizations, and restricting to rotated loops simplifies the implementation in several places. I'm planning on discussing this at my presentation at EuroLLVM in a couple weeks :)

dmgreen added inline comments.Mar 24 2019, 7:39 AM
129 ↗(On Diff #188928)

Perhaps name it ExitingBlock, if there is only one.

929 ↗(On Diff #188928)

This is the kind of thing I was thinking of:

define float @test(float* nocapture %a, i32 %n) {
  %conv = zext i32 %n to i64
  %cmp32 = icmp eq i32 %n, 0
  br i1 %cmp32, label %for.cond.cleanup7, label %for.body

for.body:                                         ; preds = %for.body, %entry
  %i.034 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
  %sum1.033 = phi float [ %add, %for.body ], [ 0.000000e+00, %entry ]
  %idxprom = trunc i64 %i.034 to i32
  %arrayidx = getelementptr inbounds float, float* %a, i32 %idxprom
  %0 = load float, float* %arrayidx, align 4
  %add = fadd float %sum1.033, %0
  %inc = add nuw nsw i64 %i.034, 1
  %cmp = icmp ult i64 %inc, %conv
  br i1 %cmp, label %for.body, label %for.body8

for.body8:                                        ; preds = %for.body, %for.body8
  %i2.031 = phi i64 [ %inc14, %for.body8 ], [ 0, %for.body ]
  %idxprom9 = trunc i64 %i2.031 to i32
  %arrayidx10 = getelementptr inbounds float, float* %a, i32 %idxprom9
  %1 = load float, float* %arrayidx10, align 4
  %div = fdiv float %1, %add
  store float %div, float* %arrayidx10, align 4
  %inc14 = add nuw nsw i64 %i2.031, 1
  %cmp5 = icmp ult i64 %inc14, %conv
  br i1 %cmp5, label %for.body8, label %for.cond.cleanup7

for.cond.cleanup7:                                ; preds = %for.body8, %entry
  %sum1.0.lcssa36 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body8 ]
  ret float %sum1.0.lcssa36

The importand part being %add, that is defined in the first loop and used in the second. I believe that using normal SSA def-use chains, anything that is def'd in the first loop (and used in the second) will need to use the value from the last iteration of the loop, so fusion will be illegal.

There may be cases where a def has the same value on every iteration of the loop, but I imagine those would already have been hoisted/sunk, or be quite rare.

1083 ↗(On Diff #188928)

Whilst checking the example, I noticed the dom tree updater has become stricter about "inbalanced" tree operations. I think that if FC0.ExitingBlocks == FC0.Latch, we add a link to FC1.Header twice, which is maybe what it's complaining about (?)

Oh, forgot to mention, I am looking forward to getting this in tree, and think it's very close. If we can sort out the last couple of things here, I think it would be good to get it committed so we can have a play around and sort out some of the questions like where in the pass pipeline it fits.

kbarton marked 10 inline comments as done.Mar 25 2019, 8:18 PM
kbarton added inline comments.
858 ↗(On Diff #179130)

I'm going to mark this as done, since I think we are in agreement we want to work on improving the dependence analysis in the future. If this is incorrect, please let me know.

129 ↗(On Diff #188928)

Good idea. Will fix in patch I'm going to post soon.

929 ↗(On Diff #188928)

Yes, I see now. It's possible to have a def in addition to a PHI.
If I wanted to catch this, I would also need to traverse all the inputs to any defs and make sure they don't come from PHIs within loop 0 as well.
I'm OK with restricting this to defs, with the assumption that if there is any loop invariant defs they would have been hoisted out of the loop at this point.

1083 ↗(On Diff #188928)

Yup, good catch. I found the same problem when I tried your test case.
I've fixed the problem and added your test above to cannot_fuse.ll.

kbarton updated this revision to Diff 192242.Mar 25 2019, 8:23 PM
kbarton marked 2 inline comments as done.

Address review comments as follows:

  1. Strengthen dependence checks between any value used in Loop 1 to ensure it is not defined in loop 0.
  2. Do not add a tree update to insert an edge between FC0.Latch and FC1.Header when FC0.ExitingBlock is the same as FC0.Latch. This edge has already been inserted into the tree updates earlier and will cause an assertion to fail while performing the updates.
  3. Rename ExitingBlocks to ExitingBlock since we are only concerned with loops with a single exiting block.
dmgreen accepted this revision.Mar 26 2019, 5:51 AM

Thanks. I had another look through and think this is in a really good state. I'm very happy for it to go in.

I'm not sure if there were any other remaining comments from @Meinersbur or @jdoerfert?

This revision is now accepted and ready to land.Mar 26 2019, 5:51 AM

Thanks @dmgreen!
The only outstanding issue was from @Meinersbur regarding the constructor for FusionCandidate. The concern was keeping the logic in the constructor, vs moving it out into a createFusionCandidate method to wrap the constructor and add all the logic to that method. I don't have a preference one way or the other, but will happily make the change if people feel strongly about it.

Meinersbur accepted this revision.Apr 11 2019, 12:00 PM

Generally, LGTM; with some smaller suggestions.

142–143 ↗(On Diff #178761)

I was thinking the following pattern:

FusionCandidate* makeFusionCandiate(...) {
  BasicBlock *Preheader;
  BasicBlock *Header;
  BasicBlock *ExitingBlocks;
  BasicBlock *ExitBlock;
  BasicBlock *Latch;
  // The above are redundant since llvm::Loop provides accessors  
  // I suggest to not cache them in other objects (they may require being updated) unless it measurable improves performance

  Loop *L;
  SmallVector<Instruction *, 16> MemReads;
  SmallVector<Instruction *, 16> MemWrites;
  if (!condition) {
    LLVM_DEBUG(dbgs() << "The reason\n");
    return nullptr;

  return new FusionCandidate(L, MemReads, MemWrites);
142–143 ↗(On Diff #178761)

It is just a suggestion and does not correspond to an LLVM coding policy. Keep as-is, it is not an requirement for me to accept the patch.

204 ↗(On Diff #178761)

[style] dump() methods are usually guard by

#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)

as mentioned in the comment above LLVM_DUMP_METHOD

276 ↗(On Diff #178761)

[suggestion] IMHO it is a good style to not depend on a particular implementation of ADTs. Code that advances iterators manually depending on inserted/removed elements can be confusing and may break easily in future changes.

However, as with the previous [suggestion], I am ok with keeping it as-is.

977–980 ↗(On Diff #178761)

Are you sure? There is SmallVectorImpl<T>::emplace_back.

292 ↗(On Diff #179130)

[suggestion] This could be a range-based for-loop.

323 ↗(On Diff #192242)

[style] Did you consider LI.empty()?

935 ↗(On Diff #192242)

[suggestion] Consider using a method name that makes clear that is has no side-effects. isEmptyPreheader?
emptyPreheader sounds like it would remove instructions from the preheader until it is empty.

kbarton marked 24 inline comments as done.Apr 16 2019, 10:41 AM

All comments have been addressed. I will be committing this soon.

142–143 ↗(On Diff #178761)

I will mark this conversation as done as I think we've agreed the existing implementation is OK. We can refactor/redo in the future if this becomes too cumbersome.

276 ↗(On Diff #178761)

Thanks. I will mark this as done for now.
This is also something that may need to get revisited as the code evolves.

977–980 ↗(On Diff #178761)

You're right. The syntax I was using was wrong. I've fixed it and changed to using emplace_back.

323 ↗(On Diff #192242)

I don't think so.
I've changed it, as I think it makes it more readable.

This revision was automatically updated to reflect the committed changes.
kbarton marked 4 inline comments as done.
rtereshin added inline comments.

It appears that if both NDEBUG and LLVM_ENABLE_DUMP are defined, the function will be compiled in, but all LLVM_DEBUG macro eliminated. Resulting

static void
printFusionCandidates(const FusionCandidateCollection &FusionCandidates) {
  for (const auto &CandidateSet : FusionCandidates) {

produces an "unused variable" CandidateSet warning. We probably don't need to wrap anything in LLVM_DEBUG here.

Meinersbur added inline comments.Apr 19 2019, 9:04 AM

We probably don't need to wrap anything in LLVM_DEBUG here.

I agree, but for a different reason:
All calls to printFusionCandidates are already wrapped by LLVM_DEBUG. The macro checks whether anything should be printed by checking the -debug/-debug-only cl::opt.