This is an archive of the discontinued LLVM Phabricator instance.

Implement basic loop fusion pass
ClosedPublic

Authored by kbarton on Dec 18 2018, 12:52 PM.

Details

Summary

Loop fusion is an optimization/transformation that takes two (or more) loops and
fuses (or joins) them together to create a single loop.

This pass implements Basic Loop Fusion. It will fuse loops that conform to the
following 4 conditions:

  1. Adjacent (no code in between the two loops)
  2. Control flow equivalent (if one loop executes, the other loop executes)
  3. Identical bounds (both loops iterate the same number of iterations)
  4. No negative distance dependencies between the loop bodies.

The intention is to initially focus on the safety checks and mechanics of loop
fusion. It makes no changes to the IR to create opportunities for fusion.
Instead, it checks the necessary conditions and if all conditions are met, it
fuses two loops together.

Further information on loop fusion, and techniques to improve the opportunities
for loop fusion can be found at
https://webdocs.cs.ualberta.ca/~amaral/thesis/ChristopherBartonMSc.pdf

This patch intentionally does not add the loop fusion pass into the pass
pipeline. The location of loop fusion in the pass pipeline needs to be discussed
and I would prefer to have that discussion outside of this review.

This patch was done in collaboration with Johannes Doerfert.

Diff Detail

Repository
rL LLVM

Event Timeline

kbarton created this revision.Dec 18 2018, 12:52 PM
jsji added a subscriber: jsji.Dec 18 2018, 3:40 PM

Thanks for bringing us loop fusion! Here are some review comments, but I did not check the functionality yet.

llvm/lib/Transforms/Scalar/LoopFuse.cpp
99 ↗(On Diff #178761)

Is this the right description for "-loop-fusion-dependence-analysis=scev"? It is the same as for the da option.

106–109 ↗(On Diff #178761)

[serious] LLVM has an infrastructure for enabling debugging output: -debug-only. We should not introduce additional ones.

At least this option should only be available in assert-builds.

123–140 ↗(On Diff #178761)

[suggestion] Add doxygen comments for these fields?

142–143 ↗(On Diff #178761)

[suggestion] IMHO it is good practice to not have much logic in object constructors. Reasons:

  1. When exceptions are thrown, the object is partially initialized
  2. With class polymorphism, calling virtual functions will call the non-overidden method.
  3. It is not possible to return anything, in this case, an error code.
  1. and 2. do not apply to this code. Point 3. is handled using the Valid flag.

Nonetheless, I think having lots of logic in constructor is a bad pattern. Instead, I suggest to have something return nullptr or llvm::ErrorOr.

154 ↗(On Diff #178761)

[remark] In contrast to the other invalidate reason, this one does not have a statistic counter. Intentional?

204 ↗(On Diff #178761)

[style] Use LLVM_DUMP_METHOD pattern for dump() methods (See compiler.h)

271 ↗(On Diff #178761)

[remark] LoopSet is a vector?

276 ↗(On Diff #178761)

[remark] Is there a reason to use std::set? Usually, the LLVM hash-based sets (DenseSet, SmallPtrSet, etc) are more performant.

282–284 ↗(On Diff #178761)

[style] braces can be removed

305–307 ↗(On Diff #178761)

[style] I'd prefer method description as a doxygen comment at the methods.

321 ↗(On Diff #178761)

[style] Use doxygen comment?

345–347 ↗(On Diff #178761)

[style] Descriptions of these would be useful.

350–355 ↗(On Diff #178761)

This is only used inside LLVM_DEBUG, so the compiler will emit an unused-function warning when compiling in release mode. Guard with #ifdef NDEBUG?

426–427 ↗(On Diff #178761)

[suggestion] Guard with #ifndef NDEBUG to avoid being called in non-assert builds?

440–441 ↗(On Diff #178761)

[style] Start functions with a verb: isControlFlowEquivalent

455–458 ↗(On Diff #178761)

[style] return PDT.dominates(FC0.Preheader, FC1.Preheader)

494–495 ↗(On Diff #178761)

[style] This looks like a leftover review comment.

517 ↗(On Diff #178761)

[Style] Use LLVM_PRETTY_FUNCTION instead of __FUNCTION__.

Maybe remove entirely? Nowhere I have seen function tracing in LLVM code.

591–594 ↗(On Diff #178761)

[suggestion] If predication is not supported, why not using ScalarEvolution::getBackedgeTakenCount()?

711 ↗(On Diff #178761)

[style] Use doxygen comment here?

728 ↗(On Diff #178761)

[style] No reason to use auto here

745–746 ↗(On Diff #178761)

[suggestion] Descriptions of these would be useful

749–750 ↗(On Diff #178761)
809–811 ↗(On Diff #178761)

[style] The braces seem unnecessary

893 ↗(On Diff #178761)

[style] Start method names with a verb: isAdjacent?

977–980 ↗(On Diff #178761)

[style] Use emplace_back?

1059–1063 ↗(On Diff #178761)

[suggestion] This looks expensive to do. However, the pass manager will do these verifications anyway between passes (if enabled), so it shouldn't be necessary to do here.

1073–1083 ↗(On Diff #178761)

[style] I think it is more common in the LLVM case base to have private fields declared at the beginning of the class.

1094 ↗(On Diff #178761)

[serious] I think LoopFuser is preserving some passes such as ScalarEvolution?

1133 ↗(On Diff #178761)

[serious] If changes are made, the pass should not indicate that it preserves all analyses (including the ones it doesn't know about).

llvm/test/Transforms/LoopFusion/cannot_fuse.ll
1 ↗(On Diff #178761)

[serious] If you are testing for llvm::dbgs() output, you need to add a REQUIRES: asserts line.

Another problem is that 2>&1 this mixes stdout and stderr output. stdout is buffered, stderr is not. How these are merged by 2>&1 is undefined. You can switch off stdout by using opt -disable-output.

If possible, using opt -analyze is a better way to verify pass-specific output because adding another LLVM_DEBUG during debugging sessions can lead to test case failures. However, checking -debug output is common in the llvm tests.

74–76 ↗(On Diff #178761)

[remark] I think these can be removed.

269–272 ↗(On Diff #178761)

[remark] Are these necessary for the test?

llvm/test/Transforms/LoopFusion/loop_nest.ll
126–140 ↗(On Diff #178761)

[remark] Can these be removed?

llvm/test/Transforms/LoopFusion/simple.ll
1 ↗(On Diff #178761)

[suggestion] Remove 2>&1 (there is no stdout output here)

kbarton marked 2 inline comments as done.Dec 20 2018, 9:26 AM

Responded to two of the comments. Most of the others will be addressed in the next revision, which I should hopefully be uploading soon.

llvm/lib/Transforms/Scalar/LoopFuse.cpp
142–143 ↗(On Diff #178761)

I understand your points here. I did not realize issue #2 (although I agree it doesn't apply here).

However, if I understand your suggestion, I would need to add an initialize method (or something similar) to do the logic that is currently in the constructor and return a success/fail code. The problem is that before anything can be done with the object, you need to ensure that it has been initialized. This can be accomplished by adding a boolean flag (isInitialized) to the class and ensuring that all accessors check that before doing anything with it. However, I think that is more complicated then leaving the initialization logic in the constructor.

That said, I'm not particularly tied to either approach. I can change the implementation if you feel strongly. Or, if there is an alternative that I'm not seeing, I'm happy to do that instead.

276 ↗(On Diff #178761)

This is a good question, and I should have added comments here to articulate why I eventually settled on std::set.

Ultimately, I was looking for a data structure that would keep the fusion candidates sorted automatically, and not invalidate the iterators as the set changes. In this implementation, either DenseSet or SmallPtrSet would be fine, but we have additional patches to extend fusion to move intervening code around and split up loops to create new opportunities. When this is done, the contents of the FusionCandidateSet will change, and having stable iterators simplifies the logic quite a bit (in my opinion). Similarly, always ensuring that the set is ordered properly is essential for this implementation, and that seems to be done reasonably well with std::set.

I will add some comments in my next revision to indicate my line of reasoning here.

kbarton edited the summary of this revision. (Show Details)Dec 20 2018, 9:28 AM
kbarton updated this revision to Diff 179130.Dec 20 2018, 12:40 PM

Addressed all but two of the review comments.
The only outstanding comments are the use of std::set (which I tried to address with comments) and the logic inside the FusionCandidate constructor (which I'm waiting for feedback on the review).

kbarton marked 31 inline comments as done and 2 inline comments as done.Dec 20 2018, 12:51 PM
kbarton added inline comments.
llvm/lib/Transforms/Scalar/LoopFuse.cpp
106–109 ↗(On Diff #178761)

I use this to have two different "levels" of debugging. The debugging available with -debug-only is not as verbose.
If you feel strongly about it, I can remove this and make all debug output equivalent.
In the meantime, I've guarded this use with NDEBUG.

977–980 ↗(On Diff #178761)

There is no emplace_back for TreeUpdates (unless I misunderstood your suggestion).

1059–1063 ↗(On Diff #178761)

I've kept this for now, as this isn't in the pass manager yet. However, I've guarded it under NDEBUG.
Are you OK with that?

1133 ↗(On Diff #178761)

I agree. The pass should not make changes and return true.
I don't believe this is happening now.

llvm/test/Transforms/LoopFusion/cannot_fuse.ll
1 ↗(On Diff #178761)

I don't see an (easy) way to use analyze to get this information.
For now, I've added the REQUIRES: asserts.
I'll think if there is a way to use analyze in the future to accomplish this.

Great stuff. This looks like it will be very useful. I have a few thoughts/questions/nits.

llvm/include/llvm/Transforms/Scalar/LoopFuse.h
26 ↗(On Diff #179130)

Two spaces.

llvm/lib/Transforms/Scalar/LoopFuse.cpp
612 ↗(On Diff #179130)

second loop

719 ↗(On Diff #179130)

This debug message is (essentially) repeated?

747 ↗(On Diff #179130)

Commented code

858 ↗(On Diff #179130)

I would imagine, although I'm not sure, that there would at least be a lot of bugs here. We are dealing with different loops, but we can say that they are very similar.

What does it currently give? Anything useful? The SCEV checks you are doing above is obviously quite similar to what DA would be trying to do, but with the added loop rewrite. It would be a shame to duplicate all the effort but may, I guess, be necessary if DA doesn't do well enough.

966 ↗(On Diff #179130)

thest -> test?

984 ↗(On Diff #179130)

Theg

1003 ↗(On Diff #179130)

It can sometimes be better to insert edges into the DT before deleting old ones. It keep the tree more intact (especially pdts), with less subtrees becoming unreachable and less recalculations needed. It means it can be simpler and quicker for the updates.

1015 ↗(On Diff #179130)

Is there anywhere that we check that these won't have phi operands from inside the first loop? i.e. that the phis can be moved into the first loop / before all of it's instructions.

1087 ↗(On Diff #179130)

You may want to add DominatorTree::VerificationLevel::Fast otherwise this might be quite slow.

1118 ↗(On Diff #179130)

Looks like the code is preserving LI too (I'm not sure if you can/should preserve SE without preserving LI.)

kbarton marked 15 inline comments as done.Jan 17 2019, 1:57 PM

I've addressed most of these comments (except the ones I have some questions about). I will be refreshing the patch momentarily.

llvm/lib/Transforms/Scalar/LoopFuse.cpp
719 ↗(On Diff #179130)

Good catch. Fixed.

858 ↗(On Diff #179130)

At this point DA doesn't give anything useful - at least not for the test cases that I have tried. I have not had a chance to investigate why, or if there is a better/different way to do things where it can be useful (which is why I marked the TODO here).

The one thing that DA is able to do, that SCEV currently cannot, is understand the restrict keyword and accurately identify no dependencies between the two loops in this case. Perhaps it would be better to try and teach SCEV about restrict, and then only rely on SCEV in the long run?

1003 ↗(On Diff #179130)

OK, that's good to know, thanks.

Is this specifically for inserting/deleting edges between similar blocks, or is this for all inserts/deletes in the entire tree? In other words, should I rearrange this code, and the code below (lines 1048-1051) that updates the latch blocks to do all the insertions before any deletions?

1015 ↗(On Diff #179130)

This is a really good catch!] We completely missed this.
I've added a check in dependencesAllowFusion that walks through the PHIs in the header of FC1 and check if the incoming block is in FC0 then it is the header block. If it is not, then we do not fuse.

1087 ↗(On Diff #179130)

Thanks for pointing this out.
I've changed to fast.
Most (probably all) of the problems this has found has been done by comparing to a newly constructed tree, so I think Fast should be sufficient.

1118 ↗(On Diff #179130)

Yup, good call. I've added LoopInfoWrapperPass to the preserved list.

kbarton updated this revision to Diff 182396.Jan 17 2019, 1:59 PM
kbarton marked 4 inline comments as done.

Addressed comments from dmgreen.

Initial comments from my side.

llvm/lib/Transforms/Scalar/LoopFuse.cpp
142–143 ↗(On Diff #178761)

It is not uncommon in LLVM to do it this way. I personally do not try to force a solution similar to bool createXXX(...) but I also do not necessarily object if there is an actual reason.

591–594 ↗(On Diff #178761)

Because, as an example and not the only reason, we could easily generate statistics on the impact predication support might have.

1015 ↗(On Diff #179130)

I don't get this. Could one of you produce a problematic example for me? (also good as a test case)

7 ↗(On Diff #182396)

We need to update the licence ;)

111 ↗(On Diff #182396)

You should keep it this way. Other passes do it similarly.

848 ↗(On Diff #182396)

This should probably be a conditional and a return false ;)

1162 ↗(On Diff #182396)

As above, preserve LoopAnalysis.

dmgreen added inline comments.Jan 23 2019, 8:41 AM
llvm/lib/Transforms/Scalar/LoopFuse.cpp
858 ↗(On Diff #179130)

I remember DA having problems on 64bit systems due to all the sexts that kept happening. Delinearising constant's too, maybe? There are certainly problems in it that would be worth trying to fix (or find a replacement to do it better).

The delinearisation that DA will attempt may also be helpful. It is nowadays all built on SCEV's too, so should in theory (baring the facts that we know here about dealing with different but similar loops) be a more powerful version of the SCEV code here. That power might be confusing it at the moment though.

I would expect that better DA is something llvm will need sooner or later as more loop optimisations are implemented. Perhaps this is something that we can improve in the future, relying on the SCEV code at the moment, but getting more help from DA as it is improved. The no-aliasing checks it can do are useful at least, it would seem.

1003 ↗(On Diff #179130)

I think it's probably fine, just something I've run into in the past with PDT's. If it's something we run into, we can fix it then. And if it does become an issue, it may be better to teach the DTU to do this itself.

529 ↗(On Diff #182396)

Is there reason in the current version to prevent fusion if "FC0" has instructions in its preheaders?

917 ↗(On Diff #182396)

I'm not sure if this is quite strong enough. Consider something like this, where the sum would be used in the second block, but not as phi:

for(i = 0; i < n; i++)
  sum += a[i];
for(i = 0; i < n; i++)
  b[i] = a[i]/sum;

These can be fused, but use the wrong "sum" on each iteration of the loop. Unroll and jam side-steps all this by requiring lcssa nodes (and, you know, not requiring generalised loop fusion :) )

Herald added a project: Restricted Project. · View Herald TranscriptFeb 11 2019, 1:14 PM
kbarton marked 16 inline comments as done.Feb 12 2019, 9:58 AM

I've addressed almost all of the comments.
I will work on an update to the problem that dmgreen provided an example for and post a new version of the patch when that is done.
In the meantime, if my understanding of the comments is incorrect, someone please correct me :)

llvm/lib/Transforms/Scalar/LoopFuse.cpp
142–143 ↗(On Diff #178761)

OK, so I'm still not sure what the preference is here:

  1. Keep it as is.
  2. Change to create an empty object and then require an Initialize method call on it.
  3. Create a wrapper, createFusionCandidate(), that does the construction and initialization.
  4. Some other alternative that I don't see.

If I'm going to change, I prefer the createFusionCandidate route (which I hadn't thought of earlier) which I think is the cleanest, although I don't know how to enforce it's usage.

440–441 ↗(On Diff #178761)

I addressed this a while ago, I just forgot to check it off as done.

858 ↗(On Diff #179130)

I completely agree about improving DA and the need for it with other loop optimizations. I still haven't had a chance to look at it in detail, but would like to start looking at it once this patch is finalized and lands.

For now are you OK with the current usage of DA here? I'm hoping that as we extend/improve it, we can modify this code to take advantage of it.

1003 ↗(On Diff #179130)

OK, sounds good. I will mark this comment as done then.

1015 ↗(On Diff #179130)

dmgreen added an example above that illustrates the issue (after some manual modifications to the IR to get around other limitations in fusion).

529 ↗(On Diff #182396)

Strictly speaking, no, I there is no reason to prevent fusion if FC0 has instructions in its preheader.
This check was meant to prevent fusion if FC1 has instructions in the preheader since we don't have any analysis to move them around yet
I've moved this check later when we are trying to fuse pairs of candidates. If FC1 has a non-empty preheader then fusion is skipped.

848 ↗(On Diff #182396)

I've changed from the assert to a check and print message.
After that, all the conditions just result in a return false, which cleans this up a lot.

917 ↗(On Diff #182396)

Yes, you're right. If I generate the ll for this example and massage it to make it eligible for fusion, we will fuse these.
However, I'm confused by your statement:

These can be fused, but use the wrong "sum" on each iteration of the loop.

I don't see how fusion is legal here. Are you saying that the current check will (incorrectly) still allow fusion? Or there is a way to fuse these loops?

For this example, lcssa will create a non-empty preheader for the second loop and the earlier checks preventing fusion will bail before we get to this test. If I manually sink the PHI from lcssa into the header of the second loop, I can create an empty preheader for the second loop to allow fusion to continue.

At any rate, I think this check needs to be strengthened to prevent fusion from happening in this case. Do you agree?

dmgreen added inline comments.Feb 13 2019, 9:41 AM
llvm/lib/Transforms/Scalar/LoopFuse.cpp
858 ↗(On Diff #179130)

Sounds good to me. The !DepResult check can help in some cases at least, and the rest we can add as DA is improved.

917 ↗(On Diff #182396)

Yeah, sorry. I meant _will_ fuse, but shouldn't. "Can" was a bit misleading.

kbarton marked 9 inline comments as done and an inline comment as not done.Mar 1 2019, 9:03 AM

Addressed some more comments. I'll upload a new patch momentarily.

kbarton updated this revision to Diff 188928.Mar 1 2019, 9:59 AM

Addressed most of the review comments.

I think there are still two outstanding issues to resolve before committing this patch:

  1. The current constructors for FusionCandidates - whether to leave them as is or whether to reimplement using a createFusionCandidate wrapper that removes most of the logic from the constructor.
  2. New safety check that I added, to detect cross-loop dependencies that were previously missed. The new check is around line 918 of LoopFuse.cpp (inside dependencesAllowFusion method).
dmgreen added inline comments.Mar 3 2019, 6:12 AM
llvm/lib/Transforms/Scalar/LoopFuse.cpp
929 ↗(On Diff #188928)

I think that you can just do something like

if (Instruction *Def = dyn_cast<Instruction>(Op))
  if (FC0.L->contains(Def->getParent()))
    return false;

Providing there are no lcssa phis, this should rule out anything that depends on the first loop.

llvm/test/Transforms/LoopFusion/cannot_fuse.ll
283 ↗(On Diff #188928)

You can run something like -loop-rotate to make this into a more simply structured loop. -instcombine will remove any lcssa nodes, and -simplify-cfg can clean things up too.

I guess that will depend on what types of loops you want to test, ones that have been cleanup up or ones that are a little less structured. It's good to test both, but which is more important will depend on where in the pass pipeline this ends up.

kbarton marked 2 inline comments as done.Mar 22 2019, 2:07 PM
kbarton added inline comments.
llvm/lib/Transforms/Scalar/LoopFuse.cpp
929 ↗(On Diff #188928)

I just finished up testing this change, and then realized I think this is overly restrictive.
I had originally restricted it to PHIs because I only wanted to include things that are (re)defined in the first loop. Basically, the first loop needs to complete in order to get the correct definition going into the second loop.

This alternate sequence will also include stores, not just PHIs. The store can be loop invariant, and if it is not hoisted out of the loop prior, then it would also prevent fusion.

llvm/test/Transforms/LoopFusion/cannot_fuse.ll
283 ↗(On Diff #188928)

I completely agree.
In fact we have a subsequent patch to post once this one lands that restricts loop fusion to only run on rotated loops. This seems to be a common theme among many loop optimizations, and restricting to rotated loops simplifies the implementation in several places. I'm planning on discussing this at my presentation at EuroLLVM in a couple weeks :)

dmgreen added inline comments.Mar 24 2019, 7:39 AM
llvm/lib/Transforms/Scalar/LoopFuse.cpp
129 ↗(On Diff #188928)

Perhaps name it ExitingBlock, if there is only one.

929 ↗(On Diff #188928)

This is the kind of thing I was thinking of:

define float @test(float* nocapture %a, i32 %n) {
entry:
  %conv = zext i32 %n to i64
  %cmp32 = icmp eq i32 %n, 0
  br i1 %cmp32, label %for.cond.cleanup7, label %for.body

for.body:                                         ; preds = %for.body, %entry
  %i.034 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
  %sum1.033 = phi float [ %add, %for.body ], [ 0.000000e+00, %entry ]
  %idxprom = trunc i64 %i.034 to i32
  %arrayidx = getelementptr inbounds float, float* %a, i32 %idxprom
  %0 = load float, float* %arrayidx, align 4
  %add = fadd float %sum1.033, %0
  %inc = add nuw nsw i64 %i.034, 1
  %cmp = icmp ult i64 %inc, %conv
  br i1 %cmp, label %for.body, label %for.body8

for.body8:                                        ; preds = %for.body, %for.body8
  %i2.031 = phi i64 [ %inc14, %for.body8 ], [ 0, %for.body ]
  %idxprom9 = trunc i64 %i2.031 to i32
  %arrayidx10 = getelementptr inbounds float, float* %a, i32 %idxprom9
  %1 = load float, float* %arrayidx10, align 4
  %div = fdiv float %1, %add
  store float %div, float* %arrayidx10, align 4
  %inc14 = add nuw nsw i64 %i2.031, 1
  %cmp5 = icmp ult i64 %inc14, %conv
  br i1 %cmp5, label %for.body8, label %for.cond.cleanup7

for.cond.cleanup7:                                ; preds = %for.body8, %entry
  %sum1.0.lcssa36 = phi float [ 0.000000e+00, %entry ], [ %add, %for.body8 ]
  ret float %sum1.0.lcssa36
}

The importand part being %add, that is defined in the first loop and used in the second. I believe that using normal SSA def-use chains, anything that is def'd in the first loop (and used in the second) will need to use the value from the last iteration of the loop, so fusion will be illegal.

There may be cases where a def has the same value on every iteration of the loop, but I imagine those would already have been hoisted/sunk, or be quite rare.

1083 ↗(On Diff #188928)

Whilst checking the example, I noticed the dom tree updater has become stricter about "inbalanced" tree operations. I think that if FC0.ExitingBlocks == FC0.Latch, we add a link to FC1.Header twice, which is maybe what it's complaining about (?)

Oh, forgot to mention, I am looking forward to getting this in tree, and think it's very close. If we can sort out the last couple of things here, I think it would be good to get it committed so we can have a play around and sort out some of the questions like where in the pass pipeline it fits.

kbarton marked 10 inline comments as done.Mar 25 2019, 8:18 PM
kbarton added inline comments.
llvm/lib/Transforms/Scalar/LoopFuse.cpp
858 ↗(On Diff #179130)

I'm going to mark this as done, since I think we are in agreement we want to work on improving the dependence analysis in the future. If this is incorrect, please let me know.

129 ↗(On Diff #188928)

Good idea. Will fix in patch I'm going to post soon.

929 ↗(On Diff #188928)

Yes, I see now. It's possible to have a def in addition to a PHI.
If I wanted to catch this, I would also need to traverse all the inputs to any defs and make sure they don't come from PHIs within loop 0 as well.
I'm OK with restricting this to defs, with the assumption that if there is any loop invariant defs they would have been hoisted out of the loop at this point.

1083 ↗(On Diff #188928)

Yup, good catch. I found the same problem when I tried your test case.
I've fixed the problem and added your test above to cannot_fuse.ll.

kbarton updated this revision to Diff 192242.Mar 25 2019, 8:23 PM
kbarton marked 2 inline comments as done.

Address review comments as follows:

  1. Strengthen dependence checks between any value used in Loop 1 to ensure it is not defined in loop 0.
  2. Do not add a tree update to insert an edge between FC0.Latch and FC1.Header when FC0.ExitingBlock is the same as FC0.Latch. This edge has already been inserted into the tree updates earlier and will cause an assertion to fail while performing the updates.
  3. Rename ExitingBlocks to ExitingBlock since we are only concerned with loops with a single exiting block.
dmgreen accepted this revision.Mar 26 2019, 5:51 AM

Thanks. I had another look through and think this is in a really good state. I'm very happy for it to go in.

I'm not sure if there were any other remaining comments from @Meinersbur or @jdoerfert?

This revision is now accepted and ready to land.Mar 26 2019, 5:51 AM

Thanks @dmgreen!
The only outstanding issue was from @Meinersbur regarding the constructor for FusionCandidate. The concern was keeping the logic in the constructor, vs moving it out into a createFusionCandidate method to wrap the constructor and add all the logic to that method. I don't have a preference one way or the other, but will happily make the change if people feel strongly about it.

Meinersbur accepted this revision.Apr 11 2019, 12:00 PM

Generally, LGTM; with some smaller suggestions.

llvm/lib/Transforms/Scalar/LoopFuse.cpp
142–143 ↗(On Diff #178761)

I was thinking the following pattern:

FusionCandidate* makeFusionCandiate(...) {
  BasicBlock *Preheader;
  BasicBlock *Header;
  BasicBlock *ExitingBlocks;
  BasicBlock *ExitBlock;
  BasicBlock *Latch;
  // The above are redundant since llvm::Loop provides accessors  
  // I suggest to not cache them in other objects (they may require being updated) unless it measurable improves performance

  Loop *L;
  SmallVector<Instruction *, 16> MemReads;
  SmallVector<Instruction *, 16> MemWrites;
  
  if (!condition) {
    LLVM_DEBUG(dbgs() << "The reason\n");
    InvalidationReason++;
    return nullptr;
  }

  return new FusionCandidate(L, MemReads, MemWrites);
}
142–143 ↗(On Diff #178761)

It is just a suggestion and does not correspond to an LLVM coding policy. Keep as-is, it is not an requirement for me to accept the patch.

204 ↗(On Diff #178761)

[style] dump() methods are usually guard by

#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP)

as mentioned in the comment above LLVM_DUMP_METHOD

276 ↗(On Diff #178761)

[suggestion] IMHO it is a good style to not depend on a particular implementation of ADTs. Code that advances iterators manually depending on inserted/removed elements can be confusing and may break easily in future changes.

However, as with the previous [suggestion], I am ok with keeping it as-is.

977–980 ↗(On Diff #178761)

Are you sure? There is SmallVectorImpl<T>::emplace_back.

292 ↗(On Diff #179130)

[suggestion] This could be a range-based for-loop.

323 ↗(On Diff #192242)

[style] Did you consider LI.empty()?

935 ↗(On Diff #192242)

[suggestion] Consider using a method name that makes clear that is has no side-effects. isEmptyPreheader?
emptyPreheader sounds like it would remove instructions from the preheader until it is empty.

kbarton marked 24 inline comments as done.Apr 16 2019, 10:41 AM

All comments have been addressed. I will be committing this soon.

llvm/lib/Transforms/Scalar/LoopFuse.cpp
142–143 ↗(On Diff #178761)

I will mark this conversation as done as I think we've agreed the existing implementation is OK. We can refactor/redo in the future if this becomes too cumbersome.

276 ↗(On Diff #178761)

Thanks. I will mark this as done for now.
This is also something that may need to get revisited as the code evolves.

977–980 ↗(On Diff #178761)

You're right. The syntax I was using was wrong. I've fixed it and changed to using emplace_back.

323 ↗(On Diff #192242)

I don't think so.
I've changed it, as I think it makes it more readable.

This revision was automatically updated to reflect the committed changes.
kbarton marked 4 inline comments as done.
rtereshin added inline comments.
llvm/trunk/lib/Transforms/Scalar/LoopFuse.cpp
324

It appears that if both NDEBUG and LLVM_ENABLE_DUMP are defined, the function will be compiled in, but all LLVM_DEBUG macro eliminated. Resulting

static void
printFusionCandidates(const FusionCandidateCollection &FusionCandidates) {
  for (const auto &CandidateSet : FusionCandidates) {
  }
}

produces an "unused variable" CandidateSet warning. We probably don't need to wrap anything in LLVM_DEBUG here.

Meinersbur added inline comments.Apr 19 2019, 9:04 AM
llvm/trunk/lib/Transforms/Scalar/LoopFuse.cpp
324

We probably don't need to wrap anything in LLVM_DEBUG here.

I agree, but for a different reason:
All calls to printFusionCandidates are already wrapped by LLVM_DEBUG. The macro checks whether anything should be printed by checking the -debug/-debug-only cl::opt.