Page MenuHomePhabricator

[LICM] Make LICM able to hoist phis
ClosedPublic

Authored by john.brawn on Oct 3 2018, 5:06 AM.

Details

Summary

The general approach taken is to make note of loop invariant branches, then when we see something conditional on that branch, such as a phi, we create a copy of the branch and (empty versions of) its successors and hoist using that.

This has no impact by itself that I've been able to see, as LICM typically doesn't see such phis as they will have been converted into selects by the time LICM is run, but once we start doing phi-to-select conversion later it will be important.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Does this lose debug information? I do very little with optimization passes so I don't know, but it would be nice to keep it in mind.

john.brawn marked 8 inline comments as done.Oct 31 2018, 8:54 AM
john.brawn added inline comments.
lib/Transforms/Scalar/LICM.cpp
497

It is actually possible to have more than 1 here. I'll adjust it to have deterministic behaviour for that case (by picking whichever happens to be first in the function's block list).

702

It's actually possible to not find anything, in which case we fall back to picking the first element of the worklist. I've updated the comment to make this clear.

john.brawn marked 2 inline comments as done.Oct 31 2018, 9:15 AM

Does this lose debug information? I do very little with optimization passes so I don't know, but it would be nice to keep it in mind.

The hoisting of phis goes through the hoist function, so it's no worse than anything else i.e. it loses the debug information, see line 1404.

Update according to review comments.

mkazantsev requested changes to this revision.Oct 31 2018, 11:01 PM
mkazantsev added inline comments.
lib/Transforms/Scalar/LICM.cpp
501

The general algorithm for non-empty TrueDestSucc deals well with case of size == 1; do we really need separate processing for this case?

622

Maybe I am missing something, but where do we notify DomTree of existence of these new edges?

683

I guess you could use insert(Worklist.begin(), Worklist.end()), it's tad faster.

702

I think I now understand the algorithm you're trying to apply.

Imagine the graph A -> B -> C -> D -> E, and the blocks happened to be arranged as E, D, C, B, A in the Worklist. Maybe I am missing something, but it seems that the algorithm inside collectChildrenInLoop is BFS-based and cannot protect us from this situation. It could, for example, happen if all these blocks have a common parent and arranged in the order E, D, C, B, A in this parent.

The downside if your algorithm is that it will traverse over all worklist to find A, then over all remaining worklist to find B and so on. Each any_of query will work O(N), and you do N such queries, so overall algorithmic complexity is O(N^2).

The correct way to efficiently process all blocks before their successors is processing them in Reverse Post-Order. This arrangement by definition gives you what you want: it sorts the blocks in order that all successors go after their predecessors. If the elements of Worklist are arranged in RPO, you can just process them in order without extra checks, and it will be O(N).

You can see example how to build RPO in class LoopBlocksDFS, the methods you need are beginRPO(), endRPO().

And aside from all that above, erasing the last element from the Worklist is O(1) and of the first one is O(N). When you don't care what element to erase, you should do so to the last one. Erasing the first element is expensive. However once you process them in RPO order, you don't need to erase anything from the Worklist at all because you know that you've processed all predecessors before all successors.

This revision now requires changes to proceed.Oct 31 2018, 11:01 PM

I generally like what this patch is doing, but the size of code and its complexity overwhelms me. Is it possible to separate the patch into smaller pieces so that it would be easier to review them in isolation? If yes, I'd appreciate that a lot.

john.brawn added inline comments.Nov 6 2018, 8:39 AM
lib/Transforms/Scalar/LICM.cpp
501

We could have one case for size >= 1 but that would mean iterating through potentially all of the blocks in the function to find the one element of the set, which seems rather wasteful when we can just take that one element directly.

622

Inserting the edge is never needed. insertEdge only has an effect when the source and destination blocks have different immediate dominators (see InsertReachable in GenericDomTreeConstruction.h) and that never happens - HoistTrueDest and HoistFalseDest always have the same immediate dominator as HoistCommonSucc, and for the edge from HoistCommonSucc to TargetSucc it may be that HoistCommonSucc is now the immediate dominator of TargetSucc but that's handled in the loop at line 632.

702

Yes it does look like reverse post-order gives me what I want (though I have to adjust some of the tests as it picks a different but equally good order). I'll modify this to do that.

I generally like what this patch is doing, but the size of code and its complexity overwhelms me. Is it possible to separate the patch into smaller pieces so that it would be easier to review them in isolation? If yes, I'd appreciate that a lot.

I don't really think so. The only way I can think of slicing it up would be to add the control flow hoisting (which does nothing useful by itself) as one patch, and then the phi hoisting as a seprate patch, but the phi hoisting part is very small and simple so I don't think it would usefully reduce the size of the first part.

Use LoopBlocksRPO to generate the worklist.

mkazantsev added inline comments.Nov 6 2018, 7:55 PM
lib/Transforms/Scalar/LICM.cpp
468

Preheader by definition has only one successor which is the loop header. Therefore, OriginalPreheaderChildren always has one element which is header's DTNode. So it doesn't look like you even need this field, just take loop header (it doesn't change, right?)

Or am I missing something here?

And btw, if it needs to be a field and a vector, use SmallVector instead of std::vector.

557

This name is misleading, it doesn't look like something that can change the IR. How about getOrCreateHoistedBlock?

648

Please add verification of LI in debug mode just to make sure that we didn't mess up anything (maybe not here, but in some reasonable place).

682

Why not SmallVector and push_back? That wouldn't make any difference in terms of effect but less memory-consuming.

821

This is the part I don't understand. If the hoist destination doesn't dominate instruction's users then it also doesn't dominate the original instruction. Is LICM able to hoist to such locations? Is there an example of that?

832

Set Changed here.

833

Do we really need to change the hoist point? Why not just hoist of them before preheader's terminator? That would make this code simpler.

1409–1415

Braces not needed (here and in some other places like this).

test/Transforms/LICM/hoist-phi.ll
127

Please add TODO: to the tests you think we should cover in the future, that will make it easier to track if we decide to expand this transform to something more general.

mkazantsev requested changes to this revision.Nov 6 2018, 8:25 PM

Missing Changed update and LI sanitizing should be added.

This revision now requires changes to proceed.Nov 6 2018, 8:25 PM
john.brawn marked 6 inline comments as done.Nov 7 2018, 8:45 AM
john.brawn added inline comments.
lib/Transforms/Scalar/LICM.cpp
468

Yes, I think you're right. I'll do some testing to double-check then change this.

682

The iteration order when rehoisting will have to be changed (because we have to visit instructions before the instructions they use, i.e. later instructions before earlier instructions), but yes that will work.

821

We get this when we hoist an instruction but don't hoist all of its uses. An easy example of this is a phi where one operand is loop invariant and the other is not. The hoisted loop invariant operand will not dominate its use. In the tests I've added @conditional_use, @rehoist, and @diamond_with_extra_in_edge are all examples of where rehoisting happens.

833

If we have something like

%a = add i32 %loop_invariant, 1
%b = mul i32 %a, 2

then if %b does not dominate its uses we have to rehoist %b, at which point %a no longer dominates %b and has to be rehoisted to before %a. That's what HoistPoint is doing, it's making sure we rehoist instructions before rehoisted instructions that use them.

mkazantsev added inline comments.Nov 8 2018, 2:13 AM
lib/Transforms/Scalar/LICM.cpp
682

You can preserve the order of processing if you just iterate in reverse order.

mkazantsev added inline comments.Nov 8 2018, 2:17 AM
lib/Transforms/Scalar/LICM.cpp
821

Right, it makes sense when uses are Phis. Maybe it should be explicitly stated in the comment. :)

833

But if we process %a and then %b and insert each of them before the terminator, we don't have this problem. It seems that using a vector instead of a list will spare you of this problem.

john.brawn updated this revision to Diff 173148.Nov 8 2018, 5:21 AM
john.brawn marked an inline comment as done.

Use SmallVector for HoistedInstructions, rename getHoistedBlock, add LI and DT verification, other small changes.

I have some minor comment, only the one that is related to SafetyInfo worries me (yet it just needs using utility function instead of just moveBefore). All other things are non-functional nits.
I'll take some time running fuzz tests with this patch because it's big. :) Please wait few hours, I will either give you approval (under condition that my comments will be addressed) or give you a failing test example.

lib/Transforms/Scalar/LICM.cpp
503

Minor: please add explanatory messages to the assertions you make.

561

If you only expect BI to be the only element with the property you need, it would make sense to save some compile time by inserting break here. But on the other hand, the assertion you make also makes sense. How about this: BI = find_if(<cond>), and then assert that find_if(<cond> && el != BI) == HoistableBranches.end()? Just a suggestion.

832

Removal/insertion of instructions to blocks may also need SafetyInfo updates. We have a helper function eraseInstruction for removal that handles it properly, but there was no helper for moveBefore because of only 1 occurrence. I've just added it in the patch rL346472

I am pretty sure that the current code is accidentally correct because rehoisting happens outside of loop (and SafetyInfo doesn't really care), but please rebase on top of this patch and use this utility whenever you move instructions to keep things general.

840

It makes more sense to verify DT before LI since LI uses it.

1389–1390

After you've changed the semantics, Preheader is no longer the preheader of CurLoop (or is it)? If not, please rename it. If yes, why do you need it as parameter? You can just take loop's preheader as it was before.

In any case, it should be const BasicBlock *.

mkazantsev added inline comments.Nov 8 2018, 10:06 PM
lib/Transforms/Scalar/LICM.cpp
635

It can be useful to have staitstics on hoisted branches. I'm OK if it goes as a follow-up patch.

mkazantsev requested changes to this revision.Nov 9 2018, 3:30 AM

One of my fuzz tests has failed with the following assertion:

PHINode should have one entry for each predecessor of its parent basic block!
%.us-phi = phi i32 [ %res.i.peel, %cHeapLvb.exit181 ], [ %83, %bci_229.licm.split.us.licm ], [ %83, %bci_229.licm.split.us.licm ]

It will take me time to reduce the test to something reasonable, but maybe this info will help you find the bug in the code. So far I can say that the problematic situation happens when FalseDest == CommonSucc. Hope if helps you to diagnose the bug and construct a test. If not, I'll be able to provide the reduced test case early next week.

This revision now requires changes to proceed.Nov 9 2018, 3:30 AM

The patch fails the following test:

opt -licm -S test.ll

; ModuleID = './bugpoint-reduced-simplified.ll'
source_filename = "./bugpoint-reduced-simplified.ll"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128-ni:1"

define void @bar() {
bb:
  %tmp = call i32 @llvm.x86.sse2.cvttsd2si(<2 x double> undef)
  br label %bb3

bb1:                                              ; preds = %bb4
  %tmp2 = phi i32 [ %tmp5, %bb4 ]
  ret void

bb3:                                              ; preds = %bb4, %bb
  br i1 undef, label %bb6, label %bb4

bb4:                                              ; preds = %bb6, %bb6, %bb3
  %tmp5 = phi i32 [ %tmp, %bb3 ], [ undef, %bb6 ], [ undef, %bb6 ]
  br i1 undef, label %bb1, label %bb3

bb6:                                              ; preds = %bb3
  br i1 undef, label %bb4, label %bb4
}

; Function Attrs: nounwind readnone
declare i32 @llvm.x86.sse2.cvttsd2si(<2 x double>) #0

attributes #0 = { nounwind readnone }

The failure looks like

PHINode should have one entry for each predecessor of its parent basic block!
  %tmp5 = phi i32 [ %tmp, %bb ], [ undef, %bb6.licm ], [ undef, %bb6.licm ]
in function bar
LLVM ERROR: Broken function found, compilation aborted!

Likely some corner case related to conditional branch to the same block for true and false is not handled. Please make sure that it passes before we can proceed with the patch.

john.brawn marked 6 inline comments as done.Nov 14 2018, 5:47 AM

The patch fails the following test:

...

Likely some corner case related to conditional branch to the same block for true and false is not handled. Please make sure that it passes before we can proceed with the patch.

Yes, it's specifically that in such cases the phi in the destination block has the same incoming block appear more than once which I didn't expect. I think the easiest thing to do is to have canHoistPHI return false for such phis.

lib/Transforms/Scalar/LICM.cpp
635

I'll add this and also one for the number of created blocks, which seems like it may also be useful.

1389–1390

I left it as Preheader just because it was simpler not to change it, but yes it makes sense to rename it so I'll do so. It can't be const though, as we're hoisting I into it (i.e. modifying it).

john.brawn marked 2 inline comments as done.

Update according to review comments.

I don't see any obvious problems in code. Let me run one more round of fuzz testing, I'll give my approval if it passes.

This revision is now accepted and ready to land.Nov 18 2018, 8:25 PM
This revision was automatically updated to reflect the committed changes.

Looks like this was reverted in r347225.

The thing causing the crash was incorrect handling of instructions that need to be rehoisted. When we have hoisted a phi, then hoisted a use of that phi, then if have to rehoist that use we need to make sure that we rehoist it to _after_ the hoisted phi. We can do this by rehoisting to the immediate dominator instead of just rehoisting everything to the original preheader.

john.brawn reopened this revision.Nov 20 2018, 9:12 AM
This revision is now accepted and ready to land.Nov 20 2018, 9:12 AM
john.brawn requested review of this revision.Nov 20 2018, 9:13 AM

This rehoisting stuff has been wary for me before, and now I'm completely hesitant about what is going on there. Why do we end up hoisting to some block that does not dominate its users? Do we have a strong reason of doing so rather than finding the correct hoisting destination?

This rehoisting stuff has been wary for me before, and now I'm completely hesitant about what is going on there. Why do we end up hoisting to some block that does not dominate its users? Do we have a strong reason of doing so rather than finding the correct hoisting destination?

Let's look at the @phi_conditional_use test in hoist-phi.ll as an example:

define i64 @phi_conditional_use(i32 %f, i32* %g) {
entry:
  %cmp1 = icmp eq i32 %f, 1
  %cmp2 = icmp eq i32 %f, 0
  br label %loop

loop:
  br i1 %cmp1, label %if.end, label %if.then

if.then:
  br label %if.end

if.end:
  %phi = phi i64 [ 0, %loop ], [ 1, %if.then ]
  br i1 %cmp2, label %loop.backedge, label %if.then2

if.then2:
  %d = getelementptr inbounds i32, i32* %g, i64 %phi
  store i32 1, i32* %d, align 4
  br label %loop.backedge

loop.backedge:
  br label %loop
}

When looking at %loop we note that the branch uses a loop invariant condition. Then when looking at %if.end we see that %phi has loop invariant operands, and is the merge point of the branch we noted, so we copy the control flow and hoist the phi:

define i64 @phi_conditional_use(i32 %f, i32* %g) {
entry:
  %cmp1 = icmp eq i32 %f, 1
  %cmp2 = icmp eq i32 %f, 0
  br i1 %cmp1, label %if.end.licm, label %if.then.licm

if.then.licm:
  br label %if.end.licm

if.end.licm:
  %phi = phi i64 [ 0, %entry ], [ 1, %if.then.licm ]
  br label %loop

loop:
  br i1 %cmp1, label %if.end, label %if.then

if.then:
  br label %if.end

if.end:
  br i1 %cmp2, label %loop.backedge, label %if.then2

if.then2:
  %d = getelementptr inbounds i32, i32* %g, i64 %phi
  store i32 1, i32* %d, align 4
  br label %loop.backedge

loop.backedge:
  br label %loop
}

We then note that the branch at the end if %if.end has a loop invariant condition. We then look at %if.then2 and see that %d has loop invariant operands, and that it's in a block that's the target of a noted branch, so we duplicate control flow and hoist:

define i64 @phi_conditional_use(i32 %f, i32* %g) {
entry:
  %cmp1 = icmp eq i32 %f, 1
  %cmp2 = icmp eq i32 %f, 0
  br i1 %cmp1, label %if.end.licm, label %if.then.licm

if.then.licm:
  br label %if.end.licm

if.end.licm:
  %phi = phi i64 [ 0, %entry ], [ 1, %if.then.licm ]
  br i1 %cmp2, label %loop.backedge.licm, label %if.then2.licm

if.then2.licm:
  %d = getelementptr inbounds i32, i32* %g, i64 %phi
  br label %loop.backedge.licm

loop.backedge.licm:
  br label %loop

loop:
  br i1 %cmp1, label %if.end, label %if.then

if.then:
  br label %if.end

if.end:
  br i1 %cmp2, label %loop.backedge, label %if.then2

if.then2:
  store i32 1, i32* %d, align 4
  br label %loop.backedge

loop.backedge:
  br label %loop
}

Now we look at the store in %if.then2, which has loop invariant operands but can't be hoisted as it's not safe to hoist the store.

We're now done hoisting things, but we have the problem that %d does not dominate the store in %if.then2. So we have to move it somewhere where it does dominate %if.then2, but we also have to make sure it's after %phi, as it uses that as an operand. So it has to go in %if.end.licm after %phi.

Some comments on all of this:

At the time that we hoist %d we don't know if its uses are also going to get hoisted. We _could_ do something like rewrite LICM to be a two-step process where it first makes note of which instructions are going to get hoisted, then hoists them using knowledge of if their uses are going to get hoisted to figure out where they get hoisted. But that would be a lot more disruption than the current patch.

We could change things to never hoist things into conditional blocks i.e. %d would be hoisted directly to %if.end.licm. There's two reasons I didn't do this:

  • As explained in the comment at line 715, hoisting to conditional blocks could in the future be used as away to be able to hoist instructions that are unsafe to execute unconditionally, e.g. the store could actually be hoisted to %if.then2.licm as it would only get executed in the cases where the store in the loop would have been executed.
  • It seems a little strange to just speculatively execute instructions without limit because you're hoisting them outside of the loop. It's what LICM currently does, but it's at odds with the current behaviour elsewhere e.g. in SimplifyCFG which does the same thing (except hoisting instructions to their immediate predecessor block instead of outside a loop) but has a lot of (inconsistent) heuristics for deciding if it's a good idea or not. Doing it like this means we can leave it up to later passes to figure out if the instructions should be unconditional.
  • Additionally: doing this would means that "the hoisted version of a block X for the purpose of control flow and phis" and "which block to hoist instructions from block X into" could be different blocks whereas they're currently always the same block, which would need some extra code to handle (though of course at the same time the code to do rehoisting would get removed so it could turn out to be about the same complexity-wise).
mkazantsev requested changes to this revision.Nov 25 2018, 9:33 PM

I don't really have strong objections against that. Thanks for your explanation. It could've been a Phi node insertion instead of rehoisting in this case, but I don't really see why rehoisting should be bad (other than we may end up executing code that never executes, but it's a general LICM flaw and not specific to this patch). I also see no bug in what we have now. So I'm ok with the idea.

I am not comfortable with reverting it back and forth, so I will request some re-design of your patch. Please make an option to switch off/on CFG hoisting. We will need it if this transform will expose some unexpected bugs in the future. Reverting patch that big may become hard in time. Then, I'm OK if you check in the current patch with this option switch by default (in tests, it can be set to true).

Then, you can check in a patch that turns it on by default.

This revision now requires changes to proceed.Nov 25 2018, 9:33 PM

Added an option to enable control flow hoisting, turned off by default.

This revision is now accepted and ready to land.Nov 27 2018, 5:27 PM
This revision was automatically updated to reflect the committed changes.

This caused a significant regression in compile time for some sources, see https://bugs.llvm.org/show_bug.cgi?id=39836 for details. Building one source file went from 45 seconds to 140 seconds.

Reverted in SVN r347867.