This is an archive of the discontinued LLVM Phabricator instance.

[LICM] sink through non-trivially replicable PHI
ClosedPublic

Authored by junbuml on Aug 25 2017, 1:44 PM.

Details

Summary

The current LICM allows sinking an instruction only when it is exposed to exit
blocks through a trivially replacable PHI of which all incoming values are the
same instruction. This change enhance LICM to sink a sinkable instruction
through non-trivially replacable PHIs by spliting predecessors of loop
exits.

Diff Detail

Repository
rL LLVM

Event Timeline

junbuml created this revision.Aug 25 2017, 1:44 PM
efriedma added inline comments.Aug 25 2017, 2:43 PM
lib/Transforms/Scalar/LICM.cpp
803 ↗(On Diff #112738)

Do you need to do something to keep SunkCopies up to date?

858 ↗(On Diff #112738)

The indexing here is a little weird; can you first compute all the blocks to split, then split them?

junbuml updated this revision to Diff 113103.Aug 29 2017, 9:23 AM

Addressed Eli's comment.

lib/Transforms/Scalar/LICM.cpp
803 ↗(On Diff #112738)

This is to clone only one instruction per exit block and I couldn't think of a case where the same instruction is cloned several times in the same exit block. Please let me know if you can see any case where we need to update SunkCopies.

efriedma added inline comments.Aug 29 2017, 11:59 AM
lib/Transforms/Scalar/LICM.cpp
803 ↗(On Diff #112738)

The reason the SunkCopies map exists is that there could be multiple PHI nodes which use the same instruction in a loop exit block.

Maybe it would be more straightforward if you would make all the necessary calls to splitPredecessorsOfLoopExit before you actually start sinking instructions?

junbuml updated this revision to Diff 113266.Aug 30 2017, 8:30 AM

Addressed Eli's comment.

junbuml added inline comments.Aug 30 2017, 8:33 AM
lib/Transforms/Scalar/LICM.cpp
803 ↗(On Diff #112738)

Now, I split non-trivially replaceable PHIs before sinking in sink(). Please take a look and let me know any comment. Thanks Eli for the review.

efriedma edited edge metadata.Aug 30 2017, 11:39 AM

I'd like more eyes on this to figure out if there are potential problems with this transform (maybe there's a performance cost for splitting edges in some cases?).

A few more testcases would be nice; maybe a loop with multiple exit blocks, a loop with more than two exits pointing to the same exit block, and a loop where multiple exit PHI nodes use the same sinkable instruction.

lib/Transforms/Scalar/LICM.cpp
850 ↗(On Diff #113266)

Iterating over a SmallPtrSet is non-deterministic.

879 ↗(On Diff #113266)

Maybe put a comment here describing what the first loop is doing?

925 ↗(On Diff #113266)

Unnecessary cast.

Overall, i'm unclear on why you can't compute which ones will actually be sunk, and then split the preds.
Also, the examples does not even require splitting preds to effect the transform.

These are simply phi of ops form, and there is an equivalent op of phis form you could use directly in the exit block.
For example, you have, in the last test (i converted the checks to the code):

Out12.split.loop.exit:
lcssaphi1 = phi i32 [ %N_addr.0.pn, %ContLoop ]
mulres  = mul i32 %N, %lcssaphi1
subres1 = sub %mulres, %N2
br label %Out12

Out12.split.loop.exit1:
lcssaphi2 = phi i32 [ %N_addr.0.pn, %Loop ]
mulres2 =  mul i32 %N, %lcssaphi2
subres2 = sub i32, %mulres2, %N
br label %Out12

Out12:
something = phi i32 [%subres, %Out12.split.loop.exit ], [ %subres2, %Out12.split.loop.exit1 ]

This is the same as

Out12: 
phires = phi i32 [%N, %Loop], [%N2, %ContLoop]
mulres = mul i32 %N, %N_addr.0.pn
; something here has the same value as the phi named something above
something = sub i32 %mulres, %phires

It seems to me if you can compute which can be sunk before choosing to split the preds, you can do it without splitting the preds.
What am i missing?

lib/Transforms/Scalar/LICM.cpp
911 ↗(On Diff #113266)

Is there a reason not to just put all the users in a smallvector of WeakVH first?
While it's true splitPred may invalidate the iterators, none of the answers to what this loop does can change (that i can see) as a result of that (nor will this loop do anything for any new values).

junbuml marked 2 inline comments as done.Aug 31 2017, 9:41 AM

I agree that the test doesn't really require splitting since the operations of incoming values are the same (sub/mul) and all operands used in the sinkable chain (%N_addr.0.pn, %N, and %N2) dominate the PHI, so we can directly add PHIs for operands and share the same operation in the exit block like you comment. For me, it seems that the test case is a special case in which splitting is not required. In most case, I think splitting is required to sink through a non-trivially replicable PHI. For example, if operations of incoming values are different or any operands in the sinkable chain do not dominate the PHI, we should split preds.

For me, SimplifyCFG should clean up such foldable case. The test case seems to be handled properly by SimplifyCFG :

Out12:
  %N_addr.0.pn.lcssa.sink = phi i32 [ %N_addr.0.pn, %Loop ], [ %N_addr.0.pn, %ContLoop ]
  %N.sink = phi i32 [ %N, %Loop ], [ %N2, %ContLoop ]
  %tmp.6.le = mul i32 %N, %N_addr.0.pn.lcssa.sink
  %tmp.7.le = sub i32 %tmp.6.le, %N.sink

Would it make sense that LICM split preds for non-trivially replicable PHIs in general and let the SimplyCFG fold such case. If there is any case where SimplyCFG cannot fold, we can improve SimplyCFG.

I agree that the test doesn't really require splitting since the operations of incoming values are the same (sub/mul) and all operands used in the sinkable chain (%N_addr.0.pn, %N, and %N2) dominate the PHI, so we can directly add PHIs for operands and share the same operation in the exit block like you comment. For me, it seems that the test case is a special case in which splitting is not required. In most case, I think splitting is required to sink through a non-trivially replicable PHI. For example, if operations of incoming values are different or any operands in the sinkable chain do not dominate the PHI, we should split preds.

The case where you can't place it somewhere safe should be the case where the edge is critical.
Otherwise, there should always be somewhere to place the computation safely and correctly.
Now, i admit it may be tricky to come up with such an algorithm, but i'd also be surprised if it doesn't exist in literature already.

For me, SimplifyCFG should clean up such foldable case. The test case seems to be handled properly by SimplifyCFG :

Out12:
  %N_addr.0.pn.lcssa.sink = phi i32 [ %N_addr.0.pn, %Loop ], [ %N_addr.0.pn, %ContLoop ]
  %N.sink = phi i32 [ %N, %Loop ], [ %N2, %ContLoop ]
  %tmp.6.le = mul i32 %N, %N_addr.0.pn.lcssa.sink
  %tmp.7.le = sub i32 %tmp.6.le, %N.sink

Would it make sense that LICM split preds for non-trivially replicable PHIs in general and let the SimplyCFG fold such case. If there is any case where SimplyCFG cannot fold, we can improve SimplyCFG.

If we can avoid creating the mess in the first place, i'd rather see us do that.
Changing the CFG like this costs something (incremental dominator updates, etc).
If we don't need to do it, we generally shouldn't do it.

If the algorithm turns out to be wildly complex, sure, let's do the easy thing and split preds and let something clean it up.
But i'd at least like to see us *try*.

junbuml added a comment.EditedSep 1 2017, 8:31 AM

Of course, it would be good to avoid unnecessary splitting in the fist place if it's not very costly. Let me try to find a reasonable approach for this.

The case where you can't place it somewhere safe should be the case where the edge is critical.
Otherwise, there should always be somewhere to place the computation safely and correctly.

I'm not perfectly clear about your above comment. Did you mean that we cannot safely sink a sinkable instruction through a critical edge? So, in such case we should split the critical edge to sink?

Of course, it would be good to avoid unnecessary splitting in the fist place if it's not very costly. Let me try to find a reasonable approach for this.

If you need help, let me know.
If you can't find one, awesome, let's go with splitting :)

The case where you can't place it somewhere safe should be the case where the edge is critical.
Otherwise, there should always be somewhere to place the computation safely and correctly.

I'm not perfectly clear about your above comment. Did you mean that we cannot safely sink a sinkable instruction through a critical edge? So, in such case we should split the critical edge to sink?

Yes.
Instructions are really placed on edges in most cases.
That is, you are hoisting or sinking it out of a block, and want it to appear first/last in another block.
This is really edge placement.
But most compilers don't allow code on edges, so you actually have to put it in the block.
In the normal, non-critical edge case, this is safe to do. That is, it's possible to do it and have the instruction execute under only the same conditions it did before.

In the critical edge case, you may need to split the edge in order to have a place to put it.

Let me give you a hoisting example just because people seem to find it easier when i explain it that direction:

a      b
| \  /
c  d

We have the same computation in B and D.
So you want to do PRE on it.
To eliminate the extra computation, you need to place a computation on the edge between A and D. That is the only way to ensure that the computation is still available on all paths to D.

If you could place code on that edge, it would be easy, no muss, no fuss!
You'd be guaranteed it only executes on the a->d path.
LLVM (and most other compilers) can't place code on edges, only in the blocks.

But if you place the computation in block A, it may also be computed on the path to C, and when D is never executed.
That's no good, now your computation may execute when it didn't before (think of a load, for example, or any trapping instruction).

This happens because a->d is a critical edge. It's an edge from a block with multiple successors to a block with multiple predecessors.

The only safe way to place the computation on this edge is to split this edge so it looks like:

a
| \         b
c   t     /
      \  /
        d

and then place the computation in t.

Sinking has the same problem (imagine you want to sink something from a to d. if you place it at the beginning of d, now it also occurs on the b->d path, where it may never have occurred before. The only way to make it occur just on the a->d path is to split the edge as we did above)
The edges to loop exits may often be critical edges, so you may (but not always) have to split them to place computations safely.
You can actually compute which edges are "blocking" your transformation and split them.
(you could also pre-split all critical edges, at some cost. There is a lot of literature on this, and the TL;DR is that you can prove that a number of dataflow problems do not get optimal answers in the presence of critical edges. LLVM has not chosen this path so far, however, because the cost has outweighed the practical benefit).

junbuml updated this revision to Diff 114198.Sep 7 2017, 10:20 AM
junbuml marked an inline comment as done.

I looked at if we can add an extra check to avoid unnecessary splitting, but for me this seems to be a special case for which I don't want to increase complexity in this change. Since all incoming edges in non-trivially replacable PHI in LCSSA is critical, splitting them by default doesn't seems unreasonable to me.

The splitting is not required only when all incoming value are sinkable and the operation of them is same in the non-trivially replicable PHIs. In such case, we can sink the same sinkable operation and modify the PHI to pass the operands to the shared operation like the IR you mentioned above. Handling such case pretty much special case which I believe we can let other pass handle or we can extend it in a separate patch if we need to handle it in LICM.

Added more test cases and made minor updates.

junbuml added inline comments.Sep 7 2017, 10:22 AM
lib/Transforms/Scalar/LICM.cpp
911 ↗(On Diff #113266)

After splitPredecessorsOfLoopExit(), user PHIs will be modified and could no longer be an user and also could be not in the exit block anymore. We also need to get Use from user_iterator for the unreachability check for Phi's incoming block in this loop. Instead of holding the original users and iterating them over, I think restarting the iterators for valid users might be earlier to read. To avoid performing the same check for the same users, I cached visited users.

If you see the test16() in the test case, the same phi is used multiple times as an user. After splitPredecessorsOfLoopExit, the PHI will no longer be an user of the sinkable instruction, so we may not want to revisit this no-longer-user phi.

Okay.
One reason we don't split critical edges by default in most passes in LLVM is that edges coming out of switches that lead to the same block are pretty much always critical.
Doing so by default for large switches leads to *huge* code blowup.

IE it goes from 1 block to N blocks, where N is the number of switch cases.

If there are multiple layers of switches, it's obviously even worse.

Have you tested on this on any switch heavy code to ensure it does not blow up code size and compile time?

Okay.
One reason we don't split critical edges by default in most passes in LLVM is that edges coming out of switches that lead to the same block are pretty much always critical.
Doing so by default for large switches leads to *huge* code blowup.
IE it goes from 1 block to N blocks, where N is the number of switch cases.
If there are multiple layers of switches, it's obviously even worse.
Have you tested on this on any switch heavy code to ensure it does not blow up code size and compile time?

I tried to build spec2000/spec2006/spec2017 with/without this change for aarch64. This change was applied widely and as far as I know perlbench use lots of big switches, but I didn't see any significant changes in code size and compile time across all benchmarks.
However, as you point out, it could possibly increase code size even in some case where splitting is required for sinking. Just to be safe what about checking if the number of predecessors is larger than some threshold from cl::opt before splitting?

It's possible to get quadratic code growth from LICM sinking. For example:

int foo();
int unconditional_break(int n, int *p) {
int r = 0;
for (int i = 0; i < n; ++i) {
  int x = foo();
  int a = *p;
  int b = a * a;
  int c = b * b;
  int d = c * c;
  int e = d * d;
  int f = e * e;
  if (x == 1) { r = a; break; }
  if (x == 2) { r = b; break; }
  if (x == 3) { r = c; break; }
  if (x == 4) { r = d; break; }
  if (x == 5) { r = e; break; }
  if (x == 6) { r = f; break; }
}
return r;
}

int foo();
int conditional_break(int n, int *p) {
int r = 0;
for (int i = 0; i < n; ++i) {
  int x = foo();
  int a = *p;
  int b = a * a;
  int c = b * b;
  int d = c * c;
  int e = d * d;
  int f = e * e;
  if (x == 1) { r = a; if (foo()) break; }
  if (x == 2) { r = b; if (foo()) break; }
  if (x == 3) { r = c; if (foo()) break; }
  if (x == 4) { r = d; if (foo()) break; }
  if (x == 5) { r = e; if (foo()) break; }
  if (x == 6) { r = f; if (foo()) break; }
}
return r;
}

The source contains 5 multiplies, but the optimized IR has 15 due to sinking. For the first function, sinking happens on trunk. For the second function, this patch will allow sinking, I think.

That said, codesize bloat from hasn't ever shown up as a problem in practice; I doubt it will cause problems here.

Okay.
One reason we don't split critical edges by default in most passes in LLVM is that edges coming out of switches that lead to the same block are pretty much always critical.
Doing so by default for large switches leads to *huge* code blowup.
IE it goes from 1 block to N blocks, where N is the number of switch cases.
If there are multiple layers of switches, it's obviously even worse.
Have you tested on this on any switch heavy code to ensure it does not blow up code size and compile time?

I tried to build spec2000/spec2006/spec2017 with/without this change for aarch64. This change was applied widely and as far as I know perlbench use lots of big switches, but I didn't see any significant changes in code size and compile time across all benchmarks.
However, as you point out, it could possibly increase code size even in some case where splitting is required for sinking. Just to be safe what about checking if the number of predecessors is larger than some threshold from cl::opt before splitting?

If you haven't seen any real code size regressions, i think you should just be prepared to address it in the future, and let's leave it at that for now :)
Thanks for working through all this.

Kindly ping. Please let me know any comment.

Ran tests for spec2000/2006/2017 on aarch64, and there was no clear change in performance and size. However, observed 4% performance gain in spec2006/xalancbmk when applied together with my another LICM patch (https://reviews.llvm.org/D37076).

junbuml updated this revision to Diff 116676.Sep 26 2017, 9:44 AM

Minor update in comments and fixed a test case failure in subreg-postra-2.ll by preventing the select instruction from being sunk.

Kindly ping.

Kindly ping one more time. Please let me know any comment.

This revision is now accepted and ready to land.Oct 20 2017, 1:41 PM
junbuml updated this revision to Diff 120267.Oct 25 2017, 8:48 AM
junbuml added a reviewer: jtony.

Tony,
Can you please confirm if the change in subreg-postra-2.ll doesn't break the original intention of the test?
Thanks,
Jun

I'm still not clear if the change in subreg-postra-2.ll is okay. Can anyone who know about this PowerPC test please review the change in the test.
Thanks,
Jun

hfinkel added inline comments.Oct 31 2017, 4:15 PM
test/CodeGen/PowerPC/subreg-postra-2.ll
7 ↗(On Diff #120267)

Do you need to change this test at all? Would using the flag -ppc-gep-opt=0 enable you to leave the test as is?

junbuml updated this revision to Diff 121127.Nov 1 2017, 8:16 AM

Added "-ppc-gep-opt=0". No change in the test with/without this change. The change in the second RUN is because of using -ppc-gep-opt=0, but this doesn't seem to break the original intention of this test.

junbuml updated this revision to Diff 121354.Nov 2 2017, 1:28 PM

Rebased. I will commit this if the change in PowerPC test (subreg-postra-2.ll) is okay.
Thanks,
Jun

hfinkel edited edge metadata.Nov 2 2017, 1:32 PM

Rebased. I will commit this if the change in PowerPC test (subreg-postra-2.ll) is okay.

Yes, looks fine to me.

Thanks,
Jun

This revision was automatically updated to reflect the committed changes.