The current LICM allows sinking an instruction only when it is exposed to exit
blocks through a trivially replacable PHI of which all incoming values are the
same instruction. This change enhance LICM to sink a sinkable instruction
through non-trivially replacable PHIs by spliting predecessors of loop
exits.
Details
Diff Detail
Event Timeline
Addressed Eli's comment.
| lib/Transforms/Scalar/LICM.cpp | ||
|---|---|---|
| 808 | This is to clone only one instruction per exit block and I couldn't think of a case where the same instruction is cloned several times in the same exit block. Please let me know if you can see any case where we need to update SunkCopies. | |
| lib/Transforms/Scalar/LICM.cpp | ||
|---|---|---|
| 808 | The reason the SunkCopies map exists is that there could be multiple PHI nodes which use the same instruction in a loop exit block. Maybe it would be more straightforward if you would make all the necessary calls to splitPredecessorsOfLoopExit before you actually start sinking instructions? | |
| lib/Transforms/Scalar/LICM.cpp | ||
|---|---|---|
| 808 | Now, I split non-trivially replaceable PHIs before sinking in sink(). Please take a look and let me know any comment. Thanks Eli for the review. | |
I'd like more eyes on this to figure out if there are potential problems with this transform (maybe there's a performance cost for splitting edges in some cases?).
A few more testcases would be nice; maybe a loop with multiple exit blocks, a loop with more than two exits pointing to the same exit block, and a loop where multiple exit PHI nodes use the same sinkable instruction.
| lib/Transforms/Scalar/LICM.cpp | ||
|---|---|---|
| 855 | Iterating over a SmallPtrSet is non-deterministic. | |
| 907–908 | Maybe put a comment here describing what the first loop is doing? | |
| 948 | Unnecessary cast. | |
Overall, i'm unclear  on why you can't compute which ones will actually be sunk, and then split the preds.
Also, the examples does not even require splitting preds to effect the transform.
These are simply phi of ops form, and there is an equivalent op of phis form you could use directly in the exit block.
For example, you have, in the last test (i converted the checks to the code):
Out12.split.loop.exit: lcssaphi1 = phi i32 [ %N_addr.0.pn, %ContLoop ] mulres = mul i32 %N, %lcssaphi1 subres1 = sub %mulres, %N2 br label %Out12 Out12.split.loop.exit1: lcssaphi2 = phi i32 [ %N_addr.0.pn, %Loop ] mulres2 = mul i32 %N, %lcssaphi2 subres2 = sub i32, %mulres2, %N br label %Out12 Out12: something = phi i32 [%subres, %Out12.split.loop.exit ], [ %subres2, %Out12.split.loop.exit1 ]
This is the same as
Out12: phires = phi i32 [%N, %Loop], [%N2, %ContLoop] mulres = mul i32 %N, %N_addr.0.pn ; something here has the same value as the phi named something above something = sub i32 %mulres, %phires
It seems to me if you can compute which can be sunk before choosing to split the preds, you can do it without splitting the preds.
What am i missing?
| lib/Transforms/Scalar/LICM.cpp | ||
|---|---|---|
| 937 | Is there a reason not to just put all the users in a smallvector of WeakVH first? | |
I agree that the test doesn't really require splitting since the operations of incoming values are the same (sub/mul) and all operands used in the sinkable chain (%N_addr.0.pn, %N, and %N2) dominate the PHI, so we can directly add PHIs for operands and share the same operation in the exit block like you comment. For me, it seems that the test case is a special case in which splitting is not required. In most case, I think splitting is required to sink through a non-trivially replicable PHI. For example, if operations of incoming values are different or any operands in the sinkable chain do not dominate the PHI, we should split preds.
For me, SimplifyCFG should clean up such foldable case. The test case seems to be handled properly by SimplifyCFG :
Out12: %N_addr.0.pn.lcssa.sink = phi i32 [ %N_addr.0.pn, %Loop ], [ %N_addr.0.pn, %ContLoop ] %N.sink = phi i32 [ %N, %Loop ], [ %N2, %ContLoop ] %tmp.6.le = mul i32 %N, %N_addr.0.pn.lcssa.sink %tmp.7.le = sub i32 %tmp.6.le, %N.sink
Would it make sense that LICM split preds for non-trivially replicable PHIs in general and let the SimplyCFG fold such case. If there is any case where SimplyCFG cannot fold, we can improve SimplyCFG.
The case where you can't place it somewhere safe should be the case where the edge is critical.
Otherwise, there should always be somewhere to place the computation safely and correctly.
Now, i admit it may be tricky to come up with such an algorithm, but i'd also be surprised if it doesn't exist in literature already.
For me, SimplifyCFG should clean up such foldable case. The test case seems to be handled properly by SimplifyCFG :
Out12: %N_addr.0.pn.lcssa.sink = phi i32 [ %N_addr.0.pn, %Loop ], [ %N_addr.0.pn, %ContLoop ] %N.sink = phi i32 [ %N, %Loop ], [ %N2, %ContLoop ] %tmp.6.le = mul i32 %N, %N_addr.0.pn.lcssa.sink %tmp.7.le = sub i32 %tmp.6.le, %N.sinkWould it make sense that LICM split preds for non-trivially replicable PHIs in general and let the SimplyCFG fold such case. If there is any case where SimplyCFG cannot fold, we can improve SimplyCFG.
If we can avoid creating the mess in the first place, i'd rather see us do that.
Changing the CFG like this costs something (incremental dominator updates, etc).  
If we don't need to do it, we generally shouldn't do it.
If the algorithm turns out to be wildly complex, sure, let's do the easy thing and split preds and let something clean it up.
But i'd at least like to see us *try*.
Of course, it would be good to avoid unnecessary splitting in the fist place if it's not very costly. Let me try to find a reasonable approach for this.
The case where you can't place it somewhere safe should be the case where the edge is critical.
Otherwise, there should always be somewhere to place the computation safely and correctly.
I'm not perfectly clear about your above comment. Did you mean that we cannot safely sink a sinkable instruction through a critical edge? So, in such case we should split the critical edge to sink?
If you need help, let me know.
If you can't find one, awesome, let's go with splitting :)
The case where you can't place it somewhere safe should be the case where the edge is critical.
Otherwise, there should always be somewhere to place the computation safely and correctly.I'm not perfectly clear about your above comment. Did you mean that we cannot safely sink a sinkable instruction through a critical edge? So, in such case we should split the critical edge to sink?
Yes.
Instructions are really placed on edges in most cases.
That is, you are hoisting or sinking it out of a block, and want it to appear first/last in another block.
This is really edge placement.
But most compilers don't allow code on edges, so you actually have to put it in the block.
In the normal, non-critical edge case, this is safe to do.  That is, it's possible to do it and have the instruction execute under only the same conditions it did before.
In the critical edge case, you may need to split the edge in order to have a place to put it.
Let me give you a hoisting example just because people seem to find it easier when i explain it that direction:
a b | \ / c d
We have the same computation in B and D.
So you want to do PRE on it.
To eliminate the extra computation, you need to place a computation on the edge between A and D.   That is the only way to ensure that the computation is still available on all paths to D.
If you could place code on that edge, it would be easy, no muss, no fuss!
You'd be guaranteed it only executes on the a->d path.
LLVM (and most other compilers) can't place code on edges, only in the blocks.
But if you place the computation in block A, it may also be computed on the path to C, and when D is never executed.
That's no good, now your computation may execute when it didn't before (think of a load, for example, or any trapping instruction).
This happens because a->d is a critical edge. It's an edge from a block with multiple successors to a block with multiple predecessors.
The only safe way to place the computation on this edge is to split this edge so it looks like:
a
| \         b
c   t     /
      \  /
        dand then place the computation in t.
Sinking has the same problem (imagine you want to sink something from a to d.  if you place it at the beginning of d, now it also occurs on the b->d path, where it may never have occurred before. The only way to make it occur just on the a->d path is to split the edge as we did above)
The edges to loop exits may often be critical edges, so you may (but not always)  have to split them to place computations safely.
You can actually compute which edges are "blocking" your transformation and split them.
(you could also pre-split all critical edges, at some cost. There is a lot of literature on this, and the TL;DR is that you can prove that a number of dataflow problems do not get optimal answers in the presence of critical edges.  LLVM has not chosen this path so far, however, because the cost has outweighed the practical benefit).
I looked at if we can add an extra check to avoid unnecessary splitting, but for me this seems to be a special case for which I don't want to increase complexity in this change. Since all incoming edges in non-trivially replacable PHI in LCSSA is critical, splitting them by default doesn't seems unreasonable to me.
The splitting is not required only when all incoming value are sinkable and the operation of them is same in the non-trivially replicable PHIs. In such case, we can sink the same sinkable operation and modify the PHI to pass the operands to the shared operation like the IR you mentioned above. Handling such case pretty much special case which I believe we can let other pass handle or we can extend it in a separate patch if we need to handle it in LICM.
Added more test cases and made minor updates.
| lib/Transforms/Scalar/LICM.cpp | ||
|---|---|---|
| 937 | After splitPredecessorsOfLoopExit(), user PHIs will be modified and could no longer be an user and also could be not in the exit block anymore. We also need to get Use from user_iterator for the unreachability check for Phi's incoming block in this loop. Instead of holding the original users and iterating them over, I think restarting the iterators for valid users might be earlier to read. To avoid performing the same check for the same users, I cached visited users. If you see the test16() in the test case, the same phi is used multiple times as an user. After splitPredecessorsOfLoopExit, the PHI will no longer be an user of the sinkable instruction, so we may not want to revisit this no-longer-user phi. | |
Okay.
One reason we don't split critical edges by default in most passes in LLVM is that edges coming out of switches that lead to the same block are pretty much always critical.
Doing so by default for large switches leads to *huge* code blowup.
IE it goes from 1 block to N blocks, where N is the number of switch cases.
If there are multiple layers of switches, it's obviously even worse.
Have you tested on this on any switch heavy code to ensure it does not blow up code size and compile time?
Okay.
One reason we don't split critical edges by default in most passes in LLVM is that edges coming out of switches that lead to the same block are pretty much always critical.
Doing so by default for large switches leads to *huge* code blowup.
IE it goes from 1 block to N blocks, where N is the number of switch cases.
If there are multiple layers of switches, it's obviously even worse.
Have you tested on this on any switch heavy code to ensure it does not blow up code size and compile time?
I tried to build spec2000/spec2006/spec2017 with/without this change for aarch64. This change was applied widely and as far as I know perlbench use lots of big switches, but I didn't see any significant changes in code size and compile time across all benchmarks.  
However, as you point out, it could possibly increase code size even in some case where splitting is required for sinking. Just to be safe what about checking if the number of predecessors is larger than some threshold from cl::opt before splitting?
It's possible to get quadratic code growth from LICM sinking. For example:
int foo();
int unconditional_break(int n, int *p) {
int r = 0;
for (int i = 0; i < n; ++i) {
  int x = foo();
  int a = *p;
  int b = a * a;
  int c = b * b;
  int d = c * c;
  int e = d * d;
  int f = e * e;
  if (x == 1) { r = a; break; }
  if (x == 2) { r = b; break; }
  if (x == 3) { r = c; break; }
  if (x == 4) { r = d; break; }
  if (x == 5) { r = e; break; }
  if (x == 6) { r = f; break; }
}
return r;
}
int foo();
int conditional_break(int n, int *p) {
int r = 0;
for (int i = 0; i < n; ++i) {
  int x = foo();
  int a = *p;
  int b = a * a;
  int c = b * b;
  int d = c * c;
  int e = d * d;
  int f = e * e;
  if (x == 1) { r = a; if (foo()) break; }
  if (x == 2) { r = b; if (foo()) break; }
  if (x == 3) { r = c; if (foo()) break; }
  if (x == 4) { r = d; if (foo()) break; }
  if (x == 5) { r = e; if (foo()) break; }
  if (x == 6) { r = f; if (foo()) break; }
}
return r;
}The source contains 5 multiplies, but the optimized IR has 15 due to sinking. For the first function, sinking happens on trunk. For the second function, this patch will allow sinking, I think.
That said, codesize bloat from hasn't ever shown up as a problem in practice; I doubt it will cause problems here.
If you haven't seen any real code size regressions, i think you should just be prepared to address it in the future, and let's leave it at that for now :)
Thanks for working through all this.
Kindly ping. Please let me know any comment.
Ran tests for spec2000/2006/2017 on aarch64, and there was no clear change in performance and size. However, observed 4% performance gain in spec2006/xalancbmk when applied together with my another LICM patch (https://reviews.llvm.org/D37076).
Minor update in comments and fixed a test case failure in subreg-postra-2.ll by preventing the select instruction from being sunk.
Tony, 
Can you please confirm if the change in subreg-postra-2.ll  doesn't break the original intention of the test? 
Thanks,
Jun
I'm still not clear if the change in subreg-postra-2.ll is okay.  Can anyone who know about this PowerPC test please review the change in the test.
Thanks,
Jun
| test/CodeGen/PowerPC/subreg-postra-2.ll | ||
|---|---|---|
| 7 | Do you need to change this test at all? Would using the flag -ppc-gep-opt=0 enable you to leave the test as is? | |
Added "-ppc-gep-opt=0". No change in the test with/without this change. The change in the second RUN is because of using -ppc-gep-opt=0, but this doesn't seem to break the original intention of this test.
Rebased.  I will commit this if the change in PowerPC test (subreg-postra-2.ll) is okay.
Thanks,
Jun
Do you need to do something to keep SunkCopies up to date?