This is an archive of the discontinued LLVM Phabricator instance.

[Taildup] Don't tail-duplicate loop header with multiple successors as its latches
ClosedPublic

Authored by junparser on Sep 28 2021, 2:11 AM.

Details

Summary

when Taildup hit loop with multiple latches like:

//    1 -> 2 <-> 3                 |
//          \  <-> 4               |
//           \   <-> 5             |
//            \---> rest           |

it may transform this loop into multiple loops by duplicate loop header.
However, this change may has little benefit while makes cfg much complex.
In some uncommon cases, it causes large compile time regression (offered by
@alexfh in D106056).

This patch disable tail-duplicate of such cases.
PS: It checks the number of latches > 2 since we want to omit conditional branch.

TestPlan: check-llvm

Diff Detail

Event Timeline

junparser created this revision.Sep 28 2021, 2:11 AM
junparser requested review of this revision.Sep 28 2021, 2:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 28 2021, 2:11 AM
lkail added a subscriber: lkail.Sep 28 2021, 2:18 AM

reduced bc file.

$time llc-after --filetype=asm -O3   huge-switch.ll  -o huge-switch.s

real    0m0.454s
user    0m0.436s
sys     0m0.012s

time llc-before -O3 --filetype=asm  huge-switch.ll   -o huge-switch-1.s

real    0m14.714s
user    0m14.297s
sys     0m0.134s

The backend spend lots of time to build cfg&loop chain. also it has worse code.

lkail added a comment.Sep 28 2021, 4:10 AM

IIUC, switching this pattern off might lose some optimization commonly seen in interpreter, e.g., https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables. Do you have any benchmark numbers?

junparser updated this revision to Diff 375769.Sep 28 2021, 8:02 PM

Add testcase diff between this change.

IIUC, switching this pattern off might lose some optimization commonly seen in interpreter, e.g., https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables. Do you have any benchmark numbers?

Hi, thanks for the reminder!
I tested the change with spec cpu2017, and there was no performance change. I also tested the case in https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables, and I add it int tesetcase as well. This patch has no effect on most of the interpreter switch loop.

However, i test the case offered by @alexfh based on D106056 and w/o this change, it show 1x performance gains without the change. The reduced case is large_loop_switch in testcase.

The root reason is that llvm optimizer usually does not change unused default case into unreachable before D106056. The interpreter switch loop also keeps this branch when transform to jumptable, The loop form become to :

//                  
//              / <---------/ / /
//          0-> 1->  2  -> 3  / /            
//              \      \  -> 4 /              
//                \      \ -> 5             
//                  \---> default

the default bb can not tail duplicate into loop header. With D106056, the unreachable default bb is removed, then taildup can help to transform switch loop into dispatch table.

So this change blocks optimization from D106056, @alexfh would you like to workaround with -mllvm -tail-dup-indirect-size=4?

Or let add one more parameter such as -tail-dup-jmptable-size with default value maybe 100?
any suggest?

alexfh added a comment.Oct 1 2021, 4:34 PM

So this change blocks optimization from D106056, @alexfh would you like to workaround with -mllvm -tail-dup-indirect-size=4?

I found a single huge protobuf generated source that was pushed over the time limit. However, there are likely many others that will just take more time to compile wasting resources and potentially increasing build latency. Having to hunt down all of them and to sprinkle -mllvm -tail-dup-indirect-size=4 all around the build configuration files would be unfortunate.

Or let add one more parameter such as -tail-dup-jmptable-size with default value maybe 100?
any suggest?

Would that stop this particular transformation for large switch cases? Should this always be done?

junparser updated this revision to Diff 378416.Oct 8 2021, 11:32 PM

Add one parameter. @alexfh could you have a try with this patch?

So this change blocks optimization from D106056, @alexfh would you like to workaround with -mllvm -tail-dup-indirect-size=4?

I found a single huge protobuf generated source that was pushed over the time limit. However, there are likely many others that will just take more time to compile wasting resources and potentially increasing build latency. Having to hunt down all of them and to sprinkle -mllvm -tail-dup-indirect-size=4 all around the build configuration files would be unfortunate.

Or let add one more parameter such as -tail-dup-jmptable-size with default value maybe 100?
any suggest?

Would that stop this particular transformation for large switch cases? Should this always be done?

Sorry for the late reply. I believe this should stop to tail duplicate large jump table loop generated by switch loop.

IIUC, switching this pattern off might lose some optimization commonly seen in interpreter, e.g., https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables. Do you have any benchmark numbers?

Hi, thanks for the reminder!
I tested the change with spec cpu2017, and there was no performance change. I also tested the case in https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables, and I add it int tesetcase as well. This patch has no effect on most of the interpreter switch loop.

However, i test the case offered by @alexfh based on D106056 and w/o this change, it show 1x performance gains without the change. The reduced case is large_loop_switch in testcase.

The root reason is that llvm optimizer usually does not change unused default case into unreachable before D106056. The interpreter switch loop also keeps this branch when transform to jumptable, The loop form become to :

//                  
//              / <---------/ / /
//          0-> 1->  2  -> 3  / /            
//              \      \  -> 4 /              
//                \      \ -> 5             
//                  \---> default

the default bb can not tail duplicate into loop header. With D106056, the unreachable default bb is removed, then taildup can help to transform switch loop into dispatch table.

@lkail @MatzeB any suggestion with the parameter value 128?

lkail added a comment.EditedOct 10 2021, 9:37 PM

@lkail @MatzeB any suggestion with the parameter value 128?

From the perf data pasted by @alexfh in https://reviews.llvm.org/D106056, the compile time regression occurs in MachineBlockPlacement when CFG becomes complex. I think we should fix the issue there, i.e., fix the O(n^2) iteration by(There are multiple such usages, listed one)

diff --git a/llvm/lib/CodeGen/MachineBlockPlacement.cpp b/llvm/lib/CodeGen/MachineBlockPlacement.cpp
index 8a1b4031642d..56c3d3106a19 100644
--- a/llvm/lib/CodeGen/MachineBlockPlacement.cpp
+++ b/llvm/lib/CodeGen/MachineBlockPlacement.cpp
@@ -1905,8 +1905,11 @@ MachineBlockPlacement::TopFallThroughFreq(
       // Check if Top is the best successor of Pred.
       auto TopProb = MBPI->getEdgeProbability(Pred, Top);
       bool TopOK = true;
-      for (MachineBasicBlock *Succ : Pred->successors()) {
-        auto SuccProb = MBPI->getEdgeProbability(Pred, Succ);
+      for (MachineBasicBlock::succ_iterator SI = Pred->succ_begin(),
+                                          SE = Pred->succ_end();
+         SI != SE; ++SI) {
+        MachineBasicBlock *Succ = *SI;
+        auto SuccProb = MBPI->getEdgeProbability(Pred, SI);
         BlockChain *SuccChain = BlockToChain[Succ];
         // Check if Succ can be placed after Pred.
         // Succ should not be in any chain, or it is the head of some chain.

This might give acceptable compile time.

@lkail @MatzeB any suggestion with the parameter value 128?

From the perf data pasted by @alexfh in https://reviews.llvm.org/D106056, the compile time regression occurs in MachineBlockPlacement when CFG becomes complex. I think we should fix the issue there, i.e., fix the O(n^2) iteration by(There are multiple such usages, listed one)

It's not only MachineBlockPlacement pass, we saw Eliminate PHI and Control flow pass also cost much more time. with the ir file i attached above, it shows

6.1973 ( 46.3%)   0.0000 (  0.0%)   6.1973 ( 45.8%)   6.2801 ( 45.8%)  Eliminate PHI nodes for register allocation
2.3774 ( 17.7%)   0.0000 (  0.0%)   2.3774 ( 17.6%)   2.4110 ( 17.6%)  Control Flow Optimizer
0.9603 (  7.2%)   0.0000 (  0.0%)   0.9603 (  7.1%)   0.9740 (  7.1%)  Branch Probability Basic Block Placement
0.4236 (  3.2%)   0.0680 ( 50.6%)   0.4916 (  3.6%)   0.4978 (  3.6%)  Early Tail Duplication
0.4374 (  3.3%)   0.0000 (  0.0%)   0.4374 (  3.2%)   0.4455 (  3.2%)  Machine code sinking
diff --git a/llvm/lib/CodeGen/MachineBlockPlacement.cpp b/llvm/lib/CodeGen/MachineBlockPlacement.cpp
index 8a1b4031642d..56c3d3106a19 100644
--- a/llvm/lib/CodeGen/MachineBlockPlacement.cpp
+++ b/llvm/lib/CodeGen/MachineBlockPlacement.cpp
@@ -1905,8 +1905,11 @@ MachineBlockPlacement::TopFallThroughFreq(
       // Check if Top is the best successor of Pred.
       auto TopProb = MBPI->getEdgeProbability(Pred, Top);
       bool TopOK = true;
-      for (MachineBasicBlock *Succ : Pred->successors()) {
-        auto SuccProb = MBPI->getEdgeProbability(Pred, Succ);
+      for (MachineBasicBlock::succ_iterator SI = Pred->succ_begin(),
+                                          SE = Pred->succ_end();
+         SI != SE; ++SI) {
+        MachineBasicBlock *Succ = *SI;
+        auto SuccProb = MBPI->getEdgeProbability(Pred, SI);
         BlockChain *SuccChain = BlockToChain[Succ];
         // Check if Succ can be placed after Pred.
         // Succ should not be in any chain, or it is the head of some chain.

This might give acceptable compile time.

Also I have test with this change, It does not work. The time is not consumed by find in getEdgeProbability

lkail added a comment.Oct 11 2021, 3:29 AM

I agree we should add limitation to such pattern of CFG. If not, quadratic number of edges are added to CFG and hurt compile-time, especially when there are 1K+ cases. The setting of tail-dup-jmptable-loop-size might be related to BTB or BTAC, of which I'm not an expert. But for now, we don't want to regress protobuf's compile-time, value of 128 need @alexfh 's confirm.

llvm/lib/CodeGen/TailDuplicator.cpp
560

nit: A renamed to MBB or TailBB.

572

Please add comment to explain why we don't taildup this pattern of CFG.

llvm/test/CodeGen/X86/tail-dup-multiple-latch-loop.ll
1

This test should be pre-committed to show diff.

lkail edited reviewers, added: fhahn; removed: Florian.
junparser updated this revision to Diff 378860.Oct 11 2021, 7:59 PM

address comments.

junparser added inline comments.Oct 11 2021, 7:59 PM
llvm/test/CodeGen/X86/tail-dup-multiple-latch-loop.ll
1

This has been pre-committed.

I agree we should add limitation to such pattern of CFG. If not, quadratic number of edges are added to CFG and hurt compile-time, especially when there are 1K+ cases. The setting of tail-dup-jmptable-loop-size might be related to BTB or BTAC, of which I'm not an expert. But for now, we don't want to regress protobuf's compile-time, value of 128 need @alexfh 's confirm.

@alexfh, would you have a try with this patch?

Hi @lkail, since the value 128 can fix the ut, can we land this first, and then change this value if we have to. is it ok?

lkail accepted this revision.Oct 28 2021, 9:15 PM

LGTM, let's land it first to unblock https://reviews.llvm.org/D106056. But one thing to notice, number of python's opcode is larger than 128. If this patch affects python's perf, I think we should seek another adequate default value.

This revision is now accepted and ready to land.Oct 28 2021, 9:15 PM

LGTM, let's land it first to unblock https://reviews.llvm.org/D106056. But one thing to notice, number of python's opcode is larger than 128. If this patch affects python's perf, I think we should seek another adequate default value.

Do we have any benchmark on python, I think I can have a perf test with them before land this.

lkail added a comment.Oct 31 2021, 8:24 PM

Do we have any benchmark on python, I think I can have a perf test with them before land this.

We don't have it in test-suite yet. In test-suite, we have SingleSource/Benchmarks/Misc/evalloop.c and MultiSource/Benchmarks/TSVC for computed-goto.

Do we have any benchmark on python, I think I can have a perf test with them before land this.

We don't have it in test-suite yet. In test-suite, we have SingleSource/Benchmarks/Misc/evalloop.c and MultiSource/Benchmarks/TSVC for computed-goto.

I've tested python under pyperformance with USE_COMPUTED_GOTOS disabled, It shows no performance difference. So, let's keep 128 as default value for now.

This revision was landed with ongoing or failed builds.Nov 1 2021, 12:33 AM
This revision was automatically updated to reflect the committed changes.
lkail added a comment.EditedNov 22 2021, 5:49 PM

As reported in https://reviews.llvm.org/rGc93f93b2e3f28997f794265089fb8138dd5b5f13, we should implement a more general way to avoid adding quadratic edges in CFGs.

As reported in https://reviews.llvm.org/rGc93f93b2e3f28997f794265089fb8138dd5b5f13, we should implement a more general way to avoid adding quadratic edges in CFGs.

we may need loop form or dominate tree to check loop header and latch here. I'll revert this as well.