Test thanks to Michael Kuklinski from #llvm: https://godbolt.org/z/bdrah5Goo
originally inspired by Daniel Lemire's https://lemire.me/blog/2021/10/26/in-c-is-empty-faster-than-comparing-the-size-with-zero/
We manage to deduce that the answer does not require looping,
but we do that after the last LoopDeletion pass run,
so we end up being stuck with a dead loop.
Now, as with all things SCEV, this has a very expected ~+0.12% compile time performance regression:
https://llvm-compile-time-tracker.com/compare.php?from=0ae7bf124a9bca76dd9a91b2f7379168ff13f562&to=c2ae57c9b961aeb4a28c747266949340613a6d84&stat=instructions
(for comparison, doing that in function simplification pipeline
would have been ~+0.5 compile time performance regression, D112840)
Looking at the transformation stats over vanilla test-suite, i think it's rather expected:
| statistic name | baseline | proposed | Δ | % | |%| | |--------------------------------------------------|----------:|----------:|------:|-------:|-------:| | scalar-evolution.NumBruteForceTripCountsComputed | 789 | 888 | 99 | 12.55% | 12.55% | | scalar-evolution.NumTripCountsNotComputed | 105592 | 117900 | 12308 | 11.66% | 11.66% | | loop-delete.NumBackedgesBroken | 542 | 559 | 17 | 3.14% | 3.14% | | regalloc.numExtends | 81 | 79 | -2 | -2.47% | 2.47% | | indvars.NumFoldedUser | 408 | 400 | -8 | -1.96% | 1.96% | | indvars.NumElimCmp | 3831 | 3758 | -73 | -1.91% | 1.91% | | scalar-evolution.NumTripCountsComputed | 299759 | 304278 | 4519 | 1.51% | 1.51% | | loop-delete.NumDeleted | 8055 | 8128 | 73 | 0.91% | 0.91% | | machine-cse.NumCommutes | 111 | 110 | -1 | -0.90% | 0.90% | | globaldce.NumFunctions | 1187 | 1192 | 5 | 0.42% | 0.42% | | codegenprepare.NumSelectsExpanded | 277 | 278 | 1 | 0.36% | 0.36% | | loop-unroll.NumRuntimeUnrolled | 13841 | 13791 | -50 | -0.36% | 0.36% | | machinelicm.NumPostRAHoisted | 1168 | 1172 | 4 | 0.34% | 0.34% | | phi-node-elimination.NumCriticalEdgesSplit | 83054 | 82879 | -175 | -0.21% | 0.21% | | machine-cse.NumPREs | 3085 | 3079 | -6 | -0.19% | 0.19% | | branch-folder.NumBranchOpts | 108122 | 107942 | -180 | -0.17% | 0.17% | | loop-unroll.NumUnrolled | 40136 | 40067 | -69 | -0.17% | 0.17% | | branch-folder.NumDeadBlocks | 130818 | 130607 | -211 | -0.16% | 0.16% | | codegenprepare.NumBlocksElim | 92856 | 92714 | -142 | -0.15% | 0.15% | | instsimplify.NumSimplified | 103263 | 103129 | -134 | -0.13% | 0.13% | | instcombine.NumConstProp | 26070 | 26102 | 32 | 0.12% | 0.12% | | instsimplify.NumExpand | 1716 | 1718 | 2 | 0.12% | 0.12% | | loop-unroll.NumCompletelyUnrolled | 9236 | 9225 | -11 | -0.12% | 0.12% | | branch-folder.NumHoist | 2773 | 2770 | -3 | -0.11% | 0.11% | | regalloc.NumReloadsRemoved | 10822 | 10834 | 12 | 0.11% | 0.11% | | regalloc.NumSnippets | 11394 | 11406 | 12 | 0.11% | 0.11% | | machine-cse.NumCrossBBCSEs | 1052 | 1053 | 1 | 0.10% | 0.10% | | machinelicm.NumCSEed | 99887 | 99784 | -103 | -0.10% | 0.10% | | branch-folder.NumTailMerge | 72501 | 72435 | -66 | -0.09% | 0.09% | | codegenprepare.NumExtUses | 22007 | 21987 | -20 | -0.09% | 0.09% | | local.NumRemoved | 68232 | 68294 | 62 | 0.09% | 0.09% | | loop-vectorize.LoopsAnalyzed | 75483 | 75413 | -70 | -0.09% | 0.09% |
Note that i'm only changing current PM, and not touching obsolete PM.
This is an alternative to the function simplification pipeline variant of the same change, D112840.
It has both less compile time impact (since the additional number of SCEV trip count calculations
is way lass less than with the D112840), and it is much more powerful/impactful (almost 2x more loops deleted).
I have checked, and doing this after loop rotation is favorable (more loops deleted).