This is an archive of the discontinued LLVM Phabricator instance.

[PassManager] `buildModuleOptimizationPipeline()`: schedule `LoopDeletion` pass run before vectorization passes
ClosedPublic

Authored by lebedev.ri on Oct 29 2021, 3:40 PM.

Details

Summary

Test thanks to Michael Kuklinski from #llvm: https://godbolt.org/z/bdrah5Goo
originally inspired by Daniel Lemire's https://lemire.me/blog/2021/10/26/in-c-is-empty-faster-than-comparing-the-size-with-zero/

We manage to deduce that the answer does not require looping,
but we do that after the last LoopDeletion pass run,
so we end up being stuck with a dead loop.

Now, as with all things SCEV, this has a very expected ~+0.12% compile time performance regression:
https://llvm-compile-time-tracker.com/compare.php?from=0ae7bf124a9bca76dd9a91b2f7379168ff13f562&to=c2ae57c9b961aeb4a28c747266949340613a6d84&stat=instructions
(for comparison, doing that in function simplification pipeline
would have been ~+0.5 compile time performance regression, D112840)

Looking at the transformation stats over vanilla test-suite, i think it's rather expected:

| statistic name                                   |  baseline |  proposed |     Δ |      % |    |%| |
|--------------------------------------------------|----------:|----------:|------:|-------:|-------:|
| scalar-evolution.NumBruteForceTripCountsComputed |       789 |       888 |    99 | 12.55% | 12.55% |
| scalar-evolution.NumTripCountsNotComputed        |    105592 |    117900 | 12308 | 11.66% | 11.66% |
| loop-delete.NumBackedgesBroken                   |       542 |       559 |    17 |  3.14% |  3.14% |
| regalloc.numExtends                              |        81 |        79 |    -2 | -2.47% |  2.47% |
| indvars.NumFoldedUser                            |       408 |       400 |    -8 | -1.96% |  1.96% |
| indvars.NumElimCmp                               |      3831 |      3758 |   -73 | -1.91% |  1.91% |
| scalar-evolution.NumTripCountsComputed           |    299759 |    304278 |  4519 |  1.51% |  1.51% |
| loop-delete.NumDeleted                           |      8055 |      8128 |    73 |  0.91% |  0.91% |
| machine-cse.NumCommutes                          |       111 |       110 |    -1 | -0.90% |  0.90% |
| globaldce.NumFunctions                           |      1187 |      1192 |     5 |  0.42% |  0.42% |
| codegenprepare.NumSelectsExpanded                |       277 |       278 |     1 |  0.36% |  0.36% |
| loop-unroll.NumRuntimeUnrolled                   |     13841 |     13791 |   -50 | -0.36% |  0.36% |
| machinelicm.NumPostRAHoisted                     |      1168 |      1172 |     4 |  0.34% |  0.34% |
| phi-node-elimination.NumCriticalEdgesSplit       |     83054 |     82879 |  -175 | -0.21% |  0.21% |
| machine-cse.NumPREs                              |      3085 |      3079 |    -6 | -0.19% |  0.19% |
| branch-folder.NumBranchOpts                      |    108122 |    107942 |  -180 | -0.17% |  0.17% |
| loop-unroll.NumUnrolled                          |     40136 |     40067 |   -69 | -0.17% |  0.17% |
| branch-folder.NumDeadBlocks                      |    130818 |    130607 |  -211 | -0.16% |  0.16% |
| codegenprepare.NumBlocksElim                     |     92856 |     92714 |  -142 | -0.15% |  0.15% |
| instsimplify.NumSimplified                       |    103263 |    103129 |  -134 | -0.13% |  0.13% |
| instcombine.NumConstProp                         |     26070 |     26102 |    32 |  0.12% |  0.12% |
| instsimplify.NumExpand                           |      1716 |      1718 |     2 |  0.12% |  0.12% |
| loop-unroll.NumCompletelyUnrolled                |      9236 |      9225 |   -11 | -0.12% |  0.12% |
| branch-folder.NumHoist                           |      2773 |      2770 |    -3 | -0.11% |  0.11% |
| regalloc.NumReloadsRemoved                       |     10822 |     10834 |    12 |  0.11% |  0.11% |
| regalloc.NumSnippets                             |     11394 |     11406 |    12 |  0.11% |  0.11% |
| machine-cse.NumCrossBBCSEs                       |      1052 |      1053 |     1 |  0.10% |  0.10% |
| machinelicm.NumCSEed                             |     99887 |     99784 |  -103 | -0.10% |  0.10% |
| branch-folder.NumTailMerge                       |     72501 |     72435 |   -66 | -0.09% |  0.09% |
| codegenprepare.NumExtUses                        |     22007 |     21987 |   -20 | -0.09% |  0.09% |
| local.NumRemoved                                 |     68232 |     68294 |    62 |  0.09% |  0.09% |
| loop-vectorize.LoopsAnalyzed                     |     75483 |     75413 |   -70 | -0.09% |  0.09% |

Note that i'm only changing current PM, and not touching obsolete PM.

This is an alternative to the function simplification pipeline variant of the same change, D112840.
It has both less compile time impact (since the additional number of SCEV trip count calculations
is way lass less than with the D112840), and it is much more powerful/impactful (almost 2x more loops deleted).
I have checked, and doing this after loop rotation is favorable (more loops deleted).

Diff Detail

Event Timeline

lebedev.ri created this revision.Oct 29 2021, 3:40 PM
lebedev.ri edited the summary of this revision. (Show Details)Oct 29 2021, 3:45 PM
mkazantsev accepted this revision.Nov 1 2021, 12:19 AM

Deletion of loops before they are vectorized absolutely makes sense to me. Please give it couple more days just in case if someone has concerns regarding the regressions, but making vectorizer run on empty doesn't sound. Please give it 2-3 days to hear others' concerns.

This revision is now accepted and ready to land.Nov 1 2021, 12:19 AM

Deletion of loops before they are vectorized absolutely makes sense to me. Please give it couple more days just in case if someone has concerns regarding the regressions, but making vectorizer run on empty doesn't sound. Please give it 2-3 days to hear others' concerns.

Indeed, even if/when what @aeubanks suggested in https://reviews.llvm.org/D112840#3097891 is implemented,
somehow i don't think it will alleviate the need for *this*-late LoopDeletion.

Thank you for the review, i'll wait a bit on @aeubanks / @asbirlea / @dmgreen,
though given just how little of a compile-time regression this has become,
i can't imagine this will be blocked.

Yeah, Seems OK to me.

(Copied from the other change)
I looked into @is_not_empty_variant1(). It seems like GVN + an instcombine cleanup is what's allowing the loop to be deleted. Currently GVN runs after the loop passes which is the issue.
I tried replacing the EarlyCSEPass at the beginning of the function simplification pipeline with a NewGVNPass and that fixes @is_not_empty_variant1. I think something along those lines is more principled than adding an LoopDeletionPass somewhere.
(Of course, if we want to turn on NewGVN specifically we need to iron out remaining issues)

Deletion of loops before they are vectorized absolutely makes sense to me. Please give it couple more days just in case if someone has concerns regarding the regressions, but making vectorizer run on empty doesn't sound. Please give it 2-3 days to hear others' concerns.

Indeed, even if/when what @aeubanks suggested in https://reviews.llvm.org/D112840#3097891 is implemented,
somehow i don't think it will alleviate the need for *this*-late LoopDeletion.

Can you explain why *this* late LoopDeletion would still be necessary even with the GVN phase ordering changes?
It makes sense to me that specifically GVN can optimize enough to make some loops deletable. Ideally we'd already delete all deletable loops in the function simplification pipeline rather than late in the pipeline. Anything to do with deleting unused code should trigger in the simplification pipeline, not the late optimization pipeline, since deleting unnecessary instructions is a canonicalization.
And I'm sure there are many other loop passes that could benefit from having GVN run first.

Take all of this with a grain of salt because running an extra GVN pass is very expensive. We run EarlyCSE at the beginning of the function simplification pipeline as a mini-GVN, but replacing that with NewGVN, which is supposed to be significantly faster than GVN, still results in major compile time regressions. https://llvm-compile-time-tracker.com/compare.php?from=1c05c52de2177a328b7d2d07b697af67eb9f8122&to=755e184ed0216ef39d68621a9b19a3dd34d677f2&stat=instructions

So overall I think this is a hack to fix a fairly specific use case, but a proper fix is nowhere in the near future. If you really think this snippet of code is important to optimize then I'm not opposed to this change until we have a more proper fix in the future.
We can always tack on passes to the end of the pipeline to handle missed optimizations due to phase ordering, it doesn't mean it's principled. (of course if you can explain why an extra loop deletion pass on top of the one in the simplification pipeline is principled I'm all ears)

Thank you for the review, i'll wait a bit on @aeubanks / @asbirlea / @dmgreen,
though given just how little of a compile-time regression this has become,
i can't imagine this will be blocked.

(Copied from the other change)
I looked into @is_not_empty_variant1(). It seems like GVN + an instcombine cleanup is what's allowing the loop to be deleted. Currently GVN runs after the loop passes which is the issue.
I tried replacing the EarlyCSEPass at the beginning of the function simplification pipeline with a NewGVNPass and that fixes @is_not_empty_variant1. I think something along those lines is more principled than adding an LoopDeletionPass somewhere.
(Of course, if we want to turn on NewGVN specifically we need to iron out remaining issues)

Deletion of loops before they are vectorized absolutely makes sense to me. Please give it couple more days just in case if someone has concerns regarding the regressions, but making vectorizer run on empty doesn't sound. Please give it 2-3 days to hear others' concerns.

Indeed, even if/when what @aeubanks suggested in https://reviews.llvm.org/D112840#3097891 is implemented,
somehow i don't think it will alleviate the need for *this*-late LoopDeletion.

Can you explain why *this* late LoopDeletion would still be necessary even with the GVN phase ordering changes?

Well, i'm not actually saying that, i'm only speculating as much.
Let's suppose GVN phase ordering change happens, and D112840 would become unnecessary,
But, i posted this change (replacing D112840) because while D112840 (on vanilla test-suite) caused

| statistic name                                   |  baseline |  proposed |      Δ |      % |    |%| |
|--------------------------------------------------|----------:|----------:|-------:|-------:|-------:|
| loop-delete.NumBackedgesBroken                   |       542 |       557 |     15 |  2.77% |  2.77% |
| loop-delete.NumDeleted                           |      8055 |      8096 |     41 |  0.51% |  0.51% |

while this variant causes

| statistic name                                   |  baseline |  proposed |     Δ |      % |    |%| |
|--------------------------------------------------|----------:|----------:|------:|-------:|-------:|
| loop-delete.NumBackedgesBroken                   |       542 |       559 |    17 |  3.14% |  3.14% |
| loop-delete.NumDeleted                           |      8055 |      8128 |    73 |  0.91% |  0.91% |

Aka, as compared to baseline, this deletes ~80% more loops than D112840 would.
Notably, this new LoopDelete is placed after LoopRotatePass, because i have checked, and that
also causes an increase in num loops deleted, as compared with placing it before LoopRotatePass.
So right now, i would not expect that GVN phase ordering improvement will make this late LoopDeletion unnecessary,
but if i'm proven wrong and it does, then great and this could be reverted.

It makes sense to me that specifically GVN can optimize enough to make some loops deletable. Ideally we'd already delete all deletable loops in the function simplification pipeline rather than late in the pipeline. Anything to do with deleting unused code should trigger in the simplification pipeline, not the late optimization pipeline, since deleting unnecessary instructions is a canonicalization.
And I'm sure there are many other loop passes that could benefit from having GVN run first.

Sure, i'm not arguing that we *should't* succeed with this in simplification pipeline,
i'm only saying that at least currently, even more loops become dead *after* the simplification pipeline, in optimization pipeline.

Take all of this with a grain of salt because running an extra GVN pass is very expensive. We run EarlyCSE at the beginning of the function simplification pipeline as a mini-GVN, but replacing that with NewGVN, which is supposed to be significantly faster than GVN, still results in major compile time regressions. https://llvm-compile-time-tracker.com/compare.php?from=1c05c52de2177a328b7d2d07b697af67eb9f8122&to=755e184ed0216ef39d68621a9b19a3dd34d677f2&stat=instructions

Right.

So overall I think this is a hack to fix a fairly specific use case, but a proper fix is nowhere in the near future. If you really think this snippet of code is important to optimize then I'm not opposed to this change until we have a more proper fix in the future.
We can always tack on passes to the end of the pipeline to handle missed optimizations due to phase ordering, it doesn't mean it's principled. (of course if you can explain why an extra loop deletion pass on top of the one in the simplification pipeline is principled I'm all ears)

Yay, i love ugly hacks.

Thank you for the review, i'll wait a bit on @aeubanks / @asbirlea / @dmgreen,
though given just how little of a compile-time regression this has become,
i can't imagine this will be blocked.

I think unless there will be concerns raised by then, i'll land this friday.
After all, this isn't setting things in stone, once the GVN issue is resolved this can be revisited.

I think unless there will be concerns raised by then, i'll land this friday.
After all, this isn't setting things in stone, once the GVN issue is resolved this can be revisited.

Seems fine to me, but could you add a comment, perhaps a link to this discussion?

lebedev.ri updated this revision to Diff 384486.Nov 3 2021, 9:19 AM

Adding comment.

This revision was landed with ongoing or failed builds.Nov 3 2021, 9:25 AM
This revision was automatically updated to reflect the committed changes.