This is an archive of the discontinued LLVM Phabricator instance.

[PassManager] `buildFunctionSimplificationPipeline()`: schedule another `LoopDeletion` pass run before last `LICM`
AbandonedPublic

Authored by lebedev.ri on Oct 29 2021, 12:17 PM.

Details

Summary

Test thanks to Michael Kuklinski from #llvm: https://godbolt.org/z/bdrah5Goo
originally inspired by Daniel Lemire's https://lemire.me/blog/2021/10/26/in-c-is-empty-faster-than-comparing-the-size-with-zero/

We manage to deduce that the answer does not require looping,
but we do that after the last LoopDeletion pass run,
so we end up being stuck with a dead loop.

While perhaps we are (clearly) also missing LoopDeletion
in buildModuleOptimizationPipeline() part of the pipeline,
we *should* add this one here, to reduce the function size
before inlining. I believe we want to have a new loop pass manager,
so that LoopSimplifyPass/LCSSAPass are guaranteed to be computed once?

Now, as with all things SCEV, this has a very expected ~+0.5% compile time performance regression:
https://llvm-compile-time-tracker.com/compare.php?from=e5df0a5a6f412965eb1be495d2672b4164c6c3d5&to=4af31d7accb368597bc8b4fc8f125d7c2fdbae6a&stat=instructions

Looking at the transformation stats over vanilla test-suite, i think it's rather expected:

| statistic name                                   |  baseline |  proposed |      Δ |      % |    |%| |
|--------------------------------------------------|----------:|----------:|-------:|-------:|-------:|
| scalar-evolution.NumTripCountsNotComputed        |    105592 |    137458 |  31866 | 30.18% | 30.18% |
| scalar-evolution.NumBruteForceTripCountsComputed |       789 |       955 |    166 | 21.04% | 21.04% |
| scalar-evolution.NumTripCountsComputed           |    299759 |    348901 |  49142 | 16.39% | 16.39% |
| loop-delete.NumBackedgesBroken                   |       542 |       557 |     15 |  2.77% |  2.77% |
| regalloc.numExtends                              |        81 |        79 |     -2 | -2.47% |  2.47% |
| licm.NumSunk                                     |     12167 |     11875 |   -292 | -2.40% |  2.40% |
| indvars.NumFoldedUser                            |       408 |       400 |     -8 | -1.96% |  1.96% |
| indvars.NumElimCmp                               |      3831 |      3757 |    -74 | -1.93% |  1.93% |
| loop-delete.NumDeleted                           |      8055 |      8096 |     41 |  0.51% |  0.51% |
| codegenprepare.NumSelectsExpanded                |       277 |       278 |      1 |  0.36% |  0.36% |
| assume-queries.NumAssumeQueries                  | 120910554 | 121324983 | 414429 |  0.34% |  0.34% |
| loop-unroll.NumRuntimeUnrolled                   |     13841 |     13796 |    -45 | -0.33% |  0.33% |
| phi-node-elimination.NumCriticalEdgesSplit       |     83054 |     82804 |   -250 | -0.30% |  0.30% |
| branch-folder.NumBranchOpts                      |    108122 |    107827 |   -295 | -0.27% |  0.27% |
| correlated-value-propagation.NumCmps             |      1461 |      1465 |      4 |  0.27% |  0.27% |
| branch-folder.NumDeadBlocks                      |    130818 |    130490 |   -328 | -0.25% |  0.25% |
| branch-folder.NumTailMerge                       |     72501 |     72347 |   -154 | -0.21% |  0.21% |
| correlated-value-propagation.NumAShrs            |       514 |       515 |      1 |  0.19% |  0.19% |
| loop-unroll.NumUnrolled                          |     40136 |     40069 |    -67 | -0.17% |  0.17% |
| licm.NumHoisted                                  |    388827 |    388221 |   -606 | -0.16% |  0.16% |
| machine-cse.NumPREs                              |      3085 |      3080 |     -5 | -0.16% |  0.16% |
| loop-unroll.NumCompletelyUnrolled                |      9236 |      9222 |    -14 | -0.15% |  0.15% |

Note that i'm only changing current PM, and not touching obsolete PM.

Diff Detail

Event Timeline

lebedev.ri created this revision.Oct 29 2021, 12:17 PM
lebedev.ri requested review of this revision.Oct 29 2021, 12:17 PM

Also partially solves PR52354?

Maybe loop unrolling + some cleanup pass could also solve your motivating case and also all motivation cases from PR52354?

lebedev.ri edited the summary of this revision. (Show Details)Oct 29 2021, 12:29 PM
lebedev.ri added a reviewer: aeubanks.
lebedev.ri edited the summary of this revision. (Show Details)Oct 29 2021, 12:36 PM

Also partially solves PR52354?

I don't believe so.

Maybe loop unrolling + some cleanup pass could also solve your motivating case and also all motivation cases from PR52354?

I don't have an answer to that question.

lebedev.ri edited the summary of this revision. (Show Details)Oct 29 2021, 2:28 PM

any numbers for actual performance gains? not sure that it's worth addressing some specifically bad code in a blog post for 0.5% compile time, but I'll run this patch on some internal benchmarks

do you know after which pass the loop becomes deletable? presumably some pass between LPM2 and LPM3 causes the loop to be deletable since LPM2 contains a LoopDeletionPass

any numbers for actual performance gains? not sure that it's worth addressing some specifically bad code in a blog post for 0.5% compile time, but I'll run this patch on some internal benchmarks

do you know after which pass the loop becomes deletable? presumably some pass between LPM2 and LPM3 causes the loop to be deletable since LPM2 contains a LoopDeletionPass

Answered inline.

llvm/lib/Passes/PassBuilderPipelines.cpp
527

do you know after which pass the loop becomes deletable? presumably some pass between LPM2 and LPM3 causes the loop to be deletable since LPM2 contains a LoopDeletionPass

In that particular case after this one.

lebedev.ri planned changes to this revision.Oct 29 2021, 3:00 PM

Hmm, actually, i looked, and we catch a lot more cases if we do this in buildModuleOptimizationPipeline(), so let me do that instead. BRB.

Hmm, actually, i looked, and we catch a lot more cases if we do this in buildModuleOptimizationPipeline(), so let me do that instead. BRB.

it'd be nice to make this work in the function simplification pipeline so that more passes have a chance to take advantage of deleted loops, e.g. the inliner
putting passes at the end of the pipeline doesn't seem ideal unless there's a specific reason for it

Hmm, actually, i looked, and we catch a lot more cases if we do this in buildModuleOptimizationPipeline(), so let me do that instead. BRB.

it'd be nice to make this work in the function simplification pipeline so that more passes have a chance to take advantage of deleted loops, e.g. the inliner
putting passes at the end of the pipeline doesn't seem ideal unless there's a specific reason for it

From where i stand, we roughly have 3 obvious choices:

  1. keep dead loops :)
  2. the current diff, delete dead loops at the end of function simplification pipeline
  3. upcoming diff, delete dead loops before vectorization in module optimization pipeline. catches more cases, but perhaps pessimizes inliner.
  4. 2 & 3, slowest of them all.

What do you have in mind instead?

Hmm, actually, i looked, and we catch a lot more cases if we do this in buildModuleOptimizationPipeline(), so let me do that instead. BRB.

it'd be nice to make this work in the function simplification pipeline so that more passes have a chance to take advantage of deleted loops, e.g. the inliner
putting passes at the end of the pipeline doesn't seem ideal unless there's a specific reason for it

From where i stand, we roughly have 3 obvious choices:

  1. keep dead loops :)
  2. the current diff, delete dead loops at the end of function simplification pipeline
  3. upcoming diff, delete dead loops before vectorization in module optimization pipeline. catches more cases, but perhaps pessimizes inliner.
  4. 2 & 3, slowest of them all.

What do you have in mind instead?

I looked into @is_not_empty_variant1(). It seems like GVN + an instcombine cleanup is what's allowing the loop to be deleted. Currently GVN runs after the loop passes which is the issue.
I tried replacing the EarlyCSEPass at the beginning of the function simplification pipeline with a NewGVNPass and that fixes @is_not_empty_variant1. I think something along those lines is more principled than adding an LoopDeletionPass somewhere.
(Of course, if we want to turn on NewGVN specifically we need to iron out remaining issues)

lebedev.ri abandoned this revision.Oct 29 2021, 3:40 PM

Tentatively abandoning in favor of buildModuleOptimizationPipeline() variant D112851.
It is both more powerful, and TBD compile time wise.

Tentatively abandoning in favor of buildModuleOptimizationPipeline() variant D112851.
It is both more powerful, and TBD compile time wise.

And https://llvm-compile-time-tracker.com/compare.php?from=0ae7bf124a9bca76dd9a91b2f7379168ff13f562&to=c2ae57c9b961aeb4a28c747266949340613a6d84&stat=instructions is in,
it's only a ~+0.12% geomean regression.