This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] PHI-of-extractvalues -> extractvalue-of-PHI, aka invokes are bad
ClosedPublic

Authored by lebedev.ri on Aug 25 2020, 4:51 AM.

Details

Summary

While since D86306 we do it's sibling fold for insertvalue,
we should also do this for extractvalue's.

And unlike that one, the results here are, quite honestly, shocking,
as it can be observed here on vanilla llvm test-suite + RawSpeed results:

| statistic name                                     | baseline  | proposed  |       Δ |       % |    |%| |
|----------------------------------------------------|-----------|-----------|--------:|--------:|-------:|
| asm-printer.EmittedInsts                           | 7945095   | 7942507   |   -2588 |  -0.03% |  0.03% |
| assembler.ObjectBytes                              | 273209920 | 273069800 | -140120 |  -0.05% |  0.05% |
| early-cse.NumCSE                                   | 2183363   | 2183398   |      35 |   0.00% |  0.00% |
| early-cse.NumSimplify                              | 541847    | 550017    |    8170 |   1.51% |  1.51% |
| instcombine.NumAggregateReconstructionsSimplified  | 2139      | 108       |   -2031 | -94.95% | 94.95% |
| instcombine.NumCombined                            | 3601364   | 3635448   |   34084 |   0.95% |  0.95% |
| instcombine.NumConstProp                           | 27153     | 27157     |       4 |   0.01% |  0.01% |
| instcombine.NumDeadInst                            | 1694521   | 1765022   |   70501 |   4.16% |  4.16% |
| instcombine.NumPHIsOfExtractValues                 | 0         | 37546     |   37546 |   0.00% |  0.00% |
| instcombine.NumSunkInst                            | 63158     | 63686     |     528 |   0.84% |  0.84% |
| instcount.NumBrInst                                | 874304    | 871857    |   -2447 |  -0.28% |  0.28% |
| instcount.NumCallInst                              | 1757657   | 1758402   |     745 |   0.04% |  0.04% |
| instcount.NumExtractValueInst                      | 45623     | 11483     |  -34140 | -74.83% | 74.83% |
| instcount.NumInsertValueInst                       | 4983      | 580       |   -4403 | -88.36% | 88.36% |
| instcount.NumInvokeInst                            | 61018     | 59478     |   -1540 |  -2.52% |  2.52% |
| instcount.NumLandingPadInst                        | 35334     | 34215     |   -1119 |  -3.17% |  3.17% |
| instcount.NumPHIInst                               | 344428    | 331116    |  -13312 |  -3.86% |  3.86% |
| instcount.NumRetInst                               | 100773    | 100772    |      -1 |   0.00% |  0.00% |
| instcount.TotalBlocks                              | 1081154   | 1077166   |   -3988 |  -0.37% |  0.37% |
| instcount.TotalFuncs                               | 101443    | 101442    |      -1 |   0.00% |  0.00% |
| instcount.TotalInsts                               | 8890201   | 8833747   |  -56454 |  -0.64% |  0.64% |
| instsimplify.NumSimplified                         | 75822     | 75707     |    -115 |  -0.15% |  0.15% |
| simplifycfg.NumHoistCommonCode                     | 24203     | 24197     |      -6 |  -0.02% |  0.02% |
| simplifycfg.NumHoistCommonInstrs                   | 48201     | 48195     |      -6 |  -0.01% |  0.01% |
| simplifycfg.NumInvokes                             | 2785      | 4298      |    1513 |  54.33% | 54.33% |
| simplifycfg.NumSimpl                               | 997332    | 1018189   |   20857 |   2.09% |  2.09% |
| simplifycfg.NumSinkCommonCode                      | 7088      | 6464      |    -624 |  -8.80% |  8.80% |
| simplifycfg.NumSinkCommonInstrs                    | 15117     | 14021     |   -1096 |  -7.25% |  7.25% |

... which tells us that this new fold fires whopping 38k times,
increasing the amount of SimplifyCFG's invoke->call transforms by +54% (+1513) (again, D85787 did that last time),
decreasing total instruction count by -0.64% (-56454),
and sharply decreasing count of insertvalue's (-88.36%, i.e. 9 times less)
and extractvalue's (-74.83%, i.e. four times less).

This causes geomean -0.01% binary size decrease
http://llvm-compile-time-tracker.com/compare.php?from=4d5ca22b8adfb6643466e4e9f48ba14bb48938bc&to=97dacca0111cb2ae678204e52a3cee00e3a69208&stat=size-text
and, ignoring O0-g, is a geomean -0.01%..-0.05% compile-time improvement
http://llvm-compile-time-tracker.com/compare.php?from=4d5ca22b8adfb6643466e4e9f48ba14bb48938bc&to=97dacca0111cb2ae678204e52a3cee00e3a69208&stat=instructions

The other thing that tells is, is that while this is a massive win for invoke->call transform
InstCombinerImpl::foldAggregateConstructionIntoAggregateReuse() fold,
which is supposed to be dealing with such aggregate reconstructions,
fires a lot less now. There are two reasons why:

  1. After this fold, as it can be seen in tests, we may (will) end up with trivially redundant PHI nodes. We don't CSE them in InstCombine presently, which means that EarlyCSE needs to run and then InstCombine rerun.
  2. But then, EarlyCSE not only manages to fold such redundant PHI's, it also sees that the extract-insert chain recreates the original aggregate, and replaces it with the original aggregate.

The take-aways are

  1. We maybe should do most trivial, same-BB PHI CSE in InstCombine
  2. I need to check if what other patterns remain, and how they can be resolved. (i.e. i wonder if foldAggregateConstructionIntoAggregateReuse() might go away)

Diff Detail

Event Timeline

lebedev.ri created this revision.Aug 25 2020, 4:51 AM
lebedev.ri requested review of this revision.Aug 25 2020, 4:51 AM
lebedev.ri retitled this revision from [InstCombine] PHI-of-insertvalues -> insertvalue-of-PHI's, aka invokes are bad to [InstCombine] PHI-of-extractvalues -> extractvalue-of-PHI, aka invokes are bad.Aug 25 2020, 6:27 AM
lebedev.ri edited the summary of this revision. (Show Details)Aug 25 2020, 9:08 AM
lebedev.ri edited the summary of this revision. (Show Details)Aug 25 2020, 12:03 PM
nikic added a subscriber: nikic.Aug 25 2020, 12:30 PM

Nice results :)

spatel accepted this revision.Aug 25 2020, 2:32 PM

Seems like a logical extension of the earlier patch, and overall results are good. LGTM.

This revision is now accepted and ready to land.Aug 25 2020, 2:32 PM

Seems like a logical extension of the earlier patch, and overall results are good. LGTM.

Thank you for the review.

This revision was landed with ongoing or failed builds.Aug 25 2020, 11:09 PM
This revision was automatically updated to reflect the committed changes.

@spatel

Ok, so

There are two reasons why:

  1. After this fold, as it can be seen in tests, we may (will) end up with trivially redundant PHI nodes. We don't CSE them in InstCombine presently, which means that EarlyCSE needs to run and then InstCombine rerun.
  2. But then, EarlyCSE not only manages to fold such redundant PHI's, it also sees that the extract-insert chain recreates the original aggregate, and replaces it with the original aggregate.

The take-aways are

  1. We maybe should do most trivial, same-BB PHI CSE in InstCombine
  2. I need to check if what other patterns remain, and how they can be resolved. (i.e. i wonder if foldAggregateConstructionIntoAggregateReuse() might go away)

I looked at remaining interesting patterns, and i believe this sinking won't be able to deal with everything.
there may be genuine extra uses of extracts, namely __cxa_begin_catch().

So before i look into foldAggregateConstructionIntoAggregateReuse() further,
what are your thoughts about teaching InstSimplify to do most basic CSE for PHI nodes?
(purely by comparing that the incoming values match)

This will ease some phase ordering issues, since then we won't need to run earlycse and rerun instcombine.

Effect on vanilla test-suite + RawSpeed:

| statistic name                                     | baseline  | proposed  |      Δ |        % |    \|%\| |
|----------------------------------------------------|-----------|-----------|-------:|---------:|---------:|
| asm-printer.EmittedInsts                           | 7942505   | 7942558   |     53 |    0.00% |    0.00% |
| assembler.ObjectBytes                              | 273069800 | 273085376 |  15576 |    0.01% |    0.01% |
| correlated-value-propagation.NumPhis               | 18825     | 18958     |    133 |    0.71% |    0.71% |
| early-cse.NumCSE                                   | 2183398   | 2183342   |    -56 |    0.00% |    0.00% |
| early-cse.NumCSELoad                               | 317796    | 317801    |      5 |    0.00% |    0.00% |
| early-cse.NumSimplify                              | 550017    | 542090    |  -7927 |   -1.44% |    1.44% |
| instcombine.NumAggregateReconstructionsSimplified  | 108       | 4502      |   4394 | 4068.52% | 4068.52% |
| instcombine.NumCombined                            | 3635533   | 3659879   |  24346 |    0.67% |    0.67% |
| instcombine.NumDeadInst                            | 1765202   | 1770173   |   4971 |    0.28% |    0.28% |
| instcombine.NumPHIsOfExtractValues                 | 37546     | 37521     |    -25 |   -0.07% |    0.07% |
| instcount.NumBrInst                                | 871857    | 871838    |    -19 |    0.00% |    0.00% |
| instcount.NumCallInst                              | 1758402   | 1758818   |    416 |    0.02% |    0.02% |
| instcount.NumExtractValueInst                      | 11483     | 11477     |     -6 |   -0.05% |    0.05% |
| instcount.NumInsertValueInst                       | 580       | 578       |     -2 |   -0.34% |    0.34% |
| instcount.NumInvokeInst                            | 59478     | 59502     |     24 |    0.04% |    0.04% |
| instcount.NumLandingPadInst                        | 34215     | 34214     |     -1 |    0.00% |    0.00% |
| instcount.NumPHIInst                               | 331116    | 331086    |    -30 |   -0.01% |    0.01% |
| instcount.NumResumeInst                            | 8062      | 8061      |     -1 |   -0.01% |    0.01% |
| instcount.NumRetInst                               | 100772    | 100770    |     -2 |    0.00% |    0.00% |
| instcount.TotalBlocks                              | 1077166   | 1077168   |      2 |    0.00% |    0.00% |
| instcount.TotalFuncs                               | 101442    | 101441    |     -1 |    0.00% |    0.00% |
| instcount.TotalInsts                               | 8833575   | 8833896   |    321 |    0.00% |    0.00% |
| simplifycfg.NumInvokes                             | 4298      | 4406      |    108 |    2.51% |    2.51% |
| simplifycfg.NumSimpl                               | 1018189   | 998050    | -20139 |   -1.98% |    1.98% |

... so it again results in improvements in invoke->call fold.

So before i look into foldAggregateConstructionIntoAggregateReuse() further,
what are your thoughts about teaching InstSimplify to do most basic CSE for PHI nodes?
(purely by comparing that the incoming values match)

Instruction::isIdenticalTo()? I see one use of that via visitStoreInst, so there's possible precedent. Start a llvm-dev thread to get other opinions?

So before i look into foldAggregateConstructionIntoAggregateReuse() further,
what are your thoughts about teaching InstSimplify to do most basic CSE for PHI nodes?
(purely by comparing that the incoming values match)

Instruction::isIdenticalTo()?

Hm, i guess, although that one is broken QoI-wise for PHI's - it assumes identical incoming block order, which isn't something that should be depended upon.

I see one use of that via visitStoreInst, so there's possible precedent. Start a llvm-dev thread to get other opinions?

So before i look into foldAggregateConstructionIntoAggregateReuse() further,
what are your thoughts about teaching InstSimplify to do most basic CSE for PHI nodes?
(purely by comparing that the incoming values match)

Instruction::isIdenticalTo()?

Hm, i guess, although that one is broken QoI-wise for PHI's - it assumes identical incoming block order, which isn't something that should be depended upon.

I see one use of that via visitStoreInst, so there's possible precedent. Start a llvm-dev thread to get other opinions?

Actually, apparently we don't even do PHI CSE in EarlyCSE,
so i consider my question to be dumb,
and directly proceeded with the fix, rG6102310d814ad73eab60a88b21dd70874f7a056f.

lebedev.ri added a comment.EditedAug 31 2020, 10:05 AM

So before i look into foldAggregateConstructionIntoAggregateReuse() further,
what are your thoughts about teaching InstSimplify to do most basic CSE for PHI nodes?
(purely by comparing that the incoming values match)

Instruction::isIdenticalTo()?

Hm, i guess, although that one is broken QoI-wise for PHI's - it assumes identical incoming block order, which isn't something that should be depended upon.

I see one use of that via visitStoreInst, so there's possible precedent. Start a llvm-dev thread to get other opinions?

Actually, apparently we don't even do PHI CSE in EarlyCSE,
so i consider my question to be dumb,
and directly proceeded with the fix, rG6102310d814ad73eab60a88b21dd70874f7a056f.

Yikes, that could've/should've gone smoother.

But with that done, i've taken some time to analyze next step.
As of right now, in vanilla llvm test-suite + RawSpeed, there are 41 motivational patterns remain
(resume of something that isn't either a landingpad or a PHI of landingpads).

All those cases require multiple levels of PHI's,
and in all of those cases at least one extractvalue has an extra use,
which goes against the current legality checking for these transforms.

But even if we relax it to "all PHIs must go away, and at least one extractvalue must go away",
while that does finally catch all the remaining motivational cases,
it causes geomean +0.10% compile time increase (regression),
mainly because of, i believe, use-count checking, and it's "greedy" approach
D86882, http://llvm-compile-time-tracker.com/compare.php?from=1d01fc100bb5bef5f5eaf92520b2e52f64ee1d6e&to=d07a30a216f640c26c08771e2f7ecba783f5e44e&stat=instructions

So it would appear i will indeed have to enhance foldAggregateConstructionIntoAggregateReuse() :/