HomePhabricator

[InstCombine] Take 2: Perform trivial PHI CSE

Authored by lebedev.ri on Aug 29 2020, 12:42 AM.

Description

[InstCombine] Take 2: Perform trivial PHI CSE

The original take was 6102310d814ad73eab60a88b21dd70874f7a056f,
which taught InstSimplify to do that, which seemed better at time,
since we got EarlyCSE support for free.

However, it was proven that we can not do that there,
the simplified-to PHI would not be reachable from the original PHI,
and that is not something InstSimplify is allowed to do,
as noted in the commit ed90f15efb40d26b5d3ead3bb8e9e284218e0186
that reverted it :

It appears to cause compilation non-determinism and caused stage3 mismatches.

However InstCombine already does many different optimizations,
so it should be a safe place to do it here.

Note that we still can't just compare incoming values ranges,
because there is no guarantee that these PHI's we'd simplify to
were already re-visited and sorted.
However coming up with a test is problematic.

Effects on vanilla llvm test-suite + RawSpeed:

| statistic name                                     | baseline  | proposed  |      Δ |        % |      |%| |
|----------------------------------------------------|-----------|-----------|-------:|---------:|---------:|
| instcombine.NumPHICSEs                             | 0         | 22228     |  22228 |    0.00% |    0.00% |
| asm-printer.EmittedInsts                           | 7942329   | 7942456   |    127 |    0.00% |    0.00% |
| assembler.ObjectBytes                              | 254295632 | 254313792 |  18160 |    0.01% |    0.01% |
| early-cse.NumCSE                                   | 2183283   | 2183272   |    -11 |    0.00% |    0.00% |
| early-cse.NumSimplify                              | 550105    | 541842    |  -8263 |   -1.50% |    1.50% |
| instcombine.NumAggregateReconstructionsSimplified  | 73        | 4506      |   4433 | 6072.60% | 6072.60% |
| instcombine.NumCombined                            | 3640311   | 3666911   |  26600 |    0.73% |    0.73% |
| instcombine.NumDeadInst                            | 1778204   | 1783318   |   5114 |    0.29% |    0.29% |
| instcount.NumCallInst                              | 1758395   | 1758804   |    409 |    0.02% |    0.02% |
| instcount.NumInvokeInst                            | 59478     | 59502     |     24 |    0.04% |    0.04% |
| instcount.NumPHIInst                               | 330557    | 330549    |     -8 |    0.00% |    0.00% |
| instcount.TotalBlocks                              | 1077138   | 1077221   |     83 |    0.01% |    0.01% |
| instcount.TotalFuncs                               | 101442    | 101441    |     -1 |    0.00% |    0.00% |
| instcount.TotalInsts                               | 8831946   | 8832611   |    665 |    0.01% |    0.01% |
| simplifycfg.NumInvokes                             | 4300      | 4410      |    110 |    2.56% |    2.56% |
| simplifycfg.NumSimpl                               | 1019813   | 999740    | -20073 |   -1.97% |    1.97% |

So it fires ~22k times, which is less than ~24k the take 1 did.
It allows foldAggregateConstructionIntoAggregateReuse() to actually work
after PHI-of-extractvalue folds did their thing. Previously SimplifyCFG
would have done this PHI CSE, of all places. Additionally, allows some
more invoke->call folds to happen (+110, +2.56%).

All in all, expectedly, this catches less things overall,
but all the motivational cases are still caught, so all good.

Event Timeline

nikic added a subscriber: nikic.Aug 29 2020, 3:55 AM
nikic added inline comments.
/llvm/lib/IR/Instruction.cpp
489

s/nessesairly/necessarily

/llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp
1443

As we're in InstCombine here, this needs to use replaceInstUsesWith() rather than RAUW. Maybe that also avoids the threshold increase in the test?

nikic added inline comments.Aug 29 2020, 4:00 AM
/llvm/test/Transforms/InstCombine/select.ll
1906

This TODO is now done :)

xbolva00 added a subscriber: xbolva00.EditedAug 29 2020, 4:05 AM

No review?

Given that “take 1” was post commented by @nikic and reverted due to breakages, “take 2” should be properly reviewed.

Atleast @spatel @efriedma should really look at it..

lebedev.ri added inline comments.Aug 29 2020, 4:39 AM
/llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp
1443

Ugh, thanks, this pitfail will now follow us always :) Not great.
TBN there are 9 replaceAllUsesWith() calls in instcombine,
so i think there are a few more instances of this.

That being said, it doesn't help with that test.

nikic added inline comments.Aug 29 2020, 6:02 AM
/llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp
1443

What seems to be happening there is that we first insert the %storemerge phi, then the %storemerge1 phi before it, and then only CSE the %storemerge phi on the next iteration, as we are only considering phis before the current one. As such we cannot CSE new phis inserted at the start of the block (where they typically are inserted) without an extra iteration.

Maybe this is one of the cases where we need to go against the conventional InstCombine wisdom and scan the phis after the current one, instead of the ones before it?

lebedev.ri added inline comments.Aug 29 2020, 6:20 AM
/llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp
1443

(PHI's can reverence preceeding PHI's, right? i guess we would then need to swap the replacement direction)

What seems to be happening there is that we first insert the %storemerge phi, then the %storemerge1 phi before it, and then only CSE the %storemerge phi on the next iteration, as we are only considering phis before the current one. As such we cannot CSE new phis inserted at the start of the block (where they typically are inserted) without an extra iteration.

That sounds like a likely explanation of the problem, yes.

Maybe this is one of the cases where we need to go against the conventional InstCombine wisdom and scan the phis after the current one, instead of the ones before it?

I'm not sure "(where they typically are inserted)" is a sufficient justification to *only*
scan after the PHI node in question. I wonder if we need to scan them all.

Also, even with InstCombine-based approach, there seems to be stage2-stage3 mismatch issue:

http://lab.llvm.org:8011/builders/clang-with-thin-lto-ubuntu/builds/24095

i'm not sure if that tells us whether or not the earlier, instsimplify-based attempt isn't flawed though.

I've reverted the instcombine::visitphi() change for the moment,
but left Instruction::isIdenticalToWhenDefined() change in.
Once the bot goes back green, i'll retry with scanning all phi's.

lebedev.ri added inline comments.Aug 29 2020, 9:17 AM
/llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp
1443

And the build didn't go green after reverting *just* the InstCombine changes,
http://lab.llvm.org:8011/builders/clang-with-thin-lto-ubuntu/builds/24097
which means we are at least partially looking at the wrong thing,
and the Instruction::isIdenticalToWhenDefined() change must be at least partially the problem..

efriedma added inline comments.Aug 31 2020, 12:39 PM
/llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp
1443

isIdenticalToWhenDefined() is used by EarlyCSE; that's likely the source of the issue. (The equality comparison has to be consistent with the hash function.)

lebedev.ri added inline comments.Aug 31 2020, 12:43 PM
/llvm/lib/Transforms/InstCombine/InstCombinePHI.cpp
1443

isIdenticalToWhenDefined() is used by EarlyCSE; that's likely the source of the issue. (The equality comparison has to be consistent with the hash function.)

Yep, indeed, me and @nikic arrived at that conclusion separately almost at the same time https://reviews.llvm.org/rGbf21ce7b908e#inline-4637