Page MenuHomePhabricator

[LVI][CVP] Use block value when simplifying icmps
ClosedPublic

Authored by nikic on Oct 31 2019, 2:38 PM.

Details

Summary

Add a flag to getPredicateAt() that allows making use of the block value. This allows us to take into account range information from the current block, rather than only information that is threaded over edges, making the icmp simplification in CVP a lot more useful.

I'm not changing getPredicateAt() to use the block value unconditionally to avoid any impact on the JumpThreading pass, which is somewhat picky about LVI query order.

Most test changes here are just icmps that now get dropped (while previously only a result used in a return was replaced). The two tests in icmp.ll show two representative improvements. Once this lands, I will also clean up CVP tests to drop lots of dummy block splits that had to be added to work around the previous weakness.

Diff Detail

Event Timeline

nikic created this revision.Oct 31 2019, 2:38 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 31 2019, 2:38 PM
nikic marked 2 inline comments as done.Oct 31 2019, 2:45 PM
nikic added inline comments.
llvm/test/Transforms/JumpThreading/bb-unreachable-from-entry.ll
13 ↗(On Diff #227351)

This change is caused because we now successfully evaluate the icmp to a constant, and jump threading does not clean up after that (simplifycfg will). Previously this used some of the threading logic, which follows the chain to the end.

llvm/test/Transforms/JumpThreading/header-succ.ll
113 ↗(On Diff #227351)

It took me a while to understand what is going here... what happens is that we have a branch on undef and previously this branch was folded in one direction (to an exit), while now it goes into the other (into an infinite loop). Both are valid. The reason is again that we evaluate the icmp to a constant.

Ooh, finally, nice!
What's going on with JumpThreading tests?

nikic added a comment.Nov 1 2019, 9:58 AM

@lebedev.ri I added a couple of comments on the JT tests, probably after you opened the tab ^^ The reason for the diffs is that we now evaluate the condition to a constant, while previously these were phi threaded. It has a bit of an odd effect here, but I would expect that in practice folding a branch completely is generally preferable.

@lebedev.ri I added a couple of comments on the JT tests, probably after you opened the tab ^^

Yes.

The diff seems okay to me but i'm not deeply familiar with LVI.
I wonder who is the current code owner of LVI/CVP.

What correctness testing was done, test-suite?

alex added a subscriber: alex.Nov 1 2019, 2:37 PM
nikic added a comment.Nov 2 2019, 8:00 AM

What correctness testing was done, test-suite?

I've done a test-suite run now, it passes.

lebedev.ri accepted this revision.Nov 2 2019, 8:55 AM

Okay, LG, thank you for looking into it.
@spatel / @reames ?

This revision is now accepted and ready to land.Nov 2 2019, 8:55 AM

Interestingly, we seriously regress in NSW deductions:

| statistic                                 |     old |     new | delta |  % change |
| correlated-value-propagation.NumAShrs     |     209 |     438 |   229 | 109.5694% |
| correlated-value-propagation.NumAddNSW    |    4972 |    4237 |  -735 | -14.7828% |
| correlated-value-propagation.NumAddNUW    |    7141 |    7157 |    16 |   0.2241% |
| correlated-value-propagation.NumAddNW     |   12113 |   11394 |  -719 |  -5.9358% |
| correlated-value-propagation.NumAnd       |     442 |     441 |    -1 |  -0.2262% |
| correlated-value-propagation.NumCmps      |    1194 |    1183 |   -11 |  -0.9213% |
| correlated-value-propagation.NumDeadCases |     111 |     115 |     4 |   3.6036% |
| correlated-value-propagation.NumMulNSW    |     278 |     276 |    -2 |  -0.7194% |
| correlated-value-propagation.NumMulNUW    |    1326 |    1324 |    -2 |  -0.1508% |
| correlated-value-propagation.NumMulNW     |    1604 |    1600 |    -4 |  -0.2494% |
| correlated-value-propagation.NumNSW       |    7160 |    6415 |  -745 | -10.4050% |
| correlated-value-propagation.NumNUW       |   13306 |   13315 |     9 |   0.0676% |
| correlated-value-propagation.NumNW        |   20466 |   19730 |  -736 |  -3.5962% |
| correlated-value-propagation.NumOverflows |       7 |       4 |    -3 | -42.8571% |
| correlated-value-propagation.NumPhiCommon |     397 |     401 |     4 |   1.0076% |
| correlated-value-propagation.NumPhis      |   15360 |   15435 |    75 |   0.4883% |
| correlated-value-propagation.NumSDivs     |     207 |     304 |    97 |  46.8599% |
| correlated-value-propagation.NumSExt      |    6278 |    7226 |   948 |  15.1004% |
| correlated-value-propagation.NumSRems     |      28 |      49 |    21 |  75.0000% |
| correlated-value-propagation.NumShlNSW    |    1171 |    1160 |   -11 |  -0.9394% |
| correlated-value-propagation.NumShlNUW    |    2793 |    2783 |   -10 |  -0.3580% |
| correlated-value-propagation.NumShlNW     |    3964 |    3943 |   -21 |  -0.5298% |
| correlated-value-propagation.NumSubNSW    |     739 |     742 |     3 |   0.4060% |
| correlated-value-propagation.NumSubNUW    |    2046 |    2051 |     5 |   0.2444% |
| correlated-value-propagation.NumSubNW     |    2785 |    2793 |     8 |   0.2873% |
| correlated-value-propagation.NumUDivs     |     353 |     375 |    22 |   6.2323% |
| instcount.NumAShrInst                     |   13761 |   13588 |  -173 |  -1.2572% |
| instcount.NumAddInst                      |  277538 |  277585 |    47 |   0.0169% |
| instcount.NumAllocaInst                   |   28040 |   28043 |     3 |   0.0107% |
| instcount.NumAndInst                      |   66082 |   66163 |    81 |   0.1226% |
| instcount.NumBitCastInst                  |  675473 |  675587 |   114 |   0.0169% |
| instcount.NumBrInst                       |  712987 |  712797 |  -190 |  -0.0266% |
| instcount.NumCallInst                     |  528453 |  528461 |     8 |   0.0015% |
| instcount.NumExtractElementInst           |   19418 |   19420 |     2 |   0.0103% |
| instcount.NumFCmpInst                     |   11039 |   11037 |    -2 |  -0.0181% |
| instcount.NumFNegInst                     |    1270 |    1272 |     2 |   0.1575% |
| instcount.NumFSubInst                     |   90911 |   90915 |     4 |   0.0044% |
| instcount.NumGetElementPtrInst            | 1229297 | 1229518 |   221 |   0.0180% |
| instcount.NumICmpInst                     |  487467 |  487413 |   -54 |  -0.0111% |
| instcount.NumInvokeInst                   |   21780 |   21779 |    -1 |  -0.0046% |
| instcount.NumLShrInst                     |   27409 |   27587 |   178 |   0.6494% |
| instcount.NumLoadInst                     |  890831 |  890834 |     3 |   0.0003% |
| instcount.NumMulInst                      |   43841 |   43832 |    -9 |  -0.0205% |
| instcount.NumOrInst                       |  102648 |  102652 |     4 |   0.0039% |
| instcount.NumPHIInst                      |  317805 |  317726 |   -79 |  -0.0249% |
| instcount.NumPtrToIntInst                 |   16114 |   16116 |     2 |   0.0124% |
| instcount.NumRetInst                      |   88782 |   88777 |    -5 |  -0.0056% |
| instcount.NumSDivInst                     |    8732 |    8670 |   -62 |  -0.7100% |
| instcount.NumSExtInst                     |   79306 |   78745 |  -561 |  -0.7074% |
| instcount.NumSRemInst                     |    1679 |    1678 |    -1 |  -0.0596% |
| instcount.NumSelectInst                   |   46181 |   46251 |    70 |   0.1516% |
| instcount.NumShlInst                      |   40640 |   40575 |   -65 |  -0.1599% |
| instcount.NumShuffleVectorInst            |  100326 |  100317 |    -9 |  -0.0090% |
| instcount.NumStoreInst                    |  814080 |  814211 |   131 |   0.0161% |
| instcount.NumSubInst                      |   61973 |   61976 |     3 |   0.0048% |
| instcount.NumTruncInst                    |   62132 |   62166 |    34 |   0.0547% |
| instcount.NumUDivInst                     |    2526 |    2535 |     9 |   0.3563% |
| instcount.NumURemInst                     |    1589 |    1598 |     9 |   0.5664% |
| instcount.NumUnreachableInst              |   13486 |   13467 |   -19 |  -0.1409% |
| instcount.NumXorInst                      |   10632 |   10643 |    11 |   0.1035% |
| instcount.NumZExtInst                     |   68227 |   69001 |   774 |   1.1344% |
| instcount.TotalBlocks                     |  847787 |  847572 |  -215 |  -0.0254% |
| instcount.TotalFuncs                      |   88847 |   88843 |    -4 |  -0.0045% |
| instcount.TotalInsts                      | 7411666 | 7412146 |   480 |   0.0065% |
fhahn added a subscriber: fhahn.Nov 4 2019, 1:55 AM
lebedev.ri added inline comments.Nov 4 2019, 6:31 AM
llvm/lib/Analysis/LazyValueInfo.cpp
1766–1768

Should we be getting both, and combining their knowledge somehow?

reames added a comment.Nov 4 2019, 9:36 AM

Can you give some context on the problem you're trying to solve? This doesn't look quite right, but maybe with some context I can make a suggestion as to how to approach cleanly?

llvm/lib/Analysis/LazyValueInfo.cpp
1762

Wait, no, please don't do this. Please don't bake in assumption about the semantics of the function based on the type of the argument. If this difference exists, we should fix/remove it.

I've long thought we needed to have a getValueAtBegin(BB) variant. Would that solve your use case?

lebedev.ri requested changes to this revision.Nov 5 2019, 12:56 AM

(there is no "revoke review" so marking as "changes needed" instead)

This revision now requires changes to proceed.Nov 5 2019, 12:56 AM
nikic marked an inline comment as done.Nov 5 2019, 2:37 AM

@lebedev.ri Weirdly that does not match the results I get:

correlated-value-propagation.NumAShrs | 199 | 425
correlated-value-propagation.NumAddNSW | 1975 | 2056
correlated-value-propagation.NumAddNUW | 4156 | 4173
correlated-value-propagation.NumAddNW | 6131 | 6229
correlated-value-propagation.NumAnd | 194 | 204
correlated-value-propagation.NumCmps | 809 | 933
correlated-value-propagation.NumDeadCases | 110 | 114
correlated-value-propagation.NumMulNSW | 131 | 129
correlated-value-propagation.NumMulNUW | 832 | 830
correlated-value-propagation.NumMulNW | 963 | 959
correlated-value-propagation.NumNSW | 3781 | 3850
correlated-value-propagation.NumNUW | 7737 | 7745
correlated-value-propagation.NumNW | 11518 | 11595
correlated-value-propagation.NumOverflows | 7 | 4
correlated-value-propagation.NumPhiCommon | 393 | 393
correlated-value-propagation.NumPhis | 11375 | 11442
correlated-value-propagation.NumSDivs | 201 | 295
correlated-value-propagation.NumSExt | 3667 | 4602
correlated-value-propagation.NumSRems | 28 | 47
correlated-value-propagation.NumSelects | 25 | 25
correlated-value-propagation.NumShlNSW | 1082 | 1069
correlated-value-propagation.NumShlNUW | 2262 | 2251
correlated-value-propagation.NumShlNW | 3344 | 3320
correlated-value-propagation.NumSubNSW | 593 | 596
correlated-value-propagation.NumSubNUW | 487 | 491
correlated-value-propagation.NumSubNW | 1080 | 1087
correlated-value-propagation.NumUDivs | 189 | 211

I ran this together with a second patch to CVP which actually computes the icmp result for non-local values, which I was planning to submit as a followup. I don't think that part should impact the NSW deductions though.

llvm/lib/Analysis/LazyValueInfo.cpp
1766–1768

getValueInBlock() is strictly more powerful, so that should not be necessary.

nikic marked an inline comment as done.Nov 5 2019, 2:49 AM

Can you give some context on the problem you're trying to solve? This doesn't look quite right, but maybe with some context I can make a suggestion as to how to approach cleanly?

I want to determine the result of icmps based on range information within the same basic block (see the two tests for examples). LVI has all the necessary machinery to do that, but right now that getPredicateAt() method which is used to evaluate icmps in JT and CVP does not actually make use of it. getPrediateAt() is (mostly) weaker than manually evaluating the predicate based on the getConstantRange() result, and CVP actually does that for some of the "non-negative" optimizations.

See also D44252, where you suggested (I think, if I understood correctly) to do what I'm trying to do here.

llvm/lib/Analysis/LazyValueInfo.cpp
1762

I agree that this is pretty weird. I can take a look at fixing the behavior of pointers, which I believe are in the wrong here (the integer behavior is already heavily embedded in how CVP works).

nikic updated this revision to Diff 228504.Nov 8 2019, 11:50 AM

Rebase over changes to pointer handling.

There are some concerning changes to lvi-after-jumpthreading.ll which I probably missed before.

nikic marked an inline comment as done.Nov 8 2019, 12:54 PM
nikic added inline comments.
llvm/test/Analysis/LazyValueAnalysis/lvi-after-jumpthreading.ll
25 ↗(On Diff #228504)

There are two issues here. The first is that JumpThreading marks LVI as preserved, but only seems to preserve it in the sense that the result isn't incorrect -- but not necessarily the same as rerunning from scratch.

The second is that we now end up computing block values in a different order, and this has an impact on the results. Here is the debug output of running JumpThreading on this function:

Before this patch:

LVI Getting value   %iv = phi i32 [ 0, %entry ], [ %iv.next, %backedge ] at ''
  Result = overdefined
LVI Getting edge value i32 0 from 'entry' to 'loop'
  Result = constantrange<0, 1>
LVI Getting edge value   %iv.next = add nsw i32 %iv, 1 from 'backedge' to 'loop'
PUSH:   %iv.next = add nsw i32 %iv, 1 in backedge
PUSH:   %iv = phi i32 [ 0, %entry ], [ %iv.next, %backedge ] in backedge
PUSH:   %iv = phi i32 [ 0, %entry ], [ %iv.next, %backedge ] in loop
POP   %iv = phi i32 [ 0, %entry ], [ %iv.next, %backedge ] in loop = constantrange<-2147483648, 400>
POP   %iv = phi i32 [ 0, %entry ], [ %iv.next, %backedge ] in backedge = constantrange<0, 400>
POP   %iv.next = add nsw i32 %iv, 1 in backedge = constantrange<1, 401>
  Result = constantrange<1, 400>
LVI Getting value   %iv.next = add nsw i32 %iv, 1 at ''
  Result = overdefined
LVI Getting value   %iv.next = add nsw i32 %iv, 1 at ''
  Result = overdefined

After this patch:

LVI Getting block end value   %iv = phi i32 [ 0, %entry ], [ %iv.next, %backedge ] at 'loop'
PUSH:   %iv = phi i32 [ 0, %entry ], [ %iv.next, %backedge ] in loop
PUSH:   %iv.next = add nsw i32 %iv, 1 in backedge
PUSH:   %iv = phi i32 [ 0, %entry ], [ %iv.next, %backedge ] in backedge
POP   %iv = phi i32 [ 0, %entry ], [ %iv.next, %backedge ] in backedge = constantrange<0, -2147483648>
POP   %iv.next = add nsw i32 %iv, 1 in backedge = constantrange<1, -2147483648>
POP   %iv = phi i32 [ 0, %entry ], [ %iv.next, %backedge ] in loop = constantrange<0, 400>
  Result = constantrange<0, 400>
LVI Getting block end value   %iv.next = add nsw i32 %iv, 1 at 'backedge'
  Result = constantrange<1, -2147483648>
LVI Getting block end value   %iv.next = add nsw i32 %iv, 1 at 'backedge'
  Result = constantrange<1, -2147483648>

Previously we ended up computing %iv in loop first, then %iv in backedge, then %iv.next in backedge. Now we compute %iv in backedge first, then %iv.next in backedge`, then %iv in loop. This eventually results in different results being calculated.

This seems pretty concerning regarding the general design of LVI, and I'm not really sure what I should do about this.

Please can you specify the order of these LVI patches, which one should go after which one, where one would start reviewing?

lebedev.ri resigned from this revision.Apr 8 2020, 11:46 PM

No idea who is comfortable/qualified doing LVI reviews, but that isn't me.

nikic updated this revision to Diff 288855.Aug 30 2020, 8:22 AM
nikic retitled this revision from [LVI][CVP] Use block value in getPredicateAt() to [LVI][CVP] Use block value when simplifying icmps.
nikic edited the summary of this revision. (Show Details)
nikic added a reviewer: fhahn.
nikic set the repository for this revision to rG LLVM Github Monorepo.

Add flag to getPredicateOf() that allows using the block value, and enable it in CVP.

nikic added a comment.Aug 30 2020, 8:32 AM
This comment was removed by nikic.

Oops, I looked at the wrong revision in my previous comment, that was for D69914.

Now looking at the right one: https://llvm-compile-time-tracker.com/compare.php?from=cb392c870d12eb520f84c8b7eb4f57e37483baed&to=d06337c311c36ce4fd0e939bbcef65f6d53c9d47&stat=instructions Compile-time impact is pretty minimal (< 0.1%).

Relevant stats on test-suite:

correlated-value-propagation.NumCmps | 816 | 1004
sccp.NumDeadBlocks | 7420 | 7413
sccp.NumInstRemoved | 18200 | 18168

I think that the addition of range support in SCCP has already covered many of the cases I wanted this for. Still think it makes sense to do this in CVP as well, as we already have the needed information, and LVI and SCCP have somewhat different strengths.

nikic updated this revision to Diff 293027.Sep 20 2020, 1:16 PM

Add additional test case @test_br_cmp_with_offset. Found in Rust code and not folded by IPSCCP (or anything else in LLVM).

fhahn added a comment.Sep 27 2020, 7:24 AM

LGTM as long as this is not too expensive in terms of compile-time, thanks!

As mentioned in the comments and description, this catches cases we cannot really catch in (IP)SCCP or other places at the moment. In particular, the way conditional information is handled in IPSCCP does not allow this kind of simplifications without major changes.

llvm/lib/Analysis/LazyValueInfo.cpp
1763

Currently the comment for getValueInBlock says it returns the value at the end of the block. IIUC this only refers to the fact that all assumes in the block are used. But if CxtI is passed, only assumes dominating CxtI are passed, so effectively we get the value at CxtI? Might be good to clarify this in the docs.

nikic added inline comments.Sep 27 2020, 9:25 AM
llvm/lib/Analysis/LazyValueInfo.cpp
1763

Right, the comment there is outdated. Prior to D69914 the value was valid either at the start of the block (for integers) or the end of the block (for pointers), now it is always valid at the start (or at the context instruction if given). I've pushed a clarification to those doc comments in https://github.com/llvm/llvm-project/commit/709d03f8af4da4204849a70f01798e7cebba2e32.

This revision was not accepted when it landed; it landed in state Needs Review.Sep 27 2020, 11:33 AM
This revision was automatically updated to reflect the committed changes.

Hello.

We've ran into an issues with a case like this:
https://godbolt.org/z/KnvGvo

This now manages to prove that the "max" in a saturating the max(min(X, 32767), -32768) is always false. For some cpu's that's fantastic, and give a decent speed increase. But others have a ssat instruction that can do that saturation quickly, that match from a min(max(..)).
Same goes for MVE instructions where there are a lot of saturating instructions we were previously picking (even if we were not doing it optimally before).

Any ideas of a good way of re-proving that the lower bounds don't need to be checked, but in the backend during ISel?

(In this case the _only_ value that can saturate is a -32768*-32768 multiply, but the code remains and the performance change can be quite substantial. )

nikic added a comment.Sep 28 2020, 1:28 PM

Hello.

We've ran into an issues with a case like this:
https://godbolt.org/z/KnvGvo

This now manages to prove that the "max" in a saturating the max(min(X, 32767), -32768) is always false. For some cpu's that's fantastic, and give a decent speed increase. But others have a ssat instruction that can do that saturation quickly, that match from a min(max(..)).
Same goes for MVE instructions where there are a lot of saturating instructions we were previously picking (even if we were not doing it optimally before).

Any ideas of a good way of re-proving that the lower bounds don't need to be checked, but in the backend during ISel?

(In this case the _only_ value that can saturate is a -32768*-32768 multiply, but the code remains and the performance change can be quite substantial. )

I don't have a good suggestion here. This doesn't seem to be provable from known bits or known sign bits reasoning (...right?) and I don't think there's any range based reasoning available at the SDAG level right now. Maybe that would be the motivation to introduce it...

I'm pretty surprised this managed to survive this long. I guess IPSCCP did not fold this yet because it doesn't handle SPF.