This is an archive of the discontinued LLVM Phabricator instance.

[SCEV] Re-enable "Use nw flag and symbolic iteration count to sharpen ranges of AddRecs", attempt 3
ClosedPublic

Authored by mkazantsev on Oct 14 2020, 4:05 AM.

Details

Summary

We can sharpen the range of a AddRec if we know that it does not
self-wrap and know the symbolic iteration count in the loop. If we can
evaluate the value of AddRec on the last iteration and prove that at least
one its intermediate value lies between start and end, then no-wrap flag
allows us to conclude that all of them also lie between start and end. So
the estimate of range can be improved to union of ranges of start and end.

Switched off by default, can be turned on by flag.

Diff Detail

Event Timeline

mkazantsev created this revision.Oct 14 2020, 4:05 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 14 2020, 4:05 AM
mkazantsev requested review of this revision.Oct 14 2020, 4:05 AM
mkazantsev added inline comments.Oct 14 2020, 5:10 AM
llvm/test/Analysis/ScalarEvolution/no-wrap-symbolic-becount.ll
10

Unsigned range here can also be made [0,0,4294967296)...

xbolva00 added inline comments.
llvm/test/Transforms/IndVarSimplify/widen-loop-comp.ll
548

remove todo

mkazantsev planned changes to this revision.Oct 14 2020, 5:19 AM
mkazantsev added inline comments.
llvm/test/Transforms/IndVarSimplify/widen-loop-comp.ll
548

Not until the zext is removed.

Sharpened unsigend flag too.

mkazantsev added inline comments.Oct 14 2020, 5:46 AM
llvm/test/Transforms/IndVarSimplify/widen-loop-comp.ll
548

Ideally, we should be able to widen everything to i64. Currently IV gives up because zext(iv) is not an addrec after the range is sharpened. It's weird but fixable I think.

efriedma added inline comments.
llvm/lib/Analysis/ScalarEvolution.cpp
5705

I'm concerned that evaluating at iteration MaxBECount won't do what you want. In particular, if MaxBECount is larger than the number of steps required to violate nw, then the way you're comparing Start and End doesn't work.

The actual backedge-taken count at runtime can't be that large, of course, but MaxBECount is an approximation in general; I don't think we make any relevant guarantees. In particular, if the step isn't constant, we can't get this right.

I think you could explicitly compute the largest number of non-wrapping iterations, and then take the minimum of that and MaxBECount to compute the "real" maximum count.

5744

At first I wasn't sure what isKnownPredicate(LEPred, Start, End)) was checking, but then I figured out it's accounting for the possibility of crossing INT_MAX/UINT_MAX, since nw doesn't exclude that. I think an alternate way of satisfying this condition would be to check for nsw/nuw.

fhahn added inline comments.Oct 14 2020, 10:59 AM
llvm/lib/Analysis/ScalarEvolution.cpp
5705

I think you could explicitly compute the largest number of non-wrapping iterations, and then take the minimum of that and MaxBECount to compute the "real" maximum count.

Unrelated to the patch directly, but I suppose we could also use this to clamp down the computed MaxBECount directly, if we do not already do so? But we still won't have a reliable way to indicate that the computed count can be used directly I think.

mkazantsev added inline comments.Oct 14 2020, 10:18 PM
llvm/lib/Analysis/ScalarEvolution.cpp
5705

I don't think this situation is possible, but I don't have strong enough arguments to prove that. We can add an extra check that step * max num iterations must be known less than UINT_MAX for this type.

mkazantsev added inline comments.Oct 14 2020, 10:24 PM
llvm/lib/Analysis/ScalarEvolution.cpp
5705

I'll think about adding clamp into MaxBECount computation. I'm concerned that SCEV may not always be able to derive required facts from clamped expressions. Let's not do it in this patch and just add a check against this situation.

mkazantsev added inline comments.Oct 14 2020, 10:50 PM
llvm/lib/Analysis/ScalarEvolution.cpp
5744

nuw/nsw checks should be, in different ways, already incorporated inside of isKnownPredicate.

mkazantsev marked an inline comment as done.

Added check to address Eli's comment. Now we make sure that we can *prove* nw.

This revision is now accepted and ready to land.Oct 15 2020, 11:18 AM
mkazantsev added inline comments.Oct 15 2020, 8:21 PM
llvm/lib/Analysis/ScalarEvolution.cpp
5757

Actually we do not need to prove anything about V1. Knowing start < end && step is positive is enough, as well as start > end is negative.

mkazantsev planned changes to this revision.Oct 15 2020, 8:23 PM

Removed V1 check which is actually not needed.

This revision is now accepted and ready to land.Oct 15 2020, 9:26 PM
nikic added a subscriber: nikic.Oct 16 2020, 1:04 AM

I've reverted this change because it causes large compile-time regressions without a commensurate change in codegen: https://llvm-compile-time-tracker.com/compare.php?from=cc175c2cc8e638462bab74e0781e06f9b6eb5017&to=905101c36025fe1c8ecdf9a20cd59db036676073&stat=instructions For example, we have a 4% regression on mafft without any codegen change. This either needs a more efficient implementation or a stronger motivation.

I'll try using less powerful proof methods here; isKnownPredicate might be an overkill.

fhahn added inline comments.Oct 16 2020, 12:40 PM
llvm/lib/Analysis/ScalarEvolution.cpp
5705

I started working on using this info for MaxBECount: D89586. Still needs some further work though.

nikic added a comment.Oct 16 2020, 1:03 PM

I've reverted this change again for a couple of reasons:

First, the compile-time impact is now better, but still larger than the change justifies (at least without knowing the broader context): https://llvm-compile-time-tracker.com/compare.php?from=fbd62fe60fb2281ca33da35dc25ca3c87ec0bb51&to=32b72c3165bf65cca2e8e6197b59eb4c4b60392a&stat=instructions The mafft regression is no longer 4%, but it's still 3%.

Less importantly, I don't think the logic here is quite correct. The first issue is the use of the unionWith() API, which probably doesn't do what you want it to do. The way it is used here, it will return the smallest union of Start and End, which is not necessarily the range [Start, End], it can also be Start], [End, if we simplify this down to single-element ranges. Here is a test-case that I think illustrates the issue, if I didn't mess it up:

define i32 @test_02(i32 %start, i32* %p, i32* %q) {
; CHECK-LABEL: 'test_02'
; CHECK-NEXT:  Classifying expressions for: @test_02
; CHECK-NEXT:    %zext = zext i32 %start to i64
; CHECK-NEXT:    --> (zext i32 %start to i64) U: [0,4294967296) S: [0,4294967296)
; CHECK-NEXT:    %shl = shl i64 %zext, 31
; CHECK-NEXT:    --> (2147483648 * (zext i32 %start to i64))<nuw><nsw> U: [0,9223372034707292161) S: [0,9223372034707292161)
; CHECK-NEXT:    %indvars.iv = phi i64 [ %indvars.iv.next, %loop ], [ %shl, %entry ]
; CHECK-NEXT:    --> {(2147483648 * (zext i32 %start to i64))<nuw><nsw>,+,-1}<nsw><%loop> U: [-9223372036854775808,9223372034707292161) S: [-9223372036854775808,9223372034707292161) Exits: -9223372036854775806 LoopDispositions: { %loop: Computable }
; CHECK-NEXT:    %indvars.iv.next = add nsw i64 %indvars.iv, -1
; CHECK-NEXT:    --> {(-1 + (2147483648 * (zext i32 %start to i64))<nuw><nsw>)<nsw>,+,-1}<nw><%loop> U: full-set S: [-1,-9223372036854775806) Exits: -9223372036854775807 LoopDispositions: { %loop: Computable }
; CHECK-NEXT:  Determining loop execution counts for: @test_02
; CHECK-NEXT:  Loop %loop: backedge-taken count is (9223372036854775806 + (2147483648 * (zext i32 %start to i64))<nuw><nsw>)<nuw>
; CHECK-NEXT:  Loop %loop: max backedge-taken count is -2147483650
; CHECK-NEXT:  Loop %loop: Predicated backedge-taken count is (9223372036854775806 + (2147483648 * (zext i32 %start to i64))<nuw><nsw>)<nuw>
; CHECK-NEXT:   Predicates:
; CHECK:       Loop %loop: Trip multiple is 1
;
entry:
  %zext = zext i32 %start to i64
  %shl = shl i64 %zext, 31
  br label %loop

loop:                                             ; preds = %backedge, %entry
  %indvars.iv = phi i64 [ %indvars.iv.next, %loop ], [ %shl, %entry ]
  %cond = icmp eq i64 %indvars.iv, -9223372036854775806
  %indvars.iv.next = add nsw i64 %indvars.iv, -1
  br i1 %cond, label %exit, label %loop

exit:                                             ; preds = %loop
  ret i32 0
}

Note that the signed range of %indvars.iv.next is [-1,-9223372036854775806), while it should be [-9223372036854775807,9223372034707292161), give or take a few off by one errors.

This could be partially avoided by passing a sign preference to unionWith(). However, even with a sign preference, this may return a smaller wrapping range, if that avoid returning a full set. Similarly, the signed/unsigned ranges you request for start/end are also only sign preferences, you may still get back a wrapping range. I haven't followed it through in detail, but I don't think the logic is correct for wrapping ranges.

That's a good point, thanks! I'll see what I can do.

@nikic the output on your test is same with and without the patch. The point of checking of range's start and end is valid, but not in this case.

Scratch that, I used wrong revision.

mkazantsev retitled this revision from [SCEV] Use nw flag and symbolic iteration count to sharpen ranges of AddRecs to [SCEV] Re-enable "Use nw flag and symbolic iteration count to sharpen ranges of AddRecs", attempt 2.
mkazantsev edited the summary of this revision. (Show Details)
mkazantsev added a reviewer: nikic.

@nikic I've checked in your test separately and added handling for it here, to make sure that its behavior does not change. Proof methods is reduced to isKnownPredicateViaConstantRanges which does not create new SCEVs and therefore should not lead to recursive behavior. Could you please confirm that the compile time is OK with this version?

Unfortunately the compile-time issue still exists and is not materially affected by the switch to isKnownPredicateViaConstantRanges(). I've looked at callgrind profiles for partSalignmm.c, which has a 5% regression with this change, and found that the additional 166M instructions are spread roughly as follows:

  • 72M: evaluateAtIteration
  • 35M: isKnownPredicate for the TODO check
  • 32M: Various SCEV expression construction for the TODO check
  • 9M: computeMaxBackedgeTakenCount

Resolving the TODO would be a significant optimization, but unfortunately the largest part of the cost is inside evaluateAtIteration(), which seems like a more fundamental part of this calculation.

Is this patch being reverted again? It also broke the Polly buildbots:

Failed Tests (1):
  Polly :: ScopInfo/NonAffine/non-affine-loop-condition-dependent-access_3.ll
MaskRay reopened this revision.Oct 20 2020, 9:09 PM
MaskRay added 1 blocking reviewer(s): nikic.
MaskRay added a subscriber: MaskRay.

Is this patch being reverted again? It also broke the Polly buildbots:

Failed Tests (1):
  Polly :: ScopInfo/NonAffine/non-affine-loop-condition-dependent-access_3.ll

Verified this did break the Polly test. Reverted. Changed nikic to a blocking reviewer.

Thanks, The buildbot is green again: http://lab.llvm.org:8014/#/builders/5/builds/5

I would work on a fix (an integer range changed), but it would cause failure again if this patch was reverted.

How come that polly buildbot breach is a reason for a revert?

Hi @nikic, evaluateAtIteration is supposed to be completely free in this case, returning simply a+b. Do you have any info regarding what types of start/step are causing this?

MaskRay added a comment.EditedOct 20 2020, 11:26 PM

How come that polly buildbot breach is a reason for a revert?

The patch has been reverted multiple times (it is likely there are other unhandled issues) and the polly test demonstrates an internal difference which did not seem cosmetic. Plus nikic had several unaddressed comments. Hence the revert.

I think if we want to gatekeep new improvements based on their compile time impact, we should document that first in an appropriate documentation.

mkazantsev added a comment.EditedOct 20 2020, 11:52 PM

All comments have been addressed, except for the compile time issue, and we are trying to figure out what case we are dealing with. I have no info about what, how and in which config is being regressed. https://llvm-compile-time-tracker.com/ gives no info in this. This is not a part of LLVM test suite, and the change is not introducing a functional bug.

I would expect that, if the compile time impact is treated as a problem, this problem would be filed on bugs.llvm.com. I will gladly help to resolve this issue once I have info about what, how, in which circumstances and which environment has regressed. I don't think that reverting patches without providing info about problematic tests is a constructive approach that will move us anywhere.

As for Polly, I'm pretty sure it wouldn't be a big problem to update the test. It's not the part of core LLVM, and this patch can only sharpen the analyzes results. If the polly test suffers from more precise SCEV ranges, it's clearly not a SCEV problem.

Fixed Polly test. Looks like a clear improvement to me.

Fixed Polly test. Looks like a clear improvement to me.

I agree, with this patch Polly is able to deduce that the range of k is between zero and 1024*1024 (instead of MAX_INT)

mkazantsev retitled this revision from [SCEV] Re-enable "Use nw flag and symbolic iteration count to sharpen ranges of AddRecs", attempt 2 to [SCEV] Re-enable "Use nw flag and symbolic iteration count to sharpen ranges of AddRecs", attempt 3.

Now isKnownPredicateViaConstantRanges is used for TODO check too, and Step is limited to the constant. If it still is not helping, then I have no slightest clue what it can be and need a reproducer for the problem.

fhahn added a comment.Oct 21 2020, 4:50 AM

Now isKnownPredicateViaConstantRanges is used for TODO check too, and Step is limited to the constant. If it still is not helping, then I have no slightest clue what it can be and need a reproducer for the problem.

From what I understand the revert commit specifically mentioned a large (+3%) regression in compile-time for mafft from CTMark, which is part of llvm-test-suite, built with -O3. And in particular mentioned a larger regression in a specific file of mafft (partSalignmm.c). IIUC it should be possible to reproduce the regression by building that file with -O3 for X86 with release builds (no assertions) with and without the change?

@nikic is there a way to measure the CT impact of this version without actually merging it? If it's still too bad, please provide me something that I can analyze (either C++ test + run instructions or IR test).

Now isKnownPredicateViaConstantRanges is used for TODO check too, and Step is limited to the constant. If it still is not helping, then I have no slightest clue what it can be and need a reproducer for the problem.

From what I understand the revert commit specifically mentioned a large (+3%) regression in compile-time for mafft from CTMark, which is part of llvm-test-suite, built with -O3. And in particular mentioned a larger regression in a specific file of mafft (partSalignmm.c). IIUC it should be possible to reproduce the regression by building that file with -O3 for X86 with release builds (no assertions) with and without the change?

I don't have it in my tree; is it https://github.com/llvm/llvm-test-suite/tree/master/CTMark what you are referencing? I can try this one...

fhahn added a comment.Oct 21 2020, 5:03 AM

Now isKnownPredicateViaConstantRanges is used for TODO check too, and Step is limited to the constant. If it still is not helping, then I have no slightest clue what it can be and need a reproducer for the problem.

From what I understand the revert commit specifically mentioned a large (+3%) regression in compile-time for mafft from CTMark, which is part of llvm-test-suite, built with -O3. And in particular mentioned a larger regression in a specific file of mafft (partSalignmm.c). IIUC it should be possible to reproduce the regression by building that file with -O3 for X86 with release builds (no assertions) with and without the change?

I don't have it in my tree; is it https://github.com/llvm/llvm-test-suite/tree/master/CTMark what you are referencing? I can try this one...

Yes, it's part of the separate llvm-test-suite repo, the sources for mafft specifically are here https://github.com/llvm/llvm-test-suite/tree/master/MultiSource/Benchmarks/mafft. The CTMark directory only contains links to the actual sources, because it is only a small subset targeted at tracking compile-time.

fhahn added a comment.Oct 21 2020, 5:40 AM

It would probably be helpful to add a short snippet about how to build & run CTMark to reproduce the setup on the compile-time tracker website (clone, cmake, build and llvm-lit commands to run). And maybe linking to https://llvm.org/docs/TestSuiteGuide.html (and improving this page :))

nikic added a comment.Oct 21 2020, 3:07 PM

Thank you @fhahn for providing some context. I have updated https://llvm-compile-time-tracker.com/about.php to add some basic reproducing instructions as well.

For reference, here are the results for the current patch, which still has roughly the same characteristics: https://llvm-compile-time-tracker.com/compare.php?from=4b7dafd9046f0ceaadacaafe0ea4a1fb00cf70a5&to=c713c1ef75612a3dae1ef4b99f766e0c7784381f&stat=instructions

Hi @nikic, evaluateAtIteration is supposed to be completely free in this case, returning simply a+b. Do you have any info regarding what types of start/step are causing this?

I added some debug output and compiled the previously mentioned partSalignmm.c test case, with the following output: https://gist.github.com/nikic/40d676b9eeacee3ed38cb896a111e9f1 I couldn't say which ones of these are particularly slow though, and nothing seems particularly out of the ordinary there.

As you say, evaluateAtIteration() is simply a+b in this case, but this is not necessarily free, because potentially expensive expression folding may be involved. That expression folding is uncached, unless the expression doesn't fold.

I think the core issue here is that this optimization is a bit of a layering violation: All the existing getRangeRef() logic just recursively combines ranges of existing SCEV expressions. It does not create any new SCEV expressions. (Apart from calculating the constant max BE count, which is at least a cached calculation.)

I was going to suggest that we could use getSCEVAtScope on the parent loop to at least make this a cached addrec exit value, but that would not cover your motivating example (though probably would cover most other test changes). Unfortunately I'm not familiar enough with SCEV to make a good suggestion here.

What's the impact on O1 and O2 modes?

This comment was removed by mkazantsev.

Let's try to cache symbolic max BE count, it's the only thing that still can be improved here.

Rebased using caching version of BE count computation. Should be cheaper now.

@nikic @fhahn thanks for help with the reproduction! I think I can reproduce the problem locally now. Investigating where the ticks are spent and why.

mkazantsev planned changes to this revision.Oct 23 2020, 12:18 AM

Need to investigate why End computation is so expensive.

mkazantsev added a comment.EditedOct 23 2020, 12:55 AM

The common pattern where the CT drops that I ovserve is following: Start is a constant (in most cases zero), Step is also a constant and MaxBECount is a pretty big expression. The time is spent in attempts to prove nuw/nsw flags on End computation in this case.

mkazantsev edited the summary of this revision. (Show Details)

Here's the study that I made on the example of rdopt_coding_state. Units below are Ir (hopefully it's "instructions retired") on compilation with clang O3.

Almost queries are made for start = 0, constant step & BE count expression that varies from small to really huge expressions.
IR without patch: 220370413
IR with previous version of patch (no SCEV size limits): 224114149 (2% regression).
When I started limiting the size of IRs involved in transform, the dynamics was following:
Limit = 2: 220517261
Limit = 3: 221761742
Limit = 4-6: 222841240

So starting size 4, we have 1% regression on this test (but wide picture might be different). In the patch I set the limit of 6 because it's the minimum value needed to pass all tests in the way as they passed without limits. @nikic could you please evaluate this one?

nikic added a comment.Oct 23 2020, 8:48 AM

Here are the numbers for the new patch: https://llvm-compile-time-tracker.com/compare.php?from=13edfcc97d29574d1d38ded4fa9c2af6e6519472&to=45962d4244d46b1df0618701686a8756c6fb6b78&stat=instructions The geomean regression is now at 0.6% and the mafft regression is at a bit over 2%.

mkazantsev added a comment.EditedOct 25 2020, 9:31 PM

In that case, I see no other way than merge it with threshold 0 for clang. Frankly, I don't care if we lose 2% CT on some corner case tests if we get better SCEV ranges from it. Limiting SCEV sizes is a dirty and unstable hack, so it's better be an "all or nothing" flag.

mkazantsev edited the summary of this revision. (Show Details)

It would be great to get some benchmark data to see runtime performance / improvements as well.

mkazantsev added a comment.EditedOct 26 2020, 8:59 PM

This patch alone does not give any improvements itself. However, it's an important enabling step for fighting incomplete IV widening (see dependent patches), which is a source of performance problems in many various cases, mostly with decrementing loops. I only ran Java benchmarks and see up to 30% boosts in some cases.

@nikic @MaskRay any reasons to keep this out? It's effectively an NFC now. If we want to further measure its impact on compile time & performance of clang benchmarks, it can be a separate change switching the flag value.

lebedev.ri accepted this revision.Oct 27 2020, 5:49 AM

@nikic any reason for this compile time re-measuring dance instead of explaining the onboarding process?
(I.e. telling https://github.com/max-azul to make a github fork off https://github.com/llvm/llvm-project
into https://github.com/max-azul/llvm-project and adding that as a tracked remote for http://llvm-compile-time-tracker.com/ ?
(then, push branch to be measured as perf/<...>, and it will get measured))

Current numbers: (+0.6% geomean)
https://llvm-compile-time-tracker.com/compare.php?from=0ac56e8eaaeb41b59a952fa465872f647c930347&to=b36e80eaf4b4e35c74d814babe722160ef164f9b&stat=instructions

@nikic @MaskRay any reasons to keep this out? It's effectively an NFC now. If we want to further measure its impact on compile time & performance of clang benchmarks, it can be a separate change switching the flag value.

I don't see any reasons, especially if this doesn't change the default just yet.

nikic accepted this revision.Oct 27 2020, 10:23 AM

LG for landing under the flag.

I already landed a couple of SCEV improvements on Sunday that improved compile-time (while also improving nowrap inference), and am still pursuing some more leads, so I hope we can drop the flag in the future.

llvm/lib/Analysis/ScalarEvolution.cpp
5709

You can use the recently added getMinusOne() here.

5738

It might make more sense to use getRangeRef(Start, SignHint) and getRangeRef(End, SignHint) here, rather than going SignHint -> IsSigned -> SignHint.

5747

You can use isWrappedSet() and isSignWrappedSet() here (there are some special cases due to the half-open range).

This revision is now accepted and ready to land.Oct 27 2020, 10:23 AM
mkazantsev added inline comments.Oct 27 2020, 10:06 PM
llvm/lib/Analysis/ScalarEvolution.cpp
5738

Good point!

5747

Will do.