This is an archive of the discontinued LLVM Phabricator instance.

[ScalarEvolution] Fix overflow in computeBECount.
ClosedPublic

Authored by efriedma on Jun 30 2021, 11:17 AM.

Details

Summary

The current implementation of computeBECount doesn't account for the possibility that adding "Stride - 1" to Delta might overflow. For almost all loops, it doesn't, but it's not actually proven anywhere.

To deal with this, use a variety of tricks to try to prove that the addition doesn't overflow. If the proof is impossible, use an alternate sequence which never overflows.

Diff Detail

Event Timeline

efriedma created this revision.Jun 30 2021, 11:17 AM
efriedma requested review of this revision.Jun 30 2021, 11:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 30 2021, 11:17 AM

I'm working on a very similar patch as we speak, so you found a knowledgeable reviewer. Your test diff is a lot smaller than my current one, so maybe you're on to something. Unfortunately, I'm about to leave for the day, so I won't be able to get you detailed feedback until at least tomorrow.

A couple of quick comments.

The code does disallow stride == 0 on the howManyLessThans path. The logic is explained in the !PositiveStride path, but the comments are really confusing, so let me explain again. For a non-positive exit, we require that this be the sole exit and the current condition control that exit. (Indirectly through NoWrap) As a result, if the stride was zero, then the loop must either a) be infinite, or b) exit on the first iteration. The !loopIsFiniteByAssumption check requires a zero stride to exit on the first iteration. (Ah, I think I see the issue you did now. The stride can be zero iff the exit is taken on first iteration. If this was your chain of thought, can you clarify submission comment?)

llvm/lib/Analysis/ScalarEvolution.cpp
11630

Style comment: I'd suggest pulling out a function getUDivCeiling(Delta,Stride,MayOverflow). I found (in my WIP patch) that this really helps readability. Once you have that, you can use early return via tail call to helper for most of the above.

reames added inline comments.Jun 30 2021, 12:10 PM
llvm/lib/Analysis/ScalarEvolution.cpp
11639

While true, this is the same set of facts which allowed you to avoid the max for the start expression. As such, you've already gotten most of the benefit.

11763–11764

This check needs removed. The problem is that this isn't doing what the comment says it is.

The actual condition to prove the backedge is taken on the first iteration is OrigStart Cond OrigRHS.

I have a test case which this check causes us to generate wrong code because Start-Stride < RHS, but Start < RHS does not hold. (Actually, what happened to that test case? I thought I'd committed that, but I don't see it in my commit history. Need to find that.)

When I tried removing this without also fixing the overflow issue, I got bad results. I don't fully remember why atm, will try to reconstruct that and share it.

Your test diff is a lot smaller than my current one, so maybe you're on to something.

I found that I needed a wide variety of heuristics to handle various cases in the regression tests. This is the source of most of the complexity in this patch.

(Ah, I think I see the issue you did now. The stride can be zero iff the exit is taken on first iteration. If this was your chain of thought, can you clarify submission comment?)

I'll clarify the commit message, sure.

llvm/lib/Analysis/ScalarEvolution.cpp
11630

Hmm...

Maybe I'll just make the computation of MayOverflow a lambda.

11763–11764

Okay. We can deal with this separately, I think?

efriedma edited the summary of this revision. (Show Details)Jun 30 2021, 1:06 PM
efriedma edited the summary of this revision. (Show Details)Jul 2 2021, 3:57 PM
efriedma updated this revision to Diff 356305.Jul 2 2021, 5:14 PM

Address review comments.

How much worse does it become if we simply check that start <= MAX_INT - step and bail if we can't prove it?

llvm/include/llvm/Analysis/ScalarEvolution.h
2053

I feel it a bit confusing to pass end before start. Reverse order maybe?

How much worse does it become if we simply check that start <= MAX_INT - step and bail if we can't prove it?

I assume that isn't the expression you meant to write?

Practically, I expect the most important cases to recognize are:

  1. Power-of-two stride.
  2. Post-increment loops: Start == Stride. (This is a special case of Start >= Stride - 1).

I'm not sure how well isLoopEntryGuardedByCond/isKnownViaNonRecursiveReasoning would do if I just throw Start >= Stride - 1 for unsigned, or the signed equivalent, at it. I can try, I guess.

Detailed comments inline. The two common issues running through are:

  • The use of IsSigned confuses me. We should be computing an unsigned expression at this point, so why does the signedness of the end max expression matter? I think I finally figured it out but a change to the code structure and comments would really help. My current understanding is that IsSigned == true implies (End >=s Start) and !IsSigned implies (End >=u Start) as preconditions of the method. I'd just state that explicitly in comment form.
  • We can have an exit which is dynamically dead, produces poison, and overflows in the computation. See the long explanation in the first comment. As far as I can tell, this only applies to the first case.

Sorry this took so long.

llvm/include/llvm/Analysis/ScalarEvolution.h
2041

... if *this* exit is taken

llvm/lib/Analysis/ScalarEvolution.cpp
11548

I'm not sure this holds, let me walk you through why.

There is no guarantee that this exit must be taken. Instead, some other exit could be taken.

The cornercase here is when we have an exit (which won't be taken) whose no-wrap IV does in fact wrap on the last iteration (producing poison), but whose value we don't branch on (e.g. no UB) because an earlier exit is taken instead.

I believe this means the minimum BTC we can report for the (untaken) exit is 2^BW/Stride. (e.g. so it must be >= the exit count of the taken exit)

I believe that invalidates your inequality, and thus your result.

We can patch this in a couple of ways:

  • If ControlsExit is true, then we know this condition is branched on by the sole exit. Thus, no other exit, and logic holds.
  • If we know this condition is branched on by any latch dominating exit, then we know that the execution of iteration 2^BW/Stride must execute UB if no earlier exit is taken. Thus, either this exit is taken on a previous iteration, or another exit is taken up to that iteration. (I got stuck here, but it really seems like we should be able to do case analysis?)

Aside: If we already know the other exit count is small, it's a real shame we can't use that here... Maybe we can do a two step analysis and simplify using min approach? Just a thought.

11563

Aside: None of the logic below appears to rely on no-wrap facts on the IV. Note sure if it's worth it, but should we reflect that in code structure somehow?

11568

Code structure wise, I request you pass in the BaseStart explicitly. This parsing of expressions is confusing.

11582

The last line of this comment confuses me. Why is this relevant?

Otherwise, the reasoning in this case does appear to hold.

11586

According to my brute force program, your reasoning does hold. I find the isSigned usage very very confusing. Please write out the implied predicates.

11601

This appears to hold, though I didn't follow the logic TBH. I resorted to writing a small program to brute force the i8 space. :)

Please state the !IsSigned precondition in the comment in End >=u Start form.

11607

Again, please state the precondition that IsSigned implies. Your code is correct, but the comment is very hard to follow.

efriedma added inline comments.Jul 7 2021, 3:15 PM
llvm/lib/Analysis/ScalarEvolution.cpp
11548

If ControlsExit is false, and stride isn't one, it's hard to get here. I think the only way is if canIVOverflowOnLT returns false, which I think is enough to prevent any trouble here? Maybe worth going into in the comment, though.

11601

It's basically just proving "end + (stride-1)" doesn't overflow. I'll try to clarify the comment.

efriedma updated this revision to Diff 357123.Jul 7 2021, 8:24 PM

Spent some more time on this.

I discovered that some of the checks were not really necessary. I think I must have added a check, then didn't realize it became unnecessary as I added other checks. So it's down to three relatively straightforward checks. We can re-add the other checks if we discover practical cases where they're necessary.

I rewrote the power-of-two proof; hopefully makes it more clear that it's actually correct. And I clarified the signed vs. unsigned on the other checks.

Separated out getUDivCeilSCEV() as a separate function. Makes this patch a little easier to read, and hopefully the general utility is useful for other cases in the future.

reames added inline comments.Jul 8 2021, 9:46 AM
llvm/lib/Analysis/ScalarEvolution.cpp
11548

I believe you are correct, though I'll note I really dislike code whose correctness depends on understanding a non-trivial invariant through a large batch of complicated code. Please try to comment this clearly!

reames accepted this revision.Jul 8 2021, 9:55 AM

LGTM

Thank you for pushing this through.

I'll have a couple of small followups for review to tweak comments and code structure, but what you've got is a huge improvement and I want to get this in. It unblocks several other changes which improve opt quality.

This revision is now accepted and ready to land.Jul 8 2021, 9:55 AM
This revision was landed with ongoing or failed builds.Jul 8 2021, 10:10 AM
This revision was automatically updated to reflect the committed changes.

I’ve bisected a miscompilation on aarch64 down to this commit. I’ll follow up with a repro later…

uabelho added a subscriber: uabelho.Jul 9 2021, 1:06 AM
shchenz added a subscriber: shchenz.Jul 9 2021, 3:34 AM

Yeah, we also found issue on PowerPC target. Reducing a small case:

__attribute__((noinline))int foo(int *arr)
{
  int prime = 0, k = 0;

  for(int i = 0; i < 8191; i++) {
    prime = i + i + 1; 
    k = i + prime;
    for (int j = k; j < 8191; j += prime)
      arr[j] = 0;
  }
  return 0;
}
int main(void)
{
  int arr[8191];

  foo (arr);
  return 0;
}
$ clang -O2 t.cpp ; ./a.out
Segmentation fault (core dumped)

-fno-vectorize can make above case pass. Initial investigation shows that after this patch, the loop execution count for the vectorized loop now can be wrap, so generate unexpected behaviour.

Hi,

We see a miscompile as well.
It goes wrong for my out of tree target so I don't have a reproducer for an in tree target but maybe some info may help anyway.
I have a function containing the following two nested loops:

void foo()
{
    #define MAX 2048

    static unsigned a[MAX];

    unsigned count = 2;

    for (unsigned i = 0; i < MAX; ++i)
    {
        if (a[i] == 0)
        {
            unsigned p = i + i + 3;
            count++;
            for (unsigned j = i; j < MAX; j += p)
            {
                a[j] = 1;
            }
        }
    }
    assert(count == 565);
}

With some debug printouts from the compiler I see that before we got the following from SCEV for the inner loop:

Exit count: ({2047,+,-1}<%for.body> /u {3,+,2}<nuw><nsw><%for.body>)
Trip count: (1 + ({2047,+,-1}<%for.body> /u {3,+,2}<nuw><nsw><%for.body>))<nuw><nsw>
Trip count bound: 16 bits, [1,684)

and with the patch we get

Exit count: ((((-1 * (1 umin {2045,+,-3}<%for.body>))<nuw><nsw> + {2045,+,-3}<%for.body>) /u {3,+,2}<nuw><nsw><%for.body>) + (1 umin {2045,+,-3}<%for.body>))
Trip count: (1 + (((-1 * (1 umin {2045,+,-3}<%for.body>))<nuw><nsw> + {2045,+,-3}<%for.body>) /u {3,+,2}<nuw><nsw><%for.body>) + (1 umin {2045,+,-3}<%for.body>))
Trip count bound: 16 bits, [1,21848)

With some printouts in the outer and inner loop I also see that when i is 682, then something goes wrong for the inner loop and it executes for the following (way too many) j values:

outer. i: 682
  inner. j: 682
  inner. j: 2049
  inner. j: 3416
  inner. j: 4783
  inner. j: 6150
  [...]
  inner. j: 63564
  inner. j: 64931
  inner. j: 762
outer. i: 683

The case I'm observing is within this translation unit: https://martin.st/temp/gsmdec-preproc.c

There's some difference visible in the assembly generated by clang -target aarch64-linux-gnu -c -O3 gsmdec-preproc.c, I haven't exactly pinpointed what part of it is that breaks.

If you have access to aarch64 linux, the issue can be reproduced at runtime with these commands:

git clone git://source.ffmpeg.org/ffmpeg
cd ffmpeg
./configure --cc=clang --samples=/path/to/empty/dir
make fate-rsync # sync data samples for running tests, into the path specified above
make -j$(nproc) fate-g723_1-dec-1

The error in this case is in libavformat/gsmdec.o. There's also a couple other miscompilations in there, visible in make fate-sub-vplayer and make fate-wmavoice-7k, but I haven't pinpointed which translation unit those errors stem from.

lkail added a subscriber: lkail.Jul 9 2021, 5:12 AM

I think the miscompile is the isLoopEntryGuardedByCond issue noted in https://reviews.llvm.org/D105216#2851010 . Looking.

efriedma reopened this revision.Jul 9 2021, 12:15 PM
This revision is now accepted and ready to land.Jul 9 2021, 12:15 PM
efriedma updated this revision to Diff 357646.Jul 9 2021, 2:56 PM
efriedma edited the summary of this revision. (Show Details)

Updated to fix the miscompile. This unfortunately involves yet another complicated proof...

efriedma updated this revision to Diff 357674.Jul 9 2021, 5:17 PM

More refactoring. Unbreak howManyGreaterThans.

efriedma requested review of this revision.Jul 12 2021, 1:14 PM

@reames Please take another look before I merge; I ended up substantially refactoring the patch to deal with the miscompile.

I've glanced at this, and the refactor/new code seems very complicated to me. My LGTM no longer applies. I will try to take a closer look at the reasoning in the newly added parts, but I *strongly* request you look for ways to factor the code so that the reasoning is modular. If we can split this into two or more patches, that would be my strong preference.

but I *strongly* request you look for ways to factor the code so that the reasoning is modular

I really don't see any way to unify the two different proof strategies. The proofs are trying to reason about overflow on completely different operations.

If we can split this into two or more patches, that would be my strong preference.

I can split off the introduction of getUDivCeilSCEV() into a separate patch, and I can introduce the new isLoopEntryGuardedByCond(L, CondGT, Start, StartMinusStride) check in a separate patch. I'm not sure the result is easier to review, but I guess it would make it easier to figure out regressions.

Posted D105865 with just cleanup. Then I'll post a followup with the missing isLoopEntryGuardedByCond() checks. Then I'll rebase this (so at that point, we'll just be adding the MayAddOverflow checks). Let me know if that makes sense.

but I *strongly* request you look for ways to factor the code so that the reasoning is modular

I really don't see any way to unify the two different proof strategies. The proofs are trying to reason about overflow on completely different operations.

So, I think this might be part of our problem. The code and comments to date have not made your last sentence here obvious to me as the reviewer.

To be clear, I have not seen a concise description of what the bug in the original patch *was*. That makes it hard to review a patch which is supposed to fix it.

If we can split this into two or more patches, that would be my strong preference.

I can split off the introduction of getUDivCeilSCEV() into a separate patch, and I can introduce the new isLoopEntryGuardedByCond(L, CondGT, Start, StartMinusStride) check in a separate patch. I'm not sure the result is easier to review, but I guess it would make it easier to figure out regressions.

Splitting patches to isolate regressions is good!

I went ahead and landed e4b43973. This handles only the constant bounds cases using the new ceiling operation. This is an easy sub-case because a) stride is known positive, b) it'll constant fold thus not changing the result at all, and c) it doesn't appear to be influenced by your most recent proof. If you rebase over that, it should help a bit to reduce complexity.

but I *strongly* request you look for ways to factor the code so that the reasoning is modular

I really don't see any way to unify the two different proof strategies. The proofs are trying to reason about overflow on completely different operations.

So, I think this might be part of our problem. The code and comments to date have not made your last sentence here obvious to me as the reviewer.

To be clear, I have not seen a concise description of what the bug in the original patch *was*. That makes it hard to review a patch which is supposed to fix it.

Oh, sorry, I thought you understood the issue since you sort of pointed it out originally. The issue is the case where End < Start.

If we prove RHS >= Start, this is impossible. If we use End = max(RHS, Start), this is also impossible. If we only proved RHS > Start - Stride, it's possible that Start - Stride < End < Start. In this case, the backedge-taken count should be zero. But RHS - Start overflows, so getUDivCeilSCEV treats it as a very large positive number.

So we never want to use getUDivCeilSCEV if we're optimizing an expression by proving RHS > Start - Stride. As I outline in the proof in this patch, in this case, we can show that the backedge-taken count is actually ((RHS - 1) - (Start - Stride)) /u Stride, which is equivalent to the expression we would produce without this patch.

So we never want to use getUDivCeilSCEV if we're optimizing an expression by proving RHS > Start - Stride. As I outline in the proof in this patch, in this case, we can show that the backedge-taken count is actually ((RHS - 1) - (Start - Stride)) /u Stride, which is equivalent to the expression we would produce without this patch.

It's this last bit which is non-obvious from the patch. You're using an entirely different formula for the BTC in this case, you should say so in the comments!

Eli, I think I just had an "aha!" moment. Check my reasoning here, but I think we can greatly simplify the patch structure.

The most recent case you added, appears to simply be a specialization of ceiling(max(A,B)-A, C) where max(A,B) > A - C is known. I don't think we actually need to know anything about where A, B, and C come from for your lowering to be valid. I think we can also generalize this slightly into two pieces:

  1. A general ceiling lowering when ceiling(X-A, C) where X > A - C
  2. A pushdown rule for max(X,Y) > Z as X > Z && Y > Z ==> max(X,Y) > Z

This really tempts me to go back to the first version of your patch, and add another specialization of how to generate a faster ceiling.

What do you think?

llvm/lib/Analysis/ScalarEvolution.cpp
11845

I split off this part of the change into 087310c71e with a bit of restructuring.

The general formula for the backedge-taken count can be written, in arbitrary-precision arithmetic, as something like floor(max(End - Start + Stride - 1, 0) / Stride), I think. We could define a function to compute that, I guess, separate from howManyLessThans, and optimize based on that.

Granted, that's basically just taking this patch, and shoving the code into a different function. But it might be easier to reason about future changes.

Eli,

I went ahead and posted https://reviews.llvm.org/D105942 which splits this one step further, but in the process, I also convinced myself of the correctness of your revised patch. I'd be fine either landing the one I posted, then rebasing this into a smaller patch, or simply landing this one after rebase, and then posting a cleanup or two afterwards. Up to you.

I still like the idea of having all of this inside getUDivCeiling if possible, but doing that as cleanup once this lands seems reasonable.

Eli, I think we're out of things which can be reasonable split here. If you want to rebase your patch - preferably the form close to what you committed before - I think we're ready for the last big piece to go in. We probably should wait a day or two to make sure the previous bits stick, but assuming they do, I don't see any reason to delay this further.

efriedma updated this revision to Diff 359105.Jul 15 2021, 1:22 PM
efriedma edited the summary of this revision. (Show Details)

Rebased.

reames accepted this revision.Jul 15 2021, 2:17 PM

LGTM

And thank you!

This revision is now accepted and ready to land.Jul 15 2021, 2:17 PM
This revision was landed with ongoing or failed builds.Jul 16 2021, 4:15 PM
This revision was automatically updated to reflect the committed changes.