Page MenuHomePhabricator

[LoopDeletion] Break backedge if we can prove that the loop is exited on 1st iteration (try 3)
ClosedPublic

Authored by mkazantsev on May 17 2021, 4:55 AM.

Details

Summary

This patch handles one particular case of one-iteration loops for which SCEV
cannot straightforwardly prove BECount = 1. The idea of the optimization is to
symbolically execute conditional branches on the 1st iteration, moving in topoligical
order, and only visiting blocks that may be reached on the first iteration. If we find out
that we never reach header via the latch, then the backedge can be broken.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Added one more early bailout:

commit 7d418dadf6b1e6fd9bcccf7c5b5e1db74992ee70 (HEAD -> main, origin/main)
Author: Max Kazantsev <mkazantsev@azul.com>
Date:   Thu May 27 15:18:30 2021 +0700

    [NFCI][LoopDeletion] Do not call complex analysis for known non-zero BTC

@nikic could you please run this one too? I think it can be useful in real cases with countable loops.

@mkazantsev
This change is breaking a couple of Power PC buildbots.
https://lab.llvm.org/buildbot/#/builders/105
https://lab.llvm.org/buildbot/#/builders/100
Since the bots have been broken for more than 24 hours I ask that you please revert the change and then recommit once the issues have been addressed.
If you have questions about reproducing the issue or need help with it please let me know.

(snip) I ask that you please revert the change and then recommit once the issues have been addressed.

I second this request. I also would like you to re-review your changes before landing them again.

This commit also introduced a regression in the s390x-lnt buildbot due to an apparent miscompilation of the test-suite::gs.test test case:
https://lab.llvm.org/buildbot/#/builders/45/builds/3264

@mkazantsev
I have reverted these changes (and subsequent related NFC patches) with the following patches:

commit 0159652058ca555b05db6a209fe9cf660c3bf8e6 (HEAD -> main, origin/main, origin/HEAD)
Author: Stefan Pintilie <stefanp@ca.ibm.com>
Date:   Fri May 28 11:35:25 2021 -0500

    Revert "Return "[LoopDeletion] Break backedge if we can prove that the loop is exited on 1st iteration" (try 2)"
    
    This reverts commit be1a23203b1de655b8c7dac7549818d975a0cbbf.

commit 24bd657202379595b514ee17241f019294e86bc9
Author: Stefan Pintilie <stefanp@ca.ibm.com>
Date:   Fri May 28 11:35:12 2021 -0500

    Revert "[NFCI][LoopDeletion] Only query SCEV about loop successor if another successor is also in loop"
    
    This reverts commit b0b2bf3b5da950679db1431aae431a6dedea2245.

commit fd553312031c7d8085fa6ee0755a957796eadf05
Author: Stefan Pintilie <stefanp@ca.ibm.com>
Date:   Fri May 28 11:34:02 2021 -0500

    Revert "[NFC] Formatting fix"
    
    This reverts commit 59d938e649e62db0cef4903d495e838fbc6a6eb8.

commit 807fc7cdc97fc172b4967707a7718e7333351bff
Author: Stefan Pintilie <stefanp@ca.ibm.com>
Date:   Fri May 28 11:33:45 2021 -0500

    Revert "[NFC] Reuse existing variables instead of re-requesting successors"
    
    This reverts commit c467585682dcdda75e645ef3ab47c8b48440db12.

commit dd226803c220f02a5987f0ee9f9ac3ffe2b35c09
Author: Stefan Pintilie <stefanp@ca.ibm.com>
Date:   Fri May 28 11:17:46 2021 -0500

    Revert "[NFCI][LoopDeletion] Do not call complex analysis for known non-zero BTC"
    
    This reverts commit 7d418dadf6b1e6fd9bcccf7c5b5e1db74992ee70.

When you are ready to commit I would be happy to test this patch on Power PC if you would like. It may help to avoid more noise on the buildbots.
Also, if you need help reproducing or investigating failures on Power PC I can help with that as well.

Thanks for the revert! I'll see into what's happening.

Reviewing the failures, I see several issues that might be relevant:

https://lab.llvm.org/buildbot/#/builders/105

FAIL: MultiSource/Benchmarks/MallocBench/gs/gs.execution_time - is it about compile time again? I don't see any indication of a miscompile. If so, I see no alternative to making this transform enabled under option.

https://lab.llvm.org/buildbot/#/builders/100

Same.

https://lab.llvm.org/buildbot/#/builders/45/builds/3264

That might be interesting. Will try to reproduce that.

mkazantsev reopened this revision.Sun, May 30, 9:23 PM
This revision is now accepted and ready to land.Sun, May 30, 9:23 PM
mkazantsev planned changes to this revision.Sun, May 30, 9:24 PM

The last failure also looks like a timeout, I don't see any indication of miscompile:

FAIL: test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test (73 of 2021)
******************** TEST 'test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test' FAILED ********************

/home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/tools/timeit-target --limit-core 0 --limit-cpu 7200 --timeout 7200 --limit-file-size 104857600 --limit-rss-size 838860800 --append-exitstatus --redirect-output /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/Output/gs.test.out --redirect-input /dev/null --chdir /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs --summary /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/Output/gs.test.time /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/gs -DNODISPLAY INPUT/large.ps
cd /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs ; /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/tools/fpcmp-target /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/Output/gs.test.out gs.reference_output

/home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/tools/fpcmp-target: FP Comparison failed, not a numeric difference between '
' and 'g'

@uweigand could you please confirm that the output like this could have been a result of timeout?

So far, it looks that we still have compile time issue. No other solution that make this transform optional.

So far it seems that all problems are caused by compile time increase. There is already no room for shrinking the scope, so I'm returning this with option (turned off by default).

This revision is now accepted and ready to land.Sun, May 30, 11:10 PM
mkazantsev requested review of this revision.Sun, May 30, 11:10 PM

What are the latest compile time data for this patch?

lebedev.ri requested changes to this revision.Mon, May 31, 12:13 AM

So far it seems that all problems are caused by compile time increase. There is already no room for shrinking the scope, so I'm returning this with option (turned off by default).

I don't believe
https://lab.llvm.org/buildbot/#/builders/105/builds/10339
FAIL: MultiSource/Benchmarks/MallocBench/gs/gs.execution_time is compile time issue.

From https://lab.llvm.org/buildbot/api/v2/logs/5749632/raw :

---------------------------------------------------------------
>>> ========= '/home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/gs' Program
---------------------------------------------------------------

TEST-PASS: compile /home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/gs
TEST-RESULT-compile-success: pass
TEST-RESULT-compile-hash: 71676ae82ccfc3ce19d1a06b5c542f59
TEST-RESULT-compile-time: user 21.431000
TEST-RESULT-compile-real-time: real 30.486300

TEST-FAIL: exec /home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/gs
TEST-RESULT-exec-time: user       0.0000
TEST-RESULT-exec-real-time: real       0.0204

Compare that with a recent run:
https://lab.llvm.org/buildbot/#/builders/105/builds/10409

---------------------------------------------------------------
>>> ========= '/home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/gs' Program
---------------------------------------------------------------

TEST-PASS: compile /home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/gs
TEST-RESULT-compile-success: pass
TEST-RESULT-compile-hash: 414dbc57ddbb1e8c60a37884db3f706c
TEST-RESULT-compile-time: user 40.363300
TEST-RESULT-compile-real-time: real 64.390900

TEST-PASS: exec /home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/gs
TEST-RESULT-exec-success: pass
TEST-RESULT-exec-time: user       0.0761
TEST-RESULT-exec-real-time: real       0.1529

It really does look like that program was miscompiled (or an UB in it exploited), and it produced an unexpected result.

This revision now requires changes to proceed.Mon, May 31, 12:13 AM

The last failure also looks like a timeout, I don't see any indication of miscompile:

FAIL: test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test (73 of 2021)
******************** TEST 'test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test' FAILED ********************

/home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/tools/timeit-target --limit-core 0 --limit-cpu 7200 --timeout 7200 --limit-file-size 104857600 --limit-rss-size 838860800 --append-exitstatus --redirect-output /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/Output/gs.test.out --redirect-input /dev/null --chdir /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs --summary /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/Output/gs.test.time /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/gs -DNODISPLAY INPUT/large.ps
cd /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs ; /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/tools/fpcmp-target /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/Output/gs.test.out gs.reference_output

/home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/tools/fpcmp-target: FP Comparison failed, not a numeric difference between '
' and 'g'

@uweigand could you please confirm that the output like this could have been a result of timeout?

No, there is no timeout - execution of the test binary fails with an internal error:

Reading ghost.ps... 
Interp returns -16
ostack:
0x103e848: 0x02 dict -ewr------ 0x0000 0x01045b58
estack:
0x103de38: 0x09 oper x--------- 0x0000 0x010069c0
0x103de48: 0x03 file xe-r------ 0x0003 0x0105f708
exit 1

In the "good" case (with the patch reverted) we instead get the expected output:

Reading ghost.ps... ghost.ps read.
exit 0

Looking at the assembly, only a single function shows any difference between the two versions: "interp" from MultiSource/Benchmarks/MallocBench/gs/interp.c.

mkazantsev planned changes to this revision.Mon, May 31, 8:28 AM

Oh I see... That's nasty. In this case, could you please help me to reproduce it? I wasn't able to find the sources of this test in LLVM repository. IR dump before Loop Deletion would work too.

What are the latest compile time data for this patch?

There have been several separate follow-ups to address the compile time, but I believe we never measured it all together.

Oh I see... That's nasty. In this case, could you please help me to reproduce it? I wasn't able to find the sources of this test in LLVM repository. IR dump before Loop Deletion would work too.

This is part of the llvm-test-suite repository. The code for the interp function is here:
https://github.com/llvm/llvm-test-suite/blob/main/MultiSource/Benchmarks/MallocBench/gs/interp.c#L180
I notice that this has a weird "loop" constructed from a switch statement and various gotos, maybe this isn't handled correctly?

Thanks, I was able to reproduce the failure locally. Investigating.

mkazantsev retitled this revision from [LoopDeletion] Break backedge if we can prove that the loop is exited on 1st iteration to [LoopDeletion] Break backedge if we can prove that the loop is exited on 1st iteration (try 3).

Root cause: another case of irreducible CFG I didn't filter out. As result, RPOT makes no sense. Checked in failing unit test as https://reviews.llvm.org/rG4ef47eaed952. Added an assert decting this.

Turning off the transform for irreducible CFG at all. The potential win isn't worth problems it is causing. Should have followed David's advice in the first place.

I still want to have an option (on by default now) to be able to switch off this transform if required.

What are the latest compile time data for this patch?

There have been several separate follow-ups to address the compile time, but I believe we never measured it all together.

Could we get new data? @nikic? I am not sure if followups were not reverted as well so if follow up patches can recommited as NFC(I) patches, it would be great.

Can this patch optimize code from comment #4 in https://bugs.llvm.org/show_bug.cgi?id=47491 ?

Btw, any comment?

What are the latest compile time data for this patch?

There have been several separate follow-ups to address the compile time, but I believe we never measured it all together.

Could we get new data? @nikic? I am not sure if followups were not reverted as well so if follow up patches can recommited as NFC(I) patches, it would be great.

The followups were all reverted and integrated into this one. I see no point in merging them separately, unless we want more pain with reverts (if it is needed again). They are just part of the algorithm.

Can this patch optimize code from comment #4 in https://bugs.llvm.org/show_bug.cgi?id=47491 ?

Btw, any comment?

Let me check.

What are the latest compile time data for this patch?

There have been several separate follow-ups to address the compile time, but I believe we never measured it all together.

Could we get new data? @nikic? I am not sure if followups were not reverted as well so if follow up patches can recommited as NFC(I) patches, it would be great.

The followups were all reverted and integrated into this one. I see no point in merging them separately, unless we want more pain with reverts (if it is needed again). They are just part of the algorithm.

Nice, so only this patch needs to tested for compile time. But if less than +0.5%, it should be acceptable. There were many improvements in ct, so I do think there is some space for less cheap optimizations.

Can this patch optimize code from comment #4 in https://bugs.llvm.org/show_bug.cgi?id=47491 ?

Btw, any comment?

Let me check.

It doesn't, and it shouldn't. There is no 1 iteration loops here, and the existing loop gets fully unrolled.

What are the latest compile time data for this patch?

There have been several separate follow-ups to address the compile time, but I believe we never measured it all together.

Could we get new data? @nikic? I am not sure if followups were not reverted as well so if follow up patches can recommited as NFC(I) patches, it would be great.

The followups were all reverted and integrated into this one. I see no point in merging them separately, unless we want more pain with reverts (if it is needed again). They are just part of the algorithm.

Nice, so only this patch needs to tested for compile time. But if less than +0.5%, it should be acceptable. There were many improvements in ct, so I do think there is some space for less cheap optimizations.

Yes, only this patch. The NFC(I)s were basically this

auto *BTC = SE.getBackedgeTakenCount(L);
if (!isa<SCEVCouldNotCompute>(BTC) && SE.isKnownNonZero(BTC))
  return LoopDeletionResult::Unmodified;
if (!BTC->isZero() && !canProveExitOnFirstIteration(L, DT, SE, LI))
  return LoopDeletionResult::Unmodified;

and

if (L->contains(IfFalse) && SE.isKnownPredicate(Pred, LHSS, RHSS))  // contains check is NFCI
  MarkLiveEdge(BB, IfTrue);
else if (L->contains(IfTrue) &&  // contains check is NFCI
         SE.isKnownPredicate(ICmpInst::getInversePredicate(Pred), LHSS,
                             RHSS))
  MarkLiveEdge(BB, IfFalse);
else
  MarkAllSuccessorsLive(BB);

No need to tear them off, they fit well.

All failures so far were related to irreducible CFG (which made RPOT meaningless), which is now completely ruled out. @stefanp could you please check that this version works ok with PowerPC?

nikic added a comment.Sun, Jun 6, 3:02 PM

Here are compile-time numbers for the latest version of the patch: https://llvm-compile-time-tracker.com/compare.php?from=1ffa6499ea3b7fde687666a45e89909fbf72a450&to=d51dd5be5d0e67015240cfb6eedc14a2f3dfc3a7&stat=instructions Unfortunately I didn't have a chance to profile where the remaining cost is yet.

The new results look fine.

mkazantsev added a comment.EditedSun, Jun 6, 8:35 PM

Can we go ahead and merge it, or you want to further reduce the CT impact?
AFAIK failures on PowerPC are now gone (correct me if I'm wrong).

nikic added a comment.Mon, Jun 7, 1:31 PM

After looking through the tests, why does this need to use SCEV at all? As far as I can tell, simply using InstSimplify would be sufficient to prove all these cases (or just ConstFold for that matter). Without isKnownPredicateAt(), where symbolic reasoning is necessary, SCEV doesn't seem to provide much value here, or at least it's not obvious from the tests. (Of course, there will be cases where it can do more, but the converse also holds -- in particular, InstSimplify/ConstFold support arbitrary instructions, no need to special-case a limited set of them.)

SCEV is much more powerful in terms of simplification. Imagine the situation when some Phi takes value of a + b + c on the 1st iteration (and unknown on all others), then gets subtracted by a, then by b, and then by c, and then the result is compared against zero. If we want InstSimplify to do something like this, we will need to re-implement recrusive expression aggregation, simplification, in summary - we will need to re-invent SCEV.

I'm also sceptical about the approach that we are following (cripple the opts to save compile time). I'd rather prefer it to be isKnownPredicateAt, maybe under a flag, to use as much power as it can.

If the SCEV's CT impact is still unacceptable, we can cripple it even further to use isKnownViaNonRecursiveReasoning, but basically in this case we can just go with -O0 and stay happy about the results.

nikic added a comment.Sat, Jun 12, 3:12 AM

SCEV is much more powerful in terms of simplification. Imagine the situation when some Phi takes value of a + b + c on the 1st iteration (and unknown on all others), then gets subtracted by a, then by b, and then by c, and then the result is compared against zero. If we want InstSimplify to do something like this, we will need to re-implement recrusive expression aggregation, simplification, in summary - we will need to re-invent SCEV.

I'm also sceptical about the approach that we are following (cripple the opts to save compile time). I'd rather prefer it to be isKnownPredicateAt, maybe under a flag, to use as much power as it can.

If the SCEV's CT impact is still unacceptable, we can cripple it even further to use isKnownViaNonRecursiveReasoning, but basically in this case we can just go with -O0 and stay happy about the results.

Let me start with a bit of philosophy: This patch falls into the very unfortunate category where a) it solves some specific problem you have, b) outside that, it applies very rarely, but c) has a compile-time impact on everyone. It's easy to justify an average-case compile-time impact that shows an average-case improvement. It's not so easy to justify an average-case compile-time impact that only shows an exceptional-case improvement.

To provide some numbers: After adding a loop-delete.NumBackedgesBroken stat and running test-suite at O3, the stat increases from 94 to 96. That is, in the entirety of test-suite, we break two additional back edges. Of course, test-suite is skewed in the kind of IR it contains (being C/C++), but this gives us a general idea.

I do think it's okay to solve specific optimization problems that do not have any wider applicability, but we should try to minimize the impact they have on everyone else. I do understand that making this SCEV based can in principle handle more cases, but as always, there are tradeoffs. I gave an InstSimplify based implementation a try (quick hack: https://github.com/llvm/llvm-project/commit/5aa9bbc50a1fed1364caaa7edc9273fb38de990d), and this had the same practical optimization power on test-suite (2 additional backedges broken) as the SCEV-based implementation. Only difference is that the SCEV-based implementation caused codegen changes on two additional test cases without breaking a backedge, due to some kind of SCEV caching difference. (Compile-time for that version is https://llvm-compile-time-tracker.com/compare.php?from=8fdd7c2ff16da370e28ef1b22e400d57a541484f&to=5aa9bbc50a1fed1364caaa7edc9273fb38de990d&stat=instructions)

Taking a step back here from the technical details, I want to make an observation: The ability to break a not taken loop backedge is a core property of "sparse conditional" style algorithms. They solve this problem as a special case of a more general data flow problem. I checked, and both SCCP and NewGVN are capable of breaking the backedge on all cases in eval_first_iteration.ll that get broken by the this implementation. Neither handle the "loop guard" cases (which needed isKnownPredicateAt), but I believe both could be extended to handle them -- they have the necessary ingredients to do so efficiently via PredicateInfo.

Assuming that those test cases are actually representative, could it be that we're trying to solve a phase ordering problem by introducing a new optimization? For example, there's this FIXME that suggests that SCCP should be performed before rather than after loop optimizations. Maybe just doing that move would already address the original motivation? It would be helpful to have the unreduced motivating test-case, as it's hard to evaluate phase ordering otherwise.

Thanks Nikita, I think I have some arguments for this philosophical debate. :)

This patch falls into the very unfortunate category where a) it solves some specific problem you have, b) outside that, it applies very rarely

I'd argue that. Just from how many failures we've seen when irreducible CFG wasn't handled correctly, I conclude that there have been many attempts to make this transform even in these weird cases. Provided how rare those CFGs are, I can speculate that there have also been many cases when it did apply on regular CFG too. Even in the test suite, there were 2 loops of this kind, and I'd expect it to not have such patterns at all.

Either way, "rare or frequent" is a very speculative category. You can say "meh, it's just 2 of them in the test suite" or "wow, it's 2 of them somewhere I didn't expect this!" For us, this pattern is coming from a very common abstract I/O algorithm, which has logic of "read until you've read enough", and for some of its implementations it knows that it will always read whole lot at once. I'd not expect the test suite to be a place to heavily exercise things like this, and if it was found there - cool.

As for SCEV impact, some points.

  1. Yes, I agree that we should try to reduce impact on compile time when we can;
  2. We already have passes (IV simplify, IRCE, LSR...) where we already use SCEV. I'm absoultely certain that in reality, most things they do they could handle with InstSimplify-like instructions. But for some reason, we don't mind against the toll that SCEV takes there (and it should be HUGE in IV simplify) just because we started tracking compile time *after* these things were already there.
  3. It's pretty clear to me that the problem is not in loop deletion that is slow with SCEV. The problem is that the SCEV is slow. And it's much slower than 0.2% that we see here. Holding away future optimizations just because SCEV is, and always has been, slow, doesn't seem right to me.

Yes, we can in theory simplify this thing to InstSimplify. We save 0.2% CT. And we still keep wasting tons of CT in IV Simplify for example (I don't know how much, but pretty sure it's well above 10%).

I think we all agree that slowness here is just from SCEV, not from the change I made. So my suggestion is that we should be fighting the root cause, which is SCEV inefficiency or slowness, and not try to fight accidental manifestations of this inefficiency. I can commit to spend some time deliberately looking for places where we can save time in SCEV.

Reducing my whole suggestion to absurd, if I win 0.2% CT somewhere else in SCEV, can we go with this?

mkazantsev added a comment.EditedTue, Jun 15, 1:14 AM

P.S. and as for SCCP - that's a good point I didn't think of. Need to make some experiments to see if it helps.
UPD: it helps in some of cases that are interesting to us. In some of them it doesn't, but not clear whether it's a conceptual limitation or can be fixed easily.

reames added a comment.EditedTue, Jun 15, 6:43 PM

I want to chime in on the philosophy bits raised in the last few comments.

@nikic I understand and generally agree with your point about not implementing adhoc optimizations for particular patterns which happen to break particular workloads. However, I don't think it really applies here.

Leaving aside your point about whether this can be done without SCEV (a reasonable question I'll touch on later), I see the class of loops which provably don't execute their backedge based on symbolic reasoning of the first iteration as being an interesting class. In fact, SCEV *already* does a variant of this in the exit computation logic when the exit condition is not analyzable. If anything, from a philosophy perspective, I'd prefer that this patch didn't exist because SCEV did this internally. However, that doesn't appear to be pragmatic from a compile time perspective.

I also want to call out a methodology issue. If we use the test-suite as evidence of (lack of) profit, and the test suite contains only historical C and C++ code, we will never implement any transform which either a) shows up for other languages, or b) is idiomatic in newer code (e.g. new language versions for C/C++). I'm not saying you're wrong to collect this data, but we need to be careful about how we interpret it.

From a purely pragmatic perspective, I don't want to get in the habit of breaking forward progress for perfection. @nikic, I see you as coming dangerously close to that here. I do think that if Max can show a savings elsewhere, we should allow this patch.

@mkazantsev As a pragmatic means to address the raised review concern, have you investigated why using SCEV here is more expensive? One thing I'm wondering about is whether we're invalidating SCEV too aggressively in one of the existing transforms. I don't have strong evidence of this, but computing SCEVs for a loop isn't *that* expensive, and given a properly structured loop pipeline should happen approximately once. I'd be tempted to instrument the various "forgetX" calls and run one of the worst offenders compile time wise. Do all of the invalidation make sense? Just based on a quick skim of the code, I see a couple call sites in LoopSimplifyCFG and IndVarSimplify which look likely to invalidate far more than is required. In particular, if we were invalidating multiple times in a loop nest, that could cause degenerate compile time.

(Edit - In particular, the couple of calls to forgetTopmostLoop look to be a huge hammer for a problem which could be solved more narrowly. If any of those show up in profiling, we've got our cause.)

I also second @nikic request for a less reduced reproducer. It's hard to tell whether there's another approach here without seeing the original motivating example. The fact that all of the test cases are handled by SimplifyInst (which, btw, unrolling already has most of the code for) raises questions here. It'd be nice to put those to bed once and for all.

nikic added a comment.Wed, Jun 16, 2:24 PM

As for SCEV impact, some points.

  1. Yes, I agree that we should try to reduce impact on compile time when we can;
  2. We already have passes (IV simplify, IRCE, LSR...) where we already use SCEV. I'm absoultely certain that in reality, most things they do they could handle with InstSimplify-like instructions. But for some reason, we don't mind against the toll that SCEV takes there (and it should be HUGE in IV simplify) just because we started tracking compile time *after* these things were already there.
  3. It's pretty clear to me that the problem is not in loop deletion that is slow with SCEV. The problem is that the SCEV is slow. And it's much slower than 0.2% that we see here. Holding away future optimizations just because SCEV is, and always has been, slow, doesn't seem right to me.

I think we need to keep two things separate here: Passes like IndVars, IRCE and LSR perform actual loop reasoning. The entire purpose of SCEV is to reason about add recurrences, and that's exactly what these passes need. Using anything but SCEV wouldn't really make sense in this context. InstSimplify knows nothing (or at least very little) about recurrences.

The case here is different: This optimization only has a nominal relation to loops. There are no add recurrences involved, there is no "scalar evolution" going on -- rather, SCEV is being used as a generic symbolic reasoning framework, which is not a use-case it is intended or optimized for.

You are right that SCEV has serious compile-time issues in general, and "make SCEV faster" is certainly preferred over "don't use SCEV". However, in this particular instance I would argue that even if SCEV were blazing fast, it still wouldn't be the right tool for the job. Of course, there is the little caveat that we don't have any other symbolic reasoning framework in LLVM that has the same level of power, so I can certainly appreciate why it is being used here. I just want to highlight that this is not a case where SCEV is the obviously correct choice, as it is for passes like IndVars or LSR.

P.S. and as for SCCP - that's a good point I didn't think of. Need to make some experiments to see if it helps.
UPD: it helps in some of cases that are interesting to us. In some of them it doesn't, but not clear whether it's a conceptual limitation or can be fixed easily.

Worth mentioning that currently SCCP doesn't use PredicateInfo, which means it doesn't perform folding based on dominating branches. This is currently only enabled in the IPSCCP pass.

I want to chime in on the philosophy bits raised in the last few comments.

@nikic I understand and generally agree with your point about not implementing adhoc optimizations for particular patterns which happen to break particular workloads. However, I don't think it really applies here.

Leaving aside your point about whether this can be done without SCEV (a reasonable question I'll touch on later), I see the class of loops which provably don't execute their backedge based on symbolic reasoning of the first iteration as being an interesting class. In fact, SCEV *already* does a variant of this in the exit computation logic when the exit condition is not analyzable. If anything, from a philosophy perspective, I'd prefer that this patch didn't exist because SCEV did this internally. However, that doesn't appear to be pragmatic from a compile time perspective.

I also want to call out a methodology issue. If we use the test-suite as evidence of (lack of) profit, and the test suite contains only historical C and C++ code, we will never implement any transform which either a) shows up for other languages, or b) is idiomatic in newer code (e.g. new language versions for C/C++). I'm not saying you're wrong to collect this data, but we need to be careful about how we interpret it.

From a purely pragmatic perspective, I don't want to get in the habit of breaking forward progress for perfection. @nikic, I see you as coming dangerously close to that here. I do think that if Max can show a savings elsewhere, we should allow this patch.

Yeah, I agree that using test-suite for evaluation is dangerous in that it is a very biased collection of code. I wouldn't be surprised if most of the optimizations I implemented have zero hits on test-suite, because they are targeted at a different frontend :) What test-suite mainly tells us is that there exists some large category of code for which an optimization is not relevant. It doesn't tell use whether there exists some other large category where it is.

I do like your framing in terms of this being an "interesting class" of loops, and agree that they are worth optimizing -- the question is mostly just how hard we should try :)

In any case, I don't want to block this patch. If you think this the right way to approach the problem, I'm fine with it going in as-is.

llvm/lib/Transforms/Scalar/LoopDeletion.cpp
178

Legit clang-tidy warning: If you have a long chain of adds, this is going to blow the stack. Should probably use worklist?

@reames

One thing I'm wondering about is whether we're invalidating SCEV too aggressively in one of the existing transforms. I don't have strong evidence of this, but computing SCEVs for a loop isn't *that* expensive, and given a properly structured loop pipeline should happen approximately once. I'd be tempted to instrument the various "forgetX" calls and run one of the worst offenders compile time wise. Do all of the invalidation make sense? Just based on a quick skim of the code, I see a couple call sites in LoopSimplifyCFG and IndVarSimplify which look likely to invalidate far more than is required. In particular, if we were invalidating multiple times in a loop nest, that could cause degenerate compile time.

Even if so, it's not a problem of this patch, right? I can spend some time trying to investigate this if we all have agreement that it can go with SCEV if we can win some time elsewhere. We don't have this agreement so far. 2 concerns I have:

  1. I don't believe that 0.2% time is something that can be reliably measured by any tool available to me, even if I can somehow get the environment to measure clang;
  2. SCEV invalidation would be the last place to find this "elsewhere", because bugs caused by un-invalidated SCEV are *extremely* (5 times underlined) nasty. I don't want to commit to investigate and fix all of them, unless tortured, or hung above a volcano. :)

It's hard to tell whether there's another approach here without seeing the original motivating example. The fact that all of the test cases are handled by SimplifyInst (which, btw, unrolling already has most of the code for) raises questions here. It'd be nice to put those to bed once and for all.

Currently scheming a prototype to see if it works. My speculation is InstSimplify might be enough for my motivating example. However, this will require massive refactoring of InstSimplify itself (some of its static functions will need to be made publicly available). It's a lot of work just to duplicate a part of what SCEV does out of box.

@nikic

The case here is different: This optimization only has a nominal relation to loops.

I disagree. If we think something is a loop, we will run all our loop passes (including IndVars and LSR) on it. However, if loop deletion proves it's not a loop, we may save time on costly opts in other places. But again, whether this will or will not happen frequently, or whether the overall impact is positive, is a very speculative ground on which neither of us have (or may have) enough evidence.

There are no add recurrences involved, there is no "scalar evolution" going on -- rather, SCEV is being used as a generic symbolic reasoning framework, which is not a use-case it is intended or optimized for.

And LLVM is not "low level" and not a "virtual machine", right? :) It's just a wordplay.
Just because the initial motivation for SCEV was loop passes (maybe IndVars), I don't see any reasons why it cannot be used as generic symbolic framework. It has all power of it, it has tons of simplification that has nothing to do with loops, its implication engine does a lot to prove non-recurrent things. Today we even use it for pointers. I see no reason why we should stick to something that was relevant in the past and hold away the further expansion of its usage.

Of course, there is the little caveat that we don't have any other symbolic reasoning framework in LLVM that has the same level of power, so I can certainly appreciate why it is being used here. I just want to highlight that this is not a case where SCEV is the obviously correct choice, as it is for passes like IndVars or LSR.

I think that we should not have multiple framework that do the same thing, some of them being more and some being less powerful. What you are saying makes sense, but to me it just means that we need a way to make SCEV more lightweight. Maybe configure its prove engine to limit to the simplest ways of proving thigs.

I'll give a chance to a version that only uses the simplest way of proving thigs through non-recursive reasoning, and try to wright a version with InstSimplify as PoC.

nikic added a comment.Thu, Jun 17, 2:46 AM

Currently scheming a prototype to see if it works. My speculation is InstSimplify might be enough for my motivating example. However, this will require massive refactoring of InstSimplify itself (some of its static functions will need to be made publicly available). It's a lot of work just to duplicate a part of what SCEV does out of box.

This is surprising to me -- why does this require changes to InstSimplify? I would have expected that using it is simpler, not more complicated, as the SCEV use effectively requires you to reimplement the whole createSCEV() logic (which this patch only does to a small degree).

@nikic

The case here is different: This optimization only has a nominal relation to loops.

I disagree. If we think something is a loop, we will run all our loop passes (including IndVars and LSR) on it. However, if loop deletion proves it's not a loop, we may save time on costly opts in other places. But again, whether this will or will not happen frequently, or whether the overall impact is positive, is a very speculative ground on which neither of us have (or may have) enough evidence.

I think my point may have been lost here: Yes, there is a loop structure here, but you don't make use of it, because you're specifically trying to prove that it is not actually a loop. You are looking at only one iteration, in which case it is straight-line code. Any loop-based reasoning SCEV does is actively unwanted here.

There are no add recurrences involved, there is no "scalar evolution" going on -- rather, SCEV is being used as a generic symbolic reasoning framework, which is not a use-case it is intended or optimized for.

And LLVM is not "low level" and not a "virtual machine", right? :) It's just a wordplay.
Just because the initial motivation for SCEV was loop passes (maybe IndVars), I don't see any reasons why it cannot be used as generic symbolic framework. It has all power of it, it has tons of simplification that has nothing to do with loops, its implication engine does a lot to prove non-recurrent things. Today we even use it for pointers. I see no reason why we should stick to something that was relevant in the past and hold away the further expansion of its usage.

Okay, if we consider SCEV as a generic symbolic reasoning framework that is intended for non-loop analysis, then the usage here makes more sense to me. This isn't how I've been thinking about SCEV.

As said before, I'm okay with this going in as-is if you think it's the right approach -- I think I made this a bigger issue than it really deserved, because it ended up being a kind of proxy issue for the general problem of improvements to SCEV or new uses of SCEV leading to inevitable compile-time regressions. I don't actually think the regression in this patch is particularly problematic, my concern is more cumulative in nature, as this is not the first (or second, or third) time this happened... There's always talk that we should make SCEV faster, but there has been little actual movement in that direction.

(If we can solve motivating cases via SCCP or some improvements / phase ordering changes to it, that would still be my first preference, as I think that pass is perfectly suited to this problem from an algorithmic perspective.)

Cut down to non-recursive reasoning. @nikic could you please check the CT impact? If it's now, then it's all about complex recursive rules of proof, and this is where we can seek to simplify SCEV's reasoning.

This is surprising to me -- why does this require changes to InstSimplify?

InstSimplify does not provide public API for this:

const SCEV *LHSS = getSCEVOnFirstIteration(LHS, L, SE, FirstIterSCEV);
const SCEV *RHSS = getSCEVOnFirstIteration(RHS, L, SE, FirstIterSCEV);
S = SE.getAddExpr(LHSS, RHSS);

It has SimplifyAddInst that works with custom arguments, but it's currently static within the cpp file. I'd need to expose all of them. At least this, maybe more.

nikic added a comment.Thu, Jun 17, 3:05 AM

This is surprising to me -- why does this require changes to InstSimplify?

InstSimplify does not provide public API for this:

const SCEV *LHSS = getSCEVOnFirstIteration(LHS, L, SE, FirstIterSCEV);
const SCEV *RHSS = getSCEVOnFirstIteration(RHS, L, SE, FirstIterSCEV);
S = SE.getAddExpr(LHSS, RHSS);

It has SimplifyAddInst that works with custom arguments, but it's currently static within the cpp file. I'd need to expose all of them. At least this, maybe more.

The public API is SimplifyBinOp(), which works for all binary operators. See https://github.com/llvm/llvm-project/commit/5aa9bbc50a1fed1364caaa7edc9273fb38de990d for usage in this context. InstSimplify lacks an API that works generically overall instructions (with operands provided separately), but it exposes a couple of these high-level functions.

Oh, thanks! Didn't notice that. I'll now try to prototype an alternative solution with InstSimplify. But honestly, I'd be super surprised if SCEV's non-recursive prove is more expensive than this (and clearly SCEV's canonicalization is more powerful).

nikic added a comment.Thu, Jun 17, 3:25 AM

Cut down to non-recursive reasoning. @nikic could you please check the CT impact? If it's now, then it's all about complex recursive rules of proof, and this is where we can seek to simplify SCEV's reasoning.

New results: https://llvm-compile-time-tracker.com/compare.php?from=b5c4fc0f232b6368bbd7cd8681a6931f2c30ad02&to=580918ee9cdb3aa6dae1c12d0633a7da5124881b&stat=instructions I think these are approximately the same as before, so it doesn't looks like limiting to non-recursive reasoning is worthwhile here.

mkazantsev planned changes to this revision.Thu, Jun 17, 3:43 AM

Then trying InstSimplify...

Reworked using InstSimplify. It's good enough for my case. Let's hope that at least this has acceptable CT overhead.

nikic added a comment.Thu, Jun 17, 5:39 AM

It looks like the new implementation causes a segfault while building 7zip using the ReleaseLTO-g configuration: https://llvm-compile-time-tracker.com/show_error.php?commit=77e65ca9acb9db4de64a1ff8002f1d53470a0de3

Returned integer type check of branch conditions, removed test TODOs, removed null TTI with non-null DT which looks extremely fishy.

nikic added a comment.Thu, Jun 17, 2:54 PM

Still crashing, here's the backtrace:

ld: /home/nikic/llvm-project/llvm/include/llvm/Support/Casting.h:104: static bool llvm::isa_impl_cl<To, const From*>::doit(const From*) [with To = llvm::Constant; From = llvm::Value]: Assertion `Val && "isa<> used on a null pointer"' failed.

Thread 2.1 "ld" received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff7c2e100 (LWP 388814)]
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7c70859 in __GI_abort () at abort.c:79
#2  0x00007ffff7c70729 in __assert_fail_base (
    fmt=0x7ffff7e06588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=0x7ffff49afba0 "Val && \"isa<> used on a null pointer\"", 
    file=0x7ffff49afaa0 "/home/nikic/llvm-project/llvm/include/llvm/Support/Casting.h", line=104, function=<optimized out>) at assert.c:92
#3  0x00007ffff7c81f36 in __GI___assert_fail (
    assertion=0x7ffff49afba0 "Val && \"isa<> used on a null pointer\"", 
    file=0x7ffff49afaa0 "/home/nikic/llvm-project/llvm/include/llvm/Support/Casting.h", line=104, 
    function=0x7ffff49afb28 "static bool llvm::isa_impl_cl<To, const From*>::doit(const From*) [with To = llvm::Constant; From = llvm::Value]") at assert.c:101
#4  0x00007ffff1669db7 in llvm::cast_retty<llvm::Constant, llvm::Value*>::ret_type llvm::dyn_cast<llvm::Constant, llvm::Value>(llvm::Value*) ()
   from /home/nikic/llvm-project/build/bin/../lib/LLVMgold.so
#5  0x00007ffff43120d9 in SimplifyICmpInst(unsigned int, llvm::Value*, llvm::Value*, llvm::SimplifyQuery const&, unsigned int) ()
   from /home/nikic/llvm-project/build/bin/../lib/LLVMgold.so
#6  0x00007ffff3a563d3 in canProveExitOnFirstIteration(llvm::Loop*, llvm::DominatorTree&, llvm::LoopInfo&) ()
   from /home/nikic/llvm-project/build/bin/../lib/LLVMgold.so

It looks like the LHS argument to SimplifyICmpInst is null. I suspect this may be due to reference invalidation, commented inline.

llvm/lib/Transforms/Scalar/LoopDeletion.cpp
302

This acquired a reference to FirstIterValue[&PN], then calls getValueOnFirstIteration() which may invalidate it and then writes to it.

mkazantsev added inline comments.Thu, Jun 17, 11:04 PM
llvm/lib/Transforms/Scalar/LoopDeletion.cpp
302

Wow, that's fishy! Thanks!

Compile-time with the reference invalidation issue fixed: https://llvm-compile-time-tracker.com/compare.php?from=b5c4fc0f232b6368bbd7cd8681a6931f2c30ad02&to=9bb9b602fd58e6fdd353cdfedec56980a40230c1&stat=instructions Impact is small and without outliers. I think this is as cheap as it gets :)

Fixed potential ref invalidation issue.
Removed TODOs from tests.

I was able to catch this bug on our Fuzzer tests, and the fix in the last revision helped.

So, can we go with it now? There is one more CT opportunity here, which is replace of eager phi values propagation with lazy. But last time I tried it it crashed, so I'd prefer it to be a separate patch.

nikic accepted this revision.Fri, Jun 18, 1:38 AM

LGTM

llvm/lib/Transforms/Scalar/LoopDeletion.cpp
190

Simple dyn_cast<BinaryOperator> seems simpler than going through PatternMatch here?

LGTM

llvm/lib/Transforms/Scalar/LoopDeletion.cpp
190

True.

This revision was not accepted when it landed; it landed in state Needs Review.Fri, Jun 18, 3:32 AM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.