This patch handles one particular case of one-iteration loops for which SCEV
cannot straightforwardly prove BECount = 1. The idea of the optimization is to
symbolically execute conditional branches on the 1st iteration, moving in topoligical
order, and only visiting blocks that may be reached on the first iteration. If we find out
that we never reach header via the latch, then the backedge can be broken.
Details
- Reviewers
fhahn reames adpatel6 lebedev.ri nikic - Commits
- rGde92287cf8d1: [LoopDeletion] Break backedge if we can prove that the loop is exited on 1st…
rGbe1a23203b1d: Return "[LoopDeletion] Break backedge if we can prove that the loop is exited…
rG43d2e51c2e86: Return "[LoopDeletion] Break backedge if we can prove that the loop is exited…
rG2531fd70d19a: [LoopDeletion] Break backedge if we can prove that the loop is exited on 1st…
Diff Detail
Unit Tests
Event Timeline
I'm confused why this is worthwhile. LoopUnrolling already knows how to symbolically execute a loop. Shouldn't LoopUnroll fully unroll this loop by one - thus deleting the loop backedge in question?
See analyzeLoopUnrollCost in LoopUnrollPass.cpp.
Hm, glancing at that code, it looks like it symbolically evaluates only phis and branches. I don't see any principled reason we couldn't generalize that.
In general, I would be much more comfortable as posing this as symbolic execution of the first iteration. In the process, we're proving the exit is taken and loop deletion can happen naturally later.
Full unrolling is only possible when we know trip count, which is not the case here. There is a subtle difference between computing the trip count and proving that the trip count is zero. This patch is trying to make the latter in cases when it's not known.
By definition, loop backedge can be broken when we can prove that BTC is 0. So this is what we are doing here: if we could not compute BTC by normal means, let's try to prove it's equal to some particular known value (zero).
The only good alternative I am seeing here is that this should be done in SCEV's computeExitCount. However, this will introduce a lot of compile time overhead just to prove that loops have 0 iterations, which might not be the best use of compile time.
Basing on our conversation with Philip, I've reframed it not as backward traversal, but topological walk by blocks live on 1st iteration.
As a follow-up, after we've proven 1 iteration, we can also break all non-taken edges.
LGTM w/minor comments.
llvm/lib/Transforms/Scalar/LoopDeletion.cpp | ||
---|---|---|
242 | This check should probably be lifted to caller, and assert !Header here. It confuses the interface of this lambda. You could also consider renaming the lambda to getSolePredecessorOnFirstIteration. | |
248 | Note that predecessors can be listed multiple times due to switches. You should check that OnlyPred is non-null and not equal to this block. |
llvm/lib/Transforms/Scalar/LoopDeletion.cpp | ||
---|---|---|
242 | But why lifting it out? Semantically the sole predecessor of header on the 1st iteration is a loop pred. This allows us to process all loop blocks uniformly. Why do you think that header should be a special case outside this function? | |
248 | Good point, I'll add a test for this. |
Can this patch optimize code from comment #4 in https://bugs.llvm.org/show_bug.cgi?id=47491 ?
As a heads-up, it looks like this change has a non-trivial compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=ce245246043d3c4f12515b2c773ed6c9174345b5&to=2531fd70d19aa5d61feb533bbdeee7717a4129eb&stat=instructions I haven't looked any closer yet.
This patch appears to have caused a performance regression on the buildbots: https://lab.llvm.org/buildbot/#/builders/105/builds/10180
Please fix or revert.
FYI, given timezones, Max is asleep. Feel free to revert in the meantime.
You should also plan to help with reducing a reproducer for the performance problem. A quick glance at the built bot log doesn't show anything glaring obvious and I would not expect loop deletion becoming more effective to worsen performance.
Just looked into this. libclamav_wwunpack.c from test-suite is a good test case, small file with about 10% compile-time regression. Callgrind says that most of the additional time is spent inside isKnownPredicateAt(), where in turn most of the time is spent in isImpliedCond(). That's a familiar issue...
Is the "at" part an important part of this optimization? If no, if this used just isKnownPredicate(), I expect it would be fine. If yes, then I think that this (and really, any new optimization that uses isKnownPredicateAt) is probably not viable unless and until someone makes a serious effort towards improving its performance.
Just a heads up, I think I'm seeing a miscompile with this patch.
I'll dig further and try to extract a reproducer.
Reproduce the miscompile with:
opt -loop-deletion loop-deletion.ll -S -o -
The bb12 -> bb2 back edge is replaced with a branch to unreachable which I think is wrong in this case.
(I made some comments in the input file showing how I think the back edge may be taken.)
Great! (It's reverted already due to the compile-time regression I think. https://reviews.llvm.org/rG832c99f727723bab2baf5a477bda3d91fed56f5d )
@uabelho Looking into your example, I don't think enough evidence it's a miscompile. Your test has non-side-effecting infinite loop, which has undefined behavior. So it's legal to make the replacement. Can you construct the example without an infinite loop?
@nikic > Is the "at" part an important part of this optimization?
Unfortunately yes, without it the motivating test won't pass. I'll simply make option to enable/disable this transform.
Hm, is it really? For input 1, 1, 0 I think it terminates and return the value 20 after executing this path:
entry -> if.end -> bb2 -> bb4 -> bb12 -> bb2 -> bb4 ->bb13 -> ret
?
Is it worth it? It should not be commited at all if off by default - you can have it in your fork, no?
It would just add additional code complexivity for future developers + bit rot.
@nikic I already regret that I spent time improving implication engine of SCEV... :)
Actually, thinking a bit more about it, the "at" part wasn't needed for my initial motivating example. So I can rework this tests in a way it doesn't need "at".
However, this tendency is worrysome. We intentionally give up the optimization power trying to sustain compile time, instead of solving the problem with SCEV caching once and for all. Maybe this is what we should think about. If even LoopDeletion can cause compile time regression, I refuse to understand what can't do it. :)
@uabelho you are right. The problem is loop RPOT processes blocks in order (bb2, bb12, bb4), which is not correct. The expected order is (bb2, bb4, bb12). No clue why this happened, investigating.
Thanks!
I tried testing the transform in Alive2, but I think it doesn't like the recursive call in the input. So then I simplified the input and removed the call and then Alive2 says the transformation done by loop-deletion is correct:
https://alive2.llvm.org/ce/z/K8ZJzz
But I still don't agree that it is (e.g. for input 1,1 I think it's incorrect) so I'm not sure what is going on. It's the first time I try to use Alive2 though so I might very well have messed something up.
(I've looked at this function so much now I'm not quite sure about anything anymore :)
Silly me. I've accounted for inner loops but not irreducible CFG cycles, which have exactly same problems but not detectable. I need to exclude loops with irreducible CFG.
Great! The description of my failing testcase says "irreducible flow and constant propagation" so it sounds like you'r on the right track then.
Yes, the point is, RPOT (or any other traversal) is undefined for any cycles, should they be inner loops or contained non-loop cycles. I think the best is not to deal with them at all. It never ended well for any opt. :)
New compile-time impact: https://llvm-compile-time-tracker.com/compare.php?from=01120fe5b39837f87e6fa34a5227b8f8634d7b01&to=be1a23203b1de655b8c7dac7549818d975a0cbbf&stat=instructions
Getting rid of isKnownPredicateAt() has improved things, but it's still fairly expensive :/ The test case I looked at had a particularly high contribution from isKnownPredicateAt(), and that one did improve a lot (from 10% to below 1%).
The best solution I can come up with (and it is something I even started to do at some point) is to replace two queries to isKnownPredicate with one which returns Optional<bool>, that should give up to 2x CT improvement. I'll think if we can get some minor CT wins here.
llvm/lib/Transforms/Scalar/LoopDeletion.cpp | ||
---|---|---|
315 | To save CT, we can give up querying for non-loop successors. |
One step to improve the situation:
commit b0b2bf3b5da950679db1431aae431a6dedea2245 (HEAD -> main, origin/main) Author: Max Kazantsev <mkazantsev@azul.com> Date: Thu May 27 11:47:30 2021 +0700 [NFCI][LoopDeletion] Only query SCEV about loop successor if another successor is also in loop
llvm/lib/Transforms/Scalar/LoopDeletion.cpp | ||
---|---|---|
220 | if (containsIrreducibleCFG<const BasicBlock *>(RPOT, LI)) return false; |
This should also help:
commit 51d334a845a082338735b0fdfc620a4b15fa26fe (HEAD -> main, origin/main) Author: Max Kazantsev <mkazantsev@azul.com> Date: Thu May 27 13:20:57 2021 +0700 [NFCI] Lazily evaluate SCEVs of PHIs Eager evaluation has cost of compile time. Only query them if they are required for proving predicates.
llvm/lib/Transforms/Scalar/LoopDeletion.cpp | ||
---|---|---|
220 | No, we don't care about irreducible CFG in code that is unreachable on 1st iteration. |
@nikic could you please evaluate impact of 2 patches above? They should do a great deal of decreasing SCEV usage.
@mkazantsev Results for the two patches are https://llvm-compile-time-tracker.com/compare.php?from=d82f2a123f9c443911fc40009d2017915b850758&to=b0b2bf3b5da950679db1431aae431a6dedea2245&stat=instructions and https://llvm-compile-time-tracker.com/compare.php?from=59d938e649e62db0cef4903d495e838fbc6a6eb8&to=51d334a845a082338735b0fdfc620a4b15fa26fe&stat=instructions. They make for minor improvements, though close to the noise floor.
Thanks! I'll see what else can be done. Though, I think the right solution is to speed up SCEV queries in general and not try to cripple particular opts.
Added one more early bailout:
commit 7d418dadf6b1e6fd9bcccf7c5b5e1db74992ee70 (HEAD -> main, origin/main) Author: Max Kazantsev <mkazantsev@azul.com> Date: Thu May 27 15:18:30 2021 +0700 [NFCI][LoopDeletion] Do not call complex analysis for known non-zero BTC
@nikic could you please run this one too? I think it can be useful in real cases with countable loops.
@mkazantsev
This change is breaking a couple of Power PC buildbots.
https://lab.llvm.org/buildbot/#/builders/105
https://lab.llvm.org/buildbot/#/builders/100
Since the bots have been broken for more than 24 hours I ask that you please revert the change and then recommit once the issues have been addressed.
If you have questions about reproducing the issue or need help with it please let me know.
I second this request. I also would like you to re-review your changes before landing them again.
This commit also introduced a regression in the s390x-lnt buildbot due to an apparent miscompilation of the test-suite::gs.test test case:
https://lab.llvm.org/buildbot/#/builders/45/builds/3264
@mkazantsev
I have reverted these changes (and subsequent related NFC patches) with the following patches:
commit 0159652058ca555b05db6a209fe9cf660c3bf8e6 (HEAD -> main, origin/main, origin/HEAD) Author: Stefan Pintilie <stefanp@ca.ibm.com> Date: Fri May 28 11:35:25 2021 -0500 Revert "Return "[LoopDeletion] Break backedge if we can prove that the loop is exited on 1st iteration" (try 2)" This reverts commit be1a23203b1de655b8c7dac7549818d975a0cbbf. commit 24bd657202379595b514ee17241f019294e86bc9 Author: Stefan Pintilie <stefanp@ca.ibm.com> Date: Fri May 28 11:35:12 2021 -0500 Revert "[NFCI][LoopDeletion] Only query SCEV about loop successor if another successor is also in loop" This reverts commit b0b2bf3b5da950679db1431aae431a6dedea2245. commit fd553312031c7d8085fa6ee0755a957796eadf05 Author: Stefan Pintilie <stefanp@ca.ibm.com> Date: Fri May 28 11:34:02 2021 -0500 Revert "[NFC] Formatting fix" This reverts commit 59d938e649e62db0cef4903d495e838fbc6a6eb8. commit 807fc7cdc97fc172b4967707a7718e7333351bff Author: Stefan Pintilie <stefanp@ca.ibm.com> Date: Fri May 28 11:33:45 2021 -0500 Revert "[NFC] Reuse existing variables instead of re-requesting successors" This reverts commit c467585682dcdda75e645ef3ab47c8b48440db12. commit dd226803c220f02a5987f0ee9f9ac3ffe2b35c09 Author: Stefan Pintilie <stefanp@ca.ibm.com> Date: Fri May 28 11:17:46 2021 -0500 Revert "[NFCI][LoopDeletion] Do not call complex analysis for known non-zero BTC" This reverts commit 7d418dadf6b1e6fd9bcccf7c5b5e1db74992ee70.
When you are ready to commit I would be happy to test this patch on Power PC if you would like. It may help to avoid more noise on the buildbots.
Also, if you need help reproducing or investigating failures on Power PC I can help with that as well.
Reviewing the failures, I see several issues that might be relevant:
FAIL: MultiSource/Benchmarks/MallocBench/gs/gs.execution_time - is it about compile time again? I don't see any indication of a miscompile. If so, I see no alternative to making this transform enabled under option.
Same.
That might be interesting. Will try to reproduce that.
The last failure also looks like a timeout, I don't see any indication of miscompile:
FAIL: test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test (73 of 2021) ******************** TEST 'test-suite :: MultiSource/Benchmarks/MallocBench/gs/gs.test' FAILED ******************** /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/tools/timeit-target --limit-core 0 --limit-cpu 7200 --timeout 7200 --limit-file-size 104857600 --limit-rss-size 838860800 --append-exitstatus --redirect-output /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/Output/gs.test.out --redirect-input /dev/null --chdir /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs --summary /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/Output/gs.test.time /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/gs -DNODISPLAY INPUT/large.ps cd /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs ; /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/tools/fpcmp-target /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/Output/gs.test.out gs.reference_output /home/uweigand/sandbox/buildbot/clang-s390x-linux-lnt/test/sandbox/build/tools/fpcmp-target: FP Comparison failed, not a numeric difference between ' ' and 'g'
@uweigand could you please confirm that the output like this could have been a result of timeout?
So far, it looks that we still have compile time issue. No other solution that make this transform optional.
So far it seems that all problems are caused by compile time increase. There is already no room for shrinking the scope, so I'm returning this with option (turned off by default).
I don't believe
https://lab.llvm.org/buildbot/#/builders/105/builds/10339
FAIL: MultiSource/Benchmarks/MallocBench/gs/gs.execution_time is compile time issue.
From https://lab.llvm.org/buildbot/api/v2/logs/5749632/raw :
--------------------------------------------------------------- >>> ========= '/home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/gs' Program --------------------------------------------------------------- TEST-PASS: compile /home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/gs TEST-RESULT-compile-success: pass TEST-RESULT-compile-hash: 71676ae82ccfc3ce19d1a06b5c542f59 TEST-RESULT-compile-time: user 21.431000 TEST-RESULT-compile-real-time: real 30.486300 TEST-FAIL: exec /home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/gs TEST-RESULT-exec-time: user 0.0000 TEST-RESULT-exec-real-time: real 0.0204
Compare that with a recent run:
https://lab.llvm.org/buildbot/#/builders/105/builds/10409
--------------------------------------------------------------- >>> ========= '/home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/gs' Program --------------------------------------------------------------- TEST-PASS: compile /home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/gs TEST-RESULT-compile-success: pass TEST-RESULT-compile-hash: 414dbc57ddbb1e8c60a37884db3f706c TEST-RESULT-compile-time: user 40.363300 TEST-RESULT-compile-real-time: real 64.390900 TEST-PASS: exec /home/buildbots/ppc64le-clang-lnt-test/clang-ppc64le-lnt/test/sandbox/build/MultiSource/Benchmarks/MallocBench/gs/gs TEST-RESULT-exec-success: pass TEST-RESULT-exec-time: user 0.0761 TEST-RESULT-exec-real-time: real 0.1529
It really does look like that program was miscompiled (or an UB in it exploited), and it produced an unexpected result.
No, there is no timeout - execution of the test binary fails with an internal error:
Reading ghost.ps... Interp returns -16 ostack: 0x103e848: 0x02 dict -ewr------ 0x0000 0x01045b58 estack: 0x103de38: 0x09 oper x--------- 0x0000 0x010069c0 0x103de48: 0x03 file xe-r------ 0x0003 0x0105f708 exit 1
In the "good" case (with the patch reverted) we instead get the expected output:
Reading ghost.ps... ghost.ps read. exit 0
Looking at the assembly, only a single function shows any difference between the two versions: "interp" from MultiSource/Benchmarks/MallocBench/gs/interp.c.
Oh I see... That's nasty. In this case, could you please help me to reproduce it? I wasn't able to find the sources of this test in LLVM repository. IR dump before Loop Deletion would work too.
There have been several separate follow-ups to address the compile time, but I believe we never measured it all together.
This is part of the llvm-test-suite repository. The code for the interp function is here:
https://github.com/llvm/llvm-test-suite/blob/main/MultiSource/Benchmarks/MallocBench/gs/interp.c#L180
I notice that this has a weird "loop" constructed from a switch statement and various gotos, maybe this isn't handled correctly?
Root cause: another case of irreducible CFG I didn't filter out. As result, RPOT makes no sense. Checked in failing unit test as https://reviews.llvm.org/rG4ef47eaed952. Added an assert decting this.
Turning off the transform for irreducible CFG at all. The potential win isn't worth problems it is causing. Should have followed David's advice in the first place.
I still want to have an option (on by default now) to be able to switch off this transform if required.
Could we get new data? @nikic? I am not sure if followups were not reverted as well so if follow up patches can recommited as NFC(I) patches, it would be great.
The followups were all reverted and integrated into this one. I see no point in merging them separately, unless we want more pain with reverts (if it is needed again). They are just part of the algorithm.
Nice, so only this patch needs to tested for compile time. But if less than +0.5%, it should be acceptable. There were many improvements in ct, so I do think there is some space for less cheap optimizations.
It doesn't, and it shouldn't. There is no 1 iteration loops here, and the existing loop gets fully unrolled.
Yes, only this patch. The NFC(I)s were basically this
auto *BTC = SE.getBackedgeTakenCount(L); if (!isa<SCEVCouldNotCompute>(BTC) && SE.isKnownNonZero(BTC)) return LoopDeletionResult::Unmodified; if (!BTC->isZero() && !canProveExitOnFirstIteration(L, DT, SE, LI)) return LoopDeletionResult::Unmodified;
and
if (L->contains(IfFalse) && SE.isKnownPredicate(Pred, LHSS, RHSS)) // contains check is NFCI MarkLiveEdge(BB, IfTrue); else if (L->contains(IfTrue) && // contains check is NFCI SE.isKnownPredicate(ICmpInst::getInversePredicate(Pred), LHSS, RHSS)) MarkLiveEdge(BB, IfFalse); else MarkAllSuccessorsLive(BB);
No need to tear them off, they fit well.
All failures so far were related to irreducible CFG (which made RPOT meaningless), which is now completely ruled out. @stefanp could you please check that this version works ok with PowerPC?
Here are compile-time numbers for the latest version of the patch: https://llvm-compile-time-tracker.com/compare.php?from=1ffa6499ea3b7fde687666a45e89909fbf72a450&to=d51dd5be5d0e67015240cfb6eedc14a2f3dfc3a7&stat=instructions Unfortunately I didn't have a chance to profile where the remaining cost is yet.
Can we go ahead and merge it, or you want to further reduce the CT impact?
AFAIK failures on PowerPC are now gone (correct me if I'm wrong).
After looking through the tests, why does this need to use SCEV at all? As far as I can tell, simply using InstSimplify would be sufficient to prove all these cases (or just ConstFold for that matter). Without isKnownPredicateAt(), where symbolic reasoning is necessary, SCEV doesn't seem to provide much value here, or at least it's not obvious from the tests. (Of course, there will be cases where it can do more, but the converse also holds -- in particular, InstSimplify/ConstFold support arbitrary instructions, no need to special-case a limited set of them.)
SCEV is much more powerful in terms of simplification. Imagine the situation when some Phi takes value of a + b + c on the 1st iteration (and unknown on all others), then gets subtracted by a, then by b, and then by c, and then the result is compared against zero. If we want InstSimplify to do something like this, we will need to re-implement recrusive expression aggregation, simplification, in summary - we will need to re-invent SCEV.
I'm also sceptical about the approach that we are following (cripple the opts to save compile time). I'd rather prefer it to be isKnownPredicateAt, maybe under a flag, to use as much power as it can.
If the SCEV's CT impact is still unacceptable, we can cripple it even further to use isKnownViaNonRecursiveReasoning, but basically in this case we can just go with -O0 and stay happy about the results.
Let me start with a bit of philosophy: This patch falls into the very unfortunate category where a) it solves some specific problem you have, b) outside that, it applies very rarely, but c) has a compile-time impact on everyone. It's easy to justify an average-case compile-time impact that shows an average-case improvement. It's not so easy to justify an average-case compile-time impact that only shows an exceptional-case improvement.
To provide some numbers: After adding a loop-delete.NumBackedgesBroken stat and running test-suite at O3, the stat increases from 94 to 96. That is, in the entirety of test-suite, we break two additional back edges. Of course, test-suite is skewed in the kind of IR it contains (being C/C++), but this gives us a general idea.
I do think it's okay to solve specific optimization problems that do not have any wider applicability, but we should try to minimize the impact they have on everyone else. I do understand that making this SCEV based can in principle handle more cases, but as always, there are tradeoffs. I gave an InstSimplify based implementation a try (quick hack: https://github.com/llvm/llvm-project/commit/5aa9bbc50a1fed1364caaa7edc9273fb38de990d), and this had the same practical optimization power on test-suite (2 additional backedges broken) as the SCEV-based implementation. Only difference is that the SCEV-based implementation caused codegen changes on two additional test cases without breaking a backedge, due to some kind of SCEV caching difference. (Compile-time for that version is https://llvm-compile-time-tracker.com/compare.php?from=8fdd7c2ff16da370e28ef1b22e400d57a541484f&to=5aa9bbc50a1fed1364caaa7edc9273fb38de990d&stat=instructions)
Taking a step back here from the technical details, I want to make an observation: The ability to break a not taken loop backedge is a core property of "sparse conditional" style algorithms. They solve this problem as a special case of a more general data flow problem. I checked, and both SCCP and NewGVN are capable of breaking the backedge on all cases in eval_first_iteration.ll that get broken by the this implementation. Neither handle the "loop guard" cases (which needed isKnownPredicateAt), but I believe both could be extended to handle them -- they have the necessary ingredients to do so efficiently via PredicateInfo.
Assuming that those test cases are actually representative, could it be that we're trying to solve a phase ordering problem by introducing a new optimization? For example, there's this FIXME that suggests that SCCP should be performed before rather than after loop optimizations. Maybe just doing that move would already address the original motivation? It would be helpful to have the unreduced motivating test-case, as it's hard to evaluate phase ordering otherwise.
Thanks Nikita, I think I have some arguments for this philosophical debate. :)
This patch falls into the very unfortunate category where a) it solves some specific problem you have, b) outside that, it applies very rarely
I'd argue that. Just from how many failures we've seen when irreducible CFG wasn't handled correctly, I conclude that there have been many attempts to make this transform even in these weird cases. Provided how rare those CFGs are, I can speculate that there have also been many cases when it did apply on regular CFG too. Even in the test suite, there were 2 loops of this kind, and I'd expect it to not have such patterns at all.
Either way, "rare or frequent" is a very speculative category. You can say "meh, it's just 2 of them in the test suite" or "wow, it's 2 of them somewhere I didn't expect this!" For us, this pattern is coming from a very common abstract I/O algorithm, which has logic of "read until you've read enough", and for some of its implementations it knows that it will always read whole lot at once. I'd not expect the test suite to be a place to heavily exercise things like this, and if it was found there - cool.
As for SCEV impact, some points.
- Yes, I agree that we should try to reduce impact on compile time when we can;
- We already have passes (IV simplify, IRCE, LSR...) where we already use SCEV. I'm absoultely certain that in reality, most things they do they could handle with InstSimplify-like instructions. But for some reason, we don't mind against the toll that SCEV takes there (and it should be HUGE in IV simplify) just because we started tracking compile time *after* these things were already there.
- It's pretty clear to me that the problem is not in loop deletion that is slow with SCEV. The problem is that the SCEV is slow. And it's much slower than 0.2% that we see here. Holding away future optimizations just because SCEV is, and always has been, slow, doesn't seem right to me.
Yes, we can in theory simplify this thing to InstSimplify. We save 0.2% CT. And we still keep wasting tons of CT in IV Simplify for example (I don't know how much, but pretty sure it's well above 10%).
I think we all agree that slowness here is just from SCEV, not from the change I made. So my suggestion is that we should be fighting the root cause, which is SCEV inefficiency or slowness, and not try to fight accidental manifestations of this inefficiency. I can commit to spend some time deliberately looking for places where we can save time in SCEV.
Reducing my whole suggestion to absurd, if I win 0.2% CT somewhere else in SCEV, can we go with this?
P.S. and as for SCCP - that's a good point I didn't think of. Need to make some experiments to see if it helps.
UPD: it helps in some of cases that are interesting to us. In some of them it doesn't, but not clear whether it's a conceptual limitation or can be fixed easily.
I want to chime in on the philosophy bits raised in the last few comments.
@nikic I understand and generally agree with your point about not implementing adhoc optimizations for particular patterns which happen to break particular workloads. However, I don't think it really applies here.
Leaving aside your point about whether this can be done without SCEV (a reasonable question I'll touch on later), I see the class of loops which provably don't execute their backedge based on symbolic reasoning of the first iteration as being an interesting class. In fact, SCEV *already* does a variant of this in the exit computation logic when the exit condition is not analyzable. If anything, from a philosophy perspective, I'd prefer that this patch didn't exist because SCEV did this internally. However, that doesn't appear to be pragmatic from a compile time perspective.
I also want to call out a methodology issue. If we use the test-suite as evidence of (lack of) profit, and the test suite contains only historical C and C++ code, we will never implement any transform which either a) shows up for other languages, or b) is idiomatic in newer code (e.g. new language versions for C/C++). I'm not saying you're wrong to collect this data, but we need to be careful about how we interpret it.
From a purely pragmatic perspective, I don't want to get in the habit of breaking forward progress for perfection. @nikic, I see you as coming dangerously close to that here. I do think that if Max can show a savings elsewhere, we should allow this patch.
@mkazantsev As a pragmatic means to address the raised review concern, have you investigated why using SCEV here is more expensive? One thing I'm wondering about is whether we're invalidating SCEV too aggressively in one of the existing transforms. I don't have strong evidence of this, but computing SCEVs for a loop isn't *that* expensive, and given a properly structured loop pipeline should happen approximately once. I'd be tempted to instrument the various "forgetX" calls and run one of the worst offenders compile time wise. Do all of the invalidation make sense? Just based on a quick skim of the code, I see a couple call sites in LoopSimplifyCFG and IndVarSimplify which look likely to invalidate far more than is required. In particular, if we were invalidating multiple times in a loop nest, that could cause degenerate compile time.
(Edit - In particular, the couple of calls to forgetTopmostLoop look to be a huge hammer for a problem which could be solved more narrowly. If any of those show up in profiling, we've got our cause.)
I also second @nikic request for a less reduced reproducer. It's hard to tell whether there's another approach here without seeing the original motivating example. The fact that all of the test cases are handled by SimplifyInst (which, btw, unrolling already has most of the code for) raises questions here. It'd be nice to put those to bed once and for all.
I think we need to keep two things separate here: Passes like IndVars, IRCE and LSR perform actual loop reasoning. The entire purpose of SCEV is to reason about add recurrences, and that's exactly what these passes need. Using anything but SCEV wouldn't really make sense in this context. InstSimplify knows nothing (or at least very little) about recurrences.
The case here is different: This optimization only has a nominal relation to loops. There are no add recurrences involved, there is no "scalar evolution" going on -- rather, SCEV is being used as a generic symbolic reasoning framework, which is not a use-case it is intended or optimized for.
You are right that SCEV has serious compile-time issues in general, and "make SCEV faster" is certainly preferred over "don't use SCEV". However, in this particular instance I would argue that even if SCEV were blazing fast, it still wouldn't be the right tool for the job. Of course, there is the little caveat that we don't have any other symbolic reasoning framework in LLVM that has the same level of power, so I can certainly appreciate why it is being used here. I just want to highlight that this is not a case where SCEV is the obviously correct choice, as it is for passes like IndVars or LSR.
Worth mentioning that currently SCCP doesn't use PredicateInfo, which means it doesn't perform folding based on dominating branches. This is currently only enabled in the IPSCCP pass.
Yeah, I agree that using test-suite for evaluation is dangerous in that it is a very biased collection of code. I wouldn't be surprised if most of the optimizations I implemented have zero hits on test-suite, because they are targeted at a different frontend :) What test-suite mainly tells us is that there exists some large category of code for which an optimization is not relevant. It doesn't tell use whether there exists some other large category where it is.
I do like your framing in terms of this being an "interesting class" of loops, and agree that they are worth optimizing -- the question is mostly just how hard we should try :)
In any case, I don't want to block this patch. If you think this the right way to approach the problem, I'm fine with it going in as-is.
llvm/lib/Transforms/Scalar/LoopDeletion.cpp | ||
---|---|---|
178 | Legit clang-tidy warning: If you have a long chain of adds, this is going to blow the stack. Should probably use worklist? |
One thing I'm wondering about is whether we're invalidating SCEV too aggressively in one of the existing transforms. I don't have strong evidence of this, but computing SCEVs for a loop isn't *that* expensive, and given a properly structured loop pipeline should happen approximately once. I'd be tempted to instrument the various "forgetX" calls and run one of the worst offenders compile time wise. Do all of the invalidation make sense? Just based on a quick skim of the code, I see a couple call sites in LoopSimplifyCFG and IndVarSimplify which look likely to invalidate far more than is required. In particular, if we were invalidating multiple times in a loop nest, that could cause degenerate compile time.
Even if so, it's not a problem of this patch, right? I can spend some time trying to investigate this if we all have agreement that it can go with SCEV if we can win some time elsewhere. We don't have this agreement so far. 2 concerns I have:
- I don't believe that 0.2% time is something that can be reliably measured by any tool available to me, even if I can somehow get the environment to measure clang;
- SCEV invalidation would be the last place to find this "elsewhere", because bugs caused by un-invalidated SCEV are *extremely* (5 times underlined) nasty. I don't want to commit to investigate and fix all of them, unless tortured, or hung above a volcano. :)
It's hard to tell whether there's another approach here without seeing the original motivating example. The fact that all of the test cases are handled by SimplifyInst (which, btw, unrolling already has most of the code for) raises questions here. It'd be nice to put those to bed once and for all.
Currently scheming a prototype to see if it works. My speculation is InstSimplify might be enough for my motivating example. However, this will require massive refactoring of InstSimplify itself (some of its static functions will need to be made publicly available). It's a lot of work just to duplicate a part of what SCEV does out of box.
The case here is different: This optimization only has a nominal relation to loops.
I disagree. If we think something is a loop, we will run all our loop passes (including IndVars and LSR) on it. However, if loop deletion proves it's not a loop, we may save time on costly opts in other places. But again, whether this will or will not happen frequently, or whether the overall impact is positive, is a very speculative ground on which neither of us have (or may have) enough evidence.
There are no add recurrences involved, there is no "scalar evolution" going on -- rather, SCEV is being used as a generic symbolic reasoning framework, which is not a use-case it is intended or optimized for.
And LLVM is not "low level" and not a "virtual machine", right? :) It's just a wordplay.
Just because the initial motivation for SCEV was loop passes (maybe IndVars), I don't see any reasons why it cannot be used as generic symbolic framework. It has all power of it, it has tons of simplification that has nothing to do with loops, its implication engine does a lot to prove non-recurrent things. Today we even use it for pointers. I see no reason why we should stick to something that was relevant in the past and hold away the further expansion of its usage.
Of course, there is the little caveat that we don't have any other symbolic reasoning framework in LLVM that has the same level of power, so I can certainly appreciate why it is being used here. I just want to highlight that this is not a case where SCEV is the obviously correct choice, as it is for passes like IndVars or LSR.
I think that we should not have multiple framework that do the same thing, some of them being more and some being less powerful. What you are saying makes sense, but to me it just means that we need a way to make SCEV more lightweight. Maybe configure its prove engine to limit to the simplest ways of proving thigs.
I'll give a chance to a version that only uses the simplest way of proving thigs through non-recursive reasoning, and try to wright a version with InstSimplify as PoC.
This is surprising to me -- why does this require changes to InstSimplify? I would have expected that using it is simpler, not more complicated, as the SCEV use effectively requires you to reimplement the whole createSCEV() logic (which this patch only does to a small degree).
The case here is different: This optimization only has a nominal relation to loops.
I disagree. If we think something is a loop, we will run all our loop passes (including IndVars and LSR) on it. However, if loop deletion proves it's not a loop, we may save time on costly opts in other places. But again, whether this will or will not happen frequently, or whether the overall impact is positive, is a very speculative ground on which neither of us have (or may have) enough evidence.
I think my point may have been lost here: Yes, there is a loop structure here, but you don't make use of it, because you're specifically trying to prove that it is not actually a loop. You are looking at only one iteration, in which case it is straight-line code. Any loop-based reasoning SCEV does is actively unwanted here.
There are no add recurrences involved, there is no "scalar evolution" going on -- rather, SCEV is being used as a generic symbolic reasoning framework, which is not a use-case it is intended or optimized for.
And LLVM is not "low level" and not a "virtual machine", right? :) It's just a wordplay.
Just because the initial motivation for SCEV was loop passes (maybe IndVars), I don't see any reasons why it cannot be used as generic symbolic framework. It has all power of it, it has tons of simplification that has nothing to do with loops, its implication engine does a lot to prove non-recurrent things. Today we even use it for pointers. I see no reason why we should stick to something that was relevant in the past and hold away the further expansion of its usage.
Okay, if we consider SCEV as a generic symbolic reasoning framework that is intended for non-loop analysis, then the usage here makes more sense to me. This isn't how I've been thinking about SCEV.
As said before, I'm okay with this going in as-is if you think it's the right approach -- I think I made this a bigger issue than it really deserved, because it ended up being a kind of proxy issue for the general problem of improvements to SCEV or new uses of SCEV leading to inevitable compile-time regressions. I don't actually think the regression in this patch is particularly problematic, my concern is more cumulative in nature, as this is not the first (or second, or third) time this happened... There's always talk that we should make SCEV faster, but there has been little actual movement in that direction.
(If we can solve motivating cases via SCCP or some improvements / phase ordering changes to it, that would still be my first preference, as I think that pass is perfectly suited to this problem from an algorithmic perspective.)
Cut down to non-recursive reasoning. @nikic could you please check the CT impact? If it's now, then it's all about complex recursive rules of proof, and this is where we can seek to simplify SCEV's reasoning.
This is surprising to me -- why does this require changes to InstSimplify?
InstSimplify does not provide public API for this:
const SCEV *LHSS = getSCEVOnFirstIteration(LHS, L, SE, FirstIterSCEV); const SCEV *RHSS = getSCEVOnFirstIteration(RHS, L, SE, FirstIterSCEV); S = SE.getAddExpr(LHSS, RHSS);
It has SimplifyAddInst that works with custom arguments, but it's currently static within the cpp file. I'd need to expose all of them. At least this, maybe more.
The public API is SimplifyBinOp(), which works for all binary operators. See https://github.com/llvm/llvm-project/commit/5aa9bbc50a1fed1364caaa7edc9273fb38de990d for usage in this context. InstSimplify lacks an API that works generically overall instructions (with operands provided separately), but it exposes a couple of these high-level functions.
Oh, thanks! Didn't notice that. I'll now try to prototype an alternative solution with InstSimplify. But honestly, I'd be super surprised if SCEV's non-recursive prove is more expensive than this (and clearly SCEV's canonicalization is more powerful).
New results: https://llvm-compile-time-tracker.com/compare.php?from=b5c4fc0f232b6368bbd7cd8681a6931f2c30ad02&to=580918ee9cdb3aa6dae1c12d0633a7da5124881b&stat=instructions I think these are approximately the same as before, so it doesn't looks like limiting to non-recursive reasoning is worthwhile here.
Reworked using InstSimplify. It's good enough for my case. Let's hope that at least this has acceptable CT overhead.
It looks like the new implementation causes a segfault while building 7zip using the ReleaseLTO-g configuration: https://llvm-compile-time-tracker.com/show_error.php?commit=77e65ca9acb9db4de64a1ff8002f1d53470a0de3
Returned integer type check of branch conditions, removed test TODOs, removed null TTI with non-null DT which looks extremely fishy.
Still crashing, here's the backtrace:
ld: /home/nikic/llvm-project/llvm/include/llvm/Support/Casting.h:104: static bool llvm::isa_impl_cl<To, const From*>::doit(const From*) [with To = llvm::Constant; From = llvm::Value]: Assertion `Val && "isa<> used on a null pointer"' failed. Thread 2.1 "ld" received signal SIGABRT, Aborted. [Switching to Thread 0x7ffff7c2e100 (LWP 388814)] __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 #1 0x00007ffff7c70859 in __GI_abort () at abort.c:79 #2 0x00007ffff7c70729 in __assert_fail_base ( fmt=0x7ffff7e06588 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=0x7ffff49afba0 "Val && \"isa<> used on a null pointer\"", file=0x7ffff49afaa0 "/home/nikic/llvm-project/llvm/include/llvm/Support/Casting.h", line=104, function=<optimized out>) at assert.c:92 #3 0x00007ffff7c81f36 in __GI___assert_fail ( assertion=0x7ffff49afba0 "Val && \"isa<> used on a null pointer\"", file=0x7ffff49afaa0 "/home/nikic/llvm-project/llvm/include/llvm/Support/Casting.h", line=104, function=0x7ffff49afb28 "static bool llvm::isa_impl_cl<To, const From*>::doit(const From*) [with To = llvm::Constant; From = llvm::Value]") at assert.c:101 #4 0x00007ffff1669db7 in llvm::cast_retty<llvm::Constant, llvm::Value*>::ret_type llvm::dyn_cast<llvm::Constant, llvm::Value>(llvm::Value*) () from /home/nikic/llvm-project/build/bin/../lib/LLVMgold.so #5 0x00007ffff43120d9 in SimplifyICmpInst(unsigned int, llvm::Value*, llvm::Value*, llvm::SimplifyQuery const&, unsigned int) () from /home/nikic/llvm-project/build/bin/../lib/LLVMgold.so #6 0x00007ffff3a563d3 in canProveExitOnFirstIteration(llvm::Loop*, llvm::DominatorTree&, llvm::LoopInfo&) () from /home/nikic/llvm-project/build/bin/../lib/LLVMgold.so
It looks like the LHS argument to SimplifyICmpInst is null. I suspect this may be due to reference invalidation, commented inline.
llvm/lib/Transforms/Scalar/LoopDeletion.cpp | ||
---|---|---|
302 | This acquired a reference to FirstIterValue[&PN], then calls getValueOnFirstIteration() which may invalidate it and then writes to it. |
llvm/lib/Transforms/Scalar/LoopDeletion.cpp | ||
---|---|---|
302 | Wow, that's fishy! Thanks! |
Compile-time with the reference invalidation issue fixed: https://llvm-compile-time-tracker.com/compare.php?from=b5c4fc0f232b6368bbd7cd8681a6931f2c30ad02&to=9bb9b602fd58e6fdd353cdfedec56980a40230c1&stat=instructions Impact is small and without outliers. I think this is as cheap as it gets :)
I was able to catch this bug on our Fuzzer tests, and the fix in the last revision helped.
So, can we go with it now? There is one more CT opportunity here, which is replace of eager phi values propagation with lazy. But last time I tried it it crashed, so I'd prefer it to be a separate patch.
LGTM
llvm/lib/Transforms/Scalar/LoopDeletion.cpp | ||
---|---|---|
190 | Simple dyn_cast<BinaryOperator> seems simpler than going through PatternMatch here? |
Legit clang-tidy warning: If you have a long chain of adds, this is going to blow the stack. Should probably use worklist?