This implements the approach disscussed in D104870:
instead of simply alaways tail-merging all unreachable blocks,
first try to group the calls that precede unreachable,
and only merge the ones where grouping succeeded.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Numbers are in:
D104870 was: (as of https://reviews.llvm.org/D104870#inline-998899)
https://llvm-compile-time-tracker.com/compare.php?from=1f169a774cb865659cefe085e70a56a884e3711e&to=fc54bb9a8ef85bd76dd9e934b2546f4beadc5b5e&stat=instructions
This now is: https://llvm-compile-time-tracker.com/compare.php?from=2353e1c87b09c20e75f0f3ceb05fa4a4261fe3dd&to=bed7b8df4565f4503889a19235e853b985ca3481&stat=instructions
So slightly better compile-time-wise, basically the same size impact.
llvm/test/Transforms/PhaseOrdering/AArch64/peel-multiple-unreachable-exits-for-vectorization.ll | ||
---|---|---|
39 | It looks to me like tail merging unreachable blocks is preventing vectorization in this test case and the next, which seems like a blocking issue. The test was added here, if that helps understand why it no longer works: |
I think one big reason why this is indeed better is because now it will be trivial to introduce
(and pass here) bool SkipProfitabilityChecks option to SinkCommonCodeFromPredecessors().
Maybe others have ideas for how to make it nicer.
Looking at it, it's actually not *that* ugly as it seemed when i was writing the code.
I guess i could wrap this tail-merging/undo into a class, not sure if that would help.
What do other folks (@nikic @aeubanks @asbirlea) think about this approach?
llvm/test/Transforms/PhaseOrdering/AArch64/peel-multiple-unreachable-exits-for-vectorization.ll | ||
---|---|---|
39 |
llvm/lib/Transforms/Scalar/SimplifyCFGPass.cpp | ||
---|---|---|
373 | Fix comment |
llvm/test/Transforms/PhaseOrdering/AArch64/peel-multiple-unreachable-exits-for-vectorization.ll | ||
---|---|---|
39 | I'm not actually sure that we can solve this within LV itself, I'm pretty sure LV does expand the backedge taken count, As i see it, the options are:
|
llvm/test/Transforms/PhaseOrdering/AArch64/peel-multiple-unreachable-exits-for-vectorization.ll | ||
---|---|---|
39 | I think the proper way to address this (and other poison-safety issues in SCEV) is to add umin variants in both IR and SCEV that don't propagate op2 poison if op1 is zero. |
It looks like the new version still has the large code size regressions (9% on mafft, 3% on 7zip). I understand that some code size increase is expected (and intended), but I don't think a particularly good case for the tradeoff has been made yet (in terms of where / how much performance this is buying for more code size). Though maybe I missed this in previous discussion threads.
I should probably test how this impacts rust code (which has a lot of unreachable terminators in release builds due to bounds checks), though that requires applying this patch on top of LLVM 13.
How do we know that whatever compile time benchmark we see regresses is a reliable indicator in this regard?
I think this is yet another irresolvable clash between the optimization and compilation time/size.
-O3 does not mean "please quickly give me minimal code", there's -Os/-Oz for that.
IOW if you indent to block a patch, could you please actually do so, not just waguley imply so?
I should probably test how this impacts rust code (which has a lot of unreachable terminators in release builds due to bounds checks), though that requires applying this patch on top of LLVM 13.
Were you able to to so?
As somebody without much context for this patch reading the patch summary, there's no "why" for this patch. The llvm-compile-time-tracker numbers are all negative in various aspects. I think nikic's question is do you have metrics showing that this actually helps performance on any code?
Sorry for the delay. I tested this together with your recent commit removing sinking limitations for unreachable blocks. The result was close to no change in either compile-time or run-time (where "run-time" here is non-LLVM compile-time, so a fairly narrow workload). All sub-1% and mostly below the significance threshold. Unfortunately I wasn't able to get code size information because the necessary infrastructure was broken at the time. I did look at some rustc artifacts as a sanity check and the only larger increase I spotted was rustdoc by +0.4%.
I played with the patch a bit, and found that this approach has one major limitation as far as rust code is concerned: It only works if you have a single assert/panic/whatever function. All unreachable terminators are merged together and we can then only sink if the predecessors have the same call. Rust has a bunch of different panic functions depending on the situation. So if you have a bunch of array accesses that all call panic_bounds_check and then add a single assert, then the tail merging stops working. I expect that significantly limits the cases where the optimization applies.
Since asking that back then, i've come up with D117805, which indeed handles said more generic case,
and after that patch is accepted i'll rewrite this patch to use that new infra.
Now that the 'merge compatible invokes' is effectively done, let's revisit this.
I've reimplemented this using the approach innovated/invented there,
so now not only we don't just tail-merge everything,
not only do we only do that when sinking succeeds,
we now also handle multiple sets of mergeable calls.
I've tested the newest version against rust, and unfortunately it came back as a universal regression, both in terms of compile-time and run-time. Compile-time regressions up to 4% and run-time regressions in the ~1% range (not by much, but very consistently regressing).
Fix comment