This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG] Tail-merging all blocks with `unreachable` terminator, final take
AbandonedPublic

Authored by lebedev.ri on Jan 5 2022, 1:13 PM.

Details

Reviewers
rnk
reames
nikic
Summary

This implements the approach disscussed in D104870:
instead of simply alaways tail-merging all unreachable blocks,
first try to group the calls that precede unreachable,
and only merge the ones where grouping succeeded.

https://llvm-compile-time-tracker.com/compare.php?from=3564551400224cd24dd8650dc2ace19174833af7&to=a4fe8518811df84d14a21071f3516ec3841cd369&stat=instructions

Diff Detail

Unit TestsFailed

Event Timeline

lebedev.ri created this revision.Jan 5 2022, 1:13 PM
lebedev.ri requested review of this revision.Jan 5 2022, 1:13 PM
lebedev.ri edited the summary of this revision. (Show Details)Jan 5 2022, 1:36 PM

I think I prefer this version over the previous. This may be slightly ugly, but IMO it is functionally better (fewer analysis reruns and pass updates). Maybe others have ideas for how to make it nicer.

What do other folks (@nikic @aeubanks @asbirlea) think about this approach?

llvm/test/Transforms/PhaseOrdering/AArch64/peel-multiple-unreachable-exits-for-vectorization.ll
39

It looks to me like tail merging unreachable blocks is preventing vectorization in this test case and the next, which seems like a blocking issue. The test was added here, if that helps understand why it no longer works:
https://reviews.llvm.org/rG39cc0b8c68b8d316954ecfac0d1f8498ea42866c
@fhahn

I think I prefer this version over the previous. This may be slightly ugly, but IMO it is functionally better (fewer analysis reruns and pass updates).

I think one big reason why this is indeed better is because now it will be trivial to introduce
(and pass here) bool SkipProfitabilityChecks option to SinkCommonCodeFromPredecessors().

Maybe others have ideas for how to make it nicer.

Looking at it, it's actually not *that* ugly as it seemed when i was writing the code.
I guess i could wrap this tail-merging/undo into a class, not sure if that would help.

What do other folks (@nikic @aeubanks @asbirlea) think about this approach?

lebedev.ri added inline comments.Jan 5 2022, 2:54 PM
llvm/test/Transforms/PhaseOrdering/AArch64/peel-multiple-unreachable-exits-for-vectorization.ll
39
xbolva00 added inline comments.
llvm/lib/Transforms/Scalar/SimplifyCFGPass.cpp
239

Fix comment

lebedev.ri added inline comments.
llvm/test/Transforms/PhaseOrdering/AArch64/peel-multiple-unreachable-exits-for-vectorization.ll
39

CC @aqjune @nikic

I'm not actually sure that we can solve this within LV itself,
but i would love to be proven wrong here.

I'm pretty sure LV does expand the backedge taken count,
so i suppose only not being allowed to expand BTC wouldn't help here.

As i see it, the options are:

  • ignore this failure
  • adjust the test to mask the failure (i would hope adding noundef's should help?), potentially coupled with:
  • are there some missing reasoning bits in impliesPoison() and friends that could prevent this regression?
  • Introduce UB-safe mode for SCEVExpander, lift backedge taken count poison-safety restriction
  • Prevent simplifycfg from merging conditions like that (as in, iff plain and/or isn't going to be used)
nikic added inline comments.Jan 6 2022, 7:37 AM
llvm/test/Transforms/PhaseOrdering/AArch64/peel-multiple-unreachable-exits-for-vectorization.ll
39

I think the proper way to address this (and other poison-safety issues in SCEV) is to add umin variants in both IR and SCEV that don't propagate op2 poison if op1 is zero.

lebedev.ri added inline comments.Jan 6 2022, 2:50 PM
llvm/test/Transforms/PhaseOrdering/AArch64/peel-multiple-unreachable-exits-for-vectorization.ll
39

Posted D116766

lebedev.ri marked an inline comment as done.

Ok, the regression has been dealt with :)

It looks like the new version still has the large code size regressions (9% on mafft, 3% on 7zip). I understand that some code size increase is expected (and intended), but I don't think a particularly good case for the tradeoff has been made yet (in terms of where / how much performance this is buying for more code size). Though maybe I missed this in previous discussion threads.

I should probably test how this impacts rust code (which has a lot of unreachable terminators in release builds due to bounds checks), though that requires applying this patch on top of LLVM 13.

It looks like the new version still has the large code size regressions (9% on mafft, 3% on 7zip). I understand that some code size increase is expected (and intended), but I don't think a particularly good case for the tradeoff has been made yet (in terms of where / how much performance this is buying for more code size). Though maybe I missed this in previous discussion threads.

How do we know that whatever compile time benchmark we see regresses is a reliable indicator in this regard?
I think this is yet another irresolvable clash between the optimization and compilation time/size.
-O3 does not mean "please quickly give me minimal code", there's -Os/-Oz for that.
IOW if you indent to block a patch, could you please actually do so, not just waguley imply so?

I should probably test how this impacts rust code (which has a lot of unreachable terminators in release builds due to bounds checks), though that requires applying this patch on top of LLVM 13.

Were you able to to so?

It looks like the new version still has the large code size regressions (9% on mafft, 3% on 7zip). I understand that some code size increase is expected (and intended), but I don't think a particularly good case for the tradeoff has been made yet (in terms of where / how much performance this is buying for more code size). Though maybe I missed this in previous discussion threads.

How do we know that whatever compile time benchmark we see regresses is a reliable indicator in this regard?
I think this is yet another irresolvable clash between the optimization and compilation time/size.
-O3 does not mean "please quickly give me minimal code", there's -Os/-Oz for that.
IOW if you indent to block a patch, could you please actually do so, not just waguley imply so?

As somebody without much context for this patch reading the patch summary, there's no "why" for this patch. The llvm-compile-time-tracker numbers are all negative in various aspects. I think nikic's question is do you have metrics showing that this actually helps performance on any code?

nikic added a comment.Jan 24 2022, 2:20 AM

I should probably test how this impacts rust code (which has a lot of unreachable terminators in release builds due to bounds checks), though that requires applying this patch on top of LLVM 13.

Were you able to to so?

Sorry for the delay. I tested this together with your recent commit removing sinking limitations for unreachable blocks. The result was close to no change in either compile-time or run-time (where "run-time" here is non-LLVM compile-time, so a fairly narrow workload). All sub-1% and mostly below the significance threshold. Unfortunately I wasn't able to get code size information because the necessary infrastructure was broken at the time. I did look at some rustc artifacts as a sanity check and the only larger increase I spotted was rustdoc by +0.4%.

I played with the patch a bit, and found that this approach has one major limitation as far as rust code is concerned: It only works if you have a single assert/panic/whatever function. All unreachable terminators are merged together and we can then only sink if the predecessors have the same call. Rust has a bunch of different panic functions depending on the situation. So if you have a bunch of array accesses that all call panic_bounds_check and then add a single assert, then the tail merging stops working. I expect that significantly limits the cases where the optimization applies.

lebedev.ri added a comment.EditedJan 24 2022, 3:12 AM

I should probably test how this impacts rust code (which has a lot of unreachable terminators in release builds due to bounds checks), though that requires applying this patch on top of LLVM 13.

Were you able to to so?

Sorry for the delay. I tested this together with your recent commit removing sinking limitations for unreachable blocks.
The result was close to no change in either compile-time or run-time (where "run-time" here is non-LLVM compile-time,
so a fairly narrow workload). All sub-1% and mostly below the significance threshold. Unfortunately I wasn't able to
get code size information because the necessary infrastructure was broken at the time.
I did look at some rustc artifacts as a sanity check and the only larger increase I spotted was rustdoc by +0.4%.

I played with the patch a bit, and found that this approach has one major limitation as far as rust code is concerned:
It only works if you have a single assert/panic/whatever function.
All unreachable terminators are merged together and we can then only sink if the predecessors have the same call.
Rust has a bunch of different panic functions depending on the situation.
So if you have a bunch of array accesses that all call panic_bounds_check and then add a single assert,
then the tail merging stops working. I expect that significantly limits the cases where the optimization applies.

Since asking that back then, i've come up with D117805, which indeed handles said more generic case,
and after that patch is accepted i'll rewrite this patch to use that new infra.

lebedev.ri updated this revision to Diff 407118.Feb 9 2022, 4:26 AM
lebedev.ri edited the summary of this revision. (Show Details)

Now that the 'merge compatible invokes' is effectively done, let's revisit this.

I've reimplemented this using the approach innovated/invented there,
so now not only we don't just tail-merge everything,
not only do we only do that when sinking succeeds,
we now also handle multiple sets of mergeable calls.

nikic added a reviewer: nikic.Feb 21 2022, 1:20 AM

I've tested the newest version against rust, and unfortunately it came back as a universal regression, both in terms of compile-time and run-time. Compile-time regressions up to 4% and run-time regressions in the ~1% range (not by much, but very consistently regressing).

lebedev.ri abandoned this revision.Oct 18 2022, 5:46 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 18 2022, 5:46 PM