Page MenuHomePhabricator

[SimplifyCFG] Tail-merging all blocks with `unreachable` terminator
Needs ReviewPublic

Authored by lebedev.ri on Jun 24 2021, 11:42 AM.

Details

Summary

Unlike ret/resume, for unreachable currently no such tail-merging
is being performed by clang. There has been at least one previous attempt
at this - D29428.

@rnk noted in D104445, this may be somewhat problematic for certain backends
that "form regions later in the backend", but as per the feedback in D104870
this appears to be a non-problem.

As for the motivation for this, the goal basically is to de-pessimize(*)
the source code that either uses exceptions, and/or is built with assertions,
and/or with ASAN/UBSan instrumentation in warnings-as-errors more.

In all of these cases there's a few function-terminating IR patterns,
but they are repeated over and over again, bloating the function size,
which naturally counts towards inlining cost, which naturally lowers
the chances of the functions compiled under such conditions from being inlined.
(* the insufficient inlining is the pessimization)

As discussed in D101231, while we could technically teach inliner to exempt
such function-terminating blocks from the cost calculation,
we don't really want to do that, because we'd end with even more cold code.
But instead we can achieve much the same goal by decreasing the inline cost
via tail-merging & sinking common code.

All that being said, this is expected to result in *bigger* codesizes,
both because the inliner actually did something now,
and because ultimately the tail duplication in backend might have
undone what we have done here..

Diff Detail

Event Timeline

lebedev.ri created this revision.Jun 24 2021, 11:42 AM
lebedev.ri requested review of this revision.Jun 24 2021, 11:42 AM

Yes, AMDGPU has to deal with this anyway. We have the AMDGPUUnifyDivergentExitNodes patch which essentially does the same thing to merge unreachables

Yes, AMDGPU has to deal with this anyway. We have the AMDGPUUnifyDivergentExitNodes patch which essentially does the same thing to merge unreachables

@arsenm thank you for commenting!

This change deserves some discussion. To address the issues with regions, we shouldn't rely on my hazily remembered understandings, we should consult the experts.

To those I just added to the review, the question is, will tail merging calls to noreturn functions in IR (BranchFolding already does it in codegen) be an issue for the subsystems you contribute to.

llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

I expected your code to fire on this test case. Can you explain why this example isn't getting tail merged?

Consider this example: https://gcc.godbolt.org/z/ox16a9P1z

[[noreturn]] void abort1();
[[noreturn]] void abort2();
[[noreturn]] void abort3();
bool cond();
void doAsserts() {
    if (cond()) abort1();
    if (cond()) abort2();
    if (cond()) abort3();
}

I think it is more canonical to leave these unreachable terminators in place after the calls to noreturn functions, rather than merging the unreachables together.

I just want to make sure your transform isn't firing, creating BBs, and then a later part of simplifycfg rolls the unreachables back up into place after the calls.

lebedev.ri marked an inline comment as done.Jun 24 2021, 1:29 PM
lebedev.ri added inline comments.
llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

I expected your code to fire on this test case. Can you explain why this example isn't getting tail merged?

It fired, we didn't sink anything, and SimplifyCFGOpt::simplifyUnreachable() decided to undo it.

rnk added inline comments.Jun 24 2021, 2:08 PM
llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

Got it, and we want to avoid that because otherwise it will make the overall pass return true to indicate that it changed something, which will make the parent pass manager re-run more passes.

lebedev.ri marked 2 inline comments as done.Jun 24 2021, 2:19 PM
lebedev.ri added inline comments.
llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

True.

As far as i'm aware that only results in potentially invalidating analysises,
i'm not aware of that triggering another optimization pass runs.

IIRC that part of SimplifyCFGOpt::simplifyUnreachable() is a pretty important canonicalization,
because e.g. instcombine can't modify cfg.

Given that tail-merging already happens in some scenarios, I don't see any additional debug-info difficulties in doing it for Unreachable, over anything else. (For variable locations and source locations at least).

This change deserves some discussion. To address the issues with regions, we shouldn't rely on my hazily remembered understandings, we should consult the experts.

To those I just added to the review, the question is, will tail merging calls to noreturn functions in IR (BranchFolding already does it in codegen) be an issue for the subsystems you contribute to.

Thanks for letting me know! I guess WinEH is fine because WinEH makes sure to clone blocks that belong to multiple funclets, right? I think Wasm should be OK as well as WinEH works, because we use WinEHPrepare, so we benefit from this transformation as well.

lebedev.ri marked an inline comment as done.Jun 25 2021, 4:31 PM

Thank you @arsenm, @jmorse & @aheejin!
Looks like, as i predicted, all affected parties already have to deal with such IR, so this isn't an issue.

rnk added a subscriber: aeubanks.Jun 29 2021, 12:11 PM

I think all my high level concerns are addressed then. I think it was worth letting all those folks know about this change. Users will also probably notice this change: it will affect the source location of lots of noreturn calls.

llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

Even if it only invalidates analyses, I think this is worth addressing before landing this. Ideally this code would directly call the heuristic that "sink from common predecessors" uses, but if that isn't available, I think you could approximate it by not merging unreachable terminators when the previous non-debug instruction is a noreturn call with distinct callees. We know that is unprofitable, and that accounts for most blocks ending in unreachable. It saves compile time from IR churn too.

@aeubanks, what are the consequences of passes indicating that they changed the IR when they actually didn't?

lebedev.ri marked an inline comment as done.Jun 29 2021, 12:20 PM

Thank you for taking a look.

llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

This is impossible to address.

I think you could approximate it by not merging unreachable terminators when the previous non-debug instruction is a noreturn call with distinct callees.

I can not, because fixing lack of sinking in such cases is basically the very next step here.

rnk added inline comments.Jun 29 2021, 12:58 PM
llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

I can not, because fixing lack of sinking in such cases is basically the very next step here.

This is the transform I'm talking about avoiding, and I don't think we plan to do this in the next step:

bb1:
  call void @abort1()
  unreachable
bb2:
  call void @abort2()
  unreachable

->

bb1:
  br label %common
bb2:
  br label %common
common:
  %callee = phi ... @abort1 ... @abort2
  call void ... %callee()
  unreachable

Right? This would make direct calls indirect, which is less canonical. I doubt this is going to change soon.

This is impossible to address.

I guess what you are saying is that this isn't possible to implement with the current code and data structures. We'd need to incorporate the structure of the instruction before unreachable into the map. This makes me think maybe it would be better to extend the SinkFromCommonPredecessors logic to consider blocks ending in unreachable. The current code is essentially restructuring the CFG in a way that is convenient for that function. I think transforms should avoid changing the IR before they know if transformation is really profitable, and it seems like the profitability heuristic is over there.

lebedev.ri marked an inline comment as done.Jun 29 2021, 1:16 PM
lebedev.ri added inline comments.
llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

Right? This would make direct calls indirect, which is less canonical. I doubt this is going to change soon.

Right. This isn't going to change.
However, consider

bb1:
  call void @abort1()
  unreachable
bb2:
  call void @abort2()
  unreachable
bb3:
  call void @abort2()
  unreachable

->

bb1:
  call void @abort1()
  unreachable
bb2:
  br label %bb2.bb3.common
bb2.bb3.common:
  call void @abort2()
  unreachable

Also, consider:

bb1:
  call void @abort1()
  unreachable
bb2:
  call void @abort1()
  br label %bb3
bb3:
  unreachable

My point being, we can't realistically say that we will/won't succeed in sinking stuff.

aeubanks added inline comments.Jun 29 2021, 7:33 PM
llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

there's no correctness issue with saying that we modified IR if we didn't actually, it'll just invalidate analyses, causing more work when they are recomputed in later passes

might be worth putting this through http://llvm-compile-time-tracker.com/

lebedev.ri marked an inline comment as done.Jun 30 2021, 12:19 AM

@aeubanks thank you for taking a look!

llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

Since you asked, sure: https://llvm-compile-time-tracker.com/compare.php?from=1f169a774cb865659cefe085e70a56a884e3711e&to=fc54bb9a8ef85bd76dd9e934b2546f4beadc5b5e&stat=instructions
I'm not sure what this tells us here.
Since the instruction stat correlates with the size changes, i guess we could say that it lead to more inlining,
and more IR to chew through. Which is pretty much the expected outcome.

nikic added inline comments.Jun 30 2021, 12:36 AM
llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

This shows a 10% increase in code size on mafft with LTO and a few others also increase by multiple percent points. Did you rerun @rnk's test on clang Release+Assert code size with this patch?

It looks like large code size increases are still the blocker for this patch, as they were back then.

lebedev.ri marked 2 inline comments as done.Jun 30 2021, 12:58 AM
lebedev.ri added inline comments.
llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

This shows a 10% increase in code size on mafft with LTO and a few others also increase by multiple percent points.

Yep.

Did you rerun @rnk's test on clang Release+Assert code size with this patch?

I have not because there is no reason to expect that the outcome is different.
(It will be somewhat different, because the approach is somewhat different)

It looks like large code size increases are still the blocker for this patch, as they were back then.

I'm not quite sure how we arrive at this conclusion.

Let me make a comparison: when one tries to paint something,
it is expected that not only said something will get colored,
but the paint amount will use up.

What i'm saying is that the effect this has is not unexpected, on the contrary, it is expected.
We successfully decrease the amount of IR bloat by assertion blocks,
decreasing the size of the functions they are in,
and naturally that makes some of them more eligible for inlining.
which happens, and increases codesize.

I would like to also call-back to the disscussion in D101468,
where we had very much the same disscussion, and actually i argued that it was bad,
but @nikic argued that said change is good since we no longer overestimate the inlining cost,
and i if that leads to an overestimation, then the problem is in inliner.
I'm not sure how in this patch the views changed to diametrically opposite ones :)

lebedev.ri marked an inline comment as done.Jun 30 2021, 2:00 AM

(@dmgreen you might be interested in also looking at the size numbers on the benchmarks you track)

llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

Forgot to mention: "There are three kinds of lies: lies, damned lies, and statistics."
The n% increase in code size is a pretty meaningless number,
and i'm sad to see it being used in such a harsh blocking manner.
What we should at least do, is look at how it compares with assert-less code.

I don't yet have clang numbers, but here's some for RawSpeed:

$ stat --printf="%s %n\n"  build-release-*/src/utilities/rsbench/rsbench | sort
17188264 build-release-new/src/utilities/rsbench/rsbench
17234640 build-release-old/src/utilities/rsbench/rsbench
17464336 build-release-with-asserts-new/src/utilities/rsbench/rsbench
17508840 build-release-with-asserts-old/src/utilities/rsbench/rsbench

I.e. -DNDEBUG->-UNDEBUG is +1.6% increase,
while old->new (i.e. this patch) causes -0.25% decrease.

Let me get these numbers for clang...

lebedev.ri added inline comments.Jun 30 2021, 10:50 AM
llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

And here's clang numbers.

$ stat --printf="%s %n\n" build-release-*/bin/clang-13 | sort
103032240 build-release-old/bin/clang-13
103046136 build-release-new/bin/clang-13
123732704 build-release-with-asserts-old/bin/clang-13
123882984 build-release-with-asserts-new/bin/clang-13

I.e. -DNDEBUG->-UNDEBUG is +20% increase,
while old->new (i.e. this patch) causes +0.1% increase for assert-ful build, and +0.01 for assert-less one.

But what this really tells us is that the numbers will vary depending on the underlying libc implementation.
The thing is, glibc's __assert_fail() has 4 arguments (the stringified assertion, filename, line, function),
and in worst-case scenario we'll need a PHI for each one of them, yet currently the profitability check
only allows a single PHI.

So to reproduce @rnk's numbers, he'd have to redo the test on whatever platform used originally.

Another way to spell this, the regression will appear later when profitability check is tuned :)

(@dmgreen you might be interested in also looking at the size numbers on the benchmarks you track)

There are some small changes at -Oz, but smaller enough to not worry about and not reliably better or worse. The codesize examples I'm running may not have a lot of unreachable terminators, being bare-metal programs. You don't tend to abort when you have no OS to do anything sensible with the abort.

rnk added inline comments.Jun 30 2021, 4:52 PM
llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

My reading of the llvm compile time tracker results is that this patch may result in a modest (~1% or less) compile time increase. That could be in the noise. The cost may mostly come from the analysis invalidation, which could be avoided if this were implemented in SinkCodeFromCommonPredecessors. I don't consider this a hard blocker if that is onerous.

The n% increase in code size is a pretty meaningless number,
and i'm sad to see it being used in such a harsh blocking manner.

I think it is more constructive to think of the engagement here as early feedback. People will notice code size increases and provide feedback, it's just a question of when. Reviewers are trying to be helpful.

W.r.t. __assert_fail, consider that CodeGen tail duplication may ultimately re-duplicate all the calls to __assert_fail. If that's the case, we shouldn't do this transform: it would throw away source location information for no gain. The more phis we create, the more likely it is to trigger tail duplication in the backend.

However, there are many common calls to __assert_fail from things like the llvm::cast template. These are typically inlined and can be tail merged with one phi for the message, the file, line, and function should all be the same.

Anyway, are you set on this approach, or would you consider the proposed alternative?

lebedev.ri marked an inline comment as done.Jul 1 2021, 2:57 AM
lebedev.ri added inline comments.
llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

(Note that the compile time numbers aren't final, because this lacks
further profitability checks relaxation in sinking logic.)

My reading of the llvm compile time tracker results is that this patch may result in a modest (~1% or less) compile time increase. That could be in the noise.

FWIW i do agree that this will obviously affect the compile time.

The cost may mostly come from the analysis invalidation,

The *avoidable* portion of the cost.
The cost that comes from succeeding in the transformation,
and enabling whatever next transformation will remain.

I think i should mention that i believe that for any non-trivial function
the simplifycfg will likely already report that it changed things,
and for trivial things it shouldn't be too costly to recompute analysis.
This may be a faulty view, but that is what i think.

The n% increase in code size is a pretty meaningless number,
and i'm sad to see it being used in such a harsh blocking manner.

I think it is more constructive to think of the engagement here as early feedback. People will notice code size > increases and provide feedback, it's just a question of when. Reviewers are trying to be helpful.

Right, i agree, though it didn't quite read as such to me,
but it may be again just that the tone doesn't quite always roundtrip perfectly through translation.

which could be avoided if this were implemented in SinkCodeFromCommonPredecessors. I don't consider this a hard blocker if that is onerous.
<...>
Anyway, are you set on this approach, or would you consider the proposed alternative?

Hmm, wait, i lost the thought. What were implemented in where? The profitability check before tail-merging?
Or not doing tail-merging eagerly/early, but instead when visiting a function-terminating block,
scan the function for all the blocks with the same terminator and only tail-merge iff we can actually sink?
The latter sounds like it will result in some quadratic behavior.

the description doesn't state why this is desirable
are there performance improvements? I would initially think that this is for code size, but the conversation indicates otherwise

nikic added inline comments.Jul 1 2021, 9:52 AM
llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

What i'm saying is that the effect this has is not unexpected, on the contrary, it is expected.
We successfully decrease the amount of IR bloat by assertion blocks,
decreasing the size of the functions they are in,
and naturally that makes some of them more eligible for inlining.
which happens, and increases codesize.

Just to double check: Has someone confirmed that the code size increases we see are indeed caused (or caused primarily) by the inlining interaction?

I would like to also call-back to the disscussion in D101468,
where we had very much the same disscussion, and actually i argued that it was bad,
but @nikic argued that said change is good since we no longer overestimate the inlining cost,
and i if that leads to an overestimation, then the problem is in inliner.
I'm not sure how in this patch the views changed to diametrically opposite ones :)

My view here hasn't really changed in principle, but the practical aspects here are quite different. D101468 had some code size wins, some losses, and an overall 0.1% regression. Here we see many regressions in the 1-10% range. D101468 was all of "the right thing to do", had a clear motivation for vectorization and had fairly limited code size impact.

For this patch, it seems like "the right thing to do" on an abstract level (in the sense that we do the same thing for other terminators), but it also has a non-trivial code size impact and the overall motivation isn't clear to me. Or at least, the patch summary doesn't say what the larger motivation here is. I would have guessed that the motivation is to reduce code size by tail merging and sinking, but if in practice the reverse happens due to inlining interaction, then I'm not sure that really makes sense. Maybe all I'm really looking for is clarification on what the motivation / bigger picture here is?

My reading of the llvm compile time tracker results is that this patch may result in a modest (~1% or less) compile time increase. That could be in the noise.

@rnk: The way to read the numbers is: Anything colored is likely not noise. For most benchmarks, the noise floor is <0.05%.

lebedev.ri edited the summary of this revision. (Show Details)Jul 1 2021, 12:00 PM
lebedev.ri added a reviewer: davidxl.
lebedev.ri marked an inline comment as done.

Tried to update the description to talk about the motivation.

llvm/test/Transforms/SimplifyCFG/tail-merge-noreturn.ll
129–130

What i'm saying is that the effect this has is not unexpected, on the contrary, it is expected.
We successfully decrease the amount of IR bloat by assertion blocks,
decreasing the size of the functions they are in,
and naturally that makes some of them more eligible for inlining.
which happens, and increases codesize.

Just to double check: Has someone confirmed that the code size increases we see are indeed caused (or caused primarily) by the inlining interaction?

I have just compared the stats for the vanilla test-suite as a whole, and 7zip specifically:
while there is an increase in assembly instruction count, and increase in tail duplication,
there is an increase in the count of IR instructions and decrease of the function/block count
at the end of middle-end pipeline, and finally more inline.NumInlined.
So i'm going to go with "yes".

are there any benchmarks (public or not) that show benefits with this patch?