This is an archive of the discontinued LLVM Phabricator instance.

[NewPM][Inliner] Mark inlined calls to functions in same SCC as callee as noinline
AbandonedPublic

Authored by aeubanks on Feb 25 2022, 11:41 AM.

Details

Summary

When processing the list of calls created by inlining, check each call
to see if the new call's callee is in the same SCC as the original
callee's. If so, mark the call as noinline.

This is an alternative to D98481, which forbids inlining functions in a
non-trivial SCC at all. This allows us to do one level of inlining, plus
inlining of any other calls to functions outside the SCC.

Hopefully fixes PR45253.

Diff Detail

Event Timeline

aeubanks created this revision.Feb 25 2022, 11:41 AM
aeubanks requested review of this revision.Feb 25 2022, 11:41 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 25 2022, 11:41 AM

will fix the two failing tests if we like this patch

this does cause fewer instances of inlining on llvm-test-suite

I did some performance testing on some internal benchmarks, seeing mixed results

aeubanks edited reviewers, added: nikic, mtrofin, davidxl, hoy, wenlei; removed: ctetreau.Feb 28 2022, 4:55 PM
nikic added inline comments.Mar 1 2022, 5:44 AM
llvm/lib/Transforms/IPO/Inliner.cpp
942

Shouldn't we also check && CG.lookup(CG.get(*NewCallee)) != C here? We do want to allow this if caller, the callee and the new callee are all in the same SCC, right? If everything is one SCC, then the inlining is size-limited through the usual mechanism.

nikic added a comment.Mar 1 2022, 7:17 AM

Testing this with rust, while D98481 only had minor effect on performance (regressions in the sub-1% range that are more than compensated by noinline removal it enables), this one has pretty big impact (filter by "check" on https://perf.rust-lang.org/compare.html?start=4a56cbec59903a830a5fc06c5c81956de4199584&end=a6d1b910a312aa142487362987a3ce22c6da7684) with regressions up to 12%.

This *might* be related to the missing check I pointed out above.

will fix the two failing tests if we like this patch

this does cause fewer instances of inlining on llvm-test-suite

I did some performance testing on some internal benchmarks, seeing mixed results

IIUC, we sometimes want to allow even recursive function inlining, up to a point; and same for mutually-recursive ones. For instance, if A<->B, both A and B may be worth inlining in B and A, respectively, for their respective base cases.

Since there are regressions we've observed, sounds like we can isolate 1-2 of those and better understand the problem?

mtrofin added a reviewer: kazu.Mar 1 2022, 7:23 AM
aeubanks updated this revision to Diff 412153.Mar 1 2022, 10:11 AM

skip noinline if in caller's SCC

going to rerun benchmarks with new revision

nikic accepted this revision.Mar 2 2022, 12:43 AM

Updated patch is performance-neutral on our side. Based on some spot checks, it does resolve the catastrophic inlining issue as well. The general approach also looks reasonable.

llvm/lib/Transforms/IPO/Inliner.cpp
942

Add a comment here for why the restriction exists.

This revision is now accepted and ready to land.Mar 2 2022, 12:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2022, 12:43 AM
aeubanks updated this revision to Diff 412529.Mar 2 2022, 1:31 PM

add comment

still investigating one regressing internal benchmark

davidxl added inline comments.Mar 3 2022, 11:09 PM
llvm/lib/Transforms/IPO/Inliner.cpp
951

This feels like very strict. Is it possible to allow some level (under an option) of iterative inlining into the same SCC? This is because not all cases of inlining into a non-trivial SCC will result in compile time issue due to existing threshold based stopping mechanism (i.e. the new callee in SCC is too large).

llvm/test/Transforms/Inline/mut-rec-scc.ll
41

can you add a callsite to a function test3() which is outside the SCC, where the test3's inlining is deferred until test2 is inlined. This is to test that test3's inlining won't be suppressed.

nikic added inline comments.Mar 4 2022, 12:45 AM
llvm/lib/Transforms/IPO/Inliner.cpp
941–949

Just a suggestion, based on my understanding...

951

This does not restrict inlining into the same SCC. What it prevents is continuously peeling off more and more calls from a *different* non-trivial SCC.

From what I can tell, this is a rare case, only two programs in llvm-test-suite show any codegen changes with this patch.

davidxl added inline comments.Mar 4 2022, 10:49 AM
llvm/lib/Transforms/IPO/Inliner.cpp
951

To clarify, my understanding of the scenario is like the following

foo --> (a, b, c) where foo calls 'a'.

where a, b, and c forms anon trivial SCC.

Without the fix, we can end up with inlining of foo->a, foo->b, foo->c, foo->a, foo->b, .... until foo blows up (we don't have a caller limit set in the inliner).

With this patch, once foo-> a is inlined, the new callsite foo->b is marked as no inline.

Is that the right?

nikic added inline comments.Mar 4 2022, 11:22 AM
llvm/lib/Transforms/IPO/Inliner.cpp
951

Yeah, that sounds about right. I believe local inlining history will actually make sure that we only inline foo->a, foo->b, foo->c, which is what prevents this from simply inlining infinitely. But it still ends up flattening the SCC, which is particularly problematic if there are multi-edges in the SCC, in which case flattening is exponential.

D98481 forbids the foo->a inline in the first place, while this patch allows inlining foo->a, but forbids the foo->b inline. There are some cases where this kind of single-level inline is probably useful, e.g. if inlining at this call-site can prune away the recursive parts entirely.

davidxl added inline comments.Mar 4 2022, 1:48 PM
llvm/lib/Transforms/IPO/Inliner.cpp
951

What I am thinking is a light weight mechanism to fix this with more flexibility. Assuming the number of callsites (from outside of SCC) into non-trivial SCC is small, a map from the <caller, non-trivialSCC> pairs to inlined count can be used to track the number of iterative inlining happening between the caller and SCC. If it exceeds the limit, mark the new callsite noInline.

wenlei added inline comments.Mar 5 2022, 11:53 AM
llvm/lib/Transforms/IPO/Inliner.cpp
951

This change is indeed better than D98481, but I'm still not sure if it is good enough from performance perspective.

I hear the argument about build speed and the earlier comments on strict bottom-up inlining, but different people have different balance between build speed and performance. Great if we can achieve both; otherwise, trying to minimize perf impact and giving user flexibility is probably better.

Recursive inlining is weak for llvm even without this change, and this area is somewhat perf sensitive. We have observed perf gap between gcc and llvm due to less recursive inlining. We also have made changes to allow more recursive inlining with PGO outside of cgscc inliner where there's less constraints for this kind of issues. D109104 is example of more aggressive recursive inlining with PGO leading to visible perf movement even on spec.

We can try to measure this change with our internal benchmarks. But overall I agree with David that a solution that gives people flexibility would be better (either an optional inliner size/growth cap, or cross scc inline history counter like David suggested).

nikic added inline comments.Mar 6 2022, 8:23 AM
llvm/lib/Transforms/IPO/Inliner.cpp
951

Would it be sufficient to add an option for disabling the check to get this landed? Does that add the desired flexibility?

To be clear, this is really not about a compile-time / performance tradeoff -- the exponential nature of the issue means that builds just don't finish in any reasonable timeframe anymore. It's a correctness problem, and as such should have been fixed months ago, even if there were a performance impact.

At this point I cannot justify delaying a fix for this critical bug anymore. If upstream is unwilling to accept a "good enough" fix, then we will make this a required patch for downstream LLVM distributors instead. I had hoped to avoid this, because it diverges upstream LLVM from distro-provided LLVM and causes work for each distributor, but at the same time I also can't have an LLVM 14.0.0 release with this issue unfixed, which is where we're headed now.

PS: I am happy to discuss more invasive ways to address this general class of problems after an immediate fix has landed on the release branch. Something I have been toying with is assigning an inlining penalty to inlined call-sites, based on the cost of the callee it was inlined from. This basically allows those call sites to use up remaining inlining budget from the parent function, but not more than that. This is principled, in that it avoids a cost-model bypass if inlining happens in a non-bottom-up fashion, but I suspect that it will fare worse than ad-hoc solutions in practice, because it would also penalize inlining of promoted or devirtualized calls. But this is just a side-note, as this (as well as the other alternatives we have previously discussed) are not suitable as immediate, minimally invasive, low-impact bug fixes.

xbolva00 added inline comments.
llvm/lib/Transforms/IPO/Inliner.cpp
951

Your fix looks good enough; better fix could be added to point release 14.0.1

Adding an option is a way to to forward, but I think it should be off by default. If it is on by default, it is very likely to cause some performance regressions.

As long as the option is there for rustc to turn on, we can agree to disagree...

Adding an option is a way to to forward, but I think it should be off by default. If it is on by default, it is very likely to cause some performance regressions.

Or.. many bug reports about infinite compile times. Bad trade off.

The number one goal of every compiler should be correct and working compilation process.

I would not object turning it on by default when the more elaborate solution is there.

If we turn the current solution on by default, we run the risk of turning one bug (compile time) into another (performance) -- depends on the users, it is not a 100% good tradeoff to make.

Can somebody provide any perf data to justify “default off”?

LLVM user base is large, the only way to get an answer to that is to turn it on and let user report -- but according to my experience, that is a very disruptive process and user may spend long time figuring out the source of regressions. From the nature of the change, the chances of that happening is quite high.

It is reasonable to expect either 1) a proper/complete fix; or 2) a stop gap solution which user can use to combat the compile time issue. The compile time issue is there since forever and not a recent regression. 2) is one step in the right direction and the final goal is to achieve 1).

nikic added a comment.Mar 6 2022, 3:28 PM

There's a couple of reasons why a default-disabled option is not great:

  1. It's not compatible with linker plugin LTO. People would have to be experts on LLVM internals and know that they need to add something like -Wl,-mllvm,-enable-unbounded-cross-scc-inlining=0 to their build system if they are linking rust object files.
  2. This problem is not Rust-specific. According to https://discourse.llvm.org/t/rust-newpm-blocker-catastrophic-inlining/6171/2, Apple uses the earlier version of this patch, though not sure for which toolchain(s). IIRC Mozilla previously encountered this with C++ code.
  3. While I agree that correctness and performance is just another tradeoff, we generally always trade off in favor of correctness, unless we expect a widespread performance impact, in which case may temporarily trade off in favor of performance. There is no evidence that this patch has any widespread impact, but there is evidence against it: For rustc performance tests, this patch is entirely performance neutral (while the previous version had minor negative impact) and the entirety of llvm-test-suite has only two programs even showing codegen changes with this patch. There might be an impact in some isolated cases (actually, this is pretty much guaranteed with any inliner change), but all evidence points towards no widespread impact, and as such I don't think there is justification for favoring performance over correctness.

It is reasonable to expect either 1) a proper/complete fix;

I do think this is a proper fix. There are some alternative ways to fix the issue which may well be better for some cases, but frankly I don't understand how you would even evaluate that without knowing about specific instances that this patch would regress. Knowing specific regressions, we could check whether an alternative patch avoids them, but without that, isn't this just guesswork?

The compile time issue is there since forever and not a recent regression.

This is true, but the new pass manager has exacerbated this issue, which is a recent regression. We did encounter a few instances of the same issue with the legacy pass manager as well, but the issue has become much more wide-spread and pressing with the new pass manager.

All that being said, I would take a default-disabled option over nothing at all. We would enable it by default in rustc, and I would give a recommendation to enable it by default in Fedora/RHEL LLVM, but at least we would not be forcing all distros to adopt the patch.

There's a couple of reasons why a default-disabled option is not great:

  1. It's not compatible with linker plugin LTO. People would have to be experts on LLVM internals and know that they need to add something like -Wl,-mllvm,-enable-unbounded-cross-scc-inlining=0 to their build system if they are linking rust object files.
  2. This problem is not Rust-specific. According to https://discourse.llvm.org/t/rust-newpm-blocker-catastrophic-inlining/6171/2, Apple uses the earlier version of this patch, though not sure for which toolchain(s). IIRC Mozilla previously encountered this with C++ code.
  3. While I agree that correctness and performance is just another tradeoff, we generally always trade off in favor of correctness, unless we expect a widespread performance impact, in which case may temporarily trade off in favor of performance. There is no evidence that this patch has any widespread impact, but there is evidence against it: For rustc performance tests, this patch is entirely performance neutral (while the previous version had minor negative impact) and the entirety of llvm-test-suite has only two programs even showing codegen changes with this patch. There might be an impact in some isolated cases (actually, this is pretty much guaranteed with any inliner change), but all evidence points towards no widespread impact, and as such I don't think there is justification for favoring performance over correctness.

It is reasonable to expect either 1) a proper/complete fix;

I do think this is a proper fix.

There is a reason to believe this is not the proper fix: there are callsites that are perfectly ok to be inlined but will get disabled due to the way the fix is done. It is considered proper if the fix actually detects the compile time budget will be reached -- not at the first inlining.

I guess we need to agree to disagree here.

There are some alternative ways to fix the issue which may well be better for some cases, but frankly I don't understand how you would even evaluate that without knowing about specific instances that this patch would regress. Knowing specific regressions, we could check whether an alternative patch avoids them, but without that, isn't this just guesswork?

The compile time issue is there since forever and not a recent regression.

This is true, but the new pass manager has exacerbated this issue, which is a recent regression. We did encounter a few instances of the same issue with the legacy pass manager as well, but the issue has become much more wide-spread and pressing with the new pass manager.

All that being said, I would take a default-disabled option over nothing at all. We would enable it by default in rustc, and I would give a recommendation to enable it by default in Fedora/RHEL LLVM, but at least we would not be forcing all distros to adopt the patch.

Agree -- let's make this step happen first to unblock the affected users.

aeubanks added inline comments.Mar 7 2022, 12:26 AM
llvm/lib/Transforms/IPO/Inliner.cpp
951

https://reviews.llvm.org/D121084 is a similar alternative to this where instead of marking call sites noinline, we make the inline cost of these sorts of calls exponentially more expensive. it's definitely not as good at preventing bloat as this patch based on testing variants of mut-rec-scc.ll, but perhaps it's good enough for real world examples?

xbolva00 added inline comments.Mar 7 2022, 3:09 AM
llvm/lib/Transforms/IPO/Inliner.cpp
953

What if ICB is always_inline?

I think we need to land this patch. It's fixing a major regression from the new pass manager which is blocking some users from even upgrading to LLVM 13. I understand that there are concerns that it may regress performance in some cases, but I think that fixing the regression is more important especially since we have not actually identified any real world cases that would regress. It also sounds like there is at least one downsteam user working around this regression in their own tree, and given the impact of the problem, we would also have to fix this downstream in Fedora as well.

I'm fairly certain @davidxl isn't interested in LLVM release branches, so there's the temporary option of landing this only in the release branch, with the promise of a better solution in a reasonable timeframe on ToT.

If this is doable, it feels like a winning strategy to me.

aeubanks updated this revision to Diff 413531.Mar 7 2022, 9:45 AM

add cl::opt

actually given that multiple people have complained about this issue, perhaps it's better with the cl::opt to just submit this, pass -mllvm where performance may be impacted, and have impacted people investigate this and come up with a potentially better solution

llvm/test/Transforms/Inline/mut-rec-scc.ll
41

do you mean add a call inside test3 to some other new function (e.g. test4)?

davidxl added inline comments.Mar 7 2022, 10:11 AM
llvm/lib/Transforms/IPO/Inliner.cpp
96

As I stated, I prefer this off by default, or the alternate patch:https://reviews.llvm.org/D121084

llvm/test/Transforms/Inline/mut-rec-scc.ll
41

ok to leave this as it is for now.

nikic added a comment.Mar 7 2022, 10:36 AM

actually given that multiple people have complained about this issue, perhaps it's better with the cl::opt to just submit this, pass -mllvm where performance may be impacted, and have impacted people investigate this and come up with a potentially better solution

Right. I'd like to emphasize again that if impacted cases get reported, this allows us to address them in a data-driven way. For example, we now have two very similar solutions here and in D121084, but we don't have any way to evaluate whether the latter is better in any practical sense. If we had an impacted case, we could check whether or not it would be affected by D121084, with which cost multiplier, etc. Without having an affected case, we can only guess.

If this is doable, it feels like a winning strategy to me.

Yes, we could commit this patch directly to the release branch, and it seems like we may have to do this to get it in in time for the release, but I don't want that to slow down the effort to fix this in main. I think we still need to come to some consensus on how to fix this in a timely manner in the main branch.

llvm/lib/Transforms/IPO/Inliner.cpp
96

I'm a little confused by all negatives in the option, but I think the default state should be whatever state will fix PR45253. This at least will get us closer to the old-pass-manager behavior and will unblock users. I don't think the option is particularly useful if the default state doesn't fix PR45253.

As I mentioned, the cost of investigating performance regressions is high. However if y'all think it is better to shift the cost, I won't stay in the way.

I've also approved the other patch.

wenlei requested changes to this revision.Mar 10 2022, 2:19 PM

I know that we have settled on D121084, but just to provide a data point to support the earlier concern that David and I raised about perf impact.

I measured this on a large internal workload (at Meta), and there's a consistent 0.5% CPU regression with this patch.

This revision now requires changes to proceed.Mar 10 2022, 2:19 PM
aeubanks abandoned this revision.Mar 14 2022, 4:41 PM