This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/IPO/
-
Transforms/
-
IPO/
2/5
Inliner.cpp
-
test/Transforms/Inline/
-
Transforms/
-
Inline/
-
inline-remark.ll
-
recursion.ll

Differential D98481

[Inliner] Do not inline mutual-recursive functions to avoid exponential size growth.
AbandonedPublic

Authored by hoy on Mar 11 2021, 11:44 PM.

Download Raw Diff

Details

Reviewers

davidxl
wenlei
kazu
aeubanks

Summary

This change avoids inlining a function involved in a mutual-recursive cycle. Due to the lack of a global inline history, such inlining may cause an exponential size growth as the inlining extends up along the call graph.

Consider the following example,

void A() { B(); B();}
void B() { A(); }
void C() { B(); }
void D() { C(); }

When processing C, inlining B will result in

void C() { B(); B(); }

When processing D, inlining C and further the two B callsites will result in

void D() { B(); B(); B(); B(); }

We've also seen that one of our internal services suffered from the inliner hanging for 6 hours.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	630 ms	x64 windows > Clang.Modules::preprocess-build-diamond.m

Event Timeline

hoy created this revision.Mar 11 2021, 11:44 PM

Herald added subscribers: wenlei, hiraditya, eraman. · View Herald TranscriptMar 11 2021, 11:44 PM

hoy requested review of this revision.Mar 11 2021, 11:44 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 11 2021, 11:44 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

hoy edited the summary of this revision. (Show Details)Mar 11 2021, 11:45 PM

hoy added reviewers: davidxl, wenlei.

Harbormaster completed remote builds in B93429: Diff 330157.Mar 12 2021, 2:37 AM

Updating D98481: [Inliner] Do not inline mutual-recursive functions to avoid exponential size growth.

Harbormaster completed remote builds in B93519: Diff 330271.Mar 12 2021, 10:18 AM

Thanks for looking into this. This fix is more restrictive than I originally thought. And I'm also not sure if it does what we needed (see inline comment for this part).

IIUC, if we have A->B, B->C, C->B, even if there's no inlining between B and C, with this change, A will never inline B. Is that correct? I imagine that would throttle inliner very visibly - we can gauge the effect by looking at the number of inlining w/ and w/o this change for spec and larger programs.

Ideally, we don't want to limit all inlining across SCC if the inlinee is part of a recursion cycle. The problem here is the repeated inlining due to recursion. We have InlineHistory for single iteration SCC run, and InlinedInternalEdges for cross iteration checks against repeated recursive inlining in the same SCC, but they don't restrict all recursive inlining in general. It would be good to have something that only limits the extra, repeated inlining for this case too.

I think InlinedInternalEdges is close to what is needed here, except that here we're dealing with different SCC when moving up the SCC graph. Wondering if we can somehow transfer the InlinedInternalEdges of previously processed SCC so later SCC is also aware of what would be repeated recursive inlining, though as @chandlerc called out InlinedInternalEdges itself is already a hack that breaks layering.

+@chandlerc, and @wmi who improved InlinedInternalEdges in https://reviews.llvm.org/D52915.

llvm/lib/Transforms/IPO/Inliner.cpp
833	If the goal is to prevent repeated recursive inlining when processing caller SCC, this check may not always work either depending on what state we are with (A,B) SCC. If we stop at A inlined B (so A only calls A, but not B), then the (A,B) SCC is broken into (A) and (B), in which case the CalleeSCC size check against 1 may pass when we look at C or D. So what happens could be - we pass this check at caller C, doing more inlining, then fail this check at D, bouncing back and a forth as we move up.

In D98481#2624371, @wenlei wrote:

Thanks for looking into this. This fix is more restrictive than I originally thought. And I'm also not sure if it does what we needed (see inline comment for this part).

Right, it doesn't do exactly what we need, which is a global inline history to prevent a recursive callsites being inlined repeated It aggressively prevents inlining recursive callsites out of their defining SCCs.

IIUC, if we have A->B, B->C, C->B, even if there's no inlining between B and C, with this change, A will never inline B. Is that correct? I imagine that would throttle inliner very visibly - we can gauge the effect by looking at the number of inlining w/ and w/o this change for spec and larger programs.

Correct, with this change B and C will never be inlined into any function outside their enclosing SCC. I agree this is aggressive. The effect could be propagated as the bottom-up walk goes.

Ideally, we don't want to limit all inlining across SCC if the inlinee is part of a recursion cycle. The problem here is the repeated inlining due to recursion. We have InlineHistory for single iteration SCC run, and InlinedInternalEdges for cross iteration checks against repeated recursive inlining in the same SCC, but they don't restrict all recursive inlining in general. It would be good to have something that only limits the extra, repeated inlining for this case too.

I think InlinedInternalEdges is close to what is needed here, except that here we're dealing with different SCC when moving up the SCC graph. Wondering if we can somehow transfer the InlinedInternalEdges of previously processed SCC so later SCC is also aware of what would be repeated recursive inlining, though as @chandlerc called out InlinedInternalEdges itself is already a hack that breaks layering.

+@chandlerc, and @wmi who improved InlinedInternalEdges in https://reviews.llvm.org/D52915.

A global inline history across also SCCs sounds good fit there, however, building such a history is a bit challenging it that how to select the hash keys. It would be good to know if InlinedInternalEdges can be extended to handle repeatedly inlining a recursive callsite across SCCs.

llvm/lib/Transforms/IPO/Inliner.cpp
833	Do you mean that if `A` turns into self-recursive, `C` then can inline it because `CalleeSCC->size() == 1`?

In D98481#2625282, @hoy wrote:

In D98481#2624371, @wenlei wrote:

Thanks for looking into this. This fix is more restrictive than I originally thought. And I'm also not sure if it does what we needed (see inline comment for this part).

Right, it doesn't do exactly what we need, which is a global inline history to prevent a recursive callsites being inlined repeated It aggressively prevents inlining recursive callsites out of their defining SCCs.

IIUC, if we have A->B, B->C, C->B, even if there's no inlining between B and C, with this change, A will never inline B. Is that correct? I imagine that would throttle inliner very visibly - we can gauge the effect by looking at the number of inlining w/ and w/o this change for spec and larger programs.

Correct, with this change B and C will never be inlined into any function outside their enclosing SCC. I agree this is aggressive. The effect could be propagated as the bottom-up walk goes.

Practically, it means this change can potentially have negative performance impact on many workloads, which by itself would definitely be a deal breaker. And if that's not the case, we need to prove it through testing with this change. (Even if we don't observe any negative impact on a few workloads, it's still not bullet proof, but it's at least a data point).

Ideally, we don't want to limit all inlining across SCC if the inlinee is part of a recursion cycle. The problem here is the repeated inlining due to recursion. We have InlineHistory for single iteration SCC run, and InlinedInternalEdges for cross iteration checks against repeated recursive inlining in the same SCC, but they don't restrict all recursive inlining in general. It would be good to have something that only limits the extra, repeated inlining for this case too.

I think InlinedInternalEdges is close to what is needed here, except that here we're dealing with different SCC when moving up the SCC graph. Wondering if we can somehow transfer the InlinedInternalEdges of previously processed SCC so later SCC is also aware of what would be repeated recursive inlining, though as @chandlerc called out InlinedInternalEdges itself is already a hack that breaks layering.

+@chandlerc, and @wmi who improved InlinedInternalEdges in https://reviews.llvm.org/D52915.

A global inline history across also SCCs sounds good fit there, however, building such a history is a bit challenging it that how to select the hash keys. It would be good to know if InlinedInternalEdges can be extended to handle repeatedly inlining a recursive callsite across SCCs.

llvm/lib/Transforms/IPO/Inliner.cpp
833	Right.

In D98481#2626313, @wenlei wrote:

In D98481#2625282, @hoy wrote:

In D98481#2624371, @wenlei wrote:

Thanks for looking into this. This fix is more restrictive than I originally thought. And I'm also not sure if it does what we needed (see inline comment for this part).

Right, it doesn't do exactly what we need, which is a global inline history to prevent a recursive callsites being inlined repeated It aggressively prevents inlining recursive callsites out of their defining SCCs.

IIUC, if we have A->B, B->C, C->B, even if there's no inlining between B and C, with this change, A will never inline B. Is that correct? I imagine that would throttle inliner very visibly - we can gauge the effect by looking at the number of inlining w/ and w/o this change for spec and larger programs.

Correct, with this change B and C will never be inlined into any function outside their enclosing SCC. I agree this is aggressive. The effect could be propagated as the bottom-up walk goes.

Practically, it means this change can potentially have negative performance impact on many workloads, which by itself would definitely be a deal breaker. And if that's not the case, we need to prove it through testing with this change. (Even if we don't observe any negative impact on a few workloads, it's still not bullet proof, but it's at least a data point).

Good point. I can get some numbers out of SPEC. BTW, inlining self-recursive functions is also disabled and I'm not sure what was the motivation and whether we have data points for that. After inlining is done for an SCC, there is a good chance that quite some functions in the original SCC become self-recursive and they are prevented from further inlining.

Ideally, we don't want to limit all inlining across SCC if the inlinee is part of a recursion cycle. The problem here is the repeated inlining due to recursion. We have InlineHistory for single iteration SCC run, and InlinedInternalEdges for cross iteration checks against repeated recursive inlining in the same SCC, but they don't restrict all recursive inlining in general. It would be good to have something that only limits the extra, repeated inlining for this case too.

I think InlinedInternalEdges is close to what is needed here, except that here we're dealing with different SCC when moving up the SCC graph. Wondering if we can somehow transfer the InlinedInternalEdges of previously processed SCC so later SCC is also aware of what would be repeated recursive inlining, though as @chandlerc called out InlinedInternalEdges itself is already a hack that breaks layering.

+@chandlerc, and @wmi who improved InlinedInternalEdges in https://reviews.llvm.org/D52915.

A global inline history across also SCCs sounds good fit there, however, building such a history is a bit challenging it that how to select the hash keys. It would be good to know if InlinedInternalEdges can be extended to handle repeatedly inlining a recursive callsite across SCCs.

llvm/lib/Transforms/IPO/Inliner.cpp
833	Inlining self-recursion is already disabled by inline cost analyzer. Not sure why it is done though. So `C` will not inline `A` once `A` becomes self-recursive.

I got some numbers for SPEC2017 with AutoFDO + LTO. Timing wise no noticeable difference but there is difference in code size for some benchmarks which indicates less inlining with this patch. The difference is pretty big for perlbench_s. I guess we don't see perf gap because those recursive functions are run a lot more times than the inlined times.

	             After/Before (Code Size)
   508.namd_r    	100.00%
   510.parest_r  	100.00%
   511.povray_r  	97.89%
   600.perlbench_s	89.68%
   602.gcc_s     	98.79%
   605.mcf_s     	100.00%
   620.omnetpp_s 	100.00%
   623.xalancbmk_s	99.91%
   625.x264_s    	100.00%
   631.deepsjeng_s	100.00%
   638.imagick_s 	99.55%
   641.leela_s   	100.00%
   644.nab_s     	100.00%
   657.xz_s      	100.00%

In D98481#2628047, @hoy wrote:
I got some numbers for SPEC2017 with AutoFDO + LTO. Timing wise no noticeable difference but there is difference in code size for some benchmarks which indicates less inlining with this patch. The difference is pretty big for perlbench_s I guess we don't see perf gap because recursive those functions are run a lot more times than the inlined times.
	             After/Before (Code Size)
   508.namd_r    	100.00%
   510.parest_r  	100.00%
   511.povray_r  	97.89%
   600.perlbench_s	89.68%
   602.gcc_s     	98.79%
   605.mcf_s     	100.00%
   620.omnetpp_s 	100.00%
   623.xalancbmk_s	99.91%
   625.x264_s    	100.00%
   631.deepsjeng_s	100.00%
   638.imagick_s 	99.55%
   641.leela_s   	100.00%
   644.nab_s     	100.00%
   657.xz_s      	100.00%

Do you have numbers on how many times the new check blocks inlining for each (if you move it to be the last check)? The code size number seems to be indicate that the change indeed impacts inlining quite a bit.

When the inline history was introduced in https://reviews.llvm.org/D36188, it was verified that it didn't block any inlining for spec.

In D98481#2628055, @wenlei wrote:
In D98481#2628047, @hoy wrote:
I got some numbers for SPEC2017 with AutoFDO + LTO. Timing wise no noticeable difference but there is difference in code size for some benchmarks which indicates less inlining with this patch. The difference is pretty big for perlbench_s I guess we don't see perf gap because recursive those functions are run a lot more times than the inlined times.
	             After/Before (Code Size)
   508.namd_r    	100.00%
   510.parest_r  	100.00%
   511.povray_r  	97.89%
   600.perlbench_s	89.68%
   602.gcc_s     	98.79%
   605.mcf_s     	100.00%
   620.omnetpp_s 	100.00%
   623.xalancbmk_s	99.91%
   625.x264_s    	100.00%
   631.deepsjeng_s	100.00%
   638.imagick_s 	99.55%
   641.leela_s   	100.00%
   644.nab_s     	100.00%
   657.xz_s      	100.00%
Do you have numbers on how many times the new check blocks inlining for each (if you move it to be the last check)? The code size number seems to be indicate that the change indeed impacts inlining quite a bit.

When the inline history was introduced in https://reviews.llvm.org/D36188, it was verified that it didn't block any inlining for spec.

Callsite inlining are good counters. For those benchmarks affected, inlined callsites vs non-inlined callsites due to this change are:

               inlined callsites        recursive callsites    
511.povray_r        3178                   603
600.perlbench_s     4577                   2139
602.gcc_s          46125                   726
623.xalancbmk_s    15573                   79
638.imagick_s       7719                   61

Apparently, perlbench_s has a lot less inlining.

davidxl added a reviewer: kazu.Mar 15 2021, 11:02 PM

Without guarding the caller size, this (over topdown inlining) can also potentially happen without recursion.

This patch seems too aggressive. One possible way to handle it is to restrict inlining to a node N in non trivial SCC when node N has >1 fan-out (edges) to other nodes in the same SCC -- the large fan-out is the reason for the excessive growth.

llvm/lib/Transforms/IPO/Inliner.cpp
809	if (CalleeSCC == C) &&

In D98481#2628143, @davidxl wrote:

Without guarding the caller size, this (over topdown inlining) can also potentially happen without recursion.

This patch seems too aggressive. One possible way to handle it is to restrict inlining to a node N in non trivial SCC when node N has >1 fan-out (edges) to other nodes in the same SCC -- the large fan-out is the reason for the excessive growth.

Thanks for the suggestion. Exactly, the excessive growth boils down to the fan-out. An additional check for the fan-out factor should effectively make the restriction less aggressive.

In D98481#2629511, @hoy wrote:

In D98481#2628143, @davidxl wrote:

Without guarding the caller size, this (over topdown inlining) can also potentially happen without recursion.

This patch seems too aggressive. One possible way to handle it is to restrict inlining to a node N in non trivial SCC when node N has >1 fan-out (edges) to other nodes in the same SCC -- the large fan-out is the reason for the excessive growth.

Thanks for the suggestion. Exactly, the excessive growth boils down to the fan-out. An additional check for the fan-out factor should effectively make the restriction less aggressive.

Fan-out is one way to control the growth, I'm not sure if we can make it narrow and selective enough though.

I also wanted to point out that this case is related to the use of a partial profile, in which case the hot threshold may have triggered more aggressive inlining.

Actually we have a long list of issues where the CGSCC inliner explodes (w/ or w/o profile). It seem to me that the way CGSCC inliner works relies heavily on a "good" threshold, otherwise there's this risk of exponential growth of size and build time regardless of whether we have recursion or not just like David pointed. We indeed have seen cases where the default inline threshold leads to excessive inlining when there's no recursion.

While in theory if we have a reasonable threshold, CGSCC inliner would finish in reasonable amount of time (unless we have bugs that cause SCC mutation to not converge), it's hard for a preset threshold to fit all builds, which is why we ran into such cases again and again. I know that there's past discussion about introducing size/growth cap for CGSCC inliner and it didn't go through, but I'm wondering if we should revisit that given that preset threshold lead to explosive inlining in quite a few cases. We can find way to get around it by tuning threshold, but it's questionable to me whether the time spent is worthwhile to chase down these cases one by one and ending up tuning threshold without finding fundamental issues other than CGGSCC inliner is unbounded and rely on threshold.

If we cap size, we would need to inline more important call sites first. At the minimum, can we add a switch (off by default) to cap size so when people run into this, there would offer a very targeted solution - currently we sometimes have to tune down threshold globally, which negatively affects other inline paths too. We can also add a warning when inliner is throttled due to cap.

Having cap on the target context (where a site is inlined into) is useful for performance purpose but it is based on working set analysis which we are working on. This can be 'overloaded' to control compile time growth too.

Regarding option, I think it is useful to have an internal option to workaround problem like this. The easiest is to have a cap on the max number callsites that can iteratively inlined for a given callsite.

In D98481#2630554, @davidxl wrote:

Having cap on the target context (where a site is inlined into) is useful for performance purpose but it is based on working set analysis which we are working on. This can be 'overloaded' to control compile time growth too.

Good to know. It's kind of a separate optimization topic, but would like to learn more about this effort. How do you perform working set analysis, is any of this available in upstream now?

On the other hand, even if the working set based heuristic isn't ready yet, the scaffolding would be the same if you need a cap. And when you put a cap, you also need to be selective and prioritize inline candidate, right? If these are available in upstream, a switch to cap pathological cases would become very simple by sharing the scaffolding.

Regarding option, I think it is useful to have an internal option to workaround problem like this. The easiest is to have a cap on the max number callsites that can iteratively inlined for a given callsite.

Sounds good. So let's put in a switch for a cap, default to off, also emit a warning when the switch is on and cap is hit? It would still be nice if we inline important call site first when cap is hit (and share working set heuristic scaffolding for that).

Hiroshi is working on the implementation, and I expect upstreaming happening soon. In a nutshell, the working set analysis is based on building interprocedural loop graph (ILG) and working set propagation.

Yes what you describe about the option sounds reasonable, though I am not sure about warning. Some -Rpass-analysis or -Rpass-missed message or some debug output seem better.

In D98481#2630691, @davidxl wrote:

Hiroshi is working on the implementation, and I expect upstreaming happening soon. In a nutshell, the working set analysis is based on building interprocedural loop graph (ILG) and working set propagation.

Looking forward to it. Interesting to see how accurately we can model working set.

Yes what you describe about the option sounds reasonable, though I am not sure about warning. Some -Rpass-analysis or -Rpass-missed message or some debug output seem better.

The warning is to make people aware this is happening. Service developers don't use pass remarks, and pass remarks can also be a bit noisy.

A cap sounds good to me, since that's something we've been looking for for a long time. Regarding the cap, what kind of cap do you think will be appropriate? A cap on function size growth or a cap on how many times an original callsite can be inlined iteratively? Regarding the later, I was wondering what's a good way to track a callsite across SCC and function passes.

From: Xinliang David Li <davidxl@google.com>
Sent: Wednesday, March 17, 2021 9:02 AM
To: Hongtao Yu <reviews+D98481+public+2d824aa1c1081e47@reviews.llvm.org>
Cc: Hongtao Yu <hoy@fb.com>; Wenlei He <wenlei@fb.com>; Kazu Hirata <kazu@google.com>; Chandler Carruth <chandlerc@gmail.com>; Wei Mi <wmi@google.com>; Easwaran Raman <eraman@google.com>; llvm-commits <llvm-commits@lists.llvm.org>; bhuvanendra.kumarn@amd.com <bhuvanendra.kumarn@amd.com>; 88888yl@gmail.com <88888yl@gmail.com>; dougpuob@gmail.com <dougpuob@gmail.com>; David Green <david.green@arm.com>
Subject: Re: [PATCH] D98481: [Inliner] Do not inline mutual-recursive functions to avoid exponential size growth.

From: Xinliang David Li <davidxl@google.com>
Date: Wednesday, March 17, 2021 at 9:02 AM
To: Hongtao Yu <reviews+D98481+public+2d824aa1c1081e47@reviews.llvm.org>
Cc: Hongtao Yu <hoy@fb.com>, Wenlei He <wenlei@fb.com>, Kazu Hirata <kazu@google.com>, Chandler Carruth <chandlerc@gmail.com>, Wei Mi <wmi@google.com>, Easwaran Raman <eraman@google.com>, llvm-commits <llvm-commits@lists.llvm.org>, bhuvanendra.kumarn@amd.com <bhuvanendra.kumarn@amd.com>, 88888yl@gmail.com <88888yl@gmail.com>, dougpuob@gmail.com <dougpuob@gmail.com>, David Green <david.green@arm.com>
Subject: Re: [PATCH] D98481: [Inliner] Do not inline mutual-recursive functions to avoid exponential size growth.

I was thinking about capping on function size so that small callees such as getters/setters can always be inlined. What do you think?

Thanks for the suggestion. I was thinking about capping on function size so that small callees such as getters/setters can always be inlined. What do you think?

that is fine if it does not add visible compile time overhead (e.g, to
compute and track size).

David

I think size cap might be better than call site counts, and query inliner instr size should not have cost.

2nd thought on the warning, how about this. The goal is help save time for everyone.

One switch for -inline-size-limit, default to huge value (could be INT_MAX). When the limit is hit, we emit a warning, but don’t stop inlining. This make it obvious to everyone when cgscc inlining is causing extremely long build time. With default to huge value, it should not fire at all so nothing would break due to warning as error. But people may choose to set a lower value (e.g. we may use a lower value internally, it may break build, but the hope is to capture extremely slow builds and bloated codegen quickly)
Another switch to -stop-inline-for-size-limit, default to off. When this is enabled, cgscc inline will stop when the above size limit is hit, no warning is issued since this is asked by user explicitly. This serves as a possible mitigation for bloated inlining without changing global threshold, and it's also cohesive with #1.

We've encountered variants of this issue as well. I think this is a rather critical problem that needs to be addressed in a simple, release-backportable way, even if that fix causes performance regressions in some cases. A slower build is still better than no build at all. We can't even bootstrap rustc on some platforms anymore due to this problem.

The approach in this patch seems pretty sensible as a simple starting point and was previously suggested by @aeubanks in https://bugs.llvm.org/show_bug.cgi?id=45253 as well.

Herald added subscribers: ormris, modimo. · View Herald TranscriptOct 13 2021, 1:37 PM

cuviper added a subscriber: cuviper.Oct 13 2021, 2:23 PM

In D98481#3062526, @nikic wrote:

We've encountered variants of this issue as well. I think this is a rather critical problem that needs to be addressed in a simple, release-backportable way, even if that fix causes performance regressions in some cases. A slower build is still better than no build at all. We can't even bootstrap rustc on some platforms anymore due to this problem.

The approach in this patch seems pretty sensible as a simple starting point and was previously suggested by @aeubanks in https://bugs.llvm.org/show_bug.cgi?id=45253 as well.

This patch as is would be too aggressive as discussed above. There had been some discussion off patch. We converged on the proposal above to introduce -inline-size-limit and -stop-inline-for-size-limit - that doesn't change default behavior so it won't suddenly throttle performance for others, but offers users the ability to get around pathological cases when needed. Would that solve the problem you have at hand?

In D98481#3062704, @wenlei wrote:

In D98481#3062526, @nikic wrote:

We've encountered variants of this issue as well. I think this is a rather critical problem that needs to be addressed in a simple, release-backportable way, even if that fix causes performance regressions in some cases. A slower build is still better than no build at all. We can't even bootstrap rustc on some platforms anymore due to this problem.

The approach in this patch seems pretty sensible as a simple starting point and was previously suggested by @aeubanks in https://bugs.llvm.org/show_bug.cgi?id=45253 as well.

This patch as is would be too aggressive as discussed above. There had been some discussion off patch. We converged on the proposal above to introduce -inline-size-limit and -stop-inline-for-size-limit - that doesn't change default behavior so it won't suddenly throttle performance for others, but offers users the ability to get around pathological cases when needed. Would that solve the problem you have at hand?

I don't think this is a solution to this problem, for two reasons: First, having builds finish in reasonable time should not be a feature that is hidden behind an option. It's okay to have an option to opt-in to longer (potentially pathologically longer) compile-time for more optimization, but it should not be the default. If this is solved by an inlining size limit, it should be enabled by default.

Second, while an inlining size limit does mitigate this issue somewhat, it doesn't really solve it. The core problem is that the inliner is broken in its handling of recursive calls, and a fix for that issue needs to involve either the SCC structure or the inlining history. For example, we are seeing this issue in the context of recursive destructors, and it's perfectly plausible for the same destructor to be invoked from many functions. If the inlining size limit prevents these functions from blowing up to hundreds of thousands of instructions and instead leaves them all at "only" 20k (or whatever the size limit would be), that's of course already a win and will mitigate the worst problems, but it's still not the ideal outcome and does not address the fundamental issue. The ideal outcome would be that this inlining doesn't happen and all the functions stay small. A per-function inlining size limit is still not "global enough" in its consideration.

I do want to mention that this can also help with performance (apart from the obvious benefit of not blowing up code size beyond all reasonable bounds and completely trashing the instruction cache). For example, we're currently forced to mark all destructors in landingpads as noinline, because the pathological inlining problem would be even worse otherwise. However, this is problematic because it prevents inlining of trivial destructors as well, so we miss opportunities to simplify or entirely optimize away landingpads. We would not be forced to do this if the inliner exhibited sensible behavior.

Now, I'm not saying this patch is perfect in it's current form -- it seems worthwhile to at least try to implement the additional fan-out restriction that has been suggested above. But the core idea here seems like the right direction...

Dushistov added a subscriber: Dushistov.Nov 24 2021, 1:17 PM

slanterns added a subscriber: slanterns.Nov 26 2021, 11:32 PM

I've been thinking about this some more, and I think something important to highlight is the interaction with simplification in the inliner pipeline: Normally, inlining inside an SCC stops either because the SCC has been split into trivial cycles (which are explicitly forbidden from further inlining) or because the functions in the SCC have increased in size to the point that no further inlining is possible. In both cases, the end result is that the SCC will not be inlined into a caller anymore. Now, due to simplication, it may be that in the second situation, we stop inlining because the functions become too large, but after simplication they are small enough for further inlining. Once we are in that situation, there is no longer anything blocking cross-SCC inlining, and we get the exponential inlining described here.

This makes me think that not inlining across (cyclic) SCCs is intended behavior of the bottom-up inliner, that currently breaks down when the inliner is combined with a simplification pipeline in the same CGSCC pass manager. This patch enforced this limitation in a more direct and robust way.

One could make an argument that we should be trying to inline again after functions inside an SCC have been simplified. This would then maximally inline within the SCC before considering the callers. But I suspect doing something like this would cause many other complications.

I found this example useful to consider, because it wasn't immediately obvious to me why we don't see issues in simpler cases: https://gist.github.com/nikic/5f389a8868483631861a6ec0931d968a

ormris removed a subscriber: ormris.Jan 18 2022, 10:15 AM

nikic mentioned this in D118313: [Driver] Remove -fno-experimental-new-pass-manager.Jan 27 2022, 12:34 AM

shayba added a subscriber: shayba.Jan 31 2022, 8:39 PM

aeubanks mentioned this in D120584: [NewPM][Inliner] Mark inlined calls to functions in same SCC as callee as noinline.Feb 25 2022, 11:41 AM

one more alternative: https://reviews.llvm.org/D120584

hoy abandoned this revision.Mar 14 2022, 4:38 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 14 2022, 4:38 PM

davide mentioned this in D138602: [WIP] Alwaysinliner time explosion with new pass manager.Nov 23 2022, 1:14 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

IPO/

Inliner.cpp

24 lines

test/

Transforms/

Inline/

inline-remark.ll

5 lines

recursion.ll

101 lines

Diff 330271

llvm/lib/Transforms/IPO/Inliner.cpp

Show First 20 Lines • Show All 797 Lines • ▼ Show 20 Lines	for (; I < (int)Calls.size() && Calls[I].first->getCaller() == &F; ++I) {
inlineHistoryIncludes(&Callee, InlineHistoryID, InlineHistory)) {		inlineHistoryIncludes(&Callee, InlineHistoryID, InlineHistory)) {
setInlineRemark(*CB, "recursive");		setInlineRemark(*CB, "recursive");
continue;		continue;
}		}

// Check if this inlining may repeat breaking an SCC apart that has		// Check if this inlining may repeat breaking an SCC apart that has
// already been split once before. In that case, inlining here may		// already been split once before. In that case, inlining here may
// trigger infinite inlining, much like is prevented within the inliner		// trigger infinite inlining, much like is prevented within the inliner
// itself by the InlineHistory above, but spread across CGSCC iterations		// itself by the InlineHistory above, but spread across CGSCC iterations
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: 'auto CalleeSCC' can be declared as 'auto CalleeSCC' [llvm-qualified-auto] not useful Lint: Pre-merge checks:* clang-tidy: warning: 'auto CalleeSCC' can be declared as 'auto *CalleeSCC' [llvm-qualified…
// and thus hidden from the full inline history.		// and thus hidden from the full inline history.
		auto CalleeSCC = CG.lookupSCC(*CG.lookup(Callee));
if (CG.lookupSCC(*CG.lookup(Callee)) == C &&		if (CG.lookupSCC(*CG.lookup(Callee)) == C &&
		davidxlUnsubmitted Not Done Reply Inline Actions if (CalleeSCC == C) && davidxl: if (CalleeSCC == C) &&
UR.InlinedInternalEdges.count({&N, C})) {		UR.InlinedInternalEdges.count({&N, C})) {
LLVM_DEBUG(dbgs() << "Skipping inlining internal SCC edge from a node "		LLVM_DEBUG(dbgs() << "Skipping inlining internal SCC edge from a node "
"previously split out of this SCC by inlining: "		"previously split out of this SCC by inlining: "
<< F.getName() << " -> " << Callee.getName() << "\n");		<< F.getName() << " -> " << Callee.getName() << "\n");
setInlineRemark(*CB, "recursive SCC split");		setInlineRemark(*CB, "recursive SCC split");
continue;		continue;
}		}

		// Do not inline a function involved in a mutual-recursive cycle. Due to
		// the lack of a global inline history, such inlining may cause an
		// exponential size growth as the inlining extends up along the call
		// graph. Consider the following example,
		// void A() { B(); B();}
		// void B() { A(); }
		// void C() { B(); }
		// void D() { C(); }
		// When processing C, inlining B will result in
		// void C() { B(); B(); }
		// When processing D, inlining C and futher the two B callsites will
		// result in
		// void D() { B(); B(); B(); B(); }
		// Note that we are only checking mutual-recursion here. Self-recurision
		// is handled by inline cost analyzer.
		if (CalleeSCC != C && CalleeSCC->size() > 1) {
		wenleiUnsubmitted Not Done Reply Inline Actions If the goal is to prevent repeated recursive inlining when processing caller SCC, this check may not always work either depending on what state we are with (A,B) SCC. If we stop at A inlined B (so A only calls A, but not B), then the (A,B) SCC is broken into (A) and (B), in which case the CalleeSCC size check against 1 may pass when we look at C or D. So what happens could be - we pass this check at caller C, doing more inlining, then fail this check at D, bouncing back and a forth as we move up. wenlei: If the goal is to prevent repeated recursive inlining when processing caller SCC, this check…
		hoyAuthorUnsubmitted Done Reply Inline Actions Do you mean that if `A` turns into self-recursive, `C` then can inline it because `CalleeSCC->size() == 1`? hoy: Do you mean that if `A` turns into self-recursive, `C` then can inline it because `CalleeSCC…
		wenleiUnsubmitted Not Done Reply Inline Actions Right. wenlei: Right.
		hoyAuthorUnsubmitted Done Reply Inline Actions Inlining self-recursion is already disabled by inline cost analyzer. Not sure why it is done though. So `C` will not inline `A` once `A` becomes self-recursive. hoy: Inlining self-recursion is already disabled by inline cost analyzer. Not sure why it is done…
		LLVM_DEBUG(dbgs() << "Skipping inlining a node that is involved in a "
		"mutual-recursive SCC"
		<< F.getName() << " -> " << Callee.getName() << "\n");
		setInlineRemark(*CB, "recursive");
		continue;
		}

auto Advice = Advisor.getAdvice(*CB, OnlyMandatory);		auto Advice = Advisor.getAdvice(*CB, OnlyMandatory);
// Check whether we want to inline this callsite.		// Check whether we want to inline this callsite.
if (!Advice->isInliningRecommended()) {		if (!Advice->isInliningRecommended()) {
Advice->recordUnattemptedInlining();		Advice->recordUnattemptedInlining();
continue;		continue;
}		}

// Setup the data structure used to plumb customization into the		// Setup the data structure used to plumb customization into the
▲ Show 20 Lines • Show All 219 Lines • Show Last 20 Lines

llvm/test/Transforms/Inline/inline-remark.ll

Show All 20 Lines	bb2:
call void @bar(i1 true)		call void @bar(i1 true)
ret void		ret void
}		}

;; Test 1 - Add different inline remarks to similar callsites.		;; Test 1 - Add different inline remarks to similar callsites.
define void @test1() {		define void @test1() {
; CHECK-LABEL: @test1		; CHECK-LABEL: @test1
; CHECK-NEXT: call void @bar(i1 true) [[ATTR1:#[0-9]+]]		; CHECK-NEXT: call void @bar(i1 true) [[ATTR1:#[0-9]+]]
; CHECK-NEXT: call void @bar(i1 false) [[ATTR2:#[0-9]+]]		; CHECK-NEXT: call void @bar(i1 false) [[ATTR1]]
call void @bar(i1 true)		call void @bar(i1 true)
call void @bar(i1 false)		call void @bar(i1 false)
ret void		ret void
}		}

define void @noop() {		define void @noop() {
ret void		ret void
}		}
Show All 12 Lines
define void @test3() {		define void @test3() {
; CHECK-LABEL: @test3		; CHECK-LABEL: @test3
; CHECK-NEXT: call void @test3() [[ATTR4:#[0-9]+]]		; CHECK-NEXT: call void @test3() [[ATTR4:#[0-9]+]]
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
call void @test3() alwaysinline		call void @test3() alwaysinline
ret void		ret void
}		}

; CHECK: attributes [[ATTR1]] = { "inline-remark"="(cost=25, threshold=0)" }		; CHECK: attributes [[ATTR1]] = { "inline-remark"="recursive" }
; CHECK: attributes [[ATTR2]] = { "inline-remark"="(cost=never): recursive" }
; CHECK: attributes [[ATTR3]] = { "inline-remark"="unsupported operand bundle; (cost={{.}}, threshold={{.}})" }		; CHECK: attributes [[ATTR3]] = { "inline-remark"="unsupported operand bundle; (cost={{.}}, threshold={{.}})" }
; CHECK: attributes [[ATTR4]] = { alwaysinline "inline-remark"="(cost=never): recursive call" }		; CHECK: attributes [[ATTR4]] = { alwaysinline "inline-remark"="(cost=never): recursive call" }

llvm/test/Transforms/Inline/recursion.ll

This file was added.

				; RUN: opt -O3 -inline-threshold=50 -S < %s \| FileCheck %s
				;; Check function fib is not over-inlined into A, B and C.

				define dso_local i32 @fib0(i32 %n) #0 {
				entry:
				%retval = alloca i32, align 4
				%n.addr = alloca i32, align 4
				store i32 %n, i32* %n.addr, align 4
				%0 = load i32, i32* %n.addr, align 4
				%cmp = icmp sle i32 %0, 1
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				%1 = load i32, i32* %n.addr, align 4
				store i32 %1, i32* %retval, align 4
				br label %return

				if.end: ; preds = %entry
				%2 = load i32, i32* %n.addr, align 4
				%sub = sub nsw i32 %2, 1
				%call = call i32 @fib(i32 %sub)
				%3 = load i32, i32* %n.addr, align 4
				%sub1 = sub nsw i32 %3, 2
				%call2 = call i32 @fib(i32 %sub1)
				%add = add nsw i32 %call, %call2
				store i32 %add, i32* %retval, align 4
				br label %return

				return: ; preds = %if.end, %if.then
				%4 = load i32, i32* %retval, align 4
				ret i32 %4
				}

				define dso_local i32 @fib(i32 %n) #0 {
				entry:
				%retval = alloca i32, align 4
				%n.addr = alloca i32, align 4
				store i32 %n, i32* %n.addr, align 4
				%0 = load i32, i32* %n.addr, align 4
				%cmp = icmp sle i32 %0, 1
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				%1 = load i32, i32* %n.addr, align 4
				store i32 %1, i32* %retval, align 4
				br label %return

				if.end: ; preds = %entry
				%2 = load i32, i32* %n.addr, align 4
				%sub = sub nsw i32 %2, 1
				%call = call i32 @fib0(i32 %sub)
				%3 = load i32, i32* %n.addr, align 4
				%sub1 = sub nsw i32 %3, 2
				%call2 = call i32 @fib0(i32 %sub1)
				%add = add nsw i32 %call, %call2
				store i32 %add, i32* %retval, align 4
				br label %return

				return: ; preds = %if.end, %if.then
				%4 = load i32, i32* %retval, align 4
				ret i32 %4
				}

				; CHECK: @A(
				; CHECK: call i32 @fib(i32 100)
				; CHECK: call i32 @fib(i32 200)
				define dso_local i32 @A(i32 %n) #1 {
				entry:
				%n.addr = alloca i32, align 4
				store i32 %n, i32* %n.addr, align 4
				%call = call i32 @fib(i32 100)
				%call1 = call i32 @fib(i32 200)
				%add = add nsw i32 %call, %call1
				ret i32 %add
				}

				; CHECK: @B(
				; CHECK: call i32 @fib(i32 100)
				; CHECK: call i32 @fib(i32 200)
				define dso_local i32 @B(i32 %n) #1 {
				entry:
				%n.addr = alloca i32, align 4
				store i32 %n, i32* %n.addr, align 4
				%call = call i32 @A(i32 8)
				ret i32 %call
				}

				; CHECK: @C(
				; CHECK: call i32 @fib(i32 100)
				; CHECK: call i32 @fib(i32 200)
				define dso_local i32 @C(i32 %n) #0 {
				entry:
				%n.addr = alloca i32, align 4
				store i32 %n, i32* %n.addr, align 4
				%0 = load i32, i32* %n.addr, align 4
				%call = call i32 @B(i32 %0)
				ret i32 %call
				}

				attributes #0 = { nounwind uwtable "disable-tail-calls"="false" "frame-pointer"="none" "min-legal-vector-width"="0" "no-jump-tables"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
				attributes #1 = { alwaysinline nounwind uwtable "disable-tail-calls"="false" "frame-pointer"="none" "min-legal-vector-width"="0" "no-jump-tables"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }