Page MenuHomePhabricator

[Inliner] Do not inline mutual-recursive functions to avoid exponential size growth.
Needs ReviewPublic

Authored by hoy on Mar 11 2021, 11:44 PM.

Details

Summary

This change avoids inlining a function involved in a mutual-recursive cycle. Due to the lack of a global inline history, such inlining may cause an exponential size growth as the inlining extends up along the call graph.

Consider the following example,

void A() { B(); B();}
void B() { A(); }
void C() { B(); }
void D() { C(); }

When processing C, inlining B will result in

void C() { B(); B(); }

When processing D, inlining C and further the two B callsites will result in

void D() { B(); B(); B(); B(); }

We've also seen that one of our internal services suffered from the inliner hanging for 6 hours.

Diff Detail

Unit TestsFailed

TimeTest
630 msx64 windows > Clang.Modules::preprocess-build-diamond.m
Script: -- : 'RUN: at line 1'; rm -rf C:\ws\w16c2-1\llvm-project\premerge-checks\build\tools\clang\test\Modules\Output\preprocess-build-diamond.m.tmp

Event Timeline

hoy created this revision.Mar 11 2021, 11:44 PM
hoy requested review of this revision.Mar 11 2021, 11:44 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 11 2021, 11:44 PM
hoy edited the summary of this revision. (Show Details)Mar 11 2021, 11:45 PM
hoy added reviewers: davidxl, wenlei.
hoy updated this revision to Diff 330271.Mar 12 2021, 9:22 AM

Updating D98481: [Inliner] Do not inline mutual-recursive functions to avoid exponential size growth.

Thanks for looking into this. This fix is more restrictive than I originally thought. And I'm also not sure if it does what we needed (see inline comment for this part).

IIUC, if we have A->B, B->C, C->B, even if there's no inlining between B and C, with this change, A will never inline B. Is that correct? I imagine that would throttle inliner very visibly - we can gauge the effect by looking at the number of inlining w/ and w/o this change for spec and larger programs.

Ideally, we don't want to limit all inlining across SCC if the inlinee is part of a recursion cycle. The problem here is the repeated inlining due to recursion. We have InlineHistory for single iteration SCC run, and InlinedInternalEdges for cross iteration checks against repeated recursive inlining in the same SCC, but they don't restrict all recursive inlining in general. It would be good to have something that only limits the extra, repeated inlining for this case too.

I think InlinedInternalEdges is close to what is needed here, except that here we're dealing with different SCC when moving up the SCC graph. Wondering if we can somehow transfer the InlinedInternalEdges of previously processed SCC so later SCC is also aware of what would be repeated recursive inlining, though as @chandlerc called out InlinedInternalEdges itself is already a hack that breaks layering.

+@chandlerc, and @wmi who improved InlinedInternalEdges in https://reviews.llvm.org/D52915.

llvm/lib/Transforms/IPO/Inliner.cpp
833

If the goal is to prevent repeated recursive inlining when processing caller SCC, this check may not always work either depending on what state we are with (A,B) SCC.

If we stop at A inlined B (so A only calls A, but not B), then the (A,B) SCC is broken into (A) and (B), in which case the CalleeSCC size check against 1 may pass when we look at C or D. So what happens could be - we pass this check at caller C, doing more inlining, then fail this check at D, bouncing back and a forth as we move up.

hoy added a comment.Mar 14 2021, 10:02 PM

Thanks for looking into this. This fix is more restrictive than I originally thought. And I'm also not sure if it does what we needed (see inline comment for this part).

Right, it doesn't do exactly what we need, which is a global inline history to prevent a recursive callsites being inlined repeated It aggressively prevents inlining recursive callsites out of their defining SCCs.

IIUC, if we have A->B, B->C, C->B, even if there's no inlining between B and C, with this change, A will never inline B. Is that correct? I imagine that would throttle inliner very visibly - we can gauge the effect by looking at the number of inlining w/ and w/o this change for spec and larger programs.

Correct, with this change B and C will never be inlined into any function outside their enclosing SCC. I agree this is aggressive. The effect could be propagated as the bottom-up walk goes.

Ideally, we don't want to limit all inlining across SCC if the inlinee is part of a recursion cycle. The problem here is the repeated inlining due to recursion. We have InlineHistory for single iteration SCC run, and InlinedInternalEdges for cross iteration checks against repeated recursive inlining in the same SCC, but they don't restrict all recursive inlining in general. It would be good to have something that only limits the extra, repeated inlining for this case too.

I think InlinedInternalEdges is close to what is needed here, except that here we're dealing with different SCC when moving up the SCC graph. Wondering if we can somehow transfer the InlinedInternalEdges of previously processed SCC so later SCC is also aware of what would be repeated recursive inlining, though as @chandlerc called out InlinedInternalEdges itself is already a hack that breaks layering.

+@chandlerc, and @wmi who improved InlinedInternalEdges in https://reviews.llvm.org/D52915.

A global inline history across also SCCs sounds good fit there, however, building such a history is a bit challenging it that how to select the hash keys. It would be good to know if InlinedInternalEdges can be extended to handle repeatedly inlining a recursive callsite across SCCs.

llvm/lib/Transforms/IPO/Inliner.cpp
833

Do you mean that if A turns into self-recursive, C then can inline it because CalleeSCC->size() == 1?

In D98481#2625282, @hoy wrote:

Thanks for looking into this. This fix is more restrictive than I originally thought. And I'm also not sure if it does what we needed (see inline comment for this part).

Right, it doesn't do exactly what we need, which is a global inline history to prevent a recursive callsites being inlined repeated It aggressively prevents inlining recursive callsites out of their defining SCCs.

IIUC, if we have A->B, B->C, C->B, even if there's no inlining between B and C, with this change, A will never inline B. Is that correct? I imagine that would throttle inliner very visibly - we can gauge the effect by looking at the number of inlining w/ and w/o this change for spec and larger programs.

Correct, with this change B and C will never be inlined into any function outside their enclosing SCC. I agree this is aggressive. The effect could be propagated as the bottom-up walk goes.

Practically, it means this change can potentially have negative performance impact on many workloads, which by itself would definitely be a deal breaker. And if that's not the case, we need to prove it through testing with this change. (Even if we don't observe any negative impact on a few workloads, it's still not bullet proof, but it's at least a data point).

Ideally, we don't want to limit all inlining across SCC if the inlinee is part of a recursion cycle. The problem here is the repeated inlining due to recursion. We have InlineHistory for single iteration SCC run, and InlinedInternalEdges for cross iteration checks against repeated recursive inlining in the same SCC, but they don't restrict all recursive inlining in general. It would be good to have something that only limits the extra, repeated inlining for this case too.

I think InlinedInternalEdges is close to what is needed here, except that here we're dealing with different SCC when moving up the SCC graph. Wondering if we can somehow transfer the InlinedInternalEdges of previously processed SCC so later SCC is also aware of what would be repeated recursive inlining, though as @chandlerc called out InlinedInternalEdges itself is already a hack that breaks layering.

+@chandlerc, and @wmi who improved InlinedInternalEdges in https://reviews.llvm.org/D52915.

A global inline history across also SCCs sounds good fit there, however, building such a history is a bit challenging it that how to select the hash keys. It would be good to know if InlinedInternalEdges can be extended to handle repeatedly inlining a recursive callsite across SCCs.

llvm/lib/Transforms/IPO/Inliner.cpp
833

Right.

hoy added a comment.Mar 15 2021, 9:05 AM
In D98481#2625282, @hoy wrote:

Thanks for looking into this. This fix is more restrictive than I originally thought. And I'm also not sure if it does what we needed (see inline comment for this part).

Right, it doesn't do exactly what we need, which is a global inline history to prevent a recursive callsites being inlined repeated It aggressively prevents inlining recursive callsites out of their defining SCCs.

IIUC, if we have A->B, B->C, C->B, even if there's no inlining between B and C, with this change, A will never inline B. Is that correct? I imagine that would throttle inliner very visibly - we can gauge the effect by looking at the number of inlining w/ and w/o this change for spec and larger programs.

Correct, with this change B and C will never be inlined into any function outside their enclosing SCC. I agree this is aggressive. The effect could be propagated as the bottom-up walk goes.

Practically, it means this change can potentially have negative performance impact on many workloads, which by itself would definitely be a deal breaker. And if that's not the case, we need to prove it through testing with this change. (Even if we don't observe any negative impact on a few workloads, it's still not bullet proof, but it's at least a data point).

Good point. I can get some numbers out of SPEC. BTW, inlining self-recursive functions is also disabled and I'm not sure what was the motivation and whether we have data points for that. After inlining is done for an SCC, there is a good chance that quite some functions in the original SCC become self-recursive and they are prevented from further inlining.

Ideally, we don't want to limit all inlining across SCC if the inlinee is part of a recursion cycle. The problem here is the repeated inlining due to recursion. We have InlineHistory for single iteration SCC run, and InlinedInternalEdges for cross iteration checks against repeated recursive inlining in the same SCC, but they don't restrict all recursive inlining in general. It would be good to have something that only limits the extra, repeated inlining for this case too.

I think InlinedInternalEdges is close to what is needed here, except that here we're dealing with different SCC when moving up the SCC graph. Wondering if we can somehow transfer the InlinedInternalEdges of previously processed SCC so later SCC is also aware of what would be repeated recursive inlining, though as @chandlerc called out InlinedInternalEdges itself is already a hack that breaks layering.

+@chandlerc, and @wmi who improved InlinedInternalEdges in https://reviews.llvm.org/D52915.

A global inline history across also SCCs sounds good fit there, however, building such a history is a bit challenging it that how to select the hash keys. It would be good to know if InlinedInternalEdges can be extended to handle repeatedly inlining a recursive callsite across SCCs.

llvm/lib/Transforms/IPO/Inliner.cpp
833

Inlining self-recursion is already disabled by inline cost analyzer. Not sure why it is done though. So C will not inline A once A becomes self-recursive.

hoy added a comment.EditedMar 15 2021, 9:38 PM

I got some numbers for SPEC2017 with AutoFDO + LTO. Timing wise no noticeable difference but there is difference in code size for some benchmarks which indicates less inlining with this patch. The difference is pretty big for perlbench_s. I guess we don't see perf gap because those recursive functions are run a lot more times than the inlined times.

	             After/Before (Code Size)
   508.namd_r    	100.00%
   510.parest_r  	100.00%
   511.povray_r  	97.89%
   600.perlbench_s	89.68%
   602.gcc_s     	98.79%
   605.mcf_s     	100.00%
   620.omnetpp_s 	100.00%
   623.xalancbmk_s	99.91%
   625.x264_s    	100.00%
   631.deepsjeng_s	100.00%
   638.imagick_s 	99.55%
   641.leela_s   	100.00%
   644.nab_s     	100.00%
   657.xz_s      	100.00%
In D98481#2628047, @hoy wrote:

I got some numbers for SPEC2017 with AutoFDO + LTO. Timing wise no noticeable difference but there is difference in code size for some benchmarks which indicates less inlining with this patch. The difference is pretty big for perlbench_s I guess we don't see perf gap because recursive those functions are run a lot more times than the inlined times.

	             After/Before (Code Size)
   508.namd_r    	100.00%
   510.parest_r  	100.00%
   511.povray_r  	97.89%
   600.perlbench_s	89.68%
   602.gcc_s     	98.79%
   605.mcf_s     	100.00%
   620.omnetpp_s 	100.00%
   623.xalancbmk_s	99.91%
   625.x264_s    	100.00%
   631.deepsjeng_s	100.00%
   638.imagick_s 	99.55%
   641.leela_s   	100.00%
   644.nab_s     	100.00%
   657.xz_s      	100.00%

Do you have numbers on how many times the new check blocks inlining for each (if you move it to be the last check)? The code size number seems to be indicate that the change indeed impacts inlining quite a bit.

When the inline history was introduced in https://reviews.llvm.org/D36188, it was verified that it didn't block any inlining for spec.

hoy added a comment.Mar 15 2021, 10:44 PM
In D98481#2628047, @hoy wrote:

I got some numbers for SPEC2017 with AutoFDO + LTO. Timing wise no noticeable difference but there is difference in code size for some benchmarks which indicates less inlining with this patch. The difference is pretty big for perlbench_s I guess we don't see perf gap because recursive those functions are run a lot more times than the inlined times.

	             After/Before (Code Size)
   508.namd_r    	100.00%
   510.parest_r  	100.00%
   511.povray_r  	97.89%
   600.perlbench_s	89.68%
   602.gcc_s     	98.79%
   605.mcf_s     	100.00%
   620.omnetpp_s 	100.00%
   623.xalancbmk_s	99.91%
   625.x264_s    	100.00%
   631.deepsjeng_s	100.00%
   638.imagick_s 	99.55%
   641.leela_s   	100.00%
   644.nab_s     	100.00%
   657.xz_s      	100.00%

Do you have numbers on how many times the new check blocks inlining for each (if you move it to be the last check)? The code size number seems to be indicate that the change indeed impacts inlining quite a bit.

When the inline history was introduced in https://reviews.llvm.org/D36188, it was verified that it didn't block any inlining for spec.

Callsite inlining are good counters. For those benchmarks affected, inlined callsites vs non-inlined callsites due to this change are:

               inlined callsites        recursive callsites    
511.povray_r        3178                   603
600.perlbench_s     4577                   2139
602.gcc_s          46125                   726
623.xalancbmk_s    15573                   79
638.imagick_s       7719                   61

Apparently, perlbench_s has a lot less inlining.

Without guarding the caller size, this (over topdown inlining) can also potentially happen without recursion.

This patch seems too aggressive. One possible way to handle it is to restrict inlining to a node N in non trivial SCC when node N has >1 fan-out (edges) to other nodes in the same SCC -- the large fan-out is the reason for the excessive growth.

llvm/lib/Transforms/IPO/Inliner.cpp
809

if (CalleeSCC == C) &&

hoy added a comment.Mar 16 2021, 10:07 AM

Without guarding the caller size, this (over topdown inlining) can also potentially happen without recursion.

This patch seems too aggressive. One possible way to handle it is to restrict inlining to a node N in non trivial SCC when node N has >1 fan-out (edges) to other nodes in the same SCC -- the large fan-out is the reason for the excessive growth.

Thanks for the suggestion. Exactly, the excessive growth boils down to the fan-out. An additional check for the fan-out factor should effectively make the restriction less aggressive.

In D98481#2629511, @hoy wrote:

Without guarding the caller size, this (over topdown inlining) can also potentially happen without recursion.

This patch seems too aggressive. One possible way to handle it is to restrict inlining to a node N in non trivial SCC when node N has >1 fan-out (edges) to other nodes in the same SCC -- the large fan-out is the reason for the excessive growth.

Thanks for the suggestion. Exactly, the excessive growth boils down to the fan-out. An additional check for the fan-out factor should effectively make the restriction less aggressive.

Fan-out is one way to control the growth, I'm not sure if we can make it narrow and selective enough though.

I also wanted to point out that this case is related to the use of a partial profile, in which case the hot threshold may have triggered more aggressive inlining.

Actually we have a long list of issues where the CGSCC inliner explodes (w/ or w/o profile). It seem to me that the way CGSCC inliner works relies heavily on a "good" threshold, otherwise there's this risk of exponential growth of size and build time regardless of whether we have recursion or not just like David pointed. We indeed have seen cases where the default inline threshold leads to excessive inlining when there's no recursion.

While in theory if we have a reasonable threshold, CGSCC inliner would finish in reasonable amount of time (unless we have bugs that cause SCC mutation to not converge), it's hard for a preset threshold to fit all builds, which is why we ran into such cases again and again. I know that there's past discussion about introducing size/growth cap for CGSCC inliner and it didn't go through, but I'm wondering if we should revisit that given that preset threshold lead to explosive inlining in quite a few cases. We can find way to get around it by tuning threshold, but it's questionable to me whether the time spent is worthwhile to chase down these cases one by one and ending up tuning threshold without finding fundamental issues other than CGGSCC inliner is unbounded and rely on threshold.

If we cap size, we would need to inline more important call sites first. At the minimum, can we add a switch (off by default) to cap size so when people run into this, there would offer a very targeted solution - currently we sometimes have to tune down threshold globally, which negatively affects other inline paths too. We can also add a warning when inliner is throttled due to cap.

Having cap on the target context (where a site is inlined into) is useful for performance purpose but it is based on working set analysis which we are working on. This can be 'overloaded' to control compile time growth too.

Regarding option, I think it is useful to have an internal option to workaround problem like this. The easiest is to have a cap on the max number callsites that can iteratively inlined for a given callsite.

Having cap on the target context (where a site is inlined into) is useful for performance purpose but it is based on working set analysis which we are working on. This can be 'overloaded' to control compile time growth too.

Good to know. It's kind of a separate optimization topic, but would like to learn more about this effort. How do you perform working set analysis, is any of this available in upstream now?

On the other hand, even if the working set based heuristic isn't ready yet, the scaffolding would be the same if you need a cap. And when you put a cap, you also need to be selective and prioritize inline candidate, right? If these are available in upstream, a switch to cap pathological cases would become very simple by sharing the scaffolding.

Regarding option, I think it is useful to have an internal option to workaround problem like this. The easiest is to have a cap on the max number callsites that can iteratively inlined for a given callsite.

Sounds good. So let's put in a switch for a cap, default to off, also emit a warning when the switch is on and cap is hit? It would still be nice if we inline important call site first when cap is hit (and share working set heuristic scaffolding for that).

Hiroshi is working on the implementation, and I expect upstreaming happening soon. In a nutshell, the working set analysis is based on building interprocedural loop graph (ILG) and working set propagation.

Yes what you describe about the option sounds reasonable, though I am not sure about warning. Some -Rpass-analysis or -Rpass-missed message or some debug output seem better.

Hiroshi is working on the implementation, and I expect upstreaming happening soon. In a nutshell, the working set analysis is based on building interprocedural loop graph (ILG) and working set propagation.

Looking forward to it. Interesting to see how accurately we can model working set.

Yes what you describe about the option sounds reasonable, though I am not sure about warning. Some -Rpass-analysis or -Rpass-missed message or some debug output seem better.

The warning is to make people aware this is happening. Service developers don't use pass remarks, and pass remarks can also be a bit noisy.

hoy added a comment.Mar 16 2021, 11:42 PM

A cap sounds good to me, since that's something we've been looking for for a long time. Regarding the cap, what kind of cap do you think will be appropriate? A cap on function size growth or a cap on how many times an original callsite can be inlined iteratively? Regarding the later, I was wondering what's a good way to track a callsite across SCC and function passes.


From: Xinliang David Li <davidxl@google.com>
Sent: Wednesday, March 17, 2021 9:02 AM
To: Hongtao Yu <reviews+D98481+public+2d824aa1c1081e47@reviews.llvm.org>
Cc: Hongtao Yu <hoy@fb.com>; Wenlei He <wenlei@fb.com>; Kazu Hirata <kazu@google.com>; Chandler Carruth <chandlerc@gmail.com>; Wei Mi <wmi@google.com>; Easwaran Raman <eraman@google.com>; llvm-commits <llvm-commits@lists.llvm.org>; bhuvanendra.kumarn@amd.com <bhuvanendra.kumarn@amd.com>; 88888yl@gmail.com <88888yl@gmail.com>; dougpuob@gmail.com <dougpuob@gmail.com>; David Green <david.green@arm.com>
Subject: Re: [PATCH] D98481: [Inliner] Do not inline mutual-recursive functions to avoid exponential size growth.

hoy added a comment.Mar 17 2021, 11:43 AM

From: Xinliang David Li <davidxl@google.com>
Date: Wednesday, March 17, 2021 at 9:02 AM
To: Hongtao Yu <reviews+D98481+public+2d824aa1c1081e47@reviews.llvm.org>
Cc: Hongtao Yu <hoy@fb.com>, Wenlei He <wenlei@fb.com>, Kazu Hirata <kazu@google.com>, Chandler Carruth <chandlerc@gmail.com>, Wei Mi <wmi@google.com>, Easwaran Raman <eraman@google.com>, llvm-commits <llvm-commits@lists.llvm.org>, bhuvanendra.kumarn@amd.com <bhuvanendra.kumarn@amd.com>, 88888yl@gmail.com <88888yl@gmail.com>, dougpuob@gmail.com <dougpuob@gmail.com>, David Green <david.green@arm.com>
Subject: Re: [PATCH] D98481: [Inliner] Do not inline mutual-recursive functions to avoid exponential size growth.

hoy added a comment.Mar 17 2021, 11:44 AM

I was thinking about capping on function size so that small callees such as getters/setters can always be inlined. What do you think?

From: Xinliang David Li <davidxl@google.com>
Date: Wednesday, March 17, 2021 at 9:02 AM
To: Hongtao Yu <reviews+D98481+public+2d824aa1c1081e47@reviews.llvm.org>
Cc: Hongtao Yu <hoy@fb.com>, Wenlei He <wenlei@fb.com>, Kazu Hirata <kazu@google.com>, Chandler Carruth <chandlerc@gmail.com>, Wei Mi <wmi@google.com>, Easwaran Raman <eraman@google.com>, llvm-commits <llvm-commits@lists.llvm.org>, bhuvanendra.kumarn@amd.com <bhuvanendra.kumarn@amd.com>, 88888yl@gmail.com <88888yl@gmail.com>, dougpuob@gmail.com <dougpuob@gmail.com>, David Green <david.green@arm.com>
Subject: Re: [PATCH] D98481: [Inliner] Do not inline mutual-recursive functions to avoid exponential size growth.

hoy added a comment.Mar 17 2021, 11:46 AM

Thanks for the suggestion. I was thinking about capping on function size so that small callees such as getters/setters can always be inlined. What do you think?

From: Xinliang David Li <davidxl@google.com>
Date: Wednesday, March 17, 2021 at 9:02 AM
To: Hongtao Yu <reviews+D98481+public+2d824aa1c1081e47@reviews.llvm.org>
Cc: Hongtao Yu <hoy@fb.com>, Wenlei He <wenlei@fb.com>, Kazu Hirata <kazu@google.com>, Chandler Carruth <chandlerc@gmail.com>, Wei Mi <wmi@google.com>, Easwaran Raman <eraman@google.com>, llvm-commits <llvm-commits@lists.llvm.org>, bhuvanendra.kumarn@amd.com <bhuvanendra.kumarn@amd.com>, 88888yl@gmail.com <88888yl@gmail.com>, dougpuob@gmail.com <dougpuob@gmail.com>, David Green <david.green@arm.com>
Subject: Re: [PATCH] D98481: [Inliner] Do not inline mutual-recursive functions to avoid exponential size growth.

davidxl added a subscriber: hoy.Mar 17 2021, 11:49 AM

that is fine if it does not add visible compile time overhead (e.g, to
compute and track size).

David

I think size cap might be better than call site counts, and query inliner instr size should not have cost.

2nd thought on the warning, how about this. The goal is help save time for everyone.

  • One switch for -inline-size-limit, default to huge value (could be INT_MAX). When the limit is hit, we emit a warning, but don’t stop inlining. This make it obvious to everyone when cgscc inlining is causing extremely long build time. With default to huge value, it should not fire at all so nothing would break due to warning as error. But people may choose to set a lower value (e.g. we may use a lower value internally, it may break build, but the hope is to capture extremely slow builds and bloated codegen quickly)
  • Another switch to -stop-inline-for-size-limit, default to off. When this is enabled, cgscc inline will stop when the above size limit is hit, no warning is issued since this is asked by user explicitly. This serves as a possible mitigation for bloated inlining without changing global threshold, and it's also cohesive with #1.