This is an archive of the discontinued LLVM Phabricator instance.

[Inliner] Avoid excessive inlining through devirtualised calls
AbandonedPublic

Authored by jmorse on Mar 7 2023, 8:51 AM.

Details

Summary

This is a fix for GH60773 [0], an "inliner inlines exponentially" problem similar to [1]. The inliner isn't my area of expertise so be forewarned that I might be using terminology wrong below,

In GH44598 [2] a pattern was discovered where the inliner will inline large stretches of an SCC before it realises it's inlining a whole cycle, growing function sizes significantly and increasing compile times. The fix in [3] was to annotate intra-SCC calls with a inline-cost multiplier if they're part of such behaviour, to discourage further inlining.

We've found a scenario [0] where this same behaviour occurs, but where we start inlining an outer RefSCC if a function call is devirtualised. Observe in the attached test case: longname forms one RefSCC, while the chain from rec_call_to_longname to e forms an outer RefSCC:

RefSCC with 1 call SCCs:
  SCC with 1 functions:
    longname

RefSCC with 4 call SCCs:
  SCC with 1 functions:
    rec_call_to_longname
  SCC with 1 functions:
    a
  SCC with 1 functions:
    c
  SCC with 1 functions:
    e

During inlining of functions a, c, e, both rec_call_to_longname and longname are always inlined, the latter because we give a large bonus to any function with calls that can be devirtualised. And because there are two calls in longname that get devirtualised, the number of function calls to rec_call_to_longname doubles each time we inline. This makes function e grow exponentially in the size of the outer RefSCC. This showed up in an LTO link of a large C++ codebase, adding at least an order of magnitude to compile times (it didn't complete). I think the frequency of this behaviour occurring is rare, it may depend on PGO too, but the impact is large when it does occur.

As a fix, in this patch I've extended the original fix from [3] to detect newly devirtualised calls and add the "don't repeatedly inline this" multiplier attribute. The principle is that managing to devirtualise a function call is already a great triumph; but starting to inline from a parent RefSCC is very likely to form a loop. The exact check ("any devirtualised call to outside of SCC we were already inlining") might be too strong -- it could be relaxed or examine the RefSCC structure, I'm not extremely confident on this so have defaulted to being conservative.

The compile time impact is good [4]: there's some noise and also some improvements in compile time. I ran a SPEC2017 intspeed "estimate" with this change and saw some unremarkable differences that were all within the noise. Performance measurement is another area I'm unfamiliar with so I don't want to make any claims, but I don't believe this patch causes a runtime performance regression.

Note the branch-on-undef with profile metadata in the test case: that's what came out of reducing the original long-running sample, seemingly it's part of convincing the inliner to repeatedly inline. I understand we're supposed to be avoiding adding more tests that branch on undef, so I'll try and cook up an alternative, but the rest of the patch is ready for review.

[0] https://github.com/llvm/llvm-project/issues/60773
[1] https://github.com/llvm/llvm-project/issues/44598
[2] https://github.com/llvm/llvm-project/issues/44598#issuecomment-981025691
[3] https://reviews.llvm.org/D121084
[4] http://llvm-compile-time-tracker.com/compare.php?from=2399497c9d68c946764e3089fb0b4c8d29166d67&to=160528d0c9ac9b6f179818664030110faf13ef0b&stat=instructions%3Au

Diff Detail

Event Timeline

jmorse created this revision.Mar 7 2023, 8:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2023, 8:51 AM
jmorse requested review of this revision.Mar 7 2023, 8:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 7 2023, 8:51 AM

sorry for the slow response

looking at the test case, this looks more like an issue with inlining heuristics. the branch_weights says that we almost never call longname in rec_call_to_longname but we still end up inlining it, presumably due to the devirtualization bonus as you've said? removing the !prof from the branch makes this test case ok which is odd

jmorse added a comment.EditedMar 29 2023, 5:30 AM

Interesting -- I've dug into the cost modelling, and I think you're correct that there's an issue, but it's even less enjoyable. It looks like the effect of the branch-weights successfully suppresses longname being inlined into rec_call_to_longname because it's a cold callsite, in the first instance. However, once rec_call_to_longname is inlined into a, the same computations round all the frequencies to zero, and inlining starts to happen repeatedly through both rec_call_to_longname and longname, giving the exponential behaviour. Without the branch weights, inlining longname into rec_call_to_longname creates a directly recursive function that never gets inlined again.

My understanding of inlining is that if we've rejecting inlining at a call site because of the cost, it would be very unexpected if inlining the call-site to somewhere else lowered the cost. Thus the cost-modelling changing it's mind is the most surprising behaviour and is most likely to be wrong. It's another area I'm unfamiliar with that I'll try digging into, the relevant function is InlineCostCallAnalyzer::isColdCallSite [0]. I'll probably be able to work it out, but here's a dump of the relevant state just in case someone else knows a graceful solution -- in the first instance on the linked line,

  • ColdProb is a fixed-point representation of 0.02f,
  • CallSiteFreq is integer value 8,
  • CallerEntryFreq is integer value 450,
  • Return-value of CallerEntryFreq * ColdProb is 9

I assume others can interpret the following branch probability summary:

block-frequency: rec_call_to_longname
=====================================
reverse-post-order-traversal
 - 0: 
 - 1: 
 - 2: 
loop-detection
compute-mass-in-function
 - node: 
  => [ local  ] weight = 38171822, succ = 
  => [ local  ] weight = 2109311826, succ = 
  => mass:  ffffffffffffffff
  => assign 048ce95bffffffff (fb7316a400000000) to 
  => assign fb7316a400000000 (0000000000000000) to 
 - node: 
  => [ local  ] weight = 2147483648, succ = 
  => mass:  048ce95bffffffff
  => assign 048ce95bffffffff (0000000000000000) to 
 - node: 
  => mass:  ffffffffffffffff
float-to-int: min = 0.01777513977, max = 1.0, factor = 450.0667844
 - : float = 1.0, scaled = 450.0667844, int = 450
 - : float = 0.01777513977, scaled = 8.0, int = 8
 - : float = 1.0, scaled = 450.0667844, int = 450
block-frequency-info: rec_call_to_longname
 - : float = 1.0, int = 450
 - : float = 0.017775, int = 8
 - : float = 1.0, int = 450

Wheras in the second instance, the function looks like this:

define internal void @a() {
  store ptr @extern, ptr @e, align 8
  br i1 undef, label %1, label %rec_call_to_longname.exit, !prof !0

1:                                                ; preds = %0
  call void @longname(ptr @rec_call_to_longname, ptr undef, ptr undef)
  br label %rec_call_to_longname.exit

rec_call_to_longname.exit:                        ; preds = %0, %1
  ret void
}

The branch-frequency dump is:

block-frequency: a                                                                                       
==================        
reverse-post-order-traversal
 - 0:                                                                                                                                                                                                              
loop-detection                                                                                           
compute-mass-in-function
 - node:   
  => mass:  ffffffffffffffff                  
float-to-int: min = 1.0, max = 1.0, factor = 8.0
 - : float = 1.0, scaled = 8.0, int = 8                                                                                                                                                                            
block-frequency-info: a                       
 - : float = 1.0, int = 8

The variable valuations:

  • ColdProb is a fixed-point representation of 0.02f,
  • CallSiteFreq is integer value 0,
  • CallerEntryFreq is integer value 8,
  • Return-value of CallerEntryFreq * ColdProb is 0

And zero not being less than zero, it's not judged to be cold.

[0] https://github.com/llvm/llvm-project/blob/96118f1b0ab7a18999a4f2199ba1ecd546c68cb8/llvm/lib/Analysis/InlineCost.cpp#L1795

(Oh looks, Phab posts when you hit control+enter; I've edited more data into the earlier comment)

jmorse updated this revision to Diff 509647.EditedMar 30 2023, 7:03 AM

I've changed my mind: actually there are scenarios where being inlined into a different context should enable inlining that was previously skipped, for example where an unlikely branch can statically found to be always true after inlining, justifying inlining whatever's down that branch. This has exposed the deeper issue and I've now got an updated test that does not rely on profile data.

I believe today any mutually recursive functions will be inlined until they become directly recursive, after which inlining is prohibited. The trouble comes when the mutual inlining is hidden because of a prior decision to not inline a call site, that then becomes enabled in a new + different context, which then enables a cycle to be explored by InlinerPass::run. The non-profile-data example below is very similar to before, but hides the cycle by making the longname <=> rec_call_to_longname interaction dependent on a function pointer passed in to rec_call_to_longname. Neither can be inlined into each other by themselves; but in the context of the a function, all the function calls get devirtualised and the cycle explored, exponentially in the chain from e to a. Attaching the "function-inline-cost-multiplier" attribute suppresses this. I also patch part of the inline-cost computation that can accidentally send the cost negative, thus defeating multipliers.

This is still rare because:

  • Passing a function a pointer to itself isn't a common pattern,
  • I'm effectively forcing the chain of functions from e to a to be inlined by marking them internal, which get an inlining threshold boost,

But there could be other scenarios where context-dependent inlining choices disable and then enable a cycle.

Gentle ping here; this still might not be the right solution, but given the non-inline-cost based reproducer would you agree this is more of a design-of-the-inliner issue than just a costing problem? I suspect it's very rare, but it does cause a large downstream codebase to take near-infinite compile-time on an LTO + PGO configuration.

apologies, was out on vacation for a while

llvm/lib/Transforms/IPO/Inliner.cpp
431–432

IIUC, there's never actually a cycle (ref or call) in the IR. The problem is that after inlining, we do see the same sort of issue as with inlining through a child SCC because the indirect calls end up ping ponging between @longname and @rec_call_to_longname.

433–437

I don't understand this comment. I tried the following diff and it adds the multiplier attributes as expected

@@ -418,8 +420,15 @@ PreservedAnalyses InlinerPass::run(LazyCallGraph::SCC &InitialC,
               // inline through the SCC the function will end up being
               // self-recursive which the inliner bails out on, and inlining
               // within an SCC is necessary for performance.
-              if (CalleeSCC != C &&
-                  CalleeSCC == CG.lookupSCC(CG.get(*NewCallee))) {
+              auto WasCallEdge = [&CG](Function &Callee, Function &NewCallee) {
+                LazyCallGraph::Node &CalleeN = CG.get(Callee);
+                LazyCallGraph::Node &NewCalleeN = CG.get(NewCallee);
+                LazyCallGraph::Edge *E = CalleeN->lookup(NewCalleeN);
+                return E && E->isCall();
+              };
+              if ((CalleeSCC != C &&
+                   CalleeSCC == CG.lookupSCC(CG.get(*NewCallee))) ||
+                  !WasCallEdge(Callee, *NewCallee)) {
                 Attribute NewCBCostMult = Attribute::get(
                     M.getContext(),
                     InlineConstants::FunctionInlineCostMultiplierAttributeName,
446

what's the 10000 for, can we reuse the existing code?

llvm/test/Transforms/Inline/devirtualize-7.ll
2

slightly reduced test case

target triple = "x86_64-unknown-unknown"

declare void @extern1()

define internal void @rec_call_to_longname(ptr %longname) {
  call void %longname(ptr @rec_call_to_longname)
  ret void
}

define internal void @longname(ptr %0) {
  call void @extern1()
  call void @extern1()
  call void @extern1()
  call void %0(ptr @longname)  ; Exponential growth occurs between these two devirt calls.
  call void %0(ptr @longname)
  ret void
}

define void @e() {
  call void @c()
  call void @c()
  ret void
}

define internal void @c() {
  call void @b()
  ret void
}

define internal void @b() {
  call void @a()
  ret void
}

define internal void @a() {
  call void @rec_call_to_longname(ptr @longname)
  ret void
}
jmorse updated this revision to Diff 514652.Apr 18 2023, 8:49 AM

No worries; thanks for the review, I've adopted the revised test and filter for adding the attribute.

(The extra code in the previous review turns out to be un-needed. I'd managed to confuse myself while backporting rG53e5e586709 to an old release to see if affected the issue we were experiencing, I had made the attribute a function attribute rather than call. Whoops. The shorter filter works just fine.)

For the test checks, I'm now checking for both the attribute and the inlined function body, just in case the attribute ceases to have the same effect on inlining decisions.

aeubanks added inline comments.Apr 18 2023, 10:06 AM
llvm/lib/Analysis/InlineCost.cpp
683

is this change intended?

jmorse added inline comments.Apr 18 2023, 11:05 AM
llvm/lib/Analysis/InlineCost.cpp
683

It is, it puts a ceiling on the outer subtraction so that the cost is never reduced to be below zero. While the comment says that negative-costs have been considered, I think that only applies to the inner subtraction, as I've seen negative numbers produced here during testing.

I can split this out into a different patch if needs be, but I'm not sure how to write a test for it. As far as I understand it, the cost can only be observed by passing -debug to LLVM, and I'm under the impression that that's generally discouraged in tests.

aeubanks added inline comments.Apr 18 2023, 11:58 AM
llvm/lib/Analysis/InlineCost.cpp
683

-passes=print<inline-cost> should be what you want for testing

but negative costs are very common for inlining because it's cheaper to not setup a call, so I don't think this change makes sense

Splitting out the inline comments,

but negative costs are very common for inlining because it's cheaper to not setup a call, so I don't think this change makes sense

Hrrmmm -- removing the floor on Cost causes the too-much-inlining behaviour in the added test. It looks like, at the point where the function-inline-cost-multiplier is added,

  • With the floor added, Cost is 5, and is eventually multiplied to be larger than the threshold,
  • Without the floor Cost is -50, the multiplier can't stop it being negative, so the inlining always succeeds.

This strays even further outside of my comfort zone, but it seems that:

  • If the cost is permitted to be negative at any point then there'll inevitably be an input where the cost-multiplier isn't able to suppress inlining,
  • The cost needs to be transiently negative sometimes to correctly sum the costs and benefits of different parts of a function.

Perhaps the solution is a fail-safe upper bound on the function-inline-cost-multiplier past which inlining is disallowed, i.e. "we've inlined through this SCC ten times now in this context, quit inlining it"?

ah that makes sense. what about on top of multiplying the multiplier, we also add a fixed cost every time function-inline-cost-addend/-extra (or some better name)

jmorse updated this revision to Diff 515274.Apr 20 2023, 3:58 AM

Sounds good; how about simply using the multiplier as the addend, see revised patch. I imagine any addend would correlate and grow with the multiplier, and the only thing it needs to do is eventually make the cost positive, as once that occurs the multiplier should be so large that no further inlining is permitted.

The downside is the attribute name being slightly misleading as it's no longer a multiplier. That doesn't especially bother me, but it could be renamed to function-inline-cost-extra if needs be ("extra" is vague enough).

Doing this will slightly increase the frequency with which function-inline-cost-multiplier causes inlining to be suppressed, as we're effectively making it a larger number (see other inline cost test change). As this facility exists to catch rare and undesirable behaviours, I don't think this is an especially bad thing.

the inline cost multiplier looks good (could even be split out)

but I just realized that this will apply to all devirtualization, which can't be good for performance. we only want to penalize devirtualization that can cause mutually recursive inlining. it's pretty tricky figuring out what conditions this inlining explosion can arise from

I believe something along the lines of "F has a ref edge to NewCallee and NewCallee has a ref edge to itself (or maybe is in a non-trivial RefSCC?)", then this can occur? we want to be as precise as possible limiting when we add the inline cost penalty

I believe something along the lines of "F has a ref edge to NewCallee and NewCallee has a ref edge to itself (or maybe is in a non-trivial RefSCC?)", then this can occur? we want to be as precise as possible limiting when we add the inline cost penalty

I think that can be defeated by adding even more indirection to the calls such as the below, although a solution to some cases is better than none, consider:

define internal void @rec_call_to_longname(ptr %longname, ptr %other) {
  call void %longname(ptr %other, ptr %longname)
  ret void
}

define internal void @longname(ptr %0, ptr %1) {
  call void @extern1()
  call void @extern1()
  call void @extern1()
  call void %0(ptr %1, ptr %0)  ; Exponential growth occurs between these two devirt calls.
  call void %0(ptr %1, ptr %0)
  ret void
}

define internal void @a() {
  call void @rec_call_to_longname(ptr @longname, ptr @rec_call_to_longname)
  ret void
}

[... rest of reproducer...]

This too is over-inlined and gets the multiplier added with the patch as it is.

I'm starting to get the feeling that we can't actually identify / distinguish these recursive cases because LLVM currently tracks whether an edge is a Call, a Ref, but not both. Specifically:

  1. Every function vulnerable to the recursion must contain a ref to the recursive function, otherwise the recursion happens at some other stage of inlining,
  2. The recursive-function must be devirtualised, directly called and inlined,
  3. But leave a reference to the recursive-function in the vulnerable function.

Retaining the reference to the recursive-function is necessary for the function to be vulnerable, so that more calls to the recursive-function can be generated, devirtualised and inlined. Wheras general devirtualisation Should (TM) eliminate the reference. Here's the callgraph for the reproducer:

Edges in function: extern1
                                                 
Edges in function: rec_call_to_longname                 
  ref  -> rec_call_to_longname                            
                                                               
Edges in function: longname                      
  ref  -> longname  
                                                 
Edges in function: e                             
  call -> d                                              
                                                 
Edges in function: d                                      
  call -> c                                                    
                                                 
Edges in function: c                  
  call -> b                                      
                                                 
Edges in function: b                                     
  call -> a                                      
                                                 
Edges in function: a                                    
  call -> rec_call_to_longname                            
  ref  -> longname                                             
                                                 
RefSCC with 1 call SCCs:                         
  SCC with 1 functions:                          
    rec_call_to_longname                         
                                                         
RefSCC with 1 call SCCs:                      
  SCC with 1 functions:                          
    longname                                            
                                                 
RefSCC with 1 call SCCs:                                  
  SCC with 1 functions:                                        
    a                                            
                                                 
[repeated similar RefSCC the same for b c d e]

If you compute a new callgraph while inlining is in progress and print it after every inline, there don't appear to be any RefSCCs with more than one, single, SCC. Instead, each function [a-e) switches from having a call to rec_call_to_longname and ref to longname, to having calls to both because the ref gets devirtualised, and finally the function gets fully inlined into the caller. I think in an ideal world I'd like to identify the above scenario and then annotate further calls in the devirtualised+inlined function with a cost multiplier.

I'm not quite sure how to implement that right now, but does the above rationale make sense? (I can dig further if it's sound).

actually I think the inline history feature is relevant here. normally we'll detect this sort of issue with the inline history and bail out when we try to inline rec_call_to_longname again from longname, and in fact -debug shows a lot of Skipping inlining due to history. but that's per-inline invocation, and due to the multiple callers, that gets forgotten every time we go up to the next caller. I think marking the call with noinline should allow us to remember this across inline invocations.

diff --git a/llvm/lib/Transforms/IPO/Inliner.cpp b/llvm/lib/Transforms/IPO/Inliner.cpp
index 01808b3d14fe..a50147f07f68 100644
--- a/llvm/lib/Transforms/IPO/Inliner.cpp
+++ b/llvm/lib/Transforms/IPO/Inliner.cpp
@@ -333,6 +333,7 @@ PreservedAnalyses InlinerPass::run(LazyCallGraph::SCC &InitialC,
         LLVM_DEBUG(dbgs() << "Skipping inlining due to history: " << F.getName()
                           << " -> " << Callee.getName() << "\n");
         setInlineRemark(*CB, "recursive");
+        CB->setIsNoInline();
         continue;
       }

this is fairly similar to "inlining through a child SCC" problem that was mitigated with function-inline-cost-multiplier. I think that this noinline patch will take care of many cases that function-inline-cost-multiplier initially set out fix (and the first iteration of that patch was to add noinline instead of a cost multiplier), but function-inline-cost-multiplier isn't completely redundant because this noinline will only kick in if we see that we've revisited a callee, but function-inline-cost-multiplier will still work with inlining through larger SCCs if we don't revisit the first callee quickly

https://reviews.llvm.org/D150989

does that solve the issue you're seeing?

Thanks for cooking up an alternative, that certainly sounds like it's a more precise fix -- I'll test D150989 on the reproducer we have to confirm that it fixes the problem.

jmorse abandoned this revision.May 25 2023, 3:41 AM

Abandoning in favour of D150989, many thanks for exploring the difficulties here. I'll raise a different patch for dealing with the negative-cost-defeats-the-multiplier issue.