Page MenuHomePhabricator

[Inline] Disable deferred inlining
ClosedPublic

Authored by nikic on Dec 10 2021, 1:13 AM.

Details

Summary

After the switch to the new pass manager, we have observed multiple instances of catastrophic inlining, where the inliner produces huge functions with many hundreds of thousands of instructions from small input IR. We were forced to back out the switch to the new pass manager for this reason. This patch fixes at least one of the root cause issues.

LLVM uses a bottom-up inliner, and the fact that functions are processed bottom-up is not just a question of optimality -- it is a critically imporant requirement to prevent runaway inlining. The premise of the current inlining approach and cost model is that after all calls inside a function have been inlined, it may get large enough that inlining it into its callers is no longer considered profitable. This safeguard does not exist if inlining doesn't happen bottom-up, as inlining the callees, and their callees, and their callees etc. will always seem individually profitable, and the inliner can easily flatten the whole call tree.

There are instances where we necessarily have to deviate from bottom-up inlining: When inlining in an SCC there is no natural "bottom", so inlining effectively happens top-down. This requires special care, and the inliner avoids exponential blowup by ensuring that functions in the SCC grow in a balanced way and will eventually hit the threshold.

However, there is one instance where the inlining advisor explicitly violates the bottom-up principle: Deferred inlining tries to "defer" inlining a call if it determines that inlining the caller into all its call-sites would be more profitable. Something very important to understand about deferred inlining is that it doesn't make one inlining choice in place of another -- it effectively chooses to do both. If we have a call chain A -> B -> C and cost modelling tells us that inlining B -> C is profitable, but we defer this and instead inline A -> B first, then we'll now have a call A -> C, and the cost model will (a few special cases notwithstanding) still tell us that this is profitable. So the end result is that we inlined *both* B and C, even though under the usual cost model function B would have been too large to further inline after C has been integrated into it.

Because deferred inlining violates the bottom-up invariant of the inliner, it can result in exponential inlining. The exponential-deferred-inlining.ll test case illustrates this on a simple example (see https://gist.github.com/nikic/1262b5f7d27278e1b34a190ae10947f5 for a much more catastrophic case with about 5000x size blowup). If the call chain A -> B -> C is not a chain but a tree of calls, then we end up deferring inlining across the tree and end up flattening everything into the root node.

This patch proposes to address this by disabling deferred inlining entirely (currently still behind an option). Beyond the issue of exponential inlining, I don't think that the whole concept makes sense, at least as long as deferred inlining still ends up inlining both call edges.

I believe the motivation for having deferred inlining in the first place is that you might have a small wrapper function with local linkage that could be eliminated if inlined. This would automatically happen if there was a single caller, due to the large "last call to local" bonus. However, this bonus is not extended if there are multiple callers, even if we would eventually end up inlining into all of them (if the bonus were extended).

Now, unlike the normal inlining cost model, the deferred inlining cost model does look at all callers, and will extend the "last call to local" bonus if it determines that we could inline all of them as long as we defer the current inlining decision. This makes very little sense. The "last call to local" bonus doesn't really cost model anything. It's basically an "infinite" bonus that ensures we always inline the last call to a local. The fact that it's not literally infinite just prevents inlining of huge functions, which can easily result in scalability issues. I very much doubt that it was an intentional cost-modelling choice to say that getting rid of a small local function is worth adding 15000 instructions elsewhere, yet this is exactly how this value is getting used here.

The main alternative I see to complete removal is to change deferred inlining to an actual either/or decision. That is, to mark deferred calls as noinline so we're actually trading off one inlining decision against another, and not just adding a side-channel to the cost model to do both.

Apart from fixing the catastrophic inlining case, the effect on rustc is a modest compile-time improvement on average (up to 8% for a parsing-type crate, where tree-like calls are expected) and pretty neutral where run-time performance is concerned (mix of small wins and losses, usually in the sub-1% category).

Diff Detail

Event Timeline

nikic created this revision.Dec 10 2021, 1:13 AM
nikic requested review of this revision.Dec 10 2021, 1:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 10 2021, 1:13 AM
nikic edited the summary of this revision. (Show Details)Dec 10 2021, 1:18 AM

I believe part of the problem is that cost estimation doesn't take much of the caller into account. If it did, in the A->B->C case, A->B being too large could make A->C unprofitable.

One other thing to consider about deferral, though: say A->B, C->B, B->D. Say not deferring B->D leads to B not being inlinable in either A nor C. If we deferred, then suppose B gets inlined into A, and in this context, so does D; but suppose B does not inlined into C and then nor does D in this second context (we don't revisit callsites if inlining does not occur).

I think presence of profiles may make supporting this worthwhile, as call site hotness plays a role in the cost calculation.

The heuristic is old - I think the original change is ~12 years ago (3059924bddca79e0fea5f1473d9a66985e1fda4a) and it saw very little churn since. Notably, though, this makes some performance remarks.

For prudence, should we perhaps as a first step allow disabling it (and maybe the Rust frontend can then disable it by default?), but leave it on by default? We can then see if disabling it hurts performance of any real-world, PGO-ed binaries.

First some high level comments:

  1. There is a a conflict between maximize runtime performance and compile time, so this should not be disabled without extensive benchmarking from downstream users
  2. It is OK to introduce an option to make it true (not changing behavior for now)
  3. Some form of (limited) topdown inlining is beneficial, and there should be a general way to deal with runaway inlining

More inline comments below:

After the switch to the new pass manager, we have observed multiple instances of catastrophic inlining, where the inliner produces huge functions with many hundreds of thousands of instructions from small input IR. We were forced to back out the switch to the new pass manager for this reason. This patch fixes at least one of the root cause issues.

Why does it only happen with the new PM?

There are instances where we necessarily have to deviate from bottom-up inlining: When inlining in an SCC there is no natural "bottom", so inlining effectively happens top-down. This requires special care, and the inliner avoids exponential blowup by ensuring that functions in the SCC grow in a balanced way and will eventually hit the threshold.

This is not precise description: in the current bottom up inlining, after a calllsite gets inlined into to the caller, there will be an iterative inlining happening after that in top-down fashion. This has triggered compile time issue with recursive calls before.

The autoFDO's sample loader also has an top-down inlining pass.

However, there is one instance where the inlining advisor explicitly violates the bottom-up principle: Deferred inlining tries to "defer" inlining a call if it determines that inlining the caller into all its call-sites would be more profitable. Something very important to understand about deferred inlining is that it doesn't make one inlining choice in place of another -- it effectively chooses to do both. If we have a call chain A -> B -> C and cost modelling tells us that inlining B -> C is profitable, but we defer this and instead inline A -> B first, then we'll now have a call A -> C, and the cost model will (a few special cases notwithstanding) still tell us that this is profitable. So the end result is that we inlined *both* B and C, even though under the usual cost model function B would have been too large to further inline after C has been integrated into it.

I think flattening A->B->C for most of the cases is working by design.

Because deferred inlining violates the bottom-up invariant of the inliner, it can result in exponential inlining. The exponential-deferred-inlining.ll test case illustrates this on a simple example (see https://gist.github.com/nikic/1262b5f7d27278e1b34a190ae10947f5 for a much more catastrophic case with about 5000x size blowup). If the call chain A -> B -> C is not a chain but a tree of calls, then we end up deferring inlining across the tree and end up flattening everything into the root node.

This should be dealt with, but disabling the mechanism may be a too big a hammer without extensive testing,

This patch proposes to address this by disabling deferred inlining entirely (currently still behind an option). Beyond the issue of exponential inlining, I don't think that the whole concept makes sense, at least as long as deferred inlining still ends up inlining both call edges.

I think introducing an option without changing behavior is the right first step to go.

Now, unlike the normal inlining cost model, the deferred inlining cost model does look at all callers, and will extend the "last call to local" bonus if it determines that we could inline all of them as long as we defer the current inlining decision. This makes very little sense. The "last call to local" bonus doesn't really cost model anything. It's basically an "infinite" bonus that ensures we always inline the last call to a local. The fact that it's not literally infinite just prevents inlining of huge functions, which can easily result in scalability issues. I very much doubt that it was an intentional cost-modelling choice to say that getting rid of a small local function is worth adding 15000 instructions elsewhere, yet this is exactly how this value is getting used here.

The last call bonus of 15000 is certainly tunable. It models the DFE opportunity which should be callee size dependent.

The main alternative I see to complete removal is to change deferred inlining to an actual either/or decision. That is, to mark deferred calls as noinline so we're actually trading off one inlining decision against another, and not just adding a side-channel to the cost model to do both.

No that is not the intention of the deferred inlining.

David

kazu added a comment.Dec 10 2021, 10:12 AM

Can you achieve the same effect with -mllvm -inline-deferral-scale=0? If so, I would be inclined to keeping the mechanism as is for now.

As Mircea suggests, I am OK with disabling it by default after some benchmarking. We could even remove it altogether if nobody complains for a while.

Regarding additional benchmarks, could I ask one of you to trigger a run of Google's internal benchmarks with this patch? Those are probably pretty representative for both PGO and non-PGO workloads.

Regarding additional benchmarks, could I ask one of you to trigger a run of Google's internal benchmarks with this patch? Those are probably pretty representative for both PGO and non-PGO workloads.

I'll take care of that over the weekend.

wenlei added a subscriber: wenlei.Dec 13 2021, 12:13 AM

Are you using newpm with PGO? FWIW, we've also ran into bloated inlining multiple times after switching to npm a while back, but most of that turned out to be due to a combination of npm's better use of BFI for inline and some kind of profiling issue leading to unusual flat profile which makes inliner super aggressive (flat profile has a relative lower hot threshold so more callsites are considered hot).

We haven't found problem with inline deferral so far, though we also haven't measured its perf benefit..

Regarding additional benchmarks, could I ask one of you to trigger a run of Google's internal benchmarks with this patch? Those are probably pretty representative for both PGO and non-PGO workloads.

I'll take care of that over the weekend.

TL;DR; no significant performance effect

I ran 2 things:

  • intenal microbenchmarks we use for compiler release, which are using google benchmark. I used the perf counter support in that library, and measured INSTRUCTIONS and CYCLES for a subset that I pre-verified has very low INSTRUCTIONS variation (i.e. around 0.2%). The geomean is flat. Per-benchmark, I saw variations sub-2% for either of the 2 counters.
  • an internal benchmark for a real application - no significant change
nikic updated this revision to Diff 394240.Dec 14 2021, 6:50 AM

Rebase after splitting off the addition of the option -- this now only flips the default value.

mtrofin accepted this revision.Dec 14 2021, 7:40 AM

LGTM, and a question to reviewers: how should we proceed with understanding if we should actually disable/remove deferral (because, judging just on my datapoints, it's basically a no-op)

This revision is now accepted and ready to land.Dec 14 2021, 7:40 AM
nikic added a comment.Dec 14 2021, 7:53 AM

I believe part of the problem is that cost estimation doesn't take much of the caller into account. If it did, in the A->B->C case, A->B being too large could make A->C unprofitable.

Right. Taking the caller size into account would help to at least mitigate this general class of problems, though I think that we should avoid getting into the situation in the first place if we can. I think one area where this can happen and is not avoidable is de-indirection during inlining. One can probably construct a case where inlining each level of calls makes the next level direct -- a caller size limit, or a limit on reprocessing of new calls, is likely the only way to combat infinite growth in that case.

One other thing to consider about deferral, though: say A->B, C->B, B->D. Say not deferring B->D leads to B not being inlinable in either A nor C. If we deferred, then suppose B gets inlined into A, and in this context, so does D; but suppose B does not inlined into C and then nor does D in this second context (we don't revisit callsites if inlining does not occur).

Yeah, something like that can happen, though I'm not sure that's a good outcome: The C->B caller that did not get inlined would probably benefit from the B->D edge being inlined -- but that doesn't happen to enable inlining on an unrelated edge.

For prudence, should we perhaps as a first step allow disabling it (and maybe the Rust frontend can then disable it by default?), but leave it on by default? We can then see if disabling it hurts performance of any real-world, PGO-ed binaries.

I went ahead and landed the option in https://github.com/llvm/llvm-project/commit/7abf299fedff2f5b22ea8180311b44feeac26a9f, so this patch is now only about flipping the default. Explicitly setting it on the rustc side would address our immediate problem, but isn't a great long-term solution and is not reliable in LTO scenarios.

TL;DR; no significant performance effect

I ran 2 things:

  • intenal microbenchmarks we use for compiler release, which are using google benchmark. I used the perf counter support in that library, and measured INSTRUCTIONS and CYCLES for a subset that I pre-verified has very low INSTRUCTIONS variation (i.e. around 0.2%). The geomean is flat. Per-benchmark, I saw variations sub-2% for either of the 2 counters.
  • an internal benchmark for a real application - no significant change

Thanks! That looks promising and about matches what I saw.

Can you achieve the same effect with -mllvm -inline-deferral-scale=0? If so, I would be inclined to keeping the mechanism as is for now.

As Mircea suggests, I am OK with disabling it by default after some benchmarking. We could even remove it altogether if nobody complains for a while.

-inline-deferral-scale=0 reduces the amount of deferred inlining, but doesn't disable it entirely, and doesn't help in the case I'm looking at. This is because the last call bonus can result in a large negative cost for deferred inlining, which will then end up being preferred independently of the deferral scale.

Are you using newpm with PGO? FWIW, we've also ran into bloated inlining multiple times after switching to npm a while back, but most of that turned out to be due to a combination of npm's better use of BFI for inline and some kind of profiling issue leading to unusual flat profile which makes inliner super aggressive (flat profile has a relative lower hot threshold so more callsites are considered hot).

We haven't found problem with inline deferral so far, though we also haven't measured its perf benefit..

All reports I saw were for non-PGO builds. Though there does seem to be a relation with (static) BFI at least in my motivating case, because precision loss in frequencies results in incorrect cold callsite estimation. This ends up being the catalyst for the issue, but I don't think it's the root problem. But this is likely the reason why the motivating test case worked fine under the legacy pass manager, which does not use BFI during inlining. (There is likely a selection effect here: These problems are treated as blocking regressions because previously working code breaks -- the legacy PM inliner also had instances of catastrophic inlining, but those were largely ignored.)

  1. There is a a conflict between maximize runtime performance and compile time, so this should not be disabled without extensive benchmarking from downstream users

Generally true. However, when compile-time tends towards infinity it stops being an optimization tradeoff and starts being a correctness issue. The compiler is no longer able to produce a working binary.

Why does it only happen with the new PM?

See above -- in my motivating case, I believe this is due to BFI use in the new PM, but ultimately the new PM is incidental here. It just perturbs inlining heuristics/behavior enough that previously working code breaks. The reduced test case in this patch causes catastrophic inlining in both the new and legacy PMs.

However, there is one instance where the inlining advisor explicitly violates the bottom-up principle: Deferred inlining tries to "defer" inlining a call if it determines that inlining the caller into all its call-sites would be more profitable. Something very important to understand about deferred inlining is that it doesn't make one inlining choice in place of another -- it effectively chooses to do both. If we have a call chain A -> B -> C and cost modelling tells us that inlining B -> C is profitable, but we defer this and instead inline A -> B first, then we'll now have a call A -> C, and the cost model will (a few special cases notwithstanding) still tell us that this is profitable. So the end result is that we inlined *both* B and C, even though under the usual cost model function B would have been too large to further inline after C has been integrated into it.

I think flattening A->B->C for most of the cases is working by design.

This is working by design if the cost model tells us that this is profitable. Deferred inlining is used in cases where the cost model tells us that inlining both levels is not profitable -- and then we go ahead and do so anyway by changing the inlining order.

The last call bonus of 15000 is certainly tunable. It models the DFE opportunity which should be callee size dependent.

Yes, this use of the last call bonus would be correct if the last call bonus is roughly equal to the callee size. It's just the current use of a hardcoded and very large last call bonus that does not make sense in this context, because it vastly overestimates the benefit of removing small functions. (The hardcoded last call bonus does make sense for its primary use in the cost model.)

The main alternative I see to complete removal is to change deferred inlining to an actual either/or decision. That is, to mark deferred calls as noinline so we're actually trading off one inlining decision against another, and not just adding a side-channel to the cost model to do both.

No that is not the intention of the deferred inlining.

At least from the original commit message (https://github.com/llvm/llvm-project/commit/3059924bddca79e0fea5f1473d9a66985e1fda4a) and the motivating issue (https://github.com/llvm/llvm-project/issues/3345) the original idea was to inline the caller "instead", not "in addition to". But clearly this is not what the implementation ended up doing in practice.

LGTM, and a question to reviewers: how should we proceed with understanding if we should actually disable/remove deferral (because, judging just on my datapoints, it's basically a no-op)

My plan would be to land this, and if no major regressions are reported follow up after the LLVM 14 release to drop the option and implementation. If major regressions are reported, it would be good to add reduced test cases for them, as this functionality is currently untested (apart from one incidental CGSCC invalidation test).

This revision was landed with ongoing or failed builds.Dec 16 2021, 1:00 AM
This revision was automatically updated to reflect the committed changes.

Hi, @nikic

This change caused the following regression on the EXPENSIVE_CHECKS=on green dragon CI, where clang is unable to compile compiler-rt for armv7:

*** Bad machine code: Using an undefined physical register ***
- function:    _ZN11__sanitizer7BVGraphINS_17TwoLevelBitVectorILm1ENS_14BasicBitVectorImEEEEE8addEdgesERKS4_mPmm
- basic block: %bb.2 _ZN11__sanitizer14BasicBitVectorImE19getAndClearFirstOneEv.exit.i.i (0x134185288)
- instruction: BUNDLE implicit-def dead $itstate, implicit-def $r2, implicit killed $cpsr, implicit $sp, implicit killed $r6, implicit $r5, debug-location !3667; compiler-rt/lib/sanitizer_common/sanitizer_bitvector.h:0 @[ compiler-rt/lib/sanitizer_common/sanitizer_bitvector.h:204:23 @[ compiler-rt/lib/sanitizer_common/sanitizer_bitvector.h:251:5 @[ compiler-rt/lib/sanitizer_common/sanitizer_bvgraph.h:52:8 ] ] ]
- operand 4:   implicit killed $r6
fatal error: error in backend: Found 1 machine code errors.

Here's the bot:
https://green.lab.llvm.org/green/job/clang-stage1-cmake-RA-expensive/
And the set of changes:
https://green.lab.llvm.org/green/job/clang-stage1-cmake-RA-expensive/21216/changes#detail

I have attached two reproducer files for this issue:

Could you please see if that's something you can resolve?

nikic added a comment.Dec 21 2021, 1:12 AM

I can at least offer this reduction (run through llc -verifymachineinstrs): https://gist.github.com/nikic/0b5ff77ab7248466eb525d934e98bbeb

I don't have any familiarity with backend scheduling passes though.

nikic added a comment.Dec 21 2021, 1:45 AM

I've opened https://github.com/llvm/llvm-project/issues/52817 for the machine verification issue with some additional investigation.

fhahn added a subscriber: fhahn.Jan 7 2022, 4:34 AM

It looks like unfortunately building sanitizers with EXPENSIVE_CHECKS is still failing https://green.lab.llvm.org/green/job/clang-stage1-cmake-RA-expensive/21520/console . I think it might be good to revert the commit to bring the bot back to green until https://github.com/llvm/llvm-project/issues/52817 is addressed.

This seems to cause a huge regression on SystemZ/imagick. If I use "-mllvm -inline-deferral" with trunk, I see a 25% improvement!

There is a slight regression on xalanc with deferred inlining, but otherwise this doesn't really seem to hurt much on this architecture.

My question then is if we should re-enable it on SystemZ, or if you would recommend some alternative that perhaps could give us performance back on imagick while having a sounder algorithm?

This seems to cause a huge regression on SystemZ/imagick. If I use "-mllvm -inline-deferral" with trunk, I see a 25% improvement!

There is a slight regression on xalanc with deferred inlining, but otherwise this doesn't really seem to hurt much on this architecture.

My question then is if we should re-enable it on SystemZ, or if you would recommend some alternative that perhaps could give us performance back on imagick while having a sounder algorithm?

This doesn't really seem like something that should be controlled by target options. It's hard to discuss alternatives without knowing more details about the issue affecting imagick -- there are some variations on the "deferred inlining" concept we could implement, but I don't know what would help imagick in particular. Changes to inlining heuristics tend to be very hit and miss.

I believe the main inlining-related peculiarity of the SystemZ target is that it increases all inlining thresholds by a factor of three (https://github.com/llvm/llvm-project/blob/c0671e2c9b5c70fbcda277dcd5321d052ca2a2ee/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h#L39). No other non-GPU target does this, and I imagine that it can make inlining behavior on SystemZ very different from other targets. It might be worthwhile to double check whether that threshold multiplier is really needed.

jonpa added a comment.Jan 14 2022, 9:13 AM

This seems to cause a huge regression on SystemZ/imagick. If I use "-mllvm -inline-deferral" with trunk, I see a 25% improvement!

There is a slight regression on xalanc with deferred inlining, but otherwise this doesn't really seem to hurt much on this architecture.

My question then is if we should re-enable it on SystemZ, or if you would recommend some alternative that perhaps could give us performance back on imagick while having a sounder algorithm?

This doesn't really seem like something that should be controlled by target options. It's hard to discuss alternatives without knowing more details about the issue affecting imagick -- there are some variations on the "deferred inlining" concept we could implement, but I don't know what would help imagick in particular. Changes to inlining heuristics tend to be very hit and miss.

I believe the main inlining-related peculiarity of the SystemZ target is that it increases all inlining thresholds by a factor of three (https://github.com/llvm/llvm-project/blob/c0671e2c9b5c70fbcda277dcd5321d052ca2a2ee/llvm/lib/Target/SystemZ/SystemZTargetTransformInfo.h#L39). No other non-GPU target does this, and I imagine that it can make inlining behavior on SystemZ very different from other targets. It might be worthwhile to double check whether that threshold multiplier is really needed.

Yes, that's a good point - we found earlier that this was the best setting looking at the benchmarks. Now that the overall inlining algorithm has changed, we should revisit this, which I did: It seems that '3' is still better than '1' and '2', but what's more is that imagick seems to look good again with '4', and overall on benchmarks '5' is even better... So readjusting this value looks right now as the way to go.

On imagick I see just one function that got inlined with 'inline-deferral', @SetPixelCacheNexusPixels, and I am guessing this is the key difference. This function has the 'hot' and 'internal' attributes. It does not have any loops but several calls (@PLT). I saw also for some reason that if not inlined it was called from 12 different functions, but if inlined the file in total had less instructions... The debug output from the inliner:

trunk (threshold multiplier = 3):

Inlining calls in: SetPixelCacheNexusPixels
Inlining calls in: SetPixelCacheNexusPixels
Updated inlining SCC: (SetPixelCacheNexusPixels)
    Inlining (cost=685, threshold=750), Call:   %call62 = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* %cache_info, %struct._RectangleInfo* nonnull %region, i32 zeroext 1, %struct._NexusInfo* %27, %struct._ExceptionInfo* %exception) #22
    Inlining (cost=685, threshold=750), Call:   %call76 = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* %clone_info, %struct._RectangleInfo* nonnull %region, i32 zeroext 1, %struct._NexusInfo* %59, %struct._ExceptionInfo* %exception) #23
    Inlining (cost=685, threshold=750), Call:   %call130 = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* %cache_info, %struct._RectangleInfo* nonnull %region115, i32 zeroext 1, %struct._NexusInfo* %108, %struct._ExceptionInfo* %exception) #23
    Inlining (cost=685, threshold=750), Call:   %call144 = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* %clone_info, %struct._RectangleInfo* nonnull %region115, i32 zeroext 1, %struct._NexusInfo* %140, %struct._ExceptionInfo* %exception) #23
    NOT Inlining (cost=900, threshold=750), Call:   %call37 = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* nonnull %1, %struct._RectangleInfo* nonnull %region, i32 zeroext %8, %struct._NexusInfo* %nexus_info, %struct._ExceptionInfo* %exception) #20
    NOT Inlining (cost=900, threshold=750), Call:   %call37.i = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* nonnull %1, %struct._RectangleInfo* nonnull %region.i, i32 zeroext %8, %struct._NexusInfo* %nexus_info, %struct._ExceptionInfo* %exception) #20
    NOT Inlining (cost=900, threshold=750), Call:   %call37.i.i = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* nonnull %6, %struct._RectangleInfo* nonnull %region.i.i, i32 zeroext %13, %struct._NexusInfo* %4, %struct._ExceptionInfo* %exception) #20
    NOT Inlining (cost=900, threshold=750), Call:   %call37.i.i = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* nonnull %5, %struct._RectangleInfo* nonnull %region.i.i, i32 zeroext %12, %struct._NexusInfo* %3, %struct._ExceptionInfo* %exception) #21
    NOT Inlining (cost=900, threshold=750), Call:   %call37.i.i.i = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* nonnull %11, %struct._RectangleInfo* nonnull %region.i.i.i, i32 zeroext %18, %struct._NexusInfo* %9, %struct._ExceptionInfo* %exception) #21
    NOT Inlining (cost=900, threshold=750), Call:   %call = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* %4, %struct._RectangleInfo* nonnull %region, i32 zeroext %cond, %struct._NexusInfo* %nexus_info, %struct._ExceptionInfo* %exception) #22
    NOT Inlining (cost=135, threshold=135), Call:   %call37.i.i = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* nonnull %10, %struct._RectangleInfo* nonnull %region.i.i, i32 zeroext %17, %struct._NexusInfo* %8, %struct._ExceptionInfo* %exception) #22
    NOT Inlining (cost=135, threshold=135), Call:   %call37.i.i = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* nonnull %12, %struct._RectangleInfo* nonnull %region.i.i, i32 zeroext %19, %struct._NexusInfo* %10, %struct._ExceptionInfo* %exception) #23
    NOT Inlining (cost=135, threshold=135), Call:   %call37.i.i.i = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* nonnull %15, %struct._RectangleInfo* nonnull %region.i.i.i, i32 zeroext %22, %struct._NexusInfo* %13, %struct._ExceptionInfo* %exception) #21
    NOT Inlining (cost=135, threshold=135), Call:   %call37.i.i.i88 = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* nonnull %83, %struct._RectangleInfo* nonnull %region.i.i.i21, i32 zeroext %90, %struct._NexusInfo* %81, %struct._ExceptionInfo* %exception) #21
    NOT Inlining (cost=900, threshold=750), Call:   %call37.i = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* nonnull %5, %struct._RectangleInfo* nonnull %region.i, i32 zeroext %12, %struct._NexusInfo* %3, %struct._ExceptionInfo* %exception) #21
    NOT Inlining (cost=900, threshold=750), Call:   %call37.i.i = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* nonnull %8, %struct._RectangleInfo* nonnull %region.i.i, i32 zeroext %15, %struct._NexusInfo* %6, %struct._ExceptionInfo* %exception) #21
    NOT Inlining (cost=900, threshold=750), Call:   %call37.i = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* nonnull %1, %struct._RectangleInfo* nonnull %region.i, i32 zeroext %8, %struct._NexusInfo* %nexus_info, %struct._ExceptionInfo* %exception) #21
    NOT Inlining (cost=900, threshold=750), Call:   %call37.i = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* nonnull %6, %struct._RectangleInfo* nonnull %region.i, i32 zeroext %13, %struct._NexusInfo* %4, %struct._ExceptionInfo* %exception) #21

With -inline-deferral:

Inlining calls in: SetPixelCacheNexusPixels
Inlining calls in: SetPixelCacheNexusPixels
Updated inlining SCC: (SetPixelCacheNexusPixels)
    Inlining (cost=395, threshold=750), Call:   %call62 = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* %cache_info, %struct._RectangleInfo* nonnull %region, i32 zeroext 1, %struct._NexusInfo* %27, %struct._ExceptionInfo* %exception) #22
    Inlining (cost=395, threshold=750), Call:   %call76 = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* %clone_info, %struct._RectangleInfo* nonnull %region, i32 zeroext 1, %struct._NexusInfo* %51, %struct._ExceptionInfo* %exception) #22
    Inlining (cost=395, threshold=750), Call:   %call130 = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* %cache_info, %struct._RectangleInfo* nonnull %region115, i32 zeroext 1, %struct._NexusInfo* %92, %struct._ExceptionInfo* %exception) #22
    Inlining (cost=395, threshold=750), Call:   %call144 = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* %clone_info, %struct._RectangleInfo* nonnull %region115, i32 zeroext 1, %struct._NexusInfo* %116, %struct._ExceptionInfo* %exception) #22
    Inlining (cost=610, threshold=750), Call:   %call37 = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* nonnull %1, %struct._RectangleInfo* nonnull %region, i32 zeroext %8, %struct._NexusInfo* %nexus_info, %struct._ExceptionInfo* %exception) #20
    Inlining (cost=-14390, threshold=750), Call:   %call = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* %4, %struct._RectangleInfo* nonnull %region, i32 zeroext %cond, %struct._NexusInfo* %nexus_info, %struct._ExceptionInfo* %exception) #22

With threshold multiplier = 4:

Inlining calls in: SetPixelCacheNexusPixels
Inlining calls in: SetPixelCacheNexusPixels
Updated inlining SCC: (SetPixelCacheNexusPixels)
    Inlining (cost=685, threshold=1000), Call:   %call62 = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* %cache_info, %struct._RectangleInfo* nonnull %region, i32 zeroext 1, %struct._NexusInfo* %41, %struct._ExceptionInfo* %exception) #23
    Inlining (cost=685, threshold=1000), Call:   %call76 = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* %clone_info, %struct._RectangleInfo* nonnull %region, i32 zeroext 1, %struct._NexusInfo* %73, %struct._ExceptionInfo* %exception) #23
    Inlining (cost=685, threshold=1000), Call:   %call130 = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* %cache_info, %struct._RectangleInfo* nonnull %region115, i32 zeroext 1, %struct._NexusInfo* %122, %struct._ExceptionInfo* %exception) #23
    Inlining (cost=685, threshold=1000), Call:   %call144 = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* %clone_info, %struct._RectangleInfo* nonnull %region115, i32 zeroext 1, %struct._NexusInfo* %154, %struct._ExceptionInfo* %exception) #23
    Inlining (cost=900, threshold=1000), Call:   %call37 = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* nonnull %1, %struct._RectangleInfo* nonnull %region, i32 zeroext %8, %struct._NexusInfo* %nexus_info, %struct._ExceptionInfo* %exception) #20
    Inlining (cost=-14100, threshold=1000), Call:   %call = call fastcc %struct._PixelPacket* @SetPixelCacheNexusPixels(%struct._CacheInfo* %4, %struct._RectangleInfo* nonnull %region, i32 zeroext %cond, %struct._NexusInfo* %nexus_info, %struct._ExceptionInfo* %exception) #22

I will do some more measurements to see which value for the multiplier looks best. Do you see any reason not to use this as a tuning parameter, or any other preferred way? (I have been assuming that it *is* the target-independent variable to play with...)

jonpa added a comment.Jan 16 2022, 2:36 PM

Unfortunately it seems there is a problem with inlining with higher multiplier factors, which I have reported at https://github.com/llvm/llvm-project/issues/53236.

After this patch, another perf regression was reported :

https://github.com/llvm/llvm-project/issues/53555

jonpa added a comment.Feb 8 2022, 4:27 PM

It seems that the imagick regression on SystemZ per above where inlining of @SetPixelCacheNexusPixels is no longer done involves the memcpy performed directly after entering the function. A 32-byte struct is built by the caller and passed to this function which then only does a memcpy of that struct as a source.

I wonder if perhaps "partial inlining" could be used for this to just inline the memcpy part into caller? It seems this was enabled by default back in 2017, but now that does not appear to be the case. Has this changed with the move to the new pass manager somehow? (I have to enable it with '-mllvm -enable-partial-inlining' and I don't see any target using it...)

If it was enabled in old pm, but not in new pm, then just oversight? Patch welcome.

jonpa added a comment.Feb 8 2022, 4:42 PM

If it was enabled in old pm, but not in new pm, then just oversight? Patch welcome.

I was assuming the inliner (and outliner?) heuristics may have undergone a reworking with the new pass-manager, but I don't really know...

If it was enabled in old pm, but not in new pm, then just oversight? Patch welcome.

I was assuming the inliner (and outliner?) heuristics may have undergone a reworking with the new pass-manager, but I don't really know...

cc @aeubanks

If it was enabled in old pm, but not in new pm, then just oversight? Patch welcome.

partial inlining looks like it's still in the new PM, search RunPartialInlining in the LLVM sources

I was assuming the inliner (and outliner?) heuristics may have undergone a reworking with the new pass-manager, but I don't really know...

cc @aeubanks

I don't think the general inlining heuristics have been changed, although the new PM has access to some analyses that the old PM doesn't when inlining that may cause inlining decisions to be different, namely BlockFrequencyInfo. The inlining order may also be different.