This is an archive of the discontinued LLVM Phabricator instance.

[InlineCost] Relax bonus restrictions on uninlinable functions
Needs ReviewPublic

Authored by george.burgess.iv on Feb 4 2020, 1:58 PM.

Details

Summary

We currently ignore inlining bonuses sometimes (importantly to me, LastCallToStaticBonus) in order to make the caller of a function more inlinable. This appears suboptimal if the caller can't be inlined in the first place.

This is an attempt to fix that suboptimality (which showed up for me most prominently with the new PM), though it's not quite finished. Before I plumb everything through, I'd like to know if anyone's strongly against a change like this.

Importantly, I can see the following downsides:

  • This patch makes our inline cost depend on state of the caller, which is generally discouraged, since we try to cache these things.
  • This patch basically causes the inliner to inline differently depending on whether we're LTO'ing or not, which AFAICT is new.

The last bullet is the case because, in a traditional build, I don't think we care about how inlineable exportedFunction is below, assuming it has no users in its TU:

static void helper() {
  // large body
}

void exportedFunction(bool i) {
  if (UNLIKELY(i))
    helper();
}

OTOH, with ThinLTO or LTO enabled, we might discover uses of exportedFunction during our link, so the inlinability of it starts to matter.

Thoughts appreciated. :)

Diff Detail

Event Timeline

On the first bullet, I'm trying to find where InlineCost values are cached. Are we doing that?

On the first bullet, I'm trying to find where InlineCost values are cached. Are we doing that?

I don't actually know myself; I got that from above the comment for CandidateCall: "The candidate callsite being analyzed. Please do not use this to do analysis in the caller function; we want the inline cost query to be easily cacheable. Instead, use the cover function paramHasAttr."

This is sort of extension of the 'shouldBeDeferred' check in Inliner.cpp, except that here the caller may not actually be inlined. Blindly eliminating bonus may preventing the callee from being inlined while not actually enabling the caller to be inlined.

On the first bullet, I'm trying to find where InlineCost values are cached. Are we doing that?

I don't actually know myself; I got that from above the comment for CandidateCall: "The candidate callsite being analyzed. Please do not use this to do analysis in the caller function; we want the inline cost query to be easily cacheable. Instead, use the cover function paramHasAttr."

Yup - I also don't know of any location caching the cost. I wonder if this was a nerver-realized goal, and whether it is abandoned. May be worth checking with the author of the comment?

Yup - I also don't know of any location caching the cost. I wonder if this was a nerver-realized goal, and whether it is abandoned. May be worth checking with the author of the comment?

Good call -- looks like the comment was added in 9b5c9580e384 around 5 years ago. The commit message says "We're not currently [caching inline costs], but initial results on LTO indicate this will quickly become important," so this sounds like a nice-to-have that was never had.

Removed the cacheable bit of that comment, since I think it's a bit misleading if there's no motion on it.

This is sort of extension of the 'shouldBeDeferred' check in Inliner.cpp, except that here the caller may not actually be inlined. Blindly eliminating bonus may preventing the callee from being inlined while not actually enabling the caller to be inlined.

Oof, yeah, these look similar, but it's not clear to me if there's a good way to unify them as a part of this patch (or lift this out to there). Lemme poke it a bit and see :)

DisallowAllBonuses is called only when the callsite or the callee entry is cold. The only reason to make it conditional on caller's inlinability would be for code size improvement. Among the 3 bonuses disabled in DisallowAllBonuses, only the LastCallToStaticBonus is useful to allow for code size reduction. You don't want to give a single BB bonus or a vector bonus for a cold callee (or a callee at a cold callsite) whose caller is unlikely to be inlined.

Updated to plumb through the "are we doing ThinLTO || prelink LTO" bit in a way that makes sense to me. Suggestions for potentially better approaches is welcome :)

Among the 3 bonuses disabled in DisallowAllBonuses, only the LastCallToStaticBonus is useful to allow for code size reduction [...].

Fixed; thanks!

Also, RE

Oof, yeah, these look similar, but it's not clear to me if there's a good way to unify them as a part of this patch (or lift this out to there). Lemme poke it a bit and see :)

Unfortunately, I couldn't figure out a clean way to do this, so I added a FIXME for now

I have concerns on so much plumbing changes to make the small heuristic change.

On the other hand, can you first extract the pass manager builder changes (i.e, passing Phase kind) into a separate patch? That one seems independent.

Especially for inliner changes I suggest a) sharing performance numbers on standard benchmarks with O2 & O2 (thin) LTO, and b) offering a flag to toggle individual changes. Performance here includes run-time performance, compile-time, and code size. A simple change can have detrimental impact on any of key metrics folks might be watching, and that that impact could be different depending on the targets.

I have concerns on so much plumbing changes to make the small heuristic change.

Agreed that that's discouraging. Lemme take numbers and see what this does for us. If this still looks promising, I'll split as requested.

sharing performance numbers on standard benchmarks with O2 & O2 (thin) LTO,

Thanks for the feedback! I'm not sure what benchmarks are considered standard. Do we have documentation on this, or might you have specific ones in mind I can try this on?

Friendly ping Gerolf (or anyone else who has opinions on standardish benchmarks) :)

SPEC2006/2017 will be a good public benchmarks. Clang performance/size (with self build with and without PGO) is another one. Internal benchmark results are also meaningful.