This is an archive of the discontinued LLVM Phabricator instance.

[RFC][InlineCost] Don't count the cost of truly exceptionally unlikely blocks (PR50099)
Needs ReviewPublic

Authored by lebedev.ri on Apr 24 2021, 6:53 AM.

Details

Summary

Cost-benefit analysis already did something like that, but different,
off by default, and needs a profile (as opposed to being happy with static weights).

In PR50099 i reported that NewPM switch introduced a rather large regression in one benchmark.
That happens because certain destructor call is determined to be cold,
so it is given smaller inlining budget (45), and it's inlining cost measure at ~55.

We could either raise budgets, or lower costs.
D101228, and potentially D101229 does the former.
Here i propose to investigate one approach for the latter.

The large portion of that destructor is exception handling,
and as per block frequency it is *exceptionally* unlikely to execute.
So i propose to introduce a impossible-code-rel-freq option,
defaulting to 2 parts per million (that is the smallest one to do the job)
of function's entry frequency, and saying that if the block's frequency
is less than that, then it is impossible to execute,
and not adding the costs of instructions in said block.

Does this sound completely insane? Thoughts?

Diff Detail

Event Timeline

lebedev.ri created this revision.Apr 24 2021, 6:53 AM
lebedev.ri requested review of this revision.Apr 24 2021, 6:53 AM
lebedev.ri retitled this revision from [RFC][InlineCost] Don't count the cost of trully exceptionally unlikely blocks (PR50099) to [RFC][InlineCost] Don't count the cost of truly exceptionally unlikely blocks (PR50099).
nikic added a subscriber: nikic.Apr 24 2021, 7:18 AM

Possibly I'm misunderstanding what is proposed here, but aren't these kind of very rarely executed blocks exactly what we do not want to inline? This would make sense in the context of something like partial inlining, where we could inline the hot parts without the cold parts (and thus not cost the cold parts), but for regular inlining this means we duplicate cold code, increasing code size and icache pressure.

lebedev.ri added a comment.EditedApr 24 2021, 7:22 AM

Possibly I'm misunderstanding what is proposed here, but aren't these kind of very rarely executed blocks exactly what we do not want to inline? This would make sense in the context of something like partial inlining, where we could inline the hot parts without the cold parts (and thus not cost the cold parts), but for regular inlining this means we duplicate cold code, increasing code size and icache pressure.

Is that not what i asked in the description?

Right now i'm simply enumerating possible solutions in hope that at least one will fit.

Possibly I'm misunderstanding what is proposed here, but aren't these kind of very rarely executed blocks exactly what we do not want to inline? This would make sense in the context of something like partial inlining, where we could inline the hot parts without the cold parts (and thus not cost the cold parts), but for regular inlining this means we duplicate cold code, increasing code size and icache pressure.

Is that not what i asked in the description?

Right now i'm simply enumerating possible solutions in hope that at least one will fit.

FWIW what i'm doing here is actually apparently already suggested in a FIXME:

bool CallAnalyzer::visitUnreachableInst(UnreachableInst &I) {
  // FIXME: It might be reasonably to discount the cost of instructions leading
  // to unreachable as they have the lowest possible impact on both runtime and
  // code size.
  return true; // No actual code is needed for unreachable.
}

from rL197215 @ https://github.com/llvm/llvm-project/blob/269b335bd7332cd0d13451260d408dc9fcbcb5b1/llvm/lib/Analysis/InlineCost.cpp#L2066-L2071

Do we know why the destructor call is determined to be cold in the first place? Is there something can be done to improve static branch prediction?

I believe this patch itself will likely help improve performance of programs with EH (without PGO) at the cost of increased code size in general.

Do we know why the destructor call is determined to be cold in the first place? Is there something can be done to improve static branch prediction?

I've posted full unoptimized repro IR at https://bugs.llvm.org/show_bug.cgi?id=50099
I believe that determination is correct. Said destructor call is only reachable through landingpads,
which are obviously modelled as being quite cold.

I believe this patch itself will likely help improve performance of programs with EH (without PGO) at the cost of increased code size in general.

That's the idea, yes. I see two alternatives: just bump the threshold (D101229),
or give bonus per each alloca (that is suspected as SROA'ble) that is passed as an argument into the callee function.
There is a problem with the latter approach - such cold callsites currently disable *all* bonuses

If there are other alternatives i'd love to hear about them.