This is an archive of the discontinued LLVM Phabricator instance.

[NewPM/Inliner] Reduce threshold for cold callsites in the non-PGO case
ClosedPublic

Authored by eraman on Jun 16 2017, 6:09 PM.

Details

Summary

This uses inline-cold-callsite-threshold to callsites that are cold locally (within the function) in the absence of profile information. Callsite's coldness is determined based on it's BFI relative to the caller's entry BFI. The default value chosen is callsite's frequency being <= 1/100th of the caller's entry frequency. In general this is a small size win. For example, the llvm test-suite sees a mean text size reduction of 0.03%, but there are some nice wins in large benchmarks there (sqlite sees 23% reduction and 7zip sees 4% reduction). There are some regressions, but those are all on benchmarks with smaller code size (<4K). This also improves an internal compression benchmark by around 8% by preventing a cold callee from being inlined and thereby allowing the caller to be inlined into its caller. The 1% threshold allows a callsite guarded by _builtin_expect(cond, 0) inside a singly-nested loop to be considered cold.

I am working on doing similar identification of hot callsites without profile information, but tuning that is proving to be harder and so I want to start with this first.

Diff Detail

Event Timeline

eraman created this revision.Jun 16 2017, 6:09 PM
davidxl added inline comments.Jun 16 2017, 10:40 PM
lib/Analysis/InlineCost.cpp
691

Can the static/local call site check and the global cold site check be moved into a helper function such as CallAnalyzer::isColdCallsite ?

test/Transforms/Inline/inline-cold-callsite.ll
35

change the label name to 'cold:'.

eraman updated this revision to Diff 102943.Jun 17 2017, 12:52 PM
eraman edited the summary of this revision. (Show Details)

Address David's comments.

eraman marked 2 inline comments as done.Jun 17 2017, 12:54 PM
chandlerc added inline comments.Jun 26 2017, 12:00 PM
lib/Analysis/InlineCost.cpp
72–76

Did clang format do this? yuck. You may get better results by mergeing the string literal into a single line string literal and letting clang-format re-break it from there.

653–656

Rather than digging the integers out (and risking overflow) I would expect to use BlockFrequency and BranchProbability directly?

For example, in MBP we do something along the lines of...

const BranchProbability ColdProb(1, 5); // 20%
auto CallSiteFreq = CallerBFI->getBlockFreq(CallSiteBB);
auto EntryFreq = CallerBFI->getEntryFreq();
return CallSiteFreq < EntryFreq * ColdProb;

(Clearly the coldness should be different here than it is in MBP, just seems like if the way we scale BlockFrequency is with a BranchProbability, we should keep doing that here.)

It also might be good to try and cache the scaled entry frequency so that we don't re-compute it for each call site?

eraman updated this revision to Diff 104191.Jun 27 2017, 9:21 AM

Address Chandler's comments.

eraman marked an inline comment as done.Jun 27 2017, 9:27 AM
eraman marked an inline comment as done.

I used BranchProbability as you suggested but due to BranchProbability's implementation I need to adjust the test case (essentitally BranchProbability(1,100) is < 1/100). This is not a big deal, but it does not treat a callsite guarded by a builtin_expect inside a loop as cold anymore.

chandlerc edited edge metadata.Jun 27 2017, 2:28 PM

I used BranchProbability as you suggested but due to BranchProbability's implementation I need to adjust the test case (essentitally BranchProbability(1,100) is < 1/100). This is not a big deal, but it does not treat a callsite guarded by a builtin_expect inside a loop as cold anymore.

I'd be happy to increase this to 2/100 to start with if that makes more sense to you. I agree that a callsite guarded by builtin_expect inside a loop seems like a good baseline for "is this threshold sane". (Naturally, we can tweak it further based on experience using it in the wild, but good to get a sane initial starting point.)

chandlerc accepted this revision.Jun 27 2017, 2:30 PM

Also LGTM generally. Feel free to submit with updated (or current) threshold based on what makes the most sense to you. Also a minor comment below, lemme know if you want another look at the code if caching that ends up useful but weird/complex.

lib/Analysis/InlineCost.cpp
658

As mentioned earlier, is it possible to cache this so that we don't re-compute the scaled entry frequency for each callsite? Maybe there isn't a good place to cache it...

This revision is now accepted and ready to land.Jun 27 2017, 2:30 PM
eraman updated this revision to Diff 104303.Jun 27 2017, 4:04 PM

Increase the threshold to 2, fix a bug and add comments about caching the caller entry threshold.

Also LGTM generally. Feel free to submit with updated (or current) threshold based on what makes the most sense to you. Also a minor comment below, lemme know if you want another look at the code if caching that ends up useful but weird/complex.

I have increased the threshold to 2. I've also fixed a thinko in constructing the ColdProb object (I wrote ColdProb(1, 100 * ColdCallSiteRelFreq), but it really should be ColdProb(ColdCallSiteRelFreq, 100). Will submit with these changes.

lib/Analysis/InlineCost.cpp
658

Discussed this offline with Chandler and we agreed that the effort to cache this not worth it unless this starts showing up in the profiles. I have added a comment reflecting this.

This revision was automatically updated to reflect the committed changes.