This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO][Preinliner] Use linear threshold to drive inline decision.
ClosedPublic

Authored by hoy on May 5 2022, 10:10 AM.

Details

Summary

The per-callsite size threshold used today to drive preinline decision is based on hotness/coldness cutoff. The default setup is for callsites with a sample count above the hotness cutoff (99%), a 1500 size threshold is used. Any callsite below 99.99% coldness cutoff uses a zero threshold. This has a couple issues:

  1. While both cutoffs and size thoresholds are configurable, different applications may need different setups, making a universal setup impractical.
  1. The callsites between hotness cutoff and coldness cutoff are not considered as inline candidates, which could be a missing opportunity.
  1. Hot callsites always use the same threshold. In reality we may want a bigger threshold for hotter callsites.

In this change we are introducing a linear threshold regardless of hot/cold cutoffs. Given a sample space, a threshold is computed for a callsite based on the position of that callsite sample in the whole space. With that we no longer need to define what's hot or cold. Callsites with different hotness will get a different threshold. This should overcome the above three issues.

I have seen good results with a universal default setup for two of our internal services.

For one service, 0.2% to 0.5% perf improvement over a baseline with a previous default setup, on-par code size.
For the second service, 0.5% to 0.8% perf improvement over a baseline with a previous default setup, 0.2% code size increase; on-par performance and code size with a baseline that is with a carefully tuned cutoff to cover enough hot functions.

Diff Detail

Event Timeline

hoy created this revision.May 5 2022, 10:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 5 2022, 10:10 AM
Herald added subscribers: modimo, wenlei. · View Herald Transcript
hoy requested review of this revision.May 5 2022, 10:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 5 2022, 10:10 AM
hoy edited the summary of this revision. (Show Details)May 5 2022, 10:13 AM
hoy added reviewers: wenlei, wlei.

In general, we should include code size and perf numbers for benchmarks for such changes.

hoy edited the summary of this revision. (Show Details)May 5 2022, 11:37 AM
wenlei edited the summary of this revision. (Show Details)May 5 2022, 3:16 PM
wenlei added inline comments.May 5 2022, 10:12 PM
llvm/tools/llvm-profgen/CSPreInliner.cpp
163

One concern about using getMaxCount -- that will have more instability and variation run to run. How about using HotCountThreshold (or N * HotCountThresold, or a set percentile) to stabilize?

The NormalizedHotness doesn't have to be between 0 and 1. Or if we want it to be between 0 and 1, we can also do min(NormalizedHotness, 1).

164–165

Perhaps rename Position as NormalizedHotness?

166–169

How about we simplify the knobs, i.e. remove flag preinline-hot-callsite-threshold-multiplier and preinline-hot-callsite-threshold-constant.

Make Multiplier a constant, and we can still effectively tune the multiplier via setting sample-profile-hot-inline-threshold.
Make PreInlineHotCallsiteThresholdConstant a constant and we can still effectively tune it via sample-profile-cold-inline-threshold. We could also consider setting SampleColdCallSiteThreshold = 1 in PreInliner (current we set it to 0).

With the above, I think if we can simply switches (avoid adding new ones) without scarifying flexibility for tuning.

hoy added inline comments.May 5 2022, 10:58 PM
llvm/tools/llvm-profgen/CSPreInliner.cpp
163

(or N * HotCountThresold, or a set percentile) to stabilize?

You mean something like Position = N* HotCountThresold ? BTW, what is N?

Using a percentile should have a better stability. Position is similar to (1-percentile).

164–165

Sounds good.

166–169

Make Multiplier a constant, and we can still effectively tune the multiplier via setting sample-profile-hot-inline-threshold.

This should work since SampleHotCallSiteThreshold is only used here. One thing is the hot threshold may look very different from what the compiler uses, because of the multiplier. We could use the same heuristics on the compiler as well.

Make PreInlineHotCallsiteThresholdConstant a constant and we can still effectively tune it via sample-profile-cold-inline-threshold

This may also affect cold contexts. Enlarging sample-profile-cold-inline-threshold , say from 0 to 5, may result in a lot more cold contexts in the profile. This is because we currently treat cold functions as having the same hotness, and only use the cold threshold for them instead of computing a linear threshold.

wenlei added inline comments.May 5 2022, 11:14 PM
llvm/tools/llvm-profgen/CSPreInliner.cpp
163

BTW, what is N?

N can be arbitrary constant, depending on the range we want to normalize. Some random example: (3 * 99% count), or (2 * 95% count)

You mean something like Position = N* HotCountThresold ?

No, just replace getMaxCount with HotCountThresold or a precentile count.

Using a percentile should have a better stability. Position is similar to (1-percentile).

If you use (3 * 99% count) as the upper bound of range, then for count = (5 * 99% count), normalized hotness will be larger than 1, unless we manually cap it.

166–169

One thing is the hot threshold may look very different from what the compiler uses, because of the multiplier. We could use the same heuristics on the compiler as well.

I think it's ok for compiler and preinliner to have different thresholds, because one is based on IR, and the other is based on different proxy.

This may also affect cold contexts. Enlarging sample-profile-cold-inline-threshold , say from 0 to 5, may result in a lot more cold contexts in the profile. This is because we currently treat cold functions as having the same hotness, and only use the cold threshold for them instead of computing a linear threshold.

Maybe you didn't understand what I said. What I was suggesting is basically this + SampleColdCallSiteThreshold + PreInlineHotCallsiteThresholdConstant -> + SampleColdCallSiteThreshold. Then we tune SampleColdCallSiteThreshold alone, which is equivalent. I didn't find other use of SampleColdCallSiteThreshold in llvm-profgen.

hoy added inline comments.May 5 2022, 11:31 PM
llvm/tools/llvm-profgen/CSPreInliner.cpp
163

I see. Yeah, choosing a value smaller than getMaxCount may need caping to avoid threshold going ridiculous. But capping means values above the cap won't be treated linearly.

Also positioning based on HotCountThresold makes it service-dependent again. The value at a given cutoff could be very different.

166–169

SampleColdCallSiteThreshold is also used around line 170:

if (Candidate.CallsiteCount <= ColdCountThreshold)
   SampleThreshold = SampleColdCallSiteThreshold;

I didn't choose to go linearly for every sample. Cold samples share the same threshold for now. I did this because cold samples are many and size-sensitive, changing SampleColdCallSiteThreshold from 0 to 1 can cause profile grow a lot. Linear threshold for cold samples probably doesn't make a lot sense.

wenlei added inline comments.May 5 2022, 11:58 PM
llvm/tools/llvm-profgen/CSPreInliner.cpp
163

But capping means values above the cap won't be treated linearly.

That's why we can use a factor N, or simply use a different percentile, say 20%, which should be very large, and still much more stable than max.

Also positioning based on HotCountThresold makes it service-dependent again. The value at a given cutoff could be very different.

How is getMaxCount not service dependent then?

166–169

Ah, I see. I missed line 170. But the way you have it setup now creates a cliff for the heuristic:

  • for Candidate.CallsiteCount == ColdCountThreshold, threshold is SampleColdCallSiteThreshold
  • for Candidate.CallsiteCount == ColdCountThreshold + 1, threshold is SampleColdCallSiteThreshold + PreInlineHotCallsiteThresholdConstant (can be large number from flag)

We don't need to have linear threshold for cold samples, but it makes more sense for the heuristic to be consecutive without cliffs.

If all we need here is to avoid changing SampleColdCallSiteThreshold to1, perhaps making PreInlineHotCallsiteThresholdConstant a constant 1 is good enough -- it doesn't need to be tunable. Using 1 also makes the heuristic consecutive.

I was hoping to avoid explosion of tuning knobs and only expose those that are really necessary.

hoy added inline comments.May 6 2022, 10:13 AM
llvm/tools/llvm-profgen/CSPreInliner.cpp
163

I see, making it very large makes sense. 20% or 10% or should be close to getMaxCount while be more stable.

As long as the value is big enough to cover most samples, it's not service-dependent. I was concerning about the samples that are not covered or not getting a linear threshold.

166–169

Making PreInlineHotCallsiteThresholdConstant sounds good. Actually I never tried other values during my tuning.

hoy updated this revision to Diff 427673.May 6 2022, 10:28 AM
hoy edited the summary of this revision. (Show Details)

Addressing comments.

wenlei added inline comments.May 6 2022, 10:43 AM
llvm/tools/llvm-profgen/CSPreInliner.cpp
164

Use the count at 10% cutff to cap the threshold.

nit: people should be able to figure out what this is all about, but this comment can be confusing. maybe let's just make it more clear that "we normalize hotness to be [0,1], then linearly adjust threshold based on normalized hotness. "

That comment couples with more explicit code for readability.

NormalizationUpperBound = ProfileSummaryBuilder::getEntryForPercentile(...)
NormalizationLowerBound = ColdCountThreshold (or ProfileSummaryBuilder::getEntryForPercentile(...) )

NormalizedHotness = ...
SampleThreshold = ...
172

Perhaps worth a comment for this +1 (so later on others don't think it's non-critical and attempt to clean it up)

hoy added inline comments.May 6 2022, 10:58 AM
llvm/tools/llvm-profgen/CSPreInliner.cpp
164

Sounds good. Also added comment for the 10% cutoff.

172

Done.

hoy updated this revision to Diff 427680.May 6 2022, 10:59 AM

Updating D125023: [CSSPGO][Preinliner] Use linear threshold to drive inline decision.

wenlei accepted this revision.May 6 2022, 11:01 AM

thanks, lgtm assuming performance is still good with the final version.

This revision is now accepted and ready to land.May 6 2022, 11:01 AM
hoy added a comment.May 8 2022, 10:07 PM

thanks, lgtm assuming performance is still good with the final version.

Perf results came back good. Landing.

This revision was landed with ongoing or failed builds.May 8 2022, 10:23 PM
This revision was automatically updated to reflect the committed changes.