This is an archive of the discontinued LLVM Phabricator instance.

[SampleFDO] Don't let inliner treat warm callsite with inline instance in the profile as cold
ClosedPublic

Authored by wmi on Apr 6 2018, 9:16 AM.

Details

Summary

We found current sampleFDO had a performance issue when triaging a regression. For a callsite with inline instance in the profile, even if hot callsite inliner cannot inline it, it may still execute enough times and should not be treated as cold in regular inliner later. However, currently if such callsite is not inlined by hot callsite inliner, and the BB where the callsite locates doesn't get samples from other instructions inside of it, the callsite will have no profile metadata annotated. In regular inliner cost analysis, if the callsite has no profile annotated and its caller has profile information, it will be treated as cold.

The fix is for such warm callsites without profile because they are inlined in the profile, still keep them without profile metadata annotated. For other callsites whose parent BBs don't get any sample, explicitly annotate them with 0 profile count (Don't omit profile metadata). In regular inliner cost analysis, if a callsite has no profile annotated, we won't treat it as cold anymore -- we treat callsites as cold only when they profile count exists and is less than cold cutoff value.

It fixes a 5% regression in the target application. I also evaluate it on two server benchmarks and find no performance difference there, but one server benchmark gets 2% reduction in size.

I also evaluate other alternative to fix the issue, like relax the criterial of hotness checking in hot callsite inliner, but the result is not as good as this strategy probably because regular inliner has more information about whether we should inline a warm callsite with medium/small size callee.

Diff Detail

Repository
rL LLVM

Event Timeline

wmi created this revision.Apr 6 2018, 9:16 AM
wmi updated this revision to Diff 141366.Apr 6 2018, 9:22 AM

Update a comment in the code.

davidxl added inline comments.Apr 11 2018, 9:32 AM
lib/Analysis/ProfileSummaryInfo.cpp
252 ↗(On Diff #141366)

This can cause problem if the caller function is newly added and there is no profile associated with it -- all callsites there will be marked as cold.

lib/Transforms/IPO/SampleProfile.cpp
1300 ↗(On Diff #141366)

Instead of skipping it, is it better to annotate it with a 'warm' profile count?

wmi added inline comments.Apr 11 2018, 10:00 AM
lib/Analysis/ProfileSummaryInfo.cpp
252 ↗(On Diff #141366)

You are right. That is a problem. Need to figure out how to avoid it.

lib/Transforms/IPO/SampleProfile.cpp
1300 ↗(On Diff #141366)

I considered that solution but I was worried that by annotating the callsite with warm profile count, it will be treated as warm but the callsites inside of the callee will still be treated as cold after the current callsite is inlined.

Definitely the issue here I am worried about is minor than the missing profile issue of new functions introduced by source change. I think this is still a good solution if only the testing is fine, or if we can come up with better solution.

Thanks for the fix!

I think maybe a preferred way to fix this is to change SampleProfileLoader::inlineHotFunctions to inline these "warm" inlined callsites early. The current algorithm uses callsiteIsHot, which compares inline instance's total count to the caller's total count, which could be misleading if the caller is super large/hot. A better algorithm should compare inline instance's total count to PSI to get a global hotness. In this way, if the profile annotator thinks a callsite is not hot, the later inliner should *not* even try to inline it. This makes the design cleaner and more stable. WDYT?

wmi added a comment.Apr 16 2018, 9:10 AM

Thanks for the fix!

I think maybe a preferred way to fix this is to change SampleProfileLoader::inlineHotFunctions to inline these "warm" inlined callsites early. The current algorithm uses callsiteIsHot, which compares inline instance's total count to the caller's total count, which could be misleading if the caller is super large/hot. A better algorithm should compare inline instance's total count to PSI to get a global hotness. In this way, if the profile annotator thinks a callsite is not hot, the later inliner should *not* even try to inline it. This makes the design cleaner and more stable. WDYT?

I tried the idea to compute the inline instance's total count divided by its bb count, and compare the division result to PSI hot threshold. That improved the regression benchmark but did not recover the whole regression. That is why I choosed to keep the current callsiteIsHot check in early inliner unchanged because I guessed regular inliner may have a better position to decide whether to inline such warm/medium size callsite.

wmi updated this revision to Diff 142643.Apr 16 2018, 9:23 AM

Tried David's suggestion and found the tests were good. The original regression for the target benchmark was recovered and we even got a little improvement. Another two server benchmarks had no performance change.

Patch was updated accordingly.

In D45377#1068853, @wmi wrote:

Thanks for the fix!

I think maybe a preferred way to fix this is to change SampleProfileLoader::inlineHotFunctions to inline these "warm" inlined callsites early. The current algorithm uses callsiteIsHot, which compares inline instance's total count to the caller's total count, which could be misleading if the caller is super large/hot. A better algorithm should compare inline instance's total count to PSI to get a global hotness. In this way, if the profile annotator thinks a callsite is not hot, the later inliner should *not* even try to inline it. This makes the design cleaner and more stable. WDYT?

I tried the idea to compute the inline instance's total count divided by its bb count, and compare the division result to PSI hot threshold. That improved the regression benchmark but did not recover the whole regression. That is why I choosed to keep the current callsiteIsHot check in early inliner unchanged because I guessed regular inliner may have a better position to decide whether to inline such warm/medium size callsite.

I suppose the regression comes from iterative-AutoFDO?

The problem of letting regular inliner to handle warm callsites is that the callee may have profile missing if it is fully inlined. Maybe instead of comparing total_count/num_calle_bb to hot threshold, just compare total_count to hot threshold? I agree this may increase code size a little, but it should not be worst than the previous afdo binary?

wmi added a comment.Apr 16 2018, 10:50 AM
In D45377#1068853, @wmi wrote:

Thanks for the fix!

I think maybe a preferred way to fix this is to change SampleProfileLoader::inlineHotFunctions to inline these "warm" inlined callsites early. The current algorithm uses callsiteIsHot, which compares inline instance's total count to the caller's total count, which could be misleading if the caller is super large/hot. A better algorithm should compare inline instance's total count to PSI to get a global hotness. In this way, if the profile annotator thinks a callsite is not hot, the later inliner should *not* even try to inline it. This makes the design cleaner and more stable. WDYT?

I tried the idea to compute the inline instance's total count divided by its bb count, and compare the division result to PSI hot threshold. That improved the regression benchmark but did not recover the whole regression. That is why I choosed to keep the current callsiteIsHot check in early inliner unchanged because I guessed regular inliner may have a better position to decide whether to inline such warm/medium size callsite.

I suppose the regression comes from iterative-AutoFDO?

It is possible. Because it is only about 1% regression uncovered by the change, I don't have a good way to measure exactly where it comes from. And 1% is within the fluctuation range the target benchmarks allows.

The problem of letting regular inliner to handle warm callsites is that the callee may have profile missing if it is fully inlined. Maybe instead of comparing total_count/num_calle_bb to hot threshold, just compare total_count to hot threshold? I agree this may increase code size a little, but it should not be worst than the previous afdo binary?

Yes, that is the same concern I have in my reply to David's suggestion, but the result seems fine. I can measure your suggested way and see how it looks like.

wmi added a comment.Apr 18 2018, 8:42 AM

The problem of letting regular inliner to handle warm callsites is that the callee may have profile missing if it is fully inlined. Maybe instead of comparing total_count/num_calle_bb to hot
threshold, just compare total_count to hot threshold? I agree this may increase code size a little, but it should not be worst than the previous afdo binary?

Yes, that is the same concern I have in my reply to David's suggestion, but the result seems fine. I can measure your suggested way and see how it looks like.

I tested the solution of comparing total_count to hot threshold, for the two server benchmarks the performance had no change. But for the regressed benchmark, it is a little worse than the solution of comparing total_count/num_callee_bb to hot threshold -- in my three runs there were two runs for which the regression was larger than the fluctuation range the target benchmarks allows. I know it is possible there is other side-effect taking place here, but for now I don't have detail perf profile for me to find out.

wmi added a comment.Apr 30 2018, 10:51 AM
In D45377#1071031, @wmi wrote:

The problem of letting regular inliner to handle warm callsites is that the callee may have profile missing if it is fully inlined. Maybe instead of comparing total_count/num_calle_bb to hot
threshold, just compare total_count to hot threshold? I agree this may increase code size a little, but it should not be worst than the previous afdo binary?

Yes, that is the same concern I have in my reply to David's suggestion, but the result seems fine. I can measure your suggested way and see how it looks like.

I tested the solution of comparing total_count to hot threshold, for the two server benchmarks the performance had no change. But for the regressed benchmark, it is a little worse than the solution of comparing total_count/num_callee_bb to hot threshold -- in my three runs there were two runs for which the regression was larger than the fluctuation range the target benchmarks allows. I know it is possible there is other side-effect taking place here, but for now I don't have detail perf profile for me to find out.

Ok, I find out the reason why comparing total_count to hot threshold didn't recover the regression. It is indeed caused by side-effect. The different inline disabled a jumpthreading and in turn disabled a block of code from being sunk into cold region in machine sinking. This lead to the regression. The patch in https://reviews.llvm.org/D46275 can fix the issue in jumpthreading. With D46275 installed, the solution of comparing total_count to hot threshold recover all the regression and even bring small improvement for the benchmark.

I will update the patch using the solution of comparing total_count to hot threshold.

wmi updated this revision to Diff 144647.Apr 30 2018, 4:18 PM

Update the patch to use the solution of comparing total count to hot cutoff threshold.

danielcdh added inline comments.Apr 30 2018, 6:41 PM
lib/Transforms/IPO/SampleProfile.cpp
368 ↗(On Diff #144647)

In what situations will PSI be nullptr? If not, then please assert it instead.

Also, I think this will overwrite the later (PercentSamples >= SampleProfileHotThreshold) heuristic, and we should remove that flag.

wmi added inline comments.Apr 30 2018, 8:48 PM
lib/Transforms/IPO/SampleProfile.cpp
368 ↗(On Diff #144647)

PSI will not be nullptr. Will add an assertion.

It is possible that for a callsite its CallsiteTotalSamples is less than hot cutoff threshold but still have a PententSamples larger than SampleProfileHotThreshold. My original plan is if a callsite is inlined currently, the new heuristic will still keep it.

But I check where SampleProfileHotThreshold is used and find it is also used to populate the InlinedGUIDs set. To make that simple and consistent, like you suggest, I may remove SampleProfileHotThreshold and related heuristic.

wmi updated this revision to Diff 144899.May 2 2018, 10:16 AM

remove SampleProfileHotThreshold. The benchmarks showed no regressions. I am now testing the iterative AFDO result.

wmi added a comment.May 3 2018, 9:41 AM

Iterative AFDO result is comparable with AFDO result.

danielcdh accepted this revision.May 10 2018, 1:09 PM
This revision is now accepted and ready to land.May 10 2018, 1:09 PM
This revision was automatically updated to reflect the committed changes.