This is an archive of the discontinued LLVM Phabricator instance.

[SampleProfile] Check entry count instead of total count to decide if inlined callsite is hot.
Needs ReviewPublic

Authored by twoh on Apr 1 2019, 11:55 AM.

Details

Reviewers
danielcdh
wmi
Summary

As we're interested in the "callsite" hotness, not the hotness of the inlined function itself, it makes more sense to check entry sample count rather than the total sample count. Also, considering that PSI histogram is based on the execution cound of each individual line, I think PSI->isHotCount should be invoked with a single instruction/line profile count, not with an aggregated count for a function.

Event Timeline

twoh created this revision.Apr 1 2019, 11:55 AM
wmi added a comment.Apr 1 2019, 4:29 PM

Theorectically, as you said checking entry sample count rather than the total sample count makes more sense. However, the entry sample count of callsite is not as precise as function sample entry count, which is got from lbr directly. For callsite, it can get very wrong entry sample count because of missing debug information after optimization. That is why we chose total sample count instead of entry sample count when evaluating the hotness of callsite.

twoh added a comment.Apr 1 2019, 5:05 PM

@wmi Thanks for the reply! I can totally understand that entry count is not as precise as total count, but still don't think current implementation is the right way to address the issue. As I mentioned in the summary it compares two different things (instruction level counter vs function level counter), opens up a possibility for optimizing against wrong function (e.g. long and cold function), and makes it hard to find the root cause of the performance issue.

If we can't have a precise entry count, the right way to address the issue would be not using PSI based heuristic but using a heuristic that actually considers a total count of the function.

wmi added a comment.Apr 2 2019, 10:19 AM

@wmi Thanks for the reply! I can totally understand that entry count is not as precise as total count, but still don't think current implementation is the right way to address the issue. As I mentioned in the summary it compares two different things (instruction level counter vs function level counter), opens up a possibility for optimizing against wrong function (e.g. long and cold function), and makes it hard to find the root cause of the performance issue.

If we can't have a precise entry count, the right way to address the issue would be not using PSI based heuristic but using a heuristic that actually considers a total count of the function.

Thanks Taewook. I understand your concern and I want to add some extra explanation to my last reply. In addition to the impreciseness of entry count of inline instance, another reason we choose total sample count instead of entry count is it is more conservative way to prevent missing inlining in sample profile load phase. Total sample count is always equal to or larger than the entry count, which means it is less likely we miss a function which better be inlined but not. Want to emphasize that the main purpose of inlining inside of AutoFDO is not to reduce call overhead or expose other optimization opportunities, but to enable context sensitive profile annotation. Suppose if there is an inline instance with cold entry but a hot spot inside of it, although we will not reduce call overhead by inlining it, inlining it is still important so the hot spot can get context sensitive profile during profile annotation.

As you said, one case we want to prevent is the inline instance is long and doesn't have any hot spot inside of it. The only reason the total sample count looks hot is because the body is large enough. For such case, we may be able to prevent it by adding some simple heuristic in callsiteIsHot. Like in addition to the total sample count being larger than hot threshold, the max sample count inside of the inline instance must also be larger than a certain threshold, i.e., there is at least one hot spot inside for the inline instance to be hot.

What do you think?

twoh added a comment.Apr 2 2019, 12:02 PM

@wmi Thank you for the detailed explanation!

Want to emphasize that the main purpose of inlining inside of AutoFDO is not to reduce call overhead or expose other optimization opportunities, but to enable context sensitive profile annotation. Suppose if there is an inline instance with cold entry but a hot spot inside of it, although we will not reduce call overhead by inlining it, inlining it is still important so the hot spot can get context sensitive profile during profile annotation.

I can understand the value of context-sensitive profile information, but wouldn't it be only valuable if the callsite is actually worth inlining? IIUC, LLVM doesn't propagate the context-sensitive profile information collected from the inlined callsite back to the original standalone function. For me it seems that the current implementation drives the wrong inlining decision to collect the information useful only under the wrong decision.

As you said, one case we want to prevent is the inline instance is long and doesn't have any hot spot inside of it. The only reason the total sample count looks hot is because the body is large enough. For such case, we may be able to prevent it by adding some simple heuristic in callsiteIsHot. Like in addition to the total sample count being larger than hot threshold, the max sample count inside of the inline instance must also be larger than a certain threshold, i.e., there is at least one hot spot inside for the inline instance to be hot.

I think this will mitigate the problem, but still don't think it is the right way to solve the problem. My biggest concern with the current implementation is that we're comparing function total count against PSI threshold values. If we really cannot trust function entry count for sample profile data we may try function total count based heuristic, but still comparing total count against PSI threshold, which is computed from the instruction count histogram, seems wrong to me.

wmi added a comment.Apr 2 2019, 3:08 PM

I can understand the value of context-sensitive profile information, but wouldn't it be only valuable if the callsite is actually worth inlining?

I think an example may be helpful to illustrate why inlining for better profile annotation is important even if the inlining doesn't remove call overhead because of entry being cold.

foo() {
  for (i = 0; i < N; i++) {
    if (condition = getcond()) {

    } else {

    }
  }
}

goo {
  ...
  call foo  // condition is true in 99% iterations of loop in foo.
  ...
}

hoo {
  ...
  call foo  // condition is false in 99% iterations of loop in foo.
  ...
}

Suppose foo is only called twice, once in goo and once in hoo, so is very cold at the function entry. But the loop inside of foo is very hot (N is very large). Suppose in non-FDO build, both callsites of foo are inlined although they are cold because size cost is lower than threshold (we can reduce size by inlining foo). So we will have context sensitive profiles for foo in the profile generated from non-FDO binary.

If we inline the two callsites in sample profile loader, after profile annotation, we know the branch probablity of the if statement inlined into goo will be 99% vs 1% while branch probablity in hoo will be 1% vs 99%.

If we don't inline the above two callsites in sample profile loader, it could be the case either we don't get any profile for foo because foo are all inlined in non-FDO binary, or we get the profile but the profile is not context sensitive so the branch probability of if statement in foo will be regarded as 50% vs 50%.

twoh added a comment.Apr 2 2019, 3:59 PM

@wmi Thank you for the concrete example! I think what we need for your example is context-sensitive profiling and function specialization, not inlining. Admittedly we don't have an infrastructure in LLVM to support context-sensitive profiling for non-inlined case and we don't perform context sensitive function specialization...

Again, my bigger concern is with using PSI->isHotCount to check the function hotness. If we want to stay with the function total count based heuristic, wouldn't it make more sense if we have something like ProfileSummaryInfo::isFunctionHotInCallGraph, which actually checks the hotness of the function itself, and use it?

wmi added a comment.Apr 2 2019, 4:18 PM

@wmi Thank you for the concrete example! I think what we need for your example is context-sensitive profiling and function specialization, not inlining. Admittedly we don't have an infrastructure in LLVM to support context-sensitive profiling for non-inlined case and we don't perform context sensitive function specialization...

Again, my bigger concern is with using PSI->isHotCount to check the function hotness. If we want to stay with the function total count based heuristic, wouldn't it make more sense if we have something like ProfileSummaryInfo::isFunctionHotInCallGraph, which actually checks the hotness of the function itself, and use it?

I agree with that. We can create ProfileSummaryInfo::isFunctionHotInCallGraph instead of simply using PSI->isHotCount. Comparing against PSI->isHotCount is actually the simplest implementation of ProfileSummaryInfo::isFunctionHotInCallGraph. The mitigating strategy mentioned above could also be added here. You can definitely come up with better way to model the global hotness of inline instance. Another goodness of splitting up the logics to comparing global hotness in different scenarios is it may be easier for performance tuning.

wenlei added a subscriber: wenlei.Apr 3 2019, 5:59 PM