We found current sampleFDO had a performance issue when triaging a regression. For a callsite with inline instance in the profile, even if hot callsite inliner cannot inline it, it may still execute enough times and should not be treated as cold in regular inliner later. However, currently if such callsite is not inlined by hot callsite inliner, and the BB where the callsite locates doesn't get samples from other instructions inside of it, the callsite will have no profile metadata annotated. In regular inliner cost analysis, if the callsite has no profile annotated and its caller has profile information, it will be treated as cold.
The fix is for such warm callsites without profile because they are inlined in the profile, still keep them without profile metadata annotated. For other callsites whose parent BBs don't get any sample, explicitly annotate them with 0 profile count (Don't omit profile metadata). In regular inliner cost analysis, if a callsite has no profile annotated, we won't treat it as cold anymore -- we treat callsites as cold only when they profile count exists and is less than cold cutoff value.
It fixes a 5% regression in the target application. I also evaluate it on two server benchmarks and find no performance difference there, but one server benchmark gets 2% reduction in size.
I also evaluate other alternative to fix the issue, like relax the criterial of hotness checking in hot callsite inliner, but the result is not as good as this strategy probably because regular inliner has more information about whether we should inline a warm callsite with medium/small size callee.
This can cause problem if the caller function is newly added and there is no profile associated with it -- all callsites there will be marked as cold.