This is an archive of the discontinued LLVM Phabricator instance.

[Inliner] Tweak the condition for determining cold callsites.
Needs ReviewPublic

Authored by eraman on Aug 21 2017, 12:19 PM.

Details

Reviewers
chandlerc
Summary

This addresses the following issue. Consider a call chain A->B->C, where
the B->C callsite is guarded by an if with builtin_expect (cond, 0).
When BFI is computed for B, it essentially gives a non-zero value to the
coldest reachable block and scales all blocks including the entry based
on that. The inliner then skips B-> callsite since it is cold and then,
let us assume, inlines B into A. Now, we incrementally updates A's BFI
by including the newly cloned blocks. This incremental update doesn't
update the entry BFI of A (since we only want to update the frequencies
of newly cloned blocks). As a result, we have A's entry having some low
block frequency and B->C clone has a frequency of 0. This results in
CallSiteFreq < CallerEntryFreq * ColdProb to evaluate to false since
both LHS and RHS are 0. This tweak ensures that the cloned callsite is
still 0. An alterantive is to explicitly check if CallSiteFreq is 0.

Event Timeline

eraman created this revision.Aug 21 2017, 12:19 PM

Thinking about this a little more, this will end up treating non-cold callsites as cold. As an exmple, consider bar gets inlined to foo. Let entry_freq(foo) = 8, entry_freq(bar) = 8, callsite_freq(foo->bar) = 7. Now, any block in bar with freq == 1 will end up with a frequency of 0 after cloning and will be treated as cold even it's frequency relative to foo's entry is 1/7 which is not cold for the default value of cold-callsite-rel-freq (1/50). Rouding the freuenies while scaling will mitigate this a little bit, but still not sufficient. Need to think more.

chandlerc edited edge metadata.Aug 23 2017, 12:01 PM

Thinking about this a little more, this will end up treating non-cold callsites as cold. As an exmple, consider bar gets inlined to foo. Let entry_freq(foo) = 8, entry_freq(bar) = 8, callsite_freq(foo->bar) = 7. Now, any block in bar with freq == 1 will end up with a frequency of 0 after cloning and will be treated as cold even it's frequency relative to foo's entry is 1/7 which is not cold for the default value of cold-callsite-rel-freq (1/50). Rouding the freuenies while scaling will mitigate this a little bit, but still not sufficient. Need to think more.

I think you'll need to saturate at 1, and if necessary to preserve precision scale the caller up.

Thinking about this a little more, this will end up treating non-cold callsites as cold. As an exmple, consider bar gets inlined to foo. Let entry_freq(foo) = 8, entry_freq(bar) = 8, callsite_freq(foo->bar) = 7. Now, any block in bar with freq == 1 will end up with a frequency of 0 after cloning and will be treated as cold even it's frequency relative to foo's entry is 1/7 which is not cold for the default value of cold-callsite-rel-freq (1/50). Rouding the freuenies while scaling will mitigate this a little bit, but still not sufficient. Need to think more.

I think you'll need to saturate at 1, and if necessary to preserve precision scale the caller up.

Scaling the caller up pretty much defeats the purpose of incremental BFI updates and we no longer have the property that cost of a single inlining is constant.

Thinking about this a little more, this will end up treating non-cold callsites as cold. As an exmple, consider bar gets inlined to foo. Let entry_freq(foo) = 8, entry_freq(bar) = 8, callsite_freq(foo->bar) = 7. Now, any block in bar with freq == 1 will end up with a frequency of 0 after cloning and will be treated as cold even it's frequency relative to foo's entry is 1/7 which is not cold for the default value of cold-callsite-rel-freq (1/50). Rouding the freuenies while scaling will mitigate this a little bit, but still not sufficient. Need to think more.

I think you'll need to saturate at 1, and if necessary to preserve precision scale the caller up.

Scaling the caller up pretty much defeats the purpose of incremental BFI updates and we no longer have the property that cost of a single inlining is constant.

We only need the block frequencies of blocks containing callsites. We already maintain the list of callsites including those obtained by inlining. We could keep track of the relative (to the entry of the caller) block frequencies as a ScaledNumber and use this in InlineCost and in InlineFunction.cpp (to update the profile counts). Thoughts?