The priority-based inliner currenlty uses block count combined with callee entry count to drive callsite inlining. This doesn't work well with LTO where postlink inlining is driven by prelink-annotated block count which could be based on the merge of all context profiles. I'm fixing it by using callee profile entry count only which should be context-sensitive.
I'm seeing 0.2% perf improvment for one of our internal large benchmarks with probe-based non-CS profile.
When we don't have callee samples, should we fall back to call site block counts?
In reality we would also need to tolerate some source change, i.e. the call site didn't exist in pass1 build.