This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO][llvm-profgen] Context-sensitive global pre-inliner
ClosedPublic

Authored by wenlei on Mar 22 2021, 11:06 PM.

Details

Summary

This change sets up a framework in llvm-profgen to estimate inline decision and adjust context-sensitive profile based on that. We call it a global pre-inliner in llvm-profgen.

It will serve two purposes:

  1. Since context profile for not inlined context will be merged into base profile, if we estimate a context will not be inlined, we can merge the context profile in the output to save profile size.
  2. For thinLTO, when a context involving functions from different modules is not inined, we can't merge functions profiles across modules, leading to suboptimal post-inline count quality. By estimating some inline decisions, we would be able to adjust/merge context profiles beforehand as a mitigation.

Compiler inline heuristic uses inline cost which is not available in llvm-profgen. But since inline cost is closely related to size, we could get an estimate through function size from debug info. Because the size we have in llvm-profgen is the final size, it could also be more accurate than the inline cost estimation in the compiler.

This change only has the framework, with a few TODOs left for follow up patches for a complete implementation:

  1. We need to retrieve size for funciton//inlinee from debug info for inlining estimation. Currently we use number of samples in a profile as place holder for size estimation.
  2. Currently the thresholds are using the values used by sample loader inliner. But they need to be tuned since the size here is fully optimized machine code size, instead of inline cost based on not yet fully optimized IR.

Diff Detail

Event Timeline

wenlei created this revision.Mar 22 2021, 11:06 PM
wenlei requested review of this revision.Mar 22 2021, 11:06 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 22 2021, 11:06 PM
hoy added inline comments.Mar 24 2021, 11:08 AM
llvm/tools/llvm-profgen/CSPreInliner.cpp
106

nit: a function call

151

Should this be >=?

169

This currently only reflects the number of live/hot lines. Might be extended to using static size from dwarf/probe decoding or disassembling. Can you leave a TODO for this?

llvm/tools/llvm-profgen/ProfileGenerator.cpp
470

Nit: context

llvm/tools/llvm-profgen/ProfiledCallGraph.h
29 ↗(On Diff #332537)

Nit: ProfiledCallGraphNodeComparer

wenlei added inline comments.Mar 24 2021, 7:13 PM
llvm/tools/llvm-profgen/CSPreInliner.cpp
106

updated.

151

> is consistent with SampleProfileLoader::shouldInlineCandidate, though practically I don't think it matters.

169

Todo added.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
470

updated.

llvm/tools/llvm-profgen/ProfiledCallGraph.h
29 ↗(On Diff #332537)

Good catch, changed.

wenlei updated this revision to Diff 333190.Mar 24 2021, 7:15 PM

Address Hongtao's comments. Also moving ProfiledCallGraph.h into IPO so we can use it in compiler too (there's a need to make it available in compiler and we will follow up with separate patch).

ThinLTO is known to have issues related to profile update (cross module), so we were thinking something similar in ThinLink phase. One of the issues is that the pre-inlining needs to make similar decisions as the compiler. How well is the preinliner doing in this regard?

ThinLTO is known to have issues related to profile update (cross module), so we were thinking something similar in ThinLink phase.

This is the exact problem we are trying to mitigate. We also considered doing this in ThinLink but adjusting profiles for thin-backends and communicating inline decisions to thin-backends would add quite a bit of complexity, which could also slow down ThinLink. With CSSPGO, doing it in profile generation and use adjusted profile to convey inline estimation/suggestion is much simpler and cheaper.

One of the issues is that the pre-inlining needs to make similar decisions as the compiler. How well is the preinliner doing in this regard?

Yes, this is a challenge. We don't have data yet, but I hope with some tuning we can get them to be close. One problem with doing pre-inlining is we don't have a lot of information that compiler can see from IR, though if needed some of that can be embedded into binary (some metadata in probe descriptor, etc.) for preinliner. I hope a more accurate view on machine code byte size for inline cost can offset some of the disadvantages due to lack of IR.

We'll be working on tuning the preinliner to get it to be close to compiler inliner. This is similar to the effort of transferring more inlining from cgscc inliner to sample loader inliner in that we may not see immediate results, but over time, as the new component matures, we hope to reap benefits later.

wmi added a comment.Mar 24 2021, 11:00 PM

ThinLTO is known to have issues related to profile update (cross module), so we were thinking something similar in ThinLink phase.

This is the exact problem we are trying to mitigate. We also considered doing this in ThinLink but adjusting profiles for thin-backends and communicating inline decisions to thin-backends would add quite a bit of complexity, which could also slow down ThinLink. With CSSPGO, doing it in profile generation and use adjusted profile to convey inline estimation/suggestion is much simpler and cheaper.

One of the issues is that the pre-inlining needs to make similar decisions as the compiler. How well is the preinliner doing in this regard?

Yes, this is a challenge. We don't have data yet, but I hope with some tuning we can get them to be close. One problem with doing pre-inlining is we don't have a lot of information that compiler can see from IR, though if needed some of that can be embedded into binary (some metadata in probe descriptor, etc.) for preinliner. I hope a more accurate view on machine code byte size for inline cost can offset some of the disadvantages due to lack of IR.

It is a good idea to have an non-intrusive way to predict cross-module inlining decision and update the profile beforehand.

To mitigate ThinLTO profile update issue, either apparent inline or no-inline decisions can be made. From the patch description, seems currently it only considers the case that no-inline decision is made and profile can be merged back. Have you considered the case that inline is apparently beneficial and profile without context can be split?

We'll be working on tuning the preinliner to get it to be close to compiler inliner. This is similar to the effort of transferring more inlining from cgscc inliner to sample loader inliner in that we may not see immediate results, but over time, as the new component matures, we hope to reap benefits later.

In D99146#2649707, @wmi wrote:

ThinLTO is known to have issues related to profile update (cross module), so we were thinking something similar in ThinLink phase.

This is the exact problem we are trying to mitigate. We also considered doing this in ThinLink but adjusting profiles for thin-backends and communicating inline decisions to thin-backends would add quite a bit of complexity, which could also slow down ThinLink. With CSSPGO, doing it in profile generation and use adjusted profile to convey inline estimation/suggestion is much simpler and cheaper.

One of the issues is that the pre-inlining needs to make similar decisions as the compiler. How well is the preinliner doing in this regard?

Yes, this is a challenge. We don't have data yet, but I hope with some tuning we can get them to be close. One problem with doing pre-inlining is we don't have a lot of information that compiler can see from IR, though if needed some of that can be embedded into binary (some metadata in probe descriptor, etc.) for preinliner. I hope a more accurate view on machine code byte size for inline cost can offset some of the disadvantages due to lack of IR.

It is a good idea to have an non-intrusive way to predict cross-module inlining decision and update the profile beforehand.

To mitigate ThinLTO profile update issue, either apparent inline or no-inline decisions can be made. From the patch description, seems currently it only considers the case that no-inline decision is made and profile can be merged back. Have you considered the case that inline is apparently beneficial and profile without context can be split?

If we don't have context profile from raw input profile, the split is going to be a simple scaling based on call site counts, right? In that case, doing it in profile generation won't improve profile quality because the scaling won't be very different from the scaling done by cgscc inliner. Though if we split the profiles to synthesize context profile, sample loader would be able to inline more, but if we want we could allow sample loader inlining to do scaling.

We'll be working on tuning the preinliner to get it to be close to compiler inliner. This is similar to the effort of transferring more inlining from cgscc inliner to sample loader inliner in that we may not see immediate results, but over time, as the new component matures, we hope to reap benefits later.

wenlei updated this revision to Diff 333362.Mar 25 2021, 11:10 AM

Update header comments and macro for ProfiledCallGraph.

hoy added inline comments.Mar 25 2021, 12:31 PM
llvm/tools/llvm-profgen/CSPreInliner.cpp
64

I'm wondering in the future if functions without profile should be considered so that a broader inline decisions can be made regardless of callsite hotness.

llvm/tools/llvm-profgen/ProfileGenerator.cpp
470

Trying to understand how keys can change. Do we only remove the profiles from ProfileMap once they are merged into base profiles? Are base profiles from the reuse of first non-inlined profiles?

wenlei added inline comments.Mar 25 2021, 4:33 PM
llvm/tools/llvm-profgen/CSPreInliner.cpp
64

For pre-inlining, we need to have callee profile otherwise there's no profile to be adjusted regardless of whether we inline or not.

For caller profile, we currently requires it to trigger pre-inline, but we could do pre-inline without caller profile. Is that what you meant? Currently compiler's inlining also requires caller profile, so what we have here aligns with compiler.

(Not sure if I get the question, it's building top-down order here, not actual pre-inlining.. )

llvm/tools/llvm-profgen/ProfileGenerator.cpp
470

Right, base profile can be from the first promoted context profile. We don't remove profiles from ProfileMap during context promotion and merging. ProfileMap owns the function profiles, and context promotion is done on context trie only, which also updates context for the profiles. The key of that map does not change, but the context for profile in the map changes, so key is no longer accurate. Fortunately for profile writing, we don't look at the keys, so as long we fix this spot to avoid using the key, we are good. We could also update the map to keep keys accurate, but it involves moving profiles around which has some cost.

hoy added inline comments.Mar 25 2021, 5:51 PM
llvm/tools/llvm-profgen/CSPreInliner.cpp
64

Sorry for the confusion. I actually meant to explore non-profiled callees (such as getters/setters) here and in getInlineCandidates to mimid the situation that both hot callees of those getters/setters and themselves are inlined into the current caller. The getter/setter inlining, though their callsite are not hot, are likely done in prelink CGSCC inlining. I was thinking about simulating the prelink inlining if possible. There's no such need if prelink inlining is disabled.

wenlei added inline comments.Mar 25 2021, 9:28 PM
llvm/tools/llvm-profgen/CSPreInliner.cpp
64

What do you mean by non-profiled callees? If function does not have profile, there's no profile to be adjusted, then doing pre-inline or not doesn't matter. CGSCC inline in prelink can happen but since there's no profile for the callee, no profile need to be adjusted and there's no count quality issue.

Or are you suggesting considering looking at multiple levels of callees when evaluating a call site? That is orthogonal to whether a function has profile.

hoy added inline comments.Mar 25 2021, 10:02 PM
llvm/tools/llvm-profgen/CSPreInliner.cpp
64

Yeah, even if the first level callee doesn't have profile, the callee's callee could have a hot profile. Skipping such first-level callees also skips next-level callees. It's fine if first-level callees are not inlined in prelink, which will aslo not be inlined in postlink sample loader. If such callees are inlined by prelink cgscc, then next level callees will likely be inlined by postlink FDO, which is a discrepancy from llvm-profgen preliner.

Currently if all functions have profiles, multi-level is naturally supported with the priority-based BFS processing by tweaking calliste costs. If any function in one call chain doesn't have a profile, BFS will stop at that level.

wenlei added inline comments.Mar 25 2021, 10:09 PM
llvm/tools/llvm-profgen/CSPreInliner.cpp
64

We don't support multiple levels even with priority-based BFS inliner in the sense that we never look ahead to see if there's anything hot underneath a cold callee. Callee without profile is just one example of cold callee.

hoy added inline comments.Mar 25 2021, 10:22 PM
llvm/tools/llvm-profgen/CSPreInliner.cpp
64

I see. Yes, callees without profile is a case of cold callees. They shouldn't be an issue without cgscc inlining sitting between the preliner and the targeted postlink FDO inliner. With the cgscc inlining, we might need sort of simulation for that, which might be quite different with the current top-down simulator.

wenlei added inline comments.Mar 25 2021, 10:36 PM
llvm/tools/llvm-profgen/CSPreInliner.cpp
64

Yeah, I hope that cgscc inline will mostly only deal with cold/small function inlining for csspgo, in which case profile adjustment is less important, hence preinline is less important (extreme case of cold inlining is cases where we don't have profile at all, then pre-inline doesn't matter). If we actually find cgscc inlining covering some hot inlining, I think it'd be worth looking to see why sample loader doesn't handle such cases. So I hope we don't have to do much for preinline estimation for cgscc inliner.

hoy accepted this revision.Mar 25 2021, 10:55 PM
hoy added inline comments.
llvm/tools/llvm-profgen/CSPreInliner.cpp
64

Sounds good. Current implementation looks good to me. Would like to see this moving forward and evolving close to the real inliner.

This revision is now accepted and ready to land.Mar 25 2021, 10:55 PM
wmi added inline comments.Mar 26 2021, 10:04 AM
llvm/tools/llvm-profgen/CMakeLists.txt
8

Besides the flags, is there anything else needed for the patch to add IPO as a dependency for llvm-profgen? IPO include many unrelated things.

wmi added inline comments.Mar 26 2021, 10:56 AM
llvm/tools/llvm-profgen/CSPreInliner.cpp
182

The definition of ShouldInline is only used in LLVM_DEBUG. It may trigger warning in release mode.

182–190

I didn't find if the candidate should not be inlined, where the context profile is merged into the base profile. Could you show me?

200

Included in LLVM_DEBUG.

222

printProfileNames is only used in LLVM_DEBUG. Need to include it in #ifndef NDEBUG.

wenlei added inline comments.Mar 26 2021, 10:57 AM
llvm/tools/llvm-profgen/CMakeLists.txt
8

That is for reusing context tracker and the profiled call graph added in IPO. These two are shared between sample loader and llvm-profgen, which hopefully helps to make preinliner close to compiler inline. I moved profiled call graph into IPO so compiler can use it too (we now use it in https://reviews.llvm.org/D99351).

wenlei added inline comments.Mar 26 2021, 11:01 AM
llvm/tools/llvm-profgen/CSPreInliner.cpp
182

Good point, let me check and adjust.

182–190

The merge is done within getBaseSamplesFor on-demand, same as how it's done in compiler.

200

This is all in LLVM_DEBUG on line 197. Do you mean we need a separate LLVM_DEBUG?

222

Good point, will do.

wmi added inline comments.Mar 26 2021, 12:28 PM
llvm/tools/llvm-profgen/CSPreInliner.cpp
182–190

Ah, I see. markContextSamplesInlined mark those contexts which will be inlined. For the rest, they are non-inlined and will be merged when getBaseSamplesFor is called.

200

Ah, I missed that. Never mind.

wenlei added inline comments.Mar 26 2021, 4:19 PM
llvm/tools/llvm-profgen/CSPreInliner.cpp
182

Seem the assignment is treated as a use, and we don't have warning in release. But we do have one for printProfileNames, now wrapped in macro.

wenlei updated this revision to Diff 333652.Mar 26 2021, 4:21 PM

Addressed Wei's comment, also moved profile call graph building into ProfiledCallGraph.

wmi accepted this revision.Mar 26 2021, 5:08 PM

LGTM. Thanks.

This revision was landed with ongoing or failed builds.Mar 29 2021, 9:54 AM
This revision was automatically updated to reflect the committed changes.

ThinLTO is known to have issues related to profile update (cross module), so we were thinking something similar in ThinLink phase.

This is the exact problem we are trying to mitigate. We also considered doing this in ThinLink but adjusting profiles for thin-backends and communicating inline decisions to thin-backends would add quite a bit of complexity, which could also slow down ThinLink. With CSSPGO, doing it in profile generation and use adjusted profile to convey inline estimation/suggestion is much simpler and cheaper.

One of the issues is that the pre-inlining needs to make similar decisions as the compiler. How well is the preinliner doing in this regard?

Yes, this is a challenge. We don't have data yet, but I hope with some tuning we can get them to be close. One problem with doing pre-inlining is we don't have a lot of information that compiler can see from IR, though if needed some of that can be embedded into binary (some metadata in probe descriptor, etc.) for preinliner. I hope a more accurate view on machine code byte size for inline cost can offset some of the disadvantages due to lack of IR.

We'll be working on tuning the preinliner to get it to be close to compiler inliner. This is similar to the effort of transferring more inlining from cgscc inliner to sample loader inliner in that we may not see immediate results, but over time, as the new component matures, we hope to reap benefits later.

@davidxl @wmi just to report back now that we've done more work along this path, and got some results. We observed global preinliner boosting performance for a large internal workload by ~1% (baseline is CSSPGO without global preinliner). Hopefully with some work we can rely more on preinliner for AutoFDO too.