This is an archive of the discontinued LLVM Phabricator instance.

[ProfSampleLoader] When disable-sample-loader-inlining is true, merge profiles of inlined instances to outlining versions.
ClosedPublic

Authored by mingmingl on Mar 16 2022, 2:50 PM.

Details

Summary

When --disable-sample-loader-inlining is true, skip inline transformation, but merge profiles of inlined instances to outlining versions.

Diff Detail

Event Timeline

mingmingl created this revision.Mar 16 2022, 2:50 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 16 2022, 2:50 PM
mingmingl published this revision for review.Mar 16 2022, 2:50 PM
mingmingl added reviewers: xur, kazu.
mingmingl added a subscriber: davidxl.
Herald added a project: Restricted Project. · View Herald TranscriptMar 16 2022, 2:51 PM
mingmingl updated this revision to Diff 415995.Mar 16 2022, 3:00 PM

fix typo "inlinee" in comment

mingmingl updated this revision to Diff 415999.Mar 16 2022, 3:15 PM

polish comment around option --sample-profile-merge-inlinee to minimize the potential ambiguity.

For inlineHotFunctionsWithPriority, do not add functions to LocalNotInlinedCallSites if ContextTracker is valid (not nullptr).

This keeps the original semantic that LocalNotInlinedCallSites is empty, when ContextTracker is not nullptr.

wenlei added a subscriber: hoy.Mar 17 2022, 9:15 AM

What's your actual use case that motivated this change?

When this was originally added, we used it to disable only pre-link sample loader inlining, in which case we don't want to merge the profiles even when inlining there is skipped. This is because during post-link, sample loader inlining will still happen, so merging profiles during pre-link as if relevant inlining will never happen may cause over-optimize (it was measurable in code size). +@hoy

What's your actual use case that motivated this change?

When this was originally added, we used it to disable only pre-link sample loader inlining, in which case we don't want to merge the profiles even when inlining there is skipped. This is because during post-link, sample loader inlining will still happen, so merging profiles during pre-link as if relevant inlining will never happen may cause over-optimize (it was measurable in code size). +@hoy

Good point about the post-link sample loader inlinling. The intention of this internal flag is actually to disable both sample loader inlining for performance experiment purpose (e.g exercising more cost/benefit analysis of regular inlining) -- related effort is kazu@ is doing analysis on context similarity analysis to try to tune down the sample loader inlining.

related effort is kazu@ is doing analysis on context similarity analysis to try to tune down the sample loader inlining.

Curious to hear more about this. Is tuning down sample loader inlining for performance or something else (e.g. compile time)? The problem of SCC inliner being bottom-up makes it less selective, which is a disadvantage comparing to early top-down inlining, so we were actually hoping to move more inlining from SCC to sample loader. Or are you shifting inlining to new ModuleInliner?

related effort is kazu@ is doing analysis on context similarity analysis to try to tune down the sample loader inlining.

Curious to hear more about this. Is tuning down sample loader inlining for performance or something else (e.g. compile time)? The problem of SCC inliner being bottom-up makes it less selective, which is a disadvantage comparing to early top-down inlining, so we were actually hoping to move more inlining from SCC to sample loader. Or are you shifting inlining to new ModuleInliner?

Re SCC inlining, yes -- kazu's prologue/epilogue based cost analysis won't work well it either -- so yes eventually more sophisticated analysis will rely on moduleinliner. Working set based analysis I mentioned a while back ago is another candidate.

hoy added a comment.Mar 17 2022, 10:12 AM

What's your actual use case that motivated this change?

When this was originally added, we used it to disable only pre-link sample loader inlining, in which case we don't want to merge the profiles even when inlining there is skipped. This is because during post-link, sample loader inlining will still happen, so merging profiles during pre-link as if relevant inlining will never happen may cause over-optimize (it was measurable in code size). +@hoy

Good point about the post-link sample loader inlinling. The intention of this internal flag is actually to disable both sample loader inlining for performance experiment purpose (e.g exercising more cost/benefit analysis of regular inlining) -- related effort is kazu@ is doing analysis on context similarity analysis to try to tune down the sample loader inlining.

Can you shed more light on the context similarity analysis? Thanks.

so yes eventually more sophisticated analysis will rely on moduleinliner.

Do you have plan to merge the moduleinliner and sample loader inliner eventually? How would them be positioned when sample profile is available?

What's your actual use case that motivated this change?

When this was originally added, we used it to disable only pre-link sample loader inlining, in which case we don't want to merge the profiles even when inlining there is skipped. This is because during post-link, sample loader inlining will still happen, so merging profiles during pre-link as if relevant inlining will never happen may cause over-optimize (it was measurable in code size). +@hoy

Good point about the post-link sample loader inlinling. The intention of this internal flag is actually to disable both sample loader inlining for performance experiment purpose (e.g exercising more cost/benefit analysis of regular inlining) -- related effort is kazu@ is doing analysis on context similarity analysis to try to tune down the sample loader inlining.

Can you shed more light on the context similarity analysis? Thanks.

@kazu can help elaborate on it.

so yes eventually more sophisticated analysis will rely on moduleinliner.

Do you have plan to merge the moduleinliner and sample loader inliner eventually? How would them be positioned when sample profile is available?

Sample loader inliner will still be needed but tuned down to where it is needed for profile quality. We hope to ship more decisions to later inliner with more analysis.

ormris removed a subscriber: ormris.Mar 17 2022, 10:56 AM
mingmingl updated this revision to Diff 416361.Mar 17 2022, 4:49 PM

Update comment about option --disable-sample-loader-inlining, to call out it doesn't skip profile merging (as the patch currently is).

I'll follow up the discussion (regarding whether this is the desired behavior) in the patch.

mingmingl added a comment.EditedMar 17 2022, 5:13 PM

What's your actual use case that motivated this change?

When this was originally added, we used it to disable only pre-link sample loader inlining, in which case we don't want to merge the profiles even when inlining there is skipped.

If the profiles of inlined instances are not merged, and the instances are not inlined, I'm wondering if the profiles will be inaccurate for other optimization passes (not only inliner itself).

I think the answer is yes, so decide to make this patch.

This is because during post-link, sample loader inlining will still happen, so merging profiles during pre-link as if relevant inlining will never happen may cause over-optimize (it was measurable in code size). +@hoy

Thanks for pointing this out!

For my understanding, how does over-optimization or measurable code size change happen when profiles are merged and pre-link inliner skips IR transformation? Does it happen for the following scenario?

  • "merging without inlining" cause some functions (on the edge) to become hot functions; these hot functions are imported in post-link sample loader, and cause code size increase

For my understanding, how does over-optimization or measurable code size change happen when profiles are merged and pre-link inliner skips IR transformation? Does it happen for the following scenario?

  • "merging without inlining" cause some functions (on the edge) to become hot functions; these hot functions are imported in post-link sample loader, and cause code size increase

If we skip inlining at a particular call site during pre-link, but still inline that site during post-link, merging causes pre-link passes to be unnecessarily aggressive (including pre-link SCC inliner for the callee) based on merged (larger) counts because the merged counts don't reflect post-link inlining.

For my understanding, how does over-optimization or measurable code size change happen when profiles are merged and pre-link inliner skips IR transformation? Does it happen for the following scenario?

  • "merging without inlining" cause some functions (on the edge) to become hot functions; these hot functions are imported in post-link sample loader, and cause code size increase

If we skip inlining at a particular call site during pre-link, but still inline that site during post-link, merging causes pre-link passes to be unnecessarily aggressive (including pre-link SCC inliner for the callee) based on merged (larger) counts because the merged counts don't reflect post-link inlining.

Since --disable-sample-loader-inlining applies on both pre-link and post-link sample loader inlining in one clang invocation, I'm assuming the (undesired aggressive) inlining happens in feedback-driven inliner (i.e. InlinerPass::run pointed by [1])

Is my understanding correct that the current implementation [2] of --disable-sample-loader-inlining would (undesirably) inline a function in post-linking stage due to feedback-driven inliner, if this function satisfy following conditions

  1. The function is hot within the module,
    • So inline is desired in pre-linking stage.
  2. The function is cold across module
    • So inline is not desired in post-linking stage, but got inlined (and could be counter-productive) by feedback-driven inliner.

If the scenario above is where problem happens, I'm planning to

  1. Call out this undesired side-effect for the usage of --disable-sample-loader-inlining
  2. Call out the option is used to evaluate AutoFDO inliner, and recommend using it together with hacks [3] in feedback-driven inliner for evaluation purpose.
  3. Proceed with this change, as the current implementation could be confusing (in-accurate profiles that affect more than sample-loader inlining)

May I get some feedback on the plans above? thanks!

[1] https://github.com/llvm/llvm-project/blob/f4b794427e8037a4e952cacdfe7201e961f31a6f/llvm/lib/Transforms/IPO/Inliner.cpp#L745
[2] the current impl is

  1. skipping sample-loader inlining transformations artificially
  2. merging profiles of inline instance to its outlining version

[3] The hack is to still use static information in calculating "threshold" but ignores profiles (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Analysis/InlineCost.cpp#L1869-L1902 is where profiles are used to calculate threshold)

mingmingl added a comment.EditedMar 21 2022, 4:27 PM
In D121862#3398140, @luna wrote:

For my understanding, how does over-optimization or measurable code size change happen when profiles are merged and pre-link inliner skips IR transformation? Does it happen for the following scenario?

  • "merging without inlining" cause some functions (on the edge) to become hot functions; these hot functions are imported in post-link sample loader, and cause code size increase

If we skip inlining at a particular call site during pre-link, but still inline that site during post-link, merging causes pre-link passes to be unnecessarily aggressive (including pre-link SCC inliner for the callee) based on merged (larger) counts because the merged counts don't reflect post-link inlining.

Since --disable-sample-loader-inlining applies on both pre-link and post-link sample loader inlining in one clang invocation, I'm assuming the (undesired aggressive) inlining happens in feedback-driven inliner (i.e. InlinerPass::run pointed by [1])

Is my understanding correct that the current implementation [2] of --disable-sample-loader-inlining would (undesirably) inline a function in post-linking stage due to feedback-driven inliner, if this function satisfy following conditions

  1. The function is hot within the module,
    • So inline is desired in pre-linking stage.
  2. The function is cold across module
    • So inline is not desired in post-linking stage, but got inlined (and could be counter-productive) by feedback-driven inliner.

If the scenario above is where problem happens, I'm planning to

  1. Call out this undesired side-effect for the usage of --disable-sample-loader-inlining
  2. Call out the option is used to evaluate AutoFDO inliner, and recommend using it together with hacks [3] in feedback-driven inliner for evaluation purpose.
  3. Proceed with this change, as the current implementation could be confusing (in-accurate profiles that affect more than sample-loader inlining)

May I get some feedback on the plans above? thanks!

[1] https://github.com/llvm/llvm-project/blob/f4b794427e8037a4e952cacdfe7201e961f31a6f/llvm/lib/Transforms/IPO/Inliner.cpp#L745
[2] the current impl is

  1. skipping sample-loader inlining transformations artificially
  2. merging profiles of inline instance to its outlining version

[3] The hack is to still use static information in calculating "threshold" but ignores profiles (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Analysis/InlineCost.cpp#L1869-L1902 is where profiles are used to calculate threshold)

(An addendum to this comment)

With or without profile merging, artificially skipping inline transformations cause changes in profile and could cause unexpected change (even just for evaluation purpose)

To elaborate

  1. With profile merging, functions (that are cold across modules and hot within a module) become hot across modules
  2. Without profile merging, the profiles are inaccurate (e.g., an outline function could have been (and should be) considered hot if profiles of inlined instance are accounted for).

When the option was introduced in https://reviews.llvm.org/D120344, I think the intention is to use it to disable both sampleloader inlining (pre/postlink), so perhaps we can just tighten the comment for the option and document that the profiles are merged back.

If there is a need for prelink disabling in the future, we can introduce a new option for it (which does not merge profile back).

mingmingl updated this revision to Diff 417133.Mar 21 2022, 4:55 PM

Update option description of --disable-sample-loader-inlining. Mention the side-effects of profile-merging when inline transformations are skipped, and point to D121862 (this patch) about more details.

When the option was introduced in https://reviews.llvm.org/D120344, I think the intention is to use it to disable both sampleloader inlining (pre/postlink), so perhaps we can just tighten the comment for the option and document that the profiles are merged back.

If there is a need for prelink disabling in the future, we can introduce a new option for it (which does not merge profile back).

Done.

I modified the option description. The description become (unavoidably) verbose now. But since option description is a centralized place to convey the caveats. decided to make it the current way, so it's not easily missed.

mingmingl updated this revision to Diff 417137.Mar 21 2022, 4:59 PM

Minor polish of option description. No code change.

When the option was introduced in https://reviews.llvm.org/D120344, I think the intention is to use it to disable both sampleloader inlining (pre/postlink), so perhaps we can just tighten the comment for the option and document that the profiles are merged back.

If there is a need for prelink disabling in the future, we can introduce a new option for it (which does not merge profile back).

Sounds good.

When this was originally added, we used it to disable only pre-link sample loader inlining

Just realized this was your change. My above comment was referring to an internal change (and I didn't realize it's internal).

llvm/lib/Transforms/IPO/SampleProfile.cpp
1202–1203

Move the check into tryPromoteAndInlineCandidate/tryInlineCandidate, so we don't need to spray it everywhere?

mingmingl updated this revision to Diff 417326.Mar 22 2022, 9:50 AM
  1. Move DisableSampleLoaderInlining into {tryPromoteAndInlineCandidate,tryInlineCandidate} as suggested.
  1. Update option description of --disable-sample-loader-inlining.
mingmingl updated this revision to Diff 417329.Mar 22 2022, 9:55 AM

Only option description is changed. No code change.

Auto-format gives short lines (by splitting one long line into two). Manually move the next line up for cosmetic reasons.

wenlei accepted this revision.Mar 22 2022, 4:29 PM

lgtm with some nits.

llvm/lib/Transforms/IPO/SampleProfile.cpp
183–184

This seems too verbose - it's the longest message among all flags here. But I don't have a strong opinion.

191

remove

1495–1496

Remove this comment as well.

1510

nit: remove this white space line and avoid unrelated changes.

This revision is now accepted and ready to land.Mar 22 2022, 4:29 PM
mingmingl updated this revision to Diff 417673.Mar 23 2022, 9:55 AM
mingmingl marked 4 inline comments as done.

Address comments (diffs are described in comment reply).

thanks for the reviews!

llvm/lib/Transforms/IPO/SampleProfile.cpp
183–184

I agree.

Took the liberty to move the side-effects part into a "//" comment.

--sample-profile-remapping-file (in the same cpp file) has a comment [1] besides option description, so hopefully this is fine for code style.

[1] https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/IPO/SampleProfile.cpp#L123-L125

191

thanks for the catch! Done.

mingmingl updated this revision to Diff 417674.Mar 23 2022, 9:56 AM

Remove quote in code comment.

wenlei added inline comments.Mar 23 2022, 10:00 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
1495

I'd avoid random white space changes in general, unless you're intentionally fixing some format issues.

mingmingl marked 2 inline comments as done.

remove blank lines around untouched lines as suggested.

thanks for the reviews!

llvm/lib/Transforms/IPO/SampleProfile.cpp
1495

Remove the added black lines here, and three other places.

unless you're intentionally fixing some format issues.

(This is opinion-based, so just to explain why I add so many blank lines)

Blank lines were added when I didn't see a curly brace around one-line if-else, even if it's the code style ( https://llvm.org/docs/CodingStandards.html#don-t-use-braces-on-simple-single-statement-bodies-of-if-else-loop-statements)

I'll keep avoid adding blank lines around untouched lines.

This revision was landed with ongoing or failed builds.Mar 23 2022, 1:07 PM
This revision was automatically updated to reflect the committed changes.
kazu added a comment.Jun 6 2022, 3:27 PM

@hoy, @wenlei Sorry for the extremely delayed reply to your questions. Here are some thoughts between @davidxl and myself:

Quick background

I enabled the cost-benefit analysis in early 2021 for instrumentation FDO. It gives us performance gains by inlining big but very hot call sites that would be rejected by the simple size-based threshold. At the same time, we keep the combined size of .text.hot and .text largely the same by rejecting small but marginally hot call sites. In the end, we reduce the call instruction frequency -- the number of call instructions exeucted per 1000 retired instructions.

Analyzing our large internal benchmark reveals several problems of AutoFDO relative to instrumentation FDO:

  • The AutoFDO binary performs worse.
  • The combined size of .text.hot and .text of the AutoFDO executable is 8 times as big as that of the FDO executable (even with the machine function splitting turned off).
  • The AutoFDO binary is more frontend bound and causes more i-cache and iTLB misses on x86. That is, the backend is sitting idle, waiting for decoded instructions.
  • The AutoFDO binary invokes call instructions more often than the FDO binary (even with the cost-benefit analysis disabled).

So, all in all, it's clear that we are inlining a less-than-ideal set of call sites.

Context similarity analysis

I'm exploring the opposite of what you are exploring -- shifting some inlining from the sample loader inliner to the SCC inliner with the cost benefit analysis enabled, which is disabled for AutoFDO for now.

In my experiments, I don't get consistent performance wins from simply tuning down the sample loader inliner with increased thresholds on sample counts and enabling the cost-benefit analysis for the SCC inliner. So, I am wondering if we could intelligently tune down the sample loader inliner -- inlining context-sensitive callees in the sample loader inliner and leaving the rest to the SCC inliner. Note that the sample loader inliner is the only place where we can take advantage of context sensitivity. Once we flatten the profile of a given callee, we lose information on context sensitivity.

I did some analysis on context sensitivity with clang (as an application) and our large internal benchmark. It turns out that inlining a context-sensitive function to its immediate caller will allow us to take advantage of most of the context sensitivity. Specifically, in clang, only 33% of functions have a single behavior. Now, given A->B, if we hypothetically created a copy of B and named it B_called_by_A for every callee at the source code level, 91% of functions would have a single behavior. Our large internal benchmark is similar to clang in this aspect.

Module inliner

I am planning to explore the possibility of (largely) replacing the SCC inliner with the module inliner. Basically, we would inline call sites in the descending order of profitability -- most likely the ratio of cycle savings to size costs.

My plan is to try it out with instrumenation FDO first as we have very accurate (but context insensitive) profile information.

Prologue/epilogue analysis

@davidxl mentioned this, so I might as well expand it a little bit here. We do spend cycles on prologue and epilogue, but we do not take that into account in inlining. Specifically, given A->B->C, inlining C into B could make B's prologue/epilogue bigger because of increased register usage. If B doesn't call C often enough, then B's bigger prologue/epilogue could slow down things when A calls B.

There is a room for improvement in this area, but it's hard to capture that in the SCC inliner. If we avoid inlining C into B because of the prologue/epilogue size concerns, but B later gets inlined A, then we worry about the prologue/epilogue size in vain. We need an inliner where we don't simply discard call sites that do not look profitable currently during inlining.

I'm hoping that the module inliner fits the bill here. A call site that does not look profitable curretly simply stays in the priority queue. If B->C does not look profitable enough now, we might inline A->B first, which may make A's prologue/epilogue big enough. At that point, inlining C into A (with B inlined into it) may cause no harm to A's prologue/epilogue.

Sample loader inliner and module inliner

This combination is pretty far into the future in my current plan, but the core idea will probably stay the same. Let the sample loader inliner inline profitable context-sensitive call sites, leaving the rest to the module inliner (as opposed to the SCC inliner).

Thanks for sharing your insights @kazu.

We actually tried cost-benefit inlining for both IRPGO and Sample PGO. We didn't observe measureable gain with IRPGO on HHVM workload, but .text size was 5% smaller with same perf.

For sample PGO, we tried to annotate callsite weight first and then invoke call analyzer for cost-benefit analysis based inlining, all within sample loader, but also didn't get better results.

We still believe cost-benefit analysis is the way to go, but we just haven't look deep into it to understand why it didn't bring benefit for our workload.

For module inliner vs sample loader inliner, I understand that it'd be more accurate to estimate size and simplification effect of inlining when done later in the pipeline, but from the current analysis, it looks like we won't miss much if we do cost-benefit analysis in sample loader, have you considered that? This way you get to leverage context sensitivity and also benefit from cost-benefit analysis.

About context similarity, IIUC you experiments indicates majority of function profiles are actually not context sensitive? That is interesting. For CSSPGO on one internal workload, we see noticeable perf degradation if we try to cap context profile depth -- that indicates there's enough context sensitivity on critical path for it to matter. Maybe this is workload dependent.

The prolog/epilog analysis would need to be done a bit late. It'd be interesting to see to how accurately we can model them. Maybe you have seen pathological cases where it's important, but I wonder how much they help the general cases.

The combined size of .text.hot and .text of the AutoFDO executable is 8 times as big as that of the FDO executable (even with the machine function splitting turned off).

8x total .text difference is very surprising. For us, on HHVM workload, AutoFDO/CSSPGO total .text size is only a few % larger than IRPGO using the same training. For your case, did you use the same training for both, or was IRPGO using benchmark/canary while AutoFDO was on fleet-wide profile?

For sampling PGO in general, we found there's a correlation between perf improvement and size increase, even when we were tweaking selectiveness rather than aggressiveness. Similarly, we found that tuning inlining to be more aggressive than the default often leads to better perf. Wondering if you've observed something similar.

kazu added a comment.Jun 8 2022, 2:48 PM

@wenlei, thank *you* for sharing your insights.

For module inliner vs sample loader inliner, I understand that it'd be more accurate to estimate size and simplification effect of inlining when done later in the pipeline, but from the current analysis, it looks like we won't miss much if we do cost-benefit analysis in sample loader, have you considered that? This way you get to leverage context sensitivity and also benefit from cost-benefit analysis.

I haven't considered doing cost-benefit analysis in the sample loader inliner, but I have observed something consistent with your comment about accuracy. Specifically, that the cycle savings are dominated by the call site cost (that is, the cost of the call and return instructions) computed in InlineCostCallAnalyzer::costBenefitAnalysis in InlineCost.cpp:

// Compute the total savings for the call site.
auto *CallerBB = CandidateCall.getParent();
BlockFrequencyInfo *CallerBFI = &(GetBFI(*(CallerBB->getParent())));
CycleSavings += getCallsiteCost(this->CandidateCall, DL);
CycleSavings *= CallerBFI->getBlockProfileCount(CallerBB).getValue();

I do compute the savings from conditionals being folded and such (look for a comment Count a conditional branch as savings if it becomes unconditional.), but that's a relatively minor addition. For large enough applications, hot call sites are easily a factor of 100x or 1000x apart in terms of their sample counts. Even if our size estimate is off by a factor of, say 2, because we forget to take folded branches into account, the slightly imprecise cost-benefit ratio just affects those call sites around the cost-benefit ratio threshold without significantly impacting the set of important call sites to be inlined.

My observation above is also consistent with the following. Small changes to the cost-benefit threshold just affects, again, those call sites around the cost-benefit ratio threshold without significantly impacting the set of important call sites to be inlined. Requiring more benefits would produce a smaller executable with comparable performance. In concrete terms, you might set InlineSavingsMultiplier to 4 to 6 for instrumentation FDO applications.

About context similarity, IIUC you experiments indicates majority of function profiles are actually not context sensitive? That is interesting. For CSSPGO on one internal workload, we see noticeable perf degradation if we try to cap context profile depth -- that indicates there's enough context sensitivity on critical path for it to matter. Maybe this is workload dependent.

No, the majority of functions (about 67% for clang) are context sensitive. I did say that the hypothetical renaming trick (that is, one level of inlining) would take advantage of most of the context sensitivity, but that's for the entire AutoFDO application. If we focus on the top 90% of samples or something, the picture may look quite different. It's totally possible that our applications also require certain depths of inlining to reveal the context sensitivity.

The prolog/epilog analysis would need to be done a bit late. It'd be interesting to see to how accurately we can model them.

IIRC, if I do the traditional iterative live variable analysis on the LLVM IR, not the machine IR, and take into account the x86-specific calling conventions (but treat all variables as integer/pointer variables for simplificy), I can accurately estimate the number of PUSH instructions 67% of the time and no more than off by one 93% of the time. Of course, if the inline "root" keeps changing during inlining, the fairly accurate estimate is of no use. I'll revisit this project when I start playing with the module inliner, where the inline "root" will hopefully tend to stick around, and less important call sites can wait in the priority queue instead being discarded.

8x total .text difference is very surprising. For us, on HHVM workload, AutoFDO/CSSPGO total .text size is only a few % larger than IRPGO using the same training. For your case, did you use the same training for both, or was IRPGO using benchmark/canary while AutoFDO was on fleet-wide profile?

The latter. The instrumentation FDO gets trained with representative workload, while the AutoFDO executable is built with the profile collected from the instrumentation FDO executable processing the real workload. I don't know how good an idea it is to build an AutoFDO executable this way. At least, it's not a typical way to build AutoFDO executables.

For sampling PGO in general, we found there's a correlation between perf improvement and size increase, even when we were tweaking selectiveness rather than aggressiveness. Similarly, we found that tuning inlining to be more aggressive than the default often leads to better perf. Wondering if you've observed something similar.

No, I haven't played with the AutoFDO parameters in isolation. That said, I suspect (but haven't verified) that out AutoFDO exeuctable is well past your correlation between perf improvement and size increase. Knowing that the instrumentation FDO executable performs a few percent better than the AutoFDO counterpart along with other goodies (smaller .text/.text.hot footprint, reduced i-cache/iTLB misses, and reduced call instruction frequency, etc), I'm hoping to put the AutoFDO executable on another slope to a higher maximum.

hoy added a comment.Jun 8 2022, 3:28 PM

Thanks for sharing more about the context and the background, @kazu

I enabled the cost-benefit analysis in early 2021 for instrumentation FDO. It gives us performance gains by inlining big but very hot call sites that would be rejected by the simple size-based threshold.

How about the build time? We've encountered dramatic build time increase due to giant functions as a result of aggressive AutoFDO inlining.

So, I am wondering if we could intelligently tune down the sample loader inliner -- inlining context-sensitive callees in the sample loader inliner and leaving the rest to the SCC inliner.

I'm trying to find the theory behind this. The two inliner behaves differently due to their style, i.e, top-down vs bottom-up. Regardless of context-sensitivity, shifting inlining from one to the other may create different inlining patterns, and I didn't find an easy way to tell which is better. Please let me know your insight on this. We are trying to make the whole inlining process perform uniformly (for now trying to shift as much inlining towards the top-down sample loader inliner) to focus on tuning an inlining model towards one inliner. We found it more complicated in LTO mode where prelink CGSCC inlining could affect post-link sample loader inlining in an unexpected way, i.e, increasing candidate size and breaking the pre-existing calling context in the profile. This makes it harder to tune the heuristics for the sample loader. We tried to turn off the prelink cgscc inliner and saw some size wins but neutral perf. We will likely do more investigation in this direction.