This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO] Call site prioritized inlining for sample PGO
ClosedPublic

Authored by wenlei on Jan 3 2021, 4:45 PM.

Details

Summary

This change implemented call site prioritized BFS profile guided inlining for sample profile loader. The new inlining strategy maximize the benefit of context-sensitive profile as mentioned in the follow up discussion of CSSPGO RFC (https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s). The change will not affect today's AutoFDO as it's opt-in. CSSPGO now defaults to the new FDO inliner, but can fall back to today's replay inliner using a switch (-sample-profile-prioritized-inline=0).

Motivation

With baseline AutoFDO, the inliner in sample profile loader only replays previous inlining, and the use of profile is only for pruning previous inlining that turned out to be cold. Due to the nature of replay, the FDO inliner is simple with hotness being the only decision factor. It has the following limitations that we're improving now for CSSPGO.

  • It doesn't take inline candidate size (and saving) into account. Since it's doing replay, the size growth is bounded by previous CGSCC inlining. With context-sensitive profile, FDO inliner is no longer limited by previous inlining, so we need to take size into account to avoid significant size bloat.
  • The way it looks at hotness is not accurate. It uses total samples in an inlinee as proxy for hotness, while what really matters for an inline decision is the call site count. This is an unfortunate fall back because call site count and callee entry count are not reliable due to dwarf based correlation, especially for inlinees. Now paired with pseudo-probe, we have accurate call site count and callee's entry count, so we can use that to gauge hotness more accurately.
  • It treats all call sites from a block as hot as long as there's one call site considered hot. This is normally true, but since total samples is used as hotness proxy, this transitiveness within block magnifies the inaccurate hotness heuristic. With pseudo-probe and the change above, this is no longer an issue for CSSPGO.

New FDO Inliner

Putting all the requirement for CSSPGO together, we need a top-down call site prioritized BFS inliner. Here're reasons why each component is needed.

  • Top-down: We need a top-down inliner to better leverage context-sensitive profile, so inlining is driven by accurate context profile, and post-inline is also accurate. This is already implemented in https://reviews.llvm.org/D70655.
  • Size/growth Cap: For top-down inliner, taking function size into account for inline decision alone isn't sufficient to control size growth. We also need to explicitly cap size growth because with top-down inlining, we can grow inliner size significantly with large number of smaller inlinees even if each individually passes the cost/size check.
  • Prioritize call sites: With size cap, inlining order also becomes important, because if we stop inlining due to size budget limit, we'd want to use budget towards the most beneficial call sites.
  • BFS inline: Same as call site prioritization, if we stop inlining due to size budget limit, we want a balanced inline tree, rather than going deep on one call path.

Note that the new inliner also avoids repeatedly evaluating same set of call sites, so it should help with compile time too. For this reason, we could transition today's FDO inliner to use a queue with equal priority to avoid wasted reevaluation of same call site (TODO).

Speculative indirect call promotion and inlining is also supported now with CSSPGO just like baseline AutoFDO.

Tunings and knobs

I created tuning knobs for size growth/cap control, and for hot threshold separate from CGSCC inliner. The default values are selected based on initial tuning with CSSPGO.

  • sample-profile-inline-growth-limit: Limit the size growth ratio, default to 12 - inliner can't grow size to 12x of pre-inline size.
  • sample-profile-inline-limit-min: The lower bound of size growth limit, defaults to 100.
  • sample-profile-inline-limit-max: The upper bound of size growth limit, defaults to 10000.
  • sample-profile-hot-inline-threshold: Inline cost threshold for hot call sites, defaults to 3000, same as CGSCC inliner.

Results

Evaluated with an internal LLVM fork couple months ago, plus another change to adjust hot-threshold cutoff for context profile (will send up after this one), the new inliner show ~1% geomean perf win on spec2006 with CSSPGO, while reducing code size too. The measurement was done using train-train setup, MonoLTO w/ new pass manager and pseudo-probe. Note that this is just a starting point - we hope that the new inliner will open up more opportunity with CSSPGO, but it will certainly take more time and effort to make it fully calibrated and mature for bigger workloads (we're working on it).

Diff Detail

Event Timeline

wenlei created this revision.Jan 3 2021, 4:45 PM
wenlei requested review of this revision.Jan 3 2021, 4:45 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 3 2021, 4:45 PM
wenlei edited the summary of this revision. (Show Details)Jan 3 2021, 4:54 PM
wenlei updated this revision to Diff 314319.Jan 3 2021, 11:50 PM

Fix test and linter.

wenlei added a subscriber: wlei.

A few high level questions:

  1. Can the size bloat problem be handled in llvm-profgen time ? Basically using hotness information to prune/merge profiles properly?
  2. What is the intuition behind BFS order?
  3. How often does the size limit get triggered with the new inliner?
  4. what is the largest improvement in spec06? Any internal benchmark data?

A few high level questions:

  1. Can the size bloat problem be handled in llvm-profgen time ? Basically using hotness information to prune/merge profiles properly?

Yes to some degree. llvm-profgen can do two things: 1) prune and merge profile using hotness without knowledge of inlining. We are doing this already for baseline without this change, but not enough to limit inlining, since this has to be conservative as the pruning is not selective enough. 2) In best case scenario, if we can predict all inline decision, llvm-profgen can prepare (promote and merge) the profile in a way that only those context profile needed for inlining is kept, in which case the inlining will be bounded by the profile output from llvm-profgen. However, llvm-profgen don't have inline cost, so it's hard to prepare context profile perfectly. It's best to leave it up to the sample profile inliner to decide what context profile is useful, and what to be promoted and merged.

We do plan to implement some top down global inlining estimation and adjust profiles accordingly in llvm-profgen though, for ThinLTO. This is because LTO sample profile inliner is not global for ThinLTO, and letting llvm-profgen do some preparation would help with cross-module context profile adjustment. (We could also do something in ThinLink, but that adds more complexity and may also hurt compile time)

  1. What is the intuition behind BFS order?

This is to have a more balanced inlining. BFS with priority queue will always pick the most beneficial call site to inline within several levels of call graph from current function; while DFS may go deep on a particular call path without knowing whether we have more beneficial candidate on other paths.

  1. How often does the size limit get triggered with the new inliner?

Most functions in SPEC didn't hit the limit. There's one outlier, gobmk, without the cap it's disastrous for perf and code size - its call graph is very dense with small functions, and unbounded profile guided inliner would go wild. I can also double check on other workload.

  1. what is the largest improvement in spec06? Any internal benchmark data?

The largest improvements came from povray (11%), gobmk (10%), followed by a few others in low single digit.

When we made this change, we haven't tried internal workload, and later when we try internal workload, it's group with other changes together. So unfortunately, I don't have data for internal workload for this change alone. We can give it a try to on some internal benchmarks, though we will have to use internal fork to do that which has other stuff not yet upstreamed.

wmi added a comment.Jan 12 2021, 3:38 PM

Sorry for the delay in review. It makes a lot of sense to have a priority based inliner for CSSPGO since its profile annotation doesn't rely on replaying the inlining. But I don't understand why we rely on BFS/DFS strategy to expose the hottest callsite for priority based inliner. In my mind, CSSPGO can know the hotness of each callsite with full context information. Can we just sort all the callsites based on CSSPGO profile then try to inline the hottest callsite under a specific context first in a top down manner? We may actually need to inline some parent callsites before we can inline the hottest callsite, but it is all driven by the target to inline the hottest callsite first. If we worry some callsite is too deep and we need to inline through a deep path before we can inline the specific callsite, we may add the depth into priority computation. What do you think?

llvm/lib/Transforms/IPO/SampleProfile.cpp
432

Since there is another Struct named as InlineCandidate, can we rename it to doInline or executeInline?

In D94001#2494594, @wmi wrote:

Sorry for the delay in review. It makes a lot of sense to have a priority based inliner for CSSPGO since its profile annotation doesn't rely on replaying the inlining. But I don't understand why we rely on BFS/DFS strategy to expose the hottest callsite for priority based inliner. In my mind, CSSPGO can know the hotness of each callsite with full context information. Can we just sort all the callsites based on CSSPGO profile then try to inline the hottest callsite under a specific context first in a top down manner? We may actually need to inline some parent callsites before we can inline the hottest callsite, but it is all driven by the target to inline the hottest callsite first. If we worry some callsite is too deep and we need to inline through a deep path before we can inline the specific callsite, we may add the depth into priority computation. What do you think?

Yeah, that's one step further than what I have in this patch. The key here is the prioritized work list, and BFS is just a natural by product of using the priority queue. I called out BFS to emphasize the more balanced inlining, but it's not super accurate because it is only a true BFS when priorities are all identical. What you suggested is essentially increasing the scope of call sites to prioritize - currently it's the immediate callees. I agreed that increasing the scope to multiple levels or the entire sub call tree may be beneficial. But I don't think it conflicts the approach in the patch, instead it can be added leveraging this implementation.

With context profiles organized in a trie, it's relatively easy to get the hottest call sites in sub call tree (assuming no stale profile). I think we can assign total order/priority for these hot call sites as well as the call sites in the middle leading to those hot ones (call sites that lead to hottest call site need to have their priority bumped up), then the priority queue will lead us to these hottest call sites first even if they're several levels away.. I can give it a try to increase the scope of call site for prioritization as a follow up.

We need to be able to priority call site before its IR is available though. Originally I was thinking about also taking cost/size into account in priority in addition to call site hotness (don't have that implemented yet), to do that we need the actual call site IR in addition to profile. I was also hoping to eventually unify the early inliner for AFDO and CSSPGO. This implementation didn't diverge from AFDO too far, and we should be able to merge the two in the future with just different priority assignment (equal priority for AFDO). But I don't think these two are critical.

llvm/lib/Transforms/IPO/SampleProfile.cpp
432

Good point, renamed to tryInlineCandidate to be consistent with tryPromoteIndirectCall and shouldInlineCandidate. This function may decided not to inline.

wenlei updated this revision to Diff 316328.Jan 12 2021, 11:01 PM

retitle, rename inlineCandidate.

wenlei retitled this revision from [CSSPGO] Call site prioritized BFS inlining for sample PGO to [CSSPGO] Call site prioritized inlining for sample PGO.Jan 12 2021, 11:01 PM
wmi added a comment.Jan 13 2021, 9:57 PM
In D94001#2494594, @wmi wrote:

Sorry for the delay in review. It makes a lot of sense to have a priority based inliner for CSSPGO since its profile annotation doesn't rely on replaying the inlining. But I don't understand why we rely on BFS/DFS strategy to expose the hottest callsite for priority based inliner. In my mind, CSSPGO can know the hotness of each callsite with full context information. Can we just sort all the callsites based on CSSPGO profile then try to inline the hottest callsite under a specific context first in a top down manner? We may actually need to inline some parent callsites before we can inline the hottest callsite, but it is all driven by the target to inline the hottest callsite first. If we worry some callsite is too deep and we need to inline through a deep path before we can inline the specific callsite, we may add the depth into priority computation. What do you think?

Yeah, that's one step further than what I have in this patch. The key here is the prioritized work list, and BFS is just a natural by product of using the priority queue. I called out BFS to emphasize the more balanced inlining, but it's not super accurate because it is only a true BFS when priorities are all identical. What you suggested is essentially increasing the scope of call sites to prioritize - currently it's the immediate callees. I agreed that increasing the scope to multiple levels or the entire sub call tree may be beneficial. But I don't think it conflicts the approach in the patch, instead it can be added leveraging this implementation.

With context profiles organized in a trie, it's relatively easy to get the hottest call sites in sub call tree (assuming no stale profile). I think we can assign total order/priority for these hot call sites as well as the call sites in the middle leading to those hot ones (call sites that lead to hottest call site need to have their priority bumped up), then the priority queue will lead us to these hottest call sites first even if they're several levels away.. I can give it a try to increase the scope of call site for prioritization as a follow up.

Ok, thanks. About the prioritized work list, I was also expecting it will consider all the possible candidates in the module, not just in a Function, so it will be disruptive in terms of the order of profile annotation and inlining. I agree it can be a followup since it needs more experiments.

llvm/include/llvm/Transforms/IPO/SampleContextTracker.h
100

Do we expect DIL to be the debug location of the indirect call?

llvm/lib/Transforms/IPO/SampleContextTracker.cpp
43

How about getMaxCountChildContext?

223–225

Is it possible for getContextFor to return nullptr for DIL, and do you need to handle such case?

llvm/lib/Transforms/IPO/SampleProfile.cpp
186

It will only be applied to priority based sample profile loader inlining, right? Same for the description of sample-profile-inline-limit-min/sample-profile-inline-limit-max

368

Considering the case CallsiteCounts are equal, does every InlineCandidate have non-null CalleeSamples? If that is true, FunctionSamples::GUID can be used to make the comparison more stable.

1000

Add assertions to check L and R are not nullptr?

1396

CallsiteCount is 0 definitely before this line, so no need to use std::max.
CallsiteCount = CalleeSamples->getEntrySamples();

1494

Several parts in this loop looks quite similar as the counterparts in inlineHotFunctions. Is it possible to make them shared?

wenlei marked an inline comment as done.Jan 19 2021, 5:59 PM
wenlei added inline comments.
llvm/include/llvm/Transforms/IPO/SampleContextTracker.h
100

Right, discriminator for line-based profile and call site probe id for csspgo profile should differentiate call sites so DIL uniquely identifies a call site. Updated the comment to be explicit.

llvm/lib/Transforms/IPO/SampleContextTracker.cpp
43

Good point, changed to getHottestChildContext.

223–225

That shouldn't happen because everything we inlined during sample loader must have a context profile, hence getContextFor for any location must have a containing context profile.

llvm/lib/Transforms/IPO/SampleProfile.cpp
186

Yeah, updated the descriptions. thanks for pointing out.

368

Yeah, every candidate should have non-null CalleeSamples. Added tie breaker using GUID. Thanks for the suggestion.

1396

Good catch, this is a mistaken when taken out the changes for upstream. These are meant to be separate ifs. We found that taking the max works better. Fixed.

1494

Yeah, I thought about that too. I hoisted out part of ICP into tryPromoteIndirectCall to be shared. Looking more at this, I think if we let inlineHotFunctions uses InlineCandidate (with dummy hotness), we should be able to reuse tryInlineCandidate for inlineHotFunctions, and that may enable more shared code for the two loops. What about let me try this with an NFC patch on top of this one, so this patch doesn't change existing inlining code too much?

wenlei updated this revision to Diff 317737.Jan 19 2021, 6:01 PM

Address Wei's feedback. Clean up ICPCallee from InlineCandidate now that we don't check InlineCost before ICP.

wenlei updated this revision to Diff 317784.Jan 19 2021, 11:46 PM

default CallsitePrioritizedInline to true only for CSSPGO

wenlei added inline comments.Jan 20 2021, 12:03 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
1494

Sent D95024. The two loops are shorter now. inlineCallInstruction (AFDO) similar to tryInlineCandidate (CSSPGO) is now merged and removed. The ICP+Inline code all lifted into new common helper tryPromoteAndInlineCandidate.

wenlei updated this revision to Diff 320490.Feb 1 2021, 8:58 AM

rebase, update test case after merge.

wenlei added a comment.Feb 1 2021, 9:33 AM

Some more data points:

  • Inline size/growth limit hits:
    • Among spec2017, gcc has ~100 hits where inlining stopped due to limit. parest has ~40. Other benchmarks mostly has around 0-20 hits.
    • For server side Hermes (internal version of https://github.com/facebook/hermes), inliner hit limit for ~500 times.
    • For cpython (internal version of https://github.com/python/cpython), inliner hit limit for ~100 times.
  • Perf and code size:
    • I don't have numbers comparing just w/ and w/o the new inliner for bigger workloads. But here's what I got on spec2017 comparing -sample-profile-prioritized-inline=1 vs -sample-profile-prioritized-inline=1 - ~1.5% geomean perf boost, and it also controls the size growth (30% reduction) as intended. This is using pseudo-probe, and some small tweaks that we will upstream after this change are also included in the test run.

SPEC2017 run time .text size
508.namd_r -0.62% -10.73%
510.parest_r 0.23% -1.68%
511.povray_r -3.79% -50.93%
526.blender_r 1.06% -7.78%
600.perlbench_s -0.62% -41.87%
602.gcc_s -1.58% -60.28%
605.mcf_s -0.64% -39.95%
620.omnetpp_s 0.34% -4.08%
623.xalancbmk_s -5.24% -8.20%
625.x264_s -10.36% -75.58%
631.deepsjeng_s -1.87% -55.70%
638.imagick_s 0.14% 0.00%
641.leela_s -0.87% -13.85%
644.nab_s 0.79% -6.19%
657.xz_s 0.21% -12.84%
geomean -1.57% -31.16%

wenlei added a comment.Feb 1 2021, 9:38 AM

@wmi, @davidxl We collected some data hoping that gave some insight to the questions asked earlier. I think I've addressed all comments here (with some in D95024), let me know if I missed anything or if you have any other comments.

wmi added a comment.Feb 1 2021, 1:37 PM

Thanks for the data. It shows priority based inliner does significantly better than current early inliner in sample loader for CSSPGO both on performance and code size! It will interested to know how the importance to performance is distributed between priority based early inliner and CGSCC inliner now. I imagaine CGSCC inliner now plays a very minor role in CSSPGO now?

llvm/lib/Transforms/IPO/SampleProfile.cpp
1404

Add assert message. There are some other places missing the messages.

1527

According to https://bugs.llvm.org/show_bug.cgi?id=18962, PR18962 has been fixed in 2014. Is it the right bug to refer to?

1536–1538

Should the comment block be hoisted?

wenlei marked 3 inline comments as done.Feb 1 2021, 6:04 PM
wenlei added inline comments.
llvm/lib/Transforms/IPO/SampleProfile.cpp
1404

Message added for all instances.

1527

The bug was fixed, but we generate different types for the same definition for the case in that bug, and call analyzer can't handle that now. Updated the comment to include more details.

For code below, whether fn1 is present affects the type name of A. Then we could see two different types from the same definition, which makes CallAnalyzer choke as it's expecting matching parameter type on both caller and callee side.

class A {
  // append has to have the same prototype as fn1 to tickle the bug.
  void (*append)(A *);
};
void fn1(A *p1) {
}
1536–1538

Updated the 1st part of the comment, and removed the 2nd part which is out dated.

wenlei updated this revision to Diff 320655.Feb 1 2021, 6:06 PM
wenlei marked 3 inline comments as done.

Address Wei's comments.

wenlei added a comment.Feb 1 2021, 6:14 PM
In D94001#2534785, @wmi wrote:

Thanks for the data. It shows priority based inliner does significantly better than current early inliner in sample loader for CSSPGO both on performance and code size! It will interested to know how the importance to performance is distributed between priority based early inliner and CGSCC inliner now. I imagaine CGSCC inliner now plays a very minor role in CSSPGO now?

CGSCC inliner is still quite important for CSSPGO for now. We have tried to skip pre-LTO inlining, and that seems to be fine, however if we skip LTO time CGSCC inlining, there's noticeable regression. This is something we want to dig more, and gradually shift more inlining from CGSCC to sample loader's top down inlining.

wmi accepted this revision.Feb 1 2021, 10:37 PM

LGTM.

llvm/lib/Transforms/IPO/SampleProfile.cpp
1527

I see, thanks for clarifying it. It is better to add a TODO here.

This revision is now accepted and ready to land.Feb 1 2021, 10:37 PM
wenlei marked an inline comment as done.Feb 1 2021, 11:45 PM
wenlei added inline comments.
llvm/lib/Transforms/IPO/SampleProfile.cpp
1527

Updated comment with TODO.

wenlei updated this revision to Diff 320690.Feb 1 2021, 11:45 PM
wenlei marked an inline comment as done.

Update comment.

This revision was landed with ongoing or failed builds.Feb 1 2021, 11:47 PM
This revision was automatically updated to reflect the committed changes.