This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO] Infrastructure for context-sensitive Sample PGO and Inlining
ClosedPublic

Authored by wenlei on Oct 25 2020, 1:02 PM.

Details

Summary

This change adds the context-senstive sample PGO infracture described in CSSPGO RFC (https://groups.google.com/g/llvm-dev/c/1p1rdYbL93s). It introduced an abstraction between input profile and profile loader that queries input profile for functions. Specifically, there's now the notion of base profile and context profile, and they are managed by the new SampleContextTracker for adjusting and merging profiles based on inline decisions. It works with top-down profiled guided inliner in profile loader (https://reviews.llvm.org/D70655) for better inlining with specialization and better post-inline profile fidelity. In the future, we can also expose this infrastructure to CGSCC inliner in order for it to take advantage of context-sensitive profile. This change is the consumption part of context-sensitive profile (The generation part is in this stack: https://reviews.llvm.org/D89707). We've seen good results internally in conjunction with Pseudo-probe (https://reviews.llvm.org/D86193). Pacthes for integration with Pseudo-probe coming up soon.

Currently the new infrastructure kick in when input profile contains the new context-sensitive profile; otherwise it's no-op and does not affect existing AutoFDO.

Interface

There're two sets of interfaces for query and tracking respectively exposed from SampleContextTracker. For query, now instead of simply getting a profile from input for a function, we can explicitly query base profile or context profile for given call path of a function. For tracking, there're separate APIs for marking context profile as inlined, or promoting and merging not inlined context profile.

  • Query base profile (getBaseSamplesFor)

Base profile is the merged synthetic profile for function's CFG profile from any outstanding (not inlined) context. We can query base profile by function.

  • Query context profile (getContextSamplesFor)

Context profile is a function's CFG profile for a given calling context. We can query context profile by context string.

  • Track inlined context profile (markContextSamplesInlined)

When a function is inlined for given calling context, we need to mark the context profile for that context as inlined. This is to make sure we don't include inlined context profile when synthesizing base profile for that inlined function.

  • Track not-inlined context profile (promoteMergeContextSamplesTree)

When a function is not inlined for given calling context, we need to promote the context profile tree so the not inlined context becomes top-level context. This preserve the sub-context under that function so later inline decision for that not inlined function will still have context profile for its call tree. Note that profile will be merged if needed when promoting a context profile tree if any of the node already exists at its promoted destination.

Implementation

Implementation-wise, SampleContext is created as abstraction for context - currently it uses string of call path as internal representation. Each SampleContext also has a ContextState indicating whether it's raw context profile from input, whether it's inlined or merged, whether it's synthetic profile created by compiler. Each FunctionSamples now has a SampleContext that tells whether it's base profile or context profile, and for context profile what is the context and state.

On top of the above context representation, a custom trie tree is implemented to track and manager context profiles. Specifically, SampleContextTracker is implemented that encapsulates a trie tree with ContextTireNode as node. Each node of the trie tree represents a frame in calling context, thus the path from root to a node represents a valid calling context. We also track FunctionSamples for each node, so this trie tree can serve efficient query for context profile. Accordingly, context profile tree promotion now becomes moving a subtree to be under the root of entire tree, and merge nodes for subtree if this move encounters existing nodes.

Integration

SampleContextTracker is now also integrated with AutoFDO, SampleProfileReader and SampleProfileLoader. When we detected input profile contains context-sensitive profile, SampleContextTracker will be used to track profiles, and all profile query will go to SampleContextTracker instead of SampleProfileReader automatically. Tracking APIs are called automatically for each inline decision from SampleProfileLoader.

Diff Detail

Unit TestsFailed

Event Timeline

wenlei created this revision.Oct 25 2020, 1:02 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 25 2020, 1:02 PM
wenlei requested review of this revision.Oct 25 2020, 1:02 PM
wenlei edited the summary of this revision. (Show Details)Oct 25 2020, 1:07 PM
wenlei added reviewers: wmi, hoy, davidxl.
wenlei added a subscriber: wlei.
wenlei updated this revision to Diff 300554.Oct 25 2020, 1:15 PM

format, separate change in ControlHeightReduction, remove remaining internal markers.

davidxl added inline comments.Oct 29 2020, 2:33 PM
llvm/include/llvm/ProfileData/SampleProf.h
347

Since these states are not mutually exclusive, perhaps name it ContextStateMask?

364

document this method.

377

Document the context string format here.

409

what is the input format? document it here.

527

CSFDO ==> CSSPGO

llvm/lib/ProfileData/SampleProfReader.cpp
225–233

so FName is also the context String?

davidxl added inline comments.Oct 29 2020, 2:33 PM
llvm/include/llvm/Transforms/IPO/SampleContextTracker.h
95

Document the public APIs.

wmi added a comment.Oct 29 2020, 6:13 PM

Thanks for the very helpful description!

llvm/include/llvm/ProfileData/SampleProf.h
527–529

It means CSSPGO will treat all the new lines as cold, even if some of them may be inferred from other parts of the profile. How much extra size is needed if zero is emitted?

llvm/lib/Transforms/IPO/SampleContextTracker.cpp
247

I don't understand what the top level means here. Better document it.

Do we cache the base profile somewhere or we merge it everytime?

369

Do we need to call getCanonicalFnName here to make the name in inline stack canonical so we can match the name in inline stack with the name in context?

llvm/lib/Transforms/IPO/SampleProfile.cpp
1907–1911

Here it means no any profile loading or just no CS profile? ThinLTO thinlink phase needs to know which functions are hot and it can import them, so profile information is needed in ThinLTO prelink.

llvm/test/Transforms/SampleProfile/Inputs/profile-context-tracker.prof
28

Here "main" doesn't show up in the context. Is it a problem of unwinding or debug info?

llvm/test/Transforms/SampleProfile/profile-context-tracker-debug.ll
2–4

Is there anything which cannot be tested in profile-context-tracker.ll? The debug message is usually used as last resort if something cannot be fully tested by just checking IR.

wmi added a comment.Oct 30 2020, 4:22 PM

Another question. Have you ever evaluated the performance by comparing this patch "CSSPGO profile working with existing AFDO pipeline" with the default SPGO? I understand CSSPGO's benefit has not been used without the change of CGSCC inliner. The intention of the comparison is to understand how well existing AFDO pipeline work with CSSPGO profile. It may expose problem in existing SPGO profile or CSSPGO.

wenlei updated this revision to Diff 302294.Nov 2 2020, 8:38 AM

address feedback from Wei and David

wenlei marked 2 inline comments as done.Nov 2 2020, 10:02 AM

Thanks for quick review! I've update the patch, also see replies inline.

llvm/include/llvm/ProfileData/SampleProf.h
347

Renamed.

364

Done.

377

Added header comment to SampleContext.

409

Added comment.

527–529

Knowing that CS profile will be much bigger, we started with trimming zero counts trying to save size as much as we can. But I don't actually have the data at hand. Let me see if I can get some data on this.

New lines will be less of a problem for pseudo-probe if they don't change CFG.

llvm/include/llvm/Transforms/IPO/SampleContextTracker.h
95

Done.

llvm/lib/ProfileData/SampleProfReader.cpp
225–233

Yes, SampleContext can take both full context string (wrapped with []) as well as context-less function names, and it will set internal state accordingly. I've updated header comment for SampleContext with details.

// Example of full context string (note the wrapping `[]`):
//    `[main:3 @ _Z5funcAi:1 @ _Z8funcLeafi]`
// Example of context-less function name (same as AutoFDO):
//    `_Z8funcLeafi`
llvm/lib/Transforms/IPO/SampleContextTracker.cpp
247

Top-level means it's under root node directly - path from to node is empty, hence no context. I added comment here as well as in the header comment of SampleContextTracker.

If getBaseSamplesFor is called for the same function again, we'll retrieve the existing top-level node from last call (it must exist, with context profile all merged into it already), then iterating over FuncToCtxtProfileSet[Name] but won't do anything since all context profiles have been merged (check on line 265). So it's somewhat like caching. Currently getBaseSamplesFor is only called once for each function from sample profile loader.

369

Good catch, I think we need to canonicalize CalleeName which is the leaf. (The names of middle inline frames should be fine as they're from debug metadata which are not modified when suffixes are appended for symbol promotion, etc..)

I guess we need to add getCanonicalFnName for SampleProfileLoader::findCalleeFunctionSamples. IIUC, we need it there for today's FDO too?

llvm/lib/Transforms/IPO/SampleProfile.cpp
1907–1911

Oops, we don't have this change now, forgot to remove when upstreaming. And you're right, we need to load profile for thinlto so thinlink importing can be profile guided. Thanks for catch this.

llvm/test/Transforms/SampleProfile/Inputs/profile-context-tracker.prof
28

This is an artificial context to simulate the case where funcB is also called from external functions to current module (compile time profile loader's case), and we merge context involving external caller correctly. Real profile for this case doesn't have problem in capturing the correct context.

llvm/test/Transforms/SampleProfile/profile-context-tracker-debug.ll
2–4

I added this hoping to make it easier to reason about the internals operations/state of context tracker, and also capture any unintended subtle change in context tracking. But if we look at end result of IR, the non-debug test should be able to cover it as good. I can remove this one if you think that's better.

wenlei updated this revision to Diff 302327.Nov 2 2020, 10:04 AM
wenlei marked an inline comment as done.

Fix typo and linter, add getCanonicalFnName.

In D90125#2365948, @wmi wrote:

Another question. Have you ever evaluated the performance by comparing this patch "CSSPGO profile working with existing AFDO pipeline" with the default SPGO? I understand CSSPGO's benefit has not been used without the change of CGSCC inliner. The intention of the comparison is to understand how well existing AFDO pipeline work with CSSPGO profile. It may expose problem in existing SPGO profile or CSSPGO.

Yeah, I've tried that initially on SPEC. It showed some perf win, however the problem with inliner (actually the sample loader inliner, not the CGSCC inliner) is very visible on a few cases. It's mostly because today's sample loader inliner is a replay inliner, so it won't be more aggressive than CGSCC inline from previous build, hence simple hotness heuristic works. But with CSSPGO profile, it's unbounded on hot path (as long as it's hot, there's no check on inlinee's size or inline cost) and can lead to size bloat and perf regression in some cases.

I have an upcoming change to make sample loader inliner a priority based inliner with size and cost checks. And it should work for today's AFDO as well - if we treat all call sites to inline with equal priority, and set inline limit to infinite, it should be a no-op change for AFDO, but perhaps can be tuned it to benefit AFDO later too.

wmi added a comment.Nov 3 2020, 9:50 AM
In D90125#2365948, @wmi wrote:

Another question. Have you ever evaluated the performance by comparing this patch "CSSPGO profile working with existing AFDO pipeline" with the default SPGO? I understand CSSPGO's benefit has not been used without the change of CGSCC inliner. The intention of the comparison is to understand how well existing AFDO pipeline work with CSSPGO profile. It may expose problem in existing SPGO profile or CSSPGO.

Yeah, I've tried that initially on SPEC. It showed some perf win, however the problem with inliner (actually the sample loader inliner, not the CGSCC inliner) is very visible on a few cases. It's mostly because today's sample loader inliner is a replay inliner, so it won't be more aggressive than CGSCC inline from previous build, hence simple hotness heuristic works. But with CSSPGO profile, it's unbounded on hot path (as long as it's hot, there's no check on inlinee's size or inline cost) and can lead to size bloat and perf regression in some cases.

I see. Thanks. CSSPGO profile is currently oblivious to inline/outline, unlike current SPGO profile which has the concept of inline/outline instance. So CSSPGO profile cannot replace SPGO to drive the current early inliner (oblivious to inline size, used mainly for maximize profile matching).

I have an upcoming change to make sample loader inliner a priority based inliner with size and cost checks. And it should work for today's AFDO as well - if we treat all call sites to inline with equal priority, and set inline limit to infinite, it should be a no-op change for AFDO, but perhaps can be tuned it to benefit AFDO later too.

If CSSPGO has the priority based inliner, does it mean it will do most inliner work and CGSCC inliner will mostly be used as an iterative clean up pass?

If CSSPGO has the priority based inliner, does it mean it will do most inliner work and CGSCC inliner will mostly be used as an iterative clean up pass?

Yeah, we want early inliner to take over more inlining, for the full benefit of top-down specialization as well as accurate post-inline profile. CSSPGO+priority-based TD inline now lead to more inlining shifted from CGSCC inline to early inline, but currently most of the inlining is still done by CGSCC inline (by number of inline sites). This is subject to tuning and something we're looking into. We want to make sure at least all hot inlining are covered early TD inliner.

wmi added inline comments.Nov 3 2020, 10:13 AM
llvm/lib/Transforms/IPO/SampleContextTracker.cpp
369

In the context, there are also levels contributed by stack unwinding. Those frames should have the same names as elf symbols. To be consistent, do we want to apply getCanonicalFnName for all the context levels?

I guess we need to add getCanonicalFnName for SampleProfileLoader::findCalleeFunctionSamples. IIUC, we need it there for today's FDO too?

Agree. Today, it may not need it because most suffixes are appended after inline so like you said the names of the inline frames from debug metadata don't contain the suffixes. But there are now suffixes being added before inline (https://reviews.llvm.org/D89617) and there may be others in the future. It is good to always apply the function.

llvm/test/Transforms/SampleProfile/profile-context-tracker-debug.ll
2–4

If there could be unintended subtle change which cannot be caught by the non-debug test, we can keep it. Just make sure the debug messages used in CHECK are all necessary in terms of ensuring the result we are expecting to see.

wenlei added inline comments.Nov 4 2020, 11:47 PM
llvm/lib/Transforms/IPO/SampleContextTracker.cpp
369

In the context, there are also levels contributed by stack unwinding. Those frames should have the same names as elf symbols. To be consistent, do we want to apply getCanonicalFnName for all the context levels?

Good point. In this case, I think it's better canonicalize all names during profile generation though. IIRC AutoFDO get names from dwarf hence it does not have the suffixes (as if it's canonicalized). So doing canonicalization during profile generation would make it consistent with AutoFDO.

davidxl added inline comments.Nov 10 2020, 12:56 PM
llvm/include/llvm/ProfileData/SampleProf.h
361

The syntax of the context string can probably be made more consistent like:

SampleContext : LeafContext

: ParentContext   LeafContext

LeafContext: function_name
ParentContext: [ParentFrames]
ParentFrames: OneParentFrame

: ParentFrames   OneParentFrame

OneParentFrame: function_name:line @

So in your example, the full context string should look like:

[main:3 @_Z5funcAi:1 @] _Z8funcLeafi

wenlei added inline comments.Nov 10 2020, 1:10 PM
llvm/include/llvm/ProfileData/SampleProf.h
361

Agreed, that's indeed more consistent and I like that better. Thanks for the suggestion. Will make the change (llvm-profgen change will follow too).

wenlei added inline comments.Nov 10 2020, 6:17 PM
llvm/include/llvm/ProfileData/SampleProf.h
361

Actually, there're other implication if we were to change context string to be [context] leaf. With the currently syntax, when we promote context, new context string is a substring of the original one. So we just create StringRef wrapper for context promotion without creating new strings.

E.g. when main:3 @ _Z5funcAi:1 @ _Z8funcLeafi is promoted, it becomes _Z5funcAi:1 @ _Z8funcLeafi which is sub string of the original one (StringRef that reuses the underlying string).

If we use [main:3 @_Z5funcAi:1 @] _Z8funcLeafi, promotion will lead to something like [_Z5funcAi:1 @] _Z8funcLeafi, which is no longer a substring and a new string need to be created. We use two StringRef to represent context part and leaf part today, but we also need a consistent string representation for the full context getNameWithContext.

So practically, current syntax is more efficient for context promotion.

Additionally, with the proposed syntax, top level context would look like this [] main to differentiate from context-less header. In this case, [main] is probably better.

What do you think?

davidxl added inline comments.Nov 10 2020, 6:57 PM
llvm/include/llvm/ProfileData/SampleProf.h
361

Actually, there're other implication if we were to change context string to be [context] leaf. With the currently syntax, when we promote context, new context string is a substring of the original one. So we just create StringRef wrapper for context promotion without creating new strings.

E.g. when main:3 @ _Z5funcAi:1 @ _Z8funcLeafi is promoted, it becomes _Z5funcAi:1 @ _Z8funcLeafi which is sub string of the original one (StringRef that reuses the underlying string).

In this case, where does the bracket go?

If we use [main:3 @_Z5funcAi:1 @] _Z8funcLeafi, promotion will lead to something like [_Z5funcAi:1 @] _Z8funcLeafi, which is no longer a substring and a new string need to be created. We use two StringRef to represent context part and leaf part today, but we also need a consistent string representation for the full context getNameWithContext.

So practically, current syntax is more efficient for context promotion.

Additionally, with the proposed syntax, top level context would look like this [] main to differentiate from context-less header. In this case, [main] is probably better.

what is the difference between top level context vs context less header?

What do you think?

wenlei added inline comments.Nov 11 2020, 9:08 AM
llvm/include/llvm/ProfileData/SampleProf.h
361

In this case, where does the bracket go?

Internally, we don't use the bracket, and since they're the first and last character, we just create a StringRef with bracket removed for SampleContext, without creating new string.

So in profile file, we have [main:3 @ _Z5funcAi:1 @ _Z8funcLeafi], while in SampleContext used by LLVM and tools, we have main:3 @ _Z5funcAi:1 @ _Z8funcLeafi. The form used by SampleContext is easier for context promotion, and the conversion from [] form to SampleContext is just a StringRef wrapper.

what is the difference between top level context vs context less header?

They have different meanings, consider a dso, context-less profile foo means we just don't know the calling context, while context profile [foo] means this is called directly from external function.

In practice, we don't have context-less profile and context profile in a single profile file now, so it's also about consistency in context profile and enough differentiation between context profile and context-less profile (i.e reserve the form main *only* for context-less profile for today's AFDO).

davidxl added inline comments.Nov 11 2020, 12:17 PM
llvm/include/llvm/ProfileData/SampleProf.h
361

In this case, where does the bracket go?

Internally, we don't use the bracket, and since they're the first and last character, we just create a StringRef with bracket removed for SampleContext, without creating new string.

So in profile file, we have [main:3 @ _Z5funcAi:1 @ _Z8funcLeafi], while in SampleContext used by LLVM and tools, we have main:3 @ _Z5funcAi:1 @ _Z8funcLeafi. The form used by SampleContext is easier for context promotion, and the conversion from [] form to SampleContext is just a StringRef wrapper.

That is what I was thinking -- for external format, we need to make it well defined and as consistent as possible, while for internal representation, any format is fine. Here is the question:

  1. is there a need to share external and internal rep? Once the external strings are parsed they can be discarded
  2. for internal format, is there more compact form ? Is there a need to use string ?

what is the difference between top level context vs context less header?

They have different meanings, consider a dso, context-less profile foo means we just don't know the calling context, while context profile [foo] means this is called directly from external function.

In practice, we don't have context-less profile and context profile in a single profile file now, so it's also about consistency in context profile and enough differentiation between context profile and context-less profile (i.e reserve the form main *only* for context-less profile for today's AFDO).

wenlei added inline comments.Nov 11 2020, 12:58 PM
llvm/include/llvm/ProfileData/SampleProf.h
361

for internal format, is there more compact form ? Is there a need to use string ?

This is something we thought about too. We were thinking about something along the lines of a rolling hash (integer encoding that is friendly to context promotion operation) to eliminate StringRef. But actually with current implementation, it's quite compact already because it's always StringRef and we never need to create any new string (the order of context syntax, root to leaf, was also intentional to make promoted context substring of original context). So we're happy with it.

is there a need to share external and internal rep? Once the external strings are parsed they can be discarded

Sharing between internal and external format isn't must have, but I think it's nice to have if cost is minimal.

External strings can be discarded but that would require non-trivial framework changes that affects AFDO too. Currently, for AFDO (and CSSPGO) we keep profile file in a memory buffer, and all external strings are StringRef, which is used throughout SampleProfileReader and hence SampleProfileLoader, assuming the underlying strings are always available. E.g. FunctionSamples::Name is StringRef wrapped around external string from the memory buffer of input profile. CSSPGO uses the same framework, and all of that would need to be changed if we want to discard external strings (essentially freeing the memory buffer after loading profile).

This (freeing the memory buffer after loading profile for both AFDO and CSSPGO) feels like more of an optimization rather than part of the CSSPGO framework, and if we do that optimization later, we could change internal representation accordingly. What do you think?

davidxl added inline comments.Nov 11 2020, 2:15 PM
llvm/include/llvm/ProfileData/SampleProf.h
361

for internal format, is there more compact form ? Is there a need to use string ?

This is something we thought about too. We were thinking about something along the lines of a rolling hash (integer encoding that is friendly to context promotion operation) to eliminate StringRef. But actually with current implementation, it's quite compact already because it's always StringRef and we never need to create any new string (the order of context syntax, root to leaf, was also intentional to make promoted context substring of original context). So we're happy with it.

A related question is how large is memory consumption when the raw string is used throughout ? By changing the internal format, can it bring down memory usage further for a large profile?

is there a need to share external and internal rep? Once the external strings are parsed they can be discarded

Sharing between internal and external format isn't must have, but I think it's nice to have if cost is minimal.

External strings can be discarded but that would require non-trivial framework changes that affects AFDO too. Currently, for AFDO (and CSSPGO) we keep profile file in a memory buffer, and all external strings are StringRef, which is used throughout SampleProfileReader and hence SampleProfileLoader, assuming the underlying strings are always available. E.g. FunctionSamples::Name is StringRef wrapped around external string from the memory buffer of input profile. CSSPGO uses the same framework, and all of that would need to be changed if we want to discard external strings (essentially freeing the memory buffer after loading profile).

This (freeing the memory buffer after loading profile for both AFDO and CSSPGO) feels like more of an optimization rather than part of the CSSPGO framework, and if we do that optimization later, we could change internal representation accordingly. What do you think?

Wei can probably chime in -- he has plans to reduce cross binary FDO string table size.

If we tie our implementation too much on string sharing, it makes it less flexible to change in the future. If we can decouple external/internal representation, it allows us to get it right for the external format from the beginning.

wenlei added inline comments.Nov 11 2020, 3:30 PM
llvm/include/llvm/ProfileData/SampleProf.h
361

A related question is how large is memory consumption when the raw string is used throughout ? By changing the internal format, can it bring down memory usage further for a large profile?

The extra memory consumption now is just fixed 12 bytes (StringRef) for each context, not a string for each context. There's of course memory consumption for the memory buffer holding input profile (and that is the size of input profile).

If we tie our implementation too much on string sharing, it makes it less flexible to change in the future. If we can decouple external/internal representation, it allows us to get it right for the external format from the beginning.

I see your point. But I think since this is only about internal representation, we always have the flexibility in changing it in the future (no compatibility issue, the coupling isn't invasive neither is the change). With current framework, we thought string representation with StringRef fits nicely. If in the future, we want to free memory buffer after profile loading, I'd be happy to take care of this part.

What we're doing here is consistent with AFDO today which also shares strings from memory buffer for internal FunctionSamples. In that sense, I don't think this needs to be treated differently from AFDO. It's also not easy to decouple cleanly now with AFDO and common things like FunctionSample still using strings from memory buffer.

Given Wei's feedback, I am ok with the current design now, including the
external sample format.

wenlei added inline comments.Nov 19 2020, 2:59 PM
llvm/include/llvm/ProfileData/SampleProf.h
527–529

I took a look at current profile generation tool. It requires some extra work to fill in zeros for not sample lines for CSSPGO. I collected AutoFDO for mysql w/ and w/o zeros filled, here's the size difference. CSSPGO is likely to see similar or bigger relative size difference (profile for context can be more sparse).

w/ zero filled for unsampled lines: 3.9M
w/o zero filled for unsampled lines: 1.4M

wenlei marked 16 inline comments as done.Nov 20 2020, 9:53 AM

Thanks for reviewing and the feedbacks. I think I've addressed all of the current ones. Please take a look, thanks!

llvm/lib/Transforms/IPO/SampleContextTracker.cpp
369

Name canonicalization is now done in llvm-profgen (https://reviews.llvm.org/D89723).

llvm/test/Transforms/SampleProfile/profile-context-tracker-debug.ll
2–4

Ok, I'll keep it then. We want to make sure context tracker is doing exactly what it has to do (and checking on inlining alone may not be strong enough).

wmi added inline comments.Nov 30 2020, 11:25 PM
llvm/include/llvm/Transforms/IPO/SampleContextTracker.h
71

Rename it to ChildContexts or AllChildContext?

llvm/lib/Transforms/IPO/SampleContextTracker.cpp
62–66

Make param AllowCreate default to true so we don't need this wrapper?

72

Add an assert message.

270

Add an assertion message.

284

Add an assertion message.

358

Add an assertion message.

374

A question, the context in SampleContextTracker includes not only inline stack but also call stack. S vector below only contains the inline stack at the DIL location. How can it match with the full stack starting from RootContext?

438

Add an assertion message.

447

Add an assertion message.

454

Add assertion message.

463

Use OldCallSiteLoc instead?

474–491

It will be slightly easier to read if the block can be extracted to a function.

wenlei marked 11 inline comments as done.Dec 1 2020, 11:34 PM
wenlei added inline comments.
llvm/lib/Transforms/IPO/SampleContextTracker.cpp
62–66

Good catch, changed.

374

When we decided to not inline a call site, context profile will be promoted to root, so what remains in context tracker should reflect the accurate remainder context profile. E.g. if we start with A->(call) B->(inline) C in context tracker. At some point if we're looking at B->C from DIL, there're two scenarios:

  1. If A inlined B, in this case, we wouldn't be able to match B->C from DIL to anything in context tracker. But this is intentional and desired, because The remainder/base profile for B, or the context profile B->C shouldn't have anything if A->B inline happened.
  2. If A not inlined B, in this case, B->C should be moved/promoted from child of A to be under root. Then we would be able to match B->C from DIL to B->C (under root) in context tracker.
wenlei updated this revision to Diff 308881.Dec 1 2020, 11:40 PM
wenlei marked an inline comment as done.

Address Wei's feedback.

wmi added inline comments.Dec 2 2020, 12:34 PM
llvm/lib/Transforms/IPO/SampleContextTracker.cpp
374

I see, thanks. After compiler decides it won't inline at some callsite, the profile for the callsite will be promoted and some context information will be loss. This seems to assume the inlining happens in top-down order and happens only once. I remember the CSSPGO profile will be used to drive CGSCC Inliner in the future. CGSCC Inliner will need to do the inlining iteratively so how it supposes to work with profile promotion?

wenlei added inline comments.Dec 2 2020, 1:38 PM
llvm/lib/Transforms/IPO/SampleContextTracker.cpp
374

You're right that it currently assumes top-down order - that is the best way to leverage context sensitive profile. If we try to use CSSPGO profile to drive SCC inline, bottom-up order and iterative nature are two key differences.

The bottom-up inlining means we can't promote context profile by moving them to be under root, instead, we will need to copy (and merge) context profile into the base profile under root. For the same example A->B->C, with SCC inline, we could end up processing B first before A. When processing B, we promote the not inlined context profile of B to be under root (B->C), and merge them together into a base profile of B. However, we still need to keep the original context profile tree (A->B->C) so later when we processing A, we will still see the B and C under A.

Actually the promotion happens when we try to access a function's base profile (getBaseSamplesFor calls promoteMergeContextSamplesTree for each not inlined context profile), so the difference between top-down and bottom-up inline is more about accuracy - with bottom-up inline, when getting base profile for B, we'd assume none of B's call sites is inlined even if later A inlines B.

For iterative inlining, we can getBaseSamplesFor every time we process a function again to redo the promotion and merge based on the up-to-date inline decisions. E.g. if we process B then A (which inlines B), then B again, the 2nd time we process B, we would not merge the B under A into B's current base profile, which makes the profile more accurate than first pass over B. (But it's still not as good as top-down inline because even if we can unmerge context profile, we can't undo inlining).

wmi added inline comments.Dec 2 2020, 3:09 PM
llvm/lib/Transforms/IPO/SampleContextTracker.cpp
374

Thanks for the detailed explanation. That makes sense to me.

Talking about the profile permotion and merging, if function is still hot after the inlining of all its callsites has been decided, and if it still has different profiles under different contexts, it may be interesting to clone the function so we can still apply the context sensitive profile in group. It will be interesting to have some support to compare profiles under different contexts and split them into groups. I feel the full context sensitive profile opens up some new opportunity we can explore by maximizing its usage in the future.

wmi accepted this revision.Dec 2 2020, 3:09 PM

LGTM.

This revision is now accepted and ready to land.Dec 2 2020, 3:09 PM
wenlei added inline comments.Dec 2 2020, 6:53 PM
llvm/lib/Transforms/IPO/SampleContextTracker.cpp
374

Yeah, that's a good point. We're also thinking about cloning as it's something clang is still behind gcc.

I think it will take some time before we fully leverage the new opportunities. I will send another patch for priority based top-down inlining with CSSPGO, with that more inlining will be done during early top-down inline, but it will take more effort to rebalance inline between sample loader vs CGSCC.

This revision was landed with ongoing or failed builds.Dec 6 2020, 12:12 PM
This revision was automatically updated to reflect the committed changes.