This is an archive of the discontinued LLVM Phabricator instance.

[CSSPGO][llvm-profgen] Missing frame inference.
ClosedPublic

Authored by hoy on Dec 5 2022, 1:56 PM.

Details

Summary

This change introduces a missing frame inferrer aiming at fixing missing frames. It current only handles missing frames due to the compiler tail call elimination (TCE) but could also be extended to supporting other scenarios like frame pointer omission. When a tail called function is sampled, the caller frame will be missing from the call chain because the caller frame is reused for the callee frame. While TCE is beneficial to both perf and reducing stack overflow, a workaround being made in this change aims to find back the missing frames as much as possible.

The idea behind this work is to build a dynamic call graph that consists of only tail call edges constructed from LBR samples and DFS-search for a unique path for a given source frame and target frame on the graph. The unique path will be used to fill in the missing frames between the source and target. Note that only a unique path counts. Multiple paths are treated unreachable since we don't want to overcount for any particular possible path.

A switch --infer-missing-frame is introduced and defaults to be on.

Some testing results:

  • 0.4% perf win according to three internal benchmarks.
  • About 2/3 of the missing tail call frames can be recovered, according to an internal benchmark.
  • 10% more profile generation time.

Diff Detail

Event Timeline

hoy created this revision.Dec 5 2022, 1:56 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 5 2022, 1:56 PM
Herald added subscribers: modimo, wenlei. · View Herald Transcript
hoy requested review of this revision.Dec 5 2022, 1:56 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 5 2022, 1:56 PM
hoy edited the summary of this revision. (Show Details)Dec 5 2022, 1:57 PM
hoy added reviewers: wenlei, wlei.
wenlei added inline comments.Dec 7 2022, 12:56 AM
llvm/tools/llvm-profgen/ProfileGenerator.cpp
17–18

I didn't see new instance of these containers. Also didn't see these includes being removed from other headers.

781

why is this only done for probe case?

819

The way this is done not only turn unknown indirect call targets into known, but also removes/filters some unsampled indirect calls. I'm wondering whether doing such filtering on direct call would also help narrowing down the search space?

828

either a known a zero

typo? I also don't understand this message.

848

Maybe just my pet peeve, but I always feel plain, simple description is much less confusing than those ambiguous terms. Through static analysis, indirect call targets are known, but with samples, these targets become known. Can we just simply say known, unknown, and avoid materialized, unmaterialized?

871

NDEBUG

llvm/tools/llvm-profgen/ProfiledBinary.cpp
552

Can we assert that Target is always available from evaluateBranch for direct branch?

565

Is MCDesc.isBarrier() for filtering *unconditional* branch? Make it explicit in the comments below.

llvm/tools/llvm-profgen/ProfiledBinary.h
251–255
  1. Did you choose unordered_multimap<uint64_t, uint64_t> over unordered_map<uint64_t, unordered_set<uint64_t>> to optimize for mostly 1 target per call? multimap has multiple copies of the keys too, but map doesn't. The later feels a bit cleaner..
  1. The name is bit confusing because this contains both call source and target.
  1. I haven't think through this carefully, but these data structures seems quite ad-hoc in that it somewhat overlaps with both CallAddressSet above and the data in the new tracker. I know that each serves slightly different purpose, but it just feels a bit unorganized, and bolted on the way it is now.
480

nit: type& get.. -> type &get..

llvm/tools/llvm-profgen/TailCallTracker.cpp
42–43 ↗(On Diff #480232)

I found the names a bit confusing, how about making these tweaks:
TailCallTargets -> TailCallToTargetMap
TailCallReachableFuncs -> TailCallTargetFuncs
FuncTailCalls -> FuncToTailCallMap

50 ↗(On Diff #480232)

Why use unordered_multimap to allow duplication in the first place? With unordered_map<key_t, unordered_set<value_t>> for both TailCallTargets and FuncTailCalls, there won't be duplicates.

61–63 ↗(On Diff #480232)

FuncTailCalls contains all functions and their associated tail calls, and it's not a per-function map. Why size 1 means a single tail call function?

86 ↗(On Diff #480232)

The no debug macro used should be NDEBUG

115 ↗(On Diff #480232)

The return value can be simplified to use boolean instead of int, because we only care whether there's unique path?

We may need to check for cycle at caller side, so we avoid the need to return 0.

115 ↗(On Diff #480232)

Do we want to limit the level of recursive search? Practically, it's probably very rare to have 3+ back to back tail calls, and three consecutive missing frames due to that. I'm not sure how much this can help though.

If we open this up to general missing frame inference beyond tail call, such pruning might be useful.

154 ↗(On Diff #480232)

Reachable via multiple paths is considered unreachable

This is just non-unique path situation. Why do we need to introduce a notion of "unreachable" while it is actually reachable? This is probably pure terminology issue, but I don't think I understand the rational behind "unreachable".

186 ↗(On Diff #480232)

It doesn't seem like we need to return num of path here. We bail early when NumPaths > 1, so the number isn't accurate anyways. At most, a boolean indicated whether we found a unique path should be enough.

229 ↗(On Diff #480232)

typo: broken

232 ↗(On Diff #480232)

NDEBUG

llvm/tools/llvm-profgen/TailCallTracker.h
23 ↗(On Diff #480232)

The core improvement is to find unique path between a caller frame and callee frame, so we can infer missing frames. You identified cases of missing frames caused by tail call, so you centered the implementation around tail call. But actually recovering missing frame for FPO is not that different either.

So what I'm thinking about is, have a way to not limit this to just handling tail call, instead of TailCallContextTracker, make it a generic MissingFrameInferrer. The trade off is that the search is going to be more expensive and more likely to have multiple path. But implementation wise it should be fairly similar and could be turned on/off via switch.

We don't have to cover everything now, but I think MissingFrameInferrer could be a better name than TailCallContextTracker, even if we decided to limit to tail call for now. This really is an inferrer instead of a tracker. WDYT?

Correspondingly computeUniqueTailCallPath would be computeUniqueCallPath even if it only deals with tail call now.

83–85 ↗(On Diff #480232)
  1. Add comment for the two data structures - what is the vector and the uint64_t value.
  1. I'd avoid introducing terms when more intuitive and plain expression is available. How about just use CallerCalleePair instead of FrameTransition? This also goes better with PairHash
  1. For NonUniquePaths, do we really care how many paths are there? Would an unordered_set be good enough?
  1. Can we merge UniquePaths and NonUniquePaths? so we don't need two separate look up. we could use empty vector to represent non-unique path?
89 ↗(On Diff #480232)

NDEBUG

92 ↗(On Diff #480232)

Would be nice to unify terms, you used unique vs non-unique elsewhere.

hoy marked 5 inline comments as done.Dec 7 2022, 11:00 AM
hoy added inline comments.
llvm/tools/llvm-profgen/ProfileGenerator.cpp
17–18

Good catch, removed.

781

Because the current tracker implementation only applies to the pseudo probe case where AddrBasedCtxKey is used. The line number base uses StringBasedCtxKey. So there's no need to do refineTailCallTargets for the line number case for now.

819

Good point. I think it should be helpful. Will give it a try.

828

should be "a known or a zero"

848

Yeah, known/unknown sounds good.

871

fixed.

llvm/tools/llvm-profgen/ProfiledBinary.cpp
552

Done.

565

It's to filter out conditional branch. Tail call is an unconditional branch. Comment updated.

llvm/tools/llvm-profgen/ProfiledBinary.h
251–255

Did you choose unordered_multimap<uint64_t, uint64_t> over unordered_map<uint64_t, unordered_set<uint64_t>> to optimize for mostly 1 target per call? multimap has multiple copies of the keys too, but map doesn't. The later feels a bit cleaner..

Yeah, that's the point. Single targets take 95% and with multipmap it wouldn't bother construct a container for each of those targets. Duplications are handled in refineTailCallTargets .

The name is bit confusing because this contains both call source and target.

How about CallPairs or just CallEdges?

I haven't think through this carefully, but these data structures seems quite ad-hoc in that it somewhat overlaps with both CallAddressSet above and the data in the new tracker. I know that each serves slightly different purpose, but it just feels a bit unorganized, and bolted on the way it is now.

Yeah, they are separated mostly because of efficiency. A unordered_map<uint64_t, unordered_set<uint64_t>> would unify them but at the expense of efficiency.

llvm/tools/llvm-profgen/TailCallTracker.cpp
42–43 ↗(On Diff #480232)

Sounds good.

50 ↗(On Diff #480232)

It's about efficiency. 95% of tail calls have a single target, using unordered_set for them sounds like unnecessary. WDYT?

61–63 ↗(On Diff #480232)

Size 1 means there is only one tail call in the program. It in turn means there's only one function having single tail call.

115 ↗(On Diff #480232)

The return value can be simplified to use boolean instead of int, because we only care whether there's unique path?

Hmm, it currently returns three kind of values, 0, 1 and multiple. I'm not seeing how a boolean could make the logic simpler.

Do we want to limit the level of recursive search? Practically, it's probably very rare to have 3+ back to back tail calls, and three consecutive missing frames due to that. I'm not sure how much this can help though.

Having a limit of the search level makes sense, but I'm not sure what is a good value there. It' not uncommon to have a tail call chain with more than 3 frames. For example, a chain consists of one-liner getter helpers. I can get some numbers and see.

154 ↗(On Diff #480232)

The comment is confusing. unreachable has its literal meaning in the algorithm. That reachable via multiple paths has a different meaning. I just simplified the comment to "Stop analyzing the remaining if we are already seeing more than one reachable paths."

186 ↗(On Diff #480232)

Not sure a boolean is enough. E.g, it wouldn't differentiate the following two cases when inferring missing frames between A -> F

  • Case 1 with actually paths:
A -> B -> C -> F

A -> B -> D -> F

A -> E -> F

The integer return value can tell that we should stop analyzing the path A -> E -> F because there are already multiple available paths. Returning 0 or false would not tell it apart from the case below.

  • Case 2 with actually paths:
A -> B -> C 

A -> E -> F
llvm/tools/llvm-profgen/TailCallTracker.h
23 ↗(On Diff #480232)

Yeah, renaming TailCallContextTracker as MissingFrameInferrer sounds good.

Correspondingly computeUniqueTailCallPath would be computeUniqueCallPath even if it only deals with tail call now.

I still like computeUniqueTailCallPath since what it real does is to compute a unique tail call path. Moving forward we could extend inferMissingFrames to make an extra call to another say computeUniqueFPOPath or something?

83–85 ↗(On Diff #480232)

CallerCalleePair sounds good.

For NonUniquePaths, do we really care how many paths are there? Would an unordered_set be good enough?

We don't really care. unordered_set should be good enough.

Can we merge UniquePaths and NonUniquePaths? so we don't need two separate look up. we could use empty vector to represent non-unique path?

Constructing an empty vector also takes extra memory and time, but is probably better than an extra hash lookup?

hoy updated this revision to Diff 480985.Dec 7 2022, 11:00 AM
hoy edited the summary of this revision. (Show Details)

Addressing feedbacks.

Thanks for the changes and addressing the feedbacks. Can you also rename all the instance of tail call tracker accordingly, for change description as well as title? And to be consistent, use simple terms instead of "materialized" in change description too?

llvm/tools/llvm-profgen/ProfileGenerator.cpp
828

maybe "a known or a zero (unknown)"?

llvm/tools/llvm-profgen/ProfiledBinary.h
251–255

There's also duplication between ProfiledBinary members and MissingFrameInferrer members.

llvm/tools/llvm-profgen/TailCallTracker.cpp
50 ↗(On Diff #480232)

How much performance does it help? The dedup also has cost. I think the use multimap has readability cost in quite a few places (makes things more complicated). one example is the sliding window you used in order to count tail calls. If the perf cost isn't too much, I'd favor simplicity and readability.

60 ↗(On Diff #480232)

typo: mutiple -> multiple

61–63 ↗(On Diff #480232)

Ok, I think the confusion comes from multimap, where one element means not only one function, but also one tail call in the program. And this special case is needed for the sliding window below to work.

How much does it help for performance by using multimap? I think that readability will improve if you use a map of a set instead.

115 ↗(On Diff #480232)

Ok, I thought that returning 0 was only for cycles. But now I see that 0 is also possible for unreachable cases. Yeah, we need a tri-state here. Maybe use a proper tri-state return value? The thing is the returned value also isn't really NumPaths because of early bail on NumPaths > 1.

I'm fine with what it is now too.

154 ↗(On Diff #480232)

The updated comment looks good to me. And I think I understand what you're doing here, but I'm curious - what is the literal meaning of unreachable in the algorithm you mentioned? I understand that we want to stop processing upon multiple paths, but stop process is different from unreachable, even though unreachable will lead to top processing for sure.

186 ↗(On Diff #480232)

I see the need for tri-state, but this one is a bit different from the return value of computeUniqueTailCallPath. Here, what I really meant is that you didn't use the return value of inferMissingFrames anywhere, or did miss anything? I think that having a boolean to indicate whether we successfully inferred something could be useful even if we aren't using it now.

llvm/tools/llvm-profgen/TailCallTracker.h
23 ↗(On Diff #480232)

If we extend it in the future, I think it won't be a separate function. Because the tweak is going to be about what is being considered a viable edge -- currently it's tail call only, but later it can be more. I don't think we need a completely separate DFS to accommodate other edges in the future. That's why I think having one general function for DFS is reasonable. And the name can reflect that. But this isn't critical.

83–85 ↗(On Diff #480232)

Constructing an empty vector also takes extra memory and time, but is probably better than an extra hash lookup?

I'm not sure if this is perf critical. But I was thinking more about better structure, consolidation and readability.

hoy marked 2 inline comments as done.Dec 7 2022, 1:18 PM
hoy added inline comments.
llvm/tools/llvm-profgen/ProfiledBinary.h
251–255

The MissingFrameInferrer members are mostly BinaryFunction based while the ProfiledBinary members are address based. The separation is to reduce the disassembling time by avoiding the address to BinaryFunction lookup there.

llvm/tools/llvm-profgen/TailCallTracker.cpp
50 ↗(On Diff #480232)

It's hard to measure without implementing the non-multimap version. But I can think that besides the extra constructor call and the extra memory that the container takes, there's also an extra cost in accessing a single element in the container.

one example is the sliding window you used in order to count tail calls. If the perf cost isn't too much,

This is debug code. I can rewrite it by converting the multimap to non-multimap, if it helps readability.

115 ↗(On Diff #480232)

Having a limit of the search level makes sense, but I'm not sure what is a good value there. It' not uncommon to have a tail call chain with more than 3 frames. For example, a chain consists of one-liner getter helpers. I can get some numbers and see.

An update on this. The longest tail call path for a medium benchmark is 16. There are 5% paths whose length is larger than 3.

154 ↗(On Diff #480232)

By literal meaning of unreachable, I mean the there isn't a path available on the dynamic call graph, probably because the call edges are not LBR-sampled.

186 ↗(On Diff #480232)

Oh yeah, the return value of inferMissingFrames is not used anywhere. Changed it to boolean type.

hoy updated this revision to Diff 481035.Dec 7 2022, 1:18 PM

Addressing feedbacks.

wenlei added a comment.Dec 7 2022, 3:31 PM

Thanks for the changes and addressing the feedbacks. Can you also rename all the instance of tail call tracker accordingly, for change description as well as title? And to be consistent, use simple terms instead of "materialized" in change description too?

I will take another look later, but I think the title and description etc still needs update per comment above.

wlei added inline comments.Dec 7 2022, 3:37 PM
llvm/tools/llvm-profgen/ProfiledBinary.cpp
471–472

Nit: findFuncRange is also called inside setIsFuncEntry, perhaps we could hoist and reuse that, this should save some running time.

hoy retitled this revision from [CSSPGO][llvm-profgen] A tail call tracker to infer missing tail call frames. to [CSSPGO][llvm-profgen] Missing frame inference..Dec 7 2022, 4:22 PM
hoy edited the summary of this revision. (Show Details)
hoy added inline comments.
llvm/tools/llvm-profgen/ProfiledBinary.cpp
471–472

Good point, fixed.

hoy updated this revision to Diff 481103.Dec 7 2022, 4:24 PM

Bounding the DFS search with a speific size (default to INT32_MAX). Will run more experiments to settle down on a good limit.

hoy updated this revision to Diff 481107.Dec 7 2022, 4:36 PM

Updating D139367: [CSSPGO][llvm-profgen] Missing frame inference.

wenlei added inline comments.Dec 15 2022, 1:08 PM
llvm/tools/llvm-profgen/MissingFrameInferrer.cpp
78

as mentioned earlier, this can be very straightforward code if we're not using multimap.

also the dedup above won't be needed.

correct me if i'm wrong, but I doubt setup/initialize time is dominant comparing to actual query/infer time.

llvm/tools/llvm-profgen/MissingFrameInferrer.h
2–3

nit: fix the comment to be on single line.

97

nit: ReachableViaMultiPaths

llvm/tools/llvm-profgen/ProfileGenerator.cpp
1228

Update "tail call tracker"

llvm/tools/llvm-profgen/ProfiledBinary.h
251–255

Yeah, that's the point. Single targets take 95% and with multipmap it wouldn't bother construct a container for each of those targets. Duplications are handled in refineTailCallTargets .

You mentioned efficiency a few times here. I agree that unifying may have a (slight) cost. But I feel that the actual cost of inferring missing frame is from the "query" or "infer" time, not the "setup" time here.

These things are hard to estimate exactly without measuring, but I would guess that changing the data structure to favor readability, simplicity and structure would not impact performance visibly as long as it doesn't hurt query path. For that reason, I feel that the optimization here could be premature at the expense of readability, and I'm still not convinced it is needed.

How about CallPairs or just CallEdges?

CallEdges sounds good.

The MissingFrameInferrer members are mostly BinaryFunction based while the ProfiledBinary members are address based. The separation is to reduce the disassembling time by avoiding the address to BinaryFunction lookup there.

The look up should be quick, right? And shifting the look up cost earlier (as opposed to later in inferrer) doesn't mean we're doing more work?

One other thing to consider is to move data structure only needed by tail call into the inferrer, as these are technically owned by inferrer. Ideally, we want to reduce duplication, but if that's not possible, we can move the address based map into inferrer as well, and have it populated during disassembly time. This way these seemingly similar data is at least centralized together.

hoy updated this revision to Diff 483409.Dec 15 2022, 6:26 PM
hoy marked 3 inline comments as done.

Switching to using single maps.

llvm/tools/llvm-profgen/ProfiledBinary.h
251–255

OK, switched to using single maps, which appears to be 3% slower than using multimaps. This pushes out the existing 8% overhead to 11%. Do we think it's worth a trade-off for readability?

wenlei added inline comments.Dec 15 2022, 7:16 PM
llvm/tools/llvm-profgen/ProfiledBinary.h
251–255

Thanks for experimenting. 3% extra overhead is a bit surprising. Could you do one more thing to confirm where this is coming from -- whether it's initialization or query/infer time overhead?

I thought this change would mostly affect initialization path, which I don't expect to be dominant in cost. Is that not the case? what's the cost split between initialization and query/infer?

If there's indeed a solid, repeatable 3% perf hit, I don't think it's worth the overhead for readability.

hoy added inline comments.Dec 16 2022, 8:22 AM
llvm/tools/llvm-profgen/ProfiledBinary.h
251–255

Looks like the slowdown is mostly from initialization path where all functions (sampled or not) are disassembled.

Interestingly, I remeasured for three times and the average slowdown is like 1.3%. The actual wall time is like 28min50sec vs 29min12 sec. I guess it isn't a big deal? The single map implementation is indeed simpler.

wenlei accepted this revision.Dec 16 2022, 8:31 AM

Ltgm, thanks.

llvm/tools/llvm-profgen/ProfiledBinary.h
251–255

Sounds good. Thanks for working through this patiently!

This revision is now accepted and ready to land.Dec 16 2022, 8:31 AM
This revision was landed with ongoing or failed builds.Dec 16 2022, 8:45 AM
This revision was automatically updated to reflect the committed changes.

Fix ( 09e79659bf2aeb0a5bd8ad6a9a40734b42caaf8a ) for a minor build break this patch caused - stats should be gated by LLVM_ENABLE_STATS, not by NDEBUG