This is an archive of the discontinued LLVM Phabricator instance.

[SamplePGO] Stale profile matching(part 2)
ClosedPublic

Authored by wlei on Apr 4 2023, 9:57 AM.

Details

Summary

Part 2 of https://reviews.llvm.org/D147456

Call target name anchor based profile fuzzy matching

Use callee name on IR as an anchor to match the call target/inlinee name in the profile. The advantages of this in particular:

  • Different from the traditional way of encoding hash signatures to every block that would affect binary/profile size and build speed, it doesn't require any additional information for this, all the data is already in the IR and profiles.
  • Effective for current nested profile layout in which once a callsite is mismatched all the inlinee's profiles are dropped.

The input of the algorithm:

  • IR locations: the anchor is the callee name of direct callsite.
  • Profile locations: the anchor is the call target name for BodySamples or inlinee's profile name for CallsiteSamples.

The two lists are populated by parsing the IR and profile and both can be generalized as a sequence of locations with an optional anchor.
For example: say location 1.2(foo) refers to a callsite at 1.2 with callee name foo and 1.3 refers to a non-directcall location 1.3.

// The current build source code:
   int main() {
1.     ...
2.     foo();
3.     ...
4      ...
5.     ...
6.     bar();
7.     ...
   }

IR locations are populated and simplified as: [1, 2(foo), 3, 5, 6(bar), 7].

; The "stale" profile:
main:350:1
 1: 1
 2: 3
 3: 100 foo:100
 4: 2
 7: 2
 8: 200 bar:200
 9: 30

Profile locations are populated and simplified as [1, 2, 3(foo), 4, 7, 8(bar), 9]
Matching heuristic:

  • Match all the anchors in lexical order first.
  • Match non-anchors evenly between two anchors: Split the non-anchor range, the first half is matched based on the start anchor, the second half is matched based on the end anchor.

So the example above is matched like:

[1,    2(foo), 3,  5,  6(bar), 7]
 |     |       |   |     |     |
[1, 2, 3(foo), 4,  7,  8(bar), 9]

3 -> 4 matching is based on anchor foo, 5 -> 7 matching is based on anchor bar.
The output mapping of matching is [2->3, 3->4, 5->7, 6->8, 7->9].

For the implementation, the anchors are saved in a map for fast look-up. The result mapping is saved into IRToProfileLocationMap(see https://reviews.llvm.org/D147456) and distributed to all FunctionSamples(distributeIRToProfileLocationMap)

Performance evaluation:

Clang-self build benchmark:
Current build version: clang-10
The profiled version: clang-9
Results compared to a refresh profile(collected profile on clang-10) and to be fair, we invalidated new functions' profiles(both refresh and stale profile use the same profile list).

  1. Regression to using refresh profile with this off : -3.93%
  2. Regression to using refresh profile with this on : -1.1%

So this algorithm can recover ~72% of the regression.
Internal(Meta) large-scale services.
we saw one real instance of a 3 week stale profile., it delivered a ~1.8% win.

Notes or future work:

  • Classic AutoFDO support: the current version only supports pseudo-probe, but I believe it's not hard to extend to classic line-number based AutoFDO since pseudo-probe and line-number are shared the LineLocation structure.
  • The fuzzy matching is an open-ended area and there could be more heuristics to try out, but since the current version already recovers a reasonable percentage of regression(with some pseudo probe order change, it can recover close to 90%), I'm submitting the patch for review and we will try more heuristics in future.
  • Profile call target name are only available when the call is hit by samples, the missing anchor might mislead the matching, this can be mitigated in llvm-profgen to generate the call target for the zero samples.
  • This doesn't handle function name mismatch, we plan to solve it in future.

Diff Detail

Event Timeline

wlei created this revision.Apr 4 2023, 9:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 4 2023, 9:57 AM
wlei requested review of this revision.Apr 4 2023, 9:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 4 2023, 9:57 AM
wlei edited the summary of this revision. (Show Details)Apr 5 2023, 2:11 PM
wlei edited the summary of this revision. (Show Details)Apr 5 2023, 2:26 PM
wlei updated this revision to Diff 511242.Apr 5 2023, 5:26 PM
wlei edited the summary of this revision. (Show Details)

Updating D147545: [CSSPGO] Stale profile matching(part 2)

wlei updated this revision to Diff 511243.Apr 5 2023, 5:41 PM

Updating D147545: [CSSPGO] Stale profile matching(part 2)

wlei edited the summary of this revision. (Show Details)Apr 5 2023, 6:05 PM
wlei added reviewers: hoy, wenlei, xur, davidxl, mingmingl, mtrofin.
wenlei added inline comments.Apr 5 2023, 6:12 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
1841

Unrelated to this change, but it may be useful to have a way for users to print out functions with stale profile without needing debug compiler.

Often times users ask this: is this change going to invalidate profile for this function? having a switch may help them self-diagnose to get an answer.

1847

As mentioned below, I think we should still throw away profiles that can't be matched due to missing call site anchors.

2286–2288

Wondering if we can simplify by merging the two containers, i.e use std::map<LineLocation, StringRef>, and empty StringRef value for non-direct-callsite, then we avoid the 2nd map, and also avoid the map look up in runStaleProfileMatching. It may use slightly more memory, but feels a bit cleaner.

Or even something like std::map<LineLocation, std::pair<StringRef, bool>> to have MatchedCallsiteLocs merged in as well. These containers feel a bit duplicated.

2347–2353

This doesn't seem to do what you intended to achieve, or at least what the comment says

non-anchor location match is based on the offset to the last matched anchor.

If last matched anchor doesn't exist, should we still match anything?

Consider a case where a function is completely changed, and there's no matched callsite at call, in which case we should probably still throw away its profile instead of matching everything with LocationDelta == 0?

wlei added inline comments.Apr 5 2023, 11:01 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
2347–2353

Sorry if my description is unclear.

My implementation assumes a special non-callsite anchor: 0, i.e. the beginning of the function. Without the fuzzy matching, it's just this only one anchor matched, and all the others(non-anchor) use the LocationDelta = 0 to do the matching, which means the match source is equal to the target.

Consider a case where a function is completely changed, and there's no matched callsite at call, in which case we should probably still throw away its profile instead of matching everything with LocationDelta == 0?

Say:

IR locations:      [1...100 101(foo)]
Profile locations: [1...100 101(bar)]

So my intuition is even without any callsite anchor, matching the leading location would be better than dropping them. (remembering current pseudo-probe implementation, the bb ids all come first, it's not uncommon)

wenlei added inline comments.Apr 6 2023, 8:30 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
2347–2353

Hmm.. I thought if there's no matching call site, that would mean the function has completely changed, so we can't match probes/blocks in a reasonable way. I can see what you have can help matching leaf functions without call sites.

But there needs to be something to make sure when the change is too big, we bail out instead of doing low confidence matching. Maybe we should bail out on zero matched call site + different number of probes/blocks?

hoy added inline comments.Apr 6 2023, 10:46 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
2278

Please add a comment for this. Basically we are filtering out possible indirect calls.

2293

Should this be conditioned under FunctionSamples::ProfileIsProbeBased if the same is going to be used for non-CS profile?

Also if profile does match, there's no need to populate AllLocations?

2294

nit: use emplace

2303

May be worth mentioning the matching is based on sequential lexical order. "First come" means the anchors with lexically early locations.

llvm/test/Transforms/SampleProfile/Inputs/pseudo-probe-stale-profile-matching.prof
8

Add a case for multiple call targets at same location?

davidxl added inline comments.Apr 6 2023, 12:06 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
447

nit: location -> source location

1844–1845

nit: probably rename the option to be 'resync-stale-profile' to make the meaning more accurate.

2272

The mapping from callee to location is 1 to many. Should that be handled with sequential id (per callee)?

2278

indirect call sites are good anchors too. Why not use sequential id based anchoring?

wenlei added inline comments.Apr 6 2023, 2:57 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
2278

The call site matching is based on callee names, indirect call from IR doesn't have callee name, and we opt to be conservative here (using name, instead of sequential id).

wlei added inline comments.Apr 6 2023, 3:41 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
1844–1845

Thanks for the name suggestion. so "resync the profile according to current build source code", it sounds good to me, @wenlei see if you are good for this name since you suggested the salvage-stale-profile

2272

The mapping from callee to location is 1 to many. Should that be handled with sequential id (per callee)?

Here is also 1 to many map, see StringMap<std::set<LineLocation>> &CalleeToCallsitesMap which is a callee name string to a set of location.
So with no callsite changes, this works the same as using sequential id, but with callsite change, this should be more resilient. Say if we add one or delete the first callsite, all following callsite would be mismatched if using sequential id match, but for using this callsite name match, all other callsite with different name won't be affected.

2347–2353

Good point, I will add some check(like a threshold) to make sure the very bad match to be bailed out and drop the profiles.

consider this scenario:

{

S1
foo();
S2

}

{

S3
bar();
S2

}

The first block including foo() call is shifted in source code, but the bar() block does not.

With this remapping algorithm, A won't find sample, bar() and S2 is fine; but C's location will be wrongly remapped.

I wonder if there is a need to do two way anchoring to handle this.

llvm/lib/Transforms/IPO/SampleProfile.cpp
2331

What does RA stand for?

hoy added a comment.Apr 10 2023, 10:16 AM

consider this scenario:

{

S1
foo();
S2

}

{

S3
bar();
S2

}

The first block including foo() call is shifted in source code, but the bar() block does not.

With this remapping algorithm, A won't find sample, bar() and S2 is fine; but C's location will be wrongly remapped.

I wonder if there is a need to do two way anchoring to handle this.

Can you please point out what A and C are? Do they refer to S1 and S3, respectively?

I think the intuition of the current matching algorithm focuses on getting the callsites right as the first step, as we found out that brining back missing inlining (due to profile mismatch) is quite helpful. Getting the block weights right is also important, and there will be upcoming improvements on a better numbering of callsite ids and probe ids (mostly for CSSPGO). That said, the current heuristics is never perfect as it does not try to reason about user semantics changes. Our observation is that for many cases users don't change hot function calls often, so we currently rely on the calls as anchors.

consider this scenario:

{

S1
foo();
S2

}

{

S3
bar();
S2

}

The first block including foo() call is shifted in source code, but the bar() block does not.

With this remapping algorithm, A won't find sample, bar() and S2 is fine; but C's location will be wrongly remapped.

I wonder if there is a need to do two way anchoring to handle this.

Can you please point out what A and C are? Do they refer to S1 and S3, respectively?

Right -- sorry about the confusion.

I think the intuition of the current matching algorithm focuses on getting the callsites right as the first step, as we found out that brining back missing inlining (due to profile mismatch) is quite helpful. Getting the block weights right is also important, and there will be upcoming improvements on a better numbering of callsite ids and probe ids (mostly for CSSPGO). That said, the current heuristics is never perfect as it does not try to reason about user semantics changes. Our observation is that for many cases users don't change hot function calls often, so we currently rely on the calls as anchors.

Understand. I also believe this patch works for most of the cases. The thing that remains concerning is if there are cases that applying the source remapping lead to worse performance. For instances, statements in S3 may get wrong profile data after remapping.

hoy added a comment.Apr 10 2023, 10:48 AM

consider this scenario:

{

S1
foo();
S2

}

{

S3
bar();
S2

}

The first block including foo() call is shifted in source code, but the bar() block does not.

With this remapping algorithm, A won't find sample, bar() and S2 is fine; but C's location will be wrongly remapped.

I wonder if there is a need to do two way anchoring to handle this.

Can you please point out what A and C are? Do they refer to S1 and S3, respectively?

Right -- sorry about the confusion.

I think the intuition of the current matching algorithm focuses on getting the callsites right as the first step, as we found out that brining back missing inlining (due to profile mismatch) is quite helpful. Getting the block weights right is also important, and there will be upcoming improvements on a better numbering of callsite ids and probe ids (mostly for CSSPGO). That said, the current heuristics is never perfect as it does not try to reason about user semantics changes. Our observation is that for many cases users don't change hot function calls often, so we currently rely on the calls as anchors.

Understand. I also believe this patch works for most of the cases. The thing that remains concerning is if there are cases that applying the source remapping lead to worse performance. For instances, statements in S3 may get wrong profile data after remapping.

That's a valid concern. I think yes, there will be blocks getting wrong profile data if no call anchors are matched or the user changes the semantics that makes our assumption completed invalid. For non-CS AutoFDO, I think something similar is already happening as the current line-offset based profile loading has no knowledge about how source semantics is changed. For CSSPGO, the wrong profile data may not be completely bad since at least the function would not be considered as cold as it was previously, because of the mismatch that caused the profile completely dropped.

I wonder if there is a need to do two way anchoring to handle this.

@davidxl can you elaborate on two way anchoring? is that also consider location of non-call site as anchor? It's trickier to use non-callsite as anchor since we would need a "signature" for block that is unique enough.

Also since this is all in the territory of heuristic and tuning, maybe we can get the initial version in while keep improving it. We got to this version after fair amount of internal evaluation, and it showed good results already. We still plan to keep tuning it though it will take some time to experiment.

wlei added a comment.Apr 10 2023, 11:13 AM

I wonder if there is a need to do two way anchoring to handle this.

@davidxl can you elaborate on two way anchoring? is that also consider location of non-call site as anchor? It's trickier to use non-callsite as anchor since we would need a "signature" for block that is unique enough.

Also since this is all in the territory of heuristic and tuning, maybe we can get the initial version in while keep improving it. We got to this version after fair amount of internal evaluation, and it showed good results already. We still plan to keep tuning it though it will take some time to experiment.

I guess what @davidxl meant for two way matching is to spilt the non-anchor range. some matches are based on the start anchor, some are based on the end anchor.

foo()
inst1
inst2
inst3
inst4
bar()

if foo is mismatched and bar is matched, the current version, all inst1-4 will be mismatched.
and if matching in two way, say split evenly, inst3, inst4 will be matched based on bar().

I wonder if there is a need to do two way anchoring to handle this.

@davidxl can you elaborate on two way anchoring? is that also consider location of non-call site as anchor? It's trickier to use non-callsite as anchor since we would need a "signature" for block that is unique enough.

Also since this is all in the territory of heuristic and tuning, maybe we can get the initial version in while keep improving it. We got to this version after fair amount of internal evaluation, and it showed good results already. We still plan to keep tuning it though it will take some time to experiment.

I guess what @davidxl meant for two way matching is to spilt the non-anchor range. some matches are based on the start anchor, some are based on the end anchor.

foo()
inst1
inst2
inst3
inst4
bar()

if foo is mismatched and bar is matched, the current version, all inst1-4 will be mismatched.
and if matching in two way, say split evenly, inst3, inst4 will be matched based on bar().

This one way. Another way is to do backward scanning of the anchors and create another loc mapping. The heuristic can be used to merge two location mappings (i.e., prefer the one that matches the current IR loc. If both differs, perhaps keep both - which may complicate the profile annotation).

I wonder if there is a need to do two way anchoring to handle this.

@davidxl can you elaborate on two way anchoring? is that also consider location of non-call site as anchor? It's trickier to use non-callsite as anchor since we would need a "signature" for block that is unique enough.

Also since this is all in the territory of heuristic and tuning, maybe we can get the initial version in while keep improving it. We got to this version after fair amount of internal evaluation, and it showed good results already. We still plan to keep tuning it though it will take some time to experiment.

I guess what @davidxl meant for two way matching is to spilt the non-anchor range. some matches are based on the start anchor, some are based on the end anchor.

foo()
inst1
inst2
inst3
inst4
bar()

if foo is mismatched and bar is matched, the current version, all inst1-4 will be mismatched.
and if matching in two way, say split evenly, inst3, inst4 will be matched based on bar().

Ok, then for both heuristics, we can always have cases that yield correct matching with one heuristic, but wrong result with the other.

I wonder if there is a need to do two way anchoring to handle this.

@davidxl can you elaborate on two way anchoring? is that also consider location of non-call site as anchor? It's trickier to use non-callsite as anchor since we would need a "signature" for block that is unique enough.

Also since this is all in the territory of heuristic and tuning, maybe we can get the initial version in while keep improving it. We got to this version after fair amount of internal evaluation, and it showed good results already. We still plan to keep tuning it though it will take some time to experiment.

I guess what @davidxl meant for two way matching is to spilt the non-anchor range. some matches are based on the start anchor, some are based on the end anchor.

foo()
inst1
inst2
inst3
inst4
bar()

if foo is mismatched and bar is matched, the current version, all inst1-4 will be mismatched.
and if matching in two way, say split evenly, inst3, inst4 will be matched based on bar().

This one way. Another way is to do backward scanning of the anchors and create another loc mapping.

Ok, that sounds reasonable. We also discussed backward matching in the past. I'm fine if we want to quickly test backward matching (or a mix of forward and backward matching and compare result), or just get the current version in and iterate on that later.

The heuristic can be used to merge two location mappings (i.e., prefer the one that matches the current IR loc. If both differs, perhaps keep both - which may complicate the profile annotation).

Keep both will add complexity -- we will need to see meaningfully better result to justify it. Inclined to keep it simple.

wlei updated this revision to Diff 514315.Apr 17 2023, 10:53 AM

addressing reviewers' feedback
added to match backwards(split evenly) for non-anchor locations.

wlei added inline comments.Apr 17 2023, 10:54 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
447

Done.

1841

Sounds good, I will add a switch in a separate diff.

2278

Commnets added

2286–2288

Good point! going to remove the MatchedCallsiteLocs , so changed to use std::map<LineLocation, StringRef>

2293

Should this be conditioned under FunctionSamples::ProfileIsProbeBased if the same is going to be used for non-CS profile?

Yes, added the FunctionSamples::ProfileIsProbeBased check.

Also if profile does match, there's no need to populate AllLocations?

Yes, going the remove the MatchedCallsiteLocs , so here only leave one structure IRLocations.

2294

Done.

2303

Thanks for the suggestion, refined the comments

2331

I wanted to say "Result of the Anchor", changed to a more meaningful name ProfileAnchors

2347–2353

I found to do this we need to get some metrics to define how big the change is, also need to test the perf number and tune the threshold, I will handle this in a separate diff.

llvm/test/Transforms/SampleProfile/Inputs/pseudo-probe-stale-profile-matching.prof
8

Sounds good, added a line including multiple call targets

wlei added a comment.Apr 17 2023, 11:18 AM

I wonder if there is a need to do two way anchoring to handle this.

@davidxl can you elaborate on two way anchoring? is that also consider location of non-call site as anchor? It's trickier to use non-callsite as anchor since we would need a "signature" for block that is unique enough.

Also since this is all in the territory of heuristic and tuning, maybe we can get the initial version in while keep improving it. We got to this version after fair amount of internal evaluation, and it showed good results already. We still plan to keep tuning it though it will take some time to experiment.

I guess what @davidxl meant for two way matching is to spilt the non-anchor range. some matches are based on the start anchor, some are based on the end anchor.

foo()
inst1
inst2
inst3
inst4
bar()

if foo is mismatched and bar is matched, the current version, all inst1-4 will be mismatched.
and if matching in two way, say split evenly, inst3, inst4 will be matched based on bar().

This one way. Another way is to do backward scanning of the anchors and create another loc mapping. The heuristic can be used to merge two location mappings (i.e., prefer the one that matches the current IR loc. If both differs, perhaps keep both - which may complicate the profile annotation).

@davidxl For the non-anchor locations, I added the support to split the range evenly and match backwards for half of it. I tested, it showed a sight win like from ~1.1 regression to ~1.0 regression, maybe also noise, but I checked several instances, some backwards matchings can help.

For the anchor locations, yes, keeping both candidates will complicate the profile annotation. "prefer the one that matches the current IR loc." or search the closest loc are also good heuristic, we plan to explore more heuristic later.

Moveover, I did a statistical analysis, it appears that 86% of the anchors in the profile are unique anchor for the clang-self build, that might explain the current way can cover most of the cases. So I'm also leaning to make it simple at this moment.

I wonder if there is a need to do two way anchoring to handle this.

@davidxl can you elaborate on two way anchoring? is that also consider location of non-call site as anchor? It's trickier to use non-callsite as anchor since we would need a "signature" for block that is unique enough.

Also since this is all in the territory of heuristic and tuning, maybe we can get the initial version in while keep improving it. We got to this version after fair amount of internal evaluation, and it showed good results already. We still plan to keep tuning it though it will take some time to experiment.

I guess what @davidxl meant for two way matching is to spilt the non-anchor range. some matches are based on the start anchor, some are based on the end anchor.

foo()
inst1
inst2
inst3
inst4
bar()

if foo is mismatched and bar is matched, the current version, all inst1-4 will be mismatched.
and if matching in two way, say split evenly, inst3, inst4 will be matched based on bar().

This one way. Another way is to do backward scanning of the anchors and create another loc mapping. The heuristic can be used to merge two location mappings (i.e., prefer the one that matches the current IR loc. If both differs, perhaps keep both - which may complicate the profile annotation).

@davidxl For the non-anchor locations, I added the support to split the range evenly and match backwards for half of it. I tested, it showed a sight win like from ~1.1 regression to ~1.0 regression, maybe also noise, but I checked several instances, some backwards matchings can help.

For the anchor locations, yes, keeping both candidates will complicate the profile annotation. "prefer the one that matches the current IR loc." or search the closest loc are also good heuristic, we plan to explore more heuristic later.

Moveover, I did a statistical analysis, it appears that 86% of the anchors in the profile are unique anchor for the clang-self build, that might explain the current way can cover most of the cases. So I'm also leaning to make it simple at this moment.

I am ok with keeping it simple for now -- assuming this feature is off by default.

wenlei added inline comments.Apr 17 2023, 8:09 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
1844–1845

not a big deal but maybe recover-stale-profile is a middle ground? the intention was to use a name that tells users what the switch does on a high level (salvage/recover), rather than how it's implemented (match/sync).

2286–2288

Looks like MatchedCallsiteLocs is still there?

2347–2353

sounds good.

wlei added inline comments.Apr 18 2023, 10:13 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
2286–2288

Sorry for confusion, it will also be in a separate diff, same diff with checking big profile change. I plan to use the IRLocations to directly match profile locations, there will be no this MatchedCallsiteLocs and use it to guid if the mismatch is big to drop or not.

hoy added inline comments.Apr 18 2023, 10:27 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
2295

I guess IRLocations should also be populated for non-CS. Maybe add a TODO?

2311

typo: wrote -> written

BTW, have you seen such collision? With pseudo probes this shouldn't happen. It should not happen with dwarf discriminators either.

2345

BTW, wondering if you've ever seen mismatched callsites when function hash matches. The hash counts number of callsites but not their orders.

wlei updated this revision to Diff 514734.Apr 18 2023, 1:57 PM

addressing Hongtao's feedback.

llvm/lib/Transforms/IPO/SampleProfile.cpp
2295

TODO added, Yes, currently only support pseudo probe.

2311

Before I didn't add isa<IntrinsicInst>(&I) condition to (std::optional<PseudoProbe> Probe = extractProbe(I))
I saw some callee name become empty due to the overwriting by the above emplace.

So I was thinking this is to record all the anchors, the anchor should always be higher priority than non-anchor, so I changed like here to force the writing. Also in case any changes when we support AutoFDO or post-link time matching. Or we can use the assertion here.

2345

Good question. Yes, there are many mismatched callsites even hash is matched, current work only support when a checksum mismatch is detected.

There is a general issue whether we can turn it on for all the functions. That is whether the matching algorithm can handle perfectly with the non-stale profile. the current heuristic is "first come first match", but not all the functions are in the profile(supposing there are functions doesn't hit any samples), it could give inconsistent anchors for the non-stale profile then cause a mismatch.

In order to solve it, I think we can try:

  • Use a more strict checksum, like also count the orders.
  • find a threshold from the mismatch metrics to control it.
  • Use a different heuristic, like search the closest location which can handle well with non-stale profile, but need more measuring for the mismatched function.

This is also an issue blocking AutoFDO, since AutoFDO doesn't have the checksum.

wenlei added inline comments.Apr 18 2023, 5:07 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
2345

it might be useful to have a debug print, or STATISTIC so we know when and how often this happens. It's essentially hash collision.

hoy added inline comments.Apr 19 2023, 11:06 AM
llvm/lib/Transforms/IPO/SampleProfile.cpp
2311

I see. Yeah, extractProbe on call instruction gets the call probe. If you want block probes only, you can change the check isa<IntrinsicInst> to isa<PseudoProbeInst>.

2345

Good point. Counting how many direct callsites having mismatched targets in the profile and on the IR would be helpful.

wenlei added inline comments.Apr 19 2023, 2:42 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
2286–2288

Sounds good to deal with it in a separate patch. For IRLocations, would be good to have a comment mentioning the use of empty StringRef non-direct-call site.

2295

Looks like we're not populating non-call locations for AutoFDO? In that case, should we assert on CSSPGO to make sure we don't accidentally run this for AutoFDO before its support is complete?

2311

For non-call location, the name should be empty. Maybe assert that we never overwrite non-empty name?

hoy added inline comments.Apr 19 2023, 3:55 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
2311

An assert sounds good for pseudo probes. The overwrite is possible for autofdo without enabling dwarf discriminator.

2345

Also as discussed offline, since IR callsites are matched with profile callsites sequentially, it may be interesting to see how many same-named callsites are not able to get a match. If it's non-trivial, adding zero-count callsites in the profile may allow for a more precise match. Of course that's going to increase profile size, but just curious if it is worth.

wlei updated this revision to Diff 515420.Apr 20 2023, 12:02 PM

addressing feedback.

llvm/lib/Transforms/IPO/SampleProfile.cpp
2286–2288

comment added

2295

Yes, now it's not supported for AutoFDO. It's already under FunctionSamples::ProfileIsProbeBased, so this code won't be run in AutoFDO.

2311

Sounds good, changed to use `isa<PseudoProbeInst>. and added the assert for pseudo probe.

2345

it might be useful to have a debug print, or STATISTIC so we know when and how often this happens. It's essentially hash collision.

We already have the global statistic report for the mismatched callsite(the TotalProfiledCallsites under ReportProfileStaleness), added a debug print for the function level debugging.

Also as discussed offline, since IR callsites are matched with profile callsites sequentially, it may be interesting to see how many same-named callsites are not able to get a match. If it's non-trivial, adding zero-count callsites in the profile may allow for a more precise match. Of course that's going to increase profile size, but just curious if it is worth.

Yeah, we could do a offline analysis for how many callsites whose name can be found in the profile but still remain mismatched after matching. We could do add the zero-count calliste in llvm-profgen, only for top-level is enough, I guess that size should be ok with extbinary.

Alternatively, we could use "search closest location" heuristic so that for a fresh profile where the profile and the IR should be the same location, the IR location can always be matched to the same location in the Profile.

wlei updated this revision to Diff 515422.Apr 20 2023, 12:05 PM

Updating D147545: [CSSPGO] Stale profile matching(part 2)

wenlei added inline comments.Apr 20 2023, 12:52 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
2343

A bit confused by the use of _TotalProfiledCallsites and _NumMismatchedCallsites, what are you trying to do with the two additional variables?

wlei added inline comments.Apr 20 2023, 12:55 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
2343

Because the TotalProfiledCallsites and NumMismatchedCallsites is the sum for the whole module functions not one function, so here for function level debug print, use a location variable to do it.

hoy added a comment.Apr 21 2023, 9:48 AM

How about retitle it as "[AutoFDO] Stale profile matching (part 2)" to align with the first patch? It's mentioned in the summary that only supports CSSPGO for now, so it should be good.

llvm/lib/Transforms/IPO/SampleProfile.cpp
2311

still seeing if (FunctionSamples::ProfileIsProbeBased && isa<IntrinsicInst>(&I)) at line 2290.

LGTM otherwise .

wlei retitled this revision from [CSSPGO] Stale profile matching(part 2) to [AutoFDO] Stale profile matching(part 2).Apr 21 2023, 1:18 PM
wlei edited the summary of this revision. (Show Details)
wlei added a comment.Apr 21 2023, 1:21 PM

How about retitle it as "[AutoFDO] Stale profile matching (part 2)" to align with the first patch? It's mentioned in the summary that only supports CSSPGO for now, so it should be good.

Sounds good.

llvm/lib/Transforms/IPO/SampleProfile.cpp
2311

Oops, I missed that.

wlei updated this revision to Diff 515886.Apr 21 2023, 1:21 PM

Updating D147545: [AutoFDO] Stale profile matching(part 2)

hoy accepted this revision.Apr 21 2023, 1:39 PM
This revision is now accepted and ready to land.Apr 21 2023, 1:39 PM
wenlei added inline comments.Apr 25 2023, 4:09 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
449

nit: FuncToMatchingsMap->FuncMappings?

2320

Can/should we assert IRToProfileLocationMap is empty to begin with? i.e. we shouldn't be populating the same map multiple times?

2341

If you provide erase with an iterator instead of an element, it would save a find..

2343

Ok, now I see what's going on. The code as is can be confusing (both the naming of variables, and the subtraction involved to get per-function stats).

I suggest we make countProfileMismatches take two int& input through parameters, so it doesn't change global state as a way to output, which is inconsistent with the way it takes func as input through parameters.

Then use FuncMismatchedCallsites, FuncProfiledCallsites to pass into countProfileMismatches to get its output. At caller side, we accumulate FuncMismatchedCallsites, FuncProfiledCallsites on to TotalProfiledCallsites, TotalProfiledCallsites.

2364

I'm wondering if we should just create a map here, and change the getOrCreateIRToProfileLocationMap API into a simple getter?

There seems to be very clear separation as to when we need to create and when we need to get, since there's no case for on-demand creation. Hence, simple/separate API might be cleaner, and less error-prone.

2368

It looks to me that IsAnchorMatched can be replaced with !IRToProfileLocationMap.empty()?

wlei added inline comments.Apr 25 2023, 8:20 PM
llvm/lib/Transforms/IPO/SampleProfile.cpp
449

done.

2320

Good point, done.

2341

done.

2343

yeah, it's clearer this way, done.

2364

Yes, here the matching should run only once per function, makes sense to just create a map.

2368

This is inside the for loop :

for (auto &IR : IRLocations) { 
   bool IsAnchorMatched = false;
   ....
 }

For each location, it needs to check if it's an anchor then if it's matched, maybe the name is confusing, changed to IsMatchedAnchor.

wlei updated this revision to Diff 517034.Apr 25 2023, 8:20 PM

addressing Wenlei's feedback

wenlei accepted this revision.Apr 25 2023, 8:58 PM

lgtm, thanks!

As we evolve SampleProfileMatcher, it might be worth considering moving the matcher into its own file similar to SampleProfileProbe.cpp, to avoid bloating SampleProfile.cpp indefinitely. But this can be handled later in separate patch.

How about retitle it as "[AutoFDO] Stale profile matching (part 2)" to align with the first patch? It's mentioned in the summary that only supports CSSPGO for now, so it should be good.

I found this confusing, and the patch really isn't about AutoFDO, but CSSPGO. Maybe rename the first patch as [CSSPGO] instead for consistency. Or just say SamplePGO to avoid differentiating the two.

wlei added a comment.Apr 26 2023, 10:04 AM

lgtm, thanks!

As we evolve SampleProfileMatcher, it might be worth considering moving the matcher into its own file similar to SampleProfileProbe.cpp, to avoid bloating SampleProfile.cpp indefinitely. But this can be handled later in separate patch.

Good point, will do in a separate patch.

How about retitle it as "[AutoFDO] Stale profile matching (part 2)" to align with the first patch? It's mentioned in the summary that only supports CSSPGO for now, so it should be good.

I found this confusing, and the patch really isn't about AutoFDO, but CSSPGO. Maybe rename the first patch as [CSSPGO] instead for consistency. Or just say SamplePGO to avoid differentiating the two.

The infra might be useful for AutoFDO, changed to SamplePGO.

wlei retitled this revision from [AutoFDO] Stale profile matching(part 2) to [SamplePGO] Stale profile matching(part 2).Apr 26 2023, 10:05 AM
wlei edited the summary of this revision. (Show Details)Apr 28 2023, 12:37 PM
This revision was landed with ongoing or failed builds.Apr 28 2023, 1:13 PM
This revision was automatically updated to reflect the committed changes.