This is an archive of the discontinued LLVM Phabricator instance.

[PGO] Supplement PGO profile with Sample profile
ClosedPublic

Authored by wmi on Jun 16 2020, 4:42 PM.

Details

Summary

PGO profile is usually more precise than sample profile. However, PGO profile needs to be collected from loadtest and loadtest may not be representative enough to the production workload. Sample profile collected from production can be used as a supplement -- especially for functions cold in loadtest but warm/hot in production, we can use function entry count in sample profile to scale up the related function in PGO profile.

The implementation contains changes in compiler side and llvm-profdata side. In compiler side, during PGO instrumentation and profile-use phase, the patch will guarantee there is a counter in entry block for each function and the counter will be at the first entry in the counter vector. We will use llvm-profdata to merge PGO profile and sample profile, and the output will be a new PGO profile with some counters scaled up. If a function is never executed in PGO profile but hot in sample profile, llvm-profdata will reset the entry count using the related entry count in sample profile multiplied by a scalefactor, at the same time leaving the rest of the counters as zero. If a function has non-zero/cold entry count, but is hot in sample profile, all the counters inside of the function will be scaled up equally.

In the long run, it may be useful to let compiler support using PGO profile and sample profile at the same time, but that requires more careful design and more substantial changes to make two profiles work seamlessly. The patch here serves as a simple intermediate solution.

Diff Detail

Repository
rL LLVM

Event Timeline

wmi created this revision.Jun 16 2020, 4:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 16 2020, 4:42 PM

I think it is good to have an entry counter always, so that the profile dump is more readable. Do you have data showing the instrumentation overhead and profile size impact (clang and some large app)?

wmi added a comment.Jun 17 2020, 2:37 PM

I think it is good to have an entry counter always, so that the profile dump is more readable. Do you have data showing the instrumentation overhead and profile size impact (clang and some large app)?

Yes, I tried clang. The instrumentation runtime overhead increases by about 0.8%. The raw profile size increases by 1.8%. The zipped profile size increases by 0.15%.
Right now in the patch, inserting entry counter is guarded by a flag with default value being false.

Why is the profile size increase? I expect the number of instrumented blocks remain mostly unchanged.

The reason for the question is that if the overhead is low, I think we should make the default to be true.

wmi added a comment.Jun 17 2020, 4:21 PM

Why is the profile size increase? I expect the number of instrumented blocks remain mostly unchanged.

The reason for the question is that if the overhead is low, I think we should make the default to be true.

For function entry bb which have multiple successors, the existing algorithm in FuncPGOInstrumentation<Edge, BBInfo>::getInstrBB will insert the counter in all its successors. My current implementation simply adds a counter in entry block so in that case, it introduces redundent counter.

I can improve it by selecting a successor to not insert counter for it since it can be inferred from the counters surrounding it. With that implemented, I expect the profile size will be unchanged.

hoyFB added a comment.Jun 17 2020, 5:01 PM

It's an interesting idea to improve the PGO profile quality with sample profiles. Thanks for working on this!

In D81981#2099580, @wmi wrote:

Why is the profile size increase? I expect the number of instrumented blocks remain mostly unchanged.

The reason for the question is that if the overhead is low, I think we should make the default to be true.

For function entry bb which have multiple successors, the existing algorithm in FuncPGOInstrumentation<Edge, BBInfo>::getInstrBB will insert the counter in all its successors. My current implementation simply adds a counter in entry block so in that case, it introduces redundent counter.

I can improve it by selecting a successor to not insert counter for it since it can be inferred from the counters surrounding it. With that implemented, I expect the profile size will be unchanged.

The current PGO instrumentation is based on MST. Changing the instrumentation may require changes in how the counts of non-MST-edges are calculated (in PGOUseFunc::setInstrumentedCounts). So maybe adjust the MST to remove the sibling edge ?

llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
768

Remove the lambda and use the newly added static function instead?

yes -- see Hongtao's reply.

If how we select BB to instrument depends on a switch, we would need instrumentation build and optimizing build to have consistent switch, otherwise counters could mismatch even if CFG checksum matches? I guess that's one reason why it'd be good to avoid different ways of selecting BBs.

Would it be possible to tweak/cheat the edge weights just for MST so entry BB is pinned to be non-MST node hence guaranteed to be instrumented directly?

There should not be an option which makes things complicated as Wenlei described. Instead, once this change is done, there would be a version bump (both raw and index). The index format needs to be backward compatible, so there needs to be some version specific handling there (can be removed later).

wmi added a comment.Jun 17 2020, 7:05 PM

If how we select BB to instrument depends on a switch, we would need instrumentation build and optimizing build to have consistent switch, otherwise counters could mismatch even if CFG checksum matches? I guess that's one reason why it'd be good to avoid different ways of selecting BBs.

PGO has already had a common practice to ensure profile-gen and profile-use to have the same flags. So the flag to enable inserting counter in entry block won't cause too much trouble.

Would it be possible to tweak/cheat the edge weights just for MST so entry BB is pinned to be non-MST node hence guaranteed to be instrumented directly?

I consider that but I don't know how to do that. From what I currently understand, MST is to select some edges with highest frequencies and pruning those edges won't affect the inference of the profile of all the edges. We can prune edge when selecting MST but we cannot guarantee a node is selected as an instrumented BB during that phase. Deciding which BB to instrument is done in getInstrBB -- by choosing whether to instrument src node or dst node for each edge.

I am new to this part so if you know there is way to do that, please let me know. That is very appreciated.

wmi added a comment.Jun 17 2020, 7:31 PM

The index format needs to be backward compatible, so there needs to be some version specific handling there (can be removed later).

I don't understand this part. Could you elaborate it -- why index format is different from raw format in backward compatiblity, and what is the version specific handling?

For function entry bb which have multiple successors, the existing algorithm in FuncPGOInstrumentation<Edge, BBInfo>::getInstrBB will insert the counter in all its successors. My current implementation simply adds a counter in entry block so in that case, it introduces redundent counter.

I can improve it by selecting a successor to not insert counter for it since it can be inferred from the counters surrounding it. With that implemented, I expect the profile size will be unchanged.

The current PGO instrumentation is based on MST.

Yes, it is based on two parts, selecting MST is one part and selecting src/dst node of each non-MST edge to instrument is another part.

Changing the instrumentation may require changes in how the counts of non-MST-edges are calculated (in PGOUseFunc::setInstrumentedCounts). So maybe adjust the MST to remove the sibling edge ?

The change to add entry BB as an instrumented BB is in function getInstrumentBBs which is shared by profile-gen and profile-use, so it will be consistent between profile-gen and profile-use. About adjusting MST to remove sibling edge, I feel it is inconsistent with current goal of MST selection. The goal of selecting MST is to avoid instrumenting the most frequent edges so we can minimize the cost. Removing a successor edge of the entry block is a different goal. Mixing these two goals will make things complicated. I feel it is simpler to add the change in the second part -- selecting between src and dst which node to instrument.

There is a use case that user check in indexed format profile for sources that do not change much (e.g. library code), thus the indexed format profile needs to be backward compatible. Raw profile has not such requirement.

For IR PGO, the compatibility requirement is nice to have, but it is probably not a hard requirement as there are other ways to easily break it -- for instance any early inliner changes or CFG cleanup pass changes can make the old profile unusable.

Also is it suffice to just never select the fake edge to entry in MST?

In D81981#2099825, @wmi wrote:

The index format needs to be backward compatible, so there needs to be some version specific handling there (can be removed later).

I don't understand this part. Could you elaborate it -- why index format is different from raw format in backward compatiblity, and what is the version specific handling?

For function entry bb which have multiple successors, the existing algorithm in FuncPGOInstrumentation<Edge, BBInfo>::getInstrBB will insert the counter in all its successors. My current implementation simply adds a counter in entry block so in that case, it introduces redundent counter.

I can improve it by selecting a successor to not insert counter for it since it can be inferred from the counters surrounding it. With that implemented, I expect the profile size will be unchanged.

The current PGO instrumentation is based on MST.

Yes, it is based on two parts, selecting MST is one part and selecting src/dst node of each non-MST edge to instrument is another part.

Changing the instrumentation may require changes in how the counts of non-MST-edges are calculated (in PGOUseFunc::setInstrumentedCounts). So maybe adjust the MST to remove the sibling edge ?

The change to add entry BB as an instrumented BB is in function getInstrumentBBs which is shared by profile-gen and profile-use, so it will be consistent between profile-gen and profile-use. About adjusting MST to remove sibling edge, I feel it is inconsistent with current goal of MST selection. The goal of selecting MST is to avoid instrumenting the most frequent edges so we can minimize the cost. Removing a successor edge of the entry block is a different goal. Mixing these two goals will make things complicated. I feel it is simpler to add the change in the second part -- selecting between src and dst which node to instrument.

I see. Yes, it's reasonable to avoid changing MST edges, instead, to change which block to instrument for a given edge. I was thinking special logic may be needed for the edge count calculation as well, since it's related to where the instrumentation happens. This should work if we change both places.

Since only non-MST edges are instrumented, I was wondering alternatively we can remove an edge related to the entry block from MST to force the entry instrumented. I think removing the fake entry edge as David suggested is better than removing an outgoing sibling edge from the entry. Removing the fake entry edge from MST will result in one of the outgoing sibling edges added to MST, which in turn will cause the corresponding successor of the entry not instrumented.

wmi updated this revision to Diff 273545.Jun 25 2020, 4:29 PM

Remove the compiler part since that part will be done in https://reviews.llvm.org/D82123.

Add an option -base-scale-function so people doesn't have to always compute the scale factor for PGO/SampleFDO profiles themselves. If user knows for some function its counter value is proportional to the total count of the execution, by specifying the function through -base-scale-function, llvm-profdata will compute the scale factor based on the counter values of the function.

wmi added a comment.Jul 8 2020, 9:49 AM

https://reviews.llvm.org/D82123 to always instrument function entry BB has been committed guarded by a flag. https://reviews.llvm.org/D83024 to enable the flag by default is under review.

Can you take another look at the patch?

Can you first split the NFC part (refactoring part such as GetEntryForPercentile) out ?

llvm/include/llvm/ProfileData/InstrProf.h
681–682

document the parameters.

wmi marked an inline comment as done.Jul 8 2020, 3:34 PM

refactor GetEntryForPercentile out in https://reviews.llvm.org/D83439

wmi updated this revision to Diff 276586.Jul 8 2020, 3:46 PM

Address David's comment.
Adjust comments, function names and flag names.

wmi updated this revision to Diff 276588.Jul 8 2020, 3:54 PM

Fix a wrong flag name in test.

davidxl added inline comments.Jul 9 2020, 11:03 AM
llvm/tools/llvm-profdata/llvm-profdata.cpp
294

this refactoring can also be committed independently

514

make sample file path as the part of the option, so there is no need to handle the ordering.

529

Are these two weights comparable?

853

Is this flag tested?

wmi marked 5 inline comments as done.Jul 10 2020, 7:20 PM
wmi added inline comments.
llvm/tools/llvm-profdata/llvm-profdata.cpp
294
514

Indeed that will save the ordering handle logic, but I want to use weighted_input to scale the count in sample profile to be roughtly the same as the count in instr profile. To support -supplement-instr-with-sample=<weight>,<filename> will be a little weird and increase complexity.

529

Yes, given "-weighted-input=2, instr_profile -weighted-input=3, sample_profile", that means we want to scale the count in sample profile by 3/2 before update the entry in instr profile.

853

Good point, add tests for this flag and the flag early-inline-size-threshold

wmi updated this revision to Diff 277201.Jul 10 2020, 7:20 PM
wmi marked an inline comment as done.

Address David's comments.

wmi added a comment.Jul 16 2020, 2:46 PM

I also plan to dump the functions cold in instr profile and hot in sample profile, and sort the output according to hotness in sample profile. That can be used to guide PGO users if they want to improve the representativeness of their loadtest. I leave that part in a separate patch for easier review.

I think this feature should be decoupled from the version change -- since this is an approximate anyway.

One way to do this is to use max count or total count as a reference point and compute the scale factor.

llvm/tools/llvm-profdata/llvm-profdata.cpp
446

when there is no scaling, setting instr count with sample count does not make sense. Perhaps just set it to be above cold threshold.

wmi marked an inline comment as done.Jul 17 2020, 10:25 AM

I think this feature should be decoupled from the version change -- since this is an approximate anyway. One way to do this is to use max count or total count as a reference point and compute the scale factor.

If it is uncoupled from the version change, for function with counter values not being 0 in instr profile, it is ok to scale all the counter values by a scale factor based on max count or total count. For function with all counter values being 0, we cannot uniformly scale up all the counter values because that will mess up the branch probability inside of the function. We want to set the entry BB counter to a hot value only so compiler can use static heuristic to compute the branch probability inside of the function. That is why entry BB counter is needed in this feature.

llvm/tools/llvm-profdata/llvm-profdata.cpp
446

Maybe I can make scalefactor an option, and requires user to provide either -scalefactor or -base-scale-function, so that user won't accidentally leave scalefactor to be 1.

In this way, I can make sample profile to be the input of the option -supplement-instr-with-sample=, so I can remove the input profile ordering logic.

wmi updated this revision to Diff 279577.Jul 21 2020, 10:49 AM

Address David's comment

If we don't need to handle all zero cases using scaling, we can remove the dependency on the always entry patch.

llvm/tools/llvm-profdata/llvm-profdata.cpp
400

handling this case (all zero case) in this way won't help much -- The branch Probablity pass will set all branch weights to 1 making all branch unbiased -- it is worse than using static heuristic based.

The right way I think is to remove their profile entries completely. I assume the compiler pass later will treat such functions as unknown and not put into .text.unlikely.

wmi marked an inline comment as done.Jul 22 2020, 2:42 PM
wmi added inline comments.
llvm/tools/llvm-profdata/llvm-profdata.cpp
400

But I think for a large part of the functions missing in loadtest, they have all zero counter values, so it is better to have a way to handle them.

Is it possible for PGO to handle the functions with only entry counter not being zero and with other counters all being zero in a special way -- for those functions, just set the entry count and skip the metadata setting inside of the function? So that those functions can use static profiling inside.

The PGOUse pass can choose not to annotate any branches with total weights == 0. Now the question becomes how do we tell PGOUse pass whether the entry should be set to 0 or leave it not set. There are two ways to do it (to signal it is not really cold, but unknown):

  1. Remove the function from the indexed format profile;
  2. set all counts to some sentinel value such as -1.

Inlining won't be helped unless there is a hot callsite to the all-zero count function -- but this should not exist. I think the major performance hit comes from 1) text.unlikely which may not be mlocked; and 2) all unbiased branches due to zero weights. So doing this depending it on entry count existence is fine, but we still to teach PGOUse to drop the body. I think a simpler design would be

At llvm_profdata side:

  1. if the instrumentation cold function has enough internal counts, just scale up the max internal counts to be a multiple of hot threshold
  1. if the cold function has all zero counts or we believe all their internal counts are not trustworthy (basically ignore step 1) with an option), we can simply discard the function entry completely (to signal this function is actually hot, but we don't know internal counts)

At PGOUse side:

if we don't find counters for a function, set the function's entry value to be above hot threshold (a function statically linked in should always have counts. If there are not counts, it means it is corrected by llvm-profdata).

wmi updated this revision to Diff 280621.Jul 24 2020, 4:53 PM

Address David's comments.

The major change is to remove the dependence on always having entry counter in the profile. For function with all zero instr profile or most of zero instr profile, its counters will be set to all -1. All -1 counters indicates the internal profile for the function is unaccountable and also indicates the function is hot. PGO profile-use will drop all the internal counters while set the function entry count to be several times above hot threshold.

I choose to set all counters to all -1 instead of dropping the profile to express the indications above because I am afraid in rare case, PGO profile may be accidently used when building an unrelated target. If we set the functions to be hot when their profiles cannot be found, we may treat all the functions to be hot and that may bloat up the code and trigger compile-time issue.

davidxl added inline comments.Jul 27 2020, 9:22 AM
llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
1234

document the variable.

1239

Is it possible to have some blocks -1?

1647

oh, just move this comment to the variable decl.

llvm/test/tools/llvm-profdata/Inputs/mix_instr.proftext
11

Do we have a test case for all zero case and below the threshold case (considiered all zero)?

llvm/tools/llvm-profdata/llvm-profdata.cpp
424

Is it possible to delete the instprof record for the function from the profile?

wmi marked 4 inline comments as done.Jul 27 2020, 10:09 AM
wmi added inline comments.
llvm/lib/Transforms/Instrumentation/PGOInstrumentation.cpp
1239

I feel having all blocks -1 to indicate unpresentative profile for an actually hot function is simpler than having some blocks -1. That is because when we compute profile summary, we want to strip those unpresentative profile. If we change some blocks to -1 but keep the rest unchanged, those counters will still be used for computing profile summary.

1647

Ok, will do.

llvm/test/tools/llvm-profdata/Inputs/mix_instr.proftext
11

Yes, foo is intentionally created for that. The line 22 and line 42 in suppl-instr-with-sample.test test the cases where ratio of zero counter in foo is above or lower the threshold.

llvm/tools/llvm-profdata/llvm-profdata.cpp
424

One reason that I use all -1 as the indication that the function is hot and its profile is unpresentative is: user may build unrelated target together with the PGO optimized target in the same command. I know a lot of SampleFDO user does that to simplify their release. I imagine there could be PGO user doing that too. Another possiblity is to build test targets using the profile. If we delete the instprof record and treat all the functions without instprof to be hot during prof-use, we may accidently treat a lot of cold functions to be hot if the profile is applied on some unrelated targets or tests (tests may be partially related targets), and that may cause compile-time issue.

This revision is now accepted and ready to land.Jul 27 2020, 10:59 AM
This revision was automatically updated to reflect the committed changes.