Current PGO profile counts are not context sensitive. The branch probabilities for the inlined functions are kept the same for all call-sites, and they might be very different from the actual branch probabilities. These suboptimal profile can greatly affect some downstream optimizations, in particular for machine basic block placement optimization.
In this patch, we propose to have a post-inline PGO instrumentation/use pass, which we called Context Sensitive PGO (CSPGO). For the users who want the best possible performance, they can perform a second round of PGO instrument/use on the top of the regular PGO. They will have two sets of profile counts. The first pass profile will be manly for inline, indirect-call promotion, and CGSCC simplification pass optimizations. The second pass profile is for post-inline optimizations and code-gen optimizations.
The typical usage models are:
// Regular PGO instrumentation and generate pass1 profile.
> clang -O2 -fprofile-generate source.c -o gen > ./gen > llvm-profdata merge default.*profraw -o pass1.profdata
// CSPGO instrumentation.
> clang -O2 -fprofile-use=pass1.profdata -fcs-profile-generate -o gen2 > ./gen2
// Merge two sets of profiles
> llvm-profdata merge default.*profraw pass1.profdata -o profile.profdata
// Use the combined profile. Pass manager will invoke two PGO use passes.
> clang -O2 -fprofile-use=profile.profdata -o use
The major changes are the following:
- bump index profile version to indicate if there are context-sensitive counts.
- change the FuncHash and reserve bits for context-sensitive counts.
- change the ProfileSummary interface as we might have two sets of profile summaries.
- change pass-manager (legacy and new)
- change llvm-profdata to handle two kind of profiles.
- change instrprofiling lowing for more aggressive register promotion for CSPGO.
Note that other than the FuncHash change, there is no effect on existing PGO pass.
This patches is functional tested using SPEC2006, with the various combinations of the following:
- legacy-pass manager
- new pass manager
- thin-lto (through gold plugin)
- thin-lto (though driver command line)
- lto (through plugin)
- non-lto
We performance-tested this patch with a few large Google benchmarks and saw good performance improvements.
I did not include any tests in this patch. I will add tests once the interfaces are decided in the code review.
It doesn't seem necessary to add a flag to computeSummary or to add recomputeSummary to PSI's API.
Why not call ProfileSummaryInfo::invalidate when attaching CSPGO data to a Module? invalidate() could be modified to recompute the summary. That keeps the APIs a bit simpler.