Part 2 of https://reviews.llvm.org/D147456
Call target name anchor based profile fuzzy matching
Use callee name on IR as an anchor to match the call target/inlinee name in the profile. The advantages of this in particular:
- Different from the traditional way of encoding hash signatures to every block that would affect binary/profile size and build speed, it doesn't require any additional information for this, all the data is already in the IR and profiles.
- Effective for current nested profile layout in which once a callsite is mismatched all the inlinee's profiles are dropped.
The input of the algorithm:
- IR locations: the anchor is the callee name of direct callsite.
- Profile locations: the anchor is the call target name for BodySamples or inlinee's profile name for CallsiteSamples.
The two lists are populated by parsing the IR and profile and both can be generalized as a sequence of locations with an optional anchor.
For example: say location 1.2(foo) refers to a callsite at 1.2 with callee name foo and 1.3 refers to a non-directcall location 1.3.
// The current build source code: int main() { 1. ... 2. foo(); 3. ... 4 ... 5. ... 6. bar(); 7. ... }
IR locations are populated and simplified as: [1, 2(foo), 3, 5, 6(bar), 7].
; The "stale" profile: main:350:1 1: 1 2: 3 3: 100 foo:100 4: 2 7: 2 8: 200 bar:200 9: 30
Profile locations are populated and simplified as [1, 2, 3(foo), 4, 7, 8(bar), 9]
Matching heuristic:
- Match all the anchors in lexical order first.
- Match non-anchors evenly between two anchors: Split the non-anchor range, the first half is matched based on the start anchor, the second half is matched based on the end anchor.
So the example above is matched like:
[1, 2(foo), 3, 5, 6(bar), 7] | | | | | | [1, 2, 3(foo), 4, 7, 8(bar), 9]
3 -> 4 matching is based on anchor foo, 5 -> 7 matching is based on anchor bar.
The output mapping of matching is [2->3, 3->4, 5->7, 6->8, 7->9].
For the implementation, the anchors are saved in a map for fast look-up. The result mapping is saved into IRToProfileLocationMap(see https://reviews.llvm.org/D147456) and distributed to all FunctionSamples(distributeIRToProfileLocationMap)
Performance evaluation:
Clang-self build benchmark:
Current build version: clang-10
The profiled version: clang-9
Results compared to a refresh profile(collected profile on clang-10) and to be fair, we invalidated new functions' profiles(both refresh and stale profile use the same profile list).
- Regression to using refresh profile with this off : -3.93%
- Regression to using refresh profile with this on : -1.1%
So this algorithm can recover ~72% of the regression.
Internal(Meta) large-scale services.
we saw one real instance of a 3 week stale profile., it delivered a ~1.8% win.
Notes or future work:
- Classic AutoFDO support: the current version only supports pseudo-probe, but I believe it's not hard to extend to classic line-number based AutoFDO since pseudo-probe and line-number are shared the LineLocation structure.
- The fuzzy matching is an open-ended area and there could be more heuristics to try out, but since the current version already recovers a reasonable percentage of regression(with some pseudo probe order change, it can recover close to 90%), I'm submitting the patch for review and we will try more heuristics in future.
- Profile call target name are only available when the call is hit by samples, the missing anchor might mislead the matching, this can be mitigated in llvm-profgen to generate the call target for the zero samples.
- This doesn't handle function name mismatch, we plan to solve it in future.
nit: location -> source location