This is an archive of the discontinued LLVM Phabricator instance.

[BOLT] stale profile matching [part 1 out of 2]
ClosedPublic

Authored by spupyrev on Feb 21 2023, 8:38 AM.

Details

Summary

BOLT often has to deal with profiles collected on binaries built from several
revisions behind release. As a result, a certain percentage of functions is
considered stale and not optimized. This diff adds an ability to match profile
to functions that are not 100% binary identical, which increases the
optimization coverage and boosts the performance of applications.

The algorithm consists of two phases: matching and inference:

  • At the matching phase, we try to "guess" as many block and jump counts from the stale profile as possible. To this end, the content of each basic block is hashed and stored in the (yaml) profile. When BOLT optimizes a binary, it computes block hashes and identifies the corresponding entries in the stale profile. It yields a partial profile for every CFG in the binary.
  • At the inference phase, we employ a network flow-based algorithm (profi) to reconstruct "realistic" block and jump counts from the partial profile generated at the first stage. In practice, we don't always produce proper profile data but the majority (e.g., >90%) of CFGs get the correct counts.

This is a first part of the change; the next stacked diff extends the block hashing
and provides perf evaluation numbers.

Diff Detail

Event Timeline

spupyrev created this revision.Feb 21 2023, 8:38 AM
Herald added a reviewer: Amir. · View Herald Transcript
Herald added a reviewer: maksfb. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
spupyrev requested review of this revision.Feb 21 2023, 8:38 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 21 2023, 8:38 AM
yota9 added a comment.Feb 21 2023, 8:48 AM

Wow, would definitely take a look to this patch, thank you!

spupyrev edited the summary of this revision. (Show Details)Feb 21 2023, 11:22 AM
spupyrev updated this revision to Diff 499320.Feb 21 2023, 4:10 PM

extended comments

spupyrev edited the summary of this revision. (Show Details)Feb 21 2023, 4:10 PM
spupyrev updated this revision to Diff 499621.Feb 22 2023, 12:24 PM

Added a test and adjusted logging

spupyrev edited the summary of this revision. (Show Details)Feb 22 2023, 12:27 PM
spupyrev updated this revision to Diff 499647.Feb 22 2023, 2:02 PM

fixing a typo

wlei added a subscriber: wlei.Feb 27 2023, 3:38 PM
spupyrev updated this revision to Diff 519885.May 5 2023, 8:53 AM

Adjusting tests. Should be ready for review now

@maksfb this is ready for review

Amir added inline comments.May 10 2023, 1:17 PM
llvm/lib/Transforms/Utils/SampleProfileInference.cpp
1325 ↗(On Diff #519886)

There's an issue in shared build mode:

ld.lld: error: undefined symbol: llvm::applyFlowInference(llvm::ProfiParams const&, llvm::FlowFunction&)
>>> referenced by StaleProfileMatching.cpp
>>>               tools/bolt/lib/Profile/CMakeFiles/LLVMBOLTProfile.dir/StaleProfileMatching.cpp.o:(llvm::bolt::applyInference(llvm::FlowFunction&))
spupyrev updated this revision to Diff 521394.May 11 2023, 11:45 AM

fixed the build and rebased

Amir added a comment.May 11 2023, 6:01 PM

I'd rather move changes to llvm/lib/Transforms/Utils/SampleProfileInference.cpp to a separate diff to test with existing users and land separately as the changes don't seem to be specific to BOLT.

bolt/lib/Profile/StaleProfileMatching.cpp
139

Can we return FlowFunction object from this function?

147

Do we really need this? since we're calling updateLayoutIndices, we have BB->getLayoutIndex().

286–288
290–292
313

Can we use llvm/ADT/GraphTraits + BreadthFirstIterator to avoid reimplementing this?

384–387

Please use llvm::any_of

506

It's just removing CTCMispredCount annotation if it exists.

spupyrev marked 6 inline comments as done.May 12 2023, 4:16 PM
spupyrev added inline comments.
bolt/lib/Profile/StaleProfileMatching.cpp
147

great, thanks

313

If I understand correctly, that would require adding some boilerplate code for iterators over the nodes and their neighbors? In addition, I need to run an "inverse" BFS starting from all reachable sinks... It certainly can be done, but I'm not sure if the implementation would be any simpler. Is there a good example of how to implement something similar with built-in types?

506

which is exactly what we're trying to accomplish, no?

spupyrev updated this revision to Diff 521839.May 12 2023, 4:33 PM
spupyrev marked an inline comment as done.

comments

Amir added inline comments.May 15 2023, 1:29 PM
bolt/lib/Profile/StaleProfileMatching.cpp
313

If I understand correctly, that would require adding some boilerplate code for iterators over the nodes and their neighbors? In addition, I need to run an "inverse" BFS starting from all reachable sinks... It certainly can be done, but I'm not sure if the implementation would be any simpler. Is there a good example of how to implement something similar with built-in types?

No, I don't have any good examples of using ADT algorithms with built-in types. LLVM classes use them throughout, with examples in llvm/unittests/ADT/. I tried using them with our BinaryBasicBlock but we implement GraphTraits interfaces. I guess it's fine to leave the BFS implementation as-is if FlowFunction doesn't already use any of ADT stuff, but in general it's a good idea to reuse LLVM's algorithms if you end up reimplementing them in several places.

506

I think it's better to be explicit and use removeAnnotation directly.

spupyrev updated this revision to Diff 522635.May 16 2023, 8:15 AM

removeAnnotation

Thanks for this patch, Sergey! Now that it's no longer a draft, can you update the title and summary? Do you plan to turn the option on by default in a separate patch?

bolt/lib/Profile/StaleProfileMatching.cpp
198

Some multi-entry functions will have secondary entry points reachable from the "main" entry point. In that case, their InDegree will be non-zero. Do you want to capture those here as well? The proper way to check if the block is reachable from outside the function is to check isEntryPoint().

349

nit:

352

nit:

422

Would assignProfile() describe the function more accurately?

520

Does the algorithm depend on the layout? I.e. will inferred profile be different if we supply a different order of basic blocks?

spupyrev marked 4 inline comments as done.May 24 2023, 2:22 PM

Now that it's no longer a draft, can you update the title and summary? Do you plan to turn the option on by default in a separate patch?

For context, this diff alone isn't enough for the feature; we still have https://reviews.llvm.org/D146661, which provides an actual hash computation. I'll adjust the summary to highlight that.
Once both diffs are reviewed, we plan to turn this on for a couple of services and monitor the results. If there are no issues, we can make this on by default.

bolt/lib/Profile/StaleProfileMatching.cpp
198

These dummy entry blocks are designed exactly to capture extra entry points. From my tests on the clang binary, they do _not_ have jumps from the main entry point, and thus, their InDegree is zero. Perhaps, there are some other cases that I don't see on clang?

Anyhow, for blocks with InDegree>0, we don't need to make special adjustments; they will be handled naturally.

520

In theory, the order shouldn't matter much. Here we updateLayoutIndices only to be able to index the blocks.

In practice however, there are "almost identical" basic blocks, especially with aggressive inlining. Such blocks have identical bodies, neighbors with identical bodies etc. There is no way to locally determine which blocks belong to which profile item. In order to break ties in such cases, I decided to simply use block addresses (that is, offsets) as a tie breaking rule. It doesn't happen too often but without the tie breaking rule the inference quality is slightly worse. (Using block indices in the layout also works here but provides a tiny regression in comparison to addresses)

spupyrev updated this revision to Diff 525335.May 24 2023, 2:26 PM
spupyrev marked an inline comment as done.

review

spupyrev retitled this revision from [BOLT] initial version of stale profile matching to [BOLT] stale profile matching [part 1 out of 2].May 24 2023, 2:27 PM
spupyrev edited the summary of this revision. (Show Details)
maksfb added inline comments.May 26 2023, 4:13 PM
bolt/lib/Profile/StaleProfileMatching.cpp
198

Secondary entry points normally come from assembly code, perhaps from some Fortran programs as well. HHVM has memcpy written like that: https://github.com/facebook/hhvm/blob/master/hphp/util/memcpy-x64.S

263–264

nit:

520

I see. So then BlockOrder is used only for mapping blocks in BinaryFunction to FlowFunction and back?

spupyrev marked 2 inline comments as done.May 31 2023, 4:03 PM
spupyrev added inline comments.
bolt/lib/Profile/StaleProfileMatching.cpp
198

You're right, i checked the HHVM binary and modified the condition here. Surprisingly, there are also instances where !BB->isEntryPoint() && InDegree[I] == 0, so we need both conditions here

520

Correct

spupyrev updated this revision to Diff 527225.May 31 2023, 4:15 PM

making code compile

maksfb accepted this revision.Jun 2 2023, 1:58 PM

LGTM. See comments with mostly nits. We can probably be more graceful while mapping basic blocks, but this issue can be addressed separately.

bolt/lib/Profile/StaleProfileMatching.cpp
201–202

nit: remove those variables

214–215

likewise

333
466–468

nit:

489–490

nit:

525

Will this work?

This revision is now accepted and ready to land.Jun 2 2023, 1:58 PM
Amir added a comment.Jun 5 2023, 11:14 AM

I'm trying to see if we can reuse SampleProfileInference code as much as possible given it's already templated and used with LLVM IR/MIR (Flow-Sensitive Sample AutoFDO and Context-Sensitive Sample PGO respectively).

bolt/lib/Profile/StaleProfileMatching.cpp
140

What's the primary motivation behind adding a BinaryFunction-specific version instead of using SampleProfileInference<BT>::initFunction?
Is it the fact that BF can have multiple entry points hence we need a dummy source node, and/or custom handling of EH control flow?

162

What outdegree is used for?

313

Does it make a difference to use BFS instead of DFS if the goal is collecting (un)reachable blocks?

There's an overlap in functionality with SampleProfileInference<BT>::apply which finds nodes reachable from source and sink using depth_first_ext and inverse_depth_first_ext.

spupyrev updated this revision to Diff 528867.Jun 6 2023, 7:57 AM
spupyrev marked 9 inline comments as done.

suggested (minor) edits

This revision was automatically updated to reflect the committed changes.