This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
bolt/
-
include/bolt/
-
bolt/
-
Core/
-
BinaryFunction.h
-
Profile/
-
YAMLProfileReader.h
-
lib/
-
Passes/
-
BinaryPasses.cpp
-
Profile/
-
CMakeLists.txt
27/33
StaleProfileMatching.cpp
-
YAMLProfileReader.cpp
-
test/X86/
-
X86/
-
Inputs/
-
blarge_profile_stale.yaml
-
reader-stale-yaml.test

Differential D144500

[BOLT] stale profile matching [part 1 out of 2]
ClosedPublic

Authored by spupyrev on Feb 21 2023, 8:38 AM.

Download Raw Diff

Details

Reviewers

rafauler
Amir
maksfb

Commits

rG44268271f61e: [BOLT] stale profile matching [part 1 out of 2]

Summary

BOLT often has to deal with profiles collected on binaries built from several
revisions behind release. As a result, a certain percentage of functions is
considered stale and not optimized. This diff adds an ability to match profile
to functions that are not 100% binary identical, which increases the
optimization coverage and boosts the performance of applications.

The algorithm consists of two phases: matching and inference:

At the matching phase, we try to "guess" as many block and jump counts from the stale profile as possible. To this end, the content of each basic block is hashed and stored in the (yaml) profile. When BOLT optimizes a binary, it computes block hashes and identifies the corresponding entries in the stale profile. It yields a partial profile for every CFG in the binary.
At the inference phase, we employ a network flow-based algorithm (profi) to reconstruct "realistic" block and jump counts from the partial profile generated at the first stage. In practice, we don't always produce proper profile data but the majority (e.g., >90%) of CFGs get the correct counts.

This is a first part of the change; the next stacked diff extends the block hashing
and provides perf evaluation numbers.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spupyrev created this revision.Feb 21 2023, 8:38 AM

Herald added a reviewer: rafauler. · View Herald TranscriptFeb 21 2023, 8:38 AM

Herald added a reviewer: Amir. · View Herald Transcript

Herald added a reviewer: maksfb. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: treapster, ayermolo, wenlei, hiraditya. · View Herald Transcript

spupyrev requested review of this revision.Feb 21 2023, 8:38 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 21 2023, 8:38 AM

Herald added subscribers: llvm-commits, yota9. · View Herald Transcript

spupyrev added a parent revision: D144306: [BOLT][NFC] Add hash computation for basic blocks.Feb 21 2023, 8:39 AM

Wow, would definitely take a look to this patch, thank you!

Harbormaster completed remote builds in B215039: Diff 499194.Feb 21 2023, 10:25 AM

spupyrev edited the summary of this revision. (Show Details)Feb 21 2023, 11:22 AM

extended comments

spupyrev edited the summary of this revision. (Show Details)Feb 21 2023, 4:10 PM

Harbormaster completed remote builds in B215125: Diff 499320.Feb 21 2023, 5:36 PM

Added a test and adjusted logging

spupyrev edited the summary of this revision. (Show Details)Feb 22 2023, 12:27 PM

fixing a typo

Harbormaster completed remote builds in B215360: Diff 499647.Feb 22 2023, 4:45 PM

wlei added a subscriber: wlei.Feb 27 2023, 3:38 PM

spupyrev added a child revision: D146661: [BOLT] stale profile matching [part 2 out of 2].Mar 22 2023, 2:27 PM

spupyrev removed a parent revision: D144306: [BOLT][NFC] Add hash computation for basic blocks.May 5 2023, 7:12 AM

Herald added a subscriber: hoy. · View Herald TranscriptMay 5 2023, 7:12 AM

rebase

Harbormaster completed remote builds in B230238: Diff 519851.May 5 2023, 8:37 AM

Adjusting tests. Should be ready for review now

nit

Harbormaster completed remote builds in B230262: Diff 519886.May 5 2023, 9:40 AM

@maksfb this is ready for review

Amir added inline comments.May 10 2023, 1:17 PM

llvm/lib/Transforms/Utils/SampleProfileInference.cpp

1325 ↗

(On Diff #519886)

There's an issue in shared build mode:

ld.lld: error: undefined symbol: llvm::applyFlowInference(llvm::ProfiParams const&, llvm::FlowFunction&)
>>> referenced by StaleProfileMatching.cpp
>>>               tools/bolt/lib/Profile/CMakeFiles/LLVMBOLTProfile.dir/StaleProfileMatching.cpp.o:(llvm::bolt::applyInference(llvm::FlowFunction&))

fixed the build and rebased

Harbormaster completed remote builds in B231405: Diff 521394.May 11 2023, 1:04 PM

I'd rather move changes to llvm/lib/Transforms/Utils/SampleProfileInference.cpp to a separate diff to test with existing users and land separately as the changes don't seem to be specific to BOLT.

bolt/lib/Profile/StaleProfileMatching.cpp
139	Can we return FlowFunction object from this function?
147	Do we really need this? since we're calling updateLayoutIndices, we have BB->getLayoutIndex().
286–288
290–292
313	Can we use llvm/ADT/GraphTraits + BreadthFirstIterator to avoid reimplementing this?
384–387	Please use `llvm::any_of`
506	It's just removing CTCMispredCount annotation if it exists.

spupyrev mentioned this in D150466: profilie inference changes for stale profile matching.May 12 2023, 10:52 AM

spupyrev mentioned this in rG2ee4ddae0439: profilie inference changes for stale profile matching.May 12 2023, 11:58 AM

spupyrev marked 6 inline comments as done.May 12 2023, 4:16 PM

spupyrev added inline comments.

bolt/lib/Profile/StaleProfileMatching.cpp
147	great, thanks
313	If I understand correctly, that would require adding some boilerplate code for iterators over the nodes and their neighbors? In addition, I need to run an "inverse" BFS starting from all reachable sinks... It certainly can be done, but I'm not sure if the implementation would be any simpler. Is there a good example of how to implement something similar with built-in types?
506	which is exactly what we're trying to accomplish, no?

comments

Harbormaster completed remote builds in B231739: Diff 521839.May 12 2023, 4:40 PM

Amir added inline comments.May 15 2023, 1:29 PM

bolt/lib/Profile/StaleProfileMatching.cpp
313	If I understand correctly, that would require adding some boilerplate code for iterators over the nodes and their neighbors? In addition, I need to run an "inverse" BFS starting from all reachable sinks... It certainly can be done, but I'm not sure if the implementation would be any simpler. Is there a good example of how to implement something similar with built-in types? No, I don't have any good examples of using ADT algorithms with built-in types. LLVM classes use them throughout, with examples in `llvm/unittests/ADT/`. I tried using them with our BinaryBasicBlock but we implement GraphTraits interfaces. I guess it's fine to leave the BFS implementation as-is if FlowFunction doesn't already use any of ADT stuff, but in general it's a good idea to reuse LLVM's algorithms if you end up reimplementing them in several places.
506	I think it's better to be explicit and use `removeAnnotation` directly.

removeAnnotation

Harbormaster completed remote builds in B232328: Diff 522635.May 16 2023, 8:22 AM

Thanks for this patch, Sergey! Now that it's no longer a draft, can you update the title and summary? Do you plan to turn the option on by default in a separate patch?

bolt/lib/Profile/StaleProfileMatching.cpp
198	Some multi-entry functions will have secondary entry points reachable from the "main" entry point. In that case, their `InDegree` will be non-zero. Do you want to capture those here as well? The proper way to check if the block is reachable from outside the function is to check `isEntryPoint()`.
349	nit:
352	nit:
422	Would `assignProfile()` describe the function more accurately?
520	Does the algorithm depend on the layout? I.e. will inferred profile be different if we supply a different order of basic blocks?

Now that it's no longer a draft, can you update the title and summary? Do you plan to turn the option on by default in a separate patch?

For context, this diff alone isn't enough for the feature; we still have https://reviews.llvm.org/D146661, which provides an actual hash computation. I'll adjust the summary to highlight that.
Once both diffs are reviewed, we plan to turn this on for a couple of services and monitor the results. If there are no issues, we can make this on by default.

bolt/lib/Profile/StaleProfileMatching.cpp
198	These dummy entry blocks are designed exactly to capture extra entry points. From my tests on the clang binary, they do _not_ have jumps from the main entry point, and thus, their InDegree is zero. Perhaps, there are some other cases that I don't see on clang? Anyhow, for blocks with InDegree>0, we don't need to make special adjustments; they will be handled naturally.
520	In theory, the order shouldn't matter much. Here we `updateLayoutIndices` only to be able to index the blocks. In practice however, there are "almost identical" basic blocks, especially with aggressive inlining. Such blocks have identical bodies, neighbors with identical bodies etc. There is no way to locally determine which blocks belong to which profile item. In order to break ties in such cases, I decided to simply use block addresses (that is, offsets) as a tie breaking rule. It doesn't happen too often but without the tie breaking rule the inference quality is slightly worse. (Using block indices in the layout also works here but provides a tiny regression in comparison to addresses)

review

spupyrev retitled this revision from [BOLT] initial version of stale profile matching to [BOLT] stale profile matching [part 1 out of 2].May 24 2023, 2:27 PM

spupyrev edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B234322: Diff 525335.May 24 2023, 2:38 PM

maksfb added inline comments.May 26 2023, 4:13 PM

bolt/lib/Profile/StaleProfileMatching.cpp
198	Secondary entry points normally come from assembly code, perhaps from some Fortran programs as well. HHVM has memcpy written like that: https://github.com/facebook/hhvm/blob/master/hphp/util/memcpy-x64.S
263–264	nit:
520	I see. So then `BlockOrder` is used only for mapping blocks in `BinaryFunction` to `FlowFunction` and back?

comments

spupyrev marked 2 inline comments as done.May 31 2023, 4:03 PM

spupyrev added inline comments.

bolt/lib/Profile/StaleProfileMatching.cpp
198	You're right, i checked the HHVM binary and modified the condition here. Surprisingly, there are also instances where `!BB->isEntryPoint() && InDegree[I] == 0`, so we need both conditions here
520	Correct

making code compile

Harbormaster completed remote builds in B235694: Diff 527225.May 31 2023, 4:25 PM

LGTM. See comments with mostly nits. We can probably be more graceful while mapping basic blocks, but this issue can be addressed separately.

bolt/lib/Profile/StaleProfileMatching.cpp
201–202	nit: remove those variables
214–215	likewise
333
466–468	nit:
489–490	nit:
525	Will this work?

This revision is now accepted and ready to land.Jun 2 2023, 1:58 PM

Amir added a child revision: D152041: [profi] Move out base definitions into SampleProfileInferenceBase header.Jun 2 2023, 3:32 PM

I'm trying to see if we can reuse SampleProfileInference code as much as possible given it's already templated and used with LLVM IR/MIR (Flow-Sensitive Sample AutoFDO and Context-Sensitive Sample PGO respectively).

bolt/lib/Profile/StaleProfileMatching.cpp
140	What's the primary motivation behind adding a BinaryFunction-specific version instead of using `SampleProfileInference<BT>::initFunction`? Is it the fact that BF can have multiple entry points hence we need a dummy source node, and/or custom handling of EH control flow?
162	What outdegree is used for?
313	Does it make a difference to use BFS instead of DFS if the goal is collecting (un)reachable blocks? There's an overlap in functionality with `SampleProfileInference<BT>::apply` which finds nodes reachable from source and sink using `depth_first_ext` and `inverse_depth_first_ext`.

Amir mentioned this in D152187: [profi][NFC] Get rid of afdo_detail::TypeMap.Jun 5 2023, 1:10 PM

Amir mentioned this in D152216: [profi][NFC] Refactor SampleProfileInference::initFunction.Jun 5 2023, 5:26 PM

Amir mentioned this in D152217: [profi][NFC] Refactor SampleProfileInference::apply.Jun 5 2023, 5:29 PM

suggested (minor) edits

Harbormaster completed remote builds in B236953: Diff 528867.Jun 6 2023, 8:06 AM

Closed by commit rG44268271f61e: [BOLT] stale profile matching [part 1 out of 2] (authored by spupyrev). · Explain WhyJun 6 2023, 12:14 PM

This revision was automatically updated to reflect the committed changes.

spupyrev added a commit: rG44268271f61e: [BOLT] stale profile matching [part 1 out of 2].

Amir mentioned this in rGb244a4c4c921: [profi][NFC] Get rid of afdo_detail::TypeMap.Jun 6 2023, 1:48 PM

Amir mentioned this in rGc6d8b51ba5da: [profi][NFC] Refactor SampleProfileInference::initFunction.Jun 9 2023, 12:41 PM

Revision Contents

Path

Size

bolt/

include/

bolt/

Core/

BinaryFunction.h

10 lines

Profile/

YAMLProfileReader.h

4 lines

lib/

Passes/

BinaryPasses.cpp

19 lines

Profile/

CMakeLists.txt

2 lines

StaleProfileMatching.cpp

546 lines

YAMLProfileReader.cpp

11 lines

test/

X86/

Inputs/

blarge_profile_stale.yaml

51 lines

reader-stale-yaml.test

38 lines

Diff 528984

bolt/include/bolt/Core/BinaryFunction.h

Show First 20 Lines • Show All 378 Lines • ▼ Show 20 Lines	private:
float ProfileMatchRatio{0.0f};		float ProfileMatchRatio{0.0f};

/// Raw branch count for this function in the profile		/// Raw branch count for this function in the profile
uint64_t RawBranchCount{0};		uint64_t RawBranchCount{0};

/// Indicates the type of profile the function is using.		/// Indicates the type of profile the function is using.
uint16_t ProfileFlags{PF_NONE};		uint16_t ProfileFlags{PF_NONE};

		/// True if the function's input profile data has been inaccurate but has
		/// been adjusted by the profile inference algorithm.
		bool HasInferredProfile{false};

/// For functions with mismatched profile we store all call profile		/// For functions with mismatched profile we store all call profile
/// information at a function level (as opposed to tying it to		/// information at a function level (as opposed to tying it to
/// specific call sites).		/// specific call sites).
IndirectCallSiteProfile AllCallSites;		IndirectCallSiteProfile AllCallSites;

/// Score of the function (estimated number of instructions executed,		/// Score of the function (estimated number of instructions executed,
/// according to profile data). -1 if the score has not been calculated yet.		/// according to profile data). -1 if the score has not been calculated yet.
mutable int64_t FunctionScore{-1};		mutable int64_t FunctionScore{-1};
▲ Show 20 Lines • Show All 1,166 Lines • ▼ Show 20 Lines	if (ExecutionCount == COUNT_NO_PROFILE)
ExecutionCount = 0;		ExecutionCount = 0;
ProfileFlags = Flags;		ProfileFlags = Flags;
ProfileMatchRatio = 1.0f;		ProfileMatchRatio = 1.0f;
}		}

/// Return flags describing a profile for this function.		/// Return flags describing a profile for this function.
uint16_t getProfileFlags() const { return ProfileFlags; }		uint16_t getProfileFlags() const { return ProfileFlags; }

		/// Return true if the function's input profile data has been inaccurate but
		/// has been corrected by the profile inference algorithm.
		bool hasInferredProfile() const { return HasInferredProfile; }

		void setHasInferredProfile(bool Inferred) { HasInferredProfile = Inferred; }

void addCFIInstruction(uint64_t Offset, MCCFIInstruction &&Inst) {		void addCFIInstruction(uint64_t Offset, MCCFIInstruction &&Inst) {
assert(!Instructions.empty());		assert(!Instructions.empty());

// Fix CFI instructions skipping NOPs. We need to fix this because changing		// Fix CFI instructions skipping NOPs. We need to fix this because changing
// CFI state after a NOP, besides being wrong and inaccurate, makes it		// CFI state after a NOP, besides being wrong and inaccurate, makes it
// harder for us to recover this information, since we can create empty BBs		// harder for us to recover this information, since we can create empty BBs
// with NOPs and then reorder it away.		// with NOPs and then reorder it away.
// We fix this by moving the CFI instruction just before any NOPs.		// We fix this by moving the CFI instruction just before any NOPs.
▲ Show 20 Lines • Show All 789 Lines • Show Last 20 Lines

bolt/include/bolt/Profile/YAMLProfileReader.h

Show First 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	private:

/// Strict matching of a name in a profile to its contents.		/// Strict matching of a name in a profile to its contents.
StringMap<yaml::bolt::BinaryFunctionProfile *> ProfileNameToProfile;		StringMap<yaml::bolt::BinaryFunctionProfile *> ProfileNameToProfile;

/// Populate \p Function profile with the one supplied in YAML format.		/// Populate \p Function profile with the one supplied in YAML format.
bool parseFunctionProfile(BinaryFunction &Function,		bool parseFunctionProfile(BinaryFunction &Function,
const yaml::bolt::BinaryFunctionProfile &YamlBF);		const yaml::bolt::BinaryFunctionProfile &YamlBF);

		/// Infer function profile from stale data (collected on older binaries).
		bool inferStaleProfile(BinaryFunction &Function,
		const yaml::bolt::BinaryFunctionProfile &YamlBF);

/// Initialize maps for profile matching.		/// Initialize maps for profile matching.
void buildNameMaps(std::map<uint64_t, BinaryFunction> &Functions);		void buildNameMaps(std::map<uint64_t, BinaryFunction> &Functions);

/// Update matched YAML -> BinaryFunction pair.		/// Update matched YAML -> BinaryFunction pair.
void matchProfileToFunction(yaml::bolt::BinaryFunctionProfile &YamlBF,		void matchProfileToFunction(yaml::bolt::BinaryFunctionProfile &YamlBF,
BinaryFunction &BF) {		BinaryFunction &BF) {
if (YamlBF.Id >= YamlProfileToFunction.size())		if (YamlBF.Id >= YamlProfileToFunction.size())
YamlProfileToFunction.resize(YamlBF.Id + 1);		YamlProfileToFunction.resize(YamlBF.Id + 1);
Show All 16 Lines

bolt/lib/Passes/BinaryPasses.cpp

Show First 20 Lines • Show All 1,335 Lines • ▼ Show 20 Lines	outs() << "Worst average bias observed in " << WorstBiasFunc->getPrintName()
<< "\n";		<< "\n";
LLVM_DEBUG(WorstBiasFunc->dump());		LLVM_DEBUG(WorstBiasFunc->dump());
}		}
}		}

void PrintProgramStats::runOnFunctions(BinaryContext &BC) {		void PrintProgramStats::runOnFunctions(BinaryContext &BC) {
uint64_t NumRegularFunctions = 0;		uint64_t NumRegularFunctions = 0;
uint64_t NumStaleProfileFunctions = 0;		uint64_t NumStaleProfileFunctions = 0;
		uint64_t NumAllStaleFunctions = 0;
		uint64_t NumInferredFunctions = 0;
uint64_t NumNonSimpleProfiledFunctions = 0;		uint64_t NumNonSimpleProfiledFunctions = 0;
uint64_t NumUnknownControlFlowFunctions = 0;		uint64_t NumUnknownControlFlowFunctions = 0;
uint64_t TotalSampleCount = 0;		uint64_t TotalSampleCount = 0;
uint64_t StaleSampleCount = 0;		uint64_t StaleSampleCount = 0;
		uint64_t InferredSampleCount = 0;
std::vector<const BinaryFunction *> ProfiledFunctions;		std::vector<const BinaryFunction *> ProfiledFunctions;
const char *StaleFuncsHeader = "BOLT-INFO: Functions with stale profile:\n";		const char *StaleFuncsHeader = "BOLT-INFO: Functions with stale profile:\n";
for (auto &BFI : BC.getBinaryFunctions()) {		for (auto &BFI : BC.getBinaryFunctions()) {
const BinaryFunction &Function = BFI.second;		const BinaryFunction &Function = BFI.second;

// Ignore PLT functions for stats.		// Ignore PLT functions for stats.
if (Function.isPLTFunction())		if (Function.isPLTFunction())
continue;		continue;
Show All 18 Lines	for (auto &BFI : BC.getBinaryFunctions()) {
if (!Function.hasProfile())		if (!Function.hasProfile())
continue;		continue;

uint64_t SampleCount = Function.getRawBranchCount();		uint64_t SampleCount = Function.getRawBranchCount();
TotalSampleCount += SampleCount;		TotalSampleCount += SampleCount;

if (Function.hasValidProfile()) {		if (Function.hasValidProfile()) {
ProfiledFunctions.push_back(&Function);		ProfiledFunctions.push_back(&Function);
		if (Function.hasInferredProfile()) {
		++NumInferredFunctions;
		InferredSampleCount += SampleCount;
		++NumAllStaleFunctions;
		}
} else {		} else {
if (opts::ReportStaleFuncs) {		if (opts::ReportStaleFuncs) {
outs() << StaleFuncsHeader;		outs() << StaleFuncsHeader;
StaleFuncsHeader = "";		StaleFuncsHeader = "";
outs() << " " << Function << '\n';		outs() << " " << Function << '\n';
}		}
++NumStaleProfileFunctions;		++NumStaleProfileFunctions;
StaleSampleCount += SampleCount;		StaleSampleCount += SampleCount;
		++NumAllStaleFunctions;
}		}
}		}
BC.NumProfiledFuncs = ProfiledFunctions.size();		BC.NumProfiledFuncs = ProfiledFunctions.size();
BC.NumStaleProfileFuncs = NumStaleProfileFunctions;		BC.NumStaleProfileFuncs = NumStaleProfileFunctions;

const size_t NumAllProfiledFunctions =		const size_t NumAllProfiledFunctions =
ProfiledFunctions.size() + NumStaleProfileFunctions;		ProfiledFunctions.size() + NumStaleProfileFunctions;
outs() << "BOLT-INFO: " << NumAllProfiledFunctions << " out of "		outs() << "BOLT-INFO: " << NumAllProfiledFunctions << " out of "
Show All 30 Lines	if (TotalSampleCount > 0) {
" (possibly stale) profile.\n";		" (possibly stale) profile.\n";
}		}
if (PctStale > opts::StaleThreshold) {		if (PctStale > opts::StaleThreshold) {
errs() << "BOLT-ERROR: stale functions exceed specified threshold of "		errs() << "BOLT-ERROR: stale functions exceed specified threshold of "
<< opts::StaleThreshold << "%. Exiting.\n";		<< opts::StaleThreshold << "%. Exiting.\n";
exit(1);		exit(1);
}		}
}		}
		if (NumInferredFunctions) {
		outs() << format("BOLT-INFO: inferred profile for %d (%.2f%% of profiled, "
		"%.2f%% of stale) functions responsible for %.2f%% samples"
		" (%zu out of %zu)\n",
		NumInferredFunctions,
		100.0 * NumInferredFunctions / NumAllProfiledFunctions,
		100.0 * NumInferredFunctions / NumAllStaleFunctions,
		100.0 * InferredSampleCount / TotalSampleCount,
		InferredSampleCount, TotalSampleCount);
		}

if (const uint64_t NumUnusedObjects = BC.getNumUnusedProfiledObjects()) {		if (const uint64_t NumUnusedObjects = BC.getNumUnusedProfiledObjects()) {
outs() << "BOLT-INFO: profile for " << NumUnusedObjects		outs() << "BOLT-INFO: profile for " << NumUnusedObjects
<< " objects was ignored\n";		<< " objects was ignored\n";
}		}

if (ProfiledFunctions.size() > 10) {		if (ProfiledFunctions.size() > 10) {
if (opts::Verbosity >= 1) {		if (opts::Verbosity >= 1) {
▲ Show 20 Lines • Show All 408 Lines • Show Last 20 Lines

bolt/lib/Profile/CMakeLists.txt

	add_llvm_library(LLVMBOLTProfile			add_llvm_library(LLVMBOLTProfile
	BoltAddressTranslation.cpp			BoltAddressTranslation.cpp
	DataAggregator.cpp			DataAggregator.cpp
	DataReader.cpp			DataReader.cpp
	Heatmap.cpp			Heatmap.cpp
	ProfileReaderBase.cpp			ProfileReaderBase.cpp
				StaleProfileMatching.cpp
	YAMLProfileReader.cpp			YAMLProfileReader.cpp
	YAMLProfileWriter.cpp			YAMLProfileWriter.cpp

	DISABLE_LLVM_LINK_LLVM_DYLIB			DISABLE_LLVM_LINK_LLVM_DYLIB

	LINK_COMPONENTS			LINK_COMPONENTS
	Support			Support
				TransformUtils
	)			)

	target_link_libraries(LLVMBOLTProfile			target_link_libraries(LLVMBOLTProfile
	PRIVATE			PRIVATE
	LLVMBOLTCore			LLVMBOLTCore
	LLVMBOLTPasses			LLVMBOLTPasses
	LLVMBOLTUtils			LLVMBOLTUtils
	)			)

bolt/lib/Profile/StaleProfileMatching.cpp

This file was added.

//===- bolt/Profile/StaleProfileMatching.cpp - Profile data matching ----===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

// BOLT often has to deal with profiles collected on binaries built from several

// revisions behind release. As a result, a certain percentage of functions is

// considered stale and not optimized. This file implements an ability to match

// profile to functions that are not 100% binary identical, and thus, increasing

// the optimization coverage and boost the performance of applications.

// The algorithm consists of two phases: matching and inference:

// - At the matching phase, we try to "guess" as many block and jump counts from

// the stale profile as possible. To this end, the content of each basic block

// is hashed and stored in the (yaml) profile. When BOLT optimizes a binary,

// it computes block hashes and identifies the corresponding entries in the

// stale profile. It yields a partial profile for every CFG in the binary.

// - At the inference phase, we employ a network flow-based algorithm (profi) to

// reconstruct "realistic" block and jump counts from the partial profile

// generated at the first stage. In practice, we don't always produce proper

// profile data but the majority (e.g., >90%) of CFGs get the correct counts.

//===----------------------------------------------------------------------===//

#include "bolt/Core/HashUtilities.h"

#include "bolt/Profile/YAMLProfileReader.h"

#include "llvm/ADT/Hashing.h"

#include "llvm/Support/CommandLine.h"

#include "llvm/Transforms/Utils/SampleProfileInference.h"

#include <queue>

using namespace llvm;

namespace opts {

extern cl::OptionCategory BoltOptCategory;

cl::opt<bool>

InferStaleProfile("infer-stale-profile",

cl::desc("Infer counts from stale profile data."),

cl::init(false), cl::Hidden, cl::cat(BoltOptCategory));

cl::opt<unsigned> StaleMatchingMaxFuncSize(

"stale-matching-max-func-size",

cl::desc("The maximum size of a function to consider for inference."),

cl::init(10000), cl::Hidden, cl::cat(BoltOptCategory));

// Parameters of the profile inference algorithm. The default values are tuned

// on several benchmarks.

cl::opt<bool> StaleMatchingEvenFlowDistribution(

"stale-matching-even-flow-distribution",

cl::desc("Try to evenly distribute flow when there are multiple equally "

"likely options."),

cl::init(true), cl::ReallyHidden, cl::cat(BoltOptCategory));

cl::opt<bool> StaleMatchingRebalanceUnknown(

"stale-matching-rebalance-unknown",

cl::desc("Evenly re-distribute flow among unknown subgraphs."),

cl::init(false), cl::ReallyHidden, cl::cat(BoltOptCategory));

cl::opt<bool> StaleMatchingJoinIslands(

"stale-matching-join-islands",

cl::desc("Join isolated components having positive flow."), cl::init(true),

cl::ReallyHidden, cl::cat(BoltOptCategory));

cl::opt<unsigned> StaleMatchingCostBlockInc(

"stale-matching-cost-block-inc",

cl::desc("The cost of increasing a block's count by one."), cl::init(110),

cl::ReallyHidden, cl::cat(BoltOptCategory));

cl::opt<unsigned> StaleMatchingCostBlockDec(

"stale-matching-cost-block-dec",

cl::desc("The cost of decreasing a block's count by one."), cl::init(100),

cl::ReallyHidden, cl::cat(BoltOptCategory));

cl::opt<unsigned> StaleMatchingCostBlockEntryInc(

"stale-matching-cost-block-entry-inc",

cl::desc("The cost of increasing the entry block's count by one."),

cl::init(110), cl::ReallyHidden, cl::cat(BoltOptCategory));

cl::opt<unsigned> StaleMatchingCostBlockEntryDec(

"stale-matching-cost-block-entry-dec",

cl::desc("The cost of decreasing the entry block's count by one."),

cl::init(100), cl::ReallyHidden, cl::cat(BoltOptCategory));

cl::opt<unsigned> StaleMatchingCostBlockZeroInc(

"stale-matching-cost-block-zero-inc",

cl::desc("The cost of increasing a count of zero-weight block by one."),

cl::init(10), cl::Hidden, cl::cat(BoltOptCategory));

cl::opt<unsigned> StaleMatchingCostBlockUnknownInc(

"stale-matching-cost-block-unknown-inc",

cl::desc("The cost of increasing an unknown block's count by one."),

cl::init(10), cl::ReallyHidden, cl::cat(BoltOptCategory));

cl::opt<unsigned> StaleMatchingCostJumpInc(

"stale-matching-cost-jump-inc",

cl::desc("The cost of increasing a jump's count by one."), cl::init(100),

cl::ReallyHidden, cl::cat(BoltOptCategory));

cl::opt<unsigned> StaleMatchingCostJumpFTInc(

"stale-matching-cost-jump-ft-inc",

cl::desc("The cost of increasing a fall-through jump's count by one."),

cl::init(100), cl::ReallyHidden, cl::cat(BoltOptCategory));

cl::opt<unsigned> StaleMatchingCostJumpDec(

"stale-matching-cost-jump-dec",

cl::desc("The cost of decreasing a jump's count by one."), cl::init(110),

cl::ReallyHidden, cl::cat(BoltOptCategory));

cl::opt<unsigned> StaleMatchingCostJumpFTDec(

"stale-matching-cost-jump-ft-dec",

cl::desc("The cost of decreasing a fall-through jump's count by one."),

cl::init(110), cl::ReallyHidden, cl::cat(BoltOptCategory));

cl::opt<unsigned> StaleMatchingCostJumpUnknownInc(

"stale-matching-cost-jump-unknown-inc",

cl::desc("The cost of increasing an unknown jump's count by one."),

cl::init(50), cl::ReallyHidden, cl::cat(BoltOptCategory));

cl::opt<unsigned> StaleMatchingCostJumpUnknownFTInc(

"stale-matching-cost-jump-unknown-ft-inc",

cl::desc(

"The cost of increasing an unknown fall-through jump's count by one."),

cl::init(5), cl::ReallyHidden, cl::cat(BoltOptCategory));

} // namespace opts

namespace llvm {

namespace bolt {

/// Create a wrapper flow function to use with the profile inference algorithm,

/// and initialize its jumps and metadata.

FlowFunction

createFlowFunction(const BinaryFunction::BasicBlockOrderType &BlockOrder) {

AmirUnsubmitted

Done

Can we return FlowFunction object from this function?

Amir: Can we return FlowFunction object from this function?

FlowFunction Func;

AmirUnsubmitted

Not Done

What's the primary motivation behind adding a BinaryFunction-specific version instead of using SampleProfileInference<BT>::initFunction?
Is it the fact that BF can have multiple entry points hence we need a dummy source node, and/or custom handling of EH control flow?

Amir: What's the primary motivation behind adding a BinaryFunction-specific version instead of using…

// Add a special "dummy" source so that there is always a unique entry point.

// Because of the extra source, for all other blocks in FlowFunction it holds

// that Block.Index == BB->getLayoutIndex() + 1

FlowBlock EntryBlock;

EntryBlock.Index = 0;

Func.Blocks.push_back(EntryBlock);

AmirUnsubmitted

Done

Do we really need this? since we're calling updateLayoutIndices, we have BB->getLayoutIndex().

Amir: Do we really need this? since we're calling updateLayoutIndices, we have BB->getLayoutIndex().

spupyrevAuthorUnsubmitted

Done

great, thanks

spupyrev: great, thanks

// Create FlowBlock for every basic block in the binary function

for (const BinaryBasicBlock *BB : BlockOrder) {

Func.Blocks.emplace_back();

FlowBlock &Block = Func.Blocks.back();

Block.Index = Func.Blocks.size() - 1;

(void)BB;

assert(Block.Index == BB->getLayoutIndex() + 1 &&

"incorrectly assigned basic block index");

}

// Create FlowJump for each jump between basic blocks in the binary function

std::vector<uint64_t> InDegree(Func.Blocks.size(), 0);

for (const BinaryBasicBlock *SrcBB : BlockOrder) {

std::unordered_set<const BinaryBasicBlock *> UniqueSuccs;

AmirUnsubmitted

Done

What outdegree is used for?

Amir: What outdegree is used for?

// Collect regular jumps

for (const BinaryBasicBlock *DstBB : SrcBB->successors()) {

// Ignoring parallel edges

if (UniqueSuccs.find(DstBB) != UniqueSuccs.end())

continue;

Func.Jumps.emplace_back();

FlowJump &Jump = Func.Jumps.back();

Jump.Source = SrcBB->getLayoutIndex() + 1;

Jump.Target = DstBB->getLayoutIndex() + 1;

InDegree[Jump.Target]++;

UniqueSuccs.insert(DstBB);

}

// Collect jumps to landing pads

for (const BinaryBasicBlock *DstBB : SrcBB->landing_pads()) {

// Ignoring parallel edges

if (UniqueSuccs.find(DstBB) != UniqueSuccs.end())

continue;

Func.Jumps.emplace_back();

FlowJump &Jump = Func.Jumps.back();

Jump.Source = SrcBB->getLayoutIndex() + 1;

Jump.Target = DstBB->getLayoutIndex() + 1;

InDegree[Jump.Target]++;

UniqueSuccs.insert(DstBB);

}

// Add dummy edges to the extra sources. If there are multiple entry blocks,

// add an unlikely edge from 0 to the subsequent ones

assert(InDegree[0] == 0 && "dummy entry blocks shouldn't have predecessors");

for (uint64_t I = 1; I < Func.Blocks.size(); I++) {

const BinaryBasicBlock *BB = BlockOrder[I - 1];

if (BB->isEntryPoint() || InDegree[I] == 0) {

Func.Jumps.emplace_back();

FlowJump &Jump = Func.Jumps.back();

maksfbUnsubmitted

Done

Some multi-entry functions will have secondary entry points reachable from the "main" entry point. In that case, their InDegree will be non-zero. Do you want to capture those here as well? The proper way to check if the block is reachable from outside the function is to check isEntryPoint().

maksfb: Some multi-entry functions will have secondary entry points reachable from the "main" entry…

spupyrevAuthorUnsubmitted

Done

These dummy entry blocks are designed exactly to capture extra entry points. From my tests on the clang binary, they do _not_ have jumps from the main entry point, and thus, their InDegree is zero. Perhaps, there are some other cases that I don't see on clang?

Anyhow, for blocks with InDegree>0, we don't need to make special adjustments; they will be handled naturally.

spupyrev: These dummy entry blocks are designed exactly to capture extra entry points. From my tests on…

maksfbUnsubmitted

Not Done

Secondary entry points normally come from assembly code, perhaps from some Fortran programs as well. HHVM has memcpy written like that: https://github.com/facebook/hhvm/blob/master/hphp/util/memcpy-x64.S

maksfb: Secondary entry points normally come from assembly code, perhaps from some Fortran programs as…

spupyrevAuthorUnsubmitted

Done

You're right, i checked the HHVM binary and modified the condition here. Surprisingly, there are also instances where !BB->isEntryPoint() && InDegree[I] == 0, so we need both conditions here

spupyrev: You're right, i checked the HHVM binary and modified the condition here. Surprisingly, there…

Jump.Source = 0;

Jump.Target = I;

if (!BB->isEntryPoint())

Jump.IsUnlikely = true;

maksfbUnsubmitted

Done

nit: remove those variables

maksfb: nit: remove those variables

}

// Create necessary metadata for the flow function

for (FlowJump &Jump : Func.Jumps) {

Func.Blocks.at(Jump.Source).SuccJumps.push_back(&Jump);

Func.Blocks.at(Jump.Target).PredJumps.push_back(&Jump);

}

return Func;

}

/// Assign initial block/jump weights based on the stale profile data. The goal

/// is to extract as much information from the stale profile as possible. Here

maksfbUnsubmitted

Done

likewise

maksfb: likewise

/// we assume that each basic block is specified via a hash value computed from

/// its content and the hashes of the unchanged basic blocks stay the same

/// across different revisions of the binary.

/// Whenever there is a count in the profile with the hash corresponding to one

/// of the basic blocks in the binary, the count is "matched" to the block.

/// Similarly, if both the source and the target of a count in the profile are

/// matched to a jump in the binary, the count is recorded in CFG.

void matchWeightsByHashes(const BinaryFunction::BasicBlockOrderType &BlockOrder,

const yaml::bolt::BinaryFunctionProfile &YamlBF,

FlowFunction &Func) {

assert(Func.Blocks.size() == BlockOrder.size() + 1);

// Initialize stale matcher

DenseMap<uint64_t, std::vector<FlowBlock *>> HashToBlocks;

for (uint64_t I = 0; I < BlockOrder.size(); I++) {

const BinaryBasicBlock *BB = BlockOrder[I];

assert(BB->getHash() != 0 && "empty hash of BinaryBasicBlock");

HashToBlocks[BB->getHash()].push_back(&Func.Blocks[I + 1]);

}

// Index in yaml profile => corresponding (matched) block

DenseMap<uint64_t, const FlowBlock *> MatchedBlocks;

// Match blocks from the profile to the blocks in CFG

for (const yaml::bolt::BinaryBasicBlockProfile &YamlBB : YamlBF.Blocks) {

assert(YamlBB.Hash != 0 && "empty hash of BinaryBasicBlockProfile");

auto It = HashToBlocks.find(YamlBB.Hash);

if (It != HashToBlocks.end()) {

const FlowBlock *MatchedBlock = It->second.front();

MatchedBlocks[YamlBB.Index] = MatchedBlock;

}

// Match jumps from the profile to the jumps from CFG

std::vector<uint64_t> OutWeight(Func.Blocks.size(), 0);

std::vector<uint64_t> InWeight(Func.Blocks.size(), 0);

for (const yaml::bolt::BinaryBasicBlockProfile &YamlBB : YamlBF.Blocks) {

for (const yaml::bolt::SuccessorInfo &YamlSI : YamlBB.Successors) {

if (YamlSI.Count == 0)

continue;

// Try to find the jump for a given (src, dst) pair from the profile and

// assign the jump weight based on the profile count

const uint64_t SrcIndex = YamlBB.Index;

const uint64_t DstIndex = YamlSI.Index;

const FlowBlock *MatchedSrcBlock =

MatchedBlocks.find(SrcIndex) != MatchedBlocks.end()

? MatchedBlocks[SrcIndex]

: nullptr;

const FlowBlock *MatchedDstBlock =

maksfbUnsubmitted

Done

// assign the jump weight based on the profile count

- uint64_t SrcIndex = YamlBB.Index;

- uint64_t DstIndex = YamlSI.Index;

+ const uint64_t SrcIndex = YamlBB.Index;

+ const uint64_t DstIndex = YamlSI.Index;

const FlowBlock *MatchedSrcBlock =

nit:

maksfb: nit:

MatchedBlocks.find(DstIndex) != MatchedBlocks.end()

? MatchedBlocks[DstIndex]

: nullptr;

if (MatchedSrcBlock != nullptr && MatchedDstBlock != nullptr) {

// Find a jump between the two blocks

FlowJump *Jump = nullptr;

for (FlowJump *SuccJump : MatchedSrcBlock->SuccJumps) {

if (SuccJump->Target == MatchedDstBlock->Index) {

Jump = SuccJump;

break;

}

// Assign the weight, if the corresponding jump is found

if (Jump != nullptr) {

Jump->Weight = YamlSI.Count;

Jump->HasUnknownWeight = false;

}

// Assign the weight for the src block, if it is found

if (MatchedSrcBlock != nullptr)

OutWeight[MatchedSrcBlock->Index] += YamlSI.Count;

// Assign the weight for the dst block, if it is found

if (MatchedDstBlock != nullptr)

AmirUnsubmitted

Done

// Assign the weight for the src block, if it is found

- if (MatchedSrcBlock != nullptr) {

+ if (MatchedSrcBlock != nullptr)

OutWeight[MatchedSrcBlock->Index] += YamlSI.Count;

- }

// Assign the weight for the dst block, if it is found

Amir:

InWeight[MatchedDstBlock->Index] += YamlSI.Count;

}

AmirUnsubmitted

Done

// Assign the weight for the dst block, if it is found

- if (MatchedDstBlock != nullptr) {

+ if (MatchedDstBlock != nullptr)

InWeight[MatchedDstBlock->Index] += YamlSI.Count;

- }

+ }

Amir:

// Assign block counts based on in-/out- jumps

for (FlowBlock &Block : Func.Blocks) {

if (OutWeight[Block.Index] == 0 && InWeight[Block.Index] == 0) {

assert(Block.HasUnknownWeight && "unmatched block with positive count");

continue;

}

Block.HasUnknownWeight = false;

Block.Weight = std::max(OutWeight[Block.Index], InWeight[Block.Index]);

}

/// The function finds all blocks that are (i) reachable from the Entry block

/// and (ii) do not have a path to an exit, and marks all such blocks 'cold'

/// so that profi does not send any flow to such blocks.

void preprocessUnreachableBlocks(FlowFunction &Func) {

const uint64_t NumBlocks = Func.Blocks.size();

// Start bfs from the source

std::queue<uint64_t> Queue;

std::vector<bool> VisitedEntry(NumBlocks, false);

for (uint64_t I = 0; I < NumBlocks; I++) {

AmirUnsubmitted

Not Done

Can we use llvm/ADT/GraphTraits + BreadthFirstIterator to avoid reimplementing this?

Amir: Can we use llvm/ADT/GraphTraits + BreadthFirstIterator to avoid reimplementing this?

spupyrevAuthorUnsubmitted

Done

If I understand correctly, that would require adding some boilerplate code for iterators over the nodes and their neighbors? In addition, I need to run an "inverse" BFS starting from all reachable sinks... It certainly can be done, but I'm not sure if the implementation would be any simpler. Is there a good example of how to implement something similar with built-in types?

spupyrev: If I understand correctly, that would require adding some boilerplate code for iterators over…

AmirUnsubmitted

Not Done

If I understand correctly, that would require adding some boilerplate code for iterators over the nodes and their neighbors? In addition, I need to run an "inverse" BFS starting from all reachable sinks... It certainly can be done, but I'm not sure if the implementation would be any simpler. Is there a good example of how to implement something similar with built-in types?

No, I don't have any good examples of using ADT algorithms with built-in types. LLVM classes use them throughout, with examples in llvm/unittests/ADT/. I tried using them with our BinaryBasicBlock but we implement GraphTraits interfaces. I guess it's fine to leave the BFS implementation as-is if FlowFunction doesn't already use any of ADT stuff, but in general it's a good idea to reuse LLVM's algorithms if you end up reimplementing them in several places.

Amir: > If I understand correctly, that would require adding some boilerplate code for iterators over…

AmirUnsubmitted

Not Done

Does it make a difference to use BFS instead of DFS if the goal is collecting (un)reachable blocks?

There's an overlap in functionality with SampleProfileInference<BT>::apply which finds nodes reachable from source and sink using depth_first_ext and inverse_depth_first_ext.

Amir: Does it make a difference to use BFS instead of DFS if the goal is collecting (un)reachable…

FlowBlock &Block = Func.Blocks[I];

if (Block.isEntry()) {

Queue.push(I);

VisitedEntry[I] = true;

break;

}

while (!Queue.empty()) {

const uint64_t Src = Queue.front();

Queue.pop();

for (FlowJump *Jump : Func.Blocks[Src].SuccJumps) {

const uint64_t Dst = Jump->Target;

if (!VisitedEntry[Dst]) {

Queue.push(Dst);

VisitedEntry[Dst] = true;

}

// Start bfs from all sinks

maksfbUnsubmitted

Done

for (FlowJump *Jump : Func.Blocks[Src].SuccJumps) {

- uint64_t Dst = Jump->Target;

+ const uint64_t Dst = Jump->Target;

if (!VisitedEntry[Dst]) {

maksfb:

std::vector<bool> VisitedExit(NumBlocks, false);

for (uint64_t I = 0; I < NumBlocks; I++) {

FlowBlock &Block = Func.Blocks[I];

if (Block.isExit() && VisitedEntry[I]) {

Queue.push(I);

VisitedExit[I] = true;

}

while (!Queue.empty()) {

const uint64_t Src = Queue.front();

Queue.pop();

for (FlowJump *Jump : Func.Blocks[Src].PredJumps) {

const uint64_t Dst = Jump->Source;

if (!VisitedExit[Dst]) {

Queue.push(Dst);

VisitedExit[Dst] = true;

maksfbUnsubmitted

Done

while (!Queue.empty()) {

- uint64_t Src = Queue.front();

+ const uint64_t Src = Queue.front();

Queue.pop();

nit:

maksfb: nit:

}

maksfbUnsubmitted

Done

for (FlowJump *Jump : Func.Blocks[Src].PredJumps) {

- uint64_t Dst = Jump->Source;

+ const uint64_t Dst = Jump->Source;

if (!VisitedExit[Dst]) {

nit:

maksfb: nit:

// Make all blocks of zero weight so that flow is not sent

for (uint64_t I = 0; I < NumBlocks; I++) {

FlowBlock &Block = Func.Blocks[I];

if (Block.Weight == 0)

continue;

if (!VisitedEntry[I] || !VisitedExit[I]) {

Block.Weight = 0;

Block.HasUnknownWeight = true;

Block.IsUnlikely = true;

for (FlowJump *Jump : Block.SuccJumps) {

if (Jump->Source == Block.Index && Jump->Target == Block.Index) {

Jump->Weight = 0;

Jump->HasUnknownWeight = true;

Jump->IsUnlikely = true;

}

/// Decide if stale profile matching can be applied for a given function.

/// Currently we skip inference for (very) large instances and for instances

/// having "unexpected" control flow (e.g., having no sink basic blocks).

bool canApplyInference(const FlowFunction &Func) {

if (Func.Blocks.size() > opts::StaleMatchingMaxFuncSize)

return false;

bool HasExitBlocks = llvm::any_of(

Func.Blocks, [&](const FlowBlock &Block) { return Block.isExit(); });

if (!HasExitBlocks)

return false;

return true;

}

AmirUnsubmitted

Done

Please use llvm::any_of

Amir: Please use `llvm::any_of`

/// Apply the profile inference algorithm for a given flow function.

void applyInference(FlowFunction &Func) {

ProfiParams Params;

// Set the params from the command-line flags.

Params.EvenFlowDistribution = opts::StaleMatchingEvenFlowDistribution;

Params.RebalanceUnknown = opts::StaleMatchingRebalanceUnknown;

Params.JoinIslands = opts::StaleMatchingJoinIslands;

Params.CostBlockInc = opts::StaleMatchingCostBlockInc;

Params.CostBlockDec = opts::StaleMatchingCostBlockDec;

Params.CostBlockEntryInc = opts::StaleMatchingCostBlockEntryInc;

Params.CostBlockEntryDec = opts::StaleMatchingCostBlockEntryDec;

Params.CostBlockZeroInc = opts::StaleMatchingCostBlockZeroInc;

Params.CostBlockUnknownInc = opts::StaleMatchingCostBlockUnknownInc;

Params.CostJumpInc = opts::StaleMatchingCostJumpInc;

Params.CostJumpFTInc = opts::StaleMatchingCostJumpFTInc;

Params.CostJumpDec = opts::StaleMatchingCostJumpDec;

Params.CostJumpFTDec = opts::StaleMatchingCostJumpFTDec;

Params.CostJumpUnknownInc = opts::StaleMatchingCostJumpUnknownInc;

Params.CostJumpUnknownFTInc = opts::StaleMatchingCostJumpUnknownFTInc;

applyFlowInference(Params, Func);

}

/// Collect inferred counts from the flow function and update annotations in

/// the binary function.

void assignProfile(BinaryFunction &BF,

const BinaryFunction::BasicBlockOrderType &BlockOrder,

FlowFunction &Func) {

BinaryContext &BC = BF.getBinaryContext();

assert(Func.Blocks.size() == BlockOrder.size() + 1);

for (uint64_t I = 0; I < BlockOrder.size(); I++) {

maksfbUnsubmitted

Done

Would assignProfile() describe the function more accurately?

maksfb: Would `assignProfile()` describe the function more accurately?

FlowBlock &Block = Func.Blocks[I + 1];

BinaryBasicBlock *BB = BlockOrder[I];

// Update block's count

BB->setExecutionCount(Block.Flow);

// Update jump counts: (i) clean existing counts and then (ii) set new ones

auto BI = BB->branch_info_begin();

for (const BinaryBasicBlock *DstBB : BB->successors()) {

(void)DstBB;

BI->Count = 0;

BI->MispredictedCount = 0;

++BI;

}

for (FlowJump *Jump : Block.SuccJumps) {

if (Jump->IsUnlikely)

continue;

if (Jump->Flow == 0)

continue;

BinaryBasicBlock &SuccBB = *BlockOrder[Jump->Target - 1];

// Check if the edge corresponds to a regular jump or a landing pad

if (BB->getSuccessor(SuccBB.getLabel())) {

BinaryBasicBlock::BinaryBranchInfo &BI = BB->getBranchInfo(SuccBB);

BI.Count += Jump->Flow;

} else {

BinaryBasicBlock *LP = BB->getLandingPad(SuccBB.getLabel());

if (LP && LP->getKnownExecutionCount() < Jump->Flow)

LP->setExecutionCount(Jump->Flow);

}

// Update call-site annotations

auto setOrUpdateAnnotation = [&](MCInst &Instr, StringRef Name,

uint64_t Count) {

if (BC.MIB->hasAnnotation(Instr, Name))

BC.MIB->removeAnnotation(Instr, Name);

// Do not add zero-count annotations

if (Count == 0)

return;

BC.MIB->addAnnotation(Instr, Name, Count);

};

for (MCInst &Instr : *BB) {

// Ignore pseudo instructions

if (BC.MIB->isPseudo(Instr))

maksfbUnsubmitted

Done

uint64_t Count) {

- if (BC.MIB->hasAnnotation(Instr, Name)) {

+ if (BC.MIB->hasAnnotation(Instr, Name))

BC.MIB->removeAnnotation(Instr, Name);

- }

// Do not add zero-count annotations

nit:

maksfb: nit:

continue;

// Ignore jump tables

const MCInst *LastInstr = BB->getLastNonPseudoInstr();

if (BC.MIB->getJumpTable(*LastInstr) && LastInstr == &Instr)

continue;

if (BC.MIB->isIndirectCall(Instr) || BC.MIB->isIndirectBranch(Instr)) {

auto &ICSP = BC.MIB->getOrCreateAnnotationAs<IndirectCallSiteProfile>(

Instr, "CallProfile");

if (!ICSP.empty()) {

// Try to evenly distribute the counts among the call sites

const uint64_t TotalCount = Block.Flow;

const uint64_t NumSites = ICSP.size();

for (uint64_t Idx = 0; Idx < ICSP.size(); Idx++) {

IndirectCallProfile &CSP = ICSP[Idx];

uint64_t CountPerSite = TotalCount / NumSites;

// When counts cannot be exactly distributed, increase by 1 the

// counts of the first (TotalCount % NumSites) call sites

if (Idx < TotalCount % NumSites)

CountPerSite++;

CSP.Count = CountPerSite;

}

maksfbUnsubmitted

Done

// Try to evenly distribute the counts among the call sites

- uint64_t TotalCount = Block.Flow;

- uint64_t NumSites = ICSP.size();

+ const uint64_t TotalCount = Block.Flow;

+ const uint64_t NumSites = ICSP.size();

for (uint64_t Idx = 0; Idx < ICSP.size(); Idx++) {

nit:

maksfb: nit:

} else {

ICSP.emplace_back(nullptr, Block.Flow, 0);

}

} else if (BC.MIB->getConditionalTailCall(Instr)) {

// We don't know exactly the number of times the conditional tail call

// is executed; conservatively, setting it to the count of the block

setOrUpdateAnnotation(Instr, "CTCTakenCount", Block.Flow);

BC.MIB->removeAnnotation(Instr, "CTCMispredCount");

} else if (BC.MIB->isCall(Instr)) {

setOrUpdateAnnotation(Instr, "Count", Block.Flow);

}

// Update function's execution count and mark the function inferred.

BF.setExecutionCount(Func.Blocks[0].Flow);

AmirUnsubmitted

Done

It's just removing CTCMispredCount annotation if it exists.

Amir: It's just removing CTCMispredCount annotation if it exists.

spupyrevAuthorUnsubmitted

Done

which is exactly what we're trying to accomplish, no?

spupyrev: which is exactly what we're trying to accomplish, no?

AmirUnsubmitted

Done

I think it's better to be explicit and use removeAnnotation directly.

Amir: I think it's better to be explicit and use `removeAnnotation` directly.

BF.setHasInferredProfile(true);

}

bool YAMLProfileReader::inferStaleProfile(

BinaryFunction &BF, const yaml::bolt::BinaryFunctionProfile &YamlBF) {

// Make sure that block indices and hashes are up to date

BF.getLayout().updateLayoutIndices();

BF.computeBlockHashes();

const BinaryFunction::BasicBlockOrderType BlockOrder(

BF.getLayout().block_begin(), BF.getLayout().block_end());

// Create a wrapper flow function to use with the profile inference algorithm

FlowFunction Func = createFlowFunction(BlockOrder);

maksfbUnsubmitted

Done

Does the algorithm depend on the layout? I.e. will inferred profile be different if we supply a different order of basic blocks?

maksfb: Does the algorithm depend on the layout? I.e. will inferred profile be different if we supply a…

spupyrevAuthorUnsubmitted

Done

In theory, the order shouldn't matter much. Here we updateLayoutIndices only to be able to index the blocks.

In practice however, there are "almost identical" basic blocks, especially with aggressive inlining. Such blocks have identical bodies, neighbors with identical bodies etc. There is no way to locally determine which blocks belong to which profile item. In order to break ties in such cases, I decided to simply use block addresses (that is, offsets) as a tie breaking rule. It doesn't happen too often but without the tie breaking rule the inference quality is slightly worse. (Using block indices in the layout also works here but provides a tiny regression in comparison to addresses)

spupyrev: In theory, the order shouldn't matter much. Here we `updateLayoutIndices` only to be able to…

maksfbUnsubmitted

Not Done

I see. So then BlockOrder is used only for mapping blocks in BinaryFunction to FlowFunction and back?

maksfb: I see. So then `BlockOrder` is used only for mapping blocks in `BinaryFunction` to…

spupyrevAuthorUnsubmitted

Done

Correct

spupyrev: Correct

// Match as many block/jump counts from the stale profile as possible

matchWeightsByHashes(BlockOrder, YamlBF, Func);

// Adjust the flow function by marking unreachable blocks Unlikely so that

maksfbUnsubmitted

Done

BF.computeBlockHashes();

- BinaryFunction::BasicBlockOrderType BlockOrder(BF.getLayout().block_begin(),

+ const BinaryFunction::BasicBlockOrderType BlockOrder(BF.getLayout().block_begin(),

BF.getLayout().block_end());

Will this work?

maksfb: Will this work?

// they don't get any counts assigned

preprocessUnreachableBlocks(Func);

// Check if profile inference can be applied for the instance

if (!canApplyInference(Func))

return false;

// Apply the profile inference algorithm

applyInference(Func);

// Collect inferred counts and update function annotations

assignProfile(BF, BlockOrder, Func);

// As of now, we always mark the binary function having "correct" profile.

// In the future, we may discard the results for instances with poor inference

// metrics and keep such functions un-optimized.

return true;

}

} // end namespace bolt

} // end namespace llvm

bolt/lib/Profile/YAMLProfileReader.cpp

Show All 14 Lines
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"

using namespace llvm;		using namespace llvm;

namespace opts {		namespace opts {

extern cl::opt<unsigned> Verbosity;		extern cl::opt<unsigned> Verbosity;
extern cl::OptionCategory BoltOptCategory;		extern cl::OptionCategory BoltOptCategory;
		extern cl::opt<bool> InferStaleProfile;

static llvm::cl::opt<bool>		static llvm::cl::opt<bool>
IgnoreHash("profile-ignore-hash",		IgnoreHash("profile-ignore-hash",
cl::desc("ignore hash while reading function profile"),		cl::desc("ignore hash while reading function profile"),
cl::Hidden, cl::cat(BoltOptCategory));		cl::Hidden, cl::cat(BoltOptCategory));
}		}

namespace llvm {		namespace llvm {
▲ Show 20 Lines • Show All 203 Lines • ▼ Show 20 Lines	bool YAMLProfileReader::parseFunctionProfile(
if (ProfileMatched)		if (ProfileMatched)
BF.markProfiled(YamlBP.Header.Flags);		BF.markProfiled(YamlBP.Header.Flags);

if (!ProfileMatched && opts::Verbosity >= 1)		if (!ProfileMatched && opts::Verbosity >= 1)
errs() << "BOLT-WARNING: " << MismatchedBlocks << " blocks, "		errs() << "BOLT-WARNING: " << MismatchedBlocks << " blocks, "
<< MismatchedCalls << " calls, and " << MismatchedEdges		<< MismatchedCalls << " calls, and " << MismatchedEdges
<< " edges in profile did not match function " << BF << '\n';		<< " edges in profile did not match function " << BF << '\n';

		if (!ProfileMatched && opts::InferStaleProfile) {
		if (opts::Verbosity >= 1)
		outs() << "BOLT-INFO: applying profile inference for "
		<< "\"" << BF.getPrintName() << "\"\n";
		if (inferStaleProfile(BF, YamlBF)) {
		ProfileMatched = true;
		BF.markProfiled(YamlBP.Header.Flags);
		}
		}

return ProfileMatched;		return ProfileMatched;
}		}

Error YAMLProfileReader::preprocessProfile(BinaryContext &BC) {		Error YAMLProfileReader::preprocessProfile(BinaryContext &BC) {
ErrorOr<std::unique_ptr<MemoryBuffer>> MB =		ErrorOr<std::unique_ptr<MemoryBuffer>> MB =
MemoryBuffer::getFileOrSTDIN(Filename);		MemoryBuffer::getFileOrSTDIN(Filename);
if (std::error_code EC = MB.getError()) {		if (std::error_code EC = MB.getError()) {
errs() << "ERROR: cannot open " << Filename << ": " << EC.message() << "\n";		errs() << "ERROR: cannot open " << Filename << ": " << EC.message() << "\n";
▲ Show 20 Lines • Show All 176 Lines • Show Last 20 Lines

bolt/test/X86/Inputs/blarge_profile_stale.yaml

This file was added.

				---
				header:
				profile-version: 1
				binary-name: 'reader-yaml.test.tmp.exe'
				binary-build-id: '<unknown>'
				profile-flags: [ lbr ]
				profile-origin: branch profile reader
				profile-events: ''
				functions:
				- name: SolveCubic
				fid: 6
				hash: 0xC6E9098E973BBE19
				exec: 151
				nblocks: 18
				blocks:
				- bid: 0
				insns: 43
				hash: 0xD2411AC186118199
				exec: 151
				succ: [ { bid: 1, cnt: 4, mis: 2 }, { bid: 11, cnt: 0 } ]
				- bid: 1
				insns: 7
				hash: 0xDF0C9CC1FEAA70C3
				succ: [ { bid: 10, cnt: 0 }, { bid: 2, cnt: 0 } ]
				- bid: 13
				insns: 26
				hash: 0xF05DC5524E99E56F
				succ: [ { bid: 15, cnt: 89 }, { bid: 14, cnt: 0 } ]
				- bid: 15
				insns: 9
				hash: 0xB2E8338276A9834E
				- name: usqrt
				fid: 7
				hash: 0x8B62B1F9AD81EA35
				exec: 20
				nblocks: 6
				blocks:
				- bid: 0
				insns: 4
				hash: 0xE3FEB842A6548CCF
				exec: 20
				succ: [ { bid: 1, cnt: 0 } ]
				- bid: 1
				insns: 9
				hash: 0x85948FF2924613B7
				succ: [ { bid: 3, cnt: 320, mis: 171 }, { bid: 2, cnt: 0 } ]
				- bid: 3
				insns: 2
				hash: 0x41D8DB2D2B01F411
				succ: [ { bid: 1, cnt: 300, mis: 33 }, { bid: 4, cnt: 20 } ]
				...

bolt/test/X86/reader-stale-yaml.test

This file was added.

				# This script checks that YamlProfileReader in llvm-bolt is reading data
				# correctly and stale data is corrected.

				RUN: yaml2obj %p/Inputs/blarge.yaml &> %t.exe
				RUN: llvm-bolt %t.exe -o /dev/null --b %p/Inputs/blarge_profile_stale.yaml --print-cfg --print-only=usqrt --infer-stale-profile=1 --profile-ignore-hash=1 \
				RUN: 2>&1 \| FileCheck %s -check-prefix=CHECK

				# Verify that yaml reader works as expected.
				CHECK: pre-processing profile using YAML profile reader

				# Verify the inferred counts of "usqrt" that has stale profile:
				# - the function has nblocks=6 in the profile, which makes it stale
				# - block with bid=0 has an incorrect (missing) count, which is inferred
				CHECK: Binary Function "usqrt" after building cfg {
				CHECK: State : CFG constructed
				CHECK: Address : 0x401170
				CHECK: Size : 0x43
				CHECK: Section : .text
				CHECK: IsSimple : 1
				CHECK: BB Count : 5
				CHECK: Exec Count : 20
				CHECK: Branch Count: 640
				CHECK: }

				# Verify block counts.
				CHECK: .LBB01 (4 instructions, align : 1)
				CHECK: Successors: .Ltmp[[#BB13:]] (mispreds: 0, count: 20)
				CHECK: .Ltmp[[#BB13:]] (9 instructions, align : 1)
				CHECK: Successors: .Ltmp[[#BB12:]] (mispreds: 0, count: 320), .LFT[[#BB0:]] (mispreds: 0, count: 0)
				CHECK: .LFT[[#BB0:]] (2 instructions, align : 1)
				CHECK: Successors: .Ltmp[[#BB12:]] (mispreds: 0, count: 0)
				CHECK: .Ltmp[[#BB12:]] (2 instructions, align : 1)
				CHECK: Successors: .Ltmp[[#BB13:]] (mispreds: 0, count: 300), .LFT[[#BB1:]] (mispreds: 0, count: 20)
				CHECK: .LFT[[#BB1:]] (2 instructions, align : 1)

				# Check the overal inference stats.
				CHECK: 2 out of 7 functions in the binary (28.6%) have non-empty execution profile
				CHECK: inferred profile for 1 (50.00% of profiled, 100.00% of stale) functions responsible for 87.31% samples (640 out of 733)

This is an archive of the discontinued LLVM Phabricator instance.

[BOLT] stale profile matching [part 1 out of 2]ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 528984

bolt/include/bolt/Core/BinaryFunction.h

bolt/include/bolt/Profile/YAMLProfileReader.h

bolt/lib/Passes/BinaryPasses.cpp

bolt/lib/Profile/CMakeLists.txt

bolt/lib/Profile/StaleProfileMatching.cpp

bolt/lib/Profile/YAMLProfileReader.cpp

bolt/test/X86/Inputs/blarge_profile_stale.yaml

bolt/test/X86/reader-stale-yaml.test

[BOLT] stale profile matching [part 1 out of 2]
ClosedPublic