This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
bolt/
-
lib/
-
Core/
-
BinaryFunction.cpp
-
Profile/
3/3
StaleProfileMatching.cpp
-
test/X86/Inputs/
-
X86/
-
Inputs/
-
blarge_profile_stale.yaml

Differential D146661

[BOLT] stale profile matching [part 2 out of 2]
ClosedPublic

Authored by spupyrev on Mar 22 2023, 2:26 PM.

Download Raw Diff

Details

Reviewers

rafauler
Amir
maksfb

Commits

rG2316a10fe59a: [BOLT] stale profile matching [part 2 out of 2]

Summary

This is a first "serious" version of stale profile matching in BOLT. This diff
extends the hash computation for basic blocks so that we can apply a fuzzy
hash-based matching. The idea is to compute several "versions" of a hash value
for a basic block. A loose version of a hash (computed by ignoring instruction
operands) allows to match blocks in functions whose content has been changed,
while stricter hash values (considering instruction opcodes with operands and
even based on hashes of block's successors/predecessors) allow to resolve
collisions. In order to save space and build time, individual hash components
are blended into a single uint64_t.
There are likely numerous ways of improving hash computation but already this
simple variant provides significant perf benefits.

Perf testing on the clang binary: collecting data on clang-10 and using it
to optimize clang-11 (with ~1 year of commits in between). Next, we compare

stale_clang (clang-11 optimized with profile collected on clang-10 with infer-stale-profile=0)
opt_clang (clang-11 optimized with profile collected on clang-11)
infer_clang (clang-11 optimized with profile collected on clang-10 with infer-stale-profile=1)

LTO-only mode:
stale_clang vs opt_clang: task-clock [delta(%): 9.4252 ± 1.6582, p-value: 0.000002]
(That is, there is a ~9.5% perf regression)
infer_clang vs opt_clang: task-clock [delta(%): 2.1834 ± 1.8158, p-value: 0.040702]
(That is, the regression is reduced to ~2%)
Related BOLT logs:

BOLT-INFO: identified 2114 (18.61%) stale functions responsible for 30.96% samples
BOLT-INFO: inferred profile for 2101 (18.52% of all profiled) functions responsible for 30.95% samples

LTO+AutoFDO mode:
stale_clang vs opt_clang: task-clock [delta(%): 19.1293 ± 1.4131, p-value: 0.000002]
infer_clang vs opt_clang: task-clock [delta(%): 7.4364 ± 1.3343, p-value: 0.000002]
Related BOLT logs:

BOLT-INFO: identified 5452 (50.27%) stale functions responsible for 85.34% samples
BOLT-INFO: inferred profile for 5442 (50.23% of all profiled) functions responsible for 85.33% samples

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spupyrev created this revision.Mar 22 2023, 2:26 PM

Herald added a reviewer: rafauler. · View Herald TranscriptMar 22 2023, 2:26 PM

Herald added a reviewer: Amir. · View Herald Transcript

Herald added a reviewer: maksfb. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: treapster, ayermolo. · View Herald Transcript

spupyrev added a parent revision: D144500: [BOLT] stale profile matching [part 1 out of 2].Mar 22 2023, 2:27 PM

Harbormaster completed remote builds in B221138: Diff 507509.Mar 22 2023, 2:27 PM

spupyrev edited the summary of this revision. (Show Details)Mar 22 2023, 2:43 PM

Herald added a subscriber: wenlei. · View Herald TranscriptMar 22 2023, 2:43 PM

spupyrev published this revision for review.Mar 22 2023, 2:45 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 22 2023, 2:45 PM

Herald added subscribers: llvm-commits, yota9. · View Herald Transcript

spupyrev mentioned this in D144500: [BOLT] stale profile matching [part 1 out of 2].May 24 2023, 2:22 PM

spupyrev retitled this revision from [BOLT] v1 stale profile matching to [BOLT] stale profile matching [part 2 out of 2].May 24 2023, 2:26 PM

Herald added a subscriber: wlei. · View Herald TranscriptMay 24 2023, 2:26 PM

Please address a couple of nits. Will test internally, otherwise LG.
@maksfb – can you please take a look?

bolt/lib/Profile/StaleProfileMatching.cpp
219	Remove or replace?
293–295	nit
422–424

comments + rebase

Harbormaster completed remote builds in B234599: Diff 525732.May 25 2023, 12:20 PM

X86/reader-stale-yaml.test is failing in testing, please update it.

rebasing & fixing the test & adding debug logging

Harbormaster completed remote builds in B237562: Diff 529688.Jun 8 2023, 11:56 AM

Amir accepted this revision.Jun 8 2023, 12:18 PM

This revision is now accepted and ready to land.Jun 8 2023, 12:18 PM

Closed by commit rG2316a10fe59a: [BOLT] stale profile matching [part 2 out of 2] (authored by spupyrev). · Explain WhyJun 8 2023, 2:43 PM

This revision was automatically updated to reflect the committed changes.

spupyrev added a commit: rG2316a10fe59a: [BOLT] stale profile matching [part 2 out of 2].

Revision Contents

Path

Size

bolt/

lib/

Core/

BinaryFunction.cpp

8 lines

Profile/

StaleProfileMatching.cpp

200 lines

test/

X86/

Inputs/

blarge_profile_stale.yaml

6 lines

Diff 529734

bolt/lib/Core/BinaryFunction.cpp

Show First 20 Lines • Show All 3,605 Lines • ▼ Show 20 Lines	size_t BinaryFunction::computeHash(bool UseDFS,
// possibly their operands and then hashing that string with std::hash.		// possibly their operands and then hashing that string with std::hash.
std::string HashString;		std::string HashString;
for (const BinaryBasicBlock *BB : Order)		for (const BinaryBasicBlock *BB : Order)
HashString.append(hashBlock(BC, *BB, OperandHashFunc));		HashString.append(hashBlock(BC, *BB, OperandHashFunc));

return Hash = std::hash<std::string>{}(HashString);		return Hash = std::hash<std::string>{}(HashString);
}		}

void BinaryFunction::computeBlockHashes() const {
for (const BinaryBasicBlock *BB : BasicBlocks) {
std::string Hash =
hashBlock(BC, *BB, [](const MCOperand &Op) { return std::string(); });
BB->setHash(std::hash<std::string>{}(Hash));
}
}

void BinaryFunction::insertBasicBlocks(		void BinaryFunction::insertBasicBlocks(
BinaryBasicBlock *Start,		BinaryBasicBlock *Start,
std::vector<std::unique_ptr<BinaryBasicBlock>> &&NewBBs,		std::vector<std::unique_ptr<BinaryBasicBlock>> &&NewBBs,
const bool UpdateLayout, const bool UpdateCFIState,		const bool UpdateLayout, const bool UpdateCFIState,
const bool RecomputeLandingPads) {		const bool RecomputeLandingPads) {
const int64_t StartIndex = Start ? getIndex(Start) : -1LL;		const int64_t StartIndex = Start ? getIndex(Start) : -1LL;
const size_t NumNewBlocks = NewBBs.size();		const size_t NumNewBlocks = NewBBs.size();

▲ Show 20 Lines • Show All 893 Lines • Show Last 20 Lines

bolt/lib/Profile/StaleProfileMatching.cpp

Show All 27 Lines

#include "bolt/Core/HashUtilities.h"

#include "bolt/Profile/YAMLProfileReader.h"

#include "llvm/ADT/Hashing.h"

#include "llvm/Support/CommandLine.h"

#include "llvm/Transforms/Utils/SampleProfileInference.h"

#include <queue>

#undef DEBUG_TYPE

#define DEBUG_TYPE "bolt-prof"

using namespace llvm;

namespace opts {

extern cl::OptionCategory BoltOptCategory;

cl::opt<bool>

InferStaleProfile("infer-stale-profile",

▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines

cl::desc(

"The cost of increasing an unknown fall-through jump's count by one."),

cl::init(5), cl::ReallyHidden, cl::cat(BoltOptCategory));

} // namespace opts

namespace llvm {

namespace bolt {

/// An object wrapping several components of a basic block hash. The combined

/// (blended) hash is represented and stored as one uint64_t, while individual

/// components are of smaller size (e.g., uint16_t or uint8_t).

struct BlendedBlockHash {

private:

static uint64_t combineHashes(uint16_t Hash1, uint16_t Hash2, uint16_t Hash3,

uint16_t Hash4) {

uint64_t Hash = 0;

Hash |= uint64_t(Hash4);

Hash <<= 16;

Hash |= uint64_t(Hash3);

Hash <<= 16;

Hash |= uint64_t(Hash2);

Hash <<= 16;

Hash |= uint64_t(Hash1);

return Hash;

}

static void parseHashes(uint64_t Hash, uint16_t &Hash1, uint16_t &Hash2,

uint16_t &Hash3, uint16_t &Hash4) {

Hash1 = Hash & 0xffff;

Hash >>= 16;

Hash2 = Hash & 0xffff;

Hash >>= 16;

Hash3 = Hash & 0xffff;

Hash >>= 16;

Hash4 = Hash & 0xffff;

Hash >>= 16;

}

public:

explicit BlendedBlockHash() {}

explicit BlendedBlockHash(uint64_t CombinedHash) {

parseHashes(CombinedHash, Offset, OpcodeHash, InstrHash, NeighborHash);

}

/// Combine the blended hash into uint64_t.

uint64_t combine() const {

return combineHashes(Offset, OpcodeHash, InstrHash, NeighborHash);

}

/// Compute a distance between two given blended hashes. The smaller the

/// distance, the more similar two blocks are. For identical basic blocks,

/// the distance is zero.

uint64_t distance(const BlendedBlockHash &BBH) const {

assert(OpcodeHash == BBH.OpcodeHash &&

"incorrect blended hash distance computation");

uint64_t Dist = 0;

// Account for NeighborHash

Dist += NeighborHash == BBH.NeighborHash ? 0 : 1;

Dist <<= 16;

// Account for InstrHash

Dist += InstrHash == BBH.InstrHash ? 0 : 1;

Dist <<= 16;

// Account for Offset

Dist += (Offset >= BBH.Offset ? Offset - BBH.Offset : BBH.Offset - Offset);

return Dist;

}

/// The offset of the basic block from the function start.

uint16_t Offset{0};

/// (Loose) Hash of the basic block instructions, excluding operands.

uint16_t OpcodeHash{0};

/// (Strong) Hash of the basic block instructions, including opcodes and

/// operands.

uint16_t InstrHash{0};

/// Hash of the (loose) basic block together with (loose) hashes of its

/// successors and predecessors.

uint16_t NeighborHash{0};

};

/// The object is used to identify and match basic blocks in a BinaryFunction

AmirUnsubmitted

Done

Remove or replace?

Amir: Remove or replace?

/// given their hashes computed on a binary built from several revisions behind

/// release.

class StaleMatcher {

public:

/// Initialize stale matcher.

void init(const std::vector<FlowBlock *> &Blocks,

const std::vector<BlendedBlockHash> &Hashes) {

assert(Blocks.size() == Hashes.size() &&

"incorrect matcher initialization");

for (size_t I = 0; I < Blocks.size(); I++) {

FlowBlock *Block = Blocks[I];

uint16_t OpHash = Hashes[I].OpcodeHash;

OpHashToBlocks[OpHash].push_back(std::make_pair(Hashes[I], Block));

}

/// Find the most similar block for a given hash.

const FlowBlock *matchBlock(BlendedBlockHash BlendedHash) const {

auto BlockIt = OpHashToBlocks.find(BlendedHash.OpcodeHash);

if (BlockIt == OpHashToBlocks.end()) {

return nullptr;

}

FlowBlock *BestBlock = nullptr;

uint64_t BestDist = std::numeric_limits<uint64_t>::max();

for (auto It : BlockIt->second) {

FlowBlock *Block = It.second;

BlendedBlockHash Hash = It.first;

uint64_t Dist = Hash.distance(BlendedHash);

if (BestBlock == nullptr || Dist < BestDist) {

BestDist = Dist;

BestBlock = Block;

}

return BestBlock;

}

private:

using HashBlockPairType = std::pair<BlendedBlockHash, FlowBlock *>;

std::unordered_map<uint16_t, std::vector<HashBlockPairType>> OpHashToBlocks;

};

void BinaryFunction::computeBlockHashes() const {

if (size() == 0)

return;

assert(hasCFG() && "the function is expected to have CFG");

std::vector<BlendedBlockHash> BlendedHashes(BasicBlocks.size());

std::vector<uint64_t> OpcodeHashes(BasicBlocks.size());

// Initialize hash components

for (size_t I = 0; I < BasicBlocks.size(); I++) {

const BinaryBasicBlock *BB = BasicBlocks[I];

assert(BB->getIndex() == I && "incorrect block index");

BlendedHashes[I].Offset = BB->getOffset();

// Hashing complete instructions

std::string InstrHashStr = hashBlock(

BC, *BB, [&](const MCOperand &Op) { return hashInstOperand(BC, Op); });

uint64_t InstrHash = std::hash<std::string>{}(InstrHashStr);

BlendedHashes[I].InstrHash = hash_64_to_16(InstrHash);

// Hashing opcodes

std::string OpcodeHashStr =

hashBlock(BC, *BB, [](const MCOperand &Op) { return std::string(); });

OpcodeHashes[I] = std::hash<std::string>{}(OpcodeHashStr);

BlendedHashes[I].OpcodeHash = hash_64_to_16(OpcodeHashes[I]);

}

// Initialize neighbor hash

for (size_t I = 0; I < BasicBlocks.size(); I++) {

const BinaryBasicBlock *BB = BasicBlocks[I];

uint64_t Hash = OpcodeHashes[I];

// Append hashes of successors

for (BinaryBasicBlock *SuccBB : BB->successors()) {

uint64_t SuccHash = OpcodeHashes[SuccBB->getIndex()];

Hash = hashing::detail::hash_16_bytes(Hash, SuccHash);

}

// Append hashes of predecessors

AmirUnsubmitted

Done

// Append hashes of predecessors

- for (BinaryBasicBlock *PredBB : BB->predecessors()) {

+ for (BinaryBasicBlock *PredBB : BB->predecessors())

Hash = hash_128_to_64(Hash, OpcodeHashes[PredBB->getIndex()]);

- }

BlendedHashes[I].NeighborHash = hash_64_to_16(Hash);

nit

Amir: nit

for (BinaryBasicBlock *PredBB : BB->predecessors()) {

uint64_t PredHash = OpcodeHashes[PredBB->getIndex()];

Hash = hashing::detail::hash_16_bytes(Hash, PredHash);

}

BlendedHashes[I].NeighborHash = hash_64_to_16(Hash);

}

// Assign hashes

for (size_t I = 0; I < BasicBlocks.size(); I++) {

const BinaryBasicBlock *BB = BasicBlocks[I];

BB->setHash(BlendedHashes[I].combine());

}

/// Create a wrapper flow function to use with the profile inference algorithm,

/// and initialize its jumps and metadata.

FlowFunction

createFlowFunction(const BinaryFunction::BasicBlockOrderType &BlockOrder) {

FlowFunction Func;

// Add a special "dummy" source so that there is always a unique entry point.

// Because of the extra source, for all other blocks in FlowFunction it holds

▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines

/// Whenever there is a count in the profile with the hash corresponding to one

/// of the basic blocks in the binary, the count is "matched" to the block.

/// Similarly, if both the source and the target of a count in the profile are

/// matched to a jump in the binary, the count is recorded in CFG.

void matchWeightsByHashes(const BinaryFunction::BasicBlockOrderType &BlockOrder,

const yaml::bolt::BinaryFunctionProfile &YamlBF,

FlowFunction &Func) {

assert(Func.Blocks.size() == BlockOrder.size() + 1);

// Initialize stale matcher

DenseMap<uint64_t, std::vector<FlowBlock *>> HashToBlocks;

std::vector<FlowBlock *> Blocks;

std::vector<BlendedBlockHash> BlendedHashes;

for (uint64_t I = 0; I < BlockOrder.size(); I++) {

const BinaryBasicBlock *BB = BlockOrder[I];

assert(BB->getHash() != 0 && "empty hash of BinaryBasicBlock");

HashToBlocks[BB->getHash()].push_back(&Func.Blocks[I + 1]);

Blocks.push_back(&Func.Blocks[I + 1]);

BlendedBlockHash BlendedHash(BB->getHash());

BlendedHashes.push_back(BlendedHash);

LLVM_DEBUG(dbgs() << "BB with index " << I << " has hash = "

<< Twine::utohexstr(BB->getHash()) << "\n");

}

StaleMatcher Matcher;

Matcher.init(Blocks, BlendedHashes);

// Index in yaml profile => corresponding (matched) block

DenseMap<uint64_t, const FlowBlock *> MatchedBlocks;

// Match blocks from the profile to the blocks in CFG

for (const yaml::bolt::BinaryBasicBlockProfile &YamlBB : YamlBF.Blocks) {

assert(YamlBB.Hash != 0 && "empty hash of BinaryBasicBlockProfile");

auto It = HashToBlocks.find(YamlBB.Hash);

BlendedBlockHash BlendedHash(YamlBB.Hash);

if (It != HashToBlocks.end()) {

const FlowBlock *MatchedBlock = Matcher.matchBlock(BlendedHash);

const FlowBlock *MatchedBlock = It->second.front();

if (MatchedBlock != nullptr) {

MatchedBlocks[YamlBB.Index] = MatchedBlock;

LLVM_DEBUG(dbgs() << "Matched yaml block with bid = " << YamlBB.Index

AmirUnsubmitted

Done

const FlowBlock *MatchedBlock = Matcher.matchBlock(BlendedHash);

- if (MatchedBlock != nullptr) {

+ if (MatchedBlock != nullptr)

MatchedBlocks[YamlBB.Index] = MatchedBlock;

- }

+ }

// Match jumps from the profile to the jumps from CFG

Amir:

<< " and hash = " << Twine::utohexstr(YamlBB.Hash)

<< " to BB with index = " << MatchedBlock->Index - 1

<< "\n");

} else {

LLVM_DEBUG(

dbgs() << "Couldn't match yaml block with bid = " << YamlBB.Index

<< " and hash = " << Twine::utohexstr(YamlBB.Hash) << "\n");

}

// Match jumps from the profile to the jumps from CFG

std::vector<uint64_t> OutWeight(Func.Blocks.size(), 0);

std::vector<uint64_t> InWeight(Func.Blocks.size(), 0);

for (const yaml::bolt::BinaryBasicBlockProfile &YamlBB : YamlBF.Blocks) {

for (const yaml::bolt::SuccessorInfo &YamlSI : YamlBB.Successors) {

▲ Show 20 Lines • Show All 295 Lines • Show Last 20 Lines

bolt/test/X86/Inputs/blarge_profile_stale.yaml

Show All 31 Lines	functions:
- name: usqrt		- name: usqrt
fid: 7		fid: 7
hash: 0x8B62B1F9AD81EA35		hash: 0x8B62B1F9AD81EA35
exec: 20		exec: 20
nblocks: 6		nblocks: 6
blocks:		blocks:
- bid: 0		- bid: 0
insns: 4		insns: 4
hash: 0xE3FEB842A6548CCF		hash: 0xb1e5b76571270000
exec: 20		exec: 20
succ: [ { bid: 1, cnt: 0 } ]		succ: [ { bid: 1, cnt: 0 } ]
- bid: 1		- bid: 1
insns: 9		insns: 9
hash: 0x85948FF2924613B7		hash: 0x587e93788b970010
succ: [ { bid: 3, cnt: 320, mis: 171 }, { bid: 2, cnt: 0 } ]		succ: [ { bid: 3, cnt: 320, mis: 171 }, { bid: 2, cnt: 0 } ]
- bid: 3		- bid: 3
insns: 2		insns: 2
hash: 0x41D8DB2D2B01F411		hash: 0x20e605d745e50039
succ: [ { bid: 1, cnt: 300, mis: 33 }, { bid: 4, cnt: 20 } ]		succ: [ { bid: 1, cnt: 300, mis: 33 }, { bid: 4, cnt: 20 } ]
...		...

This is an archive of the discontinued LLVM Phabricator instance.

[BOLT] stale profile matching [part 2 out of 2]ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 529734

bolt/lib/Core/BinaryFunction.cpp

bolt/lib/Profile/StaleProfileMatching.cpp

bolt/test/X86/Inputs/blarge_profile_stale.yaml

[BOLT] stale profile matching [part 2 out of 2]
ClosedPublic