This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
bolt/
-
include/bolt/Passes/
-
bolt/
-
Passes/
-
TailDuplication.h
-
lib/
-
Passes/
2/6
TailDuplication.cpp
-
Rewrite/
-
BinaryPassManager.cpp
-
test/
-
X86/
-
tail-duplication-cache.s
-
tail-duplication-cacheline.s
-
tail-duplication-complex.s
-
tail-duplication-jt.s
-
tail-duplication-pass.s
-
tail-duplication-prop-bug.s
-
runtime/X86/
-
X86/
-
tail-duplication-constant-prop.s

Differential D123050

[BOLT] Cache-Aware Tail Duplication
ClosedPublic

Authored by spupyrev on Apr 4 2022, 10:28 AM.

Download Raw Diff

Details

Reviewers

rafauler
Amir
maksfb

Commits

rG5904836b8a71: [BOLT] Cache-Aware Tail Duplication

Summary

A new "cache-aware" strategy for tail duplication.

There are two competing goals in tail duplication: (a) On one hand, we want to
duplicate basic blocks in order to shorten the blocks and reduce the number of
executed instructions; (b) (b) On the other hand, we do not want to duplicate
too many blocks, as it would pollute i-cache and negatively affect performance.
The new strategy finds a tradeoff via a proxy objective, called the cache score,
which is designed to quantify the impact of duplication on i-cache. In the
algorithm, the blocks are duplicated only if that improves the cache score.

As a part of the diff, we have also polished the existing implementation,
adjusted logging, fixed some minor issues etc.

Applying the optimization:

llvm-bolt ... -relocs -split-functions=3 -split-all-cold -icf=1 -lite=1 \
-update-debug-sections=false -split-eh -use-gnu-stack -jump-tables=move \
-reorder-functions=hfsort -reorder-blocks=ext-tsp -tail-duplication=cache

Perf impact is measured on different revisions of the clang binary:
release_7

benchmark1: -0.4006 ± 0.3207 (win)
benchmark2: -0.5010 ± 0.2159 (win)
benchmark3:  0.0686 ± 0.2403
benchmark4: -0.0713 ± 0.2702

release_10

benchmark1:  0.1915 ± 0.3361
benchmark2: -0.1900 ± 0.1977
benchmark3: -0.0330 ± 0.1959
benchmark4: -0.2428 ± 0.1956 (win)

release_12

benchmark1:  0.3636 ± 0.3040 (regression)
benchmark2: -0.3389 ± 0.1707 (win)
benchmark3:  0.0583 ± 0.1801
benchmark4: -0.2919 ± 0.1798 (win)

The wins are from the 0.05%-0.1% reduction in the instruction count; branch
misses, i-cache/i-TLB misses also sometimes change in a fairly unpredictable
manner (possibly due to different code alignment and function ordering that
depends on function sizes).

This is also tested on two large real-world services. For the first one, we
observe 0.1%-0.2% perf win (0.1% fewer instructions); for the second one, the
performance is flat.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spupyrev created this revision.Apr 4 2022, 10:28 AM

Herald added a reviewer: rafauler. · View Herald TranscriptApr 4 2022, 10:28 AM

Herald added a reviewer: Amir. · View Herald Transcript

Herald added a reviewer: maksfb. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added a subscriber: ayermolo. · View Herald Transcript

spupyrev requested review of this revision.Apr 4 2022, 10:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 4 2022, 10:28 AM

Herald added subscribers: llvm-commits, yota9. · View Herald Transcript

Harbormaster completed remote builds in B157763: Diff 420228.Apr 4 2022, 10:34 AM

spupyrev edited the summary of this revision. (Show Details)Apr 4 2022, 10:40 AM

wenlei added a subscriber: wenlei.Apr 4 2022, 10:49 AM

There was a *very* brief discussion about system models on llvm-dev. Did you consider using something similar in bolt resp. llvm?
https://lists.llvm.org/pipermail/llvm-dev/2018-October/127319.html

In D123050#3454138, @tschuett wrote:

There was a *very* brief discussion about system models on llvm-dev. Did you consider using something similar in bolt resp. llvm?
https://lists.llvm.org/pipermail/llvm-dev/2018-October/127319.html

Interesting, thanks for the pointer. Is there a discussion somewhere, i can only see a single message with RFC.

In D123050#3454577, @spupyrev wrote:

In D123050#3454138, @tschuett wrote:

There was a *very* brief discussion about system models on llvm-dev. Did you consider using something similar in bolt resp. llvm?
https://lists.llvm.org/pipermail/llvm-dev/2018-October/127319.html

Interesting, thanks for the pointer. Is there a discussion somewhere, i can only see a single message with RFC.

I found a phab diff, but that is all I could find.
https://reviews.llvm.org/D58736

rebase + a small adjustment of the default optimization flags

Harbormaster completed remote builds in B161651: Diff 425586.Apr 27 2022, 12:12 PM

LGTM with a few comments below

bolt/lib/Passes/TailDuplication.cpp
393	By reading "jump distance" I would expect to see JumpDistance = DstAddr - (SrcAddr + SrcSize) as an approximation of the jump distance in a forward jump. E.g. BlockA: <src address> nop nop jmp BlockB BlockB: <src address + src size> also <dstaddr> nop nop nop <dstaddr + dst size> In the example above, because jmp BlockB is a fall-through, it should calculate 0 as the distance. Using the definition I provided above, it is zero, but not in the source code. In the source code of this diff, JumpDistance would be "src size + dst size". Is this to calculate how much cache space is used by using these two blocks? What's the idea? Can you make more explicit the reasoning in the comments? From reading the comments, I got the impression that a FT should be calculated as zero distance.
402	Why are we always returning 1.0 in case of a forward jump? Is this correct? From what I read in the comments of this function, I would expect to see Prob * Count here.

This revision is now accepted and ready to land.May 31 2022, 3:02 PM

spupyrev added inline comments.May 31 2022, 3:27 PM

bolt/lib/Passes/TailDuplication.cpp
393	Thanks for reading the code so carefully! I think that `distance` is not the best name here and causes the confusion. The idea is to quantify the impact of the block placement on the instruction cache. In particular, we want to distinguish between the case when `BlockB` is "large" and the case when it is "small" (eg a few bytes). It feels that the latter is more i-cache friendly than the former; thus, we'd want to duplicate `BlockB` only when it is "small". Notice that this effect won't be achieved when we compute the "correct" distance as you describe (which is 0 independently of the size of the block). The above is of course just an unproved intuition that perform reasonably well in practice. I am happy to experiment with alternatives and/or extensions, if you have any. Also let me know if you see a better name for `distance`.
402	Your assumption is correct, as well as the implementation. Likely my formatting makes the code harder to read. Do you think the following equivalent would be more readable? if (IsForwardJump) return Prob * Count; else return opts::TailDuplicationCacheBackwardWeight * Prob * Count;

rafauler added inline comments.May 31 2022, 4:12 PM

bolt/lib/Passes/TailDuplication.cpp
393	I think the current implementation makes sense (to quantify size and not just the jump distance) and I don't have any suggestions on improving that. I would probably try something similar. I was just a bit confused because the comment in line 105 of the header file mention that a fallthough jump will map to a 1.0 score. Maybe update the comment? For the name, I would probably use "JumpScore", but I don't have any strong opinions on it, so if you want to keep JumpDistance, that's fine too.
402	Ohhh sorry, now I see it. You can keep the current formatting, that's fine. Whichever you prefer.

@spupyrev I don't remember if you have commit access to the LLVM repo. If you don't, let me know, and I'll commit this once you are done with modifications and give me a green signal.

updated comments per review

Harbormaster completed remote builds in B167723: Diff 434035.Jun 3 2022, 8:40 AM

Closed by commit rG5904836b8a71: [BOLT] Cache-Aware Tail Duplication (authored by spupyrev). · Explain WhyJun 3 2022, 9:11 AM

This revision was automatically updated to reflect the committed changes.

spupyrev added a commit: rG5904836b8a71: [BOLT] Cache-Aware Tail Duplication.

Revision Contents

Path

Size

bolt/

include/

bolt/

Passes/

TailDuplication.h

65 lines

lib/

Passes/

TailDuplication.cpp

367 lines

Rewrite/

BinaryPassManager.cpp

8 lines

test/

X86/

tail-duplication-cache.s

59 lines

tail-duplication-cacheline.s

2 lines

tail-duplication-complex.s

4 lines

tail-duplication-jt.s

4 lines

tail-duplication-pass.s

4 lines

tail-duplication-prop-bug.s

2 lines

runtime/

X86/

tail-duplication-constant-prop.s

6 lines

Diff 434045

bolt/include/bolt/Passes/TailDuplication.h

Show All 40 Lines
#include "bolt/Passes/BinaryPasses.h"		#include "bolt/Passes/BinaryPasses.h"

namespace llvm {		namespace llvm {
namespace bolt {		namespace bolt {

/// Pass for duplicating blocks that would require a jump.		/// Pass for duplicating blocks that would require a jump.
class TailDuplication : public BinaryFunctionPass {		class TailDuplication : public BinaryFunctionPass {
/// Record how many possible tail duplications there can be.		/// Record how many possible tail duplications there can be.
uint64_t PossibleDuplications = 0;		uint64_t ModifiedFunctions = 0;

/// Record how many times these duplications would get used.		/// The number of duplicated basic blocks.
uint64_t PossibleDuplicationsDynamicCount = 0;		uint64_t DuplicatedBlockCount = 0;

		/// The size (in bytes) of duplicated basic blocks.
		uint64_t DuplicatedByteCount = 0;

/// Record the execution count of all unconditional branches.		/// Record how many times these duplications would get used.
uint64_t UnconditionalBranchDynamicCount = 0;		uint64_t DuplicationsDynamicCount = 0;

/// Record the execution count of all blocks.		/// Record the execution count of all blocks.
uint64_t AllBlocksDynamicCount = 0;		uint64_t AllDynamicCount = 0;

/// Record the number of instructions deleted because of propagation		/// Record the number of instructions deleted because of propagation
uint64_t StaticInstructionDeletionCount = 0;		uint64_t StaticInstructionDeletionCount = 0;

/// Record the number of instructions deleted because of propagation		/// Record the number of instructions deleted because of propagation
uint64_t DynamicInstructionDeletionCount = 0;		uint64_t DynamicInstructionDeletionCount = 0;

/// Sets Regs with the caller saved registers		/// Sets Regs with the caller saved registers
Show All 15 Lines	class TailDuplication : public BinaryFunctionPass {
bool isOverwrittenBeforeUsed(BinaryBasicBlock &StartBB, unsigned Reg) const;		bool isOverwrittenBeforeUsed(BinaryBasicBlock &StartBB, unsigned Reg) const;

/// Constant and Copy Propagate for the block formed by OriginalBB and		/// Constant and Copy Propagate for the block formed by OriginalBB and
/// BlocksToPropagate		/// BlocksToPropagate
void		void
constantAndCopyPropagate(BinaryBasicBlock &OriginalBB,		constantAndCopyPropagate(BinaryBasicBlock &OriginalBB,
std::vector<BinaryBasicBlock *> &BlocksToPropagate);		std::vector<BinaryBasicBlock *> &BlocksToPropagate);

/// True if Succ is in the same cache line as BB (approximately)		/// True if Tail is in the same cache line as BB (approximately)
bool isInCacheLine(const BinaryBasicBlock &BB,		bool isInCacheLine(const BinaryBasicBlock &BB,
const BinaryBasicBlock &Succ) const;		const BinaryBasicBlock &Tail) const;

/// Duplicates BlocksToDuplicate and places them after BB.		/// Duplicates BlocksToDuplicate and places them after BB.
std::vector<BinaryBasicBlock *>		std::vector<BinaryBasicBlock *> duplicateBlocks(
tailDuplicate(BinaryBasicBlock &BB,		BinaryBasicBlock &BB,
const std::vector<BinaryBasicBlock *> &BlocksToDuplicate) const;		const std::vector<BinaryBasicBlock *> &BlocksToDuplicate) const;

		/// Decide whether the tail basic blocks should be duplicated after BB.
		bool shouldDuplicate(BinaryBasicBlock BB, BinaryBasicBlock Tail) const;

		/// Compute the cache score for a jump (Src, Dst) with frequency Count.
		/// The value is in the range [0..1] and quantifies how "cache-friendly"
		/// the jump is. The score is close to 1 for "short" forward jumps and
		/// it is 0 for "long" jumps exceeding a specified threshold; between the
		/// bounds, the value decreases linearly. For backward jumps, the value is
		/// scaled by a specified factor.
		double cacheScore(uint64_t SrcAddr, uint64_t SrcSize, uint64_t DstAddr,
		uint64_t DstSize, uint64_t Count) const;

		/// Decide whether the cache score has been improved after duplication.
		bool cacheScoreImproved(const MCCodeEmitter *Emitter, BinaryFunction &BF,
		BinaryBasicBlock Pred, BinaryBasicBlock Tail) const;

		/// A moderate strategy for tail duplication.
/// Returns a vector of BinaryBasicBlock to copy after BB. If it's empty,		/// Returns a vector of BinaryBasicBlock to copy after BB. If it's empty,
/// nothing should be duplicated		/// nothing should be duplicated.
std::vector<BinaryBasicBlock *>		std::vector<BinaryBasicBlock *>
moderateCodeToDuplicate(BinaryBasicBlock &BB) const;		moderateDuplicate(BinaryBasicBlock &BB, BinaryBasicBlock &Tail) const;

		/// An aggressive strategy for tail duplication.
std::vector<BinaryBasicBlock *>		std::vector<BinaryBasicBlock *>
aggressiveCodeToDuplicate(BinaryBasicBlock &BB) const;		aggressiveDuplicate(BinaryBasicBlock &BB, BinaryBasicBlock &Tail) const;

		/// A cache-aware strategy for tail duplication.
		std::vector<BinaryBasicBlock > cacheDuplicate(const MCCodeEmitter Emitter,
		BinaryFunction &BF,
		BinaryBasicBlock *BB,
		BinaryBasicBlock *Tail) const;

void runOnFunction(BinaryFunction &Function);		void runOnFunction(BinaryFunction &Function);

public:		public:
		enum DuplicationMode : char {
		TD_NONE = 0,
		TD_AGGRESSIVE,
		TD_MODERATE,
		TD_CACHE
		};

explicit TailDuplication() : BinaryFunctionPass(false) {}		explicit TailDuplication() : BinaryFunctionPass(false) {}

const char *getName() const override { return "tail duplication"; }		const char *getName() const override { return "tail duplication"; }

void runOnFunctions(BinaryContext &BC) override;		void runOnFunctions(BinaryContext &BC) override;
};		};

} // namespace bolt		} // namespace bolt
} // namespace llvm		} // namespace llvm

#endif		#endif

bolt/lib/Passes/TailDuplication.cpp

//===- bolt/Passes/TailDuplication.cpp ------------------------------------===//		//===- bolt/Passes/TailDuplication.cpp ------------------------------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// This file implements the TailDuplication class.		// This file implements the TailDuplication class.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "bolt/Passes/TailDuplication.h"		#include "bolt/Passes/TailDuplication.h"
		#include "llvm/ADT/DenseMap.h"
#include "llvm/MC/MCRegisterInfo.h"		#include "llvm/MC/MCRegisterInfo.h"

#include <numeric>		#include <numeric>

#define DEBUG_TYPE "taildup"		#define DEBUG_TYPE "taildup"

using namespace llvm;		using namespace llvm;

namespace opts {		namespace opts {

extern cl::OptionCategory BoltOptCategory;		extern cl::OptionCategory BoltOptCategory;
		extern cl::opt<bool> NoThreads;

static cl::opt<bool> TailDuplicationAggressive(		cl::opt<bolt::TailDuplication::DuplicationMode> TailDuplicationMode(
"tail-duplication-aggressive",		"tail-duplication",
cl::desc("tail duplication should act aggressively in duplicating multiple "		cl::desc("duplicate unconditional branches that cross a cache line"),
"blocks per tail"),		cl::init(bolt::TailDuplication::TD_NONE),
cl::ZeroOrMore, cl::ReallyHidden, cl::init(false),		cl::values(clEnumValN(bolt::TailDuplication::TD_NONE, "none",
cl::cat(BoltOptCategory));		"do not apply"),
		clEnumValN(bolt::TailDuplication::TD_AGGRESSIVE, "aggressive",
		"aggressive strategy"),
		clEnumValN(bolt::TailDuplication::TD_MODERATE, "moderate",
		"moderate strategy"),
		clEnumValN(bolt::TailDuplication::TD_CACHE, "cache",
		"cache-aware duplication strategy")),
		cl::ZeroOrMore, cl::Hidden, cl::cat(BoltOptCategory));

static cl::opt<unsigned>		static cl::opt<unsigned>
TailDuplicationMinimumOffset("tail-duplication-minimum-offset",		TailDuplicationMinimumOffset("tail-duplication-minimum-offset",
cl::desc("minimum offset needed between block "		cl::desc("minimum offset needed between block "
"and successor to allow duplication"),		"and successor to allow duplication"),
cl::ZeroOrMore, cl::ReallyHidden, cl::init(64),		cl::ZeroOrMore, cl::ReallyHidden, cl::init(64),
cl::cat(BoltOptCategory));		cl::cat(BoltOptCategory));

static cl::opt<unsigned> TailDuplicationMaximumDuplication(		static cl::opt<unsigned> TailDuplicationMaximumDuplication(
"tail-duplication-maximum-duplication",		"tail-duplication-maximum-duplication",
cl::desc("maximum size of duplicated blocks (in bytes)"), cl::ZeroOrMore,		cl::desc("tail blocks whose size (in bytes) exceeds the value are never "
cl::ReallyHidden, cl::init(64), cl::cat(BoltOptCategory));		"duplicated"),
		cl::ZeroOrMore, cl::ReallyHidden, cl::init(24), cl::cat(BoltOptCategory));

		static cl::opt<unsigned> TailDuplicationMinimumDuplication(
		"tail-duplication-minimum-duplication",
		cl::desc("tail blocks with size (in bytes) not exceeding the value are "
		"always duplicated"),
		cl::ZeroOrMore, cl::ReallyHidden, cl::init(2), cl::cat(BoltOptCategory));

static cl::opt<bool> TailDuplicationConstCopyPropagation(		static cl::opt<bool> TailDuplicationConstCopyPropagation(
"tail-duplication-const-copy-propagation",		"tail-duplication-const-copy-propagation",
cl::desc("enable const and copy propagation after tail duplication"),		cl::desc("enable const and copy propagation after tail duplication"),
cl::ReallyHidden, cl::init(false), cl::cat(BoltOptCategory));		cl::ReallyHidden, cl::init(false), cl::cat(BoltOptCategory));

		static cl::opt<unsigned> TailDuplicationMaxCacheDistance(
		"tail-duplication-max-cache-distance",
		cl::desc("The weight of backward jumps for ExtTSP value"), cl::init(256),
		cl::ReallyHidden, cl::ZeroOrMore, cl::cat(BoltOptCategory));

		static cl::opt<double> TailDuplicationCacheBackwardWeight(
		"tail-duplication-cache-backward-weight",
		cl::desc(
		"The maximum distance (in bytes) of backward jumps for ExtTSP value"),
		cl::init(0.5), cl::ReallyHidden, cl::ZeroOrMore, cl::cat(BoltOptCategory));

} // namespace opts		} // namespace opts

namespace llvm {		namespace llvm {
namespace bolt {		namespace bolt {

void TailDuplication::getCallerSavedRegs(const MCInst &Inst, BitVector &Regs,		void TailDuplication::getCallerSavedRegs(const MCInst &Inst, BitVector &Regs,
BinaryContext &BC) const {		BinaryContext &BC) const {
if (!BC.MIB->isCall(Inst))		if (!BC.MIB->isCall(Inst))
return;		return;
BitVector CallRegs = BitVector(BC.MRI->getNumRegs(), false);		BitVector CallRegs = BitVector(BC.MRI->getNumRegs(), false);
BC.MIB->getCalleeSavedRegs(CallRegs);		BC.MIB->getCalleeSavedRegs(CallRegs);
CallRegs.flip();		CallRegs.flip();
Regs \|= CallRegs;		Regs \|= CallRegs;
▲ Show 20 Lines • Show All 176 Lines • ▼ Show 20 Lines	for (unsigned I = BB.getLayoutIndex() + Direction; I != Succ.getLayoutIndex();
Distance += BlockLayout[I]->getOriginalSize();		Distance += BlockLayout[I]->getOriginalSize();
if (Distance > opts::TailDuplicationMinimumOffset)		if (Distance > opts::TailDuplicationMinimumOffset)
return false;		return false;
}		}
return true;		return true;
}		}

std::vector<BinaryBasicBlock *>		std::vector<BinaryBasicBlock *>
TailDuplication::moderateCodeToDuplicate(BinaryBasicBlock &BB) const {		TailDuplication::moderateDuplicate(BinaryBasicBlock &BB,
		BinaryBasicBlock &Tail) const {
std::vector<BinaryBasicBlock *> BlocksToDuplicate;		std::vector<BinaryBasicBlock *> BlocksToDuplicate;
if (BB.hasJumpTable())		// The block must be hot
		if (BB.getKnownExecutionCount() == 0)
		return BlocksToDuplicate;
		// and its sucessor is not already in the same cache line
		if (isInCacheLine(BB, Tail))
return BlocksToDuplicate;		return BlocksToDuplicate;
if (BB.getOriginalSize() > opts::TailDuplicationMaximumDuplication)		// and its size do not exceed the maximum allowed size
		if (Tail.getOriginalSize() > opts::TailDuplicationMaximumDuplication)
return BlocksToDuplicate;		return BlocksToDuplicate;
for (auto Itr = BB.succ_begin(); Itr != BB.succ_end(); ++Itr) {
if ((*Itr)->getLayoutIndex() == BB.getLayoutIndex() + 1)
// If duplicating would introduce a new branch, don't duplicate		// If duplicating would introduce a new branch, don't duplicate
		for (auto Itr = Tail.succ_begin(); Itr != Tail.succ_end(); ++Itr) {
		if ((*Itr)->getLayoutIndex() == Tail.getLayoutIndex() + 1)
return BlocksToDuplicate;		return BlocksToDuplicate;
}		}
BlocksToDuplicate.push_back(&BB);
		BlocksToDuplicate.push_back(&Tail);
return BlocksToDuplicate;		return BlocksToDuplicate;
}		}

std::vector<BinaryBasicBlock *>		std::vector<BinaryBasicBlock *>
TailDuplication::aggressiveCodeToDuplicate(BinaryBasicBlock &BB) const {		TailDuplication::aggressiveDuplicate(BinaryBasicBlock &BB,
		BinaryBasicBlock &Tail) const {
std::vector<BinaryBasicBlock *> BlocksToDuplicate;		std::vector<BinaryBasicBlock *> BlocksToDuplicate;
		// The block must be hot
		if (BB.getKnownExecutionCount() == 0)
		return BlocksToDuplicate;
		// and its sucessor is not already in the same cache line
		if (isInCacheLine(BB, Tail))
		return BlocksToDuplicate;

BinaryBasicBlock *CurrBB = &BB;		BinaryBasicBlock *CurrBB = &BB;
while (CurrBB) {		while (CurrBB) {
LLVM_DEBUG(dbgs() << "Aggressive tail duplication: adding "		LLVM_DEBUG(dbgs() << "Aggressive tail duplication: adding "
<< CurrBB->getName() << " to duplication list\n";);		<< CurrBB->getName() << " to duplication list\n";);
BlocksToDuplicate.push_back(CurrBB);		BlocksToDuplicate.push_back(CurrBB);

if (CurrBB->hasJumpTable()) {		if (CurrBB->hasJumpTable()) {
LLVM_DEBUG(dbgs() << "Aggressive tail duplication: clearing duplication "		LLVM_DEBUG(dbgs() << "Aggressive tail duplication: clearing duplication "
Show All 21 Lines	if (CurrBB->succ_size() >= 2) {
<< CurrBB->getName() << '\n';);		<< CurrBB->getName() << '\n';);
BlocksToDuplicate.clear();		BlocksToDuplicate.clear();
}		}
break;		break;
}		}

// With one successor, if its a jump, we should duplicate all blocks in		// With one successor, if its a jump, we should duplicate all blocks in
// BlocksToDuplicate. Otherwise, we should keep going		// BlocksToDuplicate. Otherwise, we should keep going
BinaryBasicBlock *Succ = CurrBB->getSuccessor();		BinaryBasicBlock *SuccBB = CurrBB->getSuccessor();
if (Succ->getLayoutIndex() != CurrBB->getLayoutIndex() + 1)		if (SuccBB->getLayoutIndex() != CurrBB->getLayoutIndex() + 1)
break;		break;
CurrBB = Succ;		CurrBB = SuccBB;
}		}
// Don't duplicate if its too much code		// Don't duplicate if its too much code
unsigned DuplicationByteCount = std::accumulate(		unsigned DuplicationByteCount = std::accumulate(
std::begin(BlocksToDuplicate), std::end(BlocksToDuplicate), 0,		std::begin(BlocksToDuplicate), std::end(BlocksToDuplicate), 0,
[](int value, BinaryBasicBlock *p) {		[](int value, BinaryBasicBlock *p) {
return value + p->getOriginalSize();		return value + p->getOriginalSize();
});		});
if (DuplicationByteCount > opts::TailDuplicationMaximumDuplication) {		if (DuplicationByteCount > opts::TailDuplicationMaximumDuplication) {
LLVM_DEBUG(dbgs() << "Aggressive tail duplication: duplication byte count ("		LLVM_DEBUG(dbgs() << "Aggressive tail duplication: duplication byte count ("
<< DuplicationByteCount << ") exceeds maximum "		<< DuplicationByteCount << ") exceeds maximum "
<< opts::TailDuplicationMaximumDuplication << '\n';);		<< opts::TailDuplicationMaximumDuplication << '\n';);
BlocksToDuplicate.clear();		BlocksToDuplicate.clear();
}		}
LLVM_DEBUG(dbgs() << "Aggressive tail duplication: found "		LLVM_DEBUG(dbgs() << "Aggressive tail duplication: found "
<< BlocksToDuplicate.size() << " blocks to duplicate\n";);		<< BlocksToDuplicate.size() << " blocks to duplicate\n";);
return BlocksToDuplicate;		return BlocksToDuplicate;
}		}

std::vector<BinaryBasicBlock *> TailDuplication::tailDuplicate(		bool TailDuplication::shouldDuplicate(BinaryBasicBlock *Pred,
		BinaryBasicBlock *Tail) const {
		if (Pred == Tail)
		return false;
		// Cannot duplicate non-tail blocks
		if (Tail->succ_size() != 0)
		return false;
		// The blocks are already in the order
		if (Pred->getLayoutIndex() + 1 == Tail->getLayoutIndex())
		return false;
		// No tail duplication for blocks with jump tables
		if (Pred->hasJumpTable())
		return false;
		if (Tail->hasJumpTable())
		return false;

		return true;
		}

		double TailDuplication::cacheScore(uint64_t SrcAddr, uint64_t SrcSize,
		uint64_t DstAddr, uint64_t DstSize,
		uint64_t Count) const {
		assert(Count != BinaryBasicBlock::COUNT_NO_PROFILE);

		bool IsForwardJump = SrcAddr <= DstAddr;
		uint64_t JumpDistance = 0;
		// Computing the length of the jump so that it takes the sizes of the two
		// blocks into consideration
		rafaulerUnsubmitted Not Done Reply Inline Actions By reading "jump distance" I would expect to see JumpDistance = DstAddr - (SrcAddr + SrcSize) as an approximation of the jump distance in a forward jump. E.g. BlockA: <src address> nop nop jmp BlockB BlockB: <src address + src size> also <dstaddr> nop nop nop <dstaddr + dst size> In the example above, because jmp BlockB is a fall-through, it should calculate 0 as the distance. Using the definition I provided above, it is zero, but not in the source code. In the source code of this diff, JumpDistance would be "src size + dst size". Is this to calculate how much cache space is used by using these two blocks? What's the idea? Can you make more explicit the reasoning in the comments? From reading the comments, I got the impression that a FT should be calculated as zero distance. rafauler: By reading "jump distance" I would expect to see JumpDistance = DstAddr - (SrcAddr + SrcSize)…
		spupyrevAuthorUnsubmitted Done Reply Inline Actions Thanks for reading the code so carefully! I think that `distance` is not the best name here and causes the confusion. The idea is to quantify the impact of the block placement on the instruction cache. In particular, we want to distinguish between the case when `BlockB` is "large" and the case when it is "small" (eg a few bytes). It feels that the latter is more i-cache friendly than the former; thus, we'd want to duplicate `BlockB` only when it is "small". Notice that this effect won't be achieved when we compute the "correct" distance as you describe (which is 0 independently of the size of the block). The above is of course just an unproved intuition that perform reasonably well in practice. I am happy to experiment with alternatives and/or extensions, if you have any. Also let me know if you see a better name for `distance`. spupyrev: Thanks for reading the code so carefully! I think that `distance` is not the best name here and…
		rafaulerUnsubmitted Not Done Reply Inline Actions I think the current implementation makes sense (to quantify size and not just the jump distance) and I don't have any suggestions on improving that. I would probably try something similar. I was just a bit confused because the comment in line 105 of the header file mention that a fallthough jump will map to a 1.0 score. Maybe update the comment? For the name, I would probably use "JumpScore", but I don't have any strong opinions on it, so if you want to keep JumpDistance, that's fine too. rafauler: I think the current implementation makes sense (to quantify size and not just the jump…
		if (IsForwardJump) {
		JumpDistance = (DstAddr + DstSize) - (SrcAddr);
		} else {
		JumpDistance = (SrcAddr + SrcSize) - (DstAddr);
		}

		if (JumpDistance >= opts::TailDuplicationMaxCacheDistance)
		return 0;
		double Prob = 1.0 - static_cast<double>(JumpDistance) /
		rafaulerUnsubmitted Not Done Reply Inline Actions Why are we always returning 1.0 in case of a forward jump? Is this correct? From what I read in the comments of this function, I would expect to see Prob * Count here. rafauler: Why are we always returning 1.0 in case of a forward jump? Is this correct? From what I read in…
		spupyrevAuthorUnsubmitted Done Reply Inline Actions Your assumption is correct, as well as the implementation. Likely my formatting makes the code harder to read. Do you think the following equivalent would be more readable? if (IsForwardJump) return Prob * Count; else return opts::TailDuplicationCacheBackwardWeight * Prob * Count; spupyrev: Your assumption is correct, as well as the implementation. Likely my formatting makes the code…
		rafaulerUnsubmitted Not Done Reply Inline Actions Ohhh sorry, now I see it. You can keep the current formatting, that's fine. Whichever you prefer. rafauler: Ohhh sorry, now I see it. You can keep the current formatting, that's fine. Whichever you…
		opts::TailDuplicationMaxCacheDistance;
		return (IsForwardJump ? 1.0 : opts::TailDuplicationCacheBackwardWeight) *
		Prob * Count;
		}

		bool TailDuplication::cacheScoreImproved(const MCCodeEmitter *Emitter,
		BinaryFunction &BF,
		BinaryBasicBlock *Pred,
		BinaryBasicBlock *Tail) const {
		// Collect (estimated) basic block sizes
		DenseMap<BinaryBasicBlock *, uint64_t> BBSize;
		for (BinaryBasicBlock *BB : BF.layout()) {
		BBSize[BB] = std::max<uint64_t>(BB->estimateSize(Emitter), 1);
		}

		// Build current addresses of basic blocks starting at the entry block
		DenseMap<BinaryBasicBlock *, uint64_t> CurAddr;
		uint64_t Addr = 0;
		for (BinaryBasicBlock *SrcBB : BF.layout()) {
		CurAddr[SrcBB] = Addr;
		Addr += BBSize[SrcBB];
		}

		// Build new addresses (after duplication) starting at the entry block
		DenseMap<BinaryBasicBlock *, uint64_t> NewAddr;
		Addr = 0;
		for (BinaryBasicBlock *SrcBB : BF.layout()) {
		NewAddr[SrcBB] = Addr;
		Addr += BBSize[SrcBB];
		if (SrcBB == Pred)
		Addr += BBSize[Tail];
		}

		// Compute the cache score for the existing layout of basic blocks
		double CurScore = 0;
		for (BinaryBasicBlock *SrcBB : BF.layout()) {
		auto BI = SrcBB->branch_info_begin();
		for (BinaryBasicBlock *DstBB : SrcBB->successors()) {
		if (SrcBB != DstBB) {
		CurScore += cacheScore(CurAddr[SrcBB], BBSize[SrcBB], CurAddr[DstBB],
		BBSize[DstBB], BI->Count);
		}
		++BI;
		}
		}

		// Compute the cache score for the layout of blocks after tail duplication
		double NewScore = 0;
		for (BinaryBasicBlock *SrcBB : BF.layout()) {
		auto BI = SrcBB->branch_info_begin();
		for (BinaryBasicBlock *DstBB : SrcBB->successors()) {
		if (SrcBB != DstBB) {
		if (SrcBB == Pred && DstBB == Tail) {
		NewScore += cacheScore(NewAddr[SrcBB], BBSize[SrcBB],
		NewAddr[SrcBB] + BBSize[SrcBB], BBSize[DstBB],
		BI->Count);
		} else {
		NewScore += cacheScore(NewAddr[SrcBB], BBSize[SrcBB], NewAddr[DstBB],
		BBSize[DstBB], BI->Count);
		}
		}
		++BI;
		}
		}

		return NewScore > CurScore;
		}

		std::vector<BinaryBasicBlock *>
		TailDuplication::cacheDuplicate(const MCCodeEmitter *Emitter,
		BinaryFunction &BF, BinaryBasicBlock *Pred,
		BinaryBasicBlock *Tail) const {
		std::vector<BinaryBasicBlock *> BlocksToDuplicate;

		// No need to duplicate cold basic blocks
		if (Pred->isCold() \|\| Tail->isCold()) {
		return BlocksToDuplicate;
		}
		// Always duplicate "small" tail basic blocks, which might be beneficial for
		// code size, since a jump instruction is eliminated
		if (Tail->estimateSize(Emitter) <= opts::TailDuplicationMinimumDuplication) {
		BlocksToDuplicate.push_back(Tail);
		return BlocksToDuplicate;
		}
		// Never duplicate "large" tail basic blocks
		if (Tail->estimateSize(Emitter) > opts::TailDuplicationMaximumDuplication) {
		return BlocksToDuplicate;
		}
		// Do not append basic blocks after the last hot block in the current layout
		auto NextBlock = BF.getBasicBlockAfter(Pred);
		if (NextBlock == nullptr \|\| (!Pred->isCold() && NextBlock->isCold())) {
		return BlocksToDuplicate;
		}

		// Duplicate the tail only if it improves the cache score
		if (cacheScoreImproved(Emitter, BF, Pred, Tail)) {
		BlocksToDuplicate.push_back(Tail);
		}

		return BlocksToDuplicate;
		}

		std::vector<BinaryBasicBlock *> TailDuplication::duplicateBlocks(
BinaryBasicBlock &BB,		BinaryBasicBlock &BB,
const std::vector<BinaryBasicBlock *> &BlocksToDuplicate) const {		const std::vector<BinaryBasicBlock *> &BlocksToDuplicate) const {
BinaryFunction *BF = BB.getFunction();		BinaryFunction *BF = BB.getFunction();
BinaryContext &BC = BF->getBinaryContext();		BinaryContext &BC = BF->getBinaryContext();

// Ratio of this new branches execution count to the total size of the		// Ratio of this new branches execution count to the total size of the
// successor's execution count. Used to set this new branches execution count		// successor's execution count. Used to set this new branches execution count
// and lower the old successor's execution count		// and lower the old successor's execution count
double ExecutionCountRatio =		double ExecutionCountRatio =
BB.getExecutionCount() > BB.getSuccessor()->getExecutionCount()		BB.getExecutionCount() >= BB.getSuccessor()->getExecutionCount()
? 1.0		? 1.0
: (double)BB.getExecutionCount() /		: (double)BB.getExecutionCount() /
BB.getSuccessor()->getExecutionCount();		BB.getSuccessor()->getExecutionCount();

// Use the last branch info when adding a successor to LastBB		// Use the last branch info when adding a successor to LastBB
BinaryBasicBlock::BinaryBranchInfo &LastBI =		BinaryBasicBlock::BinaryBranchInfo &LastBI =
BB.getBranchInfo(*(BB.getSuccessor()));		BB.getBranchInfo(*(BB.getSuccessor()));

BinaryBasicBlock *LastOriginalBB = &BB;		BinaryBasicBlock *LastOriginalBB = &BB;
BinaryBasicBlock *LastDuplicatedBB = &BB;		BinaryBasicBlock *LastDuplicatedBB = &BB;
assert(LastDuplicatedBB->succ_size() == 1 &&		assert(LastDuplicatedBB->succ_size() == 1 &&
"tail duplication cannot act on a block with more than 1 successor");		"tail duplication cannot act on a block with more than 1 successor");
LastDuplicatedBB->removeSuccessor(LastDuplicatedBB->getSuccessor());		LastDuplicatedBB->removeSuccessor(LastDuplicatedBB->getSuccessor());

std::vector<std::unique_ptr<BinaryBasicBlock>> DuplicatedBlocks;		std::vector<std::unique_ptr<BinaryBasicBlock>> DuplicatedBlocks;
std::vector<BinaryBasicBlock *> DuplicatedBlocksToReturn;		std::vector<BinaryBasicBlock *> DuplicatedBlocksToReturn;

for (BinaryBasicBlock *CurrBB : BlocksToDuplicate) {		for (BinaryBasicBlock *CurBB : BlocksToDuplicate) {
DuplicatedBlocks.emplace_back(		DuplicatedBlocks.emplace_back(
BF->createBasicBlock(0, (BC.Ctx)->createNamedTempSymbol("tail-dup")));		BF->createBasicBlock(0, (BC.Ctx)->createNamedTempSymbol("tail-dup")));
BinaryBasicBlock *NewBB = DuplicatedBlocks.back().get();		BinaryBasicBlock *NewBB = DuplicatedBlocks.back().get();

NewBB->addInstructions(CurrBB->begin(), CurrBB->end());		NewBB->addInstructions(CurBB->begin(), CurBB->end());
// Set execution count as if it was just a copy of the original		// Set execution count as if it was just a copy of the original
NewBB->setExecutionCount(		NewBB->setExecutionCount(CurBB->getExecutionCount());
std::max((uint64_t)1, CurrBB->getExecutionCount()));		NewBB->setIsCold(CurBB->isCold());
LastDuplicatedBB->addSuccessor(NewBB, LastBI);		LastDuplicatedBB->addSuccessor(NewBB, LastBI);

DuplicatedBlocksToReturn.push_back(NewBB);		DuplicatedBlocksToReturn.push_back(NewBB);

// As long as its not the first block, adjust both original and duplicated		// As long as its not the first block, adjust both original and duplicated
// to what they should be		// to what they should be
if (LastDuplicatedBB != &BB) {		if (LastDuplicatedBB != &BB) {
LastOriginalBB->adjustExecutionCount(1.0 - ExecutionCountRatio);		LastOriginalBB->adjustExecutionCount(1.0 - ExecutionCountRatio);
LastDuplicatedBB->adjustExecutionCount(ExecutionCountRatio);		LastDuplicatedBB->adjustExecutionCount(ExecutionCountRatio);
}		}

if (CurrBB->succ_size() == 1)		if (CurBB->succ_size() == 1)
LastBI = CurrBB->getBranchInfo(*(CurrBB->getSuccessor()));		LastBI = CurBB->getBranchInfo(*(CurBB->getSuccessor()));

LastOriginalBB = CurrBB;		LastOriginalBB = CurBB;
LastDuplicatedBB = NewBB;		LastDuplicatedBB = NewBB;
}		}

LastDuplicatedBB->addSuccessors(		LastDuplicatedBB->addSuccessors(
LastOriginalBB->succ_begin(), LastOriginalBB->succ_end(),		LastOriginalBB->succ_begin(), LastOriginalBB->succ_end(),
LastOriginalBB->branch_info_begin(), LastOriginalBB->branch_info_end());		LastOriginalBB->branch_info_begin(), LastOriginalBB->branch_info_end());

LastOriginalBB->adjustExecutionCount(1.0 - ExecutionCountRatio);		LastOriginalBB->adjustExecutionCount(1.0 - ExecutionCountRatio);
LastDuplicatedBB->adjustExecutionCount(ExecutionCountRatio);		LastDuplicatedBB->adjustExecutionCount(ExecutionCountRatio);

BF->insertBasicBlocks(&BB, std::move(DuplicatedBlocks));		BF->insertBasicBlocks(&BB, std::move(DuplicatedBlocks));

return DuplicatedBlocksToReturn;		return DuplicatedBlocksToReturn;
}		}

void TailDuplication::runOnFunction(BinaryFunction &Function) {		void TailDuplication::runOnFunction(BinaryFunction &Function) {
		// Create a separate MCCodeEmitter to allow lock-free execution
		BinaryContext::IndependentCodeEmitter Emitter;
		if (!opts::NoThreads) {
		Emitter = Function.getBinaryContext().createIndependentMCCodeEmitter();
		}

		Function.updateLayoutIndices();

// New blocks will be added and layout will change,		// New blocks will be added and layout will change,
// so make a copy here to iterate over the original layout		// so make a copy here to iterate over the original layout
BinaryFunction::BasicBlockOrderType BlockLayout = Function.getLayout();		BinaryFunction::BasicBlockOrderType BlockLayout = Function.getLayout();
		bool ModifiedFunction = false;
for (BinaryBasicBlock *BB : BlockLayout) {		for (BinaryBasicBlock *BB : BlockLayout) {
if (BB->succ_size() == 1 &&		AllDynamicCount += BB->getKnownExecutionCount();
BB->getSuccessor()->getLayoutIndex() != BB->getLayoutIndex() + 1)
UnconditionalBranchDynamicCount += BB->getExecutionCount();
if (BB->succ_size() == 2 &&
BB->getFallthrough()->getLayoutIndex() != BB->getLayoutIndex() + 1)
UnconditionalBranchDynamicCount += BB->getFallthroughBranchInfo().Count;
AllBlocksDynamicCount += BB->getExecutionCount();

// The block must be hot		// The block must be with one successor
if (BB->getExecutionCount() == 0)
continue;
// with one successor
if (BB->succ_size() != 1)		if (BB->succ_size() != 1)
continue;		continue;
		BinaryBasicBlock *Tail = BB->getSuccessor();
// no jump table		// Verify that the tail should be duplicated
if (BB->hasJumpTable())		if (!shouldDuplicate(BB, Tail))
continue;

// Skip not-in-layout, i.e. unreachable, blocks.
if (BB->getLayoutIndex() >= BlockLayout.size())
continue;		continue;

// and we are estimating that this sucessor is not already in the same cache
// line
BinaryBasicBlock *Succ = BB->getSuccessor();
if (isInCacheLine(BB, Succ))
continue;
std::vector<BinaryBasicBlock *> BlocksToDuplicate;		std::vector<BinaryBasicBlock *> BlocksToDuplicate;
if (opts::TailDuplicationAggressive)		if (opts::TailDuplicationMode == TailDuplication::TD_AGGRESSIVE) {
BlocksToDuplicate = aggressiveCodeToDuplicate(*Succ);		BlocksToDuplicate = aggressiveDuplicate(BB, Tail);
else		} else if (opts::TailDuplicationMode == TailDuplication::TD_MODERATE) {
BlocksToDuplicate = moderateCodeToDuplicate(*Succ);		BlocksToDuplicate = moderateDuplicate(BB, Tail);
		} else if (opts::TailDuplicationMode == TailDuplication::TD_CACHE) {
		BlocksToDuplicate = cacheDuplicate(Emitter.MCE.get(), Function, BB, Tail);
		} else {
		llvm_unreachable("unknown tail duplication mode");
		}

if (BlocksToDuplicate.size() == 0)		if (BlocksToDuplicate.empty())
continue;
PossibleDuplications++;
PossibleDuplicationsDynamicCount += BB->getExecutionCount();
std::vector<BinaryBasicBlock *> DuplicatedBlocks =
tailDuplicate(*BB, BlocksToDuplicate);
if (!opts::TailDuplicationConstCopyPropagation)
continue;		continue;

		// Apply the the duplication
		ModifiedFunction = true;
		DuplicationsDynamicCount += BB->getExecutionCount();
		auto DuplicatedBlocks = duplicateBlocks(*BB, BlocksToDuplicate);
		for (BinaryBasicBlock *BB : DuplicatedBlocks) {
		DuplicatedBlockCount++;
		DuplicatedByteCount += BB->estimateSize(Emitter.MCE.get());
		}

		if (opts::TailDuplicationConstCopyPropagation) {
constantAndCopyPropagate(*BB, DuplicatedBlocks);		constantAndCopyPropagate(*BB, DuplicatedBlocks);
BinaryBasicBlock *FirstBB = BlocksToDuplicate[0];		BinaryBasicBlock *FirstBB = BlocksToDuplicate[0];
if (FirstBB->pred_size() == 1) {		if (FirstBB->pred_size() == 1) {
BinaryBasicBlock PredBB = FirstBB->pred_begin();		BinaryBasicBlock PredBB = FirstBB->pred_begin();
if (PredBB->succ_size() == 1)		if (PredBB->succ_size() == 1)
constantAndCopyPropagate(*PredBB, BlocksToDuplicate);		constantAndCopyPropagate(*PredBB, BlocksToDuplicate);
}		}
}		}

		// Layout indices might be stale after duplication
		Function.updateLayoutIndices();
		}
		if (ModifiedFunction)
		ModifiedFunctions++;
}		}

void TailDuplication::runOnFunctions(BinaryContext &BC) {		void TailDuplication::runOnFunctions(BinaryContext &BC) {
		if (opts::TailDuplicationMode == TailDuplication::TD_NONE)
		return;

for (auto &It : BC.getBinaryFunctions()) {		for (auto &It : BC.getBinaryFunctions()) {
BinaryFunction &Function = It.second;		BinaryFunction &Function = It.second;
if (!shouldOptimize(Function))		if (!shouldOptimize(Function))
continue;		continue;
runOnFunction(Function);		runOnFunction(Function);
}		}

outs() << "BOLT-INFO: tail duplication possible duplications: "		outs() << "BOLT-INFO: tail duplication"
<< PossibleDuplications << "\n";		<< format(" modified %zu (%.2f%%) functions;", ModifiedFunctions,
outs() << "BOLT-INFO: tail duplication possible dynamic reductions: "		100.0 * ModifiedFunctions / BC.getBinaryFunctions().size())
<< PossibleDuplicationsDynamicCount << "\n";		<< format(" duplicated %zu blocks (%zu bytes) responsible for",
outs() << "BOLT-INFO: tail duplication possible dynamic reductions to "		DuplicatedBlockCount, DuplicatedByteCount)
"unconditional branch execution : "		<< format(" %zu dynamic executions (%.2f%% of all block executions)",
<< format("%.1f", ((float)PossibleDuplicationsDynamicCount * 100.0f) /		DuplicationsDynamicCount,
UnconditionalBranchDynamicCount)		100.0 * DuplicationsDynamicCount / AllDynamicCount)
<< "%\n";		<< "\n";
outs() << "BOLT-INFO: tail duplication possible dynamic reductions to all "
"blocks execution : "		if (opts::TailDuplicationConstCopyPropagation) {
<< format("%.1f", ((float)PossibleDuplicationsDynamicCount * 100.0f) /		outs() << "BOLT-INFO: tail duplication "
AllBlocksDynamicCount)		<< format("applied %zu static and %zu dynamic propagation deletions",
<< "%\n";		StaticInstructionDeletionCount,
outs() << "BOLT-INFO: tail duplication static propagation deletions: "		DynamicInstructionDeletionCount)
<< StaticInstructionDeletionCount << "\n";		<< "\n";
outs() << "BOLT-INFO: tail duplication dynamic propagation deletions: "		}
<< DynamicInstructionDeletionCount << "\n"; //
}		}

} // end namespace bolt		} // end namespace bolt
} // end namespace llvm		} // end namespace llvm

bolt/lib/Rewrite/BinaryPassManager.cpp

Show First 20 Lines • Show All 233 Lines • ▼ Show 20 Lines	StripRepRet("strip-rep-ret",
cl::init(true), cl::ZeroOrMore, cl::cat(BoltOptCategory));		cl::init(true), cl::ZeroOrMore, cl::cat(BoltOptCategory));

static cl::opt<bool>		static cl::opt<bool>
VerifyCFG("verify-cfg",		VerifyCFG("verify-cfg",
cl::desc("verify the CFG after every pass"),		cl::desc("verify the CFG after every pass"),
cl::init(false), cl::Hidden, cl::ZeroOrMore, cl::cat(BoltOptCategory));		cl::init(false), cl::Hidden, cl::ZeroOrMore, cl::cat(BoltOptCategory));

static cl::opt<bool>		static cl::opt<bool>
TailDuplicationFlag("tail-duplication",
cl::desc("duplicate unconditional branches that cross a cache line"),
cl::ZeroOrMore, cl::ReallyHidden, cl::cat(BoltOptCategory));

static cl::opt<bool>
ThreeWayBranchFlag("three-way-branch",		ThreeWayBranchFlag("three-way-branch",
cl::desc("reorder three way branches"),		cl::desc("reorder three way branches"),
cl::ZeroOrMore, cl::ReallyHidden, cl::cat(BoltOptCategory));		cl::ZeroOrMore, cl::ReallyHidden, cl::cat(BoltOptCategory));

static cl::opt<bool> CMOVConversionFlag("cmov-conversion",		static cl::opt<bool> CMOVConversionFlag("cmov-conversion",
cl::desc("fold jcc+mov into cmov"),		cl::desc("fold jcc+mov into cmov"),
cl::ZeroOrMore, cl::ReallyHidden,		cl::ZeroOrMore, cl::ReallyHidden,
cl::cat(BoltOptCategory));		cl::cat(BoltOptCategory));
▲ Show 20 Lines • Show All 136 Lines • ▼ Show 20 Lines	void BinaryFunctionPassManager::runAllPasses(BinaryContext &BC) {

Manager.registerPass(std::make_unique<EliminateUnreachableBlocks>(PrintUCE),		Manager.registerPass(std::make_unique<EliminateUnreachableBlocks>(PrintUCE),
opts::EliminateUnreachable);		opts::EliminateUnreachable);

Manager.registerPass(std::make_unique<SplitFunctions>(PrintSplit));		Manager.registerPass(std::make_unique<SplitFunctions>(PrintSplit));

Manager.registerPass(std::make_unique<LoopInversionPass>());		Manager.registerPass(std::make_unique<LoopInversionPass>());

Manager.registerPass(std::make_unique<TailDuplication>(),		Manager.registerPass(std::make_unique<TailDuplication>());
opts::TailDuplicationFlag);

Manager.registerPass(std::make_unique<CMOVConversion>(),		Manager.registerPass(std::make_unique<CMOVConversion>(),
opts::CMOVConversionFlag);		opts::CMOVConversionFlag);

// This pass syncs local branches with CFG. If any of the following		// This pass syncs local branches with CFG. If any of the following
// passes breaks the sync - they either need to re-run the pass or		// passes breaks the sync - they either need to re-run the pass or
// fix branches consistency internally.		// fix branches consistency internally.
Manager.registerPass(std::make_unique<FixupBranches>(PrintAfterBranchFixup));		Manager.registerPass(std::make_unique<FixupBranches>(PrintAfterBranchFixup));
▲ Show 20 Lines • Show All 87 Lines • Show Last 20 Lines

bolt/test/X86/tail-duplication-cache.s

This file was added.

				# REQUIRES: system-linux

				# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown \
				# RUN: %s -o %t.o
				# RUN: link_fdata %s %t.o %t.fdata
				# RUN: link_fdata %s %t.o %t.fdata2 "FDATA2"
				# RUN: %clang %cflags %t.o -o %t.exe -Wl,-q
				# RUN: llvm-bolt %t.exe -data %t.fdata -reorder-blocks=none -print-finalized \
				# RUN: -tail-duplication=cache -o %t.out \| FileCheck %s
				# RUN: llvm-bolt %t.exe -data %t.fdata2 -reorder-blocks=none -print-finalized \
				# RUN: -tail-duplication=cache -o %t.out2 \| FileCheck --check-prefix="CHECK2" %s

				# A test where the tail is duplicated to eliminate an uncoditional jump
				# FDATA: 1 main #.BB0_br# 1 main #.BB4# 0 100
				# FDATA: 1 main #.BB0_br# 1 main #.BB1# 0 100
				# FDATA: 1 main #.BB1_br# 1 main #.BB3# 0 50
				# FDATA: 1 main #.BB1_br# 1 main #.BB2# 0 50
				# FDATA: 1 main #.BB3_br# 1 main #.BB2# 0 50
				# CHECK: BOLT-INFO: tail duplication modified 1 ({{.}}%) functions; duplicated 1 blocks (13 bytes) responsible for 50 dynamic executions ({{.}}% of all block executions)
				# CHECK: BB Layout : .LBB00, .Ltmp0, .Ltmp1, .Ltmp2, .Ltmp3, .Ltmp4, .Ltmp5, .Ltail-dup0, .Ltmp6

				# A test where the tail is not duplicated due to the cache score
				# FDATA2: 1 main #.BB0_br# 1 main #.BB4# 0 100
				# FDATA2: 1 main #.BB0_br# 1 main #.BB1# 0 2
				# FDATA2: 1 main #.BB1_br# 1 main #.BB3# 0 1
				# FDATA2: 1 main #.BB1_br# 1 main #.BB2# 0 1
				# FDATA2: 1 main #.BB3_br# 1 main #.BB2# 0 1
				# CHECK2: BOLT-INFO: tail duplication modified 0 (0.00%) functions; duplicated 0 blocks (0 bytes) responsible for 0 dynamic executions (0.00% of all block executions)
				# CHECK2: BB Layout : .LBB00, .Ltmp0, .Ltmp1, .Ltmp2, .Ltmp3, .Ltmp4, .Ltmp5, .Ltmp6

				.text
				.globl main
				.type main, %function
				.size main, .Lend-main
				main:
				.BB0:
				xor %eax, %eax
				cmpl %eax, %ebx
				.BB0_br:
				je .BB4
				.BB1:
				inc %rax
				.BB1_br:
				je .BB3
				.BB2:
				inc %rax
				inc %rax
				inc %rax
				inc %rax
				retq
				.BB3:
				inc %rax
				.BB3_br:
				jmp .BB2
				.BB4:
				retq
				# For relocations against .text
				call exit
				.Lend:

bolt/test/X86/tail-duplication-cacheline.s

	# This reproduces a bug in TailDuplication::isInCacheLine			# This reproduces a bug in TailDuplication::isInCacheLine
	# with accessing BlockLayout past bounds (unreachable blocks).			# with accessing BlockLayout past bounds (unreachable blocks).

	# REQUIRES: system-linux			# REQUIRES: system-linux

	# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown \			# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown \
	# RUN: %s -o %t.o			# RUN: %s -o %t.o
	# RUN: link_fdata %s %t.o %t.fdata			# RUN: link_fdata %s %t.o %t.fdata
	# RUN: llvm-strip --strip-unneeded %t.o			# RUN: llvm-strip --strip-unneeded %t.o
	# RUN: %clang %cflags -no-pie %t.o -o %t.exe -Wl,-q -nostdlib			# RUN: %clang %cflags -no-pie %t.o -o %t.exe -Wl,-q -nostdlib
	# RUN: llvm-bolt %t.exe -o %t.out -data %t.fdata -relocs \			# RUN: llvm-bolt %t.exe -o %t.out -data %t.fdata -relocs \
	# RUN: -tail-duplication=1 -tail-duplication-aggressive=1			# RUN: -tail-duplication=aggressive
	.globl _start			.globl _start
	_start:			_start:
	jmp d			jmp d
	je _start			je _start
	movl %esi, %edi			movl %esi, %edi
	d:			d:
	jmpq *JT0(,%rcx,8)			jmpq *JT0(,%rcx,8)
	# FDATA: 1 _start #d# 1 _start #e# 1 3			# FDATA: 1 _start #d# 1 _start #e# 1 3
	Show All 11 Lines

bolt/test/X86/tail-duplication-complex.s

	# REQUIRES: system-linux			# REQUIRES: system-linux

	# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown \			# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown \
	# RUN: %s -o %t.o			# RUN: %s -o %t.o
	# RUN: link_fdata %s %t.o %t.fdata			# RUN: link_fdata %s %t.o %t.fdata
	# RUN: %clang %cflags %t.o -o %t.exe -Wl,-q			# RUN: %clang %cflags %t.o -o %t.exe -Wl,-q

	# RUN: llvm-bolt %t.exe -data %t.fdata -print-finalized \			# RUN: llvm-bolt %t.exe -data %t.fdata -print-finalized \
	# RUN: -tail-duplication -tail-duplication-minimum-offset 1 -o %t.out \| FileCheck %s			# RUN: -tail-duplication=moderate -tail-duplication-minimum-offset 1 -o %t.out \| FileCheck %s

	# FDATA: 1 main f 1 main 19 0 10			# FDATA: 1 main f 1 main 19 0 10
	# FDATA: 1 main f 1 main 11 0 13			# FDATA: 1 main f 1 main 11 0 13
	# FDATA: 1 main 17 1 main 3c 0 10			# FDATA: 1 main 17 1 main 3c 0 10
	# FDATA: 1 main 39 1 main 3c 0 10			# FDATA: 1 main 39 1 main 3c 0 10

	# CHECK: tail duplication possible duplications: 1			# CHECK: tail duplication modified 1 ({{.}}%) functions; duplicated 1 blocks ({{.}} bytes) responsible for {{.}} dynamic executions ({{.}} of all block executions)
	# CHECK: BB Layout : .LBB00, .Ltmp0, .Ltail-dup0, .Ltmp1, .Ltmp2			# CHECK: BB Layout : .LBB00, .Ltmp0, .Ltail-dup0, .Ltmp1, .Ltmp2

	# This is the C++ code fed to Clang			# This is the C++ code fed to Clang
	# int fib(int term) {			# int fib(int term) {
	# if (term <= 1)			# if (term <= 1)
	# return term;			# return term;
	# return fib(term-1) + fib(term-2);			# return fib(term-1) + fib(term-2);
	# }			# }
	Show All 35 Lines

bolt/test/X86/tail-duplication-jt.s

	# This reproduces a bug in tail duplication when aggressiveCodeToDuplicate			# This reproduces a bug in tail duplication when aggressiveCodeToDuplicate
	# fails to handle a block with a jump table.			# fails to handle a block with a jump table.

	# REQUIRES: system-linux			# REQUIRES: system-linux

	# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown \			# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown \
	# RUN: %s -o %t.o			# RUN: %s -o %t.o
	# RUN: link_fdata %s %t.o %t.fdata			# RUN: link_fdata %s %t.o %t.fdata
	# RUN: llvm-strip --strip-unneeded %t.o			# RUN: llvm-strip --strip-unneeded %t.o
	# RUN: %clangxx %cflags -no-pie %t.o -o %t.exe -Wl,-q			# RUN: %clangxx %cflags -no-pie %t.o -o %t.exe -Wl,-q
	# RUN: llvm-bolt %t.exe -o %t.out -data %t.fdata -relocs \			# RUN: llvm-bolt %t.exe -o %t.out -data %t.fdata -relocs \
	# RUN: -tail-duplication=1 -tail-duplication-aggressive=1 \			# RUN: -tail-duplication=aggressive \
	# RUN: -print-cfg \| FileCheck %s			# RUN: -print-cfg \| FileCheck %s
	# CHECK: Jump table {{.}} for function a at {{.}} with a total count of 3			# CHECK: Jump table {{.}} for function a at {{.}} with a total count of 3
	.globl main			.globl main
	main:			main:
	.globl a			.globl a
	.type a, %function			.type a, %function
	a:			a:
	.cfi_startproc			.cfi_startproc
	▲ Show 20 Lines • Show All 58 Lines • Show Last 20 Lines

bolt/test/X86/tail-duplication-pass.s

	# REQUIRES: system-linux			# REQUIRES: system-linux

	# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown \			# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown \
	# RUN: %s -o %t.o			# RUN: %s -o %t.o
	# RUN: link_fdata %s %t.o %t.fdata			# RUN: link_fdata %s %t.o %t.fdata
	# RUN: %clang %cflags %t.o -o %t.exe -Wl,-q			# RUN: %clang %cflags %t.o -o %t.exe -Wl,-q
	# RUN: llvm-bolt %t.exe -data %t.fdata -reorder-blocks=ext-tsp -print-finalized \			# RUN: llvm-bolt %t.exe -data %t.fdata -reorder-blocks=ext-tsp -print-finalized \
	# RUN: -tail-duplication -tail-duplication-minimum-offset 1 -o %t.out \| FileCheck %s			# RUN: -tail-duplication=moderate -tail-duplication-minimum-offset=1 -o %t.out \| FileCheck %s

	# FDATA: 1 main 2 1 main #.BB2# 0 10			# FDATA: 1 main 2 1 main #.BB2# 0 10
	# FDATA: 1 main 4 1 main #.BB2# 0 20			# FDATA: 1 main 4 1 main #.BB2# 0 20
	# CHECK: tail duplication possible duplications: 1			# CHECK: BOLT-INFO: tail duplication modified 1 ({{.}}%) functions; duplicated 1 blocks (1 bytes) responsible for {{.}} dynamic executions ({{.*}}% of all block executions)
	# CHECK: BB Layout : .LBB00, .Ltail-dup0, .Ltmp0, .Ltmp1			# CHECK: BB Layout : .LBB00, .Ltail-dup0, .Ltmp0, .Ltmp1

	.text			.text
	.globl main			.globl main
	.type main, %function			.type main, %function
	.size main, .Lend-main			.size main, .Lend-main
	main:			main:
	xor %eax, %eax			xor %eax, %eax
	jmp .BB2			jmp .BB2
	.BB1:			.BB1:
	inc %rax			inc %rax
	.BB2:			.BB2:
	retq			retq
	# For relocations against .text			# For relocations against .text
	call exit			call exit
	.Lend:			.Lend:

bolt/test/X86/tail-duplication-prop-bug.s

	# This reproduces a bug in aggressive tail duplication/copy propagation.			# This reproduces a bug in aggressive tail duplication/copy propagation.

	# REQUIRES: system-linux			# REQUIRES: system-linux
	# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s -o %t.o			# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown %s -o %t.o
	# RUN: link_fdata %s %t.o %t.fdata			# RUN: link_fdata %s %t.o %t.fdata
	# RUN: llvm-strip --strip-unneeded %t.o			# RUN: llvm-strip --strip-unneeded %t.o
	# RUN: ld.lld %t.o -o %t.exe -q -nostdlib			# RUN: ld.lld %t.o -o %t.exe -q -nostdlib
	# RUN: llvm-bolt %t.exe -o %t.out -data %t.fdata -relocs \			# RUN: llvm-bolt %t.exe -o %t.out -data %t.fdata -relocs \
	# RUN: -tail-duplication=1 -tail-duplication-aggressive=1 \			# RUN: -tail-duplication=aggressive \
	# RUN: -tail-duplication-const-copy-propagation=1			# RUN: -tail-duplication-const-copy-propagation=1

	.text			.text
	.type a, %function			.type a, %function
	.globl a			.globl a
	a:			a:
	.cfi_startproc			.cfi_startproc
	jmpq *JT(,%rcx,8)			jmpq *JT(,%rcx,8)
	Show All 27 Lines

bolt/test/runtime/X86/tail-duplication-constant-prop.s

	# REQUIRES: system-linux			# REQUIRES: system-linux

	# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown \			# RUN: llvm-mc -filetype=obj -triple x86_64-unknown-unknown \
	# RUN: %s -o %t.o			# RUN: %s -o %t.o
	# RUN: link_fdata %s %t.o %t.fdata			# RUN: link_fdata %s %t.o %t.fdata
	# RUN: %clang %cflags %t.o -o %t.exe -Wl,-q			# RUN: %clang %cflags %t.o -o %t.exe -Wl,-q
	# RUN: llvm-bolt %t.exe -data %t.fdata -reorder-blocks=ext-tsp -print-finalized \			# RUN: llvm-bolt %t.exe -data %t.fdata -reorder-blocks=ext-tsp -print-finalized \
	# RUN: -tail-duplication -tail-duplication-minimum-offset 1 -o %t.out \| FileCheck %s			# RUN: -tail-duplication=moderate -tail-duplication-minimum-offset=1 -tail-duplication-const-copy-propagation=1 -o %t.out \| FileCheck %s
	# RUN: %t.exe; echo $?			# RUN: %t.exe; echo $?
	# RUN: %t.out; echo $?			# RUN: %t.out; echo $?

	# FDATA: 1 main 14 1 main #.BB2# 0 10			# FDATA: 1 main 14 1 main #.BB2# 0 10
	# FDATA: 1 main 16 1 main #.BB2# 0 20			# FDATA: 1 main 16 1 main #.BB2# 0 20
	# CHECK: tail duplication possible duplications: 1			# CHECK: BOLT-INFO: tail duplication modified 1 ({{.}}%) functions; duplicated 1 blocks ({{.}} bytes) responsible for {{.}} dynamic executions ({{.}}% of all block executions)
	# CHECK: BB Layout : .LBB00, .Ltail-dup0, .Ltmp0, .Ltmp1			# CHECK: BB Layout : .LBB00, .Ltail-dup0, .Ltmp0, .Ltmp1
	# CHECK-NOT: mov $0x2, %rbx			# CHECK-NOT: mov $0x2, %rbx

	.text			.text
	.globl main			.globl main
	.type main, %function			.type main, %function
	.size main, .Lend-main			.size main, .Lend-main
	main:			main:
	mov $0x2, %rbx			mov $0x2, %rbx
	mov $0x1, %rdi			mov $0x1, %rdi
	inc %rdi			inc %rdi
	mov %rdi, %rsi			mov %rdi, %rsi
	jmp .BB2			jmp .BB2
	.BB1:			.BB1:
	mov $0x9, %rbx			mov $0x9, %rbx
	.BB2:			.BB2:
	mov %rbx, %rax			mov %rbx, %rax
	mov $0x5, %rbx			mov $0x5, %rbx
	add %rsi, %rax			add %rsi, %rax
	jmp .BB4			retq
	.BB3:			.BB3:
	mov $0x9, %rbx			mov $0x9, %rbx
	.BB4:			.BB4:
	mov $0xa, %rsi			mov $0xa, %rsi
	add %rbx, %rax			add %rbx, %rax
	add %rsi, %rax			add %rsi, %rax
	.BB5:			.BB5:
	retq			retq
	.Lend:			.Lend:

This is an archive of the discontinued LLVM Phabricator instance.

[BOLT] Cache-Aware Tail DuplicationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 434045

bolt/include/bolt/Passes/TailDuplication.h

bolt/lib/Passes/TailDuplication.cpp

bolt/lib/Rewrite/BinaryPassManager.cpp

bolt/test/X86/tail-duplication-cache.s

bolt/test/X86/tail-duplication-cacheline.s

bolt/test/X86/tail-duplication-complex.s

bolt/test/X86/tail-duplication-jt.s

bolt/test/X86/tail-duplication-pass.s

bolt/test/X86/tail-duplication-prop-bug.s

bolt/test/runtime/X86/tail-duplication-constant-prop.s

[BOLT] Cache-Aware Tail Duplication
ClosedPublic