Download Raw Diff

Details

Reviewers

chandlerc
djasper
dexonsmith

Commits

rGb90b9e053109: In MachineBlockPlacement, filter cold blocks off the loop chain when profile…
rL251833: In MachineBlockPlacement, filter cold blocks off the loop chain when profile…

Summary

In the current BB placement algorithm, a loop chain always contains all loop blocks. This has a drawback that cold blocks in the loop may be inserted on a hot function path, hence increasing branch cost and also reducing icache locality.

Consider a simple example shown below:

A
|
B⇆C
|
D

When B->C is quite cold, the best BB-layout should be A,B,D,C. But the current implementation produces A,C,B,D.

This patch filters those cold blocks off from the loop chain by comparing the ratio:

LoopBBFreq / LoopFreq

to 20%: if it is less than 20%, we don't include this BB to the loop chain. Here LoopFreq is the frequency of the loop when we reduce the loop into a single node. In general we have more cold blocks when the loop has few iterations. And vice versa.

Diff Detail

Event Timeline

congh updated this revision to Diff 31073.Jul 30 2015, 2:14 PM

congh retitled this revision from to Filter cold blocks off the loop chain when profile data is available..

congh updated this object.

congh added reviewers: chandlerc, dexonsmith.

congh added subscribers: llvm-commits, davidxl.

How do you come up with the 5/1 ratio? Why not making it 1/1? Here are the reasons

if the loop has high trip count, a block that executed fewer than 1 time per loop does not have any cache reuse
if the loop has very short trip count, it is likely that the loop won't be rotated. Splitting the cold block out increases the chances to connect loop with the exit BB with a fallthrough ..

lib/CodeGen/MachineBlockPlacement.cpp
1036	Overloading BranchProbablity for the purpose of computing ratios is confusing IMO. Why not just const unsigned LoopToColdBBRatio = 5;
1041	if (Freq * LoopToColdBBRatio < LoopFreq.getFrequency()) ..

In D11662#215634, @davidxl wrote:

How do you come up with the 5/1 ratio? Why not making it 1/1? Here are the reasons

It complies with the ratio that is used in selectBestSuccessor(): when the loop has only one iteration, this will lead to similar result if we treat the loop as a non-loop by removing back edges.

if the loop has high trip count, a block that executed fewer than 1 time per loop does not have any cache reuse

The cold blocks that are not filtered off will be placed to the end of the loop chain (or beginning after loop rotation). So for loop itself the icache locality is still good as all hot blocks are still put together. When connecting those unfiltered cold blocks to blocks outside of the loop, we can use the similar logic that is used in selectBestSuccessor(): we follow CFG constraints when the block is not too cold (>20%).

if the loop has very short trip count, it is likely that the loop won't be rotated. Splitting the cold block out increases the chances to connect loop with the exit BB with a fallthrough ..

In an extreme situation, if the loop has only one iteration, and if we use 1/1 as the threshold, it seems that too many blocks in the loop will be filtered off.

lib/CodeGen/MachineBlockPlacement.cpp
1036	At first I think this may be more flexible as we can choose other threshold like 2/5. But now I prefer your suggestion. This can also prevent the potential overflow when converting uint64 to uint32.

Update the patch by using LoopToColdBBRatio instead of a probability according to David's comment.

Ok -- the change is made to be consistent with existing trace formation heuristic to improve locality. It outlines the cold blocks and allow them to be pushed away further so that trace can be formed from the current loop with loop exit blocks..

lib/CodeGen/MachineBlockPlacement.cpp
1037	Try to avoid hard-coding parameter like this.

congh added inline comments.Jul 31 2015, 2:41 PM

lib/CodeGen/MachineBlockPlacement.cpp
1037	Should I make it a global constant variable or an option? The other 20% probability used in selectBestSuccessor() is also hard-coded and we could use the same variable to replace them.

I suggest making it an option. The other usage is slightly different

we can leave it out for now (if needed can be changed later).

David

Update the patch by making the loop-to-cold-block-ratio an option.

Ping?

congh added a reviewer: djasper.Sep 10 2015, 1:18 PM

Why only do this if profile data is available? Are we scared that the statically derived probabilities are off by too much?

How does this affect nested loops? Is the 'outlined' block still in the outer loop? I am not sure which one would be better, but maybe we should add a comment and a test to document the behavior?

Should there be some minimum size for the block to be outlined? A very short block will still increase branch count, but not really affect cache locality. For those, the trade-off might be different.

In D11662#261683, @djasper wrote:

Why only do this if profile data is available? Are we scared that the statically derived probabilities are off by too much?

Without precise profile data, it is safer to put all blocks of the same loop together: if hot blocks are scattered into several places at the runtime, we will get very poor cache locality.

How does this affect nested loops? Is the 'outlined' block still in the outer loop? I am not sure which one would be better, but maybe we should add a comment and a test to document the behavior?

You can treat each loop as a function: for an inner loop, if a block is cold, it will probably put together with blocks of its outer loop. But this is also determined by whether that block is cold or not in the outer loop (e.g., its frequency is more than 20% of the frequency of the outer loop). That says, we keep passing this cold block to outer loops until it is not cold in that outer loop anymore. I have updated the comment and explained this. I also updated the test case with a nested loop for which a block is cold for the inner loop but not cold for the outer loop.

Should there be some minimum size for the block to be outlined? A very short block will still increase branch count, but not really affect cache locality. For those, the trade-off might be different.

If we consider not to outline cold but very short blocks, we should do it uniformly in other cases. Take a diamond branch for instance:

A; if (...) B; else C; D;

Suppose B is very cold but very short, should we layout this branch as ACBD..., or ACD...B?

One optimization I could think out is that we can put B as close as possible to D without affecting overall branch cost.

Update the patch according to Daniel's comments.

Ping?

Sorry for the delay -- I will try to find some time to review it.

David

Please put the block set computation code into a helper function -- LGTM with that. Also how do you come up with the 5/1 ratio?

Ping?

In D11662#279170, @davidxl wrote:

Please put the block set computation code into a helper function -- LGTM with that. Also how do you come up with the 5/1 ratio?

OK. The 5/1 ratio complies with the ratio that is used in selectBestSuccessor(): when the loop has only one iteration, this will lead to similar result if we treat the loop as a non-loop by removing back edges.

Update the patch according to David's comment.

Closed by commit rL251833: In MachineBlockPlacement, filter cold blocks off the loop chain when profile… (authored by conghou). · Explain WhyNov 2 2015, 1:26 PM

This revision was automatically updated to reflect the committed changes.

Diff 38968

lib/CodeGen/MachineBlockPlacement.cpp

Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	static cl::opt<bool> OutlineOptionalBranches(
cl::init(false), cl::Hidden);		cl::init(false), cl::Hidden);

static cl::opt<unsigned> OutlineOptionalThreshold(		static cl::opt<unsigned> OutlineOptionalThreshold(
"outline-optional-threshold",		"outline-optional-threshold",
cl::desc("Don't outline optional branches that are a single block with an "		cl::desc("Don't outline optional branches that are a single block with an "
"instruction count below this threshold"),		"instruction count below this threshold"),
cl::init(4), cl::Hidden);		cl::init(4), cl::Hidden);

		static cl::opt<unsigned> LoopToColdBlockRatio(
		"loop-to-cold-block-ratio",
		cl::desc("Outline loop blocks from loop chain if (frequency of loop) / "
		"(frequency of block) is greater than this ratio"),
		cl::init(5), cl::Hidden);

static cl::opt<bool>		static cl::opt<bool>
PreciseRotationCost("precise-rotation-cost",		PreciseRotationCost("precise-rotation-cost",
cl::desc("Model the cost of loop rotation more "		cl::desc("Model the cost of loop rotation more "
"precisely by using profile data."),		"precisely by using profile data."),
cl::init(false), cl::Hidden);		cl::init(false), cl::Hidden);

static cl::opt<unsigned> MisfetchCost(		static cl::opt<unsigned> MisfetchCost(
"misfetch-cost",		"misfetch-cost",
▲ Show 20 Lines • Show All 166 Lines • ▼ Show 20 Lines	getFirstUnplacedBlock(MachineFunction &F, const BlockChain &PlacedChain,
const BlockFilterSet *BlockFilter);		const BlockFilterSet *BlockFilter);
void buildChain(MachineBasicBlock *BB, BlockChain &Chain,		void buildChain(MachineBasicBlock *BB, BlockChain &Chain,
SmallVectorImpl<MachineBasicBlock *> &BlockWorkList,		SmallVectorImpl<MachineBasicBlock *> &BlockWorkList,
const BlockFilterSet *BlockFilter = nullptr);		const BlockFilterSet *BlockFilter = nullptr);
MachineBasicBlock *findBestLoopTop(MachineLoop &L,		MachineBasicBlock *findBestLoopTop(MachineLoop &L,
const BlockFilterSet &LoopBlockSet);		const BlockFilterSet &LoopBlockSet);
MachineBasicBlock *findBestLoopExit(MachineFunction &F, MachineLoop &L,		MachineBasicBlock *findBestLoopExit(MachineFunction &F, MachineLoop &L,
const BlockFilterSet &LoopBlockSet);		const BlockFilterSet &LoopBlockSet);
		BlockFilterSet collectLoopBlockSet(MachineFunction &F, MachineLoop &L);
void buildLoopChains(MachineFunction &F, MachineLoop &L);		void buildLoopChains(MachineFunction &F, MachineLoop &L);
void rotateLoop(BlockChain &LoopChain, MachineBasicBlock *ExitingBB,		void rotateLoop(BlockChain &LoopChain, MachineBasicBlock *ExitingBB,
const BlockFilterSet &LoopBlockSet);		const BlockFilterSet &LoopBlockSet);
void rotateLoopWithProfile(BlockChain &LoopChain, MachineLoop &L,		void rotateLoopWithProfile(BlockChain &LoopChain, MachineLoop &L,
const BlockFilterSet &LoopBlockSet);		const BlockFilterSet &LoopBlockSet);
void buildCFGChains(MachineFunction &F);		void buildCFGChains(MachineFunction &F);

public:		public:
▲ Show 20 Lines • Show All 681 Lines • ▼ Show 20 Lines	void MachineBlockPlacement::rotateLoopWithProfile(

if (RotationPos != LoopChain.end()) {		if (RotationPos != LoopChain.end()) {
DEBUG(dbgs() << "Rotate loop by making " << getBlockNum(*RotationPos)		DEBUG(dbgs() << "Rotate loop by making " << getBlockNum(*RotationPos)
<< " to the top\n");		<< " to the top\n");
std::rotate(LoopChain.begin(), RotationPos, LoopChain.end());		std::rotate(LoopChain.begin(), RotationPos, LoopChain.end());
}		}
}		}

		/// \brief Collect blocks in the given loop that are to be placed.
		///
		/// When profile data is available, exclude cold blocks from the returned set;
		/// otherwise, collect all blocks in the loop.
		MachineBlockPlacement::BlockFilterSet
		MachineBlockPlacement::collectLoopBlockSet(MachineFunction &F, MachineLoop &L) {
		BlockFilterSet LoopBlockSet;

		// Filter cold blocks off from LoopBlockSet when profile data is available.
		// Collect the sum of frequencies of incoming edges to the loop header from
		// outside. If we treat the loop as a super block, this is the frequency of
		// the loop. Then for each block in the loop, we calculate the ratio between
		// its frequency and the frequency of the loop block. When it is too small,
		// don't add it to the loop chain. If there are outer loops, then this block
		// will be merged into the first outer loop chain for which this block is not
		// cold anymore. This needs precise profile data and we only do this when
		// profile data is available.
		if (F.getFunction()->getEntryCount()) {
		BlockFrequency LoopFreq(0);
		for (auto LoopPred : L.getHeader()->predecessors())
		if (!L.contains(LoopPred))
		LoopFreq += MBFI->getBlockFreq(LoopPred) *
		MBPI->getEdgeProbability(LoopPred, L.getHeader());

		for (MachineBasicBlock *LoopBB : L.getBlocks()) {
		auto Freq = MBFI->getBlockFreq(LoopBB).getFrequency();
		if (Freq == 0 \|\| LoopFreq.getFrequency() / Freq > LoopToColdBlockRatio)
		continue;
		LoopBlockSet.insert(LoopBB);
		}
		} else
		LoopBlockSet.insert(L.block_begin(), L.block_end());

		return LoopBlockSet;
		}

/// \brief Forms basic block chains from the natural loop structures.		/// \brief Forms basic block chains from the natural loop structures.
///		///
/// These chains are designed to preserve the existing structure of the code		/// These chains are designed to preserve the existing structure of the code
/// as much as possible. We can then stitch the chains together in a way which		/// as much as possible. We can then stitch the chains together in a way which
/// both preserves the topological structure and minimizes taken conditional		/// both preserves the topological structure and minimizes taken conditional
/// branches.		/// branches.
void MachineBlockPlacement::buildLoopChains(MachineFunction &F,		void MachineBlockPlacement::buildLoopChains(MachineFunction &F,
MachineLoop &L) {		MachineLoop &L) {
// First recurse through any nested loops, building chains for those inner		// First recurse through any nested loops, building chains for those inner
// loops.		// loops.
for (MachineLoop *InnerLoop : L)		for (MachineLoop *InnerLoop : L)
buildLoopChains(F, *InnerLoop);		buildLoopChains(F, *InnerLoop);

SmallVector<MachineBasicBlock *, 16> BlockWorkList;		SmallVector<MachineBasicBlock *, 16> BlockWorkList;
BlockFilterSet LoopBlockSet(L.block_begin(), L.block_end());		BlockFilterSet LoopBlockSet = collectLoopBlockSet(F, L);

// Check if we have profile data for this function. If yes, we will rotate		// Check if we have profile data for this function. If yes, we will rotate
// this loop by modeling costs more precisely which requires the profile data		// this loop by modeling costs more precisely which requires the profile data
// for better layout.		// for better layout.
bool RotateLoopWithProfile =		bool RotateLoopWithProfile =
PreciseRotationCost && F.getFunction()->getEntryCount();		PreciseRotationCost && F.getFunction()->getEntryCount();

// First check to see if there is an obviously preferable top block for the		// First check to see if there is an obviously preferable top block for the
// loop. This will default to the header, but may end up as one of the		// loop. This will default to the header, but may end up as one of the
// predecessors to the header if there is one which will result in strictly		// predecessors to the header if there is one which will result in strictly
// fewer branches in the loop body.		// fewer branches in the loop body.
// When we use profile data to rotate the loop, this is unnecessary.		// When we use profile data to rotate the loop, this is unnecessary.
MachineBasicBlock *LoopTop =		MachineBasicBlock *LoopTop =
RotateLoopWithProfile ? L.getHeader() : findBestLoopTop(L, LoopBlockSet);		RotateLoopWithProfile ? L.getHeader() : findBestLoopTop(L, LoopBlockSet);

// If we selected just the header for the loop top, look for a potentially		// If we selected just the header for the loop top, look for a potentially
		davidxlUnsubmitted Done Reply Inline Actions Overloading BranchProbablity for the purpose of computing ratios is confusing IMO. Why not just const unsigned LoopToColdBBRatio = 5; davidxl: Overloading BranchProbablity for the purpose of computing ratios is confusing IMO. Why not just…
		conghAuthorUnsubmitted Not Done Reply Inline Actions At first I think this may be more flexible as we can choose other threshold like 2/5. But now I prefer your suggestion. This can also prevent the potential overflow when converting uint64 to uint32. congh: At first I think this may be more flexible as we can choose other threshold like 2/5. But now I…
// profitable exit block in the event that rotating the loop can eliminate		// profitable exit block in the event that rotating the loop can eliminate
		davidxlUnsubmitted Not Done Reply Inline Actions Try to avoid hard-coding parameter like this. davidxl: Try to avoid hard-coding parameter like this.
		conghAuthorUnsubmitted Not Done Reply Inline Actions Should I make it a global constant variable or an option? The other 20% probability used in selectBestSuccessor() is also hard-coded and we could use the same variable to replace them. congh: Should I make it a global constant variable or an option? The other 20% probability used in…
// branches by placing an exit edge at the bottom.		// branches by placing an exit edge at the bottom.
MachineBasicBlock *ExitingBB = nullptr;		MachineBasicBlock *ExitingBB = nullptr;
if (!RotateLoopWithProfile && LoopTop == L.getHeader())		if (!RotateLoopWithProfile && LoopTop == L.getHeader())
ExitingBB = findBestLoopExit(F, L, LoopBlockSet);		ExitingBB = findBestLoopExit(F, L, LoopBlockSet);
		davidxlUnsubmitted Done Reply Inline Actions if (Freq * LoopToColdBBRatio < LoopFreq.getFrequency()) .. davidxl: if (Freq * LoopToColdBBRatio < LoopFreq.getFrequency()) ..

BlockChain &LoopChain = *BlockToChain[LoopTop];		BlockChain &LoopChain = *BlockToChain[LoopTop];

// FIXME: This is a really lame way of walking the chains in the loop: we		// FIXME: This is a really lame way of walking the chains in the loop: we
// walk the blocks, and use a set to prevent visiting a particular chain		// walk the blocks, and use a set to prevent visiting a particular chain
// twice.		// twice.
SmallPtrSet<BlockChain *, 4> UpdatedPreds;		SmallPtrSet<BlockChain *, 4> UpdatedPreds;
assert(LoopChain.LoopPredecessors == 0);		assert(LoopChain.LoopPredecessors == 0);
UpdatedPreds.insert(&LoopChain);		UpdatedPreds.insert(&LoopChain);
for (MachineBasicBlock *LoopBB : L.getBlocks()) {
		for (MachineBasicBlock *LoopBB : LoopBlockSet) {
BlockChain &Chain = *BlockToChain[LoopBB];		BlockChain &Chain = *BlockToChain[LoopBB];
if (!UpdatedPreds.insert(&Chain).second)		if (!UpdatedPreds.insert(&Chain).second)
continue;		continue;

assert(Chain.LoopPredecessors == 0);		assert(Chain.LoopPredecessors == 0);
for (MachineBasicBlock *ChainBB : Chain) {		for (MachineBasicBlock *ChainBB : Chain) {
assert(BlockToChain[ChainBB] == &Chain);		assert(BlockToChain[ChainBB] == &Chain);
for (MachineBasicBlock *Pred : ChainBB->predecessors()) {		for (MachineBasicBlock *Pred : ChainBB->predecessors()) {
▲ Show 20 Lines • Show All 392 Lines • Show Last 20 Lines

test/CodeGen/X86/code_placement_cold_loop_blocks.ll

This file was added.

				; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux < %s \| FileCheck %s -check-prefix=CHECK

				define void @foo() !prof !1 {
				; Test if a cold block in a loop will be placed at the end of the function
				; chain.
				;
				; CHECK-LABEL: foo:
				; CHECK: callq b
				; CHECK: callq c
				; CHECK: callq e
				; CHECK: callq f
				; CHECK: callq d

				entry:
				br label %header

				header:
				call void @b()
				%call = call zeroext i1 @a()
				br i1 %call, label %if.then, label %if.else, !prof !4

				if.then:
				call void @c()
				br label %if.end

				if.else:
				call void @d()
				br label %if.end

				if.end:
				call void @e()
				%call2 = call zeroext i1 @a()
				br i1 %call2, label %header, label %end, !prof !5

				end:
				call void @f()
				ret void
				}

				define void @nested_loop_0() !prof !1 {
				; Test if a block that is cold in the inner loop but not cold in the outer loop
				; will merged to the outer loop chain.
				;
				; CHECK-LABEL: nested_loop_0:
				; CHECK: callq c
				; CHECK: callq d
				; CHECK: callq e
				; CHECK: callq b
				; CHECK: callq f

				entry:
				br label %header

				header:
				call void @b()
				%call4 = call zeroext i1 @a()
				br i1 %call4, label %header2, label %end

				header2:
				call void @c()
				%call = call zeroext i1 @a()
				br i1 %call, label %if.then, label %if.else, !prof !2

				if.then:
				call void @d()
				%call3 = call zeroext i1 @a()
				br i1 %call3, label %header2, label %header, !prof !3

				if.else:
				call void @e()
				%call2 = call zeroext i1 @a()
				br i1 %call2, label %header2, label %header, !prof !3

				end:
				call void @f()
				ret void
				}

				define void @nested_loop_1() !prof !1 {
				; Test if a cold block in an inner loop will be placed at the end of the
				; function chain.
				;
				; CHECK-LABEL: nested_loop_1:
				; CHECK: callq b
				; CHECK: callq c
				; CHECK: callq e
				; CHECK: callq d

				entry:
				br label %header

				header:
				call void @b()
				br label %header2

				header2:
				call void @c()
				%call = call zeroext i1 @a()
				br i1 %call, label %end, label %if.else, !prof !4

				if.else:
				call void @d()
				%call2 = call zeroext i1 @a()
				br i1 %call2, label %header2, label %header, !prof !5

				end:
				call void @e()
				ret void
				}

				declare zeroext i1 @a()
				declare void @b()
				declare void @c()
				declare void @d()
				declare void @e()
				declare void @f()

				!1 = !{!"function_entry_count", i64 1}
				!2 = !{!"branch_weights", i32 100, i32 1}
				!3 = !{!"branch_weights", i32 1, i32 10}
				!4 = !{!"branch_weights", i32 1000, i32 1}
				!5 = !{!"branch_weights", i32 100, i32 1}

This is an archive of the discontinued LLVM Phabricator instance.

Filter cold blocks off the loop chain when profile data is available.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 38968

lib/CodeGen/MachineBlockPlacement.cpp

test/CodeGen/X86/code_placement_cold_loop_blocks.ll

This is an archive of the discontinued LLVM Phabricator instance.

Filter cold blocks off the loop chain when profile data is available.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 38968

lib/CodeGen/MachineBlockPlacement.cpp

test/CodeGen/X86/code_placement_cold_loop_blocks.ll

Filter cold blocks off the loop chain when profile data is available.
ClosedPublic