This is an archive of the discontinued LLVM Phabricator instance.

[MBP] Enhance cost based branch prob threshold computation to handle general control flows
Needs ReviewPublic

Authored by davidxl on Jun 23 2016, 3:14 PM.

Download Raw Diff

Details

Reviewers

danielcdh
congh

Summary

See comments in the code for details.

Also added more extensive testing in various scenarios to make sure the optimal layout (with minimal branch cost or maximal fall through) is picked.

Diff Detail

Event Timeline

davidxl updated this revision to Diff 61730.Jun 23 2016, 3:14 PM

davidxl retitled this revision from to [MBP] Enhance cost based branch prob threshold computation to handle general control flows.

davidxl updated this object.

davidxl added reviewers: danielcdh, congh.

davidxl added a subscriber: llvm-commits.

danielcdh added inline comments.Jun 23 2016, 7:52 PM

lib/CodeGen/MachineBlockPlacement.cpp
610	You mean in (C), right?
624	This is a little confusing, I think the condition should be: F1 > max(2*F3, F2) So when checking Pred1, the backward threhold should be 0.66, when checking Pred2, the backward threshold should be 0.5? But for the current implementation, I tried all testcase, and some other testcases with different triangle-diamond combination, and it always gives me optimal solution.

davidxl added inline comments.Jun 23 2016, 8:38 PM

lib/CodeGen/MachineBlockPlacement.cpp
610	yes.
624	It is F1 > F2 + F3. Basically, we are comparing selecting BB->Succ as fall through vs selecting BB->Pred1 and Pred2->Succ as fall throughs. In order for BB->Succ to succeed, the minimal min(F1) = F2 + F3. When computing backward probability, only BB->Succ and Pred2->Succ edges are considered, so the probability threshold T = min(F1) /(min(F1) + F2) = (F2 + F3 )/(2*F2 + F3). I will update the comments.

For case C, if F1=50, F2=40, F3=20, it seems it's most beneficial to choose
BB->Succ as fall-through than Pred1->Succ or Pred2->Succ, or am I missing
something?

davidxl added inline comments.Jun 23 2016, 10:16 PM

lib/CodeGen/MachineBlockPlacement.cpp
624	Dehao's question: For case C, if F1=50, F2=40, F3=20, it seems it's most beneficial to choose BB->Succ as fall-through than Pred1->Succ or Pred2->Succ, or am I missing something? The answer is, in this case, it is better not to choose BB->Succ as the fall through. Let's take a look at an example: suppose BB and Pred2 in example C has a predecessor 'S'. With your choice, the layout is S, BB, Succ, Pred2, Pred1 with cost F2 + F3 + F2 + F3 The optimal layout is in fact S, BB, P1, P2, Succ with cost F1 + F2 + F3

danielcdh added inline comments.Jun 23 2016, 10:50 PM

lib/CodeGen/MachineBlockPlacement.cpp
624	I see. Thanks for explanation. It may worth mentioning in the comment that the threshold is F1>max(2*F3, F2+F3), because F2>F3, thus F1>F2+F3.
664	Maybe combine the two scenarios to something like: F2 = std::max(MBFI->getBlockFreq(BB) * SuccProb.getCompl();, MaxPredEdgeFreq); F3 = std::min(MBFI->getBlockFreq(BB) * SuccProb.getCompl();, MaxPredEdgeFreq);

Revision Contents

Path

Size

lib/

CodeGen/

MachineBlockPlacement.cpp

210 lines

test/

CodeGen/

X86/

code_placement_pgo_cost.ll

475 lines

Diff 61730

lib/CodeGen/MachineBlockPlacement.cpp

Show First 20 Lines • Show All 286 Lines • ▼ Show 20 Lines	class MachineBlockPlacement : public MachineFunctionPass {
collectViableSuccessors(MachineBasicBlock *BB, BlockChain &Chain,		collectViableSuccessors(MachineBasicBlock *BB, BlockChain &Chain,
const BlockFilterSet *BlockFilter,		const BlockFilterSet *BlockFilter,
SmallVector<MachineBasicBlock *, 4> &Successors);		SmallVector<MachineBasicBlock *, 4> &Successors);
bool shouldPredBlockBeOutlined(MachineBasicBlock BB, MachineBasicBlock Succ,		bool shouldPredBlockBeOutlined(MachineBasicBlock BB, MachineBasicBlock Succ,
BlockChain &Chain,		BlockChain &Chain,
const BlockFilterSet *BlockFilter,		const BlockFilterSet *BlockFilter,
BranchProbability SuccProb,		BranchProbability SuccProb,
BranchProbability HotProb);		BranchProbability HotProb);
		BranchProbability
		getLayoutSuccessorProbThreshold(MachineBasicBlock *BB,
		MachineBasicBlock *Succ,
		BranchProbability SuccProb,
		BlockChain &SuccChain,
		BlockChain &Chain,
		const BlockFilterSet *BlockFilter);
bool		bool
hasBetterLayoutPredecessor(MachineBasicBlock BB, MachineBasicBlock Succ,		hasBetterLayoutPredecessor(MachineBasicBlock BB, MachineBasicBlock Succ,
BlockChain &SuccChain, BranchProbability SuccProb,		BlockChain &SuccChain, BranchProbability SuccProb,
BranchProbability RealSuccProb, BlockChain &Chain,		BranchProbability RealSuccProb, BlockChain &Chain,
const BlockFilterSet *BlockFilter);		const BlockFilterSet *BlockFilter);
MachineBasicBlock selectBestSuccessor(MachineBasicBlock BB,		MachineBasicBlock selectBestSuccessor(MachineBasicBlock BB,
BlockChain &Chain,		BlockChain &Chain,
const BlockFilterSet *BlockFilter);		const BlockFilterSet *BlockFilter);
▲ Show 20 Lines • Show All 213 Lines • ▼ Show 20 Lines	for (MachineBasicBlock *Pred : Succ->predecessors()) {
if (Pred->size() < OutlineOptionalThreshold)		if (Pred->size() < OutlineOptionalThreshold)
return false;		return false;
}		}
return true;		return true;
} else		} else
return false;		return false;
}		}

// When profile is not present, return the StaticLikelyProb.		// Computes the branch probability threshold that is applied to edge
// When profile is available, we need to handle the triangle-shape CFG.		// BB->Succ where Succ that still has unscheduled predecessor(s). When
static BranchProbability getLayoutSuccessorProbThreshold(		// the probability of BB->Succ is greater than the threshold, it is
MachineBasicBlock *BB) {		// definitely benefifical to lay out Succ as BB's layout successor, otherwise
		// it is better to connect Succ with one of its predecessors instead. The
		// computed probability threshold is either forward branch probability (i.e.
		// probabilty to branch out), or backward probability (i.e. the probability
		// of an edge merging into the destination block) -- depending on the
		// scenarios. For details see algorithm description below.
		BranchProbability MachineBlockPlacement::getLayoutSuccessorProbThreshold(
		MachineBasicBlock BB, MachineBasicBlock Succ, BranchProbability SuccProb,
		BlockChain &SuccChain, BlockChain &Chain, const BlockFilterSet *BlockFilter) {

		// No profile, just use the pre-set threshold.
if (!BB->getParent()->getFunction()->getEntryCount())		if (!BB->getParent()->getFunction()->getEntryCount())
return BranchProbability(StaticLikelyProb, 100);		return BranchProbability(StaticLikelyProb, 100);
if (BB->succ_size() == 2) {
const MachineBasicBlock Succ1 = BB->succ_begin();		// When Profile data is available :
const MachineBasicBlock Succ2 = (BB->succ_begin() + 1);
if (Succ1->isSuccessor(Succ2) \|\| Succ2->isSuccessor(Succ1)) {		// (We don't yet handle cases with fan-out > 2)
/* See case 1 below for the cost analysis. For BB->Succ to		if (BB->succ_size() <= 2) {
* be taken with smaller cost, the following needs to hold:		//
* Prob(BB->Succ) > 2* Prob(BB->Pred)		// BB BB
* So the threshold T		// \| \ \| \
* T = 2 * (1-Prob(BB->Pred). Since T + Prob(BB->Pred) == 1,		// \| \ \| \
* We have T + T/2 = 1, i.e. T = 2/3. Also adding user specified		// \| Pred0 \| Pred
* branch bias, we have		// \| / \ \| /
* T = (2/3)*(ProfileLikelyProb/50)		// \| / PredK \| /
* = (2*ProfileLikelyProb)/150)		// \| / / \ Succ
*/		// \| \| / \
return BranchProbability(2 * ProfileLikelyProb, 150);		// \ \| / / ....
		// Succ
		//
		// (A) (B)
		//
		// (A) is a general form of triangular CFG (B). In (A), Let
		//
		// P1 = Prob(BB->Succ),
		// P2 = Prob(BB->Pred0),
		// F2 = Freq(BB->Pred0),
		// F3 = Freq(PredK, Succ), and
		// P3 = F3/F2, where
		// PredK is the unscheduled predecessor of Succ
		// with the largest edge frequency from PredK to Succ.
		//
		// The objective is to find the probability threshold of
		// BB->Succ edge to make laying out BB->Succ profitable.
		// Assuming the freuqency of BB is 1, the fall through
		// benefit of picking BB->Succ is P1 (e.g, taken branches
		// saved when not selected). If Pred0 is selected as
		// BB's layout successor, the layout benefit will be
		// P2 + P2 * P3. Note that the internal edges between Pred*
		// blocks do not count because the benefit of laying them
		// out sequentially will be materialized independent of the
		// decision here. To compute the threshold T of P1,
		// T = P2 + P2 * P3, i.e, P2 = T/(1 + P3)
		// Since P1 + P2 = 1, substitue P1 with T, we get
		// T = (1 + P3)/(2 + P3)
		// = (F2 + F3)/(2*F2 + F3)
		// Finally, adding user specified branch bias, we have
		// T = (F2 + F3) * ProfileLikelyProb/((2 * F2 + F3) * 50)
		// Example:
		// 1. For (B), F2 == F3, and P3 is 1, so T = 67%.
		// 2. For (A), When P3 is 0.5, T = 60%.
		//
		// The same formula also applies to the backward probability
		// threshold computation when the 'heaviest' predecessor is not
		// dominated by the source block BB.
		//
		// BB Pred2
		// / \| /
		// / \| \| BB Pred2
		// Pred1 \| \| \| \|
		// \ \| \| \ /
		// \ \| / \ /
		// \ \| / \ /
		// Succ Succ
		//
		// (C) (D)
		// (D) is a degenerated case of (C) where Pred1 does not exist. In (D),
		danielcdhUnsubmitted Not Done Reply Inline Actions You mean in (C), right? danielcdh: You mean in (C), right?
		davidxlAuthorUnsubmitted Not Done Reply Inline Actions yes. davidxl: yes.
		// Let
		// F1 = Freq(BB->Succ),
		// F2 = Freq(Pred2->Succ), and
		// F3 = Freq(BB->Pred1)
		//
		// Here we assume F2 > F3. Otherwise, the cost model will be the same as
		// scenarios (A)/(B) above.
		//
		// The savings (eliminated taken branches) of laying out Succ after BB
		// is F1, and the cost (introduced taken branches) consists of two parts:
		// one is the branch from source block BB to Pred1 (F3), and the other
		// part is the branch from Pred2 to the destination block Succ (F3). In
		// order for BB->Succ to be selected, we must have F1 > F2 + F3. So the
		// backward probability threshold is (backward checking igores other
		danielcdhUnsubmitted Not Done Reply Inline Actions This is a little confusing, I think the condition should be: F1 > max(2F3, F2) So when checking Pred1, the backward threhold should be 0.66, when checking Pred2, the backward threshold should be 0.5? But for the current implementation, I tried all testcase, and some other testcases with different triangle-diamond combination, and it always gives me optimal solution. danielcdh:* This is a little confusing, I think the condition should be: F1 > max(2*F3, F2) So when…
		davidxlAuthorUnsubmitted Not Done Reply Inline Actions It is F1 > F2 + F3. Basically, we are comparing selecting BB->Succ as fall through vs selecting BB->Pred1 and Pred2->Succ as fall throughs. In order for BB->Succ to succeed, the minimal min(F1) = F2 + F3. When computing backward probability, only BB->Succ and Pred2->Succ edges are considered, so the probability threshold T = min(F1) /(min(F1) + F2) = (F2 + F3 )/(2F2 + F3). I will update the comments. davidxl:* It is F1 > F2 + F3. Basically, we are comparing selecting BB->Succ as fall through vs…
		davidxlAuthorUnsubmitted Not Done Reply Inline Actions Dehao's question: For case C, if F1=50, F2=40, F3=20, it seems it's most beneficial to choose BB->Succ as fall-through than Pred1->Succ or Pred2->Succ, or am I missing something? The answer is, in this case, it is better not to choose BB->Succ as the fall through. Let's take a look at an example: suppose BB and Pred2 in example C has a predecessor 'S'. With your choice, the layout is S, BB, Succ, Pred2, Pred1 with cost F2 + F3 + F2 + F3 The optimal layout is in fact S, BB, P1, P2, Succ with cost F1 + F2 + F3 davidxl: Dehao's question: For case C, if F1=50, F2=40, F3=20, it seems it's most beneficial to choose…
		danielcdhUnsubmitted Not Done Reply Inline Actions I see. Thanks for explanation. It may worth mentioning in the comment that the threshold is F1>max(2F3, F2+F3), because F2>F3, thus F1>F2+F3. danielcdh:* I see. Thanks for explanation. It may worth mentioning in the comment that the threshold is…
		// predecessors other than Pred2):
		// T = (F2 + F3)/(F2 + F3 + F2)
		// = (F2 + F3)/(2*F2 + F3)
		//
		// QED
		//
		// Examples:
		// 1. For (D), F3 == 0, so we have T = 50%
		// 2. For (C) where F2 = 10, F3 = 20, we have T = 30/50 = 60%. It matches
		// the expection: Freq(BB->Succ) needs to be at least 30 in order to be
		// profitable to layout Succ after BB.
		//
		BlockFrequency MaxPredEdgeFreq = 0;
		MachineBasicBlock *PredWithMaxFreq = nullptr;
		for (MachineBasicBlock *Pred : Succ->predecessors()) {
		// This filters out the loop backedge predecessor
		// (e.g, loops are laid out first. Later when
		// connecting pre-header (BB) with the loop header (Succ), we need
		// to skip the Pred which is already in Succ's chain.
		if ( BlockToChain[Pred] == &SuccChain \|\|
		Succ == Pred \|\|
		// Skip if either already placed or filtered out due outlining
		(BlockFilter && !BlockFilter->count(Pred)) \|\|
		BlockToChain[Pred] == &Chain)
		continue;

		// Find one unscheduled Pred:
		BlockFrequency PredEdgeFreq =
		MBFI->getBlockFreq(Pred) * MBPI->getEdgeProbability(Pred, Succ);
		if (PredEdgeFreq > MaxPredEdgeFreq) {
		MaxPredEdgeFreq = PredEdgeFreq;
		PredWithMaxFreq = Pred;
		}
}		}

		if (PredWithMaxFreq == nullptr \|\| MaxPredEdgeFreq.getFrequency() == 0)
		return BranchProbability(ProfileLikelyProb, 100);

		BlockFrequency F2, F3;
		if (MDT->properlyDominates(BB, PredWithMaxFreq)) {
		danielcdhUnsubmitted Not Done Reply Inline Actions Maybe combine the two scenarios to something like: F2 = std::max(MBFI->getBlockFreq(BB) * SuccProb.getCompl();, MaxPredEdgeFreq); F3 = std::min(MBFI->getBlockFreq(BB) * SuccProb.getCompl();, MaxPredEdgeFreq); danielcdh: Maybe combine the two scenarios to something like: F2 = std::max(MBFI->getBlockFreq(BB) *…
		// Scenario (A)
		F2 = MBFI->getBlockFreq(BB) * SuccProb.getCompl();
		F3 = MaxPredEdgeFreq;
		if (F2 < F3) {
		// handle rounding errors in block frequency propagation
		F2 = F3;
}		}
		} else {
		// Scenario (C)
		F2 = MaxPredEdgeFreq;
		F3 = MBFI->getBlockFreq(BB) * SuccProb.getCompl();
		}

		APInt ProbBias(128, ProfileLikelyProb);
		// N = (F2 + F3) * ProfileLikelyProb
		APInt NV(128, F2.getFrequency());
		APInt F3V(128, F3.getFrequency());
		NV += F3V;
		NV *= ProbBias;

		// D = (2 * F2 + F3) * 50
		APInt DV(128, F2.getFrequency());
		DV *= APInt(128, 2);
		DV += F3V;
		DV *= APInt(128, 50);

		if (DV.ult(NV))
		DV = NV;

		APInt L(128, APInt::getMaxValue(64).getLimitedValue());
		if (DV.ugt(L)) {
		APInt Scale = DV.udiv(L);
		DV = DV.udiv(Scale);
		NV = NV.udiv(Scale);
		}

		return BranchProbability::getBranchProbability(NV.getLimitedValue(), DV.getLimitedValue());
		}

		// For other shapes, return the default.
return BranchProbability(ProfileLikelyProb, 100);		return BranchProbability(ProfileLikelyProb, 100);
}		}

/// Checks to see if the layout candidate block \p Succ has a better layout		/// Checks to see if the layout candidate block \p Succ has a better layout
/// predecessor than \c BB. If yes, returns true.		/// predecessor than \c BB. If yes, returns true.
bool MachineBlockPlacement::hasBetterLayoutPredecessor(		bool MachineBlockPlacement::hasBetterLayoutPredecessor(
MachineBasicBlock BB, MachineBasicBlock Succ, BlockChain &SuccChain,		MachineBasicBlock BB, MachineBasicBlock Succ, BlockChain &SuccChain,
BranchProbability SuccProb, BranchProbability RealSuccProb,		BranchProbability SuccProb, BranchProbability RealSuccProb,
BlockChain &Chain, const BlockFilterSet *BlockFilter) {		BlockChain &Chain, const BlockFilterSet *BlockFilter) {

// This is no global conflict, just return false.		// There is no global conflict, just return false.
if (SuccChain.UnscheduledPredecessors == 0)		if (SuccChain.UnscheduledPredecessors == 0)
return false;		return false;

// There are two basic scenarios here:		// Descriptions of two basic scenarios.
		// For more general cases and detailed description of the cost analysis
		// algorithm, see getLayoutSuccessorProbThreshold method
// -------------------------------------		// -------------------------------------
// Case 1: triagular shape CFG:		// Case 1: triagular shape CFG:
// BB		// BB
// \| \		// \| \
// \| \		// \| \
// \| Pred		// \| Pred
// \| /		// \| /
// Succ		// Succ
Show All 10 Lines	bool MachineBlockPlacement::hasBetterLayoutPredecessor(
// 2 * freq(BB->Pred) * taken_branch_cost + unconditional_jump_cost		// 2 * freq(BB->Pred) * taken_branch_cost + unconditional_jump_cost
// < freq(BB->Succ) * taken_branch_cost.		// < freq(BB->Succ) * taken_branch_cost.
// Ignoring unconditional jump cost, we get		// Ignoring unconditional jump cost, we get
// freq(BB->Succ) > 2 * freq(BB->Pred), i.e.,		// freq(BB->Succ) > 2 * freq(BB->Pred), i.e.,
// prob(BB->Succ) > 2 * prob(BB->Pred)		// prob(BB->Succ) > 2 * prob(BB->Pred)
//		//
// When real profile data is available, we can precisely compute the the		// When real profile data is available, we can precisely compute the the
// probabililty threshold that is needed for edge BB->Succ to be considered.		// probabililty threshold that is needed for edge BB->Succ to be considered.
		// See getLayoutSuccessorProbThrehold for description on cost analysis of
		// general triangular shape.
// With out profile data, the heuristic requires the branch bias to be		// With out profile data, the heuristic requires the branch bias to be
// a lot larger to make sure the signal is very strong (e.g. 80% default).		// a lot larger to make sure the signal is very strong (e.g. 80% default).
// -----------------------------------------------------------------		// -----------------------------------------------------------------
// Case 2: diamond like CFG:		// Case 2: diamond like CFG:
// S		// S
// / \		// / \
// \| \		// \| \
// BB Pred		// BB Pred
Show All 10 Lines	bool MachineBlockPlacement::hasBetterLayoutPredecessor(
// When profile data is not available, however, we need to be more		// When profile data is not available, however, we need to be more
// conservative. If the branch prediction is wrong, breaking the topo-order		// conservative. If the branch prediction is wrong, breaking the topo-order
// will actually yield a layout with large cost. For this reason, we need		// will actually yield a layout with large cost. For this reason, we need
// strong biaaed branch at block S with Prob(S->BB) in order to select		// strong biaaed branch at block S with Prob(S->BB) in order to select
// BB->Succ. This is equialant to looking the CFG backward with backward		// BB->Succ. This is equialant to looking the CFG backward with backward
// edge: Prob(Succ->BB) needs to >= HotProb in order to be selected (without		// edge: Prob(Succ->BB) needs to >= HotProb in order to be selected (without
// profile data).		// profile data).

BranchProbability HotProb = getLayoutSuccessorProbThreshold(BB);		BranchProbability HotProb =
		getLayoutSuccessorProbThreshold(BB, Succ, RealSuccProb, SuccChain,
		Chain, BlockFilter);

// Forward checking. For case 2, SuccProb will be 1.		// Forward checking. For case 2, SuccProb will be 1.
if (SuccProb < HotProb) {		if (SuccProb < HotProb) {
DEBUG(dbgs() << " " << getBlockName(Succ) << " -> " << SuccProb		DEBUG(dbgs() << " " << getBlockName(Succ) << " -> " << SuccProb
<< " (prob) (CFG conflict)\n");		<< " (prob) (CFG conflict)\n");
return true;		return true;
}		}

// Make sure that a hot successor doesn't have a globally more		// Make sure that a hot successor doesn't have a globally more
// important predecessor.		// important predecessor.
BlockFrequency CandidateEdgeFreq = MBFI->getBlockFreq(BB) * RealSuccProb;		BlockFrequency CandidateEdgeFreq = MBFI->getBlockFreq(BB) * RealSuccProb;
bool BadCFGConflict = false;		bool BadCFGConflict = false;

for (MachineBasicBlock *Pred : Succ->predecessors()) {		for (MachineBasicBlock *Pred : Succ->predecessors()) {
if (Pred == Succ \|\| BlockToChain[Pred] == &SuccChain \|\|		if (Pred == Succ \|\| BlockToChain[Pred] == &SuccChain \|\|
(BlockFilter && !BlockFilter->count(Pred)) \|\|		(BlockFilter && !BlockFilter->count(Pred)) \|\|
BlockToChain[Pred] == &Chain)		BlockToChain[Pred] == &Chain)
continue;		continue;
// Do backward checking. For case 1, it is actually redundant check. For		// Now do backward checking. For case 1, it is actually redundant check. For
// case 2 above, we need a backward checking to filter out edges that are		// case 2 above, we need a backward checking to filter out edges that are
// not 'strongly' biased. With profile data available, the check is mostly		// not 'strongly' biased. With profile data available, the check is mostly
// redundant too (when threshold prob is set at 50%) unless S has more than		// redundant too (when threshold prob is set at 50%) unless S has more than
// two successors.		// two successors.
// BB Pred		// BB Pred
// \ /		// \ /
// Succ		// Succ
// We select edgee BB->Succ if		// We select edgee BB->Succ if
▲ Show 20 Lines • Show All 991 Lines • Show Last 20 Lines

test/CodeGen/X86/code_placement_pgo_cost.ll

				; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux < %s \| FileCheck %s

				declare void @a()
				declare void @b()
				declare void @c()
				declare void @d()
				declare void @e()

				;; Simple double triangles
				;; a
				;; \| \ 20
				;; \| b
				;; 70 \|10/\ 10
				;; \| / c
				;; \| \| / 10
				;; \\|/
				;; d
				;; The optimal layout is <a, d, b, c> with cost = 20 + 10 + 10 = 40.
				;; The default layout (without profile) is <a, b, c, d> with cost 80.
				; CHECK-LABEL: test_double_triangles_1:
				; CHECK: callq a
				; CHECK: callq d
				; CHECK: callq b
				; CHECK: callq c

				define void @test_double_triangles_1(i32 %t1, i32 %t2) !prof !1 {
				entry:
				call void @a()
				%cond1 = icmp eq i32 %t1, 0
				br i1 %cond1, label %merge, label %else, !prof !8

				else:
				call void @b()
				%cond2 = icmp eq i32 %t2, 0
				br i1 %cond2, label %merge, label %else2, !prof !9

				else2:
				call void @c()
				br label %merge

				merge:
				call void @d()
				ret void
				}

				;; a
				;; \| \ 0
				;; \| b
				;; 70 \| /\
				;; \| / c
				;; \| \| /
				;; \\|/
				;; d
				;; Same as above with more extreme probability.
				; CHECK-LABEL: test_double_triangles_2:
				; CHECK: callq a
				; CHECK: callq d
				; CHECK: callq b
				; CHECK: callq c
				define void @test_double_triangles_2(i32 %t1, i32 %t2) !prof !1 {
				entry:
				call void @a()
				%cond1 = icmp eq i32 %t1, 0
				br i1 %cond1, label %merge, label %else, !prof !10

				else:
				call void @b()
				%cond2 = icmp eq i32 %t2, 0
				br i1 %cond2, label %merge, label %else2

				else2:
				call void @c()
				br label %merge

				merge:
				call void @d()
				ret void
				}

				;; a
				;; \| \ 40
				;; \| b
				;; 70 \|25/\ 15
				;; \| / c
				;; \| \| / 15
				;; \\|/
				;; d
				;; Similar to above -- the optimal layout is still <a,d,b,c> with cost
				;; 40 + 15 + 25 = 80. The default layout has cost 95.
				; CHECK-LABEL: test_double_triangles_3:
				; CHECK: callq a
				; CHECK: callq d
				; CHECK: callq b
				; CHECK: callq c
				define void @test_double_triangles_3(i32 %t1, i32 %t2) !prof !1 {
				entry:
				call void @a()
				%cond1 = icmp eq i32 %t1, 0
				br i1 %cond1, label %merge, label %else, !prof !11

				else:
				call void @b()
				%cond2 = icmp eq i32 %t2, 0
				br i1 %cond2, label %merge, label %else2, !prof !12

				else2:
				call void @c()
				br label %merge

				merge:
				call void @d()
				ret void
				}

				;; a
				;; \| \ 45
				;; \| b
				;; 60 \|35/\ 10
				;; \| / c
				;; \| \| / 10
				;; \\|/
				;; d
				;; The optimal layout in this case is <a,b,d,c>. It has cost
				;; 60 + 10 + 10 = 80. The default layout <a,b,c,d> has cost = 95.
				;; Layout <a,d,b,c> has cost = 45 + 35 + 10 = 90
				; CHECK-LABEL: test_double_triangles_4:
				; CHECK: callq a
				; CHECK: callq b
				; CHECK: callq d
				; CHECK: callq c
				;
				define void @test_double_triangles_4(i32 %t1, i32 %t2) !prof !1 {
				entry:
				call void @a()
				%cond1 = icmp eq i32 %t1, 0
				br i1 %cond1, label %merge, label %else, !prof !13

				else:
				call void @b()
				%cond2 = icmp eq i32 %t2, 0
				br i1 %cond2, label %merge, label %else2, !prof !14

				else2:
				call void @c()
				br label %merge

				merge:
				call void @d()
				ret void
				}

				;; a
				;; \| \ 40
				;; \| b
				;; 60 \|10/\ 30
				;; \| / c
				;; \| \| / 30
				;; \\|/
				;; d
				;; Optimal layout is <a,b,c,d> with cost = 60 + 10 = 70
				;; Alternate layout <a,d,b,c> has cost = 40 + 10 + 30 = 80
				;; a b d c has cost = 120
				; CHECK-LABEL: test_double_triangles_5:
				; CHECK: callq a
				; CHECK: callq b
				; CHECK: callq c
				; CHECK: callq d

				define void @test_double_triangles_5(i32 %t1, i32 %t2) !prof !1 {
				entry:
				call void @a()
				%cond1 = icmp eq i32 %t1, 0
				br i1 %cond1, label %merge, label %else, !prof !15

				else:
				call void @b()
				%cond2 = icmp eq i32 %t2, 0
				br i1 %cond2, label %merge, label %else2, !prof !16

				else2:
				call void @c()
				br label %merge

				merge:
				call void @d()
				ret void
				}

				;; The following CFG has a shape with double triangle nested in a
				;; diamond.
				;; For the following CFG, the optimal layout is <a, d, e, b, c> which has
				;; cost of 80. If using the default layout strategy (without PGO), the
				;; order is <a, b, c, d, e> which has cost of 90.
				;; a
				;; \| \ 40
				;; 60 \| \
				;; \| b
				;; \| 0/ \40
				;; \| / c
				;; \| / / \
				;; \\| /10 \| 30
				;; d \|
				;; 70 \ /
				;; \ /
				;; e
				; CHECK-LABEL: test_multi_merge_1:
				; CHECK: callq a
				; CHECK: callq d
				; CHECK: callq e
				; CHECK: callq b
				; CHECK: callq c
				define void @test_multi_merge_1(i32 %t1, i32 %t2, i32 %t3) !prof !1 {
				entry:
				call void @a()
				%cond1 = icmp eq i32 %t1, 0
				br i1 %cond1, label %merge, label %then, !prof !2

				then:
				call void @b()
				%cond2 = icmp eq i32 %t2, 0
				br i1 %cond2, label %merge, label %then2, !prof !3

				then2:
				call void @c()
				%cond3 = icmp eq i32 %t3, 0
				br i1 %cond3, label %merge, label %exit, !prof !4

				merge:
				call void @d()
				br label %exit
				exit:
				call void @e()
				ret void
				}

				;; Similar to the above with the same optimal layout (with cost 80)
				;; The default no optimal layout has cost of 100.
				;; a
				;; \| \ 40
				;; 60 \| \
				;; \| b
				;; \| 0/ \40
				;; \| / c
				;; \| / / \
				;; \\| / 0 \| 40
				;; d \|
				;; 60 \ /
				;; \ /
				;; e
				; CHECK-LABEL: test_multi_merge_2:
				; CHECK: callq a
				; CHECK: callq d
				; CHECK: callq e
				; CHECK: callq b
				; CHECK: callq c
				define void @test_multi_merge_2(i32 %t1, i32 %t2, i32 %t3) !prof !1 {
				entry:
				call void @a()
				%cond1 = icmp eq i32 %t1, 0
				br i1 %cond1, label %merge, label %then, !prof !2

				then:
				call void @b()
				%cond2 = icmp eq i32 %t2, 0
				br i1 %cond2, label %merge, label %then2, !prof !3

				then2:
				call void @c()
				%cond3 = icmp eq i32 %t3, 0
				br i1 %cond3, label %merge, label %exit, !prof !5

				merge:
				call void @d()
				br label %exit
				exit:
				call void @e()
				ret void
				}

				;; The CFG is the same, but the count distribution changes here. It is no
				;; longer optimal to layout d after a. The optimal layout is
				;; <a, b, d, e, c>
				;; a
				;; \| \ 40
				;; 60 \| \
				;; \| b
				;; \| 40/ \ 0
				;; \| / c
				;; \| / / \
				;; \\| / 0 \| 0
				;; d \|
				;; 100 \ /
				;; \ /
				;; e
				; CHECK-LABEL: test_multi_merge_3:
				; CHECK: callq a
				; CHECK: callq b
				; CHECK: callq d
				; CHECK: callq e
				; CHECK: callq c
				define void @test_multi_merge_3(i32 %t1, i32 %t2, i32 %t3) !prof !1 {
				entry:
				call void @a()
				%cond1 = icmp eq i32 %t1, 0
				br i1 %cond1, label %merge, label %then, !prof !2

				then:
				call void @b()
				%cond2 = icmp eq i32 %t2, 0
				br i1 %cond2, label %merge, label %then2, !prof !6

				then2:
				call void @c()
				%cond3 = icmp eq i32 %t3, 0
				br i1 %cond3, label %merge, label %exit, !prof !7

				merge:
				call void @d()
				br label %exit
				exit:
				call void @e()
				ret void
				}

				;; Single triangle inside (lower half) of a diamond
				;;
				;; a
				;; 45 / \ 55
				;; / \
				;; b c
				;; \ 40\| \ 15
				;; 45 \ \| d
				;; \ \| / 15
				;; \\|/
				;; e
				;; Optimal layout is <a,c,d,b,e> with cost 45 + 15 + 40 = 100
				; CHECK-LABEL: test_triangle_in_diamond_1
				; CHECK: callq a
				; CHECK: callq c
				; CHECK: callq d
				; CHECK: callq b
				; CHECK: callq e
				define void @test_triangle_in_diamond_1(i32 %t1, i32 %t2) !prof !1 {
				entry:
				call void @a()
				%cond1 = icmp eq i32 %t1, 0
				br i1 %cond1, label %then, label %else, !prof !17

				then:
				call void @b()
				br label %merge

				else:
				call void @c()
				%cond2 = icmp eq i32 %t2, 0
				br i1 %cond2, label %merge, label %else2, !prof !18

				else2:
				call void @d()
				br label %merge

				merge:
				call void @e()
				ret void
				}

				;; a
				;; 25 / \ 75
				;; / \
				;; b c
				;; \ 40\| \ 35
				;; 25 \ \| d
				;; \ \| / 35
				;; \\|/
				;; e
				;; Optimal layout <a,c,d,e,b> with cost = 25 + 25 + 40 = 90
				;; The default layout <a,c,d,b,e>'s cost is 40 + 35 + 25 = 100
				;;
				; CHECK-LABEL: test_triangle_in_diamond_2
				; CHECK: callq a
				; CHECK: callq c
				; CHECK: callq d
				; CHECK: callq e
				; CHECK: callq b
				define void @test_triangle_in_diamond_2(i32 %t1, i32 %t2) !prof !1 {
				entry:
				call void @a()
				%cond1 = icmp eq i32 %t1, 0
				br i1 %cond1, label %then, label %else, !prof !19

				then:
				call void @b()
				br label %merge

				else:
				call void @c()
				%cond2 = icmp eq i32 %t2, 0
				br i1 %cond2, label %merge, label %else2, !prof !20

				else2:
				call void @d()
				br label %merge

				merge:
				call void @e()
				ret void
				}

				;; a
				;; 25 / \ 75
				;; / \
				;; b c
				;; \ 55\| \ 20
				;; 25 \ \| d
				;; \ \| / 20
				;; \\|/
				;; e
				;; The optimal layout is <a, c, e, b, d> with cost
				;; 25 + 25 + 20 + 20 = 90
				;; Default layout <a,c,d,b,e> has cost 25 + 55 + 20 = 100

				; CHECK-LABEL: test_triangle_in_diamond_3
				; CHECK: callq a
				; CHECK: callq c
				; CHECK: callq e
				; CHECK: callq b
				; CHECK: callq d

				define void @test_triangle_in_diamond_3(i32 %t1, i32 %t2) !prof !1 {
				entry:
				call void @a()
				%cond1 = icmp eq i32 %t1, 0
				br i1 %cond1, label %then, label %else, !prof !19

				then:
				call void @b()
				br label %merge

				else:
				call void @c()
				%cond2 = icmp eq i32 %t2, 0
				br i1 %cond2, label %merge, label %else2, !prof !21

				else2:
				call void @d()
				br label %merge

				merge:
				call void @e()
				ret void
				}

				;; Double triangle inside (lower half) of a diamond

				!1 = !{!"function_entry_count", i32 10}
				!2 = !{!"branch_weights", i32 60, i32 40}
				!3 = !{!"branch_weights", i32 0, i32 40}
				!4 = !{!"branch_weights", i32 10, i32 30}
				!5 = !{!"branch_weights", i32 0, i32 40}
				!6 = !{!"branch_weights", i32 40, i32 0}
				!7 = !{!"branch_weights", i32 0, i32 0}
				!8 = !{!"branch_weights", i32 70, i32 20}
				!9 = !{!"branch_weights", i32 10, i32 10}
				!10 = !{!"branch_weights", i32 70, i32 0}
				!11 = !{!"branch_weights", i32 70, i32 40}
				!12 = !{!"branch_weights", i32 25, i32 15}
				!13 = !{!"branch_weights", i32 65, i32 45}
				!14 = !{!"branch_weights", i32 35, i32 10}
				!15 = !{!"branch_weights", i32 60, i32 40}
				!16 = !{!"branch_weights", i32 10, i32 30}
				!17 = !{!"branch_weights", i32 45, i32 55}
				!18 = !{!"branch_weights", i32 40, i32 15}
				!19 = !{!"branch_weights", i32 25, i32 75}
				!20 = !{!"branch_weights", i32 40, i32 35}
				!21 = !{!"branch_weights", i32 55, i32 20}