This is an archive of the discontinued LLVM Phabricator instance.

[PGO] Hoist hot case statement from switches
AbandonedPublic

Authored by mcrosier on Oct 14 2014, 2:44 PM.

Download Raw Diff

Details

Reviewers

bob.wilson
grosbach
bruno
djasper
hansw
hfinkel

Summary

This patch identifies hot cases, based on profile information, and inserts the necessary conditional logic to jump to the hottest case statement prior to getting into the switching logic.

A few comments:

I'm still working on getting performance numbers, so feel free to grab the patch and test yourself.
The 80% "hot" threshold is entirely arbitrary and likely needs tuning. By default, we don't consider switches with fewer than 4 cases, so 80% seems reasonable, IMHO.
I would like to add additional tests, so suggestions are welcome. Also, I'm not sure how to verify profile information is being propagated correctly; the logic looks correct, but I'd like to have a test in place to ensure no future regressions.

The ideal solution would be a balanced binary tree, but this seems to be a reasonable first step.

Chad

Diff Detail

Event Timeline

mcrosier updated this revision to Diff 14892.Oct 14 2014, 2:44 PM

mcrosier retitled this revision from to [PGO] Hoist hot case statement from switches.

mcrosier updated this object.

mcrosier edited the test plan for this revision. (Show Details)

mcrosier added reviewers: bruno, bob.wilson, chandlerc, hfinkel, grosbach.

mcrosier added a subscriber: Unknown Object (MLST).

The ideal solution to "given I have branch weight metadata giving me the relative probabilities of each case label, in what order should I do two-way comparisons to minimize the expected number of comparisons needed to reach the cases at run-time" is not a balanced binary tree; it is a Huffman tree (where the case probabilities are what is usually called "symbol probabilities" when talking about Huffman trees.)

I don't think this patch is the right long-term approach. I do think this patch is useful for identifying the possible performance win of using branch weight info to guide switch lowering. If the current patch doesn't result in a measurable speedup across a variety of benchmarks, it is unlikely to be worth much effort in improving our lowering by using branch weight metadata. You mentioned you ran spec2000/spec2006; what were the performance benefits?

If you aren't familiar with Huffman trees, they are basically based on a couple simple observations (imagine you have a binary tree, where each branching point represents a test, and the leaves represent case labels):

the tree is full (i.e. all non-leaf nodes have two children). This means that if a case label is one of the "deepest" labels (longest path from the root), then its sibling must be as well.
if case A has a greater probability than case B, then the leaf containing case A must be closer to the root (or the same distance) as case B.
there is an optimal tree where the two least likely cases are siblings.

That's it. The algorithm is then to start with all case labels in a min-heap by probability, and to repeatedly pop off the two least likely cases, merge them into a fragment of the final tree, which notionally has a probability equal to the sum of their probabilities, and push this merged tree fragment back on the heap. Eventually, the heap contains only one element which is your final tree.

Honestly, having branch probabilities is really a game-changer for switch lowering and probably merits a different approach (or reconsidering the current approach). You should probably take a look (if you haven't already) at the paper "A Superoptimizer Analysis of Multiway Branch Code Generation" in https://gcc.gnu.org/wiki/HomePage?action=AttachFile&do=get&target=gcc-2008-proceedings.pdf

Hi Sean,
Thanks for your input. Comments below.

In D5786#4, @silvas wrote:

The ideal solution to "given I have branch weight metadata giving me the relative probabilities of each case label, in what order should I do two-way comparisons to minimize the expected number of comparisons needed to reach the cases at run-time" is not a balanced binary tree; it is a Huffman tree (where the case probabilities are what is usually called "symbol probabilities" when talking about Huffman trees.)

To clarify, I didn't mean "balanced" in the traditional sense, but balanced based on probabilities. I do recall our conversation (+Chandler +Has) on IRC about Huffman trees.

I don't think this patch is the right long-term approach.

I believe I said that.. LLVM prefers incremental; I believe this is a reasonable first step. I understand the Huffman tree is an entirely different implementation, but the proposed solution is on the order of 50 lines of real code, straight-forward, and should require little or no maintenance. It *should* also provide a fairly good ROI, but I understand that that is yet to be proven.

I do think this patch is useful for identifying the possible performance win of using branch weight info to guide switch lowering. If the current patch doesn't result in a measurable speedup across a variety of benchmarks, it is unlikely to be worth much effort in improving our lowering by using branch weight metadata. You mentioned you ran spec2000/spec2006; what were the performance benefits?

I mentioned that I was working on getting numbers, but I've now been side tracked due to an internal release and the upcoming Dev Meeting. I don't know when I'll have the bandwidth to generate the numbers, so I suggested other give it a try if they so wish. Feel free to do so yourself.

If you aren't familiar with Huffman trees, they are basically based on a couple simple observations (imagine you have a binary tree, where each branching point represents a test, and the leaves represent case labels):

the tree is full (i.e. all non-leaf nodes have two children). This means that if a case label is one of the "deepest" labels (longest path from the root), then its sibling must be as well.

if case A has a greater probability than case B, then the leaf containing case A must be closer to the root (or the same distance) as case B.

there is an optimal tree where the two least likely cases are siblings.

That's it. The algorithm is then to start with all case labels in a min-heap by probability, and to repeatedly pop off the two least likely cases, merge them into a fragment of the final tree, which notionally has a probability equal to the sum of their probabilities, and push this merged tree fragment back on the heap. Eventually, the heap contains only one element which is your final tree.

Honestly, having branch probabilities is really a game-changer for switch lowering and probably merits a different approach (or reconsidering the current approach). You should probably take a look (if you haven't already) at the paper "A Superoptimizer Analysis of Multiway Branch Code Generation" in https://gcc.gnu.org/wiki/HomePage?action=AttachFile&do=get&target=gcc-2008-proceedings.pdf

I'm aware of the paper. In fact, this approach is based on results in the 2006 Summit in the paper titled, "Switch Statement Case Reordering FDO." That works shows that hoisting the hottest two cases in spec2000/perlbmk improved performance by 17%. I don't expect to get nearly that performance, but I do expect to see some gain.

I understand your concern, but I don't agree that this patch is a step in the wrong direction (or should be rejected for the ideal). If the patch were more pervasive, then I would proceed with caution. However, it is not and it should have a fair ROI. If, and when, someone comes along to implement the Huffman approach it will be trivial to revert this change.

Chad

I'd really like to get this in, but I suspect others could review this better than I could. Adding Hans and Daniel as they've looked at this stuff recently.

Hans is probably best-suited to review this properly. However, also note that he is undertaking a massive rewrite of the entire switch lowering logic (http://reviews.llvm.org/D8649). You should probably sync.

As for the approach, I am not entirely sure if it is good. Obviously, we want to make use of profile numbers to improve switch lowering. However, it is unclear to me whether e.g. we'd want to handle the hot case separately, even if it could be part of a jump table instead. Jump tables are quite fast and pulling the hot case out is additional, entirely additional code.

I too am a little concerned about the approach of peeling off a hot case. Like djasper pointed out, it might have been possible to include the case in a jump table (even worse, peeling the case might prevent a jump table being built due to density changing). More generally, peeling a hot case will penalize the non-hot cases by putting an extra branch in front of them, which means we'd only want to do this when the hot case is completely dominating.

If we do want to peel a hot case, my patch would allow doing that without disrupting jump tables or bit tests: we'd extract those first, and then peel the hot CaseCluster, which might be a single case, range, jump table or bit test.

The direction I'd like to explore with my patch however, is to balance the binary tree based on profile info rather than node count. That would put hotter cases closer to the root. If f(x) is the number of branches needed to reach case x, I think this approach would minimize the expected value of f(x) when x is a random variable with a distribution matching the profile info.

In D5786#149001, @hans wrote:

The direction I'd like to explore with my patch however, is to balance the binary tree based on profile info rather than node count. That would put hotter cases closer to the root. If f(x) is the number of branches needed to reach case x, I think this approach would minimize the expected value of f(x) when x is a random variable with a distribution matching the profile info.

Sort of random, but in case it saves you a bunch of time, I believe this exact problem is covered in Knuth (either volume 1 or 3, I forget). The keyword is "optimal binary search tree".

In D5786#149384, @silvas wrote:

In D5786#149001, @hans wrote:

The direction I'd like to explore with my patch however, is to balance the binary tree based on profile info rather than node count. That would put hotter cases closer to the root. If f(x) is the number of branches needed to reach case x, I think this approach would minimize the expected value of f(x) when x is a random variable with a distribution matching the profile info.

Sort of random, but in case it saves you a bunch of time, I believe this exact problem is covered in Knuth (either volume 1 or 3, I forget). The keyword is "optimal binary search tree".

Thanks! It sounds like what I'm proposing is this approach: http://en.wikipedia.org/wiki/Optimal_binary_search_tree#Mehlhorn.27s_approximation_algorithm

I guess it was bold of me to claim my idea guarantees the minimum expected value, but this makes me think it's still a good idea.

In D5786#149403, @hans wrote:

In D5786#149384, @silvas wrote:

In D5786#149001, @hans wrote:

The direction I'd like to explore with my patch however, is to balance the binary tree based on profile info rather than node count. That would put hotter cases closer to the root. If f(x) is the number of branches needed to reach case x, I think this approach would minimize the expected value of f(x) when x is a random variable with a distribution matching the profile info.

Sort of random, but in case it saves you a bunch of time, I believe this exact problem is covered in Knuth (either volume 1 or 3, I forget). The keyword is "optimal binary search tree".

Thanks! It sounds like what I'm proposing is this approach: http://en.wikipedia.org/wiki/Optimal_binary_search_tree#Mehlhorn.27s_approximation_algorithm

I guess it was bold of me to claim my idea guarantees the minimum expected value, but this makes me think it's still a good idea.

Ultimately it is based on your model for performance.

As a possibly realistic example, we may not want to optimize expected tree depth of the binary tree, but rather want to optimize predictability of the branches along each path from the root to the pieces of code in the leaves (weighted appropriately etc.). In this case, what we want to infer from the data is not "how likely are we to go to this case", but rather "how likely is a particular choice of branching scheme to result in well-predicted branches". In this case the choice of tree structure could be something related to minimizing entropy of each branch (as a rough approximation to branch prediction cost).

It's all about your model for the resulting performance. It seems like your current approach is based on a performance model where the tree depth is the sole determining factor in the resulting performance once we are doing a binary search tree (i.e. branches take constant time). That sounds reasonable, but might be worth revisiting in light of modern uarch's. Perhaps the entropy of each branch should be taken into account. Or maybe there is something fancier than entropy we can use here (such as a markov model; this would require improving the profiling tooling I think, so that they can collect conditional probabilities).

The extraction of switch tables and bit tests are examples where we are modeling the performance of handling particular situations.

Ok... done braindumping now...

In D5786#149446, @silvas wrote:

In D5786#149403, @hans wrote:

In D5786#149384, @silvas wrote:

In D5786#149001, @hans wrote:

The direction I'd like to explore with my patch however, is to balance the binary tree based on profile info rather than node count. That would put hotter cases closer to the root. If f(x) is the number of branches needed to reach case x, I think this approach would minimize the expected value of f(x) when x is a random variable with a distribution matching the profile info.

Sort of random, but in case it saves you a bunch of time, I believe this exact problem is covered in Knuth (either volume 1 or 3, I forget). The keyword is "optimal binary search tree".

Thanks! It sounds like what I'm proposing is this approach: http://en.wikipedia.org/wiki/Optimal_binary_search_tree#Mehlhorn.27s_approximation_algorithm

I guess it was bold of me to claim my idea guarantees the minimum expected value, but this makes me think it's still a good idea.

Ultimately it is based on your model for performance.

As a possibly realistic example, we may not want to optimize expected tree depth of the binary tree, but rather want to optimize predictability of the branches along each path from the root to the pieces of code in the leaves (weighted appropriately etc.). In this case, what we want to infer from the data is not "how likely are we to go to this case", but rather "how likely is a particular choice of branching scheme to result in well-predicted branches". In this case the choice of tree structure could be something related to minimizing entropy of each branch (as a rough approximation to branch prediction cost).

It's all about your model for the resulting performance. It seems like your current approach is based on a performance model where the tree depth is the sole determining factor in the resulting performance once we are doing a binary search tree (i.e. branches take constant time). That sounds reasonable, but might be worth revisiting in light of modern uarch's. Perhaps the entropy of each branch should be taken into account. Or maybe there is something fancier than entropy we can use here (such as a markov model; this would require improving the profiling tooling I think, so that they can collect conditional probabilities).

The extraction of switch tables and bit tests are examples where we are modeling the performance of handling particular situations.

Ok... done braindumping now...

It's an interesting idea, but I'm not sure it's practical to try to model this. For now, I think minimizing the number of branches is a good heuristic, and then I put my faith in the hardware to predict the branches we do generate. For example, http://dl.acm.org/citation.cfm?id=2738614 suggests that indirect branches are well predicted on modern architectures.

mcrosier abandoned this revision.Aug 5 2015, 11:17 AM

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

SelectionDAGBuilder.h

2 lines

SelectionDAGBuilder.cpp

89 lines

test/

CodeGen/

AArch64/

switch-pgo.ll

261 lines

Diff 14892

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h

Context not available.
	void visitUnreachable(const UnreachableInst &I);	void visitUnreachable(const UnreachableInst &I);

	// Helpers for visitSwitch	// Helpers for visitSwitch
		MachineBasicBlock *handleHotSwitchCase(CaseVector& Cases,
		const SwitchInst& SI);
	bool handleSmallSwitchRange(CaseRec& CR,	bool handleSmallSwitchRange(CaseRec& CR,
	CaseRecVector& WorkList,	CaseRecVector& WorkList,
	const Value* SV,	const Value* SV,
Context not available.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

Context not available.
	cl::location(LimitFloatPrecision),	cl::location(LimitFloatPrecision),
	cl::init(0));	cl::init(0));

		static cl::opt<unsigned>
		HotCaseProb("hot-switch-case-probability", cl::Hidden, cl::init(80),
		cl::desc("Branch probability necessary to consider a case statement"
		"hot (default 80%)"));

	// Limit the width of DAG chains. This is important in general to prevent	// Limit the width of DAG chains. This is important in general to prevent
	// prevent DAG-based analysis from blowing up. For example, alias analysis and	// prevent DAG-based analysis from blowing up. For example, alias analysis and
	// load clustering may not complete in reasonable time. It is difficult to	// load clustering may not complete in reasonable time. It is difficult to
Context not available.
	setValue(&LP, Res);	setValue(&LP, Res);
	}	}

		/// Look for a "hot" case and, if found, insert conditional logic to branch
		/// to the hot case prior to jumping into the switching logic.
		MachineBasicBlock *
		SelectionDAGBuilder::handleHotSwitchCase(CaseVector& Cases,
		const SwitchInst& SI) {
		BranchProbabilityInfo *BPI = FuncInfo.BPI;
		if (!BPI)
		return nullptr;

		// handleSmallSwitchRange handles switches with few cases.
		if (Cases.size() <= 3)
		return nullptr;

		// Find the edge with the largest weight and sum the total weight to compute
		// the branch probability of the hottest case.
		CaseItr HotCase;
		uint32_t MaxCaseWeight = 0;
		uint32_t TotalWeight = BPI->getEdgeWeight(SI.getParent(), SI.getDefaultDest());
		for (CaseItr I = Cases.begin(), E = Cases.end(); I!=E; ++I) {
		TotalWeight += I->ExtraWeight;
		if (I->ExtraWeight > MaxCaseWeight) {
		MaxCaseWeight = I->ExtraWeight;
		HotCase = I;
		}
		}

		// Only handle single branch.
		if (HotCase->High != HotCase->Low)
		return nullptr;

		// Check if we have a hot case.
		if (BranchProbability(MaxCaseWeight, TotalWeight) <
		BranchProbability(HotCaseProb, 100))
		return nullptr;

		DEBUG(dbgs() << "Found hot switch case: "
		<< cast<ConstantInt>(HotCase->Low)->getValue() << '\n'
		<< "Num cases: " << Cases.size() << "\nProbability: "
		<< BranchProbability(MaxCaseWeight, TotalWeight) << '\n'
		<< "Weight: " << MaxCaseWeight<< '\n');

		// Get the MachineFunction which holds the current MBB. This is used when
		// inserting any additional MBBs necessary to represent the switch.
		MachineFunction *CurMF = FuncInfo.MF;
		MachineBasicBlock *SwitchMBB = FuncInfo.MBB;

		// Figure out which block is immediately after the current one.
		MachineFunction::iterator BBI = SwitchMBB;
		++BBI;

		// Create a CaseBlock record representing a conditional branch to the Case's
		// hot MBB if the value being switched on SV is equal to C.
		MachineBasicBlock *FallThrough =
		CurMF->CreateMachineBasicBlock(SwitchMBB->getBasicBlock());
		CurMF->insert(BBI, FallThrough);

		// Put SV in a virtual register to make it available from the new blocks.
		const Value *SV = SI.getCondition();
		ExportFromCurrentBlock(SV);

		assert(HotCase->High == HotCase->Low && "Expected a single case statement.");
		CaseBlock CB(ISD::SETEQ, SV, HotCase->Low, nullptr, /* truebb */ HotCase->BB,
		/* falsebb / FallThrough, / me */ SwitchMBB,
		/* trueweight */ HotCase->ExtraWeight,
		/* falseweight */ TotalWeight - HotCase->ExtraWeight);

		// Push the CaseBlock onto the vector to be later processed by SDISel.
		SwitchCases.push_back(CB);

		// Remove the hot case from the vector of cases.
		Cases.erase(HotCase);

		return FallThrough;
		}

	/// handleSmallSwitchCaseRange - Emit a series of specific tests (suitable for	/// handleSmallSwitchCaseRange - Emit a series of specific tests (suitable for
	/// small case ranges).	/// small case ranges).
	bool SelectionDAGBuilder::handleSmallSwitchRange(CaseRec& CR,	bool SelectionDAGBuilder::handleSmallSwitchRange(CaseRec& CR,
Context not available.
	for (auto &I : Cases)	for (auto &I : Cases)
	// A range counts double, since it requires two compares.	// A range counts double, since it requires two compares.
	numCmps += I.Low != I.High ? 2 : 1;	numCmps += I.Low != I.High ? 2 : 1;

	dbgs() << "Clusterify finished. Total clusters: " << Cases.size()	dbgs() << "Clusterify finished. Total clusters: " << Cases.size()
	<< ". Total compares: " << numCmps << '\n';	<< ". Total compares: " << numCmps << '\n';
	});	});
Context not available.
	CaseVector Cases;	CaseVector Cases;
	Clusterify(Cases, SI);	Clusterify(Cases, SI);

		// Look for a "hot" case and, if found, insert conditional logic to branch
		// to the hot case prior to jumping into the switching logic.
		MachineBasicBlock *CaseRecordMBB = SwitchMBB;
		if (MachineBasicBlock *NewMBB = handleHotSwitchCase(Cases, SI))
		CaseRecordMBB = NewMBB;

	// Get the Value to be switched on and default basic blocks, which will be	// Get the Value to be switched on and default basic blocks, which will be
	// inserted into CaseBlock records, representing basic blocks in the binary	// inserted into CaseBlock records, representing basic blocks in the binary
	// search tree.	// search tree.
Context not available.

	// Push the initial CaseRec onto the worklist	// Push the initial CaseRec onto the worklist
	CaseRecVector WorkList;	CaseRecVector WorkList;
	WorkList.push_back(CaseRec(SwitchMBB,nullptr,nullptr,	WorkList.push_back(CaseRec(CaseRecordMBB, nullptr, nullptr,
	CaseRange(Cases.begin(),Cases.end())));	CaseRange(Cases.begin(),Cases.end())));

	while (!WorkList.empty()) {	while (!WorkList.empty()) {
Context not available.

test/CodeGen/AArch64/switch-pgo.ll

This file was added.

				; RUN: llc -verify-machineinstrs < %s -mtriple=aarch64-linux-gnu \| FileCheck %s

				; CHECK-LABEL: test1:
				; CHECK: cmp w0, #111
				; CHECK: cmp w0, #26
				; CHECK: cmp w0, #10
				; CHECK: cmp w0, #12
				; CHECK: cmp w0, #112
				; CHECK: cmp w0, #27
				; CHECK: cmp w0, #16

				define void @test1(i32 %cmp, i32* %ptr) {
				entry:
				switch i32 %cmp, label %exit [
				i32 10, label %sw.bb1
				i32 12, label %sw.bb2
				i32 16, label %sw.bb3
				i32 27, label %sw.bb4
				i32 112, label %sw.bb5
				], !prof !1

				sw.bb1:
				store i32 1, i32* %ptr, align 4
				br label %exit

				sw.bb2:
				store i32 2, i32* %ptr, align 4
				br label %exit

				sw.bb3:
				store i32 3, i32* %ptr, align 4
				br label %exit

				sw.bb4:
				store i32 4, i32* %ptr, align 4
				br label %exit

				sw.bb5:
				store i32 5, i32* %ptr, align 4
				br label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: test2:
				; CHECK: cmp w0, #10
				; CHECK: cmp w0, #111
				; CHECK: cmp w0, #12
				; CHECK: cmp w0, #16
				; CHECK: cmp w0, #112
				; CHECK: cmp w0, #27

				define void @test2(i32 %cmp, i32* %ptr) {
				entry:
				switch i32 %cmp, label %exit [
				i32 10, label %sw.bb1
				i32 12, label %sw.bb2
				i32 16, label %sw.bb3
				i32 27, label %sw.bb4
				i32 112, label %sw.bb5
				], !prof !2

				sw.bb1:
				store i32 1, i32* %ptr, align 4
				br label %exit

				sw.bb2:
				store i32 2, i32* %ptr, align 4
				br label %exit

				sw.bb3:
				store i32 3, i32* %ptr, align 4
				br label %exit

				sw.bb4:
				store i32 4, i32* %ptr, align 4
				br label %exit

				sw.bb5:
				store i32 5, i32* %ptr, align 4
				br label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: test3:
				; CHECK: cmp w0, #12
				; CHECK: cmp w0, #111
				; CHECK: cmp w0, #10
				; CHECK: cmp w0, #16
				; CHECK: cmp w0, #112
				; CHECK: cmp w0, #27

				define void @test3(i32 %cmp, i32* %ptr) {
				entry:
				switch i32 %cmp, label %exit [
				i32 10, label %sw.bb1
				i32 12, label %sw.bb2
				i32 16, label %sw.bb3
				i32 27, label %sw.bb4
				i32 112, label %sw.bb5
				], !prof !3

				sw.bb1:
				store i32 1, i32* %ptr, align 4
				br label %exit

				sw.bb2:
				store i32 2, i32* %ptr, align 4
				br label %exit

				sw.bb3:
				store i32 3, i32* %ptr, align 4
				br label %exit

				sw.bb4:
				store i32 4, i32* %ptr, align 4
				br label %exit

				sw.bb5:
				store i32 5, i32* %ptr, align 4
				br label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: test4:
				; CHECK: cmp w0, #16
				; CHECK: cmp w0, #111
				; CHECK: cmp w0, #10
				; CHECK: cmp w0, #12
				; CHECK: cmp w0, #112
				; CHECK: cmp w0, #27

				define void @test4(i32 %cmp, i32* %ptr) {
				entry:
				switch i32 %cmp, label %exit [
				i32 10, label %sw.bb1
				i32 12, label %sw.bb2
				i32 16, label %sw.bb3
				i32 27, label %sw.bb4
				i32 112, label %sw.bb5
				], !prof !4

				sw.bb1:
				store i32 1, i32* %ptr, align 4
				br label %exit

				sw.bb2:
				store i32 2, i32* %ptr, align 4
				br label %exit

				sw.bb3:
				store i32 3, i32* %ptr, align 4
				br label %exit

				sw.bb4:
				store i32 4, i32* %ptr, align 4
				br label %exit

				sw.bb5:
				store i32 5, i32* %ptr, align 4
				br label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: test5:
				; CHECK: cmp w0, #27
				; CHECK: cmp w0, #111
				; CHECK: cmp w0, #10
				; CHECK: cmp w0, #12
				; CHECK: cmp w0, #112
				; CHECK: cmp w0, #16

				define void @test5(i32 %cmp, i32* %ptr) {
				entry:
				switch i32 %cmp, label %exit [
				i32 10, label %sw.bb1
				i32 12, label %sw.bb2
				i32 16, label %sw.bb3
				i32 27, label %sw.bb4
				i32 112, label %sw.bb5
				], !prof !5

				sw.bb1:
				store i32 1, i32* %ptr, align 4
				br label %exit

				sw.bb2:
				store i32 2, i32* %ptr, align 4
				br label %exit

				sw.bb3:
				store i32 3, i32* %ptr, align 4
				br label %exit

				sw.bb4:
				store i32 4, i32* %ptr, align 4
				br label %exit

				sw.bb5:
				store i32 5, i32* %ptr, align 4
				br label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: test6:
				; CHECK: cmp w0, #112
				; CHECK: cmp w0, #26
				; CHECK: cmp w0, #10
				; CHECK: cmp w0, #12
				; CHECK: cmp w0, #27
				; CHECK: cmp w0, #16

				define void @test6(i32 %cmp, i32* %ptr) {
				entry:
				switch i32 %cmp, label %exit [
				i32 10, label %sw.bb1
				i32 12, label %sw.bb2
				i32 16, label %sw.bb3
				i32 27, label %sw.bb4
				i32 112, label %sw.bb5
				], !prof !6

				sw.bb1:
				store i32 1, i32* %ptr, align 4
				br label %exit

				sw.bb2:
				store i32 2, i32* %ptr, align 4
				br label %exit

				sw.bb3:
				store i32 3, i32* %ptr, align 4
				br label %exit

				sw.bb4:
				store i32 4, i32* %ptr, align 4
				br label %exit

				sw.bb5:
				store i32 5, i32* %ptr, align 4
				br label %exit

				exit:
				ret void
				}

				!1 = metadata !{metadata !"branch_weights", i32 20, i32 20, i32 20, i32 20, i32 10, i32 10}
				!2 = metadata !{metadata !"branch_weights", i32 1, i32 95, i32 1, i32 1, i32 1, i32 1}
				!3 = metadata !{metadata !"branch_weights", i32 1, i32 1, i32 95, i32 1, i32 1, i32 1}
				!4 = metadata !{metadata !"branch_weights", i32 1, i32 1, i32 1, i32 95, i32 1, i32 1}
				!5 = metadata !{metadata !"branch_weights", i32 1, i32 1, i32 1, i32 1, i32 95, i32 1}
				!6 = metadata !{metadata !"branch_weights", i32 1, i32 1, i32 1, i32 1, i32 1, i32 95}