This is an archive of the discontinued LLVM Phabricator instance.

BlockPlacement: Consider hotness of blocks relative to a loop iteration rather than relative to the loop as a whole
Needs ReviewPublic

Authored by rsmith on Aug 4 2017, 1:56 PM.

Download Raw Diff

Details

Reviewers

congh
chandlerc
iteratee
davidxl

Summary

This patch tweaks the coldness metric for blocks in loops introduced in https://reviews.llvm.org/D11662 by considering the relative hotness of a block compared to a loop iteration rather than the relative hotness of a block compared to the loop as a whole.

The threshold for coldness is changed from <= 1/5 executions of the block per execution of the enclosing loop to <= 1/160 executions of the block per iteration of the enclosing loop. This preserves the existing threshold for a loop with a backedge probability of 31/32, which is the value that BranchProbabilityInfo invents in the absence of profiling data.

Here are a couple of illustrative special cases to demonstrate why I think the threshold should be hotness relative to a loop iteration rather than relative to executions of the enclosing loop:

If loop trip count is very high (say, 1 million) and loop contains a cold block (probability is say 1 in 100,000), we will force that block into the loop, potentially imposing a cost on almost all loop iterations for minimal gain.

If loop trip count is very low (say, we expect there to be 1 trip) and loop contains a block with probability 0.1 relative to the loop iteration, we will force that block out of the loop (and potentially end up placing it a very long way away, if the loop contains many blocks) despite it not being particularly cold.

If we instead considered blocks to be cold based on their probability relative to an iteration of the loop, we would do the right thing in both of the above extrema. But perhaps there's some reason that coldness relative to the loop as a whole is a better metric that would justify the current behavior?

(Part of the added testcase also depends on D36296.)

Diff Detail

Repository: rL LLVM

Event Timeline

rsmith created this revision.Aug 4 2017, 1:56 PM

Herald added a subscriber: sanjoy. · View Herald TranscriptAug 4 2017, 1:56 PM

Should we adjust the ratio if it's not specified explicitly and no profile data is available?

This looks fine to me. Is there any performance data with PGO?

In D36338#832522, @iteratee wrote:

Should we adjust the ratio if it's not specified explicitly and no profile data is available?

Some tuning seems like a good idea as a next step, but I'd like to keep the behavior in the no-profile-data case the same for this change.

In D36338#832543, @davidxl wrote:

This looks fine to me. Is there any performance data with PGO?

I've done some measurements, and found that while some benchmarks improve significantly, others regress significantly. I'd like to take a look at the regressions before pushing forward more with this.

Revision Contents

Path

Size

lib/

CodeGen/

MachineBlockPlacement.cpp

27 lines

test/

CodeGen/

X86/

code_placement_cold_loop_blocks.ll

103 lines

Diff 109816

lib/CodeGen/MachineBlockPlacement.cpp

Context not available.
	"loop-to-cold-block-ratio",	"loop-to-cold-block-ratio",
	cl::desc("Outline loop blocks from loop chain if (frequency of loop) / "	cl::desc("Outline loop blocks from loop chain if (frequency of loop) / "
	"(frequency of block) is greater than this ratio"),	"(frequency of block) is greater than this ratio"),
	cl::init(5), cl::Hidden);	cl::init(160), cl::Hidden);

	static cl::opt<bool>	static cl::opt<bool>
	PreciseRotationCost("precise-rotation-cost",	PreciseRotationCost("precise-rotation-cost",
Context not available.
	BlockFilterSet LoopBlockSet;	BlockFilterSet LoopBlockSet;

	// Filter cold blocks off from LoopBlockSet when profile data is available.	// Filter cold blocks off from LoopBlockSet when profile data is available.
	// Collect the sum of frequencies of incoming edges to the loop header from	// For each block in the loop, we calculate the estimated number of times
	// outside. If we treat the loop as a super block, this is the frequency of	// it will execute for each iteration of the loop. When it is too small,
	// the loop. Then for each block in the loop, we calculate the ratio between
	// its frequency and the frequency of the loop block. When it is too small,
	// don't add it to the loop chain. If there are outer loops, then this block	// don't add it to the loop chain. If there are outer loops, then this block
	// will be merged into the first outer loop chain for which this block is not	// will be merged into the first outer loop chain for which this block is not
	// cold anymore. This needs precise profile data and we only do this when	// cold anymore. By default, we only do this when profile data is available.
	// profile data is available.
	if (F->getFunction()->getEntryCount()) {	if (F->getFunction()->getEntryCount()) {
	BlockFrequency LoopFreq(0);	BlockFrequency LoopFreq = MBFI->getBlockFreq(L.getHeader());
	for (auto LoopPred : L.getHeader()->predecessors())	DEBUG(dbgs() << "Finding loop blocks for loop with frequency "
	if (!L.contains(LoopPred))	<< LoopFreq.getFrequency() << "\n");
	LoopFreq += MBFI->getBlockFreq(LoopPred) *
	MBPI->getEdgeProbability(LoopPred, L.getHeader());

	for (MachineBasicBlock *LoopBB : L.getBlocks()) {	for (MachineBasicBlock *LoopBB : L.getBlocks()) {
	auto Freq = MBFI->getBlockFreq(LoopBB).getFrequency();	auto Freq = MBFI->getBlockFreq(LoopBB).getFrequency();
	if (Freq == 0 \|\| LoopFreq.getFrequency() / Freq > LoopToColdBlockRatio)	if (Freq == 0 \|\| LoopFreq.getFrequency() / Freq > LoopToColdBlockRatio) {
		DEBUG(dbgs() << "Excluding " << getBlockName(LoopBB)
		<< " with frequency " << Freq
		<< " from loop with frequency " << LoopFreq.getFrequency()
		<< "\n");
	continue;	continue;
		}
	LoopBlockSet.insert(LoopBB);	LoopBlockSet.insert(LoopBB);
	}	}
	} else	} else
Context not available.

test/CodeGen/X86/code_placement_cold_loop_blocks.ll

	; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux < %s \| FileCheck %s	; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux < %s \| FileCheck %s --check-prefix=CHECK --check-prefix=NOPRECISE
		; RUN: llc -mcpu=corei7 -mtriple=x86_64-linux -force-evict-cold-blocks-from-loops -force-loop-cold-block < %s \| FileCheck %s --check-prefix=CHECK --check-prefix=PRECISE

	define void @foo() !prof !1 {	define void @foo() !prof !1 {
	; Test if a cold block in a loop will be placed at the end of the function	; Test if a cold block in a loop will be placed at the end of the function
Context not available.
	; will merged to the outer loop chain.	; will merged to the outer loop chain.
	;	;
	; CHECK-LABEL: nested_loop_0:	; CHECK-LABEL: nested_loop_0:
		; PRECISE: callq b
	; CHECK: callq c	; CHECK: callq c
	; CHECK: callq d	; CHECK: callq d
	; CHECK: callq e	; CHECK: callq e
	; CHECK: callq b	; NOPRECISE: callq b
	; CHECK: callq f	; CHECK: callq f

	entry:	entry:
Context not available.
	ret void	ret void
	}	}

		define void @cold_block_in_hot_loop() !prof !1 {
		; Test that a cold block gets moved out of a high-trip-count loop.
		;
		; CHECK-LABEL: cold_block_in_hot_loop
		; NOPRECISE: callq b
		; NOPRECISE: callq d
		; NOPRECISE: callq e
		; NOPRECISE: callq c
		; With precise rotation cost, the %header -> %cold edge (10 occurrences)
		; is preferred over the %entry -> %header edge (1 occurrence).
		; PRECISE: callq d
		; PRECISE: callq b
		; PRECISE: callq c
		; PRECISE: callq e
		entry:
		br label %header

		header:
		call void @b()
		%call = call i1 @a()
		br i1 %call, label %cold, label %after.cold, !prof !7

		cold:
		call void @c()
		br label %after.cold

		after.cold:
		call void @d()
		%cont = call i1 @a()
		br i1 %cont, label %header, label %done, !prof !8

		done:
		call void @e()
		ret void
		}

		define void @no_profile_data() {
		; Test that we sensibly place a cold block within a loop even when no
		; overall profile data is available, when precise loop rotation cost
		; and cold loop block eviction are enabled.
		;
		; CHECK-LABEL: no_profile_data:
		; NOPRECISE: callq a
		; NOPRECISE: callq d
		; NOPRECISE: callq g
		; NOPRECISE: callq b
		; NOPRECISE: callq c
		; PRECISE: callq a
		; PRECISE: callq c
		; PRECISE: callq g
		; PRECISE: callq b
		; PRECISE: callq d
		entry:
		br label %for.outer.body

		for.outer.body:
		%i = phi i32 [ 0, %entry ], [ %i.inc, %for.outer.inc ]
		%br = call i1 @a()
		br label %for.inner.body

		for.inner.body:
		%i.inner = phi i32 [ 0, %for.outer.body ], [ %i.inc, %for.inner.inc ]
		br i1 %br, label %if.then, label %if.else

		if.then:
		%br.unlikely = call i1 @g()
		br i1 %br.unlikely, label %unlikely, label %if.end, !prof !6

		unlikely:
		call void @d()
		br label %if.end

		if.end:
		call void @b()
		br label %for.inner.inc

		if.else:
		call void @c()
		br label %for.inner.inc

		for.inner.inc:
		%i.inc = add i32 %i.inner, 1
		%done.inner = icmp eq i32 %i.inc, 1000
		br i1 %done.inner, label %for.outer.inc, label %for.inner.body

		for.outer.inc:
		%done.outer = icmp eq i32 %i.inc, 1000000
		br i1 %done.outer, label %ret, label %for.outer.body

		ret:
		ret void
		}

	declare zeroext i1 @a()	declare zeroext i1 @a()
	declare void @b()	declare void @b()
	declare void @c()	declare void @c()
Context not available.
	declare void @d()	declare void @d()
	declare void @e()	declare void @e()
	declare void @f()	declare void @f()
		declare zeroext i1 @g()

	!1 = !{!"function_entry_count", i64 1}	!1 = !{!"function_entry_count", i64 1}
	!2 = !{!"branch_weights", i32 100, i32 1}	!2 = !{!"branch_weights", i32 100, i32 1}
Context not available.
	!3 = !{!"branch_weights", i32 1, i32 10}	!3 = !{!"branch_weights", i32 1, i32 10}
	!4 = !{!"branch_weights", i32 1000, i32 1}	!4 = !{!"branch_weights", i32 1000, i32 1}
	!5 = !{!"branch_weights", i32 100, i32 1}	!5 = !{!"branch_weights", i32 100, i32 1}
		!6 = !{!"branch_weights", i32 1, i32 2000}
		!7 = !{!"branch_weights", i32 1, i32 100000}
		!8 = !{!"branch_weights", i32 1000000, i32 1}
Context not available.