This is an archive of the discontinued LLVM Phabricator instance.

CodeGen: Placement: Apply triangle heuristic more aggressively at O3.
AbandonedPublic

Authored by iteratee on Mar 7 2017, 5:15 PM.

Download Raw Diff

Details

Reviewers

Summary

The triangle tail duplication heuristic can improve performance for small chains
of triangles, but this can cause poorly generated code to perform worse, for
example, one of the mips shift lowerings generates code that looks like this:

if x
  do something that doesn't modify x
if !x
  do something

Requiring 3 in a row avoids this case.
but 2 in a row is likely overly aggressive for O2, at least for now.
I have benchmark data showing this is profitable in the cases where it applies:

No significant performance changes to llvm test-suite. Tiny size increases: 0.027% on the 5 affected benchmarks.

This improves an important internal Google benchmark (protocol buffer serialization) by 2%, with no significant effect on other internal benchmarks.

Diff Detail

Event Timeline

iteratee created this revision.Mar 7 2017, 5:15 PM

Herald added a subscriber: nemanjai. · View Herald TranscriptMar 7 2017, 5:15 PM

iteratee edited the summary of this revision. (Show Details)Mar 7 2017, 5:15 PM

So, this didn't go to the list. I'd abandon it and create a fresh revision with llvm-commits CC'ed directly.

Also, it'd be good to include at least a summary of the benchmark data (especially the LLVM test suite data). You should even be able to include some of our internal benchmark data by looking at how Dehao did it here: http://lists.llvm.org/pipermail/llvm-dev/2017-February/109802.html

Alternatively, you can try cc llvm-commits, then update the description.

iteratee added a subscriber: llvm-commits.Mar 14 2017, 2:55 PM

iteratee edited the summary of this revision. (Show Details)Mar 14 2017, 5:17 PM

Should the mips problem be fixed? I don't see a fundamental reason this needs to be opt level specific.

In D30728#701847, @davidxl wrote:

Should the mips problem be fixed? I don't see a fundamental reason this needs to be opt level specific.

We looked into the mips issue. It is only a problem on mips2 and mips3, which are now 25 years old. The lowering for those targets could be better, but I'll file a bug and let it be.

The issue also goes away with branch coalescing if the mips branch targets are correctly marked as not clobbering the AT register.

In D30728#702250, @iteratee wrote:

In D30728#701847, @davidxl wrote:

Should the mips problem be fixed? I don't see a fundamental reason this needs to be opt level specific.

We looked into the mips issue. It is only a problem on mips2 and mips3, which are now 25 years old. The lowering for those targets could be better, but I'll file a bug and let it be.

The issue also goes away with branch coalescing if the mips branch targets are correctly marked as not clobbering the AT register.

Agreed, I'd very much like to file a bug and punt here. I don't think it's worth the yak shave of fixing the mips port to avoid treating all conditional branches as macro instructions and the branch coalescing patch will be turned on by default soon.

iteratee abandoned this revision.May 5 2017, 9:58 AM

Revision Contents

Path

Size

lib/

CodeGen/

MachineBlockPlacement.cpp

19 lines

test/

CodeGen/

PowerPC/

tail-dup-layout.ll

80 lines

Diff 90969

lib/CodeGen/MachineBlockPlacement.cpp

Show First 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	static cl::opt<unsigned> TailDupPlacementPenalty(
cl::desc("Cost penalty for blocks that can avoid breaking CFG by copying. "		cl::desc("Cost penalty for blocks that can avoid breaking CFG by copying. "
"Copying can increase fallthrough, but it also increases icache "		"Copying can increase fallthrough, but it also increases icache "
"pressure. This parameter controls the penalty to account for that. "		"pressure. This parameter controls the penalty to account for that. "
"Percent as integer."),		"Percent as integer."),
cl::init(2),		cl::init(2),
cl::Hidden);		cl::Hidden);

// Heuristic for triangle chains.		// Heuristic for triangle chains.
static cl::opt<unsigned> TriangleChainCount(		static cl::opt<unsigned> TriangleChainCountOpt(
"triangle-chain-count",		"triangle-chain-count",
cl::desc("Number of triangle-shaped-CFG's that need to be in a row for the "		cl::desc("Number of triangle-shaped-CFG's that need to be in a row for the "
"triangle tail duplication heuristic to kick in. 0 to disable."),		"triangle tail duplication heuristic to kick in. 0 to disable."),
cl::init(3),		cl::init(3),
cl::Hidden);		cl::Hidden);

extern cl::opt<unsigned> StaticLikelyProb;		extern cl::opt<unsigned> StaticLikelyProb;
extern cl::opt<unsigned> ProfileLikelyProb;		extern cl::opt<unsigned> ProfileLikelyProb;
▲ Show 20 Lines • Show All 178 Lines • ▼ Show 20 Lines	class MachineBlockPlacement : public MachineFunctionPass {

/// \brief Duplicator used to duplicate tails during placement.		/// \brief Duplicator used to duplicate tails during placement.
///		///
/// Placement decisions can open up new tail duplication opportunities, but		/// Placement decisions can open up new tail duplication opportunities, but
/// since tail duplication affects placement decisions of later blocks, it		/// since tail duplication affects placement decisions of later blocks, it
/// must be done inline.		/// must be done inline.
TailDuplicator TailDup;		TailDuplicator TailDup;

		/// \brief Number of triangle-shaped-CFG's that need to be in a row for the
		/// triangle tail duplication heuristic to kick in. Defaults to the value in
		/// TriangleChainCountOpt, unless -O3 is specified, and then it is reduced by
		/// 1.
		unsigned TriangleChainCount;

/// \brief Allocator and owner of BlockChain structures.		/// \brief Allocator and owner of BlockChain structures.
///		///
/// We build BlockChains lazily while processing the loop structure of		/// We build BlockChains lazily while processing the loop structure of
/// a function. To reduce malloc traffic, we allocate them using this		/// a function. To reduce malloc traffic, we allocate them using this
/// slab-like allocator, and destroy them after the pass completes. An		/// slab-like allocator, and destroy them after the pass completes. An
/// important guarantee is that this allocator produces stable pointers to		/// important guarantee is that this allocator produces stable pointers to
/// the chains.		/// the chains.
SpecificBumpPtrAllocator<BlockChain> ChainAllocator;		SpecificBumpPtrAllocator<BlockChain> ChainAllocator;
▲ Show 20 Lines • Show All 2,284 Lines • ▼ Show 20 Lines	if (TailDupPlacement) {
MPDT = &getAnalysis<MachinePostDominatorTree>();		MPDT = &getAnalysis<MachinePostDominatorTree>();
unsigned TailDupSize = TailDupPlacementThreshold;		unsigned TailDupSize = TailDupPlacementThreshold;
if (MF.getFunction()->optForSize())		if (MF.getFunction()->optForSize())
TailDupSize = 1;		TailDupSize = 1;
TailDup.initMF(MF, MBPI, /* LayoutMode */ true, TailDupSize);		TailDup.initMF(MF, MBPI, /* LayoutMode */ true, TailDupSize);
precomputeTriangleChains();		precomputeTriangleChains();
}		}

		TargetPassConfig *PassConfig = &getAnalysis<TargetPassConfig>();
		// For agressive optimization, we can adjust some thresholds to be less
		// conservative.
		TriangleChainCount = TriangleChainCountOpt;
		if (PassConfig->getOptLevel() >= CodeGenOpt::Aggressive) {
		// Apply the triangle heuristic to 1 fewer triangle.
		if (TriangleChainCountOpt.getNumOccurrences() == 0)
		TriangleChainCount -= 1;
		}

assert(BlockToChain.empty());		assert(BlockToChain.empty());

buildCFGChains();		buildCFGChains();

// Changing the layout can create new tail merging opportunities.		// Changing the layout can create new tail merging opportunities.
TargetPassConfig *PassConfig = &getAnalysis<TargetPassConfig>();
// TailMerge can create jump into if branches that make CFG irreducible for		// TailMerge can create jump into if branches that make CFG irreducible for
// HW that requires structured CFG.		// HW that requires structured CFG.
bool EnableTailMerge = !MF.getTarget().requiresStructuredCFG() &&		bool EnableTailMerge = !MF.getTarget().requiresStructuredCFG() &&
PassConfig->getEnableTailMerge() &&		PassConfig->getEnableTailMerge() &&
BranchFoldPlacement;		BranchFoldPlacement;
// No tail merging opportunities if the block number is less than four.		// No tail merging opportunities if the block number is less than four.
if (MF.size() > 3 && EnableTailMerge) {		if (MF.size() > 3 && EnableTailMerge) {
unsigned TailMergeSize = TailDupPlacementThreshold + 1;		unsigned TailMergeSize = TailDupPlacementThreshold + 1;
▲ Show 20 Lines • Show All 115 Lines • Show Last 20 Lines

test/CodeGen/PowerPC/tail-dup-layout.ll

; RUN: llc -O2 < %s \| FileCheck %s		; RUN: llc -O2 < %s \| FileCheck --check-prefix=CHECK --check-prefix=CHECK-O2 %s
		; RUN: llc -O3 < %s \| FileCheck --check-prefix=CHECK --check-prefix=CHECK-O3 %s
target datalayout = "e-m:e-i64:64-n32:64"		target datalayout = "e-m:e-i64:64-n32:64"
target triple = "powerpc64le-grtev4-linux-gnu"		target triple = "powerpc64le-grtev4-linux-gnu"

; Intended layout:		; Intended layout:
; The chain-based outlining produces the layout		; The chain-based outlining produces the layout
; test1		; test1
; test2		; test2
; test3		; test3
▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	optional4:
call void @d()		call void @d()
call void @d()		call void @d()
br label %exit		br label %exit
exit:		exit:
ret void		ret void
}		}

; Intended layout:		; Intended layout:
; The chain-of-triangles based duplicating produces the layout		; At -O3, The chain-of-triangles based duplicating produces the layout
; test1		; test1
; test2		; test2
; test3		; test3
; test4
; optional1		; optional1
; optional2		; optional2
; optional3		; optional3
; optional4
; exit		; exit
; even for 50/50 branches.		; even for 50/50 branches.
		; At -O2, The chain-of-triangles heuristic should not apply, producing the
		; layout:
		; test1
		; optional1
		; test2
		; optional2
		; test3
		; optional3
		; exit
; Tail duplication puts test n+1 at the end of optional n		; Tail duplication puts test n+1 at the end of optional n
; so optional1 includes a copy of test2 at the end, and branches		; so optional1 includes a copy of test2 at the end, and branches
; to test3 (at the top) or falls through to optional 2.		; to test3 (at the top) or falls through to optional 2.
; The CHECK statements check for the whole string of tests		; The CHECK statements check for the whole string of tests
; and then check that the correct test has been duplicated into the end of		; and then check that the correct test has been duplicated into the end of
; the optional blocks and that the optional blocks are in the correct order.		; the optional blocks and that the optional blocks are in the correct order.
;CHECK-LABEL: straight_test_50:		;CHECK-LABEL: straight_test_50:
; test1 may have been merged with entry		; test1 may have been merged with entry
;CHECK: mr [[TAGREG:[0-9]+]], 3		;CHECK: mr [[TAGREG:[0-9]+]], 3
;CHECK: andi. {{[0-9]+}}, [[TAGREG]], 1		;CHECK: andi. {{[0-9]+}}, [[TAGREG]], 1
;CHECK-NEXT: bc 12, 1, .[[OPT1LABEL:[_0-9A-Za-z]+]]		;CHECK-O3-NEXT: bc 12, 1, .[[OPT1LABEL:[_0-9A-Za-z]+]]
;CHECK-NEXT: # %test2		;CHECK-O3-NEXT: # %test2
;CHECK-NEXT: rlwinm. {{[0-9]+}}, [[TAGREG]], 0, 30, 30		;CHECK-O3-NEXT: rlwinm. {{[0-9]+}}, [[TAGREG]], 0, 30, 30
;CHECK-NEXT: bne 0, .[[OPT2LABEL:[_0-9A-Za-z]+]]		;CHECK-O3-NEXT: bne 0, .[[OPT2LABEL:[_0-9A-Za-z]+]]
;CHECK-NEXT: .[[TEST3LABEL:[_0-9A-Za-z]+]]: # %test3		;CHECK-O3-NEXT: .[[TEST3LABEL:[_0-9A-Za-z]+]]: # %test3
;CHECK-NEXT: rlwinm. {{[0-9]+}}, [[TAGREG]], 0, 29, 29		;CHECK-O3-NEXT: rlwinm. {{[0-9]+}}, [[TAGREG]], 0, 29, 29
;CHECK-NEXT: bne 0, .[[OPT3LABEL:[_0-9A-Za-z]+]]		;CHECK-O3-NEXT: bne 0, .[[OPT3LABEL:[_0-9A-Za-z]+]]
;CHECK-NEXT: .[[TEST4LABEL:[_0-9A-Za-z]+]]: # %test4		;CHECK-O3-NEXT: .[[EXITLABEL:[_0-9A-Za-z]+]]: # %exit
;CHECK-NEXT: rlwinm. {{[0-9]+}}, [[TAGREG]], 0, 28, 28		;CHECK-O3: blr
;CHECK-NEXT: bne 0, .[[OPT4LABEL:[_0-9A-Za-z]+]]		;CHECK-O3-NEXT: .[[OPT1LABEL]]:
;CHECK-NEXT: .[[EXITLABEL:[_0-9A-Za-z]+]]: # %exit		;CHECK-O3: rlwinm. {{[0-9]+}}, [[TAGREG]], 0, 30, 30
;CHECK: blr		;CHECK-O3-NEXT: beq 0, .[[TEST3LABEL]]
;CHECK-NEXT: .[[OPT1LABEL]]:		;CHECK-O3-NEXT: .[[OPT2LABEL]]:
;CHECK: rlwinm. {{[0-9]+}}, [[TAGREG]], 0, 30, 30		;CHECK-O3: rlwinm. {{[0-9]+}}, [[TAGREG]], 0, 29, 29
;CHECK-NEXT: beq 0, .[[TEST3LABEL]]		;CHECK-O3-NEXT: beq 0, .[[EXITLABEL]]
;CHECK-NEXT: .[[OPT2LABEL]]:		;CHECK-O3-NEXT: .[[OPT3LABEL]]:
;CHECK: rlwinm. {{[0-9]+}}, [[TAGREG]], 0, 29, 29		;CHECK-O3: b .[[EXITLABEL]]
;CHECK-NEXT: beq 0, .[[TEST4LABEL]]
;CHECK-NEXT: .[[OPT3LABEL]]:		;CHECK-O2-NEXT: bc 4, 1, .[[TEST2LABEL:[_0-9A-Za-z]+]]
;CHECK: rlwinm. {{[0-9]+}}, [[TAGREG]], 0, 28, 28		;CHECK-O2: bl a
;CHECK-NEXT: beq 0, .[[EXITLABEL]]		;CHECK-O2-NOT: rlwinm
;CHECK-NEXT: .[[OPT4LABEL]]:		;CHECK-O2: .[[TEST2LABEL]]: # %test2
;CHECK: b .[[EXITLABEL]]		;CHECK-O2-NEXT: rlwinm. {{[0-9]+}}, [[TAGREG]], 0, 30, 30
		;CHECK-O2-NEXT: beq 0, .[[TEST3LABEL:[_0-9A-Za-z]+]]
		;CHECK-O2: bl b
		;CHECK-O2-NOT: rlwinm
		;CHECK-O2: .[[TEST3LABEL]]: # %test3
		;CHECK-O2-NEXT: rlwinm. {{[0-9]+}}, [[TAGREG]], 0, 29, 29
		;CHECK-O2-NEXT: beq 0, .[[EXITLABEL:[_0-9A-Za-z]+]]
		;CHECK-O2: bl c
		;CHECK-O2: .[[EXITLABEL:[_0-9A-Za-z]+]]: # %exit
		;CHECK-O2: blr

define void @straight_test_50(i32 %tag) {		define void @straight_test_50(i32 %tag) {
entry:		entry:
br label %test1		br label %test1
test1:		test1:
%tagbit1 = and i32 %tag, 1		%tagbit1 = and i32 %tag, 1
%tagbit1eq0 = icmp eq i32 %tagbit1, 0		%tagbit1eq0 = icmp eq i32 %tagbit1, 0
br i1 %tagbit1eq0, label %test2, label %optional1, !prof !2		br i1 %tagbit1eq0, label %test2, label %optional1, !prof !2
optional1:		optional1:
call void @a()		call void @a()
br label %test2		br label %test2
test2:		test2:
%tagbit2 = and i32 %tag, 2		%tagbit2 = and i32 %tag, 2
%tagbit2eq0 = icmp eq i32 %tagbit2, 0		%tagbit2eq0 = icmp eq i32 %tagbit2, 0
br i1 %tagbit2eq0, label %test3, label %optional2, !prof !2		br i1 %tagbit2eq0, label %test3, label %optional2, !prof !2
optional2:		optional2:
call void @b()		call void @b()
br label %test3		br label %test3
test3:		test3:
%tagbit3 = and i32 %tag, 4		%tagbit3 = and i32 %tag, 4
%tagbit3eq0 = icmp eq i32 %tagbit3, 0		%tagbit3eq0 = icmp eq i32 %tagbit3, 0
br i1 %tagbit3eq0, label %test4, label %optional3, !prof !2		br i1 %tagbit3eq0, label %exit, label %optional3, !prof !1
optional3:		optional3:
call void @c()		call void @c()
br label %test4
test4:
%tagbit4 = and i32 %tag, 8
%tagbit4eq0 = icmp eq i32 %tagbit4, 0
br i1 %tagbit4eq0, label %exit, label %optional4, !prof !1
optional4:
call void @d()
br label %exit		br label %exit
exit:		exit:
ret void		ret void
}		}

; Intended layout:		; Intended layout:
; The chain-based outlining produces the layout		; The chain-based outlining produces the layout
; entry		; entry
▲ Show 20 Lines • Show All 327 Lines • Show Last 20 Lines