This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/CodeGen/
-
CodeGen/
1
RegAllocGreedy.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
limit-split-cost.mir

Differential D49353

[RegAlloc] Skip global splitting if the live range is huge and its spill is trivially rematerializable
ClosedPublic

Authored by wmi on Jul 15 2018, 1:51 PM.

Download Raw Diff

Details

Reviewers

qcolombet
MatzeB
chandlerc

Commits

rG40c4aa7637d0: [RegAlloc] Skip global splitting if the live range is huge and its spill is…
rL337186: [RegAlloc] Skip global splitting if the live range is huge and its spill is

Summary

We run into a case where machineLICM hoists a large number of live ranges outside of a big loop because it thinks those live ranges are trivially rematerializable. In regalloc, global splitting is tried out first for those live ranges before they are spilled and rematerialized. Because the global splitting algorithm is quadratic, increasing a lot of global splitting candidates causes huge compile time increase (50s to 1400s on my local machine).

However, we think for live ranges which are very large and are trivially rematerialiable, it is better to just skip global splitting so as to save compile time with little chance of sacrificing performance. We uses the segment size of live range to indirectly evaluate whether the global splitting of the live range can introduce high cost, and use an option as a knob to adjust the size limit threshold.

performance evaluation is ongoing.

Diff Detail

Repository: rL LLVM

Event Timeline

wmi created this revision.Jul 15 2018, 1:51 PM

Given that this only kicks in towards an edge case, I think this is fine to go on into the tree. It solves a very real problem for us, and we can iterate on it with upstream if there are subsequent comments.

For context of others, this fixes an issue we've hit with flex generated code combined with -fno-jump-tables. Sadly, we're using that option a lot more now due to retpolines needing it. This means we end up hitting some weird edge cases like this.

Meant to actually say "LGTM" and feel free to submit. Minor tweaks seem fine to happen post-commit.

This revision is now accepted and ready to land.Jul 15 2018, 4:37 PM

Closed by commit rL337186: [RegAlloc] Skip global splitting if the live range is huge and its spill is (authored by wmi). · Explain WhyJul 16 2018, 8:47 AM

This revision was automatically updated to reflect the committed changes.

For the record: The check is based on a LiveInterval::size() which gives you the number of segments. So I assume what is "huge" here is the number of basic blocks?

llvm/trunk/lib/CodeGen/RegAllocGreedy.cpp
130	This help string is wrong.

In D49353#1163840, @MatzeB wrote:

For the record: The check is based on a LiveInterval::size() which gives you the number of segments. So I assume what is "huge" here is the number of basic blocks?

One segment can span multiple basicblocks. I am not sure whether one basicblock can have multiple segments inside of it theoretically, but it is uncommon. So emperically large number of segments mean large number of basicblocks, then large number of edge bundle nodes and high hopfield neural network algorithm cost.

In D49353#1163951, @wmi wrote:

In D49353#1163840, @MatzeB wrote:

For the record: The check is based on a LiveInterval::size() which gives you the number of segments. So I assume what is "huge" here is the number of basic blocks?

One segment can span multiple basicblocks. I am not sure whether one basicblock can have multiple segments inside of it theoretically, but it is uncommon. So emperically large number of segments mean large number of basicblocks, then large number of edge bundle nodes and high hopfield neural network algorithm cost.

Yes that's what I was getting at (just trying to understand the context); the description speaks about a live range being "huge", but just because you have a huge function or a huge basic block does not necessarily trigger this condition. I think right now with the connected component rule in place we can have no more than two segments of the same virtual register inside a basic block (a value living, and a value created possibly living out), any other situation must have been split up into multiple vregs previously. I had the same experience that the bad situations for register allocation compiletime (or copy coalescing for that matter) is automatically generated lexer or parser code with a big number of basic blocks. That usually leads to new value numbers being created at join points increasing the number of liverange segments...

Anyway so far this change looks fine to me, but I'm still waiting for our internal systems to come back with numbers on this change (will probably be 1 or 2 more days until everything has cycled through).

In D49353#1164088, @MatzeB wrote:

In D49353#1163951, @wmi wrote:

In D49353#1163840, @MatzeB wrote:

For the record: The check is based on a LiveInterval::size() which gives you the number of segments. So I assume what is "huge" here is the number of basic blocks?

One segment can span multiple basicblocks. I am not sure whether one basicblock can have multiple segments inside of it theoretically, but it is uncommon. So emperically large number of segments mean large number of basicblocks, then large number of edge bundle nodes and high hopfield neural network algorithm cost.

Yes that's what I was getting at (just trying to understand the context); the description speaks about a live range being "huge", but just because you have a huge function or a huge basic block does not necessarily trigger this condition. I think right now with the connected component rule in place we can have no more than two segments of the same virtual register inside a basic block (a value living, and a value created possibly living out), any other situation must have been split up into multiple vregs previously. I had the same experience that the bad situations for register allocation compiletime (or copy coalescing for that matter) is automatically generated lexer or parser code with a big number of basic blocks. That usually leads to new value numbers being created at join points increasing the number of liverange segments...

Exactly as you are saying, copy coalescing is another problem. The patch here only solves the problem partially, and we are still facing the compile time problem of copy coalescing. A trivially rematerializable def instruction with thousands of uses is hoisted outside of a loop in machineLICM. That instruction is rematerialized in copy coalescing by thousands of times for each use, and the live interval update during each time of rematerialization is very costly because of the large live interval.

I wonder how much extra benefit we can get from machineLICM by hoisting so many trivially rematerializable instructions outside of a big loop. I believe most of loop invariant load/store or computation should already be hoisted during LICM phase, so here rematerializable instructions shouldn't enable many other loop invariant load/store/computations. But the extra compile time cost may be significant.

Anyway so far this change looks fine to me, but I'm still waiting for our internal systems to come back with numbers on this change (will probably be 1 or 2 more days until everything has cycled through).

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

RegAllocGreedy.cpp

19 lines

test/

CodeGen/

X86/

limit-split-cost.mir

150 lines

Diff 155695

llvm/trunk/lib/CodeGen/RegAllocGreedy.cpp

Show First 20 Lines • Show All 119 Lines • ▼ Show 20 Lines
static cl::opt<bool> EnableDeferredSpilling(		static cl::opt<bool> EnableDeferredSpilling(
"enable-deferred-spilling", cl::Hidden,		"enable-deferred-spilling", cl::Hidden,
cl::desc("Instead of spilling a variable right away, defer the actual "		cl::desc("Instead of spilling a variable right away, defer the actual "
"code insertion to the end of the allocation. That way the "		"code insertion to the end of the allocation. That way the "
"allocator might still find a suitable coloring for this "		"allocator might still find a suitable coloring for this "
"variable because of other evicted variables."),		"variable because of other evicted variables."),
cl::init(false));		cl::init(false));

		static cl::opt<unsigned>
		HugeSizeForSplit("huge-size-for-split", cl::Hidden,
		cl::desc("Last chance recoloring max depth"),
		MatzeBUnsubmitted Not Done Reply Inline Actions This help string is wrong. MatzeB: This help string is wrong.
		cl::init(5000));

// FIXME: Find a good default for this flag and remove the flag.		// FIXME: Find a good default for this flag and remove the flag.
static cl::opt<unsigned>		static cl::opt<unsigned>
CSRFirstTimeCost("regalloc-csr-first-time-cost",		CSRFirstTimeCost("regalloc-csr-first-time-cost",
cl::desc("Cost for first time use of callee-saved register."),		cl::desc("Cost for first time use of callee-saved register."),
cl::init(0), cl::Hidden);		cl::init(0), cl::Hidden);

static cl::opt<bool> ConsiderLocalIntervalCost(		static cl::opt<bool> ConsiderLocalIntervalCost(
"condsider-local-interval-cost", cl::Hidden,		"condsider-local-interval-cost", cl::Hidden,
▲ Show 20 Lines • Show All 337 Lines • ▼ Show 20 Lines	bool mayRecolorAllInterferences(unsigned PhysReg, LiveInterval &VirtReg,
const SmallVirtRegSet &FixedRegisters);		const SmallVirtRegSet &FixedRegisters);

unsigned tryAssign(LiveInterval&, AllocationOrder&,		unsigned tryAssign(LiveInterval&, AllocationOrder&,
SmallVectorImpl<unsigned>&);		SmallVectorImpl<unsigned>&);
unsigned tryEvict(LiveInterval&, AllocationOrder&,		unsigned tryEvict(LiveInterval&, AllocationOrder&,
SmallVectorImpl<unsigned>&, unsigned = ~0u);		SmallVectorImpl<unsigned>&, unsigned = ~0u);
unsigned tryRegionSplit(LiveInterval&, AllocationOrder&,		unsigned tryRegionSplit(LiveInterval&, AllocationOrder&,
SmallVectorImpl<unsigned>&);		SmallVectorImpl<unsigned>&);
		unsigned isSplitBenefitWorthCost(LiveInterval &VirtReg);
/// Calculate cost of region splitting.		/// Calculate cost of region splitting.
unsigned calculateRegionSplitCost(LiveInterval &VirtReg,		unsigned calculateRegionSplitCost(LiveInterval &VirtReg,
AllocationOrder &Order,		AllocationOrder &Order,
BlockFrequency &BestCost,		BlockFrequency &BestCost,
unsigned &NumCands, bool IgnoreCSR,		unsigned &NumCands, bool IgnoreCSR,
bool *CanCauseEvictionChain = nullptr);		bool *CanCauseEvictionChain = nullptr);
/// Perform region splitting.		/// Perform region splitting.
unsigned doRegionSplit(LiveInterval &VirtReg, unsigned BestCand,		unsigned doRegionSplit(LiveInterval &VirtReg, unsigned BestCand,
▲ Show 20 Lines • Show All 1,277 Lines • ▼ Show 20 Lines	for (unsigned i = 0, e = LREdit.size(); i != e; ++i) {
// Other intervals are treated as new. This includes local intervals created		// Other intervals are treated as new. This includes local intervals created
// for blocks with multiple uses, and anything created by DCE.		// for blocks with multiple uses, and anything created by DCE.
}		}

if (VerifyEnabled)		if (VerifyEnabled)
MF->verify(this, "After splitting live range around region");		MF->verify(this, "After splitting live range around region");
}		}

		// Global split has high compile time cost especially for large live range.
		// Return false for the case here where the potential benefit will never
		// worth the cost.
		unsigned RAGreedy::isSplitBenefitWorthCost(LiveInterval &VirtReg) {
		MachineInstr *MI = MRI->getUniqueVRegDef(VirtReg.reg);
		if (MI && TII->isTriviallyReMaterializable(*MI, AA) &&
		VirtReg.size() > HugeSizeForSplit)
		return false;
		return true;
		}

unsigned RAGreedy::tryRegionSplit(LiveInterval &VirtReg, AllocationOrder &Order,		unsigned RAGreedy::tryRegionSplit(LiveInterval &VirtReg, AllocationOrder &Order,
SmallVectorImpl<unsigned> &NewVRegs) {		SmallVectorImpl<unsigned> &NewVRegs) {
		if (!isSplitBenefitWorthCost(VirtReg))
		return 0;
unsigned NumCands = 0;		unsigned NumCands = 0;
BlockFrequency SpillCost = calcSpillCost();		BlockFrequency SpillCost = calcSpillCost();
BlockFrequency BestCost;		BlockFrequency BestCost;

// Check if we can split this live range around a compact region.		// Check if we can split this live range around a compact region.
bool HasCompact = calcCompactRegion(GlobalCand.front());		bool HasCompact = calcCompactRegion(GlobalCand.front());
if (HasCompact) {		if (HasCompact) {
// Yes, keep GlobalCand[0] as the compact region candidate.		// Yes, keep GlobalCand[0] as the compact region candidate.
▲ Show 20 Lines • Show All 1,411 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/limit-split-cost.mir

				# REQUIRES: asserts
				# RUN: llc -mtriple=x86_64-- -run-pass=greedy %s -debug-only=regalloc -huge-size-for-split=0 -o /dev/null 2>&1 \| FileCheck %s
				# Check no global region split is needed because the live range to split is trivially rematerializable.
				# CHECK-NOT: Compact region bundles
				--- \|
				; ModuleID = '<stdin>'
				source_filename = "2.cc"
				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				@m = local_unnamed_addr global i32 0, align 4
				@.str = private unnamed_addr constant [4 x i8] c"abc\00", align 1
				@.str.1 = private unnamed_addr constant [4 x i8] c"def\00", align 1
				@.str.2 = private unnamed_addr constant [4 x i8] c"ghi\00", align 1

				; Function Attrs: uwtable
				define void @_Z3fooi(i32 %value) local_unnamed_addr #0 {
				entry:
				br label %do.body

				do.body: ; preds = %do.cond, %entry
				tail call void asm sideeffect "", "~{r10},~{r11},~{r12},~{r13},~{r14},~{r15},~{dirflag},~{fpsr},~{flags}"() #2, !srcloc !3
				switch i32 %value, label %do.cond [
				i32 0, label %sw.bb
				i32 1, label %sw.bb1
				i32 2, label %sw.bb2
				]

				sw.bb: ; preds = %do.body
				tail call void @_Z3gooPKc(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @.str, i64 0, i64 0))
				br label %sw.bb1

				sw.bb1: ; preds = %sw.bb, %do.body
				tail call void @_Z3gooPKc(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @.str.1, i64 0, i64 0))
				br label %sw.bb2

				sw.bb2: ; preds = %sw.bb1, %do.body
				tail call void @_Z3gooPKc(i8* getelementptr inbounds ([4 x i8], [4 x i8]* @.str.2, i64 0, i64 0))
				br label %do.cond

				do.cond: ; preds = %sw.bb2, %do.body
				%0 = load i32, i32* @m, align 4, !tbaa !4
				%cmp = icmp eq i32 %0, 5
				br i1 %cmp, label %do.end, label %do.body

				do.end: ; preds = %do.cond
				ret void
				}

				declare void @_Z3gooPKc(i8*) local_unnamed_addr #1

				; Function Attrs: nounwind
				declare void @llvm.stackprotector(i8, i8*) #2

				attributes #0 = { uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }
				attributes #1 = { "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+fxsr,+mmx,+sse,+sse2,+x87" "unsafe-fp-math"="false" "use-soft-float"="false" }
				attributes #2 = { nounwind }

				!llvm.module.flags = !{!0, !1}
				!llvm.ident = !{!2}

				!0 = !{i32 1, !"wchar_size", i32 4}
				!1 = !{i32 7, !"PIC Level", i32 2}
				!2 = !{!"clang version 7.0.0 (trunk 335057)"}
				!3 = !{i32 80}
				!4 = !{!5, !5, i64 0}
				!5 = !{!"int", !6, i64 0}
				!6 = !{!"omnipotent char", !7, i64 0}
				!7 = !{!"Simple C++ TBAA"}

				...
				---
				name: _Z3fooi
				alignment: 4
				tracksRegLiveness: true
				registers:
				- { id: 0, class: gr32 }
				- { id: 1, class: gr32 }
				- { id: 2, class: gr32 }
				- { id: 3, class: gr64 }
				- { id: 4, class: gr64 }
				- { id: 5, class: gr64 }
				- { id: 6, class: gr64 }
				- { id: 7, class: gr32 }
				- { id: 8, class: gr32 }
				liveins:
				- { reg: '$edi', virtual-reg: '%0' }
				frameInfo:
				hasCalls: true
				body: \|
				bb.0.entry:
				liveins: $edi

				%0:gr32 = COPY $edi
				%5:gr64 = LEA64r $rip, 1, $noreg, @.str.2, $noreg
				%6:gr64 = MOV64rm $rip, 1, $noreg, target-flags(x86-gotpcrel) @m, $noreg :: (load 8 from got)
				%4:gr64 = LEA64r $rip, 1, $noreg, @.str.1, $noreg
				%3:gr64 = LEA64r $rip, 1, $noreg, @.str, $noreg

				bb.1.do.body:
				successors: %bb.6(0x20000000), %bb.2(0x60000000)

				INLINEASM &"", 1, 12, implicit-def dead early-clobber $r10, 12, implicit-def dead early-clobber $r11, 12, implicit-def dead early-clobber $r12, 12, implicit-def dead early-clobber $r13, 12, implicit-def dead early-clobber $r14, 12, implicit-def dead early-clobber $r15, 12, implicit-def dead early-clobber $eflags, !3
				CMP32ri8 %0, 2, implicit-def $eflags
				JE_1 %bb.6, implicit killed $eflags
				JMP_1 %bb.2

				bb.2.do.body:
				successors: %bb.5(0x2aaaaaab), %bb.3(0x55555555)

				CMP32ri8 %0, 1, implicit-def $eflags
				JE_1 %bb.5, implicit killed $eflags
				JMP_1 %bb.3

				bb.3.do.body:
				successors: %bb.4, %bb.7

				TEST32rr %0, %0, implicit-def $eflags
				JNE_1 %bb.7, implicit killed $eflags
				JMP_1 %bb.4

				bb.4.sw.bb:
				ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
				$rdi = COPY %3
				CALL64pcrel32 target-flags(x86-plt) @_Z3gooPKc, csr_64, implicit $rsp, implicit $ssp, implicit $rdi, implicit-def $rsp, implicit-def $ssp
				ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp

				bb.5.sw.bb1:
				ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
				$rdi = COPY %4
				CALL64pcrel32 target-flags(x86-plt) @_Z3gooPKc, csr_64, implicit $rsp, implicit $ssp, implicit $rdi, implicit-def $rsp, implicit-def $ssp
				ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp

				bb.6.sw.bb2:
				ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp
				$rdi = COPY %5
				CALL64pcrel32 target-flags(x86-plt) @_Z3gooPKc, csr_64, implicit $rsp, implicit $ssp, implicit $rdi, implicit-def $rsp, implicit-def $ssp
				ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp

				bb.7.do.cond:
				successors: %bb.8(0x04000000), %bb.1(0x7c000000)

				CMP32mi8 %6, 1, $noreg, 0, $noreg, 5, implicit-def $eflags :: (dereferenceable load 4 from @m, !tbaa !4)
				JNE_1 %bb.1, implicit killed $eflags
				JMP_1 %bb.8

				bb.8.do.end:
				RET 0

				...