This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/
-
CodeGen/
2
CodeGenPrepare.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
codegen-prepare-split.ll

Differential D44814

[CodeGenPrepare] Split huge basic blocks for faster compilation.
Needs ReviewPublic

Authored by mzolotukhin on Mar 22 2018, 6:50 PM.

Download Raw Diff

Details

Reviewers

ab
davide
• dberlin

Summary

Many passes stuggle with huge basic blocks. While we definitely should address
the issues in the passes in the first place, often we discover such places too
late when the compiler 'hangs'. This patch adds a 'fuse' to prevent this from
happening by splitting huge basic blocks. It shouldn't happen in usual
scenarios, but should help us in corner cases that occur here and there. This
does not apply for -O3, as at -O3 we're usually willing to sacrifice compile
time for the sake of any extra performance.

Diff Detail

Repository

rL LLVM

Build Status

Buildable 16410
Build 16410: arc lint + arc unit

Event Timeline

mzolotukhin created this revision.Mar 22 2018, 6:50 PM

Herald added a subscriber: hiraditya. · View Herald TranscriptMar 22 2018, 6:50 PM

mzolotukhin added a reviewer: davide.Mar 22 2018, 6:51 PM

Harbormaster completed remote builds in B16371: Diff 139549.Mar 22 2018, 6:51 PM

BB->size() has multiple problems: one, it's linear time, and two, it doesn't ignore debug info.

CodeGenPrepare doesn't run at -O0.

BB->size() has multiple problems: one, it's linear time, and two, it doesn't ignore debug info.

What would be a better way to get BB size?

CodeGenPrepare doesn't run at -O0.

Correct, in the current shape the patch aims only at -Os. Do you think it would be better to have a separate pass for it and schedule it at O0 too?

Thanks,
Michael

junbuml added a subscriber: junbuml.Mar 23 2018, 10:57 AM

What would be a better way to get BB size?

Given your algorithm, I don't understand why you need to get the BB size in the first place; you can just iterate over every block, and split when you hit the threshold.

Correct, in the current shape the patch aims only at -Os. Do you think it would be better to have a separate pass for it and schedule it at O0 too?

The key question is whether we're using SelectionDAG ISel, since the blocks will be un-split after ISel anyway. We don't use SelectionDAG ISel for most blocks at -O0, but we for a few. Maybe not worth worrying about.

llvm/lib/CodeGen/CodeGenPrepare.cpp
491	You probably need to skip over all PHI nodes/landingpads/etc. so you don't try to split a block in an impossible place. And I think some Windows exception-handling blocks can't be split? (It would be unusual to have a block with over 1000 PHI nodes, but not impossible.) And like I mentioned before, you need to skip debug into intrinsics. This should have a testcase to exercise this logic (you can mess with the threshold to keep the testcase small).

Rewrite algorithm to Avoid using BB->size().
Skip PHI-nodes and EH-pads.
Don't split on terminators.
Add tests.

Harbormaster completed remote builds in B16410: Diff 139684.Mar 23 2018, 5:37 PM

Hi Eli,

Thanks for the input, please take a look at the updated patch. For now I didn't introduce a logic to skip debug intrinsics - is it needed from a correctness point of view? I've seen test-cases with thousands of llvm.dbg.value intrinsics that would also benefit from splitting (provided it's legal). Also, I wonder if we actually should only do that on -Os - maybe it would make sense to do that on -O3 too with a significantly high threshold. What do you think?

Thanks,
Michael

Ignore DbgInfo intrinsics.

mzolotukhin added a subscriber: aprantl.Mar 23 2018, 6:09 PM

Harbormaster completed remote builds in B16411: Diff 139689.Mar 23 2018, 6:10 PM

Adrian convinced me that we need to properly ignore debug-info intrinsics to guarantee the same code generation with and without debug info. I've updated the patch.

Thanks,
Michael

Ping!

davide added a reviewer: • dberlin.Apr 2 2018, 12:02 PM

I don't know how I feel about this patch. It looks like it's papering over a huge problem, which is basically the fact that some passes down the road are either quadratic or linear with large constant factor.
If we have examples of the broken passes, maybe we should consider whether it's feasible to fix them instead of applying this hack here?

llvm/lib/CodeGen/CodeGenPrepare.cpp
218–221	How was this picked?

That said, the worklist algorithm you picked seems correct.
Another question that I have for you is: how is this going to work when GlobalIsel will become the standard?
My understanding is that cross-BB codegen will remove the need for CodegenPrepare, so if we start relying on BBs to be split we need to preserve this functionality somehow.

I don't know how I feel about this patch. It looks like it's papering over a huge problem, which is basically the fact that some passes down the road are either quadratic or linear with large constant factor.
If we have examples of the broken passes, maybe we should consider whether it's feasible to fix them instead of applying this hack here?

We do have examples of such passes, and we should fix them, I agree. However, I don't consider this patch as a fix for them. The purpose of this patch is to prevent compiler hangs in future - even when we fix the known issues, there is no guarantee that there are no more places like them. When we specifically want to look for such problematic spots, we can always set the option to -1, but by default it will just save us from "compiler hangs" (at least, from those coming from backend).

How was this picked?

The threshold was picked kind of randomly with the intention to make it high enough so that it doesn't affect usual cases, and in the bad cases the compiler still finishes in a reasonable time. Also, when "bad" situation occurs we still should be able to identify the problematic pass by unusually high relative time consumed.

Another question that I have for you is: how is this going to work when GlobalIsel will become the standard?
My understanding is that cross-BB codegen will remove the need for CodegenPrepare, so if we start relying on BBs to be split we need to preserve this functionality somehow.

It probably wouldn't work for GlobalISel (at least, to my understanding), but since GlobalISel operates on a wider scope, I expect that non-linearities would be discovered and fixed much quicker.

Thanks,
Michael

this does not feel like the right solution, but even if it is, this will hurt us; we *commonly* have shaders with single basic blocks exceeding 1000 instructions, and splitting them could sabotage scheduling quite badly. in the worst case, it could guarantee spilling, by splitting the block at a point that creates too many live values between the top and bottom.

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

CodeGenPrepare.cpp

39 lines

test/

CodeGen/

X86/

codegen-prepare-split.ll

116 lines

Diff 139684

llvm/lib/CodeGen/CodeGenPrepare.cpp

Show First 20 Lines • Show All 209 Lines • ▼ Show 20 Lines
static cl::opt<bool> AddrSinkCombineBaseOffs(		static cl::opt<bool> AddrSinkCombineBaseOffs(
"addr-sink-combine-base-offs", cl::Hidden, cl::init(true),		"addr-sink-combine-base-offs", cl::Hidden, cl::init(true),
cl::desc("Allow combining of BaseOffs field in Address sinking."));		cl::desc("Allow combining of BaseOffs field in Address sinking."));

static cl::opt<bool> AddrSinkCombineScaledReg(		static cl::opt<bool> AddrSinkCombineScaledReg(
"addr-sink-combine-scaled-reg", cl::Hidden, cl::init(true),		"addr-sink-combine-scaled-reg", cl::Hidden, cl::init(true),
cl::desc("Allow combining of ScaledReg field in Address sinking."));		cl::desc("Allow combining of ScaledReg field in Address sinking."));

		static cl::opt<unsigned> BasicBlockMaxSize(
		"bb-max-size", cl::Hidden, cl::init(1000),
		cl::desc("Split basic blocks bigger than this size when compiling for Os"));

		davideUnsubmitted Not Done Reply Inline Actions How was this picked? davide: How was this picked?
namespace {		namespace {

using SetOfInstrs = SmallPtrSet<Instruction *, 16>;		using SetOfInstrs = SmallPtrSet<Instruction *, 16>;
using TypeIsSExt = PointerIntPair<Type *, 1, bool>;		using TypeIsSExt = PointerIntPair<Type *, 1, bool>;
using InstrToOrigTy = DenseMap<Instruction *, TypeIsSExt>;		using InstrToOrigTy = DenseMap<Instruction *, TypeIsSExt>;
using SExts = SmallVector<Instruction *, 16>;		using SExts = SmallVector<Instruction *, 16>;
using ValueToSExts = DenseMap<Value *, SExts>;		using ValueToSExts = DenseMap<Value *, SExts>;

▲ Show 20 Lines • Show All 241 Lines • ▼ Show 20 Lines	if (!DisableBranchOpts) {
// Merge pairs of basic blocks with unconditional branches, connected by		// Merge pairs of basic blocks with unconditional branches, connected by
// a single edge.		// a single edge.
if (EverMadeChange \|\| MadeChange)		if (EverMadeChange \|\| MadeChange)
MadeChange \|= eliminateFallThrough(F);		MadeChange \|= eliminateFallThrough(F);

EverMadeChange \|= MadeChange;		EverMadeChange \|= MadeChange;
}		}

		// Split big basic blocks. We're doing it to save compile time, which is not
		// a concern on O3.
		if (OptSize && BasicBlockMaxSize != 0) {
		SmallVector<BasicBlock *, 4> Worklist;
		for (BasicBlock &BB : F)
		Worklist.push_back(&BB);

		MadeChange = false;
		while (!Worklist.empty()) {
		BasicBlock *BB = Worklist.pop_back_val();
		unsigned n = 0;
		for (auto It = BB->begin(); It != BB->end(); It++, n++) {
		Instruction &I = *It;
		efriedmaUnsubmitted Not Done Reply Inline Actions You probably need to skip over all PHI nodes/landingpads/etc. so you don't try to split a block in an impossible place. And I think some Windows exception-handling blocks can't be split? (It would be unusual to have a block with over 1000 PHI nodes, but not impossible.) And like I mentioned before, you need to skip debug into intrinsics. This should have a testcase to exercise this logic (you can mess with the threshold to keep the testcase small). efriedma: You probably need to skip over all PHI nodes/landingpads/etc. so you don't try to split a block…
		// Skip instructions that we can not split the block on.
		if (isa<PHINode>(&I) \|\| I.isEHPad())
		continue;

		// If we've reached terminator, break so that we don't split the block
		// on it.
		if (I.isTerminator())
		break;

		// If the block is too big, split it and put the remainder to the
		// worklist.
		if (n >= BasicBlockMaxSize) {
		BB = SplitBlock(BB, &I);
		Worklist.push_back(BB);
		MadeChange = true;
		break;
		}
		}
		}
		EverMadeChange \|= MadeChange;
		}

if (!DisableGCOpts) {		if (!DisableGCOpts) {
SmallVector<Instruction *, 2> Statepoints;		SmallVector<Instruction *, 2> Statepoints;
for (BasicBlock &BB : F)		for (BasicBlock &BB : F)
for (Instruction &I : BB)		for (Instruction &I : BB)
if (isStatepoint(I))		if (isStatepoint(I))
Statepoints.push_back(&I);		Statepoints.push_back(&I);
for (auto &I : Statepoints)		for (auto &I : Statepoints)
EverMadeChange \|= simplifyOffsetableRelocate(*I);		EverMadeChange \|= simplifyOffsetableRelocate(*I);
▲ Show 20 Lines • Show All 6,138 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/codegen-prepare-split.ll

This file was added.

				; RUN: opt < %s -S -codegenprepare -bb-max-size=5 \| FileCheck %s
				target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"

				; CHECK-LABEL: @foo1
				define void @foo1(i64 *%base) optsize {
				bb:
				%a0 = getelementptr inbounds i64, i64* %base, i64 8
				%a1 = getelementptr inbounds i64, i64* %base, i64 16
				%a2 = getelementptr inbounds i64, i64* %base, i64 24
				%a3 = getelementptr inbounds i64, i64* %base, i64 32
				%a4 = getelementptr inbounds i64, i64* %base, i64 40
				; CHECK: %a4 = getelementptr
				; CHECK-NEXT: br label
				store i64 0, i64* %a0, align 4
				store i64 1, i64* %a1, align 4
				store i64 2, i64* %a2, align 4
				store i64 3, i64* %a3, align 4
				store i64 4, i64* %a4, align 4
				ret void
				}

				; CHECK-LABEL: @foo2
				define i64 @foo2(i64 *%base, i1 %p) optsize {
				bb:
				br i1 %p, label %bb_true, label %bb_false

				bb_true:
				br label %bb1

				bb_false:
				br label %bb1

				bb1:
				%b0 = phi i64 [0, %bb_true], [1, %bb_false]
				%b1 = phi i64 [2, %bb_true], [1, %bb_false]
				%b2 = phi i64 [4, %bb_true], [2, %bb_false]
				%b3 = phi i64 [9, %bb_true], [3, %bb_false]
				%b4 = phi i64 [8, %bb_true], [5, %bb_false]
				; CHECK: %b4 = phi i64
				; CHECK-NOT: br label
				; CHECK-NEXT: ret i64
				ret i64 %b0
				}

				; CHECK-LABEL: @foo3
				define i64 @foo3(i64 *%base, i1 %p) optsize {
				bb:
				br i1 %p, label %bb_true, label %bb_false

				bb_true:
				br label %bb1

				bb_false:
				br label %bb1

				bb1:
				%b0 = phi i64 [0, %bb_true], [1, %bb_false]
				%b1 = phi i64 [2, %bb_true], [1, %bb_false]
				%b2 = phi i64 [4, %bb_true], [2, %bb_false]
				%b3 = phi i64 [9, %bb_true], [3, %bb_false]
				%b4 = phi i64 [8, %bb_true], [5, %bb_false]
				; CHECK: %b4 = phi i64
				; CHECK-NEXT: br label
				%a = getelementptr inbounds i64, i64* %base, i64 100
				store i64 %b0, i64* %a, align 4
				ret i64 %b0
				}

				; CHECK-LABEL: @foo4
				define void @foo4() optsize {
				bb:
				%a0 = alloca i32, align 4
				%a1 = alloca i32, align 4
				%a2 = alloca i32, align 4
				%a3 = alloca i32, align 4
				%a4 = alloca i32, align 4
				; CHECK: %a4 = alloca
				; CHECK-NEXT: br label
				%a5 = alloca i32, align 4
				%a6 = alloca i32, align 4
				%a7 = alloca i32, align 4
				%a8 = alloca i32, align 4
				%a9 = alloca i32, align 4
				; CHECK: %a9 = alloca
				; CHECK-NEXT: br label
				call void @llvm.dbg.declare(metadata i32* %a0, metadata !6, metadata !DIExpression()), !dbg !2
				call void @llvm.dbg.declare(metadata i32* %a1, metadata !6, metadata !DIExpression()), !dbg !2
				call void @llvm.dbg.declare(metadata i32* %a2, metadata !6, metadata !DIExpression()), !dbg !2
				call void @llvm.dbg.declare(metadata i32* %a3, metadata !6, metadata !DIExpression()), !dbg !2
				call void @llvm.dbg.declare(metadata i32* %a4, metadata !6, metadata !DIExpression()), !dbg !2
				; CHECK: call void @llvm.dbg.declare(metadata i32* %a4
				; CHECK-NEXT: br label
				call void @llvm.dbg.declare(metadata i32* %a5, metadata !6, metadata !DIExpression()), !dbg !2
				call void @llvm.dbg.declare(metadata i32* %a6, metadata !6, metadata !DIExpression()), !dbg !2
				call void @llvm.dbg.declare(metadata i32* %a7, metadata !6, metadata !DIExpression()), !dbg !2
				call void @llvm.dbg.declare(metadata i32* %a8, metadata !6, metadata !DIExpression()), !dbg !2
				call void @llvm.dbg.declare(metadata i32* %a9, metadata !6, metadata !DIExpression()), !dbg !2
				; CHECK: call void @llvm.dbg.declare(metadata i32* %a9
				; CHECK-NEXT: ret void
				ret void
				}

				declare void @llvm.dbg.declare(metadata, metadata, metadata) #0

				attributes #0 = { nounwind readnone speculatable }

				!llvm.module.flags = !{!0, !1}

				!0 = !{i32 2, !"Dwarf Version", i32 4}
				!1 = !{i32 2, !"Debug Info Version", i32 3}
				!2 = !DILocation(line: 1, column: 1, scope: !3)
				!3 = distinct !DISubprogram(scope: null, isLocal: false, isDefinition: true, isOptimized: false, unit: !4)
				!4 = distinct !DICompileUnit(language: DW_LANG_C99, file: !5, isOptimized: false)
				!5 = !DIFile(filename: "foo.c", directory: "/path/to/file")
				!6 = !DILocalVariable(name: "a", arg: 1, scope: !3, file: !5, line: 1, type: !7)
				!7 = !DIBasicType(name: "int", size: 32, encoding: DW_ATE_signed)

This is an archive of the discontinued LLVM Phabricator instance.

[CodeGenPrepare] Split huge basic blocks for faster compilation.Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 139684

llvm/lib/CodeGen/CodeGenPrepare.cpp

llvm/test/CodeGen/X86/codegen-prepare-split.ll

[CodeGenPrepare] Split huge basic blocks for faster compilation.
Needs ReviewPublic