This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Utils/
-
Transforms/
-
Utils/
1/1
SimplifyCFG.cpp
-
test/CodeGen/
-
CodeGen/
-
AArch64/
-
arm64-jumptable.ll
-
ARM/
-
avoid-cpsr-rmw.ll

Differential D41361

[SimplifyCFG] Avoid quadratic on a predecessors number behavior in instruction sinking.
ClosedPublic

Authored by mzolotukhin on Dec 18 2017, 10:47 AM.

Download Raw Diff

Details

Reviewers

jmolloy
spatel
efriedma

Summary

If a block has N predecessors, then the current algorithm will try to
sink common code to this block N times (whenever we visit a
predecessor). Every attempt to sink the common code includes going
through all predecessors, so the complexity of the algorithm becomes
O(N^2).
With this patch we try to sink common code only when we visit the block
itself. With this, the complexity goes down to O(N).
As a side effect, the moment the code is sunk is slightly different than
before (the order of simplifications has been changed), that's why I had
to adjust two tests (note that neither of the tests is supposed to test
SimplifyCFG):

test/CodeGen/AArch64/arm64-jumptable.ll - changes in this test mimic

the changes that previous implementation of SimplifyCFG would do.

test/CodeGen/ARM/avoid-cpsr-rmw.ll - in this test I disabled common

code sinking by a command line flag.

Diff Detail

Build Status

Buildable 13347
Build 13347: arc lint + arc unit

Event Timeline

mzolotukhin created this revision.Dec 18 2017, 10:47 AM

Herald added subscribers: kristof.beyls, javed.absar, aemerson. · View Herald TranscriptDec 18 2017, 10:47 AM

This also fixes a big compile time regression we found that was caused by rL279460 (and consequent follow-ups). In the problematic test-case we have a huge function, where one of the blocks have more than 1000 predecessors.

davide added a reviewer: efriedma.Dec 18 2017, 1:56 PM

test/CodeGen/AArch64/arm64-jumptable.ll - changes in this test mimic the changes that previous implementation of SimplifyCFG would do.

With your patch, what transform do we perform here instead of sinking? As far as I can tell, SimplifyCFG doesn't make any other changes.

lib/Transforms/Utils/SimplifyCFG.cpp
1657–1661	*predecessors

In D41361#960373, @efriedma wrote:

test/CodeGen/AArch64/arm64-jumptable.ll - changes in this test mimic the changes that previous implementation of SimplifyCFG would do.

With your patch, what transform do we perform here instead of sinking? As far as I can tell, SimplifyCFG doesn't make any other changes.

I don't think SimplifyCFG makes any other changes, but the test isn't for SimplifyCFG - as far as I understand it, it's for instructions we generate in the backend. The test was just written to trigger some specific situation, and with this change it stopped triggering it.

Michael

I don't think SimplifyCFG makes any other changes, but the test isn't for SimplifyCFG - as far as I understand it, it's for instructions we generate in the backend. The test was just written to trigger some specific situation, and with this change it stopped triggering it.

Yes, I'm not really worried about that aspect; I'm more concerned that the transform somehow isn't triggering in cases where it should.

I just tried digging a bit; it looks like the generated code is worse, particularly for avoid-cpsr-rmw.ll. I'm not sure that's really the fault of this patch, exactly... it looks like the sinking does in fact happen, but the end result is different because of related transforms. Looks like it's exposing a bad heuristic elsewhere?

Yes, I'm not really worried about that aspect; I'm more concerned that the transform somehow isn't triggering in cases where it should.

I just tried digging a bit; it looks like the generated code is worse, particularly for avoid-cpsr-rmw.ll. I'm not sure that's really the fault of this patch, exactly... it looks like the sinking does in fact happen, but the end result is different because of related transforms. Looks like it's exposing a bad heuristic elsewhere?

Yes, it does behave differently after the patch. However, from code sinking point of view the new behavior seems better to me.

Here is how the behavior changed for arm64-jumptable.ll. Previously SimplifyCFG produced

define void @sum(i32 %a, i32* %to, i32 %c) {
entry:
  switch i32 %a, label %exit [
    i32 1, label %bb1
    i32 2, label %exit.sink.split
    i32 3, label %bb3
    i32 4, label %bb4
  ]
bb1:
  %b = add i32 %c, 1
  br label %exit.sink.split
bb3:
  br label %exit.sink.split
bb4:
  br label %exit.sink.split
exit.sink.split:
  %.sink = phi i32 [ 5, %bb4 ], [ %b, %bb1 ], [ 3, %bb3 ], [ %a, %entry ]
  store i32 %.sink, i32* %to
  br label %exit
exit:
  ret void
}

Now we're producing:

define void @sum(i32 %a, i32* %to, i32 %c) {
entry:
  switch i32 %a, label %exit [
    i32 1, label %bb1
    i32 2, label %exit.sink.split
    i32 3, label %exit.sink.split
    i32 4, label %bb4
  ]
bb1:                                              ; preds = %entry
  %b = add i32 %c, 1
  br label %exit.sink.split
bb4:                                              ; preds = %entry
  br label %exit.sink.split
exit.sink.split:                                  ; preds = %entry, %entry, %bb1, %bb4
  %.sink = phi i32 [ 5, %bb4 ], [ %b, %bb1 ], [ %a, %entry ], [ %a, %entry ]
  store i32 %.sink, i32* %to
  br label %exit
exit:                                             ; preds = %exit.sink.split, %entry
  ret void
}

So, we merged edges entry->bb3->exit.sink.split. Except for helping backend, I don't see why we would want to keep these edges separate. Helping backend might be a sufficient reason, but I think it's generally better not to rely on this kind of stuff in the backend.

In avoid-cpsr-rmw.ll we previously generated:

define void @t4(i32* nocapture %p, double* nocapture %q) #3 {
entry:
  %0 = load double, double* %q, align 4
  %cmp = fcmp olt double %0, 1.000000e+01
  %incdec.ptr1 = getelementptr inbounds i32, i32* %p, i32 1
  br i1 %cmp, label %if.then, label %if.end
if.then:                                          ; preds = %entry
  br label %if.end
if.end:                                           ; preds = %entry, %if.then
  %.sink3 = phi i32 [ 7, %if.then ], [ 3, %entry ]
  %.sink2 = phi i32 [ 2, %if.then ], [ 3, %entry ]
  %.sink1 = phi i32 [ 8, %if.then ], [ 5, %entry ]
  %.sink = phi i32 [ 9, %if.then ], [ 6, %entry ]
  store i32 %.sink3, i32* %p, align 4
  %incdec.ptr5 = getelementptr inbounds i32, i32* %p, i32 %.sink2
  store i32 %.sink1, i32* %incdec.ptr1, align 4
  store i32 %.sink, i32* %incdec.ptr5, align 4
  ret void
}

And now we generate:

define void @t4(i32* nocapture %p, double* nocapture %q) #3 {
entry:
  %0 = load double, double* %q, align 4
  %cmp = fcmp olt double %0, 1.000000e+01
  %incdec.ptr1 = getelementptr inbounds i32, i32* %p, i32 1
  %. = select i1 %cmp, i32 7, i32 3
  %.4 = select i1 %cmp, i32 2, i32 3
  %.5 = select i1 %cmp, i32 8, i32 5
  %.6 = select i1 %cmp, i32 9, i32 6
  store i32 %., i32* %p, align 4
  %incdec.ptr5 = getelementptr inbounds i32, i32* %p, i32 %.4
  store i32 %.5, i32* %incdec.ptr1, align 4
  store i32 %.6, i32* %incdec.ptr5, align 4
  ret void
}

Since we're not changing logic for generating selects and we're ending up with them now, I assume that it's also a desirable form of IR in this case.

If that's not what we want to do in SimplifyCFG, I can investigate it further and try to fix.

Thanks,
Michael

The change to arm64-jumptable.ll looks fine. I think something is going wrong in avoid-cpsr-rmw.ll.

Since we're not changing logic for generating selects and we're ending up with them now, I assume that it's also a desirable form of IR in this case.

We have thresholds in various places to avoid transforming control flow into straight-line computation if it gets too expensive; they're scattered all over the place, though, so we might not handle it consistently.

efriedma added a subscriber: hfinkel.Dec 20 2017, 3:05 PM

In D41361#961498, @efriedma wrote:

The change to arm64-jumptable.ll looks fine. I think something is going wrong in avoid-cpsr-rmw.ll.

Since we're not changing logic for generating selects and we're ending up with them now, I assume that it's also a desirable form of IR in this case.

We have thresholds in various places to avoid transforming control flow into straight-line computation if it gets too expensive; they're scattered all over the place, though, so we might not handle it consistently.

Exactly. Also, @Carrot has been working on this problem recently (see D39352). Our heuristics in this regard should soon get significantly smarter (although it's not clear that it will, or should, change the avoid-cpsr-rmw.ll case). @efriedma, based on the IR above, it does not seem that what's happening is wrong (the branchy code does not seem obviously superior to me because it's not clear we're missing opportunities for speculative execution). Can you elaborate?

The question is how you materialize all the immediates. On ARM, at least, the "select" version generates two instructions per PHI node, with a sequence like "mov imm; mowmi imm; mov imm; movwmi imm" etc. while the branchy version generates a branch, where each block has one instruction per PHI node. AArch64 is similar (except that we use "csel" there). x86 is similar; we generate a "mov+mov+cmov" sequence, except for some special cases which get lowered to "lea".

So clearly there's some threshold where the select-based lowering is worse, simply because the processor is executing more instructions. Of course, the exact threshold where the branch is faster probably depends on the target, the length of the critical path, and how predictable the branch is.

In D41361#961594, @efriedma wrote:

The question is how you materialize all the immediates. On ARM, at least, the "select" version generates two instructions per PHI node, with a sequence like "mov imm; mowmi imm; mov imm; movwmi imm" etc. while the branchy version generates a branch, where each block has one instruction per PHI node. AArch64 is similar (except that we use "csel" there). x86 is similar; we generate a "mov+mov+cmov" sequence, except for some special cases which get lowered to "lea".

So clearly there's some threshold where the select-based lowering is worse, simply because the processor is executing more instructions. Of course, the exact threshold where the branch is faster probably depends on the target, the length of the critical path, and how predictable the branch is.

Good point. I don't know if we have a heuristic that deals with that specifically, but it seems like we should. Should we move forward with this patch, file a bug, and then work on this in follow-up? Fixing the O(N^2) behavior here seems like something we should keep as separate as possible from changing the heuristics.

OK, fixing the heuristic doesn't need to block this fix. Sinking itself appears to be working as expected, and I don't think this particular case of a bunch of PHI nodes is likely to show up often in practice.

LGTM.

This revision is now accepted and ready to land.Dec 20 2017, 4:54 PM

Rebase.

Harbormaster completed remote builds in B13347: Diff 127804.Dec 20 2017, 5:03 PM

Thanks for review!
Phabricator didn't catch the review thread. The patch was committed in rL321236.

Michael

Revision Contents

Path

Size

lib/

Transforms/

Utils/

SimplifyCFG.cpp

24 lines

test/

CodeGen/

AArch64/

arm64-jumptable.ll

16 lines

ARM/

avoid-cpsr-rmw.ll

4 lines

Diff 127804

lib/Transforms/Utils/SimplifyCFG.cpp

Show First 20 Lines • Show All 1,648 Lines • ▼ Show 20 Lines	public:

ArrayRef<Instruction> operator () const {		ArrayRef<Instruction> operator () const {
return Insts;		return Insts;
}		}
};		};

} // end anonymous namespace		} // end anonymous namespace

/// Given an unconditional branch that goes to BBEnd,		/// Check whether BB's predecessors end with unconditional branches. If it is
/// check whether BBEnd has only two predecessors and the other predecessor		/// true, sink any common code from the predecessors to BB.
/// ends with an unconditional branch. If it is true, sink any common code		/// We also allow one predecessor to end with conditional branch (but no more
/// in the two predecessors to BBEnd.		/// than one).
static bool SinkThenElseCodeToEnd(BranchInst *BI1) {		static bool SinkCommonCodeFromPredecessors(BasicBlock *BB) {
		efriedmaUnsubmitted Done Reply Inline Actions predecessors efriedma:* *predecessors
assert(BI1->isUnconditional());
BasicBlock *BBEnd = BI1->getSuccessor(0);

// We support two situations:		// We support two situations:
// (1) all incoming arcs are unconditional		// (1) all incoming arcs are unconditional
// (2) one incoming arc is conditional		// (2) one incoming arc is conditional
//		//
// (2) is very common in switch defaults and		// (2) is very common in switch defaults and
// else-if patterns;		// else-if patterns;
//		//
// if (a) f(1);		// if (a) f(1);
Show All 27 Lines	static bool SinkCommonCodeFromPredecessors(BasicBlock *BB) {
// \| [x(2)] \|		// \| [x(2)] \|
// \ / \|		// \ / \|
// [sink.split] \|		// [sink.split] \|
// \ /		// \ /
// [ end ]		// [ end ]
//		//
SmallVector<BasicBlock*,4> UnconditionalPreds;		SmallVector<BasicBlock*,4> UnconditionalPreds;
Instruction *Cond = nullptr;		Instruction *Cond = nullptr;
for (auto *B : predecessors(BBEnd)) {		for (auto *B : predecessors(BB)) {
auto *T = B->getTerminator();		auto *T = B->getTerminator();
if (isa<BranchInst>(T) && cast<BranchInst>(T)->isUnconditional())		if (isa<BranchInst>(T) && cast<BranchInst>(T)->isUnconditional())
UnconditionalPreds.push_back(B);		UnconditionalPreds.push_back(B);
else if ((isa<BranchInst>(T) \|\| isa<SwitchInst>(T)) && !Cond)		else if ((isa<BranchInst>(T) \|\| isa<SwitchInst>(T)) && !Cond)
Cond = T;		Cond = T;
else		else
return false;		return false;
}		}
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	while (ProfitableToSinkInstruction(LRI) && Idx < ScanIdx) {
++Idx;		++Idx;
}		}
if (!Profitable)		if (!Profitable)
return false;		return false;

DEBUG(dbgs() << "SINK: Splitting edge\n");		DEBUG(dbgs() << "SINK: Splitting edge\n");
// We have a conditional edge and we're going to sink some instructions.		// We have a conditional edge and we're going to sink some instructions.
// Insert a new block postdominating all blocks we're going to sink from.		// Insert a new block postdominating all blocks we're going to sink from.
if (!SplitBlockPredecessors(BI1->getSuccessor(0), UnconditionalPreds,		if (!SplitBlockPredecessors(BB, UnconditionalPreds, ".sink.split"))
".sink.split"))
// Edges couldn't be split.		// Edges couldn't be split.
return false;		return false;
Changed = true;		Changed = true;
}		}

// Now that we've analyzed all potential sinking candidates, perform the		// Now that we've analyzed all potential sinking candidates, perform the
// actual sink. We iteratively sink the last non-terminator of the source		// actual sink. We iteratively sink the last non-terminator of the source
// blocks into their common successor unless doing so would require too		// blocks into their common successor unless doing so would require too
▲ Show 20 Lines • Show All 3,937 Lines • ▼ Show 20 Lines	static bool TryToMergeLandingPad(LandingPadInst LPad, BranchInst BI,
return false;		return false;
}		}

bool SimplifyCFGOpt::SimplifyUncondBranch(BranchInst *BI,		bool SimplifyCFGOpt::SimplifyUncondBranch(BranchInst *BI,
IRBuilder<> &Builder) {		IRBuilder<> &Builder) {
BasicBlock *BB = BI->getParent();		BasicBlock *BB = BI->getParent();
BasicBlock *Succ = BI->getSuccessor(0);		BasicBlock *Succ = BI->getSuccessor(0);

if (SinkCommon && Options.SinkCommonInsts && SinkThenElseCodeToEnd(BI))
return true;

// If the Terminator is the only non-phi instruction, simplify the block.		// If the Terminator is the only non-phi instruction, simplify the block.
// If LoopHeader is provided, check if the block or its successor is a loop		// If LoopHeader is provided, check if the block or its successor is a loop
// header. (This is for early invocations before loop simplify and		// header. (This is for early invocations before loop simplify and
// vectorization to keep canonical loop forms for nested loops. These blocks		// vectorization to keep canonical loop forms for nested loops. These blocks
// can be eliminated when the pass is invoked later in the back-end.)		// can be eliminated when the pass is invoked later in the back-end.)
bool NeedCanonicalLoop =		bool NeedCanonicalLoop =
Options.NeedCanonicalLoop &&		Options.NeedCanonicalLoop &&
(LoopHeaders && (LoopHeaders->count(BB) \|\| LoopHeaders->count(Succ)));		(LoopHeaders && (LoopHeaders->count(BB) \|\| LoopHeaders->count(Succ)));
▲ Show 20 Lines • Show All 261 Lines • ▼ Show 20 Lines	bool SimplifyCFGOpt::run(BasicBlock *BB) {
Changed \|= removeUndefIntroducingPredecessor(BB);		Changed \|= removeUndefIntroducingPredecessor(BB);

// Merge basic blocks into their predecessor if there is only one distinct		// Merge basic blocks into their predecessor if there is only one distinct
// pred, and if there is only one distinct successor of the predecessor, and		// pred, and if there is only one distinct successor of the predecessor, and
// if there are no PHI nodes.		// if there are no PHI nodes.
if (MergeBlockIntoPredecessor(BB))		if (MergeBlockIntoPredecessor(BB))
return true;		return true;

		if (SinkCommon && Options.SinkCommonInsts)
		Changed \|= SinkCommonCodeFromPredecessors(BB);

IRBuilder<> Builder(BB);		IRBuilder<> Builder(BB);

// If there is a trivial two-entry PHI node in this basic block, and we can		// If there is a trivial two-entry PHI node in this basic block, and we can
// eliminate it, do so now.		// eliminate it, do so now.
if (auto *PN = dyn_cast<PHINode>(BB->begin()))		if (auto *PN = dyn_cast<PHINode>(BB->begin()))
if (PN->getNumIncomingValues() == 2)		if (PN->getNumIncomingValues() == 2)
Changed \|= FoldTwoEntryPHINode(PN, TTI, DL);		Changed \|= FoldTwoEntryPHINode(PN, TTI, DL);

Show All 39 Lines

test/CodeGen/AArch64/arm64-jumptable.ll

	; RUN: llc -mtriple=arm64-apple-ios < %s \| FileCheck %s			; RUN: llc -mtriple=arm64-apple-ios < %s \| FileCheck %s
	; RUN: llc -mtriple=arm64-linux-gnu < %s \| FileCheck %s --check-prefix=CHECK-LINUX			; RUN: llc -mtriple=arm64-linux-gnu < %s \| FileCheck %s --check-prefix=CHECK-LINUX
	; <rdar://11417675>			; <rdar://11417675>

	define void @sum(i32 %a, i32* %to, i32 %c) {			define void @sum(i32 %a, i32* %to, i32 %c) {
	entry:			entry:
	switch i32 %a, label %exit [			switch i32 %a, label %exit [
	i32 1, label %bb1			i32 1, label %bb1
	i32 2, label %bb2			i32 2, label %exit.sink.split
	i32 3, label %bb3			i32 3, label %bb3
	i32 4, label %bb4			i32 4, label %bb4
	]			]
	bb1:			bb1:
	%b = add i32 %c, 1			%b = add i32 %c, 1
	store i32 %b, i32* %to			br label %exit.sink.split
	br label %exit
	bb2:
	store i32 2, i32* %to
	br label %exit
	bb3:			bb3:
	store i32 3, i32* %to			br label %exit.sink.split
	br label %exit
	bb4:			bb4:
	store i32 5, i32* %to			br label %exit.sink.split
				exit.sink.split:
				%.sink = phi i32 [ 5, %bb4 ], [ %b, %bb1 ], [ 3, %bb3 ], [ %a, %entry ]
				store i32 %.sink, i32* %to
	br label %exit			br label %exit
	exit:			exit:
	ret void			ret void
	}			}

	; CHECK-LABEL: sum:			; CHECK-LABEL: sum:
	; CHECK: adrp {{x[0-9]+}}, LJTI0_0@PAGE			; CHECK: adrp {{x[0-9]+}}, LJTI0_0@PAGE
	; CHECK: add {{x[0-9]+}}, {{x[0-9]+}}, LJTI0_0@PAGEOFF			; CHECK: add {{x[0-9]+}}, {{x[0-9]+}}, LJTI0_0@PAGEOFF

	; CHECK-LINUX-LABEL: sum:			; CHECK-LINUX-LABEL: sum:
	; CHECK-LINUX: adrp {{x[0-9]+}}, .LJTI0_0			; CHECK-LINUX: adrp {{x[0-9]+}}, .LJTI0_0
	; CHECK-LINUX: add {{x[0-9]+}}, {{x[0-9]+}}, :lo12:.LJTI0_0			; CHECK-LINUX: add {{x[0-9]+}}, {{x[0-9]+}}, :lo12:.LJTI0_0

test/CodeGen/ARM/avoid-cpsr-rmw.ll

	; RUN: llc < %s -mtriple=thumbv7-apple-darwin -mcpu=cortex-a9 \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-CORTEX			; RUN: llc < %s -mtriple=thumbv7-apple-darwin -mcpu=cortex-a9 -simplifycfg-sink-common=false \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-CORTEX
	; RUN: llc < %s -mtriple=thumbv7-apple-darwin -mcpu=swift \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-SWIFT			; RUN: llc < %s -mtriple=thumbv7-apple-darwin -mcpu=swift -simplifycfg-sink-common=false \| FileCheck %s --check-prefix=CHECK --check-prefix=CHECK-SWIFT
	; Avoid some 's' 16-bit instruction which partially update CPSR (and add false			; Avoid some 's' 16-bit instruction which partially update CPSR (and add false
	; dependency) when it isn't dependent on last CPSR defining instruction.			; dependency) when it isn't dependent on last CPSR defining instruction.
	; rdar://8928208			; rdar://8928208

	define i32 @t1(i32 %a, i32 %b, i32 %c, i32 %d) nounwind readnone {			define i32 @t1(i32 %a, i32 %b, i32 %c, i32 %d) nounwind readnone {
	entry:			entry:
	; CHECK-LABEL: t1:			; CHECK-LABEL: t1:
	; CHECK-CORTEX: muls [[REG:(r[0-9]+)]], r3, r2			; CHECK-CORTEX: muls [[REG:(r[0-9]+)]], r3, r2
	▲ Show 20 Lines • Show All 106 Lines • Show Last 20 Lines