This is an archive of the discontinued LLVM Phabricator instance.

Disable jump threading into loop headers
ClosedPublic

Authored by kparzysz on Aug 7 2017, 8:52 AM.

Download Raw Diff

Details

Reviewers

davide
craig.topper
hfinkel

Commits

rG1dc313727ec5: Disable jump threading into loop headers
rL312664: Disable jump threading into loop headers

Summary

Consider this type of a loop:

for (...) {
  ...
  if (...) continue;
  ...
}

Normally, the "continue" would branch to the loop control code that checks whether the loop should continue iterating and which contains the (often) unique loop latch branch. In certain cases jump threading can "thread" the inner branch directly to the loop header, creating a second loop latch. Loop canonicalization would then transform this loop into a loop nest. The problem with this is that in such a loop nest neither loop is countable even if the original loop was. This may inhibit subsequent loop optimizations and be detrimental to performance.

Diff Detail

Repository: rL LLVM

Event Timeline

kparzysz created this revision.Aug 7 2017, 8:52 AM

Herald added a subscriber: mehdi_amini. · View Herald TranscriptAug 7 2017, 8:52 AM

@bmakam: Isn't this very similar to some of your recent work?

In D36404#834053, @mcrosier wrote:

@bmakam: Isn't this very similar to some of your recent work?

This is similar to r308422, but it was only limited to threading unconditional jumps in almost empty blocks. It seems like there are more places in jumpthreading that can turn a loop into non-canonical forms. We also saw in D29572 that jumpthreading can turn a loop into irreducible loop but it was performance neutral.

Do you have numbers/examples of performance delta(s) with this patch applied?

On our internal benchmarks it's mostly neutral, but the benchmark that motivated it improved by 5.3%. We have some extra code, though, that is supposed to deal with similar cases, so some of the impact may be reduced. By default it keeps the behavior unchanged, so for all other architectures it should have no effect.

DavidKreitzer added a subscriber: DavidKreitzer.Aug 10 2017, 2:39 PM

Ping.

Any thoughts on this?

In D36404#852556, @kparzysz wrote:

Any thoughts on this?

I think that we should just disable this for all targets. JumpThreading is run early in the pipeline where we are supposed to be simplifying the IR, but importantly, canonicalizing it. Our canonical form must be chosen to enable subsequent optimizations, and if it's not, then we should fix that. The whole reason that LoopHeaders exists in the first place is to avoid this kind of thing. The comment above FindLoopHeaders reads:

/// FindLoopHeaders - We do not want jump threading to turn proper loop
/// structures into irreducible loops.  Doing this breaks up the loop nesting
/// hierarchy and pessimizes later transformations.  To prevent this from
/// happening, we first have to find the loop headers. ...

If this optimization is a good thing to do, then we should have a special run of JumpThreading late in the pipeline where we allow it to put loops into efficient-but-hard-to-analyze forms. That can certainly be follow-up work if it is found to be helpful.

Changed the code to simply disable jump threading into loop headers.

In D36404#862414, @kparzysz wrote:

Changed the code to simply disable jump threading into loop headers.

LGTM.

This revision is now accepted and ready to land.Sep 6 2017, 11:44 AM

Closed by commit rL312664: Disable jump threading into loop headers (authored by kparzysz). · Explain WhySep 6 2017, 12:38 PM

This revision was automatically updated to reflect the committed changes.

In D36404#835401, @kparzysz wrote:

On our internal benchmarks it's mostly neutral, but the benchmark that motivated it improved by 5.3%. We have some extra code, though, that is supposed to deal with similar cases, so some of the impact may be reduced. By default it keeps the behavior unchanged, so for all other architectures it should have no effect.

@kparzysz, This change regressed spec2017/perlbench by 3% on Falkor. Perhaps we need a late jump-threading pass like Hal suggested?

Have you tried it? Changing jump threading to have early/late versions is trivial, the question is when should the late one run, and if that would help in the first place.

In D36404#869657, @kparzysz wrote:

Have you tried it? Changing jump threading to have early/late versions is trivial, the question is when should the late one run, and if that would help in the first place.

I reverted this change and it recovered the regression. I was going to try to re-run the whole jumpthreading pass after latesimplifycfg to see if it would help, but I don't know if it is a good place because LSR that runs after, will not have the loop in canonical form and loopsimplify will again turn it into irreducible loop. Another place I was thinking was to try before CGP.

I'll come up with a late jump threading pass that you can try plugging in, in addition to having the existing one.

Late jump threading is in D37816.

In D36404#869695, @bmakam wrote:

In D36404#869657, @kparzysz wrote:

Have you tried it? Changing jump threading to have early/late versions is trivial, the question is when should the late one run, and if that would help in the first place.

I reverted this change and it recovered the regression. I was going to try to re-run the whole jumpthreading pass after latesimplifycfg to see if it would help, but I don't know if it is a good place because LSR that runs after, will not have the loop in canonical form and loopsimplify will again turn it into irreducible loop. Another place I was thinking was to try before CGP.

If it'll work right before CGP, that's probably best. That way targets can run IR-level passes that want to look at loops before that.

In D36404#870092, @hfinkel wrote:

In D36404#869695, @bmakam wrote:

In D36404#869657, @kparzysz wrote:

Have you tried it? Changing jump threading to have early/late versions is trivial, the question is when should the late one run, and if that would help in the first place.

I reverted this change and it recovered the regression. I was going to try to re-run the whole jumpthreading pass after latesimplifycfg to see if it would help, but I don't know if it is a good place because LSR that runs after, will not have the loop in canonical form and loopsimplify will again turn it into irreducible loop. Another place I was thinking was to try before CGP.

If it'll work right before CGP, that's probably best. That way targets can run IR-level passes that want to look at loops before that.

I tried right after latesimplifycfg and it almost recovers the regression. However, running it right before CGP made it even worse for performance of spec2017/perlbench. It seems to be due to some kind of bad interaction with LSR, I haven't digged deeper. Would running right before LSR be reasonable?

In D36404#870132, @bmakam wrote:

In D36404#870092, @hfinkel wrote:

In D36404#869695, @bmakam wrote:

In D36404#869657, @kparzysz wrote:

Have you tried it? Changing jump threading to have early/late versions is trivial, the question is when should the late one run, and if that would help in the first place.

I reverted this change and it recovered the regression. I was going to try to re-run the whole jumpthreading pass after latesimplifycfg to see if it would help, but I don't know if it is a good place because LSR that runs after, will not have the loop in canonical form and loopsimplify will again turn it into irreducible loop. Another place I was thinking was to try before CGP.

If it'll work right before CGP, that's probably best. That way targets can run IR-level passes that want to look at loops before that.

I tried right after latesimplifycfg and it almost recovers the regression. However, running it right before CGP made it even worse for performance of spec2017/perlbench. It seems to be due to some kind of bad interaction with LSR, I haven't digged deeper. Would running right before LSR be reasonable?

We need to figure out if LSR is doing something reasonable. It's possible that LSR is actually causing the regression, and old jumpthreading is just serving to obscure the loop structure (thus preventing LSR from doing whatever bad thing it's doing).

On another note, this patch was meant to prevent adding extra branches to the header from within the loop. Branches to the header that come from outside of the loop should be fine. In other words, it should be ok to refine the check to only disable jump threading when any of the PredBBs is dominated by the header (SuccBB).

We need to figure out if LSR is doing something reasonable. It's possible that LSR is actually causing the regression, and old jumpthreading is just serving to obscure the loop structure (thus preventing LSR from doing whatever bad thing it's doing).

It seems like LSR is doing the right thing, the difference in codegen is due to loopsimplify that is required in LSR. If LoopSimplify is required in CGP, then running LateJumpThreading right before CGP (D37866) completely recovers the regression.

loladiro mentioned this in D42260: [JumpThreading] Don't select an edge that we know we can't thread.Jan 18 2018, 12:38 PM

haicheng mentioned this in rL328798: [JumpThreading] Don't select an edge that we know we can't thread.Mar 29 2018, 9:04 AM

tstellar mentioned this in rL333577: Merging r328798:.May 30 2018, 11:54 AM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Scalar/

JumpThreading.cpp

13 lines

test/

Transforms/

JumpThreading/

static-profile.ll

37 lines

Diff 114052

llvm/trunk/lib/Transforms/Scalar/JumpThreading.cpp

Show First 20 Lines • Show All 1,785 Lines • ▼ Show 20 Lines	bool JumpThreadingPass::ThreadEdge(BasicBlock *BB,
if (SuccBB == BB) {		if (SuccBB == BB) {
DEBUG(dbgs() << " Not threading across BB '" << BB->getName()		DEBUG(dbgs() << " Not threading across BB '" << BB->getName()
<< "' - would thread to self!\n");		<< "' - would thread to self!\n");
return false;		return false;
}		}

// If threading this would thread across a loop header, don't thread the edge.		// If threading this would thread across a loop header, don't thread the edge.
// See the comments above FindLoopHeaders for justifications and caveats.		// See the comments above FindLoopHeaders for justifications and caveats.
if (LoopHeaders.count(BB)) {		if (LoopHeaders.count(BB) \|\| LoopHeaders.count(SuccBB)) {
DEBUG(dbgs() << " Not threading across loop header BB '" << BB->getName()		DEBUG({
<< "' to dest BB '" << SuccBB->getName()		bool BBIsHeader = LoopHeaders.count(BB);
<< "' - it might create an irreducible loop!\n");		bool SuccIsHeader = LoopHeaders.count(SuccBB);
		dbgs() << " Not threading across "
		<< (BBIsHeader ? "loop header BB '" : "block BB '") << BB->getName()
		<< "' to dest " << (SuccIsHeader ? "loop header BB '" : "block BB '")
		<< SuccBB->getName() << "' - it might create an irreducible loop!\n";
		});
return false;		return false;
}		}

unsigned JumpThreadCost =		unsigned JumpThreadCost =
getJumpThreadDuplicationCost(BB, BB->getTerminator(), BBDupThreshold);		getJumpThreadDuplicationCost(BB, BB->getTerminator(), BBDupThreshold);
if (JumpThreadCost > BBDupThreshold) {		if (JumpThreadCost > BBDupThreshold) {
DEBUG(dbgs() << " Not threading BB '" << BB->getName()		DEBUG(dbgs() << " Not threading BB '" << BB->getName()
<< "' - Cost is too high: " << JumpThreadCost << "\n");		<< "' - Cost is too high: " << JumpThreadCost << "\n");
▲ Show 20 Lines • Show All 695 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/JumpThreading/static-profile.ll

	Show All 17 Lines
	; check_2 (16)			; check_2 (16)
	; (8) / \|			; (8) / \|
	; eq_2 \| (8)			; eq_2 \| (8)
	; \ \|			; \ \|
	; check_3 (16)			; check_3 (16)
	; (1) / \|			; (1) / \|
	; (loop exit) \| (15)			; (loop exit) \| (15)
	; \|			; \|
				; latch
				; \|
	; (back edge)			; (back edge)
	;			;
	; First we thread eq_1->check_2 to check_3. Frequencies are updated to remove			; First we thread eq_1->check_2 to check_3. Frequencies are updated to remove
	; the frequency of eq_1 from check_2 and then the false edge leaving check_2			; the frequency of eq_1 from check_2 and then the false edge leaving check_2
	; (changed frequencies are highlighted with * *):			; (changed frequencies are highlighted with * *):
	;			;
	; check_1 (16)			; check_1 (16)
	; (8) / \|			; (8) / \|
	; eq_1~ \| (8)			; eq_1~ \| (8)
	; / \|			; / \|
	; / check_2 (8)			; / check_2 (8)
	; / (8) / \|			; / (8) / \|
	; \ eq_2 \| (0)			; \ eq_2 \| (0)
	; \ \ \|			; \ \ \|
	; ` --- check_3 (16)			; ` --- check_3 (16)
	; (1) / \|			; (1) / \|
	; (loop exit) \| (15)			; (loop exit) \| (15)
	; \|			; \|
				; latch
				; \|
	; (back edge)			; (back edge)
	;			;
	; Next we thread eq_1->check_3 and eq_2->check_3 to check_1 as new back edges.			; Next we thread eq_1->check_3 and eq_2->check_3 to check_1 as new edges to
	; Frequencies are updated to remove the frequency of eq_1 and eq_3 from			; the loop latch. Frequencies are updated to remove the frequency of eq_1
	; check_3 and then the false edge leaving check_3 (changed frequencies are			; and eq_3 from check_3 and then the false edge leaving check_3 (changed
	; highlighted with * *):			; frequencies are highlighted with * *):
	;			;
	; check_1 (16)			; check_1 (16)
	; (8) / \|			; (8) / \|
	; eq_1~ \| (8)			; eq_1~ \| (8)
	; / \|			; / \|
	; / check_2 (8)			; / check_2 (8)
	; / (8) / \|			; / (8) / \|
	; /-- eq_2~ \| (0)			; /-- eq_2~ \| (0)
	; (back edge) \|			; / \|
	; check_3 (0)			; / check_3 (0)
	; (0) / \|			; / (0) / \|
	; (loop exit) \| (0)			; \| (loop exit) \| (0)
				; \ \|
				; `--------- latch
	; \|			; \|
	; (back edge)			; (back edge)
	;			;
	; As a result, the loop exit edge ends up with 0 frequency which in turn makes			; As a result, the loop exit edge ends up with 0 frequency which in turn makes
	; the loop header to have maximum frequency.			; the loop header to have maximum frequency.

	declare void @bar()			declare void @bar()

	define void @foo(i32 *%p, i32 %n) !prof !0 {			define void @foo(i32 *%p, i32 %n) !prof !0 {
	entry:			entry:
	%enter_loop = icmp eq i32 %n, 0			%enter_loop = icmp eq i32 %n, 0
	br i1 %enter_loop, label %exit, label %check_1, !prof !1			br i1 %enter_loop, label %exit, label %check_1, !prof !1
	; CHECK: br i1 %enter_loop, label %exit, label %check_1, !prof !1			; CHECK: br i1 %enter_loop, label %exit, label %check_1, !prof !1

	check_1:			check_1:
	%v = load i32, i32* %p			%v = load i32, i32* %p
	%cond1 = icmp eq i32 %v, 1			%cond1 = icmp eq i32 %v, 1
	br i1 %cond1, label %eq_1, label %check_2			br i1 %cond1, label %eq_1, label %check_2
	; No metadata:			; No metadata:
	; CHECK: br i1 %cond1, label %check_2.thread, label %check_2{{$}}			; CHECK: br i1 %cond1, label %check_2.thread, label %check_2{{$}}

	eq_1:			eq_1:
	call void @bar()			call void @bar()
	br label %check_2			br label %check_2
	; Verify the new backedge:			; Verify the new edge:
	; CHECK: check_2.thread:			; CHECK: check_2.thread:
	; CHECK-NEXT: call void @bar()			; CHECK-NEXT: call void @bar()
	; CHECK-NEXT: br label %check_3.thread			; CHECK-NEXT: br label %latch

	check_2:			check_2:
	%cond2 = icmp eq i32 %v, 2			%cond2 = icmp eq i32 %v, 2
	br i1 %cond2, label %eq_2, label %check_3			br i1 %cond2, label %eq_2, label %check_3
	; No metadata:			; No metadata:
	; CHECK: br i1 %cond2, label %eq_2, label %check_3{{$}}			; CHECK: br i1 %cond2, label %eq_2, label %check_3{{$}}

	eq_2:			eq_2:
	call void @bar()			call void @bar()
	br label %check_3			br label %check_3
	; Verify the new backedge:			; Verify the new edge:
	; CHECK: eq_2:			; CHECK: eq_2:
	; CHECK-NEXT: call void @bar()			; CHECK-NEXT: call void @bar()
	; CHECK-NEXT: br label %check_3.thread			; CHECK-NEXT: br label %latch

	check_3:			check_3:
	%condE = icmp eq i32 %v, 3			%condE = icmp eq i32 %v, 3
	br i1 %condE, label %exit, label %check_1			br i1 %condE, label %exit, label %latch
	; No metadata:			; No metadata:
	; CHECK: br i1 %condE, label %exit, label %check_1{{$}}			; CHECK: br i1 %condE, label %exit, label %latch{{$}}

				latch:
				br label %check_1

	exit:			exit:
	ret void			ret void
	}			}

	!0 = !{!"function_entry_count", i64 120}			!0 = !{!"function_entry_count", i64 120}
	; CHECK-NOT: branch_weights			; CHECK-NOT: branch_weights
	!1 = !{!"branch_weights", i32 119, i32 1}			!1 = !{!"branch_weights", i32 119, i32 1}
	; CHECK: !1 = !{!"branch_weights", i32 119, i32 1}			; CHECK: !1 = !{!"branch_weights", i32 119, i32 1}
	; CHECK-NOT: branch_weights			; CHECK-NOT: branch_weights