This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/
-
Transforms/
-
Scalar/
-
LoopDeletion.cpp
-
Utils/
1/3
LoopUtils.cpp
-
test/Transforms/
-
Transforms/
-
IndVarSimplify/X86/
-
X86/
-
pr45360.ll
-
LoopDeletion/
-
zero-btc.ll

Differential D94378

[LoopDeletion] Handle inner loops w/untaken backedges
ClosedPublic

Authored by reames on Jan 10 2021, 5:08 PM.

Download Raw Diff

Details

Reviewers

fhahn
jdoerfert
jonpa
atmnpatel
bollu

Commits

rGef51eed37b7e: [LoopDeletion] Handle inner loops w/untaken backedges

Summary

This builds on the restricted after initial revert form of D93906, and adds back support for breaking backedges of inner loops. It turns out the original invalidation logic wasn't quite right, specifically around the handling of LCSSA.

When breaking the backedge of an inner loop, we can cause blocks which were in the outer loop only because they were also included in a sub-loop to be removed from both loops. This results in the exit block set for our original parent loop changing, and thus a need for new LCSSA phi nodes.

This case happens when the inner loop has an exit block which is also an exit block of the parent, and there's a block in the child which reaches an exit to said block without also reaching an exit to the parent loop.

(I'm describing this in terms of the immediate parent, but the problem is general for any transitive parent in the nest.)

At a high level, we seem to have two choices. Either a) rebuild lcssa if needed, or b) restrict the transformation such that an lcssa rebuild isn't needed.

The lcssa rebuild approach is potentially expensive in the worst case. Each rebuild is potentially O(N^2) in the number of instructions in the loop being rebuilt. At worst, we could have N sub-loops (since each must contain at least one instruction), resulting in a worst case example of a whole forest of loops w/zero btc resulting in O(N^3). We have lots of precedent for this being an acceptable expense, but I want to explicit raise the issue for review. Thoughts?

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

reames created this revision.Jan 10 2021, 5:08 PM

Herald added subscribers: dantrushin, bollu, hiraditya, mcrosier. · View Herald TranscriptJan 10 2021, 5:08 PM

reames requested review of this revision.Jan 10 2021, 5:08 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 10 2021, 5:08 PM

Harbormaster completed remote builds in B84619: Diff 315681.Jan 10 2021, 5:58 PM

ping

The logic seems sound. I have no idea about the practical impact. Should we do some compile time testing (@fhahn @nikic) ?

Here's the compile-time numbers for the patch: http://llvm-compile-time-tracker.com/compare.php?from=e377c8eeb4aa2eb239a651f1fe12c27fc77deda3&to=8a40070dec65fe8b716cbfbf134f405fc3d26f8a&stat=instructions

The only notable thing is mafft 52035M 52077M (+0.08%),. Perhaps some noise as there are no binary changes and IIUC it should only trigger if the transformation fires. Or perhaps a different optimzation later on gives the same result.

fhahn added inline comments.Jan 20 2021, 10:42 AM

llvm/lib/Transforms/Utils/LoopUtils.cpp
804	From the arguments, it seems like `changeToUnreachable` at least tries to preserve LCSSA, but clearly misses the case at hand. I've not looked at the details on what exactly needs updating Do you think it would be possible to directly fix the values broken after removing the block?

reames added inline comments.Jan 20 2021, 2:51 PM

llvm/lib/Transforms/Utils/LoopUtils.cpp
804	We're introducing a whole new set of exit blocks which used to be inside the loop. We either have to scan all defs in the loop (which this does), or all uses in the blocks removed from the loop. Is the later any better than the former? The former has the benefit that we can reuse existing code. (Remember we have to do this for all loop levels, so it's slightly tricky.)

LGTM, thanks!

llvm/lib/Transforms/Utils/LoopUtils.cpp
804	Is the later any better than the former? I have no idea. Let's use the existing infrastructure and take another look if there are any issues reported.

This revision is now accepted and ready to land.Jan 22 2021, 2:08 PM

Herald added a reviewer: bollu. · View Herald TranscriptJan 22 2021, 2:08 PM

Closed by commit rGef51eed37b7e: [LoopDeletion] Handle inner loops w/untaken backedges (authored by reames). · Explain WhyJan 22 2021, 4:44 PM

This revision was automatically updated to reflect the committed changes.

reames added a commit: rGef51eed37b7e: [LoopDeletion] Handle inner loops w/untaken backedges.

Hi,

I wrote
https://bugs.llvm.org/show_bug.cgi?id=49802
about a problem that started happening with this commit.

Since there are already two somewhat similar PRs perhaps this patch just exposes an already existing problem but I thought I'd give a heads-up here about it anyway.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Scalar/

LoopDeletion.cpp

8 lines

Utils/

LoopUtils.cpp

17 lines

test/

Transforms/

IndVarSimplify/

X86/

pr45360.ll

6 lines

LoopDeletion/

zero-btc.ll

11 lines

Diff 318683

llvm/lib/Transforms/Scalar/LoopDeletion.cpp

Show First 20 Lines • Show All 145 Lines • ▼ Show 20 Lines	breakBackedgeIfNotTaken(Loop *L, DominatorTree &DT, ScalarEvolution &SE,

if (!L->getLoopLatch())		if (!L->getLoopLatch())
return LoopDeletionResult::Unmodified;		return LoopDeletionResult::Unmodified;

auto *BTC = SE.getBackedgeTakenCount(L);		auto *BTC = SE.getBackedgeTakenCount(L);
if (!BTC->isZero())		if (!BTC->isZero())
return LoopDeletionResult::Unmodified;		return LoopDeletionResult::Unmodified;

// For non-outermost loops, the tricky case is that we can drop blocks
// out of both inner and outer loops at the same time. This results in
// new exiting block for the outer loop appearing, and possibly needing
// an lcssa phi inserted. (See loop_nest_lcssa test case in zero-btc.ll)
// TODO: We can handle a bunch of cases here without much work, revisit.
if (!L->isOutermost())
return LoopDeletionResult::Unmodified;

breakLoopBackedge(L, DT, SE, LI, MSSA);		breakLoopBackedge(L, DT, SE, LI, MSSA);
return LoopDeletionResult::Deleted;		return LoopDeletionResult::Deleted;
}		}

/// Remove a loop if it is dead.		/// Remove a loop if it is dead.
///		///
/// A loop is considered dead either if it does not impact the observable		/// A loop is considered dead either if it does not impact the observable
/// behavior of the program other than finite running time, or if it is		/// behavior of the program other than finite running time, or if it is
▲ Show 20 Lines • Show All 184 Lines • Show Last 20 Lines

llvm/lib/Transforms/Utils/LoopUtils.cpp

Show First 20 Lines • Show All 755 Lines • ▼ Show 20 Lines	if (Loop *ParentLoop = L->getParentLoop()) {
Loop::iterator I = find(*LI, L);		Loop::iterator I = find(*LI, L);
assert(I != LI->end() && "Couldn't find loop");		assert(I != LI->end() && "Couldn't find loop");
LI->removeLoop(I);		LI->removeLoop(I);
}		}
LI->destroy(L);		LI->destroy(L);
}		}
}		}

		static Loop getOutermostLoop(Loop L) {
		while (Loop *Parent = L->getParentLoop())
		L = Parent;
		return L;
		}

void llvm::breakLoopBackedge(Loop *L, DominatorTree &DT, ScalarEvolution &SE,		void llvm::breakLoopBackedge(Loop *L, DominatorTree &DT, ScalarEvolution &SE,
LoopInfo &LI, MemorySSA *MSSA) {		LoopInfo &LI, MemorySSA *MSSA) {

assert(L->isOutermost() && "Can't yet preserve LCSSA for this case");
auto *Latch = L->getLoopLatch();		auto *Latch = L->getLoopLatch();
assert(Latch && "multiple latches not yet supported");		assert(Latch && "multiple latches not yet supported");
auto *Header = L->getHeader();		auto *Header = L->getHeader();
		Loop *OutermostLoop = getOutermostLoop(L);

SE.forgetLoop(L);		SE.forgetLoop(L);

// Note: By splitting the backedge, and then explicitly making it unreachable		// Note: By splitting the backedge, and then explicitly making it unreachable
// we gracefully handle corner cases such as non-bottom tested loops and the		// we gracefully handle corner cases such as non-bottom tested loops and the
// like. We also have the benefit of being able to reuse existing well tested		// like. We also have the benefit of being able to reuse existing well tested
// code. It might be worth special casing the common bottom tested case at		// code. It might be worth special casing the common bottom tested case at
// some point to avoid code churn.		// some point to avoid code churn.

std::unique_ptr<MemorySSAUpdater> MSSAU;		std::unique_ptr<MemorySSAUpdater> MSSAU;
if (MSSA)		if (MSSA)
MSSAU = std::make_unique<MemorySSAUpdater>(MSSA);		MSSAU = std::make_unique<MemorySSAUpdater>(MSSA);

auto *BackedgeBB = SplitEdge(Latch, Header, &DT, &LI, MSSAU.get());		auto *BackedgeBB = SplitEdge(Latch, Header, &DT, &LI, MSSAU.get());

DomTreeUpdater DTU(&DT, DomTreeUpdater::UpdateStrategy::Eager);		DomTreeUpdater DTU(&DT, DomTreeUpdater::UpdateStrategy::Eager);
(void)changeToUnreachable(BackedgeBB->getTerminator(), /UseTrap/false,		(void)changeToUnreachable(BackedgeBB->getTerminator(), /UseTrap/false,
/PreserveLCSSA/true, &DTU, MSSAU.get());		/PreserveLCSSA/true, &DTU, MSSAU.get());

// Erase (and destroy) this loop instance. Handles relinking sub-loops		// Erase (and destroy) this loop instance. Handles relinking sub-loops
// and blocks within the loop as needed.		// and blocks within the loop as needed.
LI.erase(L);		LI.erase(L);

		// If the loop we broke had a parent, then changeToUnreachable might have
		// caused a block to be removed from the parent loop (see loop_nest_lcssa
		// test case in zero-btc.ll for an example), thus changing the parent's
		// exit blocks. If that happened, we need to rebuild LCSSA on the outermost
		// loop which might have a had a block removed.
		if (OutermostLoop != L)
		fhahnUnsubmitted Not Done Reply Inline Actions From the arguments, it seems like `changeToUnreachable` at least tries to preserve LCSSA, but clearly misses the case at hand. I've not looked at the details on what exactly needs updating Do you think it would be possible to directly fix the values broken after removing the block? fhahn: From the arguments, it seems like `changeToUnreachable` at least tries to preserve LCSSA, but…
		reamesAuthorUnsubmitted Done Reply Inline Actions We're introducing a whole new set of exit blocks which used to be inside the loop. We either have to scan all defs in the loop (which this does), or all uses in the blocks removed from the loop. Is the later any better than the former? The former has the benefit that we can reuse existing code. (Remember we have to do this for all loop levels, so it's slightly tricky.) reames: We're introducing a whole new set of exit blocks which used to be inside the loop. We either…
		fhahnUnsubmitted Not Done Reply Inline Actions Is the later any better than the former? I have no idea. Let's use the existing infrastructure and take another look if there are any issues reported. fhahn: > Is the later any better than the former? I have no idea. Let's use the existing…
		formLCSSARecursively(*OutermostLoop, DT, &LI, &SE);
}		}


/// Checks if \p L has single exit through latch block except possibly		/// Checks if \p L has single exit through latch block except possibly
/// "deoptimizing" exits. Returns branch instruction terminating the loop		/// "deoptimizing" exits. Returns branch instruction terminating the loop
/// latch if above check is successful, nullptr otherwise.		/// latch if above check is successful, nullptr otherwise.
static BranchInst getExpectedExitLoopLatchBranch(Loop L) {		static BranchInst getExpectedExitLoopLatchBranch(Loop L) {
BasicBlock *Latch = L->getLoopLatch();		BasicBlock *Latch = L->getLoopLatch();
▲ Show 20 Lines • Show All 911 Lines • Show Last 20 Lines

llvm/test/Transforms/IndVarSimplify/X86/pr45360.ll

	Show All 17 Lines
	@e = dso_local global i32 0, align 4			@e = dso_local global i32 0, align 4

	define dso_local i32 @main() {			define dso_local i32 @main() {
	; CHECK-LABEL: @main(			; CHECK-LABEL: @main(
	; CHECK-NEXT: bb:			; CHECK-NEXT: bb:
	; CHECK-NEXT: [[I6:%.]] = load i32, i32 @a, align 4			; CHECK-NEXT: [[I6:%.]] = load i32, i32 @a, align 4
	; CHECK-NEXT: [[I24:%.]] = load i32, i32 @b, align 4			; CHECK-NEXT: [[I24:%.]] = load i32, i32 @b, align 4
	; CHECK-NEXT: [[D_PROMOTED9:%.]] = load i32, i32 @d, align 4			; CHECK-NEXT: [[D_PROMOTED9:%.]] = load i32, i32 @d, align 4
	; CHECK-NEXT: br label [[BB1:%.*]]			; CHECK-NEXT: br label [[BB13_PREHEADER:%.*]]
	; CHECK: bb1:			; CHECK: bb13.preheader:
	; CHECK-NEXT: [[I8_LCSSA10:%.]] = phi i32 [ [[D_PROMOTED9]], [[BB:%.]] ], [ [[I8:%.]], [[BB19_PREHEADER:%.]] ]			; CHECK-NEXT: [[I8_LCSSA10:%.]] = phi i32 [ [[D_PROMOTED9]], [[BB:%.]] ], [ [[I8:%.]], [[BB19_PREHEADER:%.]] ]
	; CHECK-NEXT: [[I8]] = and i32 [[I8_LCSSA10]], [[I6]]			; CHECK-NEXT: [[I8]] = and i32 [[I8_LCSSA10]], [[I6]]
	; CHECK-NEXT: [[I21:%.*]] = icmp eq i32 [[I8]], 0			; CHECK-NEXT: [[I21:%.*]] = icmp eq i32 [[I8]], 0
	; CHECK-NEXT: br i1 [[I21]], label [[BB13_PREHEADER_BB27_THREAD_SPLIT_CRIT_EDGE:%.*]], label [[BB19_PREHEADER]]			; CHECK-NEXT: br i1 [[I21]], label [[BB13_PREHEADER_BB27_THREAD_SPLIT_CRIT_EDGE:%.*]], label [[BB19_PREHEADER]]
	; CHECK: bb19.preheader:			; CHECK: bb19.preheader:
	; CHECK-NEXT: [[I26:%.*]] = urem i32 [[I24]], [[I8]]			; CHECK-NEXT: [[I26:%.*]] = urem i32 [[I24]], [[I8]]
	; CHECK-NEXT: store i32 [[I26]], i32* @e, align 4			; CHECK-NEXT: store i32 [[I26]], i32* @e, align 4
	; CHECK-NEXT: [[I30_NOT:%.*]] = icmp eq i32 [[I26]], 0			; CHECK-NEXT: [[I30_NOT:%.*]] = icmp eq i32 [[I26]], 0
	; CHECK-NEXT: br i1 [[I30_NOT]], label [[BB32_LOOPEXIT:%.*]], label [[BB1]]			; CHECK-NEXT: br i1 [[I30_NOT]], label [[BB32_LOOPEXIT:%.*]], label [[BB13_PREHEADER]]
	; CHECK: bb13.preheader.bb27.thread.split_crit_edge:			; CHECK: bb13.preheader.bb27.thread.split_crit_edge:
	; CHECK-NEXT: store i32 -1, i32* @f, align 4			; CHECK-NEXT: store i32 -1, i32* @f, align 4
	; CHECK-NEXT: store i32 0, i32* @d, align 4			; CHECK-NEXT: store i32 0, i32* @d, align 4
	; CHECK-NEXT: store i32 0, i32* @c, align 4			; CHECK-NEXT: store i32 0, i32* @c, align 4
	; CHECK-NEXT: br label [[BB32:%.*]]			; CHECK-NEXT: br label [[BB32:%.*]]
	; CHECK: bb32.loopexit:			; CHECK: bb32.loopexit:
	; CHECK-NEXT: store i32 -1, i32* @f, align 4			; CHECK-NEXT: store i32 -1, i32* @f, align 4
	; CHECK-NEXT: store i32 [[I8]], i32* @d, align 4			; CHECK-NEXT: store i32 [[I8]], i32* @d, align 4
	▲ Show 20 Lines • Show All 93 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopDeletion/zero-btc.ll

	Show First 20 Lines • Show All 296 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: @test_live_outer(			; CHECK-LABEL: @test_live_outer(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: br label [[LOOP:%.*]]			; CHECK-NEXT: br label [[LOOP:%.*]]
	; CHECK: loop:			; CHECK: loop:
	; CHECK-NEXT: [[IV:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[IV_INC:%.]], [[LATCH:%.]] ]			; CHECK-NEXT: [[IV:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[IV_INC:%.]], [[LATCH:%.]] ]
	; CHECK-NEXT: br label [[INNER:%.*]]			; CHECK-NEXT: br label [[INNER:%.*]]
	; CHECK: inner:			; CHECK: inner:
	; CHECK-NEXT: store i32 0, i32* @G, align 4			; CHECK-NEXT: store i32 0, i32* @G, align 4
	; CHECK-NEXT: br i1 false, label [[INNER]], label [[LATCH]]			; CHECK-NEXT: br i1 false, label [[INNER_INNER_CRIT_EDGE:%.*]], label [[LATCH]]
				; CHECK: inner.inner_crit_edge:
				; CHECK-NEXT: unreachable
	; CHECK: latch:			; CHECK: latch:
	; CHECK-NEXT: store i32 [[IV]], i32* @G, align 4			; CHECK-NEXT: store i32 [[IV]], i32* @G, align 4
	; CHECK-NEXT: [[IV_INC]] = add i32 [[IV]], 1			; CHECK-NEXT: [[IV_INC]] = add i32 [[IV]], 1
	; CHECK-NEXT: [[CND:%.*]] = icmp ult i32 [[IV_INC]], 200			; CHECK-NEXT: [[CND:%.*]] = icmp ult i32 [[IV_INC]], 200
	; CHECK-NEXT: br i1 [[CND]], label [[LOOP]], label [[EXIT:%.*]]			; CHECK-NEXT: br i1 [[CND]], label [[LOOP]], label [[EXIT:%.*]]
	; CHECK: exit:			; CHECK: exit:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	Show All 27 Lines
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.*]] = add i32 1, 2			; CHECK-NEXT: [[TMP0:%.*]] = add i32 1, 2
	; CHECK-NEXT: br label [[OUTER_HEADER:%.*]]			; CHECK-NEXT: br label [[OUTER_HEADER:%.*]]
	; CHECK: outer_header:			; CHECK: outer_header:
	; CHECK-NEXT: br label [[INNER_HEADER:%.*]]			; CHECK-NEXT: br label [[INNER_HEADER:%.*]]
	; CHECK: inner_header:			; CHECK: inner_header:
	; CHECK-NEXT: br i1 false, label [[INNER_LATCH:%.]], label [[OUTER_LATCH:%.]]			; CHECK-NEXT: br i1 false, label [[INNER_LATCH:%.]], label [[OUTER_LATCH:%.]]
	; CHECK: inner_latch:			; CHECK: inner_latch:
	; CHECK-NEXT: br i1 false, label [[INNER_HEADER]], label [[LOOPEXIT:%.*]]			; CHECK-NEXT: [[DOTLCSSA:%.*]] = phi i32 [ [[TMP0]], [[INNER_HEADER]] ]
				; CHECK-NEXT: br i1 false, label [[INNER_LATCH_INNER_HEADER_CRIT_EDGE:%.]], label [[LOOPEXIT:%.]]
				; CHECK: inner_latch.inner_header_crit_edge:
				; CHECK-NEXT: unreachable
	; CHECK: outer_latch:			; CHECK: outer_latch:
	; CHECK-NEXT: br label [[OUTER_HEADER]]			; CHECK-NEXT: br label [[OUTER_HEADER]]
	; CHECK: loopexit:			; CHECK: loopexit:
	; CHECK-NEXT: [[DOTLCSSA32:%.*]] = phi i32 [ [[TMP0]], [[INNER_LATCH]] ]			; CHECK-NEXT: [[DOTLCSSA32:%.*]] = phi i32 [ [[DOTLCSSA]], [[INNER_LATCH]] ]
	; CHECK-NEXT: unreachable			; CHECK-NEXT: unreachable
	;			;
	entry:			entry:
	br label %outer_header			br label %outer_header

	outer_header:			outer_header:
	%0 = add i32 1, 2			%0 = add i32 1, 2
	br label %inner_header			br label %inner_header
	Show All 14 Lines