This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Utils/
-
Transforms/
-
Utils/
13
LoopUnroll.cpp

Differential D17473

[LoopUnroll] Avoid unnecessary DT recomputation.
ClosedPublic

Authored by mzolotukhin on Feb 19 2016, 7:04 PM.

Download Raw Diff

Details

Reviewers

chandlerc
jmolloy
sanjoy
escha
hfinkel

Commits

rGde19ed1eb1ac: [LoopUnroll] Avoid unnecessary DT recomputation.
rL261595: [LoopUnroll] Avoid unnecessary DT recomputation.

Summary

When we completely unroll a loop, it's pretty easy to update DT in-place and
thus avoid rebuilding it. DT recalculation is one of the most time-consuming
tasks in loop-unroll, so avoiding it at least in case of full unroll should be
beneficial.

On some extreme (but still real-world) tests this patch improves compile time by
~2x.

Diff Detail

Event Timeline

mzolotukhin updated this revision to Diff 48569.Feb 19 2016, 7:04 PM

mzolotukhin retitled this revision from to [LoopUnroll] Avoid unnecessary DT recomputation..

mzolotukhin updated this object.

mzolotukhin added reviewers: chandlerc, escha, jmolloy, hfinkel, sanjoy.

mzolotukhin added a subscriber: llvm-commits.

Herald added a subscriber: sanjoy. · View Herald TranscriptFeb 19 2016, 7:04 PM

s/UniqueExit/Exit/ - the exit block doesn't need to be unique.

Only high-level question is whether all of these cases have to be handled when the loop is in simplified form.

lib/Transforms/Utils/LoopUnroll.cpp
115–117	Range based for-loop?
533–534	There is a range based function as well for this.
536	I'm surprised this much work is necessary even when the loop is in simplified form?
562–568	Merge the two ifs? And update the comment?

Thanks for the feedback, I'll update the patch soon.

Michael

lib/Transforms/Utils/LoopUnroll.cpp
536	The reason we're doing this is that dominator for an exit block could change after unrolling. Consider a diamond-like loop body with header `H`, side blocks `A` and `B`, and latchblock `L`. Suppose `B` is exiting to `E`. If `B` is the immediate dominator of `E` before unrolling, it's not the case after unrolling - we'll have several blocks exiting to `E`, so we have to actually find their common dominator. Probably, there is a more efficient way of doing this, but even in this form it's a pure win over what we have now.

chandlerc added inline comments.Feb 19 2016, 7:42 PM

lib/Transforms/Utils/LoopUnroll.cpp
536	Ahh, I see. So, the reason that this seemed odd to me is that all of these blocks that now branch to E come from unrolled copies of the loop, and so they should all have the same IDom -- the IDom of B from the first copy of the loop (which I think pretty much has to be the header, but I've not thought very hard about that). But have we done any CFG simplification during unrolling at this point? (I know we talked about that, not sure any of it landed...) If so, that would of course potentially invalidate the idea of basing this purely on the loop structure and structural nature of unrolling. It's not so much that this is ever going to be expensive at runtime (the domtree should make this pattern quite fast), it was just that I wanted to understand the complexity. I think explaining some of the context of how to think about this code in comments would be very useful here.
562–568	Also, what do you think about running verifyDomTree here to help flush out any bugs?

Also LGTM with the comments addressed. Also happy to chat further.

This revision is now accepted and ready to land.Feb 19 2016, 7:43 PM

Hi Chandler,

Some replies from me are inline. I'm heading off for the weekend now, probably update the patch and add some comments on Monday. Thanks for your feedback!

Michael

lib/Transforms/Utils/LoopUnroll.cpp
536	all of these blocks that now branch to E come from unrolled copies of the loop, and so they should all have the same IDom -- the IDom of B from the first copy of the loop I don't think it's correct. IDom of B might be H, but doesn't have to be (you can imagine a diamond in diamond structure to prove it). To explain it better I'll try to use ASCII mad skills here. Here is our (slightly modified) original loop body: (H) \| v (I) / \ v v (A) (B) --> (E) \ / v (L) Here IDom(B) = I, IDom(E) = B. After unrolling we'll have: (H) \| v (I) / \ v v (A) (B) ------ \ / \ v \| (L) \| \| \| v \| (H') \| \| \| v \| (I') \| / \ \| v v v (A') (B') -> (E) \ / vv (L') In the unrolled loop IDom(B) = I, IDom(B') = I', IDom(E) = NearestCommonDominator(B', B) = I. Pleas note, that it doesn't have to be the header. That said, I see what you meant by using structural nature of unrolling - we do exploit it when we assign dominators for cloned blocks. As for the CFG simplification - we perform some folding right after this, in `FoldBlockIntoPredecessor`, which I also updated in this patch.
562–568	I did have else DEBUG(DT->verifyDomTree()); here (and I run tests with it), but if I keep it in, it'll regress compile time for Asserts=On builds. If that's fine, I can restore it.

mehdi_amini added a subscriber: mehdi_amini.Feb 19 2016, 9:03 PM

mehdi_amini added inline comments.

lib/Transforms/Utils/LoopUnroll.cpp
536	It is not necessarily the header, but it is still `IDom(B)`, do you have an example where `IDom(E) != IDom(B)`? (Yeah, I love your asciiart-fu ;))

mzolotukhin added inline comments.Feb 19 2016, 9:20 PM

lib/Transforms/Utils/LoopUnroll.cpp
536	One example is that `IDom(E)` might be `B` itself. Also, if you construct an example with nested diamonds, you'll probably be able to get `IDom(E)` != `IDom(B)`. But after thinking about it, I think `IDom(E)` = `NearestCommonDominator(B, H')`. It looks like obvious to me now, but it's too late on Friday to actually accurately prove it:)

Rebase on TOT.
Optimize dom-info update for exit blocks.
Add DT verification.
Add more comments.

Hi Mehdi, Chandler,

I rewrote the part about updating dominators for exit block. Could you please take a look one more time?

Thanks,
Michael

Awesome, thanks Michael. And yea, I'm very happy to have the verify in here until someone complains about it. This is too easy to get wrong, and the only way we'll find the really awesome test cases are with the verify left in place.

mehdi_amini added inline comments.Feb 22 2016, 4:33 PM

lib/Transforms/Utils/LoopUnroll.cpp
115–117	The above comment on range based for-loop still applies I think?

Closed by commit rL261595: [LoopUnroll] Avoid unnecessary DT recomputation. (authored by mzolotukhin). · Explain WhyFeb 22 2016, 4:35 PM

This revision was automatically updated to reflect the committed changes.

Ouch, I missed that, I'll commit this as a follow-up. Thanks!

Michael

mehdi_amini added inline comments.Feb 22 2016, 4:43 PM

lib/Transforms/Utils/LoopUnroll.cpp
121	Random thought: couldn't the pattern above being implemented more efficiently inside the DT (by breaking the invariant during the process): no need for a temporary vector and the ability to reserve the space in the new dominator (could just directly append the source children vector to the new one). This pattern seems to appear at multiple places in the codebase (looked very quickly).

mehdi_amini added inline comments.May 31 2016, 3:24 PM

llvm/trunk/lib/Transforms/Utils/LoopUnroll.cpp
579 ↗	(On Diff #48755)	If DT is null, this will always break (the test will be false, the else branch taken, and the nullptr dereferenced). Clang can be smart and eliminate your null check as well, deducing that it is not possible for it to be null.

mzolotukhin added inline comments.Jun 6 2016, 12:45 PM

llvm/trunk/lib/Transforms/Utils/LoopUnroll.cpp
579 ↗	(On Diff #48755)	Good catch! Thank, Mehdi!

Revision Contents

Path

Size

lib/

Transforms/

Utils/

LoopUnroll.cpp

56 lines

Diff 48570

lib/Transforms/Utils/LoopUnroll.cpp

Show First 20 Lines • Show All 67 Lines • ▼ Show 20 Lines
/// only has one predecessor, and that predecessor only has one successor.		/// only has one predecessor, and that predecessor only has one successor.
/// The LoopInfo Analysis that is passed will be kept consistent. If folding is		/// The LoopInfo Analysis that is passed will be kept consistent. If folding is
/// successful references to the containing loop must be removed from		/// successful references to the containing loop must be removed from
/// ScalarEvolution by calling ScalarEvolution::forgetLoop because SE may have		/// ScalarEvolution by calling ScalarEvolution::forgetLoop because SE may have
/// references to the eliminated BB. The argument ForgottenLoops contains a set		/// references to the eliminated BB. The argument ForgottenLoops contains a set
/// of loops that have already been forgotten to prevent redundant, expensive		/// of loops that have already been forgotten to prevent redundant, expensive
/// calls to ScalarEvolution::forgetLoop. Returns the new combined block.		/// calls to ScalarEvolution::forgetLoop. Returns the new combined block.
static BasicBlock *		static BasicBlock *
FoldBlockIntoPredecessor(BasicBlock BB, LoopInfo LI, ScalarEvolution *SE,		FoldBlockIntoPredecessor(BasicBlock BB, LoopInfo LI, ScalarEvolution *SE,
SmallPtrSetImpl<Loop *> &ForgottenLoops) {		SmallPtrSetImpl<Loop *> &ForgottenLoops,
		DominatorTree *DT) {
// Merge basic blocks into their predecessor if there is only one distinct		// Merge basic blocks into their predecessor if there is only one distinct
// pred, and if there is only one distinct successor of the predecessor, and		// pred, and if there is only one distinct successor of the predecessor, and
// if there are no PHI nodes.		// if there are no PHI nodes.
BasicBlock *OnlyPred = BB->getSinglePredecessor();		BasicBlock *OnlyPred = BB->getSinglePredecessor();
if (!OnlyPred) return nullptr;		if (!OnlyPred) return nullptr;

if (OnlyPred->getTerminator()->getNumSuccessors() != 1)		if (OnlyPred->getTerminator()->getNumSuccessors() != 1)
return nullptr;		return nullptr;
Show All 15 Lines	FoldBlockIntoPredecessor(BasicBlock BB, LoopInfo LI, ScalarEvolution *SE,
BB->replaceAllUsesWith(OnlyPred);		BB->replaceAllUsesWith(OnlyPred);

// Move all definitions in the successor to the predecessor...		// Move all definitions in the successor to the predecessor...
OnlyPred->getInstList().splice(OnlyPred->end(), BB->getInstList());		OnlyPred->getInstList().splice(OnlyPred->end(), BB->getInstList());

// OldName will be valid until erased.		// OldName will be valid until erased.
StringRef OldName = BB->getName();		StringRef OldName = BB->getName();

// Erase basic block from the function...		// Erase the old block and update dominator info.
		if (DT)
		if (DomTreeNode *DTN = DT->getNode(BB)) {
		DomTreeNode *PredDTN = DT->getNode(OnlyPred);
		SmallVector<DomTreeNode *, 8> Children(DTN->begin(), DTN->end());
		for (SmallVectorImpl<DomTreeNode *>::iterator DI = Children.begin(),
		DE = Children.end();
		DI != DE; ++DI)
		chandlercUnsubmitted Not Done Reply Inline Actions Range based for-loop? chandlerc: Range based for-loop?
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions The above comment on range based for-loop still applies I think? mehdi_amini: The above comment on range based for-loop still applies I think?
		DT->changeImmediateDominator(*DI, PredDTN);

		DT->eraseNode(BB);
		}
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions Random thought: couldn't the pattern above being implemented more efficiently inside the DT (by breaking the invariant during the process): no need for a temporary vector and the ability to reserve the space in the new dominator (could just directly append the source children vector to the new one). This pattern seems to appear at multiple places in the codebase (looked very quickly). mehdi_amini: Random thought: couldn't the pattern above being implemented more efficiently inside the DT (by…

// ScalarEvolution holds references to loop exit blocks.		// ScalarEvolution holds references to loop exit blocks.
if (SE) {		if (SE) {
if (Loop *L = LI->getLoopFor(BB)) {		if (Loop *L = LI->getLoopFor(BB)) {
if (ForgottenLoops.insert(L).second)		if (ForgottenLoops.insert(L).second)
SE->forgetLoop(L);		SE->forgetLoop(L);
}		}
}		}
▲ Show 20 Lines • Show All 297 Lines • ▼ Show 20 Lines	for (LoopBlocksDFS::RPOIterator BB = BlockBegin; BB != BlockEnd; ++BB) {
// we can insert the proper branches later.		// we can insert the proper branches later.
if (*BB == Header)		if (*BB == Header)
Headers.push_back(New);		Headers.push_back(New);
if (*BB == LatchBlock)		if (*BB == LatchBlock)
Latches.push_back(New);		Latches.push_back(New);

NewBlocks.push_back(New);		NewBlocks.push_back(New);
UnrolledLoopBlocks.push_back(New);		UnrolledLoopBlocks.push_back(New);

		// Update DomTree
		if (DT) {
		if (*BB == Header)
		DT->addNewBlock(New, Latches[It - 1]);
		else {
		auto BBDomNode = DT->getNode(*BB);
		auto BBIDom = BBDomNode->getIDom();
		BasicBlock *OriginalBBIDom = BBIDom->getBlock();
		DT->addNewBlock(
		New, cast<BasicBlock>(LastValueMap[cast<Value>(OriginalBBIDom)]));
		}
		}
}		}

// Remap all instructions in the most recent iteration		// Remap all instructions in the most recent iteration
for (unsigned i = 0; i < NewBlocks.size(); ++i)		for (unsigned i = 0; i < NewBlocks.size(); ++i)
for (BasicBlock::iterator I = NewBlocks[i]->begin(),		for (BasicBlock::iterator I = NewBlocks[i]->begin(),
E = NewBlocks[i]->end(); I != E; ++I)		E = NewBlocks[i]->end(); I != E; ++I)
::RemapInstruction(&*I, LastValueMap);		::RemapInstruction(&*I, LastValueMap);
}		}
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	if (NeedConditional) {
}		}
}		}
}		}
// Replace the conditional branch with an unconditional one.		// Replace the conditional branch with an unconditional one.
BranchInst::Create(Dest, Term);		BranchInst::Create(Dest, Term);
Term->eraseFromParent();		Term->eraseFromParent();
}		}
}		}
		// Update dominators of loop exit blocks
		if (DT) {
		for (auto Exit : ExitBlocks) {
		BasicBlock *ExitIDom = nullptr;
		for (auto BI = pred_begin(Exit), BE = pred_end(Exit); BI != BE; BI++) {
		ExitIDom =
		chandlercUnsubmitted Not Done Reply Inline Actions There is a range based function as well for this. chandlerc: There is a range based function as well for this.
		ExitIDom ? DT->findNearestCommonDominator(ExitIDom, BI) : BI;
		}
		chandlercUnsubmitted Not Done Reply Inline Actions I'm surprised this much work is necessary even when the loop is in simplified form? chandlerc: I'm surprised this much work is necessary even when the loop is in simplified form?
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions The reason we're doing this is that dominator for an exit block could change after unrolling. Consider a diamond-like loop body with header `H`, side blocks `A` and `B`, and latchblock `L`. Suppose `B` is exiting to `E`. If `B` is the immediate dominator of `E` before unrolling, it's not the case after unrolling - we'll have several blocks exiting to `E`, so we have to actually find their common dominator. Probably, there is a more efficient way of doing this, but even in this form it's a pure win over what we have now. mzolotukhin: The reason we're doing this is that dominator for an exit block could change after unrolling.
		chandlercUnsubmitted Not Done Reply Inline Actions Ahh, I see. So, the reason that this seemed odd to me is that all of these blocks that now branch to E come from unrolled copies of the loop, and so they should all have the same IDom -- the IDom of B from the first copy of the loop (which I think pretty much has to be the header, but I've not thought very hard about that). But have we done any CFG simplification during unrolling at this point? (I know we talked about that, not sure any of it landed...) If so, that would of course potentially invalidate the idea of basing this purely on the loop structure and structural nature of unrolling. It's not so much that this is ever going to be expensive at runtime (the domtree should make this pattern quite fast), it was just that I wanted to understand the complexity. I think explaining some of the context of how to think about this code in comments would be very useful here. chandlerc: Ahh, I see. So, the reason that this seemed odd to me is that all of these blocks that now…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions all of these blocks that now branch to E come from unrolled copies of the loop, and so they should all have the same IDom -- the IDom of B from the first copy of the loop I don't think it's correct. IDom of B might be H, but doesn't have to be (you can imagine a diamond in diamond structure to prove it). To explain it better I'll try to use ASCII mad skills here. Here is our (slightly modified) original loop body: (H) \| v (I) / \ v v (A) (B) --> (E) \ / v (L) Here IDom(B) = I, IDom(E) = B. After unrolling we'll have: (H) \| v (I) / \ v v (A) (B) ------ \ / \ v \| (L) \| \| \| v \| (H') \| \| \| v \| (I') \| / \ \| v v v (A') (B') -> (E) \ / vv (L') In the unrolled loop IDom(B) = I, IDom(B') = I', IDom(E) = NearestCommonDominator(B', B) = I. Pleas note, that it doesn't have to be the header. That said, I see what you meant by using structural nature of unrolling - we do exploit it when we assign dominators for cloned blocks. As for the CFG simplification - we perform some folding right after this, in `FoldBlockIntoPredecessor`, which I also updated in this patch. mzolotukhin: > all of these blocks that now branch to E come from unrolled copies of the loop, and so they…
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions It is not necessarily the header, but it is still `IDom(B)`, do you have an example where `IDom(E) != IDom(B)`? (Yeah, I love your asciiart-fu ;)) mehdi_amini: It is not necessarily the header, but it is still `IDom(B)`, do you have an example where `IDom…
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions One example is that `IDom(E)` might be `B` itself. Also, if you construct an example with nested diamonds, you'll probably be able to get `IDom(E)` != `IDom(B)`. But after thinking about it, I think `IDom(E)` = `NearestCommonDominator(B, H')`. It looks like obvious to me now, but it's too late on Friday to actually accurately prove it:) mzolotukhin: One example is that `IDom(E)` might be `B` itself. Also, if you construct an example with…
		DT->changeImmediateDominator(Exit, ExitIDom);
		}
		}

// Merge adjacent basic blocks, if possible.		// Merge adjacent basic blocks, if possible.
SmallPtrSet<Loop *, 4> ForgottenLoops;		SmallPtrSet<Loop *, 4> ForgottenLoops;
for (unsigned i = 0, e = Latches.size(); i != e; ++i) {		for (unsigned i = 0, e = Latches.size(); i != e; ++i) {
BranchInst *Term = cast<BranchInst>(Latches[i]->getTerminator());		BranchInst *Term = cast<BranchInst>(Latches[i]->getTerminator());
if (Term->isUnconditional()) {		if (Term->isUnconditional()) {
BasicBlock *Dest = Term->getSuccessor(0);		BasicBlock *Dest = Term->getSuccessor(0);
if (BasicBlock *Fold = FoldBlockIntoPredecessor(Dest, LI, SE,		if (BasicBlock *Fold =
ForgottenLoops)) {		FoldBlockIntoPredecessor(Dest, LI, SE, ForgottenLoops, DT)) {
// Dest has been folded into Fold. Update our worklists accordingly.		// Dest has been folded into Fold. Update our worklists accordingly.
std::replace(Latches.begin(), Latches.end(), Dest, Fold);		std::replace(Latches.begin(), Latches.end(), Dest, Fold);
UnrolledLoopBlocks.erase(std::remove(UnrolledLoopBlocks.begin(),		UnrolledLoopBlocks.erase(std::remove(UnrolledLoopBlocks.begin(),
UnrolledLoopBlocks.end(), Dest),		UnrolledLoopBlocks.end(), Dest),
UnrolledLoopBlocks.end());		UnrolledLoopBlocks.end());
}		}
}		}
}		}

// FIXME: We could register any cloned assumptions instead of clearing the		// FIXME: We could register any cloned assumptions instead of clearing the
// whole function's cache.		// whole function's cache.
AC->clear();		AC->clear();

		if (DT) {
// FIXME: Reconstruct dom info, because it is not preserved properly.		// FIXME: Reconstruct dom info, because it is not preserved properly.
// Incrementally updating domtree after loop unrolling would be easy.		// Incrementally updating domtree after loop unrolling would be easy.
if (DT)		if (!CompletelyUnroll)
DT->recalculate(*L->getHeader()->getParent());		DT->recalculate(*L->getHeader()->getParent());
		}

		chandlercUnsubmitted Not Done Reply Inline Actions Merge the two ifs? And update the comment? chandlerc: Merge the two ifs? And update the comment?
		chandlercUnsubmitted Not Done Reply Inline Actions Also, what do you think about running verifyDomTree here to help flush out any bugs? chandlerc: Also, what do you think about running verifyDomTree here to help flush out any bugs?
		mzolotukhinAuthorUnsubmitted Not Done Reply Inline Actions I did have else DEBUG(DT->verifyDomTree()); here (and I run tests with it), but if I keep it in, it'll regress compile time for Asserts=On builds. If that's fine, I can restore it. mzolotukhin: I did have ``` else DEBUG(DT->verifyDomTree()); ``` here (and I run tests with it), but if I…
// Simplify any new induction variables in the partially unrolled loop.		// Simplify any new induction variables in the partially unrolled loop.
if (SE && !CompletelyUnroll) {		if (SE && !CompletelyUnroll) {
SmallVector<WeakVH, 16> DeadInsts;		SmallVector<WeakVH, 16> DeadInsts;
simplifyLoopIVs(L, SE, DT, LI, DeadInsts);		simplifyLoopIVs(L, SE, DT, LI, DeadInsts);

// Aggressively clean up dead instructions that simplifyLoopIVs already		// Aggressively clean up dead instructions that simplifyLoopIVs already
// identified. Any remaining should be cleaned up below.		// identified. Any remaining should be cleaned up below.
while (!DeadInsts.empty())		while (!DeadInsts.empty())
▲ Show 20 Lines • Show All 94 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroll] Avoid unnecessary DT recomputation.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 48570

lib/Transforms/Utils/LoopUnroll.cpp

[LoopUnroll] Avoid unnecessary DT recomputation.
ClosedPublic