This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/lib/Transforms/Utils/
-
trunk/
-
lib/
-
Transforms/
-
Utils/
-
LCSSA.cpp

Differential D31843

[LCSSA] Try to not walk the dominator tree more than necessary
ClosedPublic

Authored by davide on Apr 8 2017, 1:54 AM.

Download Raw Diff

Details

Reviewers

mzolotukhin
• dberlin

Commits

rGaf36d02430d6: [LCSSA] Efficiently compute blocks dominating at least one exit.
rL300255: [LCSSA] Efficiently compute blocks dominating at least one exit.

Summary

PR31851 (although I hit the same issue on some other, internal testcases) shows LCSSA formation taking a long time.
This patch is an idea discussed on IRC and proposed by Danny to speed up the computation.
It seems a large chunk of time is spent walking the Dominator tree while walking over the basic blocks forming a loop. We do it to skip blocks not dominating any of the loop exits, as if this condition is true, no SSA name defined in the block can be used outside of it.
Turns out, and that's at the same time funny and sad, performing the check itself is more expensive than performing the work.

This patch speeds up the checking doing the following:

Walk the domtree in breadth-first order, keeping tracks of levels and numbering nodes according to the BF order.
Walk loops in domtree order, performing the dominance checks only on exits above the current BB.

(Danny suggests we could do better just looking at the subtrees, but that's probably an improvement for later).

The patch needs cleanup, and needs to be slightly reworked as the domtree computation is performed in runOnFunction but there are many loop passes which call formLCSSARecursively() as an utility and could benefit from this speedup (although they currently don't). So, consider it a WIP, but, goes without saying, comments welcome.

Figures: opt -O2 on the test attached to PR31851

Baseline:

real    1m45.731s
user    1m45.533s
sys     0m0.198s

Patch:

real    1m35.519s
user    1m35.318s
sys     0m0.201s

Unless I measured this wrong (which is possible, given it's 2AM), it's a net 10% reduction in compile time.

Diff Detail

Repository: rL LLVM

Event Timeline

davide created this revision.Apr 8 2017, 1:54 AM

This is not quite right, I'll update a new revision soon.

Updated version, which is correct but not as fast :( (just 2 seconds faster than the unoptimized version).
Apparently we still spend too much time walking the dominator. I'll take a look.

I realized part of the time spent by blockDominatesAnExit is doing *redundant* map lookups (and not doing the dominance check proper), which I think should always be constant time in this case as the dominator DFS numbering doesn't change. Precomputing the domtree nodes in advance saves a bit of computation and makes thing faster (at least on this testcase), but I'm unsure whether this would be a win in general (time goes down from 1:45m to 1:40).

Still a little disturbing that skipping the check entirely is *much* faster.

--- a/lib/Transforms/Utils/LCSSA.cpp
+++ b/lib/Transforms/Utils/LCSSA.cpp
@@ -242,10 +242,10 @@ bool llvm::formLCSSAForInstructions(SmallVectorImpl<Instruction *> &Worklist,
 static bool
 blockDominatesAnExit(BasicBlock *BB,
                      DominatorTree &DT,
-                     const SmallVectorImpl<BasicBlock *> &ExitBlocks) {
+                     const SmallVectorImpl<DomTreeNode *> &ExitBlocks) {
   DomTreeNode *DomNode = DT.getNode(BB);
-  return any_of(ExitBlocks, [&](BasicBlock *EB) {
-    return DT.dominates(DomNode, DT.getNode(EB));
+  return any_of(ExitBlocks, [&](DomTreeNode *EB) {
+    return DT.dominates(DomNode, EB);
   });
 }

@@ -260,6 +260,10 @@ bool llvm::formLCSSA(Loop &L, DominatorTree &DT, LoopInfo *LI,
   if (ExitBlocks.empty())
     return false;

+  SmallVector<DomTreeNode *, 8> DomTreeNodeBlocks;
+  for (auto *EBB : ExitBlocks)
+    DomTreeNodeBlocks.push_back(DT.getNode(EBB));
+
   SmallVector<Instruction *, 8> Worklist;

   // Look at all the instructions in the loop, checking to see if they have uses
@@ -268,7 +272,7 @@ bool llvm::formLCSSA(Loop &L, DominatorTree &DT, LoopInfo *LI,
     // For large loops, avoid use-scanning by using dominance information:  In
     // particular, if a block does not dominate any of the loop exits, then none
     // of the values defined in the block could be used outside the loop.
-    if (!blockDominatesAnExit(BB, DT, ExitBlocks))
+    if (!blockDominatesAnExit(BB, DT, DomTreeNodeBlocks))
       continue;

     for (Instruction &I : *BB) {

• dberlin added inline comments.Apr 10 2017, 12:10 PM

lib/Transforms/Utils/LCSSA.cpp
285 ↗	(On Diff #94626)	Just to record ideas (i'll review the code in a sec) You could actually do a little better here: If you sort looptrav in level order, you can just remove the exit above you from the set once you get to a greater level number, instead of trying to skip it. You can actually do even better than that if you want - you can do what we do with valuedfs's: You can sort the exits and the loop traversal into dom-tree DFS number order, in the same list. Walk through them both at once with a stack. Push loop traversal blocks onto the stack, pop them when they no longer dominate the thing you are looking at (either exit or loop traversal) Every exit you come across must be processed for all loop blocks still on the stack after popping to get to the right dominance scope. (dominance is transitive, so if a dominates b, and b dominates c, a dominates c)

If the slow part is checking if the block dominates any exit, how about we precompute set of blocks meeting this requirement in advance?

E.g. if we start from the set of exiting blocks and go up the dom-tree until we hit the loop header, we should end up with a set of blocks dominating at least one exit. Would it be faster than what we have now?

In D31843#722830, @mzolotukhin wrote:

If the slow part is checking if the block dominates any exit, how about we precompute set of blocks meeting this requirement in advance?

E.g. if we start from the set of exiting blocks and go up the dom-tree until we hit the loop header, we should end up with a set of blocks dominating at least one exit. Would it be faster than what we have now?

This is precisely equivalent to the stack algorithm i outlined.
;)

This is precisely equivalent to the stack algorithm i outlined.

I didn't refresh the page before posting, sorry :)

Regarding your description: why do we need to sort exits and then check if the current block dominates an exit/other blocks from the set? Don't we get that for free just from the way we're building the set? I.e. we started from the set of exiting blocks - they all dominate at least one exit by definition. We put them to worklist/stack. We pop a block from the worklist, and if it's not the loop header, we add its immediate dominator to the worklist. By construction and from transitiveness of dominance property, the added block also dominates at least one exit. We continue until the worklist isn't empty. The algorithm is guaranteed to finish if the loop is well-formed (i.e. has one entry, which is the loop header).

All the blocks that we ever added to the worklist are the blocks that dominate at least one exit and the expensive check is replaced with a lookup in this set.

Sorry if it's again the same algorithm as you suggested - but at least we'll figure that out :)

Michael

In D31843#722830, @mzolotukhin wrote:

If the slow part is checking if the block dominates any exit, how about we precompute set of blocks meeting this requirement in advance?

E.g. if we start from the set of exiting blocks and go up the dom-tree until we hit the loop header, we should end up with a set of blocks dominating at least one exit. Would it be faster than what we have now?

That's only one part of the slowdown. The other bit is that apparently doing repeated lookups in the maps BasicBlock * -> DomTreeNode *, i.e. getNode() is expensive. The patch in my previous comment tries to work-around with an hack precomputing the set once and reusing. If I understand the newly proposed algorithm, it should reduce the number of redundant Dom lookups as a side effect, so we should be fine.

In D31843#722874, @mzolotukhin wrote:

This is precisely equivalent to the stack algorithm i outlined.

I didn't refresh the page before posting, sorry :)

Regarding your description: why do we need to sort exits and then check if the current block dominates an exit/other blocks from the set?
Don't we get that for free just from the way we're building the set? I.e. we started from the set of exiting blocks - they all dominate at least one exit by definition.

Yes, i didn't pay attention to the details of the incoming set.
Your algorithm and mine will generate the same set.
They just enumerate them in opposite directions :)
(mine top down, yours bottom up)

SGTM if yours is easier :)

I assume it is not possible for an exit to dominate another exit given the way we define loop exits
(IE walking up, you will never hit another exit)

Thank you for the braindump/ideas suggestion.
The new patch is here and it's 25% faster than the unpatched version on the testcase included in the PR

real    1m22.833s
user    1m22.663s
sys     0m0.170s

davide added inline comments.Apr 12 2017, 6:03 PM

lib/Transforms/Utils/LCSSA.cpp
90 ↗	(On Diff #95068)	Please ignore this assert, it helped me to find bugs during the development, I think it can committed separately (with a proper message)
274–277 ↗	(On Diff #95068)	I'm not entirely sure if this one is needed. I'll try to find examples where this can happen && remove and see if that triggers failures in the testsuite.

I convinced myself that the check is required.
Consider the following IR:

define void @tinkywinky() {
entry:
  br label %preheader
preheader:                                        ; preds = %loop, %entry
  br label %loop
loop:                                             ; preds = %stuff, %preheader
  indirectbr i8* undef, [label %preheader, label %stuff]
stuff:                                            ; preds = %loop
  br label %loop
}

We have one loop composed by two blocks [loop (header), stuffs] and one exit block [preheader].
preheader immediate dominator is entry, which doesn't belong to the loop, so I think that shouldn't be included in the set we're building (as we're not interested), but without the check it would. Does it make sense?

The new patch is here and it's 25% faster than the unpatched version on the testcase included in the PR

Nice!

The patch looks good to me (not surprising probably, since I suggested the same :) ). One nitpick to consider: maybe factor out the function to compute BlocksDominatingExits to keep the logic of formLCSSA clearer.

Michael

In D31843#725624, @mzolotukhin wrote:

The new patch is here and it's 25% faster than the unpatched version on the testcase included in the PR

Nice!

The patch looks good to me (not surprising probably, since I suggested the same :) ). One nitpick to consider: maybe factor out the function to compute BlocksDominatingExits to keep the logic of formLCSSA clearer.

Michael

I'll wait for Dan if he has additional comments, but other than that I'd like to point out that LCSSA is still very slow ;)
I think the next thing I'll try will be hooking the new SSAUpdater as we spend a lot of time there.

Closed by commit rL300255: [LCSSA] Efficiently compute blocks dominating at least one exit. (authored by davide). · Explain WhyApr 13 2017, 1:49 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Utils/

LCSSA.cpp

73 lines

Diff 95201

llvm/trunk/lib/Transforms/Utils/LCSSA.cpp

Show First 20 Lines • Show All 232 Lines • ▼ Show 20 Lines	bool llvm::formLCSSAForInstructions(SmallVectorImpl<Instruction *> &Worklist,
// Remove PHI nodes that did not have any uses rewritten.		// Remove PHI nodes that did not have any uses rewritten.
for (PHINode *PN : PHIsToRemove) {		for (PHINode *PN : PHIsToRemove) {
assert (PN->use_empty() && "Trying to remove a phi with uses.");		assert (PN->use_empty() && "Trying to remove a phi with uses.");
PN->eraseFromParent();		PN->eraseFromParent();
}		}
return Changed;		return Changed;
}		}

/// Return true if the specified block dominates at least		// Compute the set of BasicBlocks in the loop `L` dominating at least one exit.
/// one of the blocks in the specified list.		static void computeBlocksDominatingExits(
static bool		Loop &L, DominatorTree &DT, SmallVector<BasicBlock *, 8> &ExitBlocks,
blockDominatesAnExit(BasicBlock *BB,		SmallPtrSet<BasicBlock *, 8> &BlocksDominatingExits) {
DominatorTree &DT,		SmallVector<BasicBlock *, 8> BBWorklist;
const SmallVectorImpl<BasicBlock *> &ExitBlocks) {
DomTreeNode *DomNode = DT.getNode(BB);		// We start from the exit blocks, as every block trivially dominates itself
return any_of(ExitBlocks, [&](BasicBlock *EB) {		// (not strictly).
return DT.dominates(DomNode, DT.getNode(EB));		for (BasicBlock *BB : ExitBlocks)
});		BBWorklist.push_back(BB);

		while (!BBWorklist.empty()) {
		BasicBlock *BB = BBWorklist.pop_back_val();

		// Check if this is a loop header. If this is the case, we're done.
		if (L.getHeader() == BB)
		continue;

		// Otherwise, add its immediate predecessor in the dominator tree to the
		// worklist, unless we visited it already.
		BasicBlock *IDomBB = DT.getNode(BB)->getIDom()->getBlock();

		// Exit blocks can have an immediate dominator not beloinging to the
		// loop. For an exit block to be immediately dominated by another block
		// outside the loop, it implies not all paths from that dominator, to the
		// exit block, go through the loop.
		// Example:
		//
		// \|---- A
		// \| \|
		// \| B<--
		// \| \| \|
		// \|---> C --
		// \|
		// D
		//
		// C is the exit block of the loop and it's immediately dominated by A,
		// which doesn't belong to the loop.
		if (!L.contains(IDomBB))
		continue;

		if (BlocksDominatingExits.insert(IDomBB).second)
		BBWorklist.push_back(IDomBB);
		}
}		}

bool llvm::formLCSSA(Loop &L, DominatorTree &DT, LoopInfo *LI,		bool llvm::formLCSSA(Loop &L, DominatorTree &DT, LoopInfo *LI,
ScalarEvolution *SE) {		ScalarEvolution *SE) {
bool Changed = false;		bool Changed = false;

// Get the set of exiting blocks.
SmallVector<BasicBlock *, 8> ExitBlocks;		SmallVector<BasicBlock *, 8> ExitBlocks;
L.getExitBlocks(ExitBlocks);		L.getExitBlocks(ExitBlocks);

if (ExitBlocks.empty())		if (ExitBlocks.empty())
return false;		return false;

		SmallPtrSet<BasicBlock *, 8> BlocksDominatingExits;

		// We want to avoid use-scanning leveraging dominance informations.
		// If a block doesn't dominate any of the loop exits, the none of the values
		// defined in the loop can be used outside.
		// We compute the set of blocks fullfilling the conditions in advance
		// walking the dominator tree upwards until we hit a loop header.
		computeBlocksDominatingExits(L, DT, ExitBlocks, BlocksDominatingExits);

SmallVector<Instruction *, 8> Worklist;		SmallVector<Instruction *, 8> Worklist;

// Look at all the instructions in the loop, checking to see if they have uses		// Look at all the instructions in the loop, checking to see if they have uses
// outside the loop. If so, put them into the worklist to rewrite those uses.		// outside the loop. If so, put them into the worklist to rewrite those uses.
for (BasicBlock *BB : L.blocks()) {		for (BasicBlock *BB : BlocksDominatingExits) {
// For large loops, avoid use-scanning by using dominance information: In
// particular, if a block does not dominate any of the loop exits, then none
// of the values defined in the block could be used outside the loop.
if (!blockDominatesAnExit(BB, DT, ExitBlocks))
continue;

for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
// Reject two common cases fast: instructions with no uses (like stores)		// Reject two common cases fast: instructions with no uses (like stores)
// and instructions with one use that is in the same block as this.		// and instructions with one use that is in the same block as this.
if (I.use_empty() \|\|		if (I.use_empty() \|\|
(I.hasOneUse() && I.user_back()->getParent() == BB &&		(I.hasOneUse() && I.user_back()->getParent() == BB &&
!isa<PHINode>(I.user_back())))		!isa<PHINode>(I.user_back())))
continue;		continue;

▲ Show 20 Lines • Show All 124 Lines • Show Last 20 Lines