This is an archive of the discontinued LLVM Phabricator instance.

[LCSSA] Try to not walk the dominator tree more than necessary
ClosedPublic

Authored by davide on Apr 8 2017, 1:54 AM.

Details

Summary

PR31851 (although I hit the same issue on some other, internal testcases) shows LCSSA formation taking a long time.
This patch is an idea discussed on IRC and proposed by Danny to speed up the computation.
It seems a large chunk of time is spent walking the Dominator tree while walking over the basic blocks forming a loop. We do it to skip blocks not dominating any of the loop exits, as if this condition is true, no SSA name defined in the block can be used outside of it.
Turns out, and that's at the same time funny and sad, performing the check itself is more expensive than performing the work.

This patch speeds up the checking doing the following:

  1. Walk the domtree in breadth-first order, keeping tracks of levels and numbering nodes according to the BF order.
  2. Walk loops in domtree order, performing the dominance checks only on exits above the current BB.

(Danny suggests we could do better just looking at the subtrees, but that's probably an improvement for later).

The patch needs cleanup, and needs to be slightly reworked as the domtree computation is performed in runOnFunction but there are many loop passes which call formLCSSARecursively() as an utility and could benefit from this speedup (although they currently don't). So, consider it a WIP, but, goes without saying, comments welcome.

Figures: opt -O2 on the test attached to PR31851

Baseline:

real    1m45.731s
user    1m45.533s
sys     0m0.198s

Patch:

real    1m35.519s
user    1m35.318s
sys     0m0.201s

Unless I measured this wrong (which is possible, given it's 2AM), it's a net 10% reduction in compile time.

Diff Detail

Repository
rL LLVM

Event Timeline

davide created this revision.Apr 8 2017, 1:54 AM
davide added a comment.Apr 9 2017, 4:22 PM

This is not quite right, I'll update a new revision soon.

davide updated this revision to Diff 94626.Apr 9 2017, 6:01 PM

Updated version, which is correct but not as fast :( (just 2 seconds faster than the unoptimized version).
Apparently we still spend too much time walking the dominator. I'll take a look.

davide added a comment.Apr 9 2017, 7:27 PM

I realized part of the time spent by blockDominatesAnExit is doing *redundant* map lookups (and not doing the dominance check proper), which I think should always be constant time in this case as the dominator DFS numbering doesn't change. Precomputing the domtree nodes in advance saves a bit of computation and makes thing faster (at least on this testcase), but I'm unsure whether this would be a win in general (time goes down from 1:45m to 1:40).

Still a little disturbing that skipping the check entirely is *much* faster.

--- a/lib/Transforms/Utils/LCSSA.cpp
+++ b/lib/Transforms/Utils/LCSSA.cpp
@@ -242,10 +242,10 @@ bool llvm::formLCSSAForInstructions(SmallVectorImpl<Instruction *> &Worklist,
 static bool
 blockDominatesAnExit(BasicBlock *BB,
                      DominatorTree &DT,
-                     const SmallVectorImpl<BasicBlock *> &ExitBlocks) {
+                     const SmallVectorImpl<DomTreeNode *> &ExitBlocks) {
   DomTreeNode *DomNode = DT.getNode(BB);
-  return any_of(ExitBlocks, [&](BasicBlock *EB) {
-    return DT.dominates(DomNode, DT.getNode(EB));
+  return any_of(ExitBlocks, [&](DomTreeNode *EB) {
+    return DT.dominates(DomNode, EB);
   });
 }

@@ -260,6 +260,10 @@ bool llvm::formLCSSA(Loop &L, DominatorTree &DT, LoopInfo *LI,
   if (ExitBlocks.empty())
     return false;

+  SmallVector<DomTreeNode *, 8> DomTreeNodeBlocks;
+  for (auto *EBB : ExitBlocks)
+    DomTreeNodeBlocks.push_back(DT.getNode(EBB));
+
   SmallVector<Instruction *, 8> Worklist;

   // Look at all the instructions in the loop, checking to see if they have uses
@@ -268,7 +272,7 @@ bool llvm::formLCSSA(Loop &L, DominatorTree &DT, LoopInfo *LI,
     // For large loops, avoid use-scanning by using dominance information:  In
     // particular, if a block does not dominate any of the loop exits, then none
     // of the values defined in the block could be used outside the loop.
-    if (!blockDominatesAnExit(BB, DT, ExitBlocks))
+    if (!blockDominatesAnExit(BB, DT, DomTreeNodeBlocks))
       continue;

     for (Instruction &I : *BB) {
dberlin added inline comments.Apr 10 2017, 12:10 PM
lib/Transforms/Utils/LCSSA.cpp
285 ↗(On Diff #94626)

Just to record ideas (i'll review the code in a sec)
You could actually do a little better here:
If you sort looptrav in *level* order, you can just remove the exit above you from the set once you get to a greater level number, instead of trying to skip it.

You can actually do even better than that if you want - you can do what we do with valuedfs's:
You can sort the exits and the loop traversal into dom-tree DFS number order, in the same list.
Walk through them both at once with a stack.
Push loop traversal blocks onto the stack, pop them when they no longer dominate the thing you are looking at (either exit or loop traversal)
Every exit you come across must be processed for all loop blocks still on the stack after popping to get to the right dominance scope.

(dominance is transitive, so if a dominates b, and b dominates c, a dominates c)

mzolotukhin edited edge metadata.Apr 10 2017, 12:36 PM

If the slow part is checking if the block dominates any exit, how about we precompute set of blocks meeting this requirement in advance?

E.g. if we start from the set of exiting blocks and go up the dom-tree until we hit the loop header, we should end up with a set of blocks dominating at least one exit. Would it be faster than what we have now?

dberlin edited edge metadata.Apr 10 2017, 12:41 PM

If the slow part is checking if the block dominates any exit, how about we precompute set of blocks meeting this requirement in advance?

E.g. if we start from the set of exiting blocks and go up the dom-tree until we hit the loop header, we should end up with a set of blocks dominating at least one exit. Would it be faster than what we have now?

This is precisely equivalent to the stack algorithm i outlined.
;)

This is precisely equivalent to the stack algorithm i outlined.

I didn't refresh the page before posting, sorry :)

Regarding your description: why do we need to sort exits and then check if the current block dominates an exit/other blocks from the set? Don't we get that for free just from the way we're building the set? I.e. we started from the set of exiting blocks - they all dominate at least one exit by definition. We put them to worklist/stack. We pop a block from the worklist, and if it's not the loop header, we add its immediate dominator to the worklist. By construction and from transitiveness of dominance property, the added block also dominates at least one exit. We continue until the worklist isn't empty. The algorithm is guaranteed to finish if the loop is well-formed (i.e. has one entry, which is the loop header).

All the blocks that we ever added to the worklist are the blocks that dominate at least one exit and the expensive check is replaced with a lookup in this set.

Sorry if it's again the same algorithm as you suggested - but at least we'll figure that out :)

Michael

If the slow part is checking if the block dominates any exit, how about we precompute set of blocks meeting this requirement in advance?

E.g. if we start from the set of exiting blocks and go up the dom-tree until we hit the loop header, we should end up with a set of blocks dominating at least one exit. Would it be faster than what we have now?

That's only one part of the slowdown. The other bit is that apparently doing repeated lookups in the maps BasicBlock * -> DomTreeNode *, i.e. getNode() is expensive. The patch in my previous comment tries to work-around with an hack precomputing the set once and reusing. If I understand the newly proposed algorithm, it should reduce the number of redundant Dom lookups as a side effect, so we should be fine.

This is precisely equivalent to the stack algorithm i outlined.

I didn't refresh the page before posting, sorry :)

Regarding your description: why do we need to sort exits and then check if the current block dominates an exit/other blocks from the set?
Don't we get that for free just from the way we're building the set? I.e. we started from the set of exiting blocks - they all dominate at least one exit by definition.

Yes, i didn't pay attention to the details of the incoming set.
Your algorithm and mine will generate the same set.
They just enumerate them in opposite directions :)
(mine top down, yours bottom up)

SGTM if yours is easier :)

I assume it is not possible for an exit to dominate another exit given the way we define loop exits
(IE walking up, you will never hit another exit)

davide updated this revision to Diff 95068.Apr 12 2017, 6:01 PM

Thank you for the braindump/ideas suggestion.
The new patch is here and it's 25% faster than the unpatched version on the testcase included in the PR

real    1m22.833s
user    1m22.663s
sys     0m0.170s
davide added inline comments.Apr 12 2017, 6:03 PM
lib/Transforms/Utils/LCSSA.cpp
90 ↗(On Diff #95068)

Please ignore this assert, it helped me to find bugs during the development, I think it can committed separately (with a proper message)

274–277 ↗(On Diff #95068)

I'm not entirely sure if this one is needed. I'll try to find examples where this can happen && remove and see if that triggers failures in the testsuite.

I convinced myself that the check is required.
Consider the following IR:

define void @tinkywinky() {
entry:
  br label %preheader
preheader:                                        ; preds = %loop, %entry
  br label %loop
loop:                                             ; preds = %stuff, %preheader
  indirectbr i8* undef, [label %preheader, label %stuff]
stuff:                                            ; preds = %loop
  br label %loop
}

We have one loop composed by two blocks [loop (header), stuffs] and one exit block [preheader].
preheader immediate dominator is entry, which doesn't belong to the loop, so I think that shouldn't be included in the set we're building (as we're not interested), but without the check it would. Does it make sense?

The new patch is here and it's 25% faster than the unpatched version on the testcase included in the PR

Nice!

The patch looks good to me (not surprising probably, since I suggested the same :) ). One nitpick to consider: maybe factor out the function to compute BlocksDominatingExits to keep the logic of formLCSSA clearer.

Michael

The new patch is here and it's 25% faster than the unpatched version on the testcase included in the PR

Nice!

The patch looks good to me (not surprising probably, since I suggested the same :) ). One nitpick to consider: maybe factor out the function to compute BlocksDominatingExits to keep the logic of formLCSSA clearer.

Michael

I'll wait for Dan if he has additional comments, but other than that I'd like to point out that LCSSA is still very slow ;)
I think the next thing I'll try will be hooking the new SSAUpdater as we spend a lot of time there.

This revision was automatically updated to reflect the committed changes.