This is an archive of the discontinued LLVM Phabricator instance.

[PM/LoopUnswitch] Introduce a new, simpler loop unswitch pass.
ClosedPublic

Authored by chandlerc on Apr 23 2017, 3:50 PM.

Details

Summary

Currently, this pass only focuses on *trivial* loop unswitching. At that
reduced problem it remains dramatically better than the current loop
unswitch is.

  • Old pass is worse than cubic complexity. New pass is (I think) linear.
  • New pass is much simpler in its design by focusing on full unswitching. (See below for details on this).
  • New pass doesn't carry state for thresholds between pass iterations.
  • New pass doesn't carry state for correctness (both miscompile and infloop) between pass iterations.
  • New pass produces substantially better code after unswitching.
  • New pass can handle more trivial unswitch cases.
  • New pass doesn't recompute the dominator tree for the entire function and instead incrementally updates it.

I've ported all of the trivial unswitching test cases from the old pass
to the new one to make sure that major functionality isn't lost in the
process. For several of the test cases I've worked to improve the
precision and rigor of the CHECKs, but for many I've just updated them
to handle the new IR produced.

My initial motivation was the fact that the old pass carried state in
very unreliable ways between pass iterations, and these mechansims were
incompatible with the new pass manager. However, I discovered many more
improvements to make along the way.

This pass makes two very significant assumptions that enable most of these
improvements:

  1. Focus on *full* unswitching -- that is, completely removing whatever control flow construct is being unswitched from the loop. In the case of trivial unswitching, this means removing the trivial (exiting) edge. In non-trivial unswitching, this means removing the branch or switch itself. This is in opposition to *partial* unswitching where some part of the unswitched control flow remains in the loop. Partial unswitching only really applies to switches and to folded branches. These are very similar to full unrolling and partial unrolling. The full form is an effective canonicalization, the partial form needs a complex cost model, cannot be iterated, isn't canonicalizing, and should be a separate pass that runs very late (much like unrolling).
  1. Leverage LLVM's Loop machinery to the fullest. The original unswitch dates from a time when a great deal of LLVM's loop infrastructure was missing, ineffective, and/or unreliable. As a consequence, a lot of complexity was added which we no longer need.

With these two overarching principles, I think we can build a fast and
effective unswitcher that fits in well in the new PM and in the
canonicalization pipeline. Some of the remaining functionality around
partial unswitching may not be relevant today (not many test cases or
benchmarks I can find) but if they are I'd like to add support for them
as a separate layer that runs very late in the pipeline.

Purely to make reviewing and introducing this code more manageable, I've
split this into first a trivial-unswitch-only pass and in the next patch
I'll add support for full non-trivial unswitching against a *fixed*
threshold, exactly like full unrolling. I even plan to re-use the
unrolling thresholds, as these are incredibly similar cost tradeoffs:
we're cloning a loop body in order to end up with simplified control
flow. We should only do that when the total growth is reasonably small.

One of the biggest changes with this pass compared to the previous one
is that previously, each individual trivial exiting edge from a switch
was unswitched separately as a branch. Now, we unswitch the entire
switch at once, with cases going to the various destinations. This lets
us unswitch multiple exiting edges in a single operation and also avoids
numerous extremely bad behaviors, where we would introduce 1000s of
branches to test for thousands of possible values, all of which would
take the exact same exit path bypassing the loop. Now we will use
a switch with 1000s of cases that can be efficiently lowered into
a jumptable. This avoids relying on somehow forming a switch out of the
branches or getting horrible code if that fails for any reason.

Another significant change is that this pass actively updates the CFG
based on unswitching. For trivial unswitching, this is actually very
easy because of the definition of loop simplified form. Doing this makes
the code coming out of loop unswitch dramatically more friendly. We
still should run loop-simplifycfg (at the least) after this to clean up,
but it will have to do a lot less work.

Finally, this pass makes much fewer attempts to simplify instructions
based on the unswitch. Something like loop-instsimplify, instcombine, or
GVN can be used to do increasingly powerful simplifications based on the
now dominating predicate. The old simplifications are things that
something like loop-instsimplify should get today or a very, very basic
loop-instcombine could get. Keeping that logic separate is a big
simplifying technique.

Most of the code in this pass that isn't in the old one has to do with
achieving specific goals:

  • Updating teh dominator tree as we go
  • Unswitching all cases in a switch in a single step.

I think it is still shorter than just the trivial unswitching code in the old
pass despite having this functionality.

Diff Detail

Repository
rL LLVM

Event Timeline

chandlerc created this revision.Apr 23 2017, 3:50 PM
zzheng added a subscriber: zzheng.Apr 23 2017, 4:42 PM
sanjoy edited edge metadata.Apr 25 2017, 12:48 AM

Looks good overall, mostly minor comments inline.

lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
71 ↗(On Diff #96331)

I'd prefer calling this function just updateLoopExitIDom, since nothing in it is specific to unswitching.

75 ↗(On Diff #96331)

s/unswitch1/unswitch!/

80 ↗(On Diff #96331)

s/necesasrily/necessarily/

122 ↗(On Diff #96331)

Instead of special casing here, why not just

SmallSetVector<BasicBlock *, 4> Worklist;
Worklist.insert(UnswitchedNode);

and let nature run its course?

136 ↗(On Diff #96331)

s/reach1/reach!/

200 ↗(On Diff #96331)

This is very minor, but I'm used to seeing 1 - LoopExitSuccIdx in cases like these.

205 ↗(On Diff #96331)

Why is a PHI on the exit block a problem? Are you saying this because the PHI can require you to copy some computation into OldPH? If so, perhaps we can be a bit more aggressive here, and allow PHIs that have constant incoming values for the edge we're unswitching? It would be fine to do in a later patch, but please add a quick TODO or explanation if that's the reason.

223 ↗(On Diff #96331)

I found this bit of code a bit convoluted to understand. I'd have written this as (if I understood the intent correctly):

if (LoopExitBB->getSinglePredecessor()) {
  assert(LoopExitBB->getSinglePredecessor() == BI.getParent());
  UnswitchedBB = LoopExitBB
} else {
  // We know something other than BI is branching to LoopExitBB and since the 
  // loop is simplified, we know that all those other things are from within
  // the loop.  Hence the unswitched branch needs to branch to the a split out
  // block.
  UnswitchedBB = SplitBlock(LoopExitBB, &LoopExitBB->front(), &DT, &LI);
}
238 ↗(On Diff #96331)

Same suggestion here regarding 1 - LoopExitSuccIdx.

261 ↗(On Diff #96331)

s/switcih/switch/

365 ↗(On Diff #96331)

Minor, but s/conditional branch/switch/

445 ↗(On Diff #96331)

s/cans/scans/

453 ↗(On Diff #96331)

What does this routine change in the cases where nothing is unswitched?

475 ↗(On Diff #96331)

The Check the header first. comment seems out of place. I'd also be tempted to use llvm::any_of here.

485 ↗(On Diff #96331)

Are handling these cases here necessary? I'd imagine LoopSimplifyCFG to be a better place for these (especially the second one, which is just an unconditional br written as a switch).

518 ↗(On Diff #96331)

Should this be return Changed;?

davide edited edge metadata.Apr 25 2017, 2:32 PM

I focused on the logic to update the dominator and I think it's correct. Some comments below.
Also, recently there have been some changes wrt divergent conditions. Do you plan to integrate them in the new pass?

lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
17 ↗(On Diff #96331)

I'm under the impression some of these headers are dead (just copied from the old loop unswitch which had PGO support). Not a big problem, but if you can cleanup them it would be better.

50 ↗(On Diff #96331)

Do we really need NumTrivial? From what I see here it's just NumBranches + NumSwitches

122 ↗(On Diff #96331)

+1

179–184 ↗(On Diff #96331)

Instead of having getExitBlocks() taking a SmallVector<> can we have it or a variant of it which takes a SmallPtrSet ?
I could use it in LCSSA to compute the dominance chain as well, FWIW.

527 ↗(On Diff #96331)

Minor: auto as the type is obvious.

643 ↗(On Diff #96331)

Commented out code.

davide requested changes to this revision.Apr 25 2017, 2:33 PM
This revision now requires changes to proceed.Apr 25 2017, 2:33 PM

I focused on the logic to update the dominator and I think it's correct. Some comments below.
Also, recently there have been some changes wrt divergent conditions. Do you plan to integrate them in the new pass?

The reason why I'm asking is that I'm not an expert about DivergenceAnalysis so I can't really reason about all the implications there.

chandlerc updated this revision to Diff 96883.Apr 27 2017, 1:44 AM
chandlerc edited edge metadata.
chandlerc marked 15 inline comments as done.

Updates to address code review comments.

Thanks for the reviews! Updated patch and responses inline.

I focused on the logic to update the dominator and I think it's correct. Some comments below.
Also, recently there have been some changes wrt divergent conditions. Do you plan to integrate them in the new pass?

The reason why I'm asking is that I'm not an expert about DivergenceAnalysis so I can't really reason about all the implications there.

I do plan to integrate them. I'm just starting with what seemed like a reasonably useful to review subset. Does this seem like a reasonable thing to do in a follow-up?

lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
50 ↗(On Diff #96331)

Well, in the next patch we start doing non-trivial ones... I can wait until then if you prefer, I don't have a strong opinion.

122 ↗(On Diff #96331)

I mean, I still need the comment on that line, and it is no shorter, and it avoids a pointless query to the DomChain set?

I can make the change if you want, just seems odd.

179–184 ↗(On Diff #96331)

Sure, could I add that in a follow-up patch?

205 ↗(On Diff #96331)

It isn't at all. I just left this in from the old pass. I'd like to fix this in a subsequent patch though, added a comment.

223 ↗(On Diff #96331)

Oooo, I like getSinglePredecessor. Done!

453 ↗(On Diff #96331)

This comment predates me making it *not* change anything. Updated comment.

485 ↗(On Diff #96331)

No, this is a holdover from when we restarted this entire thing on each unswitch. Thanks, updated.

518 ↗(On Diff #96331)

Wow, good catch.

sanjoy accepted this revision.Apr 27 2017, 9:49 AM

This lgtm.

lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
122 ↗(On Diff #96331)

Okay what you have now lgtm.

davide accepted this revision.Apr 27 2017, 9:58 AM

LGTM. I think the divergent bits can be handled in a follow-up, just wanted to make sure this wasn't an oversight.

lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
50 ↗(On Diff #96331)

You can leave it if you plan to use for non-trivial.

179–184 ↗(On Diff #96331)

Yes, no problem.

This revision is now accepted and ready to land.Apr 27 2017, 9:58 AM
This revision was automatically updated to reflect the committed changes.

Thanks for the reviews, working on follow ups. Some are obvious and I'll just submit but let me know if you see anything weird.