Page MenuHomePhabricator

[PM/unswitch] Teach SimpleLoopUnswitch to do non-trivial unswitching, making it no longer even remotely simple.
ClosedPublic

Authored by chandlerc on Jun 14 2017, 3:25 AM.

Details

Summary

The pass will now be more of a "full loop unswitching" pass rather than
anything substantively simpler than any other approach. I plan to rename
it accordingly once the dust settles.

The key ideas of the new loop unswitcher are carried over for
non-trivial unswitching:

  1. Fully unswitch a branch or switch instruction from inside of a loop to outside of it.
  2. Update the CFG and IR. This avoids needing to "remember" the unswitched branches as well as avoiding excessively cloning and reliance on complex parts of simplify-cfg to cleanup the cfg.
  3. Update the analyses rather than just blowing them away or relying on something else updating them.

Sadly, #3 is somewhat compromised here as the dominator tree updates
were too complex for me to want to reason about. I may take another stab
at it, but it may be best to wait for something like the newly proposed
dynamic dominators. However, we do adhere to #3 w.r.t. LoopInfo.

This approach also an some important principles specific to non-trivial
unswitching: not *all* of the loop will be duplicated when unswitching.
This fact allows us to compute the cost in terms of how much *duplicate*
code is inserted rather than just on raw size. Unswitching conditions
which essentialy partition loops will work regardless of how large the
loop is in reality.

Unfortunately, there is a *lot* of code to implement all of this. And
I'm still not really happy with all of it. I think there is more
factoring and cleanup to be done here, but I wanted to at least get it
out where others can see and comment on it.

Some high level outstanding things that I'd like to defer to subsequent
patches:

  • We could be much more clever about not cloning things that will be deleted. In fact, we should be able to delete *nothing* and do a minimal number of clones.
  • There are many more interesting selection criteria for which branch to unswitch that we might want to look at. One that I'm interested in particularly are a set of conditions which all exit the loop and which can be merged into a single unswitched test of them.

Anyways, even with somewhat rough code, hopefuly folks can start chewing
on this and giving feedback.

Depends on D34049.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sanjoy added inline comments.Jul 12 2017, 10:16 PM
lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
1907 ↗(On Diff #102518)

Question though, is it useful to keep the call to actually do the unswitch separate?

No strong opinion on that.

95 ↗(On Diff #104372)

This bit looks separable, all of the call sites to this function is in pre-existing code.

186 ↗(On Diff #104372)

This bit looks separable to me -- no new call sites were introduced in this patch.

1133 ↗(On Diff #104372)

Document the return value.

1267 ↗(On Diff #104372)

I'd s/UnloopedSet/UnloopedBlocks/ (I think the Blocks suffix adds more information than the Set suffix).

1357 ↗(On Diff #104372)

Again, I can't see when !L.contains(BBL) will kick in.

1589 ↗(On Diff #104372)

Define the direction of "conservative" here -- does it mean "A conservative-dominates B implies A actually-dominates B" or the converse?

1640 ↗(On Diff #104372)

Why is DontDeleteUselessPHIs false?

1696 ↗(On Diff #104372)

When you say "cloned sibling loop", do you include all of the loops that used to be subloops of L and have now been reparented? I can't think of a case in which the exit block set for those will change.

1719 ↗(On Diff #104372)

As I said above, I'm not convinced that we need this bit.

1726 ↗(On Diff #104372)

Instead of commenting on what's at the top of the stack, how about phrasing the comment as on the order in which we will process the loops?

s/ALso/Also/

1740 ↗(On Diff #104372)

s/enclusing/enclosing/

1761 ↗(On Diff #104372)

s/entiry/entire/

This revision now requires changes to proceed.Jul 12 2017, 10:16 PM
chandlerc updated this revision to Diff 107260.Jul 19 2017, 1:28 AM
chandlerc edited edge metadata.
chandlerc marked 18 inline comments as done.

Updated based on most of the code review comments from Sanjoy. Still some pending.

Most of the comments addressed or responded to here... Still working on the actual refactoring ,sorry for the delay there.

lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
767 ↗(On Diff #104372)

I used to use addBasicBlockToLoop but it proved very error prone. It added the blocks to too many parent loops and changed the *order* in which the blocks were added to parent loops. I'd much rather keep this unless there is a functional issue here.

867 ↗(On Diff #104372)

Unreachable code within the loop that gets cloned, but when we do our backwards walk is not found and associated with the loop.

This is a somewhat hypothetical (I don't have a test case) but it seemed principled.

891 ↗(On Diff #104372)

Same answer as above, tried to add rationale. Happy to update this if I'm missing something, I've not thought deeply here.

896 ↗(On Diff #104372)

Not trivially. We don't always clone a loop and we return the cloned loop if any. Suggestions?

898 ↗(On Diff #104372)

Because this seemed like a more precise and definitive result?

Essentially, *if* we establish a loop here, we can reason about the header and preheader using that information alone. This in turn should prune the UnloopedBlockSet logic below. I'm not sure why sinking this would be more clear?

907 ↗(On Diff #104372)

I tried to clarify this comment, not sure how successful I was.

924 ↗(On Diff #104372)

Continuing discussion there.

989 ↗(On Diff #104372)

No?

The UnloopedBlockSet might have the ClonedPH in it, and if it does, we could erase it once, and put it into the worklist. We'd then visit it and not want to walk the predecessors of that cloned PH.

1002 ↗(On Diff #104372)

So, LoopInfo goes to some trouble to avoid its block order relying on use-list order via the predecessors order. Regardless of whether that's important or not, I think it is reasonable to try and continue here as unswitch isn't the thing that should walk that back.

I also personally agree with that call. I don't like use-list order dependencies anywhere we can avoid them. I understand that they're nearly impossible to avoid today, but I think we would benefit greatly from moving away from that any and everywhere we can.

1263 ↗(On Diff #104372)

Because the whole idea is that the loop blocks list should have a stable ordering, and so I don't want to permute it while partitioning? Lots of transforms' order depends on the loop block order.

1280 ↗(On Diff #104372)

Same answer as the rest of these... My default is stable. If there is a reason why the order of this *cannot* impact the order of IR data structures that impact the order of transforms downstream, that's worth optimizing, but until then we should preserve incoming order of things IMO.

1287 ↗(On Diff #104372)

There really is no good name here. But the capture is just confusing. I think this can and should be a function w/o any capture that just operates on its arguments, removing the shadowing. Does taht work for you?

1314 ↗(On Diff #104372)

See above about preheader. I think the same logic applies to both routines.

1343 ↗(On Diff #104372)

I'm not following ... exit blocks of the original L are exactly what would be identified by this check? My brain is fuzzy on this, so maybe I'm just misunderstanding the point you're making.

1357 ↗(On Diff #104372)

Hoping that these two issues are the same.

1589 ↗(On Diff #104372)

This whole thing is kinda garbage.

Anyways, tried to rephrase it. But not sure this really addresses the concern. Maybe I should nuke all usage of domtree blewo this to avoid any concerns.

1640 ↗(On Diff #104372)

Because they could be LCSSA PHIs?

1696 ↗(On Diff #104372)

Yes, I do include those.

We can change which loop an exit block fo a cloned sibling belongs to. Not sure if we can actually change the exit set. But is there a meaningful distinction here?

1719 ↗(On Diff #104372)

Which bit? i'm not sure I'm following you here.

1726 ↗(On Diff #104372)

I don't think there is an order that I can reasonably document other than the first one?

sanjoy added inline comments.Jul 28 2017, 2:19 PM
lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
767 ↗(On Diff #104372)

SGTM

867 ↗(On Diff #104372)

It looks like LoopInfo filters out unreachable blocks (and does not insert them into llvm::Loops): https://github.com/llvm-mirror/llvm/blob/master/include/llvm/Analysis/LoopInfoImpl.h#L360 so IIUC that situation exactly won't happen. Are there any other such cases?

896 ↗(On Diff #104372)

Never mind then. :)

898 ↗(On Diff #104372)

I was thinking about this as -- the preheader is just another block that did not get included into ClonedL so it should be handled by the generic logic that handles such blocks.

907 ↗(On Diff #104372)

I might use "not arbitrary" instead of "stable". "stable" seems to imply "deterministic" which is slightly confusing since predecesor ordering should be deterministic as well.

989 ↗(On Diff #104372)

Sorry, I forgot the case where the loop disappears because we delete all of the backedges. However, I think we should be able to assert BlocksInClonedLoop.empty() under if (BB == ClonedPH) (since the header of the cloned loop should still dominate the exit, and at least that header won't be in UnloopedBlockSet?

1002 ↗(On Diff #104372)

In that case, the problem is that not that the use list order is not deterministic but that it is nonsensical, so I'd change the comment to reflect that.

1280 ↗(On Diff #104372)

That ("prefer stable unless you can prove it does not matter") SGTM.

1287 ↗(On Diff #104372)

SGTM.

1314 ↗(On Diff #104372)

I've replied above on this.

1343 ↗(On Diff #104372)

I meant to say:

  • ExitLoopSet contains ExitBB and a set of blocks that were internal to L.
  • The loop for ExitBB is already ExitL so we don't need to touch it.
  • For the blocks that were internal to L, L.contains(BBL) will never never be false except for the case when BBL == &L.
1357 ↗(On Diff #104372)

Replied above.

1589 ↗(On Diff #104372)

Maybe I should nuke all usage of domtree blewo this to avoid any concerns.

If that's possible, I'd certainly prefer that.

I'd phrase the comment as "we know that the split point definitely dominates the merge block" or something like that.

1640 ↗(On Diff #104372)

I see -- can DontDeleteUselessPHIs be !L.contains(ContinueSuccBB) in that case?

1719 ↗(On Diff #104372)

Updating LCSSA for NonChildClonedLoops and HoistedLoops. Since we didn't break apart these loops, we should not have new uses of values defined inside these loops outside these loops (as opposed to the L, ClonedL and their parents, which we did split up and thus uses for interior values that were interior before may be exterior now).

1726 ↗(On Diff #104372)

I meant that you start with L and ClonedL (if they still exist) and move up the loop tree till OuterExitL.

Rebase onto the new LoopInfo API and get everything working. Also rebase into
a monorepo checkout.

Things are now working again and this patch can be experimented with, but
I still need to do some refactoring before it is ready for further review (that
is the last significant outstanding review comment AFAIK).

chandlerc marked 38 inline comments as done.Oct 16 2017, 12:52 AM

Updates and/or responses. Largely addressed all the comments with the next patch upload.

lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
867 ↗(On Diff #104372)

I've turned this into an assert.

907 ↗(On Diff #104372)

Reworded to be even less confusing I think by just directly stating the constraint: don't rely on predecessor order.

1343 ↗(On Diff #104372)

Sorry, yes, this is trivially true. Sorry for not seeing this the first time around.

1589 ↗(On Diff #104372)

I think the comment is better now, but let me know if it still doesn't make you happy.

I've checked, and we *only* use the dominator tree on blocks in the original loop, so these nodes aren't important. I've also documented what is going on there.

I think the only realistic solution in the end will be to teach this to accurately update the dominator tree, but that should be a separate patch.

1640 ↗(On Diff #104372)

Couldn't there still be PHIs that are necessary for LCSSA form of some nested loop?

Generally, this seems risky and low value. Why waste time proving that it could be false?

1719 ↗(On Diff #104372)

Maybe I'm being overly conservative here, but I'm having a very hard time convincing myself that we have a constructive reason why LCSSA will be correct here.

We are fundamentally changing the CFG when cloning. I feel like we will keep finding devilish cases where that CFG change subtly changes the LCSSA solution set. Maybe not, but I'm quite paranoid here and not able to really convince myself through either a proof or testing that this is correct without this. Maybe you can? Maybe this should be a follow-up patch?

chandlerc marked 4 inline comments as done.Oct 16 2017, 1:02 AM
chandlerc added inline comments.
lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
1343 ↗(On Diff #104372)

Err, I actually have a test case that checks this. ;] Without this, we orphan basic blocks. Added this and the check below back.

chandlerc updated this revision to Diff 119114.Oct 16 2017, 1:03 AM
chandlerc marked an inline comment as done.

Update addressing the remaining comments from Sanjoy.

Notably, this includes the majority of work to address the factoring comments.
I've now factored several parts of this into separate functions to make things
a bit more consumable.

I think this is all ready for review again.

sanjoy accepted this revision.Nov 13 2017, 9:59 AM

I skimmed through the changes, and they lgtm. I didn't try to follow the logic closely since I assumed most of the actual mechanics stay the same (but please let me know if that's not true and there are parts of the change that I should review again).

Also, looks like the issue of unstructured control flow was not handled (unswitching a loop with unstructured control flow can create child loops that weren't present before)? That's fine to address in a separate change.

This revision is now accepted and ready to land.Nov 13 2017, 9:59 AM
davide edited edge metadata.Nov 13 2017, 3:47 PM

I have few comments. As Sanjoy already approved, feel free to update this and check in without another round trip.
Most of them can be addressed in follow-ups, but I thought they're worth mentioning regardless. Thank you for pushing forward this.
It's been a long time coming :)

llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
1139 ↗(On Diff #119114)

Can you add a comment here explaining with a non-stable sort is not enough?

1245 ↗(On Diff #119114)

This doesn't (yet) use the new dominator incremental update API, from what I can tell.
I think it can be done in a separate change, but I think it's worth doing instead of rolling its own update logic.

1789 ↗(On Diff #119114)

same here.

1809 ↗(On Diff #119114)

JFYI, rebuilding LCSSA in LLVM is currently a sledgehammer for some use cases.
It's of course, not your fault, as the algorithm is O(N^3) [I can provide cases that are tight :)], but let's be prepared to fix it in case this turns out to be a compile time sink.

1858 ↗(On Diff #119114)

Want to add also verifyisRecursivelyLCSSA here? or whatever is called?

chandlerc marked 3 inline comments as done.Nov 15 2017, 11:06 PM

Thanks Davide, I think I've got these covered either by adjustments or future work. Give a shout about anything you'd still like addresed in follow-up patches.

llvm/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
1139 ↗(On Diff #119114)

At this point, I don't think it actually matters because I ended up using a stable order below, so I've changed this. Thanks!

1245 ↗(On Diff #119114)

Yep. I've got that in the change description as one of the obvious follow-ups.

1809 ↗(On Diff #119114)

Yep. Sadly, I wasn't able to easily come up with a better way, but I can believe there is one.

This at least doesn't do the recursive formulation...

1858 ↗(On Diff #119114)

We already verify all child loops of updated loops in UpdateLCSSA. So the only way this would help is if the LCSSA formation itself is busted.

davide accepted this revision.Nov 16 2017, 4:40 PM

LGTM. Thanks Chandler.

This revision was automatically updated to reflect the committed changes.
chandlerc marked 2 inline comments as done.
wuzish added a subscriber: wuzish.Jul 25 2019, 1:16 AM
wuzish added inline comments.
llvm/trunk/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
61

I have a question about the default value false of "enable-nontrivial-unswitch". Could it be changed to true because it seems that it can bring much improvement of bmk when enable it. Or is there any reason to stop it?

Herald added a project: Restricted Project. · View Herald TranscriptJul 25 2019, 1:16 AM
Herald added a subscriber: mgrang. · View Herald Transcript
wuzish added inline comments.Jul 26 2019, 12:16 AM
llvm/trunk/lib/Transforms/Scalar/SimpleLoopUnswitch.cpp
61

@chandlerc @sanjoy @davide Could you please have a look at the comment?