This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroll] Avoid unnecessary DT recomputation.
ClosedPublic

Authored by mzolotukhin on Feb 19 2016, 7:04 PM.

Details

Summary

When we completely unroll a loop, it's pretty easy to update DT in-place and
thus avoid rebuilding it. DT recalculation is one of the most time-consuming
tasks in loop-unroll, so avoiding it at least in case of full unroll should be
beneficial.

On some extreme (but still real-world) tests this patch improves compile time by
~2x.

Diff Detail

Event Timeline

mzolotukhin retitled this revision from to [LoopUnroll] Avoid unnecessary DT recomputation..
mzolotukhin updated this object.
mzolotukhin added a subscriber: llvm-commits.
  • s/UniqueExit/Exit/ - the exit block doesn't need to be unique.
chandlerc edited edge metadata.Feb 19 2016, 7:16 PM

Only high-level question is whether all of these cases have to be handled when the loop is in simplified form.

lib/Transforms/Utils/LoopUnroll.cpp
115–117

Range based for-loop?

533–534

There is a range based function as well for this.

536

I'm surprised this much work is necessary even when the loop is in simplified form?

562–568

Merge the two ifs? And update the comment?

Thanks for the feedback, I'll update the patch soon.

Michael

lib/Transforms/Utils/LoopUnroll.cpp
536

The reason we're doing this is that dominator for an exit block could change after unrolling. Consider a diamond-like loop body with header H, side blocks A and B, and latchblock L. Suppose B is exiting to E.

If B is the immediate dominator of E before unrolling, it's not the case after unrolling - we'll have several blocks exiting to E, so we have to actually find their common dominator.

Probably, there is a more efficient way of doing this, but even in this form it's a pure win over what we have now.

chandlerc added inline comments.Feb 19 2016, 7:42 PM
lib/Transforms/Utils/LoopUnroll.cpp
536

Ahh, I see.

So, the reason that this seemed odd to me is that all of these blocks that now branch to E come from unrolled copies of the loop, and so they should all have the same IDom -- the IDom of B from the first copy of the loop (which I think pretty much has to be the header, but I've not thought very hard about that).

But have we done any CFG simplification during unrolling at this point? (I know we talked about that, not sure any of it landed...) If so, that would of course potentially invalidate the idea of basing this purely on the loop structure and structural nature of unrolling.

It's not so much that this is ever going to be expensive at runtime (the domtree should make this pattern quite fast), it was just that I wanted to understand the complexity.

I think explaining some of the context of how to think about this code in comments would be very useful here.

562–568

Also, what do you think about running verifyDomTree here to help flush out any bugs?

chandlerc accepted this revision.Feb 19 2016, 7:43 PM
chandlerc edited edge metadata.

Also LGTM with the comments addressed. Also happy to chat further.

This revision is now accepted and ready to land.Feb 19 2016, 7:43 PM

Hi Chandler,

Some replies from me are inline. I'm heading off for the weekend now, probably update the patch and add some comments on Monday. Thanks for your feedback!

Michael

lib/Transforms/Utils/LoopUnroll.cpp
536

all of these blocks that now branch to E come from unrolled copies of the loop, and so they should all have the same IDom -- the IDom of B from the first copy of the loop

I don't think it's correct. IDom of B might be H, but doesn't have to be (you can imagine a diamond in diamond structure to prove it).

To explain it better I'll try to use ASCII mad skills here. Here is our (slightly modified) original loop body:

  (H)
   |
   v
  (I)
  / \
 v   v
(A) (B) --> (E)
  \ /
   v
  (L)

Here IDom(B) = I, IDom(E) = B.

After unrolling we'll have:

  (H)
   |
   v
  (I)
  / \
 v   v
(A) (B) ------
  \ /         \
   v          |
  (L)         |
   |          |
   v          |
  (H')        |
   |          |
   v          |
  (I')        |
  /  \        |
 v    v       v
(A') (B') -> (E)
  \  /
   vv
  (L')

In the unrolled loop IDom(B) = I, IDom(B') = I', IDom(E) = NearestCommonDominator(B', B) = I. Pleas note, that it doesn't have to be the header.

That said, I see what you meant by using structural nature of unrolling - we do exploit it when we assign dominators for cloned blocks.

As for the CFG simplification - we perform some folding right after this, in FoldBlockIntoPredecessor, which I also updated in this patch.

562–568

I did have

else 
  DEBUG(DT->verifyDomTree());

here (and I run tests with it), but if I keep it in, it'll regress compile time for Asserts=On builds. If that's fine, I can restore it.

mehdi_amini added inline comments.
lib/Transforms/Utils/LoopUnroll.cpp
536

It is not necessarily the header, but it is still IDom(B), do you have an example where IDom(E) != IDom(B)? (Yeah, I love your asciiart-fu ;))

mzolotukhin added inline comments.Feb 19 2016, 9:20 PM
lib/Transforms/Utils/LoopUnroll.cpp
536

One example is that IDom(E) might be B itself. Also, if you construct an example with nested diamonds, you'll probably be able to get IDom(E) != IDom(B).

But after thinking about it, I think IDom(E) = NearestCommonDominator(B, H'). It looks like obvious to me now, but it's too late on Friday to actually accurately prove it:)

mzolotukhin edited edge metadata.
  • Rebase on TOT.
  • Optimize dom-info update for exit blocks.
  • Add DT verification.
  • Add more comments.

Hi Mehdi, Chandler,

I rewrote the part about updating dominators for exit block. Could you please take a look one more time?

Thanks,
Michael

Awesome, thanks Michael. And yea, I'm very happy to have the verify in here until someone complains about it. This is too easy to get wrong, and the only way we'll find the really awesome test cases are with the verify left in place.

mehdi_amini added inline comments.Feb 22 2016, 4:33 PM
lib/Transforms/Utils/LoopUnroll.cpp
115–117

The above comment on range based for-loop still applies I think?

This revision was automatically updated to reflect the committed changes.

Ouch, I missed that, I'll commit this as a follow-up. Thanks!

Michael

mehdi_amini added inline comments.Feb 22 2016, 4:43 PM
lib/Transforms/Utils/LoopUnroll.cpp
121

Random thought: couldn't the pattern above being implemented more efficiently inside the DT (by breaking the invariant during the process): no need for a temporary vector and the ability to reserve the space in the new dominator (could just directly append the source children vector to the new one).
This pattern seems to appear at multiple places in the codebase (looked very quickly).

mehdi_amini added inline comments.May 31 2016, 3:24 PM
llvm/trunk/lib/Transforms/Utils/LoopUnroll.cpp
579 ↗(On Diff #48755)

If DT is null, this will always break (the test will be false, the else branch taken, and the nullptr dereferenced).
Clang can be smart and eliminate your null check as well, deducing that it is not possible for it to be null.

mzolotukhin added inline comments.Jun 6 2016, 12:45 PM
llvm/trunk/lib/Transforms/Utils/LoopUnroll.cpp
579 ↗(On Diff #48755)

Good catch! Thank, Mehdi!