Page MenuHomePhabricator

[LoopUnroll] Use smallest exact trip count from any exit
ClosedPublic

Authored by nikic on May 23 2021, 3:48 AM.

Details

Summary

This is a more general alternative/extension to D102635. Rather than handling the special case of "header exit with non-exiting latch", this unrolls against the smallest constant exact trip count from any (latch-dominating) exit, regardless of whether the latch is also exiting or not.

The motivating case is in full-unroll-one-unpredictable-exit.ll. Here the header exit is an IV-based exit, while the latch exit is a data comparison. This kind of loop does not get rotated, because the latch is already exiting, and loop rotation doesn't try to distinguish between IV-based/SCEV-able latches.

I believe that unrolling should be treating this kind of loop the same regardless of whether the IV-based comparison happens to be on the latch or some other block.

Diff Detail

Event Timeline

nikic created this revision.May 23 2021, 3:48 AM
nikic requested review of this revision.May 23 2021, 3:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 23 2021, 3:48 AM
nikic planned changes to this revision.May 23 2021, 1:33 PM

Looks like the PreserveCondBr flag is currently not respected when partial unrolling is performed, and the latch still gets simplified in that case.

nikic updated this revision to Diff 347408.May 24 2021, 8:40 AM

Rebase over D103026.

nikic edited the summary of this revision. (Show Details)May 24 2021, 8:43 AM
reames added inline comments.May 24 2021, 2:50 PM
llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
1133

I would strongly prefer this logic be sunk into the appropriate SCEV accessors. I'm fine with you doing that in a follow up provided you commit to doing so. I'll leave that decision up to you.

nikic added inline comments.May 24 2021, 3:31 PM
llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
1133

Could you please clarify which SCEV accessors you have in mind here? Do you mean sinking this into the getSmallConstantTripCount() variant that only accepts a Loop? I would have a couple of concerns with doing that:

  • We also need to know which exit the trip count refers to.
  • This is not really the trip count of the loop (just a loop exit), and I'm pretty sure changing that would break other users of the API.
  • The limitation to latch-dominating exits here is not fundamental, and mainly there due to unclear profitability.

Maybe I misunderstood the suggestion though.

reames added inline comments.May 24 2021, 3:38 PM
llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
1133

I was referring specifically to the small constant trip count and small constant multiple versions which take Loop parameters.

Your point about needing the exit block is true for the way the code is currently phrased. I'd missed that. I think the need for that can and should be removed (see my comment on the patch this one is based on), but if that's logistically complicated, I'm fine with us moving forward with this structure and then revisiting in the future.

Any exit count for an exit which dominates the latch must be a (potentially conservative) exit count for the loop. So, I'm not quite sure what you mean with the rest of your comments.

nikic added inline comments.May 25 2021, 12:50 AM
llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
1133

Your point about needing the exit block is true for the way the code is currently phrased. I'd missed that. I think the need for that can and should be removed (see my comment on the patch this one is based on), but if that's logistically complicated, I'm fine with us moving forward with this structure and then revisiting in the future.

Going to respond here to keep it in one place: I think what you're suggesting is effectively to use getSmallConstantMaxTripCount() and pass down that information only. However, the important distinction here is that the max trip count only tells you that branches after that trip count must be taken, not that branches before it cannot be taken (for that specific exit). Unrolling handles a number of different exits (exact, exact-or-zero, max, multiple-of) and this (unless I'm misunderstanding again) would only handle the max case, where we can only fold the final branch.

Any exit count for an exit which dominates the latch must be a (potentially conservative) exit count for the loop. So, I'm not quite sure what you mean with the rest of your comments.

The getSmallConstantTripCount() currently returns an exact trip count for a single-exit loop, while getSmallConstantMaxTripCount() returns a max trip count for any loop. As mentioned before, both provide different guarantees. I think to stick with the spirit of getSmallConstantTripCount(), it should only be returning a value if all the exit counts in a loop match, which doesn't seem terribly useful in practice.

reames added a comment.EditedMay 26 2021, 9:27 AM

@nikic We're talking past each other here.

Let me work up from definitions. I think this will make my point easier to follow.

An exact exit count for an loop exit is the iteration on which that loop exit *must* be taken. We know that no previous iteration can exit via that exit. A maximum exit count is any upper bound on an exact exit count. The only requirement is that it be a valid upper bound. In practice, we only consider constant upper bounds, but that's an implementation detail. See getExitCount(BB) and computeExitLimit for details. Note that we only compute exit counts for exits which dominate latch as we don't want to have to reason about exits being skipped on the potentially exiting iteration.

An exit count for a loop (either exact or max) is simply a umin of all the exit counts which dominate the latch. See getBackedgeTakenCount and computeBackedgeTakenCount.

A trip count is simply an exit (e.g. backedge taken) count plus one. The value 0 is used for "don't know". (We really should have used an option there instead, because the resulting code is confusing.)

A "small" exit count is simply one which evaluates to a small constant. A valid implementation of any of the "small constant trip count" methods is to compute the exact symbolic exit count, dyn_cast to a constant, and add one.

For historical reasons, the small constant trip count routines have never been updated to do that. Instead, they're still specialized for the case of a single exiting block. However, that simply means they produce "don't know" more than required.

My whole point has been - aside from the need to know which exit has the smallest trip count - that you could simply use the generic implementation which works for multiple exit loops. There's no need for SCEV to continue to use the form restricted to single exit loops.

For further clarity, here's the implementation I'm suggesting.
unsigned ScalarEvolution::getSmallConstantTripCount(const Loop *L) {

auto *ExitCount = dyn_cast<SCEVConstant>(getBackedgeTakenCount(L, Exact));
return getConstantTripCount(ExitCount);

}

(This passes make check btw.)

The rest of the patch would be removing the now-stale header comment, and reviewing callers to ensure that they can actually handle multiple exit loops to return a value other than "don't know" and if not, adding a guard before the call.

I just went ahead and finished the SCEV patch I was suggesting since I'd done most of it already.
See https://reviews.llvm.org/D103182

This comment was removed by nikic.

@nikic Computing an exact exit count for a multiple exit loop using umin is correct, and is pretty much one of the key purposes of SCEV. I've tried to explain this multiple times, and we don't seem to be getting anywhere here. Can we *please* move this to spoke conversation?

@reames Sorry, I completely confused myself here. Ignore everything I said in my last comment!

@reames Thanks for the patient explanation and me being so dense. No idea what I was thinking here.

To loop back around to the unrolling use-case, D103182 would now provide us with an exact trip count for the loop. However, I think we still need to use the approach from this patch, or something similar to it, for two reasons:

  • Without the knowledge which exit the trip count is for, we can fold all exits before the TripCount to "not taken", but we don't know which exit is the taken one on the last iteration.
  • For trip multiples, the "loop trip multiple" would be the minimum of all trip multiples (please correct me if I'm wrong on that!) which I believe is not what is desired for unrolling. If we have one unpredictable exit and one multiple-4 exit, we'd want to unroll against that exit and save use the intermediate branches, but the "loop trip multiple" would be 1 in that case, as we can exit from the loop on every iteration.

@reames Thanks for the patient explanation and me being so dense. No idea what I was thinking here.

To loop back around to the unrolling use-case, D103182 would now provide us with an exact trip count for the loop. However, I think we still need to use the approach from this patch, or something similar to it, for two reasons:

  • Without the knowledge which exit the trip count is for, we can fold all exits before the TripCount to "not taken", but we don't know which exit is the taken one on the last iteration.

Maybe I'm now being dense, but *why* do we need to know which block was taken? I'm searching for uses of ExitingBlock in the code, and I can't find it used after the bit of code computing TripCount and TripMultiple.

  • For trip multiples, the "loop trip multiple" would be the minimum of all trip multiples (please correct me if I'm wrong on that!) which I believe is not what is desired for unrolling. If we have one unpredictable exit and one multiple-4 exit, we'd want to unroll against that exit and save use the intermediate branches, but the "loop trip multiple" would be 1 in that case, as we can exit from the loop on every iteration.

So, not minimum of all trip multiples, GCD of the same. See D103189.

With GCD, I think unrolling gets a huge boost for multiple exit loops. We can still go further, but we'd need to actual expose multiple multiples to the cost model. Simply "lying" about the multiple to the costing code seems highly suspect. (Though, looking at the existing cost modeling code, it looks like it already fudges the multiple. I have no idea what that code is doing.)

I would request you stage this. Start with the easy GCD case which is not going to break any assumptions made later in the code (as it is a trip multiple for the loop), then come back to it if you have a motivating example and we can complicate the costing.

I just realized what I was confused about in the first place: getBackedgeTakenCount() requires that all exits have an exact exit count. If you have one unpredicate exit and one exit with an exact exit count, then you'll get back an unpredictable backedge taken count. Only if all exits have an exact exit count will the umin be taken. (This is the isComplete() condition in BackedgeTakenInfo::getExact().)

So yes, the exact loop trip count is well-defined for multi-exit loops, but is not available if there is at least one unpredictable exit. However, a loop with one unpredictable exit and one exact exit is exactly the kind of loop I'm interested in here. Loop unrolling occurs against the exact exit, leaving us with TripCount unrolled checks of the unpredictable exit. (This is already supported if the exact trip count is on the latch, while other exits are unpredictable, but not if it's on a non-latch exit.)

@reames Thanks for the patient explanation and me being so dense. No idea what I was thinking here.

To loop back around to the unrolling use-case, D103182 would now provide us with an exact trip count for the loop. However, I think we still need to use the approach from this patch, or something similar to it, for two reasons:

  • Without the knowledge which exit the trip count is for, we can fold all exits before the TripCount to "not taken", but we don't know which exit is the taken one on the last iteration.

Maybe I'm now being dense, but *why* do we need to know which block was taken? I'm searching for uses of ExitingBlock in the code, and I can't find it used after the bit of code computing TripCount and TripMultiple.

You're right: If we actually have an exact loop trip count, then we don't. But see my comment above: Here, we actually don't have an exact loop trip count. We have an exact trip count for one exit, while other exits might be unpredictable, and as such cannot be folded at all.

  • For trip multiples, the "loop trip multiple" would be the minimum of all trip multiples (please correct me if I'm wrong on that!) which I believe is not what is desired for unrolling. If we have one unpredictable exit and one multiple-4 exit, we'd want to unroll against that exit and save use the intermediate branches, but the "loop trip multiple" would be 1 in that case, as we can exit from the loop on every iteration.

So, not minimum of all trip multiples, GCD of the same. See D103189.

With GCD, I think unrolling gets a huge boost for multiple exit loops. We can still go further, but we'd need to actual expose multiple multiples to the cost model. Simply "lying" about the multiple to the costing code seems highly suspect. (Though, looking at the existing cost modeling code, it looks like it already fudges the multiple. I have no idea what that code is doing.)

I would request you stage this. Start with the easy GCD case which is not going to break any assumptions made later in the code (as it is a trip multiple for the loop), then come back to it if you have a motivating example and we can complicate the costing.

I think I'm going to adjust this to not change the trip multiple handling at all in this patch, i.e. only use trip multiple for either a single exit or a latch exit, as before. We can deal with that case separately. I don't think we have good test coverage for this area right now.

I just realized what I was confused about in the first place: getBackedgeTakenCount() requires that all exits have an exact exit count. If you have one unpredicate exit and one exit with an exact exit count, then you'll get back an unpredictable backedge taken count. Only if all exits have an exact exit count will the umin be taken. (This is the isComplete() condition in BackedgeTakenInfo::getExact().)

I think you've returned to an earlier confusion. I don't remember if this was from this review, or another recent one.

What unrolling appears to want here is a *maximum* trip count. What it's actually computing is an *exact* trip count. The variable names are highly confusing. I had previously suggested you rename all the TripCount variables to MaxTripCount and use the appropriate means to query that upper bound.

Specifically:
SE.getTripCountFromExitCount(SE.getConstantMaxBackedgeTakenCount(L))

I repeat that suggestion.

Could you please mark it as "changes planned" until the underlying patch is ready or merged?

nikic planned changes to this revision.Jun 2 2021, 12:17 AM
nikic updated this revision to Diff 353181.Jun 19 2021, 1:46 AM
nikic edited the summary of this revision. (Show Details)

Rebase, and only consider TripCount for all exits, leave TripMultiple alone for now to minimize impact.

Would be LGTM, but I'd like to understand the called out test change first. Everything else looks as expected.

llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
1124

Your comment here is wrong. The code is correct, but the comment isn't. :)

If we unroll by the exact trip count of *any* exit, we're guaranteed to break the backedge. As such, there might be conditional exits left in *earlier* iterations, but there will be nothing in *later* iterations which is what your comment appears to say.

It might also be worth stating explicitly that this is an upper bound on the actual trip count of the loop (since an earlier conditional exit we can't analyze might be taken), and draw the distinction with a maximum count (conservatism in analyzing each exit.) Separately, I really think we should be allowing max trip counts here, but that's a separate step.

llvm/test/Transforms/LoopUnroll/runtime-loop-known-exit.ll
9

This test change looks concerning. Have you explored why this is happening? On the surface, this looks like a bad interaction between runtime unrolling and finding a more precise trip count for full unrolling we may need to explore.

nikic added inline comments.Jun 19 2021, 11:46 AM
llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
1124

What I really wanted to say here is that an unroll by this trip count eliminates all branches relating to one exit, but branches relating to other exits may have to be kept. This is opposed to the max trip count case where we're only guaranteed to break the backedge, but may not be able to remove any other branches.

The unroll code already handles max trip count, but it's only used if no exact trip count is known, and is controlled by a target option (which is disabled on X86...)

llvm/test/Transforms/LoopUnroll/runtime-loop-known-exit.ll
9

When a trip count is known, we don't perform runtime unrolling and perform partial unrolling instead. In this case it doesn't happen because I specified -unroll-runtime but not -unroll-allow-partial so we get no unrolling. If we do allow partial unrolling, the result looks like this: https://gist.github.com/nikic/0541f7937f6db4867ef7fd4d7673a2b1 We don't perform the runtime unroll transform to enforce a certain trip multiple on the latch exit, and instead make use of the known trip count on the other exit to only check it once per iteration. I'd say the new result is better (same reduction in branches without the need to remainder loop), with the caveat that it's more aggressive, because it doesn't use the default runtime unroll count.

nikic added inline comments.Jun 19 2021, 12:29 PM
llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp
1124

I guess it's worth discussing the larger context here. It probably doesn't come as a surprise that the modelling is rather odd and doesn't seem particularly principled. https://github.com/llvm/llvm-project/blob/59d90fe817b5f1feae1a1406bd487e6552b9928d/llvm/lib/Transforms/Scalar/LoopUnrollPass.cpp#L835-L846 lays out the reason why this is behind a target option, which is that it will result in more branches, which may be problematic for constrained branch predicators. What this doesn't take into account is that even a full unroll may replicate branches, either for other exits or just control flow within the loop.

What would make more sense to me is to have some kind of "branch penalty" that applies for each newly introduced branch -- this could be due to inner control flow, a remaining unpredictable exit, or an exit that only has an upper bound.

reames accepted this revision.Jun 19 2021, 5:08 PM

Response explained the test case interaction. LGTM w/a tweaked comment.

This revision is now accepted and ready to land.Jun 19 2021, 5:08 PM
nikic updated this revision to Diff 353233.Jun 20 2021, 7:49 AM

Update comment.

Comment much improved, thanks!

This revision was landed with ongoing or failed builds.Jun 20 2021, 12:01 PM
This revision was automatically updated to reflect the committed changes.