This is an archive of the discontinued LLVM Phabricator instance.

Inductive unrolling pass
AbandonedPublic

Authored by mkazantsev on Feb 23 2021, 9:54 PM.

Details

Summary

This patch demonstrates an optimization LLVM is currently missing. It is
related to elimination of checks for which we know exact symbolic exit
count via unrolling. Imagine the loop where we have an exit with known
exit count:

do {
  ...
  if (cond()) break; // if iteration № EC is reached, we must break here.
  ...
} while (true);

Here is what happens if we unroll it with a factor of UF

do {
  ...
  if (cond()) break; // Corresponds to iterations №№ N * UF + 0 in the initial loop.
  ...
  if (cond()) break; // Corresponds to iterations №№ N * UF + 1 in the initial loop.
  ...
  if (cond()) break; // Corresponds to iterations №№ N * UF + 2 in the initial loop.
  <...>
  if (cond()) break; // Corresponds to iterations №№ N * UF + UF - 1 in the initial loop.
  ...
} while (true);

Important fact: let Rem be EC urem UF. Then, if we know its value (for example, Rem = 2),
then we know that only the exit in the unrolled loop which corresponds to N * UF + 2 may
possibly exit, and no other exit corresponds to the exiting iteration of the original loop, so
they never exit the loop. So we can remove UF - 1 checks leaving only one of them. So overall,
we save up to EC / UF * (UF - 1) checks over all loop iterations.

The problem is that we do not actually know value of Rem statically in most cases. But we always
know it in runtime if we know the exit count and selected the UF. We can insert a preloop which
will make sure that we enter the unrolled version of the loop with the iteration number corresponding
to Rem (of the original loop). So we always know that only the first cloned check in the unrolled
loop needs to be executed. The resulting solution would be:

for (i = 0; i < Rem; i++) {
  ...
  if (cond()) goto exit;
  ...
}
do {
  ...
  if (cond()) break; // Corresponds to iterations №№ N * UF + Rem + 0 in the initial loop.
  ...
  if (false) break; // Corresponds to iterations №№ N * UF + Rem + 1 in the initial loop and may never happen.
  ...
  if (false) break; // Corresponds to iterations №№ N * UF + Rem + 2 in the initial loop and may never happen.
  <...>
  if (false) break; // Corresponds to iterations №№ N * UF + Rem + UF - 1 in the initial loop and may never happen.
  ...
} while (true);
exit:
  ...

This patch adds implementation for this pass for experimental use.

Diff Detail

Event Timeline

mkazantsev created this revision.Feb 23 2021, 9:54 PM
mkazantsev requested review of this revision.Feb 23 2021, 9:54 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 23 2021, 9:54 PM
mkazantsev updated this revision to Diff 327354.Mar 1 2021, 9:33 PM
mkazantsev updated this revision to Diff 327399.Mar 2 2021, 3:09 AM
mkazantsev updated this revision to Diff 328009.Mar 3 2021, 8:21 PM
mkazantsev updated this revision to Diff 329197.Mar 8 2021, 8:29 PM
mkazantsev updated this revision to Diff 329198.Mar 8 2021, 8:39 PM
mkazantsev retitled this revision from [WIP][Not ready for review] Inductive unrolling pass to [WIP] Inductive unrolling (PoC). Frames as a separate pass so far.
mkazantsev edited the summary of this revision. (Show Details)

Hi everyone,

Here is the idea of optimization we are currently missing, but unfortunately I could not find a really good place in any of the existing passes to integrate it into. The loop unrolling looks like a (more or less) proper place but not sure if it's where it truly belongs. Unrolling here is just a tool that we use to eliminate something which is (in most cases) an inductive check. So maybe it may be split between unrolling and IndVarSimplify.

Anyways, I'd appreciate any ideas and hints on where should we integrate this. As a fallback, we can always leave it as a separate pass until we find a place for it.

mkazantsev retitled this revision from [WIP] Inductive unrolling (PoC). Frames as a separate pass so far to [WIP] Inductive unrolling (PoC). Framed as a separate pass so far.Mar 12 2021, 12:12 AM
mkazantsev retitled this revision from [WIP] Inductive unrolling (PoC). Framed as a separate pass so far to Inductive unrolling pass.Mar 24 2021, 4:20 AM
mkazantsev edited the summary of this revision. (Show Details)
reames added a comment.Apr 8 2021, 8:40 AM

Wrapping my head around this, just sharing some initial thoughts. Don't take anything here to be stronger than me thinking aloud.

The sub-piece of this which doesn't require a prologue loop looks like a natural extension to the existing loop unrolling. I think - but you should experimentally confirm - that existing runtime unrolling + indvars should prune all but one of the exits when the exiting lane is statically known. The missing piece would be to update the cost model. I'd had another idea for a change in the same area, you could look at my https://reviews.llvm.org/D91481 for ideas on how to possibly structure that.

That same sub-piece is also interesting in the loop vectorizer. If we know the exiting lane is zero, and that no control dependent potentially faulting operations are above said exit, we can widen the exit condition (as a simple scale check of lane 0) and vectorize the loop (if other exits meet the current conditions.)

Continuing to think about the vectorizer, a generalization of that may also be useful in that we can cheaply form a predicate mask for the whole loop. If we know lane X exits (and said exit dominates latch), then we can widen loads before said exit with a predicate mask which disables all lanes X+1 onward. We can do the same thing for multiple analyzeable loops (though we instead fold together the exit conditions), but this is interesting in allowing one unanalyzeable exit in the set.

Out of the above, the unrolling support seems easy to motivate, and not terrible invasive. The case of trip count being known TC mod UF == C for some C doesn't seem uncommon, and the costing checks should be cheap,

The more general transformation - which uses a prologue loop to ensure C == 0 for otherwise arbitrary TC - seems trickier. I really like the power, but we don't have other cases where the unroller needs a preload right now. (I don't think?) If we could find other cases which motivate a preloop and the costing thereof, having this in the unroller (as an alternative to peeling) seems reasonable. There's a bunch of code structure pieces - we really need this well abstracted, generic, and tested - but we can return to the details later.

I'm also hesitant about how to handle the case of multiple exits which have computable exit lanes C, and C2, but where C != C2. (E.g. exit 0 might exit on lane 1, exit 1 might exit on lane 2). Note that C1 and C2 in this case aren't known until runtime, so we can't tell exit 2 is dead. Figuring out something for the multiple exit case seems worthwhile, and needs more thought.

It seems like we might be able to adjust the trip count of the preloop to the minimum of the required rotation for the two exits, but that needs a bit more thought to confirm there's not some cornercase there.

Again, interesting, but a few additional moving parts. I'd probably start with the first sub-case costing changes in the unroller, get that worked through and enabled, then return to the second case.

Mostly for my own sanity later, I believe the costing code for this should end up as simple as:
auto *EC = SE->getExitCount(L, ExitingBlock);
if (!isa<SCEVUnknown>(EC)) {

// get TC from EC
auto *Rem = SE->getURemExpr(TC, UF as SCEVConstant);
if (isa<SCEVConstant>(Rem)) {
  // discount cost of exit by (UF-1)/UF
}

}

mkazantsev planned changes to this revision.Aug 6 2021, 2:27 AM

To be merged into Loop Unroll.

mkazantsev abandoned this revision.Feb 13 2022, 11:02 PM

I have no plans to enable this, anyone who cares feel free to take over.