This is an archive of the discontinued LLVM Phabricator instance.

[LoopUnroll] Use tripcount from exiting header, if latch not exiting.
AbandonedPublic

Authored by fhahn on May 17 2021, 9:26 AM.

Details

Summary

Currently we use the trip count from the exiting latch or from the
unique exit block. If the loop is not rotated, the latch won't exit but
the header will.

This patch falls back to using the trip count from the header, if the
header is exiting and the latch is not.

This fixes cases where we failed to unroll with -Oz, like
https://godbolt.org/z/fP6sna8qK

bool foo(int *ptr, int limit) {
    #pragma clang loop unroll(full)
    for (unsigned int i = 0; i < 4; i++) {
        if (ptr[i] > limit)
        return false;
        ptr[i]++;
    }
    return true;
}

Diff Detail

Event Timeline

fhahn created this revision.May 17 2021, 9:26 AM
fhahn requested review of this revision.May 17 2021, 9:26 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 17 2021, 9:26 AM
Meinersbur accepted this revision.May 17 2021, 9:54 AM

LGTM

However, I wonder why it is not using the overloads without BB arguments. At the end, unrolling would be only interested in some max trip count imposed by any exit, not of a specific one.

This revision is now accepted and ready to land.May 17 2021, 9:54 AM
nikic added a subscriber: nikic.May 17 2021, 10:09 AM

However, I wonder why it is not using the overloads without BB arguments. At the end, unrolling would be only interested in some max trip count imposed by any exit, not of a specific one.

Full unrolling can operate in a number of different modes. Max tripcount is one of them, but is actually disabled on most targets (maybe it shouldn't be...) The one this targets is an exact tripcount on one exit, which is generally enabled. The exact tripcount case will always eliminate branches for one exit (though there may be others), while max tripcount will effectively retain all branches.

Unfortunately the loop unrolling implementation currently has some implicit contracts about what kinds of loops it can unroll -- e.g. it would be great if this worked even if the latch were exiting (but had no exact exit count itself), but that does wouldn't work out of the box.

reames requested changes to this revision.May 17 2021, 1:22 PM

This is semantically incorrect, and can not land. There is an important distinction between a maximum trip count (which this code would be correct for), and an exact trip count (which is what this code appears to need). Consider the case where the header exits after 20 iterations, and another non-latch exit exits after 10. Fully unrolling by 20 would be incorrect. (Since control flow is elided and we'd execute too many iterations.)

If you want to achieve the same effect correctly, you should update ScalarEvolution::getSmallConstantTripCount(L) to do what the block variant does, but with the getExitCount(L) version. The change is straight forward, but you will need to audit callers to ensure they're not relying on the single exit block post-condition.

This revision now requires changes to proceed.May 17 2021, 1:22 PM

This is semantically incorrect, and can not land. There is an important distinction between a maximum trip count (which this code would be correct for), and an exact trip count (which is what this code appears to need). Consider the case where the header exits after 20 iterations, and another non-latch exit exits after 10. Fully unrolling by 20 would be incorrect. (Since control flow is elided and we'd execute too many iterations.)

Does this mean only the branch in the latch is elided by (full) unrolling? The situation could also be that the latch exits after 20 iterations. I was not expecting that there is something special about the latch when exiting.

This is semantically incorrect, and can not land. There is an important distinction between a maximum trip count (which this code would be correct for), and an exact trip count (which is what this code appears to need). Consider the case where the header exits after 20 iterations, and another non-latch exit exits after 10. Fully unrolling by 20 would be incorrect. (Since control flow is elided and we'd execute too many iterations.)

Does this mean only the branch in the latch is elided by (full) unrolling? The situation could also be that the latch exits after 20 iterations. I was not expecting that there is something special about the latch when exiting.

When we full unroll, we don't duplicate exit blocks. (I haven't checked how we handle single vs multiple exits, but this patch presupposes we handle multiple.) If we unroll by 20 *without* duplicating the earlier exit control flow, we have a miscompile.

fhahn updated this revision to Diff 346001.May 17 2021, 3:12 PM

This is semantically incorrect, and can not land. There is an important distinction between a maximum trip count (which this code would be correct for), and an exact trip count (which is what this code appears to need). Consider the case where the header exits after 20 iterations, and another non-latch exit exits after 10. Fully unrolling by 20 would be incorrect. (Since control flow is elided and we'd execute too many iterations.)

Does this mean only the branch in the latch is elided by (full) unrolling? The situation could also be that the latch exits after 20 iterations. I was not expecting that there is something special about the latch when exiting.

Thanks for taking a look! LoopUnroll expects TripCount to be an upper bound on the number of times the terminator we simplify during unrolling executes. At the moment, that's either the latch block terminator, if it exits or the unique exiting block's termaintor. Otherwise the exiting branches should be retained. I've also added another test that has an early exit depending on the IV as well. It gets simplified, but only due to simplifications after unrolling.

If there's a single exit block or the latch is exiting, we use TripCount to decide whether the loop is fully unrolled. If that's the case, we simplify the branches in the latch/unique exiting block. If there the latch is not exiting and there's no unique exiting block, we won't simplify the exiting branches to unconditional branches (the code to do so is around https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Utils/LoopUnroll.cpp#L787, but ExitingBlocks will be empty, if there's no single exiting branch instruction).

I also had another look at some of the comments and it seems like the one for llvm::UnrollLoop was not updated to reflect the fact that loops with multiple exiting blocks are also supported. I tried to update the comment to reflect the current requirement. I might have missed something (there' been a few other changes since I added support for unrolling header-exiting loops and quite a few cases are handled in the code), but I think it should more accurately reflect what's going on. I think it might make sense to split off the comment change, but I included it here to start with as it is directly related to the discussion here.

@fhahn I'm willing to defer to you here. If you're really sure the TripCount is always interpreted as an upper bound, you can land and move forward. From your description, it really sounds like this code is duplicating a lot of logic already existing in SCEV. As a cleanup, it would be nice if we named the variables appropriately and used scev's existing distinction between exact and max trip counts.

Just curious: how come that the loop wasn't rotated properly before unrolling? Looks like it's a pass ordering problem.

fhahn added a comment.May 18 2021, 3:04 PM

@fhahn I'm willing to defer to you here. If you're really sure the TripCount is always interpreted as an upper bound, you can land and move forward. From your description, it really sounds like this code is duplicating a lot of logic already existing in SCEV. As a cleanup, it would be nice if we named the variables appropriately and used scev's existing distinction between exact and max trip counts.

After taking a closer look, I think we might be able to indeed re-use parts of SCEV in some parts. The variable names & co could also be clarified & improved. I'll look into some follow-up patches.

Just curious: how come that the loop wasn't rotated properly before unrolling? Looks like it's a pass ordering problem.

The case I am looking at uses -Oz, where loop-rotate is very conservative when it comes to duplicating code and does not rotate the loop (e.g. https://godbolt.org/z/fP6sna8qK). With all other optimization levels, the loop gets rotated.

mkazantsev accepted this revision.Jun 1 2021, 11:46 PM

Looks ok by me, but please make sure you address Philip's concerns.

fhahn abandoned this revision.Jul 1 2021, 9:57 AM

More general refactoring/improvements mean that this patch is no longer needed. Thanks!