This is an archive of the discontinued LLVM Phabricator instance.

[LV] Vectorize (some) early and multiple exit loops
ClosedPublic

Authored by reames on Dec 15 2020, 10:38 AM.

Details

Summary

This patch is a major step towards supporting multiple exit loops in the vectorizer. This patch on it's own extends the loop forms allowed in two ways:

  1. single exit loops which are not bottom tested
  2. multiple exit loops w/ a single exit block reached from all exits and no phis in the exit block (because of LCSSA this implies no values defined in the loop used later)

The restrictions on multiple exit loop structures will be removed in follow up patches; disallowing cases for now makes the code changes smaller and more obvious. As before, we can only handle loops with entirely analyzable exits. Removing that restriction is much harder, and is not part of currently planned efforts.

The basic idea here is that we can force the last iteration to run in the scalar epilogue loop (if we have one). From the definition of SCEV's backedge taken count, we know that no earlier iteration can exit the vector body. As such, we can leave the decision on which exit to be taken to the scalar code and generate a bottom tested vector loop which runs all but the last iteration.

The existing code already had the notion of requiring one iteration in the scalar epilogue, this patch is mainly about generalizing that support slightly, making sure we don't try to use this mechanism when tail folding, and updating the code to reflect the difference between a single exit block and a unique exit block (very mechanical).

Diff Detail

Event Timeline

reames created this revision.Dec 15 2020, 10:38 AM
reames requested review of this revision.Dec 15 2020, 10:38 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 15 2020, 10:38 AM
reames updated this revision to Diff 311967.Dec 15 2020, 11:01 AM

rebase over landed tests

reames edited the summary of this revision. (Show Details)Dec 15 2020, 11:02 AM
reames updated this revision to Diff 311986.EditedDec 15 2020, 12:06 PM

Update tests to cover cases where we can't vectorize due to either a) size, or b) predication.

Doing this revealed that the handling of the predicate don't vectorize option is broken in the patch. (We correctly vectorize with a scalar epilogue where the user intent was not to vectorize.) Fix forthcoming.

(Note - comment edited as I'd originally misunderstood scope of work to address tail fold case above)

reames updated this revision to Diff 312000.Dec 15 2020, 12:45 PM

Rebase over a81db8b31. While that change is NFC, the code structure makes it much easier to disable vectorization when predication is demanded and we can't provide it.

reames updated this revision to Diff 312005.Dec 15 2020, 12:55 PM

One further simplification enabled by previous rebase.

Ayal added a comment.Dec 21 2020, 4:53 PM

Nice leverage of requiresScalarEpilogue!
Looks good to me, adding some minor comments.

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1098–1106

To avoid confusing exit/ing block terms:
// We currently must have a single "exit block" after the loop. Note that multiple "exiting blocks" inside the loop are allowed, provided they all reach the single exit block.

1105

nit: a[n] unique

1121

Clarify failure reason, e.g., "The loop must have no live-out values if it has more than one exiting block" ?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1559

Checking that the only exit is the latch can be done (alternatively, literally) by
if (TheLoop->getExitingBlock() == TheLoop->getLoopLatch())

3018

To avoid confusing exit/ing block/loop terms:
That is, if the loop contains multiple exiting blocks, or a single exiting block which is not the latch.

5507–5524
// We can't vectorize anything but a bottom tested loop without a scalar epilogue.

Perhaps better stated as:
// The only loops we can vectorize without a scalar epilogue, are loops with a bottom-test and a single exiting block.

5509

nit: i[t]eration

5512

If predication is preferred over a scalar epilog, but the latter is not forbidden (i.e., the CM_ScalarEpilogueNotNeededUsePredicate case), we could "fallback to a vectorization with a scalar epilogue" here, instead of bailing out, as done below?

Can test if (TheLoop->getExitingBlock() != TheLoop->getLatchBlock())

llvm/test/Transforms/LoopVectorize/loop-legality-checks.ll
24

Would it be useful to keep this (single exiting, double latched(?)) test?

fhahn requested changes to this revision.Dec 22 2020, 8:56 AM

Thank you very much for putting up this patch! This looks like a good start.

I think there are some remaining code-gen issue, e.g. something like the example below leads to a verifier failure, when built with opt -loop-vectorize -force-vector-width=2. I didn't have time to take a closer look at what might cause the failure yet.

define void @test(float* %addr) {
entry:
  br label %loop.header

loop.header:
  %iv = phi i64 [ 0, %entry ], [ %iv.next, %loop.latch ]
  %gep = getelementptr float, float* %addr, i64 %iv
  %exitcond.not = icmp eq i64 %iv, 200
  br i1 %exitcond.not, label %exit, label %loop.body

loop.body:
  %0 = load float, float* %gep, align 4
  br i1 undef, label %loop.latch, label %then

then:
  store float 10.0, float* %gep, align 4
  br label %loop.latch

loop.latch:
  %iv.next = add nuw nsw i64 %iv, 1
  br label %loop.header

exit:
  ret void
}
This revision now requires changes to proceed.Dec 22 2020, 8:56 AM
reames marked 8 inline comments as done.Dec 22 2020, 9:24 AM

Ayal, thanks for all the great wordsmith comments!

Florian, I can confirm the crash with your test case, will investigate.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1559

I went ahead and switched since you seem to have a preference, but in general, these are not the same. Consider an infinite loop with two latch blocks. Such a loop can't reach here, but it requires context to know that. The form of the check I used is context free.

Doesn't really matter here, so I'll go with your preference.

llvm/test/Transforms/LoopVectorize/loop-legality-checks.ll
24

Oh, I'd missed the fact this was a double latch, not just a non exiting latch. I'll add a test for that case in loop-form.ll along with the others and rebase.

reames updated this revision to Diff 313373.Dec 22 2020, 9:35 AM
reames marked an inline comment as done.

Incorporate wording/style comments from Ayal, test rebase pending.

JFYI, multiple latch tests added in f106b2, but they don't impact the diff as SCEV can't prove exit counts and thus we don't vectorize.

Now to track down the problem Florian found.

Ok, I see what's going on w/Florian's example. It's an interaction with block predication and early exits. Essentially, block predication expects to be able to add uses of any condition in the loop, and the dead code elimination has eliminated the exit condition in the vector loop. I can restrict the legality slightly to avoid this interaction easily enough, but I want to think a bit first if there's a clean integration.

reames updated this revision to Diff 313402.Dec 22 2020, 11:49 AM
reames edited the summary of this revision. (Show Details)

Incorporate a fix for the issue Florian found. Essentially, we can't both treat single use exit conditions as dead, and allow predication to use them in a mask. Since we know they're evaluate to true for the entire vector body, we can simply ignore them when forming edge predicates.

Florian, any other edge cases you can think of? I'd completely missed that one. Thank you for finding it!

Ayal accepted this revision.Dec 22 2020, 2:56 PM

This looks fine to me, thanks; would be good to get @fhahn approval too.

Wonder if a similar optimization may be applied to the loop unroller as well - discarding all exiting edges (keeping one in the latch only) from a single-latched "countable" loop, whose last iteration (or more) is peeled.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
1559

Agreed, this form assumes there is a single latch, which is also implied by the comment, and can be asserted.

7956

Good catch and fix!

Incorporate a fix for the issue Florian found. Essentially, we can't both treat single use exit conditions as dead, and allow predication to use them in a mask. Since we know they're evaluate to true for the entire vector body, we can simply ignore them when forming edge predicates.

Florian, any other edge cases you can think of? I'd completely missed that one. Thank you for finding it!

I think there are some scenarios when we break some LCSSA PHIs that use PHIs created during SCEV expansion when we generate the runtime checks. After a first glance, it appears like an issue after we add the conditional branch from the middle block. The example below should cause a verifier failure with opt -loop-vectorize -force-vector-width=4. I tried to reduce it a bit, but unfortunately it is still quite ugly.

define void @widget(i16** %arg, i64 %N) local_unnamed_addr {
bb:
  br label %bb1

bb1:                                              ; preds = %bb16, %bb1, %bb
  %tmp = load i16*, i16** %arg, align 8
  %tmp2 = load i16*, i16** %arg, align 8
  br i1 undef, label %bb1, label %bb3

bb3:                                              ; preds = %bb15, %bb1
  br i1 undef, label %bb16, label %bb4

bb4:                                              ; preds = %bb3
  br i1 undef, label %bb5, label %bb15

bb5:                                              ; preds = %bb9, %bb4
  %tmp6 = phi i64 [ %tmp7, %bb9 ], [ %N, %bb4 ]
  %tmp7 = add nsw i64 %tmp6, -1
  %tmp8 = icmp sgt i64 %tmp6, 1
  br i1 %tmp8, label %bb9, label %bb13

bb9:                                              ; preds = %bb5
  %tmp10 = getelementptr inbounds i16, i16* %tmp2, i64 %tmp7
  %tmp11 = load i16, i16* %tmp10, align 2
  %tmp12 = getelementptr inbounds i16, i16* %tmp, i64 0
  store i16 %tmp11, i16* %tmp12, align 2
  br label %bb5

bb13:                                             ; preds = %bb15, %bb5
  %tmp14 = load i16, i16* %tmp, align 2
  ret void

bb15:                                             ; preds = %bb4
  br i1 undef, label %bb13, label %bb3

bb16:                                             ; preds = %bb3
  br label %bb1
}

Florian, any other edge cases you can think of? I'd completely missed that one. Thank you for finding it!

I think there are some scenarios when we break some LCSSA PHIs that use PHIs created during SCEV expansion when we generate the runtime checks. After a first glance, it appears like an issue after we add the conditional branch from the middle block. The example below should cause a verifier failure with opt -loop-vectorize -force-vector-width=4. I tried to reduce it a bit, but unfortunately it is still quite ugly.

I can confirm the failure and will debug. Thank you again for finding a cornercase.

Florian, do you mind if I landed this under an off by default flag? I realize we have correctness issues outstanding, but it would be a lot easier to test this, and highlight the fixes one by one if I was working off checked in code. Once we'd worked through everything, I'd enable and remove the flag.

Ayal added a comment.Dec 24 2020, 5:15 AM

I think there are some scenarios when we break some LCSSA PHIs that use PHIs created during SCEV expansion when we generate the runtime checks. After a first glance, it appears like an issue after we add the conditional branch from the middle block. The example below should cause a verifier failure with opt -loop-vectorize -force-vector-width=4. I tried to reduce it a bit, but unfortunately it is still quite ugly.

Hmm, if it is reduced a bit further by fusing bb9 into bb5, making the loop bottom-tested, same failure still occurs, regardless of this patch?

I.e., replacing:

bb5:                                              ; preds = %bb9, %bb4
  %tmp6 = phi i64 [ %tmp7, %bb9 ], [ %N, %bb4 ]
  %tmp7 = add nsw i64 %tmp6, -1
  %tmp8 = icmp sgt i64 %tmp6, 1
  br i1 %tmp8, label %bb9, label %bb13

bb9:                                              ; preds = %bb5
  %tmp10 = getelementptr inbounds i16, i16* %tmp2, i64 %tmp7
  %tmp11 = load i16, i16* %tmp10, align 2
  %tmp12 = getelementptr inbounds i16, i16* %tmp, i64 0
  store i16 %tmp11, i16* %tmp12, align 2
  br label %bb5

with:

bb5:                                              ; preds = %bb5, %bb4
  %tmp6 = phi i64 [ %tmp7, %bb5 ], [ %N, %bb4 ]
  %tmp7 = add nsw i64 %tmp6, -1
  %tmp8 = icmp sgt i64 %tmp6, 1
  %tmp10 = getelementptr inbounds i16, i16* %tmp2, i64 %tmp7
  %tmp11 = load i16, i16* %tmp10, align 2
  %tmp12 = getelementptr inbounds i16, i16* %tmp, i64 0
  store i16 %tmp11, i16* %tmp12, align 2
  br i1 %tmp8, label %bb5, label %bb13
fhahn added a comment.Dec 24 2020, 9:15 AM

I can confirm the failure and will debug. Thank you again for finding a cornercase.

Florian, do you mind if I landed this under an off by default flag? I realize we have correctness issues outstanding, but it would be a lot easier to test this, and highlight the fixes one by one if I was working off checked in code. Once we'd worked through everything, I'd enable and remove the flag.

IMO it would be preferable to land this without a flag, so we ensure this gets wide testing and we can flush out & fix the remaining issues early. I think it is almost there. I'll look into the issue over the next few days and I suggest to circle back after the holiday weekend and either land the patch together with a fix or behind a flag.

Hmm, if it is reduced a bit further by fusing bb9 into bb5, making the loop bottom-tested, same failure still occurs, regardless of this patch?

Thanks Ayal! I had a suspicion that this crash only got surfaced when using the patch, as the code related to setting up the middle block & co is not really touched by this patch. I'll try to take a look over the next few days.

fhahn accepted this revision.Dec 27 2020, 10:25 AM

LGTM, thanks! I pushed a small fix for the crash discussed earlier in 0ea3749b3cde. It should be a very safe/straight-forward fix, but I would appreciate taking a look post-review.

I also did some additional testing with loop rotation disabled, which should stress test parts of the multi-exit support, on SPEC2006 & MultiSource. That didn't surface any further issues, so I think it should be fine to land this without a flag for now.

There appears to be some training whitespace in the diff, it would be good to re-format before landing.

This revision is now accepted and ready to land.Dec 27 2020, 10:25 AM
This revision was automatically updated to reflect the committed changes.

Ayal, Florian, thank you for the efforts on this review. I really appreciate the active review.

As an aside: I realized that supporting tail folding is actually a lot easier than I thought. I didn't do it in this patch, but a future patch just needs to a) not clamp the iteration space if tail folding, and b) use the exit conditions to form predicate masks. Amusingly, it would have been easier to start with tail folding from the beginning if I'd realized that. :)

The next patch in this series is D93725.

aeubanks added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1120
fhahn added inline comments.Jan 1 2021, 6:02 AM
llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
1120

I missed this during the initial review, but we might derefence a nullptr here, because when DoExtraAnalysis is true, ExitBB may be null. I pushed d9f306aa52fe to only execute the checks in the else branch of the !ExitBB check

Ayal added a comment.Jan 5 2021, 8:35 AM

LGTM, thanks! I pushed a small fix for the crash discussed earlier in 0ea3749b3cde. It should be a very safe/straight-forward fix, but I would appreciate taking a look post-review.

Thanks, good catch! Added a couple of post-commit nits.