This is an archive of the discontinued LLVM Phabricator instance.

[LV] Test once if vector trip count is zero, instead of twice
ClosedPublic

Authored by Ayal on Jun 13 2017, 9:00 AM.

Details

Summary

Before reaching the header block of the vectorized loop, the Loop Vectorizer generates two conditions to make sure the vectorized loop will iterate at-least once: first emitVectorLoopEnteredCheck() compares the Scalar Trip Count to VF*UF, and if the former is not less than the latter, emitMinimumIterationCountCheck() compares the Vector Trip Count to zero.

In general, VTC = STC mod VF*UF, unless this does not comply with an optional constraint to execute at-least a single scalar iteration in the epilog, aka requiresScalarEpilogue(). In this case a single (the last) vector iteration is peeled and replaced with VF*UF scalar iterations (instead of none), reducing VTC by 1.

This patch replaces the above two comparisons for VTC == 0 with a single comparison:

  • STC < VF*UF, when requiresScalarEpilogue() does not hold; or
  • STC <= VF*UF, when requiresScalarEpilogue() does hold,

effectively removing the basic-block originally named "min.iters.checked".

The original documentation of emitMinimumIterationCountCheck() claiming it checks for overflow seems obsolete (right?).

Original observation and initial patch by Evgeny Stupachenko.

Diff Detail

Repository
rL LLVM

Event Timeline

Ayal created this revision.Jun 13 2017, 9:00 AM
mkuper added a reviewer: wmi.Jul 7 2017, 4:21 PM
mkuper edited edge metadata.

Re overflow - the point is that getOrCreateTripCount() returns, basically, PSE.getBackedgeTakenCount() + 1, and that may overflow, so the "trip count" may end up being 0 if the backedge taken count is 0. I don't think this is outdated, and this is behavior we want to preserve. But this patch should preserve this behavior IIUC. Can you make sure there's a test for this?

Anyway, the logic seems sound, but I'd like wmi to look at it - he added the minimum iterations check in r245952, and combined it with the overflow check, instead of the VTC==0 check.
Wei, is there some case we're missing here?

One more thing about the description: "In this case a single (the last) vector iteration is peeled and replaced with VF*UF scalar iterations (instead of none), reducing VTC by 1." <-- this is confusing. That only happens if STC == 0 (mod VF * UF). So the logic is correct, since this is indeed what happens when STC == VF * UF, but I would phrase it differently.

wmi edited edge metadata.Jul 11 2017, 9:49 AM

Re overflow - the point is that getOrCreateTripCount() returns, basically, PSE.getBackedgeTakenCount() + 1, and that may overflow, so the "trip count" may end up being 0 if the backedge taken count is 0. I don't think this is outdated, and this is behavior we want to preserve. But this patch should preserve this behavior IIUC. Can you make sure there's a test for this?

max_i32_backedgetaken in test/Transforms/LoopVectorize/induction.ll is the orginal test for overflow case. The patch generates correct code for it.

Anyway, the logic seems sound, but I'd like wmi to look at it - he added the minimum iterations check in r245952, and combined it with the overflow check, instead of the VTC==0 check.
Wei, is there some case we're missing here?

The logic looks good to me. It is a nice improvement.

One more thing about the description: "In this case a single (the last) vector iteration is peeled and replaced with VF*UF scalar iterations (instead of none), reducing VTC by 1." <-- this is confusing. That only happens if STC == 0 (mod VF * UF). So the logic is correct, since this is indeed what happens when STC == VF * UF, but I would phrase it differently.

mkuper accepted this revision.Jul 11 2017, 10:39 AM

Thanks, Wei.
LGTM.

This revision is now accepted and ready to land.Jul 11 2017, 10:39 AM
Ayal added a subscriber: anemet.Jul 11 2017, 4:20 PM

Re overflow - the point is that getOrCreateTripCount() returns, basically, PSE.getBackedgeTakenCount() + 1, and that may overflow, so the "trip count" may end up being 0 if the backedge taken count is 0. I don't think this is outdated, and this is behavior we want to preserve. But this patch should preserve this behavior IIUC.

Ahh, right. This patch is indeed intended to preserve the existing behavior. In D12107 @anemet specifically asked not to remove the explanation for the need of checking overflow. Will make sure this explanation remains in place.

In D34150#805377, @wmi wrote:

Re overflow - the point is that getOrCreateTripCount() returns, basically, PSE.getBackedgeTakenCount() + 1, and that may overflow, so the "trip count" may end up being 0 if the backedge taken count is 0. I don't think this is outdated, and this is behavior we want to preserve. But this patch should preserve this behavior IIUC. Can you make sure there's a test for this?

max_i32_backedgetaken in test/Transforms/LoopVectorize/induction.ll is the orginal test for overflow case. The patch generates correct code for it.

Also miniters.ll test added back in r245952 checks for the right min.iters.check, so that seems to be covered. However, a devised test shows that vector.scevcheck may generate its own checks to guard against potential overflow of the trip count, presumably unaware that min.iters.check took care of it(?); so there may be a third test involved in checking if the vector trip count is zero... Will open a PR, as this deserves a separate fix.

BTW, a scalar loop whose trip count is uintxx_max+1 may actually (theoretically?) be a good candidate to vectorize; provided we compute its vector trip count correctly. Such loops however are expected to appear seldom in practice, if at all(?)

In D34150#805377, @wmi wrote:

Anyway, the logic seems sound, but I'd like wmi to look at it - he added the minimum iterations check in r245952, and combined it with the overflow check, instead of the VTC==0 check.
Wei, is there some case we're missing here?

The logic looks good to me. It is a nice improvement.

Thanks :-). Perhaps this is what @jmolloy referred to originally back in D12107:

We do actually already emit such a check - we just emit it late. This new check now makes the later check redundant, so we should remove it.

One more thing about the description: "In this case a single (the last) vector iteration is peeled and replaced with VF*UF scalar iterations (instead of none), reducing VTC by 1." <-- this is confusing. That only happens if STC == 0 (mod VF * UF). So the logic is correct, since this is indeed what happens when STC == VF * UF, but I would phrase it differently.

Right, sorry for the confusion.
Another, hopefully less confusing way to phrase this: when requiresScalarEpilogue() holds, at-least one iteration must remain scalar; the rest can be used to form vector iterations. So VTC = (STC-1) mod VF*UF, and so VTC is zero iff STC-1 < VF*UF, which we convert to STC <= VF*UF.

Ayal updated this revision to Diff 106320.Jul 12 2017, 3:15 PM

Added the following comment following review, and updated a testcase that was recently modified (if-conversion-nest.ll)

// Now, compare the new count to zero. If it is zero skip the vector loop and
// jump to the scalar loop. This check also covers the case where the
// backedge-taken count is unit##_max: adding one to it will overflow leading
// to an incorrect trip count of zero. In this case we will also jump to the
// scalar loop, instead of trying to vectorize with a correct trip count, as
// such loops are expected to be extremely rare.
emitMinimumIterationCountCheck(Lp, ScalarPH);
This revision was automatically updated to reflect the committed changes.