This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorizer] Interleaved-mem-accesses analysis and getPtrStride
ClosedPublic

Authored by dorit on Oct 5 2016, 6:05 AM.

Details

Summary

Following D20789 getPtrStride has a new argument ShouldCheckWrap;
This patch makes collectConstStridedAccesses (part of interleaved-mem-accesses analysis)
call getPtrStride with [Assume=true and ShouldCheckWrap=false],
(instead of the default [Assume=false, ShouldCheckWrap=true]),
matching the way getPtrStride is called from isConsecutivePtr.

It is actually not entirely clear to me when we want to permit Assume and/or require ShouldCheckWrap... Looks like there may be some inconsistency now with how getPtrStride is called; For example, hasConsecutiveLikePtrOperand under the convers calls getPtrStride once with [Assume/CheckWrap=true,false] to identify consecutive accesses, and once with [Assume/CheckWrap=false,true] to identify strided accesses. Is this intentional?

In the attached testcase we have missed vectorization when interleaved-accesses are not identified because vectorization with gathers is too costly. If getPtrStride is called with Assume=false and ShouldCheckWrap=true (current behavior) it cannot resolve the stride because the SCEV could wrap and Stride!=1. If getPtrStride is called with Assume=true or ShouldCheckWrap=false it is able to return a stride of 2, and in turn the interleaved-mem-accesses analysis is able to form an interleaving group.

Diff Detail

Event Timeline

dorit updated this revision to Diff 73631.Oct 5 2016, 6:05 AM
dorit retitled this revision from to [LoopVectorizer] Interleaved-mem-accesses analysis and getPtrStride .
dorit updated this object.
dorit added reviewers: hfinkel, sbaranga.
dorit added subscribers: delena, Ayal, mkuper.
mkuper added inline comments.Oct 6 2016, 4:52 PM
llvm/test/Transforms/LoopVectorize/interleaved-accesses-1.ll
4

Drive-by comment: please either remove the triple, or put it in the Transforms/LoopVectorize/X86 subdir.
(Yes, we have ~70 tests that have a triple in the main directory. This looks broken, and I'm going to clean it up.)

ABataev added a subscriber: ABataev.Oct 7 2016, 1:45 AM
ABataev added inline comments.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5617

, true, false -> , /*Assume=*/true, /*ShouldCheckWrap=*/false

sbaranga edited edge metadata.Oct 8 2016, 6:20 AM

Hi,

ShouldCheckWrap indicates if we need to check if the pointer is wrapping in order to identify this as a strided pointer. This is mainly needed for dependence analysis.

Assume=true will version the loop (if possible) such that we can identify this a strided pointer. For example if the SCEV expression of the pointer is (sext i32 {%a,+2} to i64) + 2 we add the necessary checks to convert it to {(sext i32 %a to i64) + 2,+,2}, and we will get a stride of 2. If ShouldCheckWrap is true we will also get a further run-time check to guarantee that the pointer is not wrapping. The downside of this is that we have a limited budget for the number of checks, so ideally we shouldn't add any when we don't need to.

One problem here is that we are lowering the cost of the vectorized loop while increasing the number of runtime checks (which might end up causing a loop to not be vectorized at all).

Therefore I think this at least needs some benchmarking.

We could also take into account the maximum number of checks allowed (after we've determined that it's legal to vectorize the loop), and make sure that we don't exceed it.

Thanks,
Silviu

llvm/test/Transforms/LoopVectorize/interleaved-accesses-1.ll
88

Do we need all the attributes/meta-data?

gilr added a subscriber: gilr.Oct 8 2016, 11:22 AM
dorit updated this revision to Diff 74065.Oct 9 2016, 3:58 AM
dorit edited edge metadata.

Uploaded a version with all the fixes requested so far.

Thanks to all,
Dorit

dorit added a comment.Oct 9 2016, 4:13 AM

Thanks Silviu for the detailed explanation.

ShouldCheckWrap indicates if we need to check if the pointer is wrapping in order to identify this as a strided pointer. This is mainly needed for dependence analysis.

So I take it we are fine with ShouldCheckWrap=false here (being past the dependence testing stage).

One problem here is that we are lowering the cost of the vectorized loop while increasing the number of runtime checks (which might end up causing a loop to not be vectorized at all).

Therefore I think this at least needs some benchmarking.

You mean we might increase the number of runtime checks because we're passing Assume=true? I don't mind passing Assume=false if there's a concern. Although with ShouldCheckWrap=false I don't see how the call to getPtrStride here can result in creating more runtime tests...?

Having said that, I did do some benchmarking and saw no performance degradations.

thanks,
Dorit

We could also take into account the maximum number of checks allowed (after we've determined that it's legal to vectorize the loop), and make sure that we don't exceed it.

Thanks,
Silviu

Thanks Silviu for the detailed explanation.

ShouldCheckWrap indicates if we need to check if the pointer is wrapping in order to identify this as a strided pointer. This is mainly needed for dependence analysis.

So I take it we are fine with ShouldCheckWrap=false here (being past the dependence testing stage).

I initially thought ignoring pointer wrapping would be fine, but it might be more complicated. We're doing here a specialized transformation which complicates things: (emitting a wide load or store). I don't know what would happen if that load or store wraps around the address space (for example doing a 48-byte load at 0xffffff8 in a 32-bit address space). This is a very narrow corner case...

For full groups this should be ok (which includes your example), since if we would wrap around the address space we would do a memory access at nullptr even without the transformation. This means that we don't have to check if stores wrap (all store interleaved groups are full), but we do for loads.

I think deferring these pointer wrapping checks until after we've formed the interleaved groups should work.

One problem here is that we are lowering the cost of the vectorized loop while increasing the number of runtime checks (which might end up causing a loop to not be vectorized at all).

Therefore I think this at least needs some benchmarking.

You mean we might increase the number of runtime checks because we're passing Assume=true? I don't mind passing Assume=false if there's a concern. Although with ShouldCheckWrap=false I don't see how the call to getPtrStride here can result in creating more runtime tests...?

It can happen for example with the following loop

int i = 0;
for (unsinged char index = x; index < y; index +=2, ++i)

b[i] = a[index] * 2;

when alias analysis can figure out that accesses to b and a don't alias. Without Assume = true (and further run-time checks) we wouldn't be able to recognize a[index] as a strided pointer (because it isn't when y < x). LAA doesn't need to add any checks for this (dependences are safe).

-Silviu

dorit added a comment.Oct 11 2016, 4:08 AM

I think deferring these pointer wrapping checks until after we've formed the interleaved groups should work.

Makes sense. So, to summarize:

  1. during collectConstStrideAccesses we want to call getPtrStride with Assume=false (*), ShouldCheckWrap=false (optimistic).
  2. towards the end of analyzeInterleaving we will revisit groups (of loads) with gaps and for those call getPtrStride with Assume=false (**), shouldCheckWrap=true.

(*) As far as I can see when ShouldCheckWrap=false then Assume is pretty much a don't-care in terms of runtime checks; no new runtime checks will be added by getPtrStride. See below.

(**) I would set Assume=false initially, but I think in the long run we may want to allow runtime checks here if the budget allows (as you noted earlier, so that we don't end up failing vectorization altogether due to these extra checks). That can be done in a followup patch.

Agreed?

It can happen for example with the following loop

Thanks for the example!

Without Assume = true (and further run-time checks) we wouldn't be able to recognize a[index] as a strided pointer (because it isn't when y < x).

Yes; but my point was that when ShouldCheckWrap=false we never get to consider adding new runtime checks in getPtrStride, even if Assume=true. Unless I'm missing something?

thanks,
Dorit

Agreed?

SGTM

Yes; but my point was that when ShouldCheckWrap=false we never get to consider adding new runtime checks in getPtrStride, even if Assume=true. Unless I'm missing something?

Assume=true can add more run-time checks than just for what ShouldCheckWrap checks, so it is possible for this to happen. ShouldCheckWrap will check wrapping for something that has an linear expression, while Assume=true can add checks that would make the expression of the pointer linear (this would happen in the previous example). If you look at the implementation of getPtrStride, the PSE.getAsAddRec(Ptr) convert the expression of the pointer to something linear (strided) using runtime checks, while PSE.setNoOverflow will only check the pointer wrapping with runtime checks.

-Silviu

dorit added a comment.Oct 18 2016, 8:48 AM

Assume=true can add more run-time checks than just for what ShouldCheckWrap checks, so it is possible for this to happen. ShouldCheckWrap will check wrapping for something that has an linear expression, while Assume=true can add checks that would make the expression of the pointer linear (this would happen in the previous example). If you look at the implementation of getPtrStride, the PSE.getAsAddRec(Ptr) convert the expression of the pointer to something linear (strided) using runtime checks, while PSE.setNoOverflow will only check the pointer wrapping with runtime checks.

I see. Thanks.

So in this case, don't we have a problem of potentially adding runtime SCEV assumptions without ever checking if we exceed the threshold? getPtrStride is called with Assume=true several times from the cost-model stage and from the actual vectorization transformation stage, via the call to isConsecutivePtr. Looks like these call sites are past the point when we check whether we exceeded the threshold for runtime SCEV assumptions...?

thanks,
Dorit

I see. Thanks.

So in this case, don't we have a problem of potentially adding runtime SCEV assumptions without ever checking if we exceed the threshold? getPtrStride is called with Assume=true several times from the cost-model stage and from the actual vectorization transformation stage, via the call to isConsecutivePtr. Looks like these call sites are past the point when we check whether we exceeded the threshold for runtime SCEV assumptions...?

Yes, I think that would be the case. I think for now this isn't really a problem, as these are mostly edge cases. If it turns out to be a problem, we have several options:

  • allowing Predicated Scalar Evolution interface to only add these checks if we don't exceed a certain threshold
  • I've found that we can often rewrite the code to only add these checks if we really need them

We should also move the check for the number of SCEV assumptions after the legality stage of the vectorizer (I thought we already did that).

Thanks,
Silviu

dorit updated this revision to Diff 75112.Oct 19 2016, 12:55 AM

Updated as agreed above: use Assume=false,ShouldCheckWrap=false initially; revisit groups with gaps later with Assume=false,ShouldCheckWrap=true;
Added an xfailing testcase where we have gaps and need Assume=true in order to prove that the stride is 2.

I see. Thanks.

So in this case, don't we have a problem of potentially adding runtime SCEV assumptions without ever checking if we exceed the threshold? getPtrStride is called with Assume=true several times from the cost-model stage and from the actual vectorization transformation stage, via the call to isConsecutivePtr. Looks like these call sites are past the point when we check whether we exceeded the threshold for runtime SCEV assumptions...?

Yes, I think that would be the case.

I will open a PR to capture all the SCEV threshold issues that came up here, including the xfailing test I just added.

We should also move the check for the number of SCEV assumptions after the legality stage of the vectorizer (I thought we already did that).

so what exactly do you mean? It's already the very last thing we do in canVectorize...?

thanks,
Dorit

Thanks,
Silviu

mssimpso edited edge metadata.Oct 19 2016, 7:50 AM

Hi Dorit/Silviu,

Intuitively it seems like we would want to use getPtrStride in the interleaved access analysis the same way we do in isConsecutivePtr. Is the issue here that we might add additional run-time checks that we aren't currently? We already call isConsecutivePtr for the pointer operand of every load/store in the loop. Would the additional checks then be duplicates?

Intuitively it seems like we would want to use getPtrStride in the interleaved access analysis the same way we do in isConsecutivePtr. Is the issue here that we might add additional run-time checks that we aren't currently? We already call isConsecutivePtr for the pointer operand of every load/store in the loop. Would the additional checks then be duplicates?

isConsecutivePtr won't check for pointer wrapping (unless it's called from LAA), while there is a corner case here that would require us to check the pointer wrapping here (we can start emitting wrapping wide loads/stores otherwise).

so what exactly do you mean? It's already the very last thing we do in canVectorize...?

Sorry, I misunderstood your previous statement (I thought this analysis happened after the check for the number of predicates). We shouldn't be adding new predicates in the transformation stage, this should be done during the legality phase. But there we are adding predicates that will always be useful (we should get better codegen for AddRecExprs).
Maybe the best solution would be to try to make all pointers AddRecExprs in the legality stage for now?

-Silviu

dorit added a comment.Oct 20 2016, 3:25 AM

Hi Dorit/Silviu,

Intuitively it seems like we would want to use getPtrStride in the interleaved access analysis the same way we do in isConsecutivePtr. Is the issue here that we might add additional run-time checks that we aren't currently?

That was my initial intuition too (the first version of the patch I proposed did exactly that).

AFAIU Silviu's concern here was that if we use Assume=true during interleaving analysis we may be adding "nice to have" runtime checks (loop can be vectorized without it) which may prevent us from adding necessary runtime checks later (due to exceeding the SCEV-runtime-checks-threshold) which would prevent vectorization altogether.

In any case, we have two orthogonal issues here:
(1) the general SCEV-runtime-checks threshold issues -- straightening that up all over.
(2) interleaving analysis being overly conservative w.r.t stride analysis.
In this patch we want to improve (2), but without making (1) worse.
I was going to open a separate PR for (1).

Are we ok with the patch (for addressing (2))?

Thanks,
Dorit

If I understand things correctly now, we're trying to optimize the case where we have a full interleave group where the stride may wrap. So yes I see that we may want ShouldCheckWrap=false, at least initially. I don't think there is any harm in Assume=true for the initial call, though, because we later in Legality (collectLoopUniforms) call isConsecutivePtr for the pointer operand of every memory instruction (with Assume=true). If we don't use Assume=true for a pointer in the interleaved accesses anaysis, we will do so later in the uniformity analysis, so I think we want to do the same thing in both places.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5624–5626

I think this would be a little cleaner if we added a may wrap flag to StrideDescriptor. We would check the wrapping condition initially here and save that in the StrideDescriptor (and in the interleave groups as we build them). Then for removing groups, we could simply check that it is not full and may wrap.

llvm/test/Transforms/LoopVectorize/interleaved-accesses-2.ll
16 ↗(On Diff #75112)

I would prefer that we not XFAIL a new test. Unless Silviu disagrees with letting Assume=true, I think it would be better to check the actual behavior while keeping the FIXME.

Would we be missing a test though? If Assume=true here, I don't think we will be checking the case that we actually remove a group that is may wrap?

sbaranga added inline comments.Oct 20 2016, 6:42 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5839

I think there is another case here when we don't need to check the wrapping: if the group has the first and last pointers, and we know they both don't wrap.

We should be able to apply this for versioning later and have at most two no-wrap checks per interleaved group.

llvm/test/Transforms/LoopVectorize/interleaved-accesses-2.ll
16 ↗(On Diff #75112)

I think having Assume=true for the initial check should be fine.

If I understand things correctly now, we're trying to optimize the case where we have a full interleave group where the stride may wrap. So yes I see that we may want ShouldCheckWrap=false, at least initially. I don't think there is any harm in Assume=true for the initial call, though, because we later in Legality (collectLoopUniforms) call isConsecutivePtr for the pointer operand of every memory instruction (with Assume=true). If we don't use Assume=true for a pointer in the interleaved accesses anaysis, we will do so later in the uniformity analysis, so I think we want to do the same thing in both places.

I think we are all in agreement on that!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5624–5626

OTOH, I think there’s also something clean about initially avoiding the wrapping question all together…

Also, how aggressive in terms of allowing runtime checks do we want to be when we check the wrapping condition initially ?
If we allow it, that would be the equivalent of calling getPtrStride with Assume=true, ShouldCheckWrap=true; That might mean adding a lot of tests…
If we don’t allow it, we are too conservative.

Maybe I misunderstood exactly what you are proposing, but I think in any case like the approach of light-weight optimistic initial pass over all pointers, and then selectively checking wrapping only where needed.

5839

So, trying to summarize:

  1. Initial check with Assume=true, ShouldCheckWrap=false;
  2. Revisit only groups with gaps as follows:
  3. if first and last elements are accessed: skip further checking.
  4. otherwise – check the first and last pointers of the group with Assume=false (for now), ShouldCheckWrap=true;

Silviu, is that what you meant?

llvm/test/Transforms/LoopVectorize/interleaved-accesses-2.ll
16 ↗(On Diff #75112)

Right;

Silviu, the example you gave earlier for a strided access that requires Assume=true: How would you turn it into a test that doesn't fail vectorization on "SCEV could not compute the loop exit count"...?

dorit added inline comments.Oct 20 2016, 11:27 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5839

3 and 4 above were meant to be sub-bullets of item 2...:

1.Initial check with Assume=true, ShouldCheckWrap=false;
2.Revisit only groups with gaps as follows:

  • if first and last elements are accessed: skip further checking.
  • otherwise – check the first and last pointers of the group with Assume=false (for now), ShouldCheckWrap=true;
sbaranga added inline comments.Oct 21 2016, 4:05 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5839

I agree with point 1)
For point 2) I meant something slightly different:

  • if first and last elements are accessed *and* we know they don't wrap, skip further checking (because it implies that all the pointers in the group don't wrap).
  • if the group is full, skip further checking
  • otherwise (as you said) check the first and last pointers of the group with Assume=false (for now), ShouldCheckWrap=true;

I think optimizing this is a bit open-ended, but we could definitely go further (feel free to omit this for this commit):

If we are peeling the loop and we know that the first pointer doesn't wrap then we can deduce that all pointers in the group don't wrap.

This means that we can forcefully peel the loop in order to only have to check the first pointer for no-wrap. When
we'll change to use Assume=true,ShouldCheckWrap=true we'll only need at most one runtime check per interleaved group (I think this is probably optimal).

llvm/test/Transforms/LoopVectorize/interleaved-accesses-2.ll
16 ↗(On Diff #75112)

I've tried the following function, and you do have to have assume=true for it in order to form interleaved groups

void func(unsigned * restrict a, unsigned * restrict b, unsigned char x, unsigned char y) {

  int i = 0;
  for (unsigned char index = x; i < y; index +=2, ++i)
    b[i] = a[index] * 2;

}
dorit added inline comments.Oct 23 2016, 2:45 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5839

about “*and* we know they don't wrap”….:

  1. exactly what do you mean by “know they don’t wrap”? that calling getPtrStride with Assume=false, ShouldCheckWrap=true is successful?
  1. I would think that “group is full” and “first and last elements are accessed” are equivalent in terms of the effect of the transformation on creating new potentially-wrapping accesses (in both cases the transformation will not make things worse than the original code in that respect); we shouldn’t care if the accesses to the first/last elements wrap or not because the wide loads that we’ll create will have the same wrap-around behavior as the original code, no? (in short, why the extra "*and* we know they don't wrap"...?)

(this is assuming that potential wrapping that may be caused by extra accesses in the vector code created due to misalignment are accounted for else-where).

llvm/test/Transforms/LoopVectorize/interleaved-accesses-2.ll
16 ↗(On Diff #75112)

perfect. thanks!!

sbaranga added inline comments.Oct 24 2016, 2:45 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5839
  1. Correct, in this case I mean Assume=false, ShouldCheckWrap=true. Although when switching to using Assume = true it is preferable to call getPtrStride on just the two accesses instead of on all the pointers.

2). Here is my current reasoning:

The justification for why the "group is full" case doesn't require checking is that if the wide load does wrap we would do an access to nullptr in the non-transformed code (and therefore have undefined behaviour). This reasoning doesn't hold when the group is not full (since we could be skipping the access to nullptr).

I didn't fully understand the part about alignment, but AFAICT considering here that the accesses all the accesses are aligned to the element type should be enough.

dorit updated this revision to Diff 75796.Oct 25 2016, 2:55 PM
dorit edited edge metadata.

The updated version adds the following:

  • avoid XFAIL in a new test
  • use Assume=true in collectConstStrideAccesses + add a testcase that requires that and that tests group invalidation in that scenario
  • check only the first and last members of the group instead of checking all the members
  • further optimization ideas that came up here were added as a TODO comment.
dorit added a comment.Oct 25 2016, 2:58 PM

1.Initial check with Assume=true, ShouldCheckWrap=false; 2.Revisit
only groups with gaps as follows:

  • if first and last elements are accessed: skip further checking.
  • otherwise – check the first and last pointers of the group

with Assume=false (for now), ShouldCheckWrap=true;

I agree with point 1)
For point 2) I meant something slightly different:
(a) - if first and last elements are accessed *and* we know they don't wrap, skip further checking (because it implies that all the pointers in the group don't wrap).
(b) - if the group is full, skip further checking
(c)- otherwise (as you said) check the first and last pointers of the
group with Assume=false (for now), ShouldCheckWrap=true;

...

  1. Correct, in this case I mean Assume=false, ShouldCheckWrap=true.

Although when switching to using Assume = true it is preferable to call getPtrStride on just the two accesses instead of on all the pointers.

We might as well check just these two accesses already, even while we are using Assume=false, no? (it's less critical because it won't affect runtime, but it will make the followup change of upgrading to Assume=true smaller).
This is what I did in the updated version I just uploaded.
I only included cases (b) and (c) from the above; In the end I am not sure what case (a) checks beyond that case (c) already covers...?

thanks,
Dorit

dorit added inline comments.Oct 25 2016, 8:50 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5862

meant to put a continue's here instead of break's... will upload a new version.

dorit updated this revision to Diff 75840.Oct 26 2016, 2:22 AM

break --> continue

Yes, now that I think about it skipping case c) makes sense.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5846

FirstMemberPtr should always be getPointerOperand(Group->getMember(0)) (it's guaranteed to exist).

Now that I look at the code it would be simpler to go with a the more optimized version. It would be simpler to just go here with

FirstMemberPtr = getPointerOperand(Group->getMember(0))
LastMemberPtr = getPointerOperand(Group->getMember(Group->getFactor()-1))

and remove the loop. If we get null for LastMemberPtr we can just ignore it (since we know that in this case we will always peel the loop, meaning that we only need to check the first member). The code will look like:

if (!getPtrStride(PSE, FirstMemberPtr, TheLoop, Strides,

                            /*Assume=*/false, /*ShouldCheckWrap=*/true)) {
releaseGroup(Group);
continue;

}

if (LastMemberPtr)

if (!getPtrStride(PSE, LastMemberPtr, TheLoop, Strides, 
                            /*Assume=*/false, /*ShouldCheckWrap=*/true))) {
releaseGroup();
continue;

}

dorit updated this revision to Diff 75855.Oct 26 2016, 5:20 AM

Thanks, Silviu! Relying on the peeling does come in handy here!

The new version looks at members 0 and factor-1 instead of scanning the group to find the actual first and last members.

sbaranga accepted this revision.Oct 26 2016, 5:29 AM
sbaranga edited edge metadata.

Thanks for making the changes! LGTM

This revision is now accepted and ready to land.Oct 26 2016, 5:29 AM
This revision was automatically updated to reflect the committed changes.