This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
AArch64/
-
gather-cost.ll
-
ARM/
-
gather-cost.ll
-
X86/
-
gather-cost.ll
-
interleaved-accesses-1.ll
-
interleaved-accesses-2.ll
-
interleaved-accesses-3.ll

Differential D25276

[LoopVectorizer] Interleaved-mem-accesses analysis and getPtrStride
ClosedPublic

Authored by dorit on Oct 5 2016, 6:05 AM.

Download Raw Diff

Details

Reviewers

anemet
mssimpso
hfinkel
sbaranga

Commits

rG3c1c658f2488: [LoopVectorize] Make interleaved-accesses analysis less conservative about…
rL285517: [LoopVectorize] Make interleaved-accesses analysis less conservative about

Summary

Following D20789 getPtrStride has a new argument ShouldCheckWrap;
This patch makes collectConstStridedAccesses (part of interleaved-mem-accesses analysis)
call getPtrStride with [Assume=true and ShouldCheckWrap=false],
(instead of the default [Assume=false, ShouldCheckWrap=true]),
matching the way getPtrStride is called from isConsecutivePtr.

It is actually not entirely clear to me when we want to permit Assume and/or require ShouldCheckWrap... Looks like there may be some inconsistency now with how getPtrStride is called; For example, hasConsecutiveLikePtrOperand under the convers calls getPtrStride once with [Assume/CheckWrap=true,false] to identify consecutive accesses, and once with [Assume/CheckWrap=false,true] to identify strided accesses. Is this intentional?

In the attached testcase we have missed vectorization when interleaved-accesses are not identified because vectorization with gathers is too costly. If getPtrStride is called with Assume=false and ShouldCheckWrap=true (current behavior) it cannot resolve the stride because the SCEV could wrap and Stride!=1. If getPtrStride is called with Assume=true or ShouldCheckWrap=false it is able to return a stride of 2, and in turn the interleaved-mem-accesses analysis is able to form an interleaving group.

Diff Detail

Repository: rL LLVM

Event Timeline

dorit updated this revision to Diff 73631.Oct 5 2016, 6:05 AM

dorit retitled this revision from to [LoopVectorizer] Interleaved-mem-accesses analysis and getPtrStride .

dorit updated this object.

dorit added reviewers: hfinkel, sbaranga.

dorit added subscribers: delena, Ayal, mkuper.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptOct 5 2016, 6:05 AM

mkuper added inline comments.Oct 6 2016, 4:52 PM

llvm/test/Transforms/LoopVectorize/interleaved-accesses-1.ll
4 ↗	(On Diff #73631)	Drive-by comment: please either remove the triple, or put it in the Transforms/LoopVectorize/X86 subdir. (Yes, we have ~70 tests that have a triple in the main directory. This looks broken, and I'm going to clean it up.)

ABataev added a subscriber: ABataev.Oct 7 2016, 1:45 AM

ABataev added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5617 ↗	(On Diff #73631)	`, true, false` -> `, /Assume=/true, /ShouldCheckWrap=/false`

Hi,

ShouldCheckWrap indicates if we need to check if the pointer is wrapping in order to identify this as a strided pointer. This is mainly needed for dependence analysis.

Assume=true will version the loop (if possible) such that we can identify this a strided pointer. For example if the SCEV expression of the pointer is (sext i32 {%a,+2} to i64) + 2 we add the necessary checks to convert it to {(sext i32 %a to i64) + 2,+,2}, and we will get a stride of 2. If ShouldCheckWrap is true we will also get a further run-time check to guarantee that the pointer is not wrapping. The downside of this is that we have a limited budget for the number of checks, so ideally we shouldn't add any when we don't need to.

One problem here is that we are lowering the cost of the vectorized loop while increasing the number of runtime checks (which might end up causing a loop to not be vectorized at all).

Therefore I think this at least needs some benchmarking.

We could also take into account the maximum number of checks allowed (after we've determined that it's legal to vectorize the loop), and make sure that we don't exceed it.

Thanks,
Silviu

llvm/test/Transforms/LoopVectorize/interleaved-accesses-1.ll
88 ↗	(On Diff #73631)	Do we need all the attributes/meta-data?

gilr added a subscriber: gilr.Oct 8 2016, 11:22 AM

Uploaded a version with all the fixes requested so far.

Thanks to all,
Dorit

In D25276#565321, @sbaranga wrote:

Thanks Silviu for the detailed explanation.

ShouldCheckWrap indicates if we need to check if the pointer is wrapping in order to identify this as a strided pointer. This is mainly needed for dependence analysis.

So I take it we are fine with ShouldCheckWrap=false here (being past the dependence testing stage).

One problem here is that we are lowering the cost of the vectorized loop while increasing the number of runtime checks (which might end up causing a loop to not be vectorized at all).

Therefore I think this at least needs some benchmarking.

You mean we might increase the number of runtime checks because we're passing Assume=true? I don't mind passing Assume=false if there's a concern. Although with ShouldCheckWrap=false I don't see how the call to getPtrStride here can result in creating more runtime tests...?

Having said that, I did do some benchmarking and saw no performance degradations.

thanks,
Dorit

We could also take into account the maximum number of checks allowed (after we've determined that it's legal to vectorize the loop), and make sure that we don't exceed it.

Thanks,
Silviu

In D25276#565553, @dorit wrote:

In D25276#565321, @sbaranga wrote:

Thanks Silviu for the detailed explanation.

ShouldCheckWrap indicates if we need to check if the pointer is wrapping in order to identify this as a strided pointer. This is mainly needed for dependence analysis.

So I take it we are fine with ShouldCheckWrap=false here (being past the dependence testing stage).

I initially thought ignoring pointer wrapping would be fine, but it might be more complicated. We're doing here a specialized transformation which complicates things: (emitting a wide load or store). I don't know what would happen if that load or store wraps around the address space (for example doing a 48-byte load at 0xffffff8 in a 32-bit address space). This is a very narrow corner case...

For full groups this should be ok (which includes your example), since if we would wrap around the address space we would do a memory access at nullptr even without the transformation. This means that we don't have to check if stores wrap (all store interleaved groups are full), but we do for loads.

I think deferring these pointer wrapping checks until after we've formed the interleaved groups should work.

One problem here is that we are lowering the cost of the vectorized loop while increasing the number of runtime checks (which might end up causing a loop to not be vectorized at all).

Therefore I think this at least needs some benchmarking.

You mean we might increase the number of runtime checks because we're passing Assume=true? I don't mind passing Assume=false if there's a concern. Although with ShouldCheckWrap=false I don't see how the call to getPtrStride here can result in creating more runtime tests...?

It can happen for example with the following loop

int i = 0;
for (unsinged char index = x; index < y; index +=2, ++i)

b[i] = a[index] * 2;

when alias analysis can figure out that accesses to b and a don't alias. Without Assume = true (and further run-time checks) we wouldn't be able to recognize a[index] as a strided pointer (because it isn't when y < x). LAA doesn't need to add any checks for this (dependences are safe).

-Silviu

sbaranga added reviewers: anemet, mssimpso.Oct 10 2016, 6:37 AM

sbaranga added a subscriber: llvm-commits.Oct 10 2016, 8:10 AM

I think deferring these pointer wrapping checks until after we've formed the interleaved groups should work.

Makes sense. So, to summarize:

during collectConstStrideAccesses we want to call getPtrStride with Assume=false (*), ShouldCheckWrap=false (optimistic).
towards the end of analyzeInterleaving we will revisit groups (of loads) with gaps and for those call getPtrStride with Assume=false (**), shouldCheckWrap=true.

(*) As far as I can see when ShouldCheckWrap=false then Assume is pretty much a don't-care in terms of runtime checks; no new runtime checks will be added by getPtrStride. See below.

(**) I would set Assume=false initially, but I think in the long run we may want to allow runtime checks here if the budget allows (as you noted earlier, so that we don't end up failing vectorization altogether due to these extra checks). That can be done in a followup patch.

Agreed?

It can happen for example with the following loop

Thanks for the example!

Without Assume = true (and further run-time checks) we wouldn't be able to recognize a[index] as a strided pointer (because it isn't when y < x).

Yes; but my point was that when ShouldCheckWrap=false we never get to consider adding new runtime checks in getPtrStride, even if Assume=true. Unless I'm missing something?

thanks,
Dorit

In D25276#567145, @dorit wrote:

Agreed?

SGTM

Yes; but my point was that when ShouldCheckWrap=false we never get to consider adding new runtime checks in getPtrStride, even if Assume=true. Unless I'm missing something?

Assume=true can add more run-time checks than just for what ShouldCheckWrap checks, so it is possible for this to happen. ShouldCheckWrap will check wrapping for something that has an linear expression, while Assume=true can add checks that would make the expression of the pointer linear (this would happen in the previous example). If you look at the implementation of getPtrStride, the PSE.getAsAddRec(Ptr) convert the expression of the pointer to something linear (strided) using runtime checks, while PSE.setNoOverflow will only check the pointer wrapping with runtime checks.

-Silviu

In D25276#567411, @sbaranga wrote:

Assume=true can add more run-time checks than just for what ShouldCheckWrap checks, so it is possible for this to happen. ShouldCheckWrap will check wrapping for something that has an linear expression, while Assume=true can add checks that would make the expression of the pointer linear (this would happen in the previous example). If you look at the implementation of getPtrStride, the PSE.getAsAddRec(Ptr) convert the expression of the pointer to something linear (strided) using runtime checks, while PSE.setNoOverflow will only check the pointer wrapping with runtime checks.

I see. Thanks.

So in this case, don't we have a problem of potentially adding runtime SCEV assumptions without ever checking if we exceed the threshold? getPtrStride is called with Assume=true several times from the cost-model stage and from the actual vectorization transformation stage, via the call to isConsecutivePtr. Looks like these call sites are past the point when we check whether we exceeded the threshold for runtime SCEV assumptions...?

thanks,
Dorit

In D25276#573001, @dorit wrote:

I see. Thanks.

So in this case, don't we have a problem of potentially adding runtime SCEV assumptions without ever checking if we exceed the threshold? getPtrStride is called with Assume=true several times from the cost-model stage and from the actual vectorization transformation stage, via the call to isConsecutivePtr. Looks like these call sites are past the point when we check whether we exceeded the threshold for runtime SCEV assumptions...?

Yes, I think that would be the case. I think for now this isn't really a problem, as these are mostly edge cases. If it turns out to be a problem, we have several options:

allowing Predicated Scalar Evolution interface to only add these checks if we don't exceed a certain threshold
I've found that we can often rewrite the code to only add these checks if we really need them

We should also move the check for the number of SCEV assumptions after the legality stage of the vectorizer (I thought we already did that).

Thanks,
Silviu

Updated as agreed above: use Assume=false,ShouldCheckWrap=false initially; revisit groups with gaps later with Assume=false,ShouldCheckWrap=true;
Added an xfailing testcase where we have gaps and need Assume=true in order to prove that the stride is 2.

In D25276#573182, @sbaranga wrote:

In D25276#573001, @dorit wrote:

I see. Thanks.

So in this case, don't we have a problem of potentially adding runtime SCEV assumptions without ever checking if we exceed the threshold? getPtrStride is called with Assume=true several times from the cost-model stage and from the actual vectorization transformation stage, via the call to isConsecutivePtr. Looks like these call sites are past the point when we check whether we exceeded the threshold for runtime SCEV assumptions...?

Yes, I think that would be the case.

I will open a PR to capture all the SCEV threshold issues that came up here, including the xfailing test I just added.

We should also move the check for the number of SCEV assumptions after the legality stage of the vectorizer (I thought we already did that).

so what exactly do you mean? It's already the very last thing we do in canVectorize...?

thanks,
Dorit

Thanks,
Silviu

Hi Dorit/Silviu,

Intuitively it seems like we would want to use getPtrStride in the interleaved access analysis the same way we do in isConsecutivePtr. Is the issue here that we might add additional run-time checks that we aren't currently? We already call isConsecutivePtr for the pointer operand of every load/store in the loop. Would the additional checks then be duplicates?

In D25276#574171, @mssimpso wrote:

Intuitively it seems like we would want to use getPtrStride in the interleaved access analysis the same way we do in isConsecutivePtr. Is the issue here that we might add additional run-time checks that we aren't currently? We already call isConsecutivePtr for the pointer operand of every load/store in the loop. Would the additional checks then be duplicates?

isConsecutivePtr won't check for pointer wrapping (unless it's called from LAA), while there is a corner case here that would require us to check the pointer wrapping here (we can start emitting wrapping wide loads/stores otherwise).

In D25276#573854, @dorit wrote:

so what exactly do you mean? It's already the very last thing we do in canVectorize...?

Sorry, I misunderstood your previous statement (I thought this analysis happened after the check for the number of predicates). We shouldn't be adding new predicates in the transformation stage, this should be done during the legality phase. But there we are adding predicates that will always be useful (we should get better codegen for AddRecExprs).
Maybe the best solution would be to try to make all pointers AddRecExprs in the legality stage for now?

-Silviu

In D25276#574171, @mssimpso wrote:

Hi Dorit/Silviu,

Intuitively it seems like we would want to use getPtrStride in the interleaved access analysis the same way we do in isConsecutivePtr. Is the issue here that we might add additional run-time checks that we aren't currently?

That was my initial intuition too (the first version of the patch I proposed did exactly that).

AFAIU Silviu's concern here was that if we use Assume=true during interleaving analysis we may be adding "nice to have" runtime checks (loop can be vectorized without it) which may prevent us from adding necessary runtime checks later (due to exceeding the SCEV-runtime-checks-threshold) which would prevent vectorization altogether.

In any case, we have two orthogonal issues here:
(1) the general SCEV-runtime-checks threshold issues -- straightening that up all over.
(2) interleaving analysis being overly conservative w.r.t stride analysis.
In this patch we want to improve (2), but without making (1) worse.
I was going to open a separate PR for (1).

Are we ok with the patch (for addressing (2))?

Thanks,
Dorit

If I understand things correctly now, we're trying to optimize the case where we have a full interleave group where the stride may wrap. So yes I see that we may want ShouldCheckWrap=false, at least initially. I don't think there is any harm in Assume=true for the initial call, though, because we later in Legality (collectLoopUniforms) call isConsecutivePtr for the pointer operand of every memory instruction (with Assume=true). If we don't use Assume=true for a pointer in the interleaved accesses anaysis, we will do so later in the uniformity analysis, so I think we want to do the same thing in both places.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5648–5650 ↗	(On Diff #75112)	I think this would be a little cleaner if we added a may wrap flag to StrideDescriptor. We would check the wrapping condition initially here and save that in the StrideDescriptor (and in the interleave groups as we build them). Then for removing groups, we could simply check that it is not full and may wrap.
llvm/test/Transforms/LoopVectorize/interleaved-accesses-2.ll
16 ↗	(On Diff #75112)	I would prefer that we not XFAIL a new test. Unless Silviu disagrees with letting Assume=true, I think it would be better to check the actual behavior while keeping the FIXME. Would we be missing a test though? If Assume=true here, I don't think we will be checking the case that we actually remove a group that is may wrap?

sbaranga added inline comments.Oct 20 2016, 6:42 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5872 ↗	(On Diff #75112)	I think there is another case here when we don't need to check the wrapping: if the group has the first and last pointers, and we know they both don't wrap. We should be able to apply this for versioning later and have at most two no-wrap checks per interleaved group.
llvm/test/Transforms/LoopVectorize/interleaved-accesses-2.ll
16 ↗	(On Diff #75112)	I think having Assume=true for the initial check should be fine.

In D25276#575290, @mssimpso wrote:

If I understand things correctly now, we're trying to optimize the case where we have a full interleave group where the stride may wrap. So yes I see that we may want ShouldCheckWrap=false, at least initially. I don't think there is any harm in Assume=true for the initial call, though, because we later in Legality (collectLoopUniforms) call isConsecutivePtr for the pointer operand of every memory instruction (with Assume=true). If we don't use Assume=true for a pointer in the interleaved accesses anaysis, we will do so later in the uniformity analysis, so I think we want to do the same thing in both places.

I think we are all in agreement on that!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5648–5650 ↗	(On Diff #75112)	OTOH, I think there’s also something clean about initially avoiding the wrapping question all together… Also, how aggressive in terms of allowing runtime checks do we want to be when we check the wrapping condition initially ? If we allow it, that would be the equivalent of calling getPtrStride with Assume=true, ShouldCheckWrap=true; That might mean adding a lot of tests… If we don’t allow it, we are too conservative. Maybe I misunderstood exactly what you are proposing, but I think in any case like the approach of light-weight optimistic initial pass over all pointers, and then selectively checking wrapping only where needed.
5872 ↗	(On Diff #75112)	So, trying to summarize: Initial check with Assume=true, ShouldCheckWrap=false; Revisit only groups with gaps as follows: if first and last elements are accessed: skip further checking. otherwise – check the first and last pointers of the group with Assume=false (for now), ShouldCheckWrap=true; Silviu, is that what you meant?
llvm/test/Transforms/LoopVectorize/interleaved-accesses-2.ll
16 ↗	(On Diff #75112)	Right; Silviu, the example you gave earlier for a strided access that requires Assume=true: How would you turn it into a test that doesn't fail vectorization on "SCEV could not compute the loop exit count"...?

dorit added inline comments.Oct 20 2016, 11:27 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5872 ↗	(On Diff #75112)	3 and 4 above were meant to be sub-bullets of item 2...: 1.Initial check with Assume=true, ShouldCheckWrap=false; 2.Revisit only groups with gaps as follows: if first and last elements are accessed: skip further checking. otherwise – check the first and last pointers of the group with Assume=false (for now), ShouldCheckWrap=true;

sbaranga added inline comments.Oct 21 2016, 4:05 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5872 ↗	(On Diff #75112)	I agree with point 1) For point 2) I meant something slightly different: if first and last elements are accessed and we know they don't wrap, skip further checking (because it implies that all the pointers in the group don't wrap). if the group is full, skip further checking otherwise (as you said) check the first and last pointers of the group with Assume=false (for now), ShouldCheckWrap=true; I think optimizing this is a bit open-ended, but we could definitely go further (feel free to omit this for this commit): If we are peeling the loop and we know that the first pointer doesn't wrap then we can deduce that all pointers in the group don't wrap. This means that we can forcefully peel the loop in order to only have to check the first pointer for no-wrap. When we'll change to use Assume=true,ShouldCheckWrap=true we'll only need at most one runtime check per interleaved group (I think this is probably optimal).
llvm/test/Transforms/LoopVectorize/interleaved-accesses-2.ll
16 ↗	(On Diff #75112)	I've tried the following function, and you do have to have assume=true for it in order to form interleaved groups void func(unsigned * restrict a, unsigned * restrict b, unsigned char x, unsigned char y) { int i = 0; for (unsigned char index = x; i < y; index +=2, ++i) b[i] = a[index] * 2; }

dorit added inline comments.Oct 23 2016, 2:45 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5872 ↗	(On Diff #75112)	about “and we know they don't wrap”….: exactly what do you mean by “know they don’t wrap”? that calling getPtrStride with Assume=false, ShouldCheckWrap=true is successful? I would think that “group is full” and “first and last elements are accessed” are equivalent in terms of the effect of the transformation on creating new potentially-wrapping accesses (in both cases the transformation will not make things worse than the original code in that respect); we shouldn’t care if the accesses to the first/last elements wrap or not because the wide loads that we’ll create will have the same wrap-around behavior as the original code, no? (in short, why the extra "and we know they don't wrap"...?) (this is assuming that potential wrapping that may be caused by extra accesses in the vector code created due to misalignment are accounted for else-where).
llvm/test/Transforms/LoopVectorize/interleaved-accesses-2.ll
16 ↗	(On Diff #75112)	perfect. thanks!!

sbaranga added inline comments.Oct 24 2016, 2:45 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5872 ↗	(On Diff #75112)	Correct, in this case I mean Assume=false, ShouldCheckWrap=true. Although when switching to using Assume = true it is preferable to call getPtrStride on just the two accesses instead of on all the pointers. 2). Here is my current reasoning: The justification for why the "group is full" case doesn't require checking is that if the wide load does wrap we would do an access to nullptr in the non-transformed code (and therefore have undefined behaviour). This reasoning doesn't hold when the group is not full (since we could be skipping the access to nullptr). I didn't fully understand the part about alignment, but AFAICT considering here that the accesses all the accesses are aligned to the element type should be enough.

The updated version adds the following:

avoid XFAIL in a new test
use Assume=true in collectConstStrideAccesses + add a testcase that requires that and that tests group invalidation in that scenario
check only the first and last members of the group instead of checking all the members
further optimization ideas that came up here were added as a TODO comment.

1.Initial check with Assume=true, ShouldCheckWrap=false; 2.Revisit
only groups with gaps as follows:

if first and last elements are accessed: skip further checking.

otherwise – check the first and last pointers of the group

with Assume=false (for now), ShouldCheckWrap=true;

I agree with point 1)
For point 2) I meant something slightly different:
(a) - if first and last elements are accessed *and* we know they don't wrap, skip further checking (because it implies that all the pointers in the group don't wrap).
(b) - if the group is full, skip further checking
(c)- otherwise (as you said) check the first and last pointers of the
group with Assume=false (for now), ShouldCheckWrap=true;

...

Correct, in this case I mean Assume=false, ShouldCheckWrap=true.

Although when switching to using Assume = true it is preferable to call getPtrStride on just the two accesses instead of on all the pointers.

We might as well check just these two accesses already, even while we are using Assume=false, no? (it's less critical because it won't affect runtime, but it will make the followup change of upgrading to Assume=true smaller).
This is what I did in the updated version I just uploaded.
I only included cases (b) and (c) from the above; In the end I am not sure what case (a) checks beyond that case (c) already covers...?

thanks,
Dorit

dorit added inline comments.Oct 25 2016, 8:50 PM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5895 ↗	(On Diff #75796)	meant to put a continue's here instead of break's... will upload a new version.

break --> continue

Herald added a subscriber: anna. · View Herald TranscriptOct 26 2016, 2:22 AM

Yes, now that I think about it skipping case c) makes sense.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5879 ↗	(On Diff #75840)	FirstMemberPtr should always be getPointerOperand(Group->getMember(0)) (it's guaranteed to exist). Now that I look at the code it would be simpler to go with a the more optimized version. It would be simpler to just go here with FirstMemberPtr = getPointerOperand(Group->getMember(0)) LastMemberPtr = getPointerOperand(Group->getMember(Group->getFactor()-1)) and remove the loop. If we get null for LastMemberPtr we can just ignore it (since we know that in this case we will always peel the loop, meaning that we only need to check the first member). The code will look like: if (!getPtrStride(PSE, FirstMemberPtr, TheLoop, Strides, /Assume=/false, /ShouldCheckWrap=/true)) { releaseGroup(Group); continue; } if (LastMemberPtr) if (!getPtrStride(PSE, LastMemberPtr, TheLoop, Strides, /Assume=/false, /ShouldCheckWrap=/true))) { releaseGroup(); continue; }

Thanks, Silviu! Relying on the peeling does come in handy here!

The new version looks at members 0 and factor-1 instead of scanning the group to find the actual first and last members.

Thanks for making the changes! LGTM

This revision is now accepted and ready to land.Oct 26 2016, 5:29 AM

Closed by commit rL285517: [LoopVectorize] Make interleaved-accesses analysis less conservative about (authored by dorit). · Explain WhyOct 30 2016, 5:32 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

60 lines

test/

Transforms/

LoopVectorize/

AArch64/

gather-cost.ll

2 lines

ARM/

gather-cost.ll

2 lines

X86/

gather-cost.ll

2 lines

interleaved-accesses-1.ll

78 lines

interleaved-accesses-2.ll

58 lines

interleaved-accesses-3.ll

57 lines

Diff 76334

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,728 Lines • ▼ Show 20 Lines	void InterleavedAccessInfo::collectConstStrideAccesses(
for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO()))		for (BasicBlock *BB : make_range(DFS.beginRPO(), DFS.endRPO()))
for (auto &I : *BB) {		for (auto &I : *BB) {
auto *LI = dyn_cast<LoadInst>(&I);		auto *LI = dyn_cast<LoadInst>(&I);
auto *SI = dyn_cast<StoreInst>(&I);		auto *SI = dyn_cast<StoreInst>(&I);
if (!LI && !SI)		if (!LI && !SI)
continue;		continue;

Value *Ptr = getPointerOperand(&I);		Value *Ptr = getPointerOperand(&I);
int64_t Stride = getPtrStride(PSE, Ptr, TheLoop, Strides);		// We don't check wrapping here because we don't know yet if Ptr will be
		// part of a full group or a group with gaps. Checking wrapping for all
		// pointers (even those that end up in groups with no gaps) will be overly
		// conservative. For full groups, wrapping should be ok since if we would
		// wrap around the address space we would do a memory access at nullptr
		// even without the transformation. The wrapping checks are therefore
		// deferred until after we've formed the interleaved groups.
		int64_t Stride = getPtrStride(PSE, Ptr, TheLoop, Strides,
		/Assume=/true, /ShouldCheckWrap=/false);

const SCEV *Scev = replaceSymbolicStrideSCEV(PSE, Strides, Ptr);		const SCEV *Scev = replaceSymbolicStrideSCEV(PSE, Strides, Ptr);
PointerType *PtrTy = dyn_cast<PointerType>(Ptr->getType());		PointerType *PtrTy = dyn_cast<PointerType>(Ptr->getType());
uint64_t Size = DL.getTypeAllocSize(PtrTy->getElementType());		uint64_t Size = DL.getTypeAllocSize(PtrTy->getElementType());

// An alignment of 0 means target ABI alignment.		// An alignment of 0 means target ABI alignment.
unsigned Align = LI ? LI->getAlignment() : SI->getAlignment();		unsigned Align = LI ? LI->getAlignment() : SI->getAlignment();
if (!Align)		if (!Align)
▲ Show 20 Lines • Show All 187 Lines • ▼ Show 20 Lines	for (auto BI = AccessStrideInfo.rbegin(), E = AccessStrideInfo.rend();
} // Iteration over A accesses.		} // Iteration over A accesses.
} // Iteration over B accesses.		} // Iteration over B accesses.

// Remove interleaved store groups with gaps.		// Remove interleaved store groups with gaps.
for (InterleaveGroup *Group : StoreGroups)		for (InterleaveGroup *Group : StoreGroups)
if (Group->getNumMembers() != Group->getFactor())		if (Group->getNumMembers() != Group->getFactor())
releaseGroup(Group);		releaseGroup(Group);

		// Remove interleaved groups with gaps (currently only loads) whose memory
		// accesses may wrap around. We have to revisit the getPtrStride analysis,
		// this time with ShouldCheckWrap=true, since collectConstStrideAccesses does
		// not check wrapping (see documentation there).
		// FORNOW we use Assume=false;
		// TODO: Change to Assume=true but making sure we don't exceed the threshold
		// of runtime SCEV assumptions checks (thereby potentially failing to
		// vectorize altogether).
		// Additional optional optimizations:
		// TODO: If we are peeling the loop and we know that the first pointer doesn't
		// wrap then we can deduce that all pointers in the group don't wrap.
		// This means that we can forcefully peel the loop in order to only have to
		// check the first pointer for no-wrap. When we'll change to use Assume=true
		// we'll only need at most one runtime check per interleaved group.
		//
		for (InterleaveGroup *Group : LoadGroups) {

		// Case 1: A full group. Can Skip the checks; For full groups, if the wide
		// load would wrap around the address space we would do a memory access at
		// nullptr even without the transformation.
		if (Group->getNumMembers() == Group->getFactor())
		continue;

		// Case 2: If first and last members of the group don't wrap this implies
		// that all the pointers in the group don't wrap.
		// So we check only group member 0 (which is always guaranteed to exist),
		// and group member Factor-1 (if it doesn't exist we can just ignore it
		// since we know that in this case we will always peel the loop, in which
		// case we only need to check the first member).
		Value *FirstMemberPtr = getPointerOperand(Group->getMember(0));
		if (!getPtrStride(PSE, FirstMemberPtr, TheLoop, Strides, /Assume=/false,
		/ShouldCheckWrap=/true)) {
		DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to "
		"potential pointer wrapping.\n");
		releaseGroup(Group);
		continue;
		}

		if (Instruction *LastMember = Group->getMember(Group->getFactor() - 1)) {
		Value *LastMemberPtr = getPointerOperand(LastMember);
		if (!getPtrStride(PSE, LastMemberPtr, TheLoop, Strides, /Assume=/false,
		/ShouldCheckWrap=/true)) {
		DEBUG(dbgs() << "LV: Invalidate candidate interleaved group due to "
		"potential pointer wrapping.\n");
		releaseGroup(Group);
		continue;
		}
		}
		}

// If there is a non-reversed interleaved load group with gaps, we will need		// If there is a non-reversed interleaved load group with gaps, we will need
// to execute at least one scalar epilogue iteration. This will ensure that		// to execute at least one scalar epilogue iteration. This will ensure that
// we don't speculatively access memory out-of-bounds. Note that we only need		// we don't speculatively access memory out-of-bounds. Note that we only need
// to look for a member at index factor - 1, since every group must have a		// to look for a member at index factor - 1, since every group must have a
// member at index zero.		// member at index zero.
for (InterleaveGroup *Group : LoadGroups)		for (InterleaveGroup *Group : LoadGroups)
if (!Group->getMember(Group->getFactor() - 1)) {		if (!Group->getMember(Group->getFactor() - 1)) {
if (Group->isReverse()) {		if (Group->isReverse()) {
▲ Show 20 Lines • Show All 1,462 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/AArch64/gather-cost.ll

	; RUN: opt -loop-vectorize -mtriple=arm64-apple-ios -S -mcpu=cyclone < %s \| FileCheck %s			; RUN: opt -loop-vectorize -mtriple=arm64-apple-ios -S -mcpu=cyclone -enable-interleaved-mem-accesses=false < %s \| FileCheck %s
	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-n32:64-S128"			target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-n32:64-S128"

	@kernel = global [512 x float] zeroinitializer, align 16			@kernel = global [512 x float] zeroinitializer, align 16
	@kernel2 = global [512 x float] zeroinitializer, align 16			@kernel2 = global [512 x float] zeroinitializer, align 16
	@kernel3 = global [512 x float] zeroinitializer, align 16			@kernel3 = global [512 x float] zeroinitializer, align 16
	@kernel4 = global [512 x float] zeroinitializer, align 16			@kernel4 = global [512 x float] zeroinitializer, align 16
	@src_data = global [1536 x float] zeroinitializer, align 16			@src_data = global [1536 x float] zeroinitializer, align 16
	@r_ = global i8 0, align 1			@r_ = global i8 0, align 1
	▲ Show 20 Lines • Show All 76 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/ARM/gather-cost.ll

	; RUN: opt -loop-vectorize -mtriple=thumbv7s-apple-ios6.0.0 -S < %s \| FileCheck %s			; RUN: opt -loop-vectorize -mtriple=thumbv7s-apple-ios6.0.0 -S -enable-interleaved-mem-accesses=false < %s \| FileCheck %s

	target datalayout = "e-p:32:32:32-i1:8:32-i8:8:32-i16:16:32-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:32:64-v128:32:128-a0:0:32-n32-S32"			target datalayout = "e-p:32:32:32-i1:8:32-i8:8:32-i16:16:32-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:32:64-v128:32:128-a0:0:32-n32-S32"

	@kernel = global [512 x float] zeroinitializer, align 4			@kernel = global [512 x float] zeroinitializer, align 4
	@kernel2 = global [512 x float] zeroinitializer, align 4			@kernel2 = global [512 x float] zeroinitializer, align 4
	@kernel3 = global [512 x float] zeroinitializer, align 4			@kernel3 = global [512 x float] zeroinitializer, align 4
	@kernel4 = global [512 x float] zeroinitializer, align 4			@kernel4 = global [512 x float] zeroinitializer, align 4
	@src_data = global [1536 x float] zeroinitializer, align 4			@src_data = global [1536 x float] zeroinitializer, align 4
	▲ Show 20 Lines • Show All 79 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/X86/gather-cost.ll

	; RUN: opt -loop-vectorize -mtriple=x86_64-apple-macosx -S -mcpu=corei7-avx < %s \| FileCheck %s			; RUN: opt -loop-vectorize -mtriple=x86_64-apple-macosx -S -mcpu=corei7-avx -enable-interleaved-mem-accesses=false < %s \| FileCheck %s
	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"			target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"

	@kernel = global [512 x float] zeroinitializer, align 16			@kernel = global [512 x float] zeroinitializer, align 16
	@kernel2 = global [512 x float] zeroinitializer, align 16			@kernel2 = global [512 x float] zeroinitializer, align 16
	@kernel3 = global [512 x float] zeroinitializer, align 16			@kernel3 = global [512 x float] zeroinitializer, align 16
	@kernel4 = global [512 x float] zeroinitializer, align 16			@kernel4 = global [512 x float] zeroinitializer, align 16
	@src_data = global [1536 x float] zeroinitializer, align 16			@src_data = global [1536 x float] zeroinitializer, align 16
	@r_ = global i8 0, align 1			@r_ = global i8 0, align 1
	▲ Show 20 Lines • Show All 77 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-1.ll

				; RUN: opt -S -loop-vectorize -instcombine -force-vector-width=4 -force-vector-interleave=1 -enable-interleaved-mem-accesses=true < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

				; Check that the interleaved-mem-access analysis identifies the access
				; to array 'in' as interleaved, despite the possibly wrapping unsigned
				; 'out_ix' index.
				;
				; In this test the interleave-groups are full (have no gaps), so no wrapping
				; checks are necessary. We can call getPtrStride with Assume=false and
				; ShouldCheckWrap=false to safely figure out that the stride is 2.

				; #include <stdlib.h>
				; class Complex {
				; private:
				; float real_;
				; float imaginary_;
				;
				;public:
				; Complex() : real_(0), imaginary_(0) { }
				; Complex(float real, float imaginary) : real_(real), imaginary_(imaginary) { }
				; Complex(const Complex &rhs) : real_(rhs.real()), imaginary_(rhs.imaginary()) { }
				;
				; inline float real() const { return real_; }
				; inline float imaginary() const { return imaginary_; }
				;};
				;
				;void test(Complex * __restrict__ out, Complex * __restrict__ in, size_t out_start, size_t size)
				;{
				; for (size_t out_offset = 0; out_offset < size; ++out_offset)
				; {
				; size_t out_ix = out_start + out_offset;
				; Complex t0 = in[out_ix];
				; out[out_ix] = t0;
				; }
				;}

				; CHECK: vector.body:
				; CHECK: %wide.vec = load <8 x i32>, <8 x i32>* {{.*}}, align 4
				; CHECK: shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				; CHECK: shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>

				%class.Complex = type { float, float }

				define void @_Z4testP7ComplexS0_mm(%class.Complex* noalias nocapture %out, %class.Complex* noalias nocapture readonly %in, i64 %out_start, i64 %size) local_unnamed_addr {
				entry:
				%cmp9 = icmp eq i64 %size, 0
				br i1 %cmp9, label %for.cond.cleanup, label %for.body.preheader

				for.body.preheader:
				br label %for.body

				for.cond.cleanup.loopexit:
				br label %for.cond.cleanup

				for.cond.cleanup:
				ret void

				for.body:
				%out_offset.010 = phi i64 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%add = add i64 %out_offset.010, %out_start
				%arrayidx = getelementptr inbounds %class.Complex, %class.Complex* %in, i64 %add
				%0 = bitcast %class.Complex* %arrayidx to i32*
				%1 = load i32, i32* %0, align 4
				%imaginary_.i.i = getelementptr inbounds %class.Complex, %class.Complex* %in, i64 %add, i32 1
				%2 = bitcast float* %imaginary_.i.i to i32*
				%3 = load i32, i32* %2, align 4
				%arrayidx1 = getelementptr inbounds %class.Complex, %class.Complex* %out, i64 %add
				%4 = bitcast %class.Complex* %arrayidx1 to i64*
				%t0.sroa.4.0.insert.ext = zext i32 %3 to i64
				%t0.sroa.4.0.insert.shift = shl nuw i64 %t0.sroa.4.0.insert.ext, 32
				%t0.sroa.0.0.insert.ext = zext i32 %1 to i64
				%t0.sroa.0.0.insert.insert = or i64 %t0.sroa.4.0.insert.shift, %t0.sroa.0.0.insert.ext
				store i64 %t0.sroa.0.0.insert.insert, i64* %4, align 4
				%inc = add nuw i64 %out_offset.010, 1
				%exitcond = icmp eq i64 %inc, %size
				br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
				}

llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-2.ll

				; RUN: opt -S -loop-vectorize -instcombine -force-vector-width=4 -force-vector-interleave=1 -enable-interleaved-mem-accesses=true < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

				; Check that the interleaved-mem-access analysis currently does not create an
				; interleave group for the access to array 'in' due to the possibly wrapping
				; unsigned 'out_ix' index.
				;
				; In this test the interleave-group of the loads is not full (has gaps), so
				; the wrapping checks are necessary. Here this cannot be done statically so
				; runtime checks are needed, but with Assume=false getPtrStride cannot add
				; runtime checks and as a result we can't create the interleave-group.
				;
				; FIXME: This is currently a missed optimization until we can use Assume=true
				; with proper threshold checks. Once we do that the candidate interleave-group
				; will not be invalidated by the wrapping checks.

				; #include <stdlib.h>
				; void test(float * __restrict__ out, float * __restrict__ in, size_t size)
				; {
				; for (size_t out_offset = 0; out_offset < size; ++out_offset)
				; {
				; float t0 = in[2*out_offset];
				; out[out_offset] = t0;
				; }
				; }

				; CHECK: vector.body:
				; CHECK-NOT: %wide.vec = load <8 x i32>, <8 x i32>* {{.*}}, align 4
				; CHECK-NOT: shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>

				define void @_Z4testPfS_m(float* noalias nocapture %out, float* noalias nocapture readonly %in, i64 %size) local_unnamed_addr {
				entry:
				%cmp7 = icmp eq i64 %size, 0
				br i1 %cmp7, label %for.cond.cleanup, label %for.body.preheader

				for.body.preheader:
				br label %for.body

				for.cond.cleanup.loopexit:
				br label %for.cond.cleanup

				for.cond.cleanup:
				ret void

				for.body:
				%out_offset.08 = phi i64 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%mul = shl i64 %out_offset.08, 1
				%arrayidx = getelementptr inbounds float, float* %in, i64 %mul
				%0 = bitcast float* %arrayidx to i32*
				%1 = load i32, i32* %0, align 4
				%arrayidx1 = getelementptr inbounds float, float* %out, i64 %out_offset.08
				%2 = bitcast float* %arrayidx1 to i32*
				store i32 %1, i32* %2, align 4
				%inc = add nuw i64 %out_offset.08, 1
				%exitcond = icmp eq i64 %inc, %size
				br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
				}

llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-3.ll

				; RUN: opt -S -loop-vectorize -instcombine -force-vector-width=4 -force-vector-interleave=1 -enable-interleaved-mem-accesses=true < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

				; Check that the interleaved-mem-access analysis currently does not create an
				; interleave group for access 'a' due to the possible pointer wrap-around.
				;
				; To begin with, in this test the candidate interleave group can be created
				; only when getPtrStride is called with Assume=true. Next, because
				; the interleave-group of the loads is not full (has gaps), we also need to check
				; for possible pointer wrapping. Here we currently use Assume=false and as a
				; result cannot prove the transformation is safe and therefore invalidate the
				; candidate interleave group.
				;
				; FIXME: This is a missed optimization. Once we use Assume=true here, we will
				; not have to invalidate the group.

				; void func(unsigned * __restrict a, unsigned * __restrict b, unsigned char x, unsigned char y) {
				; int i = 0;
				; for (unsigned char index = x; i < y; index +=2, ++i)
				; b[i] = a[index] * 2;
				;
				; }

				; CHECK: vector.body:
				; CHECK-NOT: %wide.vec = load <8 x i32>, <8 x i32>* {{.*}}, align 4
				; CHECK-NOT: shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>

				define void @_Z4funcPjS_hh(i32* noalias nocapture readonly %a, i32* noalias nocapture %b, i8 zeroext %x, i8 zeroext %y) local_unnamed_addr {
				entry:
				%cmp9 = icmp eq i8 %y, 0
				br i1 %cmp9, label %for.cond.cleanup, label %for.body.preheader

				for.body.preheader:
				%wide.trip.count = zext i8 %y to i64
				br label %for.body

				for.cond.cleanup.loopexit:
				br label %for.cond.cleanup

				for.cond.cleanup:
				ret void

				for.body:
				%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %for.body.preheader ]
				%index.011 = phi i8 [ %add, %for.body ], [ %x, %for.body.preheader ]
				%idxprom = zext i8 %index.011 to i64
				%arrayidx = getelementptr inbounds i32, i32* %a, i64 %idxprom
				%0 = load i32, i32* %arrayidx, align 4
				%mul = shl i32 %0, 1
				%arrayidx2 = getelementptr inbounds i32, i32* %b, i64 %indvars.iv
				store i32 %mul, i32* %arrayidx2, align 4
				%add = add i8 %index.011, 2
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, %wide.trip.count
				br i1 %exitcond, label %for.cond.cleanup.loopexit, label %for.body
				}

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorizer] Interleaved-mem-accesses analysis and getPtrStride ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 76334

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/trunk/test/Transforms/LoopVectorize/AArch64/gather-cost.ll

llvm/trunk/test/Transforms/LoopVectorize/ARM/gather-cost.ll

llvm/trunk/test/Transforms/LoopVectorize/X86/gather-cost.ll

llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-1.ll

llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-2.ll

llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses-3.ll

[LoopVectorizer] Interleaved-mem-accesses analysis and getPtrStride
ClosedPublic