This is an archive of the discontinued LLVM Phabricator instance.

Consecutive memory access in Loop Vectorizer - fixed and simplified
ClosedPublic

Authored by delena on May 30 2016, 5:28 AM.

Details

Summary

I fixed consecutive memory access detection in Loop Vectorizer.
It did not handle correctly cases without GEP.

I started from this loop that wasn't vectorized.

for (int i=0; i<len; i++)

*to++ = *from++;

Probably this loop should be resolved with memcpy, but it is another pass. Current loop vectorizer does not recognize consecutive memory access, because the generated code performs load/store without GEP. I use SCEV to detect consecutive access.

Diff Detail

Repository
rL LLVM

Event Timeline

delena updated this revision to Diff 58947.May 30 2016, 5:28 AM
delena retitled this revision from to Consecutive memory access in Loop Vectorizer - fixed and simplified.
delena updated this object.
delena added reviewers: hfinkel, anemet, mzolotukhin, Ayal.
delena set the repository for this revision to rL LLVM.
delena added a subscriber: llvm-commits.
Ayal edited edge metadata.Jun 5 2016, 4:56 PM

(1) Indeed *p++ should be treated just as consecutively as p[i++] by all means.

(2) Could we simply use getPtrStride(), like so:

int Stride = getPtrStride(PSE, Ptr, TheLoop, Strides);
if (Stride == 1 || Stride == -1)
  return Stride;
return 0;

making it work with specialization for unit-stride (vectorizing the load in version-mem-access.ll)?

(3) If you test for

*to++ == *from--

you'll also catch reverse consecutive, and won't compete with -loop-idiom's folding to memcpy; albeit cost will be higher.

hfinkel edited edge metadata.Jun 5 2016, 6:53 PM
In D20789#449385, @Ayal wrote:

(1) Indeed *p++ should be treated just as consecutively as p[i++] by all means.

(2) Could we simply use getPtrStride(), like so:

int Stride = getPtrStride(PSE, Ptr, TheLoop, Strides);
if (Stride == 1 || Stride == -1)
  return Stride;
return 0;

making it work with specialization for unit-stride (vectorizing the load in version-mem-access.ll)?

(3) If you test for

*to++ == *from--

you'll also catch reverse consecutive, and won't compete with -loop-idiom's folding to memcpy; albeit cost will be higher.

You can also do *to++ = 1 + *from++; which will avoid the memcpy idiom issue.

delena updated this revision to Diff 59716.Jun 6 2016, 6:54 AM
delena edited edge metadata.
delena added a subscriber: roman.shirokiy.

I used getPtrStride() as Ayal suggested. I also put extracting stride from GEP to a separate function.

mkuper added inline comments.Jun 6 2016, 11:26 AM
../lib/Transforms/Vectorize/LoopVectorize.cpp
2299–2300

Are you planning to call this with more complex GEPs?
If not, I'm not sure I understand why we need to call getGEPInductionOperand(Gep) here. Isn't it guaranteed to return 1?

2314

The formatting here looks off.

mzolotukhin edited edge metadata.Jun 6 2016, 11:38 AM

Please find some remarks from me inline.

Tanks,
Michael

../lib/Transforms/Vectorize/LoopVectorize.cpp
2314–2315

Formatting looks weird here.

../test/Transforms/LoopVectorize/consec_no_gep.ll
7

Could you please elaborate on what do you mean by "without preceding GEP instruction"? The test below has two GEP-instructions before load and store. Is the problem in the bitcast between GEP and load/store?

mzolotukhin added inline comments.Jun 6 2016, 11:38 AM
../lib/Transforms/Vectorize/LoopVectorize.cpp
2300–2301

Can we relax this constraint? Looks like we didn't have such limitation before.

Ayal added inline comments.Jun 6 2016, 11:57 PM
../lib/Transforms/Vectorize/LoopVectorize.cpp
2298–2299

It's admittedly not your original code, but while you're refactoring this "gep with all indices being invariant except last that's inductive with a step that's not originally a compile time constant but gets replaced by one due to specialization, skipping over z/sext/bitcast" stride 1 special case, should it belong to LAA's getPtrStride() rather than LVL's isConsecutivePtr()?

In any case, please add documentation explaining what this is about.

2299–2300

Again, admittedly not your original code, just commenting that if Step is any constant, not only +-1, we could use it to return the desired stride instead of zero. Or perhaps we'd only be specializing for the final stride(?). Otherwise you may call this isConsecutiveGEPIndex() rather than getStrideFromGEPIndex().

delena added inline comments.Jun 7 2016, 12:17 AM
../lib/Transforms/Vectorize/LoopVectorize.cpp
2299–2300

It was original code. I assume that I have to check two things - (1) The last index in "induction". (2) All other are loop-invariant.

2299–2300

we could use it to return the desired stride instead of zero

I don't want to complicate the code and make it more general since I don't see other usage for it.
I can put this function back to isConsecutivePtr() if you prefer this solution.

2300–2301

You are right, but this index should be the last. Otherwise step 1 will not mean consecutive GEP.
I'll fix.

../test/Transforms/LoopVectorize/consec_no_gep.ll
7

The bitcast is between "load" and "Phi". GEP does not feed load in this case.

delena updated this revision to Diff 59851.Jun 7 2016, 3:29 AM
delena edited edge metadata.

I moved analysis of GEP with runtime index to llvm::getPtrStride().

Ayal added a comment.Jun 7 2016, 4:50 AM

This looks good to me (especially isConsecutivePrt() ;-)

../lib/Analysis/LoopAccessAnalysis.cpp
902–903

getGEPInductionOperand() returns the last relevant index to check; all indices following it do not affect the address; so InductionOperand should work even when it's not NumOperands - 1.

../lib/Transforms/Vectorize/LoopVectorize.cpp
2298–2301

somewhat redundant, as it's the first check in getPtrStride() as well

delena updated this revision to Diff 59866.Jun 7 2016, 5:01 AM

Fixed according to Ayal's comment.

anemet added inline comments.Jun 7 2016, 6:27 AM
../lib/Analysis/LoopAccessAnalysis.cpp
908–912

Sorry for jumping in late but I was on vacation. Elena, does your testcase actually exercise this part of the code or it's handled as a regular AddRec above? I am assuming not since the Ptr is not a GEP.

Do we actually have coverage exercising this part?

../test/Transforms/LoopVectorize/consec_no_gep.ll
8

The C code and the IR do not match here. You have some casts here. Can this testcase be simplified to remove the casts?

delena added inline comments.Jun 7 2016, 6:44 AM
../lib/Analysis/LoopAccessAnalysis.cpp
908–912

The test that exercises this code is in version-mem-access.ll. This is the existing test that fails without this extension.
My test case does not have GEP (but it has AddRec ) and this case wan't handled properly.

../test/Transforms/LoopVectorize/consec_no_gep.ll
8

I ran clang and printed IR before loop vectorizer. So, the code is exact here.

anemet added inline comments.Jun 7 2016, 9:04 AM
../lib/Analysis/LoopAccessAnalysis.cpp
908–912

OK and what is the SCEV that fails to become an AddRec after replaceSymbolicStrideSCEV?

I am wondering if there is a fundamental reason we need getConsecutiveAccessFromGEPIndex. My guess is that it's this in getConsecutiveAccessFromGEPIndex:

// Because of the multiplication by a stride we can have a s/zext cast.
// We are going to replace this stride by 1 so the cast is safe to ignore.

which does not seem fundamental.

delena added inline comments.Jun 7 2016, 12:20 PM
../lib/Analysis/LoopAccessAnalysis.cpp
908–912

About this specific case with GEP where we need to call getConsecutiveAccessFromGEPIndex():

GEP itself is AddExpr, not AddRecExpr as expected. AddExpr is (base + MulExpr).
The MulExpr is (AddRec_of_induction * sext(Stride)).
We have Stride is equal to 1, but it is not compile time constant, but runtime constant.
something like this:
if (Stride ==1) then

execute vector-loop

else

execute scalar-loop

After call to replaceSymbolicStrideSCEV it is converted to sext(AddRecExpr_with_step_1).

anemet added inline comments.Jun 7 2016, 2:33 PM
../lib/Analysis/LoopAccessAnalysis.cpp
908–912

The sext is actually around the AddRec:

(lldb) p PtrScev->dump()
((8 * (sext i32 {0,+,1}<nuw><%for.body> to i64)) + %x)

@sbaranga, why can't predicated SCEV turn this into i64 {%x,+,8}?

sbaranga added inline comments.Jun 7 2016, 2:58 PM
../lib/Analysis/LoopAccessAnalysis.cpp
908–912

@anemet: It should. If it doesn't I would consider it a bug. What is the input where this is triggered? Is it the added test case?

@delena: the expression for that GEP should get folded into an AddRecExpr by SCEV.

I'll have a look at this tomorrow. We should avoid versioning on the stride if possible, since that removes a lot of the input space.

anemet added inline comments.Jun 7 2016, 3:07 PM
../lib/Analysis/LoopAccessAnalysis.cpp
908–912

@sbaranga, no. I applied the patch and then ran the old testcase of Transforms/LoopVectorize/version-mem-access.ll.

delena added inline comments.Jun 7 2016, 11:41 PM
../lib/Analysis/LoopAccessAnalysis.cpp
908–912

@sbaranga, this GEP may be folded to AddRecExpr by PSE.

I can call to getPtrStride() with Assume=true and PSE will be invoked, but in this case 2 ARM tests fail:

LLVM :: Transforms/LoopVectorize/AArch64/gather-cost.ll
LLVM :: Transforms/LoopVectorize/ARM/gather-cost.ll
anemet added inline comments.Jun 8 2016, 3:27 AM
../lib/Analysis/LoopAccessAnalysis.cpp
908–912

@delena, great. Can you please look why those are failing?

sbaranga added inline comments.Jun 8 2016, 3:35 AM
../lib/Analysis/LoopAccessAnalysis.cpp
908–912

Thanks! If those tests are a problem we can fix them.

We probably need to use Assume=true here anyway, otherwise this might not be correct: since the expression for the pointer is ((8 * (sext i32 {0,+,1}<nuw><%for.body> to i64)) + %x), this means that this pointer is not consecutive for trip counts greater than MAX_INT (the index would wrap). Either that or SCEV is dropping some flags.

delena updated this revision to Diff 60501.Jun 13 2016, 3:48 AM

I changed interface of getPtrStride(). Instead of one "Assume" parameter which means "additional run-time assumptions", I put 2 parameters - "Assume memory access versioning in run-time" and "Assume no-wrap in address calculation".
isConsecutivePtr() sets to "true" only the first assumption.
Now all tests work correctly.

anemet edited edge metadata.Jun 13 2016, 5:50 AM

I changed interface of getPtrStride(). Instead of one "Assume" parameter which means "additional run-time assumptions", I put 2 parameters - "Assume memory access versioning in run-time" and "Assume no-wrap in address calculation".
isConsecutivePtr() sets to "true" only the first assumption.

Why? What's wrong with calling getPtrStirde with Assume=true?

I changed interface of getPtrStride(). Instead of one "Assume" parameter which means "additional run-time assumptions", I put 2 parameters - "Assume memory access versioning in run-time" and "Assume no-wrap in address calculation".
isConsecutivePtr() sets to "true" only the first assumption.

Why? What's wrong with calling getPtrStirde with Assume=true?

2 tests were failing. "Assume-versioning"=true is ok for isConsecutivePtr(), but Assume-no-wrap should be "false".

I changed interface of getPtrStride(). Instead of one "Assume" parameter which means "additional run-time assumptions", I put 2 parameters - "Assume memory access versioning in run-time" and "Assume no-wrap in address calculation".
isConsecutivePtr() sets to "true" only the first assumption.

Why? What's wrong with calling getPtrStirde with Assume=true?

2 tests were failing. "Assume-versioning"=true is ok for isConsecutivePtr(), but Assume-no-wrap should be "false".

Why were the tests failing? You need to provide some explanation rather than just communicate the conclusion of your investigation.

As far as I can see both of these just add run-time checks from things that cannot be proven at compile time.

Hi Elena,

It looks like we don't care if the pointer is wrapping in isConsecutivePtr, but getPtrStride is testing it since this is required for LAA?

In that case it might be better to either factor out the wrapping checks from getPtrStride or change the interface so that getPtrStride takes an Assume parameter and a ShouldCheckWrap parameter?

-Silviu

Why were the tests failing?

When I turn on "Assume" parameter in the original version, I actually do two different things together:

  1. Force check the predicated access
  2. Disable wrap check in address calculation

When I do (1), I fix the original bug. But the two tests (we talked earlier) fail due to (2).
I separated (1) from (2) in the current patch and everything works fine.

Why were the tests failing?

When I turn on "Assume" parameter in the original version, I actually do two different things together:

  1. Force check the predicated access
  2. Disable wrap check in address calculation

When I do (1), I fix the original bug. But the two tests (we talked earlier) fail due to (2).
I separated (1) from (2) in the current patch and everything works fine.

This is still insufficient information. You're just restating the conclusion again (I already understood this part) without explaining why the test fails when run-time non-wrapping checks are allowed.

Also, in your list, #2 is not to disable wrap checks but to enable them.

Looking into one of the tests, it should fail if we end up vectorizing even though gathers for indirect pointer accesses are required. These are not available on ARM so the cost should be prohibitively high.

One reason could be that we now consider the indirect pointer access as consecutive (i.e. not requiring a gather) which is obviously wrong. I am wondering if there is a bug lurking somewhere in getPtrStride or in the test. Can you please look into it more and provide details.

In general, I think that @sbaranga's approach is likely to be the way forward (i.e. your patch is pretty much good) but the failures are still unexpected so let's first understand this part.

Thanks,
Adam

delena added inline comments.Jun 16 2016, 6:31 AM
../lib/Analysis/LoopAccessAnalysis.cpp
987

If I comment-out this line the two tests work without splitting the "Assume" parameter. Looks like the flag is written here for the pointer and is used in another place.
But commenting out this line causes failure of another test.

I ran ARM/gather-cost.ll with debugger. It receives stride 3 for one of the memory accesses. I assume (I don't know ARM arch), that vectorization is possible with stride=3 under no-wrap flag for GEP's AddRec.
When we run isConsecutivePtr(.., Assume=true), we receive the same result (0), but the no-wrap flag for this pointer is being changed.
This no-wrap flag is checked later and the vectorization decision is different.

I ran ARM/gather-cost.ll with debugger. It receives stride 3 for one of the memory accesses. I assume (I don't know ARM arch), that vectorization is possible with stride=3 under no-wrap flag for GEP's AddRec.

Yes, we do have instructions on ARM that can handle strides {2,3,4}. The fix should be changing the stride to something that would require a gather operation (let's say 5).

When we run isConsecutivePtr(.., Assume=true), we receive the same result (0), but the no-wrap flag for this pointer is being changed.

Interesting, if Assume=true it should return 3? Do you why this is happening? Just from looking at the original code, is Assume=true it should return Stride not 0.

When we run isConsecutivePtr(.., Assume=true), we receive the same result (0), but the no-wrap flag for this pointer is being changed.

Interesting, if Assume=true it should return 3? Do you why this is happening? Just from looking at the original code, is Assume=true it should return Stride not 0.

Sorry, I was looking at isStridedPtr instead of isConsecutivePtr. This makes sense now.

What solution do you suggest? I see 2 options
(1) Change stride in the failing ARM tests and pass one parameter Assume=true to getPtrStride()
(2) Separate assumptions of getPtrStride(), the current version.

This is a bug-fix and I'd like to complete it.
Thank you.

What solution do you suggest? I see 2 options

I was suggesting doing the following:

In that case it might be better to either factor out the wrapping checks from getPtrStride or change the interface so that getPtrStride takes an Assume parameter and a ShouldCheckWrap parameter?

To expand on this:

My preference would be to have just one Assume=true argument. However, we don't need to prove non-wrapping for isConsecutivePtr, so we should avoid the logic in getPtrStride that tries to prove non-wrapping:

  • only have one Assume argument to getPtrStride, and pass in a ShouldCheckWrap argument
  • In getPtrStride, change the logic for IsNoWrapAddRec to be:

    bool IsNoWrapAddRec = !ShouldCheckWrap || PSE.hasNoOverflow(Ptr, SCEVWrapPredicate::IncrementNUSW) || isNoWrapAddRec(Ptr, AR, PSE, Lp);

and invoke getPtrStride with Assume=true and ShouldCheckWrap=false.

FWIW this doesn't look like a bug fix.

I ran ARM/gather-cost.ll with debugger. It receives stride 3 for one of the memory accesses. I assume (I don't know ARM arch), that vectorization is possible with stride=3 under no-wrap flag for GEP's AddRec.
When we run isConsecutivePtr(.., Assume=true), we receive the same result (0), but the no-wrap flag for this pointer is being changed.
This no-wrap flag is checked later and the vectorization decision is different.

OK, thanks for the explanation but something is still not clear. If the stride is 3 how does that convert into isConsecutivePtr to return true? We are checking for Stride == 1 || Stride == -1 in the new version of isConsecutivePtr.

OK, thanks for the explanation but something is still not clear. If the stride is 3 how does that convert into isConsecutivePtr to return true? We are checking for Stride == 1 || Stride == -1 in the new version of isConsecutivePtr.

isConsecutivePtr() still returns 0. But it has a side effect. This line: PSE.setNoOverflow(Ptr, SCEVWrapPredicate::IncrementNUSW);
This NoOverlflow flag is set for the pointer (Ptr) and checked later, in another function.

OK, thanks for the explanation but something is still not clear. If the stride is 3 how does that convert into isConsecutivePtr to return true? We are checking for Stride == 1 || Stride == -1 in the new version of isConsecutivePtr.

isConsecutivePtr() still returns 0. But it has a side effect. This line: PSE.setNoOverflow(Ptr, SCEVWrapPredicate::IncrementNUSW);
This NoOverlflow flag is set for the pointer (Ptr) and checked later, in another function.

Ah, makes sense now. Thanks!

delena updated this revision to Diff 61261.Jun 20 2016, 9:42 AM
delena edited edge metadata.

I've build the new patch according to Silviu's comments.
While my patch was under review, the isConsecutivePtr() function was changed, but the bug, I'm trying to fix wasn't resolved.
Please let me know if the call to getStridePtr() is still ok within the new code.

anemet added inline comments.Jun 20 2016, 9:55 AM
../lib/Transforms/Vectorize/LoopVectorize.cpp
2298–2299

Also you didn't resolve the merge conflict properly. isConsecutivePtr was just forwarding to getPtrStride in the earlier version as it should but now the entire original body has snuck back in.

2300–2302

No, this should be something like:

const ValueToValueMap &Strides = getSymbolicStrides() ? *getSymbolicStrides() : ValueToValueMap();

int Stride = getPtrStride(PSE, Ptr, TheLoop, Strides, true, false);

../test/Transforms/LoopVectorize/consec_no_gep.ll
28–36

As I said earlier, the float cast is non-essiential here. Please simplify this just to use integer pointers.

delena added inline comments.Jun 20 2016, 10:20 AM
../lib/Transforms/Vectorize/LoopVectorize.cpp
2298–2299

You are right. I thought that getPtrStrides() will not work with empty strides set. But it works perfect..
Thanks.

delena updated this revision to Diff 61264.Jun 20 2016, 10:22 AM
anemet added inline comments.Jun 20 2016, 10:57 AM
../test/Transforms/LoopVectorize/consec_no_gep.ll
29–37

Elena, are you planning to address this comment as well?

delena updated this revision to Diff 61272.Jun 20 2016, 11:15 AM

Simplified the test case.

anemet accepted this revision.Jun 20 2016, 2:08 PM
anemet edited edge metadata.

LGTM with the formatting change below.

../test/Transforms/LoopVectorize/consec_no_gep.ll
2

Space between ; and RUN

This revision is now accepted and ready to land.Jun 20 2016, 2:08 PM

Thanks! LGTM as well.

This revision was automatically updated to reflect the committed changes.
delena reopened this revision.Jul 31 2016, 4:05 AM

This revision was reverted. I'm re-opening and continue working on on it.

This revision is now accepted and ready to land.Jul 31 2016, 4:05 AM

Hi, I have a question about LoopAccessAnalysis.

The current patch enables vectorization of this loop:

void IncrementalCopyFastPath(const char* src, char* op, int len) {
  while (len > 0) {
    *(reinterpret_cast<long long*>(op)) = *(reinterpret_cast<const long long*>(src));
    src += 8;
    op += 8;
    len -= 8;
  }
}

And the result is incorrect because the distance between "src" and "op" is not big enough.
And there is no run-time check between load and store pointers.
This is the loop before vectorization:

while.body:                                       ; preds = %while.body.preheader, %while.body
  %len.addr.07 = phi i32 [ %sub, %while.body ], [ %len, %while.body.preheader ]
  %op.addr.06 = phi i8* [ %add.ptr1, %while.body ], [ %op, %while.body.preheader ]
  %src.addr.05 = phi i8* [ %add.ptr, %while.body ], [ %src, %while.body.preheader ]
  %0 = bitcast i8* %src.addr.05 to i64*
  %1 = load i64, i64* %0, align 8
  %2 = bitcast i8* %op.addr.06 to i64*
  store i64 %1, i64* %2, align 8
  %add.ptr = getelementptr inbounds i8, i8* %src.addr.05, i64 8
  %add.ptr1 = getelementptr inbounds i8, i8* %op.addr.06, i64 8
  %sub = add nsw i32 %len.addr.07, -8
  %cmp = icmp sgt i32 %len.addr.07, 8
  br i1 %cmp, label %while.body, label %while.end.loopexit

  while.end.loopexit:                               ; preds = %while.body

I looked at MemoryDepChecker::areDepsSafe(). It checks pointers with same "Leader", like A[i] and A[i+2].
What function should check distance between A and B in order to say that A[i] = B[i] are safe for vectorization?
Thank you.

Hi Elena,

I looked at MemoryDepChecker::areDepsSafe(). It checks pointers with same "Leader", like A[i] and A[i+2].
What function should check distance between A and B in order to say that A[i] = B[i] are safe for vectorization?
Thank you.

Is that function inlined? Otherwise, the two pointers have different underlying objects. For accesses with different underlying objects, we guarantee no dependence using either alias analysis or a runtime no alias check.

This is done in AccessAnalysis::canCheckPtrAtRT (and implicitly in AccessAnalysis::processMemAccesses).

-Silviu

delena added a comment.Aug 1 2016, 5:56 AM

This is done in AccessAnalysis::canCheckPtrAtRT (and implicitly in AccessAnalysis::processMemAccesses).

AccessAnalysis::canCheckPtrAtRT() returns true and  DependentAccesses has these 2 accesses.

CanVecMem = DepChecker->areDepsSafe() returns true without checking anything

This is the code:

bool CanDoRTIfNeeded = Accesses.canCheckPtrAtRT(*PtrRtChecking, PSE->getSE(),
                                                TheLoop, SymbolicStrides);
if (!CanDoRTIfNeeded) {
  emitAnalysis(LoopAccessReport() << "cannot identify array bounds");
  DEBUG(dbgs() << "LAA: We can't vectorize because we can't find "
               << "the array bounds.\n");
  CanVecMem = false;
  return;
}

DEBUG(dbgs() << "LAA: We can perform a memory runtime check if needed.\n");

CanVecMem = true;
if (Accesses.isDependencyCheckNeeded()) {
  DEBUG(dbgs() << "LAA: Checking memory dependencies\n");
  CanVecMem = DepChecker->areDepsSafe(
      DependentAccesses, Accesses.getDependenciesToCheck(), SymbolicStrides);
wmi added a subscriber: wmi.Aug 1 2016, 8:32 PM
delena added a comment.Aug 2 2016, 4:46 AM

This is the output of LoopAccessAnalysis:

W:\ver2>build\Debug\bin\opt.exe -O2 test.ll -S -o test_opt.ll --debug-only=loop-accesses
LAA: Found a loop in _Z23IncrementalCopyFastPathPKcPci: while.body
LAA: Processing memory accesses...

AST: Alias Set Tracker: 1 alias sets for 2 pointer values.
AliasSet[0x24f30a0, 2] may alias, No access Pointers: (i64* %2, 18446744073709551615), (i64* %0, 18446744073709551615)

LAA: Accesses(2):

%2 = bitcast i8* %op.addr.06 to i64* (write)
%0 = bitcast i8* %src.addr.05 to i64* (read-only)

Underlying objects for pointer %2 = bitcast i8* %op.addr.06 to i64*

i8* %op

Underlying objects for pointer %0 = bitcast i8* %src.addr.05 to i64*

i8* %src

LAA: Found a runtime check ptr: %2 = bitcast i8* %op.addr.06 to i64*
LAA: Found a runtime check ptr: %0 = bitcast i8* %src.addr.05 to i64*
LAA: We need to do 1 pointer comparisons.
LAA: We can perform a memory runtime check if needed.
LAA: Checking memory dependencies
Total Dependences: 0
LAA: No unsafe dependent memory operations in loop. We need runtime memory checks.
LAA: Adding RT check for range:
Start: %op End: ((8 * (zext i32 ((8 + (-9 smax (-1 + (-1 * %len))) + %len) /u 8) to i64)) + %op)
LAA: Adding RT check for range:
Start: %src End: ((8 * (zext i32 ((8 + (-9 smax (-1 + (-1 * %len))) + %len) /u 8) to i64)) + %src)
LAA: Found a loop in _Z23IncrementalCopyFastPathPKcPci: while.body
LAA: Processing memory accesses...

AST: Alias Set Tracker: 1 alias sets for 2 pointer values.
AliasSet[0x24f3eb0, 2] may alias, No access Pointers: (i64* %231, 18446744073709551615), (i64* %229, 18446744073709551615)

LAA: Accesses(2):

%231 = bitcast i8* %op.addr.06 to i64* (write)
%229 = bitcast i8* %src.addr.05 to i64* (read-only)

Underlying objects for pointer %231 = bitcast i8* %op.addr.06 to i64*

i8* %op

Underlying objects for pointer %229 = bitcast i8* %src.addr.05 to i64*

i8* %src

LAA: Found a runtime check ptr: %231 = bitcast i8* %op.addr.06 to i64*
LAA: Found a runtime check ptr: %229 = bitcast i8* %src.addr.05 to i64*
LAA: We need to do 1 pointer comparisons.
LAA: We can perform a memory runtime check if needed.
LAA: Checking memory dependencies
Total Dependences: 0
LAA: No unsafe dependent memory operations in loop. We need runtime memory checks.
LAA: Found a loop in _Z23IncrementalCopyFastPathPKcPci: vector.body
LAA: SCEV could not compute the loop exit count.

This is the output of LoopAccessAnalysis:

W:\ver2>build\Debug\bin\opt.exe -O2 test.ll -S -o test_opt.ll --debug-only=loop-accesses
LAA: Found a loop in _Z23IncrementalCopyFastPathPKcPci: while.body
LAA: Processing memory accesses...

AST: Alias Set Tracker: 1 alias sets for 2 pointer values.
AliasSet[0x24f30a0, 2] may alias, No access Pointers: (i64* %2, 18446744073709551615), (i64* %0, 18446744073709551615)

LAA: Accesses(2):

%2 = bitcast i8* %op.addr.06 to i64* (write)
%0 = bitcast i8* %src.addr.05 to i64* (read-only)

Underlying objects for pointer %2 = bitcast i8* %op.addr.06 to i64*

i8* %op

Underlying objects for pointer %0 = bitcast i8* %src.addr.05 to i64*

i8* %src

LAA: Found a runtime check ptr: %2 = bitcast i8* %op.addr.06 to i64*
LAA: Found a runtime check ptr: %0 = bitcast i8* %src.addr.05 to i64*
LAA: We need to do 1 pointer comparisons.

LAA says we need a run-time alias check. Didn't we generate one?

LAA: We can perform a memory runtime check if needed.
LAA: Checking memory dependencies
Total Dependences: 0
LAA: No unsafe dependent memory operations in loop. We need runtime memory checks.
LAA: Adding RT check for range:
Start: %op End: ((8 * (zext i32 ((8 + (-9 smax (-1 + (-1 * %len))) + %len) /u 8) to i64)) + %op)
LAA: Adding RT check for range:
Start: %src End: ((8 * (zext i32 ((8 + (-9 smax (-1 + (-1 * %len))) + %len) /u 8) to i64)) + %src)
LAA: Found a loop in _Z23IncrementalCopyFastPathPKcPci: while.body
LAA: Processing memory accesses...

AST: Alias Set Tracker: 1 alias sets for 2 pointer values.
AliasSet[0x24f3eb0, 2] may alias, No access Pointers: (i64* %231, 18446744073709551615), (i64* %229, 18446744073709551615)

LAA: Accesses(2):

%231 = bitcast i8* %op.addr.06 to i64* (write)
%229 = bitcast i8* %src.addr.05 to i64* (read-only)

Underlying objects for pointer %231 = bitcast i8* %op.addr.06 to i64*

i8* %op

Underlying objects for pointer %229 = bitcast i8* %src.addr.05 to i64*

i8* %src

LAA: Found a runtime check ptr: %231 = bitcast i8* %op.addr.06 to i64*
LAA: Found a runtime check ptr: %229 = bitcast i8* %src.addr.05 to i64*
LAA: We need to do 1 pointer comparisons.
LAA: We can perform a memory runtime check if needed.
LAA: Checking memory dependencies
Total Dependences: 0
LAA: No unsafe dependent memory operations in loop. We need runtime memory checks.
LAA: Found a loop in _Z23IncrementalCopyFastPathPKcPci: vector.body
LAA: SCEV could not compute the loop exit count.

This is the output of LoopAccessAnalysis:

W:\ver2>build\Debug\bin\opt.exe -O2 test.ll -S -o test_opt.ll --debug-only=loop-accesses
LAA: Found a loop in _Z23IncrementalCopyFastPathPKcPci: while.body
LAA: Processing memory accesses...

AST: Alias Set Tracker: 1 alias sets for 2 pointer values.
AliasSet[0x24f30a0, 2] may alias, No access Pointers: (i64* %2, 18446744073709551615), (i64* %0, 18446744073709551615)

LAA: Accesses(2):

%2 = bitcast i8* %op.addr.06 to i64* (write)
%0 = bitcast i8* %src.addr.05 to i64* (read-only)

Underlying objects for pointer %2 = bitcast i8* %op.addr.06 to i64*

i8* %op

Underlying objects for pointer %0 = bitcast i8* %src.addr.05 to i64*

i8* %src

LAA: Found a runtime check ptr: %2 = bitcast i8* %op.addr.06 to i64*
LAA: Found a runtime check ptr: %0 = bitcast i8* %src.addr.05 to i64*
LAA: We need to do 1 pointer comparisons.

LAA says we need a run-time alias check. Didn't we generate one?

LAA: We can perform a memory runtime check if needed.
LAA: Checking memory dependencies
Total Dependences: 0

No. It says " Total Dependences: 0" because it checks only pointers with the same leader, like A[i] and A[i+5].
IMO, this is a bug that should be fixed. In a separate patch, of course.

LAA: No unsafe dependent memory operations in loop. We need runtime memory checks.
LAA: Adding RT check for range:
Start: %op End: ((8 * (zext i32 ((8 + (-9 smax (-1 + (-1 * %len))) + %len) /u 8) to i64)) + %op)
LAA: Adding RT check for range:
Start: %src End: ((8 * (zext i32 ((8 + (-9 smax (-1 + (-1 * %len))) + %len) /u 8) to i64)) + %src)
LAA: Found a loop in _Z23IncrementalCopyFastPathPKcPci: while.body
LAA: Processing memory accesses...

AST: Alias Set Tracker: 1 alias sets for 2 pointer values.
AliasSet[0x24f3eb0, 2] may alias, No access Pointers: (i64* %231, 18446744073709551615), (i64* %229, 18446744073709551615)

LAA: Accesses(2):

%231 = bitcast i8* %op.addr.06 to i64* (write)
%229 = bitcast i8* %src.addr.05 to i64* (read-only)

Underlying objects for pointer %231 = bitcast i8* %op.addr.06 to i64*

i8* %op

Underlying objects for pointer %229 = bitcast i8* %src.addr.05 to i64*

i8* %src

LAA: Found a runtime check ptr: %231 = bitcast i8* %op.addr.06 to i64*
LAA: Found a runtime check ptr: %229 = bitcast i8* %src.addr.05 to i64*
LAA: We need to do 1 pointer comparisons.
LAA: We can perform a memory runtime check if needed.
LAA: Checking memory dependencies
Total Dependences: 0
LAA: No unsafe dependent memory operations in loop. We need runtime memory checks.
LAA: Found a loop in _Z23IncrementalCopyFastPathPKcPci: vector.body
LAA: SCEV could not compute the loop exit count.

No. It says " Total Dependences: 0" because it checks only pointers with the same leader, like A[i] and A[i+5].
IMO, this is a bug that should be fixed. In a separate patch, of course.

You may want to reread Silviu's comment. 'Total dependences' covers dependences that can be determined *statically* by the compiler (i.e. same underlying object). For anything else that may alias we need to add run-time alias checks to ensure no-dependence. This latter is what should have happened in this case.

mkuper edited edge metadata.Aug 2 2016, 10:53 AM

I tried building with r274114 (before this was reverted), and we did generate a runtime alias check:

vector.memcheck:                                  ; preds = %min.iters.checked
  %7 = sub i32 -1, %len
  %8 = icmp sgt i32 %7, -9
  %smax9 = select i1 %8, i32 %7, i32 -9
  %9 = add i32 %len, %smax9
  %10 = add i32 %9, 8
  %11 = lshr i32 %10, 3
  %12 = zext i32 %11 to i64
  %13 = shl i64 %12, 3
  %scevgep = getelementptr i8, i8* %op, i64 %13
  %scevgep10 = getelementptr i8, i8* %src, i64 %13
  %bound0 = icmp ule i8* %op, %scevgep10
  %bound1 = icmp ule i8* %src, %scevgep
  %found.conflict = and i1 %bound0, %bound1
  %memcheck.conflict = and i1 %found.conflict, true
  %cast.crd = trunc i64 %n.vec to i32
  %14 = shl i32 %cast.crd, 3
  %ind.end = sub i32 %len, %14
  %15 = shl i64 %n.vec, 3
  %ind.end12 = getelementptr i8, i8* %op, i64 %15
  %ind.end14 = getelementptr i8, i8* %src, i64 %15
  br i1 %memcheck.conflict, label %scalar.ph, label %vector.ph

I'm not sure this runtime check is correct, though.

delena updated this revision to Diff 69669.Aug 30 2016, 6:34 AM
delena edited edge metadata.

I plan to re-commit this patch, but I'm asking to review two more lines.
(I'll explain in the next mail)

delena requested a review of this revision.Aug 30 2016, 6:44 AM
delena edited edge metadata.
delena added inline comments.
../lib/Transforms/Vectorize/LoopVectorize.cpp
2711–2712

This change I made after buildbot failed at assertion. The main problem that I wasn't able to reproduce this failure. (llvm self build under sanitizer ..).
I fixed the failure by changing the first "if" from "if-induction" to "if-not-loop-invariant" and the buildbot passed all tests.
Do you think that this fix is correct?
May be you can tell me how to reproduce the failure?
Thanks.

mssimpso added inline comments.Aug 30 2016, 8:11 AM
../lib/Transforms/Vectorize/LoopVectorize.cpp
2711–2712

Hi Elena,

I'm looking at this for the first time, but it does look a little strange to me. Before your change, it looks like this case was specifically for pointer induction variables, but now it's not? If the GEP's pointer operand is not an induction variable, then for the GEP to be consecutive, the pointer operand should have to be loop invariant. We check this in isConsecutivePtr. Could this be a case of llvm::getGEPInductionOperand not agreeing with Legal->isInductionVariable?

Matt.

delena added inline comments.Aug 30 2016, 1:11 PM
../lib/Transforms/Vectorize/LoopVectorize.cpp
2711–2712

getelementptr %Ptr, %Ind1

Let me consider 2 cases (A) and (B):

(A)
Let assume that %Ptr is not induction and not loop-invariant. It, probably, means that Ptr is not changed from iteration to iteration, but defined inside the loop. LICM can't hoist it due to some reason. %Ind1 should be induction in this case.

(B)
Another case, Ptr is not induction, but it changes from iteration to iteration providing stride=1 for the memory access.
Ptr = induction * X, for example, and memory versioning takes the case when X == 1. %Ind1 is loop invariant.

Case B seems more relevant than A.

dorit added a subscriber: dorit.Sep 3 2016, 3:11 PM

...

Is that function inlined? Otherwise, the two pointers have different underlying objects. For accesses with different underlying objects, we guarantee no dependence using either alias analysis or a runtime no alias check.

This is done in AccessAnalysis::canCheckPtrAtRT (and implicitly in AccessAnalysis::processMemAccesses).

I have a question about this (unrelated to Elena's patch; sorry to digress a little):

Looks like for the pointers that we can't anti-alias we call canCheckPtrAtRT with ShouldCheckWrap=false :

// Find pointers with computable bounds. We are going to use this information
// to place a runtime bound check.
bool CanDoRTIfNeeded = Accesses.canCheckPtrAtRT(*PtrRtChecking, PSE->getSE(),
                                                TheLoop, SymbolicStrides);

And if there is any unknown dependence distance we call canCheckPtrAtRT with ShouldCheckWrap=true:

if (!CanVecMem && DepChecker->shouldRetryWithRuntimeCheck()) {
    DEBUG(dbgs() << "LAA: Retrying with memory checks\n");
    …
    CanDoRTIfNeeded = Accesses.canCheckPtrAtRT(*PtrRtChecking, SE, TheLoop,
                                               SymbolicStrides, true);

Both cases will end up generating similar runtime pointer-range comparison… why are we not concerned with wrapping in the first?

And when ShouldCheckWrap is true: I assume the unit stride restriction in isNoWrap comes from the concern not to refer to an address beyond the location of the first element past the array boundary? Is that because we compute the pointer range as "ptr+Stride*TripCount", which might refer to "Stride" locations past the array boundary (and therefore may wrap around the address space)?

thanks!
Dorit

delena updated this revision to Diff 70288.Sep 4 2016, 4:16 AM
delena edited edge metadata.

I simplified GEP cloning in vectorizeMemoryInstruction(). I assume that all necessary analysis is already done in isConsecutivePtr().

delena updated this revision to Diff 71748.Sep 18 2016, 6:59 AM

Updated diff before commit.

This revision was automatically updated to reflect the committed changes.