This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize]Teach Loop Vectorizer about interleaved memory access
ClosedPublic

Authored by HaoLiu on Apr 30 2015, 4:50 AM.

Details

Summary

Hi,

Early in this month, I added a patch to teach Loop Vectorizer about interleaved data access in D8820. According to the code review comments. I've made a lot of changes. This new patch is attached. It will identify and vectorize interleaved Accesses into "Loads/Stores + ShuffleVectors".
E.g. It can translate following interleaved loads (If vector factor is 4):

for (i = 0; i < N; i+=3) {
    R = Pic[i];       // Load R color elements
    G = Pic[i+1];     // Load G color elements
    B = Pic[i+2];     // Load B color elements
    ... // do something to R, G, B
}

Into

%wide.vec = load <12 x i32>, <12 x i32>* %ptr               ; load for R,G,B
%R.vec = shufflevector %wide.vec, undef, <0, 3, 6, 9>    ; mask for R load
%G.vec = shufflevector %wide.vec, undef, <1, 4, 7, 10>  ; mask for G load
%B.vec = shufflevector %wide.vec, undef, <2, 5, 8, 11>  ; mask for B load

Or it can translate following interleaved stores (If vector factor is 4):

for (i = 0; i < N; i+=3) {
    ... do something to R, G, B
    Pic[i] = R;     // Store R color elements
    Pic[i+1] = G;     // Store G color elements
    Pic[i+2] = B;     // Store B color elements
}

Into

%RG.vec = shufflevector %R.vec, %G.vec, <0, 1, 2, ..., 7>
%BU.vec = shufflevector %B.vec, undef, <0, 1, 2, 3, u, u, u, u>
%interleaved.vec = shufflevector %RG.vec, %BU.vec,
            <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>  ; mask for interleaved store
store <12 x i32> %interleaved.vec, <12 x i32>* %ptr         ; write for R,G,B

This patch mainly does:

(1) Identify interleaved access. (As some situation can not be covered corrently, I've added a TODO.)
(2) Transfer the indentified interleaved access to ShuffleVectors and Load/Store.
(3) Add a new pass in AArch64 backend to match the interleaved load/store with stride 2,3,4 to ldN/stN intrinsics.

I also added a new target hook to calculate the cost. (As I don't know too much about other targets, I just estimated it roughly.) It can be improved to be more accurate.

For the correctness, I've tested on AArch64 target with LNT, EEMBC, SPEC2000, SPEC2006, which can all pass.
For the performance, as there are other issues could forbid many vectorization opportunities, I don't see obvious improvements. But some benchmarks like EEMBC.RGBcmy and EEMBC.RGByiq are expected to have huge improvements (6 time

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
HaoLiu added inline comments.May 18 2015, 2:49 AM
include/llvm/Analysis/LoopAccessAnalysis.h
370–476

Done.

rengolin added inline comments.May 18 2015, 5:56 AM
include/llvm/Analysis/LoopAccessAnalysis.h
398

Ouch, no!

If you do this, than a perfectly valid sequence like this:

group.eraseFromMap(Map);
group.getDelta(); // there's no way to know that this won't work

will segfault.

A quick fix would be to leave the deletion to whomever is calling the function, so it's clear on the lifetime of the object:

group.eraseFromMap(Map);
group.getDelta();
delete group;
group.getReverse(); // this is obviously wrong

But again, it seems to me that such logic is spread out too much.

So a better fix would be to coalesce all of it into a class in itself. One that contains the maps, the groups, and the control over the lifetime of all its groups in a consistent way, without requiring the caller to know about it.

I'm ok with the temporary solution, as long as you add a FIXME to that effect.

anemet added a subscriber: anemet.May 18 2015, 9:30 AM
anemet added inline comments.
include/llvm/Analysis/LoopAccessAnalysis.h
634–635

I've only started looking at this patch but I have a quick initial, more fundamental comment.

You are modifying an analysis pass (LoopAccessAnalysis). Please add support for printing the new information gathered by the analysis. Please also add tests under test/Analysis/LoopAccessAnalysis to verify the new information.

You may even want separate out the patch for the analysis part (LAA + tests) from the transformation parts (LV + tests + etc).

HaoLiu updated this revision to Diff 26044.May 19 2015, 3:36 AM

I've updated a new patch according to the comments from Renato and Adam. This patch adds a new class InterleavedAccessInfo to handle the analysis about interleaved accesses.

Review please.

Thanks,
-Hao

include/llvm/Analysis/LoopAccessAnalysis.h
398

That's reasonable.

I've added a new class InterleavedAccessInfo to manage this.

634–635

Hi Adam,

I've undated the patch to support printing the new information. I added more test cases as well.

I didn't separate patch as I think there are already a lot of history comments on both parts. Also I think the bound between analysis part and transformation part is clear.

Anyway, I can still separate this patch if you insist.

Hi Hao,

Please find some comments from me inline. I'll get back to review a bit later.

Also, I think it's a good idea to separate the part for LoopAccessAnalysis part into a separate patch. I think it's fine to keep it in one patch during review (to keep the history), but I'd suggest committing it separately when they are ready. It'll help tracking down any potential issues that could be exposed later.

Thanks,
Michael

include/llvm/Analysis/LoopAccessAnalysis.h
320

We require Align!=0 here, but in the constructor below we explicitly set it to 0. Is one of these unused and could be removed?

490–491

Please fix the indentation here.

include/llvm/Analysis/TargetTransformInfo.h
448

s/vecotor/vector/

449–451

This comment seems to be outdated - there are no arguments SubTy, Index, or Gap in the declaration below.

include/llvm/Analysis/TargetTransformInfoImpl.h
306

Why the cost is Delta?

include/llvm/CodeGen/BasicTTIImpl.h
534

s/divied/divided/

546–547

Am I getting it correctly, that VecTy is the type of 'combined' vector?
I.e. if we have a stride=2, VF=4, and ScalarTy=i32, then VecTy would be <8 x i32>, and SubVT would be <4 x i32>? I'm not sure I've completely understood this part, so please correct me if I'm wrong.

If that's true, then the cost computation doesn't seem correct. Extract 1st or 2nd element from the wide vector isn't the same as extracting odds or even elements.

551

Was it supposed to be +=?

558–564

What's the difference between computing costs for stores and loads?

Also, I think it's a good idea to separate the part for LoopAccessAnalysis part into a separate patch. I think it's fine to keep it in one patch during review (to keep the history), but I'd suggest committing it separately when they are ready. It'll help tracking down any potential issues that could be exposed later.

Michael, I agree with you and I'll separate the patch when committing.

I've refactored the patch according to comments from Michael.

Review please.

Thanks,
-Hao

include/llvm/Analysis/LoopAccessAnalysis.h
320

Removed the unused constructor.

490–491

Fixed.

include/llvm/Analysis/TargetTransformInfo.h
448

Done.

449–451

Fixed.

include/llvm/Analysis/TargetTransformInfoImpl.h
306

Previously I though there are 'Delta' instructions each costs 1.
Regarding all the default cost in other getXXXCost is 1, I've changed it to be "1".

include/llvm/CodeGen/BasicTTIImpl.h
546–547

Yes, you are right.

The cost is not only "Extract" cost, it also plus the "Insert" cost.

Say we interleaved load two vectors:

%vec = load <8 x i32>, <8 x i32>* %ptr
%v0 = shuffle %vec, undef, <0, 2, 4, 6>
%v1 = shuffle %vec, undef, <1, 3, 5, 7>

The cost consist of 2 parts:

(1) load of <8 x i32>
(2) extract %v0 and %v1 from %vec.

The cost of (2) is estimated as 2 parts:

(a) extract elements at: 0, 1, 2, 3, 4, 5, 6, 7
(b) insert elements 0, 2, 4, 6 into %v0 and insert elements 1, 3, 5, 7 into %v1

The interleaved store is similar.

551

Fixed.

558–564

Say we interleaved store two vectors:

%v0_v1 = shuffle %v0, %v1, <0, 4, 1, 5, 2, 6, 3, 7>
store <8 x i32> %v0_v1, <8 x i32>*

It has two main differences:
(1) The direction is different. It firstly extract elements from two sub vectors and then insert into a wide vector (interleaved Load firstly extract elements from a wide vector and then insert into sub vectors)
(2) The interleaved store doesn't allow gaps, which means we must have member of index 0 (with even elements) and member of index 1 (with odd elements). But interleaved load allows gaps, it could only load even elements such as:

%vec = load <8 x i32>, <8 x i32>* %ptr
%v0 = shuffle <8 x i32> %vec, undef, <0, 2, 4, 6>

This is still beneficial and cheaper than scalar loads.

I've refactored this function to add a new parameter of indices for interleaved load.

Michael, I agree with you and I'll separate the patch when committing.

Unless you can guarantee that trunk will work with just the first patch (ie. you have tested both equally), I don't think it's a good idea to split the patch. The analysis does nothing without the actual vectorisation, and vice versa.

It's a big patch, yes, but also this is a big change. I don't see what splitting the patches would gain us.

The patch now looks more like what we had planned in the beginning, so as long as all comments are addressed, and all tests pass, I'm happy.

Thank you for the effort of bringing this up again.

cheers,
--renato

Michael, I agree with you and I'll separate the patch when committing.

Unless you can guarantee that trunk will work with just the first patch (ie. you have tested both equally), I don't think it's a good idea to split the patch. The analysis does nothing without the actual vectorisation, and vice versa.

It's a big patch, yes, but also this is a big change. I don't see what splitting the patches would gain us.

The patch now looks more like what we had planned in the beginning, so as long as all comments are addressed, and all tests pass, I'm happy.

Thank you for the effort of bringing this up again.

cheers,
--renato

Renato, I also agree with you. I think to commit separately as two patches or commit as a whole patch are similar and have no big differences.

Thanks,
-Hao

Hi Hao,

Thanks for updating the patch, another bunch of comments from me inline.

Also, sorry for that I give the feedback in parts - it's hard to find enough time to review the entire patch all at once. I know that it might be a bit frustrating sometimes, but in fact I really appreciate your work and see how things converge:)

Thanks,
Michael

include/llvm/Analysis/TargetTransformInfo.h
447–457

Nitpick: Alignment and AddressSpace aren't documented.

lib/Analysis/LoopAccessAnalysis.cpp
908–916

Please add some comments on this struct and its members.

951–952

Using empty() is more efficient than size()==0.

1041–1042

Could we use a range loop here?

1045

Range-based loop?

1049–1050

Why not to return right after we discovered the block needs predication?

1065–1066

What's the rationale behind this? It might be inefficient to vectorize a group with a huge gap, but isn't it a question for cost-model rather than for the analysis?

1108–1116

Could we just check if PosA dominates PosB?

HaoLiu updated this revision to Diff 26208.May 21 2015, 2:08 AM

Hi Hao,

Thanks for updating the patch, another bunch of comments from me inline.

Also, sorry for that I give the feedback in parts - it's hard to find enough time to review the entire patch all at once. I know that it might be a bit frustrating sometimes, but in fact I really appreciate your work and see how things converge:)

Thanks,
Michael

Hi Michael,

Thanks for your comments, which really help me a lot. I've updated a new patch accordingly.

Review please.

Thanks,
-Hao

include/llvm/Analysis/TargetTransformInfo.h
447–457

Done.

lib/Analysis/LoopAccessAnalysis.cpp
908–916

Done.

951–952

Good idea!

1041–1042

Done.

1045

Done.

1049–1050

It's slightly different.
We still support a predicted block if it doesn't have load/store. Because the vectorization of interleaved loads/stores won't break any dependences.
But if a predicted block has loads/stores, we need to handle an interleave group contains mixed normal loads/stores and predicted loads/stores.

1065–1066

This is just a conservative way. I think generally if a group missing more than half members, it is not very beneficial to do vectorization. Also it sounds not reasonable to still call it "interleaved" group.

But I agree with you that it's cost-model's work to decide whether it is beneficial.
The new patch keeps all load groups.
There is also a new situation that: the cost of a load group with a huge gap may even be expensive than the scalar operations. Then I think we need to replace with scalar operations. I've added a FIXME in the new patch as I haven't found an efficient way to fix this.

1108–1116

A good idea!

rengolin added inline comments.May 21 2015, 2:18 AM
lib/Analysis/LoopAccessAnalysis.cpp
1049–1050

This might be a case for indexed loads on AVX, but for now, let's keep it simple. :)

1065–1066

I also agree that the cost model should get this right, but currently, there is no way it can do that.

However, since the interleaved vectorisation is not enabled by default, I think keeping all the groups and fixing the cost model later is the right thing to do.

rengolin accepted this revision.May 21 2015, 2:22 AM
rengolin edited edge metadata.

Thanks Hao,

I think we're good to go. Michael, any more concerns?

Once we land this in, it'll be easier to test it and fix the issues created by this patch without affecting the rest.

cheers,
--renato

This revision is now accepted and ready to land.May 21 2015, 2:22 AM

I've updated a new patch according to the comments from Renato and Adam. This patch adds a new class InterleavedAccessInfo to handle the analysis about interleaved accesses.

Thanks, Hao. I'll look at this today.

Adam

Hi Hao,

Please find more remarks from me inline, andI think this is the last bunch of comments from me. With them properly addressed I'm fine with committing the patch. However, I'd also like to hear from Adam regarding the LoopAccessAnalysis part, since he's been working actively on it.

And again, thanks for working on this, it's a much-needed feature!

Michael

include/llvm/Analysis/LoopAccessAnalysis.h
403

Should it be unsigned?

lib/Analysis/LoopAccessAnalysis.cpp
978

I think you missed my question earlier on this part.

This algorithm is clearly quadratic, which is dangerous for compile time. We need to either make it linear, or add some limits (e.g. if number of accesses > 100, give up and don't even try to group them).

Also, if we separate it into two phases: 1) grouping accesses, 2) finding write-after-write pairs, I think we can make it more efficient. Since it's sufficient to compare against any access from a group, we'll be quadratic on the number of groups, not the number of accesses. And when we look for WaW pairs, we can only look only inside a group, not across all accesses in the loop. Does it sound reasonable, or am I missing something?

lib/Transforms/Vectorize/LoopVectorize.cpp
703–709

This change is unnecessary.

1673

Nitpick: s/I.E/I.e./

1753–1754

Maybe it's worth mentioning here that though ConcatenateTwoVectors could extend a vector with UNDEFs, that only could happens with the last vector in the set.

Currently implementation guarantees that, but with the code evolving in future it can be easily violated if it's not stated clearly. Maybe an assertion could be even better here (like if it's not the last pair of vectors, then their sizes should be the same), but it might be an overkill.

1761

Could we just assign TmpList to VecList?

1820

Should we iterate to VF instead of UF here? PtrParts contains VF elements, right?

1832–1833

I'm not sure it's correct if we vectorize *and* unroll the loop. Don't we need to adjust the pointer according to which unrolling-part we are at (instead of using 0)?

1854

Name CallI is very misleading.

1854–1855

Name CallI is very misleading.

1901–1903

Could you please provide an example when this triggers? I'm a bit concerned about legality of this cast.

1916–1917

The name is misleading.

4864

s/even expensive/even more expensive/

test/Transforms/LoopVectorize/interleaved-accesses.ll
1 ↗(On Diff #26208)

Is it possible to use noalias attributes for arguments and drop -runtime-memory-check-threshold=24?

Hi Hao,

Thinking about this a bit more, why are we collecting the InterLeaveGroups in LAA? LAA provides loop dependence information and computing this additional, unrelated information comes at a cost. Clients other than LV would pay for this without using it. I think that the InterLeaveGroups analysis should probably live in LV.

What do you think? Let me know if I am missing something from an earlier discussion.

Adam

HaoLiu edited edge metadata.

I've updated a new patch refactored according to Michael's comments.

Review please.

Thanks,
-Hao

Hi Hao,

Please find more remarks from me inline, andI think this is the last bunch of comments from me. With them properly addressed I'm fine with committing the patch. However, I'd also like to hear from Adam regarding the LoopAccessAnalysis part, since he's been working actively on it.

And again, thanks for working on this, it's a much-needed feature!

Michael

lib/Analysis/LoopAccessAnalysis.cpp
978

For the first question. Previously, I thought you meant to improve the insert member method.
Anyway, I think the algorithm is quadratic, the paper says:

This relies on the pivoting order in which the pairs of accesses are processed and  
reduces the complexity of the grouping algorithm: we iterate over O(n^2 ) pairs of
loads and stores (where n is the number of loads and stores)

But I think we don't need to worry about compile time:

(1) The "n" is only the number of stride loads/stores (|Stride| > 1)
(2) For a case with a lot of loads and stores, it possiblely can not pass previous dependence check, because it could have a lot of runtime memory check or have memory conflict. So it returns early.

On the other hand, if a loop indeed can pass dependence check. say if it has 100 loads, we should analyze the interleave accesses as vectorization on the possible interleave groups can really give us a lot of benefit.

978

For the second question. I think it is slightly different. We should not only consider about Group and Group, a group and a single store may break the dependence.
Say, A[i] = a; // (1)

B[i] = b;          // (2)
A[i+1] = c;      // (3)

The combine of (1) and (3) may break the dependence on "A[i] - B[i]" pair even though B[i] is not in a group.
So I think we should search all the write-write pairs. If we separate into two phases, we'll have to calculate the distance again. I think it may increase the compile time.

lib/Transforms/Vectorize/LoopVectorize.cpp
1820

Slightly different.
'PtrParts' contains UF vectors, and each vector has VF elements.
Here, firstly we want the pointer vector in current unroll part, I.e. 'PtrParts[Part]'.
Then for each pointer vector ''PtrParts[Part]', we want the pointer in lane 0.

E.g.

for (unsigned i = 0; i < 1024; i+=2) {
     A[i] = i;
}

If VF is 4 and UF is 2. Then we'll have two unroll parts:
part 1:

A[0] = 0;
A[2] = 2;
A[4] = 4;
A[6] = 6;

part 2:

A[8] = 8;
...

For this case, 'PtrParts' contains two vectors:

Vector 1 consists pointers to: A[0], A[2], A[4], A[6]
Vector 2  consists pointers to: A[8], A[10], A[12], A[14]

Then we extract lane 0 from Vector 1 to get a pointer pointing to A[0], so that we can use it to load A[0-7] and after that, we can extract A[0, 2, 4, 6] from A[0-7].

If the loads/stores are reverse, we need the pointer in last lane. E.g. the access sequences will be "A[1022], A[1020], A[1018], A[1016]", the last lane is the pointer to A[1016]". We can load A[1016-1023] to extract A[1016, 1018, 1020, 1022].

1832–1833

Yes, we need.
As the above comment says, we get vectors consist of pointers in all unroll parts.

1901–1903

A test case called @int_float_struct tests the interleaved load group:

struct IntFloat {
    int a;
    float b;
 };

 void int_float_struct(struct IntFloat *A) {
    int SumA;
    float SumB;
    for (unsigned i = 0; i < 1024; i++)  {
      SumA += A[i].a;
      SumB += A[i].b;
    }
    SA = SumA;
    SB = SumB;
  }

You can find this case in "interleaved-accesses.ll". The int-float can be long-double, or any pointer types, as long as the structure has elements of the same size.

Actually I have better case:

struct IntFloat {
  int a;
  float b;
};

void int_float_struct(struct IntFloat *A, struct IntFloat *B) {
    for (unsigned i = 0; i < 1024; i++)  {
      B[i].a = A[i].a + 1;
      B[i].b = A[i].b + 1.0;
    }
  }

Unfortunately, this case can not be vectorized as it can not pass dependence check. As we have following code in isDependent() (LoopAccessAnalysis.c):

if (ATy != BTy) {
  DEBUG(dbgs() <<
        "LAA: ReadWrite-Write positive dependency with different types\n");
  return Dependence::Unknown;
}

I think this is a bug. We don't need to check types as we will check dependences based on the memory object size. by removing such code I don't see any failures in benchmarks.
Anyway, as such code, I can not give a test for interleaved write group.

Thinking about this a bit more, why are we collecting the InterLeaveGroups in LAA? LAA provides loop dependence information and computing this additional, unrelated information comes at a cost. Clients other than LV would pay for this without using it. I think that the InterLeaveGroups analysis should probably live in LV.

What do you think? Let me know if I am missing something from an earlier discussion.

Adam

Hi Adam,

It's easy to move such analysis to LV, but I think it reasonable to analyze interleaved accesses in LoopAccessAnalysis:

(1) Regarding the name "LoopAccessAnalysis", which is responsible for analyzing accesses in a loop. The interleaved access group is a special kind of access in loop.
(2) For clients other than LV, it could disable or enable such analysis. Or we could disable it by default and LV calls function like "analyzeInterleaving()" to do analysis. To achieve this is easy.
(3) For other clients, I think we may also need to modify the memory dependence analysis, which is now dedicated to analyze dependences for loop vectorizer.

Thanks,
-Hao

HaoLiu added inline comments.May 22 2015, 12:47 AM
test/Transforms/LoopVectorize/interleaved-accesses.ll
1 ↗(On Diff #26208)

Hi Michael,

Do you mean by adding "-scoped-noalias". I tried but it doesn't work. I think maybe the runtime check is different from normal alias analysis.

It's easy to move such analysis to LV, but I think it reasonable to analyze interleaved accesses in LoopAccessAnalysis:

(1) Regarding the name "LoopAccessAnalysis", which is responsible for analyzing accesses in a loop. The interleaved access group is a special kind of access in loop.

True, I named it LAA because it’s a parallel dependence analysis framework to DependenceAnalysis but the focus is the same: dependence analysis plus run-time alias check generation for may-alias accesses.

I don’t have a problem adding things to the analysis that are free to compute even if it’s only used by a single client of the analysis pass (consider Ashutosh’s recent StoreToInvariantAddress changes). Here however we’re adding a potentially costly new analysis only required by a single client.

(2) For clients other than LV, it could disable or enable such analysis. Or we could disable it by default and LV calls function like "analyzeInterleaving()" to do analysis. To achieve this is easy.

Not really. This is an analysis pass, transform passes may depend on it and the pipeline manager will run this pass depending on transform pass requirements and possibly rerun it if it got invalidated by prior transformations.

As an example, consider this scenario. LAA is performed because of Loop Distribution. We end up not distributing the loop so the result of the analysis is intact and won’t be rerun. Then comes LV but the analysis is lacking the interleaved access analysis part.

Admittedly, we currently have a hack in LAA with regard to symbolic strides which would be similar to what you’re proposing. That however is more of shortcoming of how LAA was split out from LV rather than an example to follow.

(3) For other clients, I think we may also need to modify the memory dependence analysis, which is now dedicated to analyze dependences for loop vectorizer.

I am not sure I understand this point. Can you please elaborate?

Thanks,
Adam

Thanks,
-Hao

mzolotukhin added inline comments.May 22 2015, 9:27 AM
test/Transforms/LoopVectorize/interleaved-accesses.ll
1 ↗(On Diff #26208)

No, I meant changing declaration from this:

define void @foo(%struct.ST2* nocapture readonly %A, %struct.ST2* nocapture %B)

to

define void @foo(%struct.ST2* noalias nocapture readonly %A, %struct.ST2* noalias nocapture %B)

It'll tell that A and B don't alias, and thus we won't need runtime checks for them.

mzolotukhin added inline comments.May 22 2015, 11:40 AM
lib/Analysis/LoopAccessAnalysis.cpp
978

Hm.. I'm reading the paper from here:
http://domino.research.ibm.com/library/cyberdig.nsf/papers/EFD6206D4D9A9186852570D2005D9373/$File/H-0235.pdf
And it says:

This relies on the pivoting order in which pairs of accesses are processed, and implies that the complexity of the grouping algorithm is linear in the number of loads and stores.

I'm not sure if it's safe to assume that the dependence check won't be passed. E.g. we might have the following case:

struct A {
int x,y;
};
struct A points[100];
for (int iter = 0; iter < 10000; iter++) { // not to be unrolled
  for (int i = 0; i < 100; i++) { // this loop will be completely unrolled
    points[i].x++;
  }
  // some other computations on the points
}

When the innermost loop is unrolled, we get 100 interleaving accesses, which will require 10^4 pairwise comparisons. I agree that we theoretically can optimize it better if we are not limited on time. However, if a programmer decides to increase number of points by a factor of 10, the compile time might slow down 100x, which won't be acceptable.
I think we have a lot of optimizations that could theoretically find the best solution if they are allowed to consume infinite number of time, however, we usually have some thresholds for them (such thresholds often don't prevent one from catching the cases the optimization is aimed for, but they guard against rare but very nasty cases when the compiler might work for hours).

For the second question, my understanding is that we should have two groups: {A[i], A[i+1]} and {B[i]}. Then we check if an access from the first group can alias with an access from the second group - I think we don't have to check every access in a group for this kind of reasoning. I.e. if leaders of the groups might alias, then any member of the first group can alias with any member of the second group (and if the leaders don't alias, then we can probably assume that the groups are totally independent of each other). Am I missing something here?

lib/Transforms/Vectorize/LoopVectorize.cpp
1820

Thanks for the explanation, sounds good.

1901–1903

It makes sense, thanks for the explanation!

It's easy to move such analysis to LV, but I think it reasonable to analyze interleaved accesses in LoopAccessAnalysis:

(1) Regarding the name "LoopAccessAnalysis", which is responsible for analyzing accesses in a loop. The interleaved access group is a special kind of access in loop.

True, I named it LAA because it’s a parallel dependence analysis framework to DependenceAnalysis but the focus is the same: dependence analysis plus run-time alias check generation for may-alias accesses.

I don’t have a problem adding things to the analysis that are free to compute even if it’s only used by a single client of the analysis pass (consider Ashutosh’s recent StoreToInvariantAddress changes). Here however we’re adding a potentially costly new analysis only required by a single client.

(2) For clients other than LV, it could disable or enable such analysis. Or we could disable it by default and LV calls function like "analyzeInterleaving()" to do analysis. To achieve this is easy.

Not really. This is an analysis pass, transform passes may depend on it and the pipeline manager will run this pass depending on transform pass requirements and possibly rerun it if it got invalidated by prior transformations.

As an example, consider this scenario. LAA is performed because of Loop Distribution. We end up not distributing the loop so the result of the analysis is intact and won’t be rerun. Then comes LV but the analysis is lacking the interleaved access analysis part.

Admittedly, we currently have a hack in LAA with regard to symbolic strides which would be similar to what you’re proposing. That however is more of shortcoming of how LAA was split out from LV rather than an example to follow.

(3) For other clients, I think we may also need to modify the memory dependence analysis, which is now dedicated to analyze dependences for loop vectorizer.

I am not sure I understand this point. Can you please elaborate?

I previously thought LoopAccessAnalysis is specific for LV, because there are a lot of LV specific data like C

It's easy to move such analysis to LV, but I think it reasonable to analyze interleaved accesses in LoopAccessAnalysis:

(1) Regarding the name "LoopAccessAnalysis", which is responsible for analyzing accesses in a loop. The interleaved access group is a special kind of access in loop.

True, I named it LAA because it’s a parallel dependence analysis framework to DependenceAnalysis but the focus is the same: dependence analysis plus run-time alias check generation for may-alias accesses.

I don’t have a problem adding things to the analysis that are free to compute even if it’s only used by a single client of the analysis pass (consider Ashutosh’s recent StoreToInvariantAddress changes). Here however we’re adding a potentially costly new analysis only required by a single client.

(2) For clients other than LV, it could disable or enable such analysis. Or we could disable it by default and LV calls function like "analyzeInterleaving()" to do analysis. To achieve this is easy.

Not really. This is an analysis pass, transform passes may depend on it and the pipeline manager will run this pass depending on transform pass requirements and possibly rerun it if it got invalidated by prior transformations.

As an example, consider this scenario. LAA is performed because of Loop Distribution. We end up not distributing the loop so the result of the analysis is intact and won’t be rerun. Then comes LV but the analysis is lacking the interleaved access analysis part.

Admittedly, we currently have a hack in LAA with regard to symbolic strides which would be similar to what you’re proposing. That however is more of shortcoming of how LAA was split out from LV rather than an example to follow.

(3) For other clients, I think we may also need to modify the memory dependence analysis, which is now dedicated to analyze dependences for loop vectorizer.

I am not sure I understand this point. Can you please elaborate?

Hi Adam,

I think LoopAccessAnalysis is a specific analysis for LV, so it's reasonable to put interleaved access analysis in an analysis specific for LV. If LoopAccessAnalysis is not specific for LV, I agree that we should not put it here. Currently I don't have idea about how other clients could use such interleave information.

But I still wonder how "other" clients use this analysis, as it has a lot of LV specific code like "CanVecMem, force-vector-width, force-vector-interleave", .... I wonder how "other" clients could use such analysis. Adam, do you have any example about other clients using LoopAccessAnalysis?

Thanks
-Hao

HaoLiu added inline comments.May 24 2015, 7:43 AM
lib/Analysis/LoopAccessAnalysis.cpp
978

I agree with you. I could also find many similar thresholds. I'll add such limitation to the patch.

I think you are right for the second question. I'll try to implement this in the next patch.

I think LoopAccessAnalysis is a specific analysis for LV, so it's reasonable to put interleaved access analysis in an analysis specific for LV. If LoopAccessAnalysis is not specific for LV, I agree that we should not put it here. Currently I don't have idea about how other clients could use such interleave information.

But I still wonder how "other" clients use this analysis, as it has a lot of LV specific code like "CanVecMem, force-vector-width, force-vector-interleave", .... I wonder how "other" clients could use such analysis. Adam, do you have any example about other clients using LoopAccessAnalysis?

Sure, I already mentioned both clients: Transforms/Scalar/LoopDistribute.cpp and the still pending Ashutosh's Loop versioning pass in D9151.

Regarding your first specific question on CanVecMem, the basic design principle for splitting out LAA fron LV was to continue to provide the high-level LV-specific answers (i.e. CanVecMem) but also provide more lower-level/generic information about the dependences and run-time checks so that other passes could also use them (see APIs like: getInterestingDependences, getMemoryInstructions, getRuntimeChecks and getInstructionsForAccess).

Your other question was about force-vector-width, etc. These influence what dependences are acceptable for LV for its specific view of the dependence information. The solution here was to provide subtypes like *Vectorizable (see Dependence::DepType). These "subtypes" provide further classification of the dependence. LV uses these to allow vectorization of certain Backward dependences.

Loop Distribution does not use this classification since it only cares about forward vs. backward dependences.

Thus the difference is exposed to the clients and they can chose to treat these types differently or uniformly.

Adam

I think LoopAccessAnalysis is a specific analysis for LV, so it's reasonable to put interleaved access analysis in an analysis specific for LV. If LoopAccessAnalysis is not specific for LV, I agree that we should not put it here. Currently I don't have idea about how other clients could use such interleave information.

But I still wonder how "other" clients use this analysis, as it has a lot of LV specific code like "CanVecMem, force-vector-width, force-vector-interleave", .... I wonder how "other" clients could use such analysis. Adam, do you have any example about other clients using LoopAccessAnalysis?

Sure, I already mentioned both clients: Transforms/Scalar/LoopDistribute.cpp and the still pending Ashutosh's Loop versioning pass in D9151.

Regarding your first specific question on CanVecMem, the basic design principle for splitting out LAA fron LV was to continue to provide the high-level LV-specific answers (i.e. CanVecMem) but also provide more lower-level/generic information about the dependences and run-time checks so that other passes could also use them (see APIs like: getInterestingDependences, getMemoryInstructions, getRuntimeChecks and getInstructionsForAccess).

Your other question was about force-vector-width, etc. These influence what dependences are acceptable for LV for its specific view of the dependence information. The solution here was to provide subtypes like *Vectorizable (see Dependence::DepType). These "subtypes" provide further classification of the dependence. LV uses these to allow vectorization of certain Backward dependences.

Loop Distribution does not use this classification since it only cares about forward vs. backward dependences.

Thus the difference is exposed to the clients and they can chose to treat these types differently or uniformly.

Adam

I see. That make sense.

Thanks,
-Hao

HaoLiu updated this revision to Diff 26419.May 25 2015, 3:23 AM

Updated a new patch refactored according to Michael and Adam.
There are two main changes:

(1) Add a threshold MaxInterleaveStride to avoid analyzing accesses with too large stride.
(2) Move the code about analyzing interleaved accesses in LoopAccessAnalysis to LoopVectorizer.

Review please.

Thanks
-Hao

test/Transforms/LoopVectorize/interleaved-accesses.ll
1 ↗(On Diff #26208)

I tried this, but it still can not work for some cases like @test_struct_store4. It is a known problem with our runtime memory checks. It will check the dependences in the same array.
E.g. for (i = 0; i < n; i+=4) {

  A[i] = a;
  A[i+1] = b;
  A[i+2] = c;
  A[i+3] = d;
}

A lot of runtime checks with be generated for pairs in {A[i], A[i+1], A[i+2], A[i+3]} and will break the threshold.

HaoLiu updated this object.May 26 2015, 12:05 AM
HaoLiu edited the test plan for this revision. (Show Details)
HaoLiu changed the visibility from "Public (No Login Required)" to "Unknown Object (????) (You do not have permission to view policy details.)".
HaoLiu changed the edit policy from "All Users" to "Unknown Object (????) (You do not have permission to view policy details.)".
HaoLiu removed subscribers: anemet, mzolotukhin, aemerson, Unknown Object (MLST).
This revision now requires review to proceed.May 26 2015, 12:05 AM
HaoLiu changed the visibility from "Unknown Object (????) (You do not have permission to view policy details.)" to "All Users".
HaoLiu changed the edit policy from "Unknown Object (????) (You do not have permission to view policy details.)" to "All Users".

I met a bug in Phabricator. It automatically removed all reviewers when I uploading a new patch. Now I've added reviewers back.

Thanks,
-Hao

HaoLiu added a subscriber: Unknown Object (MLST).May 26 2015, 12:10 AM
anemet added inline comments.May 26 2015, 4:28 PM
lib/Analysis/LoopAccessAnalysis.cpp
784–785

Please split out of this part of the change -- a cleanup, and commit it separately (and upate this patch for easier read)

788–790

I think that these two cases are quite different. The first is essentially proving that accesses like A[2*i] and A[2*i + 1] are independent. This is true in general not just for vectorization. Thus we should be returning NoDep rather than BackwardVectorizable.

Please also add more comments/examples, it took me a long time to figure this out (hopefully I did at the end). Please also add LAA tests.

798–806

I am not sure I follow why you change the structure of the MaxSafeDepDistBytes logic here. Why aren't you simply comparing MaxSafeDepDistBytes with TypeByteSize * NumIter * Stride?

Looks like this will modify the existing behavior for stride=1 which is probably not what you intend.

HaoLiu updated this revision to Diff 26584.May 27 2015, 3:52 AM
HaoLiu changed the visibility from "All Users" to "Public (No Login Required)".

Updated a new patch refactored the LoopAccessAnalysis about dependence check on strided accesses.

It's difficult to understand the logic about dependences, so I added many comments to describe it.

Review please.

Thanks,
-Hao

lib/Analysis/LoopAccessAnalysis.cpp
784–785

Hi Adma,

I Agree. I'll do that when commiting the patch.

788–790

Hi Adam,

You are right. I've updated a new patch refactored this logic. I added a lot of new test cases as well.

798–806

I think preivous logic is not correct. If "Distance > MaxSafeDepDistBytes", it is not safe to do vectorization.

HaoLiu updated this revision to Diff 26585.May 27 2015, 4:10 AM

Updated a new patch with slight modifications.

Review please.

Thanks,
-Hao

lib/Analysis/LoopAccessAnalysis.cpp
798–806

Sorry, previously I misunderstood the MaxSafeDepDistBytes. The new patch follows the meaning of MaxSafeDepDistBytes, it also has the same logic for stride = 1.

I also found a problem with MaxSafeDepDistBytes. It cannot handdle cases with different kinds of types. like:

void foo(int *A, char *B) {
    for (unsigned i = 0; i < 1024; i++) {
        A[i+2] = A[i] + 1;
        B[i+2] = B[i] + 1;
    }
}

I think we should use MaxFactor, which stands for the maximum number of iterations to be vectorized & unrolled. I've added a FIXME in the patch.

anemet added inline comments.May 29 2015, 12:53 PM
lib/Analysis/LoopAccessAnalysis.cpp
687–708

I don't think this is GCD. Consider the values Stride=4 and ScaledDistance=6:

4 * i = 4 * j + 6 can not be valid for any integer values of i and j.

I think the Diophantine equation is this: ScaledDistance/Stride = (i - j) where i - j has to be integer.

So the two accesses can only reference the same element if ScaledDistance is a multiple of Stride.

716

areStridedAccessIndependent is the conforming name (funtions are lower-case). Also not the the 'd' at the end of Strided. Same for the previous function.

725–743

If you want to handle this case I prefer we come back to it later as a separate LAA-only patch. Let's focus on the basics here that's needed for LV to handle the basic cases.

787

vectorized is misspelled

790

MinDistanceNeeded is probably a better name.

790–815

I think this is correct but I wonder if the example was less contrived if you used:

for (i = 0; i < 1024; i+= 3)
  A[i + 4] = A[i] + 1

MinSizeNeeded is 4 * 3 * (2 - 1) + 4 = 16 which is equal to the distance.

Also a nit: most or all of this comment is explaining why MinSizeNeeded is computed the way it is so the comment should be before the computation.

test/Analysis/LoopAccessAnalysis/analyze-stride-access-depedences.ll
216–228 ↗(On Diff #26663)

As a follow-on patch, it would probably be a good idea to print MaxSafeDepDistance in LAI::analyze and check its value here.

224–228 ↗(On Diff #26663)

Use CHECK-NEXTs where you can. Same later.

331 ↗(On Diff #26663)

should *look* like?

HaoLiu accepted this revision.May 31 2015, 11:55 PM
HaoLiu added a reviewer: HaoLiu.

Refactored the patch according to Adam's comments.

I also found a bug that write after load to the same location like:

A[i] = a;
b = A[i];

is not looked as store-load forwarding currently. As this could affect the correctness, I fixed this with slight modification. 2 new test cases are added.

Review please.

Thanks,
-Hao

lib/Analysis/LoopAccessAnalysis.cpp
687–708

Ah, not GCD.
You are right!

716

Agree.

725–743

Agree. Also this is rare case. I think it won't affect about the performance.

787

Fixed.

790

Fixed

790–815

This case is vectorizable as MinDistanceNeeded is exactly equal to the distance. If the distance is smaller (even 1 byte smaller), it can not be vectorizable.

Actually we already has a similar case in memdep.ll:

for (i = 0; i < 1024; ++i)
    A[i+2] = A[i] + 1;

The MinDistanceNeeded is 2*4 = 8. The distance is also 8.

test/Analysis/LoopAccessAnalysis/analyze-stride-access-depedences.ll
216–228 ↗(On Diff #26663)

MaxSafeDepDistance can be printed out in "-debug-only". I don't know how to print it properly. Also as my comments on MaxSafeDepDistance say, I think we should refactor MaxSafeDepDistance as it could prevent some vectorization opportunities.

This revision is now accepted and ready to land.May 31 2015, 11:55 PM
HaoLiu requested a review of this revision.Jun 1 2015, 3:16 AM
HaoLiu edited edge metadata.
HaoLiu removed a reviewer: HaoLiu.
anemet edited edge metadata.Jun 1 2015, 9:07 PM

I also found a bug that write after load to the same location like:

A[i] = a;
b = A[i];

is not looked as store-load forwarding currently. As this could affect the correctness, I fixed this with slight modification. 2 new test cases are added.

Why do you think that this is bug? In this case because the vectorized loads and stores are aligned with each other, there should be no problem for the memory unit to figure out store-to-load forwarding. See the big comment in MemoryDepChecker::couldPreventStoreLoadForward.

I also found a bug that write after load to the same location like:

A[i] = a;
b = A[i];

is not looked as store-load forwarding currently. As this could affect the correctness, I fixed this with slight modification. 2 new test cases are added.

Why do you think that this is bug? In this case because the vectorized loads and stores are aligned with each other, there should be no problem for the memory unit to figure out store-to-load forwarding. See the big comment in MemoryDepChecker::couldPreventStoreLoadForward.

To my understanding, if a load has to wait until the store is committed, it is store-load forwarding.
For the case:

A[i] = a;     (1)
b = A[i];     (2)

(2) has to waite (1) to be committed. It is store-load forwarding. After vectoring such case, it is still store-load forwarding.

From comment in couldPreventStoreLoadForward, I think it means to prevent generating store-load forwarding by vectorization. If that is ture. The difference for this case is that it is original store-load forwarding. Do you mean that we could vectorize the cases that are original store-load forwarding?

Thanks,
_hao

HaoLiu updated this revision to Diff 26955.Jun 2 2015, 12:31 AM
HaoLiu edited edge metadata.

Hi Adam,

I updated a new patch by removing the modification about StoreLoadForwarding.

Thanks,
-Hao

To my understanding, if a load has to wait until the store is committed, it is store-load forwarding.
For the case:

A[i] = a;     (1)
b = A[i];     (2)

(2) has to waite (1) to be committed. It is store-load forwarding. After vectoring such case, it is still store-load forwarding.

From comment in couldPreventStoreLoadForward, I think it means to prevent generating store-load forwarding by vectorization. If that is ture. The difference for this case is that it is original store-load forwarding. Do you mean that we could vectorize the cases that are original store-load forwarding?

The problem is not whether we can vectorize or not but because we can. It's whether the resulting vector store and load will be subject to the processor's store-to-load forwarding optimization. In this case the store does not have to get fully retired before the load could start executing (assuming an OOO core) but the memory unit will forward the value of the store to the load (and check whether there are intervening stores that make this invalid).

This is easy for the processor to figure out if both operations use the same address. However if they are only partially overlapping it's hard.

To my understanding, if a load has to wait until the store is committed, it is store-load forwarding.
For the case:

A[i] = a;     (1)
b = A[i];     (2)

(2) has to waite (1) to be committed. It is store-load forwarding. After vectoring such case, it is still store-load forwarding.

From comment in couldPreventStoreLoadForward, I think it means to prevent generating store-load forwarding by vectorization. If that is ture. The difference for this case is that it is original store-load forwarding. Do you mean that we could vectorize the cases that are original store-load forwarding?

The problem is not whether we can vectorize or not but because we can. It's whether the resulting vector store and load will be subject to the processor's store-to-load forwarding optimization. In this case the store does not have to get fully retired before the load could start executing (assuming an OOO core) but the memory unit will forward the value of the store to the load (and check whether there are intervening stores that make this invalid).

This is easy for the processor to figure out if both operations use the same address. However if they are only partially overlapping it's hard.

Sounds reasonable.

Anyway, I've removed the modification in last patch. Review please.

Thanks,
-Hao

Hi Adam,

Do you have any other concerns? If not, I'll commit this patch separately as you said.

I think it still has some potential problems, but we can improve it continously in the future. The commit of this patch is the first step. Then I can continue following work.

Thanks,
-Hao

anemet added inline comments.Jun 4 2015, 12:59 AM
lib/Analysis/LoopAccessAnalysis.cpp
696–717

Now, that I look at this again, the second covers the first case as well (i.e. superset). There is room to simplify it.

790–815

I didn't quite understand what you were saying here but looks like you didn't change the comment, so I guess you disagree that my example is better. Your example is vectorizable if miniter is 2 and so is mine, so I don't understand your reply.

test/Analysis/LoopAccessAnalysis/stride-access-dependence.ll
343–344 ↗(On Diff #26955)

"It should be vectorizable."

Hi Adam,

I updated a new patch according to your comments. Also see my comments are inline.

Review please.

Thanks,
-Hao

lib/Analysis/LoopAccessAnalysis.cpp
696–717

You are right, only need "ScaledDist % Stride".

790–815

Previously, I thought you just asked me a question about that case.

Sorry about the misleading.
Actually, your case is "indep", which returns early with "Dependence::NoDep". So it could not reach at this place.

That's why I use an example with dependences.

test/Analysis/LoopAccessAnalysis/stride-access-dependence.ll
343–344 ↗(On Diff #26955)

Agree. Thanks.

rengolin edited edge metadata.Jun 4 2015, 1:40 AM

Hao, Adam, Michael,

I think this review has gone a bit too far for what it is. I agree there are still many points to cover and many issues to fix, but we should do them later, after the initial implementation is in. For what it is, a prototype for strided vectorization, this patch fits the bill. It's not on by default, it's still experimental and won't be used at -O3 or anything, so we're still safe in case it does miss-compile anything.

I'd like to thank Adam and Michael for the extensive comments, it's not easy to get passionate people to do code review. I'd also like to than Hao for his extreme patience with so many and so drastic code reviews! Finally thanks to Arnold and Nadav for the valuable input and discussions to get the first prototype working previously.

But right now, I think it's time we land this patch and continue the review on further patches.

Adam, if you're happy that Hao's last changes address your concerns, please approve the patch and let's start round 2 on a new review.

cheers,
--renato

HaoLiu updated this revision to Diff 27098.Jun 4 2015, 1:48 AM
HaoLiu edited edge metadata.

I think this review has gone a bit too far for what it is. I agree there are still many points to cover and many issues to fix, but we should do them later, after the initial implementation is in. For what it is, a prototype for strided vectorization, this patch fits the bill. It's not on by default, it's still experimental and won't be used at -O3 or anything, so we're still safe in case it does miss-compile anything.

I am very surprised by this sentiment. I would think that if reviews provide you with bugs, you'd be happy to fix them before the code going in. (We effectively performed the bugfixing for you.)

More importantly though, you're incorrect with your facts. Existing paths in LAA were also modified by this patch which affect things like the LoopVectorizer which is on by default.

You also seem to suggest that the changes requested were mostly minor or cosmetic. Clearly the placement of a rather expensive analysis is not a cosmetic change but a basic design decision that should be debated pre-commit.

There is extensive work happening on LAA, so wrong starts like this can be pretty disruptive for other people working on the same module.

Also as I am sure Hao can attest, DependenceAnalysis is a pretty complex topic so it takes time to actually think through the analysis and it's better to do this while we're all focused on the problem at hand. We caught multiple issues in this category and there were probably only one or two cosmetic comments.

I agree that the review lasted longer than usual but I don't see the main reason being our "drastic code reviews".

It would have been much easier if the patch was split up as I originally requested. The existing modularity allowed for this but the patch didn't take advantage of it. Also as we were finding correctness/design issues with the patch, Hao started improving the code (in actually very nice ways, thanks!!) which then needed new reviews again, etc.

We're also in different time zones which didn't help either but I feel like the community should be able to put up with this much slack.

Adam, if you're happy that Hao's last changes address your concerns, please approve the patch and let's start round 2 on a new review.

Yeah, the LAA parts look good now.

LAA parts LGTM.

lib/Analysis/LoopAccessAnalysis.cpp
790–815

Makes sense.

rengolin accepted this revision.Jun 4 2015, 1:25 PM
rengolin edited edge metadata.

I am very surprised by this sentiment. I would think that if reviews provide you with bugs, you'd be happy to fix them before the code going in. (We effectively performed the bugfixing for you.)

I think you got me wrong. I *am* very happy that you are reviewing with such attention to detail...

More importantly though, you're incorrect with your facts. Existing paths in LAA were also modified by this patch which affect things like the LoopVectorizer which is on by default.

...but I was also (wrongly) assuming that the patch was still self contained. I stand corrected.

I agree with you that in the LAA case we need to make sure the behaviour outside of this patch should go unmodified.

I agree that the review lasted longer than usual but I don't see the main reason being our "drastic code reviews".

I meant it in the good way.

It would have been much easier if the patch was split up as I originally requested. The existing modularity allowed for this but the patch didn't take advantage of it. Also as we were finding correctness/design issues with the patch, Hao started improving the code (in actually very nice ways, thanks!!) which then needed new reviews again, etc. We're also in different time zones which didn't help either but I feel like the community should be able to put up with this much slack.

I'm fully aware of that and none of it was the reason why I asked for a breakpoint. As you said, Hao could have split the patch earlier, but once it was still small and self contained, the other bits would have been lost. This patch was re-written dozens of times and almost none of the original assumptions were kept in the final version. With such a game changer, it's hard to predict the future.

I just interpreted the last reviews as specific to the stride vectorizer, when (as you say), they're generic to LAA. My sincere apologies.

Yeah, the LAA parts look good now.

Great! Round 2! :)

cheers,
--renato

This revision is now accepted and ready to land.Jun 4 2015, 1:25 PM
anemet added a comment.Jun 4 2015, 2:55 PM

I just interpreted the last reviews as specific to the stride vectorizer, when (as you say), they're generic to LAA. My sincere apologies.

OK, apologies accepted :), thanks for clarifying.

I think that part of the problem is that LAA as an independent pass still has some rough edges. Any help (Hao’s here and Silviu’s in the other thread) improving this situation is much appreciated.

Adam

mzolotukhin edited edge metadata.Jun 4 2015, 3:54 PM

Hi,

I’m glad we’re on the same page now. I’m really looking forward to seeing this feature, but I didn’t want to rush with it because it strongly defines directions for future development of the vectorizer. I believe that this code would be a base for many other optimizations (e.g. supporting masks), and that’s why it’s really important for it to be thoroughly reviewed. That’s also a reason for a lot of cosmetic changes I requested - when I see a place that’s harder to understand than it should be, I have to spend additional time on it, and that means that in future many other developers would have to spend their time as well. Also, such places disrupt from the main idea, which makes the review harder.

Thanks,
Michael

Hi Hao,

The patch LGTM with some minor notes. Thank you for the efforts, your work is very appreciated!

Michael

lib/Analysis/LoopAccessAnalysis.cpp
800–801

Something went wrong with the indentation here.

lib/Transforms/Vectorize/LoopVectorize.cpp
658

Could we rename Delta to InterleaveFactor or something like this. When roaming through the code it'd be much easier to understand what it stands for - especially when the code is in-tree and one doesn't see the context of the current patch.

681

s/Call this class/Use this class/

4221

static_cast isn't needed here.

4318–4322

What is the previous check you are referring to? And if it's guaranteed, could this if be replaced with an assertion?

4326

static_cast isn't necessary here (and if we want to cast everything getIndex should also be casted from unsigned).

test/Transforms/LoopVectorize/interleaved-accesses.ll
1 ↗(On Diff #27098)

Which of the tests needs -runtime-memory-check-threshold=24? Could we get rid of this flag by adding noalias attribute, or aliasing metadata?

39 ↗(On Diff #27098)

Please run the tests through opt -instnamer to replace %[0-9]+ names - they are really hard to deal with when one need to modify the test.

HaoLiu updated this revision to Diff 27175.Jun 4 2015, 10:44 PM
HaoLiu edited edge metadata.

I've updated the patch according to Michael's comments. My comments are inline.

Review please.

Thanks,
-Hao

lib/Analysis/LoopAccessAnalysis.cpp
800–801

No, The indentation is intended here to show the distance and overlap between the accesses to array A and B.

lib/Transforms/Vectorize/LoopVectorize.cpp
658

Reasonable. Renamed.

681

Fixed.

4221

std::abs() returns a signed int. Comparing signed to unsigned will cause a warning from the compiler.

To avoid static_cast, I use a temperaral value for "std::abs(Stride)" in the new patch.

4318–4322

It can not be assertion. This comment is hard to be understood. I've replaced it with another comment, which describe the thing we do right here.

A case @interleaved_stores in stride-access-dependence.ll checks for this situation. So no need to clarify this again.

4326

No, the division between signed and unsigned can not be casted.
I tested with a small case:

int a = -4;
unsigned b = 4;
int c = a / b;

The result c is a very large numer (1073741823). Modulo is simialr, I once fixed a bug about "signed % unsigned" (didn't cast unsigned data) in SeparateConstOffsetFromGEP.cpp.

This is different from ADD/SUB.

test/Transforms/LoopVectorize/interleaved-accesses.ll
1 ↗(On Diff #27098)

We couldn't. Still too many runtime checks for accesses related to one pointer (Even though they are noalis).
E.g.

for( i = 0; i < 1024; i+=3) {
  A[i] += a;
  A[i+1] += b;
  A[i+2] += c
}

It has 3 loads and 3 stores, which need 11 runtime checks (greater than the current threshold 8). The 3 stores will be compared with each other (3 checks), and each load will be compared with the 3 stores (3 * 3 checks).

This could be improved in the future. No need to compare accesses in the same group. But currently, this is the work around to test.

39 ↗(On Diff #27098)

Renamed anonymous names. This is really a helpful command.

This revision was automatically updated to reflect the committed changes.

Committed in r239285 (Code about LoopAccessAnalysis) and r239291 (Code about LoopVectorize).

Thanks a lot about the all review work. Especially Renato, Michael and Adam, you are very kind and patient for such a long period of review.

For the following work, my colleagues are working on improving the run time memory check and type promotion problem in Loop Vectorizor. For myself, I have patches for AArch64 and ARM backend to match the generated IRs into ldN/stN instructions. I'll send out later.

Thanks,
-Hao