This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/
-
llvm/
-
Analysis/
27
LoopAccessAnalysis.h
6
TargetTransformInfo.h
4
TargetTransformInfoImpl.h
-
CodeGen/
7
BasicTTIImpl.h
-
lib/
-
Analysis/
89
LoopAccessAnalysis.cpp
-
TargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
120
LoopVectorize.cpp
-
test/
-
Analysis/LoopAccessAnalysis/
-
LoopAccessAnalysis/
2
stride-access-dependence.ll
-
zero-distance-dependence.ll
-
Transforms/LoopVectorize/
-
LoopVectorize/
-
AArch64/
-
arbitrary-induction-step.ll
8
interleaved-accesses.ll

Differential D9368

[LoopVectorize]Teach Loop Vectorizer about interleaved memory access
ClosedPublic

Authored by • HaoLiu on Apr 30 2015, 4:50 AM.

Download Raw Diff

Details

Reviewers

rengolin
anemet
mzolotukhin

Commits

rG751004a67de4: [LoopAccessAnalysis] Teach LAA to check the memory dependence between strided…
rL239285: [LoopAccessAnalysis] Teach LAA to check the memory dependence between strided…

Summary

Hi,

Early in this month, I added a patch to teach Loop Vectorizer about interleaved data access in D8820. According to the code review comments. I've made a lot of changes. This new patch is attached. It will identify and vectorize interleaved Accesses into "Loads/Stores + ShuffleVectors".
E.g. It can translate following interleaved loads (If vector factor is 4):

for (i = 0; i < N; i+=3) {
    R = Pic[i];       // Load R color elements
    G = Pic[i+1];     // Load G color elements
    B = Pic[i+2];     // Load B color elements
    ... // do something to R, G, B
}

Into

%wide.vec = load <12 x i32>, <12 x i32>* %ptr               ; load for R,G,B
%R.vec = shufflevector %wide.vec, undef, <0, 3, 6, 9>    ; mask for R load
%G.vec = shufflevector %wide.vec, undef, <1, 4, 7, 10>  ; mask for G load
%B.vec = shufflevector %wide.vec, undef, <2, 5, 8, 11>  ; mask for B load

Or it can translate following interleaved stores (If vector factor is 4):

for (i = 0; i < N; i+=3) {
    ... do something to R, G, B
    Pic[i] = R;     // Store R color elements
    Pic[i+1] = G;     // Store G color elements
    Pic[i+2] = B;     // Store B color elements
}

Into

%RG.vec = shufflevector %R.vec, %G.vec, <0, 1, 2, ..., 7>
%BU.vec = shufflevector %B.vec, undef, <0, 1, 2, 3, u, u, u, u>
%interleaved.vec = shufflevector %RG.vec, %BU.vec,
            <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11>  ; mask for interleaved store
store <12 x i32> %interleaved.vec, <12 x i32>* %ptr         ; write for R,G,B

This patch mainly does:

(1) Identify interleaved access. (As some situation can not be covered corrently, I've added a TODO.)
(2) Transfer the indentified interleaved access to ShuffleVectors and Load/Store.
(3) Add a new pass in AArch64 backend to match the interleaved load/store with stride 2,3,4 to ldN/stN intrinsics.

I also added a new target hook to calculate the cost. (As I don't know too much about other targets, I just estimated it roughly.) It can be improved to be more accurate.

For the correctness, I've tested on AArch64 target with LNT, EEMBC, SPEC2000, SPEC2006, which can all pass.
For the performance, as there are other issues could forbid many vectorization opportunities, I don't see obvious improvements. But some benchmarks like EEMBC.RGBcmy and EEMBC.RGByiq are expected to have huge improvements (6 time

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

• HaoLiu added inline comments.May 18 2015, 2:49 AM

include/llvm/Analysis/LoopAccessAnalysis.h
372	Done.

rengolin added inline comments.May 18 2015, 5:56 AM

include/llvm/Analysis/LoopAccessAnalysis.h
400	Ouch, no! If you do this, than a perfectly valid sequence like this: group.eraseFromMap(Map); group.getDelta(); // there's no way to know that this won't work will segfault. A quick fix would be to leave the deletion to whomever is calling the function, so it's clear on the lifetime of the object: group.eraseFromMap(Map); group.getDelta(); delete group; group.getReverse(); // this is obviously wrong But again, it seems to me that such logic is spread out too much. So a better fix would be to coalesce all of it into a class in itself. One that contains the maps, the groups, and the control over the lifetime of all its groups in a consistent way, without requiring the caller to know about it. I'm ok with the temporary solution, as long as you add a FIXME to that effect.

anemet added a subscriber: anemet.May 18 2015, 9:30 AM

anemet added inline comments.

include/llvm/Analysis/LoopAccessAnalysis.h
491–492	I've only started looking at this patch but I have a quick initial, more fundamental comment. You are modifying an analysis pass (LoopAccessAnalysis). Please add support for printing the new information gathered by the analysis. Please also add tests under test/Analysis/LoopAccessAnalysis to verify the new information. You may even want separate out the patch for the analysis part (LAA + tests) from the transformation parts (LV + tests + etc).

• HaoLiu updated this revision to Diff 26044.May 19 2015, 3:36 AM

I've updated a new patch according to the comments from Renato and Adam. This patch adds a new class InterleavedAccessInfo to handle the analysis about interleaved accesses.

Review please.

Thanks,
-Hao

include/llvm/Analysis/LoopAccessAnalysis.h
400	That's reasonable. I've added a new class InterleavedAccessInfo to manage this.
491–492	Hi Adam, I've undated the patch to support printing the new information. I added more test cases as well. I didn't separate patch as I think there are already a lot of history comments on both parts. Also I think the bound between analysis part and transformation part is clear. Anyway, I can still separate this patch if you insist.

Hi Hao,

Please find some comments from me inline. I'll get back to review a bit later.

Also, I think it's a good idea to separate the part for LoopAccessAnalysis part into a separate patch. I think it's fine to keep it in one patch during review (to keep the history), but I'd suggest committing it separately when they are ready. It'll help tracking down any potential issues that could be exposed later.

Thanks,
Michael

include/llvm/Analysis/LoopAccessAnalysis.h
322	We require `Align!=0` here, but in the constructor below we explicitly set it to 0. Is one of these unused and could be removed?
492–493	Please fix the indentation here.
include/llvm/Analysis/TargetTransformInfo.h
448	s/vecotor/vector/
449–451	This comment seems to be outdated - there are no arguments `SubTy`, `Index`, or `Gap` in the declaration below.
include/llvm/Analysis/TargetTransformInfoImpl.h
306	Why the cost is `Delta`?
include/llvm/CodeGen/BasicTTIImpl.h
534	s/divied/divided/
546–547	Am I getting it correctly, that `VecTy` is the type of 'combined' vector? I.e. if we have a stride=2, VF=4, and ScalarTy=i32, then `VecTy` would be `<8 x i32>`, and `SubVT` would be `<4 x i32>`? I'm not sure I've completely understood this part, so please correct me if I'm wrong. If that's true, then the cost computation doesn't seem correct. Extract 1st or 2nd element from the wide vector isn't the same as extracting odds or even elements.
551	Was it supposed to be `+=`?
558–564	What's the difference between computing costs for stores and loads?

• HaoLiu updated this revision to Diff 26123.May 19 2015, 11:16 PM

• HaoLiu updated this revision to Diff 26125.May 19 2015, 11:50 PM

Also, I think it's a good idea to separate the part for LoopAccessAnalysis part into a separate patch. I think it's fine to keep it in one patch during review (to keep the history), but I'd suggest committing it separately when they are ready. It'll help tracking down any potential issues that could be exposed later.

Michael, I agree with you and I'll separate the patch when committing.

I've refactored the patch according to comments from Michael.

Review please.

Thanks,
-Hao

include/llvm/Analysis/LoopAccessAnalysis.h
322	Removed the unused constructor.
492–493	Fixed.
include/llvm/Analysis/TargetTransformInfo.h
448	Done.
449–451	Fixed.
include/llvm/Analysis/TargetTransformInfoImpl.h
306	Previously I though there are 'Delta' instructions each costs 1. Regarding all the default cost in other getXXXCost is 1, I've changed it to be "1".
include/llvm/CodeGen/BasicTTIImpl.h
546–547	Yes, you are right. The cost is not only "Extract" cost, it also plus the "Insert" cost. Say we interleaved load two vectors: %vec = load <8 x i32>, <8 x i32>* %ptr %v0 = shuffle %vec, undef, <0, 2, 4, 6> %v1 = shuffle %vec, undef, <1, 3, 5, 7> The cost consist of 2 parts: (1) load of <8 x i32> (2) extract %v0 and %v1 from %vec. The cost of (2) is estimated as 2 parts: (a) extract elements at: 0, 1, 2, 3, 4, 5, 6, 7 (b) insert elements 0, 2, 4, 6 into %v0 and insert elements 1, 3, 5, 7 into %v1 The interleaved store is similar.
551	Fixed.
558–564	Say we interleaved store two vectors: %v0_v1 = shuffle %v0, %v1, <0, 4, 1, 5, 2, 6, 3, 7> store <8 x i32> %v0_v1, <8 x i32>* It has two main differences: (1) The direction is different. It firstly extract elements from two sub vectors and then insert into a wide vector (interleaved Load firstly extract elements from a wide vector and then insert into sub vectors) (2) The interleaved store doesn't allow gaps, which means we must have member of index 0 (with even elements) and member of index 1 (with odd elements). But interleaved load allows gaps, it could only load even elements such as: %vec = load <8 x i32>, <8 x i32>* %ptr %v0 = shuffle <8 x i32> %vec, undef, <0, 2, 4, 6> This is still beneficial and cheaper than scalar loads. I've refactored this function to add a new parameter of indices for interleaved load.

In D9368#175501, @HaoLiu wrote:

Michael, I agree with you and I'll separate the patch when committing.

Unless you can guarantee that trunk will work with just the first patch (ie. you have tested both equally), I don't think it's a good idea to split the patch. The analysis does nothing without the actual vectorisation, and vice versa.

It's a big patch, yes, but also this is a big change. I don't see what splitting the patches would gain us.

The patch now looks more like what we had planned in the beginning, so as long as all comments are addressed, and all tests pass, I'm happy.

Thank you for the effort of bringing this up again.

cheers,
--renato

In D9368#175547, @rengolin wrote:

In D9368#175501, @HaoLiu wrote:

Michael, I agree with you and I'll separate the patch when committing.

Unless you can guarantee that trunk will work with just the first patch (ie. you have tested both equally), I don't think it's a good idea to split the patch. The analysis does nothing without the actual vectorisation, and vice versa.

It's a big patch, yes, but also this is a big change. I don't see what splitting the patches would gain us.

The patch now looks more like what we had planned in the beginning, so as long as all comments are addressed, and all tests pass, I'm happy.

Thank you for the effort of bringing this up again.

cheers,
--renato

Renato, I also agree with you. I think to commit separately as two patches or commit as a whole patch are similar and have no big differences.

Thanks,
-Hao

Hi Hao,

Thanks for updating the patch, another bunch of comments from me inline.

Also, sorry for that I give the feedback in parts - it's hard to find enough time to review the entire patch all at once. I know that it might be a bit frustrating sometimes, but in fact I really appreciate your work and see how things converge:)

Thanks,
Michael

include/llvm/Analysis/TargetTransformInfo.h
447–457	Nitpick: `Alignment` and `AddressSpace` aren't documented.
lib/Analysis/LoopAccessAnalysis.cpp
999–1007	Please add some comments on this struct and its members.
1042–1043	Using `empty()` is more efficient than `size()==0`.
1132–1133	Could we use a range loop here?
1136	Range-based loop?
1140–1141	Why not to return right after we discovered the block needs predication?
1156–1157	What's the rationale behind this? It might be inefficient to vectorize a group with a huge gap, but isn't it a question for cost-model rather than for the analysis?
1199–1207	Could we just check if `PosA` dominates `PosB`?

• HaoLiu updated this revision to Diff 26206.May 21 2015, 12:54 AM

In D9368#176086, @mzolotukhin wrote:

Hi Hao,

Thanks for updating the patch, another bunch of comments from me inline.

Also, sorry for that I give the feedback in parts - it's hard to find enough time to review the entire patch all at once. I know that it might be a bit frustrating sometimes, but in fact I really appreciate your work and see how things converge:)

Thanks,
Michael

Hi Michael,

Thanks for your comments, which really help me a lot. I've updated a new patch accordingly.

Review please.

Thanks,
-Hao

include/llvm/Analysis/TargetTransformInfo.h
447–457	Done.
lib/Analysis/LoopAccessAnalysis.cpp
999–1007	Done.
1042–1043	Good idea!
1132–1133	Done.
1136	Done.
1140–1141	It's slightly different. We still support a predicted block if it doesn't have load/store. Because the vectorization of interleaved loads/stores won't break any dependences. But if a predicted block has loads/stores, we need to handle an interleave group contains mixed normal loads/stores and predicted loads/stores.
1156–1157	This is just a conservative way. I think generally if a group missing more than half members, it is not very beneficial to do vectorization. Also it sounds not reasonable to still call it "interleaved" group. But I agree with you that it's cost-model's work to decide whether it is beneficial. The new patch keeps all load groups. There is also a new situation that: the cost of a load group with a huge gap may even be expensive than the scalar operations. Then I think we need to replace with scalar operations. I've added a FIXME in the new patch as I haven't found an efficient way to fix this.
1199–1207	A good idea!

rengolin added inline comments.May 21 2015, 2:18 AM

lib/Analysis/LoopAccessAnalysis.cpp
1140–1141	This might be a case for indexed loads on AVX, but for now, let's keep it simple. :)
1156–1157	I also agree that the cost model should get this right, but currently, there is no way it can do that. However, since the interleaved vectorisation is not enabled by default, I think keeping all the groups and fixing the cost model later is the right thing to do.

Thanks Hao,

I think we're good to go. Michael, any more concerns?

Once we land this in, it'll be easier to test it and fix the issues created by this patch without affecting the rest.

cheers,
--renato

This revision is now accepted and ready to land.May 21 2015, 2:22 AM

In D9368#174895, @HaoLiu wrote:

I've updated a new patch according to the comments from Renato and Adam. This patch adds a new class InterleavedAccessInfo to handle the analysis about interleaved accesses.

Thanks, Hao. I'll look at this today.

Adam

Hi Hao,

Please find more remarks from me inline, andI think this is the last bunch of comments from me. With them properly addressed I'm fine with committing the patch. However, I'd also like to hear from Adam regarding the LoopAccessAnalysis part, since he's been working actively on it.

And again, thanks for working on this, it's a much-needed feature!

Michael

include/llvm/Analysis/LoopAccessAnalysis.h
405	Should it be `unsigned`?
lib/Analysis/LoopAccessAnalysis.cpp
1069	I think you missed my question earlier on this part. This algorithm is clearly quadratic, which is dangerous for compile time. We need to either make it linear, or add some limits (e.g. if number of accesses > 100, give up and don't even try to group them). Also, if we separate it into two phases: 1) grouping accesses, 2) finding write-after-write pairs, I think we can make it more efficient. Since it's sufficient to compare against any access from a group, we'll be quadratic on the number of groups, not the number of accesses. And when we look for WaW pairs, we can only look only inside a group, not across all accesses in the loop. Does it sound reasonable, or am I missing something?
lib/Transforms/Vectorize/LoopVectorize.cpp
926	This change is unnecessary.
1904	Nitpick: s/I.E/I.e./
1984–1985	Maybe it's worth mentioning here that though `ConcatenateTwoVectors` could extend a vector with UNDEFs, that only could happens with the last vector in the set. Currently implementation guarantees that, but with the code evolving in future it can be easily violated if it's not stated clearly. Maybe an assertion could be even better here (like if it's not the last pair of vectors, then their sizes should be the same), but it might be an overkill.
1992	Could we just assign `TmpList` to `VecList`?
2051	Should we iterate to `VF` instead of `UF` here? `PtrParts` contains `VF` elements, right?
2063–2064	I'm not sure it's correct if we vectorize and unroll the loop. Don't we need to adjust the pointer according to which unrolling-part we are at (instead of using `0`)?
2085	Name `CallI` is very misleading.
2085–2086	Name `CallI` is very misleading.
2132–2134	Could you please provide an example when this triggers? I'm a bit concerned about legality of this cast.
2147–2148	The name is misleading.
5257	s/even expensive/even more expensive/
test/Transforms/LoopVectorize/interleaved-accesses.ll
2	Is it possible to use `noalias` attributes for arguments and drop `-runtime-memory-check-threshold=24`?

Hi Hao,

Thinking about this a bit more, why are we collecting the InterLeaveGroups in LAA? LAA provides loop dependence information and computing this additional, unrelated information comes at a cost. Clients other than LV would pay for this without using it. I think that the InterLeaveGroups analysis should probably live in LV.

What do you think? Let me know if I am missing something from an earlier discussion.

Adam

• HaoLiu updated this revision to Diff 26308.May 21 2015, 11:20 PM

• HaoLiu edited edge metadata.

I've updated a new patch refactored according to Michael's comments.

Review please.

Thanks,
-Hao

In D9368#176830, @mzolotukhin wrote:

Hi Hao,

Please find more remarks from me inline, andI think this is the last bunch of comments from me. With them properly addressed I'm fine with committing the patch. However, I'd also like to hear from Adam regarding the LoopAccessAnalysis part, since he's been working actively on it.

And again, thanks for working on this, it's a much-needed feature!

Michael

lib/Analysis/LoopAccessAnalysis.cpp
1069	For the first question. Previously, I thought you meant to improve the insert member method. Anyway, I think the algorithm is quadratic, the paper says: This relies on the pivoting order in which the pairs of accesses are processed and reduces the complexity of the grouping algorithm: we iterate over O(n^2 ) pairs of loads and stores (where n is the number of loads and stores) But I think we don't need to worry about compile time: (1) The "n" is only the number of stride loads/stores (\|Stride\| > 1) (2) For a case with a lot of loads and stores, it possiblely can not pass previous dependence check, because it could have a lot of runtime memory check or have memory conflict. So it returns early. On the other hand, if a loop indeed can pass dependence check. say if it has 100 loads, we should analyze the interleave accesses as vectorization on the possible interleave groups can really give us a lot of benefit.
1069	For the second question. I think it is slightly different. We should not only consider about Group and Group, a group and a single store may break the dependence. Say, A[i] = a; // (1) B[i] = b; // (2) A[i+1] = c; // (3) The combine of (1) and (3) may break the dependence on "A[i] - B[i]" pair even though B[i] is not in a group. So I think we should search all the write-write pairs. If we separate into two phases, we'll have to calculate the distance again. I think it may increase the compile time.
lib/Transforms/Vectorize/LoopVectorize.cpp
2051	Slightly different. 'PtrParts' contains UF vectors, and each vector has VF elements. Here, firstly we want the pointer vector in current unroll part, I.e. 'PtrParts[Part]'. Then for each pointer vector ''PtrParts[Part]', we want the pointer in lane 0. E.g. for (unsigned i = 0; i < 1024; i+=2) { A[i] = i; } If VF is 4 and UF is 2. Then we'll have two unroll parts: part 1: A[0] = 0; A[2] = 2; A[4] = 4; A[6] = 6; part 2: A[8] = 8; ... For this case, 'PtrParts' contains two vectors: Vector 1 consists pointers to: A[0], A[2], A[4], A[6] Vector 2 consists pointers to: A[8], A[10], A[12], A[14] Then we extract lane 0 from Vector 1 to get a pointer pointing to A[0], so that we can use it to load A[0-7] and after that, we can extract A[0, 2, 4, 6] from A[0-7]. If the loads/stores are reverse, we need the pointer in last lane. E.g. the access sequences will be "A[1022], A[1020], A[1018], A[1016]", the last lane is the pointer to A[1016]". We can load A[1016-1023] to extract A[1016, 1018, 1020, 1022].
2063–2064	Yes, we need. As the above comment says, we get vectors consist of pointers in all unroll parts.
2132–2134	A test case called @int_float_struct tests the interleaved load group: struct IntFloat { int a; float b; }; void int_float_struct(struct IntFloat A) { int SumA; float SumB; for (unsigned i = 0; i < 1024; i++) { SumA += A[i].a; SumB += A[i].b; } SA = SumA; SB = SumB; } You can find this case in "interleaved-accesses.ll". The int-float can be long-double, or any pointer types, as long as the structure has elements of the same size. Actually I have better case: struct IntFloat { int a; float b; }; void int_float_struct(struct IntFloat A, struct IntFloat *B) { for (unsigned i = 0; i < 1024; i++) { B[i].a = A[i].a + 1; B[i].b = A[i].b + 1.0; } } Unfortunately, this case can not be vectorized as it can not pass dependence check. As we have following code in isDependent() (LoopAccessAnalysis.c): if (ATy != BTy) { DEBUG(dbgs() << "LAA: ReadWrite-Write positive dependency with different types\n"); return Dependence::Unknown; } I think this is a bug. We don't need to check types as we will check dependences based on the memory object size. by removing such code I don't see any failures in benchmarks. Anyway, as such code, I can not give a test for interleaved write group.

Thinking about this a bit more, why are we collecting the InterLeaveGroups in LAA? LAA provides loop dependence information and computing this additional, unrelated information comes at a cost. Clients other than LV would pay for this without using it. I think that the InterLeaveGroups analysis should probably live in LV.

What do you think? Let me know if I am missing something from an earlier discussion.

Adam

Hi Adam,

It's easy to move such analysis to LV, but I think it reasonable to analyze interleaved accesses in LoopAccessAnalysis:

(1) Regarding the name "LoopAccessAnalysis", which is responsible for analyzing accesses in a loop. The interleaved access group is a special kind of access in loop.
(2) For clients other than LV, it could disable or enable such analysis. Or we could disable it by default and LV calls function like "analyzeInterleaving()" to do analysis. To achieve this is easy.
(3) For other clients, I think we may also need to modify the memory dependence analysis, which is now dedicated to analyze dependences for loop vectorizer.

Thanks,
-Hao

• HaoLiu added inline comments.May 22 2015, 12:47 AM

test/Transforms/LoopVectorize/interleaved-accesses.ll
2	Hi Michael, Do you mean by adding "-scoped-noalias". I tried but it doesn't work. I think maybe the runtime check is different from normal alias analysis.

In D9368#176907, @HaoLiu wrote:
It's easy to move such analysis to LV, but I think it reasonable to analyze interleaved accesses in LoopAccessAnalysis:
(1) Regarding the name "LoopAccessAnalysis", which is responsible for analyzing accesses in a loop. The interleaved access group is a special kind of access in loop.

True, I named it LAA because it’s a parallel dependence analysis framework to DependenceAnalysis but the focus is the same: dependence analysis plus run-time alias check generation for may-alias accesses.

I don’t have a problem adding things to the analysis that are free to compute even if it’s only used by a single client of the analysis pass (consider Ashutosh’s recent StoreToInvariantAddress changes). Here however we’re adding a potentially costly new analysis only required by a single client.

(2) For clients other than LV, it could disable or enable such analysis. Or we could disable it by default and LV calls function like "analyzeInterleaving()" to do analysis. To achieve this is easy.

Not really. This is an analysis pass, transform passes may depend on it and the pipeline manager will run this pass depending on transform pass requirements and possibly rerun it if it got invalidated by prior transformations.

As an example, consider this scenario. LAA is performed because of Loop Distribution. We end up not distributing the loop so the result of the analysis is intact and won’t be rerun. Then comes LV but the analysis is lacking the interleaved access analysis part.

Admittedly, we currently have a hack in LAA with regard to symbolic strides which would be similar to what you’re proposing. That however is more of shortcoming of how LAA was split out from LV rather than an example to follow.

(3) For other clients, I think we may also need to modify the memory dependence analysis, which is now dedicated to analyze dependences for loop vectorizer.

I am not sure I understand this point. Can you please elaborate?

Thanks,
Adam

Thanks,
-Hao

mzolotukhin added inline comments.May 22 2015, 9:27 AM

test/Transforms/LoopVectorize/interleaved-accesses.ll
2	No, I meant changing declaration from this: define void @foo(%struct.ST2* nocapture readonly %A, %struct.ST2* nocapture %B) to define void @foo(%struct.ST2* noalias nocapture readonly %A, %struct.ST2* noalias nocapture %B) It'll tell that `A` and `B` don't alias, and thus we won't need runtime checks for them.

mzolotukhin added inline comments.May 22 2015, 11:40 AM

lib/Analysis/LoopAccessAnalysis.cpp
1069	Hm.. I'm reading the paper from here: http://domino.research.ibm.com/library/cyberdig.nsf/papers/EFD6206D4D9A9186852570D2005D9373/$File/H-0235.pdf And it says: This relies on the pivoting order in which pairs of accesses are processed, and implies that the complexity of the grouping algorithm is linear in the number of loads and stores. I'm not sure if it's safe to assume that the dependence check won't be passed. E.g. we might have the following case: struct A { int x,y; }; struct A points[100]; for (int iter = 0; iter < 10000; iter++) { // not to be unrolled for (int i = 0; i < 100; i++) { // this loop will be completely unrolled points[i].x++; } // some other computations on the points } When the innermost loop is unrolled, we get 100 interleaving accesses, which will require 10^4 pairwise comparisons. I agree that we theoretically can optimize it better if we are not limited on time. However, if a programmer decides to increase number of points by a factor of 10, the compile time might slow down 100x, which won't be acceptable. I think we have a lot of optimizations that could theoretically find the best solution if they are allowed to consume infinite number of time, however, we usually have some thresholds for them (such thresholds often don't prevent one from catching the cases the optimization is aimed for, but they guard against rare but very nasty cases when the compiler might work for hours). For the second question, my understanding is that we should have two groups: {A[i], A[i+1]} and {B[i]}. Then we check if an access from the first group can alias with an access from the second group - I think we don't have to check every access in a group for this kind of reasoning. I.e. if leaders of the groups might alias, then any member of the first group can alias with any member of the second group (and if the leaders don't alias, then we can probably assume that the groups are totally independent of each other). Am I missing something here?
lib/Transforms/Vectorize/LoopVectorize.cpp
2051	Thanks for the explanation, sounds good.
2132–2134	It makes sense, thanks for the explanation!

In D9368#177057, @anemet wrote:
In D9368#176907, @HaoLiu wrote:
It's easy to move such analysis to LV, but I think it reasonable to analyze interleaved accesses in LoopAccessAnalysis:
(1) Regarding the name "LoopAccessAnalysis", which is responsible for analyzing accesses in a loop. The interleaved access group is a special kind of access in loop.
True, I named it LAA because it’s a parallel dependence analysis framework to DependenceAnalysis but the focus is the same: dependence analysis plus run-time alias check generation for may-alias accesses.

I don’t have a problem adding things to the analysis that are free to compute even if it’s only used by a single client of the analysis pass (consider Ashutosh’s recent StoreToInvariantAddress changes). Here however we’re adding a potentially costly new analysis only required by a single client.
(2) For clients other than LV, it could disable or enable such analysis. Or we could disable it by default and LV calls function like "analyzeInterleaving()" to do analysis. To achieve this is easy.
Not really. This is an analysis pass, transform passes may depend on it and the pipeline manager will run this pass depending on transform pass requirements and possibly rerun it if it got invalidated by prior transformations.

As an example, consider this scenario. LAA is performed because of Loop Distribution. We end up not distributing the loop so the result of the analysis is intact and won’t be rerun. Then comes LV but the analysis is lacking the interleaved access analysis part.

Admittedly, we currently have a hack in LAA with regard to symbolic strides which would be similar to what you’re proposing. That however is more of shortcoming of how LAA was split out from LV rather than an example to follow.
(3) For other clients, I think we may also need to modify the memory dependence analysis, which is now dedicated to analyze dependences for loop vectorizer.
I am not sure I understand this point. Can you please elaborate?

I previously thought LoopAccessAnalysis is specific for LV, because there are a lot of LV specific data like C

In D9368#177057, @anemet wrote:
In D9368#176907, @HaoLiu wrote:
It's easy to move such analysis to LV, but I think it reasonable to analyze interleaved accesses in LoopAccessAnalysis:
(1) Regarding the name "LoopAccessAnalysis", which is responsible for analyzing accesses in a loop. The interleaved access group is a special kind of access in loop.
True, I named it LAA because it’s a parallel dependence analysis framework to DependenceAnalysis but the focus is the same: dependence analysis plus run-time alias check generation for may-alias accesses.

I don’t have a problem adding things to the analysis that are free to compute even if it’s only used by a single client of the analysis pass (consider Ashutosh’s recent StoreToInvariantAddress changes). Here however we’re adding a potentially costly new analysis only required by a single client.
(2) For clients other than LV, it could disable or enable such analysis. Or we could disable it by default and LV calls function like "analyzeInterleaving()" to do analysis. To achieve this is easy.
Not really. This is an analysis pass, transform passes may depend on it and the pipeline manager will run this pass depending on transform pass requirements and possibly rerun it if it got invalidated by prior transformations.

As an example, consider this scenario. LAA is performed because of Loop Distribution. We end up not distributing the loop so the result of the analysis is intact and won’t be rerun. Then comes LV but the analysis is lacking the interleaved access analysis part.

Admittedly, we currently have a hack in LAA with regard to symbolic strides which would be similar to what you’re proposing. That however is more of shortcoming of how LAA was split out from LV rather than an example to follow.
(3) For other clients, I think we may also need to modify the memory dependence analysis, which is now dedicated to analyze dependences for loop vectorizer.
I am not sure I understand this point. Can you please elaborate?

Hi Adam,

I think LoopAccessAnalysis is a specific analysis for LV, so it's reasonable to put interleaved access analysis in an analysis specific for LV. If LoopAccessAnalysis is not specific for LV, I agree that we should not put it here. Currently I don't have idea about how other clients could use such interleave information.

But I still wonder how "other" clients use this analysis, as it has a lot of LV specific code like "CanVecMem, force-vector-width, force-vector-interleave", .... I wonder how "other" clients could use such analysis. Adam, do you have any example about other clients using LoopAccessAnalysis?

Thanks
-Hao

• HaoLiu added inline comments.May 24 2015, 7:43 AM

lib/Analysis/LoopAccessAnalysis.cpp
1069	I agree with you. I could also find many similar thresholds. I'll add such limitation to the patch. I think you are right for the second question. I'll try to implement this in the next patch.

In D9368#177506, @HaoLiu wrote:

I think LoopAccessAnalysis is a specific analysis for LV, so it's reasonable to put interleaved access analysis in an analysis specific for LV. If LoopAccessAnalysis is not specific for LV, I agree that we should not put it here. Currently I don't have idea about how other clients could use such interleave information.

But I still wonder how "other" clients use this analysis, as it has a lot of LV specific code like "CanVecMem, force-vector-width, force-vector-interleave", .... I wonder how "other" clients could use such analysis. Adam, do you have any example about other clients using LoopAccessAnalysis?

Sure, I already mentioned both clients: Transforms/Scalar/LoopDistribute.cpp and the still pending Ashutosh's Loop versioning pass in D9151.

Regarding your first specific question on CanVecMem, the basic design principle for splitting out LAA fron LV was to continue to provide the high-level LV-specific answers (i.e. CanVecMem) but also provide more lower-level/generic information about the dependences and run-time checks so that other passes could also use them (see APIs like: getInterestingDependences, getMemoryInstructions, getRuntimeChecks and getInstructionsForAccess).

Your other question was about force-vector-width, etc. These influence what dependences are acceptable for LV for its specific view of the dependence information. The solution here was to provide subtypes like *Vectorizable (see Dependence::DepType). These "subtypes" provide further classification of the dependence. LV uses these to allow vectorization of certain Backward dependences.

Loop Distribution does not use this classification since it only cares about forward vs. backward dependences.

Thus the difference is exposed to the clients and they can chose to treat these types differently or uniformly.

Adam

In D9368#177534, @anemet wrote:

In D9368#177506, @HaoLiu wrote:

I think LoopAccessAnalysis is a specific analysis for LV, so it's reasonable to put interleaved access analysis in an analysis specific for LV. If LoopAccessAnalysis is not specific for LV, I agree that we should not put it here. Currently I don't have idea about how other clients could use such interleave information.

But I still wonder how "other" clients use this analysis, as it has a lot of LV specific code like "CanVecMem, force-vector-width, force-vector-interleave", .... I wonder how "other" clients could use such analysis. Adam, do you have any example about other clients using LoopAccessAnalysis?

Sure, I already mentioned both clients: Transforms/Scalar/LoopDistribute.cpp and the still pending Ashutosh's Loop versioning pass in D9151.

Regarding your first specific question on CanVecMem, the basic design principle for splitting out LAA fron LV was to continue to provide the high-level LV-specific answers (i.e. CanVecMem) but also provide more lower-level/generic information about the dependences and run-time checks so that other passes could also use them (see APIs like: getInterestingDependences, getMemoryInstructions, getRuntimeChecks and getInstructionsForAccess).

Your other question was about force-vector-width, etc. These influence what dependences are acceptable for LV for its specific view of the dependence information. The solution here was to provide subtypes like *Vectorizable (see Dependence::DepType). These "subtypes" provide further classification of the dependence. LV uses these to allow vectorization of certain Backward dependences.

Loop Distribution does not use this classification since it only cares about forward vs. backward dependences.

Thus the difference is exposed to the clients and they can chose to treat these types differently or uniformly.

Adam

I see. That make sense.

Thanks,
-Hao

• HaoLiu updated this revision to Diff 26419.May 25 2015, 3:23 AM

Updated a new patch refactored according to Michael and Adam.
There are two main changes:

(1) Add a threshold MaxInterleaveStride to avoid analyzing accesses with too large stride.
(2) Move the code about analyzing interleaved accesses in LoopAccessAnalysis to LoopVectorizer.

Review please.

Thanks
-Hao

test/Transforms/LoopVectorize/interleaved-accesses.ll
2	I tried this, but it still can not work for some cases like @test_struct_store4. It is a known problem with our runtime memory checks. It will check the dependences in the same array. E.g. for (i = 0; i < n; i+=4) { A[i] = a; A[i+1] = b; A[i+2] = c; A[i+3] = d; } A lot of runtime checks with be generated for pairs in {A[i], A[i+1], A[i+2], A[i+3]} and will break the threshold.

• HaoLiu updated this object.May 26 2015, 12:05 AM

• HaoLiu edited the test plan for this revision. (Show Details)

• HaoLiu removed reviewers: rengolin, ab, t.p.northover, delena, hfinkel, aschwaighofer.

• HaoLiu updated this revision to Diff 26502.May 26 2015, 12:05 AM

• HaoLiu changed the visibility from "Public (No Login Required)" to "Unknown Object (????) (You do not have permission to view policy details.)".

• HaoLiu changed the edit policy from "All Users" to "Unknown Object (????) (You do not have permission to view policy details.)".

• HaoLiu removed subscribers: anemet, mzolotukhin, aemerson, Unknown Object (MLST).

This revision now requires review to proceed.May 26 2015, 12:05 AM

I met a bug in Phabricator. It automatically removed all reviewers when I uploading a new patch. Now I've added reviewers back.

Thanks,
-Hao

• HaoLiu added a subscriber: Unknown Object (MLST).May 26 2015, 12:10 AM

anemet added inline comments.May 26 2015, 4:28 PM

lib/Analysis/LoopAccessAnalysis.cpp
834–835	Please split out of this part of the change -- a cleanup, and commit it separately (and upate this patch for easier read)
838–840	I think that these two cases are quite different. The first is essentially proving that accesses like A[2i] and A[2i + 1] are independent. This is true in general not just for vectorization. Thus we should be returning NoDep rather than BackwardVectorizable. Please also add more comments/examples, it took me a long time to figure this out (hopefully I did at the end). Please also add LAA tests.
871–896	I am not sure I follow why you change the structure of the MaxSafeDepDistBytes logic here. Why aren't you simply comparing MaxSafeDepDistBytes with TypeByteSize * NumIter * Stride? Looks like this will modify the existing behavior for stride=1 which is probably not what you intend.

• HaoLiu updated this revision to Diff 26584.May 27 2015, 3:52 AM

• HaoLiu changed the visibility from "All Users" to "Public (No Login Required)".

Updated a new patch refactored the LoopAccessAnalysis about dependence check on strided accesses.

It's difficult to understand the logic about dependences, so I added many comments to describe it.

Review please.

Thanks,
-Hao

lib/Analysis/LoopAccessAnalysis.cpp
834–835	Hi Adma, I Agree. I'll do that when commiting the patch.
838–840	Hi Adam, You are right. I've updated a new patch refactored this logic. I added a lot of new test cases as well.
871–896	I think preivous logic is not correct. If "Distance > MaxSafeDepDistBytes", it is not safe to do vectorization.

• HaoLiu updated this revision to Diff 26585.May 27 2015, 4:10 AM

• HaoLiu updated this revision to Diff 26663.May 27 2015, 11:13 PM

Updated a new patch with slight modifications.

Review please.

Thanks,
-Hao

lib/Analysis/LoopAccessAnalysis.cpp
871–896	Sorry, previously I misunderstood the MaxSafeDepDistBytes. The new patch follows the meaning of MaxSafeDepDistBytes, it also has the same logic for stride = 1. I also found a problem with MaxSafeDepDistBytes. It cannot handdle cases with different kinds of types. like: void foo(int A, char B) { for (unsigned i = 0; i < 1024; i++) { A[i+2] = A[i] + 1; B[i+2] = B[i] + 1; } } I think we should use MaxFactor, which stands for the maximum number of iterations to be vectorized & unrolled. I've added a FIXME in the patch.

anemet added inline comments.May 29 2015, 12:53 PM

lib/Analysis/LoopAccessAnalysis.cpp
684–705	I don't think this is GCD. Consider the values Stride=4 and ScaledDistance=6: 4 * i = 4 * j + 6 can not be valid for any integer values of i and j. I think the Diophantine equation is this: ScaledDistance/Stride = (i - j) where i - j has to be integer. So the two accesses can only reference the same element if ScaledDistance is a multiple of Stride.
713	areStridedAccessIndependent is the conforming name (funtions are lower-case). Also not the the 'd' at the end of Strided. Same for the previous function.
722–740	If you want to handle this case I prefer we come back to it later as a separate LAA-only patch. Let's focus on the basics here that's needed for LV to handle the basic cases.
837	vectorized is misspelled
840	MinDistanceNeeded is probably a better name.
840–865	I think this is correct but I wonder if the example was less contrived if you used: for (i = 0; i < 1024; i+= 3) A[i + 4] = A[i] + 1 MinSizeNeeded is 4 * 3 * (2 - 1) + 4 = 16 which is equal to the distance. Also a nit: most or all of this comment is explaining why MinSizeNeeded is computed the way it is so the comment should be before the computation.
test/Analysis/LoopAccessAnalysis/analyze-stride-access-depedences.ll
216–228 ↗	(On Diff #26663)	As a follow-on patch, it would probably be a good idea to print MaxSafeDepDistance in LAI::analyze and check its value here.
224–228 ↗	(On Diff #26663)	Use CHECK-NEXTs where you can. Same later.
331 ↗	(On Diff #26663)	should look like?

• HaoLiu updated this revision to Diff 26876.May 31 2015, 11:52 PM

Refactored the patch according to Adam's comments.

I also found a bug that write after load to the same location like:

A[i] = a;
b = A[i];

is not looked as store-load forwarding currently. As this could affect the correctness, I fixed this with slight modification. 2 new test cases are added.

Review please.

Thanks,
-Hao

lib/Analysis/LoopAccessAnalysis.cpp
684–705	Ah, not GCD. You are right!
713	Agree.
722–740	Agree. Also this is rare case. I think it won't affect about the performance.
837	Fixed.
840	Fixed
840–865	This case is vectorizable as MinDistanceNeeded is exactly equal to the distance. If the distance is smaller (even 1 byte smaller), it can not be vectorizable. Actually we already has a similar case in memdep.ll: for (i = 0; i < 1024; ++i) A[i+2] = A[i] + 1; The MinDistanceNeeded is 2*4 = 8. The distance is also 8.
test/Analysis/LoopAccessAnalysis/analyze-stride-access-depedences.ll
216–228 ↗	(On Diff #26663)	MaxSafeDepDistance can be printed out in "-debug-only". I don't know how to print it properly. Also as my comments on MaxSafeDepDistance say, I think we should refactor MaxSafeDepDistance as it could prevent some vectorization opportunities.

This revision is now accepted and ready to land.May 31 2015, 11:55 PM

• HaoLiu requested a review of this revision.Jun 1 2015, 3:16 AM

• HaoLiu edited edge metadata.

• HaoLiu removed a reviewer: • HaoLiu.

In D9368#181632, @HaoLiu wrote:
I also found a bug that write after load to the same location like:
A[i] = a;
b = A[i];
is not looked as store-load forwarding currently. As this could affect the correctness, I fixed this with slight modification. 2 new test cases are added.

Why do you think that this is bug? In this case because the vectorized loads and stores are aligned with each other, there should be no problem for the memory unit to figure out store-to-load forwarding. See the big comment in MemoryDepChecker::couldPreventStoreLoadForward.

In D9368#182268, @anemet wrote:
In D9368#181632, @HaoLiu wrote:
I also found a bug that write after load to the same location like:
A[i] = a;
b = A[i];
is not looked as store-load forwarding currently. As this could affect the correctness, I fixed this with slight modification. 2 new test cases are added.
Why do you think that this is bug? In this case because the vectorized loads and stores are aligned with each other, there should be no problem for the memory unit to figure out store-to-load forwarding. See the big comment in MemoryDepChecker::couldPreventStoreLoadForward.

To my understanding, if a load has to wait until the store is committed, it is store-load forwarding.
For the case:

A[i] = a;     (1)
b = A[i];     (2)

(2) has to waite (1) to be committed. It is store-load forwarding. After vectoring such case, it is still store-load forwarding.

From comment in couldPreventStoreLoadForward, I think it means to prevent generating store-load forwarding by vectorization. If that is ture. The difference for this case is that it is original store-load forwarding. Do you mean that we could vectorize the cases that are original store-load forwarding?

Thanks,
_hao

Hi Adam,

I updated a new patch by removing the modification about StoreLoadForwarding.

Thanks,
-Hao

In D9368#182273, @HaoLiu wrote:
To my understanding, if a load has to wait until the store is committed, it is store-load forwarding.
For the case:
A[i] = a;     (1)
b = A[i];     (2)
(2) has to waite (1) to be committed. It is store-load forwarding. After vectoring such case, it is still store-load forwarding.

From comment in couldPreventStoreLoadForward, I think it means to prevent generating store-load forwarding by vectorization. If that is ture. The difference for this case is that it is original store-load forwarding. Do you mean that we could vectorize the cases that are original store-load forwarding?

The problem is not whether we can vectorize or not but because we can. It's whether the resulting vector store and load will be subject to the processor's store-to-load forwarding optimization. In this case the store does not have to get fully retired before the load could start executing (assuming an OOO core) but the memory unit will forward the value of the store to the load (and check whether there are intervening stores that make this invalid).

This is easy for the processor to figure out if both operations use the same address. However if they are only partially overlapping it's hard.

In D9368#182471, @anemet wrote:
In D9368#182273, @HaoLiu wrote:
To my understanding, if a load has to wait until the store is committed, it is store-load forwarding.
For the case:
A[i] = a;     (1)
b = A[i];     (2)
(2) has to waite (1) to be committed. It is store-load forwarding. After vectoring such case, it is still store-load forwarding.

From comment in couldPreventStoreLoadForward, I think it means to prevent generating store-load forwarding by vectorization. If that is ture. The difference for this case is that it is original store-load forwarding. Do you mean that we could vectorize the cases that are original store-load forwarding?
The problem is not whether we can vectorize or not but because we can. It's whether the resulting vector store and load will be subject to the processor's store-to-load forwarding optimization. In this case the store does not have to get fully retired before the load could start executing (assuming an OOO core) but the memory unit will forward the value of the store to the load (and check whether there are intervening stores that make this invalid).

This is easy for the processor to figure out if both operations use the same address. However if they are only partially overlapping it's hard.

Sounds reasonable.

Anyway, I've removed the modification in last patch. Review please.

Thanks,
-Hao

Hi Adam,

Do you have any other concerns? If not, I'll commit this patch separately as you said.

I think it still has some potential problems, but we can improve it continously in the future. The commit of this patch is the first step. Then I can continue following work.

Thanks,
-Hao

anemet added inline comments.Jun 4 2015, 12:59 AM

lib/Analysis/LoopAccessAnalysis.cpp
693–714	Now, that I look at this again, the second covers the first case as well (i.e. superset). There is room to simplify it.
840–865	I didn't quite understand what you were saying here but looks like you didn't change the comment, so I guess you disagree that my example is better. Your example is vectorizable if miniter is 2 and so is mine, so I don't understand your reply.
test/Analysis/LoopAccessAnalysis/stride-access-dependence.ll
344–345	"It should be vectorizable."

Hi Adam,

I updated a new patch according to your comments. Also see my comments are inline.

Review please.

Thanks,
-Hao

lib/Analysis/LoopAccessAnalysis.cpp
693–714	You are right, only need "ScaledDist % Stride".
840–865	Previously, I thought you just asked me a question about that case. Sorry about the misleading. Actually, your case is "indep", which returns early with "Dependence::NoDep". So it could not reach at this place. That's why I use an example with dependences.
test/Analysis/LoopAccessAnalysis/stride-access-dependence.ll
344–345	Agree. Thanks.

Hao, Adam, Michael,

I think this review has gone a bit too far for what it is. I agree there are still many points to cover and many issues to fix, but we should do them later, after the initial implementation is in. For what it is, a prototype for strided vectorization, this patch fits the bill. It's not on by default, it's still experimental and won't be used at -O3 or anything, so we're still safe in case it does miss-compile anything.

I'd like to thank Adam and Michael for the extensive comments, it's not easy to get passionate people to do code review. I'd also like to than Hao for his extreme patience with so many and so drastic code reviews! Finally thanks to Arnold and Nadav for the valuable input and discussions to get the first prototype working previously.

But right now, I think it's time we land this patch and continue the review on further patches.

Adam, if you're happy that Hao's last changes address your concerns, please approve the patch and let's start round 2 on a new review.

cheers,
--renato

• HaoLiu updated this revision to Diff 27098.Jun 4 2015, 1:48 AM

• HaoLiu edited edge metadata.

In D9368#183336, @rengolin wrote:

I think this review has gone a bit too far for what it is. I agree there are still many points to cover and many issues to fix, but we should do them later, after the initial implementation is in. For what it is, a prototype for strided vectorization, this patch fits the bill. It's not on by default, it's still experimental and won't be used at -O3 or anything, so we're still safe in case it does miss-compile anything.

I am very surprised by this sentiment. I would think that if reviews provide you with bugs, you'd be happy to fix them before the code going in. (We effectively performed the bugfixing for you.)

More importantly though, you're incorrect with your facts. Existing paths in LAA were also modified by this patch which affect things like the LoopVectorizer which is on by default.

You also seem to suggest that the changes requested were mostly minor or cosmetic. Clearly the placement of a rather expensive analysis is not a cosmetic change but a basic design decision that should be debated pre-commit.

There is extensive work happening on LAA, so wrong starts like this can be pretty disruptive for other people working on the same module.

Also as I am sure Hao can attest, DependenceAnalysis is a pretty complex topic so it takes time to actually think through the analysis and it's better to do this while we're all focused on the problem at hand. We caught multiple issues in this category and there were probably only one or two cosmetic comments.

I agree that the review lasted longer than usual but I don't see the main reason being our "drastic code reviews".

It would have been much easier if the patch was split up as I originally requested. The existing modularity allowed for this but the patch didn't take advantage of it. Also as we were finding correctness/design issues with the patch, Hao started improving the code (in actually very nice ways, thanks!!) which then needed new reviews again, etc.

We're also in different time zones which didn't help either but I feel like the community should be able to put up with this much slack.

Adam, if you're happy that Hao's last changes address your concerns, please approve the patch and let's start round 2 on a new review.

Yeah, the LAA parts look good now.

LAA parts LGTM.

lib/Analysis/LoopAccessAnalysis.cpp
840–865	Makes sense.

In D9368#183650, @anemet wrote:

I am very surprised by this sentiment. I would think that if reviews provide you with bugs, you'd be happy to fix them before the code going in. (We effectively performed the bugfixing for you.)

I think you got me wrong. I *am* very happy that you are reviewing with such attention to detail...

More importantly though, you're incorrect with your facts. Existing paths in LAA were also modified by this patch which affect things like the LoopVectorizer which is on by default.

...but I was also (wrongly) assuming that the patch was still self contained. I stand corrected.

I agree with you that in the LAA case we need to make sure the behaviour outside of this patch should go unmodified.

I agree that the review lasted longer than usual but I don't see the main reason being our "drastic code reviews".

I meant it in the good way.

It would have been much easier if the patch was split up as I originally requested. The existing modularity allowed for this but the patch didn't take advantage of it. Also as we were finding correctness/design issues with the patch, Hao started improving the code (in actually very nice ways, thanks!!) which then needed new reviews again, etc. We're also in different time zones which didn't help either but I feel like the community should be able to put up with this much slack.

I'm fully aware of that and none of it was the reason why I asked for a breakpoint. As you said, Hao could have split the patch earlier, but once it was still small and self contained, the other bits would have been lost. This patch was re-written dozens of times and almost none of the original assumptions were kept in the final version. With such a game changer, it's hard to predict the future.

I just interpreted the last reviews as specific to the stride vectorizer, when (as you say), they're generic to LAA. My sincere apologies.

Yeah, the LAA parts look good now.

Great! Round 2! :)

cheers,
--renato

This revision is now accepted and ready to land.Jun 4 2015, 1:25 PM

In D9368#183726, @rengolin wrote:

I just interpreted the last reviews as specific to the stride vectorizer, when (as you say), they're generic to LAA. My sincere apologies.

OK, apologies accepted :), thanks for clarifying.

I think that part of the problem is that LAA as an independent pass still has some rough edges. Any help (Hao’s here and Silviu’s in the other thread) improving this situation is much appreciated.

Adam

Hi,

I’m glad we’re on the same page now. I’m really looking forward to seeing this feature, but I didn’t want to rush with it because it strongly defines directions for future development of the vectorizer. I believe that this code would be a base for many other optimizations (e.g. supporting masks), and that’s why it’s really important for it to be thoroughly reviewed. That’s also a reason for a lot of cosmetic changes I requested - when I see a place that’s harder to understand than it should be, I have to spend additional time on it, and that means that in future many other developers would have to spend their time as well. Also, such places disrupt from the main idea, which makes the review harder.

Thanks,
Michael

Hi Hao,

The patch LGTM with some minor notes. Thank you for the efforts, your work is very appreciated!

Michael

lib/Analysis/LoopAccessAnalysis.cpp
850–851	Something went wrong with the indentation here.
lib/Transforms/Vectorize/LoopVectorize.cpp
667	Could we rename `Delta` to `InterleaveFactor` or something like this. When roaming through the code it'd be much easier to understand what it stands for - especially when the code is in-tree and one doesn't see the context of the current patch.
690	s/Call this class/Use this class/
4455	static_cast isn't needed here.
4552–4556	What is the previous check you are referring to? And if it's guaranteed, could this `if` be replaced with an assertion?
4560	`static_cast` isn't necessary here (and if we want to cast everything `getIndex` should also be casted from `unsigned`).
test/Transforms/LoopVectorize/interleaved-accesses.ll
2	Which of the tests needs `-runtime-memory-check-threshold=24`? Could we get rid of this flag by adding `noalias` attribute, or aliasing metadata?
40	Please run the tests through `opt -instnamer` to replace %[0-9]+ names - they are really hard to deal with when one need to modify the test.

• HaoLiu updated this revision to Diff 27175.Jun 4 2015, 10:44 PM

• HaoLiu edited edge metadata.

I've updated the patch according to Michael's comments. My comments are inline.

Review please.

Thanks,
-Hao

lib/Analysis/LoopAccessAnalysis.cpp
850–851	No, The indentation is intended here to show the distance and overlap between the accesses to array A and B.
lib/Transforms/Vectorize/LoopVectorize.cpp
667	Reasonable. Renamed.
690	Fixed.
4455	std::abs() returns a signed int. Comparing signed to unsigned will cause a warning from the compiler. To avoid static_cast, I use a temperaral value for "std::abs(Stride)" in the new patch.
4552–4556	It can not be assertion. This comment is hard to be understood. I've replaced it with another comment, which describe the thing we do right here. A case @interleaved_stores in stride-access-dependence.ll checks for this situation. So no need to clarify this again.
4560	No, the division between signed and unsigned can not be casted. I tested with a small case: int a = -4; unsigned b = 4; int c = a / b; The result c is a very large numer (1073741823). Modulo is simialr, I once fixed a bug about "signed % unsigned" (didn't cast unsigned data) in SeparateConstOffsetFromGEP.cpp. This is different from ADD/SUB.
test/Transforms/LoopVectorize/interleaved-accesses.ll
2	We couldn't. Still too many runtime checks for accesses related to one pointer (Even though they are noalis). E.g. for( i = 0; i < 1024; i+=3) { A[i] += a; A[i+1] += b; A[i+2] += c } It has 3 loads and 3 stores, which need 11 runtime checks (greater than the current threshold 8). The 3 stores will be compared with each other (3 checks), and each load will be compared with the 3 stores (3 * 3 checks). This could be improved in the future. No need to compare accesses in the same group. But currently, this is the work around to test.
40	Renamed anonymous names. This is really a helpful command.

Closed by commit rL239285: [LoopAccessAnalysis] Teach LAA to check the memory dependence between strided… (authored by • HaoLiu). · Explain WhyJun 7 2015, 9:52 PM

This revision was automatically updated to reflect the committed changes.

Committed in r239285 (Code about LoopAccessAnalysis) and r239291 (Code about LoopVectorize).

Thanks a lot about the all review work. Especially Renato, Michael and Adam, you are very kind and patient for such a long period of review.

For the following work, my colleagues are working on improving the run time memory check and type promotion problem in Loop Vectorizor. For myself, I have patches for AArch64 and ARM backend to match the generated IRs into ldN/stN instructions. I'll send out later.

Thanks,
-Hao

Revision Contents

Path

Size

include/

llvm/

Analysis/

LoopAccessAnalysis.h

5 lines

TargetTransformInfo.h

27 lines

TargetTransformInfoImpl.h

8 lines

CodeGen/

BasicTTIImpl.h

67 lines

lib/

Analysis/

LoopAccessAnalysis.cpp

136 lines

TargetTransformInfo.cpp

7 lines

Transforms/

Vectorize/

LoopVectorize.cpp

699 lines

test/

Analysis/

LoopAccessAnalysis/

stride-access-dependence.ll

540 lines

zero-distance-dependence.ll

75 lines

Transforms/

LoopVectorize/

AArch64/

arbitrary-induction-step.ll

37 lines

interleaved-accesses.ll

422 lines

Diff 26876

include/llvm/Analysis/LoopAccessAnalysis.h

Show First 20 Lines • Show All 313 Lines • ▼ Show 20 Lines	struct RuntimePointerCheck {
RuntimePointerCheck() : Need(false) {}		RuntimePointerCheck() : Need(false) {}

/// Reset the state of the pointer runtime information.		/// Reset the state of the pointer runtime information.
void reset() {		void reset() {
Need = false;		Need = false;
Pointers.clear();		Pointers.clear();
Starts.clear();		Starts.clear();
Ends.clear();		Ends.clear();
IsWritePtr.clear();		IsWritePtr.clear();
		mzolotukhinUnsubmitted Not Done Reply Inline Actions We require `Align!=0` here, but in the constructor below we explicitly set it to 0. Is one of these unused and could be removed? mzolotukhin: We require `Align!=0` here, but in the constructor below we explicitly set it to 0. Is one of…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Removed the unused constructor. HaoLiu: Removed the unused constructor.
DependencySetId.clear();		DependencySetId.clear();
AliasSetId.clear();		AliasSetId.clear();
}		}

/// Insert a pointer and calculate the start and end SCEVs.		/// Insert a pointer and calculate the start and end SCEVs.
void insert(ScalarEvolution SE, Loop Lp, Value *Ptr, bool WritePtr,		void insert(ScalarEvolution SE, Loop Lp, Value *Ptr, bool WritePtr,
unsigned DepSetId, unsigned ASId,		unsigned DepSetId, unsigned ASId,
const ValueToValueMap &Strides);		const ValueToValueMap &Strides);
Show All 33 Lines	struct RuntimePointerCheck {
/// Holds the information if this pointer is used for writing to memory.		/// Holds the information if this pointer is used for writing to memory.
SmallVector<bool, 2> IsWritePtr;		SmallVector<bool, 2> IsWritePtr;
/// Holds the id of the set of pointers that could be dependent because of a		/// Holds the id of the set of pointers that could be dependent because of a
/// shared underlying object.		/// shared underlying object.
SmallVector<unsigned, 2> DependencySetId;		SmallVector<unsigned, 2> DependencySetId;
/// Holds the id of the disjoint alias set to which this pointer belongs.		/// Holds the id of the disjoint alias set to which this pointer belongs.
SmallVector<unsigned, 2> AliasSetId;		SmallVector<unsigned, 2> AliasSetId;
};		};

		rengolinUnsubmitted Not Done Reply Inline Actions Please add some high level description of this class rengolin: Please add some high level description of this class
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
LoopAccessInfo(Loop L, ScalarEvolution SE, const DataLayout &DL,		LoopAccessInfo(Loop L, ScalarEvolution SE, const DataLayout &DL,
		rengolinUnsubmitted Not Done Reply Inline Actions nitpick: I'd use class with public section rather than a struct with a private section. Just to keep the C++ness of things. rengolin: nitpick: I'd use class with public section rather than a struct with a private section. Just to…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
const TargetLibraryInfo TLI, AliasAnalysis AA,		const TargetLibraryInfo TLI, AliasAnalysis AA,
DominatorTree DT, LoopInfo LI,		DominatorTree DT, LoopInfo LI,
const ValueToValueMap &Strides);		const ValueToValueMap &Strides);

/// Return true we can analyze the memory accesses in the loop and there are		/// Return true we can analyze the memory accesses in the loop and there are
/// no memory dependence cycles.		/// no memory dependence cycles.
bool canVectorizeMemory() const { return CanVecMem; }		bool canVectorizeMemory() const { return CanVecMem; }

const RuntimePointerCheck *getRuntimePointerCheck() const {		const RuntimePointerCheck *getRuntimePointerCheck() const {
return &PtrRtCheck;		return &PtrRtCheck;
}		}

/// \brief Number of memchecks required to prove independence of otherwise		/// \brief Number of memchecks required to prove independence of otherwise
		mzolotukhinUnsubmitted Not Done Reply Inline Actions s/};/}/ mzolotukhin: s/};/}/
/// may-alias pointers.		/// may-alias pointers.
unsigned getNumRuntimePointerChecks() const { return NumComparisons; }		unsigned getNumRuntimePointerChecks() const { return NumComparisons; }

/// Return true if the block BB needs to be predicated in order for the loop		/// Return true if the block BB needs to be predicated in order for the loop
/// to be vectorized.		/// to be vectorized.
static bool blockNeedsPredication(BasicBlock BB, Loop TheLoop,		static bool blockNeedsPredication(BasicBlock BB, Loop TheLoop,
DominatorTree *DT);		DominatorTree *DT);

/// Returns true if the value V is uniform within the loop.		/// Returns true if the value V is uniform within the loop.
bool isUniform(Value *V) const;		bool isUniform(Value *V) const;

		rengolinUnsubmitted Not Done Reply Inline Actions a better comment would be "returns false if the instruction doesn't belong to the group" rengolin: a better comment would be "returns false if the instruction doesn't belong to the group"
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
unsigned getMaxSafeDepDistBytes() const { return MaxSafeDepDistBytes; }		unsigned getMaxSafeDepDistBytes() const { return MaxSafeDepDistBytes; }
unsigned getNumStores() const { return NumStores; }		unsigned getNumStores() const { return NumStores; }
unsigned getNumLoads() const { return NumLoads;}		unsigned getNumLoads() const { return NumLoads;}
		rengolinUnsubmitted Not Done Reply Inline Actions Ouch, no! If you do this, than a perfectly valid sequence like this: group.eraseFromMap(Map); group.getDelta(); // there's no way to know that this won't work will segfault. A quick fix would be to leave the deletion to whomever is calling the function, so it's clear on the lifetime of the object: group.eraseFromMap(Map); group.getDelta(); delete group; group.getReverse(); // this is obviously wrong But again, it seems to me that such logic is spread out too much. So a better fix would be to coalesce all of it into a class in itself. One that contains the maps, the groups, and the control over the lifetime of all its groups in a consistent way, without requiring the caller to know about it. I'm ok with the temporary solution, as long as you add a FIXME to that effect. rengolin: Ouch, no! If you do this, than a perfectly valid sequence like this: group.eraseFromMap…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions That's reasonable. I've added a new class InterleavedAccessInfo to manage this. HaoLiu: That's reasonable. I've added a new class InterleavedAccessInfo to manage this.

		mzolotukhinUnsubmitted Not Done Reply Inline Actions Could you please convert the comments to doxygen format? (See http://llvm.org/docs/CodingStandards.html#doxygen-use-in-documentation-comments ) mzolotukhin: Could you please convert the comments to doxygen format? (See http://llvm.
/// \brief Add code that checks at runtime if the accessed arrays overlap.		/// \brief Add code that checks at runtime if the accessed arrays overlap.
///		///
/// Returns a pair of instructions where the first element is the first		/// Returns a pair of instructions where the first element is the first
/// instruction generated in possibly a sequence of instructions and the		/// instruction generated in possibly a sequence of instructions and the
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Should it be `unsigned`? mzolotukhin: Should it be `unsigned`?
/// second value is the final comparator value or NULL if no check is needed.		/// second value is the final comparator value or NULL if no check is needed.
///		///
/// If \p PtrPartition is set, it contains the partition number for pointers		/// If \p PtrPartition is set, it contains the partition number for pointers
/// (-1 if the pointer belongs to multiple partitions). In this case omit		/// (-1 if the pointer belongs to multiple partitions). In this case omit
/// checks between pointers belonging to the same partition.		/// checks between pointers belonging to the same partition.
std::pair<Instruction , Instruction >		std::pair<Instruction , Instruction >
addRuntimeCheck(Instruction *Loc,		addRuntimeCheck(Instruction *Loc,
const SmallVectorImpl<int> *PtrPartition = nullptr) const;		const SmallVectorImpl<int> *PtrPartition = nullptr) const;

/// \brief The diagnostics report generated for the analysis. E.g. why we		/// \brief The diagnostics report generated for the analysis. E.g. why we
/// couldn't analyze the loop.		/// couldn't analyze the loop.
		mzolotukhinUnsubmitted Not Done Reply Inline Actions A couple of comments regarding this code: Let's consistently use either `Idx` or `Index`. Do we need `AbsIdx` at all? Under the if we can use `-Index` instead of it, after the if - just `Index`. In the paper you mentioned it was stated that the grouping algorithm is linear in the number of accesses. In your current implementation it's quadratic, and the complexity seems to come from two factors: 1) we recalculate indexes after adding a new member, 2) we explicitly check if the group already contains a member with the same index. Could we do anything about that? E.g. I think that recalculating isn't necessary at all, and the check for duplicates can be done faster if we use a hash-table for it. mzolotukhin: A couple of comments regarding this code: 1) Let's consistently use either `Idx` or `Index`. 2)…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Reasonable. HaoLiu: Reasonable.
const Optional<LoopAccessReport> &getReport() const { return Report; }		const Optional<LoopAccessReport> &getReport() const { return Report; }

/// \brief the Memory Dependence Checker which can determine the		/// \brief the Memory Dependence Checker which can determine the
/// loop-independent and loop-carried dependences between memory accesses.		/// loop-independent and loop-carried dependences between memory accesses.
const MemoryDepChecker &getDepChecker() const { return DepChecker; }		const MemoryDepChecker &getDepChecker() const { return DepChecker; }

/// \brief Return the list of instructions that use \p Ptr to read or write		/// \brief Return the list of instructions that use \p Ptr to read or write
/// memory.		/// memory.
SmallVector<Instruction , 4> getInstructionsForAccess(Value Ptr,		SmallVector<Instruction , 4> getInstructionsForAccess(Value Ptr,
bool isWrite) const {		bool isWrite) const {
return DepChecker.getInstructionsForAccess(Ptr, isWrite);		return DepChecker.getInstructionsForAccess(Ptr, isWrite);
}		}

/// \brief Print the information about the memory accesses in the loop.		/// \brief Print the information about the memory accesses in the loop.
void print(raw_ostream &OS, unsigned Depth = 0) const;		void print(raw_ostream &OS, unsigned Depth = 0) const;

/// \brief Used to ensure that if the analysis was run with speculating the		/// \brief Used to ensure that if the analysis was run with speculating the
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Not a good naming - from the names it's absolutely unclear what's the difference between `Align` and `Alignment`, and what they stand for. mzolotukhin: Not a good naming - from the names it's absolutely unclear what's the difference between…
/// value of symbolic strides, the client queries it with the same assumption.		/// value of symbolic strides, the client queries it with the same assumption.
		rengolinUnsubmitted Not Done Reply Inline Actions Add "InterleaveGroup" to "contains...". A little step for the developer, a giant leap for the debugger. :) rengolin: Add "InterleaveGroup" to "contains...". A little step for the developer, a giant leap for the…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
/// Only used in DEBUG build but we don't want NDEBUG-dependent ABI.		/// Only used in DEBUG build but we don't want NDEBUG-dependent ABI.
unsigned NumSymbolicStrides;		unsigned NumSymbolicStrides;

/// \brief Checks existence of store to invariant address inside loop.		/// \brief Checks existence of store to invariant address inside loop.
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Use range loop? mzolotukhin: Use range loop?
/// If the loop has any store to invariant address, then it returns true,		/// If the loop has any store to invariant address, then it returns true,
/// else returns false.		/// else returns false.
bool hasStoreToLoopInvariantAddress() const {		bool hasStoreToLoopInvariantAddress() const {
return StoreToLoopInvariantAddress;		return StoreToLoopInvariantAddress;
}		}

private:		private:
/// \brief Analyze the loop. Substitute symbolic strides using Strides.		/// \brief Analyze the loop. Substitute symbolic strides using Strides.
void analyzeLoop(const ValueToValueMap &Strides);		void analyzeLoop(const ValueToValueMap &Strides);

		rengolinUnsubmitted Not Done Reply Inline Actions You shouldn't expose a private member like that. Otherwise, there's no point in making it private. In removeInterleaveGroup() you erase all components and delete the group, completely bypassing the private keyword. Why would you need to delete the Members' list and not the rest of the InterleaveGroup's members? Why not just delete the whole object and let the destructor in InterleaveGroup free Members privately? rengolin: You shouldn't expose a private member like that. Otherwise, there's no point in making it…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
/// \brief Check if the structure of the loop allows it to be analyzed by this		/// \brief Check if the structure of the loop allows it to be analyzed by this
/// pass.		/// pass.
bool canAnalyzeLoop();		bool canAnalyzeLoop();

void emitAnalysis(LoopAccessReport &Message);		void emitAnalysis(LoopAccessReport &Message);

/// We need to check that all of the pointers in this list are disjoint		/// We need to check that all of the pointers in this list are disjoint
/// at runtime.		/// at runtime.
RuntimePointerCheck PtrRtCheck;		RuntimePointerCheck PtrRtCheck;

/// \brief the Memory Dependence Checker which can determine the		/// \brief the Memory Dependence Checker which can determine the
/// loop-independent and loop-carried dependences between memory accesses.		/// loop-independent and loop-carried dependences between memory accesses.
MemoryDepChecker DepChecker;		MemoryDepChecker DepChecker;
		rengolinUnsubmitted Not Done Reply Inline Actions This should move inside InterleaveGroup's destructor. rengolin: This should move inside InterleaveGroup's destructor.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.

/// \brief Number of memchecks required to prove independence of otherwise		/// \brief Number of memchecks required to prove independence of otherwise
/// may-alias pointers		/// may-alias pointers
unsigned NumComparisons;		unsigned NumComparisons;

Loop *TheLoop;		Loop *TheLoop;
ScalarEvolution *SE;		ScalarEvolution *SE;
const DataLayout &DL;		const DataLayout &DL;
Show All 13 Lines	private:
/// \brief Indicator for storing to uniform addresses.		/// \brief Indicator for storing to uniform addresses.
/// If a loop has write to a loop invariant address then it should be true.		/// If a loop has write to a loop invariant address then it should be true.
bool StoreToLoopInvariantAddress;		bool StoreToLoopInvariantAddress;

/// \brief The diagnostics report generated for the analysis. E.g. why we		/// \brief The diagnostics report generated for the analysis. E.g. why we
/// couldn't analyze the loop.		/// couldn't analyze the loop.
Optional<LoopAccessReport> Report;		Optional<LoopAccessReport> Report;
};		};

Value stripIntegerCast(Value V);		Value stripIntegerCast(Value V);
		anemetUnsubmitted Not Done Reply Inline Actions I've only started looking at this patch but I have a quick initial, more fundamental comment. You are modifying an analysis pass (LoopAccessAnalysis). Please add support for printing the new information gathered by the analysis. Please also add tests under test/Analysis/LoopAccessAnalysis to verify the new information. You may even want separate out the patch for the analysis part (LAA + tests) from the transformation parts (LV + tests + etc). anemet: I've only started looking at this patch but I have a quick initial, more fundamental comment.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Hi Adam, I've undated the patch to support printing the new information. I added more test cases as well. I didn't separate patch as I think there are already a lot of history comments on both parts. Also I think the bound between analysis part and transformation part is clear. Anyway, I can still separate this patch if you insist. HaoLiu: Hi Adam, I've undated the patch to support printing the new information. I added more test…

		mzolotukhinUnsubmitted Not Done Reply Inline Actions Please fix the indentation here. mzolotukhin: Please fix the indentation here.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Fixed. HaoLiu: Fixed.
///\brief Return the SCEV corresponding to a pointer with the symbolic stride		///\brief Return the SCEV corresponding to a pointer with the symbolic stride
///replaced with constant one.		///replaced with constant one.
///		///
/// If \p OrigPtr is not null, use it to look up the stride value instead of \p		/// If \p OrigPtr is not null, use it to look up the stride value instead of \p
/// Ptr. \p PtrToStride provides the mapping between the pointer value and its		/// Ptr. \p PtrToStride provides the mapping between the pointer value and its
/// stride as collected by LoopVectorizationLegality::collectStridedAccess.		/// stride as collected by LoopVectorizationLegality::collectStridedAccess.
const SCEV replaceSymbolicStrideSCEV(ScalarEvolution SE,		const SCEV replaceSymbolicStrideSCEV(ScalarEvolution SE,
const ValueToValueMap &PtrToStride,		const ValueToValueMap &PtrToStride,
Value Ptr, Value OrigPtr = nullptr);		Value Ptr, Value OrigPtr = nullptr);

		/// \brief Check the stride of the pointer and ensure that it does not wrap in
		/// the address space.
		int isStridedPtr(ScalarEvolution SE, Value Ptr, const Loop *Lp,
		const ValueToValueMap &StridesMap);

/// \brief This analysis provides dependence information for the memory accesses		/// \brief This analysis provides dependence information for the memory accesses
/// of a loop.		/// of a loop.
///		///
/// It runs the analysis for a loop on demand. This can be initiated by		/// It runs the analysis for a loop on demand. This can be initiated by
/// querying the loop access info via LAA::getInfo. getInfo return a		/// querying the loop access info via LAA::getInfo. getInfo return a
/// LoopAccessInfo object. See this class for the specifics of what information		/// LoopAccessInfo object. See this class for the specifics of what information
/// is provided.		/// is provided.
class LoopAccessAnalysis : public FunctionPass {		class LoopAccessAnalysis : public FunctionPass {
▲ Show 20 Lines • Show All 41 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 438 Lines • ▼ Show 20 Lines	public:
/// \return The cost of Load and Store instructions.		/// \return The cost of Load and Store instructions.
unsigned getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,		unsigned getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
unsigned AddressSpace) const;		unsigned AddressSpace) const;

/// \return The cost of masked Load and Store instructions.		/// \return The cost of masked Load and Store instructions.
unsigned getMaskedMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,		unsigned getMaskedMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
unsigned AddressSpace) const;		unsigned AddressSpace) const;

		/// \return The cost of the interleaved memory operation.
		/// \p Opcode is the memory operation code
		mzolotukhinUnsubmitted Not Done Reply Inline Actions s/vecotor/vector/ mzolotukhin: s/vecotor/vector/
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
		/// \p VecTy is the vector type of the interleaved access.
		/// \p Delta is the interleave factor
		/// \p Indices is the indices for interleaved load members (as interleaved
		mzolotukhinUnsubmitted Not Done Reply Inline Actions This comment seems to be outdated - there are no arguments `SubTy`, `Index`, or `Gap` in the declaration below. mzolotukhin: This comment seems to be outdated - there are no arguments `SubTy`, `Index`, or `Gap` in the…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Fixed. HaoLiu: Fixed.
		/// load allows gaps)
		/// \p Alignment is the alignment of the memory operation
		/// \p AddressSpace is address space of the pointer.
		unsigned getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,
		unsigned Delta,
		ArrayRef<unsigned> Indices,
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Nitpick: `Alignment` and `AddressSpace` aren't documented. mzolotukhin: Nitpick: `Alignment` and `AddressSpace` aren't documented.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
		unsigned Alignment,
		unsigned AddressSpace) const;

/// \brief Calculate the cost of performing a vector reduction.		/// \brief Calculate the cost of performing a vector reduction.
///		///
/// This is the cost of reducing the vector value of type \p Ty to a scalar		/// This is the cost of reducing the vector value of type \p Ty to a scalar
/// value using the operation denoted by \p Opcode. The form of the reduction		/// value using the operation denoted by \p Opcode. The form of the reduction
/// can either be a pairwise reduction or a reduction that splits the vector		/// can either be a pairwise reduction or a reduction that splits the vector
/// at every reduction level.		/// at every reduction level.
///		///
/// Pairwise:		/// Pairwise:
▲ Show 20 Lines • Show All 122 Lines • ▼ Show 20 Lines	public:
virtual unsigned getVectorInstrCost(unsigned Opcode, Type *Val,		virtual unsigned getVectorInstrCost(unsigned Opcode, Type *Val,
unsigned Index) = 0;		unsigned Index) = 0;
virtual unsigned getMemoryOpCost(unsigned Opcode, Type *Src,		virtual unsigned getMemoryOpCost(unsigned Opcode, Type *Src,
unsigned Alignment,		unsigned Alignment,
unsigned AddressSpace) = 0;		unsigned AddressSpace) = 0;
virtual unsigned getMaskedMemoryOpCost(unsigned Opcode, Type *Src,		virtual unsigned getMaskedMemoryOpCost(unsigned Opcode, Type *Src,
unsigned Alignment,		unsigned Alignment,
unsigned AddressSpace) = 0;		unsigned AddressSpace) = 0;
		virtual unsigned getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,
		unsigned Delta,
		ArrayRef<unsigned> Indices,
		unsigned Alignment,
		unsigned AddressSpace) = 0;
virtual unsigned getReductionCost(unsigned Opcode, Type *Ty,		virtual unsigned getReductionCost(unsigned Opcode, Type *Ty,
bool IsPairwiseForm) = 0;		bool IsPairwiseForm) = 0;
virtual unsigned getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,		virtual unsigned getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
ArrayRef<Type *> Tys) = 0;		ArrayRef<Type *> Tys) = 0;
virtual unsigned getCallInstrCost(Function F, Type RetTy,		virtual unsigned getCallInstrCost(Function F, Type RetTy,
ArrayRef<Type *> Tys) = 0;		ArrayRef<Type *> Tys) = 0;
virtual unsigned getNumberOfParts(Type *Tp) = 0;		virtual unsigned getNumberOfParts(Type *Tp) = 0;
virtual unsigned getAddressComputationCost(Type *Ty, bool IsComplex) = 0;		virtual unsigned getAddressComputationCost(Type *Ty, bool IsComplex) = 0;
▲ Show 20 Lines • Show All 142 Lines • ▼ Show 20 Lines	public:
unsigned getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,		unsigned getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
unsigned AddressSpace) override {		unsigned AddressSpace) override {
return Impl.getMemoryOpCost(Opcode, Src, Alignment, AddressSpace);		return Impl.getMemoryOpCost(Opcode, Src, Alignment, AddressSpace);
}		}
unsigned getMaskedMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,		unsigned getMaskedMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
unsigned AddressSpace) override {		unsigned AddressSpace) override {
return Impl.getMaskedMemoryOpCost(Opcode, Src, Alignment, AddressSpace);		return Impl.getMaskedMemoryOpCost(Opcode, Src, Alignment, AddressSpace);
}		}
		unsigned getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,
		unsigned Delta,
		ArrayRef<unsigned> Indices,
		unsigned Alignment,
		unsigned AddressSpace) override {
		return Impl.getInterleavedMemoryOpCost(Opcode, VecTy, Delta, Indices,
		Alignment, AddressSpace);
		}
unsigned getReductionCost(unsigned Opcode, Type *Ty,		unsigned getReductionCost(unsigned Opcode, Type *Ty,
bool IsPairwiseForm) override {		bool IsPairwiseForm) override {
return Impl.getReductionCost(Opcode, Ty, IsPairwiseForm);		return Impl.getReductionCost(Opcode, Ty, IsPairwiseForm);
}		}
unsigned getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,		unsigned getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
ArrayRef<Type *> Tys) override {		ArrayRef<Type *> Tys) override {
return Impl.getIntrinsicInstrCost(ID, RetTy, Tys);		return Impl.getIntrinsicInstrCost(ID, RetTy, Tys);
}		}
▲ Show 20 Lines • Show All 128 Lines • Show Last 20 Lines

include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 294 Lines • ▼ Show 20 Lines	unsigned getMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
return 1;		return 1;
}		}

unsigned getMaskedMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,		unsigned getMaskedMemoryOpCost(unsigned Opcode, Type *Src, unsigned Alignment,
unsigned AddressSpace) {		unsigned AddressSpace) {
return 1;		return 1;
}		}

		unsigned getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,
		unsigned Delta,
		ArrayRef<unsigned> Indices,
		unsigned Alignment,
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Why the cost is `Delta`? mzolotukhin: Why the cost is `Delta`?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Previously I though there are 'Delta' instructions each costs 1. Regarding all the default cost in other getXXXCost is 1, I've changed it to be "1". HaoLiu: Previously I though there are 'Delta' instructions each costs 1. Regarding all the default cost…
		unsigned AddressSpace) {
		return 1;
		}

unsigned getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,		unsigned getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
ArrayRef<Type *> Tys) {		ArrayRef<Type *> Tys) {
return 1;		return 1;
		hfinkelUnsubmitted Not Done Reply Inline Actions This is not really the right place for this default, and not the right way to generate it. Just return 1 here. The default should be in include/llvm/CodeGen/BasicTTIImpl.h, and you should call getScalarizationOverhead, likely, to generate the default cost. hfinkel: This is not really the right place for this default, and not the right way to generate it. Just…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions I've removed such code. As I'm not familiar with the cost model, I'll add the cost later. HaoLiu: I've removed such code. As I'm not familiar with the cost model, I'll add the cost later.
}		}

unsigned getCallInstrCost(Function F, Type RetTy, ArrayRef<Type *> Tys) {		unsigned getCallInstrCost(Function F, Type RetTy, ArrayRef<Type *> Tys) {
return 1;		return 1;
}		}

unsigned getNumberOfParts(Type *Tp) { return 0; }		unsigned getNumberOfParts(Type *Tp) { return 0; }

▲ Show 20 Lines • Show All 128 Lines • Show Last 20 Lines

include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 516 Lines • ▼ Show 20 Lines	if (Src->isVectorTy() &&
Cost += getScalarizationOverhead(Src, Opcode != Instruction::Store,		Cost += getScalarizationOverhead(Src, Opcode != Instruction::Store,
Opcode == Instruction::Store);		Opcode == Instruction::Store);
}		}
}		}

return Cost;		return Cost;
}		}

		unsigned getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy,
		unsigned Delta,
		ArrayRef<unsigned> Indices,
		unsigned Alignment,
		unsigned AddressSpace) {
		VectorType *VT = dyn_cast<VectorType>(VecTy);
		assert(VT && "Expect a vector type for interleaved memory op");

		unsigned NumElts = VT->getNumElements();
		assert(Delta > 1 && NumElts % Delta == 0 && "Invalid Delta");
		mzolotukhinUnsubmitted Not Done Reply Inline Actions s/divied/divided/ mzolotukhin: s/divied/divided/

		unsigned NumSubElts = NumElts / Delta;
		VectorType *SubVT = VectorType::get(VT->getElementType(), NumSubElts);

		// Firstly, the cost of load/store operation.
		unsigned Cost = getMemoryOpCost(Opcode, VecTy, Alignment, AddressSpace);

		// Then plus the cost of interleave operation.
		if (Opcode == Instruction::Load) {
		// The interleave cost is similar to extract sub vectors' elements
		// from the wide vector, and insert them into sub vectors.
		//
		// E.g. Interleaved load of Delta 2 to 1 sub vectors (%v0):
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Am I getting it correctly, that `VecTy` is the type of 'combined' vector? I.e. if we have a stride=2, VF=4, and ScalarTy=i32, then `VecTy` would be `<8 x i32>`, and `SubVT` would be `<4 x i32>`? I'm not sure I've completely understood this part, so please correct me if I'm wrong. If that's true, then the cost computation doesn't seem correct. Extract 1st or 2nd element from the wide vector isn't the same as extracting odds or even elements. mzolotukhin: Am I getting it correctly, that `VecTy` is the type of 'combined' vector? I.e. if we have a…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Yes, you are right. The cost is not only "Extract" cost, it also plus the "Insert" cost. Say we interleaved load two vectors: %vec = load <8 x i32>, <8 x i32>* %ptr %v0 = shuffle %vec, undef, <0, 2, 4, 6> %v1 = shuffle %vec, undef, <1, 3, 5, 7> The cost consist of 2 parts: (1) load of <8 x i32> (2) extract %v0 and %v1 from %vec. The cost of (2) is estimated as 2 parts: (a) extract elements at: 0, 1, 2, 3, 4, 5, 6, 7 (b) insert elements 0, 2, 4, 6 into %v0 and insert elements 1, 3, 5, 7 into %v1 The interleaved store is similar. HaoLiu: Yes, you are right. The cost is not only "Extract" cost, it also plus the "Insert" cost. Say…
		// %vec = load <8 x i32>, <8 x i32>* %ptr
		// %v0 = shuffle %vec, undef, <0, 2, 4, 6> ; Index 0
		// The cost is estimated as extract elements at 0, 2, 4, 6 from the
		// <8 x i32> vector and insert them into a <4 x i32> vector.
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Was it supposed to be `+=`? mzolotukhin: Was it supposed to be `+=`?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Fixed. HaoLiu: Fixed.

		assert(Indices.size() <= Delta &&
		"Interleaved memory op has too many members");
		for (unsigned Index : Indices) {
		assert(Index < Delta && "Invalid index for interleaved memory op");

		// Extract elements from loaded vector for each sub vector.
		for (unsigned i = 0; i < NumSubElts; i++)
		Cost += getVectorInstrCost(Instruction::ExtractElement, VT,
		Index + i * Delta);
		}

		unsigned InsSubCost = 0;
		mzolotukhinUnsubmitted Not Done Reply Inline Actions What's the difference between computing costs for stores and loads? mzolotukhin: What's the difference between computing costs for stores and loads?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Say we interleaved store two vectors: %v0_v1 = shuffle %v0, %v1, <0, 4, 1, 5, 2, 6, 3, 7> store <8 x i32> %v0_v1, <8 x i32>* It has two main differences: (1) The direction is different. It firstly extract elements from two sub vectors and then insert into a wide vector (interleaved Load firstly extract elements from a wide vector and then insert into sub vectors) (2) The interleaved store doesn't allow gaps, which means we must have member of index 0 (with even elements) and member of index 1 (with odd elements). But interleaved load allows gaps, it could only load even elements such as: %vec = load <8 x i32>, <8 x i32>* %ptr %v0 = shuffle <8 x i32> %vec, undef, <0, 2, 4, 6> This is still beneficial and cheaper than scalar loads. I've refactored this function to add a new parameter of indices for interleaved load. HaoLiu: Say we interleaved store two vectors: %v0_v1 = shuffle %v0, %v1, <0, 4, 1, 5, 2, 6, 3, 7>…
		for (unsigned i = 0; i < NumSubElts; i++)
		InsSubCost += getVectorInstrCost(Instruction::InsertElement, SubVT, i);

		Cost += Indices.size() * InsSubCost;
		} else {
		// The interleave cost is extract all elements from sub vectors, and
		// insert them into the wide vector.
		//
		// E.g. Interleaved store of Delta 2 (For vector %v0, %v1):
		// %v0_v1 = shuffle %v0, %v1, <0, 4, 1, 5, 2, 6, 3, 7>
		// store <8 x i32> %interleaved.vec, <8 x i32>* %ptr
		// The cost is estimated as extract all elements from both <4 x i32>
		// vectors and insert into the <8 x i32> vector.

		unsigned ExtSubCost = 0;
		for (unsigned i = 0; i < NumSubElts; i++)
		ExtSubCost += getVectorInstrCost(Instruction::ExtractElement, SubVT, i);

		Cost += Delta * ExtSubCost;

		for (unsigned i = 0; i < NumElts; i++)
		Cost += getVectorInstrCost(Instruction::InsertElement, VT, i);
		}

		return Cost;
		}

unsigned getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,		unsigned getIntrinsicInstrCost(Intrinsic::ID IID, Type *RetTy,
ArrayRef<Type *> Tys) {		ArrayRef<Type *> Tys) {
unsigned ISD = 0;		unsigned ISD = 0;
switch (IID) {		switch (IID) {
default: {		default: {
// Assume that we need to scalarize this intrinsic.		// Assume that we need to scalarize this intrinsic.
unsigned ScalarizationCost = 0;		unsigned ScalarizationCost = 0;
unsigned ScalarCalls = 1;		unsigned ScalarCalls = 1;
▲ Show 20 Lines • Show All 235 Lines • Show Last 20 Lines

lib/Analysis/LoopAccessAnalysis.cpp

Show First 20 Lines • Show All 283 Lines • ▼ Show 20 Lines	static bool hasComputableBounds(ScalarEvolution *SE,
const SCEV *PtrScev = replaceSymbolicStrideSCEV(SE, Strides, Ptr);		const SCEV *PtrScev = replaceSymbolicStrideSCEV(SE, Strides, Ptr);
const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(PtrScev);		const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(PtrScev);
if (!AR)		if (!AR)
return false;		return false;

return AR->isAffine();		return AR->isAffine();
}		}

/// \brief Check the stride of the pointer and ensure that it does not wrap in
/// the address space.
static int isStridedPtr(ScalarEvolution SE, Value Ptr, const Loop *Lp,
const ValueToValueMap &StridesMap);

bool AccessAnalysis::canCheckPtrAtRT(		bool AccessAnalysis::canCheckPtrAtRT(
LoopAccessInfo::RuntimePointerCheck &RtCheck, unsigned &NumComparisons,		LoopAccessInfo::RuntimePointerCheck &RtCheck, unsigned &NumComparisons,
ScalarEvolution SE, Loop TheLoop, const ValueToValueMap &StridesMap,		ScalarEvolution SE, Loop TheLoop, const ValueToValueMap &StridesMap,
bool ShouldCheckStride) {		bool ShouldCheckStride) {
// Find pointers with computable bounds. We are going to use this information		// Find pointers with computable bounds. We are going to use this information
// to place a runtime bound check.		// to place a runtime bound check.
bool CanDoRT = true;		bool CanDoRT = true;

▲ Show 20 Lines • Show All 200 Lines • ▼ Show 20 Lines

static bool isInBoundsGep(Value *Ptr) {		static bool isInBoundsGep(Value *Ptr) {
if (GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(Ptr))		if (GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(Ptr))
return GEP->isInBounds();		return GEP->isInBounds();
return false;		return false;
}		}

/// \brief Check whether the access through \p Ptr has a constant stride.		/// \brief Check whether the access through \p Ptr has a constant stride.
static int isStridedPtr(ScalarEvolution SE, Value Ptr, const Loop *Lp,		int llvm::isStridedPtr(ScalarEvolution SE, Value Ptr, const Loop *Lp,
const ValueToValueMap &StridesMap) {		const ValueToValueMap &StridesMap) {
const Type *Ty = Ptr->getType();		const Type *Ty = Ptr->getType();
assert(Ty->isPointerTy() && "Unexpected non-ptr");		assert(Ty->isPointerTy() && "Unexpected non-ptr");

// Make sure that the pointer does not point to aggregate types.		// Make sure that the pointer does not point to aggregate types.
const PointerType *PtrTy = cast<PointerType>(Ty);		const PointerType *PtrTy = cast<PointerType>(Ty);
if (PtrTy->getElementType()->isAggregateType()) {		if (PtrTy->getElementType()->isAggregateType()) {
DEBUG(dbgs() << "LAA: Bad stride - Not a pointer to a scalar type"		DEBUG(dbgs() << "LAA: Bad stride - Not a pointer to a scalar type"
<< *Ptr << "\n");		<< *Ptr << "\n");
return 0;		return 0;
}		}

const SCEV *PtrScev = replaceSymbolicStrideSCEV(SE, StridesMap, Ptr);		const SCEV *PtrScev = replaceSymbolicStrideSCEV(SE, StridesMap, Ptr);

const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(PtrScev);		const SCEVAddRecExpr *AR = dyn_cast<SCEVAddRecExpr>(PtrScev);
if (!AR) {		if (!AR) {
DEBUG(dbgs() << "LAA: Bad stride - Not an AddRecExpr pointer "		DEBUG(dbgs() << "LAA: Bad stride - Not an AddRecExpr pointer "
<< Ptr << " SCEV: " << PtrScev << "\n");		<< Ptr << " SCEV: " << PtrScev << "\n");
return 0;		return 0;
}		}

// The accesss function must stride over the innermost loop.		// The accesss function must stride over the innermost loop.
if (Lp != AR->getLoop()) {		if (Lp != AR->getLoop()) {
DEBUG(dbgs() << "LAA: Bad stride - Not striding over innermost loop " <<		DEBUG(dbgs() << "LAA: Bad stride - Not striding over innermost loop " <<
Ptr << " SCEV: " << PtrScev << "\n");		Ptr << " SCEV: " << PtrScev << "\n");
}		}
		hfinkelUnsubmitted Not Done Reply Inline Actions Is this just a drive-by bug fix? hfinkel: Is this just a drive-by bug fix?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Yes. I just came across this minor bug. As it's quite obvious and simple, should I just commit this fix? HaoLiu: Yes. I just came across this minor bug. As it's quite obvious and simple, should I just commit…
		rengolinUnsubmitted Not Done Reply Inline Actions Yes, if this is an unrelated fix, please submit a separate diff with a full description of what it is and a test. rengolin: Yes, if this is an unrelated fix, please submit a separate diff with a full description of what…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions OK. I'll do that in another patch. HaoLiu: OK. I'll do that in another patch.

// The address calculation must not wrap. Otherwise, a dependence could be		// The address calculation must not wrap. Otherwise, a dependence could be
// inverted.		// inverted.
// An inbounds getelementptr that is a AddRec with a unit stride		// An inbounds getelementptr that is a AddRec with a unit stride
// cannot wrap per definition. The unit stride requirement is checked later.		// cannot wrap per definition. The unit stride requirement is checked later.
// An getelementptr without an inbounds attribute and unit stride would have		// An getelementptr without an inbounds attribute and unit stride would have
// to access the pointer value "0" which is undefined behavior in address		// to access the pointer value "0" which is undefined behavior in address
// space 0, therefore we can also vectorize this case.		// space 0, therefore we can also vectorize this case.
▲ Show 20 Lines • Show All 125 Lines • ▼ Show 20 Lines	bool MemoryDepChecker::couldPreventStoreLoadForward(unsigned Distance,

if (MaxVFWithoutSLForwardIssues < MaxSafeDepDistBytes &&		if (MaxVFWithoutSLForwardIssues < MaxSafeDepDistBytes &&
MaxVFWithoutSLForwardIssues !=		MaxVFWithoutSLForwardIssues !=
VectorizerParams::MaxVectorWidth * TypeByteSize)		VectorizerParams::MaxVectorWidth * TypeByteSize)
MaxSafeDepDistBytes = MaxVFWithoutSLForwardIssues;		MaxSafeDepDistBytes = MaxVFWithoutSLForwardIssues;
return false;		return false;
}		}

		/// \brief Check the dependence for two accesses with the same stride \p Stride.
		/// \p Distance is the positive distance and \p TypeByteSize is type size in
		/// bytes.
		///
		/// \returns true if they are independent.
		static bool areStridedAccessesIndependent(unsigned Distance, unsigned Stride,
		unsigned TypeByteSize) {
		assert(Stride > 1 && "The stride must be greater than 1");
		assert(TypeByteSize > 0 && "The type size in byte must be non-zero");
		assert(Distance > 0 && "The distance must be non-zero");

		// Skip if the distance is not multiple of type byte size.
		if (Distance % TypeByteSize)
		return false;

		unsigned ScaledDist = Distance / TypeByteSize;

		// (1) If the scaled distance is less than the stride.
		// E.g.
		// for (i = 0; i < 1024 ; i += 4)
		// A[i+2] = A[i] + 1;
		//
		// Two accesses in memory (scaled distance is 2, stride is 4):
		// \| A[0] \| \| \| \| A[4] \| \| \| \|
		// \| \| \| A[2] \| \| \| \| A[6] \| \|
		//
		// (2) Otherwise, no dependence if the scaled distance is not multiple of
		// the stride.
		// E.g.
		// for (i = 0; i < 1024 ; i += 3)
		anemetUnsubmitted Not Done Reply Inline Actions I don't think this is GCD. Consider the values Stride=4 and ScaledDistance=6: 4 * i = 4 * j + 6 can not be valid for any integer values of i and j. I think the Diophantine equation is this: ScaledDistance/Stride = (i - j) where i - j has to be integer. So the two accesses can only reference the same element if ScaledDistance is a multiple of Stride. anemet: I don't think this is GCD. Consider the values Stride=4 and ScaledDistance=6: 4 * i = 4 * j +…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Ah, not GCD. You are right! HaoLiu: Ah, not GCD. You are right!
		// A[i+4] = A[i] + 1;
		//
		// Two accesses in memory (scaled distance is 4, stride is 3):
		// \| A[0] \| \| \| A[3] \| \| \| A[6] \| \| \|
		// \| \| \| \| \| A[4] \| \| \| A[7] \| \|
		if (ScaledDist < Stride)
		return true;
		else
		anemetUnsubmitted Not Done Reply Inline Actions areStridedAccessIndependent is the conforming name (funtions are lower-case). Also not the the 'd' at the end of Strided. Same for the previous function. anemet: areStridedAccessIndependent is the conforming name (funtions are lower-case). Also not the the…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Agree. HaoLiu: Agree.
		return ScaledDist % Stride;
		anemetUnsubmitted Not Done Reply Inline Actions Now, that I look at this again, the second covers the first case as well (i.e. superset). There is room to simplify it. anemet: Now, that I look at this again, the second covers the first case as well (i.e. superset).
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions You are right, only need "ScaledDist % Stride". HaoLiu: You are right, only need "ScaledDist % Stride".
		}

MemoryDepChecker::Dependence::DepType		MemoryDepChecker::Dependence::DepType
MemoryDepChecker::isDependent(const MemAccessInfo &A, unsigned AIdx,		MemoryDepChecker::isDependent(const MemAccessInfo &A, unsigned AIdx,
const MemAccessInfo &B, unsigned BIdx,		const MemAccessInfo &B, unsigned BIdx,
const ValueToValueMap &Strides) {		const ValueToValueMap &Strides) {
assert (AIdx < BIdx && "Must pass arguments in program order");		assert (AIdx < BIdx && "Must pass arguments in program order");

Value *APtr = A.getPointer();		Value *APtr = A.getPointer();
Value *BPtr = B.getPointer();		Value *BPtr = B.getPointer();
bool AIsWrite = A.getInt();		bool AIsWrite = A.getInt();
bool BIsWrite = B.getInt();		bool BIsWrite = B.getInt();

// Two reads are independent.		// Two reads are independent.
if (!AIsWrite && !BIsWrite)		if (!AIsWrite && !BIsWrite)
return Dependence::NoDep;		return Dependence::NoDep;

// We cannot check pointers in different address spaces.		// We cannot check pointers in different address spaces.
if (APtr->getType()->getPointerAddressSpace() !=		if (APtr->getType()->getPointerAddressSpace() !=
BPtr->getType()->getPointerAddressSpace())		BPtr->getType()->getPointerAddressSpace())
return Dependence::Unknown;		return Dependence::Unknown;

const SCEV *AScev = replaceSymbolicStrideSCEV(SE, Strides, APtr);		const SCEV *AScev = replaceSymbolicStrideSCEV(SE, Strides, APtr);
const SCEV *BScev = replaceSymbolicStrideSCEV(SE, Strides, BPtr);		const SCEV *BScev = replaceSymbolicStrideSCEV(SE, Strides, BPtr);

int StrideAPtr = isStridedPtr(SE, APtr, InnermostLoop, Strides);		int StrideAPtr = isStridedPtr(SE, APtr, InnermostLoop, Strides);
		anemetUnsubmitted Not Done Reply Inline Actions If you want to handle this case I prefer we come back to it later as a separate LAA-only patch. Let's focus on the basics here that's needed for LV to handle the basic cases. anemet: If you want to handle this case I prefer we come back to it later as a separate LAA-only patch.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Agree. Also this is rare case. I think it won't affect about the performance. HaoLiu: Agree. Also this is rare case. I think it won't affect about the performance.
int StrideBPtr = isStridedPtr(SE, BPtr, InnermostLoop, Strides);		int StrideBPtr = isStridedPtr(SE, BPtr, InnermostLoop, Strides);

const SCEV *Src = AScev;		const SCEV *Src = AScev;
const SCEV *Sink = BScev;		const SCEV *Sink = BScev;

// If the induction step is negative we have to invert source and sink of the		// If the induction step is negative we have to invert source and sink of the
// dependence.		// dependence.
if (StrideAPtr < 0) {		if (StrideAPtr < 0) {
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	if (Val.isNegative()) {

DEBUG(dbgs() << "LAA: Dependence is negative: NoDep\n");		DEBUG(dbgs() << "LAA: Dependence is negative: NoDep\n");
return Dependence::Forward;		return Dependence::Forward;
}		}

// Write to the same location with the same size.		// Write to the same location with the same size.
// Could be improved to assert type sizes are the same (i32 == float, etc).		// Could be improved to assert type sizes are the same (i32 == float, etc).
if (Val == 0) {		if (Val == 0) {
if (ATy == BTy)		if (ATy == BTy) {
		// If it is a store-load pair with distance 0, it is store-load forwarding
		// and should not be vectorized.
		if (AIsWrite && !BIsWrite)
		return Dependence::ForwardButPreventsForwarding;

return Dependence::NoDep;		return Dependence::NoDep;
		}

DEBUG(dbgs() << "LAA: Zero dependence difference but different types\n");		DEBUG(dbgs() << "LAA: Zero dependence difference but different types\n");
return Dependence::Unknown;		return Dependence::Unknown;
}		}

assert(Val.isStrictlyPositive() && "Expect a positive value");		assert(Val.isStrictlyPositive() && "Expect a positive value");

if (ATy != BTy) {		if (ATy != BTy) {
DEBUG(dbgs() <<		DEBUG(dbgs() <<
"LAA: ReadWrite-Write positive dependency with different types\n");		"LAA: ReadWrite-Write positive dependency with different types\n");
return Dependence::Unknown;		return Dependence::Unknown;
}		}

unsigned Distance = (unsigned) Val.getZExtValue();		unsigned Distance = (unsigned) Val.getZExtValue();

		unsigned Stride = std::abs(StrideAPtr);
		if (Stride > 1 &&
		areStridedAccessesIndependent(Distance, Stride, TypeByteSize))
		return Dependence::NoDep;

// Bail out early if passed-in parameters make vectorization not feasible.		// Bail out early if passed-in parameters make vectorization not feasible.
unsigned ForcedFactor = (VectorizerParams::VectorizationFactor ?		unsigned ForcedFactor = (VectorizerParams::VectorizationFactor ?
VectorizerParams::VectorizationFactor : 1);		VectorizerParams::VectorizationFactor : 1);
unsigned ForcedUnroll = (VectorizerParams::VectorizationInterleave ?		unsigned ForcedUnroll = (VectorizerParams::VectorizationInterleave ?
VectorizerParams::VectorizationInterleave : 1);		VectorizerParams::VectorizationInterleave : 1);
		// The minimum number of iterations for a vectorized/unrolled version.
		unsigned MinNumIter = std::max(ForcedFactor * ForcedUnroll, 2U);
		anemetUnsubmitted Not Done Reply Inline Actions Please split out of this part of the change -- a cleanup, and commit it separately (and upate this patch for easier read) anemet: Please split out of this part of the change -- a cleanup, and commit it separately (and upate…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Hi Adma, I Agree. I'll do that when commiting the patch. HaoLiu: Hi Adma, I Agree. I'll do that when commiting the patch.

		// It's not vectorizable if the distance is smaller than the minimum distance
		rengolinUnsubmitted Not Done Reply Inline Actions I think you made this code more complicated for no reason. Just adding... unsigned NumIter = std::max(ForcedFactor * ForcedUnroll, 2); after the original lines would do what you want. rengolin: I think you made this code more complicated for no reason. Just adding... unsigned…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions You are right. I get your point. Thanks! HaoLiu: You are right. I get your point. Thanks!
		anemetUnsubmitted Not Done Reply Inline Actions vectorized is misspelled anemet: vectorized is misspelled
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Fixed. HaoLiu: Fixed.
		// needed for a vectroized/unrolled version. Vectorizing one iteration in
		// front needs TypeByteSize * Stride. Vectorizing the last iteration needs
		// TypeByteSize (No need to plus the last gap distance).
		anemetUnsubmitted Not Done Reply Inline Actions I think that these two cases are quite different. The first is essentially proving that accesses like A[2i] and A[2i + 1] are independent. This is true in general not just for vectorization. Thus we should be returning NoDep rather than BackwardVectorizable. Please also add more comments/examples, it took me a long time to figure this out (hopefully I did at the end). Please also add LAA tests. anemet: I think that these two cases are quite different. The first is essentially proving that…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Hi Adam, You are right. I've updated a new patch refactored this logic. I added a lot of new test cases as well. HaoLiu: Hi Adam, You are right. I've updated a new patch refactored this logic. I added a lot of new…
		anemetUnsubmitted Not Done Reply Inline Actions MinDistanceNeeded is probably a better name. anemet: MinDistanceNeeded is probably a better name.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Fixed HaoLiu: Fixed
		//
		// E.g. Assume one char is 1 byte in memory and one int is 4 bytes.
		// foo(int *A) {
		// int B = (int )((char *)A + 14);
		// for (i = 0 ; i < 1024 ; i += 2)
		// B[i] = A[i] + 1;
		// }
		//
		// Two accesses in memory (stride is 2):
		// \| A[0] \| \| A[2] \| \| A[4] \| \| A[6] \| \|
		// \| B[0] \| \| B[2] \| \| B[4] \|
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Something went wrong with the indentation here. mzolotukhin: Something went wrong with the indentation here.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions No, The indentation is intended here to show the distance and overlap between the accesses to array A and B. HaoLiu: No, The indentation is intended here to show the distance and overlap between the accesses to…
		//
		// Distance needs for vectorizing iterations except the last iteration:
		// 4 * 2 * (MinNumIter - 1). Distance needs for the last iteration: 4.
		// So the minimum distance needed is: 4 * 2 * (MinNumIter - 1) + 4.
		//
		// If MinNumIter is 2, it is vectorizable as the minimum distance needed is
		// 12, which is less than distance.
		//
		// If MinNumIter is 4 (Say if a user forces the vectorization factor to be 4),
		// the minimum distance needed is 28, which is greater than distance. It is
		// not safe to do vectorization.
		unsigned MinDistanceNeeded =
		TypeByteSize * Stride * (MinNumIter - 1) + TypeByteSize;
		if (MinDistanceNeeded > Distance) {
		anemetUnsubmitted Not Done Reply Inline Actions I think this is correct but I wonder if the example was less contrived if you used: for (i = 0; i < 1024; i+= 3) A[i + 4] = A[i] + 1 MinSizeNeeded is 4 * 3 * (2 - 1) + 4 = 16 which is equal to the distance. Also a nit: most or all of this comment is explaining why MinSizeNeeded is computed the way it is so the comment should be before the computation. anemet: I think this is correct but I wonder if the example was less contrived if you used: for (i =…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions This case is vectorizable as MinDistanceNeeded is exactly equal to the distance. If the distance is smaller (even 1 byte smaller), it can not be vectorizable. Actually we already has a similar case in memdep.ll: for (i = 0; i < 1024; ++i) A[i+2] = A[i] + 1; The MinDistanceNeeded is 24 = 8. The distance is also 8. HaoLiu:* This case is vectorizable as MinDistanceNeeded is exactly equal to the distance. If the…
		anemetUnsubmitted Not Done Reply Inline Actions I didn't quite understand what you were saying here but looks like you didn't change the comment, so I guess you disagree that my example is better. Your example is vectorizable if miniter is 2 and so is mine, so I don't understand your reply. anemet: I didn't quite understand what you were saying here but looks like you didn't change the…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Previously, I thought you just asked me a question about that case. Sorry about the misleading. Actually, your case is "indep", which returns early with "Dependence::NoDep". So it could not reach at this place. That's why I use an example with dependences. HaoLiu: Previously, I thought you just asked me a question about that case. Sorry about the misleading.
		anemetUnsubmitted Not Done Reply Inline Actions Makes sense. anemet: Makes sense.
		DEBUG(dbgs() << "LAA: Failure because of positive distance " << Distance
		<< '\n');
		return Dependence::Backward;
		}

// The distance must be bigger than the size needed for a vectorized version		// Unsafe if the minimum size needed is greater than the max safe distance.
		rengolinUnsubmitted Not Done Reply Inline Actions why not? rengolin: why not?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Sorry that my comment is confusing. I mean the "ForcedFactor * ForcedUnroll" will never be 1. (1) If the VectorizationFactor and VectorizationInterleave are both set to be 1 by user, this is the same to disable the loop vectorizer. It will never execute to this point. (2) If either of the Factors is forced, the number of iterations will also never be 1. See if user set "-force-vector-interleaved=1", the vector factor will be at least 2. Or if user set "-force-vector-width=1", the unroll factor will be at least 2. So I set the number of iterations to be 2 if either factors is forced. This can simplify the following distance checks. I've changed the comments to make it more clear. HaoLiu: Sorry that my comment is confusing. I mean the "ForcedFactor * ForcedUnroll" will never be 1.
		rengolinUnsubmitted Not Done Reply Inline Actions If one of the variables must be > 1 at this point, I don't see why you need the check below. rengolin: If one of the variables must be > 1 at this point, I don't see why you need the check below.
		mzolotukhinUnsubmitted Not Done Reply Inline Actions From your and Renato's comments I think it's more appropriate to have an assert here. Otherwise, this is very misleading. mzolotukhin: From your and Renato's comments I think it's more appropriate to have an assert here. Otherwise…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions If one parameter is not forced, it is initialized to be 0. If either of the parameter is forced, we need to set it properly. Anyway, I've changed the logic again to make it more clear. HaoLiu: If one parameter is not forced, it is initialized to be 0. If either of the parameter is forced…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Please have a look at the new code. It could be 0 or 1 if either of the parameter is not forced. In that case, we need to assign a minimum number of 2. HaoLiu: Please have a look at the new code. It could be 0 or 1 if either of the parameter is not forced.
// of the operation and the size of the vectorized operation must not be		if (MinDistanceNeeded > MaxSafeDepDistBytes) {
		rengolinUnsubmitted Not Done Reply Inline Actions you can use "2U" instead of the static_cast. rengolin: you can use "2U" instead of the static_cast.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
// bigger than the currrent maximum size.		DEBUG(dbgs() << "LAA: Failure because it needs at least "
if (Distance < 2*TypeByteSize \|\|		<< MinDistanceNeeded << " size in bytes");
2*TypeByteSize > MaxSafeDepDistBytes \|\|
Distance < TypeByteSize * ForcedUnroll * ForcedFactor) {
DEBUG(dbgs() << "LAA: Failure because of Positive distance "
<< Val.getSExtValue() << '\n');
return Dependence::Backward;		return Dependence::Backward;
}		}

// Positive distance bigger than max vectorization factor.		// Positive distance bigger than max vectorization factor.
MaxSafeDepDistBytes = Distance < MaxSafeDepDistBytes ?		// FIXME: Should use max factor instead of max distance in bytes, which could
		rengolinUnsubmitted Not Done Reply Inline Actions This comment is confusing. You mean to say that 3 always has to be true, while 1 or 2 should be true. I'd rephrase this as: // Positive distance is safe when: // Distance <= MaxSafeDepDistBytes // And either one below is true: // * Distance < Stride * TypeByteSize // * Distance >= TypeByteSize * NumIterations * Stride rengolin: This comment is confusing. You mean to say that 3 always has to be true, while 1 or 2 should be…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
Distance : MaxSafeDepDistBytes;		// not handle different types.
		// E.g. Assume one char is 1 byte in memory and one int is 4 bytes.
		// void foo (int A, char B) {
		rengolinUnsubmitted Not Done Reply Inline Actions I think the: Distance > MaxSafeDepDistBytes part deserves a separate debug message. rengolin: I think the: Distance > MaxSafeDepDistBytes part deserves a separate debug message.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
		// for (unsigned i = 0; i < 1024; i++) {
		// A[i+2] = A[i] + 1;
		// B[i+2] = B[i] + 1;
		// }
		// }
		//
		// This case is currently unsafe according to the max safe distance. If we
		// analyze the two accesses on array B, the max safe dependence distance
		// is 2. Then we analyze the accesses on array A, the minimum distance needed
		// is 8, which is less than 2 and forbidden vectorization, But actually
		// both A and B could be vectorized by 2 iterations.
		MaxSafeDepDistBytes =
		Distance < MaxSafeDepDistBytes ? Distance : MaxSafeDepDistBytes;

		aschwaighoferUnsubmitted Not Done Reply Inline Actions MaxSafeDepDistBytes records the maximum distance that is still safe. Say we saw a maximum safe distance of 4 bytes before. Now we process a new pair of access and the distance would be 8 bytes and that is still bigger-equal than "TypeByteSize * NumIterations * Stride". It is not correct to say our MaxSafeDepDistBytes is 8 bytes now. aschwaighofer: MaxSafeDepDistBytes records the maximum distance that is still safe. Say we saw a maximum safe…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Hi Arnold, Your example is not true. We've already checked "Distance <= MaxSafeDepDistBytes", so the Distance is always equal or less than 4 bytes in your case. I've modified the comments and code to make it more clear: // If Distance < Stride * TypeByteSize, it is always safe, just keep current // MaxSafeDepDistBytes. // Otherwise need to update the MaxSafeDepDistBytes to be Distance. (We've // already checked that "Distance <= MaxSafeDepDistBytes"). HaoLiu: Hi Arnold, Your example is not true. We've already checked "Distance <= MaxSafeDepDistBytes"…
		anemetUnsubmitted Not Done Reply Inline Actions I am not sure I follow why you change the structure of the MaxSafeDepDistBytes logic here. Why aren't you simply comparing MaxSafeDepDistBytes with TypeByteSize * NumIter * Stride? Looks like this will modify the existing behavior for stride=1 which is probably not what you intend. anemet: I am not sure I follow why you change the structure of the MaxSafeDepDistBytes logic here. Why…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions I think preivous logic is not correct. If "Distance > MaxSafeDepDistBytes", it is not safe to do vectorization. HaoLiu: I think preivous logic is not correct. If "Distance > MaxSafeDepDistBytes", it is not safe to…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Sorry, previously I misunderstood the MaxSafeDepDistBytes. The new patch follows the meaning of MaxSafeDepDistBytes, it also has the same logic for stride = 1. I also found a problem with MaxSafeDepDistBytes. It cannot handdle cases with different kinds of types. like: void foo(int A, char B) { for (unsigned i = 0; i < 1024; i++) { A[i+2] = A[i] + 1; B[i+2] = B[i] + 1; } } I think we should use MaxFactor, which stands for the maximum number of iterations to be vectorized & unrolled. I've added a FIXME in the patch. HaoLiu: Sorry, previously I misunderstood the MaxSafeDepDistBytes. The new patch follows the meaning of…
bool IsTrueDataDependence = (!AIsWrite && BIsWrite);		bool IsTrueDataDependence = (!AIsWrite && BIsWrite);
if (IsTrueDataDependence &&		if (IsTrueDataDependence &&
couldPreventStoreLoadForward(Distance, TypeByteSize))		couldPreventStoreLoadForward(Distance, TypeByteSize))
return Dependence::BackwardVectorizableButPreventsForwarding;		return Dependence::BackwardVectorizableButPreventsForwarding;

DEBUG(dbgs() << "LAA: Positive distance " << Val.getSExtValue() <<		DEBUG(dbgs() << "LAA: Positive distance " << Val.getSExtValue()
" with max VF = " << MaxSafeDepDistBytes / TypeByteSize << '\n');		<< " with max VF = "
		<< MaxSafeDepDistBytes / (TypeByteSize * Stride) << '\n');

return Dependence::BackwardVectorizable;		return Dependence::BackwardVectorizable;
}		}

bool MemoryDepChecker::areDepsSafe(DepCandidates &AccessSets,		bool MemoryDepChecker::areDepsSafe(DepCandidates &AccessSets,
MemAccessInfoSet &CheckDeps,		MemAccessInfoSet &CheckDeps,
const ValueToValueMap &Strides) {		const ValueToValueMap &Strides) {

▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines
void MemoryDepChecker::Dependence::print(		void MemoryDepChecker::Dependence::print(
raw_ostream &OS, unsigned Depth,		raw_ostream &OS, unsigned Depth,
const SmallVectorImpl<Instruction *> &Instrs) const {		const SmallVectorImpl<Instruction *> &Instrs) const {
OS.indent(Depth) << DepName[Type] << ":\n";		OS.indent(Depth) << DepName[Type] << ":\n";
OS.indent(Depth + 2) << *Instrs[Source] << " -> \n";		OS.indent(Depth + 2) << *Instrs[Source] << " -> \n";
OS.indent(Depth + 2) << *Instrs[Destination] << "\n";		OS.indent(Depth + 2) << *Instrs[Destination] << "\n";
}		}

bool LoopAccessInfo::canAnalyzeLoop() {		bool LoopAccessInfo::canAnalyzeLoop() {
// We need to have a loop header.		// We need to have a loop header.
DEBUG(dbgs() << "LAA: Found a loop: " <<		DEBUG(dbgs() << "LAA: Found a loop: " <<
TheLoop->getHeader()->getName() << '\n');		TheLoop->getHeader()->getName() << '\n');

// We can only analyze innermost loops.		// We can only analyze innermost loops.
if (!TheLoop->empty()) {		if (!TheLoop->empty()) {
DEBUG(dbgs() << "LAA: loop is not the innermost loop\n");		DEBUG(dbgs() << "LAA: loop is not the innermost loop\n");
emitAnalysis(LoopAccessReport() << "loop is not the innermost loop");		emitAnalysis(LoopAccessReport() << "loop is not the innermost loop");
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Please add some comments on this struct and its members. mzolotukhin: Please add some comments on this struct and its members.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
return false;		return false;
}		}

// We must have a single backedge.		// We must have a single backedge.
if (TheLoop->getNumBackEdges() != 1) {		if (TheLoop->getNumBackEdges() != 1) {
DEBUG(dbgs() << "LAA: loop control flow is not understood by analyzer\n");		DEBUG(dbgs() << "LAA: loop control flow is not understood by analyzer\n");
emitAnalysis(		emitAnalysis(
LoopAccessReport() <<		LoopAccessReport() <<
Show All 18 Lines	if (TheLoop->getExitingBlock() != TheLoop->getLoopLatch()) {
emitAnalysis(		emitAnalysis(
LoopAccessReport() <<		LoopAccessReport() <<
"loop control flow is not understood by analyzer");		"loop control flow is not understood by analyzer");
return false;		return false;
}		}

// ScalarEvolution needs to be able to find the exit count.		// ScalarEvolution needs to be able to find the exit count.
const SCEV *ExitCount = SE->getBackedgeTakenCount(TheLoop);		const SCEV *ExitCount = SE->getBackedgeTakenCount(TheLoop);
if (ExitCount == SE->getCouldNotCompute()) {		if (ExitCount == SE->getCouldNotCompute()) {
emitAnalysis(LoopAccessReport() <<		emitAnalysis(LoopAccessReport() <<
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Using `empty()` is more efficient than `size()==0`. mzolotukhin: Using `empty()` is more efficient than `size()==0`.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Good idea! HaoLiu: Good idea!
"could not determine number of loop iterations");		"could not determine number of loop iterations");
DEBUG(dbgs() << "LAA: SCEV could not compute the loop exit count.\n");		DEBUG(dbgs() << "LAA: SCEV could not compute the loop exit count.\n");
return false;		return false;
}		}

return true;		return true;
}		}

Show All 9 Lines	void LoopAccessInfo::analyzeLoop(const ValueToValueMap &Strides) {
// Holds all the different accesses in the loop.		// Holds all the different accesses in the loop.
unsigned NumReads = 0;		unsigned NumReads = 0;
unsigned NumReadWrites = 0;		unsigned NumReadWrites = 0;

PtrRtCheck.Pointers.clear();		PtrRtCheck.Pointers.clear();
PtrRtCheck.Need = false;		PtrRtCheck.Need = false;

const bool IsAnnotatedParallel = TheLoop->isAnnotatedParallel();		const bool IsAnnotatedParallel = TheLoop->isAnnotatedParallel();

		mzolotukhinUnsubmitted Not Done Reply Inline Actions I think you missed my question earlier on this part. This algorithm is clearly quadratic, which is dangerous for compile time. We need to either make it linear, or add some limits (e.g. if number of accesses > 100, give up and don't even try to group them). Also, if we separate it into two phases: 1) grouping accesses, 2) finding write-after-write pairs, I think we can make it more efficient. Since it's sufficient to compare against any access from a group, we'll be quadratic on the number of groups, not the number of accesses. And when we look for WaW pairs, we can only look only inside a group, not across all accesses in the loop. Does it sound reasonable, or am I missing something? mzolotukhin: I think you missed my question earlier on this part. This algorithm is clearly quadratic…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions For the first question. Previously, I thought you meant to improve the insert member method. Anyway, I think the algorithm is quadratic, the paper says: This relies on the pivoting order in which the pairs of accesses are processed and reduces the complexity of the grouping algorithm: we iterate over O(n^2 ) pairs of loads and stores (where n is the number of loads and stores) But I think we don't need to worry about compile time: (1) The "n" is only the number of stride loads/stores (\|Stride\| > 1) (2) For a case with a lot of loads and stores, it possiblely can not pass previous dependence check, because it could have a lot of runtime memory check or have memory conflict. So it returns early. On the other hand, if a loop indeed can pass dependence check. say if it has 100 loads, we should analyze the interleave accesses as vectorization on the possible interleave groups can really give us a lot of benefit. HaoLiu: For the first question. Previously, I thought you meant to improve the insert member method.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions For the second question. I think it is slightly different. We should not only consider about Group and Group, a group and a single store may break the dependence. Say, A[i] = a; // (1) B[i] = b; // (2) A[i+1] = c; // (3) The combine of (1) and (3) may break the dependence on "A[i] - B[i]" pair even though B[i] is not in a group. So I think we should search all the write-write pairs. If we separate into two phases, we'll have to calculate the distance again. I think it may increase the compile time. HaoLiu: For the second question. I think it is slightly different. We should not only consider about…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Hm.. I'm reading the paper from here: http://domino.research.ibm.com/library/cyberdig.nsf/papers/EFD6206D4D9A9186852570D2005D9373/$File/H-0235.pdf And it says: This relies on the pivoting order in which pairs of accesses are processed, and implies that the complexity of the grouping algorithm is linear in the number of loads and stores. I'm not sure if it's safe to assume that the dependence check won't be passed. E.g. we might have the following case: struct A { int x,y; }; struct A points[100]; for (int iter = 0; iter < 10000; iter++) { // not to be unrolled for (int i = 0; i < 100; i++) { // this loop will be completely unrolled points[i].x++; } // some other computations on the points } When the innermost loop is unrolled, we get 100 interleaving accesses, which will require 10^4 pairwise comparisons. I agree that we theoretically can optimize it better if we are not limited on time. However, if a programmer decides to increase number of points by a factor of 10, the compile time might slow down 100x, which won't be acceptable. I think we have a lot of optimizations that could theoretically find the best solution if they are allowed to consume infinite number of time, however, we usually have some thresholds for them (such thresholds often don't prevent one from catching the cases the optimization is aimed for, but they guard against rare but very nasty cases when the compiler might work for hours). For the second question, my understanding is that we should have two groups: {A[i], A[i+1]} and {B[i]}. Then we check if an access from the first group can alias with an access from the second group - I think we don't have to check every access in a group for this kind of reasoning. I.e. if leaders of the groups might alias, then any member of the first group can alias with any member of the second group (and if the leaders don't alias, then we can probably assume that the groups are totally independent of each other). Am I missing something here? mzolotukhin: Hm.. I'm reading the paper from here: http://domino.research.ibm.com/library/cyberdig.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions I agree with you. I could also find many similar thresholds. I'll add such limitation to the patch. I think you are right for the second question. I'll try to implement this in the next patch. HaoLiu: I agree with you. I could also find many similar thresholds. I'll add such limitation to the…
// For each block.		// For each block.
for (Loop::block_iterator bb = TheLoop->block_begin(),		for (Loop::block_iterator bb = TheLoop->block_begin(),
be = TheLoop->block_end(); bb != be; ++bb) {		be = TheLoop->block_end(); bb != be; ++bb) {

// Scan the BB and collect legal loads and stores.		// Scan the BB and collect legal loads and stores.
for (BasicBlock::iterator it = (bb)->begin(), e = (bb)->end(); it != e;		for (BasicBlock::iterator it = (bb)->begin(), e = (bb)->end(); it != e;
++it) {		++it) {

▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	for (BasicBlock::iterator it = (bb)->begin(), e = (bb)->end(); it != e;
}		}
NumStores++;		NumStores++;
Stores.push_back(St);		Stores.push_back(St);
DepChecker.addAccess(St);		DepChecker.addAccess(St);
}		}
} // Next instr.		} // Next instr.
} // Next block.		} // Next block.

// Now we have two lists that hold the loads and the stores.		// Now we have two lists that hold the loads and the stores.
// Next, we find the pointers that they use.		// Next, we find the pointers that they use.
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Factoring out this function could go in a separate patch. mzolotukhin: Factoring out this function could go in a separate patch.
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Could we use a range loop here? mzolotukhin: Could we use a range loop here?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.

// Check if we see any stores. If there are no stores, then we don't		// Check if we see any stores. If there are no stores, then we don't
// care if the pointers are restrict.		// care if the pointers are restrict.
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Range-based loop? mzolotukhin: Range-based loop?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
if (!Stores.size()) {		if (!Stores.size()) {
DEBUG(dbgs() << "LAA: Found a read-only loop!\n");		DEBUG(dbgs() << "LAA: Found a read-only loop!\n");
CanVecMem = true;		CanVecMem = true;
return;		return;
}		}
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Why not to return right after we discovered the block needs predication? mzolotukhin: Why not to return right after we discovered the block needs predication?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions It's slightly different. We still support a predicted block if it doesn't have load/store. Because the vectorization of interleaved loads/stores won't break any dependences. But if a predicted block has loads/stores, we need to handle an interleave group contains mixed normal loads/stores and predicted loads/stores. HaoLiu: It's slightly different. We still support a predicted block if it doesn't have load/store.
		rengolinUnsubmitted Not Done Reply Inline Actions This might be a case for indexed loads on AVX, but for now, let's keep it simple. :) rengolin: This might be a case for indexed loads on AVX, but for now, let's keep it simple. :)

MemoryDepChecker::DepCandidates DependentAccesses;		MemoryDepChecker::DepCandidates DependentAccesses;
AccessAnalysis Accesses(TheLoop->getHeader()->getModule()->getDataLayout(),		AccessAnalysis Accesses(TheLoop->getHeader()->getModule()->getDataLayout(),
AA, LI, DependentAccesses);		AA, LI, DependentAccesses);

// Holds the analyzed pointers. We don't want to call GetUnderlyingObjects		// Holds the analyzed pointers. We don't want to call GetUnderlyingObjects
// multiple times on the same object. If the ptr is accessed twice, once		// multiple times on the same object. If the ptr is accessed twice, once
// for read and once for write, it will only appear once (on the write		// for read and once for write, it will only appear once (on the write
// list). This is okay, since we are going to check for conflicts between		// list). This is okay, since we are going to check for conflicts between
// writes and between reads and writes, but not between reads and reads.		// writes and between reads and writes, but not between reads and reads.
ValueSet Seen;		ValueSet Seen;

ValueVector::iterator I, IE;		ValueVector::iterator I, IE;
for (I = Stores.begin(), IE = Stores.end(); I != IE; ++I) {		for (I = Stores.begin(), IE = Stores.end(); I != IE; ++I) {
StoreInst ST = cast<StoreInst>(I);		StoreInst ST = cast<StoreInst>(I);
Value* Ptr = ST->getPointerOperand();		Value* Ptr = ST->getPointerOperand();
		mzolotukhinUnsubmitted Not Done Reply Inline Actions This change can go in a separate patch. mzolotukhin: This change can go in a separate patch.
		mzolotukhinUnsubmitted Not Done Reply Inline Actions What's the rationale behind this? It might be inefficient to vectorize a group with a huge gap, but isn't it a question for cost-model rather than for the analysis? mzolotukhin: What's the rationale behind this? It might be inefficient to vectorize a group with a huge gap…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions This is just a conservative way. I think generally if a group missing more than half members, it is not very beneficial to do vectorization. Also it sounds not reasonable to still call it "interleaved" group. But I agree with you that it's cost-model's work to decide whether it is beneficial. The new patch keeps all load groups. There is also a new situation that: the cost of a load group with a huge gap may even be expensive than the scalar operations. Then I think we need to replace with scalar operations. I've added a FIXME in the new patch as I haven't found an efficient way to fix this. HaoLiu: This is just a conservative way. I think generally if a group missing more than half members…
		rengolinUnsubmitted Not Done Reply Inline Actions I also agree that the cost model should get this right, but currently, there is no way it can do that. However, since the interleaved vectorisation is not enabled by default, I think keeping all the groups and fixing the cost model later is the right thing to do. rengolin: I also agree that the cost model should get this right, but currently, there is no way it can…
// Check for store to loop invariant address.		// Check for store to loop invariant address.
StoreToLoopInvariantAddress \|= isUniform(Ptr);		StoreToLoopInvariantAddress \|= isUniform(Ptr);
// If we did not see this pointer before, insert it to the read-write		// If we did not see this pointer before, insert it to the read-write
// list. At this phase it is only a 'write' list.		// list. At this phase it is only a 'write' list.
if (Seen.insert(Ptr).second) {		if (Seen.insert(Ptr).second) {
++NumReadWrites;		++NumReadWrites;

AliasAnalysis::Location Loc = AA->getLocation(ST);		AliasAnalysis::Location Loc = AA->getLocation(ST);
Show All 11 Lines	if (IsAnnotatedParallel) {
DEBUG(dbgs()		DEBUG(dbgs()
<< "LAA: A loop annotated parallel, ignore memory dependency "		<< "LAA: A loop annotated parallel, ignore memory dependency "
<< "checks.\n");		<< "checks.\n");
CanVecMem = true;		CanVecMem = true;
return;		return;
}		}

for (I = Loads.begin(), IE = Loads.end(); I != IE; ++I) {		for (I = Loads.begin(), IE = Loads.end(); I != IE; ++I) {
LoadInst LD = cast<LoadInst>(I);		LoadInst LD = cast<LoadInst>(I);
Value* Ptr = LD->getPointerOperand();		Value* Ptr = LD->getPointerOperand();
		mzolotukhinUnsubmitted Not Done Reply Inline Actions And this change can go in a separate patch. mzolotukhin: And this change can go in a separate patch.
// If we did not see this pointer before, insert it to the		// If we did not see this pointer before, insert it to the
// read list. If we did see it before, then it is already in		// read list. If we did see it before, then it is already in
// the read-write list. This allows us to vectorize expressions		// the read-write list. This allows us to vectorize expressions
// such as A[i] += x; Because the address of A[i] is a read-write		// such as A[i] += x; Because the address of A[i] is a read-write
// pointer. This only works if the index of A[i] is consecutive.		// pointer. This only works if the index of A[i] is consecutive.
// If the address of i is unknown (for example A[B[i]]) then we may		// If the address of i is unknown (for example A[B[i]]) then we may
// read a few words, modify, and write a few words, and some of the		// read a few words, modify, and write a few words, and some of the
// words may be written to the same address.		// words may be written to the same address.
bool IsReadOnlyPtr = false;		bool IsReadOnlyPtr = false;
if (Seen.insert(Ptr).second \|\| !isStridedPtr(SE, Ptr, TheLoop, Strides)) {		if (Seen.insert(Ptr).second \|\| !isStridedPtr(SE, Ptr, TheLoop, Strides)) {
++NumReads;		++NumReads;
IsReadOnlyPtr = true;		IsReadOnlyPtr = true;
}		}

AliasAnalysis::Location Loc = AA->getLocation(LD);		AliasAnalysis::Location Loc = AA->getLocation(LD);
// The TBAA metadata could have a control dependency on the predication		// The TBAA metadata could have a control dependency on the predication
// condition, so we cannot rely on it when determining whether or not we		// condition, so we cannot rely on it when determining whether or not we
// need runtime pointer checks.		// need runtime pointer checks.
if (blockNeedsPredication(LD->getParent(), TheLoop, DT))		if (blockNeedsPredication(LD->getParent(), TheLoop, DT))
Loc.AATags.TBAA = nullptr;		Loc.AATags.TBAA = nullptr;

		mzolotukhinUnsubmitted Not Done Reply Inline Actions Could we just check if `PosA` dominates `PosB`? mzolotukhin: Could we just check if `PosA` dominates `PosB`?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions A good idea! HaoLiu: A good idea!
Accesses.addLoad(Loc, IsReadOnlyPtr);		Accesses.addLoad(Loc, IsReadOnlyPtr);
}		}

// If we write (or read-write) to a single destination and there are no		// If we write (or read-write) to a single destination and there are no
// other reads in this loop then is it safe to vectorize.		// other reads in this loop then is it safe to vectorize.
if (NumReadWrites == 1 && NumReads == 0) {		if (NumReadWrites == 1 && NumReads == 0) {
DEBUG(dbgs() << "LAA: Found a write-only loop!\n");		DEBUG(dbgs() << "LAA: Found a write-only loop!\n");
CanVecMem = true;		CanVecMem = true;
▲ Show 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	bool LoopAccessInfo::blockNeedsPredication(BasicBlock BB, Loop TheLoop,
// Blocks that do not dominate the latch need predication.		// Blocks that do not dominate the latch need predication.
BasicBlock* Latch = TheLoop->getLoopLatch();		BasicBlock* Latch = TheLoop->getLoopLatch();
return !DT->dominates(BB, Latch);		return !DT->dominates(BB, Latch);
}		}

void LoopAccessInfo::emitAnalysis(LoopAccessReport &Message) {		void LoopAccessInfo::emitAnalysis(LoopAccessReport &Message) {
assert(!Report && "Multiple reports generated");		assert(!Report && "Multiple reports generated");
Report = Message;		Report = Message;
}		}
		rengolinUnsubmitted Not Done Reply Inline Actions Move this to InterleaveGroup class. rengolin: Move this to InterleaveGroup class.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.

bool LoopAccessInfo::isUniform(Value *V) const {		bool LoopAccessInfo::isUniform(Value *V) const {
return (SE->isLoopInvariant(SE->getSCEV(V), TheLoop));		return (SE->isLoopInvariant(SE->getSCEV(V), TheLoop));
}		}

// FIXME: this function is currently a duplicate of the one in		// FIXME: this function is currently a duplicate of the one in
// LoopVectorize.cpp.		// LoopVectorize.cpp.
static Instruction getFirstInst(Instruction FirstInst, Value *V,		static Instruction getFirstInst(Instruction FirstInst, Value *V,
Instruction *Loc) {		Instruction *Loc) {
if (FirstInst)		if (FirstInst)
return FirstInst;		return FirstInst;
if (Instruction *I = dyn_cast<Instruction>(V))		if (Instruction *I = dyn_cast<Instruction>(V))
return I->getParent() == Loc->getParent() ? I : nullptr;		return I->getParent() == Loc->getParent() ? I : nullptr;
return nullptr;		return nullptr;
}		}

std::pair<Instruction , Instruction > LoopAccessInfo::addRuntimeCheck(		std::pair<Instruction , Instruction > LoopAccessInfo::addRuntimeCheck(
Instruction Loc, const SmallVectorImpl<int> PtrPartition) const {		Instruction Loc, const SmallVectorImpl<int> PtrPartition) const {
if (!PtrRtCheck.Need)		if (!PtrRtCheck.Need)
return std::make_pair(nullptr, nullptr);		return std::make_pair(nullptr, nullptr);

unsigned NumPointers = PtrRtCheck.Pointers.size();		unsigned NumPointers = PtrRtCheck.Pointers.size();
SmallVector<TrackingVH<Value> , 2> Starts;		SmallVector<TrackingVH<Value> , 2> Starts;
		mzolotukhinUnsubmitted Not Done Reply Inline Actions s/abi/ABI/ mzolotukhin: s/abi/ABI/
SmallVector<TrackingVH<Value> , 2> Ends;		SmallVector<TrackingVH<Value> , 2> Ends;

LLVMContext &Ctx = Loc->getContext();		LLVMContext &Ctx = Loc->getContext();
SCEVExpander Exp(*SE, DL, "induction");		SCEVExpander Exp(*SE, DL, "induction");
Instruction *FirstInst = nullptr;		Instruction *FirstInst = nullptr;

for (unsigned i = 0; i < NumPointers; ++i) {		for (unsigned i = 0; i < NumPointers; ++i) {
Value *Ptr = PtrRtCheck.Pointers[i];		Value *Ptr = PtrRtCheck.Pointers[i];
const SCEV *Sc = SE->getSCEV(Ptr);		const SCEV *Sc = SE->getSCEV(Ptr);

if (SE->isLoopInvariant(Sc, TheLoop)) {		if (SE->isLoopInvariant(Sc, TheLoop)) {
DEBUG(dbgs() << "LAA: Adding RT check for a loop invariant ptr:" <<		DEBUG(dbgs() << "LAA: Adding RT check for a loop invariant ptr:" <<
		mzolotukhinUnsubmitted Not Done Reply Inline Actions s/has/have/ mzolotukhin: s/has/have/
*Ptr <<"\n");		*Ptr <<"\n");
Starts.push_back(Ptr);		Starts.push_back(Ptr);
Ends.push_back(Ptr);		Ends.push_back(Ptr);
} else {		} else {
DEBUG(dbgs() << "LAA: Adding RT check for range:" << *Ptr << '\n');		DEBUG(dbgs() << "LAA: Adding RT check for range:" << *Ptr << '\n');
unsigned AS = Ptr->getType()->getPointerAddressSpace();		unsigned AS = Ptr->getType()->getPointerAddressSpace();

		mzolotukhinUnsubmitted Not Done Reply Inline Actions s/base. There/base there/ s/dependeces/dependences/ mzolotukhin: s/base. There/base there/ s/dependeces/dependences/
// Use this type for pointer arithmetic.		// Use this type for pointer arithmetic.
Type *PtrArithTy = Type::getInt8PtrTy(Ctx, AS);		Type *PtrArithTy = Type::getInt8PtrTy(Ctx, AS);

Value *Start = Exp.expandCodeFor(PtrRtCheck.Starts[i], PtrArithTy, Loc);		Value *Start = Exp.expandCodeFor(PtrRtCheck.Starts[i], PtrArithTy, Loc);
Value *End = Exp.expandCodeFor(PtrRtCheck.Ends[i], PtrArithTy, Loc);		Value *End = Exp.expandCodeFor(PtrRtCheck.Ends[i], PtrArithTy, Loc);
Starts.push_back(Start);		Starts.push_back(Start);
Ends.push_back(End);		Ends.push_back(End);
}		}
}		}

IRBuilder<> ChkBuilder(Loc);		IRBuilder<> ChkBuilder(Loc);
		mzolotukhinUnsubmitted Not Done Reply Inline Actions s/gurantees/guarantees/ mzolotukhin: s/gurantees/guarantees/
// Our instructions might fold to a constant.		// Our instructions might fold to a constant.
Value *MemoryRuntimeCheck = nullptr;		Value *MemoryRuntimeCheck = nullptr;
for (unsigned i = 0; i < NumPointers; ++i) {		for (unsigned i = 0; i < NumPointers; ++i) {
for (unsigned j = i+1; j < NumPointers; ++j) {		for (unsigned j = i+1; j < NumPointers; ++j) {
if (!PtrRtCheck.needsChecking(i, j, PtrPartition))		if (!PtrRtCheck.needsChecking(i, j, PtrPartition))
		rengolinUnsubmitted Not Done Reply Inline Actions An early exit here would be good. if (!StridedAccess.size()) return; rengolin: An early exit here would be good. if (!StridedAccess.size()) return;
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
continue;		continue;
		mzolotukhinUnsubmitted Not Done Reply Inline Actions s/gurantee/guarantee/ mzolotukhin: s/gurantee/guarantee/

unsigned AS0 = Starts[i]->getType()->getPointerAddressSpace();		unsigned AS0 = Starts[i]->getType()->getPointerAddressSpace();
unsigned AS1 = Starts[j]->getType()->getPointerAddressSpace();		unsigned AS1 = Starts[j]->getType()->getPointerAddressSpace();

assert((AS0 == Ends[j]->getType()->getPointerAddressSpace()) &&		assert((AS0 == Ends[j]->getType()->getPointerAddressSpace()) &&
(AS1 == Ends[i]->getType()->getPointerAddressSpace()) &&		(AS1 == Ends[i]->getType()->getPointerAddressSpace()) &&
"Trying to bounds check pointers with different address spaces");		"Trying to bounds check pointers with different address spaces");

Type *PtrArithTy0 = Type::getInt8PtrTy(Ctx, AS0);		Type *PtrArithTy0 = Type::getInt8PtrTy(Ctx, AS0);
Type *PtrArithTy1 = Type::getInt8PtrTy(Ctx, AS1);		Type *PtrArithTy1 = Type::getInt8PtrTy(Ctx, AS1);

Value *Start0 = ChkBuilder.CreateBitCast(Starts[i], PtrArithTy0, "bc");		Value *Start0 = ChkBuilder.CreateBitCast(Starts[i], PtrArithTy0, "bc");
Value *Start1 = ChkBuilder.CreateBitCast(Starts[j], PtrArithTy1, "bc");		Value *Start1 = ChkBuilder.CreateBitCast(Starts[j], PtrArithTy1, "bc");
Value *End0 = ChkBuilder.CreateBitCast(Ends[i], PtrArithTy1, "bc");		Value *End0 = ChkBuilder.CreateBitCast(Ends[i], PtrArithTy1, "bc");
Value *End1 = ChkBuilder.CreateBitCast(Ends[j], PtrArithTy0, "bc");		Value *End1 = ChkBuilder.CreateBitCast(Ends[j], PtrArithTy0, "bc");

		rengolinUnsubmitted Not Done Reply Inline Actions This is a bit of a waste. I know the DenseMap lookup is quick, but going through all the instructions again is (NlogM) with N being number of instructions and M being number of Memory instructions in the map. We're not expecting the number of memory instructions to be that great, so maybe having an array/vector of them in insertion order (guaranteed by the previous loop), you can just iterate through them in reverse order. rengolin:* This is a bit of a waste. I know the DenseMap lookup is quick, but going through all the…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Very useful suggestion. I've refactored this. HaoLiu: Very useful suggestion. I've refactored this.
Value *Cmp0 = ChkBuilder.CreateICmpULE(Start0, End1, "bound0");		Value *Cmp0 = ChkBuilder.CreateICmpULE(Start0, End1, "bound0");
FirstInst = getFirstInst(FirstInst, Cmp0, Loc);		FirstInst = getFirstInst(FirstInst, Cmp0, Loc);
		mzolotukhinUnsubmitted Not Done Reply Inline Actions I think I haven't understood completely how the paper suggests to do the grouping in a linear time, but the given implementation looks quadratic. If we are ok to keep it such, we need to at least add some limits so that we won't blow up compile time in case of huge loop bodies with mass of interleaved accesses. mzolotukhin: I think I haven't understood completely how the paper suggests to do the grouping in a linear…
Value *Cmp1 = ChkBuilder.CreateICmpULE(Start1, End0, "bound1");		Value *Cmp1 = ChkBuilder.CreateICmpULE(Start1, End0, "bound1");
FirstInst = getFirstInst(FirstInst, Cmp1, Loc);		FirstInst = getFirstInst(FirstInst, Cmp1, Loc);
Value *IsConflict = ChkBuilder.CreateAnd(Cmp0, Cmp1, "found.conflict");		Value *IsConflict = ChkBuilder.CreateAnd(Cmp0, Cmp1, "found.conflict");
FirstInst = getFirstInst(FirstInst, IsConflict, Loc);		FirstInst = getFirstInst(FirstInst, IsConflict, Loc);
if (MemoryRuntimeCheck) {		if (MemoryRuntimeCheck) {
		rengolinUnsubmitted Not Done Reply Inline Actions In my prototype, I created a Groups class, which would encapsulate this logic. I did this because I don't like the unwritten dependency that isAccessInterleaved(Inst) has with InterleaveGroupMap[Inst]. If you want to avoid creating that extra class, I suggest you create a method getInterleavedAccess(Inst) which looks up and return the element, if present, or nullptr, if not. This is different from DenseMap's [] operator / lookup() method, and is what you need here. InterleaveGroup Group = getInterleavedAccessGroup(A); if (!Group) { Group = createInterleavedAccessGroup(A, DesA.Stride, DesA.Align)); } with InterleaveGroup getInterleavedAccessGroup(Inst I) { if (InterleaveGroupMap.count(I)) return InterleaveGroupMap[I]; return nullptr; } and void createInterleavedAccessGroup(Inst I, String, Align) { assert(InterleaveGroupMap.count(I) == 0); InterleaveGroupMap[I] = new InterleaveGroup(I, Stride, Align); return InterleaveGroupMap[I]; } Though, this is basically an object in itself. rengolin: In my prototype, I created a Groups class, which would encapsulate this logic. I did this…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Very useful suggestion. I've refactored this. HaoLiu: Very useful suggestion. I've refactored this.
IsConflict = ChkBuilder.CreateOr(MemoryRuntimeCheck, IsConflict,		IsConflict = ChkBuilder.CreateOr(MemoryRuntimeCheck, IsConflict,
"conflict.rdx");		"conflict.rdx");
FirstInst = getFirstInst(FirstInst, IsConflict, Loc);		FirstInst = getFirstInst(FirstInst, IsConflict, Loc);
}		}
MemoryRuntimeCheck = IsConflict;		MemoryRuntimeCheck = IsConflict;
}		}
}		}

if (!MemoryRuntimeCheck)		if (!MemoryRuntimeCheck)
return std::make_pair(nullptr, nullptr);		return std::make_pair(nullptr, nullptr);

// We have to do this trickery because the IRBuilder might fold the check to a		// We have to do this trickery because the IRBuilder might fold the check to a
		rengolinUnsubmitted Not Done Reply Inline Actions The list of access I proposed above would also get rid of these two extra checks per iteration. rengolin: The list of access I proposed above would also get rid of these two extra checks per iteration.
// constant expression in which case there is no Instruction anchored in a		// constant expression in which case there is no Instruction anchored in a
// the block.		// the block.
Instruction *Check = BinaryOperator::CreateAnd(MemoryRuntimeCheck,		Instruction *Check = BinaryOperator::CreateAnd(MemoryRuntimeCheck,
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Shouldn't it be `getSExtValue` since the distance can be negative? mzolotukhin: Shouldn't it be `getSExtValue` since the distance can be negative?
ConstantInt::getTrue(Ctx));		ConstantInt::getTrue(Ctx));
ChkBuilder.Insert(Check, "memcheck.conflict");		ChkBuilder.Insert(Check, "memcheck.conflict");
FirstInst = getFirstInst(FirstInst, Check, Loc);		FirstInst = getFirstInst(FirstInst, Check, Loc);
return std::make_pair(FirstInst, Check);		return std::make_pair(FirstInst, Check);
}		}

LoopAccessInfo::LoopAccessInfo(Loop L, ScalarEvolution SE,		LoopAccessInfo::LoopAccessInfo(Loop L, ScalarEvolution SE,
const DataLayout &DL,		const DataLayout &DL,
const TargetLibraryInfo TLI, AliasAnalysis AA,		const TargetLibraryInfo TLI, AliasAnalysis AA,
DominatorTree DT, LoopInfo LI,		DominatorTree DT, LoopInfo LI,
const ValueToValueMap &Strides)		const ValueToValueMap &Strides)
: DepChecker(SE, L), NumComparisons(0), TheLoop(L), SE(SE), DL(DL),		: DepChecker(SE, L), NumComparisons(0), TheLoop(L), SE(SE), DL(DL),
TLI(TLI), AA(AA), DT(DT), LI(LI), NumLoads(0), NumStores(0),		TLI(TLI), AA(AA), DT(DT), LI(LI), NumLoads(0), NumStores(0),
MaxSafeDepDistBytes(-1U), CanVecMem(false),		MaxSafeDepDistBytes(-1U), CanVecMem(false),
StoreToLoopInvariantAddress(false) {		StoreToLoopInvariantAddress(false) {
		mzolotukhinUnsubmitted Not Done Reply Inline Actions This is redundant. mzolotukhin: This is redundant.
if (canAnalyzeLoop())		if (canAnalyzeLoop())
analyzeLoop(Strides);		analyzeLoop(Strides);
}		}

void LoopAccessInfo::print(raw_ostream &OS, unsigned Depth) const {		void LoopAccessInfo::print(raw_ostream &OS, unsigned Depth) const {
if (CanVecMem) {		if (CanVecMem) {
if (PtrRtCheck.Need)		if (PtrRtCheck.Need)
OS.indent(Depth) << "Memory dependences are safe with run-time checks\n";		OS.indent(Depth) << "Memory dependences are safe with run-time checks\n";
▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines

bool LoopAccessAnalysis::runOnFunction(Function &F) {		bool LoopAccessAnalysis::runOnFunction(Function &F) {
SE = &getAnalysis<ScalarEvolution>();		SE = &getAnalysis<ScalarEvolution>();
auto *TLIP = getAnalysisIfAvailable<TargetLibraryInfoWrapperPass>();		auto *TLIP = getAnalysisIfAvailable<TargetLibraryInfoWrapperPass>();
TLI = TLIP ? &TLIP->getTLI() : nullptr;		TLI = TLIP ? &TLIP->getTLI() : nullptr;
AA = &getAnalysis<AliasAnalysis>();		AA = &getAnalysis<AliasAnalysis>();
DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();		DT = &getAnalysis<DominatorTreeWrapperPass>().getDomTree();
LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();		LI = &getAnalysis<LoopInfoWrapperPass>().getLoopInfo();

		rengolinUnsubmitted Not Done Reply Inline Actions Yet another candidate to be in a Groups class. rengolin: Yet another candidate to be in a Groups class.
return false;		return false;
}		}

void LoopAccessAnalysis::getAnalysisUsage(AnalysisUsage &AU) const {		void LoopAccessAnalysis::getAnalysisUsage(AnalysisUsage &AU) const {
AU.addRequired<ScalarEvolution>();		AU.addRequired<ScalarEvolution>();
AU.addRequired<AliasAnalysis>();		AU.addRequired<AliasAnalysis>();
AU.addRequired<DominatorTreeWrapperPass>();		AU.addRequired<DominatorTreeWrapperPass>();
AU.addRequired<LoopInfoWrapperPass>();		AU.addRequired<LoopInfoWrapperPass>();

AU.setPreservesAll();		AU.setPreservesAll();
}		}

char LoopAccessAnalysis::ID = 0;		char LoopAccessAnalysis::ID = 0;
static const char laa_name[] = "Loop Access Analysis";		static const char laa_name[] = "Loop Access Analysis";
		rengolinUnsubmitted Not Done Reply Inline Actions More code that would be simplified by having a getInterleaveGroup() method, or a separate class. rengolin: More code that would be simplified by having a getInterleaveGroup() method, or a separate class.
#define LAA_NAME "loop-accesses"		#define LAA_NAME "loop-accesses"

INITIALIZE_PASS_BEGIN(LoopAccessAnalysis, LAA_NAME, laa_name, false, true)		INITIALIZE_PASS_BEGIN(LoopAccessAnalysis, LAA_NAME, laa_name, false, true)
INITIALIZE_AG_DEPENDENCY(AliasAnalysis)		INITIALIZE_AG_DEPENDENCY(AliasAnalysis)
INITIALIZE_PASS_DEPENDENCY(ScalarEvolution)		INITIALIZE_PASS_DEPENDENCY(ScalarEvolution)
INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)		INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)		INITIALIZE_PASS_DEPENDENCY(LoopInfoWrapperPass)
INITIALIZE_PASS_END(LoopAccessAnalysis, LAA_NAME, laa_name, false, true)		INITIALIZE_PASS_END(LoopAccessAnalysis, LAA_NAME, laa_name, false, true)

namespace llvm {		namespace llvm {
Pass *createLAAPass() {		Pass *createLAAPass() {
return new LoopAccessAnalysis();		return new LoopAccessAnalysis();
}		}
}		}

lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 229 Lines • ▼ Show 20 Lines

	unsigned			unsigned
	TargetTransformInfo::getMaskedMemoryOpCost(unsigned Opcode, Type *Src,			TargetTransformInfo::getMaskedMemoryOpCost(unsigned Opcode, Type *Src,
	unsigned Alignment,			unsigned Alignment,
	unsigned AddressSpace) const {			unsigned AddressSpace) const {
	return TTIImpl->getMaskedMemoryOpCost(Opcode, Src, Alignment, AddressSpace);			return TTIImpl->getMaskedMemoryOpCost(Opcode, Src, Alignment, AddressSpace);
	}			}

				unsigned TargetTransformInfo::getInterleavedMemoryOpCost(
				unsigned Opcode, Type *VecTy, unsigned Delta, ArrayRef<unsigned> Indices,
				unsigned Alignment, unsigned AddressSpace) const {
				return TTIImpl->getInterleavedMemoryOpCost(Opcode, VecTy, Delta, Indices,
				Alignment, AddressSpace);
				}

	unsigned			unsigned
	TargetTransformInfo::getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,			TargetTransformInfo::getIntrinsicInstrCost(Intrinsic::ID ID, Type *RetTy,
	ArrayRef<Type *> Tys) const {			ArrayRef<Type *> Tys) const {
	return TTIImpl->getIntrinsicInstrCost(ID, RetTy, Tys);			return TTIImpl->getIntrinsicInstrCost(ID, RetTy, Tys);
	}			}

	unsigned TargetTransformInfo::getCallInstrCost(Function F, Type RetTy,			unsigned TargetTransformInfo::getCallInstrCost(Function F, Type RetTy,
	ArrayRef<Type *> Tys) const {			ArrayRef<Type *> Tys) const {
	▲ Show 20 Lines • Show All 79 Lines • Show Last 20 Lines

lib/Transforms/Vectorize/LoopVectorize.cpp

Show All 28 Lines
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// The reduction-variable vectorization is based on the paper:		// The reduction-variable vectorization is based on the paper:
// D. Nuzman and R. Henderson. Multi-platform Auto-vectorization.		// D. Nuzman and R. Henderson. Multi-platform Auto-vectorization.
//		//
// Variable uniformity checks are inspired by:		// Variable uniformity checks are inspired by:
// Karrenberg, R. and Hack, S. Whole Function Vectorization.		// Karrenberg, R. and Hack, S. Whole Function Vectorization.
//		//
		// The interleaved access vectorization is based on the paper:
		// Dorit Nuzman, Ira Rosen and Ayal Zaks. Auto-Vectorization of Interleaved
		// Data for SIMD
		//
// Other ideas/concepts are from:		// Other ideas/concepts are from:
// A. Zaks and D. Nuzman. Autovectorization in GCC-two years later.		// A. Zaks and D. Nuzman. Autovectorization in GCC-two years later.
//		//
// S. Maleki, Y. Gao, M. Garzaran, T. Wong and D. Padua. An Evaluation of		// S. Maleki, Y. Gao, M. Garzaran, T. Wong and D. Padua. An Evaluation of
// Vectorizing Compilers.		// Vectorizing Compilers.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines
/// for (i = 0; i < N; i+=4)		/// for (i = 0; i < N; i+=4)
/// A[i:i+3] += ...		/// A[i:i+3] += ...
/// } else		/// } else
/// ...		/// ...
static cl::opt<bool> EnableMemAccessVersioning(		static cl::opt<bool> EnableMemAccessVersioning(
"enable-mem-access-versioning", cl::init(true), cl::Hidden,		"enable-mem-access-versioning", cl::init(true), cl::Hidden,
cl::desc("Enable symblic stride memory access versioning"));		cl::desc("Enable symblic stride memory access versioning"));

		static cl::opt<bool> EnableInterleavedMemAccesses(
		"enable-interleaved-mem-accesses", cl::init(false), cl::Hidden,
		cl::desc("Enable vectorization on interleaved memory accesses in a loop"));

		/// Maximum stride for an interleaved memory access.
		static cl::opt<unsigned> MaxInterleaveStride(
		"max-interleave-access-stride", cl::Hidden,
		cl::desc("Maximum interleaved access stride (default = 8)"), cl::init(8));

/// We don't unroll loops with a known constant trip count below this number.		/// We don't unroll loops with a known constant trip count below this number.
static const unsigned TinyTripCountUnrollThreshold = 128;		static const unsigned TinyTripCountUnrollThreshold = 128;

static cl::opt<unsigned> ForceTargetNumScalarRegs(		static cl::opt<unsigned> ForceTargetNumScalarRegs(
"force-target-num-scalar-regs", cl::init(0), cl::Hidden,		"force-target-num-scalar-regs", cl::init(0), cl::Hidden,
cl::desc("A flag that overrides the target's number of scalar registers."));		cl::desc("A flag that overrides the target's number of scalar registers."));

static cl::opt<unsigned> ForceTargetNumVectorRegs(		static cl::opt<unsigned> ForceTargetNumVectorRegs(
"force-target-num-vector-regs", cl::init(0), cl::Hidden,		"force-target-num-vector-regs", cl::init(0), cl::Hidden,
cl::desc("A flag that overrides the target's number of vector registers."));		cl::desc("A flag that overrides the target's number of vector registers."));

/// Maximum vectorization interleave count.		/// Maximum vectorization interleave count.
static const unsigned MaxInterleaveFactor = 16;		static const unsigned MaxInterleaveFactor = 16;

static cl::opt<unsigned> ForceTargetMaxScalarInterleaveFactor(		static cl::opt<unsigned> ForceTargetMaxScalarInterleaveFactor(
"force-target-max-scalar-interleave", cl::init(0), cl::Hidden,		"force-target-max-scalar-interleave", cl::init(0), cl::Hidden,
cl::desc("A flag that overrides the target's max interleave factor for "		cl::desc("A flag that overrides the target's max interleave factor for "
"scalar loops."));		"scalar loops."));

		rengolinUnsubmitted Not Done Reply Inline Actions If we need a flag for the moment, it's to make it false by default, so we can turn it on when we want, not the other way around. Later, if the pass has proven correct, we should make it on by default. This transformation won't change unit stride loops anyway, so I don't see why we would need to disable it even temporarily. Another note, I don't think we need the extensive comment here, just one line line the others will be fine. Nor we should document how we vectorize, since if that changes, the comment will be automatically obsolete, and probably never changed. rengolin: If we need a flag for the moment, it's to make it false by default, so we can turn it on when…
		hfinkelUnsubmitted Not Done Reply Inline Actions I disagree, the comment is good and we definitely should have it. However, it should not be here, but with the implementation (where it is also much less likely to get out of sync). Luckily, it seems that we already have such comments below, so this might be just removed. However, I don't want us to give the impression that extensive explanatory comments are undesirable in general. hfinkel: I disagree, the comment is good and we definitely should have it. However, it should not be…
		rengolinUnsubmitted Not Done Reply Inline Actions We should have that kind of documentation, yes, but in the right place. Here, just one line would do. On other methods, just the part that relate to them. In the entry point, a more top view would be good, possibly with the C->IR part. rengolin: We should have that kind of documentation, yes, but in the right place. Here, just one line…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
static cl::opt<unsigned> ForceTargetMaxVectorInterleaveFactor(		static cl::opt<unsigned> ForceTargetMaxVectorInterleaveFactor(
"force-target-max-vector-interleave", cl::init(0), cl::Hidden,		"force-target-max-vector-interleave", cl::init(0), cl::Hidden,
cl::desc("A flag that overrides the target's max interleave factor for "		cl::desc("A flag that overrides the target's max interleave factor for "
"vectorized loops."));		"vectorized loops."));

static cl::opt<unsigned> ForceTargetInstructionCost(		static cl::opt<unsigned> ForceTargetInstructionCost(
"force-target-instruction-cost", cl::init(0), cl::Hidden,		"force-target-instruction-cost", cl::init(0), cl::Hidden,
cl::desc("A flag that overrides the target's expected cost for "		cl::desc("A flag that overrides the target's expected cost for "
▲ Show 20 Lines • Show All 182 Lines • ▼ Show 20 Lines	protected:

/// When we go over instructions in the basic block we rely on previous		/// When we go over instructions in the basic block we rely on previous
/// values within the current basic block or on loop invariant values.		/// values within the current basic block or on loop invariant values.
/// When we widen (vectorize) values we place them in the map. If the values		/// When we widen (vectorize) values we place them in the map. If the values
/// are not within the map, they have to be loop invariant, so we simply		/// are not within the map, they have to be loop invariant, so we simply
/// broadcast them into a vector.		/// broadcast them into a vector.
VectorParts &getVectorValue(Value *V);		VectorParts &getVectorValue(Value *V);

		/// Try to vectorize the interleaved access group that \p Instr belongs to.
		void vectorizeInterleaveGroup(Instruction *Instr);

/// Generate a shuffle sequence that will reverse the vector Vec.		/// Generate a shuffle sequence that will reverse the vector Vec.
virtual Value reverseVector(Value Vec);		virtual Value reverseVector(Value Vec);

/// This is a helper class that holds the vectorizer state. It maps scalar		/// This is a helper class that holds the vectorizer state. It maps scalar
/// instructions to vector instructions. When the code is 'unrolled' then		/// instructions to vector instructions. When the code is 'unrolled' then
/// then a single scalar value is mapped to multiple vector parts. The parts		/// then a single scalar value is mapped to multiple vector parts. The parts
/// are stored in the VectorPart type.		/// are stored in the VectorPart type.
struct ValueMap {		struct ValueMap {
▲ Show 20 Lines • Show All 178 Lines • ▼ Show 20 Lines

/// \brief Propagate known metadata from one instruction to a vector of others.		/// \brief Propagate known metadata from one instruction to a vector of others.
static void propagateMetadata(SmallVectorImpl<Value > &To, const Instruction From) {		static void propagateMetadata(SmallVectorImpl<Value > &To, const Instruction From) {
for (Value *V : To)		for (Value *V : To)
if (Instruction *I = dyn_cast<Instruction>(V))		if (Instruction *I = dyn_cast<Instruction>(V))
propagateMetadata(I, From);		propagateMetadata(I, From);
}		}

		/// \brief The group of interleaved loads/stores sharing the same stride and
		/// close to each other.
		///
		/// Each member in this group has an index starting from 0, and the largest
		/// index should be less than Delta (interleaved factor), which is the absolute
		/// value of the access stride.
		///
		/// E.g. An interleaved load group of Delta 4:
		/// for (unsigned i = 0; i < 1024; i+=4) {
		/// a = A[i]; // Member of index 0
		/// b = A[i+1]; // Member of index 1
		/// d = A[i+3]; // Member of index 3
		/// ...
		/// }
		///
		/// An interleaved store group of Delta 4:
		/// for (unsigned i = 0; i < 1024; i+=4) {
		/// ...
		/// A[i] = a; // Member of index 0
		/// A[i+1] = b; // Member of index 1
		/// A[i+2] = c; // Member of index 2
		/// A[i+3] = d; // Member of index 3
		/// }
		///
		/// Note: the interleaved load group could have gap (missing members), but
		/// the interleaved store group doesn't allow gap.
		class InterleaveGroup {
		public:
		InterleaveGroup(Instruction *Instr, int Stride, unsigned Align)
		: Align(Align), SmallestKey(0), LargestKey(0), InsertPos(Instr) {
		assert(Align && "The alignment should be non-zero");

		Delta = std::abs(Stride);
		assert(Delta > 1 && "Invalid interleave factor");

		Reverse = Stride < 0;
		Members[0] = Instr;
		}

		bool isReverse() const { return Reverse; }
		unsigned getDelta() const { return Delta; }
		unsigned getAlignment() const { return Align; }
		unsigned getNumMembers() const { return Members.size(); }

		/// \brief Try to insert a new member \p Instr with index \p Index and
		/// alignment \p NewAlign. The index is related to the leader and it could be
		/// negative if it is the new leader.
		///
		/// \returns false if the instruction doesn't belong to the group.
		bool insertMember(Instruction *Instr, int Index, unsigned NewAlign) {
		assert(NewAlign && "The new member's alignment should be non-zero");

		int Key = Index + SmallestKey;

		// Skip if there is already a member with the same index.
		if (Members.count(Key))
		return false;

		if (Key > LargestKey) {
		// The largest index is always less than Delta.
		if (Index >= static_cast<int>(Delta))
		return false;

		LargestKey = Key;
		} else if (Key < SmallestKey) {
		// The largest index is always less than Delta.
		if (LargestKey - Key >= static_cast<int>(Delta))
		return false;

		SmallestKey = Key;
		}

		// It's always safe to select the minimum alignment.
		Align = std::min(Align, NewAlign);
		Members[Key] = Instr;
		return true;
		}

		/// \brief Get the member with the given index \p Index
		///
		/// \returns nullptr if contains no such member.
		Instruction *getMember(unsigned Index) const {
		int Key = SmallestKey + Index;
		if (!Members.count(Key))
		return nullptr;

		return Members.find(Key)->second;
		}

		/// \brief Get the index for the given member. Unlike the key in the member
		/// map, the index starts from 0.
		unsigned getIndex(Instruction *Instr) const {
		for (auto I : Members)
		if (I.second == Instr)
		return I.first - SmallestKey;

		llvm_unreachable("InterleaveGroup contains no such member");
		}

		Instruction *getInsertPos() const { return InsertPos; }
		void setInsertPos(Instruction *Inst) { InsertPos = Inst; }

		private:
		unsigned Delta; // Interleave Factor.
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Could we rename `Delta` to `InterleaveFactor` or something like this. When roaming through the code it'd be much easier to understand what it stands for - especially when the code is in-tree and one doesn't see the context of the current patch. mzolotukhin: Could we rename `Delta` to `InterleaveFactor` or something like this. When roaming through the…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Reasonable. Renamed. HaoLiu: Reasonable. Renamed.
		bool Reverse;
		unsigned Align;
		DenseMap<int, Instruction *> Members;
		int SmallestKey;
		int LargestKey;

		// To avoid breaking dependences, vectorized instructions of an interleave
		// group should be inserted at either the first load or the last store in
		// program order.
		//
		// E.g. %even = load i32 // Insert Position
		// %add = add i32 %even // Use of %even
		// %odd = load i32
		//
		// store i32 %even
		// %odd = add i32 // Def of %odd
		// store i32 %odd // Insert Position
		Instruction *InsertPos;
		};

		/// \brief Drive the analysis of interleaved memory accesses in the loop.
		///
		/// Call this class to analyze interleaved accesses only when we can vectorize
		mzolotukhinUnsubmitted Not Done Reply Inline Actions s/Call this class/Use this class/ mzolotukhin: s/Call this class/Use this class/
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Fixed. HaoLiu: Fixed.
		/// a loop. Otherwise it's meaningless to do analysis as the vectorization
		/// on interleaved accesses is unsafe.
		///
		/// The analysis collects interleave groups and records the relationships
		/// between the member and the group in a map.
		class InterleavedAccessInfo {
		public:
		InterleavedAccessInfo(ScalarEvolution SE, Loop L, DominatorTree *DT)
		: SE(SE), TheLoop(L), DT(DT) {}

		~InterleavedAccessInfo() {
		SmallSet<InterleaveGroup *, 4> DelSet;
		// Avoid releasing a pointer twice.
		for (auto &I : InterleaveGroupMap)
		DelSet.insert(I.second);
		for (auto *Ptr : DelSet)
		delete Ptr;
		}

		/// \brief Analyze the interleaved accesses and collect them in interleave
		/// groups. Substitute symbolic strides using \p Strides.
		void analyzeInterleaving(const ValueToValueMap &Strides);

		/// \brief Check if \p Instr belongs to any interleave group.
		bool isInterleaved(Instruction *Instr) const {
		return InterleaveGroupMap.count(Instr);
		}

		/// \brief Get the interleave group that \p Instr belongs to.
		///
		/// \returns nullptr if doesn't have such group.
		InterleaveGroup getInterleaveGroup(Instruction Instr) const {
		if (InterleaveGroupMap.count(Instr))
		return InterleaveGroupMap.find(Instr)->second;
		return nullptr;
		}

		private:
		ScalarEvolution *SE;
		Loop *TheLoop;
		DominatorTree *DT;

		/// Holds the relationships between the members and the interleave group.
		DenseMap<Instruction , InterleaveGroup > InterleaveGroupMap;

		/// \brief The descriptor for a strided memory access.
		struct StrideDescriptor {
		StrideDescriptor(int Stride, const SCEV *Scev, unsigned Size,
		unsigned Align)
		: Stride(Stride), Scev(Scev), Size(Size), Align(Align) {}

		StrideDescriptor() : Stride(0), Scev(nullptr), Size(0), Align(0) {}

		int Stride; // The access's stride. It is negative for a reverse access.
		const SCEV *Scev; // The scalar expression of this access
		unsigned Size; // The size of the memory object.
		unsigned Align; // The alignment of this access.
		};

		/// \brief Create a new interleave group with the given instruction \p Instr,
		/// stride \p Stride and alignment \p Align.
		///
		/// \returns the newly created interleave group.
		InterleaveGroup createInterleaveGroup(Instruction Instr, int Stride,
		unsigned Align) {
		assert(!InterleaveGroupMap.count(Instr) &&
		"Already in an interleaved access group");
		InterleaveGroupMap[Instr] = new InterleaveGroup(Instr, Stride, Align);
		return InterleaveGroupMap[Instr];
		}

		/// \brief Release the group and remove all the relationships.
		void releaseGroup(InterleaveGroup *Group) {
		for (unsigned i = 0; i < Group->getDelta(); i++)
		if (Instruction *Member = Group->getMember(i))
		InterleaveGroupMap.erase(Member);

		delete Group;
		}

		/// \brief Collect all the accesses with a constant stride in program order.
		void collectConstStridedAccesses(
		MapVector<Instruction *, StrideDescriptor> &StrideAccesses,
		const ValueToValueMap &Strides);
		};

/// LoopVectorizationLegality checks if it is legal to vectorize a loop, and		/// LoopVectorizationLegality checks if it is legal to vectorize a loop, and
/// to what vectorization factor.		/// to what vectorization factor.
/// This class does not look at the profitability of vectorization, only the		/// This class does not look at the profitability of vectorization, only the
/// legality. This class has two main kinds of checks:		/// legality. This class has two main kinds of checks:
/// * Memory checks - The code in canVectorizeMemory checks if vectorization		/// * Memory checks - The code in canVectorizeMemory checks if vectorization
/// will change the order of memory accesses in a way that will change the		/// will change the order of memory accesses in a way that will change the
/// correctness of the program.		/// correctness of the program.
/// * Scalars checks - The code in canVectorizeInstrs and canVectorizeMemory		/// * Scalars checks - The code in canVectorizeInstrs and canVectorizeMemory
/// checks for a number of different conditions, such as the availability of a		/// checks for a number of different conditions, such as the availability of a
/// single induction variable, that all types are supported and vectorize-able,		/// single induction variable, that all types are supported and vectorize-able,
/// etc. This code reflects the capabilities of InnerLoopVectorizer.		/// etc. This code reflects the capabilities of InnerLoopVectorizer.
/// This class is also used by InnerLoopVectorizer for identifying		/// This class is also used by InnerLoopVectorizer for identifying
/// induction variable and the different reduction variables.		/// induction variable and the different reduction variables.
class LoopVectorizationLegality {		class LoopVectorizationLegality {
public:		public:
LoopVectorizationLegality(Loop L, ScalarEvolution SE, DominatorTree *DT,		LoopVectorizationLegality(Loop L, ScalarEvolution SE, DominatorTree *DT,
TargetLibraryInfo TLI, AliasAnalysis AA,		TargetLibraryInfo TLI, AliasAnalysis AA,
Function F, const TargetTransformInfo TTI,		Function F, const TargetTransformInfo TTI,
LoopAccessAnalysis *LAA)		LoopAccessAnalysis *LAA)
: NumPredStores(0), TheLoop(L), SE(SE), TLI(TLI), TheFunction(F),		: NumPredStores(0), TheLoop(L), SE(SE), TLI(TLI), TheFunction(F),
TTI(TTI), DT(DT), LAA(LAA), LAI(nullptr), Induction(nullptr),		TTI(TTI), DT(DT), LAA(LAA), LAI(nullptr), InterleaveInfo(SE, L, DT),
WidestIndTy(nullptr), HasFunNoNaNAttr(false) {}		Induction(nullptr), WidestIndTy(nullptr), HasFunNoNaNAttr(false) {}

/// This enum represents the kinds of inductions that we support.		/// This enum represents the kinds of inductions that we support.
enum InductionKind {		enum InductionKind {
IK_NoInduction, ///< Not an induction variable.		IK_NoInduction, ///< Not an induction variable.
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Nitpick: s/twice,/twice./ mzolotukhin: Nitpick: s/twice,/twice./
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Fixed. HaoLiu: Fixed.
IK_IntInduction, ///< Integer induction variable. Step = C.		IK_IntInduction, ///< Integer induction variable. Step = C.
IK_PtrInduction ///< Pointer induction var. Step = C / sizeof(elem).		IK_PtrInduction ///< Pointer induction var. Step = C / sizeof(elem).
};		};

/// A struct for saving information about induction variables.		/// A struct for saving information about induction variables.
struct InductionInfo {		struct InductionInfo {
InductionInfo(Value Start, InductionKind K, ConstantInt Step)		InductionInfo(Value Start, InductionKind K, ConstantInt Step)
: StartValue(Start), IK(K), StepValue(Step) {		: StartValue(Start), IK(K), StepValue(Step) {
▲ Show 20 Lines • Show All 72 Lines • ▼ Show 20 Lines	public:
bool canVectorize();		bool canVectorize();

/// Returns the Induction variable.		/// Returns the Induction variable.
PHINode *getInduction() { return Induction; }		PHINode *getInduction() { return Induction; }

/// Returns the reduction variables found in the loop.		/// Returns the reduction variables found in the loop.
ReductionList *getReductionVars() { return &Reductions; }		ReductionList *getReductionVars() { return &Reductions; }

/// Returns the induction variables found in the loop.		/// Returns the induction variables found in the loop.
		rengolinUnsubmitted Not Done Reply Inline Actions You're already making sure the item is there. I think you should have an llvm_unreachable() here. rengolin: You're already making sure the item is there. I think you should have an llvm_unreachable()…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
InductionList *getInductionVars() { return &Inductions; }		InductionList *getInductionVars() { return &Inductions; }

/// Returns the widest induction type.		/// Returns the widest induction type.
Type *getWidestInductionType() { return WidestIndTy; }		Type *getWidestInductionType() { return WidestIndTy; }

/// Returns True if V is an induction variable in this loop.		/// Returns True if V is an induction variable in this loop.
		rengolinUnsubmitted Not Done Reply Inline Actions Since this is a struct, there's no way to guarantee that any of these members will be used after initialization. Please modify the constructors to initialise all members with their default valures (0, 1, nullptr, etc). rengolin: Since this is a struct, there's no way to guarantee that any of these members will be used…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Agree. HaoLiu: Agree.
bool isInductionVariable(const Value *V);		bool isInductionVariable(const Value *V);

/// Return true if the block BB needs to be predicated in order for the loop		/// Return true if the block BB needs to be predicated in order for the loop
/// to be vectorized.		/// to be vectorized.
bool blockNeedsPredication(BasicBlock *BB);		bool blockNeedsPredication(BasicBlock *BB);

/// Check if this pointer is consecutive when vectorizing. This happens		/// Check if this pointer is consecutive when vectorizing. This happens
/// when the last index of the GEP is the induction variable, or that the		/// when the last index of the GEP is the induction variable, or that the
Show All 12 Lines	public:
bool isUniformAfterVectorization(Instruction* I) { return Uniforms.count(I); }		bool isUniformAfterVectorization(Instruction* I) { return Uniforms.count(I); }

/// Returns the information that we collected about runtime memory check.		/// Returns the information that we collected about runtime memory check.
const LoopAccessInfo::RuntimePointerCheck *getRuntimePointerCheck() const {		const LoopAccessInfo::RuntimePointerCheck *getRuntimePointerCheck() const {
return LAI->getRuntimePointerCheck();		return LAI->getRuntimePointerCheck();
}		}

const LoopAccessInfo *getLAI() const {		const LoopAccessInfo *getLAI() const {
return LAI;		return LAI;
		mzolotukhinUnsubmitted Not Done Reply Inline Actions This change is unnecessary. mzolotukhin: This change is unnecessary.
}		}

		/// \brief Check if \p Instr belongs to any interleaved access group.
		bool isAccessInterleaved(Instruction *Instr) {
		return InterleaveInfo.isInterleaved(Instr);
		}

		/// \brief Get the interleaved access group that \p Instr belongs to.
		const InterleaveGroup getInterleavedAccessGroup(Instruction Instr) {
		return InterleaveInfo.getInterleaveGroup(Instr);
		}

unsigned getMaxSafeDepDistBytes() { return LAI->getMaxSafeDepDistBytes(); }		unsigned getMaxSafeDepDistBytes() { return LAI->getMaxSafeDepDistBytes(); }

bool hasStride(Value *V) { return StrideSet.count(V); }		bool hasStride(Value *V) { return StrideSet.count(V); }
bool mustCheckStrides() { return !StrideSet.empty(); }		bool mustCheckStrides() { return !StrideSet.empty(); }
SmallPtrSet<Value *, 8>::iterator strides_begin() {		SmallPtrSet<Value *, 8>::iterator strides_begin() {
return StrideSet.begin();		return StrideSet.begin();
}		}
SmallPtrSet<Value *, 8>::iterator strides_end() { return StrideSet.end(); }		SmallPtrSet<Value *, 8>::iterator strides_end() { return StrideSet.end(); }
▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	private:
/// Dominator Tree.		/// Dominator Tree.
DominatorTree *DT;		DominatorTree *DT;
// LoopAccess analysis.		// LoopAccess analysis.
LoopAccessAnalysis *LAA;		LoopAccessAnalysis *LAA;
// And the loop-accesses info corresponding to this loop. This pointer is		// And the loop-accesses info corresponding to this loop. This pointer is
// null until canVectorizeMemory sets it up.		// null until canVectorizeMemory sets it up.
const LoopAccessInfo *LAI;		const LoopAccessInfo *LAI;

		/// The interleave access information contains groups of interleaved accesses
		/// with the same stride and close to each other.
		InterleavedAccessInfo InterleaveInfo;

// --- vectorization state --- //		// --- vectorization state --- //

/// Holds the integer induction variable. This is the counter of the		/// Holds the integer induction variable. This is the counter of the
/// loop.		/// loop.
PHINode *Induction;		PHINode *Induction;
/// Holds the reduction variables.		/// Holds the reduction variables.
ReductionList Reductions;		ReductionList Reductions;
/// Holds all of the induction variables that we found in the loop.		/// Holds all of the induction variables that we found in the loop.
▲ Show 20 Lines • Show All 849 Lines • ▼ Show 20 Lines	Value InnerLoopVectorizer::reverseVector(Value Vec) {
for (unsigned i = 0; i < VF; ++i)		for (unsigned i = 0; i < VF; ++i)
ShuffleMask.push_back(Builder.getInt32(VF - i - 1));		ShuffleMask.push_back(Builder.getInt32(VF - i - 1));

return Builder.CreateShuffleVector(Vec, UndefValue::get(Vec->getType()),		return Builder.CreateShuffleVector(Vec, UndefValue::get(Vec->getType()),
ConstantVector::get(ShuffleMask),		ConstantVector::get(ShuffleMask),
"reverse");		"reverse");
}		}

		// Get a mask to interleave \p NumVec vectors into a wide vector.
		// I.e. <0, VF, VF2, ..., VF(NumVec-1), 1, VF+1, VF*2+1, ...>
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Nitpick: s/I.E/I.e./ mzolotukhin: Nitpick: s/I.E/I.e./
		// E.g. For 2 interleaved vectors, if VF is 4, the mask is:
		// <0, 4, 1, 5, 2, 6, 3, 7>
		static Constant *getInterleavedMask(IRBuilder<> &Builder, unsigned VF,
		unsigned NumVec) {
		rengolinUnsubmitted Not Done Reply Inline Actions Make it VF(NumVec-1) to be more clear in: // <0, VF, VF2, ..., VF(NumVec-1), 1, VF+1, VF2+1, ...> rengolin: Make it VF(NumVec-1) to be more clear in: // <0, VF, VF2, ..., VF*(NumVec-1), 1…
		SmallVector<Constant *, 16> Mask;
		for (unsigned i = 0; i < VF; i++)
		for (unsigned j = 0; j < NumVec; j++)
		Mask.push_back(Builder.getInt32(j * VF + i));

		return ConstantVector::get(Mask);
		}

		// Get the strided mask starting from index \p Start.
		// I.e. <Start, Start + Stride, ..., Start + Stride*(VF-1)>
		static Constant *getStridedMask(IRBuilder<> &Builder, unsigned Start,
		unsigned Stride, unsigned VF) {
		SmallVector<Constant *, 16> Mask;
		for (unsigned i = 0; i < VF; i++)
		Mask.push_back(Builder.getInt32(Start + i * Stride));

		return ConstantVector::get(Mask);
		}

		// Get a mask of two parts: The first part consists of sequential integers
		// starting from 0, The second part consists of UNDEFs.
		// I.e. <0, 1, 2, ..., NumInt - 1, undef, ..., undef>
		static Constant *getSequentialMask(IRBuilder<> &Builder, unsigned NumInt,
		unsigned NumUndef) {
		SmallVector<Constant *, 16> Mask;
		for (unsigned i = 0; i < NumInt; i++)
		Mask.push_back(Builder.getInt32(i));

		Constant *Undef = UndefValue::get(Builder.getInt32Ty());
		for (unsigned i = 0; i < NumUndef; i++)
		Mask.push_back(Undef);

		return ConstantVector::get(Mask);
		}

		// Concatenate two vectors with the same element type. The 2nd vector should
		// not have more elements than the 1st vector. If the 2nd vector has less
		// elements, extend it with UNDEFs.
		static Value ConcatenateTwoVectors(IRBuilder<> &Builder, Value V1,
		Value *V2) {
		VectorType *VecTy1 = dyn_cast<VectorType>(V1->getType());
		VectorType *VecTy2 = dyn_cast<VectorType>(V2->getType());
		assert(VecTy1 && VecTy2 &&
		VecTy1->getScalarType() == VecTy2->getScalarType() &&
		"Expect two vectors with the same element type");

		unsigned NumElts1 = VecTy1->getNumElements();
		unsigned NumElts2 = VecTy2->getNumElements();
		rengolinUnsubmitted Not Done Reply Inline Actions s/Concate/Concatenate/ I thought we had this kind of functionality... How is Elena doing the same for her indexed loads? rengolin: s/Concate/Concatenate/ I thought we had this kind of functionality... How is Elena doing the…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
		assert(NumElts1 >= NumElts2 && "Unexpect the first vector has less elements");
		rengolinUnsubmitted Not Done Reply Inline Actions I think this function could be a lot simpler if you just extended Vec2 first, if smaller, then concatenated both at the end, keeping all the asserts to make sure it's safe. rengolin: I think this function could be a lot simpler if you just extended Vec2 first, if smaller, then…

		if (NumElts1 > NumElts2) {
		// Extend with UNDEFs.
		Constant *ExtMask =
		getSequentialMask(Builder, NumElts2, NumElts1 - NumElts2);
		V2 = Builder.CreateShuffleVector(V2, UndefValue::get(VecTy2), ExtMask);
		}

		Constant *Mask = getSequentialMask(Builder, NumElts1 + NumElts2, 0);
		return Builder.CreateShuffleVector(V1, V2, Mask);
		}

		// Concatenate vectors in the given list. All vectors have the same type.
		static Value *ConcatenateVectors(IRBuilder<> &Builder,
		ArrayRef<Value *> InputList) {
		unsigned NumVec = InputList.size();
		assert(NumVec > 1 && "Should be at least two vectors");

		SmallVector<Value *, 8> ResList;
		ResList.append(InputList.begin(), InputList.end());
		do {
		SmallVector<Value *, 8> TmpList;
		for (unsigned i = 0; i < NumVec - 1; i += 2) {
		Value V0 = ResList[i], V1 = ResList[i + 1];
		assert((V0->getType() == V1->getType() \|\| i == NumVec - 2) &&
		rengolinUnsubmitted Not Done Reply Inline Actions s/Concate/Concatenate/ rengolin: s/Concate/Concatenate/
		"Only the last vector may have a different type");

		rengolinUnsubmitted Not Done Reply Inline Actions While clever, this method is quite heavy in that it requires function calls to return a single list, so V0 and V1 below will always be constructed from a function call to return one of the elements. Since this will normally be a list of around 2/4 elements, it'll always be too heavy. I imagine you did this because your ConcatenateTwoVectors adds undef to the tail of the smaller vectors because shufflevector needs them to the of the same size, but transforming this into a loop wouldn't be too hard. rengolin: While clever, this method is quite heavy in that it requires function calls to return a single…
		TmpList.push_back(ConcatenateTwoVectors(Builder, V0, V1));
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Maybe it's worth mentioning here that though `ConcatenateTwoVectors` could extend a vector with UNDEFs, that only could happens with the last vector in the set. Currently implementation guarantees that, but with the code evolving in future it can be easily violated if it's not stated clearly. Maybe an assertion could be even better here (like if it's not the last pair of vectors, then their sizes should be the same), but it might be an overkill. mzolotukhin: Maybe it's worth mentioning here that though `ConcatenateTwoVectors` could extend a vector with…
		}

		// Push the last vector if the total number of vectors is odd.
		if (NumVec % 2 != 0)
		TmpList.push_back(ResList[NumVec - 1]);

		ResList = TmpList;
		rengolinUnsubmitted Not Done Reply Inline Actions Move the commend before the declaration of PartNum. rengolin: Move the commend before the declaration of PartNum.
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Could we just assign `TmpList` to `VecList`? mzolotukhin: Could we just assign `TmpList` to `VecList`?
		NumVec = ResList.size();
		} while (NumVec > 1);

		return ResList[0];
		}

		// Try to vectorize the interleave group that \p Instr belongs to.
		//
		// E.g. Translate following interleaved load group (Delta is 3):
		// for (i = 0; i < N; i+=3) {
		// R = Pic[i]; // Member of index 0
		mzolotukhinUnsubmitted Not Done Reply Inline Actions I'd suggest starting the comment with describing what the function does, not from an example. Also, please add some comments about what is passed in the arguments. For instance, it's not obvious what is `Ptr` and what is `Instr` from the first glance. Also, if `Ptr` is always `Instr->getPointerOperand()` (`Instr` being `LoadInst` or `StoreInst`), is there any sense in passing it along with the `Instr`? The same comment actually relates to `VecTy` - I'd rather compute it one more time than introduce a new argument to the function. mzolotukhin: I'd suggest starting the comment with describing what the function does, not from an example.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Reasonable. HaoLiu: Reasonable.
		// G = Pic[i+1]; // Member of index 1
		// B = Pic[i+2]; // Member of index 2
		// ... // do something to R, G, B
		// }
		// To:
		// %wide.vec = load <12 x i32> ; Read 4 tuples of R,G,B
		// %R.vec = shuffle %wide.vec, undef, <0, 3, 6, 9> ; R elements
		// %G.vec = shuffle %wide.vec, undef, <1, 4, 7, 10> ; G elements
		// %B.vec = shuffle %wide.vec, undef, <2, 5, 8, 11> ; B elements
		mzolotukhinUnsubmitted Not Done Reply Inline Actions It's not 'read R,G,B', it's 'read 4 tuples of R,G,B'. I realize that it might be clear from the code though. mzolotukhin: It's not 'read R,G,B', it's 'read 4 tuples of R,G,B'. I realize that it might be clear from the…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Yes. It's more clear. HaoLiu: Yes. It's more clear.
		//
		// Or translate following interleaved store group (Delta is 3):
		// for (i = 0; i < N; i+=3) {
		// ... do something to R, G, B
		// Pic[i] = R; // Member of index 0
		// Pic[i+1] = G; // Member of index 1
		// Pic[i+2] = B; // Member of index 2
		// }
		// To:
		// %R_G.vec = shuffle %R.vec, %G.vec, <0, 1, 2, ..., 7>
		// %B_U.vec = shuffle %B.vec, undef, <0, 1, 2, 3, u, u, u, u>
		// %interleaved.vec = shuffle %R_G.vec, %B_U.vec,
		// <0, 4, 8, 1, 5, 9, 2, 6, 10, 3, 7, 11> ; Interleave R,G,B elements
		// store <12 x i32> %interleaved.vec ; Write 4 tuples of R,G,B
		void InnerLoopVectorizer::vectorizeInterleaveGroup(Instruction *Instr) {
		const InterleaveGroup *Group = Legal->getInterleavedAccessGroup(Instr);
		assert(Group && "Fail to get an interleaved access group.");

		// Skip if current instruction is not the insert position.
		if (Instr != Group->getInsertPos())
		return;

		LoadInst *LI = dyn_cast<LoadInst>(Instr);
		StoreInst *SI = dyn_cast<StoreInst>(Instr);
		rengolinUnsubmitted Not Done Reply Inline Actions IIRC, this function should only be called with the first load and last store. With the assert, you'll probably force that logic up the chain. I'm just guessing, but if you returned instead of asserting, wouldn't that work in the same way, but not needing the conditionals in the caller? rengolin: IIRC, this function should only be called with the first load and last store. With the assert…
		Value *Ptr = LI ? LI->getPointerOperand() : SI->getPointerOperand();

		// Prepare for the vector type of the interleaved load/store.
		Type *ScalarTy = LI ? LI->getType() : SI->getValueOperand()->getType();
		unsigned Delta = Group->getDelta();
		rengolinUnsubmitted Not Done Reply Inline Actions Do you need all this caching & copying? rengolin: Do you need all this caching & copying?
		Type VecTy = VectorType::get(ScalarTy, Delta VF);
		Type *PtrTy = VecTy->getPointerTo(Ptr->getType()->getPointerAddressSpace());
		rengolinUnsubmitted Not Done Reply Inline Actions Potential use of uninitialised Align member. This condition could easily fail for the wrong reasons. rengolin: Potential use of uninitialised Align member. This condition could easily fail for the wrong…

		// Prepare for the new pointers.
		rengolinUnsubmitted Not Done Reply Inline Actions This could be in the constructor, if no alignment is passed, no? rengolin: This could be in the constructor, if no alignment is passed, no?
		setDebugLocFromInst(Builder, Ptr);
		VectorParts &PtrParts = getVectorValue(Ptr);
		SmallVector<Value *, 2> NewPtrs;
		unsigned Index = Group->getIndex(Instr);
		for (unsigned Part = 0; Part < UF; Part++) {
		// Extract the pointer for current instruction from the pointer vector. A
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Should we iterate to `VF` instead of `UF` here? `PtrParts` contains `VF` elements, right? mzolotukhin: Should we iterate to `VF` instead of `UF` here? `PtrParts` contains `VF` elements, right?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Slightly different. 'PtrParts' contains UF vectors, and each vector has VF elements. Here, firstly we want the pointer vector in current unroll part, I.e. 'PtrParts[Part]'. Then for each pointer vector ''PtrParts[Part]', we want the pointer in lane 0. E.g. for (unsigned i = 0; i < 1024; i+=2) { A[i] = i; } If VF is 4 and UF is 2. Then we'll have two unroll parts: part 1: A[0] = 0; A[2] = 2; A[4] = 4; A[6] = 6; part 2: A[8] = 8; ... For this case, 'PtrParts' contains two vectors: Vector 1 consists pointers to: A[0], A[2], A[4], A[6] Vector 2 consists pointers to: A[8], A[10], A[12], A[14] Then we extract lane 0 from Vector 1 to get a pointer pointing to A[0], so that we can use it to load A[0-7] and after that, we can extract A[0, 2, 4, 6] from A[0-7]. If the loads/stores are reverse, we need the pointer in last lane. E.g. the access sequences will be "A[1022], A[1020], A[1018], A[1016]", the last lane is the pointer to A[1016]". We can load A[1016-1023] to extract A[1016, 1018, 1020, 1022]. HaoLiu: Slightly different. 'PtrParts' contains UF vectors, and each vector has VF elements. Here…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Thanks for the explanation, sounds good. mzolotukhin: Thanks for the explanation, sounds good.
		// reverse access uses the pointer in the last lane.
		Value *NewPtr = Builder.CreateExtractElement(
		PtrParts[Part],
		Group->isReverse() ? Builder.getInt32(VF - 1) : Builder.getInt32(0));

		// Notice current instruction could be any index. Need to adjust the address
		// to the member of index 0.
		//
		// E.g. a = A[i+1]; // Member of index 1 (Current instruction)
		rengolinUnsubmitted Not Done Reply Inline Actions Comment before code, pls. rengolin: Comment before code, pls.
		// b = A[i]; // Member of index 0
		// Current pointer is pointed to A[i+1], adjust it to A[i].
		rengolinUnsubmitted Not Done Reply Inline Actions Why -Idx? rengolin: Why -Idx?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions The wide vector load/store uses the address equal to the access of index 0. E.g. If we have two interleaved loads: load A[i+1] // index 1 (insert position) load A[i] // index 0 We need to use the address of A[i] to load: {A[i], A[i+1], A[i+2], A[i+3], ...} So the current pointer for A[i+1] needs to be sub by 1. HaoLiu: The wide vector load/store uses the address equal to the access of index 0. E.g. If we have two…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions What if we have load A[i] load A[i+1] or load A[i+2] load A[i+1] load A[i] ? mzolotukhin: What if we have ``` load A[i] load A[i+1] ``` or ``` load A[i+2] load A[i+1] load A[i] ``` ?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions For load A[i] load A[i+1] The insert position(i.e. the first load) is of index 0, no need to adjust. For load A[i+2] load A[i+1] load A[i] As the insert position is "load A[i+2]" (which has index 2), need to adjust to the address of "A[i]" (i.e. use an offset "-2"). I'll add more comments here. HaoLiu: For load A[i] load A[i+1] The insert position(i.e. the first load) is of index 0, no…
		//
		// E.g. A[i+1] = a; // Member of index 1
		mzolotukhinUnsubmitted Not Done Reply Inline Actions I'm not sure it's correct if we vectorize and unroll the loop. Don't we need to adjust the pointer according to which unrolling-part we are at (instead of using `0`)? mzolotukhin: I'm not sure it's correct if we vectorize and unroll the loop. Don't we need to adjust the…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Yes, we need. As the above comment says, we get vectors consist of pointers in all unroll parts. HaoLiu: Yes, we need. As the above comment says, we get vectors consist of pointers in all unroll parts.
		// A[i] = b; // Member of index 0
		// A[i+2] = c; // Member of index 2 (Current instruction)
		// Current pointer is pointed to A[i+2], adjust it to A[i].
		NewPtr = Builder.CreateGEP(NewPtr, Builder.getInt32(-Index));

		// Cast to the vector pointer type.
		NewPtrs.push_back(Builder.CreateBitCast(NewPtr, PtrTy));
		}

		setDebugLocFromInst(Builder, Instr);
		Value *UndefVec = UndefValue::get(VecTy);

		rengolinUnsubmitted Not Done Reply Inline Actions nitpick: better to have code+space before a new "section" (loads), instead of space+unrelated-code. ie: setDebugLocFromInst(Builder, Instr); . instead of . setDebugLocFromInst(Builder, Instr); rengolin: nitpick: better to have code+space before a new "section" (loads), instead of space+unrelated…
		// Vectorize the interleaved load group.
		if (LI) {
		for (unsigned Part = 0; Part < UF; Part++) {
		Instruction *NewLoadInstr = Builder.CreateAlignedLoad(
		NewPtrs[Part], Group->getAlignment(), "wide.vec");

		for (unsigned i = 0; i < Delta; i++) {
		Instruction *Member = Group->getMember(i);

		rengolinUnsubmitted Not Done Reply Inline Actions This variable is loop invariant and load/store invariant, so could be moved out of the "if(IsLoad)" and used for both load and store blocks. rengolin: This variable is loop invariant and load/store invariant, so could be moved out of the "if…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Name `CallI` is very misleading. mzolotukhin: Name `CallI` is very misleading.
		// Skip the gaps in the group.
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Name `CallI` is very misleading. mzolotukhin: Name `CallI` is very misleading.
		if (!Member)
		continue;

		Constant *StrideMask = getStridedMask(Builder, i, Delta, VF);
		Value *StridedVec = Builder.CreateShuffleVector(
		NewLoadInstr, UndefVec, StrideMask, "strided.vec");

		// If this member has different type, cast the result type.
		if (Member->getType() != ScalarTy) {
		VectorType *OtherVTy = VectorType::get(Member->getType(), VF);
		StridedVec = Builder.CreateBitOrPointerCast(StridedVec, OtherVTy);
		}

		VectorParts &Entry = WidenMap.get(Member);
		Entry[Part] =
		Group->isReverse() ? reverseVector(StridedVec) : StridedVec;
		}

		propagateMetadata(NewLoadInstr, Instr);
		}
		return;
		}

		// The sub vector type for current instruction.
		VectorType *SubVT = VectorType::get(ScalarTy, VF);

		// Vectorize the interleaved store group.
		for (unsigned Part = 0; Part < UF; Part++) {
		rengolinUnsubmitted Not Done Reply Inline Actions on loads, it's called "strided.vec", on stores "interleaved.vec". Any reason to call them differently? rengolin: on loads, it's called "strided.vec", on stores "interleaved.vec". Any reason to call them…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions I think the loaded vectors are separately: V0: A[i], A[i+2], A[i+4], ... V1: A[i+1], A[i+3], A[i+5], ... To call V0/V1 "interleaved.vec" sounds not properly. Each is "strided" vector. But the vector to be stored is actually interleaved from several vectors: V0[0], V1[0], V0[1], V1[1], ... So I think it's reasonable to call it "interleaved" vector. HaoLiu: I think the loaded vectors are separately: V0: A[i], A[i+2], A[i+4], ... V1: A[i+1], A…
		// Collect the stored vector from each member.
		SmallVector<Value *, 4> StoredVecs;
		for (unsigned i = 0; i < Delta; i++) {
		// Interleaved store group doesn't allow a gap, so each index has a member
		Instruction *Member = Group->getMember(i);
		assert(Member && "Fail to get a member from an interleaved store group");
		mzolotukhinUnsubmitted Not Done Reply Inline Actions This looks redundant. mzolotukhin: This looks redundant.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Removed. HaoLiu: Removed.

		Value *StoredVec =
		getVectorValue(dyn_cast<StoreInst>(Member)->getValueOperand())[Part];
		if (Group->isReverse())
		StoredVec = reverseVector(StoredVec);

		// If this member has different type, cast it to an unified type.
		if (StoredVec->getType() != SubVT)
		StoredVec = Builder.CreateBitOrPointerCast(StoredVec, SubVT);

		StoredVecs.push_back(StoredVec);
		}

		// Concatenate all vectors into a wide vector.
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Could you please provide an example when this triggers? I'm a bit concerned about legality of this cast. mzolotukhin: Could you please provide an example when this triggers? I'm a bit concerned about legality of…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions A test case called @int_float_struct tests the interleaved load group: struct IntFloat { int a; float b; }; void int_float_struct(struct IntFloat A) { int SumA; float SumB; for (unsigned i = 0; i < 1024; i++) { SumA += A[i].a; SumB += A[i].b; } SA = SumA; SB = SumB; } You can find this case in "interleaved-accesses.ll". The int-float can be long-double, or any pointer types, as long as the structure has elements of the same size. Actually I have better case: struct IntFloat { int a; float b; }; void int_float_struct(struct IntFloat A, struct IntFloat B) { for (unsigned i = 0; i < 1024; i++) { B[i].a = A[i].a + 1; B[i].b = A[i].b + 1.0; } } Unfortunately, this case can not be vectorized as it can not pass dependence check. As we have following code in isDependent() (LoopAccessAnalysis.c): if (ATy != BTy) { DEBUG(dbgs() << "LAA: ReadWrite-Write positive dependency with different types\n"); return Dependence::Unknown; } I think this is a bug. We don't need to check types as we will check dependences based on the memory object size. by removing such code I don't see any failures in benchmarks. Anyway, as such code, I can not give a test for interleaved write group. HaoLiu:* A test case called @int_float_struct tests the interleaved load group: struct IntFloat {…
		mzolotukhinUnsubmitted Not Done Reply Inline Actions It makes sense, thanks for the explanation! mzolotukhin: It makes sense, thanks for the explanation!
		Value *WideVec = ConcatenateVectors(Builder, StoredVecs);

		// Interleave the elements in the wide vector.
		Constant *IMask = getInterleavedMask(Builder, VF, Delta);
		Value *IVec = Builder.CreateShuffleVector(WideVec, UndefVec, IMask,
		"interleaved.vec");

		Instruction *NewStoreInstr =
		Builder.CreateAlignedStore(IVec, NewPtrs[Part], Group->getAlignment());
		propagateMetadata(NewStoreInstr, Instr);
		}
		}

void InnerLoopVectorizer::vectorizeMemoryInstruction(Instruction *Instr) {		void InnerLoopVectorizer::vectorizeMemoryInstruction(Instruction *Instr) {
		mzolotukhinUnsubmitted Not Done Reply Inline Actions The name is misleading. mzolotukhin: The name is misleading.
// Attempt to issue a wide load.		// Attempt to issue a wide load.
LoadInst *LI = dyn_cast<LoadInst>(Instr);		LoadInst *LI = dyn_cast<LoadInst>(Instr);
StoreInst *SI = dyn_cast<StoreInst>(Instr);		StoreInst *SI = dyn_cast<StoreInst>(Instr);

assert((LI \|\| SI) && "Invalid Load/Store instruction");		assert((LI \|\| SI) && "Invalid Load/Store instruction");

		// Try to vectorize the interleave group if this access is interleaved.
		if (Legal->isAccessInterleaved(Instr))
		return vectorizeInterleaveGroup(Instr);

Type *ScalarDataTy = LI ? LI->getType() : SI->getValueOperand()->getType();		Type *ScalarDataTy = LI ? LI->getType() : SI->getValueOperand()->getType();
Type *DataTy = VectorType::get(ScalarDataTy, VF);		Type *DataTy = VectorType::get(ScalarDataTy, VF);
Value *Ptr = LI ? LI->getPointerOperand() : SI->getPointerOperand();		Value *Ptr = LI ? LI->getPointerOperand() : SI->getPointerOperand();
unsigned Alignment = LI ? LI->getAlignment() : SI->getAlignment();		unsigned Alignment = LI ? LI->getAlignment() : SI->getAlignment();
// An alignment of 0 means target abi alignment. We need to use the scalar's		// An alignment of 0 means target abi alignment. We need to use the scalar's
// target abi alignment in such a case.		// target abi alignment in such a case.
const DataLayout &DL = Instr->getModule()->getDataLayout();		const DataLayout &DL = Instr->getModule()->getDataLayout();
if (!Alignment)		if (!Alignment)
▲ Show 20 Lines • Show All 1,728 Lines • ▼ Show 20 Lines	bool LoopVectorizationLegality::canVectorize() {
// Collect all of the variables that remain uniform after vectorization.		// Collect all of the variables that remain uniform after vectorization.
collectLoopUniforms();		collectLoopUniforms();

DEBUG(dbgs() << "LV: We can vectorize this loop" <<		DEBUG(dbgs() << "LV: We can vectorize this loop" <<
(LAI->getRuntimePointerCheck()->Need ? " (with a runtime bound check)" :		(LAI->getRuntimePointerCheck()->Need ? " (with a runtime bound check)" :
"")		"")
<<"!\n");		<<"!\n");

		// Analyze interleaved memory accesses.
		if (EnableInterleavedMemAccesses)
		InterleaveInfo.analyzeInterleaving(Strides);

// Okay! We can vectorize. At this point we don't have any other mem analysis		// Okay! We can vectorize. At this point we don't have any other mem analysis
// which may limit our maximum vectorization factor, so just return true with		// which may limit our maximum vectorization factor, so just return true with
// no restrictions.		// no restrictions.
return true;		return true;
}		}

static Type convertPointerToIntegerType(const DataLayout &DL, Type Ty) {		static Type convertPointerToIntegerType(const DataLayout &DL, Type Ty) {
		rengolinUnsubmitted Not Done Reply Inline Actions Verbatim copy is not right. This is too generic to not be exposed by some GEP or Builder class. rengolin: Verbatim copy is not right. This is too generic to not be exposed by some GEP or Builder class.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
if (Ty->isPointerTy())		if (Ty->isPointerTy())
return DL.getIntPtrType(Ty);		return DL.getIntPtrType(Ty);

// It is possible that char's or short's overflow when we ask for the loop's		// It is possible that char's or short's overflow when we ask for the loop's
// trip count, work around this by changing the type size.		// trip count, work around this by changing the type size.
if (Ty->getScalarSizeInBits() < 32)		if (Ty->getScalarSizeInBits() < 32)
return Type::getInt32Ty(Ty->getContext());		return Type::getInt32Ty(Ty->getContext());
		rengolinUnsubmitted Not Done Reply Inline Actions Non-Verbatim copy is even worse. Please make sure that you can use lib/Analysis' isStridedPtr. rengolin: Non-Verbatim copy is even worse. Please make sure that you can use lib/Analysis' isStridedPtr.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.

return Ty;		return Ty;
}		}

static Type* getWiderType(const DataLayout &DL, Type Ty0, Type Ty1) {		static Type* getWiderType(const DataLayout &DL, Type Ty0, Type Ty1) {
Ty0 = convertPointerToIntegerType(DL, Ty0);		Ty0 = convertPointerToIntegerType(DL, Ty0);
Ty1 = convertPointerToIntegerType(DL, Ty1);		Ty1 = convertPointerToIntegerType(DL, Ty1);
if (Ty0->getScalarSizeInBits() > Ty1->getScalarSizeInBits())		if (Ty0->getScalarSizeInBits() > Ty1->getScalarSizeInBits())
return Ty0;		return Ty0;
return Ty1;		return Ty1;
}		}

/// \brief Check that the instruction has outside loop users and is not an		/// \brief Check that the instruction has outside loop users and is not an
/// identified reduction variable.		/// identified reduction variable.
static bool hasOutsideLoopUser(const Loop TheLoop, Instruction Inst,		static bool hasOutsideLoopUser(const Loop TheLoop, Instruction Inst,
SmallPtrSetImpl<Value *> &Reductions) {		SmallPtrSetImpl<Value *> &Reductions) {
		mzolotukhinUnsubmitted Not Done Reply Inline Actions This code is a dupe of the one from SLP vectorizer. Could we do anything about it? mzolotukhin: This code is a dupe of the one from SLP vectorizer. Could we do anything about it?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions As Renato suggested, I've move such code into Transforms/Utils/VectorUtils, which consists utilities for vectorizer. HaoLiu: As Renato suggested, I've move such code into Transforms/Utils/VectorUtils, which consists…
// Reduction instructions are allowed to have exit users. All other		// Reduction instructions are allowed to have exit users. All other
// instructions must not have external users.		// instructions must not have external users.
if (!Reductions.count(Inst))		if (!Reductions.count(Inst))
//Check that all of the users of the loop are inside the BB.		//Check that all of the users of the loop are inside the BB.
for (User *U : Inst->users()) {		for (User *U : Inst->users()) {
Instruction *UI = cast<Instruction>(U);		Instruction *UI = cast<Instruction>(U);
		rengolinUnsubmitted Not Done Reply Inline Actions An explanation of what heads and tails are would be good. If I read correctly, it's not just the first and last accesses, but every first/last access in the list of consecutive pairs. So, in your example: A[i] A[i+1] A[i+2] Heads: A[i], A[i+1] Respective tails: A[i+1], A[i+2] right? rengolin: An explanation of what heads and tails are would be good. If I read correctly, it's not just…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Yes, exactly. HaoLiu: Yes, exactly.
// This user may be a reduction exit value.		// This user may be a reduction exit value.
if (!TheLoop->contains(UI)) {		if (!TheLoop->contains(UI)) {
DEBUG(dbgs() << "LV: Found an outside user for : " << *UI << '\n');		DEBUG(dbgs() << "LV: Found an outside user for : " << *UI << '\n');
return true;		return true;
}		}
}		}
return false;		return false;
}		}

bool LoopVectorizationLegality::canVectorizeInstrs() {		bool LoopVectorizationLegality::canVectorizeInstrs() {
BasicBlock *PreHeader = TheLoop->getLoopPreheader();		BasicBlock *PreHeader = TheLoop->getLoopPreheader();
BasicBlock *Header = TheLoop->getHeader();		BasicBlock *Header = TheLoop->getHeader();

// Look for the attribute signaling the absence of NaNs.		// Look for the attribute signaling the absence of NaNs.
Function &F = *Header->getParent();		Function &F = *Header->getParent();
const DataLayout &DL = F.getParent()->getDataLayout();		const DataLayout &DL = F.getParent()->getDataLayout();
if (F.hasFnAttribute("no-nans-fp-math"))		if (F.hasFnAttribute("no-nans-fp-math"))
HasFunNoNaNAttr =		HasFunNoNaNAttr =
		rengolinUnsubmitted Not Done Reply Inline Actions You probably mean: // A[i]+A[i+1] (1)(3) rengolin: You probably mean: // A[i]+A[i+1] (1)(3)
F.getFnAttribute("no-nans-fp-math").getValueAsString() == "true";		F.getFnAttribute("no-nans-fp-math").getValueAsString() == "true";

// For each block in the loop.		// For each block in the loop.
for (Loop::block_iterator bb = TheLoop->block_begin(),		for (Loop::block_iterator bb = TheLoop->block_begin(),
be = TheLoop->block_end(); bb != be; ++bb) {		be = TheLoop->block_end(); bb != be; ++bb) {

// Scan the instructions in the block and look for hazards.		// Scan the instructions in the block and look for hazards.
for (BasicBlock::iterator it = (bb)->begin(), e = (bb)->end(); it != e;		for (BasicBlock::iterator it = (bb)->begin(), e = (bb)->end(); it != e;
++it) {		++it) {

if (PHINode *Phi = dyn_cast<PHINode>(it)) {		if (PHINode *Phi = dyn_cast<PHINode>(it)) {
		rengolinUnsubmitted Not Done Reply Inline Actions I'd call this auto variable "Head", so you can use I for the load/store loops at the end. rengolin: I'd call this auto variable "Head", so you can use I for the load/store loops at the end.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Reasonable. HaoLiu: Reasonable.
Type *PhiTy = Phi->getType();		Type *PhiTy = Phi->getType();
		rengolinUnsubmitted Not Done Reply Inline Actions Haven't you checked for this already up there? rengolin: Haven't you checked for this already up there?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Not checked yet. A Head may also be a Tail of another Head, so it is not the first in the consecutive chain. This is to find the real Head in the top of the chain. HaoLiu: Not checked yet. A Head may also be a Tail of another Head, so it is not the first in the…
// Check that this PHI type is allowed.		// Check that this PHI type is allowed.
if (!PhiTy->isIntegerTy() &&		if (!PhiTy->isIntegerTy() &&
!PhiTy->isFloatingPointTy() &&		!PhiTy->isFloatingPointTy() &&
!PhiTy->isPointerTy()) {		!PhiTy->isPointerTy()) {
emitAnalysis(VectorizationReport(it)		emitAnalysis(VectorizationReport(it)
		rengolinUnsubmitted Not Done Reply Inline Actions You don't seem to be using I or Inst any more, you could declare Inst inside the loop, like: for (auto Inst = I; Tails.count(Inst) \|\| Heads.count(Inst); ) { ... } However, I don't understand why you didn't just: for (auto Inst = I; ConsecutivePairs.count(Inst); ) { ... } rengolin: You don't seem to be using I or Inst any more, you could declare Inst inside the loop, like…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Reasonable. HaoLiu: Reasonable.
<< "loop control flow is not understood by vectorizer");		<< "loop control flow is not understood by vectorizer");
DEBUG(dbgs() << "LV: Found an non-int non-pointer PHI.\n");		DEBUG(dbgs() << "LV: Found an non-int non-pointer PHI.\n");
return false;		return false;
}		}

// If this PHINode is not in the header block, then we know that we		// If this PHINode is not in the header block, then we know that we
// can convert it to select during if-conversion. No need to check if		// can convert it to select during if-conversion. No need to check if
// the PHIs in this block are induction or reduction variables.		// the PHIs in this block are induction or reduction variables.
if (*bb != Header) {		if (*bb != Header) {
// Check that this instruction has no outside users or is an		// Check that this instruction has no outside users or is an
		rengolinUnsubmitted Not Done Reply Inline Actions Shouldn't we guarantee that: ChainLen % IALen == 0 ? rengolin: Shouldn't we guarantee that: ChainLen % IALen == 0 ?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions I think it is unnecessary. The tailed nodes will be ignored as they are miss some members to form an interleaved access. HaoLiu: I think it is unnecessary. The tailed nodes will be ignored as they are miss some members to…
// identified reduction value with an outside user.		// identified reduction value with an outside user.
if (!hasOutsideLoopUser(TheLoop, it, AllowedExit))		if (!hasOutsideLoopUser(TheLoop, it, AllowedExit))
continue;		continue;
emitAnalysis(VectorizationReport(it) <<		emitAnalysis(VectorizationReport(it) <<
"value could not be identified as "		"value could not be identified as "
"an induction or reduction variable");		"an induction or reduction variable");
return false;		return false;
}		}

// We only allow if-converted PHIs with exactly two incoming values.		// We only allow if-converted PHIs with exactly two incoming values.
if (Phi->getNumIncomingValues() != 2) {		if (Phi->getNumIncomingValues() != 2) {
emitAnalysis(VectorizationReport(it)		emitAnalysis(VectorizationReport(it)
<< "control flow not understood by vectorizer");		<< "control flow not understood by vectorizer");
DEBUG(dbgs() << "LV: Found an invalid PHI.\n");		DEBUG(dbgs() << "LV: Found an invalid PHI.\n");
return false;		return false;
}		}

// This is the value coming from the preheader.		// This is the value coming from the preheader.
Value *StartValue = Phi->getIncomingValueForBlock(PreHeader);		Value *StartValue = Phi->getIncomingValueForBlock(PreHeader);
ConstantInt *StepValue = nullptr;		ConstantInt *StepValue = nullptr;
		mzolotukhinUnsubmitted Not Done Reply Inline Actions It'd better be if (Queue.size() < 2) return; since `QSize` isn't used anywhere else. mzolotukhin: It'd better be ``` if (Queue.size() < 2) return; ``` since `QSize` isn't used anywhere else.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
// Check if this is an induction variable.		// Check if this is an induction variable.
InductionKind IK = isInductionVariable(Phi, StepValue);		InductionKind IK = isInductionVariable(Phi, StepValue);

		rengolinUnsubmitted Not Done Reply Inline Actions std::min()? rengolin: std::min()?
if (IK_NoInduction != IK) {		if (IK_NoInduction != IK) {
// Get the widest type.		// Get the widest type.
if (!WidestIndTy)		if (!WidestIndTy)
WidestIndTy = convertPointerToIntegerType(DL, PhiTy);		WidestIndTy = convertPointerToIntegerType(DL, PhiTy);
else		else
WidestIndTy = getWiderType(DL, PhiTy, WidestIndTy);		WidestIndTy = getWiderType(DL, PhiTy, WidestIndTy);

// Int inductions are special because we only allow one IV.		// Int inductions are special because we only allow one IV.
if (IK == IK_IntInduction && StepValue->isOne()) {		if (IK == IK_IntInduction && StepValue->isOne()) {
// Use the phi node with the widest type as induction. Use the last		// Use the phi node with the widest type as induction. Use the last
// one if there are multiple (no good reason for doing this other		// one if there are multiple (no good reason for doing this other
// than it is expedient).		// than it is expedient).
if (!Induction \|\| PhiTy == WidestIndTy)		if (!Induction \|\| PhiTy == WidestIndTy)
Induction = Phi;		Induction = Phi;
		mzolotukhinUnsubmitted Not Done Reply Inline Actions This is very similar to the code from SLP. Any chance to reuse it? mzolotukhin: This is very similar to the code from SLP. Any chance to reuse it?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Yes, but it's still some differences. If we want to reuse it, also need to modify SLP a lot. I think it's better to clean up this with a separate patch. HaoLiu: Yes, but it's still some differences. If we want to reuse it, also need to modify SLP a lot. I…
}		}

DEBUG(dbgs() << "LV: Found an induction variable.\n");		DEBUG(dbgs() << "LV: Found an induction variable.\n");
Inductions[Phi] = InductionInfo(StartValue, IK, StepValue);		Inductions[Phi] = InductionInfo(StartValue, IK, StepValue);

// Until we explicitly handle the case of an induction variable with		// Until we explicitly handle the case of an induction variable with
// an outside loop user we have to give up vectorizing this loop.		// an outside loop user we have to give up vectorizing this loop.
if (hasOutsideLoopUser(TheLoop, it, AllowedExit)) {		if (hasOutsideLoopUser(TheLoop, it, AllowedExit)) {
emitAnalysis(VectorizationReport(it) <<		emitAnalysis(VectorizationReport(it) <<
"use of induction value outside of the "		"use of induction value outside of the "
"loop is not handled by vectorizer");		"loop is not handled by vectorizer");
return false;		return false;
}		}

continue;		continue;
		rengolinUnsubmitted Not Done Reply Inline Actions This seems too generic to be in LoopVectorizer.cpp rengolin: This seems too generic to be in LoopVectorizer.cpp
}		}
		mzolotukhinUnsubmitted Not Done Reply Inline Actions This name is misleading. mzolotukhin: This name is misleading.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Renamed both "Size" and "Num". HaoLiu: Renamed both "Size" and "Num".

if (ReductionDescriptor::isReductionPHI(Phi, TheLoop,		if (ReductionDescriptor::isReductionPHI(Phi, TheLoop,
Reductions[Phi])) {		Reductions[Phi])) {
AllowedExit.insert(Reductions[Phi].getLoopExitInstr());		AllowedExit.insert(Reductions[Phi].getLoopExitInstr());
continue;		continue;
}		}

emitAnalysis(VectorizationReport(it) <<		emitAnalysis(VectorizationReport(it) <<
"value that could not be identified as "		"value that could not be identified as "
"reduction is used outside the loop");		"reduction is used outside the loop");
DEBUG(dbgs() << "LV: Found an unidentified PHI."<< *Phi <<"\n");		DEBUG(dbgs() << "LV: Found an unidentified PHI."<< *Phi <<"\n");
return false;		return false;
}// end of PHI handling		}// end of PHI handling

// We handle calls that:		// We handle calls that:
// * Are debug info intrinsics.		// * Are debug info intrinsics.
// * Have a mapping to an IR intrinsic.		// * Have a mapping to an IR intrinsic.
// * Have a vector version available.		// * Have a vector version available.
CallInst *CI = dyn_cast<CallInst>(it);		CallInst *CI = dyn_cast<CallInst>(it);
if (CI && !getIntrinsicIDForCall(CI, TLI) && !isa<DbgInfoIntrinsic>(CI) &&		if (CI && !getIntrinsicIDForCall(CI, TLI) && !isa<DbgInfoIntrinsic>(CI) &&
!(CI->getCalledFunction() && TLI &&		!(CI->getCalledFunction() && TLI &&
TLI->isFunctionVectorizable(CI->getCalledFunction()->getName()))) {		TLI->isFunctionVectorizable(CI->getCalledFunction()->getName()))) {
emitAnalysis(VectorizationReport(it) <<		emitAnalysis(VectorizationReport(it) <<
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Shouldn't it be minimal alignment instead? mzolotukhin: Shouldn't it be minimal alignment instead?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Ah, right. I've changed this logic to be min align. HaoLiu: Ah, right. I've changed this logic to be min align.
"call instruction cannot be vectorized");		"call instruction cannot be vectorized");
DEBUG(dbgs() << "LV: Found a non-intrinsic, non-libfunc callsite.\n");		DEBUG(dbgs() << "LV: Found a non-intrinsic, non-libfunc callsite.\n");
		delenaUnsubmitted Not Done Reply Inline Actions Why predicated block is not considered? for (i = 0; i < N; i+=2) { if (B[i]) { A[i] += 1; A[i+1] += 2; } } delena: Why predicated block is not considered? for (i = 0; i < N; i+=2) { if (B[i]) { A…
		rengolinUnsubmitted Not Done Reply Inline Actions I think for the sake of simplicity in the first implementation. Maybe a TODO/FIXME comment would clarify better. rengolin: I think for the sake of simplicity in the first implementation. Maybe a TODO/FIXME comment…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Agree. HaoLiu: Agree.
return false;		return false;
}		}

// Intrinsics such as powi,cttz and ctlz are legal to vectorize if the		// Intrinsics such as powi,cttz and ctlz are legal to vectorize if the
// second argument is the same (i.e. loop invariant)		// second argument is the same (i.e. loop invariant)
if (CI &&		if (CI &&
hasVectorInstrinsicScalarOpd(getIntrinsicIDForCall(CI, TLI), 1)) {		hasVectorInstrinsicScalarOpd(getIntrinsicIDForCall(CI, TLI), 1)) {
if (!SE->isLoopInvariant(SE->getSCEV(CI->getOperand(1)), TheLoop)) {		if (!SE->isLoopInvariant(SE->getSCEV(CI->getOperand(1)), TheLoop)) {
emitAnalysis(VectorizationReport(it)		emitAnalysis(VectorizationReport(it)
<< "intrinsic instruction cannot be vectorized");		<< "intrinsic instruction cannot be vectorized");
DEBUG(dbgs() << "LV: Found unvectorizable intrinsic " << *CI << "\n");		DEBUG(dbgs() << "LV: Found unvectorizable intrinsic " << *CI << "\n");
return false;		return false;
}		}
}		}

		delenaUnsubmitted Not Done Reply Inline Actions You can put Stride, Base and IsLoad in one structure. delena: You can put Stride, Base and IsLoad in one structure.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Agree. Thanks. HaoLiu: Agree. Thanks.
// Check that the instruction return type is vectorizable.		// Check that the instruction return type is vectorizable.
// Also, we can't vectorize extractelement instructions.		// Also, we can't vectorize extractelement instructions.
if ((!VectorType::isValidElementType(it->getType()) &&		if ((!VectorType::isValidElementType(it->getType()) &&
!it->getType()->isVoidTy()) \|\| isa<ExtractElementInst>(it)) {		!it->getType()->isVoidTy()) \|\| isa<ExtractElementInst>(it)) {
emitAnalysis(VectorizationReport(it)		emitAnalysis(VectorizationReport(it)
<< "instruction return type cannot be vectorized");		<< "instruction return type cannot be vectorized");
DEBUG(dbgs() << "LV: Found unvectorizable type.\n");		DEBUG(dbgs() << "LV: Found unvectorizable type.\n");
return false;		return false;
}		}

// Check that the stored type is vectorizable.		// Check that the stored type is vectorizable.
if (StoreInst *ST = dyn_cast<StoreInst>(it)) {		if (StoreInst *ST = dyn_cast<StoreInst>(it)) {
Type *T = ST->getValueOperand()->getType();		Type *T = ST->getValueOperand()->getType();
if (!VectorType::isValidElementType(T)) {		if (!VectorType::isValidElementType(T)) {
		mzolotukhinUnsubmitted Not Done Reply Inline Actions This is redundant. mzolotukhin: This is redundant.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Removed. HaoLiu: Removed.
emitAnalysis(VectorizationReport(ST) <<		emitAnalysis(VectorizationReport(ST) <<
		delenaUnsubmitted Not Done Reply Inline Actions You assume that you have only one strided access in bb. Why? delena: You assume that you have only one strided access in bb. Why?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions No. after analyze the current Queue, it will clear and keep on analyzing in BB. HaoLiu: No. after analyze the current Queue, it will clear and keep on analyzing in BB.
"store instruction cannot be vectorized");		"store instruction cannot be vectorized");
		hfinkelUnsubmitted Not Done Reply Inline Actions You can use auto here for the iterator type, a range-based for would be better. hfinkel: You can use auto here for the iterator type, a range-based for would be better.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
return false;		return false;
}		}
if (EnableMemAccessVersioning)		if (EnableMemAccessVersioning)
collectStridedAccess(ST);		collectStridedAccess(ST);
}		}

if (EnableMemAccessVersioning)		if (EnableMemAccessVersioning)
		aschwaighoferUnsubmitted Not Done Reply Inline Actions What does this code do if you have two interleaved accesses to the same location? (I know we should not see this in optimized IR, but we can't rely on this fact) a[2i] = ; (1) a[2i] = ; (2) a[2i+1] = ; (3) a[2i+1] = ; (4) Are we guaranteed to get the pairs (1) (3) and (2)(4)? It seems to me that we will get the last consecutive location since we process 'Queue' front to back in the loop above that populates ConsecutiveChain; in my example we would get (1), (4) aschwaighofer: What does this code do if you have two interleaved accesses to the same location? (I know we…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Arnold, You are right, we can not get pairs (1)(3) and (2)(4). Actually it get two pairs: (1)(4) and (2)(4), as the later node overrides the former node. Previously, I never considered about the redundant loads/stores. Now it seems error prone, as the newly generated instruction may break the dependence. I'm still thinking about how to fix this. HaoLiu: Arnold, You are right, we can not get pairs (1)(3) and (2)(4). Actually it get two pairs: (1)…
if (LoadInst *LI = dyn_cast<LoadInst>(it))		if (LoadInst *LI = dyn_cast<LoadInst>(it))
collectStridedAccess(LI);		collectStridedAccess(LI);

// Reduction instructions are allowed to have exit users.		// Reduction instructions are allowed to have exit users.
// All other instructions must not have external users.		// All other instructions must not have external users.
if (hasOutsideLoopUser(TheLoop, it, AllowedExit)) {		if (hasOutsideLoopUser(TheLoop, it, AllowedExit)) {
emitAnalysis(VectorizationReport(it) <<		emitAnalysis(VectorizationReport(it) <<
"value cannot be used outside the loop");		"value cannot be used outside the loop");
return false;		return false;
}		}

} // next instr.		} // next instr.

}		}
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Auto + range loops could be used here. mzolotukhin: Auto + range loops could be used here.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Refactored. HaoLiu: Refactored.

if (!Induction) {		if (!Induction) {
DEBUG(dbgs() << "LV: Did not find one integer induction var.\n");		DEBUG(dbgs() << "LV: Did not find one integer induction var.\n");
if (Inductions.empty()) {		if (Inductions.empty()) {
emitAnalysis(VectorizationReport()		emitAnalysis(VectorizationReport()
<< "loop induction variable could not be identified");		<< "loop induction variable could not be identified");
		mzolotukhinUnsubmitted Not Done Reply Inline Actions When I first read this code, this name arose a question "Merge with what?" to me. mzolotukhin: When I first read this code, this name arose a question "Merge with what?" to me.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Removed. HaoLiu: Removed.
return false;		return false;
}		}
}		}

return true;		return true;
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Same here. mzolotukhin: Same here.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
}		}

///\brief Remove GEPs whose indices but the last one are loop invariant and		///\brief Remove GEPs whose indices but the last one are loop invariant and
/// return the induction operand of the gep pointer.		/// return the induction operand of the gep pointer.
static Value stripGetElementPtr(Value Ptr, ScalarEvolution SE, Loop Lp) {		static Value stripGetElementPtr(Value Ptr, ScalarEvolution SE, Loop Lp) {
GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(Ptr);		GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(Ptr);
if (!GEP)		if (!GEP)
		mzolotukhinUnsubmitted Not Done Reply Inline Actions This loop is much more complicated than it should be I think. If we can combine `if (!StridedAccesses.count(it))` with `if (CurBase != Base \|\| CurStride != Stride \|\| IsLoad != Read)`, we'll be able to get rid of `TryMerge` and of all these jumps to beginning of the loop. mzolotukhin: This loop is much more complicated than it should be I think. If we can combine ` if (!
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done with the new patch. I changed the layout of this loop, now we only call "collect" function once in the loop. But after the loop, still need to call it once to handle the tail. HaoLiu: Done with the new patch. I changed the layout of this loop, now we only call "collect" function…
return Ptr;		return Ptr;

unsigned InductionOperand = getGEPInductionOperand(GEP);		unsigned InductionOperand = getGEPInductionOperand(GEP);

// Check that all of the gep indices are uniform except for our induction		// Check that all of the gep indices are uniform except for our induction
// operand.		// operand.
for (unsigned i = 0, e = GEP->getNumOperands(); i != e; ++i)		for (unsigned i = 0, e = GEP->getNumOperands(); i != e; ++i)
if (i != InductionOperand &&		if (i != InductionOperand &&
!SE->isLoopInvariant(SE->getSCEV(GEP->getOperand(i)), Lp))		!SE->isLoopInvariant(SE->getSCEV(GEP->getOperand(i)), Lp))
return Ptr;		return Ptr;
return GEP->getOperand(InductionOperand);		return GEP->getOperand(InductionOperand);
}		}

///\brief Look for a cast use of the passed value.		///\brief Look for a cast use of the passed value.
static Value getUniqueCastUse(Value Ptr, Loop Lp, Type Ty) {		static Value getUniqueCastUse(Value Ptr, Loop Lp, Type Ty) {
Value *UniqueCast = nullptr;		Value *UniqueCast = nullptr;
for (User *U : Ptr->users()) {		for (User *U : Ptr->users()) {
CastInst *CI = dyn_cast<CastInst>(U);		CastInst *CI = dyn_cast<CastInst>(U);
if (CI && CI->getType() == Ty) {		if (CI && CI->getType() == Ty) {
if (!UniqueCast)		if (!UniqueCast)
UniqueCast = CI;		UniqueCast = CI;
else		else
return nullptr;		return nullptr;
}		}
}		}
		mzolotukhinUnsubmitted Not Done Reply Inline Actions s/instructure/instruction/ mzolotukhin: s/instructure/instruction/
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Fixed. HaoLiu: Fixed.
return UniqueCast;		return UniqueCast;
}		}

///\brief Get the stride of a pointer access in a loop.		///\brief Get the stride of a pointer access in a loop.
/// Looks for symbolic strides "a[i*stride]". Returns the symbolic stride as a		/// Looks for symbolic strides "a[i*stride]". Returns the symbolic stride as a
/// pointer to the Value, or null otherwise.		/// pointer to the Value, or null otherwise.
static Value getStrideFromPointer(Value Ptr, ScalarEvolution SE, Loop Lp) {		static Value getStrideFromPointer(Value Ptr, ScalarEvolution SE, Loop Lp) {
const PointerType *PtrTy = dyn_cast<PointerType>(Ptr->getType());		const PointerType *PtrTy = dyn_cast<PointerType>(Ptr->getType());
if (!PtrTy \|\| PtrTy->isAggregateType())		if (!PtrTy \|\| PtrTy->isAggregateType())
return nullptr;		return nullptr;

// Try to remove a gep instruction to make the pointer (actually index at this		// Try to remove a gep instruction to make the pointer (actually index at this
// point) easier analyzable. If OrigPtr is equal to Ptr we are analzying the		// point) easier analyzable. If OrigPtr is equal to Ptr we are analzying the
// pointer, otherwise, we are analyzing the index.		// pointer, otherwise, we are analyzing the index.
Value *OrigPtr = Ptr;		Value *OrigPtr = Ptr;

// The size of the pointer access.		// The size of the pointer access.
int64_t PtrAccessSize = 1;		int64_t PtrAccessSize = 1;
		mzolotukhinUnsubmitted Not Done Reply Inline Actions This seems to be redundant. mzolotukhin: This seems to be redundant.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Removed. HaoLiu: Removed.

Ptr = stripGetElementPtr(Ptr, SE, Lp);		Ptr = stripGetElementPtr(Ptr, SE, Lp);
const SCEV *V = SE->getSCEV(Ptr);		const SCEV *V = SE->getSCEV(Ptr);

if (Ptr != OrigPtr)		if (Ptr != OrigPtr)
// Strip off casts.		// Strip off casts.
while (const SCEVCastExpr *C = dyn_cast<SCEVCastExpr>(V))		while (const SCEVCastExpr *C = dyn_cast<SCEVCastExpr>(V))
V = C->getOperand();		V = C->getOperand();

const SCEVAddRecExpr *S = dyn_cast<SCEVAddRecExpr>(V);		const SCEVAddRecExpr *S = dyn_cast<SCEVAddRecExpr>(V);
if (!S)		if (!S)
return nullptr;		return nullptr;

V = S->getStepRecurrence(*SE);		V = S->getStepRecurrence(*SE);
if (!V)		if (!V)
return nullptr;		return nullptr;

// Strip off the size of access multiplication if we are still analyzing the		// Strip off the size of access multiplication if we are still analyzing the
// pointer.		// pointer.
if (OrigPtr == Ptr) {		if (OrigPtr == Ptr) {
const DataLayout &DL = Lp->getHeader()->getModule()->getDataLayout();		const DataLayout &DL = Lp->getHeader()->getModule()->getDataLayout();
DL.getTypeAllocSize(PtrTy->getElementType());		DL.getTypeAllocSize(PtrTy->getElementType());
if (const SCEVMulExpr *M = dyn_cast<SCEVMulExpr>(V)) {		if (const SCEVMulExpr *M = dyn_cast<SCEVMulExpr>(V)) {
		rengolinUnsubmitted Not Done Reply Inline Actions Potential uninitialized use of CurStride. rengolin: Potential uninitialized use of CurStride.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions I've initialized CurStride to be 1. HaoLiu: I've initialized CurStride to be 1.
if (M->getOperand(0)->getSCEVType() != scConstant)		if (M->getOperand(0)->getSCEVType() != scConstant)
return nullptr;		return nullptr;

const APInt &APStepVal =		const APInt &APStepVal =
cast<SCEVConstant>(M->getOperand(0))->getValue()->getValue();		cast<SCEVConstant>(M->getOperand(0))->getValue()->getValue();

// Huge step value - give up.		// Huge step value - give up.
if (APStepVal.getBitWidth() > 64)		if (APStepVal.getBitWidth() > 64)
Show All 25 Lines	static Value getStrideFromPointer(Value Ptr, ScalarEvolution SE, Loop Lp) {
// If we have stripped off the recurrence cast we have to make sure that we		// If we have stripped off the recurrence cast we have to make sure that we
// return the value that is used in this loop so that we can replace it later.		// return the value that is used in this loop so that we can replace it later.
if (StripedOffRecurrenceCast)		if (StripedOffRecurrenceCast)
Stride = getUniqueCastUse(Stride, Lp, StripedOffRecurrenceCast);		Stride = getUniqueCastUse(Stride, Lp, StripedOffRecurrenceCast);

return Stride;		return Stride;
}		}

void LoopVectorizationLegality::collectStridedAccess(Value *MemAccess) {		void LoopVectorizationLegality::collectStridedAccess(Value *MemAccess) {
		rengolinUnsubmitted Not Done Reply Inline Actions You just need if (CurBase) { rengolin: You just need if (CurBase) {
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Done. HaoLiu: Done.
Value *Ptr = nullptr;		Value *Ptr = nullptr;
if (LoadInst *LI = dyn_cast<LoadInst>(MemAccess))		if (LoadInst *LI = dyn_cast<LoadInst>(MemAccess))
Ptr = LI->getPointerOperand();		Ptr = LI->getPointerOperand();
else if (StoreInst *SI = dyn_cast<StoreInst>(MemAccess))		else if (StoreInst *SI = dyn_cast<StoreInst>(MemAccess))
Ptr = SI->getPointerOperand();		Ptr = SI->getPointerOperand();
else		else
return;		return;

▲ Show 20 Lines • Show All 162 Lines • ▼ Show 20 Lines	for (BasicBlock::iterator it = BB->begin(), e = BB->end(); it != e; ++it) {
case Instruction::SRem:		case Instruction::SRem:
return false;		return false;
}		}
}		}

return true;		return true;
}		}

		void InterleavedAccessInfo::collectConstStridedAccesses(
		MapVector<Instruction *, StrideDescriptor> &StrideAccesses,
		const ValueToValueMap &Strides) {
		// Holds load/store instructions in program order.
		SmallVector<Instruction *, 16> AccessList;

		for (auto *BB : TheLoop->getBlocks()) {
		bool IsPred = LoopAccessInfo::blockNeedsPredication(BB, TheLoop, DT);

		for (auto &I : *BB) {
		if (!isa<LoadInst>(&I) && !isa<StoreInst>(&I))
		continue;
		// FIXME: Currently we can't handle mixed accesses and predicated accesses
		if (IsPred)
		return;

		AccessList.push_back(&I);
		}
		}

		if (AccessList.empty())
		return;

		auto &DL = TheLoop->getHeader()->getModule()->getDataLayout();
		for (auto I : AccessList) {
		LoadInst *LI = dyn_cast<LoadInst>(I);
		StoreInst *SI = dyn_cast<StoreInst>(I);

		Value *Ptr = LI ? LI->getPointerOperand() : SI->getPointerOperand();
		int Stride = isStridedPtr(SE, Ptr, TheLoop, Strides);

		// Ignore non-stride, unit strides and too large strides.
		if (std::abs(Stride) < 2 \|\|
		std::abs(Stride) > static_cast<int>(MaxInterleaveStride))
		mzolotukhinUnsubmitted Not Done Reply Inline Actions static_cast isn't needed here. mzolotukhin: static_cast isn't needed here.
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions std::abs() returns a signed int. Comparing signed to unsigned will cause a warning from the compiler. To avoid static_cast, I use a temperaral value for "std::abs(Stride)" in the new patch. HaoLiu: std::abs() returns a signed int. Comparing signed to unsigned will cause a warning from the…
		continue;

		const SCEV *Scev = replaceSymbolicStrideSCEV(SE, Strides, Ptr);
		PointerType *PtrTy = dyn_cast<PointerType>(Ptr->getType());
		unsigned Size = DL.getTypeAllocSize(PtrTy->getElementType());

		// An alignment of 0 means target ABI alignment.
		unsigned Align = LI ? LI->getAlignment() : SI->getAlignment();
		if (!Align)
		Align = DL.getABITypeAlignment(PtrTy->getElementType());

		StrideAccesses[I] = StrideDescriptor(Stride, Scev, Size, Align);
		}
		}

		// Analyze interleaved accesses and collect them into interleave groups.
		//
		// Notice that the vectorization on interleaved groups will change instruction
		// orders and may break dependences. But the memory dependence check guarantees
		// that there is no overlap between two pointers of different strides, element
		// sizes or underlying bases.
		//
		// For pointers sharing the same stride, element size and underlying base, no
		// need to worry about Read-After-Write dependences and Write-After-Read
		// dependences.
		//
		// E.g. The RAW dependence: A[i] = a;
		// b = A[i];
		// This won't exist as it is a store-load forwarding conflict, which has
		// already been checked and forbidden in the dependence check.
		//
		// E.g. The WAR dependence: a = A[i]; // (1)
		// A[i] = b; // (2)
		// The store group of (2) is always inserted at or below (2), and the load group
		// of (1) is always inserted at or above (1). The dependence is safe.
		void InterleavedAccessInfo::analyzeInterleaving(
		const ValueToValueMap &Strides) {
		DEBUG(dbgs() << "LV: Analyzing interleaved accesses...\n");

		// Holds all the stride accesses.
		MapVector<Instruction *, StrideDescriptor> StrideAccesses;
		collectConstStridedAccesses(StrideAccesses, Strides);

		if (StrideAccesses.empty())
		return;

		// Holds all interleaved store groups temporarily.
		SmallSetVector<InterleaveGroup *, 4> StoreGroups;

		// Search the load-load/write-write pair B-A in bottom-up order and try to
		// insert B into the interleave group of A according to 3 rules:
		// 1. A and B have the same stride.
		// 2. A and B have the same memory object size.
		// 3. B belongs to the group according to the distance.
		//
		// The bottom-up order can avoid breaking the Write-After-Write dependences
		// between two pointers of the same base.
		// E.g. A[i] = a; (1)
		// A[i] = b; (2)
		// A[i+1] = c (3)
		// We form the group (2)+(3) in front, so (1) has to form groups with accesses
		// above (1), which guarantees that (1) is always above (2).
		for (auto I = StrideAccesses.rbegin(), E = StrideAccesses.rend(); I != E;
		++I) {
		Instruction *A = I->first;
		StrideDescriptor DesA = I->second;

		InterleaveGroup *Group = getInterleaveGroup(A);
		if (!Group) {
		DEBUG(dbgs() << "LV: Creating an interleave group with:" << *A << '\n');
		Group = createInterleaveGroup(A, DesA.Stride, DesA.Align);
		}

		if (A->mayWriteToMemory())
		StoreGroups.insert(Group);

		for (auto II = std::next(I); II != E; ++II) {
		Instruction *B = II->first;
		StrideDescriptor DesB = II->second;

		// Ignore if B is already in a group or B is a different memory operation.
		if (isInterleaved(B) \|\| A->mayReadFromMemory() != B->mayReadFromMemory())
		continue;

		// Check the rule 1 and 2.
		if (DesB.Stride != DesA.Stride \|\| DesB.Size != DesA.Size)
		continue;

		// Calculate the distance and prepare for the rule 3.
		const SCEVConstant *DistToA =
		dyn_cast<SCEVConstant>(SE->getMinusSCEV(DesB.Scev, DesA.Scev));
		if (!DistToA)
		continue;

		int DistanceToA = DistToA->getValue()->getValue().getSExtValue();

		// Previous dependence check guarantees no dependence between two accesses
		// if the distance is not multiple of the size. So just ignore such cases
		// and won't worry about dependences.
		if (DistanceToA % static_cast<int>(DesA.Size))
		continue;
		mzolotukhinUnsubmitted Not Done Reply Inline Actions What is the previous check you are referring to? And if it's guaranteed, could this `if` be replaced with an assertion? mzolotukhin: What is the previous check you are referring to? And if it's guaranteed, could this `if` be…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions It can not be assertion. This comment is hard to be understood. I've replaced it with another comment, which describe the thing we do right here. A case @interleaved_stores in stride-access-dependence.ll checks for this situation. So no need to clarify this again. HaoLiu: It can not be assertion. This comment is hard to be understood. I've replaced it with another…

		// The index of B is the index of A plus the related index to A.
		int IndexB =
		Group->getIndex(A) + DistanceToA / static_cast<int>(DesA.Size);
		mzolotukhinUnsubmitted Not Done Reply Inline Actions `static_cast` isn't necessary here (and if we want to cast everything `getIndex` should also be casted from `unsigned`). mzolotukhin: `static_cast` isn't necessary here (and if we want to cast everything `getIndex` should also…
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions No, the division between signed and unsigned can not be casted. I tested with a small case: int a = -4; unsigned b = 4; int c = a / b; The result c is a very large numer (1073741823). Modulo is simialr, I once fixed a bug about "signed % unsigned" (didn't cast unsigned data) in SeparateConstOffsetFromGEP.cpp. This is different from ADD/SUB. HaoLiu: No, the division between signed and unsigned can not be casted. I tested with a small case…

		// Try to insert B into the group.
		if (Group->insertMember(B, IndexB, DesB.Align)) {
		DEBUG(dbgs() << "LV: Inserted:" << *B << '\n'
		<< " into the interleave group with" << *A << '\n');
		InterleaveGroupMap[B] = Group;

		// Set the first load in program order as the insert position.
		if (B->mayReadFromMemory())
		Group->setInsertPos(B);
		}
		} // Iteration on instruction B
		} // Iteration on instruction A

		// Remove interleaved store groups with gaps.
		for (InterleaveGroup *Group : StoreGroups)
		if (Group->getNumMembers() != Group->getDelta())
		releaseGroup(Group);
		}

LoopVectorizationCostModel::VectorizationFactor		LoopVectorizationCostModel::VectorizationFactor
LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize) {		LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize) {
// Width 1 means no vectorize		// Width 1 means no vectorize
VectorizationFactor Factor = { 1U, 0U };		VectorizationFactor Factor = { 1U, 0U };
if (OptForSize && Legal->getRuntimePointerCheck()->Need) {		if (OptForSize && Legal->getRuntimePointerCheck()->Need) {
emitAnalysis(VectorizationReport() <<		emitAnalysis(VectorizationReport() <<
"runtime pointer checks needed. Enable vectorization of this "		"runtime pointer checks needed. Enable vectorization of this "
"loop with '#pragma clang loop vectorize(enable)' when "		"loop with '#pragma clang loop vectorize(enable)' when "
▲ Show 20 Lines • Show All 636 Lines • ▼ Show 20 Lines	case Instruction::Load: {
Value *Ptr = SI ? SI->getPointerOperand() : LI->getPointerOperand();		Value *Ptr = SI ? SI->getPointerOperand() : LI->getPointerOperand();
// We add the cost of address computation here instead of with the gep		// We add the cost of address computation here instead of with the gep
// instruction because only here we know whether the operation is		// instruction because only here we know whether the operation is
// scalarized.		// scalarized.
if (VF == 1)		if (VF == 1)
return TTI.getAddressComputationCost(VectorTy) +		return TTI.getAddressComputationCost(VectorTy) +
TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);		TTI.getMemoryOpCost(I->getOpcode(), VectorTy, Alignment, AS);

		// For an interleaved access, calculate the total cost of the whole
		// interleave group.
		if (Legal->isAccessInterleaved(I)) {
		mzolotukhinUnsubmitted Not Done Reply Inline Actions I'd add an assert that `IA` != `nullptr` here. mzolotukhin: I'd add an assert that `IA` != `nullptr` here.
		auto Group = Legal->getInterleavedAccessGroup(I);
		assert(Group && "Fail to get an interleaved access group.");

		// Only calculate the cost once at the insert position.
		if (Group->getInsertPos() != I)
		return 0;

		unsigned Delta = Group->getDelta();
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Do we need to add the cost once per interleaving group, or per interleaving group member? mzolotukhin: Do we need to add the cost once per interleaving group, or per interleaving group member?
		Type *WideVecTy =
		VectorType::get(VectorTy->getVectorElementType(),
		VectorTy->getVectorNumElements() * Delta);

		// Holds the indices of existing members in an interleaved load group.
		// An interleaved store group doesn't need this as it dones't allow gaps.
		SmallVector<unsigned, 4> Indices;
		if (LI) {
		for (unsigned i = 0; i < Delta; i++)
		if (Group->getMember(i))
		Indices.push_back(i);
		}

		// Calculate the cost of the whole interleaved group.
		mzolotukhinUnsubmitted Not Done Reply Inline Actions s/even expensive/even more expensive/ mzolotukhin: s/even expensive/even more expensive/
		unsigned Cost = TTI.getInterleavedMemoryOpCost(I->getOpcode(), WideVecTy,
		Group->getDelta(), Indices,
		Group->getAlignment(), AS);

		if (Group->isReverse())
		Cost +=
		Group->getNumMembers() *
		TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);

		// FIXME: The interleaved load group with a huge gap could be even more
		// expensive than scalar operations. Then we could ignore such group and
		// use scalar operations instead.
		return Cost;
		}

// Scalarized loads/stores.		// Scalarized loads/stores.
int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);		int ConsecutiveStride = Legal->isConsecutivePtr(Ptr);
bool Reverse = ConsecutiveStride < 0;		bool Reverse = ConsecutiveStride < 0;
const DataLayout &DL = I->getModule()->getDataLayout();		const DataLayout &DL = I->getModule()->getDataLayout();
unsigned ScalarAllocatedSize = DL.getTypeAllocSize(ValTy);		unsigned ScalarAllocatedSize = DL.getTypeAllocSize(ValTy);
unsigned VectorElementSize = DL.getTypeStoreSize(VectorTy) / VF;		unsigned VectorElementSize = DL.getTypeStoreSize(VectorTy) / VF;
if (!ConsecutiveStride \|\| ScalarAllocatedSize != VectorElementSize) {		if (!ConsecutiveStride \|\| ScalarAllocatedSize != VectorElementSize) {
bool IsComplexComputation =		bool IsComplexComputation =
▲ Show 20 Lines • Show All 257 Lines • Show Last 20 Lines

test/Analysis/LoopAccessAnalysis/stride-access-dependence.ll

This file was added.

				; RUN: opt -loop-accesses -analyze < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"

				; Following cases are no dependence.

				; void nodep_Read_Write(int *A) {
				; int *B = A + 1;
				; for (unsigned i = 0; i < 1024; i+=3)
				; B[i] = A[i] + 1;
				; }

				; CHECK: function 'nodep_Read_Write':
				; CHECK-NEXT: for.body:
				; CHECK-NEXT: Memory dependences are safe
				; CHECK-NEXT: Interesting Dependences:
				; CHECK-NEXT: Run-time memory checks:

				define void @nodep_Read_Write(i32* nocapture %A) {
				entry:
				%add.ptr = getelementptr inbounds i32, i32* %A, i64 1
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%add = add nsw i32 %0, 1
				%arrayidx2 = getelementptr inbounds i32, i32* %add.ptr, i64 %indvars.iv
				store i32 %add, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 3
				%cmp = icmp ult i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; int nodep_Write_Read(int *A) {
				; int sum = 0;
				; for (unsigned i = 0; i < 1024; i+=4) {
				; A[i] = i;
				; sum += A[i+3];
				; }
				;
				; return sum;
				; }

				; CHECK: function 'nodep_Write_Read':
				; CHECK-NEXT: for.body:
				; CHECK-NEXT: Memory dependences are safe
				; CHECK-NEXT: Interesting Dependences:
				; CHECK-NEXT: Run-time memory checks:

				define i32 @nodep_Write_Read(i32* nocapture %A) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret i32 %add3

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%sum.013 = phi i32 [ 0, %entry ], [ %add3, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%0 = trunc i64 %indvars.iv to i32
				store i32 %0, i32* %arrayidx, align 4
				%1 = or i64 %indvars.iv, 3
				%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 %1
				%2 = load i32, i32* %arrayidx2, align 4
				%add3 = add nsw i32 %2, %sum.013
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 4
				%cmp = icmp ult i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; void nodep_Write_Write(int *A) {
				; for (unsigned i = 0; i < 1024; i+=2) {
				; A[i] = i;
				; A[i+1] = i+1;
				; }
				; }

				; CHECK: function 'nodep_Write_Write':
				; CHECK-NEXT: for.body:
				; CHECK-NEXT: Memory dependences are safe
				; CHECK-NEXT: Interesting Dependences:
				; CHECK-NEXT: Run-time memory checks:

				define void @nodep_Write_Write(i32* nocapture %A) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%0 = trunc i64 %indvars.iv to i32
				store i32 %0, i32* %arrayidx, align 4
				%1 = or i64 %indvars.iv, 1
				%arrayidx3 = getelementptr inbounds i32, i32* %A, i64 %1
				%2 = trunc i64 %1 to i32
				store i32 %2, i32* %arrayidx3, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp ult i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; Following cases are unsafe depdences and are not vectorizable.

				; void unsafe_Read_Write(int *A) {
				; for (unsigned i = 0; i < 1024; i+=3)
				; A[i+3] = A[i] + 1;
				; }

				; CHECK: function 'unsafe_Read_Write':
				; CHECK-NEXT: for.body:
				; CHECK-NEXT: Report: unsafe dependent memory operations in loop
				; CHECK-NEXT: Interesting Dependences:
				; CHECK-NEXT: Backward:
				; CHECK-NEXT: %0 = load i32, i32* %arrayidx, align 4 ->
				; CHECK-NEXT: store i32 %add, i32* %arrayidx3, align 4

				define void @unsafe_Read_Write(i32* nocapture %A) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %entry, %for.body
				%i.010 = phi i32 [ 0, %entry ], [ %add1, %for.body ]
				%idxprom = zext i32 %i.010 to i64
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %idxprom
				%0 = load i32, i32* %arrayidx, align 4
				%add = add nsw i32 %0, 1
				%add1 = add i32 %i.010, 3
				%idxprom2 = zext i32 %add1 to i64
				%arrayidx3 = getelementptr inbounds i32, i32* %A, i64 %idxprom2
				store i32 %add, i32* %arrayidx3, align 4
				%cmp = icmp ult i32 %add1, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; int unsafe_Write_Read(int *A) {
				; int sum = 0;
				; for (unsigned i = 0; i < 1024; i+=4) {
				; A[i] = i;
				; sum += A[i+4];
				; }
				;
				; return sum;
				; }

				; CHECK: function 'unsafe_Write_Read':
				; CHECK-NEXT: for.body:
				; CHECK-NEXT: Report: unsafe dependent memory operations in loop
				; CHECK-NEXT: Interesting Dependences:
				; CHECK-NEXT: Backward:
				; CHECK-NEXT: store i32 %0, i32* %arrayidx, align 4 ->
				; CHECK-NEXT: %1 = load i32, i32* %arrayidx2, align 4

				define i32 @unsafe_Write_Read(i32* nocapture %A) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret i32 %add3

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%sum.013 = phi i32 [ 0, %entry ], [ %add3, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%0 = trunc i64 %indvars.iv to i32
				store i32 %0, i32* %arrayidx, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 4
				%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv.next
				%1 = load i32, i32* %arrayidx2, align 4
				%add3 = add nsw i32 %1, %sum.013
				%cmp = icmp ult i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; void unsafe_Write_Write(int *A) {
				; for (unsigned i = 0; i < 1024; i+=2) {
				; A[i] = i;
				; A[i+2] = i+1;
				; }
				; }

				; CHECK: function 'unsafe_Write_Write':
				; CHECK-NEXT: for.body:
				; CHECK-NEXT: Report: unsafe dependent memory operations in loop
				; CHECK-NEXT: Interesting Dependences:
				; CHECK-NEXT: Backward:
				; CHECK-NEXT: store i32 %0, i32* %arrayidx, align 4 ->
				; CHECK-NEXT: store i32 %2, i32* %arrayidx3, align 4

				define void @unsafe_Write_Write(i32* nocapture %A) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%0 = trunc i64 %indvars.iv to i32
				store i32 %0, i32* %arrayidx, align 4
				%1 = or i64 %indvars.iv, 1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%arrayidx3 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv.next
				%2 = trunc i64 %1 to i32
				store i32 %2, i32* %arrayidx3, align 4
				%cmp = icmp ult i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; Following cases check that strided accesses can be vectorized.

				; void vectorizable_Read_Write(int *A) {
				; int *B = A + 4;
				; for (unsigned i = 0; i < 1024; i+=2)
				; B[i] = A[i] + 1;
				; }

				; CHECK: function 'vectorizable_Read_Write':
				; CHECK-NEXT: for.body:
				; CHECK-NEXT: Memory dependences are safe
				; CHECK-NEXT: Interesting Dependences:
				; CHECK-NEXT: BackwardVectorizable:
				; CHECK-NEXT: %0 = load i32, i32* %arrayidx, align 4 ->
				; CHECK-NEXT: store i32 %add, i32* %arrayidx2, align 4

				define void @vectorizable_Read_Write(i32* nocapture %A) {
				entry:
				%add.ptr = getelementptr inbounds i32, i32* %A, i64 4
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%add = add nsw i32 %0, 1
				%arrayidx2 = getelementptr inbounds i32, i32* %add.ptr, i64 %indvars.iv
				store i32 %add, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp ult i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; int vectorizable_Write_Read(int *A) {
				; int *B = A + 4;
				; int sum = 0;
				; for (unsigned i = 0; i < 1024; i+=2) {
				; A[i] = i;
				; sum += B[i];
				; }
				;
				; return sum;
				; }

				; CHECK: function 'vectorizable_Write_Read':
				; CHECK-NEXT: for.body:
				; CHECK-NEXT: Memory dependences are safe
				; CHECK-NEXT: Interesting Dependences:
				; CHECK-NEXT: BackwardVectorizable:
				; CHECK-NEXT: store i32 %0, i32* %arrayidx, align 4 ->
				; CHECK-NEXT: %1 = load i32, i32* %arrayidx2, align 4

				define i32 @vectorizable_Write_Read(i32* nocapture %A) {
				entry:
				%add.ptr = getelementptr inbounds i32, i32* %A, i64 4
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret i32 %add

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%sum.013 = phi i32 [ 0, %entry ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%0 = trunc i64 %indvars.iv to i32
				store i32 %0, i32* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds i32, i32* %add.ptr, i64 %indvars.iv
				%1 = load i32, i32* %arrayidx2, align 4
				%add = add nsw i32 %1, %sum.013
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp ult i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; void vectorizable_Write_Write(int *A) {
				; int *B = A + 4;
				; for (unsigned i = 0; i < 1024; i+=2) {
				; A[i] = i;
				; B[i] = i+1;
				; }
				; }

				; CHECK: function 'vectorizable_Write_Write':
				; CHECK-NEXT: for.body:
				; CHECK-NEXT: Memory dependences are safe
				; CHECK-NEXT: Interesting Dependences:
				; CHECK-NEXT: BackwardVectorizable:
				; CHECK-NEXT: store i32 %0, i32* %arrayidx, align 4 ->
				; CHECK-NEXT: store i32 %2, i32* %arrayidx2, align 4

				define void @vectorizable_Write_Write(i32* nocapture %A) {
				entry:
				%add.ptr = getelementptr inbounds i32, i32* %A, i64 4
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%0 = trunc i64 %indvars.iv to i32
				store i32 %0, i32* %arrayidx, align 4
				%1 = or i64 %indvars.iv, 1
				%arrayidx2 = getelementptr inbounds i32, i32* %add.ptr, i64 %indvars.iv
				%2 = trunc i64 %1 to i32
				store i32 %2, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp ult i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; void vectorizable_unscaled_Read_Write(int *A) {
				; int B = (int )((char *)A + 14);
				; for (unsigned i = 0; i < 1024; i+=2)
				; B[i] = A[i] + 1;
				; }

				; FIXME: This case looks like previous case @vectorizable_Read_Write. It sould
				; to be vectorizable.

				anemetUnsubmitted Not Done Reply Inline Actions "It should be vectorizable." anemet: "It should be vectorizable."
				HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Agree. Thanks. HaoLiu: Agree. Thanks.
				; CHECK: function 'vectorizable_unscaled_Read_Write':
				; CHECK-NEXT: for.body:
				; CHECK-NEXT: Report: unsafe dependent memory operations in loop
				; CHECK-NEXT: Interesting Dependences:
				; CHECK-NEXT: BackwardVectorizableButPreventsForwarding:
				; CHECK-NEXT: %2 = load i32, i32* %arrayidx, align 4 ->
				; CHECK-NEXT: store i32 %add, i32* %arrayidx2, align 4

				define void @vectorizable_unscaled_Read_Write(i32* nocapture %A) {
				entry:
				%0 = bitcast i32* %A to i8*
				%add.ptr = getelementptr inbounds i8, i8* %0, i64 14
				%1 = bitcast i8* %add.ptr to i32*
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%2 = load i32, i32* %arrayidx, align 4
				%add = add nsw i32 %2, 1
				%arrayidx2 = getelementptr inbounds i32, i32* %1, i64 %indvars.iv
				store i32 %add, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp ult i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; int vectorizable_unscaled_Write_Read(int *A) {
				; int B = (int )((char *)A + 17);
				; int sum = 0;
				; for (unsigned i = 0; i < 1024; i+=2) {
				; A[i] = i;
				; sum += B[i];
				; }
				;
				; return sum;
				; }

				; CHECK: for function 'vectorizable_unscaled_Write_Read':
				; CHECK-NEXT: for.body:
				; CHECK-NEXT: Memory dependences are safe
				; CHECK-NEXT: Interesting Dependences:
				; CHECK-NEXT: BackwardVectorizable:
				; CHECK-NEXT: store i32 %2, i32* %arrayidx, align 4 ->
				; CHECK-NEXT: %3 = load i32, i32* %arrayidx2, align 4

				define i32 @vectorizable_unscaled_Write_Read(i32* nocapture %A) {
				entry:
				%0 = bitcast i32* %A to i8*
				%add.ptr = getelementptr inbounds i8, i8* %0, i64 17
				%1 = bitcast i8* %add.ptr to i32*
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret i32 %add

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%sum.013 = phi i32 [ 0, %entry ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%2 = trunc i64 %indvars.iv to i32
				store i32 %2, i32* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds i32, i32* %1, i64 %indvars.iv
				%3 = load i32, i32* %arrayidx2, align 4
				%add = add nsw i32 %3, %sum.013
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp ult i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; void unsafe_unscaled_Read_Write(int *A) {
				; int B = (int )((char *)A + 11);
				; for (unsigned i = 0; i < 1024; i+=2)
				; B[i] = A[i] + 1;
				; }

				; CHECK: function 'unsafe_unscaled_Read_Write':
				; CHECK-NEXT: for.body:
				; CHECK-NEXT: Report: unsafe dependent memory operations in loop
				; CHECK-NEXT: Interesting Dependences:
				; CHECK-NEXT: Backward:
				; CHECK-NEXT: %2 = load i32, i32* %arrayidx, align 4 ->
				; CHECK-NEXT: store i32 %add, i32* %arrayidx2, align 4

				define void @unsafe_unscaled_Read_Write(i32* nocapture %A) {
				entry:
				%0 = bitcast i32* %A to i8*
				%add.ptr = getelementptr inbounds i8, i8* %0, i64 11
				%1 = bitcast i8* %add.ptr to i32*
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%2 = load i32, i32* %arrayidx, align 4
				%add = add nsw i32 %2, 1
				%arrayidx2 = getelementptr inbounds i32, i32* %1, i64 %indvars.iv
				store i32 %add, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp ult i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; CHECK: function 'unsafe_unscaled_Read_Write2':
				; CHECK-NEXT: for.body:
				; CHECK-NEXT: Report: unsafe dependent memory operations in loop
				; CHECK-NEXT: Interesting Dependences:
				; CHECK-NEXT: Backward:
				; CHECK-NEXT: %2 = load i32, i32* %arrayidx, align 4 ->
				; CHECK-NEXT: store i32 %add, i32* %arrayidx2, align 4

				; void unsafe_unscaled_Read_Write2(int *A) {
				; int B = (int )((char *)A + 1);
				; for (unsigned i = 0; i < 1024; i+=2)
				; B[i] = A[i] + 1;
				; }

				define void @unsafe_unscaled_Read_Write2(i32* nocapture %A) {
				entry:
				%0 = bitcast i32* %A to i8*
				%add.ptr = getelementptr inbounds i8, i8* %0, i64 1
				%1 = bitcast i8* %add.ptr to i32*
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%2 = load i32, i32* %arrayidx, align 4
				%add = add nsw i32 %2, 1
				%arrayidx2 = getelementptr inbounds i32, i32* %1, i64 %indvars.iv
				store i32 %add, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp ult i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; Following case checks that interleaved stores have dependences with another
				; store and can not pass dependence check.

				; void interleaved_stores(int *A) {
				; int B = (int ) ((char *)A + 1);
				; for(int i = 0; i < 1024; i+=2) {
				; B[i] = i; // (1)
				; A[i+1] = i + 1; // (2)
				; B[i+1] = i + 1; // (3)
				; }
				; }
				;
				; The access (2) has overlaps with (1) and (3).

				; CHECK: function 'interleaved_stores':
				; CHECK-NEXT: for.body:
				; CHECK-NEXT: Report: unsafe dependent memory operations in loop
				; CHECK-NEXT: Interesting Dependences:
				; CHECK-NEXT: Backward:
				; CHECK-NEXT: store i32 %4, i32* %arrayidx5, align 4 ->
				; CHECK-NEXT: store i32 %4, i32* %arrayidx9, align 4
				; CHECK: Backward:
				; CHECK-NEXT: store i32 %2, i32* %arrayidx2, align 4 ->
				; CHECK-NEXT: store i32 %4, i32* %arrayidx5, align 4

				define void @interleaved_stores(i32* nocapture %A) {
				entry:
				%0 = bitcast i32* %A to i8*
				%incdec.ptr = getelementptr inbounds i8, i8* %0, i64 1
				%1 = bitcast i8* %incdec.ptr to i32*
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%2 = trunc i64 %indvars.iv to i32
				%arrayidx2 = getelementptr inbounds i32, i32* %1, i64 %indvars.iv
				store i32 %2, i32* %arrayidx2, align 4
				%3 = or i64 %indvars.iv, 1
				%arrayidx5 = getelementptr inbounds i32, i32* %A, i64 %3
				%4 = trunc i64 %3 to i32
				store i32 %4, i32* %arrayidx5, align 4
				%arrayidx9 = getelementptr inbounds i32, i32* %1, i64 %3
				store i32 %4, i32* %arrayidx9, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp slt i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

test/Analysis/LoopAccessAnalysis/zero-distance-dependence.ll

This file was added.

				; RUN: opt -loop-accesses -analyze < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"

				; This case check the store-load forwarding in a store-load pair with distance 0.

				; int dist_zero_true_dep (int *A) {
				; int *B = A;
				; int sum = 0;
				; for (unsigned i = 0; i < 1024; i+=2) {
				; B[i] = i;
				; sum += A[i];
				; }
				; return sum;
				; }

				; CHECK: function 'dist_zero_true_dep':
				; CHECK-NEXT: for.body:
				; CHECK-NEXT: Report: unsafe dependent memory operations in loop
				; CHECK-NEXT: Interesting Dependences:
				; CHECK-NEXT: ForwardButPreventsForwarding:
				; CHECK-NEXT: store i32 %0, i32* %arrayidx, align 4 ->
				; CHECK-NEXT: %1 = load i32, i32* %arrayidx2, align 4

				define i32 @dist_zero_true_dep(i32* nocapture %A) {
				entry:
				%B = getelementptr inbounds i32, i32* %A, i64 0
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret i32 %add

				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%sum.010 = phi i32 [ 0, %entry ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
				%0 = trunc i64 %indvars.iv to i32
				store i32 %0, i32* %arrayidx, align 4
				%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%1 = load i32, i32* %arrayidx2, align 4
				%add = add nsw i32 %1, %sum.010
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 1024
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				; This case check no dependence in two accesses with 0 distance.

				; int dist_zero_not_true_dep (int *A) {
				; for (unsigned i = 0; i < 1024; i+=2)
				; A[i] += i;
				; }

				; CHECK: function 'dist_zero_not_true_dep':
				; CHECK-NEXT: for.body:
				; CHECK-NEXT: Memory dependences are safe

				define i32 @dist_zero_not_true_dep(i32* nocapture %A) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret i32 undef

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%1 = trunc i64 %indvars.iv to i32
				%add = add i32 %0, %1
				store i32 %add, i32* %arrayidx, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp ult i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

test/Transforms/LoopVectorize/AArch64/arbitrary-induction-step.ll

	; RUN: opt -S < %s -loop-vectorize 2>&1 \| FileCheck %s			; RUN: opt -S < %s -loop-vectorize -force-vector-interleave=2 -force-vector-width=4 -enable-interleaved-mem-accesses=true \| FileCheck %s
	; RUN: opt -S < %s -loop-vectorize -force-vector-interleave=1 -force-vector-width=2 \| FileCheck %s --check-prefix=FORCE-VEC			; RUN: opt -S < %s -loop-vectorize -force-vector-interleave=1 -force-vector-width=2 -enable-interleaved-mem-accesses=true \| FileCheck %s --check-prefix=FORCE-VEC

	target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
	target triple = "aarch64--linux-gnueabi"			target triple = "aarch64--linux-gnueabi"

	; Test integer induction variable of step 2:			; Test integer induction variable of step 2:
	; for (int i = 0; i < 1024; i+=2) {			; for (int i = 0; i < 1024; i+=2) {
	; int tmp = *A++;			; int tmp = *A++;
	; sum += i * tmp;			; sum += i * tmp;
	▲ Show 20 Lines • Show All 86 Lines • ▼ Show 20 Lines
	; vectorization is not enforced, LV will only do interleave.			; vectorization is not enforced, LV will only do interleave.
	; for (int i = 0; i < 1024; i++) {			; for (int i = 0; i < 1024; i++) {
	; int tmp0 = *A++;			; int tmp0 = *A++;
	; int tmp1 = *A++;			; int tmp1 = *A++;
	; sum += tmp0 * tmp1;			; sum += tmp0 * tmp1;
	; }			; }

	; CHECK-LABEL: @ptr_ind_plus2(			; CHECK-LABEL: @ptr_ind_plus2(
	; CHECK: load i32, i32*			; CHECK: %[[V0:.*]] = load <8 x i32>
	; CHECK: load i32, i32*			; CHECK: shufflevector <8 x i32> %[[V0]], <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	; CHECK: load i32, i32*			; CHECK: shufflevector <8 x i32> %[[V0]], <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
	; CHECK: load i32, i32*			; CHECK: %[[V1:.*]] = load <8 x i32>
	; CHECK: mul nsw i32			; CHECK: shufflevector <8 x i32> %[[V1]], <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	; CHECK: mul nsw i32			; CHECK: shufflevector <8 x i32> %[[V1]], <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
	; CHECK: add nsw i32			; CHECK: mul nsw <4 x i32>
	; CHECK: add nsw i32			; CHECK: mul nsw <4 x i32>
	; CHECK: %index.next = add i64 %index, 2			; CHECK: add nsw <4 x i32>
	; CHECK: %21 = icmp eq i64 %index.next, 1024			; CHECK: add nsw <4 x i32>
				; CHECK: %index.next = add i64 %index, 8
				; CHECK: icmp eq i64 %index.next, 1024

	; FORCE-VEC-LABEL: @ptr_ind_plus2(			; FORCE-VEC-LABEL: @ptr_ind_plus2(
	; FORCE-VEC: load i32, i32*			; FORCE-VEC: %[[V:.*]] = load <4 x i32>
	; FORCE-VEC: insertelement <2 x i32>			; FORCE-VEC: shufflevector <4 x i32> %[[V]], <4 x i32> undef, <2 x i32> <i32 0, i32 2>
	; FORCE-VEC: load i32, i32*			; FORCE-VEC: shufflevector <4 x i32> %[[V]], <4 x i32> undef, <2 x i32> <i32 1, i32 3>
	; FORCE-VEC: insertelement <2 x i32>
	; FORCE-VEC: load i32, i32*
	; FORCE-VEC: insertelement <2 x i32>
	; FORCE-VEC: load i32, i32*
	; FORCE-VEC: insertelement <2 x i32>
	; FORCE-VEC: mul nsw <2 x i32>			; FORCE-VEC: mul nsw <2 x i32>
	; FORCE-VEC: add nsw <2 x i32>			; FORCE-VEC: add nsw <2 x i32>
	; FORCE-VEC: %index.next = add i64 %index, 2			; FORCE-VEC: %index.next = add i64 %index, 2
	; FORCE-VEC: icmp eq i64 %index.next, 1024			; FORCE-VEC: icmp eq i64 %index.next, 1024
	define i32 @ptr_ind_plus2(i32* %A) {			define i32 @ptr_ind_plus2(i32* %A) {
	entry:			entry:
	br label %for.body			br label %for.body

	Show All 18 Lines

test/Transforms/LoopVectorize/interleaved-accesses.ll

This file was added.

				; RUN: opt -S -loop-vectorize -instcombine -force-vector-width=4 -force-vector-interleave=1 -enable-interleaved-mem-accesses=true -runtime-memory-check-threshold=24 < %s \| FileCheck %s

				mzolotukhinUnsubmitted Not Done Reply Inline Actions Is it possible to use `noalias` attributes for arguments and drop `-runtime-memory-check-threshold=24`? mzolotukhin: Is it possible to use `noalias` attributes for arguments and drop `-runtime-memory-check…
				HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Hi Michael, Do you mean by adding "-scoped-noalias". I tried but it doesn't work. I think maybe the runtime check is different from normal alias analysis. HaoLiu: Hi Michael, Do you mean by adding "-scoped-noalias". I tried but it doesn't work. I think…
				mzolotukhinUnsubmitted Not Done Reply Inline Actions No, I meant changing declaration from this: define void @foo(%struct.ST2* nocapture readonly %A, %struct.ST2* nocapture %B) to define void @foo(%struct.ST2* noalias nocapture readonly %A, %struct.ST2* noalias nocapture %B) It'll tell that `A` and `B` don't alias, and thus we won't need runtime checks for them. mzolotukhin: No, I meant changing declaration from this: ``` define void @foo(%struct.ST2* nocapture…
				HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions I tried this, but it still can not work for some cases like @test_struct_store4. It is a known problem with our runtime memory checks. It will check the dependences in the same array. E.g. for (i = 0; i < n; i+=4) { A[i] = a; A[i+1] = b; A[i+2] = c; A[i+3] = d; } A lot of runtime checks with be generated for pairs in {A[i], A[i+1], A[i+2], A[i+3]} and will break the threshold. HaoLiu: I tried this, but it still can not work for some cases like @test_struct_store4. It is a known…
				mzolotukhinUnsubmitted Not Done Reply Inline Actions Which of the tests needs `-runtime-memory-check-threshold=24`? Could we get rid of this flag by adding `noalias` attribute, or aliasing metadata? mzolotukhin: Which of the tests needs `-runtime-memory-check-threshold=24`? Could we get rid of this flag by…
				HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions We couldn't. Still too many runtime checks for accesses related to one pointer (Even though they are noalis). E.g. for( i = 0; i < 1024; i+=3) { A[i] += a; A[i+1] += b; A[i+2] += c } It has 3 loads and 3 stores, which need 11 runtime checks (greater than the current threshold 8). The 3 stores will be compared with each other (3 checks), and each load will be compared with the 3 stores (3 * 3 checks). This could be improved in the future. No need to compare accesses in the same group. But currently, this is the work around to test. HaoLiu: We couldn't. Still too many runtime checks for accesses related to one pointer (Even though…
				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"

				; Check vectorization on an interleaved load group of Delta 2 and an interleaved
				; store group of Delta 2.

				; int AB[1024];
				; int CD[1024];
				; void test_array_load2_store2(int C, int D) {
				; for (int i = 0; i < 1024; i+=2) {
				; int A = AB[i];
				; int B = AB[i+1];
				; CD[i] = A + C;
				; CD[i+1] = B * D;
				; }
				; }

				; CHECK-LABEL: @test_array_load2_store2(
				; CHECK: %wide.vec = load <8 x i32>, <8 x i32>* %{{.*}}, align 4
				; CHECK: shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				; CHECK: shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				; CHECK: add nsw <4 x i32>
				; CHECK: mul nsw <4 x i32>
				; CHECK: %interleaved.vec = shufflevector <4 x i32> {{.*}}, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				; CHECK: store <8 x i32> %interleaved.vec, <8 x i32>* %{{.*}}, align 4

				%struct.ST2 = type { i32, i32 }
				@AB = common global [1024 x i32] zeroinitializer, align 4
				@CD = common global [1024 x i32] zeroinitializer, align 4

				define void @test_array_load2_store2(i32 %C, i32 %D) {
				entry:
				br label %for.body

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx0 = getelementptr inbounds [1024 x i32], [1024 x i32]* @AB, i64 0, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx0, align 4
				%1 = or i64 %indvars.iv, 1
				mzolotukhinUnsubmitted Not Done Reply Inline Actions Please run the tests through `opt -instnamer` to replace %[0-9]+ names - they are really hard to deal with when one need to modify the test. mzolotukhin: Please run the tests through `opt -instnamer` to replace %[0-9]+ names - they are really hard…
				HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Renamed anonymous names. This is really a helpful command. HaoLiu: Renamed anonymous names. This is really a helpful command.
				%arrayidx1 = getelementptr inbounds [1024 x i32], [1024 x i32]* @AB, i64 0, i64 %1
				%2 = load i32, i32* %arrayidx1, align 4
				%add = add nsw i32 %0, %C
				%mul = mul nsw i32 %2, %D
				%arrayidx2 = getelementptr inbounds [1024 x i32], [1024 x i32]* @CD, i64 0, i64 %indvars.iv
				store i32 %add, i32* %arrayidx2, align 4
				%arrayidx3 = getelementptr inbounds [1024 x i32], [1024 x i32]* @CD, i64 0, i64 %1
				store i32 %mul, i32* %arrayidx3, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp slt i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.end

				for.end: ; preds = %for.body
				ret void
				}

				; Check vectorization on an interleaved load group of Delta 3 and an interleaved
				; store group of Delta 3.

				; int A[3072];
				; struct ST S[1024];
				; void test_struct_st3() {
				; int *ptr = A;
				; for (int i = 0; i < 1024; i++) {
				; int X1 = *ptr++;
				; int X2 = *ptr++;
				; int X3 = *ptr++;
				; T[i].x = X1 + 1;
				; T[i].y = X2 + 2;
				; T[i].z = X3 + 3;
				; }
				; }

				; CHECK-LABEL: @test_struct_array_load3_store3(
				; CHECK: %wide.vec = load <12 x i32>, <12 x i32>* {{.*}}, align 4
				; CHECK: shufflevector <12 x i32> %wide.vec, <12 x i32> undef, <4 x i32> <i32 0, i32 3, i32 6, i32 9>
				; CHECK: shufflevector <12 x i32> %wide.vec, <12 x i32> undef, <4 x i32> <i32 1, i32 4, i32 7, i32 10>
				; CHECK: shufflevector <12 x i32> %wide.vec, <12 x i32> undef, <4 x i32> <i32 2, i32 5, i32 8, i32 11>
				; CHECK: add nsw <4 x i32> {{.*}}, <i32 1, i32 1, i32 1, i32 1>
				; CHECK: add nsw <4 x i32> {{.*}}, <i32 2, i32 2, i32 2, i32 2>
				; CHECK: add nsw <4 x i32> {{.*}}, <i32 3, i32 3, i32 3, i32 3>
				; CHECK: shufflevector <4 x i32> {{.*}}, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; CHECK: shufflevector <4 x i32> {{.*}}, <4 x i32> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef>
				; CHECK: %interleaved.vec = shufflevector <8 x i32> {{.*}}, <12 x i32> <i32 0, i32 4, i32 8, i32 1, i32 5, i32 9, i32 2, i32 6, i32 10, i32 3, i32 7, i32 11>
				; CHECK: store <12 x i32> %interleaved.vec, <12 x i32>* {{.*}}, align 4

				%struct.ST3 = type { i32, i32, i32 }
				@A = common global [3072 x i32] zeroinitializer, align 4
				@S = common global [1024 x %struct.ST3] zeroinitializer, align 4

				define void @test_struct_array_load3_store3() {
				entry:
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%ptr.016 = phi i32* [ getelementptr inbounds ([3072 x i32], [3072 x i32]* @A, i64 0, i64 0), %entry ], [ %incdec.ptr2, %for.body ]
				%incdec.ptr = getelementptr inbounds i32, i32* %ptr.016, i64 1
				%0 = load i32, i32* %ptr.016, align 4
				%incdec.ptr1 = getelementptr inbounds i32, i32* %ptr.016, i64 2
				%1 = load i32, i32* %incdec.ptr, align 4
				%incdec.ptr2 = getelementptr inbounds i32, i32* %ptr.016, i64 3
				%2 = load i32, i32* %incdec.ptr1, align 4
				%add = add nsw i32 %0, 1
				%x = getelementptr inbounds [1024 x %struct.ST3], [1024 x %struct.ST3]* @S, i64 0, i64 %indvars.iv, i32 0
				store i32 %add, i32* %x, align 4
				%add3 = add nsw i32 %1, 2
				%y = getelementptr inbounds [1024 x %struct.ST3], [1024 x %struct.ST3]* @S, i64 0, i64 %indvars.iv, i32 1
				store i32 %add3, i32* %y, align 4
				%add6 = add nsw i32 %2, 3
				%z = getelementptr inbounds [1024 x %struct.ST3], [1024 x %struct.ST3]* @S, i64 0, i64 %indvars.iv, i32 2
				store i32 %add6, i32* %z, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 1024
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret void
				}

				; Check vectorization on an interleaved load group of Delta 4.

				; struct ST4{
				; int x;
				; int y;
				; int z;
				; int w;
				; };
				; int test_struct_load4(struct ST4 *S) {
				; int r = 0;
				; for (int i = 0; i < 1024; i++) {
				; r += S[i].x;
				; r -= S[i].y;
				; r += S[i].z;
				; r -= S[i].w;
				; }
				; return r;
				; }

				; CHECK-LABEL: @test_struct_load4(
				; CHECK: %wide.vec = load <16 x i32>, <16 x i32>* {{.*}}, align 4
				; CHECK: shufflevector <16 x i32> %wide.vec, <16 x i32> undef, <4 x i32> <i32 0, i32 4, i32 8, i32 12>
				; CHECK: shufflevector <16 x i32> %wide.vec, <16 x i32> undef, <4 x i32> <i32 1, i32 5, i32 9, i32 13>
				; CHECK: shufflevector <16 x i32> %wide.vec, <16 x i32> undef, <4 x i32> <i32 2, i32 6, i32 10, i32 14>
				; CHECK: shufflevector <16 x i32> %wide.vec, <16 x i32> undef, <4 x i32> <i32 3, i32 7, i32 11, i32 15>
				; CHECK: add nsw <4 x i32>
				; CHECK: sub <4 x i32>
				; CHECK: add nsw <4 x i32>
				; CHECK: sub <4 x i32>

				%struct.ST4 = type { i32, i32, i32, i32 }

				define i32 @test_struct_load4(%struct.ST4* nocapture readonly %S) {
				entry:
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%r.022 = phi i32 [ 0, %entry ], [ %sub8, %for.body ]
				%x = getelementptr inbounds %struct.ST4, %struct.ST4* %S, i64 %indvars.iv, i32 0
				%0 = load i32, i32* %x, align 4
				%add = add nsw i32 %0, %r.022
				%y = getelementptr inbounds %struct.ST4, %struct.ST4* %S, i64 %indvars.iv, i32 1
				%1 = load i32, i32* %y, align 4
				%sub = sub i32 %add, %1
				%z = getelementptr inbounds %struct.ST4, %struct.ST4* %S, i64 %indvars.iv, i32 2
				%2 = load i32, i32* %z, align 4
				%add5 = add nsw i32 %sub, %2
				%w = getelementptr inbounds %struct.ST4, %struct.ST4* %S, i64 %indvars.iv, i32 3
				%3 = load i32, i32* %w, align 4
				%sub8 = sub i32 %add5, %3
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 1024
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body
				ret i32 %sub8
				}

				; Check vectorization on an interleaved store group of Delta 4.

				; void test_struct_store4(int A, struct ST4 B) {
				; int *ptr = A;
				; for (int i = 0; i < 1024; i++) {
				; int X = *ptr++;
				; B[i].x = X + 1;
				; B[i].y = X * 2;
				; B[i].z = X + 3;
				; B[i].w = X + 4;
				; }
				; }

				; CHECK-LABEL: @test_struct_store4(
				; CHECK: %[[LD:.]] = load <4 x i32>, <4 x i32>
				; CHECK: add nsw <4 x i32> %[[LD]], <i32 1, i32 1, i32 1, i32 1>
				; CHECK: shl nsw <4 x i32> %[[LD]], <i32 1, i32 1, i32 1, i32 1>
				; CHECK: add nsw <4 x i32> %[[LD]], <i32 3, i32 3, i32 3, i32 3>
				; CHECK: add nsw <4 x i32> %[[LD]], <i32 4, i32 4, i32 4, i32 4>
				; CHECK: shufflevector <4 x i32> {{.*}}, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; CHECK: shufflevector <4 x i32> {{.*}}, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; CHECK: %interleaved.vec = shufflevector <8 x i32> {{.*}}, <16 x i32> <i32 0, i32 4, i32 8, i32 12, i32 1, i32 5, i32 9, i32 13, i32 2, i32 6, i32 10, i32 14, i32 3, i32 7, i32 11, i32 15>
				; CHECK: store <16 x i32> %interleaved.vec, <16 x i32>* {{.*}}, align 4

				define void @test_struct_store4(i32* noalias nocapture readonly %A, %struct.ST4* noalias nocapture %B) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%ptr.024 = phi i32* [ %A, %entry ], [ %incdec.ptr, %for.body ]
				%incdec.ptr = getelementptr inbounds i32, i32* %ptr.024, i64 1
				%0 = load i32, i32* %ptr.024, align 4
				%add = add nsw i32 %0, 1
				%x = getelementptr inbounds %struct.ST4, %struct.ST4* %B, i64 %indvars.iv, i32 0
				store i32 %add, i32* %x, align 4
				%mul = shl nsw i32 %0, 1
				%y = getelementptr inbounds %struct.ST4, %struct.ST4* %B, i64 %indvars.iv, i32 1
				store i32 %mul, i32* %y, align 4
				%add3 = add nsw i32 %0, 3
				%z = getelementptr inbounds %struct.ST4, %struct.ST4* %B, i64 %indvars.iv, i32 2
				store i32 %add3, i32* %z, align 4
				%add6 = add nsw i32 %0, 4
				%w = getelementptr inbounds %struct.ST4, %struct.ST4* %B, i64 %indvars.iv, i32 3
				store i32 %add6, i32* %w, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 1024
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				; Check vectorization on a reverse interleaved load group of Delta 2 and
				; a reverse interleaved store group of Delta 2.

				; struct ST2 {
				; int x;
				; int y;
				; };
				;
				; void test_reversed_load2_store2(struct ST2 A, struct ST2 B) {
				; for (int i = 1023; i >= 0; i--) {
				; int a = A[i].x + i; // interleaved load of index 0
				; int b = A[i].y - i; // interleaved load of index 1
				; B[i].x = a; // interleaved store of index 0
				; B[i].y = b; // interleaved store of index 1
				; }
				; }

				; CHECK-LABEL: @test_reversed_load2_store2(
				; CHECK: %wide.vec = load <8 x i32>, <8 x i32>* {{.*}}, align 4
				; CHECK: shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				; CHECK: shufflevector <4 x i32> {{.*}}, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; CHECK: shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				; CHECK: shufflevector <4 x i32> {{.*}}, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; CHECK: add nsw <4 x i32>
				; CHECK: sub nsw <4 x i32>
				; CHECK: shufflevector <4 x i32> {{.*}}, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; CHECK: shufflevector <4 x i32> {{.*}}, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				; CHECK: %interleaved.vec = shufflevector <4 x i32> {{.*}}, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				; CHECK: store <8 x i32> %interleaved.vec, <8 x i32>* %{{.*}}, align 4

				define void @test_reversed_load2_store2(%struct.ST2* noalias nocapture readonly %A, %struct.ST2* noalias nocapture %B) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 1023, %entry ], [ %indvars.iv.next, %for.body ]
				%x = getelementptr inbounds %struct.ST2, %struct.ST2* %A, i64 %indvars.iv, i32 0
				%0 = load i32, i32* %x, align 4
				%1 = trunc i64 %indvars.iv to i32
				%add = add nsw i32 %0, %1
				%y = getelementptr inbounds %struct.ST2, %struct.ST2* %A, i64 %indvars.iv, i32 1
				%2 = load i32, i32* %y, align 4
				%sub = sub nsw i32 %2, %1
				%x5 = getelementptr inbounds %struct.ST2, %struct.ST2* %B, i64 %indvars.iv, i32 0
				store i32 %add, i32* %x5, align 4
				%y8 = getelementptr inbounds %struct.ST2, %struct.ST2* %B, i64 %indvars.iv, i32 1
				store i32 %sub, i32* %y8, align 4
				%indvars.iv.next = add nsw i64 %indvars.iv, -1
				%cmp = icmp sgt i64 %indvars.iv, 0
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; Check vectorization on an interleaved load group of Delta 2 with 1 gap
				; (missing the load of odd elements).

				; void even_load(int A, int B) {
				; for (unsigned i = 0; i < 1024; i+=2)
				; B[i/2] = A[i] * 2;
				; }

				; CHECK-LABEL: @even_load(
				; CHECK: %wide.vec = load <8 x i32>, <8 x i32>* %{{.*}}, align 4
				; CHECK: %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				; CHECK-NOT: shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				; CHECK: shl nsw <4 x i32> %strided.vec, <i32 1, i32 1, i32 1, i32 1>
				define void @even_load(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%mul = shl nsw i32 %0, 1
				%1 = lshr exact i64 %indvars.iv, 1
				%arrayidx2 = getelementptr inbounds i32, i32* %B, i64 %1
				store i32 %mul, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp ult i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; Check vectorization on interleaved access groups identified from mixed
				; loads/stores.
				; void mixed_load_store(int A, int B) {
				; for (unsigned i = 0; i < 1024; i+=2) {
				; B[i] = A[i] * A[i+1];
				; B[i+1] = A[i] + A[i+1];
				; }
				; }

				; CHECK-LABEL: @mixed_load_store(
				; CHECK: %wide.vec = load <8 x i32>, <8 x i32>* {{.*}}, align 4
				; CHECK: shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				; CHECK: shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				; CHECK: %interleaved.vec = shufflevector <4 x i32> %{{.*}}, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				; CHECK: store <8 x i32> %interleaved.vec
				define void @mixed_load_store(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%1 = or i64 %indvars.iv, 1
				%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 %1
				%2 = load i32, i32* %arrayidx2, align 4
				%mul = mul nsw i32 %2, %0
				%arrayidx4 = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
				store i32 %mul, i32* %arrayidx4, align 4
				%3 = load i32, i32* %arrayidx, align 4
				%4 = load i32, i32* %arrayidx2, align 4
				%add10 = add nsw i32 %4, %3
				%arrayidx13 = getelementptr inbounds i32, i32* %B, i64 %1
				store i32 %add10, i32* %arrayidx13, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp ult i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; Check vectorization on interleaved access groups with members having different
				; kinds of type.

				; struct IntFloat {
				; int a;
				; float b;
				; };
				;
				; int SA;
				; float SB;
				;
				; void int_float_struct(struct IntFloat *A) {
				; int SumA;
				; float SumB;
				; for (unsigned i = 0; i < 1024; i++) {
				; SumA += A[i].a;
				; SumB += A[i].b;
				; }
				; SA = SumA;
				; SB = SumB;
				; }

				; CHECK-LABEL: @int_float_struct(
				; CHECK: %wide.vec = load <8 x i32>, <8 x i32>* %{{.*}}, align 4
				; CHECK: %[[V0:.*]] = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				; CHECK: %[[V1:.*]] = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				; CHECK: bitcast <4 x i32> %[[V1]] to <4 x float>
				; CHECK: add nsw <4 x i32>
				; CHECK: fadd fast <4 x float>

				%struct.IntFloat = type { i32, float }

				@SA = common global i32 0, align 4
				@SB = common global float 0.000000e+00, align 4

				define void @int_float_struct(%struct.IntFloat* nocapture readonly %A) #0 {
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				store i32 %add, i32* @SA, align 4
				store float %add3, float* @SB, align 4
				ret void

				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%SumB.014 = phi float [ undef, %entry ], [ %add3, %for.body ]
				%SumA.013 = phi i32 [ undef, %entry ], [ %add, %for.body ]
				%a = getelementptr inbounds %struct.IntFloat, %struct.IntFloat* %A, i64 %indvars.iv, i32 0
				%0 = load i32, i32* %a, align 4
				%add = add nsw i32 %0, %SumA.013
				%b = getelementptr inbounds %struct.IntFloat, %struct.IntFloat* %A, i64 %indvars.iv, i32 1
				%1 = load float, float* %b, align 4
				%add3 = fadd fast float %SumB.014, %1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 1024
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				attributes #0 = { "unsafe-fp-math"="true" }

This is an archive of the discontinued LLVM Phabricator instance.

[LoopVectorize]Teach Loop Vectorizer about interleaved memory accessClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 26876

include/llvm/Analysis/LoopAccessAnalysis.h

include/llvm/Analysis/TargetTransformInfo.h

include/llvm/Analysis/TargetTransformInfoImpl.h

include/llvm/CodeGen/BasicTTIImpl.h

lib/Analysis/LoopAccessAnalysis.cpp

lib/Analysis/TargetTransformInfo.cpp

lib/Transforms/Vectorize/LoopVectorize.cpp

test/Analysis/LoopAccessAnalysis/stride-access-dependence.ll

test/Analysis/LoopAccessAnalysis/zero-distance-dependence.ll

test/Transforms/LoopVectorize/AArch64/arbitrary-induction-step.ll

test/Transforms/LoopVectorize/interleaved-accesses.ll

[LoopVectorize]Teach Loop Vectorizer about interleaved memory access
ClosedPublic