This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
17/43
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
1/3
interleaved-accesses.ll

Differential D19984

[LV] Preserve order of dependences in interleaved accesses analysis
ClosedPublic

Authored by mssimpso on May 5 2016, 10:48 AM.

Download Raw Diff

Details

Reviewers

anemet
hfinkel
sbaranga

Commits

rGe794678404ab: [LV] Preserve order of dependences in interleaved accesses analysis
rL273687: [LV] Preserve order of dependences in interleaved accesses analysis

Summary

The interleaved access analysis currently assumes that the inserted run-time pointer aliasing checks ensure the absence of dependences that would prevent its instruction reordering. However, this is not the case.

Issues can arise from how code generation is performed for interleaved groups. For a load group, all loads in the group are essentially moved to the location of the first load in program order, and for a store group, all stores in the group are moved to the location of the last store. For groups having members involved in a dependence relation with any other instruction in the loop, this reordering can violate the dependence.

This patch teaches the interleaved access analysis how to avoid breaking such dependences, and should fix PR27626.

An assumption of the original analysis was that the accesses had been collected in "program order". The analysis was then simplified by visiting the accesses bottom-up. However, this ordering was never guaranteed for anything other than single basic block loops. Thus, this patch also enforces the desired ordering.

Diff Detail

Event Timeline

mssimpso updated this revision to Diff 56312.May 5 2016, 10:48 AM

mssimpso retitled this revision from to [LV] Handle RAW dependences in interleaved access analysis.

mssimpso updated this object.

mssimpso added reviewers: sbaranga, anemet, hfinkel.

mssimpso added subscribers: mcrosier, llvm-commits.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptMay 5 2016, 10:48 AM

mssimpso mentioned this in D19694: [LV] Allow interleaved accesses in loops with predicated blocks.May 5 2016, 2:00 PM

mssimpso added a child revision: D19694: [LV] Allow interleaved accesses in loops with predicated blocks.

I think non-zero dependences would have already been rejected by LAA. Would this this be the reason why it is correct to only look at the zero distance ones?

Thanks,
Silviu

lib/Transforms/Vectorize/LoopVectorize.cpp
972	I guess we can check this with DT because the loads/stores are not predicated? (same in a bunch of other places).
5406	This sentence seems to be unfinished?
5418	Wouldn't we need to add B to LoopIndependentRAWStores even if we add it to StoresToRemove?

Hi Silviu, thanks for the comments.

I think non-zero dependences would have already been rejected by LAA. Would this this be the reason why it is correct to only look at the zero distance ones?

Yes, that should be the case!

lib/Transforms/Vectorize/LoopVectorize.cpp
972	I intended the dominance check to work for the predicated accesses as well. The check says that if the read does not dominate the write, then we have to conservatively assume the write may happen first, leading to a read-after-write dependence.
5406	Thanks for catching that. I'll submit an update
5418	StoresToRemove holds stores whose groups will definitely be removed. LoopIndependentRAWStores holds stores that we don't yet have enough information about to determine if they need to be removed or not. In this case, if the current store (B) is already in a group, it means that it will be re-ordered by sinking it to the insert location of the group, violating the dependence. If it's not yet in a group (and thus, won't be re-ordered), we don't yet know that the load (A) won't be hoisted, which would also violate the dependence. Once we know that another load has been added to A's group, we know that A will be re-ordered. When this happens, we move all the stores in LoopIndependentRAWStores to StoresToRemove, to mark them for definite removal.

Updated comments.

Overall I think this looks good, but it would be better if someone else would also have a look before committing.

Thanks,
Silviu

lib/Transforms/Vectorize/LoopVectorize.cpp
972	OK, this makes sense.

Thanks very much Silviu! I'll wait for Adam or Hal to provide additional feedback.

In D19984#424731, @mssimpso wrote:

Hi Silviu, thanks for the comments.

I think non-zero dependences would have already been rejected by LAA. Would this this be the reason why it is correct to only look at the zero distance ones?

Yes, that should be the case!

Why are non-zero forward deps rejected by LAA? Because of the HW store-to-load forwarding case? I think that that is only a performance consideration and it's only on conditionally.

I think that we should probably take the time and review the soundness of these code motions with respect to interleaved access vectorization {RAW, WAR, WAW} x {loop-independent, loop-carried}. I've only looked at the LAA aspects of this feature so far but I am a bit worried about this code now.

Matt, do you think you can do this?

I am also thinking if we should disable this feature in the meantime?!

Hi Adam,

Why are non-zero forward deps rejected by LAA? Because of the HW store-to-load forwarding case? I think that that is only a performance consideration and it's only on conditionally.

Siviu was referring to a comment I had made in the source. But I think the idea was that LAA had already ensured the absence of the dependences that would have prevented vectorization. So the interleaved access analysis didn't need to worry about them. Regarding the store-to-load forwarding case, yes, that is the assumption the original analysis made, which was incorrect. The current patch attempts to fix that.

I think that we should probably take the time and review the soundness of these code motions with respect to interleaved access vectorization {RAW, WAR, WAW} x {loop-independent, loop-carried}. I've only looked at the LAA aspects of this feature so far but I am a bit worried about this code now.

Matt, do you think you can do this?

I'm happy to give the analysis a very careful second look. I'll do that today and report back.

I am also thinking if we should disable this feature in the meantime?!

We should definitely disable interleaved access vectorization if we can't correct the bug (PR27626) before release time. Doing so would have performance implications for ARM/AArch64, so I'm hoping we can fix it before then. I wouldn't be strongly opposed to disabling it in the meantime. But to my knowledge, the bug isn't currently blocking anyone, and it's been around since the original implementation. Silviu and I only discovered it in passing, and it wasn't something we encountered in the wild. What do you think?

Just to expand on the point above:

From the algorithm of constructing interleaved groups, we should be able to exclude both WAW (from the algorithm, see the comments) and WAR (we're interleaving and then moving stores down and loads up, so we cannot break these) - for both loop-carried and loop independent dependences.

So the problem should only be RAW. If it is loop carried, then it is a forward dependence and we cannot vectorize, so this should be safe.
And this should handle the loop independent case.

Also we've probably not seen this until now because other optimizations would simply remove the load before this got to the vectorizer?

Cheers,
Silviu

In D19984#428293, @mssimpso wrote:

Hi Adam,

Why are non-zero forward deps rejected by LAA? Because of the HW store-to-load forwarding case? I think that that is only a performance consideration and it's only on conditionally.

Siviu was referring to a comment I had made in the source. But I think the idea was that LAA had already ensured the absence of the dependences that would have prevented vectorization. So the interleaved access analysis didn't need to worry about them. Regarding the store-to-load forwarding case, yes, that is the assumption the original analysis made, which was incorrect. The current patch attempts to fix that.

No, I was asking about *non-zero* distance deps specifically. The current patch only handles zero-distance deps. So my question was whether for non-zero distance we still rely on the store-to-load forwarding detection code to make the dep unsafe for vectorization.

Hi Silviu,

In D19984#428379, @sbaranga wrote:

Just to expand on the point above:

From the algorithm of constructing interleaved groups, we should be able to exclude both WAW (from the algorithm, see the comments) and WAR (we're interleaving and then moving stores down and loads up, so we cannot break these) - for both loop-carried and loop independent dependences.

What about moving elements of an interleaved group over other dependent accesses not in the same group?

Adam

test/Transforms/LoopVectorize/interleaved-accesses.ll
574–579	You mean p[i].x etc in the loop.

Hi Guys,

I've been thinking through these dependence issues more carefully. I'm not yet finished with all the permutations, but I've come across another somewhat unrelated issue. I thought I would post an update about that and respond to some of Adam's questions in the meantime. First the questions:

No, I was asking about *non-zero* distance deps specifically. The current patch only handles zero-distance deps. So my question was whether for non-zero distance we still rely on the store-to-load forwarding detection code to make the dep unsafe for vectorization.

For positive dependences, LAA prevents vectorization if the distance is less than some minimum (which determines the maximum safe VF). For positive dependences between strided accesses, LAA tries to prove independence (the accesses are independent if the distance is not a multiple of the stride). Non-positive dependences are allowed, and LAA checks for store-to-load forwarding conflicts for RAWs. The original analysis assumed the store-to-load detection would guarantee the absence of all RAWs, which was incorrect. So we need to consider the non-positive RAWs (as Silviu said, we should be able to exclude the WAR and WAW cases).

What about moving elements of an interleaved group over other dependent accesses not in the same group?

This would be the RAW case. For example, S1-L2 is a RAW dependence, L1-L2 form a group:

L1: load
S1: store
L2: load  // L2 would be hoisted above S1.

Or alternatively, S1-L1 is a RAW dependence, S1-S2 form a group:

S1: store // S1 would be sunk below L1.
L1: load
S2: store

The other issue I've uncovered is related to the maximum safe VF. We set this based on the positive dependence distance LAA computes. But for the interleaved accesses, the actual VF used during vectorization is VF * IF, where IF is the interleave factor. The idea is that each component of the wide vector would have VF elements after they are shuffled out. However, this could be greater than the maximum safe VF. Here's an example.

; for (int i = 1; i < 1000; ++i) {
;   p[i + 2].x = p[i].x;
;   p[i + 2].y = p[i].y;
; }
%struct.pair = type { i32, i32 }
for.body:
  %indvars.iv = phi i64 [ 1, %entry ], [ %indvars.iv.next, %for.body ]
  %x1 = getelementptr inbounds %struct.pair, %struct.pair* %p, i64 %indvars.iv, i32 0
  %0 = load i32, i32* %x1, align 4
  %1 = add nuw nsw i64 %indvars.iv, 2
  %x4 = getelementptr inbounds %struct.pair, %struct.pair* %p, i64 %1, i32 0
  store i32 %0, i32* %x4, align 4
  %y = getelementptr inbounds %struct.pair, %struct.pair* %p, i64 %indvars.iv, i32 1
  %2 = load i32, i32* %y, align 4
  %y10 = getelementptr inbounds %struct.pair, %struct.pair* %p, i64 %1, i32 1
  store i32 %2, i32* %y10, align 4
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp eq i64 %indvars.iv.next, 1000
  br i1 %exitcond, label %for.cond.cleanup, label %for.body
}

We currently generate <8 x i32> loads and stores, but the maximum safe dependence distance is only 16 bytes, so I think we are generating incorrect code here. I think we should probably check for interleaved accesses when selecting the VF.

test/Transforms/LoopVectorize/interleaved-accesses.ll
574–579	Yes, thanks for catching that!

I submitted D20241 to correct the issue with the vectorization factor mentioned above.

Matt: thanks for doing this analysis!

What about moving elements of an interleaved group over other dependent accesses not in the same group?

Adam

We shouldn't be doing that. We should be preserving dependences, and as long as we do that this should be correct.

It does look like things are more complicated then what I previously stated, at least with regard to RAW. And we do need to consider non-zero distance dependences (because we are essentially moving this loads/stores after interleaving). Here is an example:

for (...) {

a[i].a = a[i + 2].a + 1
tmp = a[i - 1].a * 2
a[i].b = tmp
a[i].c = tmp
a[i].d = tmp

}

The problem here is that by moving the store to a[i].a we're breaking the loop-carried dependence from a[i].a to a[i-1].a (which doesn't stop vectorization).

Cheers,
Silviu

I agree, Silviu. That's basically what I've been thinking. I'm working on an updated patch. Thanks for all the feedback!

Updated the analysis according to feedback from Adam and Silviu.

Sorry for the delay in getting a new version of this patch ready. This update is notably different from the previous version in the following ways: (1) We collect all memory accesses in collectConstStrideAccesses instead of only the stride-greater-than-one accesses. We have to collect all accesses to check for dependences between interleaved and non-interleaved accesses. The non-strided accesses are ignored when actually creating the groups. (2) We only ignore the WAR case. I think the code generation strategy can only ensure we won't violate write-after-reads. Everything else is checked. (3) For strided accesses, we try and prove independence like is done in LAA. All other constant-distance accesses are considered dependent, and we preserve their order. The dependences are checked in canReorderMemAccs. (4) I've added another test case, and updated existing comments about the algorithm.

I have just a few quick remarks (and most of them were inherited from the existing code).

Cheers,
Silviu

lib/Transforms/Vectorize/LoopVectorize.cpp
5264	We should improve this at some point to check for NoDep (which is what we're really looking for here) and is implied by this test.
5270	This makes assumptions on how LAA works. It might be technically true at the moment, but if the dependence analysis would be improved it would possibly invalidate this. I think the case where we memcheck each pointer is an edge case? I'm not sure it's worth doing the interleaved access vectorization in this case. The more common case could be if two pointers are in different alias sets or have different underlying objects? But that should be expressed as a NoDep.
5286	Same here. Ideally we would know we have NoDep from LAA. Some of the checks here seem to be specifically related to forming the interleaved groups, and less about reordering accesses.

Hi Silviu,

It sounds like you're suggesting that we should just query LAA to determine if any dependence exists between the two candidate instructions, and if so, prevent reordering. Does that sound right?

lib/Transforms/Vectorize/LoopVectorize.cpp
5286	If you're referring to the type and factor checks, those are required before we can call areStridedAccessesIndependent.

Addressed Silviu's comments.

Hi Silviu,

I've updated the patch to use LAA for checking for the presence of dependences like you suggested. This is actually much simpler. Thanks! I had to make one change to LAA: We now try to prove strided accesses independent before handling the negative distance cases.

Matt.

In D19984#433539, @mssimpso wrote:

I had to make one change to LAA: We now try to prove strided accesses independent before handling the negative distance cases.

Yes, I've seen this recently too. Can you please split this out (and commit) with an appropriate LAA testcase.

Thanks for making the changes. I think this looks much better.

Cheers,
Silviu

lib/Transforms/Vectorize/LoopVectorize.cpp
5265	We should be able to do this faster than O(n) - maybe with some pre-processing of dependences.

mssimpso mentioned this in rL270072: [LAA] Check independence of strided accesses before forward case.May 19 2016, 8:43 AM

In D19984#433930, @anemet wrote:

Yes, I've seen this recently too. Can you please split this out (and commit) with an appropriate LAA testcase.

Sure. I committed the LAA portion in rL270072. I'll post a rebased version of this patch and address Silviu's latest comments.

I am not done reviewing this but I just want to say that I am having a much better feeling about this analysis after your changes. As I said I had major issues with its soundness. Thanks for your patience for working through all this!

I am also sending my initial comments but there may be more coming.

lib/Transforms/Vectorize/LoopVectorize.cpp
962–964	Somewhere this should state the criteria for reordering. In theory you can reorder backward dependences but looks like you don't allow any (which is fine).
5245	I am wondering if it's time to change the name of this. After your change it no longer contains only "interesting" strided accesses. How about like AccessStrideInfo?
5249–5262	I think that we've been using Accesses pretty consistently without any abbreviation.
5280–5282	I would drop the part about memchecks. This is already pretty long and that part is trivial.

Addressed comments from Silviu and Adam.

Thanks very much for all the feedback! I agree, this is definitely looking better. In this update, I pre-process the dependences to enable constant-time queries like Silviu suggested, and I address Adam's latest comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
5245	Sounds good to me. I'll rename StridedAccesses prior to committing the current patch.

anemet added inline comments.May 19 2016, 1:55 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5379–5383	Here I think we're validating the correctness of: Potentially moving 'A' an interleaved load before any store 'B' or Potentially moving 'B' an interleaved store after any load/store 'A'. If I am right, I don't think either the comment or the code is tight enough to reflect this. It would also be good to add testcases for this. It would be also great to add a testcase to the earlier problem that only candidate accesses were dependency-checked. What do you think?

mssimpso added inline comments.May 19 2016, 2:39 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5379–5383	Adam, I think you're mostly right in the description of what we're validating the correctness of. When you have a chance, would you mind elaborating a bit more as to why you think we're not satisfying both of the cases you mention? The algorithm works bottom-up, so B will always precede A in program order, and we always investigate B's that are closest to A first (some dependences do temporarily slip by but are later checked, see below). Also note that we don't yet handle loops with predicated blocks. I'm working on that in D19694. For case (1) the bottom-up ordering implies that if we can move a load A before a store B, we know that we can also move A before any store that B precedes. Case (2) is a bit more subtle and is probalby confusing. Say we have the case below: S1: Store // depends on L1 L1: Load // depends on S1 S2: Store When A is S2 and B is S1, we might say that these are independent, and S1 could be sunk to S2's location. We will add S1 to S2's group. But when A is L1 and B is S1, we will notice the dependence. When we do, we check to see if S1 is already in a group, and if so, invalidate it. This prevents us from moving S1. Please let me know if I'm still missing something here. We could probably improve the comment and/or rename "canReorderMemAccesses", since this doesn't quite capture what's happening with the instruction ordering. And yes, we can definitely add more tests.

anemet added inline comments.May 19 2016, 4:57 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5379–5383	I think you're mostly right in the description of what we're validating the correctness of. When you have a chance, would you mind elaborating a bit more as to why you think we're not satisfying both of the cases you mention? I said "not tight enough" not that it's incorrect, sorry if that was unclear. I think I would like to include the above two cases in the comment (or an improved version) and then have the checks in the code reflect that. I.e. right now we don't check that at least one of 'A' or 'B' is interleaved. The algorithm works bottom-up, so B will always precede A in program order, and we always investigate B's that are closest to A first (some dependences do temporarily slip by but are later checked, see below). Also note that we don't yet handle loops with predicated blocks. I'm working on that in D19694. For case (1) the bottom-up ordering implies that if we can move a load A before a store B, we know that we can also move A before any store that B precedes. Case (2) is a bit more subtle and is probalby confusing. Say we have the case below: S1: Store depends on L1 L1: Load depends on S1 S2: Store When A is S2 and B is S1, we might say that these are independent, and S1 could be sunk to S2's location. We will add S1 to S2's group. But when A is L1 and B is S1, we will notice the dependence. When we do, we check to see if S1 is already in a group, and if so, invalidate it. This prevents us from moving S1. Make sense. I am wondering now if there is a simpler way to formulate this analysis with the same result. Aren't we simply saying that we don't consider an interleaved access for merging if it's either a source or the destination of a dependence? I.e. something like this in the outer loop: if (isDepSource(A) && A->mayWriteToMemory() \|\| isDepDestination(A)) break: and then no need for the inner loop? Please let me know if I'm still missing something here. We could probably improve the comment and/or rename "canReorderMemAccesses", since this doesn't quite capture what's happening with the instruction ordering. And yes, we can definitely add more tests. Thanks!

mssimpso added inline comments.May 20 2016, 6:42 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
5379–5383	I think I would like to include the above two cases in the comment (or an improved version) and then have the checks in the code reflect that. I.e. right now we don't check that at least one of 'A' or 'B' is interleaved. I see, yes this makes sense. I am wondering now if there is a simpler way to formulate this analysis with the same result. Aren't we simply saying that we don't consider an interleaved access for merging if it's either a source or the destination of a dependence? This sounds right to me. Let me think it over before posting another update. Thanks again for all the feedback, Adam!

Addressed Adam's latest round of comments.

Adam,

I've updated the comments and code to be more precise as you suggested. I've also added a couple of new test cases so that we are checking the conditions you enumerated in your last review. I'll reply to your other points inline. Thanks!

Matt.

mssimpso marked 2 inline comments as done.May 24 2016, 12:10 PM

mssimpso added inline comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
5379–5383	It would be also great to add a testcase to the earlier problem that only candidate accesses were dependency-checked. I've tried constructing a testcase that would generate incorrect code for this, but I think the current limitations of LAA might actually make this unrealizable at the moment. After the memory checks, two access can be dependent only at constant distances. Thus, if one access is strided, a dependent access at a constant distance would have to be strided as well. This sounds right to me. Let me think it over before posting another update. After thinking about this more carefully, I think we will miss cases with a simplification like the one you suggested. For example, if we have a set of interleaved loads followed by a set of interleaved stores that access the same location (or vice versa), giving up in the outer loop would prevent the two groups from being created.

This looks sensible to me. LGTM! (you should wait for further comments from Adam before committing)

Cheers,
Silviu

Hi Matt,

Sorry about the delay. As I said in my earlier heads-up I was on vacation.

My main comment is the reply to your testcase remark. The other ones are just nitpicks but I haven't finished going through the patch. I am just curious what you think about the idea regarding the testcase.

Adam

lib/Transforms/Vectorize/LoopVectorize.cpp
973–974	Both A and B are not necessarily strided here.
990–991	This requirement is part of the API so it should be in the function comment. Also a nit, can you please flip the order. I think that A preceding B is the more intuitive. Let me know if you disagree.
5379–5383	I've tried constructing a testcase that would generate incorrect code for this, but I think the current limitations of LAA might actually make this unrealizable at the moment. After the memory checks, two access can be dependent only at constant distances. Thus, if one access is strided, a dependent access at a constant distance would have to be strided as well. Can't we use larger than stride offsets/indices to emulate non-interleaved accesses? I believe that these are currently ignored by interleaved analysis. I mean something like: for (i = 0; i < n; i+=3) { ... = A[i] A[i+4] = ... ... = A[i+1] } And then A[i+1] shouldn't be moved across A[i+4] I haven't tried this, it's just an idea....

Adam,

Welcome back! Thanks for following up. I replied to your comments inline.

Matt.

lib/Transforms/Vectorize/LoopVectorize.cpp
973–974	True. I will rename this and update the comments.
990–991	Sounds good.
5379–5383	This was my first thought as well. In this case, A[i+4] is still a stride-greater-than-one access, so it would've been collected in SrideAccesses and sent through the original analysis. If its stride is equal to anything other than that of the other accesses (e.g, one or some value greater than MaxInterleaveGroupFactor), the distance between it and the other accesses will be non-constant (some SCEV expression). If the distance isn't constant LAA will report Dependence::Unknown, and we will create the memory checks.

Addressed Adam's comments.

anemet added inline comments.Jun 8 2016, 12:25 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5379–5383	Matt, I am not sure I follow, which of these accesses is not constant distance in my example? I just tried this example: void f(char a, char __restrict c, int n) { for (int i = 0; i < n; i+=3) { c[i] = a[i]; a[i+4] = c[i+4]; c[i+1] = a[i+1]; } } If you compile this on x86_64 with: -O3 -mllvm -enable-load-pre=0 -mllvm -store-to-load-forwarding-conflict-detection=0 -mllvm -enable-interleaved-mem-accesses -mllvm -force-vector-width=2 Then the loads from 'a' will be merged which breaks the forward dependence between the store of a[i+4] and the load of a[i+1]. No?

mssimpso added inline comments.Jun 8 2016, 12:54 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5379–5383	Adam, We may be talking past each other a bit. I think your original request (correct me if I'm wrong) was for a test case showing that the existing analysis doesn't consider checking memory accesses that are not "candidates" for interleaved groups. Candidate accesses are the ones collected in StrideAccesses prior to the analysis. However, every access in your example actually is a candidate access because they are all strided. They are not ignored. Yes, we currently do the wrong thing here, but that's because the current analysis isn't correctly checking dependences between the actual candidate accesses, not because it isn't considering a dependence between a candidate access and a non-candidate access. Are you asking for something different?

anemet added inline comments.Jun 8 2016, 3:15 PM

lib/Transforms/Vectorize/LoopVectorize.cpp
5379–5383	Ah, the disconnect was that I didn't realize that accesses with offsets larger than the stride were considered as candidates. The idea was to have these mimic non-strided accesses. I guess it makes sense to consider these candidates as well because the offset is not really an offset but rather the distance between two accesses. So I guess we can't have a testcase for this case at the moment... I'll continue reviewing the patch.

This is very close I just want to take extra care to update/improve the comments while this is all fresh in our memory.

lib/Transforms/Vectorize/LoopVectorize.cpp
973–974	Actually sorry, let's make this even more precise. At this point the function is called with any accesses. (The function later checks that if both accesses are non-strided we don't bother checking deps since we would never reorder those.)
977	We should probably have some qualification in the name, something like canReorderAccessesForInterleavedGroups or something like that. This is not a general canReorder predicate but takes into consideration the actual code motion strategy.
5200–5207	Is my understanding correct, that you removed this because we no longer support this case? I.e. the dep 1->2 does not currently allow for this. If this is true, we should probably add a FIXME. Also, I thought the WAW case was the only reason for the bottom-up ordering. That is a bit confusing now too.
5379–5382	It would be nice to use A and B related iterators in this part because that is what you mention in the comment. E.g. AI, BI or IA, IB.

mssimpso added inline comments.Jun 15 2016, 7:26 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
977	Sounds good. I'll update this and the comments.
5200–5207	No, we do still support this case, for essentially the reason the existing comment states. When the outer loop of the analysis is on (3) and we visit (2) and (1) in the inner loop, we will form the (3,2) group. (1) can't be added to (3,2) because a member already exists in group (3,2) with the same offset. When the outer loop is on (2), we will again visit (1) in the inner loop. This time we will see the dependence between (1) and (2) and give up trying to add additional accesses to (2)'s group. (1) is not yet in a group, so the outer loop moves on to (1). Thus, (1) must form a group with accesses that precede it and won't be sunk, as the existing comment says. I removed this comment because I thought it was somewhat misleading. The bottom-up ordering is not sufficient to prevent us from breaking WAW dependencies. We still have to explicitly check for them. For example, if we had something like the following for a factor 2 group: A[i] = x (1) A[i+2] = y (2) A[i+1] = z (3) The analysis would proceed as before. When the outer loop is on (3): this time (2) is not added to (3)'s group because its index is too large. (1) is temporarily added to (3)'s group, creating a (3,1) group. When the outer loop is on (2): we see the dependence between (1) and (2). As before, we stop trying to add additional instructions to (2)'s group. But now since (1) is already in a group, and we now know it shouldn't be reordered, we release the (3,1) group. So the bottom-up ordering isn't sufficient as we still have to check for the dependence. It's probably worth adding a comment that clarifies why the analysis is bottom-up.
5379–5382	I agree. I think we should also swap A and B in this function to match what we did to the canReorderMemAccesses function. There, A precedes B, but here B precedes A. They should be consistent.

mssimpso added inline comments.Jun 15 2016, 10:03 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
5200–5207	I got the second example here slightly wrong, but the idea is the same. I was intending to describe something like: A[i-1] = x (1) A[i-3] = y (2) A[i] = z (3) Now, the dependence between (1) and (2) will cause the (3,1) group to be released, as I originally described. I'm adding test cases for both of these WAW examples.
5379–5382	Actually, swapping A with B in this function will make the review very difficult. I think I'd rather swap A and B in canReorderMemAccesses to make things consistent. This will mean B precedes A both places. We can then swap them both in an NFC patch if we want A to precede B.

Addressed Adam's comments.

Renamed canReorderMemAccesses and updated comments
Swapped A and B in canReorderMemAccesses to be consistent with the reset of the analysis (i.e., B precedes A). We can swap the entire analysis in a follow-on.
Updated iterators to be more descriptive (i.e., AI and BI).
Commented about the bottom-up ordering.
Added test cases for the two WAW examples mentioned in my last update.

Adam,

Do you have any additional comments for this patch? Thanks!

Matt.

Matt,

All agreed, I just have a few more suggestions to improve comments for this complex piece.

Let me know what you think.

Adam

lib/Transforms/Vectorize/LoopVectorize.cpp
5200–5207	Wow, this is very subtle too. This and the previous case of why we need to remove elements from a group needs a high-level comment somewhere around the call to canReorder... See my comments on the particular lines.
5334–5343	I don't think the comment is sufficient to cover all the subtleties here. I would prefer to say something like: We can't have dependences between accesses in a group and other accesses that are located between the first and the last element of group. Probably before the break statement we should also say that: It's OK to have dependences between accesses in a group and other accesses before the first instruction we just can't extend the group beyond these. I am also wondering if we need a picture to further visualize this, i.e. a case where there is an intervening access and one where falls outside the range of the group.
5379–5382	OK.
test/Transforms/LoopVectorize/interleaved-accesses.ll
769–770	... but exclude a[i] = x

Adam,

Thanks very much for the suggestions. They all look good to me! I will update the patch.

Matt.

Sharpened comments according to Adam's feedback.

Looks great to me! Thanks for the improvements and of course initially taking this on!

This revision is now accepted and ready to land.Jun 23 2016, 1:51 PM

Thanks Adam and Silviu for all the detailed feedback! Getting these dependences right can be tricky, so I appreciate the attention.

Closed by commit rL273687: [LV] Preserve order of dependences in interleaved accesses analysis (authored by mssimpso). · Explain WhyJun 24 2016, 8:40 AM

This revision was automatically updated to reflect the committed changes.

mssimpso mentioned this in rL275473: [LV] Rename StrideAccesses to AccessStrideInfo (NFC).Jul 14 2016, 2:12 PM

mssimpso mentioned this in rL275567: [LV] Swap A and B in interleaved access analysis (NFC).Jul 15 2016, 8:30 AM

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

259 lines

test/

Transforms/

LoopVectorize/

interleaved-accesses.ll

305 lines

Diff 61708

lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 841 Lines • ▼ Show 20 Lines
/// a loop. Otherwise it's meaningless to do analysis as the vectorization		/// a loop. Otherwise it's meaningless to do analysis as the vectorization
/// on interleaved accesses is unsafe.		/// on interleaved accesses is unsafe.
///		///
/// The analysis collects interleave groups and records the relationships		/// The analysis collects interleave groups and records the relationships
/// between the member and the group in a map.		/// between the member and the group in a map.
class InterleavedAccessInfo {		class InterleavedAccessInfo {
public:		public:
InterleavedAccessInfo(PredicatedScalarEvolution &PSE, Loop *L,		InterleavedAccessInfo(PredicatedScalarEvolution &PSE, Loop *L,
DominatorTree *DT)		DominatorTree DT, LoopInfo LI)
: PSE(PSE), TheLoop(L), DT(DT), RequiresScalarEpilogue(false) {}		: PSE(PSE), TheLoop(L), DT(DT), LI(LI), LAI(nullptr),
		RequiresScalarEpilogue(false) {}

~InterleavedAccessInfo() {		~InterleavedAccessInfo() {
SmallSet<InterleaveGroup *, 4> DelSet;		SmallSet<InterleaveGroup *, 4> DelSet;
// Avoid releasing a pointer twice.		// Avoid releasing a pointer twice.
for (auto &I : InterleaveGroupMap)		for (auto &I : InterleaveGroupMap)
DelSet.insert(I.second);		DelSet.insert(I.second);
for (auto *Ptr : DelSet)		for (auto *Ptr : DelSet)
delete Ptr;		delete Ptr;
Show All 24 Lines	if (InterleaveGroupMap.count(Instr))
return InterleaveGroupMap.find(Instr)->second;		return InterleaveGroupMap.find(Instr)->second;
return nullptr;		return nullptr;
}		}

/// \brief Returns true if an interleaved group that may access memory		/// \brief Returns true if an interleaved group that may access memory
/// out-of-bounds requires a scalar epilogue iteration for correctness.		/// out-of-bounds requires a scalar epilogue iteration for correctness.
bool requiresScalarEpilogue() const { return RequiresScalarEpilogue; }		bool requiresScalarEpilogue() const { return RequiresScalarEpilogue; }

		/// \brief Initialize the LoopAccessInfo used for dependence checking.
		void setLAI(const LoopAccessInfo *Info) { LAI = Info; }

private:		private:
/// A wrapper around ScalarEvolution, used to add runtime SCEV checks.		/// A wrapper around ScalarEvolution, used to add runtime SCEV checks.
/// Simplifies SCEV expressions in the context of existing SCEV assumptions.		/// Simplifies SCEV expressions in the context of existing SCEV assumptions.
/// The interleaved access analysis can also add new predicates (for example		/// The interleaved access analysis can also add new predicates (for example
/// by versioning strides of pointers).		/// by versioning strides of pointers).
PredicatedScalarEvolution &PSE;		PredicatedScalarEvolution &PSE;
Loop *TheLoop;		Loop *TheLoop;
DominatorTree *DT;		DominatorTree *DT;
		LoopInfo *LI;
		const LoopAccessInfo *LAI;

/// True if the loop may contain non-reversed interleaved groups with		/// True if the loop may contain non-reversed interleaved groups with
/// out-of-bounds accesses. We ensure we don't speculatively access memory		/// out-of-bounds accesses. We ensure we don't speculatively access memory
/// out-of-bounds by executing at least one scalar epilogue iteration.		/// out-of-bounds by executing at least one scalar epilogue iteration.
bool RequiresScalarEpilogue;		bool RequiresScalarEpilogue;

/// Holds the relationships between the members and the interleave group.		/// Holds the relationships between the members and the interleave group.
DenseMap<Instruction , InterleaveGroup > InterleaveGroupMap;		DenseMap<Instruction , InterleaveGroup > InterleaveGroupMap;

		/// Holds dependences among the memory accesses in the loop. It maps a source
		/// access to a set of dependent sink accesses.
		DenseMap<Instruction , SmallPtrSet<Instruction , 2>> Dependences;

/// \brief The descriptor for a strided memory access.		/// \brief The descriptor for a strided memory access.
struct StrideDescriptor {		struct StrideDescriptor {
StrideDescriptor(int Stride, const SCEV *Scev, unsigned Size,		StrideDescriptor(int Stride, const SCEV *Scev, unsigned Size,
unsigned Align)		unsigned Align)
: Stride(Stride), Scev(Scev), Size(Size), Align(Align) {}		: Stride(Stride), Scev(Scev), Size(Size), Align(Align) {}

StrideDescriptor() : Stride(0), Scev(nullptr), Size(0), Align(0) {}		StrideDescriptor() : Stride(0), Scev(nullptr), Size(0), Align(0) {}

int Stride; // The access's stride. It is negative for a reverse access.		int Stride; // The access's stride. It is negative for a reverse access.
const SCEV *Scev; // The scalar expression of this access		const SCEV *Scev; // The scalar expression of this access
unsigned Size; // The size of the memory object.		unsigned Size; // The size of the memory object.
unsigned Align; // The alignment of this access.		unsigned Align; // The alignment of this access.
};		};

		/// \brief A type for holding instructions and their stride descriptors.
		typedef std::pair<Instruction *, StrideDescriptor> StrideEntry;

/// \brief Create a new interleave group with the given instruction \p Instr,		/// \brief Create a new interleave group with the given instruction \p Instr,
/// stride \p Stride and alignment \p Align.		/// stride \p Stride and alignment \p Align.
///		///
/// \returns the newly created interleave group.		/// \returns the newly created interleave group.
InterleaveGroup createInterleaveGroup(Instruction Instr, int Stride,		InterleaveGroup createInterleaveGroup(Instruction Instr, int Stride,
unsigned Align) {		unsigned Align) {
assert(!InterleaveGroupMap.count(Instr) &&		assert(!InterleaveGroupMap.count(Instr) &&
"Already in an interleaved access group");		"Already in an interleaved access group");
Show All 9 Lines	void releaseGroup(InterleaveGroup *Group) {

delete Group;		delete Group;
}		}

/// \brief Collect all the accesses with a constant stride in program order.		/// \brief Collect all the accesses with a constant stride in program order.
void collectConstStridedAccesses(		void collectConstStridedAccesses(
MapVector<Instruction *, StrideDescriptor> &StrideAccesses,		MapVector<Instruction *, StrideDescriptor> &StrideAccesses,
const ValueToValueMap &Strides);		const ValueToValueMap &Strides);

		/// \brief Returns true if \p Stride is allowed in an interleaved group.
		static bool isStrided(int Stride) {
		unsigned Factor = std::abs(Stride);
		anemetUnsubmitted Done Reply Inline Actions Somewhere this should state the criteria for reordering. In theory you can reorder backward dependences but looks like you don't allow any (which is fine). anemet: Somewhere this should state the criteria for reordering. In theory you can reorder backward…
		return Factor >= 2 && Factor <= MaxInterleaveGroupFactor;
		}

		/// \brief Returns true if LoopAccessInfo can be used for dependence queries.
		bool areDependencesValid() const {
		return LAI && LAI->getDepChecker().getDependences();
		}

		sbarangaUnsubmitted Not Done Reply Inline Actions I guess we can check this with DT because the loads/stores are not predicated? (same in a bunch of other places). sbaranga: I guess we can check this with DT because the loads/stores are not predicated? (same in a bunch…
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions I intended the dominance check to work for the predicated accesses as well. The check says that if the read does not dominate the write, then we have to conservatively assume the write may happen first, leading to a read-after-write dependence. mssimpso: I intended the dominance check to work for the predicated accesses as well. The check says that…
		sbarangaUnsubmitted Not Done Reply Inline Actions OK, this makes sense. sbaranga: OK, this makes sense.
		/// \brief Returns true if memory accesses \p B and \p A can be reordered, if
		/// necessary, when constructing interleaved groups.
		anemetUnsubmitted Done Reply Inline Actions Both A and B are not necessarily strided here. anemet: Both A and B are not necessarily strided here.
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions True. I will rename this and update the comments. mssimpso: True. I will rename this and update the comments.
		anemetUnsubmitted Done Reply Inline Actions Actually sorry, let's make this even more precise. At this point the function is called with any accesses. (The function later checks that if both accesses are non-strided we don't bother checking deps since we would never reorder those.) anemet: Actually sorry, let's make this even more precise. At this point the function is called with…
		///
		/// \p B must precede \p A in program order. We return false if reordering is
		/// not necessary or is prevented because \p B and \p A may be dependent.
		anemetUnsubmitted Done Reply Inline Actions We should probably have some qualification in the name, something like canReorderAccessesForInterleavedGroups or something like that. This is not a general canReorder predicate but takes into consideration the actual code motion strategy. anemet: We should probably have some qualification in the name, something like…
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions Sounds good. I'll update this and the comments. mssimpso: Sounds good. I'll update this and the comments.
		bool canReorderMemAccessesForInterleavedGroups(StrideEntry *B,
		StrideEntry *A) const {

		// Code motion for interleaved accesses can potentially hoist strided loads
		// and sink strided stores. The code below checks the legality of the
		// following two conditions:
		//
		// 1. Potentially moving a strided load (A) before any store (B) that
		// precedes A, or
		//
		// 2. Potentially moving a strided store (B) after any load or store (A)
		// that B precedes.
		//
		// It's legal to reorder B and A if we know there isn't a dependence from B
		anemetUnsubmitted Done Reply Inline Actions This requirement is part of the API so it should be in the function comment. Also a nit, can you please flip the order. I think that A preceding B is the more intuitive. Let me know if you disagree. anemet: This requirement is part of the API so it should be in the function comment. Also a nit, can…
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions Sounds good. mssimpso: Sounds good.
		// to A. Note that this determination is conservative since some
		// dependences could potentially be reordered safely.

		// B is potentially the source of a dependence.
		auto *Src = B->first;
		auto SrcDes = B->second;

		// A is potentially the sink of a dependence.
		auto *Sink = A->first;
		auto SinkDes = A->second;

		// Code motion for interleaved accesses can't violate WAR dependences.
		// Thus, reordering is legal if the source isn't a write.
		if (!Src->mayWriteToMemory())
		return true;

		// At least one of the accesses must be strided.
		if (!isStrided(SrcDes.Stride) && !isStrided(SinkDes.Stride))
		return true;

		// If dependence information is not available from LoopAccessInfo,
		// conservatively assume the instructions can't be reordered.
		if (!areDependencesValid())
		return false;

		// If we know there is a dependence from source to sink, assume the
		// instructions can't be reordered. Otherwise, reordering is legal.
		return !Dependences.count(Src) \|\| !Dependences.lookup(Src).count(Sink);
		}

		/// \brief Collect the dependences from LoopAccessInfo.
		///
		/// We process the dependences once during the interleaved access analysis to
		/// enable constant-time dependence queries.
		void collectDependences() {
		if (!areDependencesValid())
		return;
		auto *Deps = LAI->getDepChecker().getDependences();
		for (auto Dep : *Deps)
		Dependences[Dep.getSource(LAI)].insert(Dep.getDestination(LAI));
		}
};		};

/// Utility class for getting and setting loop vectorizer hints in the form		/// Utility class for getting and setting loop vectorizer hints in the form
/// of loop metadata.		/// of loop metadata.
/// This class keeps a number of loop annotations locally (as member variables)		/// This class keeps a number of loop annotations locally (as member variables)
/// and can, upon request, write them back as metadata on the loop. It will		/// and can, upon request, write them back as metadata on the loop. It will
/// initially scan the loop for existing metadata, and will update the local		/// initially scan the loop for existing metadata, and will update the local
/// values based on information in the loop.		/// values based on information in the loop.
▲ Show 20 Lines • Show All 314 Lines • ▼ Show 20 Lines
/// This class is also used by InnerLoopVectorizer for identifying		/// This class is also used by InnerLoopVectorizer for identifying
/// induction variable and the different reduction variables.		/// induction variable and the different reduction variables.
class LoopVectorizationLegality {		class LoopVectorizationLegality {
public:		public:
LoopVectorizationLegality(Loop *L, PredicatedScalarEvolution &PSE,		LoopVectorizationLegality(Loop *L, PredicatedScalarEvolution &PSE,
DominatorTree DT, TargetLibraryInfo TLI,		DominatorTree DT, TargetLibraryInfo TLI,
AliasAnalysis AA, Function F,		AliasAnalysis AA, Function F,
const TargetTransformInfo *TTI,		const TargetTransformInfo *TTI,
LoopAccessAnalysis *LAA,		LoopAccessAnalysis LAA, LoopInfo LI,
LoopVectorizationRequirements *R,		LoopVectorizationRequirements *R,
LoopVectorizeHints *H)		LoopVectorizeHints *H)
: NumPredStores(0), TheLoop(L), PSE(PSE), TLI(TLI), TheFunction(F),		: NumPredStores(0), TheLoop(L), PSE(PSE), TLI(TLI), TheFunction(F),
TTI(TTI), DT(DT), LAA(LAA), LAI(nullptr), InterleaveInfo(PSE, L, DT),		TTI(TTI), DT(DT), LAA(LAA), LAI(nullptr),
Induction(nullptr), WidestIndTy(nullptr), HasFunNoNaNAttr(false),		InterleaveInfo(PSE, L, DT, LI), Induction(nullptr),
Requirements(R), Hints(H) {}		WidestIndTy(nullptr), HasFunNoNaNAttr(false), Requirements(R),
		Hints(H) {}

/// ReductionList contains the reduction descriptors for all		/// ReductionList contains the reduction descriptors for all
/// of the reductions that were found in the loop.		/// of the reductions that were found in the loop.
typedef DenseMap<PHINode *, RecurrenceDescriptor> ReductionList;		typedef DenseMap<PHINode *, RecurrenceDescriptor> ReductionList;

/// InductionList saves induction variables and maps them to the		/// InductionList saves induction variables and maps them to the
/// induction descriptor.		/// induction descriptor.
typedef MapVector<PHINode *, InductionDescriptor> InductionList;		typedef MapVector<PHINode *, InductionDescriptor> InductionList;
▲ Show 20 Lines • Show All 590 Lines • ▼ Show 20 Lines	if (TC > 0u && TC < TinyTripCountVectorThreshold) {
return false;		return false;
}		}
}		}

PredicatedScalarEvolution PSE(SE, L);		PredicatedScalarEvolution PSE(SE, L);

// Check if it is legal to vectorize the loop.		// Check if it is legal to vectorize the loop.
LoopVectorizationRequirements Requirements;		LoopVectorizationRequirements Requirements;
LoopVectorizationLegality LVL(L, PSE, DT, TLI, AA, F, TTI, LAA,		LoopVectorizationLegality LVL(L, PSE, DT, TLI, AA, F, TTI, LAA, LI,
&Requirements, &Hints);		&Requirements, &Hints);
if (!LVL.canVectorize()) {		if (!LVL.canVectorize()) {
DEBUG(dbgs() << "LV: Not vectorizing: Cannot prove legality.\n");		DEBUG(dbgs() << "LV: Not vectorizing: Cannot prove legality.\n");
emitMissedWarning(F, L, Hints);		emitMissedWarning(F, L, Hints);
return false;		return false;
}		}

// Use the cost model.		// Use the cost model.
▲ Show 20 Lines • Show All 3,092 Lines • ▼ Show 20 Lines	while (!Worklist.empty()) {

// Insert all operands.		// Insert all operands.
Worklist.insert(Worklist.end(), I->op_begin(), I->op_end());		Worklist.insert(Worklist.end(), I->op_begin(), I->op_end());
}		}
}		}

bool LoopVectorizationLegality::canVectorizeMemory() {		bool LoopVectorizationLegality::canVectorizeMemory() {
LAI = &LAA->getInfo(TheLoop, getSymbolicStrides());		LAI = &LAA->getInfo(TheLoop, getSymbolicStrides());
		InterleaveInfo.setLAI(LAI);
auto &OptionalReport = LAI->getReport();		auto &OptionalReport = LAI->getReport();
if (OptionalReport)		if (OptionalReport)
emitAnalysis(VectorizationReport(*OptionalReport));		emitAnalysis(VectorizationReport(*OptionalReport));
if (!LAI->canVectorizeMemory())		if (!LAI->canVectorizeMemory())
return false;		return false;

if (LAI->hasStoreToLoopInvariantAddress()) {		if (LAI->hasStoreToLoopInvariantAddress()) {
emitAnalysis(		emitAnalysis(
▲ Show 20 Lines • Show All 98 Lines • ▼ Show 20 Lines
}		}

void InterleavedAccessInfo::collectConstStridedAccesses(		void InterleavedAccessInfo::collectConstStridedAccesses(
MapVector<Instruction *, StrideDescriptor> &StrideAccesses,		MapVector<Instruction *, StrideDescriptor> &StrideAccesses,
const ValueToValueMap &Strides) {		const ValueToValueMap &Strides) {
// Holds load/store instructions in program order.		// Holds load/store instructions in program order.
SmallVector<Instruction *, 16> AccessList;		SmallVector<Instruction *, 16> AccessList;

for (auto *BB : TheLoop->getBlocks()) {		// Since it's desired that the load/store instructions be maintained in
		// "program order" for the interleaved access analysis, we have to visit the
		// blocks in the loop in reverse postorder (i.e., in a topological order).
		// Such an ordering will ensure that any load/store that may be executed
		// before a second load/store will precede the second load/store in the
		// AccessList.
		LoopBlocksDFS DFS(TheLoop);
		DFS.perform(LI);
		for (LoopBlocksDFS::RPOIterator I = DFS.beginRPO(), E = DFS.endRPO(); I != E;
		++I) {
		BasicBlock BB = I;
bool IsPred = LoopAccessInfo::blockNeedsPredication(BB, TheLoop, DT);		bool IsPred = LoopAccessInfo::blockNeedsPredication(BB, TheLoop, DT);

for (auto &I : *BB) {		for (auto &I : *BB) {
if (!isa<LoadInst>(&I) && !isa<StoreInst>(&I))		if (!isa<LoadInst>(&I) && !isa<StoreInst>(&I))
continue;		continue;
// FIXME: Currently we can't handle mixed accesses and predicated accesses		// FIXME: Currently we can't handle mixed accesses and predicated accesses
if (IsPred)		if (IsPred)
return;		return;

AccessList.push_back(&I);		AccessList.push_back(&I);
}		}
}		}

if (AccessList.empty())		if (AccessList.empty())
return;		return;

auto &DL = TheLoop->getHeader()->getModule()->getDataLayout();		auto &DL = TheLoop->getHeader()->getModule()->getDataLayout();
for (auto I : AccessList) {		for (auto I : AccessList) {
LoadInst *LI = dyn_cast<LoadInst>(I);		LoadInst *LI = dyn_cast<LoadInst>(I);
StoreInst *SI = dyn_cast<StoreInst>(I);		StoreInst *SI = dyn_cast<StoreInst>(I);

Value *Ptr = LI ? LI->getPointerOperand() : SI->getPointerOperand();		Value *Ptr = LI ? LI->getPointerOperand() : SI->getPointerOperand();
int Stride = getPtrStride(PSE, Ptr, TheLoop, Strides);		int Stride = getPtrStride(PSE, Ptr, TheLoop, Strides);

// The factor of the corresponding interleave group.
unsigned Factor = std::abs(Stride);

// Ignore the access if the factor is too small or too large.
if (Factor < 2 \|\| Factor > MaxInterleaveGroupFactor)
continue;

const SCEV *Scev = replaceSymbolicStrideSCEV(PSE, Strides, Ptr);		const SCEV *Scev = replaceSymbolicStrideSCEV(PSE, Strides, Ptr);
PointerType *PtrTy = dyn_cast<PointerType>(Ptr->getType());		PointerType *PtrTy = dyn_cast<PointerType>(Ptr->getType());
unsigned Size = DL.getTypeAllocSize(PtrTy->getElementType());		unsigned Size = DL.getTypeAllocSize(PtrTy->getElementType());

// An alignment of 0 means target ABI alignment.		// An alignment of 0 means target ABI alignment.
unsigned Align = LI ? LI->getAlignment() : SI->getAlignment();		unsigned Align = LI ? LI->getAlignment() : SI->getAlignment();
if (!Align)		if (!Align)
Align = DL.getABITypeAlignment(PtrTy->getElementType());		Align = DL.getABITypeAlignment(PtrTy->getElementType());

StrideAccesses[I] = StrideDescriptor(Stride, Scev, Size, Align);		StrideAccesses[I] = StrideDescriptor(Stride, Scev, Size, Align);
		anemetUnsubmitted Not Done Reply Inline Actions I am wondering if it's time to change the name of this. After your change it no longer contains only "interesting" strided accesses. How about like AccessStrideInfo? anemet: I am wondering if it's time to change the name of this. After your change it no longer…
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions Sounds good to me. I'll rename StridedAccesses prior to committing the current patch. mssimpso: Sounds good to me. I'll rename StridedAccesses prior to committing the current patch.
}		}
}		}

// Analyze interleaved accesses and collect them into interleave groups.		// Analyze interleaved accesses and collect them into interleaved load and
		// store groups.
//		//
// Notice that the vectorization on interleaved groups will change instruction		// When generating code for an interleaved load group, we effectively hoist all
// orders and may break dependences. But the memory dependence check guarantees		// loads in the group to the location of the first load in program order. When
// that there is no overlap between two pointers of different strides, element		// generating code for an interleaved store group, we sink all stores to the
// sizes or underlying bases.		// location of the last store. This code motion can change the order of load
		// and store instructions and may break dependences.
//		//
// For pointers sharing the same stride, element size and underlying base, no		// The code generation strategy mentioned above ensures that we won't violate
// need to worry about Read-After-Write dependences and Write-After-Read		// any write-after-read (WAR) dependences.
// dependences.		//
		// E.g., for the WAR dependence: a = A[i]; // (1)
		// A[i] = b; // (2)
		anemetUnsubmitted Done Reply Inline Actions I think that we've been using Accesses pretty consistently without any abbreviation. anemet: I think that we've been using Accesses pretty consistently without any abbreviation.
		//
		// The store group of (2) is always inserted at or below (2), and the load
		sbarangaUnsubmitted Done Reply Inline Actions We should improve this at some point to check for NoDep (which is what we're really looking for here) and is implied by this test. sbaranga: We should improve this at some point to check for NoDep (which is what we're really looking for…
		// group of (1) is always inserted at or above (1). Thus, the instructions will
		sbarangaUnsubmitted Done Reply Inline Actions We should be able to do this faster than O(n) - maybe with some pre-processing of dependences. sbaranga: We should be able to do this faster than O(n) - maybe with some pre-processing of dependences.
		// never be reordered. All other dependences are checked to ensure the
		// correctness of the instruction reordering.
		//
		// The algorithm visits all memory accesses in the loop in bottom-up program
		// order. Program order is established by traversing the blocks in the loop in
		sbarangaUnsubmitted Done Reply Inline Actions This makes assumptions on how LAA works. It might be technically true at the moment, but if the dependence analysis would be improved it would possibly invalidate this. I think the case where we memcheck each pointer is an edge case? I'm not sure it's worth doing the interleaved access vectorization in this case. The more common case could be if two pointers are in different alias sets or have different underlying objects? But that should be expressed as a NoDep. sbaranga: This makes assumptions on how LAA works. It might be technically true at the moment, but if the…
		// reverse postorder when collecting the accesses.
//		//
// E.g. The RAW dependence: A[i] = a;		// We visit the memory accesses in bottom-up order because it can simplify the
// b = A[i];		// construction of store groups in the presence of write-after-write (WAW)
// This won't exist as it is a store-load forwarding conflict, which has		// dependences.
// already been checked and forbidden in the dependence check.
//		//
// E.g. The WAR dependence: a = A[i]; // (1)		// E.g., for the WAW dependence: A[i] = a; // (1)
// A[i] = b; // (2)		// A[i] = b; // (2)
// The store group of (2) is always inserted at or below (2), and the load group		// A[i + 1] = c; // (3)
// of (1) is always inserted at or above (1). The dependence is safe.		//
		// We will first create a store group with (3) and (2). (1) can't be added to
		// this group because it and (2) are dependent. However, (1) can be grouped
		anemetUnsubmitted Done Reply Inline Actions I would drop the part about memchecks. This is already pretty long and that part is trivial. anemet: I would drop the part about memchecks. This is already pretty long and that part is trivial.
		// with other accesses that may precede it in program order. Note that a
		// bottom-up order does not imply that WAW dependences should not be checked.
void InterleavedAccessInfo::analyzeInterleaving(		void InterleavedAccessInfo::analyzeInterleaving(
const ValueToValueMap &Strides) {		const ValueToValueMap &Strides) {
		sbarangaUnsubmitted Done Reply Inline Actions Same here. Ideally we would know we have NoDep from LAA. Some of the checks here seem to be specifically related to forming the interleaved groups, and less about reordering accesses. sbaranga: Same here. Ideally we would know we have NoDep from LAA. Some of the checks here seem to be…
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions If you're referring to the type and factor checks, those are required before we can call areStridedAccessesIndependent. mssimpso: If you're referring to the type and factor checks, those are required before we can call…
DEBUG(dbgs() << "LV: Analyzing interleaved accesses...\n");		DEBUG(dbgs() << "LV: Analyzing interleaved accesses...\n");

// Holds all the stride accesses.		// Holds all the stride accesses.
MapVector<Instruction *, StrideDescriptor> StrideAccesses;		MapVector<Instruction *, StrideDescriptor> StrideAccesses;
collectConstStridedAccesses(StrideAccesses, Strides);		collectConstStridedAccesses(StrideAccesses, Strides);

if (StrideAccesses.empty())		if (StrideAccesses.empty())
return;		return;

		// Collect the dependences in the loop.
		collectDependences();

// Holds all interleaved store groups temporarily.		// Holds all interleaved store groups temporarily.
SmallSetVector<InterleaveGroup *, 4> StoreGroups;		SmallSetVector<InterleaveGroup *, 4> StoreGroups;
// Holds all interleaved load groups temporarily.		// Holds all interleaved load groups temporarily.
SmallSetVector<InterleaveGroup *, 4> LoadGroups;		SmallSetVector<InterleaveGroup *, 4> LoadGroups;

// Search the load-load/write-write pair B-A in bottom-up order and try to		// Search the load-load/write-write pair B-A in bottom-up order and try to
// insert B into the interleave group of A according to 3 rules:		// insert B into the interleave group of A according to 3 rules:
// 1. A and B have the same stride.		// 1. A and B have the same stride.
// 2. A and B have the same memory object size.		// 2. A and B have the same memory object size.
// 3. B belongs to the group according to the distance.		// 3. B belongs to the group according to the distance.
//		for (auto AI = StrideAccesses.rbegin(), E = StrideAccesses.rend(); AI != E;
// The bottom-up order can avoid breaking the Write-After-Write dependences		++AI) {
// between two pointers of the same base.		Instruction *A = AI->first;
// E.g. A[i] = a; (1)		StrideDescriptor DesA = AI->second;
// A[i] = b; (2)
// A[i+1] = c (3)		// Initialize a group for A if it has an allowable stride. Even if we don't
// We form the group (2)+(3) in front, so (1) has to form groups with accesses		// create a group for A, we continue with the bottom-up algorithm to ensure
// above (1), which guarantees that (1) is always above (2).		// we don't break any of A's dependences.
anemetUnsubmitted Not Done Reply Inline Actions Is my understanding correct, that you removed this because we no longer support this case? I.e. the dep 1->2 does not currently allow for this. If this is true, we should probably add a FIXME. Also, I thought the WAW case was the only reason for the bottom-up ordering. That is a bit confusing now too. anemet: Is my understanding correct, that you removed this because we no longer support this case? I.e.
mssimpsoAuthorUnsubmitted Done Reply Inline Actions No, we do still support this case, for essentially the reason the existing comment states. When the outer loop of the analysis is on (3) and we visit (2) and (1) in the inner loop, we will form the (3,2) group. (1) can't be added to (3,2) because a member already exists in group (3,2) with the same offset. When the outer loop is on (2), we will again visit (1) in the inner loop. This time we will see the dependence between (1) and (2) and give up trying to add additional accesses to (2)'s group. (1) is not yet in a group, so the outer loop moves on to (1). Thus, (1) must form a group with accesses that precede it and won't be sunk, as the existing comment says. I removed this comment because I thought it was somewhat misleading. The bottom-up ordering is not sufficient to prevent us from breaking WAW dependencies. We still have to explicitly check for them. For example, if we had something like the following for a factor 2 group: A[i] = x (1) A[i+2] = y (2) A[i+1] = z (3) The analysis would proceed as before. When the outer loop is on (3): this time (2) is not added to (3)'s group because its index is too large. (1) is temporarily added to (3)'s group, creating a (3,1) group. When the outer loop is on (2): we see the dependence between (1) and (2). As before, we stop trying to add additional instructions to (2)'s group. But now since (1) is already in a group, and we now know it shouldn't be reordered, we release the (3,1) group. So the bottom-up ordering isn't sufficient as we still have to check for the dependence. It's probably worth adding a comment that clarifies why the analysis is bottom-up. mssimpso: No, we do still support this case, for essentially the reason the existing comment states. When…
mssimpsoAuthorUnsubmitted Done Reply Inline Actions I got the second example here slightly wrong, but the idea is the same. I was intending to describe something like: A[i-1] = x (1) A[i-3] = y (2) A[i] = z (3) Now, the dependence between (1) and (2) will cause the (3,1) group to be released, as I originally described. I'm adding test cases for both of these WAW examples. mssimpso: I got the second example here slightly wrong, but the idea is the same. I was intending to…
anemetUnsubmitted Not Done Reply Inline Actions Wow, this is very subtle too. This and the previous case of why we need to remove elements from a group needs a high-level comment somewhere around the call to canReorder... See my comments on the particular lines. anemet: Wow, this is very subtle too. This and the previous case of why we need to remove elements…
for (auto I = StrideAccesses.rbegin(), E = StrideAccesses.rend(); I != E;		InterleaveGroup *Group = nullptr;
++I) {		if (isStrided(DesA.Stride)) {
Instruction *A = I->first;		Group = getInterleaveGroup(A);
StrideDescriptor DesA = I->second;

InterleaveGroup *Group = getInterleaveGroup(A);
if (!Group) {		if (!Group) {
DEBUG(dbgs() << "LV: Creating an interleave group with:" << *A << '\n');		DEBUG(dbgs() << "LV: Creating an interleave group with:" << *A << '\n');
Group = createInterleaveGroup(A, DesA.Stride, DesA.Align);		Group = createInterleaveGroup(A, DesA.Stride, DesA.Align);
}		}

if (A->mayWriteToMemory())		if (A->mayWriteToMemory())
StoreGroups.insert(Group);		StoreGroups.insert(Group);
else		else
LoadGroups.insert(Group);		LoadGroups.insert(Group);
		}

for (auto II = std::next(I); II != E; ++II) {		for (auto BI = std::next(AI); BI != E; ++BI) {
Instruction *B = II->first;		Instruction *B = BI->first;
StrideDescriptor DesB = II->second;		StrideDescriptor DesB = BI->second;

		// Our code motion strategy implies that we can't have dependences
		// between accesses in an interleaved group and other accesses located
		// between the first and last member of the group. Note that this also
		// means that a group can't have more than one member at a given offset.
		// The accesses in a group can have dependences with other accesses, but
		// we must ensure we don't extend the boundaries of the group such that
		// we encompass those dependent accesses.
		//
		// For example, assume we have the sequence of accesses shown below in a
		// stride-2 loop:
		anemetUnsubmitted Not Done Reply Inline Actions I don't think the comment is sufficient to cover all the subtleties here. I would prefer to say something like: We can't have dependences between accesses in a group and other accesses that are located between the first and the last element of group. Probably before the break statement we should also say that: It's OK to have dependences between accesses in a group and other accesses before the first instruction we just can't extend the group beyond these. I am also wondering if we need a picture to further visualize this, i.e. a case where there is an intervening access and one where falls outside the range of the group. anemet: I don't think the comment is sufficient to cover all the subtleties here. I would prefer to…
		//
		// (1, 2) is a group \| A[i] = a; // (1)
		// \| A[i-1] = b; // (2) \|
		// A[i-3] = c; // (3)
		// A[i] = d; // (4) \| (2, 4) is not a group
		//
		// Because accesses (2) and (3) are dependent, we can group (2) with (1)
		// but not with (4). If we did, the dependent access (3) would be within
		// the boundaries of the (2, 4) group.
		if (!canReorderMemAccessesForInterleavedGroups(&BI, &AI)) {

		// If a dependence exists and B is already in a group, we know that B
		// must be a store since B precedes A and WAR dependences are allowed.
		// Thus, B would be sunk below A. We release B's group to prevent this
		// illegal code motion. B will then be free to form another group with
		// instructions that precede it.
		if (isInterleaved(B)) {
		InterleaveGroup *StoreGroup = getInterleaveGroup(B);
		StoreGroups.remove(StoreGroup);
		releaseGroup(StoreGroup);
		}

		// If a dependence exists and B is not already in a group (or it was
		// and we just released it), A might be hoisted above B (if A is a
		// load) or another store might be sunk below B (if A is a store). In
		// either case, we can't add additional instructions to A's group. A
		// will only form a group with instructions that it precedes.
		break;
		}

		// At this point, we've checked for illegal code motion. If either A or B
		// isn't strided, there's nothing left to do.
		if (!isStrided(DesA.Stride) \|\| !isStrided(DesB.Stride))
		continue;

// Ignore if B is already in a group or B is a different memory operation.		// Ignore if B is already in a group or B is a different memory operation.
if (isInterleaved(B) \|\| A->mayReadFromMemory() != B->mayReadFromMemory())		if (isInterleaved(B) \|\| A->mayReadFromMemory() != B->mayReadFromMemory())
continue;		continue;

		anemetUnsubmitted Done Reply Inline Actions It would be nice to use A and B related iterators in this part because that is what you mention in the comment. E.g. AI, BI or IA, IB. anemet: It would be nice to use A and B related iterators in this part because that is what you mention…
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions I agree. I think we should also swap A and B in this function to match what we did to the canReorderMemAccesses function. There, A precedes B, but here B precedes A. They should be consistent. mssimpso: I agree. I think we should also swap A and B in this function to match what we did to the…
		mssimpsoAuthorUnsubmitted Done Reply Inline Actions Actually, swapping A with B in this function will make the review very difficult. I think I'd rather swap A and B in canReorderMemAccesses to make things consistent. This will mean B precedes A both places. We can then swap them both in an NFC patch if we want A to precede B. mssimpso: Actually, swapping A with B in this function will make the review very difficult. I think I'd…
		anemetUnsubmitted Not Done Reply Inline Actions OK. anemet: OK.
// Check the rule 1 and 2.		// Check the rule 1 and 2.
		anemetUnsubmitted Done Reply Inline Actions Here I think we're validating the correctness of: Potentially moving 'A' an interleaved load before any store 'B' or Potentially moving 'B' an interleaved store after any load/store 'A'. If I am right, I don't think either the comment or the code is tight enough to reflect this. It would also be good to add testcases for this. It would be also great to add a testcase to the earlier problem that only candidate accesses were dependency-checked. What do you think? anemet: Here I think we're validating the correctness of: 1. Potentially moving 'A' an interleaved…
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions Adam, I think you're mostly right in the description of what we're validating the correctness of. When you have a chance, would you mind elaborating a bit more as to why you think we're not satisfying both of the cases you mention? The algorithm works bottom-up, so B will always precede A in program order, and we always investigate B's that are closest to A first (some dependences do temporarily slip by but are later checked, see below). Also note that we don't yet handle loops with predicated blocks. I'm working on that in D19694. For case (1) the bottom-up ordering implies that if we can move a load A before a store B, we know that we can also move A before any store that B precedes. Case (2) is a bit more subtle and is probalby confusing. Say we have the case below: S1: Store // depends on L1 L1: Load // depends on S1 S2: Store When A is S2 and B is S1, we might say that these are independent, and S1 could be sunk to S2's location. We will add S1 to S2's group. But when A is L1 and B is S1, we will notice the dependence. When we do, we check to see if S1 is already in a group, and if so, invalidate it. This prevents us from moving S1. Please let me know if I'm still missing something here. We could probably improve the comment and/or rename "canReorderMemAccesses", since this doesn't quite capture what's happening with the instruction ordering. And yes, we can definitely add more tests. mssimpso: Adam, I think you're mostly right in the description of what we're validating the correctness…
		anemetUnsubmitted Done Reply Inline Actions I think you're mostly right in the description of what we're validating the correctness of. When you have a chance, would you mind elaborating a bit more as to why you think we're not satisfying both of the cases you mention? I said "not tight enough" not that it's incorrect, sorry if that was unclear. I think I would like to include the above two cases in the comment (or an improved version) and then have the checks in the code reflect that. I.e. right now we don't check that at least one of 'A' or 'B' is interleaved. The algorithm works bottom-up, so B will always precede A in program order, and we always investigate B's that are closest to A first (some dependences do temporarily slip by but are later checked, see below). Also note that we don't yet handle loops with predicated blocks. I'm working on that in D19694. For case (1) the bottom-up ordering implies that if we can move a load A before a store B, we know that we can also move A before any store that B precedes. Case (2) is a bit more subtle and is probalby confusing. Say we have the case below: S1: Store depends on L1 L1: Load depends on S1 S2: Store When A is S2 and B is S1, we might say that these are independent, and S1 could be sunk to S2's location. We will add S1 to S2's group. But when A is L1 and B is S1, we will notice the dependence. When we do, we check to see if S1 is already in a group, and if so, invalidate it. This prevents us from moving S1. Make sense. I am wondering now if there is a simpler way to formulate this analysis with the same result. Aren't we simply saying that we don't consider an interleaved access for merging if it's either a source or the destination of a dependence? I.e. something like this in the outer loop: if (isDepSource(A) && A->mayWriteToMemory() \|\| isDepDestination(A)) break: and then no need for the inner loop? Please let me know if I'm still missing something here. We could probably improve the comment and/or rename "canReorderMemAccesses", since this doesn't quite capture what's happening with the instruction ordering. And yes, we can definitely add more tests. Thanks! anemet: > I think you're mostly right in the description of what we're validating the correctness of.
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions I think I would like to include the above two cases in the comment (or an improved version) and then have the checks in the code reflect that. I.e. right now we don't check that at least one of 'A' or 'B' is interleaved. I see, yes this makes sense. I am wondering now if there is a simpler way to formulate this analysis with the same result. Aren't we simply saying that we don't consider an interleaved access for merging if it's either a source or the destination of a dependence? This sounds right to me. Let me think it over before posting another update. Thanks again for all the feedback, Adam! mssimpso: > I think I would like to include the above two cases in the comment (or an improved version)…
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions It would be also great to add a testcase to the earlier problem that only candidate accesses were dependency-checked. I've tried constructing a testcase that would generate incorrect code for this, but I think the current limitations of LAA might actually make this unrealizable at the moment. After the memory checks, two access can be dependent only at constant distances. Thus, if one access is strided, a dependent access at a constant distance would have to be strided as well. This sounds right to me. Let me think it over before posting another update. After thinking about this more carefully, I think we will miss cases with a simplification like the one you suggested. For example, if we have a set of interleaved loads followed by a set of interleaved stores that access the same location (or vice versa), giving up in the outer loop would prevent the two groups from being created. mssimpso: > It would be also great to add a testcase to the earlier problem that only candidate accesses…
		anemetUnsubmitted Not Done Reply Inline Actions I've tried constructing a testcase that would generate incorrect code for this, but I think the current limitations of LAA might actually make this unrealizable at the moment. After the memory checks, two access can be dependent only at constant distances. Thus, if one access is strided, a dependent access at a constant distance would have to be strided as well. Can't we use larger than stride offsets/indices to emulate non-interleaved accesses? I believe that these are currently ignored by interleaved analysis. I mean something like: for (i = 0; i < n; i+=3) { ... = A[i] A[i+4] = ... ... = A[i+1] } And then A[i+1] shouldn't be moved across A[i+4] I haven't tried this, it's just an idea.... anemet: > I've tried constructing a testcase that would generate incorrect code for this, but I think…
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions This was my first thought as well. In this case, A[i+4] is still a stride-greater-than-one access, so it would've been collected in SrideAccesses and sent through the original analysis. If its stride is equal to anything other than that of the other accesses (e.g, one or some value greater than MaxInterleaveGroupFactor), the distance between it and the other accesses will be non-constant (some SCEV expression). If the distance isn't constant LAA will report Dependence::Unknown, and we will create the memory checks. mssimpso: This was my first thought as well. In this case, A[i+4] is still a stride-greater-than-one…
		anemetUnsubmitted Not Done Reply Inline Actions Matt, I am not sure I follow, which of these accesses is not constant distance in my example? I just tried this example: void f(char a, char __restrict c, int n) { for (int i = 0; i < n; i+=3) { c[i] = a[i]; a[i+4] = c[i+4]; c[i+1] = a[i+1]; } } If you compile this on x86_64 with: -O3 -mllvm -enable-load-pre=0 -mllvm -store-to-load-forwarding-conflict-detection=0 -mllvm -enable-interleaved-mem-accesses -mllvm -force-vector-width=2 Then the loads from 'a' will be merged which breaks the forward dependence between the store of a[i+4] and the load of a[i+1]. No? anemet: Matt, I am not sure I follow, which of these accesses is not constant distance in my example?
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions Adam, We may be talking past each other a bit. I think your original request (correct me if I'm wrong) was for a test case showing that the existing analysis doesn't consider checking memory accesses that are not "candidates" for interleaved groups. Candidate accesses are the ones collected in StrideAccesses prior to the analysis. However, every access in your example actually is a candidate access because they are all strided. They are not ignored. Yes, we currently do the wrong thing here, but that's because the current analysis isn't correctly checking dependences between the actual candidate accesses, not because it isn't considering a dependence between a candidate access and a non-candidate access. Are you asking for something different? mssimpso: Adam, We may be talking past each other a bit. I think your original request (correct me if…
		anemetUnsubmitted Not Done Reply Inline Actions Ah, the disconnect was that I didn't realize that accesses with offsets larger than the stride were considered as candidates. The idea was to have these mimic non-strided accesses. I guess it makes sense to consider these candidates as well because the offset is not really an offset but rather the distance between two accesses. So I guess we can't have a testcase for this case at the moment... I'll continue reviewing the patch. anemet: Ah, the disconnect was that I didn't realize that accesses with offsets larger than the stride…
if (DesB.Stride != DesA.Stride \|\| DesB.Size != DesA.Size)		if (DesB.Stride != DesA.Stride \|\| DesB.Size != DesA.Size)
continue;		continue;

// Calculate the distance and prepare for the rule 3.		// Calculate the distance and prepare for the rule 3.
const SCEVConstant *DistToA = dyn_cast<SCEVConstant>(		const SCEVConstant *DistToA = dyn_cast<SCEVConstant>(
PSE.getSE()->getMinusSCEV(DesB.Scev, DesA.Scev));		PSE.getSE()->getMinusSCEV(DesB.Scev, DesA.Scev));
if (!DistToA)		if (!DistToA)
continue;		continue;

int DistanceToA = DistToA->getAPInt().getSExtValue();		int DistanceToA = DistToA->getAPInt().getSExtValue();

// Skip if the distance is not multiple of size as they are not in the		// Skip if the distance is not multiple of size as they are not in the
// same group.		// same group.
if (DistanceToA % static_cast<int>(DesA.Size))		if (DistanceToA % static_cast<int>(DesA.Size))
continue;		continue;

// The index of B is the index of A plus the related index to A.		// The index of B is the index of A plus the related index to A.
int IndexB =		int IndexB =
Group->getIndex(A) + DistanceToA / static_cast<int>(DesA.Size);		Group->getIndex(A) + DistanceToA / static_cast<int>(DesA.Size);

// Try to insert B into the group.		// Try to insert B into the group.
if (Group->insertMember(B, IndexB, DesB.Align)) {		if (Group->insertMember(B, IndexB, DesB.Align)) {
DEBUG(dbgs() << "LV: Inserted:" << *B << '\n'		DEBUG(dbgs() << "LV: Inserted:" << *B << '\n'
		sbarangaUnsubmitted Not Done Reply Inline Actions This sentence seems to be unfinished? sbaranga: This sentence seems to be unfinished?
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions Thanks for catching that. I'll submit an update mssimpso: Thanks for catching that. I'll submit an update
<< " into the interleave group with" << *A << '\n');		<< " into the interleave group with" << *A << '\n');
InterleaveGroupMap[B] = Group;		InterleaveGroupMap[B] = Group;

// Set the first load in program order as the insert position.		// Set the first load in program order as the insert position.
if (B->mayReadFromMemory())		if (B->mayReadFromMemory())
Group->setInsertPos(B);		Group->setInsertPos(B);
}		}
} // Iteration on instruction B		} // Iteration on instruction B
} // Iteration on instruction A		} // Iteration on instruction A

// Remove interleaved store groups with gaps.		// Remove interleaved store groups with gaps.
for (InterleaveGroup *Group : StoreGroups)		for (InterleaveGroup *Group : StoreGroups)
		sbarangaUnsubmitted Not Done Reply Inline Actions Wouldn't we need to add B to LoopIndependentRAWStores even if we add it to StoresToRemove? sbaranga: Wouldn't we need to add B to LoopIndependentRAWStores even if we add it to StoresToRemove?
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions StoresToRemove holds stores whose groups will definitely be removed. LoopIndependentRAWStores holds stores that we don't yet have enough information about to determine if they need to be removed or not. In this case, if the current store (B) is already in a group, it means that it will be re-ordered by sinking it to the insert location of the group, violating the dependence. If it's not yet in a group (and thus, won't be re-ordered), we don't yet know that the load (A) won't be hoisted, which would also violate the dependence. Once we know that another load has been added to A's group, we know that A will be re-ordered. When this happens, we move all the stores in LoopIndependentRAWStores to StoresToRemove, to mark them for definite removal. mssimpso: StoresToRemove holds stores whose groups will definitely be removed. LoopIndependentRAWStores…
if (Group->getNumMembers() != Group->getFactor())		if (Group->getNumMembers() != Group->getFactor())
releaseGroup(Group);		releaseGroup(Group);

// If there is a non-reversed interleaved load group with gaps, we will need		// If there is a non-reversed interleaved load group with gaps, we will need
// to execute at least one scalar epilogue iteration. This will ensure that		// to execute at least one scalar epilogue iteration. This will ensure that
// we don't speculatively access memory out-of-bounds. Note that we only need		// we don't speculatively access memory out-of-bounds. Note that we only need
// to look for a member at index factor - 1, since every group must have a		// to look for a member at index factor - 1, since every group must have a
// member at index zero.		// member at index zero.
▲ Show 20 Lines • Show All 1,203 Lines • Show Last 20 Lines

test/Transforms/LoopVectorize/interleaved-accesses.ll

Show First 20 Lines • Show All 549 Lines • ▼ Show 20 Lines	for.body: ; preds = %for.body, %entry
%b = getelementptr inbounds %struct.IntFloat, %struct.IntFloat* %A, i64 %indvars.iv, i32 1		%b = getelementptr inbounds %struct.IntFloat, %struct.IntFloat* %A, i64 %indvars.iv, i32 1
%tmp1 = load float, float* %b, align 4		%tmp1 = load float, float* %b, align 4
%add3 = fadd fast float %SumB.014, %tmp1		%add3 = fadd fast float %SumB.014, %tmp1
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond = icmp eq i64 %indvars.iv.next, 1024		%exitcond = icmp eq i64 %indvars.iv.next, 1024
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body
}		}

		; Check vectorization of interleaved access groups in the presence of
		; dependences (PR27626). The following tests check that we don't reorder
		; dependent loads and stores when generating code for interleaved access
		; groups. Stores should be scalarized because the required code motion would
		; break dependences, and the remaining interleaved load groups should have
		; gaps.

		; PR27626_0: Ensure a strided store is not moved after a dependent (zero
		; distance) strided load.

		; void PR27626_0(struct pair *p, int z, int n) {
		; for (int i = 0; i < n; i++) {
		; p[i].x = z;
		; p[i].y = p[i].x;
		; }
		; }

		; CHECK-LABEL: @PR27626_0(
		; CHECK: min.iters.checked:
		; CHECK: %n.mod.vf = and i64 %[[N:.+]], 3
		; CHECK: %[[IsZero:[a-zA-Z0-9]+]] = icmp eq i64 %n.mod.vf, 0
		; CHECK: %[[R:[a-zA-Z0-9]+]] = select i1 %[[IsZero]], i64 4, i64 %n.mod.vf
		anemetUnsubmitted Done Reply Inline Actions You mean p[i].x etc in the loop. anemet: You mean p[i].x etc in the loop.
		mssimpsoAuthorUnsubmitted Not Done Reply Inline Actions Yes, thanks for catching that! mssimpso: Yes, thanks for catching that!
		; CHECK: %n.vec = sub i64 %[[N]], %[[R]]
		; CHECK: vector.body:
		; CHECK: %[[L1:.+]] = load <8 x i32>, <8 x i32>* {{.*}}
		; CHECK: %[[X1:.+]] = extractelement <8 x i32> %[[L1]], i32 0
		; CHECK: store i32 %[[X1]], {{.*}}
		; CHECK: %[[X2:.+]] = extractelement <8 x i32> %[[L1]], i32 2
		; CHECK: store i32 %[[X2]], {{.*}}
		; CHECK: %[[X3:.+]] = extractelement <8 x i32> %[[L1]], i32 4
		; CHECK: store i32 %[[X3]], {{.*}}
		; CHECK: %[[X4:.+]] = extractelement <8 x i32> %[[L1]], i32 6
		; CHECK: store i32 %[[X4]], {{.*}}

		%pair.i32 = type { i32, i32 }
		define void @PR27626_0(%pair.i32 *%p, i32 %z, i64 %n) {
		entry:
		br label %for.body

		for.body:
		%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
		%p_i.x = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i, i32 0
		%p_i.y = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i, i32 1
		store i32 %z, i32* %p_i.x, align 4
		%0 = load i32, i32* %p_i.x, align 4
		store i32 %0, i32 *%p_i.y, align 4
		%i.next = add nuw nsw i64 %i, 1
		%cond = icmp slt i64 %i.next, %n
		br i1 %cond, label %for.body, label %for.end

		for.end:
		ret void
		}

		; PR27626_1: Ensure a strided load is not moved before a dependent (zero
		; distance) strided store.

		; void PR27626_1(struct pair *p, int n) {
		; int s = 0;
		; for (int i = 0; i < n; i++) {
		; p[i].y = p[i].x;
		; s += p[i].y
		; }
		; }

		; CHECK-LABEL: @PR27626_1(
		; CHECK: min.iters.checked:
		; CHECK: %n.mod.vf = and i64 %[[N:.+]], 3
		; CHECK: %[[IsZero:[a-zA-Z0-9]+]] = icmp eq i64 %n.mod.vf, 0
		; CHECK: %[[R:[a-zA-Z0-9]+]] = select i1 %[[IsZero]], i64 4, i64 %n.mod.vf
		; CHECK: %n.vec = sub i64 %[[N]], %[[R]]
		; CHECK: vector.body:
		; CHECK: %[[Phi:.+]] = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ {{.*}}, %vector.body ]
		; CHECK: %[[L1:.+]] = load <8 x i32>, <8 x i32>* {{.*}}
		; CHECK: %[[X1:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 0
		; CHECK: store i32 %[[X1:.+]], {{.*}}
		; CHECK: %[[X2:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 2
		; CHECK: store i32 %[[X2:.+]], {{.*}}
		; CHECK: %[[X3:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 4
		; CHECK: store i32 %[[X3:.+]], {{.*}}
		; CHECK: %[[X4:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 6
		; CHECK: store i32 %[[X4:.+]], {{.*}}
		; CHECK: %[[L2:.+]] = load <8 x i32>, <8 x i32>* {{.*}}
		; CHECK: %[[S1:.+]] = shufflevector <8 x i32> %[[L2]], <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
		; CHECK: add nsw <4 x i32> %[[S1]], %[[Phi]]

		define i32 @PR27626_1(%pair.i32 *%p, i64 %n) {
		entry:
		br label %for.body

		for.body:
		%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
		%s = phi i32 [ %2, %for.body ], [ 0, %entry ]
		%p_i.x = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i, i32 0
		%p_i.y = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i, i32 1
		%0 = load i32, i32* %p_i.x, align 4
		store i32 %0, i32* %p_i.y, align 4
		%1 = load i32, i32* %p_i.y, align 4
		%2 = add nsw i32 %1, %s
		%i.next = add nuw nsw i64 %i, 1
		%cond = icmp slt i64 %i.next, %n
		br i1 %cond, label %for.body, label %for.end

		for.end:
		%3 = phi i32 [ %2, %for.body ]
		ret i32 %3
		}

		; PR27626_2: Ensure a strided store is not moved after a dependent (negative
		; distance) strided load.

		; void PR27626_2(struct pair *p, int z, int n) {
		; for (int i = 0; i < n; i++) {
		; p[i].x = z;
		; p[i].y = p[i - 1].x;
		; }
		; }

		; CHECK-LABEL: @PR27626_2(
		; CHECK: min.iters.checked:
		; CHECK: %n.mod.vf = and i64 %[[N:.+]], 3
		; CHECK: %[[IsZero:[a-zA-Z0-9]+]] = icmp eq i64 %n.mod.vf, 0
		; CHECK: %[[R:[a-zA-Z0-9]+]] = select i1 %[[IsZero]], i64 4, i64 %n.mod.vf
		; CHECK: %n.vec = sub i64 %[[N]], %[[R]]
		; CHECK: vector.body:
		; CHECK: %[[L1:.+]] = load <8 x i32>, <8 x i32>* {{.*}}
		; CHECK: %[[X1:.+]] = extractelement <8 x i32> %[[L1]], i32 0
		; CHECK: store i32 %[[X1]], {{.*}}
		; CHECK: %[[X2:.+]] = extractelement <8 x i32> %[[L1]], i32 2
		; CHECK: store i32 %[[X2]], {{.*}}
		; CHECK: %[[X3:.+]] = extractelement <8 x i32> %[[L1]], i32 4
		; CHECK: store i32 %[[X3]], {{.*}}
		; CHECK: %[[X4:.+]] = extractelement <8 x i32> %[[L1]], i32 6
		; CHECK: store i32 %[[X4]], {{.*}}

		define void @PR27626_2(%pair.i32 *%p, i64 %n, i32 %z) {
		entry:
		br label %for.body

		for.body:
		%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
		%i_minus_1 = add nuw nsw i64 %i, -1
		%p_i.x = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i, i32 0
		%p_i_minus_1.x = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i_minus_1, i32 0
		%p_i.y = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i, i32 1
		store i32 %z, i32* %p_i.x, align 4
		%0 = load i32, i32* %p_i_minus_1.x, align 4
		store i32 %0, i32 *%p_i.y, align 4
		%i.next = add nuw nsw i64 %i, 1
		%cond = icmp slt i64 %i.next, %n
		br i1 %cond, label %for.body, label %for.end

		for.end:
		ret void
		}

		; PR27626_3: Ensure a strided load is not moved before a dependent (negative
		; distance) strided store.

		; void PR27626_3(struct pair *p, int z, int n) {
		; for (int i = 0; i < n; i++) {
		; p[i + 1].y = p[i].x;
		; s += p[i].y;
		; }
		; }

		; CHECK-LABEL: @PR27626_3(
		; CHECK: min.iters.checked:
		; CHECK: %n.mod.vf = and i64 %[[N:.+]], 3
		; CHECK: %[[IsZero:[a-zA-Z0-9]+]] = icmp eq i64 %n.mod.vf, 0
		; CHECK: %[[R:[a-zA-Z0-9]+]] = select i1 %[[IsZero]], i64 4, i64 %n.mod.vf
		; CHECK: %n.vec = sub i64 %[[N]], %[[R]]
		; CHECK: vector.body:
		; CHECK: %[[Phi:.+]] = phi <4 x i32> [ zeroinitializer, %vector.ph ], [ {{.*}}, %vector.body ]
		; CHECK: %[[L1:.+]] = load <8 x i32>, <8 x i32>* {{.*}}
		; CHECK: %[[X1:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 0
		; CHECK: store i32 %[[X1:.+]], {{.*}}
		; CHECK: %[[X2:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 2
		; CHECK: store i32 %[[X2:.+]], {{.*}}
		; CHECK: %[[X3:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 4
		; CHECK: store i32 %[[X3:.+]], {{.*}}
		; CHECK: %[[X4:.+]] = extractelement <8 x i32> %[[L1:.+]], i32 6
		; CHECK: store i32 %[[X4:.+]], {{.*}}
		; CHECK: %[[L2:.+]] = load <8 x i32>, <8 x i32>* {{.*}}
		; CHECK: %[[S1:.+]] = shufflevector <8 x i32> %[[L2]], <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
		; CHECK: add nsw <4 x i32> %[[S1]], %[[Phi]]

		define i32 @PR27626_3(%pair.i32 *%p, i64 %n, i32 %z) {
		entry:
		br label %for.body

		for.body:
		%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
		%s = phi i32 [ %2, %for.body ], [ 0, %entry ]
		%i_plus_1 = add nuw nsw i64 %i, 1
		%p_i.x = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i, i32 0
		%p_i.y = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i, i32 1
		%p_i_plus_1.y = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i_plus_1, i32 1
		%0 = load i32, i32* %p_i.x, align 4
		store i32 %0, i32* %p_i_plus_1.y, align 4
		%1 = load i32, i32* %p_i.y, align 4
		%2 = add nsw i32 %1, %s
		%i.next = add nuw nsw i64 %i, 1
		%cond = icmp slt i64 %i.next, %n
		br i1 %cond, label %for.body, label %for.end

		for.end:
		%3 = phi i32 [ %2, %for.body ]
		ret i32 %3
		}

		; PR27626_4: Ensure we form an interleaved group for strided stores in the
		; presence of a write-after-write dependence. We create a group for
		anemetUnsubmitted Not Done Reply Inline Actions ... but exclude a[i] = x anemet: ... but exclude a[i] = x
		; (2) and (3) while excluding (1).

		; void PR27626_4(int *a, int x, int y, int z, int n) {
		; for (int i = 0; i < n; i += 2) {
		; a[i] = x; // (1)
		; a[i] = y; // (2)
		; a[i + 1] = z; // (3)
		; }
		; }

		; CHECK-LABEL: @PR27626_4(
		; CHECK: vector.ph:
		; CHECK: %[[INS_Y:.+]] = insertelement <4 x i32> undef, i32 %y, i32 0
		; CHECK: %[[SPLAT_Y:.+]] = shufflevector <4 x i32> %[[INS_Y]], <4 x i32> undef, <4 x i32> zeroinitializer
		; CHECK: %[[INS_Z:.+]] = insertelement <4 x i32> undef, i32 %z, i32 0
		; CHECK: %[[SPLAT_Z:.+]] = shufflevector <4 x i32> %[[INS_Z]], <4 x i32> undef, <4 x i32> zeroinitializer
		; CHECK: vector.body:
		; CHECK: store i32 %x, {{.*}}
		; CHECK: store i32 %x, {{.*}}
		; CHECK: store i32 %x, {{.*}}
		; CHECK: store i32 %x, {{.*}}
		; CHECK: %[[VEC:.+]] = shufflevector <4 x i32> %[[SPLAT_Y]], <4 x i32> %[[SPLAT_Z]], <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
		; CHECK: store <8 x i32> %[[VEC]], {{.*}}

		define void @PR27626_4(i32 *%a, i32 %x, i32 %y, i32 %z, i64 %n) {
		entry:
		br label %for.body

		for.body:
		%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
		%i_plus_1 = add i64 %i, 1
		%a_i = getelementptr inbounds i32, i32* %a, i64 %i
		%a_i_plus_1 = getelementptr inbounds i32, i32* %a, i64 %i_plus_1
		store i32 %x, i32* %a_i, align 4
		store i32 %y, i32* %a_i, align 4
		store i32 %z, i32* %a_i_plus_1, align 4
		%i.next = add nuw nsw i64 %i, 2
		%cond = icmp slt i64 %i.next, %n
		br i1 %cond, label %for.body, label %for.end

		for.end:
		ret void
		}

		; PR27626_5: Ensure we do not form an interleaved group for strided stores in
		; the presence of a write-after-write dependence.

		; void PR27626_5(int *a, int x, int y, int z, int n) {
		; for (int i = 3; i < n; i += 2) {
		; a[i - 1] = x;
		; a[i - 3] = y;
		; a[i] = z;
		; }
		; }

		; CHECK-LABEL: @PR27626_5(
		; CHECK: vector.body:
		; CHECK: store i32 %x, {{.*}}
		; CHECK: store i32 %x, {{.*}}
		; CHECK: store i32 %x, {{.*}}
		; CHECK: store i32 %x, {{.*}}
		; CHECK: store i32 %y, {{.*}}
		; CHECK: store i32 %y, {{.*}}
		; CHECK: store i32 %y, {{.*}}
		; CHECK: store i32 %y, {{.*}}
		; CHECK: store i32 %z, {{.*}}
		; CHECK: store i32 %z, {{.*}}
		; CHECK: store i32 %z, {{.*}}
		; CHECK: store i32 %z, {{.*}}

		define void @PR27626_5(i32 *%a, i32 %x, i32 %y, i32 %z, i64 %n) {
		entry:
		br label %for.body

		for.body:
		%i = phi i64 [ %i.next, %for.body ], [ 3, %entry ]
		%i_minus_1 = sub i64 %i, 1
		%i_minus_3 = sub i64 %i_minus_1, 2
		%a_i = getelementptr inbounds i32, i32* %a, i64 %i
		%a_i_minus_1 = getelementptr inbounds i32, i32* %a, i64 %i_minus_1
		%a_i_minus_3 = getelementptr inbounds i32, i32* %a, i64 %i_minus_3
		store i32 %x, i32* %a_i_minus_1, align 4
		store i32 %y, i32* %a_i_minus_3, align 4
		store i32 %z, i32* %a_i, align 4
		%i.next = add nuw nsw i64 %i, 2
		%cond = icmp slt i64 %i.next, %n
		br i1 %cond, label %for.body, label %for.end

		for.end:
		ret void
		}

attributes #0 = { "unsafe-fp-math"="true" }		attributes #0 = { "unsafe-fp-math"="true" }

This is an archive of the discontinued LLVM Phabricator instance.

[LV] Preserve order of dependences in interleaved accesses analysisClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 61708

lib/Transforms/Vectorize/LoopVectorize.cpp

test/Transforms/LoopVectorize/interleaved-accesses.ll

[LV] Preserve order of dependences in interleaved accesses analysis
ClosedPublic