Page MenuHomePhabricator

[LV] Preserve order of dependences in interleaved accesses analysis
ClosedPublic

Authored by mssimpso on May 5 2016, 10:48 AM.

Details

Summary

The interleaved access analysis currently assumes that the inserted run-time pointer aliasing checks ensure the absence of dependences that would prevent its instruction reordering. However, this is not the case.

Issues can arise from how code generation is performed for interleaved groups. For a load group, all loads in the group are essentially moved to the location of the first load in program order, and for a store group, all stores in the group are moved to the location of the last store. For groups having members involved in a dependence relation with any other instruction in the loop, this reordering can violate the dependence.

This patch teaches the interleaved access analysis how to avoid breaking such dependences, and should fix PR27626.

An assumption of the original analysis was that the accesses had been collected in "program order". The analysis was then simplified by visiting the accesses bottom-up. However, this ordering was never guaranteed for anything other than single basic block loops. Thus, this patch also enforces the desired ordering.

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
anemet edited edge metadata.May 11 2016, 3:21 PM

Hi Silviu, thanks for the comments.

I think non-zero dependences would have already been rejected by LAA. Would this this be the reason why it is correct to only look at the zero distance ones?

Yes, that should be the case!

Why are non-zero forward deps rejected by LAA? Because of the HW store-to-load forwarding case? I think that that is only a performance consideration and it's only on conditionally.

I think that we should probably take the time and review the soundness of these code motions with respect to interleaved access vectorization {RAW, WAR, WAW} x {loop-independent, loop-carried}. I've only looked at the LAA aspects of this feature so far but I am a bit worried about this code now.

Matt, do you think you can do this?

I am also thinking if we should disable this feature in the meantime?!

Hi Adam,

Why are non-zero forward deps rejected by LAA? Because of the HW store-to-load forwarding case? I think that that is only a performance consideration and it's only on conditionally.

Siviu was referring to a comment I had made in the source. But I think the idea was that LAA had already ensured the absence of the dependences that would have prevented vectorization. So the interleaved access analysis didn't need to worry about them. Regarding the store-to-load forwarding case, yes, that is the assumption the original analysis made, which was incorrect. The current patch attempts to fix that.

I think that we should probably take the time and review the soundness of these code motions with respect to interleaved access vectorization {RAW, WAR, WAW} x {loop-independent, loop-carried}. I've only looked at the LAA aspects of this feature so far but I am a bit worried about this code now.

Matt, do you think you can do this?

I'm happy to give the analysis a very careful second look. I'll do that today and report back.

I am also thinking if we should disable this feature in the meantime?!

We should definitely disable interleaved access vectorization if we can't correct the bug (PR27626) before release time. Doing so would have performance implications for ARM/AArch64, so I'm hoping we can fix it before then. I wouldn't be strongly opposed to disabling it in the meantime. But to my knowledge, the bug isn't currently blocking anyone, and it's been around since the original implementation. Silviu and I only discovered it in passing, and it wasn't something we encountered in the wild. What do you think?

Just to expand on the point above:

From the algorithm of constructing interleaved groups, we should be able to exclude both WAW (from the algorithm, see the comments) and WAR (we're interleaving and then moving stores down and loads up, so we cannot break these) - for both loop-carried and loop independent dependences.

So the problem should only be RAW. If it is loop carried, then it is a forward dependence and we cannot vectorize, so this should be safe.
And this should handle the loop independent case.

Also we've probably not seen this until now because other optimizations would simply remove the load before this got to the vectorizer?

Cheers,
Silviu

Hi Adam,

Why are non-zero forward deps rejected by LAA? Because of the HW store-to-load forwarding case? I think that that is only a performance consideration and it's only on conditionally.

Siviu was referring to a comment I had made in the source. But I think the idea was that LAA had already ensured the absence of the dependences that would have prevented vectorization. So the interleaved access analysis didn't need to worry about them. Regarding the store-to-load forwarding case, yes, that is the assumption the original analysis made, which was incorrect. The current patch attempts to fix that.

No, I was asking about *non-zero* distance deps specifically. The current patch only handles zero-distance deps. So my question was whether for non-zero distance we still rely on the store-to-load forwarding detection code to make the dep unsafe for vectorization.

Hi Silviu,

Just to expand on the point above:

From the algorithm of constructing interleaved groups, we should be able to exclude both WAW (from the algorithm, see the comments) and WAR (we're interleaving and then moving stores down and loads up, so we cannot break these) - for both loop-carried and loop independent dependences.

What about moving elements of an interleaved group over other dependent accesses not in the same group?

Adam

test/Transforms/LoopVectorize/interleaved-accesses.ll
574–579 ↗(On Diff #56589)

You mean p[i].x etc in the loop.

Hi Guys,

I've been thinking through these dependence issues more carefully. I'm not yet finished with all the permutations, but I've come across another somewhat unrelated issue. I thought I would post an update about that and respond to some of Adam's questions in the meantime. First the questions:

No, I was asking about *non-zero* distance deps specifically. The current patch only handles zero-distance deps. So my question was whether for non-zero distance we still rely on the store-to-load forwarding detection code to make the dep unsafe for vectorization.

For positive dependences, LAA prevents vectorization if the distance is less than some minimum (which determines the maximum safe VF). For positive dependences between strided accesses, LAA tries to prove independence (the accesses are independent if the distance is not a multiple of the stride). Non-positive dependences are allowed, and LAA checks for store-to-load forwarding conflicts for RAWs. The original analysis assumed the store-to-load detection would guarantee the absence of all RAWs, which was incorrect. So we need to consider the non-positive RAWs (as Silviu said, we should be able to exclude the WAR and WAW cases).

What about moving elements of an interleaved group over other dependent accesses not in the same group?

This would be the RAW case. For example, S1-L2 is a RAW dependence, L1-L2 form a group:

L1: load
S1: store
L2: load  // L2 would be hoisted above S1.

Or alternatively, S1-L1 is a RAW dependence, S1-S2 form a group:

S1: store // S1 would be sunk below L1.
L1: load
S2: store

The other issue I've uncovered is related to the maximum safe VF. We set this based on the positive dependence distance LAA computes. But for the interleaved accesses, the actual VF used during vectorization is VF * IF, where IF is the interleave factor. The idea is that each component of the wide vector would have VF elements after they are shuffled out. However, this could be greater than the maximum safe VF. Here's an example.

; for (int i = 1; i < 1000; ++i) {
;   p[i + 2].x = p[i].x;
;   p[i + 2].y = p[i].y;
; }
%struct.pair = type { i32, i32 }
for.body:
  %indvars.iv = phi i64 [ 1, %entry ], [ %indvars.iv.next, %for.body ]
  %x1 = getelementptr inbounds %struct.pair, %struct.pair* %p, i64 %indvars.iv, i32 0
  %0 = load i32, i32* %x1, align 4
  %1 = add nuw nsw i64 %indvars.iv, 2
  %x4 = getelementptr inbounds %struct.pair, %struct.pair* %p, i64 %1, i32 0
  store i32 %0, i32* %x4, align 4
  %y = getelementptr inbounds %struct.pair, %struct.pair* %p, i64 %indvars.iv, i32 1
  %2 = load i32, i32* %y, align 4
  %y10 = getelementptr inbounds %struct.pair, %struct.pair* %p, i64 %1, i32 1
  store i32 %2, i32* %y10, align 4
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp eq i64 %indvars.iv.next, 1000
  br i1 %exitcond, label %for.cond.cleanup, label %for.body
}

We currently generate <8 x i32> loads and stores, but the maximum safe dependence distance is only 16 bytes, so I think we are generating incorrect code here. I think we should probably check for interleaved accesses when selecting the VF.

test/Transforms/LoopVectorize/interleaved-accesses.ll
574–579 ↗(On Diff #56589)

Yes, thanks for catching that!

I submitted D20241 to correct the issue with the vectorization factor mentioned above.

Matt: thanks for doing this analysis!

What about moving elements of an interleaved group over other dependent accesses not in the same group?

Adam

We shouldn't be doing that. We should be preserving dependences, and as long as we do that this should be correct.

It does look like things are more complicated then what I previously stated, at least with regard to RAW. And we do need to consider non-zero distance dependences (because we are essentially moving this loads/stores after interleaving). Here is an example:

for (...) {

a[i].a = a[i + 2].a + 1
tmp = a[i - 1].a * 2
a[i].b = tmp
a[i].c = tmp
a[i].d = tmp

}

The problem here is that by moving the store to a[i].a we're breaking the loop-carried dependence from a[i].a to a[i-1].a (which doesn't stop vectorization).

Cheers,
Silviu

I agree, Silviu. That's basically what I've been thinking. I'm working on an updated patch. Thanks for all the feedback!

mssimpso updated this revision to Diff 57508.May 17 2016, 1:02 PM
mssimpso retitled this revision from [LV] Handle RAW dependences in interleaved access analysis to [LV] Preserve order of dependences in interleaved accesses analysis.
mssimpso updated this object.
mssimpso edited edge metadata.

Updated the analysis according to feedback from Adam and Silviu.

Sorry for the delay in getting a new version of this patch ready. This update is notably different from the previous version in the following ways: (1) We collect all memory accesses in collectConstStrideAccesses instead of only the stride-greater-than-one accesses. We have to collect all accesses to check for dependences between interleaved and non-interleaved accesses. The non-strided accesses are ignored when actually creating the groups. (2) We only ignore the WAR case. I think the code generation strategy can only ensure we won't violate write-after-reads. Everything else is checked. (3) For strided accesses, we try and prove independence like is done in LAA. All other constant-distance accesses are considered dependent, and we preserve their order. The dependences are checked in canReorderMemAccs. (4) I've added another test case, and updated existing comments about the algorithm.

mssimpso marked an inline comment as done.May 17 2016, 1:03 PM

I have just a few quick remarks (and most of them were inherited from the existing code).

Cheers,
Silviu

lib/Transforms/Vectorize/LoopVectorize.cpp
5071 ↗(On Diff #57508)

We should improve this at some point to check for NoDep (which is what we're really looking for here) and is implied by this test.

5077 ↗(On Diff #57508)

This makes assumptions on how LAA works. It might be technically true at the moment, but if the dependence analysis would be improved it would possibly invalidate this.

I think the case where we memcheck each pointer is an edge case? I'm not sure it's worth doing the interleaved access vectorization in this case.

The more common case could be if two pointers are in different alias sets or have different underlying objects? But that should be expressed as a NoDep.

5093 ↗(On Diff #57508)

Same here. Ideally we would know we have NoDep from LAA.

Some of the checks here seem to be specifically related to forming the interleaved groups, and less about reordering accesses.

Hi Silviu,

It sounds like you're suggesting that we should just query LAA to determine if any dependence exists between the two candidate instructions, and if so, prevent reordering. Does that sound right?

lib/Transforms/Vectorize/LoopVectorize.cpp
5093 ↗(On Diff #57508)

If you're referring to the type and factor checks, those are required before we can call areStridedAccessesIndependent.

mssimpso updated this revision to Diff 57671.May 18 2016, 1:57 PM

Addressed Silviu's comments.

Hi Silviu,

I've updated the patch to use LAA for checking for the presence of dependences like you suggested. This is actually much simpler. Thanks! I had to make one change to LAA: We now try to prove strided accesses independent before handling the negative distance cases.

Matt.

mssimpso marked 3 inline comments as done.May 18 2016, 1:58 PM

I had to make one change to LAA: We now try to prove strided accesses independent before handling the negative distance cases.

Yes, I've seen this recently too. Can you please split this out (and commit) with an appropriate LAA testcase.

Thanks for making the changes. I think this looks much better.

Cheers,
Silviu

lib/Transforms/Vectorize/LoopVectorize.cpp
5078 ↗(On Diff #57671)

We should be able to do this faster than O(n) - maybe with some pre-processing of dependences.

Yes, I've seen this recently too. Can you please split this out (and commit) with an appropriate LAA testcase.

Sure. I committed the LAA portion in rL270072. I'll post a rebased version of this patch and address Silviu's latest comments.

I am not done reviewing this but I just want to say that I am having a much better feeling about this analysis after your changes. As I said I had major issues with its soundness. Thanks for your patience for working through all this!

I am also sending my initial comments but there may be more coming.

lib/Transforms/Vectorize/LoopVectorize.cpp
947–949 ↗(On Diff #57671)

Somewhere this should state the criteria for reordering. In theory you can reorder backward dependences but looks like you don't allow any (which is fine).

5058 ↗(On Diff #57671)

I am wondering if it's time to change the name of this. After your change it no longer contains only "interesting" strided accesses. How about like AccessStrideInfo?

5062 ↗(On Diff #57671)

I think that we've been using Accesses pretty consistently without any abbreviation.

5093–5095 ↗(On Diff #57671)

I would drop the part about memchecks. This is already pretty long and that part is trivial.

mssimpso updated this revision to Diff 57821.May 19 2016, 10:54 AM
mssimpso marked 4 inline comments as done.

Addressed comments from Silviu and Adam.

Thanks very much for all the feedback! I agree, this is definitely looking better. In this update, I pre-process the dependences to enable constant-time queries like Silviu suggested, and I address Adam's latest comments.

lib/Transforms/Vectorize/LoopVectorize.cpp
5058 ↗(On Diff #57671)

Sounds good to me. I'll rename StridedAccesses prior to committing the current patch.

anemet added inline comments.May 19 2016, 1:55 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
5162–5166 ↗(On Diff #57821)

Here I think we're validating the correctness of:

  1. Potentially moving 'A' an interleaved load before any store 'B' or
  2. Potentially moving 'B' an interleaved store after any load/store 'A'.

If I am right, I don't think either the comment or the code is tight enough to reflect this.

It would also be good to add testcases for this.

It would be also great to add a testcase to the earlier problem that only candidate accesses were dependency-checked.

What do you think?

mssimpso added inline comments.May 19 2016, 2:39 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
5162–5166 ↗(On Diff #57821)

Adam,

I think you're mostly right in the description of what we're validating the correctness of. When you have a chance, would you mind elaborating a bit more as to why you think we're not satisfying both of the cases you mention? The algorithm works bottom-up, so B will always precede A in program order, and we always investigate B's that are closest to A first (some dependences do temporarily slip by but are later checked, see below). Also note that we don't yet handle loops with predicated blocks. I'm working on that in D19694.

For case (1) the bottom-up ordering implies that if we can move a load A before a store B, we know that we can also move A before any store that B precedes. Case (2) is a bit more subtle and is probalby confusing. Say we have the case below:

S1: Store  // depends on L1
L1: Load   // depends on S1
S2: Store

When A is S2 and B is S1, we might say that these are independent, and S1 could be sunk to S2's location. We will add S1 to S2's group. But when A is L1 and B is S1, we will notice the dependence. When we do, we check to see if S1 is already in a group, and if so, invalidate it. This prevents us from moving S1.

Please let me know if I'm still missing something here. We could probably improve the comment and/or rename "canReorderMemAccesses", since this doesn't quite capture what's happening with the instruction ordering.

And yes, we can definitely add more tests.

anemet added inline comments.May 19 2016, 4:57 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
5162–5166 ↗(On Diff #57821)

I think you're mostly right in the description of what we're validating the correctness of. When you have a chance, would you mind elaborating a bit more as to why you think we're not satisfying both of the cases you mention?

I said "not tight enough" not that it's incorrect, sorry if that was unclear.

I think I would like to include the above two cases in the comment (or an improved version) and then have the checks in the code reflect that. I.e. right now we don't check that at least one of 'A' or 'B' is interleaved.

The algorithm works bottom-up, so B will always precede A in program order, and we always investigate B's that are closest to A first (some dependences do temporarily slip by but are later checked, see below). Also note that we don't yet handle loops with predicated blocks. I'm working on that in D19694.

For case (1) the bottom-up ordering implies that if we can move a load A before a store B, we know that we can also move A before any store that B precedes. Case (2) is a bit more subtle and is probalby confusing. Say we have the case below:

S1: Store depends on L1
L1: Load
depends on S1
S2: Store
When A is S2 and B is S1, we might say that these are independent, and S1 could be sunk to S2's location. We will add S1 to S2's group. But when A is L1 and B is S1, we will notice the dependence. When we do, we check to see if S1 is already in a group, and if so, invalidate it. This prevents us from moving S1.

Make sense.

I am wondering now if there is a simpler way to formulate this analysis with the same result. Aren't we simply saying that we don't consider an interleaved access for merging if it's either a source or the destination of a dependence? I.e. something like this in the outer loop:

if (isDepSource(A) && A->mayWriteToMemory() || isDepDestination(A))
  break:

and then no need for the inner loop?

Please let me know if I'm still missing something here. We could probably improve the comment and/or rename "canReorderMemAccesses", since this doesn't quite capture what's happening with the instruction ordering.

And yes, we can definitely add more tests.

Thanks!

mssimpso added inline comments.May 20 2016, 6:42 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
5162–5166 ↗(On Diff #57821)

I think I would like to include the above two cases in the comment (or an improved version) and then have the checks in the code reflect that. I.e. right now we don't check that at least one of 'A' or 'B' is interleaved.

I see, yes this makes sense.

I am wondering now if there is a simpler way to formulate this analysis with the same result. Aren't we simply saying that we don't consider an interleaved access for merging if it's either a source or the destination of a dependence?

This sounds right to me. Let me think it over before posting another update. Thanks again for all the feedback, Adam!

mssimpso updated this revision to Diff 58286.May 24 2016, 12:08 PM

Addressed Adam's latest round of comments.

Adam,

I've updated the comments and code to be more precise as you suggested. I've also added a couple of new test cases so that we are checking the conditions you enumerated in your last review. I'll reply to your other points inline. Thanks!

Matt.

mssimpso marked 2 inline comments as done.May 24 2016, 12:10 PM
mssimpso added inline comments.
lib/Transforms/Vectorize/LoopVectorize.cpp
5196–5200 ↗(On Diff #58286)

It would be also great to add a testcase to the earlier problem that only candidate accesses were dependency-checked.

I've tried constructing a testcase that would generate incorrect code for this, but I think the current limitations of LAA might actually make this unrealizable at the moment. After the memory checks, two access can be dependent only at constant distances. Thus, if one access is strided, a dependent access at a constant distance would have to be strided as well.

This sounds right to me. Let me think it over before posting another update.

After thinking about this more carefully, I think we will miss cases with a simplification like the one you suggested. For example, if we have a set of interleaved loads followed by a set of interleaved stores that access the same location (or vice versa), giving up in the outer loop would prevent the two groups from being created.

This looks sensible to me. LGTM! (you should wait for further comments from Adam before committing)

Cheers,
Silviu

anemet added a comment.Jun 7 2016, 2:49 PM

Hi Matt,

Sorry about the delay. As I said in my earlier heads-up I was on vacation.

My main comment is the reply to your testcase remark. The other ones are just nitpicks but I haven't finished going through the patch. I am just curious what you think about the idea regarding the testcase.

Adam

lib/Transforms/Vectorize/LoopVectorize.cpp
959–960 ↗(On Diff #58286)

Both A and B are not necessarily strided here.

976–977 ↗(On Diff #58286)

This requirement is part of the API so it should be in the function comment. Also a nit, can you please flip the order. I think that A preceding B is the more intuitive. Let me know if you disagree.

5196–5200 ↗(On Diff #58286)

I've tried constructing a testcase that would generate incorrect code for this, but I think the current limitations of LAA might actually make this unrealizable at the moment. After the memory checks, two access can be dependent only at constant distances. Thus, if one access is strided, a dependent access at a constant distance would have to be strided as well.

Can't we use larger than stride offsets/indices to emulate non-interleaved accesses? I believe that these are currently ignored by interleaved analysis. I mean something like:

for (i = 0; i < n; i+=3) {
   ... = A[i]
   A[i+4] = ...
   ... = A[i+1]
}

And then A[i+1] shouldn't be moved across A[i+4]

I haven't tried this, it's just an idea....

Adam,

Welcome back! Thanks for following up. I replied to your comments inline.

Matt.

lib/Transforms/Vectorize/LoopVectorize.cpp
959–960 ↗(On Diff #58286)

True. I will rename this and update the comments.

976–977 ↗(On Diff #58286)

Sounds good.

5196–5200 ↗(On Diff #58286)

This was my first thought as well. In this case, A[i+4] is still a stride-greater-than-one access, so it would've been collected in SrideAccesses and sent through the original analysis. If its stride is equal to anything other than that of the other accesses (e.g, one or some value greater than MaxInterleaveGroupFactor), the distance between it and the other accesses will be non-constant (some SCEV expression). If the distance isn't constant LAA will report Dependence::Unknown, and we will create the memory checks.

mssimpso updated this revision to Diff 60069.Jun 8 2016, 11:10 AM
mssimpso marked 2 inline comments as done.

Addressed Adam's comments.

anemet added inline comments.Jun 8 2016, 12:25 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
5254–5258 ↗(On Diff #60069)

Matt, I am not sure I follow, which of these accesses is not constant distance in my example?

I just tried this example:

void
f(char *a, char * __restrict c, int n) {
  for (int i = 0; i < n; i+=3) {
    c[i] = a[i];
    a[i+4] = c[i+4];
    c[i+1] = a[i+1];
  }
}

If you compile this on x86_64 with:

-O3 -mllvm -enable-load-pre=0 -mllvm -store-to-load-forwarding-conflict-detection=0 -mllvm -enable-interleaved-mem-accesses -mllvm -force-vector-width=2

Then the loads from 'a' will be merged which breaks the forward dependence between the store of a[i+4] and the load of a[i+1].

No?

mssimpso added inline comments.Jun 8 2016, 12:54 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
5254–5258 ↗(On Diff #60069)

Adam,

We may be talking past each other a bit. I think your original request (correct me if I'm wrong) was for a test case showing that the existing analysis doesn't consider checking memory accesses that are not "candidates" for interleaved groups. Candidate accesses are the ones collected in StrideAccesses prior to the analysis. However, every access in your example actually is a candidate access because they are all strided. They are not ignored. Yes, we currently do the wrong thing here, but that's because the current analysis isn't correctly checking dependences between the actual candidate accesses, not because it isn't considering a dependence between a candidate access and a non-candidate access.

Are you asking for something different?

anemet added inline comments.Jun 8 2016, 3:15 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
5254–5258 ↗(On Diff #60069)

Ah, the disconnect was that I didn't realize that accesses with offsets larger than the stride were considered as candidates. The idea was to have these mimic non-strided accesses.

I guess it makes sense to consider these candidates as well because the offset is not really an offset but rather the distance between two accesses.

So I guess we can't have a testcase for this case at the moment...

I'll continue reviewing the patch.

This is very close I just want to take extra care to update/improve the comments while this is all fresh in our memory.

lib/Transforms/Vectorize/LoopVectorize.cpp
967–968 ↗(On Diff #60069)

Actually sorry, let's make this even more precise. At this point the function is called with *any* accesses. (The function later checks that if both accesses are non-strided we don't bother checking deps since we would never reorder those.)

971 ↗(On Diff #60069)

We should probably have some qualification in the name, something like canReorderAccessesForInterleavedGroups or something like that. This is not a general canReorder predicate but takes into consideration the actual code motion strategy.

5135–5142 ↗(On Diff #60069)

Is my understanding correct, that you removed this because we no longer support this case? I.e. the dep 1->2 does not currently allow for this.

If this is true, we should probably add a FIXME.

Also, I thought the WAW case was the only reason for the bottom-up ordering. That is a bit confusing now too.

5254–5257 ↗(On Diff #60069)

It would be nice to use A and B related iterators in this part because that is what you mention in the comment. E.g. AI, BI or IA, IB.

mssimpso added inline comments.Jun 15 2016, 7:26 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
971 ↗(On Diff #60069)

Sounds good. I'll update this and the comments.

5135–5142 ↗(On Diff #60069)

No, we do still support this case, for essentially the reason the existing comment states. When the outer loop of the analysis is on (3) and we visit (2) and (1) in the inner loop, we will form the (3,2) group. (1) can't be added to (3,2) because a member already exists in group (3,2) with the same offset. When the outer loop is on (2), we will again visit (1) in the inner loop. This time we will see the dependence between (1) and (2) and give up trying to add additional accesses to (2)'s group. (1) is not yet in a group, so the outer loop moves on to (1). Thus, (1) must form a group with accesses that precede it and won't be sunk, as the existing comment says.

I removed this comment because I thought it was somewhat misleading. The bottom-up ordering is not sufficient to prevent us from breaking WAW dependencies. We still have to explicitly check for them. For example, if we had something like the following for a factor 2 group:

A[i]   = x  (1)
A[i+2] = y  (2)
A[i+1] = z  (3)

The analysis would proceed as before. When the outer loop is on (3): this time (2) is not added to (3)'s group because its index is too large. (1) is temporarily added to (3)'s group, creating a (3,1) group. When the outer loop is on (2): we see the dependence between (1) and (2). As before, we stop trying to add additional instructions to (2)'s group. But now since (1) is already in a group, and we now know it shouldn't be reordered, we release the (3,1) group. So the bottom-up ordering isn't sufficient as we still have to check for the dependence.

It's probably worth adding a comment that clarifies why the analysis is bottom-up.

5254–5257 ↗(On Diff #60069)

I agree. I think we should also swap A and B in this function to match what we did to the canReorderMemAccesses function. There, A precedes B, but here B precedes A. They should be consistent.

mssimpso added inline comments.Jun 15 2016, 10:03 AM
lib/Transforms/Vectorize/LoopVectorize.cpp
5135–5142 ↗(On Diff #60069)

I got the second example here slightly wrong, but the idea is the same. I was intending to describe something like:

A[i-1] = x  (1)
A[i-3] = y  (2)
A[i]   = z  (3)

Now, the dependence between (1) and (2) will cause the (3,1) group to be released, as I originally described. I'm adding test cases for both of these WAW examples.

5254–5257 ↗(On Diff #60069)

Actually, swapping A with B in this function will make the review very difficult. I think I'd rather swap A and B in canReorderMemAccesses to make things consistent. This will mean B precedes A both places. We can then swap them both in an NFC patch if we want A to precede B.

mssimpso updated this revision to Diff 60862.Jun 15 2016, 10:48 AM
mssimpso marked 6 inline comments as done.

Addressed Adam's comments.

  • Renamed canReorderMemAccesses and updated comments
  • Swapped A and B in canReorderMemAccesses to be consistent with the reset of the analysis (i.e., B precedes A). We can swap the entire analysis in a follow-on.
  • Updated iterators to be more descriptive (i.e., AI and BI).
  • Commented about the bottom-up ordering.
  • Added test cases for the two WAW examples mentioned in my last update.

Adam,

Do you have any additional comments for this patch? Thanks!

Matt.

Matt,

All agreed, I just have a few more suggestions to improve comments for this complex piece.

Let me know what you think.

Adam

lib/Transforms/Vectorize/LoopVectorize.cpp
5200–5207 ↗(On Diff #60862)

Wow, this is very subtle too. This and the previous case of why we need to remove elements from a group needs a high-level comment somewhere around the call to canReorder... See my comments on the particular lines.

5334–5343 ↗(On Diff #60862)

I don't think the comment is sufficient to cover all the subtleties here. I would prefer to say something like:

We can't have dependences between accesses in a group and other accesses that are located between the first and the last element of group.

Probably before the break statement we should also say that:

It's OK to have dependences between accesses in a group and other accesses before the first instruction we just can't extend the group beyond these.

I am also wondering if we need a picture to further visualize this, i.e. a case where there is an intervening access and one where falls outside the range of the group.

5351–5354 ↗(On Diff #60862)

OK.

test/Transforms/LoopVectorize/interleaved-accesses.ll
769–770 ↗(On Diff #60862)

... but exclude a[i] = x

Adam,

Thanks very much for the suggestions. They all look good to me! I will update the patch.

Matt.

mssimpso updated this revision to Diff 61708.Jun 23 2016, 12:05 PM

Sharpened comments according to Adam's feedback.

anemet accepted this revision.Jun 23 2016, 1:51 PM
anemet edited edge metadata.

Looks great to me! Thanks for the improvements and of course initially taking this on!

This revision is now accepted and ready to land.Jun 23 2016, 1:51 PM

Thanks Adam and Silviu for all the detailed feedback! Getting these dependences right can be tricky, so I appreciate the attention.

This revision was automatically updated to reflect the committed changes.