This is an archive of the discontinued LLVM Phabricator instance.

[LAA] Update MaxSafeDepDistBytes when non-unit stride
AbandonedPublic

Authored by michaelmaitland on May 16 2023, 12:13 PM.

Details

Summary
MaxSafeVectorWidthInBits must be updated according to the stride
of the loop, otherwise it is possible that the value is
not conservative enough. For example:

for (int k = 0; k < len; k+=3) {
    a[k] = a[k+4];
      a[k+2] = a[k+6];
}
has MinDepDistBytes=24 and loop stride of 3. If we do not use the stride
in the calculation, we incorrectly compute MaxSafeVectorWidthInBits to
be 192 bits, when it really should be 192 / 3 = 64 bits.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptMay 16 2023, 12:14 PM
michaelmaitland requested review of this revision.May 16 2023, 12:14 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 16 2023, 12:14 PM
ABataev added inline comments.May 16 2023, 1:42 PM
llvm/lib/Analysis/LoopAccessAnalysis.cpp
2001–2002

Looks like variable name does not match its semantics anymore

llvm/lib/Analysis/LoopAccessAnalysis.cpp
2001–2002

I'm not sure I agree. I think that the value was simply incorrectly computed previously. Can you please elaborate on what you mean? Do you have any suggestions on a better name?

The getter for this variable is documented as The maximum number of bytes of a vector register we can vectorize the accesses safely with. The variable itself is documented as We can access this many bytes in parallel safely. I think that both of these statements hold under the changes made in this patch.

Fix AArch64 target triple

llvm/test/Transforms/LoopVectorize/AArch64/max-vf-for-interleaved.ll
4 ↗(On Diff #522828)

This change does not work. I ran the lit test on the wrong file, thinking it was this file. So when it passed, I thought I had fixed the problem.

The test case gives the C program:

struct pair {
  int x;
  int y;
};

void max_vf(struct pair *restrict p) {
  for (int i = 0; i < 1000; i++) {
    p[i + 2].x = p[i].x
    p[i + 2].y = p[i].y
  }
}

The IR is updated to match this IR so that the test case behaves as expected.

I'm not following why you think the existing computation of the MaxSafeDepDistBytes is wrong. Reading through the code, this appears to be simply an early out. We're doing an N^2 comparison, and if we've already found a distance of X, any later pair with distance > X can't be safe as part of the set. We could instead return the fact this pair has no dependence - it may not - but there's no point to this because there must be some other pair with an unsafe dependence which will prevent vectorization of the whole set.

That's honestly a weird optimization, but it does look to be only a compile time optimization. It does seem to screw up the dependency reporting, but that appears to only drive debugging output.

(Note, the code structure doesn't make this obvious, but MaxVF is computed using "Distance" given the previous return and min clause.)

The legality influencing bit should be in the computation of MaxVF. It does look to me like the computation of MaxVF hasn't been updated to match MinDistanceNeeded (i.e. doesn't special case last iteration), but that seems to be a missed optimization, not a miscompile.

Can you clarify whether you think this is a miscompile or a missed optimization? Your description seems to indicate miscompile, but from what I can tell, having an unnecessarily low MaxVF should just decrease the VF used, and thus maybe lead to missed optimization. I don't see how you get from that to miscompile.

Account for loop induction variable stride instead of pointer stride.

I'm not following why you think the existing computation of the MaxSafeDepDistBytes is wrong. Reading through the code, this appears to be simply an early out. We're doing an N^2 comparison, and if we've already found a distance of X, any later pair with distance > X can't be safe as part of the set. We could instead return the fact this pair has no dependence - it may not - but there's no point to this because there must be some other pair with an unsafe dependence which will prevent vectorization of the whole set.

That's honestly a weird optimization, but it does look to be only a compile time optimization. It does seem to screw up the dependency reporting, but that appears to only drive debugging output.

(Note, the code structure doesn't make this obvious, but MaxVF is computed using "Distance" given the previous return and min clause.)

The legality influencing bit should be in the computation of MaxVF. It does look to me like the computation of MaxVF hasn't been updated to match MinDistanceNeeded (i.e. doesn't special case last iteration), but that seems to be a missed optimization, not a miscompile.

Can you clarify whether you think this is a miscompile or a missed optimization? Your description seems to indicate miscompile, but from what I can tell, having an unnecessarily low MaxVF should just decrease the VF used, and thus maybe lead to missed optimization. I don't see how you get from that to miscompile.

The test case I have added shows how the MaxSafeDepDistBytes was incorrectly calculated prior to accounting for loop stride. The comment in the code and in the commit message explain it in more detail. It is possible to vectorize that loop with a vector loop stride that abides by the MaxSafeDepDistBytes. But if this value is incorrect coming from LAA, then it will lead to a miscompile.

Undo changes from first revision.

Rebase once more. This patch is ready for review.

ABataev added inline comments.Jul 17 2023, 9:51 AM
llvm/lib/Analysis/LoopAccessAnalysis.cpp
2012

Expand auto to actual type.

Does this fix a miscompile in loop-vectorize (or elsewhere)?

AFAICT MaxSafeDepDistBytes is simply the distance in bytes between dependent accesses and it's not directly the number of elements/iterations that can be performed in parallel without conflict. Loop-vectorize uses getMaxSafeVectorWidthInBits to limit the max VF which is the number of elements that can be processed in parallel.

It might be helpful to also print MaxSafeVectorWidthInBits

Does this fix a miscompile in loop-vectorize (or elsewhere)?

AFAICT MaxSafeDepDistBytes is simply the distance in bytes between dependent accesses and it's not directly the number of elements/iterations that can be performed in parallel without conflict. Loop-vectorize uses getMaxSafeVectorWidthInBits to limit the max VF which is the number of elements that can be processed in parallel.

MaxSafeVectorWidthInBits value depends on MaxVFInBits, whose value depends on MaxVF, whose value depends on MaxSafeDepDistBytes. MaxSafeDepDistBytes is calculated incorrectly prior to the changes introduces in this patch as you can see by the test case diff in this patch. In order for us to avoid a miscompile that is elicited by the example in this patch, we should fix the problem of incorrectly calculating MaxSafeDepDistBytes.

MaxSafeVectorWidthInBits value depends on MaxVFInBits, whose value depends on MaxVF, whose value depends on MaxSafeDepDistBytes. MaxSafeDepDistBytes is calculated incorrectly prior to the changes introduces in this patch as you can see by the test case diff in this patch. In order for us to avoid a miscompile that is elicited by the example in this patch, we should fix the problem of incorrectly calculating MaxSafeDepDistBytes.

But does LV miscompile the case? I think MaxVFInBits should be 2, and it refuses to vectorize with anything larger than VF = 2: https://llvm.godbolt.org/z/6KezshEdW

michaelmaitland marked an inline comment as done.Jul 17 2023, 11:06 AM

It might be helpful to also print MaxSafeVectorWidthInBits

I added this print out and checked it with the test. Before this patch, we reported The maximum number of bits that are safe to operate on in parallel is 64, which is incorrect.

Add printout and expand type

But does LV miscompile the case?

No, but I believe that the MaxSafeVectorWidthInBits is incorrect prior to this patch, and we just happen to not vectorize based on incorrect values.

Fix tests that are impacted by this change.

But does LV miscompile the case?

No, but I believe that the MaxSafeVectorWidthInBits is incorrect prior to this patch, and we just happen to not vectorize based on incorrect values.

Looking at. https://llvm.godbolt.org/z/hK8xnbM6G, LV 'vectorizes' with VF=2 (it scalarizes all loads & stores, but still reorders them), but not with VF=4 due to memory conflicts.

I think with respect to MaxSafeDepDistBytes it is still not clear to me that the current value is incorrect, as it is just the distance computed between 2 dependencies and in the test the distance between %arrayidx2 and %%arrayidx5 is 24 bytes AFAICT.

It might be helpful to also print MaxSafeVectorWidthInBits

I added this print out and checked it with the test. Before this patch, we reported The maximum number of bits that are safe to operate on in parallel is 64, which is incorrect.

Thanks! The raw distance between the accesses is 24 bytes, the stride per iteration is 12, so we should be able to execute 2 iterations in parallel unless I am missing something. We are accessing i32 types, so in 2 iterations we would access 64 bits.

I think with respect to MaxSafeDepDistBytes it is still not clear to me that the current value is incorrect, as it is just the distance computed between 2 dependencies and in the test the distance between %arrayidx2 and %%arrayidx5 is 24 bytes AFAICT.

It is not just the distance computed between 2 dependencies. It is the number of bytes we can access in parallel safely, according to the docstring of MaxSafeDepDistBytes. I do not think that it would be safe to vectorize this access 24 bytes at a time. I only think it is safe to vectorize this access 8 bytes at a time.

We are accessing i32 types, so in 2 iterations we would access 64 bits.

8 bytes is 64 bits, so I think this change is correctly calculating what we need. Otherwise, 24 bytes is 192 bits, which I think is unsafe.

fhahn added a comment.Jul 21 2023, 1:08 PM

I think with respect to MaxSafeDepDistBytes it is still not clear to me that the current value is incorrect, as it is just the distance computed between 2 dependencies and in the test the distance between %arrayidx2 and %%arrayidx5 is 24 bytes AFAICT.

It is not just the distance computed between 2 dependencies. It is the number of bytes we can access in parallel safely, according to the docstring of MaxSafeDepDistBytes. I do not think that it would be safe to vectorize this access 24 bytes at a time. I only think it is safe to vectorize this access 8 bytes at a time.

So I think the doc string should be updated in that case. For LV legality, MaxSafeVectorWidthInBits is used which is computed & used correctly AFACIT. I replaced LV's use of MaxSafeDepDistBytes in 25d34215bb80459dd328d6f8eb86c43684375d88 and I think there's no need to expose MaxSafeDepDistBytes outside of LAA, as users should use MaxSafeVectorWidthInBits WDYT?

llvm/test/Analysis/LoopAccessAnalysis/max_safe_dep_dist_non_unit_stride.ll
14

This is now more pessimistic than necessary I think

I think there's no need to expose MaxSafeDepDistBytes outside of LAA, as users should use MaxSafeVectorWidthInBits WDYT?

I agree with this statement.

I think that the name of MaxSafeDepDistBytes should then to change to something such as MinDepDistBytes which represents the smallest dependence distance in bytes in the loop. In my opinion MaxSafeDepDistBytes according to its name and the semantics of its docstring *is* wrong without this patch. We could make the changes you and I discuss above, and abandon this patch. WDYT?

michaelmaitland edited the summary of this revision. (Show Details)Aug 1 2023, 9:20 AM
michaelmaitland planned changes to this revision.Aug 1 2023, 9:31 AM
michaelmaitland abandoned this revision.Aug 1 2023, 1:00 PM

Abandoning because https://reviews.llvm.org/D156158 clarification of semantics makes this a non-issue.