MaxSafeVectorWidthInBits must be updated according to the stride of the loop, otherwise it is possible that the value is not conservative enough. For example: for (int k = 0; k < len; k+=3) { a[k] = a[k+4]; a[k+2] = a[k+6]; } has MinDepDistBytes=24 and loop stride of 3. If we do not use the stride in the calculation, we incorrectly compute MaxSafeVectorWidthInBits to be 192 bits, when it really should be 192 / 3 = 64 bits.
Details
- Reviewers
reames fhahn Ayal ABataev nikolaypanchenko
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Analysis/LoopAccessAnalysis.cpp | ||
---|---|---|
1994–1995 | Looks like variable name does not match its semantics anymore |
llvm/lib/Analysis/LoopAccessAnalysis.cpp | ||
---|---|---|
1994–1995 | I'm not sure I agree. I think that the value was simply incorrectly computed previously. Can you please elaborate on what you mean? Do you have any suggestions on a better name? The getter for this variable is documented as The maximum number of bytes of a vector register we can vectorize the accesses safely with. The variable itself is documented as We can access this many bytes in parallel safely. I think that both of these statements hold under the changes made in this patch. |
llvm/test/Transforms/LoopVectorize/AArch64/max-vf-for-interleaved.ll | ||
---|---|---|
4 ↗ | (On Diff #522828) | This change does not work. I ran the lit test on the wrong file, thinking it was this file. So when it passed, I thought I had fixed the problem. |
The test case gives the C program:
struct pair { int x; int y; }; void max_vf(struct pair *restrict p) { for (int i = 0; i < 1000; i++) { p[i + 2].x = p[i].x p[i + 2].y = p[i].y } }
The IR is updated to match this IR so that the test case behaves as expected.
I'm not following why you think the existing computation of the MaxSafeDepDistBytes is wrong. Reading through the code, this appears to be simply an early out. We're doing an N^2 comparison, and if we've already found a distance of X, any later pair with distance > X can't be safe as part of the set. We could instead return the fact this pair has no dependence - it may not - but there's no point to this because there must be some other pair with an unsafe dependence which will prevent vectorization of the whole set.
That's honestly a weird optimization, but it does look to be only a compile time optimization. It does seem to screw up the dependency reporting, but that appears to only drive debugging output.
(Note, the code structure doesn't make this obvious, but MaxVF is computed using "Distance" given the previous return and min clause.)
The legality influencing bit should be in the computation of MaxVF. It does look to me like the computation of MaxVF hasn't been updated to match MinDistanceNeeded (i.e. doesn't special case last iteration), but that seems to be a missed optimization, not a miscompile.
Can you clarify whether you think this is a miscompile or a missed optimization? Your description seems to indicate miscompile, but from what I can tell, having an unnecessarily low MaxVF should just decrease the VF used, and thus maybe lead to missed optimization. I don't see how you get from that to miscompile.
The test case I have added shows how the MaxSafeDepDistBytes was incorrectly calculated prior to accounting for loop stride. The comment in the code and in the commit message explain it in more detail. It is possible to vectorize that loop with a vector loop stride that abides by the MaxSafeDepDistBytes. But if this value is incorrect coming from LAA, then it will lead to a miscompile.
llvm/lib/Analysis/LoopAccessAnalysis.cpp | ||
---|---|---|
2005 | Expand auto to actual type. |
Does this fix a miscompile in loop-vectorize (or elsewhere)?
AFAICT MaxSafeDepDistBytes is simply the distance in bytes between dependent accesses and it's not directly the number of elements/iterations that can be performed in parallel without conflict. Loop-vectorize uses getMaxSafeVectorWidthInBits to limit the max VF which is the number of elements that can be processed in parallel.
MaxSafeVectorWidthInBits value depends on MaxVFInBits, whose value depends on MaxVF, whose value depends on MaxSafeDepDistBytes. MaxSafeDepDistBytes is calculated incorrectly prior to the changes introduces in this patch as you can see by the test case diff in this patch. In order for us to avoid a miscompile that is elicited by the example in this patch, we should fix the problem of incorrectly calculating MaxSafeDepDistBytes.
But does LV miscompile the case? I think MaxVFInBits should be 2, and it refuses to vectorize with anything larger than VF = 2: https://llvm.godbolt.org/z/6KezshEdW
I added this print out and checked it with the test. Before this patch, we reported The maximum number of bits that are safe to operate on in parallel is 64, which is incorrect.
But does LV miscompile the case?
No, but I believe that the MaxSafeVectorWidthInBits is incorrect prior to this patch, and we just happen to not vectorize based on incorrect values.
Looking at. https://llvm.godbolt.org/z/hK8xnbM6G, LV 'vectorizes' with VF=2 (it scalarizes all loads & stores, but still reorders them), but not with VF=4 due to memory conflicts.
I think with respect to MaxSafeDepDistBytes it is still not clear to me that the current value is incorrect, as it is just the distance computed between 2 dependencies and in the test the distance between %arrayidx2 and %%arrayidx5 is 24 bytes AFAICT.
Thanks! The raw distance between the accesses is 24 bytes, the stride per iteration is 12, so we should be able to execute 2 iterations in parallel unless I am missing something. We are accessing i32 types, so in 2 iterations we would access 64 bits.
I think with respect to MaxSafeDepDistBytes it is still not clear to me that the current value is incorrect, as it is just the distance computed between 2 dependencies and in the test the distance between %arrayidx2 and %%arrayidx5 is 24 bytes AFAICT.
It is not just the distance computed between 2 dependencies. It is the number of bytes we can access in parallel safely, according to the docstring of MaxSafeDepDistBytes. I do not think that it would be safe to vectorize this access 24 bytes at a time. I only think it is safe to vectorize this access 8 bytes at a time.
We are accessing i32 types, so in 2 iterations we would access 64 bits.
8 bytes is 64 bits, so I think this change is correctly calculating what we need. Otherwise, 24 bytes is 192 bits, which I think is unsafe.
So I think the doc string should be updated in that case. For LV legality, MaxSafeVectorWidthInBits is used which is computed & used correctly AFACIT. I replaced LV's use of MaxSafeDepDistBytes in 25d34215bb80459dd328d6f8eb86c43684375d88 and I think there's no need to expose MaxSafeDepDistBytes outside of LAA, as users should use MaxSafeVectorWidthInBits WDYT?
llvm/test/Analysis/LoopAccessAnalysis/max_safe_dep_dist_non_unit_stride.ll | ||
---|---|---|
14 | This is now more pessimistic than necessary I think |
I think there's no need to expose MaxSafeDepDistBytes outside of LAA, as users should use MaxSafeVectorWidthInBits WDYT?
I agree with this statement.
I think that the name of MaxSafeDepDistBytes should then to change to something such as MinDepDistBytes which represents the smallest dependence distance in bytes in the loop. In my opinion MaxSafeDepDistBytes according to its name and the semantics of its docstring *is* wrong without this patch. We could make the changes you and I discuss above, and abandon this patch. WDYT?
Abandoning because https://reviews.llvm.org/D156158 clarification of semantics makes this a non-issue.
Looks like variable name does not match its semantics anymore