Download Raw Diff

Details

Reviewers

reames
fhahn
Ayal
ABataev
nikolaypanchenko

Summary

MaxSafeVectorWidthInBits must be updated according to the stride
of the loop, otherwise it is possible that the value is
not conservative enough. For example:

for (int k = 0; k < len; k+=3) {
    a[k] = a[k+4];
      a[k+2] = a[k+6];
}
has MinDepDistBytes=24 and loop stride of 3. If we do not use the stride
in the calculation, we incorrectly compute MaxSafeVectorWidthInBits to
be 192 bits, when it really should be 192 / 3 = 64 bits.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

michaelmaitland created this revision.May 16 2023, 12:13 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 16 2023, 12:14 PM

Herald added subscribers: StephenFan, hiraditya. · View Herald Transcript

michaelmaitland requested review of this revision.May 16 2023, 12:14 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 16 2023, 12:14 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

ABataev added inline comments.May 16 2023, 1:42 PM

llvm/lib/Analysis/LoopAccessAnalysis.cpp
1994–1995	Looks like variable name does not match its semantics anymore

michaelmaitland added inline comments.May 16 2023, 1:48 PM

llvm/lib/Analysis/LoopAccessAnalysis.cpp
1994–1995	I'm not sure I agree. I think that the value was simply incorrectly computed previously. Can you please elaborate on what you mean? Do you have any suggestions on a better name? The getter for this variable is documented as `The maximum number of bytes of a vector register we can vectorize the accesses safely with`. The variable itself is documented as `We can access this many bytes in parallel safely`. I think that both of these statements hold under the changes made in this patch.

Harbormaster completed remote builds in B232395: Diff 522737.May 16 2023, 3:16 PM

Fix AArch64 target triple

Herald added a subscriber: • pcwang-thead. · View Herald TranscriptMay 16 2023, 4:37 PM

michaelmaitland added inline comments.May 16 2023, 5:00 PM

llvm/test/Transforms/LoopVectorize/AArch64/max-vf-for-interleaved.ll
4 ↗	(On Diff #522828)	This change does not work. I ran the lit test on the wrong file, thinking it was this file. So when it passed, I thought I had fixed the problem.

Harbormaster completed remote builds in B232458: Diff 522828.May 16 2023, 5:31 PM

The test case gives the C program:

struct pair {
  int x;
  int y;
};

void max_vf(struct pair *restrict p) {
  for (int i = 0; i < 1000; i++) {
    p[i + 2].x = p[i].x
    p[i + 2].y = p[i].y
  }
}

The IR is updated to match this IR so that the test case behaves as expected.

Harbormaster completed remote builds in B232477: Diff 522866.May 16 2023, 6:57 PM

I'm not following why you think the existing computation of the MaxSafeDepDistBytes is wrong. Reading through the code, this appears to be simply an early out. We're doing an N^2 comparison, and if we've already found a distance of X, any later pair with distance > X can't be safe as part of the set. We could instead return the fact this pair has no dependence - it may not - but there's no point to this because there must be some other pair with an unsafe dependence which will prevent vectorization of the whole set.

That's honestly a weird optimization, but it does look to be only a compile time optimization. It does seem to screw up the dependency reporting, but that appears to only drive debugging output.

(Note, the code structure doesn't make this obvious, but MaxVF is computed using "Distance" given the previous return and min clause.)

The legality influencing bit should be in the computation of MaxVF. It does look to me like the computation of MaxVF hasn't been updated to match MinDistanceNeeded (i.e. doesn't special case last iteration), but that seems to be a missed optimization, not a miscompile.

Can you clarify whether you think this is a miscompile or a missed optimization? Your description seems to indicate miscompile, but from what I can tell, having an unnecessarily low MaxVF should just decrease the VF used, and thus maybe lead to missed optimization. I don't see how you get from that to miscompile.

Account for loop induction variable stride instead of pointer stride.

In D150706#4354633, @reames wrote:

I'm not following why you think the existing computation of the MaxSafeDepDistBytes is wrong. Reading through the code, this appears to be simply an early out. We're doing an N^2 comparison, and if we've already found a distance of X, any later pair with distance > X can't be safe as part of the set. We could instead return the fact this pair has no dependence - it may not - but there's no point to this because there must be some other pair with an unsafe dependence which will prevent vectorization of the whole set.

That's honestly a weird optimization, but it does look to be only a compile time optimization. It does seem to screw up the dependency reporting, but that appears to only drive debugging output.

(Note, the code structure doesn't make this obvious, but MaxVF is computed using "Distance" given the previous return and min clause.)

The legality influencing bit should be in the computation of MaxVF. It does look to me like the computation of MaxVF hasn't been updated to match MinDistanceNeeded (i.e. doesn't special case last iteration), but that seems to be a missed optimization, not a miscompile.

Can you clarify whether you think this is a miscompile or a missed optimization? Your description seems to indicate miscompile, but from what I can tell, having an unnecessarily low MaxVF should just decrease the VF used, and thus maybe lead to missed optimization. I don't see how you get from that to miscompile.

The test case I have added shows how the MaxSafeDepDistBytes was incorrectly calculated prior to accounting for loop stride. The comment in the code and in the commit message explain it in more detail. It is possible to vectorize that loop with a vector loop stride that abides by the MaxSafeDepDistBytes. But if this value is incorrect coming from LAA, then it will lead to a miscompile.

Undo changes from first revision.

Harbormaster completed remote builds in B237018: Diff 528962.Jun 6 2023, 2:03 PM

ping

michaelmaitland added a parent revision: D154173: [LAA] Add test that shows MaxSafeDepDistBytes is incorrect. NFC..Jun 29 2023, 5:45 PM

Rebase on top of D154173

Harbormaster completed remote builds in B242301: Diff 536075.Jun 29 2023, 10:21 PM

Rebase on top of D154173

Harbormaster completed remote builds in B242415: Diff 536236.Jun 30 2023, 8:19 AM

Rebase once more. This patch is ready for review.

Harbormaster completed remote builds in B244171: Diff 538682.Jul 10 2023, 10:28 AM

Ping

ABataev added inline comments.Jul 17 2023, 9:51 AM

llvm/lib/Analysis/LoopAccessAnalysis.cpp
2005	Expand `auto` to actual type.

Does this fix a miscompile in loop-vectorize (or elsewhere)?

AFAICT MaxSafeDepDistBytes is simply the distance in bytes between dependent accesses and it's not directly the number of elements/iterations that can be performed in parallel without conflict. Loop-vectorize uses getMaxSafeVectorWidthInBits to limit the max VF which is the number of elements that can be processed in parallel.

It might be helpful to also print MaxSafeVectorWidthInBits

In D150706#4506956, @fhahn wrote:

Does this fix a miscompile in loop-vectorize (or elsewhere)?

AFAICT MaxSafeDepDistBytes is simply the distance in bytes between dependent accesses and it's not directly the number of elements/iterations that can be performed in parallel without conflict. Loop-vectorize uses getMaxSafeVectorWidthInBits to limit the max VF which is the number of elements that can be processed in parallel.

MaxSafeVectorWidthInBits value depends on MaxVFInBits, whose value depends on MaxVF, whose value depends on MaxSafeDepDistBytes. MaxSafeDepDistBytes is calculated incorrectly prior to the changes introduces in this patch as you can see by the test case diff in this patch. In order for us to avoid a miscompile that is elicited by the example in this patch, we should fix the problem of incorrectly calculating MaxSafeDepDistBytes.

In D150706#4507037, @michaelmaitland wrote:

MaxSafeVectorWidthInBits value depends on MaxVFInBits, whose value depends on MaxVF, whose value depends on MaxSafeDepDistBytes. MaxSafeDepDistBytes is calculated incorrectly prior to the changes introduces in this patch as you can see by the test case diff in this patch. In order for us to avoid a miscompile that is elicited by the example in this patch, we should fix the problem of incorrectly calculating MaxSafeDepDistBytes.

But does LV miscompile the case? I think MaxVFInBits should be 2, and it refuses to vectorize with anything larger than VF = 2: https://llvm.godbolt.org/z/6KezshEdW

In D150706#4506975, @fhahn wrote:

It might be helpful to also print MaxSafeVectorWidthInBits

I added this print out and checked it with the test. Before this patch, we reported The maximum number of bits that are safe to operate on in parallel is 64, which is incorrect.

Add printout and expand type

But does LV miscompile the case?

No, but I believe that the MaxSafeVectorWidthInBits is incorrect prior to this patch, and we just happen to not vectorize based on incorrect values.

Rebase.

Fix tests that are impacted by this change.

Harbormaster completed remote builds in B245955: Diff 541163.Jul 17 2023, 5:01 PM

In D150706#4507201, @michaelmaitland wrote:

But does LV miscompile the case?

No, but I believe that the MaxSafeVectorWidthInBits is incorrect prior to this patch, and we just happen to not vectorize based on incorrect values.

Looking at. https://llvm.godbolt.org/z/hK8xnbM6G, LV 'vectorizes' with VF=2 (it scalarizes all loads & stores, but still reorders them), but not with VF=4 due to memory conflicts.

I think with respect to MaxSafeDepDistBytes it is still not clear to me that the current value is incorrect, as it is just the distance computed between 2 dependencies and in the test the distance between %arrayidx2 and %%arrayidx5 is 24 bytes AFAICT.

In D150706#4507183, @michaelmaitland wrote:

In D150706#4506975, @fhahn wrote:

It might be helpful to also print MaxSafeVectorWidthInBits

I added this print out and checked it with the test. Before this patch, we reported The maximum number of bits that are safe to operate on in parallel is 64, which is incorrect.

Thanks! The raw distance between the accesses is 24 bytes, the stride per iteration is 12, so we should be able to execute 2 iterations in parallel unless I am missing something. We are accessing i32 types, so in 2 iterations we would access 64 bits.

I think with respect to MaxSafeDepDistBytes it is still not clear to me that the current value is incorrect, as it is just the distance computed between 2 dependencies and in the test the distance between %arrayidx2 and %%arrayidx5 is 24 bytes AFAICT.

It is not just the distance computed between 2 dependencies. It is the number of bytes we can access in parallel safely, according to the docstring of MaxSafeDepDistBytes. I do not think that it would be safe to vectorize this access 24 bytes at a time. I only think it is safe to vectorize this access 8 bytes at a time.

We are accessing i32 types, so in 2 iterations we would access 64 bits.

8 bytes is 64 bits, so I think this change is correctly calculating what we need. Otherwise, 24 bytes is 192 bits, which I think is unsafe.

In D150706#4515123, @michaelmaitland wrote:

I think with respect to MaxSafeDepDistBytes it is still not clear to me that the current value is incorrect, as it is just the distance computed between 2 dependencies and in the test the distance between %arrayidx2 and %%arrayidx5 is 24 bytes AFAICT.

It is not just the distance computed between 2 dependencies. It is the number of bytes we can access in parallel safely, according to the docstring of MaxSafeDepDistBytes. I do not think that it would be safe to vectorize this access 24 bytes at a time. I only think it is safe to vectorize this access 8 bytes at a time.

So I think the doc string should be updated in that case. For LV legality, MaxSafeVectorWidthInBits is used which is computed & used correctly AFACIT. I replaced LV's use of MaxSafeDepDistBytes in 25d34215bb80459dd328d6f8eb86c43684375d88 and I think there's no need to expose MaxSafeDepDistBytes outside of LAA, as users should use MaxSafeVectorWidthInBits WDYT?

llvm/test/Analysis/LoopAccessAnalysis/max_safe_dep_dist_non_unit_stride.ll
14	This is now more pessimistic than necessary I think

I think there's no need to expose MaxSafeDepDistBytes outside of LAA, as users should use MaxSafeVectorWidthInBits WDYT?

I agree with this statement.

I think that the name of MaxSafeDepDistBytes should then to change to something such as MinDepDistBytes which represents the smallest dependence distance in bytes in the loop. In my opinion MaxSafeDepDistBytes according to its name and the semantics of its docstring *is* wrong without this patch. We could make the changes you and I discuss above, and abandon this patch. WDYT?

michaelmaitland mentioned this in D156034: [LAA] Make MaxSafeDepDistBytes private in LoopAccessAnalysis.Jul 22 2023, 2:06 PM

michaelmaitland mentioned this in D156165: [LAA] MaxSafeVectorWidthBits depends on changes to MinDepDist.Aug 1 2023, 9:11 AM

michaelmaitland added a parent revision: D156158: [LAA] Rename and fix semantics of MaxSafeDepDistBytes to MinDepDistBytes.Aug 1 2023, 9:16 AM

michaelmaitland edited the summary of this revision. (Show Details)Aug 1 2023, 9:20 AM

michaelmaitland planned changes to this revision.Aug 1 2023, 9:31 AM

Abandoning because https://reviews.llvm.org/D156158 clarification of semantics makes this a non-issue.

Diff 541152

llvm/lib/Analysis/LoopAccessAnalysis.cpp

Show First 20 Lines • Show All 1,981 Lines • ▼ Show 20 Lines	MemoryDepChecker::isDependent(const MemAccessInfo &A, unsigned AIdx,
// not handle different types.		// not handle different types.
// E.g. Assume one char is 1 byte in memory and one int is 4 bytes.		// E.g. Assume one char is 1 byte in memory and one int is 4 bytes.
// void foo (int A, char B) {		// void foo (int A, char B) {
// for (unsigned i = 0; i < 1024; i++) {		// for (unsigned i = 0; i < 1024; i++) {
// A[i+2] = A[i] + 1;		// A[i+2] = A[i] + 1;
// B[i+2] = B[i] + 1;		// B[i+2] = B[i] + 1;
// }		// }
// }		// }
//
// This case is currently unsafe according to the max safe distance. If we		// This case is currently unsafe according to the max safe distance. If we
// analyze the two accesses on array B, the max safe dependence distance		// analyze the two accesses on array B, the max safe dependence distance
// is 2. Then we analyze the accesses on array A, the minimum distance needed		// is 2. Then we analyze the accesses on array A, the minimum distance needed
// is 8, which is less than 2 and forbidden vectorization, But actually		// is 8, which is less than 2 and forbidden vectorization, But actually
// both A and B could be vectorized by 2 iterations.		// both A and B could be vectorized by 2 iterations.
MaxSafeDepDistBytes =		//
		ABataevUnsubmitted Not Done Reply Inline Actions Looks like variable name does not match its semantics anymore ABataev: Looks like variable name does not match its semantics anymore
		michaelmaitlandAuthorUnsubmitted Done Reply Inline Actions I'm not sure I agree. I think that the value was simply incorrectly computed previously. Can you please elaborate on what you mean? Do you have any suggestions on a better name? The getter for this variable is documented as `The maximum number of bytes of a vector register we can vectorize the accesses safely with`. The variable itself is documented as `We can access this many bytes in parallel safely`. I think that both of these statements hold under the changes made in this patch. michaelmaitland: I'm not sure I agree. I think that the value was simply incorrectly computed previously. Can…
std::min(static_cast<uint64_t>(Distance), MaxSafeDepDistBytes);		// Distance must be reduced by a factor of the stride of the loop induction
		// variable, otherwise it is possible that MaxSafeDepDistBytes is too
		// large. For example,
		// for (int k = 0; k < len; k+=3) {
		// a[k] = a[k + 4];
		// a[k+2] = a[k+6];
		// }
		// without accounting for loop stride has MaxSafeDepDist=24 when it it must be
		// 8.
		std::optional<Loop::LoopBounds> Bounds = InnermostLoop->getBounds(SE);
		ABataevUnsubmitted Done Reply Inline Actions Expand `auto` to actual type. ABataev: Expand `auto` to actual type.
		if (!Bounds) {
		LLVM_DEBUG(dbgs() << "LAA: Could not determine bounds of loop induction "
		"variable, so the MaxSafeDepDistBytes is unknown");
		MaxSafeDepDistBytes = 0;
		return Dependence::Unknown;
		}
		const SCEV *StepVal = SE.getSCEV(Bounds->getStepValue());
		const SCEVConstant *StepValC = dyn_cast<SCEVConstant>(StepVal);
		if (!StepValC) {
		LLVM_DEBUG(dbgs() << "LAA: Could not determine step value of loop induction "
		"variable, so the MaxSafeDepDistBytes is unknown");
		MaxSafeDepDistBytes = 0;
		return Dependence::Unknown;
		}

		const APInt &LoopIVStrideAP = StepValC->getAPInt().abs();
		uint64_t LoopIVStride = LoopIVStrideAP.getZExtValue();

		MaxSafeDepDistBytes = std::min(static_cast<uint64_t>(Distance / LoopIVStride),
		MaxSafeDepDistBytes);

bool IsTrueDataDependence = (!AIsWrite && BIsWrite);		bool IsTrueDataDependence = (!AIsWrite && BIsWrite);
if (IsTrueDataDependence && EnableForwardingConflictDetection &&		if (IsTrueDataDependence && EnableForwardingConflictDetection &&
couldPreventStoreLoadForward(Distance, TypeByteSize))		couldPreventStoreLoadForward(Distance, TypeByteSize))
return Dependence::BackwardVectorizableButPreventsForwarding;		return Dependence::BackwardVectorizableButPreventsForwarding;

uint64_t MaxVF = MaxSafeDepDistBytes / (TypeByteSize * Stride);		uint64_t MaxVF = MaxSafeDepDistBytes / (TypeByteSize * Stride);
LLVM_DEBUG(dbgs() << "LAA: Positive distance " << Val.getSExtValue()		LLVM_DEBUG(dbgs() << "LAA: Positive distance " << Val.getSExtValue()
▲ Show 20 Lines • Show All 764 Lines • ▼ Show 20 Lines	if (CanVecMem) {
if (MaxSafeDepDistBytes != -1ULL)		if (MaxSafeDepDistBytes != -1ULL)
OS << " with a maximum dependence distance of " << MaxSafeDepDistBytes		OS << " with a maximum dependence distance of " << MaxSafeDepDistBytes
<< " bytes";		<< " bytes";
if (PtrRtChecking->Need)		if (PtrRtChecking->Need)
OS << " with run-time checks";		OS << " with run-time checks";
OS << "\n";		OS << "\n";
}		}

		if (DepChecker->getMaxSafeVectorWidthInBits() != -1ULL)
		OS.indent(Depth) << "The maximum number of bits that are safe to operate "
		"on in parallel is "
		<< DepChecker->getMaxSafeVectorWidthInBits() << "\n";

if (HasConvergentOp)		if (HasConvergentOp)
OS.indent(Depth) << "Has convergent operation in loop\n";		OS.indent(Depth) << "Has convergent operation in loop\n";

if (Report)		if (Report)
OS.indent(Depth) << "Report: " << Report->getMsg() << "\n";		OS.indent(Depth) << "Report: " << Report->getMsg() << "\n";

if (auto *Dependences = DepChecker->getDependences()) {		if (auto *Dependences = DepChecker->getDependences()) {
OS.indent(Depth) << "Dependences:\n";		OS.indent(Depth) << "Dependences:\n";
▲ Show 20 Lines • Show All 63 Lines • Show Last 20 Lines

llvm/test/Analysis/LoopAccessAnalysis/max_safe_dep_dist_non_unit_stride.ll

	; RUN: opt -S -disable-output -passes='print<access-info>' < %s 2>&1 \| FileCheck %s			; RUN: opt -S -disable-output -passes='print<access-info>' < %s 2>&1 \| FileCheck %s

	; Generated from following C program:			; Generated from following C program:
	; void foo(int len, int *a) {			; void foo(int len, int *a) {
	; for (int k = 0; k < len; k+=3) {			; for (int k = 0; k < len; k+=3) {
	; a[k] = a[k + 4];			; a[k] = a[k + 4];
	; a[k+2] = a[k+6];			; a[k+2] = a[k+6];
	; }			; }
	; }			; }
	define void @foo(i64 %len, ptr %a) {			define void @foo(i64 %len, ptr %a) {
	; CHECK-LABEL: Loop access info in function 'foo':			; CHECK-LABEL: Loop access info in function 'foo':
	; CHECK-NEXT: loop:			; CHECK-NEXT: loop:
	; CHECK-NEXT: Memory dependences are safe with a maximum dependence distance of 24 bytes			; CHECK-NEXT: Memory dependences are safe with a maximum dependence distance of 8 bytes
				; CHECK-NEXT: The maximum number of bits that are safe to operate on in parallel is 0
				fhahnUnsubmitted Not Done Reply Inline Actions This is now more pessimistic than necessary I think fhahn: This is now more pessimistic than necessary I think
	; CHECK-NEXT: Dependences:			; CHECK-NEXT: Dependences:
	; CHECK-NEXT: BackwardVectorizable:			; CHECK-NEXT: BackwardVectorizable:
	; CHECK-NEXT: store i32 %0, ptr %arrayidx2, align 4 ->			; CHECK-NEXT: store i32 %0, ptr %arrayidx2, align 4 ->
	; CHECK-NEXT: %1 = load i32, ptr %arrayidx5, align 4			; CHECK-NEXT: %1 = load i32, ptr %arrayidx5, align 4
	; CHECK-EMPTY:			; CHECK-EMPTY:
	; CHECK-NEXT: Run-time memory checks:			; CHECK-NEXT: Run-time memory checks:
	; CHECK-NEXT: Grouped accesses:			; CHECK-NEXT: Grouped accesses:
	; CHECK-EMPTY:			; CHECK-EMPTY:
	Show All 31 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LAA] Update MaxSafeDepDistBytes when non-unit stride
AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 541152

llvm/lib/Analysis/LoopAccessAnalysis.cpp

llvm/test/Analysis/LoopAccessAnalysis/max_safe_dep_dist_non_unit_stride.ll

This is an archive of the discontinued LLVM Phabricator instance.

[LAA] Update MaxSafeDepDistBytes when non-unit strideAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 541152

llvm/lib/Analysis/LoopAccessAnalysis.cpp

llvm/test/Analysis/LoopAccessAnalysis/max_safe_dep_dist_non_unit_stride.ll

[LAA] Update MaxSafeDepDistBytes when non-unit stride
AbandonedPublic