This is an archive of the discontinued LLVM Phabricator instance.

[LV] Fix PR26600: avoid out of bounds loads for interleaved access vectorization
ClosedPublic

Authored by sbaranga on Feb 17 2016, 5:12 AM.

Download Raw Diff

Details

Reviewers

rengolin
hfinkel

Commits

rGad1dafb2c3ba: [LV] Fix PR26600: avoid out of bounds loads for interleaved access vectorization
rL261331: [LV] Fix PR26600: avoid out of bounds loads for interleaved access vectorization

Summary

If we don't have the first and last access of an interleaved load group,
the first and last wide load in the loop can do an out of bounds
access. Even though we discard results from speculative loads,
this can cause problems, since it can technically generate page faults
(or worse).

We now discard interleaved load groups that don't have the first and
load in the group.

Diff Detail

Repository: rL LLVM

Event Timeline

sbaranga updated this revision to Diff 48174.Feb 17 2016, 5:12 AM

sbaranga retitled this revision from to [LV] Fix PR26600: avoid out of bounds loads for interleaved access vectorization.

sbaranga updated this object.

sbaranga added a reviewer: hfinkel.

sbaranga added subscribers: anemet, mzolotukhin, llvm-commits.

How complicated would it be, instead of bailing out when we have a group without the last member, to peel off the last vector iteration instead (i.e. jump to the scalar tail loop one vector-loop iteration "early")?

It seems like that would be a better solution (although, if you agree, but it seems too complicated to implement for the release branch, I'm fine with taking this (and pulling it into the release branch), and then implementing the better solution in trunk only).

In D17332#356693, @hfinkel wrote:

How complicated would it be, instead of bailing out when we have a group without the last member, to peel off the last vector iteration instead (i.e. jump to the scalar tail loop one vector-loop iteration "early")?

It seems like that would be a better solution (although, if you agree, but it seems too complicated to implement for the release branch, I'm fine with taking this (and pulling it into the release branch), and then implementing the better solution in trunk only).

That would work (and I agree it would be a better solution), but I don't have a good estimate of how much it that would be to implement (there are also probably a lot of edge cases there).

I can also think of another way to fix this in some cases which wouldn't involve the scalar remainder. For example, in the following code (similar to the example from the PR):

struct A {

int a;
int b;

}

for (int i = 0; i < 1000; ++i) {

a[i].b *= 2;

}

instead of doing the wide load from a[i].b we could do it from a[i].a and move the problem to knowing that the start is ok to load from (which we do know is safe), and it avoids the need for a scalar remainder. This might also work well if the loop was previously modified by LoopRotate (if it strips the first iteration, we can use the fact that we have dominating loads to the start of the memory location to deduce that the transformation would be safe).

Would it be ok to commit this as is for now?

Thanks,
Silviu

Hal,

We need this fix in for 3.8.0, and I think we can discuss how we deal with the problem later. For now, landing this patch on trunk and back-porting to 3.8 is a priority.

Silviu,

The patch looks good to me as it is. Let's work on the improvement later. Can you open a bug with the description, an example and the expected outcome, please?

Thanks!
--renato

This revision is now accepted and ready to land.Feb 19 2016, 7:30 AM

sbaranga closed this revision.Feb 19 2016, 7:50 AM

Closed by commit rL261331: [LV] Fix PR26600: avoid out of bounds loads for interleaved access vectorization (authored by sbaranga). · Explain WhyFeb 19 2016, 7:50 AM

This revision was automatically updated to reflect the committed changes.

Thanks Renato, I've committed this in r261331.

-Silviu

In D17332#356879, @sbaranga wrote:
In D17332#356693, @hfinkel wrote:

How complicated would it be, instead of bailing out when we have a group without the last member, to peel off the last vector iteration instead (i.e. jump to the scalar tail loop one vector-loop iteration "early")?

It seems like that would be a better solution (although, if you agree, but it seems too complicated to implement for the release branch, I'm fine with taking this (and pulling it into the release branch), and then implementing the better solution in trunk only).

That would work (and I agree it would be a better solution), but I don't have a good estimate of how much it that would be to implement (there are also probably a lot of edge cases there).

I can also think of another way to fix this in some cases which wouldn't involve the scalar remainder. For example, in the following code (similar to the example from the PR):
struct A {
  int a;
  int b;
}

for (int i = 0; i < 1000; ++i) {
  a[i].b *= 2;
}
instead of doing the wide load from a[i].b we could do it from a[i].a and move the problem to knowing that the start is ok to load from (which we do know is safe), and it avoids the need for a scalar remainder. This might also work well if the loop was previously modified by LoopRotate (if it strips the first iteration, we can use the fact that we have dominating loads to the start of the memory location to deduce that the transformation would be safe).

I think that, when we can prove that's safe, that is also a good solution. I'm not sure we can always prove this, however. The underlying allocation in this case, for example, need not include a[0].a (although maybe with some higher-level TBAA-like information, we could know that it must).

Would it be ok to commit this as is for now?

As previously stated, sure :-)

Thanks,
Silviu

mssimpso mentioned this in D19487: [LV] Reallow interleaved load groups with gaps.Apr 25 2016, 10:27 AM

dewen added a subscriber: dewen.Mar 14 2023, 12:46 AM

dewen added inline comments.

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp
4803	Hi, @sbaranga. I have some doubts about this part of the code, whether it is too conservative, and I have not found a case to verify your statement.

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 14 2023, 12:46 AM

Herald added subscribers: • pcwang-thead, nemanjai. · View Herald Transcript

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

10 lines

test/

Transforms/

LoopVectorize/

PowerPC/

stride-vectorization.ll

8 lines

interleaved-accesses.ll

6 lines

Diff 48495

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

Show First 20 Lines • Show All 4,713 Lines • ▼ Show 20 Lines	void InterleavedAccessInfo::analyzeInterleaving(
MapVector<Instruction *, StrideDescriptor> StrideAccesses;		MapVector<Instruction *, StrideDescriptor> StrideAccesses;
collectConstStridedAccesses(StrideAccesses, Strides);		collectConstStridedAccesses(StrideAccesses, Strides);

if (StrideAccesses.empty())		if (StrideAccesses.empty())
return;		return;

// Holds all interleaved store groups temporarily.		// Holds all interleaved store groups temporarily.
SmallSetVector<InterleaveGroup *, 4> StoreGroups;		SmallSetVector<InterleaveGroup *, 4> StoreGroups;
		// Holds all interleaved load groups temporarily.
		SmallSetVector<InterleaveGroup *, 4> LoadGroups;

// Search the load-load/write-write pair B-A in bottom-up order and try to		// Search the load-load/write-write pair B-A in bottom-up order and try to
// insert B into the interleave group of A according to 3 rules:		// insert B into the interleave group of A according to 3 rules:
// 1. A and B have the same stride.		// 1. A and B have the same stride.
// 2. A and B have the same memory object size.		// 2. A and B have the same memory object size.
// 3. B belongs to the group according to the distance.		// 3. B belongs to the group according to the distance.
//		//
// The bottom-up order can avoid breaking the Write-After-Write dependences		// The bottom-up order can avoid breaking the Write-After-Write dependences
Show All 11 Lines	for (auto I = StrideAccesses.rbegin(), E = StrideAccesses.rend(); I != E;
InterleaveGroup *Group = getInterleaveGroup(A);		InterleaveGroup *Group = getInterleaveGroup(A);
if (!Group) {		if (!Group) {
DEBUG(dbgs() << "LV: Creating an interleave group with:" << *A << '\n');		DEBUG(dbgs() << "LV: Creating an interleave group with:" << *A << '\n');
Group = createInterleaveGroup(A, DesA.Stride, DesA.Align);		Group = createInterleaveGroup(A, DesA.Stride, DesA.Align);
}		}

if (A->mayWriteToMemory())		if (A->mayWriteToMemory())
StoreGroups.insert(Group);		StoreGroups.insert(Group);
		else
		LoadGroups.insert(Group);

for (auto II = std::next(I); II != E; ++II) {		for (auto II = std::next(I); II != E; ++II) {
Instruction *B = II->first;		Instruction *B = II->first;
StrideDescriptor DesB = II->second;		StrideDescriptor DesB = II->second;

// Ignore if B is already in a group or B is a different memory operation.		// Ignore if B is already in a group or B is a different memory operation.
if (isInterleaved(B) \|\| A->mayReadFromMemory() != B->mayReadFromMemory())		if (isInterleaved(B) \|\| A->mayReadFromMemory() != B->mayReadFromMemory())
continue;		continue;
Show All 31 Lines	for (auto II = std::next(I); II != E; ++II) {
}		}
} // Iteration on instruction B		} // Iteration on instruction B
} // Iteration on instruction A		} // Iteration on instruction A

// Remove interleaved store groups with gaps.		// Remove interleaved store groups with gaps.
for (InterleaveGroup *Group : StoreGroups)		for (InterleaveGroup *Group : StoreGroups)
if (Group->getNumMembers() != Group->getFactor())		if (Group->getNumMembers() != Group->getFactor())
releaseGroup(Group);		releaseGroup(Group);

		// Remove interleaved load groups that don't have the first and last member.
		// This guarantees that we won't do speculative out of bounds loads.
		for (InterleaveGroup *Group : LoadGroups)
		dewenUnsubmitted Not Done Reply Inline Actions Hi, @sbaranga. I have some doubts about this part of the code, whether it is too conservative, and I have not found a case to verify your statement. dewen: Hi, @sbaranga. I have some doubts about this part of the code, whether it is too conservative…
		if (!Group->getMember(0) \|\| !Group->getMember(Group->getFactor() - 1))
		releaseGroup(Group);
}		}

LoopVectorizationCostModel::VectorizationFactor		LoopVectorizationCostModel::VectorizationFactor
LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize) {		LoopVectorizationCostModel::selectVectorizationFactor(bool OptForSize) {
// Width 1 means no vectorize		// Width 1 means no vectorize
VectorizationFactor Factor = { 1U, 0U };		VectorizationFactor Factor = { 1U, 0U };
if (OptForSize && Legal->getRuntimePointerChecking()->Need) {		if (OptForSize && Legal->getRuntimePointerChecking()->Need) {
emitAnalysis(VectorizationReport() <<		emitAnalysis(VectorizationReport() <<
▲ Show 20 Lines • Show All 1,127 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/PowerPC/stride-vectorization.ll

	Show All 10 Lines
	; CHECK: <2 x double>			; CHECK: <2 x double>

	for.cond.cleanup: ; preds = %for.body			for.cond.cleanup: ; preds = %for.body
	ret void			ret void

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
	%0 = shl nsw i64 %indvars.iv, 1			%0 = shl nsw i64 %indvars.iv, 1
				%odd.idx = add nsw i64 %0, 1

	%arrayidx = getelementptr inbounds double, double* %b, i64 %0			%arrayidx = getelementptr inbounds double, double* %b, i64 %0
				%arrayidx.odd = getelementptr inbounds double, double* %b, i64 %odd.idx

	%1 = load double, double* %arrayidx, align 8			%1 = load double, double* %arrayidx, align 8
	%add = fadd double %1, 1.000000e+00			%2 = load double, double* %arrayidx.odd, align 8

				%add = fadd double %1, %2
	%arrayidx2 = getelementptr inbounds double, double* %a, i64 %indvars.iv			%arrayidx2 = getelementptr inbounds double, double* %a, i64 %indvars.iv
	store double %add, double* %arrayidx2, align 8			store double %add, double* %arrayidx2, align 8
	%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1			%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
	%exitcond = icmp eq i64 %indvars.iv.next, 1600			%exitcond = icmp eq i64 %indvars.iv.next, 1600
	br i1 %exitcond, label %for.cond.cleanup, label %for.body			br i1 %exitcond, label %for.cond.cleanup, label %for.body
	}			}

	attributes #0 = { nounwind "target-cpu"="pwr8" }			attributes #0 = { nounwind "target-cpu"="pwr8" }

llvm/trunk/test/Transforms/LoopVectorize/interleaved-accesses.ll

	Show First 20 Lines • Show All 286 Lines • ▼ Show 20 Lines
	; (missing the load of odd elements).			; (missing the load of odd elements).

	; void even_load(int A, int B) {			; void even_load(int A, int B) {
	; for (unsigned i = 0; i < 1024; i+=2)			; for (unsigned i = 0; i < 1024; i+=2)
	; B[i/2] = A[i] * 2;			; B[i/2] = A[i] * 2;
	; }			; }

	; CHECK-LABEL: @even_load(			; CHECK-LABEL: @even_load(
	; CHECK: %wide.vec = load <8 x i32>, <8 x i32>* %{{.*}}, align 4			; CHECK-NOT: %wide.vec = load <8 x i32>, <8 x i32>* %{{.*}}, align 4
	; CHECK: %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			; CHECK-NOT: %strided.vec = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	; CHECK-NOT: shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
	; CHECK: shl nsw <4 x i32> %strided.vec, <i32 1, i32 1, i32 1, i32 1>

	define void @even_load(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) {			define void @even_load(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.cond.cleanup: ; preds = %for.body			for.cond.cleanup: ; preds = %for.body
	ret void			ret void

	▲ Show 20 Lines • Show All 161 Lines • Show Last 20 Lines