This is an archive of the discontinued LLVM Phabricator instance.

[LV] Process pointer IVs with PHINodes in collectLoopUniforms
ClosedPublic

Authored by mssimpso on Sep 13 2016, 8:40 AM.

Download Raw Diff

Details

Reviewers

Commits

rGb25e87fca572: [LV] Process pointer IVs with PHINodes in collectLoopUniforms
rL281485: [LV] Process pointer IVs with PHINodes in collectLoopUniforms

Summary

This patch moves the processing of pointer induction variables in collectLoopUniforms from the consecutive pointer phase of the analysis to the phi node phase. Previously, if a pointer induction variable was used by both a scalarized non-memory instruction as well as a vectorized memory instruction, we would incorrectly identify the pointer as uniform. Pointer induction variables should be treated the same as other phi nodes. That is, they are uniform if all users of the induction variable and induction variable update are uniform.

Diff Detail

Repository: rL LLVM

Event Timeline

mssimpso updated this revision to Diff 71178.Sep 13 2016, 8:40 AM

mssimpso retitled this revision from to [LV] Process pointer IVs with PHINodes in collectLoopUniforms.

mssimpso updated this object.

mssimpso added a reviewer: mkuper.

mssimpso added subscribers: llvm-commits, mcrosier.

Herald added a subscriber: mzolotukhin. · View Herald TranscriptSep 13 2016, 8:40 AM

mssimpso added a child revision: D24275: [LV] Don't emit unused scalars for uniform instructions.Sep 13 2016, 8:40 AM

mkuper added inline comments.Sep 13 2016, 9:43 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
5397 ↗	(On Diff #71178)	Do we care about bitcasts / GEP chains? (I'm ok with being conservative with this, just making sure it's by choice.)
5398 ↗	(On Diff #71178)	Do we need to check that if U is a store, then Ptr is actually the pointer operand of the store, not the value? I'm thinking about: store i32* %p, i32** %q, Where both %p and %q are consecutive.
5466 ↗	(On Diff #71178)	The only thing that changed here, functionally, is the addition of isVectorizedMemAccessUse right? The rest is cleanup?

Thanks for the quick feedback, Michael!

lib/Transforms/Vectorize/LoopVectorize.cpp
5397 ↗	(On Diff #71178)	This is intentionally conservative, and we can probably relax it eventually. This patch was NFC when I tested it on SPEC, by the way. But note that Ptr should be the actual pointer operand of a memory access (see comment below). This covers the case where we have a GEP that is bitcast, and the bitcast is then used by the memory access. We look through bitcasts in isConsecutivePointer, which calls getGEPInstruction. For chains of instructions, we have to prove the user uniform first. This already happens to some degree in the expansion phase of the analysis. If GEP1 is only used by GEP2, and GEP2 remains uniform, GEP1 will be marked uniform as well. The same is true for the GEP in the bitcast case above.
5398 ↗	(On Diff #71178)	Yes, nice catch! I'll update the patch. We want all users of the pointer to be memory accesses, where the pointer is the pointer operand. This is all we consider when checking for scalarization. We already know Ptr is the pointer operand of one memory access, but it could be used as the value operand of another. Something like: %x = load i32* %p store i32* %p, i32** %q %p can't remain uniform because we need to store all it's corresponding values to memory. %q remains uniform if the other conditions are met (the store is not scalarized, etc.).
5466 ↗	(On Diff #71178)	That's right. I'm happy to save the cleanup for a separate patch if you prefer.

mkuper added inline comments.Sep 13 2016, 10:43 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
5398 ↗	(On Diff #71178)	Yes, this is exactly what I meant.
5466 ↗	(On Diff #71178)	My personal preference is to do it the other way around - NFC cleanups go in first, and then you can put up the functional patch (post-cleanup) for review. This kills two birds with one stone: We don't have to revert cleanups if the functional patch turns out to be broken. Reviewing the functional patch is easier. :-) (If you think the cleanup itself is complex enough to deserve review, then I'd prefer two separate reviews, but, again, cleanup first.)

mssimpso added inline comments.Sep 13 2016, 10:48 AM

lib/Transforms/Vectorize/LoopVectorize.cpp
5398 ↗	(On Diff #71178)	Great, I'll add another test case for this.
5466 ↗	(On Diff #71178)	Sounds good. I'll go ahead with the cleanup before updating the patch here.

Addressed Michael's comments.

Updated UsersAreMemAccesses to check that uses are pointer operands.
Added a new test case for the discussed code fragment.
Rebased on top of the NFC clean up.

LGTM

This revision is now accepted and ready to land.Sep 13 2016, 2:26 PM

Thanks!

Closed by commit rL281485: [LV] Process pointer IVs with PHINodes in collectLoopUniforms (authored by mssimpso). · Explain WhySep 14 2016, 7:56 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

26 lines

test/

Transforms/

LoopVectorize/

consecutive-ptr-uniforms.ll

169 lines

Diff 71361

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,387 Lines • ▼ Show 20 Lines	void LoopVectorizationLegality::collectLoopUniforms() {
for (auto *BB : TheLoop->blocks())		for (auto *BB : TheLoop->blocks())
for (auto &I : *BB) {		for (auto &I : *BB) {

// If there's no pointer operand, there's nothing to do.		// If there's no pointer operand, there's nothing to do.
auto *Ptr = dyn_cast_or_null<Instruction>(getPointerOperand(&I));		auto *Ptr = dyn_cast_or_null<Instruction>(getPointerOperand(&I));
if (!Ptr)		if (!Ptr)
continue;		continue;

		// True if all users of Ptr are memory accesses that have Ptr as their
		// pointer operand.
		auto UsersAreMemAccesses = all_of(Ptr->users(), [&](User *U) -> bool {
		return getPointerOperand(U) == Ptr;
		});

// Ensure the memory instruction will not be scalarized, making its		// Ensure the memory instruction will not be scalarized, making its
// pointer operand non-uniform.		// pointer operand non-uniform. If the pointer operand is used by some
if (memoryInstructionMustBeScalarized(&I))		// instruction other than a memory access, we're not going to check if
		// that other instruction may be scalarized here. Thus, conservatively
		// assume the pointer operand may be non-uniform.
		if (!UsersAreMemAccesses \|\| memoryInstructionMustBeScalarized(&I))
PossibleNonUniformPtrs.insert(Ptr);		PossibleNonUniformPtrs.insert(Ptr);

// If the memory instruction will be vectorized and its pointer operand		// If the memory instruction will be vectorized and its pointer operand
// is consecutive-like, the pointer operand should remain uniform.		// is consecutive-like, the pointer operand should remain uniform.
else if (hasConsecutiveLikePtrOperand(&I))		else if (hasConsecutiveLikePtrOperand(&I))
ConsecutiveLikePtrs.insert(Ptr);		ConsecutiveLikePtrs.insert(Ptr);
}		}

Show All 21 Lines	for (auto OV : I->operand_values()) {
return isOutOfScope(U) \|\| Worklist.count(cast<Instruction>(U));		return isOutOfScope(U) \|\| Worklist.count(cast<Instruction>(U));
})) {		})) {
Worklist.insert(OI);		Worklist.insert(OI);
DEBUG(dbgs() << "LV: Found uniform instruction: " << *OI << "\n");		DEBUG(dbgs() << "LV: Found uniform instruction: " << *OI << "\n");
}		}
}		}
}		}

		// Returns true if Ptr is the pointer operand of a memory access instruction
		// I, and I is known to not require scalarization.
		auto isVectorizedMemAccessUse = [&](Instruction I, Value Ptr) -> bool {
		return getPointerOperand(I) == Ptr && !memoryInstructionMustBeScalarized(I);
		};

// For an instruction to be added into Worklist above, all its users inside		// For an instruction to be added into Worklist above, all its users inside
// the loop should also be in Worklist. However, this condition cannot be		// the loop should also be in Worklist. However, this condition cannot be
// true for phi nodes that form a cyclic dependence. We must process phi		// true for phi nodes that form a cyclic dependence. We must process phi
// nodes separately. An induction variable will remain uniform if all users		// nodes separately. An induction variable will remain uniform if all users
// of the induction variable and induction variable update remain uniform.		// of the induction variable and induction variable update remain uniform.
		// The code below handles both pointer and non-pointer induction variables.
for (auto &Induction : Inductions) {		for (auto &Induction : Inductions) {
auto *Ind = Induction.first;		auto *Ind = Induction.first;
auto *IndUpdate = cast<Instruction>(Ind->getIncomingValueForBlock(Latch));		auto *IndUpdate = cast<Instruction>(Ind->getIncomingValueForBlock(Latch));

// Determine if all users of the induction variable are uniform after		// Determine if all users of the induction variable are uniform after
// vectorization.		// vectorization.
auto UniformInd = all_of(Ind->users(), [&](User *U) -> bool {		auto UniformInd = all_of(Ind->users(), [&](User *U) -> bool {
auto *I = cast<Instruction>(U);		auto *I = cast<Instruction>(U);
return I == IndUpdate \|\| !TheLoop->contains(I) \|\| Worklist.count(I);		return I == IndUpdate \|\| !TheLoop->contains(I) \|\| Worklist.count(I) \|\|
		isVectorizedMemAccessUse(I, Ind);
});		});
if (!UniformInd)		if (!UniformInd)
continue;		continue;

// Determine if all users of the induction variable update instruction are		// Determine if all users of the induction variable update instruction are
// uniform after vectorization.		// uniform after vectorization.
auto UniformIndUpdate = all_of(IndUpdate->users(), [&](User *U) -> bool {		auto UniformIndUpdate = all_of(IndUpdate->users(), [&](User *U) -> bool {
auto *I = cast<Instruction>(U);		auto *I = cast<Instruction>(U);
return I == Ind \|\| !TheLoop->contains(I) \|\| Worklist.count(I);		return I == Ind \|\| !TheLoop->contains(I) \|\| Worklist.count(I) \|\|
		isVectorizedMemAccessUse(I, IndUpdate);
});		});
if (!UniformIndUpdate)		if (!UniformIndUpdate)
continue;		continue;

// The induction variable and its update instruction will remain uniform.		// The induction variable and its update instruction will remain uniform.
Worklist.insert(Ind);		Worklist.insert(Ind);
Worklist.insert(IndUpdate);		Worklist.insert(IndUpdate);
DEBUG(dbgs() << "LV: Found uniform instruction: " << *Ind << "\n");		DEBUG(dbgs() << "LV: Found uniform instruction: " << *Ind << "\n");
▲ Show 20 Lines • Show All 1,760 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/consecutive-ptr-uniforms.ll

Show First 20 Lines • Show All 263 Lines • ▼ Show 20 Lines	for.body:
store x86_fp80 %tmp0, x86_fp80* %tmp1, align 16		store x86_fp80 %tmp0, x86_fp80* %tmp1, align 16
%i.next = add i64 %i, 1		%i.next = add i64 %i, 1
%cond = icmp slt i64 %i.next, %n		%cond = icmp slt i64 %i.next, %n
br i1 %cond, label %for.body, label %for.end		br i1 %cond, label %for.body, label %for.end

for.end:		for.end:
ret void		ret void
}		}

		; CHECK-LABEL: pointer_iv_uniform
		;
		; Check that a pointer induction variable is recognized as uniform and remains
		; uniform after vectorization.
		;
		; CHECK: LV: Found uniform instruction: %p = phi i32* [ %tmp03, %for.body ], [ %a, %entry ]
		; CHECK: vector.body
		; CHECK: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
		; CHECK-NOT: getelementptr
		; CHECK: %next.gep = getelementptr i32, i32* %a, i64 %index
		; CHECK-NOT: getelementptr
		; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body
		;
		define void @pointer_iv_uniform(i32* %a, i32 %x, i64 %n) {
		entry:
		br label %for.body

		for.body:
		%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
		%p = phi i32* [ %tmp03, %for.body ], [ %a, %entry ]
		store i32 %x, i32* %p, align 8
		%tmp03 = getelementptr inbounds i32, i32* %p, i32 1
		%i.next = add nuw nsw i64 %i, 1
		%cond = icmp slt i64 %i.next, %n
		br i1 %cond, label %for.body, label %for.end

		for.end:
		ret void
		}

		; INTER-LABEL: pointer_iv_non_uniform_0
		;
		; Check that a pointer induction variable with a non-uniform user is not
		; recognized as uniform and is not uniform after vectorization. The pointer
		; induction variable is used by getelementptr instructions that are non-uniform
		; due to scalarization of the stores.
		;
		; INTER-NOT: LV: Found uniform instruction: %p = phi i32* [ %tmp03, %for.body ], [ %a, %entry ]
		; INTER: vector.body
		; INTER: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
		; INTER: %[[I0:.+]] = shl i64 %index, 2
		; INTER: %next.gep = getelementptr i32, i32* %a, i64 %[[I0]]
		; INTER: %[[S1:.+]] = shl i64 %index, 2
		; INTER: %[[I1:.+]] = or i64 %[[S1]], 4
		; INTER: %next.gep2 = getelementptr i32, i32* %a, i64 %[[I1]]
		; INTER: %[[S2:.+]] = shl i64 %index, 2
		; INTER: %[[I2:.+]] = or i64 %[[S2]], 8
		; INTER: %next.gep3 = getelementptr i32, i32* %a, i64 %[[I2]]
		; INTER: %[[S3:.+]] = shl i64 %index, 2
		; INTER: %[[I3:.+]] = or i64 %[[S3]], 12
		; INTER: %next.gep4 = getelementptr i32, i32* %a, i64 %[[I3]]
		; INTER: br i1 {{.*}}, label %middle.block, label %vector.body
		;
		define void @pointer_iv_non_uniform_0(i32* %a, i64 %n) {
		entry:
		br label %for.body

		for.body:
		%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
		%p = phi i32* [ %tmp03, %for.body ], [ %a, %entry ]
		%tmp00 = load i32, i32* %p, align 8
		%tmp01 = getelementptr inbounds i32, i32* %p, i32 1
		%tmp02 = load i32, i32* %tmp01, align 8
		%tmp03 = getelementptr inbounds i32, i32* %p, i32 4
		%tmp04 = load i32, i32* %tmp03, align 8
		%tmp05 = getelementptr inbounds i32, i32* %p, i32 5
		%tmp06 = load i32, i32* %tmp05, align 8
		%tmp07 = sub i32 %tmp04, %tmp00
		%tmp08 = sub i32 %tmp02, %tmp02
		%tmp09 = getelementptr inbounds i32, i32* %p, i32 2
		store i32 %tmp07, i32* %tmp09, align 8
		%tmp10 = getelementptr inbounds i32, i32* %p, i32 3
		store i32 %tmp08, i32* %tmp10, align 8
		%i.next = add nuw nsw i64 %i, 1
		%cond = icmp slt i64 %i.next, %n
		br i1 %cond, label %for.body, label %for.end

		for.end:
		ret void
		}

		; CHECK-LABEL: pointer_iv_non_uniform_1
		;
		; Check that a pointer induction variable with a non-uniform user is not
		; recognized as uniform and is not uniform after vectorization. The pointer
		; induction variable is used by a store that will be scalarized.
		;
		; CHECK-NOT: LV: Found uniform instruction: %p = phi x86_fp80* [%tmp1, %for.body], [%a, %entry]
		; CHECK: vector.body
		; CHECK: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
		; CHECK: %next.gep = getelementptr x86_fp80, x86_fp80* %a, i64 %index
		; CHECK: %[[I1:.+]] = or i64 %index, 1
		; CHECK: %next.gep2 = getelementptr x86_fp80, x86_fp80* %a, i64 %[[I1]]
		; CHECK: %[[I2:.+]] = or i64 %index, 2
		; CHECK: %next.gep3 = getelementptr x86_fp80, x86_fp80* %a, i64 %[[I2]]
		; CHECK: %[[I3:.+]] = or i64 %index, 3
		; CHECK: %next.gep4 = getelementptr x86_fp80, x86_fp80* %a, i64 %[[I3]]
		; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body
		;
		define void @pointer_iv_non_uniform_1(x86_fp80* %a, i64 %n) {
		entry:
		br label %for.body

		for.body:
		%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
		%p = phi x86_fp80* [%tmp1, %for.body], [%a, %entry]
		%tmp0 = sitofp i32 1 to x86_fp80
		store x86_fp80 %tmp0, x86_fp80* %p, align 16
		%tmp1 = getelementptr inbounds x86_fp80, x86_fp80* %p, i32 1
		%i.next = add i64 %i, 1
		%cond = icmp slt i64 %i.next, %n
		br i1 %cond, label %for.body, label %for.end

		for.end:
		ret void
		}

		; CHECK-LABEL: pointer_iv_mixed
		;
		; Check multiple pointer induction variables where only one is recognized as
		; uniform and remains uniform after vectorization. The other pointer induction
		; variable is not recognized as uniform and is not uniform after vectorization
		; because it is stored to memory.
		;
		; CHECK-NOT: LV: Found uniform instruction: %p = phi i32* [ %tmp3, %for.body ], [ %a, %entry ]
		; CHECK: LV: Found uniform instruction: %q = phi i32** [ %tmp4, %for.body ], [ %b, %entry ]
		; CHECK: vector.body
		; CHECK: %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
		; CHECK: %next.gep = getelementptr i32, i32* %a, i64 %index
		; CHECK: %[[I1:.+]] = or i64 %index, 1
		; CHECK: %next.gep10 = getelementptr i32, i32* %a, i64 %[[I1]]
		; CHECK: %[[I2:.+]] = or i64 %index, 2
		; CHECK: %next.gep11 = getelementptr i32, i32* %a, i64 %[[I2]]
		; CHECK: %[[I3:.+]] = or i64 %index, 3
		; CHECK: %next.gep12 = getelementptr i32, i32* %a, i64 %[[I3]]
		; CHECK: %[[V0:.+]] = insertelement <4 x i32> undef, i32 %next.gep, i32 0
		; CHECK: %[[V1:.+]] = insertelement <4 x i32> %[[V0]], i32 %next.gep10, i32 1
		; CHECK: %[[V2:.+]] = insertelement <4 x i32> %[[V1]], i32 %next.gep11, i32 2
		; CHECK: %[[V3:.+]] = insertelement <4 x i32> %[[V2]], i32 %next.gep12, i32 3
		; CHECK-NOT: getelementptr
		; CHECK: %next.gep13 = getelementptr i32, i32* %b, i64 %index
		; CHECK-NOT: getelementptr
		; CHECK: %[[B0:.+]] = bitcast i32** %next.gep13 to <4 x i32>
		; CHECK: store <4 x i32> %[[V3]], <4 x i32>* %[[B0]], align 8
		; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body
		;
		define i32 @pointer_iv_mixed(i32* %a, i32** %b, i64 %n) {
		entry:
		br label %for.body

		for.body:
		%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
		%p = phi i32* [ %tmp3, %for.body ], [ %a, %entry ]
		%q = phi i32** [ %tmp4, %for.body ], [ %b, %entry ]
		%tmp0 = phi i32 [ %tmp2, %for.body ], [ 0, %entry ]
		%tmp1 = load i32, i32* %p, align 8
		%tmp2 = add i32 %tmp1, %tmp0
		store i32* %p, i32** %q, align 8
		%tmp3 = getelementptr inbounds i32, i32* %p, i32 1
		%tmp4 = getelementptr inbounds i32, i32* %q, i32 1
		%i.next = add nuw nsw i64 %i, 1
		%cond = icmp slt i64 %i.next, %n
		br i1 %cond, label %for.body, label %for.end

		for.end:
		%tmp5 = phi i32 [ %tmp2, %for.body ]
		ret i32 %tmp5
		}