This is an archive of the discontinued LLVM Phabricator instance.

[LV] Avoid considering scalar-with-predication instructions as also uniform-after-vectorization, fix PR40816
ClosedPublic

Authored by Ayal on Nov 15 2019, 2:19 AM.

Download Raw Diff

Details

Reviewers

fhahn
hsaito
rengolin
dcaballe
gilr

Commits

rG6ed9cef25f91: [LV] Scalar with predication must not be uniform

Summary

Instructions identified as "scalar with predication" will be "vectorized" using a replicating region. If such instructions are also optimized as "uniform after vectorization", namely when only the first of VF lanes is used, such a replicating region becomes erroneous - only the first instance of the region can and should be formed. Fix such cases by not considering such instructions as "uniform after vectorization".

A TODO is left as such cases could be optimized by implementing single instance regions, but noting that such cases are rare. The specific case of PR40816 should be optimized by not vectorizing such instructions at all but instead recognizing them as DeadInstructions, or employing indvars to rid them before LV as discussed in https://reviews.llvm.org/D68577#1742745.

The added test case is a simplification of the original one reported in the PR.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Ayal created this revision.Nov 15 2019, 2:19 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 15 2019, 2:19 AM

Herald added subscribers: llvm-commits, rkruppe, hiraditya. · View Herald Transcript

Nice! This indeed seems to solve the problem we saw in PR40816.

I don't know the code but I can at least provide some additional testing by including this patch in my weekend testing to see nothing unexpected pops up there.

In D70298#1747644, @uabelho wrote:

Nice! This indeed seems to solve the problem we saw in PR40816.

I don't know the code but I can at least provide some additional testing by including this patch in my weekend testing to see nothing unexpected pops up there.

Testing went well!

In D70298#1749366, @uabelho wrote:

In D70298#1747644, @uabelho wrote:

Nice! This indeed seems to solve the problem we saw in PR40816.

I don't know the code but I can at least provide some additional testing by including this patch in my weekend testing to see nothing unexpected pops up there.

Testing went well!

Thanks Mikael, good to know!

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
4682	LLVM_DEBUG is redundant under #ifndef NDEBUG, will remove.

bjope added a subscriber: bjope.Nov 19 2019, 1:05 AM

Ping.

Cleaned-up the #ifndef/LLVM_DEBUG and renamed the lambda.

and rebased.

In the test case we predicate due to low trip count right? It might be worth calling that out.

Also, it would be great if the test could be simplified a bit more (e.g. make the phi and related values i16, get rid of the truncation and the multiply). It seems like the test is not the most robust, as the load feeding the compare is not used at all after vectorisation, but I’m not sure if there’s another easy way to mark it as uniform at the moment.

LGTM thanks, any test improvements would be a nice bonus :) I agree that in most cases it’s unlikely that replicating a few uniform instructions will have a big impact.

This revision is now accepted and ready to land.Nov 28 2019, 9:52 AM

fhahn mentioned this in D68831: [LV] Mark instructions with loop invariant arguments as uniform. (WIP).Nov 29 2019, 6:06 AM

It looks like this also fixes https://bugs.llvm.org/show_bug.cgi?id=43951. The reproducer there might be a good source for a test case as well.

Simplified the test as suggested.

In D70298#1763255, @fhahn wrote:

In the test case we predicate due to low trip count right? It might be worth calling that out.

Right, sure, called out.

Also, it would be great if the test could be simplified a bit more (e.g. make the phi and related values i16, get rid of the truncation and the multiply). It seems like the test is not the most robust, as the load feeding the compare is not used at all after vectorisation, but I’m not sure if there’s another easy way to mark it as uniform at the moment.

Test has been simplified.
Long version: if the phi is turned into an i16, LV can no longer vectorize the loop even if forced, because LoopVectorizationLegality's convertPointerToIntegerType() prevents it (and thus any induction) from being a Primary Induction. So to simplify and get rid of the truncation while still vectorizing with fold-tail, all i16's are turned into i32's instead. To make sure the load is scalarized with predication, instead of forming an interleave group (once we get rid of the multiply) or a masked gather, the target is changed from knl to core-2. If a load feeding the latch-compare (thereby becoming Uniform-After-Vectorization) is live, the loop would probably not have a countable trip count, i.e., vectorizable. The other option for a load to become UAV is if it feeds a GEP known to be consecutive, or strided; but for that the load must be known to produce an arithmetic progression...

LGTM thanks, any test improvements would be a nice bonus :) I agree that in most cases it’s unlikely that replicating a few uniform instructions will have a big impact.

In D70298#1763967, @fhahn wrote:

It looks like this also fixes https://bugs.llvm.org/show_bug.cgi?id=43951. The reproducer there might be a good source for a test case as well.

Agreed, thanks for pointing out! PR43951 exhibits a similar pattern of a load-from-constant-array used for computing the (tiny) trip count, eventually recognized as both scalar-with-predication and uniform-after-vectorization, and hence fixed by this patch. PR43951 has a (redundant) OR reduction instead of PR40816's store, but this distinction seems irrelevant in terms of test coverage.

Closed by commit rG6ed9cef25f91: [LV] Scalar with predication must not be uniform (authored by Ayal). · Explain WhyDec 3 2019, 9:56 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

39 lines

test/

Transforms/

LoopVectorize/

X86/

consecutive-ptr-uniforms.ll

83 lines

Diff 231939

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,662 Lines • ▼ Show 20 Lines	void LoopVectorizationCostModel::collectLoopUniforms(unsigned VF) {
auto isOutOfScope = [&](Value *V) -> bool {		auto isOutOfScope = [&](Value *V) -> bool {
Instruction *I = dyn_cast<Instruction>(V);		Instruction *I = dyn_cast<Instruction>(V);
return (!I \|\| !TheLoop->contains(I));		return (!I \|\| !TheLoop->contains(I));
};		};

SetVector<Instruction *> Worklist;		SetVector<Instruction *> Worklist;
BasicBlock *Latch = TheLoop->getLoopLatch();		BasicBlock *Latch = TheLoop->getLoopLatch();

		// Instructions that are scalar with predication must not be considered
		// uniform after vectorization, because that would create an erroneous
		// replicating region where only a single instance out of VF should be formed.
		// TODO: optimize such seldom cases if found important, see PR40816.
		auto addToWorklistIfAllowed = [&](Instruction *I) -> void {
		if (isScalarWithPredication(I, VF)) {
		LLVM_DEBUG(dbgs() << "LV: Found not uniform being ScalarWithPredication: "
		<< *I << "\n");
		return;
		}
		LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *I << "\n");
		Worklist.insert(I);
		AyalAuthorUnsubmitted Done Reply Inline Actions LLVM_DEBUG is redundant under #ifndef NDEBUG, will remove. Ayal: LLVM_DEBUG is redundant under #ifndef NDEBUG, will remove.
		};

// Start with the conditional branch. If the branch condition is an		// Start with the conditional branch. If the branch condition is an
// instruction contained in the loop that is only used by the branch, it is		// instruction contained in the loop that is only used by the branch, it is
// uniform.		// uniform.
auto *Cmp = dyn_cast<Instruction>(Latch->getTerminator()->getOperand(0));		auto *Cmp = dyn_cast<Instruction>(Latch->getTerminator()->getOperand(0));
if (Cmp && TheLoop->contains(Cmp) && Cmp->hasOneUse()) {		if (Cmp && TheLoop->contains(Cmp) && Cmp->hasOneUse())
Worklist.insert(Cmp);		addToWorklistIfAllowed(Cmp);
LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *Cmp << "\n");
}

// Holds consecutive and consecutive-like pointers. Consecutive-like pointers		// Holds consecutive and consecutive-like pointers. Consecutive-like pointers
// are pointers that are treated like consecutive pointers during		// are pointers that are treated like consecutive pointers during
// vectorization. The pointer operands of interleaved accesses are an		// vectorization. The pointer operands of interleaved accesses are an
// example.		// example.
SmallSetVector<Instruction *, 8> ConsecutiveLikePtrs;		SmallSetVector<Instruction *, 8> ConsecutiveLikePtrs;

// Holds pointer operands of instructions that are possibly non-uniform.		// Holds pointer operands of instructions that are possibly non-uniform.
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	for (auto &I : *BB) {
// remain uniform.		// remain uniform.
else		else
ConsecutiveLikePtrs.insert(Ptr);		ConsecutiveLikePtrs.insert(Ptr);
}		}

// Add to the Worklist all consecutive and consecutive-like pointers that		// Add to the Worklist all consecutive and consecutive-like pointers that
// aren't also identified as possibly non-uniform.		// aren't also identified as possibly non-uniform.
for (auto *V : ConsecutiveLikePtrs)		for (auto *V : ConsecutiveLikePtrs)
if (PossibleNonUniformPtrs.find(V) == PossibleNonUniformPtrs.end()) {		if (PossibleNonUniformPtrs.find(V) == PossibleNonUniformPtrs.end())
LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *V << "\n");		addToWorklistIfAllowed(V);
Worklist.insert(V);
}

// Expand Worklist in topological order: whenever a new instruction		// Expand Worklist in topological order: whenever a new instruction
// is added , its users should be already inside Worklist. It ensures		// is added , its users should be already inside Worklist. It ensures
// a uniform instruction will only be used by uniform instructions.		// a uniform instruction will only be used by uniform instructions.
unsigned idx = 0;		unsigned idx = 0;
while (idx != Worklist.size()) {		while (idx != Worklist.size()) {
Instruction *I = Worklist[idx++];		Instruction *I = Worklist[idx++];

Show All 9 Lines	for (auto OV : I->operand_values()) {
// If all the users of the operand are uniform, then add the		// If all the users of the operand are uniform, then add the
// operand into the uniform worklist.		// operand into the uniform worklist.
auto *OI = cast<Instruction>(OV);		auto *OI = cast<Instruction>(OV);
if (llvm::all_of(OI->users(), [&](User *U) -> bool {		if (llvm::all_of(OI->users(), [&](User *U) -> bool {
auto *J = cast<Instruction>(U);		auto *J = cast<Instruction>(U);
return Worklist.count(J) \|\|		return Worklist.count(J) \|\|
(OI == getLoadStorePointerOperand(J) &&		(OI == getLoadStorePointerOperand(J) &&
isUniformDecision(J, VF));		isUniformDecision(J, VF));
})) {		}))
Worklist.insert(OI);		addToWorklistIfAllowed(OI);
LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *OI << "\n");
}
}		}
}		}

// Returns true if Ptr is the pointer operand of a memory access instruction		// Returns true if Ptr is the pointer operand of a memory access instruction
// I, and I is known to not require scalarization.		// I, and I is known to not require scalarization.
auto isVectorizedMemAccessUse = [&](Instruction I, Value Ptr) -> bool {		auto isVectorizedMemAccessUse = [&](Instruction I, Value Ptr) -> bool {
return getLoadStorePointerOperand(I) == Ptr && isUniformDecision(I, VF);		return getLoadStorePointerOperand(I) == Ptr && isUniformDecision(I, VF);
};		};
Show All 25 Lines	auto UniformIndUpdate =
auto *I = cast<Instruction>(U);		auto *I = cast<Instruction>(U);
return I == Ind \|\| !TheLoop->contains(I) \|\| Worklist.count(I) \|\|		return I == Ind \|\| !TheLoop->contains(I) \|\| Worklist.count(I) \|\|
isVectorizedMemAccessUse(I, IndUpdate);		isVectorizedMemAccessUse(I, IndUpdate);
});		});
if (!UniformIndUpdate)		if (!UniformIndUpdate)
continue;		continue;

// The induction variable and its update instruction will remain uniform.		// The induction variable and its update instruction will remain uniform.
Worklist.insert(Ind);		addToWorklistIfAllowed(Ind);
Worklist.insert(IndUpdate);		addToWorklistIfAllowed(IndUpdate);
LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *Ind << "\n");
LLVM_DEBUG(dbgs() << "LV: Found uniform instruction: " << *IndUpdate
<< "\n");
}		}

Uniforms[VF].insert(Worklist.begin(), Worklist.end());		Uniforms[VF].insert(Worklist.begin(), Worklist.end());
}		}

bool LoopVectorizationCostModel::runtimeChecksRequired() {		bool LoopVectorizationCostModel::runtimeChecksRequired() {
LLVM_DEBUG(dbgs() << "LV: Performing code size checks.\n");		LLVM_DEBUG(dbgs() << "LV: Performing code size checks.\n");

▲ Show 20 Lines • Show All 3,087 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/consecutive-ptr-uniforms.ll

; REQUIRES: asserts		; REQUIRES: asserts
; RUN: opt < %s -loop-vectorize -instcombine -S -debug-only=loop-vectorize -disable-output -print-after=instcombine 2>&1 \| FileCheck %s		; RUN: opt < %s -loop-vectorize -instcombine -S -debug-only=loop-vectorize -disable-output -print-after=instcombine 2>&1 \| FileCheck %s
		; RUN: opt < %s -loop-vectorize -force-vector-width=2 -S \| FileCheck %s -check-prefix=FORCE

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"		target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"		target triple = "x86_64-unknown-linux-gnu"

; CHECK-LABEL: PR31671		; CHECK-LABEL: PR31671
;		;
; Check a pointer in which one of its uses is consecutive-like and another of		; Check a pointer in which one of its uses is consecutive-like and another of
; its uses is non-consecutive-like. In the test case below, %tmp3 is the		; its uses is non-consecutive-like. In the test case below, %tmp3 is the
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	for.body:
%cond = icmp slt i64 %i.next, 32000		%cond = icmp slt i64 %i.next, 32000
br i1 %cond, label %for.body, label %for.end		br i1 %cond, label %for.body, label %for.end

for.end:		for.end:
ret void		ret void
}		}

attributes #0 = { "target-cpu"="knl" }		attributes #0 = { "target-cpu"="knl" }

		; CHECK-LABEL: PR40816
		;
		; Check that scalar with predication instructions are not considered uniform
		; after vectorization, because that results in replicating a region instead of
		; having a single instance (out of VF). The predication stems from a tiny count
		; of 3 leading to folding the tail by masking using icmp ule <i, i+1> <= <2, 2>.
		;
		; CHECK: LV: Found trip count: 3
		; CHECK: LV: Found uniform instruction: {{%.}} = icmp eq i32 {{%.}}, 0
		; CHECK-NOT: LV: Found uniform instruction: {{%.}} = load i32, i32 {{%.*}}, align 1
		; CHECK: LV: Found not uniform being ScalarWithPredication: {{%.}} = load i32, i32 {{%.*}}, align 1
		; CHECK: LV: Found scalar instruction: {{%.}} = getelementptr inbounds [3 x i32], [3 x i32] @a, i32 0, i32 {{%.*}}
		;
		; FORCE-LABEL: @PR40816(
		; FORCE-NEXT: entry:
		; FORCE-NEXT: br i1 false, label {{%.}}, label [[VECTOR_PH:%.]]
		; FORCE: vector.ph:
		; FORCE-NEXT: br label [[VECTOR_BODY:%.*]]
		; FORCE: vector.body:
		; FORCE-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[PRED_LOAD_CONTINUE4:%.*]] ]
		; FORCE-NEXT: [[VEC_IND:%.]] = phi <2 x i32> [ <i32 0, i32 1>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[PRED_LOAD_CONTINUE4]] ]
		; FORCE-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
		; FORCE-NEXT: [[TMP1:%.*]] = add i32 [[INDEX]], 1
		; FORCE-NEXT: [[TMP2:%.*]] = icmp ule <2 x i32> [[VEC_IND]], <i32 2, i32 2>
		; FORCE-NEXT: [[TMP3:%.*]] = extractelement <2 x i1> [[TMP2]], i32 0
		; FORCE-NEXT: br i1 [[TMP3]], label [[PRED_STORE_IF:%.]], label [[PRED_STORE_CONTINUE:%.]]
		; FORCE: pred.store.if:
		; FORCE-NEXT: store i32 [[TMP0]], i32* @b, align 1
		; FORCE-NEXT: br label [[PRED_STORE_CONTINUE]]
		; FORCE: pred.store.continue:
		; FORCE-NEXT: [[TMP4:%.*]] = extractelement <2 x i1> [[TMP2]], i32 1
		; FORCE-NEXT: br i1 [[TMP4]], label [[PRED_STORE_IF1:%.]], label [[PRED_STORE_CONTINUE2:%.]]
		; FORCE: pred.store.if1:
		; FORCE-NEXT: store i32 [[TMP1]], i32* @b, align 1
		; FORCE-NEXT: br label [[PRED_STORE_CONTINUE2]]
		; FORCE: pred.store.continue2:
		; FORCE-NEXT: [[TMP5:%.*]] = extractelement <2 x i1> [[TMP2]], i32 0
		; FORCE-NEXT: br i1 [[TMP5]], label [[PRED_LOAD_IF:%.]], label [[PRED_LOAD_CONTINUE:%.]]
		; FORCE: pred.load.if:
		; FORCE-NEXT: [[TMP6:%.]] = getelementptr inbounds [3 x i32], [3 x i32] @a, i32 0, i32 [[TMP0]]
		; FORCE-NEXT: [[TMP7:%.]] = load i32, i32 [[TMP6]], align 1
		; FORCE-NEXT: [[TMP8:%.*]] = insertelement <2 x i32> undef, i32 [[TMP7]], i32 0
		; FORCE-NEXT: br label [[PRED_LOAD_CONTINUE]]
		; FORCE: pred.load.continue:
		; FORCE-NEXT: [[TMP9:%.*]] = phi <2 x i32> [ undef, [[PRED_STORE_CONTINUE2]] ], [ [[TMP8]], [[PRED_LOAD_IF]] ]
		; FORCE-NEXT: [[TMP10:%.*]] = extractelement <2 x i1> [[TMP2]], i32 1
		; FORCE-NEXT: br i1 [[TMP10]], label [[PRED_LOAD_IF3:%.*]], label [[PRED_LOAD_CONTINUE4]]
		; FORCE: pred.load.if3:
		; FORCE-NEXT: [[TMP11:%.]] = getelementptr inbounds [3 x i32], [3 x i32] @a, i32 0, i32 [[TMP1]]
		; FORCE-NEXT: [[TMP12:%.]] = load i32, i32 [[TMP11]], align 1
		; FORCE-NEXT: [[TMP13:%.*]] = insertelement <2 x i32> [[TMP9]], i32 [[TMP12]], i32 1
		; FORCE-NEXT: br label [[PRED_LOAD_CONTINUE4]]
		; FORCE: pred.load.continue4:
		; FORCE-NEXT: [[TMP14:%.*]] = phi <2 x i32> [ [[TMP9]], [[PRED_LOAD_CONTINUE]] ], [ [[TMP13]], [[PRED_LOAD_IF3]] ]
		; FORCE-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 2
		; FORCE-NEXT: [[VEC_IND_NEXT]] = add <2 x i32> [[VEC_IND]], <i32 2, i32 2>
		; FORCE-NEXT: [[TMP15:%.*]] = icmp eq i32 [[INDEX_NEXT]], 4
		; FORCE-NEXT: br i1 [[TMP15]], label {{%.*}}, label [[VECTOR_BODY]]
		;
		@a = internal constant [3 x i32] [i32 7, i32 7, i32 0], align 1
		@b = external global i32, align 1

		define void @PR40816() #1 {

		entry:
		br label %for.body

		for.body: ; preds = %for.body, %entry
		%0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
		store i32 %0, i32* @b, align 1
		%arrayidx1 = getelementptr inbounds [3 x i32], [3 x i32]* @a, i32 0, i32 %0
		%1 = load i32, i32* %arrayidx1, align 1
		%cmp2 = icmp eq i32 %1, 0
		%inc = add nuw nsw i32 %0, 1
		br i1 %cmp2, label %return, label %for.body

		return: ; preds = %for.body
		ret void
		}

		attributes #1 = { "target-cpu"="core2" }