This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Analysis/
-
Analysis/
3
LoopAccessAnalysis.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
2
pr34681.ll
1
version-mem-access.ll

Differential D38785

[LV/LAA] Avoid specializing a loop for stride=1 when this predicate implies a single-iteration loop
ClosedPublic

Authored by dorit on Oct 10 2017, 11:25 PM.

Download Raw Diff

Details

Reviewers

Ayal
hfinkel
silviu.baranga

Commits

rGeb13dd3eac77: [LV/LAA] Avoid specializing a loop for stride=1 when this predicate implies a…
rL317438

Summary

This fixes PR34681, where vectorization results in specializing the loop for the case where the unknown Stride N == 1 (where Stride happens to be equal to the loop iteration count N).
However, the vectorized loop will only be executed if the iteration count N>=VF, and since VF>1 in the vectorized loop, this implies that N > 1.
The two conditions cannot co-exist, so the vectorized loop body becomes dead code. (Eventually this dead code is identified and gets removed, but until then it can have some side-effects on the optimization passes on the way).

Instead, this patch avoids specialization of an unknown stride if we know that the "stride==1" predicate is going to contradict the loop-minimum-iteration-count guard. This gives the vectorizer a chance to try to vectorize the loop with the unknown stride, instead of falling directly to the scalar version.

While the motivation for this patch came from this vectorization scenario, it is relevant for any loop-optimization that wants to take advantage of the Stride==1 specialization when Stride is equal to the loop count, because optimizing a loop that executes at most one iteration probably doesn't make much sense, so we might as sell avoid this runtime guard.

Diff Detail

Event Timeline

dorit created this revision.Oct 10 2017, 11:25 PM

Herald added a subscriber: hiraditya. · View Herald TranscriptOct 10 2017, 11:25 PM

dorit added reviewers: Ayal, hfinkel, silviu.baranga.Oct 10 2017, 11:27 PM

dorit added a subscriber: llvm-commits.

It sounds like this is a profitability issue and therefore the LAA users should handle this? Maybe the users would also want to be more general (I think a loop with an iteration count of 2 is probably not worth vectorizing).

Another issue is that if LV would somehow add a stride == 1 predicate, LAA wouldn't see it so the problem wouldn't go away.

Do you happen to know what transformation we're blocking here?

Hi Silviu,

It sounds like this is a profitability issue and therefore the LAA users should handle this? Maybe the users would also want to be more general (I think a loop with an iteration count of 2 is probably not worth vectorizing).

First of all, yes, that's true; This is why I went only for the obvious no-brainer case: a loop that iterates only a single iteration is not a loop and shouldn't go through any loop level analysis and transformation. Anything beyond that is indeed a profitability decision that each user needs to make.

(In fact, regardless of this corner case, the user should be given a chance to consider the specialization to stride=1 as just one alternative. At least the vectorizer is making its first steps towards a model (VPlan) that could allow it to consider different vectorization alternatives. )

Another issue is that if LV would somehow add a stride == 1 predicate, LAA wouldn't see it so the problem wouldn't go away.

Yes, of course, the vectorizer has a much bigger problem of not being aware of the predicates it operates under. There's the memcheck predicates, scev predicates, LAA Stride==1 predicate, and then the loop-iteration-count predicates that the vectorizer itself adds (I think it's the only ones that no other analysis pass is adding under the covers?). These predicates are never reasoned about together, no attempt to check if they contradict one another or can be optimized. Indeed this patch is not attempting to address this larger problem…

Do you happen to know what transformation we're blocking here?

The main transformation we are blocking is vectorization with unknown stride (using gathers/scalarized loads). We specialize for an impossible case, and lose the opportunity to consider an actual viable alternative.

Beyond that the side effect that I've seen, is that the removal of the dead vectorized loop body created a situation where some of the runtime guards that the vectorizer inserted before the vectorized loop got to be in the pre-header of the scalar loop that followed the (now removed) vectorized loop, thereby affecting the decisions of the loop-unroller, which in turn affected the behavior of the SLP vectorizer… (all because of IR we inserted that shouldn't have been added in the first place…).

But the stronger motivation to me is to avoid inserting an impossible predicate, waste time on analysis and transformation that this predicate enables (while also leaving behind a bunch of garbage). (And all this where we could have done something better…).

In short, Silviu, I absolutely agree with your observations, but I think that this patch has (some) value nonetheless…

Thanks,
dorit

dorit retitled this revision from [LV/LAA] Avoid secializing a loop for stride=1 when this predicate implies a single-iteration loop to [LV/LAA] Avoid specializing a loop for stride=1 when this predicate implies a single-iteration loop.Oct 18 2017, 1:37 AM

Indeed, a loop with an iteration count smaller than VF is definitely not worth vectorizing. An interesting profitability issue is to decide how many iterations past VF suffice to amortize vectorization overheads. In any case, this single/no iteration case looks like a no-brainer and realistic case - traversing a column of an NxN matrix.

llvm/lib/Analysis/LoopAccessAnalysis.cpp
2153	Simplify the above explanation. Suffice to say something like the following: Avoid adding the "Stride == 1" predicate when we know that Stride >= Trip-Count. Such a predicate will effectively optimize a single or no iteration loop, as Trip-Count <= Stride == 1.
2177	`Minus` >> `StrideMinusBETaken`?
2184	Can report success here, as in the original message above: DEBUG(dbgs() << "LAA: Found a strided access that we can version");
llvm/test/Transforms/LoopVectorize/pr34681.ll
13	Would tmp += B[k*N]; suffice? I.e., the cast to `int` and offset of `j` seem redundant, albeit do no harm.
16	"runtine" >> "runtime" Suggest to emphasize: "is not generated"
llvm/test/Transforms/LoopVectorize/version-mem-access.ll
63	Just noting for completeness that this test-case originally used the symbolic stride also as the trip count. Separating them below in order to continue to vectorize the loop.

I agree that this a strict improvement.

However it seems like stride versioning might be the bigger problem. Here it doesn't help us with the dependence analysis and it adds a run-time check with a very high probability of not being validated. Probably a cost analysis that takes into account the probability of execution of the versioned loop wouldn't find stride versioning profitable, even if it did pass the added condition. In addition it causes issues such as this one. Maybe we should only use it if it somehow helps with the dependence analysis, otherwise using scatter/gather might always better.

Hi Silviu,

However it seems like stride versioning might be the bigger problem.

Yes, we have a problem with our decision making on when to do stride versioning. We need to evaluate the cost of the runtime test, and to evaluate the benefit of various possible stride specializations, and then make an informed decision about which versioning we want to perform (which stride(s) to specialize for), if at all. That's a big change that needs to be designed and carefully tuned; For example:

Maybe we should only use it if it somehow helps with the dependence analysis, otherwise using scatter/gather might always better.

If the unknown stride happens to be 1, then using gathers would have worse performance than the versioned loop with regular wide loads. So maybe we'd want more than one specialized version, or maybe some other solution, but in any case something that needs to be designed and tuned.

In short, I think we are all in agreement that:

This patch is a (small) improvement, * regardless of the users and their specific cost considerations *.
The cost-model aspect of deciding when to specialize for a certain stride needs to be improved. The users (LV especially) are currently not making informed decisions.

…Right?

thanks,
Dorit

In D38785#913785, @dorit wrote:

In short, I think we are all in agreement that:

This patch is a (small) improvement, * regardless of the users and their specific cost considerations *.

The cost-model aspect of deciding when to specialize for a certain stride needs to be improved. The users (LV especially) are currently not making informed decisions.

…Right?

I agree,

This patch looks good to me; my comments below were very minor.
Add a TODO capturing these concerns/opportunities to further restrict unknown stride specialization. Not sure what probability may be assigned to an unknown stride being one, but another option is to optimize each gather/scatter with its unknown stride guard inside the loop, rather than loop-unswitching it into a preheader predicate.

Is this ok with you too @silviu.baranga ?

Yes, I agree as well.

My point was that a more general solution might be to turn off stride versioning, but that requires some more discussions and benchmarking.

LGTM

Closed by commit rL317438 (authored by dorit). · Explain WhyNov 5 2017, 8:53 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Analysis/

LoopAccessAnalysis.cpp

46 lines

test/

Transforms/

LoopVectorize/

pr34681.ll

122 lines

version-mem-access.ll

6 lines

Diff 118544

llvm/lib/Analysis/LoopAccessAnalysis.cpp

Show First 20 Lines • Show All 2,130 Lines • ▼ Show 20 Lines	else if (StoreInst *SI = dyn_cast<StoreInst>(MemAccess))
Ptr = SI->getPointerOperand();		Ptr = SI->getPointerOperand();
else		else
return;		return;

Value *Stride = getStrideFromPointer(Ptr, PSE->getSE(), TheLoop);		Value *Stride = getStrideFromPointer(Ptr, PSE->getSE(), TheLoop);
if (!Stride)		if (!Stride)
return;		return;

DEBUG(dbgs() << "LAA: Found a strided access that we can version");		DEBUG(dbgs() << "LAA: Found a strided access that is a candidate for "
		"versioning:");
DEBUG(dbgs() << " Ptr: " << Ptr << " Stride: " << Stride << "\n");		DEBUG(dbgs() << " Ptr: " << Ptr << " Stride: " << Stride << "\n");

		// We don't want to add the Stride==1 predicate when Stride happens to be
		// equal to (or greater than) the loop trip count (TC); in such a scenario
		// we are adding a predicate that will allow executing the loop only if the
		// TC<=1, which will either (probably) not happen, or if does happen, no loop
		// optimization could gain much from the extra predicated-analysis on a loop
		// that executes a single iteration (or less):
		//
		// (1) "Stride==1" (the predicate that we are considering to add)
		// (2) "Stride>=TC" (the Stride may be >= the loop trip count)
		// If (1) and (2) coexist, it means that 1>=TC, in which case we avoid
		// adding the predicate and bail out.
		AyalUnsubmitted Not Done Reply Inline Actions Simplify the above explanation. Suffice to say something like the following: Avoid adding the "Stride == 1" predicate when we know that Stride >= Trip-Count. Such a predicate will effectively optimize a single or no iteration loop, as Trip-Count <= Stride == 1. Ayal: Simplify the above explanation. Suffice to say something like the following: //Avoid adding…
		//
		// Since TC = BackEdgeTakenCount + 1, we can actually check the following:
		// Stride >= TC ==>
		// Stride >= BETakenCount + 1 ==>
		// Stride > BETakenCount ==>
		// Stride - BETakenCount > 0

		const SCEV *StrideExpr = PSE->getSCEV(Stride);
		const SCEV *BETakenCount = PSE->getBackedgeTakenCount();

		// Match the types so we can compare the stride and the BETakenCount.
		// The Stride can be positive/negative, so we sign extend Stride;
		// The backdgeTakenCount is non-negative, so we zero extend BETakenCount.
		const DataLayout &DL = TheLoop->getHeader()->getModule()->getDataLayout();
		uint64_t StrideTypeSize = DL.getTypeAllocSize(StrideExpr->getType());
		uint64_t BETypeSize = DL.getTypeAllocSize(BETakenCount->getType());
		const SCEV *CastedStride = StrideExpr;
		const SCEV *CastedBECount = BETakenCount;
		ScalarEvolution *SE = PSE->getSE();
		if (BETypeSize >= StrideTypeSize)
		CastedStride = SE->getNoopOrSignExtend(StrideExpr, BETakenCount->getType());
		else
		CastedBECount = SE->getZeroExtendExpr(BETakenCount, StrideExpr->getType());
		const SCEV *Minus = SE->getMinusSCEV(CastedStride, CastedBECount);
		AyalUnsubmitted Not Done Reply Inline Actions `Minus` >> `StrideMinusBETaken`? Ayal: `Minus` >> `StrideMinusBETaken`?
		if (SE->isKnownPositive(Minus)) {
		DEBUG(dbgs() << "LAA: Stride>=LoopCount; No point in versioning as the "
		"Stride==1 predicate will imply that the loop executes "
		"at most once.\n");
		return;
		}

		AyalUnsubmitted Not Done Reply Inline Actions Can report success here, as in the original message above: DEBUG(dbgs() << "LAA: Found a strided access that we can version"); Ayal: Can report success here, as in the original message above: ``` DEBUG(dbgs() << "LAA: Found a…
SymbolicStrides[Ptr] = Stride;		SymbolicStrides[Ptr] = Stride;
StrideSet.insert(Stride);		StrideSet.insert(Stride);
}		}

LoopAccessInfo::LoopAccessInfo(Loop L, ScalarEvolution SE,		LoopAccessInfo::LoopAccessInfo(Loop L, ScalarEvolution SE,
const TargetLibraryInfo TLI, AliasAnalysis AA,		const TargetLibraryInfo TLI, AliasAnalysis AA,
DominatorTree DT, LoopInfo LI)		DominatorTree DT, LoopInfo LI)
: PSE(llvm::make_unique<PredicatedScalarEvolution>(SE, L)),		: PSE(llvm::make_unique<PredicatedScalarEvolution>(SE, L)),
▲ Show 20 Lines • Show All 113 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/pr34681.ll

				; RUN: opt -S -loop-vectorize -force-vector-width=4 -force-vector-interleave=1 < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

				; Check the scenario where we have an unknown Stride, which happens to also be
				; the loop iteration count, so if we specialize the loop for the Stride==1 case,
				; this also implies that the loop will iterate no more than a single iteration,
				; as in the following example:
				;
				; unsigned int N;
				; int tmp = 0;
				; for(unsigned int k=0;k<N;k++) {
				; tmp+=(int)B[k*N+j];
				AyalUnsubmitted Not Done Reply Inline Actions Would tmp += B[kN]; suffice? I.e., the cast to `int` and offset of `j` seem redundant, albeit do no harm. Ayal:* Would ``` tmp += B[k*N]; ``` suffice? I.e., the cast to `int` and offset of `j` seem…
				; }
				;
				; We check here that the following runtine scev guard for Stride==1 is not generated:
				AyalUnsubmitted Not Done Reply Inline Actions "runtine" >> "runtime" Suggest to emphasize: "is not generated" Ayal: "runtine" >> "runtime" Suggest to emphasize: "is not generated"
				; vector.scevcheck:
				; %ident.check = icmp ne i32 %N, 1
				; %0 = or i1 false, %ident.check
				; br i1 %0, label %scalar.ph, label %vector.ph
				; Instead the loop is vectorized with an unknown stride.

				; CHECK-LABEL: @foo1
				; CHECK: for.body.lr.ph
				; CHECK-NOT: %ident.check = icmp ne i32 %N, 1
				; CHECK-NOT: %[[TEST:[0-9]+]] = or i1 false, %ident.check
				; CHECK-NOT: br i1 %[[TEST]], label %scalar.ph, label %vector.ph
				; CHECK: vector.ph
				; CHECK: vector.body
				; CHECK: <4 x i32>
				; CHECK: middle.block
				; CHECK: scalar.ph


				define i32 @foo1(i32 %N, i16* nocapture readnone %A, i16* nocapture readonly %B, i32 %i, i32 %j) {
				entry:
				%cmp8 = icmp eq i32 %N, 0
				br i1 %cmp8, label %for.end, label %for.body.lr.ph

				for.body.lr.ph:
				br label %for.body

				for.body:
				%tmp.010 = phi i32 [ 0, %for.body.lr.ph ], [ %add1, %for.body ]
				%k.09 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.body ]
				%mul = mul i32 %k.09, %N
				%add = add i32 %mul, %j
				%arrayidx = getelementptr inbounds i16, i16* %B, i32 %add
				%0 = load i16, i16* %arrayidx, align 2
				%conv = sext i16 %0 to i32
				%add1 = add nsw i32 %tmp.010, %conv
				%inc = add nuw i32 %k.09, 1
				%exitcond = icmp eq i32 %inc, %N
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit:
				%add1.lcssa = phi i32 [ %add1, %for.body ]
				br label %for.end

				for.end:
				%tmp.0.lcssa = phi i32 [ 0, %entry ], [ %add1.lcssa, %for.end.loopexit ]
				ret i32 %tmp.0.lcssa
				}


				; Check the same, but also where the Stride and the loop iteration count
				; are not of the same data type.
				;
				; unsigned short N;
				; int tmp = 0;
				; for(unsigned int k=0;k<N;k++) {
				; tmp+=(int)B[k*N+j];
				; }
				;
				; We check here that the following runtine scev guard for Stride==1 is not generated:
				; vector.scevcheck:
				; %ident.check = icmp ne i16 %N, 1
				; %0 = or i1 false, %ident.check
				; br i1 %0, label %scalar.ph, label %vector.ph


				; CHECK-LABEL: @foo2
				; CHECK: for.body.lr.ph
				; CHECK-NOT: %ident.check = icmp ne i16 %N, 1
				; CHECK-NOT: %[[TEST:[0-9]+]] = or i1 false, %ident.check
				; CHECK-NOT: br i1 %[[TEST]], label %scalar.ph, label %vector.ph
				; CHECK: vector.ph
				; CHECK: vector.body
				; CHECK: <4 x i32>
				; CHECK: middle.block
				; CHECK: scalar.ph

				define i32 @foo2(i16 zeroext %N, i16* nocapture readnone %A, i16* nocapture readonly %B, i32 %i, i32 %j) {
				entry:
				%conv = zext i16 %N to i32
				%cmp11 = icmp eq i16 %N, 0
				br i1 %cmp11, label %for.end, label %for.body.lr.ph

				for.body.lr.ph:
				br label %for.body

				for.body:
				%tmp.013 = phi i32 [ 0, %for.body.lr.ph ], [ %add4, %for.body ]
				%k.012 = phi i32 [ 0, %for.body.lr.ph ], [ %inc, %for.body ]
				%mul = mul nuw i32 %k.012, %conv
				%add = add i32 %mul, %j
				%arrayidx = getelementptr inbounds i16, i16* %B, i32 %add
				%0 = load i16, i16* %arrayidx, align 2
				%conv3 = sext i16 %0 to i32
				%add4 = add nsw i32 %tmp.013, %conv3
				%inc = add nuw nsw i32 %k.012, 1
				%exitcond = icmp eq i32 %inc, %conv
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit:
				%add4.lcssa = phi i32 [ %add4, %for.body ]
				br label %for.end

				for.end:
				%tmp.0.lcssa = phi i32 [ 0, %entry ], [ %add4.lcssa, %for.end.loopexit ]
				ret i32 %tmp.0.lcssa
				}

llvm/test/Transforms/LoopVectorize/version-mem-access.ll

Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	for.end:
ret void		ret void
}		}

; We used to crash on this function because we removed the fptosi cast when		; We used to crash on this function because we removed the fptosi cast when
; replacing the symbolic stride '%conv'.		; replacing the symbolic stride '%conv'.
; PR18480		; PR18480

; CHECK-LABEL: fn1		; CHECK-LABEL: fn1
; CHECK: load <2 x double>		; CHECK: load <2 x double>
		AyalUnsubmitted Not Done Reply Inline Actions Just noting for completeness that this test-case originally used the symbolic stride also as the trip count. Separating them below in order to continue to vectorize the loop. Ayal: Just noting for completeness that this test-case originally used the symbolic stride also as…

define void @fn1(double* noalias %x, double* noalias %c, double %a) {		define void @fn1(double* noalias %x, double* noalias %c, double %a, i32 %n) {
entry:		entry:
%conv = fptosi double %a to i32		%conv = fptosi double %a to i32
%cmp8 = icmp sgt i32 %conv, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %for.body.preheader, label %for.end		br i1 %cmp8, label %for.body.preheader, label %for.end

for.body.preheader:		for.body.preheader:
br label %for.body		br label %for.body

for.body:		for.body:
%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %for.body.preheader ]		%indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %for.body.preheader ]
%0 = trunc i64 %indvars.iv to i32		%0 = trunc i64 %indvars.iv to i32
%mul = mul nsw i32 %0, %conv		%mul = mul nsw i32 %0, %conv
%idxprom = sext i32 %mul to i64		%idxprom = sext i32 %mul to i64
%arrayidx = getelementptr inbounds double, double* %x, i64 %idxprom		%arrayidx = getelementptr inbounds double, double* %x, i64 %idxprom
%1 = load double, double* %arrayidx, align 8		%1 = load double, double* %arrayidx, align 8
%arrayidx3 = getelementptr inbounds double, double* %c, i64 %indvars.iv		%arrayidx3 = getelementptr inbounds double, double* %c, i64 %indvars.iv
store double %1, double* %arrayidx3, align 8		store double %1, double* %arrayidx3, align 8
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%lftr.wideiv = trunc i64 %indvars.iv.next to i32		%lftr.wideiv = trunc i64 %indvars.iv.next to i32
%exitcond = icmp eq i32 %lftr.wideiv, %conv		%exitcond = icmp eq i32 %lftr.wideiv, %n
br i1 %exitcond, label %for.end.loopexit, label %for.body		br i1 %exitcond, label %for.end.loopexit, label %for.body

for.end.loopexit:		for.end.loopexit:
br label %for.end		br label %for.end

for.end:		for.end:
ret void		ret void
}		}