This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
X86/
-
optsize.ll
-
pr39417-optsize-scevchecks.ll

Differential D53612

[LV] Avoid vectorizing loops under opt for size that involve SCEV checks
ClosedPublic

Authored by Ayal on Oct 23 2018, 2:18 PM.

Download Raw Diff

Details

Reviewers

dorit
hsaito
dcaballe
fhahn
mkuper

Commits

rG45a3ca7be7f2: [LV] Avoid vectorizing loops under opt for size that involve SCEV checks
rL345959: [LV] Avoid vectorizing loops under opt for size that involve SCEV checks

Summary

The loop vectorizer may generate runtime SCEV checks for overflow and stride==1 cases, leading to execution of original scalar loop. The latter should be eliminated when optimizing for size. This patch fixes this behavior by preventing vectorization in such cases.

Reported by @uabelho in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20181022/596443.html
Issue also occurs w/o "fold tail" patch, as demonstrated by additional tests.

Diff Detail

Repository: rL LLVM

Event Timeline

Ayal created this revision.Oct 23 2018, 2:18 PM

Herald added subscribers: llvm-commits, rkruppe, javed.absar. · View Herald TranscriptOct 23 2018, 2:18 PM

A PR for the problem reported by @uabelho can be found here: https://bugs.llvm.org/show_bug.cgi?id=39417

Just minor comments on the tests.
LGTM.

test/Transforms/LoopVectorize/pr30654-phiscev-sext-trunc.ll
245 ↗	(On Diff #170750)	"with tiny trip count / under opt for size" --> "with tiny trip count (which implies opt for size).".
250 ↗	(On Diff #170750)	(can mention now that this is pr39417)
254 ↗	(On Diff #170750)	I was once asked to remove the "; preds = ..." comments if they are not needed. (I see a lot of tests including these comments, but this specific file doesn't).
267 ↗	(On Diff #170750)	Now that there's a PR opened for this scenario, you may want to consider moving these two loops to a separate file (pr39417-optsize-and-scevchecks.ll ?); I know that there's a desire to not open separate files for each testcase, but this specific file deals with sext-trunc overflow tests (as the name of the test implies), so may be less appropriate to include the stride==1 testcase here?

This revision is now accepted and ready to land.Oct 24 2018, 1:56 PM

The patch will avoid the assert seen in PR39417 so that is great.
(hard to tell if it still is possible to hit the assert, PR39417 was triggered during fuzzy testing using csmith to generate the input)

One thing I noticed is that if I use the test case from PR39417 and add -vectorizer-min-trip-count=3, to avoid the detection of a "very small trip count", the loop will be vectorized with VF=16. That is also what happened when we triggered the assert (without this patch). Shouldn't the VF be clamped to the trip count?
It seems like the vectorizer detects that the trip count is tiny (trip count is 3), but it vectorize using VF=16 but then the vectorized loop is skipped since we emit br i1 true, label %scalar.ph, label %vector.scevcheck. So all the hard work with vectorizing the loop is just a waste of time, or could it be beneficial to have VF > tripcount in some cases?

If the actual problem is that VF should be clamped to the trip count, then maybe this patch just hides that problem in certain cases (when having OptForSize).

In D53612#1275437, @bjope wrote:

The patch will avoid the assert seen in PR39417 so that is great.
(hard to tell if it still is possible to hit the assert, PR39417 was triggered during fuzzy testing using csmith to generate the input)

One thing I noticed is that if I use the test case from PR39417 and add -vectorizer-min-trip-count=3, to avoid the detection of a "very small trip count", the loop will be vectorized with VF=16. That is also what happened when we triggered the assert (without this patch). Shouldn't the VF be clamped to the trip count?
It seems like the vectorizer detects that the trip count is tiny (trip count is 3), but it vectorize using VF=16 but then the vectorized loop is skipped since we emit br i1 true, label %scalar.ph, label %vector.scevcheck. So all the hard work with vectorizing the loop is just a waste of time, or could it be beneficial to have VF > tripcount in some cases?

If the actual problem is that VF should be clamped to the trip count, then maybe this patch just hides that problem in certain cases (when having OptForSize).

VF doesn't have to be clamped to tripcount, but it should be clamped to reasonable multiple of natural vector size (e.g., executing 3-iter loop using 4-way vector could be better than 2-way vector if 2-way vector is less than full vector)--- and we need to ensure we'll hit vector code if vectorizer intentionally used VF that is greater than constant tripcount. Please file a separate ticket.

Addressed comments.
Added another test case derived from PR39497.
Updated to trunk before committing.

Ayal marked 4 inline comments as done.Nov 1 2018, 5:13 PM

Closed by commit rL345959: [LV] Avoid vectorizing loops under opt for size that involve SCEV checks (authored by ayalz). · Explain WhyNov 2 2018, 2:18 AM

This revision was automatically updated to reflect the committed changes.

In D53612#1275437, @bjope wrote:

...
One thing I noticed is that if I use the test case from PR39417 and add -vectorizer-min-trip-count=3, to avoid the detection of a "very small trip count", the loop will be vectorized with VF=16. That is also what happened when we triggered the assert (without this patch). Shouldn't the VF be clamped to the trip count?
It seems like the vectorizer detects that the trip count is tiny (trip count is 3), but it vectorize using VF=16 but then the vectorized loop is skipped since we emit br i1 true, label %scalar.ph, label %vector.scevcheck. So all the hard work with vectorizing the loop is just a waste of time, or could it be beneficial to have VF > tripcount in some cases?

If the actual problem is that VF should be clamped to the trip count, then maybe this patch just hides that problem in certain cases (when having OptForSize).

Interesting :-)

If tail is folded by masking, then VF's greater than the trip count are potentially relevant.

If tail is not folded by masking, it's indeed futile to use any VF(*UF) greater than the trip count. LV doesn't really notice this; IRBuilder does, when emitMinimumIterationCountCheck() asks it to

CheckMinIters = Builder.CreateICmp(
    P, Count, ConstantInt::get(Count->getType(), VF * UF),
    "min.iters.check");

it simply sets CheckMinIters to 'true'.

Another aspect related to known trip counts (smaller than VF): they should be used when comparing VectorCost to ScalarCost, instead of taking

float VectorCost = C.first / (float)i;

In any case, this patch deals with trip counts only indirectly, as they trigger OptForSize when small. Indeed worthy of a separate ticket/patch.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

26 lines

test/

Transforms/

LoopVectorize/

X86/

optsize.ll

60 lines

pr39417-optsize-scevchecks.ll

54 lines

Diff 172325

llvm/trunk/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,551 Lines • ▼ Show 20 Lines	SCEVExpander Exp(*PSE.getSE(), Bypass->getModule()->getDataLayout(),
"scev.check");		"scev.check");
Value *SCEVCheck =		Value *SCEVCheck =
Exp.expandCodeForPredicate(&PSE.getUnionPredicate(), BB->getTerminator());		Exp.expandCodeForPredicate(&PSE.getUnionPredicate(), BB->getTerminator());

if (auto *C = dyn_cast<ConstantInt>(SCEVCheck))		if (auto *C = dyn_cast<ConstantInt>(SCEVCheck))
if (C->isZero())		if (C->isZero())
return;		return;

assert(!Cost->foldTailByMasking() && "Cannot check stride when folding tail");		assert(!Cost->foldTailByMasking() &&
		"Cannot SCEV check stride or overflow when folding tail");
// Create a new block containing the stride check.		// Create a new block containing the stride check.
BB->setName("vector.scevcheck");		BB->setName("vector.scevcheck");
auto *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");		auto *NewBB = BB->splitBasicBlock(BB->getTerminator(), "vector.ph");
// Update dominator tree immediately if the generated block is a		// Update dominator tree immediately if the generated block is a
// LoopBypassBlock because SCEV expansions to generate loop bypass		// LoopBypassBlock because SCEV expansions to generate loop bypass
// checks may query it before the current function is finished.		// checks may query it before the current function is finished.
DT->addNewBlock(NewBB, BB);		DT->addNewBlock(NewBB, BB);
if (L->getParentLoop())		if (L->getParentLoop())
▲ Show 20 Lines • Show All 2,063 Lines • ▼ Show 20 Lines	ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")
"loop with '#pragma clang loop vectorize(enable)' when "		"loop with '#pragma clang loop vectorize(enable)' when "
"compiling with -Os/-Oz");		"compiling with -Os/-Oz");
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "LV: Aborting. Runtime ptr check is required with -Os/-Oz.\n");		<< "LV: Aborting. Runtime ptr check is required with -Os/-Oz.\n");
return None;		return None;
}		}

		if (!PSE.getUnionPredicate().getPredicates().empty()) {
		ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")
		<< "runtime SCEV checks needed. Enable vectorization of this "
		"loop with '#pragma clang loop vectorize(enable)' when "
		"compiling with -Os/-Oz");
		LLVM_DEBUG(
		dbgs()
		<< "LV: Aborting. Runtime SCEV check is required with -Os/-Oz.\n");
		return None;
		}

		// FIXME: Avoid specializing for stride==1 instead of bailing out.
		if (!Legal->getLAI()->getSymbolicStrides().empty()) {
		ORE->emit(createMissedAnalysis("CantVersionLoopWithOptForSize")
		<< "runtime stride == 1 checks needed. Enable vectorization of "
		"this loop with '#pragma clang loop vectorize(enable)' when "
		"compiling with -Os/-Oz");
		LLVM_DEBUG(
		dbgs()
		<< "LV: Aborting. Runtime stride check is required with -Os/-Oz.\n");
		return None;
		}

// If we optimize the program for size, avoid creating the tail loop.		// If we optimize the program for size, avoid creating the tail loop.
LLVM_DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');		LLVM_DEBUG(dbgs() << "LV: Found trip count: " << TC << '\n');

if (TC == 1) {		if (TC == 1) {
ORE->emit(createMissedAnalysis("SingleIterationLoop")		ORE->emit(createMissedAnalysis("SingleIterationLoop")
<< "loop trip count is one, irrelevant for vectorization");		<< "loop trip count is one, irrelevant for vectorization");
LLVM_DEBUG(dbgs() << "LV: Aborting, single iteration (non) loop.\n");		LLVM_DEBUG(dbgs() << "LV: Aborting, single iteration (non) loop.\n");
return None;		return None;
▲ Show 20 Lines • Show All 2,856 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/LoopVectorize/X86/optsize.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; This test verifies that the loop vectorizer will NOT vectorize loops that		; This test verifies that the loop vectorizer will NOT vectorize loops that
; will produce a tail loop with the optimize for size or the minimize size		; will produce a tail loop with the optimize for size or the minimize size
; attributes. This is a target-dependent version of the test.		; attributes. This is a target-dependent version of the test.
; RUN: opt < %s -loop-vectorize -force-vector-width=64 -S -mtriple=x86_64-unknown-linux -mcpu=skx \| FileCheck %s		; RUN: opt < %s -loop-vectorize -force-vector-width=64 -S -mtriple=x86_64-unknown-linux -mcpu=skx \| FileCheck %s
		; RUN: opt < %s -loop-vectorize -S -mtriple=x86_64-unknown-linux -mcpu=skx \| FileCheck %s --check-prefix AUTOVF

target datalayout = "E-m:e-p:32:32-i64:32-f64:32:64-a:0:32-n32-S128"		target datalayout = "E-m:e-p:32:32-i64:32-f64:32:64-a:0:32-n32-S128"

@tab = common global [32 x i8] zeroinitializer, align 1		@tab = common global [32 x i8] zeroinitializer, align 1

define i32 @foo_optsize() #0 {		define i32 @foo_optsize() #0 {
; CHECK-LABEL: @foo_optsize(		; CHECK-LABEL: @foo_optsize(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
▲ Show 20 Lines • Show All 117 Lines • ▼ Show 20 Lines	for.body: ; preds = %for.body, %entry
br i1 %exitcond, label %for.end, label %for.body		br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body		for.end: ; preds = %for.body
ret i32 0		ret i32 0
}		}

attributes #1 = { minsize }		attributes #1 = { minsize }


		; We can't vectorize this one because we version for stride==1; even having TC
		; a multiple of VF.
		; CHECK-LABEL: @scev4stride1
		; CHECK-NOT: vector.scevcheck
		; CHECK-NOT: vector.body:
		; CHECK-LABEL: for.body:
		; AUTOVF-LABEL: @scev4stride1
		; AUTOVF-NOT: vector.scevcheck
		; AUTOVF-NOT: vector.body:
		; AUTOVF-LABEL: for.body:
		define void @scev4stride1(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %k) #2 {
		for.body.preheader:
		br label %for.body

		for.body: ; preds = %for.body.preheader, %for.body
		%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
		%mul = mul nsw i32 %i.07, %k
		%arrayidx = getelementptr inbounds i32, i32* %b, i32 %mul
		%0 = load i32, i32* %arrayidx, align 4
		%arrayidx1 = getelementptr inbounds i32, i32* %a, i32 %i.07
		store i32 %0, i32* %arrayidx1, align 4
		%inc = add nuw nsw i32 %i.07, 1
		%exitcond = icmp eq i32 %inc, 256
		br i1 %exitcond, label %for.end.loopexit, label %for.body

		for.end.loopexit: ; preds = %for.body
		ret void
		}

		attributes #2 = { optsize }


		; PR39497
		; We can't vectorize this one because we version for overflow check and tiny
		; trip count leads to opt-for-size (which otherwise could fold the tail by
		; masking).
		; CHECK-LABEL: @main
		; CHECK-NOT: vector.scevcheck
		; CHECK-NOT: vector.body:
		; CHECK-LABEL: for.cond:
		; AUTOVF-LABEL: @main
		; AUTOVF-NOT: vector.scevcheck
		; AUTOVF-NOT: vector.body:
		; AUTOVF-LABEL: for.cond:
		define i32 @main() local_unnamed_addr {
		while.cond:
		br label %for.cond

		for.cond:
		%d.0 = phi i32 [ 0, %while.cond ], [ %add, %for.cond ]
		%conv = and i32 %d.0, 65535
		%cmp = icmp ult i32 %conv, 4
		%add = add nuw nsw i32 %conv, 1
		br i1 %cmp, label %for.cond, label %while.cond.loopexit

		while.cond.loopexit:
		ret i32 0
		}

llvm/trunk/test/Transforms/LoopVectorize/pr39417-optsize-scevchecks.ll

				; RUN: opt -S -loop-vectorize -force-vector-width=4 -force-vector-interleave=1 < %s \| FileCheck %s

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

				; PR39417
				; Check that the need for overflow check prevents vectorizing a loop with tiny
				; trip count (which implies opt for size).
				; CHECK-LABEL: @func_34
				; CHECK-NOT: vector.scevcheck
				; CHECK-NOT: vector.body:
				; CHECK-LABEL: bb67:
				define void @func_34() {
				bb1:
				br label %bb67

				bb67:
				%storemerge2 = phi i32 [ 0, %bb1 ], [ %_tmp2300, %bb67 ]
				%sext = shl i32 %storemerge2, 16
				%_tmp2299 = ashr exact i32 %sext, 16
				%_tmp2300 = add nsw i32 %_tmp2299, 1
				%_tmp2310 = trunc i32 %_tmp2300 to i16
				%_tmp2312 = icmp slt i16 %_tmp2310, 3
				br i1 %_tmp2312, label %bb67, label %bb68

				bb68:
				ret void
				}

				; Check that the need for stride==1 check prevents vectorizing a loop under opt
				; for size.
				; CHECK-LABEL: @scev4stride1
				; CHECK-NOT: vector.scevcheck
				; CHECK-NOT: vector.body:
				; CHECK-LABEL: for.body:
				define void @scev4stride1(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %k) #0 {
				for.body.preheader:
				br label %for.body

				for.body:
				%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
				%mul = mul nsw i32 %i.07, %k
				%arrayidx = getelementptr inbounds i32, i32* %b, i32 %mul
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %a, i32 %i.07
				store i32 %0, i32* %arrayidx1, align 4
				%inc = add nuw nsw i32 %i.07, 1
				%exitcond = icmp eq i32 %inc, 1024
				br i1 %exitcond, label %for.end.loopexit, label %for.body

				for.end.loopexit:
				ret void
				}

				attributes #0 = { optsize }