This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/
-
Analysis/
2/4
LoopAccessAnalysis.cpp
-
Transforms/Vectorize/
-
Vectorize/
-
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
X86/
-
optsize.ll
-
tail_loop_folding.ll
-
optsize.ll
-
pr39417-optsize-scevchecks.ll
-
runtime-check.ll

Differential D81345

[LV] Vectorize without versioning-for-unit-stride under -Os/-Oz
ClosedPublic

Authored by Ayal on Jun 7 2020, 9:05 AM.

Download Raw Diff

Details

Reviewers

SjoerdMeijer
fhahn
gilr

Commits

rG7bf299c8d8d5: [LV] Vectorize without versioning-for-unit-stride under -Os/-Oz

Summary

If a loop is in a function marked OptSize, Loop Access Analysis should refrain from generating runtime checks for unit strides that will version the loop.

If a loop is in a function marked OptSize and its vectorization is enabled, it should be vectorized w/o any versioning.

Fixes PR46228.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Ayal created this revision.Jun 7 2020, 9:05 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 7 2020, 9:06 AM

Herald added subscribers: llvm-commits, rkruppe, hiraditya. · View Herald Transcript

Harbormaster completed remote builds in B59395: Diff 269061.Jun 7 2020, 10:38 AM

Thanks for fixing. This looks like a good and straightforward fix to me.

This revision is now accepted and ready to land.Jun 9 2020, 3:25 AM

fhahn added inline comments.Jun 9 2020, 4:28 AM

llvm/lib/Analysis/LoopAccessAnalysis.cpp
1838	I think it might be slightly preferable to let LV drive the decision whether to version or not based on cost estimates (and LAA is used by other passes as well, which might have different requirements). Did you consider disabling generating the codes for (I think it should happen `emitSCEVChecks`) conditionally on optsize? IIUC we should only need to generate code for predicates, if either we need runtime checks or have symbolic strides. And runtime checks are already rejected in `runtimeChecksRequired`.

Ayal marked an inline comment as done.Jun 13 2020, 10:59 PM

Ayal added inline comments.

llvm/lib/Analysis/LoopAccessAnalysis.cpp
1838	Would it be better if LV passes a "only one copy of the loop is allowed" flag to analyzeLoop(), via the constructor of LAI, instead of the latter checking for OptSize? This requirement of OptSize is common to all other passes, right? Suppressing emitSCEVChecks would appease the assert, but LAI should refrain from collecting Strided Accesses in order (for getPtrStride()) to consider related accesses as non-unit strided accesses. It already does so under a cl::opt flag, for all its users.

Herald added a subscriber: bmahjour. · View Herald TranscriptJun 13 2020, 10:59 PM

bmahjour removed a subscriber: bmahjour.Jun 15 2020, 6:38 AM

fhahn added inline comments.Jun 24 2020, 2:36 PM

llvm/lib/Analysis/LoopAccessAnalysis.cpp
1838	Would it be better if LV passes a "only one copy of the loop is allowed" flag to analyzeLoop(), via the constructor of LAI, instead of the latter checking for OptSize? This requirement of OptSize is common to all other passes, right? (Sorry for the long delay!) I think something like that would be preferable, as it makes explicit the interaction between LV/LAI. But I am not sure if that is possible at the moment though, because we get LAI as analysis. I am not sure if there's a feasible alternative without too much refactoring and it is probably not worth blocking the change on that, especially given that there's already precedence for using opt flags in a similar way.

Closed by commit rG7bf299c8d8d5: [LV] Vectorize without versioning-for-unit-stride under -Os/-Oz (authored by Ayal). · Explain WhyJul 7 2020, 5:05 AM

This revision was automatically updated to reflect the committed changes.

Ayal marked an inline comment as done.Jul 7 2020, 5:10 AM

Ayal added inline comments.

llvm/lib/Analysis/LoopAccessAnalysis.cpp
1838	OK, thanks, unblocked the change :-)

Hi,

I start seeing a crash with this patch:
https://bugs.llvm.org/show_bug.cgi?id=46652

Ayal mentioned this in D83470: [LV] Fix versioning-for-unit-stide of loops with small trip count.Jul 9 2020, 3:29 AM

Ayal mentioned this in rG82a5157ff165: [LV] Fixing versioning-for-unit-stide of loops with small trip count.Jul 12 2020, 10:42 AM

hjyamauchi mentioned this in D85784: [PGO][PGSO][LV] Fix loop not vectorized issue under profile guided size opts..Aug 11 2020, 2:22 PM

hjyamauchi mentioned this in rGab401a8c8a9c: [PGO][PGSO][LV] Fix loop not vectorized issue under profile guided size opts..Aug 19 2020, 12:34 PM

Revision Contents

Path

Size

llvm/

lib/

Analysis/

LoopAccessAnalysis.cpp

8 lines

Transforms/

Vectorize/

LoopVectorize.cpp

13 lines

test/

Transforms/

LoopVectorize/

X86/

optsize.ll

45 lines

tail_loop_folding.ll

187 lines

optsize.ll

69 lines

pr39417-optsize-scevchecks.ll

56 lines

runtime-check.ll

4 lines

Diff 276002

llvm/lib/Analysis/LoopAccessAnalysis.cpp

Show First 20 Lines • Show All 1,829 Lines • ▼ Show 20 Lines	void LoopAccessInfo::analyzeLoop(AAResults AA, LoopInfo LI,
// A runtime check is only legal to insert if there are no convergent calls.		// A runtime check is only legal to insert if there are no convergent calls.
HasConvergentOp = false;		HasConvergentOp = false;

PtrRtChecking->Pointers.clear();		PtrRtChecking->Pointers.clear();
PtrRtChecking->Need = false;		PtrRtChecking->Need = false;

const bool IsAnnotatedParallel = TheLoop->isAnnotatedParallel();		const bool IsAnnotatedParallel = TheLoop->isAnnotatedParallel();

		const bool EnableMemAccessVersioningOfLoop =
		fhahnUnsubmitted Not Done Reply Inline Actions I think it might be slightly preferable to let LV drive the decision whether to version or not based on cost estimates (and LAA is used by other passes as well, which might have different requirements). Did you consider disabling generating the codes for (I think it should happen `emitSCEVChecks`) conditionally on optsize? IIUC we should only need to generate code for predicates, if either we need runtime checks or have symbolic strides. And runtime checks are already rejected in `runtimeChecksRequired`. fhahn: I think it might be slightly preferable to let LV drive the decision whether to version or not…
		AyalAuthorUnsubmitted Done Reply Inline Actions Would it be better if LV passes a "only one copy of the loop is allowed" flag to analyzeLoop(), via the constructor of LAI, instead of the latter checking for OptSize? This requirement of OptSize is common to all other passes, right? Suppressing emitSCEVChecks would appease the assert, but LAI should refrain from collecting Strided Accesses in order (for getPtrStride()) to consider related accesses as non-unit strided accesses. It already does so under a cl::opt flag, for all its users. Ayal: Would it be better if LV passes a "only one copy of the loop is allowed" flag to analyzeLoop()…
		fhahnUnsubmitted Not Done Reply Inline Actions Would it be better if LV passes a "only one copy of the loop is allowed" flag to analyzeLoop(), via the constructor of LAI, instead of the latter checking for OptSize? This requirement of OptSize is common to all other passes, right? (Sorry for the long delay!) I think something like that would be preferable, as it makes explicit the interaction between LV/LAI. But I am not sure if that is possible at the moment though, because we get LAI as analysis. I am not sure if there's a feasible alternative without too much refactoring and it is probably not worth blocking the change on that, especially given that there's already precedence for using opt flags in a similar way. fhahn: > Would it be better if LV passes a "only one copy of the loop is allowed" flag to analyzeLoop…
		AyalAuthorUnsubmitted Done Reply Inline Actions OK, thanks, unblocked the change :-) Ayal: OK, thanks, unblocked the change :-)
		EnableMemAccessVersioning &&
		!TheLoop->getHeader()->getParent()->hasOptSize();

// For each block.		// For each block.
for (BasicBlock *BB : TheLoop->blocks()) {		for (BasicBlock *BB : TheLoop->blocks()) {
// Scan the BB and collect legal loads and stores. Also detect any		// Scan the BB and collect legal loads and stores. Also detect any
// convergent instructions.		// convergent instructions.
for (Instruction &I : *BB) {		for (Instruction &I : *BB) {
if (auto *Call = dyn_cast<CallBase>(&I)) {		if (auto *Call = dyn_cast<CallBase>(&I)) {
if (Call->isConvergent())		if (Call->isConvergent())
HasConvergentOp = true;		HasConvergentOp = true;
Show All 39 Lines	for (Instruction &I : *BB) {
<< "read with atomic ordering or volatile read";		<< "read with atomic ordering or volatile read";
LLVM_DEBUG(dbgs() << "LAA: Found a non-simple load.\n");		LLVM_DEBUG(dbgs() << "LAA: Found a non-simple load.\n");
HasComplexMemInst = true;		HasComplexMemInst = true;
continue;		continue;
}		}
NumLoads++;		NumLoads++;
Loads.push_back(Ld);		Loads.push_back(Ld);
DepChecker->addAccess(Ld);		DepChecker->addAccess(Ld);
if (EnableMemAccessVersioning)		if (EnableMemAccessVersioningOfLoop)
collectStridedAccess(Ld);		collectStridedAccess(Ld);
continue;		continue;
}		}

// Save 'store' instructions. Abort if other instructions write to memory.		// Save 'store' instructions. Abort if other instructions write to memory.
if (I.mayWriteToMemory()) {		if (I.mayWriteToMemory()) {
auto *St = dyn_cast<StoreInst>(&I);		auto *St = dyn_cast<StoreInst>(&I);
if (!St) {		if (!St) {
recordAnalysis("CantVectorizeInstruction", St)		recordAnalysis("CantVectorizeInstruction", St)
<< "instruction cannot be vectorized";		<< "instruction cannot be vectorized";
HasComplexMemInst = true;		HasComplexMemInst = true;
continue;		continue;
}		}
if (!St->isSimple() && !IsAnnotatedParallel) {		if (!St->isSimple() && !IsAnnotatedParallel) {
recordAnalysis("NonSimpleStore", St)		recordAnalysis("NonSimpleStore", St)
<< "write with atomic ordering or volatile write";		<< "write with atomic ordering or volatile write";
LLVM_DEBUG(dbgs() << "LAA: Found a non-simple store.\n");		LLVM_DEBUG(dbgs() << "LAA: Found a non-simple store.\n");
HasComplexMemInst = true;		HasComplexMemInst = true;
continue;		continue;
}		}
NumStores++;		NumStores++;
Stores.push_back(St);		Stores.push_back(St);
DepChecker->addAccess(St);		DepChecker->addAccess(St);
if (EnableMemAccessVersioning)		if (EnableMemAccessVersioningOfLoop)
collectStridedAccess(St);		collectStridedAccess(St);
}		}
} // Next instr.		} // Next instr.
} // Next block.		} // Next block.

if (HasComplexMemInst) {		if (HasComplexMemInst) {
CanVecMem = false;		CanVecMem = false;
return;		return;
▲ Show 20 Lines • Show All 406 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,931 Lines • ▼ Show 20 Lines	if (!PSE.getUnionPredicate().getPredicates().empty()) {
reportVectorizationFailure("Runtime SCEV check is required with -Os/-Oz",		reportVectorizationFailure("Runtime SCEV check is required with -Os/-Oz",
"runtime SCEV checks needed. Enable vectorization of this "		"runtime SCEV checks needed. Enable vectorization of this "
"loop with '#pragma clang loop vectorize(enable)' when "		"loop with '#pragma clang loop vectorize(enable)' when "
"compiling with -Os/-Oz",		"compiling with -Os/-Oz",
"CantVersionLoopWithOptForSize", ORE, TheLoop);		"CantVersionLoopWithOptForSize", ORE, TheLoop);
return true;		return true;
}		}

// FIXME: Avoid specializing for stride==1 instead of bailing out.		assert(Legal->getLAI()->getSymbolicStrides().empty() &&
if (!Legal->getLAI()->getSymbolicStrides().empty()) {		"Specializing for stride == 1 under -Os/-Oz");
reportVectorizationFailure("Runtime stride check is required with -Os/-Oz",
"runtime stride == 1 checks needed. Enable vectorization of "
"this loop with '#pragma clang loop vectorize(enable)' when "
"compiling with -Os/-Oz",
"CantVersionLoopWithOptForSize", ORE, TheLoop);
return true;
}

return false;		return false;
}		}

Optional<unsigned> LoopVectorizationCostModel::computeMaxVF(unsigned UserVF,		Optional<unsigned> LoopVectorizationCostModel::computeMaxVF(unsigned UserVF,
unsigned UserIC) {		unsigned UserIC) {
if (Legal->getRuntimePointerChecking()->Need && TTI.hasBranchDivergence()) {		if (Legal->getRuntimePointerChecking()->Need && TTI.hasBranchDivergence()) {
// TODO: It may by useful to do since it's still likely to be dynamically		// TODO: It may by useful to do since it's still likely to be dynamically
▲ Show 20 Lines • Show All 2,649 Lines • ▼ Show 20 Lines	static ScalarEpilogueLowering getScalarEpilogueLowering(
BlockFrequencyInfo BFI, TargetTransformInfo TTI, TargetLibraryInfo *TLI,		BlockFrequencyInfo BFI, TargetTransformInfo TTI, TargetLibraryInfo *TLI,
AssumptionCache AC, LoopInfo LI, ScalarEvolution SE, DominatorTree DT,		AssumptionCache AC, LoopInfo LI, ScalarEvolution SE, DominatorTree DT,
LoopVectorizationLegality &LVL) {		LoopVectorizationLegality &LVL) {
bool OptSize =		bool OptSize =
F->hasOptSize() \|\| llvm::shouldOptimizeForSize(L->getHeader(), PSI, BFI,		F->hasOptSize() \|\| llvm::shouldOptimizeForSize(L->getHeader(), PSI, BFI,
PGSOQueryType::IRPass);		PGSOQueryType::IRPass);
// 1) OptSize takes precedence over all other options, i.e. if this is set,		// 1) OptSize takes precedence over all other options, i.e. if this is set,
// don't look at hints or options, and don't request a scalar epilogue.		// don't look at hints or options, and don't request a scalar epilogue.
if (OptSize && Hints.getForce() != LoopVectorizeHints::FK_Enabled)		if (OptSize)
return CM_ScalarEpilogueNotAllowedOptSize;		return CM_ScalarEpilogueNotAllowedOptSize;

bool PredicateOptDisabled = PreferPredicateOverEpilog.getNumOccurrences() &&		bool PredicateOptDisabled = PreferPredicateOverEpilog.getNumOccurrences() &&
!PreferPredicateOverEpilog;		!PreferPredicateOverEpilog;

// 2) Next, if disabling predication is requested on the command line, honour		// 2) Next, if disabling predication is requested on the command line, honour
// this and request a scalar epilogue.		// this and request a scalar epilogue.
if (PredicateOptDisabled)		if (PredicateOptDisabled)
▲ Show 20 Lines • Show All 485 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/optsize.ll

	Show First 20 Lines • Show All 212 Lines • ▼ Show 20 Lines

	for.end: ; preds = %for.body			for.end: ; preds = %for.body
	ret i32 0			ret i32 0
	}			}

	attributes #1 = { minsize }			attributes #1 = { minsize }


	; We can't vectorize this one because we version for stride==1; even having TC			; We can vectorize this one by refraining from versioning for stride==1.
	; a multiple of VF.
	define void @scev4stride1(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %k) #2 {			define void @scev4stride1(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %k) #2 {
	; CHECK-LABEL: @scev4stride1(			; CHECK-LABEL: @scev4stride1(
	; CHECK-NEXT: for.body.preheader:			; CHECK-NEXT: for.body.preheader:
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: for.body:			; CHECK: vector.ph:
	; CHECK-NEXT: [[I_07:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER:%.*]] ]			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <64 x i32> undef, i32 [[K:%.]], i32 0
	; CHECK-NEXT: [[MUL:%.]] = mul nsw i32 [[I_07]], [[K:%.]]			; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <64 x i32> [[BROADCAST_SPLATINSERT]], <64 x i32> undef, <64 x i32> zeroinitializer
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i32 [[MUL]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; CHECK: vector.body:
	; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i32 [[I_07]]			; CHECK: middle.block:
	; CHECK-NEXT: store i32 [[TMP0]], i32* [[ARRAYIDX1]], align 4			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 256, 256
	; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_07]], 1			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], 256			; CHECK: scalar.ph:
	; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT:%.*]], label [[FOR_BODY]]
	; CHECK: for.end.loopexit:			; CHECK: for.end.loopexit:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	; AUTOVF-LABEL: @scev4stride1(			; AUTOVF-LABEL: @scev4stride1(
	; AUTOVF-NEXT: for.body.preheader:			; AUTOVF-NEXT: for.body.preheader:
	; AUTOVF-NEXT: br label [[FOR_BODY:%.*]]			; AUTOVF-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; AUTOVF: for.body:			; AUTOVF: vector.ph:
	; AUTOVF-NEXT: [[I_07:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[FOR_BODY_PREHEADER:%.*]] ]			; AUTOVF-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <8 x i32> undef, i32 [[K:%.]], i32 0
	; AUTOVF-NEXT: [[MUL:%.]] = mul nsw i32 [[I_07]], [[K:%.]]			; AUTOVF-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i32> [[BROADCAST_SPLATINSERT]], <8 x i32> undef, <8 x i32> zeroinitializer
	; AUTOVF-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i32 [[MUL]]			; AUTOVF-NEXT: br label [[VECTOR_BODY:%.*]]
	; AUTOVF-NEXT: [[TMP0:%.]] = load i32, i32 [[ARRAYIDX]], align 4			; AUTOVF: vector.body:
	; AUTOVF-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i32 [[I_07]]			; AUTOVF: middle.block:
	; AUTOVF-NEXT: store i32 [[TMP0]], i32* [[ARRAYIDX1]], align 4			; AUTOVF-NEXT: [[CMP_N:%.*]] = icmp eq i32 256, 256
	; AUTOVF-NEXT: [[INC]] = add nuw nsw i32 [[I_07]], 1			; AUTOVF-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; AUTOVF-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], 256			; AUTOVF: scalar.ph:
	; AUTOVF-NEXT: br i1 [[EXITCOND]], label [[FOR_END_LOOPEXIT:%.*]], label [[FOR_BODY]]
	; AUTOVF: for.end.loopexit:			; AUTOVF: for.end.loopexit:
	; AUTOVF-NEXT: ret void			; AUTOVF-NEXT: ret void
	;			;
	for.body.preheader:			for.body.preheader:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body.preheader, %for.body			for.body: ; preds = %for.body.preheader, %for.body
	%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]			%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
	▲ Show 20 Lines • Show All 58 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/tail_loop_folding.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -loop-vectorize -S \| FileCheck %s --check-prefixes=CHECK,DEFAULT		; RUN: opt < %s -loop-vectorize -S \| FileCheck %s
; RUN: opt < %s -loop-vectorize -prefer-predicate-over-epilog -S \| FileCheck %s --check-prefixes=CHECK,PREDFLAG		; RUN: opt < %s -loop-vectorize -prefer-predicate-over-epilog -S \| FileCheck %s

target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"		target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"		target triple = "x86_64-unknown-linux-gnu"

define dso_local void @tail_folding_enabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {		define dso_local void @tail_folding_enabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {
; CHECK-LABEL: @tail_folding_enabled(		; CHECK-LABEL: @tail_folding_enabled(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines	for.body:
%add = add nsw i32 %1, %0		%add = add nsw i32 %1, %0
%arrayidx4 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv		%arrayidx4 = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
store i32 %add, i32* %arrayidx4, align 4		store i32 %add, i32* %arrayidx4, align 4
%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1		%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
%exitcond = icmp eq i64 %indvars.iv.next, 430		%exitcond = icmp eq i64 %indvars.iv.next, 430
br i1 %exitcond, label %for.cond.cleanup, label %for.body, !llvm.loop !6		br i1 %exitcond, label %for.cond.cleanup, label %for.body, !llvm.loop !6
}		}

		; Marking function as optsize turns tail folding on, as if explicit tail folding
		; flag was enabled.
define dso_local void @tail_folding_disabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {		define dso_local void @tail_folding_disabled(i32* noalias nocapture %A, i32* noalias nocapture readonly %B, i32* noalias nocapture readonly %C) local_unnamed_addr #0 {
; DEFAULT-LABEL: @tail_folding_disabled(		; CHECK-LABEL: @tail_folding_disabled(
; DEFAULT-NEXT: entry:		; CHECK-NEXT: entry:
; DEFAULT-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; DEFAULT: vector.ph:		; CHECK: vector.ph:
; DEFAULT-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; DEFAULT: vector.body:		; CHECK: vector.body:
; DEFAULT-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; DEFAULT-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0		; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i64> undef, i64 [[INDEX]], i32 0
; DEFAULT-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 8		; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT]], <8 x i64> undef, <8 x i32> zeroinitializer
; DEFAULT-NEXT: [[TMP2:%.*]] = add i64 [[INDEX]], 16		; CHECK-NEXT: [[INDUCTION:%.*]] = add <8 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
; DEFAULT-NEXT: [[TMP3:%.*]] = add i64 [[INDEX]], 24		; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
; DEFAULT-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[TMP0]]		; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[TMP0]]
; DEFAULT-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[TMP1]]		; CHECK-NEXT: [[TMP2:%.*]] = icmp ule <8 x i64> [[INDUCTION]], <i64 429, i64 429, i64 429, i64 429, i64 429, i64 429, i64 429, i64 429>
; DEFAULT-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[TMP2]]		; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i32 0
; DEFAULT-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[TMP3]]		; CHECK-NEXT: [[TMP4:%.]] = bitcast i32 [[TMP3]] to <8 x i32>*
; DEFAULT-NEXT: [[TMP8:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 0		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32> [[TMP4]], i32 4, <8 x i1> [[TMP2]], <8 x i32> undef)
; DEFAULT-NEXT: [[TMP9:%.]] = bitcast i32 [[TMP8]] to <8 x i32>*		; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i64 [[TMP0]]
; DEFAULT-NEXT: [[WIDE_LOAD:%.]] = load <8 x i32>, <8 x i32> [[TMP9]], align 4		; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP5]], i32 0
; DEFAULT-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 8		; CHECK-NEXT: [[TMP7:%.]] = bitcast i32 [[TMP6]] to <8 x i32>*
; DEFAULT-NEXT: [[TMP11:%.]] = bitcast i32 [[TMP10]] to <8 x i32>*		; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32> [[TMP7]], i32 4, <8 x i1> [[TMP2]], <8 x i32> undef)
; DEFAULT-NEXT: [[WIDE_LOAD1:%.]] = load <8 x i32>, <8 x i32> [[TMP11]], align 4		; CHECK-NEXT: [[TMP8:%.*]] = add nsw <8 x i32> [[WIDE_MASKED_LOAD1]], [[WIDE_MASKED_LOAD]]
; DEFAULT-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 16		; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[TMP0]]
; DEFAULT-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP12]] to <8 x i32>*		; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP9]], i32 0
; DEFAULT-NEXT: [[WIDE_LOAD2:%.]] = load <8 x i32>, <8 x i32> [[TMP13]], align 4		; CHECK-NEXT: [[TMP11:%.]] = bitcast i32 [[TMP10]] to <8 x i32>*
; DEFAULT-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, i32 [[TMP4]], i32 24		; CHECK-NEXT: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> [[TMP8]], <8 x i32>* [[TMP11]], i32 4, <8 x i1> [[TMP2]])
; DEFAULT-NEXT: [[TMP15:%.]] = bitcast i32 [[TMP14]] to <8 x i32>*		; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
; DEFAULT-NEXT: [[WIDE_LOAD3:%.]] = load <8 x i32>, <8 x i32> [[TMP15]], align 4		; CHECK-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], 432
; DEFAULT-NEXT: [[TMP16:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i64 [[TMP0]]		; CHECK-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !4
; DEFAULT-NEXT: [[TMP17:%.]] = getelementptr inbounds i32, i32 [[C]], i64 [[TMP1]]		; CHECK: middle.block:
; DEFAULT-NEXT: [[TMP18:%.]] = getelementptr inbounds i32, i32 [[C]], i64 [[TMP2]]		; CHECK-NEXT: br i1 true, label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
; DEFAULT-NEXT: [[TMP19:%.]] = getelementptr inbounds i32, i32 [[C]], i64 [[TMP3]]		; CHECK: scalar.ph:
; DEFAULT-NEXT: [[TMP20:%.]] = getelementptr inbounds i32, i32 [[TMP16]], i32 0		; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 432, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
; DEFAULT-NEXT: [[TMP21:%.]] = bitcast i32 [[TMP20]] to <8 x i32>*		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; DEFAULT-NEXT: [[WIDE_LOAD4:%.]] = load <8 x i32>, <8 x i32> [[TMP21]], align 4		; CHECK: for.cond.cleanup:
; DEFAULT-NEXT: [[TMP22:%.]] = getelementptr inbounds i32, i32 [[TMP16]], i32 8		; CHECK-NEXT: ret void
; DEFAULT-NEXT: [[TMP23:%.]] = bitcast i32 [[TMP22]] to <8 x i32>*		; CHECK: for.body:
; DEFAULT-NEXT: [[WIDE_LOAD5:%.]] = load <8 x i32>, <8 x i32> [[TMP23]], align 4		; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
; DEFAULT-NEXT: [[TMP24:%.]] = getelementptr inbounds i32, i32 [[TMP16]], i32 16		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDVARS_IV]]
; DEFAULT-NEXT: [[TMP25:%.]] = bitcast i32 [[TMP24]] to <8 x i32>*		; CHECK-NEXT: [[TMP13:%.]] = load i32, i32 [[ARRAYIDX]], align 4
; DEFAULT-NEXT: [[WIDE_LOAD6:%.]] = load <8 x i32>, <8 x i32> [[TMP25]], align 4		; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[C]], i64 [[INDVARS_IV]]
; DEFAULT-NEXT: [[TMP26:%.]] = getelementptr inbounds i32, i32 [[TMP16]], i32 24		; CHECK-NEXT: [[TMP14:%.]] = load i32, i32 [[ARRAYIDX2]], align 4
; DEFAULT-NEXT: [[TMP27:%.]] = bitcast i32 [[TMP26]] to <8 x i32>*		; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP14]], [[TMP13]]
; DEFAULT-NEXT: [[WIDE_LOAD7:%.]] = load <8 x i32>, <8 x i32> [[TMP27]], align 4		; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
; DEFAULT-NEXT: [[TMP28:%.*]] = add nsw <8 x i32> [[WIDE_LOAD4]], [[WIDE_LOAD]]		; CHECK-NEXT: store i32 [[ADD]], i32* [[ARRAYIDX4]], align 4
; DEFAULT-NEXT: [[TMP29:%.*]] = add nsw <8 x i32> [[WIDE_LOAD5]], [[WIDE_LOAD1]]		; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
; DEFAULT-NEXT: [[TMP30:%.*]] = add nsw <8 x i32> [[WIDE_LOAD6]], [[WIDE_LOAD2]]		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 430
; DEFAULT-NEXT: [[TMP31:%.*]] = add nsw <8 x i32> [[WIDE_LOAD7]], [[WIDE_LOAD3]]		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], !llvm.loop !5
; DEFAULT-NEXT: [[TMP32:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[TMP0]]
; DEFAULT-NEXT: [[TMP33:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP1]]
; DEFAULT-NEXT: [[TMP34:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP2]]
; DEFAULT-NEXT: [[TMP35:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[TMP3]]
; DEFAULT-NEXT: [[TMP36:%.]] = getelementptr inbounds i32, i32 [[TMP32]], i32 0
; DEFAULT-NEXT: [[TMP37:%.]] = bitcast i32 [[TMP36]] to <8 x i32>*
; DEFAULT-NEXT: store <8 x i32> [[TMP28]], <8 x i32>* [[TMP37]], align 4
; DEFAULT-NEXT: [[TMP38:%.]] = getelementptr inbounds i32, i32 [[TMP32]], i32 8
; DEFAULT-NEXT: [[TMP39:%.]] = bitcast i32 [[TMP38]] to <8 x i32>*
; DEFAULT-NEXT: store <8 x i32> [[TMP29]], <8 x i32>* [[TMP39]], align 4
; DEFAULT-NEXT: [[TMP40:%.]] = getelementptr inbounds i32, i32 [[TMP32]], i32 16
; DEFAULT-NEXT: [[TMP41:%.]] = bitcast i32 [[TMP40]] to <8 x i32>*
; DEFAULT-NEXT: store <8 x i32> [[TMP30]], <8 x i32>* [[TMP41]], align 4
; DEFAULT-NEXT: [[TMP42:%.]] = getelementptr inbounds i32, i32 [[TMP32]], i32 24
; DEFAULT-NEXT: [[TMP43:%.]] = bitcast i32 [[TMP42]] to <8 x i32>*
; DEFAULT-NEXT: store <8 x i32> [[TMP31]], <8 x i32>* [[TMP43]], align 4
; DEFAULT-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
; DEFAULT-NEXT: [[TMP44:%.*]] = icmp eq i64 [[INDEX_NEXT]], 416
; DEFAULT-NEXT: br i1 [[TMP44]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !4
; DEFAULT: middle.block:
; DEFAULT-NEXT: [[CMP_N:%.*]] = icmp eq i64 430, 416
; DEFAULT-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
; DEFAULT: scalar.ph:
; DEFAULT-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 416, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
; DEFAULT-NEXT: br label [[FOR_BODY:%.*]]
; DEFAULT: for.cond.cleanup:
; DEFAULT-NEXT: ret void
; DEFAULT: for.body:
; DEFAULT-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
; DEFAULT-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDVARS_IV]]
; DEFAULT-NEXT: [[TMP45:%.]] = load i32, i32 [[ARRAYIDX]], align 4
; DEFAULT-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[C]], i64 [[INDVARS_IV]]
; DEFAULT-NEXT: [[TMP46:%.]] = load i32, i32 [[ARRAYIDX2]], align 4
; DEFAULT-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP46]], [[TMP45]]
; DEFAULT-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
; DEFAULT-NEXT: store i32 [[ADD]], i32* [[ARRAYIDX4]], align 4
; DEFAULT-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
; DEFAULT-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 430
; DEFAULT-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], !llvm.loop !5
;
; PREDFLAG-LABEL: @tail_folding_disabled(
; PREDFLAG-NEXT: entry:
; PREDFLAG-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; PREDFLAG: vector.ph:
; PREDFLAG-NEXT: br label [[VECTOR_BODY:%.*]]
; PREDFLAG: vector.body:
; PREDFLAG-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; PREDFLAG-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <8 x i64> undef, i64 [[INDEX]], i32 0
; PREDFLAG-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i64> [[BROADCAST_SPLATINSERT]], <8 x i64> undef, <8 x i32> zeroinitializer
; PREDFLAG-NEXT: [[INDUCTION:%.*]] = add <8 x i64> [[BROADCAST_SPLAT]], <i64 0, i64 1, i64 2, i64 3, i64 4, i64 5, i64 6, i64 7>
; PREDFLAG-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
; PREDFLAG-NEXT: [[TMP1:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 [[TMP0]]
; PREDFLAG-NEXT: [[TMP2:%.*]] = icmp ule <8 x i64> [[INDUCTION]], <i64 429, i64 429, i64 429, i64 429, i64 429, i64 429, i64 429, i64 429>
; PREDFLAG-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i32 0
; PREDFLAG-NEXT: [[TMP4:%.]] = bitcast i32 [[TMP3]] to <8 x i32>*
; PREDFLAG-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32> [[TMP4]], i32 4, <8 x i1> [[TMP2]], <8 x i32> undef)
; PREDFLAG-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i64 [[TMP0]]
; PREDFLAG-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP5]], i32 0
; PREDFLAG-NEXT: [[TMP7:%.]] = bitcast i32 [[TMP6]] to <8 x i32>*
; PREDFLAG-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <8 x i32> @llvm.masked.load.v8i32.p0v8i32(<8 x i32> [[TMP7]], i32 4, <8 x i1> [[TMP2]], <8 x i32> undef)
; PREDFLAG-NEXT: [[TMP8:%.*]] = add nsw <8 x i32> [[WIDE_MASKED_LOAD1]], [[WIDE_MASKED_LOAD]]
; PREDFLAG-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 [[TMP0]]
; PREDFLAG-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[TMP9]], i32 0
; PREDFLAG-NEXT: [[TMP11:%.]] = bitcast i32 [[TMP10]] to <8 x i32>*
; PREDFLAG-NEXT: call void @llvm.masked.store.v8i32.p0v8i32(<8 x i32> [[TMP8]], <8 x i32>* [[TMP11]], i32 4, <8 x i1> [[TMP2]])
; PREDFLAG-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 8
; PREDFLAG-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], 432
; PREDFLAG-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !4
; PREDFLAG: middle.block:
; PREDFLAG-NEXT: br i1 true, label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
; PREDFLAG: scalar.ph:
; PREDFLAG-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 432, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
; PREDFLAG-NEXT: br label [[FOR_BODY:%.*]]
; PREDFLAG: for.cond.cleanup:
; PREDFLAG-NEXT: ret void
; PREDFLAG: for.body:
; PREDFLAG-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
; PREDFLAG-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[B]], i64 [[INDVARS_IV]]
; PREDFLAG-NEXT: [[TMP13:%.]] = load i32, i32 [[ARRAYIDX]], align 4
; PREDFLAG-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[C]], i64 [[INDVARS_IV]]
; PREDFLAG-NEXT: [[TMP14:%.]] = load i32, i32 [[ARRAYIDX2]], align 4
; PREDFLAG-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP14]], [[TMP13]]
; PREDFLAG-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i32, i32 [[A]], i64 [[INDVARS_IV]]
; PREDFLAG-NEXT: store i32 [[ADD]], i32* [[ARRAYIDX4]], align 4
; PREDFLAG-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
; PREDFLAG-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 430
; PREDFLAG-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], !llvm.loop !5
;		;
entry:		entry:
br label %for.body		br label %for.body

for.cond.cleanup:		for.cond.cleanup:
ret void		ret void

for.body:		for.body:
▲ Show 20 Lines • Show All 123 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/optsize.ll

Show First 20 Lines • Show All 148 Lines • ▼ Show 20 Lines	loop:
%pivPlus1 = add nuw nsw i32 %piv, 1		%pivPlus1 = add nuw nsw i32 %piv, 1
%cond = icmp ult i32 %piv, 510		%cond = icmp ult i32 %piv, 510
br i1 %cond, label %loop, label %exit		br i1 %cond, label %loop, label %exit

exit:		exit:
ret i32 %for		ret i32 %for
}		}

		; PR46228: Vectorize w/o versioning for unit stride under optsize and enabled
		; vectorization.

		; NOTE: Some assertions have been autogenerated by utils/update_test_checks.py
		define void @stride1(i16* noalias %B, i32 %BStride) optsize {
		; CHECK-LABEL: @stride1(
		; CHECK-NEXT: entry:
		; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <2 x i32> undef, i32 [[BSTRIDE:%.]], i32 0
		; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i32> [[BROADCAST_SPLATINSERT]], <2 x i32> undef, <2 x i32> zeroinitializer
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[PRED_STORE_CONTINUE2:%.*]] ]
		; CHECK-NEXT: [[VEC_IND:%.]] = phi <2 x i32> [ <i32 0, i32 1>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[PRED_STORE_CONTINUE2]] ]
		; CHECK-NEXT: [[TMP0:%.*]] = mul nsw <2 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]
		; CHECK-NEXT: [[TMP1:%.*]] = icmp ule <2 x i32> [[VEC_IND]], <i32 1024, i32 1024>
		; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x i1> [[TMP1]], i32 0
		; CHECK-NEXT: br i1 [[TMP2]], label [[PRED_STORE_IF:%.]], label [[PRED_STORE_CONTINUE:%.]]
		; CHECK: pred.store.if:
		; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x i32> [[TMP0]], i32 0
		; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds i16, i16 [[B:%.*]], i32 [[TMP3]]
		; CHECK-NEXT: store i16 42, i16* [[TMP4]], align 4
		; CHECK-NEXT: br label [[PRED_STORE_CONTINUE]]
		; CHECK: pred.store.continue:
		; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x i1> [[TMP1]], i32 1
		; CHECK-NEXT: br i1 [[TMP5]], label [[PRED_STORE_IF1:%.*]], label [[PRED_STORE_CONTINUE2]]
		; CHECK: pred.store.if1:
		; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x i32> [[TMP0]], i32 1
		; CHECK-NEXT: [[TMP7:%.]] = getelementptr inbounds i16, i16 [[B]], i32 [[TMP6]]
		; CHECK-NEXT: store i16 42, i16* [[TMP7]], align 4
		; CHECK-NEXT: br label [[PRED_STORE_CONTINUE2]]
		; CHECK: pred.store.continue2:
		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 2
		; CHECK-NEXT: [[VEC_IND_NEXT]] = add <2 x i32> [[VEC_IND]], <i32 2, i32 2>
		; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1026
		; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !19
		; CHECK: middle.block:
		; CHECK-NEXT: br i1 true, label [[FOR_END:%.*]], label [[SCALAR_PH]]
		; CHECK: scalar.ph:
		; CHECK: for.end:
		; CHECK-NEXT: ret void
		;
		; PGSO-LABEL: @stride1(
		; PGSO-NEXT: entry:
		; PGSO-NEXT: br i1 false, label %scalar.ph, label %vector.ph
		;
		; NPGSO-LABEL: @stride1(
		; NPGSO-NEXT: entry:
		; NPGSO-NEXT: br i1 false, label %scalar.ph, label %vector.ph

		entry:
		br label %for.body

		for.body:
		%iv = phi i32 [ %iv.next, %for.body ], [ 0, %entry ]
		%mulB = mul nsw i32 %iv, %BStride
		%gepOfB = getelementptr inbounds i16, i16* %B, i32 %mulB
		store i16 42, i16* %gepOfB, align 4
		%iv.next = add nuw nsw i32 %iv, 1
		%exitcond = icmp eq i32 %iv.next, 1025
		br i1 %exitcond, label %for.end, label %for.body, !llvm.loop !15

		for.end:
		ret void
		}

!llvm.module.flags = !{!0}		!llvm.module.flags = !{!0}
!0 = !{i32 1, !"ProfileSummary", !1}		!0 = !{i32 1, !"ProfileSummary", !1}
!1 = !{!2, !3, !4, !5, !6, !7, !8, !9}		!1 = !{!2, !3, !4, !5, !6, !7, !8, !9}
!2 = !{!"ProfileFormat", !"InstrProf"}		!2 = !{!"ProfileFormat", !"InstrProf"}
!3 = !{!"TotalCount", i64 10000}		!3 = !{!"TotalCount", i64 10000}
!4 = !{!"MaxCount", i64 10}		!4 = !{!"MaxCount", i64 10}
!5 = !{!"MaxInternalCount", i64 1}		!5 = !{!"MaxInternalCount", i64 1}
!6 = !{!"MaxFunctionCount", i64 1000}		!6 = !{!"MaxFunctionCount", i64 1000}
!7 = !{!"NumCounts", i64 3}		!7 = !{!"NumCounts", i64 3}
!8 = !{!"NumFunctions", i64 3}		!8 = !{!"NumFunctions", i64 3}
!9 = !{!"DetailedSummary", !10}		!9 = !{!"DetailedSummary", !10}
!10 = !{!11, !12, !13}		!10 = !{!11, !12, !13}
!11 = !{i32 10000, i64 100, i32 1}		!11 = !{i32 10000, i64 100, i32 1}
!12 = !{i32 999000, i64 100, i32 1}		!12 = !{i32 999000, i64 100, i32 1}
!13 = !{i32 999999, i64 1, i32 2}		!13 = !{i32 999999, i64 1, i32 2}
!14 = !{!"function_entry_count", i64 0}		!14 = !{!"function_entry_count", i64 0}
		!15 = distinct !{!15, !16}
		!16 = !{!"llvm.loop.vectorize.enable", i1 true}

llvm/test/Transforms/LoopVectorize/pr39417-optsize-scevchecks.ll

Show All 20 Lines	bb67:
%_tmp2310 = trunc i32 %_tmp2300 to i16		%_tmp2310 = trunc i32 %_tmp2300 to i16
%_tmp2312 = icmp slt i16 %_tmp2310, 3		%_tmp2312 = icmp slt i16 %_tmp2310, 3
br i1 %_tmp2312, label %bb67, label %bb68		br i1 %_tmp2312, label %bb67, label %bb68

bb68:		bb68:
ret void		ret void
}		}

; Check that the need for stride==1 check prevents vectorizing a loop under opt		; Check that a loop under opt-for-size is vectorized, w/o checking for
; for size.		; stride==1.
; CHECK-LABEL: @scev4stride1		; NOTE: Some assertions have been autogenerated by utils/update_test_checks.py
; CHECK-NOT: vector.scevcheck
; CHECK-NOT: vector.body:
; CHECK-LABEL: for.body:
define void @scev4stride1(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %k) #0 {		define void @scev4stride1(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32 %k) #0 {
		; CHECK-LABEL: @scev4stride1(
		; CHECK-NEXT: for.body.preheader:
		; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <4 x i32> undef, i32 [[K:%.]], i32 0
		; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[VEC_IND:%.]] = phi <4 x i32> [ <i32 0, i32 1, i32 2, i32 3>, [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
		; CHECK-NEXT: [[TMP1:%.*]] = add i32 [[INDEX]], 1
		; CHECK-NEXT: [[TMP2:%.*]] = add i32 [[INDEX]], 2
		; CHECK-NEXT: [[TMP3:%.*]] = add i32 [[INDEX]], 3
		; CHECK-NEXT: [[TMP4:%.*]] = mul nsw <4 x i32> [[VEC_IND]], [[BROADCAST_SPLAT]]
		; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x i32> [[TMP4]], i32 0
		; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i32 [[TMP5]]
		; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x i32> [[TMP4]], i32 1
		; CHECK-NEXT: [[TMP8:%.]] = getelementptr inbounds i32, i32 [[B]], i32 [[TMP7]]
		; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x i32> [[TMP4]], i32 2
		; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[B]], i32 [[TMP9]]
		; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x i32> [[TMP4]], i32 3
		; CHECK-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 [[B]], i32 [[TMP11]]
		; CHECK-NEXT: [[TMP13:%.]] = load i32, i32 [[TMP6]], align 4
		; CHECK-NEXT: [[TMP14:%.]] = load i32, i32 [[TMP8]], align 4
		; CHECK-NEXT: [[TMP15:%.]] = load i32, i32 [[TMP10]], align 4
		; CHECK-NEXT: [[TMP16:%.]] = load i32, i32 [[TMP12]], align 4
		; CHECK-NEXT: [[TMP17:%.*]] = insertelement <4 x i32> undef, i32 [[TMP13]], i32 0
		; CHECK-NEXT: [[TMP18:%.*]] = insertelement <4 x i32> [[TMP17]], i32 [[TMP14]], i32 1
		; CHECK-NEXT: [[TMP19:%.*]] = insertelement <4 x i32> [[TMP18]], i32 [[TMP15]], i32 2
		; CHECK-NEXT: [[TMP20:%.*]] = insertelement <4 x i32> [[TMP19]], i32 [[TMP16]], i32 3
		; CHECK-NEXT: [[TMP21:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i32 [[TMP0]]
		; CHECK-NEXT: [[TMP22:%.]] = getelementptr inbounds i32, i32 [[TMP21]], i32 0
		; CHECK-NEXT: [[TMP23:%.]] = bitcast i32 [[TMP22]] to <4 x i32>*
		; CHECK-NEXT: store <4 x i32> [[TMP20]], <4 x i32>* [[TMP23]], align 4
		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
		; CHECK-NEXT: [[VEC_IND_NEXT]] = add <4 x i32> [[VEC_IND]], <i32 4, i32 4, i32 4, i32 4>
		; CHECK-NEXT: [[TMP24:%.*]] = icmp eq i32 [[INDEX_NEXT]], 1024
		; CHECK-NEXT: br i1 [[TMP24]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop !0
		; CHECK: middle.block:
		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 1024, 1024
		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END_LOOPEXIT:%.*]], label [[SCALAR_PH]]
		; CHECK: scalar.ph:
		; CHECK: for.body:
		; CHECK: for.end.loopexit:
		; CHECK-NEXT: ret void
		;
for.body.preheader:		for.body.preheader:
br label %for.body		br label %for.body

for.body:		for.body:
%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]		%i.07 = phi i32 [ %inc, %for.body ], [ 0, %for.body.preheader ]
%mul = mul nsw i32 %i.07, %k		%mul = mul nsw i32 %i.07, %k
%arrayidx = getelementptr inbounds i32, i32* %b, i32 %mul		%arrayidx = getelementptr inbounds i32, i32* %b, i32 %mul
%0 = load i32, i32* %arrayidx, align 4		%0 = load i32, i32* %arrayidx, align 4
Show All 11 Lines

llvm/test/Transforms/LoopVectorize/runtime-check.ll

	Show First 20 Lines • Show All 156 Lines • ▼ Show 20 Lines
	loopexit:			loopexit:
	ret void			ret void
	}			}

	; CHECK: !9 = !DILocation(line: 101, column: 1, scope: !{{.*}})			; CHECK: !9 = !DILocation(line: 101, column: 1, scope: !{{.*}})

	define dso_local void @forced_optsize(i64* noalias nocapture readonly %x_p, i64* noalias nocapture readonly %y_p, i64* noalias nocapture %z_p) minsize optsize {			define dso_local void @forced_optsize(i64* noalias nocapture readonly %x_p, i64* noalias nocapture readonly %y_p, i64* noalias nocapture %z_p) minsize optsize {
	;			;
	; FORCED_OPTSIZE: remark: <unknown>:0:0: Code-size may be reduced by not forcing vectorization, or by source-code modifications eliminating the need for runtime checks (e.g., adding 'restrict').			; FORCED_OPTSIZE: remark: <unknown>:0:0: loop not vectorized: runtime pointer checks needed. Enable vectorization of this loop with '#pragma clang loop vectorize(enable)' when compiling with -Os/-Oz
	; FORCED_OPTSIZE-LABEL: @forced_optsize(			; FORCED_OPTSIZE-LABEL: @forced_optsize(
	; FORCED_OPTSIZE: vector.body:			; FORCED_OPTSIZE-NOT: vector.body:
	;			;
	entry:			entry:
	br label %for.body			br label %for.body

	for.cond.cleanup:			for.cond.cleanup:
	ret void			ret void

	for.body:			for.body:
	Show All 34 Lines