This is an archive of the discontinued LLVM Phabricator instance.

[LV] Use VScaleForTuning to allow wider epilogue VFs.
ClosedPublic

Authored by sdesmalen on Feb 1 2022, 8:50 AM.

Download Raw Diff

Details

Reviewers

david-arm
kmclaughlin
rogfer01
craig.topper
paulwalker-arm

Commits

rGeaee477edafe: [LV] Use VScaleForTuning to allow wider epilogue VFs.

Summary

When the main loop is e.g. VF=vscale x 1 and the epilogue VF cannot
be any smaller, the vectorizer should try to estimate how many lanes are
executed at runtime and allow a suitable fixed-width VF to be chosen. It
can use VScaleForTuning to figure out what a suitable fixed-width VF could
be. For the case where the main loop VF is VF=vscale x 1, and VScaleForTuning=8,
it could still choose an epilogue VF upto VF=4.

This was a bit tricky to test, so this patch also introduces a wrapper
function to get 'VScaleForTuning' by also considering vscale_range.
If min and max are equal, then that will be the vscale we compile for.
It makes little sense to tune for a different width if the code
will not be portable for other widths.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

sdesmalen created this revision.Feb 1 2022, 8:50 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptFeb 1 2022, 8:50 AM

sdesmalen requested review of this revision.Feb 1 2022, 8:50 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 1 2022, 8:50 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B146922: Diff 404955.Feb 1 2022, 8:50 AM

sdesmalen added reviewers: david-arm, kmclaughlin, rogfer01, craig.topper, paulwalker-arm.Feb 1 2022, 8:50 AM

LGTM! Seems like a sensible change.

This revision is now accepted and ready to land.Feb 2 2022, 6:59 AM

This revision was landed with ongoing or failed builds.Feb 3 2022, 7:40 AM

Closed by commit rGeaee477edafe: [LV] Use VScaleForTuning to allow wider epilogue VFs. (authored by sdesmalen). · Explain Why

This revision was automatically updated to reflect the committed changes.

sdesmalen added a commit: rGeaee477edafe: [LV] Use VScaleForTuning to allow wider epilogue VFs..

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

35 lines

test/

Transforms/

LoopVectorize/

AArch64/

sve-epilog-vect.ll

193 lines

Diff 405640

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,695 Lines • ▼ Show 20 Lines	void invalidateCostModelingDecisions() {
WideningDecisions.clear();		WideningDecisions.clear();
Uniforms.clear();		Uniforms.clear();
Scalars.clear();		Scalars.clear();
}		}

private:		private:
unsigned NumPredStores = 0;		unsigned NumPredStores = 0;

		/// Convenience function that returns the value of vscale_range iff
		/// vscale_range.min == vscale_range.max or otherwise returns the value
		/// returned by the corresponding TLI method.
		Optional<unsigned> getVScaleForTuning() const;

/// \return An upper bound for the vectorization factors for both		/// \return An upper bound for the vectorization factors for both
/// fixed and scalable vectorization, where the minimum-known number of		/// fixed and scalable vectorization, where the minimum-known number of
/// elements is a power-of-2 larger than zero. If scalable vectorization is		/// elements is a power-of-2 larger than zero. If scalable vectorization is
/// disabled or unsupported, then the scalable part will be equal to		/// disabled or unsupported, then the scalable part will be equal to
/// ElementCount::getScalable(0).		/// ElementCount::getScalable(0).
FixedScalableVFPair computeFeasibleMaxVF(unsigned ConstTripCount,		FixedScalableVFPair computeFeasibleMaxVF(unsigned ConstTripCount,
ElementCount UserVF,		ElementCount UserVF,
bool FoldTailByMasking);		bool FoldTailByMasking);
▲ Show 20 Lines • Show All 3,883 Lines • ▼ Show 20 Lines	if (ElementCount MinVF =
<< ") with target's minimum: " << MinVF << '\n');		<< ") with target's minimum: " << MinVF << '\n');
MaxVF = MinVF;		MaxVF = MinVF;
}		}
}		}
}		}
return MaxVF;		return MaxVF;
}		}

		Optional<unsigned> LoopVectorizationCostModel::getVScaleForTuning() const {
		if (TheFunction->hasFnAttribute(Attribute::VScaleRange)) {
		auto Attr = TheFunction->getFnAttribute(Attribute::VScaleRange);
		auto Min = Attr.getVScaleRangeMin();
		auto Max = Attr.getVScaleRangeMax();
		if (Max && Min == Max)
		return Max;
		}

		return TTI.getVScaleForTuning();
		}

bool LoopVectorizationCostModel::isMoreProfitable(		bool LoopVectorizationCostModel::isMoreProfitable(
const VectorizationFactor &A, const VectorizationFactor &B) const {		const VectorizationFactor &A, const VectorizationFactor &B) const {
InstructionCost CostA = A.Cost;		InstructionCost CostA = A.Cost;
InstructionCost CostB = B.Cost;		InstructionCost CostB = B.Cost;

unsigned MaxTripCount = PSE.getSE()->getSmallConstantMaxTripCount(TheLoop);		unsigned MaxTripCount = PSE.getSE()->getSmallConstantMaxTripCount(TheLoop);

if (!A.Width.isScalable() && !B.Width.isScalable() && FoldTailByMasking &&		if (!A.Width.isScalable() && !B.Width.isScalable() && FoldTailByMasking &&
MaxTripCount) {		MaxTripCount) {
// If we are folding the tail and the trip count is a known (possibly small)		// If we are folding the tail and the trip count is a known (possibly small)
// constant, the trip count will be rounded up to an integer number of		// constant, the trip count will be rounded up to an integer number of
// iterations. The total cost will be PerIterationCost*ceil(TripCount/VF),		// iterations. The total cost will be PerIterationCost*ceil(TripCount/VF),
// which we compare directly. When not folding the tail, the total cost will		// which we compare directly. When not folding the tail, the total cost will
// be PerIterationCost*floor(TC/VF) + Scalar remainder cost, and so is		// be PerIterationCost*floor(TC/VF) + Scalar remainder cost, and so is
// approximated with the per-lane cost below instead of using the tripcount		// approximated with the per-lane cost below instead of using the tripcount
// as here.		// as here.
auto RTCostA = CostA * divideCeil(MaxTripCount, A.Width.getFixedValue());		auto RTCostA = CostA * divideCeil(MaxTripCount, A.Width.getFixedValue());
auto RTCostB = CostB * divideCeil(MaxTripCount, B.Width.getFixedValue());		auto RTCostB = CostB * divideCeil(MaxTripCount, B.Width.getFixedValue());
return RTCostA < RTCostB;		return RTCostA < RTCostB;
}		}

// Improve estimate for the vector width if it is scalable.		// Improve estimate for the vector width if it is scalable.
unsigned EstimatedWidthA = A.Width.getKnownMinValue();		unsigned EstimatedWidthA = A.Width.getKnownMinValue();
unsigned EstimatedWidthB = B.Width.getKnownMinValue();		unsigned EstimatedWidthB = B.Width.getKnownMinValue();
if (Optional<unsigned> VScale = TTI.getVScaleForTuning()) {		if (Optional<unsigned> VScale = getVScaleForTuning()) {
if (A.Width.isScalable())		if (A.Width.isScalable())
EstimatedWidthA *= VScale.getValue();		EstimatedWidthA *= VScale.getValue();
if (B.Width.isScalable())		if (B.Width.isScalable())
EstimatedWidthB *= VScale.getValue();		EstimatedWidthB *= VScale.getValue();
}		}

// Assume vscale may be larger than 1 (or the value being tuned for),		// Assume vscale may be larger than 1 (or the value being tuned for),
// so that scalable vectorization is slightly favorable over fixed-width		// so that scalable vectorization is slightly favorable over fixed-width
Show All 32 Lines	for (const auto &i : VFCandidates) {
if (i.isScalar())		if (i.isScalar())
continue;		continue;

VectorizationCostTy C = expectedCost(i, &InvalidCosts);		VectorizationCostTy C = expectedCost(i, &InvalidCosts);
VectorizationFactor Candidate(i, C.first);		VectorizationFactor Candidate(i, C.first);

#ifndef NDEBUG		#ifndef NDEBUG
unsigned AssumedMinimumVscale = 1;		unsigned AssumedMinimumVscale = 1;
if (Optional<unsigned> VScale = TTI.getVScaleForTuning())		if (Optional<unsigned> VScale = getVScaleForTuning())
AssumedMinimumVscale = VScale.getValue();		AssumedMinimumVscale = VScale.getValue();
unsigned Width =		unsigned Width =
Candidate.Width.isScalable()		Candidate.Width.isScalable()
? Candidate.Width.getKnownMinValue() * AssumedMinimumVscale		? Candidate.Width.getKnownMinValue() * AssumedMinimumVscale
: Candidate.Width.getFixedValue();		: Candidate.Width.getFixedValue();
LLVM_DEBUG(dbgs() << "LV: Vector loop of width " << i		LLVM_DEBUG(dbgs() << "LV: Vector loop of width " << i
<< " costs: " << (Candidate.Cost / Width));		<< " costs: " << (Candidate.Cost / Width));
if (i.isScalable())		if (i.isScalable())
▲ Show 20 Lines • Show All 195 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel::selectEpilogueVectorizationFactor(
}		}

if (!isEpilogueVectorizationProfitable(MainLoopVF)) {		if (!isEpilogueVectorizationProfitable(MainLoopVF)) {
LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization is not profitable for "		LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization is not profitable for "
"this loop\n");		"this loop\n");
return Result;		return Result;
}		}

		// If MainLoopVF = vscale x 2, and vscale is expected to be 4, then we know
		// the main loop handles 8 lanes per iteration. We could still benefit from
		// vectorizing the epilogue loop with VF=4.
		ElementCount EstimatedRuntimeVF = MainLoopVF;
		if (MainLoopVF.isScalable()) {
		EstimatedRuntimeVF = ElementCount::getFixed(MainLoopVF.getKnownMinValue());
		if (Optional<unsigned> VScale = getVScaleForTuning())
		EstimatedRuntimeVF *= VScale.getValue();
		}

for (auto &NextVF : ProfitableVFs)		for (auto &NextVF : ProfitableVFs)
if (ElementCount::isKnownLT(NextVF.Width, MainLoopVF) &&		if (((!NextVF.Width.isScalable() && MainLoopVF.isScalable() &&
		ElementCount::isKnownLT(NextVF.Width, EstimatedRuntimeVF)) \|\|
		ElementCount::isKnownLT(NextVF.Width, MainLoopVF)) &&
(Result.Width.isScalar() \|\| isMoreProfitable(NextVF, Result)) &&		(Result.Width.isScalar() \|\| isMoreProfitable(NextVF, Result)) &&
LVP.hasPlanWithVF(NextVF.Width))		LVP.hasPlanWithVF(NextVF.Width))
Result = NextVF;		Result = NextVF;

if (Result != VectorizationFactor::Disabled())		if (Result != VectorizationFactor::Disabled())
LLVM_DEBUG(dbgs() << "LEV: Vectorizing epilogue loop with VF = "		LLVM_DEBUG(dbgs() << "LEV: Vectorizing epilogue loop with VF = "
<< Result.Width << "\n";);		<< Result.Width << "\n";);
return Result;		return Result;
▲ Show 20 Lines • Show All 4,884 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-epilog-vect.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; REQUIRES: asserts		; REQUIRES: asserts
; RUN: opt < %s -loop-vectorize -force-vector-interleave=2 -epilogue-vectorization-minimum-VF=0 --debug-only=loop-vectorize -force-target-instruction-cost=1 -S 2>%t \| FileCheck %s --check-prefix=CHECK		; RUN: opt < %s -loop-vectorize -force-vector-interleave=2 -epilogue-vectorization-minimum-VF=0 --debug-only=loop-vectorize -force-target-instruction-cost=1 -S 2>%t \| FileCheck %s --check-prefix=CHECK
; RUN: cat %t \| FileCheck %s --check-prefix=DEBUG		; RUN: cat %t \| FileCheck %s --check-prefix=DEBUG
; RUN: opt < %s -loop-vectorize -force-vector-interleave=2 -epilogue-vectorization-minimum-VF=8 --debug-only=loop-vectorize -S 2>%t \| FileCheck %s --check-prefix=CHECK
; RUN: cat %t \| FileCheck %s --check-prefix=DEBUG
; RUN: opt < %s -loop-vectorize -force-vector-interleave=2 -epilogue-vectorization-force-VF=8 --debug-only=loop-vectorize -S 2>%t \| FileCheck %s --check-prefix=CHECK-VF8		; RUN: opt < %s -loop-vectorize -force-vector-interleave=2 -epilogue-vectorization-force-VF=8 --debug-only=loop-vectorize -S 2>%t \| FileCheck %s --check-prefix=CHECK-VF8
; RUN: cat %t \| FileCheck %s --check-prefix=DEBUG-FORCED		; RUN: cat %t \| FileCheck %s --check-prefix=DEBUG-FORCED

target triple = "aarch64-linux-gnu"		target triple = "aarch64-linux-gnu"

; DEBUG: LV: Checking a loop in "f1"		; DEBUG: LV: Checking a loop in "main_vf_vscale_x_16"
; DEBUG: Create Skeleton for epilogue vectorized loop (first pass)		; DEBUG: Create Skeleton for epilogue vectorized loop (first pass)
; DEBUG: Main Loop VF:vscale x 16, Main Loop UF:2, Epilogue Loop VF:vscale x 8, Epilogue Loop UF:1		; DEBUG: Main Loop VF:vscale x 16, Main Loop UF:2, Epilogue Loop VF:vscale x 8, Epilogue Loop UF:1

; DEBUG-FORCED: LV: Checking a loop in "f1"		; DEBUG-FORCED: LV: Checking a loop in "main_vf_vscale_x_16"
; DEBUG-FORCED: LEV: Epilogue vectorization factor is forced.		; DEBUG-FORCED: LEV: Epilogue vectorization factor is forced.
; DEBUG-FORCED: Create Skeleton for epilogue vectorized loop (first pass)		; DEBUG-FORCED: Create Skeleton for epilogue vectorized loop (first pass)
; DEBUG-FORCED: Main Loop VF:vscale x 16, Main Loop UF:2, Epilogue Loop VF:8, Epilogue Loop UF:1		; DEBUG-FORCED: Main Loop VF:vscale x 16, Main Loop UF:2, Epilogue Loop VF:8, Epilogue Loop UF:1

define void @f1(i8* %A) #0 {		define void @main_vf_vscale_x_16(i8* %A) #0 {
; CHECK-LABEL: @f1(		; CHECK-LABEL: @main_vf_vscale_x_16(
; CHECK-NEXT: iter.check:		; CHECK-NEXT: iter.check:
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 8		; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 8
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP1]]		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP1]]
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]
; CHECK: vector.main.loop.iter.check:		; CHECK: vector.main.loop.iter.check:
; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 32		; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 32
▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines
; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1		; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
; CHECK-NEXT: [[EXITCOND:%.*]] = icmp ne i64 [[IV_NEXT]], 1024		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp ne i64 [[IV_NEXT]], 1024
; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[EXIT_LOOPEXIT]], !llvm.loop [[LOOP4:![0-9]+]]		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[EXIT_LOOPEXIT]], !llvm.loop [[LOOP4:![0-9]+]]
; CHECK: exit.loopexit:		; CHECK: exit.loopexit:
; CHECK-NEXT: br label [[EXIT]]		; CHECK-NEXT: br label [[EXIT]]
; CHECK: exit:		; CHECK: exit:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
; CHECK-VF8-LABEL: @f1(		; CHECK-VF8-LABEL: @main_vf_vscale_x_16(
; CHECK-VF8-NEXT: iter.check:		; CHECK-VF8-NEXT: iter.check:
; CHECK-VF8-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]		; CHECK-VF8-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]
; CHECK-VF8: vector.main.loop.iter.check:		; CHECK-VF8: vector.main.loop.iter.check:
; CHECK-VF8-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-VF8-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-VF8-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 32		; CHECK-VF8-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 32
; CHECK-VF8-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP1]]		; CHECK-VF8-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP1]]
; CHECK-VF8-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-VF8-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK-VF8: vector.ph:		; CHECK-VF8: vector.ph:
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	for.body:
%iv.next = add nuw nsw i64 %iv, 1		%iv.next = add nuw nsw i64 %iv, 1
%exitcond = icmp ne i64 %iv.next, 1024		%exitcond = icmp ne i64 %iv.next, 1024
br i1 %exitcond, label %for.body, label %exit		br i1 %exitcond, label %for.body, label %exit

exit:		exit:
ret void		ret void
}		}


		; DEBUG: LV: Checking a loop in "main_vf_vscale_x_2"
		; DEBUG: Create Skeleton for epilogue vectorized loop (first pass)
		; DEBUG: Main Loop VF:vscale x 2, Main Loop UF:2, Epilogue Loop VF:8, Epilogue Loop UF:1

		; DEBUG-FORCED: LV: Checking a loop in "main_vf_vscale_x_2"
		; DEBUG-FORCED: LEV: Epilogue vectorization factor is forced.
		; DEBUG-FORCED: Create Skeleton for epilogue vectorized loop (first pass)
		; DEBUG-FORCED: Main Loop VF:vscale x 2, Main Loop UF:2, Epilogue Loop VF:8, Epilogue Loop UF:1

		; When the vector.body uses VF=vscale x 1 (or VF=vscale x 2 because
		; that's the minimum supported VF by SVE), we could still use a wide
		; fixed-width VF=8 for the epilogue if the vectors are known to be
		; sufficiently wide. This information can be deduced from vscale_range or
		; VScaleForTuning (set by mcpu/mtune).
		define void @main_vf_vscale_x_2(i64* %A) #0 vscale_range(8, 8) {
		; CHECK-LABEL: @main_vf_vscale_x_2(
		; CHECK-NEXT: iter.check:
		; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]
		; CHECK: vector.main.loop.iter.check:
		; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
		; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP1]]
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
		; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4
		; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP3]]
		; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 1024, [[N_MOD_VF]]
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP4:%.*]] = add i64 [[INDEX]], 0
		; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
		; CHECK-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 2
		; CHECK-NEXT: [[TMP7:%.*]] = add i64 [[TMP6]], 0
		; CHECK-NEXT: [[TMP8:%.*]] = mul i64 [[TMP7]], 1
		; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[INDEX]], [[TMP8]]
		; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i64, i64 [[A:%.*]], i64 [[TMP4]]
		; CHECK-NEXT: [[TMP11:%.]] = getelementptr inbounds i64, i64 [[A]], i64 [[TMP9]]
		; CHECK-NEXT: [[TMP12:%.]] = getelementptr inbounds i64, i64 [[TMP10]], i32 0
		; CHECK-NEXT: [[TMP13:%.]] = bitcast i64 [[TMP12]] to <vscale x 2 x i64>*
		; CHECK-NEXT: store <vscale x 2 x i64> shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 1, i32 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i64>* [[TMP13]], align 1
		; CHECK-NEXT: [[TMP14:%.*]] = call i32 @llvm.vscale.i32()
		; CHECK-NEXT: [[TMP15:%.*]] = mul i32 [[TMP14]], 2
		; CHECK-NEXT: [[TMP16:%.]] = getelementptr inbounds i64, i64 [[TMP10]], i32 [[TMP15]]
		; CHECK-NEXT: [[TMP17:%.]] = bitcast i64 [[TMP16]] to <vscale x 2 x i64>*
		; CHECK-NEXT: store <vscale x 2 x i64> shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 1, i32 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i64>* [[TMP17]], align 1
		; CHECK-NEXT: [[TMP18:%.*]] = call i64 @llvm.vscale.i64()
		; CHECK-NEXT: [[TMP19:%.*]] = mul i64 [[TMP18]], 4
		; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP19]]
		; CHECK-NEXT: [[TMP20:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-NEXT: br i1 [[TMP20]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
		; CHECK: middle.block:
		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 1024, [[N_VEC]]
		; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]
		; CHECK: vec.epilog.iter.check:
		; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 1024, [[N_VEC]]
		; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8
		; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
		; CHECK: vec.epilog.ph:
		; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
		; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
		; CHECK: vec.epilog.vector.body:
		; CHECK-NEXT: [[INDEX2:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT3:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP21:%.*]] = add i64 [[INDEX2]], 0
		; CHECK-NEXT: [[TMP22:%.]] = getelementptr inbounds i64, i64 [[A]], i64 [[TMP21]]
		; CHECK-NEXT: [[TMP23:%.]] = getelementptr inbounds i64, i64 [[TMP22]], i32 0
		; CHECK-NEXT: [[TMP24:%.]] = bitcast i64 [[TMP23]] to <8 x i64>*
		; CHECK-NEXT: store <8 x i64> <i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1>, <8 x i64>* [[TMP24]], align 1
		; CHECK-NEXT: [[INDEX_NEXT3]] = add nuw i64 [[INDEX2]], 8
		; CHECK-NEXT: [[TMP25:%.*]] = icmp eq i64 [[INDEX_NEXT3]], 1024
		; CHECK-NEXT: br i1 [[TMP25]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
		; CHECK: vec.epilog.middle.block:
		; CHECK-NEXT: [[CMP_N1:%.*]] = icmp eq i64 1024, 1024
		; CHECK-NEXT: br i1 [[CMP_N1]], label [[EXIT_LOOPEXIT:%.*]], label [[VEC_EPILOG_SCALAR_PH]]
		; CHECK: vec.epilog.scalar.ph:
		; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 1024, [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK:%.]] ]
		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
		; CHECK: for.body:
		; CHECK-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[FOR_BODY]] ]
		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i64, i64 [[A]], i64 [[IV]]
		; CHECK-NEXT: store i64 1, i64* [[ARRAYIDX]], align 1
		; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp ne i64 [[IV_NEXT]], 1024
		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[EXIT_LOOPEXIT]], !llvm.loop [[LOOP7:![0-9]+]]
		; CHECK: exit.loopexit:
		; CHECK-NEXT: br label [[EXIT]]
		; CHECK: exit:
		; CHECK-NEXT: ret void
		;
		; CHECK-VF8-LABEL: @main_vf_vscale_x_2(
		; CHECK-VF8-NEXT: iter.check:
		; CHECK-VF8-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]
		; CHECK-VF8: vector.main.loop.iter.check:
		; CHECK-VF8-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
		; CHECK-VF8-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
		; CHECK-VF8-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP1]]
		; CHECK-VF8-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK-VF8: vector.ph:
		; CHECK-VF8-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
		; CHECK-VF8-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4
		; CHECK-VF8-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP3]]
		; CHECK-VF8-NEXT: [[N_VEC:%.*]] = sub i64 1024, [[N_MOD_VF]]
		; CHECK-VF8-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK-VF8: vector.body:
		; CHECK-VF8-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
		; CHECK-VF8-NEXT: [[TMP4:%.*]] = add i64 [[INDEX]], 0
		; CHECK-VF8-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
		; CHECK-VF8-NEXT: [[TMP6:%.*]] = mul i64 [[TMP5]], 2
		; CHECK-VF8-NEXT: [[TMP7:%.*]] = add i64 [[TMP6]], 0
		; CHECK-VF8-NEXT: [[TMP8:%.*]] = mul i64 [[TMP7]], 1
		; CHECK-VF8-NEXT: [[TMP9:%.*]] = add i64 [[INDEX]], [[TMP8]]
		; CHECK-VF8-NEXT: [[TMP10:%.]] = getelementptr inbounds i64, i64 [[A:%.*]], i64 [[TMP4]]
		; CHECK-VF8-NEXT: [[TMP11:%.]] = getelementptr inbounds i64, i64 [[A]], i64 [[TMP9]]
		; CHECK-VF8-NEXT: [[TMP12:%.]] = getelementptr inbounds i64, i64 [[TMP10]], i32 0
		; CHECK-VF8-NEXT: [[TMP13:%.]] = bitcast i64 [[TMP12]] to <vscale x 2 x i64>*
		; CHECK-VF8-NEXT: store <vscale x 2 x i64> shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 1, i32 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i64>* [[TMP13]], align 1
		; CHECK-VF8-NEXT: [[TMP14:%.*]] = call i32 @llvm.vscale.i32()
		; CHECK-VF8-NEXT: [[TMP15:%.*]] = mul i32 [[TMP14]], 2
		; CHECK-VF8-NEXT: [[TMP16:%.]] = getelementptr inbounds i64, i64 [[TMP10]], i32 [[TMP15]]
		; CHECK-VF8-NEXT: [[TMP17:%.]] = bitcast i64 [[TMP16]] to <vscale x 2 x i64>*
		; CHECK-VF8-NEXT: store <vscale x 2 x i64> shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 1, i32 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i64>* [[TMP17]], align 1
		; CHECK-VF8-NEXT: [[TMP18:%.*]] = call i64 @llvm.vscale.i64()
		; CHECK-VF8-NEXT: [[TMP19:%.*]] = mul i64 [[TMP18]], 4
		; CHECK-VF8-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP19]]
		; CHECK-VF8-NEXT: [[TMP20:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-VF8-NEXT: br i1 [[TMP20]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
		; CHECK-VF8: middle.block:
		; CHECK-VF8-NEXT: [[CMP_N:%.*]] = icmp eq i64 1024, [[N_VEC]]
		; CHECK-VF8-NEXT: br i1 [[CMP_N]], label [[EXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]
		; CHECK-VF8: vec.epilog.iter.check:
		; CHECK-VF8-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 1024, [[N_VEC]]
		; CHECK-VF8-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8
		; CHECK-VF8-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
		; CHECK-VF8: vec.epilog.ph:
		; CHECK-VF8-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
		; CHECK-VF8-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
		; CHECK-VF8: vec.epilog.vector.body:
		; CHECK-VF8-NEXT: [[INDEX2:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT3:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
		; CHECK-VF8-NEXT: [[TMP21:%.*]] = add i64 [[INDEX2]], 0
		; CHECK-VF8-NEXT: [[TMP22:%.]] = getelementptr inbounds i64, i64 [[A]], i64 [[TMP21]]
		; CHECK-VF8-NEXT: [[TMP23:%.]] = getelementptr inbounds i64, i64 [[TMP22]], i32 0
		; CHECK-VF8-NEXT: [[TMP24:%.]] = bitcast i64 [[TMP23]] to <8 x i64>*
		; CHECK-VF8-NEXT: store <8 x i64> <i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1, i64 1>, <8 x i64>* [[TMP24]], align 1
		; CHECK-VF8-NEXT: [[INDEX_NEXT3]] = add nuw i64 [[INDEX2]], 8
		; CHECK-VF8-NEXT: [[TMP25:%.*]] = icmp eq i64 [[INDEX_NEXT3]], 1024
		; CHECK-VF8-NEXT: br i1 [[TMP25]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
		; CHECK-VF8: vec.epilog.middle.block:
		; CHECK-VF8-NEXT: [[CMP_N1:%.*]] = icmp eq i64 1024, 1024
		; CHECK-VF8-NEXT: br i1 [[CMP_N1]], label [[EXIT_LOOPEXIT:%.*]], label [[VEC_EPILOG_SCALAR_PH]]
		; CHECK-VF8: vec.epilog.scalar.ph:
		; CHECK-VF8-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 1024, [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK:%.]] ]
		; CHECK-VF8-NEXT: br label [[FOR_BODY:%.*]]
		; CHECK-VF8: for.body:
		; CHECK-VF8-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[FOR_BODY]] ]
		; CHECK-VF8-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i64, i64 [[A]], i64 [[IV]]
		; CHECK-VF8-NEXT: store i64 1, i64* [[ARRAYIDX]], align 1
		; CHECK-VF8-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
		; CHECK-VF8-NEXT: [[EXITCOND:%.*]] = icmp ne i64 [[IV_NEXT]], 1024
		; CHECK-VF8-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[EXIT_LOOPEXIT]], !llvm.loop [[LOOP7:![0-9]+]]
		; CHECK-VF8: exit.loopexit:
		; CHECK-VF8-NEXT: br label [[EXIT]]
		; CHECK-VF8: exit:
		; CHECK-VF8-NEXT: ret void
		;
		entry:
		br label %for.body

		for.body:
		%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
		%arrayidx = getelementptr inbounds i64, i64* %A, i64 %iv
		store i64 1, i64* %arrayidx, align 1
		%iv.next = add nuw nsw i64 %iv, 1
		%exitcond = icmp ne i64 %iv.next, 1024
		br i1 %exitcond, label %for.body, label %exit

		exit:
		ret void
		}

attributes #0 = { "target-features"="+sve" }		attributes #0 = { "target-features"="+sve" }