This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
1/3
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
1
sve-tail-folding-forced.ll
-
sve-tail-folding.ll

Differential D113003

[LoopVectorize] Add support for tail folding using scalable vectors
ClosedPublic

Authored by david-arm on Nov 2 2021, 4:24 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
kmclaughlin
CarolineConcatto
SjoerdMeijer
fhahn
Ayal
dmgreen
reames

Commits

rGe3c84fb94818: [LoopVectorize] Add support for tail folding using scalable vectors

Summary

This patch fixes up an issue with InnerLoopVectorizer::getOrCreateVectorTripCount
whereby we weren't correctly generating the runtime trip count
for scalable vectors when tail-folding.

It also removes some asserts in the tail-folding path for cases when
the VF is not scalable.

In this patch I have only permitted tail-folding to be enabled
explicitly for scalable vectors when the user has specified one
of the following flags:

-prefer-predicate-over-epilogue=predicate-dont-vectorize
-prefer-predicate-over-epilogue=predicate-else-scalar-epilogue

For now it's best not to enable tail-folding with scalable vectors for
low trip counts or when optimising for code size, since there has been
no analysis on whether this is worth it.

Various tests have been added here:

Transforms/LoopVectorize/AArch64/sve-tail-folding.ll
Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll

The tests cannot be target independent because they require masked
load/store support, i.e. TTI.isLegalMaskedLoad and TTI.isLegalMaskedStore
need to return true.

Diff Detail

Unit TestsFailed

	Time	Test
	2,560 ms	x64 debian > libomp.ompt/parallel::nested.c

Event Timeline

david-arm created this revision.Nov 2 2021, 4:24 AM

Herald added subscribers: ctetreau, hiraditya, kristof.beyls. · View Herald TranscriptNov 2 2021, 4:24 AM

david-arm requested review of this revision.Nov 2 2021, 4:24 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 2 2021, 4:24 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

david-arm added a parent revision: D112552: [LoopVectorize] When tail-folding, don't always predicate uniform loads.Nov 2 2021, 4:24 AM

Harbormaster completed remote builds in B131936: Diff 384038.Nov 2 2021, 4:53 AM

Since we are (also) talking about profitability of the transform:

For now it's best not to enable tail-folding with scalable vectors for low trip counts or when optimising for code size, since there has been no analysis on whether this is worth it.

I was expecting we would need to implement target hook preferPredicateOverEpilogue for AArch64/SVE that controls this, and add some of this logic there?

SjoerdMeijer added reviewers: Ayal, dmgreen.Nov 2 2021, 5:39 AM

In D113003#3102731, @SjoerdMeijer wrote:

Since we are (also) talking about profitability of the transform:

For now it's best not to enable tail-folding with scalable vectors for low trip counts or when optimising for code size, since there has been no analysis on whether this is worth it.

I was expecting we would need to implement target hook preferPredicateOverEpilogue for AArch64/SVE that controls this, and add some of this logic there?

Hi @SjoerdMeijer, at the moment we're not really worried about profitability since we don't intend to enable tail-folding by default anytime soon. I'm happy to look into the TTI hook (preferPredicateOverEpilogue), but I'm not sure what benefit it gives us at the moment? The main reason for supporting tail-folding for scalable vectors for AArch64 is to ensure we have the technical capability and nothing crashes with scalable vectors. In future we'd like to use predication for vector epilogues and avoid the scalar tail, but maintain an unpredicated main vector loop.

Hi @SjoerdMeijer, I realised the title on the patch was misleading and I've changed it now to indicate we're not enabling this for AArch64, but just adding theoretical support for scalable vectors!

SjoerdMeijer added inline comments.Nov 2 2021, 6:40 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5881	Hi @david-arm , thanks for clarifying so far, that has helped. Moving the question about profitability to here, the place where it is probably relevant. Also note that I am asking a bunch of question just to reload to memory how things work here (and I am new to the SVE/scalable angle here). Before we discuss the actual change, I don't think I understand this original comment + code here to be honest. The first sentence: For scalable vectors, don't use tail folding as this is currently not yet supported. suggest to me that we want to return `CM_ScalarEpilogueAllowed` in `getScalarEpilogueLowering`. And I don't know what the affect is of setting `MaxFactors.ScalableVF` to 0 below, but I guess that has somehow the affect? To me it feels like that we do make a profitability call here "in the middle of something", which should be moved to a different place. I can also imagine that this is highly target dependent/specific, further supporting this. I do remember though that there is a bit of an ordering problem: the decision to tail-fold or not is taken quite early, while not all information that we ideally would like to have are not yet available. Not sure that is the case here. But I guess that for now the decision "if scalable vector, then don't tail-fold" could be taken quite early. But yeah, I might be missing something here, so please enlighten me. :)

david-arm added inline comments.Nov 2 2021, 6:58 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5881	So when this code was originally added tail-folding just crashed for scalable vectors because a lot of the code related to tail-folding assumed fixed-width vectors. We've been trying to fix a lot of these issues and now we've got to the point where it is at least possible to fold the tail for most loops. However, it is still experimental for scalable vectors - scalable vectorisation has not yet been enabled by default for the vectoriser. Even after this patch lands there is still at least one known issue that we intend to fix related to tail-folding + scalable vectors. This is the reason why tail-folding is still hidden behind a flag for scalable vectors. If we start to allow tail-folding to be automatically enabled for low trip counts or when optimising for code size then we will be knowingly exposing users to these issues. I do take your point about profitability and realise the original comment also suggested something related to profitability, but the aim of this patch is simply to add functionality that is not enabled by default. Perhaps I can change the comment to explain that we want to make this feature stable first for scalable vectors, then move on to questions of profitability? To answer your other question about `MaxFactors.ScalableVF` it basically disables scalable vectorisation entirely if `computeMaxVF` has decided that we should use tail-folding for low trip counts or reducing code size. However, we are still be able to use fixed-width vectorisation.

SjoerdMeijer added inline comments.Nov 2 2021, 7:53 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5881	I do take your point about profitability and realise the original comment also suggested something related to profitability, but the aim of this patch is simply to add functionality that is not enabled by default. Perhaps I can change the comment to explain that we want to make this feature stable first for scalable vectors, then move on to questions of profitability? Sound like a good approach to me.

Matt added a subscriber: Matt.Nov 2 2021, 11:05 AM

david-arm added a child revision: D113180: [LoopVectorize] Make VPWidenCanonicalIVRecipe::execute work for scalable vectors.Nov 4 2021, 4:02 AM

reames resigned from this revision.Nov 30 2021, 9:54 AM

Rebase, which required fixing up some tests.

Harbormaster completed remote builds in B137682: Diff 392092.Dec 6 2021, 10:16 AM

Rebase, which required tweaking the tests slightly due to cost model changes.

Harbormaster completed remote builds in B141506: Diff 397317.Jan 4 2022, 9:31 AM

liaolucy added a subscriber: liaolucy.Jan 5 2022, 5:47 AM

Seems like quite the straightforward change, LGTM.

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll
2	nit: this can be removed now that scalable vectorization is enabled by default for aarch64 when SVE is available.

This revision is now accepted and ready to land.Jan 7 2022, 6:57 AM

This revision was landed with ongoing or failed builds.Jan 10 2022, 2:55 AM

Closed by commit rGe3c84fb94818: [LoopVectorize] Add support for tail folding using scalable vectors (authored by david-arm). · Explain Why

This revision was automatically updated to reflect the committed changes.

david-arm added a commit: rGe3c84fb94818: [LoopVectorize] Add support for tail folding using scalable vectors.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

16 lines

test/

Transforms/

LoopVectorize/

AArch64/

sve-tail-folding-forced.ll

70 lines

sve-tail-folding.ll

546 lines

Diff 384038

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,465 Lines • ▼ Show 20 Lines	auto CreateScalarIV = [&](Value &Step) -> Value {
return ScalarIV;		return ScalarIV;
};		};

// Create the vector values from the scalar IV, in the absence of creating a		// Create the vector values from the scalar IV, in the absence of creating a
// vector IV.		// vector IV.
auto CreateSplatIV = [&](Value ScalarIV, Value Step) {		auto CreateSplatIV = [&](Value ScalarIV, Value Step) {
Value *Broadcasted = getBroadcastInstrs(ScalarIV);		Value *Broadcasted = getBroadcastInstrs(ScalarIV);
for (unsigned Part = 0; Part < UF; ++Part) {		for (unsigned Part = 0; Part < UF; ++Part) {
assert(!VF.isScalable() && "scalable vectors not yet supported.");
Value *StartIdx;		Value *StartIdx;
if (Step->getType()->isFloatingPointTy())		if (Step->getType()->isFloatingPointTy())
StartIdx = getRuntimeVFAsFloat(Builder, Step->getType(), VF * Part);		StartIdx = getRuntimeVFAsFloat(Builder, Step->getType(), VF * Part);
else		else
StartIdx = getRuntimeVF(Builder, Step->getType(), VF * Part);		StartIdx = getRuntimeVF(Builder, Step->getType(), VF * Part);

Value *EntryPart =		Value *EntryPart =
getStepVector(Broadcasted, StartIdx, Step, ID.getInductionOpcode());		getStepVector(Broadcasted, StartIdx, Step, ID.getInductionOpcode());
▲ Show 20 Lines • Show All 711 Lines • ▼ Show 20 Lines	Value InnerLoopVectorizer::getOrCreateVectorTripCount(Loop L) {
// up to a multiple of Step instead of rounding down. This is done by first		// up to a multiple of Step instead of rounding down. This is done by first
// adding Step-1 and then rounding down. Note that it's ok if this addition		// adding Step-1 and then rounding down. Note that it's ok if this addition
// overflows: the vector induction variable will eventually wrap to zero given		// overflows: the vector induction variable will eventually wrap to zero given
// that it starts at zero and its Step is a power of two; the loop will then		// that it starts at zero and its Step is a power of two; the loop will then
// exit, with the last early-exit vector comparison also producing all-true.		// exit, with the last early-exit vector comparison also producing all-true.
if (Cost->foldTailByMasking()) {		if (Cost->foldTailByMasking()) {
assert(isPowerOf2_32(VF.getKnownMinValue() * UF) &&		assert(isPowerOf2_32(VF.getKnownMinValue() * UF) &&
"VF*UF must be a power of 2 when folding tail by masking");		"VF*UF must be a power of 2 when folding tail by masking");
assert(!VF.isScalable() &&		Value NumLanes = getRuntimeVF(Builder, Ty, VF UF);
"Tail folding not yet supported for scalable vectors");
TC = Builder.CreateAdd(		TC = Builder.CreateAdd(
TC, ConstantInt::get(Ty, VF.getKnownMinValue() * UF - 1), "n.rnd.up");		TC, Builder.CreateSub(NumLanes, ConstantInt::get(Ty, 1)), "n.rnd.up");
}		}

// Now we need to generate the expression for the part of the loop that the		// Now we need to generate the expression for the part of the loop that the
// vectorized body will execute. This is equal to N - (N % Step) if scalar		// vectorized body will execute. This is equal to N - (N % Step) if scalar
// iterations are not required for correctness, or N - Step, otherwise. Step		// iterations are not required for correctness, or N - Step, otherwise. Step
// is equal to the vectorization factor (number of SIMD elements) times the		// is equal to the vectorization factor (number of SIMD elements) times the
// unroll factor (number of SIMD instructions).		// unroll factor (number of SIMD instructions).
Value *R = Builder.CreateURem(TC, Step, "n.mod.vf");		Value *R = Builder.CreateURem(TC, Step, "n.mod.vf");
▲ Show 20 Lines • Show All 2,656 Lines • ▼ Show 20 Lines	const SCEV *Rem = SE->getURemExpr(
SE->getConstant(BackedgeTakenCount->getType(), MaxVFtimesIC));		SE->getConstant(BackedgeTakenCount->getType(), MaxVFtimesIC));
if (Rem->isZero()) {		if (Rem->isZero()) {
// Accept MaxFixedVF if we do not have a tail.		// Accept MaxFixedVF if we do not have a tail.
LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");		LLVM_DEBUG(dbgs() << "LV: No tail will remain for any chosen VF.\n");
return MaxFactors;		return MaxFactors;
}		}
}		}

// For scalable vectors, don't use tail folding as this is currently not yet		// For scalable vectors don't use tail folding for low trip counts or
// supported. The code is likely to have ended up here if the tripcount is		// optimizing for code size. We only permit this if the user has explicitly
// low, in which case it makes sense not to use scalable vectors.		// requested it.
if (MaxFactors.ScalableVF.isVector())		if (ScalarEpilogueStatus != CM_ScalarEpilogueNotNeededUsePredicate &&
SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Hi @david-arm , thanks for clarifying so far, that has helped. Moving the question about profitability to here, the place where it is probably relevant. Also note that I am asking a bunch of question just to reload to memory how things work here (and I am new to the SVE/scalable angle here). Before we discuss the actual change, I don't think I understand this original comment + code here to be honest. The first sentence: For scalable vectors, don't use tail folding as this is currently not yet supported. suggest to me that we want to return `CM_ScalarEpilogueAllowed` in `getScalarEpilogueLowering`. And I don't know what the affect is of setting `MaxFactors.ScalableVF` to 0 below, but I guess that has somehow the affect? To me it feels like that we do make a profitability call here "in the middle of something", which should be moved to a different place. I can also imagine that this is highly target dependent/specific, further supporting this. I do remember though that there is a bit of an ordering problem: the decision to tail-fold or not is taken quite early, while not all information that we ideally would like to have are not yet available. Not sure that is the case here. But I guess that for now the decision "if scalable vector, then don't tail-fold" could be taken quite early. But yeah, I might be missing something here, so please enlighten me. :) SjoerdMeijer: Hi @david-arm , thanks for clarifying so far, that has helped. Moving the question about…
david-armAuthorUnsubmitted Done Reply Inline Actions So when this code was originally added tail-folding just crashed for scalable vectors because a lot of the code related to tail-folding assumed fixed-width vectors. We've been trying to fix a lot of these issues and now we've got to the point where it is at least possible to fold the tail for most loops. However, it is still experimental for scalable vectors - scalable vectorisation has not yet been enabled by default for the vectoriser. Even after this patch lands there is still at least one known issue that we intend to fix related to tail-folding + scalable vectors. This is the reason why tail-folding is still hidden behind a flag for scalable vectors. If we start to allow tail-folding to be automatically enabled for low trip counts or when optimising for code size then we will be knowingly exposing users to these issues. I do take your point about profitability and realise the original comment also suggested something related to profitability, but the aim of this patch is simply to add functionality that is not enabled by default. Perhaps I can change the comment to explain that we want to make this feature stable first for scalable vectors, then move on to questions of profitability? To answer your other question about `MaxFactors.ScalableVF` it basically disables scalable vectorisation entirely if `computeMaxVF` has decided that we should use tail-folding for low trip counts or reducing code size. However, we are still be able to use fixed-width vectorisation. david-arm: So when this code was originally added tail-folding just crashed for scalable vectors because a…
SjoerdMeijerUnsubmitted Not Done Reply Inline Actions I do take your point about profitability and realise the original comment also suggested something related to profitability, but the aim of this patch is simply to add functionality that is not enabled by default. Perhaps I can change the comment to explain that we want to make this feature stable first for scalable vectors, then move on to questions of profitability? Sound like a good approach to me. SjoerdMeijer: > I do take your point about profitability and realise the original comment also suggested…
		ScalarEpilogueStatus != CM_ScalarEpilogueNotAllowedUsePredicate
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - ScalarEpilogueStatus != CM_ScalarEpilogueNotAllowedUsePredicate - && MaxFactors.ScalableVF.isVector()) + ScalarEpilogueStatus != CM_ScalarEpilogueNotAllowedUsePredicate && + MaxFactors.ScalableVF.isVector()) Lint: Pre-merge checks: clang-format: please reformat the code ``` - ScalarEpilogueStatus !=…
		&& MaxFactors.ScalableVF.isVector())
MaxFactors.ScalableVF = ElementCount::getScalable(0);		MaxFactors.ScalableVF = ElementCount::getScalable(0);

// If we don't know the precise trip count, or if the trip count that we		// If we don't know the precise trip count, or if the trip count that we
// found modulo the vectorization factor is not zero, try to fold the tail		// found modulo the vectorization factor is not zero, try to fold the tail
// by masking.		// by masking.
// FIXME: look for a smaller MaxVF that does divide TC rather than masking.		// FIXME: look for a smaller MaxVF that does divide TC rather than masking.
if (Legal->prepareToFoldTailByMasking()) {		if (Legal->prepareToFoldTailByMasking()) {
FoldTailByMasking = true;		FoldTailByMasking = true;
▲ Show 20 Lines • Show All 4,759 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding-forced.ll

This file was added.

				; RUN: opt -S -loop-vectorize -scalable-vectorization=preferred < %s \| FileCheck %s

				sdesmalenUnsubmitted Not Done Reply Inline Actions nit: this can be removed now that scalable vectorization is enabled by default for aarch64 when SVE is available. sdesmalen: nit: this can be removed now that scalable vectorization is enabled by default for aarch64 when…
				; These tests ensure that tail-folding is enabled when the predicate.enable
				; loop attribute is set to true.

				target triple = "aarch64-unknown-linux-gnu"


				define void @simple_memset(i32 %val, i32* %ptr, i64 %n) #0 {
				; CHECK-LABEL: @simple_memset(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
				; CHECK-NEXT: br i1 false, label %scalar.ph, label %vector.ph
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
				; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4
				; CHECK-NEXT: [[TMP4:%.*]] = sub i64 [[TMP3]], 1
				; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP4]]
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = sub i64 [[UMAX]], 1
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TRIP_COUNT_MINUS_1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT5:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL:%.]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT6:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT5]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label %vector.body
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT2:%.]], %vector.body ]
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT3:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[INDEX1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT3]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP6:%.*]] = add <vscale x 4 x i64> [[TMP5]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 0, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP7:%.*]] = mul <vscale x 4 x i64> [[TMP6]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 4 x i64> [[BROADCAST_SPLAT4]], [[TMP7]]
				; CHECK-NEXT: [[TMP8:%.*]] = add i64 [[INDEX1]], 0
				; CHECK-NEXT: [[TMP9:%.*]] = icmp ule <vscale x 4 x i64> [[INDUCTION]], [[BROADCAST_SPLAT]]
				; CHECK-NEXT: [[TMP10:%.]] = getelementptr i32, i32 [[PTR:%.*]], i64 [[TMP8]]
				; CHECK-NEXT: [[TMP11:%.]] = getelementptr i32, i32 [[TMP10]], i32 0
				; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*
				; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT6]], <vscale x 4 x i32>* [[TMP12]], i32 4, <vscale x 4 x i1> [[TMP9]])
				; CHECK-NEXT: [[TMP13:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP14:%.*]] = mul i64 [[TMP13]], 4
				; CHECK-NEXT: [[INDEX_NEXT2]] = add i64 [[INDEX1]], [[TMP14]]
				; CHECK-NEXT: [[TMP15:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP15]], label %middle.block, label %vector.body
				; CHECK: middle.block:
				; CHECK-NEXT: br i1 true, label %while.end.loopexit, label %scalar.ph
				;
				entry:
				br label %while.body

				while.body: ; preds = %while.body, %entry
				%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
				%gep = getelementptr i32, i32* %ptr, i64 %index
				store i32 %val, i32* %gep
				%index.next = add nsw i64 %index, 1
				%cmp10 = icmp ult i64 %index.next, %n
				br i1 %cmp10, label %while.body, label %while.end.loopexit, !llvm.loop !0

				while.end.loopexit: ; preds = %while.body
				ret void
				}


				attributes #0 = { "target-features"="+sve" }

				!0 = distinct !{!0, !1}
				!1 = !{!"llvm.loop.vectorize.predicate.enable", i1 true}

llvm/test/Transforms/LoopVectorize/AArch64/sve-tail-folding.ll

	; RUN: opt -S -loop-vectorize -scalable-vectorization=preferred -prefer-predicate-over-epilogue=predicate-dont-vectorize < %s \| FileCheck %s			; RUN: opt -S -loop-vectorize -scalable-vectorization=preferred -prefer-predicate-over-epilogue=predicate-dont-vectorize < %s \| FileCheck %s
				; RUN: opt -S -loop-vectorize -scalable-vectorization=preferred -prefer-predicate-over-epilogue=predicate-else-scalar-epilogue < %s \| FileCheck %s
	; CHECK-NOT: vector.body:

	target triple = "aarch64-unknown-linux-gnu"			target triple = "aarch64-unknown-linux-gnu"

	define void @tail_predication(i32 %init, i32* %ptr, i32 %val) #0 {
				define void @simple_memset(i32 %val, i32* %ptr, i64 %n) #0 {
				; CHECK-LABEL: @simple_memset(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
				; CHECK-NEXT: br i1 false, label %scalar.ph, label %vector.ph
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
				; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4
				; CHECK-NEXT: [[TMP4:%.*]] = sub i64 [[TMP3]], 1
				; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP4]]
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = sub i64 [[UMAX]], 1
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TRIP_COUNT_MINUS_1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT5:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[VAL:%.]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT6:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT5]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label %vector.body
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT2:%.]], %vector.body ]
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT3:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[INDEX1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT3]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP6:%.*]] = add <vscale x 4 x i64> [[TMP5]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 0, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP7:%.*]] = mul <vscale x 4 x i64> [[TMP6]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 4 x i64> [[BROADCAST_SPLAT4]], [[TMP7]]
				; CHECK-NEXT: [[TMP8:%.*]] = add i64 [[INDEX1]], 0
				; CHECK-NEXT: [[TMP9:%.*]] = icmp ule <vscale x 4 x i64> [[INDUCTION]], [[BROADCAST_SPLAT]]
				; CHECK-NEXT: [[TMP10:%.]] = getelementptr i32, i32 [[PTR:%.*]], i64 [[TMP8]]
				; CHECK-NEXT: [[TMP11:%.]] = getelementptr i32, i32 [[TMP10]], i32 0
				; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*
				; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT6]], <vscale x 4 x i32>* [[TMP12]], i32 4, <vscale x 4 x i1> [[TMP9]])
				; CHECK-NEXT: [[TMP13:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP14:%.*]] = mul i64 [[TMP13]], 4
				; CHECK-NEXT: [[INDEX_NEXT2]] = add i64 [[INDEX1]], [[TMP14]]
				; CHECK-NEXT: [[TMP15:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP15]], label %middle.block, label %vector.body
				; CHECK: middle.block:
				; CHECK-NEXT: br i1 true, label %while.end.loopexit, label %scalar.ph
				;
	entry:			entry:
	br label %while.body			br label %while.body

	while.body: ; preds = %while.body, %entry			while.body: ; preds = %while.body, %entry
	%index = phi i32 [ %index.dec, %while.body ], [ %init, %entry ]			%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
	%gep = getelementptr i32, i32* %ptr, i32 %index			%gep = getelementptr i32, i32* %ptr, i64 %index
	store i32 %val, i32* %gep			store i32 %val, i32* %gep
	%index.dec = add nsw i32 %index, -1			%index.next = add nsw i64 %index, 1
	%cmp10 = icmp sgt i32 %index, 0			%cmp10 = icmp ult i64 %index.next, %n
				br i1 %cmp10, label %while.body, label %while.end.loopexit

				while.end.loopexit: ; preds = %while.body
				ret void
				}


				define void @simple_memcpy(i32* noalias %dst, i32* noalias %src, i64 %n) #0 {
				; CHECK-LABEL: @simple_memcpy(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
				; CHECK-NEXT: br i1 false, label %scalar.ph, label %vector.ph
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
				; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4
				; CHECK-NEXT: [[TMP4:%.*]] = sub i64 [[TMP3]], 1
				; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP4]]
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = sub i64 [[UMAX]], 1
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TRIP_COUNT_MINUS_1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label %vector.body
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT2:%.]], %vector.body ]
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT3:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[INDEX1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT3]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP6:%.*]] = add <vscale x 4 x i64> [[TMP5]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 0, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP7:%.*]] = mul <vscale x 4 x i64> [[TMP6]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 4 x i64> [[BROADCAST_SPLAT4]], [[TMP7]]
				; CHECK-NEXT: [[TMP8:%.*]] = add i64 [[INDEX1]], 0
				; CHECK-NEXT: [[TMP9:%.*]] = icmp ule <vscale x 4 x i64> [[INDUCTION]], [[BROADCAST_SPLAT]]
				; CHECK-NEXT: [[TMP10:%.]] = getelementptr i32, i32 [[SRC:%.*]], i64 [[TMP8]]
				; CHECK-NEXT: [[TMP11:%.]] = getelementptr i32, i32 [[TMP10]], i32 0
				; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*
				; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP12]], i32 4, <vscale x 4 x i1> [[TMP9]], <vscale x 4 x i32> poison)
				; CHECK-NEXT: [[TMP13:%.]] = getelementptr i32, i32 [[DST:%.*]], i64 [[TMP8]]
				; CHECK-NEXT: [[TMP14:%.]] = getelementptr i32, i32 [[TMP13]], i32 0
				; CHECK-NEXT: [[TMP15:%.]] = bitcast i32 [[TMP14]] to <vscale x 4 x i32>*
				; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[WIDE_MASKED_LOAD]], <vscale x 4 x i32>* [[TMP15]], i32 4, <vscale x 4 x i1> [[TMP9]])
				; CHECK-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP17:%.*]] = mul i64 [[TMP16]], 4
				; CHECK-NEXT: [[INDEX_NEXT2]] = add i64 [[INDEX1]], [[TMP17]]
				; CHECK-NEXT: [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP18]], label %middle.block, label %vector.body, !llvm.loop [[LOOP4:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: br i1 true, label %while.end.loopexit, label %scalar.ph
				;
				entry:
				br label %while.body

				while.body: ; preds = %while.body, %entry
				%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
				%gep1 = getelementptr i32, i32* %src, i64 %index
				%val = load i32, i32* %gep1
				%gep2 = getelementptr i32, i32* %dst, i64 %index
				store i32 %val, i32* %gep2
				%index.next = add nsw i64 %index, 1
				%cmp10 = icmp ult i64 %index.next, %n
				br i1 %cmp10, label %while.body, label %while.end.loopexit

				while.end.loopexit: ; preds = %while.body
				ret void
				}


				define void @simple_gather_scatter(i32* noalias %dst, i32* noalias %src, i32* noalias %ind, i64 %n) #0 {
				; CHECK-LABEL: @simple_gather_scatter(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
				; CHECK-NEXT: br i1 false, label %scalar.ph, label %vector.ph
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
				; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4
				; CHECK-NEXT: [[TMP4:%.*]] = sub i64 [[TMP3]], 1
				; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP4]]
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = sub i64 [[UMAX]], 1
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TRIP_COUNT_MINUS_1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT2:%.]], %vector.body ]
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT3:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[INDEX1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT3]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP6:%.*]] = add <vscale x 4 x i64> [[TMP5]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 0, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP7:%.*]] = mul <vscale x 4 x i64> [[TMP6]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 4 x i64> [[BROADCAST_SPLAT4]], [[TMP7]]
				; CHECK-NEXT: [[TMP8:%.*]] = add i64 [[INDEX1]], 0
				; CHECK-NEXT: [[TMP9:%.*]] = icmp ule <vscale x 4 x i64> [[INDUCTION]], [[BROADCAST_SPLAT]]
				; CHECK-NEXT: [[TMP10:%.]] = getelementptr i32, i32 [[IND:%.*]], i64 [[TMP8]]
				; CHECK-NEXT: [[TMP11:%.]] = getelementptr i32, i32 [[TMP10]], i32 0
				; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*
				; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP12]], i32 4, <vscale x 4 x i1> [[TMP9]], <vscale x 4 x i32> poison)
				; CHECK-NEXT: [[TMP13:%.]] = getelementptr i32, i32 [[SRC:%.*]], <vscale x 4 x i32> [[WIDE_MASKED_LOAD]]
				; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <vscale x 4 x i32> @llvm.masked.gather.nxv4i32.nxv4p0i32(<vscale x 4 x i32> [[TMP13]], i32 4, <vscale x 4 x i1> [[TMP9]], <vscale x 4 x i32> undef)
				; CHECK-NEXT: [[TMP14:%.]] = getelementptr i32, i32 [[DST:%.*]], <vscale x 4 x i32> [[WIDE_MASKED_LOAD]]
				; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0i32(<vscale x 4 x i32> [[WIDE_MASKED_GATHER]], <vscale x 4 x i32*> [[TMP14]], i32 4, <vscale x 4 x i1> [[TMP9]])
				; CHECK-NEXT: [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP16:%.*]] = mul i64 [[TMP15]], 4
				; CHECK-NEXT: [[INDEX_NEXT2]] = add i64 [[INDEX1]], [[TMP16]]
				; CHECK-NEXT: [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP17]], label %middle.block, label %vector.body
				; CHECK: middle.block:
				; CHECK-NEXT: br i1 true, label %while.end.loopexit, label %scalar.ph
				;
				entry:
				br label %while.body

				while.body: ; preds = %while.body, %entry
				%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
				%gep1 = getelementptr i32, i32* %ind, i64 %index
				%ind_val = load i32, i32* %gep1
				%gep2 = getelementptr i32, i32* %src, i32 %ind_val
				%val = load i32, i32* %gep2
				%gep3 = getelementptr i32, i32* %dst, i32 %ind_val
				store i32 %val, i32* %gep3
				%index.next = add nsw i64 %index, 1
				%cmp10 = icmp ult i64 %index.next, %n
				br i1 %cmp10, label %while.body, label %while.end.loopexit

				while.end.loopexit: ; preds = %while.body
				ret void
				}


				; The original loop had an unconditional uniform load. Let's make sure
				; we don't artificially create new predicated blocks for the load.
				define void @uniform_load(i32* noalias %dst, i32* noalias readonly %src, i64 %n) #0 {
				; CHECK-LABEL: @uniform_load(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label %scalar.ph, label %vector.ph
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
				; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4
				; CHECK-NEXT: [[TMP4:%.*]] = sub i64 [[TMP3]], 1
				; CHECK-NEXT: [[N_RND_UP:%.]] = add i64 [[N:%.]], [[TMP4]]
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = sub i64 [[N]], 1
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TRIP_COUNT_MINUS_1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label %vector.body
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT:%.]], %vector.body ]
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[INDEX]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT1]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP6:%.*]] = add <vscale x 4 x i64> [[TMP5]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 0, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP7:%.*]] = mul <vscale x 4 x i64> [[TMP6]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 4 x i64> [[BROADCAST_SPLAT2]], [[TMP7]]
				; CHECK-NEXT: [[TMP8:%.*]] = add i64 [[INDEX]], 0
				; CHECK-NEXT: [[TMP9:%.*]] = icmp ule <vscale x 4 x i64> [[INDUCTION]], [[BROADCAST_SPLAT]]
				; CHECK-NEXT: [[TMP10:%.]] = load i32, i32 [[SRC:%.*]], align 4
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT3:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[TMP10]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT3]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[DST:%.*]], i64 [[TMP8]]
				; CHECK-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, i32 [[TMP11]], i32 0
				; CHECK-NEXT: [[TMP13:%.]] = bitcast i32 [[TMP12]] to <vscale x 4 x i32>*
				; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[BROADCAST_SPLAT4]], <vscale x 4 x i32>* [[TMP13]], i32 4, <vscale x 4 x i1> [[TMP9]])
				; CHECK-NEXT: [[TMP14:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP15:%.*]] = mul i64 [[TMP14]], 4
				; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], [[TMP15]]
				; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP16]], label %middle.block, label %vector.body, !llvm.loop [[LOOP6:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: br i1 true, label %for.end, label %scalar.ph
				;

				entry:
				br label %for.body

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%val = load i32, i32* %src, align 4
				%arrayidx = getelementptr inbounds i32, i32* %dst, i64 %indvars.iv
				store i32 %val, i32* %arrayidx, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, %n
				br i1 %exitcond.not, label %for.end, label %for.body

				for.end: ; preds = %for.body, %entry
				ret void
				}


				; The original loop had a conditional uniform load. In this case we actually
				; do need to perform conditional loads and so we end up using a gather instead.
				; However, we at least ensure the mask is the overlap of the loop predicate
				; and the original condition.
				define void @cond_uniform_load(i32* noalias %dst, i32* noalias readonly %src, i32* noalias readonly %cond, i64 %n) #0 {
				; CHECK-LABEL: @cond_uniform_load(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label %scalar.ph, label %vector.ph
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
				; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4
				; CHECK-NEXT: [[TMP4:%.*]] = sub i64 [[TMP3]], 1
				; CHECK-NEXT: [[N_RND_UP:%.]] = add i64 [[N:%.]], [[TMP4]]
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = sub i64 [[N]], 1
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TRIP_COUNT_MINUS_1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT5:%.]] = insertelement <vscale x 4 x i32> poison, i32* [[SRC:%.*]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT6:%.]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT5]], <vscale x 4 x i32*> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label %vector.body
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT2:%.]], %vector.body ]
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT3:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[INDEX1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT3]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP6:%.*]] = add <vscale x 4 x i64> [[TMP5]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 0, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP7:%.*]] = mul <vscale x 4 x i64> [[TMP6]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 4 x i64> [[BROADCAST_SPLAT4]], [[TMP7]]
				; CHECK-NEXT: [[TMP8:%.*]] = add i64 [[INDEX1]], 0
				; CHECK-NEXT: [[TMP9:%.*]] = icmp ule <vscale x 4 x i64> [[INDUCTION]], [[BROADCAST_SPLAT]]
				; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds i32, i32 [[COND:%.*]], i64 [[TMP8]]
				; CHECK-NEXT: [[TMP11:%.]] = getelementptr inbounds i32, i32 [[TMP10]], i32 0
				; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*
				; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP12]], i32 4, <vscale x 4 x i1> [[TMP9]], <vscale x 4 x i32> poison)
				; CHECK-NEXT: [[TMP13:%.*]] = icmp eq <vscale x 4 x i32> [[WIDE_MASKED_LOAD]], shufflevector (<vscale x 4 x i32> insertelement (<vscale x 4 x i32> poison, i32 0, i32 0), <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP14:%.*]] = xor <vscale x 4 x i1> [[TMP13]], shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP15:%.*]] = select <vscale x 4 x i1> [[TMP9]], <vscale x 4 x i1> [[TMP14]], <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 false, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <vscale x 4 x i32> @llvm.masked.gather.nxv4i32.nxv4p0i32(<vscale x 4 x i32> [[BROADCAST_SPLAT6]], i32 4, <vscale x 4 x i1> [[TMP15]], <vscale x 4 x i32> undef)
				; CHECK-NEXT: [[TMP16:%.*]] = select <vscale x 4 x i1> [[TMP9]], <vscale x 4 x i1> [[TMP13]], <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 false, i32 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[PREDPHI:%.*]] = select <vscale x 4 x i1> [[TMP16]], <vscale x 4 x i32> shufflevector (<vscale x 4 x i32> insertelement (<vscale x 4 x i32> poison, i32 0, i32 0), <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x i32> [[WIDE_MASKED_GATHER]]
				; CHECK-NEXT: [[TMP17:%.]] = getelementptr inbounds i32, i32 [[DST:%.*]], i64 [[TMP8]]
				; CHECK-NEXT: [[TMP18:%.*]] = or <vscale x 4 x i1> [[TMP15]], [[TMP16]]
				; CHECK-NEXT: [[TMP19:%.]] = getelementptr inbounds i32, i32 [[TMP17]], i32 0
				; CHECK-NEXT: [[TMP20:%.]] = bitcast i32 [[TMP19]] to <vscale x 4 x i32>*
				; CHECK-NEXT: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[PREDPHI]], <vscale x 4 x i32>* [[TMP20]], i32 4, <vscale x 4 x i1> [[TMP18]])
				; CHECK-NEXT: [[TMP21:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP22:%.*]] = mul i64 [[TMP21]], 4
				; CHECK-NEXT: [[INDEX_NEXT2]] = add i64 [[INDEX1]], [[TMP22]]
				; CHECK-NEXT: [[TMP23:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP23]], label %middle.block, label %vector.body
				; CHECK: middle.block:
				; CHECK-NEXT: br i1 true, label %for.end, label %scalar.ph
				;

				entry:
				br label %for.body

				for.body: ; preds = %entry, %if.end
				%index = phi i64 [ %index.next, %if.end ], [ 0, %entry ]
				%arrayidx = getelementptr inbounds i32, i32* %cond, i64 %index
				%0 = load i32, i32* %arrayidx, align 4
				%tobool.not = icmp eq i32 %0, 0
				br i1 %tobool.not, label %if.end, label %if.then

				if.then: ; preds = %for.body
				%1 = load i32, i32* %src, align 4
				br label %if.end

				if.end: ; preds = %if.then, %for.body
				%val.0 = phi i32 [ %1, %if.then ], [ 0, %for.body ]
				%arrayidx1 = getelementptr inbounds i32, i32* %dst, i64 %index
				store i32 %val.0, i32* %arrayidx1, align 4
				%index.next = add nuw i64 %index, 1
				%exitcond.not = icmp eq i64 %index.next, %n
				br i1 %exitcond.not, label %for.end, label %for.body

				for.end: ; preds = %for.inc, %entry
				ret void
				}


				define void @simple_fdiv(float* noalias %dst, float* noalias %src, i64 %n) #0 {
				; CHECK-LABEL: @simple_fdiv(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
				; CHECK-NEXT: br i1 false, label %scalar.ph, label %vector.ph
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
				; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4
				; CHECK-NEXT: [[TMP4:%.*]] = sub i64 [[TMP3]], 1
				; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP4]]
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = sub i64 [[UMAX]], 1
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TRIP_COUNT_MINUS_1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label %vector.body
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT2:%.]], %vector.body ]
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT3:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[INDEX1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT3]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP6:%.*]] = add <vscale x 4 x i64> [[TMP5]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 0, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP7:%.*]] = mul <vscale x 4 x i64> [[TMP6]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 4 x i64> [[BROADCAST_SPLAT4]], [[TMP7]]
				; CHECK-NEXT: [[TMP8:%.*]] = add i64 [[INDEX1]], 0
				; CHECK-NEXT: [[TMP9:%.*]] = icmp ule <vscale x 4 x i64> [[INDUCTION]], [[BROADCAST_SPLAT]]
				; CHECK-NEXT: [[TMP10:%.]] = getelementptr float, float [[SRC:%.*]], i64 [[TMP8]]
				; CHECK-NEXT: [[TMP11:%.]] = getelementptr float, float [[DST:%.*]], i64 [[TMP8]]
				; CHECK-NEXT: [[TMP12:%.]] = getelementptr float, float [[TMP10]], i32 0
				; CHECK-NEXT: [[TMP13:%.]] = bitcast float [[TMP12]] to <vscale x 4 x float>*
				; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0nxv4f32(<vscale x 4 x float> [[TMP13]], i32 4, <vscale x 4 x i1> [[TMP9]], <vscale x 4 x float> poison)
				; CHECK-NEXT: [[TMP14:%.]] = getelementptr float, float [[TMP11]], i32 0
				; CHECK-NEXT: [[TMP15:%.]] = bitcast float [[TMP14]] to <vscale x 4 x float>*
				; CHECK-NEXT: [[WIDE_MASKED_LOAD5:%.]] = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0nxv4f32(<vscale x 4 x float> [[TMP15]], i32 4, <vscale x 4 x i1> [[TMP9]], <vscale x 4 x float> poison)
				; CHECK-NEXT: [[TMP16:%.*]] = fdiv <vscale x 4 x float> [[WIDE_MASKED_LOAD]], [[WIDE_MASKED_LOAD5]]
				; CHECK-NEXT: [[TMP17:%.]] = bitcast float [[TMP14]] to <vscale x 4 x float>*
				; CHECK-NEXT: call void @llvm.masked.store.nxv4f32.p0nxv4f32(<vscale x 4 x float> [[TMP16]], <vscale x 4 x float>* [[TMP17]], i32 4, <vscale x 4 x i1> [[TMP9]])
				; CHECK-NEXT: [[TMP18:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP19:%.*]] = mul i64 [[TMP18]], 4
				; CHECK-NEXT: [[INDEX_NEXT2]] = add i64 [[INDEX1]], [[TMP19]]
				; CHECK-NEXT: [[TMP20:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP20]], label %middle.block, label %vector.body, !llvm.loop [[LOOP18:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: br i1 true, label %while.end.loopexit, label %scalar.ph
				;
				entry:
				br label %while.body

				while.body: ; preds = %while.body, %entry
				%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
				%gep1 = getelementptr float, float* %src, i64 %index
				%gep2 = getelementptr float, float* %dst, i64 %index
				%val1 = load float, float* %gep1
				%val2 = load float, float* %gep2
				%res = fdiv float %val1, %val2
				store float %res, float* %gep2
				%index.next = add nsw i64 %index, 1
				%cmp10 = icmp ult i64 %index.next, %n
	br i1 %cmp10, label %while.body, label %while.end.loopexit			br i1 %cmp10, label %while.body, label %while.end.loopexit

	while.end.loopexit: ; preds = %while.body			while.end.loopexit: ; preds = %while.body
	ret void			ret void
	}			}


				define i32 @add_reduction_i32(i32* %ptr, i64 %n) #0 {
				; CHECK-LABEL: @add_reduction_i32(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
				; CHECK-NEXT: br i1 false, label %scalar.ph, label %vector.ph
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
				; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4
				; CHECK-NEXT: [[TMP4:%.*]] = sub i64 [[TMP3]], 1
				; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP4]]
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = sub i64 [[UMAX]], 1
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TRIP_COUNT_MINUS_1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label %vector.body
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT2:%.]], %vector.body ]
				; CHECK-NEXT: [[VEC_PHI:%.]] = phi <vscale x 4 x i32> [ insertelement (<vscale x 4 x i32> shufflevector (<vscale x 4 x i32> insertelement (<vscale x 4 x i32> poison, i32 0, i32 0), <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer), i32 0, i32 0), %vector.ph ], [ [[TMP13:%.]], %vector.body ]
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT3:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[INDEX1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT3]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP6:%.*]] = add <vscale x 4 x i64> [[TMP5]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 0, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP7:%.*]] = mul <vscale x 4 x i64> [[TMP6]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 4 x i64> [[BROADCAST_SPLAT4]], [[TMP7]]
				; CHECK-NEXT: [[TMP8:%.*]] = add i64 [[INDEX1]], 0
				; CHECK-NEXT: [[TMP9:%.*]] = icmp ule <vscale x 4 x i64> [[INDUCTION]], [[BROADCAST_SPLAT]]
				; CHECK-NEXT: [[TMP10:%.]] = getelementptr i32, i32 [[PTR:%.*]], i64 [[TMP8]]
				; CHECK-NEXT: [[TMP11:%.]] = getelementptr i32, i32 [[TMP10]], i32 0
				; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <vscale x 4 x i32>*
				; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> [[TMP12]], i32 4, <vscale x 4 x i1> [[TMP9]], <vscale x 4 x i32> poison)
				; CHECK-NEXT: [[TMP13]] = add <vscale x 4 x i32> [[VEC_PHI]], [[WIDE_MASKED_LOAD]]
				; CHECK-NEXT: [[TMP14:%.*]] = select <vscale x 4 x i1> [[TMP9]], <vscale x 4 x i32> [[TMP13]], <vscale x 4 x i32> [[VEC_PHI]]
				; CHECK-NEXT: [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP16:%.*]] = mul i64 [[TMP15]], 4
				; CHECK-NEXT: [[INDEX_NEXT2]] = add i64 [[INDEX1]], [[TMP16]]
				; CHECK-NEXT: [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP17]], label %middle.block, label %vector.body, !llvm.loop [[LOOP20:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[TMP18:%.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> [[TMP14]])
				; CHECK-NEXT: br i1 true, label %while.end.loopexit, label %scalar.ph
				;
				entry:
				br label %while.body

				while.body: ; preds = %while.body, %entry
				%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
				%red = phi i32 [ %red.next, %while.body ], [ 0, %entry ]
				%gep = getelementptr i32, i32* %ptr, i64 %index
				%val = load i32, i32* %gep
				%red.next = add i32 %red, %val
				%index.next = add nsw i64 %index, 1
				%cmp10 = icmp ult i64 %index.next, %n
				br i1 %cmp10, label %while.body, label %while.end.loopexit

				while.end.loopexit: ; preds = %while.body
				ret i32 %red.next
				}


				define float @add_reduction_f32(float* %ptr, i64 %n) #0 {
				; CHECK-LABEL: @add_reduction_f32(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 1)
				; CHECK-NEXT: br i1 false, label %scalar.ph, label %vector.ph
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 4
				; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 4
				; CHECK-NEXT: [[TMP4:%.*]] = sub i64 [[TMP3]], 1
				; CHECK-NEXT: [[N_RND_UP:%.*]] = add i64 [[UMAX]], [[TMP4]]
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[N_RND_UP]], [[TMP1]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[N_RND_UP]], [[N_MOD_VF]]
				; CHECK-NEXT: [[TRIP_COUNT_MINUS_1:%.*]] = sub i64 [[UMAX]], 1
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TRIP_COUNT_MINUS_1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label %vector.body
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ 0, %vector.ph ], [ [[INDEX_NEXT2:%.]], %vector.body ]
				; CHECK-NEXT: [[VEC_PHI:%.]] = phi float [ 0.000000e+00, %vector.ph ], [ [[TMP14:%.]], %vector.body ]
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT3:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[INDEX1]], i32 0
				; CHECK-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT3]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[TMP5:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP6:%.*]] = add <vscale x 4 x i64> [[TMP5]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 0, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP7:%.*]] = mul <vscale x 4 x i64> [[TMP6]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i32 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 4 x i64> [[BROADCAST_SPLAT4]], [[TMP7]]
				; CHECK-NEXT: [[TMP8:%.*]] = add i64 [[INDEX1]], 0
				; CHECK-NEXT: [[TMP9:%.*]] = icmp ule <vscale x 4 x i64> [[INDUCTION]], [[BROADCAST_SPLAT]]
				; CHECK-NEXT: [[TMP10:%.]] = getelementptr float, float [[PTR:%.*]], i64 [[TMP8]]
				; CHECK-NEXT: [[TMP11:%.]] = getelementptr float, float [[TMP10]], i32 0
				; CHECK-NEXT: [[TMP12:%.]] = bitcast float [[TMP11]] to <vscale x 4 x float>*
				; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0nxv4f32(<vscale x 4 x float> [[TMP12]], i32 4, <vscale x 4 x i1> [[TMP9]], <vscale x 4 x float> poison)
				; CHECK-NEXT: [[TMP13:%.*]] = select <vscale x 4 x i1> [[TMP9]], <vscale x 4 x float> [[WIDE_MASKED_LOAD]], <vscale x 4 x float> shufflevector (<vscale x 4 x float> insertelement (<vscale x 4 x float> poison, float -0.000000e+00, i32 0), <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP14]] = call float @llvm.vector.reduce.fadd.nxv4f32(float [[VEC_PHI]], <vscale x 4 x float> [[TMP13]])
				; CHECK-NEXT: [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP16:%.*]] = mul i64 [[TMP15]], 4
				; CHECK-NEXT: [[INDEX_NEXT2]] = add i64 [[INDEX1]], [[TMP16]]
				; CHECK-NEXT: [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT2]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP17]], label %middle.block, label %vector.body, !llvm.loop [[LOOP22:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: br i1 true, label %while.end.loopexit, label %scalar.ph
				;
				entry:
				br label %while.body

				while.body: ; preds = %while.body, %entry
				%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
				%red = phi float [ %red.next, %while.body ], [ 0.000000, %entry ]
				%gep = getelementptr float, float* %ptr, i64 %index
				%val = load float, float* %gep
				%red.next = fadd float %red, %val
				%index.next = add nsw i64 %index, 1
				%cmp10 = icmp ult i64 %index.next, %n
				br i1 %cmp10, label %while.body, label %while.end.loopexit

				while.end.loopexit: ; preds = %while.body
				ret float %red.next
				}


				; Negative tests where we don't expect tail-folding

				; Integer divides can throw exceptions and since we can't scalarize conditional
				; divides for scalable vectors we just don't bother vectorizing.
				define void @simple_idiv(i32* noalias %dst, i32* noalias %src, i64 %n) #0 {
				; CHECK-LABEL: @simple_idiv(
				; CHECK-NOT: vector.body
				;
				entry:
				br label %while.body

				while.body: ; preds = %while.body, %entry
				%index = phi i64 [ %index.next, %while.body ], [ 0, %entry ]
				%gep1 = getelementptr i32, i32* %src, i64 %index
				%gep2 = getelementptr i32, i32* %dst, i64 %index
				%val1 = load i32, i32* %gep1
				%val2 = load i32, i32* %gep2
				%res = udiv i32 %val1, %val2
				store i32 %res, i32* %gep2
				%index.next = add nsw i64 %index, 1
				%cmp10 = icmp ult i64 %index.next, %n
				br i1 %cmp10, label %while.body, label %while.end.loopexit

				while.end.loopexit: ; preds = %while.body
				ret void
				}


	attributes #0 = { "target-features"="+sve" }			attributes #0 = { "target-features"="+sve" }