This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
3/5
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
1
sve-inductions.ll

Differential D98715

[LoopVectorize] Add support for scalable vectorization of induction variables
ClosedPublic

Authored by david-arm on Mar 16 2021, 8:45 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
kmclaughlin
CarolineConcatto
fhahn
ctetreau
peterwaller-arm

Commits

rGa08c7736a771: [LoopVectorize] Add support for scalable vectorization of induction variables

Summary

This patch adds support for the vectorization of induction variables when
using scalable vectors, which required the following changes:

Removed assert from InnerLoopVectorizer::getStepVector.
Modified InnerLoopVectorizer::createVectorIntOrFpInductionPHI to use a runtime determined value for VF and removed an assert.
Modified InnerLoopVectorizer::buildScalarSteps to work for scalable vectors. I did this by calculating the full vector value for each Part of the unroll factor (UF) and caching this in the VP state. This means that we are always able to extract an arbitrary element from the vector if necessary. In addition to this, I also permitted the caching of the individual lane values themselves for the known minimum number of elements in the same way we do for fixed width vectors. This is a further optimisation that improves the code quality since it avoids unnecessary extractelement operations when extracting the first lane.
Added an assert to InnerLoopVectorizer::widenPHIInstruction, since while testing some code paths I noticed this is currently broken for scalable vectors.

Various tests to support different cases have been added here:

Transforms/LoopVectorize/AArch64/sve-inductions.ll

Diff Detail

Event Timeline

david-arm created this revision.Mar 16 2021, 8:45 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald TranscriptMar 16 2021, 8:45 AM

david-arm requested review of this revision.Mar 16 2021, 8:45 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 16 2021, 8:45 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B94063: Diff 330999.Mar 16 2021, 8:46 AM

david-arm added a parent revision: D97861: [LoopVectorize][NFC] Refactor code to use IRBuilder::CreateStepVector.Mar 16 2021, 8:47 AM

ctetreau added inline comments.Mar 16 2021, 9:23 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2536	what happens in this case now? I see a bunch of branches that handle the case where Lanes is greater than 1, but nothing for when it equals 1.

david-arm added inline comments.Mar 16 2021, 9:27 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2536	When this equals 1 due to being uniform the loop below works fine I think - we just don't bother creating a whole vector for each Part. We only create the first lane instead, which works for both fixed-width and scalable vectors. Are you specifically referring to the possibility of something like <vscale x 1 x Ty>? If so, perhaps you're right and we should still generate a whole vector for each part.

Fixed buildScalarSteps so that generate the full vector part for VF=(1,scalable).

david-arm marked an inline comment as done.Mar 19 2021, 6:53 AM

david-arm added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2536	Hi @ctetreau, I've changed the code to now check for: !IsUniform && VF.isScalable() so that this is guaranteed to do something sensible for <vscale x 1 x ElTy> vectors as well.

Harbormaster completed remote builds in B94692: Diff 331857.Mar 19 2021, 8:29 AM

ctetreau added inline comments.Mar 19 2021, 9:47 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2536	Thanks. Unfortunately, I'm quite busy with internal stuff right now, so I probably won't have time to review this closely. If things calm down and this is still up, I'll try to take a closer look. However, if you get LGTM from somebody else feel free to just merge it.

david-arm added a child revision: D99192: [NFC] Add tests for scalable vectorization of loops with large stride acesses.Mar 23 2021, 9:47 AM

david-arm added a reviewer: peterwaller-arm.Mar 23 2021, 9:50 AM

david-arm added a child revision: D99254: [SVE][LoopVectorize] Fix crash in InnerLoopVectorizer::widenPHIInstruction.Mar 24 2021, 4:26 AM

I think this looks good, I just have a couple of minor comments

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2254	`buildScalarSteps` seems to use `CreateSIToFP`, should we also be using this here?
llvm/test/Transforms/LoopVectorize/AArch64/sve-inductions.ll
113	nit: I think this could be checking for `%[[VSCALE]]` instead of %6?

This revision is now accepted and ready to land.Mar 26 2021, 11:41 AM

Closed by commit rGa08c7736a771: [LoopVectorize] Add support for scalable vectorization of induction variables (authored by david-arm). · Explain WhyMar 30 2021, 3:13 AM

This revision was automatically updated to reflect the committed changes.

david-arm added a commit: rGa08c7736a771: [LoopVectorize] Add support for scalable vectorization of induction variables.

Ayal mentioned this in D113183: [LV] Patch up induction phis after VPlan execution..Dec 15 2021, 2:38 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

59 lines

test/

Transforms/

LoopVectorize/

AArch64/

sve-inductions.ll

189 lines

Diff 330999

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,239 Lines • ▼ Show 20 Lines	if (Step->getType()->isIntegerTy()) {
MulOp = Instruction::Mul;		MulOp = Instruction::Mul;
} else {		} else {
AddOp = II.getInductionOpcode();		AddOp = II.getInductionOpcode();
MulOp = Instruction::FMul;		MulOp = Instruction::FMul;
}		}

// Multiply the vectorization factor by the step using integer or		// Multiply the vectorization factor by the step using integer or
// floating-point arithmetic as appropriate.		// floating-point arithmetic as appropriate.
Value *ConstVF =		Type *StepType = Step->getType();
getSignedIntOrFpConstant(Step->getType(), VF.getKnownMinValue());		if (Step->getType()->isFloatingPointTy())
Value *Mul = Builder.CreateBinOp(MulOp, Step, ConstVF);		StepType = IntegerType::get(StepType->getContext(),
		StepType->getScalarSizeInBits());
		Value *RuntimeVF = getRuntimeVF(Builder, StepType, VF);
		if (Step->getType()->isFloatingPointTy())
		RuntimeVF = Builder.CreateUIToFP(RuntimeVF, Step->getType());
		kmclaughlinUnsubmitted Not Done Reply Inline Actions `buildScalarSteps` seems to use `CreateSIToFP`, should we also be using this here? kmclaughlin: `buildScalarSteps` seems to use `CreateSIToFP`, should we also be using this here?
		Value *Mul = Builder.CreateBinOp(MulOp, Step, RuntimeVF);

// Create a vector splat to use in the induction update.		// Create a vector splat to use in the induction update.
//		//
// FIXME: If the step is non-constant, we create the vector splat with		// FIXME: If the step is non-constant, we create the vector splat with
// IRBuilder. IRBuilder can constant-fold the multiply, but it doesn't		// IRBuilder. IRBuilder can constant-fold the multiply, but it doesn't
// handle a constant vector splat.		// handle a constant vector splat.
assert(!VF.isScalable() && "scalable vectors not yet supported.");
Value *SplatVF = isa<Constant>(Mul)		Value *SplatVF = isa<Constant>(Mul)
? ConstantVector::getSplat(VF, cast<Constant>(Mul))		? ConstantVector::getSplat(VF, cast<Constant>(Mul))
: Builder.CreateVectorSplat(VF, Mul);		: Builder.CreateVectorSplat(VF, Mul);
Builder.restoreIP(CurrIP);		Builder.restoreIP(CurrIP);

// We may need to add the step a number of times, depending on the unroll		// We may need to add the step a number of times, depending on the unroll
// factor. The last of those goes into the PHI.		// factor. The last of those goes into the PHI.
PHINode *VecInd = PHINode::Create(SteppedStart->getType(), 2, "vec.ind",		PHINode *VecInd = PHINode::Create(SteppedStart->getType(), 2, "vec.ind",
▲ Show 20 Lines • Show All 186 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::widenIntOrFpInduction(PHINode IV, Value Start,
if (!Cost->isScalarEpilogueAllowed())		if (!Cost->isScalarEpilogueAllowed())
CreateSplatIV(ScalarIV, Step);		CreateSplatIV(ScalarIV, Step);
buildScalarSteps(ScalarIV, Step, EntryVal, ID, Def, CastDef, State);		buildScalarSteps(ScalarIV, Step, EntryVal, ID, Def, CastDef, State);
}		}

Value InnerLoopVectorizer::getStepVector(Value Val, int StartIdx, Value *Step,		Value InnerLoopVectorizer::getStepVector(Value Val, int StartIdx, Value *Step,
Instruction::BinaryOps BinOp) {		Instruction::BinaryOps BinOp) {
// Create and check the types.		// Create and check the types.
assert(isa<FixedVectorType>(Val->getType()) &&
"Creation of scalable step vector not yet supported");
auto *ValVTy = cast<VectorType>(Val->getType());		auto *ValVTy = cast<VectorType>(Val->getType());
ElementCount VLen = ValVTy->getElementCount();		ElementCount VLen = ValVTy->getElementCount();

Type *STy = Val->getType()->getScalarType();		Type *STy = Val->getType()->getScalarType();
assert((STy->isIntegerTy() \|\| STy->isFloatingPointTy()) &&		assert((STy->isIntegerTy() \|\| STy->isFloatingPointTy()) &&
"Induction Step must be an integer or FP");		"Induction Step must be an integer or FP");
assert(Step->getType() == STy && "Step has wrong type");		assert(Step->getType() == STy && "Step has wrong type");

▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::buildScalarSteps(Value ScalarIV, Value Step,

// Determine the number of scalars we need to generate for each unroll		// Determine the number of scalars we need to generate for each unroll
// iteration. If EntryVal is uniform, we only need to generate the first		// iteration. If EntryVal is uniform, we only need to generate the first
// lane. Otherwise, we generate all VF values.		// lane. Otherwise, we generate all VF values.
unsigned Lanes =		unsigned Lanes =
Cost->isUniformAfterVectorization(cast<Instruction>(EntryVal), VF)		Cost->isUniformAfterVectorization(cast<Instruction>(EntryVal), VF)
? 1		? 1
: VF.getKnownMinValue();		: VF.getKnownMinValue();
assert((!VF.isScalable() \|\| Lanes == 1) &&
ctetreauUnsubmitted Done Reply Inline Actions what happens in this case now? I see a bunch of branches that handle the case where Lanes is greater than 1, but nothing for when it equals 1. ctetreau: what happens in this case now? I see a bunch of branches that handle the case where Lanes is…
david-armAuthorUnsubmitted Done Reply Inline Actions When this equals 1 due to being uniform the loop below works fine I think - we just don't bother creating a whole vector for each Part. We only create the first lane instead, which works for both fixed-width and scalable vectors. Are you specifically referring to the possibility of something like <vscale x 1 x Ty>? If so, perhaps you're right and we should still generate a whole vector for each part. david-arm: When this equals 1 due to being uniform the loop below works fine I think - we just don't…
david-armAuthorUnsubmitted Done Reply Inline Actions Hi @ctetreau, I've changed the code to now check for: !IsUniform && VF.isScalable() so that this is guaranteed to do something sensible for <vscale x 1 x ElTy> vectors as well. david-arm: Hi @ctetreau, I've changed the code to now check for: !IsUniform && VF.isScalable() so that…
ctetreauUnsubmitted Not Done Reply Inline Actions Thanks. Unfortunately, I'm quite busy with internal stuff right now, so I probably won't have time to review this closely. If things calm down and this is still up, I'll try to take a closer look. However, if you get LGTM from somebody else feel free to just merge it. ctetreau: Thanks. Unfortunately, I'm quite busy with internal stuff right now, so I probably won't have…
"Should never scalarize a scalable vector");
// Compute the scalar steps and save the results in State.		// Compute the scalar steps and save the results in State.
for (unsigned Part = 0; Part < UF; ++Part) {		Type *IntStepTy = IntegerType::get(ScalarIVTy->getContext(),
for (unsigned Lane = 0; Lane < Lanes; ++Lane) {
auto *IntStepTy = IntegerType::get(ScalarIVTy->getContext(),
ScalarIVTy->getScalarSizeInBits());		ScalarIVTy->getScalarSizeInBits());
Value *StartIdx =		Type *VecIVTy = nullptr;
		Value UnitStepVec = nullptr, SplatStep = nullptr, *SplatIV = nullptr;
		if (Lanes > 1 && VF.isScalable()) {
		VecIVTy = VectorType::get(ScalarIVTy, VF);
		UnitStepVec = Builder.CreateStepVector(VectorType::get(IntStepTy, VF));
		SplatStep = Builder.CreateVectorSplat(VF, Step);
		SplatIV = Builder.CreateVectorSplat(VF, ScalarIV);
		}

		for (unsigned Part = 0; Part < UF; ++Part) {
		Value *StartIdx0 =
createStepForVF(Builder, ConstantInt::get(IntStepTy, Part), VF);		createStepForVF(Builder, ConstantInt::get(IntStepTy, Part), VF);

		if (Lanes > 1 && VF.isScalable()) {
		auto *SplatStartIdx = Builder.CreateVectorSplat(VF, StartIdx0);
		auto *InitVec = Builder.CreateAdd(SplatStartIdx, UnitStepVec);
if (ScalarIVTy->isFloatingPointTy())		if (ScalarIVTy->isFloatingPointTy())
StartIdx = Builder.CreateSIToFP(StartIdx, ScalarIVTy);		InitVec = Builder.CreateSIToFP(InitVec, VecIVTy);
StartIdx = Builder.CreateBinOp(		auto *Mul = Builder.CreateBinOp(MulOp, InitVec, SplatStep);
AddOp, StartIdx, getSignedIntOrFpConstant(ScalarIVTy, Lane));		auto *Add = Builder.CreateBinOp(AddOp, SplatIV, Mul);
		State.set(Def, Add, Part);
		recordVectorLoopValueForInductionCast(ID, EntryVal, Add, CastDef, State,
		Part);
		// It's useful to record the lane values too for the known minimum number
		// of elements so we do those below. This improves the code quality when
		// trying to extract the first element, for example.
		}

		if (ScalarIVTy->isFloatingPointTy())
		StartIdx0 = Builder.CreateSIToFP(StartIdx0, ScalarIVTy);

		for (unsigned Lane = 0; Lane < Lanes; ++Lane) {
		Value *StartIdx = Builder.CreateBinOp(
		AddOp, StartIdx0, getSignedIntOrFpConstant(ScalarIVTy, Lane));
// The step returned by `createStepForVF` is a runtime-evaluated value		// The step returned by `createStepForVF` is a runtime-evaluated value
// when VF is scalable. Otherwise, it should be folded into a Constant.		// when VF is scalable. Otherwise, it should be folded into a Constant.
assert((VF.isScalable() \|\| isa<Constant>(StartIdx)) &&		assert((VF.isScalable() \|\| isa<Constant>(StartIdx)) &&
"Expected StartIdx to be folded to a constant when VF is not "		"Expected StartIdx to be folded to a constant when VF is not "
"scalable");		"scalable");
auto *Mul = Builder.CreateBinOp(MulOp, StartIdx, Step);		auto *Mul = Builder.CreateBinOp(MulOp, StartIdx, Step);
auto *Add = Builder.CreateBinOp(AddOp, ScalarIV, Mul);		auto *Add = Builder.CreateBinOp(AddOp, ScalarIV, Mul);
State.set(Def, Add, VPIteration(Part, Lane));		State.set(Def, Add, VPIteration(Part, Lane));
▲ Show 20 Lines • Show All 2,159 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::widenPHIInstruction(Instruction *PN,
case InductionDescriptor::IK_NoInduction:		case InductionDescriptor::IK_NoInduction:
llvm_unreachable("Unknown induction");		llvm_unreachable("Unknown induction");
case InductionDescriptor::IK_IntInduction:		case InductionDescriptor::IK_IntInduction:
case InductionDescriptor::IK_FpInduction:		case InductionDescriptor::IK_FpInduction:
llvm_unreachable("Integer/fp induction is handled elsewhere.");		llvm_unreachable("Integer/fp induction is handled elsewhere.");
case InductionDescriptor::IK_PtrInduction: {		case InductionDescriptor::IK_PtrInduction: {
// Handle the pointer induction variable case.		// Handle the pointer induction variable case.
assert(P->getType()->isPointerTy() && "Unexpected type.");		assert(P->getType()->isPointerTy() && "Unexpected type.");
		assert(!VF.isScalable() && "Currently unsupported for scalable vectors");

if (Cost->isScalarAfterVectorization(P, State.VF)) {		if (Cost->isScalarAfterVectorization(P, State.VF)) {
// This is the normalized GEP that starts counting at zero.		// This is the normalized GEP that starts counting at zero.
Value *PtrInd =		Value *PtrInd =
Builder.CreateSExtOrTrunc(Induction, II.getStep()->getType());		Builder.CreateSExtOrTrunc(Induction, II.getStep()->getType());
// Determine the number of scalars we need to generate for each unroll		// Determine the number of scalars we need to generate for each unroll
// iteration. If the instruction is uniform, we only need to generate the		// iteration. If the instruction is uniform, we only need to generate the
// first lane. Otherwise, we generate all VF values.		// first lane. Otherwise, we generate all VF values.
▲ Show 20 Lines • Show All 5,197 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-inductions.ll

This file was added.

				; RUN: opt -mtriple aarch64-linux-gnu -mattr=+sve -loop-vectorize -dce -instcombine < %s -S 2>%t \| FileCheck %s

				; RUN: FileCheck --check-prefix=WARN --allow-empty %s <%t

				; If this check fails please read test/CodeGen/AArch64/README for instructions on how to resolve it.
				; WARN-NOT: warning

				; Test that we can add on the induction variable
				; for (long long i = 0; i < n; i++) {
				; a[i] = b[i] + i;
				; }
				; with an unroll factor (interleave count) of 2.

				define void @add_ind64_unrolled(i64* noalias nocapture %a, i64* noalias nocapture readonly %b, i64 %n) {
				; CHECK-LABEL: @add_ind64_unrolled(
				; CHECK-NEXT: entry:
				; CHECK: vector.body:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %vector.ph ], [ %{{.}}, %vector.body ]
				; CHECK-NEXT: %[[STEPVEC:.*]] = call <vscale x 2 x i64> @llvm.experimental.stepvector.nxv2i64()
				; CHECK-NEXT: %[[TMP1:.*]] = insertelement <vscale x 2 x i64> poison, i64 %[[INDEX]], i32 0
				; CHECK-NEXT: %[[IDXSPLT:.*]] = shufflevector <vscale x 2 x i64> %[[TMP1]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
				; CHECK-NEXT: %[[VECIND1:.*]] = add <vscale x 2 x i64> %[[IDXSPLT]], %[[STEPVEC]]
				; CHECK-NEXT: %[[VSCALE:.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: %[[EC:.*]] = shl i64 %[[VSCALE]], 1
				; CHECK-NEXT: %[[TMP2:.*]] = insertelement <vscale x 2 x i64> poison, i64 %[[EC]], i32 0
				; CHECK-NEXT: %[[ECSPLT:.*]] = shufflevector <vscale x 2 x i64> %[[TMP2]], <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer
				; CHECK-NEXT: %[[TMP3:.*]] = add <vscale x 2 x i64> %[[ECSPLT]], %[[STEPVEC]]
				; CHECK-NEXT: %[[VECIND2:.*]] = add <vscale x 2 x i64> %[[IDXSPLT]], %[[TMP3]]
				; CHECK: %[[LOAD1:.*]] = load <vscale x 2 x i64>
				; CHECK: %[[LOAD2:.*]] = load <vscale x 2 x i64>
				; CHECK: %[[STOREVAL1:.*]] = add nsw <vscale x 2 x i64> %[[LOAD1]], %[[VECIND1]]
				; CHECK: %[[STOREVAL2:.*]] = add nsw <vscale x 2 x i64> %[[LOAD2]], %[[VECIND2]]
				; CHECK: store <vscale x 2 x i64> %[[STOREVAL1]]
				; CHECK: store <vscale x 2 x i64> %[[STOREVAL2]]

				entry:
				br label %for.body

				for.body: ; preds = %entry, %for.body
				%i.08 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
				%arrayidx = getelementptr inbounds i64, i64* %b, i64 %i.08
				%0 = load i64, i64* %arrayidx, align 8
				%add = add nsw i64 %0, %i.08
				%arrayidx1 = getelementptr inbounds i64, i64* %a, i64 %i.08
				store i64 %add, i64* %arrayidx1, align 8
				%inc = add nuw nsw i64 %i.08, 1
				%exitcond.not = icmp eq i64 %inc, %n
				br i1 %exitcond.not, label %exit, label %for.body, !llvm.loop !0

				exit: ; preds = %for.body
				ret void
				}


				; Test that we can vectorize a separate induction variable (not used for the branch)
				; int r = 0;
				; for (long long i = 0; i < n; i++) {
				; a[i] = r;
				; r += 2;
				; }
				; with an unroll factor (interleave count) of 1.


				define void @add_unique_ind32(i32* noalias nocapture %a, i64 %n) {
				; CHECK-LABEL: @add_unique_ind32(
				; CHECK: vector.ph:
				; CHECK: %[[STEPVEC:.*]] = call <vscale x 4 x i32> @llvm.experimental.stepvector.nxv4i32()
				; CHECK-NEXT: %[[INDINIT:.*]] = shl <vscale x 4 x i32> %[[STEPVEC]], shufflevector (<vscale x 4 x i32> insertelement (<vscale x 4 x i32> undef, i32 1, i32 0), <vscale x 4 x i32> undef, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: %[[VSCALE:.*]] = call i32 @llvm.vscale.i32()
				; CHECK-NEXT: %[[INC:.*]] = shl i32 %6, 3
				; CHECK-NEXT: %[[TMP:.*]] = insertelement <vscale x 4 x i32> poison, i32 %[[INC]], i32 0
				; CHECK-NEXT: %[[VECINC:.*]] = shufflevector <vscale x 4 x i32> %[[TMP]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK: vector.body:
				; CHECK: %[[VECIND:.]] = phi <vscale x 4 x i32> [ %[[INDINIT]], %vector.ph ], [ %[[VECINDNXT:.]], %vector.body ]
				; CHECK: store <vscale x 4 x i32> %[[VECIND]]
				; CHECK: %[[VECINDNXT]] = add <vscale x 4 x i32> %[[VECIND]], %[[VECINC]]
				entry:
				br label %for.body

				for.body: ; preds = %entry, %for.body
				%i.08 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
				%r.07 = phi i32 [ %add, %for.body ], [ 0, %entry ]
				%arrayidx = getelementptr inbounds i32, i32* %a, i64 %i.08
				store i32 %r.07, i32* %arrayidx, align 4
				%add = add nuw nsw i32 %r.07, 2
				%inc = add nuw nsw i64 %i.08, 1
				%exitcond.not = icmp eq i64 %inc, %n
				br i1 %exitcond.not, label %exit, label %for.body, !llvm.loop !6

				exit: ; preds = %for.body
				ret void
				}


				; Test that we can vectorize a separate FP induction variable (not used for the branch)
				; float r = 0;
				; for (long long i = 0; i < n; i++) {
				; a[i] = r;
				; r += 2;
				; }
				; with an unroll factor (interleave count) of 1.

				define void @add_unique_indf32(float* noalias nocapture %a, i64 %n) {
				; CHECK-LABEL: @add_unique_indf32(
				; CHECK: vector.ph:
				; CHECK: %[[STEPVEC:.*]] = call <vscale x 4 x i32> @llvm.experimental.stepvector.nxv4i32()
				; CHECK-NEXT: %[[TMP1:.*]] = uitofp <vscale x 4 x i32> %[[STEPVEC]] to <vscale x 4 x float>
				; CHECK-NEXT: %[[TMP2:.*]] = fmul <vscale x 4 x float> %[[TMP1]], shufflevector (<vscale x 4 x float> insertelement (<vscale x 4 x float> poison, float 2.000000e+00, i32 0), <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: %[[INDINIT:.*]] = fadd <vscale x 4 x float> %[[TMP2]], shufflevector (<vscale x 4 x float> insertelement (<vscale x 4 x float> poison, float 0.000000e+00, i32 0), <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: %[[VSCALE:.*]] = call i32 @llvm.vscale.i32()
				; CHECK-NEXT: %[[TMP3:.*]] = shl i32 %8, 2
				; CHECK-NEXT: %[[TMP4:.*]] = uitofp i32 %[[TMP3]] to float
				; CHECK-NEXT: %[[INC:.*]] = fmul float %[[TMP4]], 2.000000e+00
				kmclaughlinUnsubmitted Not Done Reply Inline Actions nit: I think this could be checking for `%[[VSCALE]]` instead of %6? kmclaughlin: nit: I think this could be checking for `%[[VSCALE]]` instead of %6?
				; CHECK-NEXT: %[[TMP5:.*]] = insertelement <vscale x 4 x float> poison, float %[[INC]], i32 0
				; CHECK-NEXT: %[[VECINC:.*]] = shufflevector <vscale x 4 x float> %[[TMP5]], <vscale x 4 x float> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK: vector.body:
				; CHECK: %[[VECIND:.]] = phi <vscale x 4 x float> [ %[[INDINIT]], %vector.ph ], [ %[[VECINDNXT:.]], %vector.body ]
				; CHECK: store <vscale x 4 x float> %[[VECIND]]
				; CHECK: %[[VECINDNXT]] = fadd <vscale x 4 x float> %[[VECIND]], %[[VECINC]]

				entry:
				br label %for.body

				for.body: ; preds = %entry, %for.body
				%i.08 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
				%r.07 = phi float [ %add, %for.body ], [ 0.000000e+00, %entry ]
				%arrayidx = getelementptr inbounds float, float* %a, i64 %i.08
				store float %r.07, float* %arrayidx, align 4
				%add = fadd float %r.07, 2.000000e+00
				%inc = add nuw nsw i64 %i.08, 1
				%exitcond.not = icmp eq i64 %inc, %n
				br i1 %exitcond.not, label %exit, label %for.body, !llvm.loop !6

				exit: ; preds = %for.body
				ret void
				}

				; Test a case where the vectorised induction variable is used to
				; generate a mask:
				; for (long long i = 0; i < n; i++) {
				; if (i & 0x1)
				; a[i] = b[i];
				; }

				define void @cond_ind64(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i64 %n) {
				; CHECK-LABEL: @cond_ind64(
				; CHECK: vector.body:
				; CHECK-NEXT: %[[INDEX:.]] = phi i64 [ 0, %vector.ph ], [ %{{.}}, %vector.body ]
				; CHECK: %[[STEPVEC:.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: %[[TMP1:.*]] = insertelement <vscale x 4 x i64> poison, i64 %[[INDEX]], i32 0
				; CHECK-NEXT: %[[IDXSPLT:.*]] = shufflevector <vscale x 4 x i64> %[[TMP1]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: %[[VECIND:.*]] = add <vscale x 4 x i64> %[[IDXSPLT]], %[[STEPVEC]]
				; CHECK-NEXT: %[[MASK:.*]] = trunc <vscale x 4 x i64> %[[VECIND]] to <vscale x 4 x i1>
				; CHECK: %[[LOAD:.]] = call <vscale x 4 x i32> @llvm.masked.load.nxv4i32.p0nxv4i32(<vscale x 4 x i32> %{{.*}}, i32 4, <vscale x 4 x i1> %[[MASK]], <vscale x 4 x i32> poison)
				; CHECK: call void @llvm.masked.store.nxv4i32.p0nxv4i32(<vscale x 4 x i32> %[[LOAD]], <vscale x 4 x i32>* %{{.*}}, i32 4, <vscale x 4 x i1> %[[MASK]])
				entry:
				br label %for.body

				for.body: ; preds = %entry, %for.inc
				%i.08 = phi i64 [ %inc, %for.inc ], [ 0, %entry ]
				%and = and i64 %i.08, 1
				%tobool.not = icmp eq i64 %and, 0
				br i1 %tobool.not, label %for.inc, label %if.then

				if.then: ; preds = %for.body
				%arrayidx = getelementptr inbounds i32, i32* %b, i64 %i.08
				%0 = load i32, i32* %arrayidx, align 4
				%arrayidx1 = getelementptr inbounds i32, i32* %a, i64 %i.08
				store i32 %0, i32* %arrayidx1, align 4
				br label %for.inc

				for.inc: ; preds = %for.body, %if.then
				%inc = add nuw nsw i64 %i.08, 1
				%exitcond.not = icmp eq i64 %inc, %n
				br i1 %exitcond.not, label %exit, label %for.body, !llvm.loop !6

				exit: ; preds = %for.inc
				ret void
				}

				!0 = distinct !{!0, !1, !2, !3, !4, !5}
				!1 = !{!"llvm.loop.mustprogress"}
				!2 = !{!"llvm.loop.vectorize.width", i32 2}
				!3 = !{!"llvm.loop.vectorize.scalable.enable", i1 true}
				!4 = !{!"llvm.loop.interleave.count", i32 2}
				!5 = !{!"llvm.loop.vectorize.enable", i1 true}
				!6 = distinct !{!6, !1, !7, !3, !8, !5}
				!7 = !{!"llvm.loop.vectorize.width", i32 4}
				!8 = !{!"llvm.loop.interleave.count", i32 1}