This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorizationPlanner.h
10/10
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
AArch64/
3/3
sve-epilog-vect.ll
-
optimal-epilog-vectorization-scalable.ll

Differential D109432

[LoopVectorize] Permit fixed-width epilogue loops for scalable vector bodies
ClosedPublic

Authored by david-arm on Sep 8 2021, 4:14 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
CarolineConcatto
bmahjour
kmclaughlin
fhahn

Commits

rGc42bb30b9e29: [LoopVectorize] Permit fixed-width epilogue loops for scalable vector bodies

Summary

At the moment in LoopVectorizationCostModel::selectEpilogueVectorizationFactor
we bail out if the main vector loop uses a scalable VF. This patch adds
support for generating epilogue vector loops using a fixed-width VF when the
main vector loop uses a scalable VF.

I've changed LoopVectorizationCostModel::selectEpilogueVectorizationFactor
so that we convert the scalable VF into a fixed-width VF and do profitability
checks on that instead. In addition, since the scalable and fixed-width VFs
live in different VPlans that means I had to change the calls to
LVP.hasPlanWithVFs so that we only pass in the fixed-width VF.

New tests added here:

Transforms/LoopVectorize/AArch64/sve-epilog-vect.ll

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

david-arm created this revision.Sep 8 2021, 4:14 AM

Herald added subscribers: ctetreau, rogfer01, bollu and 2 others. · View Herald TranscriptSep 8 2021, 4:14 AM

david-arm requested review of this revision.Sep 8 2021, 4:14 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 8 2021, 4:14 AM

Herald added subscribers: llvm-commits, vkmr. · View Herald Transcript

Harbormaster completed remote builds in B123032: Diff 371301.Sep 8 2021, 4:15 AM

david-arm added a parent revision: D109364: [NFC] Replace unsigned VF with ElementCount in EpilogueLoopVectorizationInfo.Sep 8 2021, 4:15 AM

junparser added a subscriber: junparser.Sep 9 2021, 1:42 AM

Different plans may have different recipes, so if we don't find a single plan that supports both the main VF and the epilogue VF, I worry we may get unintended codegen. Furthermore, what guarantees that a vplan with a fixed-width conversion from the scalable VF actually exists? Would it be possible to build scalable plans that include the corresponding fixed width (and some lower powers of 2) VFs?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6266–6275	"EpilogueVF" is confusing here, since the rest of the logic expects main-loop VF. Suggest renaming to "KnownMainLoopVF" or "FixedMainLoopVF" or "FixedWidthMainLoopVF".
6270	include the fixed-width value in the debug message?

In D109432#2995562, @bmahjour wrote:

Different plans may have different recipes, so if we don't find a single plan that supports both the main VF and the epilogue VF, I worry we may get unintended codegen. Furthermore, what guarantees that a vplan with a fixed-width conversion from the scalable VF actually exists? Would it be possible to build scalable plans that include the corresponding fixed width (and some lower powers of 2) VFs?

Hi @bmahjour thanks for taking a look at the patch and leaving some comments!

So in LoopVectorizationPlanner::plan we do try to build vplans with equivalent ranges of VFs, i.e.

buildVPlansWithVPRecipes(ElementCount::getFixed(1), MaxFactors.FixedVF);
buildVPlansWithVPRecipes(ElementCount::getScalable(1), MaxFactors.ScalableVF);

Even if for some reason we are missing the fixed-width equivalent from the vplan the worst thing that can happen is we just won't produce an epilogue loop I think? In future we do intend to also add support for using scalable vectors in the epilogue loop as well, but I imagine that this is less likely to be profitable because "VF=vscale x 2" will likely cover more scalar iterations than "VF=2". This means there is a lower chance of us ever entering the epilogue loop.

However, you make a good point about the possibility of having different recipes for the same loop depending upon whether it's scalable or fixed-width. I wasn't sure if this would actually be a problem or not because the vectorised loops are independent. At the moment I can't think of a scenario where the recipes would be different, but perhaps I can add code to bail out if the recipes are different?

Renamed EpilogueVF -> FixedMainLoopVF
Improved new LLVM_DEBUG message to include FixedMainLoopVF

david-arm marked 2 inline comments as done.Sep 13 2021, 8:41 AM

Harbormaster completed remote builds in B123683: Diff 372266.Sep 13 2021, 9:38 AM

However, you make a good point about the possibility of having different recipes for the same loop depending upon whether it's scalable or fixed-width.

Even for non-scalable targets, this patch may cause the epilogue plan to be different from that of the main loop. One could argue that it may be desirable to have different plans (recipes) for the two in some cases, but at least the selection criteria should not be arbitrary (as it would be if we just look for *a* plan that matches the VF).

I'm still wondering why we currently create two separate plans one for the scalable VFs and one for the fixed-width VFs, instead of one plan that includes a union of fixed-width VFs and scalable VFs? If we had such a single plan, then the workarounds in this patch would not be necessary.

Matt added a subscriber: Matt.Sep 16 2021, 1:22 PM

Fixed a bug in createEpilogueVectorizedLoopSkeleton where we created the wrong Step value.

I have run LNT and SPEC2006 on this patch and all tests pass with these changes.

Harbormaster completed remote builds in B126096: Diff 375560.Sep 28 2021, 7:33 AM

I found no overall change in performance with SPEC2006 when building with scalable vectorisation on a A64FX machine.

Hi @david-arm, thanks for working on this. I don't see any fundamental issues with having a different VPlan for the epilogue loop and the main vector body. In fact, it makes sense to me to pick the most suitable and cost-effective (VF, Plan) pair for the epilogue. If we want to use scalable vectors for the epilogue loop at some point, then we'll also want to use predication, so I could see that requiring a different VPlan. It doesn't seem to me that this patch makes arbitrary decisions about which VPlan to choose, it aims to pick the VF with the lowest cost, although it currently falls back (consistently) to a fixed-width VPlan when the main loop has a scalable VF.

Thanks for testing that there are no functional regressions when having different VPlans for the main- and epilogue vector loops.

I made some suggestions to simplify the implementation, hope my comments make sense.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
6248	Can this change be committed as a separate patch, this seems like a sensible bugfix to me. (although no idea how to test for it)
6267	nit: Can you rewrite this as: if (MainLoopVF.isScalable()) LLVM_DEBUG(...); auto FixedMainLoopVF = ElementCount::getFixed(FixedMainLoopVF.getKnownMinValue()); `FixedMainLoopVF.isScalable()` reads like a contradictory expression that should never happen.
8229	I see that you're just building on top of the current mechanism, but I'd rather see `setBestPlan` replaced by `getBestPlanFor`, which returns a pointer to the VPlan that can be passed to `LoopVectorizationPlanner::executePlan`. That way, you don't need to do all this odd trickery with removing vplans and requiring `BackupPlans` to repopulate the set for a second call to `setBestPlan`. Then the code becomes a bit simpler to follow: auto Plan = getBestPlanFor(VF, UF) LVP.executePlan(Plan, VF, UF, ILV, DT); I don't know whether the type of `auto Plan` can be `VPlan*` or whether it needs to be an instance of `std::unique_ptr<VPlan>`. It would be convenient if the LoopVectorizationPlanner can keep ownership of all VPlans until the end, where `LoopVectorizationPlanner::executePlan` just invokes the relevant VPlan to execute.
8414	Not sure I fully understand it, but is the VF at this point always guaranteed to be a fixed-width VF? If so, can we avoid making this change here (and instead s/getKnownMinValue/getFixedValue/)? I'm sure we'll want this change at some point when we make the epilogue VF scalable, but perhaps this patch is not the one to change it in.

In D109432#3002639, @bmahjour wrote:

I'm still wondering why we currently create two separate plans one for the scalable VFs and one for the fixed-width VFs, instead of one plan that includes a union of fixed-width VFs and scalable VFs? If we had such a single plan, then the workarounds in this patch would not be necessary.

Scalable VFs require different VPlans because parts of the plan that are legal for fixed-width VFs may not work for scalable VFs, or may not be profitable. For example, scalable vectors may use gather/scatter instructions, whereas for fixed-width vectors it may resort to scalarised loads/stores.

In D109432#3033037, @david-arm wrote:

I found no overall change in performance with SPEC2006 when building with scalable vectorisation on a A64FX machine.

Are you sure that you actually enabled/used tail vectorization? I would have expected differences in performance if this was used.

In D109432#3033037, @david-arm wrote:

I found no overall change in performance with SPEC2006 when building with scalable vectorisation on a A64FX machine.

Are you sure that you actually enabled/used tail vectorization? I would have expected differences in performance if this was used.

Yeah I'm sure. I'm happy to run some more tests though on a different machine! This simply points to a couple of things I think:

It may suggest that not much hot C/C++ code in SPEC2006 is currently vectorisable. We might see more difference if we could test Fortran benchmarks.
If there are any vectorised loops then the main body trip count is possibly far larger than the tail trip count. I imagine tail vectorisation has the largest impact on smaller loops?

david-arm added inline comments.Oct 4 2021, 1:04 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8414	This fix is needed for correctness in the tests below, i.e. see ; CHECK: [[IDX_NXT]] = add nuw i64 [[IDX]], [[VEC_ITS1]] Without this fix we end up incrementing the induction variable by the wrong VF, which leads to subsequent crashes in a later pass too.

Hi @sdesmalen, also I guess we may still not be vectorising much on A64FX due to the cost model. Perhaps I can test this out in conjunction with my other patches to tune the cost model?

david-arm added inline comments.Oct 4 2021, 8:53 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8229	Hi @sdesmalen, Im happy to take a look at this. I think in order to keep the VPlans in LoopVectorizationPlanner and return a pointer to the best vplan from `getBestPlanFor` I'd need to change VPlans to use `VPlan*` instead of `std::unique_ptr<VPlan>`. Not sure if @fhahn or @bmahjour have any thoughts on this?

david-arm added inline comments.Oct 4 2021, 9:20 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
8229	Actually, creating function prototypes like this seems to work: PlanPtr& getBestPlanFor(ElementCount VF, unsigned UF); void executePlan(VPlanPtr &BestPlan, InnerLoopVectorizer &LB, DominatorTree *DT);

Spun-off patch D111125 to remove setBestPlan in favour of getBestPlanFor and using that in this patch.
Addressed other review comments.

david-arm edited the summary of this revision. (Show Details)Oct 5 2021, 1:59 AM

david-arm added a parent revision: D111125: [NFC][LoopVectorize] Remove setBestPlan in favour of getBestPlanFor.

Harbormaster completed remote builds in B126995: Diff 377117.Oct 5 2021, 1:59 AM

I observed no regressions when running SPEC2006 on a Neoverse-N1 machine. In fact, overall I saw a 0.9% performance improvement using the geometric mean of all results. Biggest outliers:

471.omnetpp: 2.7% faster
429.mcf: 2.2% faster
483.xalancbmk: 1.7% faster

Rebase.

Harbormaster completed remote builds in B127737: Diff 378179.Oct 8 2021, 6:00 AM

Rebase.

Harbormaster completed remote builds in B130890: Diff 382569.Oct 27 2021, 2:30 AM

sdesmalen added inline comments.Oct 27 2021, 2:38 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
10458	nit: please move the definition of `BestEpiPlan` closer to its use (above line 10438)
llvm/test/Transforms/LoopVectorize/AArch64/sve-epilog-vect.ll
43	Can you add a few more check lines here? e.g. I'm not sure if the interleaving is disabled for the epilogue loop (if not, then it would need a check for a second store), or if this is actually a loop with a back-edge to vec.epilog.vector.body, and what the increment would be. I assume it's `8` given `<8 x i8>`, but it would be good to have a CHECK line for it.

david-arm updated this revision to Diff 383764.Nov 1 2021, 3:55 AM

Harbormaster completed remote builds in B131717: Diff 383764.Nov 1 2021, 3:55 AM

david-arm marked 2 inline comments as done.Nov 1 2021, 3:56 AM

david-arm added inline comments.

llvm/test/Transforms/LoopVectorize/AArch64/sve-epilog-vect.ll
43	Given the structure is much more complicated with epilogues with the extra checks and so on, I decided it's easiest just to autogenerate the CHECK lines with utils/update_test_checks.py! It's probably useful to show the whole control flow anyway for at least one test.

sdesmalen added inline comments.Nov 5 2021, 5:09 AM

llvm/test/Transforms/LoopVectorize/AArch64/sve-epilog-vect.ll
4	Is it worth fixing the target-instruction-cost to 1, so that this test doesn't fail when someone updates the cost-model?

Added -force-target-instruction-cost=1 to RUN line

david-arm marked an inline comment as done.Nov 5 2021, 6:25 AM

Harbormaster completed remote builds in B132673: Diff 385052.Nov 5 2021, 6:25 AM

Thanks @david-arm. LGTM!

This revision is now accepted and ready to land.Nov 5 2021, 7:01 AM

This revision was landed with ongoing or failed builds.Nov 8 2021, 1:41 AM

Closed by commit rGc42bb30b9e29: [LoopVectorize] Permit fixed-width epilogue loops for scalable vector bodies (authored by david-arm). · Explain Why

This revision was automatically updated to reflect the committed changes.

david-arm added a commit: rGc42bb30b9e29: [LoopVectorize] Permit fixed-width epilogue loops for scalable vector bodies.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorizationPlanner.h

9 lines

LoopVectorize.cpp

44 lines

test/

Transforms/

LoopVectorize/

AArch64/

sve-epilog-vect.ll

113 lines

optimal-epilog-vectorization-scalable.ll

5 lines

Diff 385425

llvm/lib/Transforms/Vectorize/LoopVectorizationPlanner.h

Show First 20 Lines • Show All 297 Lines • ▼ Show 20 Lines	void executePlan(ElementCount VF, unsigned UF, VPlan &BestPlan,
InnerLoopVectorizer &LB, DominatorTree *DT);		InnerLoopVectorizer &LB, DominatorTree *DT);

#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)		#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
void printPlans(raw_ostream &O);		void printPlans(raw_ostream &O);
#endif		#endif

/// Look through the existing plans and return true if we have one with all		/// Look through the existing plans and return true if we have one with all
/// the vectorization factors in question.		/// the vectorization factors in question.
bool hasPlanWithVFs(const ArrayRef<ElementCount> VFs) const {		bool hasPlanWithVF(ElementCount VF) const {
return any_of(VPlans, [&](const VPlanPtr &Plan) {		return any_of(VPlans,
return all_of(VFs, [&](const ElementCount &VF) {		[&](const VPlanPtr &Plan) { return Plan->hasVF(VF); });
return Plan->hasVF(VF);
});
});
}		}

/// Test a \p Predicate on a \p Range of VF's. Return the value of applying		/// Test a \p Predicate on a \p Range of VF's. Return the value of applying
/// \p Predicate on Range.Start, possibly decreasing Range.End such that the		/// \p Predicate on Range.Start, possibly decreasing Range.End such that the
/// returned value holds for the entire \p Range.		/// returned value holds for the entire \p Range.
static bool		static bool
getDecisionAndClampRange(const std::function<bool(ElementCount)> &Predicate,		getDecisionAndClampRange(const std::function<bool(ElementCount)> &Predicate,
VFRange &Range);		VFRange &Range);
▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,227 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel::selectEpilogueVectorizationFactor(

if (!isScalarEpilogueAllowed()) {		if (!isScalarEpilogueAllowed()) {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LEV: Unable to vectorize epilogue because no epilogue is "		dbgs() << "LEV: Unable to vectorize epilogue because no epilogue is "
"allowed.\n";);		"allowed.\n";);
return Result;		return Result;
}		}

// FIXME: This can be fixed for scalable vectors later, because at this stage
// the LoopVectorizer will only consider vectorizing a loop with scalable
// vectors when the loop has a hint to enable vectorization for a given VF.
if (MainLoopVF.isScalable()) {
LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization for scalable vectors not "
"yet supported.\n");
return Result;
}

// Not really a cost consideration, but check for unsupported cases here to		// Not really a cost consideration, but check for unsupported cases here to
// simplify the logic.		// simplify the logic.
if (!isCandidateForEpilogueVectorization(*TheLoop, MainLoopVF)) {		if (!isCandidateForEpilogueVectorization(*TheLoop, MainLoopVF)) {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LEV: Unable to vectorize epilogue because the loop is "		dbgs() << "LEV: Unable to vectorize epilogue because the loop is "
"not a supported candidate.\n";);		"not a supported candidate.\n";);
return Result;		return Result;
}		}

if (EpilogueVectorizationForceVF > 1) {		if (EpilogueVectorizationForceVF > 1) {
LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization factor is forced.\n";);		LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization factor is forced.\n";);
ElementCount ForcedEC = ElementCount::getFixed(EpilogueVectorizationForceVF);		ElementCount ForcedEC = ElementCount::getFixed(EpilogueVectorizationForceVF);
if (LVP.hasPlanWithVFs({MainLoopVF, ForcedEC}))		if (LVP.hasPlanWithVF(ForcedEC))
		sdesmalenUnsubmitted Done Reply Inline Actions Can this change be committed as a separate patch, this seems like a sensible bugfix to me. (although no idea how to test for it) sdesmalen: Can this change be committed as a separate patch, this seems like a sensible bugfix to me.
return {ForcedEC, 0};		return {ForcedEC, 0};
else {		else {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "LEV: Epilogue vectorization forced factor is not viable.\n";);		<< "LEV: Epilogue vectorization forced factor is not viable.\n";);
return Result;		return Result;
}		}
}		}

if (TheLoop->getHeader()->getParent()->hasOptSize() \|\|		if (TheLoop->getHeader()->getParent()->hasOptSize() \|\|
TheLoop->getHeader()->getParent()->hasMinSize()) {		TheLoop->getHeader()->getParent()->hasMinSize()) {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "LEV: Epilogue vectorization skipped due to opt for size.\n";);		<< "LEV: Epilogue vectorization skipped due to opt for size.\n";);
return Result;		return Result;
}		}

if (!isEpilogueVectorizationProfitable(MainLoopVF))		auto FixedMainLoopVF = ElementCount::getFixed(MainLoopVF.getKnownMinValue());
		if (MainLoopVF.isScalable())
		sdesmalenUnsubmitted Done Reply Inline Actions nit: Can you rewrite this as: if (MainLoopVF.isScalable()) LLVM_DEBUG(...); auto FixedMainLoopVF = ElementCount::getFixed(FixedMainLoopVF.getKnownMinValue()); `FixedMainLoopVF.isScalable()` reads like a contradictory expression that should never happen. sdesmalen: nit: Can you rewrite this as: if (MainLoopVF.isScalable()) LLVM_DEBUG(...); auto…
		LLVM_DEBUG(
		dbgs() << "LEV: Epilogue vectorization using scalable vectors not "
		"yet supported. Converting to fixed-width (VF="
		bmahjourUnsubmitted Done Reply Inline Actions include the fixed-width value in the debug message? bmahjour: include the fixed-width value in the debug message?
		<< FixedMainLoopVF << ") instead\n");

		if (!isEpilogueVectorizationProfitable(FixedMainLoopVF)) {
		LLVM_DEBUG(dbgs() << "LEV: Epilogue vectorization is not profitable for "
		"this loop\n");
		bmahjourUnsubmitted Done Reply Inline Actions "EpilogueVF" is confusing here, since the rest of the logic expects main-loop VF. Suggest renaming to "KnownMainLoopVF" or "FixedMainLoopVF" or "FixedWidthMainLoopVF". bmahjour: "EpilogueVF" is confusing here, since the rest of the logic expects main-loop VF. Suggest…
return Result;		return Result;
		}

for (auto &NextVF : ProfitableVFs)		for (auto &NextVF : ProfitableVFs)
if (ElementCount::isKnownLT(NextVF.Width, MainLoopVF) &&		if (ElementCount::isKnownLT(NextVF.Width, FixedMainLoopVF) &&
(Result.Width.getFixedValue() == 1 \|\|		(Result.Width.getFixedValue() == 1 \|\|
isMoreProfitable(NextVF, Result)) &&		isMoreProfitable(NextVF, Result)) &&
LVP.hasPlanWithVFs({MainLoopVF, NextVF.Width}))		LVP.hasPlanWithVF(NextVF.Width))
Result = NextVF;		Result = NextVF;

if (Result != VectorizationFactor::Disabled())		if (Result != VectorizationFactor::Disabled())
LLVM_DEBUG(dbgs() << "LEV: Vectorizing epilogue loop with VF = "		LLVM_DEBUG(dbgs() << "LEV: Vectorizing epilogue loop with VF = "
<< Result.Width.getFixedValue() << "\n";);		<< Result.Width.getFixedValue() << "\n";);
return Result;		return Result;
}		}

▲ Show 20 Lines • Show All 1,929 Lines • ▼ Show 20 Lines	void LoopVectorizationPlanner::executePlan(ElementCount BestVF, unsigned BestUF,
InnerLoopVectorizer &ILV,		InnerLoopVectorizer &ILV,
DominatorTree *DT) {		DominatorTree *DT) {
LLVM_DEBUG(dbgs() << "Executing best plan with VF=" << BestVF << ", UF=" << BestUF		LLVM_DEBUG(dbgs() << "Executing best plan with VF=" << BestVF << ", UF=" << BestUF
<< '\n');		<< '\n');

// Perform the actual loop transformation.		// Perform the actual loop transformation.

// 1. Create a new empty loop. Unlink the old loop and connect the new one.		// 1. Create a new empty loop. Unlink the old loop and connect the new one.
VPTransformState State{BestVF, BestUF, LI, DT, ILV.Builder, &ILV, &BestVPlan};		VPTransformState State{BestVF, BestUF, LI, DT, ILV.Builder, &ILV, &BestVPlan};
		sdesmalenUnsubmitted Done Reply Inline Actions I see that you're just building on top of the current mechanism, but I'd rather see `setBestPlan` replaced by `getBestPlanFor`, which returns a pointer to the VPlan that can be passed to `LoopVectorizationPlanner::executePlan`. That way, you don't need to do all this odd trickery with removing vplans and requiring `BackupPlans` to repopulate the set for a second call to `setBestPlan`. Then the code becomes a bit simpler to follow: auto Plan = getBestPlanFor(VF, UF) LVP.executePlan(Plan, VF, UF, ILV, DT); I don't know whether the type of `auto Plan` can be `VPlan` or whether it needs to be an instance of `std::unique_ptr<VPlan>`. It would be convenient if the LoopVectorizationPlanner can keep ownership of all VPlans until the end, where `LoopVectorizationPlanner::executePlan` just invokes the relevant VPlan to execute. sdesmalen:* I see that you're just building on top of the current mechanism, but I'd rather see…
		david-armAuthorUnsubmitted Done Reply Inline Actions Hi @sdesmalen, Im happy to take a look at this. I think in order to keep the VPlans in LoopVectorizationPlanner and return a pointer to the best vplan from `getBestPlanFor` I'd need to change VPlans to use `VPlan` instead of `std::unique_ptr<VPlan>`. Not sure if @fhahn or @bmahjour have any thoughts on this? david-arm:* Hi @sdesmalen, Im happy to take a look at this. I think in order to keep the VPlans in…
		david-armAuthorUnsubmitted Done Reply Inline Actions Actually, creating function prototypes like this seems to work: PlanPtr& getBestPlanFor(ElementCount VF, unsigned UF); void executePlan(VPlanPtr &BestPlan, InnerLoopVectorizer &LB, DominatorTree DT); david-arm:* Actually, creating function prototypes like this seems to work: PlanPtr& getBestPlanFor…
State.CFG.PrevBB = ILV.createVectorizedLoopSkeleton();		State.CFG.PrevBB = ILV.createVectorizedLoopSkeleton();
State.TripCount = ILV.getOrCreateTripCount(nullptr);		State.TripCount = ILV.getOrCreateTripCount(nullptr);
State.CanonicalIV = ILV.Induction;		State.CanonicalIV = ILV.Induction;

ILV.printDebugTracesAtStart();		ILV.printDebugTracesAtStart();

//===------------------------------------------------===//		//===------------------------------------------------===//
//		//
▲ Show 20 Lines • Show All 166 Lines • ▼ Show 20 Lines	BasicBlock *EpilogueVectorizerMainLoop::createEpilogueVectorizedLoopSkeleton() {
// the epilogue.		// the epilogue.
EPI.MainLoopIterationCountCheck =		EPI.MainLoopIterationCountCheck =
emitMinimumIterationCountCheck(Lp, LoopScalarPreHeader, false);		emitMinimumIterationCountCheck(Lp, LoopScalarPreHeader, false);

// Generate the induction variable.		// Generate the induction variable.
OldInduction = Legal->getPrimaryInduction();		OldInduction = Legal->getPrimaryInduction();
Type *IdxTy = Legal->getWidestInductionType();		Type *IdxTy = Legal->getWidestInductionType();
Value *StartIdx = ConstantInt::get(IdxTy, 0);		Value *StartIdx = ConstantInt::get(IdxTy, 0);
Constant Step = ConstantInt::get(IdxTy, VF.getKnownMinValue() UF);
		IRBuilder<> B(&*Lp->getLoopPreheader()->getFirstInsertionPt());
		Value Step = getRuntimeVF(B, IdxTy, VF UF);
		sdesmalenUnsubmitted Done Reply Inline Actions Not sure I fully understand it, but is the VF at this point always guaranteed to be a fixed-width VF? If so, can we avoid making this change here (and instead s/getKnownMinValue/getFixedValue/)? I'm sure we'll want this change at some point when we make the epilogue VF scalable, but perhaps this patch is not the one to change it in. sdesmalen: Not sure I fully understand it, but is the VF at this point always guaranteed to be a fixed…
		david-armAuthorUnsubmitted Done Reply Inline Actions This fix is needed for correctness in the tests below, i.e. see ; CHECK: [[IDX_NXT]] = add nuw i64 [[IDX]], [[VEC_ITS1]] Without this fix we end up incrementing the induction variable by the wrong VF, which leads to subsequent crashes in a later pass too. david-arm: This fix is needed for correctness in the tests below, i.e. see ; CHECK: [[IDX_NXT]]…
Value *CountRoundDown = getOrCreateVectorTripCount(Lp);		Value *CountRoundDown = getOrCreateVectorTripCount(Lp);
EPI.VectorTripCount = CountRoundDown;		EPI.VectorTripCount = CountRoundDown;
Induction =		Induction =
createInductionVariable(Lp, StartIdx, CountRoundDown, Step,		createInductionVariable(Lp, StartIdx, CountRoundDown, Step,
getDebugLocFromInstOrOperands(OldInduction));		getDebugLocFromInstOrOperands(OldInduction));

// Skip induction resume value creation here because they will be created in		// Skip induction resume value creation here because they will be created in
// the second pass. If we created them here, they wouldn't be used anyway,		// the second pass. If we created them here, they wouldn't be used anyway,
▲ Show 20 Lines • Show All 1,993 Lines • ▼ Show 20 Lines	#endif /* NDEBUG */
{		{
// Optimistically generate runtime checks. Drop them if they turn out to not		// Optimistically generate runtime checks. Drop them if they turn out to not
// be profitable. Limit the scope of Checks, so the cleanup happens		// be profitable. Limit the scope of Checks, so the cleanup happens
// immediately after vector codegeneration is done.		// immediately after vector codegeneration is done.
GeneratedRTChecks Checks(*PSE.getSE(), DT, LI,		GeneratedRTChecks Checks(*PSE.getSE(), DT, LI,
F->getParent()->getDataLayout());		F->getParent()->getDataLayout());
if (!VF.Width.isScalar() \|\| IC > 1)		if (!VF.Width.isScalar() \|\| IC > 1)
Checks.Create(L, *LVL.getLAI(), PSE.getUnionPredicate());		Checks.Create(L, *LVL.getLAI(), PSE.getUnionPredicate());
VPlan &BestPlan = LVP.getBestPlanFor(VF.Width);

using namespace ore;		using namespace ore;
if (!VectorizeLoop) {		if (!VectorizeLoop) {
assert(IC > 1 && "interleave count should not be 1 or 0");		assert(IC > 1 && "interleave count should not be 1 or 0");
// If we decided that it is not legal to vectorize the loop, then		// If we decided that it is not legal to vectorize the loop, then
// interleave it.		// interleave it.
InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL,		InnerLoopUnroller Unroller(L, PSE, LI, DT, TLI, TTI, AC, ORE, IC, &LVL,
&CM, BFI, PSI, Checks);		&CM, BFI, PSI, Checks);

		VPlan &BestPlan = LVP.getBestPlanFor(VF.Width);
LVP.executePlan(VF.Width, IC, BestPlan, Unroller, DT);		LVP.executePlan(VF.Width, IC, BestPlan, Unroller, DT);

ORE->emit([&]() {		ORE->emit([&]() {
return OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(),		return OptimizationRemark(LV_NAME, "Interleaved", L->getStartLoc(),
L->getHeader())		L->getHeader())
<< "interleaved loop (interleaved count: "		<< "interleaved loop (interleaved count: "
<< NV("InterleaveCount", IC) << ")";		<< NV("InterleaveCount", IC) << ")";
});		});
} else {		} else {
// If we decided that it is legal to vectorize the loop, then do it.		// If we decided that it is legal to vectorize the loop, then do it.

// Consider vectorizing the epilogue too if it's profitable.		// Consider vectorizing the epilogue too if it's profitable.
VectorizationFactor EpilogueVF =		VectorizationFactor EpilogueVF =
CM.selectEpilogueVectorizationFactor(VF.Width, LVP);		CM.selectEpilogueVectorizationFactor(VF.Width, LVP);
if (EpilogueVF.Width.isVector()) {		if (EpilogueVF.Width.isVector()) {

// The first pass vectorizes the main loop and creates a scalar epilogue		// The first pass vectorizes the main loop and creates a scalar epilogue
// to be vectorized by executing the plan (potentially with a different		// to be vectorized by executing the plan (potentially with a different
// factor) again shortly afterwards.		// factor) again shortly afterwards.
EpilogueLoopVectorizationInfo EPI(VF.Width, IC, EpilogueVF.Width, 1);		EpilogueLoopVectorizationInfo EPI(VF.Width, IC, EpilogueVF.Width, 1);
EpilogueVectorizerMainLoop MainILV(L, PSE, LI, DT, TLI, TTI, AC, ORE,		EpilogueVectorizerMainLoop MainILV(L, PSE, LI, DT, TLI, TTI, AC, ORE,
EPI, &LVL, &CM, BFI, PSI, Checks);		EPI, &LVL, &CM, BFI, PSI, Checks);

LVP.executePlan(EPI.MainLoopVF, EPI.MainLoopUF, BestPlan, MainILV, DT);		VPlan &BestMainPlan = LVP.getBestPlanFor(EPI.MainLoopVF);
		LVP.executePlan(EPI.MainLoopVF, EPI.MainLoopUF, BestMainPlan, MainILV,
		sdesmalenUnsubmitted Done Reply Inline Actions nit: please move the definition of `BestEpiPlan` closer to its use (above line 10438) sdesmalen: nit: please move the definition of `BestEpiPlan` closer to its use (above line 10438)
		DT);
++LoopsVectorized;		++LoopsVectorized;

simplifyLoop(L, DT, LI, SE, AC, nullptr, false /* PreserveLCSSA */);		simplifyLoop(L, DT, LI, SE, AC, nullptr, false /* PreserveLCSSA */);
formLCSSARecursively(L, DT, LI, SE);		formLCSSARecursively(L, DT, LI, SE);

// Second pass vectorizes the epilogue and adjusts the control flow		// Second pass vectorizes the epilogue and adjusts the control flow
// edges from the first pass.		// edges from the first pass.
EPI.MainLoopVF = EPI.EpilogueVF;		EPI.MainLoopVF = EPI.EpilogueVF;
EPI.MainLoopUF = EPI.EpilogueUF;		EPI.MainLoopUF = EPI.EpilogueUF;
EpilogueVectorizerEpilogueLoop EpilogILV(L, PSE, LI, DT, TLI, TTI, AC,		EpilogueVectorizerEpilogueLoop EpilogILV(L, PSE, LI, DT, TLI, TTI, AC,
ORE, EPI, &LVL, &CM, BFI, PSI,		ORE, EPI, &LVL, &CM, BFI, PSI,
Checks);		Checks);
LVP.executePlan(EPI.EpilogueVF, EPI.EpilogueUF, BestPlan, EpilogILV,
		VPlan &BestEpiPlan = LVP.getBestPlanFor(EPI.EpilogueVF);
		LVP.executePlan(EPI.EpilogueVF, EPI.EpilogueUF, BestEpiPlan, EpilogILV,
DT);		DT);
++LoopsEpilogueVectorized;		++LoopsEpilogueVectorized;

if (!MainILV.areSafetyChecksAdded())		if (!MainILV.areSafetyChecksAdded())
DisableRuntimeUnroll = true;		DisableRuntimeUnroll = true;
} else {		} else {
InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC,		InnerLoopVectorizer LB(L, PSE, LI, DT, TLI, TTI, AC, ORE, VF.Width, IC,
&LVL, &CM, BFI, PSI, Checks);		&LVL, &CM, BFI, PSI, Checks);

		VPlan &BestPlan = LVP.getBestPlanFor(VF.Width);
LVP.executePlan(VF.Width, IC, BestPlan, LB, DT);		LVP.executePlan(VF.Width, IC, BestPlan, LB, DT);
++LoopsVectorized;		++LoopsVectorized;

// Add metadata to disable runtime unrolling a scalar loop when there		// Add metadata to disable runtime unrolling a scalar loop when there
// are no runtime checks about strides and memory. A scalar loop that is		// are no runtime checks about strides and memory. A scalar loop that is
// rarely used is not worth unrolling.		// rarely used is not worth unrolling.
if (!LB.areSafetyChecksAdded())		if (!LB.areSafetyChecksAdded())
DisableRuntimeUnroll = true;		DisableRuntimeUnroll = true;
▲ Show 20 Lines • Show All 150 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/sve-epilog-vect.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; REQUIRES: asserts
				; RUN: opt < %s -loop-vectorize -force-vector-interleave=2 -epilogue-vectorization-minimum-VF=0 --debug-only=loop-vectorize -force-target-instruction-cost=1 -S -scalable-vectorization=preferred 2>%t \| FileCheck %s
				; RUN: cat %t \| FileCheck %s --check-prefix=DEBUG
				sdesmalenUnsubmitted Done Reply Inline Actions Is it worth fixing the target-instruction-cost to 1, so that this test doesn't fail when someone updates the cost-model? sdesmalen: Is it worth fixing the target-instruction-cost to 1, so that this test doesn't fail when…
				; RUN: opt < %s -loop-vectorize -force-vector-interleave=2 -epilogue-vectorization-minimum-VF=8 --debug-only=loop-vectorize -S -scalable-vectorization=preferred 2>%t \| FileCheck %s
				; RUN: cat %t \| FileCheck %s --check-prefix=DEBUG
				; RUN: opt < %s -loop-vectorize -force-vector-interleave=2 -epilogue-vectorization-force-VF=8 --debug-only=loop-vectorize -S -scalable-vectorization=preferred 2>%t \| FileCheck %s
				; RUN: cat %t \| FileCheck %s --check-prefix=DEBUG-FORCED

				target triple = "aarch64-linux-gnu"

				; DEBUG: LV: Checking a loop in "f1"
				; DEBUG: LEV: Epilogue vectorization using scalable vectors not yet supported. Converting to fixed-width (VF=16) instead
				; DEBUG: Create Skeleton for epilogue vectorized loop (first pass)
				; DEBUG: Main Loop VF:vscale x 16, Main Loop UF:2, Epilogue Loop VF:8, Epilogue Loop UF:1

				; DEBUG-FORCED: LV: Checking a loop in "f1"
				; DEBUG-FORCED: LEV: Epilogue vectorization factor is forced.
				; DEBUG-FORCED: Create Skeleton for epilogue vectorized loop (first pass)
				; DEBUG-FORCED: Main Loop VF:vscale x 16, Main Loop UF:2, Epilogue Loop VF:8, Epilogue Loop UF:1

				define void @f1(i8* %A) #0 {
				; CHECK-LABEL: @f1(
				; CHECK-NEXT: iter.check:
				; CHECK-NEXT: br i1 false, label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]
				; CHECK: vector.main.loop.iter.check:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 32
				; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP1]]
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 32
				; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP5:%.*]] = mul i64 [[TMP4]], 32
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP5]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 1024, [[N_MOD_VF]]
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP6:%.*]] = add i64 [[INDEX]], 0
				; CHECK-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP8:%.*]] = mul i64 [[TMP7]], 16
				sdesmalenUnsubmitted Done Reply Inline Actions Can you add a few more check lines here? e.g. I'm not sure if the interleaving is disabled for the epilogue loop (if not, then it would need a check for a second store), or if this is actually a loop with a back-edge to vec.epilog.vector.body, and what the increment would be. I assume it's `8` given `<8 x i8>`, but it would be good to have a CHECK line for it. sdesmalen: Can you add a few more check lines here? e.g. I'm not sure if the interleaving is disabled for…
				david-armAuthorUnsubmitted Done Reply Inline Actions Given the structure is much more complicated with epilogues with the extra checks and so on, I decided it's easiest just to autogenerate the CHECK lines with utils/update_test_checks.py! It's probably useful to show the whole control flow anyway for at least one test. david-arm: Given the structure is much more complicated with epilogues with the extra checks and so on, I…
				; CHECK-NEXT: [[TMP9:%.*]] = add i64 [[TMP8]], 0
				; CHECK-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 1
				; CHECK-NEXT: [[TMP11:%.*]] = add i64 [[INDEX]], [[TMP10]]
				; CHECK-NEXT: [[TMP12:%.]] = getelementptr inbounds i8, i8 [[A:%.*]], i64 [[TMP6]]
				; CHECK-NEXT: [[TMP13:%.]] = getelementptr inbounds i8, i8 [[A]], i64 [[TMP11]]
				; CHECK-NEXT: [[TMP14:%.]] = getelementptr inbounds i8, i8 [[TMP12]], i32 0
				; CHECK-NEXT: [[TMP15:%.]] = bitcast i8 [[TMP14]] to <vscale x 16 x i8>*
				; CHECK-NEXT: store <vscale x 16 x i8> shufflevector (<vscale x 16 x i8> insertelement (<vscale x 16 x i8> poison, i8 1, i32 0), <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer), <vscale x 16 x i8>* [[TMP15]], align 1
				; CHECK-NEXT: [[TMP16:%.*]] = call i32 @llvm.vscale.i32()
				; CHECK-NEXT: [[TMP17:%.*]] = mul i32 [[TMP16]], 16
				; CHECK-NEXT: [[TMP18:%.]] = getelementptr inbounds i8, i8 [[TMP12]], i32 [[TMP17]]
				; CHECK-NEXT: [[TMP19:%.]] = bitcast i8 [[TMP18]] to <vscale x 16 x i8>*
				; CHECK-NEXT: store <vscale x 16 x i8> shufflevector (<vscale x 16 x i8> insertelement (<vscale x 16 x i8> poison, i8 1, i32 0), <vscale x 16 x i8> poison, <vscale x 16 x i32> zeroinitializer), <vscale x 16 x i8>* [[TMP19]], align 1
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP3]]
				; CHECK-NEXT: [[TMP20:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP20]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 1024, [[N_VEC]]
				; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT:%.]], label [[VEC_EPILOG_ITER_CHECK:%.]]
				; CHECK: vec.epilog.iter.check:
				; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = sub i64 1024, [[N_VEC]]
				; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp ult i64 [[N_VEC_REMAINING]], 8
				; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
				; CHECK: vec.epilog.ph:
				; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
				; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
				; CHECK: vec.epilog.vector.body:
				; CHECK-NEXT: [[INDEX1:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT2:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP21:%.*]] = add i64 [[INDEX1]], 0
				; CHECK-NEXT: [[TMP22:%.]] = getelementptr inbounds i8, i8 [[A]], i64 [[TMP21]]
				; CHECK-NEXT: [[TMP23:%.]] = getelementptr inbounds i8, i8 [[TMP22]], i32 0
				; CHECK-NEXT: [[TMP24:%.]] = bitcast i8 [[TMP23]] to <8 x i8>*
				; CHECK-NEXT: store <8 x i8> <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>, <8 x i8>* [[TMP24]], align 1
				; CHECK-NEXT: [[INDEX_NEXT2]] = add nuw i64 [[INDEX1]], 8
				; CHECK-NEXT: [[TMP25:%.*]] = icmp eq i64 [[INDEX_NEXT2]], 1024
				; CHECK-NEXT: br i1 [[TMP25]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.*]], label [[VEC_EPILOG_VECTOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]
				; CHECK: vec.epilog.middle.block:
				; CHECK-NEXT: [[CMP_N3:%.*]] = icmp eq i64 1024, 1024
				; CHECK-NEXT: br i1 [[CMP_N3]], label [[EXIT_LOOPEXIT:%.*]], label [[VEC_EPILOG_SCALAR_PH]]
				; CHECK: vec.epilog.scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 1024, [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[ITER_CHECK:%.]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i8, i8 [[A]], i64 [[IV]]
				; CHECK-NEXT: store i8 1, i8* [[ARRAYIDX]], align 1
				; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
				; CHECK-NEXT: [[EXITCOND:%.*]] = icmp ne i64 [[IV_NEXT]], 1024
				; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_BODY]], label [[EXIT_LOOPEXIT]], !llvm.loop [[LOOP4:![0-9]+]]
				; CHECK: exit.loopexit:
				; CHECK-NEXT: br label [[EXIT]]
				; CHECK: exit:
				; CHECK-NEXT: ret void

				entry:
				br label %for.body

				for.body:
				%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i8, i8* %A, i64 %iv
				store i8 1, i8* %arrayidx, align 1
				%iv.next = add nuw nsw i64 %iv, 1
				%exitcond = icmp ne i64 %iv.next, 1024
				br i1 %exitcond, label %for.body, label %exit

				exit:
				ret void
				}

				attributes #0 = { "target-features"="+sve" }

llvm/test/Transforms/LoopVectorize/optimal-epilog-vectorization-scalable.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: opt < %s -passes='loop-vectorize' -force-vector-width=2 -force-target-supports-scalable-vectors=true -enable-epilogue-vectorization -epilogue-vectorization-force-VF=2 --debug-only=loop-vectorize -S -scalable-vectorization=on 2>&1 \| FileCheck %s			; RUN: opt < %s -passes='loop-vectorize' -force-target-supports-scalable-vectors=true -enable-epilogue-vectorization -epilogue-vectorization-force-VF=2 --debug-only=loop-vectorize -S -scalable-vectorization=on 2>&1 \| FileCheck %s

	target datalayout = "e-m:e-i64:64-n32:64-v256:256:256-v512:512:512"			target datalayout = "e-m:e-i64:64-n32:64-v256:256:256-v512:512:512"

	; Currently we cannot handle scalable vectorization factors.			; Currently we cannot handle scalable vectorization factors.
	; CHECK: LV: Checking a loop in "f1"			; CHECK: LV: Checking a loop in "f1"
	; CHECK: LEV: Epilogue vectorization for scalable vectors not yet supported.			; CHECK: LEV: Epilogue vectorization factor is forced.
				; CHECK: Epilogue Loop VF:2, Epilogue Loop UF:1

	define void @f1(i8* %A) {			define void @f1(i8* %A) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]			%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
	%arrayidx = getelementptr inbounds i8, i8* %A, i64 %iv			%arrayidx = getelementptr inbounds i8, i8* %A, i64 %iv
	Show All 11 Lines