This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
-
ARM/
-
mve-selectandorcost.ll
-
sphinx.ll
1/1
tail-fold-multiple-icmps.ll
-
RISCV/
-
defaults.ll
-
inloop-reduction.ll
-
scalable-basics.ll
-
X86/
-
ctpop-small-trip-count.ll
-
reduction-small-trip-count.ll

Differential D154157

[LV] Cost model for out-of-loop reductions
AcceptedPublic

Authored by anna on Jun 29 2023, 2:47 PM.

Download Raw Diff

Details

Reviewers

fhahn
dmgreen
ebrevnov

Summary

When we have small trip count loops, the cost of out of loop reduction
becomes significant. We do not consider the cost of out of loop
reductions in loop vectorizer (in-loop vectorizations are handled in
cost modelling).

This patch extends the logic used by cost modelling of runtime checks to figure out the minimum trip count under which runtime checks are profitable. We reuse the same idea for out of loop reductions.

Diff Detail

Event Timeline

anna created this revision.Jun 29 2023, 2:47 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 29 2023, 2:47 PM

Herald added subscribers: artagnon, StephenFan, pengfei, hiraditya. · View Herald Transcript

anna requested review of this revision.Jun 29 2023, 2:47 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 29 2023, 2:47 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B242250: Diff 536010.Jun 29 2023, 6:45 PM

Sounds OK to me. Did you consider including the cost of reductions in the existing vectorizer cost calculations, as opposed to a hard limit?

From the description, it is not clear why FMinimum/FMaximum reductions in particular should be a blocker for vectorizing small trip count loops. What about other reductions?

Given that this is for short trip counts, it might be better/more general to extend the code that computes the minimum trip count needs to negate the overhead of runtime checks to also consider the overhead of computing the final reduction value (see https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L10047)

In D154157#4468183, @fhahn wrote:

From the description, it is not clear why FMinimum/FMaximum reductions in particular should be a blocker for vectorizing small trip count loops. What about other reductions?

Given that this is for short trip counts, it might be better/more general to extend the code that computes the minimum trip count needs to negate the overhead of runtime checks to also consider the overhead of computing the final reduction value (see https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L10047)

Maybe this generalized version would also fix https://github.com/llvm/llvm-project/issues/57476.

I support Florian's suggestion. We better take reduction overhead into account the same way as we did for runtime checks.
Today's cost model is very limited in its ability to predict the overhead which comes with the vectorization. For example, the overhead of the scalar epilogue loop is simply ignored. I think, one day we should generalize the cost model so that vectorization overhead is integral part of the model itself rather than a follow up "fix up".

Herald added a subscriber: wangpc. · View Herald TranscriptJul 4 2023, 1:20 AM

thank you Florian, this is a nice idea. Working on it.

Rewrote the patch to extend existing logic for runtime checks calculation.
This now supports all kinds of out of loop reductions. Since we compute an upper-bound on MinProfitableTripCount, this maybe a more conservative estimate.

Harbormaster completed remote builds in B254136: Diff 552416.Aug 22 2023, 11:06 AM

added an option which switches this off by default. The failures above are related to ARM and RISCV targets.

Harbormaster completed remote builds in B254889: Diff 553476.Aug 25 2023, 9:29 AM

Gentle ping @fhahn. Anything I could do to move this forward? I have kept the option as off by default, since I would need to help to test this on supported targets upstream. It helps our use-case where loops are not vectorized when out-of-loop reductions are present in small trip count loops.
There maybe some fallouts where correction in reductions cost modelling will be required.

In D154157#4628496, @anna wrote:

Gentle ping @fhahn. Anything I could do to move this forward? I have kept the option as off by default, since I would need to help to test this on supported targets upstream. It helps our use-case where loops are not vectorized when out-of-loop reductions are present in small trip count loops.
There maybe some fallouts where correction in reductions cost modelling will be required.

What are the regressions on ARM/RISCV? I think we should aim to enable this by default for all platforms, otherwise it is at the risk of not getting enabled. Also, those tests may show issues with the current implementation.

Do you have any numbers on the impact on X86?

changed the flag to on by default. Fixed the ARM/RISCV tests (by updating the lines). More details will follow.

Herald added subscribers: luke, sunshaoce, frasercrmck and 21 others. · View Herald TranscriptAug 31 2023, 12:02 PM

In D154157#4628521, @fhahn wrote:

In D154157#4628496, @anna wrote:

Gentle ping @fhahn. Anything I could do to move this forward? I have kept the option as off by default, since I would need to help to test this on supported targets upstream. It helps our use-case where loops are not vectorized when out-of-loop reductions are present in small trip count loops.
There maybe some fallouts where correction in reductions cost modelling will be required.

What are the regressions on ARM/RISCV? I think we should aim to enable this by default for all platforms, otherwise it is at the risk of not getting enabled. Also, those tests may show issues with the current implementation.

The regressions on ARM and RISCV are because we have an updated minimum trip count we expect (to offset the cost of the out-of-loop reductions), so the CHECK lines have been updated accordingly. RISCV looks okay with the minimum trip count being at least 4 or 8 (we round the MinimumTripCount up to VF, so it is more conservative).

However, in the case of ARM, there is no special cost for min/max reductions. They compute a pretty expensive cost under the assumption that we expand these reductions and extract the value. On some simple tests I ran, the assembly generated supports this cost for the -mtriple=armv7-none-eabi, but the code generated is different for another triple such as -mtriple=thumbv8.1m.main-none-none-eabi. I've added more details under the respective test.

Do you have any numbers on the impact on X86?

I will have some numbers soon. The tests are still running. We see positive impact on small trip count loops on our workloads (in the range of 40% improvement) - these workloads are small trip count loops with fminimum/fmaximum reductions where we no longer vectorize.

llvm/test/Transforms/LoopVectorize/ARM/tail-fold-multiple-icmps.ll
12	This minimum trip count of 56 comes about because we compute the reduction cost as 140 (and there are two such reductions, with total being 280). This is a pretty high cost which is computed through https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/CodeGen/BasicTTIImpl.h#L2372. This is correct for a triple with mattr such as: llc < %s -mtriple=armv7-none-eabi -float-abi=hard -mattr=+neon -verify-machineinstrs \| FileCheck %s define i8 @test_umin_v8i8(<8 x i8> %x) { ; CHECK-LABEL: test_umin_v8i8: ; CHECK: @ %bb.0: @ %entry ; CHECK-NEXT: vpmin.u8 d16, d0, d0 ; CHECK-NEXT: vpmin.u8 d16, d16, d16 ; CHECK-NEXT: vpmin.u8 d16, d16, d16 ; CHECK-NEXT: vmov.u8 r0, d16[0] ; CHECK-NEXT: bx lr entry: %z = call i8 @llvm.vector.reduce.umin.v8i8(<8 x i8> %x) ret i8 %z } However, I'm not sure if this is a correct cost for other triples.

I can look into the Arm costs. Some of them were deliberately high to try and stop bad vectorization from happening (or they haven't come up much in the past).

I've made some changes for the Arm MVE and NEON costs. Can you try a rebase? Thanks

In D154157#4638288, @dmgreen wrote:

I've made some changes for the Arm MVE and NEON costs. Can you try a rebase? Thanks

Thanks @dmgreen! The updated MinTripCount looks better now (4 instead of 56). I will update the patch.

Our X86 results are in. Over 244 workloads, the geomean changes by about ~0.4%. Looking at individual workloads, there are no major gains or regressions in large applications (unfortunately, we cannot share the exact benchmark names publicly). However, we do see big gains in 3 of these workloads where it performs a floating point minimum/maximum reduction over a float array of 3 elements (without this change we were vectorizing it).
Overall, I think the change is still reasonable to have since we now account for out of loop reductions more accurately.

Also, based on the RISCV and ARM results, we can see the minimum trip count is more reasonable numbers (4 or 8 for both targets). I will update the rebased patch.

rebased over changes from dmgreen with more accurate costs for ARM out-of-loop reductions.

anna edited the summary of this revision. (Show Details)Sep 6 2023, 11:29 AM

ping?

Any comments to progress this further?

Would you mind testing whether this also fixes https://github.com/llvm/llvm-project/issues/57476 and adding it as a test case if so?

In D154157#4648096, @nikic wrote:

Would you mind testing whether this also fixes https://github.com/llvm/llvm-project/issues/57476 and adding it as a test case if so?

Unfortunately, it doesn't help when the trip count is exactly specified at compile time. However, if it is through a branch weight profile (having same behaviour as trip-count=2), with this patch we no longer vectorize ctpop. The reason is that when exact trip count =2, we clamp the max VF to be exactly 2, the reduction cost = 1 and we still vectorize. With branch profile stating trip count will be 2, we go through VF=4 and without the patch we end up vectorizing with masked vector instructions. With the patch, the vector reduce.add with VF=4 ends up being costlier for the small trip count loop and we avoid vectorizing. I will add both test cases to show the difference.

Added ctpop test

Harbormaster completed remote builds in B257451: Diff 557112.Sep 20 2023, 8:14 AM

Ping.

I ran some tests and they looked OK. Sometimes these RuntimeChecks costs need to be quite precise - more precise than they have needed to be for vectorization in the past, and it can end up with cases where you spend some of the overhead costs of vectorization without getting the benefits, but the results looks good from what I can tell.

I might regret mentioning this if it increases the costs, but should interleaved loops get an extra cost to account for reducing the vectors by UF, into a single vector that is then vec.reduced?

Thanks for testing the patch @dmgreen!

I might regret mentioning this if it increases the costs, but should interleaved loops get an extra cost to account for reducing the vectors by UF, into a single vector that is then vec.reduced?

That's a good point, I think there is a higher chance of "touching a lot more cases" on vectorization :) but it seems that can be done as a later patch, if needed?

Yeah that sounds OK to me. I think the extra cost should be fairly minor.

This LGTM if no-one else has any other comments.

This revision is now accepted and ready to land.Sep 29 2023, 3:20 AM

Any particular reason this hasn't been landed yet?

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

LoopVectorize.cpp

190 lines

test/

Transforms/

LoopVectorize/

ARM/

mve-selectandorcost.ll

4 lines

sphinx.ll

4 lines

tail-fold-multiple-icmps.ll

36 lines

RISCV/

defaults.ll

37 lines

inloop-reduction.ll

43 lines

scalable-basics.ll

82 lines

X86/

ctpop-small-trip-count.ll

109 lines

reduction-small-trip-count.ll

94 lines

Diff 557112

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 193 Lines • ▼ Show 20 Lines	static cl::opt<unsigned> TinyTripCountVectorThreshold(
cl::desc("Loops with a constant trip count that is smaller than this "		cl::desc("Loops with a constant trip count that is smaller than this "
"value are vectorized only if no scalar iteration overheads "		"value are vectorized only if no scalar iteration overheads "
"are incurred."));		"are incurred."));

static cl::opt<unsigned> VectorizeMemoryCheckThreshold(		static cl::opt<unsigned> VectorizeMemoryCheckThreshold(
"vectorize-memory-check-threshold", cl::init(128), cl::Hidden,		"vectorize-memory-check-threshold", cl::init(128), cl::Hidden,
cl::desc("The maximum allowed number of runtime memory checks"));		cl::desc("The maximum allowed number of runtime memory checks"));

		static cl::opt<bool> IgnoreOutOfLoopReductionCost(
		"vectorizer-ignore-out-of-loop-reduction-cost", cl::init(false),
		cl::desc("Ignore the cost of out-of-loop reductions in vectorizer cost "
		"model"));

// Option prefer-predicate-over-epilogue indicates that an epilogue is undesired,		// Option prefer-predicate-over-epilogue indicates that an epilogue is undesired,
// that predication is preferred, and this lists all options. I.e., the		// that predication is preferred, and this lists all options. I.e., the
// vectorizer will try to fold the tail-loop (epilogue) into the vector body		// vectorizer will try to fold the tail-loop (epilogue) into the vector body
// and predicate the instructions accordingly. If tail-folding fails, there are		// and predicate the instructions accordingly. If tail-folding fails, there are
// different fallback strategies depending on these values:		// different fallback strategies depending on these values:
namespace PreferPredicateTy {		namespace PreferPredicateTy {
enum Option {		enum Option {
ScalarEpilogue = 0,		ScalarEpilogue = 0,
▲ Show 20 Lines • Show All 1,439 Lines • ▼ Show 20 Lines	public:

bool hasPredStores() const { return NumPredStores > 0; }		bool hasPredStores() const { return NumPredStores > 0; }

/// Returns true if epilogue vectorization is considered profitable, and		/// Returns true if epilogue vectorization is considered profitable, and
/// false otherwise.		/// false otherwise.
/// \p VF is the vectorization factor chosen for the original loop.		/// \p VF is the vectorization factor chosen for the original loop.
bool isEpilogueVectorizationProfitable(const ElementCount VF) const;		bool isEpilogueVectorizationProfitable(const ElementCount VF) const;

		/// Returns total cost for out-of-loop reductions.
		InstructionCost getOutOfLoopReductionCost(VectorizationFactor VF);

private:		private:
unsigned NumPredStores = 0;		unsigned NumPredStores = 0;

/// \return An upper bound for the vectorization factors for both		/// \return An upper bound for the vectorization factors for both
/// fixed and scalable vectorization, where the minimum-known number of		/// fixed and scalable vectorization, where the minimum-known number of
/// elements is a power-of-2 larger than zero. If scalable vectorization is		/// elements is a power-of-2 larger than zero. If scalable vectorization is
/// disabled or unsupported, then the scalable part will be equal to		/// disabled or unsupported, then the scalable part will be equal to
/// ElementCount::getScalable(0).		/// ElementCount::getScalable(0).
▲ Show 20 Lines • Show All 4,835 Lines • ▼ Show 20 Lines	assert(!Legal->isMaskRequired(I) &&
"Reverse masked interleaved access not supported.");		"Reverse masked interleaved access not supported.");
Cost += Group->getNumMembers() *		Cost += Group->getNumMembers() *
TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy,		TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy,
std::nullopt, CostKind, 0);		std::nullopt, CostKind, 0);
}		}
return Cost;		return Cost;
}		}

		InstructionCost
		LoopVectorizationCostModel::getOutOfLoopReductionCost(VectorizationFactor VF) {
		InstructionCost ReduxCost = 0;
		if (VF.Width.isScalar() \|\| IgnoreOutOfLoopReductionCost)
		return ReduxCost;

		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;
		for (auto &Reduction : Legal->getReductionVars()) {
		PHINode *Phi = Reduction.first;
		auto *VectorTy = cast<VectorType>(ToVectorTy(Phi->getType(), VF.Width));
		const RecurrenceDescriptor &RdxDesc = Reduction.second;
		if (isInLoopReduction(Phi))
		continue;
		RecurKind RK = RdxDesc.getRecurrenceKind();
		auto FMF = RdxDesc.getFastMathFlags();
		switch (RK) {
		case RecurKind::Add:
		case RecurKind::Mul:
		case RecurKind::Or:
		case RecurKind::And:
		case RecurKind::Xor:
		case RecurKind::FAdd:
		case RecurKind::FMul: {
		unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RK);
		ReduxCost +=
		TTI.getArithmeticReductionCost(RdxOpcode, VectorTy, FMF, CostKind);
		break;
		}
		case RecurKind::FMax:
		case RecurKind::FMin:
		case RecurKind::FMaximum:
		case RecurKind::FMinimum:
		case RecurKind::SMax:
		case RecurKind::SMin:
		case RecurKind::UMax:
		case RecurKind::UMin: {
		Intrinsic::ID Id = getMinMaxReductionIntrinsicOp(RK);
		ReduxCost += TTI.getMinMaxReductionCost(Id, VectorTy, FMF, CostKind);
		break;
		}
		case RecurKind::FMulAdd: {
		unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RK);
		ReduxCost +=
		TTI.getArithmeticReductionCost(RdxOpcode, VectorTy, FMF, CostKind);
		// For a call to the llvm.fmuladd intrinsic we need to add the cost of a
		// normal fmul instruction to the cost of the fadd reduction.
		ReduxCost +=
		TTI.getArithmeticInstrCost(Instruction::FMul, VectorTy, CostKind);
		break;
		}
		case RecurKind::FAnyOf:
		case RecurKind::IAnyOf: {
		// This has the cost of vector.reduce.or, but may have other costs as
		// well. FIXME: This recur kind does not have a well defined cost yet.
		unsigned RdxOpcode = RecurrenceDescriptor::getOpcode(RecurKind::Or);
		ReduxCost +=
		TTI.getArithmeticReductionCost(RdxOpcode, VectorTy, FMF, CostKind);
		break;
		}
		default:
		llvm_unreachable("Unexpected reduction operation!");
		}
		}
		return ReduxCost;
		}

std::optional<InstructionCost>		std::optional<InstructionCost>
LoopVectorizationCostModel::getReductionPatternCost(		LoopVectorizationCostModel::getReductionPatternCost(
Instruction I, ElementCount VF, Type Ty, TTI::TargetCostKind CostKind) {		Instruction I, ElementCount VF, Type Ty, TTI::TargetCostKind CostKind) {
using namespace llvm::PatternMatch;		using namespace llvm::PatternMatch;
// Early exit for no inloop reductions		// Early exit for no inloop reductions
if (InLoopReductions.empty() \|\| VF.isScalar() \|\| !isa<VectorType>(Ty))		if (InLoopReductions.empty() \|\| VF.isScalar() \|\| !isa<VectorType>(Ty))
return std::nullopt;		return std::nullopt;
auto *VectorTy = cast<VectorType>(Ty);		auto *VectorTy = cast<VectorType>(Ty);
▲ Show 20 Lines • Show All 3,202 Lines • ▼ Show 20 Lines	if (isa<FPExtInst>(I) && EmittedRemark.insert(I).second)
});		});

for (Use &Op : I->operands())		for (Use &Op : I->operands())
if (auto *OpI = dyn_cast<Instruction>(Op))		if (auto *OpI = dyn_cast<Instruction>(Op))
Worklist.push_back(OpI);		Worklist.push_back(OpI);
}		}
}		}

static bool areRuntimeChecksProfitable(GeneratedRTChecks &Checks,		// What makes the loop unprofitable to vectorize.
VectorizationFactor &VF,		namespace OutOfLoopCost {
std::optional<unsigned> VScale, Loop *L,		enum Reason {
		None, // OutOfLoopCost is zero.
		RuntimeCheck,
		OutOfLoopReduction,
		Some // Combination of above reasons: We have both runtime checks and out of
		// loop reductions.
		};
		}

		static OutOfLoopCost::Reason areOutOfLoopComputationsProfitable(
		InstructionCost CheckCost, InstructionCost ReduxCost,
		VectorizationFactor &VF, std::optional<unsigned> VScale, Loop *L,
ScalarEvolution &SE) {		ScalarEvolution &SE) {
InstructionCost CheckCost = Checks.getCost();
if (!CheckCost.isValid())		if (!CheckCost.isValid())
return false;		return OutOfLoopCost::RuntimeCheck;

		auto ReduxCostVal = *ReduxCost.getValue();
		double RtC = *CheckCost.getValue();
		if (!ReduxCostVal && !RtC)
		return OutOfLoopCost::None;
// When interleaving only scalar and vector cost will be equal, which in turn		// When interleaving only scalar and vector cost will be equal, which in turn
// would lead to a divide by 0. Fall back to hard threshold.		// would lead to a divide by 0. Fall back to hard threshold.
if (VF.Width.isScalar()) {		if (VF.Width.isScalar()) {
if (CheckCost > VectorizeMemoryCheckThreshold) {		if (CheckCost > VectorizeMemoryCheckThreshold) {
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "LV: Interleaving only is not profitable due to runtime checks\n");		<< "LV: Interleaving only is not profitable due to runtime checks\n");
return false;		return OutOfLoopCost::RuntimeCheck;
}		}
return true;		if (ReduxCostVal) {
		LLVM_DEBUG(dbgs() << "Interleaving only is not profitable due to out of "
		"loop reductions\n");
		return OutOfLoopCost::OutOfLoopReduction;
		}
		return OutOfLoopCost::None;
}		}

// The scalar cost should only be 0 when vectorizing with a user specified VF/IC. In those cases, runtime checks should always be generated.		// The scalar cost should only be 0 when vectorizing with a user specified
		// VF/IC. In those cases, ignore out of loop costs.
double ScalarC = *VF.ScalarCost.getValue();		double ScalarC = *VF.ScalarCost.getValue();
if (ScalarC == 0)		if (ScalarC == 0)
return true;		return OutOfLoopCost::None;

// First, compute the minimum iteration count required so that the vector		// First, compute the minimum iteration count required so that the vector
// loop outperforms the scalar loop.		// loop outperforms the scalar loop.
// The total cost of the scalar loop is		// The total cost of the scalar loop is
// ScalarC * TC		// ScalarC * TC
// where		// where
// * TC is the actual trip count of the loop.		// * TC is the actual trip count of the loop.
// * ScalarC is the cost of a single scalar iteration.		// * ScalarC is the cost of a single scalar iteration.
//		//
// The total cost of the vector loop is		// The total cost of the vector loop is
// RtC + VecC * (TC / VF) + EpiC		// RtC + VecC * (TC / VF) + EpiC + ReduxCost
// where		// where
// * RtC is the cost of the generated runtime checks		// * RtC is the cost of the generated runtime checks
// * VecC is the cost of a single vector iteration.		// * VecC is the cost of a single vector iteration.
// * TC is the actual trip count of the loop		// * TC is the actual trip count of the loop
// * VF is the vectorization factor		// * VF is the vectorization factor
// * EpiCost is the cost of the generated epilogue, including the cost		// * EpiCost is the cost of the generated epilogue, including the cost
// of the remaining scalar operations.		// of the remaining scalar operations.
		// * ReduxCost is the cost of out-of-loop reductions which are executed if
		// the vector loop is taken.
//		//
// Vectorization is profitable once the total vector cost is less than the		// Vectorization is profitable once the total vector cost is less than the
// total scalar cost:		// total scalar cost:
// RtC + VecC * (TC / VF) + EpiC < ScalarC * TC		// RtC + VecC * (TC / VF) + EpiC +ReduxCost < ScalarC * TC
//		//
// Now we can compute the minimum required trip count TC as		// Now we can compute the minimum required trip count TC as
// (RtC + EpiC) / (ScalarC - (VecC / VF)) < TC		// (RtC + EpiC + ReduxCost) / (ScalarC - (VecC / VF)) < TC
//		//
// For now we assume the epilogue cost EpiC = 0 for simplicity. Note that		// For now we assume the epilogue cost EpiC = 0 for simplicity. Note that
// the computations are performed on doubles, not integers and the result		// the computations are performed on doubles, not integers and the result
// is rounded up, hence we get an upper estimate of the TC.		// is rounded up, hence we get an upper estimate of the TC.
unsigned IntVF = VF.Width.getKnownMinValue();		unsigned IntVF = VF.Width.getKnownMinValue();
if (VF.Width.isScalable()) {		if (VF.Width.isScalable()) {
unsigned AssumedMinimumVscale = 1;		unsigned AssumedMinimumVscale = 1;
if (VScale)		if (VScale)
AssumedMinimumVscale = *VScale;		AssumedMinimumVscale = *VScale;
IntVF *= AssumedMinimumVscale;		IntVF *= AssumedMinimumVscale;
}		}
double VecCOverVF = double(*VF.Cost.getValue()) / IntVF;		double VecCOverVF = double(*VF.Cost.getValue()) / IntVF;
double RtC = *CheckCost.getValue();		double MinTC1 = (RtC + ReduxCostVal) / (ScalarC - VecCOverVF);
double MinTC1 = RtC / (ScalarC - VecCOverVF);

// Second, compute a minimum iteration count so that the cost of the		// Second, compute a minimum iteration count so that the cost of the
// runtime checks is only a fraction of the total scalar loop cost. This		// runtime checks is only a fraction of the total scalar loop cost. This
// adds a loop-dependent bound on the overhead incurred if the runtime		// adds a loop-dependent bound on the overhead incurred if the runtime
// checks fail. In case the runtime checks fail, the cost is RtC + ScalarC		// checks fail. In case the runtime checks fail, the cost is RtC + ScalarC
// * TC. To bound the runtime check to be a fraction 1/X of the scalar		// * TC. To bound the runtime check to be a fraction 1/X of the scalar
// cost, compute		// cost, compute
// RtC < ScalarC * TC * (1 / X) ==> RtC * X / ScalarC < TC		// RtC < ScalarC * TC * (1 / X) ==> RtC * X / ScalarC < TC
		// Note that we can ignore ReduxCost here since out-of-loop reductions are
		// computed only if the vector loop is taken.
double MinTC2 = RtC * 10 / ScalarC;		double MinTC2 = RtC * 10 / ScalarC;

// Now pick the larger minimum. If it is not a multiple of VF, choose the		// Now pick the larger minimum. If it is not a multiple of VF, choose the
// next closest multiple of VF. This should partly compensate for ignoring		// next closest multiple of VF. This should partly compensate for ignoring
// the epilogue cost.		// the epilogue cost.
uint64_t MinTC = std::ceil(std::max(MinTC1, MinTC2));		uint64_t MinTC = std::ceil(std::max(MinTC1, MinTC2));
VF.MinProfitableTripCount = ElementCount::getFixed(alignTo(MinTC, IntVF));		VF.MinProfitableTripCount = ElementCount::getFixed(alignTo(MinTC, IntVF));

LLVM_DEBUG(		LLVM_DEBUG(dbgs() << "LV: Minimum required TC for out-of-loop computations "
dbgs() << "LV: Minimum required TC for runtime checks to be profitable:"		"to be profitable:"
<< VF.MinProfitableTripCount << "\n");		<< VF.MinProfitableTripCount << "\n");

// Skip vectorization if the expected trip count is less than the minimum		// Skip vectorization if the expected trip count is less than the minimum
// required trip count.		// required trip count.
if (auto ExpectedTC = getSmallBestKnownTC(SE, L)) {		if (auto ExpectedTC = getSmallBestKnownTC(SE, L)) {
if (ElementCount::isKnownLT(ElementCount::getFixed(*ExpectedTC),		if (ElementCount::isKnownLT(ElementCount::getFixed(*ExpectedTC),
VF.MinProfitableTripCount)) {		VF.MinProfitableTripCount)) {
LLVM_DEBUG(dbgs() << "LV: Vectorization is not beneficial: expected "		LLVM_DEBUG(dbgs() << "LV: Vectorization is not beneficial: expected "
"trip count < minimum profitable VF ("		"trip count < minimum profitable VF ("
<< *ExpectedTC << " < " << VF.MinProfitableTripCount		<< *ExpectedTC << " < " << VF.MinProfitableTripCount
<< ")\n");		<< ")\n");

return false;		// If possible, return the exact reason we cannot vectorize the small trip
		// count loop.
		return (!RtC) ? OutOfLoopCost::OutOfLoopReduction
		: !(ReduxCostVal) ? OutOfLoopCost::RuntimeCheck
		: OutOfLoopCost::Some;
}		}
}		}
return true;		return OutOfLoopCost::None;
}		}

LoopVectorizePass::LoopVectorizePass(LoopVectorizeOptions Opts)		LoopVectorizePass::LoopVectorizePass(LoopVectorizeOptions Opts)
: InterleaveOnlyWhenForced(Opts.InterleaveOnlyWhenForced \|\|		: InterleaveOnlyWhenForced(Opts.InterleaveOnlyWhenForced \|\|
!EnableLoopInterleaving),		!EnableLoopInterleaving),
VectorizeOnlyWhenForced(Opts.VectorizeOnlyWhenForced \|\|		VectorizeOnlyWhenForced(Opts.VectorizeOnlyWhenForced \|\|
!EnableLoopVectorization) {}		!EnableLoopVectorization) {}

▲ Show 20 Lines • Show All 174 Lines • ▼ Show 20 Lines	if (MaybeVF) {
IC = CM.selectInterleaveCount(VF.Width, VF.Cost);		IC = CM.selectInterleaveCount(VF.Width, VF.Cost);

unsigned SelectedIC = std::max(IC, UserIC);		unsigned SelectedIC = std::max(IC, UserIC);
// Optimistically generate runtime checks if they are needed. Drop them if		// Optimistically generate runtime checks if they are needed. Drop them if
// they turn out to not be profitable.		// they turn out to not be profitable.
if (VF.Width.isVector() \|\| SelectedIC > 1)		if (VF.Width.isVector() \|\| SelectedIC > 1)
Checks.Create(L, *LVL.getLAI(), PSE.getPredicate(), VF.Width, SelectedIC);		Checks.Create(L, *LVL.getLAI(), PSE.getPredicate(), VF.Width, SelectedIC);

// Check if it is profitable to vectorize with runtime checks.		// Check if it is profitable to vectorize with out of loop computations
		// (such as reductions and runtime checks).
bool ForceVectorization =		bool ForceVectorization =
Hints.getForce() == LoopVectorizeHints::FK_Enabled;		Hints.getForce() == LoopVectorizeHints::FK_Enabled;
if (!ForceVectorization &&		if (!ForceVectorization) {
!areRuntimeChecksProfitable(Checks, VF, getVScaleForTuning(L, *TTI), L,		InstructionCost RTCheckCost = Checks.getCost();
*PSE.getSE())) {		InstructionCost ReduxCost = CM.getOutOfLoopReductionCost(VF);

		auto UnprofitableReason = areOutOfLoopComputationsProfitable(
		RTCheckCost, ReduxCost, VF, getVScaleForTuning(L, *TTI), L,
		*PSE.getSE());
		switch (UnprofitableReason) {
		case OutOfLoopCost::None:
		break;
		case OutOfLoopCost::RuntimeCheck: {
ORE->emit([&]() {		ORE->emit([&]() {
return OptimizationRemarkAnalysisAliasing(		return OptimizationRemarkAnalysisAliasing(
DEBUG_TYPE, "CantReorderMemOps", L->getStartLoc(),		DEBUG_TYPE, "CantReorderMemOps", L->getStartLoc(),
L->getHeader())		L->getHeader())
<< "loop not vectorized: cannot prove it is safe to reorder "		<< "loop not vectorized: cannot prove it is safe to reorder "
"memory operations";		"memory operations";
});		});
LLVM_DEBUG(dbgs() << "LV: Too many memory checks needed.\n");		LLVM_DEBUG(dbgs() << "LV: Too many memory checks needed.\n");
Hints.emitRemarkWithHints();		Hints.emitRemarkWithHints();
return false;		return false;
}		}
		case OutOfLoopCost::OutOfLoopReduction:
		LLVM_DEBUG(dbgs() << "LV: Costly out of loop reductions for small "
		"trip count loop.\n");
		return false;
		default:
		LLVM_DEBUG(dbgs() << "LV: Costly out of loop computation for small "
		"trip count loop.\n");
		return false;
		}
		}// ForceVectorization
}		}

// Identify the diagnostic messages that should be produced.		// Identify the diagnostic messages that should be produced.
std::pair<StringRef, std::string> VecDiagMsg, IntDiagMsg;		std::pair<StringRef, std::string> VecDiagMsg, IntDiagMsg;
bool VectorizeLoop = true, InterleaveLoop = true;		bool VectorizeLoop = true, InterleaveLoop = true;
if (VF.Width.isScalar()) {		if (VF.Width.isScalar()) {
LLVM_DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n");		LLVM_DEBUG(dbgs() << "LV: Vectorization is possible but not beneficial.\n");
VecDiagMsg = std::make_pair(		VecDiagMsg = std::make_pair(
▲ Show 20 Lines • Show All 363 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/mve-selectandorcost.ll

	Show All 11 Lines
	; CHECK-COST: LV: Found an estimated cost of 2 for VF 4 For instruction: %or.cond = select i1 %cmp2, i1 true, i1 %cmp3			; CHECK-COST: LV: Found an estimated cost of 2 for VF 4 For instruction: %or.cond = select i1 %cmp2, i1 true, i1 %cmp3

	define float @test(ptr nocapture readonly %pA, ptr nocapture readonly %pB, i32 %blockSize) #0 {			define float @test(ptr nocapture readonly %pA, ptr nocapture readonly %pB, i32 %blockSize) #0 {
	; CHECK-LABEL: @test(			; CHECK-LABEL: @test(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[CMP_NOT16:%.]] = icmp eq i32 [[BLOCKSIZE:%.]], 0			; CHECK-NEXT: [[CMP_NOT16:%.]] = icmp eq i32 [[BLOCKSIZE:%.]], 0
	; CHECK-NEXT: br i1 [[CMP_NOT16]], label [[WHILE_END:%.]], label [[WHILE_BODY_PREHEADER:%.]]			; CHECK-NEXT: br i1 [[CMP_NOT16]], label [[WHILE_END:%.]], label [[WHILE_BODY_PREHEADER:%.]]
	; CHECK: while.body.preheader:			; CHECK: while.body.preheader:
	; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[BLOCKSIZE]], 4			; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[BLOCKSIZE]], 16
	; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[BLOCKSIZE]], -4			; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[BLOCKSIZE]], -4
	; CHECK-NEXT: [[TMP0:%.*]] = shl i32 [[N_VEC]], 2			; CHECK-NEXT: [[TMP0:%.*]] = shl i32 [[N_VEC]], 2
	; CHECK-NEXT: [[IND_END:%.]] = getelementptr i8, ptr [[PA:%.]], i32 [[TMP0]]			; CHECK-NEXT: [[IND_END:%.]] = getelementptr i8, ptr [[PA:%.]], i32 [[TMP0]]
	; CHECK-NEXT: [[TMP1:%.*]] = shl i32 [[N_VEC]], 2			; CHECK-NEXT: [[TMP1:%.*]] = shl i32 [[N_VEC]], 2
	; CHECK-NEXT: [[IND_END1:%.]] = getelementptr i8, ptr [[PB:%.]], i32 [[TMP1]]			; CHECK-NEXT: [[IND_END1:%.]] = getelementptr i8, ptr [[PB:%.]], i32 [[TMP1]]
	; CHECK-NEXT: [[IND_END3:%.*]] = and i32 [[BLOCKSIZE]], 3			; CHECK-NEXT: [[IND_END3:%.*]] = and i32 [[BLOCKSIZE]], 3
	▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[TMP19:%.*]] = tail call fast float @llvm.fabs.f32(float [[SUB]])			; CHECK-NEXT: [[TMP19:%.*]] = tail call fast float @llvm.fabs.f32(float [[SUB]])
	; CHECK-NEXT: [[DIV:%.*]] = fdiv fast float [[TMP19]], [[ADD]]			; CHECK-NEXT: [[DIV:%.*]] = fdiv fast float [[TMP19]], [[ADD]]
	; CHECK-NEXT: [[ADD4:%.*]] = fadd fast float [[DIV]], [[ACCUM_017]]			; CHECK-NEXT: [[ADD4:%.*]] = fadd fast float [[DIV]], [[ACCUM_017]]
	; CHECK-NEXT: br label [[IF_END]]			; CHECK-NEXT: br label [[IF_END]]
	; CHECK: if.end:			; CHECK: if.end:
	; CHECK-NEXT: [[ACCUM_1]] = phi float [ [[ADD4]], [[IF_THEN]] ], [ [[ACCUM_017]], [[WHILE_BODY]] ]			; CHECK-NEXT: [[ACCUM_1]] = phi float [ [[ADD4]], [[IF_THEN]] ], [ [[ACCUM_017]], [[WHILE_BODY]] ]
	; CHECK-NEXT: [[DEC]] = add i32 [[BLOCKSIZE_ADDR_018]], -1			; CHECK-NEXT: [[DEC]] = add i32 [[BLOCKSIZE_ADDR_018]], -1
	; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[DEC]], 0			; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[DEC]], 0
	; CHECK-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END]], label [[WHILE_BODY]], !llvm.loop [[LOOP2:![0-9]+]]			; CHECK-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END]], label [[WHILE_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
	; CHECK: while.end:			; CHECK: while.end:
	; CHECK-NEXT: [[ACCUM_0_LCSSA:%.]] = phi float [ 0.000000e+00, [[ENTRY:%.]] ], [ [[ACCUM_1]], [[IF_END]] ], [ [[TMP14]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[ACCUM_0_LCSSA:%.]] = phi float [ 0.000000e+00, [[ENTRY:%.]] ], [ [[ACCUM_1]], [[IF_END]] ], [ [[TMP14]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: ret float [[ACCUM_0_LCSSA]]			; CHECK-NEXT: ret float [[ACCUM_0_LCSSA]]
	;			;
	entry:			entry:
	%cmp.not16 = icmp eq i32 %blockSize, 0			%cmp.not16 = icmp eq i32 %blockSize, 0
	br i1 %cmp.not16, label %while.end, label %while.body			br i1 %cmp.not16, label %while.end, label %while.body

	Show All 40 Lines

llvm/test/Transforms/LoopVectorize/ARM/sphinx.ll

	Show All 28 Lines
	; CHECK-NEXT: [[T4:%.*]] = load ptr, ptr [[ARRAYIDX109]], align 4			; CHECK-NEXT: [[T4:%.*]] = load ptr, ptr [[ARRAYIDX109]], align 4
	; CHECK-NEXT: [[T5:%.*]] = load ptr, ptr @vv, align 4			; CHECK-NEXT: [[T5:%.*]] = load ptr, ptr @vv, align 4
	; CHECK-NEXT: [[ARRAYIDX111:%.*]] = getelementptr inbounds ptr, ptr [[T5]], i32 [[T2]]			; CHECK-NEXT: [[ARRAYIDX111:%.*]] = getelementptr inbounds ptr, ptr [[T5]], i32 [[T2]]
	; CHECK-NEXT: [[T6:%.*]] = load ptr, ptr [[ARRAYIDX111]], align 4			; CHECK-NEXT: [[T6:%.*]] = load ptr, ptr [[ARRAYIDX111]], align 4
	; CHECK-NEXT: [[T7:%.*]] = load ptr, ptr @ll, align 4			; CHECK-NEXT: [[T7:%.*]] = load ptr, ptr @ll, align 4
	; CHECK-NEXT: [[ARRAYIDX113:%.*]] = getelementptr inbounds float, ptr [[T7]], i32 [[T2]]			; CHECK-NEXT: [[ARRAYIDX113:%.*]] = getelementptr inbounds float, ptr [[T7]], i32 [[T2]]
	; CHECK-NEXT: [[T8:%.*]] = load float, ptr [[ARRAYIDX113]], align 4			; CHECK-NEXT: [[T8:%.*]] = load float, ptr [[ARRAYIDX113]], align 4
	; CHECK-NEXT: [[CONV114:%.*]] = fpext float [[T8]] to double			; CHECK-NEXT: [[CONV114:%.*]] = fpext float [[T8]] to double
	; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[T]], 2			; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[T]], 8
	; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[T]], 2			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[T]], 2
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i32 [[T]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i32 [[T]], [[N_MOD_VF]]
	; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> zeroinitializer, double [[CONV114]], i32 0			; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x double> zeroinitializer, double [[CONV114]], i32 0
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	Show All 37 Lines
	; CHECK-NEXT: [[MUL123:%.*]] = fmul fast double [[CONV122]], [[CONV122]]			; CHECK-NEXT: [[MUL123:%.*]] = fmul fast double [[CONV122]], [[CONV122]]
	; CHECK-NEXT: [[ARRAYIDX124:%.*]] = getelementptr inbounds float, ptr [[T6]], i32 [[I_2132]]			; CHECK-NEXT: [[ARRAYIDX124:%.*]] = getelementptr inbounds float, ptr [[T6]], i32 [[I_2132]]
	; CHECK-NEXT: [[T11:%.*]] = load float, ptr [[ARRAYIDX124]], align 4			; CHECK-NEXT: [[T11:%.*]] = load float, ptr [[ARRAYIDX124]], align 4
	; CHECK-NEXT: [[CONV125:%.*]] = fpext float [[T11]] to double			; CHECK-NEXT: [[CONV125:%.*]] = fpext float [[T11]] to double
	; CHECK-NEXT: [[MUL126:%.*]] = fmul fast double [[MUL123]], [[CONV125]]			; CHECK-NEXT: [[MUL126:%.*]] = fmul fast double [[MUL123]], [[CONV125]]
	; CHECK-NEXT: [[SUB127]] = fsub fast double [[DVAL1_4131]], [[MUL126]]			; CHECK-NEXT: [[SUB127]] = fsub fast double [[DVAL1_4131]], [[MUL126]]
	; CHECK-NEXT: [[INC129]] = add nuw nsw i32 [[I_2132]], 1			; CHECK-NEXT: [[INC129]] = add nuw nsw i32 [[I_2132]], 1
	; CHECK-NEXT: [[EXITCOND143:%.*]] = icmp eq i32 [[INC129]], [[T]]			; CHECK-NEXT: [[EXITCOND143:%.*]] = icmp eq i32 [[INC129]], [[T]]
	; CHECK-NEXT: br i1 [[EXITCOND143]], label [[OUTEREND]], label [[INNERLOOP]], !llvm.loop [[LOOP2:![0-9]+]]			; CHECK-NEXT: br i1 [[EXITCOND143]], label [[OUTEREND]], label [[INNERLOOP]], !llvm.loop [[LOOP3:![0-9]+]]
	; CHECK: outerend:			; CHECK: outerend:
	; CHECK-NEXT: [[SUB127_LCSSA:%.*]] = phi double [ [[SUB127]], [[INNERLOOP]] ], [ [[TMP15]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[SUB127_LCSSA:%.*]] = phi double [ [[SUB127]], [[INNERLOOP]] ], [ [[TMP15]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: [[CONV138:%.*]] = fptosi double [[SUB127_LCSSA]] to i32			; CHECK-NEXT: [[CONV138:%.*]] = fptosi double [[SUB127_LCSSA]] to i32
	; CHECK-NEXT: [[CALL142]] = add nuw nsw i32 [[SCORE_1135]], [[CONV138]]			; CHECK-NEXT: [[CALL142]] = add nuw nsw i32 [[SCORE_1135]], [[CONV138]]
	; CHECK-NEXT: [[INC144]] = add nuw nsw i32 [[J_0136]], 1			; CHECK-NEXT: [[INC144]] = add nuw nsw i32 [[J_0136]], 1
	; CHECK-NEXT: [[ARRAYIDX102:%.*]] = getelementptr inbounds i32, ptr @a, i32 [[INC144]]			; CHECK-NEXT: [[ARRAYIDX102:%.*]] = getelementptr inbounds i32, ptr @a, i32 [[INC144]]
	; CHECK-NEXT: [[V17]] = load i32, ptr [[ARRAYIDX102]], align 4			; CHECK-NEXT: [[V17]] = load i32, ptr [[ARRAYIDX102]], align 4
	; CHECK-NEXT: [[CMP103:%.*]] = icmp sgt i32 [[V17]], -1			; CHECK-NEXT: [[CMP103:%.*]] = icmp sgt i32 [[V17]], -1
	▲ Show 20 Lines • Show All 58 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/tail-fold-multiple-icmps.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve -tail-predication=enabled -passes=loop-vectorize,instcombine,simplifycfg -simplifycfg-require-and-preserve-domtree=1 %s -S -o - \| FileCheck %s			; RUN: opt -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve -tail-predication=enabled -passes=loop-vectorize,instcombine,simplifycfg -simplifycfg-require-and-preserve-domtree=1 %s -S -o - \| FileCheck %s

	target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"			target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"

	define arm_aapcs_vfpcc i32 @minmaxval4(ptr nocapture readonly %x, ptr nocapture %minp, i32 %N) {			define arm_aapcs_vfpcc i32 @minmaxval4(ptr nocapture readonly %x, ptr nocapture %minp, i32 %N) {
	; CHECK-LABEL: @minmaxval4(			; CHECK-LABEL: @minmaxval4(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[CMP26_NOT:%.]] = icmp eq i32 [[N:%.]], 0			; CHECK-NEXT: [[CMP26_NOT:%.]] = icmp eq i32 [[N:%.]], 0
	; CHECK-NEXT: br i1 [[CMP26_NOT]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_BODY_PREHEADER:%.]]			; CHECK-NEXT: br i1 [[CMP26_NOT]], label [[FOR_COND_CLEANUP:%.]], label [[FOR_BODY_PREHEADER:%.]]
	; CHECK: for.body.preheader:			; CHECK: for.body.preheader:
	; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4			; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4
				annaAuthorUnsubmitted Done Reply Inline Actions This minimum trip count of 56 comes about because we compute the reduction cost as 140 (and there are two such reductions, with total being 280). This is a pretty high cost which is computed through https://github.com/llvm/llvm-project/blob/main/llvm/include/llvm/CodeGen/BasicTTIImpl.h#L2372. This is correct for a triple with mattr such as: llc < %s -mtriple=armv7-none-eabi -float-abi=hard -mattr=+neon -verify-machineinstrs \| FileCheck %s define i8 @test_umin_v8i8(<8 x i8> %x) { ; CHECK-LABEL: test_umin_v8i8: ; CHECK: @ %bb.0: @ %entry ; CHECK-NEXT: vpmin.u8 d16, d0, d0 ; CHECK-NEXT: vpmin.u8 d16, d16, d16 ; CHECK-NEXT: vpmin.u8 d16, d16, d16 ; CHECK-NEXT: vmov.u8 r0, d16[0] ; CHECK-NEXT: bx lr entry: %z = call i8 @llvm.vector.reduce.umin.v8i8(<8 x i8> %x) ret i8 %z } However, I'm not sure if this is a correct cost for other triples. anna: This minimum trip count of 56 comes about because we compute the reduction cost as 140 (and…
	; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -4			; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -4
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647>, [[VECTOR_PH]] ], [ [[TMP3:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi <4 x i32> [ <i32 2147483647, i32 2147483647, i32 2147483647, i32 2147483647>, [[VECTOR_PH]] ], [ [[TMP2:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI1:%.]] = phi <4 x i32> [ <i32 -2147483648, i32 -2147483648, i32 -2147483648, i32 -2147483648>, [[VECTOR_PH]] ], [ [[TMP2:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI1:%.]] = phi <4 x i32> [ <i32 -2147483648, i32 -2147483648, i32 -2147483648, i32 -2147483648>, [[VECTOR_PH]] ], [ [[TMP1:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, ptr [[X:%.]], i32 [[INDEX]]			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, ptr [[X:%.]], i32 [[INDEX]]
	; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP0]], align 4			; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <4 x i32>, ptr [[TMP0]], align 4
	; CHECK-NEXT: [[TMP2]] = call <4 x i32> @llvm.smax.v4i32(<4 x i32> [[WIDE_LOAD]], <4 x i32> [[VEC_PHI1]])			; CHECK-NEXT: [[TMP1]] = call <4 x i32> @llvm.smax.v4i32(<4 x i32> [[WIDE_LOAD]], <4 x i32> [[VEC_PHI1]])
	; CHECK-NEXT: [[TMP3]] = call <4 x i32> @llvm.smin.v4i32(<4 x i32> [[WIDE_LOAD]], <4 x i32> [[VEC_PHI]])			; CHECK-NEXT: [[TMP2]] = call <4 x i32> @llvm.smin.v4i32(<4 x i32> [[WIDE_LOAD]], <4 x i32> [[VEC_PHI]])
	; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], 4
	; CHECK-NEXT: [[TMP4:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP3:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP4]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]			; CHECK-NEXT: br i1 [[TMP3]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: [[TMP5:%.*]] = call i32 @llvm.vector.reduce.smax.v4i32(<4 x i32> [[TMP2]])			; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.smax.v4i32(<4 x i32> [[TMP1]])
	; CHECK-NEXT: [[TMP6:%.*]] = call i32 @llvm.vector.reduce.smin.v4i32(<4 x i32> [[TMP3]])			; CHECK-NEXT: [[TMP5:%.*]] = call i32 @llvm.vector.reduce.smin.v4i32(<4 x i32> [[TMP2]])
	; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]			; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]
	; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ [[TMP6]], [[MIDDLE_BLOCK]] ], [ 2147483647, [[FOR_BODY_PREHEADER]] ]			; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ [[TMP5]], [[MIDDLE_BLOCK]] ], [ 2147483647, [[FOR_BODY_PREHEADER]] ]
	; CHECK-NEXT: [[BC_MERGE_RDX2:%.*]] = phi i32 [ [[TMP5]], [[MIDDLE_BLOCK]] ], [ -2147483648, [[FOR_BODY_PREHEADER]] ]			; CHECK-NEXT: [[BC_MERGE_RDX2:%.*]] = phi i32 [ [[TMP4]], [[MIDDLE_BLOCK]] ], [ -2147483648, [[FOR_BODY_PREHEADER]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	; CHECK: for.cond.cleanup:			; CHECK: for.cond.cleanup:
	; CHECK-NEXT: [[MAX_0_LCSSA:%.]] = phi i32 [ -2147483648, [[ENTRY:%.]] ], [ [[TMP8:%.*]], [[FOR_BODY]] ], [ [[TMP5]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[MAX_0_LCSSA:%.]] = phi i32 [ -2147483648, [[ENTRY:%.]] ], [ [[COND:%.*]], [[FOR_BODY]] ], [ [[TMP4]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: [[MIN_0_LCSSA:%.]] = phi i32 [ 2147483647, [[ENTRY]] ], [ [[TMP9:%.]], [[FOR_BODY]] ], [ [[TMP6]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[MIN_0_LCSSA:%.]] = phi i32 [ 2147483647, [[ENTRY]] ], [ [[COND9:%.]], [[FOR_BODY]] ], [ [[TMP5]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: store i32 [[MIN_0_LCSSA]], ptr [[MINP:%.*]], align 4			; CHECK-NEXT: store i32 [[MIN_0_LCSSA]], ptr [[MINP:%.*]], align 4
	; CHECK-NEXT: ret i32 [[MAX_0_LCSSA]]			; CHECK-NEXT: ret i32 [[MAX_0_LCSSA]]
	; CHECK: for.body:			; CHECK: for.body:
	; CHECK-NEXT: [[I_029:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]			; CHECK-NEXT: [[I_029:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
	; CHECK-NEXT: [[MIN_028:%.*]] = phi i32 [ [[TMP9]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]			; CHECK-NEXT: [[MIN_028:%.*]] = phi i32 [ [[COND9]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
	; CHECK-NEXT: [[MAX_027:%.*]] = phi i32 [ [[TMP8]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX2]], [[SCALAR_PH]] ]			; CHECK-NEXT: [[MAX_027:%.*]] = phi i32 [ [[COND]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX2]], [[SCALAR_PH]] ]
	; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[X]], i32 [[I_029]]			; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[X]], i32 [[I_029]]
	; CHECK-NEXT: [[TMP7:%.*]] = load i32, ptr [[ARRAYIDX]], align 4			; CHECK-NEXT: [[TMP6:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
	; CHECK-NEXT: [[TMP8]] = call i32 @llvm.smax.i32(i32 [[TMP7]], i32 [[MAX_027]])			; CHECK-NEXT: [[COND]] = call i32 @llvm.smax.i32(i32 [[TMP6]], i32 [[MAX_027]])
	; CHECK-NEXT: [[TMP9]] = call i32 @llvm.smin.i32(i32 [[TMP7]], i32 [[MIN_028]])			; CHECK-NEXT: [[COND9]] = call i32 @llvm.smin.i32(i32 [[TMP6]], i32 [[MIN_028]])
	; CHECK-NEXT: [[INC]] = add nuw i32 [[I_029]], 1			; CHECK-NEXT: [[INC]] = add nuw i32 [[I_029]], 1
	; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC]], [[N]]			; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i32 [[INC]], [[N]]
	; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], !llvm.loop [[LOOP2:![0-9]+]]			; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
	;			;
	entry:			entry:
	%cmp26.not = icmp eq i32 %N, 0			%cmp26.not = icmp eq i32 %N, 0
	br i1 %cmp26.not, label %for.cond.cleanup, label %for.body			br i1 %cmp26.not, label %for.cond.cleanup, label %for.body

	for.cond.cleanup: ; preds = %for.body, %entry			for.cond.cleanup: ; preds = %for.body, %entry
	%max.0.lcssa = phi i32 [ -2147483648, %entry ], [ %cond, %for.body ]			%max.0.lcssa = phi i32 [ -2147483648, %entry ], [ %cond, %for.body ]
	%min.0.lcssa = phi i32 [ 2147483647, %entry ], [ %cond9, %for.body ]			%min.0.lcssa = phi i32 [ 2147483647, %entry ], [ %cond9, %for.body ]
	Show All 18 Lines

llvm/test/Transforms/LoopVectorize/RISCV/defaults.ll

Show First 20 Lines • Show All 72 Lines • ▼ Show 20 Lines	for.end:
ret void		ret void
}		}

define i64 @vector_add_reduce(ptr noalias nocapture %a) {		define i64 @vector_add_reduce(ptr noalias nocapture %a) {
; CHECK-LABEL: @vector_add_reduce(		; CHECK-LABEL: @vector_add_reduce(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 2		; CHECK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 2
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP1]]		; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.umax.i64(i64 4, i64 [[TMP1]])
		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP2]]
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 2		; CHECK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 2
; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP3]]		; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP4]]
; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 1024, [[N_MOD_VF]]		; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 1024, [[N_MOD_VF]]
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi <vscale x 2 x i64> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP7:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi <vscale x 2 x i64> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP8:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[TMP4:%.*]] = add i64 [[INDEX]], 0		; CHECK-NEXT: [[TMP5:%.*]] = add i64 [[INDEX]], 0
; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i64, ptr [[A:%.]], i64 [[TMP4]]		; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i64, ptr [[A:%.]], i64 [[TMP5]]
; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds i64, ptr [[TMP5]], i32 0		; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i64, ptr [[TMP6]], i32 0
; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 2 x i64>, ptr [[TMP6]], align 8		; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 2 x i64>, ptr [[TMP7]], align 8
; CHECK-NEXT: [[TMP7]] = add <vscale x 2 x i64> [[VEC_PHI]], [[WIDE_LOAD]]		; CHECK-NEXT: [[TMP8]] = add <vscale x 2 x i64> [[VEC_PHI]], [[WIDE_LOAD]]
; CHECK-NEXT: [[TMP8:%.*]] = call i64 @llvm.vscale.i64()		; CHECK-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
; CHECK-NEXT: [[TMP9:%.*]] = mul i64 [[TMP8]], 2		; CHECK-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 2
; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP9]]		; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP10]]
; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]		; CHECK-NEXT: br i1 [[TMP11]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK-NEXT: [[TMP11:%.*]] = call i64 @llvm.vector.reduce.add.nxv2i64(<vscale x 2 x i64> [[TMP7]])		; CHECK-NEXT: [[TMP12:%.*]] = call i64 @llvm.vector.reduce.add.nxv2i64(<vscale x 2 x i64> [[TMP8]])
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 1024, [[N_VEC]]		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 1024, [[N_VEC]]
; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
; CHECK: scalar.ph:		; CHECK: scalar.ph:
; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]		; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[TMP11]], [[MIDDLE_BLOCK]] ]		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[TMP12]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: br label [[FOR_BODY:%.*]]		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[FOR_BODY]] ]		; CHECK-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[FOR_BODY]] ]
; CHECK-NEXT: [[SUM:%.]] = phi i64 [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ], [ [[SUM_NEXT:%.]], [[FOR_BODY]] ]		; CHECK-NEXT: [[SUM:%.]] = phi i64 [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ], [ [[SUM_NEXT:%.]], [[FOR_BODY]] ]
; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[IV]]		; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[IV]]
; CHECK-NEXT: [[ELEM:%.*]] = load i64, ptr [[ARRAYIDX]], align 8		; CHECK-NEXT: [[ELEM:%.*]] = load i64, ptr [[ARRAYIDX]], align 8
; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1		; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
; CHECK-NEXT: [[SUM_NEXT]] = add i64 [[SUM]], [[ELEM]]		; CHECK-NEXT: [[SUM_NEXT]] = add i64 [[SUM]], [[ELEM]]
; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], 1024		; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], 1024
; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]		; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP5:![0-9]+]]
; CHECK: for.end:		; CHECK: for.end:
; CHECK-NEXT: [[SUM_NEXT_LCSSA:%.*]] = phi i64 [ [[SUM_NEXT]], [[FOR_BODY]] ], [ [[TMP11]], [[MIDDLE_BLOCK]] ]		; CHECK-NEXT: [[SUM_NEXT_LCSSA:%.*]] = phi i64 [ [[SUM_NEXT]], [[FOR_BODY]] ], [ [[TMP12]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i64 [[SUM_NEXT_LCSSA]]		; CHECK-NEXT: ret i64 [[SUM_NEXT_LCSSA]]
;		;
entry:		entry:
br label %for.body		br label %for.body

for.body:		for.body:
%iv = phi i64 [0, %entry], [%iv.next, %for.body]		%iv = phi i64 [0, %entry], [%iv.next, %for.body]
%sum = phi i64 [0, %entry], [%sum.next, %for.body]		%sum = phi i64 [0, %entry], [%sum.next, %for.body]
Show All 11 Lines

llvm/test/Transforms/LoopVectorize/RISCV/inloop-reduction.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -mtriple riscv64-linux-gnu -mattr=+v,+d -passes=loop-vectorize < %s -S -o - \| FileCheck %s -check-prefix=OUTLOOP			; RUN: opt -mtriple riscv64-linux-gnu -mattr=+v,+d -passes=loop-vectorize < %s -S -o - \| FileCheck %s -check-prefix=OUTLOOP
	; RUN: opt -mtriple riscv64-linux-gnu -mattr=+v,+d -passes=loop-vectorize -prefer-inloop-reductions < %s -S -o - \| FileCheck %s -check-prefix=INLOOP			; RUN: opt -mtriple riscv64-linux-gnu -mattr=+v,+d -passes=loop-vectorize -prefer-inloop-reductions < %s -S -o - \| FileCheck %s -check-prefix=INLOOP


	target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n64-S128"			target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n64-S128"
	target triple = "riscv64"			target triple = "riscv64"

	define i32 @add_i16_i32(ptr nocapture readonly %x, i32 %n) {			define i32 @add_i16_i32(ptr nocapture readonly %x, i32 %n) {
	; OUTLOOP-LABEL: @add_i16_i32(			; OUTLOOP-LABEL: @add_i16_i32(
	; OUTLOOP-NEXT: entry:			; OUTLOOP-NEXT: entry:
	; OUTLOOP-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0			; OUTLOOP-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
	; OUTLOOP-NEXT: br i1 [[CMP6]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]			; OUTLOOP-NEXT: br i1 [[CMP6]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
	; OUTLOOP: for.body.preheader:			; OUTLOOP: for.body.preheader:
	; OUTLOOP-NEXT: [[TMP0:%.*]] = call i32 @llvm.vscale.i32()			; OUTLOOP-NEXT: [[TMP0:%.*]] = call i32 @llvm.vscale.i32()
	; OUTLOOP-NEXT: [[TMP1:%.*]] = mul i32 [[TMP0]], 4			; OUTLOOP-NEXT: [[TMP1:%.*]] = mul i32 [[TMP0]], 4
	; OUTLOOP-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], [[TMP1]]			; OUTLOOP-NEXT: [[TMP2:%.*]] = call i32 @llvm.umax.i32(i32 8, i32 [[TMP1]])
				; OUTLOOP-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], [[TMP2]]
	; OUTLOOP-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; OUTLOOP-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; OUTLOOP: vector.ph:			; OUTLOOP: vector.ph:
	; OUTLOOP-NEXT: [[TMP2:%.*]] = call i32 @llvm.vscale.i32()			; OUTLOOP-NEXT: [[TMP3:%.*]] = call i32 @llvm.vscale.i32()
	; OUTLOOP-NEXT: [[TMP3:%.*]] = mul i32 [[TMP2]], 4			; OUTLOOP-NEXT: [[TMP4:%.*]] = mul i32 [[TMP3]], 4
	; OUTLOOP-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[N]], [[TMP3]]			; OUTLOOP-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[N]], [[TMP4]]
	; OUTLOOP-NEXT: [[N_VEC:%.*]] = sub i32 [[N]], [[N_MOD_VF]]			; OUTLOOP-NEXT: [[N_VEC:%.*]] = sub i32 [[N]], [[N_MOD_VF]]
	; OUTLOOP-NEXT: br label [[VECTOR_BODY:%.*]]			; OUTLOOP-NEXT: br label [[VECTOR_BODY:%.*]]
	; OUTLOOP: vector.body:			; OUTLOOP: vector.body:
	; OUTLOOP-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; OUTLOOP-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; OUTLOOP-NEXT: [[VEC_PHI:%.]] = phi <vscale x 4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP8:%.]], [[VECTOR_BODY]] ]			; OUTLOOP-NEXT: [[VEC_PHI:%.]] = phi <vscale x 4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
	; OUTLOOP-NEXT: [[TMP4:%.*]] = add i32 [[INDEX]], 0			; OUTLOOP-NEXT: [[TMP5:%.*]] = add i32 [[INDEX]], 0
	; OUTLOOP-NEXT: [[TMP5:%.]] = getelementptr inbounds i16, ptr [[X:%.]], i32 [[TMP4]]			; OUTLOOP-NEXT: [[TMP6:%.]] = getelementptr inbounds i16, ptr [[X:%.]], i32 [[TMP5]]
	; OUTLOOP-NEXT: [[TMP6:%.*]] = getelementptr inbounds i16, ptr [[TMP5]], i32 0			; OUTLOOP-NEXT: [[TMP7:%.*]] = getelementptr inbounds i16, ptr [[TMP6]], i32 0
	; OUTLOOP-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x i16>, ptr [[TMP6]], align 2			; OUTLOOP-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 4 x i16>, ptr [[TMP7]], align 2
	; OUTLOOP-NEXT: [[TMP7:%.*]] = sext <vscale x 4 x i16> [[WIDE_LOAD]] to <vscale x 4 x i32>			; OUTLOOP-NEXT: [[TMP8:%.*]] = sext <vscale x 4 x i16> [[WIDE_LOAD]] to <vscale x 4 x i32>
	; OUTLOOP-NEXT: [[TMP8]] = add <vscale x 4 x i32> [[VEC_PHI]], [[TMP7]]			; OUTLOOP-NEXT: [[TMP9]] = add <vscale x 4 x i32> [[VEC_PHI]], [[TMP8]]
	; OUTLOOP-NEXT: [[TMP9:%.*]] = call i32 @llvm.vscale.i32()			; OUTLOOP-NEXT: [[TMP10:%.*]] = call i32 @llvm.vscale.i32()
	; OUTLOOP-NEXT: [[TMP10:%.*]] = mul i32 [[TMP9]], 4			; OUTLOOP-NEXT: [[TMP11:%.*]] = mul i32 [[TMP10]], 4
	; OUTLOOP-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP10]]			; OUTLOOP-NEXT: [[INDEX_NEXT]] = add nuw i32 [[INDEX]], [[TMP11]]
	; OUTLOOP-NEXT: [[TMP11:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]			; OUTLOOP-NEXT: [[TMP12:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
	; OUTLOOP-NEXT: br i1 [[TMP11]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]			; OUTLOOP-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
	; OUTLOOP: middle.block:			; OUTLOOP: middle.block:
	; OUTLOOP-NEXT: [[TMP12:%.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> [[TMP8]])			; OUTLOOP-NEXT: [[TMP13:%.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> [[TMP9]])
	; OUTLOOP-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N]], [[N_VEC]]			; OUTLOOP-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N]], [[N_VEC]]
	; OUTLOOP-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; OUTLOOP-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; OUTLOOP: scalar.ph:			; OUTLOOP: scalar.ph:
	; OUTLOOP-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; OUTLOOP-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; OUTLOOP-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[TMP12]], [[MIDDLE_BLOCK]] ]			; OUTLOOP-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[TMP13]], [[MIDDLE_BLOCK]] ]
	; OUTLOOP-NEXT: br label [[FOR_BODY:%.*]]			; OUTLOOP-NEXT: br label [[FOR_BODY:%.*]]
	; OUTLOOP: for.body:			; OUTLOOP: for.body:
	; OUTLOOP-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]			; OUTLOOP-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
	; OUTLOOP-NEXT: [[R_07:%.]] = phi i32 [ [[ADD:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]			; OUTLOOP-NEXT: [[R_07:%.]] = phi i32 [ [[ADD:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
	; OUTLOOP-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i16, ptr [[X]], i32 [[I_08]]			; OUTLOOP-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i16, ptr [[X]], i32 [[I_08]]
	; OUTLOOP-NEXT: [[TMP13:%.*]] = load i16, ptr [[ARRAYIDX]], align 2			; OUTLOOP-NEXT: [[TMP14:%.*]] = load i16, ptr [[ARRAYIDX]], align 2
	; OUTLOOP-NEXT: [[CONV:%.*]] = sext i16 [[TMP13]] to i32			; OUTLOOP-NEXT: [[CONV:%.*]] = sext i16 [[TMP14]] to i32
	; OUTLOOP-NEXT: [[ADD]] = add nsw i32 [[R_07]], [[CONV]]			; OUTLOOP-NEXT: [[ADD]] = add nsw i32 [[R_07]], [[CONV]]
	; OUTLOOP-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1			; OUTLOOP-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1
	; OUTLOOP-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]			; OUTLOOP-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]
	; OUTLOOP-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]			; OUTLOOP-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP_LOOPEXIT]], label [[FOR_BODY]], !llvm.loop [[LOOP3:![0-9]+]]
	; OUTLOOP: for.cond.cleanup.loopexit:			; OUTLOOP: for.cond.cleanup.loopexit:
	; OUTLOOP-NEXT: [[ADD_LCSSA:%.*]] = phi i32 [ [[ADD]], [[FOR_BODY]] ], [ [[TMP12]], [[MIDDLE_BLOCK]] ]			; OUTLOOP-NEXT: [[ADD_LCSSA:%.*]] = phi i32 [ [[ADD]], [[FOR_BODY]] ], [ [[TMP13]], [[MIDDLE_BLOCK]] ]
	; OUTLOOP-NEXT: br label [[FOR_COND_CLEANUP]]			; OUTLOOP-NEXT: br label [[FOR_COND_CLEANUP]]
	; OUTLOOP: for.cond.cleanup:			; OUTLOOP: for.cond.cleanup:
	; OUTLOOP-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[ADD_LCSSA]], [[FOR_COND_CLEANUP_LOOPEXIT]] ]			; OUTLOOP-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[ADD_LCSSA]], [[FOR_COND_CLEANUP_LOOPEXIT]] ]
	; OUTLOOP-NEXT: ret i32 [[R_0_LCSSA]]			; OUTLOOP-NEXT: ret i32 [[R_0_LCSSA]]
	;			;
	; INLOOP-LABEL: @add_i16_i32(			; INLOOP-LABEL: @add_i16_i32(
	; INLOOP-NEXT: entry:			; INLOOP-NEXT: entry:
	; INLOOP-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0			; INLOOP-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
	▲ Show 20 Lines • Show All 70 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/RISCV/scalable-basics.ll

Show First 20 Lines • Show All 396 Lines • ▼ Show 20 Lines	for.end:
ret void		ret void
}		}

define i64 @indexed_load(ptr noalias nocapture %a, ptr noalias nocapture %b, i64 %v, i64 %n) {		define i64 @indexed_load(ptr noalias nocapture %a, ptr noalias nocapture %b, i64 %v, i64 %n) {
; VLENUNK-LABEL: @indexed_load(		; VLENUNK-LABEL: @indexed_load(
; VLENUNK-NEXT: entry:		; VLENUNK-NEXT: entry:
; VLENUNK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()		; VLENUNK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; VLENUNK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 2		; VLENUNK-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 2
; VLENUNK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP1]]		; VLENUNK-NEXT: [[TMP2:%.*]] = call i64 @llvm.umax.i64(i64 4, i64 [[TMP1]])
		; VLENUNK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP2]]
; VLENUNK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; VLENUNK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; VLENUNK: vector.ph:		; VLENUNK: vector.ph:
; VLENUNK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()		; VLENUNK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
; VLENUNK-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 2		; VLENUNK-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 2
; VLENUNK-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP3]]		; VLENUNK-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP4]]
; VLENUNK-NEXT: [[N_VEC:%.*]] = sub i64 1024, [[N_MOD_VF]]		; VLENUNK-NEXT: [[N_VEC:%.*]] = sub i64 1024, [[N_MOD_VF]]
; VLENUNK-NEXT: br label [[VECTOR_BODY:%.*]]		; VLENUNK-NEXT: br label [[VECTOR_BODY:%.*]]
; VLENUNK: vector.body:		; VLENUNK: vector.body:
; VLENUNK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; VLENUNK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; VLENUNK-NEXT: [[VEC_PHI:%.]] = phi <vscale x 2 x i64> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP8:%.]], [[VECTOR_BODY]] ]		; VLENUNK-NEXT: [[VEC_PHI:%.]] = phi <vscale x 2 x i64> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
; VLENUNK-NEXT: [[TMP4:%.*]] = add i64 [[INDEX]], 0		; VLENUNK-NEXT: [[TMP5:%.*]] = add i64 [[INDEX]], 0
; VLENUNK-NEXT: [[TMP5:%.]] = getelementptr inbounds i64, ptr [[B:%.]], i64 [[TMP4]]		; VLENUNK-NEXT: [[TMP6:%.]] = getelementptr inbounds i64, ptr [[B:%.]], i64 [[TMP5]]
; VLENUNK-NEXT: [[TMP6:%.*]] = getelementptr inbounds i64, ptr [[TMP5]], i32 0		; VLENUNK-NEXT: [[TMP7:%.*]] = getelementptr inbounds i64, ptr [[TMP6]], i32 0
; VLENUNK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 2 x i64>, ptr [[TMP6]], align 8		; VLENUNK-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 2 x i64>, ptr [[TMP7]], align 8
; VLENUNK-NEXT: [[TMP7:%.]] = getelementptr inbounds i64, ptr [[A:%.]], <vscale x 2 x i64> [[WIDE_LOAD]]		; VLENUNK-NEXT: [[TMP8:%.]] = getelementptr inbounds i64, ptr [[A:%.]], <vscale x 2 x i64> [[WIDE_LOAD]]
; VLENUNK-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 2 x i64> @llvm.masked.gather.nxv2i64.nxv2p0(<vscale x 2 x ptr> [[TMP7]], i32 8, <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i64> poison)		; VLENUNK-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 2 x i64> @llvm.masked.gather.nxv2i64.nxv2p0(<vscale x 2 x ptr> [[TMP8]], i32 8, <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i64> poison)
; VLENUNK-NEXT: [[TMP8]] = add <vscale x 2 x i64> [[VEC_PHI]], [[WIDE_MASKED_GATHER]]		; VLENUNK-NEXT: [[TMP9]] = add <vscale x 2 x i64> [[VEC_PHI]], [[WIDE_MASKED_GATHER]]
; VLENUNK-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()		; VLENUNK-NEXT: [[TMP10:%.*]] = call i64 @llvm.vscale.i64()
; VLENUNK-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 2		; VLENUNK-NEXT: [[TMP11:%.*]] = mul i64 [[TMP10]], 2
; VLENUNK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP10]]		; VLENUNK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP11]]
; VLENUNK-NEXT: [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]		; VLENUNK-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; VLENUNK-NEXT: br i1 [[TMP11]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]		; VLENUNK-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
; VLENUNK: middle.block:		; VLENUNK: middle.block:
; VLENUNK-NEXT: [[TMP12:%.*]] = call i64 @llvm.vector.reduce.add.nxv2i64(<vscale x 2 x i64> [[TMP8]])		; VLENUNK-NEXT: [[TMP13:%.*]] = call i64 @llvm.vector.reduce.add.nxv2i64(<vscale x 2 x i64> [[TMP9]])
; VLENUNK-NEXT: [[CMP_N:%.*]] = icmp eq i64 1024, [[N_VEC]]		; VLENUNK-NEXT: [[CMP_N:%.*]] = icmp eq i64 1024, [[N_VEC]]
; VLENUNK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]		; VLENUNK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
; VLENUNK: scalar.ph:		; VLENUNK: scalar.ph:
; VLENUNK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]		; VLENUNK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
; VLENUNK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[TMP12]], [[MIDDLE_BLOCK]] ]		; VLENUNK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[TMP13]], [[MIDDLE_BLOCK]] ]
; VLENUNK-NEXT: br label [[FOR_BODY:%.*]]		; VLENUNK-NEXT: br label [[FOR_BODY:%.*]]
; VLENUNK: for.body:		; VLENUNK: for.body:
; VLENUNK-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[FOR_BODY]] ]		; VLENUNK-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[FOR_BODY]] ]
; VLENUNK-NEXT: [[SUM:%.]] = phi i64 [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ], [ [[SUM_NEXT:%.]], [[FOR_BODY]] ]		; VLENUNK-NEXT: [[SUM:%.]] = phi i64 [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ], [ [[SUM_NEXT:%.]], [[FOR_BODY]] ]
; VLENUNK-NEXT: [[BADDR:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[IV]]		; VLENUNK-NEXT: [[BADDR:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[IV]]
; VLENUNK-NEXT: [[AIDX:%.*]] = load i64, ptr [[BADDR]], align 8		; VLENUNK-NEXT: [[AIDX:%.*]] = load i64, ptr [[BADDR]], align 8
; VLENUNK-NEXT: [[AADDR:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[AIDX]]		; VLENUNK-NEXT: [[AADDR:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[AIDX]]
; VLENUNK-NEXT: [[ELEM:%.*]] = load i64, ptr [[AADDR]], align 8		; VLENUNK-NEXT: [[ELEM:%.*]] = load i64, ptr [[AADDR]], align 8
; VLENUNK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1		; VLENUNK-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
; VLENUNK-NEXT: [[SUM_NEXT]] = add i64 [[SUM]], [[ELEM]]		; VLENUNK-NEXT: [[SUM_NEXT]] = add i64 [[SUM]], [[ELEM]]
; VLENUNK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], 1024		; VLENUNK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], 1024
; VLENUNK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP9:![0-9]+]]		; VLENUNK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP9:![0-9]+]]
; VLENUNK: for.end:		; VLENUNK: for.end:
; VLENUNK-NEXT: [[SUM_NEXT_LCSSA:%.*]] = phi i64 [ [[SUM_NEXT]], [[FOR_BODY]] ], [ [[TMP12]], [[MIDDLE_BLOCK]] ]		; VLENUNK-NEXT: [[SUM_NEXT_LCSSA:%.*]] = phi i64 [ [[SUM_NEXT]], [[FOR_BODY]] ], [ [[TMP13]], [[MIDDLE_BLOCK]] ]
; VLENUNK-NEXT: ret i64 [[SUM_NEXT_LCSSA]]		; VLENUNK-NEXT: ret i64 [[SUM_NEXT_LCSSA]]
;		;
; VLEN128-LABEL: @indexed_load(		; VLEN128-LABEL: @indexed_load(
; VLEN128-NEXT: entry:		; VLEN128-NEXT: entry:
; VLEN128-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()		; VLEN128-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
; VLEN128-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 2		; VLEN128-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 2
; VLEN128-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP1]]		; VLEN128-NEXT: [[TMP2:%.*]] = call i64 @llvm.umax.i64(i64 4, i64 [[TMP1]])
		; VLEN128-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 1024, [[TMP2]]
; VLEN128-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; VLEN128-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; VLEN128: vector.ph:		; VLEN128: vector.ph:
; VLEN128-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()		; VLEN128-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
; VLEN128-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 2		; VLEN128-NEXT: [[TMP4:%.*]] = mul i64 [[TMP3]], 2
; VLEN128-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP3]]		; VLEN128-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP4]]
; VLEN128-NEXT: [[N_VEC:%.*]] = sub i64 1024, [[N_MOD_VF]]		; VLEN128-NEXT: [[N_VEC:%.*]] = sub i64 1024, [[N_MOD_VF]]
; VLEN128-NEXT: br label [[VECTOR_BODY:%.*]]		; VLEN128-NEXT: br label [[VECTOR_BODY:%.*]]
; VLEN128: vector.body:		; VLEN128: vector.body:
; VLEN128-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; VLEN128-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; VLEN128-NEXT: [[VEC_PHI:%.]] = phi <vscale x 2 x i64> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP8:%.]], [[VECTOR_BODY]] ]		; VLEN128-NEXT: [[VEC_PHI:%.]] = phi <vscale x 2 x i64> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
; VLEN128-NEXT: [[TMP4:%.*]] = add i64 [[INDEX]], 0		; VLEN128-NEXT: [[TMP5:%.*]] = add i64 [[INDEX]], 0
; VLEN128-NEXT: [[TMP5:%.]] = getelementptr inbounds i64, ptr [[B:%.]], i64 [[TMP4]]		; VLEN128-NEXT: [[TMP6:%.]] = getelementptr inbounds i64, ptr [[B:%.]], i64 [[TMP5]]
; VLEN128-NEXT: [[TMP6:%.*]] = getelementptr inbounds i64, ptr [[TMP5]], i32 0		; VLEN128-NEXT: [[TMP7:%.*]] = getelementptr inbounds i64, ptr [[TMP6]], i32 0
; VLEN128-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 2 x i64>, ptr [[TMP6]], align 8		; VLEN128-NEXT: [[WIDE_LOAD:%.*]] = load <vscale x 2 x i64>, ptr [[TMP7]], align 8
; VLEN128-NEXT: [[TMP7:%.]] = getelementptr inbounds i64, ptr [[A:%.]], <vscale x 2 x i64> [[WIDE_LOAD]]		; VLEN128-NEXT: [[TMP8:%.]] = getelementptr inbounds i64, ptr [[A:%.]], <vscale x 2 x i64> [[WIDE_LOAD]]
; VLEN128-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 2 x i64> @llvm.masked.gather.nxv2i64.nxv2p0(<vscale x 2 x ptr> [[TMP7]], i32 8, <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i64> poison)		; VLEN128-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 2 x i64> @llvm.masked.gather.nxv2i64.nxv2p0(<vscale x 2 x ptr> [[TMP8]], i32 8, <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i64 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x i64> poison)
; VLEN128-NEXT: [[TMP8]] = add <vscale x 2 x i64> [[VEC_PHI]], [[WIDE_MASKED_GATHER]]		; VLEN128-NEXT: [[TMP9]] = add <vscale x 2 x i64> [[VEC_PHI]], [[WIDE_MASKED_GATHER]]
; VLEN128-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()		; VLEN128-NEXT: [[TMP10:%.*]] = call i64 @llvm.vscale.i64()
; VLEN128-NEXT: [[TMP10:%.*]] = mul i64 [[TMP9]], 2		; VLEN128-NEXT: [[TMP11:%.*]] = mul i64 [[TMP10]], 2
; VLEN128-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP10]]		; VLEN128-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP11]]
; VLEN128-NEXT: [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]		; VLEN128-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
; VLEN128-NEXT: br i1 [[TMP11]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]		; VLEN128-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
; VLEN128: middle.block:		; VLEN128: middle.block:
; VLEN128-NEXT: [[TMP12:%.*]] = call i64 @llvm.vector.reduce.add.nxv2i64(<vscale x 2 x i64> [[TMP8]])		; VLEN128-NEXT: [[TMP13:%.*]] = call i64 @llvm.vector.reduce.add.nxv2i64(<vscale x 2 x i64> [[TMP9]])
; VLEN128-NEXT: [[CMP_N:%.*]] = icmp eq i64 1024, [[N_VEC]]		; VLEN128-NEXT: [[CMP_N:%.*]] = icmp eq i64 1024, [[N_VEC]]
; VLEN128-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]		; VLEN128-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
; VLEN128: scalar.ph:		; VLEN128: scalar.ph:
; VLEN128-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]		; VLEN128-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
; VLEN128-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[TMP12]], [[MIDDLE_BLOCK]] ]		; VLEN128-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[TMP13]], [[MIDDLE_BLOCK]] ]
; VLEN128-NEXT: br label [[FOR_BODY:%.*]]		; VLEN128-NEXT: br label [[FOR_BODY:%.*]]
; VLEN128: for.body:		; VLEN128: for.body:
; VLEN128-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[FOR_BODY]] ]		; VLEN128-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV_NEXT:%.]], [[FOR_BODY]] ]
; VLEN128-NEXT: [[SUM:%.]] = phi i64 [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ], [ [[SUM_NEXT:%.]], [[FOR_BODY]] ]		; VLEN128-NEXT: [[SUM:%.]] = phi i64 [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ], [ [[SUM_NEXT:%.]], [[FOR_BODY]] ]
; VLEN128-NEXT: [[BADDR:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[IV]]		; VLEN128-NEXT: [[BADDR:%.*]] = getelementptr inbounds i64, ptr [[B]], i64 [[IV]]
; VLEN128-NEXT: [[AIDX:%.*]] = load i64, ptr [[BADDR]], align 8		; VLEN128-NEXT: [[AIDX:%.*]] = load i64, ptr [[BADDR]], align 8
; VLEN128-NEXT: [[AADDR:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[AIDX]]		; VLEN128-NEXT: [[AADDR:%.*]] = getelementptr inbounds i64, ptr [[A]], i64 [[AIDX]]
; VLEN128-NEXT: [[ELEM:%.*]] = load i64, ptr [[AADDR]], align 8		; VLEN128-NEXT: [[ELEM:%.*]] = load i64, ptr [[AADDR]], align 8
; VLEN128-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1		; VLEN128-NEXT: [[IV_NEXT]] = add nuw nsw i64 [[IV]], 1
; VLEN128-NEXT: [[SUM_NEXT]] = add i64 [[SUM]], [[ELEM]]		; VLEN128-NEXT: [[SUM_NEXT]] = add i64 [[SUM]], [[ELEM]]
; VLEN128-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], 1024		; VLEN128-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[IV_NEXT]], 1024
; VLEN128-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP9:![0-9]+]]		; VLEN128-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_END]], label [[FOR_BODY]], !llvm.loop [[LOOP9:![0-9]+]]
; VLEN128: for.end:		; VLEN128: for.end:
; VLEN128-NEXT: [[SUM_NEXT_LCSSA:%.*]] = phi i64 [ [[SUM_NEXT]], [[FOR_BODY]] ], [ [[TMP12]], [[MIDDLE_BLOCK]] ]		; VLEN128-NEXT: [[SUM_NEXT_LCSSA:%.*]] = phi i64 [ [[SUM_NEXT]], [[FOR_BODY]] ], [ [[TMP13]], [[MIDDLE_BLOCK]] ]
; VLEN128-NEXT: ret i64 [[SUM_NEXT_LCSSA]]		; VLEN128-NEXT: ret i64 [[SUM_NEXT_LCSSA]]
;		;
entry:		entry:
br label %for.body		br label %for.body

for.body:		for.body:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %for.body ]
%sum = phi i64 [0, %entry], [%sum.next, %for.body]		%sum = phi i64 [0, %entry], [%sum.next, %for.body]
▲ Show 20 Lines • Show All 209 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/X86/ctpop-small-trip-count.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 3
				; RUN: opt -S -passes=loop-vectorize -mcpu=znver2 -vectorizer-ignore-out-of-loop-reduction-cost=0 -force-vector-interleave=1 < %s \| FileCheck %s
				target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				; PR 57476
				; Hard-coded value of trip count being 2.
				; FIXME: We still vectorize it, since reduction cost is 1 (for VF=2).
				define i64 @test_trip_count_2(ptr %arr) {
				; CHECK-LABEL: define i64 @test_trip_count_2(
				; CHECK-SAME: ptr [[ARR:%.*]]) #[[ATTR0:[0-9]+]] {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_PHI:%.]] = phi <2 x i64> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
				; CHECK-NEXT: [[TMP1:%.*]] = getelementptr inbounds i64, ptr [[ARR]], i64 [[TMP0]]
				; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds i64, ptr [[TMP1]], i32 0
				; CHECK-NEXT: [[WIDE_LOAD:%.*]] = load <2 x i64>, ptr [[TMP2]], align 8
				; CHECK-NEXT: [[TMP3:%.*]] = call <2 x i64> @llvm.ctpop.v2i64(<2 x i64> [[WIDE_LOAD]])
				; CHECK-NEXT: [[TMP4]] = add <2 x i64> [[VEC_PHI]], [[TMP3]]
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], 2
				; CHECK-NEXT: br i1 true, label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vector.reduce.add.v2i64(<2 x i64> [[TMP4]])
				; CHECK-NEXT: br i1 true, label [[EXIT:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ 2, [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[TMP5]], [[MIDDLE_BLOCK]] ]
				; CHECK-NEXT: br label [[LOOP:%.*]]
				; CHECK: loop:
				; CHECK-NEXT: [[ACCUM:%.]] = phi i64 [ [[ACCUM_NEXT:%.]], [[LOOP]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[IV:%.]] = phi i64 [ [[IV_NEXT:%.]], [[LOOP]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[IV_NEXT]] = add nuw i64 [[IV]], 1
				; CHECK-NEXT: [[GEP:%.*]] = getelementptr inbounds i64, ptr [[ARR]], i64 [[IV]]
				; CHECK-NEXT: [[VALUE:%.*]] = load i64, ptr [[GEP]], align 8
				; CHECK-NEXT: [[CTPOP:%.*]] = tail call i64 @llvm.ctpop.i64(i64 [[VALUE]])
				; CHECK-NEXT: [[ACCUM_NEXT]] = add i64 [[ACCUM]], [[CTPOP]]
				; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[IV_NEXT]], 2
				; CHECK-NEXT: br i1 [[EXITCOND]], label [[EXIT]], label [[LOOP]], !llvm.loop [[LOOP3:![0-9]+]]
				; CHECK: exit:
				; CHECK-NEXT: [[LCSSA:%.*]] = phi i64 [ [[ACCUM_NEXT]], [[LOOP]] ], [ [[TMP5]], [[MIDDLE_BLOCK]] ]
				; CHECK-NEXT: ret i64 [[LCSSA]]
				;
				entry:
				br label %loop

				loop:
				%accum = phi i64 [ %accum.next, %loop ], [ 0, %entry ]
				%iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]
				%iv.next = add nuw i64 %iv, 1
				%gep = getelementptr inbounds i64, ptr %arr, i64 %iv
				%value = load i64, ptr %gep, align 8
				%ctpop = tail call i64 @llvm.ctpop.i64(i64 %value)
				%accum.next = add i64 %accum, %ctpop
				%exitcond = icmp eq i64 %iv.next, 2
				br i1 %exitcond, label %exit, label %loop

				exit:
				%lcssa = phi i64 [ %accum.next, %loop ]
				ret i64 %lcssa
				}

				; Same loop as above, with profile showing trip count of 2.
				; We do not vectorize this when we consider cost of reductions since reduction
				; cost along with vectorcost (with VF=4) is higher than scalar cost.
				define i64 @test_trip_count_prof_2(ptr %arr, i64 %n) {
				; CHECK-LABEL: define i64 @test_trip_count_prof_2(
				; CHECK-SAME: ptr [[ARR:%.]], i64 [[N:%.]]) #[[ATTR0]] {
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br label [[LOOP:%.*]]
				; CHECK: loop:
				; CHECK-NEXT: [[ACCUM:%.]] = phi i64 [ [[ACCUM_NEXT:%.]], [[LOOP]] ], [ 0, [[ENTRY:%.*]] ]
				; CHECK-NEXT: [[IV:%.]] = phi i64 [ [[IV_NEXT:%.]], [[LOOP]] ], [ 0, [[ENTRY]] ]
				; CHECK-NEXT: [[IV_NEXT]] = add nuw i64 [[IV]], 1
				; CHECK-NEXT: [[GEP:%.*]] = getelementptr inbounds i64, ptr [[ARR]], i64 [[IV]]
				; CHECK-NEXT: [[VALUE:%.*]] = load i64, ptr [[GEP]], align 8
				; CHECK-NEXT: [[CTPOP:%.*]] = tail call i64 @llvm.ctpop.i64(i64 [[VALUE]])
				; CHECK-NEXT: [[ACCUM_NEXT]] = add i64 [[ACCUM]], [[CTPOP]]
				; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[IV_NEXT]], [[N]]
				; CHECK-NEXT: br i1 [[EXITCOND]], label [[EXIT:%.*]], label [[LOOP]], !prof [[PROF4:![0-9]+]]
				; CHECK: exit:
				; CHECK-NEXT: [[LCSSA:%.*]] = phi i64 [ [[ACCUM_NEXT]], [[LOOP]] ]
				; CHECK-NEXT: ret i64 [[LCSSA]]
				;
				entry:
				br label %loop

				loop:
				%accum = phi i64 [ %accum.next, %loop ], [ 0, %entry ]
				%iv = phi i64 [ %iv.next, %loop ], [ 0, %entry ]
				%iv.next = add nuw i64 %iv, 1
				%gep = getelementptr inbounds i64, ptr %arr, i64 %iv
				%value = load i64, ptr %gep, align 8
				%ctpop = tail call i64 @llvm.ctpop.i64(i64 %value)
				%accum.next = add i64 %accum, %ctpop
				%exitcond = icmp eq i64 %iv.next, %n
				br i1 %exitcond, label %exit, label %loop, !prof !2

				exit:
				%lcssa = phi i64 [ %accum.next, %loop ]
				ret i64 %lcssa
				}
				declare i64 @llvm.ctpop.i64(i64)

				!2 = !{!"branch_weights", i32 1, i32 2}

llvm/test/Transforms/LoopVectorize/X86/reduction-small-trip-count.ll

This file was added.

				; RUN: opt -S -passes=loop-vectorize,dce -mcpu=skylake -vectorizer-ignore-out-of-loop-reduction-cost=0 -force-vector-interleave=1 < %s \| FileCheck %s
				target triple = "x86_64-unknown-linux-gnu"

				declare float @llvm.maximum.f32(float, float)
				declare float @llvm.fabs.f32(float)

				; This is a small trip count loop. The cost of the out-of-loop reduction is
				; significant in this case when we only perform a single vector iteration.
				; However, loop vectorizer does not consider out of loop reduction costs.

				; CHECK-LABEL: fmaximum_intrinsic
				; CHECK-NOT: llvm.vector.reduce.fmaximum
				define float @fmaximum_intrinsic(ptr nocapture readonly %x, ptr nocapture readonly %y, i32 %n, i32 %tc) {
				entry:
				br label %for.body

				for.body: ; preds = %entry, %for.body
				%i.012 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%s.011 = phi float [ 0.000000e+00, %entry ], [ %max, %for.body ]
				%arrayidx = getelementptr inbounds float, ptr %x, i32 %i.012
				%x_f = load float, ptr %arrayidx, align 4
				%arrayidxy = getelementptr inbounds float, ptr %y, i32 %i.012
				%y_f = load float, ptr %arrayidxy, align 4
				%sub = fsub float %x_f, %y_f
				%fabs = call float @llvm.fabs.f32(float %sub)
				%max = tail call float @llvm.maximum.f32(float %s.011, float %fabs)
				%inc = add nuw nsw i32 %i.012, 1
				%exitcond = icmp ult i32 %inc, 3
				br i1 %exitcond, label %for.body, label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.body
				ret float %max
				}

				; trip count of 6 is still considered non-profitable for reducing adds (min trip
				; count required is 8).
				; CHECK-LABEL: reduction_sum
				; CHECK-NOT: llvm.vector.reduce.add
				define i32 @reduction_sum(i32 %n, ptr noalias nocapture %A, ptr noalias nocapture %B) nounwind uwtable readonly noinline ssp {
				%1 = icmp sgt i32 %n, 0
				br i1 %1, label %.lr.ph, label %._crit_edge

				.lr.ph: ; preds = %0, %.lr.ph
				%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %0 ]
				%sum.02 = phi i32 [ %9, %.lr.ph ], [ 0, %0 ]
				%2 = getelementptr inbounds i32, ptr %A, i64 %indvars.iv
				%3 = load i32, ptr %2, align 4
				%4 = getelementptr inbounds i32, ptr %B, i64 %indvars.iv
				%5 = load i32, ptr %4, align 4
				%6 = trunc i64 %indvars.iv to i32
				%7 = add i32 %sum.02, %6
				%8 = add i32 %7, %3
				%9 = add i32 %8, %5
				%indvars.iv.next = add i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond = icmp eq i32 %lftr.wideiv, %n
				br i1 %exitcond, label %._crit_edge, label %.lr.ph, !prof !1

				._crit_edge: ; preds = %.lr.ph, %0
				%sum.0.lcssa = phi i32 [ 0, %0 ], [ %9, %.lr.ph ]
				ret i32 %sum.0.lcssa
				}

				; CHECK-LABEL: reduction_mix
				; CHECK-LABEL: middle.block:
				; CHECK-NEXT: vector.reduce.add
				; CHECK-NEXT: br
				define i32 @reduction_mix(i32 %n, ptr noalias nocapture %A, ptr noalias nocapture %B) nounwind uwtable readonly noinline ssp {
				%1 = icmp sgt i32 %n, 0
				br i1 %1, label %.lr.ph, label %._crit_edge

				.lr.ph: ; preds = %0, %.lr.ph
				%indvars.iv = phi i64 [ %indvars.iv.next, %.lr.ph ], [ 0, %0 ]
				%sum.02 = phi i32 [ %9, %.lr.ph ], [ 0, %0 ]
				%2 = getelementptr inbounds i32, ptr %A, i64 %indvars.iv
				%3 = load i32, ptr %2, align 4
				%4 = getelementptr inbounds i32, ptr %B, i64 %indvars.iv
				%5 = load i32, ptr %4, align 4
				%6 = mul nsw i32 %5, %3
				%7 = trunc i64 %indvars.iv to i32
				%8 = add i32 %sum.02, %7
				%9 = add i32 %8, %6
				%indvars.iv.next = add i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32
				%exitcond = icmp eq i32 %lftr.wideiv, %n
				br i1 %exitcond, label %._crit_edge, label %.lr.ph, !prof !2

				._crit_edge: ; preds = %.lr.ph, %0
				%sum.0.lcssa = phi i32 [ 0, %0 ], [ %9, %.lr.ph ]
				ret i32 %sum.0.lcssa
				}

				!1 = !{!"branch_weights", i32 1, i32 5}
				!2 = !{!"branch_weights", i32 1, i32 7}