Diff 318293

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 1,176 Lines • ▼ Show 20 Lines	public:
int getArithmeticReductionCost(		int getArithmeticReductionCost(
unsigned Opcode, VectorType *Ty, bool IsPairwiseForm,		unsigned Opcode, VectorType *Ty, bool IsPairwiseForm,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;

int getMinMaxReductionCost(		int getMinMaxReductionCost(
VectorType Ty, VectorType CondTy, bool IsPairwiseForm, bool IsUnsigned,		VectorType Ty, VectorType CondTy, bool IsPairwiseForm, bool IsUnsigned,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;

		/// Calculate the cost of an extended reduction pattern, similar to
		/// getArithmeticReductionCost of an Add reduction with an extension and
		/// optional multiply. This is the cost of as:
		/// ResTy vecreduce.add(ext(Ty A)), or if IsMLA flag is set then:
		/// ResTy vecreduce.add(mul(ext(Ty A), ext(Ty B)). The reduction happens
		/// on a VectorType with ResTy elements and Ty lanes.
		InstructionCost getExtendedAddReductionCost(
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions I am wondering if `IsMLA` is a bit too narrow as an interface, perhaps even unclear. If this is similar to `getArithmeticReductionCost` as mentioned in the comment, which takes an `opcode`, should this also take an `opcode` instead of `IsMLA`? The advantage we could describe costs for different type of reductions, or is this not useful/necessary? SjoerdMeijer: I am wondering if `IsMLA` is a bit too narrow as an interface, perhaps even unclear. If this is…
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Or now I am thinking if we actually need a new interface at all. Could this just be part of `getArithmeticReductionCost`, maybe with a flag if operands are extended? SjoerdMeijer: Or now I am thinking if we actually need a new interface at all. Could this just be part of…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions I think the only extended patterns that are interesting are add's. Mul's maybe could come up but they are very rare in comparison. As you might imagine adding up number is pretty common in comparison! And/Or/Xor don't really make sense to be extended, as well as min/max I think. Same for floats where I don't know of any systems that convert the float type and reduce at the same time. In MVE we only have VADDV/VMLAV instructions that need the extension, not any others. I've tried to update the comment to explain better that this is an extended add pattern. (And we can always extend it in the future if needed). The separate interface was suggested in https://reviews.llvm.org/D93476#2506583, which I like as it keeps the normal reductions in one place, and the extended pattern is matched here. dmgreen: I think the only extended patterns that are interesting are add's. Mul's maybe could come up…
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Okay, thanks, that makes sense. One nit/question/suggestion then: we are talking very specifically about estimating costs for `vecreduce.add`, possibly with a mul, so I think `getExtendedAddReductionCost` would describe this better. (am now looking at the rest of this change) SjoerdMeijer: Okay, thanks, that makes sense. One nit/question/suggestion then: we are talking very…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Sounds good to me. I'll make that change. Thanks for the suggestion. dmgreen: Sounds good to me. I'll make that change. Thanks for the suggestion.
		fhahnUnsubmitted Not Done Reply Inline Actions Can this use `InstructionCost`? fhahn: Can this use `InstructionCost`?
		bool IsMLA, bool IsUnsigned, Type ResTy, VectorType Ty,
		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;

/// \returns The cost of Intrinsic instructions. Analyses the real arguments.		/// \returns The cost of Intrinsic instructions. Analyses the real arguments.
/// Three cases are handled: 1. scalar instruction 2. vector instruction		/// Three cases are handled: 1. scalar instruction 2. vector instruction
/// 3. scalar instruction which is to be vectorized.		/// 3. scalar instruction which is to be vectorized.
int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind) const;		TTI::TargetCostKind CostKind) const;

/// \returns The cost of Call instructions.		/// \returns The cost of Call instructions.
int getCallInstrCost(Function F, Type RetTy, ArrayRef<Type *> Tys,		int getCallInstrCost(Function F, Type RetTy, ArrayRef<Type *> Tys,
▲ Show 20 Lines • Show All 390 Lines • ▼ Show 20 Lines	virtual int getInterleavedMemoryOpCost(
Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,		Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
bool UseMaskForCond = false, bool UseMaskForGaps = false) = 0;		bool UseMaskForCond = false, bool UseMaskForGaps = false) = 0;
virtual int getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,		virtual int getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,
bool IsPairwiseForm,		bool IsPairwiseForm,
TTI::TargetCostKind CostKind) = 0;		TTI::TargetCostKind CostKind) = 0;
virtual int getMinMaxReductionCost(VectorType Ty, VectorType CondTy,		virtual int getMinMaxReductionCost(VectorType Ty, VectorType CondTy,
bool IsPairwiseForm, bool IsUnsigned,		bool IsPairwiseForm, bool IsUnsigned,
TTI::TargetCostKind CostKind) = 0;		TTI::TargetCostKind CostKind) = 0;
		virtual InstructionCost getExtendedAddReductionCost(
		bool IsMLA, bool IsUnsigned, Type ResTy, VectorType Ty,
		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) = 0;
virtual int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		virtual int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind) = 0;		TTI::TargetCostKind CostKind) = 0;
virtual int getCallInstrCost(Function F, Type RetTy,		virtual int getCallInstrCost(Function F, Type RetTy,
ArrayRef<Type *> Tys,		ArrayRef<Type *> Tys,
TTI::TargetCostKind CostKind) = 0;		TTI::TargetCostKind CostKind) = 0;
virtual unsigned getNumberOfParts(Type *Tp) = 0;		virtual unsigned getNumberOfParts(Type *Tp) = 0;
virtual int getAddressComputationCost(Type Ty, ScalarEvolution SE,		virtual int getAddressComputationCost(Type Ty, ScalarEvolution SE,
const SCEV *Ptr) = 0;		const SCEV *Ptr) = 0;
▲ Show 20 Lines • Show All 461 Lines • ▼ Show 20 Lines	return Impl.getArithmeticReductionCost(Opcode, Ty, IsPairwiseForm,
CostKind);		CostKind);
}		}
int getMinMaxReductionCost(VectorType Ty, VectorType CondTy,		int getMinMaxReductionCost(VectorType Ty, VectorType CondTy,
bool IsPairwiseForm, bool IsUnsigned,		bool IsPairwiseForm, bool IsUnsigned,
TTI::TargetCostKind CostKind) override {		TTI::TargetCostKind CostKind) override {
return Impl.getMinMaxReductionCost(Ty, CondTy, IsPairwiseForm, IsUnsigned,		return Impl.getMinMaxReductionCost(Ty, CondTy, IsPairwiseForm, IsUnsigned,
CostKind);		CostKind);
}		}
		InstructionCost getExtendedAddReductionCost(
		bool IsMLA, bool IsUnsigned, Type ResTy, VectorType Ty,
		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) override {
		return Impl.getExtendedAddReductionCost(IsMLA, IsUnsigned, ResTy, Ty,
		CostKind);
		}
int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind) override {		TTI::TargetCostKind CostKind) override {
return Impl.getIntrinsicInstrCost(ICA, CostKind);		return Impl.getIntrinsicInstrCost(ICA, CostKind);
}		}
int getCallInstrCost(Function F, Type RetTy,		int getCallInstrCost(Function F, Type RetTy,
ArrayRef<Type *> Tys,		ArrayRef<Type *> Tys,
TTI::TargetCostKind CostKind) override {		TTI::TargetCostKind CostKind) override {
return Impl.getCallInstrCost(F, RetTy, Tys, CostKind);		return Impl.getCallInstrCost(F, RetTy, Tys, CostKind);
▲ Show 20 Lines • Show All 214 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 594 Lines • ▼ Show 20 Lines	unsigned getArithmeticReductionCost(unsigned, VectorType *, bool,
return 1;		return 1;
}		}

unsigned getMinMaxReductionCost(VectorType , VectorType , bool, bool,		unsigned getMinMaxReductionCost(VectorType , VectorType , bool, bool,
TTI::TargetCostKind) const {		TTI::TargetCostKind) const {
return 1;		return 1;
}		}

		InstructionCost getExtendedAddReductionCost(
		bool IsMLA, bool IsUnsigned, Type ResTy, VectorType Ty,
		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const {
		return 1;
		}

unsigned getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) const {		unsigned getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) const {
return 0;		return 0;
}		}

bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo &Info) const {		bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo &Info) const {
return false;		return false;
}		}

▲ Show 20 Lines • Show All 497 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 2,008 Lines • ▼ Show 20 Lines	MinMaxCost +=
thisT()->getCmpSelInstrCost(Instruction::Select, Ty, CondTy,		thisT()->getCmpSelInstrCost(Instruction::Select, Ty, CondTy,
CmpInst::BAD_ICMP_PREDICATE, CostKind));		CmpInst::BAD_ICMP_PREDICATE, CostKind));
// The last min/max should be in vector registers and we counted it above.		// The last min/max should be in vector registers and we counted it above.
// So just need a single extractelement.		// So just need a single extractelement.
return ShuffleCost + MinMaxCost +		return ShuffleCost + MinMaxCost +
thisT()->getVectorInstrCost(Instruction::ExtractElement, Ty, 0);		thisT()->getVectorInstrCost(Instruction::ExtractElement, Ty, 0);
}		}

		InstructionCost getExtendedAddReductionCost(bool IsMLA, bool IsUnsigned,
		Type ResTy, VectorType Ty,
		TTI::TargetCostKind CostKind) {
		// Without any native support, this is equivalent to the cost of
		// vecreduce.add(ext) or if IsMLA vecreduce.add(mul(ext, ext))
		VectorType *ExtTy = VectorType::get(ResTy, Ty);
		unsigned RedCost = thisT()->getArithmeticReductionCost(
		Instruction::Add, ExtTy, false, CostKind);
		unsigned MulCost = 0;
		unsigned ExtCost = thisT()->getCastInstrCost(
		IsUnsigned ? Instruction::ZExt : Instruction::SExt, ExtTy, Ty,
		TTI::CastContextHint::None, CostKind);
		if (IsMLA) {
		MulCost =
		thisT()->getArithmeticInstrCost(Instruction::Mul, ExtTy, CostKind);
		ExtCost *= 2;
		}

		return RedCost + MulCost + ExtCost;
		}

unsigned getVectorSplitCost() { return 1; }		unsigned getVectorSplitCost() { return 1; }

/// @}		/// @}
};		};

/// Concrete BasicTTIImpl that can be used if no further customization		/// Concrete BasicTTIImpl that can be used if no further customization
/// is needed.		/// is needed.
class BasicTTIImpl : public BasicTTIImplBase<BasicTTIImpl> {		class BasicTTIImpl : public BasicTTIImplBase<BasicTTIImpl> {
Show All 17 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 937 Lines • ▼ Show 20 Lines	int TargetTransformInfo::getMinMaxReductionCost(
TTI::TargetCostKind CostKind) const {		TTI::TargetCostKind CostKind) const {
int Cost =		int Cost =
TTIImpl->getMinMaxReductionCost(Ty, CondTy, IsPairwiseForm, IsUnsigned,		TTIImpl->getMinMaxReductionCost(Ty, CondTy, IsPairwiseForm, IsUnsigned,
CostKind);		CostKind);
assert(Cost >= 0 && "TTI should not produce negative costs!");		assert(Cost >= 0 && "TTI should not produce negative costs!");
return Cost;		return Cost;
}		}

		InstructionCost TargetTransformInfo::getExtendedAddReductionCost(
		bool IsMLA, bool IsUnsigned, Type ResTy, VectorType Ty,
		TTI::TargetCostKind CostKind) const {
		return TTIImpl->getExtendedAddReductionCost(IsMLA, IsUnsigned, ResTy, Ty,
		CostKind);
		}

unsigned		unsigned
TargetTransformInfo::getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) const {		TargetTransformInfo::getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) const {
return TTIImpl->getCostOfKeepingLiveOverCall(Tys);		return TTIImpl->getCostOfKeepingLiveOverCall(Tys);
}		}

bool TargetTransformInfo::getTgtMemIntrinsic(IntrinsicInst *Inst,		bool TargetTransformInfo::getTgtMemIntrinsic(IntrinsicInst *Inst,
MemIntrinsicInfo &Info) const {		MemIntrinsicInfo &Info) const {
return TTIImpl->getTgtMemIntrinsic(Inst, Info);		return TTIImpl->getTgtMemIntrinsic(Inst, Info);
▲ Show 20 Lines • Show All 508 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 241 Lines • ▼ Show 20 Lines	public:
unsigned getGatherScatterOpCost(unsigned Opcode, Type *DataTy,		unsigned getGatherScatterOpCost(unsigned Opcode, Type *DataTy,
const Value *Ptr, bool VariableMask,		const Value *Ptr, bool VariableMask,
Align Alignment, TTI::TargetCostKind CostKind,		Align Alignment, TTI::TargetCostKind CostKind,
const Instruction *I = nullptr);		const Instruction *I = nullptr);

int getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,		int getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,
bool IsPairwiseForm,		bool IsPairwiseForm,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);
		InstructionCost getExtendedAddReductionCost(bool IsMLA, bool IsUnsigned,
		Type ResTy, VectorType ValTy,
		TTI::TargetCostKind CostKind);

int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);

bool maybeLoweredToCall(Instruction &I);		bool maybeLoweredToCall(Instruction &I);
bool isLoweredToCall(const Function *F);		bool isLoweredToCall(const Function *F);
bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,		bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
AssumptionCache &AC,		AssumptionCache &AC,
Show All 30 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

Show First 20 Lines • Show All 1,505 Lines • ▼ Show 20 Lines	int ARMTTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,
};		};
if (const auto *Entry = CostTableLookup(CostTblAdd, ISD, LT.second))		if (const auto *Entry = CostTableLookup(CostTblAdd, ISD, LT.second))
return Entry->Cost * ST->getMVEVectorCostFactor() * LT.first;		return Entry->Cost * ST->getMVEVectorCostFactor() * LT.first;

return BaseT::getArithmeticReductionCost(Opcode, ValTy, IsPairwiseForm,		return BaseT::getArithmeticReductionCost(Opcode, ValTy, IsPairwiseForm,
CostKind);		CostKind);
}		}

		InstructionCost
		ARMTTIImpl::getExtendedAddReductionCost(bool IsMLA, bool IsUnsigned,
		Type ResTy, VectorType ValTy,
		TTI::TargetCostKind CostKind) {
		EVT ValVT = TLI->getValueType(DL, ValTy);
		EVT ResVT = TLI->getValueType(DL, ResTy);
		if (ST->hasMVEIntegerOps() && ValVT.isSimple() && ResVT.isSimple()) {
		std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);
		if ((LT.second == MVT::v16i8 && ResVT.getSizeInBits() <= 32) \|\|
		(LT.second == MVT::v8i16 &&
		ResVT.getSizeInBits() <= (IsMLA ? 64 : 32)) \|\|
		(LT.second == MVT::v4i32 && ResVT.getSizeInBits() <= 64))
		return ST->getMVEVectorCostFactor() * LT.first;
		}

		return BaseT::getExtendedAddReductionCost(IsMLA, IsUnsigned, ResTy, ValTy,
		CostKind);
		}

int ARMTTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		int ARMTTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
switch (ICA.getID()) {		switch (ICA.getID()) {
case Intrinsic::get_active_lane_mask:		case Intrinsic::get_active_lane_mask:
// Currently we make a somewhat optimistic assumption that		// Currently we make a somewhat optimistic assumption that
// active_lane_mask's are always free. In reality it may be freely folded		// active_lane_mask's are always free. In reality it may be freely folded
// into a tail predicated loop, expanded into a VCPT or expanded into a lot		// into a tail predicated loop, expanded into a VCPT or expanded into a lot
// of add/icmp code. We may need to improve this in the future, but being		// of add/icmp code. We may need to improve this in the future, but being
▲ Show 20 Lines • Show All 555 Lines • ▼ Show 20 Lines
bool ARMTTIImpl::preferInLoopReduction(unsigned Opcode, Type *Ty,		bool ARMTTIImpl::preferInLoopReduction(unsigned Opcode, Type *Ty,
TTI::ReductionFlags Flags) const {		TTI::ReductionFlags Flags) const {
if (!ST->hasMVEIntegerOps())		if (!ST->hasMVEIntegerOps())
return false;		return false;

unsigned ScalarBits = Ty->getScalarSizeInBits();		unsigned ScalarBits = Ty->getScalarSizeInBits();
switch (Opcode) {		switch (Opcode) {
case Instruction::Add:		case Instruction::Add:
return ScalarBits <= 32;		return ScalarBits <= 64;
default:		default:
return false;		return false;
}		}
}		}

bool ARMTTIImpl::preferPredicatedReductionSelect(		bool ARMTTIImpl::preferPredicatedReductionSelect(
unsigned Opcode, Type *Ty, TTI::ReductionFlags Flags) const {		unsigned Opcode, Type *Ty, TTI::ReductionFlags Flags) const {
if (!ST->hasMVEIntegerOps())		if (!ST->hasMVEIntegerOps())
return false;		return false;
return true;		return true;
}		}

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,649 Lines • ▼ Show 20 Lines	private:
/// width. Vector width of one means scalar.		/// width. Vector width of one means scalar.
VectorizationCostTy getInstructionCost(Instruction *I, ElementCount VF);		VectorizationCostTy getInstructionCost(Instruction *I, ElementCount VF);

/// The cost-computation logic from getInstructionCost which provides		/// The cost-computation logic from getInstructionCost which provides
/// the vector type as an output parameter.		/// the vector type as an output parameter.
InstructionCost getInstructionCost(Instruction *I, ElementCount VF,		InstructionCost getInstructionCost(Instruction *I, ElementCount VF,
Type *&VectorTy);		Type *&VectorTy);

		/// Return the cost of instructions in an inloop reduction pattern, if I is
		/// part of that pattern.
		InstructionCost getReductionPatternCost(Instruction *I, ElementCount VF,
		Type *VectorTy,
		TTI::TargetCostKind CostKind);

/// Calculate vectorization cost of memory instruction \p I.		/// Calculate vectorization cost of memory instruction \p I.
InstructionCost getMemoryInstructionCost(Instruction *I, ElementCount VF);		InstructionCost getMemoryInstructionCost(Instruction *I, ElementCount VF);

/// The cost computation for scalarized memory instruction.		/// The cost computation for scalarized memory instruction.
InstructionCost getMemInstScalarizationCost(Instruction *I, ElementCount VF);		InstructionCost getMemInstScalarizationCost(Instruction *I, ElementCount VF);

/// The cost computation for interleaving group of memory instructions.		/// The cost computation for interleaving group of memory instructions.
InstructionCost getInterleaveGroupCost(Instruction *I, ElementCount VF);		InstructionCost getInterleaveGroupCost(Instruction *I, ElementCount VF);
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	private:
/// scalarized.		/// scalarized.
DenseMap<ElementCount, SmallPtrSet<Instruction *, 4>> ForcedScalars;		DenseMap<ElementCount, SmallPtrSet<Instruction *, 4>> ForcedScalars;

/// PHINodes of the reductions that should be expanded in-loop along with		/// PHINodes of the reductions that should be expanded in-loop along with
/// their associated chains of reduction operations, in program order from top		/// their associated chains of reduction operations, in program order from top
/// (PHI) to bottom		/// (PHI) to bottom
ReductionChainMap InLoopReductionChains;		ReductionChainMap InLoopReductionChains;

		/// A Map of inloop reduction operations and their immediate chain operand.
		/// FIXME: This can be removed once reductions can be costed correctly in
		/// vplan. This was added to allow quick lookup to the inloop operations,
		/// without having to loop through InLoopReductionChains.
		DenseMap<Instruction , Instruction > InLoopReductionImmediateChains;

/// Returns the expected difference in cost from scalarizing the expression		/// Returns the expected difference in cost from scalarizing the expression
/// feeding a predicated instruction \p PredInst. The instructions to		/// feeding a predicated instruction \p PredInst. The instructions to
/// scalarize and their scalar costs are collected in \p ScalarCosts. A		/// scalarize and their scalar costs are collected in \p ScalarCosts. A
/// non-negative return value implies the expression will be scalarized.		/// non-negative return value implies the expression will be scalarized.
/// Currently, only single-use chains are considered for scalarization.		/// Currently, only single-use chains are considered for scalarization.
int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts,		int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts,
ElementCount VF);		ElementCount VF);

▲ Show 20 Lines • Show All 4,229 Lines • ▼ Show 20 Lines	for (Instruction &I : BB->instructionsWithoutDebug()) {
continue;		continue;

// Examine PHI nodes that are reduction variables. Update the type to		// Examine PHI nodes that are reduction variables. Update the type to
// account for the recurrence type.		// account for the recurrence type.
if (auto *PN = dyn_cast<PHINode>(&I)) {		if (auto *PN = dyn_cast<PHINode>(&I)) {
if (!Legal->isReductionVariable(PN))		if (!Legal->isReductionVariable(PN))
continue;		continue;
RecurrenceDescriptor RdxDesc = Legal->getReductionVars()[PN];		RecurrenceDescriptor RdxDesc = Legal->getReductionVars()[PN];
		if (PreferInLoopReductions \|\|
		TTI.preferInLoopReduction(RdxDesc.getOpcode(),
		RdxDesc.getRecurrenceType(),
		TargetTransformInfo::ReductionFlags()))
		continue;
T = RdxDesc.getRecurrenceType();		T = RdxDesc.getRecurrenceType();
}		}

// Examine the stored values.		// Examine the stored values.
if (auto *ST = dyn_cast<StoreInst>(&I))		if (auto *ST = dyn_cast<StoreInst>(&I))
T = ST->getValueOperand()->getType();		T = ST->getValueOperand()->getType();

// Ignore loaded pointer types and stored pointer types that are not		// Ignore loaded pointer types and stored pointer types that are not
▲ Show 20 Lines • Show All 816 Lines • ▼ Show 20 Lines	if (Group->isReverse()) {
assert(!Legal->isMaskRequired(I) &&		assert(!Legal->isMaskRequired(I) &&
"Reverse masked interleaved access not supported.");		"Reverse masked interleaved access not supported.");
Cost += Group->getNumMembers() *		Cost += Group->getNumMembers() *
TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);		TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);
}		}
return Cost;		return Cost;
}		}

		InstructionCost LoopVectorizationCostModel::getReductionPatternCost(
		fhahnUnsubmitted Not Done Reply Inline Actions This should probably use `InstructionCost`? fhahn: This should probably use `InstructionCost`?
		Instruction I, ElementCount VF, Type Ty, TTI::TargetCostKind CostKind) {
		// Early exit for no inloop reductions
		if (InLoopReductionChains.empty() \|\| VF.isScalar() \|\| !isa<VectorType>(Ty))
		return InstructionCost::getInvalid();
		auto *VectorTy = cast<VectorType>(Ty);

		// We are looking for a pattern of, and finding the minimal acceptable cost:
		// reduce(mul(ext(A), ext(B))) or
		// reduce(mul(A, B)) or
		// reduce(ext(A)) or
		// reduce(A).
		// The basic idea is that we walk down the tree to do that, finding the root
		// reduction instruction in InLoopReductionImmediateChains. From there we find
		// the pattern of mul/ext and test the cost of the entire pattern vs the cost
		// of the components. If the reduction cost is lower then we return it for the
		// reduction instruction and 0 for the other instructions in the pattern. If
		// it is not we return an invalid cost specifying the orignal cost method
		// should be used.
		Instruction *RetI = I;
		if ((RetI->getOpcode() == Instruction::SExt \|\|
		RetI->getOpcode() == Instruction::ZExt)) {
		SjoerdMeijerUnsubmitted Done Reply Inline Actions Nit: both cases check for hasOneUser. Can we return early if doesn't have one user? SjoerdMeijer: Nit: both cases check for hasOneUser. Can we return early if doesn't have one user?
		if (!RetI->hasOneUser())
		return InstructionCost::getInvalid();
		RetI = RetI->user_back();
		}
		if (RetI->getOpcode() == Instruction::Mul &&
		RetI->user_back()->getOpcode() == Instruction::Add) {
		if (!RetI->hasOneUser())
		return InstructionCost::getInvalid();
		RetI = RetI->user_back();
		}

		// Test if the found instruction is a reduction, and if not return an invalid
		// cost specifying the parent to use the original cost modelling.
		if (!InLoopReductionImmediateChains.count(RetI))
		return InstructionCost::getInvalid();

		// Find the reduction this chain is a part of and calculate the basic cost of
		// the reduction on its own.
		Instruction *LastChain = InLoopReductionImmediateChains[RetI];
		Instruction *ReductionPhi = LastChain;
		while (!isa<PHINode>(ReductionPhi))
		ReductionPhi = InLoopReductionImmediateChains[ReductionPhi];

		RecurrenceDescriptor RdxDesc =
		Legal->getReductionVars()[cast<PHINode>(ReductionPhi)];
		unsigned BaseCost = TTI.getArithmeticReductionCost(RdxDesc.getOpcode(),
		VectorTy, false, CostKind);

		// Get the operand that was not the reduction chain and match it to one of the
		// patterns, returning the better cost if it is found.
		Instruction *RedOp = RetI->getOperand(1) == LastChain
		? dyn_cast<Instruction>(RetI->getOperand(0))
		: dyn_cast<Instruction>(RetI->getOperand(1));

		VectorTy = VectorType::get(I->getOperand(0)->getType(), VectorTy);

		if (RedOp && (isa<SExtInst>(RedOp) \|\| isa<ZExtInst>(RedOp)) &&
		!TheLoop->isLoopInvariant(RedOp)) {
		bool IsUnsigned = isa<ZExtInst>(RedOp);
		auto *ExtType = VectorType::get(RedOp->getOperand(0)->getType(), VectorTy);
		InstructionCost RedCost = TTI.getExtendedAddReductionCost(
		/IsMLA=/false, IsUnsigned, RdxDesc.getRecurrenceType(), ExtType,
		CostKind);

		unsigned ExtCost =
		TTI.getCastInstrCost(RedOp->getOpcode(), VectorTy, ExtType,
		TTI::CastContextHint::None, CostKind, RedOp);
		if (RedCost.isValid() && RedCost < BaseCost + ExtCost)
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Do we also need to check Op1 for ZExt or SExt? SjoerdMeijer: Do we also need to check Op1 for ZExt or SExt?
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Yeah, we check that the opcodes of the two Op0 and Op1 are the same, as we won't be able to match mul(sext, zext) to anything useful. It should make sure that both operands are SExt or ZExt's. dmgreen: Yeah, we check that the opcodes of the two Op0 and Op1 are the same, as we won't be able to…
		return I == RetI ? *RedCost.getValue() : 0;
		} else if (RedOp && RedOp->getOpcode() == Instruction::Mul) {
		Instruction *Mul = RedOp;
		Instruction *Op0 = dyn_cast<Instruction>(Mul->getOperand(0));
		Instruction *Op1 = dyn_cast<Instruction>(Mul->getOperand(1));
		if (Op0 && Op1 && (isa<SExtInst>(Op0) \|\| isa<ZExtInst>(Op0)) &&
		Op0->getOpcode() == Op1->getOpcode() &&
		Op0->getOperand(0)->getType() == Op1->getOperand(0)->getType() &&
		!TheLoop->isLoopInvariant(Op0) && !TheLoop->isLoopInvariant(Op1)) {
		bool IsUnsigned = isa<ZExtInst>(Op0);
		auto *ExtType = VectorType::get(Op0->getOperand(0)->getType(), VectorTy);
		// reduce(mul(ext, ext))
		unsigned ExtCost =
		TTI.getCastInstrCost(Op0->getOpcode(), VectorTy, ExtType,
		TTI::CastContextHint::None, CostKind, Op0);
		unsigned MulCost =
		TTI.getArithmeticInstrCost(Mul->getOpcode(), VectorTy, CostKind);

		InstructionCost RedCost = TTI.getExtendedAddReductionCost(
		/IsMLA=/true, IsUnsigned, RdxDesc.getRecurrenceType(), ExtType,
		CostKind);

		if (RedCost.isValid() && RedCost < ExtCost * 2 + MulCost + BaseCost)
		return I == RetI ? *RedCost.getValue() : 0;
		} else {
		unsigned MulCost =
		TTI.getArithmeticInstrCost(Mul->getOpcode(), VectorTy, CostKind);

		InstructionCost RedCost = TTI.getExtendedAddReductionCost(
		/IsMLA=/true, true, RdxDesc.getRecurrenceType(), VectorTy,
		CostKind);

		if (RedCost.isValid() && RedCost < MulCost + BaseCost)
		return I == RetI ? *RedCost.getValue() : 0;
		}
		}

		return I == RetI ? BaseCost : InstructionCost::getInvalid();
		}

InstructionCost		InstructionCost
LoopVectorizationCostModel::getMemoryInstructionCost(Instruction *I,		LoopVectorizationCostModel::getMemoryInstructionCost(Instruction *I,
ElementCount VF) {		ElementCount VF) {
// Calculate scalar cost only. Vectorization cost should be ready at this		// Calculate scalar cost only. Vectorization cost should be ready at this
// moment.		// moment.
if (VF.isScalar()) {		if (VF.isScalar()) {
Type *ValTy = getMemInstValueType(I);		Type *ValTy = getMemInstValueType(I);
const Align Alignment = getLoadStoreAlignment(I);		const Align Alignment = getLoadStoreAlignment(I);
▲ Show 20 Lines • Show All 338 Lines • ▼ Show 20 Lines	LoopVectorizationCostModel::getInstructionCost(Instruction *I, ElementCount VF,
case Instruction::LShr:		case Instruction::LShr:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor: {		case Instruction::Xor: {
// Since we will replace the stride by 1 the multiplication should go away.		// Since we will replace the stride by 1 the multiplication should go away.
if (I->getOpcode() == Instruction::Mul && isStrideMul(I, Legal))		if (I->getOpcode() == Instruction::Mul && isStrideMul(I, Legal))
return 0;		return 0;

		// Detect reduction patterns
		InstructionCost RedCost;
		if ((RedCost = getReductionPatternCost(I, VF, VectorTy, CostKind))
		.isValid())
		return RedCost;

// Certain instructions can be cheaper to vectorize if they have a constant		// Certain instructions can be cheaper to vectorize if they have a constant
// second vector operand. One example of this are shifts on x86.		// second vector operand. One example of this are shifts on x86.
Value *Op2 = I->getOperand(1);		Value *Op2 = I->getOperand(1);
TargetTransformInfo::OperandValueProperties Op2VP;		TargetTransformInfo::OperandValueProperties Op2VP;
TargetTransformInfo::OperandValueKind Op2VK =		TargetTransformInfo::OperandValueKind Op2VK =
TTI.getOperandInfo(Op2, Op2VP);		TTI.getOperandInfo(Op2, Op2VP);
if (Op2VK == TargetTransformInfo::OK_AnyValue && Legal->isUniform(Op2))		if (Op2VK == TargetTransformInfo::OK_AnyValue && Legal->isUniform(Op2))
Op2VK = TargetTransformInfo::OK_UniformValue;		Op2VK = TargetTransformInfo::OK_UniformValue;
▲ Show 20 Lines • Show All 107 Lines • ▼ Show 20 Lines	case Instruction::BitCast: {
// integer steps. The cost of these truncations is the same as the scalar		// integer steps. The cost of these truncations is the same as the scalar
// operation.		// operation.
if (isOptimizableIVTruncate(I, VF)) {		if (isOptimizableIVTruncate(I, VF)) {
auto *Trunc = cast<TruncInst>(I);		auto *Trunc = cast<TruncInst>(I);
return TTI.getCastInstrCost(Instruction::Trunc, Trunc->getDestTy(),		return TTI.getCastInstrCost(Instruction::Trunc, Trunc->getDestTy(),
Trunc->getSrcTy(), CCH, CostKind, Trunc);		Trunc->getSrcTy(), CCH, CostKind, Trunc);
}		}

		// Detect reduction patterns
		InstructionCost RedCost;
		if ((RedCost = getReductionPatternCost(I, VF, VectorTy, CostKind))
		.isValid())
		return RedCost;

Type *SrcScalarTy = I->getOperand(0)->getType();		Type *SrcScalarTy = I->getOperand(0)->getType();
Type *SrcVecTy =		Type *SrcVecTy =
VectorTy->isVectorTy() ? ToVectorTy(SrcScalarTy, VF) : SrcScalarTy;		VectorTy->isVectorTy() ? ToVectorTy(SrcScalarTy, VF) : SrcScalarTy;
if (canTruncateToMinimalBitwidth(I, VF)) {		if (canTruncateToMinimalBitwidth(I, VF)) {
// This cast is going to be shrunk. This may remove the cast or it might		// This cast is going to be shrunk. This may remove the cast or it might
// turn it into slightly different cast. For example, if MinBW == 16,		// turn it into slightly different cast. For example, if MinBW == 16,
// "zext i8 %1 to i32" becomes "zext i8 %1 to i16".		// "zext i8 %1 to i32" becomes "zext i8 %1 to i16".
//		//
▲ Show 20 Lines • Show All 114 Lines • ▼ Show 20 Lines	if (!PreferInLoopReductions &&
continue;		continue;

// Check that we can correctly put the reductions into the loop, by		// Check that we can correctly put the reductions into the loop, by
// finding the chain of operations that leads from the phi to the loop		// finding the chain of operations that leads from the phi to the loop
// exit value.		// exit value.
SmallVector<Instruction *, 4> ReductionOperations =		SmallVector<Instruction *, 4> ReductionOperations =
RdxDesc.getReductionOpChain(Phi, TheLoop);		RdxDesc.getReductionOpChain(Phi, TheLoop);
bool InLoop = !ReductionOperations.empty();		bool InLoop = !ReductionOperations.empty();
if (InLoop)		if (InLoop) {
InLoopReductionChains[Phi] = ReductionOperations;		InLoopReductionChains[Phi] = ReductionOperations;
		// Add the elements to InLoopReductionImmediateChains for cost modelling.
		Instruction *LastChain = Phi;
		for (auto *I : ReductionOperations) {
		InLoopReductionImmediateChains[I] = LastChain;
		LastChain = I;
		}
		}
LLVM_DEBUG(dbgs() << "LV: Using " << (InLoop ? "inloop" : "out of loop")		LLVM_DEBUG(dbgs() << "LV: Using " << (InLoop ? "inloop" : "out of loop")
<< " reduction for phi: " << *Phi << "\n");		<< " reduction for phi: " << *Phi << "\n");
}		}
}		}

// TODO: we could return a pair of values that specify the max VF and		// TODO: we could return a pair of values that specify the max VF and
// min VF, to be used in `buildVPlans(MinVF, MaxVF)` instead of		// min VF, to be used in `buildVPlans(MinVF, MaxVF)` instead of
// `buildVPlans(VF, VF)`. We cannot do it because VPLAN at the moment		// `buildVPlans(VF, VF)`. We cannot do it because VPLAN at the moment
▲ Show 20 Lines • Show All 2,143 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/mve-reduction-types.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -loop-vectorize < %s -S -o - \| FileCheck %s			; RUN: opt -loop-vectorize < %s -S -o - \| FileCheck %s

	target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"			target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
	target triple = "thumbv8.1m.main-none-none-eabi"			target triple = "thumbv8.1m.main-none-none-eabi"

	define i32 @mla_i32(i8* noalias nocapture readonly %A, i8* noalias nocapture readonly %B, i32 %n) #0 {			define i32 @mla_i32(i8* noalias nocapture readonly %A, i8* noalias nocapture readonly %B, i32 %n) #0 {
	; CHECK-LABEL: @mla_i32(			; CHECK-LABEL: @mla_i32(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[CMP9:%.]] = icmp sgt i32 [[N:%.]], 0			; CHECK-NEXT: [[CMP9:%.]] = icmp sgt i32 [[N:%.]], 0
	; CHECK-NEXT: br i1 [[CMP9]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]			; CHECK-NEXT: br i1 [[CMP9]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
	; CHECK: for.body.preheader:			; CHECK: for.body.preheader:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[N_RND_UP]], 4			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[N_RND_UP]], 16
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i32 [[N_RND_UP]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i32 [[N_RND_UP]], [[N_MOD_VF]]
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP12:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP12:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> poison, i32 [[INDEX]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <16 x i32> poison, i32 [[INDEX]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> poison, <4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT]], <16 x i32> poison, <16 x i32> zeroinitializer
	; CHECK-NEXT: [[INDUCTION:%.*]] = add <4 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 1, i32 2, i32 3>			; CHECK-NEXT: [[INDUCTION:%.*]] = add <16 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[TMP0]], i32 [[N]])			; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 [[TMP0]], i32 [[N]])
	; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i8, i8 [[A:%.*]], i32 [[TMP0]]			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i8, i8 [[A:%.*]], i32 [[TMP0]]
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 [[TMP1]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 [[TMP1]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <16 x i8>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8> [[TMP3]], i32 1, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i8> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP3]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)
	; CHECK-NEXT: [[TMP4:%.*]] = sext <4 x i8> [[WIDE_MASKED_LOAD]] to <4 x i32>			; CHECK-NEXT: [[TMP4:%.*]] = sext <16 x i8> [[WIDE_MASKED_LOAD]] to <16 x i32>
	; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i8, i8 [[B:%.*]], i32 [[TMP0]]			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i8, i8 [[B:%.*]], i32 [[TMP0]]
	; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i8, i8 [[TMP5]], i32 0			; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i8, i8 [[TMP5]], i32 0
	; CHECK-NEXT: [[TMP7:%.]] = bitcast i8 [[TMP6]] to <4 x i8>*			; CHECK-NEXT: [[TMP7:%.]] = bitcast i8 [[TMP6]] to <16 x i8>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8> [[TMP7]], i32 1, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i8> poison)			; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP7]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)
	; CHECK-NEXT: [[TMP8:%.*]] = sext <4 x i8> [[WIDE_MASKED_LOAD1]] to <4 x i32>			; CHECK-NEXT: [[TMP8:%.*]] = sext <16 x i8> [[WIDE_MASKED_LOAD1]] to <16 x i32>
	; CHECK-NEXT: [[TMP9:%.*]] = mul nsw <4 x i32> [[TMP8]], [[TMP4]]			; CHECK-NEXT: [[TMP9:%.*]] = mul nsw <16 x i32> [[TMP8]], [[TMP4]]
	; CHECK-NEXT: [[TMP10:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP9]], <4 x i32> zeroinitializer			; CHECK-NEXT: [[TMP10:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i32> [[TMP9]], <16 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP10]])			; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP10]])
	; CHECK-NEXT: [[TMP12]] = add i32 [[TMP11]], [[VEC_PHI]]			; CHECK-NEXT: [[TMP12]] = add i32 [[TMP11]], [[VEC_PHI]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16
	; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]			; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[TMP12]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[TMP12]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	▲ Show 20 Lines • Show All 1,052 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/mve-reductions.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt -loop-vectorize -instcombine -simplifycfg -simplifycfg-require-and-preserve-domtree=1 -tail-predication=enabled < %s -S -o - \| FileCheck %s		; RUN: opt -loop-vectorize -instcombine -simplifycfg -simplifycfg-require-and-preserve-domtree=1 -tail-predication=enabled < %s -S -o - \| FileCheck %s

target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"		target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
target triple = "thumbv8.1m.main-arm-none-eabi"		target triple = "thumbv8.1m.main-arm-none-eabi"

		; Should not be vectorized
define i64 @add_i64_i64(i64* nocapture readonly %x, i32 %n) #0 {		define i64 @add_i64_i64(i64* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i64_i64(		; CHECK-LABEL: @add_i64_i64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP6]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP6]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]		; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
; CHECK-NEXT: [[R_07:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]		; CHECK-NEXT: [[R_07:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]
Show All 21 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
ret i64 %r.0.lcssa		ret i64 %r.0.lcssa
}		}

; FIXME: 4x		; 4x to use VADDLV
		; FIXME: TailPredicate
define i64 @add_i32_i64(i32* nocapture readonly %x, i32 %n) #0 {		define i64 @add_i32_i64(i32* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i32_i64(		; CHECK-LABEL: @add_i32_i64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP6]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP6]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
		; CHECK: for.body.preheader:
		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -4
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[X:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
		; CHECK-NEXT: [[TMP2:%.*]] = sext <4 x i32> [[WIDE_LOAD]] to <4 x i64>
		; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> [[TMP2]])
		; CHECK-NEXT: [[TMP4]] = add i64 [[TMP3]], [[VEC_PHI]]
		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
		; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]
		; CHECK: middle.block:
		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]
		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]
		; CHECK: scalar.ph:
		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP4]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]		; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[R_07:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]		; CHECK-NEXT: [[R_07:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[X:%.*]], i32 [[I_08]]		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[X]], i32 [[I_08]]
; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 [[ARRAYIDX]], align 4		; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[ARRAYIDX]], align 4
; CHECK-NEXT: [[CONV:%.*]] = sext i32 [[TMP0]] to i64		; CHECK-NEXT: [[CONV:%.*]] = sext i32 [[TMP6]] to i64
; CHECK-NEXT: [[ADD]] = add nsw i64 [[R_07]], [[CONV]]		; CHECK-NEXT: [[ADD]] = add nsw i64 [[R_07]], [[CONV]]
; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1		; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1
; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]
; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], [[LOOP2:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[ADD]], [[FOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[ADD]], [[FOR_BODY]] ], [ [[TMP4]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i64 [[R_0_LCSSA]]		; CHECK-NEXT: ret i64 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp6 = icmp sgt i32 %n, 0		%cmp6 = icmp sgt i32 %n, 0
br i1 %cmp6, label %for.body, label %for.cond.cleanup		br i1 %cmp6, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.07 = phi i64 [ %add, %for.body ], [ 0, %entry ]		%r.07 = phi i64 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i32, i32* %x, i32 %i.08		%arrayidx = getelementptr inbounds i32, i32* %x, i32 %i.08
%0 = load i32, i32* %arrayidx, align 4		%0 = load i32, i32* %arrayidx, align 4
%conv = sext i32 %0 to i64		%conv = sext i32 %0 to i64
%add = add nsw i64 %r.07, %conv		%add = add nsw i64 %r.07, %conv
%inc = add nuw nsw i32 %i.08, 1		%inc = add nuw nsw i32 %i.08, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
ret i64 %r.0.lcssa		ret i64 %r.0.lcssa
}		}

; FIXME: 4x ?		; 4x to use VADDLV
		; FIXME: TailPredicate
define i64 @add_i16_i64(i16* nocapture readonly %x, i32 %n) #0 {		define i64 @add_i16_i64(i16* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i16_i64(		; CHECK-LABEL: @add_i16_i64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP6]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP6]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
		; CHECK: for.body.preheader:
		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -4
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <4 x i16>*
		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i16>, <4 x i16> [[TMP1]], align 2
		; CHECK-NEXT: [[TMP2:%.*]] = sext <4 x i16> [[WIDE_LOAD]] to <4 x i64>
		; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> [[TMP2]])
		; CHECK-NEXT: [[TMP4]] = add i64 [[TMP3]], [[VEC_PHI]]
		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
		; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP4:!llvm.loop !.]]
		; CHECK: middle.block:
		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]
		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]
		; CHECK: scalar.ph:
		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP4]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]		; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[R_07:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]		; CHECK-NEXT: [[R_07:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[I_08]]		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i16, i16 [[X]], i32 [[I_08]]
; CHECK-NEXT: [[TMP0:%.]] = load i16, i16 [[ARRAYIDX]], align 2		; CHECK-NEXT: [[TMP6:%.]] = load i16, i16 [[ARRAYIDX]], align 2
; CHECK-NEXT: [[CONV:%.*]] = sext i16 [[TMP0]] to i64		; CHECK-NEXT: [[CONV:%.*]] = sext i16 [[TMP6]] to i64
; CHECK-NEXT: [[ADD]] = add nsw i64 [[R_07]], [[CONV]]		; CHECK-NEXT: [[ADD]] = add nsw i64 [[R_07]], [[CONV]]
; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1		; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1
; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]
; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], [[LOOP5:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[ADD]], [[FOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[ADD]], [[FOR_BODY]] ], [ [[TMP4]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i64 [[R_0_LCSSA]]		; CHECK-NEXT: ret i64 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp6 = icmp sgt i32 %n, 0		%cmp6 = icmp sgt i32 %n, 0
br i1 %cmp6, label %for.body, label %for.cond.cleanup		br i1 %cmp6, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.07 = phi i64 [ %add, %for.body ], [ 0, %entry ]		%r.07 = phi i64 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i16, i16* %x, i32 %i.08		%arrayidx = getelementptr inbounds i16, i16* %x, i32 %i.08
%0 = load i16, i16* %arrayidx, align 2		%0 = load i16, i16* %arrayidx, align 2
%conv = sext i16 %0 to i64		%conv = sext i16 %0 to i64
%add = add nsw i64 %r.07, %conv		%add = add nsw i64 %r.07, %conv
%inc = add nuw nsw i32 %i.08, 1		%inc = add nuw nsw i32 %i.08, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
ret i64 %r.0.lcssa		ret i64 %r.0.lcssa
}		}

; FIXME: 4x ?		; 4x to use VADDLV
		; FIXME: TailPredicate
define i64 @add_i8_i64(i8* nocapture readonly %x, i32 %n) #0 {		define i64 @add_i8_i64(i8* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i8_i64(		; CHECK-LABEL: @add_i8_i64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP6]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP6]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
		; CHECK: for.body.preheader:
		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -4
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <4 x i8>*
		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP1]], align 1
		; CHECK-NEXT: [[TMP2:%.*]] = zext <4 x i8> [[WIDE_LOAD]] to <4 x i64>
		; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> [[TMP2]])
		; CHECK-NEXT: [[TMP4]] = add i64 [[TMP3]], [[VEC_PHI]]
		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
		; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP6:!llvm.loop !.]]
		; CHECK: middle.block:
		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]
		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]
		; CHECK: scalar.ph:
		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP4]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]		; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[R_07:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]		; CHECK-NEXT: [[R_07:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[I_08]]		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i8, i8 [[X]], i32 [[I_08]]
; CHECK-NEXT: [[TMP0:%.]] = load i8, i8 [[ARRAYIDX]], align 1		; CHECK-NEXT: [[TMP6:%.]] = load i8, i8 [[ARRAYIDX]], align 1
; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP0]] to i64		; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP6]] to i64
; CHECK-NEXT: [[ADD]] = add nuw nsw i64 [[R_07]], [[CONV]]		; CHECK-NEXT: [[ADD]] = add nuw nsw i64 [[R_07]], [[CONV]]
; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1		; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1
; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]
; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], [[LOOP7:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[ADD]], [[FOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[ADD]], [[FOR_BODY]] ], [ [[TMP4]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i64 [[R_0_LCSSA]]		; CHECK-NEXT: ret i64 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp6 = icmp sgt i32 %n, 0		%cmp6 = icmp sgt i32 %n, 0
br i1 %cmp6, label %for.body, label %for.cond.cleanup		br i1 %cmp6, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.07 = phi i64 [ %add, %for.body ], [ 0, %entry ]		%r.07 = phi i64 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i8, i8* %x, i32 %i.08		%arrayidx = getelementptr inbounds i8, i8* %x, i32 %i.08
%0 = load i8, i8* %arrayidx, align 1		%0 = load i8, i8* %arrayidx, align 1
%conv = zext i8 %0 to i64		%conv = zext i8 %0 to i64
%add = add nuw nsw i64 %r.07, %conv		%add = add nuw nsw i64 %r.07, %conv
%inc = add nuw nsw i32 %i.08, 1		%inc = add nuw nsw i32 %i.08, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
ret i64 %r.0.lcssa		ret i64 %r.0.lcssa
}		}

		; 4x to use VADDV.u32
define i32 @add_i32_i32(i32* nocapture readonly %x, i32 %n) #0 {		define i32 @add_i32_i32(i32* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i32_i32(		; CHECK-LABEL: @add_i32_i32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP6]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP6]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP1]], i32 4, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP1]], i32 4, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> poison)
; CHECK-NEXT: [[TMP2:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[WIDE_MASKED_LOAD]], <4 x i32> zeroinitializer		; CHECK-NEXT: [[TMP2:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[WIDE_MASKED_LOAD]], <4 x i32> zeroinitializer
; CHECK-NEXT: [[TMP3:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP2]])		; CHECK-NEXT: [[TMP3:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP2]])
; CHECK-NEXT: [[TMP4]] = add i32 [[TMP3]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP4]] = add i32 [[TMP3]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP5]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP5]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP8:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP4]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP4]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i32 [[R_0_LCSSA]]		; CHECK-NEXT: ret i32 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp6 = icmp sgt i32 %n, 0		%cmp6 = icmp sgt i32 %n, 0
br i1 %cmp6, label %for.body, label %for.cond.cleanup		br i1 %cmp6, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.07 = phi i32 [ %add, %for.body ], [ 0, %entry ]		%r.07 = phi i32 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i32, i32* %x, i32 %i.08		%arrayidx = getelementptr inbounds i32, i32* %x, i32 %i.08
%0 = load i32, i32* %arrayidx, align 4		%0 = load i32, i32* %arrayidx, align 4
%add = add nsw i32 %0, %r.07		%add = add nsw i32 %0, %r.07
%inc = add nuw nsw i32 %i.08, 1		%inc = add nuw nsw i32 %i.08, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
ret i32 %r.0.lcssa		ret i32 %r.0.lcssa
}		}

; FIXME: 8x		; 8x to use VADDV.u16
define i32 @add_i16_i32(i16* nocapture readonly %x, i32 %n) #0 {		define i32 @add_i16_i32(i16* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i16_i32(		; CHECK-LABEL: @add_i16_i32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP6]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP6]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <4 x i16>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <8 x i16>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16> [[TMP1]], i32 2, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i16> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16> [[TMP1]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> poison)
; CHECK-NEXT: [[TMP2:%.*]] = sext <4 x i16> [[WIDE_MASKED_LOAD]] to <4 x i32>		; CHECK-NEXT: [[TMP2:%.*]] = sext <8 x i16> [[WIDE_MASKED_LOAD]] to <8 x i32>
; CHECK-NEXT: [[TMP3:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP2]], <4 x i32> zeroinitializer		; CHECK-NEXT: [[TMP3:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i32> [[TMP2]], <8 x i32> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP3]])		; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP3]])
; CHECK-NEXT: [[TMP5]] = add i32 [[TMP4]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP5]] = add i32 [[TMP4]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8
; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP6]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP2:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP6]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP9:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP5]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP5]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i32 [[R_0_LCSSA]]		; CHECK-NEXT: ret i32 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp6 = icmp sgt i32 %n, 0		%cmp6 = icmp sgt i32 %n, 0
br i1 %cmp6, label %for.body, label %for.cond.cleanup		br i1 %cmp6, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.07 = phi i32 [ %add, %for.body ], [ 0, %entry ]		%r.07 = phi i32 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i16, i16* %x, i32 %i.08		%arrayidx = getelementptr inbounds i16, i16* %x, i32 %i.08
%0 = load i16, i16* %arrayidx, align 2		%0 = load i16, i16* %arrayidx, align 2
%conv = sext i16 %0 to i32		%conv = sext i16 %0 to i32
%add = add nsw i32 %r.07, %conv		%add = add nsw i32 %r.07, %conv
%inc = add nuw nsw i32 %i.08, 1		%inc = add nuw nsw i32 %i.08, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
ret i32 %r.0.lcssa		ret i32 %r.0.lcssa
}		}

; FIXME: 16x		; 16x to use VADDV.u16
define i32 @add_i8_i32(i8* nocapture readonly %x, i32 %n) #0 {		define i32 @add_i8_i32(i8* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i8_i32(		; CHECK-LABEL: @add_i8_i32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP6]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP6]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -16
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <4 x i8>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <16 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8> [[TMP1]], i32 1, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i8> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP1]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)
; CHECK-NEXT: [[TMP2:%.*]] = zext <4 x i8> [[WIDE_MASKED_LOAD]] to <4 x i32>		; CHECK-NEXT: [[TMP2:%.*]] = zext <16 x i8> [[WIDE_MASKED_LOAD]] to <16 x i32>
; CHECK-NEXT: [[TMP3:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP2]], <4 x i32> zeroinitializer		; CHECK-NEXT: [[TMP3:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i32> [[TMP2]], <16 x i32> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP3]])		; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP3]])
; CHECK-NEXT: [[TMP5]] = add i32 [[TMP4]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP5]] = add i32 [[TMP4]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16
; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP6]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP3:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP6]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP10:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP5]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP5]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i32 [[R_0_LCSSA]]		; CHECK-NEXT: ret i32 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp6 = icmp sgt i32 %n, 0		%cmp6 = icmp sgt i32 %n, 0
br i1 %cmp6, label %for.body, label %for.cond.cleanup		br i1 %cmp6, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.07 = phi i32 [ %add, %for.body ], [ 0, %entry ]		%r.07 = phi i32 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i8, i8* %x, i32 %i.08		%arrayidx = getelementptr inbounds i8, i8* %x, i32 %i.08
%0 = load i8, i8* %arrayidx, align 1		%0 = load i8, i8* %arrayidx, align 1
%conv = zext i8 %0 to i32		%conv = zext i8 %0 to i32
%add = add nuw nsw i32 %r.07, %conv		%add = add nuw nsw i32 %r.07, %conv
%inc = add nuw nsw i32 %i.08, 1		%inc = add nuw nsw i32 %i.08, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
ret i32 %r.0.lcssa		ret i32 %r.0.lcssa
}		}

		; 8x to use VADDV.u16
define signext i16 @add_i16_i16(i16* nocapture readonly %x, i32 %n) #0 {		define signext i16 @add_i16_i16(i16* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i16_i16(		; CHECK-LABEL: @add_i16_i16(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP8]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP8]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i16 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i16 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <8 x i16>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <8 x i16>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16> [[TMP1]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16> [[TMP1]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> poison)
; CHECK-NEXT: [[TMP2:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> [[WIDE_MASKED_LOAD]], <8 x i16> zeroinitializer		; CHECK-NEXT: [[TMP2:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> [[WIDE_MASKED_LOAD]], <8 x i16> zeroinitializer
; CHECK-NEXT: [[TMP3:%.*]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> [[TMP2]])		; CHECK-NEXT: [[TMP3:%.*]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> [[TMP2]])
; CHECK-NEXT: [[TMP4]] = add i16 [[TMP3]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP4]] = add i16 [[TMP3]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8
; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP5]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP4:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP5]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP11:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TMP4]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TMP4]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i16 [[R_0_LCSSA]]		; CHECK-NEXT: ret i16 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp8 = icmp sgt i32 %n, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %for.body, label %for.cond.cleanup		br i1 %cmp8, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.010 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.010 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.09 = phi i16 [ %add, %for.body ], [ 0, %entry ]		%r.09 = phi i16 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i16, i16* %x, i32 %i.010		%arrayidx = getelementptr inbounds i16, i16* %x, i32 %i.010
%0 = load i16, i16* %arrayidx, align 2		%0 = load i16, i16* %arrayidx, align 2
%add = add i16 %0, %r.09		%add = add i16 %0, %r.09
%inc = add nuw nsw i32 %i.010, 1		%inc = add nuw nsw i32 %i.010, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i16 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i16 [ 0, %entry ], [ %add, %for.body ]
ret i16 %r.0.lcssa		ret i16 %r.0.lcssa
}		}

; FIXME: 16x ?		; 16x to use VADDV.u8
define signext i16 @add_i8_i16(i8* nocapture readonly %x, i32 %n) #0 {		define signext i16 @add_i8_i16(i8* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i8_i16(		; CHECK-LABEL: @add_i8_i16(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP8]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP8]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -16
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i16 [ 0, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i16 [ 0, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <8 x i8>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <16 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8> [[TMP1]], i32 1, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i8> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP1]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)
; CHECK-NEXT: [[TMP2:%.*]] = zext <8 x i8> [[WIDE_MASKED_LOAD]] to <8 x i16>		; CHECK-NEXT: [[TMP2:%.*]] = zext <16 x i8> [[WIDE_MASKED_LOAD]] to <16 x i16>
; CHECK-NEXT: [[TMP3:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> [[TMP2]], <8 x i16> zeroinitializer		; CHECK-NEXT: [[TMP3:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i16> [[TMP2]], <16 x i16> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> [[TMP3]])		; CHECK-NEXT: [[TMP4:%.*]] = call i16 @llvm.vector.reduce.add.v16i16(<16 x i16> [[TMP3]])
; CHECK-NEXT: [[TMP5]] = add i16 [[TMP4]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP5]] = add i16 [[TMP4]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16
; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP6]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP5:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP6]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP12:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TMP5]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TMP5]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i16 [[R_0_LCSSA]]		; CHECK-NEXT: ret i16 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp8 = icmp sgt i32 %n, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %for.body, label %for.cond.cleanup		br i1 %cmp8, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.010 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.010 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.09 = phi i16 [ %add, %for.body ], [ 0, %entry ]		%r.09 = phi i16 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i8, i8* %x, i32 %i.010		%arrayidx = getelementptr inbounds i8, i8* %x, i32 %i.010
%0 = load i8, i8* %arrayidx, align 1		%0 = load i8, i8* %arrayidx, align 1
%conv = zext i8 %0 to i16		%conv = zext i8 %0 to i16
%add = add i16 %r.09, %conv		%add = add i16 %r.09, %conv
%inc = add nuw nsw i32 %i.010, 1		%inc = add nuw nsw i32 %i.010, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i16 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i16 [ 0, %entry ], [ %add, %for.body ]
ret i16 %r.0.lcssa		ret i16 %r.0.lcssa
}		}

		; 16x to use VADDV.u8
define zeroext i8 @add_i8_i8(i8* nocapture readonly %x, i32 %n) #0 {		define zeroext i8 @add_i8_i8(i8* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i8_i8(		; CHECK-LABEL: @add_i8_i8(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP7:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP7:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP7]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP7]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -16		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -16
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i8 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i8 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <16 x i8>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <16 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP1]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP1]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)
; CHECK-NEXT: [[TMP2:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> [[WIDE_MASKED_LOAD]], <16 x i8> zeroinitializer		; CHECK-NEXT: [[TMP2:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> [[WIDE_MASKED_LOAD]], <16 x i8> zeroinitializer
; CHECK-NEXT: [[TMP3:%.*]] = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> [[TMP2]])		; CHECK-NEXT: [[TMP3:%.*]] = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> [[TMP2]])
; CHECK-NEXT: [[TMP4]] = add i8 [[TMP3]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP4]] = add i8 [[TMP3]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16
; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP5]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP6:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP5]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP13:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i8 [ 0, [[ENTRY:%.]] ], [ [[TMP4]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i8 [ 0, [[ENTRY:%.]] ], [ [[TMP4]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i8 [[R_0_LCSSA]]		; CHECK-NEXT: ret i8 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp7 = icmp sgt i32 %n, 0		%cmp7 = icmp sgt i32 %n, 0
br i1 %cmp7, label %for.body, label %for.cond.cleanup		br i1 %cmp7, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.09 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.09 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.08 = phi i8 [ %add, %for.body ], [ 0, %entry ]		%r.08 = phi i8 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i8, i8* %x, i32 %i.09		%arrayidx = getelementptr inbounds i8, i8* %x, i32 %i.09
%0 = load i8, i8* %arrayidx, align 1		%0 = load i8, i8* %arrayidx, align 1
%add = add i8 %0, %r.08		%add = add i8 %0, %r.08
%inc = add nuw nsw i32 %i.09, 1		%inc = add nuw nsw i32 %i.09, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i8 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i8 [ 0, %entry ], [ %add, %for.body ]
ret i8 %r.0.lcssa		ret i8 %r.0.lcssa
}		}

		; Not vectorized
define i64 @mla_i64_i64(i64* nocapture readonly %x, i64* nocapture readonly %y, i32 %n) #0 {		define i64 @mla_i64_i64(i64* nocapture readonly %x, i64* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i64_i64(		; CHECK-LABEL: @mla_i64_i64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP8]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP8]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I_010:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]		; CHECK-NEXT: [[I_010:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
; CHECK-NEXT: [[R_09:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]		; CHECK-NEXT: [[R_09:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]
Show All 27 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
ret i64 %r.0.lcssa		ret i64 %r.0.lcssa
}		}

		; 4x to use VMLAL.u32
		; FIXME: TailPredicate
define i64 @mla_i32_i64(i32* nocapture readonly %x, i32* nocapture readonly %y, i32 %n) #0 {		define i64 @mla_i32_i64(i32* nocapture readonly %x, i32* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i32_i64(		; CHECK-LABEL: @mla_i32_i64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP8]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP8]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
		; CHECK: for.body.preheader:
		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -4
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[TMP7:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[X:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
		; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[Y:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*
		; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
		; CHECK-NEXT: [[TMP4:%.*]] = mul nsw <4 x i32> [[WIDE_LOAD1]], [[WIDE_LOAD]]
		; CHECK-NEXT: [[TMP5:%.*]] = sext <4 x i32> [[TMP4]] to <4 x i64>
		; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> [[TMP5]])
		; CHECK-NEXT: [[TMP7]] = add i64 [[TMP6]], [[VEC_PHI]]
		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
		; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP14:!llvm.loop !.]]
		; CHECK: middle.block:
		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]
		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]
		; CHECK: scalar.ph:
		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP7]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I_010:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]		; CHECK-NEXT: [[I_010:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[R_09:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]		; CHECK-NEXT: [[R_09:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[X:%.*]], i32 [[I_010]]		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[X]], i32 [[I_010]]
; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 [[ARRAYIDX]], align 4		; CHECK-NEXT: [[TMP9:%.]] = load i32, i32 [[ARRAYIDX]], align 4
; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i32, i32 [[Y:%.*]], i32 [[I_010]]		; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i32, i32 [[Y]], i32 [[I_010]]
; CHECK-NEXT: [[TMP1:%.]] = load i32, i32 [[ARRAYIDX1]], align 4		; CHECK-NEXT: [[TMP10:%.]] = load i32, i32 [[ARRAYIDX1]], align 4
; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP1]], [[TMP0]]		; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP10]], [[TMP9]]
; CHECK-NEXT: [[CONV:%.*]] = sext i32 [[MUL]] to i64		; CHECK-NEXT: [[CONV:%.*]] = sext i32 [[MUL]] to i64
; CHECK-NEXT: [[ADD]] = add nsw i64 [[R_09]], [[CONV]]		; CHECK-NEXT: [[ADD]] = add nsw i64 [[R_09]], [[CONV]]
; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_010]], 1		; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_010]], 1
; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]
; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], [[LOOP15:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[ADD]], [[FOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[ADD]], [[FOR_BODY]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i64 [[R_0_LCSSA]]		; CHECK-NEXT: ret i64 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp8 = icmp sgt i32 %n, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %for.body, label %for.cond.cleanup		br i1 %cmp8, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.010 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.010 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
Show All 9 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
ret i64 %r.0.lcssa		ret i64 %r.0.lcssa
}		}

		; 8x to use VMLAL.u16
		; FIXME: TailPredicate
define i64 @mla_i16_i64(i16* nocapture readonly %x, i16* nocapture readonly %y, i32 %n) #0 {		define i64 @mla_i16_i64(i16* nocapture readonly %x, i16* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i16_i64(		; CHECK-LABEL: @mla_i16_i64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP10:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP10:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP10]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP10]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
		; CHECK: for.body.preheader:
		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 8
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -8
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <8 x i16>*
		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x i16>, <8 x i16> [[TMP1]], align 2
		; CHECK-NEXT: [[TMP2:%.*]] = sext <8 x i16> [[WIDE_LOAD]] to <8 x i32>
		; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i16, i16 [[Y:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP4:%.]] = bitcast i16 [[TMP3]] to <8 x i16>*
		; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x i16>, <8 x i16> [[TMP4]], align 2
		; CHECK-NEXT: [[TMP5:%.*]] = sext <8 x i16> [[WIDE_LOAD1]] to <8 x i32>
		; CHECK-NEXT: [[TMP6:%.*]] = mul nsw <8 x i32> [[TMP5]], [[TMP2]]
		; CHECK-NEXT: [[TMP7:%.*]] = sext <8 x i32> [[TMP6]] to <8 x i64>
		; CHECK-NEXT: [[TMP8:%.*]] = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> [[TMP7]])
		; CHECK-NEXT: [[TMP9]] = add i64 [[TMP8]], [[VEC_PHI]]
		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8
		; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP16:!llvm.loop !.]]
		; CHECK: middle.block:
		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]
		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]
		; CHECK: scalar.ph:
		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP9]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I_012:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]		; CHECK-NEXT: [[I_012:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[R_011:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]		; CHECK-NEXT: [[R_011:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[I_012]]		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i16, i16 [[X]], i32 [[I_012]]
; CHECK-NEXT: [[TMP0:%.]] = load i16, i16 [[ARRAYIDX]], align 2		; CHECK-NEXT: [[TMP11:%.]] = load i16, i16 [[ARRAYIDX]], align 2
; CHECK-NEXT: [[CONV:%.*]] = sext i16 [[TMP0]] to i32		; CHECK-NEXT: [[CONV:%.*]] = sext i16 [[TMP11]] to i32
; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i16, i16 [[Y:%.*]], i32 [[I_012]]		; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i16, i16 [[Y]], i32 [[I_012]]
; CHECK-NEXT: [[TMP1:%.]] = load i16, i16 [[ARRAYIDX1]], align 2		; CHECK-NEXT: [[TMP12:%.]] = load i16, i16 [[ARRAYIDX1]], align 2
; CHECK-NEXT: [[CONV2:%.*]] = sext i16 [[TMP1]] to i32		; CHECK-NEXT: [[CONV2:%.*]] = sext i16 [[TMP12]] to i32
; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[CONV2]], [[CONV]]		; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[CONV2]], [[CONV]]
; CHECK-NEXT: [[CONV3:%.*]] = sext i32 [[MUL]] to i64		; CHECK-NEXT: [[CONV3:%.*]] = sext i32 [[MUL]] to i64
; CHECK-NEXT: [[ADD]] = add nsw i64 [[R_011]], [[CONV3]]		; CHECK-NEXT: [[ADD]] = add nsw i64 [[R_011]], [[CONV3]]
; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_012]], 1		; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_012]], 1
; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]
; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], [[LOOP17:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[ADD]], [[FOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[ADD]], [[FOR_BODY]] ], [ [[TMP9]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i64 [[R_0_LCSSA]]		; CHECK-NEXT: ret i64 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp10 = icmp sgt i32 %n, 0		%cmp10 = icmp sgt i32 %n, 0
br i1 %cmp10, label %for.body, label %for.cond.cleanup		br i1 %cmp10, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.012 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.012 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
Show All 11 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
ret i64 %r.0.lcssa		ret i64 %r.0.lcssa
}		}

		; 8x to use VMLAL.u16
		; FIXME: 8x, TailPredicate, double-extended
define i64 @mla_i8_i64(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {		define i64 @mla_i8_i64(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i8_i64(		; CHECK-LABEL: @mla_i8_i64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP10:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP10:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP10]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP10]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
		; CHECK: for.body.preheader:
		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 16
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -16
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <16 x i8>*
		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <16 x i8>, <16 x i8> [[TMP1]], align 1
		; CHECK-NEXT: [[TMP2:%.*]] = zext <16 x i8> [[WIDE_LOAD]] to <16 x i32>
		; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[Y:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <16 x i8>*
		; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <16 x i8>, <16 x i8> [[TMP4]], align 1
		; CHECK-NEXT: [[TMP5:%.*]] = zext <16 x i8> [[WIDE_LOAD1]] to <16 x i32>
		; CHECK-NEXT: [[TMP6:%.*]] = mul nuw nsw <16 x i32> [[TMP5]], [[TMP2]]
		; CHECK-NEXT: [[TMP7:%.*]] = zext <16 x i32> [[TMP6]] to <16 x i64>
		; CHECK-NEXT: [[TMP8:%.*]] = call i64 @llvm.vector.reduce.add.v16i64(<16 x i64> [[TMP7]])
		; CHECK-NEXT: [[TMP9]] = add i64 [[TMP8]], [[VEC_PHI]]
		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16
		; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP18:!llvm.loop !.]]
		; CHECK: middle.block:
		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]
		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]
		; CHECK: scalar.ph:
		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP9]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I_012:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]		; CHECK-NEXT: [[I_012:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[R_011:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]		; CHECK-NEXT: [[R_011:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[I_012]]		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i8, i8 [[X]], i32 [[I_012]]
; CHECK-NEXT: [[TMP0:%.]] = load i8, i8 [[ARRAYIDX]], align 1		; CHECK-NEXT: [[TMP11:%.]] = load i8, i8 [[ARRAYIDX]], align 1
; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP0]] to i32		; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP11]] to i32
; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i8, i8 [[Y:%.*]], i32 [[I_012]]		; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i8, i8 [[Y]], i32 [[I_012]]
; CHECK-NEXT: [[TMP1:%.]] = load i8, i8 [[ARRAYIDX1]], align 1		; CHECK-NEXT: [[TMP12:%.]] = load i8, i8 [[ARRAYIDX1]], align 1
; CHECK-NEXT: [[CONV2:%.*]] = zext i8 [[TMP1]] to i32		; CHECK-NEXT: [[CONV2:%.*]] = zext i8 [[TMP12]] to i32
; CHECK-NEXT: [[MUL:%.*]] = mul nuw nsw i32 [[CONV2]], [[CONV]]		; CHECK-NEXT: [[MUL:%.*]] = mul nuw nsw i32 [[CONV2]], [[CONV]]
; CHECK-NEXT: [[CONV3:%.*]] = zext i32 [[MUL]] to i64		; CHECK-NEXT: [[CONV3:%.*]] = zext i32 [[MUL]] to i64
; CHECK-NEXT: [[ADD]] = add nuw nsw i64 [[R_011]], [[CONV3]]		; CHECK-NEXT: [[ADD]] = add nuw nsw i64 [[R_011]], [[CONV3]]
; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_012]], 1		; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_012]], 1
; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]
; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], [[LOOP19:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[ADD]], [[FOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[ADD]], [[FOR_BODY]] ], [ [[TMP9]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i64 [[R_0_LCSSA]]		; CHECK-NEXT: ret i64 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp10 = icmp sgt i32 %n, 0		%cmp10 = icmp sgt i32 %n, 0
br i1 %cmp10, label %for.body, label %for.cond.cleanup		br i1 %cmp10, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.012 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.012 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
Show All 11 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
ret i64 %r.0.lcssa		ret i64 %r.0.lcssa
}		}

		; 4x to use VMLA.u32
define i32 @mla_i32_i32(i32* nocapture readonly %x, i32* nocapture readonly %y, i32 %n) #0 {		define i32 @mla_i32_i32(i32* nocapture readonly %x, i32* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i32_i32(		; CHECK-LABEL: @mla_i32_i32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP8]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP8]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4
Show All 9 Lines
; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*		; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP3]], i32 4, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP3]], i32 4, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> poison)
; CHECK-NEXT: [[TMP4:%.*]] = mul nsw <4 x i32> [[WIDE_MASKED_LOAD1]], [[WIDE_MASKED_LOAD]]		; CHECK-NEXT: [[TMP4:%.*]] = mul nsw <4 x i32> [[WIDE_MASKED_LOAD1]], [[WIDE_MASKED_LOAD]]
; CHECK-NEXT: [[TMP5:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP4]], <4 x i32> zeroinitializer		; CHECK-NEXT: [[TMP5:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP4]], <4 x i32> zeroinitializer
; CHECK-NEXT: [[TMP6:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP5]])		; CHECK-NEXT: [[TMP6:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP5]])
; CHECK-NEXT: [[TMP7]] = add i32 [[TMP6]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP7]] = add i32 [[TMP6]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP8]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP7:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP8]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP20:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP7]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP7]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i32 [[R_0_LCSSA]]		; CHECK-NEXT: ret i32 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp8 = icmp sgt i32 %n, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %for.body, label %for.cond.cleanup		br i1 %cmp8, label %for.body, label %for.cond.cleanup

Show All 10 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
ret i32 %r.0.lcssa		ret i32 %r.0.lcssa
}		}

		; 8x to use VMLA.u16
define i32 @mla_i16_i32(i16* nocapture readonly %x, i16* nocapture readonly %y, i32 %n) #0 {		define i32 @mla_i16_i32(i16* nocapture readonly %x, i16* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i16_i32(		; CHECK-LABEL: @mla_i16_i32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP9:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP9:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP9]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP9]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <4 x i16>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <8 x i16>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16> [[TMP1]], i32 2, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i16> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16> [[TMP1]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> poison)
; CHECK-NEXT: [[TMP2:%.*]] = sext <4 x i16> [[WIDE_MASKED_LOAD]] to <4 x i32>		; CHECK-NEXT: [[TMP2:%.*]] = sext <8 x i16> [[WIDE_MASKED_LOAD]] to <8 x i32>
; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i16, i16 [[Y:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i16, i16 [[Y:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP4:%.]] = bitcast i16 [[TMP3]] to <4 x i16>*		; CHECK-NEXT: [[TMP4:%.]] = bitcast i16 [[TMP3]] to <8 x i16>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16> [[TMP4]], i32 2, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i16> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16> [[TMP4]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> poison)
; CHECK-NEXT: [[TMP5:%.*]] = sext <4 x i16> [[WIDE_MASKED_LOAD1]] to <4 x i32>		; CHECK-NEXT: [[TMP5:%.*]] = sext <8 x i16> [[WIDE_MASKED_LOAD1]] to <8 x i32>
; CHECK-NEXT: [[TMP6:%.*]] = mul nsw <4 x i32> [[TMP5]], [[TMP2]]		; CHECK-NEXT: [[TMP6:%.*]] = mul nsw <8 x i32> [[TMP5]], [[TMP2]]
; CHECK-NEXT: [[TMP7:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP6]], <4 x i32> zeroinitializer		; CHECK-NEXT: [[TMP7:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i32> [[TMP6]], <8 x i32> zeroinitializer
; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP7]])		; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP7]])
; CHECK-NEXT: [[TMP9]] = add i32 [[TMP8]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP9]] = add i32 [[TMP8]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8
; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP10]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP8:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP10]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP21:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP9]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP9]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i32 [[R_0_LCSSA]]		; CHECK-NEXT: ret i32 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp9 = icmp sgt i32 %n, 0		%cmp9 = icmp sgt i32 %n, 0
br i1 %cmp9, label %for.body, label %for.cond.cleanup		br i1 %cmp9, label %for.body, label %for.cond.cleanup

Show All 12 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
ret i32 %r.0.lcssa		ret i32 %r.0.lcssa
}		}

		; 16x to use VMLA.u8
define i32 @mla_i8_i32(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {		define i32 @mla_i8_i32(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i8_i32(		; CHECK-LABEL: @mla_i8_i32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP9:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP9:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP9]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP9]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -16
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <4 x i8>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <16 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8> [[TMP1]], i32 1, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i8> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP1]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)
; CHECK-NEXT: [[TMP2:%.*]] = zext <4 x i8> [[WIDE_MASKED_LOAD]] to <4 x i32>		; CHECK-NEXT: [[TMP2:%.*]] = zext <16 x i8> [[WIDE_MASKED_LOAD]] to <16 x i32>
; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[Y:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[Y:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <4 x i8>*		; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <16 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8> [[TMP4]], i32 1, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i8> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP4]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)
; CHECK-NEXT: [[TMP5:%.*]] = zext <4 x i8> [[WIDE_MASKED_LOAD1]] to <4 x i32>		; CHECK-NEXT: [[TMP5:%.*]] = zext <16 x i8> [[WIDE_MASKED_LOAD1]] to <16 x i32>
; CHECK-NEXT: [[TMP6:%.*]] = mul nuw nsw <4 x i32> [[TMP5]], [[TMP2]]		; CHECK-NEXT: [[TMP6:%.*]] = mul nuw nsw <16 x i32> [[TMP5]], [[TMP2]]
; CHECK-NEXT: [[TMP7:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP6]], <4 x i32> zeroinitializer		; CHECK-NEXT: [[TMP7:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i32> [[TMP6]], <16 x i32> zeroinitializer
; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP7]])		; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP7]])
; CHECK-NEXT: [[TMP9]] = add i32 [[TMP8]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP9]] = add i32 [[TMP8]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16
; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP10]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP9:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP10]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP22:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP9]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP9]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i32 [[R_0_LCSSA]]		; CHECK-NEXT: ret i32 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp9 = icmp sgt i32 %n, 0		%cmp9 = icmp sgt i32 %n, 0
br i1 %cmp9, label %for.body, label %for.cond.cleanup		br i1 %cmp9, label %for.body, label %for.cond.cleanup

Show All 12 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
ret i32 %r.0.lcssa		ret i32 %r.0.lcssa
}		}

		; 8x to use VMLA.u16
define signext i16 @mla_i16_i16(i16* nocapture readonly %x, i16* nocapture readonly %y, i32 %n) #0 {		define signext i16 @mla_i16_i16(i16* nocapture readonly %x, i16* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i16_i16(		; CHECK-LABEL: @mla_i16_i16(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP11:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP11:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP11]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP11]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8
Show All 9 Lines
; CHECK-NEXT: [[TMP3:%.]] = bitcast i16 [[TMP2]] to <8 x i16>*		; CHECK-NEXT: [[TMP3:%.]] = bitcast i16 [[TMP2]] to <8 x i16>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16> [[TMP3]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16> [[TMP3]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> poison)
; CHECK-NEXT: [[TMP4:%.*]] = mul <8 x i16> [[WIDE_MASKED_LOAD1]], [[WIDE_MASKED_LOAD]]		; CHECK-NEXT: [[TMP4:%.*]] = mul <8 x i16> [[WIDE_MASKED_LOAD1]], [[WIDE_MASKED_LOAD]]
; CHECK-NEXT: [[TMP5:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> [[TMP4]], <8 x i16> zeroinitializer		; CHECK-NEXT: [[TMP5:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> [[TMP4]], <8 x i16> zeroinitializer
; CHECK-NEXT: [[TMP6:%.*]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> [[TMP5]])		; CHECK-NEXT: [[TMP6:%.*]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> [[TMP5]])
; CHECK-NEXT: [[TMP7]] = add i16 [[TMP6]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP7]] = add i16 [[TMP6]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8
; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP8]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP10:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP8]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP23:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TMP7]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TMP7]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i16 [[R_0_LCSSA]]		; CHECK-NEXT: ret i16 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp11 = icmp sgt i32 %n, 0		%cmp11 = icmp sgt i32 %n, 0
br i1 %cmp11, label %for.body, label %for.cond.cleanup		br i1 %cmp11, label %for.body, label %for.cond.cleanup

Show All 10 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i16 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i16 [ 0, %entry ], [ %add, %for.body ]
ret i16 %r.0.lcssa		ret i16 %r.0.lcssa
}		}

		; 16x to use VMLA.u8
define signext i16 @mla_i8_i16(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {		define signext i16 @mla_i8_i16(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i8_i16(		; CHECK-LABEL: @mla_i8_i16(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP11:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP11:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP11]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP11]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -16
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i16 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i16 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <8 x i8>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <16 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8> [[TMP1]], i32 1, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i8> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP1]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)
; CHECK-NEXT: [[TMP2:%.*]] = zext <8 x i8> [[WIDE_MASKED_LOAD]] to <8 x i16>		; CHECK-NEXT: [[TMP2:%.*]] = zext <16 x i8> [[WIDE_MASKED_LOAD]] to <16 x i16>
; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[Y:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[Y:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <8 x i8>*		; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <16 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8> [[TMP4]], i32 1, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i8> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP4]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)
; CHECK-NEXT: [[TMP5:%.*]] = zext <8 x i8> [[WIDE_MASKED_LOAD1]] to <8 x i16>		; CHECK-NEXT: [[TMP5:%.*]] = zext <16 x i8> [[WIDE_MASKED_LOAD1]] to <16 x i16>
; CHECK-NEXT: [[TMP6:%.*]] = mul nuw <8 x i16> [[TMP5]], [[TMP2]]		; CHECK-NEXT: [[TMP6:%.*]] = mul nuw <16 x i16> [[TMP5]], [[TMP2]]
; CHECK-NEXT: [[TMP7:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> [[TMP6]], <8 x i16> zeroinitializer		; CHECK-NEXT: [[TMP7:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i16> [[TMP6]], <16 x i16> zeroinitializer
; CHECK-NEXT: [[TMP8:%.*]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> [[TMP7]])		; CHECK-NEXT: [[TMP8:%.*]] = call i16 @llvm.vector.reduce.add.v16i16(<16 x i16> [[TMP7]])
; CHECK-NEXT: [[TMP9]] = add i16 [[TMP8]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP9]] = add i16 [[TMP8]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16
; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP10]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP11:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP10]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP24:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TMP9]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TMP9]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i16 [[R_0_LCSSA]]		; CHECK-NEXT: ret i16 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp11 = icmp sgt i32 %n, 0		%cmp11 = icmp sgt i32 %n, 0
br i1 %cmp11, label %for.body, label %for.cond.cleanup		br i1 %cmp11, label %for.body, label %for.cond.cleanup

Show All 12 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i16 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i16 [ 0, %entry ], [ %add, %for.body ]
ret i16 %r.0.lcssa		ret i16 %r.0.lcssa
}		}

		; 16x to use VMLA.u8
define zeroext i8 @mla_i8_i8(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {		define zeroext i8 @mla_i8_i8(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i8_i8(		; CHECK-LABEL: @mla_i8_i8(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP10:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP10:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP10]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP10]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -16		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -16
Show All 9 Lines
; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <16 x i8>*		; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <16 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP3]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)		; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP3]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> poison)
; CHECK-NEXT: [[TMP4:%.*]] = mul <16 x i8> [[WIDE_MASKED_LOAD1]], [[WIDE_MASKED_LOAD]]		; CHECK-NEXT: [[TMP4:%.*]] = mul <16 x i8> [[WIDE_MASKED_LOAD1]], [[WIDE_MASKED_LOAD]]
; CHECK-NEXT: [[TMP5:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> [[TMP4]], <16 x i8> zeroinitializer		; CHECK-NEXT: [[TMP5:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> [[TMP4]], <16 x i8> zeroinitializer
; CHECK-NEXT: [[TMP6:%.*]] = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> [[TMP5]])		; CHECK-NEXT: [[TMP6:%.*]] = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> [[TMP5]])
; CHECK-NEXT: [[TMP7]] = add i8 [[TMP6]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP7]] = add i8 [[TMP6]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16
; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP8]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP12:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP8]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP25:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i8 [ 0, [[ENTRY:%.]] ], [ [[TMP7]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i8 [ 0, [[ENTRY:%.]] ], [ [[TMP7]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i8 [[R_0_LCSSA]]		; CHECK-NEXT: ret i8 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp10 = icmp sgt i32 %n, 0		%cmp10 = icmp sgt i32 %n, 0
br i1 %cmp10, label %for.body, label %for.cond.cleanup		br i1 %cmp10, label %for.body, label %for.cond.cleanup

▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <8 x i32> [[WIDE_VEC]], <8 x i32> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>		; CHECK-NEXT: [[STRIDED_VEC:%.*]] = shufflevector <8 x i32> [[WIDE_VEC]], <8 x i32> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
; CHECK-NEXT: [[STRIDED_VEC1:%.*]] = shufflevector <8 x i32> [[WIDE_VEC]], <8 x i32> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>		; CHECK-NEXT: [[STRIDED_VEC1:%.*]] = shufflevector <8 x i32> [[WIDE_VEC]], <8 x i32> poison, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
; CHECK-NEXT: [[TMP7:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[STRIDED_VEC1]])		; CHECK-NEXT: [[TMP7:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[STRIDED_VEC1]])
; CHECK-NEXT: [[TMP8:%.*]] = add i32 [[TMP7]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP8:%.*]] = add i32 [[TMP7]], [[VEC_PHI]]
; CHECK-NEXT: [[TMP9:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[STRIDED_VEC]])		; CHECK-NEXT: [[TMP9:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[STRIDED_VEC]])
; CHECK-NEXT: [[TMP10]] = add i32 [[TMP9]], [[TMP8]]		; CHECK-NEXT: [[TMP10]] = add i32 [[TMP9]], [[TMP8]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
; CHECK-NEXT: [[TMP11:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP11:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP11]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP13:!llvm.loop !.]]		; CHECK-NEXT: br i1 [[TMP11]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP26:!llvm.loop !.]]
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[TMP2]], [[N_VEC]]		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[TMP2]], [[N_VEC]]
; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT]], label [[SCALAR_PH]]		; CHECK-NEXT: br i1 [[CMP_N]], label [[EXIT]], label [[SCALAR_PH]]
; CHECK: scalar.ph:		; CHECK: scalar.ph:
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ [[TMP10]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ [[TMP10]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
; CHECK-NEXT: br label [[FOR_BODY:%.*]]		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[IV:%.]] = phi i32 [ [[IV_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]		; CHECK-NEXT: [[IV:%.]] = phi i32 [ [[IV_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[RED_PHI:%.]] = phi i32 [ [[RED_2:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]		; CHECK-NEXT: [[RED_PHI:%.]] = phi i32 [ [[RED_2:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[ADD:%.*]] = or i32 [[IV]], 1		; CHECK-NEXT: [[ADD:%.*]] = or i32 [[IV]], 1
; CHECK-NEXT: [[GEP_0:%.]] = getelementptr inbounds i32, i32 [[ARR]], i32 [[ADD]]		; CHECK-NEXT: [[GEP_0:%.]] = getelementptr inbounds i32, i32 [[ARR]], i32 [[ADD]]
; CHECK-NEXT: [[L_0:%.]] = load i32, i32 [[GEP_0]], align 4		; CHECK-NEXT: [[L_0:%.]] = load i32, i32 [[GEP_0]], align 4
; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[ARR]], i32 [[IV]]		; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[ARR]], i32 [[IV]]
; CHECK-NEXT: [[L_1:%.]] = load i32, i32 [[GEP_1]], align 4		; CHECK-NEXT: [[L_1:%.]] = load i32, i32 [[GEP_1]], align 4
; CHECK-NEXT: [[RED_1:%.*]] = add i32 [[L_0]], [[RED_PHI]]		; CHECK-NEXT: [[RED_1:%.*]] = add i32 [[L_0]], [[RED_PHI]]
; CHECK-NEXT: [[RED_2]] = add i32 [[RED_1]], [[L_1]]		; CHECK-NEXT: [[RED_2]] = add i32 [[RED_1]], [[L_1]]
; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i32 [[IV]], 2		; CHECK-NEXT: [[IV_NEXT]] = add nuw nsw i32 [[IV]], 2
; CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 [[IV_NEXT]], [[N]]		; CHECK-NEXT: [[CMP:%.*]] = icmp slt i32 [[IV_NEXT]], [[N]]
; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[EXIT]], [[LOOP14:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[EXIT]], [[LOOP27:!llvm.loop !.*]]
; CHECK: exit:		; CHECK: exit:
; CHECK-NEXT: [[RET_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[RED_2]], [[FOR_BODY]] ], [ [[TMP10]], [[MIDDLE_BLOCK]] ]		; CHECK-NEXT: [[RET_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[RED_2]], [[FOR_BODY]] ], [ [[TMP10]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i32 [[RET_LCSSA]]		; CHECK-NEXT: ret i32 [[RET_LCSSA]]
;		;
entry:		entry:
%guard = icmp sgt i32 %n, 0		%guard = icmp sgt i32 %n, 0
br i1 %guard , label %for.body, label %exit		br i1 %guard , label %for.body, label %exit

Show All 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LV][ARM] Inloop reduction cost modelling
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 318293

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/ARM/mve-reduction-types.ll

llvm/test/Transforms/LoopVectorize/ARM/mve-reductions.ll

This is an archive of the discontinued LLVM Phabricator instance.

[LV][ARM] Inloop reduction cost modellingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 318293

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/test/Transforms/LoopVectorize/ARM/mve-reduction-types.ll

llvm/test/Transforms/LoopVectorize/ARM/mve-reductions.ll

[LV][ARM] Inloop reduction cost modelling
ClosedPublic