Diff 312542

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 1,154 Lines • ▼ Show 20 Lines	public:
int getInterleavedMemoryOpCost(		int getInterleavedMemoryOpCost(
unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,		unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,
Align Alignment, unsigned AddressSpace,		Align Alignment, unsigned AddressSpace,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput,		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput,
bool UseMaskForCond = false, bool UseMaskForGaps = false) const;		bool UseMaskForCond = false, bool UseMaskForGaps = false) const;

/// Calculate the cost of performing a vector reduction.		/// Calculate the cost of performing a vector reduction.
///		///
/// This is the cost of reducing the vector value of type \p Ty to a scalar		/// This is the cost of reducing the vector value of type \p Ty to a type
/// value using the operation denoted by \p Opcode. The form of the reduction		/// \p ResTy using the operation denoted by \p Opcode. The form of the
/// can either be a pairwise reduction or a reduction that splits the vector		/// reduction can either be a pairwise reduction or a reduction that splits
/// at every reduction level.		/// the vector at every reduction level.
///		///
/// Pairwise:		/// Pairwise:
/// (v0, v1, v2, v3)		/// (v0, v1, v2, v3)
/// ((v0+v1), (v2+v3), undef, undef)		/// ((v0+v1), (v2+v3), undef, undef)
/// Split:		/// Split:
/// (v0, v1, v2, v3)		/// (v0, v1, v2, v3)
/// ((v0+v2), (v1+v3), undef, undef)		/// ((v0+v2), (v1+v3), undef, undef)
		///
		/// If the IsMLA flags is set then this includes the cost of a multiply step.
int getArithmeticReductionCost(		int getArithmeticReductionCost(
unsigned Opcode, VectorType *Ty, bool IsPairwiseForm,		unsigned Opcode, Type ResTy, VectorType Ty, bool IsPairwiseForm,
		bool IsMLA = false,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;

int getMinMaxReductionCost(		int getMinMaxReductionCost(
VectorType Ty, VectorType CondTy, bool IsPairwiseForm, bool IsUnsigned,		VectorType Ty, VectorType CondTy, bool IsPairwiseForm, bool IsUnsigned,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput) const;

/// \returns The cost of Intrinsic instructions. Analyses the real arguments.		/// \returns The cost of Intrinsic instructions. Analyses the real arguments.
/// Three cases are handled: 1. scalar instruction 2. vector instruction		/// Three cases are handled: 1. scalar instruction 2. vector instruction
/// 3. scalar instruction which is to be vectorized.		/// 3. scalar instruction which is to be vectorized.
int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind) const;		TTI::TargetCostKind CostKind) const;

/// \returns The cost of Call instructions.		/// \returns The cost of Call instructions.
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions I am wondering if `IsMLA` is a bit too narrow as an interface, perhaps even unclear. If this is similar to `getArithmeticReductionCost` as mentioned in the comment, which takes an `opcode`, should this also take an `opcode` instead of `IsMLA`? The advantage we could describe costs for different type of reductions, or is this not useful/necessary? SjoerdMeijer: I am wondering if `IsMLA` is a bit too narrow as an interface, perhaps even unclear. If this is…
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Or now I am thinking if we actually need a new interface at all. Could this just be part of `getArithmeticReductionCost`, maybe with a flag if operands are extended? SjoerdMeijer: Or now I am thinking if we actually need a new interface at all. Could this just be part of…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions I think the only extended patterns that are interesting are add's. Mul's maybe could come up but they are very rare in comparison. As you might imagine adding up number is pretty common in comparison! And/Or/Xor don't really make sense to be extended, as well as min/max I think. Same for floats where I don't know of any systems that convert the float type and reduce at the same time. In MVE we only have VADDV/VMLAV instructions that need the extension, not any others. I've tried to update the comment to explain better that this is an extended add pattern. (And we can always extend it in the future if needed). The separate interface was suggested in https://reviews.llvm.org/D93476#2506583, which I like as it keeps the normal reductions in one place, and the extended pattern is matched here. dmgreen: I think the only extended patterns that are interesting are add's. Mul's maybe could come up…
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Okay, thanks, that makes sense. One nit/question/suggestion then: we are talking very specifically about estimating costs for `vecreduce.add`, possibly with a mul, so I think `getExtendedAddReductionCost` would describe this better. (am now looking at the rest of this change) SjoerdMeijer: Okay, thanks, that makes sense. One nit/question/suggestion then: we are talking very…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Sounds good to me. I'll make that change. Thanks for the suggestion. dmgreen: Sounds good to me. I'll make that change. Thanks for the suggestion.
		fhahnUnsubmitted Not Done Reply Inline Actions Can this use `InstructionCost`? fhahn: Can this use `InstructionCost`?
int getCallInstrCost(Function F, Type RetTy, ArrayRef<Type *> Tys,		int getCallInstrCost(Function F, Type RetTy, ArrayRef<Type *> Tys,
TTI::TargetCostKind CostKind = TTI::TCK_SizeAndLatency) const;		TTI::TargetCostKind CostKind = TTI::TCK_SizeAndLatency) const;

/// \returns The number of pieces into which the provided type must be		/// \returns The number of pieces into which the provided type must be
/// split during legalization. Zero is returned when the answer is unknown.		/// split during legalization. Zero is returned when the answer is unknown.
unsigned getNumberOfParts(Type *Tp) const;		unsigned getNumberOfParts(Type *Tp) const;

/// \returns The cost of the address computation. For most targets this can be		/// \returns The cost of the address computation. For most targets this can be
▲ Show 20 Lines • Show All 373 Lines • ▼ Show 20 Lines	virtual int getGatherScatterOpCost(unsigned Opcode, Type *DataTy,
Align Alignment,		Align Alignment,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
const Instruction *I = nullptr) = 0;		const Instruction *I = nullptr) = 0;

virtual int getInterleavedMemoryOpCost(		virtual int getInterleavedMemoryOpCost(
unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,		unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,
Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,		Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
bool UseMaskForCond = false, bool UseMaskForGaps = false) = 0;		bool UseMaskForCond = false, bool UseMaskForGaps = false) = 0;
virtual int getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,		virtual int getArithmeticReductionCost(unsigned Opcode, Type *ResTy,
bool IsPairwiseForm,		VectorType *Ty, bool IsPairwiseForm,
		bool IsMLA,
TTI::TargetCostKind CostKind) = 0;		TTI::TargetCostKind CostKind) = 0;
virtual int getMinMaxReductionCost(VectorType Ty, VectorType CondTy,		virtual int getMinMaxReductionCost(VectorType Ty, VectorType CondTy,
bool IsPairwiseForm, bool IsUnsigned,		bool IsPairwiseForm, bool IsUnsigned,
TTI::TargetCostKind CostKind) = 0;		TTI::TargetCostKind CostKind) = 0;
virtual int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		virtual int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind) = 0;		TTI::TargetCostKind CostKind) = 0;
virtual int getCallInstrCost(Function F, Type RetTy,		virtual int getCallInstrCost(Function F, Type RetTy,
ArrayRef<Type *> Tys,		ArrayRef<Type *> Tys,
▲ Show 20 Lines • Show All 449 Lines • ▼ Show 20 Lines	int getInterleavedMemoryOpCost(unsigned Opcode, Type *VecTy, unsigned Factor,
unsigned AddressSpace,		unsigned AddressSpace,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
bool UseMaskForCond,		bool UseMaskForCond,
bool UseMaskForGaps) override {		bool UseMaskForGaps) override {
return Impl.getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,		return Impl.getInterleavedMemoryOpCost(Opcode, VecTy, Factor, Indices,
Alignment, AddressSpace, CostKind,		Alignment, AddressSpace, CostKind,
UseMaskForCond, UseMaskForGaps);		UseMaskForCond, UseMaskForGaps);
}		}
int getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,		int getArithmeticReductionCost(unsigned Opcode, Type ResTy, VectorType Ty,
bool IsPairwiseForm,		bool IsPairwiseForm, bool IsMLA,
TTI::TargetCostKind CostKind) override {		TTI::TargetCostKind CostKind) override {
return Impl.getArithmeticReductionCost(Opcode, Ty, IsPairwiseForm,		return Impl.getArithmeticReductionCost(Opcode, ResTy, Ty, IsPairwiseForm,
CostKind);		IsMLA, CostKind);
}		}
int getMinMaxReductionCost(VectorType Ty, VectorType CondTy,		int getMinMaxReductionCost(VectorType Ty, VectorType CondTy,
bool IsPairwiseForm, bool IsUnsigned,		bool IsPairwiseForm, bool IsUnsigned,
TTI::TargetCostKind CostKind) override {		TTI::TargetCostKind CostKind) override {
return Impl.getMinMaxReductionCost(Ty, CondTy, IsPairwiseForm, IsUnsigned,		return Impl.getMinMaxReductionCost(Ty, CondTy, IsPairwiseForm, IsUnsigned,
CostKind);		CostKind);
}		}
int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
▲ Show 20 Lines • Show All 217 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 565 Lines • ▼ Show 20 Lines	public:

unsigned getNumberOfParts(Type *Tp) { return 0; }		unsigned getNumberOfParts(Type *Tp) { return 0; }

unsigned getAddressComputationCost(Type Tp, ScalarEvolution ,		unsigned getAddressComputationCost(Type Tp, ScalarEvolution ,
const SCEV *) {		const SCEV *) {
return 0;		return 0;
}		}

unsigned getArithmeticReductionCost(unsigned, VectorType *, bool,		unsigned getArithmeticReductionCost(unsigned, Type , VectorType , bool,
TTI::TargetCostKind) { return 1; }		bool, TTI::TargetCostKind) {
		return 1;
		}

unsigned getMinMaxReductionCost(VectorType , VectorType , bool, bool,		unsigned getMinMaxReductionCost(VectorType , VectorType , bool, bool,
TTI::TargetCostKind) { return 1; }		TTI::TargetCostKind) { return 1; }

unsigned getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) { return 0; }		unsigned getCostOfKeepingLiveOverCall(ArrayRef<Type *> Tys) { return 0; }

bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo &Info) {		bool getTgtMemIntrinsic(IntrinsicInst *Inst, MemIntrinsicInfo &Info) {
return false;		return false;
▲ Show 20 Lines • Show All 434 Lines • ▼ Show 20 Lines	case Instruction::ExtractElement: {

// Try to match a reduction (a series of shufflevector and vector ops		// Try to match a reduction (a series of shufflevector and vector ops
// followed by an extractelement).		// followed by an extractelement).
unsigned RdxOpcode;		unsigned RdxOpcode;
VectorType *RdxType;		VectorType *RdxType;
bool IsPairwise;		bool IsPairwise;
switch (TTI::matchVectorReduction(EEI, RdxOpcode, RdxType, IsPairwise)) {		switch (TTI::matchVectorReduction(EEI, RdxOpcode, RdxType, IsPairwise)) {
case TTI::RK_Arithmetic:		case TTI::RK_Arithmetic:
return TargetTTI->getArithmeticReductionCost(RdxOpcode, RdxType,		return TargetTTI->getArithmeticReductionCost(
IsPairwise, CostKind);		RdxOpcode, RdxType->getScalarType(), RdxType, IsPairwise, false,
		CostKind);
case TTI::RK_MinMax:		case TTI::RK_MinMax:
return TargetTTI->getMinMaxReductionCost(		return TargetTTI->getMinMaxReductionCost(
RdxType, cast<VectorType>(CmpInst::makeCmpResultType(RdxType)),		RdxType, cast<VectorType>(CmpInst::makeCmpResultType(RdxType)),
IsPairwise, /IsUnsigned=/false, CostKind);		IsPairwise, /IsUnsigned=/false, CostKind);
case TTI::RK_UnsignedMinMax:		case TTI::RK_UnsignedMinMax:
return TargetTTI->getMinMaxReductionCost(		return TargetTTI->getMinMaxReductionCost(
RdxType, cast<VectorType>(CmpInst::makeCmpResultType(RdxType)),		RdxType, cast<VectorType>(CmpInst::makeCmpResultType(RdxType)),
IsPairwise, /IsUnsigned=/true, CostKind);		IsPairwise, /IsUnsigned=/true, CostKind);
▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 1,536 Lines • ▼ Show 20 Lines	unsigned getTypeBasedIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
}		}
case Intrinsic::masked_load: {		case Intrinsic::masked_load: {
Type *Ty = RetTy;		Type *Ty = RetTy;
Align TyAlign = thisT()->DL.getABITypeAlign(Ty);		Align TyAlign = thisT()->DL.getABITypeAlign(Ty);
return thisT()->getMaskedMemoryOpCost(Instruction::Load, Ty, TyAlign, 0,		return thisT()->getMaskedMemoryOpCost(Instruction::Load, Ty, TyAlign, 0,
CostKind);		CostKind);
}		}
case Intrinsic::vector_reduce_add:		case Intrinsic::vector_reduce_add:
return thisT()->getArithmeticReductionCost(Instruction::Add, VecOpTy,		return thisT()->getArithmeticReductionCost(
/IsPairwiseForm=/false,		Instruction::Add, RetTy, VecOpTy,
CostKind);		/IsPairwiseForm=/false, /IsMLA=/false, CostKind);
case Intrinsic::vector_reduce_mul:		case Intrinsic::vector_reduce_mul:
return thisT()->getArithmeticReductionCost(Instruction::Mul, VecOpTy,		return thisT()->getArithmeticReductionCost(
/IsPairwiseForm=/false,		Instruction::Mul, RetTy, VecOpTy,
CostKind);		/IsPairwiseForm=/false, /IsMLA=/false, CostKind);
case Intrinsic::vector_reduce_and:		case Intrinsic::vector_reduce_and:
return thisT()->getArithmeticReductionCost(Instruction::And, VecOpTy,		return thisT()->getArithmeticReductionCost(
/IsPairwiseForm=/false,		Instruction::And, RetTy, VecOpTy,
CostKind);		/IsPairwiseForm=/false, /IsMLA=/false, CostKind);
case Intrinsic::vector_reduce_or:		case Intrinsic::vector_reduce_or:
return thisT()->getArithmeticReductionCost(Instruction::Or, VecOpTy,		return thisT()->getArithmeticReductionCost(
/IsPairwiseForm=/false,		Instruction::Or, RetTy, VecOpTy,
CostKind);		/IsPairwiseForm=/false, /IsMLA=/false, CostKind);
case Intrinsic::vector_reduce_xor:		case Intrinsic::vector_reduce_xor:
return thisT()->getArithmeticReductionCost(Instruction::Xor, VecOpTy,		return thisT()->getArithmeticReductionCost(
/IsPairwiseForm=/false,		Instruction::Xor, RetTy, VecOpTy,
CostKind);		/IsPairwiseForm=/false, /IsMLA=/false, CostKind);
case Intrinsic::vector_reduce_fadd:		case Intrinsic::vector_reduce_fadd:
// FIXME: Add new flag for cost of strict reductions.		// FIXME: Add new flag for cost of strict reductions.
return thisT()->getArithmeticReductionCost(Instruction::FAdd, VecOpTy,		return thisT()->getArithmeticReductionCost(
/IsPairwiseForm=/false,		Instruction::FAdd, RetTy, VecOpTy,
CostKind);		/IsPairwiseForm=/false, /IsMLA=/false, CostKind);
case Intrinsic::vector_reduce_fmul:		case Intrinsic::vector_reduce_fmul:
// FIXME: Add new flag for cost of strict reductions.		// FIXME: Add new flag for cost of strict reductions.
return thisT()->getArithmeticReductionCost(Instruction::FMul, VecOpTy,		return thisT()->getArithmeticReductionCost(
/IsPairwiseForm=/false,		Instruction::FMul, RetTy, VecOpTy,
CostKind);		/IsPairwiseForm=/false, /IsMLA=/false, CostKind);
case Intrinsic::vector_reduce_smax:		case Intrinsic::vector_reduce_smax:
case Intrinsic::vector_reduce_smin:		case Intrinsic::vector_reduce_smin:
case Intrinsic::vector_reduce_fmax:		case Intrinsic::vector_reduce_fmax:
case Intrinsic::vector_reduce_fmin:		case Intrinsic::vector_reduce_fmin:
return thisT()->getMinMaxReductionCost(		return thisT()->getMinMaxReductionCost(
VecOpTy, cast<VectorType>(CmpInst::makeCmpResultType(VecOpTy)),		VecOpTy, cast<VectorType>(CmpInst::makeCmpResultType(VecOpTy)),
/IsPairwiseForm=/false,		/IsPairwiseForm=/false,
/IsUnsigned=/false, CostKind);		/IsUnsigned=/false, CostKind);
▲ Show 20 Lines • Show All 319 Lines • ▼ Show 20 Lines
/// \-------------v----------/ \----------v------------/		/// \-------------v----------/ \----------v------------/
/// n/2 elements n/2 elements		/// n/2 elements n/2 elements
/// %red1 = op <n x t> %val1, <n x t> val2		/// %red1 = op <n x t> %val1, <n x t> val2
/// Again, the operation is performed on <n x t> vector, but the resulting		/// Again, the operation is performed on <n x t> vector, but the resulting
/// vector %red1 is <n/2 x t> vector.		/// vector %red1 is <n/2 x t> vector.
///		///
/// The cost model should take into account that the actual length of the		/// The cost model should take into account that the actual length of the
/// vector is reduced on each iteration.		/// vector is reduced on each iteration.
unsigned getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,		unsigned getArithmeticReductionCost(unsigned Opcode, Type *ResTy,
bool IsPairwise,		VectorType *Ty, bool IsPairwise,
		bool IsMLA,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
Type *ScalarTy = Ty->getElementType();		Type *ScalarTy = Ty->getElementType();
unsigned NumVecElts = cast<FixedVectorType>(Ty)->getNumElements();		unsigned NumVecElts = cast<FixedVectorType>(Ty)->getNumElements();
unsigned NumReduxLevels = Log2_32(NumVecElts);		unsigned NumReduxLevels = Log2_32(NumVecElts);
unsigned ArithCost = 0;		unsigned ArithCost = 0;
unsigned ShuffleCost = 0;		unsigned ShuffleCost = 0;
std::pair<unsigned, MVT> LT =		std::pair<unsigned, MVT> LT =
thisT()->getTLI()->getTypeLegalizationCost(DL, Ty);		thisT()->getTLI()->getTypeLegalizationCost(DL, Ty);
unsigned LongVectorCount = 0;		unsigned LongVectorCount = 0;
unsigned MVTLen =		unsigned MVTLen =
LT.second.isVector() ? LT.second.getVectorNumElements() : 1;		LT.second.isVector() ? LT.second.getVectorNumElements() : 1;
		if (IsMLA)
		ArithCost +=
		thisT()->getArithmeticInstrCost(Instruction::Mul, Ty, CostKind);
while (NumVecElts > MVTLen) {		while (NumVecElts > MVTLen) {
NumVecElts /= 2;		NumVecElts /= 2;
VectorType *SubTy = FixedVectorType::get(ScalarTy, NumVecElts);		VectorType *SubTy = FixedVectorType::get(ScalarTy, NumVecElts);
// Assume the pairwise shuffles add a cost.		// Assume the pairwise shuffles add a cost.
ShuffleCost +=		ShuffleCost +=
(IsPairwise + 1) * thisT()->getShuffleCost(TTI::SK_ExtractSubvector,		(IsPairwise + 1) * thisT()->getShuffleCost(TTI::SK_ExtractSubvector,
Ty, NumVecElts, SubTy);		Ty, NumVecElts, SubTy);
ArithCost += thisT()->getArithmeticInstrCost(Opcode, SubTy, CostKind);		ArithCost += thisT()->getArithmeticInstrCost(Opcode, SubTy, CostKind);
▲ Show 20 Lines • Show All 118 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 913 Lines • ▼ Show 20 Lines
	}			}

	int TargetTransformInfo::getMemcpyCost(const Instruction *I) const {			int TargetTransformInfo::getMemcpyCost(const Instruction *I) const {
	int Cost = TTIImpl->getMemcpyCost(I);			int Cost = TTIImpl->getMemcpyCost(I);
	assert(Cost >= 0 && "TTI should not produce negative costs!");			assert(Cost >= 0 && "TTI should not produce negative costs!");
	return Cost;			return Cost;
	}			}

	int TargetTransformInfo::getArithmeticReductionCost(unsigned Opcode,			int TargetTransformInfo::getArithmeticReductionCost(
	VectorType *Ty,			unsigned Opcode, Type ResTy, VectorType Ty, bool IsPairwiseForm,
	bool IsPairwiseForm,			bool IsMLA, TTI::TargetCostKind CostKind) const {
	TTI::TargetCostKind CostKind) const {			int Cost = TTIImpl->getArithmeticReductionCost(
	int Cost = TTIImpl->getArithmeticReductionCost(Opcode, Ty, IsPairwiseForm,			Opcode, ResTy, Ty, IsPairwiseForm, IsMLA, CostKind);
	CostKind);
	assert(Cost >= 0 && "TTI should not produce negative costs!");			assert(Cost >= 0 && "TTI should not produce negative costs!");
	return Cost;			return Cost;
	}			}

	int TargetTransformInfo::getMinMaxReductionCost(			int TargetTransformInfo::getMinMaxReductionCost(
	VectorType Ty, VectorType CondTy, bool IsPairwiseForm, bool IsUnsigned,			VectorType Ty, VectorType CondTy, bool IsPairwiseForm, bool IsUnsigned,
	TTI::TargetCostKind CostKind) const {			TTI::TargetCostKind CostKind) const {
	int Cost =			int Cost =
	▲ Show 20 Lines • Show All 518 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 225 Lines • ▼ Show 20 Lines	public:

unsigned getGISelRematGlobalCost() const {		unsigned getGISelRematGlobalCost() const {
return 2;		return 2;
}		}

bool useReductionIntrinsic(unsigned Opcode, Type *Ty,		bool useReductionIntrinsic(unsigned Opcode, Type *Ty,
TTI::ReductionFlags Flags) const;		TTI::ReductionFlags Flags) const;

int getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,		int getArithmeticReductionCost(
bool IsPairwiseForm,		unsigned Opcode, Type ResTy, VectorType Ty, bool IsPairwiseForm,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);		bool IsMLA, TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);

int getShuffleCost(TTI::ShuffleKind Kind, VectorType *Tp, int Index,		int getShuffleCost(TTI::ShuffleKind Kind, VectorType *Tp, int Index,
VectorType *SubTp);		VectorType *SubTp);
/// @}		/// @}
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_LIB_TARGET_AARCH64_AARCH64TARGETTRANSFORMINFO_H		#endif // LLVM_LIB_TARGET_AARCH64_AARCH64TARGETTRANSFORMINFO_H

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 1,070 Lines • ▼ Show 20 Lines	bool AArch64TTIImpl::useReductionIntrinsic(unsigned Opcode, Type *Ty,
case Instruction::FCmp:		case Instruction::FCmp:
return Flags.NoNaN;		return Flags.NoNaN;
default:		default:
llvm_unreachable("Unhandled reduction opcode");		llvm_unreachable("Unhandled reduction opcode");
}		}
return false;		return false;
}		}

int AArch64TTIImpl::getArithmeticReductionCost(unsigned Opcode,		int AArch64TTIImpl::getArithmeticReductionCost(unsigned Opcode, Type *ResTy,
VectorType *ValTy,		VectorType *ValTy,
bool IsPairwiseForm,		bool IsPairwiseForm, bool IsMLA,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {

if (IsPairwiseForm)		if (IsPairwiseForm \|\| IsMLA \|\| ResTy != ValTy->getScalarType())
return BaseT::getArithmeticReductionCost(Opcode, ValTy, IsPairwiseForm,		return BaseT::getArithmeticReductionCost(Opcode, ResTy, ValTy,
CostKind);		IsPairwiseForm, IsMLA, CostKind);

std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);		std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);
MVT MTy = LT.second;		MVT MTy = LT.second;
int ISD = TLI->InstructionOpcodeToISD(Opcode);		int ISD = TLI->InstructionOpcodeToISD(Opcode);
assert(ISD && "Invalid opcode");		assert(ISD && "Invalid opcode");

// Horizontal adds can use the 'addv' instruction. We model the cost of these		// Horizontal adds can use the 'addv' instruction. We model the cost of these
// instructions as normal vector adds. This is the only arithmetic vector		// instructions as normal vector adds. This is the only arithmetic vector
// reduction operation for which we have an instruction.		// reduction operation for which we have an instruction.
static const CostTblEntry CostTblNoPairwise[]{		static const CostTblEntry CostTblNoPairwise[]{
{ISD::ADD, MVT::v8i8, 1},		{ISD::ADD, MVT::v8i8, 1},
{ISD::ADD, MVT::v16i8, 1},		{ISD::ADD, MVT::v16i8, 1},
{ISD::ADD, MVT::v4i16, 1},		{ISD::ADD, MVT::v4i16, 1},
{ISD::ADD, MVT::v8i16, 1},		{ISD::ADD, MVT::v8i16, 1},
{ISD::ADD, MVT::v4i32, 1},		{ISD::ADD, MVT::v4i32, 1},
};		};

if (const auto *Entry = CostTableLookup(CostTblNoPairwise, ISD, MTy))		if (const auto *Entry = CostTableLookup(CostTblNoPairwise, ISD, MTy))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;

return BaseT::getArithmeticReductionCost(Opcode, ValTy, IsPairwiseForm,		return BaseT::getArithmeticReductionCost(Opcode, ResTy, ValTy, IsPairwiseForm,
CostKind);		IsMLA, CostKind);
}		}

int AArch64TTIImpl::getShuffleCost(TTI::ShuffleKind Kind, VectorType *Tp,		int AArch64TTIImpl::getShuffleCost(TTI::ShuffleKind Kind, VectorType *Tp,
int Index, VectorType *SubTp) {		int Index, VectorType *SubTp) {
if (Kind == TTI::SK_Broadcast \|\| Kind == TTI::SK_Transpose \|\|		if (Kind == TTI::SK_Broadcast \|\| Kind == TTI::SK_Transpose \|\|
Kind == TTI::SK_Select \|\| Kind == TTI::SK_PermuteSingleSrc) {		Kind == TTI::SK_Select \|\| Kind == TTI::SK_PermuteSingleSrc) {
static const CostTblEntry ShuffleTbl[] = {		static const CostTblEntry ShuffleTbl[] = {
// Broadcast shuffle kinds can be performed with 'dup'.		// Broadcast shuffle kinds can be performed with 'dup'.
▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

Show First 20 Lines • Show All 249 Lines • ▼ Show 20 Lines	public:
bool areInlineCompatible(const Function *Caller,		bool areInlineCompatible(const Function *Caller,
const Function *Callee) const;		const Function *Callee) const;

unsigned getInliningThresholdMultiplier() { return 11; }		unsigned getInliningThresholdMultiplier() { return 11; }

int getInlinerVectorBonusPercent() { return 0; }		int getInlinerVectorBonusPercent() { return 0; }

int getArithmeticReductionCost(		int getArithmeticReductionCost(
unsigned Opcode,		unsigned Opcode, Type ResTy, VectorType Ty, bool IsPairwise, bool IsMLA,
VectorType *Ty,
bool IsPairwise,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);

int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);
int getMinMaxReductionCost(		int getMinMaxReductionCost(
VectorType Ty, VectorType CondTy, bool IsPairwiseForm, bool IsUnsigned,		VectorType Ty, VectorType CondTy, bool IsPairwiseForm, bool IsUnsigned,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);
};		};
▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

Show First 20 Lines • Show All 771 Lines • ▼ Show 20 Lines	unsigned GCNTTIImpl::getCFInstrCost(unsigned Opcode,
case Instruction::Br:		case Instruction::Br:
case Instruction::Ret:		case Instruction::Ret:
return 10;		return 10;
default:		default:
return BaseT::getCFInstrCost(Opcode, CostKind);		return BaseT::getCFInstrCost(Opcode, CostKind);
}		}
}		}

int GCNTTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,		int GCNTTIImpl::getArithmeticReductionCost(unsigned Opcode, Type *ResTy,
bool IsPairwise,		VectorType *Ty, bool IsPairwise,
		bool IsMLA,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
EVT OrigTy = TLI->getValueType(DL, Ty);		EVT OrigTy = TLI->getValueType(DL, Ty);

// Computes cost on targets that have packed math instructions(which support		// Computes cost on targets that have packed math instructions(which support
// 16-bit types only).		// 16-bit types only).
if (IsPairwise \|\|		if (IsPairwise \|\| IsMLA \|\| ResTy != Ty->getScalarType() \|\|
!ST->hasVOP3PInsts() \|\|		!ST->hasVOP3PInsts() \|\| OrigTy.getScalarSizeInBits() != 16)
OrigTy.getScalarSizeInBits() != 16)		return BaseT::getArithmeticReductionCost(Opcode, ResTy, Ty, IsPairwise,
return BaseT::getArithmeticReductionCost(Opcode, Ty, IsPairwise, CostKind);		IsMLA, CostKind);

std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);		std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);
return LT.first * getFullRateInstrCost();		return LT.first * getFullRateInstrCost();
}		}

int GCNTTIImpl::getMinMaxReductionCost(VectorType Ty, VectorType CondTy,		int GCNTTIImpl::getMinMaxReductionCost(VectorType Ty, VectorType CondTy,
bool IsPairwise, bool IsUnsigned,		bool IsPairwise, bool IsUnsigned,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
▲ Show 20 Lines • Show All 436 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 238 Lines • ▼ Show 20 Lines	int getInterleavedMemoryOpCost(
TTI::TargetCostKind CostKind = TTI::TCK_SizeAndLatency,		TTI::TargetCostKind CostKind = TTI::TCK_SizeAndLatency,
bool UseMaskForCond = false, bool UseMaskForGaps = false);		bool UseMaskForCond = false, bool UseMaskForGaps = false);

unsigned getGatherScatterOpCost(unsigned Opcode, Type *DataTy,		unsigned getGatherScatterOpCost(unsigned Opcode, Type *DataTy,
const Value *Ptr, bool VariableMask,		const Value *Ptr, bool VariableMask,
Align Alignment, TTI::TargetCostKind CostKind,		Align Alignment, TTI::TargetCostKind CostKind,
const Instruction *I = nullptr);		const Instruction *I = nullptr);

int getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,		int getArithmeticReductionCost(unsigned Opcode, Type *ResTy,
bool IsPairwiseForm,		VectorType *ValTy, bool IsPairwiseForm,
TTI::TargetCostKind CostKind);		bool IsMLA, TTI::TargetCostKind CostKind);

int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);

bool maybeLoweredToCall(Instruction &I);		bool maybeLoweredToCall(Instruction &I);
bool isLoweredToCall(const Function *F);		bool isLoweredToCall(const Function *F);
bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,		bool isHardwareLoopProfitable(Loop *L, ScalarEvolution &SE,
AssumptionCache &AC,		AssumptionCache &AC,
Show All 30 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

Show First 20 Lines • Show All 1,469 Lines • ▼ Show 20 Lines	if (const auto *ZExt = dyn_cast<ZExtInst>(GEP->getOperand(1))) {
if (ZExt->getOperand(0)->getType()->getScalarSizeInBits() <= ExtSize)		if (ZExt->getOperand(0)->getType()->getScalarSizeInBits() <= ExtSize)
return VectorCost;		return VectorCost;
}		}
return ScalarCost;		return ScalarCost;
}		}
return ScalarCost;		return ScalarCost;
}		}

int ARMTTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,		int ARMTTIImpl::getArithmeticReductionCost(unsigned Opcode, Type *ResTy,
bool IsPairwiseForm,		VectorType *ValTy,
		bool IsPairwiseForm, bool IsMLA,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
EVT ValVT = TLI->getValueType(DL, ValTy);		EVT ValVT = TLI->getValueType(DL, ValTy);
		EVT ResVT = TLI->getValueType(DL, ResTy);
int ISD = TLI->InstructionOpcodeToISD(Opcode);		int ISD = TLI->InstructionOpcodeToISD(Opcode);
if (!ST->hasMVEIntegerOps() \|\| !ValVT.isSimple() \|\| ISD != ISD::ADD)		if (!ST->hasMVEIntegerOps() \|\| !ValVT.isSimple() \|\| !ResVT.isSimple() \|\|
return BaseT::getArithmeticReductionCost(Opcode, ValTy, IsPairwiseForm,		ISD != ISD::ADD)
CostKind);		return BaseT::getArithmeticReductionCost(Opcode, ResTy, ValTy,
		IsPairwiseForm, IsMLA, CostKind);

std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);		std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, ValTy);

static const CostTblEntry CostTblAdd[]{		if ((LT.second == MVT::v16i8 && ResVT.getSizeInBits() <= 32) \|\|
{ISD::ADD, MVT::v16i8, 1},		(LT.second == MVT::v8i16 && ResVT.getSizeInBits() <= (IsMLA ? 64 : 32)) \|\|
{ISD::ADD, MVT::v8i16, 1},		(LT.second == MVT::v4i32 && ResVT.getSizeInBits() <= 64))
{ISD::ADD, MVT::v4i32, 1},		return ST->getMVEVectorCostFactor() * LT.first;
};
if (const auto *Entry = CostTableLookup(CostTblAdd, ISD, LT.second))
return Entry->Cost * ST->getMVEVectorCostFactor() * LT.first;

return BaseT::getArithmeticReductionCost(Opcode, ValTy, IsPairwiseForm,		return BaseT::getArithmeticReductionCost(Opcode, ResTy, ValTy, IsPairwiseForm,
CostKind);		IsMLA, CostKind);
}		}

int ARMTTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		int ARMTTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
// Currently we make a somewhat optimistic assumption that active_lane_mask's		// Currently we make a somewhat optimistic assumption that active_lane_mask's
// are always free. In reality it may be freely folded into a tail predicated		// are always free. In reality it may be freely folded into a tail predicated
// loop, expanded into a VCPT or expanded into a lot of add/icmp code. We		// loop, expanded into a VCPT or expanded into a lot of add/icmp code. We
// may need to improve this in the future, but being able to detect if it		// may need to improve this in the future, but being able to detect if it
▲ Show 20 Lines • Show All 529 Lines • ▼ Show 20 Lines
bool ARMTTIImpl::preferInLoopReduction(unsigned Opcode, Type *Ty,		bool ARMTTIImpl::preferInLoopReduction(unsigned Opcode, Type *Ty,
TTI::ReductionFlags Flags) const {		TTI::ReductionFlags Flags) const {
if (!ST->hasMVEIntegerOps())		if (!ST->hasMVEIntegerOps())
return false;		return false;

unsigned ScalarBits = Ty->getScalarSizeInBits();		unsigned ScalarBits = Ty->getScalarSizeInBits();
switch (Opcode) {		switch (Opcode) {
case Instruction::Add:		case Instruction::Add:
return ScalarBits <= 32;		return ScalarBits <= 64;
default:		default:
return false;		return false;
}		}
}		}

bool ARMTTIImpl::preferPredicatedReductionSelect(		bool ARMTTIImpl::preferPredicatedReductionSelect(
unsigned Opcode, Type *Ty, TTI::ReductionFlags Flags) const {		unsigned Opcode, Type *Ty, TTI::ReductionFlags Flags) const {
if (!ST->hasMVEIntegerOps())		if (!ST->hasMVEIntegerOps())
return false;		return false;
return true;		return true;
}		}

llvm/lib/Target/X86/X86TargetTransformInfo.h

Show First 20 Lines • Show All 166 Lines • ▼ Show 20 Lines	public:

unsigned getAtomicMemIntrinsicMaxElementSize() const;		unsigned getAtomicMemIntrinsicMaxElementSize() const;

int getTypeBasedIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		int getTypeBasedIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);
int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		int getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);

int getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,		int getArithmeticReductionCost(
bool IsPairwiseForm,		unsigned Opcode, Type ResTy, VectorType Ty, bool IsPairwiseForm,
TTI::TargetCostKind CostKind = TTI::TCK_SizeAndLatency);		bool IsMLA, TTI::TargetCostKind CostKind = TTI::TCK_SizeAndLatency);

int getMinMaxCost(Type Ty, Type CondTy, bool IsUnsigned);		int getMinMaxCost(Type Ty, Type CondTy, bool IsUnsigned);

int getMinMaxReductionCost(VectorType Ty, VectorType CondTy,		int getMinMaxReductionCost(VectorType Ty, VectorType CondTy,
bool IsPairwiseForm, bool IsUnsigned,		bool IsPairwiseForm, bool IsUnsigned,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);

int getInterleavedMemoryOpCost(		int getInterleavedMemoryOpCost(
▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 3,326 Lines • ▼ Show 20 Lines	if (!BaseT::isStridedAccess(Ptr))
return NumVectorInstToHideOverhead;		return NumVectorInstToHideOverhead;
if (!BaseT::getConstantStrideStep(SE, Ptr))		if (!BaseT::getConstantStrideStep(SE, Ptr))
return 1;		return 1;
}		}

return BaseT::getAddressComputationCost(Ty, SE, Ptr);		return BaseT::getAddressComputationCost(Ty, SE, Ptr);
}		}

int X86TTIImpl::getArithmeticReductionCost(unsigned Opcode, VectorType *ValTy,		int X86TTIImpl::getArithmeticReductionCost(unsigned Opcode, Type *ResTy,
bool IsPairwise,		VectorType *ValTy, bool IsPairwise,
		bool IsMLA,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
// Just use the default implementation for pair reductions.		// Just use the default implementation for pair reductions.
if (IsPairwise)		if (IsPairwise \|\| IsMLA \|\| ResTy != ValTy->getScalarType())
return BaseT::getArithmeticReductionCost(Opcode, ValTy, IsPairwise, CostKind);		return BaseT::getArithmeticReductionCost(Opcode, ResTy, ValTy, IsPairwise,
		IsMLA, CostKind);

// We use the Intel Architecture Code Analyzer(IACA) to measure the throughput		// We use the Intel Architecture Code Analyzer(IACA) to measure the throughput
// and make it as the cost.		// and make it as the cost.

static const CostTblEntry SLMCostTblNoPairWise[] = {		static const CostTblEntry SLMCostTblNoPairWise[] = {
{ ISD::FADD, MVT::v2f64, 3 },		{ ISD::FADD, MVT::v2f64, 3 },
{ ISD::ADD, MVT::v2i64, 5 },		{ ISD::ADD, MVT::v2i64, 5 },
};		};
▲ Show 20 Lines • Show All 140 Lines • ▼ Show 20 Lines	if (ST->hasAVX2())
return ArithmeticCost + Entry->Cost;		return ArithmeticCost + Entry->Cost;
if (ST->hasAVX())		if (ST->hasAVX())
if (const auto *Entry = CostTableLookup(AVX1BoolReduction, ISD, MTy))		if (const auto *Entry = CostTableLookup(AVX1BoolReduction, ISD, MTy))
return ArithmeticCost + Entry->Cost;		return ArithmeticCost + Entry->Cost;
if (ST->hasSSE2())		if (ST->hasSSE2())
if (const auto *Entry = CostTableLookup(SSE2BoolReduction, ISD, MTy))		if (const auto *Entry = CostTableLookup(SSE2BoolReduction, ISD, MTy))
return ArithmeticCost + Entry->Cost;		return ArithmeticCost + Entry->Cost;

return BaseT::getArithmeticReductionCost(Opcode, ValVTy, IsPairwise,		return BaseT::getArithmeticReductionCost(Opcode, ResTy, ValVTy, IsPairwise,
CostKind);		IsMLA, CostKind);
}		}

unsigned NumVecElts = ValVTy->getNumElements();		unsigned NumVecElts = ValVTy->getNumElements();
unsigned ScalarSize = ValVTy->getScalarSizeInBits();		unsigned ScalarSize = ValVTy->getScalarSizeInBits();

// Special case power of 2 reductions where the scalar type isn't changed		// Special case power of 2 reductions where the scalar type isn't changed
// by type legalization.		// by type legalization.
if (!isPowerOf2_32(NumVecElts) \|\| ScalarSize != MTy.getScalarSizeInBits())		if (!isPowerOf2_32(NumVecElts) \|\| ScalarSize != MTy.getScalarSizeInBits())
return BaseT::getArithmeticReductionCost(Opcode, ValVTy, IsPairwise,		return BaseT::getArithmeticReductionCost(Opcode, ResTy, ValVTy, IsPairwise,
CostKind);		IsMLA, CostKind);

unsigned ReductionCost = 0;		unsigned ReductionCost = 0;

auto *Ty = ValVTy;		auto *Ty = ValVTy;
if (LT.first != 1 && MTy.isVector() &&		if (LT.first != 1 && MTy.isVector() &&
MTy.getVectorNumElements() < ValVTy->getNumElements()) {		MTy.getVectorNumElements() < ValVTy->getNumElements()) {
// Type needs to be split. We need LT.first - 1 arithmetic ops.		// Type needs to be split. We need LT.first - 1 arithmetic ops.
Ty = FixedVectorType::get(ValVTy->getElementType(),		Ty = FixedVectorType::get(ValVTy->getElementType(),
▲ Show 20 Lines • Show All 1,246 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,629 Lines • ▼ Show 20 Lines	private:
/// Returns the execution time cost of an instruction for a given vector		/// Returns the execution time cost of an instruction for a given vector
/// width. Vector width of one means scalar.		/// width. Vector width of one means scalar.
VectorizationCostTy getInstructionCost(Instruction *I, ElementCount VF);		VectorizationCostTy getInstructionCost(Instruction *I, ElementCount VF);

/// The cost-computation logic from getInstructionCost which provides		/// The cost-computation logic from getInstructionCost which provides
/// the vector type as an output parameter.		/// the vector type as an output parameter.
unsigned getInstructionCost(Instruction I, ElementCount VF, Type &VectorTy);		unsigned getInstructionCost(Instruction I, ElementCount VF, Type &VectorTy);

		/// Return the cost of instructions in an inloop reduction pattern, if I is
		/// part of that pattern.
		int getReductionPatternCost(Instruction I, ElementCount VF, Type VectorTy,
		TTI::TargetCostKind CostKind);

/// Calculate vectorization cost of memory instruction \p I.		/// Calculate vectorization cost of memory instruction \p I.
unsigned getMemoryInstructionCost(Instruction *I, ElementCount VF);		unsigned getMemoryInstructionCost(Instruction *I, ElementCount VF);

/// The cost computation for scalarized memory instruction.		/// The cost computation for scalarized memory instruction.
unsigned getMemInstScalarizationCost(Instruction *I, ElementCount VF);		unsigned getMemInstScalarizationCost(Instruction *I, ElementCount VF);

/// The cost computation for interleaving group of memory instructions.		/// The cost computation for interleaving group of memory instructions.
unsigned getInterleaveGroupCost(Instruction *I, ElementCount VF);		unsigned getInterleaveGroupCost(Instruction *I, ElementCount VF);
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	private:
/// scalarized.		/// scalarized.
DenseMap<ElementCount, SmallPtrSet<Instruction *, 4>> ForcedScalars;		DenseMap<ElementCount, SmallPtrSet<Instruction *, 4>> ForcedScalars;

/// PHINodes of the reductions that should be expanded in-loop along with		/// PHINodes of the reductions that should be expanded in-loop along with
/// their associated chains of reduction operations, in program order from top		/// their associated chains of reduction operations, in program order from top
/// (PHI) to bottom		/// (PHI) to bottom
ReductionChainMap InLoopReductionChains;		ReductionChainMap InLoopReductionChains;

		/// A Map of inloop reduction operations and their immediate chain operand.
		/// FIXME: This can be removed once reductions can be costed correctly in
		/// vplan. This was added to allow quick lookup to the inloop operations,
		/// without having to loop through InLoopReductionChains.
		DenseMap<Instruction , Instruction > InLoopReductionImmediateChains;

/// Returns the expected difference in cost from scalarizing the expression		/// Returns the expected difference in cost from scalarizing the expression
/// feeding a predicated instruction \p PredInst. The instructions to		/// feeding a predicated instruction \p PredInst. The instructions to
/// scalarize and their scalar costs are collected in \p ScalarCosts. A		/// scalarize and their scalar costs are collected in \p ScalarCosts. A
/// non-negative return value implies the expression will be scalarized.		/// non-negative return value implies the expression will be scalarized.
/// Currently, only single-use chains are considered for scalarization.		/// Currently, only single-use chains are considered for scalarization.
int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts,		int computePredInstDiscount(Instruction *PredInst, ScalarCostsTy &ScalarCosts,
ElementCount VF);		ElementCount VF);

▲ Show 20 Lines • Show All 4,121 Lines • ▼ Show 20 Lines	for (Instruction &I : BB->instructionsWithoutDebug()) {
continue;		continue;

// Examine PHI nodes that are reduction variables. Update the type to		// Examine PHI nodes that are reduction variables. Update the type to
// account for the recurrence type.		// account for the recurrence type.
if (auto *PN = dyn_cast<PHINode>(&I)) {		if (auto *PN = dyn_cast<PHINode>(&I)) {
if (!Legal->isReductionVariable(PN))		if (!Legal->isReductionVariable(PN))
continue;		continue;
RecurrenceDescriptor RdxDesc = Legal->getReductionVars()[PN];		RecurrenceDescriptor RdxDesc = Legal->getReductionVars()[PN];
		if (PreferInLoopReductions \|\|
		TTI.preferInLoopReduction(RdxDesc.getRecurrenceBinOp(),
		RdxDesc.getRecurrenceType(),
		TargetTransformInfo::ReductionFlags()))
		continue;
T = RdxDesc.getRecurrenceType();		T = RdxDesc.getRecurrenceType();
}		}

// Examine the stored values.		// Examine the stored values.
if (auto *ST = dyn_cast<StoreInst>(&I))		if (auto *ST = dyn_cast<StoreInst>(&I))
T = ST->getValueOperand()->getType();		T = ST->getValueOperand()->getType();

// Ignore loaded pointer types and stored pointer types that are not		// Ignore loaded pointer types and stored pointer types that are not
▲ Show 20 Lines • Show All 811 Lines • ▼ Show 20 Lines	if (Group->isReverse()) {
assert(!Legal->isMaskRequired(I) &&		assert(!Legal->isMaskRequired(I) &&
"Reverse masked interleaved access not supported.");		"Reverse masked interleaved access not supported.");
Cost += Group->getNumMembers() *		Cost += Group->getNumMembers() *
TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);		TTI.getShuffleCost(TargetTransformInfo::SK_Reverse, VectorTy, 0);
}		}
return Cost;		return Cost;
}		}

		int LoopVectorizationCostModel::getReductionPatternCost(
		fhahnUnsubmitted Not Done Reply Inline Actions This should probably use `InstructionCost`? fhahn: This should probably use `InstructionCost`?
		Instruction I, ElementCount VF, Type Ty, TTI::TargetCostKind CostKind) {
		// Early exit for no inloop reductions
		if (InLoopReductionChains.empty() \|\| VF.isScalar() \|\| !isa<VectorType>(Ty))
		return -1;
		auto *VectorTy = cast<VectorType>(Ty);

		// We are looking for a pattern of, and finding the minimal acceptable cost:
		// reduce(mul(ext(A), ext(B))) or
		// reduce(mul(A, B)) or
		// reduce(ext(A)) or
		// reduce(A).
		// The basic idea is that we walk down the tree to do that, finding the root
		// reduction instruction in InLoopReductionImmediateChains. From there we find
		// the pattern of mul/ext and test the cost of the entire pattern vs the cost
		// of the components. If the reduction cost is lower then we return it for the
		// reduction instruction and 0 for the other instructions in the pattern. If
		// it is not we return -1 specifying the orignal cost method should be used.
		Instruction *RetI = I;
		if ((RetI->getOpcode() == Instruction::SExt \|\|
		RetI->getOpcode() == Instruction::ZExt) &&
		RetI->hasOneUser())
		SjoerdMeijerUnsubmitted Done Reply Inline Actions Nit: both cases check for hasOneUser. Can we return early if doesn't have one user? SjoerdMeijer: Nit: both cases check for hasOneUser. Can we return early if doesn't have one user?
		RetI = RetI->user_back();
		if (RetI->getOpcode() == Instruction::Mul && RetI->hasOneUse() &&
		RetI->user_back()->getOpcode() == Instruction::Add)
		RetI = RetI->user_back();

		// Test if the found instruction is a reduction, and if not return -1
		// specifying the parent to use the original cost modelling.
		if (!InLoopReductionImmediateChains.count(RetI))
		return -1;

		// Find the reduction this chain is a part of and calculate the basic cost of
		// the reduction on its own.
		Instruction *LastChain = InLoopReductionImmediateChains[RetI];
		Instruction *ReductionPhi = LastChain;
		while (!isa<PHINode>(ReductionPhi))
		ReductionPhi = InLoopReductionImmediateChains[ReductionPhi];

		RecurrenceDescriptor RdxDesc =
		Legal->getReductionVars()[cast<PHINode>(ReductionPhi)];
		unsigned BaseCost = TTI.getArithmeticReductionCost(
		RdxDesc.getRecurrenceBinOp(), RdxDesc.getRecurrenceType(), VectorTy,
		false, false, TTI::TCK_RecipThroughput);

		// Get the operand that was not the reduction chain and match it to one of the
		// patterns, returning the better cost if it is found.
		Value *RedOp = RetI->getOperand(1) == LastChain ? RetI->getOperand(0)
		: RetI->getOperand(1);

		VectorTy = VectorType::get(I->getOperand(0)->getType(), VectorTy);

		if (isa<SExtInst>(RedOp) \|\| isa<ZExtInst>(RedOp)) {
		auto *RedOpI = cast<Instruction>(RedOp);
		auto *ExtType = VectorType::get(RedOpI->getOperand(0)->getType(), VectorTy);
		unsigned RedCost = TTI.getArithmeticReductionCost(
		RdxDesc.getRecurrenceBinOp(), RdxDesc.getRecurrenceType(), ExtType,
		false, false, TTI::TCK_RecipThroughput);

		unsigned ExtCost =
		TTI.getCastInstrCost(RedOpI->getOpcode(), VectorTy, ExtType,
		TTI::CastContextHint::None, CostKind, RedOpI);
		if (RedCost < BaseCost + ExtCost)
		return I == RetI ? RedCost : 0;
		} else if (isa<BinaryOperator>(RedOp) &&
		cast<Instruction>(RedOp)->getOpcode() == Instruction::Mul) {
		Instruction *Mul = cast<Instruction>(RedOp);
		Value *Op0 = Mul->getOperand(0);
		Value *Op1 = Mul->getOperand(1);
		if ((isa<SExtInst>(Op0) \|\| isa<ZExtInst>(Op0)) && isa<Instruction>(Op1) &&
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions Do we also need to check Op1 for ZExt or SExt? SjoerdMeijer: Do we also need to check Op1 for ZExt or SExt?
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Yeah, we check that the opcodes of the two Op0 and Op1 are the same, as we won't be able to match mul(sext, zext) to anything useful. It should make sure that both operands are SExt or ZExt's. dmgreen: Yeah, we check that the opcodes of the two Op0 and Op1 are the same, as we won't be able to…
		cast<Instruction>(Op0)->getOpcode() ==
		cast<Instruction>(Op1)->getOpcode() &&
		Op0->getType() == Op1->getType()) {
		auto *ExtType = VectorType::get(
		cast<Instruction>(Op0)->getOperand(0)->getType(), VectorTy);
		// reduce(mul(ext, ext))
		unsigned ExtCost = TTI.getCastInstrCost(
		cast<Instruction>(Op0)->getOpcode(), VectorTy, ExtType,
		TTI::CastContextHint::None, CostKind, cast<Instruction>(Op0));
		unsigned MulCost =
		TTI.getArithmeticInstrCost(Mul->getOpcode(), VectorTy, CostKind);

		unsigned RedCost = TTI.getArithmeticReductionCost(
		RdxDesc.getRecurrenceBinOp(), RdxDesc.getRecurrenceType(), ExtType,
		false, true, TTI::TCK_RecipThroughput);

		if (RedCost < ExtCost * 2 + MulCost + BaseCost)
		return I == RetI ? RedCost : 0;
		} else {
		unsigned MulCost =
		TTI.getArithmeticInstrCost(Mul->getOpcode(), VectorTy, CostKind);

		unsigned RedCost = TTI.getArithmeticReductionCost(
		RdxDesc.getRecurrenceBinOp(), RdxDesc.getRecurrenceType(), VectorTy,
		false, true, TTI::TCK_RecipThroughput);

		if (RedCost < MulCost + BaseCost)
		return I == RetI ? RedCost : 0;
		}
		}

		return I == RetI ? BaseCost : -1;
		}

unsigned LoopVectorizationCostModel::getMemoryInstructionCost(Instruction *I,		unsigned LoopVectorizationCostModel::getMemoryInstructionCost(Instruction *I,
ElementCount VF) {		ElementCount VF) {
// Calculate scalar cost only. Vectorization cost should be ready at this		// Calculate scalar cost only. Vectorization cost should be ready at this
// moment.		// moment.
if (VF.isScalar()) {		if (VF.isScalar()) {
Type *ValTy = getMemInstValueType(I);		Type *ValTy = getMemInstValueType(I);
const Align Alignment = getLoadStoreAlignment(I);		const Align Alignment = getLoadStoreAlignment(I);
unsigned AS = getLoadStoreAddressSpace(I);		unsigned AS = getLoadStoreAddressSpace(I);
▲ Show 20 Lines • Show All 337 Lines • ▼ Show 20 Lines	unsigned LoopVectorizationCostModel::getInstructionCost(Instruction *I,
case Instruction::LShr:		case Instruction::LShr:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor: {		case Instruction::Xor: {
// Since we will replace the stride by 1 the multiplication should go away.		// Since we will replace the stride by 1 the multiplication should go away.
if (I->getOpcode() == Instruction::Mul && isStrideMul(I, Legal))		if (I->getOpcode() == Instruction::Mul && isStrideMul(I, Legal))
return 0;		return 0;

		// Detect reduction patterns
		int RedCost;
		if ((RedCost = getReductionPatternCost(I, VF, VectorTy, CostKind)) != -1)
		return RedCost;

// Certain instructions can be cheaper to vectorize if they have a constant		// Certain instructions can be cheaper to vectorize if they have a constant
// second vector operand. One example of this are shifts on x86.		// second vector operand. One example of this are shifts on x86.
Value *Op2 = I->getOperand(1);		Value *Op2 = I->getOperand(1);
TargetTransformInfo::OperandValueProperties Op2VP;		TargetTransformInfo::OperandValueProperties Op2VP;
TargetTransformInfo::OperandValueKind Op2VK =		TargetTransformInfo::OperandValueKind Op2VK =
TTI.getOperandInfo(Op2, Op2VP);		TTI.getOperandInfo(Op2, Op2VP);
if (Op2VK == TargetTransformInfo::OK_AnyValue && Legal->isUniform(Op2))		if (Op2VK == TargetTransformInfo::OK_AnyValue && Legal->isUniform(Op2))
Op2VK = TargetTransformInfo::OK_UniformValue;		Op2VK = TargetTransformInfo::OK_UniformValue;
▲ Show 20 Lines • Show All 107 Lines • ▼ Show 20 Lines	case Instruction::BitCast: {
// integer steps. The cost of these truncations is the same as the scalar		// integer steps. The cost of these truncations is the same as the scalar
// operation.		// operation.
if (isOptimizableIVTruncate(I, VF)) {		if (isOptimizableIVTruncate(I, VF)) {
auto *Trunc = cast<TruncInst>(I);		auto *Trunc = cast<TruncInst>(I);
return TTI.getCastInstrCost(Instruction::Trunc, Trunc->getDestTy(),		return TTI.getCastInstrCost(Instruction::Trunc, Trunc->getDestTy(),
Trunc->getSrcTy(), CCH, CostKind, Trunc);		Trunc->getSrcTy(), CCH, CostKind, Trunc);
}		}

		// Detect reduction patterns
		int RedCost;
		if ((RedCost = getReductionPatternCost(I, VF, VectorTy, CostKind)) != -1)
		return RedCost;

Type *SrcScalarTy = I->getOperand(0)->getType();		Type *SrcScalarTy = I->getOperand(0)->getType();
Type *SrcVecTy =		Type *SrcVecTy =
VectorTy->isVectorTy() ? ToVectorTy(SrcScalarTy, VF) : SrcScalarTy;		VectorTy->isVectorTy() ? ToVectorTy(SrcScalarTy, VF) : SrcScalarTy;
if (canTruncateToMinimalBitwidth(I, VF)) {		if (canTruncateToMinimalBitwidth(I, VF)) {
// This cast is going to be shrunk. This may remove the cast or it might		// This cast is going to be shrunk. This may remove the cast or it might
// turn it into slightly different cast. For example, if MinBW == 16,		// turn it into slightly different cast. For example, if MinBW == 16,
// "zext i8 %1 to i32" becomes "zext i8 %1 to i16".		// "zext i8 %1 to i32" becomes "zext i8 %1 to i16".
//		//
▲ Show 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	if (!PreferInLoopReductions &&
continue;		continue;

// Check that we can correctly put the reductions into the loop, by		// Check that we can correctly put the reductions into the loop, by
// finding the chain of operations that leads from the phi to the loop		// finding the chain of operations that leads from the phi to the loop
// exit value.		// exit value.
SmallVector<Instruction *, 4> ReductionOperations =		SmallVector<Instruction *, 4> ReductionOperations =
RdxDesc.getReductionOpChain(Phi, TheLoop);		RdxDesc.getReductionOpChain(Phi, TheLoop);
bool InLoop = !ReductionOperations.empty();		bool InLoop = !ReductionOperations.empty();
if (InLoop)		if (InLoop) {
InLoopReductionChains[Phi] = ReductionOperations;		InLoopReductionChains[Phi] = ReductionOperations;
		// Add the elements to InLoopReductionImmediateChains for cost modelling.
		Instruction *LastChain = Phi;
		for (auto *I : ReductionOperations) {
		InLoopReductionImmediateChains[I] = LastChain;
		LastChain = I;
		}
		}
LLVM_DEBUG(dbgs() << "LV: Using " << (InLoop ? "inloop" : "out of loop")		LLVM_DEBUG(dbgs() << "LV: Using " << (InLoop ? "inloop" : "out of loop")
<< " reduction for phi: " << *Phi << "\n");		<< " reduction for phi: " << *Phi << "\n");
}		}
}		}

// TODO: we could return a pair of values that specify the max VF and		// TODO: we could return a pair of values that specify the max VF and
// min VF, to be used in `buildVPlans(MinVF, MaxVF)` instead of		// min VF, to be used in `buildVPlans(MinVF, MaxVF)` instead of
// `buildVPlans(VF, VF)`. We cannot do it because VPLAN at the moment		// `buildVPlans(VF, VF)`. We cannot do it because VPLAN at the moment
▲ Show 20 Lines • Show All 2,102 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,197 Lines • ▼ Show 20 Lines	int getReductionCost(TargetTransformInfo TTI, Value FirstReducedVal,
unsigned ReduxWidth) {		unsigned ReduxWidth) {
Type *ScalarTy = FirstReducedVal->getType();		Type *ScalarTy = FirstReducedVal->getType();
auto *VecTy = FixedVectorType::get(ScalarTy, ReduxWidth);		auto *VecTy = FixedVectorType::get(ScalarTy, ReduxWidth);

int PairwiseRdxCost;		int PairwiseRdxCost;
int SplittingRdxCost;		int SplittingRdxCost;
switch (ReductionData.getKind()) {		switch (ReductionData.getKind()) {
case RK_Arithmetic:		case RK_Arithmetic:
PairwiseRdxCost =		PairwiseRdxCost = TTI->getArithmeticReductionCost(
TTI->getArithmeticReductionCost(ReductionData.getOpcode(), VecTy,		ReductionData.getOpcode(), ScalarTy, VecTy,
/IsPairwiseForm=/true);		/IsPairwiseForm=/true);
SplittingRdxCost =		SplittingRdxCost = TTI->getArithmeticReductionCost(
TTI->getArithmeticReductionCost(ReductionData.getOpcode(), VecTy,		ReductionData.getOpcode(), ScalarTy, VecTy,
/IsPairwiseForm=/false);		/IsPairwiseForm=/false);
break;		break;
case RK_SMin:		case RK_SMin:
case RK_SMax:		case RK_SMax:
case RK_UMin:		case RK_UMin:
case RK_UMax: {		case RK_UMax: {
auto *VecCondTy = cast<VectorType>(CmpInst::makeCmpResultType(VecTy));		auto *VecCondTy = cast<VectorType>(CmpInst::makeCmpResultType(VecTy));
bool IsUnsigned = ReductionData.getKind() == RK_UMin \|\|		bool IsUnsigned = ReductionData.getKind() == RK_UMin \|\|
ReductionData.getKind() == RK_UMax;		ReductionData.getKind() == RK_UMax;
▲ Show 20 Lines • Show All 709 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/mve-reduction-types.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -loop-vectorize < %s -S -o - \| FileCheck %s			; RUN: opt -loop-vectorize < %s -S -o - \| FileCheck %s

	target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"			target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
	target triple = "thumbv8.1m.main-none-none-eabi"			target triple = "thumbv8.1m.main-none-none-eabi"

	define i32 @mla_i32(i8* noalias nocapture readonly %A, i8* noalias nocapture readonly %B, i32 %n) #0 {			define i32 @mla_i32(i8* noalias nocapture readonly %A, i8* noalias nocapture readonly %B, i32 %n) #0 {
	; CHECK-LABEL: @mla_i32(			; CHECK-LABEL: @mla_i32(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[CMP9:%.]] = icmp sgt i32 [[N:%.]], 0			; CHECK-NEXT: [[CMP9:%.]] = icmp sgt i32 [[N:%.]], 0
	; CHECK-NEXT: br i1 [[CMP9]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]			; CHECK-NEXT: br i1 [[CMP9]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
	; CHECK: for.body.preheader:			; CHECK: for.body.preheader:
	; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]			; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
	; CHECK: vector.ph:			; CHECK: vector.ph:
	; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3			; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15
	; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[N_RND_UP]], 4			; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i32 [[N_RND_UP]], 16
	; CHECK-NEXT: [[N_VEC:%.*]] = sub i32 [[N_RND_UP]], [[N_MOD_VF]]			; CHECK-NEXT: [[N_VEC:%.*]] = sub i32 [[N_RND_UP]], [[N_MOD_VF]]
	; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]			; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
	; CHECK: vector.body:			; CHECK: vector.body:
	; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP12:%.]], [[VECTOR_BODY]] ]			; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP12:%.]], [[VECTOR_BODY]] ]
	; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <4 x i32> undef, i32 [[INDEX]], i32 0			; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.*]] = insertelement <16 x i32> undef, i32 [[INDEX]], i32 0
	; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <4 x i32> [[BROADCAST_SPLATINSERT]], <4 x i32> undef, <4 x i32> zeroinitializer			; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i32> [[BROADCAST_SPLATINSERT]], <16 x i32> undef, <16 x i32> zeroinitializer
	; CHECK-NEXT: [[INDUCTION:%.*]] = add <4 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 1, i32 2, i32 3>			; CHECK-NEXT: [[INDUCTION:%.*]] = add <16 x i32> [[BROADCAST_SPLAT]], <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0			; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[INDEX]], 0
	; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[TMP0]], i32 [[N]])			; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 [[TMP0]], i32 [[N]])
	; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i8, i8 [[A:%.*]], i32 [[TMP0]]			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i8, i8 [[A:%.*]], i32 [[TMP0]]
	; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 [[TMP1]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i8, i8 [[TMP1]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <4 x i8>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <16 x i8>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8> [[TMP3]], i32 1, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i8> undef)			; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP3]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> undef)
	; CHECK-NEXT: [[TMP4:%.*]] = sext <4 x i8> [[WIDE_MASKED_LOAD]] to <4 x i32>			; CHECK-NEXT: [[TMP4:%.*]] = sext <16 x i8> [[WIDE_MASKED_LOAD]] to <16 x i32>
	; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i8, i8 [[B:%.*]], i32 [[TMP0]]			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i8, i8 [[B:%.*]], i32 [[TMP0]]
	; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i8, i8 [[TMP5]], i32 0			; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds i8, i8 [[TMP5]], i32 0
	; CHECK-NEXT: [[TMP7:%.]] = bitcast i8 [[TMP6]] to <4 x i8>*			; CHECK-NEXT: [[TMP7:%.]] = bitcast i8 [[TMP6]] to <16 x i8>*
	; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8> [[TMP7]], i32 1, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i8> undef)			; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP7]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> undef)
	; CHECK-NEXT: [[TMP8:%.*]] = sext <4 x i8> [[WIDE_MASKED_LOAD1]] to <4 x i32>			; CHECK-NEXT: [[TMP8:%.*]] = sext <16 x i8> [[WIDE_MASKED_LOAD1]] to <16 x i32>
	; CHECK-NEXT: [[TMP9:%.*]] = mul nsw <4 x i32> [[TMP8]], [[TMP4]]			; CHECK-NEXT: [[TMP9:%.*]] = mul nsw <16 x i32> [[TMP8]], [[TMP4]]
	; CHECK-NEXT: [[TMP10:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP9]], <4 x i32> zeroinitializer			; CHECK-NEXT: [[TMP10:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i32> [[TMP9]], <16 x i32> zeroinitializer
	; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP10]])			; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP10]])
	; CHECK-NEXT: [[TMP12]] = add i32 [[TMP11]], [[VEC_PHI]]			; CHECK-NEXT: [[TMP12]] = add i32 [[TMP11]], [[VEC_PHI]]
	; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4			; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16
	; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]			; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
	; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]			; CHECK-NEXT: br i1 [[TMP13]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]
	; CHECK: middle.block:			; CHECK: middle.block:
	; CHECK-NEXT: br i1 true, label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[SCALAR_PH]]			; CHECK-NEXT: br i1 true, label [[FOR_COND_CLEANUP_LOOPEXIT:%.*]], label [[SCALAR_PH]]
	; CHECK: scalar.ph:			; CHECK: scalar.ph:
	; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]			; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
	; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[TMP12]], [[MIDDLE_BLOCK]] ]			; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ 0, [[FOR_BODY_PREHEADER]] ], [ [[TMP12]], [[MIDDLE_BLOCK]] ]
	; CHECK-NEXT: br label [[FOR_BODY:%.*]]			; CHECK-NEXT: br label [[FOR_BODY:%.*]]
	▲ Show 20 Lines • Show All 1,001 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/ARM/mve-reductions.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt -loop-vectorize -instcombine -simplifycfg -simplifycfg-require-and-preserve-domtree=1 -tail-predication=enabled < %s -S -o - \| FileCheck %s		; RUN: opt -loop-vectorize -instcombine -simplifycfg -simplifycfg-require-and-preserve-domtree=1 -tail-predication=enabled < %s -S -o - \| FileCheck %s

target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"		target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
target triple = "thumbv8.1m.main-arm-none-eabi"		target triple = "thumbv8.1m.main-arm-none-eabi"

		; Should not be vectorized
define i64 @add_i64_i64(i64* nocapture readonly %x, i32 %n) #0 {		define i64 @add_i64_i64(i64* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i64_i64(		; CHECK-LABEL: @add_i64_i64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP6]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP6]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]		; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
; CHECK-NEXT: [[R_07:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]		; CHECK-NEXT: [[R_07:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]
Show All 21 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
ret i64 %r.0.lcssa		ret i64 %r.0.lcssa
}		}

; FIXME: 4x		; 4x to use VADDLV
		; FIXME: TailPredicate
define i64 @add_i32_i64(i32* nocapture readonly %x, i32 %n) #0 {		define i64 @add_i32_i64(i32* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i32_i64(		; CHECK-LABEL: @add_i32_i64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP6]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP6]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
		; CHECK: for.body.preheader:
		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -4
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[X:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
		; CHECK-NEXT: [[TMP2:%.*]] = sext <4 x i32> [[WIDE_LOAD]] to <4 x i64>
		; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> [[TMP2]])
		; CHECK-NEXT: [[TMP4]] = add i64 [[TMP3]], [[VEC_PHI]]
		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
		; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]
		; CHECK: middle.block:
		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]
		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]
		; CHECK: scalar.ph:
		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP4]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]		; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[R_07:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]		; CHECK-NEXT: [[R_07:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[X:%.*]], i32 [[I_08]]		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[X]], i32 [[I_08]]
; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 [[ARRAYIDX]], align 4		; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[ARRAYIDX]], align 4
; CHECK-NEXT: [[CONV:%.*]] = sext i32 [[TMP0]] to i64		; CHECK-NEXT: [[CONV:%.*]] = sext i32 [[TMP6]] to i64
; CHECK-NEXT: [[ADD]] = add nsw i64 [[R_07]], [[CONV]]		; CHECK-NEXT: [[ADD]] = add nsw i64 [[R_07]], [[CONV]]
; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1		; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1
; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]
; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], [[LOOP2:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[ADD]], [[FOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[ADD]], [[FOR_BODY]] ], [ [[TMP4]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i64 [[R_0_LCSSA]]		; CHECK-NEXT: ret i64 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp6 = icmp sgt i32 %n, 0		%cmp6 = icmp sgt i32 %n, 0
br i1 %cmp6, label %for.body, label %for.cond.cleanup		br i1 %cmp6, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.07 = phi i64 [ %add, %for.body ], [ 0, %entry ]		%r.07 = phi i64 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i32, i32* %x, i32 %i.08		%arrayidx = getelementptr inbounds i32, i32* %x, i32 %i.08
%0 = load i32, i32* %arrayidx, align 4		%0 = load i32, i32* %arrayidx, align 4
%conv = sext i32 %0 to i64		%conv = sext i32 %0 to i64
%add = add nsw i64 %r.07, %conv		%add = add nsw i64 %r.07, %conv
%inc = add nuw nsw i32 %i.08, 1		%inc = add nuw nsw i32 %i.08, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
ret i64 %r.0.lcssa		ret i64 %r.0.lcssa
}		}

; FIXME: 4x ?		; 4x to use VADDLV
		; FIXME: TailPredicate
define i64 @add_i16_i64(i16* nocapture readonly %x, i32 %n) #0 {		define i64 @add_i16_i64(i16* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i16_i64(		; CHECK-LABEL: @add_i16_i64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP6]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP6]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
		; CHECK: for.body.preheader:
		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -4
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <4 x i16>*
		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i16>, <4 x i16> [[TMP1]], align 2
		; CHECK-NEXT: [[TMP2:%.*]] = sext <4 x i16> [[WIDE_LOAD]] to <4 x i64>
		; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> [[TMP2]])
		; CHECK-NEXT: [[TMP4]] = add i64 [[TMP3]], [[VEC_PHI]]
		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
		; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP4:!llvm.loop !.]]
		; CHECK: middle.block:
		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]
		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]
		; CHECK: scalar.ph:
		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP4]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]		; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[R_07:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]		; CHECK-NEXT: [[R_07:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[I_08]]		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i16, i16 [[X]], i32 [[I_08]]
; CHECK-NEXT: [[TMP0:%.]] = load i16, i16 [[ARRAYIDX]], align 2		; CHECK-NEXT: [[TMP6:%.]] = load i16, i16 [[ARRAYIDX]], align 2
; CHECK-NEXT: [[CONV:%.*]] = sext i16 [[TMP0]] to i64		; CHECK-NEXT: [[CONV:%.*]] = sext i16 [[TMP6]] to i64
; CHECK-NEXT: [[ADD]] = add nsw i64 [[R_07]], [[CONV]]		; CHECK-NEXT: [[ADD]] = add nsw i64 [[R_07]], [[CONV]]
; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1		; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1
; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]
; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], [[LOOP5:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[ADD]], [[FOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[ADD]], [[FOR_BODY]] ], [ [[TMP4]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i64 [[R_0_LCSSA]]		; CHECK-NEXT: ret i64 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp6 = icmp sgt i32 %n, 0		%cmp6 = icmp sgt i32 %n, 0
br i1 %cmp6, label %for.body, label %for.cond.cleanup		br i1 %cmp6, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.07 = phi i64 [ %add, %for.body ], [ 0, %entry ]		%r.07 = phi i64 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i16, i16* %x, i32 %i.08		%arrayidx = getelementptr inbounds i16, i16* %x, i32 %i.08
%0 = load i16, i16* %arrayidx, align 2		%0 = load i16, i16* %arrayidx, align 2
%conv = sext i16 %0 to i64		%conv = sext i16 %0 to i64
%add = add nsw i64 %r.07, %conv		%add = add nsw i64 %r.07, %conv
%inc = add nuw nsw i32 %i.08, 1		%inc = add nuw nsw i32 %i.08, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
ret i64 %r.0.lcssa		ret i64 %r.0.lcssa
}		}

; FIXME: 4x ?		; 4x to use VADDLV
		; FIXME: TailPredicate
define i64 @add_i8_i64(i8* nocapture readonly %x, i32 %n) #0 {		define i64 @add_i8_i64(i8* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i8_i64(		; CHECK-LABEL: @add_i8_i64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP6]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP6]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
		; CHECK: for.body.preheader:
		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -4
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <4 x i8>*
		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i8>, <4 x i8> [[TMP1]], align 1
		; CHECK-NEXT: [[TMP2:%.*]] = zext <4 x i8> [[WIDE_LOAD]] to <4 x i64>
		; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> [[TMP2]])
		; CHECK-NEXT: [[TMP4]] = add i64 [[TMP3]], [[VEC_PHI]]
		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
		; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-NEXT: br i1 [[TMP5]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP6:!llvm.loop !.]]
		; CHECK: middle.block:
		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]
		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]
		; CHECK: scalar.ph:
		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP4]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]		; CHECK-NEXT: [[I_08:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[R_07:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]		; CHECK-NEXT: [[R_07:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[I_08]]		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i8, i8 [[X]], i32 [[I_08]]
; CHECK-NEXT: [[TMP0:%.]] = load i8, i8 [[ARRAYIDX]], align 1		; CHECK-NEXT: [[TMP6:%.]] = load i8, i8 [[ARRAYIDX]], align 1
; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP0]] to i64		; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP6]] to i64
; CHECK-NEXT: [[ADD]] = add nuw nsw i64 [[R_07]], [[CONV]]		; CHECK-NEXT: [[ADD]] = add nuw nsw i64 [[R_07]], [[CONV]]
; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1		; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_08]], 1
; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]
; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], [[LOOP7:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[ADD]], [[FOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[ADD]], [[FOR_BODY]] ], [ [[TMP4]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i64 [[R_0_LCSSA]]		; CHECK-NEXT: ret i64 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp6 = icmp sgt i32 %n, 0		%cmp6 = icmp sgt i32 %n, 0
br i1 %cmp6, label %for.body, label %for.cond.cleanup		br i1 %cmp6, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.07 = phi i64 [ %add, %for.body ], [ 0, %entry ]		%r.07 = phi i64 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i8, i8* %x, i32 %i.08		%arrayidx = getelementptr inbounds i8, i8* %x, i32 %i.08
%0 = load i8, i8* %arrayidx, align 1		%0 = load i8, i8* %arrayidx, align 1
%conv = zext i8 %0 to i64		%conv = zext i8 %0 to i64
%add = add nuw nsw i64 %r.07, %conv		%add = add nuw nsw i64 %r.07, %conv
%inc = add nuw nsw i32 %i.08, 1		%inc = add nuw nsw i32 %i.08, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
ret i64 %r.0.lcssa		ret i64 %r.0.lcssa
}		}

		; 4x to use VADDV.u32
define i32 @add_i32_i32(i32* nocapture readonly %x, i32 %n) #0 {		define i32 @add_i32_i32(i32* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i32_i32(		; CHECK-LABEL: @add_i32_i32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP6]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP6]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP1]], i32 4, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP1]], i32 4, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> undef)
; CHECK-NEXT: [[TMP2:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[WIDE_MASKED_LOAD]], <4 x i32> zeroinitializer		; CHECK-NEXT: [[TMP2:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[WIDE_MASKED_LOAD]], <4 x i32> zeroinitializer
; CHECK-NEXT: [[TMP3:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP2]])		; CHECK-NEXT: [[TMP3:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP2]])
; CHECK-NEXT: [[TMP4]] = add i32 [[TMP3]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP4]] = add i32 [[TMP3]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP5]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP5]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP8:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP4]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP4]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i32 [[R_0_LCSSA]]		; CHECK-NEXT: ret i32 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp6 = icmp sgt i32 %n, 0		%cmp6 = icmp sgt i32 %n, 0
br i1 %cmp6, label %for.body, label %for.cond.cleanup		br i1 %cmp6, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.07 = phi i32 [ %add, %for.body ], [ 0, %entry ]		%r.07 = phi i32 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i32, i32* %x, i32 %i.08		%arrayidx = getelementptr inbounds i32, i32* %x, i32 %i.08
%0 = load i32, i32* %arrayidx, align 4		%0 = load i32, i32* %arrayidx, align 4
%add = add nsw i32 %0, %r.07		%add = add nsw i32 %0, %r.07
%inc = add nuw nsw i32 %i.08, 1		%inc = add nuw nsw i32 %i.08, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
ret i32 %r.0.lcssa		ret i32 %r.0.lcssa
}		}

; FIXME: 8x		; 8x to use VADDV.u16
define i32 @add_i16_i32(i16* nocapture readonly %x, i32 %n) #0 {		define i32 @add_i16_i32(i16* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i16_i32(		; CHECK-LABEL: @add_i16_i32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP6]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP6]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <4 x i16>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <8 x i16>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16> [[TMP1]], i32 2, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i16> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16> [[TMP1]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> undef)
; CHECK-NEXT: [[TMP2:%.*]] = sext <4 x i16> [[WIDE_MASKED_LOAD]] to <4 x i32>		; CHECK-NEXT: [[TMP2:%.*]] = sext <8 x i16> [[WIDE_MASKED_LOAD]] to <8 x i32>
; CHECK-NEXT: [[TMP3:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP2]], <4 x i32> zeroinitializer		; CHECK-NEXT: [[TMP3:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i32> [[TMP2]], <8 x i32> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP3]])		; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP3]])
; CHECK-NEXT: [[TMP5]] = add i32 [[TMP4]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP5]] = add i32 [[TMP4]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8
; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP6]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP2:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP6]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP9:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP5]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP5]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i32 [[R_0_LCSSA]]		; CHECK-NEXT: ret i32 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp6 = icmp sgt i32 %n, 0		%cmp6 = icmp sgt i32 %n, 0
br i1 %cmp6, label %for.body, label %for.cond.cleanup		br i1 %cmp6, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.07 = phi i32 [ %add, %for.body ], [ 0, %entry ]		%r.07 = phi i32 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i16, i16* %x, i32 %i.08		%arrayidx = getelementptr inbounds i16, i16* %x, i32 %i.08
%0 = load i16, i16* %arrayidx, align 2		%0 = load i16, i16* %arrayidx, align 2
%conv = sext i16 %0 to i32		%conv = sext i16 %0 to i32
%add = add nsw i32 %r.07, %conv		%add = add nsw i32 %r.07, %conv
%inc = add nuw nsw i32 %i.08, 1		%inc = add nuw nsw i32 %i.08, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
ret i32 %r.0.lcssa		ret i32 %r.0.lcssa
}		}

; FIXME: 16x		; 16x to use VADDV.u16
define i32 @add_i8_i32(i8* nocapture readonly %x, i32 %n) #0 {		define i32 @add_i8_i32(i8* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i8_i32(		; CHECK-LABEL: @add_i8_i32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP6:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP6]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP6]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -16
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <4 x i8>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <16 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8> [[TMP1]], i32 1, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i8> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP1]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> undef)
; CHECK-NEXT: [[TMP2:%.*]] = zext <4 x i8> [[WIDE_MASKED_LOAD]] to <4 x i32>		; CHECK-NEXT: [[TMP2:%.*]] = zext <16 x i8> [[WIDE_MASKED_LOAD]] to <16 x i32>
; CHECK-NEXT: [[TMP3:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP2]], <4 x i32> zeroinitializer		; CHECK-NEXT: [[TMP3:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i32> [[TMP2]], <16 x i32> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP3]])		; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP3]])
; CHECK-NEXT: [[TMP5]] = add i32 [[TMP4]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP5]] = add i32 [[TMP4]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16
; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP6]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP3:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP6]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP10:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP5]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP5]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i32 [[R_0_LCSSA]]		; CHECK-NEXT: ret i32 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp6 = icmp sgt i32 %n, 0		%cmp6 = icmp sgt i32 %n, 0
br i1 %cmp6, label %for.body, label %for.cond.cleanup		br i1 %cmp6, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.08 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.07 = phi i32 [ %add, %for.body ], [ 0, %entry ]		%r.07 = phi i32 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i8, i8* %x, i32 %i.08		%arrayidx = getelementptr inbounds i8, i8* %x, i32 %i.08
%0 = load i8, i8* %arrayidx, align 1		%0 = load i8, i8* %arrayidx, align 1
%conv = zext i8 %0 to i32		%conv = zext i8 %0 to i32
%add = add nuw nsw i32 %r.07, %conv		%add = add nuw nsw i32 %r.07, %conv
%inc = add nuw nsw i32 %i.08, 1		%inc = add nuw nsw i32 %i.08, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
ret i32 %r.0.lcssa		ret i32 %r.0.lcssa
}		}

		; 8x to use VADDV.u16
define signext i16 @add_i16_i16(i16* nocapture readonly %x, i32 %n) #0 {		define signext i16 @add_i16_i16(i16* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i16_i16(		; CHECK-LABEL: @add_i16_i16(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP8]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP8]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i16 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i16 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <8 x i16>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <8 x i16>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16> [[TMP1]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16> [[TMP1]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> undef)
; CHECK-NEXT: [[TMP2:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> [[WIDE_MASKED_LOAD]], <8 x i16> zeroinitializer		; CHECK-NEXT: [[TMP2:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> [[WIDE_MASKED_LOAD]], <8 x i16> zeroinitializer
; CHECK-NEXT: [[TMP3:%.*]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> [[TMP2]])		; CHECK-NEXT: [[TMP3:%.*]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> [[TMP2]])
; CHECK-NEXT: [[TMP4]] = add i16 [[TMP3]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP4]] = add i16 [[TMP3]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8
; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP5]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP4:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP5]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP11:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TMP4]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TMP4]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i16 [[R_0_LCSSA]]		; CHECK-NEXT: ret i16 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp8 = icmp sgt i32 %n, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %for.body, label %for.cond.cleanup		br i1 %cmp8, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.010 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.010 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.09 = phi i16 [ %add, %for.body ], [ 0, %entry ]		%r.09 = phi i16 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i16, i16* %x, i32 %i.010		%arrayidx = getelementptr inbounds i16, i16* %x, i32 %i.010
%0 = load i16, i16* %arrayidx, align 2		%0 = load i16, i16* %arrayidx, align 2
%add = add i16 %0, %r.09		%add = add i16 %0, %r.09
%inc = add nuw nsw i32 %i.010, 1		%inc = add nuw nsw i32 %i.010, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i16 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i16 [ 0, %entry ], [ %add, %for.body ]
ret i16 %r.0.lcssa		ret i16 %r.0.lcssa
}		}

; FIXME: 16x ?		; 16x to use VADDV.u8
define signext i16 @add_i8_i16(i8* nocapture readonly %x, i32 %n) #0 {		define signext i16 @add_i8_i16(i8* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i8_i16(		; CHECK-LABEL: @add_i8_i16(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP8]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP8]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -16
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i16 [ 0, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i16 [ 0, [[VECTOR_PH]] ], [ [[TMP5:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <8 x i8>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <16 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8> [[TMP1]], i32 1, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i8> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP1]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> undef)
; CHECK-NEXT: [[TMP2:%.*]] = zext <8 x i8> [[WIDE_MASKED_LOAD]] to <8 x i16>		; CHECK-NEXT: [[TMP2:%.*]] = zext <16 x i8> [[WIDE_MASKED_LOAD]] to <16 x i16>
; CHECK-NEXT: [[TMP3:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> [[TMP2]], <8 x i16> zeroinitializer		; CHECK-NEXT: [[TMP3:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i16> [[TMP2]], <16 x i16> zeroinitializer
; CHECK-NEXT: [[TMP4:%.*]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> [[TMP3]])		; CHECK-NEXT: [[TMP4:%.*]] = call i16 @llvm.vector.reduce.add.v16i16(<16 x i16> [[TMP3]])
; CHECK-NEXT: [[TMP5]] = add i16 [[TMP4]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP5]] = add i16 [[TMP4]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16
; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP6]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP5:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP6]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP12:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TMP5]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TMP5]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i16 [[R_0_LCSSA]]		; CHECK-NEXT: ret i16 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp8 = icmp sgt i32 %n, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %for.body, label %for.cond.cleanup		br i1 %cmp8, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.010 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.010 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.09 = phi i16 [ %add, %for.body ], [ 0, %entry ]		%r.09 = phi i16 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i8, i8* %x, i32 %i.010		%arrayidx = getelementptr inbounds i8, i8* %x, i32 %i.010
%0 = load i8, i8* %arrayidx, align 1		%0 = load i8, i8* %arrayidx, align 1
%conv = zext i8 %0 to i16		%conv = zext i8 %0 to i16
%add = add i16 %r.09, %conv		%add = add i16 %r.09, %conv
%inc = add nuw nsw i32 %i.010, 1		%inc = add nuw nsw i32 %i.010, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i16 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i16 [ 0, %entry ], [ %add, %for.body ]
ret i16 %r.0.lcssa		ret i16 %r.0.lcssa
}		}

		; 16x to use VADDV.u8
define zeroext i8 @add_i8_i8(i8* nocapture readonly %x, i32 %n) #0 {		define zeroext i8 @add_i8_i8(i8* nocapture readonly %x, i32 %n) #0 {
; CHECK-LABEL: @add_i8_i8(		; CHECK-LABEL: @add_i8_i8(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP7:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP7:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP7]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP7]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -16		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -16
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i8 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i8 [ 0, [[VECTOR_PH]] ], [ [[TMP4:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <16 x i8>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <16 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP1]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP1]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> undef)
; CHECK-NEXT: [[TMP2:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> [[WIDE_MASKED_LOAD]], <16 x i8> zeroinitializer		; CHECK-NEXT: [[TMP2:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> [[WIDE_MASKED_LOAD]], <16 x i8> zeroinitializer
; CHECK-NEXT: [[TMP3:%.*]] = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> [[TMP2]])		; CHECK-NEXT: [[TMP3:%.*]] = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> [[TMP2]])
; CHECK-NEXT: [[TMP4]] = add i8 [[TMP3]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP4]] = add i8 [[TMP3]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16
; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP5:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP5]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP6:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP5]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP13:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i8 [ 0, [[ENTRY:%.]] ], [ [[TMP4]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i8 [ 0, [[ENTRY:%.]] ], [ [[TMP4]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i8 [[R_0_LCSSA]]		; CHECK-NEXT: ret i8 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp7 = icmp sgt i32 %n, 0		%cmp7 = icmp sgt i32 %n, 0
br i1 %cmp7, label %for.body, label %for.cond.cleanup		br i1 %cmp7, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.09 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.09 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
%r.08 = phi i8 [ %add, %for.body ], [ 0, %entry ]		%r.08 = phi i8 [ %add, %for.body ], [ 0, %entry ]
%arrayidx = getelementptr inbounds i8, i8* %x, i32 %i.09		%arrayidx = getelementptr inbounds i8, i8* %x, i32 %i.09
%0 = load i8, i8* %arrayidx, align 1		%0 = load i8, i8* %arrayidx, align 1
%add = add i8 %0, %r.08		%add = add i8 %0, %r.08
%inc = add nuw nsw i32 %i.09, 1		%inc = add nuw nsw i32 %i.09, 1
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i8 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i8 [ 0, %entry ], [ %add, %for.body ]
ret i8 %r.0.lcssa		ret i8 %r.0.lcssa
}		}

		; Not vectorized
define i64 @mla_i64_i64(i64* nocapture readonly %x, i64* nocapture readonly %y, i32 %n) #0 {		define i64 @mla_i64_i64(i64* nocapture readonly %x, i64* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i64_i64(		; CHECK-LABEL: @mla_i64_i64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP8]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP8]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I_010:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]		; CHECK-NEXT: [[I_010:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
; CHECK-NEXT: [[R_09:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]		; CHECK-NEXT: [[R_09:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]
Show All 27 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
ret i64 %r.0.lcssa		ret i64 %r.0.lcssa
}		}

		; 4x to use VMLAL.u32
		; FIXME: TailPredicate
define i64 @mla_i32_i64(i32* nocapture readonly %x, i32* nocapture readonly %y, i32 %n) #0 {		define i64 @mla_i32_i64(i32* nocapture readonly %x, i32* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i32_i64(		; CHECK-LABEL: @mla_i32_i64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP8]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP8]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
		; CHECK: for.body.preheader:
		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 4
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -4
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[TMP7:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i32, i32 [[X:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[TMP0]] to <4 x i32>*
		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
		; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[Y:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*
		; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
		; CHECK-NEXT: [[TMP4:%.*]] = mul nsw <4 x i32> [[WIDE_LOAD1]], [[WIDE_LOAD]]
		; CHECK-NEXT: [[TMP5:%.*]] = sext <4 x i32> [[TMP4]] to <4 x i64>
		; CHECK-NEXT: [[TMP6:%.*]] = call i64 @llvm.vector.reduce.add.v4i64(<4 x i64> [[TMP5]])
		; CHECK-NEXT: [[TMP7]] = add i64 [[TMP6]], [[VEC_PHI]]
		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
		; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-NEXT: br i1 [[TMP8]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP14:!llvm.loop !.]]
		; CHECK: middle.block:
		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]
		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]
		; CHECK: scalar.ph:
		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP7]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I_010:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]		; CHECK-NEXT: [[I_010:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[R_09:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]		; CHECK-NEXT: [[R_09:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[X:%.*]], i32 [[I_010]]		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[X]], i32 [[I_010]]
; CHECK-NEXT: [[TMP0:%.]] = load i32, i32 [[ARRAYIDX]], align 4		; CHECK-NEXT: [[TMP9:%.]] = load i32, i32 [[ARRAYIDX]], align 4
; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i32, i32 [[Y:%.*]], i32 [[I_010]]		; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i32, i32 [[Y]], i32 [[I_010]]
; CHECK-NEXT: [[TMP1:%.]] = load i32, i32 [[ARRAYIDX1]], align 4		; CHECK-NEXT: [[TMP10:%.]] = load i32, i32 [[ARRAYIDX1]], align 4
; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP1]], [[TMP0]]		; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP10]], [[TMP9]]
; CHECK-NEXT: [[CONV:%.*]] = sext i32 [[MUL]] to i64		; CHECK-NEXT: [[CONV:%.*]] = sext i32 [[MUL]] to i64
; CHECK-NEXT: [[ADD]] = add nsw i64 [[R_09]], [[CONV]]		; CHECK-NEXT: [[ADD]] = add nsw i64 [[R_09]], [[CONV]]
; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_010]], 1		; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_010]], 1
; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]
; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], [[LOOP15:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[ADD]], [[FOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[ADD]], [[FOR_BODY]] ], [ [[TMP7]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i64 [[R_0_LCSSA]]		; CHECK-NEXT: ret i64 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp8 = icmp sgt i32 %n, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %for.body, label %for.cond.cleanup		br i1 %cmp8, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.010 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.010 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
Show All 9 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
ret i64 %r.0.lcssa		ret i64 %r.0.lcssa
}		}

		; 8x to use VMLAL.u16
		; FIXME: TailPredicate
define i64 @mla_i16_i64(i16* nocapture readonly %x, i16* nocapture readonly %y, i32 %n) #0 {		define i64 @mla_i16_i64(i16* nocapture readonly %x, i16* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i16_i64(		; CHECK-LABEL: @mla_i16_i64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP10:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP10:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP10]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP10]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
		; CHECK: for.body.preheader:
		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 8
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -8
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <8 x i16>*
		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x i16>, <8 x i16> [[TMP1]], align 2
		; CHECK-NEXT: [[TMP2:%.*]] = sext <8 x i16> [[WIDE_LOAD]] to <8 x i32>
		; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i16, i16 [[Y:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP4:%.]] = bitcast i16 [[TMP3]] to <8 x i16>*
		; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <8 x i16>, <8 x i16> [[TMP4]], align 2
		; CHECK-NEXT: [[TMP5:%.*]] = sext <8 x i16> [[WIDE_LOAD1]] to <8 x i32>
		; CHECK-NEXT: [[TMP6:%.*]] = mul nsw <8 x i32> [[TMP5]], [[TMP2]]
		; CHECK-NEXT: [[TMP7:%.*]] = sext <8 x i32> [[TMP6]] to <8 x i64>
		; CHECK-NEXT: [[TMP8:%.*]] = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> [[TMP7]])
		; CHECK-NEXT: [[TMP9]] = add i64 [[TMP8]], [[VEC_PHI]]
		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8
		; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP16:!llvm.loop !.]]
		; CHECK: middle.block:
		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]
		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]
		; CHECK: scalar.ph:
		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP9]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I_012:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]		; CHECK-NEXT: [[I_012:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[R_011:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]		; CHECK-NEXT: [[R_011:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[I_012]]		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i16, i16 [[X]], i32 [[I_012]]
; CHECK-NEXT: [[TMP0:%.]] = load i16, i16 [[ARRAYIDX]], align 2		; CHECK-NEXT: [[TMP11:%.]] = load i16, i16 [[ARRAYIDX]], align 2
; CHECK-NEXT: [[CONV:%.*]] = sext i16 [[TMP0]] to i32		; CHECK-NEXT: [[CONV:%.*]] = sext i16 [[TMP11]] to i32
; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i16, i16 [[Y:%.*]], i32 [[I_012]]		; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i16, i16 [[Y]], i32 [[I_012]]
; CHECK-NEXT: [[TMP1:%.]] = load i16, i16 [[ARRAYIDX1]], align 2		; CHECK-NEXT: [[TMP12:%.]] = load i16, i16 [[ARRAYIDX1]], align 2
; CHECK-NEXT: [[CONV2:%.*]] = sext i16 [[TMP1]] to i32		; CHECK-NEXT: [[CONV2:%.*]] = sext i16 [[TMP12]] to i32
; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[CONV2]], [[CONV]]		; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[CONV2]], [[CONV]]
; CHECK-NEXT: [[CONV3:%.*]] = sext i32 [[MUL]] to i64		; CHECK-NEXT: [[CONV3:%.*]] = sext i32 [[MUL]] to i64
; CHECK-NEXT: [[ADD]] = add nsw i64 [[R_011]], [[CONV3]]		; CHECK-NEXT: [[ADD]] = add nsw i64 [[R_011]], [[CONV3]]
; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_012]], 1		; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_012]], 1
; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]
; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], [[LOOP17:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[ADD]], [[FOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[ADD]], [[FOR_BODY]] ], [ [[TMP9]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i64 [[R_0_LCSSA]]		; CHECK-NEXT: ret i64 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp10 = icmp sgt i32 %n, 0		%cmp10 = icmp sgt i32 %n, 0
br i1 %cmp10, label %for.body, label %for.cond.cleanup		br i1 %cmp10, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.012 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.012 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
Show All 11 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
ret i64 %r.0.lcssa		ret i64 %r.0.lcssa
}		}

		; 8x to use VMLAL.u16
		; FIXME: 8x, TailPredicate, code generation?
define i64 @mla_i8_i64(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {		define i64 @mla_i8_i64(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i8_i64(		; CHECK-LABEL: @mla_i8_i64(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP10:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP10:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP10]], label [[FOR_BODY:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP10]], label [[FOR_BODY_PREHEADER:%.]], label [[FOR_COND_CLEANUP:%.]]
		; CHECK: for.body.preheader:
		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[N]], 16
		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
		; CHECK: vector.ph:
		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N]], -16
		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
		; CHECK: vector.body:
		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <16 x i8>*
		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <16 x i8>, <16 x i8> [[TMP1]], align 1
		; CHECK-NEXT: [[TMP2:%.*]] = zext <16 x i8> [[WIDE_LOAD]] to <16 x i32>
		; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[Y:%.*]], i32 [[INDEX]]
		; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <16 x i8>*
		; CHECK-NEXT: [[WIDE_LOAD1:%.]] = load <16 x i8>, <16 x i8> [[TMP4]], align 1
		; CHECK-NEXT: [[TMP5:%.*]] = zext <16 x i8> [[WIDE_LOAD1]] to <16 x i32>
		; CHECK-NEXT: [[TMP6:%.*]] = mul nuw nsw <16 x i32> [[TMP5]], [[TMP2]]
		; CHECK-NEXT: [[TMP7:%.*]] = zext <16 x i32> [[TMP6]] to <16 x i64>
		; CHECK-NEXT: [[TMP8:%.*]] = call i64 @llvm.vector.reduce.add.v16i64(<16 x i64> [[TMP7]])
		; CHECK-NEXT: [[TMP9]] = add i64 [[TMP8]], [[VEC_PHI]]
		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16
		; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP18:!llvm.loop !.]]
		; CHECK: middle.block:
		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i32 [[N_VEC]], [[N]]
		; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP]], label [[SCALAR_PH]]
		; CHECK: scalar.ph:
		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i64 [ [[TMP9]], [[MIDDLE_BLOCK]] ], [ 0, [[FOR_BODY_PREHEADER]] ]
		; CHECK-NEXT: br label [[FOR_BODY:%.*]]
; CHECK: for.body:		; CHECK: for.body:
; CHECK-NEXT: [[I_012:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]		; CHECK-NEXT: [[I_012:%.]] = phi i32 [ [[INC:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[R_011:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY]] ]		; CHECK-NEXT: [[R_011:%.]] = phi i64 [ [[ADD:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[I_012]]		; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i8, i8 [[X]], i32 [[I_012]]
; CHECK-NEXT: [[TMP0:%.]] = load i8, i8 [[ARRAYIDX]], align 1		; CHECK-NEXT: [[TMP11:%.]] = load i8, i8 [[ARRAYIDX]], align 1
; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP0]] to i32		; CHECK-NEXT: [[CONV:%.*]] = zext i8 [[TMP11]] to i32
; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i8, i8 [[Y:%.*]], i32 [[I_012]]		; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i8, i8 [[Y]], i32 [[I_012]]
; CHECK-NEXT: [[TMP1:%.]] = load i8, i8 [[ARRAYIDX1]], align 1		; CHECK-NEXT: [[TMP12:%.]] = load i8, i8 [[ARRAYIDX1]], align 1
; CHECK-NEXT: [[CONV2:%.*]] = zext i8 [[TMP1]] to i32		; CHECK-NEXT: [[CONV2:%.*]] = zext i8 [[TMP12]] to i32
; CHECK-NEXT: [[MUL:%.*]] = mul nuw nsw i32 [[CONV2]], [[CONV]]		; CHECK-NEXT: [[MUL:%.*]] = mul nuw nsw i32 [[CONV2]], [[CONV]]
; CHECK-NEXT: [[CONV3:%.*]] = zext i32 [[MUL]] to i64		; CHECK-NEXT: [[CONV3:%.*]] = zext i32 [[MUL]] to i64
; CHECK-NEXT: [[ADD]] = add nuw nsw i64 [[R_011]], [[CONV3]]		; CHECK-NEXT: [[ADD]] = add nuw nsw i64 [[R_011]], [[CONV3]]
; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_012]], 1		; CHECK-NEXT: [[INC]] = add nuw nsw i32 [[I_012]], 1
; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]		; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i32 [[INC]], [[N]]
; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]]		; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], [[LOOP19:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.*]] = phi i64 [ 0, [[ENTRY]] ], [ [[ADD]], [[FOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[ADD]], [[FOR_BODY]] ], [ [[TMP9]], [[MIDDLE_BLOCK]] ]
; CHECK-NEXT: ret i64 [[R_0_LCSSA]]		; CHECK-NEXT: ret i64 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp10 = icmp sgt i32 %n, 0		%cmp10 = icmp sgt i32 %n, 0
br i1 %cmp10, label %for.body, label %for.cond.cleanup		br i1 %cmp10, label %for.body, label %for.cond.cleanup

for.body: ; preds = %entry, %for.body		for.body: ; preds = %entry, %for.body
%i.012 = phi i32 [ %inc, %for.body ], [ 0, %entry ]		%i.012 = phi i32 [ %inc, %for.body ], [ 0, %entry ]
Show All 11 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i64 [ 0, %entry ], [ %add, %for.body ]
ret i64 %r.0.lcssa		ret i64 %r.0.lcssa
}		}

		; 4x to use VMLA.u32
define i32 @mla_i32_i32(i32* nocapture readonly %x, i32* nocapture readonly %y, i32 %n) #0 {		define i32 @mla_i32_i32(i32* nocapture readonly %x, i32* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i32_i32(		; CHECK-LABEL: @mla_i32_i32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP8:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP8]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP8]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4
Show All 9 Lines
; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*		; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[TMP2]] to <4 x i32>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP3]], i32 4, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <4 x i32> @llvm.masked.load.v4i32.p0v4i32(<4 x i32> [[TMP3]], i32 4, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> undef)
; CHECK-NEXT: [[TMP4:%.*]] = mul nsw <4 x i32> [[WIDE_MASKED_LOAD1]], [[WIDE_MASKED_LOAD]]		; CHECK-NEXT: [[TMP4:%.*]] = mul nsw <4 x i32> [[WIDE_MASKED_LOAD1]], [[WIDE_MASKED_LOAD]]
; CHECK-NEXT: [[TMP5:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP4]], <4 x i32> zeroinitializer		; CHECK-NEXT: [[TMP5:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP4]], <4 x i32> zeroinitializer
; CHECK-NEXT: [[TMP6:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP5]])		; CHECK-NEXT: [[TMP6:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP5]])
; CHECK-NEXT: [[TMP7]] = add i32 [[TMP6]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP7]] = add i32 [[TMP6]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4
; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP8]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP7:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP8]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP20:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP7]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP7]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i32 [[R_0_LCSSA]]		; CHECK-NEXT: ret i32 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp8 = icmp sgt i32 %n, 0		%cmp8 = icmp sgt i32 %n, 0
br i1 %cmp8, label %for.body, label %for.cond.cleanup		br i1 %cmp8, label %for.body, label %for.cond.cleanup

Show All 10 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
ret i32 %r.0.lcssa		ret i32 %r.0.lcssa
}		}

		; 8x to use VMLA.u16
define i32 @mla_i16_i32(i16* nocapture readonly %x, i16* nocapture readonly %y, i32 %n) #0 {		define i32 @mla_i16_i32(i16* nocapture readonly %x, i16* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i16_i32(		; CHECK-LABEL: @mla_i16_i32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP9:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP9:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP9]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP9]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i16, i16 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <4 x i16>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[TMP0]] to <8 x i16>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16> [[TMP1]], i32 2, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i16> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16> [[TMP1]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> undef)
; CHECK-NEXT: [[TMP2:%.*]] = sext <4 x i16> [[WIDE_MASKED_LOAD]] to <4 x i32>		; CHECK-NEXT: [[TMP2:%.*]] = sext <8 x i16> [[WIDE_MASKED_LOAD]] to <8 x i32>
; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i16, i16 [[Y:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i16, i16 [[Y:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP4:%.]] = bitcast i16 [[TMP3]] to <4 x i16>*		; CHECK-NEXT: [[TMP4:%.]] = bitcast i16 [[TMP3]] to <8 x i16>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <4 x i16> @llvm.masked.load.v4i16.p0v4i16(<4 x i16> [[TMP4]], i32 2, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i16> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16> [[TMP4]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> undef)
; CHECK-NEXT: [[TMP5:%.*]] = sext <4 x i16> [[WIDE_MASKED_LOAD1]] to <4 x i32>		; CHECK-NEXT: [[TMP5:%.*]] = sext <8 x i16> [[WIDE_MASKED_LOAD1]] to <8 x i32>
; CHECK-NEXT: [[TMP6:%.*]] = mul nsw <4 x i32> [[TMP5]], [[TMP2]]		; CHECK-NEXT: [[TMP6:%.*]] = mul nsw <8 x i32> [[TMP5]], [[TMP2]]
; CHECK-NEXT: [[TMP7:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP6]], <4 x i32> zeroinitializer		; CHECK-NEXT: [[TMP7:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i32> [[TMP6]], <8 x i32> zeroinitializer
; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP7]])		; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.vector.reduce.add.v8i32(<8 x i32> [[TMP7]])
; CHECK-NEXT: [[TMP9]] = add i32 [[TMP8]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP9]] = add i32 [[TMP8]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8
; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP10]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP8:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP10]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP21:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP9]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP9]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i32 [[R_0_LCSSA]]		; CHECK-NEXT: ret i32 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp9 = icmp sgt i32 %n, 0		%cmp9 = icmp sgt i32 %n, 0
br i1 %cmp9, label %for.body, label %for.cond.cleanup		br i1 %cmp9, label %for.body, label %for.cond.cleanup

Show All 12 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
ret i32 %r.0.lcssa		ret i32 %r.0.lcssa
}		}

		; 16x to use VMLA.u8
define i32 @mla_i8_i32(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {		define i32 @mla_i8_i32(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i8_i32(		; CHECK-LABEL: @mla_i8_i32(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP9:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP9:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP9]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP9]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 3		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -4		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -16
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <4 x i1> @llvm.get.active.lane.mask.v4i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <4 x i8>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <16 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8> [[TMP1]], i32 1, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i8> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP1]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> undef)
; CHECK-NEXT: [[TMP2:%.*]] = zext <4 x i8> [[WIDE_MASKED_LOAD]] to <4 x i32>		; CHECK-NEXT: [[TMP2:%.*]] = zext <16 x i8> [[WIDE_MASKED_LOAD]] to <16 x i32>
; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[Y:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[Y:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <4 x i8>*		; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <16 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <4 x i8> @llvm.masked.load.v4i8.p0v4i8(<4 x i8> [[TMP4]], i32 1, <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i8> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP4]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> undef)
; CHECK-NEXT: [[TMP5:%.*]] = zext <4 x i8> [[WIDE_MASKED_LOAD1]] to <4 x i32>		; CHECK-NEXT: [[TMP5:%.*]] = zext <16 x i8> [[WIDE_MASKED_LOAD1]] to <16 x i32>
; CHECK-NEXT: [[TMP6:%.*]] = mul nuw nsw <4 x i32> [[TMP5]], [[TMP2]]		; CHECK-NEXT: [[TMP6:%.*]] = mul nuw nsw <16 x i32> [[TMP5]], [[TMP2]]
; CHECK-NEXT: [[TMP7:%.*]] = select <4 x i1> [[ACTIVE_LANE_MASK]], <4 x i32> [[TMP6]], <4 x i32> zeroinitializer		; CHECK-NEXT: [[TMP7:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i32> [[TMP6]], <16 x i32> zeroinitializer
; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP7]])		; CHECK-NEXT: [[TMP8:%.*]] = call i32 @llvm.vector.reduce.add.v16i32(<16 x i32> [[TMP7]])
; CHECK-NEXT: [[TMP9]] = add i32 [[TMP8]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP9]] = add i32 [[TMP8]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 4		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16
; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP10]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP9:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP10]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP22:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP9]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i32 [ 0, [[ENTRY:%.]] ], [ [[TMP9]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i32 [[R_0_LCSSA]]		; CHECK-NEXT: ret i32 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp9 = icmp sgt i32 %n, 0		%cmp9 = icmp sgt i32 %n, 0
br i1 %cmp9, label %for.body, label %for.cond.cleanup		br i1 %cmp9, label %for.body, label %for.cond.cleanup

Show All 12 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i32 [ 0, %entry ], [ %add, %for.body ]
ret i32 %r.0.lcssa		ret i32 %r.0.lcssa
}		}

		; 8x to use VMLA.u16
define signext i16 @mla_i16_i16(i16* nocapture readonly %x, i16* nocapture readonly %y, i32 %n) #0 {		define signext i16 @mla_i16_i16(i16* nocapture readonly %x, i16* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i16_i16(		; CHECK-LABEL: @mla_i16_i16(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP11:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP11:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP11]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP11]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8
Show All 9 Lines
; CHECK-NEXT: [[TMP3:%.]] = bitcast i16 [[TMP2]] to <8 x i16>*		; CHECK-NEXT: [[TMP3:%.]] = bitcast i16 [[TMP2]] to <8 x i16>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16> [[TMP3]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <8 x i16> @llvm.masked.load.v8i16.p0v8i16(<8 x i16> [[TMP3]], i32 2, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> undef)
; CHECK-NEXT: [[TMP4:%.*]] = mul <8 x i16> [[WIDE_MASKED_LOAD1]], [[WIDE_MASKED_LOAD]]		; CHECK-NEXT: [[TMP4:%.*]] = mul <8 x i16> [[WIDE_MASKED_LOAD1]], [[WIDE_MASKED_LOAD]]
; CHECK-NEXT: [[TMP5:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> [[TMP4]], <8 x i16> zeroinitializer		; CHECK-NEXT: [[TMP5:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> [[TMP4]], <8 x i16> zeroinitializer
; CHECK-NEXT: [[TMP6:%.*]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> [[TMP5]])		; CHECK-NEXT: [[TMP6:%.*]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> [[TMP5]])
; CHECK-NEXT: [[TMP7]] = add i16 [[TMP6]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP7]] = add i16 [[TMP6]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8
; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP8]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP10:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP8]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP23:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TMP7]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TMP7]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i16 [[R_0_LCSSA]]		; CHECK-NEXT: ret i16 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp11 = icmp sgt i32 %n, 0		%cmp11 = icmp sgt i32 %n, 0
br i1 %cmp11, label %for.body, label %for.cond.cleanup		br i1 %cmp11, label %for.body, label %for.cond.cleanup

Show All 10 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i16 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i16 [ 0, %entry ], [ %add, %for.body ]
ret i16 %r.0.lcssa		ret i16 %r.0.lcssa
}		}

		; 16x to use VMLA.u8
define signext i16 @mla_i8_i16(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {		define signext i16 @mla_i8_i16(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i8_i16(		; CHECK-LABEL: @mla_i8_i16(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP11:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP11:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP11]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP11]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 7		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -8		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -16
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i32 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[VEC_PHI:%.]] = phi i16 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[VEC_PHI:%.]] = phi i16 [ 0, [[VECTOR_PH]] ], [ [[TMP9:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <8 x i1> @llvm.get.active.lane.mask.v8i1.i32(i32 [[INDEX]], i32 [[N]])		; CHECK-NEXT: [[ACTIVE_LANE_MASK:%.*]] = call <16 x i1> @llvm.get.active.lane.mask.v16i1.i32(i32 [[INDEX]], i32 [[N]])
; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds i8, i8 [[X:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <8 x i8>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast i8 [[TMP0]] to <16 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8> [[TMP1]], i32 1, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i8> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP1]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> undef)
; CHECK-NEXT: [[TMP2:%.*]] = zext <8 x i8> [[WIDE_MASKED_LOAD]] to <8 x i16>		; CHECK-NEXT: [[TMP2:%.*]] = zext <16 x i8> [[WIDE_MASKED_LOAD]] to <16 x i16>
; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[Y:%.*]], i32 [[INDEX]]		; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i8, i8 [[Y:%.*]], i32 [[INDEX]]
; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <8 x i8>*		; CHECK-NEXT: [[TMP4:%.]] = bitcast i8 [[TMP3]] to <16 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <8 x i8> @llvm.masked.load.v8i8.p0v8i8(<8 x i8> [[TMP4]], i32 1, <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i8> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP4]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> undef)
; CHECK-NEXT: [[TMP5:%.*]] = zext <8 x i8> [[WIDE_MASKED_LOAD1]] to <8 x i16>		; CHECK-NEXT: [[TMP5:%.*]] = zext <16 x i8> [[WIDE_MASKED_LOAD1]] to <16 x i16>
; CHECK-NEXT: [[TMP6:%.*]] = mul nuw <8 x i16> [[TMP5]], [[TMP2]]		; CHECK-NEXT: [[TMP6:%.*]] = mul nuw <16 x i16> [[TMP5]], [[TMP2]]
; CHECK-NEXT: [[TMP7:%.*]] = select <8 x i1> [[ACTIVE_LANE_MASK]], <8 x i16> [[TMP6]], <8 x i16> zeroinitializer		; CHECK-NEXT: [[TMP7:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i16> [[TMP6]], <16 x i16> zeroinitializer
; CHECK-NEXT: [[TMP8:%.*]] = call i16 @llvm.vector.reduce.add.v8i16(<8 x i16> [[TMP7]])		; CHECK-NEXT: [[TMP8:%.*]] = call i16 @llvm.vector.reduce.add.v16i16(<16 x i16> [[TMP7]])
; CHECK-NEXT: [[TMP9]] = add i16 [[TMP8]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP9]] = add i16 [[TMP8]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 8		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16
; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP10]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP11:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP10]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP24:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TMP9]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i16 [ 0, [[ENTRY:%.]] ], [ [[TMP9]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i16 [[R_0_LCSSA]]		; CHECK-NEXT: ret i16 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp11 = icmp sgt i32 %n, 0		%cmp11 = icmp sgt i32 %n, 0
br i1 %cmp11, label %for.body, label %for.cond.cleanup		br i1 %cmp11, label %for.body, label %for.cond.cleanup

Show All 12 Lines	for.body: ; preds = %entry, %for.body
%exitcond = icmp eq i32 %inc, %n		%exitcond = icmp eq i32 %inc, %n
br i1 %exitcond, label %for.cond.cleanup, label %for.body		br i1 %exitcond, label %for.cond.cleanup, label %for.body

for.cond.cleanup: ; preds = %for.body, %entry		for.cond.cleanup: ; preds = %for.body, %entry
%r.0.lcssa = phi i16 [ 0, %entry ], [ %add, %for.body ]		%r.0.lcssa = phi i16 [ 0, %entry ], [ %add, %for.body ]
ret i16 %r.0.lcssa		ret i16 %r.0.lcssa
}		}

		; 16x to use VMLA.u8
define zeroext i8 @mla_i8_i8(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {		define zeroext i8 @mla_i8_i8(i8* nocapture readonly %x, i8* nocapture readonly %y, i32 %n) #0 {
; CHECK-LABEL: @mla_i8_i8(		; CHECK-LABEL: @mla_i8_i8(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP10:%.]] = icmp sgt i32 [[N:%.]], 0		; CHECK-NEXT: [[CMP10:%.]] = icmp sgt i32 [[N:%.]], 0
; CHECK-NEXT: br i1 [[CMP10]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]		; CHECK-NEXT: br i1 [[CMP10]], label [[VECTOR_PH:%.]], label [[FOR_COND_CLEANUP:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15		; CHECK-NEXT: [[N_RND_UP:%.*]] = add i32 [[N]], 15
; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -16		; CHECK-NEXT: [[N_VEC:%.*]] = and i32 [[N_RND_UP]], -16
Show All 9 Lines
; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <16 x i8>*		; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[TMP2]] to <16 x i8>*
; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP3]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> undef)		; CHECK-NEXT: [[WIDE_MASKED_LOAD1:%.]] = call <16 x i8> @llvm.masked.load.v16i8.p0v16i8(<16 x i8> [[TMP3]], i32 1, <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> undef)
; CHECK-NEXT: [[TMP4:%.*]] = mul <16 x i8> [[WIDE_MASKED_LOAD1]], [[WIDE_MASKED_LOAD]]		; CHECK-NEXT: [[TMP4:%.*]] = mul <16 x i8> [[WIDE_MASKED_LOAD1]], [[WIDE_MASKED_LOAD]]
; CHECK-NEXT: [[TMP5:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> [[TMP4]], <16 x i8> zeroinitializer		; CHECK-NEXT: [[TMP5:%.*]] = select <16 x i1> [[ACTIVE_LANE_MASK]], <16 x i8> [[TMP4]], <16 x i8> zeroinitializer
; CHECK-NEXT: [[TMP6:%.*]] = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> [[TMP5]])		; CHECK-NEXT: [[TMP6:%.*]] = call i8 @llvm.vector.reduce.add.v16i8(<16 x i8> [[TMP5]])
; CHECK-NEXT: [[TMP7]] = add i8 [[TMP6]], [[VEC_PHI]]		; CHECK-NEXT: [[TMP7]] = add i8 [[TMP6]], [[VEC_PHI]]
; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16		; CHECK-NEXT: [[INDEX_NEXT]] = add i32 [[INDEX]], 16
; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP8:%.*]] = icmp eq i32 [[INDEX_NEXT]], [[N_VEC]]
; CHECK-NEXT: br i1 [[TMP8]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP12:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[TMP8]], label [[FOR_COND_CLEANUP]], label [[VECTOR_BODY]], [[LOOP25:!llvm.loop !.*]]
; CHECK: for.cond.cleanup:		; CHECK: for.cond.cleanup:
; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i8 [ 0, [[ENTRY:%.]] ], [ [[TMP7]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[R_0_LCSSA:%.]] = phi i8 [ 0, [[ENTRY:%.]] ], [ [[TMP7]], [[VECTOR_BODY]] ]
; CHECK-NEXT: ret i8 [[R_0_LCSSA]]		; CHECK-NEXT: ret i8 [[R_0_LCSSA]]
;		;
entry:		entry:
%cmp10 = icmp sgt i32 %n, 0		%cmp10 = icmp sgt i32 %n, 0
br i1 %cmp10, label %for.body, label %for.cond.cleanup		br i1 %cmp10, label %for.body, label %for.cond.cleanup

Show All 19 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[LV][ARM] Inloop reduction cost modelling
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 312542

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

llvm/lib/Target/X86/X86TargetTransformInfo.h

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/LoopVectorize/ARM/mve-reduction-types.ll

llvm/test/Transforms/LoopVectorize/ARM/mve-reductions.ll

This is an archive of the discontinued LLVM Phabricator instance.

[LV][ARM] Inloop reduction cost modellingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 312542

llvm/include/llvm/Analysis/TargetTransformInfo.h

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/include/llvm/CodeGen/BasicTTIImpl.h

llvm/lib/Analysis/TargetTransformInfo.cpp

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.h

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

llvm/lib/Target/X86/X86TargetTransformInfo.h

llvm/lib/Target/X86/X86TargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/LoopVectorize/ARM/mve-reduction-types.ll

llvm/test/Transforms/LoopVectorize/ARM/mve-reductions.ll

[LV][ARM] Inloop reduction cost modelling
ClosedPublic