This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
Analysis/
-
TargetTransformInfo.h
-
CodeGen/
-
BasicTTIImpl.h
-
lib/
-
Analysis/
1/3
TargetTransformInfo.cpp
-
Target/
-
AMDGPU/
2/4
AMDGPUTargetTransformInfo.cpp
-
ARM/
-
ARMTargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
-
LoopVectorize.cpp
-
SLPVectorizer.cpp
-
test/Transforms/
-
Transforms/
-
LoopVectorize/
-
AArch64/
1/3
intrinsiccost.ll
-
X86/
-
intrinsiccost.ll
-
SLPVectorizer/WebAssembly/
-
WebAssembly/
-
no-vectorize-rotate.ll

Differential D95291

[CostModel] Remove VF from IntrinsicCostAttributes
ClosedPublic

Authored by dmgreen on Jan 23 2021, 12:11 PM.

Download Raw Diff

Details

Reviewers

samparker
RKSimon
spatel
arsenm
efriedma
fhahn
sdesmalen
tlively

Group Reviewers

Restricted Project

Commits

rGbd4b61efbdb4: [CostModel] Remove VF from IntrinsicCostAttributes
rG502a67dd7f23: [CostModel] Remove VF from IntrinsicCostAttributes

Summary

getIntrinsicInstrCost takes a IntrinsicCostAttributes holding various parameters of the intrinsic being costed. It can either be called with a scalar intrinsic (RetTy==Scalar, VF==1), with a vector instruction (RetTy==Vector, VF==1) or from the vectorizer with a scalar type and vector width (RetTy==Scalar, VF>1). A RetTy==Vector, VF>1 is considered an error. Both of the vector modes are expected to be treated the same, but because this is confusing many backends end up getting it wrong.

Instead of trying work with those two values separately this removes the VF parameter, widening the RetTy/ArgTys by VF used called from the vectorizer. This keeps things simpler, but does require some other modifications to keep things consistent.

Most backends look like this will be an improvement (or were not using getIntrinsicInstrCost). AMDGPU needed the most changes to keep the code from c230965ccf36af5c88c working. ARM removed the fix in dfac521da1b90db683, webassembly happens to get a fixup for an SLP cost issue and both X86 and AArch64 seem to now be using better costs from the vectorizer.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

dmgreen created this revision.Jan 23 2021, 12:11 PM

Herald added subscribers: kerbowa, pengfei, sunfish and 8 others. · View Herald TranscriptJan 23 2021, 12:11 PM

dmgreen requested review of this revision.Jan 23 2021, 12:11 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 23 2021, 12:11 PM

Herald added subscribers: aheejin, wdng. · View Herald Transcript

dmgreen added a reviewer: fhahn.Jan 23 2021, 12:18 PM

Thanks for working on this. Last week I was really puzzled when looking at this code and I wasn't sure if there was a good reason for this, so good to see it get fixed!

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
781	Why are you specially calling out uadd_sat and others?
llvm/test/Transforms/LoopVectorize/AArch64/intrinsiccost.ll
11–13	Are there changes to these costs because it would previously calculate `scalarization cost + 2 * cost(sadd.sat.i16)` where it can now calculates the cost as `cost(sadd.sat.2i16)`?

sdesmalen added a reviewer: sdesmalen.Jan 25 2021, 12:31 AM

sdesmalen removed a subscriber: sdesmalen.

dmgreen added inline comments.Jan 25 2021, 12:46 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
781	Because of intrinsicHasPackedVectorBenefit above. For most other intrinsics it will just end up calling BaseT::getIntrinsicInstrCost.
llvm/test/Transforms/LoopVectorize/AArch64/intrinsiccost.ll
11–13	Yes I believe so. At least - the old cost was from Base::getIntrinsicInstrCost as the overrides from AArch64TargetTransformInfo were not taking effect. I'm not 100% sure how it was calculated but that sounds very plausible https://reviews.llvm.org/D95292 updated the sadd.sat costs.

sdesmalen added inline comments.Jan 25 2021, 1:00 AM

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
781	Is this change then to keep the cost the same as before this patch? I would have expected NElts not to need any change, maybe the cost was just wrong before.
llvm/test/Transforms/LoopVectorize/AArch64/intrinsiccost.ll
11–13	Ah, I hadn't seen that patch yet, that explains things. Given that D95292 hasn't landed, this patch is missing a dependence?

samparker added inline comments.Jan 25 2021, 3:38 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
58–59	Could it help to further reduce complexity in TTI if we moved the vectorizer-specific changes into the vectorizer? (Assuming that constructors are one of the different ways at TTI is used by the vectorizer)

I'm happy to see that this patch improves the WebAssembly code, but I am not familiar enough with the relevant code to comment on the rest of the patch. No concerns from me.

dmgreen updated this revision to Diff 319522.Jan 27 2021, 3:50 AM

dmgreen added inline comments.

llvm/lib/Analysis/TargetTransformInfo.cpp
58–59	I have given it a try, and ended up adjusting some of the other constructors at the same time. It doesn't feel like the cleanest thing in the world, but keeps the code in the vectorizer. Let me know what you think.
llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp
781	Yeah, I think the cost was incorrect before from vector instructions, but correct from the vectorizer's costing of vectorized scalar instructions (because it was looking at RetTy and ignoring VF). I was hoping someone from AMDGPU could take a look and check this was OK. I tried some f16 FMA and ssat routines on a gfx900 and they all ended up with the same vectorization - which is a good sign at least.

Maybe you want to give it a day for someone to confirm the changes to AMDGPUTargetTransformInfo.cpp, but otherwise LGTM, thanks!

This revision is now accepted and ready to land.Feb 2 2021, 3:56 AM

Thanks.

arsenm accepted this revision.Feb 2 2021, 9:43 AM

^ Thanks for taking a look :)

I'll commit this hopefully tomorrow.

samparker added inline comments.Feb 3 2021, 2:40 AM

llvm/lib/Analysis/TargetTransformInfo.cpp
58–59	Cheers Dave!

Closed by commit rG502a67dd7f23: [CostModel] Remove VF from IntrinsicCostAttributes (authored by dmgreen). · Explain WhyFeb 5 2021, 1:35 AM

This revision was automatically updated to reflect the committed changes.

dmgreen added a commit: rG502a67dd7f23: [CostModel] Remove VF from IntrinsicCostAttributes.

Looks like this is causing an assert in test-suite if built with CMake on PowerPC.
The current PowerPC buildbots are using Make, so did not trigger this (no -fenable-matrix tested), we will update that soon.

@dmgreen Can you have a quick look? Thanks.

Reduced source and command line to reproduce.

.../build/bin/clang -cc1 -fenable-matrix -triple=powerpc64le-unknown-linux-gnu -O3 -emit-llvm matrix-types-spec-f4acb3.reduced.cpp

$ cat matrix-types-spec-f4acb3.reduced.cpp
template <typename a, int b, int c, int d> void e() {
  a f, g, h;
  using i = a __attribute__((matrix_type(b, d)));
  auto j = __builtin_matrix_column_major_load(&g, b, c, b);
  auto k = __builtin_matrix_column_major_load(&f, c, d, c);
  i l = j * k;
  __builtin_matrix_column_major_store(l, &h, b);
}
template <typename, int b, int c, int> void m() { e<int, b, c, 1>(); }
int main() { m<double, 10, 1, 3>(); }

Reduced IR:

opt bugpoint-reduced-simplified.ll -inline

target datalayout = "E-m:e-i64:64-n32:64-v256:256:256-v512:512:512"
target triple = "powerpc64-unknown-linux-gnu"

$_Z1mIdLi10ELi1ELi3EEvv = comdat any
$_Z1eIiLi10ELi1ELi1EEvv = comdat any

define void @_Z1mIdLi10ELi1ELi3EEvv() local_unnamed_addr #1 comdat {
entry:
  call void @_Z1eIiLi10ELi1ELi1EEvv()
  ret void
}

define void @_Z1eIiLi10ELi1ELi1EEvv() local_unnamed_addr #1 comdat {
entry:
  %matrix1 = call <1 x i32> @llvm.matrix.column.major.load.v1i32(i32* nonnull align 4 undef, i64 1, i1 false, i32 1, i32 1)
  %0 = call <10 x i32> @llvm.matrix.multiply.v10i32.v10i32.v1i32(<10 x i32> undef, <1 x i32> %matrix1, i32 10, i32 1, i32 1)
  ret void
}

declare <1 x i32> @llvm.matrix.column.major.load.v1i32(i32* nocapture, i64, i1 immarg, i32 immarg, i32 immarg) #2
declare <10 x i32> @llvm.matrix.multiply.v10i32.v10i32.v1i32(<10 x i32>, <1 x i32>, i32 immarg, i32 immarg, i32 immarg) #3

Hello. Thanks for the report. Looks like there still a scalarization overhead that's assuming the VF still means something, but that doesn't hold for these matrix intrinsics!

Let me try and put together something to fix it.

In D95291#2549275, @dmgreen wrote:

Hello. Thanks for the report. Looks like there still a scalarization overhead that's assuming the VF still means something, but that doesn't hold for these matrix intrinsics!

Let me try and put together something to fix it.

Thanks Dave.

dmgreen mentioned this in D96287: [TTI] Change getOperandsScalarizationOverhead to take Type args.Feb 8 2021, 12:43 PM

OK. I've put up D96287 for review. It required more surgery that I was hoping to get it right. If you need revert this in the meantime then feel free, and I can re-commit the two once the followup is ready.

jsji added a reverting change: rG920280624139: Revert "[CostModel] Remove VF from IntrinsicCostAttributes".Feb 8 2021, 6:15 PM

In D95291#2549480, @dmgreen wrote:

OK. I've put up D96287 for review. It required more surgery that I was hoping to get it right. If you need revert this in the meantime then feel free, and I can re-commit the two once the followup is ready.

Thanks Dave for the prompt action! Consider the time needed for review, I reverted it first in 92028062413907e1c10b16e7c338e0745a11a051.

dmgreen added a commit: rGbd4b61efbdb4: [CostModel] Remove VF from IntrinsicCostAttributes.Feb 23 2021, 5:05 AM

dmgreen mentioned this in rGdd2dbf7ee2e5: [TTI] Change getOperandsScalarizationOverhead to take Type args.

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

38 lines

CodeGen/

BasicTTIImpl.h

43 lines

lib/

Analysis/

TargetTransformInfo.cpp

85 lines

Target/

AMDGPU/

AMDGPUTargetTransformInfo.cpp

33 lines

ARM/

ARMTargetTransformInfo.cpp

11 lines

Transforms/

Vectorize/

LoopVectorize.cpp

21 lines

SLPVectorizer.cpp

18 lines

test/

Transforms/

LoopVectorize/

AArch64/

intrinsiccost.ll

156 lines

X86/

intrinsiccost.ll

2 lines

SLPVectorizer/

WebAssembly/

no-vectorize-rotate.ll

20 lines

Diff 321676

llvm/include/llvm/Analysis/TargetTransformInfo.h

	Show First 20 Lines • Show All 112 Lines • ▼ Show 20 Lines

	class IntrinsicCostAttributes {			class IntrinsicCostAttributes {
	const IntrinsicInst *II = nullptr;			const IntrinsicInst *II = nullptr;
	Type *RetTy = nullptr;			Type *RetTy = nullptr;
	Intrinsic::ID IID;			Intrinsic::ID IID;
	SmallVector<Type *, 4> ParamTys;			SmallVector<Type *, 4> ParamTys;
	SmallVector<const Value *, 4> Arguments;			SmallVector<const Value *, 4> Arguments;
	FastMathFlags FMF;			FastMathFlags FMF;
	ElementCount VF = ElementCount::getFixed(1);
	// If ScalarizationCost is UINT_MAX, the cost of scalarizing the			// If ScalarizationCost is UINT_MAX, the cost of scalarizing the
	// arguments and the return value will be computed based on types.			// arguments and the return value will be computed based on types.
	unsigned ScalarizationCost = std::numeric_limits<unsigned>::max();			unsigned ScalarizationCost = std::numeric_limits<unsigned>::max();

	public:			public:
	IntrinsicCostAttributes(const IntrinsicInst &I);			IntrinsicCostAttributes(
				Intrinsic::ID Id, const CallBase &CI,
	IntrinsicCostAttributes(Intrinsic::ID Id, const CallBase &CI);			unsigned ScalarizationCost = std::numeric_limits<unsigned>::max());

	IntrinsicCostAttributes(Intrinsic::ID Id, const CallBase &CI,			IntrinsicCostAttributes(
	ElementCount Factor);			Intrinsic::ID Id, Type RTy, ArrayRef<Type > Tys,
				FastMathFlags Flags = FastMathFlags(), const IntrinsicInst *I = nullptr,
	IntrinsicCostAttributes(Intrinsic::ID Id, const CallBase &CI,			unsigned ScalarCost = std::numeric_limits<unsigned>::max());
	ElementCount Factor, unsigned ScalarCost);

	IntrinsicCostAttributes(Intrinsic::ID Id, Type *RTy,
	ArrayRef<Type *> Tys, FastMathFlags Flags);

	IntrinsicCostAttributes(Intrinsic::ID Id, Type *RTy,
	ArrayRef<Type *> Tys, FastMathFlags Flags,
	unsigned ScalarCost);

	IntrinsicCostAttributes(Intrinsic::ID Id, Type *RTy,
	ArrayRef<Type *> Tys, FastMathFlags Flags,
	unsigned ScalarCost,
	const IntrinsicInst *I);

	IntrinsicCostAttributes(Intrinsic::ID Id, Type *RTy,
	ArrayRef<Type *> Tys);

	IntrinsicCostAttributes(Intrinsic::ID Id, Type *RTy,			IntrinsicCostAttributes(Intrinsic::ID Id, Type *RTy,
	ArrayRef<const Value *> Args);			ArrayRef<const Value *> Args);

				IntrinsicCostAttributes(
				Intrinsic::ID Id, Type RTy, ArrayRef<const Value > Args,
				ArrayRef<Type *> Tys, FastMathFlags Flags = FastMathFlags(),
				const IntrinsicInst *I = nullptr,
				unsigned ScalarCost = std::numeric_limits<unsigned>::max());

	Intrinsic::ID getID() const { return IID; }			Intrinsic::ID getID() const { return IID; }
	const IntrinsicInst *getInst() const { return II; }			const IntrinsicInst *getInst() const { return II; }
	Type *getReturnType() const { return RetTy; }			Type *getReturnType() const { return RetTy; }
	ElementCount getVectorFactor() const { return VF; }
	FastMathFlags getFlags() const { return FMF; }			FastMathFlags getFlags() const { return FMF; }
	unsigned getScalarizationCost() const { return ScalarizationCost; }			unsigned getScalarizationCost() const { return ScalarizationCost; }
	const SmallVectorImpl<const Value *> &getArgs() const { return Arguments; }			const SmallVectorImpl<const Value *> &getArgs() const { return Arguments; }
	const SmallVectorImpl<Type *> &getArgTypes() const { return ParamTys; }			const SmallVectorImpl<Type *> &getArgTypes() const { return ParamTys; }

	bool isTypeBasedOnly() const {			bool isTypeBasedOnly() const {
	return Arguments.empty();			return Arguments.empty();
	}			}
	▲ Show 20 Lines • Show All 2,149 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/BasicTTIImpl.h

Show First 20 Lines • Show All 1,205 Lines • ▼ Show 20 Lines	unsigned getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
if (Function::isTargetIntrinsic(IID))		if (Function::isTargetIntrinsic(IID))
return TargetTransformInfo::TCC_Basic;		return TargetTransformInfo::TCC_Basic;

if (ICA.isTypeBasedOnly())		if (ICA.isTypeBasedOnly())
return getTypeBasedIntrinsicInstrCost(ICA, CostKind);		return getTypeBasedIntrinsicInstrCost(ICA, CostKind);

Type *RetTy = ICA.getReturnType();		Type *RetTy = ICA.getReturnType();

ElementCount VF = ICA.getVectorFactor();
ElementCount RetVF =		ElementCount RetVF =
(RetTy->isVectorTy() ? cast<VectorType>(RetTy)->getElementCount()		(RetTy->isVectorTy() ? cast<VectorType>(RetTy)->getElementCount()
: ElementCount::getFixed(1));		: ElementCount::getFixed(1));
assert((RetVF.isScalar() \|\| VF.isScalar()) &&
"VF > 1 and RetVF is a vector type");
const IntrinsicInst *I = ICA.getInst();		const IntrinsicInst *I = ICA.getInst();
const SmallVectorImpl<const Value *> &Args = ICA.getArgs();		const SmallVectorImpl<const Value *> &Args = ICA.getArgs();
FastMathFlags FMF = ICA.getFlags();		FastMathFlags FMF = ICA.getFlags();
switch (IID) {		switch (IID) {
default:		default:
break;		break;

case Intrinsic::cttz:		case Intrinsic::cttz:
// FIXME: If necessary, this should go in target-specific overrides.		// FIXME: If necessary, this should go in target-specific overrides.
if (VF.isScalar() && RetVF.isScalar() &&		if (RetVF.isScalar() && getTLI()->isCheapToSpeculateCttz())
getTLI()->isCheapToSpeculateCttz())
return TargetTransformInfo::TCC_Basic;		return TargetTransformInfo::TCC_Basic;
break;		break;

case Intrinsic::ctlz:		case Intrinsic::ctlz:
// FIXME: If necessary, this should go in target-specific overrides.		// FIXME: If necessary, this should go in target-specific overrides.
if (VF.isScalar() && RetVF.isScalar() &&		if (RetVF.isScalar() && getTLI()->isCheapToSpeculateCtlz())
getTLI()->isCheapToSpeculateCtlz())
return TargetTransformInfo::TCC_Basic;		return TargetTransformInfo::TCC_Basic;
break;		break;

case Intrinsic::memcpy:		case Intrinsic::memcpy:
return thisT()->getMemcpyCost(ICA.getInst());		return thisT()->getMemcpyCost(ICA.getInst());

case Intrinsic::masked_scatter: {		case Intrinsic::masked_scatter: {
assert(VF.isScalar() && "Can't vectorize types here.");
const Value *Mask = Args[3];		const Value *Mask = Args[3];
bool VarMask = !isa<Constant>(Mask);		bool VarMask = !isa<Constant>(Mask);
Align Alignment = cast<ConstantInt>(Args[2])->getAlignValue();		Align Alignment = cast<ConstantInt>(Args[2])->getAlignValue();
return thisT()->getGatherScatterOpCost(Instruction::Store,		return thisT()->getGatherScatterOpCost(Instruction::Store,
Args[0]->getType(), Args[1],		ICA.getArgTypes()[0], Args[1],
VarMask, Alignment, CostKind, I);		VarMask, Alignment, CostKind, I);
}		}
case Intrinsic::masked_gather: {		case Intrinsic::masked_gather: {
assert(VF.isScalar() && "Can't vectorize types here.");
const Value *Mask = Args[2];		const Value *Mask = Args[2];
bool VarMask = !isa<Constant>(Mask);		bool VarMask = !isa<Constant>(Mask);
Align Alignment = cast<ConstantInt>(Args[1])->getAlignValue();		Align Alignment = cast<ConstantInt>(Args[1])->getAlignValue();
return thisT()->getGatherScatterOpCost(Instruction::Load, RetTy, Args[0],		return thisT()->getGatherScatterOpCost(Instruction::Load, RetTy, Args[0],
VarMask, Alignment, CostKind, I);		VarMask, Alignment, CostKind, I);
}		}
case Intrinsic::experimental_vector_extract: {		case Intrinsic::experimental_vector_extract: {
// FIXME: Handle case where a scalable vector is extracted from a scalable		// FIXME: Handle case where a scalable vector is extracted from a scalable
Show All 21 Lines	unsigned getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
case Intrinsic::vector_reduce_or:		case Intrinsic::vector_reduce_or:
case Intrinsic::vector_reduce_xor:		case Intrinsic::vector_reduce_xor:
case Intrinsic::vector_reduce_smax:		case Intrinsic::vector_reduce_smax:
case Intrinsic::vector_reduce_smin:		case Intrinsic::vector_reduce_smin:
case Intrinsic::vector_reduce_fmax:		case Intrinsic::vector_reduce_fmax:
case Intrinsic::vector_reduce_fmin:		case Intrinsic::vector_reduce_fmin:
case Intrinsic::vector_reduce_umax:		case Intrinsic::vector_reduce_umax:
case Intrinsic::vector_reduce_umin: {		case Intrinsic::vector_reduce_umin: {
IntrinsicCostAttributes Attrs(IID, RetTy, Args[0]->getType(), FMF, 1, I);		IntrinsicCostAttributes Attrs(IID, RetTy, Args[0]->getType(), FMF, I, 1);
return getTypeBasedIntrinsicInstrCost(Attrs, CostKind);		return getTypeBasedIntrinsicInstrCost(Attrs, CostKind);
}		}
case Intrinsic::vector_reduce_fadd:		case Intrinsic::vector_reduce_fadd:
case Intrinsic::vector_reduce_fmul: {		case Intrinsic::vector_reduce_fmul: {
IntrinsicCostAttributes Attrs(		IntrinsicCostAttributes Attrs(
IID, RetTy, {Args[0]->getType(), Args[1]->getType()}, FMF, 1, I);		IID, RetTy, {Args[0]->getType(), Args[1]->getType()}, FMF, I, 1);
return getTypeBasedIntrinsicInstrCost(Attrs, CostKind);		return getTypeBasedIntrinsicInstrCost(Attrs, CostKind);
}		}
case Intrinsic::fshl:		case Intrinsic::fshl:
case Intrinsic::fshr: {		case Intrinsic::fshr: {
if (isa<ScalableVectorType>(RetTy))		if (isa<ScalableVectorType>(RetTy))
return BaseT::getIntrinsicInstrCost(ICA, CostKind);		return BaseT::getIntrinsicInstrCost(ICA, CostKind);
const Value *X = Args[0];		const Value *X = Args[0];
const Value *Y = Args[1];		const Value *Y = Args[1];
Show All 35 Lines	case Intrinsic::fshr: {
return Cost;		return Cost;
}		}
}		}
// TODO: Handle the remaining intrinsic with scalable vector type		// TODO: Handle the remaining intrinsic with scalable vector type
if (isa<ScalableVectorType>(RetTy))		if (isa<ScalableVectorType>(RetTy))
return BaseT::getIntrinsicInstrCost(ICA, CostKind);		return BaseT::getIntrinsicInstrCost(ICA, CostKind);

// Assume that we need to scalarize this intrinsic.		// Assume that we need to scalarize this intrinsic.
SmallVector<Type *, 4> Types;
for (const Value *Op : Args) {
Type *OpTy = Op->getType();
assert(VF.isScalar() \|\| !OpTy->isVectorTy());
Types.push_back(VF.isScalar()
? OpTy
: FixedVectorType::get(OpTy, VF.getKnownMinValue()));
}

if (VF.isVector() && !RetTy->isVoidTy())
RetTy = FixedVectorType::get(RetTy, VF.getKnownMinValue());

// Compute the scalarization overhead based on Args for a vector		// Compute the scalarization overhead based on Args for a vector
// intrinsic. A vectorizer will pass a scalar RetTy and VF > 1, while		// intrinsic.
// CostModel will pass a vector RetTy and VF is 1.
unsigned ScalarizationCost = std::numeric_limits<unsigned>::max();		unsigned ScalarizationCost = std::numeric_limits<unsigned>::max();
if (RetVF.isVector() \|\| VF.isVector()) {		if (RetVF.isVector()) {
ScalarizationCost = 0;		ScalarizationCost = 0;
if (!RetTy->isVoidTy())		if (!RetTy->isVoidTy())
ScalarizationCost +=		ScalarizationCost +=
getScalarizationOverhead(cast<VectorType>(RetTy), true, false);		getScalarizationOverhead(cast<VectorType>(RetTy), true, false);
ScalarizationCost +=		ScalarizationCost +=
getOperandsScalarizationOverhead(Args, VF.getKnownMinValue());		getOperandsScalarizationOverhead(Args, RetVF.getKnownMinValue());
}		}

IntrinsicCostAttributes Attrs(IID, RetTy, Types, FMF, ScalarizationCost, I);		IntrinsicCostAttributes Attrs(IID, RetTy, ICA.getArgTypes(), FMF, I,
		ScalarizationCost);
return thisT()->getTypeBasedIntrinsicInstrCost(Attrs, CostKind);		return thisT()->getTypeBasedIntrinsicInstrCost(Attrs, CostKind);
}		}

/// Get intrinsic cost based on argument types.		/// Get intrinsic cost based on argument types.
/// If ScalarizationCostPassed is std::numeric_limits<unsigned>::max(), the		/// If ScalarizationCostPassed is std::numeric_limits<unsigned>::max(), the
/// cost of scalarizing the arguments and the return value will be computed		/// cost of scalarizing the arguments and the return value will be computed
/// based on types.		/// based on types.
unsigned getTypeBasedIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,		unsigned getTypeBasedIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
▲ Show 20 Lines • Show All 226 Lines • ▼ Show 20 Lines	case Intrinsic::ssub_sat: {
Intrinsic::ID OverflowOp = IID == Intrinsic::sadd_sat		Intrinsic::ID OverflowOp = IID == Intrinsic::sadd_sat
? Intrinsic::sadd_with_overflow		? Intrinsic::sadd_with_overflow
: Intrinsic::ssub_with_overflow;		: Intrinsic::ssub_with_overflow;

// SatMax -> Overflow && SumDiff < 0		// SatMax -> Overflow && SumDiff < 0
// SatMin -> Overflow && SumDiff >= 0		// SatMin -> Overflow && SumDiff >= 0
unsigned Cost = 0;		unsigned Cost = 0;
IntrinsicCostAttributes Attrs(OverflowOp, OpTy, {RetTy, RetTy}, FMF,		IntrinsicCostAttributes Attrs(OverflowOp, OpTy, {RetTy, RetTy}, FMF,
ScalarizationCostPassed);		nullptr, ScalarizationCostPassed);
Cost += thisT()->getIntrinsicInstrCost(Attrs, CostKind);		Cost += thisT()->getIntrinsicInstrCost(Attrs, CostKind);
Cost +=		Cost +=
thisT()->getCmpSelInstrCost(BinaryOperator::ICmp, RetTy, CondTy,		thisT()->getCmpSelInstrCost(BinaryOperator::ICmp, RetTy, CondTy,
CmpInst::BAD_ICMP_PREDICATE, CostKind);		CmpInst::BAD_ICMP_PREDICATE, CostKind);
Cost += 2 * thisT()->getCmpSelInstrCost(		Cost += 2 * thisT()->getCmpSelInstrCost(
BinaryOperator::Select, RetTy, CondTy,		BinaryOperator::Select, RetTy, CondTy,
CmpInst::BAD_ICMP_PREDICATE, CostKind);		CmpInst::BAD_ICMP_PREDICATE, CostKind);
return Cost;		return Cost;
}		}
case Intrinsic::uadd_sat:		case Intrinsic::uadd_sat:
case Intrinsic::usub_sat: {		case Intrinsic::usub_sat: {
Type *CondTy = RetTy->getWithNewBitWidth(1);		Type *CondTy = RetTy->getWithNewBitWidth(1);

Type *OpTy = StructType::create({RetTy, CondTy});		Type *OpTy = StructType::create({RetTy, CondTy});
Intrinsic::ID OverflowOp = IID == Intrinsic::uadd_sat		Intrinsic::ID OverflowOp = IID == Intrinsic::uadd_sat
? Intrinsic::uadd_with_overflow		? Intrinsic::uadd_with_overflow
: Intrinsic::usub_with_overflow;		: Intrinsic::usub_with_overflow;

unsigned Cost = 0;		unsigned Cost = 0;
IntrinsicCostAttributes Attrs(OverflowOp, OpTy, {RetTy, RetTy}, FMF,		IntrinsicCostAttributes Attrs(OverflowOp, OpTy, {RetTy, RetTy}, FMF,
ScalarizationCostPassed);		nullptr, ScalarizationCostPassed);
Cost += thisT()->getIntrinsicInstrCost(Attrs, CostKind);		Cost += thisT()->getIntrinsicInstrCost(Attrs, CostKind);
Cost +=		Cost +=
thisT()->getCmpSelInstrCost(BinaryOperator::Select, RetTy, CondTy,		thisT()->getCmpSelInstrCost(BinaryOperator::Select, RetTy, CondTy,
CmpInst::BAD_ICMP_PREDICATE, CostKind);		CmpInst::BAD_ICMP_PREDICATE, CostKind);
return Cost;		return Cost;
}		}
case Intrinsic::smul_fix:		case Intrinsic::smul_fix:
case Intrinsic::umul_fix: {		case Intrinsic::umul_fix: {
▲ Show 20 Lines • Show All 421 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	bool HardwareLoopInfo::canAnalyze(LoopInfo &LI) {
// Hardware loop.		// Hardware loop.
LoopBlocksRPO RPOT(L);		LoopBlocksRPO RPOT(L);
RPOT.perform(&LI);		RPOT.perform(&LI);
if (containsIrreducibleCFG<const BasicBlock *>(RPOT, LI))		if (containsIrreducibleCFG<const BasicBlock *>(RPOT, LI))
return false;		return false;
return true;		return true;
}		}

IntrinsicCostAttributes::IntrinsicCostAttributes(const IntrinsicInst &I) :
II(&I), RetTy(I.getType()), IID(I.getIntrinsicID()) {

FunctionType *FTy = I.getCalledFunction()->getFunctionType();
ParamTys.insert(ParamTys.begin(), FTy->param_begin(), FTy->param_end());
Arguments.insert(Arguments.begin(), I.arg_begin(), I.arg_end());
if (auto *FPMO = dyn_cast<FPMathOperator>(&I))
FMF = FPMO->getFastMathFlags();
}

IntrinsicCostAttributes::IntrinsicCostAttributes(Intrinsic::ID Id,
const CallBase &CI) :
II(dyn_cast<IntrinsicInst>(&CI)), RetTy(CI.getType()), IID(Id) {

if (const auto *FPMO = dyn_cast<FPMathOperator>(&CI))
FMF = FPMO->getFastMathFlags();

Arguments.insert(Arguments.begin(), CI.arg_begin(), CI.arg_end());
FunctionType *FTy =
CI.getCalledFunction()->getFunctionType();
ParamTys.insert(ParamTys.begin(), FTy->param_begin(), FTy->param_end());
}

IntrinsicCostAttributes::IntrinsicCostAttributes(Intrinsic::ID Id,
const CallBase &CI,
ElementCount Factor)
: RetTy(CI.getType()), IID(Id), VF(Factor) {

assert(!Factor.isScalable() && "Scalable vectors are not yet supported");
if (auto *FPMO = dyn_cast<FPMathOperator>(&CI))
FMF = FPMO->getFastMathFlags();

Arguments.insert(Arguments.begin(), CI.arg_begin(), CI.arg_end());
FunctionType *FTy =
CI.getCalledFunction()->getFunctionType();
ParamTys.insert(ParamTys.begin(), FTy->param_begin(), FTy->param_end());
}

IntrinsicCostAttributes::IntrinsicCostAttributes(Intrinsic::ID Id,		IntrinsicCostAttributes::IntrinsicCostAttributes(Intrinsic::ID Id,
const CallBase &CI,		const CallBase &CI,
ElementCount Factor,		unsigned ScalarizationCost)
		samparkerUnsubmitted Not Done Reply Inline Actions Could it help to further reduce complexity in TTI if we moved the vectorizer-specific changes into the vectorizer? (Assuming that constructors are one of the different ways at TTI is used by the vectorizer) samparker: Could it help to further reduce complexity in TTI if we moved the vectorizer-specific changes…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions I have given it a try, and ended up adjusting some of the other constructors at the same time. It doesn't feel like the cleanest thing in the world, but keeps the code in the vectorizer. Let me know what you think. dmgreen: I have given it a try, and ended up adjusting some of the other constructors at the same time.
		samparkerUnsubmitted Not Done Reply Inline Actions Cheers Dave! samparker: Cheers Dave!
unsigned ScalarCost)		: II(dyn_cast<IntrinsicInst>(&CI)), RetTy(CI.getType()), IID(Id),
: RetTy(CI.getType()), IID(Id), VF(Factor), ScalarizationCost(ScalarCost) {		ScalarizationCost(ScalarizationCost) {

if (const auto *FPMO = dyn_cast<FPMathOperator>(&CI))		if (const auto *FPMO = dyn_cast<FPMathOperator>(&CI))
FMF = FPMO->getFastMathFlags();		FMF = FPMO->getFastMathFlags();

Arguments.insert(Arguments.begin(), CI.arg_begin(), CI.arg_end());		Arguments.insert(Arguments.begin(), CI.arg_begin(), CI.arg_end());
FunctionType *FTy =		FunctionType *FTy = CI.getCalledFunction()->getFunctionType();
CI.getCalledFunction()->getFunctionType();
ParamTys.insert(ParamTys.begin(), FTy->param_begin(), FTy->param_end());		ParamTys.insert(ParamTys.begin(), FTy->param_begin(), FTy->param_end());
}		}

IntrinsicCostAttributes::IntrinsicCostAttributes(Intrinsic::ID Id, Type *RTy,		IntrinsicCostAttributes::IntrinsicCostAttributes(Intrinsic::ID Id, Type *RTy,
ArrayRef<Type *> Tys,		ArrayRef<Type *> Tys,
FastMathFlags Flags) :
RetTy(RTy), IID(Id), FMF(Flags) {
ParamTys.insert(ParamTys.begin(), Tys.begin(), Tys.end());
}

IntrinsicCostAttributes::IntrinsicCostAttributes(Intrinsic::ID Id, Type *RTy,
ArrayRef<Type *> Tys,
FastMathFlags Flags,
unsigned ScalarCost) :
RetTy(RTy), IID(Id), FMF(Flags), ScalarizationCost(ScalarCost) {
ParamTys.insert(ParamTys.begin(), Tys.begin(), Tys.end());
}

IntrinsicCostAttributes::IntrinsicCostAttributes(Intrinsic::ID Id, Type *RTy,
ArrayRef<Type *> Tys,
FastMathFlags Flags,		FastMathFlags Flags,
unsigned ScalarCost,		const IntrinsicInst *I,
const IntrinsicInst *I) :		unsigned ScalarCost)
II(I), RetTy(RTy), IID(Id), FMF(Flags), ScalarizationCost(ScalarCost) {		: II(I), RetTy(RTy), IID(Id), FMF(Flags), ScalarizationCost(ScalarCost) {
ParamTys.insert(ParamTys.begin(), Tys.begin(), Tys.end());
}

IntrinsicCostAttributes::IntrinsicCostAttributes(Intrinsic::ID Id, Type *RTy,
ArrayRef<Type *> Tys) :
RetTy(RTy), IID(Id) {
ParamTys.insert(ParamTys.begin(), Tys.begin(), Tys.end());		ParamTys.insert(ParamTys.begin(), Tys.begin(), Tys.end());
}		}

IntrinsicCostAttributes::IntrinsicCostAttributes(Intrinsic::ID Id, Type *Ty,		IntrinsicCostAttributes::IntrinsicCostAttributes(Intrinsic::ID Id, Type *Ty,
ArrayRef<const Value *> Args)		ArrayRef<const Value *> Args)
: RetTy(Ty), IID(Id) {		: RetTy(Ty), IID(Id) {

Arguments.insert(Arguments.begin(), Args.begin(), Args.end());		Arguments.insert(Arguments.begin(), Args.begin(), Args.end());
ParamTys.reserve(Arguments.size());		ParamTys.reserve(Arguments.size());
for (unsigned Idx = 0, Size = Arguments.size(); Idx != Size; ++Idx)		for (unsigned Idx = 0, Size = Arguments.size(); Idx != Size; ++Idx)
ParamTys.push_back(Arguments[Idx]->getType());		ParamTys.push_back(Arguments[Idx]->getType());
}		}

		IntrinsicCostAttributes::IntrinsicCostAttributes(Intrinsic::ID Id, Type *RTy,
		ArrayRef<const Value *> Args,
		ArrayRef<Type *> Tys,
		FastMathFlags Flags,
		const IntrinsicInst *I,
		unsigned ScalarCost)
		: II(I), RetTy(RTy), IID(Id), FMF(Flags), ScalarizationCost(ScalarCost) {
		ParamTys.insert(ParamTys.begin(), Tys.begin(), Tys.end());
		Arguments.insert(Arguments.begin(), Args.begin(), Args.end());
		}

bool HardwareLoopInfo::isHardwareLoopCandidate(ScalarEvolution &SE,		bool HardwareLoopInfo::isHardwareLoopCandidate(ScalarEvolution &SE,
LoopInfo &LI, DominatorTree &DT,		LoopInfo &LI, DominatorTree &DT,
bool ForceNestedLoop,		bool ForceNestedLoop,
bool ForceHardwareLoopPHI) {		bool ForceHardwareLoopPHI) {
SmallVector<BasicBlock *, 4> ExitingBlocks;		SmallVector<BasicBlock *, 4> ExitingBlocks;
L->getExitingBlocks(ExitingBlocks);		L->getExitingBlocks(ExitingBlocks);

for (BasicBlock *BB : ExitingBlocks) {		for (BasicBlock *BB : ExitingBlocks) {
▲ Show 20 Lines • Show All 1,316 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetTransformInfo.cpp

Show First 20 Lines • Show All 725 Lines • ▼ Show 20 Lines	int GCNTTIImpl::getIntrinsicInstrCost(const IntrinsicCostAttributes &ICA,
if (!OrigTy.isSimple()) {		if (!OrigTy.isSimple()) {
if (CostKind != TTI::TCK_CodeSize)		if (CostKind != TTI::TCK_CodeSize)
return BaseT::getIntrinsicInstrCost(ICA, CostKind);		return BaseT::getIntrinsicInstrCost(ICA, CostKind);

// TODO: Combine these two logic paths.		// TODO: Combine these two logic paths.
if (ICA.isTypeBasedOnly())		if (ICA.isTypeBasedOnly())
return getTypeBasedIntrinsicInstrCost(ICA, CostKind);		return getTypeBasedIntrinsicInstrCost(ICA, CostKind);

Type *RetTy = ICA.getReturnType();
unsigned VF = ICA.getVectorFactor().getFixedValue();
unsigned RetVF =		unsigned RetVF =
(RetTy->isVectorTy() ? cast<FixedVectorType>(RetTy)->getNumElements()		(RetTy->isVectorTy() ? cast<FixedVectorType>(RetTy)->getNumElements()
: 1);		: 1);
assert((RetVF == 1 \|\| VF == 1) && "VF > 1 and RetVF is a vector type");
const IntrinsicInst *I = ICA.getInst();		const IntrinsicInst *I = ICA.getInst();
const SmallVectorImpl<const Value *> &Args = ICA.getArgs();		const SmallVectorImpl<const Value *> &Args = ICA.getArgs();
FastMathFlags FMF = ICA.getFlags();		FastMathFlags FMF = ICA.getFlags();
// Assume that we need to scalarize this intrinsic.		// Assume that we need to scalarize this intrinsic.
SmallVector<Type *, 4> Types;
for (const Value *Op : Args) {
Type *OpTy = Op->getType();
assert(VF == 1 \|\| !OpTy->isVectorTy());
Types.push_back(VF == 1 ? OpTy : FixedVectorType::get(OpTy, VF));
}

if (VF > 1 && !RetTy->isVoidTy())
RetTy = FixedVectorType::get(RetTy, VF);

// Compute the scalarization overhead based on Args for a vector		// Compute the scalarization overhead based on Args for a vector
// intrinsic. A vectorizer will pass a scalar RetTy and VF > 1, while		// intrinsic. A vectorizer will pass a scalar RetTy and VF > 1, while
// CostModel will pass a vector RetTy and VF is 1.		// CostModel will pass a vector RetTy and VF is 1.
unsigned ScalarizationCost = std::numeric_limits<unsigned>::max();		unsigned ScalarizationCost = std::numeric_limits<unsigned>::max();
if (RetVF > 1 \|\| VF > 1) {		if (RetVF > 1) {
ScalarizationCost = 0;		ScalarizationCost = 0;
if (!RetTy->isVoidTy())		if (!RetTy->isVoidTy())
ScalarizationCost +=		ScalarizationCost +=
getScalarizationOverhead(cast<VectorType>(RetTy), true, false);		getScalarizationOverhead(cast<VectorType>(RetTy), true, false);
ScalarizationCost += getOperandsScalarizationOverhead(Args, VF);		ScalarizationCost += getOperandsScalarizationOverhead(Args, RetVF);
}		}

IntrinsicCostAttributes Attrs(ICA.getID(), RetTy, Types, FMF,		IntrinsicCostAttributes Attrs(ICA.getID(), RetTy, ICA.getArgTypes(), FMF, I,
ScalarizationCost, I);		ScalarizationCost);
return getIntrinsicInstrCost(Attrs, CostKind);		return getIntrinsicInstrCost(Attrs, CostKind);
}		}

// Legalize the type.		// Legalize the type.
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, RetTy);		std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, RetTy);

unsigned NElts = LT.second.isVector() ?		unsigned NElts = LT.second.isVector() ?
LT.second.getVectorNumElements() : 1;		LT.second.getVectorNumElements() : 1;

MVT::SimpleValueType SLT = LT.second.getScalarType().SimpleTy;		MVT::SimpleValueType SLT = LT.second.getScalarType().SimpleTy;

if (SLT == MVT::f64)		if (SLT == MVT::f64)
return LT.first * NElts * get64BitInstrCost(CostKind);		return LT.first * NElts * get64BitInstrCost(CostKind);

if (ST->has16BitInsts() && SLT == MVT::f16)		if (ST->has16BitInsts() && SLT == MVT::f16)
NElts = (NElts + 1) / 2;		NElts = (NElts + 1) / 2;

// TODO: Get more refined intrinsic costs?		// TODO: Get more refined intrinsic costs?
unsigned InstRate = getQuarterRateInstrCost(CostKind);		unsigned InstRate = getQuarterRateInstrCost(CostKind);
if (ICA.getID() == Intrinsic::fma) {
		switch (ICA.getID()) {
		case Intrinsic::fma:
InstRate = ST->hasFastFMAF32() ? getHalfRateInstrCost(CostKind)		InstRate = ST->hasFastFMAF32() ? getHalfRateInstrCost(CostKind)
: getQuarterRateInstrCost(CostKind);		: getQuarterRateInstrCost(CostKind);
		break;
		case Intrinsic::uadd_sat:
		sdesmalenUnsubmitted Not Done Reply Inline Actions Why are you specially calling out uadd_sat and others? sdesmalen: Why are you specially calling out uadd_sat and others?
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Because of intrinsicHasPackedVectorBenefit above. For most other intrinsics it will just end up calling BaseT::getIntrinsicInstrCost. dmgreen: Because of intrinsicHasPackedVectorBenefit above. For most other intrinsics it will just end up…
		sdesmalenUnsubmitted Not Done Reply Inline Actions Is this change then to keep the cost the same as before this patch? I would have expected NElts not to need any change, maybe the cost was just wrong before. sdesmalen: Is this change then to keep the cost the same as before this patch? I would have expected NElts…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Yeah, I think the cost was incorrect before from vector instructions, but correct from the vectorizer's costing of vectorized scalar instructions (because it was looking at RetTy and ignoring VF). I was hoping someone from AMDGPU could take a look and check this was OK. I tried some f16 FMA and ssat routines on a gfx900 and they all ended up with the same vectorization - which is a good sign at least. dmgreen: Yeah, I think the cost was incorrect before from vector instructions, but correct from the…
		case Intrinsic::usub_sat:
		case Intrinsic::sadd_sat:
		case Intrinsic::ssub_sat:
		static const auto ValidSatTys = {MVT::v2i16, MVT::v4i16};
		if (any_of(ValidSatTys, [&LT](MVT M) { return M == LT.second; }))
		NElts = 1;
		break;
}		}

return LT.first * NElts * InstRate;		return LT.first * NElts * InstRate;
}		}

unsigned GCNTTIImpl::getCFInstrCost(unsigned Opcode,		unsigned GCNTTIImpl::getCFInstrCost(unsigned Opcode,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
if (CostKind == TTI::TCK_CodeSize \|\| CostKind == TTI::TCK_SizeAndLatency)		if (CostKind == TTI::TCK_CodeSize \|\| CostKind == TTI::TCK_SizeAndLatency)
▲ Show 20 Lines • Show All 517 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.cpp

Show First 20 Lines • Show All 1,544 Lines • ▼ Show 20 Lines	if (ST->hasMVEIntegerOps())
return 0;		return 0;
break;		break;
case Intrinsic::sadd_sat:		case Intrinsic::sadd_sat:
case Intrinsic::ssub_sat:		case Intrinsic::ssub_sat:
case Intrinsic::uadd_sat:		case Intrinsic::uadd_sat:
case Intrinsic::usub_sat: {		case Intrinsic::usub_sat: {
if (!ST->hasMVEIntegerOps())		if (!ST->hasMVEIntegerOps())
break;		break;
// Get the Return type, either directly of from ICA.ReturnType and ICA.VF.
Type *VT = ICA.getReturnType();		Type *VT = ICA.getReturnType();
if (!VT->isVectorTy() && !ICA.getVectorFactor().isScalar())
VT = VectorType::get(VT, ICA.getVectorFactor());

std::pair<int, MVT> LT =		std::pair<int, MVT> LT =
TLI->getTypeLegalizationCost(DL, VT);		TLI->getTypeLegalizationCost(DL, VT);
if (LT.second == MVT::v4i32 \|\| LT.second == MVT::v8i16 \|\|		if (LT.second == MVT::v4i32 \|\| LT.second == MVT::v8i16 \|\|
LT.second == MVT::v16i8) {		LT.second == MVT::v16i8) {
// This is a base cost of 1 for the vadd, plus 3 extract shifts if we		// This is a base cost of 1 for the vqadd, plus 3 extract shifts if we
// need to extend the type, as it uses shr(qadd(shl, shl)).		// need to extend the type, as it uses shr(qadd(shl, shl)).
unsigned Instrs = LT.second.getScalarSizeInBits() ==		unsigned Instrs =
ICA.getReturnType()->getScalarSizeInBits()		LT.second.getScalarSizeInBits() == VT->getScalarSizeInBits() ? 1 : 4;
? 1
: 4;
return LT.first * ST->getMVEVectorCostFactor() * Instrs;		return LT.first * ST->getMVEVectorCostFactor() * Instrs;
}		}
break;		break;
}		}
}		}

return BaseT::getIntrinsicInstrCost(ICA, CostKind);		return BaseT::getIntrinsicInstrCost(ICA, CostKind);
}		}
▲ Show 20 Lines • Show All 540 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,822 Lines • ▼ Show 20 Lines	if (VectorCallCost < Cost) {
Cost = VectorCallCost;		Cost = VectorCallCost;
}		}
return Cost;		return Cost;
}		}

InstructionCost		InstructionCost
LoopVectorizationCostModel::getVectorIntrinsicCost(CallInst *CI,		LoopVectorizationCostModel::getVectorIntrinsicCost(CallInst *CI,
ElementCount VF) {		ElementCount VF) {
		auto MaybeVectorizeType = [](Type Elt, ElementCount VF) -> Type {
		if (VF.isScalar() \|\| (!Elt->isIntOrPtrTy() && !Elt->isFloatingPointTy()))
		return Elt;
		return VectorType::get(Elt, VF);
		};

Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);
assert(ID && "Expected intrinsic call!");		assert(ID && "Expected intrinsic call!");
		Type *RetTy = MaybeVectorizeType(CI->getType(), VF);
		FastMathFlags FMF;
		if (auto *FPMO = dyn_cast<FPMathOperator>(CI))
		FMF = FPMO->getFastMathFlags();

		SmallVector<const Value *> Arguments(CI->arg_begin(), CI->arg_end());
		FunctionType *FTy = CI->getCalledFunction()->getFunctionType();
		SmallVector<Type *> ParamTys;
		std::transform(FTy->param_begin(), FTy->param_end(), ParamTys.begin(),
		[&](Type *Ty) { return MaybeVectorizeType(Ty, VF); });

IntrinsicCostAttributes CostAttrs(ID, *CI, VF);		IntrinsicCostAttributes CostAttrs(ID, RetTy, Arguments, ParamTys, FMF,
		dyn_cast<IntrinsicInst>(CI));
return TTI.getIntrinsicInstrCost(CostAttrs,		return TTI.getIntrinsicInstrCost(CostAttrs,
TargetTransformInfo::TCK_RecipThroughput);		TargetTransformInfo::TCK_RecipThroughput);
}		}

static Type smallestIntegerVectorType(Type T1, Type *T2) {		static Type smallestIntegerVectorType(Type T1, Type *T2) {
auto *I1 = cast<IntegerType>(cast<VectorType>(T1)->getElementType());		auto *I1 = cast<IntegerType>(cast<VectorType>(T1)->getElementType());
auto *I2 = cast<IntegerType>(cast<VectorType>(T2)->getElementType());		auto *I2 = cast<IntegerType>(cast<VectorType>(T2)->getElementType());
return I1->getBitWidth() < I2->getBitWidth() ? T1 : T2;		return I1->getBitWidth() < I2->getBitWidth() ? T1 : T2;
▲ Show 20 Lines • Show All 6,036 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,411 Lines • ▼ Show 20 Lines
}		}

static std::pair<InstructionCost, InstructionCost>		static std::pair<InstructionCost, InstructionCost>
getVectorCallCosts(CallInst CI, FixedVectorType VecTy,		getVectorCallCosts(CallInst CI, FixedVectorType VecTy,
TargetTransformInfo TTI, TargetLibraryInfo TLI) {		TargetTransformInfo TTI, TargetLibraryInfo TLI) {
Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);

// Calculate the cost of the scalar and vector calls.		// Calculate the cost of the scalar and vector calls.
IntrinsicCostAttributes CostAttrs(ID, *CI, VecTy->getElementCount());		SmallVector<Type *, 4> VecTys;
		for (Use &Arg : CI->args())
		VecTys.push_back(
		FixedVectorType::get(Arg->getType(), VecTy->getNumElements()));
		FastMathFlags FMF;
		if (auto *FPCI = dyn_cast<FPMathOperator>(CI))
		FMF = FPCI->getFastMathFlags();
		SmallVector<const Value *> Arguments(CI->arg_begin(), CI->arg_end());
		IntrinsicCostAttributes CostAttrs(ID, VecTy, Arguments, VecTys, FMF,
		dyn_cast<IntrinsicInst>(CI));
auto IntrinsicCost =		auto IntrinsicCost =
TTI->getIntrinsicInstrCost(CostAttrs, TTI::TCK_RecipThroughput);		TTI->getIntrinsicInstrCost(CostAttrs, TTI::TCK_RecipThroughput);

auto Shape = VFShape::get(*CI, ElementCount::getFixed(static_cast<unsigned>(		auto Shape = VFShape::get(*CI, ElementCount::getFixed(static_cast<unsigned>(
VecTy->getNumElements())),		VecTy->getNumElements())),
false /HasGlobalPred/);		false /HasGlobalPred/);
Function VecFunc = VFDatabase(CI).getVectorizedFunction(Shape);		Function VecFunc = VFDatabase(CI).getVectorizedFunction(Shape);
auto LibCost = IntrinsicCost;		auto LibCost = IntrinsicCost;
if (!CI->isNoBuiltin() && VecFunc) {		if (!CI->isNoBuiltin() && VecFunc) {
// Calculate the cost of the vector library call.		// Calculate the cost of the vector library call.
SmallVector<Type *, 4> VecTys;
for (Use &Arg : CI->args())
VecTys.push_back(
FixedVectorType::get(Arg->getType(), VecTy->getNumElements()));

// If the corresponding vector call is cheaper, return its cost.		// If the corresponding vector call is cheaper, return its cost.
LibCost = TTI->getCallInstrCost(nullptr, VecTy, VecTys,		LibCost = TTI->getCallInstrCost(nullptr, VecTy, VecTys,
TTI::TCK_RecipThroughput);		TTI::TCK_RecipThroughput);
}		}
return {IntrinsicCost, LibCost};		return {IntrinsicCost, LibCost};
}		}

InstructionCost BoUpSLP::getEntryCost(TreeEntry *E) {		InstructionCost BoUpSLP::getEntryCost(TreeEntry *E) {
▲ Show 20 Lines • Show All 349 Lines • ▼ Show 20 Lines	case Instruction::Store: {
LLVM_DEBUG(dumpTreeCosts(E, ReuseShuffleCost, VecStCost, ScalarStCost));		LLVM_DEBUG(dumpTreeCosts(E, ReuseShuffleCost, VecStCost, ScalarStCost));
return ReuseShuffleCost + VecStCost - ScalarStCost;		return ReuseShuffleCost + VecStCost - ScalarStCost;
}		}
case Instruction::Call: {		case Instruction::Call: {
CallInst *CI = cast<CallInst>(VL0);		CallInst *CI = cast<CallInst>(VL0);
Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);

// Calculate the cost of the scalar and vector calls.		// Calculate the cost of the scalar and vector calls.
IntrinsicCostAttributes CostAttrs(ID, *CI, ElementCount::getFixed(1), 1);		IntrinsicCostAttributes CostAttrs(ID, *CI, 1);
InstructionCost ScalarEltCost =		InstructionCost ScalarEltCost =
TTI->getIntrinsicInstrCost(CostAttrs, CostKind);		TTI->getIntrinsicInstrCost(CostAttrs, CostKind);
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
ReuseShuffleCost -= (ReuseShuffleNumbers - VL.size()) * ScalarEltCost;		ReuseShuffleCost -= (ReuseShuffleNumbers - VL.size()) * ScalarEltCost;
}		}
InstructionCost ScalarCallCost = VecTy->getNumElements() * ScalarEltCost;		InstructionCost ScalarCallCost = VecTy->getNumElements() * ScalarEltCost;

auto VecCallCosts = getVectorCallCosts(CI, VecTy, TTI, TLI);		auto VecCallCosts = getVectorCallCosts(CI, VecTy, TTI, TLI);
▲ Show 20 Lines • Show All 3,913 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/intrinsiccost.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt -loop-vectorize -instcombine -simplifycfg < %s -S -o - \| FileCheck %s --check-prefix=CHECK		; RUN: opt -loop-vectorize -instcombine -simplifycfg < %s -S -o - \| FileCheck %s --check-prefix=CHECK
; RUN: opt -loop-vectorize -debug-only=loop-vectorize -disable-output < %s 2>&1 \| FileCheck %s --check-prefix=CHECK-COST		; RUN: opt -loop-vectorize -debug-only=loop-vectorize -disable-output < %s 2>&1 \| FileCheck %s --check-prefix=CHECK-COST
; REQUIRES: asserts		; REQUIRES: asserts

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"		target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnu"		target triple = "aarch64--linux-gnu"

; CHECK-COST-LABEL: sadd		; CHECK-COST-LABEL: sadd
; CHECK-COST: Found an estimated cost of 10 for VF 1 For instruction: %1 = tail call i16 @llvm.sadd.sat.i16(i16 %0, i16 %offset)		; CHECK-COST: Found an estimated cost of 10 for VF 1 For instruction: %1 = tail call i16 @llvm.sadd.sat.i16(i16 %0, i16 %offset)
; CHECK-COST: Found an estimated cost of 26 for VF 2 For instruction: %1 = tail call i16 @llvm.sadd.sat.i16(i16 %0, i16 %offset)		; CHECK-COST: Found an estimated cost of 4 for VF 2 For instruction: %1 = tail call i16 @llvm.sadd.sat.i16(i16 %0, i16 %offset)
; CHECK-COST: Found an estimated cost of 58 for VF 4 For instruction: %1 = tail call i16 @llvm.sadd.sat.i16(i16 %0, i16 %offset)		; CHECK-COST: Found an estimated cost of 1 for VF 4 For instruction: %1 = tail call i16 @llvm.sadd.sat.i16(i16 %0, i16 %offset)
; CHECK-COST: Found an estimated cost of 122 for VF 8 For instruction: %1 = tail call i16 @llvm.sadd.sat.i16(i16 %0, i16 %offset)		; CHECK-COST: Found an estimated cost of 1 for VF 8 For instruction: %1 = tail call i16 @llvm.sadd.sat.i16(i16 %0, i16 %offset)
		sdesmalenUnsubmitted Not Done Reply Inline Actions Are there changes to these costs because it would previously calculate `scalarization cost + 2 * cost(sadd.sat.i16)` where it can now calculates the cost as `cost(sadd.sat.2i16)`? sdesmalen: Are there changes to these costs because it would previously calculate `scalarization cost + 2…
		dmgreenAuthorUnsubmitted Done Reply Inline Actions Yes I believe so. At least - the old cost was from Base::getIntrinsicInstrCost as the overrides from AArch64TargetTransformInfo were not taking effect. I'm not 100% sure how it was calculated but that sounds very plausible https://reviews.llvm.org/D95292 updated the sadd.sat costs. dmgreen: Yes I believe so. At least - the old cost was from Base::getIntrinsicInstrCost as the overrides…
		sdesmalenUnsubmitted Not Done Reply Inline Actions Ah, I hadn't seen that patch yet, that explains things. Given that D95292 hasn't landed, this patch is missing a dependence? sdesmalen: Ah, I hadn't seen that patch yet, that explains things. Given that D95292 hasn't landed, this…

define void @saddsat(i16* nocapture readonly %pSrc, i16 signext %offset, i16* nocapture noalias %pDst, i32 %blockSize) #0 {		define void @saddsat(i16* nocapture readonly %pSrc, i16 signext %offset, i16* nocapture noalias %pDst, i32 %blockSize) #0 {
; CHECK-LABEL: @saddsat(		; CHECK-LABEL: @saddsat(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP_NOT6:%.]] = icmp eq i32 [[BLOCKSIZE:%.]], 0		; CHECK-NEXT: [[CMP_NOT6:%.]] = icmp eq i32 [[BLOCKSIZE:%.]], 0
; CHECK-NEXT: br i1 [[CMP_NOT6]], label [[WHILE_END:%.]], label [[WHILE_BODY_PREHEADER:%.]]		; CHECK-NEXT: br i1 [[CMP_NOT6]], label [[WHILE_END:%.]], label [[WHILE_BODY_PREHEADER:%.]]
; CHECK: while.body.preheader:		; CHECK: while.body.preheader:
; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[BLOCKSIZE]], -1		; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[BLOCKSIZE]], -1
; CHECK-NEXT: [[TMP1:%.*]] = zext i32 [[TMP0]] to i64		; CHECK-NEXT: [[TMP1:%.*]] = zext i32 [[TMP0]] to i64
; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1		; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp eq i32 [[TMP0]], 0		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[TMP0]], 15
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_VEC:%.*]] = and i64 [[TMP2]], 8589934590		; CHECK-NEXT: [[N_VEC:%.*]] = and i64 [[TMP2]], 8589934576
; CHECK-NEXT: [[CAST_CRD:%.*]] = trunc i64 [[N_VEC]] to i32		; CHECK-NEXT: [[CAST_CRD:%.*]] = trunc i64 [[N_VEC]] to i32
; CHECK-NEXT: [[IND_END:%.*]] = sub i32 [[BLOCKSIZE]], [[CAST_CRD]]		; CHECK-NEXT: [[IND_END:%.*]] = sub i32 [[BLOCKSIZE]], [[CAST_CRD]]
; CHECK-NEXT: [[IND_END2:%.]] = getelementptr i16, i16 [[PSRC:%.*]], i64 [[N_VEC]]		; CHECK-NEXT: [[IND_END2:%.]] = getelementptr i16, i16 [[PSRC:%.*]], i64 [[N_VEC]]
; CHECK-NEXT: [[IND_END4:%.]] = getelementptr i16, i16 [[PDST:%.*]], i64 [[N_VEC]]		; CHECK-NEXT: [[IND_END4:%.]] = getelementptr i16, i16 [[PDST:%.*]], i64 [[N_VEC]]
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <2 x i16> poison, i16 [[OFFSET:%.]], i32 0		; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <8 x i16> poison, i16 [[OFFSET:%.]], i32 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <2 x i16> [[BROADCAST_SPLATINSERT]], <2 x i16> poison, <2 x i32> zeroinitializer		; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <8 x i16> [[BROADCAST_SPLATINSERT]], <8 x i16> poison, <8 x i32> zeroinitializer
		; CHECK-NEXT: [[BROADCAST_SPLATINSERT9:%.*]] = insertelement <8 x i16> poison, i16 [[OFFSET]], i32 0
		; CHECK-NEXT: [[BROADCAST_SPLAT10:%.*]] = shufflevector <8 x i16> [[BROADCAST_SPLATINSERT9]], <8 x i16> poison, <8 x i32> zeroinitializer
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[NEXT_GEP:%.]] = getelementptr i16, i16 [[PSRC]], i64 [[INDEX]]		; CHECK-NEXT: [[NEXT_GEP:%.]] = getelementptr i16, i16 [[PSRC]], i64 [[INDEX]]
; CHECK-NEXT: [[NEXT_GEP5:%.]] = getelementptr i16, i16 [[PDST]], i64 [[INDEX]]		; CHECK-NEXT: [[NEXT_GEP6:%.]] = getelementptr i16, i16 [[PDST]], i64 [[INDEX]]
; CHECK-NEXT: [[TMP3:%.]] = bitcast i16 [[NEXT_GEP]] to <2 x i16>*		; CHECK-NEXT: [[TMP3:%.]] = bitcast i16 [[NEXT_GEP]] to <8 x i16>*
; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <2 x i16>, <2 x i16> [[TMP3]], align 2		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <8 x i16>, <8 x i16> [[TMP3]], align 2
; CHECK-NEXT: [[TMP4:%.*]] = call <2 x i16> @llvm.sadd.sat.v2i16(<2 x i16> [[WIDE_LOAD]], <2 x i16> [[BROADCAST_SPLAT]])		; CHECK-NEXT: [[TMP4:%.]] = getelementptr i16, i16 [[NEXT_GEP]], i64 8
; CHECK-NEXT: [[TMP5:%.]] = bitcast i16 [[NEXT_GEP5]] to <2 x i16>*		; CHECK-NEXT: [[TMP5:%.]] = bitcast i16 [[TMP4]] to <8 x i16>*
; CHECK-NEXT: store <2 x i16> [[TMP4]], <2 x i16>* [[TMP5]], align 2		; CHECK-NEXT: [[WIDE_LOAD8:%.]] = load <8 x i16>, <8 x i16> [[TMP5]], align 2
; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 2		; CHECK-NEXT: [[TMP6:%.*]] = call <8 x i16> @llvm.sadd.sat.v8i16(<8 x i16> [[WIDE_LOAD]], <8 x i16> [[BROADCAST_SPLAT]])
; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP7:%.*]] = call <8 x i16> @llvm.sadd.sat.v8i16(<8 x i16> [[WIDE_LOAD8]], <8 x i16> [[BROADCAST_SPLAT10]])
; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]		; CHECK-NEXT: [[TMP8:%.]] = bitcast i16 [[NEXT_GEP6]] to <8 x i16>*
		; CHECK-NEXT: store <8 x i16> [[TMP6]], <8 x i16>* [[TMP8]], align 2
		; CHECK-NEXT: [[TMP9:%.]] = getelementptr i16, i16 [[NEXT_GEP6]], i64 8
		; CHECK-NEXT: [[TMP10:%.]] = bitcast i16 [[TMP9]] to <8 x i16>*
		; CHECK-NEXT: store <8 x i16> [[TMP7]], <8 x i16>* [[TMP10]], align 2
		; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16
		; CHECK-NEXT: [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-NEXT: br i1 [[TMP11]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP0:!llvm.loop !.]]
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
; CHECK-NEXT: br i1 [[CMP_N]], label [[WHILE_END]], label [[SCALAR_PH]]		; CHECK-NEXT: br i1 [[CMP_N]], label [[WHILE_END]], label [[SCALAR_PH]]
; CHECK: scalar.ph:		; CHECK: scalar.ph:
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ [[BLOCKSIZE]], [[WHILE_BODY_PREHEADER]] ]		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ [[BLOCKSIZE]], [[WHILE_BODY_PREHEADER]] ]
; CHECK-NEXT: [[BC_RESUME_VAL1:%.]] = phi i16 [ [[IND_END2]], [[MIDDLE_BLOCK]] ], [ [[PSRC]], [[WHILE_BODY_PREHEADER]] ]		; CHECK-NEXT: [[BC_RESUME_VAL1:%.]] = phi i16 [ [[IND_END2]], [[MIDDLE_BLOCK]] ], [ [[PSRC]], [[WHILE_BODY_PREHEADER]] ]
; CHECK-NEXT: [[BC_RESUME_VAL3:%.]] = phi i16 [ [[IND_END4]], [[MIDDLE_BLOCK]] ], [ [[PDST]], [[WHILE_BODY_PREHEADER]] ]		; CHECK-NEXT: [[BC_RESUME_VAL3:%.]] = phi i16 [ [[IND_END4]], [[MIDDLE_BLOCK]] ], [ [[PDST]], [[WHILE_BODY_PREHEADER]] ]
; CHECK-NEXT: br label [[WHILE_BODY:%.*]]		; CHECK-NEXT: br label [[WHILE_BODY:%.*]]
; CHECK: while.body:		; CHECK: while.body:
; CHECK-NEXT: [[BLKCNT_09:%.]] = phi i32 [ [[DEC:%.]], [[WHILE_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]		; CHECK-NEXT: [[BLKCNT_09:%.]] = phi i32 [ [[DEC:%.]], [[WHILE_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[PSRC_ADDR_08:%.]] = phi i16 [ [[INCDEC_PTR:%.*]], [[WHILE_BODY]] ], [ [[BC_RESUME_VAL1]], [[SCALAR_PH]] ]		; CHECK-NEXT: [[PSRC_ADDR_08:%.]] = phi i16 [ [[INCDEC_PTR:%.*]], [[WHILE_BODY]] ], [ [[BC_RESUME_VAL1]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[PDST_ADDR_07:%.]] = phi i16 [ [[INCDEC_PTR3:%.*]], [[WHILE_BODY]] ], [ [[BC_RESUME_VAL3]], [[SCALAR_PH]] ]		; CHECK-NEXT: [[PDST_ADDR_07:%.]] = phi i16 [ [[INCDEC_PTR3:%.*]], [[WHILE_BODY]] ], [ [[BC_RESUME_VAL3]], [[SCALAR_PH]] ]
; CHECK-NEXT: [[INCDEC_PTR]] = getelementptr inbounds i16, i16* [[PSRC_ADDR_08]], i64 1		; CHECK-NEXT: [[INCDEC_PTR]] = getelementptr inbounds i16, i16* [[PSRC_ADDR_08]], i64 1
; CHECK-NEXT: [[TMP7:%.]] = load i16, i16 [[PSRC_ADDR_08]], align 2		; CHECK-NEXT: [[TMP12:%.]] = load i16, i16 [[PSRC_ADDR_08]], align 2
; CHECK-NEXT: [[TMP8:%.*]] = tail call i16 @llvm.sadd.sat.i16(i16 [[TMP7]], i16 [[OFFSET]])		; CHECK-NEXT: [[TMP13:%.*]] = tail call i16 @llvm.sadd.sat.i16(i16 [[TMP12]], i16 [[OFFSET]])
; CHECK-NEXT: [[INCDEC_PTR3]] = getelementptr inbounds i16, i16* [[PDST_ADDR_07]], i64 1		; CHECK-NEXT: [[INCDEC_PTR3]] = getelementptr inbounds i16, i16* [[PDST_ADDR_07]], i64 1
; CHECK-NEXT: store i16 [[TMP8]], i16* [[PDST_ADDR_07]], align 2		; CHECK-NEXT: store i16 [[TMP13]], i16* [[PDST_ADDR_07]], align 2
; CHECK-NEXT: [[DEC]] = add i32 [[BLKCNT_09]], -1		; CHECK-NEXT: [[DEC]] = add i32 [[BLKCNT_09]], -1
; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[DEC]], 0		; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[DEC]], 0
; CHECK-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END]], label [[WHILE_BODY]], [[LOOP2:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END]], label [[WHILE_BODY]], [[LOOP2:!llvm.loop !.*]]
; CHECK: while.end:		; CHECK: while.end:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%cmp.not6 = icmp eq i32 %blockSize, 0		%cmp.not6 = icmp eq i32 %blockSize, 0
Show All 13 Lines	while.body: ; preds = %entry, %while.body
br i1 %cmp.not, label %while.end, label %while.body		br i1 %cmp.not, label %while.end, label %while.body

while.end: ; preds = %while.body, %entry		while.end: ; preds = %while.body, %entry
ret void		ret void
}		}

; CHECK-COST-LABEL: umin		; CHECK-COST-LABEL: umin
; CHECK-COST: Found an estimated cost of 2 for VF 1 For instruction: %1 = tail call i8 @llvm.umin.i8(i8 %0, i8 %offset)		; CHECK-COST: Found an estimated cost of 2 for VF 1 For instruction: %1 = tail call i8 @llvm.umin.i8(i8 %0, i8 %offset)
; CHECK-COST: Found an estimated cost of 6 for VF 2 For instruction: %1 = tail call i8 @llvm.umin.i8(i8 %0, i8 %offset)		; CHECK-COST: Found an estimated cost of 1 for VF 2 For instruction: %1 = tail call i8 @llvm.umin.i8(i8 %0, i8 %offset)
; CHECK-COST: Found an estimated cost of 14 for VF 4 For instruction: %1 = tail call i8 @llvm.umin.i8(i8 %0, i8 %offset)		; CHECK-COST: Found an estimated cost of 1 for VF 4 For instruction: %1 = tail call i8 @llvm.umin.i8(i8 %0, i8 %offset)
; CHECK-COST: Found an estimated cost of 30 for VF 8 For instruction: %1 = tail call i8 @llvm.umin.i8(i8 %0, i8 %offset)		; CHECK-COST: Found an estimated cost of 1 for VF 8 For instruction: %1 = tail call i8 @llvm.umin.i8(i8 %0, i8 %offset)
; CHECK-COST: Found an estimated cost of 62 for VF 16 For instruction: %1 = tail call i8 @llvm.umin.i8(i8 %0, i8 %offset)		; CHECK-COST: Found an estimated cost of 1 for VF 16 For instruction: %1 = tail call i8 @llvm.umin.i8(i8 %0, i8 %offset)

define void @umin(i8* nocapture readonly %pSrc, i8 signext %offset, i8* nocapture noalias %pDst, i32 %blockSize) #0 {		define void @umin(i8* nocapture readonly %pSrc, i8 signext %offset, i8* nocapture noalias %pDst, i32 %blockSize) #0 {
; CHECK-LABEL: @umin(		; CHECK-LABEL: @umin(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[CMP_NOT6:%.]] = icmp eq i32 [[BLOCKSIZE:%.]], 0		; CHECK-NEXT: [[CMP_NOT6:%.]] = icmp eq i32 [[BLOCKSIZE:%.]], 0
; CHECK-NEXT: br i1 [[CMP_NOT6]], label [[WHILE_END:%.]], label [[ITER_CHECK:%.]]		; CHECK-NEXT: br i1 [[CMP_NOT6]], label [[WHILE_END:%.]], label [[ITER_CHECK:%.]]
; CHECK: iter.check:		; CHECK: iter.check:
; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[BLOCKSIZE]], -1		; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[BLOCKSIZE]], -1
; CHECK-NEXT: [[TMP1:%.*]] = zext i32 [[TMP0]] to i64		; CHECK-NEXT: [[TMP1:%.*]] = zext i32 [[TMP0]] to i64
; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1		; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1
; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[TMP0]], 7		; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i32 [[TMP0]], 7
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH:%.]], label [[VECTOR_MAIN_LOOP_ITER_CHECK:%.]]
; CHECK: vector.main.loop.iter.check:		; CHECK: vector.main.loop.iter.check:
; CHECK-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i32 [[TMP0]], 15		; CHECK-NEXT: [[MIN_ITERS_CHECK1:%.*]] = icmp ult i32 [[TMP0]], 31
; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK1]], label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]		; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK1]], label [[VEC_EPILOG_PH:%.]], label [[VECTOR_PH:%.]]
; CHECK: vector.ph:		; CHECK: vector.ph:
; CHECK-NEXT: [[N_VEC:%.*]] = and i64 [[TMP2]], 8589934576		; CHECK-NEXT: [[N_VEC:%.*]] = and i64 [[TMP2]], 8589934560
; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <16 x i8> poison, i8 [[OFFSET:%.]], i32 0		; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <16 x i8> poison, i8 [[OFFSET:%.]], i32 0
; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i8> [[BROADCAST_SPLATINSERT]], <16 x i8> poison, <16 x i32> zeroinitializer		; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <16 x i8> [[BROADCAST_SPLATINSERT]], <16 x i8> poison, <16 x i32> zeroinitializer
		; CHECK-NEXT: [[BROADCAST_SPLATINSERT6:%.*]] = insertelement <16 x i8> poison, i8 [[OFFSET]], i32 0
		; CHECK-NEXT: [[BROADCAST_SPLAT7:%.*]] = shufflevector <16 x i8> [[BROADCAST_SPLATINSERT6]], <16 x i8> poison, <16 x i32> zeroinitializer
; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
; CHECK: vector.body:		; CHECK: vector.body:
; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
; CHECK-NEXT: [[NEXT_GEP:%.]] = getelementptr i8, i8 [[PSRC:%.*]], i64 [[INDEX]]		; CHECK-NEXT: [[NEXT_GEP:%.]] = getelementptr i8, i8 [[PSRC:%.*]], i64 [[INDEX]]
; CHECK-NEXT: [[NEXT_GEP2:%.]] = getelementptr i8, i8 [[PDST:%.*]], i64 [[INDEX]]		; CHECK-NEXT: [[NEXT_GEP3:%.]] = getelementptr i8, i8 [[PDST:%.*]], i64 [[INDEX]]
; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[NEXT_GEP]] to <16 x i8>*		; CHECK-NEXT: [[TMP3:%.]] = bitcast i8 [[NEXT_GEP]] to <16 x i8>*
; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <16 x i8>, <16 x i8> [[TMP3]], align 2		; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <16 x i8>, <16 x i8> [[TMP3]], align 2
; CHECK-NEXT: [[TMP4:%.*]] = call <16 x i8> @llvm.umin.v16i8(<16 x i8> [[WIDE_LOAD]], <16 x i8> [[BROADCAST_SPLAT]])		; CHECK-NEXT: [[TMP4:%.]] = getelementptr i8, i8 [[NEXT_GEP]], i64 16
; CHECK-NEXT: [[TMP5:%.]] = bitcast i8 [[NEXT_GEP2]] to <16 x i8>*		; CHECK-NEXT: [[TMP5:%.]] = bitcast i8 [[TMP4]] to <16 x i8>*
; CHECK-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* [[TMP5]], align 2		; CHECK-NEXT: [[WIDE_LOAD5:%.]] = load <16 x i8>, <16 x i8> [[TMP5]], align 2
; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 16		; CHECK-NEXT: [[TMP6:%.*]] = call <16 x i8> @llvm.umin.v16i8(<16 x i8> [[WIDE_LOAD]], <16 x i8> [[BROADCAST_SPLAT]])
; CHECK-NEXT: [[TMP6:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]		; CHECK-NEXT: [[TMP7:%.*]] = call <16 x i8> @llvm.umin.v16i8(<16 x i8> [[WIDE_LOAD5]], <16 x i8> [[BROADCAST_SPLAT7]])
; CHECK-NEXT: br i1 [[TMP6]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP4:!llvm.loop !.]]		; CHECK-NEXT: [[TMP8:%.]] = bitcast i8 [[NEXT_GEP3]] to <16 x i8>*
		; CHECK-NEXT: store <16 x i8> [[TMP6]], <16 x i8>* [[TMP8]], align 2
		; CHECK-NEXT: [[TMP9:%.]] = getelementptr i8, i8 [[NEXT_GEP3]], i64 16
		; CHECK-NEXT: [[TMP10:%.]] = bitcast i8 [[TMP9]] to <16 x i8>*
		; CHECK-NEXT: store <16 x i8> [[TMP7]], <16 x i8>* [[TMP10]], align 2
		; CHECK-NEXT: [[INDEX_NEXT]] = add i64 [[INDEX]], 32
		; CHECK-NEXT: [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
		; CHECK-NEXT: br i1 [[TMP11]], label [[MIDDLE_BLOCK:%.]], label [[VECTOR_BODY]], [[LOOP4:!llvm.loop !.]]
; CHECK: middle.block:		; CHECK: middle.block:
; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]		; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[TMP2]], [[N_VEC]]
; CHECK-NEXT: br i1 [[CMP_N]], label [[WHILE_END]], label [[VEC_EPILOG_ITER_CHECK:%.*]]		; CHECK-NEXT: br i1 [[CMP_N]], label [[WHILE_END]], label [[VEC_EPILOG_ITER_CHECK:%.*]]
; CHECK: vec.epilog.iter.check:		; CHECK: vec.epilog.iter.check:
; CHECK-NEXT: [[IND_END14:%.]] = getelementptr i8, i8 [[PDST]], i64 [[N_VEC]]		; CHECK-NEXT: [[IND_END19:%.]] = getelementptr i8, i8 [[PDST]], i64 [[N_VEC]]
; CHECK-NEXT: [[IND_END11:%.]] = getelementptr i8, i8 [[PSRC]], i64 [[N_VEC]]		; CHECK-NEXT: [[IND_END16:%.]] = getelementptr i8, i8 [[PSRC]], i64 [[N_VEC]]
; CHECK-NEXT: [[CAST_CRD7:%.*]] = trunc i64 [[N_VEC]] to i32		; CHECK-NEXT: [[CAST_CRD12:%.*]] = trunc i64 [[N_VEC]] to i32
; CHECK-NEXT: [[IND_END8:%.*]] = sub i32 [[BLOCKSIZE]], [[CAST_CRD7]]		; CHECK-NEXT: [[IND_END13:%.*]] = sub i32 [[BLOCKSIZE]], [[CAST_CRD12]]
; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = and i64 [[TMP2]], 8		; CHECK-NEXT: [[N_VEC_REMAINING:%.*]] = and i64 [[TMP2]], 24
; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK_NOT_NOT:%.*]] = icmp eq i64 [[N_VEC_REMAINING]], 0		; CHECK-NEXT: [[MIN_EPILOG_ITERS_CHECK:%.*]] = icmp eq i64 [[N_VEC_REMAINING]], 0
; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK_NOT_NOT]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]		; CHECK-NEXT: br i1 [[MIN_EPILOG_ITERS_CHECK]], label [[VEC_EPILOG_SCALAR_PH]], label [[VEC_EPILOG_PH]]
; CHECK: vec.epilog.ph:		; CHECK: vec.epilog.ph:
; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]		; CHECK-NEXT: [[VEC_EPILOG_RESUME_VAL:%.*]] = phi i64 [ [[N_VEC]], [[VEC_EPILOG_ITER_CHECK]] ], [ 0, [[VECTOR_MAIN_LOOP_ITER_CHECK]] ]
; CHECK-NEXT: [[TMP7:%.*]] = add i32 [[BLOCKSIZE]], -1		; CHECK-NEXT: [[TMP12:%.*]] = add i32 [[BLOCKSIZE]], -1
; CHECK-NEXT: [[TMP8:%.*]] = zext i32 [[TMP7]] to i64		; CHECK-NEXT: [[TMP13:%.*]] = zext i32 [[TMP12]] to i64
; CHECK-NEXT: [[TMP9:%.*]] = add nuw nsw i64 [[TMP8]], 1		; CHECK-NEXT: [[TMP14:%.*]] = add nuw nsw i64 [[TMP13]], 1
; CHECK-NEXT: [[N_VEC4:%.*]] = and i64 [[TMP9]], 8589934584		; CHECK-NEXT: [[N_VEC9:%.*]] = and i64 [[TMP14]], 8589934584
; CHECK-NEXT: [[CAST_CRD:%.*]] = trunc i64 [[N_VEC4]] to i32		; CHECK-NEXT: [[CAST_CRD:%.*]] = trunc i64 [[N_VEC9]] to i32
; CHECK-NEXT: [[IND_END:%.*]] = sub i32 [[BLOCKSIZE]], [[CAST_CRD]]		; CHECK-NEXT: [[IND_END:%.*]] = sub i32 [[BLOCKSIZE]], [[CAST_CRD]]
; CHECK-NEXT: [[IND_END10:%.]] = getelementptr i8, i8 [[PSRC]], i64 [[N_VEC4]]		; CHECK-NEXT: [[IND_END15:%.]] = getelementptr i8, i8 [[PSRC]], i64 [[N_VEC9]]
; CHECK-NEXT: [[IND_END13:%.]] = getelementptr i8, i8 [[PDST]], i64 [[N_VEC4]]		; CHECK-NEXT: [[IND_END18:%.]] = getelementptr i8, i8 [[PDST]], i64 [[N_VEC9]]
; CHECK-NEXT: [[BROADCAST_SPLATINSERT20:%.*]] = insertelement <8 x i8> poison, i8 [[OFFSET]], i32 0		; CHECK-NEXT: [[BROADCAST_SPLATINSERT25:%.*]] = insertelement <8 x i8> poison, i8 [[OFFSET]], i32 0
; CHECK-NEXT: [[BROADCAST_SPLAT21:%.*]] = shufflevector <8 x i8> [[BROADCAST_SPLATINSERT20]], <8 x i8> poison, <8 x i32> zeroinitializer		; CHECK-NEXT: [[BROADCAST_SPLAT26:%.*]] = shufflevector <8 x i8> [[BROADCAST_SPLATINSERT25]], <8 x i8> poison, <8 x i32> zeroinitializer
; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]		; CHECK-NEXT: br label [[VEC_EPILOG_VECTOR_BODY:%.*]]
; CHECK: vec.epilog.vector.body:		; CHECK: vec.epilog.vector.body:
; CHECK-NEXT: [[INDEX5:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT6:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]		; CHECK-NEXT: [[INDEX10:%.]] = phi i64 [ [[VEC_EPILOG_RESUME_VAL]], [[VEC_EPILOG_PH]] ], [ [[INDEX_NEXT11:%.]], [[VEC_EPILOG_VECTOR_BODY]] ]
; CHECK-NEXT: [[NEXT_GEP17:%.]] = getelementptr i8, i8 [[PSRC]], i64 [[INDEX5]]		; CHECK-NEXT: [[NEXT_GEP22:%.]] = getelementptr i8, i8 [[PSRC]], i64 [[INDEX10]]
; CHECK-NEXT: [[NEXT_GEP18:%.]] = getelementptr i8, i8 [[PDST]], i64 [[INDEX5]]		; CHECK-NEXT: [[NEXT_GEP23:%.]] = getelementptr i8, i8 [[PDST]], i64 [[INDEX10]]
; CHECK-NEXT: [[TMP10:%.]] = bitcast i8 [[NEXT_GEP17]] to <8 x i8>*		; CHECK-NEXT: [[TMP15:%.]] = bitcast i8 [[NEXT_GEP22]] to <8 x i8>*
; CHECK-NEXT: [[WIDE_LOAD19:%.]] = load <8 x i8>, <8 x i8> [[TMP10]], align 2		; CHECK-NEXT: [[WIDE_LOAD24:%.]] = load <8 x i8>, <8 x i8> [[TMP15]], align 2
; CHECK-NEXT: [[TMP11:%.*]] = call <8 x i8> @llvm.umin.v8i8(<8 x i8> [[WIDE_LOAD19]], <8 x i8> [[BROADCAST_SPLAT21]])		; CHECK-NEXT: [[TMP16:%.*]] = call <8 x i8> @llvm.umin.v8i8(<8 x i8> [[WIDE_LOAD24]], <8 x i8> [[BROADCAST_SPLAT26]])
; CHECK-NEXT: [[TMP12:%.]] = bitcast i8 [[NEXT_GEP18]] to <8 x i8>*		; CHECK-NEXT: [[TMP17:%.]] = bitcast i8 [[NEXT_GEP23]] to <8 x i8>*
; CHECK-NEXT: store <8 x i8> [[TMP11]], <8 x i8>* [[TMP12]], align 2		; CHECK-NEXT: store <8 x i8> [[TMP16]], <8 x i8>* [[TMP17]], align 2
; CHECK-NEXT: [[INDEX_NEXT6]] = add i64 [[INDEX5]], 8		; CHECK-NEXT: [[INDEX_NEXT11]] = add i64 [[INDEX10]], 8
; CHECK-NEXT: [[TMP13:%.*]] = icmp eq i64 [[INDEX_NEXT6]], [[N_VEC4]]		; CHECK-NEXT: [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT11]], [[N_VEC9]]
; CHECK-NEXT: br i1 [[TMP13]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.]], label [[VEC_EPILOG_VECTOR_BODY]], [[LOOP5:!llvm.loop !.]]		; CHECK-NEXT: br i1 [[TMP18]], label [[VEC_EPILOG_MIDDLE_BLOCK:%.]], label [[VEC_EPILOG_VECTOR_BODY]], [[LOOP5:!llvm.loop !.]]
; CHECK: vec.epilog.middle.block:		; CHECK: vec.epilog.middle.block:
; CHECK-NEXT: [[CMP_N15:%.*]] = icmp eq i64 [[TMP9]], [[N_VEC4]]		; CHECK-NEXT: [[CMP_N20:%.*]] = icmp eq i64 [[TMP14]], [[N_VEC9]]
; CHECK-NEXT: br i1 [[CMP_N15]], label [[WHILE_END]], label [[VEC_EPILOG_SCALAR_PH]]		; CHECK-NEXT: br i1 [[CMP_N20]], label [[WHILE_END]], label [[VEC_EPILOG_SCALAR_PH]]
; CHECK: vec.epilog.scalar.ph:		; CHECK: vec.epilog.scalar.ph:
; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[IND_END]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[IND_END8]], [[VEC_EPILOG_ITER_CHECK]] ], [ [[BLOCKSIZE]], [[ITER_CHECK]] ]		; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i32 [ [[IND_END]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[IND_END13]], [[VEC_EPILOG_ITER_CHECK]] ], [ [[BLOCKSIZE]], [[ITER_CHECK]] ]
; CHECK-NEXT: [[BC_RESUME_VAL9:%.]] = phi i8 [ [[IND_END10]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[IND_END11]], [[VEC_EPILOG_ITER_CHECK]] ], [ [[PSRC]], [[ITER_CHECK]] ]		; CHECK-NEXT: [[BC_RESUME_VAL14:%.]] = phi i8 [ [[IND_END15]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[IND_END16]], [[VEC_EPILOG_ITER_CHECK]] ], [ [[PSRC]], [[ITER_CHECK]] ]
; CHECK-NEXT: [[BC_RESUME_VAL12:%.]] = phi i8 [ [[IND_END13]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[IND_END14]], [[VEC_EPILOG_ITER_CHECK]] ], [ [[PDST]], [[ITER_CHECK]] ]		; CHECK-NEXT: [[BC_RESUME_VAL17:%.]] = phi i8 [ [[IND_END18]], [[VEC_EPILOG_MIDDLE_BLOCK]] ], [ [[IND_END19]], [[VEC_EPILOG_ITER_CHECK]] ], [ [[PDST]], [[ITER_CHECK]] ]
; CHECK-NEXT: br label [[WHILE_BODY:%.*]]		; CHECK-NEXT: br label [[WHILE_BODY:%.*]]
; CHECK: while.body:		; CHECK: while.body:
; CHECK-NEXT: [[BLKCNT_09:%.]] = phi i32 [ [[DEC:%.]], [[WHILE_BODY]] ], [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ]		; CHECK-NEXT: [[BLKCNT_09:%.]] = phi i32 [ [[DEC:%.]], [[WHILE_BODY]] ], [ [[BC_RESUME_VAL]], [[VEC_EPILOG_SCALAR_PH]] ]
; CHECK-NEXT: [[PSRC_ADDR_08:%.]] = phi i8 [ [[INCDEC_PTR:%.*]], [[WHILE_BODY]] ], [ [[BC_RESUME_VAL9]], [[VEC_EPILOG_SCALAR_PH]] ]		; CHECK-NEXT: [[PSRC_ADDR_08:%.]] = phi i8 [ [[INCDEC_PTR:%.*]], [[WHILE_BODY]] ], [ [[BC_RESUME_VAL14]], [[VEC_EPILOG_SCALAR_PH]] ]
; CHECK-NEXT: [[PDST_ADDR_07:%.]] = phi i8 [ [[INCDEC_PTR3:%.*]], [[WHILE_BODY]] ], [ [[BC_RESUME_VAL12]], [[VEC_EPILOG_SCALAR_PH]] ]		; CHECK-NEXT: [[PDST_ADDR_07:%.]] = phi i8 [ [[INCDEC_PTR3:%.*]], [[WHILE_BODY]] ], [ [[BC_RESUME_VAL17]], [[VEC_EPILOG_SCALAR_PH]] ]
; CHECK-NEXT: [[INCDEC_PTR]] = getelementptr inbounds i8, i8* [[PSRC_ADDR_08]], i64 1		; CHECK-NEXT: [[INCDEC_PTR]] = getelementptr inbounds i8, i8* [[PSRC_ADDR_08]], i64 1
; CHECK-NEXT: [[TMP14:%.]] = load i8, i8 [[PSRC_ADDR_08]], align 2		; CHECK-NEXT: [[TMP19:%.]] = load i8, i8 [[PSRC_ADDR_08]], align 2
; CHECK-NEXT: [[TMP15:%.*]] = tail call i8 @llvm.umin.i8(i8 [[TMP14]], i8 [[OFFSET]])		; CHECK-NEXT: [[TMP20:%.*]] = tail call i8 @llvm.umin.i8(i8 [[TMP19]], i8 [[OFFSET]])
; CHECK-NEXT: [[INCDEC_PTR3]] = getelementptr inbounds i8, i8* [[PDST_ADDR_07]], i64 1		; CHECK-NEXT: [[INCDEC_PTR3]] = getelementptr inbounds i8, i8* [[PDST_ADDR_07]], i64 1
; CHECK-NEXT: store i8 [[TMP15]], i8* [[PDST_ADDR_07]], align 2		; CHECK-NEXT: store i8 [[TMP20]], i8* [[PDST_ADDR_07]], align 2
; CHECK-NEXT: [[DEC]] = add i32 [[BLKCNT_09]], -1		; CHECK-NEXT: [[DEC]] = add i32 [[BLKCNT_09]], -1
; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[DEC]], 0		; CHECK-NEXT: [[CMP_NOT:%.*]] = icmp eq i32 [[DEC]], 0
; CHECK-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END]], label [[WHILE_BODY]], [[LOOP6:!llvm.loop !.*]]		; CHECK-NEXT: br i1 [[CMP_NOT]], label [[WHILE_END]], label [[WHILE_BODY]], [[LOOP6:!llvm.loop !.*]]
; CHECK: while.end:		; CHECK: while.end:
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%cmp.not6 = icmp eq i32 %blockSize, 0		%cmp.not6 = icmp eq i32 %blockSize, 0
Show All 22 Lines

llvm/test/Transforms/LoopVectorize/X86/intrinsiccost.ll

	Show First 20 Lines • Show All 151 Lines • ▼ Show 20 Lines
	}			}

	; CHECK-COST-LABEL: cttz			; CHECK-COST-LABEL: cttz
	; CHECK-COST: Found an estimated cost of 1 for VF 1 For instruction: %1 = tail call i8 @llvm.fshl.i8(i8 %0, i8 %0, i8 %offset)			; CHECK-COST: Found an estimated cost of 1 for VF 1 For instruction: %1 = tail call i8 @llvm.fshl.i8(i8 %0, i8 %0, i8 %offset)
	; CHECK-COST: Found an estimated cost of 1 for VF 2 For instruction: %1 = tail call i8 @llvm.fshl.i8(i8 %0, i8 %0, i8 %offset)			; CHECK-COST: Found an estimated cost of 1 for VF 2 For instruction: %1 = tail call i8 @llvm.fshl.i8(i8 %0, i8 %0, i8 %offset)
	; CHECK-COST: Found an estimated cost of 1 for VF 4 For instruction: %1 = tail call i8 @llvm.fshl.i8(i8 %0, i8 %0, i8 %offset)			; CHECK-COST: Found an estimated cost of 1 for VF 4 For instruction: %1 = tail call i8 @llvm.fshl.i8(i8 %0, i8 %0, i8 %offset)
	; CHECK-COST: Found an estimated cost of 1 for VF 8 For instruction: %1 = tail call i8 @llvm.fshl.i8(i8 %0, i8 %0, i8 %offset)			; CHECK-COST: Found an estimated cost of 1 for VF 8 For instruction: %1 = tail call i8 @llvm.fshl.i8(i8 %0, i8 %0, i8 %offset)
	; CHECK-COST: Found an estimated cost of 1 for VF 16 For instruction: %1 = tail call i8 @llvm.fshl.i8(i8 %0, i8 %0, i8 %offset)			; CHECK-COST: Found an estimated cost of 1 for VF 16 For instruction: %1 = tail call i8 @llvm.fshl.i8(i8 %0, i8 %0, i8 %offset)
	; CHECK-COST: Found an estimated cost of 1 for VF 32 For instruction: %1 = tail call i8 @llvm.fshl.i8(i8 %0, i8 %0, i8 %offset)			; CHECK-COST: Found an estimated cost of 4 for VF 32 For instruction: %1 = tail call i8 @llvm.fshl.i8(i8 %0, i8 %0, i8 %offset)

	define void @cttz(i8* nocapture readonly %pSrc, i8 signext %offset, i8* nocapture noalias %pDst, i32 %blockSize) #0 {			define void @cttz(i8* nocapture readonly %pSrc, i8 signext %offset, i8* nocapture noalias %pDst, i32 %blockSize) #0 {
	; CHECK-LABEL: @cttz(			; CHECK-LABEL: @cttz(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[CMP_NOT6:%.]] = icmp eq i32 [[BLOCKSIZE:%.]], 0			; CHECK-NEXT: [[CMP_NOT6:%.]] = icmp eq i32 [[BLOCKSIZE:%.]], 0
	; CHECK-NEXT: br i1 [[CMP_NOT6]], label [[WHILE_END:%.]], label [[ITER_CHECK:%.]]			; CHECK-NEXT: br i1 [[CMP_NOT6]], label [[WHILE_END:%.]], label [[ITER_CHECK:%.]]
	; CHECK: iter.check:			; CHECK: iter.check:
	; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[BLOCKSIZE]], -1			; CHECK-NEXT: [[TMP0:%.*]] = add i32 [[BLOCKSIZE]], -1
	▲ Show 20 Lines • Show All 134 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/WebAssembly/no-vectorize-rotate.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -instcombine -S \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -instcombine -S \| FileCheck %s

	; Regression test for a bug in the SLP vectorizer that was causing			; Regression test for a bug in the SLP vectorizer that was causing
	; these rotates to be incorrectly combined into a vector rotate.			; these rotates to be incorrectly combined into a vector rotate.

	; The bug fix is at https://reviews.llvm.org/D85759. This test has
	; been pre-committed to demonstrate the regressed behavior and provide
	; a clear diff for the bug fix.

	target datalayout = "e-m:e-p:32:32-i64:64-n32:64-S128"			target datalayout = "e-m:e-p:32:32-i64:64-n32:64-S128"
	target triple = "wasm32-unknown-unknown"			target triple = "wasm32-unknown-unknown"

	define void @foo(<2 x i64> %x, <4 x i32> %y, i64* %out) #0 {			define void @foo(<2 x i64> %x, <4 x i32> %y, i64* %out) #0 {
	; CHECK-LABEL: @foo(			; CHECK-LABEL: @foo(
	; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[Y:%.]], <4 x i32> undef, <2 x i32> <i32 2, i32 3>			; CHECK-NEXT: [[A:%.]] = extractelement <2 x i64> [[X:%.]], i32 0
	; CHECK-NEXT: [[TMP2:%.*]] = zext <2 x i32> [[TMP1]] to <2 x i64>			; CHECK-NEXT: [[B:%.]] = extractelement <4 x i32> [[Y:%.]], i32 2
	; CHECK-NEXT: [[TMP3:%.]] = call <2 x i64> @llvm.fshl.v2i64(<2 x i64> [[X:%.]], <2 x i64> [[X]], <2 x i64> [[TMP2]])			; CHECK-NEXT: [[CONV6:%.*]] = zext i32 [[B]] to i64
	; CHECK-NEXT: [[TMP4:%.]] = bitcast i64 [[OUT:%.]] to <2 x i64>			; CHECK-NEXT: [[C:%.*]] = tail call i64 @llvm.fshl.i64(i64 [[A]], i64 [[A]], i64 [[CONV6]])
	; CHECK-NEXT: store <2 x i64> [[TMP3]], <2 x i64>* [[TMP4]], align 8			; CHECK-NEXT: store i64 [[C]], i64* [[OUT:%.*]], align 8
				; CHECK-NEXT: [[D:%.*]] = extractelement <2 x i64> [[X]], i32 1
				; CHECK-NEXT: [[E:%.*]] = extractelement <4 x i32> [[Y]], i32 3
				; CHECK-NEXT: [[CONV17:%.*]] = zext i32 [[E]] to i64
				; CHECK-NEXT: [[F:%.*]] = tail call i64 @llvm.fshl.i64(i64 [[D]], i64 [[D]], i64 [[CONV17]])
				; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i64, i64 [[OUT]], i32 1
				; CHECK-NEXT: store i64 [[F]], i64* [[ARRAYIDX2]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	%a = extractelement <2 x i64> %x, i32 0			%a = extractelement <2 x i64> %x, i32 0
	%b = extractelement <4 x i32> %y, i32 2			%b = extractelement <4 x i32> %y, i32 2
	%conv6 = zext i32 %b to i64			%conv6 = zext i32 %b to i64
	%c = tail call i64 @llvm.fshl.i64(i64 %a, i64 %a, i64 %conv6)			%c = tail call i64 @llvm.fshl.i64(i64 %a, i64 %a, i64 %conv6)
	store i64 %c, i64* %out			store i64 %c, i64* %out
	%d = extractelement <2 x i64> %x, i32 1			%d = extractelement <2 x i64> %x, i32 1
	Show All 11 Lines