This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/
-
Target/AArch64/
-
AArch64/
-
AArch64Subtarget.h
-
AArch64TargetTransformInfo.h
-
AArch64TargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
-
SLPVectorizer.cpp
-
test/Analysis/CostModel/AArch64/
-
Analysis/
-
CostModel/
-
AArch64/
-
free-widening-casts.ll

Differential D32706

[AArch64] Consider widening instructions in cast cost calculation
ClosedPublic

Authored by mssimpso on May 1 2017, 11:04 AM.

Download Raw Diff

Details

Reviewers

mcrosier
t.p.northover
kristof.beyls
evandro
anemet
sbaranga

Commits

rG78fd46b23054: [AArch64] Consider widening instructions in cost calculations
rL302582: [AArch64] Consider widening instructions in cost calculations

Summary

The AArch64 instruction set has a few "widening" instructions (e.g., uaddl, saddl, uaddw, etc.) that take one or more doubleword operands and produce quadword results. The operands are automatically sign- or zero-extended as appropriate. However, In LLVM IR, these extends are explicit. This patch updates TTI to consider these widening instructions when determining the cost of sign- and zero-extends. If an extend is a part of a widening operation, it will be eliminated during code generation, so the reported cost should be zero.

Diff Detail

Repository: rL LLVM

Event Timeline

mssimpso created this revision.May 1 2017, 11:04 AM

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptMay 1 2017, 11:04 AM

mssimpso updated this revision to Diff 97324.May 1 2017, 11:58 AM

kristof.beyls added inline comments.May 2 2017, 12:31 AM

lib/Target/AArch64/AArch64TargetTransformInfo.cpp
195–196 ↗	(On Diff #97324)	If the idea is that this function only return true if the lengthening/widening is zero-cost; maybe the function name should be isLengtheningFree instead of isLengthening? Or something similar indicating the expected cost of widening is zero?
lib/Target/AArch64/AArch64TargetTransformInfo.h
46–47 ↗	(On Diff #97324)	I think the term "widening" is used for this operation, both in the LLVM code base, as well as in the ARMARM. So, probably best to use isWidening here too.
test/Analysis/CostModel/AArch64/lengthening-casts.ll
1–14 ↗	(On Diff #97324)	Next to tests that verify that a 0 cost is returned, maybe it'd also be useful to explicitly test that the widening instruction variant actually gets generated?

Thanks Kristof! Responses inline.

lib/Target/AArch64/AArch64TargetTransformInfo.cpp
195–196 ↗	(On Diff #97324)	Sounds good! I like isLengtheningFree.
lib/Target/AArch64/AArch64TargetTransformInfo.h
46–47 ↗	(On Diff #97324)	From what I've read, ARM uses "lengthening" to refer to the case where both operands are doublewords (e.g., uaddl) and "widening" to refer to the case when only one operand is a doubleword (e.g., uaddw). This patch only deals with the lengthening cases. We might handle the widening cases in the future, but they may not be as straightforward because the operand order matters. I can add a comment here explaining all of this.
test/Analysis/CostModel/AArch64/lengthening-casts.ll
1–14 ↗	(On Diff #97324)	We actually already have code generation tests for all of these cases over in test/CodeGen/AArch64 (see arm64-vadd.ll for the uaddl cases, for example). I just copied them over here to test the cost model. I'm fine adding an additional "llc" run-line to this test, though, that verifies the code generation property in this context. I checked and there are other cost model tests that do this (e.g., X86/scalarize.ll), so this wouldn't be that unusual.

kristof.beyls added inline comments.May 2 2017, 8:01 AM

lib/Target/AArch64/AArch64TargetTransformInfo.h
46–47 ↗	(On Diff #97324)	I was making my comment based on Table C3-67 which contains UADDL in the ARMARM copy at https://static.docs.arm.com/ddi0487/b/DDI0487B_a_armv8_arm.pdf. The table header is "SIMD widening and narrowing arithmetic instructions". The table entry for UADDL has "UADDL, UADDL2 Unsigned add long (vector form)". Therefore, I am assuming the L in the mnemonic stands for "long" not "lengthening". Maybe there is other documentation that specifically calls this specific case "lengthening"?
test/Analysis/CostModel/AArch64/lengthening-casts.ll
1–14 ↗	(On Diff #97324)	Oh, we already have the tests to check the actual code generation. Than I think this is fine. I think that you could argue both for and against the actual code generation test being in the same location as the cost model test. I don't feel strongly enough either way to argue.

mssimpso added inline comments.May 2 2017, 8:19 AM

lib/Target/AArch64/AArch64TargetTransformInfo.h
46–47 ↗	(On Diff #97324)	I was looking at different documentation (possibly outdated): http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.den0024a/ch07s03.html. Yours looks better to me, so let's go with isWideningFree, like you suggested.

mssimpso added inline comments.May 2 2017, 8:38 AM

lib/Target/AArch64/AArch64TargetTransformInfo.h
46–47 ↗	(On Diff #97324)	Also, it looks like we already handle the W variants as well, where only one operand is extended. I was thinking the operand order in the IR might affect our ability to match these instructions, but it doesn't. I'll update the patch to consider W variants (this will actually make it simpler since we don't need to care about the other operand of the binop), and add W tests.

Addressed Kristof's comments.

Changed wording to refer to "widening" instructions rather than "lengthening" instructions.
Changed function name to is isWideningFree.
Updated the logic to consider the W instruction variants in addition to the L variants.
Added code generation tests along side the cost test cases.
Added additional tests for the W variants.

In D32706#743682, @mssimpso wrote:

Addressed Kristof's comments.

Changed wording to refer to "widening" instructions rather than "lengthening" instructions.

Changed function name to is isWideningFree.

Updated the logic to consider the W instruction variants in addition to the L variants.

Added code generation tests along side the cost test cases.

Added additional tests for the W variants.

Thanks Matt.
I'm not familiar enough with this code to feel comfortable approving, so I think someone else should do that.
Other than that, I only have one more remark, now that I've looked at the patch again: maybe there should also be a "negative" test, i.e. verifying that a non-zero cost is computed when a widening instruction cannot be generated, e.g. when the type operated on cannot map to a native instruction?

In D32706#744317, @kristof.beyls wrote:

Thanks Matt.
I'm not familiar enough with this code to feel comfortable approving, so I think someone else should do that.
Other than that, I only have one more remark, now that I've looked at the patch again: maybe there should also be a "negative" test, i.e. verifying that a non-zero cost is computed when a widening instruction cannot be generated, e.g. when the type operated on cannot map to a native instruction?

Thanks Kristof! I'll add some negative tests and wait for more reviews.

Added some negative tests.

It shouldn't be assumed that such widening instructions are always free. Their latency may be longer than the equivalent instruction sequences. Preferably, the cost of both choices should weighed after consulting the backend.

In D32706#745289, @evandro wrote:

It shouldn't be assumed that such widening instructions are always free. Their latency may be longer than the equivalent instruction sequences. Preferably, the cost of both choices should weighed after consulting the backend.

Hi Evandro,

Thanks for the comment.

We mean "free" in the sense that the extends considered in this patch don't exist in the generated code. They're only there in the IR to make it well-typed (i.e., an IR "add" can't operate on <2 x i32> vectors and produce a <2 x i64> result). This is consistent with the way we use "free" in other places in the cost model, when applied to extends.

If you're suggesting, for example, that an ADD and a UADDL might in theory have different max latencies, I suppose this could be true for some hardware, but in a quick look through the sub-target scheduling models, none of them seem make a distinction that I saw. I think we generally prefer to omit sub-target details from the cost-model unless necessary.

I'm not sure what you mean by "both choices" since there's only once code sequence that is generated. Do you mean that the backed should decide whether to generate, for example, an extend plus ADD vs. UADDL based on cost? I don't think that's something we can do today,

In D32706#745329, @mssimpso wrote:

We mean "free" in the sense that the extends considered in this patch don't exist in the generated code. They're only there in the IR to make it well-typed (i.e., an IR "add" can't operate on <2 x i32> vectors and produce a <2 x i64> result). This is consistent with the way we use "free" in other places in the cost model, when applied to extends.

I see.

If you're suggesting, for example, that an ADD and a UADDL might in theory have different max latencies, I suppose this could be true for some hardware, but in a quick look through the sub-target scheduling models, none of them seem make a distinction that I saw. I think we generally prefer to omit sub-target details from the cost-model unless necessary.

Exynos M1-2 do: 1 cycle for ASIMD ADD and 3 cycles for ASIMD SADDL. I tried this patch on Exynos and noticed minor regressions and no improvements in a handful of workloads.

I'm not sure what you mean by "both choices" since there's only once code sequence that is generated. Do you mean that the backed should decide whether to generate, for example, an extend plus ADD vs. UADDL based on cost? I don't think that's something we can do today,

I'll have to think more about this.

In D32706#745348, @evandro wrote:

Exynos M1-2 do: 1 cycle for ASIMD ADD and 3 cycles for ASIMD SADDL. I tried this patch on Exynos and noticed minor regressions and no improvements in a handful of workloads.

OK, yes I see that now. I missed it on the first look! What do you think about adding a new sub-target feature for this then? Something like "FastWideningOps"? We could condition this TTI change on the feature to prevent it from hurting performance on Exynos. And regarding your last point, you could then later use it to guide instruction selection to not generate the widening ops if the non-widening versions (with added extends) are more profitable.

In D32706#746087, @mssimpso wrote:

OK, yes I see that now. I missed it on the first look! What do you think about adding a new sub-target feature for this then? Something like "FastWideningOps"? We could condition this TTI change on the feature to prevent it from hurting performance on Exynos. And regarding your last point, you could then later use it to guide instruction selection to not generate the widening ops if the non-widening versions (with added extends) are more profitable.

Please, not yet another feature. It should be possible, methinks, to getSubtarget().getSchedModel() to compare the relative costs.

Do you have a publicly available benchmark where this change is meaningful so that I can evaluate it better? For I haven't seen any change in our workloads after this change. Maybe, in the end, it doesn't matter; or maybe it's not that common.

In D32706#746122, @evandro wrote:

Please, not yet another feature. It should be possible, methinks, to getSubtarget().getSchedModel() to compare the relative costs.

I can't really comment on whether that would be possible or not, because I haven't looked at it. It seems to be somewhat tangential to this patch, though, since we currently always generate the widening instruction variants if possible, eliminating the extends. So I'm not sure how that would affect TTI. It seems like you're wanting to change instruction selection?

Do you have a publicly available benchmark where this change is meaningful so that I can evaluate it better? For I haven't seen any change in our workloads after this change. Maybe, in the end, it doesn't matter; or maybe it's not that common.

Yes, this change is primarily to encourage more aggressive SLP in spec2006/h264ref. It provides a ~17% improvement on Falkor.

Addressed Evandro's concerns.

Hey Evando,

I appreciate all the feedback. After discussing your concerns with @gberry, he suggested an alternative solution, which I've implemented in this updated version of the patch. The basic idea is to treat the widening instructions as single operations (rather than operations composed of two or three instructions), whose cost is attached to the arithmetic instruction. This design conceptually better matches the code that we generate.

I've rearranged the patch so that we (1) mark the extends as "free" and (2) apply a sub-target specified overhead to the arithmetic instructions that are widening. I've left the default overhead zero (so this version of the patch is functionally the same as the last one), but sub-targets will now have the flexibility to increase this overhead to better match the capabilities of their hardware. Thus, an "add" ultimately mapping to UADDL can have a different cost than an "add" mapping to ADD, for example. Does this make sense?

Please let me know what you think.

Herald added subscribers: javed.absar, mzolotukhin. · View Herald TranscriptMay 5 2017, 12:15 PM

@mssimpso, yes, it makes a lot of sense! This is a more sensible approach, methinks.

This revision is now accepted and ready to land.May 5 2017, 2:20 PM

Thanks Evandro!

Closed by commit rL302582: [AArch64] Consider widening instructions in cost calculations (authored by mssimpso). · Explain WhyMay 9 2017, 1:31 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AArch64/

AArch64Subtarget.h

3 lines

AArch64TargetTransformInfo.h

3 lines

AArch64TargetTransformInfo.cpp

106 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

10 lines

test/

Analysis/

CostModel/

AArch64/

free-widening-casts.ll

622 lines

Diff 98347

llvm/trunk/lib/Target/AArch64/AArch64Subtarget.h

Show First 20 Lines • Show All 100 Lines • ▼ Show 20 Lines	protected:
uint8_t VectorInsertExtractBaseCost = 3;		uint8_t VectorInsertExtractBaseCost = 3;
uint16_t CacheLineSize = 0;		uint16_t CacheLineSize = 0;
uint16_t PrefetchDistance = 0;		uint16_t PrefetchDistance = 0;
uint16_t MinPrefetchStride = 1;		uint16_t MinPrefetchStride = 1;
unsigned MaxPrefetchIterationsAhead = UINT_MAX;		unsigned MaxPrefetchIterationsAhead = UINT_MAX;
unsigned PrefFunctionAlignment = 0;		unsigned PrefFunctionAlignment = 0;
unsigned PrefLoopAlignment = 0;		unsigned PrefLoopAlignment = 0;
unsigned MaxJumpTableSize = 0;		unsigned MaxJumpTableSize = 0;
		unsigned WideningBaseCost = 0;

// ReserveX18 - X18 is not available as a general purpose register.		// ReserveX18 - X18 is not available as a general purpose register.
bool ReserveX18;		bool ReserveX18;

bool IsLittle;		bool IsLittle;

/// TargetTriple - What processor and OS we're targeting.		/// TargetTriple - What processor and OS we're targeting.
Triple TargetTriple;		Triple TargetTriple;
▲ Show 20 Lines • Show All 106 Lines • ▼ Show 20 Lines	public:
unsigned getMaxPrefetchIterationsAhead() const {		unsigned getMaxPrefetchIterationsAhead() const {
return MaxPrefetchIterationsAhead;		return MaxPrefetchIterationsAhead;
}		}
unsigned getPrefFunctionAlignment() const { return PrefFunctionAlignment; }		unsigned getPrefFunctionAlignment() const { return PrefFunctionAlignment; }
unsigned getPrefLoopAlignment() const { return PrefLoopAlignment; }		unsigned getPrefLoopAlignment() const { return PrefLoopAlignment; }

unsigned getMaximumJumpTableSize() const { return MaxJumpTableSize; }		unsigned getMaximumJumpTableSize() const { return MaxJumpTableSize; }

		unsigned getWideningBaseCost() const { return WideningBaseCost; }

/// CPU has TBI (top byte of addresses is ignored during HW address		/// CPU has TBI (top byte of addresses is ignored during HW address
/// translation) and OS enables it.		/// translation) and OS enables it.
bool supportsAddressTopByteIgnored() const;		bool supportsAddressTopByteIgnored() const;

bool hasPerfMon() const { return HasPerfMon; }		bool hasPerfMon() const { return HasPerfMon; }
bool hasFullFP16() const { return HasFullFP16; }		bool hasFullFP16() const { return HasFullFP16; }
bool hasSPE() const { return HasSPE; }		bool hasSPE() const { return HasSPE; }
bool hasLSLFast() const { return HasLSLFast; }		bool hasLSLFast() const { return HasLSLFast; }
▲ Show 20 Lines • Show All 59 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show All 37 Lines	class AArch64TTIImpl : public BasicTTIImplBase<AArch64TTIImpl> {
const AArch64TargetLowering *getTLI() const { return TLI; }		const AArch64TargetLowering *getTLI() const { return TLI; }

enum MemIntrinsicType {		enum MemIntrinsicType {
VECTOR_LDST_TWO_ELEMENTS,		VECTOR_LDST_TWO_ELEMENTS,
VECTOR_LDST_THREE_ELEMENTS,		VECTOR_LDST_THREE_ELEMENTS,
VECTOR_LDST_FOUR_ELEMENTS		VECTOR_LDST_FOUR_ELEMENTS
};		};

		bool isWideningInstruction(Type *Ty, unsigned Opcode,
		ArrayRef<const Value *> Args);

public:		public:
explicit AArch64TTIImpl(const AArch64TargetMachine *TM, const Function &F)		explicit AArch64TTIImpl(const AArch64TargetMachine *TM, const Function &F)
: BaseT(TM, F.getParent()->getDataLayout()), ST(TM->getSubtargetImpl(F)),		: BaseT(TM, F.getParent()->getDataLayout()), ST(TM->getSubtargetImpl(F)),
TLI(ST->getTargetLowering()) {}		TLI(ST->getTargetLowering()) {}

/// \name Scalar TTI Implementations		/// \name Scalar TTI Implementations
/// @{		/// @{

▲ Show 20 Lines • Show All 89 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

	Show First 20 Lines • Show All 170 Lines • ▼ Show 20 Lines
	AArch64TTIImpl::getPopcntSupport(unsigned TyWidth) {			AArch64TTIImpl::getPopcntSupport(unsigned TyWidth) {
	assert(isPowerOf2_32(TyWidth) && "Ty width must be power of 2");			assert(isPowerOf2_32(TyWidth) && "Ty width must be power of 2");
	if (TyWidth == 32 \|\| TyWidth == 64)			if (TyWidth == 32 \|\| TyWidth == 64)
	return TTI::PSK_FastHardware;			return TTI::PSK_FastHardware;
	// TODO: AArch64TargetLowering::LowerCTPOP() supports 128bit popcount.			// TODO: AArch64TargetLowering::LowerCTPOP() supports 128bit popcount.
	return TTI::PSK_Software;			return TTI::PSK_Software;
	}			}

				bool AArch64TTIImpl::isWideningInstruction(Type *DstTy, unsigned Opcode,
				ArrayRef<const Value *> Args) {

				// A helper that returns a vector type from the given type. The number of
				// elements in type Ty determine the vector width.
				auto toVectorTy = [&](Type *ArgTy) {
				return VectorType::get(ArgTy->getScalarType(),
				DstTy->getVectorNumElements());
				};

				// Exit early if DstTy is not a vector type whose elements are at least
				// 16-bits wide.
				if (!DstTy->isVectorTy() \|\| DstTy->getScalarSizeInBits() < 16)
				return false;

				// Determine if the operation has a widening variant. We consider both the
				// "long" (e.g., usubl) and "wide" (e.g., usubw) versions of the
				// instructions.
				//
				// TODO: Add additional widening operations (e.g., mul, shl, etc.) once we
				// verify that their extending operands are eliminated during code
				// generation.
				switch (Opcode) {
				case Instruction::Add: // UADDL(2), SADDL(2), UADDW(2), SADDW(2).
				case Instruction::Sub: // USUBL(2), SSUBL(2), USUBW(2), SSUBW(2).
				break;
				default:
				return false;
				}

				// To be a widening instruction (either the "wide" or "long" versions), the
				// second operand must be a sign- or zero extend having a single user. We
				// only consider extends having a single user because they may otherwise not
				// be eliminated.
				if (Args.size() != 2 \|\|
				(!isa<SExtInst>(Args[1]) && !isa<ZExtInst>(Args[1])) \|\|
				!Args[1]->hasOneUse())
				return false;
				auto *Extend = cast<CastInst>(Args[1]);

				// Legalize the destination type and ensure it can be used in a widening
				// operation.
				auto DstTyL = TLI->getTypeLegalizationCost(DL, DstTy);
				unsigned DstElTySize = DstTyL.second.getScalarSizeInBits();
				if (!DstTyL.second.isVector() \|\| DstElTySize != DstTy->getScalarSizeInBits())
				return false;

				// Legalize the source type and ensure it can be used in a widening
				// operation.
				Type *SrcTy = toVectorTy(Extend->getSrcTy());
				auto SrcTyL = TLI->getTypeLegalizationCost(DL, SrcTy);
				unsigned SrcElTySize = SrcTyL.second.getScalarSizeInBits();
				if (!SrcTyL.second.isVector() \|\| SrcElTySize != SrcTy->getScalarSizeInBits())
				return false;

				// Get the total number of vector elements in the legalized types.
				unsigned NumDstEls = DstTyL.first * DstTyL.second.getVectorNumElements();
				unsigned NumSrcEls = SrcTyL.first * SrcTyL.second.getVectorNumElements();

				// Return true if the legalized types have the same number of vector elements
				// and the destination element type size is twice that of the source type.
				return NumDstEls == NumSrcEls && 2 * SrcElTySize == DstElTySize;
				}

	int AArch64TTIImpl::getCastInstrCost(unsigned Opcode, Type Dst, Type Src,			int AArch64TTIImpl::getCastInstrCost(unsigned Opcode, Type Dst, Type Src,
	const Instruction *I) {			const Instruction *I) {
	int ISD = TLI->InstructionOpcodeToISD(Opcode);			int ISD = TLI->InstructionOpcodeToISD(Opcode);
	assert(ISD && "Invalid opcode");			assert(ISD && "Invalid opcode");

				// If the cast is observable, and it is used by a widening instruction (e.g.,
				// uaddl, saddw, etc.), it may be free.
				if (I && I->hasOneUse()) {
				auto SingleUser = cast<Instruction>(I->user_begin());
				SmallVector<const Value *, 4> Operands(SingleUser->operand_values());
				if (isWideningInstruction(Dst, SingleUser->getOpcode(), Operands)) {
				// If the cast is the second operand, it is free. We will generate either
				// a "wide" or "long" version of the widening instruction.
				if (I == SingleUser->getOperand(1))
				return 0;
				// If the cast is not the second operand, it will be free if it looks the
				// same as the second operand. In this case, we will generate a "long"
				// version of the widening instruction.
				if (auto *Cast = dyn_cast<CastInst>(SingleUser->getOperand(1)))
				if (I->getOpcode() == Cast->getOpcode() &&
				cast<CastInst>(I)->getSrcTy() == Cast->getSrcTy())
				return 0;
				}
				}

	EVT SrcTy = TLI->getValueType(DL, Src);			EVT SrcTy = TLI->getValueType(DL, Src);
	EVT DstTy = TLI->getValueType(DL, Dst);			EVT DstTy = TLI->getValueType(DL, Dst);

	if (!SrcTy.isSimple() \|\| !DstTy.isSimple())			if (!SrcTy.isSimple() \|\| !DstTy.isSimple())
	return BaseT::getCastInstrCost(Opcode, Dst, Src);			return BaseT::getCastInstrCost(Opcode, Dst, Src);

	static const TypeConversionCostTblEntry			static const TypeConversionCostTblEntry
	ConversionTbl[] = {			ConversionTbl[] = {
	▲ Show 20 Lines • Show All 182 Lines • ▼ Show 20 Lines

	int AArch64TTIImpl::getArithmeticInstrCost(			int AArch64TTIImpl::getArithmeticInstrCost(
	unsigned Opcode, Type *Ty, TTI::OperandValueKind Opd1Info,			unsigned Opcode, Type *Ty, TTI::OperandValueKind Opd1Info,
	TTI::OperandValueKind Opd2Info, TTI::OperandValueProperties Opd1PropInfo,			TTI::OperandValueKind Opd2Info, TTI::OperandValueProperties Opd1PropInfo,
	TTI::OperandValueProperties Opd2PropInfo, ArrayRef<const Value *> Args) {			TTI::OperandValueProperties Opd2PropInfo, ArrayRef<const Value *> Args) {
	// Legalize the type.			// Legalize the type.
	std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);			std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Ty);

				// If the instruction is a widening instruction (e.g., uaddl, saddw, etc.),
				// add in the widening overhead specified by the sub-target. Since the
				// extends feeding widening instructions are performed automatically, they
				// aren't present in the generated code and have a zero cost. By adding a
				// widening overhead here, we attach the total cost of the combined operation
				// to the widening instruction.
				int Cost = 0;
				if (isWideningInstruction(Ty, Opcode, Args))
				Cost += ST->getWideningBaseCost();

	int ISD = TLI->InstructionOpcodeToISD(Opcode);			int ISD = TLI->InstructionOpcodeToISD(Opcode);

	if (ISD == ISD::SDIV &&			if (ISD == ISD::SDIV &&
	Opd2Info == TargetTransformInfo::OK_UniformConstantValue &&			Opd2Info == TargetTransformInfo::OK_UniformConstantValue &&
	Opd2PropInfo == TargetTransformInfo::OP_PowerOf2) {			Opd2PropInfo == TargetTransformInfo::OP_PowerOf2) {
	// On AArch64, scalar signed division by constants power-of-two are			// On AArch64, scalar signed division by constants power-of-two are
	// normally expanded to the sequence ADD + CMP + SELECT + SRA.			// normally expanded to the sequence ADD + CMP + SELECT + SRA.
	// The OperandValue properties many not be same as that of previous			// The OperandValue properties many not be same as that of previous
	// operation; conservatively assume OP_None.			// operation; conservatively assume OP_None.
	int Cost = getArithmeticInstrCost(Instruction::Add, Ty, Opd1Info, Opd2Info,			Cost += getArithmeticInstrCost(Instruction::Add, Ty, Opd1Info, Opd2Info,
	TargetTransformInfo::OP_None,			TargetTransformInfo::OP_None,
	TargetTransformInfo::OP_None);			TargetTransformInfo::OP_None);
	Cost += getArithmeticInstrCost(Instruction::Sub, Ty, Opd1Info, Opd2Info,			Cost += getArithmeticInstrCost(Instruction::Sub, Ty, Opd1Info, Opd2Info,
	TargetTransformInfo::OP_None,			TargetTransformInfo::OP_None,
	TargetTransformInfo::OP_None);			TargetTransformInfo::OP_None);
	Cost += getArithmeticInstrCost(Instruction::Select, Ty, Opd1Info, Opd2Info,			Cost += getArithmeticInstrCost(Instruction::Select, Ty, Opd1Info, Opd2Info,
	TargetTransformInfo::OP_None,			TargetTransformInfo::OP_None,
	TargetTransformInfo::OP_None);			TargetTransformInfo::OP_None);
	Cost += getArithmeticInstrCost(Instruction::AShr, Ty, Opd1Info, Opd2Info,			Cost += getArithmeticInstrCost(Instruction::AShr, Ty, Opd1Info, Opd2Info,
	TargetTransformInfo::OP_None,			TargetTransformInfo::OP_None,
	TargetTransformInfo::OP_None);			TargetTransformInfo::OP_None);
	return Cost;			return Cost;
	}			}

	switch (ISD) {			switch (ISD) {
	default:			default:
	return BaseT::getArithmeticInstrCost(Opcode, Ty, Opd1Info, Opd2Info,			return Cost + BaseT::getArithmeticInstrCost(Opcode, Ty, Opd1Info, Opd2Info,
	Opd1PropInfo, Opd2PropInfo);			Opd1PropInfo, Opd2PropInfo);
	case ISD::ADD:			case ISD::ADD:
	case ISD::MUL:			case ISD::MUL:
	case ISD::XOR:			case ISD::XOR:
	case ISD::OR:			case ISD::OR:
	case ISD::AND:			case ISD::AND:
	// These nodes are marked as 'custom' for combining purposes only.			// These nodes are marked as 'custom' for combining purposes only.
	// We know that they are legal. See LowerAdd in ISelLowering.			// We know that they are legal. See LowerAdd in ISelLowering.
	return 1 * LT.first;			return (Cost + 1) * LT.first;
	}			}
	}			}

	int AArch64TTIImpl::getAddressComputationCost(Type Ty, ScalarEvolution SE,			int AArch64TTIImpl::getAddressComputationCost(Type Ty, ScalarEvolution SE,
	const SCEV *Ptr) {			const SCEV *Ptr) {
	// Address computations in vectorized code with non-consecutive addresses will			// Address computations in vectorized code with non-consecutive addresses will
	// likely result in more instructions compared to scalar code where the			// likely result in more instructions compared to scalar code where the
	// computation can more often be merged into the index mode. The resulting			// computation can more often be merged into the index mode. The resulting
	▲ Show 20 Lines • Show All 252 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 1,813 Lines • ▼ Show 20 Lines	case Instruction::Xor: {
Op2VK = TargetTransformInfo::OK_NonUniformConstantValue;		Op2VK = TargetTransformInfo::OK_NonUniformConstantValue;
}		}
// FIXME: Currently cost of model modification for division by power of		// FIXME: Currently cost of model modification for division by power of
// 2 is handled for X86 and AArch64. Add support for other targets.		// 2 is handled for X86 and AArch64. Add support for other targets.
if (Op2VK == TargetTransformInfo::OK_UniformConstantValue && CInt &&		if (Op2VK == TargetTransformInfo::OK_UniformConstantValue && CInt &&
CInt->getValue().isPowerOf2())		CInt->getValue().isPowerOf2())
Op2VP = TargetTransformInfo::OP_PowerOf2;		Op2VP = TargetTransformInfo::OP_PowerOf2;

int ScalarCost = VecTy->getNumElements() *		SmallVector<const Value *, 4> Operands(VL0->operand_values());
TTI->getArithmeticInstrCost(Opcode, ScalarTy, Op1VK,		int ScalarCost =
Op2VK, Op1VP, Op2VP);		VecTy->getNumElements() *
		TTI->getArithmeticInstrCost(Opcode, ScalarTy, Op1VK, Op2VK, Op1VP,
		Op2VP, Operands);
int VecCost = TTI->getArithmeticInstrCost(Opcode, VecTy, Op1VK, Op2VK,		int VecCost = TTI->getArithmeticInstrCost(Opcode, VecTy, Op1VK, Op2VK,
Op1VP, Op2VP);		Op1VP, Op2VP, Operands);
return VecCost - ScalarCost;		return VecCost - ScalarCost;
}		}
case Instruction::GetElementPtr: {		case Instruction::GetElementPtr: {
TargetTransformInfo::OperandValueKind Op1VK =		TargetTransformInfo::OperandValueKind Op1VK =
TargetTransformInfo::OK_AnyValue;		TargetTransformInfo::OK_AnyValue;
TargetTransformInfo::OperandValueKind Op2VK =		TargetTransformInfo::OperandValueKind Op2VK =
TargetTransformInfo::OK_UniformConstantValue;		TargetTransformInfo::OK_UniformConstantValue;

▲ Show 20 Lines • Show All 3,317 Lines • Show Last 20 Lines

llvm/trunk/test/Analysis/CostModel/AArch64/free-widening-casts.ll

				; RUN: opt < %s -mtriple=aarch64--linux-gnu -cost-model -analyze \| FileCheck %s --check-prefix=COST
				; RUN: llc < %s -mtriple=aarch64--linux-gnu \| FileCheck %s --check-prefix=CODE

				; COST-LABEL: uaddl_8h
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <8 x i8> %a to <8 x i16>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = zext <8 x i8> %b to <8 x i16>
				; CODE-LABEL: uaddl_8h
				; CODE: uaddl v0.8h, v0.8b, v1.8b
				define <8 x i16> @uaddl_8h(<8 x i8> %a, <8 x i8> %b) {
				%tmp0 = zext <8 x i8> %a to <8 x i16>
				%tmp1 = zext <8 x i8> %b to <8 x i16>
				%tmp2 = add <8 x i16> %tmp0, %tmp1
				ret <8 x i16> %tmp2
				}

				; COST-LABEL: uaddl_4s
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <4 x i16> %a to <4 x i32>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = zext <4 x i16> %b to <4 x i32>
				; CODE-LABEL: uaddl_4s
				; CODE: uaddl v0.4s, v0.4h, v1.4h
				define <4 x i32> @uaddl_4s(<4 x i16> %a, <4 x i16> %b) {
				%tmp0 = zext <4 x i16> %a to <4 x i32>
				%tmp1 = zext <4 x i16> %b to <4 x i32>
				%tmp2 = add <4 x i32> %tmp0, %tmp1
				ret <4 x i32> %tmp2
				}

				; COST-LABEL: uaddl_2d
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <2 x i32> %a to <2 x i64>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = zext <2 x i32> %b to <2 x i64>
				; CODE-LABEL: uaddl_2d
				; CODE: uaddl v0.2d, v0.2s, v1.2s
				define <2 x i64> @uaddl_2d(<2 x i32> %a, <2 x i32> %b) {
				%tmp0 = zext <2 x i32> %a to <2 x i64>
				%tmp1 = zext <2 x i32> %b to <2 x i64>
				%tmp2 = add <2 x i64> %tmp0, %tmp1
				ret <2 x i64> %tmp2
				}

				; COST-LABEL: uaddl2_8h
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <16 x i8> %a to <16 x i16>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = zext <16 x i8> %b to <16 x i16>
				; CODE-LABEL: uaddl2_8h
				; CODE: uaddl2 v2.8h, v0.16b, v1.16b
				; CODE-NEXT: uaddl v0.8h, v0.8b, v1.8b
				define <16 x i16> @uaddl2_8h(<16 x i8> %a, <16 x i8> %b) {
				%tmp0 = zext <16 x i8> %a to <16 x i16>
				%tmp1 = zext <16 x i8> %b to <16 x i16>
				%tmp2 = add <16 x i16> %tmp0, %tmp1
				ret <16 x i16> %tmp2
				}

				; COST-LABEL: uaddl2_4s
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <8 x i16> %a to <8 x i32>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = zext <8 x i16> %b to <8 x i32>
				; CODE-LABEL: uaddl2_4s
				; CODE: uaddl2 v2.4s, v0.8h, v1.8h
				; CODE-NEXT: uaddl v0.4s, v0.4h, v1.4h
				define <8 x i32> @uaddl2_4s(<8 x i16> %a, <8 x i16> %b) {
				%tmp0 = zext <8 x i16> %a to <8 x i32>
				%tmp1 = zext <8 x i16> %b to <8 x i32>
				%tmp2 = add <8 x i32> %tmp0, %tmp1
				ret <8 x i32> %tmp2
				}

				; COST-LABEL: uaddl2_2d
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <4 x i32> %a to <4 x i64>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = zext <4 x i32> %b to <4 x i64>
				; CODE-LABEL: uaddl2_2d
				; CODE: uaddl2 v2.2d, v0.4s, v1.4s
				; CODE-NEXT: uaddl v0.2d, v0.2s, v1.2s
				define <4 x i64> @uaddl2_2d(<4 x i32> %a, <4 x i32> %b) {
				%tmp0 = zext <4 x i32> %a to <4 x i64>
				%tmp1 = zext <4 x i32> %b to <4 x i64>
				%tmp2 = add <4 x i64> %tmp0, %tmp1
				ret <4 x i64> %tmp2
				}

				; COST-LABEL: saddl_8h
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <8 x i8> %a to <8 x i16>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = sext <8 x i8> %b to <8 x i16>
				; CODE-LABEL: saddl_8h
				; CODE: saddl v0.8h, v0.8b, v1.8b
				define <8 x i16> @saddl_8h(<8 x i8> %a, <8 x i8> %b) {
				%tmp0 = sext <8 x i8> %a to <8 x i16>
				%tmp1 = sext <8 x i8> %b to <8 x i16>
				%tmp2 = add <8 x i16> %tmp0, %tmp1
				ret <8 x i16> %tmp2
				}

				; COST-LABEL: saddl_4s
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <4 x i16> %a to <4 x i32>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = sext <4 x i16> %b to <4 x i32>
				; CODE-LABEL: saddl_4s
				; CODE: saddl v0.4s, v0.4h, v1.4h
				define <4 x i32> @saddl_4s(<4 x i16> %a, <4 x i16> %b) {
				%tmp0 = sext <4 x i16> %a to <4 x i32>
				%tmp1 = sext <4 x i16> %b to <4 x i32>
				%tmp2 = add <4 x i32> %tmp0, %tmp1
				ret <4 x i32> %tmp2
				}

				; COST-LABEL: saddl_2d
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <2 x i32> %a to <2 x i64>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = sext <2 x i32> %b to <2 x i64>
				; CODE-LABEL: saddl_2d
				; CODE: saddl v0.2d, v0.2s, v1.2s
				define <2 x i64> @saddl_2d(<2 x i32> %a, <2 x i32> %b) {
				%tmp0 = sext <2 x i32> %a to <2 x i64>
				%tmp1 = sext <2 x i32> %b to <2 x i64>
				%tmp2 = add <2 x i64> %tmp0, %tmp1
				ret <2 x i64> %tmp2
				}

				; COST-LABEL: saddl2_8h
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <16 x i8> %a to <16 x i16>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = sext <16 x i8> %b to <16 x i16>
				; CODE-LABEL: saddl2_8h
				; CODE: saddl2 v2.8h, v0.16b, v1.16b
				; CODE-NEXT: saddl v0.8h, v0.8b, v1.8b
				define <16 x i16> @saddl2_8h(<16 x i8> %a, <16 x i8> %b) {
				%tmp0 = sext <16 x i8> %a to <16 x i16>
				%tmp1 = sext <16 x i8> %b to <16 x i16>
				%tmp2 = add <16 x i16> %tmp0, %tmp1
				ret <16 x i16> %tmp2
				}

				; COST-LABEL: saddl2_4s
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <8 x i16> %a to <8 x i32>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = sext <8 x i16> %b to <8 x i32>
				; CODE-LABEL: saddl2_4s
				; CODE: saddl2 v2.4s, v0.8h, v1.8h
				; CODE-NEXT: saddl v0.4s, v0.4h, v1.4h
				define <8 x i32> @saddl2_4s(<8 x i16> %a, <8 x i16> %b) {
				%tmp0 = sext <8 x i16> %a to <8 x i32>
				%tmp1 = sext <8 x i16> %b to <8 x i32>
				%tmp2 = add <8 x i32> %tmp0, %tmp1
				ret <8 x i32> %tmp2
				}

				; COST-LABEL: saddl2_2d
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <4 x i32> %a to <4 x i64>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = sext <4 x i32> %b to <4 x i64>
				; CODE-LABEL: saddl2_2d
				; CODE: saddl2 v2.2d, v0.4s, v1.4s
				; CODE-NEXT: saddl v0.2d, v0.2s, v1.2s
				define <4 x i64> @saddl2_2d(<4 x i32> %a, <4 x i32> %b) {
				%tmp0 = sext <4 x i32> %a to <4 x i64>
				%tmp1 = sext <4 x i32> %b to <4 x i64>
				%tmp2 = add <4 x i64> %tmp0, %tmp1
				ret <4 x i64> %tmp2
				}

				; COST-LABEL: usubl_8h
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <8 x i8> %a to <8 x i16>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = zext <8 x i8> %b to <8 x i16>
				; CODE-LABEL: usubl_8h
				; CODE: usubl v0.8h, v0.8b, v1.8b
				define <8 x i16> @usubl_8h(<8 x i8> %a, <8 x i8> %b) {
				%tmp0 = zext <8 x i8> %a to <8 x i16>
				%tmp1 = zext <8 x i8> %b to <8 x i16>
				%tmp2 = sub <8 x i16> %tmp0, %tmp1
				ret <8 x i16> %tmp2
				}

				; COST-LABEL: usubl_4s
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <4 x i16> %a to <4 x i32>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = zext <4 x i16> %b to <4 x i32>
				; CODE-LABEL: usubl_4s
				; CODE: usubl v0.4s, v0.4h, v1.4h
				define <4 x i32> @usubl_4s(<4 x i16> %a, <4 x i16> %b) {
				%tmp0 = zext <4 x i16> %a to <4 x i32>
				%tmp1 = zext <4 x i16> %b to <4 x i32>
				%tmp2 = sub <4 x i32> %tmp0, %tmp1
				ret <4 x i32> %tmp2
				}

				; COST-LABEL: usubl_2d
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <2 x i32> %a to <2 x i64>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = zext <2 x i32> %b to <2 x i64>
				; CODE-LABEL: usubl_2d
				; CODE: usubl v0.2d, v0.2s, v1.2s
				define <2 x i64> @usubl_2d(<2 x i32> %a, <2 x i32> %b) {
				%tmp0 = zext <2 x i32> %a to <2 x i64>
				%tmp1 = zext <2 x i32> %b to <2 x i64>
				%tmp2 = sub <2 x i64> %tmp0, %tmp1
				ret <2 x i64> %tmp2
				}

				; COST-LABEL: usubl2_8h
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <16 x i8> %a to <16 x i16>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = zext <16 x i8> %b to <16 x i16>
				; CODE-LABEL: usubl2_8h
				; CODE: usubl2 v2.8h, v0.16b, v1.16b
				; CODE-NEXT: usubl v0.8h, v0.8b, v1.8b
				define <16 x i16> @usubl2_8h(<16 x i8> %a, <16 x i8> %b) {
				%tmp0 = zext <16 x i8> %a to <16 x i16>
				%tmp1 = zext <16 x i8> %b to <16 x i16>
				%tmp2 = sub <16 x i16> %tmp0, %tmp1
				ret <16 x i16> %tmp2
				}

				; COST-LABEL: usubl2_4s
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <8 x i16> %a to <8 x i32>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = zext <8 x i16> %b to <8 x i32>
				; CODE-LABEL: usubl2_4s
				; CODE: usubl2 v2.4s, v0.8h, v1.8h
				; CODE-NEXT: usubl v0.4s, v0.4h, v1.4h
				define <8 x i32> @usubl2_4s(<8 x i16> %a, <8 x i16> %b) {
				%tmp0 = zext <8 x i16> %a to <8 x i32>
				%tmp1 = zext <8 x i16> %b to <8 x i32>
				%tmp2 = sub <8 x i32> %tmp0, %tmp1
				ret <8 x i32> %tmp2
				}

				; COST-LABEL: usubl2_2d
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <4 x i32> %a to <4 x i64>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = zext <4 x i32> %b to <4 x i64>
				; CODE-LABEL: usubl2_2d
				; CODE: usubl2 v2.2d, v0.4s, v1.4s
				; CODE-NEXT: usubl v0.2d, v0.2s, v1.2s
				define <4 x i64> @usubl2_2d(<4 x i32> %a, <4 x i32> %b) {
				%tmp0 = zext <4 x i32> %a to <4 x i64>
				%tmp1 = zext <4 x i32> %b to <4 x i64>
				%tmp2 = sub <4 x i64> %tmp0, %tmp1
				ret <4 x i64> %tmp2
				}

				; COST-LABEL: ssubl_8h
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <8 x i8> %a to <8 x i16>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = sext <8 x i8> %b to <8 x i16>
				; CODE-LABEL: ssubl_8h
				; CODE: ssubl v0.8h, v0.8b, v1.8b
				define <8 x i16> @ssubl_8h(<8 x i8> %a, <8 x i8> %b) {
				%tmp0 = sext <8 x i8> %a to <8 x i16>
				%tmp1 = sext <8 x i8> %b to <8 x i16>
				%tmp2 = sub <8 x i16> %tmp0, %tmp1
				ret <8 x i16> %tmp2
				}

				; COST-LABEL: ssubl_4s
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <4 x i16> %a to <4 x i32>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = sext <4 x i16> %b to <4 x i32>
				; CODE-LABEL: ssubl_4s
				; CODE: ssubl v0.4s, v0.4h, v1.4h
				define <4 x i32> @ssubl_4s(<4 x i16> %a, <4 x i16> %b) {
				%tmp0 = sext <4 x i16> %a to <4 x i32>
				%tmp1 = sext <4 x i16> %b to <4 x i32>
				%tmp2 = sub <4 x i32> %tmp0, %tmp1
				ret <4 x i32> %tmp2
				}

				; COST-LABEL: ssubl_2d
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <2 x i32> %a to <2 x i64>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = sext <2 x i32> %b to <2 x i64>
				; CODE-LABEL: ssubl_2d
				; CODE: ssubl v0.2d, v0.2s, v1.2s
				define <2 x i64> @ssubl_2d(<2 x i32> %a, <2 x i32> %b) {
				%tmp0 = sext <2 x i32> %a to <2 x i64>
				%tmp1 = sext <2 x i32> %b to <2 x i64>
				%tmp2 = sub <2 x i64> %tmp0, %tmp1
				ret <2 x i64> %tmp2
				}

				; COST-LABEL: ssubl2_8h
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <16 x i8> %a to <16 x i16>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = sext <16 x i8> %b to <16 x i16>
				; CODE-LABEL: ssubl2_8h
				; CODE: ssubl2 v2.8h, v0.16b, v1.16b
				; CODE-NEXT: ssubl v0.8h, v0.8b, v1.8b
				define <16 x i16> @ssubl2_8h(<16 x i8> %a, <16 x i8> %b) {
				%tmp0 = sext <16 x i8> %a to <16 x i16>
				%tmp1 = sext <16 x i8> %b to <16 x i16>
				%tmp2 = sub <16 x i16> %tmp0, %tmp1
				ret <16 x i16> %tmp2
				}

				; COST-LABEL: ssubl2_4s
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <8 x i16> %a to <8 x i32>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = sext <8 x i16> %b to <8 x i32>
				; CODE-LABEL: ssubl2_4s
				; CODE: ssubl2 v2.4s, v0.8h, v1.8h
				; CODE-NEXT: ssubl v0.4s, v0.4h, v1.4h
				define <8 x i32> @ssubl2_4s(<8 x i16> %a, <8 x i16> %b) {
				%tmp0 = sext <8 x i16> %a to <8 x i32>
				%tmp1 = sext <8 x i16> %b to <8 x i32>
				%tmp2 = sub <8 x i32> %tmp0, %tmp1
				ret <8 x i32> %tmp2
				}

				; COST-LABEL: ssubl2_2d
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <4 x i32> %a to <4 x i64>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = sext <4 x i32> %b to <4 x i64>
				; CODE-LABEL: ssubl2_2d
				; CODE: ssubl2 v2.2d, v0.4s, v1.4s
				; CODE-NEXT: ssubl v0.2d, v0.2s, v1.2s
				define <4 x i64> @ssubl2_2d(<4 x i32> %a, <4 x i32> %b) {
				%tmp0 = sext <4 x i32> %a to <4 x i64>
				%tmp1 = sext <4 x i32> %b to <4 x i64>
				%tmp2 = sub <4 x i64> %tmp0, %tmp1
				ret <4 x i64> %tmp2
				}

				; COST-LABEL: uaddw_8h
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <8 x i8> %a to <8 x i16>
				; CODE-LABEL: uaddw_8h
				; CODE: uaddw v0.8h, v1.8h, v0.8b
				define <8 x i16> @uaddw_8h(<8 x i8> %a, <8 x i16> %b) {
				%tmp0 = zext <8 x i8> %a to <8 x i16>
				%tmp1 = add <8 x i16> %b, %tmp0
				ret <8 x i16> %tmp1
				}

				; COST-LABEL: uaddw_4s
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <4 x i16> %a to <4 x i32>
				; CODE-LABEL: uaddw_4s
				; CODE: uaddw v0.4s, v1.4s, v0.4h
				define <4 x i32> @uaddw_4s(<4 x i16> %a, <4 x i32> %b) {
				%tmp0 = zext <4 x i16> %a to <4 x i32>
				%tmp1 = add <4 x i32> %b, %tmp0
				ret <4 x i32> %tmp1
				}

				; COST-LABEL: uaddw_2d
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <2 x i32> %a to <2 x i64>
				; CODE-LABEL: uaddw_2d
				; CODE: uaddw v0.2d, v1.2d, v0.2s
				define <2 x i64> @uaddw_2d(<2 x i32> %a, <2 x i64> %b) {
				%tmp0 = zext <2 x i32> %a to <2 x i64>
				%tmp1 = add <2 x i64> %b, %tmp0
				ret <2 x i64> %tmp1
				}

				; COST-LABEL: uaddw2_8h
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <16 x i8> %a to <16 x i16>
				; CODE-LABEL: uaddw2_8h
				; CODE: uaddw2 v2.8h, v2.8h, v0.16b
				; CODE-NEXT: uaddw v0.8h, v1.8h, v0.8b
				define <16 x i16> @uaddw2_8h(<16 x i8> %a, <16 x i16> %b) {
				%tmp0 = zext <16 x i8> %a to <16 x i16>
				%tmp1 = add <16 x i16> %b, %tmp0
				ret <16 x i16> %tmp1
				}

				; COST-LABEL: uaddw2_4s
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <8 x i16> %a to <8 x i32>
				; CODE-LABEL: uaddw2_4s
				; CODE: uaddw2 v2.4s, v2.4s, v0.8h
				; CODE-NEXT: uaddw v0.4s, v1.4s, v0.4h
				define <8 x i32> @uaddw2_4s(<8 x i16> %a, <8 x i32> %b) {
				%tmp0 = zext <8 x i16> %a to <8 x i32>
				%tmp1 = add <8 x i32> %b, %tmp0
				ret <8 x i32> %tmp1
				}

				; COST-LABEL: uaddw2_2d
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <4 x i32> %a to <4 x i64>
				; CODE-LABEL: uaddw2_2d
				; CODE: uaddw2 v2.2d, v2.2d, v0.4s
				; CODE-NEXT: uaddw v0.2d, v1.2d, v0.2s
				define <4 x i64> @uaddw2_2d(<4 x i32> %a, <4 x i64> %b) {
				%tmp0 = zext <4 x i32> %a to <4 x i64>
				%tmp1 = add <4 x i64> %b, %tmp0
				ret <4 x i64> %tmp1
				}

				; COST-LABEL: saddw_8h
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <8 x i8> %a to <8 x i16>
				; CODE-LABEL: saddw_8h
				; CODE: saddw v0.8h, v1.8h, v0.8b
				define <8 x i16> @saddw_8h(<8 x i8> %a, <8 x i16> %b) {
				%tmp0 = sext <8 x i8> %a to <8 x i16>
				%tmp1 = add <8 x i16> %b, %tmp0
				ret <8 x i16> %tmp1
				}

				; COST-LABEL: saddw_4s
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <4 x i16> %a to <4 x i32>
				; CODE-LABEL: saddw_4s
				; CODE: saddw v0.4s, v1.4s, v0.4h
				define <4 x i32> @saddw_4s(<4 x i16> %a, <4 x i32> %b) {
				%tmp0 = sext <4 x i16> %a to <4 x i32>
				%tmp1 = add <4 x i32> %b, %tmp0
				ret <4 x i32> %tmp1
				}

				; COST-LABEL: saddw_2d
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <2 x i32> %a to <2 x i64>
				; CODE-LABEL: saddw_2d
				; CODE: saddw v0.2d, v1.2d, v0.2s
				define <2 x i64> @saddw_2d(<2 x i32> %a, <2 x i64> %b) {
				%tmp0 = sext <2 x i32> %a to <2 x i64>
				%tmp1 = add <2 x i64> %b, %tmp0
				ret <2 x i64> %tmp1
				}

				; COST-LABEL: saddw2_8h
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <16 x i8> %a to <16 x i16>
				; CODE-LABEL: saddw2_8h
				; CODE: saddw2 v2.8h, v2.8h, v0.16b
				; CODE-NEXT: saddw v0.8h, v1.8h, v0.8b
				define <16 x i16> @saddw2_8h(<16 x i8> %a, <16 x i16> %b) {
				%tmp0 = sext <16 x i8> %a to <16 x i16>
				%tmp1 = add <16 x i16> %b, %tmp0
				ret <16 x i16> %tmp1
				}

				; COST-LABEL: saddw2_4s
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <8 x i16> %a to <8 x i32>
				; CODE-LABEL: saddw2_4s
				; CODE: saddw2 v2.4s, v2.4s, v0.8h
				; CODE-NEXT: saddw v0.4s, v1.4s, v0.4h
				define <8 x i32> @saddw2_4s(<8 x i16> %a, <8 x i32> %b) {
				%tmp0 = sext <8 x i16> %a to <8 x i32>
				%tmp1 = add <8 x i32> %b, %tmp0
				ret <8 x i32> %tmp1
				}

				; COST-LABEL: saddw2_2d
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <4 x i32> %a to <4 x i64>
				; CODE-LABEL: saddw2_2d
				; CODE: saddw2 v2.2d, v2.2d, v0.4s
				; CODE-NEXT: saddw v0.2d, v1.2d, v0.2s
				define <4 x i64> @saddw2_2d(<4 x i32> %a, <4 x i64> %b) {
				%tmp0 = sext <4 x i32> %a to <4 x i64>
				%tmp1 = add <4 x i64> %b, %tmp0
				ret <4 x i64> %tmp1
				}

				; COST-LABEL: usubw_8h
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <8 x i8> %a to <8 x i16>
				; CODE-LABEL: usubw_8h
				; CODE: usubw v0.8h, v1.8h, v0.8b
				define <8 x i16> @usubw_8h(<8 x i8> %a, <8 x i16> %b) {
				%tmp0 = zext <8 x i8> %a to <8 x i16>
				%tmp1 = sub <8 x i16> %b, %tmp0
				ret <8 x i16> %tmp1
				}

				; COST-LABEL: usubw_4s
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <4 x i16> %a to <4 x i32>
				; CODE-LABEL: usubw_4s
				; CODE: usubw v0.4s, v1.4s, v0.4h
				define <4 x i32> @usubw_4s(<4 x i16> %a, <4 x i32> %b) {
				%tmp0 = zext <4 x i16> %a to <4 x i32>
				%tmp1 = sub <4 x i32> %b, %tmp0
				ret <4 x i32> %tmp1
				}

				; COST-LABEL: usubw_2d
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <2 x i32> %a to <2 x i64>
				; CODE-LABEL: usubw_2d
				; CODE: usubw v0.2d, v1.2d, v0.2s
				define <2 x i64> @usubw_2d(<2 x i32> %a, <2 x i64> %b) {
				%tmp0 = zext <2 x i32> %a to <2 x i64>
				%tmp1 = sub <2 x i64> %b, %tmp0
				ret <2 x i64> %tmp1
				}

				; COST-LABEL: usubw2_8h
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <16 x i8> %a to <16 x i16>
				; CODE-LABEL: usubw2_8h
				; CODE: usubw2 v2.8h, v2.8h, v0.16b
				; CODE-NEXT: usubw v0.8h, v1.8h, v0.8b
				define <16 x i16> @usubw2_8h(<16 x i8> %a, <16 x i16> %b) {
				%tmp0 = zext <16 x i8> %a to <16 x i16>
				%tmp1 = sub <16 x i16> %b, %tmp0
				ret <16 x i16> %tmp1
				}

				; COST-LABEL: usubw2_4s
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <8 x i16> %a to <8 x i32>
				; CODE-LABEL: usubw2_4s
				; CODE: usubw2 v2.4s, v2.4s, v0.8h
				; CODE-NEXT: usubw v0.4s, v1.4s, v0.4h
				define <8 x i32> @usubw2_4s(<8 x i16> %a, <8 x i32> %b) {
				%tmp0 = zext <8 x i16> %a to <8 x i32>
				%tmp1 = sub <8 x i32> %b, %tmp0
				ret <8 x i32> %tmp1
				}

				; COST-LABEL: usubw2_2d
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = zext <4 x i32> %a to <4 x i64>
				; CODE-LABEL: usubw2_2d
				; CODE: usubw2 v2.2d, v2.2d, v0.4s
				; CODE-NEXT: usubw v0.2d, v1.2d, v0.2s
				define <4 x i64> @usubw2_2d(<4 x i32> %a, <4 x i64> %b) {
				%tmp0 = zext <4 x i32> %a to <4 x i64>
				%tmp1 = sub <4 x i64> %b, %tmp0
				ret <4 x i64> %tmp1
				}

				; COST-LABEL: ssubw_8h
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <8 x i8> %a to <8 x i16>
				; CODE-LABEL: ssubw_8h
				; CODE: ssubw v0.8h, v1.8h, v0.8b
				define <8 x i16> @ssubw_8h(<8 x i8> %a, <8 x i16> %b) {
				%tmp0 = sext <8 x i8> %a to <8 x i16>
				%tmp1 = sub <8 x i16> %b, %tmp0
				ret <8 x i16> %tmp1
				}

				; COST-LABEL: ssubw_4s
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <4 x i16> %a to <4 x i32>
				; CODE-LABEL: ssubw_4s
				; CODE: ssubw v0.4s, v1.4s, v0.4h
				define <4 x i32> @ssubw_4s(<4 x i16> %a, <4 x i32> %b) {
				%tmp0 = sext <4 x i16> %a to <4 x i32>
				%tmp1 = sub <4 x i32> %b, %tmp0
				ret <4 x i32> %tmp1
				}

				; COST-LABEL: ssubw_2d
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <2 x i32> %a to <2 x i64>
				; CODE-LABEL: ssubw_2d
				; CODE: ssubw v0.2d, v1.2d, v0.2s
				define <2 x i64> @ssubw_2d(<2 x i32> %a, <2 x i64> %b) {
				%tmp0 = sext <2 x i32> %a to <2 x i64>
				%tmp1 = sub <2 x i64> %b, %tmp0
				ret <2 x i64> %tmp1
				}

				; COST-LABEL: ssubw2_8h
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <16 x i8> %a to <16 x i16>
				; CODE-LABEL: ssubw2_8h
				; CODE: ssubw2 v2.8h, v2.8h, v0.16b
				; CODE-NEXT: ssubw v0.8h, v1.8h, v0.8b
				define <16 x i16> @ssubw2_8h(<16 x i8> %a, <16 x i16> %b) {
				%tmp0 = sext <16 x i8> %a to <16 x i16>
				%tmp1 = sub <16 x i16> %b, %tmp0
				ret <16 x i16> %tmp1
				}

				; COST-LABEL: ssubw2_4s
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <8 x i16> %a to <8 x i32>
				; CODE-LABEL: ssubw2_4s
				; CODE: ssubw2 v2.4s, v2.4s, v0.8h
				; CODE-NEXT: ssubw v0.4s, v1.4s, v0.4h
				define <8 x i32> @ssubw2_4s(<8 x i16> %a, <8 x i32> %b) {
				%tmp0 = sext <8 x i16> %a to <8 x i32>
				%tmp1 = sub <8 x i32> %b, %tmp0
				ret <8 x i32> %tmp1
				}

				; COST-LABEL: ssubw2_2d
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp0 = sext <4 x i32> %a to <4 x i64>
				; CODE-LABEL: ssubw2_2d
				; CODE: ssubw2 v2.2d, v2.2d, v0.4s
				; CODE-NEXT: ssubw v0.2d, v1.2d, v0.2s
				define <4 x i64> @ssubw2_2d(<4 x i32> %a, <4 x i64> %b) {
				%tmp0 = sext <4 x i32> %a to <4 x i64>
				%tmp1 = sub <4 x i64> %b, %tmp0
				ret <4 x i64> %tmp1
				}

				; COST-LABEL: neg_wrong_operand_order
				; COST-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp0 = zext <8 x i8> %a to <8 x i16>
				define <8 x i16> @neg_wrong_operand_order(<8 x i8> %a, <8 x i16> %b) {
				%tmp0 = zext <8 x i8> %a to <8 x i16>
				%tmp1 = sub <8 x i16> %tmp0, %b
				ret <8 x i16> %tmp1
				}

				; COST-LABEL: neg_non_widening_op
				; COST-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp0 = zext <8 x i8> %a to <8 x i16>
				define <8 x i16> @neg_non_widening_op(<8 x i8> %a, <8 x i16> %b) {
				%tmp0 = zext <8 x i8> %a to <8 x i16>
				%tmp1 = udiv <8 x i16> %b, %tmp0
				ret <8 x i16> %tmp1
				}

				; COST-LABEL: neg_dissimilar_operand_kind_0
				; COST-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp0 = sext <8 x i8> %a to <8 x i16>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = zext <8 x i8> %b to <8 x i16>
				define <8 x i16> @neg_dissimilar_operand_kind_0(<8 x i8> %a, <8 x i8> %b) {
				%tmp0 = sext <8 x i8> %a to <8 x i16>
				%tmp1 = zext <8 x i8> %b to <8 x i16>
				%tmp2 = add <8 x i16> %tmp0, %tmp1
				ret <8 x i16> %tmp2
				}

				; COST-LABEL: neg_dissimilar_operand_kind_1
				; COST-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp0 = zext <4 x i8> %a to <4 x i32>
				; COST-NEXT: Cost Model: Found an estimated cost of 0 for instruction: %tmp1 = zext <4 x i16> %b to <4 x i32>
				define <4 x i32> @neg_dissimilar_operand_kind_1(<4 x i8> %a, <4 x i16> %b) {
				%tmp0 = zext <4 x i8> %a to <4 x i32>
				%tmp1 = zext <4 x i16> %b to <4 x i32>
				%tmp2 = add <4 x i32> %tmp0, %tmp1
				ret <4 x i32> %tmp2
				}

				; COST-LABEL: neg_illegal_vector_type_0
				; COST-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp0 = zext <16 x i4> %a to <16 x i8>
				define <16 x i8> @neg_illegal_vector_type_0(<16 x i4> %a, <16 x i8> %b) {
				%tmp0 = zext <16 x i4> %a to <16 x i8>
				%tmp1 = sub <16 x i8> %b, %tmp0
				ret <16 x i8> %tmp1
				}

				; COST-LABEL: neg_llegal_vector_type_1
				; COST-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp0 = zext <1 x i16> %a to <1 x i32>
				define <1 x i32> @neg_llegal_vector_type_1(<1 x i16> %a, <1 x i32> %b) {
				%tmp0 = zext <1 x i16> %a to <1 x i32>
				%tmp1 = add <1 x i32> %b, %tmp0
				ret <1 x i32> %tmp1
				}

				; COST-LABEL: neg_llegal_vector_type_2
				; COST-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp0 = zext <4 x i16> %a to <4 x i64>
				define <4 x i64> @neg_llegal_vector_type_2(<4 x i16> %a, <4 x i64> %b) {
				%tmp0 = zext <4 x i16> %a to <4 x i64>
				%tmp1 = add <4 x i64> %b, %tmp0
				ret <4 x i64> %tmp1
				}

				; COST-LABEL: neg_llegal_vector_type_3
				; COST-NEXT: Cost Model: Found an estimated cost of 3 for instruction: %tmp0 = zext <3 x i34> %a to <3 x i68>
				define <3 x i68> @neg_llegal_vector_type_3(<3 x i34> %a, <3 x i68> %b) {
				%tmp0 = zext <3 x i34> %a to <3 x i68>
				%tmp1 = add <3 x i68> %b, %tmp0
				ret <3 x i68> %tmp1
				}