This is an archive of the discontinued LLVM Phabricator instance.

[X86] Changes to extract Horizontal addition operation for AVX-512.
Needs RevisionPublic

Authored by jbhateja on Aug 8 2017, 3:12 AM.

Download Raw Diff

Details

Reviewers

craig.topper
delena
zvi
spatel
RKSimon

Summary

vphadd is not a supported instruction for AVX-512, but if result of add (512 bits)
is partially consumed such that less than half of the bits are used by the user of an add
instruction, in that case we can perform horizontal addition and concatenate the result
with undef.

This will fix PR33758

Diff Detail

Build Status

Buildable 9671
Build 9671: arc lint + arc unit

Event Timeline

jbhateja created this revision.Aug 8 2017, 3:12 AM

jbhateja added reviewers: craig.topper, delena, RKSimon.Aug 8 2017, 3:15 AM

jbhateja added a subscriber: llvm-commits.

Ping @ reviewers.

craig.topper added inline comments.Aug 10 2017, 5:23 PM

lib/Target/X86/X86ISelLowering.cpp
35597	Don't we need to make sure this only happens on v32i16 and v16i32 types? combineAdd can get called on all sorts of types.
35610	Don't use a SmallVector for a hardcoded two elements. Just use a plain old array.

Are there any cases where this should happen where Op0 is different that Op1? All of these test changes are unary shuffles.

And really shouldn't we have some more generic way of shrinking adds like this? How many of the other adds in these tests could have been done in a smaller type to use a VEX encoding?

craig.topper mentioned this in D36601: [x86] Enable some support for lowerVectorShuffleWithUndefHalf with AVX-512.Aug 10 2017, 6:52 PM

Diffusion mentioned this in rL310724: [x86] Enable some support for lowerVectorShuffleWithUndefHalf with AVX-512.Aug 11 2017, 9:21 AM

craig.topper mentioned this in D36650: [X86] WIP support narrowing operations when only a subvector is demanded.Aug 12 2017, 11:16 PM

Thinking about this some more. Do we really want to use a horizontal add instruction for a register with itself? Horizontal add is suboptimally implemented in microcode. It's 3 uops while the pshufd and the add are only 2 uops. The 3 uops also mean its limited to the complex decoder on Intel hardware.

Handling for horizontal [F]ADD/[F]SUB for AVX512 vector size 16x32 in a generic manner.
Merge branch 'master' of https://github.com/llvm-mirror/llvm
Updating test reference

In D36454#853351, @craig.topper wrote:

Thinking about this some more. Do we really want to use a horizontal add instruction for a register with itself? Horizontal add is suboptimally implemented in microcode. It's 3 uops while the pshufd and the add are only 2 uops. The 3 uops also mean its limited to the complex decoder on Intel hardware.

Yes, I agree. latency and number of micro codes are better without hadd, only lesser code size is advantage with hadd with same operands.
Do you suggest adding a check in isHorizontalBinOp to not recognize a valid pttern for horizontal add/sub if operands of are same.

Removing a file got added in last patch.

Harbormaster completed remote builds in B9670: Diff 112823.Aug 27 2017, 6:32 AM

Stashed change leftout in last checkin + formatting changes.
Updating test reference

ping @reviewers

guyblank added a subscriber: guyblank.Aug 31 2017, 8:03 AM

Ping reviewers

I think we should try to combine based on the add only being used by the extract_vector_elt. Turn the add into a 128-bit add being fed by extract_subvectors. Similarly if we see an add only being used by an extract_subvector we can shrink that add too and push the extracts up. This type of transform feels more generally useful because it will allow us to narrow many more adds in this code. This will enable EVEX->VEX to use a smaller encoding. We can apply this to many other opcodes as well.

If we do this early enough we should be able to shrink the add before the horizontal add detection.

In D36454#861826, @craig.topper wrote:

I think we should try to combine based on the add only being used by the extract_vector_elt. Turn the add into a 128-bit add being fed by extract_subvectors. Similarly if we see an add only being used by an extract_subvector we can shrink that add too and push the extracts up. This type of transform feels more generally useful because it will allow us to narrow many more adds in this code. This will enable EVEX->VEX to use a smaller encoding. We can apply this to many other opcodes as well.

If we do this early enough we should be able to shrink the add before the horizontal add detection.

Two cases for DAG node reduction:-

 a / Look at operands and try squeezing them up (EXTRACT_SUBVECTOR) for narrower operation  which is then concatinated with a pad to make the final result size same as original. Here we only look at the operands and not the uses of the operation. Which means it could break valid patterns being checked at the use nodes due to insertion of CONCAT_VECTORS now. 

b / Look at both the uses of the DAG node along with the operands of the node while narrowing down the operation. Idea here is to avoid insertion of extra concat operation for padding which shall keep the pattern matches at use node happy.

My initial patch was based on strategy (b), What I get from you above comment is to change it for any generic OPCODE instead of HADD. Correct?

Currently patch is based on strategy (a) with a wrapper over generic subroutine for scaling down operation only for HADD/HSUB perticular vector types which is also safe.

Merge branch 'master' of https://github.com/llvm-mirror/llvm
Making the downscaling operation changes generic.

Why this be solved by just combining extract_subvectors through things like binops and onto their inputs.

Like this:
combine (extract_subvector (add X, Y)) -> (narrower_add (extract_subvector X, extract_subvector Y))

lib/Target/X86/X86ISelLowering.cpp
30111	I think element indices should use DAG.getIntPtrConstant to make the type correct. It's not supposed to be i32.
33648	Can we teach combineExtractSubvector to narrow an extract of an add by extracting from the inputs? Then we shouldn't need any change here because we'll already have the narrow add?
33724	Why is this variable now upper case?
33792	Why was this moved out of the if.
test/CodeGen/X86/madd.ll
303	Why is this horizontal add not narrowed?

Would we be better off focussing on developing the @llvm.experimental.vector.reduce.add.* intrinsics? Or looking to get slpvectorizer to do extract_subvector style shuffles as part of the reduction?

jbhateja added inline comments.Sep 19 2017, 4:15 AM

lib/Target/X86/X86ISelLowering.cpp
30111	Yes this will be fixed.
33648	I think each Node specific combiner should look at pattern starting from itself. i.e. combineExtractSubvector could be taught to remove extract_subvector from a vector shuffle (if has only one use) thus producing a smaller vector shuffle that will be generic change. Looking for opportunity like you mentioned "extract of an add by extracting from input" appears as a specialization. As of now narrowing down has be done genrically at following two places 1/ Extract_vector_elt : It looks at its input vector and try scaling down if possible. 2/ Narrow down binary operation (add/sub/fadd/fsub) [this is implicitly doing whay your comment says).
33724	For better visual clarity of variable name.
33792	This will be fixed.

I don't think we should shrinking operations based on undef inputs. I think we should be shrinking them based on what elements are consumed by their users. There's no reason the shuffles in these reductions have to have undef elements. For integers we may rewrite the shuffle mask to undef in InstCombine if the elements aren't used down stream. But we don't do that for FP.

As an example, why shouldn't we be able use a horizontal add for this

define <4 x double> @fadd_noundef(<8 x double> %x225, <8 x double> %x227) {

  %x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
  %x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 5 ,i32 13, i32 7, i32 15>
  %x229 = fadd <8 x double> %x226, %x228
  %x230 = shufflevector <8 x double> %x229, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  ret <4 x double> %x230
}

lib/Target/X86/X86ISelLowering.cpp
1694	Use of MMX here is weird. We explicitly don't generate any optimized code for MMX. So making references to it in terms of SSE/AVX is misleading.

Merge branch 'master' of https://github.com/llvm-mirror/llvm
Review comments resolution.

In D36454#876132, @craig.topper wrote:
I don't think we should shrinking operations based on undef inputs. I think we should be shrinking them based on what elements are consumed by their users. There's no reason the shuffles in these reductions have to have undef elements. For integers we may rewrite the shuffle mask to undef in InstCombine if the elements aren't used down stream. But we don't do that for FP.

As an example, why shouldn't we be able use a horizontal add for this
define <4 x double> @fadd_noundef(<8 x double> %x225, <8 x double> %x227) {

  %x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
  %x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 5 ,i32 13, i32 7, i32 15>
  %x229 = fadd <8 x double> %x226, %x228
  %x230 = shufflevector <8 x double> %x229, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  ret <4 x double> %x230
}

We do generate horizontal addition in this case.

lib/Target/X86/X86ISelLowering.cpp
1694	Reference to MMX here is signifying one of the denominations of a smaller vector register.

The test case I gave does not generate horizontal add with your patch with avx512f enabled. It does with avx2 but that's only because type legalization did the dirty work of splitting the result.

Harbormaster completed remote builds in B10591: Diff 116634.Sep 26 2017, 3:31 PM

Changes to cover more patterns for [f]hadd/[f]sub for AVX512 vector types.
Generic routines added which looks at uses / undef operands of an operation for scaling it down.

Harbormaster completed remote builds in B10657: Diff 116998.Sep 28 2017, 8:15 AM

jbhateja added inline comments.Sep 28 2017, 8:22 AM

lib/Target/X86/X86ISelLowering.cpp
33638	Just realized that DAG argument is not used here, it shall be removed with other comments over patch.

jbhateja added reviewers: zvi, spatel.Sep 29 2017, 2:08 AM

I'm not sure about this approach; I don't think most of this needs to be done in the X86 backend at all, and much of it shouldn't even be done in the DAG. Most of the code appears to be better handled in a mixture of SimplifyDemandedVectorElts and SLPVectorizer.

PR33758 was about improving codegen for horizontal reductions, so we'd probably be better off having the backend optimize for @llvm.experimental.vector.reduce.add.* (or the legalized patterns it produces), and then getting the vectorizers to create these properly.

This revision now requires changes to proceed.Sep 29 2017, 3:49 AM

In D36454#884252, @RKSimon wrote:

I'm not sure about this approach; I don't think most of this needs to be done in the X86 backend at all, and much of it shouldn't even be done in the DAG. Most of the code appears to be better handled in a mixture of SimplifyDemandedVectorElts and SLPVectorizer.

PR33758 was about improving codegen for horizontal reductions, so we'd probably be better off having the backend optimize for @llvm.experimental.vector.reduce.add.* (or the legalized patterns it produces), and then getting the vectorizers to create these properly.

Hi Simon,

Thanks for pointing me to code references, I tired a simple case, which was not optimized by InstCombiner::SimpliefyDemandedVectorElts.
It works over knownbits mechanism. In fact none of the test cases provided in avx512-hadd-hsub.ll were optimized by InstCombiner.

define float @fhsub_16(<16 x float> %x225) {
;define <16 x float> @fhsub_16(<16 x float> %x225) {

%x226 = shufflevector <16 x float> %x225, <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%x227 = fadd <16 x float> %x225, %x226
%x228 = shufflevector <16 x float> %x227, <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%x229 = fsub <16 x float> %x227, %x228
%x230 = extractelement <16 x float> %x229, i32 0
ret float %x230

}

This patch provided two generic routines which try to scale down operation in denominations of X86 vector register sizes, patch D36650 is also suggesting for similar effort.

PR 33758 is specially about infrence of horizontal operations for AVX512 vector types.

Kindly elaborate how add reduction is helful here for cases in provided in testcase avx512-hadd-hsub.ll.

Thanks

In D36454#884427, @jbhateja wrote:

In D36454#884252, @RKSimon wrote:

I'm not sure about this approach; I don't think most of this needs to be done in the X86 backend at all, and much of it shouldn't even be done in the DAG. Most of the code appears to be better handled in a mixture of SimplifyDemandedVectorElts and SLPVectorizer.

PR33758 was about improving codegen for horizontal reductions, so we'd probably be better off having the backend optimize for @llvm.experimental.vector.reduce.add.* (or the legalized patterns it produces), and then getting the vectorizers to create these properly.

Hi Simon,

Thanks for pointing me to code references, I tired a simple case, which was not optimized by InstCombiner::SimpliefyDemandedVectorElts.
It works over knownbits mechanism. In fact none of the test cases provided in avx512-hadd-hsub.ll were optimized by InstCombiner.

In my opinion, @llvm.experimental.vector.reduce.fadd and friends should be treated as the canonical forms of those operations. InstCombine should form these intrinsics upon encountering these shuffle patterns.

The SLPVectorizer, LoopVectorizer, etc. should use the intrinsics when handling relevant reductions.

Is there an advantage to handling the shuffle patterns in the backend directly as opposed to forming the intrinsics earlier and then handling them in the backend?

define float @fhsub_16(<16 x float> %x225) {
;define <16 x float> @fhsub_16(<16 x float> %x225) {
%x226 = shufflevector <16 x float> %x225, <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%x227 = fadd <16 x float> %x225, %x226
%x228 = shufflevector <16 x float> %x227, <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%x229 = fsub <16 x float> %x227, %x228
%x230 = extractelement <16 x float> %x229, i32 0
ret float %x230
}

This patch provided two generic routines which try to scale down operation in denominations of X86 vector register sizes, patch D36650 is also suggesting for similar effort.

PR 33758 is specially about infrence of horizontal operations for AVX512 vector types.

Kindly elaborate how add reduction is helful here for cases in provided in testcase avx512-hadd-hsub.ll.

Thanks

RKSimon resigned from this revision.Oct 11 2018, 8:56 AM

@jbhateja Abandon this? @spatel's recent work seems to have covered everything already

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

233 lines

test/

CodeGen/

X86/

avx512-hadd-hsub.ll

48 lines

madd.ll

3 lines

sad.ll

18 lines

Diff 112829

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,685 Lines • ▼ Show 20 Lines
// but a conditional move could be stalled by an expensive earlier operation.		// but a conditional move could be stalled by an expensive earlier operation.
PredictableSelectIsExpensive = Subtarget.getSchedModel().isOutOfOrder();		PredictableSelectIsExpensive = Subtarget.getSchedModel().isOutOfOrder();
EnableExtLdPromotion = true;		EnableExtLdPromotion = true;
setPrefFunctionAlignment(4); // 2^4 bytes.		setPrefFunctionAlignment(4); // 2^4 bytes.

verifyIntrinsicTables();		verifyIntrinsicTables();
}		}

		typedef enum : unsigned { MMX = 0, XMM = 1, YMM = 3, ZMM = 7 } VecRegKind;
		craig.topperUnsubmitted Not Done Reply Inline Actions Use of MMX here is weird. We explicitly don't generate any optimized code for MMX. So making references to it in terms of SSE/AVX is misleading. craig.topper: Use of MMX here is weird. We explicitly don't generate any optimized code for MMX. So making…
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions Reference to MMX here is signifying one of the denominations of a smaller vector register. jbhateja: Reference to MMX here is signifying one of the denominations of a smaller vector register.
		enum : unsigned { UNDEF, FRWD, BKWD };

		static inline int GetLaneIndex(int Bits, int StartIdx = 0) {
		return (((Bits + 64) >> 6) - 1) + StartIdx;
		}

		/// Find the smallest sub-register which accommodates all the non-undefs
		/// of a node, Lane granularity is taken as 64 bit (MMX). Lookup is
		/// performed in both forward and backward directions.
		static bool GetMinimalUsedSubReg(SmallBitVector &Lanes, int &ForwardSubReg,
		VecRegKind &SubRegKind) {
		if (Lanes.all() \|\| Lanes.none()) {
		ForwardSubReg = UNDEF;
		SubRegKind = ZMM;
		return false;
		}

		auto GetSubReg = [&](bool ScanForward) -> VecRegKind {
		VecRegKind SubReg = ZMM;
		VecRegKind VecSubRegs[4] = {MMX, XMM, YMM, ZMM};
		for (int i = 0; i < 4; i++) {
		int Checker = ScanForward ? Lanes.find_next(VecSubRegs[i])
		: Lanes.find_prev(ZMM - VecSubRegs[i]);
		if (Checker == -1) {
		SubReg = VecSubRegs[i];
		break;
		}
		}
		return SubReg;
		};

		VecRegKind FrwdSubReg = GetSubReg(true);
		VecRegKind BkwdSubReg = GetSubReg(false);
		if (FrwdSubReg < BkwdSubReg) {
		ForwardSubReg = FRWD;
		SubRegKind = FrwdSubReg;
		return true;
		} else if (BkwdSubReg < FrwdSubReg) {
		ForwardSubReg = BKWD;
		SubRegKind = BkwdSubReg;
		return true;
		}
		return false;
		}

		// Granularity of the lane considered is 64 bit, mark a bit in the
		// bitvector if corresponding lane is accessed.
		static bool MarkUsedLanes(SDNode *N, SmallBitVector &Lanes, int StartIdx) {
		bool retVal = false;

		switch (N->getOpcode()) {
		default: {
		int VTSz = N->getValueType(0).getSizeInBits();
		for (int i = StartIdx, e = GetLaneIndex(VTSz, StartIdx); i < e; i++)
		Lanes[i] = 1;
		} break;
		case ISD::CONCAT_VECTORS: {
		int SZInBits = 0;
		for (auto &Oprnd : N->op_values()) {
		if (!Oprnd.isUndef())
		retVal \|= MarkUsedLanes(Oprnd.getNode(), Lanes, StartIdx);
		SZInBits += Oprnd.getValueType().getSizeInBits();
		StartIdx = GetLaneIndex(SZInBits);
		}
		} break;
		case ISD::VECTOR_SHUFFLE: {
		ShuffleVectorSDNode *SV = dyn_cast<ShuffleVectorSDNode>(N);
		EVT ElemTy = SV->getOperand(0).getValueType().getVectorElementType();
		int OperNumElems = SV->getOperand(0).getValueType().getVectorNumElements();
		int ElemSz = ElemTy.getSizeInBits();

		bool OpersUndef[2] = {SV->getOperand(0).isUndef(),
		SV->getOperand(1).isUndef()};

		ArrayRef<int> Mask = SV->getMask();
		for (int i = 0, e = Mask.size(); i < e; i++) {
		if (Mask[i] >= 0 && !OpersUndef[Mask[i] >= OperNumElems])
		Lanes[GetLaneIndex(i * ElemSz, StartIdx)] = 1;
		}
		} break;
		}

		if (Lanes.all())
		return true;

		return retVal;
		}

		// A generic routine which checks if operands of a binary operation
		// can be scaled down to a lower sub-register, also it sets the
		// StartIdx (from where operand's extraction needs to start)
		// NewOperVT (new value type of result) and PadVT (value type of
		// padding for result).
		static bool TryDownScalingBinaryOperands(SDNode *N, SelectionDAG &DAG,
		const X86Subtarget &Subtarget,
		uint64_t &StartIdx, EVT &NewOperVT,
		EVT &PadVT) {
		SDLoc DL(N);
		VecRegKind Op0SubReg, Op1SubReg;
		int Op0FrwdSubReg, Op1FrwdSubReg;

		assert(N->getNumOperands() == 2 && "Not a binary operation");

		SDValue Op0 = N->getOperand(0);
		SDValue Op1 = N->getOperand(1);

		EVT OperVT = Op0.getValueType();
		int LaneSz = GetLaneIndex(OperVT.getSizeInBits());

		SmallBitVector Op0Lanes(LaneSz, 0);
		SmallBitVector Op1Lanes(LaneSz, 0);

		if (!OperVT.isSimple() \|\| OperVT.getSizeInBits() < 64)
		return false;

		EVT OperElemVT = OperVT.getVectorElementType();
		int OperNumElems = OperVT.getVectorNumElements();
		int OperElemSZ = OperElemVT.getSizeInBits();

		// Mark bit corresponding to 64 bit lane if the particular
		// lane is accessed by the node.
		bool Op0FullUse = MarkUsedLanes(Op0.getNode(), Op0Lanes, 0);
		bool Op1FullUse = MarkUsedLanes(Op1.getNode(), Op1Lanes, 0);
		if (Op0FullUse && Op1FullUse)
		return false;

		// Find the smallest sub-register which can accommodate
		// non-undef part of operands.
		bool Res0 = GetMinimalUsedSubReg(Op0Lanes, Op0FrwdSubReg, Op0SubReg);
		bool Res1 = GetMinimalUsedSubReg(Op1Lanes, Op1FrwdSubReg, Op1SubReg);
		if (!Res0 && !Res1)
		return false;

		int OperSubReg = std::min(Op0SubReg, Op1SubReg);
		int PadNumElems =
		PowerOf2Floor(OperNumElems - ((OperSubReg + 1) * 64) / OperElemSZ);
		int NewOperNumElems = OperNumElems - PadNumElems;

		if ((Op0FrwdSubReg && Op1FrwdSubReg && Op0FrwdSubReg != Op1FrwdSubReg) \|\|
		NewOperNumElems >= OperNumElems)
		return false;

		// Legal direction for one of the operand could be UNDEF
		// hence both operand direction are OR'ed to ascertain
		// the actual direction of subreg.
		int FrwdSubReg = Op0FrwdSubReg \| Op1FrwdSubReg;
		NewOperVT = EVT::getVectorVT(*DAG.getContext(), OperElemVT, NewOperNumElems);
		PadVT = EVT::getVectorVT(*DAG.getContext(), OperElemVT, PadNumElems);

		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
		if (!TLI.isTypeLegal(NewOperVT) \|\| !TLI.isTypeLegal(PadVT))
		return false;

		StartIdx = FrwdSubReg ? 0 : OperNumElems - NewOperNumElems;
		return true;
		}

// This has so far only been implemented for 64-bit MachO.		// This has so far only been implemented for 64-bit MachO.
bool X86TargetLowering::useLoadStackGuardNode() const {		bool X86TargetLowering::useLoadStackGuardNode() const {
return Subtarget.isTargetMachO() && Subtarget.is64Bit();		return Subtarget.isTargetMachO() && Subtarget.is64Bit();
}		}

TargetLoweringBase::LegalizeTypeAction		TargetLoweringBase::LegalizeTypeAction
X86TargetLowering::getPreferredVectorAction(EVT VT) const {		X86TargetLowering::getPreferredVectorAction(EVT VT) const {
if (ExperimentalVectorWideningLegalization &&		if (ExperimentalVectorWideningLegalization &&
▲ Show 20 Lines • Show All 27,358 Lines • ▼ Show 20 Lines	for (int Elt : SVOp->getMask())
Mask.push_back(Elt < NumElts ? Elt : (Elt - NumElts / 2));		Mask.push_back(Elt < NumElts ? Elt : (Elt - NumElts / 2));

SDLoc DL(N);		SDLoc DL(N);
SDValue Concat = DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, N0.getOperand(0),		SDValue Concat = DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, N0.getOperand(0),
N1.getOperand(0));		N1.getOperand(0));
return DAG.getVectorShuffle(VT, DL, Concat, DAG.getUNDEF(VT), Mask);		return DAG.getVectorShuffle(VT, DL, Concat, DAG.getUNDEF(VT), Mask);
}		}


static SDValue combineShuffle(SDNode *N, SelectionDAG &DAG,		static SDValue combineShuffle(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
SDLoc dl(N);		SDLoc dl(N);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
const TargetLowering &TLI = DAG.getTargetLoweringInfo();		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
// If we have legalized the vector types, look for blends of FADD and FSUB		// If we have legalized the vector types, look for blends of FADD and FSUB
// nodes that we can fuse into an ADDSUB node.		// nodes that we can fuse into an ADDSUB node.
▲ Show 20 Lines • Show All 868 Lines • ▼ Show 20 Lines	for (SDNode::use_iterator UI = InputVector.getNode()->use_begin(),
UE = InputVector.getNode()->use_end(); UI != UE; ++UI) {		UE = InputVector.getNode()->use_end(); UI != UE; ++UI) {
if (UI.getUse().getResNo() != InputVector.getResNo())		if (UI.getUse().getResNo() != InputVector.getResNo())
return SDValue();		return SDValue();

SDNode Extract = UI;		SDNode Extract = UI;
if (Extract->getOpcode() != ISD::EXTRACT_VECTOR_ELT)		if (Extract->getOpcode() != ISD::EXTRACT_VECTOR_ELT)
return SDValue();		return SDValue();

if (Extract->getValueType(0) != MVT::i32)		if (Extract->getValueType(0) != MVT::i32)
		craig.topperUnsubmitted Not Done Reply Inline Actions I think element indices should use DAG.getIntPtrConstant to make the type correct. It's not supposed to be i32. craig.topper: I think element indices should use DAG.getIntPtrConstant to make the type correct. It's not…
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions Yes this will be fixed. jbhateja: Yes this will be fixed.
return SDValue();		return SDValue();
if (!Extract->hasOneUse())		if (!Extract->hasOneUse())
return SDValue();		return SDValue();
if (Extract->use_begin()->getOpcode() != ISD::SIGN_EXTEND &&		if (Extract->use_begin()->getOpcode() != ISD::SIGN_EXTEND &&
Extract->use_begin()->getOpcode() != ISD::ZERO_EXTEND)		Extract->use_begin()->getOpcode() != ISD::ZERO_EXTEND)
return SDValue();		return SDValue();
if (!isa<ConstantSDNode>(Extract->getOperand(1)))		if (!isa<ConstantSDNode>(Extract->getOperand(1)))
return SDValue();		return SDValue();
▲ Show 20 Lines • Show All 3,509 Lines • ▼ Show 20 Lines
/// B = < float b0, float b1, float b2, float b3 >		/// B = < float b0, float b1, float b2, float b3 >
/// then the result of doing a horizontal operation on A and B is		/// then the result of doing a horizontal operation on A and B is
/// A horizontal-op B = < a0 op a1, a2 op a3, b0 op b1, b2 op b3 >.		/// A horizontal-op B = < a0 op a1, a2 op a3, b0 op b1, b2 op b3 >.
/// In short, LHS and RHS are inspected to see if LHS op RHS is of the form		/// In short, LHS and RHS are inspected to see if LHS op RHS is of the form
/// A horizontal-op B, for some already available A and B, and if so then LHS is		/// A horizontal-op B, for some already available A and B, and if so then LHS is
/// set to A, RHS to B, and the routine returns 'true'.		/// set to A, RHS to B, and the routine returns 'true'.
/// Note that the binary operation should have the property that if one of the		/// Note that the binary operation should have the property that if one of the
/// operands is UNDEF then the result is UNDEF.		/// operands is UNDEF then the result is UNDEF.
static bool isHorizontalBinOp(SDValue &LHS, SDValue &RHS, bool IsCommutative) {		static bool isHorizontalBinOp(SDValue &LHS, SDValue &RHS, bool IsCommutative,
		bool AllowAVX512VT = false) {
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions Just realized that DAG argument is not used here, it shall be removed with other comments over patch. jbhateja: Just realized that DAG argument is not used here, it shall be removed with other comments over…
// Look for the following pattern: if		// Look for the following pattern: if
// A = < float a0, float a1, float a2, float a3 >		// A = < float a0, float a1, float a2, float a3 >
// B = < float b0, float b1, float b2, float b3 >		// B = < float b0, float b1, float b2, float b3 >
// and		// and
// LHS = VECTOR_SHUFFLE A, B, <0, 2, 4, 6>		// LHS = VECTOR_SHUFFLE A, B, <0, 2, 4, 6>
// RHS = VECTOR_SHUFFLE A, B, <1, 3, 5, 7>		// RHS = VECTOR_SHUFFLE A, B, <1, 3, 5, 7>
// then LHS op RHS = < a0 op a1, a2 op a3, b0 op b1, b2 op b3 >		// then LHS op RHS = < a0 op a1, a2 op a3, b0 op b1, b2 op b3 >
// which is A horizontal-op B.		// which is A horizontal-op B.

// At least one of the operands should be a vector shuffle.		// At least one of the operands should be a vector shuffle.
		craig.topperUnsubmitted Not Done Reply Inline Actions Can we teach combineExtractSubvector to narrow an extract of an add by extracting from the inputs? Then we shouldn't need any change here because we'll already have the narrow add? craig.topper: Can we teach combineExtractSubvector to narrow an extract of an add by extracting from the…
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions I think each Node specific combiner should look at pattern starting from itself. i.e. combineExtractSubvector could be taught to remove extract_subvector from a vector shuffle (if has only one use) thus producing a smaller vector shuffle that will be generic change. Looking for opportunity like you mentioned "extract of an add by extracting from input" appears as a specialization. As of now narrowing down has be done genrically at following two places 1/ Extract_vector_elt : It looks at its input vector and try scaling down if possible. 2/ Narrow down binary operation (add/sub/fadd/fsub) [this is implicitly doing whay your comment says). jbhateja: I think each Node specific combiner should look at pattern starting from itself. i.e.
if (LHS.getOpcode() != ISD::VECTOR_SHUFFLE &&		if (LHS.getOpcode() != ISD::VECTOR_SHUFFLE &&
RHS.getOpcode() != ISD::VECTOR_SHUFFLE)		RHS.getOpcode() != ISD::VECTOR_SHUFFLE)
return false;		return false;

MVT VT = LHS.getSimpleValueType();		MVT VT = LHS.getSimpleValueType();

assert((VT.is128BitVector() \|\| VT.is256BitVector()) &&		assert((AllowAVX512VT \|\| VT.is128BitVector() \|\| VT.is256BitVector()) &&
"Unsupported vector type for horizontal add/sub");		"Unsupported vector type for horizontal add/sub");

// Handle 128 and 256-bit vector lengths. AVX defines horizontal add/sub to		// Handle 128 and 256-bit vector lengths. AVX defines horizontal add/sub to
// operate independently on 128-bit lanes.		// operate independently on 128-bit lanes.
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();
unsigned NumLanes = VT.getSizeInBits()/128;		unsigned NumLanes = VT.getSizeInBits()/128;
unsigned NumLaneElts = NumElts / NumLanes;		unsigned NumLaneElts = NumElts / NumLanes;
assert((NumLaneElts % 2 == 0) &&		assert((NumLaneElts % 2 == 0) &&
"Vector type should have an even number of elements in each lane");		"Vector type should have an even number of elements in each lane");
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	static bool isHorizontalBinOp(SDValue &LHS, SDValue &RHS, bool IsCommutative,
// rewriting the mask).		// rewriting the mask).
if (A != C)		if (A != C)
ShuffleVectorSDNode::commuteMask(RMask);		ShuffleVectorSDNode::commuteMask(RMask);

// At this point LHS and RHS are equivalent to		// At this point LHS and RHS are equivalent to
// LHS = VECTOR_SHUFFLE A, B, LMask		// LHS = VECTOR_SHUFFLE A, B, LMask
// RHS = VECTOR_SHUFFLE A, B, RMask		// RHS = VECTOR_SHUFFLE A, B, RMask
// Check that the masks correspond to performing a horizontal operation.		// Check that the masks correspond to performing a horizontal operation.
for (unsigned l = 0; l != NumElts; l += NumLaneElts) {		for (unsigned L = 0; L != NumElts; L += NumLaneElts) {
		craig.topperUnsubmitted Not Done Reply Inline Actions Why is this variable now upper case? craig.topper: Why is this variable now upper case?
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions For better visual clarity of variable name. jbhateja: For better visual clarity of variable name.
for (unsigned i = 0; i != NumLaneElts; ++i) {		for (unsigned i = 0; i != NumLaneElts; ++i) {
int LIdx = LMask[i+l], RIdx = RMask[i+l];		int LIdx = LMask[i+L], RIdx = RMask[i+L];

// Ignore any UNDEF components.		// Ignore any UNDEF components.
if (LIdx < 0 \|\| RIdx < 0 \|\|		if (LIdx < 0 \|\| RIdx < 0 \|\|
(!A.getNode() && (LIdx < (int)NumElts \|\| RIdx < (int)NumElts)) \|\|		(!A.getNode() && (LIdx < (int)NumElts \|\| RIdx < (int)NumElts)) \|\|
(!B.getNode() && (LIdx >= (int)NumElts \|\| RIdx >= (int)NumElts)))		(!B.getNode() && (LIdx >= (int)NumElts \|\| RIdx >= (int)NumElts)))
continue;		continue;

// Check that successive elements are being operated on. If not, this is		// Check that successive elements are being operated on. If not, this is
// not a horizontal operation.		// not a horizontal operation.
unsigned Src = (i/HalfLaneElts); // each lane is split between srcs		unsigned Src = (i/HalfLaneElts); // each lane is split between srcs
int Index = 2(i%HalfLaneElts) + NumEltsSrc + l;		int Index = 2(i%HalfLaneElts) + NumEltsSrc + L;
if (!(LIdx == Index && RIdx == Index + 1) &&		if (!(LIdx == Index && RIdx == Index + 1) &&
!(IsCommutative && LIdx == Index + 1 && RIdx == Index))		!(IsCommutative && LIdx == Index + 1 && RIdx == Index))
return false;		return false;
}		}
}		}

LHS = A.getNode() ? A : B; // If A is 'UNDEF', use B for it.		LHS = A.getNode() ? A : B; // If A is 'UNDEF', use B for it.
RHS = B.getNode() ? B : A; // If B is 'UNDEF', use A for it.		RHS = B.getNode() ? B : A; // If B is 'UNDEF', use A for it.
return true;		return true;
}		}

		static SDValue TryGenHorizontalAddSub(SDNode *N, SelectionDAG &DAG,
		const X86Subtarget &Subtarget,
		bool IsCommutative, int Opcode) {
		SDLoc DL(N);
		uint64_t StartIdx;
		EVT PadVT, NewOperVT;

		EVT VT = N->getValueType(0);
		assert((VT == MVT::v16i32 \|\| VT == MVT::v16f32 \|\| VT == MVT::v8f64) &&
		"Unexpected DAG node type");

		SDValue Op0 = N->getOperand(0);
		SDValue Op1 = N->getOperand(1);

		if (isHorizontalBinOp(Op0, Op1, IsCommutative, true) &&
		(TryDownScalingBinaryOperands(N, DAG, Subtarget, StartIdx, NewOperVT,
		PadVT))) {
		SDValue NewOp0 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, NewOperVT, Op0,
		DAG.getIntPtrConstant(StartIdx, DL));
		SDValue NewOp1 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, NewOperVT, Op1,
		DAG.getIntPtrConstant(StartIdx, DL));
		SDValue NewN = DAG.getNode(Opcode, SDLoc(N), NewOperVT, NewOp0, NewOp1);
		SDValue ConcatOps[2] = {DAG.getUNDEF(PadVT), NewN};
		if (StartIdx == 0) {
		ConcatOps[0] = NewN;
		ConcatOps[1] = DAG.getUNDEF(PadVT);
		}
		return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, ConcatOps);
		}

		return SDValue();
		}


/// Do target-specific dag combines on floating-point adds/subs.		/// Do target-specific dag combines on floating-point adds/subs.
static SDValue combineFaddFsub(SDNode *N, SelectionDAG &DAG,		static SDValue combineFaddFsub(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDValue LHS = N->getOperand(0);		SDValue LHS = N->getOperand(0);
SDValue RHS = N->getOperand(1);		SDValue RHS = N->getOperand(1);
bool IsFadd = N->getOpcode() == ISD::FADD;		bool IsFadd = N->getOpcode() == ISD::FADD;
assert((IsFadd \|\| N->getOpcode() == ISD::FSUB) && "Wrong opcode");		assert((IsFadd \|\| N->getOpcode() == ISD::FSUB) && "Wrong opcode");

		auto NewOpcode = IsFadd ? X86ISD::FHADD : X86ISD::FHSUB;
		craig.topperUnsubmitted Not Done Reply Inline Actions Why was this moved out of the if. craig.topper: Why was this moved out of the if.
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions This will be fixed. jbhateja: This will be fixed.

// Try to synthesize horizontal add/sub from adds/subs of shuffles.		// Try to synthesize horizontal add/sub from adds/subs of shuffles.
if (((Subtarget.hasSSE3() && (VT == MVT::v4f32 \|\| VT == MVT::v2f64)) \|\|		if (((Subtarget.hasSSE3() && (VT == MVT::v4f32 \|\| VT == MVT::v2f64)) \|\|
(Subtarget.hasFp256() && (VT == MVT::v8f32 \|\| VT == MVT::v4f64))) &&		(Subtarget.hasFp256() && (VT == MVT::v8f32 \|\| VT == MVT::v4f64))) &&
isHorizontalBinOp(LHS, RHS, IsFadd)) {		isHorizontalBinOp(LHS, RHS, IsFadd)) {
auto NewOpcode = IsFadd ? X86ISD::FHADD : X86ISD::FHSUB;
return DAG.getNode(NewOpcode, SDLoc(N), VT, LHS, RHS);		return DAG.getNode(NewOpcode, SDLoc(N), VT, LHS, RHS);
}		}

		SDValue V;
		if ((Subtarget.hasFp256() && (VT == MVT::v16f32 \|\| VT == MVT::v8f64)) &&
		(V = TryGenHorizontalAddSub(N, DAG, Subtarget, IsFadd, NewOpcode)))
		return V;

return SDValue();		return SDValue();
}		}

/// Attempt to pre-truncate inputs to arithmetic ops if it will simplify		/// Attempt to pre-truncate inputs to arithmetic ops if it will simplify
/// the codegen.		/// the codegen.
/// e.g. TRUNC( BINOP( X, Y ) ) --> BINOP( TRUNC( X ), TRUNC( Y ) )		/// e.g. TRUNC( BINOP( X, Y ) ) --> BINOP( TRUNC( X ), TRUNC( Y ) )
static SDValue combineTruncatedArithmetic(SDNode *N, SelectionDAG &DAG,		static SDValue combineTruncatedArithmetic(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget,		const X86Subtarget &Subtarget,
▲ Show 20 Lines • Show All 1,753 Lines • ▼ Show 20 Lines	if (!ISD::isConstantSplatVector(N1, SplatVal) \|\|
!SplatVal.isOneValue())		!SplatVal.isOneValue())
return SDValue();		return SDValue();

SDValue AllOnesVec = getOnesVector(VT, DAG, SDLoc(N));		SDValue AllOnesVec = getOnesVector(VT, DAG, SDLoc(N));
unsigned NewOpcode = N->getOpcode() == ISD::ADD ? ISD::SUB : ISD::ADD;		unsigned NewOpcode = N->getOpcode() == ISD::ADD ? ISD::SUB : ISD::ADD;
return DAG.getNode(NewOpcode, SDLoc(N), VT, N->getOperand(0), AllOnesVec);		return DAG.getNode(NewOpcode, SDLoc(N), VT, N->getOperand(0), AllOnesVec);
}		}


static SDValue combineAdd(SDNode *N, SelectionDAG &DAG,		static SDValue combineAdd(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
		SDLoc DL(N);

const SDNodeFlags Flags = N->getFlags();		const SDNodeFlags Flags = N->getFlags();
if (Flags.hasVectorReduction()) {		if (Flags.hasVectorReduction()) {
if (SDValue Sad = combineLoopSADPattern(N, DAG, Subtarget))		if (SDValue Sad = combineLoopSADPattern(N, DAG, Subtarget))
return Sad;		return Sad;
if (SDValue MAdd = combineLoopMAddPattern(N, DAG, Subtarget))		if (SDValue MAdd = combineLoopMAddPattern(N, DAG, Subtarget))
return MAdd;		return MAdd;
}		}
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDValue Op0 = N->getOperand(0);		SDValue Op0 = N->getOperand(0);
SDValue Op1 = N->getOperand(1);		SDValue Op1 = N->getOperand(1);

// Try to synthesize horizontal adds from adds of shuffles.		// Try to synthesize horizontal adds from adds of shuffles.
if (((Subtarget.hasSSSE3() && (VT == MVT::v8i16 \|\| VT == MVT::v4i32)) \|\|		if (((Subtarget.hasSSSE3() && (VT == MVT::v8i16 \|\| VT == MVT::v4i32)) \|\|
(Subtarget.hasInt256() && (VT == MVT::v16i16 \|\| VT == MVT::v8i32))) &&		(Subtarget.hasInt256() && (VT == MVT::v16i16 \|\| VT == MVT::v8i32))) &&
isHorizontalBinOp(Op0, Op1, true))		isHorizontalBinOp(Op0, Op1, true))
return DAG.getNode(X86ISD::HADD, SDLoc(N), VT, Op0, Op1);		return DAG.getNode(X86ISD::HADD, SDLoc(N), VT, Op0, Op1);

		SDValue V;
		craig.topperUnsubmitted Not Done Reply Inline Actions Don't we need to make sure this only happens on v32i16 and v16i32 types? combineAdd can get called on all sorts of types. craig.topper: Don't we need to make sure this only happens on v32i16 and v16i32 types? combineAdd can get…
		if ((Subtarget.hasInt256() && VT == MVT::v16i32) &&
		(V = TryGenHorizontalAddSub(N, DAG, Subtarget, true, X86ISD::HADD)))
		return V;

if (SDValue V = combineIncDecVector(N, DAG))		if (SDValue V = combineIncDecVector(N, DAG))
return V;		return V;

return combineAddOrSubToADCOrSBB(N, DAG);		if (SDValue V = combineAddOrSubToADCOrSBB(N, DAG))
		return V;

		return SDValue();
}		}

		craig.topperUnsubmitted Not Done Reply Inline Actions Don't use a SmallVector for a hardcoded two elements. Just use a plain old array. craig.topper: Don't use a SmallVector for a hardcoded two elements. Just use a plain old array.
static SDValue combineSub(SDNode *N, SelectionDAG &DAG,		static SDValue combineSub(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
SDValue Op0 = N->getOperand(0);		SDValue Op0 = N->getOperand(0);
SDValue Op1 = N->getOperand(1);		SDValue Op1 = N->getOperand(1);

// X86 can't encode an immediate LHS of a sub. See if we can push the		// X86 can't encode an immediate LHS of a sub. See if we can push the
// negation into a preceding instruction.		// negation into a preceding instruction.
if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op0)) {		if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op0)) {
Show All 14 Lines	static SDValue combineSub(SDNode *N, SelectionDAG &DAG,

// Try to synthesize horizontal subs from subs of shuffles.		// Try to synthesize horizontal subs from subs of shuffles.
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
if (((Subtarget.hasSSSE3() && (VT == MVT::v8i16 \|\| VT == MVT::v4i32)) \|\|		if (((Subtarget.hasSSSE3() && (VT == MVT::v8i16 \|\| VT == MVT::v4i32)) \|\|
(Subtarget.hasInt256() && (VT == MVT::v16i16 \|\| VT == MVT::v8i32))) &&		(Subtarget.hasInt256() && (VT == MVT::v16i16 \|\| VT == MVT::v8i32))) &&
isHorizontalBinOp(Op0, Op1, false))		isHorizontalBinOp(Op0, Op1, false))
return DAG.getNode(X86ISD::HSUB, SDLoc(N), VT, Op0, Op1);		return DAG.getNode(X86ISD::HSUB, SDLoc(N), VT, Op0, Op1);

		SDValue V;
		if ((Subtarget.hasInt256() && VT == MVT::v16i32) &&
		(V = TryGenHorizontalAddSub(N, DAG, Subtarget, false, X86ISD::HSUB)))
		return V;

if (SDValue V = combineIncDecVector(N, DAG))		if (SDValue V = combineIncDecVector(N, DAG))
return V;		return V;

return combineAddOrSubToADCOrSBB(N, DAG);		return combineAddOrSubToADCOrSBB(N, DAG);
}		}

static SDValue combineVSZext(SDNode *N, SelectionDAG &DAG,		static SDValue combineVSZext(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
▲ Show 20 Lines • Show All 1,422 Lines • Show Last 20 Lines

test/CodeGen/X86/avx512-hadd-hsub.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	;RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=knl \| FileCheck %s --check-prefix=KNL			;RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=knl \| FileCheck %s --check-prefix=KNL
	;RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=skx \| FileCheck %s --check-prefix=SKX			;RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=skx \| FileCheck %s --check-prefix=SKX

	define i32 @hadd_16(<16 x i32> %x225) {			define i32 @hadd_16(<16 x i32> %x225) {
	; KNL-LABEL: hadd_16:			; KNL-LABEL: hadd_16:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; KNL-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; KNL-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; KNL-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; KNL-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; KNL-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; KNL-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; KNL-NEXT: vmovd %xmm0, %eax			; KNL-NEXT: vmovd %xmm0, %eax
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: hadd_16:			; SKX-LABEL: hadd_16:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; SKX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; SKX-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; SKX-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; SKX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; SKX-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; SKX-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; SKX-NEXT: vmovd %xmm0, %eax			; SKX-NEXT: vmovd %xmm0, %eax
	; SKX-NEXT: vzeroupper			; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <16 x i32> %x225, <16 x i32> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%x226 = shufflevector <16 x i32> %x225, <16 x i32> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x227 = add <16 x i32> %x225, %x226			%x227 = add <16 x i32> %x225, %x226
	%x228 = shufflevector <16 x i32> %x227, <16 x i32> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%x228 = shufflevector <16 x i32> %x227, <16 x i32> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x229 = add <16 x i32> %x227, %x228			%x229 = add <16 x i32> %x227, %x228
	%x230 = extractelement <16 x i32> %x229, i32 0			%x230 = extractelement <16 x i32> %x229, i32 0
	ret i32 %x230			ret i32 %x230
	}			}

	define i32 @hsub_16(<16 x i32> %x225) {			define i32 @hsub_16(<16 x i32> %x225) {
	; KNL-LABEL: hsub_16:			; KNL-LABEL: hsub_16:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; KNL-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; KNL-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; KNL-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; KNL-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; KNL-NEXT: vphsubd %ymm0, %ymm0, %ymm0
	; KNL-NEXT: vpsubd %zmm1, %zmm0, %zmm0
	; KNL-NEXT: vmovd %xmm0, %eax			; KNL-NEXT: vmovd %xmm0, %eax
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: hsub_16:			; SKX-LABEL: hsub_16:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; SKX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; SKX-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; SKX-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; SKX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; SKX-NEXT: vphsubd %ymm0, %ymm0, %ymm0
	; SKX-NEXT: vpsubd %zmm1, %zmm0, %zmm0
	; SKX-NEXT: vmovd %xmm0, %eax			; SKX-NEXT: vmovd %xmm0, %eax
	; SKX-NEXT: vzeroupper			; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <16 x i32> %x225, <16 x i32> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%x226 = shufflevector <16 x i32> %x225, <16 x i32> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x227 = add <16 x i32> %x225, %x226			%x227 = add <16 x i32> %x225, %x226
	%x228 = shufflevector <16 x i32> %x227, <16 x i32> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%x228 = shufflevector <16 x i32> %x227, <16 x i32> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x229 = sub <16 x i32> %x227, %x228			%x229 = sub <16 x i32> %x227, %x228
	%x230 = extractelement <16 x i32> %x229, i32 0			%x230 = extractelement <16 x i32> %x229, i32 0
	ret i32 %x230			ret i32 %x230
	}			}

	define float @fhadd_16(<16 x float> %x225) {			define float @fhadd_16(<16 x float> %x225) {
	; KNL-LABEL: fhadd_16:			; KNL-LABEL: fhadd_16:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]			; KNL-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
	; KNL-NEXT: vaddps %zmm1, %zmm0, %zmm0			; KNL-NEXT: vaddps %zmm1, %zmm0, %zmm0
	; KNL-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]			; KNL-NEXT: vhaddps %ymm0, %ymm0, %ymm0
	; KNL-NEXT: vaddps %zmm1, %zmm0, %zmm0
	; KNL-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>			; KNL-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: fhadd_16:			; SKX-LABEL: fhadd_16:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]			; SKX-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
	; SKX-NEXT: vaddps %zmm1, %zmm0, %zmm0			; SKX-NEXT: vaddps %zmm1, %zmm0, %zmm0
	; SKX-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]			; SKX-NEXT: vhaddps %ymm0, %ymm0, %ymm0
	; SKX-NEXT: vaddps %zmm1, %zmm0, %zmm0
	; SKX-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>			; SKX-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>
	; SKX-NEXT: vzeroupper			; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <16 x float> %x225, <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%x226 = shufflevector <16 x float> %x225, <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x227 = fadd <16 x float> %x225, %x226			%x227 = fadd <16 x float> %x225, %x226
	%x228 = shufflevector <16 x float> %x227, <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%x228 = shufflevector <16 x float> %x227, <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x229 = fadd <16 x float> %x227, %x228			%x229 = fadd <16 x float> %x227, %x228
	%x230 = extractelement <16 x float> %x229, i32 0			%x230 = extractelement <16 x float> %x229, i32 0
	ret float %x230			ret float %x230
	}			}

	define float @fhsub_16(<16 x float> %x225) {			define float @fhsub_16(<16 x float> %x225) {
	; KNL-LABEL: fhsub_16:			; KNL-LABEL: fhsub_16:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]			; KNL-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
	; KNL-NEXT: vaddps %zmm1, %zmm0, %zmm0			; KNL-NEXT: vaddps %zmm1, %zmm0, %zmm0
	; KNL-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]			; KNL-NEXT: vhsubps %ymm0, %ymm0, %ymm0
	; KNL-NEXT: vsubps %zmm1, %zmm0, %zmm0
	; KNL-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>			; KNL-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: fhsub_16:			; SKX-LABEL: fhsub_16:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]			; SKX-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
	; SKX-NEXT: vaddps %zmm1, %zmm0, %zmm0			; SKX-NEXT: vaddps %zmm1, %zmm0, %zmm0
	; SKX-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]			; SKX-NEXT: vhsubps %ymm0, %ymm0, %ymm0
	; SKX-NEXT: vsubps %zmm1, %zmm0, %zmm0
	; SKX-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>			; SKX-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>
	; SKX-NEXT: vzeroupper			; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <16 x float> %x225, <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%x226 = shufflevector <16 x float> %x225, <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x227 = fadd <16 x float> %x225, %x226			%x227 = fadd <16 x float> %x225, %x226
	%x228 = shufflevector <16 x float> %x227, <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%x228 = shufflevector <16 x float> %x227, <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x229 = fsub <16 x float> %x227, %x228			%x229 = fsub <16 x float> %x227, %x228
	%x230 = extractelement <16 x float> %x229, i32 0			%x230 = extractelement <16 x float> %x229, i32 0
	ret float %x230			ret float %x230
	}			}

	define <16 x i32> @hadd_16_3(<16 x i32> %x225, <16 x i32> %x227) {			define <16 x i32> @hadd_16_3(<16 x i32> %x225, <16 x i32> %x227) {
	; CHECK-LABEL: hadd_16_3:			; CHECK-LABEL: hadd_16_3:
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: vphaddd %ymm1, %ymm0, %ymm0			; CHECK-NEXT: vphaddd %ymm1, %ymm0, %ymm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	; KNL-LABEL: hadd_16_3:			; KNL-LABEL: hadd_16_3:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vshufps {{.*#+}} ymm2 = ymm0[0,2],ymm1[0,2],ymm0[4,6],ymm1[4,6]			; KNL-NEXT: vphaddd %ymm1, %ymm0, %ymm0
	; KNL-NEXT: vshufps {{.*#+}} ymm0 = ymm0[1,3],ymm1[1,3],ymm0[5,7],ymm1[5,7]
	; KNL-NEXT: vpaddd %zmm0, %zmm2, %zmm0
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: hadd_16_3:			; SKX-LABEL: hadd_16_3:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vshufps {{.*#+}} ymm2 = ymm0[0,2],ymm1[0,2],ymm0[4,6],ymm1[4,6]			; SKX-NEXT: vphaddd %ymm1, %ymm0, %ymm0
	; SKX-NEXT: vshufps {{.*#+}} ymm0 = ymm0[1,3],ymm1[1,3],ymm0[5,7],ymm1[5,7]
	; SKX-NEXT: vpaddd %zmm0, %zmm2, %zmm0
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <16 x i32> %x225, <16 x i32> %x227, <16 x i32> <i32 0, i32 2, i32 16, i32 18			%x226 = shufflevector <16 x i32> %x225, <16 x i32> %x227, <16 x i32> <i32 0, i32 2, i32 16, i32 18
	, i32 4, i32 6, i32 20, i32 22, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			, i32 4, i32 6, i32 20, i32 22, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x228 = shufflevector <16 x i32> %x225, <16 x i32> %x227, <16 x i32> <i32 1, i32 3, i32 17, i32 19			%x228 = shufflevector <16 x i32> %x225, <16 x i32> %x227, <16 x i32> <i32 1, i32 3, i32 17, i32 19
	, i32 5 , i32 7, i32 21, i32 23, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef,			, i32 5 , i32 7, i32 21, i32 23, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef,
	i32 undef, i32 undef>			i32 undef, i32 undef>
	%x229 = add <16 x i32> %x226, %x228			%x229 = add <16 x i32> %x226, %x228
	ret <16 x i32> %x229			ret <16 x i32> %x229
	}			}

	define <16 x float> @fhadd_16_3(<16 x float> %x225, <16 x float> %x227) {			define <16 x float> @fhadd_16_3(<16 x float> %x225, <16 x float> %x227) {
	; CHECK-LABEL: fhadd_16_3:			; CHECK-LABEL: fhadd_16_3:
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: vhaddps %ymm1, %ymm0, %ymm0			; CHECK-NEXT: vhaddps %ymm1, %ymm0, %ymm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	; KNL-LABEL: fhadd_16_3:			; KNL-LABEL: fhadd_16_3:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vshufps {{.*#+}} ymm2 = ymm0[0,2],ymm1[0,2],ymm0[4,6],ymm1[4,6]			; KNL-NEXT: vhaddps %ymm1, %ymm0, %ymm0
	; KNL-NEXT: vshufps {{.*#+}} ymm0 = ymm0[1,3],ymm1[1,3],ymm0[5,7],ymm1[5,7]
	; KNL-NEXT: vaddps %zmm0, %zmm2, %zmm0
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: fhadd_16_3:			; SKX-LABEL: fhadd_16_3:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vshufps {{.*#+}} ymm2 = ymm0[0,2],ymm1[0,2],ymm0[4,6],ymm1[4,6]			; SKX-NEXT: vhaddps %ymm1, %ymm0, %ymm0
	; SKX-NEXT: vshufps {{.*#+}} ymm0 = ymm0[1,3],ymm1[1,3],ymm0[5,7],ymm1[5,7]
	; SKX-NEXT: vaddps %zmm0, %zmm2, %zmm0
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <16 x float> %x225, <16 x float> %x227, <16 x i32> <i32 0, i32 2, i32 16, i32 18			%x226 = shufflevector <16 x float> %x225, <16 x float> %x227, <16 x i32> <i32 0, i32 2, i32 16, i32 18
	, i32 4, i32 6, i32 20, i32 22, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			, i32 4, i32 6, i32 20, i32 22, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x228 = shufflevector <16 x float> %x225, <16 x float> %x227, <16 x i32> <i32 1, i32 3, i32 17, i32 19			%x228 = shufflevector <16 x float> %x225, <16 x float> %x227, <16 x i32> <i32 1, i32 3, i32 17, i32 19
	, i32 5 , i32 7, i32 21, i32 23, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			, i32 5 , i32 7, i32 21, i32 23, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x229 = fadd <16 x float> %x226, %x228			%x229 = fadd <16 x float> %x226, %x228
	ret <16 x float> %x229			ret <16 x float> %x229
	}			}

	define <8 x double> @fhadd_16_4(<8 x double> %x225, <8 x double> %x227) {			define <8 x double> @fhadd_16_4(<8 x double> %x225, <8 x double> %x227) {
	; CHECK-LABEL: fhadd_16_4:			; CHECK-LABEL: fhadd_16_4:
	; CHECK: # BB#0:			; CHECK: # BB#0:
	; CHECK-NEXT: vunpcklpd {{.*#+}} ymm2 = ymm0[0],ymm1[0],ymm0[2],ymm1[2]			; CHECK-NEXT: vunpcklpd {{.*#+}} ymm2 = ymm0[0],ymm1[0],ymm0[2],ymm1[2]
	; CHECK-NEXT: vpermpd {{.*#+}} ymm2 = ymm2[0,2,1,3]			; CHECK-NEXT: vpermpd {{.*#+}} ymm2 = ymm2[0,2,1,3]
	; CHECK-NEXT: vunpckhpd {{.*#+}} ymm0 = ymm0[1],ymm1[1],ymm0[3],ymm1[3]			; CHECK-NEXT: vunpckhpd {{.*#+}} ymm0 = ymm0[1],ymm1[1],ymm0[3],ymm1[3]
	; CHECK-NEXT: vpermpd {{.*#+}} ymm0 = ymm0[0,2,1,3]			; CHECK-NEXT: vpermpd {{.*#+}} ymm0 = ymm0[0,2,1,3]
	; CHECK-NEXT: vaddpd %zmm0, %zmm2, %zmm0			; CHECK-NEXT: vaddpd %zmm0, %zmm2, %zmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	; KNL-LABEL: fhadd_16_4:			; KNL-LABEL: fhadd_16_4:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vunpcklpd {{.*#+}} ymm2 = ymm0[0],ymm1[0],ymm0[2],ymm1[2]			; KNL-NEXT: vhaddpd %ymm1, %ymm0, %ymm0
	; KNL-NEXT: vunpckhpd {{.*#+}} ymm0 = ymm0[1],ymm1[1],ymm0[3],ymm1[3]
	; KNL-NEXT: vaddpd %zmm0, %zmm2, %zmm0
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: fhadd_16_4:			; SKX-LABEL: fhadd_16_4:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vunpcklpd {{.*#+}} ymm2 = ymm0[0],ymm1[0],ymm0[2],ymm1[2]			; SKX-NEXT: vhaddpd %ymm1, %ymm0, %ymm0
	; SKX-NEXT: vunpckhpd {{.*#+}} ymm0 = ymm0[1],ymm1[1],ymm0[3],ymm1[3]
	; SKX-NEXT: vaddpd %zmm0, %zmm2, %zmm0
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 undef, i32 undef, i32 undef, i32 undef>			%x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 undef, i32 undef, i32 undef, i32 undef>
	%x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 undef ,i32 undef, i32 undef, i32 undef>			%x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 undef ,i32 undef, i32 undef, i32 undef>
	%x229 = fadd <8 x double> %x226, %x228			%x229 = fadd <8 x double> %x226, %x228
	ret <8 x double> %x229			ret <8 x double> %x229
	}			}

test/CodeGen/X86/madd.ll

	Show First 20 Lines • Show All 294 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: cmpq %rcx, %rax			; AVX2-NEXT: cmpq %rcx, %rax
	; AVX2-NEXT: jne .LBB2_1			; AVX2-NEXT: jne .LBB2_1
	; AVX2-NEXT: # BB#2: # %middle.block			; AVX2-NEXT: # BB#2: # %middle.block
	; AVX2-NEXT: vpaddd %ymm0, %ymm1, %ymm0			; AVX2-NEXT: vpaddd %ymm0, %ymm1, %ymm0
	; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
	; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
	; AVX2-NEXT: vphaddd %ymm0, %ymm0, %ymm0			; AVX2-NEXT: vphaddd %ymm0, %ymm0, %ymm0
				craig.topperUnsubmitted Not Done Reply Inline Actions Why is this horizontal add not narrowed? craig.topper: Why is this horizontal add not narrowed?
	; AVX2-NEXT: vmovd %xmm0, %eax			; AVX2-NEXT: vmovd %xmm0, %eax
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: _Z9test_charPcS_i:			; AVX512-LABEL: _Z9test_charPcS_i:
	; AVX512: # BB#0: # %entry			; AVX512: # BB#0: # %entry
	; AVX512-NEXT: movl %edx, %eax			; AVX512-NEXT: movl %edx, %eax
	; AVX512-NEXT: vpxor %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vpxor %xmm0, %xmm0, %xmm0
	Show All 12 Lines
	; AVX512-NEXT: jne .LBB2_1			; AVX512-NEXT: jne .LBB2_1
	; AVX512-NEXT: # BB#2: # %middle.block			; AVX512-NEXT: # BB#2: # %middle.block
	; AVX512-NEXT: vextracti64x4 $1, %zmm0, %ymm1			; AVX512-NEXT: vextracti64x4 $1, %zmm0, %ymm1
	; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; AVX512-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512-NEXT: vmovd %xmm0, %eax			; AVX512-NEXT: vmovd %xmm0, %eax
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	entry:			entry:
	%3 = zext i32 %2 to i64			%3 = zext i32 %2 to i64
	br label %vector.body			br label %vector.body

	vector.body:			vector.body:
	Show All 28 Lines

test/CodeGen/X86/sad.ll

	Show First 20 Lines • Show All 72 Lines • ▼ Show 20 Lines
	; AVX512F-NEXT: jne .LBB0_1			; AVX512F-NEXT: jne .LBB0_1
	; AVX512F-NEXT: # BB#2: # %middle.block			; AVX512F-NEXT: # BB#2: # %middle.block
	; AVX512F-NEXT: vextracti64x4 $1, %zmm0, %ymm1			; AVX512F-NEXT: vextracti64x4 $1, %zmm0, %ymm1
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; AVX512F-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vmovd %xmm0, %eax			; AVX512F-NEXT: vmovd %xmm0, %eax
	; AVX512F-NEXT: vzeroupper			; AVX512F-NEXT: vzeroupper
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512BW-LABEL: sad_16i8:			; AVX512BW-LABEL: sad_16i8:
	; AVX512BW: # BB#0: # %entry			; AVX512BW: # BB#0: # %entry
	; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00			; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00
	; AVX512BW-NEXT: .p2align 4, 0x90			; AVX512BW-NEXT: .p2align 4, 0x90
	; AVX512BW-NEXT: .LBB0_1: # %vector.body			; AVX512BW-NEXT: .LBB0_1: # %vector.body
	; AVX512BW-NEXT: # =>This Inner Loop Header: Depth=1			; AVX512BW-NEXT: # =>This Inner Loop Header: Depth=1
	; AVX512BW-NEXT: vmovdqu a+1024(%rax), %xmm1			; AVX512BW-NEXT: vmovdqu a+1024(%rax), %xmm1
	; AVX512BW-NEXT: vpsadbw b+1024(%rax), %xmm1, %xmm1			; AVX512BW-NEXT: vpsadbw b+1024(%rax), %xmm1, %xmm1
	; AVX512BW-NEXT: vpaddd %xmm0, %xmm1, %xmm1			; AVX512BW-NEXT: vpaddd %xmm0, %xmm1, %xmm1
	; AVX512BW-NEXT: vinserti32x4 $0, %xmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vinserti32x4 $0, %xmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: addq $4, %rax			; AVX512BW-NEXT: addq $4, %rax
	; AVX512BW-NEXT: jne .LBB0_1			; AVX512BW-NEXT: jne .LBB0_1
	; AVX512BW-NEXT: # BB#2: # %middle.block			; AVX512BW-NEXT: # BB#2: # %middle.block
	; AVX512BW-NEXT: vextracti64x4 $1, %zmm0, %ymm1			; AVX512BW-NEXT: vextracti64x4 $1, %zmm0, %ymm1
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX512BW-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; AVX512BW-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vmovd %xmm0, %eax			; AVX512BW-NEXT: vmovd %xmm0, %eax
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	entry:			entry:
	br label %vector.body			br label %vector.body

	vector.body:			vector.body:
	%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]			%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
	▲ Show 20 Lines • Show All 205 Lines • ▼ Show 20 Lines
	; AVX512F-NEXT: # BB#2: # %middle.block			; AVX512F-NEXT: # BB#2: # %middle.block
	; AVX512F-NEXT: vpaddd %zmm0, %zmm1, %zmm0			; AVX512F-NEXT: vpaddd %zmm0, %zmm1, %zmm0
	; AVX512F-NEXT: vextracti64x4 $1, %zmm0, %ymm1			; AVX512F-NEXT: vextracti64x4 $1, %zmm0, %ymm1
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; AVX512F-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vmovd %xmm0, %eax			; AVX512F-NEXT: vmovd %xmm0, %eax
	; AVX512F-NEXT: vzeroupper			; AVX512F-NEXT: vzeroupper
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512BW-LABEL: sad_32i8:			; AVX512BW-LABEL: sad_32i8:
	; AVX512BW: # BB#0: # %entry			; AVX512BW: # BB#0: # %entry
	; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00			; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00
	Show All 10 Lines
	; AVX512BW-NEXT: # BB#2: # %middle.block			; AVX512BW-NEXT: # BB#2: # %middle.block
	; AVX512BW-NEXT: vpaddd %zmm0, %zmm1, %zmm0			; AVX512BW-NEXT: vpaddd %zmm0, %zmm1, %zmm0
	; AVX512BW-NEXT: vextracti64x4 $1, %zmm0, %ymm1			; AVX512BW-NEXT: vextracti64x4 $1, %zmm0, %ymm1
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX512BW-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; AVX512BW-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vmovd %xmm0, %eax			; AVX512BW-NEXT: vmovd %xmm0, %eax
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	entry:			entry:
	br label %vector.body			br label %vector.body

	vector.body:			vector.body:
	%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]			%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
	▲ Show 20 Lines • Show All 427 Lines • ▼ Show 20 Lines
	; AVX512F-NEXT: vpaddd %zmm3, %zmm1, %zmm1			; AVX512F-NEXT: vpaddd %zmm3, %zmm1, %zmm1
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vextracti64x4 $1, %zmm0, %ymm1			; AVX512F-NEXT: vextracti64x4 $1, %zmm0, %ymm1
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; AVX512F-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vmovd %xmm0, %eax			; AVX512F-NEXT: vmovd %xmm0, %eax
	; AVX512F-NEXT: vzeroupper			; AVX512F-NEXT: vzeroupper
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512BW-LABEL: sad_avx64i8:			; AVX512BW-LABEL: sad_avx64i8:
	; AVX512BW: # BB#0: # %entry			; AVX512BW: # BB#0: # %entry
	; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00			; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00
	Show All 11 Lines
	; AVX512BW-NEXT: vpaddd %zmm0, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm0, %zmm0, %zmm0
	; AVX512BW-NEXT: vpaddd %zmm0, %zmm1, %zmm0			; AVX512BW-NEXT: vpaddd %zmm0, %zmm1, %zmm0
	; AVX512BW-NEXT: vextracti64x4 $1, %zmm0, %ymm1			; AVX512BW-NEXT: vextracti64x4 $1, %zmm0, %ymm1
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX512BW-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; AVX512BW-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vmovd %xmm0, %eax			; AVX512BW-NEXT: vmovd %xmm0, %eax
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	entry:			entry:
	br label %vector.body			br label %vector.body

	vector.body:			vector.body:
	%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]			%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
	▲ Show 20 Lines • Show All 465 Lines • Show Last 20 Lines