This is an archive of the discontinued LLVM Phabricator instance.

[X86] Changes to extract Horizontal addition operation for AVX-512.
Needs RevisionPublic

Authored by jbhateja on Aug 8 2017, 3:12 AM.

Download Raw Diff

Details

Reviewers

craig.topper
delena
zvi
spatel
RKSimon

Summary

vphadd is not a supported instruction for AVX-512, but if result of add (512 bits)
is partially consumed such that less than half of the bits are used by the user of an add
instruction, in that case we can perform horizontal addition and concatenate the result
with undef.

This will fix PR33758

Diff Detail

Build Status

Buildable 10657
Build 10657: arc lint + arc unit

Event Timeline

jbhateja created this revision.Aug 8 2017, 3:12 AM

jbhateja added reviewers: craig.topper, delena, RKSimon.Aug 8 2017, 3:15 AM

jbhateja added a subscriber: llvm-commits.

Ping @ reviewers.

craig.topper added inline comments.Aug 10 2017, 5:23 PM

lib/Target/X86/X86ISelLowering.cpp
35922	Don't we need to make sure this only happens on v32i16 and v16i32 types? combineAdd can get called on all sorts of types.
35935	Don't use a SmallVector for a hardcoded two elements. Just use a plain old array.

Are there any cases where this should happen where Op0 is different that Op1? All of these test changes are unary shuffles.

And really shouldn't we have some more generic way of shrinking adds like this? How many of the other adds in these tests could have been done in a smaller type to use a VEX encoding?

craig.topper mentioned this in D36601: [x86] Enable some support for lowerVectorShuffleWithUndefHalf with AVX-512.Aug 10 2017, 6:52 PM

Diffusion mentioned this in rL310724: [x86] Enable some support for lowerVectorShuffleWithUndefHalf with AVX-512.Aug 11 2017, 9:21 AM

craig.topper mentioned this in D36650: [X86] WIP support narrowing operations when only a subvector is demanded.Aug 12 2017, 11:16 PM

Thinking about this some more. Do we really want to use a horizontal add instruction for a register with itself? Horizontal add is suboptimally implemented in microcode. It's 3 uops while the pshufd and the add are only 2 uops. The 3 uops also mean its limited to the complex decoder on Intel hardware.

Handling for horizontal [F]ADD/[F]SUB for AVX512 vector size 16x32 in a generic manner.
Merge branch 'master' of https://github.com/llvm-mirror/llvm
Updating test reference

In D36454#853351, @craig.topper wrote:

Thinking about this some more. Do we really want to use a horizontal add instruction for a register with itself? Horizontal add is suboptimally implemented in microcode. It's 3 uops while the pshufd and the add are only 2 uops. The 3 uops also mean its limited to the complex decoder on Intel hardware.

Yes, I agree. latency and number of micro codes are better without hadd, only lesser code size is advantage with hadd with same operands.
Do you suggest adding a check in isHorizontalBinOp to not recognize a valid pttern for horizontal add/sub if operands of are same.

Removing a file got added in last patch.

Harbormaster completed remote builds in B9670: Diff 112823.Aug 27 2017, 6:32 AM

Stashed change leftout in last checkin + formatting changes.
Updating test reference

ping @reviewers

guyblank added a subscriber: guyblank.Aug 31 2017, 8:03 AM

Ping reviewers

I think we should try to combine based on the add only being used by the extract_vector_elt. Turn the add into a 128-bit add being fed by extract_subvectors. Similarly if we see an add only being used by an extract_subvector we can shrink that add too and push the extracts up. This type of transform feels more generally useful because it will allow us to narrow many more adds in this code. This will enable EVEX->VEX to use a smaller encoding. We can apply this to many other opcodes as well.

If we do this early enough we should be able to shrink the add before the horizontal add detection.

In D36454#861826, @craig.topper wrote:

I think we should try to combine based on the add only being used by the extract_vector_elt. Turn the add into a 128-bit add being fed by extract_subvectors. Similarly if we see an add only being used by an extract_subvector we can shrink that add too and push the extracts up. This type of transform feels more generally useful because it will allow us to narrow many more adds in this code. This will enable EVEX->VEX to use a smaller encoding. We can apply this to many other opcodes as well.

If we do this early enough we should be able to shrink the add before the horizontal add detection.

Two cases for DAG node reduction:-

 a / Look at operands and try squeezing them up (EXTRACT_SUBVECTOR) for narrower operation  which is then concatinated with a pad to make the final result size same as original. Here we only look at the operands and not the uses of the operation. Which means it could break valid patterns being checked at the use nodes due to insertion of CONCAT_VECTORS now. 

b / Look at both the uses of the DAG node along with the operands of the node while narrowing down the operation. Idea here is to avoid insertion of extra concat operation for padding which shall keep the pattern matches at use node happy.

My initial patch was based on strategy (b), What I get from you above comment is to change it for any generic OPCODE instead of HADD. Correct?

Currently patch is based on strategy (a) with a wrapper over generic subroutine for scaling down operation only for HADD/HSUB perticular vector types which is also safe.

Merge branch 'master' of https://github.com/llvm-mirror/llvm
Making the downscaling operation changes generic.

Why this be solved by just combining extract_subvectors through things like binops and onto their inputs.

Like this:
combine (extract_subvector (add X, Y)) -> (narrower_add (extract_subvector X, extract_subvector Y))

lib/Target/X86/X86ISelLowering.cpp
30224	I think element indices should use DAG.getIntPtrConstant to make the type correct. It's not supposed to be i32.
33834	Can we teach combineExtractSubvector to narrow an extract of an add by extracting from the inputs? Then we shouldn't need any change here because we'll already have the narrow add?
33912	Why is this variable now upper case?
33995	Why was this moved out of the if.
test/CodeGen/X86/madd.ll
299	Why is this horizontal add not narrowed?

Would we be better off focussing on developing the @llvm.experimental.vector.reduce.add.* intrinsics? Or looking to get slpvectorizer to do extract_subvector style shuffles as part of the reduction?

jbhateja added inline comments.Sep 19 2017, 4:15 AM

lib/Target/X86/X86ISelLowering.cpp
30224	Yes this will be fixed.
33834	I think each Node specific combiner should look at pattern starting from itself. i.e. combineExtractSubvector could be taught to remove extract_subvector from a vector shuffle (if has only one use) thus producing a smaller vector shuffle that will be generic change. Looking for opportunity like you mentioned "extract of an add by extracting from input" appears as a specialization. As of now narrowing down has be done genrically at following two places 1/ Extract_vector_elt : It looks at its input vector and try scaling down if possible. 2/ Narrow down binary operation (add/sub/fadd/fsub) [this is implicitly doing whay your comment says).
33912	For better visual clarity of variable name.
33995	This will be fixed.

I don't think we should shrinking operations based on undef inputs. I think we should be shrinking them based on what elements are consumed by their users. There's no reason the shuffles in these reductions have to have undef elements. For integers we may rewrite the shuffle mask to undef in InstCombine if the elements aren't used down stream. But we don't do that for FP.

As an example, why shouldn't we be able use a horizontal add for this

define <4 x double> @fadd_noundef(<8 x double> %x225, <8 x double> %x227) {

  %x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
  %x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 5 ,i32 13, i32 7, i32 15>
  %x229 = fadd <8 x double> %x226, %x228
  %x230 = shufflevector <8 x double> %x229, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  ret <4 x double> %x230
}

lib/Target/X86/X86ISelLowering.cpp
1703	Use of MMX here is weird. We explicitly don't generate any optimized code for MMX. So making references to it in terms of SSE/AVX is misleading.

Merge branch 'master' of https://github.com/llvm-mirror/llvm
Review comments resolution.

In D36454#876132, @craig.topper wrote:
I don't think we should shrinking operations based on undef inputs. I think we should be shrinking them based on what elements are consumed by their users. There's no reason the shuffles in these reductions have to have undef elements. For integers we may rewrite the shuffle mask to undef in InstCombine if the elements aren't used down stream. But we don't do that for FP.

As an example, why shouldn't we be able use a horizontal add for this
define <4 x double> @fadd_noundef(<8 x double> %x225, <8 x double> %x227) {

  %x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
  %x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 5 ,i32 13, i32 7, i32 15>
  %x229 = fadd <8 x double> %x226, %x228
  %x230 = shufflevector <8 x double> %x229, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  ret <4 x double> %x230
}

We do generate horizontal addition in this case.

lib/Target/X86/X86ISelLowering.cpp
1703	Reference to MMX here is signifying one of the denominations of a smaller vector register.

The test case I gave does not generate horizontal add with your patch with avx512f enabled. It does with avx2 but that's only because type legalization did the dirty work of splitting the result.

Harbormaster completed remote builds in B10591: Diff 116634.Sep 26 2017, 3:31 PM

Changes to cover more patterns for [f]hadd/[f]sub for AVX512 vector types.
Generic routines added which looks at uses / undef operands of an operation for scaling it down.

Harbormaster completed remote builds in B10657: Diff 116998.Sep 28 2017, 8:15 AM

jbhateja added inline comments.Sep 28 2017, 8:22 AM

lib/Target/X86/X86ISelLowering.cpp
33824	Just realized that DAG argument is not used here, it shall be removed with other comments over patch.

jbhateja added reviewers: zvi, spatel.Sep 29 2017, 2:08 AM

I'm not sure about this approach; I don't think most of this needs to be done in the X86 backend at all, and much of it shouldn't even be done in the DAG. Most of the code appears to be better handled in a mixture of SimplifyDemandedVectorElts and SLPVectorizer.

PR33758 was about improving codegen for horizontal reductions, so we'd probably be better off having the backend optimize for @llvm.experimental.vector.reduce.add.* (or the legalized patterns it produces), and then getting the vectorizers to create these properly.

This revision now requires changes to proceed.Sep 29 2017, 3:49 AM

In D36454#884252, @RKSimon wrote:

I'm not sure about this approach; I don't think most of this needs to be done in the X86 backend at all, and much of it shouldn't even be done in the DAG. Most of the code appears to be better handled in a mixture of SimplifyDemandedVectorElts and SLPVectorizer.

PR33758 was about improving codegen for horizontal reductions, so we'd probably be better off having the backend optimize for @llvm.experimental.vector.reduce.add.* (or the legalized patterns it produces), and then getting the vectorizers to create these properly.

Hi Simon,

Thanks for pointing me to code references, I tired a simple case, which was not optimized by InstCombiner::SimpliefyDemandedVectorElts.
It works over knownbits mechanism. In fact none of the test cases provided in avx512-hadd-hsub.ll were optimized by InstCombiner.

define float @fhsub_16(<16 x float> %x225) {
;define <16 x float> @fhsub_16(<16 x float> %x225) {

%x226 = shufflevector <16 x float> %x225, <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%x227 = fadd <16 x float> %x225, %x226
%x228 = shufflevector <16 x float> %x227, <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%x229 = fsub <16 x float> %x227, %x228
%x230 = extractelement <16 x float> %x229, i32 0
ret float %x230

}

This patch provided two generic routines which try to scale down operation in denominations of X86 vector register sizes, patch D36650 is also suggesting for similar effort.

PR 33758 is specially about infrence of horizontal operations for AVX512 vector types.

Kindly elaborate how add reduction is helful here for cases in provided in testcase avx512-hadd-hsub.ll.

Thanks

In D36454#884427, @jbhateja wrote:

In D36454#884252, @RKSimon wrote:

I'm not sure about this approach; I don't think most of this needs to be done in the X86 backend at all, and much of it shouldn't even be done in the DAG. Most of the code appears to be better handled in a mixture of SimplifyDemandedVectorElts and SLPVectorizer.

PR33758 was about improving codegen for horizontal reductions, so we'd probably be better off having the backend optimize for @llvm.experimental.vector.reduce.add.* (or the legalized patterns it produces), and then getting the vectorizers to create these properly.

Hi Simon,

Thanks for pointing me to code references, I tired a simple case, which was not optimized by InstCombiner::SimpliefyDemandedVectorElts.
It works over knownbits mechanism. In fact none of the test cases provided in avx512-hadd-hsub.ll were optimized by InstCombiner.

In my opinion, @llvm.experimental.vector.reduce.fadd and friends should be treated as the canonical forms of those operations. InstCombine should form these intrinsics upon encountering these shuffle patterns.

The SLPVectorizer, LoopVectorizer, etc. should use the intrinsics when handling relevant reductions.

Is there an advantage to handling the shuffle patterns in the backend directly as opposed to forming the intrinsics earlier and then handling them in the backend?

define float @fhsub_16(<16 x float> %x225) {
;define <16 x float> @fhsub_16(<16 x float> %x225) {
%x226 = shufflevector <16 x float> %x225, <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%x227 = fadd <16 x float> %x225, %x226
%x228 = shufflevector <16 x float> %x227, <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%x229 = fsub <16 x float> %x227, %x228
%x230 = extractelement <16 x float> %x229, i32 0
ret float %x230
}

This patch provided two generic routines which try to scale down operation in denominations of X86 vector register sizes, patch D36650 is also suggesting for similar effort.

PR 33758 is specially about infrence of horizontal operations for AVX512 vector types.

Kindly elaborate how add reduction is helful here for cases in provided in testcase avx512-hadd-hsub.ll.

Thanks

RKSimon resigned from this revision.Oct 11 2018, 8:56 AM

@jbhateja Abandon this? @spatel's recent work seems to have covered everything already

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

407 lines

test/

CodeGen/

X86/

avx512-hadd-hsub.ll

100 lines

madd.ll

3 lines

sad.ll

18 lines

Commit	Tree	Parents	Author	Summary	Date
9db9d0c98924	5a5cef192f7c	e4e270bbe1ee	Jatin Bhateja	Changes to cover more patterns for [f]hadd/[f]sub for AVX512 vector types.	Sep 28 2017, 8:08 AM
e4e270bbe1ee	bedffbed293d	6286d327cb28 84b05de9b1bb	Jatin Bhateja	Merge branch 'master' of https://github.com/llvm-mirror/llvm	Sep 28 2017, 1:42 AM
6286d327cb28	a943edc250bb	baaea8163aad 463b87bb2b44	Jatin Bhateja	Merge branch 'master' of https://github.com/llvm-mirror/llvm	Sep 27 2017, 10:49 AM
baaea8163aad	d0523409fc14	351b1cb83fab	Jatin Bhateja	Review comments resolution.	Sep 25 2017, 7:55 PM
351b1cb83fab	e4ae1c98f9a1	e6cfe00f0aba e402b549badd	Jatin Bhateja	Merge branch 'master' of https://github.com/llvm-mirror/llvm	Sep 23 2017, 7:40 PM
e6cfe00f0aba	8d30ca287ca9	79cec79fdffe	Jatin Bhateja	Making the downscaling operation changes generic.	Sep 19 2017, 12:06 AM
79cec79fdffe	9ff869d70cf6	c997227bf1b6 df2a024df200	Jatin Bhateja	Merge branch 'master' of https://github.com/llvm-mirror/llvm	Sep 18 2017, 9:23 AM
c997227bf1b6	4a8f6b2f926f	902b28df71db 7d677e7e2a15	Jatin Bhateja	Merge branch 'master' of https://github.com/llvm-mirror/llvm	Sep 15 2017, 3:02 AM
902b28df71db	4067ceda34c8	202c6bdd69e2	Jatin Bhateja	Updating test reference	Aug 27 2017, 7:55 AM
202c6bdd69e2	39e80a98cf39	eab49daf221c	Jatin Bhateja	Changes left in last patch	Aug 27 2017, 7:54 AM
eab49daf221c	65a9f237a015	ed3ea30bbbfe	Jatin Bhateja	Removing a file	Aug 27 2017, 6:29 AM
ed3ea30bbbfe	7bb4176b350d	405a63306953	Jatin Bhateja	Updating testcase post rebase.	Aug 27 2017, 6:21 AM
405a63306953	074963b92d3c	1e3b180a54e9	Jatin Bhateja	Formatting changes	Aug 27 2017, 12:32 AM
1e3b180a54e9	21d14618819a	79962d1e37df	Jatin Bhateja	Updating test reference	Aug 27 2017, 12:20 AM
79962d1e37df	7cd64bfc4446	015b4cdfed5e	Jatin Bhateja	[X86] Handling for horizontal [F]ADD/[F]SUB for AVX512 vector size 16x32.	Aug 26 2017, 12:21 PM
015b4cdfed5e	a034bb9d0094	68611ca17fa7	Jatin Bhateja	[X86] Changes to extract Horizontal addition operation for AVX-512. (Show More…)	Aug 8 2017, 2:58 AM
68611ca17fa7	0d5bf4ec28b9	1467a089bcd7	Jatin Bhateja	[X86] Adding a test for vector shuffle extractions. (Show More…)	Aug 3 2017, 9:43 AM

Diff 116998

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,694 Lines • ▼ Show 20 Lines
// but a conditional move could be stalled by an expensive earlier operation.		// but a conditional move could be stalled by an expensive earlier operation.
PredictableSelectIsExpensive = Subtarget.getSchedModel().isOutOfOrder();		PredictableSelectIsExpensive = Subtarget.getSchedModel().isOutOfOrder();
EnableExtLdPromotion = true;		EnableExtLdPromotion = true;
setPrefFunctionAlignment(4); // 2^4 bytes.		setPrefFunctionAlignment(4); // 2^4 bytes.

verifyIntrinsicTables();		verifyIntrinsicTables();
}		}

		typedef enum : unsigned { MMX = 0, XMM = 1, YMM = 3, ZMM = 7 } VecRegKind;
		craig.topperUnsubmitted Not Done Reply Inline Actions Use of MMX here is weird. We explicitly don't generate any optimized code for MMX. So making references to it in terms of SSE/AVX is misleading. craig.topper: Use of MMX here is weird. We explicitly don't generate any optimized code for MMX. So making…
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions Reference to MMX here is signifying one of the denominations of a smaller vector register. jbhateja: Reference to MMX here is signifying one of the denominations of a smaller vector register.
		enum : unsigned { UNDEF, FRWD, BKWD };

		static inline int GetLaneIndex(int Bits, int StartIdx = 0) {
		return (((Bits + 64) >> 6) - 1) + StartIdx;
		}

		/// Find the smallest sub-register which accommodates all the non-undefs
		/// of a node, Lane granularity is taken as 64 bit (MMX). Lookup is
		/// performed in both forward and backward directions.
		static bool GetMinimalUsedSubReg(SmallBitVector &Lanes, int &ForwardSubReg,
		VecRegKind &SubRegKind) {
		if (Lanes.all() \|\| Lanes.none()) {
		ForwardSubReg = UNDEF;
		SubRegKind = ZMM;
		return false;
		}

		auto GetSubReg = [&](bool ScanForward) -> VecRegKind {
		VecRegKind SubReg = ZMM;
		VecRegKind VecSubRegs[4] = {MMX, XMM, YMM, ZMM};
		for (int i = 0; i < 4; i++) {
		int Checker = ScanForward ? Lanes.find_next(VecSubRegs[i])
		: Lanes.find_prev(ZMM - VecSubRegs[i]);
		if (Checker == -1) {
		SubReg = VecSubRegs[i];
		break;
		}
		}
		return SubReg;
		};

		VecRegKind FrwdSubReg = GetSubReg(true);
		VecRegKind BkwdSubReg = GetSubReg(false);
		if (FrwdSubReg < BkwdSubReg) {
		ForwardSubReg = FRWD;
		SubRegKind = FrwdSubReg;
		return true;
		} else if (BkwdSubReg < FrwdSubReg) {
		ForwardSubReg = BKWD;
		SubRegKind = BkwdSubReg;
		return true;
		}
		return false;
		}

		// Granularity of the lane considered is 64 bit, mark a bit in the
		// bitvector if corresponding lane is accessed.
		static bool MarkLanesUsedByOperands(SDNode *N, SmallBitVector &Lanes,
		int StartIdx) {
		bool retVal = false;

		switch (N->getOpcode()) {
		default: {
		EVT VT = N->getValueType(0);
		if (VT.isSimple() && VT.isVector()) {
		int VTSz = VT.getSizeInBits();
		for (int i = StartIdx, e = GetLaneIndex(VTSz, StartIdx); i < e; i++)
		Lanes[i] = 1;
		} else {
		// Mark all the lanes used in default case.
		Lanes.set(0,Lanes.size());
		}
		} break;
		case ISD::CONCAT_VECTORS: {
		int SZInBits = 0;
		for (auto &Oprnd : N->op_values()) {
		if (!Oprnd.isUndef())
		retVal \|= MarkLanesUsedByOperands(Oprnd.getNode(), Lanes, StartIdx);
		SZInBits += Oprnd.getValueType().getSizeInBits();
		StartIdx = GetLaneIndex(SZInBits);
		}
		} break;
		case ISD::VECTOR_SHUFFLE: {
		ShuffleVectorSDNode *SV = dyn_cast<ShuffleVectorSDNode>(N);
		EVT ElemTy = SV->getOperand(0).getValueType().getVectorElementType();
		int OperNumElems = SV->getOperand(0).getValueType().getVectorNumElements();
		int ElemSz = ElemTy.getSizeInBits();

		bool OpersUndef[2] = {SV->getOperand(0).isUndef(),
		SV->getOperand(1).isUndef()};

		ArrayRef<int> Mask = SV->getMask();
		for (int i = 0, e = Mask.size(); i < e; i++) {
		if (Mask[i] >= 0 && !OpersUndef[Mask[i] >= OperNumElems])
		Lanes[GetLaneIndex(i * ElemSz, StartIdx)] = 1;
		}
		} break;
		}

		if (Lanes.all())
		return true;

		return retVal;
		}

		// A generic routine which checks if operands of a binary operation
		// can be scaled down to a lower sub-register, also it sets the
		// StartIdx (from where operand's extraction needs to start)
		// NewOperVT (new value type of result) and PadVT (value type of
		// padding for result).
		static bool CheckDownScalingBinOpByOperands(SDNode *N, SelectionDAG &DAG,
		const X86Subtarget &Subtarget,
		uint64_t &StartIdx, EVT &NewOperVT,
		EVT &PadVT) {
		SDLoc DL(N);
		VecRegKind Op0SubReg, Op1SubReg;
		int Op0FrwdSubReg, Op1FrwdSubReg;

		if (N->getNumOperands() != 2)
		return false;

		SDValue Op0 = N->getOperand(0);
		SDValue Op1 = N->getOperand(1);

		if (!Op0.getValueType().isVector() \|\| !Op1.getValueType().isVector())
		return false;

		EVT OperVT = Op0.getValueType();
		int LaneSz = GetLaneIndex(OperVT.getSizeInBits());

		SmallBitVector Op0Lanes(LaneSz, 0);
		SmallBitVector Op1Lanes(LaneSz, 0);

		if (!OperVT.isSimple() \|\| OperVT.getSizeInBits() < 64)
		return false;

		EVT OperElemVT = OperVT.getVectorElementType();
		int OperNumElems = OperVT.getVectorNumElements();
		int OperElemSZ = OperElemVT.getSizeInBits();

		// Mark bit corresponding to 64 bit lane if the particular
		// lane is accessed by the node.
		bool Op0FullUse = MarkLanesUsedByOperands(Op0.getNode(), Op0Lanes, 0);
		bool Op1FullUse = MarkLanesUsedByOperands(Op1.getNode(), Op1Lanes, 0);
		if (Op0FullUse && Op1FullUse)
		return false;

		// Find the smallest sub-register which can accommodate
		// non-undef part of operands.
		bool Res0 = GetMinimalUsedSubReg(Op0Lanes, Op0FrwdSubReg, Op0SubReg);
		bool Res1 = GetMinimalUsedSubReg(Op1Lanes, Op1FrwdSubReg, Op1SubReg);
		if (!Res0 && !Res1)
		return false;

		int OperSubReg = std::min(Op0SubReg, Op1SubReg);
		int PadNumElems =
		PowerOf2Floor(OperNumElems - ((OperSubReg + 1) * 64) / OperElemSZ);
		int NewOperNumElems = OperNumElems - PadNumElems;

		if ((Op0FrwdSubReg && Op1FrwdSubReg && Op0FrwdSubReg != Op1FrwdSubReg) \|\|
		NewOperNumElems >= OperNumElems)
		return false;

		// Legal direction for one of the operand could be UNDEF
		// hence both operand direction are OR'ed to ascertain
		// the actual direction of subreg.
		int FrwdSubReg = Op0FrwdSubReg \| Op1FrwdSubReg;
		NewOperVT = EVT::getVectorVT(*DAG.getContext(), OperElemVT, NewOperNumElems);
		PadVT = EVT::getVectorVT(*DAG.getContext(), OperElemVT, PadNumElems);

		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
		if (!TLI.isTypeLegal(NewOperVT) \|\| !TLI.isTypeLegal(PadVT))
		return false;

		StartIdx = FrwdSubReg == FRWD ? 0 : PadNumElems;
		return true;
		}

		static bool MarkLanesUsedByUsers(SDNode *N, SmallBitVector &Lanes) {
		unsigned ElemSz = 0;
		int64_t SubVecSize = 1;

		switch (N->getOpcode()) {
		default: {
		EVT VT = N->getValueType(0);
		if (VT.isSimple() && VT.isVector()) {
		int VTSz = VT.getSizeInBits();
		for (int i = 0, e = GetLaneIndex(VTSz); i < e; i++)
		Lanes[i] = 1;
		} else {
		// Mark all the lanes used in default case.
		Lanes.set(0,Lanes.size());
		}
		} break;

		case ISD::EXTRACT_SUBVECTOR: {
		SubVecSize = N->getValueType(0).getVectorNumElements();
		ElemSz = N->getValueType(0).getVectorElementType().getSizeInBits();
		}
		case ISD::EXTRACT_VECTOR_ELT: {
		SDValue Idx = N->getOperand(1);
		ElemSz = ElemSz ? ElemSz : N->getValueType(0).getSizeInBits();
		if (!isa<ConstantSDNode>(Idx))
		return false;

		int64_t StartIdx =
		(dyn_cast<ConstantSDNode>(Idx.getNode()))->getSExtValue();
		int64_t EndIdx = StartIdx + SubVecSize - 1;
		for (int i = StartIdx, e = EndIdx; i <= e; i++)
		Lanes[GetLaneIndex(i * ElemSz)] = 1;
		} break;

		case ISD::VECTOR_SHUFFLE: {
		ShuffleVectorSDNode *SV = dyn_cast<ShuffleVectorSDNode>(N);
		EVT ElemTy = SV->getOperand(0).getValueType().getVectorElementType();
		int OperNumElems = SV->getOperand(0).getValueType().getVectorNumElements();
		int ElemSz = ElemTy.getSizeInBits();
		ArrayRef<int> Mask = SV->getMask();

		bool FirstOprd = SV->getOperand(0).getNode() == N;
		bool UsedOprd[2] = {FirstOprd, !FirstOprd};

		// Mark lane bits of node N which are used by
		// the shuffle vector.
		for (int i = 0, e = Mask.size(); i < e; i++)
		if (Mask[i] >= 0 && UsedOprd[Mask[i] >= OperNumElems])
		Lanes[GetLaneIndex(i * ElemSz)] = 1;
		} break;
		}

		if (Lanes.all())
		return false;

		return true;
		}

		static bool CheckDownScalingBinOpByUses(SDNode *N, SelectionDAG &DAG,
		const X86Subtarget &Subtarget,
		uint64_t &StartIdx, EVT &NewVT,
		EVT &PadVT) {
		int FrwdSubReg;
		VecRegKind SubReg;
		EVT VT = N->getValueType(0);

		if (!VT.isSimple() \|\| VT.getSizeInBits() < 64)
		return false;

		int LaneSz = GetLaneIndex(VT.getSizeInBits());
		SmallBitVector Lanes(LaneSz, 0);

		for (SDNode::use_iterator UI = N->use_begin(), UE = N->use_end(); UI != UE;
		++UI)
		if (!MarkLanesUsedByUsers(*UI, Lanes))
		return false;

		if (!GetMinimalUsedSubReg(Lanes, FrwdSubReg, SubReg))
		return false;

		EVT ElemVT = VT.getVectorElementType();
		int ElemSZ = ElemVT.getSizeInBits();
		int NumElems = VT.getVectorNumElements();

		int PadNumElems = PowerOf2Floor(NumElems - ((SubReg + 1) * 64) / ElemSZ);
		int NewNumElems = NumElems - PadNumElems;

		if (NewNumElems >= NumElems)
		return false;

		NewVT = EVT::getVectorVT(*DAG.getContext(), ElemVT, NewNumElems);
		PadVT = EVT::getVectorVT(*DAG.getContext(), ElemVT, PadNumElems);

		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
		if (!TLI.isTypeLegal(NewVT) \|\| !TLI.isTypeLegal(PadVT))
		return false;

		StartIdx = FrwdSubReg == FRWD ? 0 : PadNumElems;
		return true;
		}

		static SDValue CreateScaledDownBinOper(SDLoc &DL, EVT VT, SDValue Op0,
		SDValue Op1, SelectionDAG &DAG,
		unsigned OpCode, uint64_t StartIdx,
		EVT OperVT, EVT PadVT,
		const X86Subtarget &Subtarget) {
		SDValue ConstOffset = DAG.getIntPtrConstant(StartIdx, DL);
		SDValue NewOp0 =
		DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, OperVT, Op0, ConstOffset);

		SDValue NewOp1 =
		DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, OperVT, Op1, ConstOffset);

		SDValue NewN = DAG.getNode(OpCode, DL, OperVT, NewOp0, NewOp1);
		SDValue ConcatOps[2] = {DAG.getUNDEF(PadVT), NewN};
		if (StartIdx == 0) {
		ConcatOps[0] = NewN;
		ConcatOps[1] = DAG.getUNDEF(PadVT);
		}
		return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, ConcatOps);
		}

// This has so far only been implemented for 64-bit MachO.		// This has so far only been implemented for 64-bit MachO.
bool X86TargetLowering::useLoadStackGuardNode() const {		bool X86TargetLowering::useLoadStackGuardNode() const {
return Subtarget.isTargetMachO() && Subtarget.is64Bit();		return Subtarget.isTargetMachO() && Subtarget.is64Bit();
}		}

TargetLoweringBase::LegalizeTypeAction		TargetLoweringBase::LegalizeTypeAction
X86TargetLowering::getPreferredVectorAction(EVT VT) const {		X86TargetLowering::getPreferredVectorAction(EVT VT) const {
if (ExperimentalVectorWideningLegalization &&		if (ExperimentalVectorWideningLegalization &&
▲ Show 20 Lines • Show All 28,214 Lines • ▼ Show 20 Lines	for (SDNode::use_iterator UI = InputVector.getNode()->use_begin(),
UE = InputVector.getNode()->use_end(); UI != UE; ++UI) {		UE = InputVector.getNode()->use_end(); UI != UE; ++UI) {
if (UI.getUse().getResNo() != InputVector.getResNo())		if (UI.getUse().getResNo() != InputVector.getResNo())
return SDValue();		return SDValue();

SDNode Extract = UI;		SDNode Extract = UI;
if (Extract->getOpcode() != ISD::EXTRACT_VECTOR_ELT)		if (Extract->getOpcode() != ISD::EXTRACT_VECTOR_ELT)
return SDValue();		return SDValue();

if (Extract->getValueType(0) != MVT::i32)		if (Extract->getValueType(0) != MVT::i32)
		craig.topperUnsubmitted Not Done Reply Inline Actions I think element indices should use DAG.getIntPtrConstant to make the type correct. It's not supposed to be i32. craig.topper: I think element indices should use DAG.getIntPtrConstant to make the type correct. It's not…
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions Yes this will be fixed. jbhateja: Yes this will be fixed.
return SDValue();		return SDValue();
if (!Extract->hasOneUse())		if (!Extract->hasOneUse())
return SDValue();		return SDValue();
if (Extract->use_begin()->getOpcode() != ISD::SIGN_EXTEND &&		if (Extract->use_begin()->getOpcode() != ISD::SIGN_EXTEND &&
Extract->use_begin()->getOpcode() != ISD::ZERO_EXTEND)		Extract->use_begin()->getOpcode() != ISD::ZERO_EXTEND)
return SDValue();		return SDValue();
if (!isa<ConstantSDNode>(Extract->getOperand(1)))		if (!isa<ConstantSDNode>(Extract->getOperand(1)))
return SDValue();		return SDValue();
▲ Show 20 Lines • Show All 3,582 Lines • ▼ Show 20 Lines
/// B = < float b0, float b1, float b2, float b3 >		/// B = < float b0, float b1, float b2, float b3 >
/// then the result of doing a horizontal operation on A and B is		/// then the result of doing a horizontal operation on A and B is
/// A horizontal-op B = < a0 op a1, a2 op a3, b0 op b1, b2 op b3 >.		/// A horizontal-op B = < a0 op a1, a2 op a3, b0 op b1, b2 op b3 >.
/// In short, LHS and RHS are inspected to see if LHS op RHS is of the form		/// In short, LHS and RHS are inspected to see if LHS op RHS is of the form
/// A horizontal-op B, for some already available A and B, and if so then LHS is		/// A horizontal-op B, for some already available A and B, and if so then LHS is
/// set to A, RHS to B, and the routine returns 'true'.		/// set to A, RHS to B, and the routine returns 'true'.
/// Note that the binary operation should have the property that if one of the		/// Note that the binary operation should have the property that if one of the
/// operands is UNDEF then the result is UNDEF.		/// operands is UNDEF then the result is UNDEF.
static bool isHorizontalBinOp(SDValue &LHS, SDValue &RHS, bool IsCommutative) {		static bool isHorizontalBinOp(SDValue &LHS, SDValue &RHS, bool IsCommutative,
		SelectionDAG &DAG) {
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions Just realized that DAG argument is not used here, it shall be removed with other comments over patch. jbhateja: Just realized that DAG argument is not used here, it shall be removed with other comments over…
// Look for the following pattern: if		// Look for the following pattern: if
// A = < float a0, float a1, float a2, float a3 >		// A = < float a0, float a1, float a2, float a3 >
// B = < float b0, float b1, float b2, float b3 >		// B = < float b0, float b1, float b2, float b3 >
// and		// and
// LHS = VECTOR_SHUFFLE A, B, <0, 2, 4, 6>		// LHS = VECTOR_SHUFFLE A, B, <0, 2, 4, 6>
// RHS = VECTOR_SHUFFLE A, B, <1, 3, 5, 7>		// RHS = VECTOR_SHUFFLE A, B, <1, 3, 5, 7>
// then LHS op RHS = < a0 op a1, a2 op a3, b0 op b1, b2 op b3 >		// then LHS op RHS = < a0 op a1, a2 op a3, b0 op b1, b2 op b3 >
// which is A horizontal-op B.		// which is A horizontal-op B.

// At least one of the operands should be a vector shuffle.		// At least one of the operands should be a vector shuffle.
		craig.topperUnsubmitted Not Done Reply Inline Actions Can we teach combineExtractSubvector to narrow an extract of an add by extracting from the inputs? Then we shouldn't need any change here because we'll already have the narrow add? craig.topper: Can we teach combineExtractSubvector to narrow an extract of an add by extracting from the…
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions I think each Node specific combiner should look at pattern starting from itself. i.e. combineExtractSubvector could be taught to remove extract_subvector from a vector shuffle (if has only one use) thus producing a smaller vector shuffle that will be generic change. Looking for opportunity like you mentioned "extract of an add by extracting from input" appears as a specialization. As of now narrowing down has be done genrically at following two places 1/ Extract_vector_elt : It looks at its input vector and try scaling down if possible. 2/ Narrow down binary operation (add/sub/fadd/fsub) [this is implicitly doing whay your comment says). jbhateja: I think each Node specific combiner should look at pattern starting from itself. i.e.
if (LHS.getOpcode() != ISD::VECTOR_SHUFFLE &&		if (LHS.getOpcode() != ISD::VECTOR_SHUFFLE &&
RHS.getOpcode() != ISD::VECTOR_SHUFFLE)		RHS.getOpcode() != ISD::VECTOR_SHUFFLE)
return false;		return false;

MVT VT = LHS.getSimpleValueType();		if (!LHS.getValueType().isSimple() \|\| !RHS.getValueType().isSimple())
		return false;

assert((VT.is128BitVector() \|\| VT.is256BitVector()) &&		MVT VT = LHS.getSimpleValueType();
"Unsupported vector type for horizontal add/sub");		if (!(VT.is128BitVector() \|\| VT.is256BitVector() \|\| VT.is512BitVector()))
		return false;

// Handle 128 and 256-bit vector lengths. AVX defines horizontal add/sub to		// Handle 128 and 256-bit vector lengths. AVX defines horizontal add/sub to
// operate independently on 128-bit lanes.		// operate independently on 128-bit lanes.
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();
unsigned NumLanes = VT.getSizeInBits()/128;		unsigned NumLanes = VT.getSizeInBits()/128;
unsigned NumLaneElts = NumElts / NumLanes;		unsigned NumLaneElts = NumElts / NumLanes;
assert((NumLaneElts % 2 == 0) &&		assert((NumLaneElts % 2 == 0) &&
"Vector type should have an even number of elements in each lane");		"Vector type should have an even number of elements in each lane");
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	static bool isHorizontalBinOp(SDValue &LHS, SDValue &RHS, bool IsCommutative,
// rewriting the mask).		// rewriting the mask).
if (A != C)		if (A != C)
ShuffleVectorSDNode::commuteMask(RMask);		ShuffleVectorSDNode::commuteMask(RMask);

// At this point LHS and RHS are equivalent to		// At this point LHS and RHS are equivalent to
// LHS = VECTOR_SHUFFLE A, B, LMask		// LHS = VECTOR_SHUFFLE A, B, LMask
// RHS = VECTOR_SHUFFLE A, B, RMask		// RHS = VECTOR_SHUFFLE A, B, RMask
// Check that the masks correspond to performing a horizontal operation.		// Check that the masks correspond to performing a horizontal operation.
for (unsigned l = 0; l != NumElts; l += NumLaneElts) {		for (unsigned L = 0; L != NumElts; L += NumLaneElts) {
		craig.topperUnsubmitted Not Done Reply Inline Actions Why is this variable now upper case? craig.topper: Why is this variable now upper case?
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions For better visual clarity of variable name. jbhateja: For better visual clarity of variable name.
for (unsigned i = 0; i != NumLaneElts; ++i) {		for (unsigned i = 0; i != NumLaneElts; ++i) {
int LIdx = LMask[i+l], RIdx = RMask[i+l];		int LIdx = LMask[i+L], RIdx = RMask[i+L];

// Ignore any UNDEF components.		// Ignore any UNDEF components.
if (LIdx < 0 \|\| RIdx < 0 \|\|		if (LIdx < 0 \|\| RIdx < 0 \|\|
(!A.getNode() && (LIdx < (int)NumElts \|\| RIdx < (int)NumElts)) \|\|		(!A.getNode() && (LIdx < (int)NumElts \|\| RIdx < (int)NumElts)) \|\|
(!B.getNode() && (LIdx >= (int)NumElts \|\| RIdx >= (int)NumElts)))		(!B.getNode() && (LIdx >= (int)NumElts \|\| RIdx >= (int)NumElts)))
continue;		continue;

// Check that successive elements are being operated on. If not, this is		// Check that successive elements are being operated on. If not, this is
// not a horizontal operation.		// not a horizontal operation.
unsigned Src = (i/HalfLaneElts); // each lane is split between srcs		unsigned Src = (i/HalfLaneElts); // each lane is split between srcs
int Index = 2(i%HalfLaneElts) + NumEltsSrc + l;		int Index = 2(i%HalfLaneElts) + NumEltsSrc + L;
if (!(LIdx == Index && RIdx == Index + 1) &&		if (!(LIdx == Index && RIdx == Index + 1) &&
!(IsCommutative && LIdx == Index + 1 && RIdx == Index))		!(IsCommutative && LIdx == Index + 1 && RIdx == Index))
return false;		return false;
}		}
}		}

LHS = A.getNode() ? A : B; // If A is 'UNDEF', use B for it.		LHS = A.getNode() ? A : B; // If A is 'UNDEF', use B for it.
RHS = B.getNode() ? B : A; // If B is 'UNDEF', use A for it.		RHS = B.getNode() ? B : A; // If B is 'UNDEF', use A for it.

return true;		return true;
}		}

		// Combining DAG of ADD/SUB with shuffles to Horizontal Operations.
		static SDValue combineToHorizontalOperation(SDNode *N, bool IsIntegeralOp,
		bool IsCommutative, unsigned OpCode,
		const X86Subtarget &Subtarget,
		SelectionDAG &DAG) {
		SDLoc DL(N);
		EVT VT = N->getValueType(0);
		SDValue Op0 = N->getOperand(0);
		SDValue Op1 = N->getOperand(1);

		auto ValidHorizontalTypes = [&](EVT ValTyp) -> bool {
		if (IsIntegeralOp)
		return ((Subtarget.hasSSSE3() &&
		(ValTyp == MVT::v8i16 \|\| ValTyp == MVT::v4i32)) \|\|
		(Subtarget.hasInt256() &&
		(ValTyp == MVT::v16i16 \|\| ValTyp == MVT::v8i32)));
		else
		return ((Subtarget.hasSSE3() &&
		(ValTyp == MVT::v4f32 \|\| ValTyp == MVT::v2f64)) \|\|
		(Subtarget.hasFp256() &&
		(ValTyp == MVT::v8f32 \|\| ValTyp == MVT::v4f64)));
		};

		// Try to synthesize horizontal [f]add/[f]sub from adds of shuffles.
		bool ValidHorizontalPattern = isHorizontalBinOp(Op0, Op1, IsCommutative, DAG);

		if (ValidHorizontalPattern && ValidHorizontalTypes(VT))
		return DAG.getNode(OpCode, DL, VT, Op0, Op1);

		if (ValidHorizontalPattern && VT.is512BitVector()) {
		uint64_t StartIdx;
		EVT NewOperVT, PadVT;

		if (CheckDownScalingBinOpByUses(N, DAG, Subtarget, StartIdx, NewOperVT,
		PadVT) &&
		ValidHorizontalTypes(NewOperVT))
		return CreateScaledDownBinOper(DL, VT, Op0, Op1, DAG, OpCode, StartIdx,
		NewOperVT, PadVT, Subtarget);
		else if (CheckDownScalingBinOpByOperands(N, DAG, Subtarget, StartIdx,
		NewOperVT, PadVT) &&
		ValidHorizontalTypes(NewOperVT))
		return CreateScaledDownBinOper(DL, VT, Op0, Op1, DAG, OpCode, StartIdx,
		NewOperVT, PadVT, Subtarget);
		else {
		// Not creating multiple Horizontal add/sub due to latency considerations.
		}
		}

		return SDValue();
		}

/// Do target-specific dag combines on floating-point adds/subs.		/// Do target-specific dag combines on floating-point adds/subs.
static SDValue combineFaddFsub(SDNode *N, SelectionDAG &DAG,		static SDValue combineFaddFsub(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
EVT VT = N->getValueType(0);
SDValue LHS = N->getOperand(0);
SDValue RHS = N->getOperand(1);
bool IsFadd = N->getOpcode() == ISD::FADD;		bool IsFadd = N->getOpcode() == ISD::FADD;
assert((IsFadd \|\| N->getOpcode() == ISD::FSUB) && "Wrong opcode");		assert((IsFadd \|\| N->getOpcode() == ISD::FSUB) && "Wrong opcode");

// Try to synthesize horizontal add/sub from adds/subs of shuffles.
if (((Subtarget.hasSSE3() && (VT == MVT::v4f32 \|\| VT == MVT::v2f64)) \|\|
(Subtarget.hasFp256() && (VT == MVT::v8f32 \|\| VT == MVT::v4f64))) &&
isHorizontalBinOp(LHS, RHS, IsFadd)) {
auto NewOpcode = IsFadd ? X86ISD::FHADD : X86ISD::FHSUB;		auto NewOpcode = IsFadd ? X86ISD::FHADD : X86ISD::FHSUB;
		craig.topperUnsubmitted Not Done Reply Inline Actions Why was this moved out of the if. craig.topper: Why was this moved out of the if.
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions This will be fixed. jbhateja: This will be fixed.
return DAG.getNode(NewOpcode, SDLoc(N), VT, LHS, RHS);		if (SDValue V = combineToHorizontalOperation(N, false, IsFadd, NewOpcode,
}		Subtarget, DAG))
		return V;

return SDValue();		return SDValue();
}		}

/// Attempt to pre-truncate inputs to arithmetic ops if it will simplify		/// Attempt to pre-truncate inputs to arithmetic ops if it will simplify
/// the codegen.		/// the codegen.
/// e.g. TRUNC( BINOP( X, Y ) ) --> BINOP( TRUNC( X ), TRUNC( Y ) )		/// e.g. TRUNC( BINOP( X, Y ) ) --> BINOP( TRUNC( X ), TRUNC( Y ) )
static SDValue combineTruncatedArithmetic(SDNode *N, SelectionDAG &DAG,		static SDValue combineTruncatedArithmetic(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget,		const X86Subtarget &Subtarget,
▲ Show 20 Lines • Show All 1,892 Lines • ▼ Show 20 Lines	static SDValue combineIncDecVector(SDNode *N, SelectionDAG &DAG) {

SDValue AllOnesVec = getOnesVector(VT, DAG, SDLoc(N));		SDValue AllOnesVec = getOnesVector(VT, DAG, SDLoc(N));
unsigned NewOpcode = N->getOpcode() == ISD::ADD ? ISD::SUB : ISD::ADD;		unsigned NewOpcode = N->getOpcode() == ISD::ADD ? ISD::SUB : ISD::ADD;
return DAG.getNode(NewOpcode, SDLoc(N), VT, N->getOperand(0), AllOnesVec);		return DAG.getNode(NewOpcode, SDLoc(N), VT, N->getOperand(0), AllOnesVec);
}		}

static SDValue combineAdd(SDNode *N, SelectionDAG &DAG,		static SDValue combineAdd(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
		SDLoc DL(N);

const SDNodeFlags Flags = N->getFlags();		const SDNodeFlags Flags = N->getFlags();
if (Flags.hasVectorReduction()) {		if (Flags.hasVectorReduction()) {
if (SDValue Sad = combineLoopSADPattern(N, DAG, Subtarget))		if (SDValue Sad = combineLoopSADPattern(N, DAG, Subtarget))
return Sad;		return Sad;
if (SDValue MAdd = combineLoopMAddPattern(N, DAG, Subtarget))		if (SDValue MAdd = combineLoopMAddPattern(N, DAG, Subtarget))
return MAdd;		return MAdd;
}		}
EVT VT = N->getValueType(0);
SDValue Op0 = N->getOperand(0);
SDValue Op1 = N->getOperand(1);

// Try to synthesize horizontal adds from adds of shuffles.		if (SDValue V = combineToHorizontalOperation(N, true, true, X86ISD::HADD,
if (((Subtarget.hasSSSE3() && (VT == MVT::v8i16 \|\| VT == MVT::v4i32)) \|\|		Subtarget, DAG))
(Subtarget.hasInt256() && (VT == MVT::v16i16 \|\| VT == MVT::v8i32))) &&		return V;
isHorizontalBinOp(Op0, Op1, true))
return DAG.getNode(X86ISD::HADD, SDLoc(N), VT, Op0, Op1);

if (SDValue V = combineIncDecVector(N, DAG))		if (SDValue V = combineIncDecVector(N, DAG))
		craig.topperUnsubmitted Not Done Reply Inline Actions Don't we need to make sure this only happens on v32i16 and v16i32 types? combineAdd can get called on all sorts of types. craig.topper: Don't we need to make sure this only happens on v32i16 and v16i32 types? combineAdd can get…
return V;		return V;

return combineAddOrSubToADCOrSBB(N, DAG);		if (SDValue V = combineAddOrSubToADCOrSBB(N, DAG))
		return V;

		return SDValue();
}		}

static SDValue combineSub(SDNode *N, SelectionDAG &DAG,		static SDValue combineSub(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
SDValue Op0 = N->getOperand(0);		SDValue Op0 = N->getOperand(0);
SDValue Op1 = N->getOperand(1);		SDValue Op1 = N->getOperand(1);

		craig.topperUnsubmitted Not Done Reply Inline Actions Don't use a SmallVector for a hardcoded two elements. Just use a plain old array. craig.topper: Don't use a SmallVector for a hardcoded two elements. Just use a plain old array.
// X86 can't encode an immediate LHS of a sub. See if we can push the		// X86 can't encode an immediate LHS of a sub. See if we can push the
// negation into a preceding instruction.		// negation into a preceding instruction.
if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op0)) {		if (ConstantSDNode *C = dyn_cast<ConstantSDNode>(Op0)) {
// If the RHS of the sub is a XOR with one use and a constant, invert the		// If the RHS of the sub is a XOR with one use and a constant, invert the
// immediate. Then add one to the LHS of the sub so we can turn		// immediate. Then add one to the LHS of the sub so we can turn
// X-Y -> X+~Y+1, saving one register.		// X-Y -> X+~Y+1, saving one register.
if (Op1->hasOneUse() && Op1.getOpcode() == ISD::XOR &&		if (Op1->hasOneUse() && Op1.getOpcode() == ISD::XOR &&
isa<ConstantSDNode>(Op1.getOperand(1))) {		isa<ConstantSDNode>(Op1.getOperand(1))) {
APInt XorC = cast<ConstantSDNode>(Op1.getOperand(1))->getAPIntValue();		APInt XorC = cast<ConstantSDNode>(Op1.getOperand(1))->getAPIntValue();
EVT VT = Op0.getValueType();		EVT VT = Op0.getValueType();
SDValue NewXor = DAG.getNode(ISD::XOR, SDLoc(Op1), VT,		SDValue NewXor = DAG.getNode(ISD::XOR, SDLoc(Op1), VT,
Op1.getOperand(0),		Op1.getOperand(0),
DAG.getConstant(~XorC, SDLoc(Op1), VT));		DAG.getConstant(~XorC, SDLoc(Op1), VT));
return DAG.getNode(ISD::ADD, SDLoc(N), VT, NewXor,		return DAG.getNode(ISD::ADD, SDLoc(N), VT, NewXor,
DAG.getConstant(C->getAPIntValue() + 1, SDLoc(N), VT));		DAG.getConstant(C->getAPIntValue() + 1, SDLoc(N), VT));
}		}
}		}

// Try to synthesize horizontal subs from subs of shuffles.		if (SDValue V = combineToHorizontalOperation(N, true, false, X86ISD::HSUB,
EVT VT = N->getValueType(0);		Subtarget, DAG))
if (((Subtarget.hasSSSE3() && (VT == MVT::v8i16 \|\| VT == MVT::v4i32)) \|\|		return V;
(Subtarget.hasInt256() && (VT == MVT::v16i16 \|\| VT == MVT::v8i32))) &&
isHorizontalBinOp(Op0, Op1, false))
return DAG.getNode(X86ISD::HSUB, SDLoc(N), VT, Op0, Op1);

if (SDValue V = combineIncDecVector(N, DAG))		if (SDValue V = combineIncDecVector(N, DAG))
return V;		return V;

return combineAddOrSubToADCOrSBB(N, DAG);		if (SDValue V = combineAddOrSubToADCOrSBB(N, DAG))
		return V;

		return SDValue();
}		}

static SDValue combineVSZext(SDNode *N, SelectionDAG &DAG,		static SDValue combineVSZext(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
if (DCI.isBeforeLegalize())		if (DCI.isBeforeLegalize())
return SDValue();		return SDValue();

▲ Show 20 Lines • Show All 1,461 Lines • Show Last 20 Lines

test/CodeGen/X86/avx512-hadd-hsub.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	;RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=knl \| FileCheck %s --check-prefix=KNL			;RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=knl \| FileCheck %s --check-prefix=KNL
	;RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=skx \| FileCheck %s --check-prefix=SKX			;RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=skx \| FileCheck %s --check-prefix=SKX

	define i32 @hadd_16(<16 x i32> %x225) {			define i32 @hadd_16(<16 x i32> %x225) {
	; KNL-LABEL: hadd_16:			; KNL-LABEL: hadd_16:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; KNL-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; KNL-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; KNL-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; KNL-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; KNL-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; KNL-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; KNL-NEXT: vmovd %xmm0, %eax			; KNL-NEXT: vmovd %xmm0, %eax
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: hadd_16:			; SKX-LABEL: hadd_16:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; SKX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; SKX-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; SKX-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; SKX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; SKX-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; SKX-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; SKX-NEXT: vmovd %xmm0, %eax			; SKX-NEXT: vmovd %xmm0, %eax
	; SKX-NEXT: vzeroupper			; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <16 x i32> %x225, <16 x i32> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%x226 = shufflevector <16 x i32> %x225, <16 x i32> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x227 = add <16 x i32> %x225, %x226			%x227 = add <16 x i32> %x225, %x226
	%x228 = shufflevector <16 x i32> %x227, <16 x i32> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%x228 = shufflevector <16 x i32> %x227, <16 x i32> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x229 = add <16 x i32> %x227, %x228			%x229 = add <16 x i32> %x227, %x228
	%x230 = extractelement <16 x i32> %x229, i32 0			%x230 = extractelement <16 x i32> %x229, i32 0
	ret i32 %x230			ret i32 %x230
	}			}

	define i32 @hsub_16(<16 x i32> %x225) {			define i32 @hsub_16(<16 x i32> %x225) {
	; KNL-LABEL: hsub_16:			; KNL-LABEL: hsub_16:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; KNL-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; KNL-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; KNL-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; KNL-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; KNL-NEXT: vphsubd %ymm0, %ymm0, %ymm0
	; KNL-NEXT: vpsubd %zmm1, %zmm0, %zmm0
	; KNL-NEXT: vmovd %xmm0, %eax			; KNL-NEXT: vmovd %xmm0, %eax
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: hsub_16:			; SKX-LABEL: hsub_16:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; SKX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; SKX-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; SKX-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; SKX-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; SKX-NEXT: vphsubd %ymm0, %ymm0, %ymm0
	; SKX-NEXT: vpsubd %zmm1, %zmm0, %zmm0
	; SKX-NEXT: vmovd %xmm0, %eax			; SKX-NEXT: vmovd %xmm0, %eax
	; SKX-NEXT: vzeroupper			; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <16 x i32> %x225, <16 x i32> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%x226 = shufflevector <16 x i32> %x225, <16 x i32> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x227 = add <16 x i32> %x225, %x226			%x227 = add <16 x i32> %x225, %x226
	%x228 = shufflevector <16 x i32> %x227, <16 x i32> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%x228 = shufflevector <16 x i32> %x227, <16 x i32> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x229 = sub <16 x i32> %x227, %x228			%x229 = sub <16 x i32> %x227, %x228
	%x230 = extractelement <16 x i32> %x229, i32 0			%x230 = extractelement <16 x i32> %x229, i32 0
	ret i32 %x230			ret i32 %x230
	}			}

	define float @fhadd_16(<16 x float> %x225) {			define float @fhadd_16(<16 x float> %x225) {
	; KNL-LABEL: fhadd_16:			; KNL-LABEL: fhadd_16:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]			; KNL-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
	; KNL-NEXT: vaddps %zmm1, %zmm0, %zmm0			; KNL-NEXT: vaddps %zmm1, %zmm0, %zmm0
	; KNL-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]			; KNL-NEXT: vhaddps %ymm0, %ymm0, %ymm0
	; KNL-NEXT: vaddps %zmm1, %zmm0, %zmm0
	; KNL-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>			; KNL-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: fhadd_16:			; SKX-LABEL: fhadd_16:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]			; SKX-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
	; SKX-NEXT: vaddps %zmm1, %zmm0, %zmm0			; SKX-NEXT: vaddps %zmm1, %zmm0, %zmm0
	; SKX-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]			; SKX-NEXT: vhaddps %ymm0, %ymm0, %ymm0
	; SKX-NEXT: vaddps %zmm1, %zmm0, %zmm0
	; SKX-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>			; SKX-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>
	; SKX-NEXT: vzeroupper			; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <16 x float> %x225, <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%x226 = shufflevector <16 x float> %x225, <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x227 = fadd <16 x float> %x225, %x226			%x227 = fadd <16 x float> %x225, %x226
	%x228 = shufflevector <16 x float> %x227, <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%x228 = shufflevector <16 x float> %x227, <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x229 = fadd <16 x float> %x227, %x228			%x229 = fadd <16 x float> %x227, %x228
	%x230 = extractelement <16 x float> %x229, i32 0			%x230 = extractelement <16 x float> %x229, i32 0
	ret float %x230			ret float %x230
	}			}

	define float @fhsub_16(<16 x float> %x225) {			define float @fhsub_16(<16 x float> %x225) {
	; KNL-LABEL: fhsub_16:			; KNL-LABEL: fhsub_16:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]			; KNL-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
	; KNL-NEXT: vaddps %zmm1, %zmm0, %zmm0			; KNL-NEXT: vaddps %zmm1, %zmm0, %zmm0
	; KNL-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]			; KNL-NEXT: vhsubps %ymm0, %ymm0, %ymm0
	; KNL-NEXT: vsubps %zmm1, %zmm0, %zmm0
	; KNL-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>			; KNL-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: fhsub_16:			; SKX-LABEL: fhsub_16:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]			; SKX-NEXT: vpermilpd {{.*#+}} xmm1 = xmm0[1,0]
	; SKX-NEXT: vaddps %zmm1, %zmm0, %zmm0			; SKX-NEXT: vaddps %zmm1, %zmm0, %zmm0
	; SKX-NEXT: vmovshdup {{.*#+}} xmm1 = xmm0[1,1,3,3]			; SKX-NEXT: vhsubps %ymm0, %ymm0, %ymm0
	; SKX-NEXT: vsubps %zmm1, %zmm0, %zmm0
	; SKX-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>			; SKX-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>
	; SKX-NEXT: vzeroupper			; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <16 x float> %x225, <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%x226 = shufflevector <16 x float> %x225, <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x227 = fadd <16 x float> %x225, %x226			%x227 = fadd <16 x float> %x225, %x226
	%x228 = shufflevector <16 x float> %x227, <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%x228 = shufflevector <16 x float> %x227, <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x229 = fsub <16 x float> %x227, %x228			%x229 = fsub <16 x float> %x227, %x228
	%x230 = extractelement <16 x float> %x229, i32 0			%x230 = extractelement <16 x float> %x229, i32 0
	ret float %x230			ret float %x230
	}			}

	define <16 x i32> @hadd_16_3(<16 x i32> %x225, <16 x i32> %x227) {			define <16 x i32> @hadd_16_3(<16 x i32> %x225, <16 x i32> %x227) {
	; KNL-LABEL: hadd_16_3:			; KNL-LABEL: hadd_16_3:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vshufps {{.*#+}} ymm2 = ymm0[0,2],ymm1[0,2],ymm0[4,6],ymm1[4,6]			; KNL-NEXT: vphaddd %ymm1, %ymm0, %ymm0
	; KNL-NEXT: vshufps {{.*#+}} ymm0 = ymm0[1,3],ymm1[1,3],ymm0[5,7],ymm1[5,7]
	; KNL-NEXT: vpaddd %zmm0, %zmm2, %zmm0
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: hadd_16_3:			; SKX-LABEL: hadd_16_3:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vshufps {{.*#+}} ymm2 = ymm0[0,2],ymm1[0,2],ymm0[4,6],ymm1[4,6]			; SKX-NEXT: vphaddd %ymm1, %ymm0, %ymm0
	; SKX-NEXT: vshufps {{.*#+}} ymm0 = ymm0[1,3],ymm1[1,3],ymm0[5,7],ymm1[5,7]
	; SKX-NEXT: vpaddd %zmm0, %zmm2, %zmm0
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <16 x i32> %x225, <16 x i32> %x227, <16 x i32> <i32 0, i32 2, i32 16, i32 18			%x226 = shufflevector <16 x i32> %x225, <16 x i32> %x227, <16 x i32> <i32 0, i32 2, i32 16, i32 18
	, i32 4, i32 6, i32 20, i32 22, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			, i32 4, i32 6, i32 20, i32 22, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x228 = shufflevector <16 x i32> %x225, <16 x i32> %x227, <16 x i32> <i32 1, i32 3, i32 17, i32 19			%x228 = shufflevector <16 x i32> %x225, <16 x i32> %x227, <16 x i32> <i32 1, i32 3, i32 17, i32 19
	, i32 5 , i32 7, i32 21, i32 23, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef,			, i32 5 , i32 7, i32 21, i32 23, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef,
	i32 undef, i32 undef>			i32 undef, i32 undef>
	%x229 = add <16 x i32> %x226, %x228			%x229 = add <16 x i32> %x226, %x228
	ret <16 x i32> %x229			ret <16 x i32> %x229
	}			}

	define <16 x float> @fhadd_16_3(<16 x float> %x225, <16 x float> %x227) {			define <16 x float> @fhadd_16_3(<16 x float> %x225, <16 x float> %x227) {
	; KNL-LABEL: fhadd_16_3:			; KNL-LABEL: fhadd_16_3:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vshufps {{.*#+}} ymm2 = ymm0[0,2],ymm1[0,2],ymm0[4,6],ymm1[4,6]			; KNL-NEXT: vhaddps %ymm1, %ymm0, %ymm0
	; KNL-NEXT: vshufps {{.*#+}} ymm0 = ymm0[1,3],ymm1[1,3],ymm0[5,7],ymm1[5,7]
	; KNL-NEXT: vaddps %zmm0, %zmm2, %zmm0
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: fhadd_16_3:			; SKX-LABEL: fhadd_16_3:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vshufps {{.*#+}} ymm2 = ymm0[0,2],ymm1[0,2],ymm0[4,6],ymm1[4,6]			; SKX-NEXT: vhaddps %ymm1, %ymm0, %ymm0
	; SKX-NEXT: vshufps {{.*#+}} ymm0 = ymm0[1,3],ymm1[1,3],ymm0[5,7],ymm1[5,7]
	; SKX-NEXT: vaddps %zmm0, %zmm2, %zmm0
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <16 x float> %x225, <16 x float> %x227, <16 x i32> <i32 0, i32 2, i32 16, i32 18			%x226 = shufflevector <16 x float> %x225, <16 x float> %x227, <16 x i32> <i32 0, i32 2, i32 16, i32 18
	, i32 4, i32 6, i32 20, i32 22, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			, i32 4, i32 6, i32 20, i32 22, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x228 = shufflevector <16 x float> %x225, <16 x float> %x227, <16 x i32> <i32 1, i32 3, i32 17, i32 19			%x228 = shufflevector <16 x float> %x225, <16 x float> %x227, <16 x i32> <i32 1, i32 3, i32 17, i32 19
	, i32 5 , i32 7, i32 21, i32 23, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			, i32 5 , i32 7, i32 21, i32 23, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%x229 = fadd <16 x float> %x226, %x228			%x229 = fadd <16 x float> %x226, %x228
	ret <16 x float> %x229			ret <16 x float> %x229
	}			}

	define <8 x double> @fhadd_16_4(<8 x double> %x225, <8 x double> %x227) {			define <8 x double> @fhadd_16_4(<8 x double> %x225, <8 x double> %x227) {
	; KNL-LABEL: fhadd_16_4:			; KNL-LABEL: fhadd_16_4:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vunpcklpd {{.*#+}} ymm2 = ymm0[0],ymm1[0],ymm0[2],ymm1[2]			; KNL-NEXT: vhaddpd %ymm1, %ymm0, %ymm0
	; KNL-NEXT: vunpckhpd {{.*#+}} ymm0 = ymm0[1],ymm1[1],ymm0[3],ymm1[3]
	; KNL-NEXT: vaddpd %zmm0, %zmm2, %zmm0
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: fhadd_16_4:			; SKX-LABEL: fhadd_16_4:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vunpcklpd {{.*#+}} ymm2 = ymm0[0],ymm1[0],ymm0[2],ymm1[2]			; SKX-NEXT: vhaddpd %ymm1, %ymm0, %ymm0
	; SKX-NEXT: vunpckhpd {{.*#+}} ymm0 = ymm0[1],ymm1[1],ymm0[3],ymm1[3]
	; SKX-NEXT: vaddpd %zmm0, %zmm2, %zmm0
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 undef, i32 undef, i32 undef, i32 undef>			%x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 undef, i32 undef, i32 undef, i32 undef>
	%x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 undef ,i32 undef, i32 undef, i32 undef>			%x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 undef ,i32 undef, i32 undef, i32 undef>
	%x229 = fadd <8 x double> %x226, %x228			%x229 = fadd <8 x double> %x226, %x228
	ret <8 x double> %x229			ret <8 x double> %x229
	}			}

	define <4 x double> @fadd_noundef_low(<8 x double> %x225, <8 x double> %x227) {			define <4 x double> @fadd_noundef_low(<8 x double> %x225, <8 x double> %x227) {
	; KNL-LABEL: fadd_noundef_low:			; KNL-LABEL: fadd_noundef_low:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vunpcklpd {{.*#+}} zmm2 = zmm0[0],zmm1[0],zmm0[2],zmm1[2],zmm0[4],zmm1[4],zmm0[6],zmm1[6]			; KNL-NEXT: vhaddpd %ymm1, %ymm0, %ymm0
	; KNL-NEXT: vunpckhpd {{.*#+}} zmm0 = zmm0[1],zmm1[1],zmm0[3],zmm1[3],zmm0[5],zmm1[5],zmm0[7],zmm1[7]
	; KNL-NEXT: vaddpd %zmm0, %zmm2, %zmm0
	; KNL-NEXT: # kill: %YMM0<def> %YMM0<kill> %ZMM0<kill>
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: fadd_noundef_low:			; SKX-LABEL: fadd_noundef_low:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vunpcklpd {{.*#+}} zmm2 = zmm0[0],zmm1[0],zmm0[2],zmm1[2],zmm0[4],zmm1[4],zmm0[6],zmm1[6]			; SKX-NEXT: vhaddpd %ymm1, %ymm0, %ymm0
	; SKX-NEXT: vunpckhpd {{.*#+}} zmm0 = zmm0[1],zmm1[1],zmm0[3],zmm1[3],zmm0[5],zmm1[5],zmm0[7],zmm1[7]
	; SKX-NEXT: vaddpd %zmm0, %zmm2, %zmm0
	; SKX-NEXT: # kill: %YMM0<def> %YMM0<kill> %ZMM0<kill>
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>			%x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
	%x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 5 ,i32 13, i32 7, i32 15>			%x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 5 ,i32 13, i32 7, i32 15>
	%x229 = fadd <8 x double> %x226, %x228			%x229 = fadd <8 x double> %x226, %x228
	%x230 = shufflevector <8 x double> %x229, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>			%x230 = shufflevector <8 x double> %x229, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	ret <4 x double> %x230			ret <4 x double> %x230
	}			}

	define <4 x double> @fadd_noundef_high(<8 x double> %x225, <8 x double> %x227) {			define <4 x double> @fadd_noundef_high(<8 x double> %x225, <8 x double> %x227) {
	; KNL-LABEL: fadd_noundef_high:			; KNL-LABEL: fadd_noundef_high:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vunpcklpd {{.*#+}} zmm2 = zmm0[0],zmm1[0],zmm0[2],zmm1[2],zmm0[4],zmm1[4],zmm0[6],zmm1[6]			; KNL-NEXT: vextractf64x4 $1, %zmm1, %ymm1
	; KNL-NEXT: vunpckhpd {{.*#+}} zmm0 = zmm0[1],zmm1[1],zmm0[3],zmm1[3],zmm0[5],zmm1[5],zmm0[7],zmm1[7]
	; KNL-NEXT: vaddpd %zmm0, %zmm2, %zmm0
	; KNL-NEXT: vextractf64x4 $1, %zmm0, %ymm0			; KNL-NEXT: vextractf64x4 $1, %zmm0, %ymm0
				; KNL-NEXT: vhaddpd %ymm1, %ymm0, %ymm0
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: fadd_noundef_high:			; SKX-LABEL: fadd_noundef_high:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vunpcklpd {{.*#+}} zmm2 = zmm0[0],zmm1[0],zmm0[2],zmm1[2],zmm0[4],zmm1[4],zmm0[6],zmm1[6]			; SKX-NEXT: vextractf64x4 $1, %zmm1, %ymm1
	; SKX-NEXT: vunpckhpd {{.*#+}} zmm0 = zmm0[1],zmm1[1],zmm0[3],zmm1[3],zmm0[5],zmm1[5],zmm0[7],zmm1[7]
	; SKX-NEXT: vaddpd %zmm0, %zmm2, %zmm0
	; SKX-NEXT: vextractf64x4 $1, %zmm0, %ymm0			; SKX-NEXT: vextractf64x4 $1, %zmm0, %ymm0
				; SKX-NEXT: vhaddpd %ymm1, %ymm0, %ymm0
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>			%x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
	%x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 5 ,i32 13, i32 7, i32 15>			%x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 5 ,i32 13, i32 7, i32 15>
	%x229 = fadd <8 x double> %x226, %x228			%x229 = fadd <8 x double> %x226, %x228
	%x230 = shufflevector <8 x double> %x229, <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>			%x230 = shufflevector <8 x double> %x229, <8 x double> undef, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
	ret <4 x double> %x230			ret <4 x double> %x230
	}			}


	define <8 x i32> @hadd_16_3_sv(<16 x i32> %x225, <16 x i32> %x227) {			define <8 x i32> @hadd_16_3_sv(<16 x i32> %x225, <16 x i32> %x227) {
	; KNL-LABEL: hadd_16_3_sv:			; KNL-LABEL: hadd_16_3_sv:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vshufps {{.*#+}} zmm2 = zmm0[0,2],zmm1[0,2],zmm0[4,6],zmm1[4,6],zmm0[8,10],zmm1[8,10],zmm0[12,14],zmm1[12,14]			; KNL-NEXT: vphaddd %ymm1, %ymm0, %ymm0
	; KNL-NEXT: vshufps {{.*#+}} zmm0 = zmm0[1,3],zmm1[1,3],zmm0[5,7],zmm1[5,7],zmm0[9,11],zmm1[9,11],zmm0[13,15],zmm1[13,15]
	; KNL-NEXT: vpaddd %zmm0, %zmm2, %zmm0
	; KNL-NEXT: # kill: %YMM0<def> %YMM0<kill> %ZMM0<kill>
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: hadd_16_3_sv:			; SKX-LABEL: hadd_16_3_sv:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vshufps {{.*#+}} zmm2 = zmm0[0,2],zmm1[0,2],zmm0[4,6],zmm1[4,6],zmm0[8,10],zmm1[8,10],zmm0[12,14],zmm1[12,14]			; SKX-NEXT: vphaddd %ymm1, %ymm0, %ymm0
	; SKX-NEXT: vshufps {{.*#+}} zmm0 = zmm0[1,3],zmm1[1,3],zmm0[5,7],zmm1[5,7],zmm0[9,11],zmm1[9,11],zmm0[13,15],zmm1[13,15]
	; SKX-NEXT: vpaddd %zmm0, %zmm2, %zmm0
	; SKX-NEXT: # kill: %YMM0<def> %YMM0<kill> %ZMM0<kill>
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <16 x i32> %x225, <16 x i32> %x227, <16 x i32> <i32 0, i32 2, i32 16, i32 18			%x226 = shufflevector <16 x i32> %x225, <16 x i32> %x227, <16 x i32> <i32 0, i32 2, i32 16, i32 18
	, i32 4, i32 6, i32 20, i32 22, i32 8, i32 10, i32 24, i32 26, i32 12, i32 14, i32 28, i32 30>			, i32 4, i32 6, i32 20, i32 22, i32 8, i32 10, i32 24, i32 26, i32 12, i32 14, i32 28, i32 30>
	%x228 = shufflevector <16 x i32> %x225, <16 x i32> %x227, <16 x i32> <i32 1, i32 3, i32 17, i32 19			%x228 = shufflevector <16 x i32> %x225, <16 x i32> %x227, <16 x i32> <i32 1, i32 3, i32 17, i32 19
	, i32 5 , i32 7, i32 21, i32 23, i32 9, i32 11, i32 25, i32 27, i32 13, i32 15,			, i32 5 , i32 7, i32 21, i32 23, i32 9, i32 11, i32 25, i32 27, i32 13, i32 15,
	i32 29, i32 31>			i32 29, i32 31>
	%x229 = add <16 x i32> %x226, %x228			%x229 = add <16 x i32> %x226, %x228
	%x230 = shufflevector <16 x i32> %x229, <16 x i32> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4 ,i32 5, i32 6, i32 7>			%x230 = shufflevector <16 x i32> %x229, <16 x i32> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4 ,i32 5, i32 6, i32 7>
	ret <8 x i32> %x230			ret <8 x i32> %x230
	}			}


	define double @fadd_noundef_eel(<8 x double> %x225, <8 x double> %x227) {			define double @fadd_noundef_eel(<8 x double> %x225, <8 x double> %x227) {
	; KNL-LABEL: fadd_noundef_eel:			; KNL-LABEL: fadd_noundef_eel:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vunpcklpd {{.*#+}} zmm2 = zmm0[0],zmm1[0],zmm0[2],zmm1[2],zmm0[4],zmm1[4],zmm0[6],zmm1[6]			; KNL-NEXT: vhaddpd %ymm1, %ymm0, %ymm0
	; KNL-NEXT: vunpckhpd {{.*#+}} zmm0 = zmm0[1],zmm1[1],zmm0[3],zmm1[3],zmm0[5],zmm1[5],zmm0[7],zmm1[7]
	; KNL-NEXT: vaddpd %zmm0, %zmm2, %zmm0
	; KNL-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>			; KNL-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: fadd_noundef_eel:			; SKX-LABEL: fadd_noundef_eel:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vunpcklpd {{.*#+}} zmm2 = zmm0[0],zmm1[0],zmm0[2],zmm1[2],zmm0[4],zmm1[4],zmm0[6],zmm1[6]			; SKX-NEXT: vhaddpd %ymm1, %ymm0, %ymm0
	; SKX-NEXT: vunpckhpd {{.*#+}} zmm0 = zmm0[1],zmm1[1],zmm0[3],zmm1[3],zmm0[5],zmm1[5],zmm0[7],zmm1[7]
	; SKX-NEXT: vaddpd %zmm0, %zmm2, %zmm0
	; SKX-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>			; SKX-NEXT: # kill: %XMM0<def> %XMM0<kill> %ZMM0<kill>
	; SKX-NEXT: vzeroupper			; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>			%x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
	%x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 5 ,i32 13, i32 7, i32 15>			%x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 5 ,i32 13, i32 7, i32 15>
	%x229 = fadd <8 x double> %x226, %x228			%x229 = fadd <8 x double> %x226, %x228
	%x230 = extractelement <8 x double> %x229, i32 0			%x230 = extractelement <8 x double> %x229, i32 0
	ret double %x230			ret double %x230
	}			}



	define double @fsub_noundef_ee (<8 x double> %x225, <8 x double> %x227) {			define double @fsub_noundef_ee (<8 x double> %x225, <8 x double> %x227) {
	; KNL-LABEL: fsub_noundef_ee:			; KNL-LABEL: fsub_noundef_ee:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vunpcklpd {{.*#+}} zmm2 = zmm0[0],zmm1[0],zmm0[2],zmm1[2],zmm0[4],zmm1[4],zmm0[6],zmm1[6]			; KNL-NEXT: vextractf64x4 $1, %zmm1, %ymm1
	; KNL-NEXT: vunpckhpd {{.*#+}} zmm0 = zmm0[1],zmm1[1],zmm0[3],zmm1[3],zmm0[5],zmm1[5],zmm0[7],zmm1[7]			; KNL-NEXT: vextractf64x4 $1, %zmm0, %ymm0
	; KNL-NEXT: vsubpd %zmm0, %zmm2, %zmm0			; KNL-NEXT: vhsubpd %ymm1, %ymm0, %ymm0
				; KNL-NEXT: vinsertf64x4 $1, %ymm0, %zmm0, %zmm0
	; KNL-NEXT: vextractf32x4 $2, %zmm0, %xmm0			; KNL-NEXT: vextractf32x4 $2, %zmm0, %xmm0
	; KNL-NEXT: vpermilpd {{.*#+}} xmm0 = xmm0[1,0]			; KNL-NEXT: vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: fsub_noundef_ee:			; SKX-LABEL: fsub_noundef_ee:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vunpcklpd {{.*#+}} zmm2 = zmm0[0],zmm1[0],zmm0[2],zmm1[2],zmm0[4],zmm1[4],zmm0[6],zmm1[6]			; SKX-NEXT: vextractf64x4 $1, %zmm1, %ymm1
	; SKX-NEXT: vunpckhpd {{.*#+}} zmm0 = zmm0[1],zmm1[1],zmm0[3],zmm1[3],zmm0[5],zmm1[5],zmm0[7],zmm1[7]			; SKX-NEXT: vextractf64x4 $1, %zmm0, %ymm0
	; SKX-NEXT: vsubpd %zmm0, %zmm2, %zmm0			; SKX-NEXT: vhsubpd %ymm1, %ymm0, %ymm0
				; SKX-NEXT: vinsertf64x4 $1, %ymm0, %zmm0, %zmm0
	; SKX-NEXT: vextractf32x4 $2, %zmm0, %xmm0			; SKX-NEXT: vextractf32x4 $2, %zmm0, %xmm0
	; SKX-NEXT: vpermilpd {{.*#+}} xmm0 = xmm0[1,0]			; SKX-NEXT: vpermilpd {{.*#+}} xmm0 = xmm0[1,0]
	; SKX-NEXT: vzeroupper			; SKX-NEXT: vzeroupper
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>			%x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
	%x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 5 ,i32 13, i32 7, i32 15>			%x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 5 ,i32 13, i32 7, i32 15>
	%x229 = fsub <8 x double> %x226, %x228			%x229 = fsub <8 x double> %x226, %x228
	%x230 = extractelement <8 x double> %x229, i32 5			%x230 = extractelement <8 x double> %x229, i32 5
	ret double %x230			ret double %x230
	}			}

test/CodeGen/X86/madd.ll

	Show First 20 Lines • Show All 290 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: cmpq %rcx, %rax			; AVX2-NEXT: cmpq %rcx, %rax
	; AVX2-NEXT: jne .LBB2_1			; AVX2-NEXT: jne .LBB2_1
	; AVX2-NEXT: # BB#2: # %middle.block			; AVX2-NEXT: # BB#2: # %middle.block
	; AVX2-NEXT: vpaddd %ymm0, %ymm1, %ymm0			; AVX2-NEXT: vpaddd %ymm0, %ymm1, %ymm0
	; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
	; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
	; AVX2-NEXT: vphaddd %ymm0, %ymm0, %ymm0			; AVX2-NEXT: vphaddd %ymm0, %ymm0, %ymm0
				craig.topperUnsubmitted Not Done Reply Inline Actions Why is this horizontal add not narrowed? craig.topper: Why is this horizontal add not narrowed?
	; AVX2-NEXT: vmovd %xmm0, %eax			; AVX2-NEXT: vmovd %xmm0, %eax
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: _Z9test_charPcS_i:			; AVX512-LABEL: _Z9test_charPcS_i:
	; AVX512: # BB#0: # %entry			; AVX512: # BB#0: # %entry
	; AVX512-NEXT: movl %edx, %eax			; AVX512-NEXT: movl %edx, %eax
	; AVX512-NEXT: vpxor %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vpxor %xmm0, %xmm0, %xmm0
	Show All 10 Lines
	; AVX512-NEXT: jne .LBB2_1			; AVX512-NEXT: jne .LBB2_1
	; AVX512-NEXT: # BB#2: # %middle.block			; AVX512-NEXT: # BB#2: # %middle.block
	; AVX512-NEXT: vextracti64x4 $1, %zmm0, %ymm1			; AVX512-NEXT: vextracti64x4 $1, %zmm0, %ymm1
	; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; AVX512-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512-NEXT: vmovd %xmm0, %eax			; AVX512-NEXT: vmovd %xmm0, %eax
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	entry:			entry:
	%3 = zext i32 %2 to i64			%3 = zext i32 %2 to i64
	br label %vector.body			br label %vector.body

	vector.body:			vector.body:
	Show All 28 Lines

test/CodeGen/X86/sad.ll

	Show First 20 Lines • Show All 72 Lines • ▼ Show 20 Lines
	; AVX512F-NEXT: jne .LBB0_1			; AVX512F-NEXT: jne .LBB0_1
	; AVX512F-NEXT: # BB#2: # %middle.block			; AVX512F-NEXT: # BB#2: # %middle.block
	; AVX512F-NEXT: vextracti64x4 $1, %zmm0, %ymm1			; AVX512F-NEXT: vextracti64x4 $1, %zmm0, %ymm1
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; AVX512F-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vmovd %xmm0, %eax			; AVX512F-NEXT: vmovd %xmm0, %eax
	; AVX512F-NEXT: vzeroupper			; AVX512F-NEXT: vzeroupper
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512BW-LABEL: sad_16i8:			; AVX512BW-LABEL: sad_16i8:
	; AVX512BW: # BB#0: # %entry			; AVX512BW: # BB#0: # %entry
	; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00			; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00
	; AVX512BW-NEXT: .p2align 4, 0x90			; AVX512BW-NEXT: .p2align 4, 0x90
	; AVX512BW-NEXT: .LBB0_1: # %vector.body			; AVX512BW-NEXT: .LBB0_1: # %vector.body
	; AVX512BW-NEXT: # =>This Inner Loop Header: Depth=1			; AVX512BW-NEXT: # =>This Inner Loop Header: Depth=1
	; AVX512BW-NEXT: vmovdqu a+1024(%rax), %xmm1			; AVX512BW-NEXT: vmovdqu a+1024(%rax), %xmm1
	; AVX512BW-NEXT: vpsadbw b+1024(%rax), %xmm1, %xmm1			; AVX512BW-NEXT: vpsadbw b+1024(%rax), %xmm1, %xmm1
	; AVX512BW-NEXT: vmovdqa %xmm1, %xmm1			; AVX512BW-NEXT: vmovdqa %xmm1, %xmm1
	; AVX512BW-NEXT: vpaddd %zmm0, %zmm1, %zmm0			; AVX512BW-NEXT: vpaddd %zmm0, %zmm1, %zmm0
	; AVX512BW-NEXT: addq $4, %rax			; AVX512BW-NEXT: addq $4, %rax
	; AVX512BW-NEXT: jne .LBB0_1			; AVX512BW-NEXT: jne .LBB0_1
	; AVX512BW-NEXT: # BB#2: # %middle.block			; AVX512BW-NEXT: # BB#2: # %middle.block
	; AVX512BW-NEXT: vextracti64x4 $1, %zmm0, %ymm1			; AVX512BW-NEXT: vextracti64x4 $1, %zmm0, %ymm1
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX512BW-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; AVX512BW-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vmovd %xmm0, %eax			; AVX512BW-NEXT: vmovd %xmm0, %eax
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	entry:			entry:
	br label %vector.body			br label %vector.body

	vector.body:			vector.body:
	%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]			%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
	▲ Show 20 Lines • Show All 205 Lines • ▼ Show 20 Lines
	; AVX512F-NEXT: # BB#2: # %middle.block			; AVX512F-NEXT: # BB#2: # %middle.block
	; AVX512F-NEXT: vpaddd %zmm0, %zmm1, %zmm0			; AVX512F-NEXT: vpaddd %zmm0, %zmm1, %zmm0
	; AVX512F-NEXT: vextracti64x4 $1, %zmm0, %ymm1			; AVX512F-NEXT: vextracti64x4 $1, %zmm0, %ymm1
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; AVX512F-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vmovd %xmm0, %eax			; AVX512F-NEXT: vmovd %xmm0, %eax
	; AVX512F-NEXT: vzeroupper			; AVX512F-NEXT: vzeroupper
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512BW-LABEL: sad_32i8:			; AVX512BW-LABEL: sad_32i8:
	; AVX512BW: # BB#0: # %entry			; AVX512BW: # BB#0: # %entry
	; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00			; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00
	Show All 10 Lines
	; AVX512BW-NEXT: # BB#2: # %middle.block			; AVX512BW-NEXT: # BB#2: # %middle.block
	; AVX512BW-NEXT: vpaddd %zmm0, %zmm1, %zmm0			; AVX512BW-NEXT: vpaddd %zmm0, %zmm1, %zmm0
	; AVX512BW-NEXT: vextracti64x4 $1, %zmm0, %ymm1			; AVX512BW-NEXT: vextracti64x4 $1, %zmm0, %ymm1
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX512BW-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; AVX512BW-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vmovd %xmm0, %eax			; AVX512BW-NEXT: vmovd %xmm0, %eax
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	entry:			entry:
	br label %vector.body			br label %vector.body

	vector.body:			vector.body:
	%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]			%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
	▲ Show 20 Lines • Show All 427 Lines • ▼ Show 20 Lines
	; AVX512F-NEXT: vpaddd %zmm3, %zmm1, %zmm1			; AVX512F-NEXT: vpaddd %zmm3, %zmm1, %zmm1
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vextracti64x4 $1, %zmm0, %ymm1			; AVX512F-NEXT: vextracti64x4 $1, %zmm0, %ymm1
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; AVX512F-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vmovd %xmm0, %eax			; AVX512F-NEXT: vmovd %xmm0, %eax
	; AVX512F-NEXT: vzeroupper			; AVX512F-NEXT: vzeroupper
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512BW-LABEL: sad_avx64i8:			; AVX512BW-LABEL: sad_avx64i8:
	; AVX512BW: # BB#0: # %entry			; AVX512BW: # BB#0: # %entry
	; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00			; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00
	Show All 11 Lines
	; AVX512BW-NEXT: vpaddd %zmm0, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm0, %zmm0, %zmm0
	; AVX512BW-NEXT: vpaddd %zmm0, %zmm1, %zmm0			; AVX512BW-NEXT: vpaddd %zmm0, %zmm1, %zmm0
	; AVX512BW-NEXT: vextracti64x4 $1, %zmm0, %ymm1			; AVX512BW-NEXT: vextracti64x4 $1, %zmm0, %ymm1
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX512BW-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]			; AVX512BW-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vmovd %xmm0, %eax			; AVX512BW-NEXT: vmovd %xmm0, %eax
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	entry:			entry:
	br label %vector.body			br label %vector.body

	vector.body:			vector.body:
	%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]			%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
	▲ Show 20 Lines • Show All 465 Lines • Show Last 20 Lines