This is an archive of the discontinued LLVM Phabricator instance.

[X86] Changes to extract Horizontal addition operation for AVX-512.
Needs RevisionPublic

Authored by jbhateja on Aug 8 2017, 3:12 AM.

Download Raw Diff

Details

Reviewers

craig.topper
delena
zvi
spatel
RKSimon

Summary

vphadd is not a supported instruction for AVX-512, but if result of add (512 bits)
is partially consumed such that less than half of the bits are used by the user of an add
instruction, in that case we can perform horizontal addition and concatenate the result
with undef.

This will fix PR33758

Diff Detail

Build Status

Buildable 9119
Build 9119: arc lint + arc unit

Event Timeline

jbhateja created this revision.Aug 8 2017, 3:12 AM

jbhateja added reviewers: craig.topper, delena, RKSimon.Aug 8 2017, 3:15 AM

jbhateja added a subscriber: llvm-commits.

Ping @ reviewers.

craig.topper added inline comments.Aug 10 2017, 5:23 PM

lib/Target/X86/X86ISelLowering.cpp
35355	Don't we need to make sure this only happens on v32i16 and v16i32 types? combineAdd can get called on all sorts of types.
35368	Don't use a SmallVector for a hardcoded two elements. Just use a plain old array.

Are there any cases where this should happen where Op0 is different that Op1? All of these test changes are unary shuffles.

And really shouldn't we have some more generic way of shrinking adds like this? How many of the other adds in these tests could have been done in a smaller type to use a VEX encoding?

craig.topper mentioned this in D36601: [x86] Enable some support for lowerVectorShuffleWithUndefHalf with AVX-512.Aug 10 2017, 6:52 PM

Diffusion mentioned this in rL310724: [x86] Enable some support for lowerVectorShuffleWithUndefHalf with AVX-512.Aug 11 2017, 9:21 AM

craig.topper mentioned this in D36650: [X86] WIP support narrowing operations when only a subvector is demanded.Aug 12 2017, 11:16 PM

Thinking about this some more. Do we really want to use a horizontal add instruction for a register with itself? Horizontal add is suboptimally implemented in microcode. It's 3 uops while the pshufd and the add are only 2 uops. The 3 uops also mean its limited to the complex decoder on Intel hardware.

Handling for horizontal [F]ADD/[F]SUB for AVX512 vector size 16x32 in a generic manner.
Merge branch 'master' of https://github.com/llvm-mirror/llvm
Updating test reference

In D36454#853351, @craig.topper wrote:

Thinking about this some more. Do we really want to use a horizontal add instruction for a register with itself? Horizontal add is suboptimally implemented in microcode. It's 3 uops while the pshufd and the add are only 2 uops. The 3 uops also mean its limited to the complex decoder on Intel hardware.

Yes, I agree. latency and number of micro codes are better without hadd, only lesser code size is advantage with hadd with same operands.
Do you suggest adding a check in isHorizontalBinOp to not recognize a valid pttern for horizontal add/sub if operands of are same.

Removing a file got added in last patch.

Harbormaster completed remote builds in B9670: Diff 112823.Aug 27 2017, 6:32 AM

Stashed change leftout in last checkin + formatting changes.
Updating test reference

ping @reviewers

guyblank added a subscriber: guyblank.Aug 31 2017, 8:03 AM

Ping reviewers

I think we should try to combine based on the add only being used by the extract_vector_elt. Turn the add into a 128-bit add being fed by extract_subvectors. Similarly if we see an add only being used by an extract_subvector we can shrink that add too and push the extracts up. This type of transform feels more generally useful because it will allow us to narrow many more adds in this code. This will enable EVEX->VEX to use a smaller encoding. We can apply this to many other opcodes as well.

If we do this early enough we should be able to shrink the add before the horizontal add detection.

In D36454#861826, @craig.topper wrote:

I think we should try to combine based on the add only being used by the extract_vector_elt. Turn the add into a 128-bit add being fed by extract_subvectors. Similarly if we see an add only being used by an extract_subvector we can shrink that add too and push the extracts up. This type of transform feels more generally useful because it will allow us to narrow many more adds in this code. This will enable EVEX->VEX to use a smaller encoding. We can apply this to many other opcodes as well.

If we do this early enough we should be able to shrink the add before the horizontal add detection.

Two cases for DAG node reduction:-

 a / Look at operands and try squeezing them up (EXTRACT_SUBVECTOR) for narrower operation  which is then concatinated with a pad to make the final result size same as original. Here we only look at the operands and not the uses of the operation. Which means it could break valid patterns being checked at the use nodes due to insertion of CONCAT_VECTORS now. 

b / Look at both the uses of the DAG node along with the operands of the node while narrowing down the operation. Idea here is to avoid insertion of extra concat operation for padding which shall keep the pattern matches at use node happy.

My initial patch was based on strategy (b), What I get from you above comment is to change it for any generic OPCODE instead of HADD. Correct?

Currently patch is based on strategy (a) with a wrapper over generic subroutine for scaling down operation only for HADD/HSUB perticular vector types which is also safe.

Merge branch 'master' of https://github.com/llvm-mirror/llvm
Making the downscaling operation changes generic.

Why this be solved by just combining extract_subvectors through things like binops and onto their inputs.

Like this:
combine (extract_subvector (add X, Y)) -> (narrower_add (extract_subvector X, extract_subvector Y))

lib/Target/X86/X86ISelLowering.cpp
29873	I think element indices should use DAG.getIntPtrConstant to make the type correct. It's not supposed to be i32.
33395	Can we teach combineExtractSubvector to narrow an extract of an add by extracting from the inputs? Then we shouldn't need any change here because we'll already have the narrow add?
33471	Why is this variable now upper case?
33505	Why was this moved out of the if.
test/CodeGen/X86/madd.ll
303	Why is this horizontal add not narrowed?

Would we be better off focussing on developing the @llvm.experimental.vector.reduce.add.* intrinsics? Or looking to get slpvectorizer to do extract_subvector style shuffles as part of the reduction?

jbhateja added inline comments.Sep 19 2017, 4:15 AM

lib/Target/X86/X86ISelLowering.cpp
29873	Yes this will be fixed.
33395	I think each Node specific combiner should look at pattern starting from itself. i.e. combineExtractSubvector could be taught to remove extract_subvector from a vector shuffle (if has only one use) thus producing a smaller vector shuffle that will be generic change. Looking for opportunity like you mentioned "extract of an add by extracting from input" appears as a specialization. As of now narrowing down has be done genrically at following two places 1/ Extract_vector_elt : It looks at its input vector and try scaling down if possible. 2/ Narrow down binary operation (add/sub/fadd/fsub) [this is implicitly doing whay your comment says).
33471	For better visual clarity of variable name.
33505	This will be fixed.

I don't think we should shrinking operations based on undef inputs. I think we should be shrinking them based on what elements are consumed by their users. There's no reason the shuffles in these reductions have to have undef elements. For integers we may rewrite the shuffle mask to undef in InstCombine if the elements aren't used down stream. But we don't do that for FP.

As an example, why shouldn't we be able use a horizontal add for this

define <4 x double> @fadd_noundef(<8 x double> %x225, <8 x double> %x227) {

  %x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
  %x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 5 ,i32 13, i32 7, i32 15>
  %x229 = fadd <8 x double> %x226, %x228
  %x230 = shufflevector <8 x double> %x229, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  ret <4 x double> %x230
}

lib/Target/X86/X86ISelLowering.cpp
1696	Use of MMX here is weird. We explicitly don't generate any optimized code for MMX. So making references to it in terms of SSE/AVX is misleading.

Merge branch 'master' of https://github.com/llvm-mirror/llvm
Review comments resolution.

In D36454#876132, @craig.topper wrote:
I don't think we should shrinking operations based on undef inputs. I think we should be shrinking them based on what elements are consumed by their users. There's no reason the shuffles in these reductions have to have undef elements. For integers we may rewrite the shuffle mask to undef in InstCombine if the elements aren't used down stream. But we don't do that for FP.

As an example, why shouldn't we be able use a horizontal add for this
define <4 x double> @fadd_noundef(<8 x double> %x225, <8 x double> %x227) {

  %x226 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 0, i32 8, i32 2, i32 10, i32 4, i32 12, i32 6, i32 14>
  %x228 = shufflevector <8 x double> %x225, <8 x double> %x227, <8 x i32> <i32 1, i32 9, i32 3, i32 11, i32 5 ,i32 13, i32 7, i32 15>
  %x229 = fadd <8 x double> %x226, %x228
  %x230 = shufflevector <8 x double> %x229, <8 x double> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  ret <4 x double> %x230
}

We do generate horizontal addition in this case.

lib/Target/X86/X86ISelLowering.cpp
1696	Reference to MMX here is signifying one of the denominations of a smaller vector register.

The test case I gave does not generate horizontal add with your patch with avx512f enabled. It does with avx2 but that's only because type legalization did the dirty work of splitting the result.

Harbormaster completed remote builds in B10591: Diff 116634.Sep 26 2017, 3:31 PM

Changes to cover more patterns for [f]hadd/[f]sub for AVX512 vector types.
Generic routines added which looks at uses / undef operands of an operation for scaling it down.

Harbormaster completed remote builds in B10657: Diff 116998.Sep 28 2017, 8:15 AM

jbhateja added inline comments.Sep 28 2017, 8:22 AM

lib/Target/X86/X86ISelLowering.cpp
33385	Just realized that DAG argument is not used here, it shall be removed with other comments over patch.

jbhateja added reviewers: zvi, spatel.Sep 29 2017, 2:08 AM

I'm not sure about this approach; I don't think most of this needs to be done in the X86 backend at all, and much of it shouldn't even be done in the DAG. Most of the code appears to be better handled in a mixture of SimplifyDemandedVectorElts and SLPVectorizer.

PR33758 was about improving codegen for horizontal reductions, so we'd probably be better off having the backend optimize for @llvm.experimental.vector.reduce.add.* (or the legalized patterns it produces), and then getting the vectorizers to create these properly.

This revision now requires changes to proceed.Sep 29 2017, 3:49 AM

In D36454#884252, @RKSimon wrote:

I'm not sure about this approach; I don't think most of this needs to be done in the X86 backend at all, and much of it shouldn't even be done in the DAG. Most of the code appears to be better handled in a mixture of SimplifyDemandedVectorElts and SLPVectorizer.

PR33758 was about improving codegen for horizontal reductions, so we'd probably be better off having the backend optimize for @llvm.experimental.vector.reduce.add.* (or the legalized patterns it produces), and then getting the vectorizers to create these properly.

Hi Simon,

Thanks for pointing me to code references, I tired a simple case, which was not optimized by InstCombiner::SimpliefyDemandedVectorElts.
It works over knownbits mechanism. In fact none of the test cases provided in avx512-hadd-hsub.ll were optimized by InstCombiner.

define float @fhsub_16(<16 x float> %x225) {
;define <16 x float> @fhsub_16(<16 x float> %x225) {

%x226 = shufflevector <16 x float> %x225, <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%x227 = fadd <16 x float> %x225, %x226
%x228 = shufflevector <16 x float> %x227, <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%x229 = fsub <16 x float> %x227, %x228
%x230 = extractelement <16 x float> %x229, i32 0
ret float %x230

}

This patch provided two generic routines which try to scale down operation in denominations of X86 vector register sizes, patch D36650 is also suggesting for similar effort.

PR 33758 is specially about infrence of horizontal operations for AVX512 vector types.

Kindly elaborate how add reduction is helful here for cases in provided in testcase avx512-hadd-hsub.ll.

Thanks

In D36454#884427, @jbhateja wrote:

In D36454#884252, @RKSimon wrote:

I'm not sure about this approach; I don't think most of this needs to be done in the X86 backend at all, and much of it shouldn't even be done in the DAG. Most of the code appears to be better handled in a mixture of SimplifyDemandedVectorElts and SLPVectorizer.

PR33758 was about improving codegen for horizontal reductions, so we'd probably be better off having the backend optimize for @llvm.experimental.vector.reduce.add.* (or the legalized patterns it produces), and then getting the vectorizers to create these properly.

Hi Simon,

Thanks for pointing me to code references, I tired a simple case, which was not optimized by InstCombiner::SimpliefyDemandedVectorElts.
It works over knownbits mechanism. In fact none of the test cases provided in avx512-hadd-hsub.ll were optimized by InstCombiner.

In my opinion, @llvm.experimental.vector.reduce.fadd and friends should be treated as the canonical forms of those operations. InstCombine should form these intrinsics upon encountering these shuffle patterns.

The SLPVectorizer, LoopVectorizer, etc. should use the intrinsics when handling relevant reductions.

Is there an advantage to handling the shuffle patterns in the backend directly as opposed to forming the intrinsics earlier and then handling them in the backend?

define float @fhsub_16(<16 x float> %x225) {
;define <16 x float> @fhsub_16(<16 x float> %x225) {
%x226 = shufflevector <16 x float> %x225, <16 x float> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%x227 = fadd <16 x float> %x225, %x226
%x228 = shufflevector <16 x float> %x227, <16 x float> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
%x229 = fsub <16 x float> %x227, %x228
%x230 = extractelement <16 x float> %x229, i32 0
ret float %x230
}

This patch provided two generic routines which try to scale down operation in denominations of X86 vector register sizes, patch D36650 is also suggesting for similar effort.

PR 33758 is specially about infrence of horizontal operations for AVX512 vector types.

Kindly elaborate how add reduction is helful here for cases in provided in testcase avx512-hadd-hsub.ll.

Thanks

RKSimon resigned from this revision.Oct 11 2018, 8:56 AM

@jbhateja Abandon this? @spatel's recent work seems to have covered everything already

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

82 lines

test/

CodeGen/

X86/

madd.ll

3 lines

sad.ll

18 lines

Diff 110162

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,687 Lines • ▼ Show 20 Lines
// but a conditional move could be stalled by an expensive earlier operation.		// but a conditional move could be stalled by an expensive earlier operation.
PredictableSelectIsExpensive = Subtarget.getSchedModel().isOutOfOrder();		PredictableSelectIsExpensive = Subtarget.getSchedModel().isOutOfOrder();
EnableExtLdPromotion = true;		EnableExtLdPromotion = true;
setPrefFunctionAlignment(4); // 2^4 bytes.		setPrefFunctionAlignment(4); // 2^4 bytes.

verifyIntrinsicTables();		verifyIntrinsicTables();
}		}

// This has so far only been implemented for 64-bit MachO.		// This has so far only been implemented for 64-bit MachO.
		craig.topperUnsubmitted Not Done Reply Inline Actions Use of MMX here is weird. We explicitly don't generate any optimized code for MMX. So making references to it in terms of SSE/AVX is misleading. craig.topper: Use of MMX here is weird. We explicitly don't generate any optimized code for MMX. So making…
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions Reference to MMX here is signifying one of the denominations of a smaller vector register. jbhateja: Reference to MMX here is signifying one of the denominations of a smaller vector register.
bool X86TargetLowering::useLoadStackGuardNode() const {		bool X86TargetLowering::useLoadStackGuardNode() const {
return Subtarget.isTargetMachO() && Subtarget.is64Bit();		return Subtarget.isTargetMachO() && Subtarget.is64Bit();
}		}

TargetLoweringBase::LegalizeTypeAction		TargetLoweringBase::LegalizeTypeAction
X86TargetLowering::getPreferredVectorAction(EVT VT) const {		X86TargetLowering::getPreferredVectorAction(EVT VT) const {
if (ExperimentalVectorWideningLegalization &&		if (ExperimentalVectorWideningLegalization &&
VT.getVectorNumElements() != 1 &&		VT.getVectorNumElements() != 1 &&
▲ Show 20 Lines • Show All 27,275 Lines • ▼ Show 20 Lines	for (int Elt : SVOp->getMask())
Mask.push_back(Elt < NumElts ? Elt : (Elt - NumElts / 2));		Mask.push_back(Elt < NumElts ? Elt : (Elt - NumElts / 2));

SDLoc DL(N);		SDLoc DL(N);
SDValue Concat = DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, N0.getOperand(0),		SDValue Concat = DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, N0.getOperand(0),
N1.getOperand(0));		N1.getOperand(0));
return DAG.getVectorShuffle(VT, DL, Concat, DAG.getUNDEF(VT), Mask);		return DAG.getVectorShuffle(VT, DL, Concat, DAG.getUNDEF(VT), Mask);
}		}


static SDValue combineShuffle(SDNode *N, SelectionDAG &DAG,		static SDValue combineShuffle(SDNode *N, SelectionDAG &DAG,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
SDLoc dl(N);		SDLoc dl(N);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
const TargetLowering &TLI = DAG.getTargetLoweringInfo();		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
// If we have legalized the vector types, look for blends of FADD and FSUB		// If we have legalized the vector types, look for blends of FADD and FSUB
// nodes that we can fuse into an ADDSUB node.		// nodes that we can fuse into an ADDSUB node.
▲ Show 20 Lines • Show All 868 Lines • ▼ Show 20 Lines	for (SDNode::use_iterator UI = InputVector.getNode()->use_begin(),
UE = InputVector.getNode()->use_end(); UI != UE; ++UI) {		UE = InputVector.getNode()->use_end(); UI != UE; ++UI) {
if (UI.getUse().getResNo() != InputVector.getResNo())		if (UI.getUse().getResNo() != InputVector.getResNo())
return SDValue();		return SDValue();

SDNode Extract = UI;		SDNode Extract = UI;
if (Extract->getOpcode() != ISD::EXTRACT_VECTOR_ELT)		if (Extract->getOpcode() != ISD::EXTRACT_VECTOR_ELT)
return SDValue();		return SDValue();

if (Extract->getValueType(0) != MVT::i32)		if (Extract->getValueType(0) != MVT::i32)
		craig.topperUnsubmitted Not Done Reply Inline Actions I think element indices should use DAG.getIntPtrConstant to make the type correct. It's not supposed to be i32. craig.topper: I think element indices should use DAG.getIntPtrConstant to make the type correct. It's not…
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions Yes this will be fixed. jbhateja: Yes this will be fixed.
return SDValue();		return SDValue();
if (!Extract->hasOneUse())		if (!Extract->hasOneUse())
return SDValue();		return SDValue();
if (Extract->use_begin()->getOpcode() != ISD::SIGN_EXTEND &&		if (Extract->use_begin()->getOpcode() != ISD::SIGN_EXTEND &&
Extract->use_begin()->getOpcode() != ISD::ZERO_EXTEND)		Extract->use_begin()->getOpcode() != ISD::ZERO_EXTEND)
return SDValue();		return SDValue();
if (!isa<ConstantSDNode>(Extract->getOperand(1)))		if (!isa<ConstantSDNode>(Extract->getOperand(1)))
return SDValue();		return SDValue();
▲ Show 20 Lines • Show All 3,494 Lines • ▼ Show 20 Lines
/// B = < float b0, float b1, float b2, float b3 >		/// B = < float b0, float b1, float b2, float b3 >
/// then the result of doing a horizontal operation on A and B is		/// then the result of doing a horizontal operation on A and B is
/// A horizontal-op B = < a0 op a1, a2 op a3, b0 op b1, b2 op b3 >.		/// A horizontal-op B = < a0 op a1, a2 op a3, b0 op b1, b2 op b3 >.
/// In short, LHS and RHS are inspected to see if LHS op RHS is of the form		/// In short, LHS and RHS are inspected to see if LHS op RHS is of the form
/// A horizontal-op B, for some already available A and B, and if so then LHS is		/// A horizontal-op B, for some already available A and B, and if so then LHS is
/// set to A, RHS to B, and the routine returns 'true'.		/// set to A, RHS to B, and the routine returns 'true'.
/// Note that the binary operation should have the property that if one of the		/// Note that the binary operation should have the property that if one of the
/// operands is UNDEF then the result is UNDEF.		/// operands is UNDEF then the result is UNDEF.
static bool isHorizontalBinOp(SDValue &LHS, SDValue &RHS, bool IsCommutative) {		static bool isHorizontalBinOp(SDValue &LHS, SDValue &RHS, bool IsCommutative,
		bool AllowAVX512VT = false) {
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions Just realized that DAG argument is not used here, it shall be removed with other comments over patch. jbhateja: Just realized that DAG argument is not used here, it shall be removed with other comments over…
// Look for the following pattern: if		// Look for the following pattern: if
// A = < float a0, float a1, float a2, float a3 >		// A = < float a0, float a1, float a2, float a3 >
// B = < float b0, float b1, float b2, float b3 >		// B = < float b0, float b1, float b2, float b3 >
// and		// and
// LHS = VECTOR_SHUFFLE A, B, <0, 2, 4, 6>		// LHS = VECTOR_SHUFFLE A, B, <0, 2, 4, 6>
// RHS = VECTOR_SHUFFLE A, B, <1, 3, 5, 7>		// RHS = VECTOR_SHUFFLE A, B, <1, 3, 5, 7>
// then LHS op RHS = < a0 op a1, a2 op a3, b0 op b1, b2 op b3 >		// then LHS op RHS = < a0 op a1, a2 op a3, b0 op b1, b2 op b3 >
// which is A horizontal-op B.		// which is A horizontal-op B.

// At least one of the operands should be a vector shuffle.		// At least one of the operands should be a vector shuffle.
		craig.topperUnsubmitted Not Done Reply Inline Actions Can we teach combineExtractSubvector to narrow an extract of an add by extracting from the inputs? Then we shouldn't need any change here because we'll already have the narrow add? craig.topper: Can we teach combineExtractSubvector to narrow an extract of an add by extracting from the…
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions I think each Node specific combiner should look at pattern starting from itself. i.e. combineExtractSubvector could be taught to remove extract_subvector from a vector shuffle (if has only one use) thus producing a smaller vector shuffle that will be generic change. Looking for opportunity like you mentioned "extract of an add by extracting from input" appears as a specialization. As of now narrowing down has be done genrically at following two places 1/ Extract_vector_elt : It looks at its input vector and try scaling down if possible. 2/ Narrow down binary operation (add/sub/fadd/fsub) [this is implicitly doing whay your comment says). jbhateja: I think each Node specific combiner should look at pattern starting from itself. i.e.
if (LHS.getOpcode() != ISD::VECTOR_SHUFFLE &&		if (LHS.getOpcode() != ISD::VECTOR_SHUFFLE &&
RHS.getOpcode() != ISD::VECTOR_SHUFFLE)		RHS.getOpcode() != ISD::VECTOR_SHUFFLE)
return false;		return false;

MVT VT = LHS.getSimpleValueType();		MVT VT = LHS.getSimpleValueType();

assert((VT.is128BitVector() \|\| VT.is256BitVector()) &&		assert((AllowAVX512VT \|\| VT.is128BitVector() \|\| VT.is256BitVector()) &&
"Unsupported vector type for horizontal add/sub");		"Unsupported vector type for horizontal add/sub");

// Handle 128 and 256-bit vector lengths. AVX defines horizontal add/sub to		// Handle 128 and 256-bit vector lengths. AVX defines horizontal add/sub to
// operate independently on 128-bit lanes.		// operate independently on 128-bit lanes.
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();
unsigned NumLanes = VT.getSizeInBits()/128;		unsigned NumLanes = VT.getSizeInBits()/128;
unsigned NumLaneElts = NumElts / NumLanes;		unsigned NumLaneElts = NumElts / NumLanes;
assert((NumLaneElts % 2 == 0) &&		assert((NumLaneElts % 2 == 0) &&
"Vector type should have an even number of elements in each lane");		"Vector type should have an even number of elements in each lane");
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	static bool isHorizontalBinOp(SDValue &LHS, SDValue &RHS, bool IsCommutative,
// rewriting the mask).		// rewriting the mask).
if (A != C)		if (A != C)
ShuffleVectorSDNode::commuteMask(RMask);		ShuffleVectorSDNode::commuteMask(RMask);

// At this point LHS and RHS are equivalent to		// At this point LHS and RHS are equivalent to
// LHS = VECTOR_SHUFFLE A, B, LMask		// LHS = VECTOR_SHUFFLE A, B, LMask
// RHS = VECTOR_SHUFFLE A, B, RMask		// RHS = VECTOR_SHUFFLE A, B, RMask
// Check that the masks correspond to performing a horizontal operation.		// Check that the masks correspond to performing a horizontal operation.
for (unsigned l = 0; l != NumElts; l += NumLaneElts) {		for (unsigned l = 0; l != NumElts; l += NumLaneElts) {
		craig.topperUnsubmitted Not Done Reply Inline Actions Why is this variable now upper case? craig.topper: Why is this variable now upper case?
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions For better visual clarity of variable name. jbhateja: For better visual clarity of variable name.
for (unsigned i = 0; i != NumLaneElts; ++i) {		for (unsigned i = 0; i != NumLaneElts; ++i) {
int LIdx = LMask[i+l], RIdx = RMask[i+l];		int LIdx = LMask[i+l], RIdx = RMask[i+l];

// Ignore any UNDEF components.		// Ignore any UNDEF components.
if (LIdx < 0 \|\| RIdx < 0 \|\|		if (LIdx < 0 \|\| RIdx < 0 \|\|
(!A.getNode() && (LIdx < (int)NumElts \|\| RIdx < (int)NumElts)) \|\|		(!A.getNode() && (LIdx < (int)NumElts \|\| RIdx < (int)NumElts)) \|\|
(!B.getNode() && (LIdx >= (int)NumElts \|\| RIdx >= (int)NumElts)))		(!B.getNode() && (LIdx >= (int)NumElts \|\| RIdx >= (int)NumElts)))
continue;		continue;
Show All 17 Lines
static SDValue combineFaddFsub(SDNode *N, SelectionDAG &DAG,		static SDValue combineFaddFsub(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDValue LHS = N->getOperand(0);		SDValue LHS = N->getOperand(0);
SDValue RHS = N->getOperand(1);		SDValue RHS = N->getOperand(1);
bool IsFadd = N->getOpcode() == ISD::FADD;		bool IsFadd = N->getOpcode() == ISD::FADD;
assert((IsFadd \|\| N->getOpcode() == ISD::FSUB) && "Wrong opcode");		assert((IsFadd \|\| N->getOpcode() == ISD::FSUB) && "Wrong opcode");

// Try to synthesize horizontal add/sub from adds/subs of shuffles.		// Try to synthesize horizontal add/sub from adds/subs of shuffles.
		craig.topperUnsubmitted Not Done Reply Inline Actions Why was this moved out of the if. craig.topper: Why was this moved out of the if.
		jbhatejaAuthorUnsubmitted Not Done Reply Inline Actions This will be fixed. jbhateja: This will be fixed.
if (((Subtarget.hasSSE3() && (VT == MVT::v4f32 \|\| VT == MVT::v2f64)) \|\|		if (((Subtarget.hasSSE3() && (VT == MVT::v4f32 \|\| VT == MVT::v2f64)) \|\|
(Subtarget.hasFp256() && (VT == MVT::v8f32 \|\| VT == MVT::v4f64))) &&		(Subtarget.hasFp256() && (VT == MVT::v8f32 \|\| VT == MVT::v4f64))) &&
isHorizontalBinOp(LHS, RHS, IsFadd)) {		isHorizontalBinOp(LHS, RHS, IsFadd)) {
auto NewOpcode = IsFadd ? X86ISD::FHADD : X86ISD::FHSUB;		auto NewOpcode = IsFadd ? X86ISD::FHADD : X86ISD::FHSUB;
return DAG.getNode(NewOpcode, SDLoc(N), VT, LHS, RHS);		return DAG.getNode(NewOpcode, SDLoc(N), VT, LHS, RHS);
}		}
return SDValue();		return SDValue();
}		}
▲ Show 20 Lines • Show All 1,758 Lines • ▼ Show 20 Lines	static SDValue combineIncDecVector(SDNode *N, SelectionDAG &DAG) {
if (!ISD::isConstantSplatVector(N1, SplatVal) \|\| !SplatVal.isOneValue())		if (!ISD::isConstantSplatVector(N1, SplatVal) \|\| !SplatVal.isOneValue())
return SDValue();		return SDValue();

SDValue AllOnesVec = getOnesVector(VT, DAG, SDLoc(N));		SDValue AllOnesVec = getOnesVector(VT, DAG, SDLoc(N));
unsigned NewOpcode = N->getOpcode() == ISD::ADD ? ISD::SUB : ISD::ADD;		unsigned NewOpcode = N->getOpcode() == ISD::ADD ? ISD::SUB : ISD::ADD;
return DAG.getNode(NewOpcode, SDLoc(N), VT, N->getOperand(0), AllOnesVec);		return DAG.getNode(NewOpcode, SDLoc(N), VT, N->getOperand(0), AllOnesVec);
}		}

		static bool hasUnusedLanesAVX512(SDNode *N, SDValue &Op0, SDValue &Op1,
		bool checkUpperHalfF) {
		bool Op0SV = Op0.getOpcode() == ISD::VECTOR_SHUFFLE;
		bool Op1SV = Op1.getOpcode() == ISD::VECTOR_SHUFFLE;
		// At least one of the operands should be a vector shuffle.
		if (!Op0SV && !Op1SV)
		return false;

		EVT VT = N->getValueType(0);
		int NumElems = VT.getVectorNumElements();

		auto checkIsUnusedLane = [&](SDValue &Node, bool checkUpperHalfF) -> bool {
		int Start = checkUpperHalfF ? NumElems / 2 : 0;
		int End = checkUpperHalfF ? NumElems : NumElems / 2;
		ShuffleVectorSDNode *SV = dyn_cast<ShuffleVectorSDNode>(Node.getNode());
		if (!SV)
		return false;
		ArrayRef<int> Mask = SV->getMask();
		for (int i = Start; i < End; i++) {
		if (Mask[i] >= 0)
		return false;
		}
		return true;
		};

		if (VT.getSizeInBits() == 512 && N->hasOneUse()) {
		SDNode User = (N->use_begin());
		EVT UserVT = User->getValueType(0);
		int64_t SubVecSize = 0;
		switch (User->getOpcode()) {
		default:
		return false;
		case ISD::EXTRACT_SUBVECTOR: {
		SubVecSize = User->getValueType(0).getVectorNumElements();
		}
		case ISD::EXTRACT_VECTOR_ELT: {
		SDValue Idx = User->getOperand(1);
		if (!isa<ConstantSDNode>(Idx))
		return false;
		int64_t Index = (dyn_cast<ConstantSDNode>(Idx.getNode()))->getSExtValue();
		int64_t HalfNumElems = NumElems / 2;
		if ((checkUpperHalfF && Index > HalfNumElems) \|\|
		(!checkUpperHalfF && ((Index + SubVecSize) < HalfNumElems)))
		return false;
		} break;
		}
		if ((UserVT.bitsGE(VT)) \|\|
		(Op0SV && !checkIsUnusedLane(Op0, checkUpperHalfF)) \|\|
		(Op1SV && !checkIsUnusedLane(Op1, checkUpperHalfF)))
		return false;
		return true;
		}
		return false;
		}

static SDValue combineAdd(SDNode *N, SelectionDAG &DAG,		static SDValue combineAdd(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
		SDLoc DL(N);
const SDNodeFlags Flags = N->getFlags();		const SDNodeFlags Flags = N->getFlags();
if (Flags.hasVectorReduction()) {		if (Flags.hasVectorReduction()) {
if (SDValue Sad = combineLoopSADPattern(N, DAG, Subtarget))		if (SDValue Sad = combineLoopSADPattern(N, DAG, Subtarget))
return Sad;		return Sad;
if (SDValue MAdd = combineLoopMAddPattern(N, DAG, Subtarget))		if (SDValue MAdd = combineLoopMAddPattern(N, DAG, Subtarget))
return MAdd;		return MAdd;
}		}
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDValue Op0 = N->getOperand(0);		SDValue Op0 = N->getOperand(0);
SDValue Op1 = N->getOperand(1);		SDValue Op1 = N->getOperand(1);

// Try to synthesize horizontal adds from adds of shuffles.		// Try to synthesize horizontal adds from adds of shuffles.
if (((Subtarget.hasSSSE3() && (VT == MVT::v8i16 \|\| VT == MVT::v4i32)) \|\|		if (((Subtarget.hasSSSE3() && (VT == MVT::v8i16 \|\| VT == MVT::v4i32)) \|\|
(Subtarget.hasInt256() && (VT == MVT::v16i16 \|\| VT == MVT::v8i32))) &&		(Subtarget.hasInt256() && (VT == MVT::v16i16 \|\| VT == MVT::v8i32))) &&
isHorizontalBinOp(Op0, Op1, true))		isHorizontalBinOp(Op0, Op1, true))
return DAG.getNode(X86ISD::HADD, SDLoc(N), VT, Op0, Op1);		return DAG.getNode(X86ISD::HADD, SDLoc(N), VT, Op0, Op1);

		if (Subtarget.hasAVX512() && hasUnusedLanesAVX512(N, Op0, Op1, true) &&
		craig.topperUnsubmitted Not Done Reply Inline Actions Don't we need to make sure this only happens on v32i16 and v16i32 types? combineAdd can get called on all sorts of types. craig.topper: Don't we need to make sure this only happens on v32i16 and v16i32 types? combineAdd can get…
		isHorizontalBinOp(Op0, Op1, true, true)) {
		EVT ElemType = VT.getVectorElementType();
		unsigned HalfLength = VT.getVectorNumElements() / 2;
		EVT NewVT = EVT::getVectorVT(*DAG.getContext(), ElemType, HalfLength);

		SDValue SubVecOp0 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, NewVT, Op0,
		DAG.getIntPtrConstant(0, DL));
		SDValue SubVecOp1 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, NewVT, Op1,
		DAG.getIntPtrConstant(0, DL));
		SDValue HADDNode =
		DAG.getNode(X86ISD::HADD, SDLoc(N), NewVT, SubVecOp0, SubVecOp1);

		SmallVector<SDValue, 2> ConcatOps(2, DAG.getUNDEF(NewVT));
		craig.topperUnsubmitted Not Done Reply Inline Actions Don't use a SmallVector for a hardcoded two elements. Just use a plain old array. craig.topper: Don't use a SmallVector for a hardcoded two elements. Just use a plain old array.
		ConcatOps[0] = HADDNode;
		return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, ConcatOps);
		}

if (SDValue V = combineIncDecVector(N, DAG))		if (SDValue V = combineIncDecVector(N, DAG))
return V;		return V;

return combineAddOrSubToADCOrSBB(N, DAG);		return combineAddOrSubToADCOrSBB(N, DAG);
}		}

static SDValue combineSub(SDNode *N, SelectionDAG &DAG,		static SDValue combineSub(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
▲ Show 20 Lines • Show All 1,380 Lines • Show Last 20 Lines

test/CodeGen/X86/madd.ll

	Show First 20 Lines • Show All 294 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: addq $-16, %rax			; AVX2-NEXT: addq $-16, %rax
	; AVX2-NEXT: jne .LBB2_1			; AVX2-NEXT: jne .LBB2_1
	; AVX2-NEXT: # BB#2: # %middle.block			; AVX2-NEXT: # BB#2: # %middle.block
	; AVX2-NEXT: vpaddd %ymm0, %ymm1, %ymm0			; AVX2-NEXT: vpaddd %ymm0, %ymm1, %ymm0
	; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
	; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]			; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
	; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vpaddd %ymm1, %ymm0, %ymm0
	; AVX2-NEXT: vphaddd %ymm0, %ymm0, %ymm0			; AVX2-NEXT: vphaddd %ymm0, %ymm0, %ymm0
				craig.topperUnsubmitted Not Done Reply Inline Actions Why is this horizontal add not narrowed? craig.topper: Why is this horizontal add not narrowed?
	; AVX2-NEXT: vmovd %xmm0, %eax			; AVX2-NEXT: vmovd %xmm0, %eax
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: _Z9test_charPcS_i:			; AVX512-LABEL: _Z9test_charPcS_i:
	; AVX512: # BB#0: # %entry			; AVX512: # BB#0: # %entry
	; AVX512-NEXT: movl %edx, %eax			; AVX512-NEXT: movl %edx, %eax
	; AVX512-NEXT: vpxor %xmm0, %xmm0, %xmm0			; AVX512-NEXT: vpxor %xmm0, %xmm0, %xmm0
	Show All 12 Lines
	; AVX512-NEXT: jne .LBB2_1			; AVX512-NEXT: jne .LBB2_1
	; AVX512-NEXT: # BB#2: # %middle.block			; AVX512-NEXT: # BB#2: # %middle.block
	; AVX512-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]			; AVX512-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]
	; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]			; AVX512-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]
	; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[2,3,2,3,6,7,6,7,10,11,10,11,14,15,14,15]			; AVX512-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[2,3,2,3,6,7,6,7,10,11,10,11,14,15,14,15]
	; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[1,1,2,3,5,5,6,7,9,9,10,11,13,13,14,15]			; AVX512-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512-NEXT: vmovd %xmm0, %eax			; AVX512-NEXT: vmovd %xmm0, %eax
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	entry:			entry:
	%3 = zext i32 %2 to i64			%3 = zext i32 %2 to i64
	br label %vector.body			br label %vector.body

	vector.body:			vector.body:
	Show All 28 Lines

test/CodeGen/X86/sad.ll

	Show First 20 Lines • Show All 72 Lines • ▼ Show 20 Lines
	; AVX512F-NEXT: jne .LBB0_1			; AVX512F-NEXT: jne .LBB0_1
	; AVX512F-NEXT: # BB#2: # %middle.block			; AVX512F-NEXT: # BB#2: # %middle.block
	; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]			; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]			; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[2,3,2,3,6,7,6,7,10,11,10,11,14,15,14,15]			; AVX512F-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[2,3,2,3,6,7,6,7,10,11,10,11,14,15,14,15]
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[1,1,2,3,5,5,6,7,9,9,10,11,13,13,14,15]			; AVX512F-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vmovd %xmm0, %eax			; AVX512F-NEXT: vmovd %xmm0, %eax
	; AVX512F-NEXT: vzeroupper			; AVX512F-NEXT: vzeroupper
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512BW-LABEL: sad_16i8:			; AVX512BW-LABEL: sad_16i8:
	; AVX512BW: # BB#0: # %entry			; AVX512BW: # BB#0: # %entry
	; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00			; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00
	; AVX512BW-NEXT: .p2align 4, 0x90			; AVX512BW-NEXT: .p2align 4, 0x90
	; AVX512BW-NEXT: .LBB0_1: # %vector.body			; AVX512BW-NEXT: .LBB0_1: # %vector.body
	; AVX512BW-NEXT: # =>This Inner Loop Header: Depth=1			; AVX512BW-NEXT: # =>This Inner Loop Header: Depth=1
	; AVX512BW-NEXT: vmovdqu a+1024(%rax), %xmm1			; AVX512BW-NEXT: vmovdqu a+1024(%rax), %xmm1
	; AVX512BW-NEXT: vpsadbw b+1024(%rax), %xmm1, %xmm1			; AVX512BW-NEXT: vpsadbw b+1024(%rax), %xmm1, %xmm1
	; AVX512BW-NEXT: vpaddd %xmm0, %xmm1, %xmm1			; AVX512BW-NEXT: vpaddd %xmm0, %xmm1, %xmm1
	; AVX512BW-NEXT: vinserti32x4 $0, %xmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vinserti32x4 $0, %xmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: addq $4, %rax			; AVX512BW-NEXT: addq $4, %rax
	; AVX512BW-NEXT: jne .LBB0_1			; AVX512BW-NEXT: jne .LBB0_1
	; AVX512BW-NEXT: # BB#2: # %middle.block			; AVX512BW-NEXT: # BB#2: # %middle.block
	; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]			; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]			; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[2,3,2,3,6,7,6,7,10,11,10,11,14,15,14,15]			; AVX512BW-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[2,3,2,3,6,7,6,7,10,11,10,11,14,15,14,15]
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[1,1,2,3,5,5,6,7,9,9,10,11,13,13,14,15]			; AVX512BW-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vmovd %xmm0, %eax			; AVX512BW-NEXT: vmovd %xmm0, %eax
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	entry:			entry:
	br label %vector.body			br label %vector.body

	vector.body:			vector.body:
	%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]			%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
	▲ Show 20 Lines • Show All 205 Lines • ▼ Show 20 Lines
	; AVX512F-NEXT: # BB#2: # %middle.block			; AVX512F-NEXT: # BB#2: # %middle.block
	; AVX512F-NEXT: vpaddd %zmm0, %zmm1, %zmm0			; AVX512F-NEXT: vpaddd %zmm0, %zmm1, %zmm0
	; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]			; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]			; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[2,3,2,3,6,7,6,7,10,11,10,11,14,15,14,15]			; AVX512F-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[2,3,2,3,6,7,6,7,10,11,10,11,14,15,14,15]
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[1,1,2,3,5,5,6,7,9,9,10,11,13,13,14,15]			; AVX512F-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vmovd %xmm0, %eax			; AVX512F-NEXT: vmovd %xmm0, %eax
	; AVX512F-NEXT: vzeroupper			; AVX512F-NEXT: vzeroupper
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512BW-LABEL: sad_32i8:			; AVX512BW-LABEL: sad_32i8:
	; AVX512BW: # BB#0: # %entry			; AVX512BW: # BB#0: # %entry
	; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00			; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00
	Show All 10 Lines
	; AVX512BW-NEXT: # BB#2: # %middle.block			; AVX512BW-NEXT: # BB#2: # %middle.block
	; AVX512BW-NEXT: vpaddd %zmm0, %zmm1, %zmm0			; AVX512BW-NEXT: vpaddd %zmm0, %zmm1, %zmm0
	; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]			; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]			; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[2,3,2,3,6,7,6,7,10,11,10,11,14,15,14,15]			; AVX512BW-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[2,3,2,3,6,7,6,7,10,11,10,11,14,15,14,15]
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[1,1,2,3,5,5,6,7,9,9,10,11,13,13,14,15]			; AVX512BW-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vmovd %xmm0, %eax			; AVX512BW-NEXT: vmovd %xmm0, %eax
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	entry:			entry:
	br label %vector.body			br label %vector.body

	vector.body:			vector.body:
	%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]			%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
	▲ Show 20 Lines • Show All 427 Lines • ▼ Show 20 Lines
	; AVX512F-NEXT: vpaddd %zmm3, %zmm1, %zmm1			; AVX512F-NEXT: vpaddd %zmm3, %zmm1, %zmm1
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]			; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]			; AVX512F-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[2,3,2,3,6,7,6,7,10,11,10,11,14,15,14,15]			; AVX512F-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[2,3,2,3,6,7,6,7,10,11,10,11,14,15,14,15]
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[1,1,2,3,5,5,6,7,9,9,10,11,13,13,14,15]			; AVX512F-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512F-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512F-NEXT: vmovd %xmm0, %eax			; AVX512F-NEXT: vmovd %xmm0, %eax
	; AVX512F-NEXT: vzeroupper			; AVX512F-NEXT: vzeroupper
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512BW-LABEL: sad_avx64i8:			; AVX512BW-LABEL: sad_avx64i8:
	; AVX512BW: # BB#0: # %entry			; AVX512BW: # BB#0: # %entry
	; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0			; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0
	; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00			; AVX512BW-NEXT: movq $-1024, %rax # imm = 0xFC00
	Show All 11 Lines
	; AVX512BW-NEXT: vpaddd %zmm0, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm0, %zmm0, %zmm0
	; AVX512BW-NEXT: vpaddd %zmm0, %zmm1, %zmm0			; AVX512BW-NEXT: vpaddd %zmm0, %zmm1, %zmm0
	; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]			; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[4,5,6,7,0,1,0,1]
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]			; AVX512BW-NEXT: vshufi64x2 {{.*#+}} zmm1 = zmm0[2,3,0,1,0,1,0,1]
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[2,3,2,3,6,7,6,7,10,11,10,11,14,15,14,15]			; AVX512BW-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[2,3,2,3,6,7,6,7,10,11,10,11,14,15,14,15]
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0			; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vpshufd {{.*#+}} zmm1 = zmm0[1,1,2,3,5,5,6,7,9,9,10,11,13,13,14,15]			; AVX512BW-NEXT: vphaddd %ymm0, %ymm0, %ymm0
	; AVX512BW-NEXT: vpaddd %zmm1, %zmm0, %zmm0
	; AVX512BW-NEXT: vmovd %xmm0, %eax			; AVX512BW-NEXT: vmovd %xmm0, %eax
	; AVX512BW-NEXT: vzeroupper			; AVX512BW-NEXT: vzeroupper
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	entry:			entry:
	br label %vector.body			br label %vector.body

	vector.body:			vector.body:
	%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]			%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
	▲ Show 20 Lines • Show All 465 Lines • Show Last 20 Lines