This is an archive of the discontinued LLVM Phabricator instance.

[x86] use movmsk when extracting multiple lanes of a vector compare (PR39665)
AbandonedPublic

Authored by spatel on Mar 21 2019, 2:29 PM.

Download Raw Diff

Details

Reviewers

RKSimon
craig.topper
lebedev.ri
andreadb
nlopes
regehr
efriedma

Summary

The motivation for this patch is the 2nd example from PR39665 comment 2:
https://bugs.llvm.org/show_bug.cgi?id=39665#c2
(To avoid confusion, we could open a separate bug report to correspond with this patch because there are several independent problems in the bug report.)

We do a vector comparison, extract >1 lane, and then do some arbitrary ops on those results. The likely expensive part of that sequence is getting the results from XMM to GPR, so we want to do that with a single 'movmsk'.

Typically, we'd want to do "all-of" or "any-of" type comparisons, but I've intentionally created more complicated test cases here to show potential trade-offs. If we're not happy with those diffs, we can restrict the pattern matching to only apply to the more specific/typical patterns.

Unfortunately, we seem to be missing folds to form 'test' instructions with bitmasks, so even the motivating case isn't optimal yet. I'll take a look at that transform next if the general direction on this patch looks good, but I think this patch is likely still a perf win on that example (although I have no idea about the KNL target specifically).

Diff Detail

Event Timeline

spatel created this revision.Mar 21 2019, 2:29 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 21 2019, 2:29 PM

Herald added subscribers: jdoerfert, hiraditya, mcrosier. · View Herald Transcript

RKSimon added inline comments.Mar 21 2019, 2:46 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
34440	With a little care we should be able to do v8i16 as well.
34480	Can any of the code from combineHorizontalPredicateResult or combineBitcastvxi1 be reused?
llvm/test/CodeGen/X86/movmsk-cmp.ll
4926	Are these and superfluous if we just need the lsb?
5136	For the anyof/allof cases it'd be a lot better if we could merge the tests into a single compare - in c-ray that would allow us to merge the multiple cmp+jmp - the 2 separate jmps packed so close together is a known perf issues.

spatel marked 2 inline comments as done.Mar 21 2019, 3:07 PM

spatel added inline comments.

llvm/lib/Target/X86/X86ISelLowering.cpp
34480	I didn't see a way. This raises what might be a fundamental question about these kinds of patterns. I was going for a general solution, but if we really only care about reductions or compare-of-compare, then we might be better off trying to add a glob of vector compare logic to SLP (maybe it's already there)? Ie, we could try to form this in IR: %cmp = fcmp ogt <2 x double> %x, %y %e1 = extractelement <2 x i1> %cmp, i32 0 %e2 = extractelement <2 x i1> %cmp, i32 1 %u = and i1 %e1, %e2 => %cmp = fcmp ogt <2 x double> %x, %y %bc = bitcast <2 x i1> %cmp to i2 %u = icmp eq i2 %bc, -1
llvm/test/CodeGen/X86/movmsk-cmp.ll
5136	Yes, I think this is an independent problem. Let me see if I can solve at least part of that, so it's not distracting here.

spatel added reviewers: nlopes, regehr, efriedma.Mar 21 2019, 3:44 PM

spatel marked an inline comment as done.

spatel added inline comments.

llvm/test/CodeGen/X86/movmsk-cmp.ll
5136	Oh no...we need the select-to-logic transform that we recently discussed as illegal due to poison: t21: i32 = select t28, Constant:i32<42>, Constant:i32<99> t22: i32 = select t30, t21, Constant:i32<99>

spatel mentioned this in D59710: [SLP] remove lower limit for forming reduction patterns.Mar 22 2019, 12:35 PM

The more I look at the motivating patterns, the less hopeful I am that we can get optimal code by delaying the transforms until SDAG.

So here's a 1st step to improve the SLP vectorizer: D59710

For reference, the double-select is *created here in SDAG* because that is supposed to be an optimization: D7622

There's still a chance that this transform is useful in some cases, so I'm not quite ready to abandon.

spatel mentioned this in D60610: [X86][SSE] Recognise vXi1 boolean anyof/allof reduction patterns.Apr 12 2019, 6:44 AM

RKSimon mentioned this in D61189: [X86][SSE] Extract i1 elements from vXi1 bool vectors.Apr 26 2019, 7:56 AM

D61189 improves on what this patch set out to do (although we're still likely to need IR fixes), so abandoning.

RKSimon mentioned this in rL359666: [X86][SSE] Extract i1 elements from vXi1 bool vectors.May 1 2019, 3:01 AM

RKSimon mentioned this in rG99eefe94b5b0: [X86][SSE] Extract i1 elements from vXi1 bool vectors.

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

73 lines

test/

CodeGen/

X86/

movmsk-cmp.ll

194 lines

Diff 191775

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 34,404 Lines • ▼ Show 20 Lines	if ((SrcVT == MVT::v8i16 && Subtarget.hasSSE2()) \|\|
SDValue ExtOp = DAG.getNode(OpCode, dl, MVT::i32, SrcOp,		SDValue ExtOp = DAG.getNode(OpCode, dl, MVT::i32, SrcOp,
DAG.getIntPtrConstant(SrcIdx, dl));		DAG.getIntPtrConstant(SrcIdx, dl));
return DAG.getZExtOrTrunc(ExtOp, dl, VT);		return DAG.getZExtOrTrunc(ExtOp, dl, VT);
}		}

return SDValue();		return SDValue();
}		}

		/// Given an extract vector element instruction with a setcc operand, look for
		/// other extracts of that same vector comparison and convert all of the
		/// extracts into MOVMSK followed by scalar ops. This eliminates multiple
		/// potentially expensive transfers from a vector to scalar registers.
		static SDValue foldExtractEltsToMovMsk(SDNode *ExtElt, SelectionDAG &DAG,
		TargetLowering::DAGCombinerInfo &DCI,
		const X86Subtarget &Subtarget) {
		// First, match this as an extract element from a setcc with constant index.
		assert(ExtElt->getOpcode() == ISD::EXTRACT_VECTOR_ELT && "Expected extract");
		SDValue Setcc = ExtElt->getOperand(0);
		SDValue Index = ExtElt->getOperand(1);
		EVT VT = ExtElt->getValueType(0);
		EVT SetccVT = Setcc.getValueType();
		if (VT != MVT::i1 \|\| Setcc.getOpcode() != ISD::SETCC \|\| Setcc.hasOneUse() \|\|
		!isa<ConstantSDNode>(Index) \|\| SetccVT.getScalarType() != MVT::i1)
		return SDValue();

		// Make sure that we have movmsk-ability. The setcc type is converted to
		// integer from FP if needed. AVX512 with writemask will have a vXi1 type,
		// so we bail out on that.
		// TODO: Allow 256-bit with AVX1/2.
		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
		EVT SetccOpVT = Setcc.getOperand(0).getValueType();
		EVT NewSetccVT = TLI.getSetCCResultType(DAG.getDataLayout(),
		*DAG.getContext(), SetccOpVT);
		bool CanUseMOVMSKPS = NewSetccVT == MVT::v4i32 && Subtarget.hasSSE1();
		bool CanUseMOVMSKPD = NewSetccVT == MVT::v2i64 && Subtarget.hasSSE2();
		bool CanUsePMOVMSKB = NewSetccVT == MVT::v16i8 && Subtarget.hasSSE2();
		RKSimonUnsubmitted Not Done Reply Inline Actions With a little care we should be able to do v8i16 as well. RKSimon: With a little care we should be able to do v8i16 as well.
		if (!(CanUseMOVMSKPS \|\| CanUseMOVMSKPD \|\| CanUsePMOVMSKB))
		return SDValue();

		// Collect all similar extract element uses of the setcc.
		// Use a set because duplicates may be present in the uses list.
		SmallSetVector<SDNode *, 16> ExtEltSet;
		for (SDNode *Use : Setcc->uses()) {
		if (Use->getOpcode() == ISD::EXTRACT_VECTOR_ELT &&
		isa<ConstantSDNode>(Use->getOperand(1)) &&
		Use->getValueType(0) == MVT::i1)
		ExtEltSet.insert(Use);
		}

		// We need to replace at least 2 extracts for this to be profitable.
		// TODO: We might want to allow this for a single extract if MOVMSK is the
		// fastest way to get a vector compare result over to a GPR.
		if (ExtEltSet.size() < 2)
		return SDValue();

		// Convert the vXi1 setcc to the x86-legal type. Then, MOVMSK the bits to a
		// scalar value.
		SDLoc DL(ExtElt);
		SDValue NewCmp = DAG.getNode(ISD::SETCC, DL, NewSetccVT, Setcc.getOperand(0),
		Setcc.getOperand(1), Setcc.getOperand(2));
		SDValue MovMsk = DAG.getNode(X86ISD::MOVMSK, DL, MVT::i32, NewCmp);

		// For each extracted bit of the original boolean vector setcc, convert to
		// extract of a scalar bit from the MOVMSK. We are still early in combining,
		// so create generic and/setcc nodes that may get folded with other ops.
		SDValue Zero = DAG.getConstant(0, DL, MVT::i32);
		for (SDNode *ExtUse : ExtEltSet) {
		// extelt vXi1 (setcc X, Y, CC), IndexC -->
		// ((movmsk (setcc X, Y, CC)) & (1 << IndexC)) != 0
		SDValue BitIndex = DAG.getConstant(1 << ExtUse->getConstantOperandVal(1),
		DL, MVT::i32);
		SDValue MaskedBit = DAG.getNode(ISD::AND, DL, MVT::i32, MovMsk, BitIndex);
		SDValue MovMskBit = DAG.getSetCC(DL, MVT::i1, MaskedBit, Zero, ISD::SETNE);
		DCI.CombineTo(ExtUse, MovMskBit);
		}
		return SDValue(ExtElt, 0); // ExtElt was replaced.
		RKSimonUnsubmitted Not Done Reply Inline Actions Can any of the code from combineHorizontalPredicateResult or combineBitcastvxi1 be reused? RKSimon: Can any of the code from combineHorizontalPredicateResult or combineBitcastvxi1 be reused?
		spatelAuthorUnsubmitted Done Reply Inline Actions I didn't see a way. This raises what might be a fundamental question about these kinds of patterns. I was going for a general solution, but if we really only care about reductions or compare-of-compare, then we might be better off trying to add a glob of vector compare logic to SLP (maybe it's already there)? Ie, we could try to form this in IR: %cmp = fcmp ogt <2 x double> %x, %y %e1 = extractelement <2 x i1> %cmp, i32 0 %e2 = extractelement <2 x i1> %cmp, i32 1 %u = and i1 %e1, %e2 => %cmp = fcmp ogt <2 x double> %x, %y %bc = bitcast <2 x i1> %cmp to i2 %u = icmp eq i2 %bc, -1 spatel: I didn't see a way. This raises what might be a fundamental question about these kinds of…
		}

/// Extracting a scalar FP value from vector element 0 is free, so extract each		/// Extracting a scalar FP value from vector element 0 is free, so extract each
/// operand first, then perform the math as a scalar op.		/// operand first, then perform the math as a scalar op.
static SDValue scalarizeExtEltFP(SDNode *ExtElt, SelectionDAG &DAG) {		static SDValue scalarizeExtEltFP(SDNode *ExtElt, SelectionDAG &DAG) {
assert(ExtElt->getOpcode() == ISD::EXTRACT_VECTOR_ELT && "Expected extract");		assert(ExtElt->getOpcode() == ISD::EXTRACT_VECTOR_ELT && "Expected extract");
SDValue Vec = ExtElt->getOperand(0);		SDValue Vec = ExtElt->getOperand(0);
SDValue Index = ExtElt->getOperand(1);		SDValue Index = ExtElt->getOperand(1);
EVT VT = ExtElt->getValueType(0);		EVT VT = ExtElt->getValueType(0);
EVT VecVT = Vec.getValueType();		EVT VecVT = Vec.getValueType();
▲ Show 20 Lines • Show All 157 Lines • ▼ Show 20 Lines	static SDValue combineExtractVectorElt(SDNode *N, SelectionDAG &DAG,
// Attempt to replace an all_of/any_of horizontal reduction with a MOVMSK.		// Attempt to replace an all_of/any_of horizontal reduction with a MOVMSK.
if (SDValue Cmp = combineHorizontalPredicateResult(N, DAG, Subtarget))		if (SDValue Cmp = combineHorizontalPredicateResult(N, DAG, Subtarget))
return Cmp;		return Cmp;

// Attempt to replace min/max v8i16/v16i8 reductions with PHMINPOSUW.		// Attempt to replace min/max v8i16/v16i8 reductions with PHMINPOSUW.
if (SDValue MinMax = combineHorizontalMinMaxResult(N, DAG, Subtarget))		if (SDValue MinMax = combineHorizontalMinMaxResult(N, DAG, Subtarget))
return MinMax;		return MinMax;

		if (SDValue MovMsk = foldExtractEltsToMovMsk(N, DAG, DCI, Subtarget))
		return MovMsk;

if (SDValue V = scalarizeExtEltFP(N, DAG))		if (SDValue V = scalarizeExtEltFP(N, DAG))
return V;		return V;

return SDValue();		return SDValue();
}		}

/// If a vector select has an operand that is -1 or 0, try to simplify the		/// If a vector select has an operand that is -1 or 0, try to simplify the
/// select to a bitwise logic operation.		/// select to a bitwise logic operation.
▲ Show 20 Lines • Show All 9,255 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/movmsk-cmp.ll

Show First 20 Lines • Show All 4,771 Lines • ▼ Show 20 Lines
}		}

; Multiple extract elements from a vector compare.		; Multiple extract elements from a vector compare.

define i1 @movmsk_v16i8(<16 x i8> %x, <16 x i8> %y) {		define i1 @movmsk_v16i8(<16 x i8> %x, <16 x i8> %y) {
; SSE2-LABEL: movmsk_v16i8:		; SSE2-LABEL: movmsk_v16i8:
; SSE2: # %bb.0:		; SSE2: # %bb.0:
; SSE2-NEXT: pcmpeqb %xmm1, %xmm0		; SSE2-NEXT: pcmpeqb %xmm1, %xmm0
; SSE2-NEXT: movdqa %xmm0, -{{[0-9]+}}(%rsp)		; SSE2-NEXT: pmovmskb %xmm0, %eax
; SSE2-NEXT: movb -{{[0-9]+}}(%rsp), %al		; SSE2-NEXT: movl %eax, %ecx
; SSE2-NEXT: xorb -{{[0-9]+}}(%rsp), %al		; SSE2-NEXT: shrl $15, %ecx
; SSE2-NEXT: andb -{{[0-9]+}}(%rsp), %al		; SSE2-NEXT: movl %eax, %edx
		; SSE2-NEXT: shrl $8, %edx
		; SSE2-NEXT: andl $1, %edx
		; SSE2-NEXT: andl $8, %eax
		; SSE2-NEXT: shrl $3, %eax
		; SSE2-NEXT: xorl %edx, %eax
		; SSE2-NEXT: andl %ecx, %eax
		; SSE2-NEXT: # kill: def $al killed $al killed $eax
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; AVX-LABEL: movmsk_v16i8:		; AVX-LABEL: movmsk_v16i8:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpcmpeqb %xmm1, %xmm0, %xmm0		; AVX-NEXT: vpcmpeqb %xmm1, %xmm0, %xmm0
; AVX-NEXT: vpextrb $3, %xmm0, %eax		; AVX-NEXT: vpmovmskb %xmm0, %eax
; AVX-NEXT: vpextrb $8, %xmm0, %ecx		; AVX-NEXT: movl %eax, %ecx
; AVX-NEXT: xorl %eax, %ecx		; AVX-NEXT: shrl $15, %ecx
; AVX-NEXT: vpextrb $15, %xmm0, %eax		; AVX-NEXT: movl %eax, %edx
		; AVX-NEXT: shrl $8, %edx
		; AVX-NEXT: andl $1, %edx
		; AVX-NEXT: andl $8, %eax
		; AVX-NEXT: shrl $3, %eax
		; AVX-NEXT: xorl %edx, %eax
; AVX-NEXT: andl %ecx, %eax		; AVX-NEXT: andl %ecx, %eax
; AVX-NEXT: # kill: def $al killed $al killed $eax		; AVX-NEXT: # kill: def $al killed $al killed $eax
; AVX-NEXT: retq		; AVX-NEXT: retq
;		;
; KNL-LABEL: movmsk_v16i8:		; KNL-LABEL: movmsk_v16i8:
; KNL: # %bb.0:		; KNL: # %bb.0:
; KNL-NEXT: vpcmpeqb %xmm1, %xmm0, %xmm0		; KNL-NEXT: vpcmpeqb %xmm1, %xmm0, %xmm0
; KNL-NEXT: vpmovsxbd %xmm0, %zmm0		; KNL-NEXT: vpmovmskb %xmm0, %eax
; KNL-NEXT: vptestmd %zmm0, %zmm0, %k0		; KNL-NEXT: movl %eax, %ecx
; KNL-NEXT: kshiftrw $15, %k0, %k1		; KNL-NEXT: shrl $15, %ecx
; KNL-NEXT: kmovw %k1, %ecx		; KNL-NEXT: movl %eax, %edx
; KNL-NEXT: kshiftrw $8, %k0, %k1		; KNL-NEXT: shrl $8, %edx
; KNL-NEXT: kmovw %k1, %edx		; KNL-NEXT: andl $1, %edx
; KNL-NEXT: kshiftrw $3, %k0, %k0		; KNL-NEXT: andl $8, %eax
; KNL-NEXT: kmovw %k0, %eax		; KNL-NEXT: shrl $3, %eax
; KNL-NEXT: xorb %dl, %al		; KNL-NEXT: xorl %edx, %eax
; KNL-NEXT: andb %cl, %al		; KNL-NEXT: andl %ecx, %eax
; KNL-NEXT: # kill: def $al killed $al killed $eax		; KNL-NEXT: # kill: def $al killed $al killed $eax
; KNL-NEXT: vzeroupper
; KNL-NEXT: retq		; KNL-NEXT: retq
;		;
; SKX-LABEL: movmsk_v16i8:		; SKX-LABEL: movmsk_v16i8:
; SKX: # %bb.0:		; SKX: # %bb.0:
; SKX-NEXT: vpcmpeqb %xmm1, %xmm0, %k0		; SKX-NEXT: vpcmpeqb %xmm1, %xmm0, %k0
; SKX-NEXT: kshiftrw $15, %k0, %k1		; SKX-NEXT: kshiftrw $15, %k0, %k1
; SKX-NEXT: kmovd %k1, %ecx		; SKX-NEXT: kmovd %k1, %ecx
; SKX-NEXT: kshiftrw $8, %k0, %k1		; SKX-NEXT: kshiftrw $8, %k0, %k1
▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	; SKX-NEXT: retq
%u3 = and i1 %u1, %u2		%u3 = and i1 %u1, %u2
ret i1 %u3		ret i1 %u3
}		}

define i1 @movmsk_v4i32(<4 x i32> %x, <4 x i32> %y) {		define i1 @movmsk_v4i32(<4 x i32> %x, <4 x i32> %y) {
; SSE2-LABEL: movmsk_v4i32:		; SSE2-LABEL: movmsk_v4i32:
; SSE2: # %bb.0:		; SSE2: # %bb.0:
; SSE2-NEXT: pcmpgtd %xmm0, %xmm1		; SSE2-NEXT: pcmpgtd %xmm0, %xmm1
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[2,3,0,1]		; SSE2-NEXT: movmskps %xmm1, %eax
; SSE2-NEXT: movd %xmm0, %ecx		; SSE2-NEXT: movl %eax, %ecx
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[3,1,2,3]		; SSE2-NEXT: shrl $3, %ecx
; SSE2-NEXT: movd %xmm0, %eax		; SSE2-NEXT: andl $4, %eax
		RKSimonUnsubmitted Not Done Reply Inline Actions Are these and superfluous if we just need the lsb? RKSimon: Are these and superfluous if we just need the lsb?
		; SSE2-NEXT: shrl $2, %eax
; SSE2-NEXT: xorl %ecx, %eax		; SSE2-NEXT: xorl %ecx, %eax
; SSE2-NEXT: # kill: def $al killed $al killed $eax		; SSE2-NEXT: # kill: def $al killed $al killed $eax
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; AVX-LABEL: movmsk_v4i32:		; AVX-LABEL: movmsk_v4i32:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpcmpgtd %xmm0, %xmm1, %xmm0		; AVX-NEXT: vpcmpgtd %xmm0, %xmm1, %xmm0
; AVX-NEXT: vpextrd $2, %xmm0, %ecx		; AVX-NEXT: vmovmskps %xmm0, %eax
; AVX-NEXT: vpextrd $3, %xmm0, %eax		; AVX-NEXT: movl %eax, %ecx
		; AVX-NEXT: shrl $3, %ecx
		; AVX-NEXT: andl $4, %eax
		; AVX-NEXT: shrl $2, %eax
; AVX-NEXT: xorl %ecx, %eax		; AVX-NEXT: xorl %ecx, %eax
; AVX-NEXT: # kill: def $al killed $al killed $eax		; AVX-NEXT: # kill: def $al killed $al killed $eax
; AVX-NEXT: retq		; AVX-NEXT: retq
;		;
; KNL-LABEL: movmsk_v4i32:		; KNL-LABEL: movmsk_v4i32:
; KNL: # %bb.0:		; KNL: # %bb.0:
; KNL-NEXT: # kill: def $xmm1 killed $xmm1 def $zmm1		; KNL-NEXT: vpcmpgtd %xmm0, %xmm1, %xmm0
; KNL-NEXT: # kill: def $xmm0 killed $xmm0 def $zmm0		; KNL-NEXT: vmovmskps %xmm0, %eax
; KNL-NEXT: vpcmpgtd %zmm0, %zmm1, %k0		; KNL-NEXT: movl %eax, %ecx
; KNL-NEXT: kshiftrw $3, %k0, %k1		; KNL-NEXT: shrl $3, %ecx
; KNL-NEXT: kmovw %k1, %ecx		; KNL-NEXT: andl $4, %eax
; KNL-NEXT: kshiftrw $2, %k0, %k0		; KNL-NEXT: shrl $2, %eax
; KNL-NEXT: kmovw %k0, %eax		; KNL-NEXT: xorl %ecx, %eax
; KNL-NEXT: xorb %cl, %al
; KNL-NEXT: # kill: def $al killed $al killed $eax		; KNL-NEXT: # kill: def $al killed $al killed $eax
; KNL-NEXT: vzeroupper
; KNL-NEXT: retq		; KNL-NEXT: retq
;		;
; SKX-LABEL: movmsk_v4i32:		; SKX-LABEL: movmsk_v4i32:
; SKX: # %bb.0:		; SKX: # %bb.0:
; SKX-NEXT: vpcmpgtd %xmm0, %xmm1, %k0		; SKX-NEXT: vpcmpgtd %xmm0, %xmm1, %k0
; SKX-NEXT: kshiftrb $3, %k0, %k1		; SKX-NEXT: kshiftrb $3, %k0, %k1
; SKX-NEXT: kmovd %k1, %ecx		; SKX-NEXT: kmovd %k1, %ecx
; SKX-NEXT: kshiftrb $2, %k0, %k0		; SKX-NEXT: kshiftrb $2, %k0, %k0
Show All 11 Lines
define i1 @movmsk_v2i64(<2 x i64> %x, <2 x i64> %y) {		define i1 @movmsk_v2i64(<2 x i64> %x, <2 x i64> %y) {
; SSE2-LABEL: movmsk_v2i64:		; SSE2-LABEL: movmsk_v2i64:
; SSE2: # %bb.0:		; SSE2: # %bb.0:
; SSE2-NEXT: pcmpeqd %xmm1, %xmm0		; SSE2-NEXT: pcmpeqd %xmm1, %xmm0
; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,0,3,2]		; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,0,3,2]
; SSE2-NEXT: pand %xmm0, %xmm1		; SSE2-NEXT: pand %xmm0, %xmm1
; SSE2-NEXT: pcmpeqd %xmm0, %xmm0		; SSE2-NEXT: pcmpeqd %xmm0, %xmm0
; SSE2-NEXT: pxor %xmm1, %xmm0		; SSE2-NEXT: pxor %xmm1, %xmm0
; SSE2-NEXT: movd %xmm0, %ecx		; SSE2-NEXT: movmskpd %xmm0, %ecx
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[2,3,0,1]		; SSE2-NEXT: movl %ecx, %eax
; SSE2-NEXT: movd %xmm0, %eax		; SSE2-NEXT: shrl %eax
; SSE2-NEXT: andl %ecx, %eax		; SSE2-NEXT: andl %ecx, %eax
; SSE2-NEXT: # kill: def $al killed $al killed $eax		; SSE2-NEXT: # kill: def $al killed $al killed $eax
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; AVX-LABEL: movmsk_v2i64:		; AVX-LABEL: movmsk_v2i64:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vpcmpeqq %xmm1, %xmm0, %xmm0		; AVX-NEXT: vpcmpeqq %xmm1, %xmm0, %xmm0
; AVX-NEXT: vpcmpeqd %xmm1, %xmm1, %xmm1		; AVX-NEXT: vpcmpeqd %xmm1, %xmm1, %xmm1
; AVX-NEXT: vpxor %xmm1, %xmm0, %xmm0		; AVX-NEXT: vpxor %xmm1, %xmm0, %xmm0
; AVX-NEXT: vpextrd $2, %xmm0, %ecx		; AVX-NEXT: vmovmskpd %xmm0, %ecx
; AVX-NEXT: vmovd %xmm0, %eax		; AVX-NEXT: movl %ecx, %eax
		; AVX-NEXT: shrl %eax
; AVX-NEXT: andl %ecx, %eax		; AVX-NEXT: andl %ecx, %eax
; AVX-NEXT: # kill: def $al killed $al killed $eax		; AVX-NEXT: # kill: def $al killed $al killed $eax
; AVX-NEXT: retq		; AVX-NEXT: retq
;		;
; KNL-LABEL: movmsk_v2i64:		; KNL-LABEL: movmsk_v2i64:
; KNL: # %bb.0:		; KNL: # %bb.0:
; KNL-NEXT: # kill: def $xmm1 killed $xmm1 def $zmm1		; KNL-NEXT: vpcmpeqq %xmm1, %xmm0, %xmm0
; KNL-NEXT: # kill: def $xmm0 killed $xmm0 def $zmm0		; KNL-NEXT: vpternlogq $15, %zmm0, %zmm0, %zmm0
; KNL-NEXT: vpcmpneqq %zmm1, %zmm0, %k0		; KNL-NEXT: vmovmskpd %xmm0, %ecx
; KNL-NEXT: kshiftrw $1, %k0, %k1		; KNL-NEXT: movl %ecx, %eax
; KNL-NEXT: kmovw %k1, %ecx		; KNL-NEXT: shrl %eax
; KNL-NEXT: kmovw %k0, %eax		; KNL-NEXT: andl %ecx, %eax
; KNL-NEXT: andb %cl, %al
; KNL-NEXT: # kill: def $al killed $al killed $eax		; KNL-NEXT: # kill: def $al killed $al killed $eax
; KNL-NEXT: vzeroupper		; KNL-NEXT: vzeroupper
; KNL-NEXT: retq		; KNL-NEXT: retq
;		;
; SKX-LABEL: movmsk_v2i64:		; SKX-LABEL: movmsk_v2i64:
; SKX: # %bb.0:		; SKX: # %bb.0:
; SKX-NEXT: vpcmpneqq %xmm1, %xmm0, %k0		; SKX-NEXT: vpcmpneqq %xmm1, %xmm0, %k0
; SKX-NEXT: kshiftrb $1, %k0, %k1		; SKX-NEXT: kshiftrb $1, %k0, %k1
Show All 11 Lines

define i1 @movmsk_v4f32(<4 x float> %x, <4 x float> %y) {		define i1 @movmsk_v4f32(<4 x float> %x, <4 x float> %y) {
; SSE2-LABEL: movmsk_v4f32:		; SSE2-LABEL: movmsk_v4f32:
; SSE2: # %bb.0:		; SSE2: # %bb.0:
; SSE2-NEXT: movaps %xmm0, %xmm2		; SSE2-NEXT: movaps %xmm0, %xmm2
; SSE2-NEXT: cmpeqps %xmm1, %xmm2		; SSE2-NEXT: cmpeqps %xmm1, %xmm2
; SSE2-NEXT: cmpunordps %xmm1, %xmm0		; SSE2-NEXT: cmpunordps %xmm1, %xmm0
; SSE2-NEXT: orps %xmm2, %xmm0		; SSE2-NEXT: orps %xmm2, %xmm0
; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]		; SSE2-NEXT: movmskps %xmm0, %eax
; SSE2-NEXT: movd %xmm1, %ecx		; SSE2-NEXT: testb $14, %al
; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]		; SSE2-NEXT: setne %al
; SSE2-NEXT: movd %xmm1, %edx
; SSE2-NEXT: pextrw $6, %xmm0, %eax
; SSE2-NEXT: orl %edx, %eax
; SSE2-NEXT: orl %ecx, %eax
; SSE2-NEXT: # kill: def $al killed $al killed $eax
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; AVX-LABEL: movmsk_v4f32:		; AVX-LABEL: movmsk_v4f32:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vcmpeq_uqps %xmm1, %xmm0, %xmm0		; AVX-NEXT: vcmpeq_uqps %xmm1, %xmm0, %xmm0
; AVX-NEXT: vextractps $1, %xmm0, %ecx		; AVX-NEXT: vmovmskps %xmm0, %eax
; AVX-NEXT: vextractps $2, %xmm0, %edx		; AVX-NEXT: testb $14, %al
; AVX-NEXT: vpextrb $12, %xmm0, %eax		; AVX-NEXT: setne %al
; AVX-NEXT: orl %edx, %eax
; AVX-NEXT: orl %ecx, %eax
; AVX-NEXT: # kill: def $al killed $al killed $eax
; AVX-NEXT: retq		; AVX-NEXT: retq
;		;
; KNL-LABEL: movmsk_v4f32:		; KNL-LABEL: movmsk_v4f32:
; KNL: # %bb.0:		; KNL: # %bb.0:
; KNL-NEXT: # kill: def $xmm1 killed $xmm1 def $zmm1		; KNL-NEXT: vcmpeq_uqps %xmm1, %xmm0, %xmm0
; KNL-NEXT: # kill: def $xmm0 killed $xmm0 def $zmm0		; KNL-NEXT: vmovmskps %xmm0, %eax
; KNL-NEXT: vcmpeq_uqps %zmm1, %zmm0, %k0		; KNL-NEXT: testb $14, %al
; KNL-NEXT: kshiftrw $3, %k0, %k1		; KNL-NEXT: setne %al
; KNL-NEXT: kmovw %k1, %ecx
; KNL-NEXT: kshiftrw $2, %k0, %k1
; KNL-NEXT: kmovw %k1, %eax
; KNL-NEXT: kshiftrw $1, %k0, %k0
; KNL-NEXT: kmovw %k0, %edx
; KNL-NEXT: orb %cl, %al
; KNL-NEXT: orb %dl, %al
; KNL-NEXT: # kill: def $al killed $al killed $eax
; KNL-NEXT: vzeroupper
; KNL-NEXT: retq		; KNL-NEXT: retq
;		;
; SKX-LABEL: movmsk_v4f32:		; SKX-LABEL: movmsk_v4f32:
; SKX: # %bb.0:		; SKX: # %bb.0:
; SKX-NEXT: vcmpeq_uqps %xmm1, %xmm0, %k0		; SKX-NEXT: vcmpeq_uqps %xmm1, %xmm0, %k0
; SKX-NEXT: kshiftrb $3, %k0, %k1		; SKX-NEXT: kshiftrb $3, %k0, %k1
; SKX-NEXT: kmovd %k1, %ecx		; SKX-NEXT: kmovd %k1, %ecx
; SKX-NEXT: kshiftrb $2, %k0, %k1		; SKX-NEXT: kshiftrb $2, %k0, %k1
Show All 12 Lines	; SKX-NEXT: retq
%u2 = or i1 %u1, %e3		%u2 = or i1 %u1, %e3
ret i1 %u2		ret i1 %u2
}		}

define i1 @movmsk_v2f64(<2 x double> %x, <2 x double> %y) {		define i1 @movmsk_v2f64(<2 x double> %x, <2 x double> %y) {
; SSE2-LABEL: movmsk_v2f64:		; SSE2-LABEL: movmsk_v2f64:
; SSE2: # %bb.0:		; SSE2: # %bb.0:
; SSE2-NEXT: cmplepd %xmm0, %xmm1		; SSE2-NEXT: cmplepd %xmm0, %xmm1
; SSE2-NEXT: movd %xmm1, %ecx		; SSE2-NEXT: movmskpd %xmm1, %ecx
; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm1[2,3,0,1]		; SSE2-NEXT: movl %ecx, %eax
; SSE2-NEXT: movd %xmm0, %eax		; SSE2-NEXT: shrl %eax
; SSE2-NEXT: andl %ecx, %eax		; SSE2-NEXT: andl %ecx, %eax
; SSE2-NEXT: # kill: def $al killed $al killed $eax		; SSE2-NEXT: # kill: def $al killed $al killed $eax
; SSE2-NEXT: retq		; SSE2-NEXT: retq
;		;
; AVX-LABEL: movmsk_v2f64:		; AVX-LABEL: movmsk_v2f64:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vcmplepd %xmm0, %xmm1, %xmm0		; AVX-NEXT: vcmplepd %xmm0, %xmm1, %xmm0
; AVX-NEXT: vextractps $2, %xmm0, %ecx		; AVX-NEXT: vmovmskpd %xmm0, %ecx
; AVX-NEXT: vmovd %xmm0, %eax		; AVX-NEXT: movl %ecx, %eax
		; AVX-NEXT: shrl %eax
; AVX-NEXT: andl %ecx, %eax		; AVX-NEXT: andl %ecx, %eax
; AVX-NEXT: # kill: def $al killed $al killed $eax		; AVX-NEXT: # kill: def $al killed $al killed $eax
; AVX-NEXT: retq		; AVX-NEXT: retq
;		;
; KNL-LABEL: movmsk_v2f64:		; KNL-LABEL: movmsk_v2f64:
; KNL: # %bb.0:		; KNL: # %bb.0:
; KNL-NEXT: # kill: def $xmm1 killed $xmm1 def $zmm1		; KNL-NEXT: vcmplepd %xmm0, %xmm1, %xmm0
; KNL-NEXT: # kill: def $xmm0 killed $xmm0 def $zmm0		; KNL-NEXT: vmovmskpd %xmm0, %ecx
; KNL-NEXT: vcmplepd %zmm0, %zmm1, %k0		; KNL-NEXT: movl %ecx, %eax
; KNL-NEXT: kshiftrw $1, %k0, %k1		; KNL-NEXT: shrl %eax
; KNL-NEXT: kmovw %k1, %ecx		; KNL-NEXT: andl %ecx, %eax
; KNL-NEXT: kmovw %k0, %eax
; KNL-NEXT: andb %cl, %al
; KNL-NEXT: # kill: def $al killed $al killed $eax		; KNL-NEXT: # kill: def $al killed $al killed $eax
; KNL-NEXT: vzeroupper
; KNL-NEXT: retq		; KNL-NEXT: retq
;		;
; SKX-LABEL: movmsk_v2f64:		; SKX-LABEL: movmsk_v2f64:
; SKX: # %bb.0:		; SKX: # %bb.0:
; SKX-NEXT: vcmplepd %xmm0, %xmm1, %k0		; SKX-NEXT: vcmplepd %xmm0, %xmm1, %k0
; SKX-NEXT: kshiftrb $1, %k0, %k1		; SKX-NEXT: kshiftrb $1, %k0, %k1
; SKX-NEXT: kmovd %k1, %ecx		; SKX-NEXT: kmovd %k1, %ecx
; SKX-NEXT: kmovd %k0, %eax		; SKX-NEXT: kmovd %k0, %eax
; SKX-NEXT: andb %cl, %al		; SKX-NEXT: andb %cl, %al
; SKX-NEXT: # kill: def $al killed $al killed $eax		; SKX-NEXT: # kill: def $al killed $al killed $eax
; SKX-NEXT: retq		; SKX-NEXT: retq
%cmp = fcmp oge <2 x double> %x, %y		%cmp = fcmp oge <2 x double> %x, %y
%e1 = extractelement <2 x i1> %cmp, i32 0		%e1 = extractelement <2 x i1> %cmp, i32 0
%e2 = extractelement <2 x i1> %cmp, i32 1		%e2 = extractelement <2 x i1> %cmp, i32 1
%u1 = and i1 %e1, %e2		%u1 = and i1 %e1, %e2
ret i1 %u1		ret i1 %u1
}		}

define i32 @PR39665_c_ray(<2 x double> %x, <2 x double> %y) {		define i32 @PR39665_c_ray(<2 x double> %x, <2 x double> %y) {
; SSE2-LABEL: PR39665_c_ray:		; SSE2-LABEL: PR39665_c_ray:
; SSE2: # %bb.0:		; SSE2: # %bb.0:
; SSE2-NEXT: cmpltpd %xmm0, %xmm1		; SSE2-NEXT: cmpltpd %xmm0, %xmm1
; SSE2-NEXT: movapd %xmm1, -{{[0-9]+}}(%rsp)		; SSE2-NEXT: movmskpd %xmm1, %ecx
; SSE2-NEXT: testb $1, -{{[0-9]+}}(%rsp)		; SSE2-NEXT: testb $2, %cl
; SSE2-NEXT: movl $42, %eax		; SSE2-NEXT: movl $42, %eax
; SSE2-NEXT: movl $99, %ecx		; SSE2-NEXT: movl $99, %edx
; SSE2-NEXT: cmovel %ecx, %eax		; SSE2-NEXT: cmovel %edx, %eax
; SSE2-NEXT: testb $1, -{{[0-9]+}}(%rsp)		; SSE2-NEXT: testb $1, %cl
; SSE2-NEXT: cmovel %ecx, %eax		; SSE2-NEXT: cmovel %edx, %eax
; SSE2-NEXT: retq		; SSE2-NEXT: retq
		RKSimonUnsubmitted Not Done Reply Inline Actions For the anyof/allof cases it'd be a lot better if we could merge the tests into a single compare - in c-ray that would allow us to merge the multiple cmp+jmp - the 2 separate jmps packed so close together is a known perf issues. RKSimon: For the anyof/allof cases it'd be a lot better if we could merge the tests into a single…
		spatelAuthorUnsubmitted Done Reply Inline Actions Yes, I think this is an independent problem. Let me see if I can solve at least part of that, so it's not distracting here. spatel: Yes, I think this is an independent problem. Let me see if I can solve at least part of that…
		spatelAuthorUnsubmitted Done Reply Inline Actions Oh no...we need the select-to-logic transform that we recently discussed as illegal due to poison: t21: i32 = select t28, Constant:i32<42>, Constant:i32<99> t22: i32 = select t30, t21, Constant:i32<99> spatel: Oh no...we need the select-to-logic transform that we recently discussed as illegal due to…
;		;
; AVX-LABEL: PR39665_c_ray:		; AVX-LABEL: PR39665_c_ray:
; AVX: # %bb.0:		; AVX: # %bb.0:
; AVX-NEXT: vcmpltpd %xmm0, %xmm1, %xmm0		; AVX-NEXT: vcmpltpd %xmm0, %xmm1, %xmm0
; AVX-NEXT: vpextrb $0, %xmm0, %ecx		; AVX-NEXT: vmovmskpd %xmm0, %ecx
; AVX-NEXT: vpextrb $8, %xmm0, %eax		; AVX-NEXT: testb $2, %cl
; AVX-NEXT: testb $1, %al
; AVX-NEXT: movl $42, %eax		; AVX-NEXT: movl $42, %eax
; AVX-NEXT: movl $99, %edx		; AVX-NEXT: movl $99, %edx
; AVX-NEXT: cmovel %edx, %eax		; AVX-NEXT: cmovel %edx, %eax
; AVX-NEXT: testb $1, %cl		; AVX-NEXT: testb $1, %cl
; AVX-NEXT: cmovel %edx, %eax		; AVX-NEXT: cmovel %edx, %eax
; AVX-NEXT: retq		; AVX-NEXT: retq
;		;
; KNL-LABEL: PR39665_c_ray:		; KNL-LABEL: PR39665_c_ray:
; KNL: # %bb.0:		; KNL: # %bb.0:
; KNL-NEXT: # kill: def $xmm1 killed $xmm1 def $zmm1		; KNL-NEXT: vcmpltpd %xmm0, %xmm1, %xmm0
; KNL-NEXT: # kill: def $xmm0 killed $xmm0 def $zmm0		; KNL-NEXT: vmovmskpd %xmm0, %ecx
; KNL-NEXT: vcmpltpd %zmm0, %zmm1, %k0		; KNL-NEXT: testb $2, %cl
; KNL-NEXT: kshiftrw $1, %k0, %k1
; KNL-NEXT: kmovw %k1, %eax
; KNL-NEXT: kmovw %k0, %ecx
; KNL-NEXT: testb $1, %al
; KNL-NEXT: movl $42, %eax		; KNL-NEXT: movl $42, %eax
; KNL-NEXT: movl $99, %edx		; KNL-NEXT: movl $99, %edx
; KNL-NEXT: cmovel %edx, %eax		; KNL-NEXT: cmovel %edx, %eax
; KNL-NEXT: testb $1, %cl		; KNL-NEXT: testb $1, %cl
; KNL-NEXT: cmovel %edx, %eax		; KNL-NEXT: cmovel %edx, %eax
; KNL-NEXT: vzeroupper
; KNL-NEXT: retq		; KNL-NEXT: retq
;		;
; SKX-LABEL: PR39665_c_ray:		; SKX-LABEL: PR39665_c_ray:
; SKX: # %bb.0:		; SKX: # %bb.0:
; SKX-NEXT: vcmpltpd %xmm0, %xmm1, %k0		; SKX-NEXT: vcmpltpd %xmm0, %xmm1, %k0
; SKX-NEXT: kshiftrb $1, %k0, %k1		; SKX-NEXT: kshiftrb $1, %k0, %k1
; SKX-NEXT: kmovd %k1, %eax		; SKX-NEXT: kmovd %k1, %eax
; SKX-NEXT: kmovd %k0, %ecx		; SKX-NEXT: kmovd %k0, %ecx
Show All 14 Lines