This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
-
DAGCombiner.cpp
-
test/CodeGen/
-
CodeGen/
-
ARM/
-
vuzp.ll
-
vzip.ll
-
X86/
-
mulvi32.ll
-
oddshuffles.ll

Differential D60545

[DAGCombiner] narrow shuffle of concatenated vectors
ClosedPublic

Authored by spatel on Apr 10 2019, 3:35 PM.

Download Raw Diff

Details

Reviewers

dmgreen
sdesmalen
efriedma
RKSimon
craig.topper

Commits

rG5e4ad39af7c2: [DAGCombiner] narrow shuffle of concatenated vectors
rL358291: [DAGCombiner] narrow shuffle of concatenated vectors

Summary

// shuffle (concat X, undef), (concat Y, undef), Mask -->
// concat (shuffle X, Y, Mask0), (shuffle X, Y, Mask1)

Someone with more ARM NEON experience can confirm, but I think the changes with 'vtrn' are improvements.

The x86 changes look neutral or better. There's one test with an extra instruction, but that could be reversed for a subtarget with the right attributes.

But by default, I think we want to avoid the 256-bit op when possible (in my motivating benchmark, a handful of ymm ops sprinkled into a sequence of xmm ops are triggering frequency throttling on Haswell resulting in signficantly worse perf).

Diff Detail

Repository: rL LLVM

Event Timeline

spatel created this revision.Apr 10 2019, 3:35 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 10 2019, 3:35 PM

Herald added subscribers: hiraditya, kristof.beyls, javed.absar, mcrosier. · View Herald Transcript

From looking at the A57 scheduler model in LLVM this seems like an improvement, with for example VUZPq taking twice as long as VUZPd. But in general I could imagine shuffles on shorter vectors to be cheaper.
It will probably also be easier to spot that one of concatenated subvectors becomes fully undef after doing the shuffling on the subvector first.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17640 ↗	(On Diff #194598)	Would it be possible to move/copy this comment to the call of foldShuffleOfConcatUndefs ?

spatel marked an inline comment as done.Apr 11 2019, 10:46 AM

spatel added inline comments.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17640 ↗	(On Diff #194598)	I can do that if you think that makes the code clearer, but in general I try to make the function's doxygen comment a bit more vague in case the implementation changes/grows, but not repeat that function-level comment at the callers. For example, I think this function could be extended to handle a pattern where the 2nd operand is not an undef value as long as the shuffle mask is not choosing elements from that 2nd operand. Let me know if I am misunderstanding the suggestion.

LGTM. The compiler will have more opportunity to use cheaper shuffles if part of the (sub)vector is undef, so this looks like a general improvement.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
17640 ↗	(On Diff #194598)	When reading through DAGCombiner code I usually appreciate the "X -> Y" comments that provide an overview of transforms so you can easily see what case is optimised without having to look at further function definitions/doxygen. But that only makes sense if the function handles a single transform. So I agree that if the scope of this function widens, it makes little sense to copy this comment to the call-site.

This revision is now accepted and ready to land.Apr 12 2019, 6:27 AM

In D60545#1464207, @sdesmalen wrote:

LGTM. The compiler will have more opportunity to use cheaper shuffles if part of the (sub)vector is undef, so this looks like a general improvement.

Thanks!

Re: code comment organization - I see your point, and I'll take that as something to be more aware of in general...I've been hacking away in here long enough that it's hard to remember the shock of seeing this mess for the 1st time. :)

Closed by commit rL358291: [DAGCombiner] narrow shuffle of concatenated vectors (authored by spatel). · Explain WhyApr 12 2019, 9:30 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

50 lines

test/

CodeGen/

ARM/

vuzp.ll

28 lines

vzip.ll

4 lines

X86/

mulvi32.ll

5 lines

oddshuffles.ll

47 lines

Diff 194905

llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 17,592 Lines • ▼ Show 20 Lines	if (SDValue NarrowBOp = narrowExtractedVectorBinOp(N, DAG))
return NarrowBOp;		return NarrowBOp;

if (SimplifyDemandedVectorElts(SDValue(N, 0)))		if (SimplifyDemandedVectorElts(SDValue(N, 0)))
return SDValue(N, 0);		return SDValue(N, 0);

return SDValue();		return SDValue();
}		}

		/// Try to convert a wide shuffle of concatenated vectors into 2 narrow shuffles
		/// followed by concatenation. Narrow vector ops may have better performance
		/// than wide ops, and this can unlock further narrowing of other vector ops.
		/// Targets can invert this transform later if it is not profitable.
		static SDValue foldShuffleOfConcatUndefs(ShuffleVectorSDNode *Shuf,
		SelectionDAG &DAG) {
		SDValue N0 = Shuf->getOperand(0), N1 = Shuf->getOperand(1);
		if (N0.getOpcode() != ISD::CONCAT_VECTORS \|\| N0.getNumOperands() != 2 \|\|
		N1.getOpcode() != ISD::CONCAT_VECTORS \|\| N1.getNumOperands() != 2 \|\|
		!N0.getOperand(1).isUndef() \|\| !N1.getOperand(1).isUndef())
		return SDValue();

		// Split the wide shuffle mask into halves. Any mask element that is accessing
		// operand 1 is offset down to account for narrowing of the vectors.
		ArrayRef<int> Mask = Shuf->getMask();
		EVT VT = Shuf->getValueType(0);
		unsigned NumElts = VT.getVectorNumElements();
		unsigned HalfNumElts = NumElts / 2;
		SmallVector<int, 16> Mask0(HalfNumElts, -1);
		SmallVector<int, 16> Mask1(HalfNumElts, -1);
		for (unsigned i = 0; i != NumElts; ++i) {
		if (Mask[i] == -1)
		continue;
		int M = Mask[i] < (int)NumElts ? Mask[i] : Mask[i] - (int)HalfNumElts;
		if (i < HalfNumElts)
		Mask0[i] = M;
		else
		Mask1[i - HalfNumElts] = M;
		}

		// Ask the target if this is a valid transform.
		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
		EVT HalfVT = EVT::getVectorVT(*DAG.getContext(), VT.getScalarType(),
		HalfNumElts);
		if (!TLI.isShuffleMaskLegal(Mask0, HalfVT) \|\|
		!TLI.isShuffleMaskLegal(Mask1, HalfVT))
		return SDValue();

		// shuffle (concat X, undef), (concat Y, undef), Mask -->
		// concat (shuffle X, Y, Mask0), (shuffle X, Y, Mask1)
		SDValue X = N0.getOperand(0), Y = N1.getOperand(0);
		SDLoc DL(Shuf);
		SDValue Shuf0 = DAG.getVectorShuffle(HalfVT, DL, X, Y, Mask0);
		SDValue Shuf1 = DAG.getVectorShuffle(HalfVT, DL, X, Y, Mask1);
		return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, Shuf0, Shuf1);
		}

// Tries to turn a shuffle of two CONCAT_VECTORS into a single concat,		// Tries to turn a shuffle of two CONCAT_VECTORS into a single concat,
// or turn a shuffle of a single concat into simpler shuffle then concat.		// or turn a shuffle of a single concat into simpler shuffle then concat.
static SDValue partitionShuffleOfConcats(SDNode *N, SelectionDAG &DAG) {		static SDValue partitionShuffleOfConcats(SDNode *N, SelectionDAG &DAG) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();

SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
▲ Show 20 Lines • Show All 765 Lines • ▼ Show 20 Lines	if (N0.getOpcode() == ISD::VECTOR_SHUFFLE && N->isOnlyUserOf(N0.getNode()) &&
}		}

// shuffle(shuffle(A, B, M0), C, M1) -> shuffle(A, B, M2)		// shuffle(shuffle(A, B, M0), C, M1) -> shuffle(A, B, M2)
// shuffle(shuffle(A, B, M0), C, M1) -> shuffle(A, C, M2)		// shuffle(shuffle(A, B, M0), C, M1) -> shuffle(A, C, M2)
// shuffle(shuffle(A, B, M0), C, M1) -> shuffle(B, C, M2)		// shuffle(shuffle(A, B, M0), C, M1) -> shuffle(B, C, M2)
return DAG.getVectorShuffle(VT, SDLoc(N), SV0, SV1, Mask);		return DAG.getVectorShuffle(VT, SDLoc(N), SV0, SV1, Mask);
}		}

		if (SDValue V = foldShuffleOfConcatUndefs(SVN, DAG))
		return V;

return SDValue();		return SDValue();
}		}

SDValue DAGCombiner::visitSCALAR_TO_VECTOR(SDNode *N) {		SDValue DAGCombiner::visitSCALAR_TO_VECTOR(SDNode *N) {
SDValue InVal = N->getOperand(0);		SDValue InVal = N->getOperand(0);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

// Replace a SCALAR_TO_VECTOR(EXTRACT_VECTOR_ELT(V,C0)) pattern		// Replace a SCALAR_TO_VECTOR(EXTRACT_VECTOR_ELT(V,C0)) pattern
▲ Show 20 Lines • Show All 1,607 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/ARM/vuzp.ll

Show First 20 Lines • Show All 264 Lines • ▼ Show 20 Lines	; CHECK-NEXT: mov pc, lr
%tmp3 = shufflevector <8 x i16> %tmp1, <8 x i16> %tmp2, <16 x i32> <i32 0, i32 undef, i32 4, i32 undef, i32 8, i32 10, i32 12, i32 14, i32 1, i32 3, i32 5, i32 undef, i32 undef, i32 11, i32 13, i32 15>		%tmp3 = shufflevector <8 x i16> %tmp1, <8 x i16> %tmp2, <16 x i32> <i32 0, i32 undef, i32 4, i32 undef, i32 8, i32 10, i32 12, i32 14, i32 1, i32 3, i32 5, i32 undef, i32 undef, i32 11, i32 13, i32 15>
ret <16 x i16> %tmp3		ret <16 x i16> %tmp3
}		}

define <8 x i16> @vuzp_lower_shufflemask_undef(<4 x i16>* %A, <4 x i16>* %B) {		define <8 x i16> @vuzp_lower_shufflemask_undef(<4 x i16>* %A, <4 x i16>* %B) {
; CHECK-LABEL: vuzp_lower_shufflemask_undef:		; CHECK-LABEL: vuzp_lower_shufflemask_undef:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: vldr d17, [r1]		; CHECK-NEXT: vldr d17, [r1]
; CHECK-NEXT: vldr d16, [r0]		; CHECK-NEXT: vldr d18, [r0]
; CHECK-NEXT: vorr q9, q8, q8		; CHECK-NEXT: vuzp.16 d18, d17
; CHECK-NEXT: vuzp.16 q8, q9		; CHECK-NEXT: vmov r0, r1, d16
; CHECK-NEXT: vmov r0, r1, d18		; CHECK-NEXT: vmov r2, r3, d17
; CHECK-NEXT: vmov r2, r3, d19
; CHECK-NEXT: mov pc, lr		; CHECK-NEXT: mov pc, lr
entry:		entry:
%tmp1 = load <4 x i16>, <4 x i16>* %A		%tmp1 = load <4 x i16>, <4 x i16>* %A
%tmp2 = load <4 x i16>, <4 x i16>* %B		%tmp2 = load <4 x i16>, <4 x i16>* %B
%0 = shufflevector <4 x i16> %tmp1, <4 x i16> %tmp2, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 1, i32 3, i32 5, i32 7>		%0 = shufflevector <4 x i16> %tmp1, <4 x i16> %tmp2, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 1, i32 3, i32 5, i32 7>
ret <8 x i16> %0		ret <8 x i16> %0
}		}

define <4 x i32> @vuzp_lower_shufflemask_zeroed(<2 x i32>* %A, <2 x i32>* %B) {		define <4 x i32> @vuzp_lower_shufflemask_zeroed(<2 x i32>* %A, <2 x i32>* %B) {
; CHECK-LABEL: vuzp_lower_shufflemask_zeroed:		; CHECK-LABEL: vuzp_lower_shufflemask_zeroed:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
		; CHECK-NEXT: vldr d18, [r0]
		; CHECK-NEXT: vorr d19, d18, d18
; CHECK-NEXT: vldr d17, [r1]		; CHECK-NEXT: vldr d17, [r1]
; CHECK-NEXT: vldr d16, [r0]		; CHECK-NEXT: vtrn.32 d19, d17
; CHECK-NEXT: vdup.32 q9, d16[0]		; CHECK-NEXT: vdup.32 d16, d18[0]
; CHECK-NEXT: vuzp.32 q8, q9
; CHECK-NEXT: vext.32 q8, q9, q9, #2
; CHECK-NEXT: vmov r0, r1, d16
; CHECK-NEXT: vmov r2, r3, d17		; CHECK-NEXT: vmov r2, r3, d17
		; CHECK-NEXT: vmov r0, r1, d16
; CHECK-NEXT: mov pc, lr		; CHECK-NEXT: mov pc, lr
entry:		entry:
%tmp1 = load <2 x i32>, <2 x i32>* %A		%tmp1 = load <2 x i32>, <2 x i32>* %A
%tmp2 = load <2 x i32>, <2 x i32>* %B		%tmp2 = load <2 x i32>, <2 x i32>* %B
%0 = shufflevector <2 x i32> %tmp1, <2 x i32> %tmp2, <4 x i32> <i32 0, i32 0, i32 1, i32 3>		%0 = shufflevector <2 x i32> %tmp1, <2 x i32> %tmp2, <4 x i32> <i32 0, i32 0, i32 1, i32 3>
ret <4 x i32> %0		ret <4 x i32> %0
}		}

define void @vuzp_rev_shufflemask_vtrn(<2 x i32>* %A, <2 x i32>* %B, <4 x i32>* %C) {		define void @vuzp_rev_shufflemask_vtrn(<2 x i32>* %A, <2 x i32>* %B, <4 x i32>* %C) {
; CHECK-LABEL: vuzp_rev_shufflemask_vtrn:		; CHECK-LABEL: vuzp_rev_shufflemask_vtrn:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: vldr d17, [r1]		; CHECK-NEXT: vldr d16, [r1]
; CHECK-NEXT: vldr d16, [r0]		; CHECK-NEXT: vldr d17, [r0]
; CHECK-NEXT: vrev64.32 q9, q8		; CHECK-NEXT: vtrn.32 d17, d16
; CHECK-NEXT: vuzp.32 q8, q9		; CHECK-NEXT: vst1.64 {d16, d17}, [r2]
; CHECK-NEXT: vst1.64 {d18, d19}, [r2]
; CHECK-NEXT: mov pc, lr		; CHECK-NEXT: mov pc, lr
entry:		entry:
%tmp1 = load <2 x i32>, <2 x i32>* %A		%tmp1 = load <2 x i32>, <2 x i32>* %A
%tmp2 = load <2 x i32>, <2 x i32>* %B		%tmp2 = load <2 x i32>, <2 x i32>* %B
%0 = shufflevector <2 x i32> %tmp1, <2 x i32> %tmp2, <4 x i32> <i32 1, i32 3, i32 0, i32 2>		%0 = shufflevector <2 x i32> %tmp1, <2 x i32> %tmp2, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
store <4 x i32> %0, <4 x i32>* %C		store <4 x i32> %0, <4 x i32>* %C
ret void		ret void
}		}
▲ Show 20 Lines • Show All 229 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/ARM/vzip.ll

Show First 20 Lines • Show All 264 Lines • ▼ Show 20 Lines	; CHECK-NEXT: mov pc, lr
%tmp3 = shufflevector <16 x i8> %tmp1, <16 x i8> %tmp2, <32 x i32> <i32 0, i32 16, i32 1, i32 undef, i32 undef, i32 undef, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23, i32 8, i32 24, i32 9, i32 undef, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 undef, i32 14, i32 30, i32 undef, i32 31>		%tmp3 = shufflevector <16 x i8> %tmp1, <16 x i8> %tmp2, <32 x i32> <i32 0, i32 16, i32 1, i32 undef, i32 undef, i32 undef, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23, i32 8, i32 24, i32 9, i32 undef, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 undef, i32 14, i32 30, i32 undef, i32 31>
ret <32 x i8> %tmp3		ret <32 x i8> %tmp3
}		}

define <8 x i16> @vzip_lower_shufflemask_undef(<4 x i16>* %A, <4 x i16>* %B) {		define <8 x i16> @vzip_lower_shufflemask_undef(<4 x i16>* %A, <4 x i16>* %B) {
; CHECK-LABEL: vzip_lower_shufflemask_undef:		; CHECK-LABEL: vzip_lower_shufflemask_undef:
; CHECK: @ %bb.0: @ %entry		; CHECK: @ %bb.0: @ %entry
; CHECK-NEXT: vldr d17, [r1]		; CHECK-NEXT: vldr d17, [r1]
; CHECK-NEXT: vldr d16, [r0]		; CHECK-NEXT: vldr d18, [r0]
; CHECK-NEXT: vzip.16 d16, d17		; CHECK-NEXT: vzip.16 d18, d17
; CHECK-NEXT: vmov r0, r1, d16		; CHECK-NEXT: vmov r0, r1, d16
; CHECK-NEXT: vmov r2, r3, d17		; CHECK-NEXT: vmov r2, r3, d17
; CHECK-NEXT: mov pc, lr		; CHECK-NEXT: mov pc, lr
entry:		entry:
%tmp1 = load <4 x i16>, <4 x i16>* %A		%tmp1 = load <4 x i16>, <4 x i16>* %A
%tmp2 = load <4 x i16>, <4 x i16>* %B		%tmp2 = load <4 x i16>, <4 x i16>* %B
%0 = shufflevector <4 x i16> %tmp1, <4 x i16> %tmp2, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 2, i32 6, i32 3, i32 7>		%0 = shufflevector <4 x i16> %tmp1, <4 x i16> %tmp2, <8 x i32> <i32 undef, i32 undef, i32 undef, i32 undef, i32 2, i32 6, i32 3, i32 7>
ret <8 x i16> %0		ret <8 x i16> %0
▲ Show 20 Lines • Show All 101 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/mulvi32.ll

	Show First 20 Lines • Show All 223 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: _mul4xi32toi64b:			; AVX2-LABEL: _mul4xi32toi64b:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vpmuludq %xmm1, %xmm0, %xmm2			; AVX2-NEXT: vpmuludq %xmm1, %xmm0, %xmm2
	; AVX2-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]			; AVX2-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[1,1,3,3]
	; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]			; AVX2-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[1,1,3,3]
	; AVX2-NEXT: vpmuludq %xmm1, %xmm0, %xmm0			; AVX2-NEXT: vpmuludq %xmm1, %xmm0, %xmm0
	; AVX2-NEXT: vinserti128 $1, %xmm0, %ymm2, %ymm0			; AVX2-NEXT: vpunpckhqdq {{.*#+}} xmm1 = xmm2[1],xmm0[1]
	; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]			; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm2[0],xmm0[0]
				; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%even0 = shufflevector <4 x i32> %0, <4 x i32> undef, <4 x i32> <i32 0, i32 undef, i32 2, i32 undef>			%even0 = shufflevector <4 x i32> %0, <4 x i32> undef, <4 x i32> <i32 0, i32 undef, i32 2, i32 undef>
	%even1 = shufflevector <4 x i32> %1, <4 x i32> undef, <4 x i32> <i32 0, i32 undef, i32 2, i32 undef>			%even1 = shufflevector <4 x i32> %1, <4 x i32> undef, <4 x i32> <i32 0, i32 undef, i32 2, i32 undef>
	%evenMul = call <2 x i64> @llvm.x86.sse2.pmulu.dq(<4 x i32> %even0, <4 x i32> %even1) readnone			%evenMul = call <2 x i64> @llvm.x86.sse2.pmulu.dq(<4 x i32> %even0, <4 x i32> %even1) readnone
	%odd0 = shufflevector <4 x i32> %0, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 3, i32 undef>			%odd0 = shufflevector <4 x i32> %0, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 3, i32 undef>
	%odd1 = shufflevector <4 x i32> %1, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 3, i32 undef>			%odd1 = shufflevector <4 x i32> %1, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 3, i32 undef>
	%oddMul = call <2 x i64> @llvm.x86.sse2.pmulu.dq(<4 x i32> %odd0, <4 x i32> %odd1) readnone			%oddMul = call <2 x i64> @llvm.x86.sse2.pmulu.dq(<4 x i32> %odd0, <4 x i32> %odd1) readnone
	%r = shufflevector <2 x i64> %evenMul, <2 x i64> %oddMul, <4 x i32> <i32 0, i32 2, i32 1, i32 3>			%r = shufflevector <2 x i64> %evenMul, <2 x i64> %oddMul, <4 x i32> <i32 0, i32 2, i32 1, i32 3>
	▲ Show 20 Lines • Show All 119 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/oddshuffles.ll

	Show First 20 Lines • Show All 375 Lines • ▼ Show 20 Lines
	; SSE42-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,2,3,2]			; SSE42-NEXT: pshufd {{.*#+}} xmm2 = xmm2[0,2,3,2]
	; SSE42-NEXT: pblendw {{.*#+}} xmm0 = xmm0[0,1,2,3],xmm1[4,5,6,7]			; SSE42-NEXT: pblendw {{.*#+}} xmm0 = xmm0[0,1,2,3],xmm1[4,5,6,7]
	; SSE42-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]			; SSE42-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,3,2,3]
	; SSE42-NEXT: movd %xmm1, 24(%rdi)			; SSE42-NEXT: movd %xmm1, 24(%rdi)
	; SSE42-NEXT: movq %xmm0, 16(%rdi)			; SSE42-NEXT: movq %xmm0, 16(%rdi)
	; SSE42-NEXT: movdqa %xmm2, (%rdi)			; SSE42-NEXT: movdqa %xmm2, (%rdi)
	; SSE42-NEXT: retq			; SSE42-NEXT: retq
	;			;
	; AVX1-LABEL: v7i32:			; AVX-LABEL: v7i32:
	; AVX1: # %bb.0:			; AVX: # %bb.0:
	; AVX1-NEXT: vblendps {{.*#+}} xmm2 = xmm0[0,1],xmm1[2],xmm0[3]			; AVX-NEXT: vblendps {{.*#+}} xmm2 = xmm0[0,1],xmm1[2],xmm0[3]
	; AVX1-NEXT: vpermilps {{.*#+}} xmm2 = xmm2[0,2,3,2]			; AVX-NEXT: vpermilps {{.*#+}} xmm2 = xmm2[0,2,3,2]
	; AVX1-NEXT: vblendps {{.*#+}} xmm0 = xmm1[0],xmm0[1],xmm1[2,3]			; AVX-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,3]
	; AVX1-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[1,3,0,3]			; AVX-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[1,3,2,3]
	; AVX1-NEXT: vmovss %xmm1, 24(%rdi)			; AVX-NEXT: vmovss %xmm1, 24(%rdi)
	; AVX1-NEXT: vmovlps %xmm0, 16(%rdi)			; AVX-NEXT: vmovlps %xmm0, 16(%rdi)
	; AVX1-NEXT: vmovaps %xmm2, (%rdi)			; AVX-NEXT: vmovaps %xmm2, (%rdi)
	; AVX1-NEXT: retq			; AVX-NEXT: retq
	;
	; AVX2-LABEL: v7i32:
	; AVX2: # %bb.0:
	; AVX2-NEXT: # kill: def $xmm0 killed $xmm0 def $ymm0
	; AVX2-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
	; AVX2-NEXT: vmovaps {{.*#+}} ymm2 = <0,6,3,6,1,7,4,u>
	; AVX2-NEXT: vpermps %ymm0, %ymm2, %ymm0
	; AVX2-NEXT: vmovss %xmm1, 24(%rdi)
	; AVX2-NEXT: vextractf128 $1, %ymm0, %xmm1
	; AVX2-NEXT: vmovlps %xmm1, 16(%rdi)
	; AVX2-NEXT: vmovaps %xmm0, (%rdi)
	; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq
	;			;
	; XOP-LABEL: v7i32:			; XOP-LABEL: v7i32:
	; XOP: # %bb.0:			; XOP: # %bb.0:
	; XOP-NEXT: vblendps {{.*#+}} xmm2 = xmm0[0,1],xmm1[2],xmm0[3]			; XOP-NEXT: vblendps {{.*#+}} xmm2 = xmm0[0,1],xmm1[2],xmm0[3]
	; XOP-NEXT: vpermilps {{.*#+}} xmm2 = xmm2[0,2,3,2]			; XOP-NEXT: vpermilps {{.*#+}} xmm2 = xmm2[0,2,3,2]
	; XOP-NEXT: vblendps {{.*#+}} xmm0 = xmm1[0],xmm0[1],xmm1[2,3]			; XOP-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,3]
	; XOP-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[1,3,0,3]			; XOP-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[1,3,2,3]
	; XOP-NEXT: vmovss %xmm1, 24(%rdi)			; XOP-NEXT: vmovss %xmm1, 24(%rdi)
	; XOP-NEXT: vmovlps %xmm0, 16(%rdi)			; XOP-NEXT: vmovlps %xmm0, 16(%rdi)
	; XOP-NEXT: vmovaps %xmm2, (%rdi)			; XOP-NEXT: vmovaps %xmm2, (%rdi)
	; XOP-NEXT: retq			; XOP-NEXT: retq
	%r = shufflevector <4 x i32> %a, <4 x i32> %b, <7 x i32> <i32 0, i32 6, i32 3, i32 6, i32 1, i32 7, i32 4>			%r = shufflevector <4 x i32> %a, <4 x i32> %b, <7 x i32> <i32 0, i32 6, i32 3, i32 6, i32 1, i32 7, i32 4>
	store <7 x i32> %r, <7 x i32>* %p			store <7 x i32> %r, <7 x i32>* %p
	ret void			ret void
	}			}
	▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines
	; SSE2-NEXT: pshufhw {{.*#+}} xmm4 = xmm0[0,1,2,3,6,5,4,7]			; SSE2-NEXT: pshufhw {{.*#+}} xmm4 = xmm0[0,1,2,3,6,5,4,7]
	; SSE2-NEXT: pshufd {{.*#+}} xmm4 = xmm4[0,3,2,1]			; SSE2-NEXT: pshufd {{.*#+}} xmm4 = xmm4[0,3,2,1]
	; SSE2-NEXT: pshuflw {{.*#+}} xmm4 = xmm4[0,2,2,1,4,5,6,7]			; SSE2-NEXT: pshuflw {{.*#+}} xmm4 = xmm4[0,2,2,1,4,5,6,7]
	; SSE2-NEXT: pshufhw {{.*#+}} xmm4 = xmm4[0,1,2,3,5,5,6,4]			; SSE2-NEXT: pshufhw {{.*#+}} xmm4 = xmm4[0,1,2,3,5,5,6,4]
	; SSE2-NEXT: pand %xmm3, %xmm4			; SSE2-NEXT: pand %xmm3, %xmm4
	; SSE2-NEXT: pandn %xmm2, %xmm3			; SSE2-NEXT: pandn %xmm2, %xmm3
	; SSE2-NEXT: por %xmm4, %xmm3			; SSE2-NEXT: por %xmm4, %xmm3
	; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,2,3]			; SSE2-NEXT: pshufd {{.*#+}} xmm1 = xmm1[1,1,2,3]
	; SSE2-NEXT: movdqa {{.*#+}} xmm2 = [65535,0,0,65535,65535,65535,65535,65535]			; SSE2-NEXT: movdqa {{.*#+}} xmm2 = [0,65535,65535,0,65535,65535,65535,65535]
	; SSE2-NEXT: pand %xmm2, %xmm1
	; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[3,1,2,3]			; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[3,1,2,3]
	; SSE2-NEXT: pshuflw {{.*#+}} xmm0 = xmm0[0,3,1,3,4,5,6,7]			; SSE2-NEXT: pshuflw {{.*#+}} xmm0 = xmm0[0,3,1,3,4,5,6,7]
	; SSE2-NEXT: pandn %xmm0, %xmm2			; SSE2-NEXT: pand %xmm2, %xmm0
	; SSE2-NEXT: por %xmm1, %xmm2			; SSE2-NEXT: pandn %xmm1, %xmm2
				; SSE2-NEXT: por %xmm0, %xmm2
	; SSE2-NEXT: movq %xmm2, 16(%rdi)			; SSE2-NEXT: movq %xmm2, 16(%rdi)
	; SSE2-NEXT: movdqa %xmm3, (%rdi)			; SSE2-NEXT: movdqa %xmm3, (%rdi)
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; SSE42-LABEL: v12i16:			; SSE42-LABEL: v12i16:
	; SSE42: # %bb.0:			; SSE42: # %bb.0:
	; SSE42-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,1,2,3]			; SSE42-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,1,2,3]
	; SSE42-NEXT: pshufd {{.*#+}} xmm3 = xmm0[3,1,2,3]			; SSE42-NEXT: pshufd {{.*#+}} xmm3 = xmm0[3,1,2,3]
	; SSE42-NEXT: pshuflw {{.*#+}} xmm3 = xmm3[0,3,1,3,4,5,6,7]			; SSE42-NEXT: pshuflw {{.*#+}} xmm3 = xmm3[0,3,1,3,4,5,6,7]
	; SSE42-NEXT: pblendw {{.*#+}} xmm3 = xmm2[0],xmm3[1,2],xmm2[3,4,5,6,7]			; SSE42-NEXT: pblendw {{.*#+}} xmm3 = xmm2[0],xmm3[1,2],xmm2[3],xmm3[4,5,6,7]
	; SSE42-NEXT: pshufd {{.*#+}} xmm1 = xmm1[0,0,0,3]			; SSE42-NEXT: pshufd {{.*#+}} xmm1 = xmm1[0,0,0,3]
	; SSE42-NEXT: pshufb {{.*#+}} xmm0 = xmm0[0,1,8,9,8,9,2,3,10,11,10,11,4,5,12,13]			; SSE42-NEXT: pshufb {{.*#+}} xmm0 = xmm0[0,1,8,9,8,9,2,3,10,11,10,11,4,5,12,13]
	; SSE42-NEXT: pblendw {{.*#+}} xmm0 = xmm0[0,1],xmm1[2],xmm0[3,4],xmm1[5],xmm0[6,7]			; SSE42-NEXT: pblendw {{.*#+}} xmm0 = xmm0[0,1],xmm1[2],xmm0[3,4],xmm1[5],xmm0[6,7]
	; SSE42-NEXT: movdqa %xmm0, (%rdi)			; SSE42-NEXT: movdqa %xmm0, (%rdi)
	; SSE42-NEXT: movq %xmm3, 16(%rdi)			; SSE42-NEXT: movq %xmm3, 16(%rdi)
	; SSE42-NEXT: retq			; SSE42-NEXT: retq
	;			;
	; AVX1-LABEL: v12i16:			; AVX1-LABEL: v12i16:
	▲ Show 20 Lines • Show All 1,302 Lines • Show Last 20 Lines