This is an archive of the discontinued LLVM Phabricator instance.

Differential D19661

[X86] Also try to zero elts when lowering v32i8 shuffle with PSHUFB.
ClosedPublic

Authored by ab on Apr 28 2016, 7:29 AM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon

Commits

rGa3dc1ba142c5: [X86] Try to zero elts when lowering 256-bit shuffle with PSHUFB.
rL271113: [X86] Try to zero elts when lowering 256-bit shuffle with PSHUFB.

Summary

Otherwise we fallback to pshufb+blend later on.

We could generalize lowerVectorShuffleAsPSHUFB to ymm, but when I tried that was less readable than doing it inline.

Diff Detail

Repository: rL LLVM

Event Timeline

ab updated this revision to Diff 55413.Apr 28 2016, 7:29 AM

ab retitled this revision from to [X86] Also try to zero elts when lowering v32i8 shuffle with PSHUFB..

ab updated this object.

ab added reviewers: spatel, RKSimon.

ab added a subscriber: llvm-commits.

break early when the shuffle isn't single-input, even with zeroing.

spatel added inline comments.Apr 28 2016, 8:04 AM

lib/Target/X86/X86ISelLowering.cpp
11427–11436 ↗	(On Diff #55413)	Would it make sense to enhance isSingleInputShuffleMask() like: bool isSingleInputShuffleMask(ArrayRef<int> Mask, bool OrZeroes = false) {} and then optionally check for zeroes there? It seems like vperm2* and maybe some other instructions would then be able to use the same helper.
11445 ↗	(On Diff #55413)	Please add a comment here to describe the pshufb encoding. Something like: If the most-significant-bit of any byte of the PSHUFB mask is set, the destination byte is zeroed. I hate having to look back at the manual. :)
11446 ↗	(On Diff #55413)	Use SM_SentinelUndef?

We could generalize lowerVectorShuffleAsPSHUFB to ymm, but when I tried that was less readable than doing it inline.

How much did you have to do to alter lowerVectorShuffleAsPSHUFB? I'd have expected it to end up quite similar to the implementation in combineX86ShuffleChain which I thought was quite straightforward.

lib/Target/X86/X86ISelLowering.cpp
11448 ↗	(On Diff #55414)	Convention for lowering is to accept any negative value as undefined - its only later on that we make use of sentinel constants.

In D19661#415401, @RKSimon wrote:

We could generalize lowerVectorShuffleAsPSHUFB to ymm, but when I tried that was less readable than doing it inline.

How much did you have to do to alter lowerVectorShuffleAsPSHUFB? I'd have expected it to end up quite similar to the implementation in combineX86ShuffleChain which I thought was quite straightforward.

Sorry: it's not ymm that's the problem, it's that lowerVectorShuffleAsPSHUFB should really be lowerVectorShuffleAsBlendOfPSHUFBs; if you don't want the final blend, there's not that much you can reuse.
It's also tricky to extract the PSHUFB creation out of this hypothetical lowerVectorShuffleAsBlendOfPSHUFB. It wants a PSHUFB that ignores the lanes from the other operand. But we want a PSHUFB if there are no lanes from the other operand.

I'll give it another think; maybe we can do better.

lib/Target/X86/X86ISelLowering.cpp
11427–11445 ↗	(On Diff #55414)	I wanted to do that at first, but you need to look at the operands, and at that point you basically duplicated computeZeroableShuffleElements. An alternative I considered was to do some kind of computeZeroableShuffleMask(Mask, V1, V2, NewMask), which returns a mask with SM_SentinelZero and SM_SentinelUndef. Then, users can check that it isSingleInputShuffleMask and do their thing. WDYT?
11447–11455 ↗	(On Diff #55414)	Heh, fair enough, will do!

spatel added inline comments.May 5 2016, 2:13 PM

lib/Target/X86/X86ISelLowering.cpp
11427–11445 ↗	(On Diff #55414)	That would be good - slow down the code explosion for x86 vector lowering. But I don't think we have to hold up this patch for that; a TODO comment should be ok. LGTM, but I'll let Simon give the final word because he has a much better understanding of everything going on with shuffles.

LGTM with a couple of nits:

Please can you add a comment explaining why lowerVectorShuffleAsPSHUFB can't be used?

I think we're getting to the point where we should probably just generate Zeroable as standard and pass it around the same as Mask - we're generating it a lot these days. That would also make the addition of a 'isSingleInputShuffleMaskWithZeroables' helper more straightforward. We can look into this for a future patch.

lib/Target/X86/X86ISelLowering.cpp
11448 ↗	(On Diff #55414)	Sorry I was wrong here (I missed the Zeroable case afterward) - it should read Mask[i] == SM_SentinelUndef

This revision is now accepted and ready to land.May 7 2016, 7:56 AM

I did a final review before committing, but realized I missed the v16i16 case.

Here's an alternative patch, with a generalized PSHUFB lowering, and "zeroable mask" computation. WDYT?

I went ahead and renamed lowerVectorShuffleAsPSHUFB in r271024.

Minor comments but my previous LGTM still stands!

lib/Target/X86/X86ISelLowering.cpp
7189 ↗	(On Diff #58824)	const int NumEltBytes = VT.getScalarSizeInBits() / 8;
7196 ↗	(On Diff #58824)	Can't we just pass Subtarget as an argument like we do in other cases?
7205 ↗	(On Diff #58824)	Rephrase 'Sign bit set in i8 mask means zero element'?

Closed by commit rL271113: [X86] Try to zero elts when lowering 256-bit shuffle with PSHUFB. (authored by ab). · Explain WhyMay 28 2016, 7:44 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

101 lines

test/

CodeGen/

X86/

vector-shuffle-256-v16.ll

18 lines

vector-shuffle-256-v32.ll

16 lines

Diff 58889

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,164 Lines • ▼ Show 20 Lines	if ((V.getNumOperands() % Size) == 0) {
Zeroable[i] = AllZeroable;		Zeroable[i] = AllZeroable;
continue;		continue;
}		}
}		}

return Zeroable;		return Zeroable;
}		}

		/// Mutate a shuffle mask, replacing zeroable elements with SM_SentinelZero.
		static void computeZeroableShuffleMask(MutableArrayRef<int> Mask,
		SDValue V1, SDValue V2) {
		SmallBitVector Zeroable = computeZeroableShuffleElements(Mask, V1, V2);
		for (int i = 0, Size = Mask.size(); i < Size; ++i) {
		if (Mask[i] != SM_SentinelUndef && Zeroable[i])
		Mask[i] = SM_SentinelZero;
		}
		}

		/// Try to lower a shuffle with a single PSHUFB of V1.
		/// This is only possible if V2 is unused (at all, or only for zero elements).
		static SDValue lowerVectorShuffleWithPSHUFB(SDLoc DL, MVT VT,
		ArrayRef<int> Mask, SDValue V1,
		SDValue V2,
		const X86Subtarget &Subtarget,
		SelectionDAG &DAG) {
		const int NumBytes = VT.is128BitVector() ? 16 : 32;
		const int NumEltBytes = VT.getScalarSizeInBits() / 8;

		assert((Subtarget.hasSSSE3() && VT.is128BitVector()) \|\|
		(Subtarget.hasAVX2() && VT.is256BitVector()));

		SmallVector<int, 32> ZeroableMask(Mask.begin(), Mask.end());
		computeZeroableShuffleMask(ZeroableMask, V1, V2);

		if (!isSingleInputShuffleMask(ZeroableMask) \|\|
		is128BitLaneCrossingShuffleMask(VT, Mask))
		return SDValue();

		SmallVector<SDValue, 32> PSHUFBMask(NumBytes);
		// Sign bit set in i8 mask means zero element.
		SDValue ZeroMask = DAG.getConstant(0x80, DL, MVT::i8);

		for (int i = 0; i < NumBytes; ++i) {
		int M = ZeroableMask[i / NumEltBytes];
		if (M == SM_SentinelUndef) {
		PSHUFBMask[i] = DAG.getUNDEF(MVT::i8);
		} else if (M == SM_SentinelZero) {
		PSHUFBMask[i] = ZeroMask;
		} else {
		M = M * NumEltBytes + (i % NumEltBytes);
		M = i < 16 ? M : M - 16;
		PSHUFBMask[i] = DAG.getConstant(M, DL, MVT::i8);
		}
		}

		MVT I8VT = MVT::getVectorVT(MVT::i8, NumBytes);
		return DAG.getBitcast(
		VT, DAG.getNode(X86ISD::PSHUFB, DL, I8VT, DAG.getBitcast(I8VT, V1),
		DAG.getBuildVector(I8VT, DL, PSHUFBMask)));
		}

// X86 has dedicated unpack instructions that can handle specific blend		// X86 has dedicated unpack instructions that can handle specific blend
// operations: UNPCKH and UNPCKL.		// operations: UNPCKH and UNPCKL.
static SDValue lowerVectorShuffleWithUNPCK(SDLoc DL, MVT VT, ArrayRef<int> Mask,		static SDValue lowerVectorShuffleWithUNPCK(SDLoc DL, MVT VT, ArrayRef<int> Mask,
SDValue V1, SDValue V2,		SDValue V1, SDValue V2,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
int NumElts = VT.getVectorNumElements();		int NumElts = VT.getVectorNumElements();
int NumEltsInLane = 128 / VT.getScalarSizeInBits();		int NumEltsInLane = 128 / VT.getScalarSizeInBits();
SmallVector<int, 8> Unpckl;		SmallVector<int, 8> Unpckl;
▲ Show 20 Lines • Show All 4,203 Lines • ▼ Show 20 Lines	if (isSingleInputShuffleMask(Mask)) {
SmallVector<int, 8> RepeatedMask;		SmallVector<int, 8> RepeatedMask;
if (is128BitLaneRepeatedShuffleMask(MVT::v16i16, Mask, RepeatedMask)) {		if (is128BitLaneRepeatedShuffleMask(MVT::v16i16, Mask, RepeatedMask)) {
// As this is a single-input shuffle, the repeated mask should be		// As this is a single-input shuffle, the repeated mask should be
// a strictly valid v8i16 mask that we can pass through to the v8i16		// a strictly valid v8i16 mask that we can pass through to the v8i16
// lowering to handle even the v16 case.		// lowering to handle even the v16 case.
return lowerV8I16GeneralSingleInputVectorShuffle(		return lowerV8I16GeneralSingleInputVectorShuffle(
DL, MVT::v16i16, V1, RepeatedMask, Subtarget, DAG);		DL, MVT::v16i16, V1, RepeatedMask, Subtarget, DAG);
}		}

SDValue PSHUFBMask[32];
for (int i = 0; i < 16; ++i) {
if (Mask[i] == -1) {
PSHUFBMask[2 * i] = PSHUFBMask[2 * i + 1] = DAG.getUNDEF(MVT::i8);
continue;
}		}

int M = i < 8 ? Mask[i] : Mask[i] - 8;		if (SDValue PSHUFB = lowerVectorShuffleWithPSHUFB(DL, MVT::v16i16, Mask, V1,
assert(M >= 0 && M < 8 && "Invalid single-input mask!");		V2, Subtarget, DAG))
PSHUFBMask[2 * i] = DAG.getConstant(2 * M, DL, MVT::i8);		return PSHUFB;
PSHUFBMask[2 * i + 1] = DAG.getConstant(2 * M + 1, DL, MVT::i8);
}
return DAG.getBitcast(
MVT::v16i16,
DAG.getNode(X86ISD::PSHUFB, DL, MVT::v32i8,
DAG.getBitcast(MVT::v32i8, V1),
DAG.getBuildVector(MVT::v32i8, DL, PSHUFBMask)));
}

// Try to simplify this by merging 128-bit lanes to enable a lane-based		// Try to simplify this by merging 128-bit lanes to enable a lane-based
// shuffle.		// shuffle.
if (SDValue Result = lowerVectorShuffleByMerging128BitLanes(		if (SDValue Result = lowerVectorShuffleByMerging128BitLanes(
DL, MVT::v16i16, V1, V2, Mask, Subtarget, DAG))		DL, MVT::v16i16, V1, V2, Mask, Subtarget, DAG))
return Result;		return Result;

// Otherwise fall back on generic lowering.		// Otherwise fall back on generic lowering.
▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines	if (SDValue Rotate = lowerVectorShuffleAsByteRotate(
return Rotate;		return Rotate;

// Try to create an in-lane repeating shuffle mask and then shuffle the		// Try to create an in-lane repeating shuffle mask and then shuffle the
// the results into the target lanes.		// the results into the target lanes.
if (SDValue V = lowerShuffleAsRepeatedMaskAndLanePermute(		if (SDValue V = lowerShuffleAsRepeatedMaskAndLanePermute(
DL, MVT::v32i8, V1, V2, Mask, Subtarget, DAG))		DL, MVT::v32i8, V1, V2, Mask, Subtarget, DAG))
return V;		return V;

if (isSingleInputShuffleMask(Mask)) {
// There are no generalized cross-lane shuffle operations available on i8		// There are no generalized cross-lane shuffle operations available on i8
// element types.		// element types.
if (is128BitLaneCrossingShuffleMask(MVT::v32i8, Mask))		if (isSingleInputShuffleMask(Mask) &&
return lowerVectorShuffleAsLanePermuteAndBlend(DL, MVT::v32i8, V1, V2,		is128BitLaneCrossingShuffleMask(MVT::v32i8, Mask))
Mask, DAG);		return lowerVectorShuffleAsLanePermuteAndBlend(DL, MVT::v32i8, V1, V2, Mask,
		DAG);
SDValue PSHUFBMask[32];
for (int i = 0; i < 32; ++i)
PSHUFBMask[i] =
Mask[i] < 0
? DAG.getUNDEF(MVT::i8)
: DAG.getConstant(Mask[i] < 16 ? Mask[i] : Mask[i] - 16, DL,
MVT::i8);

return DAG.getNode(X86ISD::PSHUFB, DL, MVT::v32i8, V1,		if (SDValue PSHUFB = lowerVectorShuffleWithPSHUFB(DL, MVT::v32i8, Mask, V1,
DAG.getBuildVector(MVT::v32i8, DL, PSHUFBMask));		V2, Subtarget, DAG))
}		return PSHUFB;

// Try to simplify this by merging 128-bit lanes to enable a lane-based		// Try to simplify this by merging 128-bit lanes to enable a lane-based
// shuffle.		// shuffle.
if (SDValue Result = lowerVectorShuffleByMerging128BitLanes(		if (SDValue Result = lowerVectorShuffleByMerging128BitLanes(
DL, MVT::v32i8, V1, V2, Mask, Subtarget, DAG))		DL, MVT::v32i8, V1, V2, Mask, Subtarget, DAG))
return Result;		return Result;

// Otherwise fall back on generic lowering.		// Otherwise fall back on generic lowering.
▲ Show 20 Lines • Show All 19,249 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vector-shuffle-256-v16.ll

	Show First 20 Lines • Show All 2,372 Lines • ▼ Show 20 Lines
	; AVX2-LABEL: shuffle_v16i16_04_05_06_03_uu_uu_uu_uu_12_13_14_11_uu_uu_uu_uu:			; AVX2-LABEL: shuffle_v16i16_04_05_06_03_uu_uu_uu_uu_12_13_14_11_uu_uu_uu_uu:
	; AVX2: # BB#0:			; AVX2: # BB#0:
	; AVX2-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[8,9,10,11,12,13,6,7,8,9,10,11,0,1,2,3,24,25,26,27,28,29,22,23,24,25,26,27,16,17,18,19]			; AVX2-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[8,9,10,11,12,13,6,7,8,9,10,11,0,1,2,3,24,25,26,27,28,29,22,23,24,25,26,27,16,17,18,19]
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%shuffle = shufflevector <16 x i16> %a, <16 x i16> %b, <16 x i32> <i32 4, i32 5, i32 6, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 12, i32 13, i32 14, i32 11, i32 undef, i32 undef, i32 undef, i32 undef>			%shuffle = shufflevector <16 x i16> %a, <16 x i16> %b, <16 x i32> <i32 4, i32 5, i32 6, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 12, i32 13, i32 14, i32 11, i32 undef, i32 undef, i32 undef, i32 undef>
	ret <16 x i16> %shuffle			ret <16 x i16> %shuffle
	}			}

				define <16 x i16> @shuffle_v16i16_01_zz_02_zz_04_uu_06_07_08_09_10_11_12_13_14_15(<16 x i16> %a) {
				; AVX1-LABEL: shuffle_v16i16_01_zz_02_zz_04_uu_06_07_08_09_10_11_12_13_14_15:
				; AVX1: # BB#0:
				; AVX1-NEXT: vpshuflw {{.*#+}} xmm1 = xmm0[1,1,2,3,4,5,6,7]
				; AVX1-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVX1-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0],xmm2[1],xmm1[2],xmm2[3],xmm1[4,5,6,7]
				; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm0
				; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: shuffle_v16i16_01_zz_02_zz_04_uu_06_07_08_09_10_11_12_13_14_15:
				; AVX2: # BB#0:
				; AVX2-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[2,3],zero,zero,ymm0[4,5],zero,zero,ymm0[8,9,u,u,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
				; AVX2-NEXT: retq
				%shuffle = shufflevector <16 x i16> %a, <16 x i16> zeroinitializer, <16 x i32> <i32 1, i32 16, i32 2, i32 16, i32 4, i32 undef, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				ret <16 x i16> %shuffle
				}

	define <16 x i16> @shuffle_v16i16_00_01_02_07_04_05_06_11_08_09_10_15_12_13_14_11(<16 x i16> %a, <16 x i16> %b) {			define <16 x i16> @shuffle_v16i16_00_01_02_07_04_05_06_11_08_09_10_15_12_13_14_11(<16 x i16> %a, <16 x i16> %b) {
	; AVX1-LABEL: shuffle_v16i16_00_01_02_07_04_05_06_11_08_09_10_15_12_13_14_11:			; AVX1-LABEL: shuffle_v16i16_00_01_02_07_04_05_06_11_08_09_10_15_12_13_14_11:
	; AVX1: # BB#0:			; AVX1: # BB#0:
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1
	; AVX1-NEXT: vmovdqa {{.*#+}} xmm2 = [0,1,2,3,4,5,14,15,8,9,10,11,12,13,6,7]			; AVX1-NEXT: vmovdqa {{.*#+}} xmm2 = [0,1,2,3,4,5,14,15,8,9,10,11,12,13,6,7]
	; AVX1-NEXT: vpshufb %xmm2, %xmm1, %xmm3			; AVX1-NEXT: vpshufb %xmm2, %xmm1, %xmm3
	; AVX1-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1,2],xmm1[3],xmm0[4,5,6,7]			; AVX1-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1,2],xmm1[3],xmm0[4,5,6,7]
	; AVX1-NEXT: vpshufb %xmm2, %xmm0, %xmm0			; AVX1-NEXT: vpshufb %xmm2, %xmm0, %xmm0
	▲ Show 20 Lines • Show All 1,133 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vector-shuffle-256-v32.ll

	Show First 20 Lines • Show All 947 Lines • ▼ Show 20 Lines
	; ALL-LABEL: shuffle_v32i8_zz_01_zz_03_zz_05_zz_07_zz_09_zz_11_zz_13_zz_15_zz_17_zz_19_zz_21_zz_23_zz_25_zz_27_zz_29_zz_31:			; ALL-LABEL: shuffle_v32i8_zz_01_zz_03_zz_05_zz_07_zz_09_zz_11_zz_13_zz_15_zz_17_zz_19_zz_21_zz_23_zz_25_zz_27_zz_29_zz_31:
	; ALL: # BB#0:			; ALL: # BB#0:
	; ALL-NEXT: vandps {{.*}}(%rip), %ymm0, %ymm0			; ALL-NEXT: vandps {{.*}}(%rip), %ymm0, %ymm0
	; ALL-NEXT: retq			; ALL-NEXT: retq
	%shuffle = shufflevector <32 x i8> %a, <32 x i8> zeroinitializer, <32 x i32> <i32 32, i32 1, i32 34, i32 3, i32 36, i32 5, i32 38, i32 7, i32 40, i32 9, i32 42, i32 11, i32 44, i32 13, i32 46, i32 15, i32 48, i32 17, i32 50, i32 19, i32 52, i32 21, i32 54, i32 23, i32 56, i32 25, i32 58, i32 27, i32 60, i32 29, i32 62, i32 31>			%shuffle = shufflevector <32 x i8> %a, <32 x i8> zeroinitializer, <32 x i32> <i32 32, i32 1, i32 34, i32 3, i32 36, i32 5, i32 38, i32 7, i32 40, i32 9, i32 42, i32 11, i32 44, i32 13, i32 46, i32 15, i32 48, i32 17, i32 50, i32 19, i32 52, i32 21, i32 54, i32 23, i32 56, i32 25, i32 58, i32 27, i32 60, i32 29, i32 62, i32 31>
	ret <32 x i8> %shuffle			ret <32 x i8> %shuffle
	}			}

				define <32 x i8> @shuffle_v32i8_01_zz_02_zz_04_uu_06_07_08_09_10_11_12_13_14_15_u6_17_18_19_20_21_22_23_24_25_26_27_28_29_30_31(<32 x i8> %a) {
				; AVX1-LABEL: shuffle_v32i8_01_zz_02_zz_04_uu_06_07_08_09_10_11_12_13_14_15_u6_17_18_19_20_21_22_23_24_25_26_27_28_29_30_31:
				; AVX1: # BB#0:
				; AVX1-NEXT: vpshufb {{.*#+}} xmm1 = xmm0[1],zero,xmm0[2],zero,xmm0[4,u,6,7,8,9,10,11,12,13,14,15]
				; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm0
				; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: shuffle_v32i8_01_zz_02_zz_04_uu_06_07_08_09_10_11_12_13_14_15_u6_17_18_19_20_21_22_23_24_25_26_27_28_29_30_31:
				; AVX2: # BB#0:
				; AVX2-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[1],zero,ymm0[2],zero,ymm0[4,u,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
				; AVX2-NEXT: retq
				%shuffle = shufflevector <32 x i8> %a, <32 x i8> zeroinitializer, <32 x i32> <i32 1, i32 32, i32 2, i32 32, i32 4, i32 undef, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
				ret <32 x i8> %shuffle
				}

	define <32 x i8> @shuffle_v32i8_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32(<32 x i8> %a, <32 x i8> %b) {			define <32 x i8> @shuffle_v32i8_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32(<32 x i8> %a, <32 x i8> %b) {
	; AVX1-LABEL: shuffle_v32i8_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32:			; AVX1-LABEL: shuffle_v32i8_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32:
	; AVX1: # BB#0:			; AVX1: # BB#0:
	; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]			; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
	; AVX1-NEXT: vpshuflw {{.*#+}} xmm0 = xmm0[0,0,0,0,4,5,6,7]			; AVX1-NEXT: vpshuflw {{.*#+}} xmm0 = xmm0[0,0,0,0,4,5,6,7]
	; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]			; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	▲ Show 20 Lines • Show All 1,186 Lines • Show Last 20 Lines