This is an archive of the discontinued LLVM Phabricator instance.

Differential D19661

[X86] Also try to zero elts when lowering v32i8 shuffle with PSHUFB.
ClosedPublic

Authored by ab on Apr 28 2016, 7:29 AM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon

Commits

rGa3dc1ba142c5: [X86] Try to zero elts when lowering 256-bit shuffle with PSHUFB.
rL271113: [X86] Try to zero elts when lowering 256-bit shuffle with PSHUFB.

Summary

Otherwise we fallback to pshufb+blend later on.

We could generalize lowerVectorShuffleAsPSHUFB to ymm, but when I tried that was less readable than doing it inline.

Diff Detail

Event Timeline

ab updated this revision to Diff 55413.Apr 28 2016, 7:29 AM

ab retitled this revision from to [X86] Also try to zero elts when lowering v32i8 shuffle with PSHUFB..

ab updated this object.

ab added reviewers: spatel, RKSimon.

ab added a subscriber: llvm-commits.

break early when the shuffle isn't single-input, even with zeroing.

spatel added inline comments.Apr 28 2016, 8:04 AM

lib/Target/X86/X86ISelLowering.cpp
11427–11436	Would it make sense to enhance isSingleInputShuffleMask() like: bool isSingleInputShuffleMask(ArrayRef<int> Mask, bool OrZeroes = false) {} and then optionally check for zeroes there? It seems like vperm2* and maybe some other instructions would then be able to use the same helper.
11445	Please add a comment here to describe the pshufb encoding. Something like: If the most-significant-bit of any byte of the PSHUFB mask is set, the destination byte is zeroed. I hate having to look back at the manual. :)
11446	Use SM_SentinelUndef?

We could generalize lowerVectorShuffleAsPSHUFB to ymm, but when I tried that was less readable than doing it inline.

How much did you have to do to alter lowerVectorShuffleAsPSHUFB? I'd have expected it to end up quite similar to the implementation in combineX86ShuffleChain which I thought was quite straightforward.

lib/Target/X86/X86ISelLowering.cpp
11446	Convention for lowering is to accept any negative value as undefined - its only later on that we make use of sentinel constants.

In D19661#415401, @RKSimon wrote:

We could generalize lowerVectorShuffleAsPSHUFB to ymm, but when I tried that was less readable than doing it inline.

How much did you have to do to alter lowerVectorShuffleAsPSHUFB? I'd have expected it to end up quite similar to the implementation in combineX86ShuffleChain which I thought was quite straightforward.

Sorry: it's not ymm that's the problem, it's that lowerVectorShuffleAsPSHUFB should really be lowerVectorShuffleAsBlendOfPSHUFBs; if you don't want the final blend, there's not that much you can reuse.
It's also tricky to extract the PSHUFB creation out of this hypothetical lowerVectorShuffleAsBlendOfPSHUFB. It wants a PSHUFB that ignores the lanes from the other operand. But we want a PSHUFB if there are no lanes from the other operand.

I'll give it another think; maybe we can do better.

lib/Target/X86/X86ISelLowering.cpp
11427–11444	I wanted to do that at first, but you need to look at the operands, and at that point you basically duplicated computeZeroableShuffleElements. An alternative I considered was to do some kind of computeZeroableShuffleMask(Mask, V1, V2, NewMask), which returns a mask with SM_SentinelZero and SM_SentinelUndef. Then, users can check that it isSingleInputShuffleMask and do their thing. WDYT?
11445–11454	Heh, fair enough, will do!

spatel added inline comments.May 5 2016, 2:13 PM

lib/Target/X86/X86ISelLowering.cpp
11427–11444	That would be good - slow down the code explosion for x86 vector lowering. But I don't think we have to hold up this patch for that; a TODO comment should be ok. LGTM, but I'll let Simon give the final word because he has a much better understanding of everything going on with shuffles.

LGTM with a couple of nits:

Please can you add a comment explaining why lowerVectorShuffleAsPSHUFB can't be used?

I think we're getting to the point where we should probably just generate Zeroable as standard and pass it around the same as Mask - we're generating it a lot these days. That would also make the addition of a 'isSingleInputShuffleMaskWithZeroables' helper more straightforward. We can look into this for a future patch.

lib/Target/X86/X86ISelLowering.cpp
11446	Sorry I was wrong here (I missed the Zeroable case afterward) - it should read Mask[i] == SM_SentinelUndef

This revision is now accepted and ready to land.May 7 2016, 7:56 AM

I did a final review before committing, but realized I missed the v16i16 case.

Here's an alternative patch, with a generalized PSHUFB lowering, and "zeroable mask" computation. WDYT?

I went ahead and renamed lowerVectorShuffleAsPSHUFB in r271024.

Minor comments but my previous LGTM still stands!

lib/Target/X86/X86ISelLowering.cpp
7150	const int NumEltBytes = VT.getScalarSizeInBits() / 8;
7157	Can't we just pass Subtarget as an argument like we do in other cases?
7166	Rephrase 'Sign bit set in i8 mask means zero element'?

Closed by commit rL271113: [X86] Try to zero elts when lowering 256-bit shuffle with PSHUFB. (authored by ab). · Explain WhyMay 28 2016, 7:44 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size


	c/

lib/

Target/

X86/

X86ISelLowering.cpp

37 lines

test/

CodeGen/

X86/

vector-shuffle-256-v32.ll

16 lines

Diff 55413

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,141 Lines • ▼ Show 20 Lines	static SDValue lowerVectorShuffleWithUNPCK(SDLoc DL, MVT VT, ArrayRef<int> Mask,
SmallVector<int, 8> Unpckh;		SmallVector<int, 8> Unpckh;

for (int i = 0; i < NumElts; ++i) {		for (int i = 0; i < NumElts; ++i) {
unsigned LaneStart = (i / NumEltsInLane) * NumEltsInLane;		unsigned LaneStart = (i / NumEltsInLane) * NumEltsInLane;
int LoPos = (i % NumEltsInLane) / 2 + LaneStart + NumElts * (i % 2);		int LoPos = (i % NumEltsInLane) / 2 + LaneStart + NumElts * (i % 2);
int HiPos = LoPos + NumEltsInLane / 2;		int HiPos = LoPos + NumEltsInLane / 2;
Unpckl.push_back(LoPos);		Unpckl.push_back(LoPos);
Unpckh.push_back(HiPos);		Unpckh.push_back(HiPos);
}		}
		RKSimonUnsubmitted Not Done Reply Inline Actions const int NumEltBytes = VT.getScalarSizeInBits() / 8; RKSimon: const int NumEltBytes = VT.getScalarSizeInBits() / 8;

if (isShuffleEquivalent(V1, V2, Mask, Unpckl))		if (isShuffleEquivalent(V1, V2, Mask, Unpckl))
return DAG.getNode(X86ISD::UNPCKL, DL, VT, V1, V2);		return DAG.getNode(X86ISD::UNPCKL, DL, VT, V1, V2);
if (isShuffleEquivalent(V1, V2, Mask, Unpckh))		if (isShuffleEquivalent(V1, V2, Mask, Unpckh))
return DAG.getNode(X86ISD::UNPCKH, DL, VT, V1, V2);		return DAG.getNode(X86ISD::UNPCKH, DL, VT, V1, V2);

// Commute and try again.		// Commute and try again.
		RKSimonUnsubmitted Not Done Reply Inline Actions Can't we just pass Subtarget as an argument like we do in other cases? RKSimon: Can't we just pass Subtarget as an argument like we do in other cases?
ShuffleVectorSDNode::commuteMask(Unpckl);		ShuffleVectorSDNode::commuteMask(Unpckl);
if (isShuffleEquivalent(V1, V2, Mask, Unpckl))		if (isShuffleEquivalent(V1, V2, Mask, Unpckl))
return DAG.getNode(X86ISD::UNPCKL, DL, VT, V2, V1);		return DAG.getNode(X86ISD::UNPCKL, DL, VT, V2, V1);

ShuffleVectorSDNode::commuteMask(Unpckh);		ShuffleVectorSDNode::commuteMask(Unpckh);
if (isShuffleEquivalent(V1, V2, Mask, Unpckh))		if (isShuffleEquivalent(V1, V2, Mask, Unpckh))
return DAG.getNode(X86ISD::UNPCKH, DL, VT, V2, V1);		return DAG.getNode(X86ISD::UNPCKH, DL, VT, V2, V1);

return SDValue();		return SDValue();
		RKSimonUnsubmitted Not Done Reply Inline Actions Rephrase 'Sign bit set in i8 mask means zero element'? RKSimon: Rephrase 'Sign bit set in i8 mask means zero element'?
}		}

/// \brief Try to emit a bitmask instruction for a shuffle.		/// \brief Try to emit a bitmask instruction for a shuffle.
///		///
/// This handles cases where we can model a blend exactly as a bitmask due to		/// This handles cases where we can model a blend exactly as a bitmask due to
/// one of the inputs being zeroable.		/// one of the inputs being zeroable.
static SDValue lowerVectorShuffleAsBitMask(SDLoc DL, MVT VT, SDValue V1,		static SDValue lowerVectorShuffleAsBitMask(SDLoc DL, MVT VT, SDValue V1,
SDValue V2, ArrayRef<int> Mask,		SDValue V2, ArrayRef<int> Mask,
▲ Show 20 Lines • Show All 4,244 Lines • ▼ Show 20 Lines	if (SDValue Rotate = lowerVectorShuffleAsByteRotate(
return Rotate;		return Rotate;

// Try to create an in-lane repeating shuffle mask and then shuffle the		// Try to create an in-lane repeating shuffle mask and then shuffle the
// the results into the target lanes.		// the results into the target lanes.
if (SDValue V = lowerShuffleAsRepeatedMaskAndLanePermute(		if (SDValue V = lowerShuffleAsRepeatedMaskAndLanePermute(
DL, MVT::v32i8, V1, V2, Mask, Subtarget, DAG))		DL, MVT::v32i8, V1, V2, Mask, Subtarget, DAG))
return V;		return V;

if (isSingleInputShuffleMask(Mask)) {		SmallBitVector Zeroable = computeZeroableShuffleElements(Mask, V1, V2);
		bool SingleInputMask = true;
		bool SingleInputAndZeroesMask = true;
		for (int i = 0, Size = Mask.size(); i < Size; ++i) {
		if (Mask[i] >= Size) {
		SingleInputMask = false;
		if (!Zeroable[i])
		SingleInputAndZeroesMask = false;
		}
		}
		spatelUnsubmitted Not Done Reply Inline Actions Would it make sense to enhance isSingleInputShuffleMask() like: bool isSingleInputShuffleMask(ArrayRef<int> Mask, bool OrZeroes = false) {} and then optionally check for zeroes there? It seems like vperm2* and maybe some other instructions would then be able to use the same helper. spatel: Would it make sense to enhance isSingleInputShuffleMask() like: bool isSingleInputShuffleMask…
// There are no generalized cross-lane shuffle operations available on i8		// There are no generalized cross-lane shuffle operations available on i8
// element types.		// element types.
if (is128BitLaneCrossingShuffleMask(MVT::v32i8, Mask))		if (SingleInputMask && is128BitLaneCrossingShuffleMask(MVT::v32i8, Mask))
return lowerVectorShuffleAsLanePermuteAndBlend(DL, MVT::v32i8, V1, V2,		return lowerVectorShuffleAsLanePermuteAndBlend(DL, MVT::v32i8, V1, V2, Mask,
Mask, DAG);		DAG);

		if (SingleInputAndZeroesMask) {
SDValue PSHUFBMask[32];		SDValue PSHUFBMask[32];
		abAuthorUnsubmitted Not Done Reply Inline Actions I wanted to do that at first, but you need to look at the operands, and at that point you basically duplicated computeZeroableShuffleElements. An alternative I considered was to do some kind of computeZeroableShuffleMask(Mask, V1, V2, NewMask), which returns a mask with SM_SentinelZero and SM_SentinelUndef. Then, users can check that it isSingleInputShuffleMask and do their thing. WDYT? ab: I wanted to do that at first, but you need to look at the operands, and at that point you…
		spatelUnsubmitted Not Done Reply Inline Actions That would be good - slow down the code explosion for x86 vector lowering. But I don't think we have to hold up this patch for that; a TODO comment should be ok. LGTM, but I'll let Simon give the final word because he has a much better understanding of everything going on with shuffles. spatel: That would be good - slow down the code explosion for x86 vector lowering. But I don't think we…
for (int i = 0; i < 32; ++i)		for (int i = 0; i < 32; ++i) {
		spatelUnsubmitted Not Done Reply Inline Actions Please add a comment here to describe the pshufb encoding. Something like: If the most-significant-bit of any byte of the PSHUFB mask is set, the destination byte is zeroed. I hate having to look back at the manual. :) spatel: Please add a comment here to describe the pshufb encoding. Something like: If the most…
		if (Mask[i] < 0)
		spatelUnsubmitted Not Done Reply Inline Actions Use SM_SentinelUndef? spatel: Use SM_SentinelUndef?
		RKSimonUnsubmitted Not Done Reply Inline Actions Convention for lowering is to accept any negative value as undefined - its only later on that we make use of sentinel constants. RKSimon: Convention for lowering is to accept any negative value as undefined - its only later on that…
		RKSimonUnsubmitted Not Done Reply Inline Actions Sorry I was wrong here (I missed the Zeroable case afterward) - it should read Mask[i] == SM_SentinelUndef RKSimon: Sorry I was wrong here (I missed the Zeroable case afterward) - it should read Mask[i] ==…
		PSHUFBMask[i] = DAG.getUNDEF(MVT::i8);
		else if (Zeroable[i])
		PSHUFBMask[i] = DAG.getConstant(0x80, DL, MVT::i8);
		else
PSHUFBMask[i] =		PSHUFBMask[i] =
Mask[i] < 0		DAG.getConstant(Mask[i] < 16 ? Mask[i] : Mask[i] - 16, DL, MVT::i8);
? DAG.getUNDEF(MVT::i8)		}
: DAG.getConstant(Mask[i] < 16 ? Mask[i] : Mask[i] - 16, DL,
MVT::i8);

		abAuthorUnsubmitted Not Done Reply Inline Actions Heh, fair enough, will do! ab: Heh, fair enough, will do!
return DAG.getNode(X86ISD::PSHUFB, DL, MVT::v32i8, V1,		return DAG.getNode(X86ISD::PSHUFB, DL, MVT::v32i8, V1,
DAG.getBuildVector(MVT::v32i8, DL, PSHUFBMask));		DAG.getBuildVector(MVT::v32i8, DL, PSHUFBMask));
}		}

// Try to simplify this by merging 128-bit lanes to enable a lane-based		// Try to simplify this by merging 128-bit lanes to enable a lane-based
// shuffle.		// shuffle.
if (SDValue Result = lowerVectorShuffleByMerging128BitLanes(		if (SDValue Result = lowerVectorShuffleByMerging128BitLanes(
DL, MVT::v32i8, V1, V2, Mask, Subtarget, DAG))		DL, MVT::v32i8, V1, V2, Mask, Subtarget, DAG))
▲ Show 20 Lines • Show All 18,969 Lines • Show Last 20 Lines

test/CodeGen/X86/vector-shuffle-256-v32.ll

	Show First 20 Lines • Show All 947 Lines • ▼ Show 20 Lines
	; ALL-LABEL: shuffle_v32i8_zz_01_zz_03_zz_05_zz_07_zz_09_zz_11_zz_13_zz_15_zz_17_zz_19_zz_21_zz_23_zz_25_zz_27_zz_29_zz_31:			; ALL-LABEL: shuffle_v32i8_zz_01_zz_03_zz_05_zz_07_zz_09_zz_11_zz_13_zz_15_zz_17_zz_19_zz_21_zz_23_zz_25_zz_27_zz_29_zz_31:
	; ALL: # BB#0:			; ALL: # BB#0:
	; ALL-NEXT: vandps {{.*}}(%rip), %ymm0, %ymm0			; ALL-NEXT: vandps {{.*}}(%rip), %ymm0, %ymm0
	; ALL-NEXT: retq			; ALL-NEXT: retq
	%shuffle = shufflevector <32 x i8> %a, <32 x i8> zeroinitializer, <32 x i32> <i32 32, i32 1, i32 34, i32 3, i32 36, i32 5, i32 38, i32 7, i32 40, i32 9, i32 42, i32 11, i32 44, i32 13, i32 46, i32 15, i32 48, i32 17, i32 50, i32 19, i32 52, i32 21, i32 54, i32 23, i32 56, i32 25, i32 58, i32 27, i32 60, i32 29, i32 62, i32 31>			%shuffle = shufflevector <32 x i8> %a, <32 x i8> zeroinitializer, <32 x i32> <i32 32, i32 1, i32 34, i32 3, i32 36, i32 5, i32 38, i32 7, i32 40, i32 9, i32 42, i32 11, i32 44, i32 13, i32 46, i32 15, i32 48, i32 17, i32 50, i32 19, i32 52, i32 21, i32 54, i32 23, i32 56, i32 25, i32 58, i32 27, i32 60, i32 29, i32 62, i32 31>
	ret <32 x i8> %shuffle			ret <32 x i8> %shuffle
	}			}

				define <32 x i8> @shuffle_v32i8_01_zz_02_zz_04_05_06_07_08_09_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28_29_30_31(<32 x i8> %a) {
				; AVX1-LABEL: shuffle_v32i8_01_zz_02_zz_04_05_06_07_08_09_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28_29_30_31:
				; AVX1: # BB#0:
				; AVX1-NEXT: vpshufb {{.*#+}} xmm1 = xmm0[1],zero,xmm0[2],zero,xmm0[4,5,6,7,8,9,10,11,12,13,14,15]
				; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm0
				; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: shuffle_v32i8_01_zz_02_zz_04_05_06_07_08_09_10_11_12_13_14_15_16_17_18_19_20_21_22_23_24_25_26_27_28_29_30_31:
				; AVX2: # BB#0:
				; AVX2-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[1],zero,ymm0[2],zero,ymm0[4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
				; AVX2-NEXT: retq
				%shuffle = shufflevector <32 x i8> %a, <32 x i8> zeroinitializer, <32 x i32> <i32 1, i32 32, i32 2, i32 32, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
				ret <32 x i8> %shuffle
				}

	define <32 x i8> @shuffle_v32i8_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32(<32 x i8> %a, <32 x i8> %b) {			define <32 x i8> @shuffle_v32i8_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32(<32 x i8> %a, <32 x i8> %b) {
	; AVX1-LABEL: shuffle_v32i8_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32:			; AVX1-LABEL: shuffle_v32i8_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32_00_32:
	; AVX1: # BB#0:			; AVX1: # BB#0:
	; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]			; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
	; AVX1-NEXT: vpshuflw {{.*#+}} xmm0 = xmm0[0,0,0,0,4,5,6,7]			; AVX1-NEXT: vpshuflw {{.*#+}} xmm0 = xmm0[0,0,0,0,4,5,6,7]
	; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]			; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,0,1,1]
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	▲ Show 20 Lines • Show All 1,186 Lines • Show Last 20 Lines