This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Shuffle blends with zero
ClosedPublic

Authored by RKSimon on Oct 25 2015, 3:06 PM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
delena
andreadb

Commits

rGca56a72af997: [X86][SSE] Shuffle blends with zero
rL251659: [X86][SSE] Shuffle blends with zero

Summary

This patch generalizes the zeroing of vector elements with the BLEND instructions. Currently a zero vector will only blend if the shuffled elements are correctly inline, this patch recognises when a vector input is zero (or zeroable) and modifies a local copy of the shuffle mask to support a blend. As a zeroable vector input may not be all zeroes, the zeroable vector is regenerated if necessary.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 38360.Oct 25 2015, 3:06 PM

RKSimon retitled this revision from to [X86][SSE] Shuffle blends with zero.

RKSimon updated this object.

RKSimon added reviewers: spatel, andreadb, delena, qcolombet.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: llvm-commits.

In more general case it will work if one of V1 or V2 is vector of constants with '0' in the right place. When you calculate computeZeroableShuffleElements() you check this option.

if (Zeroable[i]) {

You just should know what input to choose. You don't need to rebuild V1 or V2. 
And you **always **can define mask for the "zeroable" element, no fallthru in this case.

If computeZeroableShuffleElements() was returning not only mask, but also input number (V1 or V2) per zeroable element, you'd just use this information.

Both isBuildVectorAllZeros and computeZeroableShuffleElements() treats undef lanes as zeroable - so we have a problem when the shuffle mask wants an actual zero input but the lane that we'd need to blend from is actually UNDEF:

shufflevector <4 x float> %v, <4 x float><float 0.000000e+00, float undef, float undef, float undef>, <4 x i32> <i32 0, i32 4, i32 2, i32 4>

which to use BLENDPS we'd need to convert to:

shufflevector <4 x float> %v, <4 x float> <float undef, float 0.000000e+00, float undef, float 0.000000e+00>, <4 x i32> <i32 0, i32 5, i32 2, i32 7>

But its easier if we just set the whole input vector as zero (since we know its zeroable anyhow).

I'll add an extra test for this example.

Now it could be that we have cases where we have a BUILD_VECTOR input with zero/nonzero constants that could be matched up (possibly by creating a new BUILD_VECTOR with reordered constants suitable for blending) but it'll be a much more involved change and I haven't seen any real world code that would benefit from this yet, so I just focussed on the zeroing which I do have examples of.

Hi Simon,

Your code is fully correct. I just think that you miss some opportunities.

I'll take your example and change one element:

shufflevector <4 x float> %v, <4 x float><float 0.000000e+00, float undef, float undef, float 0.2>, <4 x i32> <i32 0, i32 4, i32 2, i32 7>

it is equal to the blend:

shufflevector <4 x float> %v, <4 x float><float 0.000000e+00, float 0.0, float undef, float 0.2>, <4 x i32> <i32 0, i32 5, i32 2, i32 7>

Elena

I've been investigating general build vector support and the problem I'm finding is the BUILD_VECTOR node is usually lowered before the shuffle we're trying to match - preventing easy matching (the constant data has often disappeared inside a LOAD, INSERT_VECTOR_ELT or something else). It should be possible to create a helper function to do this (possibly just expanding getShuffleScalarElt) but its quite beyond the scope of what I had in mind for this patch.

RKSimon updated this revision to Diff 38739.Oct 29 2015, 7:58 AM

RKSimon edited edge metadata.

I did not mean to complicate solution. Please commit, I'll try to check with debugger what happens with const vector.

This revision is now accepted and ready to land.Oct 29 2015, 9:09 AM

Closed by commit rL251659: [X86][SSE] Shuffle blends with zero (authored by RKSimon). · Explain WhyOct 29 2015, 3:13 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

68 lines

test/

CodeGen/

X86/

avx-vperm2x128.ll

25 lines

vector-shuffle-128-v4.ll

37 lines

Diff 38769

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,853 Lines • ▼ Show 20 Lines	static SDValue lowerVectorShuffleAsBitBlend(SDLoc DL, MVT VT, SDValue V1,
return DAG.getNode(ISD::OR, DL, VT, V1, V2);		return DAG.getNode(ISD::OR, DL, VT, V1, V2);
}		}

/// \brief Try to emit a blend instruction for a shuffle.		/// \brief Try to emit a blend instruction for a shuffle.
///		///
/// This doesn't do any checks for the availability of instructions for blending		/// This doesn't do any checks for the availability of instructions for blending
/// these values. It relies on the availability of the X86ISD::BLENDI pattern to		/// these values. It relies on the availability of the X86ISD::BLENDI pattern to
/// be matched in the backend with the type given. What it does check for is		/// be matched in the backend with the type given. What it does check for is
/// that the shuffle mask is in fact a blend.		/// that the shuffle mask is a blend, or convertible into a blend with zero.
static SDValue lowerVectorShuffleAsBlend(SDLoc DL, MVT VT, SDValue V1,		static SDValue lowerVectorShuffleAsBlend(SDLoc DL, MVT VT, SDValue V1,
SDValue V2, ArrayRef<int> Mask,		SDValue V2, ArrayRef<int> Original,
const X86Subtarget *Subtarget,		const X86Subtarget *Subtarget,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
		bool V1IsZero = ISD::isBuildVectorAllZeros(V1.getNode());
		bool V2IsZero = ISD::isBuildVectorAllZeros(V2.getNode());
		SmallVector<int, 8> Mask(Original.begin(), Original.end());
		SmallBitVector Zeroable = computeZeroableShuffleElements(Mask, V1, V2);
		bool ForceV1Zero = false, ForceV2Zero = false;

		// Attempt to generate the binary blend mask. If an input is zero then
		// we can use any lane.
		// TODO: generalize the zero matching to any scalar like isShuffleEquivalent.
unsigned BlendMask = 0;		unsigned BlendMask = 0;
for (int i = 0, Size = Mask.size(); i < Size; ++i) {		for (int i = 0, Size = Mask.size(); i < Size; ++i) {
if (Mask[i] >= Size) {		int M = Mask[i];
if (Mask[i] != i + Size)		if (M < 0)
return SDValue(); // Shuffled V2 input!		continue;
		if (M == i)
		continue;
		if (M == i + Size) {
		BlendMask \|= 1u << i;
		continue;
		}
		if (Zeroable[i]) {
		if (V1IsZero) {
		ForceV1Zero = true;
		Mask[i] = i;
		continue;
		}
		if (V2IsZero) {
		ForceV2Zero = true;
BlendMask \|= 1u << i;		BlendMask \|= 1u << i;
		Mask[i] = i + Size;
continue;		continue;
}		}
if (Mask[i] >= 0 && Mask[i] != i)
return SDValue(); // Shuffled V1 input!
}		}
		return SDValue(); // Shuffled input!
		}

		// Create a REAL zero vector - ISD::isBuildVectorAllZeros allows UNDEFs.
		if (ForceV1Zero)
		V1 = getZeroVector(VT, Subtarget, DAG, DL);
		if (ForceV2Zero)
		V2 = getZeroVector(VT, Subtarget, DAG, DL);

		auto ScaleBlendMask = [](unsigned BlendMask, int Size, int Scale) {
		unsigned ScaledMask = 0;
		for (int i = 0; i != Size; ++i)
		if (BlendMask & (1u << i))
		for (int j = 0; j != Scale; ++j)
		ScaledMask \|= 1u << (i * Scale + j);
		return ScaledMask;
		};

switch (VT.SimpleTy) {		switch (VT.SimpleTy) {
case MVT::v2f64:		case MVT::v2f64:
case MVT::v4f32:		case MVT::v4f32:
case MVT::v4f64:		case MVT::v4f64:
case MVT::v8f32:		case MVT::v8f32:
return DAG.getNode(X86ISD::BLENDI, DL, VT, V1, V2,		return DAG.getNode(X86ISD::BLENDI, DL, VT, V1, V2,
DAG.getConstant(BlendMask, DL, MVT::i8));		DAG.getConstant(BlendMask, DL, MVT::i8));

case MVT::v4i64:		case MVT::v4i64:
case MVT::v8i32:		case MVT::v8i32:
assert(Subtarget->hasAVX2() && "256-bit integer blends require AVX2!");		assert(Subtarget->hasAVX2() && "256-bit integer blends require AVX2!");
// FALLTHROUGH		// FALLTHROUGH
case MVT::v2i64:		case MVT::v2i64:
case MVT::v4i32:		case MVT::v4i32:
// If we have AVX2 it is faster to use VPBLENDD when the shuffle fits into		// If we have AVX2 it is faster to use VPBLENDD when the shuffle fits into
// that instruction.		// that instruction.
if (Subtarget->hasAVX2()) {		if (Subtarget->hasAVX2()) {
// Scale the blend by the number of 32-bit dwords per element.		// Scale the blend by the number of 32-bit dwords per element.
int Scale = VT.getScalarSizeInBits() / 32;		int Scale = VT.getScalarSizeInBits() / 32;
BlendMask = 0;		BlendMask = ScaleBlendMask(BlendMask, Mask.size(), Scale);
for (int i = 0, Size = Mask.size(); i < Size; ++i)
if (Mask[i] >= Size)
for (int j = 0; j < Scale; ++j)
BlendMask \|= 1u << (i * Scale + j);

MVT BlendVT = VT.getSizeInBits() > 128 ? MVT::v8i32 : MVT::v4i32;		MVT BlendVT = VT.getSizeInBits() > 128 ? MVT::v8i32 : MVT::v4i32;
V1 = DAG.getBitcast(BlendVT, V1);		V1 = DAG.getBitcast(BlendVT, V1);
V2 = DAG.getBitcast(BlendVT, V2);		V2 = DAG.getBitcast(BlendVT, V2);
return DAG.getBitcast(		return DAG.getBitcast(
VT, DAG.getNode(X86ISD::BLENDI, DL, BlendVT, V1, V2,		VT, DAG.getNode(X86ISD::BLENDI, DL, BlendVT, V1, V2,
DAG.getConstant(BlendMask, DL, MVT::i8)));		DAG.getConstant(BlendMask, DL, MVT::i8)));
}		}
// FALLTHROUGH		// FALLTHROUGH
case MVT::v8i16: {		case MVT::v8i16: {
// For integer shuffles we need to expand the mask and cast the inputs to		// For integer shuffles we need to expand the mask and cast the inputs to
// v8i16s prior to blending.		// v8i16s prior to blending.
int Scale = 8 / VT.getVectorNumElements();		int Scale = 8 / VT.getVectorNumElements();
BlendMask = 0;		BlendMask = ScaleBlendMask(BlendMask, Mask.size(), Scale);
for (int i = 0, Size = Mask.size(); i < Size; ++i)
if (Mask[i] >= Size)
for (int j = 0; j < Scale; ++j)
BlendMask \|= 1u << (i * Scale + j);

V1 = DAG.getBitcast(MVT::v8i16, V1);		V1 = DAG.getBitcast(MVT::v8i16, V1);
V2 = DAG.getBitcast(MVT::v8i16, V2);		V2 = DAG.getBitcast(MVT::v8i16, V2);
return DAG.getBitcast(VT,		return DAG.getBitcast(VT,
DAG.getNode(X86ISD::BLENDI, DL, MVT::v8i16, V1, V2,		DAG.getNode(X86ISD::BLENDI, DL, MVT::v8i16, V1, V2,
DAG.getConstant(BlendMask, DL, MVT::i8)));		DAG.getConstant(BlendMask, DL, MVT::i8)));
}		}

case MVT::v16i16: {		case MVT::v16i16: {
▲ Show 20 Lines • Show All 20,580 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/avx-vperm2x128.ll

	Show First 20 Lines • Show All 301 Lines • ▼ Show 20 Lines
	; ALL-NEXT: retq			; ALL-NEXT: retq
	%s = shufflevector <4 x double> <double 0.0, double 0.0, double undef, double undef>, <4 x double> %a, <4 x i32> <i32 0, i32 1, i32 6, i32 7>			%s = shufflevector <4 x double> <double 0.0, double 0.0, double undef, double undef>, <4 x double> %a, <4 x i32> <i32 0, i32 1, i32 6, i32 7>
	ret <4 x double> %s			ret <4 x double> %s
	}			}

	define <4 x double> @vperm2z_0x80(<4 x double> %a) {			define <4 x double> @vperm2z_0x80(<4 x double> %a) {
	; ALL-LABEL: vperm2z_0x80:			; ALL-LABEL: vperm2z_0x80:
	; ALL: ## BB#0:			; ALL: ## BB#0:
	; ALL-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[0,1],zero,zero			; ALL-NEXT: vxorpd %ymm1, %ymm1, %ymm1
				; ALL-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0,1],ymm1[2,3]
	; ALL-NEXT: retq			; ALL-NEXT: retq
	%s = shufflevector <4 x double> %a, <4 x double> <double 0.0, double 0.0, double undef, double undef>, <4 x i32> <i32 0, i32 1, i32 4, i32 5>			%s = shufflevector <4 x double> %a, <4 x double> <double 0.0, double 0.0, double undef, double undef>, <4 x i32> <i32 0, i32 1, i32 4, i32 5>
	ret <4 x double> %s			ret <4 x double> %s
	}			}

	define <4 x double> @vperm2z_0x81(<4 x double> %a) {			define <4 x double> @vperm2z_0x81(<4 x double> %a) {
	; ALL-LABEL: vperm2z_0x81:			; ALL-LABEL: vperm2z_0x81:
	; ALL: ## BB#0:			; ALL: ## BB#0:
	; ALL-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[2,3],zero,zero			; ALL-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[2,3],zero,zero
	; ALL-NEXT: retq			; ALL-NEXT: retq
	%s = shufflevector <4 x double> %a, <4 x double> <double 0.0, double 0.0, double undef, double undef>, <4 x i32> <i32 2, i32 3, i32 4, i32 5>			%s = shufflevector <4 x double> %a, <4 x double> <double 0.0, double 0.0, double undef, double undef>, <4 x i32> <i32 2, i32 3, i32 4, i32 5>
	ret <4 x double> %s			ret <4 x double> %s
	}			}

	define <4 x double> @vperm2z_0x82(<4 x double> %a) {			define <4 x double> @vperm2z_0x82(<4 x double> %a) {
	; ALL-LABEL: vperm2z_0x82:			; ALL-LABEL: vperm2z_0x82:
	; ALL: ## BB#0:			; ALL: ## BB#0:
	; ALL-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[0,1],zero,zero			; ALL-NEXT: vxorpd %ymm1, %ymm1, %ymm1
				; ALL-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0,1],ymm1[2,3]
	; ALL-NEXT: retq			; ALL-NEXT: retq
	%s = shufflevector <4 x double> <double 0.0, double 0.0, double undef, double undef>, <4 x double> %a, <4 x i32> <i32 4, i32 5, i32 0, i32 1>			%s = shufflevector <4 x double> <double 0.0, double 0.0, double undef, double undef>, <4 x double> %a, <4 x i32> <i32 4, i32 5, i32 0, i32 1>
	ret <4 x double> %s			ret <4 x double> %s
	}			}

	define <4 x double> @vperm2z_0x83(<4 x double> %a) {			define <4 x double> @vperm2z_0x83(<4 x double> %a) {
	; ALL-LABEL: vperm2z_0x83:			; ALL-LABEL: vperm2z_0x83:
	; ALL: ## BB#0:			; ALL: ## BB#0:
	; ALL-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[2,3],zero,zero			; ALL-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[2,3],zero,zero
	; ALL-NEXT: retq			; ALL-NEXT: retq
	%s = shufflevector <4 x double> <double 0.0, double 0.0, double undef, double undef>, <4 x double> %a, <4 x i32> <i32 6, i32 7, i32 0, i32 1>			%s = shufflevector <4 x double> <double 0.0, double 0.0, double undef, double undef>, <4 x double> %a, <4 x i32> <i32 6, i32 7, i32 0, i32 1>
	ret <4 x double> %s			ret <4 x double> %s
	}			}

	;; With AVX2 select the integer version of the instruction. Use an add to force the domain selection.			;; With AVX2 select the integer version of the instruction. Use an add to force the domain selection.

	define <4 x i64> @vperm2z_int_0x83(<4 x i64> %a, <4 x i64> %b) {			define <4 x i64> @vperm2z_int_0x83(<4 x i64> %a, <4 x i64> %b) {
	; ALL-LABEL: vperm2z_int_0x83:			; AVX1-LABEL: vperm2z_int_0x83:
	; ALL: ## BB#0:			; AVX1: ## BB#0:
	; AVX1: vperm2f128 {{.*#+}} ymm0 = ymm0[2,3],zero,zero			; AVX1-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[2,3],zero,zero
	; AVX2: vperm2i128 {{.*#+}} ymm0 = ymm0[2,3],zero,zero			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm2
				; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm3
				; AVX1-NEXT: vpaddq %xmm2, %xmm3, %xmm2
				; AVX1-NEXT: vpaddq %xmm0, %xmm1, %xmm0
				; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm0
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: vperm2z_int_0x83:
				; AVX2: ## BB#0:
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm0 = ymm0[2,3],zero,zero
				; AVX2-NEXT: vpaddq %ymm0, %ymm1, %ymm0
				; AVX2-NEXT: retq
	%s = shufflevector <4 x i64> <i64 0, i64 0, i64 undef, i64 undef>, <4 x i64> %a, <4 x i32> <i32 6, i32 7, i32 0, i32 1>			%s = shufflevector <4 x i64> <i64 0, i64 0, i64 undef, i64 undef>, <4 x i64> %a, <4 x i32> <i32 6, i32 7, i32 0, i32 1>
	%c = add <4 x i64> %b, %s			%c = add <4 x i64> %b, %s
	ret <4 x i64> %c			ret <4 x i64> %c
	}			}

llvm/trunk/test/CodeGen/X86/vector-shuffle-128-v4.ll

	Show First 20 Lines • Show All 946 Lines • ▼ Show 20 Lines
	; AVX: # BB#0:			; AVX: # BB#0:
	; AVX-NEXT: vxorps %xmm1, %xmm1, %xmm1			; AVX-NEXT: vxorps %xmm1, %xmm1, %xmm1
	; AVX-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0],xmm1[1,2],xmm0[3]			; AVX-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0],xmm1[1,2],xmm0[3]
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%shuffle = shufflevector <4 x float> %a, <4 x float> zeroinitializer, <4 x i32> <i32 0, i32 4, i32 4, i32 3>			%shuffle = shufflevector <4 x float> %a, <4 x float> zeroinitializer, <4 x i32> <i32 0, i32 4, i32 4, i32 3>
	ret <4 x float> %shuffle			ret <4 x float> %shuffle
	}			}

				define <4 x float> @shuffle_v4f32_0z2z(<4 x float> %v) {
				; SSE2-LABEL: shuffle_v4f32_0z2z:
				; SSE2: # BB#0:
				; SSE2-NEXT: xorps %xmm1, %xmm1
				; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,0]
				; SSE2-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2,1,3]
				; SSE2-NEXT: retq
				;
				; SSE3-LABEL: shuffle_v4f32_0z2z:
				; SSE3: # BB#0:
				; SSE3-NEXT: xorps %xmm1, %xmm1
				; SSE3-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,0]
				; SSE3-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2,1,3]
				; SSE3-NEXT: retq
				;
				; SSSE3-LABEL: shuffle_v4f32_0z2z:
				; SSSE3: # BB#0:
				; SSSE3-NEXT: xorps %xmm1, %xmm1
				; SSSE3-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,0]
				; SSSE3-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2,1,3]
				; SSSE3-NEXT: retq
				;
				; SSE41-LABEL: shuffle_v4f32_0z2z:
				; SSE41: # BB#0:
				; SSE41-NEXT: xorps %xmm1, %xmm1
				; SSE41-NEXT: blendps {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
				; SSE41-NEXT: retq
				;
				; AVX-LABEL: shuffle_v4f32_0z2z:
				; AVX: # BB#0:
				; AVX-NEXT: vxorps %xmm1, %xmm1, %xmm1
				; AVX-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0],xmm1[1],xmm0[2],xmm1[3]
				; AVX-NEXT: retq
				%shuffle = shufflevector <4 x float> %v, <4 x float> <float 0.000000e+00, float undef, float undef, float undef>, <4 x i32> <i32 0, i32 4, i32 2, i32 4>
				ret <4 x float> %shuffle
				}

	define <4 x float> @shuffle_v4f32_u051(<4 x float> %a, <4 x float> %b) {			define <4 x float> @shuffle_v4f32_u051(<4 x float> %a, <4 x float> %b) {
	; SSE-LABEL: shuffle_v4f32_u051:			; SSE-LABEL: shuffle_v4f32_u051:
	; SSE: # BB#0:			; SSE: # BB#0:
	; SSE-NEXT: unpcklps {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]			; SSE-NEXT: unpcklps {{.*#+}} xmm1 = xmm1[0],xmm0[0],xmm1[1],xmm0[1]
	; SSE-NEXT: movaps %xmm1, %xmm0			; SSE-NEXT: movaps %xmm1, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: shuffle_v4f32_u051:			; AVX-LABEL: shuffle_v4f32_u051:
	▲ Show 20 Lines • Show All 1,000 Lines • Show Last 20 Lines