This is an archive of the discontinued LLVM Phabricator instance.

[X86] Fix bad treatment of multi-lane blends in BUILD_VECTORtoBlendMask()
ClosedPublic

Authored by mkuper on Oct 7 2015, 2:47 AM.

Download Raw Diff

Details

Reviewers

chandlerc
filcab
andreadb
aivchenk

Commits

rG04e79329d090: [X86] Fix wrong treatment of multi-lane blends in BUILD_VECTORtoBlendMask()
rL249669: [X86] Fix wrong treatment of multi-lane blends in BUILD_VECTORtoBlendMask()

Summary

This is a replacement for the patch in D13207.
It turns out that there are actually separate two bugs. One is the bug that D13207 originally handled.
The other is that the number of lanes it computes may not actually be 1 or 2. See the AVX case of the constant_pblendvb_avx2 test for an example. I'm not sure how exactly it should behave, so I've decided to pessimize it back to safety.

Diff Detail

Event Timeline

mkuper updated this revision to Diff 36721.Oct 7 2015, 2:47 AM

mkuper retitled this revision from to [X86] Fix bad treatment of multi-lane blends in BUILD_VECTORtoBlendMask().

mkuper updated this object.

mkuper added reviewers: filcab, chandlerc, aivchenk.

mkuper mentioned this in D13207: Fix wrong mask generated for _mm_blendv_epi8 (BUG 24532).

mkuper added subscribers: qcolombet, llvm-commits.

andreadb added a subscriber: andreadb.Oct 7 2015, 3:33 AM

Hi Michael

The patch looks good to me.

Unrelated to your patch (but still important in my opinion).

Currently, function BUILD_VECTORtoBlendMask is only called by function 'transformVSELECTtoBlendVECTOR_SHUFFLE'. The goal of that function is (if possible) to convert a VSELECT into a target independent vector shuffle node.

Function transformVSELECTtoBlendVECTOR_SHUFFLE performs the folling steps:

At first it checks if the VSELECT condition is constant (i.e. a build_vector of ConstantSDNode (or Undef));
It then delegates to function 'BUILD_VECTORtoBlendMask' the computation of a bitmask (an unsigned value) which represents some sort of 'compressed' vector shuffle mask;
The bitmask is eventually decoded into a mask which is suitable for a target independent vector shuffle node.

The more I look at that code, the more I think that point 2. could have been entirely avoided.
What I am trying to say (correct me if I am wrong) is that we probably don't need to go through a two-step process where we firstly convert the select mask to a bitmask, and then we convert the bitmask to a vector shuffle mask...

Presumably this design made sense in the past (more than one year ago) since we didn't have the new shuffle lowering logic in place. Nowadays we should probably check if it would make more sense to just remove function BUILD_VECTORtoBlendMask and simplify the logic in transformVSELECTtoBlendVECTOR_SHUFFLE. This is just an idea and - as I said - unrelated to your patch.. I hope it makes sense.

-Andrea

This revision is now accepted and ready to land.Oct 7 2015, 6:33 AM

Thanks, Andrea.

What you wrote makes sense to me, but I'm not familiar enough with this code (or its history) to have a strong opinion either way.
Filipe and Chandler probably understand the bigger picture better.

I have a question:
This looks like it's to generate BLENDI nodes, in order to match the Intel instructions for blends with immediates...
But there are no byte-shuffles with immediates, AFAICT.

What did I miss?

P.S: If I change that function to bail out on ElemType == MVT::i8, I see the masks reversed (but both lanes are equal)...

test/CodeGen/X86/vector-blend.ll
664	Why are the "first lanes" changing here? Either it was completely wrong, and you're fixing it, or the replacement is wrong (the "second lane" was always wrong).

In D13505#261765, @filcab wrote:

I have a question:
This looks like it's to generate BLENDI nodes, in order to match the Intel instructions for blends with immediates...
But there are no byte-shuffles with immediates, AFAICT.

What did I miss?

The user of that function delegates the selection of shuffle blend nodes to the shuffle lowering logic.
That said, you are right and there are no byte shuffles with immediate mask. However, I don't think it is in general a bad idea to canonicalize byte shuffles to shuffle vector node since it may potentially enable more combining of shuffles.

test/CodeGen/X86/vector-blend.ll
664	The operands to the vpblendvb have been commuted. So the mask has been changed accordingly. I don't think this is wrong.

Closed by commit rL249669: [X86] Fix wrong treatment of multi-lane blends in BUILD_VECTORtoBlendMask() (authored by mkuper). · Explain WhyOct 8 2015, 1:15 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

X86/

	X86ISelLowering.cpp
	X86ISelLowering.cpp (revision 249379)

14 lines

test/

CodeGen/

X86/

	vector-blend.ll
	vector-blend.ll (revision 249379)

72 lines

Diff 36721

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 11,079 Lines • ▼ Show 20 Lines

	// This function assumes its argument is a BUILD_VECTOR of constants or			// This function assumes its argument is a BUILD_VECTOR of constants or
	// undef SDNodes. i.e: ISD::isBuildVectorOfConstantSDNodes(BuildVector) is			// undef SDNodes. i.e: ISD::isBuildVectorOfConstantSDNodes(BuildVector) is
	// true.			// true.
	static bool BUILD_VECTORtoBlendMask(BuildVectorSDNode *BuildVector,			static bool BUILD_VECTORtoBlendMask(BuildVectorSDNode *BuildVector,
	unsigned &MaskValue) {			unsigned &MaskValue) {
	MaskValue = 0;			MaskValue = 0;
	unsigned NumElems = BuildVector->getNumOperands();			unsigned NumElems = BuildVector->getNumOperands();

	// There are 2 lanes if (NumElems > 8), and 1 lane otherwise.			// There are 2 lanes if (NumElems > 8), and 1 lane otherwise.
				// We don't handle the >2 lanes case right now.
	unsigned NumLanes = (NumElems - 1) / 8 + 1;			unsigned NumLanes = (NumElems - 1) / 8 + 1;
				if (NumLanes > 2)
				return false;

	unsigned NumElemsInLane = NumElems / NumLanes;			unsigned NumElemsInLane = NumElems / NumLanes;

	// Blend for v16i16 should be symmetric for the both lanes.			// Blend for v16i16 should be symmetric for the both lanes.
	for (unsigned i = 0; i < NumElemsInLane; ++i) {			for (unsigned i = 0; i < NumElemsInLane; ++i) {
	SDValue EltCond = BuildVector->getOperand(i);			SDValue EltCond = BuildVector->getOperand(i);
	SDValue SndLaneEltCond =			SDValue SndLaneEltCond =
	(NumLanes == 2) ? BuildVector->getOperand(i + NumElemsInLane) : EltCond;			(NumLanes == 2) ? BuildVector->getOperand(i + NumElemsInLane) : EltCond;

	int Lane1Cond = -1, Lane2Cond = -1;			int Lane1Cond = -1, Lane2Cond = -1;
	if (isa<ConstantSDNode>(EltCond))			if (isa<ConstantSDNode>(EltCond))
	Lane1Cond = !isZero(EltCond);			Lane1Cond = !isZero(EltCond);
	if (isa<ConstantSDNode>(SndLaneEltCond))			if (isa<ConstantSDNode>(SndLaneEltCond))
	Lane2Cond = !isZero(SndLaneEltCond);			Lane2Cond = !isZero(SndLaneEltCond);

				unsigned LaneMask = 0;
	if (Lane1Cond == Lane2Cond \|\| Lane2Cond < 0)			if (Lane1Cond == Lane2Cond \|\| Lane2Cond < 0)
	// Lane1Cond != 0, means we want the first argument.			// Lane1Cond != 0, means we want the first argument.
	// Lane1Cond == 0, means we want the second argument.			// Lane1Cond == 0, means we want the second argument.
	// The encoding of this argument is 0 for the first argument, 1			// The encoding of this argument is 0 for the first argument, 1
	// for the second. Therefore, invert the condition.			// for the second. Therefore, invert the condition.
	MaskValue \|= !Lane1Cond << i;			LaneMask = !Lane1Cond << i;
	else if (Lane1Cond < 0)			else if (Lane1Cond < 0)
	MaskValue \|= !Lane2Cond << i;			LaneMask = !Lane2Cond << i;
	else			else
	return false;			return false;

				MaskValue \|= LaneMask;
				if (NumLanes == 2)
				MaskValue \|= LaneMask << NumElemsInLane;
	}			}
	return true;			return true;
	}			}

	/// \brief Try to lower a VSELECT instruction to a vector shuffle.			/// \brief Try to lower a VSELECT instruction to a vector shuffle.
	static SDValue lowerVSELECTtoVectorShuffle(SDValue Op,			static SDValue lowerVSELECTtoVectorShuffle(SDValue Op,
	const X86Subtarget *Subtarget,			const X86Subtarget *Subtarget,
	SelectionDAG &DAG) {			SelectionDAG &DAG) {
	▲ Show 20 Lines • Show All 16,055 Lines • Show Last 20 Lines

test/CodeGen/X86/vector-blend.ll

	Show First 20 Lines • Show All 249 Lines • ▼ Show 20 Lines
	entry:			entry:
	%vsel = select <8 x i1> <i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false>, <8 x i16> %v1, <8 x i16> %v2			%vsel = select <8 x i1> <i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false>, <8 x i16> %v1, <8 x i16> %v2
	ret <8 x i16> %vsel			ret <8 x i16> %vsel
	}			}

	define <16 x i8> @vsel_i8(<16 x i8> %v1, <16 x i8> %v2) {			define <16 x i8> @vsel_i8(<16 x i8> %v1, <16 x i8> %v2) {
	; SSE2-LABEL: vsel_i8:			; SSE2-LABEL: vsel_i8:
	; SSE2: # BB#0: # %entry			; SSE2: # BB#0: # %entry
	; SSE2-NEXT: movaps {{.*#+}} xmm2 = [255,0,0,0,255,0,0,0,255,255,255,255,255,255,255,255]			; SSE2-NEXT: movaps {{.*#+}} xmm2 = [0,255,255,255,0,255,255,255,0,255,255,255,0,255,255,255]
	; SSE2-NEXT: andps %xmm2, %xmm0			; SSE2-NEXT: andps %xmm2, %xmm1
	; SSE2-NEXT: andnps %xmm1, %xmm2			; SSE2-NEXT: andnps %xmm0, %xmm2
	; SSE2-NEXT: orps %xmm2, %xmm0			; SSE2-NEXT: orps %xmm1, %xmm2
				; SSE2-NEXT: movaps %xmm2, %xmm0
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; SSSE3-LABEL: vsel_i8:			; SSSE3-LABEL: vsel_i8:
	; SSSE3: # BB#0: # %entry			; SSSE3: # BB#0: # %entry
	; SSSE3-NEXT: pshufb {{.*#+}} xmm1 = zero,xmm1[1,2,3],zero,xmm1[5,6,7],zero,zero,zero,zero,zero,zero,zero,zero			; SSSE3-NEXT: pshufb {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[8],zero,zero,zero,xmm0[12],zero,zero,zero
	; SSSE3-NEXT: pshufb {{.*#+}} xmm0 = xmm0[0],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[8,9,10,11,12,13,14,15]			; SSSE3-NEXT: pshufb {{.*#+}} xmm1 = zero,xmm1[1,2,3],zero,xmm1[5,6,7],zero,xmm1[9,10,11],zero,xmm1[13,14,15]
	; SSSE3-NEXT: por %xmm1, %xmm0			; SSSE3-NEXT: por %xmm1, %xmm0
	; SSSE3-NEXT: retq			; SSSE3-NEXT: retq
	;			;
	; SSE41-LABEL: vsel_i8:			; SSE41-LABEL: vsel_i8:
	; SSE41: # BB#0: # %entry			; SSE41: # BB#0: # %entry
	; SSE41-NEXT: movdqa %xmm0, %xmm2			; SSE41-NEXT: movdqa %xmm0, %xmm2
	; SSE41-NEXT: movaps {{.*#+}} xmm0 = [255,0,0,0,255,0,0,0,255,255,255,255,255,255,255,255]			; SSE41-NEXT: movaps {{.*#+}} xmm0 = [0,255,255,255,0,255,255,255,0,255,255,255,0,255,255,255]
	; SSE41-NEXT: pblendvb %xmm2, %xmm1			; SSE41-NEXT: pblendvb %xmm1, %xmm2
	; SSE41-NEXT: movdqa %xmm1, %xmm0			; SSE41-NEXT: movdqa %xmm2, %xmm0
	; SSE41-NEXT: retq			; SSE41-NEXT: retq
	;			;
	; AVX-LABEL: vsel_i8:			; AVX-LABEL: vsel_i8:
	; AVX: # BB#0: # %entry			; AVX: # BB#0: # %entry
	; AVX-NEXT: vmovdqa {{.*#+}} xmm2 = [255,0,0,0,255,0,0,0,255,255,255,255,255,255,255,255]			; AVX-NEXT: vmovdqa {{.*#+}} xmm2 = [0,255,255,255,0,255,255,255,0,255,255,255,0,255,255,255]
	; AVX-NEXT: vpblendvb %xmm2, %xmm0, %xmm1, %xmm0			; AVX-NEXT: vpblendvb %xmm2, %xmm1, %xmm0, %xmm0
	; AVX-NEXT: retq			; AVX-NEXT: retq
	entry:			entry:
	%vsel = select <16 x i1> <i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false>, <16 x i8> %v1, <16 x i8> %v2			%vsel = select <16 x i1> <i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false>, <16 x i8> %v1, <16 x i8> %v2
	ret <16 x i8> %vsel			ret <16 x i8> %vsel
	}			}


	; AVX256 tests:			; AVX256 tests:
	▲ Show 20 Lines • Show All 327 Lines • ▼ Show 20 Lines
	entry:			entry:
	%select = select <8 x i1> <i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true>, <8 x float> %xyzw, <8 x float> %abcd			%select = select <8 x i1> <i1 false, i1 false, i1 false, i1 true, i1 false, i1 false, i1 false, i1 true>, <8 x float> %xyzw, <8 x float> %abcd
	ret <8 x float> %select			ret <8 x float> %select
	}			}

	define <32 x i8> @constant_pblendvb_avx2(<32 x i8> %xyzw, <32 x i8> %abcd) {			define <32 x i8> @constant_pblendvb_avx2(<32 x i8> %xyzw, <32 x i8> %abcd) {
	; SSE2-LABEL: constant_pblendvb_avx2:			; SSE2-LABEL: constant_pblendvb_avx2:
	; SSE2: # BB#0: # %entry			; SSE2: # BB#0: # %entry
	; SSE2-NEXT: movaps {{.*#+}} xmm4 = [0,0,255,0,255,255,255,0,255,255,255,255,255,255,255,255]			; SSE2-NEXT: movaps {{.*#+}} xmm4 = [255,255,0,255,0,0,0,255,255,255,0,255,0,0,0,255]
	; SSE2-NEXT: movaps %xmm4, %xmm5			; SSE2-NEXT: movaps %xmm4, %xmm5
	; SSE2-NEXT: andnps %xmm2, %xmm5			; SSE2-NEXT: andnps %xmm0, %xmm5
	; SSE2-NEXT: andps %xmm4, %xmm0			; SSE2-NEXT: andps %xmm4, %xmm2
	; SSE2-NEXT: orps %xmm5, %xmm0			; SSE2-NEXT: orps %xmm2, %xmm5
	; SSE2-NEXT: andps %xmm4, %xmm1			; SSE2-NEXT: andps %xmm4, %xmm3
	; SSE2-NEXT: andnps %xmm3, %xmm4			; SSE2-NEXT: andnps %xmm1, %xmm4
	; SSE2-NEXT: orps %xmm4, %xmm1			; SSE2-NEXT: orps %xmm3, %xmm4
				; SSE2-NEXT: movaps %xmm5, %xmm0
				; SSE2-NEXT: movaps %xmm4, %xmm1
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; SSSE3-LABEL: constant_pblendvb_avx2:			; SSSE3-LABEL: constant_pblendvb_avx2:
	; SSSE3: # BB#0: # %entry			; SSSE3: # BB#0: # %entry
	; SSSE3-NEXT: movdqa {{.*#+}} xmm4 = [0,1,128,3,128,128,128,7,128,128,128,128,128,128,128,128]			; SSSE3-NEXT: movdqa {{.*#+}} xmm4 = [128,128,2,128,4,5,6,128,128,128,10,128,12,13,14,128]
	; SSSE3-NEXT: pshufb %xmm4, %xmm2			; SSSE3-NEXT: pshufb %xmm4, %xmm0
	; SSSE3-NEXT: movdqa {{.*#+}} xmm5 = [128,128,2,128,4,5,6,128,8,9,10,11,12,13,14,15]			; SSSE3-NEXT: movdqa {{.*#+}} xmm5 = [0,1,128,3,128,128,128,7,8,9,128,11,128,128,128,15]
	; SSSE3-NEXT: pshufb %xmm5, %xmm0			; SSSE3-NEXT: pshufb %xmm5, %xmm2
	; SSSE3-NEXT: por %xmm2, %xmm0			; SSSE3-NEXT: por %xmm2, %xmm0
	; SSSE3-NEXT: pshufb %xmm4, %xmm3			; SSSE3-NEXT: pshufb %xmm4, %xmm1
	; SSSE3-NEXT: pshufb %xmm5, %xmm1			; SSSE3-NEXT: pshufb %xmm5, %xmm3
	; SSSE3-NEXT: por %xmm3, %xmm1			; SSSE3-NEXT: por %xmm3, %xmm1
	; SSSE3-NEXT: retq			; SSSE3-NEXT: retq
	;			;
	; SSE41-LABEL: constant_pblendvb_avx2:			; SSE41-LABEL: constant_pblendvb_avx2:
	; SSE41: # BB#0: # %entry			; SSE41: # BB#0: # %entry
	; SSE41-NEXT: movdqa %xmm0, %xmm4			; SSE41-NEXT: movdqa %xmm0, %xmm4
	; SSE41-NEXT: movaps {{.*#+}} xmm0 = [0,0,255,0,255,255,255,0,255,255,255,255,255,255,255,255]			; SSE41-NEXT: movaps {{.*#+}} xmm0 = [255,255,0,255,0,0,0,255,255,255,0,255,0,0,0,255]
	; SSE41-NEXT: pblendvb %xmm4, %xmm2			; SSE41-NEXT: pblendvb %xmm2, %xmm4
	; SSE41-NEXT: pblendvb %xmm1, %xmm3			; SSE41-NEXT: pblendvb %xmm3, %xmm1
	; SSE41-NEXT: movdqa %xmm2, %xmm0			; SSE41-NEXT: movdqa %xmm4, %xmm0
	; SSE41-NEXT: movdqa %xmm3, %xmm1
	; SSE41-NEXT: retq			; SSE41-NEXT: retq
	;			;
	; AVX1-LABEL: constant_pblendvb_avx2:			; AVX1-LABEL: constant_pblendvb_avx2:
	; AVX1: # BB#0: # %entry			; AVX1: # BB#0: # %entry
	; AVX1-NEXT: vmovdqa {{.*#+}} xmm2 = [0,0,255,0,255,255,255,0,255,255,255,255,255,255,255,255]			; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm2
	; AVX1-NEXT: vpblendvb %xmm2, %xmm0, %xmm1, %xmm1			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm3
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm0			; AVX1-NEXT: vmovdqa .LCPI18_0(%rip), %xmm4 # xmm4 = [255,255,0,255,0,0,0,255,255,255,0,255,0,0,0,255]
				filcabUnsubmitted Not Done Reply Inline Actions Why are the "first lanes" changing here? Either it was completely wrong, and you're fixing it, or the replacement is wrong (the "second lane" was always wrong). filcab: Why are the "first lanes" changing here? Either it was completely wrong, and you're fixing it…
				andreadbUnsubmitted Not Done Reply Inline Actions The operands to the vpblendvb have been commuted. So the mask has been changed accordingly. I don't think this is wrong. andreadb: The operands to the vpblendvb have been commuted. So the mask has been changed accordingly. I…
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0			; AVX1-NEXT: vpblendvb %xmm4, %xmm2, %xmm3, %xmm2
				; AVX1-NEXT: vpblendvb %xmm4, %xmm1, %xmm0, %xmm0
				; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: constant_pblendvb_avx2:			; AVX2-LABEL: constant_pblendvb_avx2:
	; AVX2: # BB#0: # %entry			; AVX2: # BB#0: # %entry
	; AVX2-NEXT: vmovdqa {{.*#+}} ymm2 = [0,0,255,0,255,255,255,0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]			; AVX2-NEXT: vmovdqa {{.*#+}} ymm2 = [0,0,255,0,255,255,255,0,0,0,255,0,255,255,255,0,0,0,255,0,255,255,255,0,0,0,255,0,255,255,255,0]
	; AVX2-NEXT: vpblendvb %ymm2, %ymm0, %ymm1, %ymm0			; AVX2-NEXT: vpblendvb %ymm2, %ymm0, %ymm1, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	entry:			entry:
	%select = select <32 x i1> <i1 false, i1 false, i1 true, i1 false, i1 true, i1 true, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 true, i1 true, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 true, i1 true, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 true, i1 true, i1 true, i1 false>, <32 x i8> %xyzw, <32 x i8> %abcd			%select = select <32 x i1> <i1 false, i1 false, i1 true, i1 false, i1 true, i1 true, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 true, i1 true, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 true, i1 true, i1 true, i1 false, i1 false, i1 false, i1 true, i1 false, i1 true, i1 true, i1 true, i1 false>, <32 x i8> %xyzw, <32 x i8> %abcd
	ret <32 x i8> %select			ret <32 x i8> %select
	}			}

	declare <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float>, <8 x float>, <8 x float>)			declare <8 x float> @llvm.x86.avx.blendv.ps.256(<8 x float>, <8 x float>, <8 x float>)
	▲ Show 20 Lines • Show All 123 Lines • Show Last 20 Lines