This is an archive of the discontinued LLVM Phabricator instance.

[x86] improve AVX lowering of vector zext
ClosedPublic

Authored by spatel on Mar 25 2019, 9:33 AM.

Download Raw Diff

Details

Reviewers

RKSimon
craig.topper

Commits

rG1df0bb6264a3: [x86] improve AVX lowering of vector zext
rL357129: [x86] improve AVX lowering of vector zext

Summary

If we know the 2 halves of an oversized zext-in-reg are the same, don't create those halves independently.

I tried several different approaches to fold this, but it's difficult to get right during legalization. In the default path, we are creating a generic shuffle that looks like an unpack high, but it can get transformed into a different mask (a blend), so it's not straightforward to match that. If we try to fold after it actually becomes an X86ISD::UNPCKH node, we can't be sure what the operand node is - it might be a generic shuffle, or it could be some x86-specific op.

I thought we had some utility to determine if a mask had an any-size splat subset pattern, but I don't see it, so I wrote a small match mask helper for this 1 case.

From the test output, we should be doing something like this for SSE4.1 as well, but I'd rather leave that as a follow-up since it involves changing lowering actions.

Diff Detail

Event Timeline

spatel created this revision.Mar 25 2019, 9:33 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 25 2019, 9:33 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald Transcript

RKSimon added inline comments.Mar 25 2019, 9:42 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
9899	This might fail if one half has an undef mask value and the other doesn't and we then use the mask half with the undef (there's probably a better way to phrase that....) - we need to return a merged mask with the mask value set to the non-undef case. IIRC we used to have a helper to do this but it evolved into isRepeatedShuffleMask which assumes sublane offsets.

spatel marked an inline comment as done.Mar 25 2019, 9:55 AM

spatel added inline comments.

llvm/lib/Target/X86/X86ISelLowering.cpp
9899	Yes, good catch. I'll add another test to try to show that. Returning the 'stronger' half mask would make this patch more complex if I understand correctly, and I haven't actually seen that pattern. Ok, to just make the matcher more strict and fail if the undef lanes don't line up?

spatel mentioned this in rL356930: [x86] add another vector zext test; NFC.Mar 25 2019, 10:53 AM

spatel mentioned this in rGf49e33e252c4: [x86] add another vector zext test; NFC.

Patch updated:

Don't match undef lanes unless they exist on both halves of the mask.
New test to show that difference.

LGTM - someday shuffle combining will handle different sized source/destination registers (and we can properly combine from vinsertf128) but this is good enough for now.

This revision is now accepted and ready to land.Mar 27 2019, 11:24 AM

Closed by commit rL357129: [x86] improve AVX lowering of vector zext (authored by spatel). · Explain WhyMar 27 2019, 3:41 PM

This revision was automatically updated to reflect the committed changes.

spatel mentioned this in D59961: [DAGCombiner] simplify shuffle of shuffle.Mar 28 2019, 2:36 PM

spatel mentioned this in rL357258: [DAGCombiner] simplify shuffle of shuffle.Mar 29 2019, 7:19 AM

spatel mentioned this in rG12685d0f7cd8: [DAGCombiner] simplify shuffle of shuffle.

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

21 lines

test/

CodeGen/

X86/

vector-zext.ll

18 lines

Diff 192122

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,877 Lines • ▼ Show 20 Lines	for (unsigned i = 0; i != 4; ++i) {
createUnpackShuffleMask(VT, UnpackMask, (i >> 1) % 2, i % 2);		createUnpackShuffleMask(VT, UnpackMask, (i >> 1) % 2, i % 2);
if (isTargetShuffleEquivalent(Mask, UnpackMask) \|\|		if (isTargetShuffleEquivalent(Mask, UnpackMask) \|\|
isTargetShuffleEquivalent(CommutedMask, UnpackMask))		isTargetShuffleEquivalent(CommutedMask, UnpackMask))
return true;		return true;
}		}
return false;		return false;
}		}

		/// Return true if a shuffle mask chooses elements identically in its top and
		/// bottom halves. For example, any splat mask has the same top and bottom
		/// halves. Undefined elements in the mask are ignored.
		static bool hasIdenticalHalvesShuffleMask(ArrayRef<int> Mask) {
		assert(Mask.size() % 2 == 0 && "Expecting even number of elements in mask");
		unsigned HalfSize = Mask.size() / 2;
		for (unsigned i = 0; i != HalfSize; ++i) {
		if (Mask[i] == SM_SentinelUndef \|\| Mask[i + HalfSize] == SM_SentinelUndef)
		continue;
		if (Mask[i] != Mask[i + HalfSize])
		return false;
		}
		return true;
		}
		RKSimonUnsubmitted Not Done Reply Inline Actions This might fail if one half has an undef mask value and the other doesn't and we then use the mask half with the undef (there's probably a better way to phrase that....) - we need to return a merged mask with the mask value set to the non-undef case. IIRC we used to have a helper to do this but it evolved into isRepeatedShuffleMask which assumes sublane offsets. RKSimon: This might fail if one half has an undef mask value and the other doesn't and we then use the…
		spatelAuthorUnsubmitted Done Reply Inline Actions Yes, good catch. I'll add another test to try to show that. Returning the 'stronger' half mask would make this patch more complex if I understand correctly, and I haven't actually seen that pattern. Ok, to just make the matcher more strict and fail if the undef lanes don't line up? spatel: Yes, good catch. I'll add another test to try to show that. Returning the 'stronger' half mask…

/// Get a 4-lane 8-bit shuffle immediate for a mask.		/// Get a 4-lane 8-bit shuffle immediate for a mask.
///		///
/// This helper function produces an 8-bit shuffle immediate corresponding to		/// This helper function produces an 8-bit shuffle immediate corresponding to
/// the ubiquitous shuffle encoding scheme used in x86 instructions for		/// the ubiquitous shuffle encoding scheme used in x86 instructions for
/// shuffling 4 lanes. It can be used with most of the PSHUF instructions for		/// shuffling 4 lanes. It can be used with most of the PSHUF instructions for
/// example.		/// example.
///		///
/// NB: We rely heavily on "undef" masks preserving the input lane.		/// NB: We rely heavily on "undef" masks preserving the input lane.
▲ Show 20 Lines • Show All 8,470 Lines • ▼ Show 20 Lines	static SDValue LowerAVXExtend(SDValue Op, SelectionDAG &DAG,
// Concat upper and lower parts.		// Concat upper and lower parts.
//		//

MVT HalfVT = MVT::getVectorVT(VT.getVectorElementType(),		MVT HalfVT = MVT::getVectorVT(VT.getVectorElementType(),
VT.getVectorNumElements() / 2);		VT.getVectorNumElements() / 2);

SDValue OpLo = DAG.getNode(ISD::ZERO_EXTEND_VECTOR_INREG, dl, HalfVT, In);		SDValue OpLo = DAG.getNode(ISD::ZERO_EXTEND_VECTOR_INREG, dl, HalfVT, In);

		// Short-circuit if we can determine that each 128-bit half is the same value.
		// Otherwise, this is difficult to match and optimize.
		if (auto *Shuf = dyn_cast<ShuffleVectorSDNode>(In))
		if (hasIdenticalHalvesShuffleMask(Shuf->getMask()))
		return DAG.getNode(ISD::CONCAT_VECTORS, dl, VT, OpLo, OpLo);

SDValue ZeroVec = DAG.getConstant(0, dl, InVT);		SDValue ZeroVec = DAG.getConstant(0, dl, InVT);
SDValue Undef = DAG.getUNDEF(InVT);		SDValue Undef = DAG.getUNDEF(InVT);
bool NeedZero = Op.getOpcode() == ISD::ZERO_EXTEND;		bool NeedZero = Op.getOpcode() == ISD::ZERO_EXTEND;
SDValue OpHi = getUnpackh(DAG, dl, InVT, In, NeedZero ? ZeroVec : Undef);		SDValue OpHi = getUnpackh(DAG, dl, InVT, In, NeedZero ? ZeroVec : Undef);
OpHi = DAG.getBitcast(HalfVT, OpHi);		OpHi = DAG.getBitcast(HalfVT, OpHi);

return DAG.getNode(ISD::CONCAT_VECTORS, dl, VT, OpLo, OpHi);		return DAG.getNode(ISD::CONCAT_VECTORS, dl, VT, OpLo, OpHi);
}		}
▲ Show 20 Lines • Show All 25,576 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/vector-zext.ll

	Show First 20 Lines • Show All 2,586 Lines • ▼ Show 20 Lines
	; SSE41-NEXT: pxor %xmm2, %xmm2			; SSE41-NEXT: pxor %xmm2, %xmm2
	; SSE41-NEXT: pmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero			; SSE41-NEXT: pmovzxdq {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero
	; SSE41-NEXT: punpckhdq {{.*#+}} xmm1 = xmm1[2],xmm2[2],xmm1[3],xmm2[3]			; SSE41-NEXT: punpckhdq {{.*#+}} xmm1 = xmm1[2],xmm2[2],xmm1[3],xmm2[3]
	; SSE41-NEXT: retq			; SSE41-NEXT: retq
	;			;
	; AVX1-LABEL: splatshuf_zext_v4i64:			; AVX1-LABEL: splatshuf_zext_v4i64:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,0,0,0]			; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,0,0,0]
	; AVX1-NEXT: vpxor %xmm1, %xmm1, %xmm1
	; AVX1-NEXT: vpunpckhdq {{.*#+}} xmm1 = xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	; AVX1-NEXT: vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero			; AVX1-NEXT: vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
	; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: splatshuf_zext_v4i64:			; AVX2-LABEL: splatshuf_zext_v4i64:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vpbroadcastd %xmm0, %xmm0			; AVX2-NEXT: vpbroadcastd %xmm0, %xmm0
	; AVX2-NEXT: vpmovzxdq {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero			; AVX2-NEXT: vpmovzxdq {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	Show All 35 Lines
	; SSE41-NEXT: pshufb {{.*#+}} xmm1 = xmm1[0,1,2,3,6,7,14,15,0,1,6,7,6,7,14,15]			; SSE41-NEXT: pshufb {{.*#+}} xmm1 = xmm1[0,1,2,3,6,7,14,15,0,1,6,7,6,7,14,15]
	; SSE41-NEXT: pxor %xmm2, %xmm2			; SSE41-NEXT: pxor %xmm2, %xmm2
	; SSE41-NEXT: pmovzxwd {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero			; SSE41-NEXT: pmovzxwd {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero
	; SSE41-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]			; SSE41-NEXT: punpckhwd {{.*#+}} xmm1 = xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
	; SSE41-NEXT: retq			; SSE41-NEXT: retq
	;			;
	; AVX1-LABEL: splatshuf_zext_v8i32:			; AVX1-LABEL: splatshuf_zext_v8i32:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,2,3,6,7,14,15,0,1,6,7,6,7,14,15]			; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1],zero,zero,xmm0[2,3],zero,zero,xmm0[6,7],zero,zero,xmm0[14,15],zero,zero
	; AVX1-NEXT: vpxor %xmm1, %xmm1, %xmm1			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm1 = xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
	; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
	; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: splatshuf_zext_v8i32:			; AVX2-LABEL: splatshuf_zext_v8i32:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,2,3,6,7,14,15,0,1,6,7,6,7,14,15]			; AVX2-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,2,3,6,7,14,15,0,1,6,7,6,7,14,15]
	; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero			; AVX2-NEXT: vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	Show All 30 Lines
	; SSE41-NEXT: pshufb {{.*#+}} xmm1 = xmm1[14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14]			; SSE41-NEXT: pshufb {{.*#+}} xmm1 = xmm1[14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14]
	; SSE41-NEXT: pxor %xmm2, %xmm2			; SSE41-NEXT: pxor %xmm2, %xmm2
	; SSE41-NEXT: pmovzxbw {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero			; SSE41-NEXT: pmovzxbw {{.*#+}} xmm0 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
	; SSE41-NEXT: punpckhbw {{.*#+}} xmm1 = xmm1[8],xmm2[8],xmm1[9],xmm2[9],xmm1[10],xmm2[10],xmm1[11],xmm2[11],xmm1[12],xmm2[12],xmm1[13],xmm2[13],xmm1[14],xmm2[14],xmm1[15],xmm2[15]			; SSE41-NEXT: punpckhbw {{.*#+}} xmm1 = xmm1[8],xmm2[8],xmm1[9],xmm2[9],xmm1[10],xmm2[10],xmm1[11],xmm2[11],xmm1[12],xmm2[12],xmm1[13],xmm2[13],xmm1[14],xmm2[14],xmm1[15],xmm2[15]
	; SSE41-NEXT: retq			; SSE41-NEXT: retq
	;			;
	; AVX1-LABEL: splatshuf_zext_v16i16:			; AVX1-LABEL: splatshuf_zext_v16i16:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14]			; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[14],zero,xmm0[14],zero,xmm0[14],zero,xmm0[14],zero,xmm0[14],zero,xmm0[14],zero,xmm0[14],zero,xmm0[14],zero
	; AVX1-NEXT: vpxor %xmm1, %xmm1, %xmm1			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm0, %ymm0
	; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm1 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]
	; AVX1-NEXT: vpmovzxbw {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
	; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: splatshuf_zext_v16i16:			; AVX2-LABEL: splatshuf_zext_v16i16:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14]			; AVX2-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[14,14,14,14,14,14,14,14,14,14,14,14,14,14,14,14]
	; AVX2-NEXT: vpmovzxbw {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero,xmm0[8],zero,xmm0[9],zero,xmm0[10],zero,xmm0[11],zero,xmm0[12],zero,xmm0[13],zero,xmm0[14],zero,xmm0[15],zero			; AVX2-NEXT: vpmovzxbw {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero,xmm0[8],zero,xmm0[9],zero,xmm0[10],zero,xmm0[11],zero,xmm0[12],zero,xmm0[13],zero,xmm0[14],zero,xmm0[15],zero
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	Show All 9 Lines