This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/X86/
-
Target/
-
X86/
-
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
2
avx512-ext.ll
1
avx512-trunc.ll
2
vector-compare-results.ll
-
vector-sext.ll
-
vector-trunc.ll
-
vselect-packss.ll

Differential D40215

[X86][AVX512] Use PACKSS/PACKUS for vXi16->vXi8 truncations without BWI.
AbandonedPublic

Authored by RKSimon on Nov 18 2017, 2:12 PM.

Download Raw Diff

Details

Reviewers

craig.topper
zvi
delena
pcordes

Summary

We currently don't use the PACKSS truncation combine for AVX512 targets, forcing targets without BWI to use the EXT+TRUNC pattern.

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon created this revision.Nov 18 2017, 2:12 PM

delena added inline comments.Nov 18 2017, 10:42 PM

test/CodeGen/X86/vector-compare-results.ll
320	You are inserting AVX2 instructions instead of AVX-512, right? If yes, the prev code is better, since we have more registers in AVX-512.

This patch came about as possible tweak prior to adding support AVX512BW PACKUS/PACKSS for PR34871, which was concerned about port5 load from double the number of truncations. But yes register limits is likely to be an issue. I still need to finish PR34773 first though so will come back to this in a while.

KNL has slow vpackwb ymm. It doesn't look optimal for most of these cases: vpmovzxwd / vpmovdb looks better. For AVX512F without BW+DQ, we should probably go ahead and use it, i.e. assume that AVX2 instructions are fast.

Also note that KNL has faster vpmovzx than vpmovsx.

test/CodeGen/X86/avx512-ext.ll
1725	AVX2 ymm byte / word shuffles are SLOW on KNL, even though the xmm version is fast (except for `pshufb`). This sequence would make sense for AVX512F `tune=generic` (because it's very good on SKX and presumably future mainstream CPUs with good AVX2 support), but definitely not for `-march=knl`. `vpacksswb xmm` is fast: 1 uop / 1c throughput / 2-6c latency, but the YMM versions of vpack / vpunpck (except for DQ and QDQ) are 5 uops / 9c throughput. In this case: vpacksswb ymm + vpermq ymm = 5 + 1 = 6 shuffle uops, and maybe 10c throughput (9 + 1 assuming they all compete for the same execution resources and can't pipeline with each other). 2x vpmovzxwd y->z + 2x vpmovdb z->x + vinserti128 x->y = 5 shuffle uops, throughput = 2x2 + 2x1 + 1 = 7 cycles (with no decode stalls from multi-uop instructions). The extra ILP probably doesn't help at all because it appears there's only one shuffle execution unit (on FP0). So it's not much better, but avoiding the decode bottleneck should allow much better out-of-order execution and probably hyperthreading friendliness.
1728	`vpmovsx` (all forms) is 2 uops on KNL, vs. 1 for `vpmovzx` (all element / register sizes). This is a big deal for the front-end (2c throughput vs. 7-8c throughput). If you're about to feed it to a truncate and only doing it to work around lack of AVX512BW, definitely use ZX. If only one vector was needed, vpmovzx %ymm,%zmm / vpmovdb %zmm, %xmm looks like a big win according to Agner Fog's uarch guide + instruction tables.
test/CodeGen/X86/avx512-trunc.ll
568	This is a win: two 1-uop shuffles with 1c throughput (vextracti128 / vpackuswb xmm) is definitely better than `vpmovzx` (1 uop / 2c throughput) / `vpmovdb` (1 uop / 1c throughput). And vpmovSX is 2 uops, 7-8c throughput (decode bottleneck), so the original was horrible because of the missed vpmovzx optimization, but the vpackuswb version is still better than that because it's only using XMM registers.
test/CodeGen/X86/vector-compare-results.ll
320	In most cases that's hopefully minor compared to the shuffle throughput gain from using vextracti128 / vpackss xmm (both 1c throughput on KNL). vpmovsx is 2 uops on KNL, so it's a big missed optimization to use it instead of vpmovzx, but even vpmovzx is 1 uop / 2c throughput (not fully pipelined). See my previous comment

Abandoning old patch - the performant AVX512 use cases for truncateVectorWithPACK have been already been added over the years.

Revision Contents

Path

Size

lib/

Target/

X86/

	X86ISelLowering.cpp
	X86ISelLowering.cpp (revision 318596)

16 lines

test/

CodeGen/

X86/

	avx512-ext.ll
	avx512-ext.ll (revision 318596)

9 lines

	avx512-trunc.ll
	avx512-trunc.ll (revision 318596)

8 lines

	vector-compare-results.ll
	vector-compare-results.ll (revision 318596)

26 lines

	vector-sext.ll
	vector-sext.ll (revision 318596)

9 lines

	vector-trunc.ll
	vector-trunc.ll (revision 318596)

32 lines

	vselect-packss.ll
	vselect-packss.ll (revision 318599)

4 lines

Diff 123478

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 16,221 Lines • ▼ Show 20 Lines
	/// AVX2 (Int256) sub-targets require extra shuffling as the PACK*S operates			/// AVX2 (Int256) sub-targets require extra shuffling as the PACK*S operates
	/// within each 128-bit lane.			/// within each 128-bit lane.
	static SDValue truncateVectorWithPACK(unsigned Opcode, EVT DstVT, SDValue In,			static SDValue truncateVectorWithPACK(unsigned Opcode, EVT DstVT, SDValue In,
	const SDLoc &DL, SelectionDAG &DAG,			const SDLoc &DL, SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {			const X86Subtarget &Subtarget) {
	assert((Opcode == X86ISD::PACKSS \|\| Opcode == X86ISD::PACKUS) &&			assert((Opcode == X86ISD::PACKSS \|\| Opcode == X86ISD::PACKUS) &&
	"Unexpected PACK opcode");			"Unexpected PACK opcode");

				EVT SrcVT = In.getValueType();

	// Requires SSE2 but AVX512 has fast truncate.			// Requires SSE2 but AVX512 has fast truncate.
	if (!Subtarget.hasSSE2() \|\| Subtarget.hasAVX512())			if (!Subtarget.hasSSE2() \|\| Subtarget.hasBWI() \|\|
				(Subtarget.hasAVX512() && SrcVT.getScalarType() != MVT::i16))
	return SDValue();			return SDValue();

	EVT SrcVT = In.getValueType();

	// No truncation required, we might get here due to recursive calls.			// No truncation required, we might get here due to recursive calls.
	if (SrcVT == DstVT)			if (SrcVT == DstVT)
	return In;			return In;

	// We only support vector truncation to 128bits or greater from a			// We only support vector truncation to 128bits or greater from a
	// 256bits or greater source.			// 256bits or greater source.
	unsigned DstSizeInBits = DstVT.getSizeInBits();			unsigned DstSizeInBits = DstVT.getSizeInBits();
	unsigned SrcSizeInBits = SrcVT.getSizeInBits();			unsigned SrcSizeInBits = SrcVT.getSizeInBits();
	▲ Show 20 Lines • Show All 18,400 Lines • ▼ Show 20 Lines
	}			}

	/// This function transforms vector truncation of 'extended sign-bits' or			/// This function transforms vector truncation of 'extended sign-bits' or
	/// 'extended zero-bits' values.			/// 'extended zero-bits' values.
	/// vXi16/vXi32/vXi64 to vXi8/vXi16/vXi32 into X86ISD::PACKSS/PACKUS operations.			/// vXi16/vXi32/vXi64 to vXi8/vXi16/vXi32 into X86ISD::PACKSS/PACKUS operations.
	static SDValue combineVectorSignBitsTruncation(SDNode *N, SDLoc &DL,			static SDValue combineVectorSignBitsTruncation(SDNode *N, SDLoc &DL,
	SelectionDAG &DAG,			SelectionDAG &DAG,
	const X86Subtarget &Subtarget) {			const X86Subtarget &Subtarget) {
	// Requires SSE2 but AVX512 has fast truncate.
	if (!Subtarget.hasSSE2() \|\| Subtarget.hasAVX512())
	return SDValue();

	if (!N->getValueType(0).isVector() \|\| !N->getValueType(0).isSimple())			if (!N->getValueType(0).isVector() \|\| !N->getValueType(0).isSimple())
	return SDValue();			return SDValue();

	SDValue In = N->getOperand(0);			SDValue In = N->getOperand(0);
	if (!In.getValueType().isSimple())			if (!In.getValueType().isSimple())
	return SDValue();			return SDValue();

	MVT VT = N->getValueType(0).getSimpleVT();			MVT VT = N->getValueType(0).getSimpleVT();
	MVT SVT = VT.getScalarType();			MVT SVT = VT.getScalarType();

	MVT InVT = In.getValueType().getSimpleVT();			MVT InVT = In.getValueType().getSimpleVT();
	MVT InSVT = InVT.getScalarType();			MVT InSVT = InVT.getScalarType();

				// Requires SSE2 but AVX512 has fast truncate.
				if (!Subtarget.hasSSE2() \|\| Subtarget.hasBWI() \|\|
				(Subtarget.hasAVX512() && InSVT != MVT::i16))
				return SDValue();

	// Check we have a truncation suited for PACKSS.			// Check we have a truncation suited for PACKSS.
	if (!VT.is128BitVector() && !VT.is256BitVector())			if (!VT.is128BitVector() && !VT.is256BitVector())
	return SDValue();			return SDValue();
	if (SVT != MVT::i8 && SVT != MVT::i16 && SVT != MVT::i32)			if (SVT != MVT::i8 && SVT != MVT::i16 && SVT != MVT::i32)
	return SDValue();			return SDValue();
	if (InSVT != MVT::i16 && InSVT != MVT::i32 && InSVT != MVT::i64)			if (InSVT != MVT::i16 && InSVT != MVT::i32 && InSVT != MVT::i64)
	return SDValue();			return SDValue();

	▲ Show 20 Lines • Show All 3,372 Lines • Show Last 20 Lines

test/CodeGen/X86/avx512-ext.ll

Show First 20 Lines • Show All 1,714 Lines • ▼ Show 20 Lines	; SKX-NEXT: retq
%1 = zext <16 x i1> %mask to <16 x i16>		%1 = zext <16 x i1> %mask to <16 x i16>
ret <16 x i16> %1		ret <16 x i16> %1
}		}


define <32 x i8> @zext_32xi1_to_32xi8(<32 x i16> %x, <32 x i16> %y) #0 {		define <32 x i8> @zext_32xi1_to_32xi8(<32 x i16> %x, <32 x i16> %y) #0 {
; KNL-LABEL: zext_32xi1_to_32xi8:		; KNL-LABEL: zext_32xi1_to_32xi8:
; KNL: # BB#0:		; KNL: # BB#0:
; KNL-NEXT: vpcmpeqw %ymm2, %ymm0, %ymm0
; KNL-NEXT: vpmovsxwd %ymm0, %zmm0
; KNL-NEXT: vpmovdb %zmm0, %xmm0
; KNL-NEXT: vpcmpeqw %ymm3, %ymm1, %ymm1		; KNL-NEXT: vpcmpeqw %ymm3, %ymm1, %ymm1
; KNL-NEXT: vpmovsxwd %ymm1, %zmm1		; KNL-NEXT: vpcmpeqw %ymm2, %ymm0, %ymm0
; KNL-NEXT: vpmovdb %zmm1, %xmm1		; KNL-NEXT: vpacksswb %ymm1, %ymm0, %ymm0
pcordesUnsubmitted Not Done Reply Inline Actions `vpmovsx` (all forms) is 2 uops on KNL, vs. 1 for `vpmovzx` (all element / register sizes). This is a big deal for the front-end (2c throughput vs. 7-8c throughput). If you're about to feed it to a truncate and only doing it to work around lack of AVX512BW, definitely use ZX. If only one vector was needed, vpmovzx %ymm,%zmm / vpmovdb %zmm, %xmm looks like a big win according to Agner Fog's uarch guide + instruction tables. pcordes: `vpmovsx` (all forms) is 2 uops on KNL, vs. 1 for `vpmovzx` (all element / register sizes).
		pcordesUnsubmitted Not Done Reply Inline Actions AVX2 ymm byte / word shuffles are SLOW on KNL, even though the xmm version is fast (except for `pshufb`). This sequence would make sense for AVX512F `tune=generic` (because it's very good on SKX and presumably future mainstream CPUs with good AVX2 support), but definitely not for `-march=knl`. `vpacksswb xmm` is fast: 1 uop / 1c throughput / 2-6c latency, but the YMM versions of vpack / vpunpck (except for DQ and QDQ) are 5 uops / 9c throughput. In this case: vpacksswb ymm + vpermq ymm = 5 + 1 = 6 shuffle uops, and maybe 10c throughput (9 + 1 assuming they all compete for the same execution resources and can't pipeline with each other). 2x vpmovzxwd y->z + 2x vpmovdb z->x + vinserti128 x->y = 5 shuffle uops, throughput = 2x2 + 2x1 + 1 = 7 cycles (with no decode stalls from multi-uop instructions). The extra ILP probably doesn't help at all because it appears there's only one shuffle execution unit (on FP0). So it's not much better, but avoiding the decode bottleneck should allow much better out-of-order execution and probably hyperthreading friendliness. pcordes: AVX2 ymm byte / word shuffles are SLOW on KNL, even though the xmm version is fast (except for…
; KNL-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0		; KNL-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]
; KNL-NEXT: vpand {{.*}}(%rip), %ymm0, %ymm0		; KNL-NEXT: vpand {{.*}}(%rip), %ymm0, %ymm0
; KNL-NEXT: retq		; KNL-NEXT: retq
;		;
; SKX-LABEL: zext_32xi1_to_32xi8:		; SKX-LABEL: zext_32xi1_to_32xi8:
; SKX: # BB#0:		; SKX: # BB#0:
; SKX-NEXT: vpcmpeqw %zmm1, %zmm0, %k1		; SKX-NEXT: vpcmpeqw %zmm1, %zmm0, %k1
; SKX-NEXT: vmovdqu8 {{.*}}(%rip), %ymm0 {%k1} {z}		; SKX-NEXT: vmovdqu8 {{.*}}(%rip), %ymm0 {%k1} {z}
; SKX-NEXT: retq		; SKX-NEXT: retq
▲ Show 20 Lines • Show All 50 Lines • Show Last 20 Lines

test/CodeGen/X86/avx512-trunc.ll

Show First 20 Lines • Show All 536 Lines • ▼ Show 20 Lines	; SKX-NEXT: retq
ret void		ret void
}		}


define void @usat_trunc_wb_256_mem(<16 x i16> %i, <16 x i8>* %res) {		define void @usat_trunc_wb_256_mem(<16 x i16> %i, <16 x i8>* %res) {
; KNL-LABEL: usat_trunc_wb_256_mem:		; KNL-LABEL: usat_trunc_wb_256_mem:
; KNL: ## BB#0:		; KNL: ## BB#0:
; KNL-NEXT: vpminuw {{.*}}(%rip), %ymm0, %ymm0		; KNL-NEXT: vpminuw {{.*}}(%rip), %ymm0, %ymm0
; KNL-NEXT: vpmovsxwd %ymm0, %zmm0		; KNL-NEXT: vextracti128 $1, %ymm0, %xmm1
; KNL-NEXT: vpmovdb %zmm0, %xmm0		; KNL-NEXT: vpackuswb %xmm1, %xmm0, %xmm0
; KNL-NEXT: vmovdqu %xmm0, (%rdi)		; KNL-NEXT: vmovdqu %xmm0, (%rdi)
; KNL-NEXT: vzeroupper		; KNL-NEXT: vzeroupper
; KNL-NEXT: retq		; KNL-NEXT: retq
;		;
; SKX-LABEL: usat_trunc_wb_256_mem:		; SKX-LABEL: usat_trunc_wb_256_mem:
; SKX: ## BB#0:		; SKX: ## BB#0:
; SKX-NEXT: vpmovuswb %ymm0, (%rdi)		; SKX-NEXT: vpmovuswb %ymm0, (%rdi)
; SKX-NEXT: vzeroupper		; SKX-NEXT: vzeroupper
; SKX-NEXT: retq		; SKX-NEXT: retq
%x3 = icmp ult <16 x i16> %i, <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>		%x3 = icmp ult <16 x i16> %i, <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>
%x5 = select <16 x i1> %x3, <16 x i16> %i, <16 x i16> <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>		%x5 = select <16 x i1> %x3, <16 x i16> %i, <16 x i16> <i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255, i16 255>
%x6 = trunc <16 x i16> %x5 to <16 x i8>		%x6 = trunc <16 x i16> %x5 to <16 x i8>
store <16 x i8> %x6, <16 x i8>* %res, align 1		store <16 x i8> %x6, <16 x i8>* %res, align 1
ret void		ret void
}		}

define <16 x i8> @usat_trunc_wb_256(<16 x i16> %i) {		define <16 x i8> @usat_trunc_wb_256(<16 x i16> %i) {
; KNL-LABEL: usat_trunc_wb_256:		; KNL-LABEL: usat_trunc_wb_256:
; KNL: ## BB#0:		; KNL: ## BB#0:
; KNL-NEXT: vpminuw {{.*}}(%rip), %ymm0, %ymm0		; KNL-NEXT: vpminuw {{.*}}(%rip), %ymm0, %ymm0
; KNL-NEXT: vpmovsxwd %ymm0, %zmm0		; KNL-NEXT: vextracti128 $1, %ymm0, %xmm1
; KNL-NEXT: vpmovdb %zmm0, %xmm0		; KNL-NEXT: vpackuswb %xmm1, %xmm0, %xmm0
		pcordesUnsubmitted Not Done Reply Inline Actions This is a win: two 1-uop shuffles with 1c throughput (vextracti128 / vpackuswb xmm) is definitely better than `vpmovzx` (1 uop / 2c throughput) / `vpmovdb` (1 uop / 1c throughput). And vpmovSX is 2 uops, 7-8c throughput (decode bottleneck), so the original was horrible because of the missed vpmovzx optimization, but the vpackuswb version is still better than that because it's only using XMM registers. pcordes: This is a win: two 1-uop shuffles with 1c throughput (vextracti128 / vpackuswb xmm) is…
; KNL-NEXT: vzeroupper		; KNL-NEXT: vzeroupper
; KNL-NEXT: retq		; KNL-NEXT: retq
;		;
; SKX-LABEL: usat_trunc_wb_256:		; SKX-LABEL: usat_trunc_wb_256:
; SKX: ## BB#0:		; SKX: ## BB#0:
; SKX-NEXT: vpmovuswb %ymm0, %xmm0		; SKX-NEXT: vpmovuswb %ymm0, %xmm0
; SKX-NEXT: vzeroupper		; SKX-NEXT: vzeroupper
; SKX-NEXT: retq		; SKX-NEXT: retq
▲ Show 20 Lines • Show All 192 Lines • Show Last 20 Lines

test/CodeGen/X86/vector-compare-results.ll

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 311 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX2-NEXT: vpacksswb %xmm1, %xmm0, %xmm0			; AVX2-NEXT: vpacksswb %xmm1, %xmm0, %xmm0
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512F-LABEL: test_cmp_v16i16:			; AVX512F-LABEL: test_cmp_v16i16:
	; AVX512F: # BB#0:			; AVX512F: # BB#0:
	; AVX512F-NEXT: vpcmpgtw %ymm1, %ymm0, %ymm0			; AVX512F-NEXT: vpcmpgtw %ymm1, %ymm0, %ymm0
	; AVX512F-NEXT: vpmovsxwd %ymm0, %zmm0			; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1
				delenaUnsubmitted Not Done Reply Inline Actions You are inserting AVX2 instructions instead of AVX-512, right? If yes, the prev code is better, since we have more registers in AVX-512. delena: You are inserting AVX2 instructions instead of AVX-512, right? If yes, the prev code is better…
				pcordesUnsubmitted Not Done Reply Inline Actions In most cases that's hopefully minor compared to the shuffle throughput gain from using vextracti128 / vpackss xmm (both 1c throughput on KNL). vpmovsx is 2 uops on KNL, so it's a big missed optimization to use it instead of vpmovzx, but even vpmovzx is 1 uop / 2c throughput (not fully pipelined). See my previous comment pcordes: In most cases that's hopefully minor compared to the shuffle throughput gain from using…
	; AVX512F-NEXT: vpmovdb %zmm0, %xmm0			; AVX512F-NEXT: vpacksswb %xmm1, %xmm0, %xmm0
	; AVX512F-NEXT: vzeroupper			; AVX512F-NEXT: vzeroupper
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512DQ-LABEL: test_cmp_v16i16:			; AVX512DQ-LABEL: test_cmp_v16i16:
	; AVX512DQ: # BB#0:			; AVX512DQ: # BB#0:
	; AVX512DQ-NEXT: vpcmpgtw %ymm1, %ymm0, %ymm0			; AVX512DQ-NEXT: vpcmpgtw %ymm1, %ymm0, %ymm0
	; AVX512DQ-NEXT: vpmovsxwd %ymm0, %zmm0			; AVX512DQ-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512DQ-NEXT: vpmovdb %zmm0, %xmm0			; AVX512DQ-NEXT: vpacksswb %xmm1, %xmm0, %xmm0
	; AVX512DQ-NEXT: vzeroupper			; AVX512DQ-NEXT: vzeroupper
	; AVX512DQ-NEXT: retq			; AVX512DQ-NEXT: retq
	;			;
	; AVX512BW-LABEL: test_cmp_v16i16:			; AVX512BW-LABEL: test_cmp_v16i16:
	; AVX512BW: # BB#0:			; AVX512BW: # BB#0:
	; AVX512BW-NEXT: vpcmpgtw %ymm1, %ymm0, %ymm0			; AVX512BW-NEXT: vpcmpgtw %ymm1, %ymm0, %ymm0
	; AVX512BW-NEXT: vpmovwb %zmm0, %ymm0			; AVX512BW-NEXT: vpmovwb %zmm0, %ymm0
	; AVX512BW-NEXT: # kill: %XMM0<def> %XMM0<kill> %YMM0<kill>			; AVX512BW-NEXT: # kill: %XMM0<def> %XMM0<kill> %YMM0<kill>
	▲ Show 20 Lines • Show All 779 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: vpcmpgtw %ymm3, %ymm1, %ymm1			; AVX2-NEXT: vpcmpgtw %ymm3, %ymm1, %ymm1
	; AVX2-NEXT: vpcmpgtw %ymm2, %ymm0, %ymm0			; AVX2-NEXT: vpcmpgtw %ymm2, %ymm0, %ymm0
	; AVX2-NEXT: vpacksswb %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vpacksswb %ymm1, %ymm0, %ymm0
	; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]			; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512F-LABEL: test_cmp_v32i16:			; AVX512F-LABEL: test_cmp_v32i16:
	; AVX512F: # BB#0:			; AVX512F: # BB#0:
	; AVX512F-NEXT: vpcmpgtw %ymm2, %ymm0, %ymm0
	; AVX512F-NEXT: vpmovsxwd %ymm0, %zmm0
	; AVX512F-NEXT: vpmovdb %zmm0, %xmm0
	; AVX512F-NEXT: vpcmpgtw %ymm3, %ymm1, %ymm1			; AVX512F-NEXT: vpcmpgtw %ymm3, %ymm1, %ymm1
	; AVX512F-NEXT: vpmovsxwd %ymm1, %zmm1			; AVX512F-NEXT: vpcmpgtw %ymm2, %ymm0, %ymm0
	; AVX512F-NEXT: vpmovdb %zmm1, %xmm1			; AVX512F-NEXT: vpacksswb %ymm1, %ymm0, %ymm0
	; AVX512F-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0			; AVX512F-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512DQ-LABEL: test_cmp_v32i16:			; AVX512DQ-LABEL: test_cmp_v32i16:
	; AVX512DQ: # BB#0:			; AVX512DQ: # BB#0:
	; AVX512DQ-NEXT: vpcmpgtw %ymm2, %ymm0, %ymm0
	; AVX512DQ-NEXT: vpmovsxwd %ymm0, %zmm0
	; AVX512DQ-NEXT: vpmovdb %zmm0, %xmm0
	; AVX512DQ-NEXT: vpcmpgtw %ymm3, %ymm1, %ymm1			; AVX512DQ-NEXT: vpcmpgtw %ymm3, %ymm1, %ymm1
	; AVX512DQ-NEXT: vpmovsxwd %ymm1, %zmm1			; AVX512DQ-NEXT: vpcmpgtw %ymm2, %ymm0, %ymm0
	; AVX512DQ-NEXT: vpmovdb %zmm1, %xmm1			; AVX512DQ-NEXT: vpacksswb %ymm1, %ymm0, %ymm0
	; AVX512DQ-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0			; AVX512DQ-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]
	; AVX512DQ-NEXT: retq			; AVX512DQ-NEXT: retq
	;			;
	; AVX512BW-LABEL: test_cmp_v32i16:			; AVX512BW-LABEL: test_cmp_v32i16:
	; AVX512BW: # BB#0:			; AVX512BW: # BB#0:
	; AVX512BW-NEXT: vpcmpgtw %zmm1, %zmm0, %k0			; AVX512BW-NEXT: vpcmpgtw %zmm1, %zmm0, %k0
	; AVX512BW-NEXT: vpmovm2b %k0, %zmm0			; AVX512BW-NEXT: vpmovm2b %k0, %zmm0
	; AVX512BW-NEXT: # kill: %YMM0<def> %YMM0<kill> %ZMM0<kill>			; AVX512BW-NEXT: # kill: %YMM0<def> %YMM0<kill> %ZMM0<kill>
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	▲ Show 20 Lines • Show All 9,206 Lines • Show Last 20 Lines

test/CodeGen/X86/vector-sext.ll

	Show First 20 Lines • Show All 5,011 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: vpcmpeqw %ymm3, %ymm1, %ymm1			; AVX2-NEXT: vpcmpeqw %ymm3, %ymm1, %ymm1
	; AVX2-NEXT: vpcmpeqw %ymm2, %ymm0, %ymm0			; AVX2-NEXT: vpcmpeqw %ymm2, %ymm0, %ymm0
	; AVX2-NEXT: vpacksswb %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vpacksswb %ymm1, %ymm0, %ymm0
	; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]			; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512F-LABEL: sext_32xi1_to_32xi8:			; AVX512F-LABEL: sext_32xi1_to_32xi8:
	; AVX512F: # BB#0:			; AVX512F: # BB#0:
	; AVX512F-NEXT: vpcmpeqw %ymm2, %ymm0, %ymm0
	; AVX512F-NEXT: vpmovsxwd %ymm0, %zmm0
	; AVX512F-NEXT: vpmovdb %zmm0, %xmm0
	; AVX512F-NEXT: vpcmpeqw %ymm3, %ymm1, %ymm1			; AVX512F-NEXT: vpcmpeqw %ymm3, %ymm1, %ymm1
	; AVX512F-NEXT: vpmovsxwd %ymm1, %zmm1			; AVX512F-NEXT: vpcmpeqw %ymm2, %ymm0, %ymm0
	; AVX512F-NEXT: vpmovdb %zmm1, %xmm1			; AVX512F-NEXT: vpacksswb %ymm1, %ymm0, %ymm0
	; AVX512F-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0			; AVX512F-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,1,3]
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512BW-LABEL: sext_32xi1_to_32xi8:			; AVX512BW-LABEL: sext_32xi1_to_32xi8:
	; AVX512BW: # BB#0:			; AVX512BW: # BB#0:
	; AVX512BW-NEXT: vpcmpeqw %zmm1, %zmm0, %k0			; AVX512BW-NEXT: vpcmpeqw %zmm1, %zmm0, %k0
	; AVX512BW-NEXT: vpmovm2b %k0, %zmm0			; AVX512BW-NEXT: vpmovm2b %k0, %zmm0
	; AVX512BW-NEXT: # kill: %YMM0<def> %YMM0<kill> %ZMM0<kill>			; AVX512BW-NEXT: # kill: %YMM0<def> %YMM0<kill> %ZMM0<kill>
	; AVX512BW-NEXT: retq			; AVX512BW-NEXT: retq
	▲ Show 20 Lines • Show All 70 Lines • Show Last 20 Lines

test/CodeGen/X86/vector-trunc.ll

	Show First 20 Lines • Show All 1,131 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: vpacksswb %xmm1, %xmm0, %xmm0			; AVX2-NEXT: vpacksswb %xmm1, %xmm0, %xmm0
	; AVX2-NEXT: vmovdqu %xmm0, (%rax)			; AVX2-NEXT: vmovdqu %xmm0, (%rax)
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512F-LABEL: trunc16i16_16i8_ashr:			; AVX512F-LABEL: trunc16i16_16i8_ashr:
	; AVX512F: # BB#0: # %entry			; AVX512F: # BB#0: # %entry
	; AVX512F-NEXT: vpsraw $8, %ymm0, %ymm0			; AVX512F-NEXT: vpsraw $8, %ymm0, %ymm0
	; AVX512F-NEXT: vpmovsxwd %ymm0, %zmm0			; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512F-NEXT: vpmovdb %zmm0, %xmm0			; AVX512F-NEXT: vpacksswb %xmm1, %xmm0, %xmm0
	; AVX512F-NEXT: vmovdqu %xmm0, (%rax)			; AVX512F-NEXT: vmovdqu %xmm0, (%rax)
	; AVX512F-NEXT: vzeroupper			; AVX512F-NEXT: vzeroupper
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512VL-LABEL: trunc16i16_16i8_ashr:			; AVX512VL-LABEL: trunc16i16_16i8_ashr:
	; AVX512VL: # BB#0: # %entry			; AVX512VL: # BB#0: # %entry
	; AVX512VL-NEXT: vpsraw $8, %ymm0, %ymm0			; AVX512VL-NEXT: vpsraw $8, %ymm0, %ymm0
	; AVX512VL-NEXT: vpmovsxwd %ymm0, %zmm0			; AVX512VL-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512VL-NEXT: vpmovdb %zmm0, %xmm0			; AVX512VL-NEXT: vpacksswb %xmm1, %xmm0, %xmm0
	; AVX512VL-NEXT: vmovdqu %xmm0, (%rax)			; AVX512VL-NEXT: vmovdqu %xmm0, (%rax)
	; AVX512VL-NEXT: vzeroupper			; AVX512VL-NEXT: vzeroupper
	; AVX512VL-NEXT: retq			; AVX512VL-NEXT: retq
	;			;
	; AVX512BW-LABEL: trunc16i16_16i8_ashr:			; AVX512BW-LABEL: trunc16i16_16i8_ashr:
	; AVX512BW: # BB#0: # %entry			; AVX512BW: # BB#0: # %entry
	; AVX512BW-NEXT: vpsraw $8, %ymm0, %ymm0			; AVX512BW-NEXT: vpsraw $8, %ymm0, %ymm0
	; AVX512BW-NEXT: vpmovwb %zmm0, %ymm0			; AVX512BW-NEXT: vpmovwb %zmm0, %ymm0
	Show All 40 Lines
	; AVX2-NEXT: vpackuswb %xmm1, %xmm0, %xmm0			; AVX2-NEXT: vpackuswb %xmm1, %xmm0, %xmm0
	; AVX2-NEXT: vmovdqu %xmm0, (%rax)			; AVX2-NEXT: vmovdqu %xmm0, (%rax)
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512F-LABEL: trunc16i16_16i8_lshr:			; AVX512F-LABEL: trunc16i16_16i8_lshr:
	; AVX512F: # BB#0: # %entry			; AVX512F: # BB#0: # %entry
	; AVX512F-NEXT: vpsrlw $8, %ymm0, %ymm0			; AVX512F-NEXT: vpsrlw $8, %ymm0, %ymm0
	; AVX512F-NEXT: vpmovsxwd %ymm0, %zmm0			; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512F-NEXT: vpmovdb %zmm0, %xmm0			; AVX512F-NEXT: vpackuswb %xmm1, %xmm0, %xmm0
	; AVX512F-NEXT: vmovdqu %xmm0, (%rax)			; AVX512F-NEXT: vmovdqu %xmm0, (%rax)
	; AVX512F-NEXT: vzeroupper			; AVX512F-NEXT: vzeroupper
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512VL-LABEL: trunc16i16_16i8_lshr:			; AVX512VL-LABEL: trunc16i16_16i8_lshr:
	; AVX512VL: # BB#0: # %entry			; AVX512VL: # BB#0: # %entry
	; AVX512VL-NEXT: vpsrlw $8, %ymm0, %ymm0			; AVX512VL-NEXT: vpsrlw $8, %ymm0, %ymm0
	; AVX512VL-NEXT: vpmovsxwd %ymm0, %zmm0			; AVX512VL-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512VL-NEXT: vpmovdb %zmm0, %xmm0			; AVX512VL-NEXT: vpackuswb %xmm1, %xmm0, %xmm0
	; AVX512VL-NEXT: vmovdqu %xmm0, (%rax)			; AVX512VL-NEXT: vmovdqu %xmm0, (%rax)
	; AVX512VL-NEXT: vzeroupper			; AVX512VL-NEXT: vzeroupper
	; AVX512VL-NEXT: retq			; AVX512VL-NEXT: retq
	;			;
	; AVX512BW-LABEL: trunc16i16_16i8_lshr:			; AVX512BW-LABEL: trunc16i16_16i8_lshr:
	; AVX512BW: # BB#0: # %entry			; AVX512BW: # BB#0: # %entry
	; AVX512BW-NEXT: vpsrlw $8, %ymm0, %ymm0			; AVX512BW-NEXT: vpsrlw $8, %ymm0, %ymm0
	; AVX512BW-NEXT: vpmovwb %zmm0, %ymm0			; AVX512BW-NEXT: vpmovwb %zmm0, %ymm0
	▲ Show 20 Lines • Show All 712 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512F-LABEL: PR34773:			; AVX512F-LABEL: PR34773:
	; AVX512F: # BB#0:			; AVX512F: # BB#0:
	; AVX512F-NEXT: vmovdqu (%rdi), %ymm0			; AVX512F-NEXT: vmovdqu (%rdi), %ymm0
	; AVX512F-NEXT: vmovdqu 32(%rdi), %ymm1			; AVX512F-NEXT: vmovdqu 32(%rdi), %ymm1
	; AVX512F-NEXT: vpsrlw $8, %ymm0, %ymm0			; AVX512F-NEXT: vpsrlw $8, %ymm0, %ymm0
	; AVX512F-NEXT: vpsrlw $8, %ymm1, %ymm1			; AVX512F-NEXT: vpsrlw $8, %ymm1, %ymm1
	; AVX512F-NEXT: vpmovsxwd %ymm0, %zmm0			; AVX512F-NEXT: vextracti128 $1, %ymm0, %xmm2
	; AVX512F-NEXT: vpmovdb %zmm0, %xmm0			; AVX512F-NEXT: vpackuswb %xmm2, %xmm0, %xmm0
	; AVX512F-NEXT: vpmovsxwd %ymm1, %zmm1			; AVX512F-NEXT: vextracti128 $1, %ymm1, %xmm2
	; AVX512F-NEXT: vpmovdb %zmm1, %xmm1			; AVX512F-NEXT: vpackuswb %xmm2, %xmm1, %xmm1
	; AVX512F-NEXT: vmovdqu %xmm0, (%rsi)			; AVX512F-NEXT: vmovdqu %xmm0, (%rsi)
	; AVX512F-NEXT: vmovdqu %xmm1, 16(%rsi)			; AVX512F-NEXT: vmovdqu %xmm1, 16(%rsi)
	; AVX512F-NEXT: vzeroupper			; AVX512F-NEXT: vzeroupper
	; AVX512F-NEXT: retq			; AVX512F-NEXT: retq
	;			;
	; AVX512VL-LABEL: PR34773:			; AVX512VL-LABEL: PR34773:
	; AVX512VL: # BB#0:			; AVX512VL: # BB#0:
	; AVX512VL-NEXT: vmovdqu (%rdi), %ymm0			; AVX512VL-NEXT: vmovdqu (%rdi), %ymm0
	; AVX512VL-NEXT: vmovdqu 32(%rdi), %ymm1			; AVX512VL-NEXT: vmovdqu 32(%rdi), %ymm1
	; AVX512VL-NEXT: vpsrlw $8, %ymm0, %ymm0			; AVX512VL-NEXT: vpsrlw $8, %ymm0, %ymm0
	; AVX512VL-NEXT: vpsrlw $8, %ymm1, %ymm1			; AVX512VL-NEXT: vpsrlw $8, %ymm1, %ymm1
	; AVX512VL-NEXT: vpmovsxwd %ymm0, %zmm0			; AVX512VL-NEXT: vextracti128 $1, %ymm0, %xmm2
	; AVX512VL-NEXT: vpmovdb %zmm0, %xmm0			; AVX512VL-NEXT: vpackuswb %xmm2, %xmm0, %xmm0
	; AVX512VL-NEXT: vpmovsxwd %ymm1, %zmm1			; AVX512VL-NEXT: vextracti128 $1, %ymm1, %xmm2
	; AVX512VL-NEXT: vpmovdb %zmm1, %xmm1			; AVX512VL-NEXT: vpackuswb %xmm2, %xmm1, %xmm1
	; AVX512VL-NEXT: vmovdqu %xmm0, (%rsi)			; AVX512VL-NEXT: vmovdqu %xmm0, (%rsi)
	; AVX512VL-NEXT: vmovdqu %xmm1, 16(%rsi)			; AVX512VL-NEXT: vmovdqu %xmm1, 16(%rsi)
	; AVX512VL-NEXT: vzeroupper			; AVX512VL-NEXT: vzeroupper
	; AVX512VL-NEXT: retq			; AVX512VL-NEXT: retq
	;			;
	; AVX512BW-LABEL: PR34773:			; AVX512BW-LABEL: PR34773:
	; AVX512BW: # BB#0:			; AVX512BW: # BB#0:
	; AVX512BW-NEXT: vmovdqu (%rdi), %ymm0			; AVX512BW-NEXT: vmovdqu (%rdi), %ymm0
	Show All 34 Lines

test/CodeGen/X86/vselect-packss.ll

	Show First 20 Lines • Show All 47 Lines • ▼ Show 20 Lines
	; AVX2-NEXT: vpacksswb %xmm1, %xmm0, %xmm0			; AVX2-NEXT: vpacksswb %xmm1, %xmm0, %xmm0
	; AVX2-NEXT: vpblendvb %xmm0, %xmm2, %xmm3, %xmm0			; AVX2-NEXT: vpblendvb %xmm0, %xmm2, %xmm3, %xmm0
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: vselect_packss_v16i16:			; AVX512-LABEL: vselect_packss_v16i16:
	; AVX512: # BB#0:			; AVX512: # BB#0:
	; AVX512-NEXT: vpcmpeqw %ymm1, %ymm0, %ymm0			; AVX512-NEXT: vpcmpeqw %ymm1, %ymm0, %ymm0
	; AVX512-NEXT: vpmovsxwd %ymm0, %zmm0			; AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX512-NEXT: vpmovdb %zmm0, %xmm0			; AVX512-NEXT: vpacksswb %xmm1, %xmm0, %xmm0
	; AVX512-NEXT: vpblendvb %xmm0, %xmm2, %xmm3, %xmm0			; AVX512-NEXT: vpblendvb %xmm0, %xmm2, %xmm3, %xmm0
	; AVX512-NEXT: vzeroupper			; AVX512-NEXT: vzeroupper
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	%1 = icmp eq <16 x i16> %a0, %a1			%1 = icmp eq <16 x i16> %a0, %a1
	%2 = sext <16 x i1> %1 to <16 x i8>			%2 = sext <16 x i1> %1 to <16 x i8>
	%3 = and <16 x i8> %2, %a2			%3 = and <16 x i8> %2, %a2
	%4 = xor <16 x i8> %2, <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>			%4 = xor <16 x i8> %2, <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>
	%5 = and <16 x i8> %4, %a3			%5 = and <16 x i8> %4, %a3
	▲ Show 20 Lines • Show All 535 Lines • Show Last 20 Lines