This is an archive of the discontinued LLVM Phabricator instance.

[X86] Disable 512-bit vectors during type legalization for prefer-vector-width
AbandonedPublic

Authored by craig.topper on Dec 17 2017, 10:31 PM.

Download Raw Diff

Details

Reviewers

hfinkel
spatel
RKSimon
echristo
chandlerc

Summary

This continues the work started in D41096

This patch adds a new subtarget feature to indicate that there are no 512-bit vectors present in the function. When combined the prefer-avx256 feature, it will disable 512-bit vectors in the legalizer. I intend to set this subtarget feature in getSubtargetImpl when the Function has a function attribute indicating the required vector width for the function is less than 512 bits.

I looked into trying to add a 512-bit register feature that could be disabled instead like D41096 proposed, but I couldn't find a good way to make the existing command line options work. The tablegen generated subtarget feature system just doesn't allow for a wrapper/alias feature that implies other features.

I had to allow VK32 to be legal with BWI and (VLX || 512-bit vectors). VK64 is only legal with BWI and 512-bit vectors. VK64 is only needed when v64i8 is legal.

I know there are still places in the code that extend narrow vectors to 512-bit to do an operation on wider vector elements and truncate back. Those will need to be split before extending instead.

I plan to add some basic tests for this as well and I'll be adding tests as I fix the various extending lowerings mentioned above.

Diff Detail

Event Timeline

craig.topper created this revision.Dec 17 2017, 10:31 PM

craig.topper added a parent revision: D41096: [X86] Initial support for prefer-vector-width function attribute.

Add some tests.

Fix LowerMULH and LowerShift to not extend to 512-bits when its not legal

I looked into trying to add a 512-bit register feature that could be disabled instead like D41096 proposed, but I couldn't find a good way to make the existing command line options work. The tablegen generated subtarget feature system just doesn't allow for a wrapper/alias feature that implies other features.

Thanks for experimenting.

lib/Target/X86/X86Subtarget.h
584	This sits on top of D41096? I thought it would replace it. Do we need to prefer AVX2 if we have AVX-512 without using zmm?

craig.topper added inline comments.Dec 19 2017, 10:42 PM

lib/Target/X86/X86Subtarget.h
584	It's not prefer AVX2. It's prefer 256-bit vectors. The name could be better.. From our previous discussions we still wanted to "prefer 256-bit" even when the user uses 512-bit explicitly unless they pass -mprefer-vector-width=512. And we need the prefer flag to be a property of the affected CPUs. So this flag represents those two things. We can also use this flag to do targeted fixes to disable extensions to 512-bit when the CPU prefers 256-bit, but we weren't able to disable the legalizer. I think the LowerShift and LowerMULH fixes in this patch might want to be qualified with only "prefer 256-bit" rather than 512-bit types are illegal.

This patch now contains fixes for all of the known issues with lowering trying to use 512-bit types for extending to make operations legal.

I've modified the logic for disable 512-bit types a little. I now only disable 512-bit types if the ABI doesn't require them, the CPU prefers 256-bit, and we have the AVX512VL(the feature that enables masking on 128/256-bit registers) instruction set. This is fine for SKX, but means -mprefer-vector-width=256 will not disable the use of 512-bit regs on KNL as it doesn't have AVX512VL. But KNL doesn't have the frequency issue either so its probably ok. This was mainly done to avoid having to deal with a situatiion where we would support masking on scalar operations, but not being able to doing any masking on vectors. By introducing the VLX requirements before disabling 512-bit vectors we are able support ensure we always have vector+scalar masking and we can continue widening to 512-bits for masking when AVX512VL isn't available.

I've added an experimental pass just befoer isel that detects intrinsics and ABI requirements for needing 512-bit vectors and adds the appropriate require-vector-width attribute based on what it finds. This should hopefully avoid any miscompiles or isel failures for testing this feature. It currently only sets the required width to 256 or 512, but it can be made more generic in the future.

I found all of the lowering fixes by enabling the pass and forcing the prefer-vector-width-256 feature on for all CPUs. Then looking for assertion failures and llvm_unreahables on the X86 codegen lit tests.

I plan to add more directed tests for each of the lowering fixes.

Herald added a subscriber: mgorny. · View Herald TranscriptDec 22 2017, 12:57 PM

spatel mentioned this in D41618: [x86] allow pairs of PCMPEQ for vector-sized integer equality comparisons (PR33325).Jan 1 2018, 9:43 AM

Added test cases for all places where I had to prevent extending to 512-bit to legalize an operation.

Fixed a bug in zext and sext lowering that created a v8i8 node and forced scalarization.

craig.topper added reviewers: echristo, chandlerc.Jan 4 2018, 4:05 PM

craig.topper retitled this revision from [X86] WIP disable 512-bit vectors during type legalization for prefer-vector-width to [X86] Disable 512-bit vectors during type legalization for prefer-vector-width.Jan 8 2018, 10:24 AM

echristo added inline comments.Jan 8 2018, 5:50 PM

lib/Target/X86/X86Subtarget.h
590	I think I'd rather a preferred-vector-width attribute rather than the combination of 128/256/etc features. Thoughts?

craig.topper mentioned this in D41895: [X86] Another attempt at support prefer-vector-width function attribute.Jan 9 2018, 6:02 PM

craig.topper mentioned this in D42724: [X86] Don't make 512-bit vectors legal when preferred vector width is 256 bits and 512 bits aren't required.Jan 30 2018, 5:46 PM

craig.topper abandoned this revision.Feb 4 2018, 8:30 PM

Herald added a subscriber: hintonda. · View Herald TranscriptFeb 4 2018, 8:30 PM

craig.topper mentioned this in D123284: [ArgPromotion][Attributor] Update min-legal-vector-width when do promotion.Dec 8 2022, 12:00 PM

Revision Contents

Path

Size

lib/

Target/

X86/

1 line

6 lines

6 lines

208 lines

14 lines

1 line

22 lines

X86TargetTransformInfo.cpp

2 lines

X86VectorWidthInfer.cpp

126 lines

test/

CodeGen/

X86/

prefer-avx256-lzcnt.ll

139 lines

prefer-avx256-mask-extend.ll

265 lines

prefer-avx256-mask-shuffle.ll

185 lines

prefer-avx256-popcnt.ll

104 lines

prefer-avx256-shift.ll

419 lines

prefer-avx256-trunc.ll

41 lines

prefer-avx256-wide-mul.ll

67 lines

Diff 128583

lib/Target/X86/CMakeLists.txt

Show First 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	set(sources
X86RegisterInfo.cpp		X86RegisterInfo.cpp
X86SelectionDAGInfo.cpp		X86SelectionDAGInfo.cpp
X86ShuffleDecodeConstantPool.cpp		X86ShuffleDecodeConstantPool.cpp
X86Subtarget.cpp		X86Subtarget.cpp
X86TargetMachine.cpp		X86TargetMachine.cpp
X86TargetObjectFile.cpp		X86TargetObjectFile.cpp
X86TargetTransformInfo.cpp		X86TargetTransformInfo.cpp
X86VZeroUpper.cpp		X86VZeroUpper.cpp
		X86VectorWidthInfer.cpp
X86WinAllocaExpander.cpp		X86WinAllocaExpander.cpp
X86WinEHState.cpp		X86WinEHState.cpp
X86CallingConv.cpp		X86CallingConv.cpp
)		)

add_llvm_target(X86CodeGen ${sources})		add_llvm_target(X86CodeGen ${sources})

add_subdirectory(AsmParser)		add_subdirectory(AsmParser)
add_subdirectory(Disassembler)		add_subdirectory(Disassembler)
add_subdirectory(InstPrinter)		add_subdirectory(InstPrinter)
add_subdirectory(MCTargetDesc)		add_subdirectory(MCTargetDesc)
add_subdirectory(TargetInfo)		add_subdirectory(TargetInfo)
add_subdirectory(Utils)		add_subdirectory(Utils)

lib/Target/X86/X86.h

	Show First 20 Lines • Show All 102 Lines • ▼ Show 20 Lines
	FunctionPass *createX86EvexToVexInsts();			FunctionPass *createX86EvexToVexInsts();

	InstructionSelector *createX86InstructionSelector(const X86TargetMachine &TM,			InstructionSelector *createX86InstructionSelector(const X86TargetMachine &TM,
	X86Subtarget &,			X86Subtarget &,
	X86RegisterBankInfo &);			X86RegisterBankInfo &);

	void initializeEvexToVexInstPassPass(PassRegistry &);			void initializeEvexToVexInstPassPass(PassRegistry &);

				/// This pass tries to infer a required vector width for a function if the
				/// require-vector-width attribute isn't present.
				FunctionPass *createX86VectorWidthInferPass();

				void initializeX86VectorWidthInferPass(PassRegistry &);

	} // End llvm namespace			} // End llvm namespace

	#endif			#endif

lib/Target/X86/X86.td

	Show First 20 Lines • Show All 327 Lines • ▼ Show 20 Lines
	def FeatureHasFastGather			def FeatureHasFastGather
	: SubtargetFeature<"fast-gather", "HasFastGather", "true",			: SubtargetFeature<"fast-gather", "HasFastGather", "true",
	"Indicates if gather is reasonably fast.">;			"Indicates if gather is reasonably fast.">;

	def FeaturePreferVecWidth256			def FeaturePreferVecWidth256
	: SubtargetFeature<"prefer-vector-width-256", "PreferVecWidth256", "true",			: SubtargetFeature<"prefer-vector-width-256", "PreferVecWidth256", "true",
	"Prefer 256-bit AVX instructions">;			"Prefer 256-bit AVX instructions">;

				// This feature is used in combination with prefer-avx256 to disable 512-bit
				// instructions in the legalizer.
				def FeatureNo512BitVectors
				: SubtargetFeature<"no-512-bit-vectors", "No512BitVectors", "true",
				"No 512-bit vectors present in function">;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Register File Description			// Register File Description
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	include "X86RegisterInfo.td"			include "X86RegisterInfo.td"
	include "X86RegisterBanks.td"			include "X86RegisterBanks.td"

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	▲ Show 20 Lines • Show All 709 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,132 Lines • ▼ Show 20 Lines	if (HasInt256) {

for (auto VT : { MVT::v4i32, MVT::v8i32, MVT::v2i64, MVT::v4i64,		for (auto VT : { MVT::v4i32, MVT::v8i32, MVT::v2i64, MVT::v4i64,
MVT::v4f32, MVT::v8f32, MVT::v2f64, MVT::v4f64 })		MVT::v4f32, MVT::v8f32, MVT::v2f64, MVT::v4f64 })
setOperationAction(ISD::MGATHER, VT, Custom);		setOperationAction(ISD::MGATHER, VT, Custom);
}		}
}		}

if (!Subtarget.useSoftFloat() && Subtarget.hasAVX512()) {		if (!Subtarget.useSoftFloat() && Subtarget.hasAVX512()) {
addRegisterClass(MVT::v16i32, &X86::VR512RegClass);
addRegisterClass(MVT::v16f32, &X86::VR512RegClass);
addRegisterClass(MVT::v8i64, &X86::VR512RegClass);
addRegisterClass(MVT::v8f64, &X86::VR512RegClass);

addRegisterClass(MVT::v1i1, &X86::VK1RegClass);		addRegisterClass(MVT::v1i1, &X86::VK1RegClass);
addRegisterClass(MVT::v8i1, &X86::VK8RegClass);		addRegisterClass(MVT::v8i1, &X86::VK8RegClass);
addRegisterClass(MVT::v16i1, &X86::VK16RegClass);		addRegisterClass(MVT::v16i1, &X86::VK16RegClass);

setOperationAction(ISD::SELECT, MVT::v1i1, Custom);		setOperationAction(ISD::SELECT, MVT::v1i1, Custom);
setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v1i1, Custom);		setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v1i1, Custom);
setOperationAction(ISD::BUILD_VECTOR, MVT::v1i1, Custom);		setOperationAction(ISD::BUILD_VECTOR, MVT::v1i1, Custom);

Show All 40 Lines	for (auto VT : { MVT::v8i1, MVT::v16i1 }) {
setOperationAction(ISD::VSELECT, VT, Expand);		setOperationAction(ISD::VSELECT, VT, Expand);
}		}

setOperationAction(ISD::CONCAT_VECTORS, MVT::v16i1, Custom);		setOperationAction(ISD::CONCAT_VECTORS, MVT::v16i1, Custom);
setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v8i1, Custom);		setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v8i1, Custom);
setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v16i1, Custom);		setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v16i1, Custom);
for (auto VT : { MVT::v1i1, MVT::v8i1 })		for (auto VT : { MVT::v1i1, MVT::v8i1 })
setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);		setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);
		}

		if (!Subtarget.useSoftFloat() && Subtarget.useAVX512Regs()) {
		addRegisterClass(MVT::v16i32, &X86::VR512RegClass);
		addRegisterClass(MVT::v16f32, &X86::VR512RegClass);
		addRegisterClass(MVT::v8i64, &X86::VR512RegClass);
		addRegisterClass(MVT::v8f64, &X86::VR512RegClass);

for (MVT VT : MVT::fp_vector_valuetypes())		for (MVT VT : MVT::fp_vector_valuetypes())
setLoadExtAction(ISD::EXTLOAD, VT, MVT::v8f32, Legal);		setLoadExtAction(ISD::EXTLOAD, VT, MVT::v8f32, Legal);

for (auto ExtType : {ISD::ZEXTLOAD, ISD::SEXTLOAD}) {		for (auto ExtType : {ISD::ZEXTLOAD, ISD::SEXTLOAD}) {
setLoadExtAction(ExtType, MVT::v16i32, MVT::v16i8, Legal);		setLoadExtAction(ExtType, MVT::v16i32, MVT::v16i8, Legal);
setLoadExtAction(ExtType, MVT::v16i32, MVT::v16i16, Legal);		setLoadExtAction(ExtType, MVT::v16i32, MVT::v16i16, Legal);
setLoadExtAction(ExtType, MVT::v8i64, MVT::v8i8, Legal);		setLoadExtAction(ExtType, MVT::v8i64, MVT::v8i8, Legal);
▲ Show 20 Lines • Show All 146 Lines • ▼ Show 20 Lines	for (auto VT : { MVT::v16i32, MVT::v8i64, MVT::v16f32, MVT::v8f64 }) {
setOperationAction(ISD::MSCATTER, VT, Custom);		setOperationAction(ISD::MSCATTER, VT, Custom);
}		}
for (auto VT : { MVT::v64i8, MVT::v32i16, MVT::v16i32 }) {		for (auto VT : { MVT::v64i8, MVT::v32i16, MVT::v16i32 }) {
setOperationPromotedToType(ISD::LOAD, VT, MVT::v8i64);		setOperationPromotedToType(ISD::LOAD, VT, MVT::v8i64);
setOperationPromotedToType(ISD::SELECT, VT, MVT::v8i64);		setOperationPromotedToType(ISD::SELECT, VT, MVT::v8i64);
}		}
}// has AVX-512		}// has AVX-512

if (!Subtarget.useSoftFloat() &&		if (!Subtarget.useSoftFloat() && Subtarget.hasAVX512()) {
(Subtarget.hasAVX512() \|\| Subtarget.hasVLX())) {
// These operations are handled on non-VLX by artificially widening in		// These operations are handled on non-VLX by artificially widening in
// isel patterns.		// isel patterns.
// TODO: Custom widen in lowering on non-VLX and drop the isel patterns?		// TODO: Custom widen in lowering on non-VLX and drop the isel patterns?

setOperationAction(ISD::FP_TO_UINT, MVT::v8i32, Legal);		setOperationAction(ISD::FP_TO_UINT, MVT::v8i32, Legal);
setOperationAction(ISD::FP_TO_UINT, MVT::v4i32, Legal);		setOperationAction(ISD::FP_TO_UINT, MVT::v4i32, Legal);
setOperationAction(ISD::FP_TO_UINT, MVT::v2i32, Custom);		setOperationAction(ISD::FP_TO_UINT, MVT::v2i32, Custom);
setOperationAction(ISD::UINT_TO_FP, MVT::v8i32, Legal);		setOperationAction(ISD::UINT_TO_FP, MVT::v8i32, Legal);
Show All 34 Lines	if (!Subtarget.useSoftFloat() && Subtarget.hasAVX512()) {

if (Subtarget.hasVPOPCNTDQ()) {		if (Subtarget.hasVPOPCNTDQ()) {
for (auto VT : { MVT::v4i32, MVT::v8i32, MVT::v2i64, MVT::v4i64 })		for (auto VT : { MVT::v4i32, MVT::v8i32, MVT::v2i64, MVT::v4i64 })
setOperationAction(ISD::CTPOP, VT, Legal);		setOperationAction(ISD::CTPOP, VT, Legal);
}		}
}		}

if (!Subtarget.useSoftFloat() && Subtarget.hasBWI()) {		if (!Subtarget.useSoftFloat() && Subtarget.hasBWI()) {
addRegisterClass(MVT::v32i16, &X86::VR512RegClass);
addRegisterClass(MVT::v64i8, &X86::VR512RegClass);

addRegisterClass(MVT::v32i1, &X86::VK32RegClass);		addRegisterClass(MVT::v32i1, &X86::VK32RegClass);
addRegisterClass(MVT::v64i1, &X86::VK64RegClass);

for (auto VT : { MVT::v32i1, MVT::v64i1 }) {		for (auto VT : { MVT::v32i1 }) {
setOperationAction(ISD::ADD, VT, Custom);		setOperationAction(ISD::ADD, VT, Custom);
setOperationAction(ISD::SUB, VT, Custom);		setOperationAction(ISD::SUB, VT, Custom);
setOperationAction(ISD::MUL, VT, Custom);		setOperationAction(ISD::MUL, VT, Custom);
setOperationAction(ISD::VSELECT, VT, Expand);		setOperationAction(ISD::VSELECT, VT, Expand);

setOperationAction(ISD::TRUNCATE, VT, Custom);		setOperationAction(ISD::TRUNCATE, VT, Custom);
setOperationAction(ISD::SETCC, VT, Custom);		setOperationAction(ISD::SETCC, VT, Custom);
setOperationAction(ISD::EXTRACT_VECTOR_ELT, VT, Custom);		setOperationAction(ISD::EXTRACT_VECTOR_ELT, VT, Custom);
setOperationAction(ISD::INSERT_VECTOR_ELT, VT, Custom);		setOperationAction(ISD::INSERT_VECTOR_ELT, VT, Custom);
setOperationAction(ISD::SELECT, VT, Custom);		setOperationAction(ISD::SELECT, VT, Custom);
setOperationAction(ISD::BUILD_VECTOR, VT, Custom);		setOperationAction(ISD::BUILD_VECTOR, VT, Custom);
setOperationAction(ISD::VECTOR_SHUFFLE, VT, Custom);		setOperationAction(ISD::VECTOR_SHUFFLE, VT, Custom);
}		}

setOperationAction(ISD::CONCAT_VECTORS, MVT::v32i1, Custom);		setOperationAction(ISD::CONCAT_VECTORS, MVT::v32i1, Custom);
setOperationAction(ISD::CONCAT_VECTORS, MVT::v64i1, Custom);
setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v32i1, Custom);		setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v32i1, Custom);
setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v64i1, Custom);		setOperationAction(ISD::EXTRACT_SUBVECTOR, MVT::v16i1, Custom);
for (auto VT : { MVT::v16i1, MVT::v32i1 })
setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);

// Extends from v32i1 masks to 256-bit vectors.		// Extends from v32i1 masks to 256-bit vectors.
setOperationAction(ISD::SIGN_EXTEND, MVT::v32i8, Custom);		setOperationAction(ISD::SIGN_EXTEND, MVT::v32i8, Custom);
setOperationAction(ISD::ZERO_EXTEND, MVT::v32i8, Custom);		setOperationAction(ISD::ZERO_EXTEND, MVT::v32i8, Custom);
setOperationAction(ISD::ANY_EXTEND, MVT::v32i8, Custom);		setOperationAction(ISD::ANY_EXTEND, MVT::v32i8, Custom);
		}

		if (!Subtarget.useSoftFloat() && Subtarget.useBWIRegs()) {
		addRegisterClass(MVT::v32i16, &X86::VR512RegClass);
		addRegisterClass(MVT::v64i8, &X86::VR512RegClass);

		addRegisterClass(MVT::v64i1, &X86::VK64RegClass);

		for (auto VT : { MVT::v64i1 }) {
		setOperationAction(ISD::ADD, VT, Custom);
		setOperationAction(ISD::SUB, VT, Custom);
		setOperationAction(ISD::MUL, VT, Custom);
		setOperationAction(ISD::VSELECT, VT, Expand);

		setOperationAction(ISD::TRUNCATE, VT, Custom);
		setOperationAction(ISD::SETCC, VT, Custom);
		setOperationAction(ISD::EXTRACT_VECTOR_ELT, VT, Custom);
		setOperationAction(ISD::INSERT_VECTOR_ELT, VT, Custom);
		setOperationAction(ISD::SELECT, VT, Custom);
		setOperationAction(ISD::BUILD_VECTOR, VT, Custom);
		setOperationAction(ISD::VECTOR_SHUFFLE, VT, Custom);
		}

		setOperationAction(ISD::CONCAT_VECTORS, MVT::v64i1, Custom);
		setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v64i1, Custom);
		setOperationAction(ISD::EXTRACT_SUBVECTOR, MVT::v32i1, Custom);

// Extends from v64i1 masks to 512-bit vectors.		// Extends from v64i1 masks to 512-bit vectors.
setOperationAction(ISD::SIGN_EXTEND, MVT::v64i8, Custom);		setOperationAction(ISD::SIGN_EXTEND, MVT::v64i8, Custom);
setOperationAction(ISD::ZERO_EXTEND, MVT::v64i8, Custom);		setOperationAction(ISD::ZERO_EXTEND, MVT::v64i8, Custom);
setOperationAction(ISD::ANY_EXTEND, MVT::v64i8, Custom);		setOperationAction(ISD::ANY_EXTEND, MVT::v64i8, Custom);

setOperationAction(ISD::MUL, MVT::v32i16, Legal);		setOperationAction(ISD::MUL, MVT::v32i16, Legal);
setOperationAction(ISD::MUL, MVT::v64i8, Custom);		setOperationAction(ISD::MUL, MVT::v64i8, Custom);
setOperationAction(ISD::MULHS, MVT::v32i16, Legal);		setOperationAction(ISD::MULHS, MVT::v32i16, Legal);
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	if (!Subtarget.useSoftFloat() && Subtarget.useBWIRegs()) {
}		}

if (Subtarget.hasBITALG()) {		if (Subtarget.hasBITALG()) {
for (auto VT : { MVT::v64i8, MVT::v32i16 })		for (auto VT : { MVT::v64i8, MVT::v32i16 })
setOperationAction(ISD::CTPOP, VT, Legal);		setOperationAction(ISD::CTPOP, VT, Legal);
}		}
}		}

if (!Subtarget.useSoftFloat() && Subtarget.hasBWI() &&		if (!Subtarget.useSoftFloat() && Subtarget.hasBWI()) {
(Subtarget.hasAVX512() \|\| Subtarget.hasVLX())) {
for (auto VT : { MVT::v32i8, MVT::v16i8, MVT::v16i16, MVT::v8i16 }) {		for (auto VT : { MVT::v32i8, MVT::v16i8, MVT::v16i16, MVT::v8i16 }) {
setOperationAction(ISD::MLOAD, VT, Subtarget.hasVLX() ? Legal : Custom);		setOperationAction(ISD::MLOAD, VT, Subtarget.hasVLX() ? Legal : Custom);
setOperationAction(ISD::MSTORE, VT, Subtarget.hasVLX() ? Legal : Custom);		setOperationAction(ISD::MSTORE, VT, Subtarget.hasVLX() ? Legal : Custom);
}		}

// These operations are handled on non-VLX by artificially widening in		// These operations are handled on non-VLX by artificially widening in
// isel patterns.		// isel patterns.
// TODO: Custom widen in lowering on non-VLX and drop the isel patterns?		// TODO: Custom widen in lowering on non-VLX and drop the isel patterns?
▲ Show 20 Lines • Show All 12,775 Lines • ▼ Show 20 Lines	case MVT::v4i1:
ExtVT = MVT::v4i32;		ExtVT = MVT::v4i32;
break;		break;
case MVT::v8i1:		case MVT::v8i1:
// Take 512-bit type, more shuffles on KNL. If we have VLX use a 256-bit		// Take 512-bit type, more shuffles on KNL. If we have VLX use a 256-bit
// shuffle.		// shuffle.
ExtVT = Subtarget.hasVLX() ? MVT::v8i32 : MVT::v8i64;		ExtVT = Subtarget.hasVLX() ? MVT::v8i32 : MVT::v8i64;
break;		break;
case MVT::v16i1:		case MVT::v16i1:
ExtVT = MVT::v16i32;		// Take 512-bit type, unless we are forbidden to use 512-bit types.
		ExtVT = DAG.getTargetLoweringInfo().isTypeLegal(MVT::v16i32) ? MVT::v16i32
		: MVT::v16i16;
break;		break;
case MVT::v32i1:		case MVT::v32i1:
ExtVT = MVT::v32i16;		// Take 512-bit type, unless we are forbidden to use 512-bit types.
		ExtVT = DAG.getTargetLoweringInfo().isTypeLegal(MVT::v32i16) ? MVT::v32i16
		: MVT::v32i8;
break;		break;
case MVT::v64i1:		case MVT::v64i1:
ExtVT = MVT::v64i8;		ExtVT = MVT::v64i8;
break;		break;
}		}

if (ISD::isBuildVectorAllZeros(V1.getNode()))		if (ISD::isBuildVectorAllZeros(V1.getNode()))
V1 = getZeroVector(ExtVT, Subtarget, DAG, DL);		V1 = getZeroVector(ExtVT, Subtarget, DAG, DL);
▲ Show 20 Lines • Show All 2,043 Lines • ▼ Show 20 Lines	static SDValue LowerZERO_EXTEND_Mask(SDValue Op,
MVT InVT = In.getSimpleValueType();		MVT InVT = In.getSimpleValueType();
assert(InVT.getVectorElementType() == MVT::i1 && "Unexpected input type!");		assert(InVT.getVectorElementType() == MVT::i1 && "Unexpected input type!");
SDLoc DL(Op);		SDLoc DL(Op);
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();

// Extend VT if the scalar type is v8/v16 and BWI is not supported.		// Extend VT if the scalar type is v8/v16 and BWI is not supported.
MVT ExtVT = VT;		MVT ExtVT = VT;
if (!Subtarget.hasBWI() &&		if (!Subtarget.hasBWI() &&
(VT.getVectorElementType().getSizeInBits() <= 16))		(VT.getVectorElementType().getSizeInBits() <= 16)) {
		// If v16i32 isn't legal we'll need to split and concatenate.
		if (NumElts == 16 && !Subtarget.useAVX512Regs()) {
		SDValue Lo = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MVT::v8i1, In,
		DAG.getIntPtrConstant(0, DL));
		SDValue Hi = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MVT::v8i1, In,
		DAG.getIntPtrConstant(8, DL));
		Lo = DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::v8i16, Lo);
		Hi = DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::v8i16, Hi);
		SDValue Res = DAG.getNode(ISD::CONCAT_VECTORS, DL, MVT::v16i16, Lo, Hi);
		return DAG.getNode(ISD::TRUNCATE, DL, VT, Res);
		}

ExtVT = MVT::getVectorVT(MVT::i32, NumElts);		ExtVT = MVT::getVectorVT(MVT::i32, NumElts);
		}

// Widen to 512-bits if VLX is not supported.		// Widen to 512-bits if VLX is not supported.
MVT WideVT = ExtVT;		MVT WideVT = ExtVT;
if (!ExtVT.is512BitVector() && !Subtarget.hasVLX()) {		if (!ExtVT.is512BitVector() && !Subtarget.hasVLX()) {
NumElts *= 512 / ExtVT.getSizeInBits();		NumElts *= 512 / ExtVT.getSizeInBits();
InVT = MVT::getVectorVT(MVT::i1, NumElts);		InVT = MVT::getVectorVT(MVT::i1, NumElts);
In = DAG.getNode(ISD::INSERT_SUBVECTOR, DL, InVT, DAG.getUNDEF(InVT),		In = DAG.getNode(ISD::INSERT_SUBVECTOR, DL, InVT, DAG.getUNDEF(InVT),
In, DAG.getIntPtrConstant(0, DL));		In, DAG.getIntPtrConstant(0, DL));
▲ Show 20 Lines • Show All 156 Lines • ▼ Show 20 Lines	if (Subtarget.hasBWI()) {
In = DAG.getBitcast(InVT, In);		In = DAG.getBitcast(InVT, In);
}		}
return DAG.getNode(X86ISD::CVT2MASK, DL, VT, In);		return DAG.getNode(X86ISD::CVT2MASK, DL, VT, In);
}		}
// Use TESTD/Q, extended vector to packed dword/qword.		// Use TESTD/Q, extended vector to packed dword/qword.
assert((InVT.is256BitVector() \|\| InVT.is128BitVector()) &&		assert((InVT.is256BitVector() \|\| InVT.is128BitVector()) &&
"Unexpected vector type.");		"Unexpected vector type.");
unsigned NumElts = InVT.getVectorNumElements();		unsigned NumElts = InVT.getVectorNumElements();
		if (NumElts == 16 && !Subtarget.useAVX512Regs()) {
		assert(Subtarget.hasVLX() && "Can't use 512-bit registers or VLX?");
		// If we can't use 512-bit ops we'll need to split this to use
		// MVT::v8i32 and concat the result.
		if (InVT == MVT::v16i8) {
		// First we need to sign extend up to 256-bits so we can split that.
		InVT = MVT::v16i16;
		In = DAG.getNode(ISD::SIGN_EXTEND, DL, InVT, In);
		}
		SDValue Lo = extract128BitVector(In, 0, DAG, DL);
		SDValue Hi = extract128BitVector(In, 8, DAG, DL);
		// We're split now, just emit two truncates and a concat. The two
		// truncates will trigger legalization to come back to this function.
		Lo = DAG.getNode(ISD::TRUNCATE, DL, MVT::v8i1, Lo);
		Hi = DAG.getNode(ISD::TRUNCATE, DL, MVT::v8i1, Hi);
		return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, Lo, Hi);
		}
MVT EltVT = Subtarget.hasVLX() ? MVT::i32 : MVT::getIntegerVT(512/NumElts);		MVT EltVT = Subtarget.hasVLX() ? MVT::i32 : MVT::getIntegerVT(512/NumElts);
MVT ExtVT = MVT::getVectorVT(EltVT, NumElts);		MVT ExtVT = MVT::getVectorVT(EltVT, NumElts);
In = DAG.getNode(ISD::SIGN_EXTEND, DL, ExtVT, In);		In = DAG.getNode(ISD::SIGN_EXTEND, DL, ExtVT, In);
InVT = ExtVT;		InVT = ExtVT;
ShiftInx = InVT.getScalarSizeInBits() - 1;		ShiftInx = InVT.getScalarSizeInBits() - 1;
}		}

if (DAG.ComputeNumSignBits(In) < InVT.getScalarSizeInBits()) {		if (DAG.ComputeNumSignBits(In) < InVT.getScalarSizeInBits()) {
Show All 15 Lines	assert(VT.getVectorNumElements() == InVT.getVectorNumElements() &&
"Invalid TRUNCATE operation");		"Invalid TRUNCATE operation");

if (VT.getVectorElementType() == MVT::i1)		if (VT.getVectorElementType() == MVT::i1)
return LowerTruncateVecI1(Op, DAG, Subtarget);		return LowerTruncateVecI1(Op, DAG, Subtarget);

// vpmovqb/w/d, vpmovdb/w, vpmovwb		// vpmovqb/w/d, vpmovdb/w, vpmovwb
if (Subtarget.hasAVX512()) {		if (Subtarget.hasAVX512()) {
// word to byte only under BWI		// word to byte only under BWI
if (InVT == MVT::v16i16 && !Subtarget.hasBWI()) // v16i16 -> v16i8		if (InVT == MVT::v16i16 && !Subtarget.hasBWI()) { // v16i16 -> v16i8
		if (Subtarget.useAVX512Regs())
return DAG.getNode(X86ISD::VTRUNC, DL, VT,		return DAG.getNode(X86ISD::VTRUNC, DL, VT,
getExtendInVec(X86ISD::VSEXT, DL, MVT::v16i32, In, DAG));		getExtendInVec(X86ISD::VSEXT, DL, MVT::v16i32, In,
		DAG));
		} else {
return DAG.getNode(X86ISD::VTRUNC, DL, VT, In);		return DAG.getNode(X86ISD::VTRUNC, DL, VT, In);
}		}
		}

// Truncate with PACKSS if we are truncating a vector with sign-bits that		// Truncate with PACKSS if we are truncating a vector with sign-bits that
// extend all the way to the packed/truncated value.		// extend all the way to the packed/truncated value.
unsigned NumPackedBits = std::min<unsigned>(VT.getScalarSizeInBits(), 16);		unsigned NumPackedBits = std::min<unsigned>(VT.getScalarSizeInBits(), 16);
if ((InNumEltBits - NumPackedBits) < DAG.ComputeNumSignBits(In))		if ((InNumEltBits - NumPackedBits) < DAG.ComputeNumSignBits(In))
if (SDValue V =		if (SDValue V =
truncateVectorWithPACK(X86ISD::PACKSS, VT, In, DL, DAG, Subtarget))		truncateVectorWithPACK(X86ISD::PACKSS, VT, In, DL, DAG, Subtarget))
return V;		return V;
▲ Show 20 Lines • Show All 1,940 Lines • ▼ Show 20 Lines	static SDValue LowerSIGN_EXTEND_Mask(SDValue Op,
assert(InVT.getVectorElementType() == MVT::i1 && "Unexpected input type!");		assert(InVT.getVectorElementType() == MVT::i1 && "Unexpected input type!");
MVT VTElt = VT.getVectorElementType();		MVT VTElt = VT.getVectorElementType();
SDLoc dl(Op);		SDLoc dl(Op);

unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();

// Extend VT if the scalar type is v8/v16 and BWI is not supported.		// Extend VT if the scalar type is v8/v16 and BWI is not supported.
MVT ExtVT = VT;		MVT ExtVT = VT;
if (!Subtarget.hasBWI() && VTElt.getSizeInBits() <= 16)		if (!Subtarget.hasBWI() && VTElt.getSizeInBits() <= 16) {
		// If v16i32 isn't legal we'll need to split and concatenate.
		if (NumElts == 16 && !Subtarget.useAVX512Regs()) {
		SDValue Lo = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v8i1, In,
		DAG.getIntPtrConstant(0, dl));
		SDValue Hi = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, MVT::v8i1, In,
		DAG.getIntPtrConstant(8, dl));
		Lo = DAG.getNode(ISD::SIGN_EXTEND, dl, MVT::v8i16, Lo);
		Hi = DAG.getNode(ISD::SIGN_EXTEND, dl, MVT::v8i16, Hi);
		SDValue Res = DAG.getNode(ISD::CONCAT_VECTORS, dl, MVT::v16i16, Lo, Hi);
		return DAG.getNode(ISD::TRUNCATE, dl, VT, Res);
		}

ExtVT = MVT::getVectorVT(MVT::i32, NumElts);		ExtVT = MVT::getVectorVT(MVT::i32, NumElts);
		}

// Widen to 512-bits if VLX is not supported.		// Widen to 512-bits if VLX is not supported.
MVT WideVT = ExtVT;		MVT WideVT = ExtVT;
if (!ExtVT.is512BitVector() && !Subtarget.hasVLX()) {		if (!ExtVT.is512BitVector() && !Subtarget.hasVLX()) {
NumElts *= 512 / ExtVT.getSizeInBits();		NumElts *= 512 / ExtVT.getSizeInBits();
InVT = MVT::getVectorVT(MVT::i1, NumElts);		InVT = MVT::getVectorVT(MVT::i1, NumElts);
In = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, InVT, DAG.getUNDEF(InVT),		In = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, InVT, DAG.getUNDEF(InVT),
In, DAG.getIntPtrConstant(0, dl));		In, DAG.getIntPtrConstant(0, dl));
▲ Show 20 Lines • Show All 3,271 Lines • ▼ Show 20 Lines
// Decompose 512-bit ops into smaller 256-bit ops.		// Decompose 512-bit ops into smaller 256-bit ops.
static SDValue Lower512IntUnary(SDValue Op, SelectionDAG &DAG) {		static SDValue Lower512IntUnary(SDValue Op, SelectionDAG &DAG) {
assert(Op.getSimpleValueType().is512BitVector() &&		assert(Op.getSimpleValueType().is512BitVector() &&
Op.getSimpleValueType().isInteger() &&		Op.getSimpleValueType().isInteger() &&
"Only handle AVX 512-bit vector integer operation");		"Only handle AVX 512-bit vector integer operation");
return LowerVectorIntUnary(Op, DAG);		return LowerVectorIntUnary(Op, DAG);
}		}

/// \brief Lower a vector CTLZ using native supported vector CTLZ instruction.
//
// i8/i16 vector implemented using dword LZCNT vector instruction
// ( sub(trunc(lzcnt(zext32(x)))) ). In case zext32(x) is illegal,
// split the vector, perform operation on it's Lo a Hi part and
// concatenate the results.
static SDValue LowerVectorCTLZ_AVX512CDI(SDValue Op, SelectionDAG &DAG) {
assert(Op.getOpcode() == ISD::CTLZ);
SDLoc dl(Op);
MVT VT = Op.getSimpleValueType();
MVT EltVT = VT.getVectorElementType();
unsigned NumElems = VT.getVectorNumElements();

assert((EltVT == MVT::i8 \|\| EltVT == MVT::i16) &&
"Unsupported element type");

// Split vector, it's Lo and Hi parts will be handled in next iteration.
if (16 < NumElems)
return LowerVectorIntUnary(Op, DAG);

MVT NewVT = MVT::getVectorVT(MVT::i32, NumElems);
assert((NewVT.is256BitVector() \|\| NewVT.is512BitVector()) &&
"Unsupported value type for operation");

// Use native supported vector instruction vplzcntd.
Op = DAG.getNode(ISD::ZERO_EXTEND, dl, NewVT, Op.getOperand(0));
SDValue CtlzNode = DAG.getNode(ISD::CTLZ, dl, NewVT, Op);
SDValue TruncNode = DAG.getNode(ISD::TRUNCATE, dl, VT, CtlzNode);
SDValue Delta = DAG.getConstant(32 - EltVT.getSizeInBits(), dl, VT);

return DAG.getNode(ISD::SUB, dl, VT, TruncNode, Delta);
}

// Lower CTLZ using a PSHUFB lookup table implementation.		// Lower CTLZ using a PSHUFB lookup table implementation.
static SDValue LowerVectorCTLZInRegLUT(SDValue Op, const SDLoc &DL,		static SDValue LowerVectorCTLZInRegLUT(SDValue Op, const SDLoc &DL,
const X86Subtarget &Subtarget,		const X86Subtarget &Subtarget,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
MVT VT = Op.getSimpleValueType();		MVT VT = Op.getSimpleValueType();
int NumElts = VT.getVectorNumElements();		int NumElts = VT.getVectorNumElements();
int NumBytes = NumElts * (VT.getScalarSizeInBits() / 8);		int NumBytes = NumElts * (VT.getScalarSizeInBits() / 8);
MVT CurrVT = MVT::getVectorVT(MVT::i8, NumBytes);		MVT CurrVT = MVT::getVectorVT(MVT::i8, NumBytes);
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	while (CurrVT != VT) {
R1 = DAG.getNode(ISD::AND, DL, NextVT, ResNext, R1);		R1 = DAG.getNode(ISD::AND, DL, NextVT, ResNext, R1);
Res = DAG.getNode(ISD::ADD, DL, NextVT, R0, R1);		Res = DAG.getNode(ISD::ADD, DL, NextVT, R0, R1);
CurrVT = NextVT;		CurrVT = NextVT;
}		}

return Res;		return Res;
}		}

		/// \brief Lower a vector CTLZ using native supported vector CTLZ instruction.
		//
		// i8/i16 vector implemented using dword LZCNT vector instruction
		// ( sub(trunc(lzcnt(zext32(x)))) ). In case zext32(x) is illegal,
		// split the vector, perform operation on it's Lo a Hi part and
		// concatenate the results.
		static SDValue LowerVectorCTLZ_AVX512CDI(SDValue Op, SelectionDAG &DAG,
		const X86Subtarget &Subtarget) {
		assert(Op.getOpcode() == ISD::CTLZ);
		SDLoc dl(Op);
		MVT VT = Op.getSimpleValueType();
		MVT EltVT = VT.getVectorElementType();
		unsigned NumElems = VT.getVectorNumElements();

		assert((EltVT == MVT::i8 \|\| EltVT == MVT::i16) &&
		"Unsupported element type");

		// Split vector, it's Lo and Hi parts will be handled in next iteration.
		if (NumElems > 16 \|\| (NumElems == 16 && !Subtarget.useAVX512Regs())) {
		// If the input is v16i8, we can't split it, just fall back to LUT.
		if (VT == MVT::v16i8)
		return LowerVectorCTLZInRegLUT(Op, dl, Subtarget, DAG);

		return LowerVectorIntUnary(Op, DAG);
		}

		MVT NewVT = MVT::getVectorVT(MVT::i32, NumElems);
		assert((NewVT.is256BitVector() \|\| NewVT.is512BitVector()) &&
		"Unsupported value type for operation");

		// Use native supported vector instruction vplzcntd.
		Op = DAG.getNode(ISD::ZERO_EXTEND, dl, NewVT, Op.getOperand(0));
		SDValue CtlzNode = DAG.getNode(ISD::CTLZ, dl, NewVT, Op);
		SDValue TruncNode = DAG.getNode(ISD::TRUNCATE, dl, VT, CtlzNode);
		SDValue Delta = DAG.getConstant(32 - EltVT.getSizeInBits(), dl, VT);

		return DAG.getNode(ISD::SUB, dl, VT, TruncNode, Delta);
		}

static SDValue LowerVectorCTLZ(SDValue Op, const SDLoc &DL,		static SDValue LowerVectorCTLZ(SDValue Op, const SDLoc &DL,
const X86Subtarget &Subtarget,		const X86Subtarget &Subtarget,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
MVT VT = Op.getSimpleValueType();		MVT VT = Op.getSimpleValueType();

if (Subtarget.hasCDI())		if (Subtarget.hasCDI())
return LowerVectorCTLZ_AVX512CDI(Op, DAG);		return LowerVectorCTLZ_AVX512CDI(Op, DAG, Subtarget);

// Decompose 256-bit ops into smaller 128-bit ops.		// Decompose 256-bit ops into smaller 128-bit ops.
if (VT.is256BitVector() && !Subtarget.hasInt256())		if (VT.is256BitVector() && !Subtarget.hasInt256())
return Lower256IntUnary(Op, DAG);		return Lower256IntUnary(Op, DAG);

// Decompose 512-bit ops into smaller 256-bit ops.		// Decompose 512-bit ops into smaller 256-bit ops.
if (VT.is512BitVector() && !Subtarget.hasBWI())		if (VT.is512BitVector() && !Subtarget.hasBWI())
return Lower512IntUnary(Op, DAG);		return Lower512IntUnary(Op, DAG);
▲ Show 20 Lines • Show All 399 Lines • ▼ Show 20 Lines	static SDValue LowerMULH(SDValue Op, const X86Subtarget &Subtarget,

// AVX2 implementations - extend xmm subvectors to ymm.		// AVX2 implementations - extend xmm subvectors to ymm.
if (Subtarget.hasInt256()) {		if (Subtarget.hasInt256()) {
unsigned NumElems = VT.getVectorNumElements();		unsigned NumElems = VT.getVectorNumElements();
SDValue Lo = DAG.getIntPtrConstant(0, dl);		SDValue Lo = DAG.getIntPtrConstant(0, dl);
SDValue Hi = DAG.getIntPtrConstant(NumElems / 2, dl);		SDValue Hi = DAG.getIntPtrConstant(NumElems / 2, dl);

if (VT == MVT::v32i8) {		if (VT == MVT::v32i8) {
if (Subtarget.hasBWI()) {		if (Subtarget.useBWIRegs()) {
SDValue ExA = DAG.getNode(ExAVX, dl, MVT::v32i16, A);		SDValue ExA = DAG.getNode(ExAVX, dl, MVT::v32i16, A);
SDValue ExB = DAG.getNode(ExAVX, dl, MVT::v32i16, B);		SDValue ExB = DAG.getNode(ExAVX, dl, MVT::v32i16, B);
SDValue Mul = DAG.getNode(ISD::MUL, dl, MVT::v32i16, ExA, ExB);		SDValue Mul = DAG.getNode(ISD::MUL, dl, MVT::v32i16, ExA, ExB);
Mul = DAG.getNode(ISD::SRL, dl, MVT::v32i16, Mul,		Mul = DAG.getNode(ISD::SRL, dl, MVT::v32i16, Mul,
DAG.getConstant(8, dl, MVT::v32i16));		DAG.getConstant(8, dl, MVT::v32i16));
return DAG.getNode(ISD::TRUNCATE, dl, VT, Mul);		return DAG.getNode(ISD::TRUNCATE, dl, VT, Mul);
}		}
SDValue ALo = extract128BitVector(A, 0, DAG, dl);		SDValue ALo = extract128BitVector(A, 0, DAG, dl);
▲ Show 20 Lines • Show All 773 Lines • ▼ Show 20 Lines	if (VT == MVT::v4i32) {
SDValue R13 = DAG.getVectorShuffle(VT, dl, R1, R3, {-1, 1, -1, 7});		SDValue R13 = DAG.getVectorShuffle(VT, dl, R1, R3, {-1, 1, -1, 7});
return DAG.getVectorShuffle(VT, dl, R02, R13, {0, 5, 2, 7});		return DAG.getVectorShuffle(VT, dl, R02, R13, {0, 5, 2, 7});
}		}

// It's worth extending once and using the vXi16/vXi32 shifts for smaller		// It's worth extending once and using the vXi16/vXi32 shifts for smaller
// types, but without AVX512 the extra overheads to get from vXi8 to vXi32		// types, but without AVX512 the extra overheads to get from vXi8 to vXi32
// make the existing SSE solution better.		// make the existing SSE solution better.
if ((Subtarget.hasInt256() && VT == MVT::v8i16) \|\|		if ((Subtarget.hasInt256() && VT == MVT::v8i16) \|\|
(Subtarget.hasAVX512() && VT == MVT::v16i16) \|\|		(Subtarget.useAVX512Regs() && VT == MVT::v16i16) \|\|
(Subtarget.hasAVX512() && VT == MVT::v16i8) \|\|		(Subtarget.useAVX512Regs() && VT == MVT::v16i8) \|\|
(Subtarget.hasBWI() && VT == MVT::v32i8)) {		(Subtarget.hasBWI() && Subtarget.hasVLX() && VT == MVT::v16i8) \|\|
		(Subtarget.useBWIRegs() && VT == MVT::v32i8)) {
assert((!Subtarget.hasBWI() \|\| VT == MVT::v32i8 \|\| VT == MVT::v16i8) &&		assert((!Subtarget.hasBWI() \|\| VT == MVT::v32i8 \|\| VT == MVT::v16i8) &&
"Unexpected vector type");		"Unexpected vector type");
MVT EvtSVT = Subtarget.hasBWI() ? MVT::i16 : MVT::i32;		MVT EvtSVT = Subtarget.hasBWI() ? MVT::i16 : MVT::i32;
MVT ExtVT = MVT::getVectorVT(EvtSVT, VT.getVectorNumElements());		MVT ExtVT = MVT::getVectorVT(EvtSVT, VT.getVectorNumElements());
unsigned ExtOpc =		unsigned ExtOpc =
Op.getOpcode() == ISD::SRA ? ISD::SIGN_EXTEND : ISD::ZERO_EXTEND;		Op.getOpcode() == ISD::SRA ? ISD::SIGN_EXTEND : ISD::ZERO_EXTEND;
R = DAG.getNode(ExtOpc, dl, ExtVT, R);		R = DAG.getNode(ExtOpc, dl, ExtVT, R);
Amt = DAG.getNode(ISD::ZERO_EXTEND, dl, ExtVT, Amt);		Amt = DAG.getNode(ISD::ZERO_EXTEND, dl, ExtVT, Amt);
▲ Show 20 Lines • Show All 807 Lines • ▼ Show 20 Lines	static SDValue LowerVectorCTPOP(SDValue Op, const X86Subtarget &Subtarget,
SDLoc DL(Op.getNode());		SDLoc DL(Op.getNode());
SDValue Op0 = Op.getOperand(0);		SDValue Op0 = Op.getOperand(0);

// TRUNC(CTPOP(ZEXT(X))) to make use of vXi32/vXi64 VPOPCNT instructions.		// TRUNC(CTPOP(ZEXT(X))) to make use of vXi32/vXi64 VPOPCNT instructions.
if (Subtarget.hasVPOPCNTDQ()) {		if (Subtarget.hasVPOPCNTDQ()) {
unsigned NumElems = VT.getVectorNumElements();		unsigned NumElems = VT.getVectorNumElements();
assert((VT.getVectorElementType() == MVT::i8 \|\|		assert((VT.getVectorElementType() == MVT::i8 \|\|
VT.getVectorElementType() == MVT::i16) && "Unexpected type");		VT.getVectorElementType() == MVT::i16) && "Unexpected type");
if (NumElems <= 16) {		if (NumElems <= 16 && !(NumElems == 16 && !Subtarget.useAVX512Regs())) {
MVT NewVT = MVT::getVectorVT(MVT::i32, NumElems);		MVT NewVT = MVT::getVectorVT(MVT::i32, NumElems);
Op = DAG.getNode(ISD::ZERO_EXTEND, DL, NewVT, Op0);		Op = DAG.getNode(ISD::ZERO_EXTEND, DL, NewVT, Op0);
Op = DAG.getNode(ISD::CTPOP, DL, NewVT, Op);		Op = DAG.getNode(ISD::CTPOP, DL, NewVT, Op);
return DAG.getNode(ISD::TRUNCATE, DL, VT, Op);		return DAG.getNode(ISD::TRUNCATE, DL, VT, Op);
}		}
}		}

if (!Subtarget.hasSSSE3()) {		if (!Subtarget.hasSSSE3()) {
▲ Show 20 Lines • Show All 14,720 Lines • Show Last 20 Lines

lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 349 Lines • ▼ Show 20 Lines	protected:

/// Max. memset / memcpy size that is turned into rep/movs, rep/stos ops.		/// Max. memset / memcpy size that is turned into rep/movs, rep/stos ops.
///		///
unsigned MaxInlineSizeThreshold;		unsigned MaxInlineSizeThreshold;

/// Prefer 256-bit AVX instructions over 512-bit instructions.		/// Prefer 256-bit AVX instructions over 512-bit instructions.
bool PreferVecWidth256;		bool PreferVecWidth256;

		/// Indicates there are no 512-bit vectors present in the function.
		bool No512BitVectors;

/// What processor and OS we're targeting.		/// What processor and OS we're targeting.
Triple TargetTriple;		Triple TargetTriple;

/// Instruction itineraries for scheduling		/// Instruction itineraries for scheduling
InstrItineraryData InstrItins;		InstrItineraryData InstrItins;

/// GlobalISel related APIs.		/// GlobalISel related APIs.
std::unique_ptr<CallLowering> CallLoweringInfo;		std::unique_ptr<CallLowering> CallLoweringInfo;
▲ Show 20 Lines • Show All 207 Lines • ▼ Show 20 Lines	public:
bool hasVNNI() const { return HasVNNI; }		bool hasVNNI() const { return HasVNNI; }
bool hasBITALG() const { return HasBITALG; }		bool hasBITALG() const { return HasBITALG; }
bool hasMPX() const { return HasMPX; }		bool hasMPX() const { return HasMPX; }
bool hasSHSTK() const { return HasSHSTK; }		bool hasSHSTK() const { return HasSHSTK; }
bool hasIBT() const { return HasIBT; }		bool hasIBT() const { return HasIBT; }
bool hasCLFLUSHOPT() const { return HasCLFLUSHOPT; }		bool hasCLFLUSHOPT() const { return HasCLFLUSHOPT; }
bool hasCLWB() const { return HasCLWB; }		bool hasCLWB() const { return HasCLWB; }

bool preferVecWidth256() const { return PreferVecWidth256; }		bool preferVecWidth256() const { return PreferVecWidth256; }
		hfinkelUnsubmitted Not Done Reply Inline Actions This sits on top of D41096? I thought it would replace it. Do we need to prefer AVX2 if we have AVX-512 without using zmm? hfinkel: This sits on top of D41096? I thought it would replace it. Do we need to prefer AVX2 if we have…
		craig.topperAuthorUnsubmitted Not Done Reply Inline Actions It's not prefer AVX2. It's prefer 256-bit vectors. The name could be better.. From our previous discussions we still wanted to "prefer 256-bit" even when the user uses 512-bit explicitly unless they pass -mprefer-vector-width=512. And we need the prefer flag to be a property of the affected CPUs. So this flag represents those two things. We can also use this flag to do targeted fixes to disable extensions to 512-bit when the CPU prefers 256-bit, but we weren't able to disable the legalizer. I think the LowerShift and LowerMULH fixes in this patch might want to be qualified with only "prefer 256-bit" rather than 512-bit types are illegal. craig.topper: It's not prefer AVX2. It's prefer 256-bit vectors. The name could be better.. From our…

		// If there are no 512-bit vectors and we prefer not to use 512-bit registers,
		// disable them in the legalizer. We also need VLX support so we can do
		// masked operations.
		bool useAVX512Regs() const {
		return hasAVX512() && !(hasVLX() && PreferVecWidth256 && No512BitVectors);
		echristoUnsubmitted Not Done Reply Inline Actions I think I'd rather a preferred-vector-width attribute rather than the combination of 128/256/etc features. Thoughts? echristo: I think I'd rather a preferred-vector-width attribute rather than the combination of…
		}

		bool useBWIRegs() const {
		return hasBWI() && useAVX512Regs();
		}

bool isXRaySupported() const override { return is64Bit(); }		bool isXRaySupported() const override { return is64Bit(); }

X86ProcFamilyEnum getProcFamily() const { return X86ProcFamily; }		X86ProcFamilyEnum getProcFamily() const { return X86ProcFamily; }

/// TODO: to be removed later and replaced with suitable properties		/// TODO: to be removed later and replaced with suitable properties
bool isAtom() const { return X86ProcFamily == IntelAtom; }		bool isAtom() const { return X86ProcFamily == IntelAtom; }
bool isSLM() const { return X86ProcFamily == IntelSLM; }		bool isSLM() const { return X86ProcFamily == IntelSLM; }
bool useSoftFloat() const { return UseSoftFloat; }		bool useSoftFloat() const { return UseSoftFloat; }
▲ Show 20 Lines • Show All 136 Lines • Show Last 20 Lines

lib/Target/X86/X86Subtarget.cpp

Show First 20 Lines • Show All 340 Lines • ▼ Show 20 Lines	void X86Subtarget::initializeEnvironment() {
stackAlignment = 4;		stackAlignment = 4;
// FIXME: this is a known good value for Yonah. How about others?		// FIXME: this is a known good value for Yonah. How about others?
MaxInlineSizeThreshold = 128;		MaxInlineSizeThreshold = 128;
UseSoftFloat = false;		UseSoftFloat = false;
X86ProcFamily = Others;		X86ProcFamily = Others;
GatherOverhead = 1024;		GatherOverhead = 1024;
ScatterOverhead = 1024;		ScatterOverhead = 1024;
PreferVecWidth256 = false;		PreferVecWidth256 = false;
		No512BitVectors = false;
}		}

X86Subtarget &X86Subtarget::initializeSubtargetDependencies(StringRef CPU,		X86Subtarget &X86Subtarget::initializeSubtargetDependencies(StringRef CPU,
StringRef FS) {		StringRef FS) {
initializeEnvironment();		initializeEnvironment();
initSubtargetFeatures(CPU, FS);		initSubtargetFeatures(CPU, FS);
return *this;		return *this;
}		}
▲ Show 20 Lines • Show All 53 Lines • Show Last 20 Lines

lib/Target/X86/X86TargetMachine.cpp

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
#include <string>		#include <string>

using namespace llvm;		using namespace llvm;

static cl::opt<bool> EnableMachineCombinerPass("x86-machine-combiner",		static cl::opt<bool> EnableMachineCombinerPass("x86-machine-combiner",
cl::desc("Enable the machine combiner pass"),		cl::desc("Enable the machine combiner pass"),
cl::init(true), cl::Hidden);		cl::init(true), cl::Hidden);

		static cl::opt<bool>
		InferRequiredVectorWidth("x86-experimental-infer-vector-width",
		cl::desc("Infer the required vector width"),
		cl::init(false), cl::Hidden);

namespace llvm {		namespace llvm {

void initializeWinEHStatePassPass(PassRegistry &);		void initializeWinEHStatePassPass(PassRegistry &);
void initializeFixupLEAPassPass(PassRegistry &);		void initializeFixupLEAPassPass(PassRegistry &);
void initializeX86CallFrameOptimizationPass(PassRegistry &);		void initializeX86CallFrameOptimizationPass(PassRegistry &);
void initializeX86CmovConverterPassPass(PassRegistry &);		void initializeX86CmovConverterPassPass(PassRegistry &);
void initializeX86ExecutionDepsFixPass(PassRegistry &);		void initializeX86ExecutionDepsFixPass(PassRegistry &);
void initializeX86DomainReassignmentPass(PassRegistry &);		void initializeX86DomainReassignmentPass(PassRegistry &);
Show All 10 Lines	extern "C" void LLVMInitializeX86Target() {
initializeWinEHStatePassPass(PR);		initializeWinEHStatePassPass(PR);
initializeFixupBWInstPassPass(PR);		initializeFixupBWInstPassPass(PR);
initializeEvexToVexInstPassPass(PR);		initializeEvexToVexInstPassPass(PR);
initializeFixupLEAPassPass(PR);		initializeFixupLEAPassPass(PR);
initializeX86CallFrameOptimizationPass(PR);		initializeX86CallFrameOptimizationPass(PR);
initializeX86CmovConverterPassPass(PR);		initializeX86CmovConverterPassPass(PR);
initializeX86ExecutionDepsFixPass(PR);		initializeX86ExecutionDepsFixPass(PR);
initializeX86DomainReassignmentPass(PR);		initializeX86DomainReassignmentPass(PR);
		initializeX86VectorWidthInferPass(PR);
}		}

static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {		static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {
if (TT.isOSBinFormatMachO()) {		if (TT.isOSBinFormatMachO()) {
if (TT.getArch() == Triple::x86_64)		if (TT.getArch() == Triple::x86_64)
return llvm::make_unique<X86_64MachoTargetObjectFile>();		return llvm::make_unique<X86_64MachoTargetObjectFile>();
return llvm::make_unique<TargetLoweringObjectFileMachO>();		return llvm::make_unique<TargetLoweringObjectFileMachO>();
}		}
▲ Show 20 Lines • Show All 172 Lines • ▼ Show 20 Lines	if (F.hasFnAttribute("prefer-vector-width")) {
if (!Val.getAsInteger(0, Width)) {		if (!Val.getAsInteger(0, Width)) {
if (Key.size() > CPU.size())		if (Key.size() > CPU.size())
Key += ",";		Key += ",";
Key += (Width < 512) ? "+prefer-vector-width-256"		Key += (Width < 512) ? "+prefer-vector-width-256"
: "-prefer-vector-width-256";		: "-prefer-vector-width-256";
}		}
}		}

		// Translate required vector width function attribute into subtarget features.
		// This enables the legalizer to disable 512-bit vectors on targets that
		// prefer to avoid them.
		if (F.hasFnAttribute("require-vector-width")) {
		StringRef Val = F.getFnAttribute("require-vector-width").getValueAsString();
		unsigned Width;
		if (!Val.getAsInteger(0, Width)) {
		if (Key.size() > CPU.size())
		Key += ",";
		Key += (Width <= 256) ? "+no-512-bit-vectors" : "-no-512-bit-vectors";
		}
		}

FS = Key.substr(CPU.size());		FS = Key.substr(CPU.size());

auto &I = SubtargetMap[Key];		auto &I = SubtargetMap[Key];
if (!I) {		if (!I) {
// This needs to be done before we create a new subtarget since any		// This needs to be done before we create a new subtarget since any
// creation will depend on the TM and the code generation flags on the		// creation will depend on the TM and the code generation flags on the
// function that reside in TargetOptions.		// function that reside in TargetOptions.
resetTargetOptions(F);		resetTargetOptions(F);
▲ Show 20 Lines • Show All 123 Lines • ▼ Show 20 Lines	bool X86PassConfig::addILPOpts() {
addPass(&EarlyIfConverterID);		addPass(&EarlyIfConverterID);
if (EnableMachineCombinerPass)		if (EnableMachineCombinerPass)
addPass(&MachineCombinerID);		addPass(&MachineCombinerID);
addPass(createX86CmovConverterPass());		addPass(createX86CmovConverterPass());
return true;		return true;
}		}

bool X86PassConfig::addPreISel() {		bool X86PassConfig::addPreISel() {
		if (InferRequiredVectorWidth)
		addPass(createX86VectorWidthInferPass());

// Only add this pass for 32-bit x86 Windows.		// Only add this pass for 32-bit x86 Windows.
const Triple &TT = TM->getTargetTriple();		const Triple &TT = TM->getTargetTriple();
if (TT.isOSWindows() && TT.getArch() == Triple::x86)		if (TT.isOSWindows() && TT.getArch() == Triple::x86)
addPass(createX86WinEHStatePass());		addPass(createX86WinEHStatePass());
return true;		return true;
}		}

void X86PassConfig::addPreRegAlloc() {		void X86PassConfig::addPreRegAlloc() {
Show All 34 Lines

lib/Target/X86/X86TargetTransformInfo.cpp

Show First 20 Lines • Show All 2,516 Lines • ▼ Show 20 Lines	bool X86TTIImpl::isLegalMaskedGather(Type *DataTy) {
Type *ScalarTy = DataTy->getScalarType();		Type *ScalarTy = DataTy->getScalarType();
int DataWidth = isa<PointerType>(ScalarTy) ?		int DataWidth = isa<PointerType>(ScalarTy) ?
DL.getPointerSizeInBits() : ScalarTy->getPrimitiveSizeInBits();		DL.getPointerSizeInBits() : ScalarTy->getPrimitiveSizeInBits();

// Some CPUs have better gather performance than others.		// Some CPUs have better gather performance than others.
// TODO: Remove the explicit ST->hasAVX512()?, That would mean we would only		// TODO: Remove the explicit ST->hasAVX512()?, That would mean we would only
// enable gather with a -march.		// enable gather with a -march.
return (DataWidth == 32 \|\| DataWidth == 64) &&		return (DataWidth == 32 \|\| DataWidth == 64) &&
(ST->hasAVX512() \|\| (ST->hasFastGather() && ST->hasAVX2()));		(ST->hasAVX512() \|\| (ST->hasFastGather() && ST->hasAVX2()));
}		}

bool X86TTIImpl::isLegalMaskedScatter(Type *DataType) {		bool X86TTIImpl::isLegalMaskedScatter(Type *DataType) {
// AVX2 doesn't support scatter		// AVX2 doesn't support scatter
if (!ST->hasAVX512())		if (!ST->hasAVX512())
return false;		return false;
return isLegalMaskedGather(DataType);		return isLegalMaskedGather(DataType);
}		}
▲ Show 20 Lines • Show All 327 Lines • Show Last 20 Lines

lib/Target/X86/X86VectorWidthInfer.cpp

This file was added.

				//===- X86VectorWidthInfer.cpp - Infer require-vector-width attribute -----===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				/// \file This pass tries to infer the required vector with for a function
				/// if the require-vector-width attribute isn't present.
				// ===---------------------------------------------------------------------===//

				#include "X86TargetMachine.h"
				#include "llvm/CodeGen/TargetPassConfig.h"
				#include "llvm/IR/IntrinsicInst.h"
				#include "llvm/Pass.h"

				using namespace llvm;

				#define DEBUG_TYPE "x86-vector-width-fix"

				namespace {

				class X86VectorWidthInfer : public FunctionPass {
				public:
				static char ID; // Pass ID

				X86VectorWidthInfer() : FunctionPass(ID) {
				initializeX86VectorWidthInferPass(*PassRegistry::getPassRegistry());
				}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<TargetPassConfig>();
				}

				bool runOnFunction(Function &F) override;
				};

				} // end anonymous namespace

				char X86VectorWidthInfer::ID = 0;

				INITIALIZE_PASS_BEGIN(X86VectorWidthInfer, DEBUG_TYPE,
				"X86 Vector Width Infer", false, false)
				INITIALIZE_PASS_DEPENDENCY(TargetPassConfig)
				INITIALIZE_PASS_END(X86VectorWidthInfer, DEBUG_TYPE,
				"X86 Vector Width Infer", false, false)

				FunctionPass *llvm::createX86VectorWidthInferPass() {
				return new X86VectorWidthInfer();
				}

				bool X86VectorWidthInfer::runOnFunction(Function &F) {
				TargetPassConfig &TPC = getAnalysis<TargetPassConfig>();
				const X86Subtarget *ST =
				TPC.getTM<X86TargetMachine>().getSubtargetImpl(F);

				// If the target doesn't support 512-bit vectors or doesn't prefer them,
				// then there is nothing to do.
				if (!ST->hasAVX512() \|\| !ST->preferVecWidth256())
				return false;

				unsigned RequiredWidth = 0;

				// If we already have a function attribute and it says that 512-bit vectors
				// are required, we are done.
				// TODO: In the future we should maybe just trust the attribute.
				if (F.hasFnAttribute("require-vector-width")) {
				StringRef Val = F.getFnAttribute("require-vector-width").getValueAsString();
				unsigned Width;
				if (!Val.getAsInteger(0, Width)) {
				if (Width > 256)
				return false;
				RequiredWidth = Width;
				}
				}

				// Check for a vector return type.
				Type *RetTy = F.getReturnType();
				if (RetTy->isVectorTy())
				RequiredWidth = std::max(RequiredWidth,
				RetTy->getPrimitiveSizeInBits());

				// Check for any vector arguments.
				for (const auto &A : F.args()) {
				Type *ArgTy = A.getType();
				if (ArgTy->isVectorTy())
				RequiredWidth = std::max(RequiredWidth,
				ArgTy->getPrimitiveSizeInBits());
				}

				// Otherwise scan for any calls that need wide registers to match ABI.
				// Also need this for any target specific intrinsics.
				for (auto &BB : F) {
				for (auto &I : BB) {
				if (auto *CI = dyn_cast<CallInst>(&I)) {
				// We can handle target independent intrinsics via type legalization so
				// skip those.
				if (auto *II = dyn_cast<IntrinsicInst>(&I)) {
				StringRef Name = II->getCalledFunction()->getName();
				if (!Name.startswith("llvm.x86."))
				continue;
				}
				// Ok we have a call. Check its types.
				Type *RetTy = CI->getType();
				if (RetTy->isVectorTy())
				RequiredWidth = std::max(RequiredWidth,
				RetTy->getPrimitiveSizeInBits());
				for (Value *A : CI->arg_operands()) {
				Type *ArgTy = A->getType();
				if (ArgTy->isVectorTy())
				RequiredWidth = std::max(RequiredWidth,
				ArgTy->getPrimitiveSizeInBits());
				}
				}
				}
				}

				// Remove and replace function's prefer-vector-width attribute.
				// TODO this should be more generic, but this will work until we have wider
				// vectors.
				F.removeFnAttr("require-vector-width");
				F.addFnAttr("require-vector-width", (RequiredWidth > 256) ? "512" : "256");

				return false;
				}

test/CodeGen/X86/prefer-avx256-lzcnt.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vl,+avx512cd,+prefer-vector-width-256 \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX256
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vl,+avx512cd,-prefer-vector-width-256 \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512VL
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f,+avx512cd \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512F

				define <8 x i16> @testv8i16(<8 x i16> %in) nounwind "require-vector-width"="128" {
				; AVX256-LABEL: testv8i16:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
				; AVX256-NEXT: vplzcntd %ymm0, %ymm0
				; AVX256-NEXT: vpmovdw %ymm0, %xmm0
				; AVX256-NEXT: vpsubw {{.*}}(%rip), %xmm0, %xmm0
				; AVX256-NEXT: vzeroupper
				; AVX256-NEXT: retq
				;
				; AVX512VL-LABEL: testv8i16:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
				; AVX512VL-NEXT: vplzcntd %ymm0, %ymm0
				; AVX512VL-NEXT: vpmovdw %ymm0, %xmm0
				; AVX512VL-NEXT: vpsubw {{.*}}(%rip), %xmm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				;
				; AVX512F-LABEL: testv8i16:
				; AVX512F: # %bb.0:
				; AVX512F-NEXT: vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
				; AVX512F-NEXT: vplzcntd %zmm0, %zmm0
				; AVX512F-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512F-NEXT: vpsubw {{.*}}(%rip), %xmm0, %xmm0
				; AVX512F-NEXT: vzeroupper
				; AVX512F-NEXT: retq
				%out = call <8 x i16> @llvm.ctlz.v8i16(<8 x i16> %in, i1 false)
				ret <8 x i16> %out
				}

				define <16 x i8> @testv16i8(<16 x i8> %in) nounwind "require-vector-width"="128" {
				; AVX256-LABEL: testv16i8:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vmovdqa {{.*#+}} xmm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; AVX256-NEXT: vpand %xmm1, %xmm0, %xmm2
				; AVX256-NEXT: vmovdqa {{.*#+}} xmm3 = [4,3,2,2,1,1,1,1,0,0,0,0,0,0,0,0]
				; AVX256-NEXT: vpshufb %xmm2, %xmm3, %xmm2
				; AVX256-NEXT: vpsrlw $4, %xmm0, %xmm0
				; AVX256-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX256-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; AVX256-NEXT: vpcmpeqb %xmm1, %xmm0, %xmm1
				; AVX256-NEXT: vpand %xmm1, %xmm2, %xmm1
				; AVX256-NEXT: vpshufb %xmm0, %xmm3, %xmm0
				; AVX256-NEXT: vpaddb %xmm0, %xmm1, %xmm0
				; AVX256-NEXT: retq
				;
				; AVX512-LABEL: testv16i8:
				; AVX512: # %bb.0:
				; AVX512-NEXT: vpmovzxbd {{.*#+}} zmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero,xmm0[8],zero,zero,zero,xmm0[9],zero,zero,zero,xmm0[10],zero,zero,zero,xmm0[11],zero,zero,zero,xmm0[12],zero,zero,zero,xmm0[13],zero,zero,zero,xmm0[14],zero,zero,zero,xmm0[15],zero,zero,zero
				; AVX512-NEXT: vplzcntd %zmm0, %zmm0
				; AVX512-NEXT: vpmovdb %zmm0, %xmm0
				; AVX512-NEXT: vpsubb {{.*}}(%rip), %xmm0, %xmm0
				; AVX512-NEXT: vzeroupper
				; AVX512-NEXT: retq
				%out = call <16 x i8> @llvm.ctlz.v16i8(<16 x i8> %in, i1 false)
				ret <16 x i8> %out
				}

				define <16 x i16> @testv16i16(<16 x i16> %in) nounwind "require-vector-width"="256" {
				; AVX256-LABEL: testv16i16:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX256-NEXT: vpmovzxwd {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
				; AVX256-NEXT: vplzcntd %ymm1, %ymm1
				; AVX256-NEXT: vpmovdw %ymm1, %xmm1
				; AVX256-NEXT: vmovdqa {{.*#+}} xmm2 = [16,16,16,16,16,16,16,16]
				; AVX256-NEXT: vpsubw %xmm2, %xmm1, %xmm1
				; AVX256-NEXT: vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
				; AVX256-NEXT: vplzcntd %ymm0, %ymm0
				; AVX256-NEXT: vpmovdw %ymm0, %xmm0
				; AVX256-NEXT: vpsubw %xmm2, %xmm0, %xmm0
				; AVX256-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
				; AVX256-NEXT: retq
				;
				; AVX512-LABEL: testv16i16:
				; AVX512: # %bb.0:
				; AVX512-NEXT: vpmovzxwd {{.*#+}} zmm0 = ymm0[0],zero,ymm0[1],zero,ymm0[2],zero,ymm0[3],zero,ymm0[4],zero,ymm0[5],zero,ymm0[6],zero,ymm0[7],zero,ymm0[8],zero,ymm0[9],zero,ymm0[10],zero,ymm0[11],zero,ymm0[12],zero,ymm0[13],zero,ymm0[14],zero,ymm0[15],zero
				; AVX512-NEXT: vplzcntd %zmm0, %zmm0
				; AVX512-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512-NEXT: vpsubw {{.*}}(%rip), %ymm0, %ymm0
				; AVX512-NEXT: retq
				%out = call <16 x i16> @llvm.ctlz.v16i16(<16 x i16> %in, i1 false)
				ret <16 x i16> %out
				}

				define <32 x i8> @testv32i8(<32 x i8> %in) nounwind "require-vector-width"="256" {
				; AVX256-LABEL: testv32i8:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX256-NEXT: vmovdqa {{.*#+}} xmm2 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; AVX256-NEXT: vpand %xmm2, %xmm1, %xmm3
				; AVX256-NEXT: vmovdqa {{.*#+}} xmm4 = [4,3,2,2,1,1,1,1,0,0,0,0,0,0,0,0]
				; AVX256-NEXT: vpshufb %xmm3, %xmm4, %xmm3
				; AVX256-NEXT: vpsrlw $4, %xmm1, %xmm1
				; AVX256-NEXT: vpand %xmm2, %xmm1, %xmm1
				; AVX256-NEXT: vpxor %xmm5, %xmm5, %xmm5
				; AVX256-NEXT: vpcmpeqb %xmm5, %xmm1, %xmm6
				; AVX256-NEXT: vpand %xmm6, %xmm3, %xmm3
				; AVX256-NEXT: vpshufb %xmm1, %xmm4, %xmm1
				; AVX256-NEXT: vpaddb %xmm1, %xmm3, %xmm1
				; AVX256-NEXT: vpand %xmm2, %xmm0, %xmm3
				; AVX256-NEXT: vpshufb %xmm3, %xmm4, %xmm3
				; AVX256-NEXT: vpsrlw $4, %xmm0, %xmm0
				; AVX256-NEXT: vpand %xmm2, %xmm0, %xmm0
				; AVX256-NEXT: vpcmpeqb %xmm5, %xmm0, %xmm2
				; AVX256-NEXT: vpand %xmm2, %xmm3, %xmm2
				; AVX256-NEXT: vpshufb %xmm0, %xmm4, %xmm0
				; AVX256-NEXT: vpaddb %xmm0, %xmm2, %xmm0
				; AVX256-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
				; AVX256-NEXT: retq
				;
				; AVX512-LABEL: testv32i8:
				; AVX512: # %bb.0:
				; AVX512-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX512-NEXT: vpmovzxbd {{.*#+}} zmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero,xmm1[8],zero,zero,zero,xmm1[9],zero,zero,zero,xmm1[10],zero,zero,zero,xmm1[11],zero,zero,zero,xmm1[12],zero,zero,zero,xmm1[13],zero,zero,zero,xmm1[14],zero,zero,zero,xmm1[15],zero,zero,zero
				; AVX512-NEXT: vplzcntd %zmm1, %zmm1
				; AVX512-NEXT: vpmovdb %zmm1, %xmm1
				; AVX512-NEXT: vmovdqa {{.*#+}} xmm2 = [24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24]
				; AVX512-NEXT: vpsubb %xmm2, %xmm1, %xmm1
				; AVX512-NEXT: vpmovzxbd {{.*#+}} zmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero,xmm0[8],zero,zero,zero,xmm0[9],zero,zero,zero,xmm0[10],zero,zero,zero,xmm0[11],zero,zero,zero,xmm0[12],zero,zero,zero,xmm0[13],zero,zero,zero,xmm0[14],zero,zero,zero,xmm0[15],zero,zero,zero
				; AVX512-NEXT: vplzcntd %zmm0, %zmm0
				; AVX512-NEXT: vpmovdb %zmm0, %xmm0
				; AVX512-NEXT: vpsubb %xmm2, %xmm0, %xmm0
				; AVX512-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
				; AVX512-NEXT: retq
				%out = call <32 x i8> @llvm.ctlz.v32i8(<32 x i8> %in, i1 false)
				ret <32 x i8> %out
				}

				declare <8 x i16> @llvm.ctlz.v8i16(<8 x i16>, i1)
				declare <16 x i8> @llvm.ctlz.v16i8(<16 x i8>, i1)
				declare <16 x i16> @llvm.ctlz.v16i16(<16 x i16>, i1)
				declare <32 x i8> @llvm.ctlz.v32i8(<32 x i8>, i1)

test/CodeGen/X86/prefer-avx256-mask-extend.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vl,+prefer-vector-width-256 \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX256
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vl,-prefer-vector-width-256 \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512VL
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512F

				define <8 x i16> @testv8i64_sext(<8 x i64>* %p) nounwind "require-vector-width"="128" {
				; AVX256-LABEL: testv8i64_sext:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX256-NEXT: vpcmpeqq (%rdi), %ymm0, %k0
				; AVX256-NEXT: vpcmpeqq 32(%rdi), %ymm0, %k1
				; AVX256-NEXT: kshiftlw $4, %k1, %k1
				; AVX256-NEXT: korw %k1, %k0, %k1
				; AVX256-NEXT: vpcmpeqd %ymm0, %ymm0, %ymm0
				; AVX256-NEXT: vmovdqa32 %ymm0, %ymm0 {%k1} {z}
				; AVX256-NEXT: vpmovdw %ymm0, %xmm0
				; AVX256-NEXT: vzeroupper
				; AVX256-NEXT: retq
				;
				; AVX512VL-LABEL: testv8i64_sext:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX512VL-NEXT: vpcmpeqq (%rdi), %zmm0, %k1
				; AVX512VL-NEXT: vpcmpeqd %ymm0, %ymm0, %ymm0
				; AVX512VL-NEXT: vmovdqa32 %ymm0, %ymm0 {%k1} {z}
				; AVX512VL-NEXT: vpmovdw %ymm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				;
				; AVX512F-LABEL: testv8i64_sext:
				; AVX512F: # %bb.0:
				; AVX512F-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX512F-NEXT: vpcmpeqq (%rdi), %zmm0, %k1
				; AVX512F-NEXT: vpternlogd $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
				; AVX512F-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512F-NEXT: # kill: def %xmm0 killed %xmm0 killed %ymm0
				; AVX512F-NEXT: vzeroupper
				; AVX512F-NEXT: retq
				%in = load <8 x i64>, <8 x i64>* %p
				%cmp = icmp eq <8 x i64> %in, zeroinitializer
				%trunc = sext <8 x i1> %cmp to <8 x i16>
				ret <8 x i16> %trunc
				}

				define <8 x i16> @testv8i32_sext(<8 x i32>* %p) nounwind "require-vector-width"="128" {
				; AVX256-LABEL: testv8i32_sext:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX256-NEXT: vpcmpeqd (%rdi), %ymm0, %k1
				; AVX256-NEXT: vpcmpeqd %ymm0, %ymm0, %ymm0
				; AVX256-NEXT: vmovdqa32 %ymm0, %ymm0 {%k1} {z}
				; AVX256-NEXT: vpmovdw %ymm0, %xmm0
				; AVX256-NEXT: vzeroupper
				; AVX256-NEXT: retq
				;
				; AVX512VL-LABEL: testv8i32_sext:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX512VL-NEXT: vpcmpeqd (%rdi), %ymm0, %k1
				; AVX512VL-NEXT: vpcmpeqd %ymm0, %ymm0, %ymm0
				; AVX512VL-NEXT: vmovdqa32 %ymm0, %ymm0 {%k1} {z}
				; AVX512VL-NEXT: vpmovdw %ymm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				;
				; AVX512F-LABEL: testv8i32_sext:
				; AVX512F: # %bb.0:
				; AVX512F-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX512F-NEXT: vpcmpeqd (%rdi), %ymm0, %ymm0
				; AVX512F-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512F-NEXT: # kill: def %xmm0 killed %xmm0 killed %ymm0
				; AVX512F-NEXT: vzeroupper
				; AVX512F-NEXT: retq
				%in = load <8 x i32>, <8 x i32>* %p
				%cmp = icmp eq <8 x i32> %in, zeroinitializer
				%trunc = sext <8 x i1> %cmp to <8 x i16>
				ret <8 x i16> %trunc
				}

				define <16 x i8> @testv16i32_sext(<16 x i32>* %p) nounwind "require-vector-width"="128" {
				; AVX256-LABEL: testv16i32_sext:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX256-NEXT: vpcmpeqd (%rdi), %ymm0, %k1
				; AVX256-NEXT: vpcmpeqd 32(%rdi), %ymm0, %k2
				; AVX256-NEXT: vpcmpeqd %ymm0, %ymm0, %ymm0
				; AVX256-NEXT: vmovdqa32 %ymm0, %ymm1 {%k2} {z}
				; AVX256-NEXT: vpmovdw %ymm1, %xmm1
				; AVX256-NEXT: vpacksswb %xmm0, %xmm1, %xmm1
				; AVX256-NEXT: vmovdqa32 %ymm0, %ymm0 {%k1} {z}
				; AVX256-NEXT: vpmovdw %ymm0, %xmm0
				; AVX256-NEXT: vpacksswb %xmm0, %xmm0, %xmm0
				; AVX256-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
				; AVX256-NEXT: vzeroupper
				; AVX256-NEXT: retq
				;
				; AVX512-LABEL: testv16i32_sext:
				; AVX512: # %bb.0:
				; AVX512-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vpcmpeqd (%rdi), %zmm0, %k1
				; AVX512-NEXT: vpternlogd $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
				; AVX512-NEXT: vpmovdb %zmm0, %xmm0
				; AVX512-NEXT: vzeroupper
				; AVX512-NEXT: retq
				%in = load <16 x i32>, <16 x i32>* %p
				%cmp = icmp eq <16 x i32> %in, zeroinitializer
				%trunc = sext <16 x i1> %cmp to <16 x i8>
				ret <16 x i8> %trunc
				}

				define <16 x i16> @testv16i32_sext_v16i16(<16 x i32>* %p) nounwind "require-vector-width"="256" {
				; AVX256-LABEL: testv16i32_sext_v16i16:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX256-NEXT: vpcmpeqd 32(%rdi), %ymm0, %k1
				; AVX256-NEXT: vpcmpeqd (%rdi), %ymm0, %k2
				; AVX256-NEXT: vpcmpeqd %ymm0, %ymm0, %ymm0
				; AVX256-NEXT: vmovdqa32 %ymm0, %ymm1 {%k2} {z}
				; AVX256-NEXT: vpmovdw %ymm1, %xmm1
				; AVX256-NEXT: vmovdqa32 %ymm0, %ymm0 {%k1} {z}
				; AVX256-NEXT: vpmovdw %ymm0, %xmm0
				; AVX256-NEXT: vinserti128 $1, %xmm0, %ymm1, %ymm0
				; AVX256-NEXT: retq
				;
				; AVX512-LABEL: testv16i32_sext_v16i16:
				; AVX512: # %bb.0:
				; AVX512-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vpcmpeqd (%rdi), %zmm0, %k1
				; AVX512-NEXT: vpternlogd $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
				; AVX512-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512-NEXT: retq
				%in = load <16 x i32>, <16 x i32>* %p
				%cmp = icmp eq <16 x i32> %in, zeroinitializer
				%trunc = sext <16 x i1> %cmp to <16 x i16>
				ret <16 x i16> %trunc
				}

				define <8 x i16> @testv8i64_zext(<8 x i64>* %p) nounwind "require-vector-width"="128" {
				; AVX256-LABEL: testv8i64_zext:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX256-NEXT: vpcmpeqq (%rdi), %ymm0, %k0
				; AVX256-NEXT: vpcmpeqq 32(%rdi), %ymm0, %k1
				; AVX256-NEXT: kshiftlw $4, %k1, %k1
				; AVX256-NEXT: korw %k1, %k0, %k1
				; AVX256-NEXT: vpbroadcastd {{.*}}(%rip), %ymm0 {%k1} {z}
				; AVX256-NEXT: vpmovdw %ymm0, %xmm0
				; AVX256-NEXT: vzeroupper
				; AVX256-NEXT: retq
				;
				; AVX512VL-LABEL: testv8i64_zext:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX512VL-NEXT: vpcmpeqq (%rdi), %zmm0, %k1
				; AVX512VL-NEXT: vpbroadcastd {{.*}}(%rip), %ymm0 {%k1} {z}
				; AVX512VL-NEXT: vpmovdw %ymm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				;
				; AVX512F-LABEL: testv8i64_zext:
				; AVX512F: # %bb.0:
				; AVX512F-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX512F-NEXT: vpcmpeqq (%rdi), %zmm0, %k1
				; AVX512F-NEXT: vpbroadcastd {{.*}}(%rip), %zmm0 {%k1} {z}
				; AVX512F-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512F-NEXT: # kill: def %xmm0 killed %xmm0 killed %ymm0
				; AVX512F-NEXT: vzeroupper
				; AVX512F-NEXT: retq
				%in = load <8 x i64>, <8 x i64>* %p
				%cmp = icmp eq <8 x i64> %in, zeroinitializer
				%trunc = zext <8 x i1> %cmp to <8 x i16>
				ret <8 x i16> %trunc
				}

				define <8 x i16> @testv8i32_zext(<8 x i32>* %p) nounwind "require-vector-width"="128" {
				; AVX256-LABEL: testv8i32_zext:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX256-NEXT: vpcmpeqd (%rdi), %ymm0, %k1
				; AVX256-NEXT: vpbroadcastd {{.*}}(%rip), %ymm0 {%k1} {z}
				; AVX256-NEXT: vpmovdw %ymm0, %xmm0
				; AVX256-NEXT: vzeroupper
				; AVX256-NEXT: retq
				;
				; AVX512VL-LABEL: testv8i32_zext:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX512VL-NEXT: vpcmpeqd (%rdi), %ymm0, %k1
				; AVX512VL-NEXT: vpbroadcastd {{.*}}(%rip), %ymm0 {%k1} {z}
				; AVX512VL-NEXT: vpmovdw %ymm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				;
				; AVX512F-LABEL: testv8i32_zext:
				; AVX512F: # %bb.0:
				; AVX512F-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX512F-NEXT: vpcmpeqd (%rdi), %ymm0, %ymm0
				; AVX512F-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512F-NEXT: vpsrlw $15, %xmm0, %xmm0
				; AVX512F-NEXT: vzeroupper
				; AVX512F-NEXT: retq
				%in = load <8 x i32>, <8 x i32>* %p
				%cmp = icmp eq <8 x i32> %in, zeroinitializer
				%trunc = zext <8 x i1> %cmp to <8 x i16>
				ret <8 x i16> %trunc
				}

				define <16 x i8> @testv16i32_zext(<16 x i32>* %p) nounwind "require-vector-width"="128" {
				; AVX256-LABEL: testv16i32_zext:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX256-NEXT: vpcmpeqd (%rdi), %ymm0, %k1
				; AVX256-NEXT: vpcmpeqd 32(%rdi), %ymm0, %k2
				; AVX256-NEXT: movl {{.*}}(%rip), %eax
				; AVX256-NEXT: vpbroadcastd %eax, %ymm0 {%k2} {z}
				; AVX256-NEXT: vpmovdw %ymm0, %xmm0
				; AVX256-NEXT: vmovdqa {{.*#+}} xmm1 = <0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u>
				; AVX256-NEXT: vpshufb %xmm1, %xmm0, %xmm0
				; AVX256-NEXT: vpbroadcastd %eax, %ymm2 {%k1} {z}
				; AVX256-NEXT: vpmovdw %ymm2, %xmm2
				; AVX256-NEXT: vpshufb %xmm1, %xmm2, %xmm1
				; AVX256-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm1[0],xmm0[0]
				; AVX256-NEXT: vzeroupper
				; AVX256-NEXT: retq
				;
				; AVX512-LABEL: testv16i32_zext:
				; AVX512: # %bb.0:
				; AVX512-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vpcmpeqd (%rdi), %zmm0, %k1
				; AVX512-NEXT: vpbroadcastd {{.*}}(%rip), %zmm0 {%k1} {z}
				; AVX512-NEXT: vpmovdb %zmm0, %xmm0
				; AVX512-NEXT: vzeroupper
				; AVX512-NEXT: retq
				%in = load <16 x i32>, <16 x i32>* %p
				%cmp = icmp eq <16 x i32> %in, zeroinitializer
				%trunc = zext <16 x i1> %cmp to <16 x i8>
				ret <16 x i8> %trunc
				}

				define <16 x i16> @testv16i32_zext_v16i16(<16 x i32>* %p) nounwind "require-vector-width"="256" {
				; AVX256-LABEL: testv16i32_zext_v16i16:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX256-NEXT: vpcmpeqd 32(%rdi), %ymm0, %k1
				; AVX256-NEXT: vpcmpeqd (%rdi), %ymm0, %k2
				; AVX256-NEXT: movl {{.*}}(%rip), %eax
				; AVX256-NEXT: vpbroadcastd %eax, %ymm0 {%k2} {z}
				; AVX256-NEXT: vpmovdw %ymm0, %xmm0
				; AVX256-NEXT: vpbroadcastd %eax, %ymm1 {%k1} {z}
				; AVX256-NEXT: vpmovdw %ymm1, %xmm1
				; AVX256-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
				; AVX256-NEXT: retq
				;
				; AVX512-LABEL: testv16i32_zext_v16i16:
				; AVX512: # %bb.0:
				; AVX512-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX512-NEXT: vpcmpeqd (%rdi), %zmm0, %k1
				; AVX512-NEXT: vpbroadcastd {{.*}}(%rip), %zmm0 {%k1} {z}
				; AVX512-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512-NEXT: retq
				%in = load <16 x i32>, <16 x i32>* %p
				%cmp = icmp eq <16 x i32> %in, zeroinitializer
				%trunc = zext <16 x i1> %cmp to <16 x i16>
				ret <16 x i16> %trunc
				}

test/CodeGen/X86/prefer-avx256-mask-shuffle.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vl,+prefer-vector-width-256 \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX256 --check-prefix=AVX256VL
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vl,-prefer-vector-width-256 \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512NOBW --check-prefix=AVX512VL
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512bw,+avx512vl,+prefer-vector-width-256 \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX256 --check-prefix=AVX256VLBW
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512bw,+avx512vl,-prefer-vector-width-256 \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512VLBW
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512NOBW --check-prefix=AVX512F
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512bw \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512BW

				define <16 x i1> @shuf16i1_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0(<16 x i32>* %a, <16 x i32>* %b) "require-vector-width"="256" {
				; AVX256VL-LABEL: shuf16i1_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0:
				; AVX256VL: # %bb.0:
				; AVX256VL-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX256VL-NEXT: vpcmpeqd 32(%rdi), %ymm0, %k1
				; AVX256VL-NEXT: vpcmpeqd (%rdi), %ymm0, %k2
				; AVX256VL-NEXT: vpcmpeqd (%rsi), %ymm0, %k3
				; AVX256VL-NEXT: vpcmpeqd %ymm0, %ymm0, %ymm0
				; AVX256VL-NEXT: vmovdqa32 %ymm0, %ymm1 {%k2} {z}
				; AVX256VL-NEXT: vpmovdw %ymm1, %xmm1
				; AVX256VL-NEXT: vmovdqa32 %ymm0, %ymm2 {%k1} {z}
				; AVX256VL-NEXT: vpmovdw %ymm2, %xmm2
				; AVX256VL-NEXT: vinserti128 $1, %xmm2, %ymm1, %ymm1
				; AVX256VL-NEXT: vpermq {{.*#+}} ymm2 = ymm1[2,3,0,1]
				; AVX256VL-NEXT: vpblendd {{.*#+}} ymm1 = ymm1[0,1],ymm2[2],ymm1[3],ymm2[4,5],ymm1[6],ymm2[7]
				; AVX256VL-NEXT: vpshufb {{.*#+}} ymm1 = ymm1[6,7,12,13,u,u,8,9,6,7,14,15,14,15,0,1,22,23,28,29,18,19,26,27,22,23,u,u,30,31,16,17]
				; AVX256VL-NEXT: vmovdqa32 %ymm0, %ymm2 {%k3} {z}
				; AVX256VL-NEXT: vpmovdw %ymm2, %xmm2
				; AVX256VL-NEXT: kshiftrw $8, %k3, %k1
				; AVX256VL-NEXT: vmovdqa32 %ymm0, %ymm3 {%k1} {z}
				; AVX256VL-NEXT: vpmovdw %ymm3, %xmm3
				; AVX256VL-NEXT: vinserti128 $1, %xmm3, %ymm2, %ymm2
				; AVX256VL-NEXT: vpermq {{.*#+}} ymm2 = ymm2[1,1,2,1]
				; AVX256VL-NEXT: vmovdqa {{.*#+}} ymm3 = [255,255,255,255,0,0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0,0,255,255,255,255]
				; AVX256VL-NEXT: vpblendvb %ymm3, %ymm1, %ymm2, %ymm1
				; AVX256VL-NEXT: vpmovsxwd %xmm1, %ymm2
				; AVX256VL-NEXT: vpslld $31, %ymm2, %ymm2
				; AVX256VL-NEXT: vptestmd %ymm2, %ymm2, %k1
				; AVX256VL-NEXT: vextracti128 $1, %ymm1, %xmm1
				; AVX256VL-NEXT: vpmovsxwd %xmm1, %ymm1
				; AVX256VL-NEXT: vpslld $31, %ymm1, %ymm1
				; AVX256VL-NEXT: vptestmd %ymm1, %ymm1, %k0
				; AVX256VL-NEXT: kunpckbw %k1, %k0, %k0
				; AVX256VL-NEXT: kshiftrw $8, %k0, %k2
				; AVX256VL-NEXT: vmovdqa32 %ymm0, %ymm1 {%k2} {z}
				; AVX256VL-NEXT: vpmovdw %ymm1, %xmm1
				; AVX256VL-NEXT: vpacksswb %xmm0, %xmm1, %xmm1
				; AVX256VL-NEXT: vmovdqa32 %ymm0, %ymm0 {%k1} {z}
				; AVX256VL-NEXT: vpmovdw %ymm0, %xmm0
				; AVX256VL-NEXT: vpacksswb %xmm0, %xmm0, %xmm0
				; AVX256VL-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
				; AVX256VL-NEXT: vzeroupper
				; AVX256VL-NEXT: retq
				;
				; AVX512NOBW-LABEL: shuf16i1_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0:
				; AVX512NOBW: # %bb.0:
				; AVX512NOBW-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX512NOBW-NEXT: vpcmpeqd (%rdi), %zmm0, %k1
				; AVX512NOBW-NEXT: vpcmpeqd (%rsi), %zmm0, %k2
				; AVX512NOBW-NEXT: vpternlogd $255, %zmm0, %zmm0, %zmm0 {%k2} {z}
				; AVX512NOBW-NEXT: vpternlogd $255, %zmm1, %zmm1, %zmm1 {%k1} {z}
				; AVX512NOBW-NEXT: vmovdqa32 {{.*#+}} zmm2 = [3,6,22,12,3,7,7,0,3,6,1,13,3,21,7,0]
				; AVX512NOBW-NEXT: vpermi2d %zmm0, %zmm1, %zmm2
				; AVX512NOBW-NEXT: vptestmd %zmm2, %zmm2, %k1
				; AVX512NOBW-NEXT: vpternlogd $255, %zmm0, %zmm0, %zmm0 {%k1} {z}
				; AVX512NOBW-NEXT: vpmovdb %zmm0, %xmm0
				; AVX512NOBW-NEXT: vzeroupper
				; AVX512NOBW-NEXT: retq
				;
				; AVX256VLBW-LABEL: shuf16i1_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0:
				; AVX256VLBW: # %bb.0:
				; AVX256VLBW-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX256VLBW-NEXT: vpcmpeqd (%rdi), %ymm0, %k0
				; AVX256VLBW-NEXT: vpcmpeqd 32(%rdi), %ymm0, %k1
				; AVX256VLBW-NEXT: kunpckbw %k0, %k1, %k0
				; AVX256VLBW-NEXT: vpcmpeqd (%rsi), %ymm0, %k1
				; AVX256VLBW-NEXT: vpmovm2w %k0, %ymm0
				; AVX256VLBW-NEXT: vpmovm2w %k1, %ymm1
				; AVX256VLBW-NEXT: vmovdqa {{.*#+}} ymm2 = [3,6,22,12,3,7,7,0,3,6,1,13,3,21,7,0]
				; AVX256VLBW-NEXT: vpermi2w %ymm1, %ymm0, %ymm2
				; AVX256VLBW-NEXT: vpmovw2m %ymm2, %k0
				; AVX256VLBW-NEXT: vpmovm2b %k0, %xmm0
				; AVX256VLBW-NEXT: vzeroupper
				; AVX256VLBW-NEXT: retq
				;
				; AVX512VLBW-LABEL: shuf16i1_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0:
				; AVX512VLBW: # %bb.0:
				; AVX512VLBW-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX512VLBW-NEXT: vpcmpeqd (%rdi), %zmm0, %k1
				; AVX512VLBW-NEXT: vpcmpeqd (%rsi), %zmm0, %k2
				; AVX512VLBW-NEXT: vpternlogd $255, %zmm0, %zmm0, %zmm0 {%k2} {z}
				; AVX512VLBW-NEXT: vpternlogd $255, %zmm1, %zmm1, %zmm1 {%k1} {z}
				; AVX512VLBW-NEXT: vmovdqa32 {{.*#+}} zmm2 = [3,6,22,12,3,7,7,0,3,6,1,13,3,21,7,0]
				; AVX512VLBW-NEXT: vpermi2d %zmm0, %zmm1, %zmm2
				; AVX512VLBW-NEXT: vptestmd %zmm2, %zmm2, %k0
				; AVX512VLBW-NEXT: vpmovm2b %k0, %xmm0
				; AVX512VLBW-NEXT: vzeroupper
				; AVX512VLBW-NEXT: retq
				;
				; AVX512BW-LABEL: shuf16i1_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; AVX512BW-NEXT: vpcmpeqd (%rdi), %zmm0, %k1
				; AVX512BW-NEXT: vpcmpeqd (%rsi), %zmm0, %k2
				; AVX512BW-NEXT: vpternlogd $255, %zmm0, %zmm0, %zmm0 {%k2} {z}
				; AVX512BW-NEXT: vpternlogd $255, %zmm1, %zmm1, %zmm1 {%k1} {z}
				; AVX512BW-NEXT: vmovdqa32 {{.*#+}} zmm2 = [3,6,22,12,3,7,7,0,3,6,1,13,3,21,7,0]
				; AVX512BW-NEXT: vpermi2d %zmm0, %zmm1, %zmm2
				; AVX512BW-NEXT: vptestmd %zmm2, %zmm2, %k0
				; AVX512BW-NEXT: vpmovm2b %k0, %zmm0
				; AVX512BW-NEXT: # kill: def %xmm0 killed %xmm0 killed %zmm0
				; AVX512BW-NEXT: vzeroupper
				; AVX512BW-NEXT: retq

				%a1 = load <16 x i32>, <16 x i32>* %a
				%b1 = load <16 x i32>, <16 x i32>* %b
				%a2 = icmp eq <16 x i32> %a1, zeroinitializer
				%b2 = icmp eq <16 x i32> %b1, zeroinitializer
				%c = shufflevector <16 x i1> %a2, <16 x i1> %b2, <16 x i32> <i32 3, i32 6, i32 22, i32 12, i32 3, i32 7, i32 7, i32 0, i32 3, i32 6, i32 1, i32 13, i32 3, i32 21, i32 7, i32 0>
				ret <16 x i1> %c
				}

				define <32 x i1> @shuf32i1_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0(<32 x i8> %a) "require-vector-width"="256" {
				; AVX256VL-LABEL: shuf32i1_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0:
				; AVX256VL: # %bb.0:
				; AVX256VL-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; AVX256VL-NEXT: vpcmpeqb %ymm1, %ymm0, %ymm0
				; AVX256VL-NEXT: vpshufb {{.*#+}} ymm1 = ymm0[3,6,u,12,3,7,7,0,3,6,1,13,3,u,7,0,u,u,22,u,u,u,u,u,u,u,u,u,u,21,u,u]
				; AVX256VL-NEXT: vpermq {{.*#+}} ymm0 = ymm0[2,3,0,1]
				; AVX256VL-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[u,u,6,u,u,u,u,u,u,u,u,u,u,5,u,u,19,22,u,28,19,23,23,16,19,22,17,29,19,u,23,16]
				; AVX256VL-NEXT: vmovdqa {{.*#+}} ymm2 = [255,255,0,255,255,255,255,255,255,255,255,255,255,0,255,255,0,0,255,0,0,0,0,0,0,0,0,0,0,255,0,0]
				; AVX256VL-NEXT: vpblendvb %ymm2, %ymm1, %ymm0, %ymm0
				; AVX256VL-NEXT: retq
				;
				; AVX512NOBW-LABEL: shuf32i1_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0:
				; AVX512NOBW: # %bb.0:
				; AVX512NOBW-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; AVX512NOBW-NEXT: vpcmpeqb %ymm1, %ymm0, %ymm0
				; AVX512NOBW-NEXT: vpshufb {{.*#+}} ymm1 = ymm0[3,6,u,12,3,7,7,0,3,6,1,13,3,u,7,0,u,u,22,u,u,u,u,u,u,u,u,u,u,21,u,u]
				; AVX512NOBW-NEXT: vpermq {{.*#+}} ymm0 = ymm0[2,3,0,1]
				; AVX512NOBW-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[u,u,6,u,u,u,u,u,u,u,u,u,u,5,u,u,19,22,u,28,19,23,23,16,19,22,17,29,19,u,23,16]
				; AVX512NOBW-NEXT: vmovdqa {{.*#+}} ymm2 = [255,255,0,255,255,255,255,255,255,255,255,255,255,0,255,255,0,0,255,0,0,0,0,0,0,0,0,0,0,255,0,0]
				; AVX512NOBW-NEXT: vpblendvb %ymm2, %ymm1, %ymm0, %ymm0
				; AVX512NOBW-NEXT: retq
				;
				; AVX256VLBW-LABEL: shuf32i1_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0:
				; AVX256VLBW: # %bb.0:
				; AVX256VLBW-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; AVX256VLBW-NEXT: vpcmpeqb %ymm1, %ymm0, %k0
				; AVX256VLBW-NEXT: vpmovm2b %k0, %ymm0
				; AVX256VLBW-NEXT: vpermq {{.*#+}} ymm1 = ymm0[2,3,0,1]
				; AVX256VLBW-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[3,6,u,12,3,7,7,0,3,6,1,13,3,u,7,0,u,u,22,u,u,u,u,u,u,u,u,u,u,21,u,u]
				; AVX256VLBW-NEXT: movl $-537190396, %eax # imm = 0xDFFB2004
				; AVX256VLBW-NEXT: kmovd %eax, %k1
				; AVX256VLBW-NEXT: vpshufb {{.*#+}} ymm0 {%k1} = ymm1[u,u,6,u,u,u,u,u,u,u,u,u,u,5,u,u,19,22,u,28,19,23,23,16,19,22,17,29,19,u,23,16]
				; AVX256VLBW-NEXT: vpmovb2m %ymm0, %k0
				; AVX256VLBW-NEXT: vpmovm2b %k0, %ymm0
				; AVX256VLBW-NEXT: retq
				;
				; AVX512VLBW-LABEL: shuf32i1_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0:
				; AVX512VLBW: # %bb.0:
				; AVX512VLBW-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; AVX512VLBW-NEXT: vpcmpeqb %ymm1, %ymm0, %k0
				; AVX512VLBW-NEXT: vpmovm2w %k0, %zmm0
				; AVX512VLBW-NEXT: vmovdqa64 {{.*#+}} zmm1 = [3,6,22,12,3,7,7,0,3,6,1,13,3,21,7,0,3,6,22,12,3,7,7,0,3,6,1,13,3,21,7,0]
				; AVX512VLBW-NEXT: vpermw %zmm0, %zmm1, %zmm0
				; AVX512VLBW-NEXT: vpmovw2m %zmm0, %k0
				; AVX512VLBW-NEXT: vpmovm2b %k0, %ymm0
				; AVX512VLBW-NEXT: retq
				;
				; AVX512BW-LABEL: shuf32i1_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0_3_6_22_12_3_7_7_0_3_6_1_13_3_21_7_0:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; AVX512BW-NEXT: vpcmpeqb %ymm1, %ymm0, %ymm0
				; AVX512BW-NEXT: vpmovb2m %zmm0, %k0
				; AVX512BW-NEXT: vpmovm2w %k0, %zmm0
				; AVX512BW-NEXT: vmovdqa64 {{.*#+}} zmm1 = [3,6,22,12,3,7,7,0,3,6,1,13,3,21,7,0,3,6,22,12,3,7,7,0,3,6,1,13,3,21,7,0]
				; AVX512BW-NEXT: vpermw %zmm0, %zmm1, %zmm0
				; AVX512BW-NEXT: vpmovw2m %zmm0, %k0
				; AVX512BW-NEXT: vpmovm2b %k0, %zmm0
				; AVX512BW-NEXT: # kill: def %ymm0 killed %ymm0 killed %zmm0
				; AVX512BW-NEXT: retq
				%cmp = icmp eq <32 x i8> %a, zeroinitializer
				%b = shufflevector <32 x i1> %cmp, <32 x i1> undef, <32 x i32> <i32 3, i32 6, i32 22, i32 12, i32 3, i32 7, i32 7, i32 0, i32 3, i32 6, i32 1, i32 13, i32 3, i32 21, i32 7, i32 0, i32 3, i32 6, i32 22, i32 12, i32 3, i32 7, i32 7, i32 0, i32 3, i32 6, i32 1, i32 13, i32 3, i32 21, i32 7, i32 0>
				ret <32 x i1> %b
				}

test/CodeGen/X86/prefer-avx256-popcnt.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vl,+avx512vpopcntdq,+prefer-vector-width-256 \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX256
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vl,+avx512vpopcntdq,-prefer-vector-width-256 \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512VL
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f,+avx512vpopcntdq \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512F

				define <8 x i16> @testv8i16(<8 x i16> %in) nounwind "require-vector-width"="128" {
				; AVX256-LABEL: testv8i16:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
				; AVX256-NEXT: vpopcntd %ymm0, %ymm0
				; AVX256-NEXT: vpmovdw %ymm0, %xmm0
				; AVX256-NEXT: vzeroupper
				; AVX256-NEXT: retq
				;
				; AVX512VL-LABEL: testv8i16:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
				; AVX512VL-NEXT: vpopcntd %ymm0, %ymm0
				; AVX512VL-NEXT: vpmovdw %ymm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				;
				; AVX512F-LABEL: testv8i16:
				; AVX512F: # %bb.0:
				; AVX512F-NEXT: vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
				; AVX512F-NEXT: vpopcntd %zmm0, %zmm0
				; AVX512F-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512F-NEXT: # kill: def %xmm0 killed %xmm0 killed %ymm0
				; AVX512F-NEXT: vzeroupper
				; AVX512F-NEXT: retq
				%out = call <8 x i16> @llvm.ctpop.v8i16(<8 x i16> %in)
				ret <8 x i16> %out
				}

				define <16 x i8> @testv16i8(<16 x i8> %in) nounwind "require-vector-width"="128" {
				; AVX256-LABEL: testv16i8:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vmovdqa {{.*#+}} xmm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; AVX256-NEXT: vpand %xmm1, %xmm0, %xmm2
				; AVX256-NEXT: vmovdqa {{.*#+}} xmm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; AVX256-NEXT: vpshufb %xmm2, %xmm3, %xmm2
				; AVX256-NEXT: vpsrlw $4, %xmm0, %xmm0
				; AVX256-NEXT: vpand %xmm1, %xmm0, %xmm0
				; AVX256-NEXT: vpshufb %xmm0, %xmm3, %xmm0
				; AVX256-NEXT: vpaddb %xmm2, %xmm0, %xmm0
				; AVX256-NEXT: retq
				;
				; AVX512-LABEL: testv16i8:
				; AVX512: # %bb.0:
				; AVX512-NEXT: vpmovzxbd {{.*#+}} zmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero,xmm0[8],zero,zero,zero,xmm0[9],zero,zero,zero,xmm0[10],zero,zero,zero,xmm0[11],zero,zero,zero,xmm0[12],zero,zero,zero,xmm0[13],zero,zero,zero,xmm0[14],zero,zero,zero,xmm0[15],zero,zero,zero
				; AVX512-NEXT: vpopcntd %zmm0, %zmm0
				; AVX512-NEXT: vpmovdb %zmm0, %xmm0
				; AVX512-NEXT: vzeroupper
				; AVX512-NEXT: retq
				%out = call <16 x i8> @llvm.ctpop.v16i8(<16 x i8> %in)
				ret <16 x i8> %out
				}

				define <16 x i16> @testv16i16(<16 x i16> %in) nounwind "require-vector-width"="256" {
				; AVX256-LABEL: testv16i16:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vmovdqa {{.*#+}} ymm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; AVX256-NEXT: vpand %ymm1, %ymm0, %ymm2
				; AVX256-NEXT: vmovdqa {{.*#+}} ymm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; AVX256-NEXT: vpshufb %ymm2, %ymm3, %ymm2
				; AVX256-NEXT: vpsrlw $4, %ymm0, %ymm0
				; AVX256-NEXT: vpand %ymm1, %ymm0, %ymm0
				; AVX256-NEXT: vpshufb %ymm0, %ymm3, %ymm0
				; AVX256-NEXT: vpaddb %ymm2, %ymm0, %ymm0
				; AVX256-NEXT: vpsllw $8, %ymm0, %ymm1
				; AVX256-NEXT: vpaddb %ymm0, %ymm1, %ymm0
				; AVX256-NEXT: vpsrlw $8, %ymm0, %ymm0
				; AVX256-NEXT: retq
				;
				; AVX512-LABEL: testv16i16:
				; AVX512: # %bb.0:
				; AVX512-NEXT: vpmovzxwd {{.*#+}} zmm0 = ymm0[0],zero,ymm0[1],zero,ymm0[2],zero,ymm0[3],zero,ymm0[4],zero,ymm0[5],zero,ymm0[6],zero,ymm0[7],zero,ymm0[8],zero,ymm0[9],zero,ymm0[10],zero,ymm0[11],zero,ymm0[12],zero,ymm0[13],zero,ymm0[14],zero,ymm0[15],zero
				; AVX512-NEXT: vpopcntd %zmm0, %zmm0
				; AVX512-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512-NEXT: retq
				%out = call <16 x i16> @llvm.ctpop.v16i16(<16 x i16> %in)
				ret <16 x i16> %out
				}

				define <32 x i8> @testv32i8(<32 x i8> %in) nounwind "require-vector-width"="256" {
				; CHECK-LABEL: testv32i8:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovdqa {{.*#+}} ymm1 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
				; CHECK-NEXT: vpand %ymm1, %ymm0, %ymm2
				; CHECK-NEXT: vmovdqa {{.*#+}} ymm3 = [0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4]
				; CHECK-NEXT: vpshufb %ymm2, %ymm3, %ymm2
				; CHECK-NEXT: vpsrlw $4, %ymm0, %ymm0
				; CHECK-NEXT: vpand %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vpshufb %ymm0, %ymm3, %ymm0
				; CHECK-NEXT: vpaddb %ymm2, %ymm0, %ymm0
				; CHECK-NEXT: retq
				%out = call <32 x i8> @llvm.ctpop.v32i8(<32 x i8> %in)
				ret <32 x i8> %out
				}

				declare <8 x i16> @llvm.ctpop.v8i16(<8 x i16>)
				declare <16 x i8> @llvm.ctpop.v16i8(<16 x i8>)
				declare <16 x i16> @llvm.ctpop.v16i16(<16 x i16>)
				declare <32 x i8> @llvm.ctpop.v32i8(<32 x i8>)

test/CodeGen/X86/prefer-avx256-shift.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512bw,+avx512vl,+prefer-vector-width-256,+no-512-bit-vectors \| FileCheck %s --check-prefix=ALL --check-prefix=AVX256 --check-prefix=AVX256BW
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vl,+prefer-vector-width-256,+no-512-bit-vectors \| FileCheck %s --check-prefix=ALL --check-prefix=AVX256 --check-prefix=AVX256VL
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512bw,+avx512vl \| FileCheck %s --check-prefix=ALL --check-prefix=AVX512 --check-prefix=AVX512BW
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vl \| FileCheck %s --check-prefix=ALL --check-prefix=AVX512 --check-prefix=AVX512VL

				define <32 x i8> @var_shl_v32i8(<32 x i8> %a, <32 x i8> %b) nounwind {
				; AVX256-LABEL: var_shl_v32i8:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vpsllw $5, %ymm1, %ymm1
				; AVX256-NEXT: vpsllw $4, %ymm0, %ymm2
				; AVX256-NEXT: vpand {{.*}}(%rip), %ymm2, %ymm2
				; AVX256-NEXT: vpblendvb %ymm1, %ymm2, %ymm0, %ymm0
				; AVX256-NEXT: vpsllw $2, %ymm0, %ymm2
				; AVX256-NEXT: vpand {{.*}}(%rip), %ymm2, %ymm2
				; AVX256-NEXT: vpaddb %ymm1, %ymm1, %ymm1
				; AVX256-NEXT: vpblendvb %ymm1, %ymm2, %ymm0, %ymm0
				; AVX256-NEXT: vpaddb %ymm0, %ymm0, %ymm2
				; AVX256-NEXT: vpaddb %ymm1, %ymm1, %ymm1
				; AVX256-NEXT: vpblendvb %ymm1, %ymm2, %ymm0, %ymm0
				; AVX256-NEXT: retq
				;
				; AVX512BW-LABEL: var_shl_v32i8:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: vpmovzxbw {{.*#+}} zmm1 = ymm1[0],zero,ymm1[1],zero,ymm1[2],zero,ymm1[3],zero,ymm1[4],zero,ymm1[5],zero,ymm1[6],zero,ymm1[7],zero,ymm1[8],zero,ymm1[9],zero,ymm1[10],zero,ymm1[11],zero,ymm1[12],zero,ymm1[13],zero,ymm1[14],zero,ymm1[15],zero,ymm1[16],zero,ymm1[17],zero,ymm1[18],zero,ymm1[19],zero,ymm1[20],zero,ymm1[21],zero,ymm1[22],zero,ymm1[23],zero,ymm1[24],zero,ymm1[25],zero,ymm1[26],zero,ymm1[27],zero,ymm1[28],zero,ymm1[29],zero,ymm1[30],zero,ymm1[31],zero
				; AVX512BW-NEXT: vpmovzxbw {{.*#+}} zmm0 = ymm0[0],zero,ymm0[1],zero,ymm0[2],zero,ymm0[3],zero,ymm0[4],zero,ymm0[5],zero,ymm0[6],zero,ymm0[7],zero,ymm0[8],zero,ymm0[9],zero,ymm0[10],zero,ymm0[11],zero,ymm0[12],zero,ymm0[13],zero,ymm0[14],zero,ymm0[15],zero,ymm0[16],zero,ymm0[17],zero,ymm0[18],zero,ymm0[19],zero,ymm0[20],zero,ymm0[21],zero,ymm0[22],zero,ymm0[23],zero,ymm0[24],zero,ymm0[25],zero,ymm0[26],zero,ymm0[27],zero,ymm0[28],zero,ymm0[29],zero,ymm0[30],zero,ymm0[31],zero
				; AVX512BW-NEXT: vpsllvw %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: vpmovwb %zmm0, %ymm0
				; AVX512BW-NEXT: retq
				;
				; AVX512VL-LABEL: var_shl_v32i8:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpsllw $5, %ymm1, %ymm1
				; AVX512VL-NEXT: vpsllw $4, %ymm0, %ymm2
				; AVX512VL-NEXT: vpand {{.*}}(%rip), %ymm2, %ymm2
				; AVX512VL-NEXT: vpblendvb %ymm1, %ymm2, %ymm0, %ymm0
				; AVX512VL-NEXT: vpsllw $2, %ymm0, %ymm2
				; AVX512VL-NEXT: vpand {{.*}}(%rip), %ymm2, %ymm2
				; AVX512VL-NEXT: vpaddb %ymm1, %ymm1, %ymm1
				; AVX512VL-NEXT: vpblendvb %ymm1, %ymm2, %ymm0, %ymm0
				; AVX512VL-NEXT: vpaddb %ymm0, %ymm0, %ymm2
				; AVX512VL-NEXT: vpaddb %ymm1, %ymm1, %ymm1
				; AVX512VL-NEXT: vpblendvb %ymm1, %ymm2, %ymm0, %ymm0
				; AVX512VL-NEXT: retq
				%shift = shl <32 x i8> %a, %b
				ret <32 x i8> %shift
				}

				define <16 x i16> @var_shl_v16i16(<16 x i16> %a, <16 x i16> %b) nounwind {
				; AVX256BW-LABEL: var_shl_v16i16:
				; AVX256BW: # %bb.0:
				; AVX256BW-NEXT: vpsllvw %ymm1, %ymm0, %ymm0
				; AVX256BW-NEXT: retq
				;
				; AVX256VL-LABEL: var_shl_v16i16:
				; AVX256VL: # %bb.0:
				; AVX256VL-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVX256VL-NEXT: vpunpckhwd {{.*#+}} ymm3 = ymm1[4],ymm2[4],ymm1[5],ymm2[5],ymm1[6],ymm2[6],ymm1[7],ymm2[7],ymm1[12],ymm2[12],ymm1[13],ymm2[13],ymm1[14],ymm2[14],ymm1[15],ymm2[15]
				; AVX256VL-NEXT: vpunpckhwd {{.*#+}} ymm4 = ymm2[4],ymm0[4],ymm2[5],ymm0[5],ymm2[6],ymm0[6],ymm2[7],ymm0[7],ymm2[12],ymm0[12],ymm2[13],ymm0[13],ymm2[14],ymm0[14],ymm2[15],ymm0[15]
				; AVX256VL-NEXT: vpsllvd %ymm3, %ymm4, %ymm3
				; AVX256VL-NEXT: vpsrld $16, %ymm3, %ymm3
				; AVX256VL-NEXT: vpunpcklwd {{.*#+}} ymm1 = ymm1[0],ymm2[0],ymm1[1],ymm2[1],ymm1[2],ymm2[2],ymm1[3],ymm2[3],ymm1[8],ymm2[8],ymm1[9],ymm2[9],ymm1[10],ymm2[10],ymm1[11],ymm2[11]
				; AVX256VL-NEXT: vpunpcklwd {{.*#+}} ymm0 = ymm2[0],ymm0[0],ymm2[1],ymm0[1],ymm2[2],ymm0[2],ymm2[3],ymm0[3],ymm2[8],ymm0[8],ymm2[9],ymm0[9],ymm2[10],ymm0[10],ymm2[11],ymm0[11]
				; AVX256VL-NEXT: vpsllvd %ymm1, %ymm0, %ymm0
				; AVX256VL-NEXT: vpsrld $16, %ymm0, %ymm0
				; AVX256VL-NEXT: vpackusdw %ymm3, %ymm0, %ymm0
				; AVX256VL-NEXT: retq
				;
				; AVX512BW-LABEL: var_shl_v16i16:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: vpsllvw %ymm1, %ymm0, %ymm0
				; AVX512BW-NEXT: retq
				;
				; AVX512VL-LABEL: var_shl_v16i16:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovzxwd {{.*#+}} zmm1 = ymm1[0],zero,ymm1[1],zero,ymm1[2],zero,ymm1[3],zero,ymm1[4],zero,ymm1[5],zero,ymm1[6],zero,ymm1[7],zero,ymm1[8],zero,ymm1[9],zero,ymm1[10],zero,ymm1[11],zero,ymm1[12],zero,ymm1[13],zero,ymm1[14],zero,ymm1[15],zero
				; AVX512VL-NEXT: vpmovzxwd {{.*#+}} zmm0 = ymm0[0],zero,ymm0[1],zero,ymm0[2],zero,ymm0[3],zero,ymm0[4],zero,ymm0[5],zero,ymm0[6],zero,ymm0[7],zero,ymm0[8],zero,ymm0[9],zero,ymm0[10],zero,ymm0[11],zero,ymm0[12],zero,ymm0[13],zero,ymm0[14],zero,ymm0[15],zero
				; AVX512VL-NEXT: vpsllvd %zmm1, %zmm0, %zmm0
				; AVX512VL-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512VL-NEXT: retq
				%shift = shl <16 x i16> %a, %b
				ret <16 x i16> %shift
				}

				define <16 x i8> @var_shl_v16i8(<16 x i8> %a, <16 x i8> %b) nounwind {
				; AVX256BW-LABEL: var_shl_v16i8:
				; AVX256BW: # %bb.0:
				; AVX256BW-NEXT: vpmovzxbw {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero,xmm1[8],zero,xmm1[9],zero,xmm1[10],zero,xmm1[11],zero,xmm1[12],zero,xmm1[13],zero,xmm1[14],zero,xmm1[15],zero
				; AVX256BW-NEXT: vpmovzxbw {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero,xmm0[8],zero,xmm0[9],zero,xmm0[10],zero,xmm0[11],zero,xmm0[12],zero,xmm0[13],zero,xmm0[14],zero,xmm0[15],zero
				; AVX256BW-NEXT: vpsllvw %ymm1, %ymm0, %ymm0
				; AVX256BW-NEXT: vpmovwb %ymm0, %xmm0
				; AVX256BW-NEXT: vzeroupper
				; AVX256BW-NEXT: retq
				;
				; AVX256VL-LABEL: var_shl_v16i8:
				; AVX256VL: # %bb.0:
				; AVX256VL-NEXT: vpsllw $5, %xmm1, %xmm1
				; AVX256VL-NEXT: vpsllw $4, %xmm0, %xmm2
				; AVX256VL-NEXT: vpand {{.*}}(%rip), %xmm2, %xmm2
				; AVX256VL-NEXT: vpblendvb %xmm1, %xmm2, %xmm0, %xmm0
				; AVX256VL-NEXT: vpsllw $2, %xmm0, %xmm2
				; AVX256VL-NEXT: vpand {{.*}}(%rip), %xmm2, %xmm2
				; AVX256VL-NEXT: vpaddb %xmm1, %xmm1, %xmm1
				; AVX256VL-NEXT: vpblendvb %xmm1, %xmm2, %xmm0, %xmm0
				; AVX256VL-NEXT: vpaddb %xmm0, %xmm0, %xmm2
				; AVX256VL-NEXT: vpaddb %xmm1, %xmm1, %xmm1
				; AVX256VL-NEXT: vpblendvb %xmm1, %xmm2, %xmm0, %xmm0
				; AVX256VL-NEXT: retq
				;
				; AVX512BW-LABEL: var_shl_v16i8:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: vpmovzxbw {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero,xmm1[8],zero,xmm1[9],zero,xmm1[10],zero,xmm1[11],zero,xmm1[12],zero,xmm1[13],zero,xmm1[14],zero,xmm1[15],zero
				; AVX512BW-NEXT: vpmovzxbw {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero,xmm0[8],zero,xmm0[9],zero,xmm0[10],zero,xmm0[11],zero,xmm0[12],zero,xmm0[13],zero,xmm0[14],zero,xmm0[15],zero
				; AVX512BW-NEXT: vpsllvw %ymm1, %ymm0, %ymm0
				; AVX512BW-NEXT: vpmovwb %ymm0, %xmm0
				; AVX512BW-NEXT: vzeroupper
				; AVX512BW-NEXT: retq
				;
				; AVX512VL-LABEL: var_shl_v16i8:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovzxbd {{.*#+}} zmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero,xmm1[8],zero,zero,zero,xmm1[9],zero,zero,zero,xmm1[10],zero,zero,zero,xmm1[11],zero,zero,zero,xmm1[12],zero,zero,zero,xmm1[13],zero,zero,zero,xmm1[14],zero,zero,zero,xmm1[15],zero,zero,zero
				; AVX512VL-NEXT: vpmovzxbd {{.*#+}} zmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero,xmm0[8],zero,zero,zero,xmm0[9],zero,zero,zero,xmm0[10],zero,zero,zero,xmm0[11],zero,zero,zero,xmm0[12],zero,zero,zero,xmm0[13],zero,zero,zero,xmm0[14],zero,zero,zero,xmm0[15],zero,zero,zero
				; AVX512VL-NEXT: vpsllvd %zmm1, %zmm0, %zmm0
				; AVX512VL-NEXT: vpmovdb %zmm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				%shift = shl <16 x i8> %a, %b
				ret <16 x i8> %shift
				}

				define <32 x i8> @var_lshr_v32i8(<32 x i8> %a, <32 x i8> %b) nounwind {
				; AVX256-LABEL: var_lshr_v32i8:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vpsllw $5, %ymm1, %ymm1
				; AVX256-NEXT: vpsrlw $4, %ymm0, %ymm2
				; AVX256-NEXT: vpand {{.*}}(%rip), %ymm2, %ymm2
				; AVX256-NEXT: vpblendvb %ymm1, %ymm2, %ymm0, %ymm0
				; AVX256-NEXT: vpsrlw $2, %ymm0, %ymm2
				; AVX256-NEXT: vpand {{.*}}(%rip), %ymm2, %ymm2
				; AVX256-NEXT: vpaddb %ymm1, %ymm1, %ymm1
				; AVX256-NEXT: vpblendvb %ymm1, %ymm2, %ymm0, %ymm0
				; AVX256-NEXT: vpsrlw $1, %ymm0, %ymm2
				; AVX256-NEXT: vpand {{.*}}(%rip), %ymm2, %ymm2
				; AVX256-NEXT: vpaddb %ymm1, %ymm1, %ymm1
				; AVX256-NEXT: vpblendvb %ymm1, %ymm2, %ymm0, %ymm0
				; AVX256-NEXT: retq
				;
				; AVX512BW-LABEL: var_lshr_v32i8:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: vpmovzxbw {{.*#+}} zmm1 = ymm1[0],zero,ymm1[1],zero,ymm1[2],zero,ymm1[3],zero,ymm1[4],zero,ymm1[5],zero,ymm1[6],zero,ymm1[7],zero,ymm1[8],zero,ymm1[9],zero,ymm1[10],zero,ymm1[11],zero,ymm1[12],zero,ymm1[13],zero,ymm1[14],zero,ymm1[15],zero,ymm1[16],zero,ymm1[17],zero,ymm1[18],zero,ymm1[19],zero,ymm1[20],zero,ymm1[21],zero,ymm1[22],zero,ymm1[23],zero,ymm1[24],zero,ymm1[25],zero,ymm1[26],zero,ymm1[27],zero,ymm1[28],zero,ymm1[29],zero,ymm1[30],zero,ymm1[31],zero
				; AVX512BW-NEXT: vpmovzxbw {{.*#+}} zmm0 = ymm0[0],zero,ymm0[1],zero,ymm0[2],zero,ymm0[3],zero,ymm0[4],zero,ymm0[5],zero,ymm0[6],zero,ymm0[7],zero,ymm0[8],zero,ymm0[9],zero,ymm0[10],zero,ymm0[11],zero,ymm0[12],zero,ymm0[13],zero,ymm0[14],zero,ymm0[15],zero,ymm0[16],zero,ymm0[17],zero,ymm0[18],zero,ymm0[19],zero,ymm0[20],zero,ymm0[21],zero,ymm0[22],zero,ymm0[23],zero,ymm0[24],zero,ymm0[25],zero,ymm0[26],zero,ymm0[27],zero,ymm0[28],zero,ymm0[29],zero,ymm0[30],zero,ymm0[31],zero
				; AVX512BW-NEXT: vpsrlvw %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: vpmovwb %zmm0, %ymm0
				; AVX512BW-NEXT: retq
				;
				; AVX512VL-LABEL: var_lshr_v32i8:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpsllw $5, %ymm1, %ymm1
				; AVX512VL-NEXT: vpsrlw $4, %ymm0, %ymm2
				; AVX512VL-NEXT: vpand {{.*}}(%rip), %ymm2, %ymm2
				; AVX512VL-NEXT: vpblendvb %ymm1, %ymm2, %ymm0, %ymm0
				; AVX512VL-NEXT: vpsrlw $2, %ymm0, %ymm2
				; AVX512VL-NEXT: vpand {{.*}}(%rip), %ymm2, %ymm2
				; AVX512VL-NEXT: vpaddb %ymm1, %ymm1, %ymm1
				; AVX512VL-NEXT: vpblendvb %ymm1, %ymm2, %ymm0, %ymm0
				; AVX512VL-NEXT: vpsrlw $1, %ymm0, %ymm2
				; AVX512VL-NEXT: vpand {{.*}}(%rip), %ymm2, %ymm2
				; AVX512VL-NEXT: vpaddb %ymm1, %ymm1, %ymm1
				; AVX512VL-NEXT: vpblendvb %ymm1, %ymm2, %ymm0, %ymm0
				; AVX512VL-NEXT: retq
				%shift = lshr <32 x i8> %a, %b
				ret <32 x i8> %shift
				}

				define <16 x i16> @var_lshr_v16i16(<16 x i16> %a, <16 x i16> %b) nounwind {
				; AVX256BW-LABEL: var_lshr_v16i16:
				; AVX256BW: # %bb.0:
				; AVX256BW-NEXT: vpsrlvw %ymm1, %ymm0, %ymm0
				; AVX256BW-NEXT: retq
				;
				; AVX256VL-LABEL: var_lshr_v16i16:
				; AVX256VL: # %bb.0:
				; AVX256VL-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVX256VL-NEXT: vpunpckhwd {{.*#+}} ymm3 = ymm1[4],ymm2[4],ymm1[5],ymm2[5],ymm1[6],ymm2[6],ymm1[7],ymm2[7],ymm1[12],ymm2[12],ymm1[13],ymm2[13],ymm1[14],ymm2[14],ymm1[15],ymm2[15]
				; AVX256VL-NEXT: vpunpckhwd {{.*#+}} ymm4 = ymm2[4],ymm0[4],ymm2[5],ymm0[5],ymm2[6],ymm0[6],ymm2[7],ymm0[7],ymm2[12],ymm0[12],ymm2[13],ymm0[13],ymm2[14],ymm0[14],ymm2[15],ymm0[15]
				; AVX256VL-NEXT: vpsrlvd %ymm3, %ymm4, %ymm3
				; AVX256VL-NEXT: vpsrld $16, %ymm3, %ymm3
				; AVX256VL-NEXT: vpunpcklwd {{.*#+}} ymm1 = ymm1[0],ymm2[0],ymm1[1],ymm2[1],ymm1[2],ymm2[2],ymm1[3],ymm2[3],ymm1[8],ymm2[8],ymm1[9],ymm2[9],ymm1[10],ymm2[10],ymm1[11],ymm2[11]
				; AVX256VL-NEXT: vpunpcklwd {{.*#+}} ymm0 = ymm2[0],ymm0[0],ymm2[1],ymm0[1],ymm2[2],ymm0[2],ymm2[3],ymm0[3],ymm2[8],ymm0[8],ymm2[9],ymm0[9],ymm2[10],ymm0[10],ymm2[11],ymm0[11]
				; AVX256VL-NEXT: vpsrlvd %ymm1, %ymm0, %ymm0
				; AVX256VL-NEXT: vpsrld $16, %ymm0, %ymm0
				; AVX256VL-NEXT: vpackusdw %ymm3, %ymm0, %ymm0
				; AVX256VL-NEXT: retq
				;
				; AVX512BW-LABEL: var_lshr_v16i16:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: vpsrlvw %ymm1, %ymm0, %ymm0
				; AVX512BW-NEXT: retq
				;
				; AVX512VL-LABEL: var_lshr_v16i16:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovzxwd {{.*#+}} zmm1 = ymm1[0],zero,ymm1[1],zero,ymm1[2],zero,ymm1[3],zero,ymm1[4],zero,ymm1[5],zero,ymm1[6],zero,ymm1[7],zero,ymm1[8],zero,ymm1[9],zero,ymm1[10],zero,ymm1[11],zero,ymm1[12],zero,ymm1[13],zero,ymm1[14],zero,ymm1[15],zero
				; AVX512VL-NEXT: vpmovzxwd {{.*#+}} zmm0 = ymm0[0],zero,ymm0[1],zero,ymm0[2],zero,ymm0[3],zero,ymm0[4],zero,ymm0[5],zero,ymm0[6],zero,ymm0[7],zero,ymm0[8],zero,ymm0[9],zero,ymm0[10],zero,ymm0[11],zero,ymm0[12],zero,ymm0[13],zero,ymm0[14],zero,ymm0[15],zero
				; AVX512VL-NEXT: vpsrlvd %zmm1, %zmm0, %zmm0
				; AVX512VL-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512VL-NEXT: retq
				%shift = lshr <16 x i16> %a, %b
				ret <16 x i16> %shift
				}

				define <16 x i8> @var_lshr_v16i8(<16 x i8> %a, <16 x i8> %b) nounwind {
				; AVX256BW-LABEL: var_lshr_v16i8:
				; AVX256BW: # %bb.0:
				; AVX256BW-NEXT: vpmovzxbw {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero,xmm1[8],zero,xmm1[9],zero,xmm1[10],zero,xmm1[11],zero,xmm1[12],zero,xmm1[13],zero,xmm1[14],zero,xmm1[15],zero
				; AVX256BW-NEXT: vpmovzxbw {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero,xmm0[8],zero,xmm0[9],zero,xmm0[10],zero,xmm0[11],zero,xmm0[12],zero,xmm0[13],zero,xmm0[14],zero,xmm0[15],zero
				; AVX256BW-NEXT: vpsrlvw %ymm1, %ymm0, %ymm0
				; AVX256BW-NEXT: vpmovwb %ymm0, %xmm0
				; AVX256BW-NEXT: vzeroupper
				; AVX256BW-NEXT: retq
				;
				; AVX256VL-LABEL: var_lshr_v16i8:
				; AVX256VL: # %bb.0:
				; AVX256VL-NEXT: vpsllw $5, %xmm1, %xmm1
				; AVX256VL-NEXT: vpsrlw $4, %xmm0, %xmm2
				; AVX256VL-NEXT: vpand {{.*}}(%rip), %xmm2, %xmm2
				; AVX256VL-NEXT: vpblendvb %xmm1, %xmm2, %xmm0, %xmm0
				; AVX256VL-NEXT: vpsrlw $2, %xmm0, %xmm2
				; AVX256VL-NEXT: vpand {{.*}}(%rip), %xmm2, %xmm2
				; AVX256VL-NEXT: vpaddb %xmm1, %xmm1, %xmm1
				; AVX256VL-NEXT: vpblendvb %xmm1, %xmm2, %xmm0, %xmm0
				; AVX256VL-NEXT: vpsrlw $1, %xmm0, %xmm2
				; AVX256VL-NEXT: vpand {{.*}}(%rip), %xmm2, %xmm2
				; AVX256VL-NEXT: vpaddb %xmm1, %xmm1, %xmm1
				; AVX256VL-NEXT: vpblendvb %xmm1, %xmm2, %xmm0, %xmm0
				; AVX256VL-NEXT: retq
				;
				; AVX512BW-LABEL: var_lshr_v16i8:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: vpmovzxbw {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero,xmm1[8],zero,xmm1[9],zero,xmm1[10],zero,xmm1[11],zero,xmm1[12],zero,xmm1[13],zero,xmm1[14],zero,xmm1[15],zero
				; AVX512BW-NEXT: vpmovzxbw {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero,xmm0[8],zero,xmm0[9],zero,xmm0[10],zero,xmm0[11],zero,xmm0[12],zero,xmm0[13],zero,xmm0[14],zero,xmm0[15],zero
				; AVX512BW-NEXT: vpsrlvw %ymm1, %ymm0, %ymm0
				; AVX512BW-NEXT: vpmovwb %ymm0, %xmm0
				; AVX512BW-NEXT: vzeroupper
				; AVX512BW-NEXT: retq
				;
				; AVX512VL-LABEL: var_lshr_v16i8:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovzxbd {{.*#+}} zmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero,xmm1[8],zero,zero,zero,xmm1[9],zero,zero,zero,xmm1[10],zero,zero,zero,xmm1[11],zero,zero,zero,xmm1[12],zero,zero,zero,xmm1[13],zero,zero,zero,xmm1[14],zero,zero,zero,xmm1[15],zero,zero,zero
				; AVX512VL-NEXT: vpmovzxbd {{.*#+}} zmm0 = xmm0[0],zero,zero,zero,xmm0[1],zero,zero,zero,xmm0[2],zero,zero,zero,xmm0[3],zero,zero,zero,xmm0[4],zero,zero,zero,xmm0[5],zero,zero,zero,xmm0[6],zero,zero,zero,xmm0[7],zero,zero,zero,xmm0[8],zero,zero,zero,xmm0[9],zero,zero,zero,xmm0[10],zero,zero,zero,xmm0[11],zero,zero,zero,xmm0[12],zero,zero,zero,xmm0[13],zero,zero,zero,xmm0[14],zero,zero,zero,xmm0[15],zero,zero,zero
				; AVX512VL-NEXT: vpsrlvd %zmm1, %zmm0, %zmm0
				; AVX512VL-NEXT: vpmovdb %zmm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				%shift = lshr <16 x i8> %a, %b
				ret <16 x i8> %shift
				}

				define <32 x i8> @var_ashr_v32i8(<32 x i8> %a, <32 x i8> %b) nounwind {
				; AVX256-LABEL: var_ashr_v32i8:
				; AVX256: # %bb.0:
				; AVX256-NEXT: vpsllw $5, %ymm1, %ymm1
				; AVX256-NEXT: vpunpckhbw {{.*#+}} ymm2 = ymm0[8],ymm1[8],ymm0[9],ymm1[9],ymm0[10],ymm1[10],ymm0[11],ymm1[11],ymm0[12],ymm1[12],ymm0[13],ymm1[13],ymm0[14],ymm1[14],ymm0[15],ymm1[15],ymm0[24],ymm1[24],ymm0[25],ymm1[25],ymm0[26],ymm1[26],ymm0[27],ymm1[27],ymm0[28],ymm1[28],ymm0[29],ymm1[29],ymm0[30],ymm1[30],ymm0[31],ymm1[31]
				; AVX256-NEXT: vpunpckhbw {{.*#+}} ymm3 = ymm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,24,24,25,25,26,26,27,27,28,28,29,29,30,30,31,31]
				; AVX256-NEXT: vpsraw $4, %ymm3, %ymm4
				; AVX256-NEXT: vpblendvb %ymm2, %ymm4, %ymm3, %ymm3
				; AVX256-NEXT: vpsraw $2, %ymm3, %ymm4
				; AVX256-NEXT: vpaddw %ymm2, %ymm2, %ymm2
				; AVX256-NEXT: vpblendvb %ymm2, %ymm4, %ymm3, %ymm3
				; AVX256-NEXT: vpsraw $1, %ymm3, %ymm4
				; AVX256-NEXT: vpaddw %ymm2, %ymm2, %ymm2
				; AVX256-NEXT: vpblendvb %ymm2, %ymm4, %ymm3, %ymm2
				; AVX256-NEXT: vpsrlw $8, %ymm2, %ymm2
				; AVX256-NEXT: vpunpcklbw {{.*#+}} ymm1 = ymm0[0],ymm1[0],ymm0[1],ymm1[1],ymm0[2],ymm1[2],ymm0[3],ymm1[3],ymm0[4],ymm1[4],ymm0[5],ymm1[5],ymm0[6],ymm1[6],ymm0[7],ymm1[7],ymm0[16],ymm1[16],ymm0[17],ymm1[17],ymm0[18],ymm1[18],ymm0[19],ymm1[19],ymm0[20],ymm1[20],ymm0[21],ymm1[21],ymm0[22],ymm1[22],ymm0[23],ymm1[23]
				; AVX256-NEXT: vpunpcklbw {{.*#+}} ymm0 = ymm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,16,16,17,17,18,18,19,19,20,20,21,21,22,22,23,23]
				; AVX256-NEXT: vpsraw $4, %ymm0, %ymm3
				; AVX256-NEXT: vpblendvb %ymm1, %ymm3, %ymm0, %ymm0
				; AVX256-NEXT: vpsraw $2, %ymm0, %ymm3
				; AVX256-NEXT: vpaddw %ymm1, %ymm1, %ymm1
				; AVX256-NEXT: vpblendvb %ymm1, %ymm3, %ymm0, %ymm0
				; AVX256-NEXT: vpsraw $1, %ymm0, %ymm3
				; AVX256-NEXT: vpaddw %ymm1, %ymm1, %ymm1
				; AVX256-NEXT: vpblendvb %ymm1, %ymm3, %ymm0, %ymm0
				; AVX256-NEXT: vpsrlw $8, %ymm0, %ymm0
				; AVX256-NEXT: vpackuswb %ymm2, %ymm0, %ymm0
				; AVX256-NEXT: retq
				;
				; AVX512BW-LABEL: var_ashr_v32i8:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: vpmovzxbw {{.*#+}} zmm1 = ymm1[0],zero,ymm1[1],zero,ymm1[2],zero,ymm1[3],zero,ymm1[4],zero,ymm1[5],zero,ymm1[6],zero,ymm1[7],zero,ymm1[8],zero,ymm1[9],zero,ymm1[10],zero,ymm1[11],zero,ymm1[12],zero,ymm1[13],zero,ymm1[14],zero,ymm1[15],zero,ymm1[16],zero,ymm1[17],zero,ymm1[18],zero,ymm1[19],zero,ymm1[20],zero,ymm1[21],zero,ymm1[22],zero,ymm1[23],zero,ymm1[24],zero,ymm1[25],zero,ymm1[26],zero,ymm1[27],zero,ymm1[28],zero,ymm1[29],zero,ymm1[30],zero,ymm1[31],zero
				; AVX512BW-NEXT: vpmovsxbw %ymm0, %zmm0
				; AVX512BW-NEXT: vpsravw %zmm1, %zmm0, %zmm0
				; AVX512BW-NEXT: vpmovwb %zmm0, %ymm0
				; AVX512BW-NEXT: retq
				;
				; AVX512VL-LABEL: var_ashr_v32i8:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpsllw $5, %ymm1, %ymm1
				; AVX512VL-NEXT: vpunpckhbw {{.*#+}} ymm2 = ymm0[8],ymm1[8],ymm0[9],ymm1[9],ymm0[10],ymm1[10],ymm0[11],ymm1[11],ymm0[12],ymm1[12],ymm0[13],ymm1[13],ymm0[14],ymm1[14],ymm0[15],ymm1[15],ymm0[24],ymm1[24],ymm0[25],ymm1[25],ymm0[26],ymm1[26],ymm0[27],ymm1[27],ymm0[28],ymm1[28],ymm0[29],ymm1[29],ymm0[30],ymm1[30],ymm0[31],ymm1[31]
				; AVX512VL-NEXT: vpunpckhbw {{.*#+}} ymm3 = ymm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,24,24,25,25,26,26,27,27,28,28,29,29,30,30,31,31]
				; AVX512VL-NEXT: vpsraw $4, %ymm3, %ymm4
				; AVX512VL-NEXT: vpblendvb %ymm2, %ymm4, %ymm3, %ymm3
				; AVX512VL-NEXT: vpsraw $2, %ymm3, %ymm4
				; AVX512VL-NEXT: vpaddw %ymm2, %ymm2, %ymm2
				; AVX512VL-NEXT: vpblendvb %ymm2, %ymm4, %ymm3, %ymm3
				; AVX512VL-NEXT: vpsraw $1, %ymm3, %ymm4
				; AVX512VL-NEXT: vpaddw %ymm2, %ymm2, %ymm2
				; AVX512VL-NEXT: vpblendvb %ymm2, %ymm4, %ymm3, %ymm2
				; AVX512VL-NEXT: vpsrlw $8, %ymm2, %ymm2
				; AVX512VL-NEXT: vpunpcklbw {{.*#+}} ymm1 = ymm0[0],ymm1[0],ymm0[1],ymm1[1],ymm0[2],ymm1[2],ymm0[3],ymm1[3],ymm0[4],ymm1[4],ymm0[5],ymm1[5],ymm0[6],ymm1[6],ymm0[7],ymm1[7],ymm0[16],ymm1[16],ymm0[17],ymm1[17],ymm0[18],ymm1[18],ymm0[19],ymm1[19],ymm0[20],ymm1[20],ymm0[21],ymm1[21],ymm0[22],ymm1[22],ymm0[23],ymm1[23]
				; AVX512VL-NEXT: vpunpcklbw {{.*#+}} ymm0 = ymm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,16,16,17,17,18,18,19,19,20,20,21,21,22,22,23,23]
				; AVX512VL-NEXT: vpsraw $4, %ymm0, %ymm3
				; AVX512VL-NEXT: vpblendvb %ymm1, %ymm3, %ymm0, %ymm0
				; AVX512VL-NEXT: vpsraw $2, %ymm0, %ymm3
				; AVX512VL-NEXT: vpaddw %ymm1, %ymm1, %ymm1
				; AVX512VL-NEXT: vpblendvb %ymm1, %ymm3, %ymm0, %ymm0
				; AVX512VL-NEXT: vpsraw $1, %ymm0, %ymm3
				; AVX512VL-NEXT: vpaddw %ymm1, %ymm1, %ymm1
				; AVX512VL-NEXT: vpblendvb %ymm1, %ymm3, %ymm0, %ymm0
				; AVX512VL-NEXT: vpsrlw $8, %ymm0, %ymm0
				; AVX512VL-NEXT: vpackuswb %ymm2, %ymm0, %ymm0
				; AVX512VL-NEXT: retq
				%shift = ashr <32 x i8> %a, %b
				ret <32 x i8> %shift
				}

				define <16 x i16> @var_ashr_v16i16(<16 x i16> %a, <16 x i16> %b) nounwind {
				; AVX256BW-LABEL: var_ashr_v16i16:
				; AVX256BW: # %bb.0:
				; AVX256BW-NEXT: vpsravw %ymm1, %ymm0, %ymm0
				; AVX256BW-NEXT: retq
				;
				; AVX256VL-LABEL: var_ashr_v16i16:
				; AVX256VL: # %bb.0:
				; AVX256VL-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; AVX256VL-NEXT: vpunpckhwd {{.*#+}} ymm3 = ymm1[4],ymm2[4],ymm1[5],ymm2[5],ymm1[6],ymm2[6],ymm1[7],ymm2[7],ymm1[12],ymm2[12],ymm1[13],ymm2[13],ymm1[14],ymm2[14],ymm1[15],ymm2[15]
				; AVX256VL-NEXT: vpunpckhwd {{.*#+}} ymm4 = ymm2[4],ymm0[4],ymm2[5],ymm0[5],ymm2[6],ymm0[6],ymm2[7],ymm0[7],ymm2[12],ymm0[12],ymm2[13],ymm0[13],ymm2[14],ymm0[14],ymm2[15],ymm0[15]
				; AVX256VL-NEXT: vpsravd %ymm3, %ymm4, %ymm3
				; AVX256VL-NEXT: vpsrld $16, %ymm3, %ymm3
				; AVX256VL-NEXT: vpunpcklwd {{.*#+}} ymm1 = ymm1[0],ymm2[0],ymm1[1],ymm2[1],ymm1[2],ymm2[2],ymm1[3],ymm2[3],ymm1[8],ymm2[8],ymm1[9],ymm2[9],ymm1[10],ymm2[10],ymm1[11],ymm2[11]
				; AVX256VL-NEXT: vpunpcklwd {{.*#+}} ymm0 = ymm2[0],ymm0[0],ymm2[1],ymm0[1],ymm2[2],ymm0[2],ymm2[3],ymm0[3],ymm2[8],ymm0[8],ymm2[9],ymm0[9],ymm2[10],ymm0[10],ymm2[11],ymm0[11]
				; AVX256VL-NEXT: vpsravd %ymm1, %ymm0, %ymm0
				; AVX256VL-NEXT: vpsrld $16, %ymm0, %ymm0
				; AVX256VL-NEXT: vpackusdw %ymm3, %ymm0, %ymm0
				; AVX256VL-NEXT: retq
				;
				; AVX512BW-LABEL: var_ashr_v16i16:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: vpsravw %ymm1, %ymm0, %ymm0
				; AVX512BW-NEXT: retq
				;
				; AVX512VL-LABEL: var_ashr_v16i16:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovzxwd {{.*#+}} zmm1 = ymm1[0],zero,ymm1[1],zero,ymm1[2],zero,ymm1[3],zero,ymm1[4],zero,ymm1[5],zero,ymm1[6],zero,ymm1[7],zero,ymm1[8],zero,ymm1[9],zero,ymm1[10],zero,ymm1[11],zero,ymm1[12],zero,ymm1[13],zero,ymm1[14],zero,ymm1[15],zero
				; AVX512VL-NEXT: vpmovsxwd %ymm0, %zmm0
				; AVX512VL-NEXT: vpsravd %zmm1, %zmm0, %zmm0
				; AVX512VL-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512VL-NEXT: retq
				%shift = ashr <16 x i16> %a, %b
				ret <16 x i16> %shift
				}

				define <16 x i8> @var_ashr_v16i8(<16 x i8> %a, <16 x i8> %b) nounwind {
				; AVX256BW-LABEL: var_ashr_v16i8:
				; AVX256BW: # %bb.0:
				; AVX256BW-NEXT: vpmovzxbw {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero,xmm1[8],zero,xmm1[9],zero,xmm1[10],zero,xmm1[11],zero,xmm1[12],zero,xmm1[13],zero,xmm1[14],zero,xmm1[15],zero
				; AVX256BW-NEXT: vpmovsxbw %xmm0, %ymm0
				; AVX256BW-NEXT: vpsravw %ymm1, %ymm0, %ymm0
				; AVX256BW-NEXT: vpmovwb %ymm0, %xmm0
				; AVX256BW-NEXT: vzeroupper
				; AVX256BW-NEXT: retq
				;
				; AVX256VL-LABEL: var_ashr_v16i8:
				; AVX256VL: # %bb.0:
				; AVX256VL-NEXT: vpsllw $5, %xmm1, %xmm1
				; AVX256VL-NEXT: vpunpckhbw {{.*#+}} xmm2 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]
				; AVX256VL-NEXT: vpunpckhbw {{.*#+}} xmm3 = xmm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]
				; AVX256VL-NEXT: vpsraw $4, %xmm3, %xmm4
				; AVX256VL-NEXT: vpblendvb %xmm2, %xmm4, %xmm3, %xmm3
				; AVX256VL-NEXT: vpsraw $2, %xmm3, %xmm4
				; AVX256VL-NEXT: vpaddw %xmm2, %xmm2, %xmm2
				; AVX256VL-NEXT: vpblendvb %xmm2, %xmm4, %xmm3, %xmm3
				; AVX256VL-NEXT: vpsraw $1, %xmm3, %xmm4
				; AVX256VL-NEXT: vpaddw %xmm2, %xmm2, %xmm2
				; AVX256VL-NEXT: vpblendvb %xmm2, %xmm4, %xmm3, %xmm2
				; AVX256VL-NEXT: vpsrlw $8, %xmm2, %xmm2
				; AVX256VL-NEXT: vpunpcklbw {{.*#+}} xmm1 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
				; AVX256VL-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
				; AVX256VL-NEXT: vpsraw $4, %xmm0, %xmm3
				; AVX256VL-NEXT: vpblendvb %xmm1, %xmm3, %xmm0, %xmm0
				; AVX256VL-NEXT: vpsraw $2, %xmm0, %xmm3
				; AVX256VL-NEXT: vpaddw %xmm1, %xmm1, %xmm1
				; AVX256VL-NEXT: vpblendvb %xmm1, %xmm3, %xmm0, %xmm0
				; AVX256VL-NEXT: vpsraw $1, %xmm0, %xmm3
				; AVX256VL-NEXT: vpaddw %xmm1, %xmm1, %xmm1
				; AVX256VL-NEXT: vpblendvb %xmm1, %xmm3, %xmm0, %xmm0
				; AVX256VL-NEXT: vpsrlw $8, %xmm0, %xmm0
				; AVX256VL-NEXT: vpackuswb %xmm2, %xmm0, %xmm0
				; AVX256VL-NEXT: retq
				;
				; AVX512BW-LABEL: var_ashr_v16i8:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: vpmovzxbw {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero,xmm1[8],zero,xmm1[9],zero,xmm1[10],zero,xmm1[11],zero,xmm1[12],zero,xmm1[13],zero,xmm1[14],zero,xmm1[15],zero
				; AVX512BW-NEXT: vpmovsxbw %xmm0, %ymm0
				; AVX512BW-NEXT: vpsravw %ymm1, %ymm0, %ymm0
				; AVX512BW-NEXT: vpmovwb %ymm0, %xmm0
				; AVX512BW-NEXT: vzeroupper
				; AVX512BW-NEXT: retq
				;
				; AVX512VL-LABEL: var_ashr_v16i8:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovzxbd {{.*#+}} zmm1 = xmm1[0],zero,zero,zero,xmm1[1],zero,zero,zero,xmm1[2],zero,zero,zero,xmm1[3],zero,zero,zero,xmm1[4],zero,zero,zero,xmm1[5],zero,zero,zero,xmm1[6],zero,zero,zero,xmm1[7],zero,zero,zero,xmm1[8],zero,zero,zero,xmm1[9],zero,zero,zero,xmm1[10],zero,zero,zero,xmm1[11],zero,zero,zero,xmm1[12],zero,zero,zero,xmm1[13],zero,zero,zero,xmm1[14],zero,zero,zero,xmm1[15],zero,zero,zero
				; AVX512VL-NEXT: vpmovsxbd %xmm0, %zmm0
				; AVX512VL-NEXT: vpsravd %zmm1, %zmm0, %zmm0
				; AVX512VL-NEXT: vpmovdb %zmm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				%shift = ashr <16 x i8> %a, %b
				ret <16 x i8> %shift
				}

test/CodeGen/X86/prefer-avx256-trunc.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vl,+prefer-vector-width-256 \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX256 --check-prefix=AVX256NOBW
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vl,-prefer-vector-width-256 \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512NOBW --check-prefix=AVX512VL
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512NOBW --check-prefix=AVX512F
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512bw \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512 --check-prefix=AVX512BW
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512bw,+avx512vl,+prefer-vector-width-256 \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX256 --check-prefix=AVX256BWVL

				define <16 x i8> @testv8i64_sext(<16 x i16> %x) nounwind "require-vector-width"="256" {
				; AVX256NOBW-LABEL: testv8i64_sext:
				; AVX256NOBW: # %bb.0:
				; AVX256NOBW-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX256NOBW-NEXT: vmovdqa {{.*#+}} xmm2 = <0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u>
				; AVX256NOBW-NEXT: vpshufb %xmm2, %xmm1, %xmm1
				; AVX256NOBW-NEXT: vpshufb %xmm2, %xmm0, %xmm0
				; AVX256NOBW-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
				; AVX256NOBW-NEXT: vzeroupper
				; AVX256NOBW-NEXT: retq
				;
				; AVX512NOBW-LABEL: testv8i64_sext:
				; AVX512NOBW: # %bb.0:
				; AVX512NOBW-NEXT: vpmovsxwd %ymm0, %zmm0
				; AVX512NOBW-NEXT: vpmovdb %zmm0, %xmm0
				; AVX512NOBW-NEXT: vzeroupper
				; AVX512NOBW-NEXT: retq
				;
				; AVX512BW-LABEL: testv8i64_sext:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: # kill: def %ymm0 killed %ymm0 def %zmm0
				; AVX512BW-NEXT: vpmovwb %zmm0, %ymm0
				; AVX512BW-NEXT: # kill: def %xmm0 killed %xmm0 killed %ymm0
				; AVX512BW-NEXT: vzeroupper
				; AVX512BW-NEXT: retq
				;
				; AVX256BWVL-LABEL: testv8i64_sext:
				; AVX256BWVL: # %bb.0:
				; AVX256BWVL-NEXT: vpmovwb %ymm0, %xmm0
				; AVX256BWVL-NEXT: vzeroupper
				; AVX256BWVL-NEXT: retq
				%trunc = trunc <16 x i16> %x to <16 x i8>
				ret <16 x i8> %trunc
				}

test/CodeGen/X86/prefer-avx256-wide-mul.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vl,+avx512bw,+prefer-vector-width-256 \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX256BW
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512vl,+avx512bw,-prefer-vector-width-256 \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX512BW

				define <32 x i8> @test_div7_32i8(<32 x i8> %a) nounwind "require-vector-width"="256" {
				; AVX256BW-LABEL: test_div7_32i8:
				; AVX256BW: # %bb.0:
				; AVX256BW-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX256BW-NEXT: vpmovzxbw {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero,xmm1[8],zero,xmm1[9],zero,xmm1[10],zero,xmm1[11],zero,xmm1[12],zero,xmm1[13],zero,xmm1[14],zero,xmm1[15],zero
				; AVX256BW-NEXT: vmovdqa {{.*#+}} ymm2 = [37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37]
				; AVX256BW-NEXT: vpmullw %ymm2, %ymm1, %ymm1
				; AVX256BW-NEXT: vpsrlw $8, %ymm1, %ymm1
				; AVX256BW-NEXT: vpmovzxbw {{.*#+}} ymm3 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero,xmm0[8],zero,xmm0[9],zero,xmm0[10],zero,xmm0[11],zero,xmm0[12],zero,xmm0[13],zero,xmm0[14],zero,xmm0[15],zero
				; AVX256BW-NEXT: vpmullw %ymm2, %ymm3, %ymm2
				; AVX256BW-NEXT: vpsrlw $8, %ymm2, %ymm2
				; AVX256BW-NEXT: vperm2i128 {{.*#+}} ymm3 = ymm2[2,3],ymm1[2,3]
				; AVX256BW-NEXT: vinserti128 $1, %xmm1, %ymm2, %ymm1
				; AVX256BW-NEXT: vpackuswb %ymm3, %ymm1, %ymm1
				; AVX256BW-NEXT: vpsubb %ymm1, %ymm0, %ymm0
				; AVX256BW-NEXT: vpsrlw $1, %ymm0, %ymm0
				; AVX256BW-NEXT: vpand {{.*}}(%rip), %ymm0, %ymm0
				; AVX256BW-NEXT: vpaddb %ymm1, %ymm0, %ymm0
				; AVX256BW-NEXT: vpsrlw $2, %ymm0, %ymm0
				; AVX256BW-NEXT: vpand {{.*}}(%rip), %ymm0, %ymm0
				; AVX256BW-NEXT: retq
				;
				; AVX512BW-LABEL: test_div7_32i8:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: vpmovzxbw {{.*#+}} zmm1 = ymm0[0],zero,ymm0[1],zero,ymm0[2],zero,ymm0[3],zero,ymm0[4],zero,ymm0[5],zero,ymm0[6],zero,ymm0[7],zero,ymm0[8],zero,ymm0[9],zero,ymm0[10],zero,ymm0[11],zero,ymm0[12],zero,ymm0[13],zero,ymm0[14],zero,ymm0[15],zero,ymm0[16],zero,ymm0[17],zero,ymm0[18],zero,ymm0[19],zero,ymm0[20],zero,ymm0[21],zero,ymm0[22],zero,ymm0[23],zero,ymm0[24],zero,ymm0[25],zero,ymm0[26],zero,ymm0[27],zero,ymm0[28],zero,ymm0[29],zero,ymm0[30],zero,ymm0[31],zero
				; AVX512BW-NEXT: vpmullw {{.*}}(%rip), %zmm1, %zmm1
				; AVX512BW-NEXT: vpsrlw $8, %zmm1, %zmm1
				; AVX512BW-NEXT: vpmovwb %zmm1, %ymm1
				; AVX512BW-NEXT: vpsubb %ymm1, %ymm0, %ymm0
				; AVX512BW-NEXT: vpsrlw $1, %ymm0, %ymm0
				; AVX512BW-NEXT: vpand {{.*}}(%rip), %ymm0, %ymm0
				; AVX512BW-NEXT: vpaddb %ymm1, %ymm0, %ymm0
				; AVX512BW-NEXT: vpsrlw $2, %ymm0, %ymm0
				; AVX512BW-NEXT: vpand {{.*}}(%rip), %ymm0, %ymm0
				; AVX512BW-NEXT: retq
				%res = udiv <32 x i8> %a, <i8 7, i8 7, i8 7, i8 7,i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7,i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7,i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7,i8 7, i8 7, i8 7, i8 7>
				ret <32 x i8> %res
				}

				define <64 x i8> @test_div7_64i8(<64 x i8> %a) nounwind "require-vector-width"="512" {
				; CHECK-LABEL: test_div7_64i8:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vpmovzxbw {{.*#+}} zmm1 = ymm0[0],zero,ymm0[1],zero,ymm0[2],zero,ymm0[3],zero,ymm0[4],zero,ymm0[5],zero,ymm0[6],zero,ymm0[7],zero,ymm0[8],zero,ymm0[9],zero,ymm0[10],zero,ymm0[11],zero,ymm0[12],zero,ymm0[13],zero,ymm0[14],zero,ymm0[15],zero,ymm0[16],zero,ymm0[17],zero,ymm0[18],zero,ymm0[19],zero,ymm0[20],zero,ymm0[21],zero,ymm0[22],zero,ymm0[23],zero,ymm0[24],zero,ymm0[25],zero,ymm0[26],zero,ymm0[27],zero,ymm0[28],zero,ymm0[29],zero,ymm0[30],zero,ymm0[31],zero
				; CHECK-NEXT: vmovdqa64 {{.*#+}} zmm2 = [37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37,37]
				; CHECK-NEXT: vpmullw %zmm2, %zmm1, %zmm1
				; CHECK-NEXT: vpsrlw $8, %zmm1, %zmm1
				; CHECK-NEXT: vpmovwb %zmm1, %ymm1
				; CHECK-NEXT: vextracti64x4 $1, %zmm0, %ymm3
				; CHECK-NEXT: vpmovzxbw {{.*#+}} zmm3 = ymm3[0],zero,ymm3[1],zero,ymm3[2],zero,ymm3[3],zero,ymm3[4],zero,ymm3[5],zero,ymm3[6],zero,ymm3[7],zero,ymm3[8],zero,ymm3[9],zero,ymm3[10],zero,ymm3[11],zero,ymm3[12],zero,ymm3[13],zero,ymm3[14],zero,ymm3[15],zero,ymm3[16],zero,ymm3[17],zero,ymm3[18],zero,ymm3[19],zero,ymm3[20],zero,ymm3[21],zero,ymm3[22],zero,ymm3[23],zero,ymm3[24],zero,ymm3[25],zero,ymm3[26],zero,ymm3[27],zero,ymm3[28],zero,ymm3[29],zero,ymm3[30],zero,ymm3[31],zero
				; CHECK-NEXT: vpmullw %zmm2, %zmm3, %zmm2
				; CHECK-NEXT: vpsrlw $8, %zmm2, %zmm2
				; CHECK-NEXT: vpmovwb %zmm2, %ymm2
				; CHECK-NEXT: vinserti64x4 $1, %ymm2, %zmm1, %zmm1
				; CHECK-NEXT: vpsubb %zmm1, %zmm0, %zmm0
				; CHECK-NEXT: vpsrlw $1, %zmm0, %zmm0
				; CHECK-NEXT: vpandq {{.*}}(%rip), %zmm0, %zmm0
				; CHECK-NEXT: vpaddb %zmm1, %zmm0, %zmm0
				; CHECK-NEXT: vpsrlw $2, %zmm0, %zmm0
				; CHECK-NEXT: vpandq {{.*}}(%rip), %zmm0, %zmm0
				; CHECK-NEXT: retq
				%res = udiv <64 x i8> %a, <i8 7, i8 7, i8 7, i8 7,i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7,i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7,i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7,i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7,i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7,i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7,i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7, i8 7,i8 7, i8 7, i8 7, i8 7>
				ret <64 x i8> %res
				}