This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/X86/
-
Target/
-
X86/
-
X86ISelLowering.cpp
-
X86Subtarget.h
-
X86Subtarget.cpp
4/8
X86TargetMachine.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
required-vector-width.ll

Differential D42724

[X86] Don't make 512-bit vectors legal when preferred vector width is 256 bits and 512 bits aren't required
ClosedPublic

Authored by craig.topper on Jan 30 2018, 5:46 PM.

Download Raw Diff

Details

Reviewers

RKSimon
echristo
chandlerc
spatel
hfinkel

Commits

rG24d3b28d931a: [X86] Don't make 512-bit vectors legal when preferred vector width is 256 bits…
rL324834: [X86] Don't make 512-bit vectors legal when preferred vector width is 256 bits…

Summary

This patch is a replacement for D41341.

This patch adds a new function attribute "required-vector-width" that can be set by the frontend to indicate the maximum vector width present in the original source code. The idea is that this would be set based on ABI requirements, intrinsics or explicit vector types being used, maybe simd pragmas, etc. The backend will then use this information to determine if its save to make 512-bit vectors illegal when the preference is for 256-bit vectors.

For code that has no vectors in it originally and only get vectors through the loop and slp vectorizers this allows us to generate code largely similar to our AVX2 only output while still enabling AVX512 features like mask registers and gather/scatter. The loop vectorizer doesn't always obey TTI and will create oversized vectors with the expectation the backend will legalize it. In order to avoid changing the vectorizer and potentially harm our AVX2 codegen this patch tries to make the legalizer behavior similar.

This is restricted to CPUs that support AVX512F and AVX512VL so that we have good fallback options to use 128 and 256-bit vectors and still get masking.

I've qualified every place I could find in X86ISelLowering.cpp and added tests cases for many of them with 2 different values for the attribute to see the codegen differences.

We still need to do frontend work for the attribute and teach the inliner how to merge it, etc. But this gets the codegen layer ready for it.

Diff Detail

Repository: rL LLVM

Event Timeline

craig.topper created this revision.Jan 30 2018, 5:46 PM

Herald added subscribers: kristof.beyls, aemerson. · View Herald TranscriptJan 30 2018, 5:46 PM

craig.topper added a reviewer: hfinkel.Jan 31 2018, 9:59 AM

Add sad test cases as well.

Ping

What happens if 512-bit intrinsics are used? Do we have a decent error message?

lib/Target/X86/X86ISelLowering.cpp
1363 ↗	(On Diff #132683)	What happens with these? If we've disabled 512-bit vectors what is expected to happen?

For intrinsics, this patch relies on a function attribute "required-vector-width" that is not created anywhere today. We're trusting the accuracy of the attribute when it is is present and if its not present we'll assume 512-bit vectors should be allowed. I plan to add support to clang to set this attribute to 512 when any AVX512 intrinsics are used and to a smaller value if only narrower vectors are used. That will prevent the legalizer from disabling 512-bit support when it is needed. Similar will need to be done for any function arguments passed in ZMM registers. Probably for some of the pragmas as well.

lib/Target/X86/X86ISelLowering.cpp
1363 ↗	(On Diff #132683)	If AVX512F is enabled, we can only disable 512-bit vectors if VLX is also enabled. So KNL will always allow 512-bit vectors. It's not clear how much we could do on KNL without 512-bit vectors. We'd probably have to just fall all the way back to AVX2. The only AVX512 instructions that would work without 512-bit vectors would be things like KOR/KAND/KXOR on mask registers, but you wouldn't be able to use masks in any instructions.
lib/Target/X86/X86Subtarget.h
636 ↗	(On Diff #132683)	canExtendTo512DQ checks !VLX to always allow widening on KNL CPUs.

LGTM as a first step - a few minors to explain the meaning of a lot of this.

lib/Target/X86/X86ISelLowering.cpp
1140 ↗	(On Diff #132683)	Comments explaining that we only legalize mask registers for hasAVX512, with the vector registers requiring useAVX512Regs
1156 ↗	(On Diff #132683)	Add tests for these since they touch 512-bit types.
1165 ↗	(On Diff #132683)	Add tests for these since they touch 512-bit types.
1447 ↗	(On Diff #132683)	Similar comment for 512-bit bw types and explain why v64i1 mask register isn't enabled with hasBWI
1533 ↗	(On Diff #132683)	This change looks independent? If so commit it separately.

This revision is now accepted and ready to land.Feb 10 2018, 9:07 AM

Closed by commit rL324834: [X86] Don't make 512-bit vectors legal when preferred vector width is 256 bits… (authored by ctopper). · Explain WhyFeb 11 2018, 12:08 AM

This revision was automatically updated to reflect the committed changes.

LGTM for now, but the set of function attributes being passed into the subtarget is getting a little awkward so we should come up with some better way to encompass them.

danilaml added a subscriber: danilaml.Aug 22 2023, 8:32 AM

danilaml added inline comments.

llvm/trunk/lib/Target/X86/X86TargetMachine.cpp
275	@craig.topper Sorry for commenting on such an old review, but I was investigating some codegen differences for very similar IR and come across this code (the attribute later changed to the `min-legal-vector-width` but otherwise it's the same on main). Is RequiredVectorWidth intended to be initialized to `UINT32_MAX`? What is the rationale? It forces maximum vector width if the function is missing the attribute for some reason, ignoring the `prefer-*` attributes. To me it seems that the conservative approach would be to set it to `0` and increase according to the attribute, since zero length vectors are always legal/"required".

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptAug 22 2023, 8:32 AM

Herald added subscribers: wangpc, sunshaoce, StephenFan, pengfei. · View Herald Transcript

craig.topper added inline comments.Aug 22 2023, 9:42 AM

llvm/trunk/lib/Target/X86/X86TargetMachine.cpp
275	It is intentionally set to UINT32_MAX if the attribute is missing. If the IR contains any 512 bit inline assembly, function arguments, returns, or X86 specific vector, the backend will crash or violate the ABI. The presence of the attribute indicates that those cases have been checked and nothing requires 512 bit vectors.

danilaml added inline comments.Aug 22 2023, 10:26 AM

llvm/trunk/lib/Target/X86/X86TargetMachine.cpp
275	Not sure I understand. Without the attribute the backend will ignore prefer-vector-width and generate avx512 asm regardless. How is setting default to 0 worse?

craig.topper added inline comments.Aug 22 2023, 10:37 AM

llvm/trunk/lib/Target/X86/X86TargetMachine.cpp
275	If there are 512-bit x86 intrinsics in the IR it will crash the compiler. I assume compiler crashes are worse than suboptimal code. Prefer vector width is still checked in many other places to prevent aggressive use of 512 bit vectors. For example, `X86TTIImpl::getRegisterBitWidth` will still tell the vectorizer that the register width is 256 bits. Are you finding the attribute missing in code compiled with clang or another frontend?

danilaml added inline comments.Aug 22 2023, 11:37 AM

llvm/trunk/lib/Target/X86/X86TargetMachine.cpp
275	Doesn't seem to crash (although I haven't tried inline assembly): https://llvm.godbolt.org/z/h6jEYrnaa In my case it's another frontend (JIT compiler). I've noticed that compiler would generate suboptimal code using avx512, even though the target cpu has prefer256 tuning and found that the issue is missing attribute (I also noticed that target knows that expanding a certain intrinsic using av512 is more costly than using avx2, but still uses the highest ISA available, but that's another issue entirely). Now I'm wondering what to set it too. Also, stuff like SLPVectorizer doesn't really care about `getRegisterBitWidth` since it usually just checks wether some operation/type is legal or not and about the cost returned by the target hooks.

craig.topper added inline comments.Aug 22 2023, 12:15 PM

llvm/trunk/lib/Target/X86/X86TargetMachine.cpp
275	It crashes if prefer-vector-width<=256 is also specified or a CPU that implies prefer vector width <=256 is used https://llvm.godbolt.org/z/G5458499K

danilaml added inline comments.Aug 22 2023, 12:30 PM

llvm/trunk/lib/Target/X86/X86TargetMachine.cpp
275	I see. This is counterintuitive. It appers that `min-legal-vector-width` is actually sort of `max`, but not really. So what is the intended usage by some non-C backend? What should it be set to to allow both avx512 intrinsics/inline asm when explicitely requested AND to keep the "prefer 256" semantics in most other palces? Should we mark every "regular" function (that doesn't use avx512 intrinsics or inline asm) with `min-legal-vector-width=0` and that do - with `=512` (we won't be passing/returning avx512 types, so ABI is not a concer AFAIU)?

craig.topper added inline comments.Aug 22 2023, 12:39 PM

llvm/trunk/lib/Target/X86/X86TargetMachine.cpp
275	I unfortunately named it from the perspective of the X86 backend where it is the minimum vector width that the backend must make a legal type to prevent crashes. The clang frontend calculates it by taking the maximum value from all intrinsics, inline assembly, and function arguments/returns. Setting it to 0 for functions without avx512 intrinsics or inline assembly should be fine today.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

60 lines

17 lines

4 lines

21 lines

test/

CodeGen/

X86/

required-vector-width.ll

628 lines

Diff 133785

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,138 Lines • ▼ Show 20 Lines	if (HasInt256) {
setOperationAction(ISD::MGATHER, MVT::v2i32, Custom);		setOperationAction(ISD::MGATHER, MVT::v2i32, Custom);

for (auto VT : { MVT::v4i32, MVT::v8i32, MVT::v2i64, MVT::v4i64,		for (auto VT : { MVT::v4i32, MVT::v8i32, MVT::v2i64, MVT::v4i64,
MVT::v4f32, MVT::v8f32, MVT::v2f64, MVT::v4f64 })		MVT::v4f32, MVT::v8f32, MVT::v2f64, MVT::v4f64 })
setOperationAction(ISD::MGATHER, VT, Custom);		setOperationAction(ISD::MGATHER, VT, Custom);
}		}
}		}

		// This block controls legalization of the mask vector sizes that are
		// available with AVX512. 512-bit vectors are in a separate block controlled
		// by useAVX512Regs.
if (!Subtarget.useSoftFloat() && Subtarget.hasAVX512()) {		if (!Subtarget.useSoftFloat() && Subtarget.hasAVX512()) {
addRegisterClass(MVT::v16i32, &X86::VR512RegClass);
addRegisterClass(MVT::v16f32, &X86::VR512RegClass);
addRegisterClass(MVT::v8i64, &X86::VR512RegClass);
addRegisterClass(MVT::v8f64, &X86::VR512RegClass);

addRegisterClass(MVT::v1i1, &X86::VK1RegClass);		addRegisterClass(MVT::v1i1, &X86::VK1RegClass);
addRegisterClass(MVT::v2i1, &X86::VK2RegClass);		addRegisterClass(MVT::v2i1, &X86::VK2RegClass);
addRegisterClass(MVT::v4i1, &X86::VK4RegClass);		addRegisterClass(MVT::v4i1, &X86::VK4RegClass);
addRegisterClass(MVT::v8i1, &X86::VK8RegClass);		addRegisterClass(MVT::v8i1, &X86::VK8RegClass);
addRegisterClass(MVT::v16i1, &X86::VK16RegClass);		addRegisterClass(MVT::v16i1, &X86::VK16RegClass);

setOperationAction(ISD::SELECT, MVT::v1i1, Custom);		setOperationAction(ISD::SELECT, MVT::v1i1, Custom);
setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v1i1, Custom);		setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v1i1, Custom);
setOperationAction(ISD::BUILD_VECTOR, MVT::v1i1, Custom);		setOperationAction(ISD::BUILD_VECTOR, MVT::v1i1, Custom);

setOperationPromotedToType(ISD::FP_TO_SINT, MVT::v16i1, MVT::v16i32);
setOperationPromotedToType(ISD::FP_TO_UINT, MVT::v16i1, MVT::v16i32);
setOperationPromotedToType(ISD::FP_TO_SINT, MVT::v8i1, MVT::v8i32);		setOperationPromotedToType(ISD::FP_TO_SINT, MVT::v8i1, MVT::v8i32);
setOperationPromotedToType(ISD::FP_TO_UINT, MVT::v8i1, MVT::v8i32);		setOperationPromotedToType(ISD::FP_TO_UINT, MVT::v8i1, MVT::v8i32);
setOperationPromotedToType(ISD::FP_TO_SINT, MVT::v4i1, MVT::v4i32);		setOperationPromotedToType(ISD::FP_TO_SINT, MVT::v4i1, MVT::v4i32);
setOperationPromotedToType(ISD::FP_TO_UINT, MVT::v4i1, MVT::v4i32);		setOperationPromotedToType(ISD::FP_TO_UINT, MVT::v4i1, MVT::v4i32);
setOperationAction(ISD::FP_TO_SINT, MVT::v2i1, Custom);		setOperationAction(ISD::FP_TO_SINT, MVT::v2i1, Custom);
setOperationAction(ISD::FP_TO_UINT, MVT::v2i1, Custom);		setOperationAction(ISD::FP_TO_UINT, MVT::v2i1, Custom);

// Extends of v16i1/v8i1/v4i1/v2i1 to 128-bit vectors.		// Extends of v16i1/v8i1/v4i1/v2i1 to 128-bit vectors.
Show All 22 Lines	if (!Subtarget.useSoftFloat() && Subtarget.hasAVX512()) {
setOperationAction(ISD::CONCAT_VECTORS, MVT::v8i1, Custom);		setOperationAction(ISD::CONCAT_VECTORS, MVT::v8i1, Custom);
setOperationAction(ISD::CONCAT_VECTORS, MVT::v4i1, Custom);		setOperationAction(ISD::CONCAT_VECTORS, MVT::v4i1, Custom);
setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v2i1, Custom);		setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v2i1, Custom);
setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v4i1, Custom);		setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v4i1, Custom);
setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v8i1, Custom);		setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v8i1, Custom);
setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v16i1, Custom);		setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v16i1, Custom);
for (auto VT : { MVT::v1i1, MVT::v2i1, MVT::v4i1, MVT::v8i1 })		for (auto VT : { MVT::v1i1, MVT::v2i1, MVT::v4i1, MVT::v8i1 })
setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);		setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);
		}

		// This block controls legalization for 512-bit operations with 32/64 bit
		// elements. 512-bits can be disabled based on prefer-vector-width and
		// required-vector-width function attributes.
		if (!Subtarget.useSoftFloat() && Subtarget.useAVX512Regs()) {
		addRegisterClass(MVT::v16i32, &X86::VR512RegClass);
		addRegisterClass(MVT::v16f32, &X86::VR512RegClass);
		addRegisterClass(MVT::v8i64, &X86::VR512RegClass);
		addRegisterClass(MVT::v8f64, &X86::VR512RegClass);

for (MVT VT : MVT::fp_vector_valuetypes())		for (MVT VT : MVT::fp_vector_valuetypes())
setLoadExtAction(ISD::EXTLOAD, VT, MVT::v8f32, Legal);		setLoadExtAction(ISD::EXTLOAD, VT, MVT::v8f32, Legal);

for (auto ExtType : {ISD::ZEXTLOAD, ISD::SEXTLOAD}) {		for (auto ExtType : {ISD::ZEXTLOAD, ISD::SEXTLOAD}) {
setLoadExtAction(ExtType, MVT::v16i32, MVT::v16i8, Legal);		setLoadExtAction(ExtType, MVT::v16i32, MVT::v16i8, Legal);
setLoadExtAction(ExtType, MVT::v16i32, MVT::v16i16, Legal);		setLoadExtAction(ExtType, MVT::v16i32, MVT::v16i16, Legal);
setLoadExtAction(ExtType, MVT::v8i64, MVT::v8i8, Legal);		setLoadExtAction(ExtType, MVT::v8i64, MVT::v8i8, Legal);
setLoadExtAction(ExtType, MVT::v8i64, MVT::v8i16, Legal);		setLoadExtAction(ExtType, MVT::v8i64, MVT::v8i16, Legal);
setLoadExtAction(ExtType, MVT::v8i64, MVT::v8i32, Legal);		setLoadExtAction(ExtType, MVT::v8i64, MVT::v8i32, Legal);
}		}

for (MVT VT : { MVT::v16f32, MVT::v8f64 }) {		for (MVT VT : { MVT::v16f32, MVT::v8f64 }) {
setOperationAction(ISD::FNEG, VT, Custom);		setOperationAction(ISD::FNEG, VT, Custom);
setOperationAction(ISD::FABS, VT, Custom);		setOperationAction(ISD::FABS, VT, Custom);
setOperationAction(ISD::FMA, VT, Legal);		setOperationAction(ISD::FMA, VT, Legal);
setOperationAction(ISD::FCOPYSIGN, VT, Custom);		setOperationAction(ISD::FCOPYSIGN, VT, Custom);
}		}

setOperationAction(ISD::FP_TO_SINT, MVT::v16i32, Legal);		setOperationAction(ISD::FP_TO_SINT, MVT::v16i32, Legal);
setOperationPromotedToType(ISD::FP_TO_SINT, MVT::v16i16, MVT::v16i32);		setOperationPromotedToType(ISD::FP_TO_SINT, MVT::v16i16, MVT::v16i32);
setOperationPromotedToType(ISD::FP_TO_SINT, MVT::v16i8, MVT::v16i32);		setOperationPromotedToType(ISD::FP_TO_SINT, MVT::v16i8, MVT::v16i32);
		setOperationPromotedToType(ISD::FP_TO_SINT, MVT::v16i1, MVT::v16i32);
setOperationAction(ISD::FP_TO_UINT, MVT::v16i32, Legal);		setOperationAction(ISD::FP_TO_UINT, MVT::v16i32, Legal);
		setOperationPromotedToType(ISD::FP_TO_UINT, MVT::v16i1, MVT::v16i32);
setOperationPromotedToType(ISD::FP_TO_UINT, MVT::v16i8, MVT::v16i32);		setOperationPromotedToType(ISD::FP_TO_UINT, MVT::v16i8, MVT::v16i32);
setOperationPromotedToType(ISD::FP_TO_UINT, MVT::v16i16, MVT::v16i32);		setOperationPromotedToType(ISD::FP_TO_UINT, MVT::v16i16, MVT::v16i32);
setOperationAction(ISD::SINT_TO_FP, MVT::v16i32, Legal);		setOperationAction(ISD::SINT_TO_FP, MVT::v16i32, Legal);
setOperationAction(ISD::UINT_TO_FP, MVT::v16i32, Legal);		setOperationAction(ISD::UINT_TO_FP, MVT::v16i32, Legal);

setTruncStoreAction(MVT::v8i64, MVT::v8i8, Legal);		setTruncStoreAction(MVT::v8i64, MVT::v8i8, Legal);
setTruncStoreAction(MVT::v8i64, MVT::v8i16, Legal);		setTruncStoreAction(MVT::v8i64, MVT::v8i16, Legal);
setTruncStoreAction(MVT::v8i64, MVT::v8i32, Legal);		setTruncStoreAction(MVT::v8i64, MVT::v8i32, Legal);
▲ Show 20 Lines • Show All 113 Lines • ▼ Show 20 Lines	for (auto VT : { MVT::v16i32, MVT::v8i64, MVT::v16f32, MVT::v8f64 }) {
setOperationAction(ISD::MSCATTER, VT, Custom);		setOperationAction(ISD::MSCATTER, VT, Custom);
}		}
for (auto VT : { MVT::v64i8, MVT::v32i16, MVT::v16i32 }) {		for (auto VT : { MVT::v64i8, MVT::v32i16, MVT::v16i32 }) {
setOperationPromotedToType(ISD::LOAD, VT, MVT::v8i64);		setOperationPromotedToType(ISD::LOAD, VT, MVT::v8i64);
setOperationPromotedToType(ISD::SELECT, VT, MVT::v8i64);		setOperationPromotedToType(ISD::SELECT, VT, MVT::v8i64);
}		}
}// has AVX-512		}// has AVX-512

		// This block controls legalization for operations that don't have
		// pre-AVX512 equivalents. Without VLX we use 512-bit operations for
		// narrower widths.
if (!Subtarget.useSoftFloat() && Subtarget.hasAVX512()) {		if (!Subtarget.useSoftFloat() && Subtarget.hasAVX512()) {
// These operations are handled on non-VLX by artificially widening in		// These operations are handled on non-VLX by artificially widening in
// isel patterns.		// isel patterns.
// TODO: Custom widen in lowering on non-VLX and drop the isel patterns?		// TODO: Custom widen in lowering on non-VLX and drop the isel patterns?

setOperationAction(ISD::FP_TO_UINT, MVT::v8i32, Legal);		setOperationAction(ISD::FP_TO_UINT, MVT::v8i32, Legal);
setOperationAction(ISD::FP_TO_UINT, MVT::v4i32, Legal);		setOperationAction(ISD::FP_TO_UINT, MVT::v4i32, Legal);
setOperationAction(ISD::FP_TO_UINT, MVT::v2i32, Custom);		setOperationAction(ISD::FP_TO_UINT, MVT::v2i32, Custom);
Show All 38 Lines	if (!Subtarget.useSoftFloat() && Subtarget.hasAVX512()) {
} // Subtarget.hasCDI()		} // Subtarget.hasCDI()

if (Subtarget.hasVPOPCNTDQ()) {		if (Subtarget.hasVPOPCNTDQ()) {
for (auto VT : { MVT::v4i32, MVT::v8i32, MVT::v2i64, MVT::v4i64 })		for (auto VT : { MVT::v4i32, MVT::v8i32, MVT::v2i64, MVT::v4i64 })
setOperationAction(ISD::CTPOP, VT, Legal);		setOperationAction(ISD::CTPOP, VT, Legal);
}		}
}		}

		// This block control legalization of v32i1/v64i1 which are available with
		// AVX512BW. 512-bit v32i16 and v64i8 vector legalization is controlled with
		// useBWIRegs.
if (!Subtarget.useSoftFloat() && Subtarget.hasBWI()) {		if (!Subtarget.useSoftFloat() && Subtarget.hasBWI()) {
addRegisterClass(MVT::v32i16, &X86::VR512RegClass);
addRegisterClass(MVT::v64i8, &X86::VR512RegClass);

addRegisterClass(MVT::v32i1, &X86::VK32RegClass);		addRegisterClass(MVT::v32i1, &X86::VK32RegClass);
addRegisterClass(MVT::v64i1, &X86::VK64RegClass);		addRegisterClass(MVT::v64i1, &X86::VK64RegClass);

for (auto VT : { MVT::v32i1, MVT::v64i1 }) {		for (auto VT : { MVT::v32i1, MVT::v64i1 }) {
setOperationAction(ISD::ADD, VT, Custom);		setOperationAction(ISD::ADD, VT, Custom);
setOperationAction(ISD::SUB, VT, Custom);		setOperationAction(ISD::SUB, VT, Custom);
setOperationAction(ISD::MUL, VT, Custom);		setOperationAction(ISD::MUL, VT, Custom);
setOperationAction(ISD::VSELECT, VT, Expand);		setOperationAction(ISD::VSELECT, VT, Expand);
Show All 13 Lines	if (!Subtarget.useSoftFloat() && Subtarget.hasBWI()) {
setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v64i1, Custom);		setOperationAction(ISD::INSERT_SUBVECTOR, MVT::v64i1, Custom);
for (auto VT : { MVT::v16i1, MVT::v32i1 })		for (auto VT : { MVT::v16i1, MVT::v32i1 })
setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);		setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);

// Extends from v32i1 masks to 256-bit vectors.		// Extends from v32i1 masks to 256-bit vectors.
setOperationAction(ISD::SIGN_EXTEND, MVT::v32i8, Custom);		setOperationAction(ISD::SIGN_EXTEND, MVT::v32i8, Custom);
setOperationAction(ISD::ZERO_EXTEND, MVT::v32i8, Custom);		setOperationAction(ISD::ZERO_EXTEND, MVT::v32i8, Custom);
setOperationAction(ISD::ANY_EXTEND, MVT::v32i8, Custom);		setOperationAction(ISD::ANY_EXTEND, MVT::v32i8, Custom);
		}

		// This block controls legalization for v32i16 and v64i8. 512-bits can be
		// disabled based on prefer-vector-width and required-vector-width function
		// attributes.
		if (!Subtarget.useSoftFloat() && Subtarget.useBWIRegs()) {
		addRegisterClass(MVT::v32i16, &X86::VR512RegClass);
		addRegisterClass(MVT::v64i8, &X86::VR512RegClass);

// Extends from v64i1 masks to 512-bit vectors.		// Extends from v64i1 masks to 512-bit vectors.
setOperationAction(ISD::SIGN_EXTEND, MVT::v64i8, Custom);		setOperationAction(ISD::SIGN_EXTEND, MVT::v64i8, Custom);
setOperationAction(ISD::ZERO_EXTEND, MVT::v64i8, Custom);		setOperationAction(ISD::ZERO_EXTEND, MVT::v64i8, Custom);
setOperationAction(ISD::ANY_EXTEND, MVT::v64i8, Custom);		setOperationAction(ISD::ANY_EXTEND, MVT::v64i8, Custom);

setOperationAction(ISD::MUL, MVT::v32i16, Legal);		setOperationAction(ISD::MUL, MVT::v32i16, Legal);
setOperationAction(ISD::MUL, MVT::v64i8, Custom);		setOperationAction(ISD::MUL, MVT::v64i8, Custom);
setOperationAction(ISD::MULHS, MVT::v32i16, Legal);		setOperationAction(ISD::MULHS, MVT::v32i16, Legal);
▲ Show 20 Lines • Show All 28,594 Lines • ▼ Show 20 Lines
/// the fact that they're unused.		/// the fact that they're unused.
static bool isAddSubOrSubAdd(SDNode *N, const X86Subtarget &Subtarget,		static bool isAddSubOrSubAdd(SDNode *N, const X86Subtarget &Subtarget,
SDValue &Opnd0, SDValue &Opnd1,		SDValue &Opnd0, SDValue &Opnd1,
bool matchSubAdd = false) {		bool matchSubAdd = false) {

EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
if ((!Subtarget.hasSSE3() \|\| (VT != MVT::v4f32 && VT != MVT::v2f64)) &&		if ((!Subtarget.hasSSE3() \|\| (VT != MVT::v4f32 && VT != MVT::v2f64)) &&
(!Subtarget.hasAVX() \|\| (VT != MVT::v8f32 && VT != MVT::v4f64)) &&		(!Subtarget.hasAVX() \|\| (VT != MVT::v8f32 && VT != MVT::v4f64)) &&
(!Subtarget.hasAVX512() \|\| (VT != MVT::v16f32 && VT != MVT::v8f64)))		(!Subtarget.useAVX512Regs() \|\| (VT != MVT::v16f32 && VT != MVT::v8f64)))
return false;		return false;

// We only handle target-independent shuffles.		// We only handle target-independent shuffles.
// FIXME: It would be easy and harmless to use the target shuffle mask		// FIXME: It would be easy and harmless to use the target shuffle mask
// extraction tool to support more.		// extraction tool to support more.
if (N->getOpcode() != ISD::VECTOR_SHUFFLE)		if (N->getOpcode() != ISD::VECTOR_SHUFFLE)
return false;		return false;

▲ Show 20 Lines • Show All 1,020 Lines • ▼ Show 20 Lines	if (!Subtarget.hasSSE2())
return SDValue();		return SDValue();

// Verify the type we're extracting from is any integer type above i16.		// Verify the type we're extracting from is any integer type above i16.
EVT VT = Extract->getOperand(0).getValueType();		EVT VT = Extract->getOperand(0).getValueType();
if (!VT.isSimple() \|\| !(VT.getVectorElementType().getSizeInBits() > 16))		if (!VT.isSimple() \|\| !(VT.getVectorElementType().getSizeInBits() > 16))
return SDValue();		return SDValue();

unsigned RegSize = 128;		unsigned RegSize = 128;
if (Subtarget.hasBWI())		if (Subtarget.useBWIRegs())
RegSize = 512;		RegSize = 512;
else if (Subtarget.hasAVX2())		else if (Subtarget.hasAVX2())
RegSize = 256;		RegSize = 256;

// We handle upto v16i* for SSE2 / v32i* for AVX2 / v64i* for AVX512.		// We handle upto v16i* for SSE2 / v32i* for AVX2 / v64i* for AVX512.
// TODO: We should be able to handle larger vectors by splitting them before		// TODO: We should be able to handle larger vectors by splitting them before
// feeding them into several SADs, and then reducing over those.		// feeding them into several SADs, and then reducing over those.
if (RegSize / VT.getVectorNumElements() < 8)		if (RegSize / VT.getVectorNumElements() < 8)
▲ Show 20 Lines • Show All 1,561 Lines • ▼ Show 20 Lines	static SDValue combineMul(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

// If the upper 17 bits of each element are zero then we can use PMADDWD,		// If the upper 17 bits of each element are zero then we can use PMADDWD,
// which is always at least as quick as PMULLD, expect on KNL.		// which is always at least as quick as PMULLD, expect on KNL.
if (Subtarget.getProcFamily() != X86Subtarget::IntelKNL &&		if (Subtarget.getProcFamily() != X86Subtarget::IntelKNL &&
((VT == MVT::v4i32 && Subtarget.hasSSE2()) \|\|		((VT == MVT::v4i32 && Subtarget.hasSSE2()) \|\|
(VT == MVT::v8i32 && Subtarget.hasAVX2()) \|\|		(VT == MVT::v8i32 && Subtarget.hasAVX2()) \|\|
(VT == MVT::v16i32 && Subtarget.hasBWI()))) {		(VT == MVT::v16i32 && Subtarget.useBWIRegs()))) {
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
APInt Mask17 = APInt::getHighBitsSet(32, 17);		APInt Mask17 = APInt::getHighBitsSet(32, 17);
if (DAG.MaskedValueIsZero(N0, Mask17) &&		if (DAG.MaskedValueIsZero(N0, Mask17) &&
DAG.MaskedValueIsZero(N1, Mask17)) {		DAG.MaskedValueIsZero(N1, Mask17)) {
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();
MVT WVT = MVT::getVectorVT(MVT::i16, 2 * NumElts);		MVT WVT = MVT::getVectorVT(MVT::i16, 2 * NumElts);
return DAG.getNode(X86ISD::VPMADDWD, SDLoc(N), VT,		return DAG.getNode(X86ISD::VPMADDWD, SDLoc(N), VT,
▲ Show 20 Lines • Show All 1,509 Lines • ▼ Show 20 Lines
// The argument Builder is a function that will be applied on each split psrt:		// The argument Builder is a function that will be applied on each split psrt:
// SDValue Builder(SelectionDAG&G, SDLoc, SDValue, SDValue)		// SDValue Builder(SelectionDAG&G, SDLoc, SDValue, SDValue)
template <typename F>		template <typename F>
SDValue SplitBinaryOpsAndApply(SelectionDAG &DAG, const X86Subtarget &Subtarget,		SDValue SplitBinaryOpsAndApply(SelectionDAG &DAG, const X86Subtarget &Subtarget,
const SDLoc &DL, EVT VT, SDValue Op0,		const SDLoc &DL, EVT VT, SDValue Op0,
SDValue Op1, F Builder) {		SDValue Op1, F Builder) {
assert(Subtarget.hasSSE2() && "Target assumed to support at least SSE2");		assert(Subtarget.hasSSE2() && "Target assumed to support at least SSE2");
unsigned NumSubs = 1;		unsigned NumSubs = 1;
if (Subtarget.hasBWI()) {		if (Subtarget.useBWIRegs()) {
if (VT.getSizeInBits() > 512) {		if (VT.getSizeInBits() > 512) {
NumSubs = VT.getSizeInBits() / 512;		NumSubs = VT.getSizeInBits() / 512;
assert((VT.getSizeInBits() % 512) == 0 && "Illegal vector size");		assert((VT.getSizeInBits() % 512) == 0 && "Illegal vector size");
}		}
} else if (Subtarget.hasAVX2()) {		} else if (Subtarget.hasAVX2()) {
if (VT.getSizeInBits() > 256) {		if (VT.getSizeInBits() > 256) {
NumSubs = VT.getSizeInBits() / 256;		NumSubs = VT.getSizeInBits() / 256;
assert((VT.getSizeInBits() % 256) == 0 && "Illegal vector size");		assert((VT.getSizeInBits() % 256) == 0 && "Illegal vector size");
▲ Show 20 Lines • Show All 1,974 Lines • ▼ Show 20 Lines	return DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, VT, SExt,
DAG.getIntPtrConstant(0, DL));		DAG.getIntPtrConstant(0, DL));
}		}

// If target-size is 128-bits (or 256-bits on AVX2 target), then convert to		// If target-size is 128-bits (or 256-bits on AVX2 target), then convert to
// ISD::_EXTEND_VECTOR_INREG which ensures lowering to X86ISD::VEXT.		// ISD::_EXTEND_VECTOR_INREG which ensures lowering to X86ISD::VEXT.
// Also use this if we don't have SSE41 to allow the legalizer do its job.		// Also use this if we don't have SSE41 to allow the legalizer do its job.
if (!Subtarget.hasSSE41() \|\| VT.is128BitVector() \|\|		if (!Subtarget.hasSSE41() \|\| VT.is128BitVector() \|\|
(VT.is256BitVector() && Subtarget.hasInt256()) \|\|		(VT.is256BitVector() && Subtarget.hasInt256()) \|\|
(VT.is512BitVector() && Subtarget.hasAVX512())) {		(VT.is512BitVector() && Subtarget.useAVX512Regs())) {
SDValue ExOp = ExtendVecSize(DL, N0, VT.getSizeInBits());		SDValue ExOp = ExtendVecSize(DL, N0, VT.getSizeInBits());
return Opcode == ISD::SIGN_EXTEND		return Opcode == ISD::SIGN_EXTEND
? DAG.getSignExtendVectorInReg(ExOp, DL, VT)		? DAG.getSignExtendVectorInReg(ExOp, DL, VT)
: DAG.getZeroExtendVectorInReg(ExOp, DL, VT);		: DAG.getZeroExtendVectorInReg(ExOp, DL, VT);
}		}

auto SplitAndExtendInReg = [&](unsigned SplitSize) {		auto SplitAndExtendInReg = [&](unsigned SplitSize) {
unsigned NumVecs = VT.getSizeInBits() / SplitSize;		unsigned NumVecs = VT.getSizeInBits() / SplitSize;
Show All 16 Lines	static SDValue combineToExtendVectorInReg(SDNode *N, SelectionDAG &DAG,

// On pre-AVX2 targets, split into 128-bit nodes of		// On pre-AVX2 targets, split into 128-bit nodes of
// ISD::*_EXTEND_VECTOR_INREG.		// ISD::*_EXTEND_VECTOR_INREG.
if (!Subtarget.hasInt256() && !(VT.getSizeInBits() % 128))		if (!Subtarget.hasInt256() && !(VT.getSizeInBits() % 128))
return SplitAndExtendInReg(128);		return SplitAndExtendInReg(128);

// On pre-AVX512 targets, split into 256-bit nodes of		// On pre-AVX512 targets, split into 256-bit nodes of
// ISD::*_EXTEND_VECTOR_INREG.		// ISD::*_EXTEND_VECTOR_INREG.
if (!Subtarget.hasAVX512() && !(VT.getSizeInBits() % 256))		if (!Subtarget.useAVX512Regs() && !(VT.getSizeInBits() % 256))
return SplitAndExtendInReg(256);		return SplitAndExtendInReg(256);

return SDValue();		return SDValue();
}		}

// Attempt to combine a (sext/zext (setcc)) to a setcc with a xmm/ymm/zmm		// Attempt to combine a (sext/zext (setcc)) to a setcc with a xmm/ymm/zmm
// result type.		// result type.
static SDValue combineExtSetcc(SDNode *N, SelectionDAG &DAG,		static SDValue combineExtSetcc(SDNode *N, SelectionDAG &DAG,
▲ Show 20 Lines • Show All 938 Lines • ▼ Show 20 Lines	static SDValue combineLoopMAddPattern(SDNode *N, SelectionDAG &DAG,

ShrinkMode Mode;		ShrinkMode Mode;
if (!canReduceVMulWidth(MulOp.getNode(), DAG, Mode) \|\| Mode == MULU16)		if (!canReduceVMulWidth(MulOp.getNode(), DAG, Mode) \|\| Mode == MULU16)
return SDValue();		return SDValue();

EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

unsigned RegSize = 128;		unsigned RegSize = 128;
if (Subtarget.hasBWI())		if (Subtarget.useBWIRegs())
RegSize = 512;		RegSize = 512;
else if (Subtarget.hasAVX2())		else if (Subtarget.hasAVX2())
RegSize = 256;		RegSize = 256;
unsigned VectorSize = VT.getVectorNumElements() * 16;		unsigned VectorSize = VT.getVectorNumElements() * 16;
// If the vector size is less than 128, or greater than the supported RegSize,		// If the vector size is less than 128, or greater than the supported RegSize,
// do not use PMADD.		// do not use PMADD.
if (VectorSize < 128 \|\| VectorSize > RegSize)		if (VectorSize < 128 \|\| VectorSize > RegSize)
return SDValue();		return SDValue();
Show All 28 Lines	static SDValue combineLoopSADPattern(SDNode *N, SelectionDAG &DAG,

// TODO: There's nothing special about i32, any integer type above i16 should		// TODO: There's nothing special about i32, any integer type above i16 should
// work just as well.		// work just as well.
if (!VT.isVector() \|\| !VT.isSimple() \|\|		if (!VT.isVector() \|\| !VT.isSimple() \|\|
!(VT.getVectorElementType() == MVT::i32))		!(VT.getVectorElementType() == MVT::i32))
return SDValue();		return SDValue();

unsigned RegSize = 128;		unsigned RegSize = 128;
if (Subtarget.hasBWI())		if (Subtarget.useBWIRegs())
RegSize = 512;		RegSize = 512;
else if (Subtarget.hasAVX2())		else if (Subtarget.hasAVX2())
RegSize = 256;		RegSize = 256;

// We only handle v16i32 for SSE2 / v32i32 for AVX2 / v64i32 for AVX512.		// We only handle v16i32 for SSE2 / v32i32 for AVX2 / v64i32 for AVX512.
// TODO: We should be able to handle larger vectors by splitting them before		// TODO: We should be able to handle larger vectors by splitting them before
// feeding them into several SADs, and then reducing over those.		// feeding them into several SADs, and then reducing over those.
if (VT.getSizeInBits() / 4 > RegSize)		if (VT.getSizeInBits() / 4 > RegSize)
▲ Show 20 Lines • Show All 211 Lines • ▼ Show 20 Lines	static SDValue combineSubToSubus(SDNode *N, SelectionDAG &DAG,
SDValue Op1 = N->getOperand(1);		SDValue Op1 = N->getOperand(1);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

// PSUBUS is supported, starting from SSE2, but special preprocessing		// PSUBUS is supported, starting from SSE2, but special preprocessing
// for v8i32 requires umin, which appears in SSE41.		// for v8i32 requires umin, which appears in SSE41.
if (!(Subtarget.hasSSE2() && (VT == MVT::v16i8 \|\| VT == MVT::v8i16)) &&		if (!(Subtarget.hasSSE2() && (VT == MVT::v16i8 \|\| VT == MVT::v8i16)) &&
!(Subtarget.hasSSE41() && (VT == MVT::v8i32)) &&		!(Subtarget.hasSSE41() && (VT == MVT::v8i32)) &&
!(Subtarget.hasAVX2() && (VT == MVT::v32i8 \|\| VT == MVT::v16i16)) &&		!(Subtarget.hasAVX2() && (VT == MVT::v32i8 \|\| VT == MVT::v16i16)) &&
!(Subtarget.hasBWI() && (VT == MVT::v64i8 \|\| VT == MVT::v32i16 \|\|		!(Subtarget.useBWIRegs() && (VT == MVT::v64i8 \|\| VT == MVT::v32i16 \|\|
VT == MVT::v16i32 \|\| VT == MVT::v8i64)))		VT == MVT::v16i32 \|\| VT == MVT::v8i64)))
return SDValue();		return SDValue();

SDValue SubusLHS, SubusRHS;		SDValue SubusLHS, SubusRHS;
// Try to find umax(a,b) - b or a - umin(a,b) patterns		// Try to find umax(a,b) - b or a - umin(a,b) patterns
// they may be converted to subus(a,b).		// they may be converted to subus(a,b).
// TODO: Need to add IR cannonicialization for this code.		// TODO: Need to add IR cannonicialization for this code.
if (Op0.getOpcode() == ISD::UMAX) {		if (Op0.getOpcode() == ISD::UMAX) {
SubusRHS = Op1;		SubusRHS = Op1;
▲ Show 20 Lines • Show All 1,574 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 401 Lines • ▼ Show 20 Lines	private:

/// Preferred vector width from function attribute.		/// Preferred vector width from function attribute.
unsigned PreferVectorWidthOverride;		unsigned PreferVectorWidthOverride;

/// Resolved preferred vector width from function attribute and subtarget		/// Resolved preferred vector width from function attribute and subtarget
/// features.		/// features.
unsigned PreferVectorWidth;		unsigned PreferVectorWidth;

		/// Required vector width from function attribute.
		unsigned RequiredVectorWidth;

/// True if compiling for 64-bit, false for 16-bit or 32-bit.		/// True if compiling for 64-bit, false for 16-bit or 32-bit.
bool In64BitMode;		bool In64BitMode;

/// True if compiling for 32-bit, false for 16-bit or 64-bit.		/// True if compiling for 32-bit, false for 16-bit or 64-bit.
bool In32BitMode;		bool In32BitMode;

/// True if compiling for 16-bit, false for 32-bit or 64-bit.		/// True if compiling for 16-bit, false for 32-bit or 64-bit.
bool In16BitMode;		bool In16BitMode;
Show All 10 Lines	private:
X86FrameLowering FrameLowering;		X86FrameLowering FrameLowering;

public:		public:
/// This constructor initializes the data members to match that		/// This constructor initializes the data members to match that
/// of the specified triple.		/// of the specified triple.
///		///
X86Subtarget(const Triple &TT, StringRef CPU, StringRef FS,		X86Subtarget(const Triple &TT, StringRef CPU, StringRef FS,
const X86TargetMachine &TM, unsigned StackAlignOverride,		const X86TargetMachine &TM, unsigned StackAlignOverride,
unsigned PreferVectorWidthOverride);		unsigned PreferVectorWidthOverride,
		unsigned RequiredVectorWidth);

const X86TargetLowering *getTargetLowering() const override {		const X86TargetLowering *getTargetLowering() const override {
return &TLInfo;		return &TLInfo;
}		}

const X86InstrInfo *getInstrInfo() const override { return &InstrInfo; }		const X86InstrInfo *getInstrInfo() const override { return &InstrInfo; }

const X86FrameLowering *getFrameLowering() const override {		const X86FrameLowering *getFrameLowering() const override {
▲ Show 20 Lines • Show All 172 Lines • ▼ Show 20 Lines	public:
bool hasIBT() const { return HasIBT; }		bool hasIBT() const { return HasIBT; }
bool hasCLFLUSHOPT() const { return HasCLFLUSHOPT; }		bool hasCLFLUSHOPT() const { return HasCLFLUSHOPT; }
bool hasCLWB() const { return HasCLWB; }		bool hasCLWB() const { return HasCLWB; }
bool hasRDPID() const { return HasRDPID; }		bool hasRDPID() const { return HasRDPID; }
bool useRetpoline() const { return UseRetpoline; }		bool useRetpoline() const { return UseRetpoline; }
bool useRetpolineExternalThunk() const { return UseRetpolineExternalThunk; }		bool useRetpolineExternalThunk() const { return UseRetpolineExternalThunk; }

unsigned getPreferVectorWidth() const { return PreferVectorWidth; }		unsigned getPreferVectorWidth() const { return PreferVectorWidth; }
		unsigned getRequiredVectorWidth() const { return RequiredVectorWidth; }

// Helper functions to determine when we should allow widening to 512-bit		// Helper functions to determine when we should allow widening to 512-bit
// during codegen.		// during codegen.
// TODO: Currently we're always allowing widening on CPUs without VLX,		// TODO: Currently we're always allowing widening on CPUs without VLX,
// because for many cases we don't have a better option.		// because for many cases we don't have a better option.
bool canExtendTo512DQ() const {		bool canExtendTo512DQ() const {
return hasAVX512() && (!hasVLX() \|\| getPreferVectorWidth() >= 512);		return hasAVX512() && (!hasVLX() \|\| getPreferVectorWidth() >= 512);
}		}
bool canExtendTo512BW() const {		bool canExtendTo512BW() const {
return hasBWI() && canExtendTo512DQ();		return hasBWI() && canExtendTo512DQ();
}		}

		// If there are no 512-bit vectors and we prefer not to use 512-bit registers,
		// disable them in the legalizer.
		bool useAVX512Regs() const {
		return hasAVX512() && (canExtendTo512DQ() \|\| RequiredVectorWidth > 256);
		}

		bool useBWIRegs() const {
		return hasBWI() && useAVX512Regs();
		}

bool isXRaySupported() const override { return is64Bit(); }		bool isXRaySupported() const override { return is64Bit(); }

X86ProcFamilyEnum getProcFamily() const { return X86ProcFamily; }		X86ProcFamilyEnum getProcFamily() const { return X86ProcFamily; }

/// TODO: to be removed later and replaced with suitable properties		/// TODO: to be removed later and replaced with suitable properties
bool isAtom() const { return X86ProcFamily == IntelAtom; }		bool isAtom() const { return X86ProcFamily == IntelAtom; }
bool isSLM() const { return X86ProcFamily == IntelSLM; }		bool isSLM() const { return X86ProcFamily == IntelSLM; }
bool useSoftFloat() const { return UseSoftFloat; }		bool useSoftFloat() const { return UseSoftFloat; }
▲ Show 20 Lines • Show All 140 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86Subtarget.cpp

Show First 20 Lines • Show All 367 Lines • ▼ Show 20 Lines	X86Subtarget &X86Subtarget::initializeSubtargetDependencies(StringRef CPU,
initializeEnvironment();		initializeEnvironment();
initSubtargetFeatures(CPU, FS);		initSubtargetFeatures(CPU, FS);
return *this;		return *this;
}		}

X86Subtarget::X86Subtarget(const Triple &TT, StringRef CPU, StringRef FS,		X86Subtarget::X86Subtarget(const Triple &TT, StringRef CPU, StringRef FS,
const X86TargetMachine &TM,		const X86TargetMachine &TM,
unsigned StackAlignOverride,		unsigned StackAlignOverride,
unsigned PreferVectorWidthOverride)		unsigned PreferVectorWidthOverride,
		unsigned RequiredVectorWidth)
: X86GenSubtargetInfo(TT, CPU, FS), X86ProcFamily(Others),		: X86GenSubtargetInfo(TT, CPU, FS), X86ProcFamily(Others),
PICStyle(PICStyles::None), TM(TM), TargetTriple(TT),		PICStyle(PICStyles::None), TM(TM), TargetTriple(TT),
StackAlignOverride(StackAlignOverride),		StackAlignOverride(StackAlignOverride),
PreferVectorWidthOverride(PreferVectorWidthOverride),		PreferVectorWidthOverride(PreferVectorWidthOverride),
		RequiredVectorWidth(RequiredVectorWidth),
In64BitMode(TargetTriple.getArch() == Triple::x86_64),		In64BitMode(TargetTriple.getArch() == Triple::x86_64),
In32BitMode(TargetTriple.getArch() == Triple::x86 &&		In32BitMode(TargetTriple.getArch() == Triple::x86 &&
TargetTriple.getEnvironment() != Triple::CODE16),		TargetTriple.getEnvironment() != Triple::CODE16),
In16BitMode(TargetTriple.getArch() == Triple::x86 &&		In16BitMode(TargetTriple.getArch() == Triple::x86 &&
TargetTriple.getEnvironment() == Triple::CODE16),		TargetTriple.getEnvironment() == Triple::CODE16),
InstrInfo(initializeSubtargetDependencies(CPU, FS)), TLInfo(TM, *this),		InstrInfo(initializeSubtargetDependencies(CPU, FS)), TLInfo(TM, *this),
FrameLowering(*this, getStackAlignment()) {		FrameLowering(*this, getStackAlignment()) {
// Determine the PICStyle based on the target selected.		// Determine the PICStyle based on the target selected.
Show All 38 Lines

llvm/trunk/lib/Target/X86/X86TargetMachine.cpp

Show First 20 Lines • Show All 253 Lines • ▼ Show 20 Lines	X86TargetMachine::getSubtargetImpl(const Function &F) const {
// subtarget feature.		// subtarget feature.
if (SoftFloat)		if (SoftFloat)
Key += FS.empty() ? "+soft-float" : ",+soft-float";		Key += FS.empty() ? "+soft-float" : ",+soft-float";

// Keep track of the key width after all features are added so we can extract		// Keep track of the key width after all features are added so we can extract
// the feature string out later.		// the feature string out later.
unsigned CPUFSWidth = Key.size();		unsigned CPUFSWidth = Key.size();

// Translate vector width function attribute into subtarget features. This		// Extract prefer-vector-width attribute.
// overrides any CPU specific turning parameter
unsigned PreferVectorWidthOverride = 0;		unsigned PreferVectorWidthOverride = 0;
if (F.hasFnAttribute("prefer-vector-width")) {		if (F.hasFnAttribute("prefer-vector-width")) {
StringRef Val = F.getFnAttribute("prefer-vector-width").getValueAsString();		StringRef Val = F.getFnAttribute("prefer-vector-width").getValueAsString();
unsigned Width;		unsigned Width;
if (!Val.getAsInteger(0, Width)) {		if (!Val.getAsInteger(0, Width)) {
Key += ",prefer-vector-width=";		Key += ",prefer-vector-width=";
Key += Val;		Key += Val;
PreferVectorWidthOverride = Width;		PreferVectorWidthOverride = Width;
}		}
}		}

		// Extract required-vector-width attribute.
		unsigned RequiredVectorWidth = UINT32_MAX;
		danilamlUnsubmitted Not Done Reply Inline Actions @craig.topper Sorry for commenting on such an old review, but I was investigating some codegen differences for very similar IR and come across this code (the attribute later changed to the `min-legal-vector-width` but otherwise it's the same on main). Is RequiredVectorWidth intended to be initialized to `UINT32_MAX`? What is the rationale? It forces maximum vector width if the function is missing the attribute for some reason, ignoring the `prefer-` attributes. To me it seems that the conservative approach would be to set it to `0` and increase according to the attribute, since zero length vectors are always legal/"required". danilaml:* @craig.topper Sorry for commenting on such an old review, but I was investigating some codegen…
		craig.topperAuthorUnsubmitted Done Reply Inline Actions It is intentionally set to UINT32_MAX if the attribute is missing. If the IR contains any 512 bit inline assembly, function arguments, returns, or X86 specific vector, the backend will crash or violate the ABI. The presence of the attribute indicates that those cases have been checked and nothing requires 512 bit vectors. craig.topper: It is intentionally set to UINT32_MAX if the attribute is missing. If the IR contains any 512…
		danilamlUnsubmitted Not Done Reply Inline Actions Not sure I understand. Without the attribute the backend will ignore prefer-vector-width and generate avx512 asm regardless. How is setting default to 0 worse? danilaml: Not sure I understand. Without the attribute the backend will ignore prefer-vector-width and…
		craig.topperAuthorUnsubmitted Done Reply Inline Actions If there are 512-bit x86 intrinsics in the IR it will crash the compiler. I assume compiler crashes are worse than suboptimal code. Prefer vector width is still checked in many other places to prevent aggressive use of 512 bit vectors. For example, `X86TTIImpl::getRegisterBitWidth` will still tell the vectorizer that the register width is 256 bits. Are you finding the attribute missing in code compiled with clang or another frontend? craig.topper: If there are 512-bit x86 intrinsics in the IR it will crash the compiler. I assume compiler…
		danilamlUnsubmitted Not Done Reply Inline Actions Doesn't seem to crash (although I haven't tried inline assembly): https://llvm.godbolt.org/z/h6jEYrnaa In my case it's another frontend (JIT compiler). I've noticed that compiler would generate suboptimal code using avx512, even though the target cpu has prefer256 tuning and found that the issue is missing attribute (I also noticed that target knows that expanding a certain intrinsic using av512 is more costly than using avx2, but still uses the highest ISA available, but that's another issue entirely). Now I'm wondering what to set it too. Also, stuff like SLPVectorizer doesn't really care about `getRegisterBitWidth` since it usually just checks wether some operation/type is legal or not and about the cost returned by the target hooks. danilaml: Doesn't seem to crash (although I haven't tried inline assembly): https://llvm.godbolt.
		craig.topperAuthorUnsubmitted Done Reply Inline Actions It crashes if prefer-vector-width<=256 is also specified or a CPU that implies prefer vector width <=256 is used https://llvm.godbolt.org/z/G5458499K craig.topper: It crashes if prefer-vector-width<=256 is also specified or a CPU that implies prefer vector…
		danilamlUnsubmitted Not Done Reply Inline Actions I see. This is counterintuitive. It appers that `min-legal-vector-width` is actually sort of `max`, but not really. So what is the intended usage by some non-C backend? What should it be set to to allow both avx512 intrinsics/inline asm when explicitely requested AND to keep the "prefer 256" semantics in most other palces? Should we mark every "regular" function (that doesn't use avx512 intrinsics or inline asm) with `min-legal-vector-width=0` and that do - with `=512` (we won't be passing/returning avx512 types, so ABI is not a concer AFAIU)? danilaml: I see. This is counterintuitive. It appers that ` min-legal-vector-width` is actually sort of…
		craig.topperAuthorUnsubmitted Done Reply Inline Actions I unfortunately named it from the perspective of the X86 backend where it is the minimum vector width that the backend must make a legal type to prevent crashes. The clang frontend calculates it by taking the maximum value from all intrinsics, inline assembly, and function arguments/returns. Setting it to 0 for functions without avx512 intrinsics or inline assembly should be fine today. craig.topper: I unfortunately named it from the perspective of the X86 backend where it is the minimum vector…
		if (F.hasFnAttribute("required-vector-width")) {
		StringRef Val = F.getFnAttribute("required-vector-width").getValueAsString();
		unsigned Width;
		if (!Val.getAsInteger(0, Width)) {
		Key += ",required-vector-width=";
		Key += Val;
		RequiredVectorWidth = Width;
		}
		}

		// Extracted here so that we make sure there is backing for the StringRef. If
		// we assigned earlier, its possible the SmallString reallocated leaving a
		// dangling StringRef.
FS = Key.slice(CPU.size(), CPUFSWidth);		FS = Key.slice(CPU.size(), CPUFSWidth);

auto &I = SubtargetMap[Key];		auto &I = SubtargetMap[Key];
if (!I) {		if (!I) {
// This needs to be done before we create a new subtarget since any		// This needs to be done before we create a new subtarget since any
// creation will depend on the TM and the code generation flags on the		// creation will depend on the TM and the code generation flags on the
// function that reside in TargetOptions.		// function that reside in TargetOptions.
resetTargetOptions(F);		resetTargetOptions(F);
I = llvm::make_unique<X86Subtarget>(TargetTriple, CPU, FS, *this,		I = llvm::make_unique<X86Subtarget>(TargetTriple, CPU, FS, *this,
Options.StackAlignmentOverride,		Options.StackAlignmentOverride,
PreferVectorWidthOverride);		PreferVectorWidthOverride,
		RequiredVectorWidth);
}		}
return I.get();		return I.get();
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Command line options for x86		// Command line options for x86
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
static cl::opt<bool>		static cl::opt<bool>
▲ Show 20 Lines • Show All 180 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/required-vector-width.ll

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512vl,avx512bw,avx512dq,prefer-256-bit \| FileCheck %s

				; This file primarily contains tests for specific places in X86ISelLowering.cpp that needed be made aware of the legalizer not allowing 512-bit vectors due to prefer-256-bit even though AVX512 is enabled.

				define void @add256(<16 x i32>* %a, <16 x i32>* %b, <16 x i32>* %c) "required-vector-width"="256" {
				; CHECK-LABEL: add256:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovdqa (%rdi), %ymm0
				; CHECK-NEXT: vmovdqa 32(%rdi), %ymm1
				; CHECK-NEXT: vpaddd (%rsi), %ymm0, %ymm0
				; CHECK-NEXT: vpaddd 32(%rsi), %ymm1, %ymm1
				; CHECK-NEXT: vmovdqa %ymm1, 32(%rdx)
				; CHECK-NEXT: vmovdqa %ymm0, (%rdx)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%d = load <16 x i32>, <16 x i32>* %a
				%e = load <16 x i32>, <16 x i32>* %b
				%f = add <16 x i32> %d, %e
				store <16 x i32> %f, <16 x i32>* %c
				ret void
				}

				define void @add512(<16 x i32>* %a, <16 x i32>* %b, <16 x i32>* %c) "required-vector-width"="512" {
				; CHECK-LABEL: add512:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovdqa64 (%rdi), %zmm0
				; CHECK-NEXT: vpaddd (%rsi), %zmm0, %zmm0
				; CHECK-NEXT: vmovdqa64 %zmm0, (%rdx)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%d = load <16 x i32>, <16 x i32>* %a
				%e = load <16 x i32>, <16 x i32>* %b
				%f = add <16 x i32> %d, %e
				store <16 x i32> %f, <16 x i32>* %c
				ret void
				}

				define void @avg_v64i8_256(<64 x i8>* %a, <64 x i8>* %b) "required-vector-width"="256" {
				; CHECK-LABEL: avg_v64i8_256:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovdqa (%rsi), %ymm0
				; CHECK-NEXT: vmovdqa 32(%rsi), %ymm1
				; CHECK-NEXT: vpavgb (%rdi), %ymm0, %ymm0
				; CHECK-NEXT: vpavgb 32(%rdi), %ymm1, %ymm1
				; CHECK-NEXT: vmovdqu %ymm1, (%rax)
				; CHECK-NEXT: vmovdqu %ymm0, (%rax)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%1 = load <64 x i8>, <64 x i8>* %a
				%2 = load <64 x i8>, <64 x i8>* %b
				%3 = zext <64 x i8> %1 to <64 x i32>
				%4 = zext <64 x i8> %2 to <64 x i32>
				%5 = add nuw nsw <64 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
				%6 = add nuw nsw <64 x i32> %5, %4
				%7 = lshr <64 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
				%8 = trunc <64 x i32> %7 to <64 x i8>
				store <64 x i8> %8, <64 x i8>* undef, align 4
				ret void
				}


				define void @avg_v64i8_512(<64 x i8>* %a, <64 x i8>* %b) "required-vector-width"="512" {
				; CHECK-LABEL: avg_v64i8_512:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovdqa64 (%rsi), %zmm0
				; CHECK-NEXT: vpavgb (%rdi), %zmm0, %zmm0
				; CHECK-NEXT: vmovdqu64 %zmm0, (%rax)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%1 = load <64 x i8>, <64 x i8>* %a
				%2 = load <64 x i8>, <64 x i8>* %b
				%3 = zext <64 x i8> %1 to <64 x i32>
				%4 = zext <64 x i8> %2 to <64 x i32>
				%5 = add nuw nsw <64 x i32> %3, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
				%6 = add nuw nsw <64 x i32> %5, %4
				%7 = lshr <64 x i32> %6, <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
				%8 = trunc <64 x i32> %7 to <64 x i8>
				store <64 x i8> %8, <64 x i8>* undef, align 4
				ret void
				}

				define void @pmaddwd_32_256(<32 x i16>* %APtr, <32 x i16>* %BPtr, <16 x i32>* %CPtr) "required-vector-width"="256" {
				; CHECK-LABEL: pmaddwd_32_256:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovdqa (%rdi), %ymm0
				; CHECK-NEXT: vmovdqa 32(%rdi), %ymm1
				; CHECK-NEXT: vpmaddwd (%rsi), %ymm0, %ymm0
				; CHECK-NEXT: vpmaddwd 32(%rsi), %ymm1, %ymm1
				; CHECK-NEXT: vmovdqa %ymm1, 32(%rdx)
				; CHECK-NEXT: vmovdqa %ymm0, (%rdx)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%A = load <32 x i16>, <32 x i16>* %APtr
				%B = load <32 x i16>, <32 x i16>* %BPtr
				%a = sext <32 x i16> %A to <32 x i32>
				%b = sext <32 x i16> %B to <32 x i32>
				%m = mul nsw <32 x i32> %a, %b
				%odd = shufflevector <32 x i32> %m, <32 x i32> undef, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30>
				%even = shufflevector <32 x i32> %m, <32 x i32> undef, <16 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15, i32 17, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31>
				%ret = add <16 x i32> %odd, %even
				store <16 x i32> %ret, <16 x i32>* %CPtr
				ret void
				}

				define void @pmaddwd_32_512(<32 x i16>* %APtr, <32 x i16>* %BPtr, <16 x i32>* %CPtr) "required-vector-width"="512" {
				; CHECK-LABEL: pmaddwd_32_512:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovdqa64 (%rdi), %zmm0
				; CHECK-NEXT: vpmaddwd (%rsi), %zmm0, %zmm0
				; CHECK-NEXT: vmovdqa64 %zmm0, (%rdx)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%A = load <32 x i16>, <32 x i16>* %APtr
				%B = load <32 x i16>, <32 x i16>* %BPtr
				%a = sext <32 x i16> %A to <32 x i32>
				%b = sext <32 x i16> %B to <32 x i32>
				%m = mul nsw <32 x i32> %a, %b
				%odd = shufflevector <32 x i32> %m, <32 x i32> undef, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30>
				%even = shufflevector <32 x i32> %m, <32 x i32> undef, <16 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15, i32 17, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31>
				%ret = add <16 x i32> %odd, %even
				store <16 x i32> %ret, <16 x i32>* %CPtr
				ret void
				}

				define void @psubus_64i8_max_256(<64 x i8>* %xptr, <64 x i8>* %yptr, <64 x i8>* %zptr) "required-vector-width"="256" {
				; CHECK-LABEL: psubus_64i8_max_256:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovdqa (%rdi), %ymm0
				; CHECK-NEXT: vmovdqa 32(%rdi), %ymm1
				; CHECK-NEXT: vpsubusb (%rsi), %ymm0, %ymm0
				; CHECK-NEXT: vpsubusb 32(%rsi), %ymm1, %ymm1
				; CHECK-NEXT: vmovdqa %ymm1, 32(%rdx)
				; CHECK-NEXT: vmovdqa %ymm0, (%rdx)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%x = load <64 x i8>, <64 x i8>* %xptr
				%y = load <64 x i8>, <64 x i8>* %yptr
				%cmp = icmp ult <64 x i8> %x, %y
				%max = select <64 x i1> %cmp, <64 x i8> %y, <64 x i8> %x
				%res = sub <64 x i8> %max, %y
				store <64 x i8> %res, <64 x i8>* %zptr
				ret void
				}

				define void @psubus_64i8_max_512(<64 x i8>* %xptr, <64 x i8>* %yptr, <64 x i8>* %zptr) "required-vector-width"="512" {
				; CHECK-LABEL: psubus_64i8_max_512:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovdqa64 (%rdi), %zmm0
				; CHECK-NEXT: vpsubusb (%rsi), %zmm0, %zmm0
				; CHECK-NEXT: vmovdqa64 %zmm0, (%rdx)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%x = load <64 x i8>, <64 x i8>* %xptr
				%y = load <64 x i8>, <64 x i8>* %yptr
				%cmp = icmp ult <64 x i8> %x, %y
				%max = select <64 x i1> %cmp, <64 x i8> %y, <64 x i8> %x
				%res = sub <64 x i8> %max, %y
				store <64 x i8> %res, <64 x i8>* %zptr
				ret void
				}

				define i32 @_Z9test_charPcS_i_256(i8* nocapture readonly, i8* nocapture readonly, i32) "required-vector-width"="256" {
				; CHECK-LABEL: _Z9test_charPcS_i_256:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: movl %edx, %eax
				; CHECK-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; CHECK-NEXT: xorl %ecx, %ecx
				; CHECK-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; CHECK-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; CHECK-NEXT: vpxor %xmm3, %xmm3, %xmm3
				; CHECK-NEXT: .p2align 4, 0x90
				; CHECK-NEXT: .LBB8_1: # %vector.body
				; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vpmovsxbw (%rdi,%rcx), %xmm4
				; CHECK-NEXT: vpmovsxbw 8(%rdi,%rcx), %xmm5
				; CHECK-NEXT: vpmovsxbw 16(%rdi,%rcx), %xmm6
				; CHECK-NEXT: vpmovsxbw 24(%rdi,%rcx), %xmm8
				; CHECK-NEXT: vpmovsxbw (%rsi,%rcx), %xmm7
				; CHECK-NEXT: vpmaddwd %xmm4, %xmm7, %xmm4
				; CHECK-NEXT: vpmovsxbw 8(%rsi,%rcx), %xmm7
				; CHECK-NEXT: vpmaddwd %xmm5, %xmm7, %xmm5
				; CHECK-NEXT: vpmovsxbw 16(%rsi,%rcx), %xmm7
				; CHECK-NEXT: vpmaddwd %xmm6, %xmm7, %xmm6
				; CHECK-NEXT: vpmovsxbw 24(%rsi,%rcx), %xmm7
				; CHECK-NEXT: vpmaddwd %xmm8, %xmm7, %xmm7
				; CHECK-NEXT: vpaddd %ymm3, %ymm7, %ymm3
				; CHECK-NEXT: vpaddd %ymm2, %ymm6, %ymm2
				; CHECK-NEXT: vpaddd %ymm1, %ymm5, %ymm1
				; CHECK-NEXT: vpaddd %ymm0, %ymm4, %ymm0
				; CHECK-NEXT: addq $32, %rcx
				; CHECK-NEXT: cmpq %rcx, %rax
				; CHECK-NEXT: jne .LBB8_1
				; CHECK-NEXT: # %bb.2: # %middle.block
				; CHECK-NEXT: vpaddd %ymm2, %ymm0, %ymm0
				; CHECK-NEXT: vpaddd %ymm3, %ymm1, %ymm1
				; CHECK-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vextracti128 $1, %ymm0, %xmm1
				; CHECK-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; CHECK-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vphaddd %ymm0, %ymm0, %ymm0
				; CHECK-NEXT: vmovd %xmm0, %eax
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				entry:
				%3 = zext i32 %2 to i64
				br label %vector.body

				vector.body:
				%index = phi i64 [ %index.next, %vector.body ], [ 0, %entry ]
				%vec.phi = phi <32 x i32> [ %11, %vector.body ], [ zeroinitializer, %entry ]
				%4 = getelementptr inbounds i8, i8* %0, i64 %index
				%5 = bitcast i8* %4 to <32 x i8>*
				%wide.load = load <32 x i8>, <32 x i8>* %5, align 1
				%6 = sext <32 x i8> %wide.load to <32 x i32>
				%7 = getelementptr inbounds i8, i8* %1, i64 %index
				%8 = bitcast i8* %7 to <32 x i8>*
				%wide.load14 = load <32 x i8>, <32 x i8>* %8, align 1
				%9 = sext <32 x i8> %wide.load14 to <32 x i32>
				%10 = mul nsw <32 x i32> %9, %6
				%11 = add nsw <32 x i32> %10, %vec.phi
				%index.next = add i64 %index, 32
				%12 = icmp eq i64 %index.next, %3
				br i1 %12, label %middle.block, label %vector.body

				middle.block:
				%rdx.shuf1 = shufflevector <32 x i32> %11, <32 x i32> undef, <32 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx1 = add <32 x i32> %11, %rdx.shuf1
				%rdx.shuf = shufflevector <32 x i32> %bin.rdx1, <32 x i32> undef, <32 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx = add <32 x i32> %bin.rdx1, %rdx.shuf
				%rdx.shuf15 = shufflevector <32 x i32> %bin.rdx, <32 x i32> undef, <32 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx32 = add <32 x i32> %bin.rdx, %rdx.shuf15
				%rdx.shuf17 = shufflevector <32 x i32> %bin.rdx32, <32 x i32> undef, <32 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx18 = add <32 x i32> %bin.rdx32, %rdx.shuf17
				%rdx.shuf19 = shufflevector <32 x i32> %bin.rdx18, <32 x i32> undef, <32 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx20 = add <32 x i32> %bin.rdx18, %rdx.shuf19
				%13 = extractelement <32 x i32> %bin.rdx20, i32 0
				ret i32 %13
				}

				define i32 @_Z9test_charPcS_i_512(i8* nocapture readonly, i8* nocapture readonly, i32) "required-vector-width"="512" {
				; CHECK-LABEL: _Z9test_charPcS_i_512:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: movl %edx, %eax
				; CHECK-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; CHECK-NEXT: xorl %ecx, %ecx
				; CHECK-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; CHECK-NEXT: .p2align 4, 0x90
				; CHECK-NEXT: .LBB9_1: # %vector.body
				; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vpmovsxbw (%rdi,%rcx), %zmm2
				; CHECK-NEXT: vpmovsxbw (%rsi,%rcx), %zmm3
				; CHECK-NEXT: vpmaddwd %zmm2, %zmm3, %zmm2
				; CHECK-NEXT: vpaddd %zmm1, %zmm2, %zmm1
				; CHECK-NEXT: addq $32, %rcx
				; CHECK-NEXT: cmpq %rcx, %rax
				; CHECK-NEXT: jne .LBB9_1
				; CHECK-NEXT: # %bb.2: # %middle.block
				; CHECK-NEXT: vpaddd %zmm0, %zmm1, %zmm0
				; CHECK-NEXT: vextracti64x4 $1, %zmm0, %ymm1
				; CHECK-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; CHECK-NEXT: vextracti128 $1, %ymm0, %xmm1
				; CHECK-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; CHECK-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; CHECK-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; CHECK-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]
				; CHECK-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; CHECK-NEXT: vmovd %xmm0, %eax
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				entry:
				%3 = zext i32 %2 to i64
				br label %vector.body

				vector.body:
				%index = phi i64 [ %index.next, %vector.body ], [ 0, %entry ]
				%vec.phi = phi <32 x i32> [ %11, %vector.body ], [ zeroinitializer, %entry ]
				%4 = getelementptr inbounds i8, i8* %0, i64 %index
				%5 = bitcast i8* %4 to <32 x i8>*
				%wide.load = load <32 x i8>, <32 x i8>* %5, align 1
				%6 = sext <32 x i8> %wide.load to <32 x i32>
				%7 = getelementptr inbounds i8, i8* %1, i64 %index
				%8 = bitcast i8* %7 to <32 x i8>*
				%wide.load14 = load <32 x i8>, <32 x i8>* %8, align 1
				%9 = sext <32 x i8> %wide.load14 to <32 x i32>
				%10 = mul nsw <32 x i32> %9, %6
				%11 = add nsw <32 x i32> %10, %vec.phi
				%index.next = add i64 %index, 32
				%12 = icmp eq i64 %index.next, %3
				br i1 %12, label %middle.block, label %vector.body

				middle.block:
				%rdx.shuf1 = shufflevector <32 x i32> %11, <32 x i32> undef, <32 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx1 = add <32 x i32> %11, %rdx.shuf1
				%rdx.shuf = shufflevector <32 x i32> %bin.rdx1, <32 x i32> undef, <32 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx = add <32 x i32> %bin.rdx1, %rdx.shuf
				%rdx.shuf15 = shufflevector <32 x i32> %bin.rdx, <32 x i32> undef, <32 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx32 = add <32 x i32> %bin.rdx, %rdx.shuf15
				%rdx.shuf17 = shufflevector <32 x i32> %bin.rdx32, <32 x i32> undef, <32 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx18 = add <32 x i32> %bin.rdx32, %rdx.shuf17
				%rdx.shuf19 = shufflevector <32 x i32> %bin.rdx18, <32 x i32> undef, <32 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx20 = add <32 x i32> %bin.rdx18, %rdx.shuf19
				%13 = extractelement <32 x i32> %bin.rdx20, i32 0
				ret i32 %13
				}

				@a = global [1024 x i8] zeroinitializer, align 16
				@b = global [1024 x i8] zeroinitializer, align 16

				define i32 @sad_16i8_256() "required-vector-width"="256" {
				; CHECK-LABEL: sad_16i8_256:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; CHECK-NEXT: movq $-1024, %rax # imm = 0xFC00
				; CHECK-NEXT: vpxor %xmm1, %xmm1, %xmm1
				; CHECK-NEXT: .p2align 4, 0x90
				; CHECK-NEXT: .LBB10_1: # %vector.body
				; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vmovdqu a+1024(%rax), %xmm2
				; CHECK-NEXT: vpsadbw b+1024(%rax), %xmm2, %xmm2
				; CHECK-NEXT: vpaddd %ymm1, %ymm2, %ymm1
				; CHECK-NEXT: addq $4, %rax
				; CHECK-NEXT: jne .LBB10_1
				; CHECK-NEXT: # %bb.2: # %middle.block
				; CHECK-NEXT: vpaddd %ymm0, %ymm1, %ymm0
				; CHECK-NEXT: vextracti128 $1, %ymm0, %xmm1
				; CHECK-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; CHECK-NEXT: vpaddd %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: vphaddd %ymm0, %ymm0, %ymm0
				; CHECK-NEXT: vmovd %xmm0, %eax
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				entry:
				br label %vector.body

				vector.body:
				%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
				%vec.phi = phi <16 x i32> [ zeroinitializer, %entry ], [ %10, %vector.body ]
				%0 = getelementptr inbounds [1024 x i8], [1024 x i8]* @a, i64 0, i64 %index
				%1 = bitcast i8* %0 to <16 x i8>*
				%wide.load = load <16 x i8>, <16 x i8>* %1, align 4
				%2 = zext <16 x i8> %wide.load to <16 x i32>
				%3 = getelementptr inbounds [1024 x i8], [1024 x i8]* @b, i64 0, i64 %index
				%4 = bitcast i8* %3 to <16 x i8>*
				%wide.load1 = load <16 x i8>, <16 x i8>* %4, align 4
				%5 = zext <16 x i8> %wide.load1 to <16 x i32>
				%6 = sub nsw <16 x i32> %2, %5
				%7 = icmp sgt <16 x i32> %6, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
				%8 = sub nsw <16 x i32> zeroinitializer, %6
				%9 = select <16 x i1> %7, <16 x i32> %6, <16 x i32> %8
				%10 = add nsw <16 x i32> %9, %vec.phi
				%index.next = add i64 %index, 4
				%11 = icmp eq i64 %index.next, 1024
				br i1 %11, label %middle.block, label %vector.body

				middle.block:
				%.lcssa = phi <16 x i32> [ %10, %vector.body ]
				%rdx.shuf = shufflevector <16 x i32> %.lcssa, <16 x i32> undef, <16 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx = add <16 x i32> %.lcssa, %rdx.shuf
				%rdx.shuf2 = shufflevector <16 x i32> %bin.rdx, <16 x i32> undef, <16 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx2 = add <16 x i32> %bin.rdx, %rdx.shuf2
				%rdx.shuf3 = shufflevector <16 x i32> %bin.rdx2, <16 x i32> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx3 = add <16 x i32> %bin.rdx2, %rdx.shuf3
				%rdx.shuf4 = shufflevector <16 x i32> %bin.rdx3, <16 x i32> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx4 = add <16 x i32> %bin.rdx3, %rdx.shuf4
				%12 = extractelement <16 x i32> %bin.rdx4, i32 0
				ret i32 %12
				}

				define i32 @sad_16i8_512() "required-vector-width"="512" {
				; CHECK-LABEL: sad_16i8_512:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: vpxor %xmm0, %xmm0, %xmm0
				; CHECK-NEXT: movq $-1024, %rax # imm = 0xFC00
				; CHECK-NEXT: .p2align 4, 0x90
				; CHECK-NEXT: .LBB11_1: # %vector.body
				; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vmovdqu a+1024(%rax), %xmm1
				; CHECK-NEXT: vpsadbw b+1024(%rax), %xmm1, %xmm1
				; CHECK-NEXT: vpaddd %zmm0, %zmm1, %zmm0
				; CHECK-NEXT: addq $4, %rax
				; CHECK-NEXT: jne .LBB11_1
				; CHECK-NEXT: # %bb.2: # %middle.block
				; CHECK-NEXT: vextracti64x4 $1, %zmm0, %ymm1
				; CHECK-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; CHECK-NEXT: vextracti128 $1, %ymm0, %xmm1
				; CHECK-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; CHECK-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; CHECK-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; CHECK-NEXT: vpshufd {{.*#+}} xmm1 = xmm0[1,1,2,3]
				; CHECK-NEXT: vpaddd %zmm1, %zmm0, %zmm0
				; CHECK-NEXT: vmovd %xmm0, %eax
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				entry:
				br label %vector.body

				vector.body:
				%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
				%vec.phi = phi <16 x i32> [ zeroinitializer, %entry ], [ %10, %vector.body ]
				%0 = getelementptr inbounds [1024 x i8], [1024 x i8]* @a, i64 0, i64 %index
				%1 = bitcast i8* %0 to <16 x i8>*
				%wide.load = load <16 x i8>, <16 x i8>* %1, align 4
				%2 = zext <16 x i8> %wide.load to <16 x i32>
				%3 = getelementptr inbounds [1024 x i8], [1024 x i8]* @b, i64 0, i64 %index
				%4 = bitcast i8* %3 to <16 x i8>*
				%wide.load1 = load <16 x i8>, <16 x i8>* %4, align 4
				%5 = zext <16 x i8> %wide.load1 to <16 x i32>
				%6 = sub nsw <16 x i32> %2, %5
				%7 = icmp sgt <16 x i32> %6, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
				%8 = sub nsw <16 x i32> zeroinitializer, %6
				%9 = select <16 x i1> %7, <16 x i32> %6, <16 x i32> %8
				%10 = add nsw <16 x i32> %9, %vec.phi
				%index.next = add i64 %index, 4
				%11 = icmp eq i64 %index.next, 1024
				br i1 %11, label %middle.block, label %vector.body

				middle.block:
				%.lcssa = phi <16 x i32> [ %10, %vector.body ]
				%rdx.shuf = shufflevector <16 x i32> %.lcssa, <16 x i32> undef, <16 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx = add <16 x i32> %.lcssa, %rdx.shuf
				%rdx.shuf2 = shufflevector <16 x i32> %bin.rdx, <16 x i32> undef, <16 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx2 = add <16 x i32> %bin.rdx, %rdx.shuf2
				%rdx.shuf3 = shufflevector <16 x i32> %bin.rdx2, <16 x i32> undef, <16 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx3 = add <16 x i32> %bin.rdx2, %rdx.shuf3
				%rdx.shuf4 = shufflevector <16 x i32> %bin.rdx3, <16 x i32> undef, <16 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%bin.rdx4 = add <16 x i32> %bin.rdx3, %rdx.shuf4
				%12 = extractelement <16 x i32> %bin.rdx4, i32 0
				ret i32 %12
				}

				define <16 x float> @sbto16f32_256(<16 x i32> %a) "required-vector-width"="256" {
				; CHECK-LABEL: sbto16f32_256:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; CHECK-NEXT: vpcmpgtd %ymm0, %ymm2, %ymm0
				; CHECK-NEXT: vcvtdq2ps %ymm0, %ymm0
				; CHECK-NEXT: vpcmpgtd %ymm1, %ymm2, %ymm1
				; CHECK-NEXT: vcvtdq2ps %ymm1, %ymm1
				; CHECK-NEXT: retq
				%mask = icmp slt <16 x i32> %a, zeroinitializer
				%1 = sitofp <16 x i1> %mask to <16 x float>
				ret <16 x float> %1
				}

				define <16 x float> @sbto16f32_512(<16 x i32> %a) "required-vector-width"="512" {
				; CHECK-LABEL: sbto16f32_512:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vpmovd2m %zmm0, %k0
				; CHECK-NEXT: vpmovm2d %k0, %zmm0
				; CHECK-NEXT: vcvtdq2ps %zmm0, %zmm0
				; CHECK-NEXT: retq
				%mask = icmp slt <16 x i32> %a, zeroinitializer
				%1 = sitofp <16 x i1> %mask to <16 x float>
				ret <16 x float> %1
				}

				define <16 x double> @sbto16f64_256(<16 x double> %a) "required-vector-width"="256" {
				; CHECK-LABEL: sbto16f64_256:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vxorpd %xmm4, %xmm4, %xmm4
				; CHECK-NEXT: vcmpltpd %ymm2, %ymm4, %k0
				; CHECK-NEXT: vcmpltpd %ymm3, %ymm4, %k1
				; CHECK-NEXT: kshiftlb $4, %k1, %k1
				; CHECK-NEXT: korb %k1, %k0, %k0
				; CHECK-NEXT: vcmpltpd %ymm0, %ymm4, %k1
				; CHECK-NEXT: vcmpltpd %ymm1, %ymm4, %k2
				; CHECK-NEXT: kshiftlb $4, %k2, %k2
				; CHECK-NEXT: korb %k2, %k1, %k1
				; CHECK-NEXT: vpmovm2d %k1, %ymm1
				; CHECK-NEXT: vcvtdq2pd %xmm1, %ymm0
				; CHECK-NEXT: vextracti128 $1, %ymm1, %xmm1
				; CHECK-NEXT: vcvtdq2pd %xmm1, %ymm1
				; CHECK-NEXT: vpmovm2d %k0, %ymm3
				; CHECK-NEXT: vcvtdq2pd %xmm3, %ymm2
				; CHECK-NEXT: vextracti128 $1, %ymm3, %xmm3
				; CHECK-NEXT: vcvtdq2pd %xmm3, %ymm3
				; CHECK-NEXT: retq
				%cmpres = fcmp ogt <16 x double> %a, zeroinitializer
				%1 = sitofp <16 x i1> %cmpres to <16 x double>
				ret <16 x double> %1
				}

				define <16 x double> @sbto16f64_512(<16 x double> %a) "required-vector-width"="512" {
				; CHECK-LABEL: sbto16f64_512:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vxorpd %xmm2, %xmm2, %xmm2
				; CHECK-NEXT: vcmpltpd %zmm0, %zmm2, %k0
				; CHECK-NEXT: vcmpltpd %zmm1, %zmm2, %k1
				; CHECK-NEXT: kunpckbw %k0, %k1, %k0
				; CHECK-NEXT: vpmovm2d %k0, %zmm1
				; CHECK-NEXT: vcvtdq2pd %ymm1, %zmm0
				; CHECK-NEXT: vextracti64x4 $1, %zmm1, %ymm1
				; CHECK-NEXT: vcvtdq2pd %ymm1, %zmm1
				; CHECK-NEXT: retq
				%cmpres = fcmp ogt <16 x double> %a, zeroinitializer
				%1 = sitofp <16 x i1> %cmpres to <16 x double>
				ret <16 x double> %1
				}

				define <16 x float> @ubto16f32_256(<16 x i32> %a) "required-vector-width"="256" {
				; CHECK-LABEL: ubto16f32_256:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; CHECK-NEXT: vpcmpgtd %ymm0, %ymm2, %ymm0
				; CHECK-NEXT: vpbroadcastd {{.*#+}} ymm3 = [1,1,1,1,1,1,1,1]
				; CHECK-NEXT: vpand %ymm3, %ymm0, %ymm0
				; CHECK-NEXT: vpcmpgtd %ymm1, %ymm2, %ymm1
				; CHECK-NEXT: vpand %ymm3, %ymm1, %ymm1
				; CHECK-NEXT: retq
				%mask = icmp slt <16 x i32> %a, zeroinitializer
				%1 = uitofp <16 x i1> %mask to <16 x float>
				ret <16 x float> %1
				}

				define <16 x float> @ubto16f32_512(<16 x i32> %a) "required-vector-width"="512" {
				; CHECK-LABEL: ubto16f32_512:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vpmovd2m %zmm0, %k0
				; CHECK-NEXT: vpmovm2d %k0, %zmm0
				; CHECK-NEXT: vpsrld $31, %zmm0, %zmm0
				; CHECK-NEXT: vcvtdq2ps %zmm0, %zmm0
				; CHECK-NEXT: retq
				%mask = icmp slt <16 x i32> %a, zeroinitializer
				%1 = uitofp <16 x i1> %mask to <16 x float>
				ret <16 x float> %1
				}

				define <16 x double> @ubto16f64_256(<16 x i32> %a) "required-vector-width"="256" {
				; CHECK-LABEL: ubto16f64_256:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vpxor %xmm2, %xmm2, %xmm2
				; CHECK-NEXT: vpcmpgtd %ymm0, %ymm2, %ymm0
				; CHECK-NEXT: vpsrld $31, %ymm0, %ymm3
				; CHECK-NEXT: vcvtdq2pd %xmm3, %ymm0
				; CHECK-NEXT: vextracti128 $1, %ymm3, %xmm3
				; CHECK-NEXT: vcvtdq2pd %xmm3, %ymm4
				; CHECK-NEXT: vpcmpgtd %ymm1, %ymm2, %ymm1
				; CHECK-NEXT: vpsrld $31, %ymm1, %ymm1
				; CHECK-NEXT: vcvtdq2pd %xmm1, %ymm2
				; CHECK-NEXT: vextracti128 $1, %ymm1, %xmm1
				; CHECK-NEXT: vcvtdq2pd %xmm1, %ymm3
				; CHECK-NEXT: vmovaps %ymm4, %ymm1
				; CHECK-NEXT: retq
				%mask = icmp slt <16 x i32> %a, zeroinitializer
				%1 = uitofp <16 x i1> %mask to <16 x double>
				ret <16 x double> %1
				}

				define <16 x double> @ubto16f64_512(<16 x i32> %a) "required-vector-width"="512" {
				; CHECK-LABEL: ubto16f64_512:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vpmovd2m %zmm0, %k0
				; CHECK-NEXT: vpmovm2d %k0, %zmm0
				; CHECK-NEXT: vpsrld $31, %zmm0, %zmm1
				; CHECK-NEXT: vcvtdq2pd %ymm1, %zmm0
				; CHECK-NEXT: vextracti64x4 $1, %zmm1, %ymm1
				; CHECK-NEXT: vcvtdq2pd %ymm1, %zmm1
				; CHECK-NEXT: retq
				%mask = icmp slt <16 x i32> %a, zeroinitializer
				%1 = uitofp <16 x i1> %mask to <16 x double>
				ret <16 x double> %1
				}

				define <16 x i32> @test_16f32toub_256(<16 x float> %a, <16 x i32> %passthru) "required-vector-width"="256" {
				; CHECK-LABEL: test_16f32toub_256:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vcvttps2dq %ymm0, %ymm0
				; CHECK-NEXT: vpmovdw %ymm0, %xmm0
				; CHECK-NEXT: vcvttps2dq %ymm1, %ymm1
				; CHECK-NEXT: vpmovdw %ymm1, %xmm1
				; CHECK-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
				; CHECK-NEXT: vpsllw $15, %ymm0, %ymm0
				; CHECK-NEXT: vpmovw2m %ymm0, %k1
				; CHECK-NEXT: vmovdqa32 %ymm2, %ymm0 {%k1} {z}
				; CHECK-NEXT: kshiftrw $8, %k1, %k1
				; CHECK-NEXT: vmovdqa32 %ymm3, %ymm1 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = fptoui <16 x float> %a to <16 x i1>
				%select = select <16 x i1> %mask, <16 x i32> %passthru, <16 x i32> zeroinitializer
				ret <16 x i32> %select
				}

				define <16 x i32> @test_16f32toub_512(<16 x float> %a, <16 x i32> %passthru) "required-vector-width"="512" {
				; CHECK-LABEL: test_16f32toub_512:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vcvttps2dq %zmm0, %zmm0
				; CHECK-NEXT: vpslld $31, %zmm0, %zmm0
				; CHECK-NEXT: vptestmd %zmm0, %zmm0, %k1
				; CHECK-NEXT: vmovdqa32 %zmm1, %zmm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = fptoui <16 x float> %a to <16 x i1>
				%select = select <16 x i1> %mask, <16 x i32> %passthru, <16 x i32> zeroinitializer
				ret <16 x i32> %select
				}

				define <16 x i32> @test_16f32tosb_256(<16 x float> %a, <16 x i32> %passthru) "required-vector-width"="256" {
				; CHECK-LABEL: test_16f32tosb_256:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vcvttps2dq %ymm0, %ymm0
				; CHECK-NEXT: vpmovdw %ymm0, %xmm0
				; CHECK-NEXT: vcvttps2dq %ymm1, %ymm1
				; CHECK-NEXT: vpmovdw %ymm1, %xmm1
				; CHECK-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
				; CHECK-NEXT: vpsllw $15, %ymm0, %ymm0
				; CHECK-NEXT: vpmovw2m %ymm0, %k1
				; CHECK-NEXT: vmovdqa32 %ymm2, %ymm0 {%k1} {z}
				; CHECK-NEXT: kshiftrw $8, %k1, %k1
				; CHECK-NEXT: vmovdqa32 %ymm3, %ymm1 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = fptosi <16 x float> %a to <16 x i1>
				%select = select <16 x i1> %mask, <16 x i32> %passthru, <16 x i32> zeroinitializer
				ret <16 x i32> %select
				}

				define <16 x i32> @test_16f32tosb_512(<16 x float> %a, <16 x i32> %passthru) "required-vector-width"="512" {
				; CHECK-LABEL: test_16f32tosb_512:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vcvttps2dq %zmm0, %zmm0
				; CHECK-NEXT: vptestmd %zmm0, %zmm0, %k1
				; CHECK-NEXT: vmovdqa32 %zmm1, %zmm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = fptosi <16 x float> %a to <16 x i1>
				%select = select <16 x i1> %mask, <16 x i32> %passthru, <16 x i32> zeroinitializer
				ret <16 x i32> %select
				}