This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/X86/
-
Target/
-
X86/
5/11
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
shuffle-vs-trunc-256.ll
1/1
shuffle-vs-trunc-512.ll

Differential D46957

[x86] Lower some trunc + shuffle patterns to vpmov[q|d][b|w]
ClosedPublic

Authored by mike.dvoretsky on May 16 2018, 8:40 AM.

Download Raw Diff

Details

Reviewers

craig.topper
tkrupa
RKSimon
ashlykov
GBuella

Commits

rG22c82af5c85e: [x86] Lower some trunc + shuffle patterns to vpmov[q|d][b|w]
rL335238: [x86] Lower some trunc + shuffle patterns to vpmov[q|d][b|w]

Summary

This should help in lowering the following four intrinsics:
_mm256_cvtepi32_epi8
_mm256_cvtepi64_epi16
_mm256_cvtepi64_epi8
_mm512_cvtepi64_epi8

Diff Detail

Event Timeline

GBuella created this revision.May 16 2018, 8:40 AM

Herald added a subscriber: llvm-commits. · View Herald TranscriptMay 16 2018, 8:40 AM

Can you precommit the test cases?

Does this handle the more intuitive case of shuffling in 0 elements without the bitcast to <2 x i64>?

Added some more tests.

What about this. This is the most obvious representation for this.

define <8 x i16> @trunc_v4i64_to_v4i16_return_v8i16(<4 x i64> %vec) nounwind {
  %truncated = trunc <4 x i64> %vec to <4 x i16>
  %result = shufflevector <4 x i16> %truncated, <4 x i16> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  ret <8 x i16> %result
}

In D46957#1101896, @craig.topper wrote:

What about this. This is the most obvious representation for this.

define <8 x i16> @trunc_v4i64_to_v4i16_return_v8i16(<4 x i64> %vec) nounwind {
  %truncated = trunc <4 x i64> %vec to <4 x i16>
  %result = shufflevector <4 x i16> %truncated, <4 x i16> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  ret <8 x i16> %result
}

Oh, nice, I'm adding this one as well. BTW, how would you write that in C?

Added some more tests.
We have four test functions for vpmovdb, and four test functions for vpmovdw.
None of these were detected before this patch.

This indirect approach is still not detected, but I think it shouldn't be too hard:

// truncate to v8i16 in the first operation, but this really is a vpmovdb
%truncated = trunc <8 x i32> %vec to <8 x i16>
%bc = bitcast <8 x i16> %truncated to <16 x i8>
%result = shufflevector <16 x i8> %bc, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>

Fixed the some of the tests, all cases are detected now.

If you think the tests are enough, I can precommit them, and then rebase this patch.

I haven't tried but you might be able to do

builtin_shufflevector(builtin_convertvector((v4di)A, v4qi), (v4qi){0, 0, 0, 0}, 0, 1, 2, 3, 4, 5, 6);

What about _mm512_cvtepi64_epi8?

There's also __m256_cvtepi64_epi8 which produces a 32 bit result. And all the ones that start as 128 bits.

GBuella updated this revision to Diff 147281.May 17 2018, 3:11 AM

GBuella edited the summary of this revision. (Show Details)

GBuella updated this revision to Diff 147282.May 17 2018, 3:29 AM

GBuella retitled this revision from [x86] Lower some trunc + shuffle patterns to vpmovd[b|w] to [x86] Lower some trunc + shuffle patterns to vpmov[q|d][b|w].

GBuella added reviewers: mike.dvoretsky, ashlykov.May 21 2018, 5:34 AM

Commit the tests with the current codegen so the patch diff shows the improvement?

lib/Target/X86/X86ISelLowering.cpp
9391	clang format all of this
9395	Cleanup the for-loop condition to avoid so much evaluation and size_t casts: http://llvm.org/docs/CodingStandards.html#don-t-use-default-labels-in-fully-covered-switches-over-enumerations
9413	Those for loop braces can be removed?

Performed some code formatting.
I also committed the tests.

RKSimon added inline comments.May 22 2018, 12:07 PM

lib/Target/X86/X86ISelLowering.cpp
9397	Instead of repeating for SwappedOps - can't you just use start/end values? Else take a local copy of the shuffle mask and use ShuffleVectorSDNode::commuteMask?
9401	Can you use the shuffle mask helpers for any/all of these - isUndefOrInRange etc?

Refactored the maskContainsSequenceForVPMOV function.

GBuella marked 5 inline comments as done.May 23 2018, 3:06 AM

I think some more work needs to be done here, to also match the masked versions of these, e.g.
_mm256_mask_cvtepi64_epi16 -> vpmovdb %ymm0, %xmm0, {k1}

RKSimon added inline comments.May 23 2018, 10:42 AM

lib/Target/X86/X86ISelLowering.cpp
9407	Why can't you permit UNDEFs in the shuffle mask?

craig.topper added inline comments.May 23 2018, 11:45 AM

test/CodeGen/X86/shuffle-vs-trunc-512.ll
946	This is an AVX512F instruction we're trying to handle here and we are failing to remove the shuffle with AVX512F.

GBuella added inline comments.May 23 2018, 9:17 PM

lib/Target/X86/X86ISelLowering.cpp
9407	I do permit undef here. The condition of the branch is what I don't permit.

Fixed the ISA feature checking conditions.

GBuella marked an inline comment as done.May 23 2018, 11:12 PM

RKSimon added inline comments.May 24 2018, 6:39 AM

lib/Target/X86/X86ISelLowering.cpp
9401	If you tweak isSequentialOrUndefInRange to take an increment argument (default = 1) then you could remove this loop.
9407	Then why don't you use isUndefOrZeroOrInRange()? if (!isUndefOrZeroOrInRange(Mask.slice(Split, Size), TruncatedVectorStart, TruncatedVectorStart + Size)) return false;

GBuella added inline comments.May 24 2018, 8:15 AM

lib/Target/X86/X86ISelLowering.cpp
9407	That would check for the wrong elements, wouldn't it? Maybe OtherVectorStart = SwappedOps ? 0 : Size if (!isUndefOrZeroOrInRange(Mask.slice(Split, Size), OtherVectorStart, OtherVectorStart + Size)) return false;

GBuella added inline comments.May 24 2018, 8:24 AM

lib/Target/X86/X86ISelLowering.cpp
9407	Then I would rather make a new function: if (isAnyInRange(Mask.slice(Split, Size), TruncatedVectorStart, TruncatedVectorStart + Size)) return false;

Taking over at @GBuella's request. The patterns that are currently implemented will be finished, but I don't have much hope for the masked versions. Since the mask is only good for the lower elements and the upped elements must be zeroed out, lowering the masked versions of these intrinsics would require not simple selects (see PR34877), but patterns like

define <8 x i16> @trunc_v4i64_to_v4i16_return_v2i64_1(<4 x i64> %vec, i8 zeroext %k, <2 x i64> %dest) nounwind {
  %truncated = trunc <4 x i64> %vec to <4 x i16>
  %dst = bitcast <2 x i64> %dest to <8 x i16>
  %dst_select = shufflevector <8 x i16> %dst, <8 x i16> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
  %mask = trunc i8 %k to i4
  %mask_vec = bitcast i4 %mask to <4 x i1>
  %res = select <4 x i1> %mask_vec, <4 x i16> %truncated, <4 x i16> %dst_select
  %result = shufflevector <4 x i16> %res, <4 x i16> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  ret <8 x i16> %result
}

define <8 x i16> @trunc_v4i64_to_v4i16_return_v2i64_2(<4 x i64> %vec, i8 zeroext %k, <2 x i64> %dest) nounwind {
  %truncated = trunc <4 x i64> %vec to <4 x i16>
  %res = shufflevector <4 x i16> %truncated, <4 x i16> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  %mask = xor i8 %k, -1
  %mask1 = and i8 %mask, 15
  %mask_vec = bitcast i8 %mask1 to <8 x i1>
  %dst = bitcast <2 x i64> %dest to <8 x i16>
  %result = select <8 x i1> %mask_vec, <8 x i16> %dst, <8 x i16> %res
  ret <8 x i16> %result
}

and I feel that both of these are too complex.

Changed isSequentialOrUndefInRange to take an increment argument and added isAnyInRange.

Herald added a subscriber: hiraditya. · View Herald TranscriptJun 4 2018, 4:57 AM

RKSimon added inline comments.Jun 11 2018, 3:11 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
9406 ↗	(On Diff #149730)	We've usually called these functions something like 'matchVectorShuffleAsVPMOV' or 'matchVectorShuffleAsVTRUNC'
9407 ↗	(On Diff #149730)	int Delta
9417 ↗	(On Diff #149730)	I still don't get why you don't want to just use isUndefOrInRange with the 'safe' vector range.
9473 ↗	(On Diff #149730)	Save yourself some typing ;-) SDValue Src = V1.getOperand(0).getOperand(0); MVT SrcVT = Src.getSimpleValueType();
9483 ↗	(On Diff #149730)	Why is it just 16i8 and not 32i8 as well for _mm512_cvtepi16_epi8 ?
9487 ↗	(On Diff #149730)	Couldn't this be at the top for an early-out?
9492 ↗	(On Diff #149730)	Is this right? I'd expect it to check for the ! case.

GBuella added inline comments.Jun 11 2018, 3:43 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
9407 ↗	(On Diff #149730)	Or Stride
9483 ↗	(On Diff #149730)	This part is only about truncations, where the result must be filled with extra zeros, due to the (narrower tan 128bits) result being in an xmm register. The _mm512_cvtepi16_epi8 one truncates from a 512bit vector into a 256bit vector, that is already recognized without this patch. The check here is about _mm_cvtepi16_epi8 (which requires avx512vl & avx512bw). It truncates from v8i16 -> v8i8, but the vpmovwb instruction actually sets a whole xmm register, so the actual result is going to be v16i8, with other 8 bytes set to zero. Ok, perhaps these details should be explained in comments around here.
9492 ↗	(On Diff #149730)	Ye, originally it was: if (!maskContainsSequenceForVPMOV(Mask, SwappedOps, 2) && !maskContainsSequenceForVPMOV(Mask, SwappedOps, 4)) return SDValue(); The new form of this conjuction doesn't seem to make much sense at first sight, what happened?

Fixed the error in the final check (it was from badly undone edits around there). Moved the early-exit check. Expanded the comment on the AVX512BW check for clarity. Some names changed per comments.

mike.dvoretsky added inline comments.Jun 11 2018, 5:42 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
9417 ↗	(On Diff #149730)	Due to SwappedOps, the 'safe' range may be before or after the 'unsafe' one, so in this case catching unsafe values is tidier. To use isUndefOrInRange we'd first have to define whether we accept elements in [Size, 2*Size) or in [0, Size). If you keep insisting, I may do that, but at the moment I don't see much benefit in that.

RKSimon added inline comments.Jun 11 2018, 5:55 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
9481 ↗	(On Diff #150717)	SrcVT.is512BitVector()
9483 ↗	(On Diff #149730)	Shouldn't it handle this case? https://godbolt.org/g/Yxw7nE

GBuella added inline comments.Jun 11 2018, 6:05 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
9483 ↗	(On Diff #149730)	Probably that could also be implemented here, we just didn't think about it so far. There is/was a patch for those using builtin_convertvector https://reviews.llvm.org/D46742 This patch was originally intended to handle these cases, which can't be don with builtin_convertvector. But if it is not a lot of extra work, the shufflevector equivalents of those convertvector ones could be detected here.

mike.dvoretsky added inline comments.Jun 11 2018, 6:19 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
9483 ↗	(On Diff #149730)	The example doesn't contain explicit truncations, so it isn't handled by this particular function anyway and there's no need to check for features for it here. Correct me if I'm wrong, but I don't think that the aim of this patch is to catch every possible VPMOV pattern, just as many as convenient for subsequent intrinsic lowering, and in the cases of 128-bit-or-wider outputs, where there's no need to zero out the upper parts of xmm registers, a pattern more complex than (in this case) %res = trunc <32 x i16> %v to <32 x i8> is just a needless complication.

mike.dvoretsky updated this revision to Diff 150895.Jun 12 2018, 1:05 AM

mike.dvoretsky marked an inline comment as done.Jun 12 2018, 1:09 AM

Ping.

RKSimon added inline comments.Jun 19 2018, 8:44 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
4867 ↗	(On Diff #150895)	Fix comment - (Low, Low+Step*Size] ?

mike.dvoretsky updated this revision to Diff 152050.Jun 20 2018, 3:37 AM

mike.dvoretsky marked an inline comment as done.

LGTM

This revision is now accepted and ready to land.Jun 21 2018, 4:57 AM

Closed by commit rL335238: [x86] Lower some trunc + shuffle patterns to vpmov[q|d][b|w] (authored by mike.dvoretsky). · Explain WhyJun 21 2018, 7:21 AM

This revision was automatically updated to reflect the committed changes.

mike.dvoretsky mentioned this in D48712: [X86] Lowering integer truncation intrinsics to native IR.Jul 2 2018, 2:33 AM

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

97 lines

test/

CodeGen/

X86/

shuffle-vs-trunc-256.ll

559 lines

shuffle-vs-trunc-512.ll

45 lines

Diff 147282

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,382 Lines • ▼ Show 20 Lines	static SDValue lowerVectorShuffleWithUNPCK(const SDLoc &DL, MVT VT,

ShuffleVectorSDNode::commuteMask(Unpckh);		ShuffleVectorSDNode::commuteMask(Unpckh);
if (isShuffleEquivalent(V1, V2, Mask, Unpckh))		if (isShuffleEquivalent(V1, V2, Mask, Unpckh))
return DAG.getNode(X86ISD::UNPCKH, DL, VT, V2, V1);		return DAG.getNode(X86ISD::UNPCKH, DL, VT, V2, V1);

return SDValue();		return SDValue();
}		}

		static bool maskContainsSequenceForVPMOV(ArrayRef<int> Mask, bool SwappedOps,
		RKSimonUnsubmitted Done Reply Inline Actions clang format all of this RKSimon: clang format all of this
		int delta)
		{
		if (SwappedOps) {
		for (int i = 0; (size_t)i < Mask.size() / delta; ++i) {
		RKSimonUnsubmitted Done Reply Inline Actions Cleanup the for-loop condition to avoid so much evaluation and size_t casts: http://llvm.org/docs/CodingStandards.html#don-t-use-default-labels-in-fully-covered-switches-over-enumerations RKSimon: Cleanup the for-loop condition to avoid so much evaluation and size_t casts: http://llvm.
		if ((size_t)Mask[i] != Mask.size() + i * delta)
		return false;
		RKSimonUnsubmitted Done Reply Inline Actions Instead of repeating for SwappedOps - can't you just use start/end values? Else take a local copy of the shuffle mask and use ShuffleVectorSDNode::commuteMask? RKSimon: Instead of repeating for SwappedOps - can't you just use start/end values? Else take a local…
		}

		for (int i = Mask.size() / delta; (size_t)i < Mask.size(); ++i) {
		if ((size_t)Mask[i] >= Mask.size() && (size_t)Mask[i] < Mask.size() * 2)
		RKSimonUnsubmitted Done Reply Inline Actions Can you use the shuffle mask helpers for any/all of these - isUndefOrInRange etc? RKSimon: Can you use the shuffle mask helpers for any/all of these - isUndefOrInRange etc?
		RKSimonUnsubmitted Not Done Reply Inline Actions If you tweak isSequentialOrUndefInRange to take an increment argument (default = 1) then you could remove this loop. RKSimon: If you tweak isSequentialOrUndefInRange to take an increment argument (default = 1) then you…
		return false;
		}
		} else {
		for (int i = 0; (size_t)i < Mask.size() / delta; ++i) {
		if (Mask[i] != i * delta)
		return false;
		RKSimonUnsubmitted Not Done Reply Inline Actions Why can't you permit UNDEFs in the shuffle mask? RKSimon: Why can't you permit UNDEFs in the shuffle mask?
		GBuellaUnsubmitted Not Done Reply Inline Actions I do permit undef here. The condition of the branch is what I don't permit. GBuella: I do permit undef here. The condition of the branch is what I don't permit.
		RKSimonUnsubmitted Not Done Reply Inline Actions Then why don't you use isUndefOrZeroOrInRange()? if (!isUndefOrZeroOrInRange(Mask.slice(Split, Size), TruncatedVectorStart, TruncatedVectorStart + Size)) return false; RKSimon: Then why don't you use isUndefOrZeroOrInRange()? ``` if (!isUndefOrZeroOrInRange(Mask.slice…
		GBuellaUnsubmitted Not Done Reply Inline Actions That would check for the wrong elements, wouldn't it? Maybe OtherVectorStart = SwappedOps ? 0 : Size if (!isUndefOrZeroOrInRange(Mask.slice(Split, Size), OtherVectorStart, OtherVectorStart + Size)) return false; GBuella: That would check for the wrong elements, wouldn't it? Maybe ``` OtherVectorStart = SwappedOps ?
		GBuellaUnsubmitted Not Done Reply Inline Actions Then I would rather make a new function: if (isAnyInRange(Mask.slice(Split, Size), TruncatedVectorStart, TruncatedVectorStart + Size)) return false; GBuella: Then I would rather make a new function: ``` if (isAnyInRange(Mask.slice(Split, Size)…
		}

		for (int i = Mask.size() / delta; (size_t)i < Mask.size(); ++i) {
		if (Mask[i] >= 0 && (size_t)Mask[i] < Mask.size())
		return false;
		}
		RKSimonUnsubmitted Done Reply Inline Actions Those for loop braces can be removed? RKSimon: Those for loop braces can be removed?
		}

		return true;
		}

		// Try to lower trunc+vector_shuffle to a vpmovdb or a vpmovdw instruction.
		//
		// An example is the following:
		//
		// t0: ch = EntryToken
		// t2: v4i64,ch = CopyFromReg t0, Register:v4i64 %0
		// t25: v4i32 = truncate t2
		// t41: v8i16 = bitcast t25
		// t21: v8i16 = BUILD_VECTOR undef:i16, undef:i16, undef:i16, undef:i16, Constant:i16<0>, Constant:i16<0>, Constant:i16<0>, Constant:i16<0>
		// t51: v8i16 = vector_shuffle<0,2,4,6,12,13,14,15> t41, t21
		// t18: v2i64 = bitcast t51
		//
		// Without avx512vl, this is lowered to:
		//
		// vpmovqd %zmm0, %ymm0
		// vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
		//
		// But when avx512vl is available, one can just use a single vpmovdw instruction.
		static SDValue lowerVectorShuffleWithVPMOV(const SDLoc &DL, ArrayRef<int> Mask,
		MVT VT, SDValue V1, SDValue V2,
		SelectionDAG &DAG,
		const X86Subtarget &Subtarget) {
		if (!Subtarget.hasVLX())
		return SDValue();

		if (VT != MVT::v16i8 && VT != MVT::v8i16)
		return SDValue();

		bool SwappedOps = false;

		if (!ISD::isBuildVectorAllZeros(V2.getNode())) {
		if (!ISD::isBuildVectorAllZeros(V1.getNode()))
		return SDValue();

		std::swap(V1, V2);
		SwappedOps = true;
		}

		// Look for:
		//
		// bitcast (truncate <8 x i32> %vec to <8 x i16>) to <16 x i8>
		// bitcast (truncate <4 x i64> %vec to <4 x i32>) to <8 x i16>
		//
		// and similar ones.
		if (V1.getOpcode() != ISD::BITCAST)
		return SDValue();
		if (V1.getOperand(0).getOpcode() != ISD::TRUNCATE)
		return SDValue();

		// Look for a mask that looks like <0, 2, 4, 6, #, #, #, #>
		// or <0, 4, 8, 12, #, #, #, #>
		size_t N = VT.getVectorNumElements();

		if (Mask.size() != N)
		return SDValue();

		// The first half/quarter of the mask should refer to every second/fourth
		// element of the vector truncated and bitcasted.
		if (!maskContainsSequenceForVPMOV(Mask, SwappedOps, 2) &&
		!maskContainsSequenceForVPMOV(Mask, SwappedOps, 4))
		return SDValue();

		return DAG.getNode(X86ISD::VTRUNC, DL, VT, V1.getOperand(0).getOperand(0));
		}

// X86 has dedicated pack instructions that can handle specific truncation		// X86 has dedicated pack instructions that can handle specific truncation
// operations: PACKSS and PACKUS.		// operations: PACKSS and PACKUS.
static bool matchVectorShuffleWithPACK(MVT VT, MVT &SrcVT, SDValue &V1,		static bool matchVectorShuffleWithPACK(MVT VT, MVT &SrcVT, SDValue &V1,
SDValue &V2, unsigned &PackOpcode,		SDValue &V2, unsigned &PackOpcode,
ArrayRef<int> TargetMask,		ArrayRef<int> TargetMask,
SelectionDAG &DAG,		SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();
▲ Show 20 Lines • Show All 5,519 Lines • ▼ Show 20 Lines	if (DAG.getTargetLoweringInfo().isTypeLegal(NewVT)) {
VT, DAG.getVectorShuffle(NewVT, DL, V1, V2, WidenedMask));		VT, DAG.getVectorShuffle(NewVT, DL, V1, V2, WidenedMask));
}		}
}		}

// Commute the shuffle if it will improve canonicalization.		// Commute the shuffle if it will improve canonicalization.
if (canonicalizeShuffleMaskWithCommute(Mask))		if (canonicalizeShuffleMaskWithCommute(Mask))
return DAG.getCommutedVectorShuffle(*SVOp);		return DAG.getCommutedVectorShuffle(*SVOp);

		if (SDValue V = lowerVectorShuffleWithVPMOV(DL, Mask, VT, V1, V2, DAG,
		Subtarget))
		return V;

// For each vector width, delegate to a specialized lowering routine.		// For each vector width, delegate to a specialized lowering routine.
if (VT.is128BitVector())		if (VT.is128BitVector())
return lower128BitVectorShuffle(DL, Mask, VT, V1, V2, Zeroable, Subtarget,		return lower128BitVectorShuffle(DL, Mask, VT, V1, V2, Zeroable, Subtarget,
DAG);		DAG);

if (VT.is256BitVector())		if (VT.is256BitVector())
return lower256BitVectorShuffle(DL, Mask, VT, V1, V2, Zeroable, Subtarget,		return lower256BitVectorShuffle(DL, Mask, VT, V1, V2, Zeroable, Subtarget,
DAG);		DAG);
▲ Show 20 Lines • Show All 24,967 Lines • Show Last 20 Lines

test/CodeGen/X86/shuffle-vs-trunc-256.ll

	Show First 20 Lines • Show All 473 Lines • ▼ Show 20 Lines
	; AVX512VBMIVL-NEXT: retq			; AVX512VBMIVL-NEXT: retq
	%vec = load <32 x i8>, <32 x i8>* %L			%vec = load <32 x i8>, <32 x i8>* %L
	%bc = bitcast <32 x i8> %vec to <8 x i32>			%bc = bitcast <32 x i8> %vec to <8 x i32>
	%strided.vec = trunc <8 x i32> %bc to <8 x i8>			%strided.vec = trunc <8 x i32> %bc to <8 x i8>
	store <8 x i8> %strided.vec, <8 x i8>* %S			store <8 x i8> %strided.vec, <8 x i8>* %S
	ret void			ret void
	}			}

				define <2 x i64> @trunc_v8i32_to_v8i8_return_v2i64(<8 x i32> %vec) nounwind {
				; IR generated from:
				; return (__m128i) {(long long)__builtin_convertvector((__v8si)__A, __v8qi), 0};
				; AVX1-LABEL: trunc_v8i32_to_v8i8_return_v2i64:
				; AVX1: # %bb.0:
				; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1
				; AVX1-NEXT: vmovdqa {{.*#+}} xmm2 = [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
				; AVX1-NEXT: vpshufb %xmm2, %xmm1, %xmm1
				; AVX1-NEXT: vpshufb %xmm2, %xmm0, %xmm0
				; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
				; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: trunc_v8i32_to_v8i8_return_v2i64:
				; AVX2: # %bb.0:
				; AVX2-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15,16,17,20,21,24,25,28,29,24,25,28,29,28,29,30,31]
				; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
				; AVX2-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
				;
				; AVX512F-LABEL: trunc_v8i32_to_v8i8_return_v2i64:
				; AVX512F: # %bb.0:
				; AVX512F-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512F-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512F-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512F-NEXT: vzeroupper
				; AVX512F-NEXT: retq
				;
				; AVX512VL-LABEL: trunc_v8i32_to_v8i8_return_v2i64:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovdb %ymm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				;
				; AVX512BW-LABEL: trunc_v8i32_to_v8i8_return_v2i64:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512BW-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512BW-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512BW-NEXT: vzeroupper
				; AVX512BW-NEXT: retq
				;
				; AVX512BWVL-LABEL: trunc_v8i32_to_v8i8_return_v2i64:
				; AVX512BWVL: # %bb.0:
				; AVX512BWVL-NEXT: vpmovdb %ymm0, %xmm0
				; AVX512BWVL-NEXT: vzeroupper
				; AVX512BWVL-NEXT: retq
				;
				; AVX512VBMIVL-LABEL: trunc_v8i32_to_v8i8_return_v2i64:
				; AVX512VBMIVL: # %bb.0:
				; AVX512VBMIVL-NEXT: vpmovdb %ymm0, %xmm0
				; AVX512VBMIVL-NEXT: vzeroupper
				; AVX512VBMIVL-NEXT: retq
				%truncated.vec = trunc <8 x i32> %vec to <8 x i8>
				%bc = bitcast <8 x i8> %truncated.vec to i64
				%result = insertelement <2 x i64> zeroinitializer, i64 %bc, i32 0
				ret <2 x i64> %result
				}

				define <16 x i8> @trunc_v8i32_to_v8i8_with_zext_return_v16i8(<8 x i32> %vec) nounwind {
				; AVX1-LABEL: trunc_v8i32_to_v8i8_with_zext_return_v16i8:
				; AVX1: # %bb.0:
				; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1
				; AVX1-NEXT: vmovdqa {{.*#+}} xmm2 = [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
				; AVX1-NEXT: vpshufb %xmm2, %xmm1, %xmm1
				; AVX1-NEXT: vpshufb %xmm2, %xmm0, %xmm0
				; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
				; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: trunc_v8i32_to_v8i8_with_zext_return_v16i8:
				; AVX2: # %bb.0:
				; AVX2-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15,16,17,20,21,24,25,28,29,24,25,28,29,28,29,30,31]
				; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
				; AVX2-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
				;
				; AVX512F-LABEL: trunc_v8i32_to_v8i8_with_zext_return_v16i8:
				; AVX512F: # %bb.0:
				; AVX512F-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512F-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512F-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512F-NEXT: vzeroupper
				; AVX512F-NEXT: retq
				;
				; AVX512VL-LABEL: trunc_v8i32_to_v8i8_with_zext_return_v16i8:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovdb %ymm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				;
				; AVX512BW-LABEL: trunc_v8i32_to_v8i8_with_zext_return_v16i8:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512BW-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512BW-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512BW-NEXT: vzeroupper
				; AVX512BW-NEXT: retq
				;
				; AVX512BWVL-LABEL: trunc_v8i32_to_v8i8_with_zext_return_v16i8:
				; AVX512BWVL: # %bb.0:
				; AVX512BWVL-NEXT: vpmovdb %ymm0, %xmm0
				; AVX512BWVL-NEXT: vzeroupper
				; AVX512BWVL-NEXT: retq
				;
				; AVX512VBMIVL-LABEL: trunc_v8i32_to_v8i8_with_zext_return_v16i8:
				; AVX512VBMIVL: # %bb.0:
				; AVX512VBMIVL-NEXT: vpmovdb %ymm0, %xmm0
				; AVX512VBMIVL-NEXT: vzeroupper
				; AVX512VBMIVL-NEXT: retq
				%truncated = trunc <8 x i32> %vec to <8 x i8>
				%truncated.ext = zext <8 x i8> %truncated to <8 x i16>
				%bc = bitcast <8 x i16> %truncated.ext to <16 x i8>
				%result = shufflevector <16 x i8> %bc, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				ret <16 x i8> %result
				}

				define <16 x i8> @trunc_v8i32_to_v8i8_via_v8i16_return_v16i8(<8 x i32> %vec) nounwind {
				; AVX1-LABEL: trunc_v8i32_to_v8i8_via_v8i16_return_v16i8:
				; AVX1: # %bb.0:
				; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1
				; AVX1-NEXT: vmovdqa {{.*#+}} xmm2 = [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
				; AVX1-NEXT: vpshufb %xmm2, %xmm1, %xmm1
				; AVX1-NEXT: vpshufb %xmm2, %xmm0, %xmm0
				; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
				; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: trunc_v8i32_to_v8i8_via_v8i16_return_v16i8:
				; AVX2: # %bb.0:
				; AVX2-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15,16,17,20,21,24,25,28,29,24,25,28,29,28,29,30,31]
				; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
				; AVX2-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
				;
				; AVX512F-LABEL: trunc_v8i32_to_v8i8_via_v8i16_return_v16i8:
				; AVX512F: # %bb.0:
				; AVX512F-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512F-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512F-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512F-NEXT: vzeroupper
				; AVX512F-NEXT: retq
				;
				; AVX512VL-LABEL: trunc_v8i32_to_v8i8_via_v8i16_return_v16i8:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovdb %ymm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				;
				; AVX512BW-LABEL: trunc_v8i32_to_v8i8_via_v8i16_return_v16i8:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512BW-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512BW-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512BW-NEXT: vzeroupper
				; AVX512BW-NEXT: retq
				;
				; AVX512BWVL-LABEL: trunc_v8i32_to_v8i8_via_v8i16_return_v16i8:
				; AVX512BWVL: # %bb.0:
				; AVX512BWVL-NEXT: vpmovdb %ymm0, %xmm0
				; AVX512BWVL-NEXT: vzeroupper
				; AVX512BWVL-NEXT: retq
				;
				; AVX512VBMIVL-LABEL: trunc_v8i32_to_v8i8_via_v8i16_return_v16i8:
				; AVX512VBMIVL: # %bb.0:
				; AVX512VBMIVL-NEXT: vpmovdb %ymm0, %xmm0
				; AVX512VBMIVL-NEXT: vzeroupper
				; AVX512VBMIVL-NEXT: retq
				%truncated = trunc <8 x i32> %vec to <8 x i16>
				%bc = bitcast <8 x i16> %truncated to <16 x i8>
				%result = shufflevector <16 x i8> %bc, <16 x i8> zeroinitializer, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 17, i32 20, i32 24, i32 22, i32 31, i32 28, i32 28, i32 29>
				ret <16 x i8> %result
				}

				define <16 x i8> @trunc_v8i32_to_v8i8_return_v16i8(<8 x i32> %vec) nounwind {
				; AVX1-LABEL: trunc_v8i32_to_v8i8_return_v16i8:
				; AVX1: # %bb.0:
				; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1
				; AVX1-NEXT: vmovdqa {{.*#+}} xmm2 = [0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15]
				; AVX1-NEXT: vpshufb %xmm2, %xmm1, %xmm1
				; AVX1-NEXT: vpshufb %xmm2, %xmm0, %xmm0
				; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
				; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: trunc_v8i32_to_v8i8_return_v16i8:
				; AVX2: # %bb.0:
				; AVX2-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[0,1,4,5,8,9,12,13,8,9,12,13,12,13,14,15,16,17,20,21,24,25,28,29,24,25,28,29,28,29,30,31]
				; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
				; AVX2-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
				;
				; AVX512F-LABEL: trunc_v8i32_to_v8i8_return_v16i8:
				; AVX512F: # %bb.0:
				; AVX512F-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512F-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512F-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512F-NEXT: vzeroupper
				; AVX512F-NEXT: retq
				;
				; AVX512VL-LABEL: trunc_v8i32_to_v8i8_return_v16i8:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovdb %ymm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				;
				; AVX512BW-LABEL: trunc_v8i32_to_v8i8_return_v16i8:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512BW-NEXT: vpmovdw %zmm0, %ymm0
				; AVX512BW-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512BW-NEXT: vzeroupper
				; AVX512BW-NEXT: retq
				;
				; AVX512BWVL-LABEL: trunc_v8i32_to_v8i8_return_v16i8:
				; AVX512BWVL: # %bb.0:
				; AVX512BWVL-NEXT: vpmovdb %ymm0, %xmm0
				; AVX512BWVL-NEXT: vzeroupper
				; AVX512BWVL-NEXT: retq
				;
				; AVX512VBMIVL-LABEL: trunc_v8i32_to_v8i8_return_v16i8:
				; AVX512VBMIVL: # %bb.0:
				; AVX512VBMIVL-NEXT: vpmovdb %ymm0, %xmm0
				; AVX512VBMIVL-NEXT: vzeroupper
				; AVX512VBMIVL-NEXT: retq
				%truncated = trunc <8 x i32> %vec to <8 x i8>
				%result = shufflevector <8 x i8> %truncated, <8 x i8> zeroinitializer, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				ret <16 x i8> %result
				}

				define <2 x i64> @trunc_v4i64_to_v4i16_return_v2i64(<4 x i64> %vec) nounwind {
				; IR generated from:
				; return (__m128i) {(long long)__builtin_convertvector((__v4di)x, __v4hi), 0};
				; AVX1-LABEL: trunc_v4i64_to_v4i16_return_v2i64:
				; AVX1: # %bb.0:
				; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1
				; AVX1-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
				; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-SLOW-LABEL: trunc_v4i64_to_v4i16_return_v2i64:
				; AVX2-SLOW: # %bb.0:
				; AVX2-SLOW-NEXT: vpshufd {{.*#+}} ymm0 = ymm0[0,2,2,3,4,6,6,7]
				; AVX2-SLOW-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
				; AVX2-SLOW-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX2-SLOW-NEXT: vzeroupper
				; AVX2-SLOW-NEXT: retq
				;
				; AVX2-FAST-LABEL: trunc_v4i64_to_v4i16_return_v2i64:
				; AVX2-FAST: # %bb.0:
				; AVX2-FAST-NEXT: vmovdqa {{.*#+}} ymm1 = [0,2,4,6,4,6,6,7]
				; AVX2-FAST-NEXT: vpermd %ymm0, %ymm1, %ymm0
				; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX2-FAST-NEXT: vzeroupper
				; AVX2-FAST-NEXT: retq
				;
				; AVX512F-LABEL: trunc_v4i64_to_v4i16_return_v2i64:
				; AVX512F: # %bb.0:
				; AVX512F-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512F-NEXT: vpmovqd %zmm0, %ymm0
				; AVX512F-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512F-NEXT: vzeroupper
				; AVX512F-NEXT: retq
				;
				; AVX512VL-LABEL: trunc_v4i64_to_v4i16_return_v2i64:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovqw %ymm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				;
				; AVX512BW-LABEL: trunc_v4i64_to_v4i16_return_v2i64:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512BW-NEXT: vpmovqd %zmm0, %ymm0
				; AVX512BW-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512BW-NEXT: vzeroupper
				; AVX512BW-NEXT: retq
				;
				; AVX512BWVL-LABEL: trunc_v4i64_to_v4i16_return_v2i64:
				; AVX512BWVL: # %bb.0:
				; AVX512BWVL-NEXT: vpmovqw %ymm0, %xmm0
				; AVX512BWVL-NEXT: vzeroupper
				; AVX512BWVL-NEXT: retq
				;
				; AVX512VBMIVL-LABEL: trunc_v4i64_to_v4i16_return_v2i64:
				; AVX512VBMIVL: # %bb.0:
				; AVX512VBMIVL-NEXT: vpmovqw %ymm0, %xmm0
				; AVX512VBMIVL-NEXT: vzeroupper
				; AVX512VBMIVL-NEXT: retq
				%truncated = trunc <4 x i64> %vec to <4 x i16>
				%bc = bitcast <4 x i16> %truncated to i64
				%result = insertelement <2 x i64> zeroinitializer, i64 %bc, i32 0
				ret <2 x i64> %result
				}

				define <8 x i16> @trunc_v4i64_to_v4i16_with_zext_return_v8i16(<4 x i64> %vec) nounwind {
				; AVX1-LABEL: trunc_v4i64_to_v4i16_with_zext_return_v8i16:
				; AVX1: # %bb.0:
				; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1
				; AVX1-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
				; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-SLOW-LABEL: trunc_v4i64_to_v4i16_with_zext_return_v8i16:
				; AVX2-SLOW: # %bb.0:
				; AVX2-SLOW-NEXT: vpshufd {{.*#+}} ymm0 = ymm0[0,2,2,3,4,6,6,7]
				; AVX2-SLOW-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
				; AVX2-SLOW-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX2-SLOW-NEXT: vzeroupper
				; AVX2-SLOW-NEXT: retq
				;
				; AVX2-FAST-LABEL: trunc_v4i64_to_v4i16_with_zext_return_v8i16:
				; AVX2-FAST: # %bb.0:
				; AVX2-FAST-NEXT: vmovdqa {{.*#+}} ymm1 = [0,2,4,6,4,6,6,7]
				; AVX2-FAST-NEXT: vpermd %ymm0, %ymm1, %ymm0
				; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX2-FAST-NEXT: vzeroupper
				; AVX2-FAST-NEXT: retq
				;
				; AVX512F-LABEL: trunc_v4i64_to_v4i16_with_zext_return_v8i16:
				; AVX512F: # %bb.0:
				; AVX512F-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512F-NEXT: vpmovqd %zmm0, %ymm0
				; AVX512F-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512F-NEXT: vzeroupper
				; AVX512F-NEXT: retq
				;
				; AVX512VL-LABEL: trunc_v4i64_to_v4i16_with_zext_return_v8i16:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovqw %ymm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				;
				; AVX512BW-LABEL: trunc_v4i64_to_v4i16_with_zext_return_v8i16:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512BW-NEXT: vpmovqd %zmm0, %ymm0
				; AVX512BW-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512BW-NEXT: vzeroupper
				; AVX512BW-NEXT: retq
				;
				; AVX512BWVL-LABEL: trunc_v4i64_to_v4i16_with_zext_return_v8i16:
				; AVX512BWVL: # %bb.0:
				; AVX512BWVL-NEXT: vpmovqw %ymm0, %xmm0
				; AVX512BWVL-NEXT: vzeroupper
				; AVX512BWVL-NEXT: retq
				;
				; AVX512VBMIVL-LABEL: trunc_v4i64_to_v4i16_with_zext_return_v8i16:
				; AVX512VBMIVL: # %bb.0:
				; AVX512VBMIVL-NEXT: vpmovqw %ymm0, %xmm0
				; AVX512VBMIVL-NEXT: vzeroupper
				; AVX512VBMIVL-NEXT: retq
				%truncated = trunc <4 x i64> %vec to <4 x i16>
				%truncated.ext = zext <4 x i16> %truncated to <4 x i32>
				%bc = bitcast <4 x i32> %truncated.ext to <8 x i16>
				%result = shufflevector <8 x i16> %bc, <8 x i16> zeroinitializer, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 1, i32 3, i32 5, i32 7>
				ret <8 x i16> %result
				}

				define <8 x i16> @trunc_v4i64_to_v4i16_via_v4i32_return_v8i16(<4 x i64> %vec) nounwind {
				; AVX1-LABEL: trunc_v4i64_to_v4i16_via_v4i32_return_v8i16:
				; AVX1: # %bb.0:
				; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1
				; AVX1-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
				; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-SLOW-LABEL: trunc_v4i64_to_v4i16_via_v4i32_return_v8i16:
				; AVX2-SLOW: # %bb.0:
				; AVX2-SLOW-NEXT: vpshufd {{.*#+}} ymm0 = ymm0[0,2,2,3,4,6,6,7]
				; AVX2-SLOW-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
				; AVX2-SLOW-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX2-SLOW-NEXT: vzeroupper
				; AVX2-SLOW-NEXT: retq
				;
				; AVX2-FAST-LABEL: trunc_v4i64_to_v4i16_via_v4i32_return_v8i16:
				; AVX2-FAST: # %bb.0:
				; AVX2-FAST-NEXT: vmovdqa {{.*#+}} ymm1 = [0,2,4,6,4,6,6,7]
				; AVX2-FAST-NEXT: vpermd %ymm0, %ymm1, %ymm0
				; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX2-FAST-NEXT: vzeroupper
				; AVX2-FAST-NEXT: retq
				;
				; AVX512F-LABEL: trunc_v4i64_to_v4i16_via_v4i32_return_v8i16:
				; AVX512F: # %bb.0:
				; AVX512F-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512F-NEXT: vpmovqd %zmm0, %ymm0
				; AVX512F-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512F-NEXT: vzeroupper
				; AVX512F-NEXT: retq
				;
				; AVX512VL-LABEL: trunc_v4i64_to_v4i16_via_v4i32_return_v8i16:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovqw %ymm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				;
				; AVX512BW-LABEL: trunc_v4i64_to_v4i16_via_v4i32_return_v8i16:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512BW-NEXT: vpmovqd %zmm0, %ymm0
				; AVX512BW-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512BW-NEXT: vzeroupper
				; AVX512BW-NEXT: retq
				;
				; AVX512BWVL-LABEL: trunc_v4i64_to_v4i16_via_v4i32_return_v8i16:
				; AVX512BWVL: # %bb.0:
				; AVX512BWVL-NEXT: vpmovqw %ymm0, %xmm0
				; AVX512BWVL-NEXT: vzeroupper
				; AVX512BWVL-NEXT: retq
				;
				; AVX512VBMIVL-LABEL: trunc_v4i64_to_v4i16_via_v4i32_return_v8i16:
				; AVX512VBMIVL: # %bb.0:
				; AVX512VBMIVL-NEXT: vpmovqw %ymm0, %xmm0
				; AVX512VBMIVL-NEXT: vzeroupper
				; AVX512VBMIVL-NEXT: retq
				%truncated = trunc <4 x i64> %vec to <4 x i32>
				%bc = bitcast <4 x i32> %truncated to <8 x i16>
				%result = shufflevector <8 x i16> %bc, <8 x i16> zeroinitializer, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 8, i32 undef, i32 13>
				ret <8 x i16> %result
				}

				define <8 x i16> @trunc_v4i64_to_v4i16_return_v8i16(<4 x i64> %vec) nounwind {
				; AVX1-LABEL: trunc_v4i64_to_v4i16_return_v8i16:
				; AVX1: # %bb.0:
				; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1
				; AVX1-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
				; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-SLOW-LABEL: trunc_v4i64_to_v4i16_return_v8i16:
				; AVX2-SLOW: # %bb.0:
				; AVX2-SLOW-NEXT: vpshufd {{.*#+}} ymm0 = ymm0[0,2,2,3,4,6,6,7]
				; AVX2-SLOW-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
				; AVX2-SLOW-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX2-SLOW-NEXT: vzeroupper
				; AVX2-SLOW-NEXT: retq
				;
				; AVX2-FAST-LABEL: trunc_v4i64_to_v4i16_return_v8i16:
				; AVX2-FAST: # %bb.0:
				; AVX2-FAST-NEXT: vmovdqa {{.*#+}} ymm1 = [0,2,4,6,4,6,6,7]
				; AVX2-FAST-NEXT: vpermd %ymm0, %ymm1, %ymm0
				; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX2-FAST-NEXT: vzeroupper
				; AVX2-FAST-NEXT: retq
				;
				; AVX512F-LABEL: trunc_v4i64_to_v4i16_return_v8i16:
				; AVX512F: # %bb.0:
				; AVX512F-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512F-NEXT: vpmovqd %zmm0, %ymm0
				; AVX512F-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512F-NEXT: vzeroupper
				; AVX512F-NEXT: retq
				;
				; AVX512VL-LABEL: trunc_v4i64_to_v4i16_return_v8i16:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovqw %ymm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				;
				; AVX512BW-LABEL: trunc_v4i64_to_v4i16_return_v8i16:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512BW-NEXT: vpmovqd %zmm0, %ymm0
				; AVX512BW-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,1,4,5,8,9,12,13],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512BW-NEXT: vzeroupper
				; AVX512BW-NEXT: retq
				;
				; AVX512BWVL-LABEL: trunc_v4i64_to_v4i16_return_v8i16:
				; AVX512BWVL: # %bb.0:
				; AVX512BWVL-NEXT: vpmovqw %ymm0, %xmm0
				; AVX512BWVL-NEXT: vzeroupper
				; AVX512BWVL-NEXT: retq
				;
				; AVX512VBMIVL-LABEL: trunc_v4i64_to_v4i16_return_v8i16:
				; AVX512VBMIVL: # %bb.0:
				; AVX512VBMIVL-NEXT: vpmovqw %ymm0, %xmm0
				; AVX512VBMIVL-NEXT: vzeroupper
				; AVX512VBMIVL-NEXT: retq
				%truncated = trunc <4 x i64> %vec to <4 x i16>
				%result = shufflevector <4 x i16> %truncated, <4 x i16> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				ret <8 x i16> %result
				}

				define <16 x i8> @trunc_v4i64_to_v4i8_return_v16i8(<4 x i64> %vec) nounwind {
				; AVX1-LABEL: trunc_v4i64_to_v4i8_return_v16i8:
				; AVX1: # %bb.0:
				; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1
				; AVX1-NEXT: vshufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
				; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12],zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm0[u],zero
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-SLOW-LABEL: trunc_v4i64_to_v4i8_return_v16i8:
				; AVX2-SLOW: # %bb.0:
				; AVX2-SLOW-NEXT: vpshufd {{.*#+}} ymm0 = ymm0[0,2,2,3,4,6,6,7]
				; AVX2-SLOW-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,2,2,3]
				; AVX2-SLOW-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12],zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm0[u],zero
				; AVX2-SLOW-NEXT: vzeroupper
				; AVX2-SLOW-NEXT: retq
				;
				; AVX2-FAST-LABEL: trunc_v4i64_to_v4i8_return_v16i8:
				; AVX2-FAST: # %bb.0:
				; AVX2-FAST-NEXT: vmovdqa {{.*#+}} ymm1 = [0,2,4,6,4,6,6,7]
				; AVX2-FAST-NEXT: vpermd %ymm0, %ymm1, %ymm0
				; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12],zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm0[u],zero
				; AVX2-FAST-NEXT: vzeroupper
				; AVX2-FAST-NEXT: retq
				;
				; AVX512F-LABEL: trunc_v4i64_to_v4i8_return_v16i8:
				; AVX512F: # %bb.0:
				; AVX512F-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512F-NEXT: vpmovqd %zmm0, %ymm0
				; AVX512F-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12],zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm0[u],zero
				; AVX512F-NEXT: vzeroupper
				; AVX512F-NEXT: retq
				;
				; AVX512VL-LABEL: trunc_v4i64_to_v4i8_return_v16i8:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovqb %ymm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				;
				; AVX512BW-LABEL: trunc_v4i64_to_v4i8_return_v16i8:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
				; AVX512BW-NEXT: vpmovqd %zmm0, %ymm0
				; AVX512BW-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,4,8,12],zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm0[u],zero
				; AVX512BW-NEXT: vzeroupper
				; AVX512BW-NEXT: retq
				;
				; AVX512BWVL-LABEL: trunc_v4i64_to_v4i8_return_v16i8:
				; AVX512BWVL: # %bb.0:
				; AVX512BWVL-NEXT: vpmovqb %ymm0, %xmm0
				; AVX512BWVL-NEXT: vzeroupper
				; AVX512BWVL-NEXT: retq
				;
				; AVX512VBMIVL-LABEL: trunc_v4i64_to_v4i8_return_v16i8:
				; AVX512VBMIVL: # %bb.0:
				; AVX512VBMIVL-NEXT: vpmovqb %ymm0, %xmm0
				; AVX512VBMIVL-NEXT: vzeroupper
				; AVX512VBMIVL-NEXT: retq
				%truncated = trunc <4 x i64> %vec to <4 x i8>
				%result = shufflevector <4 x i8> %truncated, <4 x i8> zeroinitializer, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 4, i32 5, i32 6, i32 7, i32 5, i32 5, i32 undef, i32 7>
				ret <16 x i8> %result
				}

	define void @shuffle_v16i16_to_v4i16(<16 x i16>* %L, <4 x i16>* %S) nounwind {			define void @shuffle_v16i16_to_v4i16(<16 x i16>* %L, <4 x i16>* %S) nounwind {
	; AVX1-LABEL: shuffle_v16i16_to_v4i16:			; AVX1-LABEL: shuffle_v16i16_to_v4i16:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: vmovdqa (%rdi), %ymm0			; AVX1-NEXT: vmovdqa (%rdi), %ymm0
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1
	; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]			; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[0,2,2,3]
	; AVX1-NEXT: vpshuflw {{.*#+}} xmm1 = xmm1[0,2,2,3,4,5,6,7]			; AVX1-NEXT: vpshuflw {{.*#+}} xmm1 = xmm1[0,2,2,3,4,5,6,7]
	; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]			; AVX1-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
	▲ Show 20 Lines • Show All 394 Lines • Show Last 20 Lines

test/CodeGen/X86/shuffle-vs-trunc-512.ll

	Show First 20 Lines • Show All 935 Lines • ▼ Show 20 Lines
	; AVX512VBMIVL-NEXT: vpmovzxwd {{.*#+}} xmm0 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero			; AVX512VBMIVL-NEXT: vpmovzxwd {{.*#+}} xmm0 = xmm2[0],zero,xmm2[1],zero,xmm2[2],zero,xmm2[3],zero
	; AVX512VBMIVL-NEXT: vcvtdq2pd %xmm0, %ymm0			; AVX512VBMIVL-NEXT: vcvtdq2pd %xmm0, %ymm0
	; AVX512VBMIVL-NEXT: retq			; AVX512VBMIVL-NEXT: retq
	%v = load <32 x i16>, <32 x i16>* %p, align 2			%v = load <32 x i16>, <32 x i16>* %p, align 2
	%shuf = shufflevector <32 x i16> %v, <32 x i16> undef, <4 x i32> <i32 0, i32 8, i32 16, i32 24>			%shuf = shufflevector <32 x i16> %v, <32 x i16> undef, <4 x i32> <i32 0, i32 8, i32 16, i32 24>
	%tofp = uitofp <4 x i16> %shuf to <4 x double>			%tofp = uitofp <4 x i16> %shuf to <4 x double>
	ret <4 x double> %tofp			ret <4 x double> %tofp
	}			}

				define <16 x i8> @trunc_v8i64_to_v8i8_return_v16i8(<8 x i64> %vec) nounwind {
				; AVX512F-LABEL: trunc_v8i64_to_v8i8_return_v16i8:
				craig.topperUnsubmitted Done Reply Inline Actions This is an AVX512F instruction we're trying to handle here and we are failing to remove the shuffle with AVX512F. craig.topper: This is an AVX512F instruction we're trying to handle here and we are failing to remove the…
				; AVX512F: # %bb.0:
				; AVX512F-NEXT: vpmovqw %zmm0, %xmm0
				; AVX512F-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512F-NEXT: vzeroupper
				; AVX512F-NEXT: retq
				;
				; AVX512VL-LABEL: trunc_v8i64_to_v8i8_return_v16i8:
				; AVX512VL: # %bb.0:
				; AVX512VL-NEXT: vpmovqb %zmm0, %xmm0
				; AVX512VL-NEXT: vzeroupper
				; AVX512VL-NEXT: retq
				;
				; AVX512BW-LABEL: trunc_v8i64_to_v8i8_return_v16i8:
				; AVX512BW: # %bb.0:
				; AVX512BW-NEXT: vpmovqw %zmm0, %xmm0
				; AVX512BW-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512BW-NEXT: vzeroupper
				; AVX512BW-NEXT: retq
				;
				; AVX512BWVL-LABEL: trunc_v8i64_to_v8i8_return_v16i8:
				; AVX512BWVL: # %bb.0:
				; AVX512BWVL-NEXT: vpmovqb %zmm0, %xmm0
				; AVX512BWVL-NEXT: vzeroupper
				; AVX512BWVL-NEXT: retq
				;
				; AVX512VBMI-LABEL: trunc_v8i64_to_v8i8_return_v16i8:
				; AVX512VBMI: # %bb.0:
				; AVX512VBMI-NEXT: vpmovqw %zmm0, %xmm0
				; AVX512VBMI-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14],zero,zero,zero,zero,zero,zero,zero,zero
				; AVX512VBMI-NEXT: vzeroupper
				; AVX512VBMI-NEXT: retq
				;
				; AVX512VBMIVL-LABEL: trunc_v8i64_to_v8i8_return_v16i8:
				; AVX512VBMIVL: # %bb.0:
				; AVX512VBMIVL-NEXT: vpmovqb %zmm0, %xmm0
				; AVX512VBMIVL-NEXT: vzeroupper
				; AVX512VBMIVL-NEXT: retq
				%truncated = trunc <8 x i64> %vec to <8 x i8>
				%result = shufflevector <8 x i8> %truncated, <8 x i8> zeroinitializer, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				ret <16 x i8> %result
				}