This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
3/11
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
slow-pmulld.ll
1/3
vector-interleave.ll
-
vector-interleaved-store-i16-stride-2.ll
-
vector-interleaved-store-i32-stride-2.ll
-
vector-interleaved-store-i8-stride-2.ll

Differential D134477

[X86] Lower vector interleave into unpck and perm
ClosedPublic

Authored by zhuhan0 on Sep 22 2022, 1:53 PM.

Download Raw Diff

Details

Reviewers

nadav
MatzeB
wenlei
RKSimon
craig.topper
spatel

Commits

rGd0d48a91f8bc: [X86] Lower vector interleave into unpck and perm

Summary

This Godbolt link shows different codegen between clang and gcc for a transpose operation.

clang result:

vmovdqu xmm0, xmmword ptr [rcx + rax]
vmovdqu xmm1, xmmword ptr [rcx + rax + 16]
vmovdqu xmm2, xmmword ptr [r8 + rax]
vmovdqu xmm3, xmmword ptr [r8 + rax + 16]
vpunpckhbw      xmm4, xmm2, xmm0
vpunpcklbw      xmm0, xmm2, xmm0
vpunpcklbw      xmm2, xmm3, xmm1
vpunpckhbw      xmm1, xmm3, xmm1
vmovdqu xmmword ptr [rdi + 2*rax + 48], xmm1
vmovdqu xmmword ptr [rdi + 2*rax + 32], xmm2
vmovdqu xmmword ptr [rdi + 2*rax], xmm0
vmovdqu xmmword ptr [rdi + 2*rax + 16], xmm4

gcc result:

vmovdqu ymm3, YMMWORD PTR [rdi+rax]
vpunpcklbw      ymm1, ymm3, YMMWORD PTR [rsi+rax]
vpunpckhbw      ymm0, ymm3, YMMWORD PTR [rsi+rax]
vperm2i128      ymm2, ymm1, ymm0, 32
vperm2i128      ymm1, ymm1, ymm0, 49
vmovdqu YMMWORD PTR [rcx+rax*2], ymm2
vmovdqu YMMWORD PTR [rcx+32+rax*2], ymm1

clang's code is roughly 15% slower than gcc's when evaluated on an internal compression benchmark.

The loop vectorizer generates the following shufflevector intrinsic:

%interleaved.vec = shufflevector <32 x i8> %a, <32 x i8> %b, <64 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47, i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>

which is lowered to SelectionDAG:

t2: v32i8,ch = CopyFromReg t0, Register:v32i8 %0
t6: v64i8 = concat_vectors t2, undef:v32i8
t4: v32i8,ch = CopyFromReg t0, Register:v32i8 %1
t7: v64i8 = concat_vectors t4, undef:v32i8
t8: v64i8 = vector_shuffle<0,64,1,65,2,66,3,67,4,68,5,69,6,70,7,71,8,72,9,73,10,74,11,75,12,76,13,77,14,78,15,79,16,80,17,81,18,82,19,83,20,84,21,85,22,86,23,87,24,88,25,89,26,90,27,91,28,92,29,93,30,94,31,95> t6, t7

So far this vector_shuffle is good enough for us to pattern-match and transform, but as we go down the SelectionDAG pipeline, it got split into smaller shuffles. During dagcombine1, the shuffle is split by foldShuffleOfConcatUndefs.

  // shuffle (concat X, undef), (concat Y, undef), Mask -->
  // concat (shuffle X, Y, Mask0), (shuffle X, Y, Mask1)
t2: v32i8,ch = CopyFromReg t0, Register:v32i8 %0
t4: v32i8,ch = CopyFromReg t0, Register:v32i8 %1
t19: v32i8 = vector_shuffle<0,32,1,33,2,34,3,35,4,36,5,37,6,38,7,39,8,40,9,41,10,42,11,43,12,44,13,45,14,46,15,47> t2, t4
t15: ch,glue = CopyToReg t0, Register:v32i8 $ymm0, t19
t20: v32i8 = vector_shuffle<16,48,17,49,18,50,19,51,20,52,21,53,22,54,23,55,24,56,25,57,26,58,27,59,28,60,29,61,30,62,31,63> t2, t4
t17: ch,glue = CopyToReg t15, Register:v32i8 $ymm1, t20, t15:1

With foldShuffleOfConcatUndefs commented out, the vector is still split later by the type legalizer, which comes after dagcombine1, because v64i8 is not a legal type in AVX2 (64 * 8 = 512 bits while ymm = 256 bits). There doesn't seem to be a good way to avoid this split. Lowering the vector_shuffle into unpck and perm during dagcombine1 is too early. Therefore, although somewhat inconvenient, we decided to go with pattern-matching a pair vector shuffles later in the SelectionDAG pipeline, as part of lowerV32I8Shuffle.

The code looks at the two operands of the first shuffle it encounters, iterates through the users of the operands, and tries to find two shuffles that are consecutive interleaves. Once the pattern is found, it lowers them into unpcks and perms. It returns the perm for the shuffle that's currently being lowered (have ISel modify the DAG), and replaces the other shuffle in place.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

zhuhan0 created this revision.Sep 22 2022, 1:53 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2022, 1:53 PM

Herald added subscribers: hoy, wenlei, pengfei, hiraditya. · View Herald Transcript

zhuhan0 requested review of this revision.Sep 22 2022, 1:53 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2022, 1:53 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B188257: Diff 462286.Sep 22 2022, 1:53 PM

zhuhan0 edited the summary of this revision. (Show Details)Sep 22 2022, 2:52 PM

Herald added a subscriber: steven.zhang. · View Herald TranscriptSep 22 2022, 2:52 PM

zhuhan0 edited the summary of this revision. (Show Details)Sep 22 2022, 2:58 PM

zhuhan0 added reviewers: nadav, MatzeB, wenlei, RKSimon, craig.topper, spatel.

Herald added a subscriber: StephenFan. · View Herald TranscriptSep 22 2022, 2:58 PM

RKSimon added inline comments.Sep 24 2022, 6:06 AM

llvm/test/CodeGen/X86/vector-interleave.ll
176	Please can you pre-commit this and rebase so we see the codegen change?

RKSimon added inline comments.Sep 24 2022, 8:27 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
17814	I'd probably replace these hard coded numbers with more general NumElts / NumEltsPerLane variables.
17853	Why are the ReplaceAllUsesWith calls necessary? We usually just rely on combines / demandedelts to replace these.
18553	I'm still not certain if we're better off trying to perform this in lowering or via shuffle combining or maybe combineConcatVectorOps?

RKSimon retitled this revision from [isel] Lower vector interleave into unpck and perm to [X86] Lower vector interleave into unpck and perm.Sep 24 2022, 8:27 AM

zhuhan0 mentioned this in rG67a04edd4eb0: [X86] Pre-commit unit test for D134477.Sep 24 2022, 9:52 PM

Pre-commit test. Use NumElts instead of hard-coding.

zhuhan0 added inline comments.Sep 24 2022, 10:48 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
17853	Because in SelectionDAGLegalize::LegalizeOp only the currently visited node is replaced in the DAG, but here we're trying to transform both vector shuffle nodes, so I replaced the other (not currently visited) node on the fly. This does seem weird so I'm happy to change it if there's a better way. Is there an example of using "combines / demandedelts" to replace multiple nodes at the same time?
18553	We had the same question. The thing is, the original v64i8 vector_shuffle is split during generic dag combine, so it never gets to the X86 dag combiner. After the v64i8 shuffle is split, the v32i8 shuffles do get to the X86 dag combiner, and they go through the `combineShuffle` path. That does come earlier than the legalizer. But `combineShuffle` doesn't call `combineConcatVectorOps` though and I'm not sure if any existing combine there is a good place to add this change. I can try adding a new combine there if you think that's a better place.

Harbormaster completed remote builds in B188555: Diff 462696.Sep 24 2022, 11:23 PM

Hi @RKSimon, any thought on the new version? Thanks.

Friendly ping for a review of this.

LGTM. I'm not super familiar with vector code lowering, so let's wait some more days to give others a chance to chime in if necessary, will accept in a few days then.

MatzeB added inline comments.Oct 6 2022, 4:00 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
17853	I think this is normal code when dealing with patterns that have multiple "root nodes".

RKSimon added inline comments.Oct 10 2022, 6:18 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
17798	Have you investigated using this for other types?
17831	auto SVN1 auto SVN2

ShuffleVectorSDNode -> auto

zhuhan0 added inline comments.Oct 10 2022, 11:40 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
17798	Not yet. I'll submit another patch to generalize this. Let me first run some benchmarks to confirm that it's profitable.

Harbormaster completed remote builds in B191434: Diff 466709.Oct 11 2022, 1:08 AM

zhuhan0 mentioned this in rG20be96b19813: [X86] Pre-commit more unit tests for D134477.Oct 12 2022, 10:47 AM

Generalize to other 256-bit vector types; Limit change to AVX2 only.

I decided to put the changes in the same diff so that it's easier to review.
This change showed wins across all other 256-bit vector types as well, when
measured on an internal compression benchmark (basically a loop as shown in
https://godbolt.org/z/s17Kv1s9T). We saw 40.6% and 33.4% improvement on
v16i16 and v8i32 respectively. For v4i64 though, we see merely 0.9%
improvement as the current perm + blend codegen seems already very good. So
I don't think this change is worth it for the 64-bit types.

Harbormaster completed remote builds in B191841: Diff 467293.Oct 12 2022, 5:23 PM

RKSimon added inline comments.Oct 13 2022, 12:58 AM

llvm/test/CodeGen/X86/vector-interleave.ll
247	Wouldn't AVX1 benefit from the v8f32 case as well?

zhuhan0 added inline comments.Oct 13 2022, 12:05 PM

llvm/test/CodeGen/X86/vector-interleave.ll
247	Ah yes. I missed it.

Remove AVX2 check on v8f32.

zhuhan0 added inline comments.Oct 13 2022, 12:10 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
18585–18609	v8i32 is also covered now because of the cast to v8f32 here.

Harbormaster completed remote builds in B192024: Diff 467561.Oct 13 2022, 1:00 PM

RKSimon added inline comments.Oct 14 2022, 12:54 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
18171	Yeah, those AVX1 v8i32 regressions need addressing first - please can you add the hasAVX2 check back and a TODO saying this should be expanded to AVX1?

Add back AVX2 check for v8f32. Add TODO.

Harbormaster completed remote builds in B192250: Diff 467883.Oct 14 2022, 1:31 PM

LGTM - cheers

This revision is now accepted and ready to land.Oct 15 2022, 6:44 AM

In D134477#3860460, @RKSimon wrote:

LGTM - cheers

Thanks for reviewing!

Closed by commit rGd0d48a91f8bc: [X86] Lower vector interleave into unpck and perm (authored by zhuhan0). · Explain WhyOct 17 2022, 11:40 AM

This revision was automatically updated to reflect the committed changes.

zhuhan0 added a commit: rGd0d48a91f8bc: [X86] Lower vector interleave into unpck and perm.

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

115 lines

test/

CodeGen/

X86/

slow-pmulld.ll

14 lines

vector-interleave.ll

133 lines

vector-interleaved-store-i16-stride-2.ll

118 lines

vector-interleaved-store-i32-stride-2.ll

145 lines

vector-interleaved-store-i8-stride-2.ll

43 lines

Diff 467883

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 17,769 Lines • ▼ Show 20 Lines	SDValue Unpack = DAG.getVectorShuffle(MVT::v16i8, DL, V1, V2,
{ 0, 1, 2, 3, 16, 17, 18, 19,		{ 0, 1, 2, 3, 16, 17, 18, 19,
4, 5, 6, 7, 20, 21, 22, 23 });		4, 5, 6, 7, 20, 21, 22, 23 });
// Insert the unpckldq into a zero vector to widen to v32i8.		// Insert the unpckldq into a zero vector to widen to v32i8.
return DAG.getNode(ISD::INSERT_SUBVECTOR, DL, MVT::v32i8,		return DAG.getNode(ISD::INSERT_SUBVECTOR, DL, MVT::v32i8,
DAG.getConstant(0, DL, MVT::v32i8), Unpack,		DAG.getConstant(0, DL, MVT::v32i8), Unpack,
DAG.getIntPtrConstant(0, DL));		DAG.getIntPtrConstant(0, DL));
}		}

		// a = shuffle v1, v2, mask1 ; interleaving lower lanes of v1 and v2
		// b = shuffle v1, v2, mask2 ; interleaving higher lanes of v1 and v2
		// =>
		// ul = unpckl v1, v2
		// uh = unpckh v1, v2
		// a = vperm ul, uh
		// b = vperm ul, uh
		//
		// Pattern-match interleave(256b v1, 256b v2) -> 512b v3 and lower it into unpck
		// and permute. We cannot directly match v3 because it is split into two
		// 256-bit vectors in earlier isel stages. Therefore, this function matches a
		// pair of 256-bit shuffles and makes sure the masks are consecutive.
		//
		// Once unpck and permute nodes are created, the permute corresponding to this
		// shuffle is returned, while the other permute replaces the other half of the
		// shuffle in the selection dag.
		static SDValue lowerShufflePairAsUNPCKAndPermute(const SDLoc &DL, MVT VT,
		SDValue V1, SDValue V2,
		ArrayRef<int> Mask,
		SelectionDAG &DAG) {
		if (VT != MVT::v8f32 && VT != MVT::v8i32 && VT != MVT::v16i16 &&
		RKSimonUnsubmitted Not Done Reply Inline Actions Have you investigated using this for other types? RKSimon: Have you investigated using this for other types?
		zhuhan0AuthorUnsubmitted Not Done Reply Inline Actions Not yet. I'll submit another patch to generalize this. Let me first run some benchmarks to confirm that it's profitable. zhuhan0: Not yet. I'll submit another patch to generalize this. Let me first run some benchmarks to…
		VT != MVT::v32i8)
		return SDValue();
		// <B0, B1, B0+1, B1+1, ..., >
		auto IsInterleavingPattern = [&](ArrayRef<int> Mask, unsigned Begin0,
		unsigned Begin1) {
		size_t Size = Mask.size();
		assert(Size % 2 == 0 && "Expected even mask size");
		for (unsigned I = 0; I < Size; I += 2) {
		if (Mask[I] != (int)(Begin0 + I / 2) \|\|
		Mask[I + 1] != (int)(Begin1 + I / 2))
		return false;
		}
		return true;
		};
		// Check which half is this shuffle node
		int NumElts = VT.getVectorNumElements();
		RKSimonUnsubmitted Not Done Reply Inline Actions I'd probably replace these hard coded numbers with more general NumElts / NumEltsPerLane variables. RKSimon: I'd probably replace these hard coded numbers with more general NumElts / NumEltsPerLane…
		size_t FirstQtr = NumElts / 2;
		size_t ThirdQtr = NumElts + NumElts / 2;
		bool IsFirstHalf = IsInterleavingPattern(Mask, 0, NumElts);
		bool IsSecondHalf = IsInterleavingPattern(Mask, FirstQtr, ThirdQtr);
		if (!IsFirstHalf && !IsSecondHalf)
		return SDValue();

		// Find the intersection between shuffle users of V1 and V2.
		SmallVector<SDNode *, 2> Shuffles;
		for (SDNode *User : V1->uses())
		if (User->getOpcode() == ISD::VECTOR_SHUFFLE && User->getOperand(0) == V1 &&
		User->getOperand(1) == V2)
		Shuffles.push_back(User);
		// Limit user size to two for now.
		if (Shuffles.size() != 2)
		return SDValue();
		// Find out which half of the 512-bit shuffles is each smaller shuffle
		RKSimonUnsubmitted Not Done Reply Inline Actions auto SVN1 auto SVN2 RKSimon: auto SVN1 auto SVN2
		auto *SVN1 = cast<ShuffleVectorSDNode>(Shuffles[0]);
		auto *SVN2 = cast<ShuffleVectorSDNode>(Shuffles[1]);
		SDNode *FirstHalf;
		SDNode *SecondHalf;
		if (IsInterleavingPattern(SVN1->getMask(), 0, NumElts) &&
		IsInterleavingPattern(SVN2->getMask(), FirstQtr, ThirdQtr)) {
		FirstHalf = Shuffles[0];
		SecondHalf = Shuffles[1];
		} else if (IsInterleavingPattern(SVN1->getMask(), FirstQtr, ThirdQtr) &&
		IsInterleavingPattern(SVN2->getMask(), 0, NumElts)) {
		FirstHalf = Shuffles[1];
		SecondHalf = Shuffles[0];
		} else {
		return SDValue();
		}
		// Lower into unpck and perm. Return the perm of this shuffle and replace
		// the other.
		SDValue Unpckl = DAG.getNode(X86ISD::UNPCKL, DL, VT, V1, V2);
		SDValue Unpckh = DAG.getNode(X86ISD::UNPCKH, DL, VT, V1, V2);
		SDValue Perm1 = DAG.getNode(X86ISD::VPERM2X128, DL, VT, Unpckl, Unpckh,
		DAG.getTargetConstant(0x20, DL, MVT::i8));
		SDValue Perm2 = DAG.getNode(X86ISD::VPERM2X128, DL, VT, Unpckl, Unpckh,
		RKSimonUnsubmitted Not Done Reply Inline Actions Why are the ReplaceAllUsesWith calls necessary? We usually just rely on combines / demandedelts to replace these. RKSimon: Why are the ReplaceAllUsesWith calls necessary? We usually just rely on combines / demandedelts…
		zhuhan0AuthorUnsubmitted Done Reply Inline Actions Because in SelectionDAGLegalize::LegalizeOp only the currently visited node is replaced in the DAG, but here we're trying to transform both vector shuffle nodes, so I replaced the other (not currently visited) node on the fly. This does seem weird so I'm happy to change it if there's a better way. Is there an example of using "combines / demandedelts" to replace multiple nodes at the same time? zhuhan0: Because in [SelectionDAGLegalize::LegalizeOp](https://github.com/llvm/llvm…
		MatzeBUnsubmitted Not Done Reply Inline Actions I think this is normal code when dealing with patterns that have multiple "root nodes". MatzeB: I think this is normal code when dealing with patterns that have multiple "root nodes".
		DAG.getTargetConstant(0x31, DL, MVT::i8));
		if (IsFirstHalf) {
		DAG.ReplaceAllUsesWith(SecondHalf, &Perm2);
		return Perm1;
		}
		DAG.ReplaceAllUsesWith(FirstHalf, &Perm1);
		return Perm2;
		}

/// Handle lowering of 4-lane 64-bit floating point shuffles.		/// Handle lowering of 4-lane 64-bit floating point shuffles.
///		///
/// Also ends up handling lowering of 4-lane 64-bit integer shuffles when AVX2		/// Also ends up handling lowering of 4-lane 64-bit integer shuffles when AVX2
/// isn't available.		/// isn't available.
static SDValue lowerV4F64Shuffle(const SDLoc &DL, ArrayRef<int> Mask,		static SDValue lowerV4F64Shuffle(const SDLoc &DL, ArrayRef<int> Mask,
const APInt &Zeroable, SDValue V1, SDValue V2,		const APInt &Zeroable, SDValue V1, SDValue V2,
const X86Subtarget &Subtarget,		const X86Subtarget &Subtarget,
▲ Show 20 Lines • Show All 291 Lines • ▼ Show 20 Lines	if (SDValue Result = lowerShuffleAsLanePermuteAndRepeatedMask(
return Result;		return Result;

// If we have VLX support, we can use VEXPAND.		// If we have VLX support, we can use VEXPAND.
if (Subtarget.hasVLX())		if (Subtarget.hasVLX())
if (SDValue V = lowerShuffleToEXPAND(DL, MVT::v8f32, Zeroable, Mask, V1, V2,		if (SDValue V = lowerShuffleToEXPAND(DL, MVT::v8f32, Zeroable, Mask, V1, V2,
DAG, Subtarget))		DAG, Subtarget))
return V;		return V;

		// Try to match an interleave of two v8f32s and lower them as unpck and
		// permutes using ymms. This needs to go before we try to split the vectors.
		//
		RKSimonUnsubmitted Not Done Reply Inline Actions Yeah, those AVX1 v8i32 regressions need addressing first - please can you add the hasAVX2 check back and a TODO saying this should be expanded to AVX1? RKSimon: Yeah, those AVX1 v8i32 regressions need addressing first - please can you add the hasAVX2 check…
		// TODO: Expand this to AVX1. Currently v8i32 is casted to v8f32 and hits
		// this path inadvertently.
		if (Subtarget.hasAVX2() && !Subtarget.hasAVX512())
		if (SDValue V = lowerShufflePairAsUNPCKAndPermute(DL, MVT::v8f32, V1, V2,
		Mask, DAG))
		return V;

// For non-AVX512 if the Mask is of 16bit elements in lane then try to split		// For non-AVX512 if the Mask is of 16bit elements in lane then try to split
// since after split we get a more efficient code using vpunpcklwd and		// since after split we get a more efficient code using vpunpcklwd and
// vpunpckhwd instrs than vblend.		// vpunpckhwd instrs than vblend.
if (!Subtarget.hasAVX512() && isUnpackWdShuffleMask(Mask, MVT::v8f32, DAG))		if (!Subtarget.hasAVX512() && isUnpackWdShuffleMask(Mask, MVT::v8f32, DAG))
return lowerShuffleAsSplitOrBlend(DL, MVT::v8f32, V1, V2, Mask, Subtarget,		return lowerShuffleAsSplitOrBlend(DL, MVT::v8f32, V1, V2, Mask, Subtarget,
DAG);		DAG);

// If we have AVX2 then we always want to lower with a blend because at v8 we		// If we have AVX2 then we always want to lower with a blend because at v8 we
Show All 22 Lines	static SDValue lowerV8I32Shuffle(const SDLoc &DL, ArrayRef<int> Mask,

// Whenever we can lower this as a zext, that instruction is strictly faster		// Whenever we can lower this as a zext, that instruction is strictly faster
// than any alternative. It also allows us to fold memory operands into the		// than any alternative. It also allows us to fold memory operands into the
// shuffle in many cases.		// shuffle in many cases.
if (SDValue ZExt = lowerShuffleAsZeroOrAnyExtend(DL, MVT::v8i32, V1, V2, Mask,		if (SDValue ZExt = lowerShuffleAsZeroOrAnyExtend(DL, MVT::v8i32, V1, V2, Mask,
Zeroable, Subtarget, DAG))		Zeroable, Subtarget, DAG))
return ZExt;		return ZExt;

		// Try to match an interleave of two v8i32s and lower them as unpck and
		// permutes using ymms. This needs to go before we try to split the vectors.
		if (!Subtarget.hasAVX512())
		if (SDValue V = lowerShufflePairAsUNPCKAndPermute(DL, MVT::v8i32, V1, V2,
		Mask, DAG))
		return V;

// For non-AVX512 if the Mask is of 16bit elements in lane then try to split		// For non-AVX512 if the Mask is of 16bit elements in lane then try to split
// since after split we get a more efficient code than vblend by using		// since after split we get a more efficient code than vblend by using
// vpunpcklwd and vpunpckhwd instrs.		// vpunpcklwd and vpunpckhwd instrs.
if (isUnpackWdShuffleMask(Mask, MVT::v8i32, DAG) && !V2.isUndef() &&		if (isUnpackWdShuffleMask(Mask, MVT::v8i32, DAG) && !V2.isUndef() &&
!Subtarget.hasAVX512())		!Subtarget.hasAVX512())
return lowerShuffleAsSplitOrBlend(DL, MVT::v8i32, V1, V2, Mask, Subtarget,		return lowerShuffleAsSplitOrBlend(DL, MVT::v8i32, V1, V2, Mask, Subtarget,
DAG);		DAG);

▲ Show 20 Lines • Show All 189 Lines • ▼ Show 20 Lines	if (SDValue Result = lowerShuffleAsLanePermuteAndRepeatedMask(
DL, MVT::v16i16, V1, V2, Mask, Subtarget, DAG))		DL, MVT::v16i16, V1, V2, Mask, Subtarget, DAG))
return Result;		return Result;

// Try to permute the lanes and then use a per-lane permute.		// Try to permute the lanes and then use a per-lane permute.
if (SDValue V = lowerShuffleAsLanePermuteAndPermute(		if (SDValue V = lowerShuffleAsLanePermuteAndPermute(
DL, MVT::v16i16, V1, V2, Mask, DAG, Subtarget))		DL, MVT::v16i16, V1, V2, Mask, DAG, Subtarget))
return V;		return V;

		// Try to match an interleave of two v16i16s and lower them as unpck and
		// permutes using ymms.
		if (!Subtarget.hasAVX512())
		if (SDValue V = lowerShufflePairAsUNPCKAndPermute(DL, MVT::v16i16, V1, V2,
		Mask, DAG))
		return V;

// Otherwise fall back on generic lowering.		// Otherwise fall back on generic lowering.
return lowerShuffleAsSplitOrBlend(DL, MVT::v16i16, V1, V2, Mask,		return lowerShuffleAsSplitOrBlend(DL, MVT::v16i16, V1, V2, Mask,
Subtarget, DAG);		Subtarget, DAG);
}		}

/// Handle lowering of 32-lane 8-bit integer shuffles.		/// Handle lowering of 32-lane 8-bit integer shuffles.
///		///
/// This routine is only called when we have AVX2 and thus a reasonable		/// This routine is only called when we have AVX2 and thus a reasonable
▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	static SDValue lowerV32I8Shuffle(const SDLoc &DL, ArrayRef<int> Mask,
// Look for {0, 8, 16, 24, 32, 40, 48, 56 } in the first 8 elements. Followed		// Look for {0, 8, 16, 24, 32, 40, 48, 56 } in the first 8 elements. Followed
// by zeroable elements in the remaining 24 elements. Turn this into two		// by zeroable elements in the remaining 24 elements. Turn this into two
// vmovqb instructions shuffled together.		// vmovqb instructions shuffled together.
if (Subtarget.hasVLX())		if (Subtarget.hasVLX())
if (SDValue V = lowerShuffleAsVTRUNCAndUnpack(DL, MVT::v32i8, V1, V2,		if (SDValue V = lowerShuffleAsVTRUNCAndUnpack(DL, MVT::v32i8, V1, V2,
Mask, Zeroable, DAG))		Mask, Zeroable, DAG))
return V;		return V;

		// Try to match an interleave of two v32i8s and lower them as unpck and
		// permutes using ymms.
		if (!Subtarget.hasAVX512())
		if (SDValue V = lowerShufflePairAsUNPCKAndPermute(DL, MVT::v32i8, V1, V2,
		Mask, DAG))
		RKSimonUnsubmitted Not Done Reply Inline Actions I'm still not certain if we're better off trying to perform this in lowering or via shuffle combining or maybe combineConcatVectorOps? RKSimon: I'm still not certain if we're better off trying to perform this in lowering or via shuffle…
		zhuhan0AuthorUnsubmitted Done Reply Inline Actions We had the same question. The thing is, the original v64i8 vector_shuffle is split during generic dag combine, so it never gets to the X86 dag combiner. After the v64i8 shuffle is split, the v32i8 shuffles do get to the X86 dag combiner, and they go through the `combineShuffle` path. That does come earlier than the legalizer. But `combineShuffle` doesn't call `combineConcatVectorOps` though and I'm not sure if any existing combine there is a good place to add this change. I can try adding a new combine there if you think that's a better place. zhuhan0: We had the same question. The thing is, the original v64i8 vector_shuffle is split during…
		return V;

// Otherwise fall back on generic lowering.		// Otherwise fall back on generic lowering.
return lowerShuffleAsSplitOrBlend(DL, MVT::v32i8, V1, V2, Mask,		return lowerShuffleAsSplitOrBlend(DL, MVT::v32i8, V1, V2, Mask,
Subtarget, DAG);		Subtarget, DAG);
}		}

/// High-level routine to lower various 256-bit x86 vector shuffles.		/// High-level routine to lower various 256-bit x86 vector shuffles.
///		///
/// This routine either breaks down the specific type of a 256-bit x86 vector		/// This routine either breaks down the specific type of a 256-bit x86 vector
Show All 13 Lines	if (SDValue Insertion = lowerShuffleAsElementInsertion(
DL, VT, V1, V2, Mask, Zeroable, Subtarget, DAG))		DL, VT, V1, V2, Mask, Zeroable, Subtarget, DAG))
return Insertion;		return Insertion;

// Handle special cases where the lower or upper half is UNDEF.		// Handle special cases where the lower or upper half is UNDEF.
if (SDValue V =		if (SDValue V =
lowerShuffleWithUndefHalf(DL, VT, V1, V2, Mask, Subtarget, DAG))		lowerShuffleWithUndefHalf(DL, VT, V1, V2, Mask, Subtarget, DAG))
return V;		return V;

// There is a really nice hard cut-over between AVX1 and AVX2 that means we		// There is a really nice hard cut-over between AVX1 and AVX2 that means we
// can check for those subtargets here and avoid much of the subtarget		// can check for those subtargets here and avoid much of the subtarget
// querying in the per-vector-type lowering routines. With AVX1 we have		// querying in the per-vector-type lowering routines. With AVX1 we have
// essentially zero ability to manipulate a 256-bit vector with integer		// essentially zero ability to manipulate a 256-bit vector with integer
// types. Since we'll use floating point types there eventually, just		// types. Since we'll use floating point types there eventually, just
// immediately cast everything to a float and operate entirely in that domain.		// immediately cast everything to a float and operate entirely in that domain.
if (VT.isInteger() && !Subtarget.hasAVX2()) {		if (VT.isInteger() && !Subtarget.hasAVX2()) {
int ElementBits = VT.getScalarSizeInBits();		int ElementBits = VT.getScalarSizeInBits();
if (ElementBits < 32) {		if (ElementBits < 32) {
// No floating point type available, if we can't use the bit operations		// No floating point type available, if we can't use the bit operations
// for masking/blending then decompose into 128-bit vectors.		// for masking/blending then decompose into 128-bit vectors.
if (SDValue V = lowerShuffleAsBitMask(DL, VT, V1, V2, Mask, Zeroable,		if (SDValue V = lowerShuffleAsBitMask(DL, VT, V1, V2, Mask, Zeroable,
Subtarget, DAG))		Subtarget, DAG))
return V;		return V;
if (SDValue V = lowerShuffleAsBitBlend(DL, VT, V1, V2, Mask, DAG))		if (SDValue V = lowerShuffleAsBitBlend(DL, VT, V1, V2, Mask, DAG))
return V;		return V;
return splitAndLowerShuffle(DL, VT, V1, V2, Mask, DAG);		return splitAndLowerShuffle(DL, VT, V1, V2, Mask, DAG);
}		}

MVT FpVT = MVT::getVectorVT(MVT::getFloatingPointVT(ElementBits),		MVT FpVT = MVT::getVectorVT(MVT::getFloatingPointVT(ElementBits),
VT.getVectorNumElements());		VT.getVectorNumElements());
V1 = DAG.getBitcast(FpVT, V1);		V1 = DAG.getBitcast(FpVT, V1);
V2 = DAG.getBitcast(FpVT, V2);		V2 = DAG.getBitcast(FpVT, V2);
return DAG.getBitcast(VT, DAG.getVectorShuffle(FpVT, DL, V1, V2, Mask));		return DAG.getBitcast(VT, DAG.getVectorShuffle(FpVT, DL, V1, V2, Mask));
}		}
		zhuhan0AuthorUnsubmitted Done Reply Inline Actions v8i32 is also covered now because of the cast to v8f32 here. zhuhan0: v8i32 is also covered now because of the cast to v8f32 here.

if (VT == MVT::v16f16) {		if (VT == MVT::v16f16) {
V1 = DAG.getBitcast(MVT::v16i16, V1);		V1 = DAG.getBitcast(MVT::v16i16, V1);
V2 = DAG.getBitcast(MVT::v16i16, V2);		V2 = DAG.getBitcast(MVT::v16i16, V2);
return DAG.getBitcast(MVT::v16f16,		return DAG.getBitcast(MVT::v16f16,
DAG.getVectorShuffle(MVT::v16i16, DL, V1, V2, Mask));		DAG.getVectorShuffle(MVT::v16i16, DL, V1, V2, Mask));
}		}

▲ Show 20 Lines • Show All 32,705 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/slow-pmulld.ll

	Show First 20 Lines • Show All 486 Lines • ▼ Show 20 Lines
	; SSE4-NEXT: pmulld %xmm1, %xmm3			; SSE4-NEXT: pmulld %xmm1, %xmm3
	; SSE4-NEXT: movdqa %xmm4, %xmm1			; SSE4-NEXT: movdqa %xmm4, %xmm1
	; SSE4-NEXT: ret{{[l\|q]}}			; SSE4-NEXT: ret{{[l\|q]}}
	;			;
	; AVX2-SLOW-LABEL: test_mul_v16i32_v16i16:			; AVX2-SLOW-LABEL: test_mul_v16i32_v16i16:
	; AVX2-SLOW: # %bb.0:			; AVX2-SLOW: # %bb.0:
	; AVX2-SLOW-NEXT: vmovdqa {{.*#+}} ymm1 = [18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778]			; AVX2-SLOW-NEXT: vmovdqa {{.*#+}} ymm1 = [18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778,18778]
	; AVX2-SLOW-NEXT: vpmulhuw %ymm1, %ymm0, %ymm2			; AVX2-SLOW-NEXT: vpmulhuw %ymm1, %ymm0, %ymm2
	; AVX2-SLOW-NEXT: vpmullw %ymm1, %ymm0, %ymm1			; AVX2-SLOW-NEXT: vpmullw %ymm1, %ymm0, %ymm0
	; AVX2-SLOW-NEXT: vpunpckhwd {{.*#+}} xmm0 = xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]			; AVX2-SLOW-NEXT: vpunpckhwd {{.*#+}} ymm1 = ymm0[4],ymm2[4],ymm0[5],ymm2[5],ymm0[6],ymm2[6],ymm0[7],ymm2[7],ymm0[12],ymm2[12],ymm0[13],ymm2[13],ymm0[14],ymm2[14],ymm0[15],ymm2[15]
	; AVX2-SLOW-NEXT: vpunpcklwd {{.*#+}} xmm3 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]			; AVX2-SLOW-NEXT: vpunpcklwd {{.*#+}} ymm2 = ymm0[0],ymm2[0],ymm0[1],ymm2[1],ymm0[2],ymm2[2],ymm0[3],ymm2[3],ymm0[8],ymm2[8],ymm0[9],ymm2[9],ymm0[10],ymm2[10],ymm0[11],ymm2[11]
	; AVX2-SLOW-NEXT: vinserti128 $1, %xmm0, %ymm3, %ymm0			; AVX2-SLOW-NEXT: vperm2i128 {{.*#+}} ymm0 = ymm2[0,1],ymm1[0,1]
	; AVX2-SLOW-NEXT: vextracti128 $1, %ymm2, %xmm2			; AVX2-SLOW-NEXT: vperm2i128 {{.*#+}} ymm1 = ymm2[2,3],ymm1[2,3]
	; AVX2-SLOW-NEXT: vextracti128 $1, %ymm1, %xmm1
	; AVX2-SLOW-NEXT: vpunpckhwd {{.*#+}} xmm3 = xmm1[4],xmm2[4],xmm1[5],xmm2[5],xmm1[6],xmm2[6],xmm1[7],xmm2[7]
	; AVX2-SLOW-NEXT: vpunpcklwd {{.*#+}} xmm1 = xmm1[0],xmm2[0],xmm1[1],xmm2[1],xmm1[2],xmm2[2],xmm1[3],xmm2[3]
	; AVX2-SLOW-NEXT: vinserti128 $1, %xmm3, %ymm1, %ymm1
	; AVX2-SLOW-NEXT: ret{{[l\|q]}}			; AVX2-SLOW-NEXT: ret{{[l\|q]}}
	;			;
	; AVX2-32-LABEL: test_mul_v16i32_v16i16:			; AVX2-32-LABEL: test_mul_v16i32_v16i16:
	; AVX2-32: # %bb.0:			; AVX2-32: # %bb.0:
	; AVX2-32-NEXT: vextracti128 $1, %ymm0, %xmm1			; AVX2-32-NEXT: vextracti128 $1, %ymm0, %xmm1
	; AVX2-32-NEXT: vpmovzxwd {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero			; AVX2-32-NEXT: vpmovzxwd {{.*#+}} ymm1 = xmm1[0],zero,xmm1[1],zero,xmm1[2],zero,xmm1[3],zero,xmm1[4],zero,xmm1[5],zero,xmm1[6],zero,xmm1[7],zero
	; AVX2-32-NEXT: vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero			; AVX2-32-NEXT: vpmovzxwd {{.*#+}} ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero,xmm0[4],zero,xmm0[5],zero,xmm0[6],zero,xmm0[7],zero
	; AVX2-32-NEXT: vpbroadcastd {{.*#+}} ymm2 = [18778,18778,18778,18778,18778,18778,18778,18778]			; AVX2-32-NEXT: vpbroadcastd {{.*#+}} ymm2 = [18778,18778,18778,18778,18778,18778,18778,18778]
	▲ Show 20 Lines • Show All 497 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/vector-interleave.ll

	Show First 20 Lines • Show All 85 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vinsertf128 $1, %xmm5, %ymm2, %ymm2			; AVX1-NEXT: vinsertf128 $1, %xmm5, %ymm2, %ymm2
	; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm4 = xmm3[4],xmm7[4],xmm3[5],xmm7[5],xmm3[6],xmm7[6],xmm3[7],xmm7[7]			; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm4 = xmm3[4],xmm7[4],xmm3[5],xmm7[5],xmm3[6],xmm7[6],xmm3[7],xmm7[7]
	; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm3 = xmm3[0],xmm7[0],xmm3[1],xmm7[1],xmm3[2],xmm7[2],xmm3[3],xmm7[3]			; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm3 = xmm3[0],xmm7[0],xmm3[1],xmm7[1],xmm3[2],xmm7[2],xmm3[3],xmm7[3]
	; AVX1-NEXT: vinsertf128 $1, %xmm4, %ymm3, %ymm3			; AVX1-NEXT: vinsertf128 $1, %xmm4, %ymm3, %ymm3
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: interleave8x8:			; AVX2-LABEL: interleave8x8:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm8 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]			; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm8 = xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
	; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm0 = xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
	; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm1 = xmm2[0],xmm3[0],xmm2[1],xmm3[1],xmm2[2],xmm3[2],xmm2[3],xmm3[3]
	; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm2 = xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]
	; AVX2-NEXT: vpunpckhdq {{.*#+}} xmm3 = xmm0[2],xmm2[2],xmm0[3],xmm2[3]
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm2 = xmm0[0],xmm2[0],xmm0[1],xmm2[1]
	; AVX2-NEXT: vpunpckhdq {{.*#+}} xmm9 = xmm8[2],xmm1[2],xmm8[3],xmm1[3]
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm0 = xmm8[0],xmm1[0],xmm8[1],xmm1[1]
	; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm1 = xmm4[0],xmm5[0],xmm4[1],xmm5[1],xmm4[2],xmm5[2],xmm4[3],xmm5[3]
	; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm4 = xmm4[4],xmm5[4],xmm4[5],xmm5[5],xmm4[6],xmm5[6],xmm4[7],xmm5[7]
	; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm5 = xmm7[0],xmm6[0],xmm7[1],xmm6[1],xmm7[2],xmm6[2],xmm7[3],xmm6[3]
	; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm6 = xmm7[4],xmm6[4],xmm7[5],xmm6[5],xmm7[6],xmm6[6],xmm7[7],xmm6[7]
	; AVX2-NEXT: vpunpckhdq {{.*#+}} xmm7 = xmm4[2],xmm6[2],xmm4[3],xmm6[3]
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm4 = xmm4[0],xmm6[0],xmm4[1],xmm6[1]
	; AVX2-NEXT: vpunpckhdq {{.*#+}} xmm6 = xmm1[2],xmm5[2],xmm1[3],xmm5[3]
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm1[0],xmm5[0],xmm1[1],xmm5[1]
	; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm5 = xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
	; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]			; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	; AVX2-NEXT: vinserti128 $1, %xmm5, %ymm0, %ymm0			; AVX2-NEXT: vinserti128 $1, %xmm8, %ymm0, %ymm0
	; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm1 = xmm9[4],xmm6[4],xmm9[5],xmm6[5],xmm9[6],xmm6[6],xmm9[7],xmm6[7]			; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm1 = xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]
	; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm5 = xmm9[0],xmm6[0],xmm9[1],xmm6[1],xmm9[2],xmm6[2],xmm9[3],xmm6[3]			; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1],xmm2[2],xmm3[2],xmm2[3],xmm3[3]
	; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm5, %ymm1			; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm2, %ymm1
	; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm5 = xmm2[4],xmm4[4],xmm2[5],xmm4[5],xmm2[6],xmm4[6],xmm2[7],xmm4[7]			; AVX2-NEXT: vpunpckhdq {{.*#+}} ymm2 = ymm0[2],ymm1[2],ymm0[3],ymm1[3],ymm0[6],ymm1[6],ymm0[7],ymm1[7]
	; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm2 = xmm2[0],xmm4[0],xmm2[1],xmm4[1],xmm2[2],xmm4[2],xmm2[3],xmm4[3]			; AVX2-NEXT: vpunpckldq {{.*#+}} ymm0 = ymm0[0],ymm1[0],ymm0[1],ymm1[1],ymm0[4],ymm1[4],ymm0[5],ymm1[5]
	; AVX2-NEXT: vinserti128 $1, %xmm5, %ymm2, %ymm2			; AVX2-NEXT: vperm2i128 {{.*#+}} ymm3 = ymm0[2,3],ymm2[2,3]
	; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm4 = xmm3[4],xmm7[4],xmm3[5],xmm7[5],xmm3[6],xmm7[6],xmm3[7],xmm7[7]			; AVX2-NEXT: vperm2i128 {{.*#+}} ymm0 = ymm0[0,1],ymm2[0,1]
	; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm3 = xmm3[0],xmm7[0],xmm3[1],xmm7[1],xmm3[2],xmm7[2],xmm3[3],xmm7[3]			; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm1 = xmm4[4],xmm5[4],xmm4[5],xmm5[5],xmm4[6],xmm5[6],xmm4[7],xmm5[7]
	; AVX2-NEXT: vinserti128 $1, %xmm4, %ymm3, %ymm3			; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm2 = xmm4[0],xmm5[0],xmm4[1],xmm5[1],xmm4[2],xmm5[2],xmm4[3],xmm5[3]
				; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm2, %ymm1
				; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm2 = xmm7[4],xmm6[4],xmm7[5],xmm6[5],xmm7[6],xmm6[6],xmm7[7],xmm6[7]
				; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm4 = xmm7[0],xmm6[0],xmm7[1],xmm6[1],xmm7[2],xmm6[2],xmm7[3],xmm6[3]
				; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm4, %ymm2
				; AVX2-NEXT: vpunpckhdq {{.*#+}} ymm4 = ymm1[2],ymm2[2],ymm1[3],ymm2[3],ymm1[6],ymm2[6],ymm1[7],ymm2[7]
				; AVX2-NEXT: vpunpckldq {{.*#+}} ymm1 = ymm1[0],ymm2[0],ymm1[1],ymm2[1],ymm1[4],ymm2[4],ymm1[5],ymm2[5]
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm2 = ymm1[2,3],ymm4[2,3]
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm1 = ymm1[0,1],ymm4[0,1]
				; AVX2-NEXT: vpunpckhwd {{.*#+}} ymm4 = ymm0[4],ymm1[4],ymm0[5],ymm1[5],ymm0[6],ymm1[6],ymm0[7],ymm1[7],ymm0[12],ymm1[12],ymm0[13],ymm1[13],ymm0[14],ymm1[14],ymm0[15],ymm1[15]
				; AVX2-NEXT: vpunpcklwd {{.*#+}} ymm1 = ymm0[0],ymm1[0],ymm0[1],ymm1[1],ymm0[2],ymm1[2],ymm0[3],ymm1[3],ymm0[8],ymm1[8],ymm0[9],ymm1[9],ymm0[10],ymm1[10],ymm0[11],ymm1[11]
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm0 = ymm1[0,1],ymm4[0,1]
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm1 = ymm1[2,3],ymm4[2,3]
				; AVX2-NEXT: vpunpckhwd {{.*#+}} ymm4 = ymm3[4],ymm2[4],ymm3[5],ymm2[5],ymm3[6],ymm2[6],ymm3[7],ymm2[7],ymm3[12],ymm2[12],ymm3[13],ymm2[13],ymm3[14],ymm2[14],ymm3[15],ymm2[15]
				; AVX2-NEXT: vpunpcklwd {{.*#+}} ymm3 = ymm3[0],ymm2[0],ymm3[1],ymm2[1],ymm3[2],ymm2[2],ymm3[3],ymm2[3],ymm3[8],ymm2[8],ymm3[9],ymm2[9],ymm3[10],ymm2[10],ymm3[11],ymm2[11]
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm2 = ymm3[0,1],ymm4[0,1]
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm3 = ymm3[2,3],ymm4[2,3]
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%ab = shufflevector <8 x i16> %a, <8 x i16> %b, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>			%ab = shufflevector <8 x i16> %a, <8 x i16> %b, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
	%cd = shufflevector <8 x i16> %c, <8 x i16> %d, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>			%cd = shufflevector <8 x i16> %c, <8 x i16> %d, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
	%ab32 = bitcast <16 x i16> %ab to <8 x i32>			%ab32 = bitcast <16 x i16> %ab to <8 x i32>
	%cd32 = bitcast <16 x i16> %cd to <8 x i32>			%cd32 = bitcast <16 x i16> %cd to <8 x i32>
	%abcd32 = shufflevector <8 x i32> %ab32, <8 x i32> %cd32, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>			%abcd32 = shufflevector <8 x i32> %ab32, <8 x i32> %cd32, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
	%abcd = bitcast <16 x i32> %abcd32 to <32 x i16>			%abcd = bitcast <16 x i32> %abcd32 to <32 x i16>

	Show All 38 Lines
	; AVX2-NEXT: vblendpd {{.*#+}} ymm2 = ymm3[0],ymm2[1],ymm3[2],ymm2[3]			; AVX2-NEXT: vblendpd {{.*#+}} ymm2 = ymm3[0],ymm2[1],ymm3[2],ymm2[3]
	; AVX2-NEXT: vpermpd {{.*#+}} ymm1 = ymm1[2,1,2,3]			; AVX2-NEXT: vpermpd {{.*#+}} ymm1 = ymm1[2,1,2,3]
	; AVX2-NEXT: vpermpd {{.*#+}} ymm0 = ymm0[2,1,2,3]			; AVX2-NEXT: vpermpd {{.*#+}} ymm0 = ymm0[2,1,2,3]
	; AVX2-NEXT: vshufpd {{.*#+}} ymm1 = ymm0[0],ymm1[0],ymm0[3],ymm1[3]			; AVX2-NEXT: vshufpd {{.*#+}} ymm1 = ymm0[0],ymm1[0],ymm0[3],ymm1[3]
	; AVX2-NEXT: vmovapd %ymm2, %ymm0			; AVX2-NEXT: vmovapd %ymm2, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%result = shufflevector <4 x double> %a, <4 x double> %b, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>			%result = shufflevector <4 x double> %a, <4 x double> %b, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
	ret <8 x double> %result			ret <8 x double> %result
	}			}
				RKSimonUnsubmitted Not Done Reply Inline Actions Please can you pre-commit this and rebase so we see the codegen change? RKSimon: Please can you pre-commit this and rebase so we see the codegen change?

	define <8 x i64> @interleave2x4i64(<4 x i64> %a, <4 x i64> %b) {			define <8 x i64> @interleave2x4i64(<4 x i64> %a, <4 x i64> %b) {
	; SSE-LABEL: interleave2x4i64:			; SSE-LABEL: interleave2x4i64:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	; SSE-NEXT: movaps %xmm1, %xmm4			; SSE-NEXT: movaps %xmm1, %xmm4
	; SSE-NEXT: movaps %xmm0, %xmm1			; SSE-NEXT: movaps %xmm0, %xmm1
	; SSE-NEXT: movlhps {{.*#+}} xmm0 = xmm0[0],xmm2[0]			; SSE-NEXT: movlhps {{.*#+}} xmm0 = xmm0[0],xmm2[0]
	; SSE-NEXT: unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm2[1]			; SSE-NEXT: unpckhpd {{.*#+}} xmm1 = xmm1[1],xmm2[1]
	▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vunpckhps {{.*#+}} xmm3 = xmm0[2],xmm1[2],xmm0[3],xmm1[3]			; AVX1-NEXT: vunpckhps {{.*#+}} xmm3 = xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	; AVX1-NEXT: vunpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]			; AVX1-NEXT: vunpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
	; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm0, %ymm1			; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm0, %ymm1
	; AVX1-NEXT: vmovaps %ymm2, %ymm0			; AVX1-NEXT: vmovaps %ymm2, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: interleave2x8f32:			; AVX2-LABEL: interleave2x8f32:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm2 = xmm0[2],xmm1[2],xmm0[3],xmm1[3]			; AVX2-NEXT: vunpckhps {{.*#+}} ymm2 = ymm0[2],ymm1[2],ymm0[3],ymm1[3],ymm0[6],ymm1[6],ymm0[7],ymm1[7]
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm3 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]			; AVX2-NEXT: vunpcklps {{.*#+}} ymm1 = ymm0[0],ymm1[0],ymm0[1],ymm1[1],ymm0[4],ymm1[4],ymm0[5],ymm1[5]
	; AVX2-NEXT: vinsertf128 $1, %xmm2, %ymm3, %ymm2			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm1[0,1],ymm2[0,1]
	; AVX2-NEXT: vextractf128 $1, %ymm1, %xmm1			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm1 = ymm1[2,3],ymm2[2,3]
				RKSimonUnsubmitted Not Done Reply Inline Actions Wouldn't AVX1 benefit from the v8f32 case as well? RKSimon: Wouldn't AVX1 benefit from the v8f32 case as well?
				zhuhan0AuthorUnsubmitted Done Reply Inline Actions Ah yes. I missed it. zhuhan0: Ah yes. I missed it.
	; AVX2-NEXT: vextractf128 $1, %ymm0, %xmm0
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm3 = xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
	; AVX2-NEXT: vinsertf128 $1, %xmm3, %ymm0, %ymm1
	; AVX2-NEXT: vmovaps %ymm2, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%result = shufflevector <8 x float> %a, <8 x float> %b, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>			%result = shufflevector <8 x float> %a, <8 x float> %b, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
	ret <16 x float> %result			ret <16 x float> %result
	}			}

	define <16 x i32> @interleave2x8i32(<8 x i32> %a, <8 x i32> %b) {			define <16 x i32> @interleave2x8i32(<8 x i32> %a, <8 x i32> %b) {
	; SSE-LABEL: interleave2x8i32:			; SSE-LABEL: interleave2x8i32:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	Show All 17 Lines
	; AVX1-NEXT: vunpckhps {{.*#+}} xmm3 = xmm0[2],xmm1[2],xmm0[3],xmm1[3]			; AVX1-NEXT: vunpckhps {{.*#+}} xmm3 = xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	; AVX1-NEXT: vunpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]			; AVX1-NEXT: vunpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
	; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm0, %ymm1			; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm0, %ymm1
	; AVX1-NEXT: vmovaps %ymm2, %ymm0			; AVX1-NEXT: vmovaps %ymm2, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: interleave2x8i32:			; AVX2-LABEL: interleave2x8i32:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm2 = xmm0[2],xmm1[2],xmm0[3],xmm1[3]			; AVX2-NEXT: vunpckhps {{.*#+}} ymm2 = ymm0[2],ymm1[2],ymm0[3],ymm1[3],ymm0[6],ymm1[6],ymm0[7],ymm1[7]
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm3 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]			; AVX2-NEXT: vunpcklps {{.*#+}} ymm1 = ymm0[0],ymm1[0],ymm0[1],ymm1[1],ymm0[4],ymm1[4],ymm0[5],ymm1[5]
	; AVX2-NEXT: vinsertf128 $1, %xmm2, %ymm3, %ymm2			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm1[0,1],ymm2[0,1]
	; AVX2-NEXT: vextractf128 $1, %ymm1, %xmm1			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm1 = ymm1[2,3],ymm2[2,3]
	; AVX2-NEXT: vextractf128 $1, %ymm0, %xmm0
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm3 = xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
	; AVX2-NEXT: vinsertf128 $1, %xmm3, %ymm0, %ymm1
	; AVX2-NEXT: vmovaps %ymm2, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%result = shufflevector <8 x i32> %a, <8 x i32> %b, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>			%result = shufflevector <8 x i32> %a, <8 x i32> %b, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
	ret <16 x i32> %result			ret <16 x i32> %result
	}			}

	define <32 x i16> @interleave2x16i16(<16 x i16> %a, <16 x i16> %b) {			define <32 x i16> @interleave2x16i16(<16 x i16> %a, <16 x i16> %b) {
	; SSE-LABEL: interleave2x16i16:			; SSE-LABEL: interleave2x16i16:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	Show All 17 Lines
	; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm3 = xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]			; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm3 = xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
	; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]			; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm0, %ymm1			; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm0, %ymm1
	; AVX1-NEXT: vmovaps %ymm2, %ymm0			; AVX1-NEXT: vmovaps %ymm2, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: interleave2x16i16:			; AVX2-LABEL: interleave2x16i16:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm2 = xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]			; AVX2-NEXT: vpunpckhwd {{.*#+}} ymm2 = ymm0[4],ymm1[4],ymm0[5],ymm1[5],ymm0[6],ymm1[6],ymm0[7],ymm1[7],ymm0[12],ymm1[12],ymm0[13],ymm1[13],ymm0[14],ymm1[14],ymm0[15],ymm1[15]
	; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm3 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]			; AVX2-NEXT: vpunpcklwd {{.*#+}} ymm1 = ymm0[0],ymm1[0],ymm0[1],ymm1[1],ymm0[2],ymm1[2],ymm0[3],ymm1[3],ymm0[8],ymm1[8],ymm0[9],ymm1[9],ymm0[10],ymm1[10],ymm0[11],ymm1[11]
	; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm3, %ymm2			; AVX2-NEXT: vperm2i128 {{.*#+}} ymm0 = ymm1[0,1],ymm2[0,1]
	; AVX2-NEXT: vextracti128 $1, %ymm1, %xmm1			; AVX2-NEXT: vperm2i128 {{.*#+}} ymm1 = ymm1[2,3],ymm2[2,3]
	; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm0
	; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm3 = xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
	; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3]
	; AVX2-NEXT: vinserti128 $1, %xmm3, %ymm0, %ymm1
	; AVX2-NEXT: vmovdqa %ymm2, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%result = shufflevector <16 x i16> %a, <16 x i16> %b, <32 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23, i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>			%result = shufflevector <16 x i16> %a, <16 x i16> %b, <32 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23, i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
	ret <32 x i16> %result			ret <32 x i16> %result
	}			}

	define <64 x i16> @interleave2x32i16(<32 x i16> %a, <32 x i16> %b) {			define <64 x i16> @interleave2x32i16(<32 x i16> %a, <32 x i16> %b) {
	; SSE-LABEL: interleave2x32i16:			; SSE-LABEL: interleave2x32i16:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	Show All 39 Lines
	; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]			; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
	; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm0, %ymm3			; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm0, %ymm3
	; AVX1-NEXT: vmovaps %ymm4, %ymm0			; AVX1-NEXT: vmovaps %ymm4, %ymm0
	; AVX1-NEXT: vmovaps %ymm5, %ymm1			; AVX1-NEXT: vmovaps %ymm5, %ymm1
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: interleave2x32i16:			; AVX2-LABEL: interleave2x32i16:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm4 = xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]			; AVX2-NEXT: vpunpckhwd {{.*#+}} ymm4 = ymm0[4],ymm2[4],ymm0[5],ymm2[5],ymm0[6],ymm2[6],ymm0[7],ymm2[7],ymm0[12],ymm2[12],ymm0[13],ymm2[13],ymm0[14],ymm2[14],ymm0[15],ymm2[15]
	; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm5 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]			; AVX2-NEXT: vpunpcklwd {{.*#+}} ymm2 = ymm0[0],ymm2[0],ymm0[1],ymm2[1],ymm0[2],ymm2[2],ymm0[3],ymm2[3],ymm0[8],ymm2[8],ymm0[9],ymm2[9],ymm0[10],ymm2[10],ymm0[11],ymm2[11]
	; AVX2-NEXT: vinserti128 $1, %xmm4, %ymm5, %ymm4			; AVX2-NEXT: vperm2i128 {{.*#+}} ymm0 = ymm2[0,1],ymm4[0,1]
	; AVX2-NEXT: vextracti128 $1, %ymm2, %xmm2			; AVX2-NEXT: vperm2i128 {{.*#+}} ymm4 = ymm2[2,3],ymm4[2,3]
	; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm0			; AVX2-NEXT: vpunpckhwd {{.*#+}} ymm5 = ymm1[4],ymm3[4],ymm1[5],ymm3[5],ymm1[6],ymm3[6],ymm1[7],ymm3[7],ymm1[12],ymm3[12],ymm1[13],ymm3[13],ymm1[14],ymm3[14],ymm1[15],ymm3[15]
	; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm5 = xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]			; AVX2-NEXT: vpunpcklwd {{.*#+}} ymm1 = ymm1[0],ymm3[0],ymm1[1],ymm3[1],ymm1[2],ymm3[2],ymm1[3],ymm3[3],ymm1[8],ymm3[8],ymm1[9],ymm3[9],ymm1[10],ymm3[10],ymm1[11],ymm3[11]
	; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]			; AVX2-NEXT: vperm2i128 {{.*#+}} ymm2 = ymm1[0,1],ymm5[0,1]
	; AVX2-NEXT: vinserti128 $1, %xmm5, %ymm0, %ymm5			; AVX2-NEXT: vperm2i128 {{.*#+}} ymm3 = ymm1[2,3],ymm5[2,3]
	; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm0 = xmm1[4],xmm3[4],xmm1[5],xmm3[5],xmm1[6],xmm3[6],xmm1[7],xmm3[7]			; AVX2-NEXT: vmovdqa %ymm4, %ymm1
	; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm2 = xmm1[0],xmm3[0],xmm1[1],xmm3[1],xmm1[2],xmm3[2],xmm1[3],xmm3[3]
	; AVX2-NEXT: vinserti128 $1, %xmm0, %ymm2, %ymm2
	; AVX2-NEXT: vextracti128 $1, %ymm3, %xmm0
	; AVX2-NEXT: vextracti128 $1, %ymm1, %xmm1
	; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm3 = xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
	; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
	; AVX2-NEXT: vinserti128 $1, %xmm3, %ymm0, %ymm3
	; AVX2-NEXT: vmovdqa %ymm4, %ymm0
	; AVX2-NEXT: vmovdqa %ymm5, %ymm1
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%result = shufflevector <32 x i16> %a, <32 x i16> %b, <64 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47, i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>			%result = shufflevector <32 x i16> %a, <32 x i16> %b, <64 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47, i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>
	ret <64 x i16> %result			ret <64 x i16> %result
	}			}

	define <64 x i8> @interleave2x32i8(<32 x i8> %a, <32 x i8> %b) {			define <64 x i8> @interleave2x32i8(<32 x i8> %a, <32 x i8> %b) {
	; SSE-LABEL: interleave2x32i8:			; SSE-LABEL: interleave2x32i8:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	Show All 17 Lines
	; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm3 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]			; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm3 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]
	; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]			; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
	; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm0, %ymm1			; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm0, %ymm1
	; AVX1-NEXT: vmovaps %ymm2, %ymm0			; AVX1-NEXT: vmovaps %ymm2, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: interleave2x32i8:			; AVX2-LABEL: interleave2x32i8:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vpunpckhbw {{.*#+}} xmm2 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]			; AVX2-NEXT: vpunpckhbw {{.*#+}} ymm2 = ymm0[8],ymm1[8],ymm0[9],ymm1[9],ymm0[10],ymm1[10],ymm0[11],ymm1[11],ymm0[12],ymm1[12],ymm0[13],ymm1[13],ymm0[14],ymm1[14],ymm0[15],ymm1[15],ymm0[24],ymm1[24],ymm0[25],ymm1[25],ymm0[26],ymm1[26],ymm0[27],ymm1[27],ymm0[28],ymm1[28],ymm0[29],ymm1[29],ymm0[30],ymm1[30],ymm0[31],ymm1[31]
	; AVX2-NEXT: vpunpcklbw {{.*#+}} xmm3 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]			; AVX2-NEXT: vpunpcklbw {{.*#+}} ymm1 = ymm0[0],ymm1[0],ymm0[1],ymm1[1],ymm0[2],ymm1[2],ymm0[3],ymm1[3],ymm0[4],ymm1[4],ymm0[5],ymm1[5],ymm0[6],ymm1[6],ymm0[7],ymm1[7],ymm0[16],ymm1[16],ymm0[17],ymm1[17],ymm0[18],ymm1[18],ymm0[19],ymm1[19],ymm0[20],ymm1[20],ymm0[21],ymm1[21],ymm0[22],ymm1[22],ymm0[23],ymm1[23]
	; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm3, %ymm2			; AVX2-NEXT: vperm2i128 {{.*#+}} ymm0 = ymm1[0,1],ymm2[0,1]
	; AVX2-NEXT: vextracti128 $1, %ymm1, %xmm1			; AVX2-NEXT: vperm2i128 {{.*#+}} ymm1 = ymm1[2,3],ymm2[2,3]
	; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm0
	; AVX2-NEXT: vpunpckhbw {{.*#+}} xmm3 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]
	; AVX2-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
	; AVX2-NEXT: vinserti128 $1, %xmm3, %ymm0, %ymm1
	; AVX2-NEXT: vmovdqa %ymm2, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%result = shufflevector <32 x i8> %a, <32 x i8> %b, <64 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47, i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>			%result = shufflevector <32 x i8> %a, <32 x i8> %b, <64 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47, i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>
	ret <64 x i8> %result			ret <64 x i8> %result
	}			}

	define void @splat2_i8(ptr %s, ptr %d) {			define void @splat2_i8(ptr %s, ptr %d) {
	; SSE-LABEL: splat2_i8:			; SSE-LABEL: splat2_i8:
	; SSE: # %bb.0:			; SSE: # %bb.0:
	▲ Show 20 Lines • Show All 177 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/vector-interleaved-store-i16-stride-2.ll

	Show First 20 Lines • Show All 140 Lines • ▼ Show 20 Lines
	; SSE-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]			; SSE-NEXT: punpckhwd {{.*#+}} xmm2 = xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]
	; SSE-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm3[0],xmm1[1],xmm3[1],xmm1[2],xmm3[2],xmm1[3],xmm3[3]			; SSE-NEXT: punpcklwd {{.*#+}} xmm1 = xmm1[0],xmm3[0],xmm1[1],xmm3[1],xmm1[2],xmm3[2],xmm1[3],xmm3[3]
	; SSE-NEXT: movdqa %xmm1, 32(%rdx)			; SSE-NEXT: movdqa %xmm1, 32(%rdx)
	; SSE-NEXT: movdqa %xmm2, 48(%rdx)			; SSE-NEXT: movdqa %xmm2, 48(%rdx)
	; SSE-NEXT: movdqa %xmm0, (%rdx)			; SSE-NEXT: movdqa %xmm0, (%rdx)
	; SSE-NEXT: movdqa %xmm4, 16(%rdx)			; SSE-NEXT: movdqa %xmm4, 16(%rdx)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: vf16:			; AVX1-LABEL: vf16:
	; AVX: # %bb.0:			; AVX1: # %bb.0:
	; AVX-NEXT: vmovdqa (%rsi), %xmm0			; AVX1-NEXT: vmovdqa (%rsi), %xmm0
	; AVX-NEXT: vmovdqa 16(%rsi), %xmm1			; AVX1-NEXT: vmovdqa 16(%rsi), %xmm1
	; AVX-NEXT: vmovdqa (%rdi), %xmm2			; AVX1-NEXT: vmovdqa (%rdi), %xmm2
	; AVX-NEXT: vmovdqa 16(%rdi), %xmm3			; AVX1-NEXT: vmovdqa 16(%rdi), %xmm3
	; AVX-NEXT: vpunpckhwd {{.*#+}} xmm4 = xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]			; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm4 = xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
	; AVX-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3]			; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3]
	; AVX-NEXT: vpunpckhwd {{.*#+}} xmm2 = xmm3[4],xmm1[4],xmm3[5],xmm1[5],xmm3[6],xmm1[6],xmm3[7],xmm1[7]			; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm2 = xmm3[4],xmm1[4],xmm3[5],xmm1[5],xmm3[6],xmm1[6],xmm3[7],xmm1[7]
	; AVX-NEXT: vpunpcklwd {{.*#+}} xmm1 = xmm3[0],xmm1[0],xmm3[1],xmm1[1],xmm3[2],xmm1[2],xmm3[3],xmm1[3]			; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm1 = xmm3[0],xmm1[0],xmm3[1],xmm1[1],xmm3[2],xmm1[2],xmm3[3],xmm1[3]
	; AVX-NEXT: vmovdqa %xmm1, 32(%rdx)			; AVX1-NEXT: vmovdqa %xmm1, 32(%rdx)
	; AVX-NEXT: vmovdqa %xmm2, 48(%rdx)			; AVX1-NEXT: vmovdqa %xmm2, 48(%rdx)
	; AVX-NEXT: vmovdqa %xmm0, (%rdx)			; AVX1-NEXT: vmovdqa %xmm0, (%rdx)
	; AVX-NEXT: vmovdqa %xmm4, 16(%rdx)			; AVX1-NEXT: vmovdqa %xmm4, 16(%rdx)
	; AVX-NEXT: retq			; AVX1-NEXT: retq
				;
				; AVX2-LABEL: vf16:
				; AVX2: # %bb.0:
				; AVX2-NEXT: vmovdqa (%rdi), %ymm0
				; AVX2-NEXT: vmovdqa (%rsi), %ymm1
				; AVX2-NEXT: vpunpckhwd {{.*#+}} ymm2 = ymm0[4],ymm1[4],ymm0[5],ymm1[5],ymm0[6],ymm1[6],ymm0[7],ymm1[7],ymm0[12],ymm1[12],ymm0[13],ymm1[13],ymm0[14],ymm1[14],ymm0[15],ymm1[15]
				; AVX2-NEXT: vpunpcklwd {{.*#+}} ymm0 = ymm0[0],ymm1[0],ymm0[1],ymm1[1],ymm0[2],ymm1[2],ymm0[3],ymm1[3],ymm0[8],ymm1[8],ymm0[9],ymm1[9],ymm0[10],ymm1[10],ymm0[11],ymm1[11]
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm1 = ymm0[0,1],ymm2[0,1]
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm0 = ymm0[2,3],ymm2[2,3]
				; AVX2-NEXT: vmovdqa %ymm0, 32(%rdx)
				; AVX2-NEXT: vmovdqa %ymm1, (%rdx)
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: vf16:			; AVX512-LABEL: vf16:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vmovdqa (%rdi), %ymm0			; AVX512-NEXT: vmovdqa (%rdi), %ymm0
	; AVX512-NEXT: vinserti64x4 $1, (%rsi), %zmm0, %zmm0			; AVX512-NEXT: vinserti64x4 $1, (%rsi), %zmm0, %zmm0
	; AVX512-NEXT: vmovdqa64 {{.*#+}} zmm1 = [0,16,1,17,2,18,3,19,4,20,5,21,6,22,7,23,8,24,9,25,10,26,11,27,12,28,13,29,14,30,15,31]			; AVX512-NEXT: vmovdqa64 {{.*#+}} zmm1 = [0,16,1,17,2,18,3,19,4,20,5,21,6,22,7,23,8,24,9,25,10,26,11,27,12,28,13,29,14,30,15,31]
	; AVX512-NEXT: vpermw %zmm0, %zmm1, %zmm0			; AVX512-NEXT: vpermw %zmm0, %zmm1, %zmm0
	; AVX512-NEXT: vmovdqu64 %zmm0, (%rdx)			; AVX512-NEXT: vmovdqu64 %zmm0, (%rdx)
	Show All 38 Lines
	; SSE-NEXT: movdqa %xmm2, 64(%rdx)			; SSE-NEXT: movdqa %xmm2, 64(%rdx)
	; SSE-NEXT: movdqa %xmm5, 80(%rdx)			; SSE-NEXT: movdqa %xmm5, 80(%rdx)
	; SSE-NEXT: movdqa %xmm1, 32(%rdx)			; SSE-NEXT: movdqa %xmm1, 32(%rdx)
	; SSE-NEXT: movdqa %xmm4, 48(%rdx)			; SSE-NEXT: movdqa %xmm4, 48(%rdx)
	; SSE-NEXT: movdqa %xmm0, (%rdx)			; SSE-NEXT: movdqa %xmm0, (%rdx)
	; SSE-NEXT: movdqa %xmm8, 16(%rdx)			; SSE-NEXT: movdqa %xmm8, 16(%rdx)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: vf32:			; AVX1-LABEL: vf32:
	; AVX: # %bb.0:			; AVX1: # %bb.0:
	; AVX-NEXT: vmovdqa (%rsi), %xmm0			; AVX1-NEXT: vmovdqa (%rsi), %xmm0
	; AVX-NEXT: vmovdqa 16(%rsi), %xmm1			; AVX1-NEXT: vmovdqa 16(%rsi), %xmm1
	; AVX-NEXT: vmovdqa 32(%rsi), %xmm2			; AVX1-NEXT: vmovdqa 32(%rsi), %xmm2
	; AVX-NEXT: vmovdqa 48(%rsi), %xmm3			; AVX1-NEXT: vmovdqa 48(%rsi), %xmm3
	; AVX-NEXT: vmovdqa (%rdi), %xmm4			; AVX1-NEXT: vmovdqa (%rdi), %xmm4
	; AVX-NEXT: vmovdqa 16(%rdi), %xmm5			; AVX1-NEXT: vmovdqa 16(%rdi), %xmm5
	; AVX-NEXT: vmovdqa 32(%rdi), %xmm6			; AVX1-NEXT: vmovdqa 32(%rdi), %xmm6
	; AVX-NEXT: vmovdqa 48(%rdi), %xmm7			; AVX1-NEXT: vmovdqa 48(%rdi), %xmm7
	; AVX-NEXT: vpunpckhwd {{.*#+}} xmm8 = xmm6[4],xmm2[4],xmm6[5],xmm2[5],xmm6[6],xmm2[6],xmm6[7],xmm2[7]			; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm8 = xmm6[4],xmm2[4],xmm6[5],xmm2[5],xmm6[6],xmm2[6],xmm6[7],xmm2[7]
	; AVX-NEXT: vpunpcklwd {{.*#+}} xmm2 = xmm6[0],xmm2[0],xmm6[1],xmm2[1],xmm6[2],xmm2[2],xmm6[3],xmm2[3]			; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm2 = xmm6[0],xmm2[0],xmm6[1],xmm2[1],xmm6[2],xmm2[2],xmm6[3],xmm2[3]
	; AVX-NEXT: vpunpckhwd {{.*#+}} xmm6 = xmm7[4],xmm3[4],xmm7[5],xmm3[5],xmm7[6],xmm3[6],xmm7[7],xmm3[7]			; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm6 = xmm7[4],xmm3[4],xmm7[5],xmm3[5],xmm7[6],xmm3[6],xmm7[7],xmm3[7]
	; AVX-NEXT: vpunpcklwd {{.*#+}} xmm3 = xmm7[0],xmm3[0],xmm7[1],xmm3[1],xmm7[2],xmm3[2],xmm7[3],xmm3[3]			; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm3 = xmm7[0],xmm3[0],xmm7[1],xmm3[1],xmm7[2],xmm3[2],xmm7[3],xmm3[3]
	; AVX-NEXT: vpunpckhwd {{.*#+}} xmm7 = xmm5[4],xmm1[4],xmm5[5],xmm1[5],xmm5[6],xmm1[6],xmm5[7],xmm1[7]			; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm7 = xmm5[4],xmm1[4],xmm5[5],xmm1[5],xmm5[6],xmm1[6],xmm5[7],xmm1[7]
	; AVX-NEXT: vpunpcklwd {{.*#+}} xmm1 = xmm5[0],xmm1[0],xmm5[1],xmm1[1],xmm5[2],xmm1[2],xmm5[3],xmm1[3]			; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm1 = xmm5[0],xmm1[0],xmm5[1],xmm1[1],xmm5[2],xmm1[2],xmm5[3],xmm1[3]
	; AVX-NEXT: vpunpckhwd {{.*#+}} xmm5 = xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]			; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm5 = xmm4[4],xmm0[4],xmm4[5],xmm0[5],xmm4[6],xmm0[6],xmm4[7],xmm0[7]
	; AVX-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm4[0],xmm0[0],xmm4[1],xmm0[1],xmm4[2],xmm0[2],xmm4[3],xmm0[3]			; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm4[0],xmm0[0],xmm4[1],xmm0[1],xmm4[2],xmm0[2],xmm4[3],xmm0[3]
	; AVX-NEXT: vmovdqa %xmm0, (%rdx)			; AVX1-NEXT: vmovdqa %xmm0, (%rdx)
	; AVX-NEXT: vmovdqa %xmm5, 16(%rdx)			; AVX1-NEXT: vmovdqa %xmm5, 16(%rdx)
	; AVX-NEXT: vmovdqa %xmm1, 32(%rdx)			; AVX1-NEXT: vmovdqa %xmm1, 32(%rdx)
	; AVX-NEXT: vmovdqa %xmm7, 48(%rdx)			; AVX1-NEXT: vmovdqa %xmm7, 48(%rdx)
	; AVX-NEXT: vmovdqa %xmm3, 96(%rdx)			; AVX1-NEXT: vmovdqa %xmm3, 96(%rdx)
	; AVX-NEXT: vmovdqa %xmm6, 112(%rdx)			; AVX1-NEXT: vmovdqa %xmm6, 112(%rdx)
	; AVX-NEXT: vmovdqa %xmm2, 64(%rdx)			; AVX1-NEXT: vmovdqa %xmm2, 64(%rdx)
	; AVX-NEXT: vmovdqa %xmm8, 80(%rdx)			; AVX1-NEXT: vmovdqa %xmm8, 80(%rdx)
	; AVX-NEXT: retq			; AVX1-NEXT: retq
				;
				; AVX2-LABEL: vf32:
				; AVX2: # %bb.0:
				; AVX2-NEXT: vmovdqa (%rdi), %ymm0
				; AVX2-NEXT: vmovdqa 32(%rdi), %ymm1
				; AVX2-NEXT: vmovdqa (%rsi), %ymm2
				; AVX2-NEXT: vmovdqa 32(%rsi), %ymm3
				; AVX2-NEXT: vpunpckhwd {{.*#+}} ymm4 = ymm0[4],ymm2[4],ymm0[5],ymm2[5],ymm0[6],ymm2[6],ymm0[7],ymm2[7],ymm0[12],ymm2[12],ymm0[13],ymm2[13],ymm0[14],ymm2[14],ymm0[15],ymm2[15]
				; AVX2-NEXT: vpunpcklwd {{.*#+}} ymm0 = ymm0[0],ymm2[0],ymm0[1],ymm2[1],ymm0[2],ymm2[2],ymm0[3],ymm2[3],ymm0[8],ymm2[8],ymm0[9],ymm2[9],ymm0[10],ymm2[10],ymm0[11],ymm2[11]
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm2 = ymm0[2,3],ymm4[2,3]
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm0 = ymm0[0,1],ymm4[0,1]
				; AVX2-NEXT: vpunpckhwd {{.*#+}} ymm4 = ymm1[4],ymm3[4],ymm1[5],ymm3[5],ymm1[6],ymm3[6],ymm1[7],ymm3[7],ymm1[12],ymm3[12],ymm1[13],ymm3[13],ymm1[14],ymm3[14],ymm1[15],ymm3[15]
				; AVX2-NEXT: vpunpcklwd {{.*#+}} ymm1 = ymm1[0],ymm3[0],ymm1[1],ymm3[1],ymm1[2],ymm3[2],ymm1[3],ymm3[3],ymm1[8],ymm3[8],ymm1[9],ymm3[9],ymm1[10],ymm3[10],ymm1[11],ymm3[11]
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm3 = ymm1[2,3],ymm4[2,3]
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm1 = ymm1[0,1],ymm4[0,1]
				; AVX2-NEXT: vmovdqa %ymm1, 64(%rdx)
				; AVX2-NEXT: vmovdqa %ymm3, 96(%rdx)
				; AVX2-NEXT: vmovdqa %ymm0, (%rdx)
				; AVX2-NEXT: vmovdqa %ymm2, 32(%rdx)
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: vf32:			; AVX512-LABEL: vf32:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vmovdqu64 (%rdi), %zmm0			; AVX512-NEXT: vmovdqu64 (%rdi), %zmm0
	; AVX512-NEXT: vmovdqu64 (%rsi), %zmm1			; AVX512-NEXT: vmovdqu64 (%rsi), %zmm1
	; AVX512-NEXT: vmovdqa64 {{.*#+}} zmm2 = [0,32,1,33,2,34,3,35,4,36,5,37,6,38,7,39,8,40,9,41,10,42,11,43,12,44,13,45,14,46,15,47]			; AVX512-NEXT: vmovdqa64 {{.*#+}} zmm2 = [0,32,1,33,2,34,3,35,4,36,5,37,6,38,7,39,8,40,9,41,10,42,11,43,12,44,13,45,14,46,15,47]
	; AVX512-NEXT: vpermi2w %zmm1, %zmm0, %zmm2			; AVX512-NEXT: vpermi2w %zmm1, %zmm0, %zmm2
	; AVX512-NEXT: vmovdqa64 {{.*#+}} zmm3 = [16,48,17,49,18,50,19,51,20,52,21,53,22,54,23,55,24,56,25,57,26,58,27,59,28,60,29,61,30,62,31,63]			; AVX512-NEXT: vmovdqa64 {{.*#+}} zmm3 = [16,48,17,49,18,50,19,51,20,52,21,53,22,54,23,55,24,56,25,57,26,58,27,59,28,60,29,61,30,62,31,63]
	Show All 15 Lines

llvm/test/CodeGen/X86/vector-interleaved-store-i32-stride-2.ll

	Show First 20 Lines • Show All 130 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm0, %ymm0
	; AVX1-NEXT: vmovaps %ymm0, (%rdx)			; AVX1-NEXT: vmovaps %ymm0, (%rdx)
	; AVX1-NEXT: vmovaps %ymm1, 32(%rdx)			; AVX1-NEXT: vmovaps %ymm1, 32(%rdx)
	; AVX1-NEXT: vzeroupper			; AVX1-NEXT: vzeroupper
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: store_i32_stride2_vf8:			; AVX2-LABEL: store_i32_stride2_vf8:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vmovaps (%rsi), %xmm0			; AVX2-NEXT: vmovaps (%rdi), %ymm0
	; AVX2-NEXT: vmovaps 16(%rsi), %xmm1			; AVX2-NEXT: vmovaps (%rsi), %ymm1
	; AVX2-NEXT: vmovaps (%rdi), %xmm2			; AVX2-NEXT: vunpckhps {{.*#+}} ymm2 = ymm0[2],ymm1[2],ymm0[3],ymm1[3],ymm0[6],ymm1[6],ymm0[7],ymm1[7]
	; AVX2-NEXT: vmovaps 16(%rdi), %xmm3			; AVX2-NEXT: vunpcklps {{.*#+}} ymm0 = ymm0[0],ymm1[0],ymm0[1],ymm1[1],ymm0[4],ymm1[4],ymm0[5],ymm1[5]
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm4 = xmm2[2],xmm0[2],xmm2[3],xmm0[3]			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm1 = ymm0[0,1],ymm2[0,1]
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm0 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[2,3],ymm2[2,3]
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm2 = xmm3[2],xmm1[2],xmm3[3],xmm1[3]			; AVX2-NEXT: vmovaps %ymm0, 32(%rdx)
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm1 = xmm3[0],xmm1[0],xmm3[1],xmm1[1]			; AVX2-NEXT: vmovaps %ymm1, (%rdx)
	; AVX2-NEXT: vmovaps %xmm1, 32(%rdx)			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: vmovaps %xmm2, 48(%rdx)
	; AVX2-NEXT: vmovaps %xmm0, (%rdx)
	; AVX2-NEXT: vmovaps %xmm4, 16(%rdx)
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: store_i32_stride2_vf8:			; AVX512-LABEL: store_i32_stride2_vf8:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vmovaps (%rdi), %ymm0			; AVX512-NEXT: vmovaps (%rdi), %ymm0
	; AVX512-NEXT: vinsertf64x4 $1, (%rsi), %zmm0, %zmm0			; AVX512-NEXT: vinsertf64x4 $1, (%rsi), %zmm0, %zmm0
	; AVX512-NEXT: vmovaps {{.*#+}} zmm1 = [0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15]			; AVX512-NEXT: vmovaps {{.*#+}} zmm1 = [0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15]
	; AVX512-NEXT: vpermps %zmm0, %zmm1, %zmm0			; AVX512-NEXT: vpermps %zmm0, %zmm1, %zmm0
	▲ Show 20 Lines • Show All 70 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vmovaps %ymm3, 96(%rdx)			; AVX1-NEXT: vmovaps %ymm3, 96(%rdx)
	; AVX1-NEXT: vmovaps %ymm2, 64(%rdx)			; AVX1-NEXT: vmovaps %ymm2, 64(%rdx)
	; AVX1-NEXT: vmovaps %ymm0, (%rdx)			; AVX1-NEXT: vmovaps %ymm0, (%rdx)
	; AVX1-NEXT: vzeroupper			; AVX1-NEXT: vzeroupper
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: store_i32_stride2_vf16:			; AVX2-LABEL: store_i32_stride2_vf16:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vmovaps (%rsi), %xmm0			; AVX2-NEXT: vmovaps (%rdi), %ymm0
	; AVX2-NEXT: vmovaps 16(%rsi), %xmm1			; AVX2-NEXT: vmovaps 32(%rdi), %ymm1
	; AVX2-NEXT: vmovaps 32(%rsi), %xmm2			; AVX2-NEXT: vmovaps (%rsi), %ymm2
	; AVX2-NEXT: vmovaps 48(%rsi), %xmm3			; AVX2-NEXT: vmovaps 32(%rsi), %ymm3
	; AVX2-NEXT: vmovaps (%rdi), %xmm4			; AVX2-NEXT: vunpckhps {{.*#+}} ymm4 = ymm0[2],ymm2[2],ymm0[3],ymm2[3],ymm0[6],ymm2[6],ymm0[7],ymm2[7]
	; AVX2-NEXT: vmovaps 16(%rdi), %xmm5			; AVX2-NEXT: vunpcklps {{.*#+}} ymm0 = ymm0[0],ymm2[0],ymm0[1],ymm2[1],ymm0[4],ymm2[4],ymm0[5],ymm2[5]
	; AVX2-NEXT: vmovaps 32(%rdi), %xmm6			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm2 = ymm0[2,3],ymm4[2,3]
	; AVX2-NEXT: vmovaps 48(%rdi), %xmm7			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[0,1],ymm4[0,1]
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm8 = xmm6[2],xmm2[2],xmm6[3],xmm2[3]			; AVX2-NEXT: vunpckhps {{.*#+}} ymm4 = ymm1[2],ymm3[2],ymm1[3],ymm3[3],ymm1[6],ymm3[6],ymm1[7],ymm3[7]
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm2 = xmm6[0],xmm2[0],xmm6[1],xmm2[1]			; AVX2-NEXT: vunpcklps {{.*#+}} ymm1 = ymm1[0],ymm3[0],ymm1[1],ymm3[1],ymm1[4],ymm3[4],ymm1[5],ymm3[5]
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm6 = xmm7[2],xmm3[2],xmm7[3],xmm3[3]			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm3 = ymm1[2,3],ymm4[2,3]
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm3 = xmm7[0],xmm3[0],xmm7[1],xmm3[1]			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm1 = ymm1[0,1],ymm4[0,1]
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm7 = xmm5[2],xmm1[2],xmm5[3],xmm1[3]			; AVX2-NEXT: vmovaps %ymm1, 64(%rdx)
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm1 = xmm5[0],xmm1[0],xmm5[1],xmm1[1]			; AVX2-NEXT: vmovaps %ymm3, 96(%rdx)
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm5 = xmm4[2],xmm0[2],xmm4[3],xmm0[3]			; AVX2-NEXT: vmovaps %ymm0, (%rdx)
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm0 = xmm4[0],xmm0[0],xmm4[1],xmm0[1]			; AVX2-NEXT: vmovaps %ymm2, 32(%rdx)
	; AVX2-NEXT: vmovaps %xmm0, (%rdx)			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: vmovaps %xmm5, 16(%rdx)
	; AVX2-NEXT: vmovaps %xmm1, 32(%rdx)
	; AVX2-NEXT: vmovaps %xmm7, 48(%rdx)
	; AVX2-NEXT: vmovaps %xmm3, 96(%rdx)
	; AVX2-NEXT: vmovaps %xmm6, 112(%rdx)
	; AVX2-NEXT: vmovaps %xmm2, 64(%rdx)
	; AVX2-NEXT: vmovaps %xmm8, 80(%rdx)
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: store_i32_stride2_vf16:			; AVX512-LABEL: store_i32_stride2_vf16:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vmovdqu64 (%rdi), %zmm0			; AVX512-NEXT: vmovdqu64 (%rdi), %zmm0
	; AVX512-NEXT: vmovdqu64 (%rsi), %zmm1			; AVX512-NEXT: vmovdqu64 (%rsi), %zmm1
	; AVX512-NEXT: vmovdqa64 {{.*#+}} zmm2 = [0,16,1,17,2,18,3,19,4,20,5,21,6,22,7,23]			; AVX512-NEXT: vmovdqa64 {{.*#+}} zmm2 = [0,16,1,17,2,18,3,19,4,20,5,21,6,22,7,23]
	; AVX512-NEXT: vpermi2d %zmm1, %zmm0, %zmm2			; AVX512-NEXT: vpermi2d %zmm1, %zmm0, %zmm2
	▲ Show 20 Lines • Show All 127 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vmovaps %ymm2, 160(%rdx)			; AVX1-NEXT: vmovaps %ymm2, 160(%rdx)
	; AVX1-NEXT: vmovaps %ymm1, 128(%rdx)			; AVX1-NEXT: vmovaps %ymm1, 128(%rdx)
	; AVX1-NEXT: vmovaps %ymm0, 192(%rdx)			; AVX1-NEXT: vmovaps %ymm0, 192(%rdx)
	; AVX1-NEXT: vzeroupper			; AVX1-NEXT: vzeroupper
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: store_i32_stride2_vf32:			; AVX2-LABEL: store_i32_stride2_vf32:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vmovaps 64(%rsi), %xmm1			; AVX2-NEXT: vmovaps (%rdi), %ymm0
	; AVX2-NEXT: vmovaps 64(%rdi), %xmm2			; AVX2-NEXT: vmovaps 32(%rdi), %ymm1
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm0 = xmm2[2],xmm1[2],xmm2[3],xmm1[3]			; AVX2-NEXT: vmovaps 64(%rdi), %ymm2
	; AVX2-NEXT: vmovaps %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill			; AVX2-NEXT: vmovaps 96(%rdi), %ymm3
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm1 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]			; AVX2-NEXT: vmovaps (%rsi), %ymm4
	; AVX2-NEXT: vmovaps 80(%rsi), %xmm3			; AVX2-NEXT: vmovaps 32(%rsi), %ymm5
	; AVX2-NEXT: vmovaps 80(%rdi), %xmm4			; AVX2-NEXT: vmovaps 64(%rsi), %ymm6
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm2 = xmm4[2],xmm3[2],xmm4[3],xmm3[3]			; AVX2-NEXT: vmovaps 96(%rsi), %ymm7
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm3 = xmm4[0],xmm3[0],xmm4[1],xmm3[1]			; AVX2-NEXT: vunpckhps {{.*#+}} ymm8 = ymm0[2],ymm4[2],ymm0[3],ymm4[3],ymm0[6],ymm4[6],ymm0[7],ymm4[7]
	; AVX2-NEXT: vmovaps (%rsi), %xmm4			; AVX2-NEXT: vunpcklps {{.*#+}} ymm0 = ymm0[0],ymm4[0],ymm0[1],ymm4[1],ymm0[4],ymm4[4],ymm0[5],ymm4[5]
	; AVX2-NEXT: vmovaps 16(%rsi), %xmm5			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm4 = ymm0[2,3],ymm8[2,3]
	; AVX2-NEXT: vmovaps 32(%rsi), %xmm6			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[0,1],ymm8[0,1]
	; AVX2-NEXT: vmovaps 48(%rsi), %xmm7			; AVX2-NEXT: vunpckhps {{.*#+}} ymm8 = ymm1[2],ymm5[2],ymm1[3],ymm5[3],ymm1[6],ymm5[6],ymm1[7],ymm5[7]
	; AVX2-NEXT: vmovaps (%rdi), %xmm8			; AVX2-NEXT: vunpcklps {{.*#+}} ymm1 = ymm1[0],ymm5[0],ymm1[1],ymm5[1],ymm1[4],ymm5[4],ymm1[5],ymm5[5]
	; AVX2-NEXT: vmovaps 16(%rdi), %xmm9			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm5 = ymm1[2,3],ymm8[2,3]
	; AVX2-NEXT: vmovaps 32(%rdi), %xmm10			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm1 = ymm1[0,1],ymm8[0,1]
	; AVX2-NEXT: vmovaps 48(%rdi), %xmm11			; AVX2-NEXT: vunpckhps {{.*#+}} ymm8 = ymm2[2],ymm6[2],ymm2[3],ymm6[3],ymm2[6],ymm6[6],ymm2[7],ymm6[7]
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm12 = xmm9[2],xmm5[2],xmm9[3],xmm5[3]			; AVX2-NEXT: vunpcklps {{.*#+}} ymm2 = ymm2[0],ymm6[0],ymm2[1],ymm6[1],ymm2[4],ymm6[4],ymm2[5],ymm6[5]
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm5 = xmm9[0],xmm5[0],xmm9[1],xmm5[1]			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm6 = ymm2[2,3],ymm8[2,3]
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm9 = xmm8[2],xmm4[2],xmm8[3],xmm4[3]			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm2 = ymm2[0,1],ymm8[0,1]
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm4 = xmm8[0],xmm4[0],xmm8[1],xmm4[1]			; AVX2-NEXT: vunpckhps {{.*#+}} ymm8 = ymm3[2],ymm7[2],ymm3[3],ymm7[3],ymm3[6],ymm7[6],ymm3[7],ymm7[7]
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm8 = xmm11[2],xmm7[2],xmm11[3],xmm7[3]			; AVX2-NEXT: vunpcklps {{.*#+}} ymm3 = ymm3[0],ymm7[0],ymm3[1],ymm7[1],ymm3[4],ymm7[4],ymm3[5],ymm7[5]
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm7 = xmm11[0],xmm7[0],xmm11[1],xmm7[1]			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm7 = ymm3[2,3],ymm8[2,3]
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm11 = xmm10[2],xmm6[2],xmm10[3],xmm6[3]			; AVX2-NEXT: vperm2f128 {{.*#+}} ymm3 = ymm3[0,1],ymm8[0,1]
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm6 = xmm10[0],xmm6[0],xmm10[1],xmm6[1]			; AVX2-NEXT: vmovaps %ymm3, 192(%rdx)
	; AVX2-NEXT: vmovaps 112(%rsi), %xmm10			; AVX2-NEXT: vmovaps %ymm7, 224(%rdx)
	; AVX2-NEXT: vmovaps 112(%rdi), %xmm13			; AVX2-NEXT: vmovaps %ymm2, 128(%rdx)
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm14 = xmm13[2],xmm10[2],xmm13[3],xmm10[3]			; AVX2-NEXT: vmovaps %ymm6, 160(%rdx)
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm10 = xmm13[0],xmm10[0],xmm13[1],xmm10[1]			; AVX2-NEXT: vmovaps %ymm1, 64(%rdx)
	; AVX2-NEXT: vmovaps 96(%rsi), %xmm13			; AVX2-NEXT: vmovaps %ymm5, 96(%rdx)
	; AVX2-NEXT: vmovaps 96(%rdi), %xmm15			; AVX2-NEXT: vmovaps %ymm0, (%rdx)
	; AVX2-NEXT: vunpckhps {{.*#+}} xmm0 = xmm15[2],xmm13[2],xmm15[3],xmm13[3]			; AVX2-NEXT: vmovaps %ymm4, 32(%rdx)
	; AVX2-NEXT: vunpcklps {{.*#+}} xmm13 = xmm15[0],xmm13[0],xmm15[1],xmm13[1]			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: vmovaps %xmm13, 192(%rdx)
	; AVX2-NEXT: vmovaps %xmm0, 208(%rdx)
	; AVX2-NEXT: vmovaps %xmm10, 224(%rdx)
	; AVX2-NEXT: vmovaps %xmm14, 240(%rdx)
	; AVX2-NEXT: vmovaps %xmm6, 64(%rdx)
	; AVX2-NEXT: vmovaps %xmm11, 80(%rdx)
	; AVX2-NEXT: vmovaps %xmm7, 96(%rdx)
	; AVX2-NEXT: vmovaps %xmm8, 112(%rdx)
	; AVX2-NEXT: vmovaps %xmm4, (%rdx)
	; AVX2-NEXT: vmovaps %xmm9, 16(%rdx)
	; AVX2-NEXT: vmovaps %xmm5, 32(%rdx)
	; AVX2-NEXT: vmovaps %xmm12, 48(%rdx)
	; AVX2-NEXT: vmovaps %xmm3, 160(%rdx)
	; AVX2-NEXT: vmovaps %xmm2, 176(%rdx)
	; AVX2-NEXT: vmovaps %xmm1, 128(%rdx)
	; AVX2-NEXT: vmovaps {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 16-byte Reload
	; AVX2-NEXT: vmovaps %xmm0, 144(%rdx)
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: store_i32_stride2_vf32:			; AVX512-LABEL: store_i32_stride2_vf32:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vmovdqu64 (%rdi), %zmm0			; AVX512-NEXT: vmovdqu64 (%rdi), %zmm0
	; AVX512-NEXT: vmovdqu64 64(%rdi), %zmm1			; AVX512-NEXT: vmovdqu64 64(%rdi), %zmm1
	; AVX512-NEXT: vmovdqu64 (%rsi), %zmm2			; AVX512-NEXT: vmovdqu64 (%rsi), %zmm2
	; AVX512-NEXT: vmovdqu64 64(%rsi), %zmm3			; AVX512-NEXT: vmovdqu64 64(%rsi), %zmm3
	Show All 23 Lines

llvm/test/CodeGen/X86/vector-interleaved-store-i8-stride-2.ll

	Show First 20 Lines • Show All 172 Lines • ▼ Show 20 Lines
	; SSE-NEXT: punpckhbw {{.*#+}} xmm2 = xmm2[8],xmm3[8],xmm2[9],xmm3[9],xmm2[10],xmm3[10],xmm2[11],xmm3[11],xmm2[12],xmm3[12],xmm2[13],xmm3[13],xmm2[14],xmm3[14],xmm2[15],xmm3[15]			; SSE-NEXT: punpckhbw {{.*#+}} xmm2 = xmm2[8],xmm3[8],xmm2[9],xmm3[9],xmm2[10],xmm3[10],xmm2[11],xmm3[11],xmm2[12],xmm3[12],xmm2[13],xmm3[13],xmm2[14],xmm3[14],xmm2[15],xmm3[15]
	; SSE-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm3[0],xmm1[1],xmm3[1],xmm1[2],xmm3[2],xmm1[3],xmm3[3],xmm1[4],xmm3[4],xmm1[5],xmm3[5],xmm1[6],xmm3[6],xmm1[7],xmm3[7]			; SSE-NEXT: punpcklbw {{.*#+}} xmm1 = xmm1[0],xmm3[0],xmm1[1],xmm3[1],xmm1[2],xmm3[2],xmm1[3],xmm3[3],xmm1[4],xmm3[4],xmm1[5],xmm3[5],xmm1[6],xmm3[6],xmm1[7],xmm3[7]
	; SSE-NEXT: movdqa %xmm1, 32(%rdx)			; SSE-NEXT: movdqa %xmm1, 32(%rdx)
	; SSE-NEXT: movdqa %xmm2, 48(%rdx)			; SSE-NEXT: movdqa %xmm2, 48(%rdx)
	; SSE-NEXT: movdqa %xmm0, (%rdx)			; SSE-NEXT: movdqa %xmm0, (%rdx)
	; SSE-NEXT: movdqa %xmm4, 16(%rdx)			; SSE-NEXT: movdqa %xmm4, 16(%rdx)
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-LABEL: store_i8_stride2_vf32:			; AVX1-LABEL: store_i8_stride2_vf32:
	; AVX: # %bb.0:			; AVX1: # %bb.0:
	; AVX-NEXT: vmovdqa (%rsi), %xmm0			; AVX1-NEXT: vmovdqa (%rsi), %xmm0
	; AVX-NEXT: vmovdqa 16(%rsi), %xmm1			; AVX1-NEXT: vmovdqa 16(%rsi), %xmm1
	; AVX-NEXT: vmovdqa (%rdi), %xmm2			; AVX1-NEXT: vmovdqa (%rdi), %xmm2
	; AVX-NEXT: vmovdqa 16(%rdi), %xmm3			; AVX1-NEXT: vmovdqa 16(%rdi), %xmm3
	; AVX-NEXT: vpunpckhbw {{.*#+}} xmm4 = xmm2[8],xmm0[8],xmm2[9],xmm0[9],xmm2[10],xmm0[10],xmm2[11],xmm0[11],xmm2[12],xmm0[12],xmm2[13],xmm0[13],xmm2[14],xmm0[14],xmm2[15],xmm0[15]			; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm4 = xmm2[8],xmm0[8],xmm2[9],xmm0[9],xmm2[10],xmm0[10],xmm2[11],xmm0[11],xmm2[12],xmm0[12],xmm2[13],xmm0[13],xmm2[14],xmm0[14],xmm2[15],xmm0[15]
	; AVX-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3],xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]			; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm0 = xmm2[0],xmm0[0],xmm2[1],xmm0[1],xmm2[2],xmm0[2],xmm2[3],xmm0[3],xmm2[4],xmm0[4],xmm2[5],xmm0[5],xmm2[6],xmm0[6],xmm2[7],xmm0[7]
	; AVX-NEXT: vpunpckhbw {{.*#+}} xmm2 = xmm3[8],xmm1[8],xmm3[9],xmm1[9],xmm3[10],xmm1[10],xmm3[11],xmm1[11],xmm3[12],xmm1[12],xmm3[13],xmm1[13],xmm3[14],xmm1[14],xmm3[15],xmm1[15]			; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm2 = xmm3[8],xmm1[8],xmm3[9],xmm1[9],xmm3[10],xmm1[10],xmm3[11],xmm1[11],xmm3[12],xmm1[12],xmm3[13],xmm1[13],xmm3[14],xmm1[14],xmm3[15],xmm1[15]
	; AVX-NEXT: vpunpcklbw {{.*#+}} xmm1 = xmm3[0],xmm1[0],xmm3[1],xmm1[1],xmm3[2],xmm1[2],xmm3[3],xmm1[3],xmm3[4],xmm1[4],xmm3[5],xmm1[5],xmm3[6],xmm1[6],xmm3[7],xmm1[7]			; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm1 = xmm3[0],xmm1[0],xmm3[1],xmm1[1],xmm3[2],xmm1[2],xmm3[3],xmm1[3],xmm3[4],xmm1[4],xmm3[5],xmm1[5],xmm3[6],xmm1[6],xmm3[7],xmm1[7]
	; AVX-NEXT: vmovdqa %xmm1, 32(%rdx)			; AVX1-NEXT: vmovdqa %xmm1, 32(%rdx)
	; AVX-NEXT: vmovdqa %xmm2, 48(%rdx)			; AVX1-NEXT: vmovdqa %xmm2, 48(%rdx)
	; AVX-NEXT: vmovdqa %xmm0, (%rdx)			; AVX1-NEXT: vmovdqa %xmm0, (%rdx)
	; AVX-NEXT: vmovdqa %xmm4, 16(%rdx)			; AVX1-NEXT: vmovdqa %xmm4, 16(%rdx)
	; AVX-NEXT: retq			; AVX1-NEXT: retq
				;
				; AVX2-LABEL: store_i8_stride2_vf32:
				; AVX2: # %bb.0:
				; AVX2-NEXT: vmovdqa (%rdi), %ymm0
				; AVX2-NEXT: vmovdqa (%rsi), %ymm1
				; AVX2-NEXT: vpunpckhbw {{.*#+}} ymm2 = ymm0[8],ymm1[8],ymm0[9],ymm1[9],ymm0[10],ymm1[10],ymm0[11],ymm1[11],ymm0[12],ymm1[12],ymm0[13],ymm1[13],ymm0[14],ymm1[14],ymm0[15],ymm1[15],ymm0[24],ymm1[24],ymm0[25],ymm1[25],ymm0[26],ymm1[26],ymm0[27],ymm1[27],ymm0[28],ymm1[28],ymm0[29],ymm1[29],ymm0[30],ymm1[30],ymm0[31],ymm1[31]
				; AVX2-NEXT: vpunpcklbw {{.*#+}} ymm0 = ymm0[0],ymm1[0],ymm0[1],ymm1[1],ymm0[2],ymm1[2],ymm0[3],ymm1[3],ymm0[4],ymm1[4],ymm0[5],ymm1[5],ymm0[6],ymm1[6],ymm0[7],ymm1[7],ymm0[16],ymm1[16],ymm0[17],ymm1[17],ymm0[18],ymm1[18],ymm0[19],ymm1[19],ymm0[20],ymm1[20],ymm0[21],ymm1[21],ymm0[22],ymm1[22],ymm0[23],ymm1[23]
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm1 = ymm0[0,1],ymm2[0,1]
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm0 = ymm0[2,3],ymm2[2,3]
				; AVX2-NEXT: vmovdqa %ymm0, 32(%rdx)
				; AVX2-NEXT: vmovdqa %ymm1, (%rdx)
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: store_i8_stride2_vf32:			; AVX512-LABEL: store_i8_stride2_vf32:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vmovdqa (%rdi), %ymm0			; AVX512-NEXT: vmovdqa (%rdi), %ymm0
	; AVX512-NEXT: vinserti64x4 $1, (%rsi), %zmm0, %zmm0			; AVX512-NEXT: vinserti64x4 $1, (%rsi), %zmm0, %zmm0
	; AVX512-NEXT: vmovdqa64 {{.*#+}} zmm1 = [0,4,1,5,2,6,3,7]			; AVX512-NEXT: vmovdqa64 {{.*#+}} zmm1 = [0,4,1,5,2,6,3,7]
	; AVX512-NEXT: vpermq %zmm0, %zmm1, %zmm0			; AVX512-NEXT: vpermq %zmm0, %zmm1, %zmm0
	; AVX512-NEXT: vpshufb {{.*#+}} zmm0 = zmm0[0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15,16,24,17,25,18,26,19,27,20,28,21,29,22,30,23,31,32,40,33,41,34,42,35,43,36,44,37,45,38,46,39,47,48,56,49,57,50,58,51,59,52,60,53,61,54,62,55,63]			; AVX512-NEXT: vpshufb {{.*#+}} zmm0 = zmm0[0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15,16,24,17,25,18,26,19,27,20,28,21,29,22,30,23,31,32,40,33,41,34,42,35,43,36,44,37,45,38,46,39,47,48,56,49,57,50,58,51,59,52,60,53,61,54,62,55,63]
	Show All 13 Lines