This is an archive of the discontinued LLVM Phabricator instance.

[X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess.
ClosedPublic

Authored by m_zuckerman on Jun 25 2017, 12:38 AM.

Download Raw Diff

Details

Reviewers

dorit
Farhana
RKSimon
guyblank
DavidKreitzer

Commits

rGc1918ad571e0: [X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess.
rL309086: [X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess.

Summary

This patch expands the support of lowerInterleavedStore to 32x8i stride 4.

LLVM creates suboptimal shuffle code-gen for AVX2. In overall, this patch is a specific fix for the pattern (Strid=4 VF=32) and we plan to include more patterns in the future. To reach our goal of "more patterns". We include two mask creators. The first function creates shuffle's mask equivalent to unpacklo/unpackhi instructions. The other creator creates mask equivalent to a concat of two half vectors(high/low).

The patch goal is to optimize the following sequence:
At the end of the computation, we have ymm2, ymm0, ymm12 and ymm3 holding
each 32 chars:

c0, c1, , c31
m0, m1, , m31
y0, y1, , y31
k0, k1, ., k31

And these need to be transposed/interleaved and stored like so:

c0 m0 y0 k0 c1 m1 y1 k1 c2 m2 y2 k2 c3 m3 y3 k3 ....

Diff Detail

Event Timeline

m_zuckerman created this revision.Jun 25 2017, 12:38 AM

m_zuckerman added a parent revision: D32658: Supports lowerInterleavedStore() in X86InterleavedAccess..

m_zuckerman added a reviewer: DavidKreitzer.Jun 25 2017, 1:54 AM

Would it be beneficial to work on a more general solution for the 128-bit subvector issue? Won't 16x16 and 8x32 (as well as all the 512-bit equivalents) still suffer?

lib/Target/X86/X86InterleavedAccess.cpp
369–370	Indenting / clang-format
384	If you pull out the Type::getInt16Ty(Shuffles[0]->getContext()) you should be able to tidy all this up
test/CodeGen/X86/x86-interleaved-access.ll
130	Add this test to trunk with current codegen so this patch shows the diff.
test/Transforms/InterleavedAccess/X86/interleavedStore.ll
1	Add this file to trunk with current codegen so this patch shows the diff.
2	You just need -mattr=+avx2 - it implies -mattr=+avx

dorit mentioned this in rL306238: [AVX2] [TTI CostModel] Add cost of interleaved loads/stores for AVX2.Jun 26 2017, 2:47 AM

m_zuckerman updated this revision to Diff 103953.Jun 26 2017, 7:59 AM

m_zuckerman marked 4 inline comments as done.

m_zuckerman marked an inline comment as done.

In D34601#789983, @RKSimon wrote:

Would it be beneficial to work on a more general solution for the 128-bit subvector issue? Won't 16x16 and 8x32 (as well as all the 512-bit equivalents) still suffer?

In general, we plan to generalize the problem and this is why we added the two mask functions.

ping

Just a few minor comments. Looks good to me otherwise, but should give @RKSimon a chance to see if he's happy with your responses (and maybe also another day to give @Farhana/@DavidKreitzer a chance to comment).

lib/Target/X86/X86InterleavedAccess.cpp
105–107	Should this comment be updated?
182	80-column overflow... (and while you're fixing this, maybe use the same capitalization in naming of arguments (i.e. - either "vec1" or "VEC1" form but not both)).
422	Have you tried enabling this also for AVX? (I understand if not, because with the current cost numbers that TTI returns for interleaved accesses on AVX we'll probably determine it's not worth vectorizing... so that may need to come along with an update of getInterleavedMemoryOpCost -- maybe at least a TODO comment is needed).

m_zuckerman updated this revision to Diff 104424.Jun 28 2017, 8:30 AM

m_zuckerman marked 2 inline comments as done and an inline comment as not done.

In D34601#791846, @dorit wrote:

Have you tried enabling this also for AVX? (I understand if not, because with the current cost numbers that TTI returns for interleaved accesses on AVX we'll probably determine it's not worth vectorizing... so that may need to come along with an update of getInterleavedMemoryOpCost -- maybe at least a TODO comment is needed).

Thanks , I tried it and the result was good for AVX.
You can see in the test interleavedStore.all that we are getting less instructions compared to the old code.

m_zuckerman updated this revision to Diff 104990.Jul 1 2017, 5:11 AM

In the new patch I added a test checks the AVX sequence and new if verifies that VF32 is with store instructions.

zvi added a subscriber: zvi.Jul 4 2017, 11:19 PM

Michael, can you please add AVX512 targets to the tests so we ensure that the AVX512 targets are at least as good as the AVX2 targets?

In D34601#799242, @zvi wrote:

Michael, can you please add AVX512 targets to the tests so we ensure that the AVX512 targets are at least as good as the AVX2 targets?

No problem

Farhana added inline comments.Jul 5 2017, 5:07 PM

lib/Target/X86/X86InterleavedAccess.cpp
166	Why not use the createUnpackShuffleMask(..) defined in X86ISelLowering.cpp? I would recommend you to declare a template function of this in X86ISelLowering.h which will allow both int and int32_t mask and get rid of this from X86InterleavedAccess.cpp.
182	The comment does not match the coding logic. Something is off here. May be it would be NumElement+NumElement/2?
201	I would think you could write a more generate transpose function, a function of 8 elements which would scale into 16 and 32. Why is it written only for 32 VLen?
380	Value *VecInst = SI->getPointerOperand()
380	What is the reason for generating for stores instead of 1 wide store? I would think codegen is smart enough to generate expected 4 optimized stores here. Can you please keep it consistent with the other case which means do the concatenation of the transposed vectors and generate 1 wide store. StoreInst SI = cast<StoreInst>(Inst); Value BasePtr = SI->getPointerOperand(); case 32: { transposeChar_32x4(DecomposedVectors, TransposedVectors); // VecInst contains the Ptr argument. Value VecInst = SI->getPointerOperand(); Type IntOf16 = Type::getInt16Ty(Shuffles[0]->getContext()); // From <128xi8>* to <64xi16>* Type VecTran = VectorType::get(IntOf16, 64)->getPointerTo(); BasePtr = Builder.CreateBitCast(VecInst, VecTran); And move the following two statements out of the switch. // 3. Concatenate the contiguous-vectors back into a wide vector. Value WideVec = concatenateVectors(Builder, TransposedVectors); // 4. Generate a store instruction for wide-vec. Builder.CreateAlignedStore(WideVec, BasePtr, // replace this with SI->getAlignment());
385	++i

Hi Michael, sorry it took a while for me to get to this. I was on vacation last week.

lib/Target/X86/X86InterleavedAccess.cpp
73–74	"transpose" is a poor name here. "interleave" would be better. Also, I would prefer "8bit" or "1byte" to "Char", e.g. interleave8bit_32x4. "transpose" works for the 4x4 case (and other NxN cases), because the shuffle sequence does a matrix transpose on the input vectors, and the same code can be used for interleaving and de-interleaving. To handle the 32x8 load case, we would need a different code sequence than what you are currently generating in transposeChar_32x4. Presumably, we would use deinterleave8bit_32x4 for that.
74	Trasposed --> Transposed. Can you fix this at line 72 as well?
191	Perhaps change "BeginIndex" to "CurrentIndex" since you are updating it as you go.
256	I think it is unnecessary and undesirable to do bitcasts here. It complicates both the IR and the code in X86InterleavedAccessGroup::lowerIntoOptimizedSequence, which now has to account for the interleave function returning a different type in "TransposedVectors" than the original "DecomposedVectors". Instead, you just need to "scale" your <16 x i32> masks to <32 x i32> like this: <a, b, ..., p> --> <a2, a2+1, b2, b2+1, ..., p2, p2+1> There is an existing routine to do this scaling in X86ISelLowering.cpp, scaleShuffleMask. Also see canWidenShuffleElements, which is how the 32xi8 shuffle gets automatically converted to a 16xi16 shuffle by the CodeGen, ultimately causing the desired unpckwd instructions to be generated.
380	+1
422	This comment is no longer accurate since you enabled this for AVX. It is probably okay to simply delete this comment since it is redundant with 105-107.
test/CodeGen/X86/x86-interleaved-access.ll
199–204	The codegen improvements here look great!

m_zuckerman updated this revision to Diff 106158.Jul 12 2017, 2:33 AM

m_zuckerman marked 9 inline comments as done.

m_zuckerman added inline comments.

lib/Target/X86/X86InterleavedAccess.cpp
201	You are right and we plan to do so in the next patch.

m_zuckerman marked an inline comment as done.Jul 12 2017, 6:15 AM

Thanks for the changes, Michael, and thanks for following up on the perf issue in https://bugs.llvm.org/show_bug.cgi?id=33740.

I have a few additional comments, but this generally looks good. The simplification of the code in X86InterleavedAccessGroup::lowerIntoOptimizedSequence is great!

lib/Target/X86/X86ISelLowering.h
1440 ↗	(On Diff #106158)	Fix the indenting here.
lib/Target/X86/X86InterleavedAccess.cpp
73–74	I see that you changed this to "deinterleave8bit_32x4" rather than "interleave8bit_32x4". Can you please explain why? This routine is taking 4 input vectors and merging their elements like this: v0[0], v1[0], v2[0], v3[0], v0[1], v1[1], v2[1], v3[1], ... Wouldn't you call that interleaving?
114–116	The old wording here was better, and you have a typo in "oration".
194	Would it be possible to simplify the implementation of this routine by just calling createUnpackShuffleMask with an i16 type plus calling scaleShuffleMask with a scale of 2? (That would require you to move scaleShuffleMask into X86ISelLowering.h like you did with createUnpackShuffleMask.)

m_zuckerman added inline comments.Jul 12 2017, 9:50 AM

lib/Target/X86/X86InterleavedAccess.cpp
73–74	You are right, I swapped the terminology.

m_zuckerman updated this revision to Diff 106407.Jul 13 2017, 4:33 AM

m_zuckerman marked 3 inline comments as done.

Hi Michael, I have one small additional comment. Otherwise, this looks good.

lib/Target/X86/X86ISelLowering.h
1457 ↗	(On Diff #106407)	Did you want to have T default to int as you did for createUnpackShuffleMask? That will avoid the extra changes in X86ISelLowering.cpp.

In D34601#808867, @DavidKreitzer wrote:

Hi Michael, I have one small additional comment. Otherwise, this looks good.

This the error I am getting when I try to define the function with a default parameter.
void llvm::scaleShuffleMask(int,llvm::ArrayRef<T>,llvm::SmallVectorImpl<T> &)': could not deduce template argument for 'llvm::ArrayRef<T>' from 'llvm::SmallVector<int,4>'

If you prefer not to touch the CPP file. We can define an overload function in the header file that will explicit set the function signature.

The function in the header will look like:

template <typename t1, unsigned n>
void scaleshufflemask(int scale, typename smallvector<t1, n> mask,
  typename smallvectorimpl<t1> &scaledmask) {
  scaleshufflemask(scale, makearrayref(mask), scaledmask);

This function describe another signature that one of the call in the CPP file uses.

The other option is as I proposed on this review.

What you prefer?

ping

RKSimon added inline comments.Jul 19 2017, 2:44 AM

lib/Target/X86/X86ISelLowering.h
1457 ↗	(On Diff #106407)	+1
lib/Target/X86/X86InterleavedAccess.cpp
172	shuffle
185	I think you could make this much simpler to understand: int NumHalfElements = NumElement / 2; int Offset = Low ? 0 : NumHalfElements; for (int i = 0; i < NumHalfElements; ++i) Mask.push_back(i + Offset); for (int i = 0; i < NumHalfElements; ++i) Mask.push_back(i + Offset + NumElements);

In D34601#810757, @m_zuckerman wrote:
In D34601#808867, @DavidKreitzer wrote:

Hi Michael, I have one small additional comment. Otherwise, this looks good.

This the error I am getting when I try to define the function with a default parameter.
void llvm::scaleShuffleMask(int,llvm::ArrayRef<T>,llvm::SmallVectorImpl<T> &)': could not deduce template argument for 'llvm::ArrayRef<T>' from 'llvm::SmallVector<int,4>'

If you prefer not to touch the CPP file. We can define an overload function in the header file that will explicit set the function signature.

The function in the header will look like:
template <typename t1, unsigned n>
void scaleshufflemask(int scale, typename smallvector<t1, n> mask,
  typename smallvectorimpl<t1> &scaledmask) {
  scaleshufflemask(scale, makearrayref(mask), scaledmask);
This function describe another signature that one of the call in the CPP file uses.

The other option is as I proposed on this review.

What you prefer?

I am not a fan of either of those options. How about getting rid of the template argument altogether and just use "int"? That is what ShuffleVectorInstruction::getShuffleMask uses, so it seems like it should be good enough here too. If you take this approach, it would be good to modify transpose_4x4 to use "int" instead of "uint32_t" for consistency.

m_zuckerman updated this revision to Diff 107663.Jul 21 2017, 6:06 AM

m_zuckerman marked an inline comment as done.

build.CreateShuffleVector accepts only uint32_t. I created a static CreateShuffleVector that do reinterpret_cast the int to uint. This solves our dependency on the template.

DavidKreitzer added inline comments.Jul 21 2017, 11:05 AM

lib/Target/X86/X86InterleavedAccess.cpp
185	Ugh! I didn't realize that you were going to have to add this method to do the ArrayRef conversion. Sorry, I think I gave you bad advice. I think this is a worse evil than the explicit template arguments in the previous patch. Unless someone else has a better idea, my recommendation is to go back to your previous solution and accept the explicit template arguments as an unfortunate consequence of the inconsistent ShuffleVector interfaces.

m_zuckerman updated this revision to Diff 107808.Jul 23 2017, 12:21 AM

I returned to the previous patch with small modification as @RKSimon asked.

Thanks, Michael. This LGTM pending @RKSimon's re-review. But can you please update your sources so that we can see how you merged with https://reviews.llvm.org/D35638, since that will cause conflicts?

m_zuckerman updated this revision to Diff 107938.Jul 24 2017, 11:57 AM

Thanks, Michael. LGTM, again pending @RKSimon's re-review.

This revision is now accepted and ready to land.Jul 24 2017, 12:27 PM

m_zuckerman added a child revision: D35829: [X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess (VF16 stride 4)..Jul 25 2017, 3:28 AM

LGTM - if possible moving createUnpackShuffleMask and scaleShuffleMask should be done as NFC pre-commit.

Diffusion mentioned this in rL309084: [X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess….Jul 26 2017, 12:45 AM

Closed by commit rL309086: [X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess. (authored by mzuckerm). · Explain WhyJul 26 2017, 1:12 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

X86/

X86InterleavedAccess.cpp

185 lines

test/

CodeGen/

X86/

x86-interleaved-access.ll

137 lines

Transforms/

InterleavedAccess/

X86/

interleavedStore.ll

33 lines

Diff 104424

lib/Target/X86/X86InterleavedAccess.cpp

Show First 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	class X86InterleavedAccessGroup {
/// In-V3 = s1, s2, s3, s4		/// In-V3 = s1, s2, s3, s4
/// OutputVectors:		/// OutputVectors:
/// Out-V0 = p1, q1, r1, s1		/// Out-V0 = p1, q1, r1, s1
/// Out-V1 = p2, q2, r2, s2		/// Out-V1 = p2, q2, r2, s2
/// Out-V2 = p3, q3, r3, s3		/// Out-V2 = p3, q3, r3, s3
/// Out-V3 = P4, q4, r4, s4		/// Out-V3 = P4, q4, r4, s4
void transpose_4x4(ArrayRef<Instruction *> InputVectors,		void transpose_4x4(ArrayRef<Instruction *> InputVectors,
SmallVectorImpl<Value *> &TrasposedVectors);		SmallVectorImpl<Value *> &TrasposedVectors);
		void transposeChar_32x4(ArrayRef<Instruction *> InputVectors,
		SmallVectorImpl<Value *> &TrasposedVectors);
		DavidKreitzerUnsubmitted Done Reply Inline Actions "transpose" is a poor name here. "interleave" would be better. Also, I would prefer "8bit" or "1byte" to "Char", e.g. interleave8bit_32x4. "transpose" works for the 4x4 case (and other NxN cases), because the shuffle sequence does a matrix transpose on the input vectors, and the same code can be used for interleaving and de-interleaving. To handle the 32x8 load case, we would need a different code sequence than what you are currently generating in transposeChar_32x4. Presumably, we would use deinterleave8bit_32x4 for that. DavidKreitzer: "transpose" is a poor name here. "interleave" would be better. Also, I would prefer "8bit" or…
		DavidKreitzerUnsubmitted Done Reply Inline Actions I see that you changed this to "deinterleave8bit_32x4" rather than "interleave8bit_32x4". Can you please explain why? This routine is taking 4 input vectors and merging their elements like this: v0[0], v1[0], v2[0], v3[0], v0[1], v1[1], v2[1], v3[1], ... Wouldn't you call that interleaving? DavidKreitzer: I see that you changed this to "deinterleave8bit_32x4" rather than "interleave8bit_32x4". Can…
		m_zuckermanAuthorUnsubmitted Not Done Reply Inline Actions You are right, I swapped the terminology. m_zuckerman: You are right, I swapped the terminology.
		DavidKreitzerUnsubmitted Done Reply Inline Actions Trasposed --> Transposed. Can you fix this at line 72 as well? DavidKreitzer: Trasposed --> Transposed. Can you fix this at line 72 as well?
public:		public:
/// In order to form an interleaved access group X86InterleavedAccessGroup		/// In order to form an interleaved access group X86InterleavedAccessGroup
/// requires a wide-load instruction \p 'I', a group of interleaved-vectors		/// requires a wide-load instruction \p 'I', a group of interleaved-vectors
/// \p Shuffs, reference to the first indices of each interleaved-vector		/// \p Shuffs, reference to the first indices of each interleaved-vector
/// \p 'Ind' and the interleaving stride factor \p F. In order to generate		/// \p 'Ind' and the interleaving stride factor \p F. In order to generate
/// X86-specific instructions/intrinsics it also requires the underlying		/// X86-specific instructions/intrinsics it also requires the underlying
/// target information \p STarget.		/// target information \p STarget.
explicit X86InterleavedAccessGroup(Instruction *I,		explicit X86InterleavedAccessGroup(Instruction *I,
Show All 14 Lines
};		};
} // end anonymous namespace		} // end anonymous namespace

bool X86InterleavedAccessGroup::isSupported() const {		bool X86InterleavedAccessGroup::isSupported() const {
VectorType *ShuffleVecTy = Shuffles[0]->getType();		VectorType *ShuffleVecTy = Shuffles[0]->getType();
uint64_t ShuffleVecSize = DL.getTypeSizeInBits(ShuffleVecTy);		uint64_t ShuffleVecSize = DL.getTypeSizeInBits(ShuffleVecTy);
Type *ShuffleEltTy = ShuffleVecTy->getVectorElementType();		Type *ShuffleEltTy = ShuffleVecTy->getVectorElementType();

// Currently, lowering is supported for 4-element vectors of 64 bits on AVX.		// Currently, lowering is supported for the following vectors:
		// 1. 4-element vectors of 64 bits on AVX.
		// 2. 32-element vectors of 8 bits on AVX.
		doritUnsubmitted Done Reply Inline Actions Should this comment be updated? dorit: Should this comment be updated?
uint64_t ExpectedShuffleVecSize;		uint64_t ExpectedShuffleVecSize;
if (isa<LoadInst>(Inst))		if (isa<LoadInst>(Inst))
ExpectedShuffleVecSize = 256;		ExpectedShuffleVecSize = 256;
else		else
ExpectedShuffleVecSize = 1024;		ExpectedShuffleVecSize = 1024;

if (!Subtarget.hasAVX() \|\| ShuffleVecSize != ExpectedShuffleVecSize \|\|		if (((DL.getTypeSizeInBits(ShuffleEltTy) != 64) &&
DL.getTypeSizeInBits(ShuffleEltTy) != 64 \|\| Factor != 4)		(DL.getTypeSizeInBits(ShuffleEltTy) != 8)) \|\| !Subtarget.hasAVX() \|\|
		ShuffleVecSize != ExpectedShuffleVecSize \|\| Factor != 4)
		DavidKreitzerUnsubmitted Done Reply Inline Actions The old wording here was better, and you have a typo in "oration". DavidKreitzer: The old wording here was better, and you have a typo in "oration".
return false;		return false;

return true;		return true;
}		}

void X86InterleavedAccessGroup::decompose(		void X86InterleavedAccessGroup::decompose(
Instruction VecInst, unsigned NumSubVectors, VectorType SubVecTy,		Instruction VecInst, unsigned NumSubVectors, VectorType SubVecTy,
SmallVectorImpl<Instruction *> &DecomposedVectors) {		SmallVectorImpl<Instruction *> &DecomposedVectors) {
Show All 32 Lines	for (unsigned i = 0; i < NumSubVectors; i++) {
// TODO: Support inbounds GEP.		// TODO: Support inbounds GEP.
Value *NewBasePtr = Builder.CreateGEP(VecBasePtr, Builder.getInt32(i));		Value *NewBasePtr = Builder.CreateGEP(VecBasePtr, Builder.getInt32(i));
Instruction *NewLoad =		Instruction *NewLoad =
Builder.CreateAlignedLoad(NewBasePtr, LI->getAlignment());		Builder.CreateAlignedLoad(NewBasePtr, LI->getAlignment());
DecomposedVectors.push_back(NewLoad);		DecomposedVectors.push_back(NewLoad);
}		}
}		}

		/// Generate unpacklo/unpackhi shuffle mask.
		static void createUnpackShuffleMask(int NumElts, SmallVectorImpl<uint32_t> &Mask,
		FarhanaUnsubmitted Done Reply Inline Actions Why not use the createUnpackShuffleMask(..) defined in X86ISelLowering.cpp? I would recommend you to declare a template function of this in X86ISelLowering.h which will allow both int and int32_t mask and get rid of this from X86InterleavedAccess.cpp. Farhana: Why not use the createUnpackShuffleMask(..) defined in X86ISelLowering.cpp? I would recommend…
		bool Lo, bool Unary) {
		int NumEltsInLane = NumElts / 2;
		assert(Mask.empty() && "Expected an empty shuffle mask vector");
		for (int i = 0; i < NumElts; ++i) {
		unsigned LaneStart = (i / NumEltsInLane) * NumEltsInLane;
		int Pos = (i % NumEltsInLane) / 2 + LaneStart;
		RKSimonUnsubmitted Done Reply Inline Actions shuffle RKSimon: shuffle
		Pos += (Unary ? 0 : NumElts * (i % 2));
		Pos += (Lo ? 0 : NumEltsInLane / 2);
		Mask.push_back(Pos);
		}
		}

		// Create shuffle mask for concatenation of two half vectors.
		// Low = false: mask generated for the shuffle
		// shufle(VEC1,VEC2,{NumElement/2, NumElement/2+1, NumElement/2+2...,
		// NumElement-1, NumElement-NumElement/2,
		doritUnsubmitted Done Reply Inline Actions 80-column overflow... (and while you're fixing this, maybe use the same capitalization in naming of arguments (i.e. - either "vec1" or "VEC1" form but not both)). dorit: 80-column overflow... (and while you're fixing this, maybe use the same capitalization in…
		FarhanaUnsubmitted Done Reply Inline Actions The comment does not match the coding logic. Something is off here. May be it would be NumElement+NumElement/2? Farhana: The comment does not match the coding logic. Something is off here. May be it would be…
		// NumElement-NumElement/2+1..., 2*NumElement-1})
		// = concat(high_half(VEC1),high_half(VEC2))
		// Low = true: mask generated for the shuffle
		RKSimonUnsubmitted Done Reply Inline Actions I think you could make this much simpler to understand: int NumHalfElements = NumElement / 2; int Offset = Low ? 0 : NumHalfElements; for (int i = 0; i < NumHalfElements; ++i) Mask.push_back(i + Offset); for (int i = 0; i < NumHalfElements; ++i) Mask.push_back(i + Offset + NumElements); RKSimon: I think you could make this much simpler to understand: ``` int NumHalfElements = NumElement /…
		DavidKreitzerUnsubmitted Not Done Reply Inline Actions Ugh! I didn't realize that you were going to have to add this method to do the ArrayRef conversion. Sorry, I think I gave you bad advice. I think this is a worse evil than the explicit template arguments in the previous patch. Unless someone else has a better idea, my recommendation is to go back to your previous solution and accept the explicit template arguments as an unfortunate consequence of the inconsistent ShuffleVector interfaces. DavidKreitzer: Ugh! I didn't realize that you were going to have to add this method to do the ArrayRef…
		// shufle(VEC1,VEC2,{0,1,2,...,NumElement/2-1,NumElement,
		// NumElement+1...,NumElement+NumElement/2-1})
		// = concat(low_half(VEC1),low_half(VEC2))
		static void createConcatShuffleMask(int NumElement,
		SmallVectorImpl<uint32_t> &Mask, bool Low) {
		int BeginIndex = Low ? 0 : NumElement / 2;
		DavidKreitzerUnsubmitted Done Reply Inline Actions Perhaps change "BeginIndex" to "CurrentIndex" since you are updating it as you go. DavidKreitzer: Perhaps change "BeginIndex" to "CurrentIndex" since you are updating it as you go.
		int EndIndex = BeginIndex + NumElement / 2;
		for (int i = 0; i < NumElement; ++i) {
		if (BeginIndex == EndIndex)
		DavidKreitzerUnsubmitted Done Reply Inline Actions Would it be possible to simplify the implementation of this routine by just calling createUnpackShuffleMask with an i16 type plus calling scaleShuffleMask with a scale of 2? (That would require you to move scaleShuffleMask into X86ISelLowering.h like you did with createUnpackShuffleMask.) DavidKreitzer: Would it be possible to simplify the implementation of this routine by just calling…
		BeginIndex += NumElement / 2;
		Mask.push_back(BeginIndex);
		BeginIndex++;
		}
		}

		void X86InterleavedAccessGroup::transposeChar_32x4(
		FarhanaUnsubmitted Done Reply Inline Actions I would think you could write a more generate transpose function, a function of 8 elements which would scale into 16 and 32. Why is it written only for 32 VLen? Farhana: I would think you could write a more generate transpose function, a function of 8 elements…
		m_zuckermanAuthorUnsubmitted Not Done Reply Inline Actions You are right and we plan to do so in the next patch. m_zuckerman: You are right and we plan to do so in the next patch.
		ArrayRef<Instruction *> Matrix,
		SmallVectorImpl<Value *> &TransposedMatrix) {

		// Example: Assuming we start from the following vectors:
		// Matrix[0]= c0 c1 c2 c3 c4 ... c31
		// Matrix[1]= m0 m1 m2 m3 m4 ... m31
		// Matrix[2]= y0 y1 y2 y3 y4 ... y31
		// Matrix[3]= k0 k1 k2 k3 k4 ... k31

		TransposedMatrix.resize(4);

		SmallVector<uint32_t, 32> MaskHighTemp;
		SmallVector<uint32_t, 32> MaskLowTemp;
		SmallVector<uint32_t, 32> MaskHighTemp1;
		SmallVector<uint32_t, 32> MaskLowTemp1;
		SmallVector<uint32_t, 32> ConcatLow;
		SmallVector<uint32_t, 32> ConcatHigh;

		// MaskHighTemp and MaskLowTemp built in the vpunpckhbw and vpunpcklbw X86
		// shuffle pattern.
		createUnpackShuffleMask(32, MaskHighTemp, false, false);
		createUnpackShuffleMask(32, MaskLowTemp, true, false);

		// MaskHighTemp1 and MaskLowTemp1 built in the vpunpckhdw and vpunpckldw X86
		// shuffle pattern.
		createUnpackShuffleMask(16, MaskLowTemp1, true, false);
		createUnpackShuffleMask(16, MaskHighTemp1, false, false);

		// ConcatHigh and ConcatLow built in the vperm2i128 and vinserti128 X86
		// shuffle pattern.
		createConcatShuffleMask(16, ConcatLow, true);
		createConcatShuffleMask(16, ConcatHigh, false);

		ArrayRef<uint32_t> MaskHigh = makeArrayRef(MaskHighTemp);
		ArrayRef<uint32_t> MaskLow = makeArrayRef(MaskLowTemp);
		ArrayRef<uint32_t> MaskConcatLow = makeArrayRef(ConcatLow);
		ArrayRef<uint32_t> MaskConcatHigh = makeArrayRef(ConcatHigh);
		ArrayRef<uint32_t> MaskHighWord = makeArrayRef(MaskHighTemp1);
		ArrayRef<uint32_t> MaskLowWord = makeArrayRef(MaskLowTemp1);

		// IntrVec1Low = c0 m0 c1 m1 ... c7 m7 \| c16 m16 c17 m17 ... c23 m23
		// IntrVec1High = c8 m8 c9 m9 ... c15 m15 \| c24 m24 c25 m25 ... c31 m31
		// IntrVec2Low = y0 k0 y1 k1 ... y7 k7 \| y16 k16 y17 k17 ... y23 k23
		// IntrVec2High = y8 k8 y9 k9 ... y15 k15 \| y24 k24 y25 k25 ... y31 k31

		Value *IntrVec1Low =
		Builder.CreateShuffleVector(Matrix[0], Matrix[1], MaskLow);
		Value *IntrVec1High =
		Builder.CreateShuffleVector(Matrix[0], Matrix[1], MaskHigh);
		Value *IntrVec2Low =
		Builder.CreateShuffleVector(Matrix[2], Matrix[3], MaskLow);
		Value *IntrVec2High =
		Builder.CreateShuffleVector(Matrix[2], Matrix[3], MaskHigh);

		IntrVec1Low = Builder.CreateBitCast(
		DavidKreitzerUnsubmitted Done Reply Inline Actions I think it is unnecessary and undesirable to do bitcasts here. It complicates both the IR and the code in X86InterleavedAccessGroup::lowerIntoOptimizedSequence, which now has to account for the interleave function returning a different type in "TransposedVectors" than the original "DecomposedVectors". Instead, you just need to "scale" your <16 x i32> masks to <32 x i32> like this: <a, b, ..., p> --> <a2, a2+1, b2, b2+1, ..., p2, p2+1> There is an existing routine to do this scaling in X86ISelLowering.cpp, scaleShuffleMask. Also see canWidenShuffleElements, which is how the 32xi8 shuffle gets automatically converted to a 16xi16 shuffle by the CodeGen, ultimately causing the desired unpckwd instructions to be generated. DavidKreitzer: I think it is unnecessary and undesirable to do bitcasts here. It complicates both the IR and…
		IntrVec1Low,
		VectorType::get(Type::getInt16Ty(Shuffles[0]->getContext()), 16));
		IntrVec1High = Builder.CreateBitCast(
		IntrVec1High,
		VectorType::get(Type::getInt16Ty(Shuffles[0]->getContext()), 16));
		IntrVec2Low = Builder.CreateBitCast(
		IntrVec2Low,
		VectorType::get(Type::getInt16Ty(Shuffles[0]->getContext()), 16));
		IntrVec2High = Builder.CreateBitCast(
		IntrVec2High,
		VectorType::get(Type::getInt16Ty(Shuffles[0]->getContext()), 16));

		// cmyk4 cmyk5 cmyk6 cmyk7 \| cmyk20 cmyk21 cmyk22 cmyk23
		// cmyk12 cmyk13 cmyk14 cmyk15 \| cmyk28 cmyk29 cmyk30 cmyk31
		// cmyk0 cmyk1 cmyk2 cmyk3 \| cmyk16 cmyk17 cmyk18 cmyk19
		// cmyk8 cmyk9 cmyk10 cmyk11 \| cmyk24 cmyk25 cmyk26 cmyk27

		Value *High = Builder.CreateShuffleVector(IntrVec1Low, IntrVec2Low,
		MaskHighWord);
		Value *High1 = Builder.CreateShuffleVector(IntrVec1High, IntrVec2High,
		MaskHighWord);
		Value *Low = Builder.CreateShuffleVector(IntrVec1Low, IntrVec2Low,
		MaskLowWord);
		Value *Low1 = Builder.CreateShuffleVector(IntrVec1High, IntrVec2High,
		MaskLowWord);

		// cmyk0 cmyk1 cmyk2 cmyk3 \| cmyk4 cmyk5 cmyk6 cmyk7
		// cmyk8 cmyk9 cmyk10 cmyk11 \| cmyk12 cmyk13 cmyk14 cmyk15
		// cmyk16 cmyk17 cmyk18 cmyk19 \| cmyk20 cmyk21 cmyk22 cmyk23
		// cmyk24 cmyk25 cmyk26 cmyk27 \| cmyk28 cmyk29 cmyk30 cmyk31

		TransposedMatrix[0] =
		Builder.CreateShuffleVector(Low, High, MaskConcatLow);
		TransposedMatrix[1] =
		Builder.CreateShuffleVector(Low1, High1, MaskConcatLow);
		TransposedMatrix[2] =
		Builder.CreateShuffleVector(Low, High, MaskConcatHigh);
		TransposedMatrix[3] =
		Builder.CreateShuffleVector(Low1, High1, MaskConcatHigh);
		}

void X86InterleavedAccessGroup::transpose_4x4(		void X86InterleavedAccessGroup::transpose_4x4(
ArrayRef<Instruction *> Matrix,		ArrayRef<Instruction *> Matrix,
SmallVectorImpl<Value *> &TransposedMatrix) {		SmallVectorImpl<Value *> &TransposedMatrix) {
assert(Matrix.size() == 4 && "Invalid matrix size");		assert(Matrix.size() == 4 && "Invalid matrix size");
TransposedMatrix.resize(4);		TransposedMatrix.resize(4);

// dst = src1[0,1],src2[0,1]		// dst = src1[0,1],src2[0,1]
uint32_t IntMask1[] = {0, 1, 4, 5};		uint32_t IntMask1[] = {0, 1, 4, 5};
▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	bool X86InterleavedAccessGroup::lowerIntoOptimizedSequence() {
// Lower the interleaved stores:		// Lower the interleaved stores:
// 1. Decompose the interleaved wide shuffle into individual shuffle		// 1. Decompose the interleaved wide shuffle into individual shuffle
// vectors.		// vectors.
decompose(Shuffles[0], Factor,		decompose(Shuffles[0], Factor,
VectorType::get(ShuffleEltTy, NumSubVecElems), DecomposedVectors);		VectorType::get(ShuffleEltTy, NumSubVecElems), DecomposedVectors);

// 2. Transpose the interleaved-vectors into vectors of contiguous		// 2. Transpose the interleaved-vectors into vectors of contiguous
// elements.		// elements.
		StoreInst *SI = cast<StoreInst>(Inst);
		switch (NumSubVecElems) {
		case 4: {
transpose_4x4(DecomposedVectors, TransposedVectors);		transpose_4x4(DecomposedVectors, TransposedVectors);

// 3. Concatenate the contiguous-vectors back into a wide vector.		// 3. Concatenate the contiguous-vectors back into a wide vector.
Value *WideVec = concatenateVectors(Builder, TransposedVectors);		Value *WideVec = concatenateVectors(Builder, TransposedVectors);
		RKSimonUnsubmitted Done Reply Inline Actions Indenting / clang-format RKSimon: Indenting / clang-format

// 4. Generate a store instruction for wide-vec.		// 4. Generate a store instruction for wide-vec.
StoreInst *SI = cast<StoreInst>(Inst);
Builder.CreateAlignedStore(WideVec, SI->getPointerOperand(),		Builder.CreateAlignedStore(WideVec, SI->getPointerOperand(),
SI->getAlignment());		SI->getAlignment());
		break;
		}
		case 32: {
		transposeChar_32x4(DecomposedVectors, TransposedVectors);
		// VecInst contains the Ptr argument.
		Value *VecInst = Inst->getOperand(1);
		FarhanaUnsubmitted Not Done Reply Inline Actions Value VecInst = SI->getPointerOperand() Farhana:* Value *VecInst = SI->getPointerOperand()
		FarhanaUnsubmitted Not Done Reply Inline Actions What is the reason for generating for stores instead of 1 wide store? I would think codegen is smart enough to generate expected 4 optimized stores here. Can you please keep it consistent with the other case which means do the concatenation of the transposed vectors and generate 1 wide store. StoreInst SI = cast<StoreInst>(Inst); Value BasePtr = SI->getPointerOperand(); case 32: { transposeChar_32x4(DecomposedVectors, TransposedVectors); // VecInst contains the Ptr argument. Value VecInst = SI->getPointerOperand(); Type IntOf16 = Type::getInt16Ty(Shuffles[0]->getContext()); // From <128xi8>* to <64xi16>* Type VecTran = VectorType::get(IntOf16, 64)->getPointerTo(); BasePtr = Builder.CreateBitCast(VecInst, VecTran); And move the following two statements out of the switch. // 3. Concatenate the contiguous-vectors back into a wide vector. Value WideVec = concatenateVectors(Builder, TransposedVectors); // 4. Generate a store instruction for wide-vec. Builder.CreateAlignedStore(WideVec, BasePtr, // replace this with SI->getAlignment()); Farhana: What is the reason for generating for stores instead of 1 wide store? I would think codegen is…
		DavidKreitzerUnsubmitted Not Done Reply Inline Actions +1 DavidKreitzer: +1
		Type *IntOf16 = Type::getInt16Ty(Shuffles[0]->getContext());
		// From <128xi8>* to <16xi16>*
		Type *VecTran = VectorType::get(IntOf16, 16)->getPointerTo();
		Value *VecBasePtr = Builder.CreateBitCast(VecInst, VecTran);
		RKSimonUnsubmitted Done Reply Inline Actions If you pull out the Type::getInt16Ty(Shuffles[0]->getContext()) you should be able to tidy all this up RKSimon: If you pull out the Type::getInt16Ty(Shuffles[0]->getContext()) you should be able to tidy all…
		for (unsigned i = 0; i < 4; i++) {
		FarhanaUnsubmitted Done Reply Inline Actions ++i Farhana: ++i
		Value *NewBasePtr = Builder.CreateGEP(VecBasePtr, Builder.getInt32(i));
		Builder.CreateAlignedStore(TransposedVectors[i], NewBasePtr,
		SI->getAlignment());
		}
		break;
		}
		default:
		return false;
		}

return true;		return true;
}		}

// Lower interleaved load(s) into target specific instructions/		// Lower interleaved load(s) into target specific instructions/
// intrinsics. Lowering sequence varies depending on the vector-types, factor,		// intrinsics. Lowering sequence varies depending on the vector-types, factor,
// number of shuffles and ISA.		// number of shuffles and ISA.
// Currently, lowering is supported for 4x64 bits with Factor = 4 on AVX.		// Currently, lowering is supported for 4x64 bits with Factor = 4 on AVX.
Show All 9 Lines	bool X86TargetLowering::lowerInterleavedLoad(
// Create an interleaved access group.		// Create an interleaved access group.
IRBuilder<> Builder(LI);		IRBuilder<> Builder(LI);
X86InterleavedAccessGroup Grp(LI, Shuffles, Indices, Factor, Subtarget,		X86InterleavedAccessGroup Grp(LI, Shuffles, Indices, Factor, Subtarget,
Builder);		Builder);

return Grp.isSupported() && Grp.lowerIntoOptimizedSequence();		return Grp.isSupported() && Grp.lowerIntoOptimizedSequence();
}		}

		// Currently lowering is supported for the following interleaves:
		// 1. stride4 x 64bit elements with vector factor 4 on AVX
		// 2. stride4 x 8bit elements with vector factor 32 on AVX2
		doritUnsubmitted Not Done Reply Inline Actions Have you tried enabling this also for AVX? (I understand if not, because with the current cost numbers that TTI returns for interleaved accesses on AVX we'll probably determine it's not worth vectorizing... so that may need to come along with an update of getInterleavedMemoryOpCost -- maybe at least a TODO comment is needed). dorit: Have you tried enabling this also for AVX? (I understand if not, because with the current cost…
		DavidKreitzerUnsubmitted Done Reply Inline Actions This comment is no longer accurate since you enabled this for AVX. It is probably okay to simply delete this comment since it is redundant with 105-107. DavidKreitzer: This comment is no longer accurate since you enabled this for AVX. It is probably okay to…
bool X86TargetLowering::lowerInterleavedStore(StoreInst *SI,		bool X86TargetLowering::lowerInterleavedStore(StoreInst *SI,
ShuffleVectorInst *SVI,		ShuffleVectorInst *SVI,
unsigned Factor) const {		unsigned Factor) const {
assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&		assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&
"Invalid interleave factor");		"Invalid interleave factor");

assert(SVI->getType()->getVectorNumElements() % Factor == 0 &&		assert(SVI->getType()->getVectorNumElements() % Factor == 0 &&
"Invalid interleaved store");		"Invalid interleaved store");

// Holds the indices of SVI that correspond to the starting index of each		// Holds the indices of SVI that correspond to the starting index of each
// interleaved shuffle.		// interleaved shuffle.
SmallVector<unsigned, 4> Indices;		SmallVector<unsigned, 4> Indices;
auto Mask = SVI->getShuffleMask();		auto Mask = SVI->getShuffleMask();
for (unsigned i = 0; i < Factor; i++)		for (unsigned i = 0; i < Factor; i++)
Indices.push_back(Mask[i]);		Indices.push_back(Mask[i]);

ArrayRef<ShuffleVectorInst *> Shuffles = makeArrayRef(SVI);		ArrayRef<ShuffleVectorInst *> Shuffles = makeArrayRef(SVI);

// Create an interleaved access group.		// Create an interleaved access group.
IRBuilder<> Builder(SI);		IRBuilder<> Builder(SI);
X86InterleavedAccessGroup Grp(SI, Shuffles, Indices, Factor, Subtarget,		X86InterleavedAccessGroup Grp(SI, Shuffles, Indices, Factor, Subtarget,
Builder);		Builder);

return Grp.isSupported() && Grp.lowerIntoOptimizedSequence();		return Grp.isSupported() && Grp.lowerIntoOptimizedSequence();
}		}

test/CodeGen/X86/x86-interleaved-access.ll

Show First 20 Lines • Show All 121 Lines • ▼ Show 20 Lines	; AVX2-NEXT: retq
%strided.v2 = shufflevector <16 x i64> %wide.vec, <16 x i64> undef, <4 x i32> <i32 2, i32 6, i32 10, i32 14>		%strided.v2 = shufflevector <16 x i64> %wide.vec, <16 x i64> undef, <4 x i32> <i32 2, i32 6, i32 10, i32 14>
%strided.v3 = shufflevector <16 x i64> %wide.vec, <16 x i64> undef, <4 x i32> <i32 3, i32 7, i32 11, i32 15>		%strided.v3 = shufflevector <16 x i64> %wide.vec, <16 x i64> undef, <4 x i32> <i32 3, i32 7, i32 11, i32 15>
%add1 = add <4 x i64> %strided.v0, %strided.v1		%add1 = add <4 x i64> %strided.v0, %strided.v1
%add2 = add <4 x i64> %add1, %strided.v2		%add2 = add <4 x i64> %add1, %strided.v2
%add3 = add <4 x i64> %add2, %strided.v3		%add3 = add <4 x i64> %add2, %strided.v3
ret <4 x i64> %add3		ret <4 x i64> %add3
}		}

define void @store_factorf64_4(<16 x double>* %ptr, <4 x double> %v0, <4 x double> %v1, <4 x double> %v2, <4 x double> %v3) {		define void @store_factorf64_4(<16 x double>* %ptr, <4 x double> %v0, <4 x double> %v1, <4 x double> %v2, <4 x double> %v3) {
		RKSimonUnsubmitted Done Reply Inline Actions Add this test to trunk with current codegen so this patch shows the diff. RKSimon: Add this test to trunk with current codegen so this patch shows the diff.
; AVX-LABEL: store_factorf64_4:		; AVX-LABEL: store_factorf64_4:
; AVX: # BB#0:		; AVX: # BB#0:
; AVX-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm4		; AVX-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm4
; AVX-NEXT: vinsertf128 $1, %xmm3, %ymm1, %ymm5		; AVX-NEXT: vinsertf128 $1, %xmm3, %ymm1, %ymm5
; AVX-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[2,3],ymm2[2,3]		; AVX-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[2,3],ymm2[2,3]
; AVX-NEXT: vperm2f128 {{.*#+}} ymm1 = ymm1[2,3],ymm3[2,3]		; AVX-NEXT: vperm2f128 {{.*#+}} ymm1 = ymm1[2,3],ymm3[2,3]
; AVX-NEXT: vunpcklpd {{.*#+}} ymm2 = ymm4[0],ymm5[0],ymm4[2],ymm5[2]		; AVX-NEXT: vunpcklpd {{.*#+}} ymm2 = ymm4[0],ymm5[0],ymm4[2],ymm5[2]
; AVX-NEXT: vunpcklpd {{.*#+}} ymm3 = ymm0[0],ymm1[0],ymm0[2],ymm1[2]		; AVX-NEXT: vunpcklpd {{.*#+}} ymm3 = ymm0[0],ymm1[0],ymm0[2],ymm1[2]
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	; AVX2-NEXT: retq
store <16 x i64> %interleaved.vec, <16 x i64>* %ptr, align 16		store <16 x i64> %interleaved.vec, <16 x i64>* %ptr, align 16
ret void		ret void
}		}


define void @interleaved_store_vf32_i8_stride4(<32 x i8> %x1, <32 x i8> %x2, <32 x i8> %x3, <32 x i8> %x4, <128 x i8>* %p) {		define void @interleaved_store_vf32_i8_stride4(<32 x i8> %x1, <32 x i8> %x2, <32 x i8> %x3, <32 x i8> %x4, <128 x i8>* %p) {
; AVX1-LABEL: interleaved_store_vf32_i8_stride4:		; AVX1-LABEL: interleaved_store_vf32_i8_stride4:
; AVX1: # BB#0:		; AVX1: # BB#0:
; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm4 = xmm2[0],xmm3[0],xmm2[1],xmm3[1],xmm2[2],xmm3[2],xmm2[3],xmm3[3],xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]		; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm9 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm5 = xmm0[4],xmm4[4],xmm0[5],xmm4[5],xmm0[6],xmm4[6],xmm0[7],xmm4[7]		; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm5
; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm4 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]		; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm6
; AVX1-NEXT: vinsertf128 $1, %xmm5, %ymm4, %ymm5		; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm7 = xmm6[0],xmm5[0],xmm6[1],xmm5[1],xmm6[2],xmm5[2],xmm6[3],xmm5[3],xmm6[4],xmm5[4],xmm6[5],xmm5[5],xmm6[6],xmm5[6],xmm6[7],xmm5[7]
; AVX1-NEXT: vmovaps {{.*#+}} ymm4 = [65535,0,65535,0,65535,0,65535,0,65535,0,65535,0,65535,0,65535,0]		; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm8 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]
; AVX1-NEXT: vandnps %ymm5, %ymm4, %ymm5		; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm1 = xmm6[8],xmm5[8],xmm6[9],xmm5[9],xmm6[10],xmm5[10],xmm6[11],xmm5[11],xmm6[12],xmm5[12],xmm6[13],xmm5[13],xmm6[14],xmm5[14],xmm6[15],xmm5[15]
		DavidKreitzerUnsubmitted Not Done Reply Inline Actions The codegen improvements here look great! DavidKreitzer: The codegen improvements here look great!
; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm6 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm7 = xmm6[4],xmm0[4],xmm6[5],xmm0[5],xmm6[6],xmm0[6],xmm6[7],xmm0[7]
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm6 = xmm6[0],zero,xmm6[1],zero,xmm6[2],zero,xmm6[3],zero
; AVX1-NEXT: vinsertf128 $1, %xmm7, %ymm6, %ymm6
; AVX1-NEXT: vandps %ymm4, %ymm6, %ymm6
; AVX1-NEXT: vorps %ymm5, %ymm6, %ymm8
; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm6 = xmm2[8],xmm3[8],xmm2[9],xmm3[9],xmm2[10],xmm3[10],xmm2[11],xmm3[11],xmm2[12],xmm3[12],xmm2[13],xmm3[13],xmm2[14],xmm3[14],xmm2[15],xmm3[15]
; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm7 = xmm0[4],xmm6[4],xmm0[5],xmm6[5],xmm0[6],xmm6[6],xmm0[7],xmm6[7]
; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm6 = xmm0[0],xmm6[0],xmm0[1],xmm6[1],xmm0[2],xmm6[2],xmm0[3],xmm6[3]
; AVX1-NEXT: vinsertf128 $1, %xmm7, %ymm6, %ymm6
; AVX1-NEXT: vandnps %ymm6, %ymm4, %ymm6
; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm7 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]
; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm5 = xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = xmm7[0],zero,xmm7[1],zero,xmm7[2],zero,xmm7[3],zero
; AVX1-NEXT: vinsertf128 $1, %xmm5, %ymm7, %ymm5
; AVX1-NEXT: vandps %ymm4, %ymm5, %ymm5
; AVX1-NEXT: vorps %ymm6, %ymm5, %ymm9
; AVX1-NEXT: vextractf128 $1, %ymm3, %xmm3
; AVX1-NEXT: vextractf128 $1, %ymm2, %xmm2
; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm5 = xmm2[0],xmm3[0],xmm2[1],xmm3[1],xmm2[2],xmm3[2],xmm2[3],xmm3[3],xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]		; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm5 = xmm2[0],xmm3[0],xmm2[1],xmm3[1],xmm2[2],xmm3[2],xmm2[3],xmm3[3],xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]
; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm7 = xmm0[4],xmm5[4],xmm0[5],xmm5[5],xmm0[6],xmm5[6],xmm0[7],xmm5[7]		; AVX1-NEXT: vextractf128 $1, %ymm3, %xmm6
; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm5 = xmm0[0],xmm5[0],xmm0[1],xmm5[1],xmm0[2],xmm5[2],xmm0[3],xmm5[3]		; AVX1-NEXT: vextractf128 $1, %ymm2, %xmm0
; AVX1-NEXT: vinsertf128 $1, %xmm7, %ymm5, %ymm5		; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm4 = xmm0[0],xmm6[0],xmm0[1],xmm6[1],xmm0[2],xmm6[2],xmm0[3],xmm6[3],xmm0[4],xmm6[4],xmm0[5],xmm6[5],xmm0[6],xmm6[6],xmm0[7],xmm6[7]
; AVX1-NEXT: vandnps %ymm5, %ymm4, %ymm5
; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm1
; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm0
; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm7 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm6 = xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = xmm7[0],zero,xmm7[1],zero,xmm7[2],zero,xmm7[3],zero
; AVX1-NEXT: vinsertf128 $1, %xmm6, %ymm7, %ymm6
; AVX1-NEXT: vandps %ymm4, %ymm6, %ymm6
; AVX1-NEXT: vorps %ymm5, %ymm6, %ymm5
; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm2 = xmm2[8],xmm3[8],xmm2[9],xmm3[9],xmm2[10],xmm3[10],xmm2[11],xmm3[11],xmm2[12],xmm3[12],xmm2[13],xmm3[13],xmm2[14],xmm3[14],xmm2[15],xmm3[15]		; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm2 = xmm2[8],xmm3[8],xmm2[9],xmm3[9],xmm2[10],xmm3[10],xmm2[11],xmm3[11],xmm2[12],xmm3[12],xmm2[13],xmm3[13],xmm2[14],xmm3[14],xmm2[15],xmm3[15]
; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm3 = xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]		; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm0 = xmm0[8],xmm6[8],xmm0[9],xmm6[9],xmm0[10],xmm6[10],xmm0[11],xmm6[11],xmm0[12],xmm6[12],xmm0[13],xmm6[13],xmm0[14],xmm6[14],xmm0[15],xmm6[15]
; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm2 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]		; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm3 = xmm7[4],xmm4[4],xmm7[5],xmm4[5],xmm7[6],xmm4[6],xmm7[7],xmm4[7]
; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm2, %ymm2		; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm6 = xmm9[4],xmm5[4],xmm9[5],xmm5[5],xmm9[6],xmm5[6],xmm9[7],xmm5[7]
; AVX1-NEXT: vandnps %ymm2, %ymm4, %ymm2		; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm6, %ymm10
; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm0 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]		; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm3 = xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm1 = xmm0[4,4,5,5,6,6,7,7]		; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm11 = xmm8[4],xmm2[4],xmm8[5],xmm2[5],xmm8[6],xmm2[6],xmm8[7],xmm2[7]
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero		; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm11, %ymm3
; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0		; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm4 = xmm7[0],xmm4[0],xmm7[1],xmm4[1],xmm7[2],xmm4[2],xmm7[3],xmm4[3]
; AVX1-NEXT: vandps %ymm4, %ymm0, %ymm0		; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm5 = xmm9[0],xmm5[0],xmm9[1],xmm5[1],xmm9[2],xmm5[2],xmm9[3],xmm5[3]
; AVX1-NEXT: vorps %ymm2, %ymm0, %ymm0		; AVX1-NEXT: vinsertf128 $1, %xmm4, %ymm5, %ymm4
		; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
		; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm1 = xmm8[0],xmm2[0],xmm8[1],xmm2[1],xmm8[2],xmm2[2],xmm8[3],xmm2[3]
		; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
		; AVX1-NEXT: vinsertf128 $1, %xmm6, %ymm4, %ymm1
		; AVX1-NEXT: vinsertf128 $1, %xmm11, %ymm0, %ymm2
		; AVX1-NEXT: vperm2f128 {{.*#+}} ymm4 = ymm4[2,3],ymm10[2,3]
		; AVX1-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[2,3],ymm3[2,3]
		; AVX1-NEXT: vmovaps %ymm1, (%rdi)
		; AVX1-NEXT: vmovaps %ymm2, 32(%rdi)
		; AVX1-NEXT: vmovaps %ymm4, 64(%rdi)
; AVX1-NEXT: vmovaps %ymm0, 96(%rdi)		; AVX1-NEXT: vmovaps %ymm0, 96(%rdi)
; AVX1-NEXT: vmovaps %ymm5, 64(%rdi)
; AVX1-NEXT: vmovaps %ymm9, 32(%rdi)
; AVX1-NEXT: vmovaps %ymm8, (%rdi)
; AVX1-NEXT: vzeroupper		; AVX1-NEXT: vzeroupper
; AVX1-NEXT: retq		; AVX1-NEXT: retq
;		;
; AVX2-LABEL: interleaved_store_vf32_i8_stride4:		; AVX2-LABEL: interleaved_store_vf32_i8_stride4:
; AVX2: # BB#0:		; AVX2: # BB#0:
; AVX2-NEXT: vpunpcklbw {{.*#+}} xmm4 = xmm2[0],xmm3[0],xmm2[1],xmm3[1],xmm2[2],xmm3[2],xmm2[3],xmm3[3],xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]		; AVX2-NEXT: vpunpcklbw {{.*#+}} ymm4 = ymm0[0],ymm1[0],ymm0[1],ymm1[1],ymm0[2],ymm1[2],ymm0[3],ymm1[3],ymm0[4],ymm1[4],ymm0[5],ymm1[5],ymm0[6],ymm1[6],ymm0[7],ymm1[7],ymm0[16],ymm1[16],ymm0[17],ymm1[17],ymm0[18],ymm1[18],ymm0[19],ymm1[19],ymm0[20],ymm1[20],ymm0[21],ymm1[21],ymm0[22],ymm1[22],ymm0[23],ymm1[23]
; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm5 = xmm0[4],xmm4[4],xmm0[5],xmm4[5],xmm0[6],xmm4[6],xmm0[7],xmm4[7]		; AVX2-NEXT: vpunpckhbw {{.*#+}} ymm0 = ymm0[8],ymm1[8],ymm0[9],ymm1[9],ymm0[10],ymm1[10],ymm0[11],ymm1[11],ymm0[12],ymm1[12],ymm0[13],ymm1[13],ymm0[14],ymm1[14],ymm0[15],ymm1[15],ymm0[24],ymm1[24],ymm0[25],ymm1[25],ymm0[26],ymm1[26],ymm0[27],ymm1[27],ymm0[28],ymm1[28],ymm0[29],ymm1[29],ymm0[30],ymm1[30],ymm0[31],ymm1[31]
; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm4 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]		; AVX2-NEXT: vpunpcklbw {{.*#+}} ymm1 = ymm2[0],ymm3[0],ymm2[1],ymm3[1],ymm2[2],ymm3[2],ymm2[3],ymm3[3],ymm2[4],ymm3[4],ymm2[5],ymm3[5],ymm2[6],ymm3[6],ymm2[7],ymm3[7],ymm2[16],ymm3[16],ymm2[17],ymm3[17],ymm2[18],ymm3[18],ymm2[19],ymm3[19],ymm2[20],ymm3[20],ymm2[21],ymm3[21],ymm2[22],ymm3[22],ymm2[23],ymm3[23]
; AVX2-NEXT: vinserti128 $1, %xmm5, %ymm4, %ymm4		; AVX2-NEXT: vpunpckhbw {{.*#+}} ymm2 = ymm2[8],ymm3[8],ymm2[9],ymm3[9],ymm2[10],ymm3[10],ymm2[11],ymm3[11],ymm2[12],ymm3[12],ymm2[13],ymm3[13],ymm2[14],ymm3[14],ymm2[15],ymm3[15],ymm2[24],ymm3[24],ymm2[25],ymm3[25],ymm2[26],ymm3[26],ymm2[27],ymm3[27],ymm2[28],ymm3[28],ymm2[29],ymm3[29],ymm2[30],ymm3[30],ymm2[31],ymm3[31]
; AVX2-NEXT: vpunpcklbw {{.*#+}} xmm5 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]		; AVX2-NEXT: vpunpckhwd {{.*#+}} ymm3 = ymm4[4],ymm1[4],ymm4[5],ymm1[5],ymm4[6],ymm1[6],ymm4[7],ymm1[7],ymm4[12],ymm1[12],ymm4[13],ymm1[13],ymm4[14],ymm1[14],ymm4[15],ymm1[15]
; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm6 = xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]		; AVX2-NEXT: vpunpckhwd {{.*#+}} ymm5 = ymm0[4],ymm2[4],ymm0[5],ymm2[5],ymm0[6],ymm2[6],ymm0[7],ymm2[7],ymm0[12],ymm2[12],ymm0[13],ymm2[13],ymm0[14],ymm2[14],ymm0[15],ymm2[15]
; AVX2-NEXT: vpmovzxwd {{.*#+}} xmm5 = xmm5[0],zero,xmm5[1],zero,xmm5[2],zero,xmm5[3],zero		; AVX2-NEXT: vpunpcklwd {{.*#+}} ymm1 = ymm4[0],ymm1[0],ymm4[1],ymm1[1],ymm4[2],ymm1[2],ymm4[3],ymm1[3],ymm4[8],ymm1[8],ymm4[9],ymm1[9],ymm4[10],ymm1[10],ymm4[11],ymm1[11]
; AVX2-NEXT: vinserti128 $1, %xmm6, %ymm5, %ymm5		; AVX2-NEXT: vpunpcklwd {{.*#+}} ymm0 = ymm0[0],ymm2[0],ymm0[1],ymm2[1],ymm0[2],ymm2[2],ymm0[3],ymm2[3],ymm0[8],ymm2[8],ymm0[9],ymm2[9],ymm0[10],ymm2[10],ymm0[11],ymm2[11]
; AVX2-NEXT: vpblendw {{.*#+}} ymm8 = ymm5[0],ymm4[1],ymm5[2],ymm4[3],ymm5[4],ymm4[5],ymm5[6],ymm4[7],ymm5[8],ymm4[9],ymm5[10],ymm4[11],ymm5[12],ymm4[13],ymm5[14],ymm4[15]		; AVX2-NEXT: vinserti128 $1, %xmm3, %ymm1, %ymm2
; AVX2-NEXT: vpunpckhbw {{.*#+}} xmm5 = xmm2[8],xmm3[8],xmm2[9],xmm3[9],xmm2[10],xmm3[10],xmm2[11],xmm3[11],xmm2[12],xmm3[12],xmm2[13],xmm3[13],xmm2[14],xmm3[14],xmm2[15],xmm3[15]		; AVX2-NEXT: vinserti128 $1, %xmm5, %ymm0, %ymm4
; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm6 = xmm0[4],xmm5[4],xmm0[5],xmm5[5],xmm0[6],xmm5[6],xmm0[7],xmm5[7]		; AVX2-NEXT: vperm2i128 {{.*#+}} ymm1 = ymm1[2,3],ymm3[2,3]
; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm5 = xmm0[0],xmm5[0],xmm0[1],xmm5[1],xmm0[2],xmm5[2],xmm0[3],xmm5[3]		; AVX2-NEXT: vperm2i128 {{.*#+}} ymm0 = ymm0[2,3],ymm5[2,3]
; AVX2-NEXT: vinserti128 $1, %xmm6, %ymm5, %ymm5		; AVX2-NEXT: vmovdqa %ymm2, (%rdi)
; AVX2-NEXT: vpunpckhbw {{.*#+}} xmm6 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]		; AVX2-NEXT: vmovdqa %ymm4, 32(%rdi)
; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm7 = xmm6[4],xmm0[4],xmm6[5],xmm0[5],xmm6[6],xmm0[6],xmm6[7],xmm0[7]		; AVX2-NEXT: vmovdqa %ymm1, 64(%rdi)
; AVX2-NEXT: vpmovzxwd {{.*#+}} xmm6 = xmm6[0],zero,xmm6[1],zero,xmm6[2],zero,xmm6[3],zero
; AVX2-NEXT: vinserti128 $1, %xmm7, %ymm6, %ymm6
; AVX2-NEXT: vpblendw {{.*#+}} ymm5 = ymm6[0],ymm5[1],ymm6[2],ymm5[3],ymm6[4],ymm5[5],ymm6[6],ymm5[7],ymm6[8],ymm5[9],ymm6[10],ymm5[11],ymm6[12],ymm5[13],ymm6[14],ymm5[15]
; AVX2-NEXT: vextracti128 $1, %ymm3, %xmm3
; AVX2-NEXT: vextracti128 $1, %ymm2, %xmm2
; AVX2-NEXT: vpunpcklbw {{.*#+}} xmm6 = xmm2[0],xmm3[0],xmm2[1],xmm3[1],xmm2[2],xmm3[2],xmm2[3],xmm3[3],xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]
; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm7 = xmm0[4],xmm6[4],xmm0[5],xmm6[5],xmm0[6],xmm6[6],xmm0[7],xmm6[7]
; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm6 = xmm0[0],xmm6[0],xmm0[1],xmm6[1],xmm0[2],xmm6[2],xmm0[3],xmm6[3]
; AVX2-NEXT: vinserti128 $1, %xmm7, %ymm6, %ymm6
; AVX2-NEXT: vextracti128 $1, %ymm1, %xmm1
; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm0
; AVX2-NEXT: vpunpcklbw {{.*#+}} xmm7 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm4 = xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
; AVX2-NEXT: vpmovzxwd {{.*#+}} xmm7 = xmm7[0],zero,xmm7[1],zero,xmm7[2],zero,xmm7[3],zero
; AVX2-NEXT: vinserti128 $1, %xmm4, %ymm7, %ymm4
; AVX2-NEXT: vpblendw {{.*#+}} ymm4 = ymm4[0],ymm6[1],ymm4[2],ymm6[3],ymm4[4],ymm6[5],ymm4[6],ymm6[7],ymm4[8],ymm6[9],ymm4[10],ymm6[11],ymm4[12],ymm6[13],ymm4[14],ymm6[15]
; AVX2-NEXT: vpunpckhbw {{.*#+}} xmm2 = xmm2[8],xmm3[8],xmm2[9],xmm3[9],xmm2[10],xmm3[10],xmm2[11],xmm3[11],xmm2[12],xmm3[12],xmm2[13],xmm3[13],xmm2[14],xmm3[14],xmm2[15],xmm3[15]
; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm3 = xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]
; AVX2-NEXT: vpunpcklwd {{.*#+}} xmm2 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
; AVX2-NEXT: vinserti128 $1, %xmm3, %ymm2, %ymm2
; AVX2-NEXT: vpunpckhbw {{.*#+}} xmm0 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]
; AVX2-NEXT: vpunpckhwd {{.*#+}} xmm1 = xmm0[4,4,5,5,6,6,7,7]
; AVX2-NEXT: vpmovzxwd {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
; AVX2-NEXT: vpblendw {{.*#+}} ymm0 = ymm0[0],ymm2[1],ymm0[2],ymm2[3],ymm0[4],ymm2[5],ymm0[6],ymm2[7],ymm0[8],ymm2[9],ymm0[10],ymm2[11],ymm0[12],ymm2[13],ymm0[14],ymm2[15]
; AVX2-NEXT: vmovdqa %ymm0, 96(%rdi)		; AVX2-NEXT: vmovdqa %ymm0, 96(%rdi)
; AVX2-NEXT: vmovdqa %ymm4, 64(%rdi)
; AVX2-NEXT: vmovdqa %ymm5, 32(%rdi)
; AVX2-NEXT: vmovdqa %ymm8, (%rdi)
; AVX2-NEXT: vzeroupper		; AVX2-NEXT: vzeroupper
; AVX2-NEXT: retq		; AVX2-NEXT: retq
%v1 = shufflevector <32 x i8> %x1, <32 x i8> %x2, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>		%v1 = shufflevector <32 x i8> %x1, <32 x i8> %x2, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
%v2 = shufflevector <32 x i8> %x3, <32 x i8> %x4, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>		%v2 = shufflevector <32 x i8> %x3, <32 x i8> %x4, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
%interleaved.vec = shufflevector <64 x i8> %v1, <64 x i8> %v2, <128 x i32> <i32 0, i32 32, i32 64, i32 96, i32 1, i32 33, i32 65, i32 97, i32 2, i32 34, i32 66, i32 98, i32 3, i32 35, i32 67, i32 99, i32 4, i32 36, i32 68, i32 100, i32 5, i32 37, i32 69, i32 101, i32 6, i32 38, i32 70, i32 102, i32 7, i32 39, i32 71, i32 103, i32 8, i32 40, i32 72, i32 104, i32 9, i32 41, i32 73, i32 105, i32 10, i32 42, i32 74, i32 106, i32 11, i32 43, i32 75, i32 107, i32 12, i32 44, i32 76, i32 108, i32 13, i32 45, i32 77, i32 109, i32 14, i32 46, i32 78, i32 110, i32 15, i32 47, i32 79, i32 111, i32 16, i32 48, i32 80, i32 112, i32 17, i32 49, i32 81, i32 113, i32 18, i32 50, i32 82, i32 114, i32 19, i32 51, i32 83, i32 115, i32 20, i32 52, i32 84, i32 116, i32 21, i32 53, i32 85, i32 117, i32 22, i32 54, i32 86, i32 118, i32 23, i32 55, i32 87, i32 119, i32 24, i32 56, i32 88, i32 120, i32 25, i32 57, i32 89, i32 121, i32 26, i32 58, i32 90, i32 122, i32 27, i32 59, i32 91, i32 123, i32 28, i32 60, i32 92, i32 124, i32 29, i32 61, i32 93, i32 125, i32 30, i32 62, i32 94, i32 126, i32 31, i32 63, i32 95, i32 127>		%interleaved.vec = shufflevector <64 x i8> %v1, <64 x i8> %v2, <128 x i32> <i32 0, i32 32, i32 64, i32 96, i32 1, i32 33, i32 65, i32 97, i32 2, i32 34, i32 66, i32 98, i32 3, i32 35, i32 67, i32 99, i32 4, i32 36, i32 68, i32 100, i32 5, i32 37, i32 69, i32 101, i32 6, i32 38, i32 70, i32 102, i32 7, i32 39, i32 71, i32 103, i32 8, i32 40, i32 72, i32 104, i32 9, i32 41, i32 73, i32 105, i32 10, i32 42, i32 74, i32 106, i32 11, i32 43, i32 75, i32 107, i32 12, i32 44, i32 76, i32 108, i32 13, i32 45, i32 77, i32 109, i32 14, i32 46, i32 78, i32 110, i32 15, i32 47, i32 79, i32 111, i32 16, i32 48, i32 80, i32 112, i32 17, i32 49, i32 81, i32 113, i32 18, i32 50, i32 82, i32 114, i32 19, i32 51, i32 83, i32 115, i32 20, i32 52, i32 84, i32 116, i32 21, i32 53, i32 85, i32 117, i32 22, i32 54, i32 86, i32 118, i32 23, i32 55, i32 87, i32 119, i32 24, i32 56, i32 88, i32 120, i32 25, i32 57, i32 89, i32 121, i32 26, i32 58, i32 90, i32 122, i32 27, i32 59, i32 91, i32 123, i32 28, i32 60, i32 92, i32 124, i32 29, i32 61, i32 93, i32 125, i32 30, i32 62, i32 94, i32 126, i32 31, i32 63, i32 95, i32 127>
store <128 x i8> %interleaved.vec, <128 x i8>* %p		store <128 x i8> %interleaved.vec, <128 x i8>* %p
ret void		ret void
}		}

test/Transforms/InterleavedAccess/X86/interleavedStore.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				RKSimonUnsubmitted Done Reply Inline Actions Add this file to trunk with current codegen so this patch shows the diff. RKSimon: Add this file to trunk with current codegen so this patch shows the diff.
	; RUN: opt < %s -mtriple=x86_64-pc-linux -mattr=+avx -mattr=+avx2 -interleaved-access -S \| FileCheck %s			; RUN: opt < %s -mtriple=x86_64-pc-linux -mattr=+avx2 -interleaved-access -S \| FileCheck %s
				RKSimonUnsubmitted Done Reply Inline Actions You just need -mattr=+avx2 - it implies -mattr=+avx RKSimon: You just need -mattr=+avx2 - it implies -mattr=+avx

	define void @interleaved_store_vf32_i8_stride4(<32 x i8> %x1, <32 x i8> %x2, <32 x i8> %x3, <32 x i8> %x4, <128 x i8>* %p) {			define void @interleaved_store_vf32_i8_stride4(<32 x i8> %x1, <32 x i8> %x2, <32 x i8> %x3, <32 x i8> %x4, <128 x i8>* %p) {
	; CHECK-LABEL: @interleaved_store_vf32_i8_stride4(			; CHECK-LABEL: @interleaved_store_vf32_i8_stride4(
	; CHECK-NEXT: [[V1:%.]] = shufflevector <32 x i8> [[X1:%.]], <32 x i8> [[X2:%.*]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>			; CHECK-NEXT: [[V1:%.]] = shufflevector <32 x i8> [[X1:%.]], <32 x i8> [[X2:%.*]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
	; CHECK-NEXT: [[V2:%.]] = shufflevector <32 x i8> [[X3:%.]], <32 x i8> [[X4:%.*]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>			; CHECK-NEXT: [[V2:%.]] = shufflevector <32 x i8> [[X3:%.]], <32 x i8> [[X4:%.*]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
	; CHECK-NEXT: [[INTERLEAVED_VEC:%.*]] = shufflevector <64 x i8> [[V1]], <64 x i8> [[V2]], <128 x i32> <i32 0, i32 32, i32 64, i32 96, i32 1, i32 33, i32 65, i32 97, i32 2, i32 34, i32 66, i32 98, i32 3, i32 35, i32 67, i32 99, i32 4, i32 36, i32 68, i32 100, i32 5, i32 37, i32 69, i32 101, i32 6, i32 38, i32 70, i32 102, i32 7, i32 39, i32 71, i32 103, i32 8, i32 40, i32 72, i32 104, i32 9, i32 41, i32 73, i32 105, i32 10, i32 42, i32 74, i32 106, i32 11, i32 43, i32 75, i32 107, i32 12, i32 44, i32 76, i32 108, i32 13, i32 45, i32 77, i32 109, i32 14, i32 46, i32 78, i32 110, i32 15, i32 47, i32 79, i32 111, i32 16, i32 48, i32 80, i32 112, i32 17, i32 49, i32 81, i32 113, i32 18, i32 50, i32 82, i32 114, i32 19, i32 51, i32 83, i32 115, i32 20, i32 52, i32 84, i32 116, i32 21, i32 53, i32 85, i32 117, i32 22, i32 54, i32 86, i32 118, i32 23, i32 55, i32 87, i32 119, i32 24, i32 56, i32 88, i32 120, i32 25, i32 57, i32 89, i32 121, i32 26, i32 58, i32 90, i32 122, i32 27, i32 59, i32 91, i32 123, i32 28, i32 60, i32 92, i32 124, i32 29, i32 61, i32 93, i32 125, i32 30, i32 62, i32 94, i32 126, i32 31, i32 63, i32 95, i32 127>			; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <64 x i8> [[V1]], <64 x i8> [[V2]], <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
	; CHECK-NEXT: store <128 x i8> [[INTERLEAVED_VEC]], <128 x i8>* [[P:%.*]]			; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <64 x i8> [[V1]], <64 x i8> [[V2]], <32 x i32> <i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
				; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <64 x i8> [[V1]], <64 x i8> [[V2]], <32 x i32> <i32 64, i32 65, i32 66, i32 67, i32 68, i32 69, i32 70, i32 71, i32 72, i32 73, i32 74, i32 75, i32 76, i32 77, i32 78, i32 79, i32 80, i32 81, i32 82, i32 83, i32 84, i32 85, i32 86, i32 87, i32 88, i32 89, i32 90, i32 91, i32 92, i32 93, i32 94, i32 95>
				; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <64 x i8> [[V1]], <64 x i8> [[V2]], <32 x i32> <i32 96, i32 97, i32 98, i32 99, i32 100, i32 101, i32 102, i32 103, i32 104, i32 105, i32 106, i32 107, i32 108, i32 109, i32 110, i32 111, i32 112, i32 113, i32 114, i32 115, i32 116, i32 117, i32 118, i32 119, i32 120, i32 121, i32 122, i32 123, i32 124, i32 125, i32 126, i32 127>
				; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <32 x i8> [[TMP1]], <32 x i8> [[TMP2]], <32 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55>
				; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <32 x i8> [[TMP1]], <32 x i8> [[TMP2]], <32 x i32> <i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>
				; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <32 x i8> [[TMP3]], <32 x i8> [[TMP4]], <32 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55>
				; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <32 x i8> [[TMP3]], <32 x i8> [[TMP4]], <32 x i32> <i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>
				; CHECK-NEXT: [[TMP9:%.*]] = bitcast <32 x i8> [[TMP5]] to <16 x i16>
				; CHECK-NEXT: [[TMP10:%.*]] = bitcast <32 x i8> [[TMP6]] to <16 x i16>
				; CHECK-NEXT: [[TMP11:%.*]] = bitcast <32 x i8> [[TMP7]] to <16 x i16>
				; CHECK-NEXT: [[TMP12:%.*]] = bitcast <32 x i8> [[TMP8]] to <16 x i16>
				; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <16 x i16> [[TMP9]], <16 x i16> [[TMP11]], <16 x i32> <i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
				; CHECK-NEXT: [[TMP14:%.*]] = shufflevector <16 x i16> [[TMP10]], <16 x i16> [[TMP12]], <16 x i32> <i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
				; CHECK-NEXT: [[TMP15:%.*]] = shufflevector <16 x i16> [[TMP9]], <16 x i16> [[TMP11]], <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27>
				; CHECK-NEXT: [[TMP16:%.*]] = shufflevector <16 x i16> [[TMP10]], <16 x i16> [[TMP12]], <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27>
				; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <16 x i16> [[TMP15]], <16 x i16> [[TMP13]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>
				; CHECK-NEXT: [[TMP18:%.*]] = shufflevector <16 x i16> [[TMP16]], <16 x i16> [[TMP14]], <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23>
				; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <16 x i16> [[TMP15]], <16 x i16> [[TMP13]], <16 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
				; CHECK-NEXT: [[TMP20:%.*]] = shufflevector <16 x i16> [[TMP16]], <16 x i16> [[TMP14]], <16 x i32> <i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
				; CHECK-NEXT: [[TMP21:%.]] = bitcast <128 x i8> [[P:%.]] to <16 x i16>
				; CHECK-NEXT: [[TMP22:%.]] = getelementptr <16 x i16>, <16 x i16> [[TMP21]], i32 0
				; CHECK-NEXT: store <16 x i16> [[TMP17]], <16 x i16>* [[TMP22]]
				; CHECK-NEXT: [[TMP23:%.]] = getelementptr <16 x i16>, <16 x i16> [[TMP21]], i32 1
				; CHECK-NEXT: store <16 x i16> [[TMP18]], <16 x i16>* [[TMP23]]
				; CHECK-NEXT: [[TMP24:%.]] = getelementptr <16 x i16>, <16 x i16> [[TMP21]], i32 2
				; CHECK-NEXT: store <16 x i16> [[TMP19]], <16 x i16>* [[TMP24]]
				; CHECK-NEXT: [[TMP25:%.]] = getelementptr <16 x i16>, <16 x i16> [[TMP21]], i32 3
				; CHECK-NEXT: store <16 x i16> [[TMP20]], <16 x i16>* [[TMP25]]
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	%v1 = shufflevector <32 x i8> %x1, <32 x i8> %x2, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>			%v1 = shufflevector <32 x i8> %x1, <32 x i8> %x2, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
	%v2 = shufflevector <32 x i8> %x3, <32 x i8> %x4, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>			%v2 = shufflevector <32 x i8> %x3, <32 x i8> %x4, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
	%interleaved.vec = shufflevector <64 x i8> %v1, <64 x i8> %v2, <128 x i32> <i32 0, i32 32, i32 64, i32 96, i32 1, i32 33, i32 65, i32 97, i32 2, i32 34, i32 66, i32 98, i32 3, i32 35, i32 67, i32 99, i32 4, i32 36, i32 68, i32 100, i32 5, i32 37, i32 69, i32 101, i32 6, i32 38, i32 70, i32 102, i32 7, i32 39, i32 71, i32 103, i32 8, i32 40, i32 72, i32 104, i32 9, i32 41, i32 73, i32 105, i32 10, i32 42, i32 74, i32 106, i32 11, i32 43, i32 75, i32 107, i32 12, i32 44, i32 76, i32 108, i32 13, i32 45, i32 77, i32 109, i32 14, i32 46, i32 78, i32 110, i32 15, i32 47, i32 79, i32 111, i32 16, i32 48, i32 80, i32 112, i32 17, i32 49, i32 81, i32 113, i32 18, i32 50, i32 82, i32 114, i32 19, i32 51, i32 83, i32 115, i32 20, i32 52, i32 84, i32 116, i32 21, i32 53, i32 85, i32 117, i32 22, i32 54, i32 86, i32 118, i32 23, i32 55, i32 87, i32 119, i32 24, i32 56, i32 88, i32 120, i32 25, i32 57, i32 89, i32 121, i32 26, i32 58, i32 90, i32 122, i32 27, i32 59, i32 91, i32 123, i32 28, i32 60, i32 92, i32 124, i32 29, i32 61, i32 93, i32 125, i32 30, i32 62, i32 94, i32 126, i32 31, i32 63, i32 95, i32 127>			%interleaved.vec = shufflevector <64 x i8> %v1, <64 x i8> %v2, <128 x i32> <i32 0, i32 32, i32 64, i32 96, i32 1, i32 33, i32 65, i32 97, i32 2, i32 34, i32 66, i32 98, i32 3, i32 35, i32 67, i32 99, i32 4, i32 36, i32 68, i32 100, i32 5, i32 37, i32 69, i32 101, i32 6, i32 38, i32 70, i32 102, i32 7, i32 39, i32 71, i32 103, i32 8, i32 40, i32 72, i32 104, i32 9, i32 41, i32 73, i32 105, i32 10, i32 42, i32 74, i32 106, i32 11, i32 43, i32 75, i32 107, i32 12, i32 44, i32 76, i32 108, i32 13, i32 45, i32 77, i32 109, i32 14, i32 46, i32 78, i32 110, i32 15, i32 47, i32 79, i32 111, i32 16, i32 48, i32 80, i32 112, i32 17, i32 49, i32 81, i32 113, i32 18, i32 50, i32 82, i32 114, i32 19, i32 51, i32 83, i32 115, i32 20, i32 52, i32 84, i32 116, i32 21, i32 53, i32 85, i32 117, i32 22, i32 54, i32 86, i32 118, i32 23, i32 55, i32 87, i32 119, i32 24, i32 56, i32 88, i32 120, i32 25, i32 57, i32 89, i32 121, i32 26, i32 58, i32 90, i32 122, i32 27, i32 59, i32 91, i32 123, i32 28, i32 60, i32 92, i32 124, i32 29, i32 61, i32 93, i32 125, i32 30, i32 62, i32 94, i32 126, i32 31, i32 63, i32 95, i32 127>
	store <128 x i8> %interleaved.vec, <128 x i8>* %p			store <128 x i8> %interleaved.vec, <128 x i8>* %p
	ret void			ret void
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 104424

lib/Target/X86/X86InterleavedAccess.cpp

test/CodeGen/X86/x86-interleaved-access.ll

test/Transforms/InterleavedAccess/X86/interleavedStore.ll

[X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess.
ClosedPublic