This is an archive of the discontinued LLVM Phabricator instance.

[X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess.
ClosedPublic

Authored by m_zuckerman on Jun 25 2017, 12:38 AM.

Download Raw Diff

Details

Reviewers

dorit
Farhana
RKSimon
guyblank
DavidKreitzer

Commits

rGc1918ad571e0: [X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess.
rL309086: [X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess.

Summary

This patch expands the support of lowerInterleavedStore to 32x8i stride 4.

LLVM creates suboptimal shuffle code-gen for AVX2. In overall, this patch is a specific fix for the pattern (Strid=4 VF=32) and we plan to include more patterns in the future. To reach our goal of "more patterns". We include two mask creators. The first function creates shuffle's mask equivalent to unpacklo/unpackhi instructions. The other creator creates mask equivalent to a concat of two half vectors(high/low).

The patch goal is to optimize the following sequence:
At the end of the computation, we have ymm2, ymm0, ymm12 and ymm3 holding
each 32 chars:

c0, c1, , c31
m0, m1, , m31
y0, y1, , y31
k0, k1, ., k31

And these need to be transposed/interleaved and stored like so:

c0 m0 y0 k0 c1 m1 y1 k1 c2 m2 y2 k2 c3 m3 y3 k3 ....

Diff Detail

Event Timeline

m_zuckerman created this revision.Jun 25 2017, 12:38 AM

m_zuckerman added a parent revision: D32658: Supports lowerInterleavedStore() in X86InterleavedAccess..

m_zuckerman added a reviewer: DavidKreitzer.Jun 25 2017, 1:54 AM

Would it be beneficial to work on a more general solution for the 128-bit subvector issue? Won't 16x16 and 8x32 (as well as all the 512-bit equivalents) still suffer?

lib/Target/X86/X86InterleavedAccess.cpp
353	Indenting / clang-format
370	If you pull out the Type::getInt16Ty(Shuffles[0]->getContext()) you should be able to tidy all this up
test/CodeGen/X86/x86-interleaved-access.ll
172	Add this test to trunk with current codegen so this patch shows the diff.
test/Transforms/InterleavedAccess/X86/interleavedStore.ll
1	Add this file to trunk with current codegen so this patch shows the diff.
2	You just need -mattr=+avx2 - it implies -mattr=+avx

dorit mentioned this in rL306238: [AVX2] [TTI CostModel] Add cost of interleaved loads/stores for AVX2.Jun 26 2017, 2:47 AM

m_zuckerman updated this revision to Diff 103953.Jun 26 2017, 7:59 AM

m_zuckerman marked 4 inline comments as done.

m_zuckerman marked an inline comment as done.

In D34601#789983, @RKSimon wrote:

Would it be beneficial to work on a more general solution for the 128-bit subvector issue? Won't 16x16 and 8x32 (as well as all the 512-bit equivalents) still suffer?

In general, we plan to generalize the problem and this is why we added the two mask functions.

ping

Just a few minor comments. Looks good to me otherwise, but should give @RKSimon a chance to see if he's happy with your responses (and maybe also another day to give @Farhana/@DavidKreitzer a chance to comment).

lib/Target/X86/X86InterleavedAccess.cpp
104–106	Should this comment be updated?
185	80-column overflow... (and while you're fixing this, maybe use the same capitalization in naming of arguments (i.e. - either "vec1" or "VEC1" form but not both)).
386	Have you tried enabling this also for AVX? (I understand if not, because with the current cost numbers that TTI returns for interleaved accesses on AVX we'll probably determine it's not worth vectorizing... so that may need to come along with an update of getInterleavedMemoryOpCost -- maybe at least a TODO comment is needed).

m_zuckerman updated this revision to Diff 104424.Jun 28 2017, 8:30 AM

m_zuckerman marked 2 inline comments as done and an inline comment as not done.

In D34601#791846, @dorit wrote:

Have you tried enabling this also for AVX? (I understand if not, because with the current cost numbers that TTI returns for interleaved accesses on AVX we'll probably determine it's not worth vectorizing... so that may need to come along with an update of getInterleavedMemoryOpCost -- maybe at least a TODO comment is needed).

Thanks , I tried it and the result was good for AVX.
You can see in the test interleavedStore.all that we are getting less instructions compared to the old code.

m_zuckerman updated this revision to Diff 104990.Jul 1 2017, 5:11 AM

In the new patch I added a test checks the AVX sequence and new if verifies that VF32 is with store instructions.

zvi added a subscriber: zvi.Jul 4 2017, 11:19 PM

Michael, can you please add AVX512 targets to the tests so we ensure that the AVX512 targets are at least as good as the AVX2 targets?

In D34601#799242, @zvi wrote:

Michael, can you please add AVX512 targets to the tests so we ensure that the AVX512 targets are at least as good as the AVX2 targets?

No problem

Farhana added inline comments.Jul 5 2017, 5:07 PM

lib/Target/X86/X86InterleavedAccess.cpp
169	Why not use the createUnpackShuffleMask(..) defined in X86ISelLowering.cpp? I would recommend you to declare a template function of this in X86ISelLowering.h which will allow both int and int32_t mask and get rid of this from X86InterleavedAccess.cpp.
185	The comment does not match the coding logic. Something is off here. May be it would be NumElement+NumElement/2?
204	I would think you could write a more generate transpose function, a function of 8 elements which would scale into 16 and 32. Why is it written only for 32 VLen?
366	Value *VecInst = SI->getPointerOperand()
366	What is the reason for generating for stores instead of 1 wide store? I would think codegen is smart enough to generate expected 4 optimized stores here. Can you please keep it consistent with the other case which means do the concatenation of the transposed vectors and generate 1 wide store. StoreInst SI = cast<StoreInst>(Inst); Value BasePtr = SI->getPointerOperand(); case 32: { transposeChar_32x4(DecomposedVectors, TransposedVectors); // VecInst contains the Ptr argument. Value VecInst = SI->getPointerOperand(); Type IntOf16 = Type::getInt16Ty(Shuffles[0]->getContext()); // From <128xi8>* to <64xi16>* Type VecTran = VectorType::get(IntOf16, 64)->getPointerTo(); BasePtr = Builder.CreateBitCast(VecInst, VecTran); And move the following two statements out of the switch. // 3. Concatenate the contiguous-vectors back into a wide vector. Value WideVec = concatenateVectors(Builder, TransposedVectors); // 4. Generate a store instruction for wide-vec. Builder.CreateAlignedStore(WideVec, BasePtr, // replace this with SI->getAlignment());
371	++i

Hi Michael, sorry it took a while for me to get to this. I was on vacation last week.

lib/Target/X86/X86InterleavedAccess.cpp
71–73	"transpose" is a poor name here. "interleave" would be better. Also, I would prefer "8bit" or "1byte" to "Char", e.g. interleave8bit_32x4. "transpose" works for the 4x4 case (and other NxN cases), because the shuffle sequence does a matrix transpose on the input vectors, and the same code can be used for interleaving and de-interleaving. To handle the 32x8 load case, we would need a different code sequence than what you are currently generating in transposeChar_32x4. Presumably, we would use deinterleave8bit_32x4 for that.
72	Trasposed --> Transposed. Can you fix this at line 72 as well?
194	Perhaps change "BeginIndex" to "CurrentIndex" since you are updating it as you go.
259	I think it is unnecessary and undesirable to do bitcasts here. It complicates both the IR and the code in X86InterleavedAccessGroup::lowerIntoOptimizedSequence, which now has to account for the interleave function returning a different type in "TransposedVectors" than the original "DecomposedVectors". Instead, you just need to "scale" your <16 x i32> masks to <32 x i32> like this: <a, b, ..., p> --> <a2, a2+1, b2, b2+1, ..., p2, p2+1> There is an existing routine to do this scaling in X86ISelLowering.cpp, scaleShuffleMask. Also see canWidenShuffleElements, which is how the 32xi8 shuffle gets automatically converted to a 16xi16 shuffle by the CodeGen, ultimately causing the desired unpckwd instructions to be generated.
366	+1
386	This comment is no longer accurate since you enabled this for AVX. It is probably okay to simply delete this comment since it is redundant with 105-107.
test/CodeGen/X86/x86-interleaved-access.ll
258–263	The codegen improvements here look great!

m_zuckerman updated this revision to Diff 106158.Jul 12 2017, 2:33 AM

m_zuckerman marked 9 inline comments as done.

m_zuckerman added inline comments.

lib/Target/X86/X86InterleavedAccess.cpp
204	You are right and we plan to do so in the next patch.

m_zuckerman marked an inline comment as done.Jul 12 2017, 6:15 AM

Thanks for the changes, Michael, and thanks for following up on the perf issue in https://bugs.llvm.org/show_bug.cgi?id=33740.

I have a few additional comments, but this generally looks good. The simplification of the code in X86InterleavedAccessGroup::lowerIntoOptimizedSequence is great!

lib/Target/X86/X86ISelLowering.h
1440	Fix the indenting here.
lib/Target/X86/X86InterleavedAccess.cpp
71–73	I see that you changed this to "deinterleave8bit_32x4" rather than "interleave8bit_32x4". Can you please explain why? This routine is taking 4 input vectors and merging their elements like this: v0[0], v1[0], v2[0], v3[0], v0[1], v1[1], v2[1], v3[1], ... Wouldn't you call that interleaving?
113–119	The old wording here was better, and you have a typo in "oration".
197	Would it be possible to simplify the implementation of this routine by just calling createUnpackShuffleMask with an i16 type plus calling scaleShuffleMask with a scale of 2? (That would require you to move scaleShuffleMask into X86ISelLowering.h like you did with createUnpackShuffleMask.)

m_zuckerman added inline comments.Jul 12 2017, 9:50 AM

lib/Target/X86/X86InterleavedAccess.cpp
71–73	You are right, I swapped the terminology.

m_zuckerman updated this revision to Diff 106407.Jul 13 2017, 4:33 AM

m_zuckerman marked 3 inline comments as done.

Hi Michael, I have one small additional comment. Otherwise, this looks good.

lib/Target/X86/X86ISelLowering.h
1457	Did you want to have T default to int as you did for createUnpackShuffleMask? That will avoid the extra changes in X86ISelLowering.cpp.

In D34601#808867, @DavidKreitzer wrote:

Hi Michael, I have one small additional comment. Otherwise, this looks good.

This the error I am getting when I try to define the function with a default parameter.
void llvm::scaleShuffleMask(int,llvm::ArrayRef<T>,llvm::SmallVectorImpl<T> &)': could not deduce template argument for 'llvm::ArrayRef<T>' from 'llvm::SmallVector<int,4>'

If you prefer not to touch the CPP file. We can define an overload function in the header file that will explicit set the function signature.

The function in the header will look like:

template <typename t1, unsigned n>
void scaleshufflemask(int scale, typename smallvector<t1, n> mask,
  typename smallvectorimpl<t1> &scaledmask) {
  scaleshufflemask(scale, makearrayref(mask), scaledmask);

This function describe another signature that one of the call in the CPP file uses.

The other option is as I proposed on this review.

What you prefer?

ping

RKSimon added inline comments.Jul 19 2017, 2:44 AM

lib/Target/X86/X86ISelLowering.h
1457	+1
lib/Target/X86/X86InterleavedAccess.cpp
175	shuffle
188	I think you could make this much simpler to understand: int NumHalfElements = NumElement / 2; int Offset = Low ? 0 : NumHalfElements; for (int i = 0; i < NumHalfElements; ++i) Mask.push_back(i + Offset); for (int i = 0; i < NumHalfElements; ++i) Mask.push_back(i + Offset + NumElements);

In D34601#810757, @m_zuckerman wrote:
In D34601#808867, @DavidKreitzer wrote:

Hi Michael, I have one small additional comment. Otherwise, this looks good.

This the error I am getting when I try to define the function with a default parameter.
void llvm::scaleShuffleMask(int,llvm::ArrayRef<T>,llvm::SmallVectorImpl<T> &)': could not deduce template argument for 'llvm::ArrayRef<T>' from 'llvm::SmallVector<int,4>'

If you prefer not to touch the CPP file. We can define an overload function in the header file that will explicit set the function signature.

The function in the header will look like:
template <typename t1, unsigned n>
void scaleshufflemask(int scale, typename smallvector<t1, n> mask,
  typename smallvectorimpl<t1> &scaledmask) {
  scaleshufflemask(scale, makearrayref(mask), scaledmask);
This function describe another signature that one of the call in the CPP file uses.

The other option is as I proposed on this review.

What you prefer?

I am not a fan of either of those options. How about getting rid of the template argument altogether and just use "int"? That is what ShuffleVectorInstruction::getShuffleMask uses, so it seems like it should be good enough here too. If you take this approach, it would be good to modify transpose_4x4 to use "int" instead of "uint32_t" for consistency.

m_zuckerman updated this revision to Diff 107663.Jul 21 2017, 6:06 AM

m_zuckerman marked an inline comment as done.

build.CreateShuffleVector accepts only uint32_t. I created a static CreateShuffleVector that do reinterpret_cast the int to uint. This solves our dependency on the template.

DavidKreitzer added inline comments.Jul 21 2017, 11:05 AM

lib/Target/X86/X86InterleavedAccess.cpp
188	Ugh! I didn't realize that you were going to have to add this method to do the ArrayRef conversion. Sorry, I think I gave you bad advice. I think this is a worse evil than the explicit template arguments in the previous patch. Unless someone else has a better idea, my recommendation is to go back to your previous solution and accept the explicit template arguments as an unfortunate consequence of the inconsistent ShuffleVector interfaces.

m_zuckerman updated this revision to Diff 107808.Jul 23 2017, 12:21 AM

I returned to the previous patch with small modification as @RKSimon asked.

Thanks, Michael. This LGTM pending @RKSimon's re-review. But can you please update your sources so that we can see how you merged with https://reviews.llvm.org/D35638, since that will cause conflicts?

m_zuckerman updated this revision to Diff 107938.Jul 24 2017, 11:57 AM

Thanks, Michael. LGTM, again pending @RKSimon's re-review.

This revision is now accepted and ready to land.Jul 24 2017, 12:27 PM

m_zuckerman added a child revision: D35829: [X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess (VF16 stride 4)..Jul 25 2017, 3:28 AM

LGTM - if possible moving createUnpackShuffleMask and scaleShuffleMask should be done as NFC pre-commit.

Diffusion mentioned this in rL309084: [X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess….Jul 26 2017, 12:45 AM

Closed by commit rL309086: [X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess. (authored by mzuckerm). · Explain WhyJul 26 2017, 1:12 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.h

42 lines

X86ISelLowering.cpp

52 lines

X86InterleavedAccess.cpp

137 lines

test/

CodeGen/

X86/

x86-interleaved-access.ll

137 lines

Transforms/

InterleavedAccess/

X86/

interleavedStore.ll

24 lines

Diff 107809

lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 1,428 Lines • ▼ Show 20 Lines	X86MaskedGatherSDNode(unsigned Order,
MachineMemOperand *MMO)		MachineMemOperand *MMO)
: MaskedGatherScatterSDNode(X86ISD::MGATHER, Order, dl, VTs, MemVT, MMO)		: MaskedGatherScatterSDNode(X86ISD::MGATHER, Order, dl, VTs, MemVT, MMO)
{}		{}
static bool classof(const SDNode *N) {		static bool classof(const SDNode *N) {
return N->getOpcode() == X86ISD::MGATHER;		return N->getOpcode() == X86ISD::MGATHER;
}		}
};		};

		/// Generate unpacklo/unpackhi shuffle mask.
		template <typename T = int>
		void createUnpackShuffleMask(MVT VT, SmallVectorImpl<T> &Mask, bool Lo,
		bool Unary) {
		DavidKreitzerUnsubmitted Done Reply Inline Actions Fix the indenting here. DavidKreitzer: Fix the indenting here.
		assert(Mask.empty() && "Expected an empty shuffle mask vector");
		int NumElts = VT.getVectorNumElements();
		int NumEltsInLane = 128 / VT.getScalarSizeInBits();
		for (int i = 0; i < NumElts; ++i) {
		unsigned LaneStart = (i / NumEltsInLane) * NumEltsInLane;
		int Pos = (i % NumEltsInLane) / 2 + LaneStart;
		Pos += (Unary ? 0 : NumElts * (i % 2));
		Pos += (Lo ? 0 : NumEltsInLane / 2);
		Mask.push_back(Pos);
		}
		}

		/// Helper function to scale a shuffle or target shuffle mask, replacing each
		/// mask index with the scaled sequential indices for an equivalent narrowed
		/// mask. This is the reverse process to canWidenShuffleElements, but can
		/// always succeed.
		template <typename T>
		DavidKreitzerUnsubmitted Not Done Reply Inline Actions Did you want to have T default to int as you did for createUnpackShuffleMask? That will avoid the extra changes in X86ISelLowering.cpp. DavidKreitzer: Did you want to have T default to int as you did for createUnpackShuffleMask? That will avoid…
		RKSimonUnsubmitted Not Done Reply Inline Actions +1 RKSimon: +1
		void scaleShuffleMask(int Scale, ArrayRef<T> Mask,
		SmallVectorImpl<T> &ScaledMask) {
		assert(0 < Scale && "Unexpected scaling factor");
		int NumElts = Mask.size();
		ScaledMask.assign(static_cast<size_t>(NumElts * Scale), -1);

		for (int i = 0; i != NumElts; ++i) {
		int M = Mask[i];

		// Repeat sentinel values in every mask element.
		if (M < 0) {
		for (int s = 0; s != Scale; ++s)
		ScaledMask[(Scale * i) + s] = M;
		continue;
		}

		// Scale mask element and increment across each mask element.
		for (int s = 0; s != Scale; ++s)
		ScaledMask[(Scale * i) + s] = (Scale * M) + s;
		}
		}
} // end namespace llvm		} // end namespace llvm

#endif // LLVM_LIB_TARGET_X86_X86ISELLOWERING_H		#endif // LLVM_LIB_TARGET_X86_X86ISELLOWERING_H

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,752 Lines • ▼ Show 20 Lines	for (int i = 0, Size = Mask.size(); i < Size; i += 2) {
return false;		return false;
}		}
assert(WidenedMask.size() == Mask.size() / 2 &&		assert(WidenedMask.size() == Mask.size() / 2 &&
"Incorrect size of mask after widening the elements!");		"Incorrect size of mask after widening the elements!");

return true;		return true;
}		}

/// Helper function to scale a shuffle or target shuffle mask, replacing each
/// mask index with the scaled sequential indices for an equivalent narrowed
/// mask. This is the reverse process to canWidenShuffleElements, but can always
/// succeed.
static void scaleShuffleMask(int Scale, ArrayRef<int> Mask,
SmallVectorImpl<int> &ScaledMask) {
assert(0 < Scale && "Unexpected scaling factor");
int NumElts = Mask.size();
ScaledMask.assign(static_cast<size_t>(NumElts * Scale), -1);

for (int i = 0; i != NumElts; ++i) {
int M = Mask[i];

// Repeat sentinel values in every mask element.
if (M < 0) {
for (int s = 0; s != Scale; ++s)
ScaledMask[(Scale * i) + s] = M;
continue;
}

// Scale mask element and increment across each mask element.
for (int s = 0; s != Scale; ++s)
ScaledMask[(Scale * i) + s] = (Scale * M) + s;
}
}

/// Return true if the specified EXTRACT_SUBVECTOR operand specifies a vector		/// Return true if the specified EXTRACT_SUBVECTOR operand specifies a vector
/// extract that is suitable for instruction that extract 128 or 256 bit vectors		/// extract that is suitable for instruction that extract 128 or 256 bit vectors
static bool isVEXTRACTIndex(SDNode *N, unsigned vecWidth) {		static bool isVEXTRACTIndex(SDNode *N, unsigned vecWidth) {
assert((vecWidth == 128 \|\| vecWidth == 256) && "Unexpected vector width");		assert((vecWidth == 128 \|\| vecWidth == 256) && "Unexpected vector width");
if (!isa<ConstantSDNode>(N->getOperand(1).getNode()))		if (!isa<ConstantSDNode>(N->getOperand(1).getNode()))
return false;		return false;

// The index should be aligned on a vecWidth-bit boundary.		// The index should be aligned on a vecWidth-bit boundary.
▲ Show 20 Lines • Show All 453 Lines • ▼ Show 20 Lines	if (VT.getSizeInBits() > 128 && InVT.getSizeInBits() > 128) {
int Scale = VT.getScalarSizeInBits() / InVT.getScalarSizeInBits();		int Scale = VT.getScalarSizeInBits() / InVT.getScalarSizeInBits();
In = extractSubVector(In, 0, DAG, DL,		In = extractSubVector(In, 0, DAG, DL,
std::max(128, (int)VT.getSizeInBits() / Scale));		std::max(128, (int)VT.getSizeInBits() / Scale));
}		}

return DAG.getNode(Opc, DL, VT, In);		return DAG.getNode(Opc, DL, VT, In);
}		}

/// Generate unpacklo/unpackhi shuffle mask.
static void createUnpackShuffleMask(MVT VT, SmallVectorImpl<int> &Mask, bool Lo,
bool Unary) {
assert(Mask.empty() && "Expected an empty shuffle mask vector");
int NumElts = VT.getVectorNumElements();
int NumEltsInLane = 128 / VT.getScalarSizeInBits();

for (int i = 0; i < NumElts; ++i) {
unsigned LaneStart = (i / NumEltsInLane) * NumEltsInLane;
int Pos = (i % NumEltsInLane) / 2 + LaneStart;
Pos += (Unary ? 0 : NumElts * (i % 2));
Pos += (Lo ? 0 : NumEltsInLane / 2);
Mask.push_back(Pos);
}
}

/// Returns a vector_shuffle node for an unpackl operation.		/// Returns a vector_shuffle node for an unpackl operation.
static SDValue getUnpackl(SelectionDAG &DAG, const SDLoc &dl, MVT VT,		static SDValue getUnpackl(SelectionDAG &DAG, const SDLoc &dl, MVT VT,
SDValue V1, SDValue V2) {		SDValue V1, SDValue V2) {
SmallVector<int, 8> Mask;		SmallVector<int, 8> Mask;
createUnpackShuffleMask(VT, Mask, /* Lo = / true, / Unary = */ false);		createUnpackShuffleMask(VT, Mask, /* Lo = / true, / Unary = */ false);
return DAG.getVectorShuffle(VT, dl, V1, V2, Mask);		return DAG.getVectorShuffle(VT, dl, V1, V2, Mask);
}		}

▲ Show 20 Lines • Show All 7,493 Lines • ▼ Show 20 Lines	if (SDValue Broadcast = lowerVectorShuffleAsBroadcast(DL, MVT::v4i64, V1, V2,
return Broadcast;		return Broadcast;

if (V2.isUndef()) {		if (V2.isUndef()) {
// When the shuffle is mirrored between the 128-bit lanes of the unit, we		// When the shuffle is mirrored between the 128-bit lanes of the unit, we
// can use lower latency instructions that will operate on both lanes.		// can use lower latency instructions that will operate on both lanes.
SmallVector<int, 2> RepeatedMask;		SmallVector<int, 2> RepeatedMask;
if (is128BitLaneRepeatedShuffleMask(MVT::v4i64, Mask, RepeatedMask)) {		if (is128BitLaneRepeatedShuffleMask(MVT::v4i64, Mask, RepeatedMask)) {
SmallVector<int, 4> PSHUFDMask;		SmallVector<int, 4> PSHUFDMask;
scaleShuffleMask(2, RepeatedMask, PSHUFDMask);		scaleShuffleMask<int>(2, RepeatedMask, PSHUFDMask);
return DAG.getBitcast(		return DAG.getBitcast(
MVT::v4i64,		MVT::v4i64,
DAG.getNode(X86ISD::PSHUFD, DL, MVT::v8i32,		DAG.getNode(X86ISD::PSHUFD, DL, MVT::v8i32,
DAG.getBitcast(MVT::v8i32, V1),		DAG.getBitcast(MVT::v8i32, V1),
getV4X86ShuffleImm8ForMask(PSHUFDMask, DL, DAG)));		getV4X86ShuffleImm8ForMask(PSHUFDMask, DL, DAG)));
}		}

// AVX2 provides a direct instruction for permuting a single input across		// AVX2 provides a direct instruction for permuting a single input across
▲ Show 20 Lines • Show All 687 Lines • ▼ Show 20 Lines	static SDValue lowerV8I64VectorShuffle(const SDLoc &DL, ArrayRef<int> Mask,

if (V2.isUndef()) {		if (V2.isUndef()) {
// When the shuffle is mirrored between the 128-bit lanes of the unit, we		// When the shuffle is mirrored between the 128-bit lanes of the unit, we
// can use lower latency instructions that will operate on all four		// can use lower latency instructions that will operate on all four
// 128-bit lanes.		// 128-bit lanes.
SmallVector<int, 2> Repeated128Mask;		SmallVector<int, 2> Repeated128Mask;
if (is128BitLaneRepeatedShuffleMask(MVT::v8i64, Mask, Repeated128Mask)) {		if (is128BitLaneRepeatedShuffleMask(MVT::v8i64, Mask, Repeated128Mask)) {
SmallVector<int, 4> PSHUFDMask;		SmallVector<int, 4> PSHUFDMask;
scaleShuffleMask(2, Repeated128Mask, PSHUFDMask);		scaleShuffleMask<int>(2, Repeated128Mask, PSHUFDMask);
return DAG.getBitcast(		return DAG.getBitcast(
MVT::v8i64,		MVT::v8i64,
DAG.getNode(X86ISD::PSHUFD, DL, MVT::v16i32,		DAG.getNode(X86ISD::PSHUFD, DL, MVT::v16i32,
DAG.getBitcast(MVT::v16i32, V1),		DAG.getBitcast(MVT::v16i32, V1),
getV4X86ShuffleImm8ForMask(PSHUFDMask, DL, DAG)));		getV4X86ShuffleImm8ForMask(PSHUFDMask, DL, DAG)));
}		}

SmallVector<int, 4> Repeated256Mask;		SmallVector<int, 4> Repeated256Mask;
▲ Show 20 Lines • Show All 13,709 Lines • ▼ Show 20 Lines	static bool matchUnaryPermuteVectorShuffle(MVT MaskVT, ArrayRef<int> Mask,
// had to use 2-input SHUFPD/SHUFPS shuffles (not handled here).		// had to use 2-input SHUFPD/SHUFPS shuffles (not handled here).
if ((MaskScalarSizeInBits == 64 \|\| MaskScalarSizeInBits == 32) &&		if ((MaskScalarSizeInBits == 64 \|\| MaskScalarSizeInBits == 32) &&
!ContainsZeros && (AllowIntDomain \|\| Subtarget.hasAVX())) {		!ContainsZeros && (AllowIntDomain \|\| Subtarget.hasAVX())) {
SmallVector<int, 4> RepeatedMask;		SmallVector<int, 4> RepeatedMask;
if (is128BitLaneRepeatedShuffleMask(MaskEltVT, Mask, RepeatedMask)) {		if (is128BitLaneRepeatedShuffleMask(MaskEltVT, Mask, RepeatedMask)) {
// Narrow the repeated mask to create 32-bit element permutes.		// Narrow the repeated mask to create 32-bit element permutes.
SmallVector<int, 4> WordMask = RepeatedMask;		SmallVector<int, 4> WordMask = RepeatedMask;
if (MaskScalarSizeInBits == 64)		if (MaskScalarSizeInBits == 64)
scaleShuffleMask(2, RepeatedMask, WordMask);		scaleShuffleMask<int>(2, RepeatedMask, WordMask);

Shuffle = (AllowIntDomain ? X86ISD::PSHUFD : X86ISD::VPERMILPI);		Shuffle = (AllowIntDomain ? X86ISD::PSHUFD : X86ISD::VPERMILPI);
ShuffleVT = (AllowIntDomain ? MVT::i32 : MVT::f32);		ShuffleVT = (AllowIntDomain ? MVT::i32 : MVT::f32);
ShuffleVT = MVT::getVectorVT(ShuffleVT, InputSizeInBits / 32);		ShuffleVT = MVT::getVectorVT(ShuffleVT, InputSizeInBits / 32);
PermuteImm = getV4X86ShuffleImm(WordMask);		PermuteImm = getV4X86ShuffleImm(WordMask);
return true;		return true;
}		}
}		}
▲ Show 20 Lines • Show All 346 Lines • ▼ Show 20 Lines	static bool combineX86ShuffleChain(ArrayRef<SDValue> Inputs, SDValue Root,
}		}

// For masks that have been widened to 128-bit elements or more,		// For masks that have been widened to 128-bit elements or more,
// narrow back down to 64-bit elements.		// narrow back down to 64-bit elements.
SmallVector<int, 64> Mask;		SmallVector<int, 64> Mask;
if (BaseMaskEltSizeInBits > 64) {		if (BaseMaskEltSizeInBits > 64) {
assert((BaseMaskEltSizeInBits % 64) == 0 && "Illegal mask size");		assert((BaseMaskEltSizeInBits % 64) == 0 && "Illegal mask size");
int MaskScale = BaseMaskEltSizeInBits / 64;		int MaskScale = BaseMaskEltSizeInBits / 64;
scaleShuffleMask(MaskScale, BaseMask, Mask);		scaleShuffleMask<int>(MaskScale, BaseMask, Mask);
} else {		} else {
Mask = SmallVector<int, 64>(BaseMask.begin(), BaseMask.end());		Mask = SmallVector<int, 64>(BaseMask.begin(), BaseMask.end());
}		}

unsigned NumMaskElts = Mask.size();		unsigned NumMaskElts = Mask.size();
unsigned MaskEltSizeInBits = RootSizeInBits / NumMaskElts;		unsigned MaskEltSizeInBits = RootSizeInBits / NumMaskElts;

// Determine the effective mask value type.		// Determine the effective mask value type.
▲ Show 20 Lines • Show All 2,123 Lines • ▼ Show 20 Lines	static SDValue combineExtractWithShuffle(SDNode *N, SelectionDAG &DAG,
if (!resolveTargetShuffleInputs(peekThroughBitcasts(Src), Ops, Mask, DAG))		if (!resolveTargetShuffleInputs(peekThroughBitcasts(Src), Ops, Mask, DAG))
return SDValue();		return SDValue();

// Attempt to narrow/widen the shuffle mask to the correct size.		// Attempt to narrow/widen the shuffle mask to the correct size.
if (Mask.size() != NumSrcElts) {		if (Mask.size() != NumSrcElts) {
if ((NumSrcElts % Mask.size()) == 0) {		if ((NumSrcElts % Mask.size()) == 0) {
SmallVector<int, 16> ScaledMask;		SmallVector<int, 16> ScaledMask;
int Scale = NumSrcElts / Mask.size();		int Scale = NumSrcElts / Mask.size();
scaleShuffleMask(Scale, Mask, ScaledMask);		scaleShuffleMask<int>(Scale, Mask, ScaledMask);
Mask = std::move(ScaledMask);		Mask = std::move(ScaledMask);
} else if ((Mask.size() % NumSrcElts) == 0) {		} else if ((Mask.size() % NumSrcElts) == 0) {
SmallVector<int, 16> WidenedMask;		SmallVector<int, 16> WidenedMask;
while (Mask.size() > NumSrcElts &&		while (Mask.size() > NumSrcElts &&
canWidenShuffleElements(Mask, WidenedMask))		canWidenShuffleElements(Mask, WidenedMask))
Mask = std::move(WidenedMask);		Mask = std::move(WidenedMask);
// TODO - investigate support for wider shuffle masks with known upper		// TODO - investigate support for wider shuffle masks with known upper
// undef/zero elements for implicit zero-extension.		// undef/zero elements for implicit zero-extension.
▲ Show 20 Lines • Show All 6,933 Lines • Show Last 20 Lines

lib/Target/X86/X86InterleavedAccess.cpp

//===--------- X86InterleavedAccess.cpp ----------------------------------===//		//===--------- X86InterleavedAccess.cpp ----------------------------------===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
///		///
/// \file		/// \file
/// This file contains the X86 implementation of the interleaved accesses		/// This file contains the X86 implementation of the interleaved accesses
/// optimization generating X86-specific instructions/intrinsics for		/// optimization generating X86-specific instructions/intrinsics for
/// interleaved access groups.		/// interleaved access groups.
///		///
//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//

#include "X86ISelLowering.h"
#include "X86TargetMachine.h"		#include "X86TargetMachine.h"
#include "llvm/Analysis/VectorUtils.h"		#include "llvm/Analysis/VectorUtils.h"

using namespace llvm;		using namespace llvm;

namespace {		namespace {
/// \brief This class holds necessary information to represent an interleaved		/// \brief This class holds necessary information to represent an interleaved
/// access group and supports utilities to lower the group into		/// access group and supports utilities to lower the group into
Show All 38 Lines	class X86InterleavedAccessGroup {
/// In-V2 = r1, r2, r3, r4		/// In-V2 = r1, r2, r3, r4
/// In-V3 = s1, s2, s3, s4		/// In-V3 = s1, s2, s3, s4
/// OutputVectors:		/// OutputVectors:
/// Out-V0 = p1, q1, r1, s1		/// Out-V0 = p1, q1, r1, s1
/// Out-V1 = p2, q2, r2, s2		/// Out-V1 = p2, q2, r2, s2
/// Out-V2 = p3, q3, r3, s3		/// Out-V2 = p3, q3, r3, s3
/// Out-V3 = P4, q4, r4, s4		/// Out-V3 = P4, q4, r4, s4
void transpose_4x4(ArrayRef<Instruction *> InputVectors,		void transpose_4x4(ArrayRef<Instruction *> InputVectors,
SmallVectorImpl<Value *> &TrasposedVectors);		SmallVectorImpl<Value *> &TransposedMatrix);
		void interleave8bit_32x4(ArrayRef<Instruction *> InputVectors,
		DavidKreitzerUnsubmitted Done Reply Inline Actions Trasposed --> Transposed. Can you fix this at line 72 as well? DavidKreitzer: Trasposed --> Transposed. Can you fix this at line 72 as well?
		SmallVectorImpl<Value *> &TransposedMatrix);
		DavidKreitzerUnsubmitted Done Reply Inline Actions "transpose" is a poor name here. "interleave" would be better. Also, I would prefer "8bit" or "1byte" to "Char", e.g. interleave8bit_32x4. "transpose" works for the 4x4 case (and other NxN cases), because the shuffle sequence does a matrix transpose on the input vectors, and the same code can be used for interleaving and de-interleaving. To handle the 32x8 load case, we would need a different code sequence than what you are currently generating in transposeChar_32x4. Presumably, we would use deinterleave8bit_32x4 for that. DavidKreitzer: "transpose" is a poor name here. "interleave" would be better. Also, I would prefer "8bit" or…
		DavidKreitzerUnsubmitted Done Reply Inline Actions I see that you changed this to "deinterleave8bit_32x4" rather than "interleave8bit_32x4". Can you please explain why? This routine is taking 4 input vectors and merging their elements like this: v0[0], v1[0], v2[0], v3[0], v0[1], v1[1], v2[1], v3[1], ... Wouldn't you call that interleaving? DavidKreitzer: I see that you changed this to "deinterleave8bit_32x4" rather than "interleave8bit_32x4". Can…
		m_zuckermanAuthorUnsubmitted Not Done Reply Inline Actions You are right, I swapped the terminology. m_zuckerman: You are right, I swapped the terminology.
public:		public:
/// In order to form an interleaved access group X86InterleavedAccessGroup		/// In order to form an interleaved access group X86InterleavedAccessGroup
/// requires a wide-load instruction \p 'I', a group of interleaved-vectors		/// requires a wide-load instruction \p 'I', a group of interleaved-vectors
/// \p Shuffs, reference to the first indices of each interleaved-vector		/// \p Shuffs, reference to the first indices of each interleaved-vector
/// \p 'Ind' and the interleaving stride factor \p F. In order to generate		/// \p 'Ind' and the interleaving stride factor \p F. In order to generate
/// X86-specific instructions/intrinsics it also requires the underlying		/// X86-specific instructions/intrinsics it also requires the underlying
/// target information \p STarget.		/// target information \p STarget.
explicit X86InterleavedAccessGroup(Instruction *I,		explicit X86InterleavedAccessGroup(Instruction *I,
Show All 14 Lines
};		};
} // end anonymous namespace		} // end anonymous namespace

bool X86InterleavedAccessGroup::isSupported() const {		bool X86InterleavedAccessGroup::isSupported() const {
VectorType *ShuffleVecTy = Shuffles[0]->getType();		VectorType *ShuffleVecTy = Shuffles[0]->getType();
uint64_t ShuffleVecSize = DL.getTypeSizeInBits(ShuffleVecTy);		uint64_t ShuffleVecSize = DL.getTypeSizeInBits(ShuffleVecTy);
Type *ShuffleEltTy = ShuffleVecTy->getVectorElementType();		Type *ShuffleEltTy = ShuffleVecTy->getVectorElementType();

// Currently, lowering is supported for 4-element vectors of 64 bits on AVX.		// Currently, lowering is supported for the following vectors:
		// 1. 4-element vectors of 64 bits on AVX.
		// 2. 32-element vectors of 8 bits on AVX.
		doritUnsubmitted Done Reply Inline Actions Should this comment be updated? dorit: Should this comment be updated?
uint64_t ExpectedShuffleVecSize;		uint64_t ExpectedShuffleVecSize;
if (isa<LoadInst>(Inst))		if (isa<LoadInst>(Inst))
ExpectedShuffleVecSize = 256;		ExpectedShuffleVecSize = 256;
else		else
ExpectedShuffleVecSize = 1024;		ExpectedShuffleVecSize = 1024;

if (!Subtarget.hasAVX() \|\| ShuffleVecSize != ExpectedShuffleVecSize \|\|		if (DL.getTypeSizeInBits(ShuffleEltTy) == 8 && !isa<StoreInst>(Inst))
DL.getTypeSizeInBits(ShuffleEltTy) != 64 \|\| Factor != 4)		return false;

		if (((DL.getTypeSizeInBits(ShuffleEltTy) != 64) &&
		((DL.getTypeSizeInBits(ShuffleEltTy) != 8))) \|\|
		!Subtarget.hasAVX() \|\| ShuffleVecSize != ExpectedShuffleVecSize \|\|
		Factor != 4)
		DavidKreitzerUnsubmitted Done Reply Inline Actions The old wording here was better, and you have a typo in "oration". DavidKreitzer: The old wording here was better, and you have a typo in "oration".
return false;		return false;

return true;		return true;
}		}

void X86InterleavedAccessGroup::decompose(		void X86InterleavedAccessGroup::decompose(
Instruction VecInst, unsigned NumSubVectors, VectorType SubVecTy,		Instruction VecInst, unsigned NumSubVectors, VectorType SubVecTy,
SmallVectorImpl<Instruction *> &DecomposedVectors) {		SmallVectorImpl<Instruction *> &DecomposedVectors) {
Show All 32 Lines	for (unsigned i = 0; i < NumSubVectors; i++) {
// TODO: Support inbounds GEP.		// TODO: Support inbounds GEP.
Value *NewBasePtr = Builder.CreateGEP(VecBasePtr, Builder.getInt32(i));		Value *NewBasePtr = Builder.CreateGEP(VecBasePtr, Builder.getInt32(i));
Instruction *NewLoad =		Instruction *NewLoad =
Builder.CreateAlignedLoad(NewBasePtr, LI->getAlignment());		Builder.CreateAlignedLoad(NewBasePtr, LI->getAlignment());
DecomposedVectors.push_back(NewLoad);		DecomposedVectors.push_back(NewLoad);
}		}
}		}

		// Create shuffle mask for concatenation of two half vectors.
		// Low = false: mask generated for the shuffle
		FarhanaUnsubmitted Done Reply Inline Actions Why not use the createUnpackShuffleMask(..) defined in X86ISelLowering.cpp? I would recommend you to declare a template function of this in X86ISelLowering.h which will allow both int and int32_t mask and get rid of this from X86InterleavedAccess.cpp. Farhana: Why not use the createUnpackShuffleMask(..) defined in X86ISelLowering.cpp? I would recommend…
		// shuffle(VEC1,VEC2,{NumElement/2, NumElement/2+1, NumElement/2+2...,
		// NumElement-1, NumElement+NumElement/2,
		// NumElement+NumElement/2+1..., 2*NumElement-1})
		// = concat(high_half(VEC1),high_half(VEC2))
		// Low = true: mask generated for the shuffle
		// shuffle(VEC1,VEC2,{0,1,2,...,NumElement/2-1,NumElement,
		RKSimonUnsubmitted Done Reply Inline Actions shuffle RKSimon: shuffle
		// NumElement+1...,NumElement+NumElement/2-1})
		// = concat(low_half(VEC1),low_half(VEC2))
		static void createConcatShuffleMask(int NumElements,
		SmallVectorImpl<uint32_t> &Mask, bool Low) {
		int NumHalfElements = NumElements / 2;
		int Offset = Low ? 0 : NumHalfElements;
		for (int i = 0; i < NumHalfElements; ++i)
		Mask.push_back(i + Offset);
		for (int i = 0; i < NumHalfElements; ++i)
		Mask.push_back(i + Offset + NumElements);
		doritUnsubmitted Done Reply Inline Actions 80-column overflow... (and while you're fixing this, maybe use the same capitalization in naming of arguments (i.e. - either "vec1" or "VEC1" form but not both)). dorit: 80-column overflow... (and while you're fixing this, maybe use the same capitalization in…
		FarhanaUnsubmitted Done Reply Inline Actions The comment does not match the coding logic. Something is off here. May be it would be NumElement+NumElement/2? Farhana: The comment does not match the coding logic. Something is off here. May be it would be…
		}

		void X86InterleavedAccessGroup::interleave8bit_32x4(
		RKSimonUnsubmitted Done Reply Inline Actions I think you could make this much simpler to understand: int NumHalfElements = NumElement / 2; int Offset = Low ? 0 : NumHalfElements; for (int i = 0; i < NumHalfElements; ++i) Mask.push_back(i + Offset); for (int i = 0; i < NumHalfElements; ++i) Mask.push_back(i + Offset + NumElements); RKSimon: I think you could make this much simpler to understand: ``` int NumHalfElements = NumElement /…
		DavidKreitzerUnsubmitted Not Done Reply Inline Actions Ugh! I didn't realize that you were going to have to add this method to do the ArrayRef conversion. Sorry, I think I gave you bad advice. I think this is a worse evil than the explicit template arguments in the previous patch. Unless someone else has a better idea, my recommendation is to go back to your previous solution and accept the explicit template arguments as an unfortunate consequence of the inconsistent ShuffleVector interfaces. DavidKreitzer: Ugh! I didn't realize that you were going to have to add this method to do the ArrayRef…
		ArrayRef<Instruction *> Matrix,
		SmallVectorImpl<Value *> &TransposedMatrix) {

		// Example: Assuming we start from the following vectors:
		// Matrix[0]= c0 c1 c2 c3 c4 ... c31
		// Matrix[1]= m0 m1 m2 m3 m4 ... m31
		DavidKreitzerUnsubmitted Done Reply Inline Actions Perhaps change "BeginIndex" to "CurrentIndex" since you are updating it as you go. DavidKreitzer: Perhaps change "BeginIndex" to "CurrentIndex" since you are updating it as you go.
		// Matrix[2]= y0 y1 y2 y3 y4 ... y31
		// Matrix[3]= k0 k1 k2 k3 k4 ... k31

		DavidKreitzerUnsubmitted Done Reply Inline Actions Would it be possible to simplify the implementation of this routine by just calling createUnpackShuffleMask with an i16 type plus calling scaleShuffleMask with a scale of 2? (That would require you to move scaleShuffleMask into X86ISelLowering.h like you did with createUnpackShuffleMask.) DavidKreitzer: Would it be possible to simplify the implementation of this routine by just calling…
		TransposedMatrix.resize(4);

		SmallVector<uint32_t, 32> MaskHighTemp;
		SmallVector<uint32_t, 32> MaskLowTemp;
		SmallVector<uint32_t, 32> MaskHighTemp1;
		SmallVector<uint32_t, 32> MaskLowTemp1;
		SmallVector<uint32_t, 32> MaskHighTemp2;
		FarhanaUnsubmitted Done Reply Inline Actions I would think you could write a more generate transpose function, a function of 8 elements which would scale into 16 and 32. Why is it written only for 32 VLen? Farhana: I would think you could write a more generate transpose function, a function of 8 elements…
		m_zuckermanAuthorUnsubmitted Not Done Reply Inline Actions You are right and we plan to do so in the next patch. m_zuckerman: You are right and we plan to do so in the next patch.
		SmallVector<uint32_t, 32> MaskLowTemp2;
		SmallVector<uint32_t, 32> ConcatLow;
		SmallVector<uint32_t, 32> ConcatHigh;

		// MaskHighTemp and MaskLowTemp built in the vpunpckhbw and vpunpcklbw X86
		// shuffle pattern.

		createUnpackShuffleMask<uint32_t>(MVT::v32i8, MaskHighTemp, false, false);
		createUnpackShuffleMask<uint32_t>(MVT::v32i8, MaskLowTemp, true, false);
		ArrayRef<uint32_t> MaskHigh = makeArrayRef(MaskHighTemp);
		ArrayRef<uint32_t> MaskLow = makeArrayRef(MaskLowTemp);

		// ConcatHigh and ConcatLow built in the vperm2i128 and vinserti128 X86
		// shuffle pattern.

		createConcatShuffleMask(32, ConcatLow, true);
		createConcatShuffleMask(32, ConcatHigh, false);
		ArrayRef<uint32_t> MaskConcatLow = makeArrayRef(ConcatLow);
		ArrayRef<uint32_t> MaskConcatHigh = makeArrayRef(ConcatHigh);

		// MaskHighTemp1 and MaskLowTemp1 built in the vpunpckhdw and vpunpckldw X86
		// shuffle pattern.

		createUnpackShuffleMask<uint32_t>(MVT::v16i16, MaskLowTemp1, true, false);
		createUnpackShuffleMask<uint32_t>(MVT::v16i16, MaskHighTemp1, false, false);
		scaleShuffleMask<uint32_t>(2, makeArrayRef(MaskHighTemp1), MaskHighTemp2);
		scaleShuffleMask<uint32_t>(2, makeArrayRef(MaskLowTemp1), MaskLowTemp2);
		ArrayRef<uint32_t> MaskHighWord = makeArrayRef(MaskHighTemp2);
		ArrayRef<uint32_t> MaskLowWord = makeArrayRef(MaskLowTemp2);

		// IntrVec1Low = c0 m0 c1 m1 ... c7 m7 \| c16 m16 c17 m17 ... c23 m23
		// IntrVec1High = c8 m8 c9 m9 ... c15 m15 \| c24 m24 c25 m25 ... c31 m31
		// IntrVec2Low = y0 k0 y1 k1 ... y7 k7 \| y16 k16 y17 k17 ... y23 k23
		// IntrVec2High = y8 k8 y9 k9 ... y15 k15 \| y24 k24 y25 k25 ... y31 k31

		Value *IntrVec1Low =
		Builder.CreateShuffleVector(Matrix[0], Matrix[1], MaskLow);
		Value *IntrVec1High =
		Builder.CreateShuffleVector(Matrix[0], Matrix[1], MaskHigh);
		Value *IntrVec2Low =
		Builder.CreateShuffleVector(Matrix[2], Matrix[3], MaskLow);
		Value *IntrVec2High =
		Builder.CreateShuffleVector(Matrix[2], Matrix[3], MaskHigh);

		// cmyk4 cmyk5 cmyk6 cmyk7 \| cmyk20 cmyk21 cmyk22 cmyk23
		// cmyk12 cmyk13 cmyk14 cmyk15 \| cmyk28 cmyk29 cmyk30 cmyk31
		// cmyk0 cmyk1 cmyk2 cmyk3 \| cmyk16 cmyk17 cmyk18 cmyk19
		// cmyk8 cmyk9 cmyk10 cmyk11 \| cmyk24 cmyk25 cmyk26 cmyk27

		Value *High =
		Builder.CreateShuffleVector(IntrVec1Low, IntrVec2Low, MaskHighWord);
		Value *High1 =
		Builder.CreateShuffleVector(IntrVec1High, IntrVec2High, MaskHighWord);
		Value *Low =
		Builder.CreateShuffleVector(IntrVec1Low, IntrVec2Low, MaskLowWord);
		DavidKreitzerUnsubmitted Done Reply Inline Actions I think it is unnecessary and undesirable to do bitcasts here. It complicates both the IR and the code in X86InterleavedAccessGroup::lowerIntoOptimizedSequence, which now has to account for the interleave function returning a different type in "TransposedVectors" than the original "DecomposedVectors". Instead, you just need to "scale" your <16 x i32> masks to <32 x i32> like this: <a, b, ..., p> --> <a2, a2+1, b2, b2+1, ..., p2, p2+1> There is an existing routine to do this scaling in X86ISelLowering.cpp, scaleShuffleMask. Also see canWidenShuffleElements, which is how the 32xi8 shuffle gets automatically converted to a 16xi16 shuffle by the CodeGen, ultimately causing the desired unpckwd instructions to be generated. DavidKreitzer: I think it is unnecessary and undesirable to do bitcasts here. It complicates both the IR and…
		Value *Low1 =
		Builder.CreateShuffleVector(IntrVec1High, IntrVec2High, MaskLowWord);

		// cmyk0 cmyk1 cmyk2 cmyk3 \| cmyk4 cmyk5 cmyk6 cmyk7
		// cmyk8 cmyk9 cmyk10 cmyk11 \| cmyk12 cmyk13 cmyk14 cmyk15
		// cmyk16 cmyk17 cmyk18 cmyk19 \| cmyk20 cmyk21 cmyk22 cmyk23
		// cmyk24 cmyk25 cmyk26 cmyk27 \| cmyk28 cmyk29 cmyk30 cmyk31

		TransposedMatrix[0] = Builder.CreateShuffleVector(Low, High, MaskConcatLow);
		TransposedMatrix[1] = Builder.CreateShuffleVector(Low1, High1, MaskConcatLow);
		TransposedMatrix[2] = Builder.CreateShuffleVector(Low, High, MaskConcatHigh);
		TransposedMatrix[3] =
		Builder.CreateShuffleVector(Low1, High1, MaskConcatHigh);
		}

void X86InterleavedAccessGroup::transpose_4x4(		void X86InterleavedAccessGroup::transpose_4x4(
ArrayRef<Instruction *> Matrix,		ArrayRef<Instruction *> Matrix,
SmallVectorImpl<Value *> &TransposedMatrix) {		SmallVectorImpl<Value *> &TransposedMatrix) {
assert(Matrix.size() == 4 && "Invalid matrix size");		assert(Matrix.size() == 4 && "Invalid matrix size");
TransposedMatrix.resize(4);		TransposedMatrix.resize(4);

// dst = src1[0,1],src2[0,1]		// dst = src1[0,1],src2[0,1]
uint32_t IntMask1[] = {0, 1, 4, 5};		uint32_t IntMask1[] = {0, 1, 4, 5};
▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	bool X86InterleavedAccessGroup::lowerIntoOptimizedSequence() {
// Lower the interleaved stores:		// Lower the interleaved stores:
// 1. Decompose the interleaved wide shuffle into individual shuffle		// 1. Decompose the interleaved wide shuffle into individual shuffle
// vectors.		// vectors.
decompose(Shuffles[0], Factor,		decompose(Shuffles[0], Factor,
VectorType::get(ShuffleEltTy, NumSubVecElems), DecomposedVectors);		VectorType::get(ShuffleEltTy, NumSubVecElems), DecomposedVectors);

// 2. Transpose the interleaved-vectors into vectors of contiguous		// 2. Transpose the interleaved-vectors into vectors of contiguous
// elements.		// elements.
		switch (NumSubVecElems) {
		case 4:
transpose_4x4(DecomposedVectors, TransposedVectors);		transpose_4x4(DecomposedVectors, TransposedVectors);
		break;
		case 32:
		interleave8bit_32x4(DecomposedVectors, TransposedVectors);
		break;
		default:
		return false;
		}

// 3. Concatenate the contiguous-vectors back into a wide vector.		// 3. Concatenate the contiguous-vectors back into a wide vector.
Value *WideVec = concatenateVectors(Builder, TransposedVectors);		Value *WideVec = concatenateVectors(Builder, TransposedVectors);
		RKSimonUnsubmitted Done Reply Inline Actions Indenting / clang-format RKSimon: Indenting / clang-format

// 4. Generate a store instruction for wide-vec.		// 4. Generate a store instruction for wide-vec.
StoreInst *SI = cast<StoreInst>(Inst);		StoreInst *SI = cast<StoreInst>(Inst);
Builder.CreateAlignedStore(WideVec, SI->getPointerOperand(),		Builder.CreateAlignedStore(WideVec, SI->getPointerOperand(),
SI->getAlignment());		SI->getAlignment());

return true;		return true;
}		}

// Lower interleaved load(s) into target specific instructions/		// Lower interleaved load(s) into target specific instructions/
// intrinsics. Lowering sequence varies depending on the vector-types, factor,		// intrinsics. Lowering sequence varies depending on the vector-types, factor,
// number of shuffles and ISA.		// number of shuffles and ISA.
// Currently, lowering is supported for 4x64 bits with Factor = 4 on AVX.		// Currently, lowering is supported for 4x64 bits with Factor = 4 on AVX.
		FarhanaUnsubmitted Not Done Reply Inline Actions Value VecInst = SI->getPointerOperand() Farhana:* Value *VecInst = SI->getPointerOperand()
		FarhanaUnsubmitted Not Done Reply Inline Actions What is the reason for generating for stores instead of 1 wide store? I would think codegen is smart enough to generate expected 4 optimized stores here. Can you please keep it consistent with the other case which means do the concatenation of the transposed vectors and generate 1 wide store. StoreInst SI = cast<StoreInst>(Inst); Value BasePtr = SI->getPointerOperand(); case 32: { transposeChar_32x4(DecomposedVectors, TransposedVectors); // VecInst contains the Ptr argument. Value VecInst = SI->getPointerOperand(); Type IntOf16 = Type::getInt16Ty(Shuffles[0]->getContext()); // From <128xi8>* to <64xi16>* Type VecTran = VectorType::get(IntOf16, 64)->getPointerTo(); BasePtr = Builder.CreateBitCast(VecInst, VecTran); And move the following two statements out of the switch. // 3. Concatenate the contiguous-vectors back into a wide vector. Value WideVec = concatenateVectors(Builder, TransposedVectors); // 4. Generate a store instruction for wide-vec. Builder.CreateAlignedStore(WideVec, BasePtr, // replace this with SI->getAlignment()); Farhana: What is the reason for generating for stores instead of 1 wide store? I would think codegen is…
		DavidKreitzerUnsubmitted Not Done Reply Inline Actions +1 DavidKreitzer: +1
bool X86TargetLowering::lowerInterleavedLoad(		bool X86TargetLowering::lowerInterleavedLoad(
LoadInst LI, ArrayRef<ShuffleVectorInst > Shuffles,		LoadInst LI, ArrayRef<ShuffleVectorInst > Shuffles,
ArrayRef<unsigned> Indices, unsigned Factor) const {		ArrayRef<unsigned> Indices, unsigned Factor) const {
assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&		assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&
		RKSimonUnsubmitted Done Reply Inline Actions If you pull out the Type::getInt16Ty(Shuffles[0]->getContext()) you should be able to tidy all this up RKSimon: If you pull out the Type::getInt16Ty(Shuffles[0]->getContext()) you should be able to tidy all…
"Invalid interleave factor");		"Invalid interleave factor");
		FarhanaUnsubmitted Done Reply Inline Actions ++i Farhana: ++i
assert(!Shuffles.empty() && "Empty shufflevector input");		assert(!Shuffles.empty() && "Empty shufflevector input");
assert(Shuffles.size() == Indices.size() &&		assert(Shuffles.size() == Indices.size() &&
"Unmatched number of shufflevectors and indices");		"Unmatched number of shufflevectors and indices");

// Create an interleaved access group.		// Create an interleaved access group.
IRBuilder<> Builder(LI);		IRBuilder<> Builder(LI);
X86InterleavedAccessGroup Grp(LI, Shuffles, Indices, Factor, Subtarget,		X86InterleavedAccessGroup Grp(LI, Shuffles, Indices, Factor, Subtarget,
Builder);		Builder);

return Grp.isSupported() && Grp.lowerIntoOptimizedSequence();		return Grp.isSupported() && Grp.lowerIntoOptimizedSequence();
}		}

bool X86TargetLowering::lowerInterleavedStore(StoreInst *SI,		bool X86TargetLowering::lowerInterleavedStore(StoreInst *SI,
ShuffleVectorInst *SVI,		ShuffleVectorInst *SVI,
unsigned Factor) const {		unsigned Factor) const {
		doritUnsubmitted Not Done Reply Inline Actions Have you tried enabling this also for AVX? (I understand if not, because with the current cost numbers that TTI returns for interleaved accesses on AVX we'll probably determine it's not worth vectorizing... so that may need to come along with an update of getInterleavedMemoryOpCost -- maybe at least a TODO comment is needed). dorit: Have you tried enabling this also for AVX? (I understand if not, because with the current cost…
		DavidKreitzerUnsubmitted Done Reply Inline Actions This comment is no longer accurate since you enabled this for AVX. It is probably okay to simply delete this comment since it is redundant with 105-107. DavidKreitzer: This comment is no longer accurate since you enabled this for AVX. It is probably okay to…
assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&		assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&
"Invalid interleave factor");		"Invalid interleave factor");

assert(SVI->getType()->getVectorNumElements() % Factor == 0 &&		assert(SVI->getType()->getVectorNumElements() % Factor == 0 &&
"Invalid interleaved store");		"Invalid interleaved store");

// Holds the indices of SVI that correspond to the starting index of each		// Holds the indices of SVI that correspond to the starting index of each
// interleaved shuffle.		// interleaved shuffle.
Show All 14 Lines

test/CodeGen/X86/x86-interleaved-access.ll

Show First 20 Lines • Show All 163 Lines • ▼ Show 20 Lines	; AVX-NEXT: retq
%strided.v2 = shufflevector <16 x i64> %wide.vec, <16 x i64> undef, <4 x i32> <i32 2, i32 6, i32 10, i32 14>		%strided.v2 = shufflevector <16 x i64> %wide.vec, <16 x i64> undef, <4 x i32> <i32 2, i32 6, i32 10, i32 14>
%strided.v3 = shufflevector <16 x i64> %wide.vec, <16 x i64> undef, <4 x i32> <i32 3, i32 7, i32 11, i32 15>		%strided.v3 = shufflevector <16 x i64> %wide.vec, <16 x i64> undef, <4 x i32> <i32 3, i32 7, i32 11, i32 15>
%add1 = add <4 x i64> %strided.v0, %strided.v1		%add1 = add <4 x i64> %strided.v0, %strided.v1
%add2 = add <4 x i64> %add1, %strided.v2		%add2 = add <4 x i64> %add1, %strided.v2
%add3 = add <4 x i64> %add2, %strided.v3		%add3 = add <4 x i64> %add2, %strided.v3
ret <4 x i64> %add3		ret <4 x i64> %add3
}		}

define void @store_factorf64_4(<16 x double>* %ptr, <4 x double> %v0, <4 x double> %v1, <4 x double> %v2, <4 x double> %v3) {		define void @store_factorf64_4(<16 x double>* %ptr, <4 x double> %v0, <4 x double> %v1, <4 x double> %v2, <4 x double> %v3) {
		RKSimonUnsubmitted Done Reply Inline Actions Add this test to trunk with current codegen so this patch shows the diff. RKSimon: Add this test to trunk with current codegen so this patch shows the diff.
; AVX1-LABEL: store_factorf64_4:		; AVX1-LABEL: store_factorf64_4:
; AVX1: # BB#0:		; AVX1: # BB#0:
; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm4		; AVX1-NEXT: vinsertf128 $1, %xmm2, %ymm0, %ymm4
; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm1, %ymm5		; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm1, %ymm5
; AVX1-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[2,3],ymm2[2,3]		; AVX1-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[2,3],ymm2[2,3]
; AVX1-NEXT: vperm2f128 {{.*#+}} ymm1 = ymm1[2,3],ymm3[2,3]		; AVX1-NEXT: vperm2f128 {{.*#+}} ymm1 = ymm1[2,3],ymm3[2,3]
; AVX1-NEXT: vunpcklpd {{.*#+}} ymm2 = ymm4[0],ymm5[0],ymm4[2],ymm5[2]		; AVX1-NEXT: vunpcklpd {{.*#+}} ymm2 = ymm4[0],ymm5[0],ymm4[2],ymm5[2]
; AVX1-NEXT: vunpcklpd {{.*#+}} ymm3 = ymm0[0],ymm1[0],ymm0[2],ymm1[2]		; AVX1-NEXT: vunpcklpd {{.*#+}} ymm3 = ymm0[0],ymm1[0],ymm0[2],ymm1[2]
▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	; AVX-NEXT: retq
store <16 x i64> %interleaved.vec, <16 x i64>* %ptr, align 16		store <16 x i64> %interleaved.vec, <16 x i64>* %ptr, align 16
ret void		ret void
}		}


define void @interleaved_store_vf32_i8_stride4(<32 x i8> %x1, <32 x i8> %x2, <32 x i8> %x3, <32 x i8> %x4, <128 x i8>* %p) {		define void @interleaved_store_vf32_i8_stride4(<32 x i8> %x1, <32 x i8> %x2, <32 x i8> %x3, <32 x i8> %x4, <128 x i8>* %p) {
; AVX1-LABEL: interleaved_store_vf32_i8_stride4:		; AVX1-LABEL: interleaved_store_vf32_i8_stride4:
; AVX1: # BB#0:		; AVX1: # BB#0:
; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm4 = xmm2[0],xmm3[0],xmm2[1],xmm3[1],xmm2[2],xmm3[2],xmm2[3],xmm3[3],xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]		; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm9 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm5 = xmm0[4],xmm4[4],xmm0[5],xmm4[5],xmm0[6],xmm4[6],xmm0[7],xmm4[7]		; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm5
; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm4 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]		; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm6
; AVX1-NEXT: vinsertf128 $1, %xmm5, %ymm4, %ymm5		; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm7 = xmm6[0],xmm5[0],xmm6[1],xmm5[1],xmm6[2],xmm5[2],xmm6[3],xmm5[3],xmm6[4],xmm5[4],xmm6[5],xmm5[5],xmm6[6],xmm5[6],xmm6[7],xmm5[7]
; AVX1-NEXT: vmovaps {{.*#+}} ymm4 = [65535,0,65535,0,65535,0,65535,0,65535,0,65535,0,65535,0,65535,0]		; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm8 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]
; AVX1-NEXT: vandnps %ymm5, %ymm4, %ymm5		; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm1 = xmm6[8],xmm5[8],xmm6[9],xmm5[9],xmm6[10],xmm5[10],xmm6[11],xmm5[11],xmm6[12],xmm5[12],xmm6[13],xmm5[13],xmm6[14],xmm5[14],xmm6[15],xmm5[15]
		DavidKreitzerUnsubmitted Not Done Reply Inline Actions The codegen improvements here look great! DavidKreitzer: The codegen improvements here look great!
; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm6 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm7 = xmm6[4],xmm0[4],xmm6[5],xmm0[5],xmm6[6],xmm0[6],xmm6[7],xmm0[7]
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm6 = xmm6[0],zero,xmm6[1],zero,xmm6[2],zero,xmm6[3],zero
; AVX1-NEXT: vinsertf128 $1, %xmm7, %ymm6, %ymm6
; AVX1-NEXT: vandps %ymm4, %ymm6, %ymm6
; AVX1-NEXT: vorps %ymm5, %ymm6, %ymm8
; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm6 = xmm2[8],xmm3[8],xmm2[9],xmm3[9],xmm2[10],xmm3[10],xmm2[11],xmm3[11],xmm2[12],xmm3[12],xmm2[13],xmm3[13],xmm2[14],xmm3[14],xmm2[15],xmm3[15]
; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm7 = xmm0[4],xmm6[4],xmm0[5],xmm6[5],xmm0[6],xmm6[6],xmm0[7],xmm6[7]
; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm6 = xmm0[0],xmm6[0],xmm0[1],xmm6[1],xmm0[2],xmm6[2],xmm0[3],xmm6[3]
; AVX1-NEXT: vinsertf128 $1, %xmm7, %ymm6, %ymm6
; AVX1-NEXT: vandnps %ymm6, %ymm4, %ymm6
; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm7 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]
; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm5 = xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = xmm7[0],zero,xmm7[1],zero,xmm7[2],zero,xmm7[3],zero
; AVX1-NEXT: vinsertf128 $1, %xmm5, %ymm7, %ymm5
; AVX1-NEXT: vandps %ymm4, %ymm5, %ymm5
; AVX1-NEXT: vorps %ymm6, %ymm5, %ymm9
; AVX1-NEXT: vextractf128 $1, %ymm3, %xmm3
; AVX1-NEXT: vextractf128 $1, %ymm2, %xmm2
; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm5 = xmm2[0],xmm3[0],xmm2[1],xmm3[1],xmm2[2],xmm3[2],xmm2[3],xmm3[3],xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]		; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm5 = xmm2[0],xmm3[0],xmm2[1],xmm3[1],xmm2[2],xmm3[2],xmm2[3],xmm3[3],xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]
; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm7 = xmm0[4],xmm5[4],xmm0[5],xmm5[5],xmm0[6],xmm5[6],xmm0[7],xmm5[7]		; AVX1-NEXT: vextractf128 $1, %ymm3, %xmm6
; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm5 = xmm0[0],xmm5[0],xmm0[1],xmm5[1],xmm0[2],xmm5[2],xmm0[3],xmm5[3]		; AVX1-NEXT: vextractf128 $1, %ymm2, %xmm0
; AVX1-NEXT: vinsertf128 $1, %xmm7, %ymm5, %ymm5		; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm4 = xmm0[0],xmm6[0],xmm0[1],xmm6[1],xmm0[2],xmm6[2],xmm0[3],xmm6[3],xmm0[4],xmm6[4],xmm0[5],xmm6[5],xmm0[6],xmm6[6],xmm0[7],xmm6[7]
; AVX1-NEXT: vandnps %ymm5, %ymm4, %ymm5
; AVX1-NEXT: vextractf128 $1, %ymm1, %xmm1
; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm0
; AVX1-NEXT: vpunpcklbw {{.*#+}} xmm7 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm6 = xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm7 = xmm7[0],zero,xmm7[1],zero,xmm7[2],zero,xmm7[3],zero
; AVX1-NEXT: vinsertf128 $1, %xmm6, %ymm7, %ymm6
; AVX1-NEXT: vandps %ymm4, %ymm6, %ymm6
; AVX1-NEXT: vorps %ymm5, %ymm6, %ymm5
; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm2 = xmm2[8],xmm3[8],xmm2[9],xmm3[9],xmm2[10],xmm3[10],xmm2[11],xmm3[11],xmm2[12],xmm3[12],xmm2[13],xmm3[13],xmm2[14],xmm3[14],xmm2[15],xmm3[15]		; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm2 = xmm2[8],xmm3[8],xmm2[9],xmm3[9],xmm2[10],xmm3[10],xmm2[11],xmm3[11],xmm2[12],xmm3[12],xmm2[13],xmm3[13],xmm2[14],xmm3[14],xmm2[15],xmm3[15]
; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm3 = xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]		; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm0 = xmm0[8],xmm6[8],xmm0[9],xmm6[9],xmm0[10],xmm6[10],xmm0[11],xmm6[11],xmm0[12],xmm6[12],xmm0[13],xmm6[13],xmm0[14],xmm6[14],xmm0[15],xmm6[15]
; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm2 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]		; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm3 = xmm7[4],xmm4[4],xmm7[5],xmm4[5],xmm7[6],xmm4[6],xmm7[7],xmm4[7]
; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm2, %ymm2		; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm6 = xmm9[4],xmm5[4],xmm9[5],xmm5[5],xmm9[6],xmm5[6],xmm9[7],xmm5[7]
; AVX1-NEXT: vandnps %ymm2, %ymm4, %ymm2		; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm6, %ymm10
; AVX1-NEXT: vpunpckhbw {{.*#+}} xmm0 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]		; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm3 = xmm1[4],xmm0[4],xmm1[5],xmm0[5],xmm1[6],xmm0[6],xmm1[7],xmm0[7]
; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm1 = xmm0[4,4,5,5,6,6,7,7]		; AVX1-NEXT: vpunpckhwd {{.*#+}} xmm11 = xmm8[4],xmm2[4],xmm8[5],xmm2[5],xmm8[6],xmm2[6],xmm8[7],xmm2[7]
; AVX1-NEXT: vpmovzxwd {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero		; AVX1-NEXT: vinsertf128 $1, %xmm3, %ymm11, %ymm3
; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0		; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm4 = xmm7[0],xmm4[0],xmm7[1],xmm4[1],xmm7[2],xmm4[2],xmm7[3],xmm4[3]
; AVX1-NEXT: vandps %ymm4, %ymm0, %ymm0		; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm5 = xmm9[0],xmm5[0],xmm9[1],xmm5[1],xmm9[2],xmm5[2],xmm9[3],xmm5[3]
; AVX1-NEXT: vorps %ymm2, %ymm0, %ymm0		; AVX1-NEXT: vinsertf128 $1, %xmm4, %ymm5, %ymm4
		; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm0 = xmm1[0],xmm0[0],xmm1[1],xmm0[1],xmm1[2],xmm0[2],xmm1[3],xmm0[3]
		; AVX1-NEXT: vpunpcklwd {{.*#+}} xmm1 = xmm8[0],xmm2[0],xmm8[1],xmm2[1],xmm8[2],xmm2[2],xmm8[3],xmm2[3]
		; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
		; AVX1-NEXT: vinsertf128 $1, %xmm6, %ymm4, %ymm1
		; AVX1-NEXT: vinsertf128 $1, %xmm11, %ymm0, %ymm2
		; AVX1-NEXT: vperm2f128 {{.*#+}} ymm4 = ymm4[2,3],ymm10[2,3]
		; AVX1-NEXT: vperm2f128 {{.*#+}} ymm0 = ymm0[2,3],ymm3[2,3]
; AVX1-NEXT: vmovaps %ymm0, 96(%rdi)		; AVX1-NEXT: vmovaps %ymm0, 96(%rdi)
; AVX1-NEXT: vmovaps %ymm5, 64(%rdi)		; AVX1-NEXT: vmovaps %ymm4, 64(%rdi)
; AVX1-NEXT: vmovaps %ymm9, 32(%rdi)		; AVX1-NEXT: vmovaps %ymm2, 32(%rdi)
; AVX1-NEXT: vmovaps %ymm8, (%rdi)		; AVX1-NEXT: vmovaps %ymm1, (%rdi)
; AVX1-NEXT: vzeroupper		; AVX1-NEXT: vzeroupper
; AVX1-NEXT: retq		; AVX1-NEXT: retq
;		;
; AVX-LABEL: interleaved_store_vf32_i8_stride4:		; AVX-LABEL: interleaved_store_vf32_i8_stride4:
; AVX: # BB#0:		; AVX: # BB#0:
; AVX-NEXT: vpunpcklbw {{.*#+}} xmm4 = xmm2[0],xmm3[0],xmm2[1],xmm3[1],xmm2[2],xmm3[2],xmm2[3],xmm3[3],xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]		; AVX-NEXT: vpunpcklbw {{.*#+}} ymm4 = ymm0[0],ymm1[0],ymm0[1],ymm1[1],ymm0[2],ymm1[2],ymm0[3],ymm1[3],ymm0[4],ymm1[4],ymm0[5],ymm1[5],ymm0[6],ymm1[6],ymm0[7],ymm1[7],ymm0[16],ymm1[16],ymm0[17],ymm1[17],ymm0[18],ymm1[18],ymm0[19],ymm1[19],ymm0[20],ymm1[20],ymm0[21],ymm1[21],ymm0[22],ymm1[22],ymm0[23],ymm1[23]
; AVX-NEXT: vpunpckhwd {{.*#+}} xmm5 = xmm0[4],xmm4[4],xmm0[5],xmm4[5],xmm0[6],xmm4[6],xmm0[7],xmm4[7]		; AVX-NEXT: vpunpckhbw {{.*#+}} ymm0 = ymm0[8],ymm1[8],ymm0[9],ymm1[9],ymm0[10],ymm1[10],ymm0[11],ymm1[11],ymm0[12],ymm1[12],ymm0[13],ymm1[13],ymm0[14],ymm1[14],ymm0[15],ymm1[15],ymm0[24],ymm1[24],ymm0[25],ymm1[25],ymm0[26],ymm1[26],ymm0[27],ymm1[27],ymm0[28],ymm1[28],ymm0[29],ymm1[29],ymm0[30],ymm1[30],ymm0[31],ymm1[31]
; AVX-NEXT: vpunpcklwd {{.*#+}} xmm4 = xmm0[0],xmm4[0],xmm0[1],xmm4[1],xmm0[2],xmm4[2],xmm0[3],xmm4[3]		; AVX-NEXT: vpunpcklbw {{.*#+}} ymm1 = ymm2[0],ymm3[0],ymm2[1],ymm3[1],ymm2[2],ymm3[2],ymm2[3],ymm3[3],ymm2[4],ymm3[4],ymm2[5],ymm3[5],ymm2[6],ymm3[6],ymm2[7],ymm3[7],ymm2[16],ymm3[16],ymm2[17],ymm3[17],ymm2[18],ymm3[18],ymm2[19],ymm3[19],ymm2[20],ymm3[20],ymm2[21],ymm3[21],ymm2[22],ymm3[22],ymm2[23],ymm3[23]
; AVX-NEXT: vinserti128 $1, %xmm5, %ymm4, %ymm4		; AVX-NEXT: vpunpckhbw {{.*#+}} ymm2 = ymm2[8],ymm3[8],ymm2[9],ymm3[9],ymm2[10],ymm3[10],ymm2[11],ymm3[11],ymm2[12],ymm3[12],ymm2[13],ymm3[13],ymm2[14],ymm3[14],ymm2[15],ymm3[15],ymm2[24],ymm3[24],ymm2[25],ymm3[25],ymm2[26],ymm3[26],ymm2[27],ymm3[27],ymm2[28],ymm3[28],ymm2[29],ymm3[29],ymm2[30],ymm3[30],ymm2[31],ymm3[31]
; AVX-NEXT: vpunpcklbw {{.*#+}} xmm5 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]		; AVX-NEXT: vpunpckhwd {{.*#+}} ymm3 = ymm4[4],ymm1[4],ymm4[5],ymm1[5],ymm4[6],ymm1[6],ymm4[7],ymm1[7],ymm4[12],ymm1[12],ymm4[13],ymm1[13],ymm4[14],ymm1[14],ymm4[15],ymm1[15]
; AVX-NEXT: vpunpckhwd {{.*#+}} xmm6 = xmm5[4],xmm0[4],xmm5[5],xmm0[5],xmm5[6],xmm0[6],xmm5[7],xmm0[7]		; AVX-NEXT: vpunpckhwd {{.*#+}} ymm5 = ymm0[4],ymm2[4],ymm0[5],ymm2[5],ymm0[6],ymm2[6],ymm0[7],ymm2[7],ymm0[12],ymm2[12],ymm0[13],ymm2[13],ymm0[14],ymm2[14],ymm0[15],ymm2[15]
; AVX-NEXT: vpmovzxwd {{.*#+}} xmm5 = xmm5[0],zero,xmm5[1],zero,xmm5[2],zero,xmm5[3],zero		; AVX-NEXT: vpunpcklwd {{.*#+}} ymm1 = ymm4[0],ymm1[0],ymm4[1],ymm1[1],ymm4[2],ymm1[2],ymm4[3],ymm1[3],ymm4[8],ymm1[8],ymm4[9],ymm1[9],ymm4[10],ymm1[10],ymm4[11],ymm1[11]
; AVX-NEXT: vinserti128 $1, %xmm6, %ymm5, %ymm5		; AVX-NEXT: vpunpcklwd {{.*#+}} ymm0 = ymm0[0],ymm2[0],ymm0[1],ymm2[1],ymm0[2],ymm2[2],ymm0[3],ymm2[3],ymm0[8],ymm2[8],ymm0[9],ymm2[9],ymm0[10],ymm2[10],ymm0[11],ymm2[11]
; AVX-NEXT: vpblendw {{.*#+}} ymm8 = ymm5[0],ymm4[1],ymm5[2],ymm4[3],ymm5[4],ymm4[5],ymm5[6],ymm4[7],ymm5[8],ymm4[9],ymm5[10],ymm4[11],ymm5[12],ymm4[13],ymm5[14],ymm4[15]		; AVX-NEXT: vinserti128 $1, %xmm3, %ymm1, %ymm2
; AVX-NEXT: vpunpckhbw {{.*#+}} xmm5 = xmm2[8],xmm3[8],xmm2[9],xmm3[9],xmm2[10],xmm3[10],xmm2[11],xmm3[11],xmm2[12],xmm3[12],xmm2[13],xmm3[13],xmm2[14],xmm3[14],xmm2[15],xmm3[15]		; AVX-NEXT: vinserti128 $1, %xmm5, %ymm0, %ymm4
; AVX-NEXT: vpunpckhwd {{.*#+}} xmm6 = xmm0[4],xmm5[4],xmm0[5],xmm5[5],xmm0[6],xmm5[6],xmm0[7],xmm5[7]		; AVX-NEXT: vperm2i128 {{.*#+}} ymm1 = ymm1[2,3],ymm3[2,3]
; AVX-NEXT: vpunpcklwd {{.*#+}} xmm5 = xmm0[0],xmm5[0],xmm0[1],xmm5[1],xmm0[2],xmm5[2],xmm0[3],xmm5[3]		; AVX-NEXT: vperm2i128 {{.*#+}} ymm0 = ymm0[2,3],ymm5[2,3]
; AVX-NEXT: vinserti128 $1, %xmm6, %ymm5, %ymm5
; AVX-NEXT: vpunpckhbw {{.*#+}} xmm6 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]
; AVX-NEXT: vpunpckhwd {{.*#+}} xmm7 = xmm6[4],xmm0[4],xmm6[5],xmm0[5],xmm6[6],xmm0[6],xmm6[7],xmm0[7]
; AVX-NEXT: vpmovzxwd {{.*#+}} xmm6 = xmm6[0],zero,xmm6[1],zero,xmm6[2],zero,xmm6[3],zero
; AVX-NEXT: vinserti128 $1, %xmm7, %ymm6, %ymm6
; AVX-NEXT: vpblendw {{.*#+}} ymm5 = ymm6[0],ymm5[1],ymm6[2],ymm5[3],ymm6[4],ymm5[5],ymm6[6],ymm5[7],ymm6[8],ymm5[9],ymm6[10],ymm5[11],ymm6[12],ymm5[13],ymm6[14],ymm5[15]
; AVX-NEXT: vextracti128 $1, %ymm3, %xmm3
; AVX-NEXT: vextracti128 $1, %ymm2, %xmm2
; AVX-NEXT: vpunpcklbw {{.*#+}} xmm6 = xmm2[0],xmm3[0],xmm2[1],xmm3[1],xmm2[2],xmm3[2],xmm2[3],xmm3[3],xmm2[4],xmm3[4],xmm2[5],xmm3[5],xmm2[6],xmm3[6],xmm2[7],xmm3[7]
; AVX-NEXT: vpunpckhwd {{.*#+}} xmm7 = xmm0[4],xmm6[4],xmm0[5],xmm6[5],xmm0[6],xmm6[6],xmm0[7],xmm6[7]
; AVX-NEXT: vpunpcklwd {{.*#+}} xmm6 = xmm0[0],xmm6[0],xmm0[1],xmm6[1],xmm0[2],xmm6[2],xmm0[3],xmm6[3]
; AVX-NEXT: vinserti128 $1, %xmm7, %ymm6, %ymm6
; AVX-NEXT: vextracti128 $1, %ymm1, %xmm1
; AVX-NEXT: vextracti128 $1, %ymm0, %xmm0
; AVX-NEXT: vpunpcklbw {{.*#+}} xmm7 = xmm0[0],xmm1[0],xmm0[1],xmm1[1],xmm0[2],xmm1[2],xmm0[3],xmm1[3],xmm0[4],xmm1[4],xmm0[5],xmm1[5],xmm0[6],xmm1[6],xmm0[7],xmm1[7]
; AVX-NEXT: vpunpckhwd {{.*#+}} xmm4 = xmm7[4],xmm0[4],xmm7[5],xmm0[5],xmm7[6],xmm0[6],xmm7[7],xmm0[7]
; AVX-NEXT: vpmovzxwd {{.*#+}} xmm7 = xmm7[0],zero,xmm7[1],zero,xmm7[2],zero,xmm7[3],zero
; AVX-NEXT: vinserti128 $1, %xmm4, %ymm7, %ymm4
; AVX-NEXT: vpblendw {{.*#+}} ymm4 = ymm4[0],ymm6[1],ymm4[2],ymm6[3],ymm4[4],ymm6[5],ymm4[6],ymm6[7],ymm4[8],ymm6[9],ymm4[10],ymm6[11],ymm4[12],ymm6[13],ymm4[14],ymm6[15]
; AVX-NEXT: vpunpckhbw {{.*#+}} xmm2 = xmm2[8],xmm3[8],xmm2[9],xmm3[9],xmm2[10],xmm3[10],xmm2[11],xmm3[11],xmm2[12],xmm3[12],xmm2[13],xmm3[13],xmm2[14],xmm3[14],xmm2[15],xmm3[15]
; AVX-NEXT: vpunpckhwd {{.*#+}} xmm3 = xmm0[4],xmm2[4],xmm0[5],xmm2[5],xmm0[6],xmm2[6],xmm0[7],xmm2[7]
; AVX-NEXT: vpunpcklwd {{.*#+}} xmm2 = xmm0[0],xmm2[0],xmm0[1],xmm2[1],xmm0[2],xmm2[2],xmm0[3],xmm2[3]
; AVX-NEXT: vinserti128 $1, %xmm3, %ymm2, %ymm2
; AVX-NEXT: vpunpckhbw {{.*#+}} xmm0 = xmm0[8],xmm1[8],xmm0[9],xmm1[9],xmm0[10],xmm1[10],xmm0[11],xmm1[11],xmm0[12],xmm1[12],xmm0[13],xmm1[13],xmm0[14],xmm1[14],xmm0[15],xmm1[15]
; AVX-NEXT: vpunpckhwd {{.*#+}} xmm1 = xmm0[4,4,5,5,6,6,7,7]
; AVX-NEXT: vpmovzxwd {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
; AVX-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
; AVX-NEXT: vpblendw {{.*#+}} ymm0 = ymm0[0],ymm2[1],ymm0[2],ymm2[3],ymm0[4],ymm2[5],ymm0[6],ymm2[7],ymm0[8],ymm2[9],ymm0[10],ymm2[11],ymm0[12],ymm2[13],ymm0[14],ymm2[15]
; AVX-NEXT: vmovdqa %ymm0, 96(%rdi)		; AVX-NEXT: vmovdqa %ymm0, 96(%rdi)
; AVX-NEXT: vmovdqa %ymm4, 64(%rdi)		; AVX-NEXT: vmovdqa %ymm1, 64(%rdi)
; AVX-NEXT: vmovdqa %ymm5, 32(%rdi)		; AVX-NEXT: vmovdqa %ymm4, 32(%rdi)
; AVX-NEXT: vmovdqa %ymm8, (%rdi)		; AVX-NEXT: vmovdqa %ymm2, (%rdi)
; AVX-NEXT: vzeroupper		; AVX-NEXT: vzeroupper
; AVX-NEXT: retq		; AVX-NEXT: retq
%v1 = shufflevector <32 x i8> %x1, <32 x i8> %x2, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>		%v1 = shufflevector <32 x i8> %x1, <32 x i8> %x2, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
%v2 = shufflevector <32 x i8> %x3, <32 x i8> %x4, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>		%v2 = shufflevector <32 x i8> %x3, <32 x i8> %x4, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
%interleaved.vec = shufflevector <64 x i8> %v1, <64 x i8> %v2, <128 x i32> <i32 0, i32 32, i32 64, i32 96, i32 1, i32 33, i32 65, i32 97, i32 2, i32 34, i32 66, i32 98, i32 3, i32 35, i32 67, i32 99, i32 4, i32 36, i32 68, i32 100, i32 5, i32 37, i32 69, i32 101, i32 6, i32 38, i32 70, i32 102, i32 7, i32 39, i32 71, i32 103, i32 8, i32 40, i32 72, i32 104, i32 9, i32 41, i32 73, i32 105, i32 10, i32 42, i32 74, i32 106, i32 11, i32 43, i32 75, i32 107, i32 12, i32 44, i32 76, i32 108, i32 13, i32 45, i32 77, i32 109, i32 14, i32 46, i32 78, i32 110, i32 15, i32 47, i32 79, i32 111, i32 16, i32 48, i32 80, i32 112, i32 17, i32 49, i32 81, i32 113, i32 18, i32 50, i32 82, i32 114, i32 19, i32 51, i32 83, i32 115, i32 20, i32 52, i32 84, i32 116, i32 21, i32 53, i32 85, i32 117, i32 22, i32 54, i32 86, i32 118, i32 23, i32 55, i32 87, i32 119, i32 24, i32 56, i32 88, i32 120, i32 25, i32 57, i32 89, i32 121, i32 26, i32 58, i32 90, i32 122, i32 27, i32 59, i32 91, i32 123, i32 28, i32 60, i32 92, i32 124, i32 29, i32 61, i32 93, i32 125, i32 30, i32 62, i32 94, i32 126, i32 31, i32 63, i32 95, i32 127>		%interleaved.vec = shufflevector <64 x i8> %v1, <64 x i8> %v2, <128 x i32> <i32 0, i32 32, i32 64, i32 96, i32 1, i32 33, i32 65, i32 97, i32 2, i32 34, i32 66, i32 98, i32 3, i32 35, i32 67, i32 99, i32 4, i32 36, i32 68, i32 100, i32 5, i32 37, i32 69, i32 101, i32 6, i32 38, i32 70, i32 102, i32 7, i32 39, i32 71, i32 103, i32 8, i32 40, i32 72, i32 104, i32 9, i32 41, i32 73, i32 105, i32 10, i32 42, i32 74, i32 106, i32 11, i32 43, i32 75, i32 107, i32 12, i32 44, i32 76, i32 108, i32 13, i32 45, i32 77, i32 109, i32 14, i32 46, i32 78, i32 110, i32 15, i32 47, i32 79, i32 111, i32 16, i32 48, i32 80, i32 112, i32 17, i32 49, i32 81, i32 113, i32 18, i32 50, i32 82, i32 114, i32 19, i32 51, i32 83, i32 115, i32 20, i32 52, i32 84, i32 116, i32 21, i32 53, i32 85, i32 117, i32 22, i32 54, i32 86, i32 118, i32 23, i32 55, i32 87, i32 119, i32 24, i32 56, i32 88, i32 120, i32 25, i32 57, i32 89, i32 121, i32 26, i32 58, i32 90, i32 122, i32 27, i32 59, i32 91, i32 123, i32 28, i32 60, i32 92, i32 124, i32 29, i32 61, i32 93, i32 125, i32 30, i32 62, i32 94, i32 126, i32 31, i32 63, i32 95, i32 127>
store <128 x i8> %interleaved.vec, <128 x i8>* %p		store <128 x i8> %interleaved.vec, <128 x i8>* %p
ret void		ret void
}		}

test/Transforms/InterleavedAccess/X86/interleavedStore.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				RKSimonUnsubmitted Done Reply Inline Actions Add this file to trunk with current codegen so this patch shows the diff. RKSimon: Add this file to trunk with current codegen so this patch shows the diff.
	; RUN: opt < %s -mtriple=x86_64-pc-linux -mattr=+avx -mattr=+avx2 -interleaved-access -S \| FileCheck %s			; RUN: opt < %s -mtriple=x86_64-pc-linux -mattr=+avx2 -interleaved-access -S \| FileCheck %s
				RKSimonUnsubmitted Done Reply Inline Actions You just need -mattr=+avx2 - it implies -mattr=+avx RKSimon: You just need -mattr=+avx2 - it implies -mattr=+avx

	define void @interleaved_store_vf32_i8_stride4(<32 x i8> %x1, <32 x i8> %x2, <32 x i8> %x3, <32 x i8> %x4, <128 x i8>* %p) {			define void @interleaved_store_vf32_i8_stride4(<32 x i8> %x1, <32 x i8> %x2, <32 x i8> %x3, <32 x i8> %x4, <128 x i8>* %p) {
	; CHECK-LABEL: @interleaved_store_vf32_i8_stride4(			; CHECK-LABEL: @interleaved_store_vf32_i8_stride4(
	; CHECK-NEXT: [[V1:%.]] = shufflevector <32 x i8> [[X1:%.]], <32 x i8> [[X2:%.*]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>			; CHECK-NEXT: [[V1:%.]] = shufflevector <32 x i8> [[X1:%.]], <32 x i8> [[X2:%.*]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
	; CHECK-NEXT: [[V2:%.]] = shufflevector <32 x i8> [[X3:%.]], <32 x i8> [[X4:%.*]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>			; CHECK-NEXT: [[V2:%.]] = shufflevector <32 x i8> [[X3:%.]], <32 x i8> [[X4:%.*]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
	; CHECK-NEXT: [[INTERLEAVED_VEC:%.*]] = shufflevector <64 x i8> [[V1]], <64 x i8> [[V2]], <128 x i32> <i32 0, i32 32, i32 64, i32 96, i32 1, i32 33, i32 65, i32 97, i32 2, i32 34, i32 66, i32 98, i32 3, i32 35, i32 67, i32 99, i32 4, i32 36, i32 68, i32 100, i32 5, i32 37, i32 69, i32 101, i32 6, i32 38, i32 70, i32 102, i32 7, i32 39, i32 71, i32 103, i32 8, i32 40, i32 72, i32 104, i32 9, i32 41, i32 73, i32 105, i32 10, i32 42, i32 74, i32 106, i32 11, i32 43, i32 75, i32 107, i32 12, i32 44, i32 76, i32 108, i32 13, i32 45, i32 77, i32 109, i32 14, i32 46, i32 78, i32 110, i32 15, i32 47, i32 79, i32 111, i32 16, i32 48, i32 80, i32 112, i32 17, i32 49, i32 81, i32 113, i32 18, i32 50, i32 82, i32 114, i32 19, i32 51, i32 83, i32 115, i32 20, i32 52, i32 84, i32 116, i32 21, i32 53, i32 85, i32 117, i32 22, i32 54, i32 86, i32 118, i32 23, i32 55, i32 87, i32 119, i32 24, i32 56, i32 88, i32 120, i32 25, i32 57, i32 89, i32 121, i32 26, i32 58, i32 90, i32 122, i32 27, i32 59, i32 91, i32 123, i32 28, i32 60, i32 92, i32 124, i32 29, i32 61, i32 93, i32 125, i32 30, i32 62, i32 94, i32 126, i32 31, i32 63, i32 95, i32 127>			; CHECK-NEXT: [[TMP1:%.*]] = shufflevector <64 x i8> [[V1]], <64 x i8> [[V2]], <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
	; CHECK-NEXT: store <128 x i8> [[INTERLEAVED_VEC]], <128 x i8>* [[P:%.*]]			; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <64 x i8> [[V1]], <64 x i8> [[V2]], <32 x i32> <i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
				; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <64 x i8> [[V1]], <64 x i8> [[V2]], <32 x i32> <i32 64, i32 65, i32 66, i32 67, i32 68, i32 69, i32 70, i32 71, i32 72, i32 73, i32 74, i32 75, i32 76, i32 77, i32 78, i32 79, i32 80, i32 81, i32 82, i32 83, i32 84, i32 85, i32 86, i32 87, i32 88, i32 89, i32 90, i32 91, i32 92, i32 93, i32 94, i32 95>
				; CHECK-NEXT: [[TMP4:%.*]] = shufflevector <64 x i8> [[V1]], <64 x i8> [[V2]], <32 x i32> <i32 96, i32 97, i32 98, i32 99, i32 100, i32 101, i32 102, i32 103, i32 104, i32 105, i32 106, i32 107, i32 108, i32 109, i32 110, i32 111, i32 112, i32 113, i32 114, i32 115, i32 116, i32 117, i32 118, i32 119, i32 120, i32 121, i32 122, i32 123, i32 124, i32 125, i32 126, i32 127>
				; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <32 x i8> [[TMP1]], <32 x i8> [[TMP2]], <32 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55>
				; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <32 x i8> [[TMP1]], <32 x i8> [[TMP2]], <32 x i32> <i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>
				; CHECK-NEXT: [[TMP7:%.*]] = shufflevector <32 x i8> [[TMP3]], <32 x i8> [[TMP4]], <32 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55>
				; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <32 x i8> [[TMP3]], <32 x i8> [[TMP4]], <32 x i32> <i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>
				; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <32 x i8> [[TMP5]], <32 x i8> [[TMP7]], <32 x i32> <i32 8, i32 9, i32 40, i32 41, i32 10, i32 11, i32 42, i32 43, i32 12, i32 13, i32 44, i32 45, i32 14, i32 15, i32 46, i32 47, i32 24, i32 25, i32 56, i32 57, i32 26, i32 27, i32 58, i32 59, i32 28, i32 29, i32 60, i32 61, i32 30, i32 31, i32 62, i32 63>
				; CHECK-NEXT: [[TMP10:%.*]] = shufflevector <32 x i8> [[TMP6]], <32 x i8> [[TMP8]], <32 x i32> <i32 8, i32 9, i32 40, i32 41, i32 10, i32 11, i32 42, i32 43, i32 12, i32 13, i32 44, i32 45, i32 14, i32 15, i32 46, i32 47, i32 24, i32 25, i32 56, i32 57, i32 26, i32 27, i32 58, i32 59, i32 28, i32 29, i32 60, i32 61, i32 30, i32 31, i32 62, i32 63>
				; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <32 x i8> [[TMP5]], <32 x i8> [[TMP7]], <32 x i32> <i32 0, i32 1, i32 32, i32 33, i32 2, i32 3, i32 34, i32 35, i32 4, i32 5, i32 36, i32 37, i32 6, i32 7, i32 38, i32 39, i32 16, i32 17, i32 48, i32 49, i32 18, i32 19, i32 50, i32 51, i32 20, i32 21, i32 52, i32 53, i32 22, i32 23, i32 54, i32 55>
				; CHECK-NEXT: [[TMP12:%.*]] = shufflevector <32 x i8> [[TMP6]], <32 x i8> [[TMP8]], <32 x i32> <i32 0, i32 1, i32 32, i32 33, i32 2, i32 3, i32 34, i32 35, i32 4, i32 5, i32 36, i32 37, i32 6, i32 7, i32 38, i32 39, i32 16, i32 17, i32 48, i32 49, i32 18, i32 19, i32 50, i32 51, i32 20, i32 21, i32 52, i32 53, i32 22, i32 23, i32 54, i32 55>
				; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <32 x i8> [[TMP11]], <32 x i8> [[TMP9]], <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47>
				; CHECK-NEXT: [[TMP14:%.*]] = shufflevector <32 x i8> [[TMP12]], <32 x i8> [[TMP10]], <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47>
				; CHECK-NEXT: [[TMP15:%.*]] = shufflevector <32 x i8> [[TMP11]], <32 x i8> [[TMP9]], <32 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
				; CHECK-NEXT: [[TMP16:%.*]] = shufflevector <32 x i8> [[TMP12]], <32 x i8> [[TMP10]], <32 x i32> <i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
				; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <32 x i8> [[TMP13]], <32 x i8> [[TMP14]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
				; CHECK-NEXT: [[TMP18:%.*]] = shufflevector <32 x i8> [[TMP15]], <32 x i8> [[TMP16]], <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
				; CHECK-NEXT: [[TMP19:%.*]] = shufflevector <64 x i8> [[TMP17]], <64 x i8> [[TMP18]], <128 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63, i32 64, i32 65, i32 66, i32 67, i32 68, i32 69, i32 70, i32 71, i32 72, i32 73, i32 74, i32 75, i32 76, i32 77, i32 78, i32 79, i32 80, i32 81, i32 82, i32 83, i32 84, i32 85, i32 86, i32 87, i32 88, i32 89, i32 90, i32 91, i32 92, i32 93, i32 94, i32 95, i32 96, i32 97, i32 98, i32 99, i32 100, i32 101, i32 102, i32 103, i32 104, i32 105, i32 106, i32 107, i32 108, i32 109, i32 110, i32 111, i32 112, i32 113, i32 114, i32 115, i32 116, i32 117, i32 118, i32 119, i32 120, i32 121, i32 122, i32 123, i32 124, i32 125, i32 126, i32 127>
				; CHECK-NEXT: store <128 x i8> [[TMP19]], <128 x i8>* [[P:%.*]]
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	%v1 = shufflevector <32 x i8> %x1, <32 x i8> %x2, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>			%v1 = shufflevector <32 x i8> %x1, <32 x i8> %x2, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
	%v2 = shufflevector <32 x i8> %x3, <32 x i8> %x4, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>			%v2 = shufflevector <32 x i8> %x3, <32 x i8> %x4, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38, i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46, i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54, i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62, i32 63>
	%interleaved.vec = shufflevector <64 x i8> %v1, <64 x i8> %v2, <128 x i32> <i32 0, i32 32, i32 64, i32 96, i32 1, i32 33, i32 65, i32 97, i32 2, i32 34, i32 66, i32 98, i32 3, i32 35, i32 67, i32 99, i32 4, i32 36, i32 68, i32 100, i32 5, i32 37, i32 69, i32 101, i32 6, i32 38, i32 70, i32 102, i32 7, i32 39, i32 71, i32 103, i32 8, i32 40, i32 72, i32 104, i32 9, i32 41, i32 73, i32 105, i32 10, i32 42, i32 74, i32 106, i32 11, i32 43, i32 75, i32 107, i32 12, i32 44, i32 76, i32 108, i32 13, i32 45, i32 77, i32 109, i32 14, i32 46, i32 78, i32 110, i32 15, i32 47, i32 79, i32 111, i32 16, i32 48, i32 80, i32 112, i32 17, i32 49, i32 81, i32 113, i32 18, i32 50, i32 82, i32 114, i32 19, i32 51, i32 83, i32 115, i32 20, i32 52, i32 84, i32 116, i32 21, i32 53, i32 85, i32 117, i32 22, i32 54, i32 86, i32 118, i32 23, i32 55, i32 87, i32 119, i32 24, i32 56, i32 88, i32 120, i32 25, i32 57, i32 89, i32 121, i32 26, i32 58, i32 90, i32 122, i32 27, i32 59, i32 91, i32 123, i32 28, i32 60, i32 92, i32 124, i32 29, i32 61, i32 93, i32 125, i32 30, i32 62, i32 94, i32 126, i32 31, i32 63, i32 95, i32 127>			%interleaved.vec = shufflevector <64 x i8> %v1, <64 x i8> %v2, <128 x i32> <i32 0, i32 32, i32 64, i32 96, i32 1, i32 33, i32 65, i32 97, i32 2, i32 34, i32 66, i32 98, i32 3, i32 35, i32 67, i32 99, i32 4, i32 36, i32 68, i32 100, i32 5, i32 37, i32 69, i32 101, i32 6, i32 38, i32 70, i32 102, i32 7, i32 39, i32 71, i32 103, i32 8, i32 40, i32 72, i32 104, i32 9, i32 41, i32 73, i32 105, i32 10, i32 42, i32 74, i32 106, i32 11, i32 43, i32 75, i32 107, i32 12, i32 44, i32 76, i32 108, i32 13, i32 45, i32 77, i32 109, i32 14, i32 46, i32 78, i32 110, i32 15, i32 47, i32 79, i32 111, i32 16, i32 48, i32 80, i32 112, i32 17, i32 49, i32 81, i32 113, i32 18, i32 50, i32 82, i32 114, i32 19, i32 51, i32 83, i32 115, i32 20, i32 52, i32 84, i32 116, i32 21, i32 53, i32 85, i32 117, i32 22, i32 54, i32 86, i32 118, i32 23, i32 55, i32 87, i32 119, i32 24, i32 56, i32 88, i32 120, i32 25, i32 57, i32 89, i32 121, i32 26, i32 58, i32 90, i32 122, i32 27, i32 59, i32 91, i32 123, i32 28, i32 60, i32 92, i32 124, i32 29, i32 61, i32 93, i32 125, i32 30, i32 62, i32 94, i32 126, i32 31, i32 63, i32 95, i32 127>
	store <128 x i8> %interleaved.vec, <128 x i8>* %p			store <128 x i8> %interleaved.vec, <128 x i8>* %p
	ret void			ret void
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 107809

lib/Target/X86/X86ISelLowering.h

lib/Target/X86/X86ISelLowering.cpp

lib/Target/X86/X86InterleavedAccess.cpp

test/CodeGen/X86/x86-interleaved-access.ll

test/Transforms/InterleavedAccess/X86/interleavedStore.ll

[X86][LLVM]Expanding Supports lowerInterleavedStore() in X86InterleavedAccess.
ClosedPublic