This is an archive of the discontinued LLVM Phabricator instance.

[ARM/AArch64] Support wide interleaved accesses
ClosedPublic

Authored by mssimpso on Feb 2 2017, 11:00 AM.

Download Raw Diff

Details

Reviewers

rengolin
MatzeB
javed.absar
mcrosier

Commits

rG1bfa159db981: [ARM/AArch64] Support wide interleaved accesses
rL296750: [ARM/AArch64] Support wide interleaved accesses

Summary

This patch teaches (ARM|AArch64)ISelLowering.cpp to match illegal vector types to interleaved access intrinsics as long as the types are multiples of the vector register width (128 bits). A "wide" access will now be mapped to multiple interleave intrinsics similar to the way in which non-interleaved accesses with illegal types are legalized into multiple accesses. For example, given an interleaved access whose sub-vectors are 256 bits wide, the patch would generate 2 consecutive interleaved memory accesses.

The primary motivation is the vectorization of "mixed-type" loops, such as the one shown below.

f(char *A, int *B, unsigned N) {
  for (unsigned i = 0; i < N; i += 3) {
    B[i + 0] = A[i + 0]
    B[i + 1] = A[i + 1]
    B[i + 2] = A[i + 2]
  }
}

Here, we load char data (i8) and then store it as int data (i32). We'd like to set the loop vectorization factor based on the smaller type, rather than the larger one (we can do this today using the -vectorizer-maximize-bandwidth flag). Let the vectorization factor be 16 in this case for the <16 x i8> data. If we do this, the stored vector type becomes wider than is legal. If we had stride-one accesses this is fine - type legalization will split it up. But for the interleaved accesses we have here, we currently won't be able to map what the vectorizer generates to the proper interleave intrinsics because the type is too wide. Please see the test cases for more concrete examples.

I'll update the associated TTI costs (in getInterleavedMemoryOpCost) as a follow-on patch.

Diff Detail

Build Status

Buildable 3564
Build 3564: arc lint + arc unit

Event Timeline

mssimpso created this revision.Feb 2 2017, 11:00 AM

Herald added a reviewer: javed.absar. · View Herald TranscriptFeb 2 2017, 11:00 AM

Herald added a subscriber: aemerson. · View Herald Transcript

evandro added a subscriber: evandro.Feb 6 2017, 3:09 PM

mssimpso mentioned this in D29675: [ARM/AArch64] Update costs for interleaved accesses with wide types.Feb 7 2017, 12:26 PM

mssimpso added a child revision: D29675: [ARM/AArch64] Update costs for interleaved accesses with wide types.

Hi Matthew,

This change seems pretty straight forward. I only have some silly comments inline and a simple request: to create one additional test for three LD2s (even/odd extract from <24 x i32>) on each arch, just to make sure that it scales.

cheers,
--renato

lib/Target/AArch64/AArch64ISelLowering.cpp
7281	I almost missed the modulo check above. To make it easier for posterity, I'd recommend move the 128 check after this comment.
7414	Now that I see repeated, this could be a simple inline function?
7486	this comment change is unnecessary? was that clang-format?

Addressed Renato's comments.

Moved the "legalize" comment before the modulo check
Added a helper function for determining the number of accesses for a wide type
Unformatted a comment (clang-format still wrapped the last line)
Added a 3xLD2 test case (<24 x i32>) for both ARM and AArch64.

LGTM, thanks!

This revision is now accepted and ready to land.Feb 21 2017, 12:41 PM

mssimpso mentioned this in D30305: [LV] Consider non-consecutive vectorizable accesses in max VF selection.Feb 28 2017, 3:06 PM

Closed by commit rL296750: [ARM/AArch64] Support wide interleaved accesses (authored by mssimpso). · Explain WhyMar 2 2017, 7:23 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

159 lines

ARM/

ARMISelLowering.cpp

183 lines

test/

Transforms/

InterleavedAccess/

AArch64/

interleaved-accesses.ll

161 lines

ARM/

interleaved-accesses.ll

163 lines

Diff 86856

lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,269 Lines • ▼ Show 20 Lines	assert(Shuffles.size() == Indices.size() &&
"Unmatched number of shufflevectors and indices");		"Unmatched number of shufflevectors and indices");

const DataLayout &DL = LI->getModule()->getDataLayout();		const DataLayout &DL = LI->getModule()->getDataLayout();

VectorType *VecTy = Shuffles[0]->getType();		VectorType *VecTy = Shuffles[0]->getType();
unsigned VecSize = DL.getTypeSizeInBits(VecTy);		unsigned VecSize = DL.getTypeSizeInBits(VecTy);

// Skip if we do not have NEON and skip illegal vector types.		// Skip if we do not have NEON and skip illegal vector types.
if (!Subtarget->hasNEON() \|\| (VecSize != 64 && VecSize != 128))		if (!Subtarget->hasNEON() \|\| (VecSize != 64 && VecSize % 128 != 0))
return false;		return false;

		// We can "legalize" wide vector types into multiple interleaved loads as
		rengolinUnsubmitted Done Reply Inline Actions I almost missed the modulo check above. To make it easier for posterity, I'd recommend move the 128 check after this comment. rengolin: I almost missed the modulo check above. To make it easier for posterity, I'd recommend move the…
		// long as the vector types are divisible by 128. The code below determines
		// the number of loads we will generate.
		unsigned NumLoads = 1;
		if (VecSize > 128)
		NumLoads = VecSize / 128;

// A pointer vector can not be the return type of the ldN intrinsics. Need to		// A pointer vector can not be the return type of the ldN intrinsics. Need to
// load integer vectors first and then convert to pointer vectors.		// load integer vectors first and then convert to pointer vectors.
Type *EltTy = VecTy->getVectorElementType();		Type *EltTy = VecTy->getVectorElementType();
if (EltTy->isPointerTy())		if (EltTy->isPointerTy())
VecTy =		VecTy =
VectorType::get(DL.getIntPtrType(EltTy), VecTy->getVectorNumElements());		VectorType::get(DL.getIntPtrType(EltTy), VecTy->getVectorNumElements());

		IRBuilder<> Builder(LI);

		// The base address of the load.
		Value *BaseAddr = LI->getPointerOperand();

		if (NumLoads > 1) {
		// If we're going to generate more than one load, reset the sub-vector type
		// to something legal.
		VecTy = VectorType::get(VecTy->getVectorElementType(),
		VecTy->getVectorNumElements() / NumLoads);

		// We will compute the pointer operand of each load from the original base
		// address using GEPs. Cast the base address to a pointer to the scalar
		// element type.
		BaseAddr = Builder.CreateBitCast(
		BaseAddr, VecTy->getVectorElementType()->getPointerTo(
		LI->getPointerAddressSpace()));
		}

Type *PtrTy = VecTy->getPointerTo(LI->getPointerAddressSpace());		Type *PtrTy = VecTy->getPointerTo(LI->getPointerAddressSpace());
Type *Tys[2] = {VecTy, PtrTy};		Type *Tys[2] = {VecTy, PtrTy};
static const Intrinsic::ID LoadInts[3] = {Intrinsic::aarch64_neon_ld2,		static const Intrinsic::ID LoadInts[3] = {Intrinsic::aarch64_neon_ld2,
Intrinsic::aarch64_neon_ld3,		Intrinsic::aarch64_neon_ld3,
Intrinsic::aarch64_neon_ld4};		Intrinsic::aarch64_neon_ld4};
Function *LdNFunc =		Function *LdNFunc =
Intrinsic::getDeclaration(LI->getModule(), LoadInts[Factor - 2], Tys);		Intrinsic::getDeclaration(LI->getModule(), LoadInts[Factor - 2], Tys);

IRBuilder<> Builder(LI);		// Holds sub-vectors extracted from the load intrinsic return values. The
Value *Ptr = Builder.CreateBitCast(LI->getPointerOperand(), PtrTy);		// sub-vectors are associated with the shufflevector instructions they will
		// replace.
		DenseMap<ShuffleVectorInst , SmallVector<Value , 4>> SubVecs;

		for (unsigned LoadCount = 0; LoadCount < NumLoads; ++LoadCount) {

		// If we're generating more than one load, compute the base address of
		// subsequent loads as an offset from the previous.
		if (LoadCount > 0)
		BaseAddr = Builder.CreateConstGEP1_32(
		BaseAddr, VecTy->getVectorNumElements() * Factor);

CallInst *LdN = Builder.CreateCall(LdNFunc, Ptr, "ldN");		CallInst *LdN = Builder.CreateCall(
		LdNFunc, Builder.CreateBitCast(BaseAddr, PtrTy), "ldN");

// Replace uses of each shufflevector with the corresponding vector loaded		// Extract and store the sub-vectors returned by the load intrinsic.
// by ldN.
for (unsigned i = 0; i < Shuffles.size(); i++) {		for (unsigned i = 0; i < Shuffles.size(); i++) {
ShuffleVectorInst *SVI = Shuffles[i];		ShuffleVectorInst *SVI = Shuffles[i];
unsigned Index = Indices[i];		unsigned Index = Indices[i];

Value *SubVec = Builder.CreateExtractValue(LdN, Index);		Value *SubVec = Builder.CreateExtractValue(LdN, Index);

// Convert the integer vector to pointer vector if the element is pointer.		// Convert the integer vector to pointer vector if the element is pointer.
if (EltTy->isPointerTy())		if (EltTy->isPointerTy())
SubVec = Builder.CreateIntToPtr(SubVec, SVI->getType());		SubVec = Builder.CreateIntToPtr(SubVec, SVI->getType());

SVI->replaceAllUsesWith(SubVec);		SubVecs[SVI].push_back(SubVec);
		}
		}

		// Replace uses of the shufflevector instructions with the sub-vectors
		// returned by the load intrinsic. If a shufflevector instruction is
		// associated with more than one sub-vector, those sub-vectors will be
		// concatenated into a single wide vector.
		for (ShuffleVectorInst *SVI : Shuffles) {
		auto &SubVec = SubVecs[SVI];
		auto *WideVec =
		SubVec.size() > 1 ? concatenateVectors(Builder, SubVec) : SubVec[0];
		SVI->replaceAllUsesWith(WideVec);
}		}

return true;		return true;
}		}

/// \brief Lower an interleaved store into a stN intrinsic.		/// \brief Lower an interleaved store into a stN intrinsic.
///		///
/// E.g. Lower an interleaved store (Factor = 3):		/// E.g. Lower an interleaved store (Factor = 3):
Show All 33 Lines	bool AArch64TargetLowering::lowerInterleavedStore(StoreInst *SI,
unsigned LaneLen = VecTy->getVectorNumElements() / Factor;		unsigned LaneLen = VecTy->getVectorNumElements() / Factor;
Type *EltTy = VecTy->getVectorElementType();		Type *EltTy = VecTy->getVectorElementType();
VectorType *SubVecTy = VectorType::get(EltTy, LaneLen);		VectorType *SubVecTy = VectorType::get(EltTy, LaneLen);

const DataLayout &DL = SI->getModule()->getDataLayout();		const DataLayout &DL = SI->getModule()->getDataLayout();
unsigned SubVecSize = DL.getTypeSizeInBits(SubVecTy);		unsigned SubVecSize = DL.getTypeSizeInBits(SubVecTy);

// Skip if we do not have NEON and skip illegal vector types.		// Skip if we do not have NEON and skip illegal vector types.
if (!Subtarget->hasNEON() \|\| (SubVecSize != 64 && SubVecSize != 128))		if (!Subtarget->hasNEON() \|\| (SubVecSize != 64 && SubVecSize % 128 != 0))
return false;		return false;

		// We can "legalize" wide vector types into multiple interleaved stores as
		rengolinUnsubmitted Done Reply Inline Actions Now that I see repeated, this could be a simple inline function? rengolin: Now that I see repeated, this could be a simple inline function?
		// long as the vector types are divisible by 128. The code below determines
		// the number of stores we will generate.
		unsigned NumStores = 1;
		if (SubVecSize > 128)
		NumStores = SubVecSize / 128;

Value *Op0 = SVI->getOperand(0);		Value *Op0 = SVI->getOperand(0);
Value *Op1 = SVI->getOperand(1);		Value *Op1 = SVI->getOperand(1);
IRBuilder<> Builder(SI);		IRBuilder<> Builder(SI);

// StN intrinsics don't support pointer vectors as arguments. Convert pointer		// StN intrinsics don't support pointer vectors as arguments. Convert pointer
// vectors to integer vectors.		// vectors to integer vectors.
if (EltTy->isPointerTy()) {		if (EltTy->isPointerTy()) {
Type *IntTy = DL.getIntPtrType(EltTy);		Type *IntTy = DL.getIntPtrType(EltTy);
unsigned NumOpElts =		unsigned NumOpElts =
dyn_cast<VectorType>(Op0->getType())->getVectorNumElements();		dyn_cast<VectorType>(Op0->getType())->getVectorNumElements();

// Convert to the corresponding integer vector.		// Convert to the corresponding integer vector.
Type *IntVecTy = VectorType::get(IntTy, NumOpElts);		Type *IntVecTy = VectorType::get(IntTy, NumOpElts);
Op0 = Builder.CreatePtrToInt(Op0, IntVecTy);		Op0 = Builder.CreatePtrToInt(Op0, IntVecTy);
Op1 = Builder.CreatePtrToInt(Op1, IntVecTy);		Op1 = Builder.CreatePtrToInt(Op1, IntVecTy);

SubVecTy = VectorType::get(IntTy, LaneLen);		SubVecTy = VectorType::get(IntTy, LaneLen);
}		}

		// The base address of the store.
		Value *BaseAddr = SI->getPointerOperand();

		if (NumStores > 1) {
		// If we're going to generate more than one store, reset the lane length
		// and sub-vector type to something legal.
		LaneLen /= NumStores;
		SubVecTy = VectorType::get(SubVecTy->getVectorElementType(), LaneLen);

		// We will compute the pointer operand of each store from the original base
		// address using GEPs. Cast the base address to a pointer to the scalar
		// element type.
		BaseAddr = Builder.CreateBitCast(
		BaseAddr, SubVecTy->getVectorElementType()->getPointerTo(
		SI->getPointerAddressSpace()));
		}

		auto Mask = SVI->getShuffleMask();

Type *PtrTy = SubVecTy->getPointerTo(SI->getPointerAddressSpace());		Type *PtrTy = SubVecTy->getPointerTo(SI->getPointerAddressSpace());
Type *Tys[2] = {SubVecTy, PtrTy};		Type *Tys[2] = {SubVecTy, PtrTy};
static const Intrinsic::ID StoreInts[3] = {Intrinsic::aarch64_neon_st2,		static const Intrinsic::ID StoreInts[3] = {Intrinsic::aarch64_neon_st2,
Intrinsic::aarch64_neon_st3,		Intrinsic::aarch64_neon_st3,
Intrinsic::aarch64_neon_st4};		Intrinsic::aarch64_neon_st4};
Function *StNFunc =		Function *StNFunc =
Intrinsic::getDeclaration(SI->getModule(), StoreInts[Factor - 2], Tys);		Intrinsic::getDeclaration(SI->getModule(), StoreInts[Factor - 2], Tys);

		for (unsigned StoreCount = 0; StoreCount < NumStores; ++StoreCount) {

SmallVector<Value *, 5> Ops;		SmallVector<Value *, 5> Ops;

// Split the shufflevector operands into sub vectors for the new stN call.		// Split the shufflevector operands into sub vectors for the new stN call.
auto Mask = SVI->getShuffleMask();
for (unsigned i = 0; i < Factor; i++) {		for (unsigned i = 0; i < Factor; i++) {
if (Mask[i] >= 0) {		unsigned IdxI = StoreCount * LaneLen * Factor + i;
		if (Mask[IdxI] >= 0) {
Ops.push_back(Builder.CreateShuffleVector(		Ops.push_back(Builder.CreateShuffleVector(
Op0, Op1, createSequentialMask(Builder, Mask[i], LaneLen, 0)));		Op0, Op1, createSequentialMask(Builder, Mask[IdxI], LaneLen, 0)));
} else {		} else {
unsigned StartMask = 0;		unsigned StartMask = 0;
for (unsigned j = 1; j < LaneLen; j++) {		for (unsigned j = 1; j < LaneLen; j++) {
if (Mask[j*Factor + i] >= 0) {		unsigned IdxJ = StoreCount * LaneLen * Factor + j;
StartMask = Mask[j*Factor + i] - j;		if (Mask[IdxJ * Factor + IdxI] >= 0) {
		StartMask = Mask[IdxJ * Factor + IdxI] - IdxJ;
break;		break;
}		}
}		}
// Note: If all elements in a chunk are undefs, StartMask=0!		// Note: If all elements in a chunk are undefs, StartMask=0! Note:
		rengolinUnsubmitted Done Reply Inline Actions this comment change is unnecessary? was that clang-format? rengolin: this comment change is unnecessary? was that clang-format?
// Note: Filling undef gaps with random elements is ok, since		// Filling undef gaps with random elements is ok, since those elements
// those elements were being written anyway (with undefs).		// were being written anyway (with undefs). In the case of all undefs
// In the case of all undefs we're defaulting to using elems from 0		// we're defaulting to using elems from 0 Note: StartMask cannot be
// Note: StartMask cannot be negative, it's checked in isReInterleaveMask		// negative, it's checked in isReInterleaveMask
Ops.push_back(Builder.CreateShuffleVector(		Ops.push_back(Builder.CreateShuffleVector(
Op0, Op1, createSequentialMask(Builder, StartMask, LaneLen, 0)));		Op0, Op1, createSequentialMask(Builder, StartMask, LaneLen, 0)));
}		}
}		}

Ops.push_back(Builder.CreateBitCast(SI->getPointerOperand(), PtrTy));		// If we generating more than one store, we compute the base address of
		// subsequent stores as an offset from the previous.
		if (StoreCount > 0)
		BaseAddr = Builder.CreateConstGEP1_32(BaseAddr, LaneLen * Factor);

		Ops.push_back(Builder.CreateBitCast(BaseAddr, PtrTy));
Builder.CreateCall(StNFunc, Ops);		Builder.CreateCall(StNFunc, Ops);
		}
return true;		return true;
}		}

static bool memOpAlign(unsigned DstAlign, unsigned SrcAlign,		static bool memOpAlign(unsigned DstAlign, unsigned SrcAlign,
unsigned AlignCheck) {		unsigned AlignCheck) {
return ((SrcAlign == 0 \|\| SrcAlign % AlignCheck == 0) &&		return ((SrcAlign == 0 \|\| SrcAlign % AlignCheck == 0) &&
(DstAlign == 0 \|\| DstAlign % AlignCheck == 0));		(DstAlign == 0 \|\| DstAlign % AlignCheck == 0));
}		}
▲ Show 20 Lines • Show All 3,275 Lines • Show Last 20 Lines

lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 13,294 Lines • ▼ Show 20 Lines	bool ARMTargetLowering::lowerInterleavedLoad(
Type *EltTy = VecTy->getVectorElementType();		Type *EltTy = VecTy->getVectorElementType();

const DataLayout &DL = LI->getModule()->getDataLayout();		const DataLayout &DL = LI->getModule()->getDataLayout();
unsigned VecSize = DL.getTypeSizeInBits(VecTy);		unsigned VecSize = DL.getTypeSizeInBits(VecTy);
bool EltIs64Bits = DL.getTypeSizeInBits(EltTy) == 64;		bool EltIs64Bits = DL.getTypeSizeInBits(EltTy) == 64;

// Skip if we do not have NEON and skip illegal vector types and vector types		// Skip if we do not have NEON and skip illegal vector types and vector types
// with i64/f64 elements (vldN doesn't support i64/f64 elements).		// with i64/f64 elements (vldN doesn't support i64/f64 elements).
if (!Subtarget->hasNEON() \|\| (VecSize != 64 && VecSize != 128) \|\| EltIs64Bits)		if (!Subtarget->hasNEON() \|\| (VecSize != 64 && VecSize % 128 != 0) \|\|
		EltIs64Bits)
return false;		return false;

		// We can "legalize" wide vector types into multiple interleaved loads as
		// long as the vector types are divisible by 128. The code below determines
		// the number of loads we will generate.
		unsigned NumLoads = 1;
		if (VecSize > 128)
		NumLoads = VecSize / 128;

// A pointer vector can not be the return type of the ldN intrinsics. Need to		// A pointer vector can not be the return type of the ldN intrinsics. Need to
// load integer vectors first and then convert to pointer vectors.		// load integer vectors first and then convert to pointer vectors.
if (EltTy->isPointerTy())		if (EltTy->isPointerTy())
VecTy =		VecTy =
VectorType::get(DL.getIntPtrType(EltTy), VecTy->getVectorNumElements());		VectorType::get(DL.getIntPtrType(EltTy), VecTy->getVectorNumElements());

		IRBuilder<> Builder(LI);

		// The base address of the load.
		Value *BaseAddr = LI->getPointerOperand();

		if (NumLoads > 1) {
		// If we're going to generate more than one load, reset the sub-vector type
		// to something legal.
		VecTy = VectorType::get(VecTy->getVectorElementType(),
		VecTy->getVectorNumElements() / NumLoads);

		// We will compute the pointer operand of each load from the original base
		// address using GEPs. Cast the base address to a pointer to the scalar
		// element type.
		BaseAddr = Builder.CreateBitCast(
		BaseAddr, VecTy->getVectorElementType()->getPointerTo(
		LI->getPointerAddressSpace()));
		}

		Type *Int8Ptr = Builder.getInt8PtrTy(LI->getPointerAddressSpace());
		Type *Tys[] = {VecTy, Int8Ptr};
static const Intrinsic::ID LoadInts[3] = {Intrinsic::arm_neon_vld2,		static const Intrinsic::ID LoadInts[3] = {Intrinsic::arm_neon_vld2,
Intrinsic::arm_neon_vld3,		Intrinsic::arm_neon_vld3,
Intrinsic::arm_neon_vld4};		Intrinsic::arm_neon_vld4};
		Function *VldnFunc =
		Intrinsic::getDeclaration(LI->getModule(), LoadInts[Factor - 2], Tys);

IRBuilder<> Builder(LI);		// Holds sub-vectors extracted from the load intrinsic return values. The
SmallVector<Value *, 2> Ops;		// sub-vectors are associated with the shufflevector instructions they will
		// replace.
		DenseMap<ShuffleVectorInst , SmallVector<Value , 4>> SubVecs;

		for (unsigned LoadCount = 0; LoadCount < NumLoads; ++LoadCount) {

		// If we're generating more than one load, compute the base address of
		// subsequent loads as an offset from the previous.
		if (LoadCount > 0)
		BaseAddr = Builder.CreateConstGEP1_32(
		BaseAddr, VecTy->getVectorNumElements() * Factor);

Type *Int8Ptr = Builder.getInt8PtrTy(LI->getPointerAddressSpace());		SmallVector<Value *, 2> Ops;
Ops.push_back(Builder.CreateBitCast(LI->getPointerOperand(), Int8Ptr));		Ops.push_back(Builder.CreateBitCast(BaseAddr, Int8Ptr));
Ops.push_back(Builder.getInt32(LI->getAlignment()));		Ops.push_back(Builder.getInt32(LI->getAlignment()));

Type *Tys[] = { VecTy, Int8Ptr };
Function *VldnFunc =
Intrinsic::getDeclaration(LI->getModule(), LoadInts[Factor - 2], Tys);
CallInst *VldN = Builder.CreateCall(VldnFunc, Ops, "vldN");		CallInst *VldN = Builder.CreateCall(VldnFunc, Ops, "vldN");

// Replace uses of each shufflevector with the corresponding vector loaded		// Replace uses of each shufflevector with the corresponding vector loaded
// by ldN.		// by ldN.
for (unsigned i = 0; i < Shuffles.size(); i++) {		for (unsigned i = 0; i < Shuffles.size(); i++) {
ShuffleVectorInst *SV = Shuffles[i];		ShuffleVectorInst *SV = Shuffles[i];
unsigned Index = Indices[i];		unsigned Index = Indices[i];

Value *SubVec = Builder.CreateExtractValue(VldN, Index);		Value *SubVec = Builder.CreateExtractValue(VldN, Index);

// Convert the integer vector to pointer vector if the element is pointer.		// Convert the integer vector to pointer vector if the element is pointer.
if (EltTy->isPointerTy())		if (EltTy->isPointerTy())
SubVec = Builder.CreateIntToPtr(SubVec, SV->getType());		SubVec = Builder.CreateIntToPtr(SubVec, SV->getType());

SV->replaceAllUsesWith(SubVec);		SubVecs[SV].push_back(SubVec);
		}
		}

		// Replace uses of the shufflevector instructions with the sub-vectors
		// returned by the load intrinsic. If a shufflevector instruction is
		// associated with more than one sub-vector, those sub-vectors will be
		// concatenated into a single wide vector.
		for (ShuffleVectorInst *SVI : Shuffles) {
		auto &SubVec = SubVecs[SVI];
		auto *WideVec =
		SubVec.size() > 1 ? concatenateVectors(Builder, SubVec) : SubVec[0];
		SVI->replaceAllUsesWith(WideVec);
}		}

return true;		return true;
}		}

/// \brief Lower an interleaved store into a vstN intrinsic.		/// \brief Lower an interleaved store into a vstN intrinsic.
///		///
/// E.g. Lower an interleaved store (Factor = 3):		/// E.g. Lower an interleaved store (Factor = 3):
Show All 35 Lines	bool ARMTargetLowering::lowerInterleavedStore(StoreInst *SI,
VectorType *SubVecTy = VectorType::get(EltTy, LaneLen);		VectorType *SubVecTy = VectorType::get(EltTy, LaneLen);

const DataLayout &DL = SI->getModule()->getDataLayout();		const DataLayout &DL = SI->getModule()->getDataLayout();
unsigned SubVecSize = DL.getTypeSizeInBits(SubVecTy);		unsigned SubVecSize = DL.getTypeSizeInBits(SubVecTy);
bool EltIs64Bits = DL.getTypeSizeInBits(EltTy) == 64;		bool EltIs64Bits = DL.getTypeSizeInBits(EltTy) == 64;

// Skip if we do not have NEON and skip illegal vector types and vector types		// Skip if we do not have NEON and skip illegal vector types and vector types
// with i64/f64 elements (vstN doesn't support i64/f64 elements).		// with i64/f64 elements (vstN doesn't support i64/f64 elements).
if (!Subtarget->hasNEON() \|\| (SubVecSize != 64 && SubVecSize != 128) \|\|		if (!Subtarget->hasNEON() \|\| (SubVecSize != 64 && SubVecSize % 128 != 0) \|\|
EltIs64Bits)		EltIs64Bits)
return false;		return false;

		// We can "legalize" wide vector types into multiple interleaved stores as
		// long as the vector types are divisible by 128. The code below determines
		// the number of stores we will generate.
		unsigned NumStores = 1;
		if (SubVecSize > 128)
		NumStores = SubVecSize / 128;

Value *Op0 = SVI->getOperand(0);		Value *Op0 = SVI->getOperand(0);
Value *Op1 = SVI->getOperand(1);		Value *Op1 = SVI->getOperand(1);
IRBuilder<> Builder(SI);		IRBuilder<> Builder(SI);

// StN intrinsics don't support pointer vectors as arguments. Convert pointer		// StN intrinsics don't support pointer vectors as arguments. Convert pointer
// vectors to integer vectors.		// vectors to integer vectors.
if (EltTy->isPointerTy()) {		if (EltTy->isPointerTy()) {
Type *IntTy = DL.getIntPtrType(EltTy);		Type *IntTy = DL.getIntPtrType(EltTy);

// Convert to the corresponding integer vector.		// Convert to the corresponding integer vector.
Type *IntVecTy =		Type *IntVecTy =
VectorType::get(IntTy, Op0->getType()->getVectorNumElements());		VectorType::get(IntTy, Op0->getType()->getVectorNumElements());
Op0 = Builder.CreatePtrToInt(Op0, IntVecTy);		Op0 = Builder.CreatePtrToInt(Op0, IntVecTy);
Op1 = Builder.CreatePtrToInt(Op1, IntVecTy);		Op1 = Builder.CreatePtrToInt(Op1, IntVecTy);

SubVecTy = VectorType::get(IntTy, LaneLen);		SubVecTy = VectorType::get(IntTy, LaneLen);
}		}

		// The base address of the store.
		Value *BaseAddr = SI->getPointerOperand();

		if (NumStores > 1) {
		// If we're going to generate more than one store, reset the lane length
		// and sub-vector type to something legal.
		LaneLen /= NumStores;
		SubVecTy = VectorType::get(SubVecTy->getVectorElementType(), LaneLen);

		// We will compute the pointer operand of each store from the original base
		// address using GEPs. Cast the base address to a pointer to the scalar
		// element type.
		BaseAddr = Builder.CreateBitCast(
		BaseAddr, SubVecTy->getVectorElementType()->getPointerTo(
		SI->getPointerAddressSpace()));
		}

		auto Mask = SVI->getShuffleMask();

		Type *Int8Ptr = Builder.getInt8PtrTy(SI->getPointerAddressSpace());
		Type *Tys[] = {Int8Ptr, SubVecTy};
static const Intrinsic::ID StoreInts[3] = {Intrinsic::arm_neon_vst2,		static const Intrinsic::ID StoreInts[3] = {Intrinsic::arm_neon_vst2,
Intrinsic::arm_neon_vst3,		Intrinsic::arm_neon_vst3,
Intrinsic::arm_neon_vst4};		Intrinsic::arm_neon_vst4};
SmallVector<Value *, 6> Ops;

Type *Int8Ptr = Builder.getInt8PtrTy(SI->getPointerAddressSpace());		for (unsigned StoreCount = 0; StoreCount < NumStores; ++StoreCount) {
Ops.push_back(Builder.CreateBitCast(SI->getPointerOperand(), Int8Ptr));

Type *Tys[] = { Int8Ptr, SubVecTy };		// If we generating more than one store, we compute the base address of
Function *VstNFunc = Intrinsic::getDeclaration(		// subsequent stores as an offset from the previous.
SI->getModule(), StoreInts[Factor - 2], Tys);		if (StoreCount > 0)
		BaseAddr = Builder.CreateConstGEP1_32(BaseAddr, LaneLen * Factor);

		SmallVector<Value *, 6> Ops;
		Ops.push_back(Builder.CreateBitCast(BaseAddr, Int8Ptr));

		Function *VstNFunc =
		Intrinsic::getDeclaration(SI->getModule(), StoreInts[Factor - 2], Tys);

// Split the shufflevector operands into sub vectors for the new vstN call.		// Split the shufflevector operands into sub vectors for the new vstN call.
auto Mask = SVI->getShuffleMask();
for (unsigned i = 0; i < Factor; i++) {		for (unsigned i = 0; i < Factor; i++) {
if (Mask[i] >= 0) {		unsigned IdxI = StoreCount * LaneLen * Factor + i;
		if (Mask[IdxI] >= 0) {
Ops.push_back(Builder.CreateShuffleVector(		Ops.push_back(Builder.CreateShuffleVector(
Op0, Op1, createSequentialMask(Builder, Mask[i], LaneLen, 0)));		Op0, Op1, createSequentialMask(Builder, Mask[IdxI], LaneLen, 0)));
} else {		} else {
unsigned StartMask = 0;		unsigned StartMask = 0;
for (unsigned j = 1; j < LaneLen; j++) {		for (unsigned j = 1; j < LaneLen; j++) {
if (Mask[j*Factor + i] >= 0) {		unsigned IdxJ = StoreCount * LaneLen * Factor + j;
StartMask = Mask[j*Factor + i] - j;		if (Mask[IdxJ * Factor + IdxI] >= 0) {
		StartMask = Mask[IdxJ * Factor + IdxI] - IdxJ;
break;		break;
}		}
}		}
// Note: If all elements in a chunk are undefs, StartMask=0!		// Note: If all elements in a chunk are undefs, StartMask=0! Note:
// Note: Filling undef gaps with random elements is ok, since		// Filling undef gaps with random elements is ok, since those elements
// those elements were being written anyway (with undefs).		// were being written anyway (with undefs). In the case of all undefs
// In the case of all undefs we're defaulting to using elems from 0		// we're defaulting to using elems from 0 Note: StartMask cannot be
// Note: StartMask cannot be negative, it's checked in isReInterleaveMask		// negative, it's checked in isReInterleaveMask
Ops.push_back(Builder.CreateShuffleVector(		Ops.push_back(Builder.CreateShuffleVector(
Op0, Op1, createSequentialMask(Builder, StartMask, LaneLen, 0)));		Op0, Op1, createSequentialMask(Builder, StartMask, LaneLen, 0)));
}		}
}		}

Ops.push_back(Builder.getInt32(SI->getAlignment()));		Ops.push_back(Builder.getInt32(SI->getAlignment()));
Builder.CreateCall(VstNFunc, Ops);		Builder.CreateCall(VstNFunc, Ops);
		}
return true;		return true;
}		}

enum HABaseType {		enum HABaseType {
HA_UNKNOWN = 0,		HA_UNKNOWN = 0,
HA_FLOAT,		HA_FLOAT,
HA_DOUBLE,		HA_DOUBLE,
HA_VECT64,		HA_VECT64,
▲ Show 20 Lines • Show All 132 Lines • Show Last 20 Lines

test/Transforms/InterleavedAccess/AArch64/interleaved-accesses.ll

	Show First 20 Lines • Show All 559 Lines • ▼ Show 20 Lines
	; NO_NEON: shufflevector			; NO_NEON: shufflevector
	; NO_NEON: store			; NO_NEON: store
	; NO_NEON: ret void			; NO_NEON: ret void
	define void @no_interleave(<4 x float> %a0) {			define void @no_interleave(<4 x float> %a0) {
	%v0 = shufflevector <4 x float> %a0, <4 x float> %a0, <4 x i32> <i32 0, i32 3, i32 7, i32 undef>			%v0 = shufflevector <4 x float> %a0, <4 x float> %a0, <4 x i32> <i32 0, i32 3, i32 7, i32 undef>
	store <4 x float> %v0, <4 x float>* @g, align 16			store <4 x float> %v0, <4 x float>* @g, align 16
	ret void			ret void
	}			}

				define void @load_factor2_wide(<16 x i32>* %ptr) {
				; NEON-LABEL: @load_factor2_wide(
				; NEON-NEXT: [[TMP1:%.]] = bitcast <16 x i32> %ptr to i32*
				; NEON-NEXT: [[TMP2:%.]] = bitcast i32 [[TMP1]] to <4 x i32>*
				; NEON-NEXT: [[LDN:%.]] = call { <4 x i32>, <4 x i32> } @llvm.aarch64.neon.ld2.v4i32.p0v4i32(<4 x i32> [[TMP2]])
				; NEON-NEXT: [[TMP3:%.*]] = extractvalue { <4 x i32>, <4 x i32> } [[LDN]], 1
				; NEON-NEXT: [[TMP4:%.*]] = extractvalue { <4 x i32>, <4 x i32> } [[LDN]], 0
				; NEON-NEXT: [[TMP5:%.]] = getelementptr i32, i32 [[TMP1]], i32 8
				; NEON-NEXT: [[TMP6:%.]] = bitcast i32 [[TMP5]] to <4 x i32>*
				; NEON-NEXT: [[LDN1:%.]] = call { <4 x i32>, <4 x i32> } @llvm.aarch64.neon.ld2.v4i32.p0v4i32(<4 x i32> [[TMP6]])
				; NEON-NEXT: [[TMP7:%.*]] = extractvalue { <4 x i32>, <4 x i32> } [[LDN1]], 1
				; NEON-NEXT: [[TMP8:%.*]] = extractvalue { <4 x i32>, <4 x i32> } [[LDN1]], 0
				; NEON-NEXT: [[TMP9:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP7]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP8]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: ret void
				; NO_NEON-LABEL: @load_factor2_wide(
				; NO_NEON-NOT: @llvm.aarch64.neon
				; NO_NEON: ret void
				;
				%interleaved.vec = load <16 x i32>, <16 x i32>* %ptr, align 4
				%v0 = shufflevector <16 x i32> %interleaved.vec, <16 x i32> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%v1 = shufflevector <16 x i32> %interleaved.vec, <16 x i32> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				ret void
				}

				define void @load_factor3_wide(<24 x i32>* %ptr) {
				; NEON-LABEL: @load_factor3_wide(
				; NEON-NEXT: [[TMP1:%.]] = bitcast <24 x i32> %ptr to i32*
				; NEON-NEXT: [[TMP2:%.]] = bitcast i32 [[TMP1]] to <4 x i32>*
				; NEON-NEXT: [[LDN:%.]] = call { <4 x i32>, <4 x i32>, <4 x i32> } @llvm.aarch64.neon.ld3.v4i32.p0v4i32(<4 x i32> [[TMP2]])
				; NEON-NEXT: [[TMP3:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32> } [[LDN]], 2
				; NEON-NEXT: [[TMP4:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32> } [[LDN]], 1
				; NEON-NEXT: [[TMP5:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32> } [[LDN]], 0
				; NEON-NEXT: [[TMP6:%.]] = getelementptr i32, i32 [[TMP1]], i32 12
				; NEON-NEXT: [[TMP7:%.]] = bitcast i32 [[TMP6]] to <4 x i32>*
				; NEON-NEXT: [[LDN1:%.]] = call { <4 x i32>, <4 x i32>, <4 x i32> } @llvm.aarch64.neon.ld3.v4i32.p0v4i32(<4 x i32> [[TMP7]])
				; NEON-NEXT: [[TMP8:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32> } [[LDN1]], 2
				; NEON-NEXT: [[TMP9:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32> } [[LDN1]], 1
				; NEON-NEXT: [[TMP10:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32> } [[LDN1]], 0
				; NEON-NEXT: [[TMP11:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP8]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP12:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP9]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP13:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> [[TMP10]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: ret void
				; NO_NEON-LABEL: @load_factor3_wide(
				; NO_NEON-NOT: @llvm.aarch64.neon
				; NO_NEON: ret void
				;
				%interleaved.vec = load <24 x i32>, <24 x i32>* %ptr, align 4
				%v0 = shufflevector <24 x i32> %interleaved.vec, <24 x i32> undef, <8 x i32> <i32 0, i32 3, i32 6, i32 9, i32 12, i32 15, i32 18, i32 21>
				%v1 = shufflevector <24 x i32> %interleaved.vec, <24 x i32> undef, <8 x i32> <i32 1, i32 4, i32 7, i32 10, i32 13, i32 16, i32 19, i32 22>
				%v2 = shufflevector <24 x i32> %interleaved.vec, <24 x i32> undef, <8 x i32> <i32 2, i32 5, i32 8, i32 11, i32 14, i32 17, i32 20, i32 23>
				ret void
				}

				define void @load_factor4_wide(<32 x i32>* %ptr) {
				; NEON-LABEL: @load_factor4_wide(
				; NEON-NEXT: [[TMP1:%.]] = bitcast <32 x i32> %ptr to i32*
				; NEON-NEXT: [[TMP2:%.]] = bitcast i32 [[TMP1]] to <4 x i32>*
				; NEON-NEXT: [[LDN:%.]] = call { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } @llvm.aarch64.neon.ld4.v4i32.p0v4i32(<4 x i32> [[TMP2]])
				; NEON-NEXT: [[TMP3:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } [[LDN]], 3
				; NEON-NEXT: [[TMP4:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } [[LDN]], 2
				; NEON-NEXT: [[TMP5:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } [[LDN]], 1
				; NEON-NEXT: [[TMP6:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } [[LDN]], 0
				; NEON-NEXT: [[TMP7:%.]] = getelementptr i32, i32 [[TMP1]], i32 16
				; NEON-NEXT: [[TMP8:%.]] = bitcast i32 [[TMP7]] to <4 x i32>*
				; NEON-NEXT: [[LDN1:%.]] = call { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } @llvm.aarch64.neon.ld4.v4i32.p0v4i32(<4 x i32> [[TMP8]])
				; NEON-NEXT: [[TMP9:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } [[LDN1]], 3
				; NEON-NEXT: [[TMP10:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } [[LDN1]], 2
				; NEON-NEXT: [[TMP11:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } [[LDN1]], 1
				; NEON-NEXT: [[TMP12:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } [[LDN1]], 0
				; NEON-NEXT: [[TMP13:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP9]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP14:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP10]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP15:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> [[TMP11]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP16:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: ret void
				; NO_NEON-LABEL: @load_factor4_wide(
				; NO_NEON-NOT: @llvm.aarch64.neon
				; NO_NEON: ret void
				;
				%interleaved.vec = load <32 x i32>, <32 x i32>* %ptr, align 4
				%v0 = shufflevector <32 x i32> %interleaved.vec, <32 x i32> undef, <8 x i32> <i32 0, i32 4, i32 8, i32 12, i32 16, i32 20, i32 24, i32 28>
				%v1 = shufflevector <32 x i32> %interleaved.vec, <32 x i32> undef, <8 x i32> <i32 1, i32 5, i32 9, i32 13, i32 17, i32 21, i32 25, i32 29>
				%v2 = shufflevector <32 x i32> %interleaved.vec, <32 x i32> undef, <8 x i32> <i32 2, i32 6, i32 10, i32 14, i32 18, i32 22, i32 26, i32 30>
				%v3 = shufflevector <32 x i32> %interleaved.vec, <32 x i32> undef, <8 x i32> <i32 3, i32 7, i32 11, i32 15, i32 19, i32 23, i32 27, i32 31>
				ret void
				}

				define void @store_factor2_wide(<16 x i32>* %ptr, <8 x i32> %v0, <8 x i32> %v1) {
				; NEON-LABEL: @store_factor2_wide(
				; NEON-NEXT: [[TMP1:%.]] = bitcast <16 x i32> %ptr to i32*
				; NEON-NEXT: [[TMP2:%.*]] = shufflevector <8 x i32> %v0, <8 x i32> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; NEON-NEXT: [[TMP3:%.*]] = shufflevector <8 x i32> %v0, <8 x i32> %v1, <4 x i32> <i32 8, i32 9, i32 10, i32 11>
				; NEON-NEXT: [[TMP4:%.]] = bitcast i32 [[TMP1]] to <4 x i32>*
				; NEON-NEXT: call void @llvm.aarch64.neon.st2.v4i32.p0v4i32(<4 x i32> [[TMP2]], <4 x i32> [[TMP3]], <4 x i32>* [[TMP4]])
				; NEON-NEXT: [[TMP5:%.*]] = shufflevector <8 x i32> %v0, <8 x i32> %v1, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP6:%.*]] = shufflevector <8 x i32> %v0, <8 x i32> %v1, <4 x i32> <i32 12, i32 13, i32 14, i32 15>
				; NEON-NEXT: [[TMP7:%.]] = getelementptr i32, i32 [[TMP1]], i32 8
				; NEON-NEXT: [[TMP8:%.]] = bitcast i32 [[TMP7]] to <4 x i32>*
				; NEON-NEXT: call void @llvm.aarch64.neon.st2.v4i32.p0v4i32(<4 x i32> [[TMP5]], <4 x i32> [[TMP6]], <4 x i32>* [[TMP8]])
				; NEON-NEXT: ret void
				; NO_NEON-LABEL: @store_factor2_wide(
				; NO_NEON: ret void
				;
				%interleaved.vec = shufflevector <8 x i32> %v0, <8 x i32> %v1, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
				store <16 x i32> %interleaved.vec, <16 x i32>* %ptr, align 4
				ret void
				}

				define void @store_factor3_wide(<24 x i32>* %ptr, <8 x i32> %v0, <8 x i32> %v1, <8 x i32> %v2) {
				; NEON-LABEL: @store_factor3_wide(
				; NEON: [[TMP1:%.]] = bitcast <24 x i32> %ptr to i32*
				; NEON-NEXT: [[TMP2:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; NEON-NEXT: [[TMP3:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 8, i32 9, i32 10, i32 11>
				; NEON-NEXT: [[TMP4:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 16, i32 17, i32 18, i32 19>
				; NEON-NEXT: [[TMP5:%.]] = bitcast i32 [[TMP1]] to <4 x i32>*
				; NEON-NEXT: call void @llvm.aarch64.neon.st3.v4i32.p0v4i32(<4 x i32> [[TMP2]], <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32>* [[TMP5]])
				; NEON-NEXT: [[TMP6:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP7:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 12, i32 13, i32 14, i32 15>
				; NEON-NEXT: [[TMP8:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 20, i32 21, i32 22, i32 23>
				; NEON-NEXT: [[TMP9:%.]] = getelementptr i32, i32 [[TMP1]], i32 12
				; NEON-NEXT: [[TMP10:%.]] = bitcast i32 [[TMP9]] to <4 x i32>*
				; NEON-NEXT: call void @llvm.aarch64.neon.st3.v4i32.p0v4i32(<4 x i32> [[TMP6]], <4 x i32> [[TMP7]], <4 x i32> [[TMP8]], <4 x i32>* [[TMP10]])
				; NEON-NEXT: ret void
				; NO_NEON-LABEL: @store_factor3_wide(
				; NO_NEON: ret void
				;
				%s0 = shufflevector <8 x i32> %v0, <8 x i32> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				%s1 = shufflevector <8 x i32> %v2, <8 x i32> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%interleaved.vec = shufflevector <16 x i32> %s0, <16 x i32> %s1, <24 x i32> <i32 0, i32 8, i32 16, i32 1, i32 9, i32 17, i32 2, i32 10, i32 18, i32 3, i32 11, i32 19, i32 4, i32 12, i32 20, i32 5, i32 13, i32 21, i32 6, i32 14, i32 22, i32 7, i32 15, i32 23>
				store <24 x i32> %interleaved.vec, <24 x i32>* %ptr, align 4
				ret void
				}

				define void @store_factor4_wide(<32 x i32>* %ptr, <8 x i32> %v0, <8 x i32> %v1, <8 x i32> %v2, <8 x i32> %v3) {
				; NEON-LABEL: @store_factor4_wide(
				; NEON: [[TMP1:%.]] = bitcast <32 x i32> %ptr to i32*
				; NEON-NEXT: [[TMP2:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; NEON-NEXT: [[TMP3:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 8, i32 9, i32 10, i32 11>
				; NEON-NEXT: [[TMP4:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 16, i32 17, i32 18, i32 19>
				; NEON-NEXT: [[TMP5:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 24, i32 25, i32 26, i32 27>
				; NEON-NEXT: [[TMP6:%.]] = bitcast i32 [[TMP1]] to <4 x i32>*
				; NEON-NEXT: call void @llvm.aarch64.neon.st4.v4i32.p0v4i32(<4 x i32> [[TMP2]], <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]])
				; NEON-NEXT: [[TMP7:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP8:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 12, i32 13, i32 14, i32 15>
				; NEON-NEXT: [[TMP9:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 20, i32 21, i32 22, i32 23>
				; NEON-NEXT: [[TMP10:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 28, i32 29, i32 30, i32 31>
				; NEON-NEXT: [[TMP11:%.]] = getelementptr i32, i32 [[TMP1]], i32 16
				; NEON-NEXT: [[TMP12:%.]] = bitcast i32 [[TMP11]] to <4 x i32>*
				; NEON-NEXT: call void @llvm.aarch64.neon.st4.v4i32.p0v4i32(<4 x i32> [[TMP7]], <4 x i32> [[TMP8]], <4 x i32> [[TMP9]], <4 x i32> [[TMP10]], <4 x i32>* [[TMP12]])
				; NEON-NEXT: ret void
				; NO_NEON-LABEL: @store_factor4_wide(
				; NO_NEON-NOT: @llvm.aarch64.neon
				; NO_NEON: ret void
				;
				%s0 = shufflevector <8 x i32> %v0, <8 x i32> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				%s1 = shufflevector <8 x i32> %v2, <8 x i32> %v3, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				%interleaved.vec = shufflevector <16 x i32> %s0, <16 x i32> %s1, <32 x i32> <i32 0, i32 8, i32 16, i32 24, i32 1, i32 9, i32 17, i32 25, i32 2, i32 10, i32 18, i32 26, i32 3, i32 11, i32 19, i32 27, i32 4, i32 12, i32 20, i32 28, i32 5, i32 13, i32 21, i32 29, i32 6, i32 14, i32 22, i32 30, i32 7, i32 15, i32 23, i32 31>
				store <32 x i32> %interleaved.vec, <32 x i32>* %ptr, align 4
				ret void
				}

test/Transforms/InterleavedAccess/ARM/interleaved-accesses.ll

	Show First 20 Lines • Show All 638 Lines • ▼ Show 20 Lines
	; NO_NEON: shufflevector			; NO_NEON: shufflevector
	; NO_NEON: store			; NO_NEON: store
	; NO_NEON: ret void			; NO_NEON: ret void
	define void @no_interleave(<4 x float> %a0) {			define void @no_interleave(<4 x float> %a0) {
	%v0 = shufflevector <4 x float> %a0, <4 x float> %a0, <4 x i32> <i32 0, i32 7, i32 1, i32 undef>			%v0 = shufflevector <4 x float> %a0, <4 x float> %a0, <4 x i32> <i32 0, i32 7, i32 1, i32 undef>
	store <4 x float> %v0, <4 x float>* @g, align 16			store <4 x float> %v0, <4 x float>* @g, align 16
	ret void			ret void
	}			}

				define void @load_factor2_wide(<16 x i32>* %ptr) {
				; NEON-LABEL: @load_factor2_wide(
				; NEON-NEXT: [[TMP1:%.]] = bitcast <16 x i32> %ptr to i32*
				; NEON-NEXT: [[TMP2:%.]] = bitcast i32 [[TMP1]] to i8*
				; NEON-NEXT: [[VLDN:%.]] = call { <4 x i32>, <4 x i32> } @llvm.arm.neon.vld2.v4i32.p0i8(i8 [[TMP2]], i32 4)
				; NEON-NEXT: [[TMP3:%.*]] = extractvalue { <4 x i32>, <4 x i32> } [[VLDN]], 1
				; NEON-NEXT: [[TMP4:%.*]] = extractvalue { <4 x i32>, <4 x i32> } [[VLDN]], 0
				; NEON-NEXT: [[TMP5:%.]] = getelementptr i32, i32 [[TMP1]], i32 8
				; NEON-NEXT: [[TMP6:%.]] = bitcast i32 [[TMP5]] to i8*
				; NEON-NEXT: [[VLDN1:%.]] = call { <4 x i32>, <4 x i32> } @llvm.arm.neon.vld2.v4i32.p0i8(i8 [[TMP6]], i32 4)
				; NEON-NEXT: [[TMP7:%.*]] = extractvalue { <4 x i32>, <4 x i32> } [[VLDN1]], 1
				; NEON-NEXT: [[TMP8:%.*]] = extractvalue { <4 x i32>, <4 x i32> } [[VLDN1]], 0
				; NEON-NEXT: [[TMP9:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP7]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP10:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP8]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: ret void
				; NO_NEON-LABEL: @load_factor2_wide(
				; NO_NEON-NOT: @llvm.arm.neon
				; NO_NEON: ret void
				;
				%interleaved.vec = load <16 x i32>, <16 x i32>* %ptr, align 4
				%v0 = shufflevector <16 x i32> %interleaved.vec, <16 x i32> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%v1 = shufflevector <16 x i32> %interleaved.vec, <16 x i32> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				ret void
				}

				define void @load_factor3_wide(<24 x i32>* %ptr) {
				; NEON-LABEL: @load_factor3_wide(
				; NEON-NEXT: [[TMP1:%.]] = bitcast <24 x i32> %ptr to i32*
				; NEON-NEXT: [[TMP2:%.]] = bitcast i32 [[TMP1]] to i8*
				; NEON-NEXT: [[VLDN:%.]] = call { <4 x i32>, <4 x i32>, <4 x i32> } @llvm.arm.neon.vld3.v4i32.p0i8(i8 [[TMP2]], i32 4)
				; NEON-NEXT: [[TMP3:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32> } [[VLDN]], 2
				; NEON-NEXT: [[TMP4:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32> } [[VLDN]], 1
				; NEON-NEXT: [[TMP5:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32> } [[VLDN]], 0
				; NEON-NEXT: [[TMP6:%.]] = getelementptr i32, i32 [[TMP1]], i32 12
				; NEON-NEXT: [[TMP7:%.]] = bitcast i32 [[TMP6]] to i8*
				; NEON-NEXT: [[VLDN1:%.]] = call { <4 x i32>, <4 x i32>, <4 x i32> } @llvm.arm.neon.vld3.v4i32.p0i8(i8 [[TMP7]], i32 4)
				; NEON-NEXT: [[TMP8:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32> } [[VLDN1]], 2
				; NEON-NEXT: [[TMP9:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32> } [[VLDN1]], 1
				; NEON-NEXT: [[TMP10:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32> } [[VLDN1]], 0
				; NEON-NEXT: [[TMP11:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP8]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP12:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP9]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP13:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> [[TMP10]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: ret void
				; NO_NEON-LABEL: @load_factor3_wide(
				; NO_NEON-NOT: @llvm.arm.neon
				; NO_NEON: ret void
				;
				%interleaved.vec = load <24 x i32>, <24 x i32>* %ptr, align 4
				%v0 = shufflevector <24 x i32> %interleaved.vec, <24 x i32> undef, <8 x i32> <i32 0, i32 3, i32 6, i32 9, i32 12, i32 15, i32 18, i32 21>
				%v1 = shufflevector <24 x i32> %interleaved.vec, <24 x i32> undef, <8 x i32> <i32 1, i32 4, i32 7, i32 10, i32 13, i32 16, i32 19, i32 22>
				%v2 = shufflevector <24 x i32> %interleaved.vec, <24 x i32> undef, <8 x i32> <i32 2, i32 5, i32 8, i32 11, i32 14, i32 17, i32 20, i32 23>
				ret void
				}

				define void @load_factor4_wide(<32 x i32>* %ptr) {
				; NEON-LABEL: @load_factor4_wide(
				; NEON-NEXT: [[TMP1:%.]] = bitcast <32 x i32> %ptr to i32*
				; NEON-NEXT: [[TMP2:%.]] = bitcast i32 [[TMP1]] to i8*
				; NEON-NEXT: [[VLDN:%.]] = call { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } @llvm.arm.neon.vld4.v4i32.p0i8(i8 [[TMP2]], i32 4)
				; NEON-NEXT: [[TMP3:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } [[VLDN]], 3
				; NEON-NEXT: [[TMP4:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } [[VLDN]], 2
				; NEON-NEXT: [[TMP5:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } [[VLDN]], 1
				; NEON-NEXT: [[TMP6:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } [[VLDN]], 0
				; NEON-NEXT: [[TMP7:%.]] = getelementptr i32, i32 [[TMP1]], i32 16
				; NEON-NEXT: [[TMP8:%.]] = bitcast i32 [[TMP7]] to i8*
				; NEON-NEXT: [[VLDN1:%.]] = call { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } @llvm.arm.neon.vld4.v4i32.p0i8(i8 [[TMP8]], i32 4)
				; NEON-NEXT: [[TMP9:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } [[VLDN1]], 3
				; NEON-NEXT: [[TMP10:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } [[VLDN1]], 2
				; NEON-NEXT: [[TMP11:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } [[VLDN1]], 1
				; NEON-NEXT: [[TMP12:%.*]] = extractvalue { <4 x i32>, <4 x i32>, <4 x i32>, <4 x i32> } [[VLDN1]], 0
				; NEON-NEXT: [[TMP13:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP9]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP14:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP10]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP15:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> [[TMP11]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP16:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> [[TMP12]], <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: ret void
				; NO_NEON-LABEL: @load_factor4_wide(
				; NO_NEON-NOT: @llvm.arm.neon
				; NO_NEON: ret void
				;
				%interleaved.vec = load <32 x i32>, <32 x i32>* %ptr, align 4
				%v0 = shufflevector <32 x i32> %interleaved.vec, <32 x i32> undef, <8 x i32> <i32 0, i32 4, i32 8, i32 12, i32 16, i32 20, i32 24, i32 28>
				%v1 = shufflevector <32 x i32> %interleaved.vec, <32 x i32> undef, <8 x i32> <i32 1, i32 5, i32 9, i32 13, i32 17, i32 21, i32 25, i32 29>
				%v2 = shufflevector <32 x i32> %interleaved.vec, <32 x i32> undef, <8 x i32> <i32 2, i32 6, i32 10, i32 14, i32 18, i32 22, i32 26, i32 30>
				%v3 = shufflevector <32 x i32> %interleaved.vec, <32 x i32> undef, <8 x i32> <i32 3, i32 7, i32 11, i32 15, i32 19, i32 23, i32 27, i32 31>
				ret void
				}

				define void @store_factor2_wide(<16 x i32>* %ptr, <8 x i32> %v0, <8 x i32> %v1) {
				; NEON-LABEL: @store_factor2_wide(
				; NEON-NEXT: [[TMP1:%.]] = bitcast <16 x i32> %ptr to i32*
				; NEON-NEXT: [[TMP2:%.]] = bitcast i32 [[TMP1]] to i8*
				; NEON-NEXT: [[TMP3:%.*]] = shufflevector <8 x i32> %v0, <8 x i32> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; NEON-NEXT: [[TMP4:%.*]] = shufflevector <8 x i32> %v0, <8 x i32> %v1, <4 x i32> <i32 8, i32 9, i32 10, i32 11>
				; NEON-NEXT: call void @llvm.arm.neon.vst2.p0i8.v4i32(i8* [[TMP2]], <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], i32 4)
				; NEON-NEXT: [[TMP5:%.]] = getelementptr i32, i32 [[TMP1]], i32 8
				; NEON-NEXT: [[TMP6:%.]] = bitcast i32 [[TMP5]] to i8*
				; NEON-NEXT: [[TMP7:%.*]] = shufflevector <8 x i32> %v0, <8 x i32> %v1, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP8:%.*]] = shufflevector <8 x i32> %v0, <8 x i32> %v1, <4 x i32> <i32 12, i32 13, i32 14, i32 15>
				; NEON-NEXT: call void @llvm.arm.neon.vst2.p0i8.v4i32(i8* [[TMP6]], <4 x i32> [[TMP7]], <4 x i32> [[TMP8]], i32 4)
				; NEON-NEXT: ret void
				; NO_NEON-LABEL: @store_factor2_wide(
				; NO_NEON-NOT: @llvm.arm.neon
				; NO_NEON: ret void
				;
				%interleaved.vec = shufflevector <8 x i32> %v0, <8 x i32> %v1, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
				store <16 x i32> %interleaved.vec, <16 x i32>* %ptr, align 4
				ret void
				}

				define void @store_factor3_wide(<24 x i32>* %ptr, <8 x i32> %v0, <8 x i32> %v1, <8 x i32> %v2) {
				; NEON-LABEL: @store_factor3_wide(
				; NEON: [[TMP1:%.]] = bitcast <24 x i32> %ptr to i32*
				; NEON-NEXT: [[TMP2:%.]] = bitcast i32 [[TMP1]] to i8*
				; NEON-NEXT: [[TMP3:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; NEON-NEXT: [[TMP4:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 8, i32 9, i32 10, i32 11>
				; NEON-NEXT: [[TMP5:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 16, i32 17, i32 18, i32 19>
				; NEON-NEXT: call void @llvm.arm.neon.vst3.p0i8.v4i32(i8* [[TMP2]], <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> [[TMP5]], i32 4)
				; NEON-NEXT: [[TMP6:%.]] = getelementptr i32, i32 [[TMP1]], i32 12
				; NEON-NEXT: [[TMP7:%.]] = bitcast i32 [[TMP6]] to i8*
				; NEON-NEXT: [[TMP8:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP9:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 12, i32 13, i32 14, i32 15>
				; NEON-NEXT: [[TMP10:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 20, i32 21, i32 22, i32 23>
				; NEON-NEXT: call void @llvm.arm.neon.vst3.p0i8.v4i32(i8* [[TMP7]], <4 x i32> [[TMP8]], <4 x i32> [[TMP9]], <4 x i32> [[TMP10]], i32 4)
				; NEON-NEXT: ret void
				; NO_NEON-LABEL: @store_factor3_wide(
				; NO_NEON-NOT: @llvm.arm.neon
				; NO_NEON: ret void
				;
				%s0 = shufflevector <8 x i32> %v0, <8 x i32> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				%s1 = shufflevector <8 x i32> %v2, <8 x i32> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
				%interleaved.vec = shufflevector <16 x i32> %s0, <16 x i32> %s1, <24 x i32> <i32 0, i32 8, i32 16, i32 1, i32 9, i32 17, i32 2, i32 10, i32 18, i32 3, i32 11, i32 19, i32 4, i32 12, i32 20, i32 5, i32 13, i32 21, i32 6, i32 14, i32 22, i32 7, i32 15, i32 23>
				store <24 x i32> %interleaved.vec, <24 x i32>* %ptr, align 4
				ret void
				}

				define void @store_factor4_wide(<32 x i32>* %ptr, <8 x i32> %v0, <8 x i32> %v1, <8 x i32> %v2, <8 x i32> %v3) {
				; NEON-LABEL: @store_factor4_wide(
				; NEON: [[TMP1:%.]] = bitcast <32 x i32> %ptr to i32*
				; NEON-NEXT: [[TMP2:%.]] = bitcast i32 [[TMP1]] to i8*
				; NEON-NEXT: [[TMP3:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
				; NEON-NEXT: [[TMP4:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 8, i32 9, i32 10, i32 11>
				; NEON-NEXT: [[TMP5:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 16, i32 17, i32 18, i32 19>
				; NEON-NEXT: [[TMP6:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 24, i32 25, i32 26, i32 27>
				; NEON-NEXT: call void @llvm.arm.neon.vst4.p0i8.v4i32(i8* [[TMP2]], <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> [[TMP5]], <4 x i32> [[TMP6]], i32 4)
				; NEON-NEXT: [[TMP7:%.]] = getelementptr i32, i32 [[TMP1]], i32 16
				; NEON-NEXT: [[TMP8:%.]] = bitcast i32 [[TMP7]] to i8*
				; NEON-NEXT: [[TMP9:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
				; NEON-NEXT: [[TMP10:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 12, i32 13, i32 14, i32 15>
				; NEON-NEXT: [[TMP11:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 20, i32 21, i32 22, i32 23>
				; NEON-NEXT: [[TMP12:%.*]] = shufflevector <16 x i32> %s0, <16 x i32> %s1, <4 x i32> <i32 28, i32 29, i32 30, i32 31>
				; NEON-NEXT: call void @llvm.arm.neon.vst4.p0i8.v4i32(i8* [[TMP8]], <4 x i32> [[TMP9]], <4 x i32> [[TMP10]], <4 x i32> [[TMP11]], <4 x i32> [[TMP12]], i32 4)
				; NEON-NEXT: ret void
				; NO_NEON-LABEL: @store_factor4_wide(
				; NO_NEON-NOT: @llvm.arm.neon
				; NO_NEON: ret void
				;
				%s0 = shufflevector <8 x i32> %v0, <8 x i32> %v1, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				%s1 = shufflevector <8 x i32> %v2, <8 x i32> %v3, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
				%interleaved.vec = shufflevector <16 x i32> %s0, <16 x i32> %s1, <32 x i32> <i32 0, i32 8, i32 16, i32 24, i32 1, i32 9, i32 17, i32 25, i32 2, i32 10, i32 18, i32 26, i32 3, i32 11, i32 19, i32 27, i32 4, i32 12, i32 20, i32 28, i32 5, i32 13, i32 21, i32 29, i32 6, i32 14, i32 22, i32 30, i32 7, i32 15, i32 23, i32 31>
				store <32 x i32> %interleaved.vec, <32 x i32>* %ptr, align 4
				ret void
				}