This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/
-
llvm/
-
Analysis/
1/1
TargetTransformInfo.h
-
TargetTransformInfoImpl.h
-
IR/
1/1
IRBuilder.h
-
lib/
-
Analysis/
-
TargetTransformInfo.cpp
-
IR/
5/6
IRBuilder.cpp
-
Target/AArch64/
-
AArch64/
2/2
AArch64ISelLowering.cpp
-
AArch64TargetTransformInfo.h
2
AArch64TargetTransformInfo.cpp
-
Transforms/Vectorize/
-
Vectorize/
10/14
LoopVectorize.cpp
-
test/Transforms/LoopVectorize/
-
Transforms/
-
LoopVectorize/
1
sve-interleaved-accesses.ll

Differential D145163

Add support for vectorization of interleaved memory accesses for scalable VF
ClosedPublic

Authored by huntergr on Mar 2 2023, 7:24 AM.

Download Raw Diff

Details

Reviewers

paulwalker-arm
reames
luke
mgabka
fhahn

Commits

rG95bfb1902db9: [LV][AArch64] Allow (limited) interleaving for scalable vectors

Summary

This patch is using the new intrinsics introduced in
https://reviews.llvm.org/D141924
to enable vecorization of interleaved accesses for scalable VF.
Targets need to implement a proper cost model for supported operations
to make sure that generated IR can be code generated.

Diff Detail

Unit TestsFailed

	Time	Test
	60,050 ms	x64 debian > libFuzzer.libFuzzer::minimize_crash.test

Event Timeline

mgabka created this revision.Mar 2 2023, 7:24 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2023, 7:24 AM

Herald added subscribers: nlopes, hiraditya. · View Herald Transcript

mgabka requested review of this revision.Mar 2 2023, 7:24 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2023, 7:24 AM

Herald added subscribers: llvm-commits, • pcwang-thead, alextsao1999. · View Herald Transcript

mgabka mentioned this in D134438: POC patch to demonstrate how new intrinsics for interleaved load/store could be used in LoopVectorize.Mar 2 2023, 7:25 AM

Matt added a subscriber: Matt.Mar 2 2023, 7:27 AM

nlopes added inline comments.Mar 2 2023, 8:09 AM

llvm/lib/IR/IRBuilder.cpp
596	Please use PoisonValue here and whenever possible as we are trying to get rid of UndefValue. Thank you!

Harbormaster completed remote builds in B216974: Diff 501861.Mar 2 2023, 8:36 AM

Thanks for adding this! I'm currently plugging in the hooks for RISC-V and will let you know what I run into.

llvm/lib/IR/IRBuilder.cpp
587	Maybe this should be an assertion
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2721	Need to check that `Group->getFactor() == 2` here or that the call to CreateMaskedInterleavedLoad succeeds
2809	Need to check `Group->getFactor() == 2` here too

luke added inline comments.Mar 2 2023, 9:58 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2751	It's somehow possible to reach here with a scalable vector type if `TII->hasInterleavedLoad` returns false. Can we check somewhere inside the vectorizer cost model that if `hasInterleavedLoad` is false then we rule out any recipe with an interleave group for a scalable VF?

mgabka added inline comments.Mar 10 2023, 5:47 AM

llvm/include/llvm/Analysis/TargetTransformInfo.h
835–839	I think these functions could actually be joined into 1, something like: supportsInterleaving(VectorType *VecTy, uint32_t Factor, bool IsMasked) I think it is going to be unlikely that target supports store but not load for the same vector type, @paulwalker-arm what do you think?
llvm/lib/IR/IRBuilder.cpp
587	Hi Luke, so I could remove the Factor argument entirely and make this function specific for just Factor=2 (and use assertions) what makes sense for now as the experimental interleaving intrinsics are only for factor 2.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2721	So my idea was that it would be up to the hasInterleavedLoad function to make sure that it returns true only when Factor is 2, so no extra checks is needed I think.
2751	So it is actually connected by the LV cost model, the LoopVectorizationCostModel::getInterleaveGroupCost is calling TTI.getInterleavedMemoryOpCost which should return invalid cost for factors different than 2.

Replaced use of Undef with Poison value

mgabka marked an inline comment as done.Mar 10 2023, 6:07 AM

Harbormaster completed remote builds in B218667: Diff 504122.Mar 10 2023, 7:20 AM

reames added inline comments.Mar 10 2023, 7:35 AM

llvm/include/llvm/IR/IRBuilder.h
771	This is the wrong interface. The IRBuilder interface should provide a way to create the interleave and deinterleave instrinsic calls. That interface should generate shuffles for fixed vectors. Then the calling logic in the vectorizer should worry about emitting the load/store. (That's the existing structure in fact.)
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2987	The changes to this function are NFC for fixed length vectors, and a generally useful scalable cleanup. Please separate and land this change without the need for further review. This applies only to the changes in this function so as to shrink the diff for future review.
llvm/test/Transforms/LoopVectorize/sve-interleaved-accesses.ll
2	This should be in the AArch64 sub-tree, and probably precommited. Depending on your confidence in the AArch64 code, you may want to separate that into it's own review.

huntergr mentioned this in rG9aa01c4e8917: [LV] Remove scalable constraints on creating bitcasts.Mar 17 2023, 9:20 AM

huntergr mentioned this in rGfba2a7c6958b: [LV][AArch64] Precommit interleaved access tests.Mar 29 2023, 2:26 AM

Taking this one over from @mgabka

Separated out the bitcast fix and committed. Precommitted tests.

Changed IRBuilder interface to focus on the intrinsics (and fixed-length shuffle equivalents) instead of mixing in loads/stores.

There's a few unit tests which will fail with the new interface -- although we generate the same IR instructions, the order is different. Assuming the interface is suitable I'll update the tests before posting the next patch revision.

Herald added subscribers: frasercrmck, luismarques, apazos and 20 others. · View Herald TranscriptApr 5 2023, 5:48 AM

huntergr marked 2 inline comments as done.Apr 5 2023, 5:55 AM

huntergr added inline comments.

llvm/tools/llvm-profdata/CMakeLists.txt
7 ↗	(On Diff #511067)	This is due to the fixed-length mask generation code being in VectorUtils. This isn't the only tool affected, though oddly enough I've only observed build failures on X86 and not AArch64 hosts. I've included it as a representative. I would prefer not to make changes to a bunch of cmake files for this, so I'm currently leaning towards either duplicating the mask generation code or moving it into Core. Any preferences from a reviewer?

Harbormaster completed remote builds in B223778: Diff 511067.Apr 5 2023, 6:29 AM

mgabka added inline comments.Apr 6 2023, 3:47 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
125–127	Hi @huntergr , Thanks for your changes to this patch! I have one question, the interface you proposed looks clean and nice, however it forces code generation for the deinterleaving/interleaving intrinsics to be implemented before merging this patch, am I correct? The reason why I had this option here is that it would allow us to merge this patch before other pieces are implemented.

huntergr added inline comments.Apr 6 2023, 3:57 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
125–127	Hi @mgabka , We do have code generation for these intrinsics already, they just get lowered to zips/uzips. See D141924. D146218 will match to ld2/st2 where possible (which is what we want), and should perhaps land first. The changes to isLegalInterleavedAccessType will also be needed there, so the next version of this patch can just rely on that.

mgabka mentioned this in D136153: [AArch64] Allow cost computation for interleaved accesses.Apr 6 2023, 7:02 AM

reames added inline comments.Apr 6 2023, 6:01 PM

llvm/lib/IR/IRBuilder.cpp
1343	cast<>
1384	Given this assert, we shouldn't need to pass Factor in here at all.
1387	cast<>
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2730	Having this be only in the normal load path seems unlikely to be correct. Surely we must also handle masked loads as well?
2763–2764	It looks like you're changing the handling for gaps in the deinterleave. This seems surprising and worth some discussion?

Implemented requested changes to utility functions (casts, removing redundant parameter)
Moved utility functions to VectorUtils instead of IRBuilder; this removes the problem that introduced a dependency on the Analysis component for several tools which don't need the funtionality.
Updated affected tests.
Full check output included for the strict fadd tests.

Herald added a subscriber: dmgreen. · View Herald TranscriptApr 14 2023, 7:53 AM

huntergr marked 5 inline comments as done.Apr 14 2023, 7:59 AM

huntergr added inline comments.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2730	This does handle masked loads -- 'NewLoad = Builder.CreateAligned....' is a standalone statement on the else with no opening brace. I've added a blank line to perhaps make that a little more obvious. Unless there's something else I've missed?
2763–2764	That was the result of a bit of overzealous cleanup on my part when removing some code from the original patch; I missed the 'continue'. Reverted.

reames added inline comments.Apr 14 2023, 8:36 AM

llvm/include/llvm/Analysis/VectorUtils.h
596 ↗	(On Diff #513595)	Unless you have plans to reuse these, this is just an implementation detail of the vectorizer. As such, these would be better as static functions in LoopVectorize.cpp
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
2730	Yeah, I got confused by the brace style in the code above.
2746–2751	The interface here feels really awkward for fixed length vectors. We have to create this dummy struct type, construct it, destruct it, and we loose the ability to slice out the inactive lanes. I almost wonder if this code would be clearer without the helper function at all. With an explicit version based on scalable type here, we could do a simplified version of this loop with an early return and leave the fixed length codegen unchanged. I'd be tempted to try that and see if the overall code quality looked reasonable. You could also try a lambda which enumerate the active lanes (i.e. doing the shuffle or extract as required), and move the handling of the bitcast and reverse to a callback. This might be too much complexity though.
llvm/test/Transforms/LoopVectorize/AArch64/scalable-strict-fadd.ll
1 ↗	(On Diff #513595)	Please submit a separate change to autogen this file, and then rebase. Same with the other file you switched to autogen.

Harbormaster completed remote builds in B225622: Diff 513595.Apr 14 2023, 8:59 AM

huntergr mentioned this in rGd8c49d2ac9dd: [LV][AArch64] Autogenerate checks for scalable-strict-fadd.ll (NFC).Apr 18 2023, 2:25 AM

Moved interleaveVectors to a static function in LoopVectorize
Removed deinterleaveVector, inlined intrinsic creation. This means the shuffles for fixed-length loads won't be changed, though we do end up with a little duplication as a result.
Precommitted autogen checks for scalable-strict-fadd.ll

huntergr marked 2 inline comments as done.Apr 18 2023, 4:01 AM

Harbormaster completed remote builds in B226354: Diff 514596.Apr 18 2023, 4:40 AM

Ping?

igor.kirillov added a subscriber: igor.kirillov.May 15 2023, 10:32 AM

igor.kirillov added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14373	Looks like we can use MinElts and avoid duplicating EC.getKnownMinValue()

Rebased, removed redundant call as requested.

huntergr marked an inline comment as done.May 19 2023, 1:50 AM

Harbormaster completed remote builds in B233118: Diff 523695.May 19 2023, 2:41 AM

I can't really comment on the AArch64 parts of this, but the LoopVectorizer bits look entirely reasonable to me at this point.

fhahn added a subscriber: fhahn.May 28 2023, 11:29 AM

fhahn added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14384	Nit: can just return the condition
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
441	Should we have this assert already when constructing the interleave recipe?

Simplified size checking code, changed assert on interleave factor to occur before creating recipes.

The recipe constructor doesn't actually know the VF, so just confirming that the decision for a given scalable VF is to interleave only for factors of 2 should suffice. The code that performs interleaving is still effectively guarded by asserts in call construction that it has the correct number of arguments.

huntergr marked 2 inline comments as done.Jun 1 2023, 1:41 AM

Harbormaster completed remote builds in B235766: Diff 527321.Jun 1 2023, 2:54 AM

igor.kirillov added a child revision: D152258: [LV] Add mask support for vectorizing interleaved groups.Jun 6 2023, 4:30 AM

LGTM, thanks!

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
449 ↗	(On Diff #527321)	Could you make sure there's a test case for RISCV that covers this case before landing?

This revision is now accepted and ready to land.Jun 6 2023, 1:18 PM

Herald added a subscriber: StephenFan. · View Herald TranscriptJun 6 2023, 1:18 PM

mgabka added inline comments.Jun 7 2023, 1:25 AM

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
444	can you pass here Vals directly?

huntergr added inline comments.Jun 8 2023, 3:11 AM

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
449 ↗	(On Diff #527321)	I added that check because both interleaved-accesses.ll and strided-accesses.ll (in llvm/test/Transforms/LoopVectorize/RISCV/) crash on the cast below when vplan tries to get the cost of the interleaving group with a scalable VF. Is that sufficient?

This revision was landed with ongoing or failed builds.Jun 9 2023, 3:43 AM

Closed by commit rG95bfb1902db9: [LV][AArch64] Allow (limited) interleaving for scalable vectors (authored by huntergr). · Explain Why

This revision was automatically updated to reflect the committed changes.

huntergr added a commit: rG95bfb1902db9: [LV][AArch64] Allow (limited) interleaving for scalable vectors.

luke mentioned this in D145485: [PoC][IR] Generalize interleave/deinterleave intrinsics to factors > 2.Jun 27 2023, 3:14 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

Analysis/

TargetTransformInfo.h

19 lines

TargetTransformInfoImpl.h

10 lines

IR/

IRBuilder.h

13 lines

lib/

Analysis/

TargetTransformInfo.cpp

12 lines

IR/

IRBuilder.cpp

67 lines

Target/

AArch64/

AArch64ISelLowering.cpp

23 lines

AArch64TargetTransformInfo.h

6 lines

AArch64TargetTransformInfo.cpp

36 lines

Transforms/

Vectorize/

LoopVectorize.cpp

62 lines

test/

Transforms/

LoopVectorize/

sve-interleaved-accesses.ll

1762 lines

Diff 501861

llvm/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 826 Lines • ▼ Show 20 Lines	MemCmpExpansionOptions enableMemCmpExpansion(bool OptSize,
bool IsZeroCmp) const;		bool IsZeroCmp) const;

/// Should the Select Optimization pass be enabled and ran.		/// Should the Select Optimization pass be enabled and ran.
bool enableSelectOptimize() const;		bool enableSelectOptimize() const;

/// Enable matching of interleaved access groups.		/// Enable matching of interleaved access groups.
bool enableInterleavedAccessVectorization() const;		bool enableInterleavedAccessVectorization() const;

		bool hasInterleavedLoad(VectorType VecTy, Value Addr, uint32_t Factor,
		bool IsMasked) const;

		bool hasInterleavedStore(SmallVectorImpl<Value > &StoredVecs, Value Addr,
		uint32_t Factor, bool IsMasked) const;
		mgabkaUnsubmitted Done Reply Inline Actions I think these functions could actually be joined into 1, something like: supportsInterleaving(VectorType VecTy, uint32_t Factor, bool IsMasked) I think it is going to be unlikely that target supports store but not load for the same vector type, @paulwalker-arm what do you think? mgabka:* I think these functions could actually be joined into 1, something like: supportsInterleaving…

/// Enable matching of interleaved access groups that contain predicated		/// Enable matching of interleaved access groups that contain predicated
/// accesses or gaps and therefore vectorized using masked		/// accesses or gaps and therefore vectorized using masked
/// vector loads/stores.		/// vector loads/stores.
bool enableMaskedInterleavedAccessVectorization() const;		bool enableMaskedInterleavedAccessVectorization() const;

/// Indicate that it is potentially unsafe to automatically vectorize		/// Indicate that it is potentially unsafe to automatically vectorize
/// floating-point operations because the semantics of vector and scalar		/// floating-point operations because the semantics of vector and scalar
/// floating-point semantics may differ. For example, ARM NEON v7 SIMD math		/// floating-point semantics may differ. For example, ARM NEON v7 SIMD math
▲ Show 20 Lines • Show All 876 Lines • ▼ Show 20 Lines	public:
virtual bool supportsEfficientVectorElementLoadStore() = 0;		virtual bool supportsEfficientVectorElementLoadStore() = 0;
virtual bool supportsTailCalls() = 0;		virtual bool supportsTailCalls() = 0;
virtual bool supportsTailCallFor(const CallBase *CB) = 0;		virtual bool supportsTailCallFor(const CallBase *CB) = 0;
virtual bool enableAggressiveInterleaving(bool LoopHasReductions) = 0;		virtual bool enableAggressiveInterleaving(bool LoopHasReductions) = 0;
virtual MemCmpExpansionOptions		virtual MemCmpExpansionOptions
enableMemCmpExpansion(bool OptSize, bool IsZeroCmp) const = 0;		enableMemCmpExpansion(bool OptSize, bool IsZeroCmp) const = 0;
virtual bool enableSelectOptimize() = 0;		virtual bool enableSelectOptimize() = 0;
virtual bool enableInterleavedAccessVectorization() = 0;		virtual bool enableInterleavedAccessVectorization() = 0;
		virtual bool hasInterleavedLoad(VectorType VecTy, Value Addr,
		uint32_t Factor, bool IsMasked) = 0;
		virtual bool hasInterleavedStore(SmallVectorImpl<Value *> &StoredVecs,
		Value *Addr, uint32_t Factor,
		bool IsMasked) = 0;
virtual bool enableMaskedInterleavedAccessVectorization() = 0;		virtual bool enableMaskedInterleavedAccessVectorization() = 0;
virtual bool isFPVectorizationPotentiallyUnsafe() = 0;		virtual bool isFPVectorizationPotentiallyUnsafe() = 0;
virtual bool allowsMisalignedMemoryAccesses(LLVMContext &Context,		virtual bool allowsMisalignedMemoryAccesses(LLVMContext &Context,
unsigned BitWidth,		unsigned BitWidth,
unsigned AddressSpace,		unsigned AddressSpace,
Align Alignment,		Align Alignment,
unsigned *Fast) = 0;		unsigned *Fast) = 0;
virtual PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) = 0;		virtual PopcntSupportKind getPopcntSupport(unsigned IntTyWidthInBit) = 0;
▲ Show 20 Lines • Show All 480 Lines • ▼ Show 20 Lines	public:
}		}
MemCmpExpansionOptions enableMemCmpExpansion(bool OptSize,		MemCmpExpansionOptions enableMemCmpExpansion(bool OptSize,
bool IsZeroCmp) const override {		bool IsZeroCmp) const override {
return Impl.enableMemCmpExpansion(OptSize, IsZeroCmp);		return Impl.enableMemCmpExpansion(OptSize, IsZeroCmp);
}		}
bool enableInterleavedAccessVectorization() override {		bool enableInterleavedAccessVectorization() override {
return Impl.enableInterleavedAccessVectorization();		return Impl.enableInterleavedAccessVectorization();
}		}
		bool hasInterleavedLoad(VectorType VecTy, Value Addr, uint32_t Factor,
		bool IsMasked) override {
		return Impl.hasInterleavedLoad(VecTy, Addr, Factor, IsMasked);
		}
		bool hasInterleavedStore(SmallVectorImpl<Value > &StoredVecs, Value Addr,
		uint32_t Factor, bool IsMasked) override {
		return Impl.hasInterleavedStore(StoredVecs, Addr, Factor, IsMasked);
		}
bool enableSelectOptimize() override {		bool enableSelectOptimize() override {
return Impl.enableSelectOptimize();		return Impl.enableSelectOptimize();
}		}
bool enableMaskedInterleavedAccessVectorization() override {		bool enableMaskedInterleavedAccessVectorization() override {
return Impl.enableMaskedInterleavedAccessVectorization();		return Impl.enableMaskedInterleavedAccessVectorization();
}		}
bool isFPVectorizationPotentiallyUnsafe() override {		bool isFPVectorizationPotentiallyUnsafe() override {
return Impl.isFPVectorizationPotentiallyUnsafe();		return Impl.isFPVectorizationPotentiallyUnsafe();
▲ Show 20 Lines • Show All 496 Lines • Show Last 20 Lines

llvm/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 360 Lines • ▼ Show 20 Lines	TTI::MemCmpExpansionOptions enableMemCmpExpansion(bool OptSize,
bool IsZeroCmp) const {		bool IsZeroCmp) const {
return {};		return {};
}		}

bool enableSelectOptimize() const { return true; }		bool enableSelectOptimize() const { return true; }

bool enableInterleavedAccessVectorization() const { return false; }		bool enableInterleavedAccessVectorization() const { return false; }

		bool hasInterleavedLoad(VectorType VecTy, Value Addr, uint32_t Factor,
		bool IsMasked) const {
		return false;
		}

		bool hasInterleavedStore(SmallVectorImpl<Value > &StoredVecs, Value Addr,
		uint32_t Factor, bool IsMasked) const {
		return false;
		}

bool enableMaskedInterleavedAccessVectorization() const { return false; }		bool enableMaskedInterleavedAccessVectorization() const { return false; }

bool isFPVectorizationPotentiallyUnsafe() const { return false; }		bool isFPVectorizationPotentiallyUnsafe() const { return false; }

bool allowsMisalignedMemoryAccesses(LLVMContext &Context, unsigned BitWidth,		bool allowsMisalignedMemoryAccesses(LLVMContext &Context, unsigned BitWidth,
unsigned AddressSpace, Align Alignment,		unsigned AddressSpace, Align Alignment,
unsigned *Fast) const {		unsigned *Fast) const {
return false;		return false;
▲ Show 20 Lines • Show All 926 Lines • Show Last 20 Lines

llvm/include/llvm/IR/IRBuilder.h

Show First 20 Lines • Show All 762 Lines • ▼ Show 20 Lines	public:
/// Create a call to invariant.start intrinsic.		/// Create a call to invariant.start intrinsic.
///		///
/// If the pointer isn't i8* it will be converted.		/// If the pointer isn't i8* it will be converted.
CallInst CreateInvariantStart(Value Ptr, ConstantInt *Size = nullptr);		CallInst CreateInvariantStart(Value Ptr, ConstantInt *Size = nullptr);

/// Create a call to llvm.threadlocal.address intrinsic.		/// Create a call to llvm.threadlocal.address intrinsic.
CallInst CreateThreadLocalAddress(Value Ptr);		CallInst CreateThreadLocalAddress(Value Ptr);

		/// Create a masked interleaved load using a masked load and deinterliving
		reamesUnsubmitted Done Reply Inline Actions This is the wrong interface. The IRBuilder interface should provide a way to create the interleave and deinterleave instrinsic calls. That interface should generate shuffles for fixed vectors. Then the calling logic in the vectorizer should worry about emitting the load/store. (That's the existing structure in fact.) reames: This is the wrong interface. The IRBuilder interface should provide a way to create the…
		/// intrinsics.
		CallInst CreateMaskedInterleavedLoad(uint32_t Factor, Type Ty, Value *Ptr,
		Align Alignment, Value *Mask = nullptr,
		Value *PassThru = nullptr,
		const Twine &Name = "");

		/// Create a masked interleaved store using a masked store and interleaving
		/// intrinsics.
		CallInst CreateMaskedInterleavedStore(uint32_t Factor, ArrayRef<Value > Val,
		Value *Ptr, Align Alignment,
		Value *Mask = nullptr);

/// Create a call to Masked Load intrinsic		/// Create a call to Masked Load intrinsic
CallInst CreateMaskedLoad(Type Ty, Value Ptr, Align Alignment, Value Mask,		CallInst CreateMaskedLoad(Type Ty, Value Ptr, Align Alignment, Value Mask,
Value *PassThru = nullptr, const Twine &Name = "");		Value *PassThru = nullptr, const Twine &Name = "");

/// Create a call to Masked Store intrinsic		/// Create a call to Masked Store intrinsic
CallInst CreateMaskedStore(Value Val, Value *Ptr, Align Alignment,		CallInst CreateMaskedStore(Value Val, Value *Ptr, Align Alignment,
Value *Mask);		Value *Mask);

▲ Show 20 Lines • Show All 1,856 Lines • Show Last 20 Lines

llvm/lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 557 Lines • ▼ Show 20 Lines
	bool TargetTransformInfo::enableSelectOptimize() const {			bool TargetTransformInfo::enableSelectOptimize() const {
	return TTIImpl->enableSelectOptimize();			return TTIImpl->enableSelectOptimize();
	}			}

	bool TargetTransformInfo::enableInterleavedAccessVectorization() const {			bool TargetTransformInfo::enableInterleavedAccessVectorization() const {
	return TTIImpl->enableInterleavedAccessVectorization();			return TTIImpl->enableInterleavedAccessVectorization();
	}			}

				bool TargetTransformInfo::hasInterleavedLoad(VectorType VecTy, Value Addr,
				uint32_t Factor,
				bool IsMasked) const {
				return TTIImpl->hasInterleavedLoad(VecTy, Addr, Factor, IsMasked);
				}

				bool TargetTransformInfo::hasInterleavedStore(
				SmallVectorImpl<Value > &StoredVecs, Value Addr, uint32_t Factor,
				bool IsMasked) const {
				return TTIImpl->hasInterleavedStore(StoredVecs, Addr, Factor, IsMasked);
				}

	bool TargetTransformInfo::enableMaskedInterleavedAccessVectorization() const {			bool TargetTransformInfo::enableMaskedInterleavedAccessVectorization() const {
	return TTIImpl->enableMaskedInterleavedAccessVectorization();			return TTIImpl->enableMaskedInterleavedAccessVectorization();
	}			}

	bool TargetTransformInfo::isFPVectorizationPotentiallyUnsafe() const {			bool TargetTransformInfo::isFPVectorizationPotentiallyUnsafe() const {
	return TTIImpl->isFPVectorizationPotentiallyUnsafe();			return TTIImpl->isFPVectorizationPotentiallyUnsafe();
	}			}

	▲ Show 20 Lines • Show All 677 Lines • Show Last 20 Lines

llvm/lib/IR/IRBuilder.cpp

Show First 20 Lines • Show All 563 Lines • ▼ Show 20 Lines	assert(Cond->getType() == getInt1Ty() &&
"an assumption condition must be of type i1");		"an assumption condition must be of type i1");

Value *Ops[] = { Cond };		Value *Ops[] = { Cond };
Module *M = BB->getParent()->getParent();		Module *M = BB->getParent()->getParent();
Function *FnAssume = Intrinsic::getDeclaration(M, Intrinsic::assume);		Function *FnAssume = Intrinsic::getDeclaration(M, Intrinsic::assume);
return CreateCall(FnAssume, Ops, OpBundles);		return CreateCall(FnAssume, Ops, OpBundles);
}		}

		/// Create a masked interleaved load using a masked load and deinterliving
		/// intrinsics.
		/// \p Ty - vector type to load
		/// \p Ptr - base pointer for the load
		/// \p Alignment - alignment of the source location
		/// \p Mask - vector of booleans which indicates what vector lanes should
		/// be accessed in memory
		/// \p PassThru - pass-through value that is used to fill the masked-off lanes
		/// of the result
		/// \p Name - name of the result variable
		/// \p Factor - interleaving factor
		CallInst *IRBuilderBase::CreateMaskedInterleavedLoad(
		uint32_t Factor, Type Ty, Value Ptr, Align Alignment, Value *Mask,
		Value *PassThru, const Twine &Name) {
		if (Factor != 2)
		return nullptr;
		lukeUnsubmitted Not Done Reply Inline Actions Maybe this should be an assertion luke: Maybe this should be an assertion
		mgabkaUnsubmitted Done Reply Inline Actions Hi Luke, so I could remove the Factor argument entirely and make this function specific for just Factor=2 (and use assertions) what makes sense for now as the experimental interleaving intrinsics are only for factor 2. mgabka: Hi Luke, so I could remove the Factor argument entirely and make this function specific for…
		assert(Ty->isVectorTy() && "Type should be vector");
		auto *PtrTy = cast<PointerType>(Ptr->getType());
		assert(PtrTy->isOpaqueOrPointeeTypeMatches(Ty) && "Wrong element type");
		auto *VecTy = cast<VectorType>(Ty);
		if (!Mask)
		Mask = Constant::getAllOnesValue(
		VectorType::get(Type::getInt1Ty(Context), VecTy->getElementCount()));
		if (!PassThru)
		PassThru = UndefValue::get(Ty);
		nlopesUnsubmitted Done Reply Inline Actions Please use PoisonValue here and whenever possible as we are trying to get rid of UndefValue. Thank you! nlopes: Please use PoisonValue here and whenever possible as we are trying to get rid of UndefValue.
		auto *Ld = CreateMaskedLoad(VecTy, Ptr, Alignment, Mask, PassThru, Name);
		return CreateIntrinsic(Intrinsic::experimental_vector_deinterleave2, VecTy,
		Ld);
		}

		/// Create a masked interleaved store using a masked store and interleaving
		/// intrinsics.
		/// \p StoredVals - data to be stored
		/// \p Ptr - base pointer for the store
		/// \p Alignment - alignment of the destination location
		/// \p Mask - vector of booleans which indicates what vector lanes should
		/// be accessed in memory
		/// \p Factor - interleaving factor
		CallInst *IRBuilderBase::CreateMaskedInterleavedStore(
		uint32_t Factor, ArrayRef<Value > StoredVals, Value Ptr, Align Alignment,
		Value *Mask) {
		if (Factor != 2)
		return nullptr;
		assert(StoredVals.size() == Factor &&
		"Not enough data to store for given factor");
		Type DataTy = (StoredVals.begin())->getType();
		#ifndef NDEBUG
		for (auto &Val : StoredVals)
		assert(Val->getType()->isVectorTy() && "Stored value should be a vector");
		#endif
		auto *VecTy = cast<VectorType>(DataTy);
		auto *WideVecTy = VectorType::getDoubleElementsVectorType(VecTy);
		auto *PtrTy = VecTy->getElementType()->getPointerTo(
		Ptr->getType()->getPointerAddressSpace());
		assert(PtrTy->isOpaqueOrPointeeTypeMatches(VecTy->getElementType()) &&
		"Wrong element type");
		if (!Mask)
		Mask = Constant::getAllOnesValue(VectorType::get(
		Type::getInt1Ty(Context),
		VecTy->getElementCount().multiplyCoefficientBy(Factor)));

		SmallVector<Value *, 8> Ops(StoredVals.begin(), StoredVals.end());
		auto *Val = CreateIntrinsic(WideVecTy,
		Intrinsic::experimental_vector_interleave2, Ops);
		return CreateMaskedStore(Val, Ptr, Alignment, Mask);
		}

Instruction IRBuilderBase::CreateNoAliasScopeDeclaration(Value Scope) {		Instruction IRBuilderBase::CreateNoAliasScopeDeclaration(Value Scope) {
Module *M = BB->getModule();		Module *M = BB->getModule();
auto *FnIntrinsic = Intrinsic::getDeclaration(		auto *FnIntrinsic = Intrinsic::getDeclaration(
M, Intrinsic::experimental_noalias_scope_decl, {});		M, Intrinsic::experimental_noalias_scope_decl, {});
return CreateCall(FnIntrinsic, {Scope});		return CreateCall(FnIntrinsic, {Scope});
}		}

/// Create a call to a Masked Load intrinsic.		/// Create a call to a Masked Load intrinsic.
▲ Show 20 Lines • Show All 688 Lines • ▼ Show 20 Lines

Value *IRBuilderBase::CreateExtractInteger(		Value *IRBuilderBase::CreateExtractInteger(
const DataLayout &DL, Value From, IntegerType ExtractedTy,		const DataLayout &DL, Value From, IntegerType ExtractedTy,
uint64_t Offset, const Twine &Name) {		uint64_t Offset, const Twine &Name) {
auto *IntTy = cast<IntegerType>(From->getType());		auto *IntTy = cast<IntegerType>(From->getType());
assert(DL.getTypeStoreSize(ExtractedTy) + Offset <=		assert(DL.getTypeStoreSize(ExtractedTy) + Offset <=
DL.getTypeStoreSize(IntTy) &&		DL.getTypeStoreSize(IntTy) &&
"Element extends past full value");		"Element extends past full value");
uint64_t ShAmt = 8 * Offset;		uint64_t ShAmt = 8 * Offset;
		reamesUnsubmitted Done Reply Inline Actions cast<> reames: cast<>
Value *V = From;		Value *V = From;
if (DL.isBigEndian())		if (DL.isBigEndian())
ShAmt = 8 * (DL.getTypeStoreSize(IntTy) -		ShAmt = 8 * (DL.getTypeStoreSize(IntTy) -
DL.getTypeStoreSize(ExtractedTy) - Offset);		DL.getTypeStoreSize(ExtractedTy) - Offset);
if (ShAmt) {		if (ShAmt) {
V = CreateLShr(V, ShAmt, Name + ".shift");		V = CreateLShr(V, ShAmt, Name + ".shift");
}		}
assert(ExtractedTy->getBitWidth() <= IntTy->getBitWidth() &&		assert(ExtractedTy->getBitWidth() <= IntTy->getBitWidth() &&
Show All 24 Lines	Value *IRBuilderBase::CreatePreserveArrayAccessIndex(
Module *M = BB->getParent()->getParent();		Module *M = BB->getParent()->getParent();
Function *FnPreserveArrayAccessIndex = Intrinsic::getDeclaration(		Function *FnPreserveArrayAccessIndex = Intrinsic::getDeclaration(
M, Intrinsic::preserve_array_access_index, {ResultType, BaseType});		M, Intrinsic::preserve_array_access_index, {ResultType, BaseType});

Value *DimV = getInt32(Dimension);		Value *DimV = getInt32(Dimension);
CallInst *Fn =		CallInst *Fn =
CreateCall(FnPreserveArrayAccessIndex, {Base, DimV, LastIndexV});		CreateCall(FnPreserveArrayAccessIndex, {Base, DimV, LastIndexV});
Fn->addParamAttr(		Fn->addParamAttr(
0, Attribute::get(Fn->getContext(), Attribute::ElementType, ElTy));		0, Attribute::get(Fn->getContext(), Attribute::ElementType, ElTy));
		reamesUnsubmitted Done Reply Inline Actions Given this assert, we shouldn't need to pass Factor in here at all. reames: Given this assert, we shouldn't need to pass Factor in here at all.
if (DbgInfo)		if (DbgInfo)
Fn->setMetadata(LLVMContext::MD_preserve_access_index, DbgInfo);		Fn->setMetadata(LLVMContext::MD_preserve_access_index, DbgInfo);

		reamesUnsubmitted Done Reply Inline Actions cast<> reames: cast<>
return Fn;		return Fn;
}		}

Value *IRBuilderBase::CreatePreserveUnionAccessIndex(		Value *IRBuilderBase::CreatePreserveUnionAccessIndex(
Value Base, unsigned FieldIndex, MDNode DbgInfo) {		Value Base, unsigned FieldIndex, MDNode DbgInfo) {
assert(isa<PointerType>(Base->getType()) &&		assert(isa<PointerType>(Base->getType()) &&
"Invalid Base ptr type for preserve.union.access.index.");		"Invalid Base ptr type for preserve.union.access.index.");
auto *BaseType = Base->getType();		auto *BaseType = Base->getType();
▲ Show 20 Lines • Show All 81 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 14,339 Lines • ▼ Show 20 Lines	bool AArch64TargetLowering::hasPairedLoad(EVT LoadedType,
return NumBits == 32 \|\| NumBits == 64;		return NumBits == 32 \|\| NumBits == 64;
}		}

/// A helper function for determining the number of interleaved accesses we		/// A helper function for determining the number of interleaved accesses we
/// will generate when lowering accesses of the given type.		/// will generate when lowering accesses of the given type.
unsigned AArch64TargetLowering::getNumInterleavedAccesses(		unsigned AArch64TargetLowering::getNumInterleavedAccesses(
VectorType *VecTy, const DataLayout &DL, bool UseScalable) const {		VectorType *VecTy, const DataLayout &DL, bool UseScalable) const {
unsigned VecSize = 128;		unsigned VecSize = 128;
		unsigned ElSize = DL.getTypeSizeInBits(VecTy->getElementType());
		auto EC = VecTy->getElementCount();
if (UseScalable)		if (UseScalable)
VecSize = std::max(Subtarget->getMinSVEVectorSizeInBits(), 128u);		VecSize = std::max(Subtarget->getMinSVEVectorSizeInBits(), 128u);
return std::max<unsigned>(1, (DL.getTypeSizeInBits(VecTy) + 127) / VecSize);		return std::max<unsigned>(1,
		(EC.getKnownMinValue() * ElSize + 127) / VecSize);
}		}

MachineMemOperand::Flags		MachineMemOperand::Flags
AArch64TargetLowering::getTargetMMOFlags(const Instruction &I) const {		AArch64TargetLowering::getTargetMMOFlags(const Instruction &I) const {
if (Subtarget->getProcFamily() == AArch64Subtarget::Falkor &&		if (Subtarget->getProcFamily() == AArch64Subtarget::Falkor &&
I.getMetadata(FALKOR_STRIDED_ACCESS_MD) != nullptr)		I.getMetadata(FALKOR_STRIDED_ACCESS_MD) != nullptr)
return MOStridedAccess;		return MOStridedAccess;
return MachineMemOperand::MONone;		return MachineMemOperand::MONone;
}		}

bool AArch64TargetLowering::isLegalInterleavedAccessType(		bool AArch64TargetLowering::isLegalInterleavedAccessType(
VectorType *VecTy, const DataLayout &DL, bool &UseScalable) const {		VectorType *VecTy, const DataLayout &DL, bool &UseScalable) const {

unsigned VecSize = DL.getTypeSizeInBits(VecTy);
unsigned ElSize = DL.getTypeSizeInBits(VecTy->getElementType());		unsigned ElSize = DL.getTypeSizeInBits(VecTy->getElementType());
unsigned NumElements = cast<FixedVectorType>(VecTy)->getNumElements();		auto EC = VecTy->getElementCount();

UseScalable = false;		UseScalable = false;

// Ensure that the predicate for this number of elements is available.		// Ensure that the predicate for this number of elements is available.
if (Subtarget->hasSVE() && !getSVEPredPatternFromNumElements(NumElements))		if (Subtarget->hasSVE() &&
		!getSVEPredPatternFromNumElements(EC.getKnownMinValue()))
		igor.kirillovUnsubmitted Done Reply Inline Actions Looks like we can use MinElts and avoid duplicating EC.getKnownMinValue() igor.kirillov: Looks like we can use MinElts and avoid duplicating EC.getKnownMinValue()
return false;		return false;

// Ensure the number of vector elements is greater than 1.		// Ensure the number of vector elements is greater than 1.
if (NumElements < 2)		if (EC.getKnownMinValue() < 2)
return false;		return false;

// Ensure the element type is legal.		// Ensure the element type is legal.
if (ElSize != 8 && ElSize != 16 && ElSize != 32 && ElSize != 64)		if (ElSize != 8 && ElSize != 16 && ElSize != 32 && ElSize != 64)
return false;		return false;

		if (EC.isScalable()) {
		fhahnUnsubmitted Done Reply Inline Actions Nit: can just return the condition fhahn: Nit: can just return the condition
		if (EC.getKnownMinValue() * ElSize == 128)
		return true;
		return false;
		}

		unsigned VecSize = DL.getTypeSizeInBits(VecTy);
if (Subtarget->forceStreamingCompatibleSVE() \|\|		if (Subtarget->forceStreamingCompatibleSVE() \|\|
(Subtarget->useSVEForFixedLengthVectors() &&		(Subtarget->useSVEForFixedLengthVectors() &&
(VecSize % Subtarget->getMinSVEVectorSizeInBits() == 0 \|\|		(VecSize % Subtarget->getMinSVEVectorSizeInBits() == 0 \|\|
(VecSize < Subtarget->getMinSVEVectorSizeInBits() &&		(VecSize < Subtarget->getMinSVEVectorSizeInBits() &&
isPowerOf2_32(NumElements) && VecSize > 128)))) {		isPowerOf2_32(EC.getKnownMinValue()) && VecSize > 128)))) {
UseScalable = true;		UseScalable = true;
return true;		return true;
}		}

// Ensure the total vector size is 64 or a multiple of 128. Types larger than		// Ensure the total vector size is 64 or a multiple of 128. Types larger than
// 128 will be split into multiple interleaved accesses.		// 128 will be split into multiple interleaved accesses.
return VecSize == 64 \|\| VecSize % 128 == 0;		return VecSize == 64 \|\| VecSize % 128 == 0;
}		}
▲ Show 20 Lines • Show All 10,133 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 93 Lines • ▼ Show 20 Lines	public:

/// @}		/// @}

/// \name Vector TTI Implementations		/// \name Vector TTI Implementations
/// @{		/// @{

bool enableInterleavedAccessVectorization() { return true; }		bool enableInterleavedAccessVectorization() { return true; }

		bool hasInterleavedLoad(VectorType VecTy, Value Addr, uint32_t Factor,
		bool IsMasked);

		bool hasInterleavedStore(SmallVectorImpl<Value > &StoredVecs, Value Addr,
		uint32_t Factor, bool IsMasked);

unsigned getNumberOfRegisters(unsigned ClassID) const {		unsigned getNumberOfRegisters(unsigned ClassID) const {
bool Vector = (ClassID == 1);		bool Vector = (ClassID == 1);
if (Vector) {		if (Vector) {
if (ST->hasNEON())		if (ST->hasNEON())
return 32;		return 32;
return 0;		return 0;
}		}
return 31;		return 31;
▲ Show 20 Lines • Show All 290 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 115 Lines • ▼ Show 20 Lines	static cl::opt<bool> EnableFixedwidthAutovecInStreamingMode(
"enable-fixedwidth-autovec-in-streaming-mode", cl::init(false), cl::Hidden);		"enable-fixedwidth-autovec-in-streaming-mode", cl::init(false), cl::Hidden);

// Experimental option that will only be fully functional when the cost-model		// Experimental option that will only be fully functional when the cost-model
// and code-generator have been changed to avoid using scalable vector		// and code-generator have been changed to avoid using scalable vector
// instructions that are not legal in streaming SVE mode.		// instructions that are not legal in streaming SVE mode.
static cl::opt<bool> EnableScalableAutovecInStreamingMode(		static cl::opt<bool> EnableScalableAutovecInStreamingMode(
"enable-scalable-autovec-in-streaming-mode", cl::init(false), cl::Hidden);		"enable-scalable-autovec-in-streaming-mode", cl::init(false), cl::Hidden);

		cl::opt<bool> EnableSVEInterleavedMemAccesses(
		"enable-sve-interleaved-mem-accesses", cl::init(false), cl::Hidden,
		cl::desc("Enable vectorization on interleaved memory accesses in a loop "
		"using sve load/store."));
		mgabkaUnsubmitted Not Done Reply Inline Actions Hi @huntergr , Thanks for your changes to this patch! I have one question, the interface you proposed looks clean and nice, however it forces code generation for the deinterleaving/interleaving intrinsics to be implemented before merging this patch, am I correct? The reason why I had this option here is that it would allow us to merge this patch before other pieces are implemented. mgabka: Hi @huntergr , Thanks for your changes to this patch! I have one question, the interface you…
		huntergrAuthorUnsubmitted Not Done Reply Inline Actions Hi @mgabka , We do have code generation for these intrinsics already, they just get lowered to zips/uzips. See D141924. D146218 will match to ld2/st2 where possible (which is what we want), and should perhaps land first. The changes to isLegalInterleavedAccessType will also be needed there, so the next version of this patch can just rely on that. huntergr: Hi @mgabka , We do have code generation for these intrinsics already, they just get lowered to…

		bool AArch64TTIImpl::hasInterleavedLoad(VectorType VecTy, Value Addr,
		uint32_t Factor, bool IsMasked) {
		if (!EnableSVEInterleavedMemAccesses \|\| !isa<ScalableVectorType>(VecTy))
		return false;
		if (Factor == 2)
		return true;

		return false;
		}

		bool AArch64TTIImpl::hasInterleavedStore(SmallVectorImpl<Value *> &StoredVecs,
		Value *Addr, uint32_t Factor,
		bool IsMasked) {
		return hasInterleavedLoad(
		dyn_cast<VectorType>((*StoredVecs.begin())->getType()), Addr, Factor,
		IsMasked);
		}

bool AArch64TTIImpl::areInlineCompatible(const Function *Caller,		bool AArch64TTIImpl::areInlineCompatible(const Function *Caller,
const Function *Callee) const {		const Function *Callee) const {
SMEAttrs CallerAttrs(*Caller);		SMEAttrs CallerAttrs(*Caller);
SMEAttrs CalleeAttrs(*Callee);		SMEAttrs CalleeAttrs(*Callee);
if (CallerAttrs.requiresSMChange(CalleeAttrs,		if (CallerAttrs.requiresSMChange(CalleeAttrs,
/BodyOverridesInterface=/true) \|\|		/BodyOverridesInterface=/true) \|\|
CallerAttrs.requiresLazySave(CalleeAttrs) \|\|		CallerAttrs.requiresLazySave(CalleeAttrs) \|\|
CalleeAttrs.hasNewZAInterface())		CalleeAttrs.hasNewZAInterface())
▲ Show 20 Lines • Show All 2,499 Lines • ▼ Show 20 Lines	InstructionCost AArch64TTIImpl::getMemoryOpCost(unsigned Opcode, Type *Ty,
return LT.first;		return LT.first;
}		}

InstructionCost AArch64TTIImpl::getInterleavedMemoryOpCost(		InstructionCost AArch64TTIImpl::getInterleavedMemoryOpCost(
unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,		unsigned Opcode, Type *VecTy, unsigned Factor, ArrayRef<unsigned> Indices,
Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,		Align Alignment, unsigned AddressSpace, TTI::TargetCostKind CostKind,
bool UseMaskForCond, bool UseMaskForGaps) {		bool UseMaskForCond, bool UseMaskForGaps) {
assert(Factor >= 2 && "Invalid interleave factor");		assert(Factor >= 2 && "Invalid interleave factor");
auto *VecVTy = cast<FixedVectorType>(VecTy);		auto *VecVTy = cast<VectorType>(VecTy);

		if (isa<ScalableVectorType>(VecTy) &&
		(Factor != 2 \|\| !EnableSVEInterleavedMemAccesses))
		return InstructionCost::getInvalid();

if (!UseMaskForCond && !UseMaskForGaps &&		if (!UseMaskForCond && !UseMaskForGaps &&
Factor <= TLI->getMaxSupportedInterleaveFactor()) {		Factor <= TLI->getMaxSupportedInterleaveFactor()) {
unsigned NumElts = VecVTy->getNumElements();		unsigned NumElts = VecVTy->getElementCount().getKnownMinValue();
auto *SubVecTy =		auto *SubVecTy =
FixedVectorType::get(VecTy->getScalarType(), NumElts / Factor);		VectorType::get(VecVTy->getElementType(),
		VecVTy->getElementCount().divideCoefficientBy(Factor));

// ldN/stN only support legal vector types of size 64 or 128 in bits.		// ldN/stN only support legal vector types of size 64 or 128 in bits.
// Accesses having vector types that are a multiple of 128 bits can be		// Accesses having vector types that are a multiple of 128 bits can be
// matched to more than one ldN/stN instruction.		// matched to more than one ldN/stN instruction.
bool UseScalable;		bool UseScalable;
if (NumElts % Factor == 0 &&		if (NumElts % Factor == 0 &&
TLI->isLegalInterleavedAccessType(SubVecTy, DL, UseScalable))		TLI->isLegalInterleavedAccessType(SubVecTy, DL, UseScalable))
return Factor * TLI->getNumInterleavedAccesses(SubVecTy, DL, UseScalable);		return Factor * TLI->getNumInterleavedAccesses(SubVecTy, DL, UseScalable);
▲ Show 20 Lines • Show All 760 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 432 Lines • ▼ Show 20 Lines

/// InnerLoopVectorizer vectorizes loops which contain only one basic		/// InnerLoopVectorizer vectorizes loops which contain only one basic
/// block to a specified vectorization factor (VF).		/// block to a specified vectorization factor (VF).
/// This class performs the widening of scalars into vectors, or multiple		/// This class performs the widening of scalars into vectors, or multiple
/// scalars. This class also implements the following features:		/// scalars. This class also implements the following features:
/// * It inserts an epilogue loop for handling loops that don't have iteration		/// * It inserts an epilogue loop for handling loops that don't have iteration
/// counts that are known to be a multiple of the vectorization factor.		/// counts that are known to be a multiple of the vectorization factor.
/// * It handles the code generation for reduction variables.		/// * It handles the code generation for reduction variables.
/// * Scalarization (implementation using scalars) of un-vectorizable		/// * Scalarization (implementation using scalars) of un-vectorizable
		fhahnUnsubmitted Done Reply Inline Actions Should we have this assert already when constructing the interleave recipe? fhahn: Should we have this assert already when constructing the interleave recipe?
/// instructions.		/// instructions.
/// InnerLoopVectorizer does not perform any vectorization-legality		/// InnerLoopVectorizer does not perform any vectorization-legality
/// checks, and relies on the caller to check for the different legality		/// checks, and relies on the caller to check for the different legality
		mgabkaUnsubmitted Not Done Reply Inline Actions can you pass here Vals directly? mgabka: can you pass here Vals directly?
/// aspects. The InnerLoopVectorizer relies on the		/// aspects. The InnerLoopVectorizer relies on the
/// LoopVectorizationLegality class to provide information about the induction		/// LoopVectorizationLegality class to provide information about the induction
/// and reduction variables that were found to a given vectorization factor.		/// and reduction variables that were found to a given vectorization factor.
class InnerLoopVectorizer {		class InnerLoopVectorizer {
public:		public:
InnerLoopVectorizer(Loop *OrigLoop, PredicatedScalarEvolution &PSE,		InnerLoopVectorizer(Loop *OrigLoop, PredicatedScalarEvolution &PSE,
LoopInfo LI, DominatorTree DT,		LoopInfo LI, DominatorTree DT,
const TargetLibraryInfo *TLI,		const TargetLibraryInfo *TLI,
▲ Show 20 Lines • Show All 2,171 Lines • ▼ Show 20 Lines	void InnerLoopVectorizer::vectorizeInterleaveGroup(
VPTransformState &State, VPValue Addr, ArrayRef<VPValue > StoredValues,		VPTransformState &State, VPValue Addr, ArrayRef<VPValue > StoredValues,
VPValue *BlockInMask) {		VPValue *BlockInMask) {
Instruction *Instr = Group->getInsertPos();		Instruction *Instr = Group->getInsertPos();
const DataLayout &DL = Instr->getModule()->getDataLayout();		const DataLayout &DL = Instr->getModule()->getDataLayout();

// Prepare for the vector type of the interleaved load/store.		// Prepare for the vector type of the interleaved load/store.
Type *ScalarTy = getLoadStoreType(Instr);		Type *ScalarTy = getLoadStoreType(Instr);
unsigned InterleaveFactor = Group->getFactor();		unsigned InterleaveFactor = Group->getFactor();
assert(!VF.isScalable() && "scalable vectors not yet supported.");
auto VecTy = VectorType::get(ScalarTy, VF InterleaveFactor);		auto VecTy = VectorType::get(ScalarTy, VF InterleaveFactor);

// Prepare for the new pointers.		// Prepare for the new pointers.
SmallVector<Value *, 2> AddrParts;		SmallVector<Value *, 2> AddrParts;
unsigned Index = Group->getIndex(Instr);		unsigned Index = Group->getIndex(Instr);

// TODO: extend the masked interleaved-group support to reversed access.		// TODO: extend the masked interleaved-group support to reversed access.
assert((!BlockInMask \|\| !Group->isReverse()) &&		assert((!BlockInMask \|\| !Group->isReverse()) &&
"Reversed masked interleave-group not supported.");		"Reversed masked interleave-group not supported.");

		Value *Idx;
// If the group is reverse, adjust the index to refer to the last vector lane		// If the group is reverse, adjust the index to refer to the last vector lane
// instead of the first. We adjust the index from the first vector lane,		// instead of the first. We adjust the index from the first vector lane,
// rather than directly getting the pointer for lane VF - 1, because the		// rather than directly getting the pointer for lane VF - 1, because the
// pointer operand of the interleaved access is supposed to be uniform. For		// pointer operand of the interleaved access is supposed to be uniform. For
// uniform instructions, we're only required to generate a value for the		// uniform instructions, we're only required to generate a value for the
// first vector lane in each unroll iteration.		// first vector lane in each unroll iteration.
if (Group->isReverse())		if (Group->isReverse()) {
Index += (VF.getKnownMinValue() - 1) * Group->getFactor();		Value *RuntimeVF = getRuntimeVF(Builder, Builder.getInt32Ty(), VF);
		Idx = Builder.CreateSub(RuntimeVF, Builder.getInt32(1));
		Idx = Builder.CreateMul(Idx, Builder.getInt32(Group->getFactor()));
		Idx = Builder.CreateAdd(Idx, Builder.getInt32(Index));
		Idx = Builder.CreateNeg(Idx);
		} else
		Idx = Builder.getInt32(-Index);

for (unsigned Part = 0; Part < UF; Part++) {		for (unsigned Part = 0; Part < UF; Part++) {
Value *AddrPart = State.get(Addr, VPIteration(Part, 0));		Value *AddrPart = State.get(Addr, VPIteration(Part, 0));
State.setDebugLocFromInst(AddrPart);		State.setDebugLocFromInst(AddrPart);

// Notice current instruction could be any index. Need to adjust the address		// Notice current instruction could be any index. Need to adjust the address
// to the member of index 0.		// to the member of index 0.
//		//
// E.g. a = A[i+1]; // Member of index 1 (Current instruction)		// E.g. a = A[i+1]; // Member of index 1 (Current instruction)
// b = A[i]; // Member of index 0		// b = A[i]; // Member of index 0
// Current pointer is pointed to A[i+1], adjust it to A[i].		// Current pointer is pointed to A[i+1], adjust it to A[i].
//		//
// E.g. A[i+1] = a; // Member of index 1		// E.g. A[i+1] = a; // Member of index 1
// A[i] = b; // Member of index 0		// A[i] = b; // Member of index 0
// A[i+2] = c; // Member of index 2 (Current instruction)		// A[i+2] = c; // Member of index 2 (Current instruction)
// Current pointer is pointed to A[i+2], adjust it to A[i].		// Current pointer is pointed to A[i+2], adjust it to A[i].

bool InBounds = false;		bool InBounds = false;
if (auto *gep = dyn_cast<GetElementPtrInst>(AddrPart->stripPointerCasts()))		if (auto *gep = dyn_cast<GetElementPtrInst>(AddrPart->stripPointerCasts()))
InBounds = gep->isInBounds();		InBounds = gep->isInBounds();
AddrPart = Builder.CreateGEP(ScalarTy, AddrPart, Builder.getInt32(-Index));		AddrPart = Builder.CreateGEP(ScalarTy, AddrPart, Idx);
cast<GetElementPtrInst>(AddrPart)->setIsInBounds(InBounds);		cast<GetElementPtrInst>(AddrPart)->setIsInBounds(InBounds);

// Cast to the vector pointer type.		// Cast to the vector pointer type.
unsigned AddressSpace = AddrPart->getType()->getPointerAddressSpace();		unsigned AddressSpace = AddrPart->getType()->getPointerAddressSpace();
Type *PtrTy = VecTy->getPointerTo(AddressSpace);		Type *PtrTy = VecTy->getPointerTo(AddressSpace);
AddrParts.push_back(Builder.CreateBitCast(AddrPart, PtrTy));		AddrParts.push_back(Builder.CreateBitCast(AddrPart, PtrTy));
}		}

Show All 25 Lines	for (unsigned Part = 0; Part < UF; Part++) {
GroupMask = MaskForGaps		GroupMask = MaskForGaps
? Builder.CreateBinOp(Instruction::And, ShuffledMask,		? Builder.CreateBinOp(Instruction::And, ShuffledMask,
MaskForGaps)		MaskForGaps)
: ShuffledMask;		: ShuffledMask;
}		}
NewLoad =		NewLoad =
Builder.CreateMaskedLoad(VecTy, AddrParts[Part], Group->getAlign(),		Builder.CreateMaskedLoad(VecTy, AddrParts[Part], Group->getAlign(),
GroupMask, PoisonVec, "wide.masked.vec");		GroupMask, PoisonVec, "wide.masked.vec");
}		} else {
		// Check if we can create target specific interleaving load.
		if (TTI->hasInterleavedLoad(VecTy, AddrParts[Part], Group->getFactor(),
		lukeUnsubmitted Not Done Reply Inline Actions Need to check that `Group->getFactor() == 2` here or that the call to CreateMaskedInterleavedLoad succeeds luke: Need to check that `Group->getFactor() == 2` here or that the call to…
		mgabkaUnsubmitted Done Reply Inline Actions So my idea was that it would be up to the hasInterleavedLoad function to make sure that it returns true only when Factor is 2, so no extra checks is needed I think. mgabka: So my idea was that it would be up to the hasInterleavedLoad function to make sure that it…
		false))
		NewLoad = Builder.CreateMaskedInterleavedLoad(
		Group->getFactor(), VecTy, AddrParts[Part], Group->getAlign());
else		else
NewLoad = Builder.CreateAlignedLoad(VecTy, AddrParts[Part],		NewLoad = Builder.CreateAlignedLoad(VecTy, AddrParts[Part],
Group->getAlign(), "wide.vec");		Group->getAlign(), "wide.vec");
		}
Group->addMetadata(NewLoad);		Group->addMetadata(NewLoad);
NewLoads.push_back(NewLoad);		NewLoads.push_back(NewLoad);
		reamesUnsubmitted Done Reply Inline Actions Having this be only in the normal load path seems unlikely to be correct. Surely we must also handle masked loads as well? reames: Having this be only in the normal load path seems unlikely to be correct. Surely we must also…
		huntergrAuthorUnsubmitted Done Reply Inline Actions This does handle masked loads -- 'NewLoad = Builder.CreateAligned....' is a standalone statement on the else with no opening brace. I've added a blank line to perhaps make that a little more obvious. Unless there's something else I've missed? huntergr: This does handle masked loads -- 'NewLoad = Builder.CreateAligned....' is a standalone…
		reamesUnsubmitted Done Reply Inline Actions Yeah, I got confused by the brace style in the code above. reames: Yeah, I got confused by the brace style in the code above.
}		}

// For each member in the group, shuffle out the appropriate data from the		// For each member in the group, shuffle out the appropriate data from the
// wide loads.		// wide loads.
unsigned J = 0;		unsigned J = 0;
for (unsigned I = 0; I < InterleaveFactor; ++I) {		for (unsigned I = 0; I < InterleaveFactor; ++I) {
Instruction *Member = Group->getMember(I);		Instruction *Member = Group->getMember(I);

// Skip the gaps in the group.		// Skip the gaps in the group.
if (!Member)		if (!Member)
continue;		continue;

auto StrideMask =		auto StrideMask =
createStrideMask(I, InterleaveFactor, VF.getKnownMinValue());		createStrideMask(I, InterleaveFactor, VF.getKnownMinValue());
for (unsigned Part = 0; Part < UF; Part++) {		for (unsigned Part = 0; Part < UF; Part++) {
Value *StridedVec = Builder.CreateShuffleVector(		Value *StridedVec;
NewLoads[Part], StrideMask, "strided.vec");		if (NewLoads[Part]->getType()->isAggregateType())
		StridedVec = Builder.CreateExtractValue(NewLoads[Part], I);
		else
		StridedVec = Builder.CreateShuffleVector(NewLoads[Part], StrideMask,
		"strided.vec");
		lukeUnsubmitted Not Done Reply Inline Actions It's somehow possible to reach here with a scalable vector type if `TII->hasInterleavedLoad` returns false. Can we check somewhere inside the vectorizer cost model that if `hasInterleavedLoad` is false then we rule out any recipe with an interleave group for a scalable VF? luke: It's somehow possible to reach here with a scalable vector type if `TII->hasInterleavedLoad`…
		mgabkaUnsubmitted Done Reply Inline Actions So it is actually connected by the LV cost model, the LoopVectorizationCostModel::getInterleaveGroupCost is calling TTI.getInterleavedMemoryOpCost which should return invalid cost for factors different than 2. mgabka: So it is actually connected by the LV cost model, the LoopVectorizationCostModel…
		reamesUnsubmitted Done Reply Inline Actions The interface here feels really awkward for fixed length vectors. We have to create this dummy struct type, construct it, destruct it, and we loose the ability to slice out the inactive lanes. I almost wonder if this code would be clearer without the helper function at all. With an explicit version based on scalable type here, we could do a simplified version of this loop with an early return and leave the fixed length codegen unchanged. I'd be tempted to try that and see if the overall code quality looked reasonable. You could also try a lambda which enumerate the active lanes (i.e. doing the shuffle or extract as required), and move the handling of the bitcast and reverse to a callback. This might be too much complexity though. reames: The interface here feels really awkward for fixed length vectors. We have to create this dummy…

// If this member has different type, cast the result type.		// If this member has different type, cast the result type.
if (Member->getType() != ScalarTy) {		if (Member->getType() != ScalarTy) {
assert(!VF.isScalable() && "VF is assumed to be non scalable.");
VectorType *OtherVTy = VectorType::get(Member->getType(), VF);		VectorType *OtherVTy = VectorType::get(Member->getType(), VF);
StridedVec = createBitOrPointerCast(StridedVec, OtherVTy, DL);		StridedVec = createBitOrPointerCast(StridedVec, OtherVTy, DL);
}		}

if (Group->isReverse())		if (Group->isReverse())
StridedVec = Builder.CreateVectorReverse(StridedVec, "reverse");		StridedVec = Builder.CreateVectorReverse(StridedVec, "reverse");

State.set(VPDefs[J], StridedVec, Part);		State.set(VPDefs[J], StridedVec, Part);
}		}
++J;		++J;
		reamesUnsubmitted Done Reply Inline Actions It looks like you're changing the handling for gaps in the deinterleave. This seems surprising and worth some discussion? reames: It looks like you're changing the handling for gaps in the deinterleave. This seems surprising…
		huntergrAuthorUnsubmitted Done Reply Inline Actions That was the result of a bit of overzealous cleanup on my part when removing some code from the original patch; I missed the 'continue'. Reverted. huntergr: That was the result of a bit of overzealous cleanup on my part when removing some code from the…
}		}
return;		return;
}		}

// The sub vector type for current instruction.		// The sub vector type for current instruction.
auto *SubVT = VectorType::get(ScalarTy, VF);		auto *SubVT = VectorType::get(ScalarTy, VF);

// Vectorize the interleaved store group.		// Vectorize the interleaved store group.
Show All 27 Lines	for (unsigned i = 0; i < InterleaveFactor; i++) {
// If this member has different type, cast it to a unified type.		// If this member has different type, cast it to a unified type.

if (StoredVec->getType() != SubVT)		if (StoredVec->getType() != SubVT)
StoredVec = createBitOrPointerCast(StoredVec, SubVT, DL);		StoredVec = createBitOrPointerCast(StoredVec, SubVT, DL);

StoredVecs.push_back(StoredVec);		StoredVecs.push_back(StoredVec);
}		}

		// Check if we can create target specific interleaving store.
		if (TTI->hasInterleavedStore(StoredVecs, AddrParts[Part],
		lukeUnsubmitted Not Done Reply Inline Actions Need to check `Group->getFactor() == 2` here too luke: Need to check `Group->getFactor() == 2` here too
		Group->getFactor(), false)) {
		CallInst *Store = Builder.CreateMaskedInterleavedStore(
		Group->getFactor(), StoredVecs, AddrParts[Part], Group->getAlign());

		// create interleaved store
		Group->addMetadata(Store);
		continue;
		}

// Concatenate all vectors into a wide vector.		// Concatenate all vectors into a wide vector.
Value *WideVec = concatenateVectors(Builder, StoredVecs);		Value *WideVec = concatenateVectors(Builder, StoredVecs);

// Interleave the elements in the wide vector.		// Interleave the elements in the wide vector.
Value *IVec = Builder.CreateShuffleVector(		Value *IVec = Builder.CreateShuffleVector(
WideVec, createInterleaveMask(VF.getKnownMinValue(), InterleaveFactor),		WideVec, createInterleaveMask(VF.getKnownMinValue(), InterleaveFactor),
"interleaved.vec");		"interleaved.vec");

▲ Show 20 Lines • Show All 152 Lines • ▼ Show 20 Lines	if (Cost->requiresScalarEpilogue(VF)) {
R = Builder.CreateSelect(IsZero, Step, R);		R = Builder.CreateSelect(IsZero, Step, R);
}		}

VectorTripCount = Builder.CreateSub(TC, R, "n.vec");		VectorTripCount = Builder.CreateSub(TC, R, "n.vec");

return VectorTripCount;		return VectorTripCount;
}		}

Value InnerLoopVectorizer::createBitOrPointerCast(Value V, VectorType *DstVTy,		Value InnerLoopVectorizer::createBitOrPointerCast(Value V, VectorType *DstVTy,
		reamesUnsubmitted Done Reply Inline Actions The changes to this function are NFC for fixed length vectors, and a generally useful scalable cleanup. Please separate and land this change without the need for further review. This applies only to the changes in this function so as to shrink the diff for future review. reames: The changes to this function are NFC for fixed length vectors, and a generally useful scalable…
const DataLayout &DL) {		const DataLayout &DL) {
// Verify that V is a vector type with same number of elements as DstVTy.		// Verify that V is a vector type with same number of elements as DstVTy.
auto *DstFVTy = cast<FixedVectorType>(DstVTy);		auto *DstFVTy = cast<VectorType>(DstVTy);
unsigned VF = DstFVTy->getNumElements();		auto VF = DstFVTy->getElementCount();
auto *SrcVecTy = cast<FixedVectorType>(V->getType());		auto *SrcVecTy = cast<VectorType>(V->getType());
assert((VF == SrcVecTy->getNumElements()) && "Vector dimensions do not match");		assert((VF == SrcVecTy->getElementCount()) &&
		"Vector dimensions do not match");
Type *SrcElemTy = SrcVecTy->getElementType();		Type *SrcElemTy = SrcVecTy->getElementType();
Type *DstElemTy = DstFVTy->getElementType();		Type *DstElemTy = DstFVTy->getElementType();
assert((DL.getTypeSizeInBits(SrcElemTy) == DL.getTypeSizeInBits(DstElemTy)) &&		assert((DL.getTypeSizeInBits(SrcElemTy) == DL.getTypeSizeInBits(DstElemTy)) &&
"Vector elements must have same size");		"Vector elements must have same size");

// Do a direct cast if element types are castable.		// Do a direct cast if element types are castable.
if (CastInst::isBitOrNoopPointerCastable(SrcElemTy, DstElemTy, DL)) {		if (CastInst::isBitOrNoopPointerCastable(SrcElemTy, DstElemTy, DL)) {
return Builder.CreateBitOrPointerCast(V, DstFVTy);		return Builder.CreateBitOrPointerCast(V, DstFVTy);
}		}
// V cannot be directly casted to desired vector type.		// V cannot be directly casted to desired vector type.
// May happen when V is a floating point vector but DstVTy is a vector of		// May happen when V is a floating point vector but DstVTy is a vector of
// pointers or vice-versa. Handle this using a two-step bitcast using an		// pointers or vice-versa. Handle this using a two-step bitcast using an
// intermediate Integer type for the bitcast i.e. Ptr <-> Int <-> Float.		// intermediate Integer type for the bitcast i.e. Ptr <-> Int <-> Float.
assert((DstElemTy->isPointerTy() != SrcElemTy->isPointerTy()) &&		assert((DstElemTy->isPointerTy() != SrcElemTy->isPointerTy()) &&
"Only one type should be a pointer type");		"Only one type should be a pointer type");
assert((DstElemTy->isFloatingPointTy() != SrcElemTy->isFloatingPointTy()) &&		assert((DstElemTy->isFloatingPointTy() != SrcElemTy->isFloatingPointTy()) &&
"Only one type should be a floating point type");		"Only one type should be a floating point type");
Type *IntTy =		Type *IntTy =
IntegerType::getIntNTy(V->getContext(), DL.getTypeSizeInBits(SrcElemTy));		IntegerType::getIntNTy(V->getContext(), DL.getTypeSizeInBits(SrcElemTy));
auto *VecIntTy = FixedVectorType::get(IntTy, VF);		auto *VecIntTy = VectorType::get(IntTy, VF);
Value *CastVal = Builder.CreateBitOrPointerCast(V, VecIntTy);		Value *CastVal = Builder.CreateBitOrPointerCast(V, VecIntTy);
return Builder.CreateBitOrPointerCast(CastVal, DstFVTy);		return Builder.CreateBitOrPointerCast(CastVal, DstFVTy);
}		}

void InnerLoopVectorizer::emitIterationCountCheck(BasicBlock *Bypass) {		void InnerLoopVectorizer::emitIterationCountCheck(BasicBlock *Bypass) {
Value *Count = getOrCreateTripCount(LoopVectorPreHeader);		Value *Count = getOrCreateTripCount(LoopVectorPreHeader);
// Reuse existing vector loop preheader for TC checks.		// Reuse existing vector loop preheader for TC checks.
// Note that new preheader block is generated for vector loop.		// Note that new preheader block is generated for vector loop.
▲ Show 20 Lines • Show All 3,557 Lines • ▼ Show 20 Lines	return TTI.getAddressComputationCost(VectorTy) +
TTI.getGatherScatterOpCost(		TTI.getGatherScatterOpCost(
I->getOpcode(), VectorTy, Ptr, Legal->isMaskRequired(I), Alignment,		I->getOpcode(), VectorTy, Ptr, Legal->isMaskRequired(I), Alignment,
TargetTransformInfo::TCK_RecipThroughput, I);		TargetTransformInfo::TCK_RecipThroughput, I);
}		}

InstructionCost		InstructionCost
LoopVectorizationCostModel::getInterleaveGroupCost(Instruction *I,		LoopVectorizationCostModel::getInterleaveGroupCost(Instruction *I,
ElementCount VF) {		ElementCount VF) {
// TODO: Once we have support for interleaving with scalable vectors
// we can calculate the cost properly here.
if (VF.isScalable())
return InstructionCost::getInvalid();

Type *ValTy = getLoadStoreType(I);		Type *ValTy = getLoadStoreType(I);
auto *VectorTy = cast<VectorType>(ToVectorTy(ValTy, VF));		auto *VectorTy = cast<VectorType>(ToVectorTy(ValTy, VF));
unsigned AS = getLoadStoreAddressSpace(I);		unsigned AS = getLoadStoreAddressSpace(I);
enum TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;		enum TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput;

auto Group = getInterleavedAccessGroup(I);		auto Group = getInterleavedAccessGroup(I);
assert(Group && "Fail to get an interleaved access group.");		assert(Group && "Fail to get an interleaved access group.");

▲ Show 20 Lines • Show All 4,088 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/sve-interleaved-accesses.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -mtriple=aarch64-none-linux-gnu -S -passes=loop-vectorize,instcombine -force-vector-width=4 -force-vector-interleave=1 -enable-interleaved-mem-accesses=true -enable-sve-interleaved-mem-accesses=true -mattr=+sve -scalable-vectorization=on -runtime-memory-check-threshold=24 < %s \| FileCheck %s
				reamesUnsubmitted Not Done Reply Inline Actions This should be in the AArch64 sub-tree, and probably precommited. Depending on your confidence in the AArch64 code, you may want to separate that into it's own review. reames: This should be in the AArch64 sub-tree, and probably precommited. Depending on your confidence…

				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"

				; Check vectorization on an interleaved load group of factor 2 and an interleaved
				; store group of factor 2.

				; int AB[1024];
				; int CD[1024];
				; void test_array_load2_store2(int C, int D) {
				; for (int i = 0; i < 1024; i+=2) {
				; int A = AB[i];
				; int B = AB[i+1];
				; CD[i] = A + C;
				; CD[i+1] = B * D;
				; }
				; }


				@AB = common global [1024 x i32] zeroinitializer, align 4
				@CD = common global [1024 x i32] zeroinitializer, align 4

				define void @test_array_load2_store2(i32 %C, i32 %D) #1 {
				; CHECK-LABEL: @test_array_load2_store2(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 512, [[TMP1]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub nuw nsw i64 512, [[N_MOD_VF]]
				; CHECK-NEXT: [[IND_END:%.*]] = shl nuw nsw i64 [[N_VEC]], 1
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[C:%.]], i64 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[D:%.]], i64 0
				; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT1]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[OFFSET_IDX:%.*]] = shl i64 [[INDEX]], 1
				; CHECK-NEXT: [[TMP2:%.*]] = getelementptr inbounds [1024 x i32], ptr @AB, i64 0, i64 [[OFFSET_IDX]]
				; CHECK-NEXT: [[UNMASKEDLOAD:%.*]] = load <vscale x 8 x i32>, ptr [[TMP2]], align 4
				; CHECK-NEXT: [[TMP3:%.*]] = call { <vscale x 4 x i32>, <vscale x 4 x i32> } @llvm.experimental.vector.deinterleave2.nxv8i32(<vscale x 8 x i32> [[UNMASKEDLOAD]])
				; CHECK-NEXT: [[TMP4:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP3]], 0
				; CHECK-NEXT: [[TMP5:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP3]], 1
				; CHECK-NEXT: [[TMP6:%.*]] = or i64 [[OFFSET_IDX]], 1
				; CHECK-NEXT: [[TMP7:%.*]] = add nsw <vscale x 4 x i32> [[TMP4]], [[BROADCAST_SPLAT]]
				; CHECK-NEXT: [[TMP8:%.*]] = mul nsw <vscale x 4 x i32> [[TMP5]], [[BROADCAST_SPLAT2]]
				; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds [1024 x i32], ptr @CD, i64 0, i64 [[TMP6]]
				; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds i32, ptr [[TMP9]], i64 -1
				; CHECK-NEXT: [[TMP11:%.*]] = call <vscale x 8 x i32> @llvm.experimental.vector.interleave2.nxv8i32(<vscale x 4 x i32> [[TMP7]], <vscale x 4 x i32> [[TMP8]])
				; CHECK-NEXT: store <vscale x 8 x i32> [[TMP11]], ptr [[TMP10]], align 4
				; CHECK-NEXT: [[TMP12:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP13:%.*]] = shl nuw nsw i64 [[TMP12]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP13]]
				; CHECK-NEXT: [[TMP14:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP14]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP0:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[ARRAYIDX0:%.*]] = getelementptr inbounds [1024 x i32], ptr @AB, i64 0, i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP:%.*]] = load i32, ptr [[ARRAYIDX0]], align 4
				; CHECK-NEXT: [[TMP1:%.*]] = or i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[ARRAYIDX1:%.*]] = getelementptr inbounds [1024 x i32], ptr @AB, i64 0, i64 [[TMP1]]
				; CHECK-NEXT: [[TMP2:%.*]] = load i32, ptr [[ARRAYIDX1]], align 4
				; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP]], [[C]]
				; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP2]], [[D]]
				; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds [1024 x i32], ptr @CD, i64 0, i64 [[INDVARS_IV]]
				; CHECK-NEXT: store i32 [[ADD]], ptr [[ARRAYIDX2]], align 4
				; CHECK-NEXT: [[ARRAYIDX3:%.*]] = getelementptr inbounds [1024 x i32], ptr @CD, i64 0, i64 [[TMP1]]
				; CHECK-NEXT: store i32 [[MUL]], ptr [[ARRAYIDX3]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 2
				; CHECK-NEXT: [[CMP:%.*]] = icmp slt i64 [[INDVARS_IV]], 1022
				; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop [[LOOP3:![0-9]+]]
				; CHECK: for.end:
				; CHECK-NEXT: ret void
				;
				entry:
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx0 = getelementptr inbounds [1024 x i32], [1024 x i32]* @AB, i64 0, i64 %indvars.iv
				%tmp = load i32, i32* %arrayidx0, align 4
				%tmp1 = or i64 %indvars.iv, 1
				%arrayidx1 = getelementptr inbounds [1024 x i32], [1024 x i32]* @AB, i64 0, i64 %tmp1
				%tmp2 = load i32, i32* %arrayidx1, align 4
				%add = add nsw i32 %tmp, %C
				%mul = mul nsw i32 %tmp2, %D
				%arrayidx2 = getelementptr inbounds [1024 x i32], [1024 x i32]* @CD, i64 0, i64 %indvars.iv
				store i32 %add, i32* %arrayidx2, align 4
				%arrayidx3 = getelementptr inbounds [1024 x i32], [1024 x i32]* @CD, i64 0, i64 %tmp1
				store i32 %mul, i32* %arrayidx3, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp slt i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.end

				for.end: ; preds = %for.body
				ret void
				}

				; Check vectorization on an interleaved load group of factor 2 with narrower types and an interleaved
				; store group of factor 2.

				; short AB[1024];
				; int CD[1024];
				; void test_array_load2_store2(int C, int D) {
				; for (int i = 0; i < 1024; i+=2) {
				; short A = AB[i];
				; short B = AB[i+1];
				; CD[i] = A + C;
				; CD[i+1] = B * D;
				; }
				; }


				@AB_i16 = common global [1024 x i16] zeroinitializer, align 4

				define void @test_array_load2_i16_store2(i32 %C, i32 %D) #1 {
				; CHECK-LABEL: @test_array_load2_i16_store2(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 512, [[TMP1]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub nuw nsw i64 512, [[N_MOD_VF]]
				; CHECK-NEXT: [[IND_END:%.*]] = shl nuw nsw i64 [[N_VEC]], 1
				; CHECK-NEXT: [[TMP2:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP3:%.*]] = shl <vscale x 4 x i64> [[TMP2]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP5:%.*]] = shl nuw nsw i64 [[TMP4]], 3
				; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP5]], i64 0
				; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[C:%.]], i64 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT2:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[D:%.]], i64 0
				; CHECK-NEXT: [[BROADCAST_SPLAT3:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT2]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_IND:%.]] = phi <vscale x 4 x i64> [ [[TMP3]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds [1024 x i16], ptr @AB_i16, i64 0, <vscale x 4 x i64> [[VEC_IND]]
				; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 4 x i16> @llvm.masked.gather.nxv4i16.nxv4p0(<vscale x 4 x ptr> [[TMP6]], i32 2, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x i16> poison)
				; CHECK-NEXT: [[TMP7:%.*]] = or <vscale x 4 x i64> [[VEC_IND]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds [1024 x i16], ptr @AB_i16, i64 0, <vscale x 4 x i64> [[TMP7]]
				; CHECK-NEXT: [[WIDE_MASKED_GATHER1:%.*]] = call <vscale x 4 x i16> @llvm.masked.gather.nxv4i16.nxv4p0(<vscale x 4 x ptr> [[TMP8]], i32 2, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x i16> poison)
				; CHECK-NEXT: [[TMP9:%.*]] = sext <vscale x 4 x i16> [[WIDE_MASKED_GATHER]] to <vscale x 4 x i32>
				; CHECK-NEXT: [[TMP10:%.*]] = add nsw <vscale x 4 x i32> [[BROADCAST_SPLAT]], [[TMP9]]
				; CHECK-NEXT: [[TMP11:%.*]] = sext <vscale x 4 x i16> [[WIDE_MASKED_GATHER1]] to <vscale x 4 x i32>
				; CHECK-NEXT: [[TMP12:%.*]] = mul nsw <vscale x 4 x i32> [[BROADCAST_SPLAT3]], [[TMP11]]
				; CHECK-NEXT: [[TMP13:%.*]] = extractelement <vscale x 4 x i64> [[TMP7]], i64 0
				; CHECK-NEXT: [[TMP14:%.*]] = getelementptr inbounds [1024 x i32], ptr @CD, i64 0, i64 [[TMP13]]
				; CHECK-NEXT: [[TMP15:%.*]] = getelementptr inbounds i32, ptr [[TMP14]], i64 -1
				; CHECK-NEXT: [[TMP16:%.*]] = call <vscale x 8 x i32> @llvm.experimental.vector.interleave2.nxv8i32(<vscale x 4 x i32> [[TMP10]], <vscale x 4 x i32> [[TMP12]])
				; CHECK-NEXT: store <vscale x 8 x i32> [[TMP16]], ptr [[TMP15]], align 4
				; CHECK-NEXT: [[TMP17:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP18:%.*]] = shl nuw nsw i64 [[TMP17]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP18]]
				; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 4 x i64> [[VEC_IND]], [[DOTSPLAT]]
				; CHECK-NEXT: [[TMP19:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP19]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP4:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds [1024 x i16], ptr @AB_i16, i64 0, i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP20:%.*]] = load i16, ptr [[ARRAYIDX]], align 4
				; CHECK-NEXT: [[TMP21:%.*]] = or i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds [1024 x i16], ptr @AB_i16, i64 0, i64 [[TMP21]]
				; CHECK-NEXT: [[TMP22:%.*]] = load i16, ptr [[ARRAYIDX2]], align 2
				; CHECK-NEXT: [[CONV:%.*]] = sext i16 [[TMP20]] to i32
				; CHECK-NEXT: [[ADD3:%.*]] = add nsw i32 [[CONV]], [[C]]
				; CHECK-NEXT: [[ARRAYIDX5:%.*]] = getelementptr inbounds [1024 x i32], ptr @CD, i64 0, i64 [[INDVARS_IV]]
				; CHECK-NEXT: store i32 [[ADD3]], ptr [[ARRAYIDX5]], align 4
				; CHECK-NEXT: [[CONV6:%.*]] = sext i16 [[TMP22]] to i32
				; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[CONV6]], [[D]]
				; CHECK-NEXT: [[ARRAYIDX9:%.*]] = getelementptr inbounds [1024 x i32], ptr @CD, i64 0, i64 [[TMP21]]
				; CHECK-NEXT: store i32 [[MUL]], ptr [[ARRAYIDX9]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 2
				; CHECK-NEXT: [[CMP:%.*]] = icmp ult i64 [[INDVARS_IV]], 1022
				; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop [[LOOP5:![0-9]+]]
				; CHECK: for.end:
				; CHECK-NEXT: ret void
				;
				entry:
				br label %for.body

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds [1024 x i16], [1024 x i16]* @AB_i16, i64 0, i64 %indvars.iv
				%0 = load i16, i16* %arrayidx, align 2
				%1 = or i64 %indvars.iv, 1
				%arrayidx2 = getelementptr inbounds [1024 x i16], [1024 x i16]* @AB_i16, i64 0, i64 %1
				%2 = load i16, i16* %arrayidx2, align 2
				%conv = sext i16 %0 to i32
				%add3 = add nsw i32 %conv, %C
				%arrayidx5 = getelementptr inbounds [1024 x i32], [1024 x i32]* @CD, i64 0, i64 %indvars.iv
				store i32 %add3, i32* %arrayidx5, align 4
				%conv6 = sext i16 %2 to i32
				%mul = mul nsw i32 %conv6, %D
				%arrayidx9 = getelementptr inbounds [1024 x i32], [1024 x i32]* @CD, i64 0, i64 %1
				store i32 %mul, i32* %arrayidx9, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp ult i64 %indvars.iv, 1022
				br i1 %cmp, label %for.body, label %for.end

				for.end: ; preds = %for.body
				ret void
				}

				; Check vectorization on an interleaved load group of factor 2 and an interleaved
				; store group of factor 2 with narrower types.

				; int AB[1024];
				; short CD[1024];
				; void test_array_load2_store2(int C, int D) {
				; for (int i = 0; i < 1024; i+=2) {
				; short A = AB[i];
				; short B = AB[i+1];
				; CD[i] = A + C;
				; CD[i+1] = B * D;
				; }
				; }


				@CD_i16 = dso_local local_unnamed_addr global [1024 x i16] zeroinitializer, align 2

				define void @test_array_load2_store2_i16(i32 noundef %C, i32 noundef %D) #1 {
				; CHECK-LABEL: @test_array_load2_store2_i16(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 512, [[TMP1]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub nuw nsw i64 512, [[N_MOD_VF]]
				; CHECK-NEXT: [[IND_END:%.*]] = shl nuw nsw i64 [[N_VEC]], 1
				; CHECK-NEXT: [[TMP2:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP3:%.*]] = shl <vscale x 4 x i64> [[TMP2]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP4:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP5:%.*]] = shl nuw nsw i64 [[TMP4]], 3
				; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP5]], i64 0
				; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[C:%.]], i64 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[D:%.]], i64 0
				; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT1]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_IND:%.]] = phi <vscale x 4 x i64> [ [[TMP3]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[OFFSET_IDX:%.*]] = shl i64 [[INDEX]], 1
				; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds [1024 x i32], ptr @AB, i64 0, i64 [[OFFSET_IDX]]
				; CHECK-NEXT: [[UNMASKEDLOAD:%.*]] = load <vscale x 8 x i32>, ptr [[TMP6]], align 4
				; CHECK-NEXT: [[TMP7:%.*]] = call { <vscale x 4 x i32>, <vscale x 4 x i32> } @llvm.experimental.vector.deinterleave2.nxv8i32(<vscale x 8 x i32> [[UNMASKEDLOAD]])
				; CHECK-NEXT: [[TMP8:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP7]], 0
				; CHECK-NEXT: [[TMP9:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP7]], 1
				; CHECK-NEXT: [[TMP10:%.*]] = or <vscale x 4 x i64> [[VEC_IND]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP11:%.*]] = add nsw <vscale x 4 x i32> [[TMP8]], [[BROADCAST_SPLAT]]
				; CHECK-NEXT: [[TMP12:%.*]] = trunc <vscale x 4 x i32> [[TMP11]] to <vscale x 4 x i16>
				; CHECK-NEXT: [[TMP13:%.*]] = getelementptr inbounds [1024 x i16], ptr @CD_i16, i64 0, <vscale x 4 x i64> [[VEC_IND]]
				; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i16.nxv4p0(<vscale x 4 x i16> [[TMP12]], <vscale x 4 x ptr> [[TMP13]], i32 2, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer))
				; CHECK-NEXT: [[TMP14:%.*]] = mul nsw <vscale x 4 x i32> [[TMP9]], [[BROADCAST_SPLAT2]]
				; CHECK-NEXT: [[TMP15:%.*]] = trunc <vscale x 4 x i32> [[TMP14]] to <vscale x 4 x i16>
				; CHECK-NEXT: [[TMP16:%.*]] = getelementptr inbounds [1024 x i16], ptr @CD_i16, i64 0, <vscale x 4 x i64> [[TMP10]]
				; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i16.nxv4p0(<vscale x 4 x i16> [[TMP15]], <vscale x 4 x ptr> [[TMP16]], i32 2, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer))
				; CHECK-NEXT: [[TMP17:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP18:%.*]] = shl nuw nsw i64 [[TMP17]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP18]]
				; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 4 x i64> [[VEC_IND]], [[DOTSPLAT]]
				; CHECK-NEXT: [[TMP19:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP19]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP6:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds [1024 x i32], ptr @AB, i64 0, i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP20:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
				; CHECK-NEXT: [[TMP21:%.*]] = or i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds [1024 x i32], ptr @AB, i64 0, i64 [[TMP21]]
				; CHECK-NEXT: [[TMP22:%.*]] = load i32, ptr [[ARRAYIDX2]], align 4
				; CHECK-NEXT: [[ADD3:%.*]] = add nsw i32 [[TMP20]], [[C]]
				; CHECK-NEXT: [[CONV:%.*]] = trunc i32 [[ADD3]] to i16
				; CHECK-NEXT: [[ARRAYIDX5:%.*]] = getelementptr inbounds [1024 x i16], ptr @CD_i16, i64 0, i64 [[INDVARS_IV]]
				; CHECK-NEXT: store i16 [[CONV]], ptr [[ARRAYIDX5]], align 2
				; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP22]], [[D]]
				; CHECK-NEXT: [[CONV6:%.*]] = trunc i32 [[MUL]] to i16
				; CHECK-NEXT: [[ARRAYIDX9:%.*]] = getelementptr inbounds [1024 x i16], ptr @CD_i16, i64 0, i64 [[TMP21]]
				; CHECK-NEXT: store i16 [[CONV6]], ptr [[ARRAYIDX9]], align 2
				; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 2
				; CHECK-NEXT: [[CMP:%.*]] = icmp ult i64 [[INDVARS_IV]], 1022
				; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop [[LOOP7:![0-9]+]]
				; CHECK: for.end:
				; CHECK-NEXT: ret void
				;
				entry:
				br label %for.body

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds [1024 x i32], [1024 x i32]* @AB, i64 0, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%1 = or i64 %indvars.iv, 1
				%arrayidx2 = getelementptr inbounds [1024 x i32], [1024 x i32]* @AB, i64 0, i64 %1
				%2 = load i32, i32* %arrayidx2, align 4
				%add3 = add nsw i32 %0, %C
				%conv = trunc i32 %add3 to i16
				%arrayidx5 = getelementptr inbounds [1024 x i16], [1024 x i16]* @CD_i16, i64 0, i64 %indvars.iv
				store i16 %conv, i16* %arrayidx5, align 2
				%mul = mul nsw i32 %2, %D
				%conv6 = trunc i32 %mul to i16
				%arrayidx9 = getelementptr inbounds [1024 x i16], [1024 x i16]* @CD_i16, i64 0, i64 %1
				store i16 %conv6, i16* %arrayidx9, align 2
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp ult i64 %indvars.iv, 1022
				br i1 %cmp, label %for.body, label %for.end

				for.end: ; preds = %for.body
				ret void
				}

				; Check vectorization on an interleaved load group of factor 6.
				; There is no dedicated ldN/stN so use gather instead

				%struct.ST6 = type { i32, i32, i32, i32, i32, i32 }

				define i32 @test_struct_load6(%struct.ST6* %S) #1 {
				; CHECK-LABEL: @test_struct_load6(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP1]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub nuw nsw i64 1024, [[N_MOD_VF]]
				; CHECK-NEXT: [[TMP2:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP4:%.*]] = shl nuw nsw i64 [[TMP3]], 2
				; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP4]], i64 0
				; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_IND:%.]] = phi <vscale x 4 x i64> [ [[TMP2]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_PHI:%.]] = phi <vscale x 4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP16:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds [[STRUCT_ST6:%.]], ptr [[S:%.*]], <vscale x 4 x i64> [[VEC_IND]], i32 0
				; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 4 x i32> @llvm.masked.gather.nxv4i32.nxv4p0(<vscale x 4 x ptr> [[TMP5]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x i32> poison)
				; CHECK-NEXT: [[TMP6:%.*]] = getelementptr inbounds [[STRUCT_ST6]], ptr [[S]], <vscale x 4 x i64> [[VEC_IND]], i32 1
				; CHECK-NEXT: [[WIDE_MASKED_GATHER1:%.*]] = call <vscale x 4 x i32> @llvm.masked.gather.nxv4i32.nxv4p0(<vscale x 4 x ptr> [[TMP6]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x i32> poison)
				; CHECK-NEXT: [[TMP7:%.*]] = getelementptr inbounds [[STRUCT_ST6]], ptr [[S]], <vscale x 4 x i64> [[VEC_IND]], i32 2
				; CHECK-NEXT: [[WIDE_MASKED_GATHER2:%.*]] = call <vscale x 4 x i32> @llvm.masked.gather.nxv4i32.nxv4p0(<vscale x 4 x ptr> [[TMP7]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x i32> poison)
				; CHECK-NEXT: [[TMP8:%.*]] = getelementptr inbounds [[STRUCT_ST6]], ptr [[S]], <vscale x 4 x i64> [[VEC_IND]], i32 3
				; CHECK-NEXT: [[WIDE_MASKED_GATHER3:%.*]] = call <vscale x 4 x i32> @llvm.masked.gather.nxv4i32.nxv4p0(<vscale x 4 x ptr> [[TMP8]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x i32> poison)
				; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds [[STRUCT_ST6]], ptr [[S]], <vscale x 4 x i64> [[VEC_IND]], i32 4
				; CHECK-NEXT: [[WIDE_MASKED_GATHER4:%.*]] = call <vscale x 4 x i32> @llvm.masked.gather.nxv4i32.nxv4p0(<vscale x 4 x ptr> [[TMP9]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x i32> poison)
				; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds [[STRUCT_ST6]], ptr [[S]], <vscale x 4 x i64> [[VEC_IND]], i32 5
				; CHECK-NEXT: [[WIDE_MASKED_GATHER5:%.*]] = call <vscale x 4 x i32> @llvm.masked.gather.nxv4i32.nxv4p0(<vscale x 4 x ptr> [[TMP10]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x i32> poison)
				; CHECK-NEXT: [[TMP11:%.*]] = add <vscale x 4 x i32> [[WIDE_MASKED_GATHER]], [[VEC_PHI]]
				; CHECK-NEXT: [[TMP12:%.*]] = add <vscale x 4 x i32> [[TMP11]], [[WIDE_MASKED_GATHER2]]
				; CHECK-NEXT: [[TMP13:%.*]] = add <vscale x 4 x i32> [[WIDE_MASKED_GATHER1]], [[WIDE_MASKED_GATHER3]]
				; CHECK-NEXT: [[TMP14:%.*]] = add <vscale x 4 x i32> [[TMP13]], [[WIDE_MASKED_GATHER4]]
				; CHECK-NEXT: [[TMP15:%.*]] = add <vscale x 4 x i32> [[TMP14]], [[WIDE_MASKED_GATHER5]]
				; CHECK-NEXT: [[TMP16]] = sub <vscale x 4 x i32> [[TMP12]], [[TMP15]]
				; CHECK-NEXT: [[TMP17:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP18:%.*]] = shl nuw nsw i64 [[TMP17]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP18]]
				; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 4 x i64> [[VEC_IND]], [[DOTSPLAT]]
				; CHECK-NEXT: [[TMP19:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP19]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP8:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[TMP20:%.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> [[TMP16]])
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ [[TMP20]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[R_041:%.]] = phi i32 [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ], [ [[SUB14:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[X:%.*]] = getelementptr inbounds [[STRUCT_ST6]], ptr [[S]], i64 [[INDVARS_IV]], i32 0
				; CHECK-NEXT: [[TMP21:%.*]] = load i32, ptr [[X]], align 4
				; CHECK-NEXT: [[Y:%.*]] = getelementptr inbounds [[STRUCT_ST6]], ptr [[S]], i64 [[INDVARS_IV]], i32 1
				; CHECK-NEXT: [[TMP22:%.*]] = load i32, ptr [[Y]], align 4
				; CHECK-NEXT: [[Z:%.*]] = getelementptr inbounds [[STRUCT_ST6]], ptr [[S]], i64 [[INDVARS_IV]], i32 2
				; CHECK-NEXT: [[TMP23:%.*]] = load i32, ptr [[Z]], align 4
				; CHECK-NEXT: [[W:%.*]] = getelementptr inbounds [[STRUCT_ST6]], ptr [[S]], i64 [[INDVARS_IV]], i32 3
				; CHECK-NEXT: [[TMP24:%.*]] = load i32, ptr [[W]], align 4
				; CHECK-NEXT: [[A:%.*]] = getelementptr inbounds [[STRUCT_ST6]], ptr [[S]], i64 [[INDVARS_IV]], i32 4
				; CHECK-NEXT: [[TMP25:%.*]] = load i32, ptr [[A]], align 4
				; CHECK-NEXT: [[B:%.*]] = getelementptr inbounds [[STRUCT_ST6]], ptr [[S]], i64 [[INDVARS_IV]], i32 5
				; CHECK-NEXT: [[TMP26:%.*]] = load i32, ptr [[B]], align 4
				; CHECK-NEXT: [[DOTNEG36:%.*]] = add i32 [[TMP21]], [[R_041]]
				; CHECK-NEXT: [[TMP27:%.*]] = add i32 [[DOTNEG36]], [[TMP23]]
				; CHECK-NEXT: [[TMP28:%.*]] = add i32 [[TMP22]], [[TMP24]]
				; CHECK-NEXT: [[TMP29:%.*]] = add i32 [[TMP28]], [[TMP25]]
				; CHECK-NEXT: [[TMP30:%.*]] = add i32 [[TMP29]], [[TMP26]]
				; CHECK-NEXT: [[SUB14]] = sub i32 [[TMP27]], [[TMP30]]
				; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[EXITCOND_NOT:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 1024
				; CHECK-NEXT: br i1 [[EXITCOND_NOT]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], !llvm.loop [[LOOP9:![0-9]+]]
				; CHECK: for.cond.cleanup:
				; CHECK-NEXT: [[SUB14_LCSSA:%.*]] = phi i32 [ [[SUB14]], [[FOR_BODY]] ], [ [[TMP20]], [[MIDDLE_BLOCK]] ]
				; CHECK-NEXT: ret i32 [[SUB14_LCSSA]]
				;
				entry:
				br label %for.body

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%r.041 = phi i32 [ 0, %entry ], [ %sub14, %for.body ]
				%x = getelementptr inbounds %struct.ST6, %struct.ST6* %S, i64 %indvars.iv, i32 0
				%0 = load i32, i32* %x, align 4
				%y = getelementptr inbounds %struct.ST6, %struct.ST6* %S, i64 %indvars.iv, i32 1
				%1 = load i32, i32* %y, align 4
				%z = getelementptr inbounds %struct.ST6, %struct.ST6* %S, i64 %indvars.iv, i32 2
				%2 = load i32, i32* %z, align 4
				%w = getelementptr inbounds %struct.ST6, %struct.ST6* %S, i64 %indvars.iv, i32 3
				%3 = load i32, i32* %w, align 4
				%a = getelementptr inbounds %struct.ST6, %struct.ST6* %S, i64 %indvars.iv, i32 4
				%4 = load i32, i32* %a, align 4
				%b = getelementptr inbounds %struct.ST6, %struct.ST6* %S, i64 %indvars.iv, i32 5
				%5 = load i32, i32* %b, align 4
				%.neg36 = add i32 %0, %r.041
				%6 = add i32 %.neg36, %2
				%7 = add i32 %1, %3
				%8 = add i32 %7, %4
				%9 = add i32 %8, %5
				%sub14 = sub i32 %6, %9
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, 1024
				br i1 %exitcond.not, label %for.cond.cleanup, label %for.body

				for.cond.cleanup: ; preds = %for.body
				%sub14.lcssa = phi i32 [ %sub14, %for.body ]
				ret i32 %sub14.lcssa
				}


				; Check vectorization on a reverse interleaved load group of factor 2 and
				; a reverse interleaved store group of factor 2.

				; struct ST2 {
				; int x;
				; int y;
				; };
				;
				; void test_reversed_load2_store2(struct ST2 A, struct ST2 B) {
				; for (int i = 1023; i >= 0; i--) {
				; int a = A[i].x + i; // interleaved load of index 0
				; int b = A[i].y - i; // interleaved load of index 1
				; B[i].x = a; // interleaved store of index 0
				; B[i].y = b; // interleaved store of index 1
				; }
				; }


				%struct.ST2 = type { i32, i32 }

				define void @test_reversed_load2_store2(%struct.ST2* noalias nocapture readonly %A, %struct.ST2* noalias nocapture %B) #1 {
				; CHECK-LABEL: @test_reversed_load2_store2(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP1]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub nuw nsw i64 1024, [[N_MOD_VF]]
				; CHECK-NEXT: [[IND_END:%.*]] = add nsw i64 [[N_MOD_VF]], -1
				; CHECK-NEXT: [[TMP2:%.*]] = call <vscale x 4 x i32> @llvm.experimental.stepvector.nxv4i32()
				; CHECK-NEXT: [[INDUCTION:%.*]] = sub <vscale x 4 x i32> shufflevector (<vscale x 4 x i32> insertelement (<vscale x 4 x i32> poison, i32 1023, i64 0), <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer), [[TMP2]]
				; CHECK-NEXT: [[TMP3:%.*]] = call i32 @llvm.vscale.i32()
				; CHECK-NEXT: [[DOTNEG:%.*]] = mul nsw i32 [[TMP3]], -4
				; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i32> poison, i32 [[DOTNEG]], i64 0
				; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[DOTSPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_IND:%.]] = phi <vscale x 4 x i32> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[OFFSET_IDX:%.*]] = sub i64 1023, [[INDEX]]
				; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds [[STRUCT_ST2:%.]], ptr [[A:%.*]], i64 [[OFFSET_IDX]], i32 0
				; CHECK-NEXT: [[TMP5:%.*]] = call i32 @llvm.vscale.i32()
				; CHECK-NEXT: [[TMP6:%.*]] = shl nuw nsw i32 [[TMP5]], 3
				; CHECK-NEXT: [[TMP7:%.*]] = sub nsw i32 2, [[TMP6]]
				; CHECK-NEXT: [[TMP8:%.*]] = sext i32 [[TMP7]] to i64
				; CHECK-NEXT: [[TMP9:%.*]] = getelementptr inbounds i32, ptr [[TMP4]], i64 [[TMP8]]
				; CHECK-NEXT: [[UNMASKEDLOAD:%.*]] = load <vscale x 8 x i32>, ptr [[TMP9]], align 4
				; CHECK-NEXT: [[TMP10:%.*]] = call { <vscale x 4 x i32>, <vscale x 4 x i32> } @llvm.experimental.vector.deinterleave2.nxv8i32(<vscale x 8 x i32> [[UNMASKEDLOAD]])
				; CHECK-NEXT: [[TMP11:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP10]], 0
				; CHECK-NEXT: [[REVERSE:%.*]] = call <vscale x 4 x i32> @llvm.experimental.vector.reverse.nxv4i32(<vscale x 4 x i32> [[TMP11]])
				; CHECK-NEXT: [[TMP12:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP10]], 1
				; CHECK-NEXT: [[REVERSE1:%.*]] = call <vscale x 4 x i32> @llvm.experimental.vector.reverse.nxv4i32(<vscale x 4 x i32> [[TMP12]])
				; CHECK-NEXT: [[TMP13:%.*]] = add nsw <vscale x 4 x i32> [[REVERSE]], [[VEC_IND]]
				; CHECK-NEXT: [[TMP14:%.*]] = sub nsw <vscale x 4 x i32> [[REVERSE1]], [[VEC_IND]]
				; CHECK-NEXT: [[TMP15:%.]] = getelementptr inbounds [[STRUCT_ST2]], ptr [[B:%.]], i64 [[OFFSET_IDX]], i32 1
				; CHECK-NEXT: [[TMP16:%.*]] = call i32 @llvm.vscale.i32()
				; CHECK-NEXT: [[TMP17:%.*]] = shl nuw nsw i32 [[TMP16]], 3
				; CHECK-NEXT: [[TMP18:%.*]] = sub nsw i32 1, [[TMP17]]
				; CHECK-NEXT: [[TMP19:%.*]] = sext i32 [[TMP18]] to i64
				; CHECK-NEXT: [[TMP20:%.*]] = getelementptr inbounds i32, ptr [[TMP15]], i64 [[TMP19]]
				; CHECK-NEXT: [[REVERSE2:%.*]] = call <vscale x 4 x i32> @llvm.experimental.vector.reverse.nxv4i32(<vscale x 4 x i32> [[TMP13]])
				; CHECK-NEXT: [[REVERSE3:%.*]] = call <vscale x 4 x i32> @llvm.experimental.vector.reverse.nxv4i32(<vscale x 4 x i32> [[TMP14]])
				; CHECK-NEXT: [[TMP21:%.*]] = call <vscale x 8 x i32> @llvm.experimental.vector.interleave2.nxv8i32(<vscale x 4 x i32> [[REVERSE2]], <vscale x 4 x i32> [[REVERSE3]])
				; CHECK-NEXT: store <vscale x 8 x i32> [[TMP21]], ptr [[TMP20]], align 4
				; CHECK-NEXT: [[TMP22:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP23:%.*]] = shl nuw nsw i64 [[TMP22]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP23]]
				; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 4 x i32> [[VEC_IND]], [[DOTSPLAT]]
				; CHECK-NEXT: [[TMP24:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP24]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP10:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 1023, [[ENTRY:%.]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.cond.cleanup:
				; CHECK-NEXT: ret void
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[X:%.*]] = getelementptr inbounds [[STRUCT_ST2]], ptr [[A]], i64 [[INDVARS_IV]], i32 0
				; CHECK-NEXT: [[TMP:%.*]] = load i32, ptr [[X]], align 4
				; CHECK-NEXT: [[TMP1:%.*]] = trunc i64 [[INDVARS_IV]] to i32
				; CHECK-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP]], [[TMP1]]
				; CHECK-NEXT: [[Y:%.*]] = getelementptr inbounds [[STRUCT_ST2]], ptr [[A]], i64 [[INDVARS_IV]], i32 1
				; CHECK-NEXT: [[TMP2:%.*]] = load i32, ptr [[Y]], align 4
				; CHECK-NEXT: [[SUB:%.*]] = sub nsw i32 [[TMP2]], [[TMP1]]
				; CHECK-NEXT: [[X5:%.*]] = getelementptr inbounds [[STRUCT_ST2]], ptr [[B]], i64 [[INDVARS_IV]], i32 0
				; CHECK-NEXT: store i32 [[ADD]], ptr [[X5]], align 4
				; CHECK-NEXT: [[Y8:%.*]] = getelementptr inbounds [[STRUCT_ST2]], ptr [[B]], i64 [[INDVARS_IV]], i32 1
				; CHECK-NEXT: store i32 [[SUB]], ptr [[Y8]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nsw i64 [[INDVARS_IV]], -1
				; CHECK-NEXT: [[CMP:%.*]] = icmp sgt i64 [[INDVARS_IV]], 0
				; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_COND_CLEANUP]], !llvm.loop [[LOOP11:![0-9]+]]
				;
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 1023, %entry ], [ %indvars.iv.next, %for.body ]
				%x = getelementptr inbounds %struct.ST2, %struct.ST2* %A, i64 %indvars.iv, i32 0
				%tmp = load i32, i32* %x, align 4
				%tmp1 = trunc i64 %indvars.iv to i32
				%add = add nsw i32 %tmp, %tmp1
				%y = getelementptr inbounds %struct.ST2, %struct.ST2* %A, i64 %indvars.iv, i32 1
				%tmp2 = load i32, i32* %y, align 4
				%sub = sub nsw i32 %tmp2, %tmp1
				%x5 = getelementptr inbounds %struct.ST2, %struct.ST2* %B, i64 %indvars.iv, i32 0
				store i32 %add, i32* %x5, align 4
				%y8 = getelementptr inbounds %struct.ST2, %struct.ST2* %B, i64 %indvars.iv, i32 1
				store i32 %sub, i32* %y8, align 4
				%indvars.iv.next = add nsw i64 %indvars.iv, -1
				%cmp = icmp sgt i64 %indvars.iv, 0
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; Check vectorization on an interleaved load group of factor 2 with 1 gap
				; (missing the load of odd elements). Because the vectorized loop would
				; speculatively access memory out-of-bounds, we must execute at least one
				; iteration of the scalar loop.

				; void even_load_static_tc(int A, int B) {
				; for (unsigned i = 0; i < 1024; i+=2)
				; B[i/2] = A[i] * 2;
				; }


				define void @even_load_static_tc(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) #1 {
				; CHECK-LABEL: @even_load_static_tc(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 512, [[TMP1]]
				; CHECK-NEXT: [[TMP2:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
				; CHECK-NEXT: [[TMP3:%.*]] = select i1 [[TMP2]], i64 [[TMP1]], i64 [[N_MOD_VF]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub nuw nsw i64 512, [[TMP3]]
				; CHECK-NEXT: [[IND_END:%.*]] = shl nuw nsw i64 [[N_VEC]], 1
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[OFFSET_IDX:%.*]] = shl i64 [[INDEX]], 1
				; CHECK-NEXT: [[TMP4:%.]] = getelementptr inbounds i32, ptr [[A:%.]], i64 [[OFFSET_IDX]]
				; CHECK-NEXT: [[UNMASKEDLOAD:%.*]] = load <vscale x 8 x i32>, ptr [[TMP4]], align 4
				; CHECK-NEXT: [[TMP5:%.*]] = call { <vscale x 4 x i32>, <vscale x 4 x i32> } @llvm.experimental.vector.deinterleave2.nxv8i32(<vscale x 8 x i32> [[UNMASKEDLOAD]])
				; CHECK-NEXT: [[TMP6:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP5]], 0
				; CHECK-NEXT: [[TMP7:%.*]] = shl nsw <vscale x 4 x i32> [[TMP6]], shufflevector (<vscale x 4 x i32> insertelement (<vscale x 4 x i32> poison, i32 1, i64 0), <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP8:%.*]] = and i64 [[INDEX]], 9223372036854775804
				; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, ptr [[B:%.]], i64 [[TMP8]]
				; CHECK-NEXT: store <vscale x 4 x i32> [[TMP7]], ptr [[TMP9]], align 4
				; CHECK-NEXT: [[TMP10:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP11:%.*]] = shl nuw nsw i64 [[TMP10]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP11]]
				; CHECK-NEXT: [[TMP12:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP12]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP12:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: br label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.cond.cleanup:
				; CHECK-NEXT: ret void
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
				; CHECK-NEXT: [[MUL:%.*]] = shl nsw i32 [[TMP]], 1
				; CHECK-NEXT: [[TMP1:%.*]] = lshr exact i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[TMP1]]
				; CHECK-NEXT: store i32 [[MUL]], ptr [[ARRAYIDX2]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 2
				; CHECK-NEXT: [[CMP:%.*]] = icmp ult i64 [[INDVARS_IV]], 1022
				; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_COND_CLEANUP:%.*]], !llvm.loop [[LOOP13:![0-9]+]]
				;
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%tmp = load i32, i32* %arrayidx, align 4
				%mul = shl nsw i32 %tmp, 1
				%tmp1 = lshr exact i64 %indvars.iv, 1
				%arrayidx2 = getelementptr inbounds i32, i32* %B, i64 %tmp1
				store i32 %mul, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp ult i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; Check vectorization on an interleaved load group of factor 2 with 1 gap
				; (missing the load of odd elements). Because the vectorized loop would
				; speculatively access memory out-of-bounds, we must execute at least one
				; iteration of the scalar loop.

				; void even_load_dynamic_tc(int A, int B, unsigned N) {
				; for (unsigned i = 0; i < N; i+=2)
				; B[i/2] = A[i] * 2;
				; }


				define void @even_load_dynamic_tc(i32* noalias nocapture readonly %A, i32* noalias nocapture %B, i64 %N) #1 {
				; CHECK-LABEL: @even_load_dynamic_tc(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[UMAX:%.]] = call i64 @llvm.umax.i64(i64 [[N:%.]], i64 2)
				; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[UMAX]], -1
				; CHECK-NEXT: [[TMP1:%.*]] = lshr i64 [[TMP0]], 1
				; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP3:%.*]] = shl nuw nsw i64 [[TMP2]], 2
				; CHECK-NEXT: [[MIN_ITERS_CHECK_NOT_NOT:%.*]] = icmp ult i64 [[TMP1]], [[TMP3]]
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK_NOT_NOT]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP4:%.*]] = add nuw i64 [[TMP1]], 1
				; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP6:%.*]] = shl nuw nsw i64 [[TMP5]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP4]], [[TMP6]]
				; CHECK-NEXT: [[TMP7:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
				; CHECK-NEXT: [[TMP8:%.*]] = select i1 [[TMP7]], i64 [[TMP6]], i64 [[N_MOD_VF]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP4]], [[TMP8]]
				; CHECK-NEXT: [[IND_END:%.*]] = shl i64 [[N_VEC]], 1
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[OFFSET_IDX:%.*]] = shl i64 [[INDEX]], 1
				; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds i32, ptr [[A:%.]], i64 [[OFFSET_IDX]]
				; CHECK-NEXT: [[UNMASKEDLOAD:%.*]] = load <vscale x 8 x i32>, ptr [[TMP9]], align 4
				; CHECK-NEXT: [[TMP10:%.*]] = call { <vscale x 4 x i32>, <vscale x 4 x i32> } @llvm.experimental.vector.deinterleave2.nxv8i32(<vscale x 8 x i32> [[UNMASKEDLOAD]])
				; CHECK-NEXT: [[TMP11:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP10]], 0
				; CHECK-NEXT: [[TMP12:%.*]] = shl nsw <vscale x 4 x i32> [[TMP11]], shufflevector (<vscale x 4 x i32> insertelement (<vscale x 4 x i32> poison, i32 1, i64 0), <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP13:%.*]] = and i64 [[INDEX]], 9223372036854775804
				; CHECK-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, ptr [[B:%.]], i64 [[TMP13]]
				; CHECK-NEXT: store <vscale x 4 x i32> [[TMP12]], ptr [[TMP14]], align 4
				; CHECK-NEXT: [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP16:%.*]] = shl nuw nsw i64 [[TMP15]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP16]]
				; CHECK-NEXT: [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP14:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: br label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.cond.cleanup:
				; CHECK-NEXT: ret void
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
				; CHECK-NEXT: [[MUL:%.*]] = shl nsw i32 [[TMP]], 1
				; CHECK-NEXT: [[TMP1:%.*]] = lshr exact i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[TMP1]]
				; CHECK-NEXT: store i32 [[MUL]], ptr [[ARRAYIDX2]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 2
				; CHECK-NEXT: [[CMP:%.*]] = icmp ult i64 [[INDVARS_IV_NEXT]], [[N]]
				; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_COND_CLEANUP:%.*]], !llvm.loop [[LOOP15:![0-9]+]]
				;
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%tmp = load i32, i32* %arrayidx, align 4
				%mul = shl nsw i32 %tmp, 1
				%tmp1 = lshr exact i64 %indvars.iv, 1
				%arrayidx2 = getelementptr inbounds i32, i32* %B, i64 %tmp1
				store i32 %mul, i32* %arrayidx2, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp ult i64 %indvars.iv.next, %N
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}

				; Check vectorization on a reverse interleaved load group of factor 2 with 1
				; gap and a reverse interleaved store group of factor 2. The interleaved load
				; group should be removed since it has a gap and is reverse.

				; struct pair {
				; int x;
				; int y;
				; };
				;
				; void load_gap_reverse(struct pair P1, struct pair P2, int X) {
				; for (int i = 1023; i >= 0; i--) {
				; int a = X + i;
				; int b = A[i].y - i;
				; B[i].x = a;
				; B[i].y = b;
				; }
				; }

				;TODO: still generates gather/scatter loos like instead of a scatter we could have a st2
				%pair = type { i64, i64 }
				define void @load_gap_reverse(%pair* noalias nocapture readonly %P1, %pair* noalias nocapture readonly %P2, i64 %X) #1 {
				; CHECK-LABEL: @load_gap_reverse(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP1]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub nuw nsw i64 1024, [[N_MOD_VF]]
				; CHECK-NEXT: [[IND_END:%.*]] = add nsw i64 [[N_MOD_VF]], -1
				; CHECK-NEXT: [[TMP2:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[INDUCTION:%.*]] = sub <vscale x 4 x i64> shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1023, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer), [[TMP2]]
				; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[DOTNEG:%.*]] = mul nsw i64 [[TMP3]], -4
				; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[DOTNEG]], i64 0
				; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i64> poison, i64 [[X:%.]], i64 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_IND:%.]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP4:%.*]] = add nsw <vscale x 4 x i64> [[BROADCAST_SPLAT]], [[VEC_IND]]
				; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds [[PAIR:%.]], ptr [[P1:%.*]], <vscale x 4 x i64> [[VEC_IND]], i32 0
				; CHECK-NEXT: [[TMP6:%.]] = getelementptr inbounds [[PAIR]], ptr [[P2:%.]], <vscale x 4 x i64> [[VEC_IND]], i32 1
				; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 4 x i64> @llvm.masked.gather.nxv4i64.nxv4p0(<vscale x 4 x ptr> [[TMP6]], i32 8, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x i64> poison)
				; CHECK-NEXT: [[TMP7:%.*]] = sub nsw <vscale x 4 x i64> [[WIDE_MASKED_GATHER]], [[VEC_IND]]
				; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i64.nxv4p0(<vscale x 4 x i64> [[TMP4]], <vscale x 4 x ptr> [[TMP5]], i32 8, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer))
				; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i64.nxv4p0(<vscale x 4 x i64> [[TMP7]], <vscale x 4 x ptr> [[TMP6]], i32 8, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer))
				; CHECK-NEXT: [[TMP8:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP9:%.*]] = shl nuw nsw i64 [[TMP8]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP9]]
				; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 4 x i64> [[VEC_IND]], [[DOTSPLAT]]
				; CHECK-NEXT: [[TMP10:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP10]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP16:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_EXIT:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 1023, [[ENTRY:%.]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[I:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[I_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[I_NEXT]] = add nsw i64 [[I]], -1
				; CHECK-NEXT: [[COND:%.*]] = icmp sgt i64 [[I]], 0
				; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_EXIT]], !llvm.loop [[LOOP17:![0-9]+]]
				; CHECK: for.exit:
				; CHECK-NEXT: ret void
				;
				entry:
				br label %for.body

				for.body:
				%i = phi i64 [ 1023, %entry ], [ %i.next, %for.body ]
				%0 = add nsw i64 %X, %i
				%1 = getelementptr inbounds %pair, %pair* %P1, i64 %i, i32 0
				%2 = getelementptr inbounds %pair, %pair* %P2, i64 %i, i32 1
				%3 = load i64, i64* %2, align 8
				%4 = sub nsw i64 %3, %i
				store i64 %0, i64* %1, align 8
				store i64 %4, i64* %2, align 8
				%i.next = add nsw i64 %i, -1
				%cond = icmp sgt i64 %i, 0
				br i1 %cond, label %for.body, label %for.exit

				for.exit:
				ret void
				}

				; Check vectorization on interleaved access groups identified from mixed
				; loads/stores.
				; void mixed_load2_store2(int A, int B) {
				; for (unsigned i = 0; i < 1024; i+=2) {
				; B[i] = A[i] * A[i+1];
				; B[i+1] = A[i] + A[i+1];
				; }
				; }


				define void @mixed_load2_store2(i32* noalias nocapture readonly %A, i32* noalias nocapture %B) #1 {
				; CHECK-LABEL: @mixed_load2_store2(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 512, [[TMP1]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub nuw nsw i64 512, [[N_MOD_VF]]
				; CHECK-NEXT: [[IND_END:%.*]] = shl nuw nsw i64 [[N_VEC]], 1
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[OFFSET_IDX:%.*]] = shl i64 [[INDEX]], 1
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, ptr [[A:%.]], i64 [[OFFSET_IDX]]
				; CHECK-NEXT: [[UNMASKEDLOAD:%.*]] = load <vscale x 8 x i32>, ptr [[TMP2]], align 4
				; CHECK-NEXT: [[TMP3:%.*]] = call { <vscale x 4 x i32>, <vscale x 4 x i32> } @llvm.experimental.vector.deinterleave2.nxv8i32(<vscale x 8 x i32> [[UNMASKEDLOAD]])
				; CHECK-NEXT: [[TMP4:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP3]], 0
				; CHECK-NEXT: [[TMP5:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP3]], 1
				; CHECK-NEXT: [[TMP6:%.*]] = or i64 [[OFFSET_IDX]], 1
				; CHECK-NEXT: [[TMP7:%.*]] = mul nsw <vscale x 4 x i32> [[TMP5]], [[TMP4]]
				; CHECK-NEXT: [[TMP8:%.*]] = call { <vscale x 4 x i32>, <vscale x 4 x i32> } @llvm.experimental.vector.deinterleave2.nxv8i32(<vscale x 8 x i32> [[UNMASKEDLOAD]])
				; CHECK-NEXT: [[TMP9:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP8]], 0
				; CHECK-NEXT: [[TMP10:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP8]], 1
				; CHECK-NEXT: [[TMP11:%.*]] = add nsw <vscale x 4 x i32> [[TMP10]], [[TMP9]]
				; CHECK-NEXT: [[TMP12:%.]] = getelementptr i32, ptr [[B:%.]], i64 -1
				; CHECK-NEXT: [[TMP13:%.*]] = getelementptr i32, ptr [[TMP12]], i64 [[TMP6]]
				; CHECK-NEXT: [[TMP14:%.*]] = call <vscale x 8 x i32> @llvm.experimental.vector.interleave2.nxv8i32(<vscale x 4 x i32> [[TMP7]], <vscale x 4 x i32> [[TMP11]])
				; CHECK-NEXT: store <vscale x 8 x i32> [[TMP14]], ptr [[TMP13]], align 4
				; CHECK-NEXT: [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP16:%.*]] = shl nuw nsw i64 [[TMP15]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP16]]
				; CHECK-NEXT: [[TMP17:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP17]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP18:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.cond.cleanup:
				; CHECK-NEXT: ret void
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[ARRAYIDX:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: [[TMP:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
				; CHECK-NEXT: [[TMP1:%.*]] = or i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[ARRAYIDX2:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[TMP1]]
				; CHECK-NEXT: [[TMP2:%.*]] = load i32, ptr [[ARRAYIDX2]], align 4
				; CHECK-NEXT: [[MUL:%.*]] = mul nsw i32 [[TMP2]], [[TMP]]
				; CHECK-NEXT: [[ARRAYIDX4:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[INDVARS_IV]]
				; CHECK-NEXT: store i32 [[MUL]], ptr [[ARRAYIDX4]], align 4
				; CHECK-NEXT: [[TMP3:%.*]] = load i32, ptr [[ARRAYIDX]], align 4
				; CHECK-NEXT: [[ADD10:%.*]] = add nsw i32 [[TMP2]], [[TMP3]]
				; CHECK-NEXT: [[ARRAYIDX13:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[TMP1]]
				; CHECK-NEXT: store i32 [[ADD10]], ptr [[ARRAYIDX13]], align 4
				; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 2
				; CHECK-NEXT: [[CMP:%.*]] = icmp ult i64 [[INDVARS_IV]], 1022
				; CHECK-NEXT: br i1 [[CMP]], label [[FOR_BODY]], label [[FOR_COND_CLEANUP]], !llvm.loop [[LOOP19:![0-9]+]]
				;
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				ret void

				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
				%tmp = load i32, i32* %arrayidx, align 4
				%tmp1 = or i64 %indvars.iv, 1
				%arrayidx2 = getelementptr inbounds i32, i32* %A, i64 %tmp1
				%tmp2 = load i32, i32* %arrayidx2, align 4
				%mul = mul nsw i32 %tmp2, %tmp
				%arrayidx4 = getelementptr inbounds i32, i32* %B, i64 %indvars.iv
				store i32 %mul, i32* %arrayidx4, align 4
				%tmp3 = load i32, i32* %arrayidx, align 4
				%tmp4 = load i32, i32* %arrayidx2, align 4
				%add10 = add nsw i32 %tmp4, %tmp3
				%arrayidx13 = getelementptr inbounds i32, i32* %B, i64 %tmp1
				store i32 %add10, i32* %arrayidx13, align 4
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 2
				%cmp = icmp ult i64 %indvars.iv.next, 1024
				br i1 %cmp, label %for.body, label %for.cond.cleanup
				}


				; Check vectorization on interleaved access groups with members having different
				; kinds of type.

				; struct IntFloat {
				; int a;
				; float b;
				; };
				;
				; int SA;
				; float SB;
				;
				; void int_float_struct(struct IntFloat *A) {
				; int SumA;
				; float SumB;
				; for (unsigned i = 0; i < 1024; i++) {
				; SumA += A[i].a;
				; SumB += A[i].b;
				; }
				; SA = SumA;
				; SB = SumB;
				; }


				%struct.IntFloat = type { i32, float }

				@SA = common global i32 0, align 4
				@SB = common global float 0.000000e+00, align 4

				define void @int_float_struct(%struct.IntFloat* nocapture readonly %p) #0 {
				; CHECK-LABEL: @int_float_struct(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 1024, [[TMP1]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub nuw nsw i64 1024, [[N_MOD_VF]]
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_PHI:%.]] = phi <vscale x 4 x float> [ insertelement (<vscale x 4 x float> zeroinitializer, float undef, i32 0), [[VECTOR_PH]] ], [ [[TMP8:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_PHI1:%.]] = phi <vscale x 4 x i32> [ insertelement (<vscale x 4 x i32> zeroinitializer, i32 undef, i32 0), [[VECTOR_PH]] ], [ [[TMP7:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds [[STRUCT_INTFLOAT:%.]], ptr [[P:%.*]], i64 [[INDEX]], i32 0
				; CHECK-NEXT: [[UNMASKEDLOAD:%.*]] = load <vscale x 8 x i32>, ptr [[TMP2]], align 4
				; CHECK-NEXT: [[TMP3:%.*]] = call { <vscale x 4 x i32>, <vscale x 4 x i32> } @llvm.experimental.vector.deinterleave2.nxv8i32(<vscale x 8 x i32> [[UNMASKEDLOAD]])
				; CHECK-NEXT: [[TMP4:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP3]], 0
				; CHECK-NEXT: [[TMP5:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP3]], 1
				; CHECK-NEXT: [[TMP6:%.*]] = bitcast <vscale x 4 x i32> [[TMP5]] to <vscale x 4 x float>
				; CHECK-NEXT: [[TMP7]] = add <vscale x 4 x i32> [[TMP4]], [[VEC_PHI1]]
				; CHECK-NEXT: [[TMP8]] = fadd fast <vscale x 4 x float> [[VEC_PHI]], [[TMP6]]
				; CHECK-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP10:%.*]] = shl nuw nsw i64 [[TMP9]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP10]]
				; CHECK-NEXT: [[TMP11:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP11]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP20:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[TMP12:%.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> [[TMP7]])
				; CHECK-NEXT: [[TMP13:%.*]] = call fast float @llvm.vector.reduce.fadd.nxv4f32(float -0.000000e+00, <vscale x 4 x float> [[TMP8]])
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_COND_CLEANUP:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi float [ [[TMP13]], [[MIDDLE_BLOCK]] ], [ undef, [[ENTRY]] ]
				; CHECK-NEXT: [[BC_MERGE_RDX2:%.*]] = phi i32 [ [[TMP12]], [[MIDDLE_BLOCK]] ], [ undef, [[ENTRY]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.cond.cleanup:
				; CHECK-NEXT: [[ADD_LCSSA:%.]] = phi i32 [ [[ADD:%.]], [[FOR_BODY]] ], [ [[TMP12]], [[MIDDLE_BLOCK]] ]
				; CHECK-NEXT: [[ADD3_LCSSA:%.]] = phi float [ [[ADD3:%.]], [[FOR_BODY]] ], [ [[TMP13]], [[MIDDLE_BLOCK]] ]
				; CHECK-NEXT: store i32 [[ADD_LCSSA]], ptr @SA, align 4
				; CHECK-NEXT: store float [[ADD3_LCSSA]], ptr @SB, align 4
				; CHECK-NEXT: ret void
				; CHECK: for.body:
				; CHECK-NEXT: [[INDVARS_IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[INDVARS_IV_NEXT:%.]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[SUMB_014:%.*]] = phi float [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ], [ [[ADD3]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[SUMA_013:%.*]] = phi i32 [ [[BC_MERGE_RDX2]], [[SCALAR_PH]] ], [ [[ADD]], [[FOR_BODY]] ]
				; CHECK-NEXT: [[A:%.*]] = getelementptr inbounds [[STRUCT_INTFLOAT]], ptr [[P]], i64 [[INDVARS_IV]], i32 0
				; CHECK-NEXT: [[TMP:%.*]] = load i32, ptr [[A]], align 4
				; CHECK-NEXT: [[ADD]] = add nsw i32 [[TMP]], [[SUMA_013]]
				; CHECK-NEXT: [[B:%.*]] = getelementptr inbounds [[STRUCT_INTFLOAT]], ptr [[P]], i64 [[INDVARS_IV]], i32 1
				; CHECK-NEXT: [[TMP1:%.*]] = load float, ptr [[B]], align 4
				; CHECK-NEXT: [[ADD3]] = fadd fast float [[SUMB_014]], [[TMP1]]
				; CHECK-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
				; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[INDVARS_IV_NEXT]], 1024
				; CHECK-NEXT: br i1 [[EXITCOND]], label [[FOR_COND_CLEANUP]], label [[FOR_BODY]], !llvm.loop [[LOOP21:![0-9]+]]
				;
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				store i32 %add, i32* @SA, align 4
				store float %add3, float* @SB, align 4
				ret void

				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%SumB.014 = phi float [ undef, %entry ], [ %add3, %for.body ]
				%SumA.013 = phi i32 [ undef, %entry ], [ %add, %for.body ]
				%a = getelementptr inbounds %struct.IntFloat, %struct.IntFloat* %p, i64 %indvars.iv, i32 0
				%tmp = load i32, i32* %a, align 4
				%add = add nsw i32 %tmp, %SumA.013
				%b = getelementptr inbounds %struct.IntFloat, %struct.IntFloat* %p, i64 %indvars.iv, i32 1
				%tmp1 = load float, float* %b, align 4
				%add3 = fadd fast float %SumB.014, %tmp1
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond = icmp eq i64 %indvars.iv.next, 1024
				br i1 %exitcond, label %for.cond.cleanup, label %for.body
				}

				; Check vectorization of interleaved access groups in the presence of
				; dependences (PR27626). The following tests check that we don't reorder
				; dependent loads and stores when generating code for interleaved access
				; groups. Stores should be scalarized because the required code motion would
				; break dependences, and the remaining interleaved load groups should have
				; gaps.

				; PR27626_0: Ensure a strided store is not moved after a dependent (zero
				; distance) strided load.

				; void PR27626_0(struct pair *p, int z, int n) {
				; for (int i = 0; i < n; i++) {
				; p[i].x = z;
				; p[i].y = p[i].x;
				; }
				; }


				%pair.i32 = type { i32, i32 }
				;TODO: uses sve masked scatter for p[i+1].y store for neon we have scalarised store
				; what is actually what this test is checking
				define void @PR27626_0(%pair.i32 *%p, i32 %z, i64 %n) #1 {
				; CHECK-LABEL: @PR27626_0(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[SMAX:%.]] = call i64 @llvm.smax.i64(i64 [[N:%.]], i64 1)
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
				; CHECK-NEXT: [[MIN_ITERS_CHECK_NOT:%.*]] = icmp ugt i64 [[SMAX]], [[TMP1]]
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK_NOT]], label [[VECTOR_PH:%.]], label [[SCALAR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP3:%.*]] = shl nuw nsw i64 [[TMP2]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[SMAX]], [[TMP3]]
				; CHECK-NEXT: [[TMP4:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
				; CHECK-NEXT: [[TMP5:%.*]] = select i1 [[TMP4]], i64 [[TMP3]], i64 [[N_MOD_VF]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub nsw i64 [[SMAX]], [[TMP5]]
				; CHECK-NEXT: [[TMP6:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP8:%.*]] = shl nuw nsw i64 [[TMP7]], 2
				; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP8]], i64 0
				; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[Z:%.]], i64 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_IND:%.]] = phi <vscale x 4 x i64> [ [[TMP6]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds [[PAIR_I32:%.]], ptr [[P:%.*]], <vscale x 4 x i64> [[VEC_IND]], i32 0
				; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds [[PAIR_I32]], ptr [[P]], <vscale x 4 x i64> [[VEC_IND]], i32 1
				; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0(<vscale x 4 x i32> [[BROADCAST_SPLAT]], <vscale x 4 x ptr> [[TMP9]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer))
				; CHECK-NEXT: [[TMP11:%.*]] = extractelement <vscale x 4 x ptr> [[TMP9]], i64 0
				; CHECK-NEXT: [[UNMASKEDLOAD:%.*]] = load <vscale x 8 x i32>, ptr [[TMP11]], align 4
				; CHECK-NEXT: [[TMP12:%.*]] = call { <vscale x 4 x i32>, <vscale x 4 x i32> } @llvm.experimental.vector.deinterleave2.nxv8i32(<vscale x 8 x i32> [[UNMASKEDLOAD]])
				; CHECK-NEXT: [[TMP13:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP12]], 0
				; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0(<vscale x 4 x i32> [[TMP13]], <vscale x 4 x ptr> [[TMP10]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer))
				; CHECK-NEXT: [[TMP14:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP15:%.*]] = shl nuw nsw i64 [[TMP14]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP15]]
				; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 4 x i64> [[VEC_IND]], [[DOTSPLAT]]
				; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP22:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: br label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[P_I_X:%.*]] = getelementptr inbounds [[PAIR_I32]], ptr [[P]], i64 [[I]], i32 0
				; CHECK-NEXT: [[P_I_Y:%.*]] = getelementptr inbounds [[PAIR_I32]], ptr [[P]], i64 [[I]], i32 1
				; CHECK-NEXT: store i32 [[Z]], ptr [[P_I_X]], align 4
				; CHECK-NEXT: store i32 [[Z]], ptr [[P_I_Y]], align 4
				; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1
				; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]
				; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END:%.*]], !llvm.loop [[LOOP23:![0-9]+]]
				; CHECK: for.end:
				; CHECK-NEXT: ret void
				;
				entry:
				br label %for.body

				for.body:
				%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
				%p_i.x = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i, i32 0
				%p_i.y = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i, i32 1
				store i32 %z, i32* %p_i.x, align 4
				%0 = load i32, i32* %p_i.x, align 4
				store i32 %0, i32 *%p_i.y, align 4
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end:
				ret void
				}

				; PR27626_1: Ensure a strided load is not moved before a dependent (zero
				; distance) strided store.

				; void PR27626_1(struct pair *p, int n) {
				; int s = 0;
				; for (int i = 0; i < n; i++) {
				; p[i].y = p[i].x;
				; s += p[i].y
				; }
				; }


				;TODO: uses sve masked scatter for p[i+1].y store for neon we have scalarised store
				; what is actually what this test is checking
				define i32 @PR27626_1(%pair.i32 *%p, i64 %n) #1 {
				; CHECK-LABEL: @PR27626_1(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[SMAX:%.]] = call i64 @llvm.smax.i64(i64 [[N:%.]], i64 1)
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
				; CHECK-NEXT: [[MIN_ITERS_CHECK_NOT:%.*]] = icmp ugt i64 [[SMAX]], [[TMP1]]
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK_NOT]], label [[VECTOR_PH:%.]], label [[SCALAR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP3:%.*]] = shl nuw nsw i64 [[TMP2]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[SMAX]], [[TMP3]]
				; CHECK-NEXT: [[TMP4:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
				; CHECK-NEXT: [[TMP5:%.*]] = select i1 [[TMP4]], i64 [[TMP3]], i64 [[N_MOD_VF]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub nsw i64 [[SMAX]], [[TMP5]]
				; CHECK-NEXT: [[TMP6:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP8:%.*]] = shl nuw nsw i64 [[TMP7]], 2
				; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP8]], i64 0
				; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_IND:%.]] = phi <vscale x 4 x i64> [ [[TMP6]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_PHI:%.]] = phi <vscale x 4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP16:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds [[PAIR_I32:%.]], ptr [[P:%.*]], i64 [[INDEX]], i32 0
				; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds [[PAIR_I32]], ptr [[P]], <vscale x 4 x i64> [[VEC_IND]], i32 1
				; CHECK-NEXT: [[UNMASKEDLOAD:%.*]] = load <vscale x 8 x i32>, ptr [[TMP9]], align 4
				; CHECK-NEXT: [[TMP11:%.*]] = call { <vscale x 4 x i32>, <vscale x 4 x i32> } @llvm.experimental.vector.deinterleave2.nxv8i32(<vscale x 8 x i32> [[UNMASKEDLOAD]])
				; CHECK-NEXT: [[TMP12:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP11]], 0
				; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0(<vscale x 4 x i32> [[TMP12]], <vscale x 4 x ptr> [[TMP10]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer))
				; CHECK-NEXT: [[TMP13:%.*]] = extractelement <vscale x 4 x ptr> [[TMP10]], i64 0
				; CHECK-NEXT: [[UNMASKEDLOAD1:%.*]] = load <vscale x 8 x i32>, ptr [[TMP13]], align 4
				; CHECK-NEXT: [[TMP14:%.*]] = call { <vscale x 4 x i32>, <vscale x 4 x i32> } @llvm.experimental.vector.deinterleave2.nxv8i32(<vscale x 8 x i32> [[UNMASKEDLOAD1]])
				; CHECK-NEXT: [[TMP15:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP14]], 0
				; CHECK-NEXT: [[TMP16]] = add <vscale x 4 x i32> [[TMP15]], [[VEC_PHI]]
				; CHECK-NEXT: [[TMP17:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP18:%.*]] = shl nuw nsw i64 [[TMP17]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP18]]
				; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 4 x i64> [[VEC_IND]], [[DOTSPLAT]]
				; CHECK-NEXT: [[TMP19:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP19]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP24:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[TMP20:%.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> [[TMP16]])
				; CHECK-NEXT: br label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ [[TMP20]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[S:%.]] = phi i32 [ [[TMP22:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[P_I_X:%.*]] = getelementptr inbounds [[PAIR_I32]], ptr [[P]], i64 [[I]], i32 0
				; CHECK-NEXT: [[P_I_Y:%.*]] = getelementptr inbounds [[PAIR_I32]], ptr [[P]], i64 [[I]], i32 1
				; CHECK-NEXT: [[TMP21:%.*]] = load i32, ptr [[P_I_X]], align 4
				; CHECK-NEXT: store i32 [[TMP21]], ptr [[P_I_Y]], align 4
				; CHECK-NEXT: [[TMP22]] = add nsw i32 [[TMP21]], [[S]]
				; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1
				; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]
				; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END:%.*]], !llvm.loop [[LOOP25:![0-9]+]]
				; CHECK: for.end:
				; CHECK-NEXT: ret i32 [[TMP22]]
				;
				entry:
				br label %for.body

				for.body:
				%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
				%s = phi i32 [ %2, %for.body ], [ 0, %entry ]
				%p_i.x = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i, i32 0
				%p_i.y = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i, i32 1
				%0 = load i32, i32* %p_i.x, align 4
				store i32 %0, i32* %p_i.y, align 4
				%1 = load i32, i32* %p_i.y, align 4
				%2 = add nsw i32 %1, %s
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end:
				%3 = phi i32 [ %2, %for.body ]
				ret i32 %3
				}

				; PR27626_2: Ensure a strided store is not moved after a dependent (negative
				; distance) strided load.

				; void PR27626_2(struct pair *p, int z, int n) {
				; for (int i = 0; i < n; i++) {
				; p[i].x = z;
				; p[i].y = p[i - 1].x;
				; }
				; }


				;TODO: uses sve masked scatter for p[i+1].y store for neon we have scalarised store
				; what is actually what this test is checking
				define void @PR27626_2(%pair.i32 *%p, i64 %n, i32 %z) #1 {
				; CHECK-LABEL: @PR27626_2(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[SMAX:%.]] = call i64 @llvm.smax.i64(i64 [[N:%.]], i64 1)
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
				; CHECK-NEXT: [[MIN_ITERS_CHECK_NOT:%.*]] = icmp ugt i64 [[SMAX]], [[TMP1]]
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK_NOT]], label [[VECTOR_PH:%.]], label [[SCALAR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP3:%.*]] = shl nuw nsw i64 [[TMP2]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[SMAX]], [[TMP3]]
				; CHECK-NEXT: [[TMP4:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
				; CHECK-NEXT: [[TMP5:%.*]] = select i1 [[TMP4]], i64 [[TMP3]], i64 [[N_MOD_VF]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub nsw i64 [[SMAX]], [[TMP5]]
				; CHECK-NEXT: [[TMP6:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP8:%.*]] = shl nuw nsw i64 [[TMP7]], 2
				; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP8]], i64 0
				; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[Z:%.]], i64 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_IND:%.]] = phi <vscale x 4 x i64> [ [[TMP6]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds [[PAIR_I32:%.]], ptr [[P:%.*]], <vscale x 4 x i64> [[VEC_IND]], i32 0
				; CHECK-NEXT: [[TMP10:%.*]] = getelementptr inbounds [[PAIR_I32]], ptr [[P]], i64 -1, i32 0
				; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds [[PAIR_I32]], ptr [[P]], <vscale x 4 x i64> [[VEC_IND]], i32 1
				; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0(<vscale x 4 x i32> [[BROADCAST_SPLAT]], <vscale x 4 x ptr> [[TMP9]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer))
				; CHECK-NEXT: [[UNMASKEDLOAD:%.*]] = load <vscale x 8 x i32>, ptr [[TMP10]], align 4
				; CHECK-NEXT: [[TMP12:%.*]] = call { <vscale x 4 x i32>, <vscale x 4 x i32> } @llvm.experimental.vector.deinterleave2.nxv8i32(<vscale x 8 x i32> [[UNMASKEDLOAD]])
				; CHECK-NEXT: [[TMP13:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP12]], 0
				; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0(<vscale x 4 x i32> [[TMP13]], <vscale x 4 x ptr> [[TMP11]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer))
				; CHECK-NEXT: [[TMP14:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP15:%.*]] = shl nuw nsw i64 [[TMP14]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP15]]
				; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 4 x i64> [[VEC_IND]], [[DOTSPLAT]]
				; CHECK-NEXT: [[TMP16:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP16]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP26:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: br label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[P_I_X:%.*]] = getelementptr inbounds [[PAIR_I32]], ptr [[P]], i64 [[I]], i32 0
				; CHECK-NEXT: [[P_I_MINUS_1_X:%.*]] = getelementptr inbounds [[PAIR_I32]], ptr [[P]], i64 -1, i32 0
				; CHECK-NEXT: [[P_I_Y:%.*]] = getelementptr inbounds [[PAIR_I32]], ptr [[P]], i64 [[I]], i32 1
				; CHECK-NEXT: store i32 [[Z]], ptr [[P_I_X]], align 4
				; CHECK-NEXT: [[TMP17:%.*]] = load i32, ptr [[P_I_MINUS_1_X]], align 4
				; CHECK-NEXT: store i32 [[TMP17]], ptr [[P_I_Y]], align 4
				; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1
				; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]
				; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END:%.*]], !llvm.loop [[LOOP27:![0-9]+]]
				; CHECK: for.end:
				; CHECK-NEXT: ret void
				;
				entry:
				br label %for.body

				for.body:
				%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
				%i_minus_1 = add nuw nsw i64 %i, -1
				%p_i.x = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i, i32 0
				%p_i_minus_1.x = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i_minus_1, i32 0
				%p_i.y = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i, i32 1
				store i32 %z, i32* %p_i.x, align 4
				%0 = load i32, i32* %p_i_minus_1.x, align 4
				store i32 %0, i32 *%p_i.y, align 4
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end:
				ret void
				}

				; PR27626_3: Ensure a strided load is not moved before a dependent (negative
				; distance) strided store.

				; void PR27626_3(struct pair *p, int z, int n) {
				; for (int i = 0; i < n; i++) {
				; p[i + 1].y = p[i].x;
				; s += p[i].y;
				; }
				; }


				;TODO: uses sve masked scatter for p[i+1].y store for neon we have scalarised store
				; what is actually what this test is checking
				define i32 @PR27626_3(%pair.i32 *%p, i64 %n, i32 %z) #1 {
				; CHECK-LABEL: @PR27626_3(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[SMAX:%.]] = call i64 @llvm.smax.i64(i64 [[N:%.]], i64 1)
				; CHECK-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP1:%.*]] = shl nuw nsw i64 [[TMP0]], 2
				; CHECK-NEXT: [[MIN_ITERS_CHECK_NOT:%.*]] = icmp ugt i64 [[SMAX]], [[TMP1]]
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK_NOT]], label [[VECTOR_PH:%.]], label [[SCALAR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP3:%.*]] = shl nuw nsw i64 [[TMP2]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[SMAX]], [[TMP3]]
				; CHECK-NEXT: [[TMP4:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
				; CHECK-NEXT: [[TMP5:%.*]] = select i1 [[TMP4]], i64 [[TMP3]], i64 [[N_MOD_VF]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub nsw i64 [[SMAX]], [[TMP5]]
				; CHECK-NEXT: [[TMP6:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP7:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP8:%.*]] = shl nuw nsw i64 [[TMP7]], 2
				; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP8]], i64 0
				; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_IND:%.]] = phi <vscale x 4 x i64> [ [[TMP6]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_PHI:%.]] = phi <vscale x 4 x i32> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP17:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP9:%.*]] = add nuw nsw <vscale x 4 x i64> [[VEC_IND]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP10:%.]] = getelementptr inbounds [[PAIR_I32:%.]], ptr [[P:%.*]], i64 [[INDEX]], i32 0
				; CHECK-NEXT: [[TMP11:%.*]] = getelementptr inbounds [[PAIR_I32]], ptr [[P]], i64 [[INDEX]], i32 1
				; CHECK-NEXT: [[TMP12:%.*]] = getelementptr inbounds [[PAIR_I32]], ptr [[P]], <vscale x 4 x i64> [[TMP9]], i32 1
				; CHECK-NEXT: [[UNMASKEDLOAD:%.*]] = load <vscale x 8 x i32>, ptr [[TMP10]], align 4
				; CHECK-NEXT: [[TMP13:%.*]] = call { <vscale x 4 x i32>, <vscale x 4 x i32> } @llvm.experimental.vector.deinterleave2.nxv8i32(<vscale x 8 x i32> [[UNMASKEDLOAD]])
				; CHECK-NEXT: [[TMP14:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP13]], 0
				; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0(<vscale x 4 x i32> [[TMP14]], <vscale x 4 x ptr> [[TMP12]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer))
				; CHECK-NEXT: [[UNMASKEDLOAD1:%.*]] = load <vscale x 8 x i32>, ptr [[TMP11]], align 4
				; CHECK-NEXT: [[TMP15:%.*]] = call { <vscale x 4 x i32>, <vscale x 4 x i32> } @llvm.experimental.vector.deinterleave2.nxv8i32(<vscale x 8 x i32> [[UNMASKEDLOAD1]])
				; CHECK-NEXT: [[TMP16:%.*]] = extractvalue { <vscale x 4 x i32>, <vscale x 4 x i32> } [[TMP15]], 0
				; CHECK-NEXT: [[TMP17]] = add <vscale x 4 x i32> [[TMP16]], [[VEC_PHI]]
				; CHECK-NEXT: [[TMP18:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP19:%.*]] = shl nuw nsw i64 [[TMP18]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP19]]
				; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 4 x i64> [[VEC_IND]], [[DOTSPLAT]]
				; CHECK-NEXT: [[TMP20:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP20]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP28:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[TMP21:%.*]] = call i32 @llvm.vector.reduce.add.nxv4i32(<vscale x 4 x i32> [[TMP17]])
				; CHECK-NEXT: br label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[N_VEC]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: [[BC_MERGE_RDX:%.*]] = phi i32 [ [[TMP21]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[S:%.]] = phi i32 [ [[TMP24:%.]], [[FOR_BODY]] ], [ [[BC_MERGE_RDX]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[I_PLUS_1:%.*]] = add nuw nsw i64 [[I]], 1
				; CHECK-NEXT: [[P_I_X:%.*]] = getelementptr inbounds [[PAIR_I32]], ptr [[P]], i64 [[I]], i32 0
				; CHECK-NEXT: [[P_I_Y:%.*]] = getelementptr inbounds [[PAIR_I32]], ptr [[P]], i64 [[I]], i32 1
				; CHECK-NEXT: [[P_I_PLUS_1_Y:%.*]] = getelementptr inbounds [[PAIR_I32]], ptr [[P]], i64 [[I_PLUS_1]], i32 1
				; CHECK-NEXT: [[TMP22:%.*]] = load i32, ptr [[P_I_X]], align 4
				; CHECK-NEXT: store i32 [[TMP22]], ptr [[P_I_PLUS_1_Y]], align 4
				; CHECK-NEXT: [[TMP23:%.*]] = load i32, ptr [[P_I_Y]], align 4
				; CHECK-NEXT: [[TMP24]] = add nsw i32 [[TMP23]], [[S]]
				; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 1
				; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]
				; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END:%.*]], !llvm.loop [[LOOP29:![0-9]+]]
				; CHECK: for.end:
				; CHECK-NEXT: ret i32 [[TMP24]]
				;
				entry:
				br label %for.body

				for.body:
				%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
				%s = phi i32 [ %2, %for.body ], [ 0, %entry ]
				%i_plus_1 = add nuw nsw i64 %i, 1
				%p_i.x = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i, i32 0
				%p_i.y = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i, i32 1
				%p_i_plus_1.y = getelementptr inbounds %pair.i32, %pair.i32* %p, i64 %i_plus_1, i32 1
				%0 = load i32, i32* %p_i.x, align 4
				store i32 %0, i32* %p_i_plus_1.y, align 4
				%1 = load i32, i32* %p_i.y, align 4
				%2 = add nsw i32 %1, %s
				%i.next = add nuw nsw i64 %i, 1
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end:
				%3 = phi i32 [ %2, %for.body ]
				ret i32 %3
				}

				; PR27626_4: Ensure we form an interleaved group for strided stores in the
				; presence of a write-after-write dependence. We create a group for
				; (2) and (3) while excluding (1).

				; void PR27626_4(int *a, int x, int y, int z, int n) {
				; for (int i = 0; i < n; i += 2) {
				; a[i] = x; // (1)
				; a[i] = y; // (2)
				; a[i + 1] = z; // (3)
				; }
				; }

				;TODO: uses sve masked scatter, but for neon we have a scalarised store for a[i] = x what is fine
				define void @PR27626_4(i32 *%a, i32 %x, i32 %y, i32 %z, i64 %n) #1 {
				; CHECK-LABEL: @PR27626_4(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[SMAX:%.]] = call i64 @llvm.smax.i64(i64 [[N:%.]], i64 2)
				; CHECK-NEXT: [[TMP0:%.*]] = add nsw i64 [[SMAX]], -1
				; CHECK-NEXT: [[TMP1:%.*]] = lshr i64 [[TMP0]], 1
				; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1
				; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP4:%.*]] = shl nuw nsw i64 [[TMP3]], 2
				; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], [[TMP4]]
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP6:%.*]] = shl nuw nsw i64 [[TMP5]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], [[TMP6]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub nsw i64 [[TMP2]], [[N_MOD_VF]]
				; CHECK-NEXT: [[IND_END:%.*]] = shl i64 [[N_VEC]], 1
				; CHECK-NEXT: [[TMP7:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP8:%.*]] = shl <vscale x 4 x i64> [[TMP7]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP9:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP10:%.*]] = shl nuw nsw i64 [[TMP9]], 3
				; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP10]], i64 0
				; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[X:%.]], i64 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[Y:%.]], i64 0
				; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT1]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT3:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[Z:%.]], i64 0
				; CHECK-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT3]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_IND:%.]] = phi <vscale x 4 x i64> [ [[TMP8]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[OFFSET_IDX:%.*]] = shl i64 [[INDEX]], 1
				; CHECK-NEXT: [[TMP11:%.*]] = or i64 [[OFFSET_IDX]], 1
				; CHECK-NEXT: [[TMP12:%.]] = getelementptr inbounds i32, ptr [[A:%.]], <vscale x 4 x i64> [[VEC_IND]]
				; CHECK-NEXT: [[TMP13:%.*]] = getelementptr i32, ptr [[A]], i64 -1
				; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0(<vscale x 4 x i32> [[BROADCAST_SPLAT]], <vscale x 4 x ptr> [[TMP12]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer))
				; CHECK-NEXT: [[TMP14:%.*]] = getelementptr i32, ptr [[TMP13]], i64 [[TMP11]]
				; CHECK-NEXT: [[TMP15:%.*]] = call <vscale x 8 x i32> @llvm.experimental.vector.interleave2.nxv8i32(<vscale x 4 x i32> [[BROADCAST_SPLAT2]], <vscale x 4 x i32> [[BROADCAST_SPLAT4]])
				; CHECK-NEXT: store <vscale x 8 x i32> [[TMP15]], ptr [[TMP14]], align 4
				; CHECK-NEXT: [[TMP16:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP17:%.*]] = shl nuw nsw i64 [[TMP16]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP17]]
				; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 4 x i64> [[VEC_IND]], [[DOTSPLAT]]
				; CHECK-NEXT: [[TMP18:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP18]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP30:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 0, [[ENTRY:%.]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[I_PLUS_1:%.*]] = or i64 [[I]], 1
				; CHECK-NEXT: [[A_I:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[I]]
				; CHECK-NEXT: [[A_I_PLUS_1:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[I_PLUS_1]]
				; CHECK-NEXT: store i32 [[Y]], ptr [[A_I]], align 4
				; CHECK-NEXT: store i32 [[Z]], ptr [[A_I_PLUS_1]], align 4
				; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 2
				; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]
				; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop [[LOOP31:![0-9]+]]
				; CHECK: for.end:
				; CHECK-NEXT: ret void
				;
				entry:
				br label %for.body

				for.body:
				%i = phi i64 [ %i.next, %for.body ], [ 0, %entry ]
				%i_plus_1 = add i64 %i, 1
				%a_i = getelementptr inbounds i32, i32* %a, i64 %i
				%a_i_plus_1 = getelementptr inbounds i32, i32* %a, i64 %i_plus_1
				store i32 %x, i32* %a_i, align 4
				store i32 %y, i32* %a_i, align 4
				store i32 %z, i32* %a_i_plus_1, align 4
				%i.next = add nuw nsw i64 %i, 2
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end:
				ret void
				}

				; PR27626_5: Ensure we do not form an interleaved group for strided stores in
				; the presence of a write-after-write dependence.

				; void PR27626_5(int *a, int x, int y, int z, int n) {
				; for (int i = 3; i < n; i += 2) {
				; a[i - 1] = x;
				; a[i - 3] = y;
				; a[i] = z;
				; }
				; }


				;TODO: uses masked scatter, but this is a test which checks if interleaving is not used
				define void @PR27626_5(i32 *%a, i32 %x, i32 %y, i32 %z, i64 %n) #1 {
				; CHECK-LABEL: @PR27626_5(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[SMAX:%.]] = call i64 @llvm.smax.i64(i64 [[N:%.]], i64 5)
				; CHECK-NEXT: [[TMP0:%.*]] = add nsw i64 [[SMAX]], -4
				; CHECK-NEXT: [[TMP1:%.*]] = lshr i64 [[TMP0]], 1
				; CHECK-NEXT: [[TMP2:%.*]] = add nuw nsw i64 [[TMP1]], 1
				; CHECK-NEXT: [[TMP3:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP4:%.*]] = shl nuw nsw i64 [[TMP3]], 2
				; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP2]], [[TMP4]]
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP5:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP6:%.*]] = shl nuw nsw i64 [[TMP5]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP2]], [[TMP6]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub nsw i64 [[TMP2]], [[N_MOD_VF]]
				; CHECK-NEXT: [[TMP7:%.*]] = shl i64 [[N_VEC]], 1
				; CHECK-NEXT: [[IND_END:%.*]] = add i64 [[TMP7]], 3
				; CHECK-NEXT: [[TMP8:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP9:%.*]] = shl <vscale x 4 x i64> [[TMP8]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[INDUCTION:%.*]] = add <vscale x 4 x i64> [[TMP9]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 3, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP10:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP11:%.*]] = shl nuw nsw i64 [[TMP10]], 3
				; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP11]], i64 0
				; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[X:%.]], i64 0
				; CHECK-NEXT: [[BROADCAST_SPLAT:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT1:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[Y:%.]], i64 0
				; CHECK-NEXT: [[BROADCAST_SPLAT2:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT1]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: [[BROADCAST_SPLATINSERT3:%.]] = insertelement <vscale x 4 x i32> poison, i32 [[Z:%.]], i64 0
				; CHECK-NEXT: [[BROADCAST_SPLAT4:%.*]] = shufflevector <vscale x 4 x i32> [[BROADCAST_SPLATINSERT3]], <vscale x 4 x i32> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_IND:%.]] = phi <vscale x 4 x i64> [ [[INDUCTION]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP12:%.*]] = add <vscale x 4 x i64> [[VEC_IND]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 -1, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP13:%.*]] = add <vscale x 4 x i64> [[VEC_IND]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 -3, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP14:%.]] = getelementptr inbounds i32, ptr [[A:%.]], <vscale x 4 x i64> [[VEC_IND]]
				; CHECK-NEXT: [[TMP15:%.*]] = getelementptr inbounds i32, ptr [[A]], <vscale x 4 x i64> [[TMP12]]
				; CHECK-NEXT: [[TMP16:%.*]] = getelementptr inbounds i32, ptr [[A]], <vscale x 4 x i64> [[TMP13]]
				; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0(<vscale x 4 x i32> [[BROADCAST_SPLAT]], <vscale x 4 x ptr> [[TMP15]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer))
				; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0(<vscale x 4 x i32> [[BROADCAST_SPLAT2]], <vscale x 4 x ptr> [[TMP16]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer))
				; CHECK-NEXT: call void @llvm.masked.scatter.nxv4i32.nxv4p0(<vscale x 4 x i32> [[BROADCAST_SPLAT4]], <vscale x 4 x ptr> [[TMP14]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer))
				; CHECK-NEXT: [[TMP17:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP18:%.*]] = shl nuw nsw i64 [[TMP17]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP18]]
				; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 4 x i64> [[VEC_IND]], [[DOTSPLAT]]
				; CHECK-NEXT: [[TMP19:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP19]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP32:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
				; CHECK-NEXT: br i1 [[CMP_N]], label [[FOR_END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[BC_RESUME_VAL:%.]] = phi i64 [ [[IND_END]], [[MIDDLE_BLOCK]] ], [ 3, [[ENTRY:%.]] ]
				; CHECK-NEXT: br label [[FOR_BODY:%.*]]
				; CHECK: for.body:
				; CHECK-NEXT: [[I:%.]] = phi i64 [ [[I_NEXT:%.]], [[FOR_BODY]] ], [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ]
				; CHECK-NEXT: [[I_MINUS_1:%.*]] = add i64 [[I]], -1
				; CHECK-NEXT: [[I_MINUS_3:%.*]] = add i64 [[I]], -3
				; CHECK-NEXT: [[A_I:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[I]]
				; CHECK-NEXT: [[A_I_MINUS_1:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[I_MINUS_1]]
				; CHECK-NEXT: [[A_I_MINUS_3:%.*]] = getelementptr inbounds i32, ptr [[A]], i64 [[I_MINUS_3]]
				; CHECK-NEXT: store i32 [[X]], ptr [[A_I_MINUS_1]], align 4
				; CHECK-NEXT: store i32 [[Y]], ptr [[A_I_MINUS_3]], align 4
				; CHECK-NEXT: store i32 [[Z]], ptr [[A_I]], align 4
				; CHECK-NEXT: [[I_NEXT]] = add nuw nsw i64 [[I]], 2
				; CHECK-NEXT: [[COND:%.*]] = icmp slt i64 [[I_NEXT]], [[N]]
				; CHECK-NEXT: br i1 [[COND]], label [[FOR_BODY]], label [[FOR_END]], !llvm.loop [[LOOP33:![0-9]+]]
				; CHECK: for.end:
				; CHECK-NEXT: ret void
				;
				entry:
				br label %for.body

				for.body:
				%i = phi i64 [ %i.next, %for.body ], [ 3, %entry ]
				%i_minus_1 = sub i64 %i, 1
				%i_minus_3 = sub i64 %i_minus_1, 2
				%a_i = getelementptr inbounds i32, i32* %a, i64 %i
				%a_i_minus_1 = getelementptr inbounds i32, i32* %a, i64 %i_minus_1
				%a_i_minus_3 = getelementptr inbounds i32, i32* %a, i64 %i_minus_3
				store i32 %x, i32* %a_i_minus_1, align 4
				store i32 %y, i32* %a_i_minus_3, align 4
				store i32 %z, i32* %a_i, align 4
				%i.next = add nuw nsw i64 %i, 2
				%cond = icmp slt i64 %i.next, %n
				br i1 %cond, label %for.body, label %for.end

				for.end:
				ret void
				}

				; PR34743: Ensure that a cast which needs to sink after a load that belongs to
				; an interleaved group, indeeded gets sunk.

				; void PR34743(short a, int b, int n) {
				; for (int i = 0, iv = 0; iv < n; i++, iv += 2) {
				; b[i] = a[iv] * a[iv+1] * a[iv+2];
				; }
				; }


				define void @PR34743(i16* %a, i32* %b, i64 %n) #1 {
				; CHECK-LABEL: @PR34743(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[DOTPRE:%.]] = load i16, ptr [[A:%.]], align 2
				; CHECK-NEXT: [[TMP0:%.]] = lshr i64 [[N:%.]], 1
				; CHECK-NEXT: [[TMP1:%.*]] = add nuw i64 [[TMP0]], 1
				; CHECK-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP3:%.*]] = shl nuw nsw i64 [[TMP2]], 2
				; CHECK-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 [[TMP1]], [[TMP3]]
				; CHECK-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_MEMCHECK:%.]]
				; CHECK: vector.memcheck:
				; CHECK-NEXT: [[TMP4:%.*]] = shl i64 [[N]], 1
				; CHECK-NEXT: [[TMP5:%.*]] = and i64 [[TMP4]], -4
				; CHECK-NEXT: [[TMP6:%.*]] = add i64 [[TMP5]], 4
				; CHECK-NEXT: [[UGLYGEP:%.]] = getelementptr i8, ptr [[B:%.]], i64 [[TMP6]]
				; CHECK-NEXT: [[UGLYGEP1:%.*]] = getelementptr i8, ptr [[A]], i64 2
				; CHECK-NEXT: [[TMP7:%.*]] = add i64 [[TMP5]], 6
				; CHECK-NEXT: [[UGLYGEP2:%.*]] = getelementptr i8, ptr [[A]], i64 [[TMP7]]
				; CHECK-NEXT: [[BOUND0:%.*]] = icmp ugt ptr [[UGLYGEP2]], [[B]]
				; CHECK-NEXT: [[BOUND1:%.*]] = icmp ult ptr [[UGLYGEP1]], [[UGLYGEP]]
				; CHECK-NEXT: [[FOUND_CONFLICT:%.*]] = and i1 [[BOUND0]], [[BOUND1]]
				; CHECK-NEXT: br i1 [[FOUND_CONFLICT]], label [[SCALAR_PH]], label [[VECTOR_PH:%.*]]
				; CHECK: vector.ph:
				; CHECK-NEXT: [[TMP8:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP9:%.*]] = shl nuw nsw i64 [[TMP8]], 2
				; CHECK-NEXT: [[N_MOD_VF:%.*]] = urem i64 [[TMP1]], [[TMP9]]
				; CHECK-NEXT: [[N_VEC:%.*]] = sub i64 [[TMP1]], [[N_MOD_VF]]
				; CHECK-NEXT: [[IND_END:%.*]] = shl i64 [[N_VEC]], 1
				; CHECK-NEXT: [[TMP10:%.*]] = call i32 @llvm.vscale.i32()
				; CHECK-NEXT: [[TMP11:%.*]] = shl nuw nsw i32 [[TMP10]], 2
				; CHECK-NEXT: [[TMP12:%.*]] = add nsw i32 [[TMP11]], -1
				; CHECK-NEXT: [[VECTOR_RECUR_INIT:%.*]] = insertelement <vscale x 4 x i16> poison, i16 [[DOTPRE]], i32 [[TMP12]]
				; CHECK-NEXT: [[TMP13:%.*]] = call <vscale x 4 x i64> @llvm.experimental.stepvector.nxv4i64()
				; CHECK-NEXT: [[TMP14:%.*]] = shl <vscale x 4 x i64> [[TMP13]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP15:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP16:%.*]] = shl nuw nsw i64 [[TMP15]], 3
				; CHECK-NEXT: [[DOTSPLATINSERT:%.*]] = insertelement <vscale x 4 x i64> poison, i64 [[TMP16]], i64 0
				; CHECK-NEXT: [[DOTSPLAT:%.*]] = shufflevector <vscale x 4 x i64> [[DOTSPLATINSERT]], <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VECTOR_RECUR:%.]] = phi <vscale x 4 x i16> [ [[VECTOR_RECUR_INIT]], [[VECTOR_PH]] ], [ [[WIDE_MASKED_GATHER4:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_IND:%.]] = phi <vscale x 4 x i64> [ [[TMP14]], [[VECTOR_PH]] ], [ [[VEC_IND_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP17:%.*]] = add nuw nsw <vscale x 4 x i64> [[VEC_IND]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP18:%.*]] = add nuw nsw <vscale x 4 x i64> [[VEC_IND]], shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 2, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer)
				; CHECK-NEXT: [[TMP19:%.*]] = getelementptr inbounds i16, ptr [[A]], <vscale x 4 x i64> [[TMP17]]
				; CHECK-NEXT: [[WIDE_MASKED_GATHER:%.*]] = call <vscale x 4 x i16> @llvm.masked.gather.nxv4i16.nxv4p0(<vscale x 4 x ptr> [[TMP19]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x i16> poison), !alias.scope !34
				; CHECK-NEXT: [[TMP20:%.*]] = sext <vscale x 4 x i16> [[WIDE_MASKED_GATHER]] to <vscale x 4 x i32>
				; CHECK-NEXT: [[TMP21:%.*]] = getelementptr inbounds i16, ptr [[A]], <vscale x 4 x i64> [[TMP18]]
				; CHECK-NEXT: [[WIDE_MASKED_GATHER4]] = call <vscale x 4 x i16> @llvm.masked.gather.nxv4i16.nxv4p0(<vscale x 4 x ptr> [[TMP21]], i32 4, <vscale x 4 x i1> shufflevector (<vscale x 4 x i1> insertelement (<vscale x 4 x i1> poison, i1 true, i64 0), <vscale x 4 x i1> poison, <vscale x 4 x i32> zeroinitializer), <vscale x 4 x i16> poison), !alias.scope !34
				; CHECK-NEXT: [[TMP22:%.*]] = call <vscale x 4 x i16> @llvm.experimental.vector.splice.nxv4i16(<vscale x 4 x i16> [[VECTOR_RECUR]], <vscale x 4 x i16> [[WIDE_MASKED_GATHER4]], i32 -1)
				; CHECK-NEXT: [[TMP23:%.*]] = sext <vscale x 4 x i16> [[TMP22]] to <vscale x 4 x i32>
				; CHECK-NEXT: [[TMP24:%.*]] = sext <vscale x 4 x i16> [[WIDE_MASKED_GATHER4]] to <vscale x 4 x i32>
				; CHECK-NEXT: [[TMP25:%.*]] = mul nsw <vscale x 4 x i32> [[TMP23]], [[TMP20]]
				; CHECK-NEXT: [[TMP26:%.*]] = mul nsw <vscale x 4 x i32> [[TMP25]], [[TMP24]]
				; CHECK-NEXT: [[TMP27:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[INDEX]]
				; CHECK-NEXT: store <vscale x 4 x i32> [[TMP26]], ptr [[TMP27]], align 4, !alias.scope !37, !noalias !34
				; CHECK-NEXT: [[TMP28:%.*]] = call i64 @llvm.vscale.i64()
				; CHECK-NEXT: [[TMP29:%.*]] = shl nuw nsw i64 [[TMP28]], 2
				; CHECK-NEXT: [[INDEX_NEXT]] = add nuw i64 [[INDEX]], [[TMP29]]
				; CHECK-NEXT: [[VEC_IND_NEXT]] = add <vscale x 4 x i64> [[VEC_IND]], [[DOTSPLAT]]
				; CHECK-NEXT: [[TMP30:%.*]] = icmp eq i64 [[INDEX_NEXT]], [[N_VEC]]
				; CHECK-NEXT: br i1 [[TMP30]], label [[MIDDLE_BLOCK:%.*]], label [[VECTOR_BODY]], !llvm.loop [[LOOP39:![0-9]+]]
				; CHECK: middle.block:
				; CHECK-NEXT: [[CMP_N:%.*]] = icmp eq i64 [[N_MOD_VF]], 0
				; CHECK-NEXT: [[TMP31:%.*]] = call i32 @llvm.vscale.i32()
				; CHECK-NEXT: [[TMP32:%.*]] = shl nuw nsw i32 [[TMP31]], 2
				; CHECK-NEXT: [[TMP33:%.*]] = add nsw i32 [[TMP32]], -1
				; CHECK-NEXT: [[VECTOR_RECUR_EXTRACT:%.*]] = extractelement <vscale x 4 x i16> [[WIDE_MASKED_GATHER4]], i32 [[TMP33]]
				; CHECK-NEXT: br i1 [[CMP_N]], label [[END:%.*]], label [[SCALAR_PH]]
				; CHECK: scalar.ph:
				; CHECK-NEXT: [[SCALAR_RECUR_INIT:%.]] = phi i16 [ [[DOTPRE]], [[VECTOR_MEMCHECK]] ], [ [[DOTPRE]], [[ENTRY:%.]] ], [ [[VECTOR_RECUR_EXTRACT]], [[MIDDLE_BLOCK]] ]
				; CHECK-NEXT: [[BC_RESUME_VAL:%.*]] = phi i64 [ 0, [[VECTOR_MEMCHECK]] ], [ 0, [[ENTRY]] ], [ [[IND_END]], [[MIDDLE_BLOCK]] ]
				; CHECK-NEXT: [[BC_RESUME_VAL3:%.*]] = phi i64 [ 0, [[VECTOR_MEMCHECK]] ], [ 0, [[ENTRY]] ], [ [[N_VEC]], [[MIDDLE_BLOCK]] ]
				; CHECK-NEXT: br label [[LOOP:%.*]]
				; CHECK: loop:
				; CHECK-NEXT: [[SCALAR_RECUR:%.]] = phi i16 [ [[SCALAR_RECUR_INIT]], [[SCALAR_PH]] ], [ [[LOAD2:%.]], [[LOOP]] ]
				; CHECK-NEXT: [[IV:%.]] = phi i64 [ [[BC_RESUME_VAL]], [[SCALAR_PH]] ], [ [[IV2:%.]], [[LOOP]] ]
				; CHECK-NEXT: [[I:%.]] = phi i64 [ [[BC_RESUME_VAL3]], [[SCALAR_PH]] ], [ [[I1:%.]], [[LOOP]] ]
				; CHECK-NEXT: [[CONV:%.*]] = sext i16 [[SCALAR_RECUR]] to i32
				; CHECK-NEXT: [[I1]] = add nuw nsw i64 [[I]], 1
				; CHECK-NEXT: [[IV1:%.*]] = or i64 [[IV]], 1
				; CHECK-NEXT: [[IV2]] = add nuw nsw i64 [[IV]], 2
				; CHECK-NEXT: [[GEP1:%.*]] = getelementptr inbounds i16, ptr [[A]], i64 [[IV1]]
				; CHECK-NEXT: [[LOAD1:%.*]] = load i16, ptr [[GEP1]], align 4
				; CHECK-NEXT: [[CONV1:%.*]] = sext i16 [[LOAD1]] to i32
				; CHECK-NEXT: [[GEP2:%.*]] = getelementptr inbounds i16, ptr [[A]], i64 [[IV2]]
				; CHECK-NEXT: [[LOAD2]] = load i16, ptr [[GEP2]], align 4
				; CHECK-NEXT: [[CONV2:%.*]] = sext i16 [[LOAD2]] to i32
				; CHECK-NEXT: [[MUL01:%.*]] = mul nsw i32 [[CONV]], [[CONV1]]
				; CHECK-NEXT: [[MUL012:%.*]] = mul nsw i32 [[MUL01]], [[CONV2]]
				; CHECK-NEXT: [[ARRAYIDX5:%.*]] = getelementptr inbounds i32, ptr [[B]], i64 [[I]]
				; CHECK-NEXT: store i32 [[MUL012]], ptr [[ARRAYIDX5]], align 4
				; CHECK-NEXT: [[EXITCOND:%.*]] = icmp eq i64 [[IV]], [[N]]
				; CHECK-NEXT: br i1 [[EXITCOND]], label [[END]], label [[LOOP]], !llvm.loop [[LOOP40:![0-9]+]]
				; CHECK: end:
				; CHECK-NEXT: ret void
				;
				entry:
				%.pre = load i16, i16* %a
				br label %loop

				loop:
				%0 = phi i16 [ %.pre, %entry ], [ %load2, %loop ]
				%iv = phi i64 [ 0, %entry ], [ %iv2, %loop ]
				%i = phi i64 [ 0, %entry ], [ %i1, %loop ]
				%conv = sext i16 %0 to i32
				%i1 = add nuw nsw i64 %i, 1
				%iv1 = add nuw nsw i64 %iv, 1
				%iv2 = add nuw nsw i64 %iv, 2
				%gep1 = getelementptr inbounds i16, i16* %a, i64 %iv1
				%load1 = load i16, i16* %gep1, align 4
				%conv1 = sext i16 %load1 to i32
				%gep2 = getelementptr inbounds i16, i16* %a, i64 %iv2
				%load2 = load i16, i16* %gep2, align 4
				%conv2 = sext i16 %load2 to i32
				%mul01 = mul nsw i32 %conv, %conv1
				%mul012 = mul nsw i32 %mul01, %conv2
				%arrayidx5 = getelementptr inbounds i32, i32* %b, i64 %i
				store i32 %mul012, i32* %arrayidx5
				%exitcond = icmp eq i64 %iv, %n
				br i1 %exitcond, label %end, label %loop

				end:
				ret void
				}

				attributes #1 = { "target-features"="+sve" vscale_range(1, 16) }
				attributes #0 = { "unsafe-fp-math"="true" "target-features"="+sve" vscale_range(1, 16) }