This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
5/8
VectorCombine.cpp
-
test/Transforms/VectorCombine/X86/
-
Transforms/
-
VectorCombine/
-
X86/
-
load-widening.ll

Differential D137341

[VectorCombine] widen a load with subvector insert
ClosedPublic

Authored by spatel on Nov 3 2022, 8:11 AM.

Download Raw Diff

Details

Reviewers

RKSimon
dmgreen
arsenm

Commits

rGb57819e13025: [VectorCombine] widen a load with subvector insert

Summary

This adapts/copies code from the existing fold that allows widening of load scalar+insert. It can help in IR because it removes a shuffle, and the backend can already narrow loads if that is profitable in codegen.

We might be able to consolidate more of the logic, but handling this basic pattern should be enough to make a small difference on one of the motivating examples from issue #17113. The final goal of combining loads on those patterns is not solved though.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Nov 3 2022, 8:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 3 2022, 8:11 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald Transcript

spatel requested review of this revision.Nov 3 2022, 8:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 3 2022, 8:11 AM

Herald added subscribers: llvm-commits, • pcwang-thead, wdng. · View Herald Transcript

spatel added inline comments.Nov 3 2022, 8:15 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
131	Note: I created this helper function to reduce code duplication. I didn't push that NFC change to main yet though (pending feedback here to do something different).

Harbormaster completed remote builds in B195941: Diff 472943.Nov 3 2022, 8:19 AM

Thanks for looking at this!

How useful/feasible would this be to support inserting into a upper subvector as well?

define <8 x float> @upper(ptr dereferenceable(32) %x) {
  %offset = getelementptr inbounds float, ptr %x, i64 4
  %load = load <4 x float>, ptr %offset, align 4
  %insert = shufflevector <4 x float> undef, <4 x float> %load, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  ret <8 x float> %insert
}

In D137341#3910674, @RKSimon wrote:
Thanks for looking at this!

How useful/feasible would this be to support inserting into a upper subvector as well?
define <8 x float> @upper(ptr dereferenceable(32) %x) {
  %offset = getelementptr inbounds float, ptr %x, i64 4
  %load = load <4 x float>, ptr %offset, align 4
  %insert = shufflevector <4 x float> undef, <4 x float> %load, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
  ret <8 x float> %insert
}

I don't have a real example for that, but yes, it's do-able. It would be similar to enhancements that we made in VectorCombine::vectorizeLoadInsert(), so we'd hopefully be able to share some more code for the 2 patterns. We have to (very carefully...) allow peeking through a gep and translating that as a shuffle index offset. I can add a TODO.

RKSimon added inline comments.Nov 6 2022, 7:36 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
279	Do we need to check for cases where the identity comes from the second operand? IIRC we got burnt by something similar recently.

spatel planned changes to this revision.Nov 6 2022, 7:57 AM

spatel marked an inline comment as done.

spatel added inline comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
279	Good catch. If we have a very non-canonical shuffle where both operands are valid loads, but we're actually only shuffling elements from op1 (op0 is unused), this would use the wrong operand.

Patch updated:
There are no guarantees on what shuffles the other vectorizer passes can produce, and we don't know that instcombine is run before this pass, so added code/test to handle a non-canonical identity shuffle that is choosing from operand 1 of the shuffle.

Harbormaster completed remote builds in B196353: Diff 473502.Nov 6 2022, 8:56 AM

arsenm added inline comments.Nov 6 2022, 9:38 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
321–322	Shouldn't need to create an addrspacecast here

spatel added inline comments.Nov 7 2022, 8:13 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
321–322	I copied this line from the existing fold for a load+insertelt, and that was last changed with D121787. Is that not relevant with this transform and/or the change to opaque pointers? I'm not familiar with all of the addrspace corner-cases, so I'm not sure what to do here. Add tests derived from D121787? define <4 x i32> @load_from_other_as(ptr addrspace(5) align 16 dereferenceable(16) %p) { %asc = addrspacecast ptr addrspace(5) %p to ptr %l = load <2 x i32>, ptr %asc, align 4 %s = shufflevector <2 x i32> %l, <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> ret <4 x i32> %s }

arsenm added inline comments.Nov 7 2022, 8:21 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
321–322	Opaque pointers are unrelated to the address space; there's no change there. I don't see how it's relevant for this transform. You're widening a load, which should always be into a load with the same address space that the original load used. The only cast you should need here is the element bitcast for typed pointers

spatel mentioned this in rGb62c81b83651: [VectorCombine] add test with non-canonical shuffle mask; NFC.Nov 7 2022, 9:15 AM

spatel added inline comments.Nov 7 2022, 9:51 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
321–322	I think the issue is that `stripPointerCasts()` will peek through addrspacecasts. So if we do that, then we need to cast the source pointer back to the required destination addrspace. There was a comment in the code about this before D121787 changed the code to include a cast. It's not clear to me from the descriptions if we can use a different stripPointer* API to avoid the issue. But if we do that, then it would be better to change the existing code too, so these 2 transforms are not diverging in implementation. I'll add a test with addrspacecast and update here, so we have some test coverage for this.

arsenm added inline comments.Nov 7 2022, 9:56 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
321–322	OK, I didn't realize this was stripping away casts. As per the new langref address space rules, it should be legal to change the address space of non-volatile, known dereferenceable pointer to the original address space. However, I don't believe this pass should be responsible for changing it here. I'd expect pure load widening to not look through addrspacecast.

Rebased with a new test (last test/diff in the test file) that includes addrspacecast. This will crash if we don't re-insert a cast.

I think we could reduce the wordiness by calling CreatePointerCast() rather than CreateBitCastOrAddrSpaceCast(), but I'd make that change to the existing code too to keep these in sync (assuming that works in the other transform).

Harbormaster completed remote builds in B196527: Diff 473724.Nov 7 2022, 9:57 AM

In D137341#3912838, @spatel wrote:

Rebased with a new test (last test/diff in the test file) that includes addrspacecast. This will crash if we don't re-insert a cast.

I think we could reduce the wordiness by calling CreatePointerCast() rather than CreateBitCastOrAddrSpaceCast(), but I'd make that change to the existing code too to keep these in sync (assuming that works in the other transform).

CreateBitCast should be adequate. If you need to, there's stripPointerCastsSameRepresentation

In D137341#3912868, @arsenm wrote:

CreateBitCast should be adequate. If you need to, there's stripPointerCastsSameRepresentation

Hmm...CreateBitCast crashes with:

"Assertion failed: (castIsValid(op, S, Ty) && "Invalid cast!")"

stripPointerCastsSameRepresentation() avoids the crash, but then we don't capture the expected alignment from the ptr param.

In D137341#3912961, @spatel wrote:

stripPointerCastsSameRepresentation() avoids the crash, but then we don't capture the expected alignment from the ptr param.

For alignment purposes, stripPointerCasts is OK.

However, why does the vector combiner need to look at the underlying pointer for the alignment? Doesn't instcombine increase the alignment based on the underlying pointer already, such that this can just read the alignment direct from the instruction?

In D137341#3912990, @arsenm wrote:

However, why does the vector combiner need to look at the underlying pointer for the alignment? Doesn't instcombine increase the alignment based on the underlying pointer already, such that this can just read the alignment direct from the instruction?

It's correct that InstCombine improves alignment like that, but it's similar reasoning to why we handled a non-canonical shuffle mask in this patch: vector-combine isn't guaranteed to see canonical code. Currently, it's sitting directly after SLP in the normal opt pipelines, and there's a good chance that SLP has created non-canonical code. We could adjust the opt pipeline to avoid that, but then we increase compile-time, so there's no easy answer AFAIK.

spatel mentioned this in rG6703d2ecf9da: [VectorCombine] add test with addrspacecast; NFC.Nov 8 2022, 5:29 AM

spatel mentioned this in D137635: [CodeGen] Add sources to isVectorClearMaskLegal. NFC..Nov 8 2022, 7:31 AM

The addrspacecast issue seems to be independent of the rest of the patch, and its matching equivalent code we already have in vectorizeLoadInsert

I don't know whether there's a suitable bug you could raise but it shouldn't stop us getting this in in.

This revision is now accepted and ready to land.Nov 9 2022, 8:35 AM

This revision was landed with ongoing or failed builds.Nov 10 2022, 11:13 AM

Closed by commit rGb57819e13025: [VectorCombine] widen a load with subvector insert (authored by spatel). · Explain Why

This revision was automatically updated to reflect the committed changes.

spatel added a commit: rGb57819e13025: [VectorCombine] widen a load with subvector insert.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

VectorCombine.cpp

62 lines

test/

Transforms/

VectorCombine/

X86/

load-widening.ll

61 lines

Diff 474581

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

Show First 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	private:

/// If true only perform scalarization combines and do not introduce new		/// If true only perform scalarization combines and do not introduce new
/// vector operations.		/// vector operations.
bool ScalarizationOnly;		bool ScalarizationOnly;

InstructionWorklist Worklist;		InstructionWorklist Worklist;

bool vectorizeLoadInsert(Instruction &I);		bool vectorizeLoadInsert(Instruction &I);
		bool widenSubvectorLoad(Instruction &I);
ExtractElementInst getShuffleExtract(ExtractElementInst Ext0,		ExtractElementInst getShuffleExtract(ExtractElementInst Ext0,
ExtractElementInst *Ext1,		ExtractElementInst *Ext1,
unsigned PreferredExtractIndex) const;		unsigned PreferredExtractIndex) const;
bool isExtractExtractCheap(ExtractElementInst Ext0, ExtractElementInst Ext1,		bool isExtractExtractCheap(ExtractElementInst Ext0, ExtractElementInst Ext1,
const Instruction &I,		const Instruction &I,
ExtractElementInst *&ConvertToShuffle,		ExtractElementInst *&ConvertToShuffle,
unsigned PreferredExtractIndex);		unsigned PreferredExtractIndex);
void foldExtExtCmp(ExtractElementInst Ext0, ExtractElementInst Ext1,		void foldExtExtCmp(ExtractElementInst Ext0, ExtractElementInst Ext1,
Show All 25 Lines	void eraseInstruction(Instruction &I) {
for (Value *Op : I.operands())		for (Value *Op : I.operands())
Worklist.pushValue(Op);		Worklist.pushValue(Op);
Worklist.remove(&I);		Worklist.remove(&I);
I.eraseFromParent();		I.eraseFromParent();
}		}
};		};
} // namespace		} // namespace

static bool canWidenLoad(LoadInst *Load, const TargetTransformInfo &TTI) {		static bool canWidenLoad(LoadInst *Load, const TargetTransformInfo &TTI) {
		spatelAuthorUnsubmitted Done Reply Inline Actions Note: I created this helper function to reduce code duplication. I didn't push that NFC change to main yet though (pending feedback here to do something different). spatel: Note: I created this helper function to reduce code duplication. I didn't push that NFC change…
// Do not widen load if atomic/volatile or under asan/hwasan/memtag/tsan.		// Do not widen load if atomic/volatile or under asan/hwasan/memtag/tsan.
// The widened load may load data from dirty regions or create data races		// The widened load may load data from dirty regions or create data races
// non-existent in the source.		// non-existent in the source.
if (!Load \|\| !Load->isSimple() \|\| !Load->hasOneUse() \|\|		if (!Load \|\| !Load->isSimple() \|\| !Load->hasOneUse() \|\|
Load->getFunction()->hasFnAttribute(Attribute::SanitizeMemTag) \|\|		Load->getFunction()->hasFnAttribute(Attribute::SanitizeMemTag) \|\|
mustSuppressSpeculation(*Load))		mustSuppressSpeculation(*Load))
return false;		return false;

▲ Show 20 Lines • Show All 121 Lines • ▼ Show 20 Lines	bool VectorCombine::vectorizeLoadInsert(Instruction &I) {
Value *VecLd = Builder.CreateAlignedLoad(MinVecTy, CastedPtr, Alignment);		Value *VecLd = Builder.CreateAlignedLoad(MinVecTy, CastedPtr, Alignment);
VecLd = Builder.CreateShuffleVector(VecLd, Mask);		VecLd = Builder.CreateShuffleVector(VecLd, Mask);

replaceValue(I, *VecLd);		replaceValue(I, *VecLd);
++NumVecLoad;		++NumVecLoad;
return true;		return true;
}		}

		/// If we are loading a vector and then inserting it into a larger vector with
		/// undefined elements, try to load the larger vector and eliminate the insert.
		/// This removes a shuffle in IR and may allow combining of other loaded values.
		bool VectorCombine::widenSubvectorLoad(Instruction &I) {
		// Match subvector insert of fixed vector.
		auto *Ty = dyn_cast<FixedVectorType>(I.getType());
		auto *Shuf = dyn_cast<ShuffleVectorInst>(&I);
		if (!Ty \|\| !Shuf \|\| !Shuf->isIdentityWithPadding())
		return false;

		// Allow a non-canonical shuffle mask that is choosing elements from op1.
		RKSimonUnsubmitted Done Reply Inline Actions Do we need to check for cases where the identity comes from the second operand? IIRC we got burnt by something similar recently. RKSimon: Do we need to check for cases where the identity comes from the second operand? IIRC we got…
		spatelAuthorUnsubmitted Done Reply Inline Actions Good catch. If we have a very non-canonical shuffle where both operands are valid loads, but we're actually only shuffling elements from op1 (op0 is unused), this would use the wrong operand. spatel: Good catch. If we have a very non-canonical shuffle where both operands are valid loads, but…
		unsigned NumOpElts =
		cast<FixedVectorType>(Shuf->getOperand(0)->getType())->getNumElements();
		unsigned OpIndex = any_of(Shuf->getShuffleMask(), [&NumOpElts](int M) {
		return M >= (int)(NumOpElts);
		});

		auto *Load = dyn_cast<LoadInst>(Shuf->getOperand(OpIndex));
		if (!canWidenLoad(Load, TTI))
		return false;

		// We use minimal alignment (maximum flexibility) because we only care about
		// the dereferenceable region. When calculating cost and creating a new op,
		// we may use a larger value based on alignment attributes.
		const DataLayout &DL = I.getModule()->getDataLayout();
		Value *SrcPtr = Load->getPointerOperand()->stripPointerCasts();
		assert(isa<PointerType>(SrcPtr->getType()) && "Expected a pointer type");
		Align Alignment = Load->getAlign();
		if (!isSafeToLoadUnconditionally(SrcPtr, Ty, Align(1), DL, Load, &AC, &DT))
		return false;

		Alignment = std::max(SrcPtr->getPointerAlignment(DL), Alignment);
		Type *LoadTy = Load->getType();
		unsigned AS = Load->getPointerAddressSpace();

		// Original pattern: insert_subvector (load PtrOp)
		// This conservatively assumes that the cost of a subvector insert into an
		// undef value is 0. We could add that cost if the cost model accurately
		// reflects the real cost of that operation.
		InstructionCost OldCost =
		TTI.getMemoryOpCost(Instruction::Load, LoadTy, Alignment, AS);

		// New pattern: load PtrOp
		InstructionCost NewCost =
		TTI.getMemoryOpCost(Instruction::Load, Ty, Alignment, AS);

		// We can aggressively convert to the vector form because the backend can
		// invert this transform if it does not result in a performance win.
		if (OldCost < NewCost \|\| !NewCost.isValid())
		return false;

		IRBuilder<> Builder(Load);
		Value *CastedPtr =
		Builder.CreatePointerBitCastOrAddrSpaceCast(SrcPtr, Ty->getPointerTo(AS));
		arsenmUnsubmitted Not Done Reply Inline Actions Shouldn't need to create an addrspacecast here arsenm: Shouldn't need to create an addrspacecast here
		spatelAuthorUnsubmitted Done Reply Inline Actions I copied this line from the existing fold for a load+insertelt, and that was last changed with D121787. Is that not relevant with this transform and/or the change to opaque pointers? I'm not familiar with all of the addrspace corner-cases, so I'm not sure what to do here. Add tests derived from D121787? define <4 x i32> @load_from_other_as(ptr addrspace(5) align 16 dereferenceable(16) %p) { %asc = addrspacecast ptr addrspace(5) %p to ptr %l = load <2 x i32>, ptr %asc, align 4 %s = shufflevector <2 x i32> %l, <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef> ret <4 x i32> %s } spatel: I copied this line from the existing fold for a load+insertelt, and that was last changed with…
		arsenmUnsubmitted Not Done Reply Inline Actions Opaque pointers are unrelated to the address space; there's no change there. I don't see how it's relevant for this transform. You're widening a load, which should always be into a load with the same address space that the original load used. The only cast you should need here is the element bitcast for typed pointers arsenm: Opaque pointers are unrelated to the address space; there's no change there. I don't see how…
		spatelAuthorUnsubmitted Done Reply Inline Actions I think the issue is that `stripPointerCasts()` will peek through addrspacecasts. So if we do that, then we need to cast the source pointer back to the required destination addrspace. There was a comment in the code about this before D121787 changed the code to include a cast. It's not clear to me from the descriptions if we can use a different stripPointer* API to avoid the issue. But if we do that, then it would be better to change the existing code too, so these 2 transforms are not diverging in implementation. I'll add a test with addrspacecast and update here, so we have some test coverage for this. spatel: I think the issue is that `stripPointerCasts()` will peek through addrspacecasts. So if we do…
		arsenmUnsubmitted Not Done Reply Inline Actions OK, I didn't realize this was stripping away casts. As per the new langref address space rules, it should be legal to change the address space of non-volatile, known dereferenceable pointer to the original address space. However, I don't believe this pass should be responsible for changing it here. I'd expect pure load widening to not look through addrspacecast. arsenm: OK, I didn't realize this was stripping away casts. As per the new langref address space rules…
		Value *VecLd = Builder.CreateAlignedLoad(Ty, CastedPtr, Alignment);
		replaceValue(I, *VecLd);
		++NumVecLoad;
		return true;
		}

/// Determine which, if any, of the inputs should be replaced by a shuffle		/// Determine which, if any, of the inputs should be replaced by a shuffle
/// followed by extract from a different index.		/// followed by extract from a different index.
ExtractElementInst *VectorCombine::getShuffleExtract(		ExtractElementInst *VectorCombine::getShuffleExtract(
ExtractElementInst Ext0, ExtractElementInst Ext1,		ExtractElementInst Ext0, ExtractElementInst Ext1,
unsigned PreferredExtractIndex = InvalidIndex) const {		unsigned PreferredExtractIndex = InvalidIndex) const {
auto *Index0C = dyn_cast<ConstantInt>(Ext0->getIndexOperand());		auto *Index0C = dyn_cast<ConstantInt>(Ext0->getIndexOperand());
auto *Index1C = dyn_cast<ConstantInt>(Ext1->getIndexOperand());		auto *Index1C = dyn_cast<ConstantInt>(Ext1->getIndexOperand());
assert(Index0C && Index1C && "Expected constant extract indexes");		assert(Index0C && Index1C && "Expected constant extract indexes");
▲ Show 20 Lines • Show All 1,365 Lines • ▼ Show 20 Lines	bool VectorCombine::run() {
if (!TTI.getNumberOfRegisters(TTI.getRegisterClassForType(/Vector/ true)))		if (!TTI.getNumberOfRegisters(TTI.getRegisterClassForType(/Vector/ true)))
return false;		return false;

bool MadeChange = false;		bool MadeChange = false;
auto FoldInst = [this, &MadeChange](Instruction &I) {		auto FoldInst = [this, &MadeChange](Instruction &I) {
Builder.SetInsertPoint(&I);		Builder.SetInsertPoint(&I);
if (!ScalarizationOnly) {		if (!ScalarizationOnly) {
MadeChange \|= vectorizeLoadInsert(I);		MadeChange \|= vectorizeLoadInsert(I);
		MadeChange \|= widenSubvectorLoad(I);
MadeChange \|= foldExtractExtract(I);		MadeChange \|= foldExtractExtract(I);
MadeChange \|= foldInsExtFNeg(I);		MadeChange \|= foldInsExtFNeg(I);
MadeChange \|= foldBitcastShuf(I);		MadeChange \|= foldBitcastShuf(I);
MadeChange \|= foldExtractedCmps(I);		MadeChange \|= foldExtractedCmps(I);
MadeChange \|= foldShuffleOfBinops(I);		MadeChange \|= foldShuffleOfBinops(I);
MadeChange \|= foldShuffleFromReductions(I);		MadeChange \|= foldShuffleFromReductions(I);
MadeChange \|= foldSelectShuffle(I);		MadeChange \|= foldSelectShuffle(I);
}		}
▲ Show 20 Lines • Show All 93 Lines • Show Last 20 Lines

llvm/test/Transforms/VectorCombine/X86/load-widening.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -passes=vector-combine -S -mtriple=x86_64-- -mattr=sse2 --data-layout="e" \| FileCheck %s --check-prefixes=CHECK			; RUN: opt < %s -passes=vector-combine -S -mtriple=x86_64-- -mattr=sse2 --data-layout="e" \| FileCheck %s --check-prefixes=CHECK,SSE
	; RUN: opt < %s -passes=vector-combine -S -mtriple=x86_64-- -mattr=avx2 --data-layout="e" \| FileCheck %s --check-prefixes=CHECK			; RUN: opt < %s -passes=vector-combine -S -mtriple=x86_64-- -mattr=avx2 --data-layout="e" \| FileCheck %s --check-prefixes=CHECK,AVX
	; RUN: opt < %s -passes=vector-combine -S -mtriple=x86_64-- -mattr=sse2 --data-layout="E" \| FileCheck %s --check-prefixes=CHECK			; RUN: opt < %s -passes=vector-combine -S -mtriple=x86_64-- -mattr=sse2 --data-layout="E" \| FileCheck %s --check-prefixes=CHECK,SSE
	; RUN: opt < %s -passes=vector-combine -S -mtriple=x86_64-- -mattr=avx2 --data-layout="E" \| FileCheck %s --check-prefixes=CHECK			; RUN: opt < %s -passes=vector-combine -S -mtriple=x86_64-- -mattr=avx2 --data-layout="E" \| FileCheck %s --check-prefixes=CHECK,AVX

	;-------------------------------------------------------------------------------			;-------------------------------------------------------------------------------
	; Here we know we can load 128 bits as per dereferenceability and alignment.			; Here we know we can load 128 bits as per dereferenceability and alignment.

	; We don't widen scalar loads per-se.			; We don't widen scalar loads per-se.
	define <1 x float> @scalar(ptr align 16 dereferenceable(16) %p) {			define <1 x float> @scalar(ptr align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @scalar(			; CHECK-LABEL: @scalar(
	; CHECK-NEXT: [[R:%.]] = load <1 x float>, ptr [[P:%.]], align 16			; CHECK-NEXT: [[R:%.]] = load <1 x float>, ptr [[P:%.]], align 16
	▲ Show 20 Lines • Show All 233 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: @vec_with_2elts_128bits_i4(			; CHECK-LABEL: @vec_with_2elts_128bits_i4(
	; CHECK-NEXT: [[R:%.]] = load <2 x i4>, ptr [[P:%.]], align 16			; CHECK-NEXT: [[R:%.]] = load <2 x i4>, ptr [[P:%.]], align 16
	; CHECK-NEXT: ret <2 x i4> [[R]]			; CHECK-NEXT: ret <2 x i4> [[R]]
	;			;
	%r = load <2 x i4>, ptr %p, align 16			%r = load <2 x i4>, ptr %p, align 16
	ret <2 x i4> %r			ret <2 x i4> %r
	}			}

				; Load the 128-bit vector because there is no additional cost.

	define <4 x float> @load_v1f32_v4f32(ptr dereferenceable(16) %p) {			define <4 x float> @load_v1f32_v4f32(ptr dereferenceable(16) %p) {
	; CHECK-LABEL: @load_v1f32_v4f32(			; CHECK-LABEL: @load_v1f32_v4f32(
	; CHECK-NEXT: [[L:%.]] = load <1 x float>, ptr [[P:%.]], align 16			; CHECK-NEXT: [[S:%.]] = load <4 x float>, ptr [[P:%.]], align 16
	; CHECK-NEXT: [[S:%.*]] = shufflevector <1 x float> [[L]], <1 x float> poison, <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: ret <4 x float> [[S]]			; CHECK-NEXT: ret <4 x float> [[S]]
	;			;
	%l = load <1 x float>, ptr %p, align 16			%l = load <1 x float>, ptr %p, align 16
	%s = shufflevector <1 x float> %l, <1 x float> poison, <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>			%s = shufflevector <1 x float> %l, <1 x float> poison, <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>
	ret <4 x float> %s			ret <4 x float> %s
	}			}

				; Load the 128-bit vector because there is no additional cost.
				; Alignment is taken from param attr.

	define <4 x float> @load_v2f32_v4f32(ptr align 16 dereferenceable(16) %p) {			define <4 x float> @load_v2f32_v4f32(ptr align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @load_v2f32_v4f32(			; CHECK-LABEL: @load_v2f32_v4f32(
	; CHECK-NEXT: [[L:%.]] = load <2 x float>, ptr [[P:%.]], align 1			; CHECK-NEXT: [[S:%.]] = load <4 x float>, ptr [[P:%.]], align 16
	; CHECK-NEXT: [[S:%.*]] = shufflevector <2 x float> [[L]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
	; CHECK-NEXT: ret <4 x float> [[S]]			; CHECK-NEXT: ret <4 x float> [[S]]
	;			;
	%l = load <2 x float>, ptr %p, align 1			%l = load <2 x float>, ptr %p, align 1
	%s = shufflevector <2 x float> %l, <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>			%s = shufflevector <2 x float> %l, <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
	ret <4 x float> %s			ret <4 x float> %s
	}			}

				; Load the 128-bit vector because there is no additional cost.

	define <4 x float> @load_v3f32_v4f32(ptr dereferenceable(16) %p) {			define <4 x float> @load_v3f32_v4f32(ptr dereferenceable(16) %p) {
	; CHECK-LABEL: @load_v3f32_v4f32(			; CHECK-LABEL: @load_v3f32_v4f32(
	; CHECK-NEXT: [[L:%.]] = load <3 x float>, ptr [[P:%.]], align 1			; CHECK-NEXT: [[S:%.]] = load <4 x float>, ptr [[P:%.]], align 1
	; CHECK-NEXT: [[S:%.*]] = shufflevector <3 x float> [[L]], <3 x float> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>
	; CHECK-NEXT: ret <4 x float> [[S]]			; CHECK-NEXT: ret <4 x float> [[S]]
	;			;
	%l = load <3 x float>, ptr %p, align 1			%l = load <3 x float>, ptr %p, align 1
	%s = shufflevector <3 x float> %l, <3 x float> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>			%s = shufflevector <3 x float> %l, <3 x float> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>
	ret <4 x float> %s			ret <4 x float> %s
	}			}

				; Negative test - the shuffle must be a simple subvector insert.

	define <4 x float> @load_v3f32_v4f32_wrong_mask(ptr dereferenceable(16) %p) {			define <4 x float> @load_v3f32_v4f32_wrong_mask(ptr dereferenceable(16) %p) {
	; CHECK-LABEL: @load_v3f32_v4f32_wrong_mask(			; CHECK-LABEL: @load_v3f32_v4f32_wrong_mask(
	; CHECK-NEXT: [[L:%.]] = load <3 x float>, ptr [[P:%.]], align 1			; CHECK-NEXT: [[L:%.]] = load <3 x float>, ptr [[P:%.]], align 1
	; CHECK-NEXT: [[S:%.*]] = shufflevector <3 x float> [[L]], <3 x float> poison, <4 x i32> <i32 1, i32 0, i32 2, i32 undef>			; CHECK-NEXT: [[S:%.*]] = shufflevector <3 x float> [[L]], <3 x float> poison, <4 x i32> <i32 1, i32 0, i32 2, i32 undef>
	; CHECK-NEXT: ret <4 x float> [[S]]			; CHECK-NEXT: ret <4 x float> [[S]]
	;			;
	%l = load <3 x float>, ptr %p, align 1			%l = load <3 x float>, ptr %p, align 1
	%s = shufflevector <3 x float> %l, <3 x float> poison, <4 x i32> <i32 1, i32 0, i32 2, i32 undef>			%s = shufflevector <3 x float> %l, <3 x float> poison, <4 x i32> <i32 1, i32 0, i32 2, i32 undef>
	ret <4 x float> %s			ret <4 x float> %s
	}			}

				; Negative test - must be dereferenceable to vector width.

	define <4 x float> @load_v3f32_v4f32_not_deref(ptr dereferenceable(15) %p) {			define <4 x float> @load_v3f32_v4f32_not_deref(ptr dereferenceable(15) %p) {
	; CHECK-LABEL: @load_v3f32_v4f32_not_deref(			; CHECK-LABEL: @load_v3f32_v4f32_not_deref(
	; CHECK-NEXT: [[L:%.]] = load <3 x float>, ptr [[P:%.]], align 16			; CHECK-NEXT: [[L:%.]] = load <3 x float>, ptr [[P:%.]], align 16
	; CHECK-NEXT: [[S:%.*]] = shufflevector <3 x float> [[L]], <3 x float> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>			; CHECK-NEXT: [[S:%.*]] = shufflevector <3 x float> [[L]], <3 x float> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>
	; CHECK-NEXT: ret <4 x float> [[S]]			; CHECK-NEXT: ret <4 x float> [[S]]
	;			;
	%l = load <3 x float>, ptr %p, align 16			%l = load <3 x float>, ptr %p, align 16
	%s = shufflevector <3 x float> %l, <3 x float> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>			%s = shufflevector <3 x float> %l, <3 x float> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>
	ret <4 x float> %s			ret <4 x float> %s
	}			}

				; Without AVX, the cost of loading 256-bits would be greater.

	define <8 x float> @load_v2f32_v8f32(ptr dereferenceable(32) %p) {			define <8 x float> @load_v2f32_v8f32(ptr dereferenceable(32) %p) {
	; CHECK-LABEL: @load_v2f32_v8f32(			; SSE-LABEL: @load_v2f32_v8f32(
	; CHECK-NEXT: [[L:%.]] = load <2 x float>, ptr [[P:%.]], align 1			; SSE-NEXT: [[L:%.]] = load <2 x float>, ptr [[P:%.]], align 1
	; CHECK-NEXT: [[S:%.*]] = shufflevector <2 x float> [[L]], <2 x float> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			; SSE-NEXT: [[S:%.*]] = shufflevector <2 x float> [[L]], <2 x float> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: ret <8 x float> [[S]]			; SSE-NEXT: ret <8 x float> [[S]]
				;
				; AVX-LABEL: @load_v2f32_v8f32(
				; AVX-NEXT: [[S:%.]] = load <8 x float>, ptr [[P:%.]], align 1
				; AVX-NEXT: ret <8 x float> [[S]]
	;			;
	%l = load <2 x float>, ptr %p, align 1			%l = load <2 x float>, ptr %p, align 1
	%s = shufflevector <2 x float> %l, <2 x float> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%s = shufflevector <2 x float> %l, <2 x float> poison, <8 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	ret <8 x float> %s			ret <8 x float> %s
	}			}

				; Integer type is ok too.

	define <4 x i32> @load_v2i32_v4i32(ptr dereferenceable(16) %p) {			define <4 x i32> @load_v2i32_v4i32(ptr dereferenceable(16) %p) {
	; CHECK-LABEL: @load_v2i32_v4i32(			; CHECK-LABEL: @load_v2i32_v4i32(
	; CHECK-NEXT: [[L:%.]] = load <2 x i32>, ptr [[P:%.]], align 1			; CHECK-NEXT: [[S:%.]] = load <4 x i32>, ptr [[P:%.]], align 1
	; CHECK-NEXT: [[S:%.*]] = shufflevector <2 x i32> [[L]], <2 x i32> poison, <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>
	; CHECK-NEXT: ret <4 x i32> [[S]]			; CHECK-NEXT: ret <4 x i32> [[S]]
	;			;
	%l = load <2 x i32>, ptr %p, align 1			%l = load <2 x i32>, ptr %p, align 1
	%s = shufflevector <2 x i32> %l, <2 x i32> poison, <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>			%s = shufflevector <2 x i32> %l, <2 x i32> poison, <4 x i32> <i32 0, i32 undef, i32 undef, i32 undef>
	ret <4 x i32> %s			ret <4 x i32> %s
	}			}

				; TODO: We assumed the shuffle mask is canonical.

	define <4 x i32> @load_v2i32_v4i32_non_canonical_mask(ptr dereferenceable(16) %p) {			define <4 x i32> @load_v2i32_v4i32_non_canonical_mask(ptr dereferenceable(16) %p) {
	; CHECK-LABEL: @load_v2i32_v4i32_non_canonical_mask(			; CHECK-LABEL: @load_v2i32_v4i32_non_canonical_mask(
	; CHECK-NEXT: [[L:%.]] = load <2 x i32>, ptr [[P:%.]], align 1			; CHECK-NEXT: [[L:%.]] = load <2 x i32>, ptr [[P:%.]], align 1
	; CHECK-NEXT: [[S:%.*]] = shufflevector <2 x i32> [[L]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>			; CHECK-NEXT: [[S:%.*]] = shufflevector <2 x i32> [[L]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>
	; CHECK-NEXT: ret <4 x i32> [[S]]			; CHECK-NEXT: ret <4 x i32> [[S]]
	;			;
	%l = load <2 x i32>, ptr %p, align 1			%l = load <2 x i32>, ptr %p, align 1
	%s = shufflevector <2 x i32> %l, <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>			%s = shufflevector <2 x i32> %l, <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 2, i32 undef>
	ret <4 x i32> %s			ret <4 x i32> %s
	}			}

				; Allow non-canonical commuted shuffle.

	define <4 x i32> @load_v2i32_v4i32_non_canonical_mask_commute(ptr dereferenceable(16) %p) {			define <4 x i32> @load_v2i32_v4i32_non_canonical_mask_commute(ptr dereferenceable(16) %p) {
	; CHECK-LABEL: @load_v2i32_v4i32_non_canonical_mask_commute(			; CHECK-LABEL: @load_v2i32_v4i32_non_canonical_mask_commute(
	; CHECK-NEXT: [[L:%.]] = load <2 x i32>, ptr [[P:%.]], align 1			; CHECK-NEXT: [[S:%.]] = load <4 x i32>, ptr [[P:%.]], align 1
	; CHECK-NEXT: [[S:%.*]] = shufflevector <2 x i32> poison, <2 x i32> [[L]], <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
	; CHECK-NEXT: ret <4 x i32> [[S]]			; CHECK-NEXT: ret <4 x i32> [[S]]
	;			;
	%l = load <2 x i32>, ptr %p, align 1			%l = load <2 x i32>, ptr %p, align 1
	%s = shufflevector <2 x i32> poison, <2 x i32> %l, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>			%s = shufflevector <2 x i32> poison, <2 x i32> %l, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
	ret <4 x i32> %s			ret <4 x i32> %s
	}			}

				; The wide load must be in the same addrspace as the original load.

	define <4 x i32> @load_v2i32_v4i32_addrspacecast(ptr addrspace(5) align 16 dereferenceable(16) %p) {			define <4 x i32> @load_v2i32_v4i32_addrspacecast(ptr addrspace(5) align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @load_v2i32_v4i32_addrspacecast(			; CHECK-LABEL: @load_v2i32_v4i32_addrspacecast(
	; CHECK-NEXT: [[ASC:%.]] = addrspacecast ptr addrspace(5) [[P:%.]] to ptr addrspace(42)			; CHECK-NEXT: [[TMP1:%.]] = addrspacecast ptr addrspace(5) [[P:%.]] to ptr addrspace(42)
	; CHECK-NEXT: [[L:%.*]] = load <2 x i32>, ptr addrspace(42) [[ASC]], align 4			; CHECK-NEXT: [[S:%.*]] = load <4 x i32>, ptr addrspace(42) [[TMP1]], align 16
	; CHECK-NEXT: [[S:%.*]] = shufflevector <2 x i32> [[L]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
	; CHECK-NEXT: ret <4 x i32> [[S]]			; CHECK-NEXT: ret <4 x i32> [[S]]
	;			;
	%asc = addrspacecast ptr addrspace(5) %p to ptr addrspace(42)			%asc = addrspacecast ptr addrspace(5) %p to ptr addrspace(42)
	%l = load <2 x i32>, ptr addrspace(42) %asc, align 4			%l = load <2 x i32>, ptr addrspace(42) %asc, align 4
	%s = shufflevector <2 x i32> %l, <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>			%s = shufflevector <2 x i32> %l, <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
	ret <4 x i32> %s			ret <4 x i32> %s
	}			}