This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
1/2
LegalizeIntegerTypes.cpp
3/10
LegalizeVectorTypes.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
1
sve-split-load.ll

Differential D106028

[AArch64][SelectionDAG] Add legalization for widening LOAD/MLOAD.
Needs ReviewPublic

Authored by efriedma on Jul 14 2021, 5:32 PM.

Download Raw Diff

Details

Reviewers

sdesmalen
paulwalker-arm
craig.topper

Summary

By the magic of masked loads, a widened MLOAD is almost identical to the original MLOAD.

Need to handle a few more INSERT_SUBVECTOR legalization cases to avoid crashing on the testcase.

The code for computing the mask is unfortunately not very efficient; maybe we need a target-independent version of whilelo? Or a DAGCombine to form whilelo?

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	3,190 ms	x64 debian > libarcher.critical::critical.c
	2,900 ms	x64 debian > libarcher.races::critical-unrelated.c
	2,980 ms	x64 debian > libarcher.races::lock-nested-unrelated.c
	3,280 ms	x64 debian > libarcher.races::lock-unrelated.c
	3,070 ms	x64 debian > libarcher.races::parallel-simple.c
		View Full Test Results (16 Failed)

Event Timeline

efriedma created this revision.Jul 14 2021, 5:32 PM

Herald added subscribers: ecnelises, danielkiss, steven.zhang and 2 others. · View Herald TranscriptJul 14 2021, 5:32 PM

efriedma requested review of this revision.Jul 14 2021, 5:32 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2021, 5:32 PM

Harbormaster completed remote builds in B114124: Diff 358791.Jul 14 2021, 6:09 PM

sdesmalen added inline comments.Jul 19 2021, 1:14 PM

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
4056	Do we also want to do this for other element-counts that are not a power of 2? (e.g. `<vscale x 6 x i8>`) Was this code intended to work for `<vscale x 1 x eltty>` ? (I tried that a few days ago with this patch and ran into some failures. I didn't really investigate further though)
4057	nit: VT.isKnownMultipleOf(2)
4144	Should `MemoryLocation::UnknownSize` only be used for the scalable case?

efriedma added inline comments.Jul 19 2021, 1:34 PM

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
4056	Should work for 1, I guess? Not sure I actually tried it. But note we only support a very limited set of operations. See also D105591. For `<vscale x 6 x i8>`, we can use a different sequence: generate extloads for `<vscale x 2 x i64>` and `<vscale x 4 x i32>`, and merge them together. GenWidenVectorLoads should handle that in theory; not sure how well it works in practice. I'll add more testcases, in any case.
4144	There's a general restriction on memoperands in selectiondag: the size of the operand must be at least the size of the memory VT. I didn't really want to think about precisely what MemoryLocation we could be using instead...

More tests. Make "<vscale x 1 x i32>" loads work.

Harbormaster completed remote builds in B114958: Diff 359931.Jul 19 2021, 3:52 PM

Matt added a subscriber: Matt.Jul 20 2021, 6:48 AM

sdesmalen mentioned this in D105591: [AArch64][SelectionDAG] Support passing/returning scalable vectors with unusual types..Aug 2 2021, 1:51 PM

Apologies for the delay in reviewing this, it has been on my to-do list for quite a while!

Thanks for adding support for <vscale x 1 x eltty> types and adding a test for widening the even-numbered ElementCounts (e.g. <vscale x 6 x i16>), it's nice to see that this works too.

Other than some minor comments, the patch looks mostly fine to me.

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
4881–4883	nit: Is/would this be the same as: SDValue Op0 = N->getOperand(0); TLI.getTypeToTransformTo(*DAG.getContext(), Op0.getValueType()); ?
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
4011	This code is identical to the block on lines 4022-4026, with the exception of `s/NumElements/MinNumElements`. Can you align the implementation in the two blocks and remove the `if (VT.isScalableVector())`? (and then move the `report_fatal_error` to line 4027)
4017	nit: unnecessary curly braces.
4056	Is the FIXME still relevant? For <vscale x 6 x i8>, we can use a different sequence: generate extloads for <vscale x 2 x i64> and <vscale x 4 x i32>, and merge them together. GenWidenVectorLoads should handle that in theory; not sure how well it works in practice. I'll add more testcases, in any case. Nice. From what I can see from the test you've added, <vscale x 6 x i16> is handled correctly by GenWidenVectorLoads.
4062	nit: can you move `WidenVT` closer to where it's used? (above line 4079)
llvm/test/CodeGen/AArch64/sve-split-load.ll
72	`sve-split-load.ll` may no longer be the best name for this file given that these new tests are about widening?

sdesmalen added a child revision: D107390: WIP: [SelectionDAG] Promote types in widenVectorToPartType..Aug 3 2021, 12:48 PM

sdesmalen mentioned this in D107390: WIP: [SelectionDAG] Promote types in widenVectorToPartType..

efriedma added inline comments.Aug 3 2021, 1:40 PM

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
4881–4883	No, it's not the same. We need to extend the element VT to the element VT of the result type. For SVE, the resulting vector type is usually promotable. If it is, we deal with it in PromoteIntOp_CONCAT_VECTORS.
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
4056	Yes, it's still an issue. The constant "2" is specific to SVE. The question we need to figure out here is, will GenWidenVectorLoads succeed (and be profitable)? For SVE, it should succeed for any multiple of 2, because we can use 64-bit extending loads. Not sure about profitability for SVE; the generated code is a bit messy, but the messy bits are all invariant (and probably less messy once we start using whilelo optimally). Other vector instruction sets might have different restrictions.

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

LegalizeIntegerTypes.cpp

13 lines

LegalizeVectorTypes.cpp

144 lines

test/

CodeGen/

AArch64/

sve-split-load.ll

220 lines

Diff 359931

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp

	Show First 20 Lines • Show All 4,869 Lines • ▼ Show 20 Lines

	SDValue DAGTypeLegalizer::PromoteIntRes_CONCAT_VECTORS(SDNode *N) {			SDValue DAGTypeLegalizer::PromoteIntRes_CONCAT_VECTORS(SDNode *N) {
	SDLoc dl(N);			SDLoc dl(N);

	EVT OutVT = N->getValueType(0);			EVT OutVT = N->getValueType(0);
	EVT NOutVT = TLI.getTypeToTransformTo(*DAG.getContext(), OutVT);			EVT NOutVT = TLI.getTypeToTransformTo(*DAG.getContext(), OutVT);
	assert(NOutVT.isVector() && "This type must be promoted to a vector type");			assert(NOutVT.isVector() && "This type must be promoted to a vector type");

				if (NOutVT.isScalableVector()) {
				unsigned NumOperands = N->getNumOperands();
				SmallVector<SDValue, 16> ConcatOps(NumOperands);
				EVT ConcatOpVT = EVT::getVectorVT(
				*DAG.getContext(), NOutVT.getVectorElementType(),
				N->getOperand(0).getValueType().getVectorElementCount());
				sdesmalenUnsubmitted Not Done Reply Inline Actions nit: Is/would this be the same as: SDValue Op0 = N->getOperand(0); TLI.getTypeToTransformTo(DAG.getContext(), Op0.getValueType()); ? sdesmalen:* nit: Is/would this be the same as: SDValue Op0 = N->getOperand(0); TLI.
				efriedmaAuthorUnsubmitted Done Reply Inline Actions No, it's not the same. We need to extend the element VT to the element VT of the result type. For SVE, the resulting vector type is usually promotable. If it is, we deal with it in PromoteIntOp_CONCAT_VECTORS. efriedma: No, it's not the same. We need to extend the element VT to the element VT of the result type.
				for (unsigned i = 0; i < NumOperands; ++i) {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'i' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'i' [readability-identifier-naming]…
				ConcatOps[i] =
				DAG.getNode(ISD::ANY_EXTEND, dl, ConcatOpVT, N->getOperand(i));
				}
				return DAG.getNode(ISD::CONCAT_VECTORS, dl, NOutVT, ConcatOps);
				}

	EVT OutElemTy = NOutVT.getVectorElementType();			EVT OutElemTy = NOutVT.getVectorElementType();

	unsigned NumElem = N->getOperand(0).getValueType().getVectorNumElements();			unsigned NumElem = N->getOperand(0).getValueType().getVectorNumElements();
	unsigned NumOutElem = NOutVT.getVectorNumElements();			unsigned NumOutElem = NOutVT.getVectorNumElements();
	unsigned NumOperands = N->getNumOperands();			unsigned NumOperands = N->getNumOperands();
	assert(NumElem * NumOperands == NumOutElem &&			assert(NumElem * NumOperands == NumOutElem &&
	"Unexpected number of elements");			"Unexpected number of elements");

	▲ Show 20 Lines • Show All 144 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp

Show First 20 Lines • Show All 1,295 Lines • ▼ Show 20 Lines	void DAGTypeLegalizer::SplitVecRes_INSERT_SUBVECTOR(SDNode *N, SDValue &Lo,
// of a scalable vector.		// of a scalable vector.
if (VecVT.isScalableVector() == SubVecVT.isScalableVector() &&		if (VecVT.isScalableVector() == SubVecVT.isScalableVector() &&
IdxVal >= LoElems && IdxVal + SubElems <= VecElems) {		IdxVal >= LoElems && IdxVal + SubElems <= VecElems) {
Hi = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, Hi.getValueType(), Hi, SubVec,		Hi = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, Hi.getValueType(), Hi, SubVec,
DAG.getVectorIdxConstant(IdxVal - LoElems, dl));		DAG.getVectorIdxConstant(IdxVal - LoElems, dl));
return;		return;
}		}

		// insert_subvector(Op, SubVec, 0) where SubVec widens to the result type
		// can be converted to a vselect.
		if (IdxVal == 0 && VecVT.isScalableVector() &&
		TLI.getTypeToTransformTo(*DAG.getContext(), SubVecVT) == VecVT) {
		SDValue WidenedSubVec = GetWidenedVector(SubVec);
		EVT CmpElementVT = MVT::i32;
		EVT CmpVT = EVT::getVectorVT(*DAG.getContext(), CmpElementVT,
		VecVT.getVectorElementCount());
		EVT MaskVT = EVT::getVectorVT(*DAG.getContext(), MVT::i1,
		VecVT.getVectorElementCount());

		SDLoc dl(N);
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'dl' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'dl' [readability-identifier-naming]…
		SDValue Step = DAG.getStepVector(dl, CmpVT);
		unsigned NumElements = SubVec.getValueType().getVectorMinNumElements();
		SDValue SplatNumElements = DAG.getSplatVector(
		CmpVT, dl, DAG.getVScale(dl, CmpElementVT, APInt(32, NumElements)));
		SDValue Mask =
		DAG.getSetCC(dl, MaskVT, Step, SplatNumElements, ISD::SETULT);
		SDValue Select =
		DAG.getNode(ISD::VSELECT, dl, VecVT, Mask, WidenedSubVec, Vec);

		Lo = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, Lo.getValueType(), Select,
		DAG.getVectorIdxConstant(0, dl));
		Hi = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, Hi.getValueType(), Select,
		DAG.getVectorIdxConstant(LoElems, dl));
		return;
		}

// Spill the vector to the stack.		// Spill the vector to the stack.
// In cases where the vector is illegal it will be broken down into parts		// In cases where the vector is illegal it will be broken down into parts
// and stored in parts - we should use the alignment for the smallest part.		// and stored in parts - we should use the alignment for the smallest part.
Align SmallestAlign = DAG.getReducedAlign(VecVT, /UseABI=/false);		Align SmallestAlign = DAG.getReducedAlign(VecVT, /UseABI=/false);
SDValue StackPtr =		SDValue StackPtr =
DAG.CreateStackTemporary(VecVT.getStoreSize(), SmallestAlign);		DAG.CreateStackTemporary(VecVT.getStoreSize(), SmallestAlign);
auto &MF = DAG.getMachineFunction();		auto &MF = DAG.getMachineFunction();
auto FrameIndex = cast<FrameIndexSDNode>(StackPtr.getNode())->getIndex();		auto FrameIndex = cast<FrameIndexSDNode>(StackPtr.getNode())->getIndex();
▲ Show 20 Lines • Show All 2,662 Lines • ▼ Show 20 Lines	SDValue DAGTypeLegalizer::WidenVecRes_EXTRACT_SUBVECTOR(SDNode *N) {

EVT InVT = InOp.getValueType();		EVT InVT = InOp.getValueType();

// Check if we can just return the input vector after widening.		// Check if we can just return the input vector after widening.
uint64_t IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();		uint64_t IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();
if (IdxVal == 0 && InVT == WidenVT)		if (IdxVal == 0 && InVT == WidenVT)
return InOp;		return InOp;

if (VT.isScalableVector())		if (VT.isScalableVector()) {
		unsigned WidenNumElts = WidenVT.getVectorMinNumElements();
		sdesmalenUnsubmitted Not Done Reply Inline Actions This code is identical to the block on lines 4022-4026, with the exception of `s/NumElements/MinNumElements`. Can you align the implementation in the two blocks and remove the `if (VT.isScalableVector())`? (and then move the `report_fatal_error` to line 4027) sdesmalen: This code is identical to the block on lines 4022-4026, with the exception of…
		unsigned InNumElts = InVT.getVectorMinNumElements();
		if (IdxVal % WidenNumElts == 0 && IdxVal + WidenNumElts < InNumElts)
		return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, WidenVT, InOp, Idx);
		}

		if (InVT.isScalableVector()) {
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: unnecessary curly braces. sdesmalen: nit: unnecessary curly braces.
report_fatal_error("Don't know how to widen the result of "		report_fatal_error("Don't know how to widen the result of "
"EXTRACT_SUBVECTOR for scalable vectors");		"EXTRACT_SUBVECTOR for scalable vectors");
		}

// Check if we can extract from the vector.		// Check if we can extract from the vector.
unsigned WidenNumElts = WidenVT.getVectorNumElements();		unsigned WidenNumElts = WidenVT.getVectorNumElements();
unsigned InNumElts = InVT.getVectorNumElements();		unsigned InNumElts = InVT.getVectorNumElements();
if (IdxVal % WidenNumElts == 0 && IdxVal + WidenNumElts < InNumElts)		if (IdxVal % WidenNumElts == 0 && IdxVal + WidenNumElts < InNumElts)
return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, WidenVT, InOp, Idx);		return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, WidenVT, InOp, Idx);

// We could try widening the input to the right length but for now, extract		// We could try widening the input to the right length but for now, extract
Show All 17 Lines	SDValue DAGTypeLegalizer::WidenVecRes_INSERT_VECTOR_ELT(SDNode *N) {
return DAG.getNode(ISD::INSERT_VECTOR_ELT, SDLoc(N),		return DAG.getNode(ISD::INSERT_VECTOR_ELT, SDLoc(N),
InOp.getValueType(), InOp,		InOp.getValueType(), InOp,
N->getOperand(1), N->getOperand(2));		N->getOperand(1), N->getOperand(2));
}		}

SDValue DAGTypeLegalizer::WidenVecRes_LOAD(SDNode *N) {		SDValue DAGTypeLegalizer::WidenVecRes_LOAD(SDNode *N) {
LoadSDNode *LD = cast<LoadSDNode>(N);		LoadSDNode *LD = cast<LoadSDNode>(N);
ISD::LoadExtType ExtType = LD->getExtensionType();		ISD::LoadExtType ExtType = LD->getExtensionType();
		EVT VT = N->getValueType(0);

		// FIXME: Figure out how to replace constant "2".
		sdesmalenUnsubmitted Not Done Reply Inline Actions Do we also want to do this for other element-counts that are not a power of 2? (e.g. `<vscale x 6 x i8>`) Was this code intended to work for `<vscale x 1 x eltty>` ? (I tried that a few days ago with this patch and ran into some failures. I didn't really investigate further though) sdesmalen: Do we also want to do this for other element-counts that are not a power of 2? (e.g. `<vscale x…
		efriedmaAuthorUnsubmitted Done Reply Inline Actions Should work for 1, I guess? Not sure I actually tried it. But note we only support a very limited set of operations. See also D105591. For `<vscale x 6 x i8>`, we can use a different sequence: generate extloads for `<vscale x 2 x i64>` and `<vscale x 4 x i32>`, and merge them together. GenWidenVectorLoads should handle that in theory; not sure how well it works in practice. I'll add more testcases, in any case. efriedma: Should work for 1, I guess? Not sure I actually tried it. But note we only support a very…
		sdesmalenUnsubmitted Not Done Reply Inline Actions Is the FIXME still relevant? For <vscale x 6 x i8>, we can use a different sequence: generate extloads for <vscale x 2 x i64> and <vscale x 4 x i32>, and merge them together. GenWidenVectorLoads should handle that in theory; not sure how well it works in practice. I'll add more testcases, in any case. Nice. From what I can see from the test you've added, <vscale x 6 x i16> is handled correctly by GenWidenVectorLoads. sdesmalen: Is the FIXME still relevant? > For <vscale x 6 x i8>, we can use a different sequence…
		efriedmaAuthorUnsubmitted Done Reply Inline Actions Yes, it's still an issue. The constant "2" is specific to SVE. The question we need to figure out here is, will GenWidenVectorLoads succeed (and be profitable)? For SVE, it should succeed for any multiple of 2, because we can use 64-bit extending loads. Not sure about profitability for SVE; the generated code is a bit messy, but the messy bits are all invariant (and probably less messy once we start using whilelo optimally). Other vector instruction sets might have different restrictions. efriedma: Yes, it's still an issue. The constant "2" is specific to SVE. The question we need to figure…
		if (VT.isScalableVector() &&
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: VT.isKnownMultipleOf(2) sdesmalen: nit: VT.isKnownMultipleOf(2)
		!VT.getVectorElementCount().isKnownMultipleOf(2)) {
		// Convert load to masked load. Let MLOAD legalization handle widening.
		// (We assume hardware with scalable vectors supports masked load/store.)
		SDLoc dl(N);
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'dl' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'dl' [readability-identifier-naming]…
		EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), VT);
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: can you move `WidenVT` closer to where it's used? (above line 4079) sdesmalen: nit: can you move `WidenVT` closer to where it's used? (above line 4079)
		EVT MaskVT = EVT::getVectorVT(*DAG.getContext(), MVT::i1,
		VT.getVectorElementCount());
		SDValue Mask = DAG.getAllOnesConstant(dl, MaskVT);
		SDValue PassThru = DAG.getUNDEF(VT);

		// Convert load to masked load. Let MLOAD legalization handle widening.
		SDValue Res = DAG.getMaskedLoad(VT, dl, LD->getChain(), LD->getBasePtr(),
		LD->getOffset(), Mask, PassThru,
		LD->getMemoryVT(), LD->getMemOperand(),
		LD->getAddressingMode(), ExtType);

		// Legalize the chain result - switch anything that used the old chain to
		// use the new one.
		ReplaceValueWith(SDValue(N, 1), Res.getValue(1));

		// Widen the result.
		return DAG.getNode(ISD::INSERT_SUBVECTOR, dl, WidenVT,
		DAG.getUNDEF(WidenVT), Res,
		DAG.getVectorIdxConstant(0, dl));
		}

// A vector must always be stored in memory as-is, i.e. without any padding		// A vector must always be stored in memory as-is, i.e. without any padding
// between the elements, since various code depend on it, e.g. in the		// between the elements, since various code depend on it, e.g. in the
// handling of a bitcast of a vector type to int, which may be done with a		// handling of a bitcast of a vector type to int, which may be done with a
// vector store followed by an integer load. A vector that does not have		// vector store followed by an integer load. A vector that does not have
// elements that are byte-sized must therefore be stored as an integer		// elements that are byte-sized must therefore be stored as an integer
// built out of the extracted vector elements.		// built out of the extracted vector elements.
if (!LD->getMemoryVT().isByteSized()) {		if (!LD->getMemoryVT().isByteSized()) {
Show All 23 Lines	SDValue DAGTypeLegalizer::WidenVecRes_LOAD(SDNode *N) {
// Modified the chain - switch anything that used the old chain to use		// Modified the chain - switch anything that used the old chain to use
// the new one.		// the new one.
ReplaceValueWith(SDValue(N, 1), NewChain);		ReplaceValueWith(SDValue(N, 1), NewChain);

return Result;		return Result;
}		}

SDValue DAGTypeLegalizer::WidenVecRes_MLOAD(MaskedLoadSDNode *N) {		SDValue DAGTypeLegalizer::WidenVecRes_MLOAD(MaskedLoadSDNode *N) {
		assert(N->getAddressingMode() == ISD::UNINDEXED &&
		"We shouldn't form indexed loads with illegal types");

EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(),N->getValueType(0));		EVT WidenVT = TLI.getTypeToTransformTo(*DAG.getContext(), N->getValueType(0));
SDValue Mask = N->getMask();		SDValue Mask = N->getMask();
EVT MaskVT = Mask.getValueType();		EVT MaskVT = Mask.getValueType();
SDValue PassThru = GetWidenedVector(N->getPassThru());		SDValue PassThru = GetWidenedVector(N->getPassThru());
ISD::LoadExtType ExtType = N->getExtensionType();		ISD::LoadExtType ExtType = N->getExtensionType();
SDLoc dl(N);		SDLoc dl(N);

// The mask should be widened as well		// The mask should be widened as well
EVT WideMaskVT = EVT::getVectorVT(*DAG.getContext(),		EVT WideMaskVT =
MaskVT.getVectorElementType(),		EVT::getVectorVT(*DAG.getContext(), MaskVT.getVectorElementType(),
WidenVT.getVectorNumElements());		WidenVT.getVectorElementCount());
Mask = ModifyToType(Mask, WideMaskVT, true);		Mask = ModifyToType(Mask, WideMaskVT, true);

SDValue Res = DAG.getMaskedLoad(		EVT MemVT = N->getMemoryVT();
WidenVT, dl, N->getChain(), N->getBasePtr(), N->getOffset(), Mask,		EVT WideMemVT =
PassThru, N->getMemoryVT(), N->getMemOperand(), N->getAddressingMode(),		EVT::getVectorVT(*DAG.getContext(), MemVT.getVectorElementType(),
ExtType, N->isExpandingLoad());		WidenVT.getVectorElementCount());
		MachineFunction &MF = DAG.getMachineFunction();
		MachineMemOperand *MemOp = MF.getMachineMemOperand(
		N->getMemOperand(), 0, MemoryLocation::UnknownSize);
		sdesmalenUnsubmitted Not Done Reply Inline Actions Should `MemoryLocation::UnknownSize` only be used for the scalable case? sdesmalen: Should `MemoryLocation::UnknownSize` only be used for the scalable case?
		efriedmaAuthorUnsubmitted Done Reply Inline Actions There's a general restriction on memoperands in selectiondag: the size of the operand must be at least the size of the memory VT. I didn't really want to think about precisely what MemoryLocation we could be using instead... efriedma: There's a general restriction on memoperands in selectiondag: the size of the operand must be…

		SDValue Res =
		DAG.getMaskedLoad(WidenVT, dl, N->getChain(), N->getBasePtr(),
		N->getOffset(), Mask, PassThru, WideMemVT, MemOp,
		ISD::UNINDEXED, ExtType, N->isExpandingLoad());
// Legalize the chain result - switch anything that used the old chain to		// Legalize the chain result - switch anything that used the old chain to
// use the new one.		// use the new one.
ReplaceValueWith(SDValue(N, 1), Res.getValue(1));		ReplaceValueWith(SDValue(N, 1), Res.getValue(1));
return Res;		return Res;
}		}

SDValue DAGTypeLegalizer::WidenVecRes_MGATHER(MaskedGatherSDNode *N) {		SDValue DAGTypeLegalizer::WidenVecRes_MGATHER(MaskedGatherSDNode *N) {

▲ Show 20 Lines • Show All 742 Lines • ▼ Show 20 Lines	if (VT == TLI.getTypeToTransformTo(*DAG.getContext(), InVT)) {
for (i = 1; i < NumOperands; ++i)		for (i = 1; i < NumOperands; ++i)
if (!N->getOperand(i).isUndef())		if (!N->getOperand(i).isUndef())
break;		break;

if (i == NumOperands)		if (i == NumOperands)
return GetWidenedVector(N->getOperand(0));		return GetWidenedVector(N->getOperand(0));
}		}

		if (InVT.isScalableVector())
		report_fatal_error("Cannot legalize this scalable CONCAT_VECTORS");

// Otherwise, fall back to a nasty build vector.		// Otherwise, fall back to a nasty build vector.
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();
SmallVector<SDValue, 16> Ops(NumElts);		SmallVector<SDValue, 16> Ops(NumElts);

unsigned NumInElts = InVT.getVectorNumElements();		unsigned NumInElts = InVT.getVectorNumElements();

unsigned Idx = 0;		unsigned Idx = 0;
for (unsigned i=0; i < NumOperands; ++i) {		for (unsigned i=0; i < NumOperands; ++i) {
Show All 18 Lines	SDValue DAGTypeLegalizer::WidenVecOp_INSERT_SUBVECTOR(SDNode *N) {

if (getTypeAction(SubVec.getValueType()) == TargetLowering::TypeWidenVector)		if (getTypeAction(SubVec.getValueType()) == TargetLowering::TypeWidenVector)
SubVec = GetWidenedVector(SubVec);		SubVec = GetWidenedVector(SubVec);

if (SubVec.getValueType() == InVec.getValueType() && InVec.isUndef() &&		if (SubVec.getValueType() == InVec.getValueType() && InVec.isUndef() &&
N->getConstantOperandVal(2) == 0)		N->getConstantOperandVal(2) == 0)
return SubVec;		return SubVec;

		if (InVec.getValueType().isScalableVector() &&
		N->getConstantOperandVal(2) == 0) {
		SDLoc dl(N);
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'dl' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'dl' [readability-identifier-naming]…
		EVT VT = InVec.getValueType();
		EVT CmpElementVT = MVT::i32;
		EVT CmpVT = EVT::getVectorVT(*DAG.getContext(), CmpElementVT,
		VT.getVectorElementCount());
		EVT MaskVT = EVT::getVectorVT(*DAG.getContext(), MVT::i1,
		VT.getVectorElementCount());

		if (SubVec.getValueType() != VT) {
		// If the widened SubVec is still too narrow, widen it again.
		unsigned NumConcat = VT.getVectorMinNumElements() /
		SubVec.getValueType().getVectorMinNumElements();
		SmallVector<SDValue, 16> Ops(NumConcat);
		SDValue FillVal = DAG.getUNDEF(SubVec.getValueType());
		Ops[0] = SubVec;
		for (unsigned i = 1; i != NumConcat; ++i)
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'i' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'i' [readability-identifier-naming]…
		Ops[i] = FillVal;

		SubVec = DAG.getNode(ISD::CONCAT_VECTORS, dl, VT, Ops);
		}

		SDValue Step = DAG.getStepVector(dl, CmpVT);
		unsigned NumElements = VT.getVectorMinNumElements();
		SDValue SplatNumElements = DAG.getSplatVector(
		CmpVT, dl, DAG.getVScale(dl, CmpElementVT, APInt(32, NumElements)));
		SDValue Mask =
		DAG.getSetCC(dl, MaskVT, Step, SplatNumElements, ISD::SETULT);
		return DAG.getNode(ISD::VSELECT, dl, VT, Mask, SubVec, InVec);
		}

report_fatal_error("Don't know how to widen the operands for "		report_fatal_error("Don't know how to widen the operands for "
"INSERT_SUBVECTOR");		"INSERT_SUBVECTOR");
}		}

SDValue DAGTypeLegalizer::WidenVecOp_EXTRACT_SUBVECTOR(SDNode *N) {		SDValue DAGTypeLegalizer::WidenVecOp_EXTRACT_SUBVECTOR(SDNode *N) {
SDValue InOp = GetWidenedVector(N->getOperand(0));		SDValue InOp = GetWidenedVector(N->getOperand(0));
return DAG.getNode(ISD::EXTRACT_SUBVECTOR, SDLoc(N),		return DAG.getNode(ISD::EXTRACT_SUBVECTOR, SDLoc(N),
N->getValueType(0), InOp, N->getOperand(1));		N->getValueType(0), InOp, N->getOperand(1));
▲ Show 20 Lines • Show All 662 Lines • ▼ Show 20 Lines	SDValue DAGTypeLegalizer::ModifyToType(SDValue InOp, EVT NVT,
assert(InVT.getVectorElementType() == NVT.getVectorElementType() &&		assert(InVT.getVectorElementType() == NVT.getVectorElementType() &&
"input and widen element type must match");		"input and widen element type must match");
SDLoc dl(InOp);		SDLoc dl(InOp);

// Check if InOp already has the right width.		// Check if InOp already has the right width.
if (InVT == NVT)		if (InVT == NVT)
return InOp;		return InOp;

unsigned InNumElts = InVT.getVectorNumElements();		unsigned InNumElts = InVT.getVectorMinNumElements();
unsigned WidenNumElts = NVT.getVectorNumElements();		unsigned WidenNumElts = NVT.getVectorMinNumElements();

		if (NVT.isScalableVector()) {
		if (WidenNumElts < InNumElts)
		return DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, NVT, InOp,
		DAG.getVectorIdxConstant(0, dl));

		SDValue FillVal =
		FillWithZeroes ? DAG.getConstant(0, dl, NVT) : DAG.getUNDEF(NVT);
		return DAG.getNode(ISD::INSERT_SUBVECTOR, dl, NVT, FillVal, InOp,
		DAG.getVectorIdxConstant(0, dl));
		}

if (WidenNumElts > InNumElts && WidenNumElts % InNumElts == 0) {		if (WidenNumElts > InNumElts && WidenNumElts % InNumElts == 0) {
unsigned NumConcat = WidenNumElts / InNumElts;		unsigned NumConcat = WidenNumElts / InNumElts;
SmallVector<SDValue, 16> Ops(NumConcat);		SmallVector<SDValue, 16> Ops(NumConcat);
SDValue FillVal = FillWithZeroes ? DAG.getConstant(0, dl, InVT) :		SDValue FillVal = FillWithZeroes ? DAG.getConstant(0, dl, InVT) :
DAG.getUNDEF(InVT);		DAG.getUNDEF(InVT);
Ops[0] = InOp;		Ops[0] = InOp;
for (unsigned i = 1; i != NumConcat; ++i)		for (unsigned i = 1; i != NumConcat; ++i)
Ops[i] = FillVal;		Ops[i] = FillVal;
▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-split-load.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s \| FileCheck %s			; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s \| FileCheck %s

	; UNPREDICATED			; UNPREDICATED

	define <vscale x 4 x i16> @load_promote_4i16(<vscale x 4 x i16>* %a) {			define <vscale x 4 x i16> @load_promote_4i16(<vscale x 4 x i16>* %a) {
	; CHECK-LABEL: load_promote_4i16:			; CHECK-LABEL: load_promote_4i16:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ptrue p0.s			; CHECK-NEXT: ptrue p0.s
	; CHECK-NEXT: ld1h { z0.s }, p0/z, [x0]			; CHECK-NEXT: ld1h { z0.s }, p0/z, [x0]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%load = load <vscale x 4 x i16>, <vscale x 4 x i16>* %a			%load = load <vscale x 4 x i16>, <vscale x 4 x i16>* %a, align 1
	ret <vscale x 4 x i16> %load			ret <vscale x 4 x i16> %load
	}			}

	define <vscale x 16 x i16> @load_split_i16(<vscale x 16 x i16>* %a) {			define <vscale x 16 x i16> @load_split_i16(<vscale x 16 x i16>* %a) {
	; CHECK-LABEL: load_split_i16:			; CHECK-LABEL: load_split_i16:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ptrue p0.h			; CHECK-NEXT: ptrue p0.h
	; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]			; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
	; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0, #1, mul vl]			; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0, #1, mul vl]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%load = load <vscale x 16 x i16>, <vscale x 16 x i16>* %a			%load = load <vscale x 16 x i16>, <vscale x 16 x i16>* %a, align 1
	ret <vscale x 16 x i16> %load			ret <vscale x 16 x i16> %load
	}			}

	define <vscale x 24 x i16> @load_split_24i16(<vscale x 24 x i16>* %a) {			define <vscale x 24 x i16> @load_split_24i16(<vscale x 24 x i16>* %a) {
	; CHECK-LABEL: load_split_24i16:			; CHECK-LABEL: load_split_24i16:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ptrue p0.h			; CHECK-NEXT: ptrue p0.h
	; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]			; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
	; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0, #1, mul vl]			; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0, #1, mul vl]
	; CHECK-NEXT: ld1h { z2.h }, p0/z, [x0, #2, mul vl]			; CHECK-NEXT: ld1h { z2.h }, p0/z, [x0, #2, mul vl]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%load = load <vscale x 24 x i16>, <vscale x 24 x i16>* %a			%load = load <vscale x 24 x i16>, <vscale x 24 x i16>* %a, align 1
	ret <vscale x 24 x i16> %load			ret <vscale x 24 x i16> %load
	}			}

				define <vscale x 8 x i16> @load_widen_6i16(<vscale x 6 x i16>* %a) {
				; CHECK-LABEL: load_widen_6i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: ptrue p1.d
				; CHECK-NEXT: ld1h { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z1.d }, p1/z, [x0, #2, mul vl]
				; CHECK-NEXT: uzp1 z1.s, z1.s, z0.s
				; CHECK-NEXT: uzp1 z0.h, z0.h, z1.h
				; CHECK-NEXT: ret
				%load = load <vscale x 6 x i16>, <vscale x 6 x i16>* %a, align 1
				%r = call <vscale x 8 x i16> @llvm.experimental.vector.insert.nxv8i16.nxv6i16(<vscale x 8 x i16> undef, <vscale x 6 x i16> %load, i64 0)
				ret <vscale x 8 x i16> %r
				}

				define <vscale x 4 x i32> @load_widen_1i32(<vscale x 1 x i32>* %a) {
				; CHECK-LABEL: load_widen_1i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: cntw x8
				; CHECK-NEXT: index z0.s, #0, #1
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: ptrue p1.d
				; CHECK-NEXT: mov z1.s, w8
				; CHECK-NEXT: cmphi p2.s, p0/z, z1.s, z0.s
				; CHECK-NEXT: uzp1 p1.s, p1.s, p0.s
				; CHECK-NEXT: and p0.b, p0/z, p2.b, p1.b
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ret
				%load = load <vscale x 1 x i32>, <vscale x 1 x i32>* %a, align 1
				%r = call <vscale x 4 x i32> @llvm.experimental.vector.insert.nxv4i32.nxv1i32(<vscale x 4 x i32> undef, <vscale x 1 x i32> %load, i64 0)
				ret <vscale x 4 x i32> %r
				}

				define <vscale x 4 x i32> @load_widen_3i32(<vscale x 3 x i32>* %a) {
				sdesmalenUnsubmitted Not Done Reply Inline Actions `sve-split-load.ll` may no longer be the best name for this file given that these new tests are about widening? sdesmalen: `sve-split-load.ll` may no longer be the best name for this file given that these new tests are…
				; CHECK-LABEL: load_widen_3i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: cntw x8
				; CHECK-NEXT: index z0.s, #0, #1
				; CHECK-NEXT: mov z1.s, w8
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: cmphi p0.s, p0/z, z1.s, z0.s
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ret
				%load = load <vscale x 3 x i32>, <vscale x 3 x i32>* %a, align 1
				%r = call <vscale x 4 x i32> @llvm.experimental.vector.insert.nxv4i32.nxv3i32(<vscale x 4 x i32> undef, <vscale x 3 x i32> %load, i64 0)
				ret <vscale x 4 x i32> %r
				}

				define <vscale x 8 x i32> @load_widen_6i32(<vscale x 6 x i32>* %a) {
				; CHECK-LABEL: load_widen_6i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: ptrue p1.d
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.d }, p1/z, [x0, #2, mul vl]
				; CHECK-NEXT: uzp1 z1.s, z1.s, z0.s
				; CHECK-NEXT: ret
				%load = load <vscale x 6 x i32>, <vscale x 6 x i32>* %a, align 1
				%r = call <vscale x 8 x i32> @llvm.experimental.vector.insert.nxv8i32.nxv6i32(<vscale x 8 x i32> undef, <vscale x 6 x i32> %load, i64 0)
				ret <vscale x 8 x i32> %r
				}

				define <vscale x 8 x i32> @load_widen_7i32(<vscale x 7 x i32>* %a) {
				; CHECK-LABEL: load_widen_7i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: cntw x9
				; CHECK-NEXT: cnth x8
				; CHECK-NEXT: index z0.s, #0, #1
				; CHECK-NEXT: mov z2.s, w9
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov z1.s, w8
				; CHECK-NEXT: add z2.s, z0.s, z2.s
				; CHECK-NEXT: cmphi p1.s, p0/z, z1.s, z0.s
				; CHECK-NEXT: cmphi p0.s, p0/z, z1.s, z2.s
				; CHECK-NEXT: ld1w { z0.s }, p1/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0, #1, mul vl]
				; CHECK-NEXT: ret
				%load = load <vscale x 7 x i32>, <vscale x 7 x i32>* %a, align 1
				%r = call <vscale x 8 x i32> @llvm.experimental.vector.insert.nxv8i32.nxv7i32(<vscale x 8 x i32> undef, <vscale x 7 x i32> %load, i64 0)
				ret <vscale x 8 x i32> %r
				}

	define <vscale x 32 x i16> @load_split_32i16(<vscale x 32 x i16>* %a) {			define <vscale x 32 x i16> @load_split_32i16(<vscale x 32 x i16>* %a) {
	; CHECK-LABEL: load_split_32i16:			; CHECK-LABEL: load_split_32i16:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ptrue p0.h			; CHECK-NEXT: ptrue p0.h
	; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]			; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
	; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0, #1, mul vl]			; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0, #1, mul vl]
	; CHECK-NEXT: ld1h { z2.h }, p0/z, [x0, #2, mul vl]			; CHECK-NEXT: ld1h { z2.h }, p0/z, [x0, #2, mul vl]
	; CHECK-NEXT: ld1h { z3.h }, p0/z, [x0, #3, mul vl]			; CHECK-NEXT: ld1h { z3.h }, p0/z, [x0, #3, mul vl]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%load = load <vscale x 32 x i16>, <vscale x 32 x i16>* %a			%load = load <vscale x 32 x i16>, <vscale x 32 x i16>* %a, align 1
	ret <vscale x 32 x i16> %load			ret <vscale x 32 x i16> %load
	}			}

	define <vscale x 16 x i64> @load_split_16i64(<vscale x 16 x i64>* %a) {			define <vscale x 16 x i64> @load_split_16i64(<vscale x 16 x i64>* %a) {
	; CHECK-LABEL: load_split_16i64:			; CHECK-LABEL: load_split_16i64:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ptrue p0.d			; CHECK-NEXT: ptrue p0.d
	; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]			; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
	; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0, #1, mul vl]			; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0, #1, mul vl]
	; CHECK-NEXT: ld1d { z2.d }, p0/z, [x0, #2, mul vl]			; CHECK-NEXT: ld1d { z2.d }, p0/z, [x0, #2, mul vl]
	; CHECK-NEXT: ld1d { z3.d }, p0/z, [x0, #3, mul vl]			; CHECK-NEXT: ld1d { z3.d }, p0/z, [x0, #3, mul vl]
	; CHECK-NEXT: ld1d { z4.d }, p0/z, [x0, #4, mul vl]			; CHECK-NEXT: ld1d { z4.d }, p0/z, [x0, #4, mul vl]
	; CHECK-NEXT: ld1d { z5.d }, p0/z, [x0, #5, mul vl]			; CHECK-NEXT: ld1d { z5.d }, p0/z, [x0, #5, mul vl]
	; CHECK-NEXT: ld1d { z6.d }, p0/z, [x0, #6, mul vl]			; CHECK-NEXT: ld1d { z6.d }, p0/z, [x0, #6, mul vl]
	; CHECK-NEXT: ld1d { z7.d }, p0/z, [x0, #7, mul vl]			; CHECK-NEXT: ld1d { z7.d }, p0/z, [x0, #7, mul vl]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%load = load <vscale x 16 x i64>, <vscale x 16 x i64>* %a			%load = load <vscale x 16 x i64>, <vscale x 16 x i64>* %a, align 1
	ret <vscale x 16 x i64> %load			ret <vscale x 16 x i64> %load
	}			}

	; MASKED			; MASKED

	define <vscale x 2 x i32> @masked_load_promote_2i32(<vscale x 2 x i32> *%a, <vscale x 2 x i1> %pg) {			define <vscale x 2 x i32> @masked_load_promote_2i32(<vscale x 2 x i32> *%a, <vscale x 2 x i1> %pg) {
	; CHECK-LABEL: masked_load_promote_2i32:			; CHECK-LABEL: masked_load_promote_2i32:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: zip2 p0.s, p0.s, p1.s			; CHECK-NEXT: zip2 p0.s, p0.s, p1.s
	; CHECK-NEXT: ld1d { z2.d }, p2/z, [x0, #2, mul vl]			; CHECK-NEXT: ld1d { z2.d }, p2/z, [x0, #2, mul vl]
	; CHECK-NEXT: ld1d { z3.d }, p0/z, [x0, #3, mul vl]			; CHECK-NEXT: ld1d { z3.d }, p0/z, [x0, #3, mul vl]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%load = call <vscale x 8 x i64> @llvm.masked.load.nxv8i64(<vscale x 8 x i64> *%a, i32 1, <vscale x 8 x i1> %pg, <vscale x 8 x i64> undef)			%load = call <vscale x 8 x i64> @llvm.masked.load.nxv8i64(<vscale x 8 x i64> *%a, i32 1, <vscale x 8 x i1> %pg, <vscale x 8 x i64> undef)
	ret <vscale x 8 x i64> %load			ret <vscale x 8 x i64> %load
	}			}

	declare <vscale x 32 x i8> @llvm.masked.load.nxv32i8(<vscale x 32 x i8>*, i32, <vscale x 32 x i1>, <vscale x 32 x i8>)			define <vscale x 8 x i16> @masked_load_widen_6i16(<vscale x 6 x i16>* %a, <vscale x 8 x i1> %pg.wide) {
				; CHECK-LABEL: masked_load_widen_6i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: cnth x8
				; CHECK-NEXT: mov z1.s, w8
				; CHECK-NEXT: cntw x8
				; CHECK-NEXT: index z0.s, #0, #1
				; CHECK-NEXT: ptrue p1.s
				; CHECK-NEXT: mov z2.s, w8
				; CHECK-NEXT: cmphi p2.s, p1/z, z1.s, z0.s
				; CHECK-NEXT: add z0.s, z0.s, z2.s
				; CHECK-NEXT: cmphi p1.s, p1/z, z1.s, z0.s
				; CHECK-NEXT: uzp1 p1.h, p2.h, p1.h
				; CHECK-NEXT: ptrue p2.h
				; CHECK-NEXT: and p0.b, p2/z, p1.b, p0.b
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: ret
				%pg = call <vscale x 6 x i1> @llvm.experimental.vector.extract.nxv6i1.nxv8i1(<vscale x 8 x i1> %pg.wide, i64 0)
				%load = call <vscale x 6 x i16> @llvm.masked.load.nxv6i16(<vscale x 6 x i16> *%a, i32 1, <vscale x 6 x i1> %pg, <vscale x 6 x i16> undef)
				%r = call <vscale x 8 x i16> @llvm.experimental.vector.insert.nxv8i16.nxv6i16(<vscale x 8 x i16> undef, <vscale x 6 x i16> %load, i64 0)
				ret <vscale x 8 x i16> %r
				}

	declare <vscale x 32 x i16> @llvm.masked.load.nxv32i16(<vscale x 32 x i16>*, i32, <vscale x 32 x i1>, <vscale x 32 x i16>)			define <vscale x 4 x i32> @masked_load_widen_1i32(<vscale x 1 x i32>* %a, <vscale x 4 x i1> %pg.wide) {
				; CHECK-LABEL: masked_load_widen_1i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: cntw x8
				; CHECK-NEXT: index z0.s, #0, #1
				; CHECK-NEXT: mov z1.s, w8
				; CHECK-NEXT: ptrue p1.s
				; CHECK-NEXT: cmphi p2.s, p1/z, z1.s, z0.s
				; CHECK-NEXT: and p0.b, p1/z, p2.b, p0.b
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ret
				%pg = call <vscale x 1 x i1> @llvm.experimental.vector.extract.nxv1i1.nxv4i1(<vscale x 4 x i1> %pg.wide, i64 0)
				%load = call <vscale x 1 x i32> @llvm.masked.load.nxv1i32(<vscale x 1 x i32> *%a, i32 1, <vscale x 1 x i1> %pg, <vscale x 1 x i32> undef)
				%r = call <vscale x 4 x i32> @llvm.experimental.vector.insert.nxv4i32.nxv1i32(<vscale x 4 x i32> undef, <vscale x 1 x i32> %load, i64 0)
				ret <vscale x 4 x i32> %r
				}

				define <vscale x 4 x i32> @masked_load_widen_3i32(<vscale x 3 x i32>* %a, <vscale x 4 x i1> %pg.wide) {
				; CHECK-LABEL: masked_load_widen_3i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: cntw x8
				; CHECK-NEXT: index z0.s, #0, #1
				; CHECK-NEXT: mov z1.s, w8
				; CHECK-NEXT: ptrue p1.s
				; CHECK-NEXT: cmphi p2.s, p1/z, z1.s, z0.s
				; CHECK-NEXT: and p0.b, p1/z, p2.b, p0.b
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ret
				%pg = call <vscale x 3 x i1> @llvm.experimental.vector.extract.nxv3i1.nxv4i1(<vscale x 4 x i1> %pg.wide, i64 0)
				%load = call <vscale x 3 x i32> @llvm.masked.load.nxv3i32(<vscale x 3 x i32> *%a, i32 1, <vscale x 3 x i1> %pg, <vscale x 3 x i32> undef)
				%r = call <vscale x 4 x i32> @llvm.experimental.vector.insert.nxv4i32.nxv3i32(<vscale x 4 x i32> undef, <vscale x 3 x i32> %load, i64 0)
				ret <vscale x 4 x i32> %r
				}

				define <vscale x 8 x i32> @masked_load_widen_6i32(<vscale x 6 x i32>* %a, <vscale x 8 x i1> %pg.wide) {
				; CHECK-LABEL: masked_load_widen_6i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: cnth x8
				; CHECK-NEXT: mov z1.s, w8
				; CHECK-NEXT: cntw x8
				; CHECK-NEXT: ptrue p1.s
				; CHECK-NEXT: index z0.s, #0, #1
				; CHECK-NEXT: mov z2.s, w8
				; CHECK-NEXT: cmphi p2.s, p1/z, z1.s, z0.s
				; CHECK-NEXT: add z0.s, z0.s, z2.s
				; CHECK-NEXT: cmphi p1.s, p1/z, z1.s, z0.s
				; CHECK-NEXT: uzp1 p1.h, p2.h, p1.h
				; CHECK-NEXT: ptrue p2.h
				; CHECK-NEXT: and p0.b, p2/z, p1.b, p0.b
				; CHECK-NEXT: pfalse p1.b
				; CHECK-NEXT: zip1 p2.h, p0.h, p1.h
				; CHECK-NEXT: zip2 p0.h, p0.h, p1.h
				; CHECK-NEXT: ld1w { z0.s }, p2/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0, #1, mul vl]
				; CHECK-NEXT: ret
				%pg = call <vscale x 6 x i1> @llvm.experimental.vector.extract.nxv6i1.nxv8i1(<vscale x 8 x i1> %pg.wide, i64 0)
				%load = call <vscale x 6 x i32> @llvm.masked.load.nxv6i32(<vscale x 6 x i32> *%a, i32 1, <vscale x 6 x i1> %pg, <vscale x 6 x i32> undef)
				%r = call <vscale x 8 x i32> @llvm.experimental.vector.insert.nxv8i32.nxv6i32(<vscale x 8 x i32> undef, <vscale x 6 x i32> %load, i64 0)
				ret <vscale x 8 x i32> %r
				}

				define <vscale x 8 x i32> @masked_load_widen_7i32(<vscale x 7 x i32>* %a, <vscale x 8 x i1> %pg.wide) {
				; CHECK-LABEL: masked_load_widen_7i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: cnth x8
				; CHECK-NEXT: mov z1.s, w8
				; CHECK-NEXT: cntw x8
				; CHECK-NEXT: ptrue p1.s
				; CHECK-NEXT: index z0.s, #0, #1
				; CHECK-NEXT: mov z2.s, w8
				; CHECK-NEXT: cmphi p2.s, p1/z, z1.s, z0.s
				; CHECK-NEXT: add z0.s, z0.s, z2.s
				; CHECK-NEXT: cmphi p1.s, p1/z, z1.s, z0.s
				; CHECK-NEXT: uzp1 p1.h, p2.h, p1.h
				; CHECK-NEXT: ptrue p2.h
				; CHECK-NEXT: and p0.b, p2/z, p1.b, p0.b
				; CHECK-NEXT: pfalse p1.b
				; CHECK-NEXT: zip1 p2.h, p0.h, p1.h
				; CHECK-NEXT: zip2 p0.h, p0.h, p1.h
				; CHECK-NEXT: ld1w { z0.s }, p2/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0, #1, mul vl]
				; CHECK-NEXT: ret
				%pg = call <vscale x 7 x i1> @llvm.experimental.vector.extract.nxv7i1.nxv8i1(<vscale x 8 x i1> %pg.wide, i64 0)
				%load = call <vscale x 7 x i32> @llvm.masked.load.nxv7i32(<vscale x 7 x i32> *%a, i32 1, <vscale x 7 x i1> %pg, <vscale x 7 x i32> undef)
				%r = call <vscale x 8 x i32> @llvm.experimental.vector.insert.nxv8i32.nxv7i32(<vscale x 8 x i32> undef, <vscale x 7 x i32> %load, i64 0)
				ret <vscale x 8 x i32> %r
				}

				declare <vscale x 32 x i8> @llvm.masked.load.nxv32i8(<vscale x 32 x i8>*, i32, <vscale x 32 x i1>, <vscale x 32 x i8>)
				declare <vscale x 6 x i16> @llvm.masked.load.nxv6i16(<vscale x 6 x i16>*, i32, <vscale x 6 x i1>, <vscale x 6 x i16>)
				declare <vscale x 32 x i16> @llvm.masked.load.nxv32i16(<vscale x 32 x i16>*, i32, <vscale x 32 x i1>, <vscale x 32 x i16>)
				declare <vscale x 1 x i32> @llvm.masked.load.nxv1i32(<vscale x 1 x i32>*, i32, <vscale x 1 x i1>, <vscale x 1 x i32>)
	declare <vscale x 2 x i32> @llvm.masked.load.nxv2i32(<vscale x 2 x i32>*, i32, <vscale x 2 x i1>, <vscale x 2 x i32>)			declare <vscale x 2 x i32> @llvm.masked.load.nxv2i32(<vscale x 2 x i32>*, i32, <vscale x 2 x i1>, <vscale x 2 x i32>)
				declare <vscale x 3 x i32> @llvm.masked.load.nxv3i32(<vscale x 3 x i32>*, i32, <vscale x 3 x i1>, <vscale x 3 x i32>)
				declare <vscale x 6 x i32> @llvm.masked.load.nxv6i32(<vscale x 6 x i32>*, i32, <vscale x 6 x i1>, <vscale x 6 x i32>)
				declare <vscale x 7 x i32> @llvm.masked.load.nxv7i32(<vscale x 7 x i32>*, i32, <vscale x 7 x i1>, <vscale x 7 x i32>)
	declare <vscale x 8 x i32> @llvm.masked.load.nxv8i32(<vscale x 8 x i32>*, i32, <vscale x 8 x i1>, <vscale x 8 x i32>)			declare <vscale x 8 x i32> @llvm.masked.load.nxv8i32(<vscale x 8 x i32>*, i32, <vscale x 8 x i1>, <vscale x 8 x i32>)

	declare <vscale x 8 x i64> @llvm.masked.load.nxv8i64(<vscale x 8 x i64>*, i32, <vscale x 8 x i1>, <vscale x 8 x i64>)			declare <vscale x 8 x i64> @llvm.masked.load.nxv8i64(<vscale x 8 x i64>*, i32, <vscale x 8 x i1>, <vscale x 8 x i64>)
				declare <vscale x 8 x i16> @llvm.experimental.vector.insert.nxv8i16.nxv6i16(<vscale x 8 x i16>, <vscale x 6 x i16>, i64)
				declare <vscale x 4 x i32> @llvm.experimental.vector.insert.nxv4i32.nxv1i32(<vscale x 4 x i32>, <vscale x 1 x i32>, i64)
				declare <vscale x 4 x i32> @llvm.experimental.vector.insert.nxv4i32.nxv3i32(<vscale x 4 x i32>, <vscale x 3 x i32>, i64)
				declare <vscale x 8 x i32> @llvm.experimental.vector.insert.nxv8i32.nxv6i32(<vscale x 8 x i32>, <vscale x 6 x i32>, i64)
				declare <vscale x 8 x i32> @llvm.experimental.vector.insert.nxv8i32.nxv7i32(<vscale x 8 x i32>, <vscale x 7 x i32>, i64)
				declare <vscale x 1 x i1> @llvm.experimental.vector.extract.nxv1i1.nxv4i1(<vscale x 4 x i1>, i64)
				declare <vscale x 3 x i1> @llvm.experimental.vector.extract.nxv3i1.nxv4i1(<vscale x 4 x i1>, i64)
				declare <vscale x 6 x i1> @llvm.experimental.vector.extract.nxv6i1.nxv8i1(<vscale x 8 x i1>, i64)
				declare <vscale x 7 x i1> @llvm.experimental.vector.extract.nxv7i1.nxv8i1(<vscale x 8 x i1>, i64)