This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/IR/
-
llvm/
-
IR/
1/2
IntrinsicsAArch64.td
-
lib/Target/AArch64/
-
Target/
-
AArch64/
1/5
AArch64ISelDAGToDAG.cpp
-
AArch64ISelLowering.h
2/2
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve-intrinsics-loads.ll

Differential D75751

[AArch64][SVE] Implement structured load intrinsics
ClosedPublic

Authored by c-rhodes on Mar 6 2020, 8:08 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
efriedma
rengolin

Commits

rGb82be5db71fb: [AArch64][SVE] Implement structured load intrinsics

Summary

This patch adds initial support for the following instrinsics:

llvm.aarch64.sve.ld2
llvm.aarch64.sve.ld3
llvm.aarch64.sve.ld4

For loading two, three and four vectors worth of data. Basic codegen is
implemented with reg+reg and reg+imm addressing modes being addressed
in a later patch.

The types returned by these intrinsics have a number of elements that is
a multiple of the elements in a 128-bit vector for a given type and N,
where N is the number of vectors being loaded, i.e. 2, 3 or 4. Thus, for
32-bit elements the types are:

LD2 : <vscale x 8 x i32>
LD3 : <vscale x 12 x i32>
LD4 : <vscale x 16 x i32>

This is implemented with target-specific intrinsics for each variant
that take the same operands as the IR intrinsic but return N values,
where the type of each value is a full vector, i.e. <vscale x 4 x i32>
in the above example. These values are then concatenated using the
standard concat_vector intrinsic to maintain type legality with the IR.

These intrinsics are intended for use in the Arm C Language
Extension (ACLE).

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

c-rhodes created this revision.Mar 6 2020, 8:08 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 6 2020, 8:08 AM

Herald added subscribers: danielkiss, psnobl, rkruppe and 3 others. · View Herald Transcript

Harbormaster completed remote builds in B48345: Diff 248735.Mar 6 2020, 8:16 AM

c-rhodes added a parent revision: D75674: [AArch64][SVE] Implement vector tuple intrinsics.Mar 12 2020, 4:51 AM

Hi @c-rhodes ,

thank you for working on this. I am basing the addressing mode optimization for ldN on this patch, I just wanted to point out a couple of minor remarks!

Grazie,

Francesco

[Nit] Should use vscale instead of n in the commit message:

LD2 : <n x 8 x i32>
LD3 : <n x 12 x i32>
LD4 : <n x 16 x i32>

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
1423	`SubRegIdx` is always set to `AArch64::zsub0`. Can we remove it from the parameter list of the method and use `AArch64::zsub0` directly inside the function?
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9850	[Nit] Might be worth asserting that `VT.getVectorElementCount() % N == 0`.

fpetrogalli added inline comments.Mar 18 2020, 7:46 AM

llvm/include/llvm/IR/IntrinsicsAArch64.td
818	Question: you have three overloaded operands here. How comes that you need to specify only one of them in the intrinsic name? If I look at one of your tests: %res = call <vscale x 32 x i8> @llvm.aarch64.sve.ld2.nxv32i8(<vscale x 16 x i1> %pred, <vscale x 16 x i8>* %addr) By the definition you have specified here, I was expecting to see the following intrinsic: `@llvm.aarch64.sve.ld2.nxv32i8.nxv16i8.p0nxv16i8`. Am I missing something?

Thanks for the comments @fpetrogalli! I'll update this patch to address your comments.

llvm/include/llvm/IR/IntrinsicsAArch64.td
818	You're right, there should be a suffix for each overloaded type in the intrinsic name. I initially implemented each load intrinsic separately with special types where the predicate and address were based on the return type, so something like `LLVMVectorHalfWidth<0, llvm_i1_ty>` and `LLVMPointerToHalfWidthVector<0>` for LD2 args. With this definition only the return type was overloaded and I missed updating the intrinsics in the tests. Interesting how there's no issues without specifying each overloaded type, I'll fix the tests.

fpetrogalli added inline comments.Mar 18 2020, 6:54 PM

llvm/test/CodeGen/AArch64/sve-intrinsics-loads-with-extract.ll
23 ↗	(On Diff #248735)	I just realized... do we need these tests? I might have missed something, but it seems to me that none of the code you have added here is related to the interaction between `ld<N>` and `tuple.get`. For example, the fact that in this specific test the load is loading memory in `z31` and `z0`, in this order, seems to me more of a register allocation problem other than anything related to the fact that `%res` is coming from a `ld<N>` intrinsic. I think that we can simplify these tests by removing the `ld<N>` and just writing tests that involve using the `tuple.get` on the input parameter of the test. For example, in the specific case of `@ld2b_i8_1`, I think you want to run `llc` on this code: define <vscale x 16 x i8> @ld2b_i8_1(<vscale x 32 x i8> %res) { ; CHECK-LABEL: ld2b_i8_1: ; CHECK: mov z0.d, z1.d ; CHECK-NEXT: ret %v2 = call <vscale x 16 x i8> @llvm.aarch64.sve.tuple.get.nxv32i8(<vscale x 32 x i8> %res, i32 1) ret <vscale x 16 x i8> %v2 What you would be testing here is that whatever register is assigned to the input parameter, it is shuffled around by the intrinsic to be returned by the correct output register. If you agree on this approach though, I think this deserves a separate patch. One last (minor) extra reason for splitting things like I suggested: if we make these tests `ld<N>`-free, we won't have to bother updating/extending them when we need to do any work around the `ld<N>`.

c-rhodes added inline comments.Mar 19 2020, 3:05 AM

llvm/test/CodeGen/AArch64/sve-intrinsics-loads-with-extract.ll
23 ↗	(On Diff #248735)	I implemented these tests before we had calling convention support for passing and returning tuples of scalable vectors to/from functions (downstream) so the extract was necessary. There are already tests similar to what you suggested in D75674 covering `tuple.get` that should be sufficient (?), I can remove these if they're not adding anything.

fpetrogalli added inline comments.Mar 19 2020, 5:05 AM

llvm/test/CodeGen/AArch64/sve-intrinsics-loads-with-extract.ll
23 ↗	(On Diff #248735)	I think you can safely remove them. As you said, they were written when we didn't have the tuple calling conventions, but now we have it, and it is tested in https://reviews.llvm.org/D75674. Nit: it might not be a problem, but in https://reviews.llvm.org/D75674 I cannot see tests that check when a tuple like <vscale x 8 x i64> is passed to a function definition. I see tests that are getting multiple "one -register" parameters in input, building a tuple with an intrinsic and passing it to a `tuple.get` intrinsic or to a function call, but not a test like the one I extracted in which the tuple is in input to the define. I'll leave a note there and drag @sdesmalen attention to see whether we should add them. Thanks, Francesco

fpetrogalli added a child revision: D77251: [llvm][CodeGen] Addressing modes for SVE ldN..Apr 1 2020, 4:03 PM

Changes:

Rebased.
Move lowering to DAGCombine pre-legalisation.
Added assert for VT.getVectorElementCount().Min % N == 0.
Removed test/CodeGen/AArch64/sve-intrinsics-loads-with-extract.ll.
Added missing overloaded argument types to intrinsics in test/CodeGen/AArch64/sve-intrinsics-loads.ll.
Removed SubRegIdx from SelectPredicatedLoad since AArch64::zsub0 is always used.

c-rhodes marked 4 inline comments as done.May 26 2020, 9:01 AM

c-rhodes added inline comments.

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
1423	Done, you'll have to rebase D77251 once this is merged

ping

sdesmalen added inline comments.Jun 8 2020, 9:48 AM

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
1463	Is the above statement correct when having just replaced `SDValue(N, NumVecs-1)` in the loop above?
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9822	nit: call this `LowerSVEStructLoad` ?

LGTM if you can add the extra comment.

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
1463	Sorry, never mind that comment, I now see it is taking the chain from Load, not N. Can you add a comment clarifying this though? (that it is copying the chain here)?

This revision is now accepted and ready to land.Jun 8 2020, 9:55 AM

Changes:

Rename LowerSVEStructuredLoad -> LowerSVEStructLoad.
Clarify copy chain in SelectPredicatedLoad.

c-rhodes marked 2 inline comments as done.Jun 8 2020, 10:43 AM

c-rhodes added inline comments.

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp
1463	Yeah it's just copying the chain, agree it's confusing using `NumVecs` as the index, I've updated it.

Closed by commit rGb82be5db71fb: [AArch64][SVE] Implement structured load intrinsics (authored by c-rhodes). · Explain WhyJun 9 2020, 2:09 AM

This revision was automatically updated to reflect the committed changes.

fpetrogalli removed a child revision: D77251: [llvm][CodeGen] Addressing modes for SVE ldN..Jul 6 2020, 9:28 AM

fpetrogalli added a child revision: D77251: [llvm][CodeGen] Addressing modes for SVE ldN..Jul 6 2020, 9:30 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

IR/

IntrinsicsAArch64.td

8 lines

lib/

Target/

AArch64/

AArch64ISelDAGToDAG.cpp

73 lines

AArch64ISelLowering.h

7 lines

AArch64ISelLowering.cpp

67 lines

test/

CodeGen/

AArch64/

sve-intrinsics-loads.ll

264 lines

Diff 269445

llvm/include/llvm/IR/IntrinsicsAArch64.td

Show First 20 Lines • Show All 808 Lines • ▼ Show 20 Lines	class AdvSIMD_SVE_Set_Vector_Tuple
: Intrinsic<[llvm_anyvector_ty],		: Intrinsic<[llvm_anyvector_ty],
[LLVMMatchType<0>, llvm_i32_ty, llvm_anyvector_ty],		[LLVMMatchType<0>, llvm_i32_ty, llvm_anyvector_ty],
[IntrReadMem, ImmArg<ArgIndex<1>>]>;		[IntrReadMem, ImmArg<ArgIndex<1>>]>;

class AdvSIMD_SVE_Get_Vector_Tuple		class AdvSIMD_SVE_Get_Vector_Tuple
: Intrinsic<[llvm_anyvector_ty], [llvm_anyvector_ty, llvm_i32_ty],		: Intrinsic<[llvm_anyvector_ty], [llvm_anyvector_ty, llvm_i32_ty],
[IntrReadMem, IntrArgMemOnly, ImmArg<ArgIndex<1>>]>;		[IntrReadMem, IntrArgMemOnly, ImmArg<ArgIndex<1>>]>;

		class AdvSIMD_ManyVec_PredLoad_Intrinsic
		: Intrinsic<[llvm_anyvector_ty], [llvm_anyvector_ty, llvm_anyptr_ty],
		fpetrogalliUnsubmitted Done Reply Inline Actions Question: you have three overloaded operands here. How comes that you need to specify only one of them in the intrinsic name? If I look at one of your tests: %res = call <vscale x 32 x i8> @llvm.aarch64.sve.ld2.nxv32i8(<vscale x 16 x i1> %pred, <vscale x 16 x i8>* %addr) By the definition you have specified here, I was expecting to see the following intrinsic: `@llvm.aarch64.sve.ld2.nxv32i8.nxv16i8.p0nxv16i8`. Am I missing something? fpetrogalli: Question: you have three overloaded operands here. How comes that you need to specify only one…
		c-rhodesAuthorUnsubmitted Not Done Reply Inline Actions You're right, there should be a suffix for each overloaded type in the intrinsic name. I initially implemented each load intrinsic separately with special types where the predicate and address were based on the return type, so something like `LLVMVectorHalfWidth<0, llvm_i1_ty>` and `LLVMPointerToHalfWidthVector<0>` for LD2 args. With this definition only the return type was overloaded and I missed updating the intrinsics in the tests. Interesting how there's no issues without specifying each overloaded type, I'll fix the tests. c-rhodes: You're right, there should be a suffix for each overloaded type in the intrinsic name. I…
		[IntrReadMem, IntrArgMemOnly]>;

class AdvSIMD_1Vec_PredLoad_Intrinsic		class AdvSIMD_1Vec_PredLoad_Intrinsic
: Intrinsic<[llvm_anyvector_ty],		: Intrinsic<[llvm_anyvector_ty],
[LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>,		[LLVMScalarOrSameVectorWidth<0, llvm_i1_ty>,
LLVMPointerToElt<0>],		LLVMPointerToElt<0>],
[IntrReadMem, IntrArgMemOnly]>;		[IntrReadMem, IntrArgMemOnly]>;

class AdvSIMD_1Vec_PredStore_Intrinsic		class AdvSIMD_1Vec_PredStore_Intrinsic
: Intrinsic<[],		: Intrinsic<[],
▲ Show 20 Lines • Show All 516 Lines • ▼ Show 20 Lines
def int_aarch64_sve_tuple_set : AdvSIMD_SVE_Set_Vector_Tuple;		def int_aarch64_sve_tuple_set : AdvSIMD_SVE_Set_Vector_Tuple;

//		//
// Loads		// Loads
//		//

def int_aarch64_sve_ld1 : AdvSIMD_1Vec_PredLoad_Intrinsic;		def int_aarch64_sve_ld1 : AdvSIMD_1Vec_PredLoad_Intrinsic;

		def int_aarch64_sve_ld2 : AdvSIMD_ManyVec_PredLoad_Intrinsic;
		def int_aarch64_sve_ld3 : AdvSIMD_ManyVec_PredLoad_Intrinsic;
		def int_aarch64_sve_ld4 : AdvSIMD_ManyVec_PredLoad_Intrinsic;

def int_aarch64_sve_ldnt1 : AdvSIMD_1Vec_PredLoad_Intrinsic;		def int_aarch64_sve_ldnt1 : AdvSIMD_1Vec_PredLoad_Intrinsic;
def int_aarch64_sve_ldnf1 : AdvSIMD_1Vec_PredLoad_Intrinsic;		def int_aarch64_sve_ldnf1 : AdvSIMD_1Vec_PredLoad_Intrinsic;
def int_aarch64_sve_ldff1 : AdvSIMD_1Vec_PredLoad_Intrinsic;		def int_aarch64_sve_ldff1 : AdvSIMD_1Vec_PredLoad_Intrinsic;

def int_aarch64_sve_ld1rq : AdvSIMD_1Vec_PredLoad_Intrinsic;		def int_aarch64_sve_ld1rq : AdvSIMD_1Vec_PredLoad_Intrinsic;
def int_aarch64_sve_ld1ro : AdvSIMD_1Vec_PredLoad_Intrinsic;		def int_aarch64_sve_ld1ro : AdvSIMD_1Vec_PredLoad_Intrinsic;

//		//
▲ Show 20 Lines • Show All 978 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp

Show First 20 Lines • Show All 239 Lines • ▼ Show 20 Lines	public:
void SelectTagP(SDNode *N);		void SelectTagP(SDNode *N);

void SelectLoad(SDNode *N, unsigned NumVecs, unsigned Opc,		void SelectLoad(SDNode *N, unsigned NumVecs, unsigned Opc,
unsigned SubRegIdx);		unsigned SubRegIdx);
void SelectPostLoad(SDNode *N, unsigned NumVecs, unsigned Opc,		void SelectPostLoad(SDNode *N, unsigned NumVecs, unsigned Opc,
unsigned SubRegIdx);		unsigned SubRegIdx);
void SelectLoadLane(SDNode *N, unsigned NumVecs, unsigned Opc);		void SelectLoadLane(SDNode *N, unsigned NumVecs, unsigned Opc);
void SelectPostLoadLane(SDNode *N, unsigned NumVecs, unsigned Opc);		void SelectPostLoadLane(SDNode *N, unsigned NumVecs, unsigned Opc);
		void SelectPredicatedLoad(SDNode *N, unsigned NumVecs, const unsigned Opc);

bool SelectAddrModeFrameIndexSVE(SDValue N, SDValue &Base, SDValue &OffImm);		bool SelectAddrModeFrameIndexSVE(SDValue N, SDValue &Base, SDValue &OffImm);
/// SVE Reg+Imm addressing mode.		/// SVE Reg+Imm addressing mode.
template <int64_t Min, int64_t Max>		template <int64_t Min, int64_t Max>
bool SelectAddrModeIndexedSVE(SDNode *Root, SDValue N, SDValue &Base,		bool SelectAddrModeIndexedSVE(SDNode *Root, SDValue N, SDValue &Base,
SDValue &OffImm);		SDValue &OffImm);
/// SVE Reg+Reg address mode.		/// SVE Reg+Reg address mode.
template <unsigned Scale>		template <unsigned Scale>
▲ Show 20 Lines • Show All 1,158 Lines • ▼ Show 20 Lines	void AArch64DAGToDAGISel::SelectPostLoad(SDNode *N, unsigned NumVecs,

// Update the chain		// Update the chain
ReplaceUses(SDValue(N, NumVecs + 1), SDValue(Ld, 2));		ReplaceUses(SDValue(N, NumVecs + 1), SDValue(Ld, 2));
CurDAG->RemoveDeadNode(N);		CurDAG->RemoveDeadNode(N);
}		}

/// Optimize \param OldBase and \param OldOffset selecting the best addressing		/// Optimize \param OldBase and \param OldOffset selecting the best addressing
/// mode. Returns a tuple consisting of an Opcode, an SDValue representing the		/// mode. Returns a tuple consisting of an Opcode, an SDValue representing the
/// new Base and an SDValue representing the new offset.		/// new Base and an SDValue representing the new offset.
		fpetrogalliUnsubmitted Not Done Reply Inline Actions `SubRegIdx` is always set to `AArch64::zsub0`. Can we remove it from the parameter list of the method and use `AArch64::zsub0` directly inside the function? fpetrogalli: `SubRegIdx` is always set to `AArch64::zsub0`. Can we remove it from the parameter list of the…
		c-rhodesAuthorUnsubmitted Not Done Reply Inline Actions Done, you'll have to rebase D77251 once this is merged c-rhodes: Done, you'll have to rebase D77251 once this is merged
template <unsigned Scale>		template <unsigned Scale>
std::tuple<unsigned, SDValue, SDValue>		std::tuple<unsigned, SDValue, SDValue>
AArch64DAGToDAGISel::findAddrModeSVELoadStore(SDNode *N, const unsigned Opc_rr,		AArch64DAGToDAGISel::findAddrModeSVELoadStore(SDNode *N, const unsigned Opc_rr,
const unsigned Opc_ri,		const unsigned Opc_ri,
const SDValue &OldBase,		const SDValue &OldBase,
const SDValue &OldOffset) {		const SDValue &OldOffset) {
SDValue NewBase = OldBase;		SDValue NewBase = OldBase;
SDValue NewOffset = OldOffset;		SDValue NewOffset = OldOffset;
// Detect a possible Reg+Imm addressing mode.		// Detect a possible Reg+Imm addressing mode.
const bool IsRegImm = SelectAddrModeIndexedSVE</Min=/-8, /Max=/7>(		const bool IsRegImm = SelectAddrModeIndexedSVE</Min=/-8, /Max=/7>(
N, OldBase, NewBase, NewOffset);		N, OldBase, NewBase, NewOffset);

// Detect a possible reg+reg addressing mode, but only if we haven't already		// Detect a possible reg+reg addressing mode, but only if we haven't already
// detected a Reg+Imm one.		// detected a Reg+Imm one.
const bool IsRegReg =		const bool IsRegReg =
!IsRegImm && SelectSVERegRegAddrMode<Scale>(OldBase, NewBase, NewOffset);		!IsRegImm && SelectSVERegRegAddrMode<Scale>(OldBase, NewBase, NewOffset);

// Select the instruction.		// Select the instruction.
return std::make_tuple(IsRegReg ? Opc_rr : Opc_ri, NewBase, NewOffset);		return std::make_tuple(IsRegReg ? Opc_rr : Opc_ri, NewBase, NewOffset);
}		}

		void AArch64DAGToDAGISel::SelectPredicatedLoad(SDNode *N, unsigned NumVecs,
		const unsigned Opc) {
		SDLoc DL(N);
		EVT VT = N->getValueType(0);
		SDValue Chain = N->getOperand(0);

		SDValue Ops[] = {N->getOperand(1), // Predicate
		N->getOperand(2), // Memory operand
		CurDAG->getTargetConstant(0, DL, MVT::i64), Chain};

		const EVT ResTys[] = {MVT::Untyped, MVT::Other};

		SDNode *Load = CurDAG->getMachineNode(Opc, DL, ResTys, Ops);
		SDValue SuperReg = SDValue(Load, 0);
		for (unsigned i = 0; i < NumVecs; ++i)
		ReplaceUses(SDValue(N, i), CurDAG->getTargetExtractSubreg(
		AArch64::zsub0 + i, DL, VT, SuperReg));

		// Copy chain
		sdesmalenUnsubmitted Not Done Reply Inline Actions Is the above statement correct when having just replaced `SDValue(N, NumVecs-1)` in the loop above? sdesmalen: Is the above statement correct when having just replaced `SDValue(N, NumVecs-1)` in the loop…
		sdesmalenUnsubmitted Not Done Reply Inline Actions Sorry, never mind that comment, I now see it is taking the chain from Load, not N. Can you add a comment clarifying this though? (that it is copying the chain here)? sdesmalen: Sorry, never mind that comment, I now see it is taking the chain from Load, not N. Can you add…
		c-rhodesAuthorUnsubmitted Done Reply Inline Actions Yeah it's just copying the chain, agree it's confusing using `NumVecs` as the index, I've updated it. c-rhodes: Yeah it's just copying the chain, agree it's confusing using `NumVecs` as the index, I've…
		unsigned ChainIdx = NumVecs;
		ReplaceUses(SDValue(N, ChainIdx), SDValue(Load, 1));
		CurDAG->RemoveDeadNode(N);
		}

void AArch64DAGToDAGISel::SelectStore(SDNode *N, unsigned NumVecs,		void AArch64DAGToDAGISel::SelectStore(SDNode *N, unsigned NumVecs,
unsigned Opc) {		unsigned Opc) {
SDLoc dl(N);		SDLoc dl(N);
EVT VT = N->getOperand(2)->getValueType(0);		EVT VT = N->getOperand(2)->getValueType(0);

// Form a REG_SEQUENCE to force register allocation.		// Form a REG_SEQUENCE to force register allocation.
bool Is128Bit = VT.getSizeInBits() == 128;		bool Is128Bit = VT.getSizeInBits() == 128;
SmallVector<SDValue, 4> Regs(N->op_begin() + 2, N->op_begin() + 2 + NumVecs);		SmallVector<SDValue, 4> Regs(N->op_begin() + 2, N->op_begin() + 2 + NumVecs);
▲ Show 20 Lines • Show All 3,146 Lines • ▼ Show 20 Lines	if (VT == MVT::v16i8 \|\| VT == MVT::v8i8) {
return;		return;
} else if (VT == MVT::v2i64 \|\| VT == MVT::v1i64 \|\| VT == MVT::v2f64 \|\|		} else if (VT == MVT::v2i64 \|\| VT == MVT::v1i64 \|\| VT == MVT::v2f64 \|\|
VT == MVT::v1f64) {		VT == MVT::v1f64) {
SelectPostStoreLane(Node, 4, AArch64::ST4i64_POST);		SelectPostStoreLane(Node, 4, AArch64::ST4i64_POST);
return;		return;
}		}
break;		break;
}		}
		case AArch64ISD::SVE_LD2: {
		if (VT == MVT::nxv16i8) {
		SelectPredicatedLoad(Node, 2, AArch64::LD2B_IMM);
		return;
		} else if (VT == MVT::nxv8i16 \|\| VT == MVT::nxv8f16) {
		SelectPredicatedLoad(Node, 2, AArch64::LD2H_IMM);
		return;
		} else if (VT == MVT::nxv4i32 \|\| VT == MVT::nxv4f32) {
		SelectPredicatedLoad(Node, 2, AArch64::LD2W_IMM);
		return;
		} else if (VT == MVT::nxv2i64 \|\| VT == MVT::nxv2f64) {
		SelectPredicatedLoad(Node, 2, AArch64::LD2D_IMM);
		return;
		}
		break;
		}
		case AArch64ISD::SVE_LD3: {
		if (VT == MVT::nxv16i8) {
		SelectPredicatedLoad(Node, 3, AArch64::LD3B_IMM);
		return;
		} else if (VT == MVT::nxv8i16 \|\| VT == MVT::nxv8f16) {
		SelectPredicatedLoad(Node, 3, AArch64::LD3H_IMM);
		return;
		} else if (VT == MVT::nxv4i32 \|\| VT == MVT::nxv4f32) {
		SelectPredicatedLoad(Node, 3, AArch64::LD3W_IMM);
		return;
		} else if (VT == MVT::nxv2i64 \|\| VT == MVT::nxv2f64) {
		SelectPredicatedLoad(Node, 3, AArch64::LD3D_IMM);
		return;
		}
		break;
		}
		case AArch64ISD::SVE_LD4: {
		if (VT == MVT::nxv16i8) {
		SelectPredicatedLoad(Node, 4, AArch64::LD4B_IMM);
		return;
		} else if (VT == MVT::nxv8i16 \|\| VT == MVT::nxv8f16) {
		SelectPredicatedLoad(Node, 4, AArch64::LD4H_IMM);
		return;
		} else if (VT == MVT::nxv4i32 \|\| VT == MVT::nxv4f32) {
		SelectPredicatedLoad(Node, 4, AArch64::LD4W_IMM);
		return;
		} else if (VT == MVT::nxv2i64 \|\| VT == MVT::nxv2f64) {
		SelectPredicatedLoad(Node, 4, AArch64::LD4D_IMM);
		return;
		}
		break;
		}
}		}

// Select the default instruction		// Select the default instruction
SelectCode(Node);		SelectCode(Node);
}		}

/// createAArch64ISelDag - This pass converts a legalized DAG into a		/// createAArch64ISelDag - This pass converts a legalized DAG into a
/// AArch64-specific DAG, ready for instruction scheduling.		/// AArch64-specific DAG, ready for instruction scheduling.
▲ Show 20 Lines • Show All 131 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 249 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
LD1S,		LD1S,
LDNF1,		LDNF1,
LDNF1S,		LDNF1S,
LDFF1,		LDFF1,
LDFF1S,		LDFF1S,
LD1RQ,		LD1RQ,
LD1RO,		LD1RO,

		// Structured loads.
		SVE_LD2,
		SVE_LD3,
		SVE_LD4,

// Unsigned gather loads.		// Unsigned gather loads.
GLD1,		GLD1,
GLD1_SCALED,		GLD1_SCALED,
GLD1_UXTW,		GLD1_UXTW,
GLD1_SXTW,		GLD1_SXTW,
GLD1_UXTW_SCALED,		GLD1_UXTW_SCALED,
GLD1_SXTW_SCALED,		GLD1_SXTW_SCALED,
GLD1_IMM,		GLD1_IMM,
▲ Show 20 Lines • Show All 564 Lines • ▼ Show 20 Lines	private:
SDValue LowerVSCALE(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerVSCALE(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerVECREDUCE(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerVECREDUCE(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerATOMIC_LOAD_SUB(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerATOMIC_LOAD_SUB(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerATOMIC_LOAD_AND(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerATOMIC_LOAD_AND(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerDYNAMIC_STACKALLOC(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerDYNAMIC_STACKALLOC(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerWindowsDYNAMIC_STACKALLOC(SDValue Op, SDValue Chain,		SDValue LowerWindowsDYNAMIC_STACKALLOC(SDValue Op, SDValue Chain,
SDValue &Size,		SDValue &Size,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;
		SDValue LowerSVEStructLoad(unsigned Intrinsic, ArrayRef<SDValue> LoadOps,
		EVT VT, SelectionDAG &DAG, const SDLoc &DL) const;

SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,		SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,
SmallVectorImpl<SDNode *> &Created) const override;		SmallVectorImpl<SDNode *> &Created) const override;
SDValue getSqrtEstimate(SDValue Operand, SelectionDAG &DAG, int Enabled,		SDValue getSqrtEstimate(SDValue Operand, SelectionDAG &DAG, int Enabled,
int &ExtraSteps, bool &UseOneConst,		int &ExtraSteps, bool &UseOneConst,
bool Reciprocal) const override;		bool Reciprocal) const override;
SDValue getRecipEstimate(SDValue Operand, SelectionDAG &DAG, int Enabled,		SDValue getRecipEstimate(SDValue Operand, SelectionDAG &DAG, int Enabled,
int &ExtraSteps) const override;		int &ExtraSteps) const override;
▲ Show 20 Lines • Show All 66 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,461 Lines • ▼ Show 20 Lines	const char *AArch64TargetLowering::getTargetNodeName(unsigned Opcode) const {
case AArch64ISD::LD1: return "AArch64ISD::LD1";		case AArch64ISD::LD1: return "AArch64ISD::LD1";
case AArch64ISD::LD1S: return "AArch64ISD::LD1S";		case AArch64ISD::LD1S: return "AArch64ISD::LD1S";
case AArch64ISD::LDNF1: return "AArch64ISD::LDNF1";		case AArch64ISD::LDNF1: return "AArch64ISD::LDNF1";
case AArch64ISD::LDNF1S: return "AArch64ISD::LDNF1S";		case AArch64ISD::LDNF1S: return "AArch64ISD::LDNF1S";
case AArch64ISD::LDFF1: return "AArch64ISD::LDFF1";		case AArch64ISD::LDFF1: return "AArch64ISD::LDFF1";
case AArch64ISD::LDFF1S: return "AArch64ISD::LDFF1S";		case AArch64ISD::LDFF1S: return "AArch64ISD::LDFF1S";
case AArch64ISD::LD1RQ: return "AArch64ISD::LD1RQ";		case AArch64ISD::LD1RQ: return "AArch64ISD::LD1RQ";
case AArch64ISD::LD1RO: return "AArch64ISD::LD1RO";		case AArch64ISD::LD1RO: return "AArch64ISD::LD1RO";
		case AArch64ISD::SVE_LD2: return "AArch64ISD::SVE_LD2";
		case AArch64ISD::SVE_LD3: return "AArch64ISD::SVE_LD3";
		case AArch64ISD::SVE_LD4: return "AArch64ISD::SVE_LD4";
case AArch64ISD::GLD1: return "AArch64ISD::GLD1";		case AArch64ISD::GLD1: return "AArch64ISD::GLD1";
case AArch64ISD::GLD1_SCALED: return "AArch64ISD::GLD1_SCALED";		case AArch64ISD::GLD1_SCALED: return "AArch64ISD::GLD1_SCALED";
case AArch64ISD::GLD1_SXTW: return "AArch64ISD::GLD1_SXTW";		case AArch64ISD::GLD1_SXTW: return "AArch64ISD::GLD1_SXTW";
case AArch64ISD::GLD1_UXTW: return "AArch64ISD::GLD1_UXTW";		case AArch64ISD::GLD1_UXTW: return "AArch64ISD::GLD1_UXTW";
case AArch64ISD::GLD1_SXTW_SCALED: return "AArch64ISD::GLD1_SXTW_SCALED";		case AArch64ISD::GLD1_SXTW_SCALED: return "AArch64ISD::GLD1_SXTW_SCALED";
case AArch64ISD::GLD1_UXTW_SCALED: return "AArch64ISD::GLD1_UXTW_SCALED";		case AArch64ISD::GLD1_UXTW_SCALED: return "AArch64ISD::GLD1_UXTW_SCALED";
case AArch64ISD::GLD1_IMM: return "AArch64ISD::GLD1_IMM";		case AArch64ISD::GLD1_IMM: return "AArch64ISD::GLD1_IMM";
case AArch64ISD::GLD1S: return "AArch64ISD::GLD1S";		case AArch64ISD::GLD1S: return "AArch64ISD::GLD1S";
▲ Show 20 Lines • Show All 8,313 Lines • ▼ Show 20 Lines	if (StoreCount > 0)
BaseAddr, LaneLen * Factor);		BaseAddr, LaneLen * Factor);

Ops.push_back(Builder.CreateBitCast(BaseAddr, PtrTy));		Ops.push_back(Builder.CreateBitCast(BaseAddr, PtrTy));
Builder.CreateCall(StNFunc, Ops);		Builder.CreateCall(StNFunc, Ops);
}		}
return true;		return true;
}		}

		// Lower an SVE structured load intrinsic returning a tuple type to target
		// specific intrinsic taking the same input but returning a multi-result value
		// of the split tuple type.
		//
		// E.g. Lowering an LD3:
		//
		// call <vscale x 12 x i32> @llvm.aarch64.sve.ld3.nxv12i32(
		// <vscale x 4 x i1> %pred,
		// <vscale x 4 x i32>* %addr)
		//
		// Output DAG:
		//
		// t0: ch = EntryToken
		// t2: nxv4i1,ch = CopyFromReg t0, Register:nxv4i1 %0
		// t4: i64,ch = CopyFromReg t0, Register:i64 %1
		// t5: nxv4i32,nxv4i32,nxv4i32,ch = AArch64ISD::SVE_LD3 t0, t2, t4
		// t6: nxv12i32 = concat_vectors t5, t5:1, t5:2
		//
		// This is called pre-legalization to avoid widening/splitting issues with
		// non-power-of-2 tuple types used for LD3, such as nxv12i32.
		SDValue AArch64TargetLowering::LowerSVEStructLoad(unsigned Intrinsic,
		sdesmalenUnsubmitted Done Reply Inline Actions nit: call this `LowerSVEStructLoad` ? sdesmalen: nit: call this `LowerSVEStructLoad` ?
		ArrayRef<SDValue> LoadOps,
		EVT VT, SelectionDAG &DAG,
		const SDLoc &DL) const {
		assert(VT.isScalableVector() && "Can only lower scalable vectors");

		unsigned N, Opcode;
		static std::map<unsigned, std::pair<unsigned, unsigned>> IntrinsicMap = {
		{Intrinsic::aarch64_sve_ld2, {2, AArch64ISD::SVE_LD2}},
		{Intrinsic::aarch64_sve_ld3, {3, AArch64ISD::SVE_LD3}},
		{Intrinsic::aarch64_sve_ld4, {4, AArch64ISD::SVE_LD4}}};

		std::tie(N, Opcode) = IntrinsicMap[Intrinsic];
		assert(VT.getVectorElementCount().Min % N == 0 &&
		"invalid tuple vector type!");

		EVT SplitVT = EVT::getVectorVT(*DAG.getContext(), VT.getVectorElementType(),
		VT.getVectorElementCount() / N);
		assert(isTypeLegal(SplitVT));

		SmallVector<EVT, 5> VTs(N, SplitVT);
		VTs.push_back(MVT::Other); // Chain
		SDVTList NodeTys = DAG.getVTList(VTs);

		SDValue PseudoLoad = DAG.getNode(Opcode, DL, NodeTys, LoadOps);
		SmallVector<SDValue, 4> PseudoLoadOps;
		for (unsigned I = 0; I < N; ++I)
		PseudoLoadOps.push_back(SDValue(PseudoLoad.getNode(), I));
		return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, PseudoLoadOps);
		fpetrogalliUnsubmitted Done Reply Inline Actions [Nit] Might be worth asserting that `VT.getVectorElementCount() % N == 0`. fpetrogalli: [Nit] Might be worth asserting that `VT.getVectorElementCount() % N == 0`.
		}

EVT AArch64TargetLowering::getOptimalMemOpType(		EVT AArch64TargetLowering::getOptimalMemOpType(
const MemOp &Op, const AttributeList &FuncAttributes) const {		const MemOp &Op, const AttributeList &FuncAttributes) const {
bool CanImplicitFloat =		bool CanImplicitFloat =
!FuncAttributes.hasFnAttribute(Attribute::NoImplicitFloat);		!FuncAttributes.hasFnAttribute(Attribute::NoImplicitFloat);
bool CanUseNEON = Subtarget->hasNEON() && CanImplicitFloat;		bool CanUseNEON = Subtarget->hasNEON() && CanImplicitFloat;
bool CanUseFP = Subtarget->hasFPARMv8() && CanImplicitFloat;		bool CanUseFP = Subtarget->hasFPARMv8() && CanImplicitFloat;
// Only use AdvSIMD to implement memset of 32-byte and above. It would have		// Only use AdvSIMD to implement memset of 32-byte and above. It would have
▲ Show 20 Lines • Show All 3,916 Lines • ▼ Show 20 Lines	case Intrinsic::aarch64_sve_tuple_create4: {
EVT VT = Opnds[0].getValueType();		EVT VT = Opnds[0].getValueType();
EVT EltVT = VT.getVectorElementType();		EVT EltVT = VT.getVectorElementType();
EVT DestVT = EVT::getVectorVT(*DAG.getContext(), EltVT,		EVT DestVT = EVT::getVectorVT(*DAG.getContext(), EltVT,
VT.getVectorElementCount() *		VT.getVectorElementCount() *
(N->getNumOperands() - 2));		(N->getNumOperands() - 2));
SDValue Concat = DAG.getNode(ISD::CONCAT_VECTORS, DL, DestVT, Opnds);		SDValue Concat = DAG.getNode(ISD::CONCAT_VECTORS, DL, DestVT, Opnds);
return DAG.getMergeValues({Concat, Chain}, DL);		return DAG.getMergeValues({Concat, Chain}, DL);
}		}
		case Intrinsic::aarch64_sve_ld2:
		case Intrinsic::aarch64_sve_ld3:
		case Intrinsic::aarch64_sve_ld4: {
		SDLoc DL(N);
		SDValue Chain = N->getOperand(0);
		SDValue Mask = N->getOperand(2);
		SDValue BasePtr = N->getOperand(3);
		SDValue LoadOps[] = {Chain, Mask, BasePtr};
		unsigned IntrinsicID =
		cast<ConstantSDNode>(N->getOperand(1))->getZExtValue();
		SDValue Result =
		LowerSVEStructLoad(IntrinsicID, LoadOps, N->getValueType(0), DAG, DL);
		return DAG.getMergeValues({Result, Chain}, DL);
		}
default:		default:
break;		break;
}		}
break;		break;
case ISD::GlobalAddress:		case ISD::GlobalAddress:
return performGlobalAddressCombine(N, DAG, Subtarget, getTargetMachine());		return performGlobalAddressCombine(N, DAG, Subtarget, getTargetMachine());
}		}
return SDValue();		return SDValue();
▲ Show 20 Lines • Show All 771 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-intrinsics-loads.ll

	; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s \| FileCheck %s			; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve -asm-verbose=0 < %s \| FileCheck %s

	;			;
	; LD1RQB			; LD1RQB
	;			;

	define <vscale x 16 x i8> @ld1rqb_i8(<vscale x 16 x i1> %pred, i8* %addr) {			define <vscale x 16 x i8> @ld1rqb_i8(<vscale x 16 x i1> %pred, i8* %addr) {
	; CHECK-LABEL: ld1rqb_i8:			; CHECK-LABEL: ld1rqb_i8:
	; CHECK: ld1rqb { z0.b }, p0/z, [x0]			; CHECK: ld1rqb { z0.b }, p0/z, [x0]
	▲ Show 20 Lines • Show All 237 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: ldnt1d_f64:			; CHECK-LABEL: ldnt1d_f64:
	; CHECK: ldnt1d { z0.d }, p0/z, [x0]			; CHECK: ldnt1d { z0.d }, p0/z, [x0]
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%res = call <vscale x 2 x double> @llvm.aarch64.sve.ldnt1.nxv2f64(<vscale x 2 x i1> %pred,			%res = call <vscale x 2 x double> @llvm.aarch64.sve.ldnt1.nxv2f64(<vscale x 2 x i1> %pred,
	double* %addr)			double* %addr)
	ret <vscale x 2 x double> %res			ret <vscale x 2 x double> %res
	}			}

				;
				; LD2B
				;

				define <vscale x 32 x i8> @ld2b_i8(<vscale x 16 x i1> %pred, <vscale x 16 x i8>* %addr) {
				; CHECK-LABEL: ld2b_i8:
				; CHECK: ld2b { z0.b, z1.b }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 32 x i8> @llvm.aarch64.sve.ld2.nxv32i8.nxv16i1.p0nxv16i8(<vscale x 16 x i1> %pred,
				<vscale x 16 x i8>* %addr)
				ret <vscale x 32 x i8> %res
				}

				;
				; LD2H
				;

				define <vscale x 16 x i16> @ld2h_i16(<vscale x 8 x i1> %pred, <vscale x 8 x i16>* %addr) {
				; CHECK-LABEL: ld2h_i16:
				; CHECK: ld2h { z0.h, z1.h }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 16 x i16> @llvm.aarch64.sve.ld2.nxv16i16.nxv8i1.p0nxv8i16(<vscale x 8 x i1> %pred,
				<vscale x 8 x i16>* %addr)
				ret <vscale x 16 x i16> %res
				}

				define <vscale x 16 x half> @ld2h_f16(<vscale x 8 x i1> %pred, <vscale x 8 x half>* %addr) {
				; CHECK-LABEL: ld2h_f16:
				; CHECK: ld2h { z0.h, z1.h }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 16 x half> @llvm.aarch64.sve.ld2.nxv16f16.nxv8i1.p0nxv8f16(<vscale x 8 x i1> %pred,
				<vscale x 8 x half>* %addr)
				ret <vscale x 16 x half> %res
				}

				;
				; LD2W
				;

				define <vscale x 8 x i32> @ld2w_i32(<vscale x 4 x i1> %pred, <vscale x 4 x i32>* %addr) {
				; CHECK-LABEL: ld2w_i32:
				; CHECK: ld2w { z0.s, z1.s }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x i32> @llvm.aarch64.sve.ld2.nxv8i32.nxv4i1.p0nxv4i32(<vscale x 4 x i1> %pred,
				<vscale x 4 x i32>* %addr)
				ret <vscale x 8 x i32> %res
				}

				define <vscale x 8 x float> @ld2w_f32(<vscale x 4 x i1> %pred, <vscale x 4 x float>* %addr) {
				; CHECK-LABEL: ld2w_f32:
				; CHECK: ld2w { z0.s, z1.s }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x float> @llvm.aarch64.sve.ld2.nxv8f32.nxv4i1.p0nxv4f32(<vscale x 4 x i1> %pred,
				<vscale x 4 x float>* %addr)
				ret <vscale x 8 x float> %res
				}

				;
				; LD2D
				;

				define <vscale x 4 x i64> @ld2d_i64(<vscale x 2 x i1> %pred, <vscale x 2 x i64>* %addr) {
				; CHECK-LABEL: ld2d_i64:
				; CHECK: ld2d { z0.d, z1.d }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 4 x i64> @llvm.aarch64.sve.ld2.nxv4i64.nxv2i1.p0nxv2i64(<vscale x 2 x i1> %pred,
				<vscale x 2 x i64>* %addr)
				ret <vscale x 4 x i64> %res
				}

				define <vscale x 4 x double> @ld2d_f64(<vscale x 2 x i1> %pred, <vscale x 2 x double>* %addr) {
				; CHECK-LABEL: ld2d_f64:
				; CHECK: ld2d { z0.d, z1.d }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 4 x double> @llvm.aarch64.sve.ld2.nxv4f64.nxv2i1.p0nxv2f64(<vscale x 2 x i1> %pred,
				<vscale x 2 x double>* %addr)
				ret <vscale x 4 x double> %res
				}

				;
				; LD3B
				;

				define <vscale x 48 x i8> @ld3b_i8(<vscale x 16 x i1> %pred, <vscale x 16 x i8>* %addr) {
				; CHECK-LABEL: ld3b_i8:
				; CHECK: ld3b { z0.b, z1.b, z2.b }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 48 x i8> @llvm.aarch64.sve.ld3.nxv48i8.nxv16i1.p0nxv16i8(<vscale x 16 x i1> %pred,
				<vscale x 16 x i8>* %addr)
				ret <vscale x 48 x i8> %res
				}

				;
				; LD3H
				;

				define <vscale x 24 x i16> @ld3h_i16(<vscale x 8 x i1> %pred, <vscale x 8 x i16>* %addr) {
				; CHECK-LABEL: ld3h_i16:
				; CHECK: ld3h { z0.h, z1.h, z2.h }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 24 x i16> @llvm.aarch64.sve.ld3.nxv24i16.nxv8i1.p0nxv8i16(<vscale x 8 x i1> %pred,
				<vscale x 8 x i16>* %addr)
				ret <vscale x 24 x i16> %res
				}

				define <vscale x 24 x half> @ld3h_f16(<vscale x 8 x i1> %pred, <vscale x 8 x half>* %addr) {
				; CHECK-LABEL: ld3h_f16:
				; CHECK: ld3h { z0.h, z1.h, z2.h }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 24 x half> @llvm.aarch64.sve.ld3.nxv24f16.nxv8i1.p0nxv8f16(<vscale x 8 x i1> %pred,
				<vscale x 8 x half>* %addr)
				ret <vscale x 24 x half> %res
				}

				;
				; LD3W
				;

				define <vscale x 12 x i32> @ld3w_i32(<vscale x 4 x i1> %pred, <vscale x 4 x i32>* %addr) {
				; CHECK-LABEL: ld3w_i32:
				; CHECK: ld3w { z0.s, z1.s, z2.s }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 12 x i32> @llvm.aarch64.sve.ld3.nxv12i32.nxv4i1.p0nxv4i32(<vscale x 4 x i1> %pred,
				<vscale x 4 x i32>* %addr)
				ret <vscale x 12 x i32> %res
				}

				define <vscale x 12 x float> @ld3w_f32(<vscale x 4 x i1> %pred, <vscale x 4 x float>* %addr) {
				; CHECK-LABEL: ld3w_f32:
				; CHECK: ld3w { z0.s, z1.s, z2.s }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 12 x float> @llvm.aarch64.sve.ld3.nxv12f32.nxv4i1.p0nxv4f32(<vscale x 4 x i1> %pred,
				<vscale x 4 x float>* %addr)
				ret <vscale x 12 x float> %res
				}

				;
				; LD3D
				;

				define <vscale x 6 x i64> @ld3d_i64(<vscale x 2 x i1> %pred, <vscale x 2 x i64>* %addr) {
				; CHECK-LABEL: ld3d_i64:
				; CHECK: ld3d { z0.d, z1.d, z2.d }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 6 x i64> @llvm.aarch64.sve.ld3.nxv6i64.nxv2i1.p0nxv2i64(<vscale x 2 x i1> %pred,
				<vscale x 2 x i64>* %addr)
				ret <vscale x 6 x i64> %res
				}

				define <vscale x 6 x double> @ld3d_f64(<vscale x 2 x i1> %pred, <vscale x 2 x double>* %addr) {
				; CHECK-LABEL: ld3d_f64:
				; CHECK: ld3d { z0.d, z1.d, z2.d }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 6 x double> @llvm.aarch64.sve.ld3.nxv6f64.nxv2i1.p0nxv2f64(<vscale x 2 x i1> %pred,
				<vscale x 2 x double>* %addr)
				ret <vscale x 6 x double> %res
				}

				;
				; LD4B
				;

				define <vscale x 64 x i8> @ld4b_i8(<vscale x 16 x i1> %pred, <vscale x 16 x i8>* %addr) {
				; CHECK-LABEL: ld4b_i8:
				; CHECK: ld4b { z0.b, z1.b, z2.b, z3.b }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 64 x i8> @llvm.aarch64.sve.ld4.nxv64i8.nxv16i1.p0nxv16i8(<vscale x 16 x i1> %pred,
				<vscale x 16 x i8>* %addr)
				ret <vscale x 64 x i8> %res
				}

				;
				; LD4H
				;

				define <vscale x 32 x i16> @ld4h_i16(<vscale x 8 x i1> %pred, <vscale x 8 x i16>* %addr) {
				; CHECK-LABEL: ld4h_i16:
				; CHECK: ld4h { z0.h, z1.h, z2.h, z3.h }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 32 x i16> @llvm.aarch64.sve.ld4.nxv32i16.nxv8i1.p0nxv8i16(<vscale x 8 x i1> %pred,
				<vscale x 8 x i16>* %addr)
				ret <vscale x 32 x i16> %res
				}

				define <vscale x 32 x half> @ld4h_f16(<vscale x 8 x i1> %pred, <vscale x 8 x half>* %addr) {
				; CHECK-LABEL: ld4h_f16:
				; CHECK: ld4h { z0.h, z1.h, z2.h, z3.h }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 32 x half> @llvm.aarch64.sve.ld4.nxv32f16.nxv8i1.p0nxv8f16(<vscale x 8 x i1> %pred,
				<vscale x 8 x half>* %addr)
				ret <vscale x 32 x half> %res
				}

				;
				; LD4W
				;

				define <vscale x 16 x i32> @ld4w_i32(<vscale x 4 x i1> %pred, <vscale x 4 x i32>* %addr) {
				; CHECK-LABEL: ld4w_i32:
				; CHECK: ld4w { z0.s, z1.s, z2.s, z3.s }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 16 x i32> @llvm.aarch64.sve.ld4.nxv16i32.nxv4i1.p0nxv4i32(<vscale x 4 x i1> %pred,
				<vscale x 4 x i32>* %addr)
				ret <vscale x 16 x i32> %res
				}

				define <vscale x 16 x float> @ld4w_f32(<vscale x 4 x i1> %pred, <vscale x 4 x float>* %addr) {
				; CHECK-LABEL: ld4w_f32:
				; CHECK: ld4w { z0.s, z1.s, z2.s, z3.s }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 16 x float> @llvm.aarch64.sve.ld4.nxv16f32.nxv4i1.p0nxv4f32(<vscale x 4 x i1> %pred,
				<vscale x 4 x float>* %addr)
				ret <vscale x 16 x float> %res
				}

				;
				; LD4D
				;

				define <vscale x 8 x i64> @ld4d_i64(<vscale x 2 x i1> %pred, <vscale x 2 x i64>* %addr) {
				; CHECK-LABEL: ld4d_i64:
				; CHECK: ld4d { z0.d, z1.d, z2.d, z3.d }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x i64> @llvm.aarch64.sve.ld4.nxv8i64.nxv2i1.p0nxv2i64(<vscale x 2 x i1> %pred,
				<vscale x 2 x i64>* %addr)
				ret <vscale x 8 x i64> %res
				}

				define <vscale x 8 x double> @ld4d_f64(<vscale x 2 x i1> %pred, <vscale x 2 x double>* %addr) {
				; CHECK-LABEL: ld4d_f64:
				; CHECK: ld4d { z0.d, z1.d, z2.d, z3.d }, p0/z, [x0]
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x double> @llvm.aarch64.sve.ld4.nxv8f64.nxv2i1.p0nxv2f64(<vscale x 2 x i1> %pred,
				<vscale x 2 x double>* %addr)
				ret <vscale x 8 x double> %res
				}


	declare <vscale x 16 x i8> @llvm.aarch64.sve.ld1rq.nxv16i8(<vscale x 16 x i1>, i8*)			declare <vscale x 16 x i8> @llvm.aarch64.sve.ld1rq.nxv16i8(<vscale x 16 x i1>, i8*)
	declare <vscale x 8 x i16> @llvm.aarch64.sve.ld1rq.nxv8i16(<vscale x 8 x i1>, i16*)			declare <vscale x 8 x i16> @llvm.aarch64.sve.ld1rq.nxv8i16(<vscale x 8 x i1>, i16*)
	declare <vscale x 4 x i32> @llvm.aarch64.sve.ld1rq.nxv4i32(<vscale x 4 x i1>, i32*)			declare <vscale x 4 x i32> @llvm.aarch64.sve.ld1rq.nxv4i32(<vscale x 4 x i1>, i32*)
	declare <vscale x 2 x i64> @llvm.aarch64.sve.ld1rq.nxv2i64(<vscale x 2 x i1>, i64*)			declare <vscale x 2 x i64> @llvm.aarch64.sve.ld1rq.nxv2i64(<vscale x 2 x i1>, i64*)
	declare <vscale x 8 x half> @llvm.aarch64.sve.ld1rq.nxv8f16(<vscale x 8 x i1>, half*)			declare <vscale x 8 x half> @llvm.aarch64.sve.ld1rq.nxv8f16(<vscale x 8 x i1>, half*)
	declare <vscale x 4 x float> @llvm.aarch64.sve.ld1rq.nxv4f32(<vscale x 4 x i1>, float*)			declare <vscale x 4 x float> @llvm.aarch64.sve.ld1rq.nxv4f32(<vscale x 4 x i1>, float*)
	declare <vscale x 2 x double> @llvm.aarch64.sve.ld1rq.nxv2f64(<vscale x 2 x i1>, double*)			declare <vscale x 2 x double> @llvm.aarch64.sve.ld1rq.nxv2f64(<vscale x 2 x i1>, double*)

	declare <vscale x 16 x i8> @llvm.aarch64.sve.ldnt1.nxv16i8(<vscale x 16 x i1>, i8*)			declare <vscale x 16 x i8> @llvm.aarch64.sve.ldnt1.nxv16i8(<vscale x 16 x i1>, i8*)
	declare <vscale x 8 x i16> @llvm.aarch64.sve.ldnt1.nxv8i16(<vscale x 8 x i1>, i16*)			declare <vscale x 8 x i16> @llvm.aarch64.sve.ldnt1.nxv8i16(<vscale x 8 x i1>, i16*)
	declare <vscale x 4 x i32> @llvm.aarch64.sve.ldnt1.nxv4i32(<vscale x 4 x i1>, i32*)			declare <vscale x 4 x i32> @llvm.aarch64.sve.ldnt1.nxv4i32(<vscale x 4 x i1>, i32*)
	declare <vscale x 2 x i64> @llvm.aarch64.sve.ldnt1.nxv2i64(<vscale x 2 x i1>, i64*)			declare <vscale x 2 x i64> @llvm.aarch64.sve.ldnt1.nxv2i64(<vscale x 2 x i1>, i64*)
	declare <vscale x 8 x half> @llvm.aarch64.sve.ldnt1.nxv8f16(<vscale x 8 x i1>, half*)			declare <vscale x 8 x half> @llvm.aarch64.sve.ldnt1.nxv8f16(<vscale x 8 x i1>, half*)
	declare <vscale x 4 x float> @llvm.aarch64.sve.ldnt1.nxv4f32(<vscale x 4 x i1>, float*)			declare <vscale x 4 x float> @llvm.aarch64.sve.ldnt1.nxv4f32(<vscale x 4 x i1>, float*)
	declare <vscale x 2 x double> @llvm.aarch64.sve.ldnt1.nxv2f64(<vscale x 2 x i1>, double*)			declare <vscale x 2 x double> @llvm.aarch64.sve.ldnt1.nxv2f64(<vscale x 2 x i1>, double*)

				declare <vscale x 32 x i8> @llvm.aarch64.sve.ld2.nxv32i8.nxv16i1.p0nxv16i8(<vscale x 16 x i1>, <vscale x 16 x i8>*)
				declare <vscale x 16 x i16> @llvm.aarch64.sve.ld2.nxv16i16.nxv8i1.p0nxv8i16(<vscale x 8 x i1>, <vscale x 8 x i16>*)
				declare <vscale x 8 x i32> @llvm.aarch64.sve.ld2.nxv8i32.nxv4i1.p0nxv4i32(<vscale x 4 x i1>, <vscale x 4 x i32>*)
				declare <vscale x 4 x i64> @llvm.aarch64.sve.ld2.nxv4i64.nxv2i1.p0nxv2i64(<vscale x 2 x i1>, <vscale x 2 x i64>*)
				declare <vscale x 16 x half> @llvm.aarch64.sve.ld2.nxv16f16.nxv8i1.p0nxv8f16(<vscale x 8 x i1>, <vscale x 8 x half>*)
				declare <vscale x 8 x float> @llvm.aarch64.sve.ld2.nxv8f32.nxv4i1.p0nxv4f32(<vscale x 4 x i1>, <vscale x 4 x float>*)
				declare <vscale x 4 x double> @llvm.aarch64.sve.ld2.nxv4f64.nxv2i1.p0nxv2f64(<vscale x 2 x i1>, <vscale x 2 x double>*)

				declare <vscale x 48 x i8> @llvm.aarch64.sve.ld3.nxv48i8.nxv16i1.p0nxv16i8(<vscale x 16 x i1>, <vscale x 16 x i8>*)
				declare <vscale x 24 x i16> @llvm.aarch64.sve.ld3.nxv24i16.nxv8i1.p0nxv8i16(<vscale x 8 x i1>, <vscale x 8 x i16>*)
				declare <vscale x 12 x i32> @llvm.aarch64.sve.ld3.nxv12i32.nxv4i1.p0nxv4i32(<vscale x 4 x i1>, <vscale x 4 x i32>*)
				declare <vscale x 6 x i64> @llvm.aarch64.sve.ld3.nxv6i64.nxv2i1.p0nxv2i64(<vscale x 2 x i1>, <vscale x 2 x i64>*)
				declare <vscale x 24 x half> @llvm.aarch64.sve.ld3.nxv24f16.nxv8i1.p0nxv8f16(<vscale x 8 x i1>, <vscale x 8 x half>*)
				declare <vscale x 12 x float> @llvm.aarch64.sve.ld3.nxv12f32.nxv4i1.p0nxv4f32(<vscale x 4 x i1>, <vscale x 4 x float>*)
				declare <vscale x 6 x double> @llvm.aarch64.sve.ld3.nxv6f64.nxv2i1.p0nxv2f64(<vscale x 2 x i1>, <vscale x 2 x double>*)

				declare <vscale x 64 x i8> @llvm.aarch64.sve.ld4.nxv64i8.nxv16i1.p0nxv16i8(<vscale x 16 x i1>, <vscale x 16 x i8>*)
				declare <vscale x 32 x i16> @llvm.aarch64.sve.ld4.nxv32i16.nxv8i1.p0nxv8i16(<vscale x 8 x i1>, <vscale x 8 x i16>*)
				declare <vscale x 16 x i32> @llvm.aarch64.sve.ld4.nxv16i32.nxv4i1.p0nxv4i32(<vscale x 4 x i1>, <vscale x 4 x i32>*)
				declare <vscale x 8 x i64> @llvm.aarch64.sve.ld4.nxv8i64.nxv2i1.p0nxv2i64(<vscale x 2 x i1>, <vscale x 2 x i64>*)
				declare <vscale x 32 x half> @llvm.aarch64.sve.ld4.nxv32f16.nxv8i1.p0nxv8f16(<vscale x 8 x i1>, <vscale x 8 x half>*)
				declare <vscale x 16 x float> @llvm.aarch64.sve.ld4.nxv16f32.nxv4i1.p0nxv4f32(<vscale x 4 x i1>, <vscale x 4 x float>*)
				declare <vscale x 8 x double> @llvm.aarch64.sve.ld4.nxv8f64.nxv2i1.p0nxv2f64(<vscale x 2 x i1>, <vscale x 2 x double>*)