This is an archive of the discontinued LLVM Phabricator instance.

[ARM64] Implement NEON post-increment LD1 (lane) and post-increment LD1R
ClosedPublic

Authored by • HaoLiu on May 13 2014, 3:00 AM.

Download Raw Diff

Details

Reviewers

t.p.northover
jmolloy

Summary

Hi Tim,

This patch implements post-increment LD1 (lane) and post-increment LD1R. The implementation is like the implementation of NEON post-increment load with 2/3/4 vectors.
It tries to do the following 2 combines if satisfied:

(1) combine an ARM64ISD:DUP and a post-increment load into a post-increment LD1R.
(2) combine an ISD::INSERT_VECTOR_ELT and a post-increment load into a post-increment LD1 (lane).

Ask for code review.

Thanks,
-Hao

Diff Detail

Event Timeline

• HaoLiu updated this revision to Diff 9340.May 13 2014, 3:00 AM

• HaoLiu retitled this revision from to [ARM64] Implement NEON post-increment LD1 (lane) and post-increment LD1R.

• HaoLiu updated this object.

• HaoLiu edited the test plan for this revision. (Show Details)

• HaoLiu added a reviewer: t.p.northover.

• HaoLiu added a subscriber: Unknown Object (MLST).

Herald added a subscriber: aemerson. · View Herald TranscriptMay 13 2014, 3:00 AM

mcrosier added a subscriber: mcrosier.May 13 2014, 7:46 AM

ping...

-----Original Message-----
From: Hao Liu [mailto:Hao.Liu@arm.com]
Sent: Tuesday, May 13, 2014 6:01 PM
To: Hao Liu; t.p.northover@gmail.com
Cc: Amara Emerson; llvm-commits@cs.uiuc.edu
Subject: [PATCH] [ARM64] Implement NEON post-increment LD1 (lane) and
post-increment LD1R

Hi t.p.northover,

Hi Tim,

This patch implements post-increment LD1 (lane) and post-increment LD1R.
The implementation is like the implementation of NEON post-increment load
with 2/3/4 vectors.
It tries to do the following 2 combines if satisfied:
(1) combine an ARM64ISD:DUP and a post-increment load into a post-
increment LD1R.
(2) combine an ISD::INSERT_VECTOR_ELT and a post-increment load into a
post-increment LD1 (lane).

Ask for code review.

Thanks,
-Hao

http://reviews.llvm.org/D3740

Files:
lib/Target/ARM64/ARM64ISelDAGToDAG.cpp
lib/Target/ARM64/ARM64ISelLowering.cpp
lib/Target/ARM64/ARM64ISelLowering.h
test/CodeGen/ARM64/indexed-vector-ldst.ll

Hi Hao,

I have a couple of comments.

Cheers,

James

lib/Target/ARM64/ARM64ISelDAGToDAG.cpp
1159	Please add braces around this if clause? either there should be no braces around if or else, or braces around both.
2670	Would a switch be clearer here?
2746	Having all these individual MVT:: cases makes it likely you'll miss one or someone will miss one on an update. Would it be easier to switch on the EVT::getVectorElementSizeInBits()?
lib/Target/ARM64/ARM64ISelLowering.cpp
7098	What about SEXTLOAD and ZEXTLOAD here?
7129	GetScalarSizeInBits() may simplify this.
7158	Typo: "write back"

• HaoLiu updated this revision to Diff 9408.May 14 2014, 8:39 PM

Hi James,

Thanks for the code review. I've reworked on the code according to your comments.

Thanks,
-Hao

lib/Target/ARM64/ARM64ISelDAGToDAG.cpp
2670	Hi James, Yes, switch is much clearer. But all the other code to select load/store is implemented like this, if we modify this, all other code needs to be modified. This will change a lot. So Tim suggested to deal with this piece of code in the future with a separate patch (see the comment in http://reviews.llvm.org/D3605). So I just implement this piece of code like others. This will be improved in the future.
lib/Target/ARM64/ARM64ISelLowering.cpp
7098	Oh, I analysis this again. I find that this is always ISD::LOAD, no ISD::EXTLOAD or SEXTLOAD/ZEXTLOAD. So I remove EXTLOAD and check the type of memory operand type, which should be the same to the element type of the vector.

Hi Hao,

OK, understood. This looks fine to me, but Tim needs to OK it before it can go in.

Cheers,

James

This revision is now accepted and ready to land.May 15 2014, 3:45 AM

Hi Tim,

Do you have any comments?

If no, I'll commit this patch.

Thanks,
-Hao

Hi Hao,

I think this looks OK. It would have been nicer to leave it later (ISelDAGToDAG) but we already had this code lying around, so why not make use of it?

Go for it!

Tim.

Hi Tim,

This patch has two steps:

(1) do post LD1 combine in ARM64ISelLowing
(2) do select post LD1 in ARM64ISelDAGToDAG

So you mean it's better to not to do post LD1 combine in ARM64ISelLowering? We can do such combine and select in one step in ARM64ISelDAGToDAG?

If so, considering other post NEON load/store like LD2/LD3/LD4 are also implemented in two steps, they also need to be refactored.

Thanks,
-Hao

Hi Hao,

This patch has two steps:
(1) do post LD1 combine in ARM64ISelLowing
(2) do select post LD1 in ARM64ISelDAGToDAG
So you mean it's better to not to do post LD1 combine in ARM64ISelLowering? We can do such combine and select in one step in ARM64ISelDAGToDAG?

No need to take it as a suggestion: it was just a vague desire for the
future really.

If so, considering other post NEON load/store like LD2/LD3/LD4 are also implemented in two steps, they also need to be refactored.

The main difference there, as I see it, is that they're *already*
intrinsics coming into the DAG. We're definitely not going to be
missing out on any generic DAG combines by fiddling them a bit more.
The same can't necessarily be said for more generic ISD::LOADs.

That said, the LD2/LD3/LD4 situation did help convince me that doing
it in ISelLowering wasn't too bad.

Cheers.

Tim.

Oh, I see. That's reasonable.

Committed in http://llvm.org/viewvc/llvm-project?rev=208955&view=rev

Thanks,
-Hao

• HaoLiu closed this revision.May 13 2015, 7:55 PM

Revision Contents

Path

Size

lib/

Target/

ARM64/

ARM64ISelDAGToDAG.cpp

63 lines

ARM64ISelLowering.h

2 lines

ARM64ISelLowering.cpp

90 lines

test/

CodeGen/

ARM64/

indexed-vector-ldst.ll

486 lines

Diff 9340

lib/Target/ARM64/ARM64ISelDAGToDAG.cpp

Context not available.

	// Update uses of vector list	// Update uses of vector list
	SDValue SuperReg = SDValue(Ld, 1);	SDValue SuperReg = SDValue(Ld, 1);
	for (unsigned i = 0; i < NumVecs; ++i)	if (NumVecs == 1)
	ReplaceUses(SDValue(N, i),	ReplaceUses(SDValue(N, 0), SuperReg);
	CurDAG->getTargetExtractSubreg(SubRegIdx + i, dl, VT, SuperReg));	else
		for (unsigned i = 0; i < NumVecs; ++i)
		ReplaceUses(SDValue(N, i),
		CurDAG->getTargetExtractSubreg(SubRegIdx + i, dl, VT, SuperReg));

	// Update the chain	// Update the chain
	ReplaceUses(SDValue(N, NumVecs + 1), SDValue(Ld, 2));	ReplaceUses(SDValue(N, NumVecs + 1), SDValue(Ld, 2));
Context not available.

	// Update uses of the vector list	// Update uses of the vector list
	SDValue SuperReg = SDValue(Ld, 1);	SDValue SuperReg = SDValue(Ld, 1);
	EVT WideVT = RegSeq.getOperand(1)->getValueType(0);	if (NumVecs == 1)
		jmolloyUnsubmitted Not Done Reply Inline Actions Please add braces around this if clause? either there should be no braces around if or else, or braces around both. jmolloy: Please add braces around this if clause? either there should be no braces around if or else, or…
	static unsigned QSubs[] = { ARM64::qsub0, ARM64::qsub1, ARM64::qsub2,	ReplaceUses(SDValue(N, 0),
	ARM64::qsub3 };	Narrow ? NarrowVector(SuperReg, *CurDAG) : SuperReg);
	for (unsigned i = 0; i < NumVecs; ++i) {	else {
	SDValue NV = CurDAG->getTargetExtractSubreg(QSubs[i], dl, WideVT, SuperReg);	EVT WideVT = RegSeq.getOperand(1)->getValueType(0);
	if (Narrow)	static unsigned QSubs[] = { ARM64::qsub0, ARM64::qsub1, ARM64::qsub2,
	NV = NarrowVector(NV, *CurDAG);	ARM64::qsub3 };
	ReplaceUses(SDValue(N, i), NV);	for (unsigned i = 0; i < NumVecs; ++i) {
		SDValue NV = CurDAG->getTargetExtractSubreg(QSubs[i], dl, WideVT,
		SuperReg);
		if (Narrow)
		NV = NarrowVector(NV, *CurDAG);
		ReplaceUses(SDValue(N, i), NV);
		}
	}	}

	// Update the Chain	// Update the Chain
Context not available.
	return SelectPostLoad(Node, 4, ARM64::LD1Fourv2d_POST, ARM64::qsub0);	return SelectPostLoad(Node, 4, ARM64::LD1Fourv2d_POST, ARM64::qsub0);
	break;	break;
	}	}
		case ARM64ISD::LD1DUPpost: {
		if (VT == MVT::v8i8)
		jmolloyUnsubmitted Not Done Reply Inline Actions Would a switch be clearer here? jmolloy: Would a switch be clearer here?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Hi James, Yes, switch is much clearer. But all the other code to select load/store is implemented like this, if we modify this, all other code needs to be modified. This will change a lot. So Tim suggested to deal with this piece of code in the future with a separate patch (see the comment in http://reviews.llvm.org/D3605). So I just implement this piece of code like others. This will be improved in the future. HaoLiu: Hi James, Yes, switch is much clearer. But all the other code to select load/store is…
		return SelectPostLoad(Node, 1, ARM64::LD1Rv8b_POST, ARM64::dsub0);
		else if (VT == MVT::v16i8)
		return SelectPostLoad(Node, 1, ARM64::LD1Rv16b_POST, ARM64::qsub0);
		else if (VT == MVT::v4i16)
		return SelectPostLoad(Node, 1, ARM64::LD1Rv4h_POST, ARM64::dsub0);
		else if (VT == MVT::v8i16)
		return SelectPostLoad(Node, 1, ARM64::LD1Rv8h_POST, ARM64::qsub0);
		else if (VT == MVT::v2i32 \|\| VT == MVT::v2f32)
		return SelectPostLoad(Node, 1, ARM64::LD1Rv2s_POST, ARM64::dsub0);
		else if (VT == MVT::v4i32 \|\| VT == MVT::v4f32)
		return SelectPostLoad(Node, 1, ARM64::LD1Rv4s_POST, ARM64::qsub0);
		else if (VT == MVT::v1i64 \|\| VT == MVT::v1f64)
		return SelectPostLoad(Node, 1, ARM64::LD1Rv1d_POST, ARM64::dsub0);
		else if (VT == MVT::v2i64 \|\| VT == MVT::v2f64)
		return SelectPostLoad(Node, 1, ARM64::LD1Rv2d_POST, ARM64::qsub0);
		break;
		}
	case ARM64ISD::LD2DUPpost: {	case ARM64ISD::LD2DUPpost: {
	if (VT == MVT::v8i8)	if (VT == MVT::v8i8)
	return SelectPostLoad(Node, 2, ARM64::LD2Rv8b_POST, ARM64::dsub0);	return SelectPostLoad(Node, 2, ARM64::LD2Rv8b_POST, ARM64::dsub0);
Context not available.
	return SelectPostLoad(Node, 4, ARM64::LD4Rv2d_POST, ARM64::qsub0);	return SelectPostLoad(Node, 4, ARM64::LD4Rv2d_POST, ARM64::qsub0);
	break;	break;
	}	}
		case ARM64ISD::LD1LANEpost: {
		if (VT == MVT::v16i8 \|\| VT == MVT::v8i8)
		jmolloyUnsubmitted Not Done Reply Inline Actions Having all these individual MVT:: cases makes it likely you'll miss one or someone will miss one on an update. Would it be easier to switch on the EVT::getVectorElementSizeInBits()? jmolloy: Having all these individual MVT:: cases makes it likely you'll miss one or someone will miss…
		return SelectPostLoadLane(Node, 1, ARM64::LD1i8_POST);
		else if (VT == MVT::v8i16 \|\| VT == MVT::v4i16)
		return SelectPostLoadLane(Node, 1, ARM64::LD1i16_POST);
		else if (VT == MVT::v4i32 \|\| VT == MVT::v2i32 \|\| VT == MVT::v4f32 \|\|
		VT == MVT::v2f32)
		return SelectPostLoadLane(Node, 1, ARM64::LD1i32_POST);
		else if (VT == MVT::v2i64 \|\| VT == MVT::v1i64 \|\| VT == MVT::v2f64 \|\|
		VT == MVT::v1f64)
		return SelectPostLoadLane(Node, 1, ARM64::LD1i64_POST);
		break;
		}
	case ARM64ISD::LD2LANEpost: {	case ARM64ISD::LD2LANEpost: {
	if (VT == MVT::v16i8 \|\| VT == MVT::v8i8)	if (VT == MVT::v16i8 \|\| VT == MVT::v8i8)
	return SelectPostLoadLane(Node, 2, ARM64::LD2i8_POST);	return SelectPostLoadLane(Node, 2, ARM64::LD2i8_POST);
Context not available.

lib/Target/ARM64/ARM64ISelLowering.h

Context not available.
	ST1x2post,	ST1x2post,
	ST1x3post,	ST1x3post,
	ST1x4post,	ST1x4post,
		LD1DUPpost,
	LD2DUPpost,	LD2DUPpost,
	LD3DUPpost,	LD3DUPpost,
	LD4DUPpost,	LD4DUPpost,
		LD1LANEpost,
	LD2LANEpost,	LD2LANEpost,
	LD3LANEpost,	LD3LANEpost,
	LD4LANEpost,	LD4LANEpost,
Context not available.

lib/Target/ARM64/ARM64ISelLowering.cpp

Context not available.

	setTargetDAGCombine(ISD::INTRINSIC_VOID);	setTargetDAGCombine(ISD::INTRINSIC_VOID);
	setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);	setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);
		setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);

	MaxStoresPerMemset = MaxStoresPerMemsetOptSize = 8;	MaxStoresPerMemset = MaxStoresPerMemsetOptSize = 8;
	MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = 4;	MaxStoresPerMemcpy = MaxStoresPerMemcpyOptSize = 4;
Context not available.
	case ARM64ISD::ST1x2post: return "ARM64ISD::ST1x2post";	case ARM64ISD::ST1x2post: return "ARM64ISD::ST1x2post";
	case ARM64ISD::ST1x3post: return "ARM64ISD::ST1x3post";	case ARM64ISD::ST1x3post: return "ARM64ISD::ST1x3post";
	case ARM64ISD::ST1x4post: return "ARM64ISD::ST1x4post";	case ARM64ISD::ST1x4post: return "ARM64ISD::ST1x4post";
		case ARM64ISD::LD1DUPpost: return "ARM64ISD::LD1DUPpost";
	case ARM64ISD::LD2DUPpost: return "ARM64ISD::LD2DUPpost";	case ARM64ISD::LD2DUPpost: return "ARM64ISD::LD2DUPpost";
	case ARM64ISD::LD3DUPpost: return "ARM64ISD::LD3DUPpost";	case ARM64ISD::LD3DUPpost: return "ARM64ISD::LD3DUPpost";
	case ARM64ISD::LD4DUPpost: return "ARM64ISD::LD4DUPpost";	case ARM64ISD::LD4DUPpost: return "ARM64ISD::LD4DUPpost";
		case ARM64ISD::LD1LANEpost: return "ARM64ISD::LD1LANEpost";
	case ARM64ISD::LD2LANEpost: return "ARM64ISD::LD2LANEpost";	case ARM64ISD::LD2LANEpost: return "ARM64ISD::LD2LANEpost";
	case ARM64ISD::LD3LANEpost: return "ARM64ISD::LD3LANEpost";	case ARM64ISD::LD3LANEpost: return "ARM64ISD::LD3LANEpost";
	case ARM64ISD::LD4LANEpost: return "ARM64ISD::LD4LANEpost";	case ARM64ISD::LD4LANEpost: return "ARM64ISD::LD4LANEpost";
Context not available.
	S->getAlignment());	S->getAlignment());
	}	}

		/// Target-specific DAG combine function for post-increment LD1 (lane) and
		/// post-increment LD1R.
		static SDValue performPostLD1Combine(SDNode *N,
		TargetLowering::DAGCombinerInfo &DCI,
		bool IsLaneOp) {
		if (DCI.isBeforeLegalizeOps())
		return SDValue();

		SelectionDAG &DAG = DCI.DAG;
		EVT VT = N->getValueType(0);

		unsigned LoadIdx = IsLaneOp ? 1 : 0;
		SDNode *LD = N->getOperand(LoadIdx).getNode();
		// If it is not LOAD/EXTLOAD, can not do such combine.
		if (LD->getOpcode() != ISD::LOAD && LD->getOpcode() != ISD::EXTLOAD)
		jmolloyUnsubmitted Not Done Reply Inline Actions What about SEXTLOAD and ZEXTLOAD here? jmolloy: What about SEXTLOAD and ZEXTLOAD here?
		HaoLiuAuthorUnsubmitted Not Done Reply Inline Actions Oh, I analysis this again. I find that this is always ISD::LOAD, no ISD::EXTLOAD or SEXTLOAD/ZEXTLOAD. So I remove EXTLOAD and check the type of memory operand type, which should be the same to the element type of the vector. HaoLiu: Oh, I analysis this again. I find that this is always ISD::LOAD, no ISD::EXTLOAD or…
		return SDValue();

		// Check if there are other uses. If so, do not combine as it will introduce
		// an extra load.
		for (SDNode::use_iterator UI = LD->use_begin(), UE = LD->use_end(); UI != UE;
		++UI) {
		if (UI.getUse().getResNo() == 1) // Ignore uses of the chain result.
		continue;
		if (*UI != N)
		return SDValue();
		}

		SDValue Addr = LD->getOperand(1);
		// Search for a use of the address operand that is an increment.
		for (SDNode::use_iterator UI = Addr.getNode()->use_begin(), UE =
		Addr.getNode()->use_end(); UI != UE; ++UI) {
		SDNode User = UI;
		if (User->getOpcode() != ISD::ADD
		\|\| UI.getUse().getResNo() != Addr.getResNo())
		continue;

		// Check that the add is independent of the load. Otherwise, folding it
		// would create a cycle.
		if (User->isPredecessorOf(LD) \|\| LD->isPredecessorOf(User))
		continue;

		// If the increment is a constant, it must match the memory ref size.
		SDValue Inc = User->getOperand(User->getOperand(0) == Addr ? 1 : 0);
		if (ConstantSDNode *CInc = dyn_cast<ConstantSDNode>(Inc.getNode())) {
		uint32_t IncVal = CInc->getZExtValue();
		unsigned NumBytes = (VT.getSizeInBits() / 8) / VT.getVectorNumElements();
		jmolloyUnsubmitted Not Done Reply Inline Actions GetScalarSizeInBits() may simplify this. jmolloy: GetScalarSizeInBits() may simplify this.
		if (IncVal != NumBytes)
		continue;
		Inc = DAG.getRegister(ARM64::XZR, MVT::i64);
		}

		SmallVector<SDValue, 8> Ops;
		Ops.push_back(LD->getOperand(0)); // Chain
		if (IsLaneOp) {
		Ops.push_back(N->getOperand(0)); // The vector to be inserted
		Ops.push_back(N->getOperand(2)); // The lane to be inserted in the vector
		}
		Ops.push_back(Addr);
		Ops.push_back(Inc);

		EVT Tys[3] = { VT, MVT::i64, MVT::Other };
		SDVTList SDTys = DAG.getVTList(ArrayRef<EVT>(Tys, 3));
		LoadSDNode *LoadSD = cast<LoadSDNode>(LD);
		unsigned NewOp = IsLaneOp ? ARM64ISD::LD1LANEpost : ARM64ISD::LD1DUPpost;
		SDValue UpdN = DAG.getMemIntrinsicNode(NewOp, SDLoc(N), SDTys, Ops,
		LoadSD->getMemoryVT(),
		LoadSD->getMemOperand());

		// Update the uses.
		std::vector<SDValue> NewResults;
		NewResults.push_back(SDValue(LD, 0)); // The result of load
		NewResults.push_back(SDValue(UpdN.getNode(), 2)); // Chain
		DCI.CombineTo(LD, NewResults);
		DCI.CombineTo(N, SDValue(UpdN.getNode(), 0)); // Dup/Inserted Result
		DCI.CombineTo(User, SDValue(UpdN.getNode(), 1)); // Write backe register
		jmolloyUnsubmitted Not Done Reply Inline Actions Typo: "write back" jmolloy: Typo: "write back"

		break;
		}
		return SDValue();
		}

	/// Target-specific DAG combine function for NEON load/store intrinsics	/// Target-specific DAG combine function for NEON load/store intrinsics
	/// to merge base address updates.	/// to merge base address updates.
	static SDValue performNEONPostLDSTCombine(SDNode *N,	static SDValue performNEONPostLDSTCombine(SDNode *N,
Context not available.
	if (IsLaneOp \|\| IsStore)	if (IsLaneOp \|\| IsStore)
	for (unsigned i = 2; i < AddrOpIdx; ++i)	for (unsigned i = 2; i < AddrOpIdx; ++i)
	Ops.push_back(N->getOperand(i));	Ops.push_back(N->getOperand(i));
	Ops.push_back(N->getOperand(AddrOpIdx)); // Base register	Ops.push_back(Addr); // Base register
	Ops.push_back(Inc);	Ops.push_back(Inc);

	// Return Types.	// Return Types.
Context not available.
	return performSTORECombine(N, DCI, DAG, Subtarget);	return performSTORECombine(N, DCI, DAG, Subtarget);
	case ARM64ISD::BRCOND:	case ARM64ISD::BRCOND:
	return performBRCONDCombine(N, DCI, DAG);	return performBRCONDCombine(N, DCI, DAG);
		case ARM64ISD::DUP:
		return performPostLD1Combine(N, DCI, false);
		case ISD::INSERT_VECTOR_ELT:
		return performPostLD1Combine(N, DCI, true);
	case ISD::INTRINSIC_VOID:	case ISD::INTRINSIC_VOID:
	case ISD::INTRINSIC_W_CHAIN:	case ISD::INTRINSIC_W_CHAIN:
	switch (cast<ConstantSDNode>(N->getOperand(1))->getZExtValue()) {	switch (cast<ConstantSDNode>(N->getOperand(1))->getZExtValue()) {
Context not available.

test/CodeGen/ARM64/indexed-vector-ldst.ll

Context not available.
	ret double* %tmp	ret double* %tmp
	}	}

	declare void @llvm.arm64.neon.st4lane.v1f64.p0f64(<1 x double>, <1 x double>, <1 x double>, <1 x double>, i64, double*)	declare void @llvm.arm64.neon.st4lane.v1f64.p0f64(<1 x double>, <1 x double>, <1 x double>, <1 x double>, i64, double*)
	No newline at end of file
		define <16 x i8> @test_v16i8_post_imm_ld1r(i8* %bar, i8** %ptr) {
		; CHECK-LABEL: test_v16i8_post_imm_ld1r:
		; CHECK: ld1r.16b { v0 }, [x0], #1
		%tmp1 = load i8* %bar
		%tmp2 = insertelement <16 x i8> <i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef>, i8 %tmp1, i32 0
		%tmp3 = insertelement <16 x i8> %tmp2, i8 %tmp1, i32 1
		%tmp4 = insertelement <16 x i8> %tmp3, i8 %tmp1, i32 2
		%tmp5 = insertelement <16 x i8> %tmp4, i8 %tmp1, i32 3
		%tmp6 = insertelement <16 x i8> %tmp5, i8 %tmp1, i32 4
		%tmp7 = insertelement <16 x i8> %tmp6, i8 %tmp1, i32 5
		%tmp8 = insertelement <16 x i8> %tmp7, i8 %tmp1, i32 6
		%tmp9 = insertelement <16 x i8> %tmp8, i8 %tmp1, i32 7
		%tmp10 = insertelement <16 x i8> %tmp9, i8 %tmp1, i32 8
		%tmp11 = insertelement <16 x i8> %tmp10, i8 %tmp1, i32 9
		%tmp12 = insertelement <16 x i8> %tmp11, i8 %tmp1, i32 10
		%tmp13 = insertelement <16 x i8> %tmp12, i8 %tmp1, i32 11
		%tmp14 = insertelement <16 x i8> %tmp13, i8 %tmp1, i32 12
		%tmp15 = insertelement <16 x i8> %tmp14, i8 %tmp1, i32 13
		%tmp16 = insertelement <16 x i8> %tmp15, i8 %tmp1, i32 14
		%tmp17 = insertelement <16 x i8> %tmp16, i8 %tmp1, i32 15
		%tmp18 = getelementptr i8* %bar, i64 1
		store i8* %tmp18, i8** %ptr
		ret <16 x i8> %tmp17
		}

		define <16 x i8> @test_v16i8_post_reg_ld1r(i8* %bar, i8** %ptr, i64 %inc) {
		; CHECK-LABEL: test_v16i8_post_reg_ld1r:
		; CHECK: ld1r.16b { v0 }, [x0], x{{[0-9]+}}
		%tmp1 = load i8* %bar
		%tmp2 = insertelement <16 x i8> <i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef>, i8 %tmp1, i32 0
		%tmp3 = insertelement <16 x i8> %tmp2, i8 %tmp1, i32 1
		%tmp4 = insertelement <16 x i8> %tmp3, i8 %tmp1, i32 2
		%tmp5 = insertelement <16 x i8> %tmp4, i8 %tmp1, i32 3
		%tmp6 = insertelement <16 x i8> %tmp5, i8 %tmp1, i32 4
		%tmp7 = insertelement <16 x i8> %tmp6, i8 %tmp1, i32 5
		%tmp8 = insertelement <16 x i8> %tmp7, i8 %tmp1, i32 6
		%tmp9 = insertelement <16 x i8> %tmp8, i8 %tmp1, i32 7
		%tmp10 = insertelement <16 x i8> %tmp9, i8 %tmp1, i32 8
		%tmp11 = insertelement <16 x i8> %tmp10, i8 %tmp1, i32 9
		%tmp12 = insertelement <16 x i8> %tmp11, i8 %tmp1, i32 10
		%tmp13 = insertelement <16 x i8> %tmp12, i8 %tmp1, i32 11
		%tmp14 = insertelement <16 x i8> %tmp13, i8 %tmp1, i32 12
		%tmp15 = insertelement <16 x i8> %tmp14, i8 %tmp1, i32 13
		%tmp16 = insertelement <16 x i8> %tmp15, i8 %tmp1, i32 14
		%tmp17 = insertelement <16 x i8> %tmp16, i8 %tmp1, i32 15
		%tmp18 = getelementptr i8* %bar, i64 %inc
		store i8* %tmp18, i8** %ptr
		ret <16 x i8> %tmp17
		}

		define <8 x i8> @test_v8i8_post_imm_ld1r(i8* %bar, i8** %ptr) {
		; CHECK-LABEL: test_v8i8_post_imm_ld1r:
		; CHECK: ld1r.8b { v0 }, [x0], #1
		%tmp1 = load i8* %bar
		%tmp2 = insertelement <8 x i8> <i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef>, i8 %tmp1, i32 0
		%tmp3 = insertelement <8 x i8> %tmp2, i8 %tmp1, i32 1
		%tmp4 = insertelement <8 x i8> %tmp3, i8 %tmp1, i32 2
		%tmp5 = insertelement <8 x i8> %tmp4, i8 %tmp1, i32 3
		%tmp6 = insertelement <8 x i8> %tmp5, i8 %tmp1, i32 4
		%tmp7 = insertelement <8 x i8> %tmp6, i8 %tmp1, i32 5
		%tmp8 = insertelement <8 x i8> %tmp7, i8 %tmp1, i32 6
		%tmp9 = insertelement <8 x i8> %tmp8, i8 %tmp1, i32 7
		%tmp10 = getelementptr i8* %bar, i64 1
		store i8* %tmp10, i8** %ptr
		ret <8 x i8> %tmp9
		}

		define <8 x i8> @test_v8i8_post_reg_ld1r(i8* %bar, i8** %ptr, i64 %inc) {
		; CHECK-LABEL: test_v8i8_post_reg_ld1r:
		; CHECK: ld1r.8b { v0 }, [x0], x{{[0-9]+}}
		%tmp1 = load i8* %bar
		%tmp2 = insertelement <8 x i8> <i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef, i8 undef>, i8 %tmp1, i32 0
		%tmp3 = insertelement <8 x i8> %tmp2, i8 %tmp1, i32 1
		%tmp4 = insertelement <8 x i8> %tmp3, i8 %tmp1, i32 2
		%tmp5 = insertelement <8 x i8> %tmp4, i8 %tmp1, i32 3
		%tmp6 = insertelement <8 x i8> %tmp5, i8 %tmp1, i32 4
		%tmp7 = insertelement <8 x i8> %tmp6, i8 %tmp1, i32 5
		%tmp8 = insertelement <8 x i8> %tmp7, i8 %tmp1, i32 6
		%tmp9 = insertelement <8 x i8> %tmp8, i8 %tmp1, i32 7
		%tmp10 = getelementptr i8* %bar, i64 %inc
		store i8* %tmp10, i8** %ptr
		ret <8 x i8> %tmp9
		}

		define <8 x i16> @test_v8i16_post_imm_ld1r(i16* %bar, i16** %ptr) {
		; CHECK-LABEL: test_v8i16_post_imm_ld1r:
		; CHECK: ld1r.8h { v0 }, [x0], #2
		%tmp1 = load i16* %bar
		%tmp2 = insertelement <8 x i16> <i16 undef, i16 undef, i16 undef, i16 undef, i16 undef, i16 undef, i16 undef, i16 undef>, i16 %tmp1, i32 0
		%tmp3 = insertelement <8 x i16> %tmp2, i16 %tmp1, i32 1
		%tmp4 = insertelement <8 x i16> %tmp3, i16 %tmp1, i32 2
		%tmp5 = insertelement <8 x i16> %tmp4, i16 %tmp1, i32 3
		%tmp6 = insertelement <8 x i16> %tmp5, i16 %tmp1, i32 4
		%tmp7 = insertelement <8 x i16> %tmp6, i16 %tmp1, i32 5
		%tmp8 = insertelement <8 x i16> %tmp7, i16 %tmp1, i32 6
		%tmp9 = insertelement <8 x i16> %tmp8, i16 %tmp1, i32 7
		%tmp10 = getelementptr i16* %bar, i64 1
		store i16* %tmp10, i16** %ptr
		ret <8 x i16> %tmp9
		}

		define <8 x i16> @test_v8i16_post_reg_ld1r(i16* %bar, i16** %ptr, i64 %inc) {
		; CHECK-LABEL: test_v8i16_post_reg_ld1r:
		; CHECK: ld1r.8h { v0 }, [x0], x{{[0-9]+}}
		%tmp1 = load i16* %bar
		%tmp2 = insertelement <8 x i16> <i16 undef, i16 undef, i16 undef, i16 undef, i16 undef, i16 undef, i16 undef, i16 undef>, i16 %tmp1, i32 0
		%tmp3 = insertelement <8 x i16> %tmp2, i16 %tmp1, i32 1
		%tmp4 = insertelement <8 x i16> %tmp3, i16 %tmp1, i32 2
		%tmp5 = insertelement <8 x i16> %tmp4, i16 %tmp1, i32 3
		%tmp6 = insertelement <8 x i16> %tmp5, i16 %tmp1, i32 4
		%tmp7 = insertelement <8 x i16> %tmp6, i16 %tmp1, i32 5
		%tmp8 = insertelement <8 x i16> %tmp7, i16 %tmp1, i32 6
		%tmp9 = insertelement <8 x i16> %tmp8, i16 %tmp1, i32 7
		%tmp10 = getelementptr i16* %bar, i64 %inc
		store i16* %tmp10, i16** %ptr
		ret <8 x i16> %tmp9
		}

		define <4 x i16> @test_v4i16_post_imm_ld1r(i16* %bar, i16** %ptr) {
		; CHECK-LABEL: test_v4i16_post_imm_ld1r:
		; CHECK: ld1r.4h { v0 }, [x0], #2
		%tmp1 = load i16* %bar
		%tmp2 = insertelement <4 x i16> <i16 undef, i16 undef, i16 undef, i16 undef>, i16 %tmp1, i32 0
		%tmp3 = insertelement <4 x i16> %tmp2, i16 %tmp1, i32 1
		%tmp4 = insertelement <4 x i16> %tmp3, i16 %tmp1, i32 2
		%tmp5 = insertelement <4 x i16> %tmp4, i16 %tmp1, i32 3
		%tmp6 = getelementptr i16* %bar, i64 1
		store i16* %tmp6, i16** %ptr
		ret <4 x i16> %tmp5
		}

		define <4 x i16> @test_v4i16_post_reg_ld1r(i16* %bar, i16** %ptr, i64 %inc) {
		; CHECK-LABEL: test_v4i16_post_reg_ld1r:
		; CHECK: ld1r.4h { v0 }, [x0], x{{[0-9]+}}
		%tmp1 = load i16* %bar
		%tmp2 = insertelement <4 x i16> <i16 undef, i16 undef, i16 undef, i16 undef>, i16 %tmp1, i32 0
		%tmp3 = insertelement <4 x i16> %tmp2, i16 %tmp1, i32 1
		%tmp4 = insertelement <4 x i16> %tmp3, i16 %tmp1, i32 2
		%tmp5 = insertelement <4 x i16> %tmp4, i16 %tmp1, i32 3
		%tmp6 = getelementptr i16* %bar, i64 %inc
		store i16* %tmp6, i16** %ptr
		ret <4 x i16> %tmp5
		}

		define <4 x i32> @test_v4i32_post_imm_ld1r(i32* %bar, i32** %ptr) {
		; CHECK-LABEL: test_v4i32_post_imm_ld1r:
		; CHECK: ld1r.4s { v0 }, [x0], #4
		%tmp1 = load i32* %bar
		%tmp2 = insertelement <4 x i32> <i32 undef, i32 undef, i32 undef, i32 undef>, i32 %tmp1, i32 0
		%tmp3 = insertelement <4 x i32> %tmp2, i32 %tmp1, i32 1
		%tmp4 = insertelement <4 x i32> %tmp3, i32 %tmp1, i32 2
		%tmp5 = insertelement <4 x i32> %tmp4, i32 %tmp1, i32 3
		%tmp6 = getelementptr i32* %bar, i64 1
		store i32* %tmp6, i32** %ptr
		ret <4 x i32> %tmp5
		}

		define <4 x i32> @test_v4i32_post_reg_ld1r(i32* %bar, i32** %ptr, i64 %inc) {
		; CHECK-LABEL: test_v4i32_post_reg_ld1r:
		; CHECK: ld1r.4s { v0 }, [x0], x{{[0-9]+}}
		%tmp1 = load i32* %bar
		%tmp2 = insertelement <4 x i32> <i32 undef, i32 undef, i32 undef, i32 undef>, i32 %tmp1, i32 0
		%tmp3 = insertelement <4 x i32> %tmp2, i32 %tmp1, i32 1
		%tmp4 = insertelement <4 x i32> %tmp3, i32 %tmp1, i32 2
		%tmp5 = insertelement <4 x i32> %tmp4, i32 %tmp1, i32 3
		%tmp6 = getelementptr i32* %bar, i64 %inc
		store i32* %tmp6, i32** %ptr
		ret <4 x i32> %tmp5
		}

		define <2 x i32> @test_v2i32_post_imm_ld1r(i32* %bar, i32** %ptr) {
		; CHECK-LABEL: test_v2i32_post_imm_ld1r:
		; CHECK: ld1r.2s { v0 }, [x0], #4
		%tmp1 = load i32* %bar
		%tmp2 = insertelement <2 x i32> <i32 undef, i32 undef>, i32 %tmp1, i32 0
		%tmp3 = insertelement <2 x i32> %tmp2, i32 %tmp1, i32 1
		%tmp4 = getelementptr i32* %bar, i64 1
		store i32* %tmp4, i32** %ptr
		ret <2 x i32> %tmp3
		}

		define <2 x i32> @test_v2i32_post_reg_ld1r(i32* %bar, i32** %ptr, i64 %inc) {
		; CHECK-LABEL: test_v2i32_post_reg_ld1r:
		; CHECK: ld1r.2s { v0 }, [x0], x{{[0-9]+}}
		%tmp1 = load i32* %bar
		%tmp2 = insertelement <2 x i32> <i32 undef, i32 undef>, i32 %tmp1, i32 0
		%tmp3 = insertelement <2 x i32> %tmp2, i32 %tmp1, i32 1
		%tmp4 = getelementptr i32* %bar, i64 %inc
		store i32* %tmp4, i32** %ptr
		ret <2 x i32> %tmp3
		}

		define <2 x i64> @test_v2i64_post_imm_ld1r(i64* %bar, i64** %ptr) {
		; CHECK-LABEL: test_v2i64_post_imm_ld1r:
		; CHECK: ld1r.2d { v0 }, [x0], #8
		%tmp1 = load i64* %bar
		%tmp2 = insertelement <2 x i64> <i64 undef, i64 undef>, i64 %tmp1, i32 0
		%tmp3 = insertelement <2 x i64> %tmp2, i64 %tmp1, i32 1
		%tmp4 = getelementptr i64* %bar, i64 1
		store i64* %tmp4, i64** %ptr
		ret <2 x i64> %tmp3
		}

		define <2 x i64> @test_v2i64_post_reg_ld1r(i64* %bar, i64** %ptr, i64 %inc) {
		; CHECK-LABEL: test_v2i64_post_reg_ld1r:
		; CHECK: ld1r.2d { v0 }, [x0], x{{[0-9]+}}
		%tmp1 = load i64* %bar
		%tmp2 = insertelement <2 x i64> <i64 undef, i64 undef>, i64 %tmp1, i32 0
		%tmp3 = insertelement <2 x i64> %tmp2, i64 %tmp1, i32 1
		%tmp4 = getelementptr i64* %bar, i64 %inc
		store i64* %tmp4, i64** %ptr
		ret <2 x i64> %tmp3
		}

		define <4 x float> @test_v4f32_post_imm_ld1r(float* %bar, float** %ptr) {
		; CHECK-LABEL: test_v4f32_post_imm_ld1r:
		; CHECK: ld1r.4s { v0 }, [x0], #4
		%tmp1 = load float* %bar
		%tmp2 = insertelement <4 x float> <float undef, float undef, float undef, float undef>, float %tmp1, i32 0
		%tmp3 = insertelement <4 x float> %tmp2, float %tmp1, i32 1
		%tmp4 = insertelement <4 x float> %tmp3, float %tmp1, i32 2
		%tmp5 = insertelement <4 x float> %tmp4, float %tmp1, i32 3
		%tmp6 = getelementptr float* %bar, i64 1
		store float* %tmp6, float** %ptr
		ret <4 x float> %tmp5
		}

		define <4 x float> @test_v4f32_post_reg_ld1r(float* %bar, float** %ptr, i64 %inc) {
		; CHECK-LABEL: test_v4f32_post_reg_ld1r:
		; CHECK: ld1r.4s { v0 }, [x0], x{{[0-9]+}}
		%tmp1 = load float* %bar
		%tmp2 = insertelement <4 x float> <float undef, float undef, float undef, float undef>, float %tmp1, i32 0
		%tmp3 = insertelement <4 x float> %tmp2, float %tmp1, i32 1
		%tmp4 = insertelement <4 x float> %tmp3, float %tmp1, i32 2
		%tmp5 = insertelement <4 x float> %tmp4, float %tmp1, i32 3
		%tmp6 = getelementptr float* %bar, i64 %inc
		store float* %tmp6, float** %ptr
		ret <4 x float> %tmp5
		}

		define <2 x float> @test_v2f32_post_imm_ld1r(float* %bar, float** %ptr) {
		; CHECK-LABEL: test_v2f32_post_imm_ld1r:
		; CHECK: ld1r.2s { v0 }, [x0], #4
		%tmp1 = load float* %bar
		%tmp2 = insertelement <2 x float> <float undef, float undef>, float %tmp1, i32 0
		%tmp3 = insertelement <2 x float> %tmp2, float %tmp1, i32 1
		%tmp4 = getelementptr float* %bar, i64 1
		store float* %tmp4, float** %ptr
		ret <2 x float> %tmp3
		}

		define <2 x float> @test_v2f32_post_reg_ld1r(float* %bar, float** %ptr, i64 %inc) {
		; CHECK-LABEL: test_v2f32_post_reg_ld1r:
		; CHECK: ld1r.2s { v0 }, [x0], x{{[0-9]+}}
		%tmp1 = load float* %bar
		%tmp2 = insertelement <2 x float> <float undef, float undef>, float %tmp1, i32 0
		%tmp3 = insertelement <2 x float> %tmp2, float %tmp1, i32 1
		%tmp4 = getelementptr float* %bar, i64 %inc
		store float* %tmp4, float** %ptr
		ret <2 x float> %tmp3
		}

		define <2 x double> @test_v2f64_post_imm_ld1r(double* %bar, double** %ptr) {
		; CHECK-LABEL: test_v2f64_post_imm_ld1r:
		; CHECK: ld1r.2d { v0 }, [x0], #8
		%tmp1 = load double* %bar
		%tmp2 = insertelement <2 x double> <double undef, double undef>, double %tmp1, i32 0
		%tmp3 = insertelement <2 x double> %tmp2, double %tmp1, i32 1
		%tmp4 = getelementptr double* %bar, i64 1
		store double* %tmp4, double** %ptr
		ret <2 x double> %tmp3
		}

		define <2 x double> @test_v2f64_post_reg_ld1r(double* %bar, double** %ptr, i64 %inc) {
		; CHECK-LABEL: test_v2f64_post_reg_ld1r:
		; CHECK: ld1r.2d { v0 }, [x0], x{{[0-9]+}}
		%tmp1 = load double* %bar
		%tmp2 = insertelement <2 x double> <double undef, double undef>, double %tmp1, i32 0
		%tmp3 = insertelement <2 x double> %tmp2, double %tmp1, i32 1
		%tmp4 = getelementptr double* %bar, i64 %inc
		store double* %tmp4, double** %ptr
		ret <2 x double> %tmp3
		}

		define <16 x i8> @test_v16i8_post_imm_ld1lane(i8* %bar, i8** %ptr, <16 x i8> %A) {
		; CHECK-LABEL: test_v16i8_post_imm_ld1lane:
		; CHECK: ld1.b { v0 }[1], [x0], #1
		%tmp1 = load i8* %bar
		%tmp2 = insertelement <16 x i8> %A, i8 %tmp1, i32 1
		%tmp3 = getelementptr i8* %bar, i64 1
		store i8* %tmp3, i8** %ptr
		ret <16 x i8> %tmp2
		}

		define <16 x i8> @test_v16i8_post_reg_ld1lane(i8* %bar, i8** %ptr, i64 %inc, <16 x i8> %A) {
		; CHECK-LABEL: test_v16i8_post_reg_ld1lane:
		; CHECK: ld1.b { v0 }[1], [x0], x{{[0-9]+}}
		%tmp1 = load i8* %bar
		%tmp2 = insertelement <16 x i8> %A, i8 %tmp1, i32 1
		%tmp3 = getelementptr i8* %bar, i64 %inc
		store i8* %tmp3, i8** %ptr
		ret <16 x i8> %tmp2
		}

		define <8 x i8> @test_v8i8_post_imm_ld1lane(i8* %bar, i8** %ptr, <8 x i8> %A) {
		; CHECK-LABEL: test_v8i8_post_imm_ld1lane:
		; CHECK: ld1.b { v0 }[1], [x0], #1
		%tmp1 = load i8* %bar
		%tmp2 = insertelement <8 x i8> %A, i8 %tmp1, i32 1
		%tmp3 = getelementptr i8* %bar, i64 1
		store i8* %tmp3, i8** %ptr
		ret <8 x i8> %tmp2
		}

		define <8 x i8> @test_v8i8_post_reg_ld1lane(i8* %bar, i8** %ptr, i64 %inc, <8 x i8> %A) {
		; CHECK-LABEL: test_v8i8_post_reg_ld1lane:
		; CHECK: ld1.b { v0 }[1], [x0], x{{[0-9]+}}
		%tmp1 = load i8* %bar
		%tmp2 = insertelement <8 x i8> %A, i8 %tmp1, i32 1
		%tmp3 = getelementptr i8* %bar, i64 %inc
		store i8* %tmp3, i8** %ptr
		ret <8 x i8> %tmp2
		}

		define <8 x i16> @test_v8i16_post_imm_ld1lane(i16* %bar, i16** %ptr, <8 x i16> %A) {
		; CHECK-LABEL: test_v8i16_post_imm_ld1lane:
		; CHECK: ld1.h { v0 }[1], [x0], #2
		%tmp1 = load i16* %bar
		%tmp2 = insertelement <8 x i16> %A, i16 %tmp1, i32 1
		%tmp3 = getelementptr i16* %bar, i64 1
		store i16* %tmp3, i16** %ptr
		ret <8 x i16> %tmp2
		}

		define <8 x i16> @test_v8i16_post_reg_ld1lane(i16* %bar, i16** %ptr, i64 %inc, <8 x i16> %A) {
		; CHECK-LABEL: test_v8i16_post_reg_ld1lane:
		; CHECK: ld1.h { v0 }[1], [x0], x{{[0-9]+}}
		%tmp1 = load i16* %bar
		%tmp2 = insertelement <8 x i16> %A, i16 %tmp1, i32 1
		%tmp3 = getelementptr i16* %bar, i64 %inc
		store i16* %tmp3, i16** %ptr
		ret <8 x i16> %tmp2
		}

		define <4 x i16> @test_v4i16_post_imm_ld1lane(i16* %bar, i16** %ptr, <4 x i16> %A) {
		; CHECK-LABEL: test_v4i16_post_imm_ld1lane:
		; CHECK: ld1.h { v0 }[1], [x0], #2
		%tmp1 = load i16* %bar
		%tmp2 = insertelement <4 x i16> %A, i16 %tmp1, i32 1
		%tmp3 = getelementptr i16* %bar, i64 1
		store i16* %tmp3, i16** %ptr
		ret <4 x i16> %tmp2
		}

		define <4 x i16> @test_v4i16_post_reg_ld1lane(i16* %bar, i16** %ptr, i64 %inc, <4 x i16> %A) {
		; CHECK-LABEL: test_v4i16_post_reg_ld1lane:
		; CHECK: ld1.h { v0 }[1], [x0], x{{[0-9]+}}
		%tmp1 = load i16* %bar
		%tmp2 = insertelement <4 x i16> %A, i16 %tmp1, i32 1
		%tmp3 = getelementptr i16* %bar, i64 %inc
		store i16* %tmp3, i16** %ptr
		ret <4 x i16> %tmp2
		}

		define <4 x i32> @test_v4i32_post_imm_ld1lane(i32* %bar, i32** %ptr, <4 x i32> %A) {
		; CHECK-LABEL: test_v4i32_post_imm_ld1lane:
		; CHECK: ld1.s { v0 }[1], [x0], #4
		%tmp1 = load i32* %bar
		%tmp2 = insertelement <4 x i32> %A, i32 %tmp1, i32 1
		%tmp3 = getelementptr i32* %bar, i64 1
		store i32* %tmp3, i32** %ptr
		ret <4 x i32> %tmp2
		}

		define <4 x i32> @test_v4i32_post_reg_ld1lane(i32* %bar, i32** %ptr, i64 %inc, <4 x i32> %A) {
		; CHECK-LABEL: test_v4i32_post_reg_ld1lane:
		; CHECK: ld1.s { v0 }[1], [x0], x{{[0-9]+}}
		%tmp1 = load i32* %bar
		%tmp2 = insertelement <4 x i32> %A, i32 %tmp1, i32 1
		%tmp3 = getelementptr i32* %bar, i64 %inc
		store i32* %tmp3, i32** %ptr
		ret <4 x i32> %tmp2
		}

		define <2 x i32> @test_v2i32_post_imm_ld1lane(i32* %bar, i32** %ptr, <2 x i32> %A) {
		; CHECK-LABEL: test_v2i32_post_imm_ld1lane:
		; CHECK: ld1.s { v0 }[1], [x0], #4
		%tmp1 = load i32* %bar
		%tmp2 = insertelement <2 x i32> %A, i32 %tmp1, i32 1
		%tmp3 = getelementptr i32* %bar, i64 1
		store i32* %tmp3, i32** %ptr
		ret <2 x i32> %tmp2
		}

		define <2 x i32> @test_v2i32_post_reg_ld1lane(i32* %bar, i32** %ptr, i64 %inc, <2 x i32> %A) {
		; CHECK-LABEL: test_v2i32_post_reg_ld1lane:
		; CHECK: ld1.s { v0 }[1], [x0], x{{[0-9]+}}
		%tmp1 = load i32* %bar
		%tmp2 = insertelement <2 x i32> %A, i32 %tmp1, i32 1
		%tmp3 = getelementptr i32* %bar, i64 %inc
		store i32* %tmp3, i32** %ptr
		ret <2 x i32> %tmp2
		}

		define <2 x i64> @test_v2i64_post_imm_ld1lane(i64* %bar, i64** %ptr, <2 x i64> %A) {
		; CHECK-LABEL: test_v2i64_post_imm_ld1lane:
		; CHECK: ld1.d { v0 }[1], [x0], #8
		%tmp1 = load i64* %bar
		%tmp2 = insertelement <2 x i64> %A, i64 %tmp1, i32 1
		%tmp3 = getelementptr i64* %bar, i64 1
		store i64* %tmp3, i64** %ptr
		ret <2 x i64> %tmp2
		}

		define <2 x i64> @test_v2i64_post_reg_ld1lane(i64* %bar, i64** %ptr, i64 %inc, <2 x i64> %A) {
		; CHECK-LABEL: test_v2i64_post_reg_ld1lane:
		; CHECK: ld1.d { v0 }[1], [x0], x{{[0-9]+}}
		%tmp1 = load i64* %bar
		%tmp2 = insertelement <2 x i64> %A, i64 %tmp1, i32 1
		%tmp3 = getelementptr i64* %bar, i64 %inc
		store i64* %tmp3, i64** %ptr
		ret <2 x i64> %tmp2
		}

		define <4 x float> @test_v4f32_post_imm_ld1lane(float* %bar, float** %ptr, <4 x float> %A) {
		; CHECK-LABEL: test_v4f32_post_imm_ld1lane:
		; CHECK: ld1.s { v0 }[1], [x0], #4
		%tmp1 = load float* %bar
		%tmp2 = insertelement <4 x float> %A, float %tmp1, i32 1
		%tmp3 = getelementptr float* %bar, i64 1
		store float* %tmp3, float** %ptr
		ret <4 x float> %tmp2
		}

		define <4 x float> @test_v4f32_post_reg_ld1lane(float* %bar, float** %ptr, i64 %inc, <4 x float> %A) {
		; CHECK-LABEL: test_v4f32_post_reg_ld1lane:
		; CHECK: ld1.s { v0 }[1], [x0], x{{[0-9]+}}
		%tmp1 = load float* %bar
		%tmp2 = insertelement <4 x float> %A, float %tmp1, i32 1
		%tmp3 = getelementptr float* %bar, i64 %inc
		store float* %tmp3, float** %ptr
		ret <4 x float> %tmp2
		}

		define <2 x float> @test_v2f32_post_imm_ld1lane(float* %bar, float** %ptr, <2 x float> %A) {
		; CHECK-LABEL: test_v2f32_post_imm_ld1lane:
		; CHECK: ld1.s { v0 }[1], [x0], #4
		%tmp1 = load float* %bar
		%tmp2 = insertelement <2 x float> %A, float %tmp1, i32 1
		%tmp3 = getelementptr float* %bar, i64 1
		store float* %tmp3, float** %ptr
		ret <2 x float> %tmp2
		}

		define <2 x float> @test_v2f32_post_reg_ld1lane(float* %bar, float** %ptr, i64 %inc, <2 x float> %A) {
		; CHECK-LABEL: test_v2f32_post_reg_ld1lane:
		; CHECK: ld1.s { v0 }[1], [x0], x{{[0-9]+}}
		%tmp1 = load float* %bar
		%tmp2 = insertelement <2 x float> %A, float %tmp1, i32 1
		%tmp3 = getelementptr float* %bar, i64 %inc
		store float* %tmp3, float** %ptr
		ret <2 x float> %tmp2
		}

		define <2 x double> @test_v2f64_post_imm_ld1lane(double* %bar, double** %ptr, <2 x double> %A) {
		; CHECK-LABEL: test_v2f64_post_imm_ld1lane:
		; CHECK: ld1.d { v0 }[1], [x0], #8
		%tmp1 = load double* %bar
		%tmp2 = insertelement <2 x double> %A, double %tmp1, i32 1
		%tmp3 = getelementptr double* %bar, i64 1
		store double* %tmp3, double** %ptr
		ret <2 x double> %tmp2
		}

		define <2 x double> @test_v2f64_post_reg_ld1lane(double* %bar, double** %ptr, i64 %inc, <2 x double> %A) {
		; CHECK-LABEL: test_v2f64_post_reg_ld1lane:
		; CHECK: ld1.d { v0 }[1], [x0], x{{[0-9]+}}
		%tmp1 = load double* %bar
		%tmp2 = insertelement <2 x double> %A, double %tmp1, i32 1
		%tmp3 = getelementptr double* %bar, i64 %inc
		store double* %tmp3, double** %ptr
		ret <2 x double> %tmp2
		}
		No newline at end of file
Context not available.

This is an archive of the discontinued LLVM Phabricator instance.

[ARM64] Implement NEON post-increment LD1 (lane) and post-increment LD1RClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 9340

lib/Target/ARM64/ARM64ISelDAGToDAG.cpp

lib/Target/ARM64/ARM64ISelLowering.h

lib/Target/ARM64/ARM64ISelLowering.cpp

test/CodeGen/ARM64/indexed-vector-ldst.ll

[ARM64] Implement NEON post-increment LD1 (lane) and post-increment LD1R
ClosedPublic