This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
1/1
AArch64ISelLowering.h
15/23
AArch64ISelLowering.cpp
-
AArch64InstrInfo.td
-
AArch64SVEInstrInfo.td
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve2-intrinsics-combine-rshrnb.ll

Differential D155299

[AArch64][SVE2] Combine add+lsr to rshrnb for stores
ClosedPublic

Authored by MattDevereau on Jul 14 2023, 8:09 AM.

Download Raw Diff

Details

Reviewers

david-arm
sdesmalen
Rin
efriedma
kmclaughlin
dmgreen

Commits

rG175850f98726: [AArch64][SVE2] Combine trunc+add+lsr to rshrnb

Summary

[AArch64][SVE2] Combine add+lsr to rshrnb for stores

The example sequence

add z0.h, z0.h, #32
lsr z0.h, #6
st1b z0.h, x1

can be replaced with

rshrnb z0.b, #6
st1b z0.h, x1

As the top half of the destination elements are truncated.

In similar fashion,

add z0.s, z0.s, #32
lsr z1.s, z1.s, #6
add z1.s, z1.s, #32
lsr z0.s, z0.s, #6
uzp1 z0.h, z0.h, z1.h

Can be replaced with

rshrnb z1.h, z1.s, #6
rshrnb z0.h, z0.s, #6
uzp1 z0.h, z0.h, z1.h

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

MattDevereau created this revision.Jul 14 2023, 8:09 AM

Herald added a reviewer: efriedma. · View Herald TranscriptJul 14 2023, 8:09 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: ctetreau, hiraditya, kristof.beyls. · View Herald Transcript

MattDevereau requested review of this revision.Jul 14 2023, 8:09 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2023, 8:09 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B245396: Diff 540424.Jul 14 2023, 9:24 AM

This is a nice optimisation @MattDevereau, thanks! I found there is also another case we could support with loops like this where the store doesn't come straight afterwards:

void foo(unsigned short *dest, unsigned short *src, long n) {
  for (long i = 0; i < n; i++)
    dest[i] += ((src[i] + 32) >> 6);
}

In this case the IR sequence is add, lshr, trunc since the truncate doesn't get absorbed into the store. Maybe it's worth seeing if you can reuse your code in tryCombineStoredNarrowShift for this case too?

MattDevereau updated this revision to Diff 541470.Jul 18 2023, 5:27 AM

MattDevereau edited the summary of this revision. (Show Details)

kmclaughlin added a subscriber: kmclaughlin.Jul 18 2023, 7:17 AM

kmclaughlin added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20107	Hi @MattDevereau, You might need to check the value returned by `DAG.getSplatValue` here as I think if the operand is not a splat as expected, the dyn_cast will fail. Please can you add a negative test for this scenario as well?

Harbormaster completed remote builds in B246175: Diff 541470.Jul 18 2023, 10:49 AM

@kmclaughlin Thank you. I've added two tests @neg_trunc_lsr_add_op1_not_splat and @neg_trunc_lsr_op1_not_splat to bail out of emitting rshrnb when the RHS operands are not splat values.

I've also added a test @neg_add_has_two_uses and a check that the add does not have more than one use. Without this it is possible to generate this regression which costs 2 extra cycles

neg_add_two_use:
  ptrue p0.h
  ld1h { z0.h }, p0/z, [x0]
  rshrnb z1.b, z0.h, #6
  add z0.h, z0.h, #32 // =0x20
  add z0.h, z0.h, z0.h
  st1h { z0.h }, p0, [x2, x3, lsl #1]
  st1b { z1.h }, p0, [x1, x3]
  ret

Thank you for adding the new tests @MattDevereau, I just have a couple of small suggestions in the trySimplifySrlAddToRshrnb function.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20100	I think if you change this to `dyn_cast_or_null` you can remove the additional `isSplatValue` check at the beginning of the function.
20109	Similarly here, I think you can remove the `isSplatValue` above by using `dyn_cast_or_null`

Replaced explicit checks for splat values with dyn_cast_or_null

MattDevereau marked 3 inline comments as done.Jul 26 2023, 6:48 AM

Harbormaster completed remote builds in B248247: Diff 544343.Jul 26 2023, 11:15 AM

Thanks @MattDevereau, there's one nit but otherwise LGTM!

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20169–20175	nit: The braces here aren't necessary
20182–20188	As above :)

This revision is now accepted and ready to land.Jul 27 2023, 6:14 AM

Hello. When adding these for NEON we were able to the transform via tablegen patterns at selection time, which has some benefits about leaving the nodes generic for as long as possible. In this case because of the differences in the instructions I would expect that it need demanded bits, so doing it as a combine probably makes sense.

It's best not to generate a MachineNode directly though if we can.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20119	We should avoid creating machine nodes in SelectionDAG combines. It's like a layering violation. Can you change this to either generate an intrinsic, or preferably add a new AArch64ISD node for it?

hassnaa-arm added a subscriber: hassnaa-arm.Aug 1 2023, 5:03 AM

hassnaa-arm added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20114	Hi Mat, I think that check doesn't consider the case when the AddValue has enabled bits other than the most significant bit. ex: the ShiftValue is 6, and the AddValue is 33 (100001)b Is that correct ?
20114	Do you handle the case when the destination of the AddOperation has an enabled bit in the same index of the enabled bit in the add Value ? In that case the Add operation will has side affect, and by combining that case into rshrnb, you cancel the side effect of the AddOperation. Do you agree on that ? or I'm missing something ??

MattDevereau added inline comments.Aug 1 2023, 6:10 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20114	I think that check doesn't consider the case when the AddValue has enabled bits other than the most significant bit. ex: the ShiftValue is 6, and the AddValue is 33 (100001)b This check would correctly reject that example, which is not a case for this combine. As per the documentation for rshrnb the operation is integer res = (UInt(element) + (1 << (shift-1))) >> shift; As 33 is not possible to get from left shifting 1, the combine will correctly bail. The lower bits in this case would be kept as the combine wouldn't fire.
20114	What side effect do you mean exactly? Do you handle the case when the destination of the AddOperation has an enabled bit in the same index of the enabled bit in the add Value ? I don't think the bits in the destination register of the add matter, register allocation should handle that fine and give us a non conflicting destination register.
20119	Thanks for pointing that out. I'm working on emitting a new AArch64ISD node now, however it adds a bit more complexity to the patch.

Matt added a subscriber: Matt.Aug 1 2023, 2:28 PM

hassnaa-arm added inline comments.Aug 2 2023, 2:31 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20114	Sorry I didn't express it clearly. For example, The add operation is src + (100000)b I mean that what if the src has an enabled bit at the index corresponding to the '1' in the addedVal, index 5 so that means there will be 1+1 and then index 6 will be affected. How that case is handled ?

MattDevereau added inline comments.Aug 2 2023, 3:18 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20114	Sorry but I don't understand what you mean by case. I think what you are referring to is the rounding behaviour of rshrnb which is the intended calculation and not a side effect.

Emit an AArch64ISD node instead of the machine node directly.
Include D->S addressing mode of RSHRNB

Harbormaster completed remote builds in B249730: Diff 546405.Aug 2 2023, 6:52 AM

Thanks. This looks good, but it might need to be using 1ULL for the shift value and there are few extra suggestions below. Otherwise LGTM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20093	Can you either add a check that the VT is one that is valid for RSHRNB, or an assert that we only handle those types?
20113	This may need to be `1ULL << (ShiftValue - 1)`, as the number could be a 64bit value.
20117–20119	SDValue Rshrnb = DAG.getNode( AArch64ISD::RSHRNB_I, DL, VT, Add->getOperand(0), DAG.getTargetConstant(ShiftValue, DL, MVT::i32)); (Or just return it directly)
20168	trySimplifySrlAddToRshrnb could take a SDValue, instead of needing the cast.
20792	I think RshrnbVT is just ValueVT?
llvm/lib/Target/AArch64/AArch64ISelLowering.h
217	This might deserve to be in it's own little section. There may be a number of T/B nodes we end up adding, if we treat them in the same way.
llvm/lib/Target/AArch64/SVEInstrFormats.td
4309 ↗	(On Diff #546405)	It might be better to generate a nxv16i8->nxv8i16 RSHRN with a bitcast/nvcast back to nxv8i16.

Allen added a subscriber: Allen.Aug 4 2023, 7:05 PM

MattDevereau updated this revision to Diff 547674.Aug 7 2023, 1:46 AM

MattDevereau marked 6 inline comments as done.

MattDevereau added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20093	Since were matching the Rshrnb ISD node emitted with the patterns in tablegen that match `nvx8i16 -> nxv16i8`, `nxv4i32 -> nxv8i16`, `nxv2i64 -> nxv4i32` I've added an explicit block for this now: EVT ResVT; if (VT == MVT::nxv8i16) ResVT = MVT::nxv16i8; else if (VT == MVT::nxv4i32) ResVT = MVT::nxv8i16; else if (VT == MVT::nxv2i64) ResVT = MVT::nxv4i32; else return SDValue(); We can use ResVT for the SDNode creation of Rshrnb and use VT for the bitcast back to the type that fits the DAG correctly.
20113	Done, and changed `int64_t AddValue` to `uint64_t AddValue` to line up with the change.
20117–20119	This is directly returning a bitcast SDNode back to the original VT now.
20168	I get the feeling this is a side-grade since I need to pass in SDLoc as an extra parameter this way, but I can see the cast looks a bit ugly. I'm not fussy on this so I'll go with your suggestion.

Harbormaster completed remote builds in B250695: Diff 547674.Aug 7 2023, 3:10 AM

Thanks. LGTM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
20091	SDValue is usually passed by value, not as a pointer. It might also be able to generate the DL from `SDLoc DL(Srl);`, depending on where it is best for the debug loc to come from.

Closed by commit rG175850f98726: [AArch64][SVE2] Combine trunc+add+lsr to rshrnb (authored by MattDevereau). · Explain WhyAug 9 2023, 5:49 AM

This revision was automatically updated to reflect the committed changes.

MattDevereau added a commit: rG175850f98726: [AArch64][SVE2] Combine trunc+add+lsr to rshrnb.

GitHub <noreply@github.com> mentioned this in rG18775a49416c: [AArch64][SVE2] Use rshrnb for masked stores (#70026).Oct 26 2023, 12:42 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.h

3 lines

AArch64ISelLowering.cpp

74 lines

AArch64InstrInfo.td

6 lines

AArch64SVEInstrInfo.td

2 lines

test/

CodeGen/

AArch64/

sve2-intrinsics-combine-rshrnb.ll

237 lines

Diff 548582

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 208 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
VASHR,		VASHR,

// Vector shift by scalar (again)		// Vector shift by scalar (again)
SQSHL_I,		SQSHL_I,
UQSHL_I,		UQSHL_I,
SQSHLU_I,		SQSHLU_I,
SRSHR_I,		SRSHR_I,
URSHR_I,		URSHR_I,

		dmgreenUnsubmitted Done Reply Inline Actions This might deserve to be in it's own little section. There may be a number of T/B nodes we end up adding, if we treat them in the same way. dmgreen: This might deserve to be in it's own little section. There may be a number of T/B nodes we end…
		// Vector narrowing shift by immediate (bottom)
		RSHRNB_I,

// Vector shift by constant and insert		// Vector shift by constant and insert
VSLI,		VSLI,
VSRI,		VSRI,

// Vector comparisons		// Vector comparisons
CMEQ,		CMEQ,
CMGE,		CMGE,
CMGT,		CMGT,
▲ Show 20 Lines • Show All 1,031 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,574 Lines • ▼ Show 20 Lines	case AArch64ISD::FIRST_NUMBER:
MAKE_CASE(AArch64ISD::ASSERT_ZEXT_BOOL)		MAKE_CASE(AArch64ISD::ASSERT_ZEXT_BOOL)
MAKE_CASE(AArch64ISD::MOPS_MEMSET)		MAKE_CASE(AArch64ISD::MOPS_MEMSET)
MAKE_CASE(AArch64ISD::MOPS_MEMSET_TAGGING)		MAKE_CASE(AArch64ISD::MOPS_MEMSET_TAGGING)
MAKE_CASE(AArch64ISD::MOPS_MEMCOPY)		MAKE_CASE(AArch64ISD::MOPS_MEMCOPY)
MAKE_CASE(AArch64ISD::MOPS_MEMMOVE)		MAKE_CASE(AArch64ISD::MOPS_MEMMOVE)
MAKE_CASE(AArch64ISD::CALL_BTI)		MAKE_CASE(AArch64ISD::CALL_BTI)
MAKE_CASE(AArch64ISD::MRRS)		MAKE_CASE(AArch64ISD::MRRS)
MAKE_CASE(AArch64ISD::MSRR)		MAKE_CASE(AArch64ISD::MSRR)
		MAKE_CASE(AArch64ISD::RSHRNB_I)
}		}
#undef MAKE_CASE		#undef MAKE_CASE
return nullptr;		return nullptr;
}		}

MachineBasicBlock *		MachineBasicBlock *
AArch64TargetLowering::EmitF128CSEL(MachineInstr &MI,		AArch64TargetLowering::EmitF128CSEL(MachineInstr &MI,
MachineBasicBlock *MBB) const {		MachineBasicBlock *MBB) const {
▲ Show 20 Lines • Show All 17,482 Lines • ▼ Show 20 Lines	if (MLD->isUnindexed() && MLD->getExtensionType() != ISD::SEXTLOAD &&
return NewLoad;		return NewLoad;
}		}
}		}
}		}

return SDValue();		return SDValue();
}		}

static SDValue performUzpCombine(SDNode *N, SelectionDAG &DAG) {		// Try to simplify:
		// t1 = nxv8i16 add(X, 1 << (ShiftValue - 1))
		// t2 = nxv8i16 srl(t1, ShiftValue)
		// to
		// t1 = nxv8i16 rshrnb(X, shiftvalue).
		// rshrnb will zero the top half bits of each element. Therefore, this combine
		// should only be performed when a following instruction with the rshrnb
		// as an operand does not care about the top half of each element. For example,
		// a uzp1 or a truncating store.
		static SDValue trySimplifySrlAddToRshrnb(SDValue Srl, SelectionDAG &DAG,
		dmgreenUnsubmitted Not Done Reply Inline Actions SDValue is usually passed by value, not as a pointer. It might also be able to generate the DL from `SDLoc DL(Srl);`, depending on where it is best for the debug loc to come from. dmgreen: SDValue is usually passed by value, not as a pointer. It might also be able to generate the DL…
		const AArch64Subtarget *Subtarget) {
		EVT VT = Srl->getValueType(0);
		dmgreenUnsubmitted Not Done Reply Inline Actions Can you either add a check that the VT is one that is valid for RSHRNB, or an assert that we only handle those types? dmgreen: Can you either add a check that the VT is one that is valid for RSHRNB, or an assert that we…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions Since were matching the Rshrnb ISD node emitted with the patterns in tablegen that match `nvx8i16 -> nxv16i8`, `nxv4i32 -> nxv8i16`, `nxv2i64 -> nxv4i32` I've added an explicit block for this now: EVT ResVT; if (VT == MVT::nxv8i16) ResVT = MVT::nxv16i8; else if (VT == MVT::nxv4i32) ResVT = MVT::nxv8i16; else if (VT == MVT::nxv2i64) ResVT = MVT::nxv4i32; else return SDValue(); We can use ResVT for the SDNode creation of Rshrnb and use VT for the bitcast back to the type that fits the DAG correctly. MattDevereau: Since were matching the Rshrnb ISD node emitted with the patterns in tablegen that match…

		if (!VT.isScalableVector() \|\| !Subtarget->hasSVE2() \|\|
		Srl->getOpcode() != ISD::SRL)
		return SDValue();

		EVT ResVT;
		if (VT == MVT::nxv8i16)
		kmclaughlinUnsubmitted Done Reply Inline Actions I think if you change this to `dyn_cast_or_null` you can remove the additional `isSplatValue` check at the beginning of the function. kmclaughlin: I think if you change this to `dyn_cast_or_null` you can remove the additional `isSplatValue`…
		ResVT = MVT::nxv16i8;
		else if (VT == MVT::nxv4i32)
		ResVT = MVT::nxv8i16;
		else if (VT == MVT::nxv2i64)
		ResVT = MVT::nxv4i32;
		else
		return SDValue();
		kmclaughlinUnsubmitted Done Reply Inline Actions Hi @MattDevereau, You might need to check the value returned by `DAG.getSplatValue` here as I think if the operand is not a splat as expected, the dyn_cast will fail. Please can you add a negative test for this scenario as well? kmclaughlin: Hi @MattDevereau, You might need to check the value returned by `DAG.getSplatValue` here as I…

		auto SrlOp1 =
		kmclaughlinUnsubmitted Done Reply Inline Actions Similarly here, I think you can remove the `isSplatValue` above by using `dyn_cast_or_null` kmclaughlin: Similarly here, I think you can remove the `isSplatValue` above by using `dyn_cast_or_null`
		dyn_cast_or_null<ConstantSDNode>(DAG.getSplatValue(Srl->getOperand(1)));
		if (!SrlOp1)
		return SDValue();
		unsigned ShiftValue = SrlOp1->getZExtValue();
		dmgreenUnsubmitted Done Reply Inline Actions This may need to be `1ULL << (ShiftValue - 1)`, as the number could be a 64bit value. dmgreen: This may need to be `1ULL << (ShiftValue - 1)`, as the number could be a 64bit value.
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions Done, and changed `int64_t AddValue` to `uint64_t AddValue` to line up with the change. MattDevereau: Done, and changed `int64_t AddValue` to `uint64_t AddValue` to line up with the change.

		hassnaa-armUnsubmitted Not Done Reply Inline Actions Hi Mat, I think that check doesn't consider the case when the AddValue has enabled bits other than the most significant bit. ex: the ShiftValue is 6, and the AddValue is 33 (100001)b Is that correct ? hassnaa-arm: Hi Mat, I think that check doesn't consider the case when the AddValue has enabled bits other…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions I think that check doesn't consider the case when the AddValue has enabled bits other than the most significant bit. ex: the ShiftValue is 6, and the AddValue is 33 (100001)b This check would correctly reject that example, which is not a case for this combine. As per the documentation for rshrnb the operation is integer res = (UInt(element) + (1 << (shift-1))) >> shift; As 33 is not possible to get from left shifting 1, the combine will correctly bail. The lower bits in this case would be kept as the combine wouldn't fire. MattDevereau: >I think that check doesn't consider the case when the AddValue has enabled bits other than the…
		hassnaa-armUnsubmitted Not Done Reply Inline Actions Do you handle the case when the destination of the AddOperation has an enabled bit in the same index of the enabled bit in the add Value ? In that case the Add operation will has side affect, and by combining that case into rshrnb, you cancel the side effect of the AddOperation. Do you agree on that ? or I'm missing something ?? hassnaa-arm: Do you handle the case when the destination of the AddOperation has an enabled bit in the same…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions What side effect do you mean exactly? Do you handle the case when the destination of the AddOperation has an enabled bit in the same index of the enabled bit in the add Value ? I don't think the bits in the destination register of the add matter, register allocation should handle that fine and give us a non conflicting destination register. MattDevereau: What side effect do you mean exactly? > Do you handle the case when the destination of the…
		hassnaa-armUnsubmitted Not Done Reply Inline Actions Sorry I didn't express it clearly. For example, The add operation is src + (100000)b I mean that what if the src has an enabled bit at the index corresponding to the '1' in the addedVal, index 5 so that means there will be 1+1 and then index 6 will be affected. How that case is handled ? hassnaa-arm: Sorry I didn't express it clearly. For example, The add operation is src + (100000)b I mean…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions Sorry but I don't understand what you mean by case. I think what you are referring to is the rounding behaviour of rshrnb which is the intended calculation and not a side effect. MattDevereau: Sorry but I don't understand what you mean by case. I think what you are referring to is the…
		SDValue Add = Srl->getOperand(0);
		if (Add->getOpcode() != ISD::ADD \|\| !Add->hasOneUse())
		return SDValue();
		auto AddOp1 =
		dyn_cast_or_null<ConstantSDNode>(DAG.getSplatValue(Add->getOperand(1)));
		dmgreenUnsubmitted Done Reply Inline Actions We should avoid creating machine nodes in SelectionDAG combines. It's like a layering violation. Can you change this to either generate an intrinsic, or preferably add a new AArch64ISD node for it? dmgreen: We should avoid creating machine nodes in SelectionDAG combines. It's like a layering violation.
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions Thanks for pointing that out. I'm working on emitting a new AArch64ISD node now, however it adds a bit more complexity to the patch. MattDevereau: Thanks for pointing that out. I'm working on emitting a new AArch64ISD node now, however it…
		dmgreenUnsubmitted Done Reply Inline Actions SDValue Rshrnb = DAG.getNode( AArch64ISD::RSHRNB_I, DL, VT, Add->getOperand(0), DAG.getTargetConstant(ShiftValue, DL, MVT::i32)); (Or just return it directly) dmgreen: ``` SDValue Rshrnb = DAG.getNode( AArch64ISD::RSHRNB_I, DL, VT, Add->getOperand(0)…
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions This is directly returning a bitcast SDNode back to the original VT now. MattDevereau: This is directly returning a bitcast SDNode back to the original VT now.
		if (!AddOp1)
		return SDValue();
		uint64_t AddValue = AddOp1->getZExtValue();
		if (AddValue != 1ULL << (ShiftValue - 1))
		return SDValue();

		SDLoc DL(Srl);
		SDValue Rshrnb = DAG.getNode(
		AArch64ISD::RSHRNB_I, DL, ResVT,
		{Add->getOperand(0), DAG.getTargetConstant(ShiftValue, DL, MVT::i32)});
		return DAG.getNode(ISD::BITCAST, DL, VT, Rshrnb);
		}

		static SDValue performUzpCombine(SDNode *N, SelectionDAG &DAG,
		const AArch64Subtarget *Subtarget) {
SDLoc DL(N);		SDLoc DL(N);
SDValue Op0 = N->getOperand(0);		SDValue Op0 = N->getOperand(0);
SDValue Op1 = N->getOperand(1);		SDValue Op1 = N->getOperand(1);
EVT ResVT = N->getValueType(0);		EVT ResVT = N->getValueType(0);

// uzp1(x, undef) -> concat(truncate(x), undef)		// uzp1(x, undef) -> concat(truncate(x), undef)
if (Op1.getOpcode() == ISD::UNDEF) {		if (Op1.getOpcode() == ISD::UNDEF) {
EVT BCVT = MVT::Other, HalfVT = MVT::Other;		EVT BCVT = MVT::Other, HalfVT = MVT::Other;
Show All 16 Lines	if (Op1.getOpcode() == ISD::UNDEF) {
if (BCVT != MVT::Other) {		if (BCVT != MVT::Other) {
SDValue BC = DAG.getBitcast(BCVT, Op0);		SDValue BC = DAG.getBitcast(BCVT, Op0);
SDValue Trunc = DAG.getNode(ISD::TRUNCATE, DL, HalfVT, BC);		SDValue Trunc = DAG.getNode(ISD::TRUNCATE, DL, HalfVT, BC);
return DAG.getNode(ISD::CONCAT_VECTORS, DL, ResVT, Trunc,		return DAG.getNode(ISD::CONCAT_VECTORS, DL, ResVT, Trunc,
DAG.getUNDEF(HalfVT));		DAG.getUNDEF(HalfVT));
}		}
}		}

		if (SDValue Rshrnb = trySimplifySrlAddToRshrnb(Op0, DAG, Subtarget))
		return DAG.getNode(AArch64ISD::UZP1, DL, ResVT, Rshrnb, Op1);
		dmgreenUnsubmitted Done Reply Inline Actions trySimplifySrlAddToRshrnb could take a SDValue, instead of needing the cast. dmgreen: trySimplifySrlAddToRshrnb could take a SDValue, instead of needing the cast.
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions I get the feeling this is a side-grade since I need to pass in SDLoc as an extra parameter this way, but I can see the cast looks a bit ugly. I'm not fussy on this so I'll go with your suggestion. MattDevereau: I get the feeling this is a side-grade since I need to pass in SDLoc as an extra parameter this…

		if (SDValue Rshrnb = trySimplifySrlAddToRshrnb(Op1, DAG, Subtarget))
		return DAG.getNode(AArch64ISD::UZP1, DL, ResVT, Op0, Rshrnb);

// uzp1(unpklo(uzp1(x, y)), z) => uzp1(x, z)		// uzp1(unpklo(uzp1(x, y)), z) => uzp1(x, z)
if (Op0.getOpcode() == AArch64ISD::UUNPKLO) {		if (Op0.getOpcode() == AArch64ISD::UUNPKLO) {
if (Op0.getOperand(0).getOpcode() == AArch64ISD::UZP1) {		if (Op0.getOperand(0).getOpcode() == AArch64ISD::UZP1) {
		kmclaughlinUnsubmitted Not Done Reply Inline Actions nit: The braces here aren't necessary kmclaughlin: nit: The braces here aren't necessary
SDValue X = Op0.getOperand(0).getOperand(0);		SDValue X = Op0.getOperand(0).getOperand(0);
return DAG.getNode(AArch64ISD::UZP1, DL, ResVT, X, Op1);		return DAG.getNode(AArch64ISD::UZP1, DL, ResVT, X, Op1);
}		}
}		}

// uzp1(x, unpkhi(uzp1(y, z))) => uzp1(x, z)		// uzp1(x, unpkhi(uzp1(y, z))) => uzp1(x, z)
if (Op1.getOpcode() == AArch64ISD::UUNPKHI) {		if (Op1.getOpcode() == AArch64ISD::UUNPKHI) {
if (Op1.getOperand(0).getOpcode() == AArch64ISD::UZP1) {		if (Op1.getOperand(0).getOpcode() == AArch64ISD::UZP1) {
SDValue Z = Op1.getOperand(0).getOperand(1);		SDValue Z = Op1.getOperand(0).getOperand(1);
return DAG.getNode(AArch64ISD::UZP1, DL, ResVT, Op0, Z);		return DAG.getNode(AArch64ISD::UZP1, DL, ResVT, Op0, Z);
}		}
}		}

		kmclaughlinUnsubmitted Not Done Reply Inline Actions As above :) kmclaughlin: As above :)
// uzp1(xtn x, xtn y) -> xtn(uzp1 (x, y))		// uzp1(xtn x, xtn y) -> xtn(uzp1 (x, y))
// Only implemented on little-endian subtargets.		// Only implemented on little-endian subtargets.
bool IsLittleEndian = DAG.getDataLayout().isLittleEndian();		bool IsLittleEndian = DAG.getDataLayout().isLittleEndian();

// This optimization only works on little endian.		// This optimization only works on little endian.
if (!IsLittleEndian)		if (!IsLittleEndian)
return SDValue();		return SDValue();

▲ Show 20 Lines • Show All 584 Lines • ▼ Show 20 Lines	if (Subtarget->supportsAddressTopByteIgnored() &&
return SDValue(N, 0);		return SDValue(N, 0);

if (SDValue Store = foldTruncStoreOfExt(DAG, N))		if (SDValue Store = foldTruncStoreOfExt(DAG, N))
return Store;		return Store;

if (SDValue Store = combineBoolVectorAndTruncateStore(DAG, ST))		if (SDValue Store = combineBoolVectorAndTruncateStore(DAG, ST))
return Store;		return Store;

		if (ST->isTruncatingStore())
		if (SDValue Rshrnb =
		trySimplifySrlAddToRshrnb(ST->getOperand(1), DAG, Subtarget)) {
		EVT StoreVT = ST->getMemoryVT();
		dmgreenUnsubmitted Not Done Reply Inline Actions I think RshrnbVT is just ValueVT? dmgreen: I think RshrnbVT is just ValueVT?
		if ((ValueVT == MVT::nxv8i16 && StoreVT == MVT::nxv8i8) \|\|
		(ValueVT == MVT::nxv4i32 && StoreVT == MVT::nxv4i16) \|\|
		(ValueVT == MVT::nxv2i64 && StoreVT == MVT::nxv2i32))
		return DAG.getTruncStore(ST->getChain(), ST, Rshrnb, ST->getBasePtr(),
		StoreVT, ST->getMemOperand());
		}

return SDValue();		return SDValue();
}		}

static SDValue performMSTORECombine(SDNode *N,		static SDValue performMSTORECombine(SDNode *N,
TargetLowering::DAGCombinerInfo &DCI,		TargetLowering::DAGCombinerInfo &DCI,
SelectionDAG &DAG,		SelectionDAG &DAG,
const AArch64Subtarget *Subtarget) {		const AArch64Subtarget *Subtarget) {
MaskedStoreSDNode *MST = cast<MaskedStoreSDNode>(N);		MaskedStoreSDNode *MST = cast<MaskedStoreSDNode>(N);
▲ Show 20 Lines • Show All 2,301 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
case AArch64ISD::NVCAST:		case AArch64ISD::NVCAST:
return performNVCASTCombine(N);		return performNVCASTCombine(N);
case AArch64ISD::SPLICE:		case AArch64ISD::SPLICE:
return performSpliceCombine(N, DAG);		return performSpliceCombine(N, DAG);
case AArch64ISD::UUNPKLO:		case AArch64ISD::UUNPKLO:
case AArch64ISD::UUNPKHI:		case AArch64ISD::UUNPKHI:
return performUnpackCombine(N, DAG, Subtarget);		return performUnpackCombine(N, DAG, Subtarget);
case AArch64ISD::UZP1:		case AArch64ISD::UZP1:
return performUzpCombine(N, DAG);		return performUzpCombine(N, DAG, Subtarget);
case AArch64ISD::SETCC_MERGE_ZERO:		case AArch64ISD::SETCC_MERGE_ZERO:
return performSetccMergeZeroCombine(N, DCI);		return performSetccMergeZeroCombine(N, DCI);
case AArch64ISD::REINTERPRET_CAST:		case AArch64ISD::REINTERPRET_CAST:
return performReinterpretCastCombine(N);		return performReinterpretCastCombine(N);
case AArch64ISD::GLD1_MERGE_ZERO:		case AArch64ISD::GLD1_MERGE_ZERO:
case AArch64ISD::GLD1_SCALED_MERGE_ZERO:		case AArch64ISD::GLD1_SCALED_MERGE_ZERO:
case AArch64ISD::GLD1_UXTW_MERGE_ZERO:		case AArch64ISD::GLD1_UXTW_MERGE_ZERO:
case AArch64ISD::GLD1_SXTW_MERGE_ZERO:		case AArch64ISD::GLD1_SXTW_MERGE_ZERO:
▲ Show 20 Lines • Show All 2,938 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 816 Lines • ▼ Show 20 Lines
	def AArch64stilp : SDNode<"AArch64ISD::STILP", SDT_AArch64stilp, [SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;			def AArch64stilp : SDNode<"AArch64ISD::STILP", SDT_AArch64stilp, [SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;
	def AArch64stnp : SDNode<"AArch64ISD::STNP", SDT_AArch64stnp, [SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;			def AArch64stnp : SDNode<"AArch64ISD::STNP", SDT_AArch64stnp, [SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;

	def AArch64tbl : SDNode<"AArch64ISD::TBL", SDT_AArch64TBL>;			def AArch64tbl : SDNode<"AArch64ISD::TBL", SDT_AArch64TBL>;
	def AArch64mrs : SDNode<"AArch64ISD::MRS",			def AArch64mrs : SDNode<"AArch64ISD::MRS",
	SDTypeProfile<1, 1, [SDTCisVT<0, i64>, SDTCisVT<1, i32>]>,			SDTypeProfile<1, 1, [SDTCisVT<0, i64>, SDTCisVT<1, i32>]>,
	[SDNPHasChain, SDNPOutGlue]>;			[SDNPHasChain, SDNPOutGlue]>;

				def SD_AArch64rshrnb : SDTypeProfile<1, 2, [SDTCisVec<0>, SDTCisVec<1>, SDTCisInt<2>]>;
				def AArch64rshrnb : SDNode<"AArch64ISD::RSHRNB_I", SD_AArch64rshrnb>;
				def AArch64rshrnb_pf : PatFrags<(ops node:$rs, node:$i),
				[(AArch64rshrnb node:$rs, node:$i),
				(int_aarch64_sve_rshrnb node:$rs, node:$i)]>;

	// Match add node and also treat an 'or' node is as an 'add' if the or'ed operands			// Match add node and also treat an 'or' node is as an 'add' if the or'ed operands
	// have no common bits.			// have no common bits.
	def add_and_or_is_add : PatFrags<(ops node:$lhs, node:$rhs),			def add_and_or_is_add : PatFrags<(ops node:$lhs, node:$rhs),
	[(add node:$lhs, node:$rhs), (or node:$lhs, node:$rhs)],[{			[(add node:$lhs, node:$rhs), (or node:$lhs, node:$rhs)],[{
	if (N->getOpcode() == ISD::ADD)			if (N->getOpcode() == ISD::ADD)
	return true;			return true;
	return CurDAG->haveNoCommonBitsSet(N->getOperand(0), N->getOperand(1));			return CurDAG->haveNoCommonBitsSet(N->getOperand(0), N->getOperand(1));
	}]> {			}]> {
	▲ Show 20 Lines • Show All 8,319 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,518 Lines • ▼ Show 20 Lines	let Predicates = [HasSVE2orSME] in {
defm ADCLT_ZZZ : sve2_int_addsub_long_carry<0b01, "adclt", int_aarch64_sve_adclt>;		defm ADCLT_ZZZ : sve2_int_addsub_long_carry<0b01, "adclt", int_aarch64_sve_adclt>;
defm SBCLB_ZZZ : sve2_int_addsub_long_carry<0b10, "sbclb", int_aarch64_sve_sbclb>;		defm SBCLB_ZZZ : sve2_int_addsub_long_carry<0b10, "sbclb", int_aarch64_sve_sbclb>;
defm SBCLT_ZZZ : sve2_int_addsub_long_carry<0b11, "sbclt", int_aarch64_sve_sbclt>;		defm SBCLT_ZZZ : sve2_int_addsub_long_carry<0b11, "sbclt", int_aarch64_sve_sbclt>;

// SVE2 bitwise shift right narrow (bottom)		// SVE2 bitwise shift right narrow (bottom)
defm SQSHRUNB_ZZI : sve2_int_bin_shift_imm_right_narrow_bottom<0b000, "sqshrunb", int_aarch64_sve_sqshrunb>;		defm SQSHRUNB_ZZI : sve2_int_bin_shift_imm_right_narrow_bottom<0b000, "sqshrunb", int_aarch64_sve_sqshrunb>;
defm SQRSHRUNB_ZZI : sve2_int_bin_shift_imm_right_narrow_bottom<0b001, "sqrshrunb", int_aarch64_sve_sqrshrunb>;		defm SQRSHRUNB_ZZI : sve2_int_bin_shift_imm_right_narrow_bottom<0b001, "sqrshrunb", int_aarch64_sve_sqrshrunb>;
defm SHRNB_ZZI : sve2_int_bin_shift_imm_right_narrow_bottom<0b010, "shrnb", int_aarch64_sve_shrnb>;		defm SHRNB_ZZI : sve2_int_bin_shift_imm_right_narrow_bottom<0b010, "shrnb", int_aarch64_sve_shrnb>;
defm RSHRNB_ZZI : sve2_int_bin_shift_imm_right_narrow_bottom<0b011, "rshrnb", int_aarch64_sve_rshrnb>;		defm RSHRNB_ZZI : sve2_int_bin_shift_imm_right_narrow_bottom<0b011, "rshrnb", AArch64rshrnb_pf>;
defm SQSHRNB_ZZI : sve2_int_bin_shift_imm_right_narrow_bottom<0b100, "sqshrnb", int_aarch64_sve_sqshrnb>;		defm SQSHRNB_ZZI : sve2_int_bin_shift_imm_right_narrow_bottom<0b100, "sqshrnb", int_aarch64_sve_sqshrnb>;
defm SQRSHRNB_ZZI : sve2_int_bin_shift_imm_right_narrow_bottom<0b101, "sqrshrnb", int_aarch64_sve_sqrshrnb>;		defm SQRSHRNB_ZZI : sve2_int_bin_shift_imm_right_narrow_bottom<0b101, "sqrshrnb", int_aarch64_sve_sqrshrnb>;
defm UQSHRNB_ZZI : sve2_int_bin_shift_imm_right_narrow_bottom<0b110, "uqshrnb", int_aarch64_sve_uqshrnb>;		defm UQSHRNB_ZZI : sve2_int_bin_shift_imm_right_narrow_bottom<0b110, "uqshrnb", int_aarch64_sve_uqshrnb>;
defm UQRSHRNB_ZZI : sve2_int_bin_shift_imm_right_narrow_bottom<0b111, "uqrshrnb", int_aarch64_sve_uqrshrnb>;		defm UQRSHRNB_ZZI : sve2_int_bin_shift_imm_right_narrow_bottom<0b111, "uqrshrnb", int_aarch64_sve_uqrshrnb>;

// SVE2 bitwise shift right narrow (top)		// SVE2 bitwise shift right narrow (top)
defm SQSHRUNT_ZZI : sve2_int_bin_shift_imm_right_narrow_top<0b000, "sqshrunt", int_aarch64_sve_sqshrunt>;		defm SQSHRUNT_ZZI : sve2_int_bin_shift_imm_right_narrow_top<0b000, "sqshrunt", int_aarch64_sve_sqshrunt>;
defm SQRSHRUNT_ZZI : sve2_int_bin_shift_imm_right_narrow_top<0b001, "sqrshrunt", int_aarch64_sve_sqrshrunt>;		defm SQRSHRUNT_ZZI : sve2_int_bin_shift_imm_right_narrow_top<0b001, "sqrshrunt", int_aarch64_sve_sqrshrunt>;
▲ Show 20 Lines • Show All 467 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve2-intrinsics-combine-rshrnb.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve2 < %s \| FileCheck %s

				define void @add_lshr_rshrnb_b_6(ptr %ptr, ptr %dst, i64 %index){
				; CHECK-LABEL: add_lshr_rshrnb_b_6:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: rshrnb z0.b, z0.h, #6
				; CHECK-NEXT: st1b { z0.h }, p0, [x1, x2]
				; CHECK-NEXT: ret
				%load = load <vscale x 8 x i16>, ptr %ptr, align 2
				%1 = add <vscale x 8 x i16> %load, trunc (<vscale x 8 x i32> shufflevector (<vscale x 8 x i32> insertelement (<vscale x 8 x i32> poison, i32 32, i64 0), <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer) to <vscale x 8 x i16>)
				%2 = lshr <vscale x 8 x i16> %1, trunc (<vscale x 8 x i32> shufflevector (<vscale x 8 x i32> insertelement (<vscale x 8 x i32> poison, i32 6, i64 0), <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer) to <vscale x 8 x i16>)
				%3 = trunc <vscale x 8 x i16> %2 to <vscale x 8 x i8>
				%4 = getelementptr inbounds i8, ptr %dst, i64 %index
				store <vscale x 8 x i8> %3, ptr %4, align 1
				ret void
				}

				define void @neg_add_lshr_rshrnb_b_6(ptr %ptr, ptr %dst, i64 %index){
				; CHECK-LABEL: neg_add_lshr_rshrnb_b_6:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: add z0.h, z0.h, #1 // =0x1
				; CHECK-NEXT: lsr z0.h, z0.h, #6
				; CHECK-NEXT: st1b { z0.h }, p0, [x1, x2]
				; CHECK-NEXT: ret
				%load = load <vscale x 8 x i16>, ptr %ptr, align 2
				%1 = add <vscale x 8 x i16> %load, trunc (<vscale x 8 x i32> shufflevector (<vscale x 8 x i32> insertelement (<vscale x 8 x i32> poison, i32 1, i64 0), <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer) to <vscale x 8 x i16>)
				%2 = lshr <vscale x 8 x i16> %1, trunc (<vscale x 8 x i32> shufflevector (<vscale x 8 x i32> insertelement (<vscale x 8 x i32> poison, i32 6, i64 0), <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer) to <vscale x 8 x i16>)
				%3 = trunc <vscale x 8 x i16> %2 to <vscale x 8 x i8>
				%4 = getelementptr inbounds i8, ptr %dst, i64 %index
				store <vscale x 8 x i8> %3, ptr %4, align 1
				ret void
				}

				define void @add_lshr_rshrnb_h_7(ptr %ptr, ptr %dst, i64 %index){
				; CHECK-LABEL: add_lshr_rshrnb_h_7:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: rshrnb z0.b, z0.h, #7
				; CHECK-NEXT: st1b { z0.h }, p0, [x1, x2]
				; CHECK-NEXT: ret
				%load = load <vscale x 8 x i16>, ptr %ptr, align 2
				%1 = add <vscale x 8 x i16> %load, trunc (<vscale x 8 x i32> shufflevector (<vscale x 8 x i32> insertelement (<vscale x 8 x i32> poison, i32 64, i64 0), <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer) to <vscale x 8 x i16>)
				%2 = lshr <vscale x 8 x i16> %1, trunc (<vscale x 8 x i32> shufflevector (<vscale x 8 x i32> insertelement (<vscale x 8 x i32> poison, i32 7, i64 0), <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer) to <vscale x 8 x i16>)
				%3 = trunc <vscale x 8 x i16> %2 to <vscale x 8 x i8>
				%4 = getelementptr inbounds i8, ptr %dst, i64 %index
				store <vscale x 8 x i8> %3, ptr %4, align 1
				ret void
				}

				define void @add_lshr_rshrn_h_6(ptr %ptr, ptr %dst, i64 %index){
				; CHECK-LABEL: add_lshr_rshrn_h_6:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: rshrnb z0.h, z0.s, #6
				; CHECK-NEXT: st1h { z0.s }, p0, [x1, x2, lsl #1]
				; CHECK-NEXT: ret
				%load = load <vscale x 4 x i32>, ptr %ptr, align 2
				%1 = add <vscale x 4 x i32> %load, trunc (<vscale x 4 x i64> shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 32, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer) to <vscale x 4 x i32>)
				%2 = lshr <vscale x 4 x i32> %1, trunc (<vscale x 4 x i64> shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 6, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer) to <vscale x 4 x i32>)
				%3 = trunc <vscale x 4 x i32> %2 to <vscale x 4 x i16>
				%4 = getelementptr inbounds i16, ptr %dst, i64 %index
				store <vscale x 4 x i16> %3, ptr %4, align 1
				ret void
				}

				define void @add_lshr_rshrnb_h_2(ptr %ptr, ptr %dst, i64 %index){
				; CHECK-LABEL: add_lshr_rshrnb_h_2:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: rshrnb z0.h, z0.s, #2
				; CHECK-NEXT: st1h { z0.s }, p0, [x1, x2, lsl #1]
				; CHECK-NEXT: ret
				%load = load <vscale x 4 x i32>, ptr %ptr, align 2
				%1 = add <vscale x 4 x i32> %load, trunc (<vscale x 4 x i64> shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 2, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer) to <vscale x 4 x i32>)
				%2 = lshr <vscale x 4 x i32> %1, trunc (<vscale x 4 x i64> shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 2, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer) to <vscale x 4 x i32>)
				%3 = trunc <vscale x 4 x i32> %2 to <vscale x 4 x i16>
				%4 = getelementptr inbounds i16, ptr %dst, i64 %index
				store <vscale x 4 x i16> %3, ptr %4, align 1
				ret void
				}

				define void @neg_add_lshr_rshrnb_h_0(ptr %ptr, ptr %dst, i64 %index){
				; CHECK-LABEL: neg_add_lshr_rshrnb_h_0:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ret
				%load = load <vscale x 4 x i32>, ptr %ptr, align 2
				%1 = add <vscale x 4 x i32> %load, trunc (<vscale x 4 x i64> shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 1, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer) to <vscale x 4 x i32>)
				%2 = lshr <vscale x 4 x i32> %1, trunc (<vscale x 4 x i64> shufflevector (<vscale x 4 x i64> insertelement (<vscale x 4 x i64> poison, i64 -1, i64 0), <vscale x 4 x i64> poison, <vscale x 4 x i32> zeroinitializer) to <vscale x 4 x i32>)
				%3 = trunc <vscale x 4 x i32> %2 to <vscale x 4 x i16>
				%4 = getelementptr inbounds i16, ptr %dst, i64 %index
				store <vscale x 4 x i16> %3, ptr %4, align 1
				ret void
				}

				define void @wide_add_shift_add_rshrnb_b(ptr %dest, i64 %index, <vscale x 16 x i16> %arg1){
				; CHECK-LABEL: wide_add_shift_add_rshrnb_b:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: rshrnb z1.b, z1.h, #6
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x0, x1]
				; CHECK-NEXT: rshrnb z0.b, z0.h, #6
				; CHECK-NEXT: uzp1 z0.b, z0.b, z1.b
				; CHECK-NEXT: add z0.b, z2.b, z0.b
				; CHECK-NEXT: st1b { z0.b }, p0, [x0, x1]
				; CHECK-NEXT: ret
				%1 = add <vscale x 16 x i16> %arg1, shufflevector (<vscale x 16 x i16> insertelement (<vscale x 16 x i16> poison, i16 32, i64 0), <vscale x 16 x i16> poison, <vscale x 16 x i32> zeroinitializer)
				%2 = lshr <vscale x 16 x i16> %1, shufflevector (<vscale x 16 x i16> insertelement (<vscale x 16 x i16> poison, i16 6, i64 0), <vscale x 16 x i16> poison, <vscale x 16 x i32> zeroinitializer)
				%3 = getelementptr inbounds i8, ptr %dest, i64 %index
				%load = load <vscale x 16 x i8>, ptr %3, align 2
				%4 = trunc <vscale x 16 x i16> %2 to <vscale x 16 x i8>
				%5 = add <vscale x 16 x i8> %load, %4
				store <vscale x 16 x i8> %5, ptr %3, align 2
				ret void
				}

				define void @wide_add_shift_add_rshrnb_h(ptr %dest, i64 %index, <vscale x 8 x i32> %arg1){
				; CHECK-LABEL: wide_add_shift_add_rshrnb_h:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: rshrnb z1.h, z1.s, #6
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x0, x1, lsl #1]
				; CHECK-NEXT: rshrnb z0.h, z0.s, #6
				; CHECK-NEXT: uzp1 z0.h, z0.h, z1.h
				; CHECK-NEXT: add z0.h, z2.h, z0.h
				; CHECK-NEXT: st1h { z0.h }, p0, [x0, x1, lsl #1]
				; CHECK-NEXT: ret
				%1 = add <vscale x 8 x i32> %arg1, shufflevector (<vscale x 8 x i32> insertelement (<vscale x 8 x i32> poison, i32 32, i64 0), <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer)
				%2 = lshr <vscale x 8 x i32> %1, shufflevector (<vscale x 8 x i32> insertelement (<vscale x 8 x i32> poison, i32 6, i64 0), <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer)
				%3 = getelementptr inbounds i16, ptr %dest, i64 %index
				%load = load <vscale x 8 x i16>, ptr %3, align 2
				%4 = trunc <vscale x 8 x i32> %2 to <vscale x 8 x i16>
				%5 = add <vscale x 8 x i16> %load, %4
				store <vscale x 8 x i16> %5, ptr %3, align 2
				ret void
				}

				define void @neg_trunc_lsr_add_op1_not_splat(ptr %ptr, ptr %dst, i64 %index, <vscale x 8 x i16> %add_op1){
				; CHECK-LABEL: neg_trunc_lsr_add_op1_not_splat:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0]
				; CHECK-NEXT: add z0.h, z1.h, z0.h
				; CHECK-NEXT: lsr z0.h, z0.h, #6
				; CHECK-NEXT: st1b { z0.h }, p0, [x1, x2]
				; CHECK-NEXT: ret
				%load = load <vscale x 8 x i16>, ptr %ptr, align 2
				%1 = add <vscale x 8 x i16> %load, %add_op1
				%2 = lshr <vscale x 8 x i16> %1, shufflevector (<vscale x 8 x i16> insertelement (<vscale x 8 x i16> poison, i16 6, i64 0), <vscale x 8 x i16> poison, <vscale x 8 x i32> zeroinitializer)
				%3 = trunc <vscale x 8 x i16> %2 to <vscale x 8 x i8>
				%4 = getelementptr inbounds i8, ptr %dst, i64 %index
				store <vscale x 8 x i8> %3, ptr %4, align 1
				ret void
				}

				define void @neg_trunc_lsr_op1_not_splat(ptr %ptr, ptr %dst, i64 %index, <vscale x 8 x i16> %lshr_op1){
				; CHECK-LABEL: neg_trunc_lsr_op1_not_splat:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0]
				; CHECK-NEXT: add z1.h, z1.h, #32 // =0x20
				; CHECK-NEXT: lsrr z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: st1b { z0.h }, p0, [x1, x2]
				; CHECK-NEXT: ret
				%load = load <vscale x 8 x i16>, ptr %ptr, align 2
				%1 = add <vscale x 8 x i16> %load, shufflevector (<vscale x 8 x i16> insertelement (<vscale x 8 x i16> poison, i16 32, i64 0), <vscale x 8 x i16> poison, <vscale x 8 x i32> zeroinitializer)
				%2 = lshr <vscale x 8 x i16> %1, %lshr_op1
				%3 = trunc <vscale x 8 x i16> %2 to <vscale x 8 x i8>
				%4 = getelementptr inbounds i8, ptr %dst, i64 %index
				store <vscale x 8 x i8> %3, ptr %4, align 1
				ret void
				}

				define void @neg_add_has_two_uses(ptr %ptr, ptr %dst, ptr %dst2, i64 %index){
				; CHECK-LABEL: neg_add_has_two_uses:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: add z0.h, z0.h, #32 // =0x20
				; CHECK-NEXT: lsr z1.h, z0.h, #6
				; CHECK-NEXT: add z0.h, z0.h, z0.h
				; CHECK-NEXT: st1h { z0.h }, p0, [x2, x3, lsl #1]
				; CHECK-NEXT: st1b { z1.h }, p0, [x1, x3]
				; CHECK-NEXT: ret
				%load = load <vscale x 8 x i16>, ptr %ptr, align 2
				%1 = add <vscale x 8 x i16> %load, trunc (<vscale x 8 x i32> shufflevector (<vscale x 8 x i32> insertelement (<vscale x 8 x i32> poison, i32 32, i64 0), <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer) to <vscale x 8 x i16>)
				%2 = lshr <vscale x 8 x i16> %1, trunc (<vscale x 8 x i32> shufflevector (<vscale x 8 x i32> insertelement (<vscale x 8 x i32> poison, i32 6, i64 0), <vscale x 8 x i32> poison, <vscale x 8 x i32> zeroinitializer) to <vscale x 8 x i16>)
				%3 = add <vscale x 8 x i16> %1, %1
				%4 = getelementptr inbounds i16, ptr %dst2, i64 %index
				%5 = trunc <vscale x 8 x i16> %2 to <vscale x 8 x i8>
				%6 = getelementptr inbounds i8, ptr %dst, i64 %index
				store <vscale x 8 x i16> %3, ptr %4, align 1
				store <vscale x 8 x i8> %5, ptr %6, align 1
				ret void
				}

				define void @add_lshr_rshrnb_s(ptr %ptr, ptr %dst, i64 %index){
				; CHECK-LABEL: add_lshr_rshrnb_s:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: rshrnb z0.s, z0.d, #6
				; CHECK-NEXT: st1w { z0.d }, p0, [x1, x2, lsl #2]
				; CHECK-NEXT: ret
				%load = load <vscale x 2 x i64>, ptr %ptr, align 2
				%1 = add <vscale x 2 x i64> %load, shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 32, i64 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer)
				%2 = lshr <vscale x 2 x i64> %1, shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 6, i64 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer)
				%3 = trunc <vscale x 2 x i64> %2 to <vscale x 2 x i32>
				%4 = getelementptr inbounds i32, ptr %dst, i64 %index
				store <vscale x 2 x i32> %3, ptr %4, align 1
				ret void
				}

				define void @neg_add_lshr_rshrnb_s(ptr %ptr, ptr %dst, i64 %index){
				; CHECK-LABEL: neg_add_lshr_rshrnb_s:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: add z0.d, z0.d, #32 // =0x20
				; CHECK-NEXT: lsr z0.d, z0.d, #6
				; CHECK-NEXT: st1h { z0.d }, p0, [x1, x2, lsl #1]
				; CHECK-NEXT: ret
				%load = load <vscale x 2 x i64>, ptr %ptr, align 2
				%1 = add <vscale x 2 x i64> %load, shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 32, i64 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer)
				%2 = lshr <vscale x 2 x i64> %1, shufflevector (<vscale x 2 x i64> insertelement (<vscale x 2 x i64> poison, i64 6, i64 0), <vscale x 2 x i64> poison, <vscale x 2 x i32> zeroinitializer)
				%3 = trunc <vscale x 2 x i64> %2 to <vscale x 2 x i16>
				%4 = getelementptr inbounds i16, ptr %dst, i64 %index
				store <vscale x 2 x i16> %3, ptr %4, align 1
				ret void
				}