This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
9/13
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
7/11
sve-streaming-mode-fixed-length-int-arith.ll
1/3
sve-streaming-mode-fixed-length-int-div.ll
-
sve-streaming-mode-fixed-length-int-log.ll
1/3
sve-streaming-mode-fixed-length-int-mulh.ll
15/16
sve-streaming-mode-fixed-length-int-rem.ll

Differential D135324

[AArch64-SVE]: force using SVE in streaming mode to lower arithmetic and logical fixed-width vector ops.
ClosedPublic

Authored by hassnaa-arm on Oct 5 2022, 3:05 PM.

Download Raw Diff

Details

Reviewers

david-arm
kmclaughlin
sdesmalen
paulwalker-arm

Commits

rG956489700e73: [AArch64-SVE]: Force generating code compatible to streaming mode.

Summary

Force using SVE in streaming mode to lower arithmetic and logical fixed-width vector ops.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hassnaa-arm created this revision.Oct 5 2022, 3:05 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 5 2022, 3:05 PM

Herald added subscribers: ctetreau, hiraditya, kristof.beyls, tschuett. · View Herald Transcript

hassnaa-arm requested review of this revision.Oct 5 2022, 3:06 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 5 2022, 3:06 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B190614: Diff 465569.Oct 5 2022, 3:06 PM

hassnaa-arm added reviewers: david-arm, kmclaughlin.Oct 5 2022, 3:06 PM

Matt added a subscriber: Matt.Oct 5 2022, 7:55 PM

Hi @hassnaa-arm, could you rename the title to something that describes the patch a little more? I think something like

[AArch64][SVE]: Force the use of SVE to lower fixed-width arithmetic ops in streaming mode

would be a bit clearer. What do you think?

Hi @hassnaa-arm, it looks like this patch is based off D133433. Can you add that as a parent revision so it's obvious to the reviewer please? You can do this by clicking on "Edit Related Revisions" -> "Edit Parent Revisions" at the top-right corner of the page. Thanks!

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
4450	Hi @hassnaa-arm, this change doesn't look right. I would expect it to break some tests? When we're not in streaming mode we also want to override NEON for 64-bit element types. Can you put the OverrideNEON flag back in, perhaps something like // If SVE is available then i64 vector multiplications can also be made legal. bool OverrideNEON = VT == MVT::v2i64 \|\| VT == MVT::v1i64 \|\| Subtarget->forceSVEInStreamingMode();
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-arith.ll
92	Again, this is illegal in streaming mode.
1588	This looks like a NEON instruction - can you investigate where this is coming from?

fix lowerMul to override NEON for v2i64 and v1i64 even if SVE is not forced

Harbormaster completed remote builds in B190701: Diff 465695.Oct 6 2022, 3:52 AM

hassnaa-arm marked an inline comment as done.Oct 6 2022, 3:53 AM

hassnaa-arm added a parent revision: D133433: [AArch64]: Force generating code compatible to streaming mode.Oct 6 2022, 3:57 AM

hassnaa-arm retitled this revision from [AArch64-SVE]: force using SVE in streaming mode to [AArch64-SVE]: force using SVE in streaming mode to lower arithmetic and logical fixed-width vector ops..Oct 6 2022, 4:02 AM

Thanks for this @hassnaa-arm! I had some comments about how to tidy up the tests a bit. I also think some there are some load/store test changes that shouldn't be part of this patch.

llvm/test/CodeGen/AArch64/sve-fixed-length-masked-stores.ll
420 ↗	(On Diff #465695)	nit: whitespace
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ext-loads.ll
8 ↗	(On Diff #465695)	I don't think these changes should be part of this patch, since it's not changing loads and stores?
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-arith.ll
13	Could you also add a test for an illegal NEON type too, i.e. `<4 x i8>` or `<2 x i16>`?
92	Please ignore this comment! `stp q0, q1` is legal - my mistake!
473	Again, could you add at least one illegal type - `<4 x i8>` or `<2 x i16>`?
1410	Can you add an illegal NEON type such as `<2 x i16>`?
1588	Please ignore this - my mistake!
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-div.ll
299	This still has the `vscale_range(16,0)` attribute. Can you remove it and recreate the CHECK lines please?
1063	Again, this still has the `vscale_range(16,0)` attribute. Can you remove it and regenerate the CHECK lines?
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-mulh.ll
17	Can you add a test for an illegal type such as `<4 x i8>` too?
218	Wow, this code surely gets an award for being so impressively bad?!
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-rem.ll
472	I think that you can remove the tests greater than 512 bits, i.e. <128 x i8>. If the tests already work for <64 x i8> they are likely to work for anything larger too.
774	Again, maybe remove this test since I'm not sure what extra value it gives us?
1608	Again, maybe remove this test?
1775	Again, maybe remove this test?
2240	Again, maybe remove this test?
2335	Again, maybe remove this test?
2652	Again, maybe remove this test?
2747	Again, maybe remove this test?

david-arm added inline comments.Oct 6 2022, 7:33 AM

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-rem.ll
3384	Again, maybe remove this test?
3686	Again, maybe remove this test?
4520	Again, maybe remove this test?
4687	Again, maybe remove this test?
5152	Again, maybe remove this test?
5247	Again, maybe remove this test?
5564	Again, maybe remove this test?
5659	Again, maybe remove this test?
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-loads.ll
2 ↗	(On Diff #465695)	Not part of this patch?
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-masked-store.ll
2 ↗	(On Diff #465695)	Not part of this patch?

revert changes related to load/store as they are not related to this patch

Harbormaster completed remote builds in B190757: Diff 465775.Oct 6 2022, 10:48 AM

add some illegal NEON types test cases and remove unnecessary tests

hassnaa-arm added inline comments.Oct 6 2022, 12:01 PM

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ext-loads.ll
8 ↗	(On Diff #465695)	I'm sorry, it's by mistake. I will correct it.

Harbormaster completed remote builds in B190784: Diff 465818.Oct 6 2022, 12:55 PM

This looks a lot better now thanks @hassnaa-arm and the test files are much smaller too. I spotted two issues in a couple of tests where we are using illegal NEON instructions, e.g. bic. Would you be able to investigate these please?

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-div.ll
977	This is a NEON vector instruction - this is definitely illegal in streaming mode. Can you try to find out why this is being inserted please?
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-mulh.ll
1212	Again, these `bic` instructions are illegal in streaming mode.

Fix invalid bic instruction in streaming mode.
Fix invalid bic instruction that was generated during 'and' combining by converting the fixed-length vector to scalable one to combine SVEAnd instead of and.

Harbormaster completed remote builds in B191653: Diff 467017.Oct 11 2022, 10:14 PM

hassnaa-arm edited parent revisions, added: D135564: [AArch64-SVE]: Force generating code compatible to streaming mode.; removed: D133433: [AArch64]: Force generating code compatible to streaming mode.Oct 11 2022, 10:17 PM

Remove unrelated changes

Harbormaster completed remote builds in B191925: Diff 467414.Oct 13 2022, 2:54 AM

Remove unrelated changes

Harbormaster completed remote builds in B191926: Diff 467416.Oct 13 2022, 3:00 AM

Remove unrelated changes

Harbormaster completed remote builds in B191928: Diff 467417.Oct 13 2022, 3:03 AM

Update by latest changes of parent patch

Harbormaster completed remote builds in B192013: Diff 467545.Oct 13 2022, 12:39 PM

Update by changes of parent patch

Harbormaster completed remote builds in B192186: Diff 467788.Oct 14 2022, 9:17 AM

hassnaa-arm added reviewers: sdesmalen, paulwalker-arm.Oct 17 2022, 2:06 AM

Update by parent patch

Harbormaster completed remote builds in B192462: Diff 468154.Oct 17 2022, 3:56 AM

sdesmalen added inline comments.Oct 18 2022, 5:32 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15559	nit: In LLVM the style is to start local variables with an upper-case, i.e. ScalableLHS.
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-arith.ll
66–93	I think this test can be removed, because you've already covered the "twice as wide" case (32 x i8) which ensures we don't emit any other instructions not valid in streaming mode. The "four times as wide' should already be covered by `sve-fixed-length-int-arith.ll`.
151	This test can be removed for the same reason as mentioned above.
224	This test can be removed for the same reason as mentioned above.
297	This test can be removed for the same reason as mentioned above. (same for all other 4 x as wide instances in the remainder of this file and other files in this patch)

hassnaa-arm marked 5 inline comments as done.Oct 18 2022, 7:34 AM

Remove not needed test cases

Harbormaster completed remote builds in B192745: Diff 468535.Oct 18 2022, 7:35 AM

paulwalker-arm added inline comments.Oct 19 2022, 5:47 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15556–15562	This looks odd. We shouldn't really be doing lower within DAGCombine. What happens if you just exit the combine for the invalid case? That said, I can see functions like `tryAdvSIMDModImm32()` are used in other part of codegen so I'm wondering if the prevention logic is best place within such functions so all use cases are covered.
22092	When we hit a similar issue with `LowerToPredicatedOp()` we decide to drop the calls to `useSVEForFixedLengthVectorVT()` in favour or just using `VT.isFixedLengthVector() && isTypeLegal(VT)`. Would the same work in your case?

hassnaa-arm added inline comments.Oct 19 2022, 8:33 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
22092	Sorry, I don't understand. you mean dropping the call for `useSVEForFixedLengthVectorVT(...)` ? or you mean using use `SVEForFixedLengthVectorVT(VT)` without passing the ovrrideNEON parameter ?

paulwalker-arm added inline comments.Oct 19 2022, 8:42 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
22092	The former, so you can drop the call to `useSVEForFixedLengthVectorVT()` and instead have `assert(VT.isFixedLengthVector() && isTypeLegal(VT) && ...`. By this point we should be working with only legal types and there's no harm in handling any of them.

hassnaa-arm added inline comments.Oct 19 2022, 9:26 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
22092	why do you suggest that instead of using `useSVEForFixedLengthVectorVT()` ? and why do you suggest it for `LowerToPredicatedOp()` only not also other lowering functions ?

paulwalker-arm added inline comments.Oct 19 2022, 9:47 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
22092	`useSVEForFixedLengthVectorVT()` is a semi-complex function that exists to choose which path to take during code generation and is thusly used to determine how to lower an `ISD::ADD` for example. However, by calling `LowerToPredicatedOp()` you've already made that decision and so you only need to detect scenarios that would result in broken code generation. For the case of `LowerToPredicatedOp()` this just means ensuring the input is a legal fixed length vector.

hassnaa-arm marked 4 inline comments as done.Oct 20 2022, 4:26 AM

Remove step of converting to scalable vector that was added within DAGCombine (performAndCombine)

Harbormaster completed remote builds in B193186: Diff 469161.Oct 20 2022, 4:29 AM

hassnaa-arm added a child revision: D136147: [AArch64-SVE]: Test enabling streaming mode for tests of: shifts, extract subverter, build vector, concat, and extract vector elt.Oct 20 2022, 4:47 AM

Update by latest changes of parent patch.

Harbormaster completed remote builds in B193514: Diff 469597.Oct 21 2022, 8:02 AM

hassnaa-arm removed a child revision: D136147: [AArch64-SVE]: Test enabling streaming mode for tests of: shifts, extract subverter, build vector, concat, and extract vector elt.Oct 21 2022, 8:56 AM

Update by parent patch

Harbormaster completed remote builds in B193554: Diff 469650.Oct 21 2022, 10:30 AM

Update by parent patch

Harbormaster completed remote builds in B194673: Diff 471183.Oct 27 2022, 10:01 AM

Update by parent patch

Harbormaster completed remote builds in B194899: Diff 471493.Oct 28 2022, 4:52 AM

hassnaa-arm added a child revision: D136858: [AArch64-SVE]: Force generating code compatible to streaming mode for sve-fixed-length tests..Nov 1 2022, 3:59 AM

hassnaa-arm added a child revision: D137093: [AArch64][SVE][NFC] Add streaming mode SVE tests.

Hi @hassnaa-arm, I think this patch is very close to being ready! However, do you know why the test file sve-streaming-fixed-length-int-shifts.ll was deleted?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1398–1399	I remember in one of your previous patches that @sdesmalen mentioned you shouldn't need to add `v1i64` as it should be treated as a scalar. What happens if you remove it? I imagine your v1i64 tests might just generate scalar code?

Update by parent patch

Harbormaster completed remote builds in B195464: Diff 472290.Nov 1 2022, 7:04 AM

In D135324#3898875, @david-arm wrote:

Hi @hassnaa-arm, I think this patch is very close to being ready! However, do you know why the test file sve-streaming-fixed-length-int-shifts.ll was deleted?

It was a fault while rebasing the parent patch to this patch.
In the parent patch, that deleted file was replaced by sve-streaming-mode-fixed-length-int-shifts.ll

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1398–1399	Yes, at that patch, there were no tests needing custom-lowering for v1i64. But in this patch, the test file of sve-streaming-mode-fixed-length-int-log.ll has invalid instructions for the test cases of : define <1 x i64> @and_v1i64(<1 x i64> %op1, <1 x i64> %op2) define <1 x i64> @xor_v1i64(<1 x i64> %op1, <1 x i64> %op2)

hassnaa-arm marked an inline comment as done and an inline comment as not done.Nov 1 2022, 7:42 AM

paulwalker-arm accepted this revision.Nov 1 2022, 6:07 PM

paulwalker-arm added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1398–1399	I'll also add that in general you don't want to revert such integer types to scalar because that'll cause GPR-VPR transfers that can be expensive, perhaps even more so when it comes to streaming mode. You can also see that within `LowerMUL` we have special handling for `v1i64` to keep it in the vector unit.

This revision is now accepted and ready to land.Nov 1 2022, 6:07 PM

sdesmalen added inline comments.Nov 2 2022, 1:38 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1398–1399	Does that mean we need coverage as well for v1i32? (and perhaps also v1i8, v1i16). If so, I wonder if that might warrant a separate patch rather than support the odd case here?

paulwalker-arm added inline comments.Nov 2 2022, 6:52 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1398–1399	Discussed offline but for completeness the answer is no. The other MVTs you list are not type legal (only 64/128-bit vectors are legal for NEON) and so they'll not make it into operation legalisation.

This revision was landed with ongoing or failed builds.Nov 10 2022, 4:38 AM

Closed by commit rG956489700e73: [AArch64-SVE]: Force generating code compatible to streaming mode. (authored by Hassnaa Hamdi <hassnaa.hamdi@arm.com>). · Explain Why

This revision was automatically updated to reflect the committed changes.

hassnaa-arm added a commit: rG956489700e73: [AArch64-SVE]: Force generating code compatible to streaming mode..

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

76 lines

test/

CodeGen/

AArch64/

sve-streaming-mode-fixed-length-int-arith.ll

1310 lines

sve-streaming-mode-fixed-length-int-div.ll

1229 lines

sve-streaming-mode-fixed-length-int-log.ll

546 lines

sve-streaming-mode-fixed-length-int-mulh.ll

924 lines

sve-streaming-mode-fixed-length-int-rem.ll

774 lines

Diff 467417

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,389 Lines • ▼ Show 20 Lines	if (Subtarget->hasSVE()) {
setOperationAction(ISD::MUL, MVT::v1i64, Custom);		setOperationAction(ISD::MUL, MVT::v1i64, Custom);
setOperationAction(ISD::MUL, MVT::v2i64, Custom);		setOperationAction(ISD::MUL, MVT::v2i64, Custom);

// NEON doesn't support across-vector reductions, but SVE does.		// NEON doesn't support across-vector reductions, but SVE does.
for (auto VT : {MVT::v4f16, MVT::v8f16, MVT::v2f32, MVT::v4f32, MVT::v2f64})		for (auto VT : {MVT::v4f16, MVT::v8f16, MVT::v2f32, MVT::v4f32, MVT::v2f64})
setOperationAction(ISD::VECREDUCE_SEQ_FADD, VT, Custom);		setOperationAction(ISD::VECREDUCE_SEQ_FADD, VT, Custom);

if (Subtarget->forceStreamingCompatibleSVE()) {		if (Subtarget->forceStreamingCompatibleSVE()) {
for (MVT VT : {MVT::v8i8, MVT::v16i8, MVT::v4i16, MVT::v8i16, MVT::v2i32,		for (MVT VT : {MVT::v8i8, MVT::v16i8, MVT::v4i16, MVT::v8i16, MVT::v2i32,
MVT::v4i32, MVT::v1i64, MVT::v2i64})		MVT::v4i32, MVT::v1i64, MVT::v2i64})
		david-armUnsubmitted Done Reply Inline Actions I remember in one of your previous patches that @sdesmalen mentioned you shouldn't need to add `v1i64` as it should be treated as a scalar. What happens if you remove it? I imagine your v1i64 tests might just generate scalar code? david-arm: I remember in one of your previous patches that @sdesmalen mentioned you shouldn't need to add…
		hassnaa-armAuthorUnsubmitted Not Done Reply Inline Actions Yes, at that patch, there were no tests needing custom-lowering for v1i64. But in this patch, the test file of sve-streaming-mode-fixed-length-int-log.ll has invalid instructions for the test cases of : define <1 x i64> @and_v1i64(<1 x i64> %op1, <1 x i64> %op2) define <1 x i64> @xor_v1i64(<1 x i64> %op1, <1 x i64> %op2) hassnaa-arm: Yes, at that patch, there were no tests needing custom-lowering for v1i64. But in this patch…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I'll also add that in general you don't want to revert such integer types to scalar because that'll cause GPR-VPR transfers that can be expensive, perhaps even more so when it comes to streaming mode. You can also see that within `LowerMUL` we have special handling for `v1i64` to keep it in the vector unit. paulwalker-arm: I'll also add that in general you don't want to revert such integer types to scalar because…
		sdesmalenUnsubmitted Not Done Reply Inline Actions Does that mean we need coverage as well for v1i32? (and perhaps also v1i8, v1i16). If so, I wonder if that might warrant a separate patch rather than support the odd case here? sdesmalen: Does that mean we need coverage as well for v1i32? (and perhaps also v1i8, v1i16). If so, I…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Discussed offline but for completeness the answer is no. The other MVTs you list are not type legal (only 64/128-bit vectors are legal for NEON) and so they'll not make it into operation legalisation. paulwalker-arm: Discussed offline but for completeness the answer is no. The other MVTs you list are not type…
if (useSVEForFixedLengthVectorVT(VT, true))		if (useSVEForFixedLengthVectorVT(VT, true))
addTypeForStreamingSVE(VT);		addTypeForStreamingSVE(VT);

for (MVT VT : {MVT::v4f16, MVT::v8f16, MVT::v2f32, MVT::v4f32, MVT::v1f64,		for (MVT VT : {MVT::v4f16, MVT::v8f16, MVT::v2f32, MVT::v4f32, MVT::v1f64,
MVT::v2f64})		MVT::v2f64})
if (useSVEForFixedLengthVectorVT(VT, true))		if (useSVEForFixedLengthVectorVT(VT, true))
addTypeForStreamingSVE(VT);		addTypeForStreamingSVE(VT);
}		}
▲ Show 20 Lines • Show All 201 Lines • ▼ Show 20 Lines	bool AArch64TargetLowering::shouldExpandGetActiveLaneMask(EVT ResVT,
if (OpVT != MVT::i32 && OpVT != MVT::i64)		if (OpVT != MVT::i32 && OpVT != MVT::i64)
return true;		return true;

return false;		return false;
}		}

void AArch64TargetLowering::addTypeForStreamingSVE(MVT VT) {		void AArch64TargetLowering::addTypeForStreamingSVE(MVT VT) {
setOperationAction(ISD::LOAD, VT, Custom);		setOperationAction(ISD::LOAD, VT, Custom);
		setOperationAction(ISD::ANY_EXTEND, VT, Custom);
		setOperationAction(ISD::ZERO_EXTEND, VT, Custom);
		setOperationAction(ISD::SIGN_EXTEND, VT, Custom);
		setOperationAction(ISD::CONCAT_VECTORS, VT, Custom);
		setOperationAction(ISD::ADD, VT, Custom);
		setOperationAction(ISD::SUB, VT, Custom);
		setOperationAction(ISD::MUL, VT, Custom);
		setOperationAction(ISD::MULHS, VT, Custom);
		setOperationAction(ISD::MULHU, VT, Custom);
		setOperationAction(ISD::ABS, VT, Custom);
		setOperationAction(ISD::AND, VT, Custom);
		setOperationAction(ISD::XOR, VT, Custom);
}		}

void AArch64TargetLowering::addTypeForFixedLengthSVE(MVT VT) {		void AArch64TargetLowering::addTypeForFixedLengthSVE(MVT VT) {
assert(VT.isFixedLengthVector() && "Expected fixed length vector type!");		assert(VT.isFixedLengthVector() && "Expected fixed length vector type!");

// By default everything must be expanded.		// By default everything must be expanded.
for (unsigned Op = 0; Op < ISD::BUILTIN_OP_END; ++Op)		for (unsigned Op = 0; Op < ISD::BUILTIN_OP_END; ++Op)
setOperationAction(Op, VT, Expand);		setOperationAction(Op, VT, Expand);
▲ Show 20 Lines • Show All 1,888 Lines • ▼ Show 20 Lines	if (Opc) {
// Emit the AArch64 operation with overflow check.		// Emit the AArch64 operation with overflow check.
Value = DAG.getNode(Opc, DL, VTs, LHS, RHS);		Value = DAG.getNode(Opc, DL, VTs, LHS, RHS);
Overflow = Value.getValue(1);		Overflow = Value.getValue(1);
}		}
return std::make_pair(Value, Overflow);		return std::make_pair(Value, Overflow);
}		}

SDValue AArch64TargetLowering::LowerXOR(SDValue Op, SelectionDAG &DAG) const {		SDValue AArch64TargetLowering::LowerXOR(SDValue Op, SelectionDAG &DAG) const {
if (useSVEForFixedLengthVectorVT(Op.getValueType()))		if (useSVEForFixedLengthVectorVT(Op.getValueType(),
		Subtarget->forceStreamingCompatibleSVE()))
return LowerToScalableOp(Op, DAG);		return LowerToScalableOp(Op, DAG);

SDValue Sel = Op.getOperand(0);		SDValue Sel = Op.getOperand(0);
SDValue Other = Op.getOperand(1);		SDValue Other = Op.getOperand(1);
SDLoc dl(Sel);		SDLoc dl(Sel);

// If the operand is an overflow checking operation, invert the condition		// If the operand is an overflow checking operation, invert the condition
// code and kill the Not operation. I.e., transform:		// code and kill the Not operation. I.e., transform:
▲ Show 20 Lines • Show All 895 Lines • ▼ Show 20 Lines	static unsigned selectUmullSmull(SDNode &N0, SDNode &N1, SelectionDAG &DAG,
}		}
return 0;		return 0;
}		}

SDValue AArch64TargetLowering::LowerMUL(SDValue Op, SelectionDAG &DAG) const {		SDValue AArch64TargetLowering::LowerMUL(SDValue Op, SelectionDAG &DAG) const {
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();

// If SVE is available then i64 vector multiplications can also be made legal.		// If SVE is available then i64 vector multiplications can also be made legal.
bool OverrideNEON = VT == MVT::v2i64 \|\| VT == MVT::v1i64;		bool OverrideNEON = VT == MVT::v2i64 \|\| VT == MVT::v1i64 \|\|
		Subtarget->forceStreamingCompatibleSVE();

if (VT.isScalableVector() \|\| useSVEForFixedLengthVectorVT(VT, OverrideNEON))		if (VT.isScalableVector() \|\| useSVEForFixedLengthVectorVT(VT, OverrideNEON))
return LowerToPredicatedOp(Op, DAG, AArch64ISD::MUL_PRED);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::MUL_PRED);
		david-armUnsubmitted Done Reply Inline Actions Hi @hassnaa-arm, this change doesn't look right. I would expect it to break some tests? When we're not in streaming mode we also want to override NEON for 64-bit element types. Can you put the OverrideNEON flag back in, perhaps something like // If SVE is available then i64 vector multiplications can also be made legal. bool OverrideNEON = VT == MVT::v2i64 \|\| VT == MVT::v1i64 \|\| Subtarget->forceSVEInStreamingMode(); david-arm: Hi @hassnaa-arm, this change doesn't look right. I would expect it to break some tests? When…

// Multiplications are only custom-lowered for 128-bit vectors so that		// Multiplications are only custom-lowered for 128-bit vectors so that
// VMULL can be detected. Otherwise v2i64 multiplications are not legal.		// VMULL can be detected. Otherwise v2i64 multiplications are not legal.
assert(VT.is128BitVector() && VT.isInteger() &&		assert(VT.is128BitVector() && VT.isInteger() &&
"unexpected type for custom-lowering ISD::MUL");		"unexpected type for custom-lowering ISD::MUL");
SDNode *N0 = Op.getOperand(0).getNode();		SDNode *N0 = Op.getOperand(0).getNode();
SDNode *N1 = Op.getOperand(1).getNode();		SDNode *N1 = Op.getOperand(1).getNode();
bool isMLA = false;		bool isMLA = false;
▲ Show 20 Lines • Show All 6,448 Lines • ▼ Show 20 Lines

SDValue AArch64TargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,		SDValue AArch64TargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDLoc dl(Op);		SDLoc dl(Op);
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();

ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(Op.getNode());		ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(Op.getNode());

if (useSVEForFixedLengthVectorVT(VT))		if (useSVEForFixedLengthVectorVT(VT,
		Subtarget->forceStreamingCompatibleSVE()))
return LowerFixedLengthVECTOR_SHUFFLEToSVE(Op, DAG);		return LowerFixedLengthVECTOR_SHUFFLEToSVE(Op, DAG);

// Convert shuffles that are directly supported on NEON to target-specific		// Convert shuffles that are directly supported on NEON to target-specific
// DAG nodes, instead of keeping them as shuffles and matching them again		// DAG nodes, instead of keeping them as shuffles and matching them again
// during code selection. This is more efficient and avoids the possibility		// during code selection. This is more efficient and avoids the possibility
// of inconsistencies between legalization and selection.		// of inconsistencies between legalization and selection.
ArrayRef<int> ShuffleMask = SVN->getMask();		ArrayRef<int> ShuffleMask = SVN->getMask();

▲ Show 20 Lines • Show All 553 Lines • ▼ Show 20 Lines	static SDValue tryLowerToSLI(SDNode *N, SelectionDAG &DAG) {
LLVM_DEBUG(ResultSLI->dump(&DAG));		LLVM_DEBUG(ResultSLI->dump(&DAG));

++NumShiftInserts;		++NumShiftInserts;
return ResultSLI;		return ResultSLI;
}		}

SDValue AArch64TargetLowering::LowerVectorOR(SDValue Op,		SDValue AArch64TargetLowering::LowerVectorOR(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
if (useSVEForFixedLengthVectorVT(Op.getValueType()))		if (useSVEForFixedLengthVectorVT(Op.getValueType(),
		Subtarget->forceStreamingCompatibleSVE()))
return LowerToScalableOp(Op, DAG);		return LowerToScalableOp(Op, DAG);

// Attempt to form a vector S[LR]I from (or (and X, C1), (lsl Y, C2))		// Attempt to form a vector S[LR]I from (or (and X, C1), (lsl Y, C2))
if (SDValue Res = tryLowerToSLI(Op.getNode(), DAG))		if (SDValue Res = tryLowerToSLI(Op.getNode(), DAG))
return Res;		return Res;

EVT VT = Op.getValueType();		EVT VT = Op.getValueType();

▲ Show 20 Lines • Show All 434 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerBUILD_VECTOR(SDValue Op,
LLVM_DEBUG(		LLVM_DEBUG(
dbgs() << "LowerBUILD_VECTOR: use default expansion, failed to find "		dbgs() << "LowerBUILD_VECTOR: use default expansion, failed to find "
"better alternative\n");		"better alternative\n");
return SDValue();		return SDValue();
}		}

SDValue AArch64TargetLowering::LowerCONCAT_VECTORS(SDValue Op,		SDValue AArch64TargetLowering::LowerCONCAT_VECTORS(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
if (useSVEForFixedLengthVectorVT(Op.getValueType()))		if (useSVEForFixedLengthVectorVT(Op.getValueType(),
		Subtarget->forceStreamingCompatibleSVE()))
return LowerFixedLengthConcatVectorsToSVE(Op, DAG);		return LowerFixedLengthConcatVectorsToSVE(Op, DAG);

assert(Op.getValueType().isScalableVector() &&		assert(Op.getValueType().isScalableVector() &&
isTypeLegal(Op.getValueType()) &&		isTypeLegal(Op.getValueType()) &&
"Expected legal scalable vector type!");		"Expected legal scalable vector type!");

if (isTypeLegal(Op.getOperand(0).getValueType())) {		if (isTypeLegal(Op.getOperand(0).getValueType())) {
unsigned NumOperands = Op->getNumOperands();		unsigned NumOperands = Op->getNumOperands();
▲ Show 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	if (VT.getScalarType() == MVT::i1) {
SDValue Extend =		SDValue Extend =
DAG.getNode(ISD::ANY_EXTEND, DL, VectorVT, Op.getOperand(0));		DAG.getNode(ISD::ANY_EXTEND, DL, VectorVT, Op.getOperand(0));
MVT ExtractTy = VectorVT == MVT::nxv2i64 ? MVT::i64 : MVT::i32;		MVT ExtractTy = VectorVT == MVT::nxv2i64 ? MVT::i64 : MVT::i32;
SDValue Extract = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, ExtractTy,		SDValue Extract = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, ExtractTy,
Extend, Op.getOperand(1));		Extend, Op.getOperand(1));
return DAG.getAnyExtOrTrunc(Extract, DL, Op.getValueType());		return DAG.getAnyExtOrTrunc(Extract, DL, Op.getValueType());
}		}

// try overriding NEON if possible.		if (useSVEForFixedLengthVectorVT(VT,
if (useSVEForFixedLengthVectorVT(VT))		Subtarget->forceStreamingCompatibleSVE()))
return LowerFixedLengthExtractVectorElt(Op, DAG);		return LowerFixedLengthExtractVectorElt(Op, DAG);

// Check for non-constant or out of range lane.		// Check for non-constant or out of range lane.
ConstantSDNode *CI = dyn_cast<ConstantSDNode>(Op.getOperand(1));		ConstantSDNode *CI = dyn_cast<ConstantSDNode>(Op.getOperand(1));
if (!CI \|\| CI->getZExtValue() >= VT.getVectorNumElements())		if (!CI \|\| CI->getZExtValue() >= VT.getVectorNumElements())
return SDValue();		return SDValue();

// Insertion/extraction are legal for V128 types.		// Insertion/extraction are legal for V128 types.
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	if (InVT.isScalableVector()) {

return SDValue();		return SDValue();
}		}

// This will get lowered to an appropriate EXTRACT_SUBREG in ISel.		// This will get lowered to an appropriate EXTRACT_SUBREG in ISel.
if (Idx == 0 && InVT.getSizeInBits() <= 128)		if (Idx == 0 && InVT.getSizeInBits() <= 128)
return Op;		return Op;

		if (!Subtarget->forceStreamingCompatibleSVE()) {
// If this is extracting the upper 64-bits of a 128-bit vector, we match		// If this is extracting the upper 64-bits of a 128-bit vector, we match
// that directly.		// that directly.
if (Size == 64 && Idx * InVT.getScalarSizeInBits() == 64 &&		if (Size == 64 && Idx * InVT.getScalarSizeInBits() == 64 &&
InVT.getSizeInBits() == 128)		InVT.getSizeInBits() == 128)
return Op;		return Op;
		}

if (useSVEForFixedLengthVectorVT(InVT)) {		if (useSVEForFixedLengthVectorVT(InVT,
		Subtarget->forceStreamingCompatibleSVE())) {
SDLoc DL(Op);		SDLoc DL(Op);

EVT ContainerVT = getContainerForFixedLengthVector(DAG, InVT);		EVT ContainerVT = getContainerForFixedLengthVector(DAG, InVT);
SDValue NewInVec =		SDValue NewInVec =
convertToScalableVector(DAG, ContainerVT, Op.getOperand(0));		convertToScalableVector(DAG, ContainerVT, Op.getOperand(0));

SDValue Splice = DAG.getNode(ISD::VECTOR_SPLICE, DL, ContainerVT, NewInVec,		SDValue Splice = DAG.getNode(ISD::VECTOR_SPLICE, DL, ContainerVT, NewInVec,
NewInVec, DAG.getConstant(Idx, DL, MVT::i64));		NewInVec, DAG.getConstant(Idx, DL, MVT::i64));
▲ Show 20 Lines • Show All 281 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerVectorSRA_SRL_SHL(SDValue Op,
int64_t Cnt;		int64_t Cnt;

if (!Op.getOperand(1).getValueType().isVector())		if (!Op.getOperand(1).getValueType().isVector())
return Op;		return Op;
unsigned EltSize = VT.getScalarSizeInBits();		unsigned EltSize = VT.getScalarSizeInBits();

switch (Op.getOpcode()) {		switch (Op.getOpcode()) {
case ISD::SHL:		case ISD::SHL:
if (VT.isScalableVector() \|\| useSVEForFixedLengthVectorVT(VT))		if (VT.isScalableVector() \|\|
		useSVEForFixedLengthVectorVT(VT,
		Subtarget->forceStreamingCompatibleSVE()))
return LowerToPredicatedOp(Op, DAG, AArch64ISD::SHL_PRED);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::SHL_PRED);

if (isVShiftLImm(Op.getOperand(1), VT, false, Cnt) && Cnt < EltSize)		if (isVShiftLImm(Op.getOperand(1), VT, false, Cnt) && Cnt < EltSize)
return DAG.getNode(AArch64ISD::VSHL, DL, VT, Op.getOperand(0),		return DAG.getNode(AArch64ISD::VSHL, DL, VT, Op.getOperand(0),
DAG.getConstant(Cnt, DL, MVT::i32));		DAG.getConstant(Cnt, DL, MVT::i32));
return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, VT,		return DAG.getNode(ISD::INTRINSIC_WO_CHAIN, DL, VT,
DAG.getConstant(Intrinsic::aarch64_neon_ushl, DL,		DAG.getConstant(Intrinsic::aarch64_neon_ushl, DL,
MVT::i32),		MVT::i32),
Op.getOperand(0), Op.getOperand(1));		Op.getOperand(0), Op.getOperand(1));
case ISD::SRA:		case ISD::SRA:
case ISD::SRL:		case ISD::SRL:
if (VT.isScalableVector() \|\| useSVEForFixedLengthVectorVT(VT)) {		if (VT.isScalableVector() \|\|
		useSVEForFixedLengthVectorVT(
		VT, Subtarget->forceStreamingCompatibleSVE())) {
unsigned Opc = Op.getOpcode() == ISD::SRA ? AArch64ISD::SRA_PRED		unsigned Opc = Op.getOpcode() == ISD::SRA ? AArch64ISD::SRA_PRED
: AArch64ISD::SRL_PRED;		: AArch64ISD::SRL_PRED;
return LowerToPredicatedOp(Op, DAG, Opc);		return LowerToPredicatedOp(Op, DAG, Opc);
}		}

// Right shift immediate		// Right shift immediate
if (isVShiftRImm(Op.getOperand(1), VT, false, Cnt) && Cnt < EltSize) {		if (isVShiftRImm(Op.getOperand(1), VT, false, Cnt) && Cnt < EltSize) {
unsigned Opc =		unsigned Opc =
▲ Show 20 Lines • Show All 3,096 Lines • ▼ Show 20 Lines	static SDValue performSVEAndCombine(SDNode *N,

if (isConstantSplatVectorMaskForType(Mask.getNode(), MemVT))		if (isConstantSplatVectorMaskForType(Mask.getNode(), MemVT))
return Src;		return Src;

return SDValue();		return SDValue();
}		}

static SDValue performANDCombine(SDNode *N,		static SDValue performANDCombine(SDNode *N,
TargetLowering::DAGCombinerInfo &DCI) {		TargetLowering::DAGCombinerInfo &DCI,
		const AArch64Subtarget *const Subtarget) {
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
SDValue LHS = N->getOperand(0);		SDValue LHS = N->getOperand(0);
SDValue RHS = N->getOperand(1);		SDValue RHS = N->getOperand(1);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

if (SDValue R = performANDORCSELCombine(N, DAG))		if (SDValue R = performANDORCSELCombine(N, DAG))
return R;		return R;

if (!DAG.getTargetLoweringInfo().isTypeLegal(VT))		if (!DAG.getTargetLoweringInfo().isTypeLegal(VT))
return SDValue();		return SDValue();

if (VT.isScalableVector())		if (VT.isScalableVector())
return performSVEAndCombine(N, DCI);		return performSVEAndCombine(N, DCI);

		if (VT.isFixedLengthVector() && Subtarget->forceStreamingCompatibleSVE()) {
		// convert fixed-length vector to scalable one:
		EVT scalableContainerVT = getContainerForFixedLengthVector(DAG, VT);
		SDValue scalableLHS =
		sdesmalenUnsubmitted Done Reply Inline Actions nit: In LLVM the style is to start local variables with an upper-case, i.e. ScalableLHS. sdesmalen: nit: In LLVM the style is to start local variables with an upper-case, i.e. ScalableLHS.
		convertToScalableVector(DAG, scalableContainerVT, LHS);
		SDValue scalableRHS =
		convertToScalableVector(DAG, scalableContainerVT, RHS);
		paulwalker-armUnsubmitted Done Reply Inline Actions This looks odd. We shouldn't really be doing lower within DAGCombine. What happens if you just exit the combine for the invalid case? That said, I can see functions like `tryAdvSIMDModImm32()` are used in other part of codegen so I'm wondering if the prevention logic is best place within such functions so all use cases are covered. paulwalker-arm: This looks odd. We shouldn't really be doing lower within DAGCombine. What happens if you…
		return performSVEAndCombine(N, DCI);
		}

// The combining code below works only for NEON vectors. In particular, it		// The combining code below works only for NEON vectors. In particular, it
// does not work for SVE when dealing with vectors wider than 128 bits.		// does not work for SVE when dealing with vectors wider than 128 bits.
if (!VT.is64BitVector() && !VT.is128BitVector())		if (!VT.is64BitVector() && !VT.is128BitVector())
return SDValue();		return SDValue();

BuildVectorSDNode *BVN = dyn_cast<BuildVectorSDNode>(RHS.getNode());		BuildVectorSDNode *BVN = dyn_cast<BuildVectorSDNode>(RHS.getNode());
if (!BVN)		if (!BVN)
return SDValue();		return SDValue();
▲ Show 20 Lines • Show All 4,702 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
case ISD::FP_TO_SINT_SAT:		case ISD::FP_TO_SINT_SAT:
case ISD::FP_TO_UINT_SAT:		case ISD::FP_TO_UINT_SAT:
return performFpToIntCombine(N, DAG, DCI, Subtarget);		return performFpToIntCombine(N, DAG, DCI, Subtarget);
case ISD::FDIV:		case ISD::FDIV:
return performFDivCombine(N, DAG, DCI, Subtarget);		return performFDivCombine(N, DAG, DCI, Subtarget);
case ISD::OR:		case ISD::OR:
return performORCombine(N, DCI, Subtarget);		return performORCombine(N, DCI, Subtarget);
case ISD::AND:		case ISD::AND:
return performANDCombine(N, DCI);		return performANDCombine(N, DCI, Subtarget);
case ISD::INTRINSIC_WO_CHAIN:		case ISD::INTRINSIC_WO_CHAIN:
return performIntrinsicCombine(N, DCI, Subtarget);		return performIntrinsicCombine(N, DCI, Subtarget);
case ISD::ANY_EXTEND:		case ISD::ANY_EXTEND:
case ISD::ZERO_EXTEND:		case ISD::ZERO_EXTEND:
case ISD::SIGN_EXTEND:		case ISD::SIGN_EXTEND:
return performExtendCombine(N, DCI, DAG);		return performExtendCombine(N, DCI, DAG);
case ISD::SIGN_EXTEND_INREG:		case ISD::SIGN_EXTEND_INREG:
return performSignExtendInRegCombine(N, DCI, DAG);		return performSignExtendInRegCombine(N, DCI, DAG);
▲ Show 20 Lines • Show All 1,790 Lines • ▼ Show 20 Lines
}		}

// If a fixed length vector operation has no side effects when applied to		// If a fixed length vector operation has no side effects when applied to
// undefined elements, we can safely use scalable vectors to perform the same		// undefined elements, we can safely use scalable vectors to perform the same
// operation without needing to worry about predication.		// operation without needing to worry about predication.
SDValue AArch64TargetLowering::LowerToScalableOp(SDValue Op,		SDValue AArch64TargetLowering::LowerToScalableOp(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
assert(useSVEForFixedLengthVectorVT(VT) &&		assert(useSVEForFixedLengthVectorVT(
		VT, Subtarget->forceStreamingCompatibleSVE()) &&
		paulwalker-armUnsubmitted Done Reply Inline Actions When we hit a similar issue with `LowerToPredicatedOp()` we decide to drop the calls to `useSVEForFixedLengthVectorVT()` in favour or just using `VT.isFixedLengthVector() && isTypeLegal(VT)`. Would the same work in your case? paulwalker-arm: When we hit a similar issue with `LowerToPredicatedOp()` we decide to drop the calls to…
		hassnaa-armAuthorUnsubmitted Done Reply Inline Actions Sorry, I don't understand. you mean dropping the call for `useSVEForFixedLengthVectorVT(...)` ? or you mean using use `SVEForFixedLengthVectorVT(VT)` without passing the ovrrideNEON parameter ? hassnaa-arm: Sorry, I don't understand. you mean dropping the call for `useSVEForFixedLengthVectorVT(...) `?
		paulwalker-armUnsubmitted Done Reply Inline Actions The former, so you can drop the call to `useSVEForFixedLengthVectorVT()` and instead have `assert(VT.isFixedLengthVector() && isTypeLegal(VT) && ...`. By this point we should be working with only legal types and there's no harm in handling any of them. paulwalker-arm: The former, so you can drop the call to `useSVEForFixedLengthVectorVT()` and instead have…
		hassnaa-armAuthorUnsubmitted Done Reply Inline Actions why do you suggest that instead of using `useSVEForFixedLengthVectorVT()` ? and why do you suggest it for `LowerToPredicatedOp()` only not also other lowering functions ? hassnaa-arm: why do you suggest that instead of using `useSVEForFixedLengthVectorVT()` ? and why do you…
		paulwalker-armUnsubmitted Done Reply Inline Actions `useSVEForFixedLengthVectorVT()` is a semi-complex function that exists to choose which path to take during code generation and is thusly used to determine how to lower an `ISD::ADD` for example. However, by calling `LowerToPredicatedOp()` you've already made that decision and so you only need to detect scenarios that would result in broken code generation. For the case of `LowerToPredicatedOp()` this just means ensuring the input is a legal fixed length vector. paulwalker-arm: `useSVEForFixedLengthVectorVT()` is a semi-complex function that exists to choose which path to…
"Only expected to lower fixed length vector operation!");		"Only expected to lower fixed length vector operation!");
EVT ContainerVT = getContainerForFixedLengthVector(DAG, VT);		EVT ContainerVT = getContainerForFixedLengthVector(DAG, VT);

// Create list of operands by converting existing ones to scalable types.		// Create list of operands by converting existing ones to scalable types.
SmallVector<SDValue, 4> Ops;		SmallVector<SDValue, 4> Ops;
for (const SDValue &V : Op->op_values()) {		for (const SDValue &V : Op->op_values()) {
assert(!isa<VTSDNode>(V) && "Unexpected VTSDNode node!");		assert(!isa<VTSDNode>(V) && "Unexpected VTSDNode node!");

// Pass through non-vector operands.		// Pass through non-vector operands.
if (!V.getValueType().isVector()) {		if (!V.getValueType().isVector()) {
Ops.push_back(V);		Ops.push_back(V);
continue;		continue;
}		}

// "cast" fixed length vector to a scalable vector.		// "cast" fixed length vector to a scalable vector.
assert(useSVEForFixedLengthVectorVT(V.getValueType()) &&		assert(useSVEForFixedLengthVectorVT(
		V.getValueType(), Subtarget->forceStreamingCompatibleSVE()) &&
"Only fixed length vectors are supported!");		"Only fixed length vectors are supported!");
Ops.push_back(convertToScalableVector(DAG, ContainerVT, V));		Ops.push_back(convertToScalableVector(DAG, ContainerVT, V));
}		}

auto ScalableRes = DAG.getNode(Op.getOpcode(), SDLoc(Op), ContainerVT, Ops);		auto ScalableRes = DAG.getNode(Op.getOpcode(), SDLoc(Op), ContainerVT, Ops);
return convertFromScalableVector(DAG, VT, ScalableRes);		return convertFromScalableVector(DAG, VT, ScalableRes);
}		}

▲ Show 20 Lines • Show All 570 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-arith.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -force-streaming-compatible-sve < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				;
				; ADD
				;
				define <4 x i8> @add_v4i8(<4 x i8> %op1, <4 x i8> %op2) #0 {
				; CHECK-LABEL: add_v4i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				david-armUnsubmitted Done Reply Inline Actions Could you also add a test for an illegal NEON type too, i.e. `<4 x i8>` or `<2 x i16>`? david-arm: Could you also add a test for an illegal NEON type too, i.e. `<4 x i8>` or `<2 x i16>`?
				; CHECK-NEXT: add z0.h, z0.h, z1.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = add <4 x i8> %op1, %op2
				ret <4 x i8> %res
				}

				define <8 x i8> @add_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
				; CHECK-LABEL: add_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: add z0.b, z0.b, z1.b
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = add <8 x i8> %op1, %op2
				ret <8 x i8> %res
				}

				define <16 x i8> @add_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
				; CHECK-LABEL: add_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: add z0.b, z0.b, z1.b
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = add <16 x i8> %op1, %op2
				ret <16 x i8> %res
				}

				define void @add_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: add_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #16
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x1, x8]
				; CHECK-NEXT: ld1b { z3.b }, p0/z, [x1]
				; CHECK-NEXT: add z1.b, z1.b, z3.b
				; CHECK-NEXT: add z0.b, z0.b, z2.b
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%res = add <32 x i8> %op1, %op2
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define void @add_v64i8(<64 x i8>* %a, <64 x i8>* %b) #0 {
				; CHECK-LABEL: add_v64i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #32
				; CHECK-NEXT: mov w9, #48
				; CHECK-NEXT: mov w10, #16
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x0, x9]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x0, x10]
				; CHECK-NEXT: ld1b { z3.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z4.b }, p0/z, [x1, x8]
				; CHECK-NEXT: ld1b { z5.b }, p0/z, [x1, x9]
				; CHECK-NEXT: ld1b { z6.b }, p0/z, [x1, x10]
				; CHECK-NEXT: ld1b { z7.b }, p0/z, [x1]
				; CHECK-NEXT: add z0.b, z0.b, z4.b
				; CHECK-NEXT: add z1.b, z1.b, z5.b
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: add z0.b, z3.b, z7.b
				; CHECK-NEXT: add z1.b, z2.b, z6.b
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <64 x i8>, <64 x i8>* %a
				%op2 = load <64 x i8>, <64 x i8>* %b
				%res = add <64 x i8> %op1, %op2
				store <64 x i8> %res, <64 x i8>* %a
				ret void
				}
				david-armUnsubmitted Not Done Reply Inline Actions Again, this is illegal in streaming mode. david-arm: Again, this is illegal in streaming mode.
				david-armUnsubmitted Not Done Reply Inline Actions Please ignore this comment! `stp q0, q1` is legal - my mistake! david-arm: Please ignore this comment! `stp q0, q1` is legal - my mistake!

				sdesmalenUnsubmitted Done Reply Inline Actions I think this test can be removed, because you've already covered the "twice as wide" case (32 x i8) which ensures we don't emit any other instructions not valid in streaming mode. The "four times as wide' should already be covered by `sve-fixed-length-int-arith.ll`. sdesmalen: I think this test can be removed, because you've already covered the "twice as wide" case (32 x…
				define <2 x i16> @add_v2i16(<2 x i16> %op1, <2 x i16> %op2) #0 {
				; CHECK-LABEL: add_v2i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: add z0.s, z0.s, z1.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = add <2 x i16> %op1, %op2
				ret <2 x i16> %res
				}

				define <4 x i16> @add_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
				; CHECK-LABEL: add_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: add z0.h, z0.h, z1.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = add <4 x i16> %op1, %op2
				ret <4 x i16> %res
				}

				define <8 x i16> @add_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
				; CHECK-LABEL: add_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: add z0.h, z0.h, z1.h
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = add <8 x i16> %op1, %op2
				ret <8 x i16> %res
				}

				define void @add_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: add_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x1]
				; CHECK-NEXT: add z1.h, z1.h, z3.h
				; CHECK-NEXT: add z0.h, z0.h, z2.h
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%res = add <16 x i16> %op1, %op2
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define void @add_v32i16(<32 x i16>* %a, <32 x i16>* %b) #0 {
				; CHECK-LABEL: add_v32i16:
				sdesmalenUnsubmitted Done Reply Inline Actions This test can be removed for the same reason as mentioned above. sdesmalen: This test can be removed for the same reason as mentioned above.
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #16
				; CHECK-NEXT: mov x9, #24
				; CHECK-NEXT: mov x10, #8
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0, x9, lsl #1]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x0, x10, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z4.h }, p0/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z5.h }, p0/z, [x1, x9, lsl #1]
				; CHECK-NEXT: ld1h { z6.h }, p0/z, [x1, x10, lsl #1]
				; CHECK-NEXT: ld1h { z7.h }, p0/z, [x1]
				; CHECK-NEXT: add z0.h, z0.h, z4.h
				; CHECK-NEXT: add z1.h, z1.h, z5.h
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: add z0.h, z3.h, z7.h
				; CHECK-NEXT: add z1.h, z2.h, z6.h
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i16>, <32 x i16>* %a
				%op2 = load <32 x i16>, <32 x i16>* %b
				%res = add <32 x i16> %op1, %op2
				store <32 x i16> %res, <32 x i16>* %a
				ret void
				}

				define <2 x i32> @add_v2i32(<2 x i32> %op1, <2 x i32> %op2) #0 {
				; CHECK-LABEL: add_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: add z0.s, z0.s, z1.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = add <2 x i32> %op1, %op2
				ret <2 x i32> %res
				}

				define <4 x i32> @add_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {
				; CHECK-LABEL: add_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: add z0.s, z0.s, z1.s
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = add <4 x i32> %op1, %op2
				ret <4 x i32> %res
				}

				define void @add_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; CHECK-LABEL: add_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x1, x8, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x1]
				; CHECK-NEXT: add z1.s, z1.s, z3.s
				; CHECK-NEXT: add z0.s, z0.s, z2.s
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%op2 = load <8 x i32>, <8 x i32>* %b
				%res = add <8 x i32> %op1, %op2
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define void @add_v16i32(<16 x i32>* %a, <16 x i32>* %b) #0 {
				; CHECK-LABEL: add_v16i32:
				sdesmalenUnsubmitted Done Reply Inline Actions This test can be removed for the same reason as mentioned above. sdesmalen: This test can be removed for the same reason as mentioned above.
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: mov x9, #12
				; CHECK-NEXT: mov x10, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0, x9, lsl #2]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x0, x10, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z4.s }, p0/z, [x1, x8, lsl #2]
				; CHECK-NEXT: ld1w { z5.s }, p0/z, [x1, x9, lsl #2]
				; CHECK-NEXT: ld1w { z6.s }, p0/z, [x1, x10, lsl #2]
				; CHECK-NEXT: ld1w { z7.s }, p0/z, [x1]
				; CHECK-NEXT: add z0.s, z0.s, z4.s
				; CHECK-NEXT: add z1.s, z1.s, z5.s
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: add z0.s, z3.s, z7.s
				; CHECK-NEXT: add z1.s, z2.s, z6.s
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i32>, <16 x i32>* %a
				%op2 = load <16 x i32>, <16 x i32>* %b
				%res = add <16 x i32> %op1, %op2
				store <16 x i32> %res, <16 x i32>* %a
				ret void
				}

				define <1 x i64> @add_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {
				; CHECK-LABEL: add_v1i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: add z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = add <1 x i64> %op1, %op2
				ret <1 x i64> %res
				}

				define <2 x i64> @add_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
				; CHECK-LABEL: add_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: add z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = add <2 x i64> %op1, %op2
				ret <2 x i64> %res
				}

				define void @add_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
				; CHECK-LABEL: add_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x1, x8, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x1]
				; CHECK-NEXT: add z1.d, z1.d, z3.d
				; CHECK-NEXT: add z0.d, z0.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x i64>, <4 x i64>* %a
				%op2 = load <4 x i64>, <4 x i64>* %b
				%res = add <4 x i64> %op1, %op2
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				define void @add_v8i64(<8 x i64>* %a, <8 x i64>* %b) #0 {
				; CHECK-LABEL: add_v8i64:
				sdesmalenUnsubmitted Done Reply Inline Actions This test can be removed for the same reason as mentioned above. (same for all other 4 x as wide instances in the remainder of this file and other files in this patch) sdesmalen: This test can be removed for the same reason as mentioned above. (same for all other 4 x as…
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: mov x9, #6
				; CHECK-NEXT: mov x10, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0, x9, lsl #3]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x0, x10, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z4.d }, p0/z, [x1, x8, lsl #3]
				; CHECK-NEXT: ld1d { z5.d }, p0/z, [x1, x9, lsl #3]
				; CHECK-NEXT: ld1d { z6.d }, p0/z, [x1, x10, lsl #3]
				; CHECK-NEXT: ld1d { z7.d }, p0/z, [x1]
				; CHECK-NEXT: add z0.d, z0.d, z4.d
				; CHECK-NEXT: add z1.d, z1.d, z5.d
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: add z0.d, z3.d, z7.d
				; CHECK-NEXT: add z1.d, z2.d, z6.d
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i64>, <8 x i64>* %a
				%op2 = load <8 x i64>, <8 x i64>* %b
				%res = add <8 x i64> %op1, %op2
				store <8 x i64> %res, <8 x i64>* %a
				ret void
				}

				;
				; MUL
				;

				define <4 x i8> @mul_v4i8(<4 x i8> %op1, <4 x i8> %op2) #0 {
				; CHECK-LABEL: mul_v4i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: mul z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = mul <4 x i8> %op1, %op2
				ret <4 x i8> %res
				}

				define <8 x i8> @mul_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
				; CHECK-LABEL: mul_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.b, vl8
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: mul z0.b, p0/m, z0.b, z1.b
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = mul <8 x i8> %op1, %op2
				ret <8 x i8> %res
				}

				define <16 x i8> @mul_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
				; CHECK-LABEL: mul_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: mul z0.b, p0/m, z0.b, z1.b
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = mul <16 x i8> %op1, %op2
				ret <16 x i8> %res
				}

				define void @mul_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: mul_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #16
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x1, x8]
				; CHECK-NEXT: ld1b { z3.b }, p0/z, [x1]
				; CHECK-NEXT: mul z1.b, p0/m, z1.b, z3.b
				; CHECK-NEXT: mul z0.b, p0/m, z0.b, z2.b
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%res = mul <32 x i8> %op1, %op2
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define void @mul_v64i8(<64 x i8>* %a, <64 x i8>* %b) #0 {
				; CHECK-LABEL: mul_v64i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #32
				; CHECK-NEXT: mov w9, #48
				; CHECK-NEXT: mov w10, #16
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x0, x9]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x0, x10]
				; CHECK-NEXT: ld1b { z3.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z4.b }, p0/z, [x1, x8]
				; CHECK-NEXT: ld1b { z5.b }, p0/z, [x1, x9]
				; CHECK-NEXT: ld1b { z6.b }, p0/z, [x1, x10]
				; CHECK-NEXT: ld1b { z7.b }, p0/z, [x1]
				; CHECK-NEXT: mul z0.b, p0/m, z0.b, z4.b
				; CHECK-NEXT: mul z1.b, p0/m, z1.b, z5.b
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: movprfx z0, z3
				; CHECK-NEXT: mul z0.b, p0/m, z0.b, z7.b
				; CHECK-NEXT: movprfx z1, z2
				; CHECK-NEXT: mul z1.b, p0/m, z1.b, z6.b
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <64 x i8>, <64 x i8>* %a
				%op2 = load <64 x i8>, <64 x i8>* %b
				%res = mul <64 x i8> %op1, %op2
				store <64 x i8> %res, <64 x i8>* %a
				ret void
				}

				define <2 x i16> @mul_v2i16(<2 x i16> %op1, <2 x i16> %op2) #0 {
				; CHECK-LABEL: mul_v2i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: mul z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = mul <2 x i16> %op1, %op2
				ret <2 x i16> %res
				}

				define <4 x i16> @mul_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
				; CHECK-LABEL: mul_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: mul z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = mul <4 x i16> %op1, %op2
				ret <4 x i16> %res
				}

				define <8 x i16> @mul_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
				; CHECK-LABEL: mul_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: mul z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = mul <8 x i16> %op1, %op2
				ret <8 x i16> %res
				}

				define void @mul_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: mul_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x1]
				; CHECK-NEXT: mul z1.h, p0/m, z1.h, z3.h
				; CHECK-NEXT: mul z0.h, p0/m, z0.h, z2.h
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%res = mul <16 x i16> %op1, %op2
				david-armUnsubmitted Done Reply Inline Actions Again, could you add at least one illegal type - `<4 x i8>` or `<2 x i16>`? david-arm: Again, could you add at least one illegal type - `<4 x i8>` or `<2 x i16>`?
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define void @mul_v32i16(<32 x i16>* %a, <32 x i16>* %b) #0 {
				; CHECK-LABEL: mul_v32i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #16
				; CHECK-NEXT: mov x9, #24
				; CHECK-NEXT: mov x10, #8
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0, x9, lsl #1]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x0, x10, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z4.h }, p0/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z5.h }, p0/z, [x1, x9, lsl #1]
				; CHECK-NEXT: ld1h { z6.h }, p0/z, [x1, x10, lsl #1]
				; CHECK-NEXT: ld1h { z7.h }, p0/z, [x1]
				; CHECK-NEXT: mul z0.h, p0/m, z0.h, z4.h
				; CHECK-NEXT: mul z1.h, p0/m, z1.h, z5.h
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: movprfx z0, z3
				; CHECK-NEXT: mul z0.h, p0/m, z0.h, z7.h
				; CHECK-NEXT: movprfx z1, z2
				; CHECK-NEXT: mul z1.h, p0/m, z1.h, z6.h
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i16>, <32 x i16>* %a
				%op2 = load <32 x i16>, <32 x i16>* %b
				%res = mul <32 x i16> %op1, %op2
				store <32 x i16> %res, <32 x i16>* %a
				ret void
				}

				define <2 x i32> @mul_v2i32(<2 x i32> %op1, <2 x i32> %op2) #0 {
				; CHECK-LABEL: mul_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: mul z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = mul <2 x i32> %op1, %op2
				ret <2 x i32> %res
				}

				define <4 x i32> @mul_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {
				; CHECK-LABEL: mul_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: mul z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = mul <4 x i32> %op1, %op2
				ret <4 x i32> %res
				}

				define void @mul_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; CHECK-LABEL: mul_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x1, x8, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x1]
				; CHECK-NEXT: mul z1.s, p0/m, z1.s, z3.s
				; CHECK-NEXT: mul z0.s, p0/m, z0.s, z2.s
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%op2 = load <8 x i32>, <8 x i32>* %b
				%res = mul <8 x i32> %op1, %op2
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define void @mul_v16i32(<16 x i32>* %a, <16 x i32>* %b) #0 {
				; CHECK-LABEL: mul_v16i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: mov x9, #12
				; CHECK-NEXT: mov x10, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0, x9, lsl #2]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x0, x10, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z4.s }, p0/z, [x1, x8, lsl #2]
				; CHECK-NEXT: ld1w { z5.s }, p0/z, [x1, x9, lsl #2]
				; CHECK-NEXT: ld1w { z6.s }, p0/z, [x1, x10, lsl #2]
				; CHECK-NEXT: ld1w { z7.s }, p0/z, [x1]
				; CHECK-NEXT: mul z0.s, p0/m, z0.s, z4.s
				; CHECK-NEXT: mul z1.s, p0/m, z1.s, z5.s
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: movprfx z0, z3
				; CHECK-NEXT: mul z0.s, p0/m, z0.s, z7.s
				; CHECK-NEXT: movprfx z1, z2
				; CHECK-NEXT: mul z1.s, p0/m, z1.s, z6.s
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i32>, <16 x i32>* %a
				%op2 = load <16 x i32>, <16 x i32>* %b
				%res = mul <16 x i32> %op1, %op2
				store <16 x i32> %res, <16 x i32>* %a
				ret void
				}

				define <1 x i64> @mul_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {
				; CHECK-LABEL: mul_v1i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl1
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: mul z0.d, p0/m, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = mul <1 x i64> %op1, %op2
				ret <1 x i64> %res
				}

				define <2 x i64> @mul_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
				; CHECK-LABEL: mul_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: mul z0.d, p0/m, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = mul <2 x i64> %op1, %op2
				ret <2 x i64> %res
				}

				define void @mul_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
				; CHECK-LABEL: mul_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x1, x8, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x1]
				; CHECK-NEXT: mul z1.d, p0/m, z1.d, z3.d
				; CHECK-NEXT: mul z0.d, p0/m, z0.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x i64>, <4 x i64>* %a
				%op2 = load <4 x i64>, <4 x i64>* %b
				%res = mul <4 x i64> %op1, %op2
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				define void @mul_v8i64(<8 x i64>* %a, <8 x i64>* %b) #0 {
				; CHECK-LABEL: mul_v8i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: mov x9, #6
				; CHECK-NEXT: mov x10, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0, x9, lsl #3]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x0, x10, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z4.d }, p0/z, [x1, x8, lsl #3]
				; CHECK-NEXT: ld1d { z5.d }, p0/z, [x1, x9, lsl #3]
				; CHECK-NEXT: ld1d { z6.d }, p0/z, [x1, x10, lsl #3]
				; CHECK-NEXT: ld1d { z7.d }, p0/z, [x1]
				; CHECK-NEXT: mul z0.d, p0/m, z0.d, z4.d
				; CHECK-NEXT: mul z1.d, p0/m, z1.d, z5.d
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: movprfx z0, z3
				; CHECK-NEXT: mul z0.d, p0/m, z0.d, z7.d
				; CHECK-NEXT: movprfx z1, z2
				; CHECK-NEXT: mul z1.d, p0/m, z1.d, z6.d
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i64>, <8 x i64>* %a
				%op2 = load <8 x i64>, <8 x i64>* %b
				%res = mul <8 x i64> %op1, %op2
				store <8 x i64> %res, <8 x i64>* %a
				ret void
				}

				;
				; SUB
				;

				define <4 x i8> @sub_v4i8(<4 x i8> %op1, <4 x i8> %op2) #0 {
				; CHECK-LABEL: sub_v4i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: sub z0.h, z0.h, z1.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = sub <4 x i8> %op1, %op2
				ret <4 x i8> %res
				}

				define <8 x i8> @sub_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
				; CHECK-LABEL: sub_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: sub z0.b, z0.b, z1.b
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = sub <8 x i8> %op1, %op2
				ret <8 x i8> %res
				}

				define <16 x i8> @sub_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
				; CHECK-LABEL: sub_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: sub z0.b, z0.b, z1.b
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = sub <16 x i8> %op1, %op2
				ret <16 x i8> %res
				}

				define void @sub_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: sub_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #16
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x1, x8]
				; CHECK-NEXT: ld1b { z3.b }, p0/z, [x1]
				; CHECK-NEXT: sub z1.b, z1.b, z3.b
				; CHECK-NEXT: sub z0.b, z0.b, z2.b
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%res = sub <32 x i8> %op1, %op2
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define void @sub_v64i8(<64 x i8>* %a, <64 x i8>* %b) #0 {
				; CHECK-LABEL: sub_v64i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #32
				; CHECK-NEXT: mov w9, #48
				; CHECK-NEXT: mov w10, #16
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x0, x9]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x0, x10]
				; CHECK-NEXT: ld1b { z3.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z4.b }, p0/z, [x1, x8]
				; CHECK-NEXT: ld1b { z5.b }, p0/z, [x1, x9]
				; CHECK-NEXT: ld1b { z6.b }, p0/z, [x1, x10]
				; CHECK-NEXT: ld1b { z7.b }, p0/z, [x1]
				; CHECK-NEXT: sub z0.b, z0.b, z4.b
				; CHECK-NEXT: sub z1.b, z1.b, z5.b
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: sub z0.b, z3.b, z7.b
				; CHECK-NEXT: sub z1.b, z2.b, z6.b
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <64 x i8>, <64 x i8>* %a
				%op2 = load <64 x i8>, <64 x i8>* %b
				%res = sub <64 x i8> %op1, %op2
				store <64 x i8> %res, <64 x i8>* %a
				ret void
				}

				define <2 x i16> @sub_v2i16(<2 x i16> %op1, <2 x i16> %op2) #0 {
				; CHECK-LABEL: sub_v2i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: sub z0.s, z0.s, z1.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = sub <2 x i16> %op1, %op2
				ret <2 x i16> %res
				}

				define <4 x i16> @sub_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
				; CHECK-LABEL: sub_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: sub z0.h, z0.h, z1.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = sub <4 x i16> %op1, %op2
				ret <4 x i16> %res
				}

				define <8 x i16> @sub_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
				; CHECK-LABEL: sub_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: sub z0.h, z0.h, z1.h
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = sub <8 x i16> %op1, %op2
				ret <8 x i16> %res
				}

				define void @sub_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: sub_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x1]
				; CHECK-NEXT: sub z1.h, z1.h, z3.h
				; CHECK-NEXT: sub z0.h, z0.h, z2.h
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%res = sub <16 x i16> %op1, %op2
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define void @sub_v32i16(<32 x i16>* %a, <32 x i16>* %b) #0 {
				; CHECK-LABEL: sub_v32i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #16
				; CHECK-NEXT: mov x9, #24
				; CHECK-NEXT: mov x10, #8
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0, x9, lsl #1]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x0, x10, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z4.h }, p0/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z5.h }, p0/z, [x1, x9, lsl #1]
				; CHECK-NEXT: ld1h { z6.h }, p0/z, [x1, x10, lsl #1]
				; CHECK-NEXT: ld1h { z7.h }, p0/z, [x1]
				; CHECK-NEXT: sub z0.h, z0.h, z4.h
				; CHECK-NEXT: sub z1.h, z1.h, z5.h
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: sub z0.h, z3.h, z7.h
				; CHECK-NEXT: sub z1.h, z2.h, z6.h
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i16>, <32 x i16>* %a
				%op2 = load <32 x i16>, <32 x i16>* %b
				%res = sub <32 x i16> %op1, %op2
				store <32 x i16> %res, <32 x i16>* %a
				ret void
				}

				define <2 x i32> @sub_v2i32(<2 x i32> %op1, <2 x i32> %op2) #0 {
				; CHECK-LABEL: sub_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: sub z0.s, z0.s, z1.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = sub <2 x i32> %op1, %op2
				ret <2 x i32> %res
				}

				define <4 x i32> @sub_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {
				; CHECK-LABEL: sub_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: sub z0.s, z0.s, z1.s
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = sub <4 x i32> %op1, %op2
				ret <4 x i32> %res
				}

				define void @sub_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; CHECK-LABEL: sub_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x1, x8, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x1]
				; CHECK-NEXT: sub z1.s, z1.s, z3.s
				; CHECK-NEXT: sub z0.s, z0.s, z2.s
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%op2 = load <8 x i32>, <8 x i32>* %b
				%res = sub <8 x i32> %op1, %op2
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define void @sub_v16i32(<16 x i32>* %a, <16 x i32>* %b) #0 {
				; CHECK-LABEL: sub_v16i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: mov x9, #12
				; CHECK-NEXT: mov x10, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0, x9, lsl #2]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x0, x10, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z4.s }, p0/z, [x1, x8, lsl #2]
				; CHECK-NEXT: ld1w { z5.s }, p0/z, [x1, x9, lsl #2]
				; CHECK-NEXT: ld1w { z6.s }, p0/z, [x1, x10, lsl #2]
				; CHECK-NEXT: ld1w { z7.s }, p0/z, [x1]
				; CHECK-NEXT: sub z0.s, z0.s, z4.s
				; CHECK-NEXT: sub z1.s, z1.s, z5.s
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: sub z0.s, z3.s, z7.s
				; CHECK-NEXT: sub z1.s, z2.s, z6.s
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i32>, <16 x i32>* %a
				%op2 = load <16 x i32>, <16 x i32>* %b
				%res = sub <16 x i32> %op1, %op2
				store <16 x i32> %res, <16 x i32>* %a
				ret void
				}

				define <1 x i64> @sub_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {
				; CHECK-LABEL: sub_v1i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: sub z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = sub <1 x i64> %op1, %op2
				ret <1 x i64> %res
				}

				define <2 x i64> @sub_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
				; CHECK-LABEL: sub_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: sub z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = sub <2 x i64> %op1, %op2
				ret <2 x i64> %res
				}

				define void @sub_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
				; CHECK-LABEL: sub_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x1, x8, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x1]
				; CHECK-NEXT: sub z1.d, z1.d, z3.d
				; CHECK-NEXT: sub z0.d, z0.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x i64>, <4 x i64>* %a
				%op2 = load <4 x i64>, <4 x i64>* %b
				%res = sub <4 x i64> %op1, %op2
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				define void @sub_v8i64(<8 x i64>* %a, <8 x i64>* %b) #0 {
				; CHECK-LABEL: sub_v8i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: mov x9, #6
				; CHECK-NEXT: mov x10, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0, x9, lsl #3]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x0, x10, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z4.d }, p0/z, [x1, x8, lsl #3]
				; CHECK-NEXT: ld1d { z5.d }, p0/z, [x1, x9, lsl #3]
				; CHECK-NEXT: ld1d { z6.d }, p0/z, [x1, x10, lsl #3]
				; CHECK-NEXT: ld1d { z7.d }, p0/z, [x1]
				; CHECK-NEXT: sub z0.d, z0.d, z4.d
				; CHECK-NEXT: sub z1.d, z1.d, z5.d
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: sub z0.d, z3.d, z7.d
				; CHECK-NEXT: sub z1.d, z2.d, z6.d
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i64>, <8 x i64>* %a
				%op2 = load <8 x i64>, <8 x i64>* %b
				%res = sub <8 x i64> %op1, %op2
				store <8 x i64> %res, <8 x i64>* %a
				ret void
				}


				;
				; ABS
				;

				define <4 x i8> @abs_v4i8(<4 x i8> %op1) #0 {
				; CHECK-LABEL: abs_v4i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI54_0
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI54_0
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x8]
				; CHECK-NEXT: lsl z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: asr z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: abs z0.h, p0/m, z0.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <4 x i8> @llvm.abs.v4i8(<4 x i8> %op1, i1 false)
				ret <4 x i8> %res
				}

				define <8 x i8> @abs_v8i8(<8 x i8> %op1) #0 {
				; CHECK-LABEL: abs_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.b, vl8
				; CHECK-NEXT: abs z0.b, p0/m, z0.b
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <8 x i8> @llvm.abs.v8i8(<8 x i8> %op1, i1 false)
				ret <8 x i8> %res
				}

				define <16 x i8> @abs_v16i8(<16 x i8> %op1) #0 {
				; CHECK-LABEL: abs_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: abs z0.b, p0/m, z0.b
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = call <16 x i8> @llvm.abs.v16i8(<16 x i8> %op1, i1 false)
				ret <16 x i8> %res
				}

				define void @abs_v32i8(<32 x i8>* %a) #0 {
				; CHECK-LABEL: abs_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #16
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x0]
				; CHECK-NEXT: abs z1.b, p0/m, z1.b
				; CHECK-NEXT: abs z0.b, p0/m, z0.b
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%res = call <32 x i8> @llvm.abs.v32i8(<32 x i8> %op1, i1 false)
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define void @abs_v64i8(<64 x i8>* %a) #0 {
				; CHECK-LABEL: abs_v64i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #32
				; CHECK-NEXT: mov w9, #48
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: mov w10, #16
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x0, x9]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x0, x10]
				; CHECK-NEXT: ld1b { z3.b }, p0/z, [x0]
				; CHECK-NEXT: abs z0.b, p0/m, z0.b
				; CHECK-NEXT: abs z1.b, p0/m, z1.b
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: movprfx z0, z3
				; CHECK-NEXT: abs z0.b, p0/m, z3.b
				; CHECK-NEXT: movprfx z1, z2
				; CHECK-NEXT: abs z1.b, p0/m, z2.b
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <64 x i8>, <64 x i8>* %a
				%res = call <64 x i8> @llvm.abs.v64i8(<64 x i8> %op1, i1 false)
				store <64 x i8> %res, <64 x i8>* %a
				ret void
				}

				define <2 x i16> @abs_v2i16(<2 x i16> %op1) #0 {
				; CHECK-LABEL: abs_v2i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI59_0
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI59_0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x8]
				; CHECK-NEXT: lsl z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: asr z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: abs z0.s, p0/m, z0.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <2 x i16> @llvm.abs.v2i16(<2 x i16> %op1, i1 false)
				ret <2 x i16> %res
				}

				define <4 x i16> @abs_v4i16(<4 x i16> %op1) #0 {
				; CHECK-LABEL: abs_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: abs z0.h, p0/m, z0.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <4 x i16> @llvm.abs.v4i16(<4 x i16> %op1, i1 false)
				ret <4 x i16> %res
				}

				define <8 x i16> @abs_v8i16(<8 x i16> %op1) #0 {
				; CHECK-LABEL: abs_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: abs z0.h, p0/m, z0.h
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = call <8 x i16> @llvm.abs.v8i16(<8 x i16> %op1, i1 false)
				ret <8 x i16> %res
				}

				define void @abs_v16i16(<16 x i16>* %a) #0 {
				; CHECK-LABEL: abs_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0]
				; CHECK-NEXT: abs z1.h, p0/m, z1.h
				; CHECK-NEXT: abs z0.h, p0/m, z0.h
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i16>, <16 x i16>* %a
				%res = call <16 x i16> @llvm.abs.v16i16(<16 x i16> %op1, i1 false)
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define void @abs_v32i16(<32 x i16>* %a) #0 {
				; CHECK-LABEL: abs_v32i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #16
				; CHECK-NEXT: mov x9, #24
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: mov x10, #8
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0, x9, lsl #1]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x0, x10, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x0]
				; CHECK-NEXT: abs z0.h, p0/m, z0.h
				; CHECK-NEXT: abs z1.h, p0/m, z1.h
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: movprfx z0, z3
				; CHECK-NEXT: abs z0.h, p0/m, z3.h
				; CHECK-NEXT: movprfx z1, z2
				; CHECK-NEXT: abs z1.h, p0/m, z2.h
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i16>, <32 x i16>* %a
				%res = call <32 x i16> @llvm.abs.v32i16(<32 x i16> %op1, i1 false)
				store <32 x i16> %res, <32 x i16>* %a
				ret void
				}

				define <2 x i32> @abs_v2i32(<2 x i32> %op1) #0 {
				; CHECK-LABEL: abs_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: abs z0.s, p0/m, z0.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <2 x i32> @llvm.abs.v2i32(<2 x i32> %op1, i1 false)
				ret <2 x i32> %res
				}

				define <4 x i32> @abs_v4i32(<4 x i32> %op1) #0 {
				; CHECK-LABEL: abs_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: abs z0.s, p0/m, z0.s
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = call <4 x i32> @llvm.abs.v4i32(<4 x i32> %op1, i1 false)
				ret <4 x i32> %res
				}

				define void @abs_v8i32(<8 x i32>* %a) #0 {
				; CHECK-LABEL: abs_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0]
				; CHECK-NEXT: abs z1.s, p0/m, z1.s
				; CHECK-NEXT: abs z0.s, p0/m, z0.s
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%res = call <8 x i32> @llvm.abs.v8i32(<8 x i32> %op1, i1 false)
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define void @abs_v16i32(<16 x i32>* %a) #0 {
				; CHECK-LABEL: abs_v16i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: mov x9, #12
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: mov x10, #4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0, x9, lsl #2]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x0, x10, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x0]
				; CHECK-NEXT: abs z0.s, p0/m, z0.s
				; CHECK-NEXT: abs z1.s, p0/m, z1.s
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: movprfx z0, z3
				; CHECK-NEXT: abs z0.s, p0/m, z3.s
				; CHECK-NEXT: movprfx z1, z2
				; CHECK-NEXT: abs z1.s, p0/m, z2.s
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i32>, <16 x i32>* %a
				%res = call <16 x i32> @llvm.abs.v16i32(<16 x i32> %op1, i1 false)
				store <16 x i32> %res, <16 x i32>* %a
				ret void
				}

				define <1 x i64> @abs_v1i64(<1 x i64> %op1) #0 {
				; CHECK-LABEL: abs_v1i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl1
				; CHECK-NEXT: abs z0.d, p0/m, z0.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <1 x i64> @llvm.abs.v1i64(<1 x i64> %op1, i1 false)
				ret <1 x i64> %res
				}

				define <2 x i64> @abs_v2i64(<2 x i64> %op1) #0 {
				; CHECK-LABEL: abs_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: abs z0.d, p0/m, z0.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = call <2 x i64> @llvm.abs.v2i64(<2 x i64> %op1, i1 false)
				ret <2 x i64> %res
				}

				define void @abs_v4i64(<4 x i64>* %a) #0 {
				; CHECK-LABEL: abs_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0]
				; CHECK-NEXT: abs z1.d, p0/m, z1.d
				; CHECK-NEXT: abs z0.d, p0/m, z0.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x i64>, <4 x i64>* %a
				%res = call <4 x i64> @llvm.abs.v4i64(<4 x i64> %op1, i1 false)
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				define void @abs_v8i64(<8 x i64>* %a) #0 {
				; CHECK-LABEL: abs_v8i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: mov x9, #6
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: mov x10, #2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0, x9, lsl #3]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x0, x10, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x0]
				; CHECK-NEXT: abs z0.d, p0/m, z0.d
				; CHECK-NEXT: abs z1.d, p0/m, z1.d
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: movprfx z0, z3
				; CHECK-NEXT: abs z0.d, p0/m, z3.d
				; CHECK-NEXT: movprfx z1, z2
				; CHECK-NEXT: abs z1.d, p0/m, z2.d
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i64>, <8 x i64>* %a
				%res = call <8 x i64> @llvm.abs.v8i64(<8 x i64> %op1, i1 false)
				store <8 x i64> %res, <8 x i64>* %a
				ret void
				}

				declare <4 x i8> @llvm.abs.v4i8(<4 x i8>, i1)
				declare <8 x i8> @llvm.abs.v8i8(<8 x i8>, i1)
				declare <16 x i8> @llvm.abs.v16i8(<16 x i8>, i1)
				declare <32 x i8> @llvm.abs.v32i8(<32 x i8>, i1)
				declare <64 x i8> @llvm.abs.v64i8(<64 x i8>, i1)
				declare <4 x i16> @llvm.abs.v4i16(<4 x i16>, i1)
				declare <2 x i16> @llvm.abs.v2i16(<2 x i16>, i1)
				declare <8 x i16> @llvm.abs.v8i16(<8 x i16>, i1)
				declare <16 x i16> @llvm.abs.v16i16(<16 x i16>, i1)
				declare <32 x i16> @llvm.abs.v32i16(<32 x i16>, i1)
				declare <2 x i32> @llvm.abs.v2i32(<2 x i32>, i1)
				declare <4 x i32> @llvm.abs.v4i32(<4 x i32>, i1)
				declare <8 x i32> @llvm.abs.v8i32(<8 x i32>, i1)
				declare <16 x i32> @llvm.abs.v16i32(<16 x i32>, i1)
				declare <1 x i64> @llvm.abs.v1i64(<1 x i64>, i1)
				declare <2 x i64> @llvm.abs.v2i64(<2 x i64>, i1)
				declare <4 x i64> @llvm.abs.v4i64(<4 x i64>, i1)
				declare <8 x i64> @llvm.abs.v8i64(<8 x i64>, i1)


				attributes #0 = { "target-features"="+sve" }
				david-armUnsubmitted Not Done Reply Inline Actions This looks like a NEON instruction - can you investigate where this is coming from? david-arm: This looks like a NEON instruction - can you investigate where this is coming from?
				david-armUnsubmitted Not Done Reply Inline Actions Please ignore this - my mistake! david-arm: Please ignore this - my mistake!
				david-armUnsubmitted Done Reply Inline Actions Can you add an illegal NEON type such as `<2 x i16>`? david-arm: Can you add an illegal NEON type such as `<2 x i16>`?

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-div.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -force-streaming-compatible-sve < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				;
				; SDIV
				;

				define <4 x i8> @sdiv_v4i8(<4 x i8> %op1, <4 x i8> %op2) #0 {
				; CHECK-LABEL: sdiv_v4i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: sub sp, sp, #16
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: adrp x8, .LCPI0_0
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI0_0
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p1.s, vl4
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x8]
				; CHECK-NEXT: lsl z0.h, p0/m, z0.h, z2.h
				; CHECK-NEXT: lsl z1.h, p0/m, z1.h, z2.h
				; CHECK-NEXT: asr z0.h, p0/m, z0.h, z2.h
				; CHECK-NEXT: asr z1.h, p0/m, z1.h, z2.h
				; CHECK-NEXT: sunpklo z1.s, z1.h
				; CHECK-NEXT: sunpklo z0.s, z0.h
				; CHECK-NEXT: sdiv z0.s, p1/m, z0.s, z1.s
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: mov z1.s, z0.s[3]
				; CHECK-NEXT: mov z2.s, z0.s[2]
				; CHECK-NEXT: mov z0.s, z0.s[1]
				; CHECK-NEXT: fmov w9, s1
				; CHECK-NEXT: fmov w10, s2
				; CHECK-NEXT: strh w8, [sp, #8]
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: strh w9, [sp, #14]
				; CHECK-NEXT: strh w8, [sp, #10]
				; CHECK-NEXT: add x8, sp, #8
				; CHECK-NEXT: strh w10, [sp, #12]
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: add sp, sp, #16
				; CHECK-NEXT: ret
				%res = sdiv <4 x i8> %op1, %op2
				ret <4 x i8> %res
				}

				define <8 x i8> @sdiv_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
				; CHECK-LABEL: sdiv_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: sub sp, sp, #16
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: sunpklo z1.h, z1.b
				; CHECK-NEXT: sunpklo z0.h, z0.b
				; CHECK-NEXT: sunpkhi z2.s, z1.h
				; CHECK-NEXT: sunpkhi z3.s, z0.h
				; CHECK-NEXT: sunpklo z1.s, z1.h
				; CHECK-NEXT: sunpklo z0.s, z0.h
				; CHECK-NEXT: sdivr z2.s, p0/m, z2.s, z3.s
				; CHECK-NEXT: sdiv z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: uzp1 z0.h, z0.h, z2.h
				; CHECK-NEXT: ptrue p0.b, vl8
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: mov z1.h, z0.h[7]
				; CHECK-NEXT: mov z3.h, z0.h[5]
				; CHECK-NEXT: fmov w9, s1
				; CHECK-NEXT: mov z2.h, z0.h[6]
				; CHECK-NEXT: mov z4.h, z0.h[4]
				; CHECK-NEXT: strb w8, [sp, #8]
				; CHECK-NEXT: fmov w8, s3
				; CHECK-NEXT: mov z6.h, z0.h[2]
				; CHECK-NEXT: fmov w10, s2
				; CHECK-NEXT: strb w9, [sp, #15]
				; CHECK-NEXT: fmov w9, s4
				; CHECK-NEXT: strb w8, [sp, #13]
				; CHECK-NEXT: fmov w8, s6
				; CHECK-NEXT: mov z5.h, z0.h[3]
				; CHECK-NEXT: mov z0.h, z0.h[1]
				; CHECK-NEXT: strb w10, [sp, #14]
				; CHECK-NEXT: fmov w10, s5
				; CHECK-NEXT: strb w9, [sp, #12]
				; CHECK-NEXT: fmov w9, s0
				; CHECK-NEXT: strb w8, [sp, #10]
				; CHECK-NEXT: add x8, sp, #8
				; CHECK-NEXT: strb w10, [sp, #11]
				; CHECK-NEXT: strb w9, [sp, #9]
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: add sp, sp, #16
				; CHECK-NEXT: ret
				%res = sdiv <8 x i8> %op1, %op2
				ret <8 x i8> %res
				}

				define <16 x i8> @sdiv_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
				; CHECK-LABEL: sdiv_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: sunpkhi z2.h, z1.b
				; CHECK-NEXT: sunpkhi z3.h, z0.b
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: sunpklo z1.h, z1.b
				; CHECK-NEXT: sunpkhi z4.s, z2.h
				; CHECK-NEXT: sunpkhi z5.s, z3.h
				; CHECK-NEXT: sunpklo z2.s, z2.h
				; CHECK-NEXT: sunpklo z3.s, z3.h
				; CHECK-NEXT: sunpklo z0.h, z0.b
				; CHECK-NEXT: sdivr z4.s, p0/m, z4.s, z5.s
				; CHECK-NEXT: sdivr z2.s, p0/m, z2.s, z3.s
				; CHECK-NEXT: sunpkhi z3.s, z1.h
				; CHECK-NEXT: sunpkhi z5.s, z0.h
				; CHECK-NEXT: sunpklo z1.s, z1.h
				; CHECK-NEXT: sunpklo z0.s, z0.h
				; CHECK-NEXT: sdivr z3.s, p0/m, z3.s, z5.s
				; CHECK-NEXT: sdiv z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: uzp1 z1.h, z2.h, z4.h
				; CHECK-NEXT: uzp1 z0.h, z0.h, z3.h
				; CHECK-NEXT: uzp1 z0.b, z0.b, z1.b
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = sdiv <16 x i8> %op1, %op2
				ret <16 x i8> %res
				}

				define void @sdiv_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: sdiv_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #16
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x1, x8]
				; CHECK-NEXT: ld1b { z3.b }, p0/z, [x1]
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: sunpkhi z5.h, z0.b
				; CHECK-NEXT: sunpklo z0.h, z0.b
				; CHECK-NEXT: sunpkhi z4.h, z2.b
				; CHECK-NEXT: sunpklo z2.h, z2.b
				; CHECK-NEXT: sunpkhi z6.s, z4.h
				; CHECK-NEXT: sunpkhi z7.s, z5.h
				; CHECK-NEXT: sunpklo z4.s, z4.h
				; CHECK-NEXT: sunpklo z5.s, z5.h
				; CHECK-NEXT: sunpkhi z16.s, z2.h
				; CHECK-NEXT: sdivr z6.s, p0/m, z6.s, z7.s
				; CHECK-NEXT: sdivr z4.s, p0/m, z4.s, z5.s
				; CHECK-NEXT: sunpkhi z5.s, z0.h
				; CHECK-NEXT: sunpklo z2.s, z2.h
				; CHECK-NEXT: sunpklo z0.s, z0.h
				; CHECK-NEXT: uzp1 z4.h, z4.h, z6.h
				; CHECK-NEXT: sdiv z0.s, p0/m, z0.s, z2.s
				; CHECK-NEXT: sunpkhi z2.h, z3.b
				; CHECK-NEXT: sunpkhi z6.h, z1.b
				; CHECK-NEXT: sdiv z5.s, p0/m, z5.s, z16.s
				; CHECK-NEXT: sunpkhi z7.s, z2.h
				; CHECK-NEXT: sunpkhi z16.s, z6.h
				; CHECK-NEXT: sunpklo z2.s, z2.h
				; CHECK-NEXT: sunpklo z6.s, z6.h
				; CHECK-NEXT: sunpklo z3.h, z3.b
				; CHECK-NEXT: sunpklo z1.h, z1.b
				; CHECK-NEXT: sdivr z7.s, p0/m, z7.s, z16.s
				; CHECK-NEXT: sdivr z2.s, p0/m, z2.s, z6.s
				; CHECK-NEXT: sunpkhi z6.s, z3.h
				; CHECK-NEXT: sunpkhi z16.s, z1.h
				; CHECK-NEXT: sunpklo z3.s, z3.h
				; CHECK-NEXT: sunpklo z1.s, z1.h
				; CHECK-NEXT: sdivr z6.s, p0/m, z6.s, z16.s
				; CHECK-NEXT: sdiv z1.s, p0/m, z1.s, z3.s
				; CHECK-NEXT: uzp1 z2.h, z2.h, z7.h
				; CHECK-NEXT: uzp1 z1.h, z1.h, z6.h
				; CHECK-NEXT: uzp1 z0.h, z0.h, z5.h
				; CHECK-NEXT: uzp1 z1.b, z1.b, z2.b
				; CHECK-NEXT: uzp1 z0.b, z0.b, z4.b
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%res = sdiv <32 x i8> %op1, %op2
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define void @sdiv_v64i8(<64 x i8>* %a, <64 x i8>* %b) #0 {
				; CHECK-LABEL: sdiv_v64i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #32
				; CHECK-NEXT: mov w9, #48
				; CHECK-NEXT: mov w10, #16
				; CHECK-NEXT: ptrue p1.b, vl16
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1b { z2.b }, p1/z, [x0, x8]
				; CHECK-NEXT: ld1b { z3.b }, p1/z, [x0, x9]
				; CHECK-NEXT: ld1b { z4.b }, p1/z, [x0, x10]
				; CHECK-NEXT: ld1b { z0.b }, p1/z, [x0]
				; CHECK-NEXT: ld1b { z5.b }, p1/z, [x1, x10]
				; CHECK-NEXT: ld1b { z7.b }, p1/z, [x1, x9]
				; CHECK-NEXT: ld1b { z6.b }, p1/z, [x1, x8]
				; CHECK-NEXT: sunpkhi z16.h, z4.b
				; CHECK-NEXT: sunpklo z4.h, z4.b
				; CHECK-NEXT: sunpkhi z1.h, z5.b
				; CHECK-NEXT: sunpkhi z18.s, z16.h
				; CHECK-NEXT: sunpkhi z17.s, z1.h
				; CHECK-NEXT: sunpklo z1.s, z1.h
				; CHECK-NEXT: sunpklo z16.s, z16.h
				; CHECK-NEXT: sdivr z17.s, p0/m, z17.s, z18.s
				; CHECK-NEXT: sdivr z1.s, p0/m, z1.s, z16.s
				; CHECK-NEXT: sunpklo z5.h, z5.b
				; CHECK-NEXT: uzp1 z1.h, z1.h, z17.h
				; CHECK-NEXT: sunpkhi z17.s, z5.h
				; CHECK-NEXT: sunpkhi z18.s, z4.h
				; CHECK-NEXT: sunpklo z5.s, z5.h
				; CHECK-NEXT: sunpklo z4.s, z4.h
				; CHECK-NEXT: sdivr z17.s, p0/m, z17.s, z18.s
				; CHECK-NEXT: sdiv z4.s, p0/m, z4.s, z5.s
				; CHECK-NEXT: sunpkhi z5.h, z7.b
				; CHECK-NEXT: sunpkhi z18.h, z3.b
				; CHECK-NEXT: sunpkhi z19.s, z5.h
				; CHECK-NEXT: sunpkhi z20.s, z18.h
				; CHECK-NEXT: sunpklo z5.s, z5.h
				; CHECK-NEXT: sunpklo z18.s, z18.h
				; CHECK-NEXT: sunpklo z7.h, z7.b
				; CHECK-NEXT: sunpklo z3.h, z3.b
				; CHECK-NEXT: sdivr z19.s, p0/m, z19.s, z20.s
				; CHECK-NEXT: sdivr z5.s, p0/m, z5.s, z18.s
				; CHECK-NEXT: sunpkhi z18.s, z7.h
				; CHECK-NEXT: sunpkhi z20.s, z3.h
				; CHECK-NEXT: sunpklo z7.s, z7.h
				; CHECK-NEXT: sunpklo z3.s, z3.h
				; CHECK-NEXT: sdivr z18.s, p0/m, z18.s, z20.s
				; CHECK-NEXT: sdiv z3.s, p0/m, z3.s, z7.s
				; CHECK-NEXT: uzp1 z5.h, z5.h, z19.h
				; CHECK-NEXT: uzp1 z3.h, z3.h, z18.h
				; CHECK-NEXT: ld1b { z16.b }, p1/z, [x1]
				; CHECK-NEXT: uzp1 z3.b, z3.b, z5.b
				; CHECK-NEXT: sunpkhi z5.h, z6.b
				; CHECK-NEXT: sunpkhi z7.h, z2.b
				; CHECK-NEXT: uzp1 z4.h, z4.h, z17.h
				; CHECK-NEXT: sunpkhi z17.s, z5.h
				; CHECK-NEXT: sunpkhi z18.s, z7.h
				; CHECK-NEXT: sunpklo z5.s, z5.h
				; CHECK-NEXT: sunpklo z7.s, z7.h
				; CHECK-NEXT: sunpklo z6.h, z6.b
				; CHECK-NEXT: sunpklo z2.h, z2.b
				; CHECK-NEXT: sdivr z17.s, p0/m, z17.s, z18.s
				; CHECK-NEXT: sdivr z5.s, p0/m, z5.s, z7.s
				; CHECK-NEXT: sunpkhi z7.s, z6.h
				; CHECK-NEXT: sunpkhi z18.s, z2.h
				; CHECK-NEXT: sunpklo z6.s, z6.h
				; CHECK-NEXT: sunpklo z2.s, z2.h
				; CHECK-NEXT: sdivr z7.s, p0/m, z7.s, z18.s
				; CHECK-NEXT: sdiv z2.s, p0/m, z2.s, z6.s
				; CHECK-NEXT: uzp1 z2.h, z2.h, z7.h
				; CHECK-NEXT: sunpkhi z6.h, z16.b
				; CHECK-NEXT: sunpkhi z7.h, z0.b
				; CHECK-NEXT: uzp1 z5.h, z5.h, z17.h
				; CHECK-NEXT: sunpkhi z17.s, z6.h
				; CHECK-NEXT: sunpkhi z18.s, z7.h
				; CHECK-NEXT: sunpklo z6.s, z6.h
				; CHECK-NEXT: sunpklo z7.s, z7.h
				; CHECK-NEXT: sdivr z17.s, p0/m, z17.s, z18.s
				; CHECK-NEXT: sdivr z6.s, p0/m, z6.s, z7.s
				; CHECK-NEXT: uzp1 z2.b, z2.b, z5.b
				; CHECK-NEXT: uzp1 z5.h, z6.h, z17.h
				; CHECK-NEXT: sunpklo z6.h, z16.b
				; CHECK-NEXT: sunpklo z0.h, z0.b
				; CHECK-NEXT: sunpkhi z7.s, z6.h
				; CHECK-NEXT: sunpkhi z16.s, z0.h
				; CHECK-NEXT: sunpklo z6.s, z6.h
				; CHECK-NEXT: sunpklo z0.s, z0.h
				; CHECK-NEXT: sdivr z7.s, p0/m, z7.s, z16.s
				; CHECK-NEXT: sdiv z0.s, p0/m, z0.s, z6.s
				; CHECK-NEXT: uzp1 z0.h, z0.h, z7.h
				; CHECK-NEXT: uzp1 z1.b, z4.b, z1.b
				; CHECK-NEXT: uzp1 z0.b, z0.b, z5.b
				; CHECK-NEXT: stp q2, q3, [x0, #32]
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <64 x i8>, <64 x i8>* %a
				%op2 = load <64 x i8>, <64 x i8>* %b
				%res = sdiv <64 x i8> %op1, %op2
				store <64 x i8> %res, <64 x i8>* %a
				ret void
				}

				define <2 x i16> @sdiv_v2i16(<2 x i16> %op1, <2 x i16> %op2) #0 {
				; CHECK-LABEL: sdiv_v2i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI5_0
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI5_0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x8]
				; CHECK-NEXT: lsl z1.s, p0/m, z1.s, z2.s
				; CHECK-NEXT: lsl z0.s, p0/m, z0.s, z2.s
				david-armUnsubmitted Done Reply Inline Actions This still has the `vscale_range(16,0)` attribute. Can you remove it and recreate the CHECK lines please? david-arm: This still has the `vscale_range(16,0)` attribute. Can you remove it and recreate the CHECK…
				; CHECK-NEXT: asr z1.s, p0/m, z1.s, z2.s
				; CHECK-NEXT: asr z0.s, p0/m, z0.s, z2.s
				; CHECK-NEXT: sdiv z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = sdiv <2 x i16> %op1, %op2
				ret <2 x i16> %res
				}

				define <4 x i16> @sdiv_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
				; CHECK-LABEL: sdiv_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: sub sp, sp, #16
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: sunpklo z1.s, z1.h
				; CHECK-NEXT: sunpklo z0.s, z0.h
				; CHECK-NEXT: sdiv z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: mov z1.s, z0.s[3]
				; CHECK-NEXT: mov z2.s, z0.s[2]
				; CHECK-NEXT: mov z0.s, z0.s[1]
				; CHECK-NEXT: fmov w9, s1
				; CHECK-NEXT: fmov w10, s2
				; CHECK-NEXT: strh w8, [sp, #8]
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: strh w9, [sp, #14]
				; CHECK-NEXT: strh w8, [sp, #10]
				; CHECK-NEXT: add x8, sp, #8
				; CHECK-NEXT: strh w10, [sp, #12]
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: add sp, sp, #16
				; CHECK-NEXT: ret
				%res = sdiv <4 x i16> %op1, %op2
				ret <4 x i16> %res
				}

				define <8 x i16> @sdiv_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
				; CHECK-LABEL: sdiv_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: sunpkhi z2.s, z1.h
				; CHECK-NEXT: sunpkhi z3.s, z0.h
				; CHECK-NEXT: sunpklo z1.s, z1.h
				; CHECK-NEXT: sunpklo z0.s, z0.h
				; CHECK-NEXT: sdivr z2.s, p0/m, z2.s, z3.s
				; CHECK-NEXT: sdiv z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: uzp1 z0.h, z0.h, z2.h
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = sdiv <8 x i16> %op1, %op2
				ret <8 x i16> %res
				}

				define void @sdiv_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: sdiv_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x1]
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: sunpkhi z5.s, z0.h
				; CHECK-NEXT: sunpklo z0.s, z0.h
				; CHECK-NEXT: sunpkhi z4.s, z2.h
				; CHECK-NEXT: sunpkhi z6.s, z3.h
				; CHECK-NEXT: sdivr z4.s, p0/m, z4.s, z5.s
				; CHECK-NEXT: sunpkhi z5.s, z1.h
				; CHECK-NEXT: sunpklo z2.s, z2.h
				; CHECK-NEXT: sunpklo z3.s, z3.h
				; CHECK-NEXT: sunpklo z1.s, z1.h
				; CHECK-NEXT: sdiv z5.s, p0/m, z5.s, z6.s
				; CHECK-NEXT: sdiv z1.s, p0/m, z1.s, z3.s
				; CHECK-NEXT: sdiv z0.s, p0/m, z0.s, z2.s
				; CHECK-NEXT: uzp1 z1.h, z1.h, z5.h
				; CHECK-NEXT: uzp1 z0.h, z0.h, z4.h
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%res = sdiv <16 x i16> %op1, %op2
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define void @sdiv_v32i16(<32 x i16>* %a, <32 x i16>* %b) #0 {
				; CHECK-LABEL: sdiv_v32i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #16
				; CHECK-NEXT: mov x9, #24
				; CHECK-NEXT: mov x10, #8
				; CHECK-NEXT: ptrue p1.h, vl8
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1h { z0.h }, p1/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p1/z, [x0, x9, lsl #1]
				; CHECK-NEXT: ld1h { z2.h }, p1/z, [x0, x10, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p1/z, [x0]
				; CHECK-NEXT: ld1h { z4.h }, p1/z, [x1, x10, lsl #1]
				; CHECK-NEXT: ld1h { z5.h }, p1/z, [x1, x9, lsl #1]
				; CHECK-NEXT: ld1h { z6.h }, p1/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z17.h }, p1/z, [x1]
				; CHECK-NEXT: sunpkhi z18.s, z1.h
				; CHECK-NEXT: sunpklo z1.s, z1.h
				; CHECK-NEXT: sunpkhi z16.s, z2.h
				; CHECK-NEXT: sunpklo z2.s, z2.h
				; CHECK-NEXT: sunpkhi z7.s, z4.h
				; CHECK-NEXT: sunpklo z4.s, z4.h
				; CHECK-NEXT: sdivr z7.s, p0/m, z7.s, z16.s
				; CHECK-NEXT: sunpkhi z16.s, z5.h
				; CHECK-NEXT: sunpklo z5.s, z5.h
				; CHECK-NEXT: sdiv z2.s, p0/m, z2.s, z4.s
				; CHECK-NEXT: sdiv z1.s, p0/m, z1.s, z5.s
				; CHECK-NEXT: sunpkhi z4.s, z6.h
				; CHECK-NEXT: sunpkhi z5.s, z0.h
				; CHECK-NEXT: sunpklo z6.s, z6.h
				; CHECK-NEXT: sunpklo z0.s, z0.h
				; CHECK-NEXT: sdivr z4.s, p0/m, z4.s, z5.s
				; CHECK-NEXT: sdiv z0.s, p0/m, z0.s, z6.s
				; CHECK-NEXT: sunpkhi z5.s, z17.h
				; CHECK-NEXT: sdivr z16.s, p0/m, z16.s, z18.s
				; CHECK-NEXT: sunpkhi z6.s, z3.h
				; CHECK-NEXT: uzp1 z0.h, z0.h, z4.h
				; CHECK-NEXT: movprfx z4, z6
				; CHECK-NEXT: sdiv z4.s, p0/m, z4.s, z5.s
				; CHECK-NEXT: sunpklo z5.s, z17.h
				; CHECK-NEXT: sunpklo z3.s, z3.h
				; CHECK-NEXT: uzp1 z1.h, z1.h, z16.h
				; CHECK-NEXT: sdiv z3.s, p0/m, z3.s, z5.s
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: uzp1 z0.h, z3.h, z4.h
				; CHECK-NEXT: uzp1 z1.h, z2.h, z7.h
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i16>, <32 x i16>* %a
				%op2 = load <32 x i16>, <32 x i16>* %b
				%res = sdiv <32 x i16> %op1, %op2
				store <32 x i16> %res, <32 x i16>* %a
				ret void
				}

				define <2 x i32> @sdiv_v2i32(<2 x i32> %op1, <2 x i32> %op2) #0 {
				; CHECK-LABEL: sdiv_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: sdiv z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = sdiv <2 x i32> %op1, %op2
				ret <2 x i32> %res
				}

				define <4 x i32> @sdiv_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {
				; CHECK-LABEL: sdiv_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: sdiv z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = sdiv <4 x i32> %op1, %op2
				ret <4 x i32> %res
				}

				define void @sdiv_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; CHECK-LABEL: sdiv_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x1, x8, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x1]
				; CHECK-NEXT: sdiv z1.s, p0/m, z1.s, z3.s
				; CHECK-NEXT: sdiv z0.s, p0/m, z0.s, z2.s
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%op2 = load <8 x i32>, <8 x i32>* %b
				%res = sdiv <8 x i32> %op1, %op2
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define void @sdiv_v16i32(<16 x i32>* %a, <16 x i32>* %b) #0 {
				; CHECK-LABEL: sdiv_v16i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: mov x9, #12
				; CHECK-NEXT: mov x10, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0, x9, lsl #2]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x0, x10, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z4.s }, p0/z, [x1, x8, lsl #2]
				; CHECK-NEXT: ld1w { z5.s }, p0/z, [x1, x9, lsl #2]
				; CHECK-NEXT: ld1w { z6.s }, p0/z, [x1, x10, lsl #2]
				; CHECK-NEXT: ld1w { z7.s }, p0/z, [x1]
				; CHECK-NEXT: sdiv z0.s, p0/m, z0.s, z4.s
				; CHECK-NEXT: sdiv z1.s, p0/m, z1.s, z5.s
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: movprfx z0, z3
				; CHECK-NEXT: sdiv z0.s, p0/m, z0.s, z7.s
				; CHECK-NEXT: movprfx z1, z2
				; CHECK-NEXT: sdiv z1.s, p0/m, z1.s, z6.s
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i32>, <16 x i32>* %a
				%op2 = load <16 x i32>, <16 x i32>* %b
				%res = sdiv <16 x i32> %op1, %op2
				store <16 x i32> %res, <16 x i32>* %a
				ret void
				}

				define <1 x i64> @sdiv_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {
				; CHECK-LABEL: sdiv_v1i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl1
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: sdiv z0.d, p0/m, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = sdiv <1 x i64> %op1, %op2
				ret <1 x i64> %res
				}

				define <2 x i64> @sdiv_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
				; CHECK-LABEL: sdiv_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: sdiv z0.d, p0/m, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = sdiv <2 x i64> %op1, %op2
				ret <2 x i64> %res
				}

				define void @sdiv_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
				; CHECK-LABEL: sdiv_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x1, x8, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x1]
				; CHECK-NEXT: sdiv z1.d, p0/m, z1.d, z3.d
				; CHECK-NEXT: sdiv z0.d, p0/m, z0.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x i64>, <4 x i64>* %a
				%op2 = load <4 x i64>, <4 x i64>* %b
				%res = sdiv <4 x i64> %op1, %op2
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				define void @sdiv_v8i64(<8 x i64>* %a, <8 x i64>* %b) #0 {
				; CHECK-LABEL: sdiv_v8i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: mov x9, #6
				; CHECK-NEXT: mov x10, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0, x9, lsl #3]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x0, x10, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z4.d }, p0/z, [x1, x8, lsl #3]
				; CHECK-NEXT: ld1d { z5.d }, p0/z, [x1, x9, lsl #3]
				; CHECK-NEXT: ld1d { z6.d }, p0/z, [x1, x10, lsl #3]
				; CHECK-NEXT: ld1d { z7.d }, p0/z, [x1]
				; CHECK-NEXT: sdiv z0.d, p0/m, z0.d, z4.d
				; CHECK-NEXT: sdiv z1.d, p0/m, z1.d, z5.d
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: movprfx z0, z3
				; CHECK-NEXT: sdiv z0.d, p0/m, z0.d, z7.d
				; CHECK-NEXT: movprfx z1, z2
				; CHECK-NEXT: sdiv z1.d, p0/m, z1.d, z6.d
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i64>, <8 x i64>* %a
				%op2 = load <8 x i64>, <8 x i64>* %b
				%res = sdiv <8 x i64> %op1, %op2
				store <8 x i64> %res, <8 x i64>* %a
				ret void
				}

				;
				; UDIV
				;

				define <4 x i8> @udiv_v4i8(<4 x i8> %op1, <4 x i8> %op2) #0 {
				; CHECK-LABEL: udiv_v4i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: sub sp, sp, #16
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: adrp x8, .LCPI18_0
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI18_0
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p1.s, vl4
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x8]
				; CHECK-NEXT: and z0.d, z0.d, z2.d
				; CHECK-NEXT: and z1.d, z1.d, z2.d
				; CHECK-NEXT: uunpklo z1.s, z1.h
				; CHECK-NEXT: uunpklo z0.s, z0.h
				; CHECK-NEXT: udiv z0.s, p1/m, z0.s, z1.s
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: mov z1.s, z0.s[3]
				; CHECK-NEXT: mov z2.s, z0.s[2]
				; CHECK-NEXT: mov z0.s, z0.s[1]
				; CHECK-NEXT: fmov w9, s1
				; CHECK-NEXT: fmov w10, s2
				; CHECK-NEXT: strh w8, [sp, #8]
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: strh w9, [sp, #14]
				; CHECK-NEXT: strh w8, [sp, #10]
				; CHECK-NEXT: add x8, sp, #8
				; CHECK-NEXT: strh w10, [sp, #12]
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: add sp, sp, #16
				; CHECK-NEXT: ret
				%res = udiv <4 x i8> %op1, %op2
				ret <4 x i8> %res
				}

				define <8 x i8> @udiv_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
				; CHECK-LABEL: udiv_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: sub sp, sp, #16
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: uunpklo z1.h, z1.b
				; CHECK-NEXT: uunpklo z0.h, z0.b
				; CHECK-NEXT: uunpkhi z2.s, z1.h
				; CHECK-NEXT: uunpkhi z3.s, z0.h
				; CHECK-NEXT: uunpklo z1.s, z1.h
				; CHECK-NEXT: uunpklo z0.s, z0.h
				; CHECK-NEXT: udivr z2.s, p0/m, z2.s, z3.s
				; CHECK-NEXT: udiv z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: uzp1 z0.h, z0.h, z2.h
				; CHECK-NEXT: ptrue p0.b, vl8
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: mov z1.h, z0.h[7]
				; CHECK-NEXT: mov z3.h, z0.h[5]
				; CHECK-NEXT: fmov w9, s1
				; CHECK-NEXT: mov z2.h, z0.h[6]
				; CHECK-NEXT: mov z4.h, z0.h[4]
				; CHECK-NEXT: strb w8, [sp, #8]
				; CHECK-NEXT: fmov w8, s3
				; CHECK-NEXT: mov z6.h, z0.h[2]
				; CHECK-NEXT: fmov w10, s2
				; CHECK-NEXT: strb w9, [sp, #15]
				; CHECK-NEXT: fmov w9, s4
				; CHECK-NEXT: strb w8, [sp, #13]
				; CHECK-NEXT: fmov w8, s6
				; CHECK-NEXT: mov z5.h, z0.h[3]
				; CHECK-NEXT: mov z0.h, z0.h[1]
				; CHECK-NEXT: strb w10, [sp, #14]
				; CHECK-NEXT: fmov w10, s5
				; CHECK-NEXT: strb w9, [sp, #12]
				; CHECK-NEXT: fmov w9, s0
				; CHECK-NEXT: strb w8, [sp, #10]
				; CHECK-NEXT: add x8, sp, #8
				; CHECK-NEXT: strb w10, [sp, #11]
				; CHECK-NEXT: strb w9, [sp, #9]
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: add sp, sp, #16
				; CHECK-NEXT: ret
				%res = udiv <8 x i8> %op1, %op2
				ret <8 x i8> %res
				}

				define <16 x i8> @udiv_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
				; CHECK-LABEL: udiv_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: uunpkhi z2.h, z1.b
				; CHECK-NEXT: uunpkhi z3.h, z0.b
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: uunpklo z1.h, z1.b
				; CHECK-NEXT: uunpkhi z4.s, z2.h
				; CHECK-NEXT: uunpkhi z5.s, z3.h
				; CHECK-NEXT: uunpklo z2.s, z2.h
				; CHECK-NEXT: uunpklo z3.s, z3.h
				; CHECK-NEXT: uunpklo z0.h, z0.b
				; CHECK-NEXT: udivr z4.s, p0/m, z4.s, z5.s
				; CHECK-NEXT: udivr z2.s, p0/m, z2.s, z3.s
				; CHECK-NEXT: uunpkhi z3.s, z1.h
				; CHECK-NEXT: uunpkhi z5.s, z0.h
				; CHECK-NEXT: uunpklo z1.s, z1.h
				; CHECK-NEXT: uunpklo z0.s, z0.h
				; CHECK-NEXT: udivr z3.s, p0/m, z3.s, z5.s
				; CHECK-NEXT: udiv z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: uzp1 z1.h, z2.h, z4.h
				; CHECK-NEXT: uzp1 z0.h, z0.h, z3.h
				; CHECK-NEXT: uzp1 z0.b, z0.b, z1.b
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = udiv <16 x i8> %op1, %op2
				ret <16 x i8> %res
				}

				define void @udiv_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: udiv_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #16
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x1, x8]
				; CHECK-NEXT: ld1b { z3.b }, p0/z, [x1]
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: uunpkhi z5.h, z0.b
				; CHECK-NEXT: uunpklo z0.h, z0.b
				; CHECK-NEXT: uunpkhi z4.h, z2.b
				; CHECK-NEXT: uunpklo z2.h, z2.b
				; CHECK-NEXT: uunpkhi z6.s, z4.h
				; CHECK-NEXT: uunpkhi z7.s, z5.h
				; CHECK-NEXT: uunpklo z4.s, z4.h
				; CHECK-NEXT: uunpklo z5.s, z5.h
				; CHECK-NEXT: uunpkhi z16.s, z2.h
				; CHECK-NEXT: udivr z6.s, p0/m, z6.s, z7.s
				; CHECK-NEXT: udivr z4.s, p0/m, z4.s, z5.s
				; CHECK-NEXT: uunpkhi z5.s, z0.h
				; CHECK-NEXT: uunpklo z2.s, z2.h
				; CHECK-NEXT: uunpklo z0.s, z0.h
				; CHECK-NEXT: uzp1 z4.h, z4.h, z6.h
				; CHECK-NEXT: udiv z0.s, p0/m, z0.s, z2.s
				; CHECK-NEXT: uunpkhi z2.h, z3.b
				; CHECK-NEXT: uunpkhi z6.h, z1.b
				; CHECK-NEXT: udiv z5.s, p0/m, z5.s, z16.s
				; CHECK-NEXT: uunpkhi z7.s, z2.h
				; CHECK-NEXT: uunpkhi z16.s, z6.h
				; CHECK-NEXT: uunpklo z2.s, z2.h
				; CHECK-NEXT: uunpklo z6.s, z6.h
				; CHECK-NEXT: uunpklo z3.h, z3.b
				; CHECK-NEXT: uunpklo z1.h, z1.b
				; CHECK-NEXT: udivr z7.s, p0/m, z7.s, z16.s
				; CHECK-NEXT: udivr z2.s, p0/m, z2.s, z6.s
				; CHECK-NEXT: uunpkhi z6.s, z3.h
				; CHECK-NEXT: uunpkhi z16.s, z1.h
				; CHECK-NEXT: uunpklo z3.s, z3.h
				; CHECK-NEXT: uunpklo z1.s, z1.h
				; CHECK-NEXT: udivr z6.s, p0/m, z6.s, z16.s
				; CHECK-NEXT: udiv z1.s, p0/m, z1.s, z3.s
				; CHECK-NEXT: uzp1 z2.h, z2.h, z7.h
				; CHECK-NEXT: uzp1 z1.h, z1.h, z6.h
				; CHECK-NEXT: uzp1 z0.h, z0.h, z5.h
				; CHECK-NEXT: uzp1 z1.b, z1.b, z2.b
				; CHECK-NEXT: uzp1 z0.b, z0.b, z4.b
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%res = udiv <32 x i8> %op1, %op2
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define void @udiv_v64i8(<64 x i8>* %a, <64 x i8>* %b) #0 {
				; CHECK-LABEL: udiv_v64i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #32
				; CHECK-NEXT: mov w9, #48
				; CHECK-NEXT: mov w10, #16
				; CHECK-NEXT: ptrue p1.b, vl16
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1b { z2.b }, p1/z, [x0, x8]
				; CHECK-NEXT: ld1b { z3.b }, p1/z, [x0, x9]
				; CHECK-NEXT: ld1b { z4.b }, p1/z, [x0, x10]
				; CHECK-NEXT: ld1b { z0.b }, p1/z, [x0]
				; CHECK-NEXT: ld1b { z5.b }, p1/z, [x1, x10]
				; CHECK-NEXT: ld1b { z7.b }, p1/z, [x1, x9]
				; CHECK-NEXT: ld1b { z6.b }, p1/z, [x1, x8]
				; CHECK-NEXT: uunpkhi z16.h, z4.b
				; CHECK-NEXT: uunpklo z4.h, z4.b
				; CHECK-NEXT: uunpkhi z1.h, z5.b
				; CHECK-NEXT: uunpkhi z18.s, z16.h
				; CHECK-NEXT: uunpkhi z17.s, z1.h
				; CHECK-NEXT: uunpklo z1.s, z1.h
				; CHECK-NEXT: uunpklo z16.s, z16.h
				; CHECK-NEXT: udivr z17.s, p0/m, z17.s, z18.s
				; CHECK-NEXT: udivr z1.s, p0/m, z1.s, z16.s
				; CHECK-NEXT: uunpklo z5.h, z5.b
				; CHECK-NEXT: uzp1 z1.h, z1.h, z17.h
				; CHECK-NEXT: uunpkhi z17.s, z5.h
				; CHECK-NEXT: uunpkhi z18.s, z4.h
				; CHECK-NEXT: uunpklo z5.s, z5.h
				; CHECK-NEXT: uunpklo z4.s, z4.h
				; CHECK-NEXT: udivr z17.s, p0/m, z17.s, z18.s
				; CHECK-NEXT: udiv z4.s, p0/m, z4.s, z5.s
				; CHECK-NEXT: uunpkhi z5.h, z7.b
				; CHECK-NEXT: uunpkhi z18.h, z3.b
				; CHECK-NEXT: uunpkhi z19.s, z5.h
				; CHECK-NEXT: uunpkhi z20.s, z18.h
				; CHECK-NEXT: uunpklo z5.s, z5.h
				; CHECK-NEXT: uunpklo z18.s, z18.h
				; CHECK-NEXT: uunpklo z7.h, z7.b
				; CHECK-NEXT: uunpklo z3.h, z3.b
				; CHECK-NEXT: udivr z19.s, p0/m, z19.s, z20.s
				; CHECK-NEXT: udivr z5.s, p0/m, z5.s, z18.s
				; CHECK-NEXT: uunpkhi z18.s, z7.h
				; CHECK-NEXT: uunpkhi z20.s, z3.h
				; CHECK-NEXT: uunpklo z7.s, z7.h
				; CHECK-NEXT: uunpklo z3.s, z3.h
				; CHECK-NEXT: udivr z18.s, p0/m, z18.s, z20.s
				; CHECK-NEXT: udiv z3.s, p0/m, z3.s, z7.s
				; CHECK-NEXT: uzp1 z5.h, z5.h, z19.h
				; CHECK-NEXT: uzp1 z3.h, z3.h, z18.h
				; CHECK-NEXT: ld1b { z16.b }, p1/z, [x1]
				; CHECK-NEXT: uzp1 z3.b, z3.b, z5.b
				; CHECK-NEXT: uunpkhi z5.h, z6.b
				; CHECK-NEXT: uunpkhi z7.h, z2.b
				; CHECK-NEXT: uzp1 z4.h, z4.h, z17.h
				; CHECK-NEXT: uunpkhi z17.s, z5.h
				; CHECK-NEXT: uunpkhi z18.s, z7.h
				; CHECK-NEXT: uunpklo z5.s, z5.h
				; CHECK-NEXT: uunpklo z7.s, z7.h
				; CHECK-NEXT: uunpklo z6.h, z6.b
				; CHECK-NEXT: uunpklo z2.h, z2.b
				; CHECK-NEXT: udivr z17.s, p0/m, z17.s, z18.s
				; CHECK-NEXT: udivr z5.s, p0/m, z5.s, z7.s
				; CHECK-NEXT: uunpkhi z7.s, z6.h
				; CHECK-NEXT: uunpkhi z18.s, z2.h
				; CHECK-NEXT: uunpklo z6.s, z6.h
				; CHECK-NEXT: uunpklo z2.s, z2.h
				; CHECK-NEXT: udivr z7.s, p0/m, z7.s, z18.s
				; CHECK-NEXT: udiv z2.s, p0/m, z2.s, z6.s
				; CHECK-NEXT: uzp1 z2.h, z2.h, z7.h
				; CHECK-NEXT: uunpkhi z6.h, z16.b
				; CHECK-NEXT: uunpkhi z7.h, z0.b
				; CHECK-NEXT: uzp1 z5.h, z5.h, z17.h
				; CHECK-NEXT: uunpkhi z17.s, z6.h
				; CHECK-NEXT: uunpkhi z18.s, z7.h
				; CHECK-NEXT: uunpklo z6.s, z6.h
				; CHECK-NEXT: uunpklo z7.s, z7.h
				; CHECK-NEXT: udivr z17.s, p0/m, z17.s, z18.s
				; CHECK-NEXT: udivr z6.s, p0/m, z6.s, z7.s
				; CHECK-NEXT: uzp1 z2.b, z2.b, z5.b
				; CHECK-NEXT: uzp1 z5.h, z6.h, z17.h
				; CHECK-NEXT: uunpklo z6.h, z16.b
				; CHECK-NEXT: uunpklo z0.h, z0.b
				; CHECK-NEXT: uunpkhi z7.s, z6.h
				; CHECK-NEXT: uunpkhi z16.s, z0.h
				; CHECK-NEXT: uunpklo z6.s, z6.h
				; CHECK-NEXT: uunpklo z0.s, z0.h
				; CHECK-NEXT: udivr z7.s, p0/m, z7.s, z16.s
				; CHECK-NEXT: udiv z0.s, p0/m, z0.s, z6.s
				; CHECK-NEXT: uzp1 z0.h, z0.h, z7.h
				; CHECK-NEXT: uzp1 z1.b, z4.b, z1.b
				; CHECK-NEXT: uzp1 z0.b, z0.b, z5.b
				; CHECK-NEXT: stp q2, q3, [x0, #32]
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <64 x i8>, <64 x i8>* %a
				%op2 = load <64 x i8>, <64 x i8>* %b
				%res = udiv <64 x i8> %op1, %op2
				store <64 x i8> %res, <64 x i8>* %a
				ret void
				}

				define <2 x i16> @udiv_v2i16(<2 x i16> %op1, <2 x i16> %op2) #0 {
				; CHECK-LABEL: udiv_v2i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI23_0
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI23_0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x8]
				; CHECK-NEXT: and z1.d, z1.d, z2.d
				; CHECK-NEXT: and z0.d, z0.d, z2.d
				; CHECK-NEXT: udiv z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = udiv <2 x i16> %op1, %op2
				ret <2 x i16> %res
				}

				define <4 x i16> @udiv_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
				; CHECK-LABEL: udiv_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: sub sp, sp, #16
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: uunpklo z1.s, z1.h
				; CHECK-NEXT: uunpklo z0.s, z0.h
				; CHECK-NEXT: udiv z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: mov z1.s, z0.s[3]
				; CHECK-NEXT: mov z2.s, z0.s[2]
				; CHECK-NEXT: mov z0.s, z0.s[1]
				; CHECK-NEXT: fmov w9, s1
				; CHECK-NEXT: fmov w10, s2
				; CHECK-NEXT: strh w8, [sp, #8]
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: strh w9, [sp, #14]
				; CHECK-NEXT: strh w8, [sp, #10]
				; CHECK-NEXT: add x8, sp, #8
				; CHECK-NEXT: strh w10, [sp, #12]
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: add sp, sp, #16
				; CHECK-NEXT: ret
				%res = udiv <4 x i16> %op1, %op2
				ret <4 x i16> %res
				}

				define <8 x i16> @udiv_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
				; CHECK-LABEL: udiv_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: uunpkhi z2.s, z1.h
				; CHECK-NEXT: uunpkhi z3.s, z0.h
				; CHECK-NEXT: uunpklo z1.s, z1.h
				; CHECK-NEXT: uunpklo z0.s, z0.h
				; CHECK-NEXT: udivr z2.s, p0/m, z2.s, z3.s
				; CHECK-NEXT: udiv z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: uzp1 z0.h, z0.h, z2.h
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = udiv <8 x i16> %op1, %op2
				ret <8 x i16> %res
				}

				define void @udiv_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: udiv_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x1]
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: uunpkhi z5.s, z0.h
				; CHECK-NEXT: uunpklo z0.s, z0.h
				; CHECK-NEXT: uunpkhi z4.s, z2.h
				; CHECK-NEXT: uunpkhi z6.s, z3.h
				; CHECK-NEXT: udivr z4.s, p0/m, z4.s, z5.s
				; CHECK-NEXT: uunpkhi z5.s, z1.h
				; CHECK-NEXT: uunpklo z2.s, z2.h
				; CHECK-NEXT: uunpklo z3.s, z3.h
				; CHECK-NEXT: uunpklo z1.s, z1.h
				; CHECK-NEXT: udiv z5.s, p0/m, z5.s, z6.s
				; CHECK-NEXT: udiv z1.s, p0/m, z1.s, z3.s
				; CHECK-NEXT: udiv z0.s, p0/m, z0.s, z2.s
				; CHECK-NEXT: uzp1 z1.h, z1.h, z5.h
				; CHECK-NEXT: uzp1 z0.h, z0.h, z4.h
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				david-armUnsubmitted Not Done Reply Inline Actions This is a NEON vector instruction - this is definitely illegal in streaming mode. Can you try to find out why this is being inserted please? david-arm: This is a NEON vector instruction - this is definitely illegal in streaming mode. Can you try…
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%res = udiv <16 x i16> %op1, %op2
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define void @udiv_v32i16(<32 x i16>* %a, <32 x i16>* %b) #0 {
				; CHECK-LABEL: udiv_v32i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #16
				; CHECK-NEXT: mov x9, #24
				; CHECK-NEXT: mov x10, #8
				; CHECK-NEXT: ptrue p1.h, vl8
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1h { z0.h }, p1/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p1/z, [x0, x9, lsl #1]
				; CHECK-NEXT: ld1h { z2.h }, p1/z, [x0, x10, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p1/z, [x0]
				; CHECK-NEXT: ld1h { z4.h }, p1/z, [x1, x10, lsl #1]
				; CHECK-NEXT: ld1h { z5.h }, p1/z, [x1, x9, lsl #1]
				; CHECK-NEXT: ld1h { z6.h }, p1/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z17.h }, p1/z, [x1]
				; CHECK-NEXT: uunpkhi z18.s, z1.h
				; CHECK-NEXT: uunpklo z1.s, z1.h
				; CHECK-NEXT: uunpkhi z16.s, z2.h
				; CHECK-NEXT: uunpklo z2.s, z2.h
				; CHECK-NEXT: uunpkhi z7.s, z4.h
				; CHECK-NEXT: uunpklo z4.s, z4.h
				; CHECK-NEXT: udivr z7.s, p0/m, z7.s, z16.s
				; CHECK-NEXT: uunpkhi z16.s, z5.h
				; CHECK-NEXT: uunpklo z5.s, z5.h
				; CHECK-NEXT: udiv z2.s, p0/m, z2.s, z4.s
				; CHECK-NEXT: udiv z1.s, p0/m, z1.s, z5.s
				; CHECK-NEXT: uunpkhi z4.s, z6.h
				; CHECK-NEXT: uunpkhi z5.s, z0.h
				; CHECK-NEXT: uunpklo z6.s, z6.h
				; CHECK-NEXT: uunpklo z0.s, z0.h
				; CHECK-NEXT: udivr z4.s, p0/m, z4.s, z5.s
				; CHECK-NEXT: udiv z0.s, p0/m, z0.s, z6.s
				; CHECK-NEXT: uunpkhi z5.s, z17.h
				; CHECK-NEXT: udivr z16.s, p0/m, z16.s, z18.s
				; CHECK-NEXT: uunpkhi z6.s, z3.h
				; CHECK-NEXT: uzp1 z0.h, z0.h, z4.h
				; CHECK-NEXT: movprfx z4, z6
				; CHECK-NEXT: udiv z4.s, p0/m, z4.s, z5.s
				; CHECK-NEXT: uunpklo z5.s, z17.h
				; CHECK-NEXT: uunpklo z3.s, z3.h
				; CHECK-NEXT: uzp1 z1.h, z1.h, z16.h
				; CHECK-NEXT: udiv z3.s, p0/m, z3.s, z5.s
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: uzp1 z0.h, z3.h, z4.h
				; CHECK-NEXT: uzp1 z1.h, z2.h, z7.h
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i16>, <32 x i16>* %a
				%op2 = load <32 x i16>, <32 x i16>* %b
				%res = udiv <32 x i16> %op1, %op2
				store <32 x i16> %res, <32 x i16>* %a
				ret void
				}

				define <2 x i32> @udiv_v2i32(<2 x i32> %op1, <2 x i32> %op2) #0 {
				; CHECK-LABEL: udiv_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: udiv z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = udiv <2 x i32> %op1, %op2
				ret <2 x i32> %res
				}

				define <4 x i32> @udiv_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {
				; CHECK-LABEL: udiv_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: udiv z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = udiv <4 x i32> %op1, %op2
				ret <4 x i32> %res
				david-armUnsubmitted Not Done Reply Inline Actions Again, this still has the `vscale_range(16,0)` attribute. Can you remove it and regenerate the CHECK lines? david-arm: Again, this still has the `vscale_range(16,0)` attribute. Can you remove it and regenerate the…
				}

				define void @udiv_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; CHECK-LABEL: udiv_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x1, x8, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x1]
				; CHECK-NEXT: udiv z1.s, p0/m, z1.s, z3.s
				; CHECK-NEXT: udiv z0.s, p0/m, z0.s, z2.s
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%op2 = load <8 x i32>, <8 x i32>* %b
				%res = udiv <8 x i32> %op1, %op2
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define void @udiv_v16i32(<16 x i32>* %a, <16 x i32>* %b) #0 {
				; CHECK-LABEL: udiv_v16i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: mov x9, #12
				; CHECK-NEXT: mov x10, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0, x9, lsl #2]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x0, x10, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z4.s }, p0/z, [x1, x8, lsl #2]
				; CHECK-NEXT: ld1w { z5.s }, p0/z, [x1, x9, lsl #2]
				; CHECK-NEXT: ld1w { z6.s }, p0/z, [x1, x10, lsl #2]
				; CHECK-NEXT: ld1w { z7.s }, p0/z, [x1]
				; CHECK-NEXT: udiv z0.s, p0/m, z0.s, z4.s
				; CHECK-NEXT: udiv z1.s, p0/m, z1.s, z5.s
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: movprfx z0, z3
				; CHECK-NEXT: udiv z0.s, p0/m, z0.s, z7.s
				; CHECK-NEXT: movprfx z1, z2
				; CHECK-NEXT: udiv z1.s, p0/m, z1.s, z6.s
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i32>, <16 x i32>* %a
				%op2 = load <16 x i32>, <16 x i32>* %b
				%res = udiv <16 x i32> %op1, %op2
				store <16 x i32> %res, <16 x i32>* %a
				ret void
				}

				define <1 x i64> @udiv_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {
				; CHECK-LABEL: udiv_v1i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl1
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: udiv z0.d, p0/m, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = udiv <1 x i64> %op1, %op2
				ret <1 x i64> %res
				}

				define <2 x i64> @udiv_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
				; CHECK-LABEL: udiv_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: udiv z0.d, p0/m, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = udiv <2 x i64> %op1, %op2
				ret <2 x i64> %res
				}

				define void @udiv_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
				; CHECK-LABEL: udiv_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x1, x8, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x1]
				; CHECK-NEXT: udiv z1.d, p0/m, z1.d, z3.d
				; CHECK-NEXT: udiv z0.d, p0/m, z0.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x i64>, <4 x i64>* %a
				%op2 = load <4 x i64>, <4 x i64>* %b
				%res = udiv <4 x i64> %op1, %op2
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				define void @udiv_v8i64(<8 x i64>* %a, <8 x i64>* %b) #0 {
				; CHECK-LABEL: udiv_v8i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: mov x9, #6
				; CHECK-NEXT: mov x10, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0, x9, lsl #3]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x0, x10, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z4.d }, p0/z, [x1, x8, lsl #3]
				; CHECK-NEXT: ld1d { z5.d }, p0/z, [x1, x9, lsl #3]
				; CHECK-NEXT: ld1d { z6.d }, p0/z, [x1, x10, lsl #3]
				; CHECK-NEXT: ld1d { z7.d }, p0/z, [x1]
				; CHECK-NEXT: udiv z0.d, p0/m, z0.d, z4.d
				; CHECK-NEXT: udiv z1.d, p0/m, z1.d, z5.d
				; CHECK-NEXT: stp q0, q1, [x0, #32]
				; CHECK-NEXT: movprfx z0, z3
				; CHECK-NEXT: udiv z0.d, p0/m, z0.d, z7.d
				; CHECK-NEXT: movprfx z1, z2
				; CHECK-NEXT: udiv z1.d, p0/m, z1.d, z6.d
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i64>, <8 x i64>* %a
				%op2 = load <8 x i64>, <8 x i64>* %b
				%res = udiv <8 x i64> %op1, %op2
				store <8 x i64> %res, <8 x i64>* %a
				ret void
				}

				define void @udiv_constantsplat_v8i32(<8 x i32>* %a) #0 {
				; CHECK-LABEL: udiv_constantsplat_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: adrp x8, .LCPI36_0
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI36_0
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x8]
				; CHECK-NEXT: adrp x8, .LCPI36_1
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI36_1
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x8]
				; CHECK-NEXT: adrp x8, .LCPI36_2
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI36_2
				; CHECK-NEXT: ld1w { z4.s }, p0/z, [x8]
				; CHECK-NEXT: movprfx z5, z1
				; CHECK-NEXT: umulh z5.s, p0/m, z5.s, z2.s
				; CHECK-NEXT: umulh z2.s, p0/m, z2.s, z0.s
				; CHECK-NEXT: sub z1.s, z1.s, z5.s
				; CHECK-NEXT: sub z0.s, z0.s, z2.s
				; CHECK-NEXT: lsr z1.s, p0/m, z1.s, z3.s
				; CHECK-NEXT: lsr z0.s, p0/m, z0.s, z3.s
				; CHECK-NEXT: add z1.s, z1.s, z5.s
				; CHECK-NEXT: add z0.s, z0.s, z2.s
				; CHECK-NEXT: lsr z1.s, p0/m, z1.s, z4.s
				; CHECK-NEXT: lsr z0.s, p0/m, z0.s, z4.s
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%res = udiv <8 x i32> %op1, <i32 95, i32 95, i32 95, i32 95, i32 95, i32 95, i32 95, i32 95>
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				attributes #0 = { "target-features"="+sve" }

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-log.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -force-streaming-compatible-sve < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				;
				; AND
				;

				define <8 x i8> @and_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
				; CHECK-LABEL: and_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: and z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = and <8 x i8> %op1, %op2
				ret <8 x i8> %res
				}

				define <16 x i8> @and_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
				; CHECK-LABEL: and_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: and z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = and <16 x i8> %op1, %op2
				ret <16 x i8> %res
				}

				define void @and_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: and_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #16
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x1, x8]
				; CHECK-NEXT: ld1b { z3.b }, p0/z, [x1]
				; CHECK-NEXT: and z1.d, z1.d, z3.d
				; CHECK-NEXT: and z0.d, z0.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%res = and <32 x i8> %op1, %op2
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define <4 x i16> @and_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
				; CHECK-LABEL: and_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: and z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = and <4 x i16> %op1, %op2
				ret <4 x i16> %res
				}

				define <8 x i16> @and_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
				; CHECK-LABEL: and_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: and z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = and <8 x i16> %op1, %op2
				ret <8 x i16> %res
				}

				define void @and_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: and_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x1]
				; CHECK-NEXT: and z1.d, z1.d, z3.d
				; CHECK-NEXT: and z0.d, z0.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%res = and <16 x i16> %op1, %op2
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define <2 x i32> @and_v2i32(<2 x i32> %op1, <2 x i32> %op2) #0 {
				; CHECK-LABEL: and_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: and z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = and <2 x i32> %op1, %op2
				ret <2 x i32> %res
				}

				define <4 x i32> @and_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {
				; CHECK-LABEL: and_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: and z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = and <4 x i32> %op1, %op2
				ret <4 x i32> %res
				}

				define void @and_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; CHECK-LABEL: and_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x1, x8, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x1]
				; CHECK-NEXT: and z1.d, z1.d, z3.d
				; CHECK-NEXT: and z0.d, z0.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%op2 = load <8 x i32>, <8 x i32>* %b
				%res = and <8 x i32> %op1, %op2
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define <1 x i64> @and_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {
				; CHECK-LABEL: and_v1i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: and z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = and <1 x i64> %op1, %op2
				ret <1 x i64> %res
				}

				define <2 x i64> @and_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
				; CHECK-LABEL: and_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: and z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = and <2 x i64> %op1, %op2
				ret <2 x i64> %res
				}

				define void @and_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
				; CHECK-LABEL: and_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x1, x8, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x1]
				; CHECK-NEXT: and z1.d, z1.d, z3.d
				; CHECK-NEXT: and z0.d, z0.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x i64>, <4 x i64>* %a
				%op2 = load <4 x i64>, <4 x i64>* %b
				%res = and <4 x i64> %op1, %op2
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				;
				; OR
				;

				define <8 x i8> @or_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
				; CHECK-LABEL: or_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = or <8 x i8> %op1, %op2
				ret <8 x i8> %res
				}

				define <16 x i8> @or_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
				; CHECK-LABEL: or_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = or <16 x i8> %op1, %op2
				ret <16 x i8> %res
				}

				define void @or_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: or_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #16
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x1, x8]
				; CHECK-NEXT: ld1b { z3.b }, p0/z, [x1]
				; CHECK-NEXT: orr z1.d, z1.d, z3.d
				; CHECK-NEXT: orr z0.d, z0.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%res = or <32 x i8> %op1, %op2
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define <4 x i16> @or_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
				; CHECK-LABEL: or_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = or <4 x i16> %op1, %op2
				ret <4 x i16> %res
				}

				define <8 x i16> @or_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
				; CHECK-LABEL: or_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = or <8 x i16> %op1, %op2
				ret <8 x i16> %res
				}

				define void @or_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: or_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x1]
				; CHECK-NEXT: orr z1.d, z1.d, z3.d
				; CHECK-NEXT: orr z0.d, z0.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%res = or <16 x i16> %op1, %op2
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define <2 x i32> @or_v2i32(<2 x i32> %op1, <2 x i32> %op2) #0 {
				; CHECK-LABEL: or_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = or <2 x i32> %op1, %op2
				ret <2 x i32> %res
				}

				define <4 x i32> @or_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {
				; CHECK-LABEL: or_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = or <4 x i32> %op1, %op2
				ret <4 x i32> %res
				}

				define void @or_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; CHECK-LABEL: or_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x1, x8, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x1]
				; CHECK-NEXT: orr z1.d, z1.d, z3.d
				; CHECK-NEXT: orr z0.d, z0.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%op2 = load <8 x i32>, <8 x i32>* %b
				%res = or <8 x i32> %op1, %op2
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define <1 x i64> @or_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {
				; CHECK-LABEL: or_v1i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = or <1 x i64> %op1, %op2
				ret <1 x i64> %res
				}

				define <2 x i64> @or_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
				; CHECK-LABEL: or_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = or <2 x i64> %op1, %op2
				ret <2 x i64> %res
				}

				define void @or_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
				; CHECK-LABEL: or_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x1, x8, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x1]
				; CHECK-NEXT: orr z1.d, z1.d, z3.d
				; CHECK-NEXT: orr z0.d, z0.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x i64>, <4 x i64>* %a
				%op2 = load <4 x i64>, <4 x i64>* %b
				%res = or <4 x i64> %op1, %op2
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				;
				; XOR
				;

				define <8 x i8> @xor_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
				; CHECK-LABEL: xor_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: eor z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = xor <8 x i8> %op1, %op2
				ret <8 x i8> %res
				}

				define <16 x i8> @xor_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
				; CHECK-LABEL: xor_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: eor z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = xor <16 x i8> %op1, %op2
				ret <16 x i8> %res
				}

				define void @xor_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: xor_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #16
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x1, x8]
				; CHECK-NEXT: ld1b { z3.b }, p0/z, [x1]
				; CHECK-NEXT: eor z1.d, z1.d, z3.d
				; CHECK-NEXT: eor z0.d, z0.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%res = xor <32 x i8> %op1, %op2
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define <4 x i16> @xor_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
				; CHECK-LABEL: xor_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: eor z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = xor <4 x i16> %op1, %op2
				ret <4 x i16> %res
				}

				define <8 x i16> @xor_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
				; CHECK-LABEL: xor_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: eor z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = xor <8 x i16> %op1, %op2
				ret <8 x i16> %res
				}

				define void @xor_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: xor_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x1]
				; CHECK-NEXT: eor z1.d, z1.d, z3.d
				; CHECK-NEXT: eor z0.d, z0.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%res = xor <16 x i16> %op1, %op2
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define <2 x i32> @xor_v2i32(<2 x i32> %op1, <2 x i32> %op2) #0 {
				; CHECK-LABEL: xor_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: eor z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = xor <2 x i32> %op1, %op2
				ret <2 x i32> %res
				}

				define <4 x i32> @xor_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {
				; CHECK-LABEL: xor_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: eor z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = xor <4 x i32> %op1, %op2
				ret <4 x i32> %res
				}

				define void @xor_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; CHECK-LABEL: xor_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x1, x8, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x1]
				; CHECK-NEXT: eor z1.d, z1.d, z3.d
				; CHECK-NEXT: eor z0.d, z0.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%op2 = load <8 x i32>, <8 x i32>* %b
				%res = xor <8 x i32> %op1, %op2
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define <1 x i64> @xor_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {
				; CHECK-LABEL: xor_v1i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: eor z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = xor <1 x i64> %op1, %op2
				ret <1 x i64> %res
				}

				define <2 x i64> @xor_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
				; CHECK-LABEL: xor_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: eor z0.d, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = xor <2 x i64> %op1, %op2
				ret <2 x i64> %res
				}

				define void @xor_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
				; CHECK-LABEL: xor_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x1, x8, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x1]
				; CHECK-NEXT: eor z1.d, z1.d, z3.d
				; CHECK-NEXT: eor z0.d, z0.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x i64>, <4 x i64>* %a
				%op2 = load <4 x i64>, <4 x i64>* %b
				%res = xor <4 x i64> %op1, %op2
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				attributes #0 = { "target-features"="+sve" }

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-mulh.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -force-streaming-compatible-sve < %s \| FileCheck %s

				; This test only tests the legal types for a given vector width, as mulh nodes
				; do not get generated for non-legal types.

				target triple = "aarch64-unknown-linux-gnu"

				;
				; SMULH
				;

				define <4 x i8> @smulh_v4i8(<4 x i8> %op1, <4 x i8> %op2) #0 {
				; CHECK-LABEL: smulh_v4i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI0_0
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI0_0
				david-armUnsubmitted Done Reply Inline Actions Can you add a test for an illegal type such as `<4 x i8>` too? david-arm: Can you add a test for an illegal type such as `<4 x i8>` too?
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x8]
				; CHECK-NEXT: adrp x8, .LCPI0_1
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI0_1
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x8]
				; CHECK-NEXT: lsl z0.h, p0/m, z0.h, z2.h
				; CHECK-NEXT: lsl z1.h, p0/m, z1.h, z2.h
				; CHECK-NEXT: asr z0.h, p0/m, z0.h, z2.h
				; CHECK-NEXT: asr z1.h, p0/m, z1.h, z2.h
				; CHECK-NEXT: mul z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: lsr z0.h, p0/m, z0.h, z3.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%insert = insertelement <4 x i16> undef, i16 4, i64 0
				%splat = shufflevector <4 x i16> %insert, <4 x i16> undef, <4 x i32> zeroinitializer
				%1 = sext <4 x i8> %op1 to <4 x i16>
				%2 = sext <4 x i8> %op2 to <4 x i16>
				%mul = mul <4 x i16> %1, %2
				%shr = lshr <4 x i16> %mul, <i16 4, i16 4, i16 4, i16 4>
				%res = trunc <4 x i16> %shr to <4 x i8>
				ret <4 x i8> %res
				}

				define <8 x i8> @smulh_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
				; CHECK-LABEL: smulh_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.b, vl8
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: smulh z0.b, p0/m, z0.b, z1.b
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%insert = insertelement <8 x i16> undef, i16 8, i64 0
				%splat = shufflevector <8 x i16> %insert, <8 x i16> undef, <8 x i32> zeroinitializer
				%1 = sext <8 x i8> %op1 to <8 x i16>
				%2 = sext <8 x i8> %op2 to <8 x i16>
				%mul = mul <8 x i16> %1, %2
				%shr = lshr <8 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
				%res = trunc <8 x i16> %shr to <8 x i8>
				ret <8 x i8> %res
				}

				define <16 x i8> @smulh_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
				; CHECK-LABEL: smulh_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: smulh z0.b, p0/m, z0.b, z1.b
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%1 = sext <16 x i8> %op1 to <16 x i16>
				%2 = sext <16 x i8> %op2 to <16 x i16>
				%mul = mul <16 x i16> %1, %2
				%shr = lshr <16 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
				%res = trunc <16 x i16> %shr to <16 x i8>
				ret <16 x i8> %res
				}

				define void @smulh_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: smulh_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #16
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: ptrue p1.h, vl8
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x1, x8]
				; CHECK-NEXT: ld1b { z3.b }, p0/z, [x1]
				; CHECK-NEXT: adrp x8, .LCPI3_0
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI3_0
				; CHECK-NEXT: sunpklo z5.h, z2.b
				; CHECK-NEXT: ext z2.b, z2.b, z2.b, #8
				; CHECK-NEXT: sunpklo z7.h, z3.b
				; CHECK-NEXT: ld1h { z16.h }, p1/z, [x8]
				; CHECK-NEXT: ext z3.b, z3.b, z3.b, #8
				; CHECK-NEXT: sunpklo z2.h, z2.b
				; CHECK-NEXT: sunpklo z3.h, z3.b
				; CHECK-NEXT: mul z5.h, p1/m, z5.h, z7.h
				; CHECK-NEXT: mul z2.h, p1/m, z2.h, z3.h
				; CHECK-NEXT: movprfx z3, z5
				; CHECK-NEXT: lsr z3.h, p1/m, z3.h, z16.h
				; CHECK-NEXT: fmov w8, s3
				; CHECK-NEXT: sunpklo z4.h, z0.b
				; CHECK-NEXT: sunpklo z6.h, z1.b
				; CHECK-NEXT: ext z0.b, z0.b, z0.b, #8
				; CHECK-NEXT: ext z1.b, z1.b, z1.b, #8
				; CHECK-NEXT: lsr z2.h, p1/m, z2.h, z16.h
				; CHECK-NEXT: mov z5.h, z3.h[7]
				; CHECK-NEXT: sunpklo z0.h, z0.b
				; CHECK-NEXT: sunpklo z1.h, z1.b
				; CHECK-NEXT: mul z4.h, p1/m, z4.h, z6.h
				; CHECK-NEXT: mov z6.h, z3.h[6]
				; CHECK-NEXT: mov z7.h, z3.h[5]
				; CHECK-NEXT: mov z17.h, z3.h[4]
				; CHECK-NEXT: mov z18.h, z3.h[3]
				; CHECK-NEXT: mov z19.h, z3.h[2]
				; CHECK-NEXT: mov z20.h, z3.h[1]
				; CHECK-NEXT: mov z3.h, z2.h[7]
				; CHECK-NEXT: mov z21.h, z2.h[6]
				; CHECK-NEXT: mov z22.h, z2.h[5]
				; CHECK-NEXT: mov z23.h, z2.h[4]
				; CHECK-NEXT: mov z24.h, z2.h[3]
				; CHECK-NEXT: mov z25.h, z2.h[2]
				; CHECK-NEXT: mov z26.h, z2.h[1]
				; CHECK-NEXT: fmov w9, s2
				; CHECK-NEXT: mul z0.h, p1/m, z0.h, z1.h
				; CHECK-NEXT: fmov w10, s5
				; CHECK-NEXT: strb w8, [sp, #-32]!
				; CHECK-NEXT: .cfi_def_cfa_offset 32
				; CHECK-NEXT: fmov w8, s6
				; CHECK-NEXT: strb w9, [sp, #8]
				; CHECK-NEXT: fmov w9, s7
				; CHECK-NEXT: strb w10, [sp, #7]
				; CHECK-NEXT: fmov w10, s17
				; CHECK-NEXT: lsr z0.h, p1/m, z0.h, z16.h
				; CHECK-NEXT: strb w8, [sp, #6]
				; CHECK-NEXT: fmov w8, s18
				; CHECK-NEXT: strb w9, [sp, #5]
				; CHECK-NEXT: fmov w9, s19
				; CHECK-NEXT: strb w10, [sp, #4]
				; CHECK-NEXT: fmov w10, s20
				; CHECK-NEXT: strb w8, [sp, #3]
				; CHECK-NEXT: fmov w8, s3
				; CHECK-NEXT: strb w9, [sp, #2]
				; CHECK-NEXT: fmov w9, s21
				; CHECK-NEXT: strb w10, [sp, #1]
				; CHECK-NEXT: fmov w10, s22
				; CHECK-NEXT: strb w8, [sp, #15]
				; CHECK-NEXT: fmov w8, s23
				; CHECK-NEXT: strb w9, [sp, #14]
				; CHECK-NEXT: fmov w9, s24
				; CHECK-NEXT: strb w10, [sp, #13]
				; CHECK-NEXT: fmov w10, s25
				; CHECK-NEXT: strb w8, [sp, #12]
				; CHECK-NEXT: fmov w8, s26
				; CHECK-NEXT: movprfx z1, z4
				; CHECK-NEXT: lsr z1.h, p1/m, z1.h, z16.h
				; CHECK-NEXT: strb w9, [sp, #11]
				; CHECK-NEXT: mov z2.h, z1.h[7]
				; CHECK-NEXT: fmov w9, s1
				; CHECK-NEXT: strb w10, [sp, #10]
				; CHECK-NEXT: fmov w10, s0
				; CHECK-NEXT: strb w8, [sp, #9]
				; CHECK-NEXT: fmov w8, s2
				; CHECK-NEXT: mov z3.h, z1.h[6]
				; CHECK-NEXT: mov z4.h, z1.h[5]
				; CHECK-NEXT: mov z5.h, z1.h[4]
				; CHECK-NEXT: strb w9, [sp, #16]
				; CHECK-NEXT: fmov w9, s3
				; CHECK-NEXT: strb w10, [sp, #24]
				; CHECK-NEXT: fmov w10, s4
				; CHECK-NEXT: strb w8, [sp, #23]
				; CHECK-NEXT: fmov w8, s5
				; CHECK-NEXT: mov z6.h, z1.h[3]
				; CHECK-NEXT: mov z7.h, z1.h[2]
				; CHECK-NEXT: mov z16.h, z1.h[1]
				; CHECK-NEXT: strb w9, [sp, #22]
				; CHECK-NEXT: fmov w9, s6
				; CHECK-NEXT: strb w10, [sp, #21]
				; CHECK-NEXT: fmov w10, s7
				; CHECK-NEXT: strb w8, [sp, #20]
				; CHECK-NEXT: fmov w8, s16
				; CHECK-NEXT: mov z1.h, z0.h[7]
				; CHECK-NEXT: mov z17.h, z0.h[6]
				; CHECK-NEXT: mov z18.h, z0.h[5]
				; CHECK-NEXT: strb w9, [sp, #19]
				; CHECK-NEXT: fmov w9, s1
				; CHECK-NEXT: strb w10, [sp, #18]
				; CHECK-NEXT: fmov w10, s17
				; CHECK-NEXT: strb w8, [sp, #17]
				; CHECK-NEXT: fmov w8, s18
				; CHECK-NEXT: mov z19.h, z0.h[4]
				; CHECK-NEXT: mov z20.h, z0.h[3]
				; CHECK-NEXT: mov z21.h, z0.h[2]
				; CHECK-NEXT: strb w9, [sp, #31]
				; CHECK-NEXT: fmov w9, s19
				; CHECK-NEXT: strb w10, [sp, #30]
				; CHECK-NEXT: fmov w10, s20
				; CHECK-NEXT: strb w8, [sp, #29]
				; CHECK-NEXT: fmov w8, s21
				; CHECK-NEXT: mov z22.h, z0.h[1]
				; CHECK-NEXT: strb w9, [sp, #28]
				; CHECK-NEXT: fmov w9, s22
				; CHECK-NEXT: strb w10, [sp, #27]
				; CHECK-NEXT: mov x10, sp
				; CHECK-NEXT: strb w8, [sp, #26]
				; CHECK-NEXT: add x8, sp, #16
				; CHECK-NEXT: strb w9, [sp, #25]
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x10]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x8]
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: add sp, sp, #32
				; CHECK-NEXT: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%1 = sext <32 x i8> %op1 to <32 x i16>
				%2 = sext <32 x i8> %op2 to <32 x i16>
				%mul = mul <32 x i16> %1, %2
				david-armUnsubmitted Not Done Reply Inline Actions Wow, this code surely gets an award for being so impressively bad?! david-arm: Wow, this code surely gets an award for being so impressively bad?!
				%shr = lshr <32 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
				%res = trunc <32 x i16> %shr to <32 x i8>
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define <2 x i16> @smulh_v2i16(<2 x i16> %op1, <2 x i16> %op2) #0 {
				; CHECK-LABEL: smulh_v2i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI4_0
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI4_0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x8]
				; CHECK-NEXT: lsl z0.s, p0/m, z0.s, z2.s
				; CHECK-NEXT: lsl z1.s, p0/m, z1.s, z2.s
				; CHECK-NEXT: asr z0.s, p0/m, z0.s, z2.s
				; CHECK-NEXT: asr z1.s, p0/m, z1.s, z2.s
				; CHECK-NEXT: mul z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: lsr z0.s, p0/m, z0.s, z2.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%1 = sext <2 x i16> %op1 to <2 x i32>
				%2 = sext <2 x i16> %op2 to <2 x i32>
				%mul = mul <2 x i32> %1, %2
				%shr = lshr <2 x i32> %mul, <i32 16, i32 16>
				%res = trunc <2 x i32> %shr to <2 x i16>
				ret <2 x i16> %res
				}

				define <4 x i16> @smulh_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
				; CHECK-LABEL: smulh_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: smulh z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%1 = sext <4 x i16> %op1 to <4 x i32>
				%2 = sext <4 x i16> %op2 to <4 x i32>
				%mul = mul <4 x i32> %1, %2
				%shr = lshr <4 x i32> %mul, <i32 16, i32 16, i32 16, i32 16>
				%res = trunc <4 x i32> %shr to <4 x i16>
				ret <4 x i16> %res
				}

				define <8 x i16> @smulh_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
				; CHECK-LABEL: smulh_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: smulh z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%1 = sext <8 x i16> %op1 to <8 x i32>
				%2 = sext <8 x i16> %op2 to <8 x i32>
				%mul = mul <8 x i32> %1, %2
				%shr = lshr <8 x i32> %mul, <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>
				%res = trunc <8 x i32> %shr to <8 x i16>
				ret <8 x i16> %res
				}

				define void @smulh_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: smulh_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x1]
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: movprfx z4, z1
				; CHECK-NEXT: smulh z4.h, p0/m, z4.h, z3.h
				; CHECK-NEXT: ext z1.b, z1.b, z1.b, #8
				; CHECK-NEXT: ext z3.b, z3.b, z3.b, #8
				; CHECK-NEXT: smulh z1.h, p0/m, z1.h, z3.h
				; CHECK-NEXT: movprfx z3, z0
				; CHECK-NEXT: smulh z3.h, p0/m, z3.h, z2.h
				; CHECK-NEXT: ext z2.b, z2.b, z2.b, #8
				; CHECK-NEXT: ext z0.b, z0.b, z0.b, #8
				; CHECK-NEXT: smulh z0.h, p0/m, z0.h, z2.h
				; CHECK-NEXT: splice z4.h, p0, z4.h, z1.h
				; CHECK-NEXT: splice z3.h, p0, z3.h, z0.h
				; CHECK-NEXT: stp q4, q3, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%1 = sext <16 x i16> %op1 to <16 x i32>
				%2 = sext <16 x i16> %op2 to <16 x i32>
				%mul = mul <16 x i32> %1, %2
				%shr = lshr <16 x i32> %mul, <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>
				%res = trunc <16 x i32> %shr to <16 x i16>
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define <2 x i32> @smulh_v2i32(<2 x i32> %op1, <2 x i32> %op2) #0 {
				; CHECK-LABEL: smulh_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: smulh z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%1 = sext <2 x i32> %op1 to <2 x i64>
				%2 = sext <2 x i32> %op2 to <2 x i64>
				%mul = mul <2 x i64> %1, %2
				%shr = lshr <2 x i64> %mul, <i64 32, i64 32>
				%res = trunc <2 x i64> %shr to <2 x i32>
				ret <2 x i32> %res
				}

				define <4 x i32> @smulh_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {
				; CHECK-LABEL: smulh_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: smulh z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%1 = sext <4 x i32> %op1 to <4 x i64>
				%2 = sext <4 x i32> %op2 to <4 x i64>
				%mul = mul <4 x i64> %1, %2
				%shr = lshr <4 x i64> %mul, <i64 32, i64 32, i64 32, i64 32>
				%res = trunc <4 x i64> %shr to <4 x i32>
				ret <4 x i32> %res
				}

				define void @smulh_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; CHECK-LABEL: smulh_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x1, x8, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x1]
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: movprfx z4, z1
				; CHECK-NEXT: smulh z4.s, p0/m, z4.s, z3.s
				; CHECK-NEXT: ext z1.b, z1.b, z1.b, #8
				; CHECK-NEXT: ext z3.b, z3.b, z3.b, #8
				; CHECK-NEXT: smulh z1.s, p0/m, z1.s, z3.s
				; CHECK-NEXT: movprfx z3, z0
				; CHECK-NEXT: smulh z3.s, p0/m, z3.s, z2.s
				; CHECK-NEXT: ext z2.b, z2.b, z2.b, #8
				; CHECK-NEXT: ext z0.b, z0.b, z0.b, #8
				; CHECK-NEXT: smulh z0.s, p0/m, z0.s, z2.s
				; CHECK-NEXT: splice z4.s, p0, z4.s, z1.s
				; CHECK-NEXT: splice z3.s, p0, z3.s, z0.s
				; CHECK-NEXT: stp q4, q3, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%op2 = load <8 x i32>, <8 x i32>* %b
				%1 = sext <8 x i32> %op1 to <8 x i64>
				%2 = sext <8 x i32> %op2 to <8 x i64>
				%mul = mul <8 x i64> %1, %2
				%shr = lshr <8 x i64> %mul, <i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32>
				%res = trunc <8 x i64> %shr to <8 x i32>
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define <1 x i64> @smulh_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {
				; CHECK-LABEL: smulh_v1i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl1
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: smulh z0.d, p0/m, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%insert = insertelement <1 x i128> undef, i128 64, i128 0
				%splat = shufflevector <1 x i128> %insert, <1 x i128> undef, <1 x i32> zeroinitializer
				%1 = sext <1 x i64> %op1 to <1 x i128>
				%2 = sext <1 x i64> %op2 to <1 x i128>
				%mul = mul <1 x i128> %1, %2
				%shr = lshr <1 x i128> %mul, %splat
				%res = trunc <1 x i128> %shr to <1 x i64>
				ret <1 x i64> %res
				}

				define <2 x i64> @smulh_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
				; CHECK-LABEL: smulh_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: smulh z0.d, p0/m, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%1 = sext <2 x i64> %op1 to <2 x i128>
				%2 = sext <2 x i64> %op2 to <2 x i128>
				%mul = mul <2 x i128> %1, %2
				%shr = lshr <2 x i128> %mul, <i128 64, i128 64>
				%res = trunc <2 x i128> %shr to <2 x i64>
				ret <2 x i64> %res
				}

				define void @smulh_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
				; CHECK-LABEL: smulh_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x1, x8, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x1]
				; CHECK-NEXT: ptrue p0.d, vl1
				; CHECK-NEXT: fmov x8, d0
				; CHECK-NEXT: mov z4.d, z0.d[1]
				; CHECK-NEXT: fmov x10, d2
				; CHECK-NEXT: mov z0.d, z1.d[1]
				; CHECK-NEXT: fmov x9, d1
				; CHECK-NEXT: mov z1.d, z2.d[1]
				; CHECK-NEXT: mov z2.d, z3.d[1]
				; CHECK-NEXT: fmov x11, d3
				; CHECK-NEXT: fmov x12, d0
				; CHECK-NEXT: fmov x13, d2
				; CHECK-NEXT: fmov x14, d4
				; CHECK-NEXT: smulh x8, x8, x10
				; CHECK-NEXT: fmov x10, d1
				; CHECK-NEXT: smulh x9, x9, x11
				; CHECK-NEXT: smulh x12, x12, x13
				; CHECK-NEXT: smulh x10, x14, x10
				; CHECK-NEXT: fmov d2, x8
				; CHECK-NEXT: fmov d0, x9
				; CHECK-NEXT: fmov d1, x12
				; CHECK-NEXT: fmov d3, x10
				; CHECK-NEXT: splice z0.d, p0, z0.d, z1.d
				; CHECK-NEXT: splice z2.d, p0, z2.d, z3.d
				; CHECK-NEXT: stp q0, q2, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x i64>, <4 x i64>* %a
				%op2 = load <4 x i64>, <4 x i64>* %b
				%1 = sext <4 x i64> %op1 to <4 x i128>
				%2 = sext <4 x i64> %op2 to <4 x i128>
				%mul = mul <4 x i128> %1, %2
				%shr = lshr <4 x i128> %mul, <i128 64, i128 64, i128 64, i128 64>
				%res = trunc <4 x i128> %shr to <4 x i64>
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				;
				; UMULH
				;

				define <4 x i8> @umulh_v4i8(<4 x i8> %op1, <4 x i8> %op2) #0 {
				; CHECK-LABEL: umulh_v4i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI14_0
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI14_0
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x8]
				; CHECK-NEXT: adrp x8, .LCPI14_1
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI14_1
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x8]
				; CHECK-NEXT: and z0.d, z0.d, z2.d
				; CHECK-NEXT: and z1.d, z1.d, z2.d
				; CHECK-NEXT: mul z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: lsr z0.h, p0/m, z0.h, z3.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%1 = zext <4 x i8> %op1 to <4 x i16>
				%2 = zext <4 x i8> %op2 to <4 x i16>
				%mul = mul <4 x i16> %1, %2
				%shr = lshr <4 x i16> %mul, <i16 4, i16 4, i16 4, i16 4>
				%res = trunc <4 x i16> %shr to <4 x i8>
				ret <4 x i8> %res
				}

				define <8 x i8> @umulh_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
				; CHECK-LABEL: umulh_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.b, vl8
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: umulh z0.b, p0/m, z0.b, z1.b
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%1 = zext <8 x i8> %op1 to <8 x i16>
				%2 = zext <8 x i8> %op2 to <8 x i16>
				%mul = mul <8 x i16> %1, %2
				%shr = lshr <8 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
				%res = trunc <8 x i16> %shr to <8 x i8>
				ret <8 x i8> %res
				}

				define <16 x i8> @umulh_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
				; CHECK-LABEL: umulh_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: umulh z0.b, p0/m, z0.b, z1.b
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%1 = zext <16 x i8> %op1 to <16 x i16>
				%2 = zext <16 x i8> %op2 to <16 x i16>
				%mul = mul <16 x i16> %1, %2
				%shr = lshr <16 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
				%res = trunc <16 x i16> %shr to <16 x i8>
				ret <16 x i8> %res
				}

				define void @umulh_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: umulh_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #16
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: ptrue p1.h, vl8
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x1, x8]
				; CHECK-NEXT: ld1b { z3.b }, p0/z, [x1]
				; CHECK-NEXT: adrp x8, .LCPI17_0
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI17_0
				; CHECK-NEXT: uunpklo z5.h, z2.b
				; CHECK-NEXT: ext z2.b, z2.b, z2.b, #8
				; CHECK-NEXT: uunpklo z7.h, z3.b
				; CHECK-NEXT: ld1h { z16.h }, p1/z, [x8]
				; CHECK-NEXT: ext z3.b, z3.b, z3.b, #8
				; CHECK-NEXT: uunpklo z2.h, z2.b
				; CHECK-NEXT: uunpklo z3.h, z3.b
				; CHECK-NEXT: mul z5.h, p1/m, z5.h, z7.h
				; CHECK-NEXT: mul z2.h, p1/m, z2.h, z3.h
				; CHECK-NEXT: movprfx z3, z5
				; CHECK-NEXT: lsr z3.h, p1/m, z3.h, z16.h
				; CHECK-NEXT: fmov w8, s3
				; CHECK-NEXT: uunpklo z4.h, z0.b
				; CHECK-NEXT: uunpklo z6.h, z1.b
				; CHECK-NEXT: ext z0.b, z0.b, z0.b, #8
				; CHECK-NEXT: ext z1.b, z1.b, z1.b, #8
				; CHECK-NEXT: lsr z2.h, p1/m, z2.h, z16.h
				; CHECK-NEXT: mov z5.h, z3.h[7]
				; CHECK-NEXT: uunpklo z0.h, z0.b
				; CHECK-NEXT: uunpklo z1.h, z1.b
				; CHECK-NEXT: mul z4.h, p1/m, z4.h, z6.h
				; CHECK-NEXT: mov z6.h, z3.h[6]
				; CHECK-NEXT: mov z7.h, z3.h[5]
				; CHECK-NEXT: mov z17.h, z3.h[4]
				; CHECK-NEXT: mov z18.h, z3.h[3]
				; CHECK-NEXT: mov z19.h, z3.h[2]
				; CHECK-NEXT: mov z20.h, z3.h[1]
				; CHECK-NEXT: mov z3.h, z2.h[7]
				; CHECK-NEXT: mov z21.h, z2.h[6]
				; CHECK-NEXT: mov z22.h, z2.h[5]
				; CHECK-NEXT: mov z23.h, z2.h[4]
				; CHECK-NEXT: mov z24.h, z2.h[3]
				; CHECK-NEXT: mov z25.h, z2.h[2]
				; CHECK-NEXT: mov z26.h, z2.h[1]
				; CHECK-NEXT: fmov w9, s2
				; CHECK-NEXT: mul z0.h, p1/m, z0.h, z1.h
				; CHECK-NEXT: fmov w10, s5
				; CHECK-NEXT: strb w8, [sp, #-32]!
				; CHECK-NEXT: .cfi_def_cfa_offset 32
				; CHECK-NEXT: fmov w8, s6
				; CHECK-NEXT: strb w9, [sp, #8]
				; CHECK-NEXT: fmov w9, s7
				; CHECK-NEXT: strb w10, [sp, #7]
				; CHECK-NEXT: fmov w10, s17
				; CHECK-NEXT: lsr z0.h, p1/m, z0.h, z16.h
				; CHECK-NEXT: strb w8, [sp, #6]
				; CHECK-NEXT: fmov w8, s18
				; CHECK-NEXT: strb w9, [sp, #5]
				; CHECK-NEXT: fmov w9, s19
				; CHECK-NEXT: strb w10, [sp, #4]
				; CHECK-NEXT: fmov w10, s20
				; CHECK-NEXT: strb w8, [sp, #3]
				; CHECK-NEXT: fmov w8, s3
				; CHECK-NEXT: strb w9, [sp, #2]
				; CHECK-NEXT: fmov w9, s21
				; CHECK-NEXT: strb w10, [sp, #1]
				; CHECK-NEXT: fmov w10, s22
				; CHECK-NEXT: strb w8, [sp, #15]
				; CHECK-NEXT: fmov w8, s23
				; CHECK-NEXT: strb w9, [sp, #14]
				; CHECK-NEXT: fmov w9, s24
				; CHECK-NEXT: strb w10, [sp, #13]
				; CHECK-NEXT: fmov w10, s25
				; CHECK-NEXT: strb w8, [sp, #12]
				; CHECK-NEXT: fmov w8, s26
				; CHECK-NEXT: movprfx z1, z4
				; CHECK-NEXT: lsr z1.h, p1/m, z1.h, z16.h
				; CHECK-NEXT: strb w9, [sp, #11]
				; CHECK-NEXT: mov z2.h, z1.h[7]
				; CHECK-NEXT: fmov w9, s1
				; CHECK-NEXT: strb w10, [sp, #10]
				; CHECK-NEXT: fmov w10, s0
				; CHECK-NEXT: strb w8, [sp, #9]
				; CHECK-NEXT: fmov w8, s2
				; CHECK-NEXT: mov z3.h, z1.h[6]
				; CHECK-NEXT: mov z4.h, z1.h[5]
				; CHECK-NEXT: mov z5.h, z1.h[4]
				; CHECK-NEXT: strb w9, [sp, #16]
				; CHECK-NEXT: fmov w9, s3
				; CHECK-NEXT: strb w10, [sp, #24]
				; CHECK-NEXT: fmov w10, s4
				; CHECK-NEXT: strb w8, [sp, #23]
				; CHECK-NEXT: fmov w8, s5
				; CHECK-NEXT: mov z6.h, z1.h[3]
				; CHECK-NEXT: mov z7.h, z1.h[2]
				; CHECK-NEXT: mov z16.h, z1.h[1]
				; CHECK-NEXT: strb w9, [sp, #22]
				; CHECK-NEXT: fmov w9, s6
				; CHECK-NEXT: strb w10, [sp, #21]
				; CHECK-NEXT: fmov w10, s7
				; CHECK-NEXT: strb w8, [sp, #20]
				; CHECK-NEXT: fmov w8, s16
				; CHECK-NEXT: mov z1.h, z0.h[7]
				; CHECK-NEXT: mov z17.h, z0.h[6]
				; CHECK-NEXT: mov z18.h, z0.h[5]
				; CHECK-NEXT: strb w9, [sp, #19]
				; CHECK-NEXT: fmov w9, s1
				; CHECK-NEXT: strb w10, [sp, #18]
				; CHECK-NEXT: fmov w10, s17
				; CHECK-NEXT: strb w8, [sp, #17]
				; CHECK-NEXT: fmov w8, s18
				; CHECK-NEXT: mov z19.h, z0.h[4]
				; CHECK-NEXT: mov z20.h, z0.h[3]
				; CHECK-NEXT: mov z21.h, z0.h[2]
				; CHECK-NEXT: strb w9, [sp, #31]
				; CHECK-NEXT: fmov w9, s19
				; CHECK-NEXT: strb w10, [sp, #30]
				; CHECK-NEXT: fmov w10, s20
				; CHECK-NEXT: strb w8, [sp, #29]
				; CHECK-NEXT: fmov w8, s21
				; CHECK-NEXT: mov z22.h, z0.h[1]
				; CHECK-NEXT: strb w9, [sp, #28]
				; CHECK-NEXT: fmov w9, s22
				; CHECK-NEXT: strb w10, [sp, #27]
				; CHECK-NEXT: mov x10, sp
				; CHECK-NEXT: strb w8, [sp, #26]
				; CHECK-NEXT: add x8, sp, #16
				; CHECK-NEXT: strb w9, [sp, #25]
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x10]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x8]
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: add sp, sp, #32
				; CHECK-NEXT: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%1 = zext <32 x i8> %op1 to <32 x i16>
				%2 = zext <32 x i8> %op2 to <32 x i16>
				%mul = mul <32 x i16> %1, %2
				%shr = lshr <32 x i16> %mul, <i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8, i16 8>
				%res = trunc <32 x i16> %shr to <32 x i8>
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define <2 x i16> @umulh_v2i16(<2 x i16> %op1, <2 x i16> %op2) #0 {
				; CHECK-LABEL: umulh_v2i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI18_0
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI18_0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x8]
				; CHECK-NEXT: adrp x8, .LCPI18_1
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI18_1
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x8]
				; CHECK-NEXT: and z0.d, z0.d, z2.d
				; CHECK-NEXT: and z1.d, z1.d, z2.d
				; CHECK-NEXT: mul z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: lsr z0.s, p0/m, z0.s, z3.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%1 = zext <2 x i16> %op1 to <2 x i32>
				%2 = zext <2 x i16> %op2 to <2 x i32>
				%mul = mul <2 x i32> %1, %2
				%shr = lshr <2 x i32> %mul, <i32 16, i32 16>
				%res = trunc <2 x i32> %shr to <2 x i16>
				ret <2 x i16> %res
				}

				define <4 x i16> @umulh_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
				; CHECK-LABEL: umulh_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: umulh z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%1 = zext <4 x i16> %op1 to <4 x i32>
				%2 = zext <4 x i16> %op2 to <4 x i32>
				%mul = mul <4 x i32> %1, %2
				%shr = lshr <4 x i32> %mul, <i32 16, i32 16, i32 16, i32 16>
				%res = trunc <4 x i32> %shr to <4 x i16>
				ret <4 x i16> %res
				}

				define <8 x i16> @umulh_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
				; CHECK-LABEL: umulh_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: umulh z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%1 = zext <8 x i16> %op1 to <8 x i32>
				%2 = zext <8 x i16> %op2 to <8 x i32>
				%mul = mul <8 x i32> %1, %2
				%shr = lshr <8 x i32> %mul, <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>
				%res = trunc <8 x i32> %shr to <8 x i16>
				ret <8 x i16> %res
				}

				define void @umulh_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: umulh_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x1]
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: movprfx z4, z1
				; CHECK-NEXT: umulh z4.h, p0/m, z4.h, z3.h
				; CHECK-NEXT: ext z1.b, z1.b, z1.b, #8
				; CHECK-NEXT: ext z3.b, z3.b, z3.b, #8
				; CHECK-NEXT: umulh z1.h, p0/m, z1.h, z3.h
				; CHECK-NEXT: movprfx z3, z0
				; CHECK-NEXT: umulh z3.h, p0/m, z3.h, z2.h
				; CHECK-NEXT: ext z2.b, z2.b, z2.b, #8
				; CHECK-NEXT: ext z0.b, z0.b, z0.b, #8
				; CHECK-NEXT: umulh z0.h, p0/m, z0.h, z2.h
				; CHECK-NEXT: splice z4.h, p0, z4.h, z1.h
				; CHECK-NEXT: splice z3.h, p0, z3.h, z0.h
				; CHECK-NEXT: stp q4, q3, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%1 = zext <16 x i16> %op1 to <16 x i32>
				%2 = zext <16 x i16> %op2 to <16 x i32>
				%mul = mul <16 x i32> %1, %2
				%shr = lshr <16 x i32> %mul, <i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16, i32 16>
				%res = trunc <16 x i32> %shr to <16 x i16>
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define <2 x i32> @umulh_v2i32(<2 x i32> %op1, <2 x i32> %op2) #0 {
				; CHECK-LABEL: umulh_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: umulh z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%1 = zext <2 x i32> %op1 to <2 x i64>
				%2 = zext <2 x i32> %op2 to <2 x i64>
				%mul = mul <2 x i64> %1, %2
				%shr = lshr <2 x i64> %mul, <i64 32, i64 32>
				%res = trunc <2 x i64> %shr to <2 x i32>
				ret <2 x i32> %res
				}

				define <4 x i32> @umulh_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {
				; CHECK-LABEL: umulh_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: umulh z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%1 = zext <4 x i32> %op1 to <4 x i64>
				%2 = zext <4 x i32> %op2 to <4 x i64>
				%mul = mul <4 x i64> %1, %2
				%shr = lshr <4 x i64> %mul, <i64 32, i64 32, i64 32, i64 32>
				%res = trunc <4 x i64> %shr to <4 x i32>
				ret <4 x i32> %res
				}

				define void @umulh_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; CHECK-LABEL: umulh_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x1, x8, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x1]
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: movprfx z4, z1
				; CHECK-NEXT: umulh z4.s, p0/m, z4.s, z3.s
				; CHECK-NEXT: ext z1.b, z1.b, z1.b, #8
				; CHECK-NEXT: ext z3.b, z3.b, z3.b, #8
				; CHECK-NEXT: umulh z1.s, p0/m, z1.s, z3.s
				; CHECK-NEXT: movprfx z3, z0
				; CHECK-NEXT: umulh z3.s, p0/m, z3.s, z2.s
				; CHECK-NEXT: ext z2.b, z2.b, z2.b, #8
				; CHECK-NEXT: ext z0.b, z0.b, z0.b, #8
				; CHECK-NEXT: umulh z0.s, p0/m, z0.s, z2.s
				; CHECK-NEXT: splice z4.s, p0, z4.s, z1.s
				; CHECK-NEXT: splice z3.s, p0, z3.s, z0.s
				; CHECK-NEXT: stp q4, q3, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%op2 = load <8 x i32>, <8 x i32>* %b
				%insert = insertelement <8 x i64> undef, i64 32, i64 0
				%splat = shufflevector <8 x i64> %insert, <8 x i64> undef, <8 x i32> zeroinitializer
				%1 = zext <8 x i32> %op1 to <8 x i64>
				%2 = zext <8 x i32> %op2 to <8 x i64>
				%mul = mul <8 x i64> %1, %2
				%shr = lshr <8 x i64> %mul, <i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32, i64 32>
				%res = trunc <8 x i64> %shr to <8 x i32>
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define <1 x i64> @umulh_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {
				; CHECK-LABEL: umulh_v1i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl1
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: umulh z0.d, p0/m, z0.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%1 = zext <1 x i64> %op1 to <1 x i128>
				%2 = zext <1 x i64> %op2 to <1 x i128>
				%mul = mul <1 x i128> %1, %2
				%shr = lshr <1 x i128> %mul, <i128 64>
				%res = trunc <1 x i128> %shr to <1 x i64>
				ret <1 x i64> %res
				}

				define <2 x i64> @umulh_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
				; CHECK-LABEL: umulh_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: umulh z0.d, p0/m, z0.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%1 = zext <2 x i64> %op1 to <2 x i128>
				%2 = zext <2 x i64> %op2 to <2 x i128>
				%mul = mul <2 x i128> %1, %2
				%shr = lshr <2 x i128> %mul, <i128 64, i128 64>
				%res = trunc <2 x i128> %shr to <2 x i64>
				ret <2 x i64> %res
				}

				define void @umulh_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
				; CHECK-LABEL: umulh_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x1, x8, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x1]
				; CHECK-NEXT: ptrue p0.d, vl1
				; CHECK-NEXT: fmov x8, d0
				; CHECK-NEXT: mov z4.d, z0.d[1]
				; CHECK-NEXT: fmov x10, d2
				; CHECK-NEXT: mov z0.d, z1.d[1]
				; CHECK-NEXT: fmov x9, d1
				; CHECK-NEXT: mov z1.d, z2.d[1]
				; CHECK-NEXT: mov z2.d, z3.d[1]
				; CHECK-NEXT: fmov x11, d3
				; CHECK-NEXT: fmov x12, d0
				; CHECK-NEXT: fmov x13, d2
				; CHECK-NEXT: fmov x14, d4
				; CHECK-NEXT: umulh x8, x8, x10
				; CHECK-NEXT: fmov x10, d1
				; CHECK-NEXT: umulh x9, x9, x11
				; CHECK-NEXT: umulh x12, x12, x13
				; CHECK-NEXT: umulh x10, x14, x10
				; CHECK-NEXT: fmov d2, x8
				; CHECK-NEXT: fmov d0, x9
				; CHECK-NEXT: fmov d1, x12
				; CHECK-NEXT: fmov d3, x10
				; CHECK-NEXT: splice z0.d, p0, z0.d, z1.d
				; CHECK-NEXT: splice z2.d, p0, z2.d, z3.d
				; CHECK-NEXT: stp q0, q2, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x i64>, <4 x i64>* %a
				%op2 = load <4 x i64>, <4 x i64>* %b
				%1 = zext <4 x i64> %op1 to <4 x i128>
				%2 = zext <4 x i64> %op2 to <4 x i128>
				%mul = mul <4 x i128> %1, %2
				%shr = lshr <4 x i128> %mul, <i128 64, i128 64, i128 64, i128 64>
				%res = trunc <4 x i128> %shr to <4 x i64>
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				attributes #0 = { "target-features"="+sve" }
				david-armUnsubmitted Not Done Reply Inline Actions Again, these `bic` instructions are illegal in streaming mode. david-arm: Again, these `bic` instructions are illegal in streaming mode.

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-rem.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -force-streaming-compatible-sve < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				;
				; SREM
				;

				define <4 x i8> @srem_v4i8(<4 x i8> %op1, <4 x i8> %op2) #0 {
				; CHECK-LABEL: srem_v4i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: sub sp, sp, #16
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: adrp x8, .LCPI0_0
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI0_0
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p1.s, vl4
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x8]
				; CHECK-NEXT: lsl z0.h, p0/m, z0.h, z2.h
				; CHECK-NEXT: lsl z1.h, p0/m, z1.h, z2.h
				; CHECK-NEXT: asr z0.h, p0/m, z0.h, z2.h
				; CHECK-NEXT: asr z1.h, p0/m, z1.h, z2.h
				; CHECK-NEXT: sunpklo z2.s, z1.h
				; CHECK-NEXT: sunpklo z3.s, z0.h
				; CHECK-NEXT: sdivr z2.s, p1/m, z2.s, z3.s
				; CHECK-NEXT: fmov w8, s2
				; CHECK-NEXT: mov z3.s, z2.s[3]
				; CHECK-NEXT: mov z4.s, z2.s[2]
				; CHECK-NEXT: mov z2.s, z2.s[1]
				; CHECK-NEXT: fmov w9, s3
				; CHECK-NEXT: fmov w10, s4
				; CHECK-NEXT: strh w8, [sp, #8]
				; CHECK-NEXT: fmov w8, s2
				; CHECK-NEXT: strh w9, [sp, #14]
				; CHECK-NEXT: strh w8, [sp, #10]
				; CHECK-NEXT: add x8, sp, #8
				; CHECK-NEXT: strh w10, [sp, #12]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x8]
				; CHECK-NEXT: mls z0.h, p0/m, z2.h, z1.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: add sp, sp, #16
				; CHECK-NEXT: ret
				%res = srem <4 x i8> %op1, %op2
				ret <4 x i8> %res
				}

				define <8 x i8> @srem_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
				; CHECK-LABEL: srem_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: sub sp, sp, #16
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: sunpklo z2.h, z1.b
				; CHECK-NEXT: sunpklo z3.h, z0.b
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: sunpkhi z4.s, z2.h
				; CHECK-NEXT: sunpkhi z5.s, z3.h
				; CHECK-NEXT: sunpklo z2.s, z2.h
				; CHECK-NEXT: sunpklo z3.s, z3.h
				; CHECK-NEXT: sdivr z4.s, p0/m, z4.s, z5.s
				; CHECK-NEXT: sdivr z2.s, p0/m, z2.s, z3.s
				; CHECK-NEXT: ptrue p0.b, vl8
				; CHECK-NEXT: uzp1 z2.h, z2.h, z4.h
				; CHECK-NEXT: fmov w8, s2
				; CHECK-NEXT: mov z3.h, z2.h[7]
				; CHECK-NEXT: mov z5.h, z2.h[5]
				; CHECK-NEXT: fmov w9, s3
				; CHECK-NEXT: mov z4.h, z2.h[6]
				; CHECK-NEXT: mov z6.h, z2.h[4]
				; CHECK-NEXT: strb w8, [sp, #8]
				; CHECK-NEXT: fmov w8, s5
				; CHECK-NEXT: mov z16.h, z2.h[2]
				; CHECK-NEXT: fmov w10, s4
				; CHECK-NEXT: strb w9, [sp, #15]
				; CHECK-NEXT: fmov w9, s6
				; CHECK-NEXT: strb w8, [sp, #13]
				; CHECK-NEXT: fmov w8, s16
				; CHECK-NEXT: mov z7.h, z2.h[3]
				; CHECK-NEXT: mov z2.h, z2.h[1]
				; CHECK-NEXT: strb w10, [sp, #14]
				; CHECK-NEXT: fmov w10, s7
				; CHECK-NEXT: strb w9, [sp, #12]
				; CHECK-NEXT: fmov w9, s2
				; CHECK-NEXT: strb w8, [sp, #10]
				; CHECK-NEXT: add x8, sp, #8
				; CHECK-NEXT: strb w10, [sp, #11]
				; CHECK-NEXT: strb w9, [sp, #9]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x8]
				; CHECK-NEXT: mls z0.b, p0/m, z2.b, z1.b
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: add sp, sp, #16
				; CHECK-NEXT: ret
				%res = srem <8 x i8> %op1, %op2
				ret <8 x i8> %res
				}

				define <16 x i8> @srem_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
				; CHECK-LABEL: srem_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: sunpkhi z2.h, z1.b
				; CHECK-NEXT: sunpkhi z3.h, z0.b
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: sunpkhi z5.s, z2.h
				; CHECK-NEXT: sunpkhi z6.s, z3.h
				; CHECK-NEXT: sunpklo z2.s, z2.h
				; CHECK-NEXT: sunpklo z3.s, z3.h
				; CHECK-NEXT: sunpklo z4.h, z1.b
				; CHECK-NEXT: sdivr z2.s, p0/m, z2.s, z3.s
				; CHECK-NEXT: sunpklo z3.h, z0.b
				; CHECK-NEXT: sdivr z5.s, p0/m, z5.s, z6.s
				; CHECK-NEXT: sunpkhi z6.s, z4.h
				; CHECK-NEXT: sunpkhi z7.s, z3.h
				; CHECK-NEXT: sunpklo z4.s, z4.h
				; CHECK-NEXT: sunpklo z3.s, z3.h
				; CHECK-NEXT: sdivr z6.s, p0/m, z6.s, z7.s
				; CHECK-NEXT: sdiv z3.s, p0/m, z3.s, z4.s
				; CHECK-NEXT: uzp1 z2.h, z2.h, z5.h
				; CHECK-NEXT: uzp1 z3.h, z3.h, z6.h
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: uzp1 z2.b, z3.b, z2.b
				; CHECK-NEXT: mls z0.b, p0/m, z2.b, z1.b
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = srem <16 x i8> %op1, %op2
				ret <16 x i8> %res
				}

				define void @srem_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: srem_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #16
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: ptrue p1.s, vl4
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x1, x8]
				; CHECK-NEXT: ld1b { z3.b }, p0/z, [x1]
				; CHECK-NEXT: sunpkhi z5.h, z0.b
				; CHECK-NEXT: sunpklo z7.h, z0.b
				; CHECK-NEXT: sunpkhi z4.h, z2.b
				; CHECK-NEXT: sunpklo z6.h, z2.b
				; CHECK-NEXT: sunpkhi z16.s, z4.h
				; CHECK-NEXT: sunpkhi z17.s, z5.h
				; CHECK-NEXT: sunpklo z4.s, z4.h
				; CHECK-NEXT: sunpklo z5.s, z5.h
				; CHECK-NEXT: sunpkhi z18.s, z6.h
				; CHECK-NEXT: sdivr z16.s, p1/m, z16.s, z17.s
				; CHECK-NEXT: sdivr z4.s, p1/m, z4.s, z5.s
				; CHECK-NEXT: sunpkhi z5.s, z7.h
				; CHECK-NEXT: sunpklo z6.s, z6.h
				; CHECK-NEXT: sunpklo z7.s, z7.h
				; CHECK-NEXT: uzp1 z4.h, z4.h, z16.h
				; CHECK-NEXT: sdivr z6.s, p1/m, z6.s, z7.s
				; CHECK-NEXT: sunpkhi z7.h, z3.b
				; CHECK-NEXT: sunpkhi z16.h, z1.b
				; CHECK-NEXT: sdiv z5.s, p1/m, z5.s, z18.s
				; CHECK-NEXT: sunpkhi z17.s, z7.h
				; CHECK-NEXT: sunpkhi z18.s, z16.h
				; CHECK-NEXT: sunpklo z7.s, z7.h
				; CHECK-NEXT: sunpklo z16.s, z16.h
				; CHECK-NEXT: sdivr z17.s, p1/m, z17.s, z18.s
				; CHECK-NEXT: sdivr z7.s, p1/m, z7.s, z16.s
				; CHECK-NEXT: sunpklo z16.h, z3.b
				; CHECK-NEXT: sunpklo z18.h, z1.b
				; CHECK-NEXT: sunpkhi z19.s, z16.h
				; CHECK-NEXT: sunpkhi z20.s, z18.h
				; CHECK-NEXT: sunpklo z16.s, z16.h
				; CHECK-NEXT: sunpklo z18.s, z18.h
				; CHECK-NEXT: sdivr z19.s, p1/m, z19.s, z20.s
				; CHECK-NEXT: sdivr z16.s, p1/m, z16.s, z18.s
				; CHECK-NEXT: uzp1 z7.h, z7.h, z17.h
				; CHECK-NEXT: uzp1 z16.h, z16.h, z19.h
				; CHECK-NEXT: uzp1 z5.h, z6.h, z5.h
				; CHECK-NEXT: uzp1 z6.b, z16.b, z7.b
				; CHECK-NEXT: uzp1 z4.b, z5.b, z4.b
				; CHECK-NEXT: mls z1.b, p0/m, z6.b, z3.b
				; CHECK-NEXT: mls z0.b, p0/m, z4.b, z2.b
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%res = srem <32 x i8> %op1, %op2
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define <4 x i16> @srem_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
				; CHECK-LABEL: srem_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: sub sp, sp, #16
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: sunpklo z2.s, z1.h
				; CHECK-NEXT: sunpklo z3.s, z0.h
				; CHECK-NEXT: sdivr z2.s, p0/m, z2.s, z3.s
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: fmov w8, s2
				; CHECK-NEXT: mov z3.s, z2.s[3]
				; CHECK-NEXT: mov z4.s, z2.s[2]
				; CHECK-NEXT: mov z2.s, z2.s[1]
				; CHECK-NEXT: fmov w9, s3
				; CHECK-NEXT: fmov w10, s4
				; CHECK-NEXT: strh w8, [sp, #8]
				; CHECK-NEXT: fmov w8, s2
				; CHECK-NEXT: strh w9, [sp, #14]
				; CHECK-NEXT: strh w8, [sp, #10]
				; CHECK-NEXT: add x8, sp, #8
				; CHECK-NEXT: strh w10, [sp, #12]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x8]
				; CHECK-NEXT: mls z0.h, p0/m, z2.h, z1.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: add sp, sp, #16
				; CHECK-NEXT: ret
				%res = srem <4 x i16> %op1, %op2
				ret <4 x i16> %res
				}

				define <8 x i16> @srem_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
				; CHECK-LABEL: srem_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: sunpkhi z2.s, z1.h
				; CHECK-NEXT: sunpkhi z3.s, z0.h
				; CHECK-NEXT: sunpklo z4.s, z1.h
				; CHECK-NEXT: sdivr z2.s, p0/m, z2.s, z3.s
				; CHECK-NEXT: sunpklo z5.s, z0.h
				; CHECK-NEXT: movprfx z3, z5
				; CHECK-NEXT: sdiv z3.s, p0/m, z3.s, z4.s
				; CHECK-NEXT: uzp1 z2.h, z3.h, z2.h
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: mls z0.h, p0/m, z2.h, z1.h
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = srem <8 x i16> %op1, %op2
				ret <8 x i16> %res
				}

				define void @srem_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: srem_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: ptrue p1.s, vl4
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x1]
				; CHECK-NEXT: sunpkhi z5.s, z0.h
				; CHECK-NEXT: sunpkhi z16.s, z1.h
				; CHECK-NEXT: sunpkhi z4.s, z2.h
				; CHECK-NEXT: sunpkhi z7.s, z3.h
				; CHECK-NEXT: sdivr z4.s, p1/m, z4.s, z5.s
				; CHECK-NEXT: sunpklo z5.s, z3.h
				; CHECK-NEXT: sdivr z7.s, p1/m, z7.s, z16.s
				; CHECK-NEXT: sunpklo z16.s, z1.h
				; CHECK-NEXT: sunpklo z6.s, z2.h
				; CHECK-NEXT: sdivr z5.s, p1/m, z5.s, z16.s
				; CHECK-NEXT: sunpklo z16.s, z0.h
				; CHECK-NEXT: uzp1 z5.h, z5.h, z7.h
				; CHECK-NEXT: sdivr z6.s, p1/m, z6.s, z16.s
				; CHECK-NEXT: mls z1.h, p0/m, z5.h, z3.h
				; CHECK-NEXT: uzp1 z4.h, z6.h, z4.h
				; CHECK-NEXT: mls z0.h, p0/m, z4.h, z2.h
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%res = srem <16 x i16> %op1, %op2
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define <2 x i32> @srem_v2i32(<2 x i32> %op1, <2 x i32> %op2) #0 {
				; CHECK-LABEL: srem_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: movprfx z2, z0
				; CHECK-NEXT: sdiv z2.s, p0/m, z2.s, z1.s
				; CHECK-NEXT: mls z0.s, p0/m, z2.s, z1.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = srem <2 x i32> %op1, %op2
				ret <2 x i32> %res
				}

				define <4 x i32> @srem_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {
				; CHECK-LABEL: srem_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: movprfx z2, z0
				; CHECK-NEXT: sdiv z2.s, p0/m, z2.s, z1.s
				; CHECK-NEXT: mls z0.s, p0/m, z2.s, z1.s
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = srem <4 x i32> %op1, %op2
				ret <4 x i32> %res
				}

				define void @srem_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; CHECK-LABEL: srem_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x1, x8, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x1]
				; CHECK-NEXT: movprfx z4, z1
				; CHECK-NEXT: sdiv z4.s, p0/m, z4.s, z3.s
				; CHECK-NEXT: movprfx z5, z0
				; CHECK-NEXT: sdiv z5.s, p0/m, z5.s, z2.s
				; CHECK-NEXT: mls z1.s, p0/m, z4.s, z3.s
				; CHECK-NEXT: mls z0.s, p0/m, z5.s, z2.s
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%op2 = load <8 x i32>, <8 x i32>* %b
				%res = srem <8 x i32> %op1, %op2
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define <1 x i64> @srem_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {
				; CHECK-LABEL: srem_v1i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl1
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: movprfx z2, z0
				; CHECK-NEXT: sdiv z2.d, p0/m, z2.d, z1.d
				; CHECK-NEXT: mls z0.d, p0/m, z2.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = srem <1 x i64> %op1, %op2
				ret <1 x i64> %res
				}

				define <2 x i64> @srem_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
				; CHECK-LABEL: srem_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: movprfx z2, z0
				; CHECK-NEXT: sdiv z2.d, p0/m, z2.d, z1.d
				; CHECK-NEXT: mls z0.d, p0/m, z2.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = srem <2 x i64> %op1, %op2
				ret <2 x i64> %res
				}

				define void @srem_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
				; CHECK-LABEL: srem_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x1, x8, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x1]
				; CHECK-NEXT: movprfx z4, z1
				; CHECK-NEXT: sdiv z4.d, p0/m, z4.d, z3.d
				; CHECK-NEXT: movprfx z5, z0
				; CHECK-NEXT: sdiv z5.d, p0/m, z5.d, z2.d
				; CHECK-NEXT: mls z1.d, p0/m, z4.d, z3.d
				; CHECK-NEXT: mls z0.d, p0/m, z5.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x i64>, <4 x i64>* %a
				%op2 = load <4 x i64>, <4 x i64>* %b
				%res = srem <4 x i64> %op1, %op2
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				;
				; UREM
				;

				define <4 x i8> @urem_v4i8(<4 x i8> %op1, <4 x i8> %op2) #0 {
				; CHECK-LABEL: urem_v4i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: sub sp, sp, #16
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: adrp x8, .LCPI13_0
				; CHECK-NEXT: add x8, x8, :lo12:.LCPI13_0
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p1.s, vl4
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x8]
				; CHECK-NEXT: and z0.d, z0.d, z2.d
				; CHECK-NEXT: and z1.d, z1.d, z2.d
				; CHECK-NEXT: uunpklo z2.s, z1.h
				; CHECK-NEXT: uunpklo z3.s, z0.h
				; CHECK-NEXT: udivr z2.s, p1/m, z2.s, z3.s
				; CHECK-NEXT: fmov w8, s2
				; CHECK-NEXT: mov z3.s, z2.s[3]
				; CHECK-NEXT: mov z4.s, z2.s[2]
				; CHECK-NEXT: mov z2.s, z2.s[1]
				; CHECK-NEXT: fmov w9, s3
				; CHECK-NEXT: fmov w10, s4
				; CHECK-NEXT: strh w8, [sp, #8]
				; CHECK-NEXT: fmov w8, s2
				; CHECK-NEXT: strh w9, [sp, #14]
				; CHECK-NEXT: strh w8, [sp, #10]
				; CHECK-NEXT: add x8, sp, #8
				; CHECK-NEXT: strh w10, [sp, #12]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x8]
				; CHECK-NEXT: mls z0.h, p0/m, z2.h, z1.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: add sp, sp, #16
				; CHECK-NEXT: ret
				%res = urem <4 x i8> %op1, %op2
				ret <4 x i8> %res
				}

				define <8 x i8> @urem_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
				; CHECK-LABEL: urem_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: sub sp, sp, #16
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: uunpklo z2.h, z1.b
				; CHECK-NEXT: uunpklo z3.h, z0.b
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: uunpkhi z4.s, z2.h
				; CHECK-NEXT: uunpkhi z5.s, z3.h
				; CHECK-NEXT: uunpklo z2.s, z2.h
				; CHECK-NEXT: uunpklo z3.s, z3.h
				; CHECK-NEXT: udivr z4.s, p0/m, z4.s, z5.s
				; CHECK-NEXT: udivr z2.s, p0/m, z2.s, z3.s
				; CHECK-NEXT: ptrue p0.b, vl8
				; CHECK-NEXT: uzp1 z2.h, z2.h, z4.h
				; CHECK-NEXT: fmov w8, s2
				; CHECK-NEXT: mov z3.h, z2.h[7]
				; CHECK-NEXT: mov z5.h, z2.h[5]
				; CHECK-NEXT: fmov w9, s3
				; CHECK-NEXT: mov z4.h, z2.h[6]
				; CHECK-NEXT: mov z6.h, z2.h[4]
				; CHECK-NEXT: strb w8, [sp, #8]
				; CHECK-NEXT: fmov w8, s5
				; CHECK-NEXT: mov z16.h, z2.h[2]
				; CHECK-NEXT: fmov w10, s4
				; CHECK-NEXT: strb w9, [sp, #15]
				; CHECK-NEXT: fmov w9, s6
				; CHECK-NEXT: strb w8, [sp, #13]
				; CHECK-NEXT: fmov w8, s16
				; CHECK-NEXT: mov z7.h, z2.h[3]
				; CHECK-NEXT: mov z2.h, z2.h[1]
				; CHECK-NEXT: strb w10, [sp, #14]
				; CHECK-NEXT: fmov w10, s7
				; CHECK-NEXT: strb w9, [sp, #12]
				; CHECK-NEXT: fmov w9, s2
				; CHECK-NEXT: strb w8, [sp, #10]
				; CHECK-NEXT: add x8, sp, #8
				david-armUnsubmitted Done Reply Inline Actions I think that you can remove the tests greater than 512 bits, i.e. <128 x i8>. If the tests already work for <64 x i8> they are likely to work for anything larger too. david-arm: I think that you can remove the tests greater than 512 bits, i.e. <128 x i8>. If the tests…
				; CHECK-NEXT: strb w10, [sp, #11]
				; CHECK-NEXT: strb w9, [sp, #9]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x8]
				; CHECK-NEXT: mls z0.b, p0/m, z2.b, z1.b
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: add sp, sp, #16
				; CHECK-NEXT: ret
				%res = urem <8 x i8> %op1, %op2
				ret <8 x i8> %res
				}

				define <16 x i8> @urem_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
				; CHECK-LABEL: urem_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: uunpkhi z2.h, z1.b
				; CHECK-NEXT: uunpkhi z3.h, z0.b
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: uunpkhi z5.s, z2.h
				; CHECK-NEXT: uunpkhi z6.s, z3.h
				; CHECK-NEXT: uunpklo z2.s, z2.h
				; CHECK-NEXT: uunpklo z3.s, z3.h
				; CHECK-NEXT: uunpklo z4.h, z1.b
				; CHECK-NEXT: udivr z2.s, p0/m, z2.s, z3.s
				; CHECK-NEXT: uunpklo z3.h, z0.b
				; CHECK-NEXT: udivr z5.s, p0/m, z5.s, z6.s
				; CHECK-NEXT: uunpkhi z6.s, z4.h
				; CHECK-NEXT: uunpkhi z7.s, z3.h
				; CHECK-NEXT: uunpklo z4.s, z4.h
				; CHECK-NEXT: uunpklo z3.s, z3.h
				; CHECK-NEXT: udivr z6.s, p0/m, z6.s, z7.s
				; CHECK-NEXT: udiv z3.s, p0/m, z3.s, z4.s
				; CHECK-NEXT: uzp1 z2.h, z2.h, z5.h
				; CHECK-NEXT: uzp1 z3.h, z3.h, z6.h
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: uzp1 z2.b, z3.b, z2.b
				; CHECK-NEXT: mls z0.b, p0/m, z2.b, z1.b
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = urem <16 x i8> %op1, %op2
				ret <16 x i8> %res
				}

				define void @urem_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: urem_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #16
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: ptrue p1.s, vl4
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0, x8]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z2.b }, p0/z, [x1, x8]
				; CHECK-NEXT: ld1b { z3.b }, p0/z, [x1]
				; CHECK-NEXT: uunpkhi z5.h, z0.b
				; CHECK-NEXT: uunpklo z7.h, z0.b
				; CHECK-NEXT: uunpkhi z4.h, z2.b
				; CHECK-NEXT: uunpklo z6.h, z2.b
				; CHECK-NEXT: uunpkhi z16.s, z4.h
				; CHECK-NEXT: uunpkhi z17.s, z5.h
				; CHECK-NEXT: uunpklo z4.s, z4.h
				; CHECK-NEXT: uunpklo z5.s, z5.h
				; CHECK-NEXT: uunpkhi z18.s, z6.h
				; CHECK-NEXT: udivr z16.s, p1/m, z16.s, z17.s
				; CHECK-NEXT: udivr z4.s, p1/m, z4.s, z5.s
				; CHECK-NEXT: uunpkhi z5.s, z7.h
				; CHECK-NEXT: uunpklo z6.s, z6.h
				; CHECK-NEXT: uunpklo z7.s, z7.h
				; CHECK-NEXT: uzp1 z4.h, z4.h, z16.h
				; CHECK-NEXT: udivr z6.s, p1/m, z6.s, z7.s
				; CHECK-NEXT: uunpkhi z7.h, z3.b
				; CHECK-NEXT: uunpkhi z16.h, z1.b
				; CHECK-NEXT: udiv z5.s, p1/m, z5.s, z18.s
				; CHECK-NEXT: uunpkhi z17.s, z7.h
				; CHECK-NEXT: uunpkhi z18.s, z16.h
				; CHECK-NEXT: uunpklo z7.s, z7.h
				; CHECK-NEXT: uunpklo z16.s, z16.h
				; CHECK-NEXT: udivr z17.s, p1/m, z17.s, z18.s
				; CHECK-NEXT: udivr z7.s, p1/m, z7.s, z16.s
				; CHECK-NEXT: uunpklo z16.h, z3.b
				; CHECK-NEXT: uunpklo z18.h, z1.b
				; CHECK-NEXT: uunpkhi z19.s, z16.h
				; CHECK-NEXT: uunpkhi z20.s, z18.h
				; CHECK-NEXT: uunpklo z16.s, z16.h
				; CHECK-NEXT: uunpklo z18.s, z18.h
				; CHECK-NEXT: udivr z19.s, p1/m, z19.s, z20.s
				; CHECK-NEXT: udivr z16.s, p1/m, z16.s, z18.s
				; CHECK-NEXT: uzp1 z7.h, z7.h, z17.h
				; CHECK-NEXT: uzp1 z16.h, z16.h, z19.h
				; CHECK-NEXT: uzp1 z5.h, z6.h, z5.h
				; CHECK-NEXT: uzp1 z6.b, z16.b, z7.b
				; CHECK-NEXT: uzp1 z4.b, z5.b, z4.b
				; CHECK-NEXT: mls z1.b, p0/m, z6.b, z3.b
				; CHECK-NEXT: mls z0.b, p0/m, z4.b, z2.b
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%res = urem <32 x i8> %op1, %op2
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define <4 x i16> @urem_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
				; CHECK-LABEL: urem_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: sub sp, sp, #16
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: uunpklo z2.s, z1.h
				; CHECK-NEXT: uunpklo z3.s, z0.h
				; CHECK-NEXT: udivr z2.s, p0/m, z2.s, z3.s
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: fmov w8, s2
				; CHECK-NEXT: mov z3.s, z2.s[3]
				; CHECK-NEXT: mov z4.s, z2.s[2]
				; CHECK-NEXT: mov z2.s, z2.s[1]
				; CHECK-NEXT: fmov w9, s3
				; CHECK-NEXT: fmov w10, s4
				; CHECK-NEXT: strh w8, [sp, #8]
				; CHECK-NEXT: fmov w8, s2
				; CHECK-NEXT: strh w9, [sp, #14]
				; CHECK-NEXT: strh w8, [sp, #10]
				; CHECK-NEXT: add x8, sp, #8
				; CHECK-NEXT: strh w10, [sp, #12]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x8]
				; CHECK-NEXT: mls z0.h, p0/m, z2.h, z1.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: add sp, sp, #16
				; CHECK-NEXT: ret
				%res = urem <4 x i16> %op1, %op2
				ret <4 x i16> %res
				}

				define <8 x i16> @urem_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
				; CHECK-LABEL: urem_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: uunpkhi z2.s, z1.h
				; CHECK-NEXT: uunpkhi z3.s, z0.h
				; CHECK-NEXT: uunpklo z4.s, z1.h
				; CHECK-NEXT: udivr z2.s, p0/m, z2.s, z3.s
				; CHECK-NEXT: uunpklo z5.s, z0.h
				; CHECK-NEXT: movprfx z3, z5
				; CHECK-NEXT: udiv z3.s, p0/m, z3.s, z4.s
				; CHECK-NEXT: uzp1 z2.h, z3.h, z2.h
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: mls z0.h, p0/m, z2.h, z1.h
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = urem <8 x i16> %op1, %op2
				ret <8 x i16> %res
				}

				define void @urem_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: urem_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: ptrue p1.s, vl4
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x1]
				; CHECK-NEXT: uunpkhi z5.s, z0.h
				; CHECK-NEXT: uunpkhi z16.s, z1.h
				; CHECK-NEXT: uunpkhi z4.s, z2.h
				; CHECK-NEXT: uunpkhi z7.s, z3.h
				; CHECK-NEXT: udivr z4.s, p1/m, z4.s, z5.s
				; CHECK-NEXT: uunpklo z5.s, z3.h
				; CHECK-NEXT: udivr z7.s, p1/m, z7.s, z16.s
				; CHECK-NEXT: uunpklo z16.s, z1.h
				; CHECK-NEXT: uunpklo z6.s, z2.h
				; CHECK-NEXT: udivr z5.s, p1/m, z5.s, z16.s
				; CHECK-NEXT: uunpklo z16.s, z0.h
				; CHECK-NEXT: uzp1 z5.h, z5.h, z7.h
				; CHECK-NEXT: udivr z6.s, p1/m, z6.s, z16.s
				; CHECK-NEXT: mls z1.h, p0/m, z5.h, z3.h
				; CHECK-NEXT: uzp1 z4.h, z6.h, z4.h
				; CHECK-NEXT: mls z0.h, p0/m, z4.h, z2.h
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%res = urem <16 x i16> %op1, %op2
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define <2 x i32> @urem_v2i32(<2 x i32> %op1, <2 x i32> %op2) #0 {
				; CHECK-LABEL: urem_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: movprfx z2, z0
				; CHECK-NEXT: udiv z2.s, p0/m, z2.s, z1.s
				; CHECK-NEXT: mls z0.s, p0/m, z2.s, z1.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = urem <2 x i32> %op1, %op2
				ret <2 x i32> %res
				}

				define <4 x i32> @urem_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {
				; CHECK-LABEL: urem_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: movprfx z2, z0
				; CHECK-NEXT: udiv z2.s, p0/m, z2.s, z1.s
				; CHECK-NEXT: mls z0.s, p0/m, z2.s, z1.s
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = urem <4 x i32> %op1, %op2
				ret <4 x i32> %res
				}

				define void @urem_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; CHECK-LABEL: urem_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #4
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x1, x8, lsl #2]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x1]
				; CHECK-NEXT: movprfx z4, z1
				; CHECK-NEXT: udiv z4.s, p0/m, z4.s, z3.s
				; CHECK-NEXT: movprfx z5, z0
				; CHECK-NEXT: udiv z5.s, p0/m, z5.s, z2.s
				; CHECK-NEXT: mls z1.s, p0/m, z4.s, z3.s
				; CHECK-NEXT: mls z0.s, p0/m, z5.s, z2.s
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%op2 = load <8 x i32>, <8 x i32>* %b
				%res = urem <8 x i32> %op1, %op2
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define <1 x i64> @urem_v1i64(<1 x i64> %op1, <1 x i64> %op2) #0 {
				; CHECK-LABEL: urem_v1i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl1
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: movprfx z2, z0
				; CHECK-NEXT: udiv z2.d, p0/m, z2.d, z1.d
				; CHECK-NEXT: mls z0.d, p0/m, z2.d, z1.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = urem <1 x i64> %op1, %op2
				ret <1 x i64> %res
				}

				define <2 x i64> @urem_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
				; CHECK-LABEL: urem_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: movprfx z2, z0
				; CHECK-NEXT: udiv z2.d, p0/m, z2.d, z1.d
				; CHECK-NEXT: mls z0.d, p0/m, z2.d, z1.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = urem <2 x i64> %op1, %op2
				ret <2 x i64> %res
				}

				define void @urem_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
				; CHECK-LABEL: urem_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #2
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [x1, x8, lsl #3]
				; CHECK-NEXT: ld1d { z3.d }, p0/z, [x1]
				; CHECK-NEXT: movprfx z4, z1
				; CHECK-NEXT: udiv z4.d, p0/m, z4.d, z3.d
				; CHECK-NEXT: movprfx z5, z0
				; CHECK-NEXT: udiv z5.d, p0/m, z5.d, z2.d
				; CHECK-NEXT: mls z1.d, p0/m, z4.d, z3.d
				; CHECK-NEXT: mls z0.d, p0/m, z5.d, z2.d
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x i64>, <4 x i64>* %a
				%op2 = load <4 x i64>, <4 x i64>* %b
				%res = urem <4 x i64> %op1, %op2
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				attributes #0 = { "target-features"="+sve" }
				david-armUnsubmitted Done Reply Inline Actions Again, maybe remove this test since I'm not sure what extra value it gives us? david-arm: Again, maybe remove this test since I'm not sure what extra value it gives us?
				david-armUnsubmitted Done Reply Inline Actions Again, maybe remove this test? david-arm: Again, maybe remove this test?
				david-armUnsubmitted Done Reply Inline Actions Again, maybe remove this test? david-arm: Again, maybe remove this test?
				david-armUnsubmitted Done Reply Inline Actions Again, maybe remove this test? david-arm: Again, maybe remove this test?
				david-armUnsubmitted Done Reply Inline Actions Again, maybe remove this test? david-arm: Again, maybe remove this test?
				david-armUnsubmitted Done Reply Inline Actions Again, maybe remove this test? david-arm: Again, maybe remove this test?
				david-armUnsubmitted Done Reply Inline Actions Again, maybe remove this test? david-arm: Again, maybe remove this test?
				david-armUnsubmitted Done Reply Inline Actions Again, maybe remove this test? david-arm: Again, maybe remove this test?
				david-armUnsubmitted Done Reply Inline Actions Again, maybe remove this test? david-arm: Again, maybe remove this test?
				david-armUnsubmitted Done Reply Inline Actions Again, maybe remove this test? david-arm: Again, maybe remove this test?
				david-armUnsubmitted Done Reply Inline Actions Again, maybe remove this test? david-arm: Again, maybe remove this test?
				david-armUnsubmitted Done Reply Inline Actions Again, maybe remove this test? david-arm: Again, maybe remove this test?
				david-armUnsubmitted Not Done Reply Inline Actions Again, maybe remove this test? david-arm: Again, maybe remove this test?
				david-armUnsubmitted Done Reply Inline Actions Again, maybe remove this test? david-arm: Again, maybe remove this test?
				david-armUnsubmitted Done Reply Inline Actions Again, maybe remove this test? david-arm: Again, maybe remove this test?

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64-SVE]: force using SVE in streaming mode to lower arithmetic and logical fixed-width vector ops.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 467417

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-arith.ll

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-div.ll

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-log.ll

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-mulh.ll

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-int-rem.ll

[AArch64-SVE]: force using SVE in streaming mode to lower arithmetic and logical fixed-width vector ops.
ClosedPublic