This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
5/7
AArch64ISelLowering.cpp
1/2
AArch64InstrInfo.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve-streaming-mode-fixed-length-bit-counting.ll
2/2
sve-streaming-mode-fixed-length-bitselect.ll
-
sve-streaming-mode-fixed-length-insert-vector-elt.ll
-
sve-streaming-mode-fixed-length-subvector.ll
3/3
sve-streaming-mode-fixed-length-vector-shuffle.ll

Differential D136858

[AArch64-SVE]: Force generating code compatible to streaming mode for sve-fixed-length tests.
AbandonedPublic

Authored by hassnaa-arm on Oct 27 2022, 9:40 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
david-arm
paulwalker-arm

Summary

Add testing files and enable streaming mode flag for:

bit-counting.ll
bitselect.ll
insert-vector-elt.ll
subvector.ll
vector-shuffle.ll
int-immediates.ll
int-minmax.ll
int-reduce.ll
trunc.ll
int-compare.ll
int-vselect.ll
mask-opt.ll
masked-scatter.ll
masked-gather.ll
fp-compares.ll
fp-extend-trunc.ll
addressing-modes.ll
fp-arith.ll
int-select.ll
log-reduce.ll
ld2-alloca.ll

Add needed changes to force generateing code compatible to streaming mode:
1- enable custom lowering for CTLZ and CTPOP, (needed for bit-counting.ll test).
2- enable custom lowering for insert_vector_elt, (needed for insert-vector-elt.ll test).
3- enable custom lowering for vector SETCC, (needed for subvector.ll and int-compare.ll tests).
4- enable custom lowering for SMIN, SMAX, UMIN, UMAX, (needed for int-minmax.ll and int-immediates.ll tests).
5- enable custom lowering for vecreduce_smin/smax/umin/umax/add, (needed for int-reduce).
6- enable custom lowering for truncate, (needed for trunc.ll)
7- enable custome lowering for truncStore, (needed for fp-extend-trunc.ll).
8- enable expanding setueq to avoid custom-lowering setcc to setcc_merge_zero which cause a crash while instruction selection because there is no pattern match for it, (that is needed for fp-compares.ll)
9- disable combining OR into BSL, (needed for bit-select.ll test).
10- disable lowering interleaved load to avoid generating invalid neon intrinsic, (needed for ld2-alloca.ll).
11- use SVE OR instruction instead of NEON OR, during copying phyReg -AArch64InstrInfo::copyPhysReg-, (needed for vector-shuffle).
12- force scalarisation for masked gather/scatter, because they are not supported in streaming mode.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hassnaa-arm created this revision.Oct 27 2022, 9:40 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 27 2022, 9:40 AM

Herald added subscribers: ctetreau, hiraditya, kristof.beyls, tschuett. · View Herald Transcript

hassnaa-arm requested review of this revision.Oct 27 2022, 9:40 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 27 2022, 9:40 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B194685: Diff 471197.Oct 27 2022, 9:41 AM

hassnaa-arm added inline comments.Oct 27 2022, 9:56 AM

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-vector-shuffle.ll
9	when I uncomment this test, llc returns this error: invalid shufflevector operands: %ret = shufflevector <4 x i8> %op1, <4 x i8> %op2, <4 x i32> <i32 7, i32 8, i32 9, i32 10>
67	when I uncomment this test, llc returns this error: invalid shufflevector operands: %ret = shufflevector <2 x i16> %op1, <2 x i16> %op2, <2 x i32> <i32 3, i32 4>

Add additional test cases and remove not needed ones from subvector.ll test file.

Harbormaster completed remote builds in B194699: Diff 471213.Oct 27 2022, 10:21 AM

hassnaa-arm added inline comments.Oct 27 2022, 10:25 AM

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-bitselect.ll
11	I left that comment intentionally to choose which solution is better, the current solution (disable combining or into BSP), or implement SVE lowering for the BSP pseudoinst as the comment suggest.
47	Should I append additional test cases for this test file ? It seems that the original test file -sve-fixed-length-bitselect.ll- tests specific case (for specific size).

sdesmalen added inline comments.Oct 27 2022, 10:39 AM

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-vector-shuffle.ll
9	That is because elements 8, 9 and 10 are out of bounds when you concatenate %op1 and %op2 (<=> 8 elements) The follow does work for example: %ret = shufflevector <4 x i8> %op1, <4 x i8> %op2, <4 x i32> <i32 3, i32 4, i32 5, i32 6>

Matt added a subscriber: Matt.Oct 28 2022, 1:26 PM

hassnaa-arm added a parent revision: D135324: [AArch64-SVE]: force using SVE in streaming mode to lower arithmetic and logical fixed-width vector ops..Nov 1 2022, 3:59 AM

Adding new testing files and required changes for generating streaming-compatible code for them.
int-immediates.ll
int-minmax.ll
int-reduce.ll
int-compares.ll
trunc.ll

Harbormaster completed remote builds in B195714: Diff 472626.Nov 2 2022, 8:33 AM

Add new testing files:
int-vselect.ll
mask-opt.ll
masked-scatter.ll -has problems-.
masked-gather.ll -has problems-.

Harbormaster completed remote builds in B195916: Diff 472914.Nov 3 2022, 6:03 AM

hassnaa-arm added inline comments.Nov 3 2022, 6:04 AM

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-masked-gather.ll
5 ↗	(On Diff #472914)	This testing file is still in progress. It crashes because of test cases of f16.
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-masked-scatter.ll
1 ↗	(On Diff #472914)	This testing file is still in progress. It crashes because of test cases of f16.

Add new testing files.
Add masked-gather.ll and masked-scatter.ll and force scalarisation for them.

Harbormaster completed remote builds in B195959: Diff 472976.Nov 3 2022, 10:08 AM

hassnaa-arm edited the summary of this revision. (Show Details)Nov 3 2022, 10:09 AM

hassnaa-arm edited the summary of this revision. (Show Details)

Matt added a subscriber: paulwalker-arm.Nov 3 2022, 10:25 AM

Matt added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
265 ↗	(On Diff #472976)	I wonder, would it also make sense to do that for 128-bit vectors (regardless of the streaming mode) as a (temporary?) fix for https://github.com/llvm/llvm-project/issues/56412? @hassnaa-arm, @paulwalker-arm: Thoughts?

paulwalker-arm added inline comments.Nov 3 2022, 11:15 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
265 ↗	(On Diff #472976)	@Matt : That's not a bug we can actually hit though is it? I mean, you have to edit the LLVM code in order to trigger the failure case?

Matt added inline comments.Nov 3 2022, 11:21 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
265 ↗	(On Diff #472976)	Indeed, I'm asking in the context assuming the modification of `useSVEForFixedLengthVectors`--curious whether a similar fix is applicable, given how similar that modification is to the `forceStreamingCompatibleSVE` special case (that's the only relation to this patch). Chances are it would only be needed for `half`/`fp16`, too.

paulwalker-arm added inline comments.Nov 3 2022, 11:52 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
265 ↗	(On Diff #472976)	I see, then yes. Doing this will force the intrinsic to be scalarised at the IR level and thus you will not hit the failure case within code generation. 56412 isn't just about 128bit vectors though, because those work today. It's really about restricting smaller than 64bit vectors (e.g. <2 x half>) when specially targeting SVE128. That said, I'd sooner fix the underlying issue :)

Matt added inline comments.Nov 3 2022, 12:02 PM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
265 ↗	(On Diff #472976)	Makes sense! (Would it be fair to say that the underlying issue is somewhere in SelectionDAG and the interaction of SVE128 and smaller vector?)

paulwalker-arm added inline comments.Nov 3 2022, 1:57 PM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
265 ↗	(On Diff #472976)	Yes. I'm pretty sure it's a legalisation hang where we keep bouncing between two different legalisation styles.

Matt added inline comments.Nov 3 2022, 1:59 PM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
265 ↗	(On Diff #472976)	All right, thanks!

Add new testing files.

Harbormaster completed remote builds in B196124: Diff 473206.Nov 4 2022, 5:49 AM

Add new testing files:
zip-uzp-trn.ll
optimize-ptrue.ll
int-to-fp.ll
permute-rev.ll

Harbormaster completed remote builds in B196434: Diff 473605.Nov 7 2022, 2:17 AM

Add testing files for fp.

Harbormaster completed remote builds in B196653: Diff 473908.Nov 8 2022, 12:53 AM

david-arm mentioned this in D137093: [AArch64][SVE][NFC] Add streaming mode SVE tests.Nov 8 2022, 6:26 AM

Hi @hassnaa-arm, could you rebase this off D137093 please because I'd like to see whether or not this patch fixes up the gathers and scatters present in CodeGen/AArch64/sve-streaming-mode-fixed-length-addressing-modes.ll from that patch.

I've only reviewed about 1/3 of this patch so far, since it's so big! But I'm leaving the comments I have so far ...

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
3660	Given you're trying to mov SrcReg into DstReg I think this is incorrect for two reasons: You're marking the source registers as being Defined, which isn't right since they're only being read. The second source should also be SrcReg. i.e. something like this: BuildMI(MBB, I, DL, get(AArch64::ORR_ZZZ)) .addReg(AArch64::Z0 + (DestReg - AArch64::Q0), RegState::Define) .addReg(AArch64::Z0 + (SrcReg - AArch64::Q0)) .addReg(AArch64::Z0 + (SrcReg - AArch64::Q0))
llvm/test/CodeGen/AArch64/sve-fixed-length-int-reduce.ll
1111 ↗	(On Diff #473908)	There is nothing technically wrong with these changes - we're getting the same result. I just wonder if we want to change the behaviour for NEON-like vectors when not in streaming mode? @paulwalker-arm any thoughts?
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-to-int.ll
321 ↗	(On Diff #473908)	I think we can remove this test because the input vector > 256 bits.
1070 ↗	(On Diff #473908)	Remove this test, since input > 256 bits?
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-masked-scatter.ll
2401 ↗	(On Diff #473908)	Given we know we're just going to scalarise this operation I wonder if there is much value in producing tests for large types? Perhaps we can just ignore tests for vectors that are > 256 bits?

sdesmalen added inline comments.Nov 8 2022, 7:01 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
12881	Instead of updating the OverrideNEON variable, I suspect that you actually want to do something like this: if (SrcVT.isScalableVector() \|\| useSVEForFixedLengthVectorVT( SrcVT, OverrideNEON && Subtarget->useSVEForFixedLengthVectors()) \|\| useSVEForFixedLengthVectorVT( SrcVT, Subtarget->forceStreamingCompatibleSVE())) so to not alter the behaviour for non-streaming fixed-length vectors. This avoids the regressions in sve-fixed-length-int-reduce.ll where the SVE variants require an additional predicate (whereas the NEON reduction operations are unpredicated and thus only 1 instruction).

revert changes added to sve-fixed-length-int-reduce.ll

hassnaa-arm added inline comments.Nov 10 2022, 7:50 AM

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-fp-to-int.ll
321 ↗	(On Diff #473908)	I left it because the output vector is not > 256. So, for all cases, I leave it if one of the intput/output vector is not > 256
1070 ↗	(On Diff #473908)	I left it because the output vector is not > 256. So, for all cases, I leave it if one of the intput/output vector is not > 256

Harbormaster completed remote builds in B197080: Diff 474549.Nov 10 2022, 9:06 AM

Add new testing files

hassnaa-arm edited the summary of this revision. (Show Details)Nov 11 2022, 9:18 AM

hassnaa-arm added a reviewer: paulwalker-arm.

Harbormaster completed remote builds in B197254: Diff 474791.Nov 11 2022, 10:01 AM

Add new testing file: addressing-modes.ll

hassnaa-arm edited the summary of this revision. (Show Details)Nov 14 2022, 4:15 AM

Harbormaster completed remote builds in B197498: Diff 475103.Nov 14 2022, 4:52 AM

Adding new testing file and its related changes.

hassnaa-arm edited the summary of this revision. (Show Details)Nov 15 2022, 6:04 AM

Harbormaster completed remote builds in B197746: Diff 475442.Nov 15 2022, 6:32 AM

Hi @hassnaa-arm, I've not been able to go through the entire patch yet, but I think it makes sense to split it up to make the changes a bit easier to review. I've left some comments to suggest how to split it up and also some comments on the code-changes itself.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1660	SETLT and SETLE should not be in this list, because they have a 1-1 mapping with instructions. Most of the other nodes need expanding into two nodes (one for ordered/unordered and one for LE/LT), with SETO need expanding to `not(unordered)`. It would also be nice if these changes could be moved to a separate patch.
1670–1678	These actions are unrelated to `VT` as passed into the function, so they can be moved out of this function.
3777	It would be nice if you could split up your patch such that each such a code change, lives in its own patch with a set of corresponding tests. That makes the patch a bit more manageable to review.
12624	Please move this change and corresponding tests to a separate patch.
12881	This can actually be simplified to: bool OverrideNEON = Subtarget->forceStreamingCompatibleSVE() \|\| (Subtarget->useSVEForFixedLengthVectors() && (Op.getOpcode() == ISD::VECREDUCE_AND \|\| Op.getOpcode() == ISD::VECREDUCE_OR \|\| Op.getOpcode() == ISD::VECREDUCE_XOR \|\| Op.getOpcode() == ISD::VECREDUCE_FADD \|\| (Op.getOpcode() != ISD::VECREDUCE_ADD && SrcVT.getVectorElementType() == MVT::i64))); if (SrcVT.isScalableVector() \|\| useSVEForFixedLengthVectorVT(SrcVT, OverrideNEON)) { It would be nice if you could move this change, the changes in `addTypeForStreamingSVE` and corresponding reduction tests to a separate patch.
13898	Rather than adding this condition here, you can add the condition to `isLegalInterleavedAccessType` like this: - if (Subtarget->useSVEForFixedLengthVectors() && - (VecSize % Subtarget->getMinSVEVectorSizeInBits() == 0 \|\| - (VecSize < Subtarget->getMinSVEVectorSizeInBits() && - isPowerOf2_32(NumElements) && VecSize > 128))) { + if (Subtarget->forceStreamingCompatibleSVE() \|\| + (Subtarget->useSVEForFixedLengthVectors() && + (VecSize % Subtarget->getMinSVEVectorSizeInBits() == 0 \|\| + (VecSize < Subtarget->getMinSVEVectorSizeInBits() && + isPowerOf2_32(NumElements) && VecSize > 128)))) { When you add `vscale_range(1,16)` to the attributes of the test file, you will get the code you expect.
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
3658–3662	Can you move this change to a separate patch and test it with something very simple, such as: define fp128 @test_streaming_compatible_register_mov(fp128 %q0, fp128 %q1) { ; CHECK-LABEL: test_streaming_compatible_register_mov: ; CHECK: // %bb.0: ; CHECK-NEXT: mov z0.d, z1.d ; CHECK-NEXT: ret ret fp128 %y }
llvm/test/CodeGen/AArch64/-streaming-mode-fixed-length-fp-arith.ll
1 ↗	(On Diff #475442)	The name of this file is incorrect, it should start with `sve-` instead of `-`
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ld2-alloca.ll
20 ↗	(On Diff #475442)	This test seems broken, because it's not using the result `%strided.vec` from the shufflevector, which as we can see from the assembly causes the load + shuffle to be removed entirely. I've fixed `sve-fixed-ld2-alloca.ll` in c2600244fc14, can you update this test accordingly?

fix broken ld2-alloca.ll test

Harbormaster completed remote builds in B198186: Diff 476092.Nov 17 2022, 5:37 AM

paulwalker-arm mentioned this in D138351: [SVE] Fix incorrect predicate for fixed length int/fp conversion..Nov 19 2022, 4:40 AM

sdesmalen mentioned this in D138670: [AArch64][SME]: Generate streaming-compatible code for fp compares..Nov 24 2022, 7:10 AM

[NFC] rearrange lines

Update by main branch

Harbormaster completed remote builds in B199528: Diff 477932.Nov 25 2022, 6:38 AM

Update by parent patch

Harbormaster completed remote builds in B199803: Diff 478293.Nov 28 2022, 1:20 PM

Remove testing files that use masked gather/scatter

Harbormaster completed remote builds in B200001: Diff 478554.Nov 29 2022, 7:30 AM

Update by latest changes in main branch

Remove redundant condition in AArch64TargetTransformInfo.h

Harbormaster completed remote builds in B200421: Diff 479154.Nov 30 2022, 9:41 PM

It seems this patch can be abandoned now.

This revision now requires changes to proceed.Dec 1 2022, 7:57 AM

This patch was split into smaller patches.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

12 lines

AArch64InstrInfo.cpp

6 lines

test/

CodeGen/

AArch64/

sve-streaming-mode-fixed-length-bit-counting.ll

638 lines

sve-streaming-mode-fixed-length-bitselect.ll

47 lines

sve-streaming-mode-fixed-length-insert-vector-elt.ll

414 lines

sve-streaming-mode-fixed-length-subvector.ll

493 lines

sve-streaming-mode-fixed-length-vector-shuffle.ll

356 lines

Diff 471197

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,616 Lines • ▼ Show 20 Lines	void AArch64TargetLowering::addTypeForStreamingSVE(MVT VT) {
setOperationAction(ISD::AND, VT, Custom);		setOperationAction(ISD::AND, VT, Custom);
setOperationAction(ISD::ADD, VT, Custom);		setOperationAction(ISD::ADD, VT, Custom);
setOperationAction(ISD::SUB, VT, Custom);		setOperationAction(ISD::SUB, VT, Custom);
setOperationAction(ISD::MUL, VT, Custom);		setOperationAction(ISD::MUL, VT, Custom);
setOperationAction(ISD::MULHS, VT, Custom);		setOperationAction(ISD::MULHS, VT, Custom);
setOperationAction(ISD::MULHU, VT, Custom);		setOperationAction(ISD::MULHU, VT, Custom);
setOperationAction(ISD::ABS, VT, Custom);		setOperationAction(ISD::ABS, VT, Custom);
setOperationAction(ISD::XOR, VT, Custom);		setOperationAction(ISD::XOR, VT, Custom);
		setOperationAction(ISD::CTLZ, VT, Custom);
}		}

void AArch64TargetLowering::addTypeForFixedLengthSVE(MVT VT) {		void AArch64TargetLowering::addTypeForFixedLengthSVE(MVT VT) {
assert(VT.isFixedLengthVector() && "Expected fixed length vector type!");		assert(VT.isFixedLengthVector() && "Expected fixed length vector type!");

// By default everything must be expanded.		// By default everything must be expanded.
for (unsigned Op = 0; Op < ISD::BUILTIN_OP_END; ++Op)		for (unsigned Op = 0; Op < ISD::BUILTIN_OP_END; ++Op)
setOperationAction(Op, VT, Expand);		setOperationAction(Op, VT, Expand);
Show All 18 Lines	void AArch64TargetLowering::addTypeForFixedLengthSVE(MVT VT) {
// Mark integer truncating stores/extending loads as having custom lowering		// Mark integer truncating stores/extending loads as having custom lowering
if (VT.isInteger()) {		if (VT.isInteger()) {
MVT InnerVT = VT.changeVectorElementType(MVT::i8);		MVT InnerVT = VT.changeVectorElementType(MVT::i8);
while (InnerVT != VT) {		while (InnerVT != VT) {
setTruncStoreAction(VT, InnerVT, Custom);		setTruncStoreAction(VT, InnerVT, Custom);
setLoadExtAction(ISD::ZEXTLOAD, VT, InnerVT, Custom);		setLoadExtAction(ISD::ZEXTLOAD, VT, InnerVT, Custom);
setLoadExtAction(ISD::SEXTLOAD, VT, InnerVT, Custom);		setLoadExtAction(ISD::SEXTLOAD, VT, InnerVT, Custom);
InnerVT = InnerVT.changeVectorElementType(		InnerVT = InnerVT.changeVectorElementType(
MVT::getIntegerVT(2 * InnerVT.getScalarSizeInBits()));		MVT::getIntegerVT(2 * InnerVT.getScalarSizeInBits()));
		sdesmalenUnsubmitted Done Reply Inline Actions SETLT and SETLE should not be in this list, because they have a 1-1 mapping with instructions. Most of the other nodes need expanding into two nodes (one for ordered/unordered and one for LE/LT), with SETO need expanding to `not(unordered)`. It would also be nice if these changes could be moved to a separate patch. sdesmalen: SETLT and SETLE should not be in this list, because they have a 1-1 mapping with instructions.
}		}
}		}

// Mark floating-point truncating stores/extending loads as having custom		// Mark floating-point truncating stores/extending loads as having custom
// lowering		// lowering
if (VT.isFloatingPoint()) {		if (VT.isFloatingPoint()) {
MVT InnerVT = VT.changeVectorElementType(MVT::f16);		MVT InnerVT = VT.changeVectorElementType(MVT::f16);
while (InnerVT != VT) {		while (InnerVT != VT) {
setTruncStoreAction(VT, InnerVT, Custom);		setTruncStoreAction(VT, InnerVT, Custom);
setLoadExtAction(ISD::EXTLOAD, VT, InnerVT, Custom);		setLoadExtAction(ISD::EXTLOAD, VT, InnerVT, Custom);
InnerVT = InnerVT.changeVectorElementType(		InnerVT = InnerVT.changeVectorElementType(
MVT::getFloatingPointVT(2 * InnerVT.getScalarSizeInBits()));		MVT::getFloatingPointVT(2 * InnerVT.getScalarSizeInBits()));
}		}
}		}

// Lower fixed length vector operations to scalable equivalents.		// Lower fixed length vector operations to scalable equivalents.
setOperationAction(ISD::ABS, VT, Custom);		setOperationAction(ISD::ABS, VT, Custom);
setOperationAction(ISD::ADD, VT, Custom);		setOperationAction(ISD::ADD, VT, Custom);
		sdesmalenUnsubmitted Done Reply Inline Actions These actions are unrelated to `VT` as passed into the function, so they can be moved out of this function. sdesmalen: These actions are unrelated to `VT` as passed into the function, so they can be moved out of…
setOperationAction(ISD::AND, VT, Custom);		setOperationAction(ISD::AND, VT, Custom);
setOperationAction(ISD::ANY_EXTEND, VT, Custom);		setOperationAction(ISD::ANY_EXTEND, VT, Custom);
setOperationAction(ISD::BITCAST, VT, Custom);		setOperationAction(ISD::BITCAST, VT, Custom);
setOperationAction(ISD::BITREVERSE, VT, Custom);		setOperationAction(ISD::BITREVERSE, VT, Custom);
setOperationAction(ISD::BUILD_VECTOR, VT, Custom);		setOperationAction(ISD::BUILD_VECTOR, VT, Custom);
setOperationAction(ISD::BSWAP, VT, Custom);		setOperationAction(ISD::BSWAP, VT, Custom);
setOperationAction(ISD::CONCAT_VECTORS, VT, Custom);		setOperationAction(ISD::CONCAT_VECTORS, VT, Custom);
setOperationAction(ISD::CTLZ, VT, Custom);		setOperationAction(ISD::CTLZ, VT, Custom);
▲ Show 20 Lines • Show All 2,082 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerFP_ROUND(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
if (Op.getValueType().isScalableVector())		if (Op.getValueType().isScalableVector())
return LowerToPredicatedOp(Op, DAG, AArch64ISD::FP_ROUND_MERGE_PASSTHRU);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::FP_ROUND_MERGE_PASSTHRU);

bool IsStrict = Op->isStrictFPOpcode();		bool IsStrict = Op->isStrictFPOpcode();
SDValue SrcVal = Op.getOperand(IsStrict ? 1 : 0);		SDValue SrcVal = Op.getOperand(IsStrict ? 1 : 0);
EVT SrcVT = SrcVal.getValueType();		EVT SrcVT = SrcVal.getValueType();

if (useSVEForFixedLengthVectorVT(SrcVT))		if (useSVEForFixedLengthVectorVT(SrcVT))
		sdesmalenUnsubmitted Not Done Reply Inline Actions It would be nice if you could split up your patch such that each such a code change, lives in its own patch with a set of corresponding tests. That makes the patch a bit more manageable to review. sdesmalen: It would be nice if you could split up your patch such that each such a code change, lives in…
return LowerFixedLengthFPRoundToSVE(Op, DAG);		return LowerFixedLengthFPRoundToSVE(Op, DAG);

if (SrcVT != MVT::f128) {		if (SrcVT != MVT::f128) {
// Expand cases where the input is a vector bigger than NEON.		// Expand cases where the input is a vector bigger than NEON.
if (useSVEForFixedLengthVectorVT(SrcVT))		if (useSVEForFixedLengthVectorVT(SrcVT))
return SDValue();		return SDValue();

// It's legal except when f128 is involved		// It's legal except when f128 is involved
▲ Show 20 Lines • Show All 4,626 Lines • ▼ Show 20 Lines	if (IsParity)
UaddLV = DAG.getNode(ISD::AND, DL, MVT::i32, UaddLV,		UaddLV = DAG.getNode(ISD::AND, DL, MVT::i32, UaddLV,
DAG.getConstant(1, DL, MVT::i32));		DAG.getConstant(1, DL, MVT::i32));

return DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i128, UaddLV);		return DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i128, UaddLV);
}		}

assert(!IsParity && "ISD::PARITY of vector types not supported");		assert(!IsParity && "ISD::PARITY of vector types not supported");

if (VT.isScalableVector() \|\| useSVEForFixedLengthVectorVT(VT))		if (VT.isScalableVector() \|\| useSVEForFixedLengthVectorVT(VT, Subtarget->forceStreamingCompatibleSVE()))
return LowerToPredicatedOp(Op, DAG, AArch64ISD::CTPOP_MERGE_PASSTHRU);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::CTPOP_MERGE_PASSTHRU);

assert((VT == MVT::v1i64 \|\| VT == MVT::v2i64 \|\| VT == MVT::v2i32 \|\|		assert((VT == MVT::v1i64 \|\| VT == MVT::v2i64 \|\| VT == MVT::v2i32 \|\|
VT == MVT::v4i32 \|\| VT == MVT::v4i16 \|\| VT == MVT::v8i16) &&		VT == MVT::v4i32 \|\| VT == MVT::v4i16 \|\| VT == MVT::v8i16) &&
"Unexpected type for custom ctpop lowering");		"Unexpected type for custom ctpop lowering");

EVT VT8Bit = VT.is64BitVector() ? MVT::v8i8 : MVT::v16i8;		EVT VT8Bit = VT.is64BitVector() ? MVT::v8i8 : MVT::v16i8;
Val = DAG.getBitcast(VT8Bit, Val);		Val = DAG.getBitcast(VT8Bit, Val);
▲ Show 20 Lines • Show All 3,771 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerCONCAT_VECTORS(SDValue Op,

return SDValue();		return SDValue();
}		}

SDValue AArch64TargetLowering::LowerINSERT_VECTOR_ELT(SDValue Op,		SDValue AArch64TargetLowering::LowerINSERT_VECTOR_ELT(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
assert(Op.getOpcode() == ISD::INSERT_VECTOR_ELT && "Unknown opcode!");		assert(Op.getOpcode() == ISD::INSERT_VECTOR_ELT && "Unknown opcode!");

if (useSVEForFixedLengthVectorVT(Op.getValueType()))		if (useSVEForFixedLengthVectorVT(Op.getValueType(), Subtarget->forceStreamingCompatibleSVE()))
return LowerFixedLengthInsertVectorElt(Op, DAG);		return LowerFixedLengthInsertVectorElt(Op, DAG);

// Check for non-constant or out of range lane.		// Check for non-constant or out of range lane.
EVT VT = Op.getOperand(0).getValueType();		EVT VT = Op.getOperand(0).getValueType();

if (VT.getScalarType() == MVT::i1) {		if (VT.getScalarType() == MVT::i1) {
EVT VectorVT = getPromotedVTForPredicate(VT);		EVT VectorVT = getPromotedVTForPredicate(VT);
SDLoc DL(Op);		SDLoc DL(Op);
▲ Show 20 Lines • Show All 399 Lines • ▼ Show 20 Lines	if (VT.getScalarType() == MVT::i1) {
SDValue One = DAG.getConstant(1, dl, OpVT);		SDValue One = DAG.getConstant(1, dl, OpVT);
SDValue And = DAG.getNode(ISD::AND, dl, OpVT, Op.getOperand(0), One);		SDValue And = DAG.getNode(ISD::AND, dl, OpVT, Op.getOperand(0), One);
return DAG.getSetCC(dl, VT, And, Zero, ISD::SETNE);		return DAG.getSetCC(dl, VT, And, Zero, ISD::SETNE);
}		}

if (!VT.isVector() \|\| VT.isScalableVector())		if (!VT.isVector() \|\| VT.isScalableVector())
return SDValue();		return SDValue();

if (useSVEForFixedLengthVectorVT(Op.getOperand(0).getValueType()))		if (useSVEForFixedLengthVectorVT(Op.getOperand(0).getValueType()))
		sdesmalenUnsubmitted Not Done Reply Inline Actions Please move this change and corresponding tests to a separate patch. sdesmalen: Please move this change and corresponding tests to a separate patch.
return LowerFixedLengthVectorTruncateToSVE(Op, DAG);		return LowerFixedLengthVectorTruncateToSVE(Op, DAG);

return SDValue();		return SDValue();
}		}

SDValue AArch64TargetLowering::LowerVectorSRA_SRL_SHL(SDValue Op,		SDValue AArch64TargetLowering::LowerVectorSRA_SRL_SHL(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
▲ Show 20 Lines • Show All 154 Lines • ▼ Show 20 Lines	static SDValue EmitVectorComparison(SDValue LHS, SDValue RHS,
}		}
}		}

SDValue AArch64TargetLowering::LowerVSETCC(SDValue Op,		SDValue AArch64TargetLowering::LowerVSETCC(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
if (Op.getValueType().isScalableVector())		if (Op.getValueType().isScalableVector())
return LowerToPredicatedOp(Op, DAG, AArch64ISD::SETCC_MERGE_ZERO);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::SETCC_MERGE_ZERO);

if (useSVEForFixedLengthVectorVT(Op.getOperand(0).getValueType()))		if (useSVEForFixedLengthVectorVT(Op.getOperand(0).getValueType(), Subtarget->forceStreamingCompatibleSVE()))
return LowerFixedLengthVectorSetccToSVE(Op, DAG);		return LowerFixedLengthVectorSetccToSVE(Op, DAG);

ISD::CondCode CC = cast<CondCodeSDNode>(Op.getOperand(2))->get();		ISD::CondCode CC = cast<CondCodeSDNode>(Op.getOperand(2))->get();
SDValue LHS = Op.getOperand(0);		SDValue LHS = Op.getOperand(0);
SDValue RHS = Op.getOperand(1);		SDValue RHS = Op.getOperand(1);
EVT CmpVT = LHS.getValueType().changeVectorElementTypeToInteger();		EVT CmpVT = LHS.getValueType().changeVectorElementTypeToInteger();
SDLoc dl(Op);		SDLoc dl(Op);

▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerVECREDUCE(SDValue Op,
bool OverrideNEON = Op.getOpcode() == ISD::VECREDUCE_AND \|\|		bool OverrideNEON = Op.getOpcode() == ISD::VECREDUCE_AND \|\|
Op.getOpcode() == ISD::VECREDUCE_OR \|\|		Op.getOpcode() == ISD::VECREDUCE_OR \|\|
Op.getOpcode() == ISD::VECREDUCE_XOR \|\|		Op.getOpcode() == ISD::VECREDUCE_XOR \|\|
Op.getOpcode() == ISD::VECREDUCE_FADD \|\|		Op.getOpcode() == ISD::VECREDUCE_FADD \|\|
(Op.getOpcode() != ISD::VECREDUCE_ADD &&		(Op.getOpcode() != ISD::VECREDUCE_ADD &&
SrcVT.getVectorElementType() == MVT::i64);		SrcVT.getVectorElementType() == MVT::i64);
if (SrcVT.isScalableVector() \|\|		if (SrcVT.isScalableVector() \|\|
useSVEForFixedLengthVectorVT(		useSVEForFixedLengthVectorVT(
SrcVT, OverrideNEON && Subtarget->useSVEForFixedLengthVectors())) {		SrcVT, OverrideNEON && Subtarget->useSVEForFixedLengthVectors())) {
		sdesmalenUnsubmitted Done Reply Inline Actions Instead of updating the OverrideNEON variable, I suspect that you actually want to do something like this: if (SrcVT.isScalableVector() \|\| useSVEForFixedLengthVectorVT( SrcVT, OverrideNEON && Subtarget->useSVEForFixedLengthVectors()) \|\| useSVEForFixedLengthVectorVT( SrcVT, Subtarget->forceStreamingCompatibleSVE())) so to not alter the behaviour for non-streaming fixed-length vectors. This avoids the regressions in sve-fixed-length-int-reduce.ll where the SVE variants require an additional predicate (whereas the NEON reduction operations are unpredicated and thus only 1 instruction). sdesmalen: Instead of updating the OverrideNEON variable, I suspect that you actually want to do something…
		sdesmalenUnsubmitted Done Reply Inline Actions This can actually be simplified to: bool OverrideNEON = Subtarget->forceStreamingCompatibleSVE() \|\| (Subtarget->useSVEForFixedLengthVectors() && (Op.getOpcode() == ISD::VECREDUCE_AND \|\| Op.getOpcode() == ISD::VECREDUCE_OR \|\| Op.getOpcode() == ISD::VECREDUCE_XOR \|\| Op.getOpcode() == ISD::VECREDUCE_FADD \|\| (Op.getOpcode() != ISD::VECREDUCE_ADD && SrcVT.getVectorElementType() == MVT::i64))); if (SrcVT.isScalableVector() \|\| useSVEForFixedLengthVectorVT(SrcVT, OverrideNEON)) { It would be nice if you could move this change, the changes in `addTypeForStreamingSVE` and corresponding reduction tests to a separate patch. sdesmalen: This can actually be simplified to: ``` bool OverrideNEON = Subtarget…

if (SrcVT.getVectorElementType() == MVT::i1)		if (SrcVT.getVectorElementType() == MVT::i1)
return LowerPredReductionToSVE(Op, DAG);		return LowerPredReductionToSVE(Op, DAG);

switch (Op.getOpcode()) {		switch (Op.getOpcode()) {
case ISD::VECREDUCE_ADD:		case ISD::VECREDUCE_ADD:
return LowerReductionToSVE(AArch64ISD::UADDV_PRED, Op, DAG);		return LowerReductionToSVE(AArch64ISD::UADDV_PRED, Op, DAG);
case ISD::VECREDUCE_AND:		case ISD::VECREDUCE_AND:
▲ Show 20 Lines • Show All 1,000 Lines • ▼ Show 20 Lines	bool AArch64TargetLowering::lowerInterleavedLoad(
const DataLayout &DL = LI->getModule()->getDataLayout();		const DataLayout &DL = LI->getModule()->getDataLayout();

VectorType *VTy = Shuffles[0]->getType();		VectorType *VTy = Shuffles[0]->getType();

// Skip if we do not have NEON and skip illegal vector types. We can		// Skip if we do not have NEON and skip illegal vector types. We can
// "legalize" wide vector types into multiple interleaved accesses as long as		// "legalize" wide vector types into multiple interleaved accesses as long as
// the vector types are divisible by 128.		// the vector types are divisible by 128.
bool UseScalable;		bool UseScalable;
if (!Subtarget->hasNEON() \|\|		if (!Subtarget->hasNEON() \|\|
		sdesmalenUnsubmitted Done Reply Inline Actions Rather than adding this condition here, you can add the condition to `isLegalInterleavedAccessType` like this: - if (Subtarget->useSVEForFixedLengthVectors() && - (VecSize % Subtarget->getMinSVEVectorSizeInBits() == 0 \|\| - (VecSize < Subtarget->getMinSVEVectorSizeInBits() && - isPowerOf2_32(NumElements) && VecSize > 128))) { + if (Subtarget->forceStreamingCompatibleSVE() \|\| + (Subtarget->useSVEForFixedLengthVectors() && + (VecSize % Subtarget->getMinSVEVectorSizeInBits() == 0 \|\| + (VecSize < Subtarget->getMinSVEVectorSizeInBits() && + isPowerOf2_32(NumElements) && VecSize > 128)))) { When you add `vscale_range(1,16)` to the attributes of the test file, you will get the code you expect. sdesmalen: Rather than adding this condition here, you can add the condition to…
!isLegalInterleavedAccessType(VTy, DL, UseScalable))		!isLegalInterleavedAccessType(VTy, DL, UseScalable))
return false;		return false;

unsigned NumLoads = getNumInterleavedAccesses(VTy, DL, UseScalable);		unsigned NumLoads = getNumInterleavedAccesses(VTy, DL, UseScalable);

auto *FVTy = cast<FixedVectorType>(VTy);		auto *FVTy = cast<FixedVectorType>(VTy);

// A pointer vector can not be the return type of the ldN intrinsics. Need to		// A pointer vector can not be the return type of the ldN intrinsics. Need to
▲ Show 20 Lines • Show All 1,614 Lines • ▼ Show 20 Lines	static SDValue tryCombineToBSL(SDNode *N,
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
SDLoc DL(N);		SDLoc DL(N);

if (!VT.isVector())		if (!VT.isVector())
return SDValue();		return SDValue();

// The combining code currently only works for NEON vectors. In particular,		// The combining code currently only works for NEON vectors. In particular,
// it does not work for SVE when dealing with vectors wider than 128 bits.		// it does not work for SVE when dealing with vectors wider than 128 bits.
if (!VT.is64BitVector() && !VT.is128BitVector())		if ((!VT.is64BitVector() && !VT.is128BitVector()) \|\|
		DAG.getSubtarget<AArch64Subtarget>().forceStreamingCompatibleSVE())
return SDValue();		return SDValue();

SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
if (N0.getOpcode() != ISD::AND)		if (N0.getOpcode() != ISD::AND)
return SDValue();		return SDValue();

SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
if (N1.getOpcode() != ISD::AND)		if (N1.getOpcode() != ISD::AND)
▲ Show 20 Lines • Show All 7,086 Lines • ▼ Show 20 Lines
}		}

SDValue AArch64TargetLowering::LowerFixedLengthVectorSetccToSVE(		SDValue AArch64TargetLowering::LowerFixedLengthVectorSetccToSVE(
SDValue Op, SelectionDAG &DAG) const {		SDValue Op, SelectionDAG &DAG) const {
SDLoc DL(Op);		SDLoc DL(Op);
EVT InVT = Op.getOperand(0).getValueType();		EVT InVT = Op.getOperand(0).getValueType();
EVT ContainerVT = getContainerForFixedLengthVector(DAG, InVT);		EVT ContainerVT = getContainerForFixedLengthVector(DAG, InVT);

assert(useSVEForFixedLengthVectorVT(InVT) &&		assert(useSVEForFixedLengthVectorVT(InVT, Subtarget->forceStreamingCompatibleSVE()) &&
"Only expected to lower fixed length vector operation!");		"Only expected to lower fixed length vector operation!");
assert(Op.getValueType() == InVT.changeTypeToInteger() &&		assert(Op.getValueType() == InVT.changeTypeToInteger() &&
"Expected integer result of the same bit length as the inputs!");		"Expected integer result of the same bit length as the inputs!");

auto Op1 = convertToScalableVector(DAG, ContainerVT, Op.getOperand(0));		auto Op1 = convertToScalableVector(DAG, ContainerVT, Op.getOperand(0));
auto Op2 = convertToScalableVector(DAG, ContainerVT, Op.getOperand(1));		auto Op2 = convertToScalableVector(DAG, ContainerVT, Op.getOperand(1));
auto Pg = getPredicateForFixedLengthVector(DAG, DL, InVT);		auto Pg = getPredicateForFixedLengthVector(DAG, DL, InVT);

▲ Show 20 Lines • Show All 426 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,649 Lines • ▼ Show 20 Lines	if (AArch64::WSeqPairsClassRegClass.contains(DestReg) &&
static const unsigned Indices[] = {AArch64::sube32, AArch64::subo32};		static const unsigned Indices[] = {AArch64::sube32, AArch64::subo32};
copyGPRRegTuple(MBB, I, DL, DestReg, SrcReg, KillSrc, AArch64::ORRWrs,		copyGPRRegTuple(MBB, I, DL, DestReg, SrcReg, KillSrc, AArch64::ORRWrs,
AArch64::WZR, Indices);		AArch64::WZR, Indices);
return;		return;
}		}

if (AArch64::FPR128RegClass.contains(DestReg) &&		if (AArch64::FPR128RegClass.contains(DestReg) &&
AArch64::FPR128RegClass.contains(SrcReg)) {		AArch64::FPR128RegClass.contains(SrcReg)) {
if (Subtarget.hasNEON()) {		if (Subtarget.forceStreamingCompatibleSVE()) {
		BuildMI(MBB, I, DL, get(AArch64::ORR_ZZZ), DestReg)
		.addReg(AArch64::Z0 + (SrcReg - AArch64::Q0), RegState::Define)
		david-armUnsubmitted Done Reply Inline Actions Given you're trying to mov SrcReg into DstReg I think this is incorrect for two reasons: You're marking the source registers as being Defined, which isn't right since they're only being read. The second source should also be SrcReg. i.e. something like this: BuildMI(MBB, I, DL, get(AArch64::ORR_ZZZ)) .addReg(AArch64::Z0 + (DestReg - AArch64::Q0), RegState::Define) .addReg(AArch64::Z0 + (SrcReg - AArch64::Q0)) .addReg(AArch64::Z0 + (SrcReg - AArch64::Q0)) david-arm: Given you're trying to mov SrcReg into DstReg I think this is incorrect for two reasons: 1.
		.addReg(AArch64::Z0 + (DestReg - AArch64::Q0), RegState::Define);
		} else if (Subtarget.hasNEON()) {
		sdesmalenUnsubmitted Not Done Reply Inline Actions Can you move this change to a separate patch and test it with something very simple, such as: define fp128 @test_streaming_compatible_register_mov(fp128 %q0, fp128 %q1) { ; CHECK-LABEL: test_streaming_compatible_register_mov: ; CHECK: // %bb.0: ; CHECK-NEXT: mov z0.d, z1.d ; CHECK-NEXT: ret ret fp128 %y } sdesmalen: Can you move this change to a separate patch and test it with something very simple, such as…
BuildMI(MBB, I, DL, get(AArch64::ORRv16i8), DestReg)		BuildMI(MBB, I, DL, get(AArch64::ORRv16i8), DestReg)
.addReg(SrcReg)		.addReg(SrcReg)
.addReg(SrcReg, getKillRegState(KillSrc));		.addReg(SrcReg, getKillRegState(KillSrc));
} else {		} else {
BuildMI(MBB, I, DL, get(AArch64::STRQpre))		BuildMI(MBB, I, DL, get(AArch64::STRQpre))
.addReg(AArch64::SP, RegState::Define)		.addReg(AArch64::SP, RegState::Define)
.addReg(SrcReg, getKillRegState(KillSrc))		.addReg(SrcReg, getKillRegState(KillSrc))
.addReg(AArch64::SP)		.addReg(AArch64::SP)
▲ Show 20 Lines • Show All 4,542 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-bit-counting.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -force-streaming-compatible-sve < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				;
				; CLZ
				;

				define <4 x i8> @ctlz_v4i8(<4 x i8> %op) #0 {
				; CHECK-LABEL: ctlz_v4i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI0_0
				; CHECK-NEXT: adrp x9, .LCPI0_1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: ldr d1, [x8, :lo12:.LCPI0_0]
				; CHECK-NEXT: ldr d2, [x9, :lo12:.LCPI0_1]
				; CHECK-NEXT: and z0.d, z0.d, z1.d
				; CHECK-NEXT: clz z0.h, p0/m, z0.h
				; CHECK-NEXT: sub z0.h, z0.h, z2.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <4 x i8> @llvm.ctlz.v4i8(<4 x i8> %op)
				ret <4 x i8> %res
				}

				define <8 x i8> @ctlz_v8i8(<8 x i8> %op) #0 {
				; CHECK-LABEL: ctlz_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.b, vl8
				; CHECK-NEXT: clz z0.b, p0/m, z0.b
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <8 x i8> @llvm.ctlz.v8i8(<8 x i8> %op)
				ret <8 x i8> %res
				}

				define <16 x i8> @ctlz_v16i8(<16 x i8> %op) #0 {
				; CHECK-LABEL: ctlz_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: clz z0.b, p0/m, z0.b
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = call <16 x i8> @llvm.ctlz.v16i8(<16 x i8> %op)
				ret <16 x i8> %res
				}

				define void @ctlz_v32i8(<32 x i8>* %a) #0 {
				; CHECK-LABEL: ctlz_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: clz z0.b, p0/m, z0.b
				; CHECK-NEXT: clz z1.b, p0/m, z1.b
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op = load <32 x i8>, <32 x i8>* %a
				%res = call <32 x i8> @llvm.ctlz.v32i8(<32 x i8> %op)
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define <2 x i16> @ctlz_v2i16(<2 x i16> %op) #0 {
				; CHECK-LABEL: ctlz_v2i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI4_0
				; CHECK-NEXT: adrp x9, .LCPI4_1
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: ldr d1, [x8, :lo12:.LCPI4_0]
				; CHECK-NEXT: ldr d2, [x9, :lo12:.LCPI4_1]
				; CHECK-NEXT: and z0.d, z0.d, z1.d
				; CHECK-NEXT: clz z0.s, p0/m, z0.s
				; CHECK-NEXT: sub z0.s, z0.s, z2.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <2 x i16> @llvm.ctlz.v2i16(<2 x i16> %op)
				ret <2 x i16> %res
				}

				define <4 x i16> @ctlz_v4i16(<4 x i16> %op) #0 {
				; CHECK-LABEL: ctlz_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: clz z0.h, p0/m, z0.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <4 x i16> @llvm.ctlz.v4i16(<4 x i16> %op)
				ret <4 x i16> %res
				}

				define <8 x i16> @ctlz_v8i16(<8 x i16> %op) #0 {
				; CHECK-LABEL: ctlz_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: clz z0.h, p0/m, z0.h
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = call <8 x i16> @llvm.ctlz.v8i16(<8 x i16> %op)
				ret <8 x i16> %res
				}

				define void @ctlz_v16i16(<16 x i16>* %a) #0 {
				; CHECK-LABEL: ctlz_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: clz z0.h, p0/m, z0.h
				; CHECK-NEXT: clz z1.h, p0/m, z1.h
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op = load <16 x i16>, <16 x i16>* %a
				%res = call <16 x i16> @llvm.ctlz.v16i16(<16 x i16> %op)
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define <2 x i32> @ctlz_v2i32(<2 x i32> %op) #0 {
				; CHECK-LABEL: ctlz_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: clz z0.s, p0/m, z0.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <2 x i32> @llvm.ctlz.v2i32(<2 x i32> %op)
				ret <2 x i32> %res
				}

				define <4 x i32> @ctlz_v4i32(<4 x i32> %op) #0 {
				; CHECK-LABEL: ctlz_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: clz z0.s, p0/m, z0.s
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = call <4 x i32> @llvm.ctlz.v4i32(<4 x i32> %op)
				ret <4 x i32> %res
				}

				define void @ctlz_v8i32(<8 x i32>* %a) #0 {
				; CHECK-LABEL: ctlz_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: clz z0.s, p0/m, z0.s
				; CHECK-NEXT: clz z1.s, p0/m, z1.s
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op = load <8 x i32>, <8 x i32>* %a
				%res = call <8 x i32> @llvm.ctlz.v8i32(<8 x i32> %op)
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define <1 x i64> @ctlz_v1i64(<1 x i64> %op) #0 {
				; CHECK-LABEL: ctlz_v1i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl1
				; CHECK-NEXT: clz z0.d, p0/m, z0.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <1 x i64> @llvm.ctlz.v1i64(<1 x i64> %op)
				ret <1 x i64> %res
				}

				define <2 x i64> @ctlz_v2i64(<2 x i64> %op) #0 {
				; CHECK-LABEL: ctlz_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: clz z0.d, p0/m, z0.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = call <2 x i64> @llvm.ctlz.v2i64(<2 x i64> %op)
				ret <2 x i64> %res
				}

				define void @ctlz_v4i64(<4 x i64>* %a) #0 {
				; CHECK-LABEL: ctlz_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: clz z0.d, p0/m, z0.d
				; CHECK-NEXT: clz z1.d, p0/m, z1.d
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op = load <4 x i64>, <4 x i64>* %a
				%res = call <4 x i64> @llvm.ctlz.v4i64(<4 x i64> %op)
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				;
				; CNT
				;

				define <4 x i8> @ctpop_v4i8(<4 x i8> %op) #0 {
				; CHECK-LABEL: ctpop_v4i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI14_0
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: ldr d1, [x8, :lo12:.LCPI14_0]
				; CHECK-NEXT: and z0.d, z0.d, z1.d
				; CHECK-NEXT: cnt z0.h, p0/m, z0.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <4 x i8> @llvm.ctpop.v4i8(<4 x i8> %op)
				ret <4 x i8> %res
				}

				define <8 x i8> @ctpop_v8i8(<8 x i8> %op) #0 {
				; CHECK-LABEL: ctpop_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: cnt v0.8b, v0.8b
				; CHECK-NEXT: ret
				%res = call <8 x i8> @llvm.ctpop.v8i8(<8 x i8> %op)
				ret <8 x i8> %res
				}

				define <16 x i8> @ctpop_v16i8(<16 x i8> %op) #0 {
				; CHECK-LABEL: ctpop_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: cnt v0.16b, v0.16b
				; CHECK-NEXT: ret
				%res = call <16 x i8> @llvm.ctpop.v16i8(<16 x i8> %op)
				ret <16 x i8> %res
				}

				define void @ctpop_v32i8(<32 x i8>* %a) #0 {
				; CHECK-LABEL: ctpop_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: cnt v0.16b, v0.16b
				; CHECK-NEXT: cnt v1.16b, v1.16b
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op = load <32 x i8>, <32 x i8>* %a
				%res = call <32 x i8> @llvm.ctpop.v32i8(<32 x i8> %op)
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define <2 x i16> @ctpop_v2i16(<2 x i16> %op) #0 {
				; CHECK-LABEL: ctpop_v2i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI18_0
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: ldr d1, [x8, :lo12:.LCPI18_0]
				; CHECK-NEXT: and z0.d, z0.d, z1.d
				; CHECK-NEXT: cnt z0.s, p0/m, z0.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <2 x i16> @llvm.ctpop.v2i16(<2 x i16> %op)
				ret <2 x i16> %res
				}

				define <4 x i16> @ctpop_v4i16(<4 x i16> %op) #0 {
				; CHECK-LABEL: ctpop_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: cnt z0.h, p0/m, z0.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <4 x i16> @llvm.ctpop.v4i16(<4 x i16> %op)
				ret <4 x i16> %res
				}

				define <8 x i16> @ctpop_v8i16(<8 x i16> %op) #0 {
				; CHECK-LABEL: ctpop_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: cnt z0.h, p0/m, z0.h
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = call <8 x i16> @llvm.ctpop.v8i16(<8 x i16> %op)
				ret <8 x i16> %res
				}

				define void @ctpop_v16i16(<16 x i16>* %a) #0 {
				; CHECK-LABEL: ctpop_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: cnt z0.h, p0/m, z0.h
				; CHECK-NEXT: cnt z1.h, p0/m, z1.h
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op = load <16 x i16>, <16 x i16>* %a
				%res = call <16 x i16> @llvm.ctpop.v16i16(<16 x i16> %op)
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define <2 x i32> @ctpop_v2i32(<2 x i32> %op) #0 {
				; CHECK-LABEL: ctpop_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: cnt z0.s, p0/m, z0.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <2 x i32> @llvm.ctpop.v2i32(<2 x i32> %op)
				ret <2 x i32> %res
				}

				define <4 x i32> @ctpop_v4i32(<4 x i32> %op) #0 {
				; CHECK-LABEL: ctpop_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: cnt z0.s, p0/m, z0.s
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = call <4 x i32> @llvm.ctpop.v4i32(<4 x i32> %op)
				ret <4 x i32> %res
				}

				define void @ctpop_v8i32(<8 x i32>* %a) #0 {
				; CHECK-LABEL: ctpop_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: cnt z0.s, p0/m, z0.s
				; CHECK-NEXT: cnt z1.s, p0/m, z1.s
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op = load <8 x i32>, <8 x i32>* %a
				%res = call <8 x i32> @llvm.ctpop.v8i32(<8 x i32> %op)
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define <1 x i64> @ctpop_v1i64(<1 x i64> %op) #0 {
				; CHECK-LABEL: ctpop_v1i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl1
				; CHECK-NEXT: cnt z0.d, p0/m, z0.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <1 x i64> @llvm.ctpop.v1i64(<1 x i64> %op)
				ret <1 x i64> %res
				}

				define <2 x i64> @ctpop_v2i64(<2 x i64> %op) #0 {
				; CHECK-LABEL: ctpop_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: cnt z0.d, p0/m, z0.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = call <2 x i64> @llvm.ctpop.v2i64(<2 x i64> %op)
				ret <2 x i64> %res
				}

				define void @ctpop_v4i64(<4 x i64>* %a) #0 {
				; CHECK-LABEL: ctpop_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: cnt z0.d, p0/m, z0.d
				; CHECK-NEXT: cnt z1.d, p0/m, z1.d
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op = load <4 x i64>, <4 x i64>* %a
				%res = call <4 x i64> @llvm.ctpop.v4i64(<4 x i64> %op)
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				;
				; Count trailing zeros
				;

				define <4 x i8> @cttz_v4i8(<4 x i8> %op) #0 {
				; CHECK-LABEL: cttz_v4i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI28_0
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: ldr d1, [x8, :lo12:.LCPI28_0]
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: rbit z0.h, p0/m, z0.h
				; CHECK-NEXT: clz z0.h, p0/m, z0.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <4 x i8> @llvm.cttz.v4i8(<4 x i8> %op)
				ret <4 x i8> %res
				}

				define <8 x i8> @cttz_v8i8(<8 x i8> %op) #0 {
				; CHECK-LABEL: cttz_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.b, vl8
				; CHECK-NEXT: rbit z0.b, p0/m, z0.b
				; CHECK-NEXT: clz z0.b, p0/m, z0.b
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <8 x i8> @llvm.cttz.v8i8(<8 x i8> %op)
				ret <8 x i8> %res
				}

				define <16 x i8> @cttz_v16i8(<16 x i8> %op) #0 {
				; CHECK-LABEL: cttz_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: rbit z0.b, p0/m, z0.b
				; CHECK-NEXT: clz z0.b, p0/m, z0.b
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = call <16 x i8> @llvm.cttz.v16i8(<16 x i8> %op)
				ret <16 x i8> %res
				}

				define void @cttz_v32i8(<32 x i8>* %a) #0 {
				; CHECK-LABEL: cttz_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: ptrue p0.b, vl16
				; CHECK-NEXT: rbit z0.b, p0/m, z0.b
				; CHECK-NEXT: clz z0.b, p0/m, z0.b
				; CHECK-NEXT: rbit z1.b, p0/m, z1.b
				; CHECK-NEXT: clz z1.b, p0/m, z1.b
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op = load <32 x i8>, <32 x i8>* %a
				%res = call <32 x i8> @llvm.cttz.v32i8(<32 x i8> %op)
				store <32 x i8> %res, <32 x i8>* %a
				ret void
				}

				define <2 x i16> @cttz_v2i16(<2 x i16> %op) #0 {
				; CHECK-LABEL: cttz_v2i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI32_0
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: ldr d1, [x8, :lo12:.LCPI32_0]
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: rbit z0.s, p0/m, z0.s
				; CHECK-NEXT: clz z0.s, p0/m, z0.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <2 x i16> @llvm.cttz.v2i16(<2 x i16> %op)
				ret <2 x i16> %res
				}

				define <4 x i16> @cttz_v4i16(<4 x i16> %op) #0 {
				; CHECK-LABEL: cttz_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl4
				; CHECK-NEXT: rbit z0.h, p0/m, z0.h
				; CHECK-NEXT: clz z0.h, p0/m, z0.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <4 x i16> @llvm.cttz.v4i16(<4 x i16> %op)
				ret <4 x i16> %res
				}

				define <8 x i16> @cttz_v8i16(<8 x i16> %op) #0 {
				; CHECK-LABEL: cttz_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: rbit z0.h, p0/m, z0.h
				; CHECK-NEXT: clz z0.h, p0/m, z0.h
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = call <8 x i16> @llvm.cttz.v8i16(<8 x i16> %op)
				ret <8 x i16> %res
				}

				define void @cttz_v16i16(<16 x i16>* %a) #0 {
				; CHECK-LABEL: cttz_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: ptrue p0.h, vl8
				; CHECK-NEXT: rbit z0.h, p0/m, z0.h
				; CHECK-NEXT: clz z0.h, p0/m, z0.h
				; CHECK-NEXT: rbit z1.h, p0/m, z1.h
				; CHECK-NEXT: clz z1.h, p0/m, z1.h
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op = load <16 x i16>, <16 x i16>* %a
				%res = call <16 x i16> @llvm.cttz.v16i16(<16 x i16> %op)
				store <16 x i16> %res, <16 x i16>* %a
				ret void
				}

				define <2 x i32> @cttz_v2i32(<2 x i32> %op) #0 {
				; CHECK-LABEL: cttz_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl2
				; CHECK-NEXT: rbit z0.s, p0/m, z0.s
				; CHECK-NEXT: clz z0.s, p0/m, z0.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <2 x i32> @llvm.cttz.v2i32(<2 x i32> %op)
				ret <2 x i32> %res
				}

				define <4 x i32> @cttz_v4i32(<4 x i32> %op) #0 {
				; CHECK-LABEL: cttz_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: rbit z0.s, p0/m, z0.s
				; CHECK-NEXT: clz z0.s, p0/m, z0.s
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = call <4 x i32> @llvm.cttz.v4i32(<4 x i32> %op)
				ret <4 x i32> %res
				}

				define void @cttz_v8i32(<8 x i32>* %a) #0 {
				; CHECK-LABEL: cttz_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: rbit z0.s, p0/m, z0.s
				; CHECK-NEXT: clz z0.s, p0/m, z0.s
				; CHECK-NEXT: rbit z1.s, p0/m, z1.s
				; CHECK-NEXT: clz z1.s, p0/m, z1.s
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op = load <8 x i32>, <8 x i32>* %a
				%res = call <8 x i32> @llvm.cttz.v8i32(<8 x i32> %op)
				store <8 x i32> %res, <8 x i32>* %a
				ret void
				}

				define <1 x i64> @cttz_v1i64(<1 x i64> %op) #0 {
				; CHECK-LABEL: cttz_v1i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl1
				; CHECK-NEXT: rbit z0.d, p0/m, z0.d
				; CHECK-NEXT: clz z0.d, p0/m, z0.d
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%res = call <1 x i64> @llvm.cttz.v1i64(<1 x i64> %op)
				ret <1 x i64> %res
				}

				define <2 x i64> @cttz_v2i64(<2 x i64> %op) #0 {
				; CHECK-LABEL: cttz_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: rbit z0.d, p0/m, z0.d
				; CHECK-NEXT: clz z0.d, p0/m, z0.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%res = call <2 x i64> @llvm.cttz.v2i64(<2 x i64> %op)
				ret <2 x i64> %res
				}

				define void @cttz_v4i64(<4 x i64>* %a) #0 {
				; CHECK-LABEL: cttz_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: ptrue p0.d, vl2
				; CHECK-NEXT: rbit z0.d, p0/m, z0.d
				; CHECK-NEXT: clz z0.d, p0/m, z0.d
				; CHECK-NEXT: rbit z1.d, p0/m, z1.d
				; CHECK-NEXT: clz z1.d, p0/m, z1.d
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op = load <4 x i64>, <4 x i64>* %a
				%res = call <4 x i64> @llvm.cttz.v4i64(<4 x i64> %op)
				store <4 x i64> %res, <4 x i64>* %a
				ret void
				}

				attributes #0 = { "target-features"="+sve" }

				declare <4 x i8> @llvm.ctlz.v4i8(<4 x i8>)
				declare <8 x i8> @llvm.ctlz.v8i8(<8 x i8>)
				declare <16 x i8> @llvm.ctlz.v16i8(<16 x i8>)
				declare <32 x i8> @llvm.ctlz.v32i8(<32 x i8>)
				declare <2 x i16> @llvm.ctlz.v2i16(<2 x i16>)
				declare <4 x i16> @llvm.ctlz.v4i16(<4 x i16>)
				declare <8 x i16> @llvm.ctlz.v8i16(<8 x i16>)
				declare <16 x i16> @llvm.ctlz.v16i16(<16 x i16>)
				declare <2 x i32> @llvm.ctlz.v2i32(<2 x i32>)
				declare <4 x i32> @llvm.ctlz.v4i32(<4 x i32>)
				declare <8 x i32> @llvm.ctlz.v8i32(<8 x i32>)
				declare <1 x i64> @llvm.ctlz.v1i64(<1 x i64>)
				declare <2 x i64> @llvm.ctlz.v2i64(<2 x i64>)
				declare <4 x i64> @llvm.ctlz.v4i64(<4 x i64>)

				declare <4 x i8> @llvm.ctpop.v4i8(<4 x i8>)
				declare <8 x i8> @llvm.ctpop.v8i8(<8 x i8>)
				declare <16 x i8> @llvm.ctpop.v16i8(<16 x i8>)
				declare <32 x i8> @llvm.ctpop.v32i8(<32 x i8>)
				declare <2 x i16> @llvm.ctpop.v2i16(<2 x i16>)
				declare <4 x i16> @llvm.ctpop.v4i16(<4 x i16>)
				declare <8 x i16> @llvm.ctpop.v8i16(<8 x i16>)
				declare <16 x i16> @llvm.ctpop.v16i16(<16 x i16>)
				declare <2 x i32> @llvm.ctpop.v2i32(<2 x i32>)
				declare <4 x i32> @llvm.ctpop.v4i32(<4 x i32>)
				declare <8 x i32> @llvm.ctpop.v8i32(<8 x i32>)
				declare <1 x i64> @llvm.ctpop.v1i64(<1 x i64>)
				declare <2 x i64> @llvm.ctpop.v2i64(<2 x i64>)
				declare <4 x i64> @llvm.ctpop.v4i64(<4 x i64>)

				declare <4 x i8> @llvm.cttz.v4i8(<4 x i8>)
				declare <8 x i8> @llvm.cttz.v8i8(<8 x i8>)
				declare <16 x i8> @llvm.cttz.v16i8(<16 x i8>)
				declare <32 x i8> @llvm.cttz.v32i8(<32 x i8>)
				declare <2 x i16> @llvm.cttz.v2i16(<2 x i16>)
				declare <4 x i16> @llvm.cttz.v4i16(<4 x i16>)
				declare <8 x i16> @llvm.cttz.v8i16(<8 x i16>)
				declare <16 x i16> @llvm.cttz.v16i16(<16 x i16>)
				declare <2 x i32> @llvm.cttz.v2i32(<2 x i32>)
				declare <4 x i32> @llvm.cttz.v4i32(<4 x i32>)
				declare <8 x i32> @llvm.cttz.v8i32(<8 x i32>)
				declare <1 x i64> @llvm.cttz.v1i64(<1 x i64>)
				declare <2 x i64> @llvm.cttz.v2i64(<2 x i64>)
				declare <4 x i64> @llvm.cttz.v4i64(<4 x i64>)

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-bitselect.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -force-streaming-compatible-sve < %s \| FileCheck %s

				target triple = "aarch64"

				;
				; NOTE: SVE lowering for the BSP pseudoinst is not currently implemented, so we
				; don't currently expect the code below to lower to BSL/BIT/BIF. Once
				; this is implemented, this test will be fleshed out.
				;

				hassnaa-armAuthorUnsubmitted Done Reply Inline Actions I left that comment intentionally to choose which solution is better, the current solution (disable combining or into BSP), or implement SVE lowering for the BSP pseudoinst as the comment suggest. hassnaa-arm: I left that comment intentionally to choose which solution is better, the current solution…
				define <8 x i32> @fixed_bitselect_v8i32(<8 x i32>* %pre_cond_ptr, <8 x i32>* %left_ptr, <8 x i32>* %right_ptr) #0 {
				; CHECK-LABEL: fixed_bitselect_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: adrp x8, .LCPI0_0
				; CHECK-NEXT: ldp q1, q0, [x0]
				; CHECK-NEXT: ldr q2, [x8, :lo12:.LCPI0_0]
				; CHECK-NEXT: adrp x8, .LCPI0_1
				; CHECK-NEXT: ldp q3, q4, [x1]
				; CHECK-NEXT: sub z6.s, z2.s, z1.s
				; CHECK-NEXT: sub z2.s, z2.s, z0.s
				; CHECK-NEXT: and z3.d, z6.d, z3.d
				; CHECK-NEXT: ldp q7, q16, [x2]
				; CHECK-NEXT: and z2.d, z2.d, z4.d
				; CHECK-NEXT: ldr q5, [x8, :lo12:.LCPI0_1]
				; CHECK-NEXT: add z1.s, z1.s, z5.s
				; CHECK-NEXT: add z0.s, z0.s, z5.s
				; CHECK-NEXT: and z4.d, z0.d, z16.d
				; CHECK-NEXT: and z0.d, z1.d, z7.d
				; CHECK-NEXT: orr z0.d, z0.d, z3.d
				; CHECK-NEXT: orr z1.d, z4.d, z2.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 killed $z1
				; CHECK-NEXT: ret
				%pre_cond = load <8 x i32>, <8 x i32>* %pre_cond_ptr
				%left = load <8 x i32>, <8 x i32>* %left_ptr
				%right = load <8 x i32>, <8 x i32>* %right_ptr

				%neg_cond = sub <8 x i32> zeroinitializer, %pre_cond
				%min_cond = add <8 x i32> %pre_cond, <i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1, i32 -1>
				%left_bits_0 = and <8 x i32> %neg_cond, %left
				%right_bits_0 = and <8 x i32> %min_cond, %right
				%bsl0000 = or <8 x i32> %right_bits_0, %left_bits_0
				ret <8 x i32> %bsl0000
				}

				attributes #0 = { "target-features"="+sve" }
				hassnaa-armAuthorUnsubmitted Done Reply Inline Actions Should I append additional test cases for this test file ? It seems that the original test file -sve-fixed-length-bitselect.ll- tests specific case (for specific size). hassnaa-arm: Should I append additional test cases for this test file ? It seems that the original test file…

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-insert-vector-elt.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -force-streaming-compatible-sve < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				;
				; insertelement
				;

				; i8
				define <4 x i8> @insertelement_v4i8(<4 x i8> %op1) #0 {
				; CHECK-LABEL: insertelement_v4i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #3
				; CHECK-NEXT: mov w9, #5
				; CHECK-NEXT: index z2.h, #0, #1
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: mov z1.h, w8
				; CHECK-NEXT: cmpeq p0.h, p0/z, z2.h, z1.h
				; CHECK-NEXT: mov z0.h, p0/m, w9
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%r = insertelement <4 x i8> %op1, i8 5, i64 3
				ret <4 x i8> %r
				}

				define <8 x i8> @insertelement_v8i8(<8 x i8> %op1) #0 {
				; CHECK-LABEL: insertelement_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #7
				; CHECK-NEXT: mov w9, #5
				; CHECK-NEXT: index z2.b, #0, #1
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: mov z1.b, w8
				; CHECK-NEXT: cmpeq p0.b, p0/z, z2.b, z1.b
				; CHECK-NEXT: mov z0.b, p0/m, w9
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%r = insertelement <8 x i8> %op1, i8 5, i64 7
				ret <8 x i8> %r
				}

				define <16 x i8> @insertelement_v16i8(<16 x i8> %op1) #0 {
				; CHECK-LABEL: insertelement_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #15
				; CHECK-NEXT: mov w9, #5
				; CHECK-NEXT: index z2.b, #0, #1
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: mov z1.b, w8
				; CHECK-NEXT: cmpeq p0.b, p0/z, z2.b, z1.b
				; CHECK-NEXT: mov z0.b, p0/m, w9
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%r = insertelement <16 x i8> %op1, i8 5, i64 15
				ret <16 x i8> %r
				}

				define <32 x i8> @insertelement_v32i8(<32 x i8> %op1) #0 {
				; CHECK-LABEL: insertelement_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #15
				; CHECK-NEXT: mov w9, #5
				; CHECK-NEXT: index z3.b, #0, #1
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: mov z2.b, w8
				; CHECK-NEXT: cmpeq p0.b, p0/z, z3.b, z2.b
				; CHECK-NEXT: mov z1.b, p0/m, w9
				; CHECK-NEXT: // kill: def $q1 killed $q1 killed $z1
				; CHECK-NEXT: ret
				%r = insertelement <32 x i8> %op1, i8 5, i64 31
				ret <32 x i8> %r
				}

				; i16
				define <2 x i16> @insertelement_v2i16(<2 x i16> %op1) #0 {
				; CHECK-LABEL: insertelement_v2i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #1
				; CHECK-NEXT: mov w9, #5
				; CHECK-NEXT: index z2.s, #0, #1
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: mov z1.s, w8
				; CHECK-NEXT: cmpeq p0.s, p0/z, z2.s, z1.s
				; CHECK-NEXT: mov z0.s, p0/m, w9
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%r = insertelement <2 x i16> %op1, i16 5, i64 1
				ret <2 x i16> %r
				}

				define <4 x i16> @insertelement_v4i16(<4 x i16> %op1) #0 {
				; CHECK-LABEL: insertelement_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #3
				; CHECK-NEXT: mov w9, #5
				; CHECK-NEXT: index z2.h, #0, #1
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: mov z1.h, w8
				; CHECK-NEXT: cmpeq p0.h, p0/z, z2.h, z1.h
				; CHECK-NEXT: mov z0.h, p0/m, w9
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%r = insertelement <4 x i16> %op1, i16 5, i64 3
				ret <4 x i16> %r
				}

				define <8 x i16> @insertelement_v8i16(<8 x i16> %op1) #0 {
				; CHECK-LABEL: insertelement_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #7
				; CHECK-NEXT: mov w9, #5
				; CHECK-NEXT: index z2.h, #0, #1
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: mov z1.h, w8
				; CHECK-NEXT: cmpeq p0.h, p0/z, z2.h, z1.h
				; CHECK-NEXT: mov z0.h, p0/m, w9
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%r = insertelement <8 x i16> %op1, i16 5, i64 7
				ret <8 x i16> %r
				}

				define <16 x i16> @insertelement_v16i16(<16 x i16> %op1) #0 {
				; CHECK-LABEL: insertelement_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #7
				; CHECK-NEXT: mov w9, #5
				; CHECK-NEXT: index z3.h, #0, #1
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: mov z2.h, w8
				; CHECK-NEXT: cmpeq p0.h, p0/z, z3.h, z2.h
				; CHECK-NEXT: mov z1.h, p0/m, w9
				; CHECK-NEXT: // kill: def $q1 killed $q1 killed $z1
				; CHECK-NEXT: ret
				%r = insertelement <16 x i16> %op1, i16 5, i64 15
				ret <16 x i16> %r
				}

				;i32
				define <2 x i32> @insertelement_v2i32(<2 x i32> %op1) #0 {
				; CHECK-LABEL: insertelement_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #1
				; CHECK-NEXT: mov w9, #5
				; CHECK-NEXT: index z2.s, #0, #1
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: mov z1.s, w8
				; CHECK-NEXT: cmpeq p0.s, p0/z, z2.s, z1.s
				; CHECK-NEXT: mov z0.s, p0/m, w9
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%r = insertelement <2 x i32> %op1, i32 5, i64 1
				ret <2 x i32> %r
				}

				define <4 x i32> @insertelement_v4i32(<4 x i32> %op1) #0 {
				; CHECK-LABEL: insertelement_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #3
				; CHECK-NEXT: mov w9, #5
				; CHECK-NEXT: index z2.s, #0, #1
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: mov z1.s, w8
				; CHECK-NEXT: cmpeq p0.s, p0/z, z2.s, z1.s
				; CHECK-NEXT: mov z0.s, p0/m, w9
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%r = insertelement <4 x i32> %op1, i32 5, i64 3
				ret <4 x i32> %r
				}

				define <8 x i32> @insertelement_v8i32(<8 x i32>* %a) #0 {
				; CHECK-LABEL: insertelement_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #3
				; CHECK-NEXT: index z3.s, #0, #1
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov z2.s, w8
				; CHECK-NEXT: mov w8, #5
				; CHECK-NEXT: cmpeq p0.s, p0/z, z3.s, z2.s
				; CHECK-NEXT: mov z1.s, p0/m, w8
				; CHECK-NEXT: // kill: def $q1 killed $q1 killed $z1
				; CHECK-NEXT: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%r = insertelement <8 x i32> %op1, i32 5, i64 7
				ret <8 x i32> %r
				}

				;i64
				define <1 x i64> @insertelement_v1i64(<1 x i64> %op1) #0 {
				; CHECK-LABEL: insertelement_v1i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #5
				; CHECK-NEXT: fmov d0, x8
				; CHECK-NEXT: ret
				%r = insertelement <1 x i64> %op1, i64 5, i64 0
				ret <1 x i64> %r
				}

				define <2 x i64> @insertelement_v2i64(<2 x i64> %op1) #0 {
				; CHECK-LABEL: insertelement_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #1
				; CHECK-NEXT: mov w9, #5
				; CHECK-NEXT: index z2.d, #0, #1
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: mov z1.d, x8
				; CHECK-NEXT: cmpeq p0.d, p0/z, z2.d, z1.d
				; CHECK-NEXT: mov z0.d, p0/m, x9
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%r = insertelement <2 x i64> %op1, i64 5, i64 1
				ret <2 x i64> %r
				}

				define <4 x i64> @insertelement_v4i64(<4 x i64>* %a) #0 {
				; CHECK-LABEL: insertelement_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #1
				; CHECK-NEXT: index z3.d, #0, #1
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov z2.d, x8
				; CHECK-NEXT: mov w8, #5
				; CHECK-NEXT: cmpeq p0.d, p0/z, z3.d, z2.d
				; CHECK-NEXT: mov z1.d, p0/m, x8
				; CHECK-NEXT: // kill: def $q1 killed $q1 killed $z1
				; CHECK-NEXT: ret
				%op1 = load <4 x i64>, <4 x i64>* %a
				%r = insertelement <4 x i64> %op1, i64 5, i64 3
				ret <4 x i64> %r
				}

				;f16
				define <2 x half> @insertelement_v2f16(<2 x half> %op1) #0 {
				; CHECK-LABEL: insertelement_v2f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: sub sp, sp, #16
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: fmov h1, #5.00000000
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: str h0, [sp, #8]
				; CHECK-NEXT: str h1, [sp, #10]
				; CHECK-NEXT: ldr d0, [sp, #8]
				; CHECK-NEXT: add sp, sp, #16
				; CHECK-NEXT: ret
				%r = insertelement <2 x half> %op1, half 5.0, i64 1
				ret <2 x half> %r
				}

				define <4 x half> @insertelement_v4f16(<4 x half> %op1) #0 {
				; CHECK-LABEL: insertelement_v4f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #3
				; CHECK-NEXT: fmov h1, #5.00000000
				; CHECK-NEXT: index z3.h, #0, #1
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: mov z2.h, w8
				; CHECK-NEXT: cmpeq p0.h, p0/z, z3.h, z2.h
				; CHECK-NEXT: mov z0.h, p0/m, h1
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%r = insertelement <4 x half> %op1, half 5.0, i64 3
				ret <4 x half> %r
				}

				define <8 x half> @insertelement_v8f16(<8 x half> %op1) #0 {
				; CHECK-LABEL: insertelement_v8f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #7
				; CHECK-NEXT: fmov h1, #5.00000000
				; CHECK-NEXT: index z3.h, #0, #1
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: mov z2.h, w8
				; CHECK-NEXT: cmpeq p0.h, p0/z, z3.h, z2.h
				; CHECK-NEXT: mov z0.h, p0/m, h1
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%r = insertelement <8 x half> %op1, half 5.0, i64 7
				ret <8 x half> %r
				}

				define <16 x half> @insertelement_v16f16(<16 x half>* %a) #0 {
				; CHECK-LABEL: insertelement_v16f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: mov w8, #7
				; CHECK-NEXT: fmov h3, #5.00000000
				; CHECK-NEXT: index z4.h, #0, #1
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov z2.h, w8
				; CHECK-NEXT: cmpeq p0.h, p0/z, z4.h, z2.h
				; CHECK-NEXT: mov z1.h, p0/m, h3
				; CHECK-NEXT: // kill: def $q1 killed $q1 killed $z1
				; CHECK-NEXT: ret
				%op1 = load <16 x half>, <16 x half>* %a
				%r = insertelement <16 x half> %op1, half 5.0, i64 15
				ret <16 x half> %r
				}

				;f32
				define <2 x float> @insertelement_v2f32(<2 x float> %op1) #0 {
				; CHECK-LABEL: insertelement_v2f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #1
				; CHECK-NEXT: fmov s1, #5.00000000
				; CHECK-NEXT: index z3.s, #0, #1
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: mov z2.s, w8
				; CHECK-NEXT: cmpeq p0.s, p0/z, z3.s, z2.s
				; CHECK-NEXT: mov z0.s, p0/m, s1
				; CHECK-NEXT: // kill: def $d0 killed $d0 killed $z0
				; CHECK-NEXT: ret
				%r = insertelement <2 x float> %op1, float 5.0, i64 1
				ret <2 x float> %r
				}

				define <4 x float> @insertelement_v4f32(<4 x float> %op1) #0 {
				; CHECK-LABEL: insertelement_v4f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #3
				; CHECK-NEXT: fmov s1, #5.00000000
				; CHECK-NEXT: index z3.s, #0, #1
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: mov z2.s, w8
				; CHECK-NEXT: cmpeq p0.s, p0/z, z3.s, z2.s
				; CHECK-NEXT: mov z0.s, p0/m, s1
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%r = insertelement <4 x float> %op1, float 5.0, i64 3
				ret <4 x float> %r
				}

				define <8 x float> @insertelement_v8f32(<8 x float>* %a) #0 {
				; CHECK-LABEL: insertelement_v8f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: mov w8, #3
				; CHECK-NEXT: fmov s4, #5.00000000
				; CHECK-NEXT: index z2.s, #0, #1
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov z3.s, w8
				; CHECK-NEXT: cmpeq p0.s, p0/z, z2.s, z3.s
				; CHECK-NEXT: mov z1.s, p0/m, s4
				; CHECK-NEXT: // kill: def $q1 killed $q1 killed $z1
				; CHECK-NEXT: ret
				%op1 = load <8 x float>, <8 x float>* %a
				%r = insertelement <8 x float> %op1, float 5.0, i64 7
				ret <8 x float> %r
				}

				;f64
				define <1 x double> @insertelement_v1f64(<1 x double> %op1) #0 {
				; CHECK-LABEL: insertelement_v1f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: fmov d0, #5.00000000
				; CHECK-NEXT: ret
				%r = insertelement <1 x double> %op1, double 5.0, i64 0
				ret <1 x double> %r
				}

				define <2 x double> @insertelement_v2f64(<2 x double> %op1) #0 {
				; CHECK-LABEL: insertelement_v2f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov w8, #1
				; CHECK-NEXT: fmov d1, #5.00000000
				; CHECK-NEXT: index z3.d, #0, #1
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: mov z2.d, x8
				; CHECK-NEXT: cmpeq p0.d, p0/z, z3.d, z2.d
				; CHECK-NEXT: mov z0.d, p0/m, d1
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0
				; CHECK-NEXT: ret
				%r = insertelement <2 x double> %op1, double 5.0, i64 1
				ret <2 x double> %r
				}

				define <4 x double> @insertelement_v4f64(<4 x double>* %a) #0 {
				; CHECK-LABEL: insertelement_v4f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: mov w8, #1
				; CHECK-NEXT: fmov d3, #5.00000000
				; CHECK-NEXT: index z4.d, #0, #1
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov z2.d, x8
				; CHECK-NEXT: cmpeq p0.d, p0/z, z4.d, z2.d
				; CHECK-NEXT: mov z1.d, p0/m, d3
				; CHECK-NEXT: // kill: def $q1 killed $q1 killed $z1
				; CHECK-NEXT: ret
				%op1 = load <4 x double>, <4 x double>* %a
				%r = insertelement <4 x double> %op1, double 5.0, i64 3
				ret <4 x double> %r
				}

				attributes #0 = { "target-features"="+sve" }

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-subvector.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -force-streaming-compatible-sve < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				define void @subvector_v8i16(<8 x i16> %in, <8 x i16> %out) #0 {
				; CHECK-LABEL: subvector_v8i16:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldr q0, [x0]
				; CHECK-NEXT: str q0, [x1]
				; CHECK-NEXT: ret
				%a = load <8 x i16>, <8 x i16>* %in
				br label %bb1

				bb1:
				store <8 x i16> %a, <8 x i16>* %out
				ret void
				}

				define void @subvector_v16i16(<16 x i16> %in, <16 x i16> %out) #0 {
				; CHECK-LABEL: subvector_v16i16:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: stp q0, q1, [x1]
				; CHECK-NEXT: ret
				%a = load <16 x i16>, <16 x i16>* %in
				br label %bb1

				bb1:
				store <16 x i16> %a, <16 x i16>* %out
				ret void
				}

				define void @subvector_v32i16(<32 x i16> %in, <32 x i16> %out) #0 {
				; CHECK-LABEL: subvector_v32i16:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q0, q1, [x0, #32]
				; CHECK-NEXT: ldp q2, q3, [x0]
				; CHECK-NEXT: stp q0, q1, [x1, #32]
				; CHECK-NEXT: stp q2, q3, [x1]
				; CHECK-NEXT: ret
				%a = load <32 x i16>, <32 x i16>* %in
				br label %bb1

				bb1:
				store <32 x i16> %a, <32 x i16>* %out
				ret void
				}

				define void @subvector_v64i16(<64 x i16> %in, <64 x i16> %out) #0 {
				; CHECK-LABEL: subvector_v64i16:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q1, q0, [x0]
				; CHECK-NEXT: ldp q3, q2, [x0, #96]
				; CHECK-NEXT: ldp q5, q4, [x0, #64]
				; CHECK-NEXT: ldp q7, q6, [x0, #32]
				; CHECK-NEXT: stp q1, q0, [x1]
				; CHECK-NEXT: stp q5, q4, [x1, #64]
				; CHECK-NEXT: stp q3, q2, [x1, #96]
				; CHECK-NEXT: stp q7, q6, [x1, #32]
				; CHECK-NEXT: ret
				%a = load <64 x i16>, <64 x i16>* %in
				br label %bb1

				bb1:
				store <64 x i16> %a, <64 x i16>* %out
				ret void
				}

				define void @subvector_v8i32(<8 x i32> %in, <8 x i32> %out) #0 {
				; CHECK-LABEL: subvector_v8i32:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: stp q0, q1, [x1]
				; CHECK-NEXT: ret
				%a = load <8 x i32>, <8 x i32>* %in
				br label %bb1

				bb1:
				store <8 x i32> %a, <8 x i32>* %out
				ret void
				}

				define void @subvector_v16i32(<16 x i32> %in, <16 x i32> %out) #0 {
				; CHECK-LABEL: subvector_v16i32:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q0, q1, [x0, #32]
				; CHECK-NEXT: ldp q2, q3, [x0]
				; CHECK-NEXT: stp q0, q1, [x1, #32]
				; CHECK-NEXT: stp q2, q3, [x1]
				; CHECK-NEXT: ret
				%a = load <16 x i32>, <16 x i32>* %in
				br label %bb1

				bb1:
				store <16 x i32> %a, <16 x i32>* %out
				ret void
				}

				define void @subvector_v32i32(<32 x i32> %in, <32 x i32> %out) #0 {
				; CHECK-LABEL: subvector_v32i32:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q1, q0, [x0]
				; CHECK-NEXT: ldp q3, q2, [x0, #96]
				; CHECK-NEXT: ldp q5, q4, [x0, #64]
				; CHECK-NEXT: ldp q7, q6, [x0, #32]
				; CHECK-NEXT: stp q1, q0, [x1]
				; CHECK-NEXT: stp q5, q4, [x1, #64]
				; CHECK-NEXT: stp q3, q2, [x1, #96]
				; CHECK-NEXT: stp q7, q6, [x1, #32]
				; CHECK-NEXT: ret
				%a = load <32 x i32>, <32 x i32>* %in
				br label %bb1

				bb1:
				store <32 x i32> %a, <32 x i32>* %out
				ret void
				}

				define void @subvector_v64i32(<64 x i32> %in, <64 x i32> %out) #0 {
				; CHECK-LABEL: subvector_v64i32:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q1, q0, [x0]
				; CHECK-NEXT: ldp q3, q2, [x0, #32]
				; CHECK-NEXT: ldp q5, q4, [x0, #64]
				; CHECK-NEXT: ldp q7, q6, [x0, #96]
				; CHECK-NEXT: ldp q17, q16, [x0, #128]
				; CHECK-NEXT: ldp q19, q18, [x0, #224]
				; CHECK-NEXT: ldp q21, q20, [x0, #192]
				; CHECK-NEXT: ldp q23, q22, [x0, #160]
				; CHECK-NEXT: stp q1, q0, [x1]
				; CHECK-NEXT: stp q3, q2, [x1, #32]
				; CHECK-NEXT: stp q5, q4, [x1, #64]
				; CHECK-NEXT: stp q7, q6, [x1, #96]
				; CHECK-NEXT: stp q17, q16, [x1, #128]
				; CHECK-NEXT: stp q23, q22, [x1, #160]
				; CHECK-NEXT: stp q21, q20, [x1, #192]
				; CHECK-NEXT: stp q19, q18, [x1, #224]
				; CHECK-NEXT: ret
				%a = load <64 x i32>, <64 x i32>* %in
				br label %bb1

				bb1:
				store <64 x i32> %a, <64 x i32>* %out
				ret void
				}


				define void @subvector_v8i64(<8 x i64> %in, <8 x i64> %out) #0 {
				; CHECK-LABEL: subvector_v8i64:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q0, q1, [x0, #32]
				; CHECK-NEXT: ldp q2, q3, [x0]
				; CHECK-NEXT: stp q0, q1, [x1, #32]
				; CHECK-NEXT: stp q2, q3, [x1]
				; CHECK-NEXT: ret
				%a = load <8 x i64>, <8 x i64>* %in
				br label %bb1

				bb1:
				store <8 x i64> %a, <8 x i64>* %out
				ret void
				}

				define void @subvector_v16i64(<16 x i64> %in, <16 x i64> %out) #0 {
				; CHECK-LABEL: subvector_v16i64:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q1, q0, [x0]
				; CHECK-NEXT: ldp q3, q2, [x0, #96]
				; CHECK-NEXT: ldp q5, q4, [x0, #64]
				; CHECK-NEXT: ldp q7, q6, [x0, #32]
				; CHECK-NEXT: stp q1, q0, [x1]
				; CHECK-NEXT: stp q5, q4, [x1, #64]
				; CHECK-NEXT: stp q3, q2, [x1, #96]
				; CHECK-NEXT: stp q7, q6, [x1, #32]
				; CHECK-NEXT: ret
				%a = load <16 x i64>, <16 x i64>* %in
				br label %bb1

				bb1:
				store <16 x i64> %a, <16 x i64>* %out
				ret void
				}

				define void @subvector_v32i64(<32 x i64> %in, <32 x i64> %out) #0 {
				; CHECK-LABEL: subvector_v32i64:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q1, q0, [x0]
				; CHECK-NEXT: ldp q3, q2, [x0, #32]
				; CHECK-NEXT: ldp q5, q4, [x0, #64]
				; CHECK-NEXT: ldp q7, q6, [x0, #96]
				; CHECK-NEXT: ldp q17, q16, [x0, #128]
				; CHECK-NEXT: ldp q19, q18, [x0, #224]
				; CHECK-NEXT: ldp q21, q20, [x0, #192]
				; CHECK-NEXT: ldp q23, q22, [x0, #160]
				; CHECK-NEXT: stp q1, q0, [x1]
				; CHECK-NEXT: stp q3, q2, [x1, #32]
				; CHECK-NEXT: stp q5, q4, [x1, #64]
				; CHECK-NEXT: stp q7, q6, [x1, #96]
				; CHECK-NEXT: stp q17, q16, [x1, #128]
				; CHECK-NEXT: stp q23, q22, [x1, #160]
				; CHECK-NEXT: stp q21, q20, [x1, #192]
				; CHECK-NEXT: stp q19, q18, [x1, #224]
				; CHECK-NEXT: ret
				%a = load <32 x i64>, <32 x i64>* %in
				br label %bb1

				bb1:
				store <32 x i64> %a, <32 x i64>* %out
				ret void
				}

				define void @subvector_v8f16(<8 x half> %in, <8 x half> %out) #0 {
				; CHECK-LABEL: subvector_v8f16:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldr q0, [x0]
				; CHECK-NEXT: str q0, [x1]
				; CHECK-NEXT: ret
				%a = load <8 x half>, <8 x half>* %in
				br label %bb1

				bb1:
				store <8 x half> %a, <8 x half>* %out
				ret void
				}

				define void @subvector_v16f16(<16 x half> %in, <16 x half> %out) #0 {
				; CHECK-LABEL: subvector_v16f16:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: stp q0, q1, [x1]
				; CHECK-NEXT: ret
				%a = load <16 x half>, <16 x half>* %in
				br label %bb1

				bb1:
				store <16 x half> %a, <16 x half>* %out
				ret void
				}

				define void @subvector_v32f16(<32 x half> %in, <32 x half> %out) #0 {
				; CHECK-LABEL: subvector_v32f16:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q0, q1, [x0, #32]
				; CHECK-NEXT: ldp q2, q3, [x0]
				; CHECK-NEXT: stp q0, q1, [x1, #32]
				; CHECK-NEXT: stp q2, q3, [x1]
				; CHECK-NEXT: ret
				%a = load <32 x half>, <32 x half>* %in
				br label %bb1

				bb1:
				store <32 x half> %a, <32 x half>* %out
				ret void
				}

				define void @subvector_v64f16(<64 x half> %in, <64 x half> %out) #0 {
				; CHECK-LABEL: subvector_v64f16:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q1, q0, [x0]
				; CHECK-NEXT: ldp q3, q2, [x0, #96]
				; CHECK-NEXT: ldp q5, q4, [x0, #64]
				; CHECK-NEXT: ldp q7, q6, [x0, #32]
				; CHECK-NEXT: stp q1, q0, [x1]
				; CHECK-NEXT: stp q5, q4, [x1, #64]
				; CHECK-NEXT: stp q3, q2, [x1, #96]
				; CHECK-NEXT: stp q7, q6, [x1, #32]
				; CHECK-NEXT: ret
				%a = load <64 x half>, <64 x half>* %in
				br label %bb1

				bb1:
				store <64 x half> %a, <64 x half>* %out
				ret void
				}

				define void @subvector_v8f32(<8 x float> %in, <8 x float> %out) #0 {
				; CHECK-LABEL: subvector_v8f32:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q0, q1, [x0]
				; CHECK-NEXT: stp q0, q1, [x1]
				; CHECK-NEXT: ret
				%a = load <8 x float>, <8 x float>* %in
				br label %bb1

				bb1:
				store <8 x float> %a, <8 x float>* %out
				ret void
				}

				define void @subvector_v16f32(<16 x float> %in, <16 x float> %out) #0 {
				; CHECK-LABEL: subvector_v16f32:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q0, q1, [x0, #32]
				; CHECK-NEXT: ldp q2, q3, [x0]
				; CHECK-NEXT: stp q0, q1, [x1, #32]
				; CHECK-NEXT: stp q2, q3, [x1]
				; CHECK-NEXT: ret
				%a = load <16 x float>, <16 x float>* %in
				br label %bb1

				bb1:
				store <16 x float> %a, <16 x float>* %out
				ret void
				}

				define void @subvector_v32f32(<32 x float> %in, <32 x float> %out) #0 {
				; CHECK-LABEL: subvector_v32f32:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q1, q0, [x0]
				; CHECK-NEXT: ldp q3, q2, [x0, #96]
				; CHECK-NEXT: ldp q5, q4, [x0, #64]
				; CHECK-NEXT: ldp q7, q6, [x0, #32]
				; CHECK-NEXT: stp q1, q0, [x1]
				; CHECK-NEXT: stp q5, q4, [x1, #64]
				; CHECK-NEXT: stp q3, q2, [x1, #96]
				; CHECK-NEXT: stp q7, q6, [x1, #32]
				; CHECK-NEXT: ret
				%a = load <32 x float>, <32 x float>* %in
				br label %bb1

				bb1:
				store <32 x float> %a, <32 x float>* %out
				ret void
				}

				define void @subvector_v64f32(<64 x float> %in, <64 x float> %out) #0 {
				; CHECK-LABEL: subvector_v64f32:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q1, q0, [x0]
				; CHECK-NEXT: ldp q3, q2, [x0, #32]
				; CHECK-NEXT: ldp q5, q4, [x0, #64]
				; CHECK-NEXT: ldp q7, q6, [x0, #96]
				; CHECK-NEXT: ldp q17, q16, [x0, #128]
				; CHECK-NEXT: ldp q19, q18, [x0, #224]
				; CHECK-NEXT: ldp q21, q20, [x0, #192]
				; CHECK-NEXT: ldp q23, q22, [x0, #160]
				; CHECK-NEXT: stp q1, q0, [x1]
				; CHECK-NEXT: stp q3, q2, [x1, #32]
				; CHECK-NEXT: stp q5, q4, [x1, #64]
				; CHECK-NEXT: stp q7, q6, [x1, #96]
				; CHECK-NEXT: stp q17, q16, [x1, #128]
				; CHECK-NEXT: stp q23, q22, [x1, #160]
				; CHECK-NEXT: stp q21, q20, [x1, #192]
				; CHECK-NEXT: stp q19, q18, [x1, #224]
				; CHECK-NEXT: ret
				%a = load <64 x float>, <64 x float>* %in
				br label %bb1

				bb1:
				store <64 x float> %a, <64 x float>* %out
				ret void
				}
				define void @subvector_v8f64(<8 x double> %in, <8 x double> %out) #0 {
				; CHECK-LABEL: subvector_v8f64:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q0, q1, [x0, #32]
				; CHECK-NEXT: ldp q2, q3, [x0]
				; CHECK-NEXT: stp q0, q1, [x1, #32]
				; CHECK-NEXT: stp q2, q3, [x1]
				; CHECK-NEXT: ret
				%a = load <8 x double>, <8 x double>* %in
				br label %bb1

				bb1:
				store <8 x double> %a, <8 x double>* %out
				ret void
				}

				define void @subvector_v16f64(<16 x double> %in, <16 x double> %out) #0 {
				; CHECK-LABEL: subvector_v16f64:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q1, q0, [x0]
				; CHECK-NEXT: ldp q3, q2, [x0, #96]
				; CHECK-NEXT: ldp q5, q4, [x0, #64]
				; CHECK-NEXT: ldp q7, q6, [x0, #32]
				; CHECK-NEXT: stp q1, q0, [x1]
				; CHECK-NEXT: stp q5, q4, [x1, #64]
				; CHECK-NEXT: stp q3, q2, [x1, #96]
				; CHECK-NEXT: stp q7, q6, [x1, #32]
				; CHECK-NEXT: ret
				%a = load <16 x double>, <16 x double>* %in
				br label %bb1

				bb1:
				store <16 x double> %a, <16 x double>* %out
				ret void
				}

				define void @subvector_v32f64(<32 x double> %in, <32 x double> %out) #0 {
				; CHECK-LABEL: subvector_v32f64:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: ldp q1, q0, [x0]
				; CHECK-NEXT: ldp q3, q2, [x0, #32]
				; CHECK-NEXT: ldp q5, q4, [x0, #64]
				; CHECK-NEXT: ldp q7, q6, [x0, #96]
				; CHECK-NEXT: ldp q17, q16, [x0, #128]
				; CHECK-NEXT: ldp q19, q18, [x0, #224]
				; CHECK-NEXT: ldp q21, q20, [x0, #192]
				; CHECK-NEXT: ldp q23, q22, [x0, #160]
				; CHECK-NEXT: stp q1, q0, [x1]
				; CHECK-NEXT: stp q3, q2, [x1, #32]
				; CHECK-NEXT: stp q5, q4, [x1, #64]
				; CHECK-NEXT: stp q7, q6, [x1, #96]
				; CHECK-NEXT: stp q17, q16, [x1, #128]
				; CHECK-NEXT: stp q23, q22, [x1, #160]
				; CHECK-NEXT: stp q21, q20, [x1, #192]
				; CHECK-NEXT: stp q19, q18, [x1, #224]
				; CHECK-NEXT: ret
				%a = load <32 x double>, <32 x double>* %in
				br label %bb1

				bb1:
				store <32 x double> %a, <32 x double>* %out
				ret void
				}

				define <8 x i1> @no_warn_dropped_scalable(<8 x i32>* %in) #0 {
				; CHECK-LABEL: no_warn_dropped_scalable:
				; CHECK: // %bb.0: // %bb1
				; CHECK-NEXT: sub sp, sp, #16
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: adrp x8, .LCPI22_0
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: ldp q2, q0, [x0]
				; CHECK-NEXT: ldr q1, [x8, :lo12:.LCPI22_0]
				; CHECK-NEXT: cmpgt p1.s, p0/z, z0.s, z1.s
				; CHECK-NEXT: cmpgt p0.s, p0/z, z2.s, z1.s
				; CHECK-NEXT: mov z0.s, p1/z, #-1 // =0xffffffffffffffff
				; CHECK-NEXT: mov z1.s, p0/z, #-1 // =0xffffffffffffffff
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: fmov w9, s1
				; CHECK-NEXT: mov z2.s, z0.s[3]
				; CHECK-NEXT: mov z3.s, z0.s[2]
				; CHECK-NEXT: mov z4.s, z0.s[1]
				; CHECK-NEXT: fmov w10, s2
				; CHECK-NEXT: strb w8, [sp, #12]
				; CHECK-NEXT: fmov w8, s3
				; CHECK-NEXT: strb w9, [sp, #8]
				; CHECK-NEXT: fmov w9, s4
				; CHECK-NEXT: mov z0.s, z1.s[3]
				; CHECK-NEXT: mov z5.s, z1.s[2]
				; CHECK-NEXT: mov z1.s, z1.s[1]
				; CHECK-NEXT: strb w10, [sp, #15]
				; CHECK-NEXT: fmov w10, s0
				; CHECK-NEXT: strb w8, [sp, #14]
				; CHECK-NEXT: fmov w8, s5
				; CHECK-NEXT: strb w9, [sp, #13]
				; CHECK-NEXT: fmov w9, s1
				; CHECK-NEXT: strb w10, [sp, #11]
				; CHECK-NEXT: strb w8, [sp, #10]
				; CHECK-NEXT: strb w9, [sp, #9]
				; CHECK-NEXT: ldr d0, [sp, #8]
				; CHECK-NEXT: add sp, sp, #16
				; CHECK-NEXT: ret
				%a = load <8 x i32>, <8 x i32>* %in
				br label %bb1

				bb1:
				%cond = icmp sgt <8 x i32> %a, zeroinitializer
				ret <8 x i1> %cond
				}

				; binop(insert_subvec(a), insert_subvec(b)) -> insert_subvec(binop(a,b)) like
				; combines remove redundant subvector operations. This test ensures it's not
				; performed when the input idiom is the result of operation legalisation. When
				; not prevented the test triggers infinite combine->legalise->combine->...
				define void @no_subvector_binop_hang(<8 x i32>* %in, <8 x i32>* %out, i1 %cond) #0 {
				; CHECK-LABEL: no_subvector_binop_hang:
				; CHECK: // %bb.0:
				; CHECK-NEXT: tbz w2, #0, .LBB23_2
				; CHECK-NEXT: // %bb.1: // %bb.1
				; CHECK-NEXT: ldp q1, q0, [x0]
				; CHECK-NEXT: ldp q2, q3, [x1]
				; CHECK-NEXT: orr z1.d, z1.d, z2.d
				; CHECK-NEXT: orr z0.d, z0.d, z3.d
				; CHECK-NEXT: stp q1, q0, [x1]
				; CHECK-NEXT: .LBB23_2: // %bb.2
				; CHECK-NEXT: ret
				%a = load <8 x i32>, <8 x i32>* %in
				%b = load <8 x i32>, <8 x i32>* %out
				br i1 %cond, label %bb.1, label %bb.2

				bb.1:
				%or = or <8 x i32> %a, %b
				store <8 x i32> %or, <8 x i32>* %out
				br label %bb.2

				bb.2:
				ret void
				}

				attributes #0 = { "target-features"="+sve" }

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-vector-shuffle.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -force-streaming-compatible-sve < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				; define <4 x i8> @shuffle_ext_byone_v4i8(<4 x i8> %op1, <4 x i8> %op2) #0 {
				; %ret = shufflevector <4 x i8> %op1, <4 x i8> %op2, <4 x i32> <i32 7, i32 8, i32 9, i32 10>
				; ret <4 x i8> %ret
				; }
				hassnaa-armAuthorUnsubmitted Done Reply Inline Actions when I uncomment this test, llc returns this error: invalid shufflevector operands: %ret = shufflevector <4 x i8> %op1, <4 x i8> %op2, <4 x i32> <i32 7, i32 8, i32 9, i32 10> hassnaa-arm: when I uncomment this test, llc returns this error: invalid shufflevector operands: ```…
				sdesmalenUnsubmitted Done Reply Inline Actions That is because elements 8, 9 and 10 are out of bounds when you concatenate %op1 and %op2 (<=> 8 elements) The follow does work for example: %ret = shufflevector <4 x i8> %op1, <4 x i8> %op2, <4 x i32> <i32 3, i32 4, i32 5, i32 6> sdesmalen: That is because elements 8, 9 and 10 are out of bounds when you concatenate %op1 and %op2 (<=>…

				define <8 x i8> @shuffle_ext_byone_v8i8(<8 x i8> %op1, <8 x i8> %op2) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v8i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: mov z0.b, z0.b[7]
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: insr z1.b, w8
				; CHECK-NEXT: fmov d0, d1
				; CHECK-NEXT: ret
				%ret = shufflevector <8 x i8> %op1, <8 x i8> %op2, <8 x i32> <i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14>
				ret <8 x i8> %ret
				}

				define <16 x i8> @shuffle_ext_byone_v16i8(<16 x i8> %op1, <16 x i8> %op2) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: mov z0.b, z0.b[15]
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: insr z1.b, w8
				; CHECK-NEXT: orr q0.d, z1.d, z0.d
				; CHECK-NEXT: ret
				%ret = shufflevector <16 x i8> %op1, <16 x i8> %op2, <16 x i32> <i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22,
				i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30>
				ret <16 x i8> %ret
				}

				define void @shuffle_ext_byone_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x0, #16]
				; CHECK-NEXT: ldr q1, [x1]
				; CHECK-NEXT: mov z0.b, z0.b[15]
				; CHECK-NEXT: mov z2.b, z1.b[15]
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: ldr q0, [x1, #16]
				; CHECK-NEXT: fmov w9, s2
				; CHECK-NEXT: insr z1.b, w8
				; CHECK-NEXT: insr z0.b, w9
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <32 x i8>, <32 x i8>* %a
				%op2 = load <32 x i8>, <32 x i8>* %b
				%ret = shufflevector <32 x i8> %op1, <32 x i8> %op2, <32 x i32> <i32 31, i32 32, i32 33, i32 34, i32 35, i32 36, i32 37, i32 38,
				i32 39, i32 40, i32 41, i32 42, i32 43, i32 44, i32 45, i32 46,
				i32 47, i32 48, i32 49, i32 50, i32 51, i32 52, i32 53, i32 54,
				i32 55, i32 56, i32 57, i32 58, i32 59, i32 60, i32 61, i32 62>
				store <32 x i8> %ret, <32 x i8>* %a
				ret void
				}

				; define <2 x i16> @shuffle_ext_byone_v2i16(<2 x i16> %op1, <2 x i16> %op2) #0 {
				; %ret = shufflevector <2 x i16> %op1, <2 x i16> %op2, <2 x i32> <i32 3, i32 4>
				; ret <2 x i16> %ret
				; }
				hassnaa-armAuthorUnsubmitted Done Reply Inline Actions when I uncomment this test, llc returns this error: invalid shufflevector operands: %ret = shufflevector <2 x i16> %op1, <2 x i16> %op2, <2 x i32> <i32 3, i32 4> hassnaa-arm: when I uncomment this test, llc returns this error: invalid shufflevector operands: ``` %ret…

				define <4 x i16> @shuffle_ext_byone_v4i16(<4 x i16> %op1, <4 x i16> %op2) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v4i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: mov z0.h, z0.h[3]
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: insr z1.h, w8
				; CHECK-NEXT: fmov d0, d1
				; CHECK-NEXT: ret
				%ret = shufflevector <4 x i16> %op1, <4 x i16> %op2, <4 x i32> <i32 3, i32 4, i32 5, i32 6>
				ret <4 x i16> %ret
				}

				define <8 x i16> @shuffle_ext_byone_v8i16(<8 x i16> %op1, <8 x i16> %op2) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: mov z0.h, z0.h[7]
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: insr z1.h, w8
				; CHECK-NEXT: orr q0.d, z1.d, z0.d
				; CHECK-NEXT: ret
				%ret = shufflevector <8 x i16> %op1, <8 x i16> %op2, <8 x i32> <i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14>
				ret <8 x i16> %ret
				}

				define void @shuffle_ext_byone_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x0, #16]
				; CHECK-NEXT: ldr q1, [x1]
				; CHECK-NEXT: mov z0.h, z0.h[7]
				; CHECK-NEXT: mov z2.h, z1.h[7]
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: ldr q0, [x1, #16]
				; CHECK-NEXT: fmov w9, s2
				; CHECK-NEXT: insr z1.h, w8
				; CHECK-NEXT: insr z0.h, w9
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x i16>, <16 x i16>* %a
				%op2 = load <16 x i16>, <16 x i16>* %b
				%ret = shufflevector <16 x i16> %op1, <16 x i16> %op2, <16 x i32> <i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22,
				i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30>
				store <16 x i16> %ret, <16 x i16>* %a
				ret void
				}

				define <2 x i32> @shuffle_ext_byone_v2i32(<2 x i32> %op1, <2 x i32> %op2) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v2i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: mov z0.s, z0.s[1]
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: insr z1.s, w8
				; CHECK-NEXT: fmov d0, d1
				; CHECK-NEXT: ret
				%ret = shufflevector <2 x i32> %op1, <2 x i32> %op2, <2 x i32> <i32 1, i32 2>
				ret <2 x i32> %ret
				}

				define <4 x i32> @shuffle_ext_byone_v4i32(<4 x i32> %op1, <4 x i32> %op2) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: mov z0.s, z0.s[3]
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: insr z1.s, w8
				; CHECK-NEXT: orr q0.d, z1.d, z0.d
				; CHECK-NEXT: ret
				%ret = shufflevector <4 x i32> %op1, <4 x i32> %op2, <4 x i32> <i32 3, i32 4, i32 5, i32 6>
				ret <4 x i32> %ret
				}

				define void @shuffle_ext_byone_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x0, #16]
				; CHECK-NEXT: ldr q1, [x1]
				; CHECK-NEXT: mov z0.s, z0.s[3]
				; CHECK-NEXT: mov z2.s, z1.s[3]
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: ldr q0, [x1, #16]
				; CHECK-NEXT: fmov w9, s2
				; CHECK-NEXT: insr z1.s, w8
				; CHECK-NEXT: insr z0.s, w9
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x i32>, <8 x i32>* %a
				%op2 = load <8 x i32>, <8 x i32>* %b
				%ret = shufflevector <8 x i32> %op1, <8 x i32> %op2, <8 x i32> <i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14>
				store <8 x i32> %ret, <8 x i32>* %a
				ret void
				}

				define <2 x i64> @shuffle_ext_byone_v2i64(<2 x i64> %op1, <2 x i64> %op2) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: mov z0.d, z0.d[1]
				; CHECK-NEXT: fmov x8, d0
				; CHECK-NEXT: insr z1.d, x8
				; CHECK-NEXT: orr q0.d, z1.d, z0.d
				; CHECK-NEXT: ret
				%ret = shufflevector <2 x i64> %op1, <2 x i64> %op2, <2 x i32> <i32 1, i32 2>
				ret <2 x i64> %ret
				}

				define void @shuffle_ext_byone_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x0, #16]
				; CHECK-NEXT: ldr q1, [x1]
				; CHECK-NEXT: mov z0.d, z0.d[1]
				; CHECK-NEXT: mov z2.d, z1.d[1]
				; CHECK-NEXT: fmov x8, d0
				; CHECK-NEXT: ldr q0, [x1, #16]
				; CHECK-NEXT: fmov x9, d2
				; CHECK-NEXT: insr z1.d, x8
				; CHECK-NEXT: insr z0.d, x9
				; CHECK-NEXT: stp q1, q0, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x i64>, <4 x i64>* %a
				%op2 = load <4 x i64>, <4 x i64>* %b
				%ret = shufflevector <4 x i64> %op1, <4 x i64> %op2, <4 x i32> <i32 3, i32 4, i32 5, i32 6>
				store <4 x i64> %ret, <4 x i64>* %a
				ret void
				}


				define <4 x half> @shuffle_ext_byone_v4f16(<4 x half> %op1, <4 x half> %op2) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v4f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: mov z0.h, z0.h[3]
				; CHECK-NEXT: insr z1.h, h0
				; CHECK-NEXT: fmov d0, d1
				; CHECK-NEXT: ret
				%ret = shufflevector <4 x half> %op1, <4 x half> %op2, <4 x i32> <i32 3, i32 4, i32 5, i32 6>
				ret <4 x half> %ret
				}

				define <8 x half> @shuffle_ext_byone_v8f16(<8 x half> %op1, <8 x half> %op2) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v8f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: mov z0.h, z0.h[7]
				; CHECK-NEXT: insr z1.h, h0
				; CHECK-NEXT: orr q0.d, z1.d, z0.d
				; CHECK-NEXT: ret
				%ret = shufflevector <8 x half> %op1, <8 x half> %op2, <8 x i32> <i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14>
				ret <8 x half> %ret
				}

				define void @shuffle_ext_byone_v16f16(<16 x half>* %a, <16 x half>* %b) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v16f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q1, q2, [x1]
				; CHECK-NEXT: mov z3.h, z1.h[7]
				; CHECK-NEXT: ldr q0, [x0, #16]
				; CHECK-NEXT: insr z2.h, h3
				; CHECK-NEXT: mov z0.h, z0.h[7]
				; CHECK-NEXT: insr z1.h, h0
				; CHECK-NEXT: stp q1, q2, [x0]
				; CHECK-NEXT: ret
				%op1 = load <16 x half>, <16 x half>* %a
				%op2 = load <16 x half>, <16 x half>* %b
				%ret = shufflevector <16 x half> %op1, <16 x half> %op2, <16 x i32> <i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22,
				i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30>
				store <16 x half> %ret, <16 x half>* %a
				ret void
				}

				define <2 x float> @shuffle_ext_byone_v2f32(<2 x float> %op1, <2 x float> %op2) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v2f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: // kill: def $d1 killed $d1 def $z1
				; CHECK-NEXT: mov z0.s, z0.s[1]
				; CHECK-NEXT: insr z1.s, s0
				; CHECK-NEXT: fmov d0, d1
				; CHECK-NEXT: ret
				%ret = shufflevector <2 x float> %op1, <2 x float> %op2, <2 x i32> <i32 1, i32 2>
				ret <2 x float> %ret
				}

				define <4 x float> @shuffle_ext_byone_v4f32(<4 x float> %op1, <4 x float> %op2) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v4f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: mov z0.s, z0.s[3]
				; CHECK-NEXT: insr z1.s, s0
				; CHECK-NEXT: orr q0.d, z1.d, z0.d
				; CHECK-NEXT: ret
				%ret = shufflevector <4 x float> %op1, <4 x float> %op2, <4 x i32> <i32 3, i32 4, i32 5, i32 6>
				ret <4 x float> %ret
				}

				define void @shuffle_ext_byone_v8f32(<8 x float>* %a, <8 x float>* %b) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v8f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q1, q2, [x1]
				; CHECK-NEXT: mov z3.s, z1.s[3]
				; CHECK-NEXT: ldr q0, [x0, #16]
				; CHECK-NEXT: insr z2.s, s3
				; CHECK-NEXT: mov z0.s, z0.s[3]
				; CHECK-NEXT: insr z1.s, s0
				; CHECK-NEXT: stp q1, q2, [x0]
				; CHECK-NEXT: ret
				%op1 = load <8 x float>, <8 x float>* %a
				%op2 = load <8 x float>, <8 x float>* %b
				%ret = shufflevector <8 x float> %op1, <8 x float> %op2, <8 x i32> <i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14>
				store <8 x float> %ret, <8 x float>* %a
				ret void
				}

				define <2 x double> @shuffle_ext_byone_v2f64(<2 x double> %op1, <2 x double> %op2) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v2f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q0 killed $q0 def $z0
				; CHECK-NEXT: // kill: def $q1 killed $q1 def $z1
				; CHECK-NEXT: mov z0.d, z0.d[1]
				; CHECK-NEXT: insr z1.d, d0
				; CHECK-NEXT: orr q0.d, z1.d, z0.d
				; CHECK-NEXT: ret
				%ret = shufflevector <2 x double> %op1, <2 x double> %op2, <2 x i32> <i32 1, i32 2>
				ret <2 x double> %ret
				}

				define void @shuffle_ext_byone_v4f64(<4 x double>* %a, <4 x double>* %b) #0 {
				; CHECK-LABEL: shuffle_ext_byone_v4f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q1, q2, [x1]
				; CHECK-NEXT: mov z3.d, z1.d[1]
				; CHECK-NEXT: ldr q0, [x0, #16]
				; CHECK-NEXT: insr z2.d, d3
				; CHECK-NEXT: mov z0.d, z0.d[1]
				; CHECK-NEXT: insr z1.d, d0
				; CHECK-NEXT: stp q1, q2, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x double>, <4 x double>* %a
				%op2 = load <4 x double>, <4 x double>* %b
				%ret = shufflevector <4 x double> %op1, <4 x double> %op2, <4 x i32> <i32 3, i32 4, i32 5, i32 6>
				store <4 x double> %ret, <4 x double>* %a
				ret void
				}

				define void @shuffle_ext_byone_reverse(<4 x double>* %a, <4 x double>* %b) #0 {
				; CHECK-LABEL: shuffle_ext_byone_reverse:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldp q1, q2, [x0]
				; CHECK-NEXT: mov z3.d, z1.d[1]
				; CHECK-NEXT: ldr q0, [x1, #16]
				; CHECK-NEXT: insr z2.d, d3
				; CHECK-NEXT: mov z0.d, z0.d[1]
				; CHECK-NEXT: insr z1.d, d0
				; CHECK-NEXT: stp q1, q2, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x double>, <4 x double>* %a
				%op2 = load <4 x double>, <4 x double>* %b
				%ret = shufflevector <4 x double> %op1, <4 x double> %op2, <4 x i32> <i32 7, i32 0, i32 1, i32 2>
				store <4 x double> %ret, <4 x double>* %a
				ret void
				}

				define void @shuffle_ext_invalid(<4 x double>* %a, <4 x double>* %b) #0 {
				; CHECK-LABEL: shuffle_ext_invalid:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x0, #16]
				; CHECK-NEXT: ldr q1, [x1]
				; CHECK-NEXT: stp q0, q1, [x0]
				; CHECK-NEXT: ret
				%op1 = load <4 x double>, <4 x double>* %a
				%op2 = load <4 x double>, <4 x double>* %b
				%ret = shufflevector <4 x double> %op1, <4 x double> %op2, <4 x i32> <i32 2, i32 3, i32 4, i32 5>
				store <4 x double> %ret, <4 x double>* %a
				ret void
				}

				attributes #0 = { "target-features"="+sve" }