This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
docs/
10/13
LangRef.rst
-
include/llvm/
-
llvm/
-
CodeGen/
3/3
ISDOpcodes.h
-
TargetLowering.h
-
IR/
1/1
Intrinsics.td
-
Target/
-
TargetSelectionDAG.td
-
lib/
-
CodeGen/
-
SelectionDAG/
-
LegalizeDAG.cpp
-
LegalizeIntegerTypes.cpp
-
LegalizeTypes.h
2/2
LegalizeVectorTypes.cpp
-
SelectionDAGBuilder.h
2/3
SelectionDAGBuilder.cpp
-
SelectionDAGDumper.cpp
3/8
TargetLowering.cpp
-
TargetLoweringBase.cpp
-
Target/AArch64/
-
AArch64/
2/5
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
1/1
named-vector-shuffles-neon.ll
-
named-vector-shuffles-sve.ll

Differential D94708

[IR] Introduce llvm.experimental.vector.splice intrinsic
ClosedPublic

Authored by c-rhodes on Jan 14 2021, 12:18 PM.

Download Raw Diff

Details

Reviewers

sdesmalen
david-arm
fhahn
craig.topper
cameron.mcinally
CarolineConcatto

Commits

rG2750f3ed3155: [IR] Introduce llvm.experimental.vector.splice intrinsic

Summary

This patch introduces a new intrinsic @llvm.experimental.vector.splice
that constructs a vector of the same type as the two input vectors,
based on a immediate where the sign of the immediate distinguishes two
variants. A positive immediate specifies an index into the first vector
and a negative immediate specifies the number of trailing elements to
extract from the first vector.

For example:

@llvm.experimental.vector.splice(<A,B,C,D>, <E,F,G,H>, 1) ==> <B, C, D, E>  ; index
@llvm.experimental.vector.splice(<A,B,C,D>, <E,F,G,H>, -3) ==> <B, C, D, E> ; trailing element count

These intrinsics support both fixed and scalable vectors, where the
former is lowered to a shufflevector to maintain existing behaviour,
although while marked as experimental the recommended way to express
this operation for fixed-width vectors is to use shufflevector. For
scalable vectors where it is not possible to express a shufflevector
mask for this operation, a new ISD node has been implemented.

This is one of the named shufflevector intrinsics proposed on the
mailing-list in the RFC at [1].

Patch by Paul Walker and Cullen Rhodes.

[1] https://lists.llvm.org/pipermail/llvm-dev/2020-November/146864.html

Diff Detail

Event Timeline

c-rhodes created this revision.Jan 14 2021, 12:18 PM

Herald added subscribers: jdoerfert, hiraditya, kristof.beyls. · View Herald TranscriptJan 14 2021, 12:18 PM

c-rhodes requested review of this revision.Jan 14 2021, 12:18 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 14 2021, 12:18 PM

cameron.mcinally added inline comments.Jan 14 2021, 12:37 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1109	Is this change required for the splice patch? I wonder if it should be broken out to its own commit.

Harbormaster completed remote builds in B85213: Diff 316728.Jan 14 2021, 1:09 PM

c-rhodes added inline comments.Jan 15 2021, 4:19 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1109	Is this change required for the splice patch? I wonder if it should be broken out to its own commit. This fixes a selection error I hit when splicing predicates. The promotion of predicates uses a truncate which gets lowered to a setcc that crashes on operands of type `VT`, e.g.: t17: nxv2i64 = and t29, t30 t31: nxv2i64 = AArch64ISD::DUP Constant:i64<0> t19: nxv2i1 = setcc t17, t31, setne:ch I can split this into a separate patch.

In D94444, @paulwalker-arm proposed a more generic extract vector intrinsic that accepts an index and stride. Now I'm wondering if we should just have a generic scalable shuffle vector intrinsic to handle all these operations under one intrinsic.

That idea doesn't need to hold up this Diff, but it might be something to consider...

craig.topper added inline comments.Jan 15 2021, 4:33 PM

llvm/docs/LangRef.rst
16236	Does this mention that %trailing.elt must be a constant?
llvm/include/llvm/IR/Intrinsics.td
1657	Should this have DefaultAttrs?
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1274	Can you use setOperationPromotedToType here?

In D94708#2501255, @cameron.mcinally wrote:

In D94444, @paulwalker-arm proposed a more generic extract vector intrinsic that accepts an index and stride. Now I'm wondering if we should just have a generic scalable shuffle vector intrinsic to handle all these operations under one intrinsic.

That idea doesn't need to hold up this Diff, but it might be something to consider...

I don't believe that can work for the same reasons having a single LLVM instruction does not work. Each shuffle has different requirements. Some require one operand whilst other require two and some require additional scalar operands when others do not.

paulwalker-arm added inline comments.Jan 16 2021, 4:18 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.h
903 ↗	(On Diff #316728)	It looks like you're missing the implementation of this function and some associated isel patterns, which explains the poor code generation for the SVE _1 test variants. Or are you planning to add those under a second patch, in which case this function declaration should be removed from this patch.

c-rhodes added inline comments.Jan 16 2021, 4:27 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.h
903 ↗	(On Diff #316728)	It looks like you're missing the implementation of this function and some associated isel patterns, which explains the poor code generation for the SVE _1 test variants. Or are you planning to add those under a second patch, in which case this function declaration should be removed from this patch. The plan is to upstream those patterns separately, they were initially part of this patch but they depended on some changes we have downstream to use the SIMD variant of INSR when the scalar argument comes from a vector extract. It looks like I missed this declaration, I'll remove it.

Matt added a subscriber: Matt.Jan 19 2021, 9:14 AM

Changes:

Remove unused function declaration for LowerVECTOR_SPLICE.
Clarify trailing.elts must be an integer constant in LangRef.
Use DefaultAttrs for intrinsic.
Use setOperationPromotedToType.

c-rhodes marked 3 inline comments as done.Jan 20 2021, 6:39 AM

c-rhodes added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1109	Is this change required for the splice patch? I wonder if it should be broken out to its own commit. This fixes a selection error I hit when splicing predicates. The promotion of predicates uses a truncate which gets lowered to a setcc that crashes on operands of type `VT`, e.g.: t17: nxv2i64 = and t29, t30 t31: nxv2i64 = AArch64ISD::DUP Constant:i64<0> t19: nxv2i1 = setcc t17, t31, setne:ch I can split this into a separate patch. I tried splitting this out but I couldn't figure out how to test it so I've left it in this patch. What I found was the setcc being generated by the splice occurs after result type legalization, so without the above a setcc returning an `nxv2i1` and taking two `nxv2i64` vectors is considered legal at this point in selection. With the above it falls into `SelectionDAGLegalize::LegalizeOp` which considers the legality based on the operand VTs, so setcc then gets custom lowered to the target-specific predicated variant `AArch64ISD::SETCC_MERGE_ZERO` that can be selected.
1274	Can you use setOperationPromotedToType here? That's nice, wasn't aware of that thanks.

Are there plans to use this for fixed vectors as well? If not, can this be restricted to scalable vectors only? It seems like for fixed vectors, preferring the shufflevector versions would be beneficial, to avoid having to update all transforms lookin at shuffles.

Otherwise, this should probably have some tests for platforms other than AArch64 and also support in GlobalISel, which is the default on AArch64 with -O0 IIRC (or do the transform to shuffles as an IR transform).

sdesmalen added inline comments.Jan 20 2021, 7:15 AM

llvm/docs/LangRef.rst
16266	I expect that we want to support two flavours of the splice intrinsic: // in flavour1 an immediate of '3' translates to the number of trailing elements, // e.g. start index of VL - 3 == 'B'. experimental.vector.splice.flavour1(<A,B,C,D>, <E,F,G,H>, 3) ==> <B, C, D, E> // in flavour2 an immediate of '1' translates to the start index 1 == 'B'. experimental.vector.splice.flavour2(<A,B,C,D>, <E,F,G,H>, 1) ==> <B, C, D, E> This patch implements only flavour1, and because trailing.elts should be an immediate it is not possible to express flavour2. Do we want to add flavour2 as well at some point? And if so, what would it be named? Alternatively, it seems possible to express both with the current splice intrinsic where the sign of the immediate distinguishes the two flavours, e.g. a start index of `-3` would 'wrap' by VL to index 1, and thus have the same meaning as `3` in flavour1, whereas a positive immediate of `1` would mean index 1, as in flavour 2).

In D94708#2509610, @fhahn wrote:

Are there plans to use this for fixed vectors as well? If not, can this be restricted to scalable vectors only? It seems like for fixed vectors, preferring the shufflevector versions would be beneficial, to avoid having to update all transforms lookin at shuffles.

Otherwise, this should probably have some tests for platforms other than AArch64 and also support in GlobalISel, which is the default on AArch64 with -O0 IIRC (or do the transform to shuffles as an IR transform).

The named shufflevector intrinsics accept both fixed and scalable vectors but for fixed they map to existing shufflevector. That's a good point thanks for raising that, converting to a shufflevector as an IR transform like the insert/extract do is probably the right thing to do here.

paulwalker-arm added inline comments.Jan 20 2021, 12:50 PM

llvm/docs/LangRef.rst
16266	Overloading the usage based on the sign of the immediate sounds like a worth upgrade to me.
16268–16269	This restriction does not make sense to me. The logical requirement is for the trailing elements to be less than or equal to the full runtime-time vector length. Then I'd say that counts beyond this range result in poison.

In D94708#2509677, @c-rhodes wrote:

In D94708#2509610, @fhahn wrote:

Are there plans to use this for fixed vectors as well? If not, can this be restricted to scalable vectors only? It seems like for fixed vectors, preferring the shufflevector versions would be beneficial, to avoid having to update all transforms lookin at shuffles.

Otherwise, this should probably have some tests for platforms other than AArch64 and also support in GlobalISel, which is the default on AArch64 with -O0 IIRC (or do the transform to shuffles as an IR transform).

The named shufflevector intrinsics accept both fixed and scalable vectors but for fixed they map to existing shufflevector.

As specified here yes, but I was wondering if they actually need to support fixed-width vectors? Is there a reason to use the intrinsics with fixed vectors instead of shuffles?

fhahn mentioned this in D94883: [CodeGen][SelectionDAG]Add new intrinsic experimental.vector.reverse.Jan 21 2021, 8:40 AM

CarolineConcatto added a reviewer: CarolineConcatto.Jan 21 2021, 8:51 AM

timsmith78 added a subscriber: timsmith78.Jan 21 2021, 9:23 AM

FWIW, we have a similar intrinsic in our downstream compiler for RISC-V. We call it experimental.vector.slideleftfill and is same as flavour 2 of this (except the "offset" need not be an immediate) as suggested by @sdesmalen. We have a separate predicated version (experimental.vector.vp.slideleftfill) too which takes additional arguments for the explicit vector length of the two vectors.

llvm/docs/LangRef.rst
16267	Why is it required for the `trainling.elts` to be an immediate? I imagine there could be a scenario where this value is unknown at compile-time, for instance a recurrence where the order of recurrence is determined at runtime.

paulwalker-arm added inline comments.Jan 27 2021, 4:04 AM

llvm/docs/LangRef.rst
16267	At this stage we don't want to muddy the waters by introducing new shuffle behaviours to LLVM, but rather just extend the existing shuffle requirements to cover scalable vectors. Today `shufflevector` only supports a literal mask and thus at this stage we are enforcing this same requirement to the intrinsic variants.

Changes:

As proposed by @sdesmalen, the intrinsic now supports two variants based on the sign of the immediate, where a negative immediate is a trailing element count and a positive immediate an index.
Following the discussion on the reverse intrinsic (D94883) around whether named shufflevector intrinsics should support fixed vectors, I've opted for the same approach to make it explicit in the LangRef that whilst these instructions are experimental shufflevector should be used for fixed-width vectors. The changes to InstCombineCalls to map this intrinsic to shufflevector as an IR transform have been dropped.
ExpandVectorSpliceThroughStack has been moved to TLI under expandVectorSplice which is reused by DAGTypeLegalizer::SplitVecRes_VECTOR_SPLICE.

In D94708#2512706, @fhahn wrote:

In D94708#2509677, @c-rhodes wrote:

In D94708#2509610, @fhahn wrote:

Are there plans to use this for fixed vectors as well? If not, can this be restricted to scalable vectors only? It seems like for fixed vectors, preferring the shufflevector versions would be beneficial, to avoid having to update all transforms lookin at shuffles.

Otherwise, this should probably have some tests for platforms other than AArch64 and also support in GlobalISel, which is the default on AArch64 with -O0 IIRC (or do the transform to shuffles as an IR transform).

The named shufflevector intrinsics accept both fixed and scalable vectors but for fixed they map to existing shufflevector.

As specified here yes, but I was wondering if they actually need to support fixed-width vectors? Is there a reason to use the intrinsics with fixed vectors instead of shuffles?

I noticed there was a similar discussion on D94883, as mentioned I've took the same approach and made it explicit in the LangRef that shufflevector should be used instead for this intrinsic whilst it's experimental.

llvm/docs/LangRef.rst
16268–16269	This restriction does not make sense to me. The logical requirement is for the trailing elements to be less than or equal to the full runtime-time vector length. Then I'd say that counts beyond this range result in poison. My mistake, you're right it should be the full runtime time VL rather than minimum number of elements. LangRef has been updated.

Just a few nits, will have a closer look at the patch later.

llvm/docs/LangRef.rst
16278–16284	nit: The signed immediate, modulo the number of elements in the vector, is the index into the first vector from which to extract the result value. This means conceptually that for a positive immediate, a vector is extracted from concat(%vec1, %vec2) starting at index `imm`, whereas for a negative immediate, it extracts `imm` trailing elements from the first vector, and the remaining elements from %vec2.
16287	nit: insert comma.
16305	It's better to say that if the immediate value is outside this range, the result is a poison value. That leaves it up to the implementation how to handle it.
16305	where VL is the runtime vector length of the source/result vector.

sdesmalen added inline comments.Feb 19 2021, 6:44 AM

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5524–5531	is there a way to split the load using existing code?
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
10911	just `Imm % NumElts`; ?
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
8657	You'll need to do a similar clamping for Imm > 0 (or actually, Imm > MinKnownVL) , to avoid loading beyond the allocated stack object.
llvm/test/CodeGen/AArch64/named-vector-shuffles-neon.ll
20	nit: not sure if there is a lot of value to do this for each possible element-type, if you do two (i8 and some other type) that may be sufficient.

Address comments

c-rhodes marked 6 inline comments as done.Feb 19 2021, 7:51 AM

c-rhodes added inline comments.

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
5524–5531	is there a way to split the load using existing code? Splitting the full load from `expandVectorSplice` with `SplitVecRes_LOAD` seems to work.
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
10911	just `Imm % NumElts`; ? I tried that at first but found the behaviour isn't correct for negative immediates, for example a trailing element count of `-15` and 16 elements `-15 % 16 = -15`, so end up with this mask: `[-15,-14,-13,-12,-11,-10,-9,-8,-7,-6,-5,-4,-3,-2,-1,0]`. From what I read the sign of the modulo result is implementation defined if one or both of the operands are are negative.
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
8657	You'll need to do a similar clamping for Imm > 0 (or actually, Imm > MinKnownVL) , to avoid loading beyond the allocated stack object. For the index `getVectorElementPointer` does the clamping via `clampDynamicVectorIndex`

paulwalker-arm added inline comments.Feb 19 2021, 11:08 AM

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
940	This doesn't look great to me because it ties the hands of expandVectorSplice, whose name does not suggest it must return a load. If you prefer not to explicitly create the split loads then it seems more natural, and future proof, if the result of expandVectorSplice is split using a pair of EXTRACT_SUBVECTOR operations.

bin.cheng-ali added a subscriber: bin.cheng-ali.Feb 19 2021, 10:51 PM

bin.cheng-ali added inline comments.

llvm/docs/LangRef.rst
16285	One nit. Given `imm` is negative here, should this be? "it extracts `-imm` trailing elements from the first vector, ..."
llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp
10911	Then `(NumElts + Imm) % NumElts` ?
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
8623	Sorry for may stupid question, this stores two vector register and load in the middle, does Endingness matters here? How does LLVM make sure correct sequence of elements is loaded? Thanks in advance.

paulwalker-arm added inline comments.Feb 22 2021, 3:27 AM

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
8623	There's no such thing as a stupid question :) These are load/store operations on vector types so endianness only plays a role in how each element is stored and not in the order of the elements. That's to say, element N will be at the same location in memory regardless of endianness, however, the bytes that make up element N will be laid out differently. The index used to splice the vectors is also element based, which means there are no partial element accesses and thus no endianness issues.

Address comments

c-rhodes marked 3 inline comments as done.Feb 22 2021, 7:11 AM

c-rhodes added inline comments.Feb 22 2021, 7:17 AM

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
940	This doesn't look great to me because it ties the hands of expandVectorSplice, whose name does not suggest it must return a load. If you prefer not to explicitly create the split loads then it seems more natural, and future proof, if the result of expandVectorSplice is split using a pair of EXTRACT_SUBVECTOR operations. Fair point, I've used EXTRACT_SUBVECTOR instead. I should point out warnings were being generated for the `nxv16f32` splitvec tests by the `getVectorNumElements` call in `DAGTypeLegalizer::SplitVecRes_EXTRACT_SUBVECTOR`. To remove the warnings I changed it to `getVectorMinNumElements`.

bin.cheng-ali added inline comments.Feb 23 2021, 1:59 AM

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
8623	Thanks very much for explanation. I assume this is defined in Arm architecture reference manual, however, I noticed there is supplement rule for SVE as: For unpredicated SVE vector register loads and stores, the vector is treated as containing byte elements that are transferred in increasing element number order without any endianness conversion. IIUC, this rule would apply here? Of course no endianness issue either way.

HLJ2009 added a subscriber: HLJ2009.Feb 23 2021, 2:04 AM

Just a few more nits.

llvm/docs/LangRef.rst
16307	nit: `s/, for/. For/`
llvm/include/llvm/CodeGen/ISDOpcodes.h
565	nit: `s/T is [..] and//`
565	nit: `s/out-of-bounds/not within the range [-VL, VL)/`
566–567	Please remove this restriction for scalable vectors. The implementation (by clamping to avoid a runtime crash) should not dictate the specification. It is sufficient to say that if IMM is out of range that the result value is undefined.
llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
8644	nit: because the Imm < 0 block is a lot bigger, maybe you can rewrite it as: if (Imm >= 0) { // Load back the required element. StackPtr = getVectorElementPointer(DAG, StackPtr, VT, Node->getOperand(2)); // Load the spliced result return DAG.getLoad(VT, DL, StoreV2, StackPtr, MachinePointerInfo::getUnknownStack(MF)); } // Handle Imm < 0 case here.
8657	Thanks for confirming. Can you just add a comment that clarifies this?

Address @sdesmalen's comments

c-rhodes marked 6 inline comments as done.Mar 3 2021, 3:52 AM

The patch looks fine to me, cheers @c-rhodes.

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp
8645	nit: s/This/getVectorElementPointer/

This revision is now accepted and ready to land.Mar 3 2021, 4:13 AM

Harbormaster completed remote builds in B91782: Diff 327738.Mar 3 2021, 7:39 AM

Thanks for reviewing all, I'll land this early next week unless there's any objections between then, cheers

This revision was landed with ongoing or failed builds.Mar 9 2021, 2:45 AM

Closed by commit rG2750f3ed3155: [IR] Introduce llvm.experimental.vector.splice intrinsic (authored by c-rhodes). · Explain Why

This revision was automatically updated to reflect the committed changes.

c-rhodes added a commit: rG2750f3ed3155: [IR] Introduce llvm.experimental.vector.splice intrinsic.

vkmr mentioned this in D103898: [VP] Vector predicated vector splice intrinsic.Jun 8 2021, 7:46 AM

simoll mentioned this in rG72a08c0b9404: [VP] Vector predicated vector splice intrinsic.Sep 29 2021, 1:44 AM

Jimerlife mentioned this in D128717: [RISCV] Change VECTOR_SPLICE mask operation from expand to promote.Jun 28 2022, 3:13 AM

Revision Contents

Path

Size

llvm/

docs/

LangRef.rst

46 lines

include/

llvm/

CodeGen/

ISDOpcodes.h

13 lines

TargetLowering.h

4 lines

IR/

Intrinsics.td

7 lines

Target/

TargetSelectionDAG.td

4 lines

lib/

CodeGen/

SelectionDAG/

LegalizeDAG.cpp

12 lines

LegalizeIntegerTypes.cpp

11 lines

LegalizeTypes.h

2 lines

LegalizeVectorTypes.cpp

21 lines

SelectionDAGBuilder.h

1 line

SelectionDAGBuilder.cpp

37 lines

SelectionDAGDumper.cpp

1 line

TargetLowering.cpp

73 lines

TargetLoweringBase.cpp

3 lines

Target/

AArch64/

AArch64ISelLowering.cpp

6 lines

test/

CodeGen/

AArch64/

named-vector-shuffles-neon.ll

142 lines

named-vector-shuffles-sve.ll

1310 lines

Diff 325438

llvm/docs/LangRef.rst

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 16,227 Lines • ▼ Show 20 Lines
	vector length of the result type. If the result type is a scalable vector,			vector length of the result type. If the result type is a scalable vector,
	``idx`` is first scaled by the result type's runtime scaling factor. Elements			``idx`` is first scaled by the result type's runtime scaling factor. Elements
	``idx`` through (``idx`` + num_elements(result_type) - 1) must be valid vector			``idx`` through (``idx`` + num_elements(result_type) - 1) must be valid vector
	indices. If this condition cannot be determined statically but is false at			indices. If this condition cannot be determined statically but is false at
	runtime, then the result vector is undefined. The ``idx`` parameter must be a			runtime, then the result vector is undefined. The ``idx`` parameter must be a
	vector index constant type (for most targets this will be an integer pointer			vector index constant type (for most targets this will be an integer pointer
	type).			type).

	'``llvm.experimental.vector.reverse``' Intrinsic			'``llvm.experimental.vector.reverse``' Intrinsic
				craig.topperUnsubmitted Done Reply Inline Actions Does this mention that %trailing.elt must be a constant? craig.topper: Does this mention that %trailing.elt must be a constant?
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^			^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	Syntax:			Syntax:
	"""""""			"""""""
	This is an overloaded intrinsic.			This is an overloaded intrinsic.

	::			::

	Show All 10 Lines
	recommended way to express reverse operations for fixed-width vectors is still			recommended way to express reverse operations for fixed-width vectors is still
	to use a shufflevector, as that may allow for more optimization opportunities.			to use a shufflevector, as that may allow for more optimization opportunities.

	Arguments:			Arguments:
	""""""""""			""""""""""

	The argument to this intrinsic must be a vector.			The argument to this intrinsic must be a vector.

				'``llvm.experimental.vector.splice``' Intrinsic
				^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

				Syntax:
				sdesmalenUnsubmitted Done Reply Inline Actions I expect that we want to support two flavours of the splice intrinsic: // in flavour1 an immediate of '3' translates to the number of trailing elements, // e.g. start index of VL - 3 == 'B'. experimental.vector.splice.flavour1(<A,B,C,D>, <E,F,G,H>, 3) ==> <B, C, D, E> // in flavour2 an immediate of '1' translates to the start index 1 == 'B'. experimental.vector.splice.flavour2(<A,B,C,D>, <E,F,G,H>, 1) ==> <B, C, D, E> This patch implements only flavour1, and because trailing.elts should be an immediate it is not possible to express flavour2. Do we want to add flavour2 as well at some point? And if so, what would it be named? Alternatively, it seems possible to express both with the current splice intrinsic where the sign of the immediate distinguishes the two flavours, e.g. a start index of `-3` would 'wrap' by VL to index 1, and thus have the same meaning as `3` in flavour1, whereas a positive immediate of `1` would mean index 1, as in flavour 2). sdesmalen: I expect that we want to support two flavours of the splice intrinsic: // in flavour1 an…
				paulwalker-armUnsubmitted Done Reply Inline Actions Overloading the usage based on the sign of the immediate sounds like a worth upgrade to me. paulwalker-arm: Overloading the usage based on the sign of the immediate sounds like a worth upgrade to me.
				"""""""
				vkmrUnsubmitted Not Done Reply Inline Actions Why is it required for the `trainling.elts` to be an immediate? I imagine there could be a scenario where this value is unknown at compile-time, for instance a recurrence where the order of recurrence is determined at runtime. vkmr: Why is it required for the `trainling.elts` to be an immediate? I imagine there could be a…
				paulwalker-armUnsubmitted Not Done Reply Inline Actions At this stage we don't want to muddy the waters by introducing new shuffle behaviours to LLVM, but rather just extend the existing shuffle requirements to cover scalable vectors. Today `shufflevector` only supports a literal mask and thus at this stage we are enforcing this same requirement to the intrinsic variants. paulwalker-arm: At this stage we don't want to muddy the waters by introducing new shuffle behaviours to LLVM…
				This is an overloaded intrinsic.

				paulwalker-armUnsubmitted Not Done Reply Inline Actions This restriction does not make sense to me. The logical requirement is for the trailing elements to be less than or equal to the full runtime-time vector length. Then I'd say that counts beyond this range result in poison. paulwalker-arm: This restriction does not make sense to me. The logical requirement is for the trailing…
				c-rhodesAuthorUnsubmitted Done Reply Inline Actions This restriction does not make sense to me. The logical requirement is for the trailing elements to be less than or equal to the full runtime-time vector length. Then I'd say that counts beyond this range result in poison. My mistake, you're right it should be the full runtime time VL rather than minimum number of elements. LangRef has been updated. c-rhodes: > This restriction does not make sense to me. The logical requirement is for the trailing…
				::

				declare <2 x double> @llvm.experimental.vector.splice.v2f64(<2 x double> %vec1, <2 x double> %vec2, i32 %imm)
				declare <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %vec1, <vscale x 4 x i32> %vec2, i32 %imm)

				Overview:
				"""""""""

				The '``llvm.experimental.vector.splice.*``' intrinsics construct a vector by
				concatenating elements from the first input vector with elements of the second
				input vector, returning a vector of the same type as the input vectors. The
				signed immediate, modulo the number of elements in the vector, is the index
				into the first vector from which to extract the result value. This means
				conceptually that for a positive immediate, a vector is extracted from
				``concat(%vec1, %vec2)`` starting at index ``imm``, whereas for a negative
				sdesmalenUnsubmitted Done Reply Inline Actions nit: The signed immediate, modulo the number of elements in the vector, is the index into the first vector from which to extract the result value. This means conceptually that for a positive immediate, a vector is extracted from concat(%vec1, %vec2) starting at index `imm`, whereas for a negative immediate, it extracts `imm` trailing elements from the first vector, and the remaining elements from %vec2. sdesmalen: nit: The signed immediate, modulo the number of elements in the vector, is the index into the…
				immediate, it extracts ``-imm`` trailing elements from the first vector, and
				bin.cheng-aliUnsubmitted Done Reply Inline Actions One nit. Given `imm` is negative here, should this be? "it extracts `-imm` trailing elements from the first vector, ..." bin.cheng-ali: One nit. Given ``imm`` is negative here, should this be? "it extracts ``-imm`` trailing…
				the remaining elements from ``%vec2``.

				sdesmalenUnsubmitted Done Reply Inline Actions nit: insert comma. sdesmalen: nit: insert comma.
				These intrinsics work for both fixed and scalable vectors. While this intrinsic
				is marked as experimental, the recommended way to express this operation for
				fixed-width vectors is still to use a shufflevector, as that may allow for more
				optimization opportunities.

				For example:

				.. code-block:: text

				llvm.experimental.vector.splice(<A,B,C,D>, <E,F,G,H>, 1) ==> <B, C, D, E> ; index
				llvm.experimental.vector.splice(<A,B,C,D>, <E,F,G,H>, -3) ==> <B, C, D, E> ; trailing elements


				Arguments:
				""""""""""

				The first two operands are vectors with the same type. The third argument
				``imm`` is the start index, modulo VL, where VL is the runtime vector length of
				sdesmalenUnsubmitted Done Reply Inline Actions It's better to say that if the immediate value is outside this range, the result is a poison value. That leaves it up to the implementation how to handle it. sdesmalen: It's better to say that if the immediate value is outside this range, the result is a poison…
				sdesmalenUnsubmitted Done Reply Inline Actions where VL is the runtime vector length of the source/result vector. sdesmalen: where VL is the runtime vector length of the source/result vector.
				the source/result vector. The ``imm`` is a signed integer constant in the range
				``-VL <= imm < VL``, for values outside of this range the result is poison.
				sdesmalenUnsubmitted Done Reply Inline Actions nit: `s/, for/. For/` sdesmalen: nit: `s/, for/. For/`

	Matrix Intrinsics			Matrix Intrinsics
	-----------------			-----------------

	Operations on matrixes requiring shape information (like number of rows/columns			Operations on matrixes requiring shape information (like number of rows/columns
	or the memory layout) can be expressed using the matrix intrinsics. These			or the memory layout) can be expressed using the matrix intrinsics. These
	intrinsics require matrix dimensions to be passed as immediate arguments, and			intrinsics require matrix dimensions to be passed as immediate arguments, and
	matrixes are passed and returned as vectors. This means that for a ``R`` x			matrixes are passed and returned as vectors. This means that for a ``R`` x
	``C`` matrix, element ``i`` of column ``j`` is at index ``j * R + i`` in the			``C`` matrix, element ``i`` of column ``j`` is at index ``j * R + i`` in the
	▲ Show 20 Lines • Show All 5,160 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/ISDOpcodes.h

Show First 20 Lines • Show All 548 Lines • ▼ Show 20 Lines	enum NodeType {
/// VEC1/VEC2. A VECTOR_SHUFFLE node also contains an array of constant int		/// VEC1/VEC2. A VECTOR_SHUFFLE node also contains an array of constant int
/// values that indicate which value (or undef) each result element will		/// values that indicate which value (or undef) each result element will
/// get. These constant ints are accessible through the		/// get. These constant ints are accessible through the
/// ShuffleVectorSDNode class. This is quite similar to the Altivec		/// ShuffleVectorSDNode class. This is quite similar to the Altivec
/// 'vperm' instruction, except that the indices must be constants and are		/// 'vperm' instruction, except that the indices must be constants and are
/// in terms of the element size of VEC1/VEC2, not in terms of bytes.		/// in terms of the element size of VEC1/VEC2, not in terms of bytes.
VECTOR_SHUFFLE,		VECTOR_SHUFFLE,

		/// VECTOR_SPLICE(VEC1, VEC2, IMM) - Returns a subvector of the same type as
		/// VEC1/VEC2 from CONCAT_VECTORS(VEC1, VEC2), based on the IMM in two ways.
		/// Let the result type be T, if IMM is positive it represents the starting
		/// element number (an index) from which a subvector of type T is extracted
		/// from CONCAT_VECTORS(VEC1, VEC2). If IMM is negative it represents a count
		/// specifying the number of trailing elements to extract from VEC1, where the
		/// elements of T are selected using the following algorithm:
		/// RESULT[i] = CONCAT_VECTORS(VEC1,VEC2)[VEC1.ElementCount - ABS(IMM) + i]
		/// If T is a fixed-width vector and IMM is out-of-bounds the result vector is
		sdesmalenUnsubmitted Done Reply Inline Actions nit: `s/T is [..] and//` sdesmalen: nit: `s/T is [..] and//`
		sdesmalenUnsubmitted Done Reply Inline Actions nit: `s/out-of-bounds/not within the range [-VL, VL)/` sdesmalen: nit: `s/out-of-bounds/not within the range [-VL, VL)/`
		/// undefined. If T is a scalable vector, IMM is clamped by the runtime
		/// scaling factor 'vscale'. IMM is a constant integer.
		sdesmalenUnsubmitted Done Reply Inline Actions Please remove this restriction for scalable vectors. The implementation (by clamping to avoid a runtime crash) should not dictate the specification. It is sufficient to say that if IMM is out of range that the result value is undefined. sdesmalen: Please remove this restriction for scalable vectors. The implementation (by clamping to avoid a…
		VECTOR_SPLICE,

/// SCALAR_TO_VECTOR(VAL) - This represents the operation of loading a		/// SCALAR_TO_VECTOR(VAL) - This represents the operation of loading a
/// scalar value into element 0 of the resultant vector type. The top		/// scalar value into element 0 of the resultant vector type. The top
/// elements 1 to N-1 of the N-element vector are undefined. The type		/// elements 1 to N-1 of the N-element vector are undefined. The type
/// of the operand must match the vector element type, except when they		/// of the operand must match the vector element type, except when they
/// are integer types. In this case the operand is allowed to be wider		/// are integer types. In this case the operand is allowed to be wider
/// than the vector element type, and is implicitly truncated to it.		/// than the vector element type, and is implicitly truncated to it.
SCALAR_TO_VECTOR,		SCALAR_TO_VECTOR,

▲ Show 20 Lines • Show All 833 Lines • Show Last 20 Lines

llvm/include/llvm/CodeGen/TargetLowering.h

	Show First 20 Lines • Show All 4,498 Lines • ▼ Show 20 Lines

	/// Expand a VECREDUCE_SEQ_* into an explicit ordered calculation.			/// Expand a VECREDUCE_SEQ_* into an explicit ordered calculation.
	SDValue expandVecReduceSeq(SDNode *Node, SelectionDAG &DAG) const;			SDValue expandVecReduceSeq(SDNode *Node, SelectionDAG &DAG) const;

	/// Expand an SREM or UREM using SDIV/UDIV or SDIVREM/UDIVREM, if legal.			/// Expand an SREM or UREM using SDIV/UDIV or SDIVREM/UDIVREM, if legal.
	/// Returns true if the expansion was successful.			/// Returns true if the expansion was successful.
	bool expandREM(SDNode *Node, SDValue &Result, SelectionDAG &DAG) const;			bool expandREM(SDNode *Node, SDValue &Result, SelectionDAG &DAG) const;

				/// Method for building the DAG expansion of ISD::VECTOR_SPLICE. This
				/// method accepts vectors as its arguments.
				SDValue expandVectorSplice(SDNode *Node, SelectionDAG &DAG) const;

	//===--------------------------------------------------------------------===//			//===--------------------------------------------------------------------===//
	// Instruction Emitting Hooks			// Instruction Emitting Hooks
	//			//

	/// This method should be implemented by targets that mark instructions with			/// This method should be implemented by targets that mark instructions with
	/// the 'usesCustomInserter' flag. These instructions are special in various			/// the 'usesCustomInserter' flag. These instructions are special in various
	/// ways, which require special support to insert. The specified MachineInstr			/// ways, which require special support to insert. The specified MachineInstr
	/// is created but not inserted into any basic blocks, and this method is			/// is created but not inserted into any basic blocks, and this method is
	▲ Show 20 Lines • Show All 89 Lines • Show Last 20 Lines

llvm/include/llvm/IR/Intrinsics.td

	Show First 20 Lines • Show All 1,647 Lines • ▼ Show 20 Lines
	def int_experimental_vector_insert : DefaultAttrsIntrinsic<[llvm_anyvector_ty],			def int_experimental_vector_insert : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
	[LLVMMatchType<0>, llvm_anyvector_ty, llvm_i64_ty],			[LLVMMatchType<0>, llvm_anyvector_ty, llvm_i64_ty],
	[IntrNoMem, ImmArg<ArgIndex<2>>]>;			[IntrNoMem, ImmArg<ArgIndex<2>>]>;

	def int_experimental_vector_extract : DefaultAttrsIntrinsic<[llvm_anyvector_ty],			def int_experimental_vector_extract : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
	[llvm_anyvector_ty, llvm_i64_ty],			[llvm_anyvector_ty, llvm_i64_ty],
	[IntrNoMem, ImmArg<ArgIndex<1>>]>;			[IntrNoMem, ImmArg<ArgIndex<1>>]>;

				//===---------- Named shufflevector intrinsics ------===//
				def int_experimental_vector_splice : DefaultAttrsIntrinsic<[llvm_anyvector_ty],
				craig.topperUnsubmitted Done Reply Inline Actions Should this have DefaultAttrs? craig.topper: Should this have DefaultAttrs?
				[LLVMMatchType<0>,
				LLVMMatchType<0>,
				llvm_i32_ty],
				[IntrNoMem, ImmArg<ArgIndex<2>>]>;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Target-specific intrinsics			// Target-specific intrinsics
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	include "llvm/IR/IntrinsicsPowerPC.td"			include "llvm/IR/IntrinsicsPowerPC.td"
	include "llvm/IR/IntrinsicsX86.td"			include "llvm/IR/IntrinsicsX86.td"
	Show All 12 Lines

llvm/include/llvm/Target/TargetSelectionDAG.td

Show First 20 Lines • Show All 235 Lines • ▼ Show 20 Lines
def SDTMaskedLoad: SDTypeProfile<1, 4, [ // masked load		def SDTMaskedLoad: SDTypeProfile<1, 4, [ // masked load
SDTCisVec<0>, SDTCisPtrTy<1>, SDTCisPtrTy<2>, SDTCisVec<3>, SDTCisSameAs<0, 4>,		SDTCisVec<0>, SDTCisPtrTy<1>, SDTCisPtrTy<2>, SDTCisVec<3>, SDTCisSameAs<0, 4>,
SDTCisSameNumEltsAs<0, 3>		SDTCisSameNumEltsAs<0, 3>
]>;		]>;

def SDTVecShuffle : SDTypeProfile<1, 2, [		def SDTVecShuffle : SDTypeProfile<1, 2, [
SDTCisSameAs<0, 1>, SDTCisSameAs<1, 2>		SDTCisSameAs<0, 1>, SDTCisSameAs<1, 2>
]>;		]>;
		def SDTVecSlice : SDTypeProfile<1, 3, [ // vector splice
		SDTCisSameAs<0, 1>, SDTCisSameAs<1, 2>, SDTCisInt<3>
		]>;
def SDTVecExtract : SDTypeProfile<1, 2, [ // vector extract		def SDTVecExtract : SDTypeProfile<1, 2, [ // vector extract
SDTCisEltOfVec<0, 1>, SDTCisPtrTy<2>		SDTCisEltOfVec<0, 1>, SDTCisPtrTy<2>
]>;		]>;
def SDTVecInsert : SDTypeProfile<1, 3, [ // vector insert		def SDTVecInsert : SDTypeProfile<1, 3, [ // vector insert
SDTCisEltOfVec<2, 1>, SDTCisSameAs<0, 1>, SDTCisPtrTy<3>		SDTCisEltOfVec<2, 1>, SDTCisSameAs<0, 1>, SDTCisPtrTy<3>
]>;		]>;
def SDTVecReduce : SDTypeProfile<1, 1, [ // vector reduction		def SDTVecReduce : SDTypeProfile<1, 1, [ // vector reduction
SDTCisInt<0>, SDTCisVec<1>		SDTCisInt<0>, SDTCisVec<1>
▲ Show 20 Lines • Show All 398 Lines • ▼ Show 20 Lines	def ld : SDNode<"ISD::LOAD" , SDTLoad,
[SDNPHasChain, SDNPMayLoad, SDNPMemOperand]>;		[SDNPHasChain, SDNPMayLoad, SDNPMemOperand]>;
def st : SDNode<"ISD::STORE" , SDTStore,		def st : SDNode<"ISD::STORE" , SDTStore,
[SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;		[SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;
def ist : SDNode<"ISD::STORE" , SDTIStore,		def ist : SDNode<"ISD::STORE" , SDTIStore,
[SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;		[SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;

def vector_shuffle : SDNode<"ISD::VECTOR_SHUFFLE", SDTVecShuffle, []>;		def vector_shuffle : SDNode<"ISD::VECTOR_SHUFFLE", SDTVecShuffle, []>;
def vector_reverse : SDNode<"ISD::VECTOR_REVERSE", SDTVecReverse>;		def vector_reverse : SDNode<"ISD::VECTOR_REVERSE", SDTVecReverse>;
		def vector_splice : SDNode<"ISD::VECTOR_SPLICE", SDTVecSlice, []>;
def build_vector : SDNode<"ISD::BUILD_VECTOR", SDTypeProfile<1, -1, []>, []>;		def build_vector : SDNode<"ISD::BUILD_VECTOR", SDTypeProfile<1, -1, []>, []>;
def splat_vector : SDNode<"ISD::SPLAT_VECTOR", SDTypeProfile<1, 1, []>, []>;		def splat_vector : SDNode<"ISD::SPLAT_VECTOR", SDTypeProfile<1, 1, []>, []>;
def scalar_to_vector : SDNode<"ISD::SCALAR_TO_VECTOR", SDTypeProfile<1, 1, []>,		def scalar_to_vector : SDNode<"ISD::SCALAR_TO_VECTOR", SDTypeProfile<1, 1, []>,
[]>;		[]>;

// vector_extract/vector_insert are deprecated. extractelt/insertelt		// vector_extract/vector_insert are deprecated. extractelt/insertelt
// are preferred.		// are preferred.
def vector_extract : SDNode<"ISD::EXTRACT_VECTOR_ELT",		def vector_extract : SDNode<"ISD::EXTRACT_VECTOR_ELT",
▲ Show 20 Lines • Show All 987 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

Show First 20 Lines • Show All 3,202 Lines • ▼ Show 20 Lines	case ISD::VECTOR_SHUFFLE: {
}		}

Tmp1 = DAG.getBuildVector(VT, dl, Ops);		Tmp1 = DAG.getBuildVector(VT, dl, Ops);
// We may have changed the BUILD_VECTOR type. Cast it back to the Node type.		// We may have changed the BUILD_VECTOR type. Cast it back to the Node type.
Tmp1 = DAG.getNode(ISD::BITCAST, dl, Node->getValueType(0), Tmp1);		Tmp1 = DAG.getNode(ISD::BITCAST, dl, Node->getValueType(0), Tmp1);
Results.push_back(Tmp1);		Results.push_back(Tmp1);
break;		break;
}		}
		case ISD::VECTOR_SPLICE: {
		Results.push_back(TLI.expandVectorSplice(Node, DAG));
		break;
		}
case ISD::EXTRACT_ELEMENT: {		case ISD::EXTRACT_ELEMENT: {
EVT OpTy = Node->getOperand(0).getValueType();		EVT OpTy = Node->getOperand(0).getValueType();
if (cast<ConstantSDNode>(Node->getOperand(1))->getZExtValue()) {		if (cast<ConstantSDNode>(Node->getOperand(1))->getZExtValue()) {
// 1 -> Hi		// 1 -> Hi
Tmp1 = DAG.getNode(ISD::SRL, dl, OpTy, Node->getOperand(0),		Tmp1 = DAG.getNode(ISD::SRL, dl, OpTy, Node->getOperand(0),
DAG.getConstant(OpTy.getSizeInBits() / 2, dl,		DAG.getConstant(OpTy.getSizeInBits() / 2, dl,
TLI.getShiftAmountTy(		TLI.getShiftAmountTy(
Node->getOperand(0).getValueType(),		Node->getOperand(0).getValueType(),
▲ Show 20 Lines • Show All 1,490 Lines • ▼ Show 20 Lines	case ISD::VECTOR_SHUFFLE: {
Tmp2 = DAG.getNode(ISD::BITCAST, dl, NVT, Node->getOperand(1));		Tmp2 = DAG.getNode(ISD::BITCAST, dl, NVT, Node->getOperand(1));

// Convert the shuffle mask to the right # elements.		// Convert the shuffle mask to the right # elements.
Tmp1 = ShuffleWithNarrowerEltType(NVT, OVT, dl, Tmp1, Tmp2, Mask);		Tmp1 = ShuffleWithNarrowerEltType(NVT, OVT, dl, Tmp1, Tmp2, Mask);
Tmp1 = DAG.getNode(ISD::BITCAST, dl, OVT, Tmp1);		Tmp1 = DAG.getNode(ISD::BITCAST, dl, OVT, Tmp1);
Results.push_back(Tmp1);		Results.push_back(Tmp1);
break;		break;
}		}
		case ISD::VECTOR_SPLICE: {
		Tmp1 = DAG.getNode(ISD::ANY_EXTEND, dl, NVT, Node->getOperand(0));
		Tmp2 = DAG.getNode(ISD::ANY_EXTEND, dl, NVT, Node->getOperand(1));
		Tmp3 = DAG.getNode(ISD::VECTOR_SPLICE, dl, NVT, Tmp1, Tmp2,
		Node->getOperand(2));
		Results.push_back(DAG.getNode(ISD::TRUNCATE, dl, OVT, Tmp3));
		break;
		}
case ISD::SETCC:		case ISD::SETCC:
case ISD::STRICT_FSETCC:		case ISD::STRICT_FSETCC:
case ISD::STRICT_FSETCCS: {		case ISD::STRICT_FSETCCS: {
unsigned ExtOp = ISD::FP_EXTEND;		unsigned ExtOp = ISD::FP_EXTEND;
if (NVT.isInteger()) {		if (NVT.isInteger()) {
ISD::CondCode CCCode = cast<CondCodeSDNode>(Node->getOperand(2))->get();		ISD::CondCode CCCode = cast<CondCodeSDNode>(Node->getOperand(2))->get();
ExtOp = isSignedIntSetCC(CCCode) ? ISD::SIGN_EXTEND : ISD::ZERO_EXTEND;		ExtOp = isSignedIntSetCC(CCCode) ? ISD::SIGN_EXTEND : ISD::ZERO_EXTEND;
}		}
▲ Show 20 Lines • Show All 368 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp

Show First 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	#endif
case ISD::VSCALE: Res = PromoteIntRes_VSCALE(N); break;		case ISD::VSCALE: Res = PromoteIntRes_VSCALE(N); break;

case ISD::EXTRACT_SUBVECTOR:		case ISD::EXTRACT_SUBVECTOR:
Res = PromoteIntRes_EXTRACT_SUBVECTOR(N); break;		Res = PromoteIntRes_EXTRACT_SUBVECTOR(N); break;
case ISD::VECTOR_REVERSE:		case ISD::VECTOR_REVERSE:
Res = PromoteIntRes_VECTOR_REVERSE(N); break;		Res = PromoteIntRes_VECTOR_REVERSE(N); break;
case ISD::VECTOR_SHUFFLE:		case ISD::VECTOR_SHUFFLE:
Res = PromoteIntRes_VECTOR_SHUFFLE(N); break;		Res = PromoteIntRes_VECTOR_SHUFFLE(N); break;
		case ISD::VECTOR_SPLICE:
		Res = PromoteIntRes_VECTOR_SPLICE(N); break;
case ISD::INSERT_VECTOR_ELT:		case ISD::INSERT_VECTOR_ELT:
Res = PromoteIntRes_INSERT_VECTOR_ELT(N); break;		Res = PromoteIntRes_INSERT_VECTOR_ELT(N); break;
case ISD::BUILD_VECTOR:		case ISD::BUILD_VECTOR:
Res = PromoteIntRes_BUILD_VECTOR(N); break;		Res = PromoteIntRes_BUILD_VECTOR(N); break;
case ISD::SCALAR_TO_VECTOR:		case ISD::SCALAR_TO_VECTOR:
Res = PromoteIntRes_SCALAR_TO_VECTOR(N); break;		Res = PromoteIntRes_SCALAR_TO_VECTOR(N); break;
case ISD::SPLAT_VECTOR:		case ISD::SPLAT_VECTOR:
Res = PromoteIntRes_SPLAT_VECTOR(N); break;		Res = PromoteIntRes_SPLAT_VECTOR(N); break;
▲ Show 20 Lines • Show All 4,496 Lines • ▼ Show 20 Lines	SDValue DAGTypeLegalizer::ExpandIntOp_ATOMIC_STORE(SDNode *N) {
SDValue Swap = DAG.getAtomic(ISD::ATOMIC_SWAP, dl,		SDValue Swap = DAG.getAtomic(ISD::ATOMIC_SWAP, dl,
cast<AtomicSDNode>(N)->getMemoryVT(),		cast<AtomicSDNode>(N)->getMemoryVT(),
N->getOperand(0),		N->getOperand(0),
N->getOperand(1), N->getOperand(2),		N->getOperand(1), N->getOperand(2),
cast<AtomicSDNode>(N)->getMemOperand());		cast<AtomicSDNode>(N)->getMemOperand());
return Swap.getValue(1);		return Swap.getValue(1);
}		}

		SDValue DAGTypeLegalizer::PromoteIntRes_VECTOR_SPLICE(SDNode *N) {
		SDLoc dl(N);

		SDValue V0 = GetPromotedInteger(N->getOperand(0));
		SDValue V1 = GetPromotedInteger(N->getOperand(1));
		EVT OutVT = V0.getValueType();

		return DAG.getNode(ISD::VECTOR_SPLICE, dl, OutVT, V0, V1, N->getOperand(2));
		}

SDValue DAGTypeLegalizer::PromoteIntRes_EXTRACT_SUBVECTOR(SDNode *N) {		SDValue DAGTypeLegalizer::PromoteIntRes_EXTRACT_SUBVECTOR(SDNode *N) {

EVT OutVT = N->getValueType(0);		EVT OutVT = N->getValueType(0);
EVT NOutVT = TLI.getTypeToTransformTo(*DAG.getContext(), OutVT);		EVT NOutVT = TLI.getTypeToTransformTo(*DAG.getContext(), OutVT);
assert(NOutVT.isVector() && "This type must be promoted to a vector type");		assert(NOutVT.isVector() && "This type must be promoted to a vector type");
EVT NOutVTElem = NOutVT.getVectorElementType();		EVT NOutVTElem = NOutVT.getVectorElementType();

▲ Show 20 Lines • Show All 297 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h

Show First 20 Lines • Show All 294 Lines • ▼ Show 20 Lines	private:
SDValue PromoteIntRes_AssertSext(SDNode *N);		SDValue PromoteIntRes_AssertSext(SDNode *N);
SDValue PromoteIntRes_AssertZext(SDNode *N);		SDValue PromoteIntRes_AssertZext(SDNode *N);
SDValue PromoteIntRes_Atomic0(AtomicSDNode *N);		SDValue PromoteIntRes_Atomic0(AtomicSDNode *N);
SDValue PromoteIntRes_Atomic1(AtomicSDNode *N);		SDValue PromoteIntRes_Atomic1(AtomicSDNode *N);
SDValue PromoteIntRes_AtomicCmpSwap(AtomicSDNode *N, unsigned ResNo);		SDValue PromoteIntRes_AtomicCmpSwap(AtomicSDNode *N, unsigned ResNo);
SDValue PromoteIntRes_EXTRACT_SUBVECTOR(SDNode *N);		SDValue PromoteIntRes_EXTRACT_SUBVECTOR(SDNode *N);
SDValue PromoteIntRes_VECTOR_REVERSE(SDNode *N);		SDValue PromoteIntRes_VECTOR_REVERSE(SDNode *N);
SDValue PromoteIntRes_VECTOR_SHUFFLE(SDNode *N);		SDValue PromoteIntRes_VECTOR_SHUFFLE(SDNode *N);
		SDValue PromoteIntRes_VECTOR_SPLICE(SDNode *N);
SDValue PromoteIntRes_BUILD_VECTOR(SDNode *N);		SDValue PromoteIntRes_BUILD_VECTOR(SDNode *N);
SDValue PromoteIntRes_SCALAR_TO_VECTOR(SDNode *N);		SDValue PromoteIntRes_SCALAR_TO_VECTOR(SDNode *N);
SDValue PromoteIntRes_SPLAT_VECTOR(SDNode *N);		SDValue PromoteIntRes_SPLAT_VECTOR(SDNode *N);
SDValue PromoteIntRes_EXTEND_VECTOR_INREG(SDNode *N);		SDValue PromoteIntRes_EXTEND_VECTOR_INREG(SDNode *N);
SDValue PromoteIntRes_INSERT_VECTOR_ELT(SDNode *N);		SDValue PromoteIntRes_INSERT_VECTOR_ELT(SDNode *N);
SDValue PromoteIntRes_CONCAT_VECTORS(SDNode *N);		SDValue PromoteIntRes_CONCAT_VECTORS(SDNode *N);
SDValue PromoteIntRes_BITCAST(SDNode *N);		SDValue PromoteIntRes_BITCAST(SDNode *N);
SDValue PromoteIntRes_BSWAP(SDNode *N);		SDValue PromoteIntRes_BSWAP(SDNode *N);
▲ Show 20 Lines • Show All 522 Lines • ▼ Show 20 Lines	private:
void SplitVecRes_LOAD(LoadSDNode *LD, SDValue &Lo, SDValue &Hi);		void SplitVecRes_LOAD(LoadSDNode *LD, SDValue &Lo, SDValue &Hi);
void SplitVecRes_MLOAD(MaskedLoadSDNode *MLD, SDValue &Lo, SDValue &Hi);		void SplitVecRes_MLOAD(MaskedLoadSDNode *MLD, SDValue &Lo, SDValue &Hi);
void SplitVecRes_MGATHER(MaskedGatherSDNode *MGT, SDValue &Lo, SDValue &Hi);		void SplitVecRes_MGATHER(MaskedGatherSDNode *MGT, SDValue &Lo, SDValue &Hi);
void SplitVecRes_ScalarOp(SDNode *N, SDValue &Lo, SDValue &Hi);		void SplitVecRes_ScalarOp(SDNode *N, SDValue &Lo, SDValue &Hi);
void SplitVecRes_SETCC(SDNode *N, SDValue &Lo, SDValue &Hi);		void SplitVecRes_SETCC(SDNode *N, SDValue &Lo, SDValue &Hi);
void SplitVecRes_VECTOR_REVERSE(SDNode *N, SDValue &Lo, SDValue &Hi);		void SplitVecRes_VECTOR_REVERSE(SDNode *N, SDValue &Lo, SDValue &Hi);
void SplitVecRes_VECTOR_SHUFFLE(ShuffleVectorSDNode *N, SDValue &Lo,		void SplitVecRes_VECTOR_SHUFFLE(ShuffleVectorSDNode *N, SDValue &Lo,
SDValue &Hi);		SDValue &Hi);
		void SplitVecRes_VECTOR_SPLICE(SDNode *N, SDValue &Lo, SDValue &Hi);
void SplitVecRes_VAARG(SDNode *N, SDValue &Lo, SDValue &Hi);		void SplitVecRes_VAARG(SDNode *N, SDValue &Lo, SDValue &Hi);
void SplitVecRes_FP_TO_XINT_SAT(SDNode *N, SDValue &Lo, SDValue &Hi);		void SplitVecRes_FP_TO_XINT_SAT(SDNode *N, SDValue &Lo, SDValue &Hi);

// Vector Operand Splitting: <128 x ty> -> 2 x <64 x ty>.		// Vector Operand Splitting: <128 x ty> -> 2 x <64 x ty>.
bool SplitVectorOperand(SDNode *N, unsigned OpNo);		bool SplitVectorOperand(SDNode *N, unsigned OpNo);
SDValue SplitVecOp_VSELECT(SDNode *N, unsigned OpNo);		SDValue SplitVecOp_VSELECT(SDNode *N, unsigned OpNo);
SDValue SplitVecOp_VECREDUCE(SDNode *N, unsigned OpNo);		SDValue SplitVecOp_VECREDUCE(SDNode *N, unsigned OpNo);
SDValue SplitVecOp_VECREDUCE_SEQ(SDNode *N);		SDValue SplitVecOp_VECREDUCE_SEQ(SDNode *N);
▲ Show 20 Lines • Show All 205 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp

Show First 20 Lines • Show All 930 Lines • ▼ Show 20 Lines	case ISD::SETCC:
SplitVecRes_SETCC(N, Lo, Hi);		SplitVecRes_SETCC(N, Lo, Hi);
break;		break;
case ISD::VECTOR_REVERSE:		case ISD::VECTOR_REVERSE:
SplitVecRes_VECTOR_REVERSE(N, Lo, Hi);		SplitVecRes_VECTOR_REVERSE(N, Lo, Hi);
break;		break;
case ISD::VECTOR_SHUFFLE:		case ISD::VECTOR_SHUFFLE:
SplitVecRes_VECTOR_SHUFFLE(cast<ShuffleVectorSDNode>(N), Lo, Hi);		SplitVecRes_VECTOR_SHUFFLE(cast<ShuffleVectorSDNode>(N), Lo, Hi);
break;		break;
		case ISD::VECTOR_SPLICE:
		SplitVecRes_VECTOR_SPLICE(N, Lo, Hi);
		paulwalker-armUnsubmitted Done Reply Inline Actions This doesn't look great to me because it ties the hands of expandVectorSplice, whose name does not suggest it must return a load. If you prefer not to explicitly create the split loads then it seems more natural, and future proof, if the result of expandVectorSplice is split using a pair of EXTRACT_SUBVECTOR operations. paulwalker-arm: This doesn't look great to me because it ties the hands of expandVectorSplice, whose name does…
		c-rhodesAuthorUnsubmitted Done Reply Inline Actions This doesn't look great to me because it ties the hands of expandVectorSplice, whose name does not suggest it must return a load. If you prefer not to explicitly create the split loads then it seems more natural, and future proof, if the result of expandVectorSplice is split using a pair of EXTRACT_SUBVECTOR operations. Fair point, I've used EXTRACT_SUBVECTOR instead. I should point out warnings were being generated for the `nxv16f32` splitvec tests by the `getVectorNumElements` call in `DAGTypeLegalizer::SplitVecRes_EXTRACT_SUBVECTOR`. To remove the warnings I changed it to `getVectorMinNumElements`. c-rhodes: > This doesn't look great to me because it ties the hands of expandVectorSplice, whose name…
		break;
case ISD::VAARG:		case ISD::VAARG:
SplitVecRes_VAARG(N, Lo, Hi);		SplitVecRes_VAARG(N, Lo, Hi);
break;		break;

case ISD::ANY_EXTEND_VECTOR_INREG:		case ISD::ANY_EXTEND_VECTOR_INREG:
case ISD::SIGN_EXTEND_VECTOR_INREG:		case ISD::SIGN_EXTEND_VECTOR_INREG:
case ISD::ZERO_EXTEND_VECTOR_INREG:		case ISD::ZERO_EXTEND_VECTOR_INREG:
SplitVecRes_ExtVecInRegOp(N, Lo, Hi);		SplitVecRes_ExtVecInRegOp(N, Lo, Hi);
▲ Show 20 Lines • Show All 294 Lines • ▼ Show 20 Lines	void DAGTypeLegalizer::SplitVecRes_EXTRACT_SUBVECTOR(SDNode *N, SDValue &Lo,

EVT LoVT, HiVT;		EVT LoVT, HiVT;
std::tie(LoVT, HiVT) = DAG.GetSplitDestVTs(N->getValueType(0));		std::tie(LoVT, HiVT) = DAG.GetSplitDestVTs(N->getValueType(0));

Lo = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, LoVT, Vec, Idx);		Lo = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, LoVT, Vec, Idx);
uint64_t IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();		uint64_t IdxVal = cast<ConstantSDNode>(Idx)->getZExtValue();
Hi = DAG.getNode(		Hi = DAG.getNode(
ISD::EXTRACT_SUBVECTOR, dl, HiVT, Vec,		ISD::EXTRACT_SUBVECTOR, dl, HiVT, Vec,
DAG.getVectorIdxConstant(IdxVal + LoVT.getVectorNumElements(), dl));		DAG.getVectorIdxConstant(IdxVal + LoVT.getVectorMinNumElements(), dl));
}		}

void DAGTypeLegalizer::SplitVecRes_INSERT_SUBVECTOR(SDNode *N, SDValue &Lo,		void DAGTypeLegalizer::SplitVecRes_INSERT_SUBVECTOR(SDNode *N, SDValue &Lo,
SDValue &Hi) {		SDValue &Hi) {
SDValue Vec = N->getOperand(0);		SDValue Vec = N->getOperand(0);
SDValue SubVec = N->getOperand(1);		SDValue SubVec = N->getOperand(1);
SDValue Idx = N->getOperand(2);		SDValue Idx = N->getOperand(2);
SDLoc dl(N);		SDLoc dl(N);
▲ Show 20 Lines • Show All 4,242 Lines • ▼ Show 20 Lines	void DAGTypeLegalizer::SplitVecRes_VECTOR_REVERSE(SDNode *N, SDValue &Lo,
SDValue &Hi) {		SDValue &Hi) {
SDValue InLo, InHi;		SDValue InLo, InHi;
GetSplitVector(N->getOperand(0), InLo, InHi);		GetSplitVector(N->getOperand(0), InLo, InHi);
SDLoc DL(N);		SDLoc DL(N);

Lo = DAG.getNode(ISD::VECTOR_REVERSE, DL, InHi.getValueType(), InHi);		Lo = DAG.getNode(ISD::VECTOR_REVERSE, DL, InHi.getValueType(), InHi);
Hi = DAG.getNode(ISD::VECTOR_REVERSE, DL, InLo.getValueType(), InLo);		Hi = DAG.getNode(ISD::VECTOR_REVERSE, DL, InLo.getValueType(), InLo);
}		}

		void DAGTypeLegalizer::SplitVecRes_VECTOR_SPLICE(SDNode *N, SDValue &Lo,
		SDValue &Hi) {
		EVT VT = N->getValueType(0);
		SDLoc DL(N);

		EVT LoVT, HiVT;
		std::tie(LoVT, HiVT) = DAG.GetSplitDestVTs(VT);

		SDValue Expanded = TLI.expandVectorSplice(N, DAG);
		Lo = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, LoVT, Expanded,
		DAG.getVectorIdxConstant(0, DL));
		Hi =
		DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, HiVT, Expanded,
		DAG.getVectorIdxConstant(LoVT.getVectorMinNumElements(), DL));
		}

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h

Show First 20 Lines • Show All 768 Lines • ▼ Show 20 Lines	private:
void visitPatchpoint(const CallBase &CB, const BasicBlock *EHPadBB = nullptr);		void visitPatchpoint(const CallBase &CB, const BasicBlock *EHPadBB = nullptr);

// These two are implemented in StatepointLowering.cpp		// These two are implemented in StatepointLowering.cpp
void visitGCRelocate(const GCRelocateInst &Relocate);		void visitGCRelocate(const GCRelocateInst &Relocate);
void visitGCResult(const GCResultInst &I);		void visitGCResult(const GCResultInst &I);

void visitVectorReduce(const CallInst &I, unsigned Intrinsic);		void visitVectorReduce(const CallInst &I, unsigned Intrinsic);
void visitVectorReverse(const CallInst &I);		void visitVectorReverse(const CallInst &I);
		void visitVectorSplice(const CallInst &I);

void visitUserOp1(const Instruction &I) {		void visitUserOp1(const Instruction &I) {
llvm_unreachable("UserOp1 should not exist at instruction selection time!");		llvm_unreachable("UserOp1 should not exist at instruction selection time!");
}		}
void visitUserOp2(const Instruction &I) {		void visitUserOp2(const Instruction &I) {
llvm_unreachable("UserOp2 should not exist at instruction selection time!");		llvm_unreachable("UserOp2 should not exist at instruction selection time!");
}		}

▲ Show 20 Lines • Show All 121 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,022 Lines • ▼ Show 20 Lines	case Intrinsic::experimental_vector_extract: {
EVT ResultVT = TLI.getValueType(DAG.getDataLayout(), I.getType());		EVT ResultVT = TLI.getValueType(DAG.getDataLayout(), I.getType());

setValue(&I, DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ResultVT, Vec, Index));		setValue(&I, DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, ResultVT, Vec, Index));
return;		return;
}		}
case Intrinsic::experimental_vector_reverse:		case Intrinsic::experimental_vector_reverse:
visitVectorReverse(I);		visitVectorReverse(I);
return;		return;
		case Intrinsic::experimental_vector_splice:
		visitVectorSplice(I);
		return;
}		}
}		}

void SelectionDAGBuilder::visitConstrainedFPIntrinsic(		void SelectionDAGBuilder::visitConstrainedFPIntrinsic(
const ConstrainedFPIntrinsic &FPI) {		const ConstrainedFPIntrinsic &FPI) {
SDLoc sdl = getCurSDLoc();		SDLoc sdl = getCurSDLoc();

const TargetLowering &TLI = DAG.getTargetLoweringInfo();		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
▲ Show 20 Lines • Show All 3,835 Lines • ▼ Show 20 Lines	void SelectionDAGBuilder::visitFreeze(const FreezeInst &I) {

for (unsigned i = 0; i != NumValues; ++i)		for (unsigned i = 0; i != NumValues; ++i)
Values[i] = DAG.getNode(ISD::FREEZE, getCurSDLoc(), ValueVTs[i],		Values[i] = DAG.getNode(ISD::FREEZE, getCurSDLoc(), ValueVTs[i],
SDValue(Op.getNode(), Op.getResNo() + i));		SDValue(Op.getNode(), Op.getResNo() + i));

setValue(&I, DAG.getNode(ISD::MERGE_VALUES, getCurSDLoc(),		setValue(&I, DAG.getNode(ISD::MERGE_VALUES, getCurSDLoc(),
DAG.getVTList(ValueVTs), Values));		DAG.getVTList(ValueVTs), Values));
}		}

		void SelectionDAGBuilder::visitVectorSplice(const CallInst &I) {
		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
		EVT VT = TLI.getValueType(DAG.getDataLayout(), I.getType());

		SDLoc DL = getCurSDLoc();
		SDValue V1 = getValue(I.getOperand(0));
		SDValue V2 = getValue(I.getOperand(1));
		int64_t Imm = cast<ConstantInt>(I.getOperand(2))->getSExtValue();

		// VECTOR_SHUFFLE doesn't support a scalable mask so use a dedicated node.
		if (VT.isScalableVector()) {
		MVT IdxVT = TLI.getVectorIdxTy(DAG.getDataLayout());
		setValue(&I, DAG.getNode(ISD::VECTOR_SPLICE, DL, VT, V1, V2,
		DAG.getConstant(Imm, DL, IdxVT)));
		return;
		}

		unsigned NumElts = VT.getVectorNumElements();

		if ((-Imm > NumElts) \|\| (Imm >= NumElts)) {
		// Result is undefined if immediate is out-of-bounds.
		setValue(&I, DAG.getUNDEF(VT));
		return;
		}

		uint64_t Idx = (NumElts + Imm) % NumElts;
		sdesmalenUnsubmitted Not Done Reply Inline Actions just `Imm % NumElts`; ? sdesmalen: just `Imm % NumElts`; ?
		c-rhodesAuthorUnsubmitted Done Reply Inline Actions just `Imm % NumElts`; ? I tried that at first but found the behaviour isn't correct for negative immediates, for example a trailing element count of `-15` and 16 elements `-15 % 16 = -15`, so end up with this mask: `[-15,-14,-13,-12,-11,-10,-9,-8,-7,-6,-5,-4,-3,-2,-1,0]`. From what I read the sign of the modulo result is implementation defined if one or both of the operands are are negative. c-rhodes: > just `Imm % NumElts`; ? I tried that at first but found the behaviour isn't correct for…
		bin.cheng-aliUnsubmitted Done Reply Inline Actions Then `(NumElts + Imm) % NumElts` ? bin.cheng-ali: Then `(NumElts + Imm) % NumElts` ?

		// Use VECTOR_SHUFFLE to maintain original behaviour for fixed-length vectors.
		SmallVector<int, 8> Mask;
		for (unsigned i = 0; i < NumElts; ++i)
		Mask.push_back(Idx + i);
		setValue(&I, DAG.getVectorShuffle(VT, DL, V1, V2, Mask));
		}

llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp

Show First 20 Lines • Show All 282 Lines • ▼ Show 20 Lines	#endif
case ISD::SELECT_CC: return "select_cc";		case ISD::SELECT_CC: return "select_cc";
case ISD::INSERT_VECTOR_ELT: return "insert_vector_elt";		case ISD::INSERT_VECTOR_ELT: return "insert_vector_elt";
case ISD::EXTRACT_VECTOR_ELT: return "extract_vector_elt";		case ISD::EXTRACT_VECTOR_ELT: return "extract_vector_elt";
case ISD::CONCAT_VECTORS: return "concat_vectors";		case ISD::CONCAT_VECTORS: return "concat_vectors";
case ISD::INSERT_SUBVECTOR: return "insert_subvector";		case ISD::INSERT_SUBVECTOR: return "insert_subvector";
case ISD::EXTRACT_SUBVECTOR: return "extract_subvector";		case ISD::EXTRACT_SUBVECTOR: return "extract_subvector";
case ISD::SCALAR_TO_VECTOR: return "scalar_to_vector";		case ISD::SCALAR_TO_VECTOR: return "scalar_to_vector";
case ISD::VECTOR_SHUFFLE: return "vector_shuffle";		case ISD::VECTOR_SHUFFLE: return "vector_shuffle";
		case ISD::VECTOR_SPLICE: return "vector_splice";
case ISD::SPLAT_VECTOR: return "splat_vector";		case ISD::SPLAT_VECTOR: return "splat_vector";
case ISD::VECTOR_REVERSE: return "vector_reverse";		case ISD::VECTOR_REVERSE: return "vector_reverse";
case ISD::CARRY_FALSE: return "carry_false";		case ISD::CARRY_FALSE: return "carry_false";
case ISD::ADDC: return "addc";		case ISD::ADDC: return "addc";
case ISD::ADDE: return "adde";		case ISD::ADDE: return "adde";
case ISD::ADDCARRY: return "addcarry";		case ISD::ADDCARRY: return "addcarry";
case ISD::SADDO_CARRY: return "saddo_carry";		case ISD::SADDO_CARRY: return "saddo_carry";
case ISD::SADDO: return "saddo";		case ISD::SADDO: return "saddo";
▲ Show 20 Lines • Show All 746 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,592 Lines • ▼ Show 20 Lines	SDValue TargetLowering::expandFP_TO_INT_SAT(SDNode *Node,
// is already zero.		// is already zero.
if (!IsSigned)		if (!IsSigned)
return Select;		return Select;

// Otherwise, select 0 if Src is NaN.		// Otherwise, select 0 if Src is NaN.
SDValue ZeroInt = DAG.getConstant(0, dl, DstVT);		SDValue ZeroInt = DAG.getConstant(0, dl, DstVT);
return DAG.getSelectCC(dl, Src, Src, ZeroInt, Select, ISD::CondCode::SETUO);		return DAG.getSelectCC(dl, Src, Src, ZeroInt, Select, ISD::CondCode::SETUO);
}		}

		SDValue TargetLowering::expandVectorSplice(SDNode *Node,
		SelectionDAG &DAG) const {
		assert(Node->getOpcode() == ISD::VECTOR_SPLICE && "Unexpected opcode!");
		assert(Node->getValueType(0).isScalableVector() &&
		"Fixed length vector types expected to use SHUFFLE_VECTOR!");

		EVT VT = Node->getValueType(0);
		SDValue V1 = Node->getOperand(0);
		SDValue V2 = Node->getOperand(1);
		int64_t Imm = cast<ConstantSDNode>(Node->getOperand(2))->getSExtValue();
		SDLoc DL(Node);

		// Expand through memory thusly:
		// Alloca CONCAT_VECTORS_TYPES(V1, V2) Ptr
		// Store V1, Ptr
		// Store V2, Ptr + sizeof(V1)
		// If (Imm < 0)
		// TrailingElts = -Imm
		// Ptr = Ptr + sizeof(V1) - (TrailingElts * sizeof(VT.Elt))
		// else
		// Ptr = Ptr + (Imm * sizeof(VT.Elt))
		// Res = Load Ptr
		bin.cheng-aliUnsubmitted Not Done Reply Inline Actions Sorry for may stupid question, this stores two vector register and load in the middle, does Endingness matters here? How does LLVM make sure correct sequence of elements is loaded? Thanks in advance. bin.cheng-ali: Sorry for may stupid question, this stores two vector register and load in the middle, does…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions There's no such thing as a stupid question :) These are load/store operations on vector types so endianness only plays a role in how each element is stored and not in the order of the elements. That's to say, element N will be at the same location in memory regardless of endianness, however, the bytes that make up element N will be laid out differently. The index used to splice the vectors is also element based, which means there are no partial element accesses and thus no endianness issues. paulwalker-arm: There's no such thing as a stupid question :) These are load/store operations on vector types…
		bin.cheng-aliUnsubmitted Not Done Reply Inline Actions Thanks very much for explanation. I assume this is defined in Arm architecture reference manual, however, I noticed there is supplement rule for SVE as: For unpredicated SVE vector register loads and stores, the vector is treated as containing byte elements that are transferred in increasing element number order without any endianness conversion. IIUC, this rule would apply here? Of course no endianness issue either way. bin.cheng-ali: Thanks very much for explanation. I assume this is defined in Arm architecture reference…

		Align Alignment = DAG.getReducedAlign(VT, /UseABI=/false);

		EVT MemVT = EVT::getVectorVT(*DAG.getContext(), VT.getVectorElementType(),
		VT.getVectorElementCount() * 2);
		SDValue StackPtr = DAG.CreateStackTemporary(MemVT.getStoreSize(), Alignment);
		EVT PtrVT = StackPtr.getValueType();
		auto &MF = DAG.getMachineFunction();
		auto FrameIndex = cast<FrameIndexSDNode>(StackPtr.getNode())->getIndex();
		auto PtrInfo = MachinePointerInfo::getFixedStack(MF, FrameIndex);

		// Store the lo part of CONCAT_VECTORS(V1, V2)
		SDValue StoreV1 = DAG.getStore(DAG.getEntryNode(), DL, V1, StackPtr, PtrInfo);
		// Store the hi part of CONCAT_VECTORS(V1, V2)
		SDValue OffsetToV2 = DAG.getVScale(
		DL, PtrVT,
		APInt(PtrVT.getFixedSizeInBits(), VT.getStoreSize().getKnownMinSize()));
		SDValue StackPtr2 = DAG.getNode(ISD::ADD, DL, PtrVT, StackPtr, OffsetToV2);
		SDValue StoreV2 = DAG.getStore(StoreV1, DL, V2, StackPtr2, PtrInfo);

		if (Imm < 0) {
		sdesmalenUnsubmitted Done Reply Inline Actions nit: because the Imm < 0 block is a lot bigger, maybe you can rewrite it as: if (Imm >= 0) { // Load back the required element. StackPtr = getVectorElementPointer(DAG, StackPtr, VT, Node->getOperand(2)); // Load the spliced result return DAG.getLoad(VT, DL, StoreV2, StackPtr, MachinePointerInfo::getUnknownStack(MF)); } // Handle Imm < 0 case here. sdesmalen: nit: because the Imm < 0 block is a lot bigger, maybe you can rewrite it as: if (Imm >= 0) {…
		uint64_t TrailingElts = -Imm;
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: s/This/getVectorElementPointer/ sdesmalen: nit: s/This/getVectorElementPointer/

		// NOTE: TrailingElts must be clamped so as not to read outside of V1:V2.
		TypeSize EltByteSize = VT.getVectorElementType().getStoreSize();
		SDValue TrailingBytes =
		DAG.getConstant(TrailingElts * EltByteSize, DL, PtrVT);

		if (TrailingElts > VT.getVectorMinNumElements()) {
		SDValue VLBytes =
		DAG.getVScale(DL, PtrVT,
		APInt(PtrVT.getFixedSizeInBits(),
		VT.getStoreSize().getKnownMinSize()));
		TrailingBytes = DAG.getNode(ISD::UMIN, DL, PtrVT, TrailingBytes, VLBytes);
		sdesmalenUnsubmitted Not Done Reply Inline Actions You'll need to do a similar clamping for Imm > 0 (or actually, Imm > MinKnownVL) , to avoid loading beyond the allocated stack object. sdesmalen: You'll need to do a similar clamping for Imm > 0 (or actually, Imm > MinKnownVL) , to avoid…
		c-rhodesAuthorUnsubmitted Done Reply Inline Actions You'll need to do a similar clamping for Imm > 0 (or actually, Imm > MinKnownVL) , to avoid loading beyond the allocated stack object. For the index `getVectorElementPointer` does the clamping via `clampDynamicVectorIndex` c-rhodes: > You'll need to do a similar clamping for Imm > 0 (or actually, Imm > MinKnownVL) , to avoid…
		sdesmalenUnsubmitted Done Reply Inline Actions Thanks for confirming. Can you just add a comment that clarifies this? sdesmalen: Thanks for confirming. Can you just add a comment that clarifies this?
		}

		// Calculate the start address of the spliced result.
		StackPtr2 = DAG.getNode(ISD::SUB, DL, PtrVT, StackPtr2, TrailingBytes);

		// Load the spliced result
		return DAG.getLoad(VT, DL, StoreV2, StackPtr2,
		MachinePointerInfo::getUnknownStack(MF));
		}

		// Load back the required element.
		StackPtr = getVectorElementPointer(DAG, StackPtr, VT, Node->getOperand(2));
		// Load the spliced result
		return DAG.getLoad(VT, DL, StoreV2, StackPtr,
		MachinePointerInfo::getUnknownStack(MF));
		}

llvm/lib/CodeGen/TargetLoweringBase.cpp

Show First 20 Lines • Show All 843 Lines • ▼ Show 20 Lines	#include "llvm/IR/ConstrainedOps.def"
setOperationAction(ISD::VECREDUCE_SMAX, VT, Expand);		setOperationAction(ISD::VECREDUCE_SMAX, VT, Expand);
setOperationAction(ISD::VECREDUCE_SMIN, VT, Expand);		setOperationAction(ISD::VECREDUCE_SMIN, VT, Expand);
setOperationAction(ISD::VECREDUCE_UMAX, VT, Expand);		setOperationAction(ISD::VECREDUCE_UMAX, VT, Expand);
setOperationAction(ISD::VECREDUCE_UMIN, VT, Expand);		setOperationAction(ISD::VECREDUCE_UMIN, VT, Expand);
setOperationAction(ISD::VECREDUCE_FMAX, VT, Expand);		setOperationAction(ISD::VECREDUCE_FMAX, VT, Expand);
setOperationAction(ISD::VECREDUCE_FMIN, VT, Expand);		setOperationAction(ISD::VECREDUCE_FMIN, VT, Expand);
setOperationAction(ISD::VECREDUCE_SEQ_FADD, VT, Expand);		setOperationAction(ISD::VECREDUCE_SEQ_FADD, VT, Expand);
setOperationAction(ISD::VECREDUCE_SEQ_FMUL, VT, Expand);		setOperationAction(ISD::VECREDUCE_SEQ_FMUL, VT, Expand);

		// Named vector shuffles default to expand.
		setOperationAction(ISD::VECTOR_SPLICE, VT, Expand);
}		}

// Most targets ignore the @llvm.prefetch intrinsic.		// Most targets ignore the @llvm.prefetch intrinsic.
setOperationAction(ISD::PREFETCH, MVT::Other, Expand);		setOperationAction(ISD::PREFETCH, MVT::Other, Expand);

// Most targets also ignore the @llvm.readcyclecounter intrinsic.		// Most targets also ignore the @llvm.readcyclecounter intrinsic.
setOperationAction(ISD::READCYCLECOUNTER, MVT::i64, Expand);		setOperationAction(ISD::READCYCLECOUNTER, MVT::i64, Expand);

▲ Show 20 Lines • Show All 1,445 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,100 Lines • ▼ Show 20 Lines	for (auto VT : {MVT::nxv16i8, MVT::nxv8i16, MVT::nxv4i32, MVT::nxv2i64}) {
setOperationAction(ISD::SINT_TO_FP, VT, Custom);		setOperationAction(ISD::SINT_TO_FP, VT, Custom);
setOperationAction(ISD::FP_TO_UINT, VT, Custom);		setOperationAction(ISD::FP_TO_UINT, VT, Custom);
setOperationAction(ISD::FP_TO_SINT, VT, Custom);		setOperationAction(ISD::FP_TO_SINT, VT, Custom);
setOperationAction(ISD::MGATHER, VT, Custom);		setOperationAction(ISD::MGATHER, VT, Custom);
setOperationAction(ISD::MSCATTER, VT, Custom);		setOperationAction(ISD::MSCATTER, VT, Custom);
setOperationAction(ISD::MUL, VT, Custom);		setOperationAction(ISD::MUL, VT, Custom);
setOperationAction(ISD::SPLAT_VECTOR, VT, Custom);		setOperationAction(ISD::SPLAT_VECTOR, VT, Custom);
setOperationAction(ISD::SELECT, VT, Custom);		setOperationAction(ISD::SELECT, VT, Custom);
		setOperationAction(ISD::SETCC, VT, Custom);
		cameron.mcinallyUnsubmitted Not Done Reply Inline Actions Is this change required for the splice patch? I wonder if it should be broken out to its own commit. cameron.mcinally: Is this change required for the splice patch? I wonder if it should be broken out to its own…
		c-rhodesAuthorUnsubmitted Not Done Reply Inline Actions Is this change required for the splice patch? I wonder if it should be broken out to its own commit. This fixes a selection error I hit when splicing predicates. The promotion of predicates uses a truncate which gets lowered to a setcc that crashes on operands of type `VT`, e.g.: t17: nxv2i64 = and t29, t30 t31: nxv2i64 = AArch64ISD::DUP Constant:i64<0> t19: nxv2i1 = setcc t17, t31, setne:ch I can split this into a separate patch. c-rhodes: > Is this change required for the splice patch? I wonder if it should be broken out to its own…
		c-rhodesAuthorUnsubmitted Done Reply Inline Actions Is this change required for the splice patch? I wonder if it should be broken out to its own commit. This fixes a selection error I hit when splicing predicates. The promotion of predicates uses a truncate which gets lowered to a setcc that crashes on operands of type `VT`, e.g.: t17: nxv2i64 = and t29, t30 t31: nxv2i64 = AArch64ISD::DUP Constant:i64<0> t19: nxv2i1 = setcc t17, t31, setne:ch I can split this into a separate patch. I tried splitting this out but I couldn't figure out how to test it so I've left it in this patch. What I found was the setcc being generated by the splice occurs after result type legalization, so without the above a setcc returning an `nxv2i1` and taking two `nxv2i64` vectors is considered legal at this point in selection. With the above it falls into `SelectionDAGLegalize::LegalizeOp` which considers the legality based on the operand VTs, so setcc then gets custom lowered to the target-specific predicated variant `AArch64ISD::SETCC_MERGE_ZERO` that can be selected. c-rhodes: > > Is this change required for the splice patch? I wonder if it should be broken out to its…
setOperationAction(ISD::SDIV, VT, Custom);		setOperationAction(ISD::SDIV, VT, Custom);
setOperationAction(ISD::UDIV, VT, Custom);		setOperationAction(ISD::UDIV, VT, Custom);
setOperationAction(ISD::SMIN, VT, Custom);		setOperationAction(ISD::SMIN, VT, Custom);
setOperationAction(ISD::UMIN, VT, Custom);		setOperationAction(ISD::UMIN, VT, Custom);
setOperationAction(ISD::SMAX, VT, Custom);		setOperationAction(ISD::SMAX, VT, Custom);
setOperationAction(ISD::UMAX, VT, Custom);		setOperationAction(ISD::UMAX, VT, Custom);
setOperationAction(ISD::SHL, VT, Custom);		setOperationAction(ISD::SHL, VT, Custom);
setOperationAction(ISD::SRL, VT, Custom);		setOperationAction(ISD::SRL, VT, Custom);
▲ Show 20 Lines • Show All 147 Lines • ▼ Show 20 Lines	if (Subtarget->useSVEForFixedLengthVectors()) {
for (auto VT : {MVT::v4f16, MVT::v8f16, MVT::v2f32, MVT::v4f32,		for (auto VT : {MVT::v4f16, MVT::v8f16, MVT::v2f32, MVT::v4f32,
MVT::v1f64, MVT::v2f64})		MVT::v1f64, MVT::v2f64})
setOperationAction(ISD::VECREDUCE_SEQ_FADD, VT, Custom);		setOperationAction(ISD::VECREDUCE_SEQ_FADD, VT, Custom);

// Use SVE for vectors with more than 2 elements.		// Use SVE for vectors with more than 2 elements.
for (auto VT : {MVT::v4f16, MVT::v8f16, MVT::v4f32})		for (auto VT : {MVT::v4f16, MVT::v8f16, MVT::v4f32})
setOperationAction(ISD::VECREDUCE_FADD, VT, Custom);		setOperationAction(ISD::VECREDUCE_FADD, VT, Custom);
}		}

		setOperationPromotedToType(ISD::VECTOR_SPLICE, MVT::nxv2i1, MVT::nxv2i64);
		craig.topperUnsubmitted Done Reply Inline Actions Can you use setOperationPromotedToType here? craig.topper: Can you use setOperationPromotedToType here?
		c-rhodesAuthorUnsubmitted Not Done Reply Inline Actions Can you use setOperationPromotedToType here? That's nice, wasn't aware of that thanks. c-rhodes: > Can you use setOperationPromotedToType here? That's nice, wasn't aware of that thanks.
		setOperationPromotedToType(ISD::VECTOR_SPLICE, MVT::nxv4i1, MVT::nxv4i32);
		setOperationPromotedToType(ISD::VECTOR_SPLICE, MVT::nxv8i1, MVT::nxv8i16);
		setOperationPromotedToType(ISD::VECTOR_SPLICE, MVT::nxv16i1, MVT::nxv16i8);
}		}

PredictableSelectIsExpensive = Subtarget->predictableSelectIsExpensive();		PredictableSelectIsExpensive = Subtarget->predictableSelectIsExpensive();
}		}

void AArch64TargetLowering::addTypeForNEON(MVT VT, MVT PromotedBitwiseVT) {		void AArch64TargetLowering::addTypeForNEON(MVT VT, MVT PromotedBitwiseVT) {
assert(VT.isVector() && "VT should be a vector type");		assert(VT.isVector() && "VT should be a vector type");

▲ Show 20 Lines • Show All 15,984 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/named-vector-shuffles-neon.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -verify-machineinstrs < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				;
				; VECTOR_SPLICE (index)
				;

				define <16 x i8> @splice_v16i8_idx(<16 x i8> %a, <16 x i8> %b) #0 {
				; CHECK-LABEL: splice_v16i8_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ext v0.16b, v0.16b, v1.16b, #1
				; CHECK-NEXT: ret
				%res = call <16 x i8> @llvm.experimental.vector.splice.v16i8(<16 x i8> %a, <16 x i8> %b, i32 1)
				ret <16 x i8> %res
				}

				define <2 x double> @splice_v2f64_idx(<2 x double> %a, <2 x double> %b) #0 {
				; CHECK-LABEL: splice_v2f64_idx:
				sdesmalenUnsubmitted Done Reply Inline Actions nit: not sure if there is a lot of value to do this for each possible element-type, if you do two (i8 and some other type) that may be sufficient. sdesmalen: nit: not sure if there is a lot of value to do this for each possible element-type, if you do…
				; CHECK: // %bb.0:
				; CHECK-NEXT: ext v0.16b, v0.16b, v1.16b, #8
				; CHECK-NEXT: ret
				%res = call <2 x double> @llvm.experimental.vector.splice.v2f64(<2 x double> %a, <2 x double> %b, i32 1)
				ret <2 x double> %res
				}

				; Verify promote type legalisation works as expected.
				define <2 x i8> @splice_v2i8_idx(<2 x i8> %a, <2 x i8> %b) #0 {
				; CHECK-LABEL: splice_v2i8_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ext v0.8b, v0.8b, v1.8b, #4
				; CHECK-NEXT: ret
				%res = call <2 x i8> @llvm.experimental.vector.splice.v2i8(<2 x i8> %a, <2 x i8> %b, i32 1)
				ret <2 x i8> %res
				}

				; Verify splitvec type legalisation works as expected.
				define <8 x i32> @splice_v8i32_idx(<8 x i32> %a, <8 x i32> %b) #0 {
				; CHECK-LABEL: splice_v8i32_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ext v0.16b, v1.16b, v2.16b, #4
				; CHECK-NEXT: ext v1.16b, v2.16b, v3.16b, #4
				; CHECK-NEXT: ret
				%res = call <8 x i32> @llvm.experimental.vector.splice.v8i32(<8 x i32> %a, <8 x i32> %b, i32 5)
				ret <8 x i32> %res
				}

				; Verify splitvec type legalisation works as expected.
				define <16 x float> @splice_v16f32_idx(<16 x float> %a, <16 x float> %b) #0 {
				; CHECK-LABEL: splice_v16f32_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ext v0.16b, v1.16b, v2.16b, #12
				; CHECK-NEXT: ext v1.16b, v2.16b, v3.16b, #12
				; CHECK-NEXT: ext v2.16b, v3.16b, v4.16b, #12
				; CHECK-NEXT: ext v3.16b, v4.16b, v5.16b, #12
				; CHECK-NEXT: ret
				%res = call <16 x float> @llvm.experimental.vector.splice.v16f32(<16 x float> %a, <16 x float> %b, i32 7)
				ret <16 x float> %res
				}

				; Verify out-of-bounds index results in undef vector.
				define <2 x double> @splice_v2f64_idx_out_of_bounds(<2 x double> %a, <2 x double> %b) #0 {
				; CHECK-LABEL: splice_v2f64_idx_out_of_bounds:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ret
				%res = call <2 x double> @llvm.experimental.vector.splice.v2f64(<2 x double> %a, <2 x double> %b, i32 2)
				ret <2 x double> %res
				}

				;
				; VECTOR_SPLICE (trailing elements)
				;

				define <16 x i8> @splice_v16i8(<16 x i8> %a, <16 x i8> %b) #0 {
				; CHECK-LABEL: splice_v16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ext v0.16b, v0.16b, v1.16b, #1
				; CHECK-NEXT: ret
				%res = call <16 x i8> @llvm.experimental.vector.splice.v16i8(<16 x i8> %a, <16 x i8> %b, i32 -15)
				ret <16 x i8> %res
				}

				define <2 x double> @splice_v2f64(<2 x double> %a, <2 x double> %b) #0 {
				; CHECK-LABEL: splice_v2f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ext v0.16b, v0.16b, v1.16b, #8
				; CHECK-NEXT: ret
				%res = call <2 x double> @llvm.experimental.vector.splice.v2f64(<2 x double> %a, <2 x double> %b, i32 -1)
				ret <2 x double> %res
				}

				; Verify promote type legalisation works as expected.
				define <2 x i8> @splice_v2i8(<2 x i8> %a, <2 x i8> %b) #0 {
				; CHECK-LABEL: splice_v2i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ext v0.8b, v0.8b, v1.8b, #4
				; CHECK-NEXT: ret
				%res = call <2 x i8> @llvm.experimental.vector.splice.v2i8(<2 x i8> %a, <2 x i8> %b, i32 -1)
				ret <2 x i8> %res
				}

				; Verify splitvec type legalisation works as expected.
				define <8 x i32> @splice_v8i32(<8 x i32> %a, <8 x i32> %b) #0 {
				; CHECK-LABEL: splice_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ext v0.16b, v1.16b, v2.16b, #4
				; CHECK-NEXT: ext v1.16b, v2.16b, v3.16b, #4
				; CHECK-NEXT: ret
				%res = call <8 x i32> @llvm.experimental.vector.splice.v8i32(<8 x i32> %a, <8 x i32> %b, i32 -3)
				ret <8 x i32> %res
				}

				; Verify splitvec type legalisation works as expected.
				define <16 x float> @splice_v16f32(<16 x float> %a, <16 x float> %b) #0 {
				; CHECK-LABEL: splice_v16f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ext v0.16b, v1.16b, v2.16b, #12
				; CHECK-NEXT: ext v1.16b, v2.16b, v3.16b, #12
				; CHECK-NEXT: ext v2.16b, v3.16b, v4.16b, #12
				; CHECK-NEXT: ext v3.16b, v4.16b, v5.16b, #12
				; CHECK-NEXT: ret
				%res = call <16 x float> @llvm.experimental.vector.splice.v16f32(<16 x float> %a, <16 x float> %b, i32 -9)
				ret <16 x float> %res
				}

				; Verify out-of-bounds trailing element count results in undef vector.
				define <2 x double> @splice_v2f64_out_of_bounds(<2 x double> %a, <2 x double> %b) #0 {
				; CHECK-LABEL: splice_v2f64_out_of_bounds:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ret
				%res = call <2 x double> @llvm.experimental.vector.splice.v2f64(<2 x double> %a, <2 x double> %b, i32 -3)
				ret <2 x double> %res
				}

				declare <2 x i8> @llvm.experimental.vector.splice.v2i8(<2 x i8>, <2 x i8>, i32)
				declare <16 x i8> @llvm.experimental.vector.splice.v16i8(<16 x i8>, <16 x i8>, i32)
				declare <8 x i32> @llvm.experimental.vector.splice.v8i32(<8 x i32>, <8 x i32>, i32)
				declare <16 x float> @llvm.experimental.vector.splice.v16f32(<16 x float>, <16 x float>, i32)
				declare <2 x double> @llvm.experimental.vector.splice.v2f64(<2 x double>, <2 x double>, i32)

				attributes #0 = { nounwind "target-features"="+neon" }

llvm/test/CodeGen/AArch64/named-vector-shuffles-sve.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -verify-machineinstrs < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				;
				; VECTOR_SPLICE (index)
				;

				define <vscale x 16 x i8> @splice_nxv16i8_first_idx(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
				; CHECK-LABEL: splice_nxv16i8_first_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: rdvl x9, #1
				; CHECK-NEXT: sub x9, x9, #1 // =1
				; CHECK-NEXT: cmp x9, #0 // =0
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x9, xzr, lo
				; CHECK-NEXT: st1b { z0.b }, p0, [sp]
				; CHECK-NEXT: st1b { z1.b }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 16 x i8> @llvm.experimental.vector.splice.nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b, i32 0)
				ret <vscale x 16 x i8> %res
				}

				define <vscale x 16 x i8> @splice_nxv16i8_last_idx(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
				; CHECK-LABEL: splice_nxv16i8_last_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: rdvl x9, #1
				; CHECK-NEXT: sub x9, x9, #1 // =1
				; CHECK-NEXT: mov w10, #15
				; CHECK-NEXT: cmp x9, #15 // =15
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x9, x10, lo
				; CHECK-NEXT: st1b { z0.b }, p0, [sp]
				; CHECK-NEXT: st1b { z1.b }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 16 x i8> @llvm.experimental.vector.splice.nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b, i32 15)
				ret <vscale x 16 x i8> %res
				}

				; Ensure index is clamped when we cannot prove it's less than VL-1.
				define <vscale x 16 x i8> @splice_nxv16i8_clamped_idx(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
				; CHECK-LABEL: splice_nxv16i8_clamped_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: rdvl x9, #1
				; CHECK-NEXT: sub x9, x9, #1 // =1
				; CHECK-NEXT: mov w10, #16
				; CHECK-NEXT: cmp x9, #16 // =16
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x9, x10, lo
				; CHECK-NEXT: st1b { z0.b }, p0, [sp]
				; CHECK-NEXT: st1b { z1.b }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 16 x i8> @llvm.experimental.vector.splice.nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b, i32 16)
				ret <vscale x 16 x i8> %res
				}

				define <vscale x 8 x i16> @splice_nxv8i16_first_idx(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
				; CHECK-LABEL: splice_nxv8i16_first_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cnth x9
				; CHECK-NEXT: sub x9, x9, #1 // =1
				; CHECK-NEXT: cmp x9, #0 // =0
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x9, xzr, lo
				; CHECK-NEXT: st1h { z0.h }, p0, [sp]
				; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #1
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x i16> @llvm.experimental.vector.splice.nxv8i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b, i32 0)
				ret <vscale x 8 x i16> %res
				}

				define <vscale x 8 x i16> @splice_nxv8i16_last_idx(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
				; CHECK-LABEL: splice_nxv8i16_last_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cnth x10
				; CHECK-NEXT: sub x10, x10, #1 // =1
				; CHECK-NEXT: mov w9, #7
				; CHECK-NEXT: cmp x10, #7 // =7
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x10, x9, lo
				; CHECK-NEXT: st1h { z0.h }, p0, [sp]
				; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #1
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x i16> @llvm.experimental.vector.splice.nxv8i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b, i32 7)
				ret <vscale x 8 x i16> %res
				}

				; Ensure index is clamped when we cannot prove it's less than VL-1.
				define <vscale x 8 x i16> @splice_nxv8i16_clamped_idx(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
				; CHECK-LABEL: splice_nxv8i16_clamped_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cnth x10
				; CHECK-NEXT: sub x10, x10, #1 // =1
				; CHECK-NEXT: mov w9, #8
				; CHECK-NEXT: cmp x10, #8 // =8
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x10, x9, lo
				; CHECK-NEXT: st1h { z0.h }, p0, [sp]
				; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #1
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x i16> @llvm.experimental.vector.splice.nxv8i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b, i32 8)
				ret <vscale x 8 x i16> %res
				}

				define <vscale x 4 x i32> @splice_nxv4i32_first_idx(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
				; CHECK-LABEL: splice_nxv4i32_first_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cntw x9
				; CHECK-NEXT: sub x9, x9, #1 // =1
				; CHECK-NEXT: cmp x9, #0 // =0
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x9, xzr, lo
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #2
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b, i32 0)
				ret <vscale x 4 x i32> %res
				}

				define <vscale x 4 x i32> @splice_nxv4i32_last_idx(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
				; CHECK-LABEL: splice_nxv4i32_last_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cntw x10
				; CHECK-NEXT: sub x10, x10, #1 // =1
				; CHECK-NEXT: mov w9, #3
				; CHECK-NEXT: cmp x10, #3 // =3
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x10, x9, lo
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #2
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b, i32 3)
				ret <vscale x 4 x i32> %res
				}

				; Ensure index is clamped when we cannot prove it's less than VL-1.
				define <vscale x 4 x i32> @splice_nxv4i32_clamped_idx(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
				; CHECK-LABEL: splice_nxv4i32_clamped_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cntw x10
				; CHECK-NEXT: sub x10, x10, #1 // =1
				; CHECK-NEXT: mov w9, #4
				; CHECK-NEXT: cmp x10, #4 // =4
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x10, x9, lo
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #2
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b, i32 4)
				ret <vscale x 4 x i32> %res
				}

				define <vscale x 2 x i64> @splice_nxv2i64_first_idx(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
				; CHECK-LABEL: splice_nxv2i64_first_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cntd x9
				; CHECK-NEXT: sub x9, x9, #1 // =1
				; CHECK-NEXT: cmp x9, #0 // =0
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x9, xzr, lo
				; CHECK-NEXT: st1d { z0.d }, p0, [sp]
				; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #3
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 2 x i64> @llvm.experimental.vector.splice.nxv2i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b, i32 0)
				ret <vscale x 2 x i64> %res
				}

				define <vscale x 2 x i64> @splice_nxv2i64_last_idx(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
				; CHECK-LABEL: splice_nxv2i64_last_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cntd x9
				; CHECK-NEXT: sub x9, x9, #1 // =1
				; CHECK-NEXT: cmp x9, #1 // =1
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csinc x9, x9, xzr, lo
				; CHECK-NEXT: st1d { z0.d }, p0, [sp]
				; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #3
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 2 x i64> @llvm.experimental.vector.splice.nxv2i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b, i32 1)
				ret <vscale x 2 x i64> %res
				}

				; Ensure index is clamped when we cannot prove it's less than VL-1.
				define <vscale x 2 x i64> @splice_nxv2i64_clamped_idx(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
				; CHECK-LABEL: splice_nxv2i64_clamped_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cntd x10
				; CHECK-NEXT: sub x10, x10, #1 // =1
				; CHECK-NEXT: mov w9, #2
				; CHECK-NEXT: cmp x10, #2 // =2
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x10, x9, lo
				; CHECK-NEXT: st1d { z0.d }, p0, [sp]
				; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #3
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 2 x i64> @llvm.experimental.vector.splice.nxv2i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b, i32 2)
				ret <vscale x 2 x i64> %res
				}

				define <vscale x 8 x half> @splice_nxv8f16_first_idx(<vscale x 8 x half> %a, <vscale x 8 x half> %b) #0 {
				; CHECK-LABEL: splice_nxv8f16_first_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cnth x9
				; CHECK-NEXT: sub x9, x9, #1 // =1
				; CHECK-NEXT: cmp x9, #0 // =0
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x9, xzr, lo
				; CHECK-NEXT: st1h { z0.h }, p0, [sp]
				; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #1
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x half> @llvm.experimental.vector.splice.nxv8f16(<vscale x 8 x half> %a, <vscale x 8 x half> %b, i32 0)
				ret <vscale x 8 x half> %res
				}

				define <vscale x 8 x half> @splice_nxv8f16_last_idx(<vscale x 8 x half> %a, <vscale x 8 x half> %b) #0 {
				; CHECK-LABEL: splice_nxv8f16_last_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cnth x10
				; CHECK-NEXT: sub x10, x10, #1 // =1
				; CHECK-NEXT: mov w9, #7
				; CHECK-NEXT: cmp x10, #7 // =7
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x10, x9, lo
				; CHECK-NEXT: st1h { z0.h }, p0, [sp]
				; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #1
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x half> @llvm.experimental.vector.splice.nxv8f16(<vscale x 8 x half> %a, <vscale x 8 x half> %b, i32 7)
				ret <vscale x 8 x half> %res
				}

				; Ensure index is clamped when we cannot prove it's less than VL-1.
				define <vscale x 8 x half> @splice_nxv8f16_clamped_idx(<vscale x 8 x half> %a, <vscale x 8 x half> %b) #0 {
				; CHECK-LABEL: splice_nxv8f16_clamped_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cnth x10
				; CHECK-NEXT: sub x10, x10, #1 // =1
				; CHECK-NEXT: mov w9, #8
				; CHECK-NEXT: cmp x10, #8 // =8
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x10, x9, lo
				; CHECK-NEXT: st1h { z0.h }, p0, [sp]
				; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #1
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x half> @llvm.experimental.vector.splice.nxv8f16(<vscale x 8 x half> %a, <vscale x 8 x half> %b, i32 8)
				ret <vscale x 8 x half> %res
				}

				define <vscale x 4 x float> @splice_nxv4f32_first_idx(<vscale x 4 x float> %a, <vscale x 4 x float> %b) #0 {
				; CHECK-LABEL: splice_nxv4f32_first_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cntw x9
				; CHECK-NEXT: sub x9, x9, #1 // =1
				; CHECK-NEXT: cmp x9, #0 // =0
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x9, xzr, lo
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #2
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 4 x float> @llvm.experimental.vector.splice.nxv4f32(<vscale x 4 x float> %a, <vscale x 4 x float> %b, i32 0)
				ret <vscale x 4 x float> %res
				}

				define <vscale x 4 x float> @splice_nxv4f32_last_idx(<vscale x 4 x float> %a, <vscale x 4 x float> %b) #0 {
				; CHECK-LABEL: splice_nxv4f32_last_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cntw x10
				; CHECK-NEXT: sub x10, x10, #1 // =1
				; CHECK-NEXT: mov w9, #3
				; CHECK-NEXT: cmp x10, #3 // =3
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x10, x9, lo
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #2
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 4 x float> @llvm.experimental.vector.splice.nxv4f32(<vscale x 4 x float> %a, <vscale x 4 x float> %b, i32 3)
				ret <vscale x 4 x float> %res
				}

				; Ensure index is clamped when we cannot prove it's less than VL-1.
				define <vscale x 4 x float> @splice_nxv4f32_clamped_idx(<vscale x 4 x float> %a, <vscale x 4 x float> %b) #0 {
				; CHECK-LABEL: splice_nxv4f32_clamped_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cntw x10
				; CHECK-NEXT: sub x10, x10, #1 // =1
				; CHECK-NEXT: mov w9, #4
				; CHECK-NEXT: cmp x10, #4 // =4
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x10, x9, lo
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #2
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 4 x float> @llvm.experimental.vector.splice.nxv4f32(<vscale x 4 x float> %a, <vscale x 4 x float> %b, i32 4)
				ret <vscale x 4 x float> %res
				}

				define <vscale x 2 x double> @splice_nxv2f64_first_idx(<vscale x 2 x double> %a, <vscale x 2 x double> %b) #0 {
				; CHECK-LABEL: splice_nxv2f64_first_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cntd x9
				; CHECK-NEXT: sub x9, x9, #1 // =1
				; CHECK-NEXT: cmp x9, #0 // =0
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x9, xzr, lo
				; CHECK-NEXT: st1d { z0.d }, p0, [sp]
				; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #3
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 2 x double> @llvm.experimental.vector.splice.nxv2f64(<vscale x 2 x double> %a, <vscale x 2 x double> %b, i32 0)
				ret <vscale x 2 x double> %res
				}

				define <vscale x 2 x double> @splice_nxv2f64_last_idx(<vscale x 2 x double> %a, <vscale x 2 x double> %b) #0 {
				; CHECK-LABEL: splice_nxv2f64_last_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cntd x9
				; CHECK-NEXT: sub x9, x9, #1 // =1
				; CHECK-NEXT: cmp x9, #1 // =1
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csinc x9, x9, xzr, lo
				; CHECK-NEXT: st1d { z0.d }, p0, [sp]
				; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #3
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 2 x double> @llvm.experimental.vector.splice.nxv2f64(<vscale x 2 x double> %a, <vscale x 2 x double> %b, i32 1)
				ret <vscale x 2 x double> %res
				}

				; Ensure index is clamped when we cannot prove it's less than VL-1.
				define <vscale x 2 x double> @splice_nxv2f64_clamped_idx(<vscale x 2 x double> %a, <vscale x 2 x double> %b) #0 {
				; CHECK-LABEL: splice_nxv2f64_clamped_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cntd x10
				; CHECK-NEXT: sub x10, x10, #1 // =1
				; CHECK-NEXT: mov w9, #2
				; CHECK-NEXT: cmp x10, #2 // =2
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x10, x9, lo
				; CHECK-NEXT: st1d { z0.d }, p0, [sp]
				; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #3
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 2 x double> @llvm.experimental.vector.splice.nxv2f64(<vscale x 2 x double> %a, <vscale x 2 x double> %b, i32 2)
				ret <vscale x 2 x double> %res
				}

				; Ensure predicate based splice is promoted to use ZPRs.
				define <vscale x 2 x i1> @splice_nxv2i1_idx(<vscale x 2 x i1> %a, <vscale x 2 x i1> %b) #0 {
				; CHECK-LABEL: splice_nxv2i1_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cntd x9
				; CHECK-NEXT: sub x9, x9, #1 // =1
				; CHECK-NEXT: mov z0.d, p0/z, #1 // =0x1
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: cmp x9, #1 // =1
				; CHECK-NEXT: st1d { z0.d }, p0, [sp]
				; CHECK-NEXT: mov z0.d, p1/z, #1 // =0x1
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csinc x9, x9, xzr, lo
				; CHECK-NEXT: st1d { z0.d }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #3
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
				; CHECK-NEXT: and z0.d, z0.d, #0x1
				; CHECK-NEXT: cmpne p0.d, p0/z, z0.d, #0
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 2 x i1> @llvm.experimental.vector.splice.nxv2i1(<vscale x 2 x i1> %a, <vscale x 2 x i1> %b, i32 1)
				ret <vscale x 2 x i1> %res
				}

				; Ensure predicate based splice is promoted to use ZPRs.
				define <vscale x 4 x i1> @splice_nxv4i1_idx(<vscale x 4 x i1> %a, <vscale x 4 x i1> %b) #0 {
				; CHECK-LABEL: splice_nxv4i1_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cntw x10
				; CHECK-NEXT: sub x10, x10, #1 // =1
				; CHECK-NEXT: mov z0.s, p0/z, #1 // =0x1
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov w9, #2
				; CHECK-NEXT: cmp x10, #2 // =2
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: mov z0.s, p1/z, #1 // =0x1
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x10, x9, lo
				; CHECK-NEXT: st1w { z0.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #2
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: and z0.s, z0.s, #0x1
				; CHECK-NEXT: cmpne p0.s, p0/z, z0.s, #0
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 4 x i1> @llvm.experimental.vector.splice.nxv4i1(<vscale x 4 x i1> %a, <vscale x 4 x i1> %b, i32 2)
				ret <vscale x 4 x i1> %res
				}

				; Ensure predicate based splice is promoted to use ZPRs.
				define <vscale x 8 x i1> @splice_nxv8i1_idx(<vscale x 8 x i1> %a, <vscale x 8 x i1> %b) #0 {
				; CHECK-LABEL: splice_nxv8i1_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cnth x10
				; CHECK-NEXT: sub x10, x10, #1 // =1
				; CHECK-NEXT: mov z0.h, p0/z, #1 // =0x1
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov w9, #4
				; CHECK-NEXT: cmp x10, #4 // =4
				; CHECK-NEXT: st1h { z0.h }, p0, [sp]
				; CHECK-NEXT: mov z0.h, p1/z, #1 // =0x1
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x10, x9, lo
				; CHECK-NEXT: st1h { z0.h }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #1
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: and z0.h, z0.h, #0x1
				; CHECK-NEXT: cmpne p0.h, p0/z, z0.h, #0
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x i1> @llvm.experimental.vector.splice.nxv8i1(<vscale x 8 x i1> %a, <vscale x 8 x i1> %b, i32 4)
				ret <vscale x 8 x i1> %res
				}

				; Ensure predicate based splice is promoted to use ZPRs.
				define <vscale x 16 x i1> @splice_nxv16i1_idx(<vscale x 16 x i1> %a, <vscale x 16 x i1> %b) #0 {
				; CHECK-LABEL: splice_nxv16i1_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: rdvl x9, #1
				; CHECK-NEXT: sub x9, x9, #1 // =1
				; CHECK-NEXT: mov z0.b, p0/z, #1 // =0x1
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: mov w10, #8
				; CHECK-NEXT: cmp x9, #8 // =8
				; CHECK-NEXT: st1b { z0.b }, p0, [sp]
				; CHECK-NEXT: mov z0.b, p1/z, #1 // =0x1
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x9, x10, lo
				; CHECK-NEXT: st1b { z0.b }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
				; CHECK-NEXT: and z0.b, z0.b, #0x1
				; CHECK-NEXT: cmpne p0.b, p0/z, z0.b, #0
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 16 x i1> @llvm.experimental.vector.splice.nxv16i1(<vscale x 16 x i1> %a, <vscale x 16 x i1> %b, i32 8)
				ret <vscale x 16 x i1> %res
				}

				; Verify promote type legalisation works as expected.
				define <vscale x 2 x i8> @splice_nxv2i8_idx(<vscale x 2 x i8> %a, <vscale x 2 x i8> %b) #0 {
				; CHECK-LABEL: splice_nxv2i8_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: cntd x9
				; CHECK-NEXT: sub x9, x9, #1 // =1
				; CHECK-NEXT: cmp x9, #1 // =1
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csinc x9, x9, xzr, lo
				; CHECK-NEXT: st1d { z0.d }, p0, [sp]
				; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #3
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 2 x i8> @llvm.experimental.vector.splice.nxv2i8(<vscale x 2 x i8> %a, <vscale x 2 x i8> %b, i32 1)
				ret <vscale x 2 x i8> %res
				}

				; Verify splitvec type legalisation works as expected.
				define <vscale x 8 x i32> @splice_nxv8i32_idx(<vscale x 8 x i32> %a, <vscale x 8 x i32> %b) #0 {
				; CHECK-LABEL: splice_nxv8i32_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-4
				; CHECK-NEXT: cnth x10
				; CHECK-NEXT: sub x10, x10, #1 // =1
				; CHECK-NEXT: mov w9, #2
				; CHECK-NEXT: cmp x10, #2 // =2
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x10, x9, lo
				; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: st1w { z3.s }, p0, [x8, #3, mul vl]
				; CHECK-NEXT: st1w { z2.s }, p0, [x8, #2, mul vl]
				; CHECK-NEXT: orr x8, x8, x9, lsl #2
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x8, #1, mul vl]
				; CHECK-NEXT: addvl sp, sp, #4
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x i32> @llvm.experimental.vector.splice.nxv8i32(<vscale x 8 x i32> %a, <vscale x 8 x i32> %b, i32 2)
				ret <vscale x 8 x i32> %res
				}

				; Verify splitvec type legalisation works as expected.
				define <vscale x 16 x float> @splice_nxv16f32_clamped_idx(<vscale x 16 x float> %a, <vscale x 16 x float> %b) #0 {
				; CHECK-LABEL: splice_nxv16f32_clamped_idx:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-8
				; CHECK-NEXT: rdvl x10, #1
				; CHECK-NEXT: sub x10, x10, #1 // =1
				; CHECK-NEXT: mov w9, #16
				; CHECK-NEXT: cmp x10, #16 // =16
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: csel x9, x10, x9, lo
				; CHECK-NEXT: st1w { z3.s }, p0, [x8, #3, mul vl]
				; CHECK-NEXT: st1w { z2.s }, p0, [x8, #2, mul vl]
				; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: st1w { z7.s }, p0, [x8, #7, mul vl]
				; CHECK-NEXT: st1w { z4.s }, p0, [x8, #4, mul vl]
				; CHECK-NEXT: st1w { z5.s }, p0, [x8, #5, mul vl]
				; CHECK-NEXT: st1w { z6.s }, p0, [x8, #6, mul vl]
				; CHECK-NEXT: add x8, x8, x9, lsl #2
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x8, #1, mul vl]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x8, #2, mul vl]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x8, #3, mul vl]
				; CHECK-NEXT: addvl sp, sp, #8
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 16 x float> @llvm.experimental.vector.splice.nxv16f32(<vscale x 16 x float> %a, <vscale x 16 x float> %b, i32 16)
				ret <vscale x 16 x float> %res
				}

				;
				; VECTOR_SPLICE (trailing elements)
				;

				define <vscale x 16 x i8> @splice_nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
				; CHECK-LABEL: splice_nxv16i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1b { z0.b }, p0, [sp]
				; CHECK-NEXT: st1b { z1.b }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #16 // =16
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 16 x i8> @llvm.experimental.vector.splice.nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b, i32 -16)
				ret <vscale x 16 x i8> %res
				}

				define <vscale x 16 x i8> @splice_nxv16i8_1(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
				; CHECK-LABEL: splice_nxv16i8_1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1b { z0.b }, p0, [sp]
				; CHECK-NEXT: st1b { z1.b }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #1 // =1
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 16 x i8> @llvm.experimental.vector.splice.nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b, i32 -1)
				ret <vscale x 16 x i8> %res
				}

				; Ensure number of trailing elements is clamped when we cannot prove it's less than VL.
				define <vscale x 16 x i8> @splice_nxv16i8_clamped(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b) #0 {
				; CHECK-LABEL: splice_nxv16i8_clamped:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: rdvl x9, #1
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: mov w10, #17
				; CHECK-NEXT: cmp x9, #17 // =17
				; CHECK-NEXT: st1b { z0.b }, p0, [sp]
				; CHECK-NEXT: st1b { z1.b }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: csel x9, x9, x10, lo
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, x9
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 16 x i8> @llvm.experimental.vector.splice.nxv16i8(<vscale x 16 x i8> %a, <vscale x 16 x i8> %b, i32 -17)
				ret <vscale x 16 x i8> %res
				}

				define <vscale x 8 x i16> @splice_nxv8i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
				; CHECK-LABEL: splice_nxv8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1h { z0.h }, p0, [sp]
				; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #16 // =16
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x i16> @llvm.experimental.vector.splice.nxv8i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b, i32 -8)
				ret <vscale x 8 x i16> %res
				}

				define <vscale x 8 x i16> @splice_nxv8i16_1(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
				; CHECK-LABEL: splice_nxv8i16_1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1h { z0.h }, p0, [sp]
				; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #2 // =2
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x i16> @llvm.experimental.vector.splice.nxv8i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b, i32 -1)
				ret <vscale x 8 x i16> %res
				}

				; Ensure number of trailing elements is clamped when we cannot prove it's less than VL.
				define <vscale x 8 x i16> @splice_nxv8i16_clamped(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b) #0 {
				; CHECK-LABEL: splice_nxv8i16_clamped:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: rdvl x9, #1
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: mov w10, #18
				; CHECK-NEXT: cmp x9, #18 // =18
				; CHECK-NEXT: st1h { z0.h }, p0, [sp]
				; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: csel x9, x9, x10, lo
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, x9
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x i16> @llvm.experimental.vector.splice.nxv8i16(<vscale x 8 x i16> %a, <vscale x 8 x i16> %b, i32 -9)
				ret <vscale x 8 x i16> %res
				}

				define <vscale x 4 x i32> @splice_nxv4i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
				; CHECK-LABEL: splice_nxv4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #16 // =16
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b, i32 -4)
				ret <vscale x 4 x i32> %res
				}

				define <vscale x 4 x i32> @splice_nxv4i32_1(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
				; CHECK-LABEL: splice_nxv4i32_1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #4 // =4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b, i32 -1)
				ret <vscale x 4 x i32> %res
				}

				; Ensure number of trailing elements is clamped when we cannot prove it's less than VL.
				define <vscale x 4 x i32> @splice_nxv4i32_clamped(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b) #0 {
				; CHECK-LABEL: splice_nxv4i32_clamped:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: rdvl x9, #1
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: mov w10, #20
				; CHECK-NEXT: cmp x9, #20 // =20
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: csel x9, x9, x10, lo
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, x9
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32> %a, <vscale x 4 x i32> %b, i32 -5)
				ret <vscale x 4 x i32> %res
				}

				define <vscale x 2 x i64> @splice_nxv2i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
				; CHECK-LABEL: splice_nxv2i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1d { z0.d }, p0, [sp]
				; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #16 // =16
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 2 x i64> @llvm.experimental.vector.splice.nxv2i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b, i32 -2)
				ret <vscale x 2 x i64> %res
				}

				define <vscale x 2 x i64> @splice_nxv2i64_1(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
				; CHECK-LABEL: splice_nxv2i64_1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1d { z0.d }, p0, [sp]
				; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #8 // =8
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 2 x i64> @llvm.experimental.vector.splice.nxv2i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b, i32 -1)
				ret <vscale x 2 x i64> %res
				}

				; Ensure number of trailing elements is clamped when we cannot prove it's less than VL.
				define <vscale x 2 x i64> @splice_nxv2i64_clamped(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b) #0 {
				; CHECK-LABEL: splice_nxv2i64_clamped:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: rdvl x9, #1
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: mov w10, #24
				; CHECK-NEXT: cmp x9, #24 // =24
				; CHECK-NEXT: st1d { z0.d }, p0, [sp]
				; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: csel x9, x9, x10, lo
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, x9
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 2 x i64> @llvm.experimental.vector.splice.nxv2i64(<vscale x 2 x i64> %a, <vscale x 2 x i64> %b, i32 -3)
				ret <vscale x 2 x i64> %res
				}

				define <vscale x 8 x half> @splice_nxv8f16(<vscale x 8 x half> %a, <vscale x 8 x half> %b) #0 {
				; CHECK-LABEL: splice_nxv8f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1h { z0.h }, p0, [sp]
				; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #16 // =16
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x half> @llvm.experimental.vector.splice.nxv8f16(<vscale x 8 x half> %a, <vscale x 8 x half> %b, i32 -8)
				ret <vscale x 8 x half> %res
				}

				define <vscale x 8 x half> @splice_nxv8f16_1(<vscale x 8 x half> %a, <vscale x 8 x half> %b) #0 {
				; CHECK-LABEL: splice_nxv8f16_1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1h { z0.h }, p0, [sp]
				; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #2 // =2
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x half> @llvm.experimental.vector.splice.nxv8f16(<vscale x 8 x half> %a, <vscale x 8 x half> %b, i32 -1)
				ret <vscale x 8 x half> %res
				}

				; Ensure number of trailing elements is clamped when we cannot prove it's less than VL.
				define <vscale x 8 x half> @splice_nxv8f16_clamped(<vscale x 8 x half> %a, <vscale x 8 x half> %b) #0 {
				; CHECK-LABEL: splice_nxv8f16_clamped:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: rdvl x9, #1
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: mov w10, #18
				; CHECK-NEXT: cmp x9, #18 // =18
				; CHECK-NEXT: st1h { z0.h }, p0, [sp]
				; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: csel x9, x9, x10, lo
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, x9
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x half> @llvm.experimental.vector.splice.nxv8f16(<vscale x 8 x half> %a, <vscale x 8 x half> %b, i32 -9)
				ret <vscale x 8 x half> %res
				}

				define <vscale x 4 x float> @splice_nxv4f32(<vscale x 4 x float> %a, <vscale x 4 x float> %b) #0 {
				; CHECK-LABEL: splice_nxv4f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #16 // =16
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 4 x float> @llvm.experimental.vector.splice.nxv4f32(<vscale x 4 x float> %a, <vscale x 4 x float> %b, i32 -4)
				ret <vscale x 4 x float> %res
				}

				define <vscale x 4 x float> @splice_nxv4f32_1(<vscale x 4 x float> %a, <vscale x 4 x float> %b) #0 {
				; CHECK-LABEL: splice_nxv4f32_1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #4 // =4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 4 x float> @llvm.experimental.vector.splice.nxv4f32(<vscale x 4 x float> %a, <vscale x 4 x float> %b, i32 -1)
				ret <vscale x 4 x float> %res
				}

				; Ensure number of trailing elements is clamped when we cannot prove it's less than VL.
				define <vscale x 4 x float> @splice_nxv4f32_clamped(<vscale x 4 x float> %a, <vscale x 4 x float> %b) #0 {
				; CHECK-LABEL: splice_nxv4f32_clamped:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: rdvl x9, #1
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: mov w10, #20
				; CHECK-NEXT: cmp x9, #20 // =20
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: csel x9, x9, x10, lo
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, x9
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 4 x float> @llvm.experimental.vector.splice.nxv4f32(<vscale x 4 x float> %a, <vscale x 4 x float> %b, i32 -5)
				ret <vscale x 4 x float> %res
				}

				define <vscale x 2 x double> @splice_nxv2f64(<vscale x 2 x double> %a, <vscale x 2 x double> %b) #0 {
				; CHECK-LABEL: splice_nxv2f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1d { z0.d }, p0, [sp]
				; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #16 // =16
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 2 x double> @llvm.experimental.vector.splice.nxv2f64(<vscale x 2 x double> %a, <vscale x 2 x double> %b, i32 -2)
				ret <vscale x 2 x double> %res
				}

				define <vscale x 2 x double> @splice_nxv2f64_1(<vscale x 2 x double> %a, <vscale x 2 x double> %b) #0 {
				; CHECK-LABEL: splice_nxv2f64_1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1d { z0.d }, p0, [sp]
				; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #8 // =8
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 2 x double> @llvm.experimental.vector.splice.nxv2f64(<vscale x 2 x double> %a, <vscale x 2 x double> %b, i32 -1)
				ret <vscale x 2 x double> %res
				}

				; Ensure number of trailing elements is clamped when we cannot prove it's less than VL.
				define <vscale x 2 x double> @splice_nxv2f64_clamped(<vscale x 2 x double> %a, <vscale x 2 x double> %b) #0 {
				; CHECK-LABEL: splice_nxv2f64_clamped:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: rdvl x9, #1
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: mov w10, #24
				; CHECK-NEXT: cmp x9, #24 // =24
				; CHECK-NEXT: st1d { z0.d }, p0, [sp]
				; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: csel x9, x9, x10, lo
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, x9
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 2 x double> @llvm.experimental.vector.splice.nxv2f64(<vscale x 2 x double> %a, <vscale x 2 x double> %b, i32 -3)
				ret <vscale x 2 x double> %res
				}

				; Ensure predicate based splice is promoted to use ZPRs.
				define <vscale x 2 x i1> @splice_nxv2i1(<vscale x 2 x i1> %a, <vscale x 2 x i1> %b) #0 {
				; CHECK-LABEL: splice_nxv2i1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: mov z0.d, p0/z, #1 // =0x1
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov z1.d, p1/z, #1 // =0x1
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1d { z0.d }, p0, [sp]
				; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #8 // =8
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
				; CHECK-NEXT: and z0.d, z0.d, #0x1
				; CHECK-NEXT: cmpne p0.d, p0/z, z0.d, #0
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 2 x i1> @llvm.experimental.vector.splice.nxv2i1(<vscale x 2 x i1> %a, <vscale x 2 x i1> %b, i32 -1)
				ret <vscale x 2 x i1> %res
				}

				; Ensure predicate based splice is promoted to use ZPRs.
				define <vscale x 4 x i1> @splice_nxv4i1(<vscale x 4 x i1> %a, <vscale x 4 x i1> %b) #0 {
				; CHECK-LABEL: splice_nxv4i1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: mov z0.s, p0/z, #1 // =0x1
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov z1.s, p1/z, #1 // =0x1
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #4 // =4
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: and z0.s, z0.s, #0x1
				; CHECK-NEXT: cmpne p0.s, p0/z, z0.s, #0
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 4 x i1> @llvm.experimental.vector.splice.nxv4i1(<vscale x 4 x i1> %a, <vscale x 4 x i1> %b, i32 -1)
				ret <vscale x 4 x i1> %res
				}

				; Ensure predicate based splice is promoted to use ZPRs.
				define <vscale x 8 x i1> @splice_nxv8i1(<vscale x 8 x i1> %a, <vscale x 8 x i1> %b) #0 {
				; CHECK-LABEL: splice_nxv8i1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: mov z0.h, p0/z, #1 // =0x1
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov z1.h, p1/z, #1 // =0x1
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1h { z0.h }, p0, [sp]
				; CHECK-NEXT: st1h { z1.h }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #2 // =2
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x8]
				; CHECK-NEXT: and z0.h, z0.h, #0x1
				; CHECK-NEXT: cmpne p0.h, p0/z, z0.h, #0
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x i1> @llvm.experimental.vector.splice.nxv8i1(<vscale x 8 x i1> %a, <vscale x 8 x i1> %b, i32 -1)
				ret <vscale x 8 x i1> %res
				}

				; Ensure predicate based splice is promoted to use ZPRs.
				define <vscale x 16 x i1> @splice_nxv16i1(<vscale x 16 x i1> %a, <vscale x 16 x i1> %b) #0 {
				; CHECK-LABEL: splice_nxv16i1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: mov z0.b, p0/z, #1 // =0x1
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: mov z1.b, p1/z, #1 // =0x1
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1b { z0.b }, p0, [sp]
				; CHECK-NEXT: st1b { z1.b }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #1 // =1
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x8]
				; CHECK-NEXT: and z0.b, z0.b, #0x1
				; CHECK-NEXT: cmpne p0.b, p0/z, z0.b, #0
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 16 x i1> @llvm.experimental.vector.splice.nxv16i1(<vscale x 16 x i1> %a, <vscale x 16 x i1> %b, i32 -1)
				ret <vscale x 16 x i1> %res
				}

				; Verify promote type legalisation works as expected.
				define <vscale x 2 x i8> @splice_nxv2i8(<vscale x 2 x i8> %a, <vscale x 2 x i8> %b) #0 {
				; CHECK-LABEL: splice_nxv2i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-2
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1d { z0.d }, p0, [sp]
				; CHECK-NEXT: st1d { z1.d }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: addvl x8, x8, #1
				; CHECK-NEXT: sub x8, x8, #16 // =16
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x8]
				; CHECK-NEXT: addvl sp, sp, #2
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 2 x i8> @llvm.experimental.vector.splice.nxv2i8(<vscale x 2 x i8> %a, <vscale x 2 x i8> %b, i32 -2)
				ret <vscale x 2 x i8> %res
				}

				; Verify splitvec type legalisation works as expected.
				define <vscale x 8 x i32> @splice_nxv8i32(<vscale x 8 x i32> %a, <vscale x 8 x i32> %b) #0 {
				; CHECK-LABEL: splice_nxv8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-4
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: st1w { z3.s }, p0, [x8, #3, mul vl]
				; CHECK-NEXT: st1w { z2.s }, p0, [x8, #2, mul vl]
				; CHECK-NEXT: addvl x8, x8, #2
				; CHECK-NEXT: sub x8, x8, #32 // =32
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x8, #1, mul vl]
				; CHECK-NEXT: addvl sp, sp, #4
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 8 x i32> @llvm.experimental.vector.splice.nxv8i32(<vscale x 8 x i32> %a, <vscale x 8 x i32> %b, i32 -8)
				ret <vscale x 8 x i32> %res
				}

				; Verify splitvec type legalisation works as expected.
				define <vscale x 16 x float> @splice_nxv16f32_clamped(<vscale x 16 x float> %a, <vscale x 16 x float> %b) #0 {
				; CHECK-LABEL: splice_nxv16f32_clamped:
				; CHECK: // %bb.0:
				; CHECK-NEXT: str x29, [sp, #-16]! // 8-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-8
				; CHECK-NEXT: rdvl x9, #4
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov x8, sp
				; CHECK-NEXT: mov w10, #68
				; CHECK-NEXT: cmp x9, #68 // =68
				; CHECK-NEXT: st1w { z3.s }, p0, [x8, #3, mul vl]
				; CHECK-NEXT: st1w { z2.s }, p0, [x8, #2, mul vl]
				; CHECK-NEXT: st1w { z1.s }, p0, [x8, #1, mul vl]
				; CHECK-NEXT: st1w { z0.s }, p0, [sp]
				; CHECK-NEXT: st1w { z7.s }, p0, [x8, #7, mul vl]
				; CHECK-NEXT: st1w { z4.s }, p0, [x8, #4, mul vl]
				; CHECK-NEXT: st1w { z5.s }, p0, [x8, #5, mul vl]
				; CHECK-NEXT: st1w { z6.s }, p0, [x8, #6, mul vl]
				; CHECK-NEXT: addvl x8, x8, #4
				; CHECK-NEXT: csel x9, x9, x10, lo
				; CHECK-NEXT: sub x8, x8, x9
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x8]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x8, #1, mul vl]
				; CHECK-NEXT: ld1w { z2.s }, p0/z, [x8, #2, mul vl]
				; CHECK-NEXT: ld1w { z3.s }, p0/z, [x8, #3, mul vl]
				; CHECK-NEXT: addvl sp, sp, #8
				; CHECK-NEXT: ldr x29, [sp], #16 // 8-byte Folded Reload
				; CHECK-NEXT: ret
				%res = call <vscale x 16 x float> @llvm.experimental.vector.splice.nxv16f32(<vscale x 16 x float> %a, <vscale x 16 x float> %b, i32 -17)
				ret <vscale x 16 x float> %res
				}

				declare <vscale x 2 x i1> @llvm.experimental.vector.splice.nxv2i1(<vscale x 2 x i1>, <vscale x 2 x i1>, i32)
				declare <vscale x 4 x i1> @llvm.experimental.vector.splice.nxv4i1(<vscale x 4 x i1>, <vscale x 4 x i1>, i32)
				declare <vscale x 8 x i1> @llvm.experimental.vector.splice.nxv8i1(<vscale x 8 x i1>, <vscale x 8 x i1>, i32)
				declare <vscale x 16 x i1> @llvm.experimental.vector.splice.nxv16i1(<vscale x 16 x i1>, <vscale x 16 x i1>, i32)
				declare <vscale x 2 x i8> @llvm.experimental.vector.splice.nxv2i8(<vscale x 2 x i8>, <vscale x 2 x i8>, i32)
				declare <vscale x 16 x i8> @llvm.experimental.vector.splice.nxv16i8(<vscale x 16 x i8>, <vscale x 16 x i8>, i32)
				declare <vscale x 8 x i16> @llvm.experimental.vector.splice.nxv8i16(<vscale x 8 x i16>, <vscale x 8 x i16>, i32)
				declare <vscale x 4 x i32> @llvm.experimental.vector.splice.nxv4i32(<vscale x 4 x i32>, <vscale x 4 x i32>, i32)
				declare <vscale x 8 x i32> @llvm.experimental.vector.splice.nxv8i32(<vscale x 8 x i32>, <vscale x 8 x i32>, i32)
				declare <vscale x 2 x i64> @llvm.experimental.vector.splice.nxv2i64(<vscale x 2 x i64>, <vscale x 2 x i64>, i32)
				declare <vscale x 8 x half> @llvm.experimental.vector.splice.nxv8f16(<vscale x 8 x half>, <vscale x 8 x half>, i32)
				declare <vscale x 4 x float> @llvm.experimental.vector.splice.nxv4f32(<vscale x 4 x float>, <vscale x 4 x float>, i32)
				declare <vscale x 16 x float> @llvm.experimental.vector.splice.nxv16f32(<vscale x 16 x float>, <vscale x 16 x float>, i32)
				declare <vscale x 2 x double> @llvm.experimental.vector.splice.nxv2f64(<vscale x 2 x double>, <vscale x 2 x double>, i32)

				attributes #0 = { nounwind "target-features"="+sve" }

This is an archive of the discontinued LLVM Phabricator instance.

[IR] Introduce llvm.experimental.vector.splice intrinsicClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 325438

llvm/docs/LangRef.rst

llvm/include/llvm/CodeGen/ISDOpcodes.h

llvm/include/llvm/CodeGen/TargetLowering.h

llvm/include/llvm/IR/Intrinsics.td

llvm/include/llvm/Target/TargetSelectionDAG.td

llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp

llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp

llvm/lib/CodeGen/SelectionDAG/LegalizeTypes.h

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp

llvm/lib/CodeGen/SelectionDAG/SelectionDAGDumper.cpp

llvm/lib/CodeGen/SelectionDAG/TargetLowering.cpp

llvm/lib/CodeGen/TargetLoweringBase.cpp

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/named-vector-shuffles-neon.ll

llvm/test/CodeGen/AArch64/named-vector-shuffles-sve.ll

[IR] Introduce llvm.experimental.vector.splice intrinsic
ClosedPublic