This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
8/16
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
2/8
sve-fixed-length-permute-zip-uzp-trn.ll

Differential D113376

[AArch64][SVE] Lower shuffles to permute instructions: zip1/2, uzp1/2, trn1/2
ClosedPublic

Authored by wwei on Nov 7 2021, 6:40 PM.

Download Raw Diff

Details

Reviewers

paulwalker-arm
peterwaller-arm
bsmith
david-arm
sdesmalen
dmgreen
efriedma

Commits

rG03dc2975d07e: [AArch64][SVE] Lower shuffles to permute instructions: zip1/2, uzp1/2, trn1/2

Summary

Attempt to lower a shuffle as a permute instruction(zip/uzp/trn) for fixed length SVE.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

wwei created this revision.Nov 7 2021, 6:40 PM

Herald added subscribers: ctetreau, psnobl, hiraditya and 2 others. · View Herald TranscriptNov 7 2021, 6:40 PM

wwei requested review of this revision.Nov 7 2021, 6:40 PM

Herald added a subscriber: llvm-commits. · View Herald TranscriptNov 7 2021, 6:40 PM

Harbormaster completed remote builds in B132942: Diff 385388.Nov 7 2021, 7:24 PM

david-arm added inline comments.Nov 8 2021, 1:47 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19602	This looks a bit wrong because VT=Fixed Length Vector Type, but Op1 and Op2 have scalable vector types.

wwei added inline comments.Nov 8 2021, 3:40 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19602	No, the code here is correct. `VT` and `ShuffleMask` are used to check the type of the mask value, like zip or uzp mask. `Op1` and `Op2` are used to generate new permute nodes, which will be scalable aarch64 SDNodes here. In this patch, I encapsulated a new function for the code lowering shuffles to neon ZIP/UZP/TRN. Fortunately, the same code can be used for fixed-length SVE.

LGTM, thanks.

This revision is now accepted and ready to land.Nov 8 2021, 3:49 AM

paulwalker-arm requested changes to this revision.Nov 8 2021, 5:13 AM

paulwalker-arm added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9739–9742	I don't believe this logic is universally safe. The use of ZIP2 specifically relies on knowing the exact size of the register to know which indices represent the "high half" of a vector register. This is something that is only known when `sve-vector-bits-min==sve-vector-bits-max`. I also suspect functions like isZIPMask were written when only 64 and 128 bit legal vectors existed. I doubt the logic holds for the case when a vector is legal but not the exact size of the target vector register, as is the case with the SVE fixed length support. I have not looked into the extent at which the other shuffles are affected but I suspect each have their own complexity.
llvm/test/CodeGen/AArch64/sve-fix-length-permute-zip-uzp-trn.ll
21 ↗	(On Diff #385388)	To highlight my previous comment. Index `16` is only the first byte of the second half of a 256bit vector and thus if the target has bigger vector register the use of `zip2` will not have the correct behaviour.

This revision now requires changes to proceed.Nov 8 2021, 5:13 AM

wwei added inline comments.Nov 8 2021, 8:02 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9739–9742	Thanks, I get it. It seems that there are potential problems for the fixed-length shuffle vector lowering, include the patch https://reviews.llvm.org/D105289. I will update a new revision later to fix it by adding the check: `sve-vector-bits-min==sve-vector-bits-max`

paulwalker-arm added inline comments.Nov 8 2021, 8:36 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9739–9742	I believe that patch is ok given the restrictions it imposes. It is fine using the `is###Mask` functions as a way to identify a named shuffle. What is unsafe is assuming `is###Mask` means the `###` instruction can be used directly when converted to scalable vectors. So taking D105289 you can see that is uses `isEXTMask` to determine the type of shuffle but then the actually lowering code only handles a specific case and that case does not emit an `EXT` instruction as that would have fallen into the same trap.

wwei added inline comments.Nov 9 2021, 12:55 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9739–9742	Yeah, you are right. The logic in D105289 is correct, using `EXTRACT_VECTOR_ELT` and `INSR` instructions to match a specific `EXT` pattern, it's a smart solution.

update the patch, adding sve min/max size check.

Harbormaster completed remote builds in B133194: Diff 385727.Nov 9 2021, 1:41 AM

paulwalker-arm added inline comments.Nov 9 2021, 8:23 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19605–19606	I still don't think this is enough as you've missed my previous `I doubt the logic holds for the case when a vector is legal but not the exact size of the target vector register` comment. Further discussion is attached to the test zip_v32i8 below... I'm starting to wonder if it's worth breaking out lowerShuffleToZIP_UZP_TRN as I feel like once support is added for more of the combinations there's not likely to be much reuse. For example taking the zip1/zip2 case I think supporting zip1 is likely simple and just works for all legal VTs, but zip2 requires the VT to be exactly the size of the target vector (i.e. `VT.getSizeInBits() == Subtarget->getMinSVEVectorSizeInBits() == Subtarget->getMaxSVEVectorSizeInBits()`) or perhaps a different instruction sequence.
llvm/test/CodeGen/AArch64/sve-fix-length-permute-zip-uzp-trn.ll
24–25 ↗	(On Diff #385727)	The output for `VBITS_EQ_256` and `VBITS_EQ_512` is the same, which doesn't make sense considering the target vector length is different. For the `VBITS_EQ_256` case index `16` is the start of the high half of the target vector and thus zip2 works. But for the `VBITS_EQ_512` case index `16` is actually the start of the second quarter of the target vector and thus zip2 will do the wrong thing.

wwei added inline comments.Nov 10 2021, 1:00 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19605–19606	Thanks, I agree with you. Perhaps the current implementation of `lowerShuffleToZIP_UZP_TRN` reusing neon code is not so good, since SVE may have variable length. Maybe we should identify and classify different shuffle patterns. For zip/uzp/trn/rev these four shuffle types, we can roughly divide them into two categories: zip1/uzp1/uzp2/trn1/trn2/revb/revh/revw （it' simple，and works for all legal VTs） zip2/rev （complex，and need consider high half or whole register） I think we can implement these two scenarios in steps.

Matt added a subscriber: Matt.Nov 17 2021, 12:01 PM

wwei updated this revision to Diff 391320.Dec 2 2021, 7:05 AM

update the patch，also there's another patch D114960 to support rev insts.

Harbormaster completed remote builds in B137136: Diff 391320.Dec 2 2021, 8:07 AM

wwei mentioned this in D114960: [AArch64][SVE] Lower shuffles to permute instructions: rev/revb/revh/revw.Dec 7 2021, 12:30 AM

paulwalker-arm added inline comments.Dec 9 2021, 8:09 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19605–19606	Looking at the six shuffles trn1, trn2, uzp1, uzp2, zip1, zip2 I believe that when min_length != max_length != VT.getSizeInBits() only trn1, trn2 & zip1 are valid. We agree that zip2 is not valid because the top half of the input fixed length vector does not necessarily map to the top half of a scalable vector. The reason I believe both uzp variants are also invalid is because their underlying operation is to concat both input vectors and then extract every second element. However, the vectors we're concatenating may contain junk so you end up with: expected uzp1 [A \| B \| C \| D], [E \| F \| G \| H] => [A \| C \| E \| G] SVE_VLS uzp [A \| B \| C \| D \| ? \| ? \| ? \| ?], [E \| F \| G \| H \| ? \| ? \| ?] => [A \| C \| ? \| ? \| E \| G \| ? \| ?] And thus when you extract the fixed length result you'll end up with `[A \| C \| ? \| ?]` This I think further reduces the value of having `lowerShuffleToZIP_UZP_TRN` as it would be clearer to just handle each of the scenarios separately.

wwei added inline comments.Dec 9 2021, 8:59 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19605–19606	@paulwalker-arm Thanks for your detailed explanation. I fully agree with your point of view. Indeed, we don’t need `lowerShuffleToZIP_UZP_TRN` anymore, handle each case separately will make the code clearer. I will try to refactor the code later.

wwei updated this revision to Diff 393828.Dec 13 2021, 3:18 AM

Harbormaster completed remote builds in B138922: Diff 393828.Dec 13 2021, 4:37 AM

update the patch, lowerShuffleToZIP_UZP_TRN removed and test file refactored.

rebased.

Harbormaster completed remote builds in B139448: Diff 394577.Dec 15 2021, 9:01 AM

Just a passing review I'm afraid. I'll take a proper look tomorrow.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19625	I think you can use `ContainerVT` here? It might help with the formatting. Same goes for the other places where you use `Op1.getValueType()`.
19679–19688	I've not thought about it deeply so might be wrong but given these cases only use `Op1` I'm wondering if they're always safe and thus don't need to be part of the `MinSVESize == MaxSVESize == VT.getSizeInBits()` restricted set?

paulwalker-arm added inline comments.Dec 16 2021, 10:55 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19644–19647	I figured this comment could be better and started to write something but ended up with Functions like isZIPMask return true when a ISD::VECTOR_SHUFFLE's mask represents the same logical operation as performed by a ZIP instruction. In isolation these functions do not mean the ISD::VECTOR_SHUFFLE is exactly equivalent to an AArch64 instruction. There's the extra component of ISD::VECTOR_SHUFFLE's value type to consider. Prior to SVE these functions only operated on 64/128bit vector types that have a direct mapping to a target register and so an exact mapping is implied. However, when using SVE for fixed length vectors, most legal vector types are actually sub-vectors of a larger SVE register. When mapping ISD::VECTOR_SHUFFLE to an SVE instruction care must be taken to consider how the mask's indices translate. Specifically, when the mapping requires an exact meaning for a specific vector index (e.g. Index X is the last vector element in the register) then such mappings are often only safe when the exact SVE register size is know. The main exception to this is when indices are logically relative to the first element of either ISD::VECTOR_SHUFFLE operand because these relative indices don't change when converting from fixed-length to scalable vector types (i.e. the start of a fixed length vector is always the start of a scalable vector). Which is more like a novel than a comment :) I've posted it anyway just in case there's something in there that's useful.
llvm/test/CodeGen/AArch64/sve-fixed-length-permute-zip-uzp-trn.ll
11–15	It looks like `InterleavedAccessPass` is causing your code to be bypassed. I couldn't immediately see a way to disable the pass other than using `-O0` which might cause other issues but I think if you use `volatile` loads and stores within these tests you'll get what you need.
414	Please place the attributes together at the end of the file because otherwise they're hard to find when trying to see what attributes exist for a specific function.
628	As above.

wwei updated this revision to Diff 395107.Dec 17 2021, 6:05 AM

wwei added inline comments.Dec 17 2021, 6:24 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19625	Changed, using `ContainerVT` to replace `Op1.getValueType()`
19644–19647	Thanks for your comment, I used it directly in the code
19679–19688	I don't think it's safe. For undef case, `Op1` will be used as input twice, it still has the problem like below: uzp [A \| B \| C \| D \| ? \| ? \| ? \| ?], [A \| B \| C \| D \| ? \| ? \| ?] => [A \| C \| ? \| ? \|A \| C \| ? \| ?]
llvm/test/CodeGen/AArch64/sve-fixed-length-permute-zip-uzp-trn.ll
11–15	Thanks，`volatile` can work
414	done

Harbormaster completed remote builds in B139832: Diff 395107.Dec 17 2021, 6:58 AM

paulwalker-arm accepted this revision.Dec 20 2021, 10:15 AM

paulwalker-arm added inline comments.

llvm/test/CodeGen/AArch64/sve-fixed-length-permute-zip-uzp-trn.ll
420	I think there's value in adding a comment that states why we can safely emit zip2 instructions. Not too detailed, it's just worth drawing the readers attention to the fact that for this and related tests the runtime vector length is known.
482	As above, there's value in adding a comment that mentions why we can safely emit uzp instructions.
628	Please add a small comment to highlight this is a negative test and why only zip1 instructions are emitted.

This revision is now accepted and ready to land.Dec 20 2021, 10:15 AM

This revision was landed with ongoing or failed builds.Dec 21 2021, 2:46 AM

Closed by commit rG03dc2975d07e: [AArch64][SVE] Lower shuffles to permute instructions: zip1/2, uzp1/2, trn1/2 (authored by wwei). · Explain Why

This revision was automatically updated to reflect the committed changes.

wwei added a commit: rG03dc2975d07e: [AArch64][SVE] Lower shuffles to permute instructions: zip1/2, uzp1/2, trn1/2.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

68 lines

test/

CodeGen/

AArch64/

sve-fixed-length-permute-zip-uzp-trn.ll

686 lines

Diff 395629

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,730 Lines • ▼ Show 20 Lines	static SDValue tryWidenMaskForShuffle(SDValue Op, SelectionDAG &DAG) {

return SDValue();		return SDValue();
}		}

SDValue AArch64TargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,		SDValue AArch64TargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDLoc dl(Op);		SDLoc dl(Op);
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();

ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(Op.getNode());		ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(Op.getNode());

if (useSVEForFixedLengthVectorVT(VT))		if (useSVEForFixedLengthVectorVT(VT))
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I don't believe this logic is universally safe. The use of ZIP2 specifically relies on knowing the exact size of the register to know which indices represent the "high half" of a vector register. This is something that is only known when `sve-vector-bits-min==sve-vector-bits-max`. I also suspect functions like isZIPMask were written when only 64 and 128 bit legal vectors existed. I doubt the logic holds for the case when a vector is legal but not the exact size of the target vector register, as is the case with the SVE fixed length support. I have not looked into the extent at which the other shuffles are affected but I suspect each have their own complexity. paulwalker-arm: I don't believe this logic is universally safe. The use of ZIP2 specifically relies on knowing…
		wweiAuthorUnsubmitted Done Reply Inline Actions Thanks, I get it. It seems that there are potential problems for the fixed-length shuffle vector lowering, include the patch https://reviews.llvm.org/D105289. I will update a new revision later to fix it by adding the check: `sve-vector-bits-min==sve-vector-bits-max` wwei: Thanks, I get it. It seems that there are potential problems for the fixed-length shuffle…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I believe that patch is ok given the restrictions it imposes. It is fine using the `is###Mask` functions as a way to identify a named shuffle. What is unsafe is assuming `is###Mask` means the `###` instruction can be used directly when converted to scalable vectors. So taking D105289 you can see that is uses `isEXTMask` to determine the type of shuffle but then the actually lowering code only handles a specific case and that case does not emit an `EXT` instruction as that would have fallen into the same trap. paulwalker-arm: I believe that patch is ok given the restrictions it imposes. It is fine using the `is###Mask`…
		wweiAuthorUnsubmitted Done Reply Inline Actions Yeah, you are right. The logic in D105289 is correct, using `EXTRACT_VECTOR_ELT` and `INSR` instructions to match a specific `EXT` pattern, it's a smart solution. wwei: Yeah, you are right. The logic in D105289 is correct, using `EXTRACT_VECTOR_ELT` and `INSR`…
return LowerFixedLengthVECTOR_SHUFFLEToSVE(Op, DAG);		return LowerFixedLengthVECTOR_SHUFFLEToSVE(Op, DAG);

// Convert shuffles that are directly supported on NEON to target-specific		// Convert shuffles that are directly supported on NEON to target-specific
// DAG nodes, instead of keeping them as shuffles and matching them again		// DAG nodes, instead of keeping them as shuffles and matching them again
// during code selection. This is more efficient and avoids the possibility		// during code selection. This is more efficient and avoids the possibility
// of inconsistencies between legalization and selection.		// of inconsistencies between legalization and selection.
ArrayRef<int> ShuffleMask = SVN->getMask();		ArrayRef<int> ShuffleMask = SVN->getMask();

▲ Show 20 Lines • Show All 9,843 Lines • ▼ Show 20 Lines	if (isEXTMask(ShuffleMask, VT, ReverseEXT, Imm) &&
SDValue Scalar = DAG.getNode(		SDValue Scalar = DAG.getNode(
ISD::EXTRACT_VECTOR_ELT, DL, ScalarTy, Op1,		ISD::EXTRACT_VECTOR_ELT, DL, ScalarTy, Op1,
DAG.getConstant(VT.getVectorNumElements() - 1, DL, MVT::i64));		DAG.getConstant(VT.getVectorNumElements() - 1, DL, MVT::i64));
Op = DAG.getNode(AArch64ISD::INSR, DL, ContainerVT, Op2, Scalar);		Op = DAG.getNode(AArch64ISD::INSR, DL, ContainerVT, Op2, Scalar);
return convertFromScalableVector(DAG, VT, Op);		return convertFromScalableVector(DAG, VT, Op);
}		}

for (unsigned LaneSize : {64U, 32U, 16U}) {		for (unsigned LaneSize : {64U, 32U, 16U}) {
if (isREVMask(ShuffleMask, VT, LaneSize)) {		if (isREVMask(ShuffleMask, VT, LaneSize)) {
		david-armUnsubmitted Not Done Reply Inline Actions This looks a bit wrong because VT=Fixed Length Vector Type, but Op1 and Op2 have scalable vector types. david-arm: This looks a bit wrong because VT=Fixed Length Vector Type, but Op1 and Op2 have scalable…
		wweiAuthorUnsubmitted Done Reply Inline Actions No, the code here is correct. `VT` and `ShuffleMask` are used to check the type of the mask value, like zip or uzp mask. `Op1` and `Op2` are used to generate new permute nodes, which will be scalable aarch64 SDNodes here. In this patch, I encapsulated a new function for the code lowering shuffles to neon ZIP/UZP/TRN. Fortunately, the same code can be used for fixed-length SVE. wwei: No, the code here is correct. `VT` and `ShuffleMask` are used to check the type of the mask…
EVT NewVT =		EVT NewVT =
getPackedSVEVectorVT(EVT::getIntegerVT(*DAG.getContext(), LaneSize));		getPackedSVEVectorVT(EVT::getIntegerVT(*DAG.getContext(), LaneSize));
unsigned RevOp;		unsigned RevOp;
unsigned EltSz = VT.getScalarSizeInBits();		unsigned EltSz = VT.getScalarSizeInBits();
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I still don't think this is enough as you've missed my previous `I doubt the logic holds for the case when a vector is legal but not the exact size of the target vector register` comment. Further discussion is attached to the test zip_v32i8 below... I'm starting to wonder if it's worth breaking out lowerShuffleToZIP_UZP_TRN as I feel like once support is added for more of the combinations there's not likely to be much reuse. For example taking the zip1/zip2 case I think supporting zip1 is likely simple and just works for all legal VTs, but zip2 requires the VT to be exactly the size of the target vector (i.e. `VT.getSizeInBits() == Subtarget->getMinSVEVectorSizeInBits() == Subtarget->getMaxSVEVectorSizeInBits()`) or perhaps a different instruction sequence. paulwalker-arm: I still don't think this is enough as you've missed my previous `I doubt the logic holds for…
		wweiAuthorUnsubmitted Done Reply Inline Actions Thanks, I agree with you. Perhaps the current implementation of `lowerShuffleToZIP_UZP_TRN` reusing neon code is not so good, since SVE may have variable length. Maybe we should identify and classify different shuffle patterns. For zip/uzp/trn/rev these four shuffle types, we can roughly divide them into two categories: zip1/uzp1/uzp2/trn1/trn2/revb/revh/revw （it' simple，and works for all legal VTs） zip2/rev （complex，and need consider high half or whole register） I think we can implement these two scenarios in steps. wwei: Thanks, I agree with you. Perhaps the current implementation of `lowerShuffleToZIP_UZP_TRN`…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Looking at the six shuffles trn1, trn2, uzp1, uzp2, zip1, zip2 I believe that when min_length != max_length != VT.getSizeInBits() only trn1, trn2 & zip1 are valid. We agree that zip2 is not valid because the top half of the input fixed length vector does not necessarily map to the top half of a scalable vector. The reason I believe both uzp variants are also invalid is because their underlying operation is to concat both input vectors and then extract every second element. However, the vectors we're concatenating may contain junk so you end up with: expected uzp1 [A \| B \| C \| D], [E \| F \| G \| H] => [A \| C \| E \| G] SVE_VLS uzp [A \| B \| C \| D \| ? \| ? \| ? \| ?], [E \| F \| G \| H \| ? \| ? \| ?] => [A \| C \| ? \| ? \| E \| G \| ? \| ?] And thus when you extract the fixed length result you'll end up with `[A \| C \| ? \| ?]` This I think further reduces the value of having `lowerShuffleToZIP_UZP_TRN` as it would be clearer to just handle each of the scenarios separately. paulwalker-arm: Looking at the six shuffles trn1, trn2, uzp1, uzp2, zip1, zip2 I believe that when min_length !
		wweiAuthorUnsubmitted Done Reply Inline Actions @paulwalker-arm Thanks for your detailed explanation. I fully agree with your point of view. Indeed, we don’t need `lowerShuffleToZIP_UZP_TRN` anymore, handle each case separately will make the code clearer. I will try to refactor the code later. wwei: @paulwalker-arm Thanks for your detailed explanation. I fully agree with your point of view.
if (EltSz == 8)		if (EltSz == 8)
RevOp = AArch64ISD::BSWAP_MERGE_PASSTHRU;		RevOp = AArch64ISD::BSWAP_MERGE_PASSTHRU;
else if (EltSz == 16)		else if (EltSz == 16)
RevOp = AArch64ISD::REVH_MERGE_PASSTHRU;		RevOp = AArch64ISD::REVH_MERGE_PASSTHRU;
else		else
RevOp = AArch64ISD::REVW_MERGE_PASSTHRU;		RevOp = AArch64ISD::REVW_MERGE_PASSTHRU;

Op = DAG.getNode(ISD::BITCAST, DL, NewVT, Op1);		Op = DAG.getNode(ISD::BITCAST, DL, NewVT, Op1);
Op = LowerToPredicatedOp(Op, DAG, RevOp);		Op = LowerToPredicatedOp(Op, DAG, RevOp);
Op = DAG.getNode(ISD::BITCAST, DL, ContainerVT, Op);		Op = DAG.getNode(ISD::BITCAST, DL, ContainerVT, Op);
return convertFromScalableVector(DAG, VT, Op);		return convertFromScalableVector(DAG, VT, Op);
}		}
}		}

		unsigned WhichResult;
		if (isZIPMask(ShuffleMask, VT, WhichResult) && WhichResult == 0)
		return convertFromScalableVector(
		DAG, VT, DAG.getNode(AArch64ISD::ZIP1, DL, ContainerVT, Op1, Op2));

		paulwalker-armUnsubmitted Not Done Reply Inline Actions I think you can use `ContainerVT` here? It might help with the formatting. Same goes for the other places where you use `Op1.getValueType()`. paulwalker-arm: I think you can use `ContainerVT` here? It might help with the formatting. Same goes for the…
		wweiAuthorUnsubmitted Done Reply Inline Actions Changed, using `ContainerVT` to replace `Op1.getValueType()` wwei: Changed, using `ContainerVT` to replace `Op1.getValueType()`
		if (isTRNMask(ShuffleMask, VT, WhichResult)) {
		unsigned Opc = (WhichResult == 0) ? AArch64ISD::TRN1 : AArch64ISD::TRN2;
		return convertFromScalableVector(
		DAG, VT, DAG.getNode(Opc, DL, ContainerVT, Op1, Op2));
		}

		if (isZIP_v_undef_Mask(ShuffleMask, VT, WhichResult) && WhichResult == 0)
		return convertFromScalableVector(
		DAG, VT, DAG.getNode(AArch64ISD::ZIP1, DL, ContainerVT, Op1, Op1));

		if (isTRN_v_undef_Mask(ShuffleMask, VT, WhichResult)) {
		unsigned Opc = (WhichResult == 0) ? AArch64ISD::TRN1 : AArch64ISD::TRN2;
		return convertFromScalableVector(
		DAG, VT, DAG.getNode(Opc, DL, ContainerVT, Op1, Op1));
		}

		// Functions like isZIPMask return true when a ISD::VECTOR_SHUFFLE's mask
		// represents the same logical operation as performed by a ZIP instruction. In
		// isolation these functions do not mean the ISD::VECTOR_SHUFFLE is exactly
		// equivalent to an AArch64 instruction. There's the extra component of
		// ISD::VECTOR_SHUFFLE's value type to consider. Prior to SVE these functions
		// only operated on 64/128bit vector types that have a direct mapping to a
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I figured this comment could be better and started to write something but ended up with Functions like isZIPMask return true when a ISD::VECTOR_SHUFFLE's mask represents the same logical operation as performed by a ZIP instruction. In isolation these functions do not mean the ISD::VECTOR_SHUFFLE is exactly equivalent to an AArch64 instruction. There's the extra component of ISD::VECTOR_SHUFFLE's value type to consider. Prior to SVE these functions only operated on 64/128bit vector types that have a direct mapping to a target register and so an exact mapping is implied. However, when using SVE for fixed length vectors, most legal vector types are actually sub-vectors of a larger SVE register. When mapping ISD::VECTOR_SHUFFLE to an SVE instruction care must be taken to consider how the mask's indices translate. Specifically, when the mapping requires an exact meaning for a specific vector index (e.g. Index X is the last vector element in the register) then such mappings are often only safe when the exact SVE register size is know. The main exception to this is when indices are logically relative to the first element of either ISD::VECTOR_SHUFFLE operand because these relative indices don't change when converting from fixed-length to scalable vector types (i.e. the start of a fixed length vector is always the start of a scalable vector). Which is more like a novel than a comment :) I've posted it anyway just in case there's something in there that's useful. paulwalker-arm: I figured this comment could be better and started to write something but ended up with…
		wweiAuthorUnsubmitted Done Reply Inline Actions Thanks for your comment, I used it directly in the code wwei: Thanks for your comment, I used it directly in the code
		// target register and so an exact mapping is implied.
		// However, when using SVE for fixed length vectors, most legal vector types
		// are actually sub-vectors of a larger SVE register. When mapping
		// ISD::VECTOR_SHUFFLE to an SVE instruction care must be taken to consider
		// how the mask's indices translate. Specifically, when the mapping requires
		// an exact meaning for a specific vector index (e.g. Index X is the last
		// vector element in the register) then such mappings are often only safe when
		// the exact SVE register size is know. The main exception to this is when
		// indices are logically relative to the first element of either
		// ISD::VECTOR_SHUFFLE operand because these relative indices don't change
		// when converting from fixed-length to scalable vector types (i.e. the start
		// of a fixed length vector is always the start of a scalable vector).
unsigned MinSVESize = Subtarget->getMinSVEVectorSizeInBits();		unsigned MinSVESize = Subtarget->getMinSVEVectorSizeInBits();
unsigned MaxSVESize = Subtarget->getMaxSVEVectorSizeInBits();		unsigned MaxSVESize = Subtarget->getMaxSVEVectorSizeInBits();
if (MinSVESize == MaxSVESize && MaxSVESize == VT.getSizeInBits() &&		if (MinSVESize == MaxSVESize && MaxSVESize == VT.getSizeInBits()) {
ShuffleVectorInst::isReverseMask(ShuffleMask) && Op2.isUndef()) {		if (ShuffleVectorInst::isReverseMask(ShuffleMask) && Op2.isUndef()) {
Op = DAG.getNode(ISD::VECTOR_REVERSE, DL, ContainerVT, Op1);		Op = DAG.getNode(ISD::VECTOR_REVERSE, DL, ContainerVT, Op1);
return convertFromScalableVector(DAG, VT, Op);		return convertFromScalableVector(DAG, VT, Op);
}		}

		if (isZIPMask(ShuffleMask, VT, WhichResult) && WhichResult != 0)
		return convertFromScalableVector(
		DAG, VT, DAG.getNode(AArch64ISD::ZIP2, DL, ContainerVT, Op1, Op2));

		if (isUZPMask(ShuffleMask, VT, WhichResult)) {
		unsigned Opc = (WhichResult == 0) ? AArch64ISD::UZP1 : AArch64ISD::UZP2;
		return convertFromScalableVector(
		DAG, VT, DAG.getNode(Opc, DL, ContainerVT, Op1, Op2));
		}

		if (isZIP_v_undef_Mask(ShuffleMask, VT, WhichResult) && WhichResult != 0)
		return convertFromScalableVector(
		DAG, VT, DAG.getNode(AArch64ISD::ZIP2, DL, ContainerVT, Op1, Op1));

		if (isUZP_v_undef_Mask(ShuffleMask, VT, WhichResult)) {
		unsigned Opc = (WhichResult == 0) ? AArch64ISD::UZP1 : AArch64ISD::UZP2;
		return convertFromScalableVector(
		DAG, VT, DAG.getNode(Opc, DL, ContainerVT, Op1, Op1));
		}
		}

		paulwalker-armUnsubmitted Not Done Reply Inline Actions I've not thought about it deeply so might be wrong but given these cases only use `Op1` I'm wondering if they're always safe and thus don't need to be part of the `MinSVESize == MaxSVESize == VT.getSizeInBits()` restricted set? paulwalker-arm: I've not thought about it deeply so might be wrong but given these cases only use `Op1` I'm…
		wweiAuthorUnsubmitted Done Reply Inline Actions I don't think it's safe. For undef case, `Op1` will be used as input twice, it still has the problem like below: uzp [A \| B \| C \| D \| ? \| ? \| ? \| ?], [A \| B \| C \| D \| ? \| ? \| ?] => [A \| C \| ? \| ? \|A \| C \| ? \| ?] wwei: I don't think it's safe. For undef case, `Op1` will be used as input twice, it still has the…
return SDValue();		return SDValue();
}		}

SDValue AArch64TargetLowering::getSVESafeBitCast(EVT VT, SDValue Op,		SDValue AArch64TargetLowering::getSVESafeBitCast(EVT VT, SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDLoc DL(Op);		SDLoc DL(Op);
EVT InVT = Op.getValueType();		EVT InVT = Op.getValueType();
const TargetLowering &TLI = DAG.getTargetLoweringInfo();		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
▲ Show 20 Lines • Show All 87 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-permute-zip-uzp-trn.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -aarch64-sve-vector-bits-min=256 -aarch64-sve-vector-bits-max=256 < %s \| FileCheck %s -check-prefixes=CHECK,VBITS_EQ_256
				; RUN: llc -aarch64-sve-vector-bits-min=512 -aarch64-sve-vector-bits-max=512 < %s \| FileCheck %s -check-prefixes=CHECK,VBITS_EQ_512

				target triple = "aarch64-unknown-linux-gnu"

				define void @zip1_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; VBITS_EQ_256-LABEL: zip1_v32i8:
				; VBITS_EQ_256: // %bb.0:
				; VBITS_EQ_256-NEXT: ptrue p0.b
				; VBITS_EQ_256-NEXT: ld1b { z0.b }, p0/z, [x0]
				; VBITS_EQ_256-NEXT: ld1b { z1.b }, p0/z, [x1]
				; VBITS_EQ_256-NEXT: zip1 z0.b, z0.b, z1.b
				; VBITS_EQ_256-NEXT: st1b { z0.b }, p0, [x0]
				; VBITS_EQ_256-NEXT: ret
				paulwalker-armUnsubmitted Not Done Reply Inline Actions It looks like `InterleavedAccessPass` is causing your code to be bypassed. I couldn't immediately see a way to disable the pass other than using `-O0` which might cause other issues but I think if you use `volatile` loads and stores within these tests you'll get what you need. paulwalker-arm: It looks like `InterleavedAccessPass` is causing your code to be bypassed. I couldn't…
				wweiAuthorUnsubmitted Done Reply Inline Actions Thanks，`volatile` can work wwei: Thanks，`volatile` can work
				;
				; VBITS_EQ_512-LABEL: zip1_v32i8:
				; VBITS_EQ_512: // %bb.0:
				; VBITS_EQ_512-NEXT: ptrue p0.b, vl32
				; VBITS_EQ_512-NEXT: ld1b { z0.b }, p0/z, [x0]
				; VBITS_EQ_512-NEXT: ld1b { z1.b }, p0/z, [x1]
				; VBITS_EQ_512-NEXT: zip1 z0.b, z0.b, z1.b
				; VBITS_EQ_512-NEXT: st1b { z0.b }, p0, [x0]
				; VBITS_EQ_512-NEXT: ret
				%tmp1 = load volatile <32 x i8>, <32 x i8>* %a
				%tmp2 = load volatile <32 x i8>, <32 x i8>* %b
				%tmp3 = shufflevector <32 x i8> %tmp1, <32 x i8> %tmp2, <32 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47>
				store volatile <32 x i8> %tmp3, <32 x i8>* %a
				ret void
				}

				define void @zip_v32i16(<32 x i16>* %a, <32 x i16>* %b) #0 {
				; VBITS_EQ_256-LABEL: zip_v32i16:
				; VBITS_EQ_256: // %bb.0:
				; VBITS_EQ_256-NEXT: mov x8, #16
				; VBITS_EQ_256-NEXT: ptrue p0.h
				; VBITS_EQ_256-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; VBITS_EQ_256-NEXT: ld1h { z1.h }, p0/z, [x0]
				; VBITS_EQ_256-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; VBITS_EQ_256-NEXT: ld1h { z3.h }, p0/z, [x1]
				; VBITS_EQ_256-NEXT: zip2 z4.h, z1.h, z3.h
				; VBITS_EQ_256-NEXT: zip1 z1.h, z1.h, z3.h
				; VBITS_EQ_256-NEXT: zip2 z3.h, z0.h, z2.h
				; VBITS_EQ_256-NEXT: zip1 z0.h, z0.h, z2.h
				; VBITS_EQ_256-NEXT: add z0.h, p0/m, z0.h, z1.h
				; VBITS_EQ_256-NEXT: movprfx z1, z4
				; VBITS_EQ_256-NEXT: add z1.h, p0/m, z1.h, z3.h
				; VBITS_EQ_256-NEXT: st1h { z1.h }, p0, [x0, x8, lsl #1]
				; VBITS_EQ_256-NEXT: st1h { z0.h }, p0, [x0]
				; VBITS_EQ_256-NEXT: ret
				;
				; VBITS_EQ_512-LABEL: zip_v32i16:
				; VBITS_EQ_512: // %bb.0:
				; VBITS_EQ_512-NEXT: ptrue p0.h
				; VBITS_EQ_512-NEXT: ld1h { z0.h }, p0/z, [x0]
				; VBITS_EQ_512-NEXT: ld1h { z1.h }, p0/z, [x1]
				; VBITS_EQ_512-NEXT: zip1 z2.h, z0.h, z1.h
				; VBITS_EQ_512-NEXT: zip2 z0.h, z0.h, z1.h
				; VBITS_EQ_512-NEXT: add z0.h, p0/m, z0.h, z2.h
				; VBITS_EQ_512-NEXT: st1h { z0.h }, p0, [x0]
				; VBITS_EQ_512-NEXT: ret
				%tmp1 = load <32 x i16>, <32 x i16>* %a
				%tmp2 = load <32 x i16>, <32 x i16>* %b
				%tmp3 = shufflevector <32 x i16> %tmp1, <32 x i16> %tmp2, <32 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47>
				%tmp4 = shufflevector <32 x i16> %tmp1, <32 x i16> %tmp2, <32 x i32> <i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>
				%tmp5 = add <32 x i16> %tmp3, %tmp4
				store <32 x i16> %tmp5, <32 x i16>* %a
				ret void
				}

				define void @zip1_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; VBITS_EQ_256-LABEL: zip1_v16i16:
				; VBITS_EQ_256: // %bb.0:
				; VBITS_EQ_256-NEXT: ptrue p0.h
				; VBITS_EQ_256-NEXT: ld1h { z0.h }, p0/z, [x0]
				; VBITS_EQ_256-NEXT: ld1h { z1.h }, p0/z, [x1]
				; VBITS_EQ_256-NEXT: zip1 z0.h, z0.h, z1.h
				; VBITS_EQ_256-NEXT: st1h { z0.h }, p0, [x0]
				; VBITS_EQ_256-NEXT: ret
				;
				; VBITS_EQ_512-LABEL: zip1_v16i16:
				; VBITS_EQ_512: // %bb.0:
				; VBITS_EQ_512-NEXT: ptrue p0.h, vl16
				; VBITS_EQ_512-NEXT: ld1h { z0.h }, p0/z, [x0]
				; VBITS_EQ_512-NEXT: ld1h { z1.h }, p0/z, [x1]
				; VBITS_EQ_512-NEXT: zip1 z0.h, z0.h, z1.h
				; VBITS_EQ_512-NEXT: st1h { z0.h }, p0, [x0]
				; VBITS_EQ_512-NEXT: ret
				%tmp1 = load volatile <16 x i16>, <16 x i16>* %a
				%tmp2 = load volatile <16 x i16>, <16 x i16>* %b
				%tmp3 = shufflevector <16 x i16> %tmp1, <16 x i16> %tmp2, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23>
				store volatile <16 x i16> %tmp3, <16 x i16>* %a
				ret void
				}

				define void @zip1_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; VBITS_EQ_256-LABEL: zip1_v8i32:
				; VBITS_EQ_256: // %bb.0:
				; VBITS_EQ_256-NEXT: ptrue p0.s
				; VBITS_EQ_256-NEXT: ld1w { z0.s }, p0/z, [x0]
				; VBITS_EQ_256-NEXT: ld1w { z1.s }, p0/z, [x1]
				; VBITS_EQ_256-NEXT: zip1 z0.s, z0.s, z1.s
				; VBITS_EQ_256-NEXT: st1w { z0.s }, p0, [x0]
				; VBITS_EQ_256-NEXT: ret
				;
				; VBITS_EQ_512-LABEL: zip1_v8i32:
				; VBITS_EQ_512: // %bb.0:
				; VBITS_EQ_512-NEXT: ptrue p0.s, vl8
				; VBITS_EQ_512-NEXT: ld1w { z0.s }, p0/z, [x0]
				; VBITS_EQ_512-NEXT: ld1w { z1.s }, p0/z, [x1]
				; VBITS_EQ_512-NEXT: zip1 z0.s, z0.s, z1.s
				; VBITS_EQ_512-NEXT: st1w { z0.s }, p0, [x0]
				; VBITS_EQ_512-NEXT: ret
				%tmp1 = load volatile <8 x i32>, <8 x i32>* %a
				%tmp2 = load volatile <8 x i32>, <8 x i32>* %b
				%tmp3 = shufflevector <8 x i32> %tmp1, <8 x i32> %tmp2, <8 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11>
				store volatile <8 x i32> %tmp3, <8 x i32>* %a
				ret void
				}

				define void @zip_v4f64(<4 x double>* %a, <4 x double>* %b) #0 {
				; VBITS_EQ_256-LABEL: zip_v4f64:
				; VBITS_EQ_256: // %bb.0:
				; VBITS_EQ_256-NEXT: ptrue p0.d
				; VBITS_EQ_256-NEXT: ld1d { z0.d }, p0/z, [x0]
				; VBITS_EQ_256-NEXT: ld1d { z1.d }, p0/z, [x1]
				; VBITS_EQ_256-NEXT: zip1 z2.d, z0.d, z1.d
				; VBITS_EQ_256-NEXT: zip2 z0.d, z0.d, z1.d
				; VBITS_EQ_256-NEXT: fadd z0.d, z2.d, z0.d
				; VBITS_EQ_256-NEXT: st1d { z0.d }, p0, [x0]
				; VBITS_EQ_256-NEXT: ret
				;
				; VBITS_EQ_512-LABEL: zip_v4f64:
				; VBITS_EQ_512: // %bb.0:
				; VBITS_EQ_512-NEXT: stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
				; VBITS_EQ_512-NEXT: sub x9, sp, #48
				; VBITS_EQ_512-NEXT: mov x29, sp
				; VBITS_EQ_512-NEXT: and sp, x9, #0xffffffffffffffe0
				; VBITS_EQ_512-NEXT: .cfi_def_cfa w29, 16
				; VBITS_EQ_512-NEXT: .cfi_offset w30, -8
				; VBITS_EQ_512-NEXT: .cfi_offset w29, -16
				; VBITS_EQ_512-NEXT: ptrue p0.d, vl4
				; VBITS_EQ_512-NEXT: ld1d { z0.d }, p0/z, [x0]
				; VBITS_EQ_512-NEXT: ld1d { z1.d }, p0/z, [x1]
				; VBITS_EQ_512-NEXT: mov z2.d, z1.d[3]
				; VBITS_EQ_512-NEXT: mov z3.d, z0.d[3]
				; VBITS_EQ_512-NEXT: stp d3, d2, [sp, #16]
				; VBITS_EQ_512-NEXT: mov z2.d, z1.d[2]
				; VBITS_EQ_512-NEXT: mov z3.d, z0.d[2]
				; VBITS_EQ_512-NEXT: zip1 z0.d, z0.d, z1.d
				; VBITS_EQ_512-NEXT: stp d3, d2, [sp]
				; VBITS_EQ_512-NEXT: ld1d { z2.d }, p0/z, [sp]
				; VBITS_EQ_512-NEXT: fadd z0.d, p0/m, z0.d, z2.d
				; VBITS_EQ_512-NEXT: st1d { z0.d }, p0, [x0]
				; VBITS_EQ_512-NEXT: mov sp, x29
				; VBITS_EQ_512-NEXT: ldp x29, x30, [sp], #16 // 16-byte Folded Reload
				; VBITS_EQ_512-NEXT: ret
				%tmp1 = load <4 x double>, <4 x double>* %a
				%tmp2 = load <4 x double>, <4 x double>* %b
				%tmp3 = shufflevector <4 x double> %tmp1, <4 x double> %tmp2, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
				%tmp4 = shufflevector <4 x double> %tmp1, <4 x double> %tmp2, <4 x i32> <i32 2, i32 6, i32 3, i32 7>
				%tmp5 = fadd <4 x double> %tmp3, %tmp4
				store <4 x double> %tmp5, <4 x double>* %a
				ret void
				}

				; Don't use SVE for 128-bit vectors
				define void @zip_v4i32(<4 x i32>* %a, <4 x i32>* %b) #0 {
				; CHECK-LABEL: zip_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x0]
				; CHECK-NEXT: ldr q1, [x1]
				; CHECK-NEXT: zip1 v2.4s, v0.4s, v1.4s
				; CHECK-NEXT: zip2 v0.4s, v0.4s, v1.4s
				; CHECK-NEXT: add v0.4s, v2.4s, v0.4s
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <4 x i32>, <4 x i32>* %a
				%tmp2 = load <4 x i32>, <4 x i32>* %b
				%tmp3 = shufflevector <4 x i32> %tmp1, <4 x i32> %tmp2, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
				%tmp4 = shufflevector <4 x i32> %tmp1, <4 x i32> %tmp2, <4 x i32> <i32 2, i32 6, i32 3, i32 7>
				%tmp5 = add <4 x i32> %tmp3, %tmp4
				store <4 x i32> %tmp5, <4 x i32>* %a
				ret void
				}

				define void @zip1_v8i32_undef(<8 x i32>* %a) #0 {
				; VBITS_EQ_256-LABEL: zip1_v8i32_undef:
				; VBITS_EQ_256: // %bb.0:
				; VBITS_EQ_256-NEXT: ptrue p0.s
				; VBITS_EQ_256-NEXT: ld1w { z0.s }, p0/z, [x0]
				; VBITS_EQ_256-NEXT: zip1 z0.s, z0.s, z0.s
				; VBITS_EQ_256-NEXT: st1w { z0.s }, p0, [x0]
				; VBITS_EQ_256-NEXT: ret
				;
				; VBITS_EQ_512-LABEL: zip1_v8i32_undef:
				; VBITS_EQ_512: // %bb.0:
				; VBITS_EQ_512-NEXT: ptrue p0.s, vl8
				; VBITS_EQ_512-NEXT: ld1w { z0.s }, p0/z, [x0]
				; VBITS_EQ_512-NEXT: zip1 z0.s, z0.s, z0.s
				; VBITS_EQ_512-NEXT: st1w { z0.s }, p0, [x0]
				; VBITS_EQ_512-NEXT: ret
				%tmp1 = load volatile <8 x i32>, <8 x i32>* %a
				%tmp2 = shufflevector <8 x i32> %tmp1, <8 x i32> undef, <8 x i32> <i32 0, i32 0, i32 1, i32 1, i32 2, i32 2, i32 3, i32 3>
				store volatile <8 x i32> %tmp2, <8 x i32>* %a
				ret void
				}

				define void @trn_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; VBITS_EQ_256-LABEL: trn_v32i8:
				; VBITS_EQ_256: // %bb.0:
				; VBITS_EQ_256-NEXT: ptrue p0.b
				; VBITS_EQ_256-NEXT: ld1b { z0.b }, p0/z, [x0]
				; VBITS_EQ_256-NEXT: ld1b { z1.b }, p0/z, [x1]
				; VBITS_EQ_256-NEXT: trn1 z2.b, z0.b, z1.b
				; VBITS_EQ_256-NEXT: trn2 z0.b, z0.b, z1.b
				; VBITS_EQ_256-NEXT: add z0.b, p0/m, z0.b, z2.b
				; VBITS_EQ_256-NEXT: st1b { z0.b }, p0, [x0]
				; VBITS_EQ_256-NEXT: ret
				;
				; VBITS_EQ_512-LABEL: trn_v32i8:
				; VBITS_EQ_512: // %bb.0:
				; VBITS_EQ_512-NEXT: ptrue p0.b, vl32
				; VBITS_EQ_512-NEXT: ld1b { z0.b }, p0/z, [x0]
				; VBITS_EQ_512-NEXT: ld1b { z1.b }, p0/z, [x1]
				; VBITS_EQ_512-NEXT: trn1 z2.b, z0.b, z1.b
				; VBITS_EQ_512-NEXT: trn2 z0.b, z0.b, z1.b
				; VBITS_EQ_512-NEXT: add z0.b, p0/m, z0.b, z2.b
				; VBITS_EQ_512-NEXT: st1b { z0.b }, p0, [x0]
				; VBITS_EQ_512-NEXT: ret
				%tmp1 = load <32 x i8>, <32 x i8>* %a
				%tmp2 = load <32 x i8>, <32 x i8>* %b
				%tmp3 = shufflevector <32 x i8> %tmp1, <32 x i8> %tmp2, <32 x i32> <i32 0, i32 32, i32 2, i32 34, i32 4, i32 36, i32 6, i32 38, i32 8, i32 40, i32 10, i32 42, i32 12, i32 44, i32 14, i32 46, i32 16, i32 48, i32 18, i32 50, i32 20, i32 52, i32 22, i32 54, i32 24, i32 56, i32 26, i32 58, i32 28, i32 60, i32 30, i32 62>
				%tmp4 = shufflevector <32 x i8> %tmp1, <32 x i8> %tmp2, <32 x i32> <i32 1, i32 33, i32 3, i32 35, i32 undef, i32 37, i32 7, i32 undef, i32 undef, i32 41, i32 11, i32 43, i32 13, i32 45, i32 15, i32 47, i32 17, i32 49, i32 19, i32 51, i32 21, i32 53, i32 23, i32 55, i32 25, i32 57, i32 27, i32 59, i32 29, i32 61, i32 31, i32 63>
				%tmp5 = add <32 x i8> %tmp3, %tmp4
				store <32 x i8> %tmp5, <32 x i8>* %a
				ret void
				}

				define void @trn_v32i16(<32 x i16>* %a, <32 x i16>* %b) #0 {
				; VBITS_EQ_256-LABEL: trn_v32i16:
				; VBITS_EQ_256: // %bb.0:
				; VBITS_EQ_256-NEXT: mov x8, #16
				; VBITS_EQ_256-NEXT: ptrue p0.h
				; VBITS_EQ_256-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; VBITS_EQ_256-NEXT: ld1h { z1.h }, p0/z, [x0]
				; VBITS_EQ_256-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; VBITS_EQ_256-NEXT: ld1h { z3.h }, p0/z, [x1]
				; VBITS_EQ_256-NEXT: trn1 z4.h, z0.h, z2.h
				; VBITS_EQ_256-NEXT: trn1 z5.h, z1.h, z3.h
				; VBITS_EQ_256-NEXT: trn2 z0.h, z0.h, z2.h
				; VBITS_EQ_256-NEXT: trn2 z1.h, z1.h, z3.h
				; VBITS_EQ_256-NEXT: add z0.h, p0/m, z0.h, z4.h
				; VBITS_EQ_256-NEXT: add z1.h, p0/m, z1.h, z5.h
				; VBITS_EQ_256-NEXT: st1h { z0.h }, p0, [x0, x8, lsl #1]
				; VBITS_EQ_256-NEXT: st1h { z1.h }, p0, [x0]
				; VBITS_EQ_256-NEXT: ret
				;
				; VBITS_EQ_512-LABEL: trn_v32i16:
				; VBITS_EQ_512: // %bb.0:
				; VBITS_EQ_512-NEXT: ptrue p0.h
				; VBITS_EQ_512-NEXT: ld1h { z0.h }, p0/z, [x0]
				; VBITS_EQ_512-NEXT: ld1h { z1.h }, p0/z, [x1]
				; VBITS_EQ_512-NEXT: trn1 z2.h, z0.h, z1.h
				; VBITS_EQ_512-NEXT: trn2 z0.h, z0.h, z1.h
				; VBITS_EQ_512-NEXT: add z0.h, p0/m, z0.h, z2.h
				; VBITS_EQ_512-NEXT: st1h { z0.h }, p0, [x0]
				; VBITS_EQ_512-NEXT: ret
				%tmp1 = load <32 x i16>, <32 x i16>* %a
				%tmp2 = load <32 x i16>, <32 x i16>* %b
				%tmp3 = shufflevector <32 x i16> %tmp1, <32 x i16> %tmp2, <32 x i32> <i32 0, i32 32, i32 2, i32 34, i32 4, i32 36, i32 6, i32 38, i32 8, i32 40, i32 10, i32 42, i32 12, i32 44, i32 14, i32 46, i32 16, i32 48, i32 18, i32 50, i32 20, i32 52, i32 22, i32 54, i32 24, i32 56, i32 26, i32 58, i32 28, i32 60, i32 30, i32 62>
				%tmp4 = shufflevector <32 x i16> %tmp1, <32 x i16> %tmp2, <32 x i32> <i32 1, i32 33, i32 3, i32 35, i32 undef, i32 37, i32 7, i32 undef, i32 undef, i32 41, i32 11, i32 43, i32 13, i32 45, i32 15, i32 47, i32 17, i32 49, i32 19, i32 51, i32 21, i32 53, i32 23, i32 55, i32 25, i32 57, i32 27, i32 59, i32 29, i32 61, i32 31, i32 63>
				%tmp5 = add <32 x i16> %tmp3, %tmp4
				store <32 x i16> %tmp5, <32 x i16>* %a
				ret void
				}

				define void @trn_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; VBITS_EQ_256-LABEL: trn_v16i16:
				; VBITS_EQ_256: // %bb.0:
				; VBITS_EQ_256-NEXT: ptrue p0.h
				; VBITS_EQ_256-NEXT: ld1h { z0.h }, p0/z, [x0]
				; VBITS_EQ_256-NEXT: ld1h { z1.h }, p0/z, [x1]
				; VBITS_EQ_256-NEXT: trn1 z2.h, z0.h, z1.h
				; VBITS_EQ_256-NEXT: trn2 z0.h, z0.h, z1.h
				; VBITS_EQ_256-NEXT: add z0.h, p0/m, z0.h, z2.h
				; VBITS_EQ_256-NEXT: st1h { z0.h }, p0, [x0]
				; VBITS_EQ_256-NEXT: ret
				;
				; VBITS_EQ_512-LABEL: trn_v16i16:
				; VBITS_EQ_512: // %bb.0:
				; VBITS_EQ_512-NEXT: ptrue p0.h, vl16
				; VBITS_EQ_512-NEXT: ld1h { z0.h }, p0/z, [x0]
				; VBITS_EQ_512-NEXT: ld1h { z1.h }, p0/z, [x1]
				; VBITS_EQ_512-NEXT: trn1 z2.h, z0.h, z1.h
				; VBITS_EQ_512-NEXT: trn2 z0.h, z0.h, z1.h
				; VBITS_EQ_512-NEXT: add z0.h, p0/m, z0.h, z2.h
				; VBITS_EQ_512-NEXT: st1h { z0.h }, p0, [x0]
				; VBITS_EQ_512-NEXT: ret
				%tmp1 = load <16 x i16>, <16 x i16>* %a
				%tmp2 = load <16 x i16>, <16 x i16>* %b
				%tmp3 = shufflevector <16 x i16> %tmp1, <16 x i16> %tmp2, <16 x i32> <i32 0, i32 16, i32 2, i32 18, i32 4, i32 20, i32 6, i32 22, i32 8, i32 24, i32 10, i32 26, i32 12, i32 28, i32 14, i32 30>
				%tmp4 = shufflevector <16 x i16> %tmp1, <16 x i16> %tmp2, <16 x i32> <i32 1, i32 17, i32 3, i32 19, i32 5, i32 21, i32 7, i32 23, i32 9, i32 25, i32 11, i32 27, i32 13, i32 29, i32 15, i32 31>
				%tmp5 = add <16 x i16> %tmp3, %tmp4
				store <16 x i16> %tmp5, <16 x i16>* %a
				ret void
				}

				define void @trn_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; VBITS_EQ_256-LABEL: trn_v8i32:
				; VBITS_EQ_256: // %bb.0:
				; VBITS_EQ_256-NEXT: ptrue p0.s
				; VBITS_EQ_256-NEXT: ld1w { z0.s }, p0/z, [x0]
				; VBITS_EQ_256-NEXT: ld1w { z1.s }, p0/z, [x1]
				; VBITS_EQ_256-NEXT: trn1 z2.s, z0.s, z1.s
				; VBITS_EQ_256-NEXT: trn2 z0.s, z0.s, z1.s
				; VBITS_EQ_256-NEXT: add z0.s, p0/m, z0.s, z2.s
				; VBITS_EQ_256-NEXT: st1w { z0.s }, p0, [x0]
				; VBITS_EQ_256-NEXT: ret
				;
				; VBITS_EQ_512-LABEL: trn_v8i32:
				; VBITS_EQ_512: // %bb.0:
				; VBITS_EQ_512-NEXT: ptrue p0.s, vl8
				; VBITS_EQ_512-NEXT: ld1w { z0.s }, p0/z, [x0]
				; VBITS_EQ_512-NEXT: ld1w { z1.s }, p0/z, [x1]
				; VBITS_EQ_512-NEXT: trn1 z2.s, z0.s, z1.s
				; VBITS_EQ_512-NEXT: trn2 z0.s, z0.s, z1.s
				; VBITS_EQ_512-NEXT: add z0.s, p0/m, z0.s, z2.s
				; VBITS_EQ_512-NEXT: st1w { z0.s }, p0, [x0]
				; VBITS_EQ_512-NEXT: ret
				%tmp1 = load <8 x i32>, <8 x i32>* %a
				%tmp2 = load <8 x i32>, <8 x i32>* %b
				%tmp3 = shufflevector <8 x i32> %tmp1, <8 x i32> %tmp2, <8 x i32> <i32 0, i32 8, i32 undef, i32 undef, i32 4, i32 12, i32 6, i32 14>
				%tmp4 = shufflevector <8 x i32> %tmp1, <8 x i32> %tmp2, <8 x i32> <i32 1, i32 undef, i32 3, i32 11, i32 5, i32 13, i32 undef, i32 undef>
				%tmp5 = add <8 x i32> %tmp3, %tmp4
				store <8 x i32> %tmp5, <8 x i32>* %a
				ret void
				}

				define void @trn_v4f64(<4 x double>* %a, <4 x double>* %b) #0 {
				; VBITS_EQ_256-LABEL: trn_v4f64:
				; VBITS_EQ_256: // %bb.0:
				; VBITS_EQ_256-NEXT: ptrue p0.d
				; VBITS_EQ_256-NEXT: ld1d { z0.d }, p0/z, [x0]
				; VBITS_EQ_256-NEXT: ld1d { z1.d }, p0/z, [x1]
				; VBITS_EQ_256-NEXT: trn1 z2.d, z0.d, z1.d
				; VBITS_EQ_256-NEXT: trn2 z0.d, z0.d, z1.d
				; VBITS_EQ_256-NEXT: fadd z0.d, z2.d, z0.d
				; VBITS_EQ_256-NEXT: st1d { z0.d }, p0, [x0]
				; VBITS_EQ_256-NEXT: ret
				;
				; VBITS_EQ_512-LABEL: trn_v4f64:
				; VBITS_EQ_512: // %bb.0:
				; VBITS_EQ_512-NEXT: ptrue p0.d, vl4
				; VBITS_EQ_512-NEXT: ld1d { z0.d }, p0/z, [x0]
				; VBITS_EQ_512-NEXT: ld1d { z1.d }, p0/z, [x1]
				; VBITS_EQ_512-NEXT: trn1 z2.d, z0.d, z1.d
				; VBITS_EQ_512-NEXT: trn2 z0.d, z0.d, z1.d
				; VBITS_EQ_512-NEXT: fadd z0.d, p0/m, z0.d, z2.d
				; VBITS_EQ_512-NEXT: st1d { z0.d }, p0, [x0]
				; VBITS_EQ_512-NEXT: ret
				%tmp1 = load <4 x double>, <4 x double>* %a
				%tmp2 = load <4 x double>, <4 x double>* %b
				%tmp3 = shufflevector <4 x double> %tmp1, <4 x double> %tmp2, <4 x i32> <i32 0, i32 4, i32 2, i32 6>
				%tmp4 = shufflevector <4 x double> %tmp1, <4 x double> %tmp2, <4 x i32> <i32 1, i32 5, i32 3, i32 7>
				%tmp5 = fadd <4 x double> %tmp3, %tmp4
				store <4 x double> %tmp5, <4 x double>* %a
				ret void
				}

				; Don't use SVE for 128-bit vectors
				define void @trn_v4f32(<4 x float>* %a, <4 x float>* %b) #0 {
				; CHECK-LABEL: trn_v4f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x0]
				; CHECK-NEXT: ldr q1, [x1]
				; CHECK-NEXT: trn1 v2.4s, v0.4s, v1.4s
				; CHECK-NEXT: trn2 v0.4s, v0.4s, v1.4s
				; CHECK-NEXT: fadd v0.4s, v2.4s, v0.4s
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <4 x float>, <4 x float>* %a
				%tmp2 = load <4 x float>, <4 x float>* %b
				%tmp3 = shufflevector <4 x float> %tmp1, <4 x float> %tmp2, <4 x i32> <i32 0, i32 4, i32 2, i32 6>
				%tmp4 = shufflevector <4 x float> %tmp1, <4 x float> %tmp2, <4 x i32> <i32 1, i32 5, i32 3, i32 7>
				%tmp5 = fadd <4 x float> %tmp3, %tmp4
				store <4 x float> %tmp5, <4 x float>* %a
				ret void
				}

				define void @trn_v8i32_undef(<8 x i32>* %a) #0 {
				; VBITS_EQ_256-LABEL: trn_v8i32_undef:
				; VBITS_EQ_256: // %bb.0:
				; VBITS_EQ_256-NEXT: ptrue p0.s
				; VBITS_EQ_256-NEXT: ld1w { z0.s }, p0/z, [x0]
				; VBITS_EQ_256-NEXT: trn1 z1.s, z0.s, z0.s
				; VBITS_EQ_256-NEXT: trn2 z0.s, z0.s, z0.s
				; VBITS_EQ_256-NEXT: add z0.s, p0/m, z0.s, z1.s
				; VBITS_EQ_256-NEXT: st1w { z0.s }, p0, [x0]
				; VBITS_EQ_256-NEXT: ret
				;
				; VBITS_EQ_512-LABEL: trn_v8i32_undef:
				; VBITS_EQ_512: // %bb.0:
				; VBITS_EQ_512-NEXT: ptrue p0.s, vl8
				; VBITS_EQ_512-NEXT: ld1w { z0.s }, p0/z, [x0]
				; VBITS_EQ_512-NEXT: trn1 z1.s, z0.s, z0.s
				; VBITS_EQ_512-NEXT: trn2 z0.s, z0.s, z0.s
				; VBITS_EQ_512-NEXT: add z0.s, p0/m, z0.s, z1.s
				; VBITS_EQ_512-NEXT: st1w { z0.s }, p0, [x0]
				; VBITS_EQ_512-NEXT: ret
				%tmp1 = load <8 x i32>, <8 x i32>* %a
				%tmp3 = shufflevector <8 x i32> %tmp1, <8 x i32> undef, <8 x i32> <i32 0, i32 0, i32 2, i32 2, i32 4, i32 4, i32 6, i32 6>
				%tmp4 = shufflevector <8 x i32> %tmp1, <8 x i32> undef, <8 x i32> <i32 1, i32 1, i32 3, i32 3, i32 5, i32 5, i32 7, i32 7>
				%tmp5 = add <8 x i32> %tmp3, %tmp4
				paulwalker-armUnsubmitted Not Done Reply Inline Actions Please place the attributes together at the end of the file because otherwise they're hard to find when trying to see what attributes exist for a specific function. paulwalker-arm: Please place the attributes together at the end of the file because otherwise they're hard to…
				wweiAuthorUnsubmitted Done Reply Inline Actions done wwei: done
				store <8 x i32> %tmp5, <8 x i32>* %a
				ret void
				}

				; Emit zip2 instruction for v32i8 shuffle with vscale_range(2,2),
				; since the size of v32i8 is the same as the runtime vector length.
				paulwalker-armUnsubmitted Not Done Reply Inline Actions I think there's value in adding a comment that states why we can safely emit zip2 instructions. Not too detailed, it's just worth drawing the readers attention to the fact that for this and related tests the runtime vector length is known. paulwalker-arm: I think there's value in adding a comment that states why we can safely emit zip2 instructions.
				define void @zip2_v32i8(<32 x i8>* %a, <32 x i8>* %b) #1 {
				; CHECK-LABEL: zip2_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x1]
				; CHECK-NEXT: zip2 z0.b, z0.b, z1.b
				; CHECK-NEXT: st1b { z0.b }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load volatile <32 x i8>, <32 x i8>* %a
				%tmp2 = load volatile <32 x i8>, <32 x i8>* %b
				%tmp3 = shufflevector <32 x i8> %tmp1, <32 x i8> %tmp2, <32 x i32> <i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>
				store volatile <32 x i8> %tmp3, <32 x i8>* %a
				ret void
				}

				; Emit zip2 instruction for v16i16 shuffle with vscale_range(2,2),
				; since the size of v16i16 is the same as the runtime vector length.
				define void @zip2_v16i16(<16 x i16>* %a, <16 x i16>* %b) #1 {
				; CHECK-LABEL: zip2_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x1]
				; CHECK-NEXT: zip2 z0.h, z0.h, z1.h
				; CHECK-NEXT: st1h { z0.h }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load volatile <16 x i16>, <16 x i16>* %a
				%tmp2 = load volatile <16 x i16>, <16 x i16>* %b
				%tmp3 = shufflevector <16 x i16> %tmp1, <16 x i16> %tmp2, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
				store volatile <16 x i16> %tmp3, <16 x i16>* %a
				ret void
				}

				; Emit zip2 instruction for v8i32 shuffle with vscale_range(2,2),
				; since the size of v8i32 is the same as the runtime vector length.
				define void @zip2_v8i32(<8 x i32>* %a, <8 x i32>* %b) #1 {
				; CHECK-LABEL: zip2_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x1]
				; CHECK-NEXT: zip2 z0.s, z0.s, z1.s
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load volatile <8 x i32>, <8 x i32>* %a
				%tmp2 = load volatile <8 x i32>, <8 x i32>* %b
				%tmp3 = shufflevector <8 x i32> %tmp1, <8 x i32> %tmp2, <8 x i32> <i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
				store volatile <8 x i32> %tmp3, <8 x i32>* %a
				ret void
				}

				; Emit zip2 instruction for v8i32 and undef shuffle with vscale_range(2,2)
				define void @zip2_v8i32_undef(<8 x i32>* %a) #1 {
				; CHECK-LABEL: zip2_v8i32_undef:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: zip2 z0.s, z0.s, z0.s
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load volatile <8 x i32>, <8 x i32>* %a
				paulwalker-armUnsubmitted Not Done Reply Inline Actions As above, there's value in adding a comment that mentions why we can safely emit uzp instructions. paulwalker-arm: As above, there's value in adding a comment that mentions why we can safely emit uzp…
				%tmp2 = shufflevector <8 x i32> %tmp1, <8 x i32> undef, <8 x i32> <i32 4, i32 4, i32 5, i32 5, i32 6, i32 6, i32 7, i32 7>
				store volatile <8 x i32> %tmp2, <8 x i32>* %a
				ret void
				}

				; Emit uzp1/2 instruction for v32i8 shuffle with vscale_range(2,2),
				; since the size of v32i8 is the same as the runtime vector length.
				define void @uzp_v32i8(<32 x i8>* %a, <32 x i8>* %b) #1 {
				; CHECK-LABEL: uzp_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.b
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x1]
				; CHECK-NEXT: uzp1 z2.b, z0.b, z1.b
				; CHECK-NEXT: uzp2 z0.b, z0.b, z1.b
				; CHECK-NEXT: add z0.b, p0/m, z0.b, z2.b
				; CHECK-NEXT: st1b { z0.b }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <32 x i8>, <32 x i8>* %a
				%tmp2 = load <32 x i8>, <32 x i8>* %b
				%tmp3 = shufflevector <32 x i8> %tmp1, <32 x i8> %tmp2, <32 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30, i32 32, i32 34, i32 36, i32 38, i32 40, i32 42, i32 44, i32 46, i32 48, i32 50, i32 52, i32 54, i32 56, i32 58, i32 60, i32 62>
				%tmp4 = shufflevector <32 x i8> %tmp1, <32 x i8> %tmp2, <32 x i32> <i32 1, i32 3, i32 5, i32 undef, i32 9, i32 11, i32 13, i32 undef, i32 undef, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31, i32 33, i32 35, i32 37, i32 39, i32 41, i32 43, i32 45, i32 47, i32 49, i32 51, i32 53, i32 55, i32 57, i32 59, i32 61, i32 63>
				%tmp5 = add <32 x i8> %tmp3, %tmp4
				store <32 x i8> %tmp5, <32 x i8>* %a
				ret void
				}

				; Emit uzp1/2 instruction for v32i16 shuffle with vscale_range(2,2),
				; v32i16 will be expanded into two v16i16, and the size of v16i16 is
				; the same as the runtime vector length.
				define void @uzp_v32i16(<32 x i16>* %a, <32 x i16>* %b) #1 {
				; CHECK-LABEL: uzp_v32i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #16
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; CHECK-NEXT: ld1h { z3.h }, p0/z, [x1]
				; CHECK-NEXT: uzp1 z5.h, z1.h, z0.h
				; CHECK-NEXT: uzp2 z0.h, z1.h, z0.h
				; CHECK-NEXT: add z0.h, p0/m, z0.h, z5.h
				; CHECK-NEXT: uzp1 z4.h, z3.h, z2.h
				; CHECK-NEXT: uzp2 z2.h, z3.h, z2.h
				; CHECK-NEXT: movprfx z1, z4
				; CHECK-NEXT: add z1.h, p0/m, z1.h, z2.h
				; CHECK-NEXT: st1h { z1.h }, p0, [x0, x8, lsl #1]
				; CHECK-NEXT: st1h { z0.h }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <32 x i16>, <32 x i16>* %a
				%tmp2 = load <32 x i16>, <32 x i16>* %b
				%tmp3 = shufflevector <32 x i16> %tmp1, <32 x i16> %tmp2, <32 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30, i32 32, i32 34, i32 36, i32 38, i32 40, i32 42, i32 44, i32 46, i32 48, i32 50, i32 52, i32 54, i32 56, i32 58, i32 60, i32 62>
				%tmp4 = shufflevector <32 x i16> %tmp1, <32 x i16> %tmp2, <32 x i32> <i32 1, i32 3, i32 5, i32 undef, i32 9, i32 11, i32 13, i32 undef, i32 undef, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31, i32 33, i32 35, i32 37, i32 39, i32 41, i32 43, i32 45, i32 47, i32 49, i32 51, i32 53, i32 55, i32 57, i32 59, i32 61, i32 63>
				%tmp5 = add <32 x i16> %tmp3, %tmp4
				store <32 x i16> %tmp5, <32 x i16>* %a
				ret void
				}

				; Emit uzp1/2 instruction for v16i16 shuffle with vscale_range(2,2),
				; since the size of v16i16 is the same as the runtime vector length.
				define void @uzp_v16i16(<16 x i16>* %a, <16 x i16>* %b) #1 {
				; CHECK-LABEL: uzp_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x1]
				; CHECK-NEXT: uzp1 z2.h, z0.h, z1.h
				; CHECK-NEXT: uzp2 z0.h, z0.h, z1.h
				; CHECK-NEXT: add z0.h, p0/m, z0.h, z2.h
				; CHECK-NEXT: st1h { z0.h }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <16 x i16>, <16 x i16>* %a
				%tmp2 = load <16 x i16>, <16 x i16>* %b
				%tmp3 = shufflevector <16 x i16> %tmp1, <16 x i16> %tmp2, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30>
				%tmp4 = shufflevector <16 x i16> %tmp1, <16 x i16> %tmp2, <16 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15, i32 17, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31>
				%tmp5 = add <16 x i16> %tmp3, %tmp4
				store <16 x i16> %tmp5, <16 x i16>* %a
				ret void
				}

				; Emit uzp1/2 instruction for v8f32 shuffle with vscale_range(2,2),
				; since the size of v8f32 is the same as the runtime vector length.
				define void @uzp_v8f32(<8 x float>* %a, <8 x float>* %b) #1 {
				; CHECK-LABEL: uzp_v8f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x1]
				; CHECK-NEXT: uzp1 z2.s, z0.s, z1.s
				; CHECK-NEXT: uzp2 z0.s, z0.s, z1.s
				; CHECK-NEXT: fadd z0.s, z2.s, z0.s
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <8 x float>, <8 x float>* %a
				%tmp2 = load <8 x float>, <8 x float>* %b
				%tmp3 = shufflevector <8 x float> %tmp1, <8 x float> %tmp2, <8 x i32> <i32 0, i32 undef, i32 4, i32 6, i32 undef, i32 10, i32 12, i32 14>
				%tmp4 = shufflevector <8 x float> %tmp1, <8 x float> %tmp2, <8 x i32> <i32 1, i32 undef, i32 5, i32 7, i32 9, i32 11, i32 undef, i32 undef>
				%tmp5 = fadd <8 x float> %tmp3, %tmp4
				store <8 x float> %tmp5, <8 x float>* %a
				ret void
				}

				; Emit uzp1/2 instruction for v4i64 shuffle with vscale_range(2,2),
				; since the size of v4i64 is the same as the runtime vector length.
				define void @uzp_v4i64(<4 x i64>* %a, <4 x i64>* %b) #1 {
				; CHECK-LABEL: uzp_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: uzp1 z2.d, z0.d, z1.d
				; CHECK-NEXT: uzp2 z0.d, z0.d, z1.d
				; CHECK-NEXT: add z0.d, p0/m, z0.d, z2.d
				; CHECK-NEXT: st1d { z0.d }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <4 x i64>, <4 x i64>* %a
				%tmp2 = load <4 x i64>, <4 x i64>* %b
				%tmp3 = shufflevector <4 x i64> %tmp1, <4 x i64> %tmp2, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%tmp4 = shufflevector <4 x i64> %tmp1, <4 x i64> %tmp2, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%tmp5 = add <4 x i64> %tmp3, %tmp4
				store <4 x i64> %tmp5, <4 x i64>* %a
				ret void
				}

				; Don't use SVE for 128-bit vectors
				define void @uzp_v8i16(<8 x i16>* %a, <8 x i16>* %b) #1 {
				; CHECK-LABEL: uzp_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x0]
				; CHECK-NEXT: ldr q1, [x1]
				; CHECK-NEXT: uzp1 v2.8h, v0.8h, v1.8h
				; CHECK-NEXT: uzp2 v0.8h, v0.8h, v1.8h
				; CHECK-NEXT: add v0.8h, v2.8h, v0.8h
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <8 x i16>, <8 x i16>* %a
				%tmp2 = load <8 x i16>, <8 x i16>* %b
				%tmp3 = shufflevector <8 x i16> %tmp1, <8 x i16> %tmp2, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%tmp4 = shufflevector <8 x i16> %tmp1, <8 x i16> %tmp2, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%tmp5 = add <8 x i16> %tmp3, %tmp4
				store <8 x i16> %tmp5, <8 x i16>* %a
				ret void
				}

				; Emit uzp1/2 instruction for v8i32 and undef shuffle with vscale_range(2,2)
				define void @uzp_v8i32_undef(<8 x i32>* %a) #1 {
				paulwalker-armUnsubmitted Not Done Reply Inline Actions As above. paulwalker-arm: As above.
				paulwalker-armUnsubmitted Not Done Reply Inline Actions Please add a small comment to highlight this is a negative test and why only zip1 instructions are emitted. paulwalker-arm: Please add a small comment to highlight this is a negative test and why only zip1 instructions…
				; CHECK-LABEL: uzp_v8i32_undef:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: uzp1 z1.s, z0.s, z0.s
				; CHECK-NEXT: uzp2 z0.s, z0.s, z0.s
				; CHECK-NEXT: add z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <8 x i32>, <8 x i32>* %a
				%tmp3 = shufflevector <8 x i32> %tmp1, <8 x i32> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 0, i32 2, i32 4, i32 6>
				%tmp4 = shufflevector <8 x i32> %tmp1, <8 x i32> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 1, i32 3, i32 5, i32 7>
				%tmp5 = add <8 x i32> %tmp3, %tmp4
				store <8 x i32> %tmp5, <8 x i32>* %a
				ret void
				}

				; Only zip1 can be emitted safely with vscale_range(2,4).
				; vscale_range(2,4) means different min/max vector sizes, zip2 relies on
				; knowing which indices represent the high half of sve vector register.
				define void @zip_vscale2_4(<4 x double>* %a, <4 x double>* %b) #2 {
				; CHECK-LABEL: zip_vscale2_4:
				; CHECK: // %bb.0:
				; CHECK-NEXT: stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
				; CHECK-NEXT: sub x9, sp, #48
				; CHECK-NEXT: mov x29, sp
				; CHECK-NEXT: and sp, x9, #0xffffffffffffffe0
				; CHECK-NEXT: .cfi_def_cfa w29, 16
				; CHECK-NEXT: .cfi_offset w30, -8
				; CHECK-NEXT: .cfi_offset w29, -16
				; CHECK-NEXT: ptrue p0.d, vl4
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: mov z2.d, z1.d[3]
				; CHECK-NEXT: mov z3.d, z0.d[3]
				; CHECK-NEXT: stp d3, d2, [sp, #16]
				; CHECK-NEXT: mov z2.d, z1.d[2]
				; CHECK-NEXT: mov z3.d, z0.d[2]
				; CHECK-NEXT: zip1 z0.d, z0.d, z1.d
				; CHECK-NEXT: stp d3, d2, [sp]
				; CHECK-NEXT: ld1d { z2.d }, p0/z, [sp]
				; CHECK-NEXT: fadd z0.d, p0/m, z0.d, z2.d
				; CHECK-NEXT: st1d { z0.d }, p0, [x0]
				; CHECK-NEXT: mov sp, x29
				; CHECK-NEXT: ldp x29, x30, [sp], #16 // 16-byte Folded Reload
				; CHECK-NEXT: ret
				%tmp1 = load <4 x double>, <4 x double>* %a
				%tmp2 = load <4 x double>, <4 x double>* %b
				%tmp3 = shufflevector <4 x double> %tmp1, <4 x double> %tmp2, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
				%tmp4 = shufflevector <4 x double> %tmp1, <4 x double> %tmp2, <4 x i32> <i32 2, i32 6, i32 3, i32 7>
				%tmp5 = fadd <4 x double> %tmp3, %tmp4
				store <4 x double> %tmp5, <4 x double>* %a
				ret void
				}

				attributes #0 = { "target-features"="+sve" }
				attributes #1 = { "target-features"="+sve" vscale_range(2,2) }
				attributes #2 = { "target-features"="+sve" vscale_range(2,4) }