This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
8/16
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
2
sve-fix-length-permute-zip-uzp-trn.ll

Differential D113376

[AArch64][SVE] Lower shuffles to permute instructions: zip1/2, uzp1/2, trn1/2
ClosedPublic

Authored by wwei on Nov 7 2021, 6:40 PM.

Download Raw Diff

Details

Reviewers

paulwalker-arm
peterwaller-arm
bsmith
david-arm
sdesmalen
dmgreen
efriedma

Commits

rG03dc2975d07e: [AArch64][SVE] Lower shuffles to permute instructions: zip1/2, uzp1/2, trn1/2

Summary

Attempt to lower a shuffle as a permute instruction(zip/uzp/trn) for fixed length SVE.

Diff Detail

Event Timeline

wwei created this revision.Nov 7 2021, 6:40 PM

Herald added subscribers: ctetreau, psnobl, hiraditya and 2 others. · View Herald TranscriptNov 7 2021, 6:40 PM

wwei requested review of this revision.Nov 7 2021, 6:40 PM

Herald added a subscriber: llvm-commits. · View Herald TranscriptNov 7 2021, 6:40 PM

Harbormaster completed remote builds in B132942: Diff 385388.Nov 7 2021, 7:24 PM

david-arm added inline comments.Nov 8 2021, 1:47 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19264	This looks a bit wrong because VT=Fixed Length Vector Type, but Op1 and Op2 have scalable vector types.

wwei added inline comments.Nov 8 2021, 3:40 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19264	No, the code here is correct. `VT` and `ShuffleMask` are used to check the type of the mask value, like zip or uzp mask. `Op1` and `Op2` are used to generate new permute nodes, which will be scalable aarch64 SDNodes here. In this patch, I encapsulated a new function for the code lowering shuffles to neon ZIP/UZP/TRN. Fortunately, the same code can be used for fixed-length SVE.

LGTM, thanks.

This revision is now accepted and ready to land.Nov 8 2021, 3:49 AM

paulwalker-arm requested changes to this revision.Nov 8 2021, 5:13 AM

paulwalker-arm added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9678–9681	I don't believe this logic is universally safe. The use of ZIP2 specifically relies on knowing the exact size of the register to know which indices represent the "high half" of a vector register. This is something that is only known when `sve-vector-bits-min==sve-vector-bits-max`. I also suspect functions like isZIPMask were written when only 64 and 128 bit legal vectors existed. I doubt the logic holds for the case when a vector is legal but not the exact size of the target vector register, as is the case with the SVE fixed length support. I have not looked into the extent at which the other shuffles are affected but I suspect each have their own complexity.
llvm/test/CodeGen/AArch64/sve-fix-length-permute-zip-uzp-trn.ll
21	To highlight my previous comment. Index `16` is only the first byte of the second half of a 256bit vector and thus if the target has bigger vector register the use of `zip2` will not have the correct behaviour.

This revision now requires changes to proceed.Nov 8 2021, 5:13 AM

wwei added inline comments.Nov 8 2021, 8:02 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9678–9681	Thanks, I get it. It seems that there are potential problems for the fixed-length shuffle vector lowering, include the patch https://reviews.llvm.org/D105289. I will update a new revision later to fix it by adding the check: `sve-vector-bits-min==sve-vector-bits-max`

paulwalker-arm added inline comments.Nov 8 2021, 8:36 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9678–9681	I believe that patch is ok given the restrictions it imposes. It is fine using the `is###Mask` functions as a way to identify a named shuffle. What is unsafe is assuming `is###Mask` means the `###` instruction can be used directly when converted to scalable vectors. So taking D105289 you can see that is uses `isEXTMask` to determine the type of shuffle but then the actually lowering code only handles a specific case and that case does not emit an `EXT` instruction as that would have fallen into the same trap.

wwei added inline comments.Nov 9 2021, 12:55 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9678–9681	Yeah, you are right. The logic in D105289 is correct, using `EXTRACT_VECTOR_ELT` and `INSR` instructions to match a specific `EXT` pattern, it's a smart solution.

update the patch, adding sve min/max size check.

Harbormaster completed remote builds in B133194: Diff 385727.Nov 9 2021, 1:41 AM

paulwalker-arm added inline comments.Nov 9 2021, 8:23 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19267–19268	I still don't think this is enough as you've missed my previous `I doubt the logic holds for the case when a vector is legal but not the exact size of the target vector register` comment. Further discussion is attached to the test zip_v32i8 below... I'm starting to wonder if it's worth breaking out lowerShuffleToZIP_UZP_TRN as I feel like once support is added for more of the combinations there's not likely to be much reuse. For example taking the zip1/zip2 case I think supporting zip1 is likely simple and just works for all legal VTs, but zip2 requires the VT to be exactly the size of the target vector (i.e. `VT.getSizeInBits() == Subtarget->getMinSVEVectorSizeInBits() == Subtarget->getMaxSVEVectorSizeInBits()`) or perhaps a different instruction sequence.
llvm/test/CodeGen/AArch64/sve-fix-length-permute-zip-uzp-trn.ll
25–26	The output for `VBITS_EQ_256` and `VBITS_EQ_512` is the same, which doesn't make sense considering the target vector length is different. For the `VBITS_EQ_256` case index `16` is the start of the high half of the target vector and thus zip2 works. But for the `VBITS_EQ_512` case index `16` is actually the start of the second quarter of the target vector and thus zip2 will do the wrong thing.

wwei added inline comments.Nov 10 2021, 1:00 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19267–19268	Thanks, I agree with you. Perhaps the current implementation of `lowerShuffleToZIP_UZP_TRN` reusing neon code is not so good, since SVE may have variable length. Maybe we should identify and classify different shuffle patterns. For zip/uzp/trn/rev these four shuffle types, we can roughly divide them into two categories: zip1/uzp1/uzp2/trn1/trn2/revb/revh/revw （it' simple，and works for all legal VTs） zip2/rev （complex，and need consider high half or whole register） I think we can implement these two scenarios in steps.

Matt added a subscriber: Matt.Nov 17 2021, 12:01 PM

wwei updated this revision to Diff 391320.Dec 2 2021, 7:05 AM

update the patch，also there's another patch D114960 to support rev insts.

Harbormaster completed remote builds in B137136: Diff 391320.Dec 2 2021, 8:07 AM

wwei mentioned this in D114960: [AArch64][SVE] Lower shuffles to permute instructions: rev/revb/revh/revw.Dec 7 2021, 12:30 AM

paulwalker-arm added inline comments.Dec 9 2021, 8:09 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19267–19268	Looking at the six shuffles trn1, trn2, uzp1, uzp2, zip1, zip2 I believe that when min_length != max_length != VT.getSizeInBits() only trn1, trn2 & zip1 are valid. We agree that zip2 is not valid because the top half of the input fixed length vector does not necessarily map to the top half of a scalable vector. The reason I believe both uzp variants are also invalid is because their underlying operation is to concat both input vectors and then extract every second element. However, the vectors we're concatenating may contain junk so you end up with: expected uzp1 [A \| B \| C \| D], [E \| F \| G \| H] => [A \| C \| E \| G] SVE_VLS uzp [A \| B \| C \| D \| ? \| ? \| ? \| ?], [E \| F \| G \| H \| ? \| ? \| ?] => [A \| C \| ? \| ? \| E \| G \| ? \| ?] And thus when you extract the fixed length result you'll end up with `[A \| C \| ? \| ?]` This I think further reduces the value of having `lowerShuffleToZIP_UZP_TRN` as it would be clearer to just handle each of the scenarios separately.

wwei added inline comments.Dec 9 2021, 8:59 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19267–19268	@paulwalker-arm Thanks for your detailed explanation. I fully agree with your point of view. Indeed, we don’t need `lowerShuffleToZIP_UZP_TRN` anymore, handle each case separately will make the code clearer. I will try to refactor the code later.

wwei updated this revision to Diff 393828.Dec 13 2021, 3:18 AM

Harbormaster completed remote builds in B138922: Diff 393828.Dec 13 2021, 4:37 AM

update the patch, lowerShuffleToZIP_UZP_TRN removed and test file refactored.

rebased.

Harbormaster completed remote builds in B139448: Diff 394577.Dec 15 2021, 9:01 AM

Just a passing review I'm afraid. I'll take a proper look tomorrow.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19267	I think you can use `ContainerVT` here? It might help with the formatting. Same goes for the other places where you use `Op1.getValueType()`.
19279–19288	I've not thought about it deeply so might be wrong but given these cases only use `Op1` I'm wondering if they're always safe and thus don't need to be part of the `MinSVESize == MaxSVESize == VT.getSizeInBits()` restricted set?

paulwalker-arm added inline comments.Dec 16 2021, 10:55 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19286–19289	I figured this comment could be better and started to write something but ended up with Functions like isZIPMask return true when a ISD::VECTOR_SHUFFLE's mask represents the same logical operation as performed by a ZIP instruction. In isolation these functions do not mean the ISD::VECTOR_SHUFFLE is exactly equivalent to an AArch64 instruction. There's the extra component of ISD::VECTOR_SHUFFLE's value type to consider. Prior to SVE these functions only operated on 64/128bit vector types that have a direct mapping to a target register and so an exact mapping is implied. However, when using SVE for fixed length vectors, most legal vector types are actually sub-vectors of a larger SVE register. When mapping ISD::VECTOR_SHUFFLE to an SVE instruction care must be taken to consider how the mask's indices translate. Specifically, when the mapping requires an exact meaning for a specific vector index (e.g. Index X is the last vector element in the register) then such mappings are often only safe when the exact SVE register size is know. The main exception to this is when indices are logically relative to the first element of either ISD::VECTOR_SHUFFLE operand because these relative indices don't change when converting from fixed-length to scalable vector types (i.e. the start of a fixed length vector is always the start of a scalable vector). Which is more like a novel than a comment :) I've posted it anyway just in case there's something in there that's useful.
llvm/test/CodeGen/AArch64/sve-fixed-length-permute-zip-uzp-trn.ll
10–14 ↗	(On Diff #394577)	It looks like `InterleavedAccessPass` is causing your code to be bypassed. I couldn't immediately see a way to disable the pass other than using `-O0` which might cause other issues but I think if you use `volatile` loads and stores within these tests you'll get what you need.
413 ↗	(On Diff #394577)	Please place the attributes together at the end of the file because otherwise they're hard to find when trying to see what attributes exist for a specific function.
627 ↗	(On Diff #394577)	As above.

wwei updated this revision to Diff 395107.Dec 17 2021, 6:05 AM

wwei added inline comments.Dec 17 2021, 6:24 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
19267	Changed, using `ContainerVT` to replace `Op1.getValueType()`
19279–19288	I don't think it's safe. For undef case, `Op1` will be used as input twice, it still has the problem like below: uzp [A \| B \| C \| D \| ? \| ? \| ? \| ?], [A \| B \| C \| D \| ? \| ? \| ?] => [A \| C \| ? \| ? \|A \| C \| ? \| ?]
19286–19289	Thanks for your comment, I used it directly in the code
llvm/test/CodeGen/AArch64/sve-fixed-length-permute-zip-uzp-trn.ll
10–14 ↗	(On Diff #394577)	Thanks，`volatile` can work
413 ↗	(On Diff #394577)	done

Harbormaster completed remote builds in B139832: Diff 395107.Dec 17 2021, 6:58 AM

paulwalker-arm accepted this revision.Dec 20 2021, 10:15 AM

paulwalker-arm added inline comments.

llvm/test/CodeGen/AArch64/sve-fixed-length-permute-zip-uzp-trn.ll
419 ↗	(On Diff #395107)	I think there's value in adding a comment that states why we can safely emit zip2 instructions. Not too detailed, it's just worth drawing the readers attention to the fact that for this and related tests the runtime vector length is known.
481 ↗	(On Diff #395107)	As above, there's value in adding a comment that mentions why we can safely emit uzp instructions.
627 ↗	(On Diff #395107)	Please add a small comment to highlight this is a negative test and why only zip1 instructions are emitted.

This revision is now accepted and ready to land.Dec 20 2021, 10:15 AM

This revision was landed with ongoing or failed builds.Dec 21 2021, 2:46 AM

Closed by commit rG03dc2975d07e: [AArch64][SVE] Lower shuffles to permute instructions: zip1/2, uzp1/2, trn1/2 (authored by wwei). · Explain Why

This revision was automatically updated to reflect the committed changes.

wwei added a commit: rG03dc2975d07e: [AArch64][SVE] Lower shuffles to permute instructions: zip1/2, uzp1/2, trn1/2.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

66 lines

test/

CodeGen/

AArch64/

sve-fix-length-permute-zip-uzp-trn.ll

477 lines

Diff 385388

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,665 Lines • ▼ Show 20 Lines	if (DAG.getTargetLoweringInfo().isTypeLegal(NewVT)) {
return DAG.getBitcast(VT,		return DAG.getBitcast(VT,
DAG.getVectorShuffle(NewVT, DL, V0, V1, NewMask));		DAG.getVectorShuffle(NewVT, DL, V0, V1, NewMask));
}		}
}		}

return SDValue();		return SDValue();
}		}

		static SDValue lowerShuffleToZIP_UZP_TRN(ArrayRef<int> M, EVT VT, SDValue V1,
		SDValue V2, SDLoc dl,
		SelectionDAG &DAG) {
		unsigned WhichResult;
		if (isZIPMask(M, VT, WhichResult)) {
		unsigned Opc = (WhichResult == 0) ? AArch64ISD::ZIP1 : AArch64ISD::ZIP2;
		return DAG.getNode(Opc, dl, V1.getValueType(), V1, V2);
		}
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I don't believe this logic is universally safe. The use of ZIP2 specifically relies on knowing the exact size of the register to know which indices represent the "high half" of a vector register. This is something that is only known when `sve-vector-bits-min==sve-vector-bits-max`. I also suspect functions like isZIPMask were written when only 64 and 128 bit legal vectors existed. I doubt the logic holds for the case when a vector is legal but not the exact size of the target vector register, as is the case with the SVE fixed length support. I have not looked into the extent at which the other shuffles are affected but I suspect each have their own complexity. paulwalker-arm: I don't believe this logic is universally safe. The use of ZIP2 specifically relies on knowing…
		wweiAuthorUnsubmitted Done Reply Inline Actions Thanks, I get it. It seems that there are potential problems for the fixed-length shuffle vector lowering, include the patch https://reviews.llvm.org/D105289. I will update a new revision later to fix it by adding the check: `sve-vector-bits-min==sve-vector-bits-max` wwei: Thanks, I get it. It seems that there are potential problems for the fixed-length shuffle…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I believe that patch is ok given the restrictions it imposes. It is fine using the `is###Mask` functions as a way to identify a named shuffle. What is unsafe is assuming `is###Mask` means the `###` instruction can be used directly when converted to scalable vectors. So taking D105289 you can see that is uses `isEXTMask` to determine the type of shuffle but then the actually lowering code only handles a specific case and that case does not emit an `EXT` instruction as that would have fallen into the same trap. paulwalker-arm: I believe that patch is ok given the restrictions it imposes. It is fine using the `is###Mask`…
		wweiAuthorUnsubmitted Done Reply Inline Actions Yeah, you are right. The logic in D105289 is correct, using `EXTRACT_VECTOR_ELT` and `INSR` instructions to match a specific `EXT` pattern, it's a smart solution. wwei: Yeah, you are right. The logic in D105289 is correct, using `EXTRACT_VECTOR_ELT` and `INSR`…
		if (isUZPMask(M, VT, WhichResult)) {
		unsigned Opc = (WhichResult == 0) ? AArch64ISD::UZP1 : AArch64ISD::UZP2;
		return DAG.getNode(Opc, dl, V1.getValueType(), V1, V2);
		}
		if (isTRNMask(M, VT, WhichResult)) {
		unsigned Opc = (WhichResult == 0) ? AArch64ISD::TRN1 : AArch64ISD::TRN2;
		return DAG.getNode(Opc, dl, V1.getValueType(), V1, V2);
		}

		if (isZIP_v_undef_Mask(M, VT, WhichResult)) {
		unsigned Opc = (WhichResult == 0) ? AArch64ISD::ZIP1 : AArch64ISD::ZIP2;
		return DAG.getNode(Opc, dl, V1.getValueType(), V1, V1);
		}
		if (isUZP_v_undef_Mask(M, VT, WhichResult)) {
		unsigned Opc = (WhichResult == 0) ? AArch64ISD::UZP1 : AArch64ISD::UZP2;
		return DAG.getNode(Opc, dl, V1.getValueType(), V1, V1);
		}
		if (isTRN_v_undef_Mask(M, VT, WhichResult)) {
		unsigned Opc = (WhichResult == 0) ? AArch64ISD::TRN1 : AArch64ISD::TRN2;
		return DAG.getNode(Opc, dl, V1.getValueType(), V1, V1);
		}

		return SDValue();
		}

SDValue AArch64TargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,		SDValue AArch64TargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDLoc dl(Op);		SDLoc dl(Op);
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();

ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(Op.getNode());		ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(Op.getNode());

if (useSVEForFixedLengthVectorVT(VT))		if (useSVEForFixedLengthVectorVT(VT))
▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	if (isEXTMask(ShuffleMask, VT, ReverseEXT, Imm)) {
return DAG.getNode(AArch64ISD::EXT, dl, V1.getValueType(), V1, V2,		return DAG.getNode(AArch64ISD::EXT, dl, V1.getValueType(), V1, V2,
DAG.getConstant(Imm, dl, MVT::i32));		DAG.getConstant(Imm, dl, MVT::i32));
} else if (V2->isUndef() && isSingletonEXTMask(ShuffleMask, VT, Imm)) {		} else if (V2->isUndef() && isSingletonEXTMask(ShuffleMask, VT, Imm)) {
Imm *= getExtFactor(V1);		Imm *= getExtFactor(V1);
return DAG.getNode(AArch64ISD::EXT, dl, V1.getValueType(), V1, V1,		return DAG.getNode(AArch64ISD::EXT, dl, V1.getValueType(), V1, V1,
DAG.getConstant(Imm, dl, MVT::i32));		DAG.getConstant(Imm, dl, MVT::i32));
}		}

unsigned WhichResult;		if (SDValue Permute =
if (isZIPMask(ShuffleMask, VT, WhichResult)) {		lowerShuffleToZIP_UZP_TRN(ShuffleMask, VT, V1, V2, dl, DAG))
unsigned Opc = (WhichResult == 0) ? AArch64ISD::ZIP1 : AArch64ISD::ZIP2;		return Permute;
return DAG.getNode(Opc, dl, V1.getValueType(), V1, V2);
}
if (isUZPMask(ShuffleMask, VT, WhichResult)) {
unsigned Opc = (WhichResult == 0) ? AArch64ISD::UZP1 : AArch64ISD::UZP2;
return DAG.getNode(Opc, dl, V1.getValueType(), V1, V2);
}
if (isTRNMask(ShuffleMask, VT, WhichResult)) {
unsigned Opc = (WhichResult == 0) ? AArch64ISD::TRN1 : AArch64ISD::TRN2;
return DAG.getNode(Opc, dl, V1.getValueType(), V1, V2);
}

if (isZIP_v_undef_Mask(ShuffleMask, VT, WhichResult)) {
unsigned Opc = (WhichResult == 0) ? AArch64ISD::ZIP1 : AArch64ISD::ZIP2;
return DAG.getNode(Opc, dl, V1.getValueType(), V1, V1);
}
if (isUZP_v_undef_Mask(ShuffleMask, VT, WhichResult)) {
unsigned Opc = (WhichResult == 0) ? AArch64ISD::UZP1 : AArch64ISD::UZP2;
return DAG.getNode(Opc, dl, V1.getValueType(), V1, V1);
}
if (isTRN_v_undef_Mask(ShuffleMask, VT, WhichResult)) {
unsigned Opc = (WhichResult == 0) ? AArch64ISD::TRN1 : AArch64ISD::TRN2;
return DAG.getNode(Opc, dl, V1.getValueType(), V1, V1);
}

if (SDValue Concat = tryFormConcatFromShuffle(Op, DAG))		if (SDValue Concat = tryFormConcatFromShuffle(Op, DAG))
return Concat;		return Concat;

bool DstIsLeft;		bool DstIsLeft;
int Anomaly;		int Anomaly;
int NumInputElements = V1.getValueType().getVectorNumElements();		int NumInputElements = V1.getValueType().getVectorNumElements();
if (isINSMask(ShuffleMask, NumInputElements, DstIsLeft, Anomaly)) {		if (isINSMask(ShuffleMask, NumInputElements, DstIsLeft, Anomaly)) {
▲ Show 20 Lines • Show All 9,446 Lines • ▼ Show 20 Lines	if ((ScalarTy == MVT::i8) \|\| (ScalarTy == MVT::i16))
ScalarTy = MVT::i32;		ScalarTy = MVT::i32;
SDValue Scalar = DAG.getNode(		SDValue Scalar = DAG.getNode(
ISD::EXTRACT_VECTOR_ELT, DL, ScalarTy, Op1,		ISD::EXTRACT_VECTOR_ELT, DL, ScalarTy, Op1,
DAG.getConstant(VT.getVectorNumElements() - 1, DL, MVT::i64));		DAG.getConstant(VT.getVectorNumElements() - 1, DL, MVT::i64));
Op = DAG.getNode(AArch64ISD::INSR, DL, ContainerVT, Op2, Scalar);		Op = DAG.getNode(AArch64ISD::INSR, DL, ContainerVT, Op2, Scalar);
return convertFromScalableVector(DAG, VT, Op);		return convertFromScalableVector(DAG, VT, Op);
}		}

		if (SDValue PermOp =
		lowerShuffleToZIP_UZP_TRN(ShuffleMask, VT, Op1, Op2, DL, DAG))
		david-armUnsubmitted Not Done Reply Inline Actions This looks a bit wrong because VT=Fixed Length Vector Type, but Op1 and Op2 have scalable vector types. david-arm: This looks a bit wrong because VT=Fixed Length Vector Type, but Op1 and Op2 have scalable…
		wweiAuthorUnsubmitted Done Reply Inline Actions No, the code here is correct. `VT` and `ShuffleMask` are used to check the type of the mask value, like zip or uzp mask. `Op1` and `Op2` are used to generate new permute nodes, which will be scalable aarch64 SDNodes here. In this patch, I encapsulated a new function for the code lowering shuffles to neon ZIP/UZP/TRN. Fortunately, the same code can be used for fixed-length SVE. wwei: No, the code here is correct. `VT` and `ShuffleMask` are used to check the type of the mask…
		return convertFromScalableVector(DAG, VT, PermOp);

return SDValue();		return SDValue();
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I think you can use `ContainerVT` here? It might help with the formatting. Same goes for the other places where you use `Op1.getValueType()`. paulwalker-arm: I think you can use `ContainerVT` here? It might help with the formatting. Same goes for the…
		wweiAuthorUnsubmitted Done Reply Inline Actions Changed, using `ContainerVT` to replace `Op1.getValueType()` wwei: Changed, using `ContainerVT` to replace `Op1.getValueType()`
}		}
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I still don't think this is enough as you've missed my previous `I doubt the logic holds for the case when a vector is legal but not the exact size of the target vector register` comment. Further discussion is attached to the test zip_v32i8 below... I'm starting to wonder if it's worth breaking out lowerShuffleToZIP_UZP_TRN as I feel like once support is added for more of the combinations there's not likely to be much reuse. For example taking the zip1/zip2 case I think supporting zip1 is likely simple and just works for all legal VTs, but zip2 requires the VT to be exactly the size of the target vector (i.e. `VT.getSizeInBits() == Subtarget->getMinSVEVectorSizeInBits() == Subtarget->getMaxSVEVectorSizeInBits()`) or perhaps a different instruction sequence. paulwalker-arm: I still don't think this is enough as you've missed my previous `I doubt the logic holds for…
		wweiAuthorUnsubmitted Done Reply Inline Actions Thanks, I agree with you. Perhaps the current implementation of `lowerShuffleToZIP_UZP_TRN` reusing neon code is not so good, since SVE may have variable length. Maybe we should identify and classify different shuffle patterns. For zip/uzp/trn/rev these four shuffle types, we can roughly divide them into two categories: zip1/uzp1/uzp2/trn1/trn2/revb/revh/revw （it' simple，and works for all legal VTs） zip2/rev （complex，and need consider high half or whole register） I think we can implement these two scenarios in steps. wwei: Thanks, I agree with you. Perhaps the current implementation of `lowerShuffleToZIP_UZP_TRN`…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Looking at the six shuffles trn1, trn2, uzp1, uzp2, zip1, zip2 I believe that when min_length != max_length != VT.getSizeInBits() only trn1, trn2 & zip1 are valid. We agree that zip2 is not valid because the top half of the input fixed length vector does not necessarily map to the top half of a scalable vector. The reason I believe both uzp variants are also invalid is because their underlying operation is to concat both input vectors and then extract every second element. However, the vectors we're concatenating may contain junk so you end up with: expected uzp1 [A \| B \| C \| D], [E \| F \| G \| H] => [A \| C \| E \| G] SVE_VLS uzp [A \| B \| C \| D \| ? \| ? \| ? \| ?], [E \| F \| G \| H \| ? \| ? \| ?] => [A \| C \| ? \| ? \| E \| G \| ? \| ?] And thus when you extract the fixed length result you'll end up with `[A \| C \| ? \| ?]` This I think further reduces the value of having `lowerShuffleToZIP_UZP_TRN` as it would be clearer to just handle each of the scenarios separately. paulwalker-arm: Looking at the six shuffles trn1, trn2, uzp1, uzp2, zip1, zip2 I believe that when min_length !
		wweiAuthorUnsubmitted Done Reply Inline Actions @paulwalker-arm Thanks for your detailed explanation. I fully agree with your point of view. Indeed, we don’t need `lowerShuffleToZIP_UZP_TRN` anymore, handle each case separately will make the code clearer. I will try to refactor the code later. wwei: @paulwalker-arm Thanks for your detailed explanation. I fully agree with your point of view.

SDValue AArch64TargetLowering::getSVESafeBitCast(EVT VT, SDValue Op,		SDValue AArch64TargetLowering::getSVESafeBitCast(EVT VT, SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDLoc DL(Op);		SDLoc DL(Op);
EVT InVT = Op.getValueType();		EVT InVT = Op.getValueType();
const TargetLowering &TLI = DAG.getTargetLoweringInfo();		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
(void)TLI;		(void)TLI;

assert(VT.isScalableVector() && TLI.isTypeLegal(VT) &&		assert(VT.isScalableVector() && TLI.isTypeLegal(VT) &&
InVT.isScalableVector() && TLI.isTypeLegal(InVT) &&		InVT.isScalableVector() && TLI.isTypeLegal(InVT) &&
"Only expect to cast between legal scalable vector types!");		"Only expect to cast between legal scalable vector types!");
assert((VT.getVectorElementType() == MVT::i1) ==		assert((VT.getVectorElementType() == MVT::i1) ==
(InVT.getVectorElementType() == MVT::i1) &&		(InVT.getVectorElementType() == MVT::i1) &&
"Cannot cast between data and predicate scalable vector types!");		"Cannot cast between data and predicate scalable vector types!");

if (InVT == VT)		if (InVT == VT)
return Op;		return Op;

if (VT.getVectorElementType() == MVT::i1)		if (VT.getVectorElementType() == MVT::i1)
return DAG.getNode(AArch64ISD::REINTERPRET_CAST, DL, VT, Op);		return DAG.getNode(AArch64ISD::REINTERPRET_CAST, DL, VT, Op);
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I've not thought about it deeply so might be wrong but given these cases only use `Op1` I'm wondering if they're always safe and thus don't need to be part of the `MinSVESize == MaxSVESize == VT.getSizeInBits()` restricted set? paulwalker-arm: I've not thought about it deeply so might be wrong but given these cases only use `Op1` I'm…
		wweiAuthorUnsubmitted Done Reply Inline Actions I don't think it's safe. For undef case, `Op1` will be used as input twice, it still has the problem like below: uzp [A \| B \| C \| D \| ? \| ? \| ? \| ?], [A \| B \| C \| D \| ? \| ? \| ?] => [A \| C \| ? \| ? \|A \| C \| ? \| ?] wwei: I don't think it's safe. For undef case, `Op1` will be used as input twice, it still has the…

		paulwalker-armUnsubmitted Not Done Reply Inline Actions I figured this comment could be better and started to write something but ended up with Functions like isZIPMask return true when a ISD::VECTOR_SHUFFLE's mask represents the same logical operation as performed by a ZIP instruction. In isolation these functions do not mean the ISD::VECTOR_SHUFFLE is exactly equivalent to an AArch64 instruction. There's the extra component of ISD::VECTOR_SHUFFLE's value type to consider. Prior to SVE these functions only operated on 64/128bit vector types that have a direct mapping to a target register and so an exact mapping is implied. However, when using SVE for fixed length vectors, most legal vector types are actually sub-vectors of a larger SVE register. When mapping ISD::VECTOR_SHUFFLE to an SVE instruction care must be taken to consider how the mask's indices translate. Specifically, when the mapping requires an exact meaning for a specific vector index (e.g. Index X is the last vector element in the register) then such mappings are often only safe when the exact SVE register size is know. The main exception to this is when indices are logically relative to the first element of either ISD::VECTOR_SHUFFLE operand because these relative indices don't change when converting from fixed-length to scalable vector types (i.e. the start of a fixed length vector is always the start of a scalable vector). Which is more like a novel than a comment :) I've posted it anyway just in case there's something in there that's useful. paulwalker-arm: I figured this comment could be better and started to write something but ended up with…
		wweiAuthorUnsubmitted Done Reply Inline Actions Thanks for your comment, I used it directly in the code wwei: Thanks for your comment, I used it directly in the code
EVT PackedVT = getPackedSVEVectorVT(VT.getVectorElementType());		EVT PackedVT = getPackedSVEVectorVT(VT.getVectorElementType());
EVT PackedInVT = getPackedSVEVectorVT(InVT.getVectorElementType());		EVT PackedInVT = getPackedSVEVectorVT(InVT.getVectorElementType());

// Pack input if required.		// Pack input if required.
if (InVT != PackedInVT)		if (InVT != PackedInVT)
Op = DAG.getNode(AArch64ISD::REINTERPRET_CAST, DL, PackedInVT, Op);		Op = DAG.getNode(AArch64ISD::REINTERPRET_CAST, DL, PackedInVT, Op);

Op = DAG.getNode(ISD::BITCAST, DL, PackedVT, Op);		Op = DAG.getNode(ISD::BITCAST, DL, PackedVT, Op);
▲ Show 20 Lines • Show All 64 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fix-length-permute-zip-uzp-trn.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -aarch64-sve-vector-bits-min=256 < %s \| FileCheck %s -check-prefixes=CHECK,VBITS_EQ_256
				; RUN: llc -aarch64-sve-vector-bits-min=512 < %s \| FileCheck %s -check-prefixes=CHECK,VBITS_GE_512

				target triple = "aarch64-unknown-linux-gnu"

				define void @zip_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: zip_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.b, vl32
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x1]
				; CHECK-NEXT: zip1 z2.b, z0.b, z1.b
				; CHECK-NEXT: zip2 z0.b, z0.b, z1.b
				; CHECK-NEXT: add z0.b, p0/m, z0.b, z2.b
				; CHECK-NEXT: st1b { z0.b }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <32 x i8>, <32 x i8>* %a
				%tmp2 = load <32 x i8>, <32 x i8>* %b
				%tmp3 = shufflevector <32 x i8> %tmp1, <32 x i8> %tmp2, <32 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47>
				%tmp4 = shufflevector <32 x i8> %tmp1, <32 x i8> %tmp2, <32 x i32> <i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>
				paulwalker-armUnsubmitted Not Done Reply Inline Actions To highlight my previous comment. Index `16` is only the first byte of the second half of a 256bit vector and thus if the target has bigger vector register the use of `zip2` will not have the correct behaviour. paulwalker-arm: To highlight my previous comment. Index `16` is only the first byte of the second half of a…
				%tmp5 = add <32 x i8> %tmp3, %tmp4
				store <32 x i8> %tmp5, <32 x i8>* %a
				ret void
				}

				paulwalker-armUnsubmitted Not Done Reply Inline Actions The output for `VBITS_EQ_256` and `VBITS_EQ_512` is the same, which doesn't make sense considering the target vector length is different. For the `VBITS_EQ_256` case index `16` is the start of the high half of the target vector and thus zip2 works. But for the `VBITS_EQ_512` case index `16` is actually the start of the second quarter of the target vector and thus zip2 will do the wrong thing. paulwalker-arm: The output for `VBITS_EQ_256` and `VBITS_EQ_512` is the same, which doesn't make sense…
				define void @zip_v32i16(<32 x i16>* %a, <32 x i16>* %b) #0 {
				; VBITS_EQ_256-LABEL: zip_v32i16:
				; VBITS_EQ_256: // %bb.0:
				; VBITS_EQ_256-NEXT: mov x8, #16
				; VBITS_EQ_256-NEXT: ptrue p0.h, vl16
				; VBITS_EQ_256-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; VBITS_EQ_256-NEXT: ld1h { z1.h }, p0/z, [x0]
				; VBITS_EQ_256-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; VBITS_EQ_256-NEXT: ld1h { z3.h }, p0/z, [x1]
				; VBITS_EQ_256-NEXT: zip2 z4.h, z1.h, z3.h
				; VBITS_EQ_256-NEXT: zip1 z1.h, z1.h, z3.h
				; VBITS_EQ_256-NEXT: zip2 z3.h, z0.h, z2.h
				; VBITS_EQ_256-NEXT: zip1 z0.h, z0.h, z2.h
				; VBITS_EQ_256-NEXT: add z0.h, p0/m, z0.h, z1.h
				; VBITS_EQ_256-NEXT: movprfx z1, z4
				; VBITS_EQ_256-NEXT: add z1.h, p0/m, z1.h, z3.h
				; VBITS_EQ_256-NEXT: st1h { z1.h }, p0, [x0, x8, lsl #1]
				; VBITS_EQ_256-NEXT: st1h { z0.h }, p0, [x0]
				; VBITS_EQ_256-NEXT: ret
				;
				; VBITS_GE_512-LABEL: zip_v32i16:
				; VBITS_GE_512: // %bb.0:
				; VBITS_GE_512-NEXT: ptrue p0.h, vl32
				; VBITS_GE_512-NEXT: ld1h { z0.h }, p0/z, [x0]
				; VBITS_GE_512-NEXT: ld1h { z1.h }, p0/z, [x1]
				; VBITS_GE_512-NEXT: zip1 z2.h, z0.h, z1.h
				; VBITS_GE_512-NEXT: zip2 z0.h, z0.h, z1.h
				; VBITS_GE_512-NEXT: add z0.h, p0/m, z0.h, z2.h
				; VBITS_GE_512-NEXT: st1h { z0.h }, p0, [x0]
				; VBITS_GE_512-NEXT: ret
				%tmp1 = load <32 x i16>, <32 x i16>* %a
				%tmp2 = load <32 x i16>, <32 x i16>* %b
				%tmp3 = shufflevector <32 x i16> %tmp1, <32 x i16> %tmp2, <32 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47>
				%tmp4 = shufflevector <32 x i16> %tmp1, <32 x i16> %tmp2, <32 x i32> <i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>
				%tmp5 = add <32 x i16> %tmp3, %tmp4
				store <32 x i16> %tmp5, <32 x i16>* %a
				ret void
				}

				define void @zip_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: zip_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h, vl16
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x1]
				; CHECK-NEXT: zip1 z2.h, z0.h, z1.h
				; CHECK-NEXT: zip2 z0.h, z0.h, z1.h
				; CHECK-NEXT: add z0.h, p0/m, z0.h, z2.h
				; CHECK-NEXT: st1h { z0.h }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <16 x i16>, <16 x i16>* %a
				%tmp2 = load <16 x i16>, <16 x i16>* %b
				%tmp3 = shufflevector <16 x i16> %tmp1, <16 x i16> %tmp2, <16 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23>
				%tmp4 = shufflevector <16 x i16> %tmp1, <16 x i16> %tmp2, <16 x i32> <i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
				%tmp5 = add <16 x i16> %tmp3, %tmp4
				store <16 x i16> %tmp5, <16 x i16>* %a
				ret void
				}

				define void @zip_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; CHECK-LABEL: zip_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl8
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x1]
				; CHECK-NEXT: zip1 z2.s, z0.s, z1.s
				; CHECK-NEXT: zip2 z0.s, z0.s, z1.s
				; CHECK-NEXT: add z0.s, p0/m, z0.s, z2.s
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <8 x i32>, <8 x i32>* %a
				%tmp2 = load <8 x i32>, <8 x i32>* %b
				%tmp3 = shufflevector <8 x i32> %tmp1, <8 x i32> %tmp2, <8 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11>
				%tmp4 = shufflevector <8 x i32> %tmp1, <8 x i32> %tmp2, <8 x i32> <i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
				%tmp5 = add <8 x i32> %tmp3, %tmp4
				store <8 x i32> %tmp5, <8 x i32>* %a
				ret void
				}

				define void @zip_v4f64(<4 x double>* %a, <4 x double>* %b) #0 {
				; CHECK-LABEL: zip_v4f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl4
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: zip1 z2.d, z0.d, z1.d
				; CHECK-NEXT: zip2 z0.d, z0.d, z1.d
				; CHECK-NEXT: fadd z0.d, p0/m, z0.d, z2.d
				; CHECK-NEXT: st1d { z0.d }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <4 x double>, <4 x double>* %a
				%tmp2 = load <4 x double>, <4 x double>* %b
				%tmp3 = shufflevector <4 x double> %tmp1, <4 x double> %tmp2, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
				%tmp4 = shufflevector <4 x double> %tmp1, <4 x double> %tmp2, <4 x i32> <i32 2, i32 6, i32 3, i32 7>
				%tmp5 = fadd <4 x double> %tmp3, %tmp4
				store <4 x double> %tmp5, <4 x double>* %a
				ret void
				}

				; Don't use SVE for 128-bit vectors
				define void @zip_v4i32(<4 x i32>* %a, <4 x i32>* %b) #0 {
				; CHECK-LABEL: zip_v4i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x0]
				; CHECK-NEXT: ldr q1, [x1]
				; CHECK-NEXT: zip1 v2.4s, v0.4s, v1.4s
				; CHECK-NEXT: zip2 v0.4s, v0.4s, v1.4s
				; CHECK-NEXT: add v0.4s, v2.4s, v0.4s
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <4 x i32>, <4 x i32>* %a
				%tmp2 = load <4 x i32>, <4 x i32>* %b
				%tmp3 = shufflevector <4 x i32> %tmp1, <4 x i32> %tmp2, <4 x i32> <i32 0, i32 4, i32 1, i32 5>
				%tmp4 = shufflevector <4 x i32> %tmp1, <4 x i32> %tmp2, <4 x i32> <i32 2, i32 6, i32 3, i32 7>
				%tmp5 = add <4 x i32> %tmp3, %tmp4
				store <4 x i32> %tmp5, <4 x i32>* %a
				ret void
				}

				define void @zip_v8i32_undef(<8 x i32>* %a) #0 {
				; CHECK-LABEL: zip_v8i32_undef:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl8
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: zip1 z1.s, z0.s, z0.s
				; CHECK-NEXT: zip2 z0.s, z0.s, z0.s
				; CHECK-NEXT: add z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <8 x i32>, <8 x i32>* %a
				%tmp3 = shufflevector <8 x i32> %tmp1, <8 x i32> undef, <8 x i32> <i32 0, i32 0, i32 1, i32 1, i32 2, i32 2, i32 3, i32 3>
				%tmp4 = shufflevector <8 x i32> %tmp1, <8 x i32> undef, <8 x i32> <i32 4, i32 4, i32 5, i32 5, i32 6, i32 6, i32 7, i32 7>
				%tmp5 = add <8 x i32> %tmp3, %tmp4
				store <8 x i32> %tmp5, <8 x i32>* %a
				ret void
				}

				define void @uzp_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: uzp_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.b, vl32
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x1]
				; CHECK-NEXT: uzp1 z2.b, z0.b, z1.b
				; CHECK-NEXT: uzp2 z0.b, z0.b, z1.b
				; CHECK-NEXT: add z0.b, p0/m, z0.b, z2.b
				; CHECK-NEXT: st1b { z0.b }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <32 x i8>, <32 x i8>* %a
				%tmp2 = load <32 x i8>, <32 x i8>* %b
				%tmp3 = shufflevector <32 x i8> %tmp1, <32 x i8> %tmp2, <32 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30, i32 32, i32 34, i32 36, i32 38, i32 40, i32 42, i32 44, i32 46, i32 48, i32 50, i32 52, i32 54, i32 56, i32 58, i32 60, i32 62>
				%tmp4 = shufflevector <32 x i8> %tmp1, <32 x i8> %tmp2, <32 x i32> <i32 1, i32 3, i32 5, i32 undef, i32 9, i32 11, i32 13, i32 undef, i32 undef, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31, i32 33, i32 35, i32 37, i32 39, i32 41, i32 43, i32 45, i32 47, i32 49, i32 51, i32 53, i32 55, i32 57, i32 59, i32 61, i32 63>
				%tmp5 = add <32 x i8> %tmp3, %tmp4
				store <32 x i8> %tmp5, <32 x i8>* %a
				ret void
				}

				define void @uzp_v32i16(<32 x i16>* %a, <32 x i16>* %b) #0 {
				; VBITS_EQ_256-LABEL: uzp_v32i16:
				; VBITS_EQ_256: // %bb.0:
				; VBITS_EQ_256-NEXT: mov x8, #16
				; VBITS_EQ_256-NEXT: ptrue p0.h, vl16
				; VBITS_EQ_256-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; VBITS_EQ_256-NEXT: ld1h { z1.h }, p0/z, [x0]
				; VBITS_EQ_256-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; VBITS_EQ_256-NEXT: ld1h { z3.h }, p0/z, [x1]
				; VBITS_EQ_256-NEXT: uzp1 z5.h, z1.h, z0.h
				; VBITS_EQ_256-NEXT: uzp2 z0.h, z1.h, z0.h
				; VBITS_EQ_256-NEXT: add z0.h, p0/m, z0.h, z5.h
				; VBITS_EQ_256-NEXT: uzp1 z4.h, z3.h, z2.h
				; VBITS_EQ_256-NEXT: uzp2 z2.h, z3.h, z2.h
				; VBITS_EQ_256-NEXT: movprfx z1, z4
				; VBITS_EQ_256-NEXT: add z1.h, p0/m, z1.h, z2.h
				; VBITS_EQ_256-NEXT: st1h { z1.h }, p0, [x0, x8, lsl #1]
				; VBITS_EQ_256-NEXT: st1h { z0.h }, p0, [x0]
				; VBITS_EQ_256-NEXT: ret
				;
				; VBITS_GE_512-LABEL: uzp_v32i16:
				; VBITS_GE_512: // %bb.0:
				; VBITS_GE_512-NEXT: ptrue p0.h, vl32
				; VBITS_GE_512-NEXT: ld1h { z0.h }, p0/z, [x0]
				; VBITS_GE_512-NEXT: ld1h { z1.h }, p0/z, [x1]
				; VBITS_GE_512-NEXT: uzp1 z2.h, z0.h, z1.h
				; VBITS_GE_512-NEXT: uzp2 z0.h, z0.h, z1.h
				; VBITS_GE_512-NEXT: add z0.h, p0/m, z0.h, z2.h
				; VBITS_GE_512-NEXT: st1h { z0.h }, p0, [x0]
				; VBITS_GE_512-NEXT: ret
				%tmp1 = load <32 x i16>, <32 x i16>* %a
				%tmp2 = load <32 x i16>, <32 x i16>* %b
				%tmp3 = shufflevector <32 x i16> %tmp1, <32 x i16> %tmp2, <32 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30, i32 32, i32 34, i32 36, i32 38, i32 40, i32 42, i32 44, i32 46, i32 48, i32 50, i32 52, i32 54, i32 56, i32 58, i32 60, i32 62>
				%tmp4 = shufflevector <32 x i16> %tmp1, <32 x i16> %tmp2, <32 x i32> <i32 1, i32 3, i32 5, i32 undef, i32 9, i32 11, i32 13, i32 undef, i32 undef, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31, i32 33, i32 35, i32 37, i32 39, i32 41, i32 43, i32 45, i32 47, i32 49, i32 51, i32 53, i32 55, i32 57, i32 59, i32 61, i32 63>
				%tmp5 = add <32 x i16> %tmp3, %tmp4
				store <32 x i16> %tmp5, <32 x i16>* %a
				ret void
				}

				define void @uzp_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: uzp_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h, vl16
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x1]
				; CHECK-NEXT: uzp1 z2.h, z0.h, z1.h
				; CHECK-NEXT: uzp2 z0.h, z0.h, z1.h
				; CHECK-NEXT: add z0.h, p0/m, z0.h, z2.h
				; CHECK-NEXT: st1h { z0.h }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <16 x i16>, <16 x i16>* %a
				%tmp2 = load <16 x i16>, <16 x i16>* %b
				%tmp3 = shufflevector <16 x i16> %tmp1, <16 x i16> %tmp2, <16 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16, i32 18, i32 20, i32 22, i32 24, i32 26, i32 28, i32 30>
				%tmp4 = shufflevector <16 x i16> %tmp1, <16 x i16> %tmp2, <16 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15, i32 17, i32 19, i32 21, i32 23, i32 25, i32 27, i32 29, i32 31>
				%tmp5 = add <16 x i16> %tmp3, %tmp4
				store <16 x i16> %tmp5, <16 x i16>* %a
				ret void
				}

				define void @uzp_v8f32(<8 x float>* %a, <8 x float>* %b) #0 {
				; CHECK-LABEL: uzp_v8f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl8
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x1]
				; CHECK-NEXT: uzp1 z2.s, z0.s, z1.s
				; CHECK-NEXT: uzp2 z0.s, z0.s, z1.s
				; CHECK-NEXT: fadd z0.s, p0/m, z0.s, z2.s
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <8 x float>, <8 x float>* %a
				%tmp2 = load <8 x float>, <8 x float>* %b
				%tmp3 = shufflevector <8 x float> %tmp1, <8 x float> %tmp2, <8 x i32> <i32 0, i32 undef, i32 4, i32 6, i32 undef, i32 10, i32 12, i32 14>
				%tmp4 = shufflevector <8 x float> %tmp1, <8 x float> %tmp2, <8 x i32> <i32 1, i32 undef, i32 5, i32 7, i32 9, i32 11, i32 undef, i32 undef>
				%tmp5 = fadd <8 x float> %tmp3, %tmp4
				store <8 x float> %tmp5, <8 x float>* %a
				ret void
				}

				define void @uzp_v4i64(<4 x i64>* %a, <4 x i64>* %b) #0 {
				; CHECK-LABEL: uzp_v4i64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl4
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: uzp1 z2.d, z0.d, z1.d
				; CHECK-NEXT: uzp2 z0.d, z0.d, z1.d
				; CHECK-NEXT: add z0.d, p0/m, z0.d, z2.d
				; CHECK-NEXT: st1d { z0.d }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <4 x i64>, <4 x i64>* %a
				%tmp2 = load <4 x i64>, <4 x i64>* %b
				%tmp3 = shufflevector <4 x i64> %tmp1, <4 x i64> %tmp2, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
				%tmp4 = shufflevector <4 x i64> %tmp1, <4 x i64> %tmp2, <4 x i32> <i32 1, i32 3, i32 5, i32 7>
				%tmp5 = add <4 x i64> %tmp3, %tmp4
				store <4 x i64> %tmp5, <4 x i64>* %a
				ret void
				}

				; Don't use SVE for 128-bit vectors
				define void @uzp_v8i16(<8 x i16>* %a, <8 x i16>* %b) #0 {
				; CHECK-LABEL: uzp_v8i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x0]
				; CHECK-NEXT: ldr q1, [x1]
				; CHECK-NEXT: uzp1 v2.8h, v0.8h, v1.8h
				; CHECK-NEXT: uzp2 v0.8h, v0.8h, v1.8h
				; CHECK-NEXT: add v0.8h, v2.8h, v0.8h
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <8 x i16>, <8 x i16>* %a
				%tmp2 = load <8 x i16>, <8 x i16>* %b
				%tmp3 = shufflevector <8 x i16> %tmp1, <8 x i16> %tmp2, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>
				%tmp4 = shufflevector <8 x i16> %tmp1, <8 x i16> %tmp2, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 9, i32 11, i32 13, i32 15>
				%tmp5 = add <8 x i16> %tmp3, %tmp4
				store <8 x i16> %tmp5, <8 x i16>* %a
				ret void
				}

				define void @uzp_v8i32_undef(<8 x i32>* %a) #0 {
				; CHECK-LABEL: uzp_v8i32_undef:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl8
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: uzp1 z1.s, z0.s, z0.s
				; CHECK-NEXT: uzp2 z0.s, z0.s, z0.s
				; CHECK-NEXT: add z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <8 x i32>, <8 x i32>* %a
				%tmp3 = shufflevector <8 x i32> %tmp1, <8 x i32> undef, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 0, i32 2, i32 4, i32 6>
				%tmp4 = shufflevector <8 x i32> %tmp1, <8 x i32> undef, <8 x i32> <i32 1, i32 3, i32 5, i32 7, i32 1, i32 3, i32 5, i32 7>
				%tmp5 = add <8 x i32> %tmp3, %tmp4
				store <8 x i32> %tmp5, <8 x i32>* %a
				ret void
				}

				define void @trn_v32i8(<32 x i8>* %a, <32 x i8>* %b) #0 {
				; CHECK-LABEL: trn_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.b, vl32
				; CHECK-NEXT: ld1b { z0.b }, p0/z, [x0]
				; CHECK-NEXT: ld1b { z1.b }, p0/z, [x1]
				; CHECK-NEXT: trn1 z2.b, z0.b, z1.b
				; CHECK-NEXT: trn2 z0.b, z0.b, z1.b
				; CHECK-NEXT: add z0.b, p0/m, z0.b, z2.b
				; CHECK-NEXT: st1b { z0.b }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <32 x i8>, <32 x i8>* %a
				%tmp2 = load <32 x i8>, <32 x i8>* %b
				%tmp3 = shufflevector <32 x i8> %tmp1, <32 x i8> %tmp2, <32 x i32> <i32 0, i32 32, i32 2, i32 34, i32 4, i32 36, i32 6, i32 38, i32 8, i32 40, i32 10, i32 42, i32 12, i32 44, i32 14, i32 46, i32 16, i32 48, i32 18, i32 50, i32 20, i32 52, i32 22, i32 54, i32 24, i32 56, i32 26, i32 58, i32 28, i32 60, i32 30, i32 62>
				%tmp4 = shufflevector <32 x i8> %tmp1, <32 x i8> %tmp2, <32 x i32> <i32 1, i32 33, i32 3, i32 35, i32 undef, i32 37, i32 7, i32 undef, i32 undef, i32 41, i32 11, i32 43, i32 13, i32 45, i32 15, i32 47, i32 17, i32 49, i32 19, i32 51, i32 21, i32 53, i32 23, i32 55, i32 25, i32 57, i32 27, i32 59, i32 29, i32 61, i32 31, i32 63>
				%tmp5 = add <32 x i8> %tmp3, %tmp4
				store <32 x i8> %tmp5, <32 x i8>* %a
				ret void
				}

				define void @trn_v32i16(<32 x i16>* %a, <32 x i16>* %b) #0 {
				; VBITS_EQ_256-LABEL: trn_v32i16:
				; VBITS_EQ_256: // %bb.0:
				; VBITS_EQ_256-NEXT: mov x8, #16
				; VBITS_EQ_256-NEXT: ptrue p0.h, vl16
				; VBITS_EQ_256-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; VBITS_EQ_256-NEXT: ld1h { z1.h }, p0/z, [x0]
				; VBITS_EQ_256-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; VBITS_EQ_256-NEXT: ld1h { z3.h }, p0/z, [x1]
				; VBITS_EQ_256-NEXT: trn1 z4.h, z0.h, z2.h
				; VBITS_EQ_256-NEXT: trn1 z5.h, z1.h, z3.h
				; VBITS_EQ_256-NEXT: trn2 z0.h, z0.h, z2.h
				; VBITS_EQ_256-NEXT: trn2 z1.h, z1.h, z3.h
				; VBITS_EQ_256-NEXT: add z0.h, p0/m, z0.h, z4.h
				; VBITS_EQ_256-NEXT: add z1.h, p0/m, z1.h, z5.h
				; VBITS_EQ_256-NEXT: st1h { z0.h }, p0, [x0, x8, lsl #1]
				; VBITS_EQ_256-NEXT: st1h { z1.h }, p0, [x0]
				; VBITS_EQ_256-NEXT: ret
				;
				; VBITS_GE_512-LABEL: trn_v32i16:
				; VBITS_GE_512: // %bb.0:
				; VBITS_GE_512-NEXT: ptrue p0.h, vl32
				; VBITS_GE_512-NEXT: ld1h { z0.h }, p0/z, [x0]
				; VBITS_GE_512-NEXT: ld1h { z1.h }, p0/z, [x1]
				; VBITS_GE_512-NEXT: trn1 z2.h, z0.h, z1.h
				; VBITS_GE_512-NEXT: trn2 z0.h, z0.h, z1.h
				; VBITS_GE_512-NEXT: add z0.h, p0/m, z0.h, z2.h
				; VBITS_GE_512-NEXT: st1h { z0.h }, p0, [x0]
				; VBITS_GE_512-NEXT: ret
				%tmp1 = load <32 x i16>, <32 x i16>* %a
				%tmp2 = load <32 x i16>, <32 x i16>* %b
				%tmp3 = shufflevector <32 x i16> %tmp1, <32 x i16> %tmp2, <32 x i32> <i32 0, i32 32, i32 2, i32 34, i32 4, i32 36, i32 6, i32 38, i32 8, i32 40, i32 10, i32 42, i32 12, i32 44, i32 14, i32 46, i32 16, i32 48, i32 18, i32 50, i32 20, i32 52, i32 22, i32 54, i32 24, i32 56, i32 26, i32 58, i32 28, i32 60, i32 30, i32 62>
				%tmp4 = shufflevector <32 x i16> %tmp1, <32 x i16> %tmp2, <32 x i32> <i32 1, i32 33, i32 3, i32 35, i32 undef, i32 37, i32 7, i32 undef, i32 undef, i32 41, i32 11, i32 43, i32 13, i32 45, i32 15, i32 47, i32 17, i32 49, i32 19, i32 51, i32 21, i32 53, i32 23, i32 55, i32 25, i32 57, i32 27, i32 59, i32 29, i32 61, i32 31, i32 63>
				%tmp5 = add <32 x i16> %tmp3, %tmp4
				store <32 x i16> %tmp5, <32 x i16>* %a
				ret void
				}

				define void @trn_v16i16(<16 x i16>* %a, <16 x i16>* %b) #0 {
				; CHECK-LABEL: trn_v16i16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h, vl16
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x1]
				; CHECK-NEXT: trn1 z2.h, z0.h, z1.h
				; CHECK-NEXT: trn2 z0.h, z0.h, z1.h
				; CHECK-NEXT: add z0.h, p0/m, z0.h, z2.h
				; CHECK-NEXT: st1h { z0.h }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <16 x i16>, <16 x i16>* %a
				%tmp2 = load <16 x i16>, <16 x i16>* %b
				%tmp3 = shufflevector <16 x i16> %tmp1, <16 x i16> %tmp2, <16 x i32> <i32 0, i32 16, i32 2, i32 18, i32 4, i32 20, i32 6, i32 22, i32 8, i32 24, i32 10, i32 26, i32 12, i32 28, i32 14, i32 30>
				%tmp4 = shufflevector <16 x i16> %tmp1, <16 x i16> %tmp2, <16 x i32> <i32 1, i32 17, i32 3, i32 19, i32 5, i32 21, i32 7, i32 23, i32 9, i32 25, i32 11, i32 27, i32 13, i32 29, i32 15, i32 31>
				%tmp5 = add <16 x i16> %tmp3, %tmp4
				store <16 x i16> %tmp5, <16 x i16>* %a
				ret void
				}

				define void @trn_v8i32(<8 x i32>* %a, <8 x i32>* %b) #0 {
				; CHECK-LABEL: trn_v8i32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl8
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x1]
				; CHECK-NEXT: trn1 z2.s, z0.s, z1.s
				; CHECK-NEXT: trn2 z0.s, z0.s, z1.s
				; CHECK-NEXT: add z0.s, p0/m, z0.s, z2.s
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <8 x i32>, <8 x i32>* %a
				%tmp2 = load <8 x i32>, <8 x i32>* %b
				%tmp3 = shufflevector <8 x i32> %tmp1, <8 x i32> %tmp2, <8 x i32> <i32 0, i32 8, i32 undef, i32 undef, i32 4, i32 12, i32 6, i32 14>
				%tmp4 = shufflevector <8 x i32> %tmp1, <8 x i32> %tmp2, <8 x i32> <i32 1, i32 undef, i32 3, i32 11, i32 5, i32 13, i32 undef, i32 undef>
				%tmp5 = add <8 x i32> %tmp3, %tmp4
				store <8 x i32> %tmp5, <8 x i32>* %a
				ret void
				}

				define void @trn_v4f64(<4 x double>* %a, <4 x double>* %b) #0 {
				; CHECK-LABEL: trn_v4f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl4
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: trn1 z2.d, z0.d, z1.d
				; CHECK-NEXT: trn2 z0.d, z0.d, z1.d
				; CHECK-NEXT: fadd z0.d, p0/m, z0.d, z2.d
				; CHECK-NEXT: st1d { z0.d }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <4 x double>, <4 x double>* %a
				%tmp2 = load <4 x double>, <4 x double>* %b
				%tmp3 = shufflevector <4 x double> %tmp1, <4 x double> %tmp2, <4 x i32> <i32 0, i32 4, i32 2, i32 6>
				%tmp4 = shufflevector <4 x double> %tmp1, <4 x double> %tmp2, <4 x i32> <i32 1, i32 5, i32 3, i32 7>
				%tmp5 = fadd <4 x double> %tmp3, %tmp4
				store <4 x double> %tmp5, <4 x double>* %a
				ret void
				}

				; Don't use SVE for 128-bit vectors
				define void @trn_v4f32(<4 x float>* %a, <4 x float>* %b) nounwind {
				; CHECK-LABEL: trn_v4f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x0]
				; CHECK-NEXT: ldr q1, [x1]
				; CHECK-NEXT: trn1 v2.4s, v0.4s, v1.4s
				; CHECK-NEXT: trn2 v0.4s, v0.4s, v1.4s
				; CHECK-NEXT: fadd v0.4s, v2.4s, v0.4s
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <4 x float>, <4 x float>* %a
				%tmp2 = load <4 x float>, <4 x float>* %b
				%tmp3 = shufflevector <4 x float> %tmp1, <4 x float> %tmp2, <4 x i32> <i32 0, i32 4, i32 2, i32 6>
				%tmp4 = shufflevector <4 x float> %tmp1, <4 x float> %tmp2, <4 x i32> <i32 1, i32 5, i32 3, i32 7>
				%tmp5 = fadd <4 x float> %tmp3, %tmp4
				store <4 x float> %tmp5, <4 x float>* %a
				ret void
				}

				define void @trn_v8i32_undef(<8 x i32>* %a) #0 {
				; CHECK-LABEL: trn_v8i32_undef:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl8
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: trn1 z1.s, z0.s, z0.s
				; CHECK-NEXT: trn2 z0.s, z0.s, z0.s
				; CHECK-NEXT: add z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%tmp1 = load <8 x i32>, <8 x i32>* %a
				%tmp3 = shufflevector <8 x i32> %tmp1, <8 x i32> undef, <8 x i32> <i32 0, i32 0, i32 2, i32 2, i32 4, i32 4, i32 6, i32 6>
				%tmp4 = shufflevector <8 x i32> %tmp1, <8 x i32> undef, <8 x i32> <i32 1, i32 1, i32 3, i32 3, i32 5, i32 5, i32 7, i32 7>
				%tmp5 = add <8 x i32> %tmp3, %tmp4
				store <8 x i32> %tmp5, <8 x i32>* %a
				ret void
				}

				attributes #0 = { "target-features"="+sve" }