This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
2/12
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve-fixed-length-fcopysign.ll
-
sve2-fixed-length-fcopysign.ll

Differential D128642

[AArch64][SVE] Use SVE for VLS fcopysign for wide vectors
ClosedPublic

Authored by DavidTruby on Jun 27 2022, 7:02 AM.

Download Raw Diff

Details

Reviewers

efriedma
paulwalker-arm
peterwaller-arm
bsmith
c-rhodes
dtemirbulatov
MattDevereau

Commits

rGb1b9c39629b5: [AArch64][SVE] Use SVE for VLS fcopysign for wide vectors

Summary

Currently fcopysign for VLS vectors lowers through NEON even when the
vector width is wider than a NEON vector, causing bad codegen as the
vectors are split. This patch causes SVE to be used for these vectors
instead, giving much better codegen on wide VLS vectors.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,360 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics::vlsegff_mask.c
	60,510 ms	x64 debian > Clang.Driver::arm-cortex-cpus-1.c
	60,560 ms	x64 debian > Clang.Driver::arm-cortex-cpus-2.c
	60,130 ms	x64 debian > Clang.Driver::emit-reproducer.c
	60,560 ms	x64 debian > Clang.Driver::fsanitize.c
		View Full Test Results (11 Failed)

Event Timeline

DavidTruby created this revision.Jun 27 2022, 7:02 AM

Herald added a reviewer: efriedma. · View Herald TranscriptJun 27 2022, 7:02 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: ctetreau, psnobl, hiraditya and 2 others. · View Herald Transcript

DavidTruby requested review of this revision.Jun 27 2022, 7:02 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 27 2022, 7:02 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

DavidTruby added reviewers: paulwalker-arm, peterwaller-arm, bsmith, c-rhodes, dtemirbulatov, MattDevereau.Jun 27 2022, 7:03 AM

FYI, if I add -mattr=+sve2 to your test arguments, I get:

LLVM ERROR: Cannot select: t17: v16i16 = AArch64ISD::BSP t43, t35, t32

Harbormaster completed remote builds in B172183: Diff 440205.Jun 27 2022, 7:52 AM

Fix expansion for VLS on SVE2

Harbormaster completed remote builds in B172248: Diff 440295.Jun 27 2022, 11:57 AM

When checking the output for SVE2 I see no difference, which means we're missing out on the BSL optimisation we get for scalable vectors. I think this is because you're handling the fixed->scalable lowering too late. I think you really need to edit LowerFCOPYSIGN to first convert the fixed length ISD::FCOPYSIGN to a scalable one, then let the existing scalable vector code decide how best to lower it.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1531	Rather than have this dangling there's a large ordered/sorted block further down.
18990	I think you mean `VT.isScalableVector()` here. However... Given this bug fix it makes me wonder if the following code was ever excised before this patch? Which given my SVE2 comment I'm think we can in fact keep the original code and just remove the `fixedSVEVectorVT` code?

Matt added a subscriber: Matt.Jun 28 2022, 2:03 PM

Rework patch to use VLA lowering for the VLS types.

Harbormaster completed remote builds in B172742: Diff 440991.Jun 29 2022, 8:09 AM

paulwalker-arm added inline comments.Jun 30 2022, 9:47 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7577–7584	This doesn't look safe with respect to the extend/rounding code just below. When faced with differing types the result from both convertToScalableVector called will be a type of the same size. However their element counts will be different. For example take the case: fcopysign v8f64, v8f32 this will resulting in: In1 = nxv2f64 In2 = nxv4f32 which I doubt the remaining logic will handle properly. The most likely affect being a getNode assert firing for invalid operands. My guess is that you're not seeing this because `In1` and `In2` always have the same type and indeed I couldn't immediate see a way to exercise this logic. I think this means your "mixtype" tests are likely exercising nothing new and are redundant. This is likely also true for you original patch when you added the initial scalable vector support. If they are not exercising this code as I suspect then you either need to rewrite them or just remove them if there's no actually route to test this logic. Personally I think the safest route is to simply rewrite the fixed length fcopysign into a scalable vector one after any necessary extending/rounding of the input has taken place. For what it's worth I also think the use of FP_EXTEND/FP_ROUND is not the most efficient way to get the sign bits to align but that can be changed later.
18968–18970	Isn't this original code now fine and you instead just need to remove the following // Don't expand for NEON if (VT.isFixedLengthVector()) return SDValue(); block because that is covered by the `!VT.isScalableVector()` check?

Move VLS handling after ROUND/EXTEND

DavidTruby added inline comments.Jul 4 2022, 3:52 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7577–7584	I believe I've corrected this now; I think you're right that the inputs will always be the same type anyway though. I agree that it is safer to leave the handling in just in case that does get triggered. I think it's better to leave the mixed type tests in as is, just in case something changes in future and the types coming into this function could be different we want to make sure we don't regress in that case.

Looking generally good but I see some possible minor improvements/cleanup.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7591	Nit: Does `isFixedSVE` want to move down with the use?
7661	Is this line necessary or could it be pushed up? At a glance it appears it should already be an integer VT derived from VT. Same question for the VT assignment.

This revision is now accepted and ready to land.Jul 4 2022, 4:14 AM

Harbormaster completed remote builds in B173535: Diff 442061.Jul 4 2022, 4:49 AM

DavidTruby added inline comments.Jul 4 2022, 5:24 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7661	From 7593-4: VT and IntVT will be scalable containers for the fixed length vector types. Here we need to get the original VTs back.

Requesting changes to deal with the mixed-type combine/tests, since we have found a case where the types can be different.

This revision now requires changes to proceed.Jul 18 2022, 5:50 AM

Add flag to test FCOPYSIGN nodes with differing argument types.

This patch now depends on D130370 as a result.

Herald added a subscriber: ecnelises. · View Herald TranscriptJul 28 2022, 5:40 AM

Harbormaster completed remote builds in B178059: Diff 448315.Jul 28 2022, 6:24 AM

DavidTruby added a parent revision: D130370: [llvm] Always use TargetConstant for FP_ROUND ISD Nodes.Jul 28 2022, 7:33 AM

paulwalker-arm added inline comments.Jul 28 2022, 4:30 PM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15405 ↗	(On Diff #448315)	What about `return EnableVectorFcopysignExtendRound;`?
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
3597 ↗	(On Diff #448315)	By this point we know the result type is legal because results are legalised before operands. What's important here is the result type remains legal after splitting the operands. Given the result and first operands have the same type this means ensuring the types of `LHSLo` and `LHSHi` are legal after splitting. There's a function `GetSplitDestVTs` which returns the types expected from splitting. I mention this because I think it's better to query the expected types are legal before performing the actual splitting.

DavidTruby added inline comments.Jul 28 2022, 4:36 PM

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
3597 ↗	(On Diff #448315)	Ah ok I think I was considering this wrong, I thought that the result type of the concat (which is the result type of the original FCOPYSIGN) needed to be legal for us to do the transform If that's already legal, is there a problem? Is there a case where splitting an already legal vector in two would make a vector illegal? (genuine question I'm not sure when this would pop up) Or do we need RHSLo to be legal?

paulwalker-arm added inline comments.Jul 28 2022, 4:51 PM

llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
3597 ↗	(On Diff #448315)	You can have multiple legal types for the same vector element type. For NEON `v4f32` and `v2f32` are legal. So it is possible for the result type to be legal and yet still be legal after splitting. Likewise `v1f32` is not legal for NEON and so it is possible to enter with a legal type that would become illegal when split. For the former case we can split the operation in two as you've done. For the latter we're better reverting to the original code path of calling `UnrollVector`. So generally what you've done is fine, it is just you're checking the wrong type (i.e. N's result type rather than the expected result type of the new `FCOPYSIGN` operations). Plus my comment that you probably want to use `GetSplitDestVTs` so you only call `SplitVector` for the cases that are safe.

Fix validity check for FCOPYSIGN legalization

Harbormaster completed remote builds in B178292: Diff 448640.Jul 29 2022, 9:14 AM

paulwalker-arm added inline comments.Aug 1 2022, 10:14 AM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
138 ↗	(On Diff #448640)	Up to you but I think `EnableVectorFCopySignExtendRound` looks better.
140 ↗	(On Diff #448640)	for?
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
3612 ↗	(On Diff #448640)	LHSLoVT?
3614 ↗	(On Diff #448640)	LHSHiVT?
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7588–7589	Not new but can this be removed? as it can never happen given the `SrcVT.bitsLT/SrcVT.bitsGT` code above.
7592	This can be assumed, plus `getContainerForFixedLengthVector` will ensure the type is legal anyway.
7659–7663	Bookending the fixed length lowering like this has pitfalls and can complicate the code. It's better to just rewrite the fixed length operations using scalable vector types and then let the scalable vector lowering handle any complexity. Towards the start of the function you can do: EVT ContainerVT = getContainerForFixedLengthVector(DAG, VT); In1 = convertToScalableVector(DAG, ContainerVT, In1); In2 = convertToScalableVector(DAG, ContainerVT, In2); Res = getNode(ISD::FCOPYSIGN, ContainerVT , In1, In2) return convertFromScalableVector(DAG, ContainerVT, Res); This way it doesn't matter how complicated the scalable vector lowering gets. Doing this also means you no longer need sve2-fixed-length-fcopysign.ll because there's nothing SVE2 special about the lowering code you've added (i.e. the original sve2-fcopysign.ll tests are good enough to protect that functionality).

Changed fixed-length lowering to rely on scalable lowering.
Removed redundant code.

Harbormaster completed remote builds in B178742: Diff 449254.Aug 2 2022, 5:53 AM

Documentation for combiner-vector-fcopysign-extend-round needs updating but otherwise looks good.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
141 ↗	(On Diff #449254)	Please drop this part of the documentation. Although this is why you've added the flag, it is not the only reason somebody might want to use it (i.e. somebody might actually want to enable the optimisation).
llvm/lib/CodeGen/SelectionDAG/LegalizeVectorTypes.cpp
3615–3618 ↗	(On Diff #449254)	You could just `return DAG.getNode(...`.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7573	Bogus blank line.

peterwaller-arm accepted this revision.Aug 9 2022, 4:00 AM

This revision is now accepted and ready to land.Aug 9 2022, 4:00 AM

Closed by commit rGb1b9c39629b5: [AArch64][SVE] Use SVE for VLS fcopysign for wide vectors (authored by DavidTruby). · Explain WhyAug 10 2022, 3:17 AM

This revision was automatically updated to reflect the committed changes.

DavidTruby added a commit: rGb1b9c39629b5: [AArch64][SVE] Use SVE for VLS fcopysign for wide vectors.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

33 lines

test/

CodeGen/

AArch64/

sve-fixed-length-fcopysign.ll

542 lines

sve2-fixed-length-fcopysign.ll

530 lines

Diff 442061

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,522 Lines • ▼ Show 20 Lines	void AArch64TargetLowering::addTypeForFixedLengthSVE(MVT VT) {
// By default everything must be expanded.		// By default everything must be expanded.
for (unsigned Op = 0; Op < ISD::BUILTIN_OP_END; ++Op)		for (unsigned Op = 0; Op < ISD::BUILTIN_OP_END; ++Op)
setOperationAction(Op, VT, Expand);		setOperationAction(Op, VT, Expand);

// We use EXTRACT_SUBVECTOR to "cast" a scalable vector to a fixed length one.		// We use EXTRACT_SUBVECTOR to "cast" a scalable vector to a fixed length one.
setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);		setOperationAction(ISD::EXTRACT_SUBVECTOR, VT, Custom);

if (VT.isFloatingPoint()) {		if (VT.isFloatingPoint()) {
setCondCodeAction(ISD::SETO, VT, Expand);		setCondCodeAction(ISD::SETO, VT, Expand);
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Rather than have this dangling there's a large ordered/sorted block further down. paulwalker-arm: Rather than have this dangling there's a large ordered/sorted block further down.
setCondCodeAction(ISD::SETOLT, VT, Expand);		setCondCodeAction(ISD::SETOLT, VT, Expand);
setCondCodeAction(ISD::SETLT, VT, Expand);		setCondCodeAction(ISD::SETLT, VT, Expand);
setCondCodeAction(ISD::SETOLE, VT, Expand);		setCondCodeAction(ISD::SETOLE, VT, Expand);
setCondCodeAction(ISD::SETLE, VT, Expand);		setCondCodeAction(ISD::SETLE, VT, Expand);
setCondCodeAction(ISD::SETULT, VT, Expand);		setCondCodeAction(ISD::SETULT, VT, Expand);
setCondCodeAction(ISD::SETULE, VT, Expand);		setCondCodeAction(ISD::SETULE, VT, Expand);
setCondCodeAction(ISD::SETUGE, VT, Expand);		setCondCodeAction(ISD::SETUGE, VT, Expand);
setCondCodeAction(ISD::SETUGT, VT, Expand);		setCondCodeAction(ISD::SETUGT, VT, Expand);
Show All 36 Lines	void AArch64TargetLowering::addTypeForFixedLengthSVE(MVT VT) {
setOperationAction(ISD::CONCAT_VECTORS, VT, Custom);		setOperationAction(ISD::CONCAT_VECTORS, VT, Custom);
setOperationAction(ISD::CTLZ, VT, Custom);		setOperationAction(ISD::CTLZ, VT, Custom);
setOperationAction(ISD::CTPOP, VT, Custom);		setOperationAction(ISD::CTPOP, VT, Custom);
setOperationAction(ISD::CTTZ, VT, Custom);		setOperationAction(ISD::CTTZ, VT, Custom);
setOperationAction(ISD::FABS, VT, Custom);		setOperationAction(ISD::FABS, VT, Custom);
setOperationAction(ISD::FADD, VT, Custom);		setOperationAction(ISD::FADD, VT, Custom);
setOperationAction(ISD::EXTRACT_VECTOR_ELT, VT, Custom);		setOperationAction(ISD::EXTRACT_VECTOR_ELT, VT, Custom);
setOperationAction(ISD::FCEIL, VT, Custom);		setOperationAction(ISD::FCEIL, VT, Custom);
		setOperationAction(ISD::FCOPYSIGN, VT, Custom);
setOperationAction(ISD::FDIV, VT, Custom);		setOperationAction(ISD::FDIV, VT, Custom);
setOperationAction(ISD::FFLOOR, VT, Custom);		setOperationAction(ISD::FFLOOR, VT, Custom);
setOperationAction(ISD::FMA, VT, Custom);		setOperationAction(ISD::FMA, VT, Custom);
setOperationAction(ISD::FMAXIMUM, VT, Custom);		setOperationAction(ISD::FMAXIMUM, VT, Custom);
setOperationAction(ISD::FMAXNUM, VT, Custom);		setOperationAction(ISD::FMAXNUM, VT, Custom);
setOperationAction(ISD::FMINIMUM, VT, Custom);		setOperationAction(ISD::FMINIMUM, VT, Custom);
setOperationAction(ISD::FMINNUM, VT, Custom);		setOperationAction(ISD::FMINNUM, VT, Custom);
setOperationAction(ISD::FMUL, VT, Custom);		setOperationAction(ISD::FMUL, VT, Custom);
▲ Show 20 Lines • Show All 5,972 Lines • ▼ Show 20 Lines	if (!Subtarget->hasNEON())
return SDValue();		return SDValue();

EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
EVT IntVT = VT.changeTypeToInteger();		EVT IntVT = VT.changeTypeToInteger();
SDLoc DL(Op);		SDLoc DL(Op);

SDValue In1 = Op.getOperand(0);		SDValue In1 = Op.getOperand(0);
SDValue In2 = Op.getOperand(1);		SDValue In2 = Op.getOperand(1);

		paulwalker-armUnsubmitted Not Done Reply Inline Actions Bogus blank line. paulwalker-arm: Bogus blank line.
		const bool isFixedSVE =
		VT.isFixedLengthVector() && useSVEForFixedLengthVectorVT(VT);

EVT SrcVT = In2.getValueType();		EVT SrcVT = In2.getValueType();

if (SrcVT.bitsLT(VT))		if (SrcVT.bitsLT(VT))
In2 = DAG.getNode(ISD::FP_EXTEND, DL, VT, In2);		In2 = DAG.getNode(ISD::FP_EXTEND, DL, VT, In2);
else if (SrcVT.bitsGT(VT))		else if (SrcVT.bitsGT(VT))
In2 = DAG.getNode(ISD::FP_ROUND, DL, VT, In2, DAG.getIntPtrConstant(0, DL));		In2 = DAG.getNode(ISD::FP_ROUND, DL, VT, In2, DAG.getIntPtrConstant(0, DL));

if (VT.isScalableVector())		if (VT.isScalableVector())
		paulwalker-armUnsubmitted Not Done Reply Inline Actions This doesn't look safe with respect to the extend/rounding code just below. When faced with differing types the result from both convertToScalableVector called will be a type of the same size. However their element counts will be different. For example take the case: fcopysign v8f64, v8f32 this will resulting in: In1 = nxv2f64 In2 = nxv4f32 which I doubt the remaining logic will handle properly. The most likely affect being a getNode assert firing for invalid operands. My guess is that you're not seeing this because `In1` and `In2` always have the same type and indeed I couldn't immediate see a way to exercise this logic. I think this means your "mixtype" tests are likely exercising nothing new and are redundant. This is likely also true for you original patch when you added the initial scalable vector support. If they are not exercising this code as I suspect then you either need to rewrite them or just remove them if there's no actually route to test this logic. Personally I think the safest route is to simply rewrite the fixed length fcopysign into a scalable vector one after any necessary extending/rounding of the input has taken place. For what it's worth I also think the use of FP_EXTEND/FP_ROUND is not the most efficient way to get the sign bits to align but that can be changed later. paulwalker-arm: This doesn't look safe with respect to the extend/rounding code just below. When faced with…
		DavidTrubyAuthorUnsubmitted Done Reply Inline Actions I believe I've corrected this now; I think you're right that the inputs will always be the same type anyway though. I agree that it is safer to leave the handling in just in case that does get triggered. I think it's better to leave the mixed type tests in as is, just in case something changes in future and the types coming into this function could be different we want to make sure we don't regress in that case. DavidTruby: I believe I've corrected this now; I think you're right that the inputs will always be the same…
IntVT =		IntVT =
getPackedSVEVectorVT(VT.getVectorElementType().changeTypeToInteger());		getPackedSVEVectorVT(VT.getVectorElementType().changeTypeToInteger());

if (VT != In2.getValueType())		if (VT != In2.getValueType())
return SDValue();		return SDValue();
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Not new but can this be removed? as it can never happen given the `SrcVT.bitsLT/SrcVT.bitsGT` code above. paulwalker-arm: Not new but can this be removed? as it can never happen given the `SrcVT.bitsLT/SrcVT.bitsGT`…

		if (isFixedSVE) {
		peterwaller-armUnsubmitted Not Done Reply Inline Actions Nit: Does `isFixedSVE` want to move down with the use? peterwaller-arm: Nit: Does `isFixedSVE` want to move down with the use?
		assert(isTypeLegal(VT) && "Expected only legal fixed-width types");
		paulwalker-armUnsubmitted Not Done Reply Inline Actions This can be assumed, plus `getContainerForFixedLengthVector` will ensure the type is legal anyway. paulwalker-arm: This can be assumed, plus `getContainerForFixedLengthVector` will ensure the type is legal…
		VT = getContainerForFixedLengthVector(DAG, VT);
		IntVT = getContainerForFixedLengthVector(DAG, IntVT);

		In1 = convertToScalableVector(DAG, VT, In1);
		In2 = convertToScalableVector(DAG, VT, In2);
		}


auto BitCast = [this](EVT VT, SDValue Op, SelectionDAG &DAG) {		auto BitCast = [this](EVT VT, SDValue Op, SelectionDAG &DAG) {
if (VT.isScalableVector())		if (VT.isScalableVector())
return getSVESafeBitCast(VT, Op, DAG);		return getSVESafeBitCast(VT, Op, DAG);

return DAG.getBitcast(VT, Op);		return DAG.getBitcast(VT, Op);
};		};

SDValue VecVal1, VecVal2;		SDValue VecVal1, VecVal2;
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	SDValue BSP =
DAG.getNode(AArch64ISD::BSP, DL, VecVT, SignMaskV, VecVal1, VecVal2);		DAG.getNode(AArch64ISD::BSP, DL, VecVT, SignMaskV, VecVal1, VecVal2);
if (VT == MVT::f16)		if (VT == MVT::f16)
return DAG.getTargetExtractSubreg(AArch64::hsub, DL, VT, BSP);		return DAG.getTargetExtractSubreg(AArch64::hsub, DL, VT, BSP);
if (VT == MVT::f32)		if (VT == MVT::f32)
return DAG.getTargetExtractSubreg(AArch64::ssub, DL, VT, BSP);		return DAG.getTargetExtractSubreg(AArch64::ssub, DL, VT, BSP);
if (VT == MVT::f64)		if (VT == MVT::f64)
return DAG.getTargetExtractSubreg(AArch64::dsub, DL, VT, BSP);		return DAG.getTargetExtractSubreg(AArch64::dsub, DL, VT, BSP);

		if (isFixedSVE) {
		VT = Op.getValueType();
		IntVT = VT.changeTypeToInteger();
		peterwaller-armUnsubmitted Not Done Reply Inline Actions Is this line necessary or could it be pushed up? At a glance it appears it should already be an integer VT derived from VT. Same question for the VT assignment. peterwaller-arm: Is this line necessary or could it be pushed up? At a glance it appears it should already be an…
		DavidTrubyAuthorUnsubmitted Done Reply Inline Actions From 7593-4: VT and IntVT will be scalable containers for the fixed length vector types. Here we need to get the original VTs back. DavidTruby: From 7593-4: VT and IntVT will be scalable containers for the fixed length vector types. Here…
		BSP = convertFromScalableVector(DAG, IntVT, BSP);
		}
		paulwalker-armUnsubmitted Not Done Reply Inline Actions Bookending the fixed length lowering like this has pitfalls and can complicate the code. It's better to just rewrite the fixed length operations using scalable vector types and then let the scalable vector lowering handle any complexity. Towards the start of the function you can do: EVT ContainerVT = getContainerForFixedLengthVector(DAG, VT); In1 = convertToScalableVector(DAG, ContainerVT, In1); In2 = convertToScalableVector(DAG, ContainerVT, In2); Res = getNode(ISD::FCOPYSIGN, ContainerVT , In1, In2) return convertFromScalableVector(DAG, ContainerVT, Res); This way it doesn't matter how complicated the scalable vector lowering gets. Doing this also means you no longer need sve2-fixed-length-fcopysign.ll because there's nothing SVE2 special about the lowering code you've added (i.e. the original sve2-fcopysign.ll tests are good enough to protect that functionality). paulwalker-arm: Bookending the fixed length lowering like this has pitfalls and can complicate the code. It's…

return BitCast(VT, BSP, DAG);		return BitCast(VT, BSP, DAG);
}		}

SDValue AArch64TargetLowering::LowerCTPOP(SDValue Op, SelectionDAG &DAG) const {		SDValue AArch64TargetLowering::LowerCTPOP(SDValue Op, SelectionDAG &DAG) const {
if (DAG.getMachineFunction().getFunction().hasFnAttribute(		if (DAG.getMachineFunction().getFunction().hasFnAttribute(
Attribute::NoImplicitFloat))		Attribute::NoImplicitFloat))
return SDValue();		return SDValue();

▲ Show 20 Lines • Show All 11,304 Lines • ▼ Show 20 Lines	DCI.CombineTo(N0.getNode(),
ExtLoad.getValue(1));		ExtLoad.getValue(1));
return SDValue(N, 0); // Return N so it doesn't get rechecked!		return SDValue(N, 0); // Return N so it doesn't get rechecked!
}		}

return SDValue();		return SDValue();
}		}

static SDValue performBSPExpandForSVE(SDNode *N, SelectionDAG &DAG,		static SDValue performBSPExpandForSVE(SDNode *N, SelectionDAG &DAG,
const AArch64Subtarget *Subtarget,		const AArch64Subtarget *Subtarget) {
bool fixedSVEVectorVT) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

// Don't expand for SVE2		// Don't expand for NEON, SVE2 or SME
if (!VT.isScalableVector() \|\| Subtarget->hasSVE2() \|\| Subtarget->hasSME())		if (!VT.isScalableVector() \|\| Subtarget->hasSVE2() \|\| Subtarget->hasSME())
return SDValue();		return SDValue();
paulwalker-armUnsubmitted Not Done Reply Inline Actions Isn't this original code now fine and you instead just need to remove the following // Don't expand for NEON if (VT.isFixedLengthVector()) return SDValue(); block because that is covered by the `!VT.isScalableVector()` check? paulwalker-arm: Isn't this original code now fine and you instead just need to remove the following ``` //…
		paulwalker-armUnsubmitted Not Done Reply Inline Actions I think you mean `VT.isScalableVector()` here. However... Given this bug fix it makes me wonder if the following code was ever excised before this patch? Which given my SVE2 comment I'm think we can in fact keep the original code and just remove the `fixedSVEVectorVT` code? paulwalker-arm: I think you mean `VT.isScalableVector()` here. However... Given this bug fix it makes me…

// Don't expand for NEON
if (VT.isFixedLengthVector() && !fixedSVEVectorVT)
return SDValue();

SDLoc DL(N);		SDLoc DL(N);

SDValue Mask = N->getOperand(0);		SDValue Mask = N->getOperand(0);
SDValue In1 = N->getOperand(1);		SDValue In1 = N->getOperand(1);
SDValue In2 = N->getOperand(2);		SDValue In2 = N->getOperand(2);

SDValue InvMask = DAG.getNOT(DL, Mask, VT);		SDValue InvMask = DAG.getNOT(DL, Mask, VT);
SDValue Sel = DAG.getNode(ISD::AND, DL, VT, Mask, In1);		SDValue Sel = DAG.getNode(ISD::AND, DL, VT, Mask, In1);
▲ Show 20 Lines • Show All 114 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::PerformDAGCombine(SDNode *N,
case AArch64ISD::GLD1S_IMM_MERGE_ZERO:		case AArch64ISD::GLD1S_IMM_MERGE_ZERO:
return performGLD1Combine(N, DAG);		return performGLD1Combine(N, DAG);
case AArch64ISD::VASHR:		case AArch64ISD::VASHR:
case AArch64ISD::VLSHR:		case AArch64ISD::VLSHR:
return performVectorShiftCombine(N, *this, DCI);		return performVectorShiftCombine(N, *this, DCI);
case AArch64ISD::SUNPKLO:		case AArch64ISD::SUNPKLO:
return performSunpkloCombine(N, DAG);		return performSunpkloCombine(N, DAG);
case AArch64ISD::BSP:		case AArch64ISD::BSP:
return performBSPExpandForSVE(		return performBSPExpandForSVE(N, DAG, Subtarget);
N, DAG, Subtarget, useSVEForFixedLengthVectorVT(N->getValueType(0)));
case ISD::INSERT_VECTOR_ELT:		case ISD::INSERT_VECTOR_ELT:
return performInsertVectorEltCombine(N, DCI);		return performInsertVectorEltCombine(N, DCI);
case ISD::EXTRACT_VECTOR_ELT:		case ISD::EXTRACT_VECTOR_ELT:
return performExtractVectorEltCombine(N, DCI, Subtarget);		return performExtractVectorEltCombine(N, DCI, Subtarget);
case ISD::VECREDUCE_ADD:		case ISD::VECREDUCE_ADD:
return performVecReduceAddCombine(N, DCI.DAG, Subtarget);		return performVecReduceAddCombine(N, DCI.DAG, Subtarget);
case AArch64ISD::UADDV:		case AArch64ISD::UADDV:
return performUADDVCombine(N, DAG);		return performUADDVCombine(N, DAG);
▲ Show 20 Lines • Show All 2,333 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-fcopysign.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -aarch64-sve-vector-bits-min=256 < %s \| FileCheck %s -check-prefixes=CHECK,VBITS_GE_256
				; RUN: llc -aarch64-sve-vector-bits-min=512 < %s \| FileCheck %s -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=2048 < %s \| FileCheck %s -check-prefixes=CHECK,VBITS_GE_512

				target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"

				target triple = "aarch64-unknown-linux-gnu"

				;============ f16

				define void @test_copysign_v4f16_v4f16(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v4f16_v4f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr d0, [x0]
				; CHECK-NEXT: mvni v2.4h, #128, lsl #8
				; CHECK-NEXT: ldr d1, [x1]
				; CHECK-NEXT: bif v0.8b, v1.8b, v2.8b
				; CHECK-NEXT: str d0, [x0]
				; CHECK-NEXT: ret
				%a = load <4 x half>, ptr %ap
				%b = load <4 x half>, ptr %bp
				%r = call <4 x half> @llvm.copysign.v4f16(<4 x half> %a, <4 x half> %b)
				store <4 x half> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v8f16_v8f16(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v8f16_v8f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x0]
				; CHECK-NEXT: ldr q1, [x1]
				; CHECK-NEXT: mvni v2.8h, #128, lsl #8
				; CHECK-NEXT: bif v0.16b, v1.16b, v2.16b
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%a = load <8 x half>, ptr %ap
				%b = load <8 x half>, ptr %bp
				%r = call <8 x half> @llvm.copysign.v8f16(<8 x half> %a, <8 x half> %b)
				store <8 x half> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v16f16_v16f16(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v16f16_v16f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h, vl16
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x1]
				; CHECK-NEXT: and z1.h, z1.h, #0x8000
				; CHECK-NEXT: and z0.h, z0.h, #0x7fff
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: st1h { z0.h }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <16 x half>, ptr %ap
				%b = load <16 x half>, ptr %bp
				%r = call <16 x half> @llvm.copysign.v16f16(<16 x half> %a, <16 x half> %b)
				store <16 x half> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v32f16_v32f16(ptr %ap, ptr %bp) #0 {
				; VBITS_GE_256-LABEL: test_copysign_v32f16_v32f16:
				; VBITS_GE_256: // %bb.0:
				; VBITS_GE_256-NEXT: mov x8, #16
				; VBITS_GE_256-NEXT: ptrue p0.h, vl16
				; VBITS_GE_256-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; VBITS_GE_256-NEXT: ld1h { z1.h }, p0/z, [x0]
				; VBITS_GE_256-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; VBITS_GE_256-NEXT: ld1h { z3.h }, p0/z, [x1]
				; VBITS_GE_256-NEXT: and z0.h, z0.h, #0x7fff
				; VBITS_GE_256-NEXT: and z1.h, z1.h, #0x7fff
				; VBITS_GE_256-NEXT: and z2.h, z2.h, #0x8000
				; VBITS_GE_256-NEXT: and z3.h, z3.h, #0x8000
				; VBITS_GE_256-NEXT: orr z0.d, z0.d, z2.d
				; VBITS_GE_256-NEXT: orr z1.d, z1.d, z3.d
				; VBITS_GE_256-NEXT: st1h { z0.h }, p0, [x0, x8, lsl #1]
				; VBITS_GE_256-NEXT: st1h { z1.h }, p0, [x0]
				; VBITS_GE_256-NEXT: ret
				;
				; VBITS_GE_512-LABEL: test_copysign_v32f16_v32f16:
				; VBITS_GE_512: // %bb.0:
				; VBITS_GE_512-NEXT: ptrue p0.h, vl32
				; VBITS_GE_512-NEXT: ld1h { z0.h }, p0/z, [x0]
				; VBITS_GE_512-NEXT: ld1h { z1.h }, p0/z, [x1]
				; VBITS_GE_512-NEXT: and z1.h, z1.h, #0x8000
				; VBITS_GE_512-NEXT: and z0.h, z0.h, #0x7fff
				; VBITS_GE_512-NEXT: orr z0.d, z0.d, z1.d
				; VBITS_GE_512-NEXT: st1h { z0.h }, p0, [x0]
				; VBITS_GE_512-NEXT: ret
				%a = load <32 x half>, ptr %ap
				%b = load <32 x half>, ptr %bp
				%r = call <32 x half> @llvm.copysign.v32f16(<32 x half> %a, <32 x half> %b)
				store <32 x half> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v64f16_v64f16(ptr %ap, ptr %bp) vscale_range(8,0) #0 {
				; CHECK-LABEL: test_copysign_v64f16_v64f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h, vl64
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x1]
				; CHECK-NEXT: and z1.h, z1.h, #0x8000
				; CHECK-NEXT: and z0.h, z0.h, #0x7fff
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: st1h { z0.h }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <64 x half>, ptr %ap
				%b = load <64 x half>, ptr %bp
				%r = call <64 x half> @llvm.copysign.v64f16(<64 x half> %a, <64 x half> %b)
				store <64 x half> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v128f16_v128f16(ptr %ap, ptr %bp) vscale_range(16,0) #0 {
				; CHECK-LABEL: test_copysign_v128f16_v128f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h, vl128
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x1]
				; CHECK-NEXT: and z1.h, z1.h, #0x8000
				; CHECK-NEXT: and z0.h, z0.h, #0x7fff
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: st1h { z0.h }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <128 x half>, ptr %ap
				%b = load <128 x half>, ptr %bp
				%r = call <128 x half> @llvm.copysign.v128f16(<128 x half> %a, <128 x half> %b)
				store <128 x half> %r, ptr %ap
				ret void
				}

				;============ f32

				define void @test_copysign_v2f32_v2f32(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v2f32_v2f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr d0, [x0]
				; CHECK-NEXT: mvni v2.2s, #128, lsl #24
				; CHECK-NEXT: ldr d1, [x1]
				; CHECK-NEXT: bif v0.8b, v1.8b, v2.8b
				; CHECK-NEXT: str d0, [x0]
				; CHECK-NEXT: ret
				%a = load <2 x float>, ptr %ap
				%b = load <2 x float>, ptr %bp
				%r = call <2 x float> @llvm.copysign.v2f32(<2 x float> %a, <2 x float> %b)
				store <2 x float> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v4f32_v4f32(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v4f32_v4f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x0]
				; CHECK-NEXT: ldr q1, [x1]
				; CHECK-NEXT: mvni v2.4s, #128, lsl #24
				; CHECK-NEXT: bif v0.16b, v1.16b, v2.16b
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%a = load <4 x float>, ptr %ap
				%b = load <4 x float>, ptr %bp
				%r = call <4 x float> @llvm.copysign.v4f32(<4 x float> %a, <4 x float> %b)
				store <4 x float> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v8f32_v8f32(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v8f32_v8f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl8
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x1]
				; CHECK-NEXT: and z1.s, z1.s, #0x80000000
				; CHECK-NEXT: and z0.s, z0.s, #0x7fffffff
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <8 x float>, ptr %ap
				%b = load <8 x float>, ptr %bp
				%r = call <8 x float> @llvm.copysign.v8f32(<8 x float> %a, <8 x float> %b)
				store <8 x float> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v16f32_v16f32(ptr %ap, ptr %bp) #0 {
				; VBITS_GE_256-LABEL: test_copysign_v16f32_v16f32:
				; VBITS_GE_256: // %bb.0:
				; VBITS_GE_256-NEXT: mov x8, #8
				; VBITS_GE_256-NEXT: ptrue p0.s, vl8
				; VBITS_GE_256-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; VBITS_GE_256-NEXT: ld1w { z1.s }, p0/z, [x0]
				; VBITS_GE_256-NEXT: ld1w { z2.s }, p0/z, [x1, x8, lsl #2]
				; VBITS_GE_256-NEXT: ld1w { z3.s }, p0/z, [x1]
				; VBITS_GE_256-NEXT: and z0.s, z0.s, #0x7fffffff
				; VBITS_GE_256-NEXT: and z1.s, z1.s, #0x7fffffff
				; VBITS_GE_256-NEXT: and z2.s, z2.s, #0x80000000
				; VBITS_GE_256-NEXT: and z3.s, z3.s, #0x80000000
				; VBITS_GE_256-NEXT: orr z0.d, z0.d, z2.d
				; VBITS_GE_256-NEXT: orr z1.d, z1.d, z3.d
				; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x0, x8, lsl #2]
				; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x0]
				; VBITS_GE_256-NEXT: ret
				;
				; VBITS_GE_512-LABEL: test_copysign_v16f32_v16f32:
				; VBITS_GE_512: // %bb.0:
				; VBITS_GE_512-NEXT: ptrue p0.s, vl16
				; VBITS_GE_512-NEXT: ld1w { z0.s }, p0/z, [x0]
				; VBITS_GE_512-NEXT: ld1w { z1.s }, p0/z, [x1]
				; VBITS_GE_512-NEXT: and z1.s, z1.s, #0x80000000
				; VBITS_GE_512-NEXT: and z0.s, z0.s, #0x7fffffff
				; VBITS_GE_512-NEXT: orr z0.d, z0.d, z1.d
				; VBITS_GE_512-NEXT: st1w { z0.s }, p0, [x0]
				; VBITS_GE_512-NEXT: ret
				%a = load <16 x float>, ptr %ap
				%b = load <16 x float>, ptr %bp
				%r = call <16 x float> @llvm.copysign.v16f32(<16 x float> %a, <16 x float> %b)
				store <16 x float> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v32f32_v32f32(ptr %ap, ptr %bp) vscale_range(8,0) #0 {
				; CHECK-LABEL: test_copysign_v32f32_v32f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl32
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x1]
				; CHECK-NEXT: and z1.s, z1.s, #0x80000000
				; CHECK-NEXT: and z0.s, z0.s, #0x7fffffff
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <32 x float>, ptr %ap
				%b = load <32 x float>, ptr %bp
				%r = call <32 x float> @llvm.copysign.v32f32(<32 x float> %a, <32 x float> %b)
				store <32 x float> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v64f32_v64f32(ptr %ap, ptr %bp) vscale_range(16,0) #0 {
				; CHECK-LABEL: test_copysign_v64f32_v64f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl64
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x1]
				; CHECK-NEXT: and z1.s, z1.s, #0x80000000
				; CHECK-NEXT: and z0.s, z0.s, #0x7fffffff
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <64 x float>, ptr %ap
				%b = load <64 x float>, ptr %bp
				%r = call <64 x float> @llvm.copysign.v64f32(<64 x float> %a, <64 x float> %b)
				store <64 x float> %r, ptr %ap
				ret void
				}

				;============ f64

				define void @test_copysign_v2f64_v2f64(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v2f64_v2f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: movi v0.2d, #0xffffffffffffffff
				; CHECK-NEXT: ldr q1, [x0]
				; CHECK-NEXT: ldr q2, [x1]
				; CHECK-NEXT: fneg v0.2d, v0.2d
				; CHECK-NEXT: bsl v0.16b, v1.16b, v2.16b
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%a = load <2 x double>, ptr %ap
				%b = load <2 x double>, ptr %bp
				%r = call <2 x double> @llvm.copysign.v2f64(<2 x double> %a, <2 x double> %b)
				store <2 x double> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v4f64_v4f64(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v4f64_v4f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl4
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: and z1.d, z1.d, #0x8000000000000000
				; CHECK-NEXT: and z0.d, z0.d, #0x7fffffffffffffff
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: st1d { z0.d }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <4 x double>, ptr %ap
				%b = load <4 x double>, ptr %bp
				%r = call <4 x double> @llvm.copysign.v4f64(<4 x double> %a, <4 x double> %b)
				store <4 x double> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v8f64_v8f64(ptr %ap, ptr %bp) #0 {
				; VBITS_GE_256-LABEL: test_copysign_v8f64_v8f64:
				; VBITS_GE_256: // %bb.0:
				; VBITS_GE_256-NEXT: mov x8, #4
				; VBITS_GE_256-NEXT: ptrue p0.d, vl4
				; VBITS_GE_256-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; VBITS_GE_256-NEXT: ld1d { z1.d }, p0/z, [x0]
				; VBITS_GE_256-NEXT: ld1d { z2.d }, p0/z, [x1, x8, lsl #3]
				; VBITS_GE_256-NEXT: ld1d { z3.d }, p0/z, [x1]
				; VBITS_GE_256-NEXT: and z0.d, z0.d, #0x7fffffffffffffff
				; VBITS_GE_256-NEXT: and z1.d, z1.d, #0x7fffffffffffffff
				; VBITS_GE_256-NEXT: and z2.d, z2.d, #0x8000000000000000
				; VBITS_GE_256-NEXT: and z3.d, z3.d, #0x8000000000000000
				; VBITS_GE_256-NEXT: orr z0.d, z0.d, z2.d
				; VBITS_GE_256-NEXT: orr z1.d, z1.d, z3.d
				; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x0, x8, lsl #3]
				; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x0]
				; VBITS_GE_256-NEXT: ret
				;
				; VBITS_GE_512-LABEL: test_copysign_v8f64_v8f64:
				; VBITS_GE_512: // %bb.0:
				; VBITS_GE_512-NEXT: ptrue p0.d, vl8
				; VBITS_GE_512-NEXT: ld1d { z0.d }, p0/z, [x0]
				; VBITS_GE_512-NEXT: ld1d { z1.d }, p0/z, [x1]
				; VBITS_GE_512-NEXT: and z1.d, z1.d, #0x8000000000000000
				; VBITS_GE_512-NEXT: and z0.d, z0.d, #0x7fffffffffffffff
				; VBITS_GE_512-NEXT: orr z0.d, z0.d, z1.d
				; VBITS_GE_512-NEXT: st1d { z0.d }, p0, [x0]
				; VBITS_GE_512-NEXT: ret
				%a = load <8 x double>, ptr %ap
				%b = load <8 x double>, ptr %bp
				%r = call <8 x double> @llvm.copysign.v8f64(<8 x double> %a, <8 x double> %b)
				store <8 x double> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v16f64_v16f64(ptr %ap, ptr %bp) vscale_range(8,0) #0 {
				; CHECK-LABEL: test_copysign_v16f64_v16f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl16
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: and z1.d, z1.d, #0x8000000000000000
				; CHECK-NEXT: and z0.d, z0.d, #0x7fffffffffffffff
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: st1d { z0.d }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <16 x double>, ptr %ap
				%b = load <16 x double>, ptr %bp
				%r = call <16 x double> @llvm.copysign.v16f64(<16 x double> %a, <16 x double> %b)
				store <16 x double> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v32f64_v32f64(ptr %ap, ptr %bp) vscale_range(16,0) #0 {
				; CHECK-LABEL: test_copysign_v32f64_v32f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl32
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: and z1.d, z1.d, #0x8000000000000000
				; CHECK-NEXT: and z0.d, z0.d, #0x7fffffffffffffff
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: st1d { z0.d }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <32 x double>, ptr %ap
				%b = load <32 x double>, ptr %bp
				%r = call <32 x double> @llvm.copysign.v32f64(<32 x double> %a, <32 x double> %b)
				store <32 x double> %r, ptr %ap
				ret void
				}

				;============ v2f32

				define void @test_copysign_v2f32_v2f64(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v2f32_v2f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x1]
				; CHECK-NEXT: mvni v2.2s, #128, lsl #24
				; CHECK-NEXT: ldr d1, [x0]
				; CHECK-NEXT: fcvtn v0.2s, v0.2d
				; CHECK-NEXT: bit v0.8b, v1.8b, v2.8b
				; CHECK-NEXT: str d0, [x0]
				; CHECK-NEXT: ret
				%a = load <2 x float>, ptr %ap
				%b = load <2 x double>, ptr %bp
				%tmp0 = fptrunc <2 x double> %b to <2 x float>
				%r = call <2 x float> @llvm.copysign.v2f32(<2 x float> %a, <2 x float> %tmp0)
				store <2 x float> %r, ptr %ap
				ret void
				}

				;============ v4f32

				; SplitVecOp #1
				define void @test_copysign_v4f32_v4f64(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v4f32_v4f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl4
				; CHECK-NEXT: ldr q0, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mvni v2.4s, #128, lsl #24
				; CHECK-NEXT: fcvt z1.s, p0/m, z1.d
				; CHECK-NEXT: uzp1 z1.s, z1.s, z1.s
				; CHECK-NEXT: bif v0.16b, v1.16b, v2.16b
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%a = load <4 x float>, ptr %ap
				%b = load <4 x double>, ptr %bp
				%tmp0 = fptrunc <4 x double> %b to <4 x float>
				%r = call <4 x float> @llvm.copysign.v4f32(<4 x float> %a, <4 x float> %tmp0)
				store <4 x float> %r, ptr %ap
				ret void
				}

				;============ v2f64

				define void @test_copysign_v2f64_v2f32(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v2f64_v2f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: movi v0.2d, #0xffffffffffffffff
				; CHECK-NEXT: ldr d1, [x1]
				; CHECK-NEXT: ldr q2, [x0]
				; CHECK-NEXT: fcvtl v1.2d, v1.2s
				; CHECK-NEXT: fneg v0.2d, v0.2d
				; CHECK-NEXT: bsl v0.16b, v2.16b, v1.16b
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%a = load <2 x double>, ptr %ap
				%b = load < 2 x float>, ptr %bp
				%tmp0 = fpext <2 x float> %b to <2 x double>
				%r = call <2 x double> @llvm.copysign.v2f64(<2 x double> %a, <2 x double> %tmp0)
				store <2 x double> %r, ptr %ap
				ret void
				}

				;============ v4f64

				; SplitVecRes mismatched
				define void @test_copysign_v4f64_v4f32(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v4f64_v4f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl4
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.d }, p0/z, [x1]
				; CHECK-NEXT: fcvt z1.d, p0/m, z1.s
				; CHECK-NEXT: and z0.d, z0.d, #0x7fffffffffffffff
				; CHECK-NEXT: and z1.d, z1.d, #0x8000000000000000
				; CHECK-NEXT: orr z0.d, z0.d, z1.d
				; CHECK-NEXT: st1d { z0.d }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <4 x double>, ptr %ap
				%b = load <4 x float>, ptr %bp
				%tmp0 = fpext <4 x float> %b to <4 x double>
				%r = call <4 x double> @llvm.copysign.v4f64(<4 x double> %a, <4 x double> %tmp0)
				store <4 x double> %r, ptr %ap
				ret void
				}

				;============ v4f16

				define void @test_copysign_v4f16_v4f32(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v4f16_v4f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x1]
				; CHECK-NEXT: mvni v2.4h, #128, lsl #8
				; CHECK-NEXT: ldr d1, [x0]
				; CHECK-NEXT: fcvtn v0.4h, v0.4s
				; CHECK-NEXT: bit v0.8b, v1.8b, v2.8b
				; CHECK-NEXT: str d0, [x0]
				; CHECK-NEXT: ret
				%a = load <4 x half>, ptr %ap
				%b = load <4 x float>, ptr %bp
				%tmp0 = fptrunc <4 x float> %b to <4 x half>
				%r = call <4 x half> @llvm.copysign.v4f16(<4 x half> %a, <4 x half> %tmp0)
				store <4 x half> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v4f16_v4f64(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v4f16_v4f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl4
				; CHECK-NEXT: ldr d0, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mvni v2.4h, #128, lsl #8
				; CHECK-NEXT: fcvt z1.h, p0/m, z1.d
				; CHECK-NEXT: uzp1 z1.s, z1.s, z1.s
				; CHECK-NEXT: uzp1 z1.h, z1.h, z1.h
				; CHECK-NEXT: bif v0.8b, v1.8b, v2.8b
				; CHECK-NEXT: str d0, [x0]
				; CHECK-NEXT: ret
				%a = load <4 x half>, ptr %ap
				%b = load <4 x double>, ptr %bp
				%tmp0 = fptrunc <4 x double> %b to <4 x half>
				%r = call <4 x half> @llvm.copysign.v4f16(<4 x half> %a, <4 x half> %tmp0)
				store <4 x half> %r, ptr %ap
				ret void
				}

				declare <4 x half> @llvm.copysign.v4f16(<4 x half> %a, <4 x half> %b) #0

				;============ v8f16


				define void @test_copysign_v8f16_v8f32(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v8f16_v8f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl8
				; CHECK-NEXT: ldr q0, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x1]
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mvni v2.8h, #128, lsl #8
				; CHECK-NEXT: fcvt z1.h, p0/m, z1.s
				; CHECK-NEXT: uzp1 z1.h, z1.h, z1.h
				; CHECK-NEXT: bif v0.16b, v1.16b, v2.16b
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%a = load <8 x half>, ptr %ap
				%b = load <8 x float>, ptr %bp
				%tmp0 = fptrunc <8 x float> %b to <8 x half>
				%r = call <8 x half> @llvm.copysign.v8f16(<8 x half> %a, <8 x half> %tmp0)
				store <8 x half> %r, ptr %ap
				ret void
				}

				declare <8 x half> @llvm.copysign.v8f16(<8 x half> %a, <8 x half> %b) #0
				declare <16 x half> @llvm.copysign.v16f16(<16 x half> %a, <16 x half> %b) #0
				declare <32 x half> @llvm.copysign.v32f16(<32 x half> %a, <32 x half> %b) #0
				declare <64 x half> @llvm.copysign.v64f16(<64 x half> %a, <64 x half> %b) #0
				declare <128 x half> @llvm.copysign.v128f16(<128 x half> %a, <128 x half> %b) #0

				declare <2 x float> @llvm.copysign.v2f32(<2 x float> %a, <2 x float> %b) #0
				declare <4 x float> @llvm.copysign.v4f32(<4 x float> %a, <4 x float> %b) #0
				declare <8 x float> @llvm.copysign.v8f32(<8 x float> %a, <8 x float> %b) #0
				declare <16 x float> @llvm.copysign.v16f32(<16 x float> %a, <16 x float> %b) #0
				declare <32 x float> @llvm.copysign.v32f32(<32 x float> %a, <32 x float> %b) #0
				declare <64 x float> @llvm.copysign.v64f32(<64 x float> %a, <64 x float> %b) #0

				declare <2 x double> @llvm.copysign.v2f64(<2 x double> %a, <2 x double> %b) #0
				declare <4 x double> @llvm.copysign.v4f64(<4 x double> %a, <4 x double> %b) #0
				declare <8 x double> @llvm.copysign.v8f64(<8 x double> %a, <8 x double> %b) #0
				declare <16 x double> @llvm.copysign.v16f64(<16 x double> %a, <16 x double> %b) #0
				declare <32 x double> @llvm.copysign.v32f64(<32 x double> %a, <32 x double> %b) #0

				attributes #0 = { "target-features"="+sve" }

llvm/test/CodeGen/AArch64/sve2-fixed-length-fcopysign.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -aarch64-sve-vector-bits-min=256 < %s \| FileCheck %s -check-prefixes=CHECK,VBITS_GE_256
				; RUN: llc -aarch64-sve-vector-bits-min=512 < %s \| FileCheck %s -check-prefixes=CHECK,VBITS_GE_512
				; RUN: llc -aarch64-sve-vector-bits-min=2048 < %s \| FileCheck %s -check-prefixes=CHECK,VBITS_GE_512

				target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"

				target triple = "aarch64-unknown-linux-gnu"

				;============ f16

				define void @test_copysign_v4f16_v4f16(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v4f16_v4f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr d0, [x0]
				; CHECK-NEXT: mvni v2.4h, #128, lsl #8
				; CHECK-NEXT: ldr d1, [x1]
				; CHECK-NEXT: bif v0.8b, v1.8b, v2.8b
				; CHECK-NEXT: str d0, [x0]
				; CHECK-NEXT: ret
				%a = load <4 x half>, ptr %ap
				%b = load <4 x half>, ptr %bp
				%r = call <4 x half> @llvm.copysign.v4f16(<4 x half> %a, <4 x half> %b)
				store <4 x half> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v8f16_v8f16(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v8f16_v8f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x0]
				; CHECK-NEXT: ldr q1, [x1]
				; CHECK-NEXT: mvni v2.8h, #128, lsl #8
				; CHECK-NEXT: bif v0.16b, v1.16b, v2.16b
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%a = load <8 x half>, ptr %ap
				%b = load <8 x half>, ptr %bp
				%r = call <8 x half> @llvm.copysign.v8f16(<8 x half> %a, <8 x half> %b)
				store <8 x half> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v16f16_v16f16(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v16f16_v16f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h, vl16
				; CHECK-NEXT: mov w8, #32767
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x1]
				; CHECK-NEXT: mov z2.h, w8
				; CHECK-NEXT: bsl z0.d, z0.d, z1.d, z2.d
				; CHECK-NEXT: st1h { z0.h }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <16 x half>, ptr %ap
				%b = load <16 x half>, ptr %bp
				%r = call <16 x half> @llvm.copysign.v16f16(<16 x half> %a, <16 x half> %b)
				store <16 x half> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v32f16_v32f16(ptr %ap, ptr %bp) #0 {
				; VBITS_GE_256-LABEL: test_copysign_v32f16_v32f16:
				; VBITS_GE_256: // %bb.0:
				; VBITS_GE_256-NEXT: mov x8, #16
				; VBITS_GE_256-NEXT: ptrue p0.h, vl16
				; VBITS_GE_256-NEXT: mov w9, #32767
				; VBITS_GE_256-NEXT: ld1h { z0.h }, p0/z, [x0, x8, lsl #1]
				; VBITS_GE_256-NEXT: ld1h { z1.h }, p0/z, [x0]
				; VBITS_GE_256-NEXT: ld1h { z2.h }, p0/z, [x1, x8, lsl #1]
				; VBITS_GE_256-NEXT: ld1h { z3.h }, p0/z, [x1]
				; VBITS_GE_256-NEXT: mov z4.h, w9
				; VBITS_GE_256-NEXT: bsl z0.d, z0.d, z2.d, z4.d
				; VBITS_GE_256-NEXT: bsl z1.d, z1.d, z3.d, z4.d
				; VBITS_GE_256-NEXT: st1h { z0.h }, p0, [x0, x8, lsl #1]
				; VBITS_GE_256-NEXT: st1h { z1.h }, p0, [x0]
				; VBITS_GE_256-NEXT: ret
				;
				; VBITS_GE_512-LABEL: test_copysign_v32f16_v32f16:
				; VBITS_GE_512: // %bb.0:
				; VBITS_GE_512-NEXT: ptrue p0.h, vl32
				; VBITS_GE_512-NEXT: mov w8, #32767
				; VBITS_GE_512-NEXT: ld1h { z0.h }, p0/z, [x0]
				; VBITS_GE_512-NEXT: ld1h { z1.h }, p0/z, [x1]
				; VBITS_GE_512-NEXT: mov z2.h, w8
				; VBITS_GE_512-NEXT: bsl z0.d, z0.d, z1.d, z2.d
				; VBITS_GE_512-NEXT: st1h { z0.h }, p0, [x0]
				; VBITS_GE_512-NEXT: ret
				%a = load <32 x half>, ptr %ap
				%b = load <32 x half>, ptr %bp
				%r = call <32 x half> @llvm.copysign.v32f16(<32 x half> %a, <32 x half> %b)
				store <32 x half> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v64f16_v64f16(ptr %ap, ptr %bp) vscale_range(8,0) #0 {
				; CHECK-LABEL: test_copysign_v64f16_v64f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h, vl64
				; CHECK-NEXT: mov w8, #32767
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x1]
				; CHECK-NEXT: mov z2.h, w8
				; CHECK-NEXT: bsl z0.d, z0.d, z1.d, z2.d
				; CHECK-NEXT: st1h { z0.h }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <64 x half>, ptr %ap
				%b = load <64 x half>, ptr %bp
				%r = call <64 x half> @llvm.copysign.v64f16(<64 x half> %a, <64 x half> %b)
				store <64 x half> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v128f16_v128f16(ptr %ap, ptr %bp) vscale_range(16,0) #0 {
				; CHECK-LABEL: test_copysign_v128f16_v128f16:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h, vl128
				; CHECK-NEXT: mov w8, #32767
				; CHECK-NEXT: ld1h { z0.h }, p0/z, [x0]
				; CHECK-NEXT: ld1h { z1.h }, p0/z, [x1]
				; CHECK-NEXT: mov z2.h, w8
				; CHECK-NEXT: bsl z0.d, z0.d, z1.d, z2.d
				; CHECK-NEXT: st1h { z0.h }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <128 x half>, ptr %ap
				%b = load <128 x half>, ptr %bp
				%r = call <128 x half> @llvm.copysign.v128f16(<128 x half> %a, <128 x half> %b)
				store <128 x half> %r, ptr %ap
				ret void
				}

				;============ f32

				define void @test_copysign_v2f32_v2f32(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v2f32_v2f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr d0, [x0]
				; CHECK-NEXT: mvni v2.2s, #128, lsl #24
				; CHECK-NEXT: ldr d1, [x1]
				; CHECK-NEXT: bif v0.8b, v1.8b, v2.8b
				; CHECK-NEXT: str d0, [x0]
				; CHECK-NEXT: ret
				%a = load <2 x float>, ptr %ap
				%b = load <2 x float>, ptr %bp
				%r = call <2 x float> @llvm.copysign.v2f32(<2 x float> %a, <2 x float> %b)
				store <2 x float> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v4f32_v4f32(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v4f32_v4f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x0]
				; CHECK-NEXT: ldr q1, [x1]
				; CHECK-NEXT: mvni v2.4s, #128, lsl #24
				; CHECK-NEXT: bif v0.16b, v1.16b, v2.16b
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%a = load <4 x float>, ptr %ap
				%b = load <4 x float>, ptr %bp
				%r = call <4 x float> @llvm.copysign.v4f32(<4 x float> %a, <4 x float> %b)
				store <4 x float> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v8f32_v8f32(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v8f32_v8f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl8
				; CHECK-NEXT: mov w8, #2147483647
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x1]
				; CHECK-NEXT: mov z2.s, w8
				; CHECK-NEXT: bsl z0.d, z0.d, z1.d, z2.d
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <8 x float>, ptr %ap
				%b = load <8 x float>, ptr %bp
				%r = call <8 x float> @llvm.copysign.v8f32(<8 x float> %a, <8 x float> %b)
				store <8 x float> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v16f32_v16f32(ptr %ap, ptr %bp) #0 {
				; VBITS_GE_256-LABEL: test_copysign_v16f32_v16f32:
				; VBITS_GE_256: // %bb.0:
				; VBITS_GE_256-NEXT: mov x8, #8
				; VBITS_GE_256-NEXT: ptrue p0.s, vl8
				; VBITS_GE_256-NEXT: mov w9, #2147483647
				; VBITS_GE_256-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; VBITS_GE_256-NEXT: ld1w { z1.s }, p0/z, [x0]
				; VBITS_GE_256-NEXT: ld1w { z2.s }, p0/z, [x1, x8, lsl #2]
				; VBITS_GE_256-NEXT: ld1w { z3.s }, p0/z, [x1]
				; VBITS_GE_256-NEXT: mov z4.s, w9
				; VBITS_GE_256-NEXT: bsl z0.d, z0.d, z2.d, z4.d
				; VBITS_GE_256-NEXT: bsl z1.d, z1.d, z3.d, z4.d
				; VBITS_GE_256-NEXT: st1w { z0.s }, p0, [x0, x8, lsl #2]
				; VBITS_GE_256-NEXT: st1w { z1.s }, p0, [x0]
				; VBITS_GE_256-NEXT: ret
				;
				; VBITS_GE_512-LABEL: test_copysign_v16f32_v16f32:
				; VBITS_GE_512: // %bb.0:
				; VBITS_GE_512-NEXT: ptrue p0.s, vl16
				; VBITS_GE_512-NEXT: mov w8, #2147483647
				; VBITS_GE_512-NEXT: ld1w { z0.s }, p0/z, [x0]
				; VBITS_GE_512-NEXT: ld1w { z1.s }, p0/z, [x1]
				; VBITS_GE_512-NEXT: mov z2.s, w8
				; VBITS_GE_512-NEXT: bsl z0.d, z0.d, z1.d, z2.d
				; VBITS_GE_512-NEXT: st1w { z0.s }, p0, [x0]
				; VBITS_GE_512-NEXT: ret
				%a = load <16 x float>, ptr %ap
				%b = load <16 x float>, ptr %bp
				%r = call <16 x float> @llvm.copysign.v16f32(<16 x float> %a, <16 x float> %b)
				store <16 x float> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v32f32_v32f32(ptr %ap, ptr %bp) vscale_range(8,0) #0 {
				; CHECK-LABEL: test_copysign_v32f32_v32f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl32
				; CHECK-NEXT: mov w8, #2147483647
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x1]
				; CHECK-NEXT: mov z2.s, w8
				; CHECK-NEXT: bsl z0.d, z0.d, z1.d, z2.d
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <32 x float>, ptr %ap
				%b = load <32 x float>, ptr %bp
				%r = call <32 x float> @llvm.copysign.v32f32(<32 x float> %a, <32 x float> %b)
				store <32 x float> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v64f32_v64f32(ptr %ap, ptr %bp) vscale_range(16,0) #0 {
				; CHECK-LABEL: test_copysign_v64f32_v64f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl64
				; CHECK-NEXT: mov w8, #2147483647
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x1]
				; CHECK-NEXT: mov z2.s, w8
				; CHECK-NEXT: bsl z0.d, z0.d, z1.d, z2.d
				; CHECK-NEXT: st1w { z0.s }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <64 x float>, ptr %ap
				%b = load <64 x float>, ptr %bp
				%r = call <64 x float> @llvm.copysign.v64f32(<64 x float> %a, <64 x float> %b)
				store <64 x float> %r, ptr %ap
				ret void
				}

				;============ f64

				define void @test_copysign_v2f64_v2f64(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v2f64_v2f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: movi v0.2d, #0xffffffffffffffff
				; CHECK-NEXT: ldr q1, [x0]
				; CHECK-NEXT: ldr q2, [x1]
				; CHECK-NEXT: fneg v0.2d, v0.2d
				; CHECK-NEXT: bsl v0.16b, v1.16b, v2.16b
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%a = load <2 x double>, ptr %ap
				%b = load <2 x double>, ptr %bp
				%r = call <2 x double> @llvm.copysign.v2f64(<2 x double> %a, <2 x double> %b)
				store <2 x double> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v4f64_v4f64(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v4f64_v4f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl4
				; CHECK-NEXT: mov z2.d, #0x7fffffffffffffff
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: bsl z0.d, z0.d, z1.d, z2.d
				; CHECK-NEXT: st1d { z0.d }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <4 x double>, ptr %ap
				%b = load <4 x double>, ptr %bp
				%r = call <4 x double> @llvm.copysign.v4f64(<4 x double> %a, <4 x double> %b)
				store <4 x double> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v8f64_v8f64(ptr %ap, ptr %bp) #0 {
				; VBITS_GE_256-LABEL: test_copysign_v8f64_v8f64:
				; VBITS_GE_256: // %bb.0:
				; VBITS_GE_256-NEXT: mov x8, #4
				; VBITS_GE_256-NEXT: ptrue p0.d, vl4
				; VBITS_GE_256-NEXT: mov z4.d, #0x7fffffffffffffff
				; VBITS_GE_256-NEXT: ld1d { z0.d }, p0/z, [x0, x8, lsl #3]
				; VBITS_GE_256-NEXT: ld1d { z1.d }, p0/z, [x0]
				; VBITS_GE_256-NEXT: ld1d { z2.d }, p0/z, [x1, x8, lsl #3]
				; VBITS_GE_256-NEXT: ld1d { z3.d }, p0/z, [x1]
				; VBITS_GE_256-NEXT: bsl z0.d, z0.d, z2.d, z4.d
				; VBITS_GE_256-NEXT: bsl z1.d, z1.d, z3.d, z4.d
				; VBITS_GE_256-NEXT: st1d { z0.d }, p0, [x0, x8, lsl #3]
				; VBITS_GE_256-NEXT: st1d { z1.d }, p0, [x0]
				; VBITS_GE_256-NEXT: ret
				;
				; VBITS_GE_512-LABEL: test_copysign_v8f64_v8f64:
				; VBITS_GE_512: // %bb.0:
				; VBITS_GE_512-NEXT: ptrue p0.d, vl8
				; VBITS_GE_512-NEXT: mov z2.d, #0x7fffffffffffffff
				; VBITS_GE_512-NEXT: ld1d { z0.d }, p0/z, [x0]
				; VBITS_GE_512-NEXT: ld1d { z1.d }, p0/z, [x1]
				; VBITS_GE_512-NEXT: bsl z0.d, z0.d, z1.d, z2.d
				; VBITS_GE_512-NEXT: st1d { z0.d }, p0, [x0]
				; VBITS_GE_512-NEXT: ret
				%a = load <8 x double>, ptr %ap
				%b = load <8 x double>, ptr %bp
				%r = call <8 x double> @llvm.copysign.v8f64(<8 x double> %a, <8 x double> %b)
				store <8 x double> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v16f64_v16f64(ptr %ap, ptr %bp) vscale_range(8,0) #0 {
				; CHECK-LABEL: test_copysign_v16f64_v16f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl16
				; CHECK-NEXT: mov z2.d, #0x7fffffffffffffff
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: bsl z0.d, z0.d, z1.d, z2.d
				; CHECK-NEXT: st1d { z0.d }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <16 x double>, ptr %ap
				%b = load <16 x double>, ptr %bp
				%r = call <16 x double> @llvm.copysign.v16f64(<16 x double> %a, <16 x double> %b)
				store <16 x double> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v32f64_v32f64(ptr %ap, ptr %bp) vscale_range(16,0) #0 {
				; CHECK-LABEL: test_copysign_v32f64_v32f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl32
				; CHECK-NEXT: mov z2.d, #0x7fffffffffffffff
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: bsl z0.d, z0.d, z1.d, z2.d
				; CHECK-NEXT: st1d { z0.d }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <32 x double>, ptr %ap
				%b = load <32 x double>, ptr %bp
				%r = call <32 x double> @llvm.copysign.v32f64(<32 x double> %a, <32 x double> %b)
				store <32 x double> %r, ptr %ap
				ret void
				}

				;============ v2f32

				define void @test_copysign_v2f32_v2f64(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v2f32_v2f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x1]
				; CHECK-NEXT: mvni v2.2s, #128, lsl #24
				; CHECK-NEXT: ldr d1, [x0]
				; CHECK-NEXT: fcvtn v0.2s, v0.2d
				; CHECK-NEXT: bit v0.8b, v1.8b, v2.8b
				; CHECK-NEXT: str d0, [x0]
				; CHECK-NEXT: ret
				%a = load <2 x float>, ptr %ap
				%b = load <2 x double>, ptr %bp
				%tmp0 = fptrunc <2 x double> %b to <2 x float>
				%r = call <2 x float> @llvm.copysign.v2f32(<2 x float> %a, <2 x float> %tmp0)
				store <2 x float> %r, ptr %ap
				ret void
				}

				;============ v4f32

				; SplitVecOp #1
				define void @test_copysign_v4f32_v4f64(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v4f32_v4f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl4
				; CHECK-NEXT: ldr q0, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mvni v2.4s, #128, lsl #24
				; CHECK-NEXT: fcvt z1.s, p0/m, z1.d
				; CHECK-NEXT: uzp1 z1.s, z1.s, z1.s
				; CHECK-NEXT: bif v0.16b, v1.16b, v2.16b
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%a = load <4 x float>, ptr %ap
				%b = load <4 x double>, ptr %bp
				%tmp0 = fptrunc <4 x double> %b to <4 x float>
				%r = call <4 x float> @llvm.copysign.v4f32(<4 x float> %a, <4 x float> %tmp0)
				store <4 x float> %r, ptr %ap
				ret void
				}

				;============ v2f64

				define void @test_copysign_v2f64_v2f32(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v2f64_v2f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: movi v0.2d, #0xffffffffffffffff
				; CHECK-NEXT: ldr d1, [x1]
				; CHECK-NEXT: ldr q2, [x0]
				; CHECK-NEXT: fcvtl v1.2d, v1.2s
				; CHECK-NEXT: fneg v0.2d, v0.2d
				; CHECK-NEXT: bsl v0.16b, v2.16b, v1.16b
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%a = load <2 x double>, ptr %ap
				%b = load < 2 x float>, ptr %bp
				%tmp0 = fpext <2 x float> %b to <2 x double>
				%r = call <2 x double> @llvm.copysign.v2f64(<2 x double> %a, <2 x double> %tmp0)
				store <2 x double> %r, ptr %ap
				ret void
				}

				;============ v4f64

				; SplitVecRes mismatched
				define void @test_copysign_v4f64_v4f32(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v4f64_v4f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl4
				; CHECK-NEXT: mov z2.d, #0x7fffffffffffffff
				; CHECK-NEXT: ld1d { z0.d }, p0/z, [x0]
				; CHECK-NEXT: ld1w { z1.d }, p0/z, [x1]
				; CHECK-NEXT: fcvt z1.d, p0/m, z1.s
				; CHECK-NEXT: bsl z0.d, z0.d, z1.d, z2.d
				; CHECK-NEXT: st1d { z0.d }, p0, [x0]
				; CHECK-NEXT: ret
				%a = load <4 x double>, ptr %ap
				%b = load <4 x float>, ptr %bp
				%tmp0 = fpext <4 x float> %b to <4 x double>
				%r = call <4 x double> @llvm.copysign.v4f64(<4 x double> %a, <4 x double> %tmp0)
				store <4 x double> %r, ptr %ap
				ret void
				}

				;============ v4f16

				define void @test_copysign_v4f16_v4f32(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v4f16_v4f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ldr q0, [x1]
				; CHECK-NEXT: mvni v2.4h, #128, lsl #8
				; CHECK-NEXT: ldr d1, [x0]
				; CHECK-NEXT: fcvtn v0.4h, v0.4s
				; CHECK-NEXT: bit v0.8b, v1.8b, v2.8b
				; CHECK-NEXT: str d0, [x0]
				; CHECK-NEXT: ret
				%a = load <4 x half>, ptr %ap
				%b = load <4 x float>, ptr %bp
				%tmp0 = fptrunc <4 x float> %b to <4 x half>
				%r = call <4 x half> @llvm.copysign.v4f16(<4 x half> %a, <4 x half> %tmp0)
				store <4 x half> %r, ptr %ap
				ret void
				}

				define void @test_copysign_v4f16_v4f64(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v4f16_v4f64:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d, vl4
				; CHECK-NEXT: ldr d0, [x0]
				; CHECK-NEXT: ld1d { z1.d }, p0/z, [x1]
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mvni v2.4h, #128, lsl #8
				; CHECK-NEXT: fcvt z1.h, p0/m, z1.d
				; CHECK-NEXT: uzp1 z1.s, z1.s, z1.s
				; CHECK-NEXT: uzp1 z1.h, z1.h, z1.h
				; CHECK-NEXT: bif v0.8b, v1.8b, v2.8b
				; CHECK-NEXT: str d0, [x0]
				; CHECK-NEXT: ret
				%a = load <4 x half>, ptr %ap
				%b = load <4 x double>, ptr %bp
				%tmp0 = fptrunc <4 x double> %b to <4 x half>
				%r = call <4 x half> @llvm.copysign.v4f16(<4 x half> %a, <4 x half> %tmp0)
				store <4 x half> %r, ptr %ap
				ret void
				}

				declare <4 x half> @llvm.copysign.v4f16(<4 x half> %a, <4 x half> %b) #0

				;============ v8f16


				define void @test_copysign_v8f16_v8f32(ptr %ap, ptr %bp) vscale_range(2,0) #0 {
				; CHECK-LABEL: test_copysign_v8f16_v8f32:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s, vl8
				; CHECK-NEXT: ldr q0, [x0]
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x1]
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mvni v2.8h, #128, lsl #8
				; CHECK-NEXT: fcvt z1.h, p0/m, z1.s
				; CHECK-NEXT: uzp1 z1.h, z1.h, z1.h
				; CHECK-NEXT: bif v0.16b, v1.16b, v2.16b
				; CHECK-NEXT: str q0, [x0]
				; CHECK-NEXT: ret
				%a = load <8 x half>, ptr %ap
				%b = load <8 x float>, ptr %bp
				%tmp0 = fptrunc <8 x float> %b to <8 x half>
				%r = call <8 x half> @llvm.copysign.v8f16(<8 x half> %a, <8 x half> %tmp0)
				store <8 x half> %r, ptr %ap
				ret void
				}

				declare <8 x half> @llvm.copysign.v8f16(<8 x half> %a, <8 x half> %b) #0
				declare <16 x half> @llvm.copysign.v16f16(<16 x half> %a, <16 x half> %b) #0
				declare <32 x half> @llvm.copysign.v32f16(<32 x half> %a, <32 x half> %b) #0
				declare <64 x half> @llvm.copysign.v64f16(<64 x half> %a, <64 x half> %b) #0
				declare <128 x half> @llvm.copysign.v128f16(<128 x half> %a, <128 x half> %b) #0

				declare <2 x float> @llvm.copysign.v2f32(<2 x float> %a, <2 x float> %b) #0
				declare <4 x float> @llvm.copysign.v4f32(<4 x float> %a, <4 x float> %b) #0
				declare <8 x float> @llvm.copysign.v8f32(<8 x float> %a, <8 x float> %b) #0
				declare <16 x float> @llvm.copysign.v16f32(<16 x float> %a, <16 x float> %b) #0
				declare <32 x float> @llvm.copysign.v32f32(<32 x float> %a, <32 x float> %b) #0
				declare <64 x float> @llvm.copysign.v64f32(<64 x float> %a, <64 x float> %b) #0

				declare <2 x double> @llvm.copysign.v2f64(<2 x double> %a, <2 x double> %b) #0
				declare <4 x double> @llvm.copysign.v4f64(<4 x double> %a, <4 x double> %b) #0
				declare <8 x double> @llvm.copysign.v8f64(<8 x double> %a, <8 x double> %b) #0
				declare <16 x double> @llvm.copysign.v16f64(<16 x double> %a, <16 x double> %b) #0
				declare <32 x double> @llvm.copysign.v32f64(<32 x double> %a, <32 x double> %b) #0

				attributes #0 = { "target-features"="+sve2" }