This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
1
AArch64SVEInstrInfo.td
-
SVEInstrFormats.td
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve-intrinsic-fmla-fmls-fnmla-fnmls.ll

Differential D142132

[AArch64] Map DestructiveTernaryCommWithRev intrinsics to pesudo instructions
AbandonedPublic

Authored by lizhijin on Jan 19 2023, 9:20 AM.

Download Raw Diff

Details

Reviewers

fhahn
sdesmalen
kmclaughlin
Allen
mdchen
wwei
paulwalker-arm
david-arm
dmgreen

Summary

This patch maps DestructiveTernaryCommWithRev intrinsics to pesudo intructions. This makes it easier to choose whether to generate fmla/fmls/fnmla/fnmls or fmad/fmsb/fnmad/fnmsb which reduces the generation of mov instructions when computing is intensive.

Diff Detail

Event Timeline

lizhijin created this revision.Jan 19 2023, 9:20 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 19 2023, 9:20 AM

Herald added subscribers: StephenFan, hiraditya, kristof.beyls. · View Herald Transcript

lizhijin requested review of this revision.Jan 19 2023, 9:20 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 19 2023, 9:20 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B208783: Diff 490545.Jan 19 2023, 10:23 AM

Allen added inline comments.Jan 19 2023, 5:04 PM

llvm/test/CodeGen/AArch64/sve-intrinsic-fmla-fmad.ll
32 ↗	(On Diff #490545)	nit: Expect multiple small cases for different instructions

pre-merge checks fail

lizhijin updated this revision to Diff 493003.Jan 28 2023, 7:31 AM

lizhijin edited the summary of this revision. (Show Details)

lizhijin added reviewers: paulwalker-arm, david-arm, dmgreen.

Harbormaster completed remote builds in B210541: Diff 493003.Jan 28 2023, 8:27 AM

ping

llvm/test/CodeGen/AArch64/sve-intrinsic-fmla-fmad.ll
32 ↗	(On Diff #490545)	Modified.

sdesmalen added inline comments.Feb 3 2023, 8:06 AM

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td
660	The intrinsics are defined to always merge into the first source operand, so passing this to a pseudo node which assumes the inactive lanes are undef doesn't seem entirely right.

Hi @lizhijin, I don't think this patch makes much sense because at the code generation layer we already have pseudo instructions to allow better FMLA/FMAD usage based on what the register allocate chooses to do. I suspect the problem you care about is due to how the C/C++ builtins are lowered for things like svmla_x? which currently overly restricts code generation. I've created a patch series that ends with D143767 that I believe fulfils the intent of what you wanted to achieve. Please let me know if I've misunderstood the issue you wanted to solve.

NOTE: D143767 can be extend to integer MLAs but there's ongoing work on the code generation side (D142656) before that can be done.

Matt added a subscriber: Matt.Mar 8 2023, 11:09 AM

lizhijin abandoned this revision.Mar 8 2023, 5:21 PM

To prevent some duplication of effort I just wanted to update my previous comment and say all the dependent patches have now landed and I'm planning to add _u intrinsics/builtins for the integer MLA instructions soon.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64SVEInstrInfo.td

27 lines

SVEInstrFormats.td

6 lines

test/

CodeGen/

AArch64/

sve-intrinsic-fmla-fmls-fnmla-fnmls.ll

178 lines

Diff 493003

llvm/lib/Target/AArch64/AArch64SVEInstrInfo.td

Show First 20 Lines • Show All 391 Lines • ▼ Show 20 Lines	def AArch64mls_m1 : PatFrags<(ops node:$pred, node:$op1, node:$op2, node:$op3),
[(int_aarch64_sve_mls node:$pred, node:$op1, node:$op2, node:$op3),		[(int_aarch64_sve_mls node:$pred, node:$op1, node:$op2, node:$op3),
(sub node:$op1, (AArch64mul_p_oneuse node:$pred, node:$op2, node:$op3)),		(sub node:$op1, (AArch64mul_p_oneuse node:$pred, node:$op2, node:$op3)),
// sub(a, select(mask, mul(b, c), splat(0))) -> mls(a, mask, b, c)		// sub(a, select(mask, mul(b, c), splat(0))) -> mls(a, mask, b, c)
(sub node:$op1, (vselect node:$pred, (AArch64mul_p_oneuse (SVEAllActive), node:$op2, node:$op3), (SVEDup0)))]>;		(sub node:$op1, (vselect node:$pred, (AArch64mul_p_oneuse (SVEAllActive), node:$op2, node:$op3), (SVEDup0)))]>;
def AArch64eor3 : PatFrags<(ops node:$op1, node:$op2, node:$op3),		def AArch64eor3 : PatFrags<(ops node:$op1, node:$op2, node:$op3),
[(int_aarch64_sve_eor3 node:$op1, node:$op2, node:$op3),		[(int_aarch64_sve_eor3 node:$op1, node:$op2, node:$op3),
(xor node:$op1, (xor node:$op2, node:$op3))]>;		(xor node:$op1, (xor node:$op2, node:$op3))]>;

class fma_patfrags<SDPatternOperator intrinsic, SDPatternOperator sdnode>		class fma_patfrag<SDPatternOperator sdnode>
: PatFrags<(ops node:$pred, node:$op1, node:$op2, node:$op3),		: PatFrag<(ops node:$pred, node:$op1, node:$op2, node:$op3),
[(intrinsic node:$pred, node:$op1, node:$op2, node:$op3),		(sdnode (SVEAllActive), node:$op1, (vselect node:$pred, (AArch64fmul_p_oneuse (SVEAllActive), node:$op2, node:$op3), (SVEDup0))), [{
(sdnode (SVEAllActive), node:$op1, (vselect node:$pred, (AArch64fmul_p_oneuse (SVEAllActive), node:$op2, node:$op3), (SVEDup0)))],
[{
if ((N->getOpcode() != AArch64ISD::FADD_PRED) &&
(N->getOpcode() != AArch64ISD::FSUB_PRED))
return true; // it's the intrinsic
return N->getFlags().hasAllowContract();		return N->getFlags().hasAllowContract();
}]>;		}]>;

def AArch64fmla_m1 : fma_patfrags<int_aarch64_sve_fmla, AArch64fadd_p_nsz>;		def AArch64fmla_m1 : fma_patfrag<AArch64fadd_p_nsz>;
def AArch64fmls_m1 : fma_patfrags<int_aarch64_sve_fmls, AArch64fsub_p>;		def AArch64fmls_m1 : fma_patfrag<AArch64fsub_p>;

def AArch64smax_m1 : EitherVSelectOrPassthruPatFrags<int_aarch64_sve_smax, AArch64smax_p>;		def AArch64smax_m1 : EitherVSelectOrPassthruPatFrags<int_aarch64_sve_smax, AArch64smax_p>;
def AArch64umax_m1 : EitherVSelectOrPassthruPatFrags<int_aarch64_sve_umax, AArch64umax_p>;		def AArch64umax_m1 : EitherVSelectOrPassthruPatFrags<int_aarch64_sve_umax, AArch64umax_p>;
def AArch64smin_m1 : EitherVSelectOrPassthruPatFrags<int_aarch64_sve_smin, AArch64smin_p>;		def AArch64smin_m1 : EitherVSelectOrPassthruPatFrags<int_aarch64_sve_smin, AArch64smin_p>;
def AArch64umin_m1 : EitherVSelectOrPassthruPatFrags<int_aarch64_sve_umin, AArch64umin_p>;		def AArch64umin_m1 : EitherVSelectOrPassthruPatFrags<int_aarch64_sve_umin, AArch64umin_p>;

let Predicates = [HasSVE] in {		let Predicates = [HasSVE] in {
defm RDFFR_PPz : sve_int_rdffr_pred<0b0, "rdffr", int_aarch64_sve_rdffr_z>;		defm RDFFR_PPz : sve_int_rdffr_pred<0b0, "rdffr", int_aarch64_sve_rdffr_z>;
▲ Show 20 Lines • Show All 228 Lines • ▼ Show 20 Lines
} // End HasSVE		} // End HasSVE

let Predicates = [HasSVEorSME] in {		let Predicates = [HasSVEorSME] in {
defm FCADD_ZPmZ : sve_fp_fcadd<"fcadd", int_aarch64_sve_fcadd>;		defm FCADD_ZPmZ : sve_fp_fcadd<"fcadd", int_aarch64_sve_fcadd>;
defm FCMLA_ZPmZZ : sve_fp_fcmla<"fcmla", int_aarch64_sve_fcmla>;		defm FCMLA_ZPmZZ : sve_fp_fcmla<"fcmla", int_aarch64_sve_fcmla>;

defm FMLA_ZPmZZ : sve_fp_3op_p_zds_a<0b00, "fmla", "FMLA_ZPZZZ", AArch64fmla_m1, "FMAD_ZPmZZ">;		defm FMLA_ZPmZZ : sve_fp_3op_p_zds_a<0b00, "fmla", "FMLA_ZPZZZ", AArch64fmla_m1, "FMAD_ZPmZZ">;
defm FMLS_ZPmZZ : sve_fp_3op_p_zds_a<0b01, "fmls", "FMLS_ZPZZZ", AArch64fmls_m1, "FMSB_ZPmZZ">;		defm FMLS_ZPmZZ : sve_fp_3op_p_zds_a<0b01, "fmls", "FMLS_ZPZZZ", AArch64fmls_m1, "FMSB_ZPmZZ">;
defm FNMLA_ZPmZZ : sve_fp_3op_p_zds_a<0b10, "fnmla", "FNMLA_ZPZZZ", int_aarch64_sve_fnmla, "FNMAD_ZPmZZ">;		defm FNMLA_ZPmZZ : sve_fp_3op_p_zds_a<0b10, "fnmla", "FNMLA_ZPZZZ", null_frag, "FNMAD_ZPmZZ">;
defm FNMLS_ZPmZZ : sve_fp_3op_p_zds_a<0b11, "fnmls", "FNMLS_ZPZZZ", int_aarch64_sve_fnmls, "FNMSB_ZPmZZ">;		defm FNMLS_ZPmZZ : sve_fp_3op_p_zds_a<0b11, "fnmls", "FNMLS_ZPZZZ", null_frag, "FNMSB_ZPmZZ">;

defm FMAD_ZPmZZ : sve_fp_3op_p_zds_b<0b00, "fmad", int_aarch64_sve_fmad, "FMLA_ZPmZZ", /isReverseInstr/ 1>;		defm FMAD_ZPmZZ : sve_fp_3op_p_zds_b<0b00, "fmad", int_aarch64_sve_fmad, "FMLA_ZPmZZ", /isReverseInstr/ 1>;
defm FMSB_ZPmZZ : sve_fp_3op_p_zds_b<0b01, "fmsb", int_aarch64_sve_fmsb, "FMLS_ZPmZZ", /isReverseInstr/ 1>;		defm FMSB_ZPmZZ : sve_fp_3op_p_zds_b<0b01, "fmsb", int_aarch64_sve_fmsb, "FMLS_ZPmZZ", /isReverseInstr/ 1>;
defm FNMAD_ZPmZZ : sve_fp_3op_p_zds_b<0b10, "fnmad", int_aarch64_sve_fnmad, "FNMLA_ZPmZZ", /isReverseInstr/ 1>;		defm FNMAD_ZPmZZ : sve_fp_3op_p_zds_b<0b10, "fnmad", int_aarch64_sve_fnmad, "FNMLA_ZPmZZ", /isReverseInstr/ 1>;
defm FNMSB_ZPmZZ : sve_fp_3op_p_zds_b<0b11, "fnmsb", int_aarch64_sve_fnmsb, "FNMLS_ZPmZZ", /isReverseInstr/ 1>;		defm FNMSB_ZPmZZ : sve_fp_3op_p_zds_b<0b11, "fnmsb", int_aarch64_sve_fnmsb, "FNMLS_ZPmZZ", /isReverseInstr/ 1>;

defm FMLA_ZPZZZ : sve_fp_3op_p_zds_zx;		defm FMLA_ZPZZZ : sve_fp_3op_p_zds_zx<int_aarch64_sve_fmla>;
		sdesmalenUnsubmitted Not Done Reply Inline Actions The intrinsics are defined to always merge into the first source operand, so passing this to a pseudo node which assumes the inactive lanes are undef doesn't seem entirely right. sdesmalen: The intrinsics are defined to always merge into the first source operand, so passing this to a…
defm FMLS_ZPZZZ : sve_fp_3op_p_zds_zx;		defm FMLS_ZPZZZ : sve_fp_3op_p_zds_zx<int_aarch64_sve_fmls>;
defm FNMLA_ZPZZZ : sve_fp_3op_p_zds_zx;		defm FNMLA_ZPZZZ : sve_fp_3op_p_zds_zx<int_aarch64_sve_fnmla>;
defm FNMLS_ZPZZZ : sve_fp_3op_p_zds_zx;		defm FNMLS_ZPZZZ : sve_fp_3op_p_zds_zx<int_aarch64_sve_fnmls>;

multiclass fma<ValueType Ty, ValueType PredTy, string Suffix> {		multiclass fma<ValueType Ty, ValueType PredTy, string Suffix> {
// Zd = Za + Zn * Zm		// Zd = Za + Zn * Zm
def : Pat<(Ty (AArch64fma_p PredTy:$P, Ty:$Zn, Ty:$Zm, Ty:$Za)),		def : Pat<(Ty (AArch64fma_p PredTy:$P, Ty:$Zn, Ty:$Zm, Ty:$Za)),
(!cast<Instruction>("FMLA_ZPZZZ_UNDEF_"#Suffix) $P, ZPR:$Za, ZPR:$Zn, ZPR:$Zm)>;		(!cast<Instruction>("FMLA_ZPZZZ_UNDEF_"#Suffix) $P, ZPR:$Za, ZPR:$Zn, ZPR:$Zm)>;

// Zd = Za + -Zn * Zm		// Zd = Za + -Zn * Zm
def : Pat<(Ty (AArch64fmls_p PredTy:$P, Ty:$Zn, Ty:$Zm, Ty:$Za)),		def : Pat<(Ty (AArch64fmls_p PredTy:$P, Ty:$Zn, Ty:$Zm, Ty:$Za)),
▲ Show 20 Lines • Show All 3,175 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/SVEInstrFormats.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 2,171 Lines • ▼ Show 20 Lines	multiclass sve_fp_3op_p_zds_b<bits<2> opc, string asm, SDPatternOperator op,
def _D : sve_fp_3op_p_zds_b<0b11, opc, asm, ZPR64>,		def _D : sve_fp_3op_p_zds_b<0b11, opc, asm, ZPR64>,
SVEInstr2Rev<NAME # _D, revname # _D, isReverseInstr>;		SVEInstr2Rev<NAME # _D, revname # _D, isReverseInstr>;

def : SVE_4_Op_Pat<nxv8f16, op, nxv8i1, nxv8f16, nxv8f16, nxv8f16, !cast<Instruction>(NAME # _H)>;		def : SVE_4_Op_Pat<nxv8f16, op, nxv8i1, nxv8f16, nxv8f16, nxv8f16, !cast<Instruction>(NAME # _H)>;
def : SVE_4_Op_Pat<nxv4f32, op, nxv4i1, nxv4f32, nxv4f32, nxv4f32, !cast<Instruction>(NAME # _S)>;		def : SVE_4_Op_Pat<nxv4f32, op, nxv4i1, nxv4f32, nxv4f32, nxv4f32, !cast<Instruction>(NAME # _S)>;
def : SVE_4_Op_Pat<nxv2f64, op, nxv2i1, nxv2f64, nxv2f64, nxv2f64, !cast<Instruction>(NAME # _D)>;		def : SVE_4_Op_Pat<nxv2f64, op, nxv2i1, nxv2f64, nxv2f64, nxv2f64, !cast<Instruction>(NAME # _D)>;
}		}

multiclass sve_fp_3op_p_zds_zx {		multiclass sve_fp_3op_p_zds_zx<SDPatternOperator op> {
def _UNDEF_H : PredThreeOpPseudo<NAME # _H, ZPR16, FalseLanesUndef>;		def _UNDEF_H : PredThreeOpPseudo<NAME # _H, ZPR16, FalseLanesUndef>;
def _UNDEF_S : PredThreeOpPseudo<NAME # _S, ZPR32, FalseLanesUndef>;		def _UNDEF_S : PredThreeOpPseudo<NAME # _S, ZPR32, FalseLanesUndef>;
def _UNDEF_D : PredThreeOpPseudo<NAME # _D, ZPR64, FalseLanesUndef>;		def _UNDEF_D : PredThreeOpPseudo<NAME # _D, ZPR64, FalseLanesUndef>;

		def : SVE_4_Op_Pat<nxv8f16, op, nxv8i1, nxv8f16, nxv8f16, nxv8f16, !cast<Instruction>(NAME # _UNDEF_H)>;
		def : SVE_4_Op_Pat<nxv4f32, op, nxv4i1, nxv4f32, nxv4f32, nxv4f32, !cast<Instruction>(NAME # _UNDEF_S)>;
		def : SVE_4_Op_Pat<nxv2f64, op, nxv2i1, nxv2f64, nxv2f64, nxv2f64, !cast<Instruction>(NAME # _UNDEF_D)>;
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// SVE Floating Point Multiply-Add - Indexed Group		// SVE Floating Point Multiply-Add - Indexed Group
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

class sve_fp_fma_by_indexed_elem<bits<2> sz, bits<2> opc, string asm,		class sve_fp_fma_by_indexed_elem<bits<2> sz, bits<2> opc, string asm,
ZPRRegOp zprty1,		ZPRRegOp zprty1,
▲ Show 20 Lines • Show All 7,354 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-intrinsic-fmla-fmls-fnmla-fnmls.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s \| FileCheck %s

				target triple = "aarch64-unknown-linux-gnu"

				@g_val = external local_unnamed_addr constant [8 x double], align 8

				define <vscale x 2 x double> @test_fmla_fmad(<vscale x 16 x i1> %pg, <vscale x 2 x double> %r) {
				; CHECK-LABEL: test_fmla_fmad:
				; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: adrp x8, :got:g_val
				; CHECK-NEXT: ptrue p1.d
				; CHECK-NEXT: mov z1.d, z0.d
				; CHECK-NEXT: fmul z1.d, p0/m, z1.d, z0.d
				; CHECK-NEXT: ldr x8, [x8, :got_lo12:g_val]
				; CHECK-NEXT: ld1rd { z2.d }, p1/z, [x8]
				; CHECK-NEXT: movprfx z3, z2
				; CHECK-NEXT: fmla z3.d, p0/m, z2.d, z1.d
				; CHECK-NEXT: fmad z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fmad z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fmad z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fmad z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fmad z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fmla z2.d, p0/m, z3.d, z1.d
				; CHECK-NEXT: fmov z3.d, #-1.00000000
				; CHECK-NEXT: fmul z1.d, p0/m, z1.d, z0.d
				; CHECK-NEXT: fmul z0.d, p0/m, z0.d, z3.d
				; CHECK-NEXT: fmla z0.d, p0/m, z2.d, z1.d
				; CHECK-NEXT: ret
				entry:
				%0 = tail call <vscale x 2 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv2i1(<vscale x 16 x i1> %pg)
				%1 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %r, <vscale x 2 x double> %r)
				%2 = load double, double* getelementptr inbounds ([8 x double], [8 x double]* @g_val, i64 0, i64 0), align 8
				%.splatinsert = insertelement <vscale x 2 x double> poison, double %2, i64 0
				%3 = shufflevector <vscale x 2 x double> %.splatinsert, <vscale x 2 x double> poison, <vscale x 2 x i32> zeroinitializer
				%4 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmla.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %3, <vscale x 2 x double> %1)
				%5 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmla.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %4, <vscale x 2 x double> %1)
				%6 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmla.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %5, <vscale x 2 x double> %1)
				%7 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmla.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %6, <vscale x 2 x double> %1)
				%8 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmla.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %7, <vscale x 2 x double> %1)
				%9 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmla.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %8, <vscale x 2 x double> %1)
				%10 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %1, <vscale x 2 x double> %r)
				%11 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmla.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %9, <vscale x 2 x double> %1)
				%12 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %r, <vscale x 2 x double> shufflevector (<vscale x 2 x double> insertelement (<vscale x 2 x double> poison, double -1.000000e+00, i32 0), <vscale x 2 x double> poison, <vscale x 2 x i32> zeroinitializer))
				%13 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmla.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %12, <vscale x 2 x double> %11, <vscale x 2 x double> %10)
				ret <vscale x 2 x double> %13
				}

				define <vscale x 2 x double> @test_fmls_fmsb(<vscale x 16 x i1> %pg, <vscale x 2 x double> %r) {
				; CHECK-LABEL: test_fmls_fmsb:
				; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: adrp x8, :got:g_val
				; CHECK-NEXT: ptrue p1.d
				; CHECK-NEXT: mov z1.d, z0.d
				; CHECK-NEXT: fmul z1.d, p0/m, z1.d, z0.d
				; CHECK-NEXT: ldr x8, [x8, :got_lo12:g_val]
				; CHECK-NEXT: ld1rd { z2.d }, p1/z, [x8]
				; CHECK-NEXT: movprfx z3, z2
				; CHECK-NEXT: fmls z3.d, p0/m, z2.d, z1.d
				; CHECK-NEXT: fmsb z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fmsb z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fmsb z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fmsb z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fmsb z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fmls z2.d, p0/m, z3.d, z1.d
				; CHECK-NEXT: fmov z3.d, #-1.00000000
				; CHECK-NEXT: fmul z1.d, p0/m, z1.d, z0.d
				; CHECK-NEXT: fmul z0.d, p0/m, z0.d, z3.d
				; CHECK-NEXT: fmls z0.d, p0/m, z2.d, z1.d
				; CHECK-NEXT: ret
				entry:
				%0 = tail call <vscale x 2 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv2i1(<vscale x 16 x i1> %pg)
				%1 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %r, <vscale x 2 x double> %r)
				%2 = load double, double* getelementptr inbounds ([8 x double], [8 x double]* @g_val, i64 0, i64 0), align 8
				%.splatinsert = insertelement <vscale x 2 x double> poison, double %2, i64 0
				%3 = shufflevector <vscale x 2 x double> %.splatinsert, <vscale x 2 x double> poison, <vscale x 2 x i32> zeroinitializer
				%4 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmls.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %3, <vscale x 2 x double> %1)
				%5 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmls.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %4, <vscale x 2 x double> %1)
				%6 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmls.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %5, <vscale x 2 x double> %1)
				%7 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmls.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %6, <vscale x 2 x double> %1)
				%8 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmls.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %7, <vscale x 2 x double> %1)
				%9 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmls.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %8, <vscale x 2 x double> %1)
				%10 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %1, <vscale x 2 x double> %r)
				%11 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmls.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %9, <vscale x 2 x double> %1)
				%12 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %r, <vscale x 2 x double> shufflevector (<vscale x 2 x double> insertelement (<vscale x 2 x double> poison, double -1.000000e+00, i32 0), <vscale x 2 x double> poison, <vscale x 2 x i32> zeroinitializer))
				%13 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmls.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %12, <vscale x 2 x double> %11, <vscale x 2 x double> %10)
				ret <vscale x 2 x double> %13
				}

				define <vscale x 2 x double> @test_fnmla_fnmad(<vscale x 16 x i1> %pg, <vscale x 2 x double> %r) {
				; CHECK-LABEL: test_fnmla_fnmad:
				; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: adrp x8, :got:g_val
				; CHECK-NEXT: ptrue p1.d
				; CHECK-NEXT: mov z1.d, z0.d
				; CHECK-NEXT: fmul z1.d, p0/m, z1.d, z0.d
				; CHECK-NEXT: ldr x8, [x8, :got_lo12:g_val]
				; CHECK-NEXT: ld1rd { z2.d }, p1/z, [x8]
				; CHECK-NEXT: movprfx z3, z2
				; CHECK-NEXT: fnmla z3.d, p0/m, z2.d, z1.d
				; CHECK-NEXT: fnmad z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fnmad z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fnmad z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fnmad z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fnmad z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fnmla z2.d, p0/m, z3.d, z1.d
				; CHECK-NEXT: fmov z3.d, #-1.00000000
				; CHECK-NEXT: fmul z1.d, p0/m, z1.d, z0.d
				; CHECK-NEXT: fmul z0.d, p0/m, z0.d, z3.d
				; CHECK-NEXT: fnmla z0.d, p0/m, z2.d, z1.d
				; CHECK-NEXT: ret
				entry:
				%0 = tail call <vscale x 2 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv2i1(<vscale x 16 x i1> %pg)
				%1 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %r, <vscale x 2 x double> %r)
				%2 = load double, double* getelementptr inbounds ([8 x double], [8 x double]* @g_val, i64 0, i64 0), align 8
				%.splatinsert = insertelement <vscale x 2 x double> poison, double %2, i64 0
				%3 = shufflevector <vscale x 2 x double> %.splatinsert, <vscale x 2 x double> poison, <vscale x 2 x i32> zeroinitializer
				%4 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fnmla.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %3, <vscale x 2 x double> %1)
				%5 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fnmla.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %4, <vscale x 2 x double> %1)
				%6 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fnmla.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %5, <vscale x 2 x double> %1)
				%7 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fnmla.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %6, <vscale x 2 x double> %1)
				%8 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fnmla.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %7, <vscale x 2 x double> %1)
				%9 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fnmla.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %8, <vscale x 2 x double> %1)
				%10 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %1, <vscale x 2 x double> %r)
				%11 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fnmla.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %9, <vscale x 2 x double> %1)
				%12 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %r, <vscale x 2 x double> shufflevector (<vscale x 2 x double> insertelement (<vscale x 2 x double> poison, double -1.000000e+00, i32 0), <vscale x 2 x double> poison, <vscale x 2 x i32> zeroinitializer))
				%13 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fnmla.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %12, <vscale x 2 x double> %11, <vscale x 2 x double> %10)
				ret <vscale x 2 x double> %13
				}

				define <vscale x 2 x double> @test_fnmls_fnmsb(<vscale x 16 x i1> %pg, <vscale x 2 x double> %r) {
				; CHECK-LABEL: test_fnmls_fnmsb:
				; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: adrp x8, :got:g_val
				; CHECK-NEXT: ptrue p1.d
				; CHECK-NEXT: mov z1.d, z0.d
				; CHECK-NEXT: fmul z1.d, p0/m, z1.d, z0.d
				; CHECK-NEXT: ldr x8, [x8, :got_lo12:g_val]
				; CHECK-NEXT: ld1rd { z2.d }, p1/z, [x8]
				; CHECK-NEXT: movprfx z3, z2
				; CHECK-NEXT: fnmls z3.d, p0/m, z2.d, z1.d
				; CHECK-NEXT: fnmsb z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fnmsb z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fnmsb z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fnmsb z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fnmsb z3.d, p0/m, z1.d, z2.d
				; CHECK-NEXT: fnmls z2.d, p0/m, z3.d, z1.d
				; CHECK-NEXT: fmov z3.d, #-1.00000000
				; CHECK-NEXT: fmul z1.d, p0/m, z1.d, z0.d
				; CHECK-NEXT: fmul z0.d, p0/m, z0.d, z3.d
				; CHECK-NEXT: fnmls z0.d, p0/m, z2.d, z1.d
				; CHECK-NEXT: ret
				entry:
				%0 = tail call <vscale x 2 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv2i1(<vscale x 16 x i1> %pg)
				%1 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %r, <vscale x 2 x double> %r)
				%2 = load double, double* getelementptr inbounds ([8 x double], [8 x double]* @g_val, i64 0, i64 0), align 8
				%.splatinsert = insertelement <vscale x 2 x double> poison, double %2, i64 0
				%3 = shufflevector <vscale x 2 x double> %.splatinsert, <vscale x 2 x double> poison, <vscale x 2 x i32> zeroinitializer
				%4 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fnmls.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %3, <vscale x 2 x double> %1)
				%5 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fnmls.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %4, <vscale x 2 x double> %1)
				%6 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fnmls.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %5, <vscale x 2 x double> %1)
				%7 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fnmls.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %6, <vscale x 2 x double> %1)
				%8 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fnmls.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %7, <vscale x 2 x double> %1)
				%9 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fnmls.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %8, <vscale x 2 x double> %1)
				%10 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %1, <vscale x 2 x double> %r)
				%11 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fnmls.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %3, <vscale x 2 x double> %9, <vscale x 2 x double> %1)
				%12 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %r, <vscale x 2 x double> shufflevector (<vscale x 2 x double> insertelement (<vscale x 2 x double> poison, double -1.000000e+00, i32 0), <vscale x 2 x double> poison, <vscale x 2 x i32> zeroinitializer))
				%13 = tail call contract <vscale x 2 x double> @llvm.aarch64.sve.fnmls.nxv2f64(<vscale x 2 x i1> %0, <vscale x 2 x double> %12, <vscale x 2 x double> %11, <vscale x 2 x double> %10)
				ret <vscale x 2 x double> %13
				}

				declare <vscale x 2 x i1> @llvm.aarch64.sve.convert.from.svbool.nxv2i1(<vscale x 16 x i1>)
				declare <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1>, <vscale x 2 x double>, <vscale x 2 x double>)
				declare <vscale x 2 x double> @llvm.aarch64.sve.fmla.nxv2f64(<vscale x 2 x i1>, <vscale x 2 x double>, <vscale x 2 x double>, <vscale x 2 x double>)
				declare <vscale x 2 x double> @llvm.aarch64.sve.fmls.nxv2f64(<vscale x 2 x i1>, <vscale x 2 x double>, <vscale x 2 x double>, <vscale x 2 x double>)
				declare <vscale x 2 x double> @llvm.aarch64.sve.fnmla.nxv2f64(<vscale x 2 x i1>, <vscale x 2 x double>, <vscale x 2 x double>, <vscale x 2 x double>)
				declare <vscale x 2 x double> @llvm.aarch64.sve.fnmls.nxv2f64(<vscale x 2 x i1>, <vscale x 2 x double>, <vscale x 2 x double>, <vscale x 2 x double>)
				declare i64 @llvm.vscale.i64()