This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64ExpandPseudoInsts.cpp
-
AArch64ISelDAGToDAG.cpp
-
SVEInstrFormats.td
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
1/2
sve-intrinsics-fp-arith-merging.ll

Differential D80410

[WIP][SVE] Pass through dup(0) to zero-merging pseudos
AbandonedPublic

Authored by sdesmalen on May 21 2020, 2:46 PM.

Download Raw Diff

Details

Reviewers

cameron.mcinally
efriedma

Summary

Hi @cameron.mcinally, I'm just sharing what I tried out today based
off your patch D80260. I'm not really planning to land it, but feel
free to use for reference or discard entirely if you've already been
working on something similar.

It passes the dup(0) to the zero-merging pseudos, similar to what D80260
does for any other mask value.

This patch also highlights a bug that currently exists with the expansion
of the pseudo instructions that merge zero's into the false lanes.

The zero-merging pseudos don't have any tied operand constraints to give
the register allocator more freedom to use the reverse instructions
(like FSUBR).

A bug currently exists when the register allocation of one of the pseudos
ends up as:

Dst = FSUB_ZERO_S P0, Z0, Z0

The expand pass cannot zero the false lanes of Z0 using MOVPRFX, because
the MOVPRFX instruction specifies that the destination register must not
be used in any other operand position than the destination register. This
would not be valid:

Z0 = MOVPRFX P0/z, Z0
Z0 = FSUB_S Z0, P0/m, Z0
                      ^^

At point of expanding the pseudo, there may not be a spare register
available to expand this into a legal sequence. In D71712 we've solved
this by using a 'Conditional Early Clobber' pass that runs during register
allocation and makes sure the destination register is different from
any of the input registers, if the two input registers will otherwise end
up the same. This is a bit fiddly, and it's probably better to build
on the design set out in D80260 where the merge-value value is passed
to the pseudo, so the compiler can decide at point of pseudo expansion
whether to use the DUP(0) value, or to use the zeroing MOVPRFX.

Given that the DUP IMM instructions have isReMaterializable set,
the register allocator hopefully won't try too hard to keep it in a
register.

Diff Detail

Event Timeline

sdesmalen created this revision.May 21 2020, 2:46 PM

Herald added a reviewer: efriedma. · View Herald TranscriptMay 21 2020, 2:46 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: llvm-commits, psnobl, rkruppe and 2 others. · View Herald Transcript

sdesmalen mentioned this in D80260: [WIP][SVE] Prototype for general merging MOVPRFX support..May 21 2020, 2:47 PM

cameron.mcinally added inline comments.May 26 2020, 3:01 PM

llvm/test/CodeGen/AArch64/sve-intrinsics-fp-arith-merging.ll
301	I'm looking at some rough latency tables we've put together and it looks like the tied-reg MOVPRFX sequence is 1 cycle faster than the SEL sequence: ; CHECK-NEXT: movprfx z1.s, p0/z, z0.s ; CHECK-NEXT: fsubr z1.s, p0/m, z1.s, z0.s ; CHECK-NEXT: mov z0.d, z1.d The vector MOV is faster than the DUP. And we burn the extra z1 register for both cases, so that's a wash. That said, the MOVPRFX sequence we're generating actually looks like this: ; CHECK-NEXT: mov z1.s, #0 ; CHECK-NEXT: movprfx z1.s, p0/z, z0.s ; CHECK-NEXT: fsubr z1.s, p0/m, z1.s, z0.s ; CHECK-NEXT: mov z0.d, z1.d where the DUP #0 is a dead instruction. It's proving pretty hard to get rid of the DUP at the MachineInstruction level though. Still looking...

sdesmalen marked an inline comment as done.May 27 2020, 2:38 PM

sdesmalen added inline comments.

llvm/test/CodeGen/AArch64/sve-intrinsics-fp-arith-merging.ll
301	I'm looking at some rough latency tables we've put together and it looks like the tied-reg MOVPRFX sequence is 1 cycle faster than the SEL sequence: Ah that's good to know. Always using the tied-operand constraint for the zeroing forms possibly makes the common cases slower though, because it forces the compiler to honour the constraints and avoids benefiting from the reverse instructions as the register allocator will already have done the work. All cases except this one don't need the dup+select and can use movprfx directly and make use of the commutative/reverse instructions to expand the pseudo. That said, the MOVPRFX sequence we're generating actually looks like this: Is that with a different example than the one in this test?

sdesmalen mentioned this in D82780: [AArch64][SVE] Put zeroing pseudos and patterns under flag..Jun 29 2020, 8:55 AM

sdesmalen mentioned this in rG075c440f7bc8: [AArch64][SVE] Put zeroing pseudos and patterns under flag..Jul 2 2020, 6:28 AM

sdesmalen abandoned this revision.Jul 26 2021, 8:11 AM

rkruppe removed a subscriber: rkruppe.Jul 26 2021, 9:41 AM

Matt added a subscriber: Matt.Jan 11 2023, 3:46 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 11 2023, 3:46 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ExpandPseudoInsts.cpp

17 lines

AArch64ISelDAGToDAG.cpp

3 lines

SVEInstrFormats.td

25 lines

test/

CodeGen/

AArch64/

sve-intrinsics-fp-arith-merging.ll

20 lines

Diff 265607

llvm/lib/Target/AArch64/AArch64ExpandPseudoInsts.cpp

Show First 20 Lines • Show All 431 Lines • ▼ Show 20 Lines	case AArch64::DestructiveBinaryCommWithRev:
DOPRegIsUnique =		DOPRegIsUnique =
DstReg != MI.getOperand(DOPIdx).getReg() \|\|		DstReg != MI.getOperand(DOPIdx).getReg() \|\|
MI.getOperand(DOPIdx).getReg() != MI.getOperand(SrcIdx).getReg();		MI.getOperand(DOPIdx).getReg() != MI.getOperand(SrcIdx).getReg();
break;		break;
case AArch64::DestructiveBinaryImm:		case AArch64::DestructiveBinaryImm:
DOPRegIsUnique = true;		DOPRegIsUnique = true;
break;		break;
}		}

assert (DOPRegIsUnique && "The destructive operand should be unique");
#endif		#endif

// Resolve the reverse opcode		// Resolve the reverse opcode
if (UseRev) {		if (UseRev) {
if (AArch64::getSVERevInstr(Opcode) != -1)		if (AArch64::getSVERevInstr(Opcode) != -1)
Opcode = AArch64::getSVERevInstr(Opcode);		Opcode = AArch64::getSVERevInstr(Opcode);
else if (AArch64::getSVEOrigInstr(Opcode) != -1)		else if (AArch64::getSVEOrigInstr(Opcode) != -1)
Opcode = AArch64::getSVEOrigInstr(Opcode);		Opcode = AArch64::getSVEOrigInstr(Opcode);
}		}

// Get the right MOVPRFX		// Get the right MOVPRFX
uint64_t ElementSize = TII->getElementSizeForOpcode(Opcode);		uint64_t ElementSize = TII->getElementSizeForOpcode(Opcode);
unsigned MovPrfx, MovPrfxZero, MovPrfxMerge;		unsigned MovPrfx, MovPrfxZero, MovPrfxMerge, Sel;
switch (ElementSize) {		switch (ElementSize) {
case AArch64::ElementSizeNone:		case AArch64::ElementSizeNone:
case AArch64::ElementSizeB:		case AArch64::ElementSizeB:
MovPrfx = AArch64::MOVPRFX_ZZ;		MovPrfx = AArch64::MOVPRFX_ZZ;
MovPrfxZero = AArch64::MOVPRFX_ZPzZ_B;		MovPrfxZero = AArch64::MOVPRFX_ZPzZ_B;
MovPrfxMerge = AArch64::MOVPRFX_ZPmZ_B;		MovPrfxMerge = AArch64::MOVPRFX_ZPmZ_B;
		Sel = AArch64::SEL_ZPZZ_B;
break;		break;
case AArch64::ElementSizeH:		case AArch64::ElementSizeH:
MovPrfx = AArch64::MOVPRFX_ZZ;		MovPrfx = AArch64::MOVPRFX_ZZ;
MovPrfxZero = AArch64::MOVPRFX_ZPzZ_H;		MovPrfxZero = AArch64::MOVPRFX_ZPzZ_H;
MovPrfxMerge = AArch64::MOVPRFX_ZPmZ_H;		MovPrfxMerge = AArch64::MOVPRFX_ZPmZ_H;
		Sel = AArch64::SEL_ZPZZ_H;
break;		break;
case AArch64::ElementSizeS:		case AArch64::ElementSizeS:
MovPrfx = AArch64::MOVPRFX_ZZ;		MovPrfx = AArch64::MOVPRFX_ZZ;
MovPrfxZero = AArch64::MOVPRFX_ZPzZ_S;		MovPrfxZero = AArch64::MOVPRFX_ZPzZ_S;
MovPrfxMerge = AArch64::MOVPRFX_ZPmZ_S;		MovPrfxMerge = AArch64::MOVPRFX_ZPmZ_S;
		Sel = AArch64::SEL_ZPZZ_S;
break;		break;
case AArch64::ElementSizeD:		case AArch64::ElementSizeD:
MovPrfx = AArch64::MOVPRFX_ZZ;		MovPrfx = AArch64::MOVPRFX_ZZ;
MovPrfxZero = AArch64::MOVPRFX_ZPzZ_D;		MovPrfxZero = AArch64::MOVPRFX_ZPzZ_D;
MovPrfxMerge = AArch64::MOVPRFX_ZPmZ_D;		MovPrfxMerge = AArch64::MOVPRFX_ZPmZ_D;
		Sel = AArch64::SEL_ZPZZ_D;
break;		break;
default:		default:
llvm_unreachable("Unsupported ElementSize");		llvm_unreachable("Unsupported ElementSize");
}		}

//		//
// Create the destructive operation (if required)		// Create the destructive operation (if required)
//		//
MachineInstrBuilder PRFX, DOP;		MachineInstrBuilder PRFX, DOP;
if (FalseLanes == AArch64::FalseLanesZero) {
		if (FalseLanes == AArch64::FalseLanesZero && !DOPRegIsUnique) {
		PRFX = BuildMI(MBB, MBBI, MI.getDebugLoc(), TII->get(Sel))
		.addReg(DstReg, RegState::Define)
		.addReg(MI.getOperand(PredIdx).getReg())
		.addReg(MI.getOperand(DOPIdx).getReg())
		.addReg(MI.getOperand(PassIdx).getReg());
		} else if (FalseLanes == AArch64::FalseLanesZero) {
assert(ElementSize != AArch64::ElementSizeNone &&		assert(ElementSize != AArch64::ElementSizeNone &&
"This instruction is unpredicated");		"This instruction is unpredicated");

// Merge source operand into destination register		// Merge source operand into destination register
PRFX = BuildMI(MBB, MBBI, MI.getDebugLoc(), TII->get(MovPrfxZero))		PRFX = BuildMI(MBB, MBBI, MI.getDebugLoc(), TII->get(MovPrfxZero))
.addReg(DstReg, RegState::Define)		.addReg(DstReg, RegState::Define)
.addReg(MI.getOperand(PredIdx).getReg())		.addReg(MI.getOperand(PredIdx).getReg())
.addReg(MI.getOperand(DOPIdx).getReg());		.addReg(MI.getOperand(DOPIdx).getReg());
▲ Show 20 Lines • Show All 558 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelDAGToDAG.cpp

Show First 20 Lines • Show All 156 Lines • ▼ Show 20 Lines	bool SelectDupZeroOrUndef(SDValue N) {
}		}
default:		default:
break;		break;
}		}

return false;		return false;
}		}

bool SelectDupZero(SDValue N) {		bool SelectDupZero(SDValue N, SDValue &Res) {
switch(N->getOpcode()) {		switch(N->getOpcode()) {
case AArch64ISD::DUP:		case AArch64ISD::DUP:
case ISD::SPLAT_VECTOR: {		case ISD::SPLAT_VECTOR: {
		Res = N;
auto Opnd0 = N->getOperand(0);		auto Opnd0 = N->getOperand(0);
if (auto CN = dyn_cast<ConstantSDNode>(Opnd0))		if (auto CN = dyn_cast<ConstantSDNode>(Opnd0))
if (CN->isNullValue())		if (CN->isNullValue())
return true;		return true;
if (auto CN = dyn_cast<ConstantFPSDNode>(Opnd0))		if (auto CN = dyn_cast<ConstantFPSDNode>(Opnd0))
if (CN->isZero())		if (CN->isZero())
return true;		return true;
break;		break;
▲ Show 20 Lines • Show All 4,639 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/SVEInstrFormats.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 373 Lines • ▼ Show 20 Lines	: Pat<(vtd (op vt1:$Op1, vt2:$Op2, (vt3 ImmTy:$Op3))),
(inst $Op1, $Op2, ImmTy:$Op3)>;		(inst $Op1, $Op2, ImmTy:$Op3)>;

class SVE_4_Op_Imm_Pat<ValueType vtd, SDPatternOperator op, ValueType vt1,		class SVE_4_Op_Imm_Pat<ValueType vtd, SDPatternOperator op, ValueType vt1,
ValueType vt2, ValueType vt3, ValueType vt4,		ValueType vt2, ValueType vt3, ValueType vt4,
Operand ImmTy, Instruction inst>		Operand ImmTy, Instruction inst>
: Pat<(vtd (op vt1:$Op1, vt2:$Op2, vt3:$Op3, (vt4 ImmTy:$Op4))),		: Pat<(vtd (op vt1:$Op1, vt2:$Op2, vt3:$Op3, (vt4 ImmTy:$Op4))),
(inst $Op1, $Op2, $Op3, ImmTy:$Op4)>;		(inst $Op1, $Op2, $Op3, ImmTy:$Op4)>;

def SVEDup0 : ComplexPattern<i64, 0, "SelectDupZero", []>;		def SVEDup0 : ComplexPattern<vAny, 1, "SelectDupZero", []>;
def SVEDup0Undef : ComplexPattern<i64, 0, "SelectDupZeroOrUndef", []>;		def SVEDup0Undef : ComplexPattern<i64, 0, "SelectDupZeroOrUndef", []>;

let AddedComplexity = 1 in {		let AddedComplexity = 1 in {
class SVE_3_Op_Pat_Sel<ValueType vtd, SDPatternOperator op, ValueType vt1,		class SVE_3_Op_Pat_Sel<ValueType vtd, SDPatternOperator op, ValueType vt1,
ValueType vt2, ValueType vt3, Instruction inst>		ValueType vt2, ValueType vt3, Instruction inst>
: Pat<(vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, vt2:$Passthru), vt3:$Op3)),		: Pat<(vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, vt2:$Passthru), vt3:$Op3)),
(inst $Op1, $Op2, $Op3, $Passthru)>;		(inst $Op1, $Op2, $Op3, $Passthru)>;

class SVE_3_Op_Pat_SelZero<ValueType vtd, SDPatternOperator op, ValueType vt1,		class SVE_3_Op_Pat_SelZero<ValueType vtd, SDPatternOperator op, ValueType vt1,
ValueType vt2, ValueType vt3, Instruction inst>		ValueType vt2, ValueType vt3, Instruction inst>
: Pat<(vtd (vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), vt3:$Op3))),		: Pat<(vtd (vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), vt3:$Op3))),
(inst $Op1, $Op2, $Op3)>;		(inst $Op1, $Op2, $Op3)>;

		class SVE_3_Op_Pat_SelZero_Passthru<ValueType vtd, SDPatternOperator op, ValueType vt1,
		ValueType vt2, ValueType vt3, Instruction inst>
		: Pat<(vtd (vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (vt2 SVEDup0:$Dup)), vt3:$Op3))),
		(inst $Op1, $Op2, $Op3, $Dup)>;

class SVE_3_Op_Pat_Shift_Imm_SelZero<ValueType vtd, SDPatternOperator op,		class SVE_3_Op_Pat_Shift_Imm_SelZero<ValueType vtd, SDPatternOperator op,
ValueType vt1, ValueType vt2,		ValueType vt1, ValueType vt2,
Operand vt3, Instruction inst>		Operand vt3, Instruction inst>
: Pat<(vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), (i32 (vt3:$Op3)))),		: Pat<(vtd (op vt1:$Op1, (vselect vt1:$Op1, vt2:$Op2, (SVEDup0)), (i32 (vt3:$Op3)))),
(inst $Op1, $Op2, vt3:$Op3)>;		(inst $Op1, $Op2, vt3:$Op3)>;
}		}

//		//
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	let hasNoSchedulingInfo = 1 in {

class PredTwoOpMergePseudo<string name, ZPRRegOp zprty,		class PredTwoOpMergePseudo<string name, ZPRRegOp zprty,
FalseLanesEnum flags = FalseLanesNone>		FalseLanesEnum flags = FalseLanesNone>
: SVEPseudo2Instr<name, 0>,		: SVEPseudo2Instr<name, 0>,
Pseudo<(outs zprty:$Zd), (ins PPR3bAny:$Pg, zprty:$Zs1, zprty:$Zs2, zprty:$Zpt), []> {		Pseudo<(outs zprty:$Zd), (ins PPR3bAny:$Pg, zprty:$Zs1, zprty:$Zs2, zprty:$Zpt), []> {
let FalseLanes = flags;		let FalseLanes = flags;
let Constraints = "$Zd = $Zpt";		let Constraints = "$Zd = $Zpt";
}		}

		class PredTwoOpMergeZero<string name, ZPRRegOp zprty>
		: SVEPseudo2Instr<name, 0>,
		Pseudo<(outs zprty:$Zd), (ins PPR3bAny:$Pg, zprty:$Zs1, zprty:$Zs2, zprty:$Zpt), []> {
		let FalseLanes = FalseLanesZero;
		}
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// SVE Predicate Misc Group		// SVE Predicate Misc Group
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

class sve_int_pfalse<bits<6> opc, string asm>		class sve_int_pfalse<bits<6> opc, string asm>
: I<(outs PPR8:$Pd), (ins),		: I<(outs PPR8:$Pd), (ins),
▲ Show 20 Lines • Show All 1,124 Lines • ▼ Show 20 Lines	multiclass sve_fp_2op_p_zds_fscale<bits<4> opc, string asm,
def _D : sve_fp_2op_p_zds<0b11, opc, asm, ZPR64>;		def _D : sve_fp_2op_p_zds<0b11, opc, asm, ZPR64>;

def : SVE_3_Op_Pat<nxv8f16, op, nxv8i1, nxv8f16, nxv8i16, !cast<Instruction>(NAME # _H)>;		def : SVE_3_Op_Pat<nxv8f16, op, nxv8i1, nxv8f16, nxv8i16, !cast<Instruction>(NAME # _H)>;
def : SVE_3_Op_Pat<nxv4f32, op, nxv4i1, nxv4f32, nxv4i32, !cast<Instruction>(NAME # _S)>;		def : SVE_3_Op_Pat<nxv4f32, op, nxv4i1, nxv4f32, nxv4i32, !cast<Instruction>(NAME # _S)>;
def : SVE_3_Op_Pat<nxv2f64, op, nxv2i1, nxv2f64, nxv2i64, !cast<Instruction>(NAME # _D)>;		def : SVE_3_Op_Pat<nxv2f64, op, nxv2i1, nxv2f64, nxv2i64, !cast<Instruction>(NAME # _D)>;
}		}

multiclass sve_fp_2op_p_zds_zx<SDPatternOperator op> {		multiclass sve_fp_2op_p_zds_zx<SDPatternOperator op> {
def _ZERO_H : PredTwoOpPseudo<NAME # _H, ZPR16, FalseLanesZero>;		def _ZERO_H : PredTwoOpMergeZero<NAME # _H, ZPR16>;
def _ZERO_S : PredTwoOpPseudo<NAME # _S, ZPR32, FalseLanesZero>;		def _ZERO_S : PredTwoOpMergeZero<NAME # _S, ZPR32>;
def _ZERO_D : PredTwoOpPseudo<NAME # _D, ZPR64, FalseLanesZero>;		def _ZERO_D : PredTwoOpMergeZero<NAME # _D, ZPR64>;

def : SVE_3_Op_Pat_SelZero<nxv8f16, op, nxv8i1, nxv8f16, nxv8f16, !cast<Pseudo>(NAME # _ZERO_H)>;		def : SVE_3_Op_Pat_SelZero_Passthru<nxv8f16, op, nxv8i1, nxv8f16, nxv8f16, !cast<Pseudo>(NAME # _ZERO_H)>;
def : SVE_3_Op_Pat_SelZero<nxv4f32, op, nxv4i1, nxv4f32, nxv4f32, !cast<Pseudo>(NAME # _ZERO_S)>;		def : SVE_3_Op_Pat_SelZero_Passthru<nxv4f32, op, nxv4i1, nxv4f32, nxv4f32, !cast<Pseudo>(NAME # _ZERO_S)>;
def : SVE_3_Op_Pat_SelZero<nxv2f64, op, nxv2i1, nxv2f64, nxv2f64, !cast<Pseudo>(NAME # _ZERO_D)>;		def : SVE_3_Op_Pat_SelZero_Passthru<nxv2f64, op, nxv2i1, nxv2f64, nxv2f64, !cast<Pseudo>(NAME # _ZERO_D)>;

def _H : PredTwoOpMergePseudo<NAME # _H, ZPR16, FalseLanesMerge>;		def _H : PredTwoOpMergePseudo<NAME # _H, ZPR16, FalseLanesMerge>;
def _S : PredTwoOpMergePseudo<NAME # _S, ZPR32, FalseLanesMerge>;		def _S : PredTwoOpMergePseudo<NAME # _S, ZPR32, FalseLanesMerge>;
def _D : PredTwoOpMergePseudo<NAME # _D, ZPR64, FalseLanesMerge>;		def _D : PredTwoOpMergePseudo<NAME # _D, ZPR64, FalseLanesMerge>;

def : SVE_3_Op_Pat_Sel<nxv8f16, op, nxv8i1, nxv8f16, nxv8f16, !cast<Pseudo>(NAME # _H)>;		def : SVE_3_Op_Pat_Sel<nxv8f16, op, nxv8i1, nxv8f16, nxv8f16, !cast<Pseudo>(NAME # _H)>;
def : SVE_3_Op_Pat_Sel<nxv4f32, op, nxv4i1, nxv4f32, nxv4f32, !cast<Pseudo>(NAME # _S)>;		def : SVE_3_Op_Pat_Sel<nxv4f32, op, nxv4i1, nxv4f32, nxv4f32, !cast<Pseudo>(NAME # _S)>;
def : SVE_3_Op_Pat_Sel<nxv2f64, op, nxv2i1, nxv2f64, nxv2f64, !cast<Pseudo>(NAME # _D)>;		def : SVE_3_Op_Pat_Sel<nxv2f64, op, nxv2i1, nxv2f64, nxv2f64, !cast<Pseudo>(NAME # _D)>;
▲ Show 20 Lines • Show All 6,157 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-intrinsics-fp-arith-merging.ll

	Show First 20 Lines • Show All 274 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%a_z = select <vscale x 2 x i1> %pg, <vscale x 2 x double> %a, <vscale x 2 x double> zeroinitializer			%a_z = select <vscale x 2 x i1> %pg, <vscale x 2 x double> %a, <vscale x 2 x double> zeroinitializer
	%out = call <vscale x 2 x double> @llvm.aarch64.sve.fsub.nxv2f64(<vscale x 2 x i1> %pg,			%out = call <vscale x 2 x double> @llvm.aarch64.sve.fsub.nxv2f64(<vscale x 2 x i1> %pg,
	<vscale x 2 x double> %a_z,			<vscale x 2 x double> %a_z,
	<vscale x 2 x double> %b)			<vscale x 2 x double> %b)
	ret <vscale x 2 x double> %out			ret <vscale x 2 x double> %out
	}			}

				; This tests currently breaks on master, because the register allcoation ends up as:
				; Dst = FSUB_ZERO_S P0, Z0, Z0
				; And the expand pass cannot zero the false lanes of Z0 using MOVPRFX, because the
				; instruction specifies that the destination register must not be used in any other
				; operand position than the destination register, so:
				; Z0 = MOVPRFX P0/z, Z0
				; Z0 = FSUB_S Z0, P0/m, Z0
				; would not be valid. Hence the need to use a SELECT of Z0 and DUP(0).
				define <vscale x 4 x float> @fsub_zero_z0_z0(<vscale x 4 x i1> %p, <vscale x 4 x float> %z0) {
				; CHECK-LABEL: fsub_zero_z0_z0
				; CHECK: mov z1.s, #0
				; CHECK-NEXT: sel z0.s, p0, z0.s, z1.s
				; CHECK-NEXT: fsubr z0.s, p0/m, z0.s, z0.s
				; CHECK-NEXT: ret
				%z0_in = select <vscale x 4 x i1> %p, <vscale x 4 x float> %z0, <vscale x 4 x float> zeroinitializer
				%add = call <vscale x 4 x float> @llvm.aarch64.sve.fsub.nxv4f32(<vscale x 4 x i1> %p, <vscale x 4 x float> %z0_in, <vscale x 4 x float> %z0)
				ret <vscale x 4 x float> %add
				}

				cameron.mcinallyUnsubmitted Not Done Reply Inline Actions I'm looking at some rough latency tables we've put together and it looks like the tied-reg MOVPRFX sequence is 1 cycle faster than the SEL sequence: ; CHECK-NEXT: movprfx z1.s, p0/z, z0.s ; CHECK-NEXT: fsubr z1.s, p0/m, z1.s, z0.s ; CHECK-NEXT: mov z0.d, z1.d The vector MOV is faster than the DUP. And we burn the extra z1 register for both cases, so that's a wash. That said, the MOVPRFX sequence we're generating actually looks like this: ; CHECK-NEXT: mov z1.s, #0 ; CHECK-NEXT: movprfx z1.s, p0/z, z0.s ; CHECK-NEXT: fsubr z1.s, p0/m, z1.s, z0.s ; CHECK-NEXT: mov z0.d, z1.d where the DUP #0 is a dead instruction. It's proving pretty hard to get rid of the DUP at the MachineInstruction level though. Still looking... cameron.mcinally: I'm looking at some rough latency tables we've put together and it looks like the tied-reg…
				sdesmalenAuthorUnsubmitted Done Reply Inline Actions I'm looking at some rough latency tables we've put together and it looks like the tied-reg MOVPRFX sequence is 1 cycle faster than the SEL sequence: Ah that's good to know. Always using the tied-operand constraint for the zeroing forms possibly makes the common cases slower though, because it forces the compiler to honour the constraints and avoids benefiting from the reverse instructions as the register allocator will already have done the work. All cases except this one don't need the dup+select and can use movprfx directly and make use of the commutative/reverse instructions to expand the pseudo. That said, the MOVPRFX sequence we're generating actually looks like this: Is that with a different example than the one in this test? sdesmalen: > I'm looking at some rough latency tables we've put together and it looks like the tied-reg…

	;			;
	; FSUBR			; FSUBR
	;			;

	define <vscale x 8 x half> @fsubr_h(<vscale x 8 x i1> %pg, <vscale x 8 x half> %a, <vscale x 8 x half> %b) {			define <vscale x 8 x half> @fsubr_h(<vscale x 8 x i1> %pg, <vscale x 8 x half> %a, <vscale x 8 x half> %b) {
	; CHECK-LABEL: fsubr_h:			; CHECK-LABEL: fsubr_h:
	; CHECK: movprfx z0.h, p0/z, z0.h			; CHECK: movprfx z0.h, p0/z, z0.h
	; CHECK-NEXT: fsubr z0.h, p0/m, z0.h, z1.h			; CHECK-NEXT: fsubr z0.h, p0/m, z0.h, z1.h
	▲ Show 20 Lines • Show All 79 Lines • Show Last 20 Lines