This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
4/4
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
sve-mul-fmul-idempotency.ll

Differential D96961

[AArch64][SVE][DAGCombine] Factor out redundant SVE mul/fmul intrinsics
AbandonedPublic

Authored by joechrisellis on Feb 18 2021, 8:10 AM.

Download Raw Diff

Details

Reviewers

huntergr
peterwaller-arm
bsmith
efriedma

Summary

This commit implements the following patterns:

fmul (ptrue sv_all) (dup 1.0) V => V
fmul (ptrue sv_all) V (dup 1.0) => V
mul (ptrue sv_all) (dup 1) V => V
mul (ptrue sv_all) V (dup 1) => V

That is: using the SVE mul/fmul intrinsic with an all-true predicate to
multiply a vector X by a vector of all ones is redundant.

The result of this commit is that code such as:

1  #include <arm_sve.h>
2
3  svfloat64_t foo(svfloat64_t a) {
4    svbool_t t = svptrue_b64();
5    svfloat64_t b = svdup_f64(1.0);
6    return svmul_m(t, a, b);
7  }

will compile to a nop.

This commit does not capture all possibilities; only the simple case as
described above. There is still room for further optimisation.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

joechrisellis created this revision.Feb 18 2021, 8:10 AM

Herald added a reviewer: efriedma. · View Herald TranscriptFeb 18 2021, 8:10 AM

Herald added subscribers: psnobl, hiraditya, kristof.beyls, tschuett. · View Herald Transcript

joechrisellis requested review of this revision.Feb 18 2021, 8:10 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 18 2021, 8:10 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Ideas for more tests appreciated. 😄

david-arm added a subscriber: david-arm.Feb 18 2021, 8:44 AM

david-arm added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13531	Is it worth naming this something like combineSVEIntrinsicBinOp and similarly for FP you could have combineSVEIntrinsicFPBinOp? A bit like SelectionDAG::simplifyFPBinop. The reason I mention this is that I can imagine you wanting similar things for divides, adds at some point too, i.e. fdiv X, 1.0 -> X or fadd X, 0.0 -> X
13575	Perhaps you can commonise the check for "ptrue all" that appears in both functions? Also, it might be worth inverting the logic and moving to the start of the function, i.e. SDLoc DL(N); SDValue Pg = N->getOperand(1); if (!ptrueAllIntrinsic(Pg)) return SDValue(); This avoids the extra indentation. Not sure what you think?

Harbormaster completed remote builds in B89733: Diff 324646.Feb 18 2021, 9:08 AM

Address @david-arm's comments.

more general implementation.
factor out checking for ptrue(SV_ALL) intrinsic calls to isPTrueAllIntrinsic.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13531	This is a great suggestion, thank you! Done.
13575	Agreed -- done. 😄

Harbormaster completed remote builds in B89929: Diff 324998.Feb 19 2021, 9:23 AM

After some discussion with @paulwalker-arm we think this might be better placed as an IR-level optimisation in SVEIntrinsicOpts.cpp. Therefore, I am going to abandon this revision and create a new one. 🙂

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

70 lines

test/

CodeGen/

AArch64/

sve-mul-fmul-idempotency.ll

164 lines

Diff 324646

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 13,522 Lines • ▼ Show 20 Lines	static SDValue combineSVEReductionFP(SDNode *N, unsigned Opc,

// SVE reductions set the whole vector register with the first element		// SVE reductions set the whole vector register with the first element
// containing the reduction result, which we'll now extract.		// containing the reduction result, which we'll now extract.
SDValue Zero = DAG.getConstant(0, DL, MVT::i64);		SDValue Zero = DAG.getConstant(0, DL, MVT::i64);
return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, N->getValueType(0), Reduce,		return DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, N->getValueType(0), Reduce,
Zero);		Zero);
}		}

		static SDValue combineSVEMul(SDNode *N, unsigned Opc, SelectionDAG &DAG) {
		david-armUnsubmitted Done Reply Inline Actions Is it worth naming this something like combineSVEIntrinsicBinOp and similarly for FP you could have combineSVEIntrinsicFPBinOp? A bit like SelectionDAG::simplifyFPBinop. The reason I mention this is that I can imagine you wanting similar things for divides, adds at some point too, i.e. fdiv X, 1.0 -> X or fadd X, 0.0 -> X david-arm: Is it worth naming this something like combineSVEIntrinsicBinOp and similarly for FP you could…
		joechrisellisAuthorUnsubmitted Done Reply Inline Actions This is a great suggestion, thank you! Done. joechrisellis: This is a great suggestion, thank you! Done.
		SDLoc DL(N);

		SDValue Pg = N->getOperand(1);
		SDValue VecA = N->getOperand(2);
		SDValue VecB = N->getOperand(3);
		SDNode *PgNode = Pg.getNode();
		SDNode *VecANode = VecA.getNode();
		SDNode *VecBNode = VecB.getNode();

		// If Pg is an all-true predicate...
		if (getIntrinsicID(PgNode) == Intrinsic::aarch64_sve_ptrue &&
		Pg.getConstantOperandVal(1) == AArch64SVEPredPattern::all) {

		const auto IsUnitDup = [&](SDNode *N) {
		if (getIntrinsicID(N) == Intrinsic::aarch64_sve_dup_x) {
		auto *DupOperand = cast<ConstantSDNode>(N->getOperand(1));
		return DupOperand->isOne();
		}
		return false;
		};

		// mul (ptrue sv_all) (dup 1) V => V
		if (IsUnitDup(VecANode))
		return VecB;
		// mul (ptrue sv_all) V (dup 1) => V
		if (IsUnitDup(VecBNode))
		return VecA;
		}

		return SDValue();
		}

		static SDValue combineSVEFMul(SDNode *N, unsigned Opc, SelectionDAG &DAG) {
		SDLoc DL(N);

		SDValue Pg = N->getOperand(1);
		SDValue VecA = N->getOperand(2);
		SDValue VecB = N->getOperand(3);
		SDNode *PgNode = Pg.getNode();
		SDNode *VecANode = VecA.getNode();
		SDNode *VecBNode = VecB.getNode();

		// If Pg is an all-true predicate...
		if (getIntrinsicID(PgNode) == Intrinsic::aarch64_sve_ptrue &&
		david-armUnsubmitted Done Reply Inline Actions Perhaps you can commonise the check for "ptrue all" that appears in both functions? Also, it might be worth inverting the logic and moving to the start of the function, i.e. SDLoc DL(N); SDValue Pg = N->getOperand(1); if (!ptrueAllIntrinsic(Pg)) return SDValue(); This avoids the extra indentation. Not sure what you think? david-arm: Perhaps you can commonise the check for "ptrue all" that appears in both functions? Also, it…
		joechrisellisAuthorUnsubmitted Done Reply Inline Actions Agreed -- done. 😄 joechrisellis: Agreed -- done. 😄
		Pg.getConstantOperandVal(1) == AArch64SVEPredPattern::all) {

		const auto IsUnitDup = [&](SDNode *N) {
		if (getIntrinsicID(N) == Intrinsic::aarch64_sve_dup_x) {
		auto *DupOperand = cast<ConstantFPSDNode>(N->getOperand(1));
		return DupOperand->isExactlyValue(1.0);
		}
		return false;
		};

		// fmul (ptrue sv_all) (dup 1.0) V => V
		if (IsUnitDup(VecANode))
		return VecB;
		// fmul (ptrue sv_all) V (dup 1.0) => V
		if (IsUnitDup(VecBNode))
		return VecA;
		}

		return SDValue();
		}

static SDValue combineSVEReductionOrderedFP(SDNode *N, unsigned Opc,		static SDValue combineSVEReductionOrderedFP(SDNode *N, unsigned Opc,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
SDLoc DL(N);		SDLoc DL(N);

SDValue Pred = N->getOperand(1);		SDValue Pred = N->getOperand(1);
SDValue InitVal = N->getOperand(2);		SDValue InitVal = N->getOperand(2);
SDValue VecToReduce = N->getOperand(3);		SDValue VecToReduce = N->getOperand(3);
EVT ReduceVT = VecToReduce.getValueType();		EVT ReduceVT = VecToReduce.getValueType();
▲ Show 20 Lines • Show All 173 Lines • ▼ Show 20 Lines	static SDValue performIntrinsicCombine(SDNode *N,
case Intrinsic::aarch64_sve_fmaxnmv:		case Intrinsic::aarch64_sve_fmaxnmv:
return combineSVEReductionFP(N, AArch64ISD::FMAXNMV_PRED, DAG);		return combineSVEReductionFP(N, AArch64ISD::FMAXNMV_PRED, DAG);
case Intrinsic::aarch64_sve_fmaxv:		case Intrinsic::aarch64_sve_fmaxv:
return combineSVEReductionFP(N, AArch64ISD::FMAXV_PRED, DAG);		return combineSVEReductionFP(N, AArch64ISD::FMAXV_PRED, DAG);
case Intrinsic::aarch64_sve_fminnmv:		case Intrinsic::aarch64_sve_fminnmv:
return combineSVEReductionFP(N, AArch64ISD::FMINNMV_PRED, DAG);		return combineSVEReductionFP(N, AArch64ISD::FMINNMV_PRED, DAG);
case Intrinsic::aarch64_sve_fminv:		case Intrinsic::aarch64_sve_fminv:
return combineSVEReductionFP(N, AArch64ISD::FMINV_PRED, DAG);		return combineSVEReductionFP(N, AArch64ISD::FMINV_PRED, DAG);
		case Intrinsic::aarch64_sve_mul:
		return combineSVEMul(N, AArch64ISD::MUL_PRED, DAG);
		case Intrinsic::aarch64_sve_fmul:
		return combineSVEFMul(N, AArch64ISD::FMUL_PRED, DAG);
case Intrinsic::aarch64_sve_sel:		case Intrinsic::aarch64_sve_sel:
return DAG.getNode(ISD::VSELECT, SDLoc(N), N->getValueType(0),		return DAG.getNode(ISD::VSELECT, SDLoc(N), N->getValueType(0),
N->getOperand(1), N->getOperand(2), N->getOperand(3));		N->getOperand(1), N->getOperand(2), N->getOperand(3));
case Intrinsic::aarch64_sve_cmpeq_wide:		case Intrinsic::aarch64_sve_cmpeq_wide:
return tryConvertSVEWideCompare(N, ISD::SETEQ, DCI, DAG);		return tryConvertSVEWideCompare(N, ISD::SETEQ, DCI, DAG);
case Intrinsic::aarch64_sve_cmpne_wide:		case Intrinsic::aarch64_sve_cmpne_wide:
return tryConvertSVEWideCompare(N, ISD::SETNE, DCI, DAG);		return tryConvertSVEWideCompare(N, ISD::SETNE, DCI, DAG);
case Intrinsic::aarch64_sve_cmpge_wide:		case Intrinsic::aarch64_sve_cmpge_wide:
▲ Show 20 Lines • Show All 3,536 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-mul-fmul-idempotency.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+sve < %s 2>%t \| FileCheck %s
				; RUN: FileCheck --check-prefix=WARN --allow-empty %s <%t

				; If this check fails please read test/CodeGen/AArch64/README for instructions on how to resolve it.
				; WARN-NOT: warning

				; Idempotent muls -- should compile to just a ret.
				define <vscale x 8 x i16> @idempotent_mul_0(<vscale x 8 x i16> %a) {
				; CHECK-LABEL: idempotent_mul_0:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ret
				%1 = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
				%2 = call <vscale x 8 x i16> @llvm.aarch64.sve.dup.x.nxv8i16(i16 1)
				%3 = call <vscale x 8 x i16> @llvm.aarch64.sve.mul.nxv8i16(<vscale x 8 x i1> %1, <vscale x 8 x i16> %a, <vscale x 8 x i16> %2)
				ret <vscale x 8 x i16> %3
				}

				define <vscale x 4 x i32> @idempotent_mul_1(<vscale x 4 x i32> %a) {
				; CHECK-LABEL: idempotent_mul_1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ret
				%1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
				%2 = call <vscale x 4 x i32> @llvm.aarch64.sve.dup.x.nxv4i32(i32 1)
				%3 = call <vscale x 4 x i32> @llvm.aarch64.sve.mul.nxv4i32(<vscale x 4 x i1> %1, <vscale x 4 x i32> %a, <vscale x 4 x i32> %2)
				ret <vscale x 4 x i32> %3
				}

				define <vscale x 2 x i64> @idempotent_mul_2(<vscale x 2 x i64> %a) {
				; CHECK-LABEL: idempotent_mul_2:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ret
				%1 = call <vscale x 2 x i1> @llvm.aarch64.sve.ptrue.nxv2i1(i32 31)
				%2 = call <vscale x 2 x i64> @llvm.aarch64.sve.dup.x.nxv2i64(i64 1)
				%3 = call <vscale x 2 x i64> @llvm.aarch64.sve.mul.nxv2i64(<vscale x 2 x i1> %1, <vscale x 2 x i64> %a, <vscale x 2 x i64> %2)
				ret <vscale x 2 x i64> %3
				}

				define <vscale x 2 x i64> @idempotent_mul_3(<vscale x 2 x i64> %a) {
				; CHECK-LABEL: idempotent_mul_3:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ret
				%1 = call <vscale x 2 x i1> @llvm.aarch64.sve.ptrue.nxv2i1(i32 31)
				%2 = call <vscale x 2 x i64> @llvm.aarch64.sve.dup.x.nxv2i64(i64 1)
				; Different argument order to the above tests.
				%3 = call <vscale x 2 x i64> @llvm.aarch64.sve.mul.nxv2i64(<vscale x 2 x i1> %1, <vscale x 2 x i64> %2, <vscale x 2 x i64> %a)
				ret <vscale x 2 x i64> %3
				}

				; Idempotent fmuls -- should compile to just a ret.
				define <vscale x 8 x half> @idempotent_fmul_0(<vscale x 8 x half> %a) {
				; CHECK-LABEL: idempotent_fmul_0:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ret
				%1 = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
				%2 = call <vscale x 8 x half> @llvm.aarch64.sve.dup.x.nxv8f16(half 1.0)
				%3 = call <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1> %1, <vscale x 8 x half> %a, <vscale x 8 x half> %2)
				ret <vscale x 8 x half> %3
				}

				define <vscale x 4 x float> @idempotent_fmul_1(<vscale x 4 x float> %a) {
				; CHECK-LABEL: idempotent_fmul_1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ret
				%1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
				%2 = call <vscale x 4 x float> @llvm.aarch64.sve.dup.x.nxv4f32(float 1.0)
				%3 = call <vscale x 4 x float> @llvm.aarch64.sve.fmul.nxv4f32(<vscale x 4 x i1> %1, <vscale x 4 x float> %a, <vscale x 4 x float> %2)
				ret <vscale x 4 x float> %3
				}

				define <vscale x 2 x double> @idempotent_fmul_2(<vscale x 2 x double> %a) {
				; CHECK-LABEL: idempotent_fmul_2:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ret
				%1 = call <vscale x 2 x i1> @llvm.aarch64.sve.ptrue.nxv2i1(i32 31)
				%2 = call <vscale x 2 x double> @llvm.aarch64.sve.dup.x.nxv2f64(double 1.0)
				%3 = call <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %1, <vscale x 2 x double> %a, <vscale x 2 x double> %2)
				ret <vscale x 2 x double> %3
				}

				define <vscale x 2 x double> @idempotent_fmul_3(<vscale x 2 x double> %a) {
				; CHECK-LABEL: idempotent_fmul_3:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ret
				%1 = call <vscale x 2 x i1> @llvm.aarch64.sve.ptrue.nxv2i1(i32 31)
				%2 = call <vscale x 2 x double> @llvm.aarch64.sve.dup.x.nxv2f64(double 1.0)
				; Different argument order to the above tests.
				%3 = call <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1> %1, <vscale x 2 x double> %2, <vscale x 2 x double> %a)
				ret <vscale x 2 x double> %3
				}

				; Non-idempotent muls -- we don't expect these to be optimised out.
				define <vscale x 8 x i16> @non_idempotent_mul_0(<vscale x 8 x i16> %a) {
				; CHECK-LABEL: non_idempotent_mul_0:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h
				; CHECK-NEXT: mov z1.h, #2 // =0x2
				; CHECK-NEXT: mul z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: ret
				%1 = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 31)
				%2 = call <vscale x 8 x i16> @llvm.aarch64.sve.dup.x.nxv8i16(i16 2)
				%3 = call <vscale x 8 x i16> @llvm.aarch64.sve.mul.nxv8i16(<vscale x 8 x i1> %1, <vscale x 8 x i16> %a, <vscale x 8 x i16> %2)
				ret <vscale x 8 x i16> %3
				}

				define <vscale x 4 x i32> @non_idempotent_mul_1(<vscale x 4 x i32> %a) {
				; CHECK-LABEL: non_idempotent_mul_1:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.s
				; CHECK-NEXT: mov z1.s, #2 // =0x2
				; CHECK-NEXT: mul z0.s, p0/m, z0.s, z1.s
				; CHECK-NEXT: ret
				%1 = call <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 31)
				%2 = call <vscale x 4 x i32> @llvm.aarch64.sve.dup.x.nxv4i32(i32 2)
				%3 = call <vscale x 4 x i32> @llvm.aarch64.sve.mul.nxv4i32(<vscale x 4 x i1> %1, <vscale x 4 x i32> %a, <vscale x 4 x i32> %2)
				ret <vscale x 4 x i32> %3
				}

				define <vscale x 2 x i64> @non_idempotent_mul_2(<vscale x 2 x i64> %a) {
				; CHECK-LABEL: non_idempotent_mul_2:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: mov z1.d, #2 // =0x2
				; CHECK-NEXT: mul z0.d, p0/m, z0.d, z1.d
				; CHECK-NEXT: ret
				%1 = call <vscale x 2 x i1> @llvm.aarch64.sve.ptrue.nxv2i1(i32 31)
				%2 = call <vscale x 2 x i64> @llvm.aarch64.sve.dup.x.nxv2i64(i64 2)
				%3 = call <vscale x 2 x i64> @llvm.aarch64.sve.mul.nxv2i64(<vscale x 2 x i1> %1, <vscale x 2 x i64> %a, <vscale x 2 x i64> %2)
				ret <vscale x 2 x i64> %3
				}

				define <vscale x 8 x i16> @non_idempotent_mul_3(<vscale x 8 x i16> %a) {
				; Uses a predicate that is not all true, so it shouldn't be optimized out.
				; CHECK-LABEL: non_idempotent_mul_3:
				; CHECK: // %bb.0:
				; CHECK-NEXT: ptrue p0.h, pow2
				; CHECK-NEXT: mov z1.h, #1 // =0x1
				; CHECK-NEXT: mul z0.h, p0/m, z0.h, z1.h
				; CHECK-NEXT: ret
				%1 = call <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 0)
				%2 = call <vscale x 8 x i16> @llvm.aarch64.sve.dup.x.nxv8i16(i16 1)
				%3 = call <vscale x 8 x i16> @llvm.aarch64.sve.mul.nxv8i16(<vscale x 8 x i1> %1, <vscale x 8 x i16> %a, <vscale x 8 x i16> %2)
				ret <vscale x 8 x i16> %3
				}

				declare <vscale x 16 x i1> @llvm.aarch64.sve.convert.to.svbool.nxv2i1(<vscale x 2 x i1>)
				declare <vscale x 2 x i1> @llvm.aarch64.sve.ptrue.nxv2i1(i32 immarg)
				declare <vscale x 4 x i1> @llvm.aarch64.sve.ptrue.nxv4i1(i32 immarg)
				declare <vscale x 8 x i1> @llvm.aarch64.sve.ptrue.nxv8i1(i32 immarg)

				declare <vscale x 8 x i16> @llvm.aarch64.sve.dup.x.nxv8i16(i16)
				declare <vscale x 4 x i32> @llvm.aarch64.sve.dup.x.nxv4i32(i32)
				declare <vscale x 2 x i64> @llvm.aarch64.sve.dup.x.nxv2i64(i64)
				declare <vscale x 8 x half> @llvm.aarch64.sve.dup.x.nxv8f16(half)
				declare <vscale x 4 x float> @llvm.aarch64.sve.dup.x.nxv4f32(float)
				declare <vscale x 2 x double> @llvm.aarch64.sve.dup.x.nxv2f64(double)

				declare <vscale x 8 x i16> @llvm.aarch64.sve.mul.nxv8i16(<vscale x 8 x i1>, <vscale x 8 x i16>, <vscale x 8 x i16>)
				declare <vscale x 4 x i32> @llvm.aarch64.sve.mul.nxv4i32(<vscale x 4 x i1>, <vscale x 4 x i32>, <vscale x 4 x i32>)
				declare <vscale x 2 x i64> @llvm.aarch64.sve.mul.nxv2i64(<vscale x 2 x i1>, <vscale x 2 x i64>, <vscale x 2 x i64>)

				declare <vscale x 8 x half> @llvm.aarch64.sve.fmul.nxv8f16(<vscale x 8 x i1>, <vscale x 8 x half>, <vscale x 8 x half>)
				declare <vscale x 4 x float> @llvm.aarch64.sve.fmul.nxv4f32(<vscale x 4 x i1>, <vscale x 4 x float>, <vscale x 4 x float>)
				declare <vscale x 2 x double> @llvm.aarch64.sve.fmul.nxv2f64(<vscale x 2 x i1>, <vscale x 2 x double>, <vscale x 2 x double>)