This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Combine multiple FDIVs with the same divisor
ClosedPublic

Authored by • HaoLiu on Nov 19 2014, 11:04 PM.

Download Raw Diff

Details

Reviewers

t.p.northover
hfinkel

Summary

Hi Tim and other reviewers,

This patch try to combine multiple FDIVs with the same divisor to one FDIV and multiple FMULs. This can have benefit on performance because a FMUL is much faster than a FDIV.
E.g. we combine:

a / D; 
b / D; 
c / D;

recip = 1.0 / D;
a * recip;
b * recip;
c * recip;

This is not always benefit, as we can see that the critical path increases from one FDIV to one FDIV and one FMUL, which may cause regressions. So this patch will only do such combine when there are more than 2 FDIVs.

This patch can only benefit some special benchmarks.
Our performance test on Cortex-A57 shows only SPEC2006 benchmark 188.ammp has 2.5%-3.0% improvement.

Review please.

Thanks,
-Hao

Diff Detail

Event Timeline

• HaoLiu updated this revision to Diff 16414.Nov 19 2014, 11:04 PM

• HaoLiu retitled this revision from to [AArch64] Combine multiple FDIVs with the same divisor.

• HaoLiu updated this object.

• HaoLiu edited the test plan for this revision. (Show Details)

• HaoLiu added a reviewer: t.p.northover.

• HaoLiu added a subscriber: Unknown Object (MLST).

Herald added a subscriber: aemerson. · View Herald TranscriptNov 19 2014, 11:04 PM

For reference, Ben proposed a target-independent solution here:
http://llvm.org/bugs/show_bug.cgi?id=16218

Unfortunately, it never made it into mainline because there were some regressions that needed inspection. The obvious advantage of the target-independent approach is that it works for all targets and across basic blocks. However, performing this transform across basic blocks may not matter in practice, as Duncan pointed out. Also, it may prevent other IR transforms from firing.

Can you please put this in DAGCombine, and let the target optionally enable it. The target can customize this:

// Skip if there is less than three FDIVs.
// FIXME: Different subtargets may behave differently. This can be
// controlled depending on subtargets.
if (Users.size() < 3)

I think that adding something like:

virtual bool combineRepeatedFPDivisors(unsigned &MinUsers) {
  return false;
}

added in TargetLowering.h right around the existing division functions would work well (I'm not attached to the name, feel free to propose some other name). I'd like to use this for PPC too.

Just chiming in for the target-independent path: x86 needs this too.

I don't understand the expense argument cited here:
http://llvm.org/bugs/show_bug.cgi?id=16218#c4

Is checking uses in InstCombine more expensive than in DAGCombine? This transform is only firing with fast FP-math. Does that make it any more tolerable?

• HaoLiu updated this revision to Diff 16468.Nov 20 2014, 7:28 PM

In D6334#5, @mcrosier wrote:

For reference, Ben proposed a target-independent solution here:
http://llvm.org/bugs/show_bug.cgi?id=16218

Unfortunately, it never made it into mainline because there were some regressions that needed inspection. The obvious advantage of the target-independent approach is that it works for all targets and across basic blocks. However, performing this transform across basic blocks may not matter in practice, as Duncan pointed out. Also, it may prevent other IR transforms from firing.

Hi Chad,

Yes, if it cross basic blocks, sometimes it may cause regressions. Also, as it is a target specific problem, so we think it's better to do such combine in CodeGen.

Thanks,
-Hao

In D6334#7, @hfinkel wrote:
Can you please put this in DAGCombine, and let the target optionally enable it. The target can customize this:
// Skip if there is less than three FDIVs.
// FIXME: Different subtargets may behave differently. This can be
// controlled depending on subtargets.
if (Users.size() < 3)
I think that adding something like:
virtual bool combineRepeatedFPDivisors(unsigned &MinUsers) {
  return false;
}
added in TargetLowering.h right around the existing division functions would work well (I'm not attached to the name, feel free to propose some other name). I'd like to use this for PPC too.

Hi Hale,

That's a good idea.

I've attached a new patch moving the logic into DAGCombiner.

Thanks,
-hao

Please change how the comparison is done as noted below, otherwise LGTM.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7115 ↗	(On Diff #16468)	Don't do the comparison by creating new nodes. You can: if (auto *CN0 = dyn_cast<ConstantFPSDNode>(N0)) { if (CN0->isExactlyValue(1.0)) return SDValue(); } Create the FPOne later if you need it.

This revision is now accepted and ready to land.Nov 20 2014, 7:36 PM

In D6334#9, @spatel wrote:

Just chiming in for the target-independent path: x86 needs this too.

I don't understand the expense argument cited here:
http://llvm.org/bugs/show_bug.cgi?id=16218#c4

Is checking uses in InstCombine more expensive than in DAGCombine? This transform is only firing with fast FP-math. Does that make it any more tolerable?

Hi Sanjay,

I think it is a target specific problem. Some target may not want such combine.
E.g. we combine "N FDIVs" into "1 FDIVs and N FMULs". One shortcoming is that there is one more instruction. Also, different target may have different costs for FDIV and FMUL. Even though we can suppose that a FMUL is fater than a FDIV, but maybe "2 FDIV" is faster than "1 FDIV and 2 FMUL".

So I think it's better to be solved in the backend.

Thanks,
-Hao

In D6334#14, @hfinkel wrote:

Please change how the comparison is done as noted below, otherwise LGTM.

Hi Hale,

Nice suggestion.
The patch has been modified and committed in http://llvm.org/viewvc/llvm-project?view=revision&revision=222510.

Thanks,
-Hao

weimingz added a subscriber: weimingz.Nov 21 2014, 12:05 AM

Hi Hao,

Thanks for the patch. I think ARM needs it too.

• HaoLiu closed this revision.Dec 21 2014, 10:29 PM

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

70 lines

test/

CodeGen/

AArch64/

fdiv-combine.ll

145 lines

Diff 16414

lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 440 Lines • ▼ Show 20 Lines	AArch64TargetLowering::AArch64TargetLowering(const TargetMachine &TM)
setTargetDAGCombine(ISD::ANY_EXTEND);		setTargetDAGCombine(ISD::ANY_EXTEND);
setTargetDAGCombine(ISD::ZERO_EXTEND);		setTargetDAGCombine(ISD::ZERO_EXTEND);
setTargetDAGCombine(ISD::SIGN_EXTEND);		setTargetDAGCombine(ISD::SIGN_EXTEND);
setTargetDAGCombine(ISD::BITCAST);		setTargetDAGCombine(ISD::BITCAST);
setTargetDAGCombine(ISD::CONCAT_VECTORS);		setTargetDAGCombine(ISD::CONCAT_VECTORS);
setTargetDAGCombine(ISD::STORE);		setTargetDAGCombine(ISD::STORE);

setTargetDAGCombine(ISD::MUL);		setTargetDAGCombine(ISD::MUL);
		setTargetDAGCombine(ISD::FDIV);

setTargetDAGCombine(ISD::SELECT);		setTargetDAGCombine(ISD::SELECT);
setTargetDAGCombine(ISD::VSELECT);		setTargetDAGCombine(ISD::VSELECT);

setTargetDAGCombine(ISD::INTRINSIC_VOID);		setTargetDAGCombine(ISD::INTRINSIC_VOID);
setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);		setTargetDAGCombine(ISD::INTRINSIC_W_CHAIN);
setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);		setTargetDAGCombine(ISD::INSERT_VECTOR_ELT);

▲ Show 20 Lines • Show All 6,440 Lines • ▼ Show 20 Lines	if (Value.isNonNegative()) {
return DAG.getNode(ISD::SUB, SDLoc(N), VT, N->getOperand(0),		return DAG.getNode(ISD::SUB, SDLoc(N), VT, N->getOperand(0),
ShiftedVal);		ShiftedVal);
}		}
}		}
}		}
return SDValue();		return SDValue();
}		}

		// As FMUL is much faster than FDIV, we combine multiple FDIVs with the same
		// divisor to a reciprocal and multiple FMULs with the reciprocal.
		// E.g. ( a / D; b / D; c / D; )
		// =>
		// ( recip = 1.0 / D; a * recip; b * recip; c * recip )
		// Notice that one shortcoming is the critical patch increases from "one FDIV"
		// to "one FDIV + one FMUL", which may cause regressions on some benchmarks. To
		// reduce regressions, we only do such combine when there are more than two
		// FDIVs.
		// If one of the FDIV is reciprocal, we reuse it directly.
		// E.g. ( recip = 1.0 / D; c = a / D; )
		// =>
		// ( recip = 1.0 / D; c = a * recip; )
		static SDValue performFDIVCombine(SDNode *N, SelectionDAG &DAG) {
		// Only do such combine when unsafe fp math is enabled.
		if (!DAG.getTarget().Options.UnsafeFPMath)
		return SDValue();

		SDValue Dividend = N->getOperand(0);
		SDValue Divisor = N->getOperand(1);
		EVT VT = N->getValueType(0);

		SDValue FPOne = DAG.getConstantFP(1.0, VT); // floating point 1.0
		// Skip if current Node is a reciprocal.
		if (Dividend == FPOne)
		return SDValue();

		SDValue Reciprocal = SDValue();
		SmallVector<SDNode *, 4> Users;
		// Collect all non-reciprocal users of Divisor. Also find if there is already
		// a reciprocal of Divisor. If so, we can reuse the reciprocal instead of
		// creating a new one.
		for (SDNode::use_iterator UI = Divisor.getNode()->use_begin(),
		UE = Divisor.getNode()->use_end();
		UI != UE; ++UI) {
		SDNode *User = UI.getUse().getUser();
		if (User->getOpcode() == ISD::FDIV && User->getOperand(1) == Divisor) {
		if (User->getOperand(0) == FPOne)
		Reciprocal = SDValue(User, 0);
		else
		Users.push_back(User);
		}
		}

		if (Reciprocal == SDValue()) {
		// Skip if there is less than three FDIVs.
		// FIXME: Different subtargets may behave differently. This can be
		// controlled depending on subtargets.
		if (Users.size() < 3)
		return SDValue();
		// Create a reciprocal of Divisor if there is no such reciprocal.
		Reciprocal = DAG.getNode(ISD::FDIV, SDLoc(N), VT, FPOne, Divisor);
		} else if (Users.size() == 0) {
		// Skip if there is no other users except the reciprocal.
		return SDValue();
		}

		// Dividend / Divisor => Dividend * Reciprocal
		for (auto I = Users.begin(), E = Users.end(); I != E; ++I) {
		SDValue NewNode =
		DAG.getNode(ISD::FMUL, SDLoc(I), VT, (I)->getOperand(0), Reciprocal);
		DAG.ReplaceAllUsesWith(*I, NewNode.getNode());
		}

		return SDValue();
		}

static SDValue performVectorCompareAndMaskUnaryOpCombine(SDNode *N,		static SDValue performVectorCompareAndMaskUnaryOpCombine(SDNode *N,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
// Take advantage of vector comparisons producing 0 or -1 in each lane to		// Take advantage of vector comparisons producing 0 or -1 in each lane to
// optimize away operation when it's from a constant.		// optimize away operation when it's from a constant.
//		//
// The general transformation is:		// The general transformation is:
// UNARYOP(AND(VECTOR_CMP(x,y), constant)) -->		// UNARYOP(AND(VECTOR_CMP(x,y), constant)) -->
// AND(VECTOR_CMP(x,y), constant2)		// AND(VECTOR_CMP(x,y), constant2)
▲ Show 20 Lines • Show All 1,598 Lines • ▼ Show 20 Lines	default:
break;		break;
case ISD::ADD:		case ISD::ADD:
case ISD::SUB:		case ISD::SUB:
return performAddSubLongCombine(N, DCI, DAG);		return performAddSubLongCombine(N, DCI, DAG);
case ISD::XOR:		case ISD::XOR:
return performXorCombine(N, DAG, DCI, Subtarget);		return performXorCombine(N, DAG, DCI, Subtarget);
case ISD::MUL:		case ISD::MUL:
return performMulCombine(N, DAG, DCI, Subtarget);		return performMulCombine(N, DAG, DCI, Subtarget);
		case ISD::FDIV:
		return performFDIVCombine(N, DAG);
case ISD::SINT_TO_FP:		case ISD::SINT_TO_FP:
case ISD::UINT_TO_FP:		case ISD::UINT_TO_FP:
return performIntToFpCombine(N, DAG);		return performIntToFpCombine(N, DAG);
case ISD::OR:		case ISD::OR:
return performORCombine(N, DCI, Subtarget);		return performORCombine(N, DCI, Subtarget);
case ISD::INTRINSIC_WO_CHAIN:		case ISD::INTRINSIC_WO_CHAIN:
return performIntrinsicCombine(N, DCI, Subtarget);		return performIntrinsicCombine(N, DCI, Subtarget);
case ISD::ANY_EXTEND:		case ISD::ANY_EXTEND:
▲ Show 20 Lines • Show All 312 Lines • Show Last 20 Lines

test/CodeGen/AArch64/fdiv-combine.ll

This file was added.

				; RUN: llc -march=aarch64 < %s \| FileCheck %s

				; Following test cases check:
				; a / D; b / D; c / D;
				; =>
				; recip = 1.0 / D; a * recip; b * recip; c * recip;
				define void @three_fdiv_float(float %D, float %a, float %b, float %c) #0 {
				; CHECK-LABEL: three_fdiv_float:
				; CHECK: fdiv
				; CHECK-NEXT-NOT: fdiv
				; CHECK: fmul
				; CHECK: fmul
				; CHECK: fmul
				%div = fdiv float %a, %D
				%div1 = fdiv float %b, %D
				%div2 = fdiv float %c, %D
				tail call void @foo_3f(float %div, float %div1, float %div2)
				ret void
				}

				define void @three_fdiv_double(double %D, double %a, double %b, double %c) #0 {
				; CHECK-LABEL: three_fdiv_double:
				; CHECK: fdiv
				; CHECK-NEXT-NOT: fdiv
				; CHECK: fmul
				; CHECK: fmul
				; CHECK: fmul
				%div = fdiv double %a, %D
				%div1 = fdiv double %b, %D
				%div2 = fdiv double %c, %D
				tail call void @foo_3d(double %div, double %div1, double %div2)
				ret void
				}

				define void @three_fdiv_4xfloat(<4 x float> %D, <4 x float> %a, <4 x float> %b, <4 x float> %c) #0 {
				; CHECK-LABEL: three_fdiv_4xfloat:
				; CHECK: fdiv
				; CHECK-NEXT-NOT: fdiv
				; CHECK: fmul
				; CHECK: fmul
				; CHECK: fmul
				%div = fdiv <4 x float> %a, %D
				%div1 = fdiv <4 x float> %b, %D
				%div2 = fdiv <4 x float> %c, %D
				tail call void @foo_3_4xf(<4 x float> %div, <4 x float> %div1, <4 x float> %div2)
				ret void
				}

				define void @three_fdiv_2xdouble(<2 x double> %D, <2 x double> %a, <2 x double> %b, <2 x double> %c) #0 {
				; CHECK-LABEL: three_fdiv_2xdouble:
				; CHECK: fdiv
				; CHECK-NEXT-NOT: fdiv
				; CHECK: fmul
				; CHECK: fmul
				; CHECK: fmul
				%div = fdiv <2 x double> %a, %D
				%div1 = fdiv <2 x double> %b, %D
				%div2 = fdiv <2 x double> %c, %D
				tail call void @foo_3_2xd(<2 x double> %div, <2 x double> %div1, <2 x double> %div2)
				ret void
				}

				; Following test cases check we never combine two FDIVs if neither of them
				; calculates a reciprocal.
				define void @two_fdiv_float(float %D, float %a, float %b) #0 {
				; CHECK-LABEL: two_fdiv_float:
				; CHECK: fdiv
				; CHECK: fdiv
				; CHECK-NEXT-NOT: fmul
				%div = fdiv float %a, %D
				%div1 = fdiv float %b, %D
				tail call void @foo_2f(float %div, float %div1)
				ret void
				}

				define void @two_fdiv_double(double %D, double %a, double %b) #0 {
				; CHECK-LABEL: two_fdiv_double:
				; CHECK: fdiv
				; CHECK: fdiv
				; CHECK-NEXT-NOT: fmul
				%div = fdiv double %a, %D
				%div1 = fdiv double %b, %D
				tail call void @foo_2d(double %div, double %div1)
				ret void
				}

				; Following test cases check
				; recip = 1.0 / D; c = a / D;
				; =>
				; recip = 1.0 / D; c = a * recip;
				define void @recip_fdiv_float(float %D, float %a) #0 {
				; CHECK-LABEL: recip_fdiv_float:
				; CHECK: fdiv
				; CHECK-NEXT-NOT: fdiv
				; CHECK: fmul
				%div = fdiv float 1.000000e+00, %D
				%div1 = fdiv float %a, %D
				tail call void @foo_2f(float %div, float %div1)
				ret void
				}

				define void @recip_fdiv_double(double %D, double %a) #0 {
				; CHECK-LABEL: recip_fdiv_double:
				; CHECK: fdiv
				; CHECK-NEXT-NOT: fdiv
				; CHECK: fmul
				%div = fdiv double 1.000000e+00, %D
				%div1 = fdiv double %a, %D
				tail call void @foo_2d(double %div, double %div1)
				ret void
				}

				define void @recip_fdiv_4xfloat(<4 x float> %D, <4 x float> %a) #0 {
				; CHECK-LABEL: recip_fdiv_4xfloat:
				; CHECK: fdiv
				; CHECK-NEXT-NOT: fdiv
				; CHECK: fmul
				%div = fdiv <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, %D
				%div1 = fdiv <4 x float> %a, %D
				tail call void @foo_2_4xf(<4 x float> %div, <4 x float> %div1)
				ret void
				}

				define void @recip_fdiv_2xdouble(<2 x double> %D, <2 x double> %a) #0 {
				; CHECK-LABEL: recip_fdiv_2xdouble:
				; CHECK: fdiv
				; CHECK-NEXT-NOT: fdiv
				; CHECK: fmul
				%div = fdiv <2 x double> <double 1.000000e+00, double 1.000000e+00>, %D
				%div1 = fdiv <2 x double> %a, %D
				tail call void @foo_2_2xd(<2 x double> %div, <2 x double> %div1)
				ret void
				}

				declare void @foo_3f(float, float, float)
				declare void @foo_3d(double, double, double)
				declare void @foo_3_4xf(<4 x float>, <4 x float>, <4 x float>)
				declare void @foo_3_2xd(<2 x double>, <2 x double>, <2 x double>)
				declare void @foo_2f(float, float)
				declare void @foo_2d(double, double)
				declare void @foo_2_4xf(<4 x float>, <4 x float>)
				declare void @foo_2_2xd(<2 x double>, <2 x double>)

				attributes #0 = { "unsafe-fp-math"="true" }

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Combine multiple FDIVs with the same divisorClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 16414

lib/Target/AArch64/AArch64ISelLowering.cpp

test/CodeGen/AArch64/fdiv-combine.ll

[AArch64] Combine multiple FDIVs with the same divisor
ClosedPublic