This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Combine multiple FDIVs with the same divisor
ClosedPublic

Authored by • HaoLiu on Nov 19 2014, 11:04 PM.

Download Raw Diff

Details

Reviewers

t.p.northover
hfinkel

Summary

Hi Tim and other reviewers,

This patch try to combine multiple FDIVs with the same divisor to one FDIV and multiple FMULs. This can have benefit on performance because a FMUL is much faster than a FDIV.
E.g. we combine:

a / D; 
b / D; 
c / D;

recip = 1.0 / D;
a * recip;
b * recip;
c * recip;

This is not always benefit, as we can see that the critical path increases from one FDIV to one FDIV and one FMUL, which may cause regressions. So this patch will only do such combine when there are more than 2 FDIVs.

This patch can only benefit some special benchmarks.
Our performance test on Cortex-A57 shows only SPEC2006 benchmark 188.ammp has 2.5%-3.0% improvement.

Review please.

Thanks,
-Hao

Diff Detail

Event Timeline

• HaoLiu updated this revision to Diff 16414.Nov 19 2014, 11:04 PM

• HaoLiu retitled this revision from to [AArch64] Combine multiple FDIVs with the same divisor.

• HaoLiu updated this object.

• HaoLiu edited the test plan for this revision. (Show Details)

• HaoLiu added a reviewer: t.p.northover.

• HaoLiu added a subscriber: Unknown Object (MLST).

Herald added a subscriber: aemerson. · View Herald TranscriptNov 19 2014, 11:04 PM

For reference, Ben proposed a target-independent solution here:
http://llvm.org/bugs/show_bug.cgi?id=16218

Unfortunately, it never made it into mainline because there were some regressions that needed inspection. The obvious advantage of the target-independent approach is that it works for all targets and across basic blocks. However, performing this transform across basic blocks may not matter in practice, as Duncan pointed out. Also, it may prevent other IR transforms from firing.

Can you please put this in DAGCombine, and let the target optionally enable it. The target can customize this:

// Skip if there is less than three FDIVs.
// FIXME: Different subtargets may behave differently. This can be
// controlled depending on subtargets.
if (Users.size() < 3)

I think that adding something like:

virtual bool combineRepeatedFPDivisors(unsigned &MinUsers) {
  return false;
}

added in TargetLowering.h right around the existing division functions would work well (I'm not attached to the name, feel free to propose some other name). I'd like to use this for PPC too.

Just chiming in for the target-independent path: x86 needs this too.

I don't understand the expense argument cited here:
http://llvm.org/bugs/show_bug.cgi?id=16218#c4

Is checking uses in InstCombine more expensive than in DAGCombine? This transform is only firing with fast FP-math. Does that make it any more tolerable?

• HaoLiu updated this revision to Diff 16468.Nov 20 2014, 7:28 PM

In D6334#5, @mcrosier wrote:

For reference, Ben proposed a target-independent solution here:
http://llvm.org/bugs/show_bug.cgi?id=16218

Unfortunately, it never made it into mainline because there were some regressions that needed inspection. The obvious advantage of the target-independent approach is that it works for all targets and across basic blocks. However, performing this transform across basic blocks may not matter in practice, as Duncan pointed out. Also, it may prevent other IR transforms from firing.

Hi Chad,

Yes, if it cross basic blocks, sometimes it may cause regressions. Also, as it is a target specific problem, so we think it's better to do such combine in CodeGen.

Thanks,
-Hao

In D6334#7, @hfinkel wrote:
Can you please put this in DAGCombine, and let the target optionally enable it. The target can customize this:
// Skip if there is less than three FDIVs.
// FIXME: Different subtargets may behave differently. This can be
// controlled depending on subtargets.
if (Users.size() < 3)
I think that adding something like:
virtual bool combineRepeatedFPDivisors(unsigned &MinUsers) {
  return false;
}
added in TargetLowering.h right around the existing division functions would work well (I'm not attached to the name, feel free to propose some other name). I'd like to use this for PPC too.

Hi Hale,

That's a good idea.

I've attached a new patch moving the logic into DAGCombiner.

Thanks,
-hao

Please change how the comparison is done as noted below, otherwise LGTM.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7115	Don't do the comparison by creating new nodes. You can: if (auto *CN0 = dyn_cast<ConstantFPSDNode>(N0)) { if (CN0->isExactlyValue(1.0)) return SDValue(); } Create the FPOne later if you need it.

This revision is now accepted and ready to land.Nov 20 2014, 7:36 PM

In D6334#9, @spatel wrote:

Just chiming in for the target-independent path: x86 needs this too.

I don't understand the expense argument cited here:
http://llvm.org/bugs/show_bug.cgi?id=16218#c4

Is checking uses in InstCombine more expensive than in DAGCombine? This transform is only firing with fast FP-math. Does that make it any more tolerable?

Hi Sanjay,

I think it is a target specific problem. Some target may not want such combine.
E.g. we combine "N FDIVs" into "1 FDIVs and N FMULs". One shortcoming is that there is one more instruction. Also, different target may have different costs for FDIV and FMUL. Even though we can suppose that a FMUL is fater than a FDIV, but maybe "2 FDIV" is faster than "1 FDIV and 2 FMUL".

So I think it's better to be solved in the backend.

Thanks,
-Hao

In D6334#14, @hfinkel wrote:

Please change how the comparison is done as noted below, otherwise LGTM.

Hi Hale,

Nice suggestion.
The patch has been modified and committed in http://llvm.org/viewvc/llvm-project?view=revision&revision=222510.

Thanks,
-Hao

weimingz added a subscriber: weimingz.Nov 21 2014, 12:05 AM

Hi Hao,

Thanks for the patch. I think ARM needs it too.

• HaoLiu closed this revision.Dec 21 2014, 10:29 PM

Revision Contents

Path

Size

include/

llvm/

Target/

TargetLowering.h

6 lines

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

36 lines

Target/

AArch64/

AArch64ISelLowering.h

1 line

AArch64ISelLowering.cpp

6 lines

test/

CodeGen/

AArch64/

fdiv-combine.ll

94 lines

Diff 16468

include/llvm/Target/TargetLowering.h

Show First 20 Lines • Show All 2,646 Lines • ▼ Show 20 Lines	SDValue BuildUDIV(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,
bool IsAfterLegalization,		bool IsAfterLegalization,
std::vector<SDNode > Created) const;		std::vector<SDNode > Created) const;
virtual SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor,		virtual SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor,
SelectionDAG &DAG,		SelectionDAG &DAG,
std::vector<SDNode > Created) const {		std::vector<SDNode > Created) const {
return SDValue();		return SDValue();
}		}

		/// Indicate whether this target prefers to combine the given number of FDIVs
		/// with the same divisor.
		virtual bool combineRepeatedFPDivisors(unsigned NumUsers) const {
		return false;
		}

/// Hooks for building estimates in place of slower divisions and square		/// Hooks for building estimates in place of slower divisions and square
/// roots.		/// roots.

/// Return a reciprocal square root estimate value for the input operand.		/// Return a reciprocal square root estimate value for the input operand.
/// The RefinementSteps output is the number of Newton-Raphson refinement		/// The RefinementSteps output is the number of Newton-Raphson refinement
/// iterations required to generate a sufficient (though not necessarily		/// iterations required to generate a sufficient (though not necessarily
/// IEEE-754 compliant) estimate for the value type.		/// IEEE-754 compliant) estimate for the value type.
/// The boolean UseOneConstNR output is used to select a Newton-Raphson		/// The boolean UseOneConstNR output is used to select a Newton-Raphson
▲ Show 20 Lines • Show All 87 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,098 Lines • ▼ Show 20 Lines	if (char RHSNeg = isNegatibleForFree(N1, LegalOperations, TLI, &Options)) {
// negated.		// negated.
if (LHSNeg == 2 \|\| RHSNeg == 2)		if (LHSNeg == 2 \|\| RHSNeg == 2)
return DAG.getNode(ISD::FDIV, SDLoc(N), VT,		return DAG.getNode(ISD::FDIV, SDLoc(N), VT,
GetNegatedExpression(N0, DAG, LegalOperations),		GetNegatedExpression(N0, DAG, LegalOperations),
GetNegatedExpression(N1, DAG, LegalOperations));		GetNegatedExpression(N1, DAG, LegalOperations));
}		}
}		}

		// Combine multiple FDIVs with the same divisor into one FDIV and multiple
		// FMULs.
		// E.g., (a / D; b / D;) -> (recip = 1.0 / D; a * recip; b * recip)
		// Notice that this is not always beneficial. One reason is different target
		// may have different costs for FDIV and FMUL, so sometimes the cost of two
		// FDIVs may be lower than the cost of one FDIV and two FMULs. Another reason
		// is the critical path is increased from "one FDIV" to "one FDIV + one FMUL".
		if (Options.UnsafeFPMath) {
		SDValue FPOne = DAG.getConstantFP(1.0, VT); // floating point 1.0
		hfinkelUnsubmitted Not Done Reply Inline Actions Don't do the comparison by creating new nodes. You can: if (auto CN0 = dyn_cast<ConstantFPSDNode>(N0)) { if (CN0->isExactlyValue(1.0)) return SDValue(); } Create the FPOne later if you need it. hfinkel:* Don't do the comparison by creating new nodes. You can: if (auto *CN0 =…
		// Skip if current node is a reciprocal.
		if (N0 == FPOne)
		return SDValue();

		SmallVector<SDNode *, 4> Users;
		// Find all FDIV users of the divisor.
		for (SDNode::use_iterator UI = N1.getNode()->use_begin(),
		UE = N1.getNode()->use_end();
		UI != UE; ++UI) {
		SDNode *User = UI.getUse().getUser();
		if (User->getOpcode() == ISD::FDIV && User->getOperand(1) == N1)
		Users.push_back(User);
		}

		if (TLI.combineRepeatedFPDivisors(Users.size())) {
		SDValue Reciprocal = DAG.getNode(ISD::FDIV, SDLoc(N), VT, FPOne, N1);

		// Dividend / Divisor -> Dividend * Reciprocal
		for (auto I = Users.begin(), E = Users.end(); I != E; ++I) {
		SDValue NewNode = DAG.getNode(ISD::FMUL, SDLoc(*I), VT,
		(*I)->getOperand(0), Reciprocal);
		DAG.ReplaceAllUsesWith(*I, NewNode.getNode());
		}
		return SDValue();
		}
		}

return SDValue();		return SDValue();
}		}

SDValue DAGCombiner::visitFREM(SDNode *N) {		SDValue DAGCombiner::visitFREM(SDNode *N) {
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
ConstantFPSDNode *N0CFP = dyn_cast<ConstantFPSDNode>(N0);		ConstantFPSDNode *N0CFP = dyn_cast<ConstantFPSDNode>(N0);
ConstantFPSDNode *N1CFP = dyn_cast<ConstantFPSDNode>(N1);		ConstantFPSDNode *N1CFP = dyn_cast<ConstantFPSDNode>(N1);
▲ Show 20 Lines • Show All 5,285 Lines • Show Last 20 Lines

lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 434 Lines • ▼ Show 20 Lines	private:
SDValue LowerINT_TO_FP(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerINT_TO_FP(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerVectorAND(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerVectorAND(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerVectorOR(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerVectorOR(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerCONCAT_VECTORS(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerCONCAT_VECTORS(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFSINCOS(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFSINCOS(SDValue Op, SelectionDAG &DAG) const;

SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,		SDValue BuildSDIVPow2(SDNode *N, const APInt &Divisor, SelectionDAG &DAG,
std::vector<SDNode > Created) const override;		std::vector<SDNode > Created) const override;
		bool combineRepeatedFPDivisors(unsigned NumUsers) const override;

ConstraintType		ConstraintType
getConstraintType(const std::string &Constraint) const override;		getConstraintType(const std::string &Constraint) const override;
unsigned getRegisterByName(const char* RegName, EVT VT) const override;		unsigned getRegisterByName(const char* RegName, EVT VT) const override;

/// Examine constraint string and operand type and determine a weight value.		/// Examine constraint string and operand type and determine a weight value.
/// The operand object must already have been set up with the operand type.		/// The operand object must already have been set up with the operand type.
ConstraintWeight		ConstraintWeight
Show All 34 Lines

lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,726 Lines • ▼ Show 20 Lines	case ISD::FP_TO_SINT:
return;		return;
}		}
}		}

bool AArch64TargetLowering::useLoadStackGuardNode() const {		bool AArch64TargetLowering::useLoadStackGuardNode() const {
return true;		return true;
}		}

		bool AArch64TargetLowering::combineRepeatedFPDivisors(unsigned NumUsers) const {
		// Combine multiple FDIVs if there are three or more FDIVs with the same
		// divisor.
		return NumUsers > 2;
		}

TargetLoweringBase::LegalizeTypeAction		TargetLoweringBase::LegalizeTypeAction
AArch64TargetLowering::getPreferredVectorAction(EVT VT) const {		AArch64TargetLowering::getPreferredVectorAction(EVT VT) const {
MVT SVT = VT.getSimpleVT();		MVT SVT = VT.getSimpleVT();
// During type legalization, we prefer to widen v1i8, v1i16, v1i32 to v8i8,		// During type legalization, we prefer to widen v1i8, v1i16, v1i32 to v8i8,
// v4i16, v2i32 instead of to promote.		// v4i16, v2i32 instead of to promote.
if (SVT == MVT::v1i8 \|\| SVT == MVT::v1i16 \|\| SVT == MVT::v1i32		if (SVT == MVT::v1i8 \|\| SVT == MVT::v1i16 \|\| SVT == MVT::v1i32
\|\| SVT == MVT::v1f32)		\|\| SVT == MVT::v1f32)
return TypeWidenVector;		return TypeWidenVector;
▲ Show 20 Lines • Show All 96 Lines • Show Last 20 Lines

test/CodeGen/AArch64/fdiv-combine.ll

This file was added.

				; RUN: llc -march=aarch64 < %s \| FileCheck %s

				; Following test cases check:
				; a / D; b / D; c / D;
				; =>
				; recip = 1.0 / D; a * recip; b * recip; c * recip;
				define void @three_fdiv_float(float %D, float %a, float %b, float %c) #0 {
				; CHECK-LABEL: three_fdiv_float:
				; CHECK: fdiv
				; CHECK-NEXT-NOT: fdiv
				; CHECK: fmul
				; CHECK: fmul
				; CHECK: fmul
				%div = fdiv float %a, %D
				%div1 = fdiv float %b, %D
				%div2 = fdiv float %c, %D
				tail call void @foo_3f(float %div, float %div1, float %div2)
				ret void
				}

				define void @three_fdiv_double(double %D, double %a, double %b, double %c) #0 {
				; CHECK-LABEL: three_fdiv_double:
				; CHECK: fdiv
				; CHECK-NEXT-NOT: fdiv
				; CHECK: fmul
				; CHECK: fmul
				; CHECK: fmul
				%div = fdiv double %a, %D
				%div1 = fdiv double %b, %D
				%div2 = fdiv double %c, %D
				tail call void @foo_3d(double %div, double %div1, double %div2)
				ret void
				}

				define void @three_fdiv_4xfloat(<4 x float> %D, <4 x float> %a, <4 x float> %b, <4 x float> %c) #0 {
				; CHECK-LABEL: three_fdiv_4xfloat:
				; CHECK: fdiv
				; CHECK-NEXT-NOT: fdiv
				; CHECK: fmul
				; CHECK: fmul
				; CHECK: fmul
				%div = fdiv <4 x float> %a, %D
				%div1 = fdiv <4 x float> %b, %D
				%div2 = fdiv <4 x float> %c, %D
				tail call void @foo_3_4xf(<4 x float> %div, <4 x float> %div1, <4 x float> %div2)
				ret void
				}

				define void @three_fdiv_2xdouble(<2 x double> %D, <2 x double> %a, <2 x double> %b, <2 x double> %c) #0 {
				; CHECK-LABEL: three_fdiv_2xdouble:
				; CHECK: fdiv
				; CHECK-NEXT-NOT: fdiv
				; CHECK: fmul
				; CHECK: fmul
				; CHECK: fmul
				%div = fdiv <2 x double> %a, %D
				%div1 = fdiv <2 x double> %b, %D
				%div2 = fdiv <2 x double> %c, %D
				tail call void @foo_3_2xd(<2 x double> %div, <2 x double> %div1, <2 x double> %div2)
				ret void
				}

				; Following test cases check we never combine two FDIVs if neither of them
				; calculates a reciprocal.
				define void @two_fdiv_float(float %D, float %a, float %b) #0 {
				; CHECK-LABEL: two_fdiv_float:
				; CHECK: fdiv
				; CHECK: fdiv
				; CHECK-NEXT-NOT: fmul
				%div = fdiv float %a, %D
				%div1 = fdiv float %b, %D
				tail call void @foo_2f(float %div, float %div1)
				ret void
				}

				define void @two_fdiv_double(double %D, double %a, double %b) #0 {
				; CHECK-LABEL: two_fdiv_double:
				; CHECK: fdiv
				; CHECK: fdiv
				; CHECK-NEXT-NOT: fmul
				%div = fdiv double %a, %D
				%div1 = fdiv double %b, %D
				tail call void @foo_2d(double %div, double %div1)
				ret void
				}

				declare void @foo_3f(float, float, float)
				declare void @foo_3d(double, double, double)
				declare void @foo_3_4xf(<4 x float>, <4 x float>, <4 x float>)
				declare void @foo_3_2xd(<2 x double>, <2 x double>, <2 x double>)
				declare void @foo_2f(float, float)
				declare void @foo_2d(double, double)

				attributes #0 = { "unsafe-fp-math"="true" }
				No newline at end of file

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Combine multiple FDIVs with the same divisorClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 16468

include/llvm/Target/TargetLowering.h

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

lib/Target/AArch64/AArch64ISelLowering.h

lib/Target/AArch64/AArch64ISelLowering.cpp

test/CodeGen/AArch64/fdiv-combine.ll

[AArch64] Combine multiple FDIVs with the same divisor
ClosedPublic