Download Raw Diff

Details

Reviewers

spatel
zinovy.nis
andreadb
bogner
zansari
DavidKreitzer
hfinkel

Commits

rGf664f3a57865: [DAG] Remove redundant FMUL in Newton-Raphson SQRT code
rL272920: [DAG] Remove redundant FMUL in Newton-Raphson SQRT code

Summary

When calculating a square root using Newton-Raphson with two constants,
a naive implementation is to use five multiplications (four muls to calculate
reciprocal square root and another one to calculate the square root itself).
However, after some reassociation and CSE the same result can be obtained
with only four multiplications. Unfortunately, there's no reliable way to do
such a reassociation in the back-end. So, the patch modifies NR code itself
so that it directly builds optimal code for SQRT and doesn't rely on any
further reassociation.

Diff Detail

Repository: rL LLVM

Event Timeline

n.bozhenov updated this revision to Diff 60018.Jun 8 2016, 5:36 AM

n.bozhenov retitled this revision from to Remove redundant FMUL in Newton-Raphson SQRT code.

n.bozhenov updated this object.

n.bozhenov added reviewers: zansari, DavidKreitzer, zinovy.nis, bogner, spatel, hfinkel, andreadb.

n.bozhenov added a subscriber: llvm-commits.

This seems like a good perf optimization to me in general, but a few high-level questions:

How does the refactoring of the multiplication affect the accuracy of the results? I attached some test programs that could be modified to answer this in https://llvm.org/bugs/show_bug.cgi?id=21385 .

The case of converting a sqrt into an estimate sequence is a problem in https://llvm.org/bugs/show_bug.cgi?id=24063 . Would this patch sidestep that bug by producing the more accurate/expected result for that case?

Given that we have 3 generations of Intel FMA FPUs now (and the non-x86 world has always had FMA), it would be interesting to know the accuracy (and codegen tests should probably be added) for the FMA variant. It would also be good to add tests with 2 refinement steps, so we know this patch is behaving as expected in that case, but...

(Apologies for straying from the immediate goal of the patch...) Given that Intel FPUs in Broadwell and Skylake have reduced IEEE-compliant div/sqrt to 4 and then 3 cycle throughputs according to Agner's tables, using an estimate is likely a perf misoptimization for current and future Intel big cores. I think that we can fix this using the existing hook 'isFsqrtCheap()', but please confirm that this patch preserves that opportunity.

RKSimon added a subscriber: RKSimon.Jun 8 2016, 2:34 PM

Thanks for great questions, Sanjay!

I slightly modified the example from PR21385 and compared these two
sequences to calculate square roots (the current one and the patched one):

est1 = (-0.5f * est0) * (-3.0f + est0 * est0 * f) * f;
float ae = est0 * f;
est2 = (-0.5f * ae) * (-3.0f + est0 * ae);

And I obtained the following results:

                                   est1         est2
Total tests:                 2130706432   2130706432
Inexact results:              926539007    834159368
Estimate missed by  1 ULP:    862814916    796017331
Estimate missed by  2 ULP:     62179595     37665787
Estimate missed by  3 ULP:      1537746       476250
Estimate missed by  4 ULP:         6750            0
Estimate missed by >4 ULP:            0            0

As you can see, with the patch square roots are significantly more
accurate on average, though I don't have a good explanation for this.
I performed testing on a number of Intel microarchitectures (including
Atom) and got exactly the same results for all of them.

As for improved hardware SQRT efficiency in modern Intel CPUs, I'm
working on a patch that address this particular issue and I will
share the patch very soon.

Here are the results with vfmadd132ss:

                                   est1         est2
Total tests:                 2130706432   2130706432
Inexact results:              911966111    817657369
Estimate missed by  1 ULP:    854798052    785288244
Estimate missed by  2 ULP:     56073347     32117919
Estimate missed by  3 ULP:      1092044       251206
Estimate missed by  4 ULP:         2668            0
Estimate missed by >4 ULP:            0            0

Generally, using FMA improves precision. Again, the new code
sequence produces statistically better results.

Thanks, Nikolai!
I reproduced your results for both FMA and non-FMA on my local Haswell machine.
Here is the output from AMD Jaguar (no FMA):

                                         est1        est2
Inexact results                     828912065   694175396
 Estimate missed by 1 ULP:          788646331   681356143
 Estimate missed by 2 ULP:           39579384    12754737
 Estimate missed by 3 ULP:             680779       64516
 Estimate missed by 4 ULP:               5571           0
 Estimate missed by >= 5 ULP with one N-R step = 0

So again, eliminating the extra multiply benefits accuracy in general. I attached my hacked tester program for this experiment to PR21385 in case anyone else wants to try it or adapt it to non-x86, but I'm assuming the accuracy improvement holds for any architecture because we're eliminating some intermediate error with the refactoring.

Please add a regression test to show the codegen when 2 or more N-R steps are used. PowerPC uses at least 2 refinement steps with 'frsqrte', so it may be easiest to add a sqrt (rather than rsqrt) test to test/CodeGen/PowerPC/recipest.ll instead of adding another RUN to the x86 test.

Please add a regression test to show the x86 codegen with FMA. This can get updated when you do the follow-up patch for isFsqrtCheap() for the faster sqrt/div FPUs in Broadwell, etc.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
355–365 ↗	(On Diff #60018)	It's a giant mess...but since you're updating these names, I think the recommended solution is to use the current naming convention for these functions - start with a lowercase letter.
14651–14652 ↗	(On Diff #60018)	'else' goes on the same line as '}'. Consider clang-format of the whole patch; there may be other formatting changes needed.
14696–14698 ↗	(On Diff #60018)	No braces need here; see previous comment about using clang-format.

Thanks for your remarks, Sanjay!
I've fixed formatting issues and uploaded a new version of the patch.

As for tests, I can easily add to x86/sqrt-fastmath.ll another RUN
with -mattr=fma -recip=all:2 to check both FMA and two steps
refinement, but I'm not sure if it is a good idea. I believe
sqrt-fastmath.ll is overly fragile now. It can be easily broken by
perfectly valid variations in register allocation, code ordering or code
reassociation. And I'm afraid that additional tests for 2 steps
refinement will make the test even more fragile.

So, I wonder if it would be better to split SQRT testing logically in
two parts:

check that NR is applied iff necessary (lit-tests)
check that NR code calculates roots with expected accuracy (execution test)

In other words, my suggestion is to convert the accuracy testers from
PR21385 into new tests for LLVM test-suite instead of adding more
fragile lit-tests here. What do you think?

Sorry, last time I uploaded the wrong diff. Here's the correct diff.

In D21127#455050, @n.bozhenov wrote:

As for tests, I can easily add to x86/sqrt-fastmath.ll another RUN
with -mattr=fma -recip=all:2 to check both FMA and two steps
refinement, but I'm not sure if it is a good idea. I believe
sqrt-fastmath.ll is overly fragile now. It can be easily broken by
perfectly valid variations in register allocation, code ordering or code
reassociation. And I'm afraid that additional tests for 2 steps
refinement will make the test even more fragile.

So, I wonder if it would be better to split SQRT testing logically in
two parts:

check that NR is applied iff necessary (lit-tests)

check that NR code calculates roots with expected accuracy (execution test)

In other words, my suggestion is to convert the accuracy testers from
PR21385 into new tests for LLVM test-suite instead of adding more
fragile lit-tests here. What do you think?

I think that adding an accuracy test to test-suite is a great idea. And I agree that these tests are susceptible to RA/scheduler changes, but the benefit of exact testing has usually outweighed the maintenance cost in my experience.

So I have another, hopefully better, test suggestion. :)

How about a MIR test? (cc'ing Matthias and Quentin for any MIR test suggestions)

The RUN line would include something like:
$ llc -o - -stop-after machine-scheduler rsqrt.ll -mattr=avx2,fma -recip=sqrtf

And then we can check the output closely for:

%1 = VRSQRTSSr undef %2, %0
%4 = VMULSSrr %1, %1
%4 = VFMADDSSr213m %4, %0, %rip, 1, _, %const.0, _ :: (load 4 from constant-pool)
%5 = VMULSSrm %1, %rip, 1, _, %const.1, _ :: (load 4 from constant-pool)
%11 = VMULSSrr %5, %0
%7 = VMULSSrr %4, %11
%8 = FsFLD0SS
%9 = VCMPSSrr %0, %8, 0
%10 = VFsANDNPSrr %9, %7

This will be immune to RA and other machine-level variations.

Ok, I added tests to check MIR output after instruction selection for
X86 with two refinement steps and with FMA enabled. These seem to cover
all the cases not covered previously.

LGTM.

This revision is now accepted and ready to land.Jun 14 2016, 8:44 AM

spatel mentioned this in D21379: [X86] Heuristic to selectively build Newton-Raphson SQRT estimation.Jun 15 2016, 1:16 PM

zinovy.nis accepted this revision.Jun 16 2016, 3:00 AM

zinovy.nis edited edge metadata.

Closed by commit rL272920: [DAG] Remove redundant FMUL in Newton-Raphson SQRT code (authored by spatel). · Explain WhyJun 16 2016, 10:05 AM

This revision was automatically updated to reflect the committed changes.

Diff 60991

llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 351 Lines • ▼ Show 20 Lines	private:
SDValue CombineConsecutiveLoads(SDNode *N, EVT VT);		SDValue CombineConsecutiveLoads(SDNode *N, EVT VT);
SDValue CombineExtLoad(SDNode *N);		SDValue CombineExtLoad(SDNode *N);
SDValue combineRepeatedFPDivisors(SDNode *N);		SDValue combineRepeatedFPDivisors(SDNode *N);
SDValue ConstantFoldBITCASTofBUILD_VECTOR(SDNode *, EVT);		SDValue ConstantFoldBITCASTofBUILD_VECTOR(SDNode *, EVT);
SDValue BuildSDIV(SDNode *N);		SDValue BuildSDIV(SDNode *N);
SDValue BuildSDIVPow2(SDNode *N);		SDValue BuildSDIVPow2(SDNode *N);
SDValue BuildUDIV(SDNode *N);		SDValue BuildUDIV(SDNode *N);
SDValue BuildReciprocalEstimate(SDValue Op, SDNodeFlags *Flags);		SDValue BuildReciprocalEstimate(SDValue Op, SDNodeFlags *Flags);
SDValue BuildRsqrtEstimate(SDValue Op, SDNodeFlags *Flags);		SDValue buildRsqrtEstimate(SDValue Op, SDNodeFlags *Flags);
SDValue BuildRsqrtNROneConst(SDValue Op, SDValue Est, unsigned Iterations,		SDValue buildSqrtEstimate(SDValue Op, SDNodeFlags *Flags);
SDNodeFlags *Flags);		SDValue buildSqrtEstimateImpl(SDValue Op, SDNodeFlags *Flags, bool Recip);
SDValue BuildRsqrtNRTwoConst(SDValue Op, SDValue Est, unsigned Iterations,		SDValue buildSqrtNROneConst(SDValue Op, SDValue Est, unsigned Iterations,
SDNodeFlags *Flags);		SDNodeFlags *Flags, bool Reciprocal);
		SDValue buildSqrtNRTwoConst(SDValue Op, SDValue Est, unsigned Iterations,
		SDNodeFlags *Flags, bool Reciprocal);
SDValue MatchBSwapHWordLow(SDNode *N, SDValue N0, SDValue N1,		SDValue MatchBSwapHWordLow(SDNode *N, SDValue N0, SDValue N1,
bool DemandHighBits = true);		bool DemandHighBits = true);
SDValue MatchBSwapHWord(SDNode *N, SDValue N0, SDValue N1);		SDValue MatchBSwapHWord(SDNode *N, SDValue N0, SDValue N1);
SDNode *MatchRotatePosNeg(SDValue Shifted, SDValue Pos, SDValue Neg,		SDNode *MatchRotatePosNeg(SDValue Shifted, SDValue Pos, SDValue Neg,
SDValue InnerPos, SDValue InnerNeg,		SDValue InnerPos, SDValue InnerNeg,
unsigned PosOpcode, unsigned NegOpcode,		unsigned PosOpcode, unsigned NegOpcode,
const SDLoc &DL);		const SDLoc &DL);
SDNode *MatchRotate(SDValue LHS, SDValue RHS, const SDLoc &DL);		SDNode *MatchRotate(SDValue LHS, SDValue RHS, const SDLoc &DL);
▲ Show 20 Lines • Show All 8,447 Lines • ▼ Show 20 Lines	if (N1CFP) {
TLI.isFPImmLegal(Recip, VT)))		TLI.isFPImmLegal(Recip, VT)))
return DAG.getNode(ISD::FMUL, DL, VT, N0,		return DAG.getNode(ISD::FMUL, DL, VT, N0,
DAG.getConstantFP(Recip, DL, VT), Flags);		DAG.getConstantFP(Recip, DL, VT), Flags);
}		}

// If this FDIV is part of a reciprocal square root, it may be folded		// If this FDIV is part of a reciprocal square root, it may be folded
// into a target-specific square root estimate instruction.		// into a target-specific square root estimate instruction.
if (N1.getOpcode() == ISD::FSQRT) {		if (N1.getOpcode() == ISD::FSQRT) {
if (SDValue RV = BuildRsqrtEstimate(N1.getOperand(0), Flags)) {		if (SDValue RV = buildRsqrtEstimate(N1.getOperand(0), Flags)) {
return DAG.getNode(ISD::FMUL, DL, VT, N0, RV, Flags);		return DAG.getNode(ISD::FMUL, DL, VT, N0, RV, Flags);
}		}
} else if (N1.getOpcode() == ISD::FP_EXTEND &&		} else if (N1.getOpcode() == ISD::FP_EXTEND &&
N1.getOperand(0).getOpcode() == ISD::FSQRT) {		N1.getOperand(0).getOpcode() == ISD::FSQRT) {
if (SDValue RV = BuildRsqrtEstimate(N1.getOperand(0).getOperand(0),		if (SDValue RV = buildRsqrtEstimate(N1.getOperand(0).getOperand(0),
Flags)) {		Flags)) {
RV = DAG.getNode(ISD::FP_EXTEND, SDLoc(N1), VT, RV);		RV = DAG.getNode(ISD::FP_EXTEND, SDLoc(N1), VT, RV);
AddToWorklist(RV.getNode());		AddToWorklist(RV.getNode());
return DAG.getNode(ISD::FMUL, DL, VT, N0, RV, Flags);		return DAG.getNode(ISD::FMUL, DL, VT, N0, RV, Flags);
}		}
} else if (N1.getOpcode() == ISD::FP_ROUND &&		} else if (N1.getOpcode() == ISD::FP_ROUND &&
N1.getOperand(0).getOpcode() == ISD::FSQRT) {		N1.getOperand(0).getOpcode() == ISD::FSQRT) {
if (SDValue RV = BuildRsqrtEstimate(N1.getOperand(0).getOperand(0),		if (SDValue RV = buildRsqrtEstimate(N1.getOperand(0).getOperand(0),
Flags)) {		Flags)) {
RV = DAG.getNode(ISD::FP_ROUND, SDLoc(N1), VT, RV, N1.getOperand(1));		RV = DAG.getNode(ISD::FP_ROUND, SDLoc(N1), VT, RV, N1.getOperand(1));
AddToWorklist(RV.getNode());		AddToWorklist(RV.getNode());
return DAG.getNode(ISD::FMUL, DL, VT, N0, RV, Flags);		return DAG.getNode(ISD::FMUL, DL, VT, N0, RV, Flags);
}		}
} else if (N1.getOpcode() == ISD::FMUL) {		} else if (N1.getOpcode() == ISD::FMUL) {
// Look through an FMUL. Even though this won't remove the FDIV directly,		// Look through an FMUL. Even though this won't remove the FDIV directly,
// it's still worthwhile to get rid of the FSQRT if possible.		// it's still worthwhile to get rid of the FSQRT if possible.
SDValue SqrtOp;		SDValue SqrtOp;
SDValue OtherOp;		SDValue OtherOp;
if (N1.getOperand(0).getOpcode() == ISD::FSQRT) {		if (N1.getOperand(0).getOpcode() == ISD::FSQRT) {
SqrtOp = N1.getOperand(0);		SqrtOp = N1.getOperand(0);
OtherOp = N1.getOperand(1);		OtherOp = N1.getOperand(1);
} else if (N1.getOperand(1).getOpcode() == ISD::FSQRT) {		} else if (N1.getOperand(1).getOpcode() == ISD::FSQRT) {
SqrtOp = N1.getOperand(1);		SqrtOp = N1.getOperand(1);
OtherOp = N1.getOperand(0);		OtherOp = N1.getOperand(0);
}		}
if (SqrtOp.getNode()) {		if (SqrtOp.getNode()) {
// We found a FSQRT, so try to make this fold:		// We found a FSQRT, so try to make this fold:
// x / (y * sqrt(z)) -> x * (rsqrt(z) / y)		// x / (y * sqrt(z)) -> x * (rsqrt(z) / y)
if (SDValue RV = BuildRsqrtEstimate(SqrtOp.getOperand(0), Flags)) {		if (SDValue RV = buildRsqrtEstimate(SqrtOp.getOperand(0), Flags)) {
RV = DAG.getNode(ISD::FDIV, SDLoc(N1), VT, RV, OtherOp, Flags);		RV = DAG.getNode(ISD::FDIV, SDLoc(N1), VT, RV, OtherOp, Flags);
AddToWorklist(RV.getNode());		AddToWorklist(RV.getNode());
return DAG.getNode(ISD::FMUL, DL, VT, N0, RV, Flags);		return DAG.getNode(ISD::FMUL, DL, VT, N0, RV, Flags);
}		}
}		}
}		}

// Fold into a reciprocal estimate and multiply instead of a real divide.		// Fold into a reciprocal estimate and multiply instead of a real divide.
Show All 40 Lines
SDValue DAGCombiner::visitFSQRT(SDNode *N) {		SDValue DAGCombiner::visitFSQRT(SDNode *N) {
if (!DAG.getTarget().Options.UnsafeFPMath \|\| TLI.isFsqrtCheap())		if (!DAG.getTarget().Options.UnsafeFPMath \|\| TLI.isFsqrtCheap())
return SDValue();		return SDValue();

// TODO: FSQRT nodes should have flags that propagate to the created nodes.		// TODO: FSQRT nodes should have flags that propagate to the created nodes.
// For now, create a Flags object for use with all unsafe math transforms.		// For now, create a Flags object for use with all unsafe math transforms.
SDNodeFlags Flags;		SDNodeFlags Flags;
Flags.setUnsafeAlgebra(true);		Flags.setUnsafeAlgebra(true);
		return buildSqrtEstimate(N->getOperand(0), &Flags);
// Compute this as X * (1/sqrt(X)) = X * (X ** -0.5)
SDValue RV = BuildRsqrtEstimate(N->getOperand(0), &Flags);
if (!RV)
return SDValue();

EVT VT = RV.getValueType();
SDLoc DL(N);
RV = DAG.getNode(ISD::FMUL, DL, VT, N->getOperand(0), RV, &Flags);
AddToWorklist(RV.getNode());

// Unfortunately, RV is now NaN if the input was exactly 0.
// Select out this case and force the answer to 0.
SDValue Zero = DAG.getConstantFP(0.0, DL, VT);
EVT CCVT = getSetCCResultType(VT);
SDValue ZeroCmp = DAG.getSetCC(DL, CCVT, N->getOperand(0), Zero, ISD::SETEQ);
AddToWorklist(ZeroCmp.getNode());
AddToWorklist(RV.getNode());

return DAG.getNode(VT.isVector() ? ISD::VSELECT : ISD::SELECT, DL, VT,
ZeroCmp, Zero, RV);
}		}

/// copysign(x, fp_extend(y)) -> copysign(x, y)		/// copysign(x, fp_extend(y)) -> copysign(x, y)
/// copysign(x, fp_round(y)) -> copysign(x, y)		/// copysign(x, fp_round(y)) -> copysign(x, y)
static inline bool CanCombineFCOPYSIGN_EXTEND_ROUND(SDNode *N) {		static inline bool CanCombineFCOPYSIGN_EXTEND_ROUND(SDNode *N) {
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
if ((N1.getOpcode() == ISD::FP_EXTEND \|\|		if ((N1.getOpcode() == ISD::FP_EXTEND \|\|
N1.getOpcode() == ISD::FP_ROUND)) {		N1.getOpcode() == ISD::FP_ROUND)) {
▲ Show 20 Lines • Show All 5,634 Lines • ▼ Show 20 Lines
}		}

/// Newton iteration for a function: F(X) is X_{i+1} = X_i - F(X_i)/F'(X_i)		/// Newton iteration for a function: F(X) is X_{i+1} = X_i - F(X_i)/F'(X_i)
/// For the reciprocal sqrt, we need to find the zero of the function:		/// For the reciprocal sqrt, we need to find the zero of the function:
/// F(X) = 1/X^2 - A [which has a zero at X = 1/sqrt(A)]		/// F(X) = 1/X^2 - A [which has a zero at X = 1/sqrt(A)]
/// =>		/// =>
/// X_{i+1} = X_i (1.5 - A X_i^2 / 2)		/// X_{i+1} = X_i (1.5 - A X_i^2 / 2)
/// As a result, we precompute A/2 prior to the iteration loop.		/// As a result, we precompute A/2 prior to the iteration loop.
SDValue DAGCombiner::BuildRsqrtNROneConst(SDValue Arg, SDValue Est,		SDValue DAGCombiner::buildSqrtNROneConst(SDValue Arg, SDValue Est,
unsigned Iterations,		unsigned Iterations,
SDNodeFlags *Flags) {		SDNodeFlags *Flags, bool Reciprocal) {
EVT VT = Arg.getValueType();		EVT VT = Arg.getValueType();
SDLoc DL(Arg);		SDLoc DL(Arg);
SDValue ThreeHalves = DAG.getConstantFP(1.5, DL, VT);		SDValue ThreeHalves = DAG.getConstantFP(1.5, DL, VT);

// We now need 0.5 * Arg which we can write as (1.5 * Arg - Arg) so that		// We now need 0.5 * Arg which we can write as (1.5 * Arg - Arg) so that
// this entire sequence requires only one FP constant.		// this entire sequence requires only one FP constant.
SDValue HalfArg = DAG.getNode(ISD::FMUL, DL, VT, ThreeHalves, Arg, Flags);		SDValue HalfArg = DAG.getNode(ISD::FMUL, DL, VT, ThreeHalves, Arg, Flags);
AddToWorklist(HalfArg.getNode());		AddToWorklist(HalfArg.getNode());
Show All 10 Lines	for (unsigned i = 0; i < Iterations; ++i) {
AddToWorklist(NewEst.getNode());		AddToWorklist(NewEst.getNode());

NewEst = DAG.getNode(ISD::FSUB, DL, VT, ThreeHalves, NewEst, Flags);		NewEst = DAG.getNode(ISD::FSUB, DL, VT, ThreeHalves, NewEst, Flags);
AddToWorklist(NewEst.getNode());		AddToWorklist(NewEst.getNode());

Est = DAG.getNode(ISD::FMUL, DL, VT, Est, NewEst, Flags);		Est = DAG.getNode(ISD::FMUL, DL, VT, Est, NewEst, Flags);
AddToWorklist(Est.getNode());		AddToWorklist(Est.getNode());
}		}

		// If non-reciprocal square root is requested, multiply the result by Arg.
		if (!Reciprocal) {
		Est = DAG.getNode(ISD::FMUL, DL, VT, Est, Arg, Flags);
		AddToWorklist(Est.getNode());
		}

return Est;		return Est;
}		}

/// Newton iteration for a function: F(X) is X_{i+1} = X_i - F(X_i)/F'(X_i)		/// Newton iteration for a function: F(X) is X_{i+1} = X_i - F(X_i)/F'(X_i)
/// For the reciprocal sqrt, we need to find the zero of the function:		/// For the reciprocal sqrt, we need to find the zero of the function:
/// F(X) = 1/X^2 - A [which has a zero at X = 1/sqrt(A)]		/// F(X) = 1/X^2 - A [which has a zero at X = 1/sqrt(A)]
/// =>		/// =>
/// X_{i+1} = (-0.5 * X_i) * (A * X_i * X_i + (-3.0))		/// X_{i+1} = (-0.5 * X_i) * (A * X_i * X_i + (-3.0))
SDValue DAGCombiner::BuildRsqrtNRTwoConst(SDValue Arg, SDValue Est,		SDValue DAGCombiner::buildSqrtNRTwoConst(SDValue Arg, SDValue Est,
unsigned Iterations,		unsigned Iterations,
SDNodeFlags *Flags) {		SDNodeFlags *Flags, bool Reciprocal) {
EVT VT = Arg.getValueType();		EVT VT = Arg.getValueType();
SDLoc DL(Arg);		SDLoc DL(Arg);
SDValue MinusThree = DAG.getConstantFP(-3.0, DL, VT);		SDValue MinusThree = DAG.getConstantFP(-3.0, DL, VT);
SDValue MinusHalf = DAG.getConstantFP(-0.5, DL, VT);		SDValue MinusHalf = DAG.getConstantFP(-0.5, DL, VT);

// Newton iterations: Est = -0.5 * Est * (-3.0 + Arg * Est * Est)		// This routine must enter the loop below to work correctly
		// when (Reciprocal == false).
		assert(Iterations > 0);

		// Newton iterations for reciprocal square root:
		// E = (E * -0.5) * ((A * E) * E + -3.0)
for (unsigned i = 0; i < Iterations; ++i) {		for (unsigned i = 0; i < Iterations; ++i) {
SDValue HalfEst = DAG.getNode(ISD::FMUL, DL, VT, Est, MinusHalf, Flags);		SDValue AE = DAG.getNode(ISD::FMUL, DL, VT, Arg, Est, Flags);
AddToWorklist(HalfEst.getNode());		AddToWorklist(AE.getNode());

Est = DAG.getNode(ISD::FMUL, DL, VT, Est, Est, Flags);		SDValue AEE = DAG.getNode(ISD::FMUL, DL, VT, AE, Est, Flags);
AddToWorklist(Est.getNode());		AddToWorklist(AEE.getNode());

Est = DAG.getNode(ISD::FMUL, DL, VT, Est, Arg, Flags);		SDValue RHS = DAG.getNode(ISD::FADD, DL, VT, AEE, MinusThree, Flags);
AddToWorklist(Est.getNode());		AddToWorklist(RHS.getNode());

Est = DAG.getNode(ISD::FADD, DL, VT, Est, MinusThree, Flags);		// When calculating a square root at the last iteration build:
AddToWorklist(Est.getNode());		// S = ((A * E) * -0.5) * ((A * E) * E + -3.0)
		// (notice a common subexpression)
		SDValue LHS;
		if (Reciprocal \|\| (i + 1) < Iterations) {
		// RSQRT: LHS = (E * -0.5)
		LHS = DAG.getNode(ISD::FMUL, DL, VT, Est, MinusHalf, Flags);
		} else {
		// SQRT: LHS = (A * E) * -0.5
		LHS = DAG.getNode(ISD::FMUL, DL, VT, AE, MinusHalf, Flags);
		}
		AddToWorklist(LHS.getNode());

Est = DAG.getNode(ISD::FMUL, DL, VT, Est, HalfEst, Flags);		Est = DAG.getNode(ISD::FMUL, DL, VT, LHS, RHS, Flags);
AddToWorklist(Est.getNode());		AddToWorklist(Est.getNode());
}		}

return Est;		return Est;
}		}

SDValue DAGCombiner::BuildRsqrtEstimate(SDValue Op, SDNodeFlags *Flags) {		/// Build code to calculate either rsqrt(Op) or sqrt(Op). In the latter case
		/// Op*rsqrt(Op) is actually computed, so additional postprocessing is needed if
		/// Op can be zero.
		SDValue DAGCombiner::buildSqrtEstimateImpl(SDValue Op, SDNodeFlags *Flags,
		bool Reciprocal) {
if (Level >= AfterLegalizeDAG)		if (Level >= AfterLegalizeDAG)
return SDValue();		return SDValue();

// Expose the DAG combiner to the target combiner implementations.		// Expose the DAG combiner to the target combiner implementations.
TargetLowering::DAGCombinerInfo DCI(DAG, Level, false, this);		TargetLowering::DAGCombinerInfo DCI(DAG, Level, false, this);
unsigned Iterations = 0;		unsigned Iterations = 0;
bool UseOneConstNR = false;		bool UseOneConstNR = false;
if (SDValue Est = TLI.getRsqrtEstimate(Op, DCI, Iterations, UseOneConstNR)) {		if (SDValue Est = TLI.getRsqrtEstimate(Op, DCI, Iterations, UseOneConstNR)) {
AddToWorklist(Est.getNode());		AddToWorklist(Est.getNode());
if (Iterations) {		if (Iterations) {
Est = UseOneConstNR ?		Est = UseOneConstNR
BuildRsqrtNROneConst(Op, Est, Iterations, Flags) :		? buildSqrtNROneConst(Op, Est, Iterations, Flags, Reciprocal)
BuildRsqrtNRTwoConst(Op, Est, Iterations, Flags);		: buildSqrtNRTwoConst(Op, Est, Iterations, Flags, Reciprocal);
}		}
return Est;		return Est;
}		}

return SDValue();		return SDValue();
}		}

		SDValue DAGCombiner::buildRsqrtEstimate(SDValue Op, SDNodeFlags *Flags) {
		return buildSqrtEstimateImpl(Op, Flags, true);
		}

		SDValue DAGCombiner::buildSqrtEstimate(SDValue Op, SDNodeFlags *Flags) {
		SDValue Est = buildSqrtEstimateImpl(Op, Flags, false);
		if (!Est)
		return SDValue();

		// Unfortunately, Est is now NaN if the input was exactly 0.
		// Select out this case and force the answer to 0.
		EVT VT = Est.getValueType();
		SDLoc DL(Op);
		SDValue Zero = DAG.getConstantFP(0.0, DL, VT);
		EVT CCVT = getSetCCResultType(VT);
		SDValue ZeroCmp = DAG.getSetCC(DL, CCVT, Op, Zero, ISD::SETEQ);
		AddToWorklist(ZeroCmp.getNode());

		Est = DAG.getNode(VT.isVector() ? ISD::VSELECT : ISD::SELECT, DL, VT, ZeroCmp,
		Zero, Est);
		AddToWorklist(Est.getNode());
		return Est;
		}

/// Return true if base is a frame index, which is known not to alias with		/// Return true if base is a frame index, which is known not to alias with
/// anything but itself. Provides base object and offset as results.		/// anything but itself. Provides base object and offset as results.
static bool FindBaseOffset(SDValue Ptr, SDValue &Base, int64_t &Offset,		static bool FindBaseOffset(SDValue Ptr, SDValue &Base, int64_t &Offset,
const GlobalValue &GV, const void &CV) {		const GlobalValue &GV, const void &CV) {
// Assume it is a primitive operation.		// Assume it is a primitive operation.
Base = Ptr; Offset = 0; GV = nullptr; CV = nullptr;		Base = Ptr; Offset = 0; GV = nullptr; CV = nullptr;

// If it's an adding a simple constant then integrate the offset.		// If it's an adding a simple constant then integrate the offset.
▲ Show 20 Lines • Show All 315 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/sqrt-fastmath-mir.ll

				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx2,fma -recip=sqrt:2 -stop-after=expand-isel-pseudos 2>&1 \| FileCheck %s

				declare float @llvm.sqrt.f32(float) #0

				define float @foo(float %f) #0 {
				; CHECK: {{name: *foo}}
				; CHECK: body:
				; CHECK: %0 = COPY %xmm0
				; CHECK: %1 = VRSQRTSSr killed %2, %0
				; CHECK: %3 = VMULSSrr %0, %1
				; CHECK: %4 = VMOVSSrm
				; CHECK: %5 = VFMADDSSr213r %1, killed %3, %4
				; CHECK: %6 = VMOVSSrm
				; CHECK: %7 = VMULSSrr %1, %6
				; CHECK: %8 = VMULSSrr killed %7, killed %5
				; CHECK: %9 = VMULSSrr %0, %8
				; CHECK: %10 = VFMADDSSr213r %8, %9, %4
				; CHECK: %11 = VMULSSrr %9, %6
				; CHECK: %12 = VMULSSrr killed %11, killed %10
				; CHECK: %13 = FsFLD0SS
				; CHECK: %14 = VCMPSSrr %0, killed %13, 0
				; CHECK: %15 = VFsANDNPSrr killed %14, killed %12
				; CHECK: %xmm0 = COPY %15
				; CHECK: RET 0, %xmm0
				%call = tail call float @llvm.sqrt.f32(float %f) #1
				ret float %call
				}

				define float @rfoo(float %f) #0 {
				; CHECK: {{name: *rfoo}}
				; CHECK: body: \|
				; CHECK: %0 = COPY %xmm0
				; CHECK: %1 = VRSQRTSSr killed %2, %0
				; CHECK: %3 = VMULSSrr %0, %1
				; CHECK: %4 = VMOVSSrm
				; CHECK: %5 = VFMADDSSr213r %1, killed %3, %4
				; CHECK: %6 = VMOVSSrm
				; CHECK: %7 = VMULSSrr %1, %6
				; CHECK: %8 = VMULSSrr killed %7, killed %5
				; CHECK: %9 = VMULSSrr %0, %8
				; CHECK: %10 = VFMADDSSr213r %8, killed %9, %4
				; CHECK: %11 = VMULSSrr %8, %6
				; CHECK: %12 = VMULSSrr killed %11, killed %10
				; CHECK: %xmm0 = COPY %12
				; CHECK: RET 0, %xmm0
				%sqrt = tail call float @llvm.sqrt.f32(float %f)
				%div = fdiv fast float 1.0, %sqrt
				ret float %div
				}

				attributes #0 = { "unsafe-fp-math"="true" }
				attributes #1 = { nounwind readnone }

llvm/trunk/test/CodeGen/X86/sqrt-fastmath.ll

	Show All 28 Lines
	; NORECIP-LABEL: ff:			; NORECIP-LABEL: ff:
	; NORECIP: # BB#0:			; NORECIP: # BB#0:
	; NORECIP-NEXT: sqrtss %xmm0, %xmm0			; NORECIP-NEXT: sqrtss %xmm0, %xmm0
	; NORECIP-NEXT: retq			; NORECIP-NEXT: retq
	;			;
	; ESTIMATE-LABEL: ff:			; ESTIMATE-LABEL: ff:
	; ESTIMATE: # BB#0:			; ESTIMATE: # BB#0:
	; ESTIMATE-NEXT: vrsqrtss %xmm0, %xmm0, %xmm1			; ESTIMATE-NEXT: vrsqrtss %xmm0, %xmm0, %xmm1
	; ESTIMATE-NEXT: vmulss {{.*}}(%rip), %xmm1, %xmm2			; ESTIMATE-NEXT: vmulss %xmm1, %xmm0, %xmm2
	; ESTIMATE-NEXT: vmulss %xmm0, %xmm1, %xmm3			; ESTIMATE-NEXT: vmulss %xmm1, %xmm2, %xmm1
	; ESTIMATE-NEXT: vmulss %xmm3, %xmm1, %xmm1
	; ESTIMATE-NEXT: vaddss {{.*}}(%rip), %xmm1, %xmm1			; ESTIMATE-NEXT: vaddss {{.*}}(%rip), %xmm1, %xmm1
	; ESTIMATE-NEXT: vmulss %xmm0, %xmm2, %xmm2			; ESTIMATE-NEXT: vmulss {{.*}}(%rip), %xmm2, %xmm2
	; ESTIMATE-NEXT: vmulss %xmm2, %xmm1, %xmm1			; ESTIMATE-NEXT: vmulss %xmm1, %xmm2, %xmm1
	; ESTIMATE-NEXT: vxorps %xmm2, %xmm2, %xmm2			; ESTIMATE-NEXT: vxorps %xmm2, %xmm2, %xmm2
	; ESTIMATE-NEXT: vcmpeqss %xmm2, %xmm0, %xmm0			; ESTIMATE-NEXT: vcmpeqss %xmm2, %xmm0, %xmm0
	; ESTIMATE-NEXT: vandnps %xmm1, %xmm0, %xmm0			; ESTIMATE-NEXT: vandnps %xmm1, %xmm0, %xmm0
	; ESTIMATE-NEXT: retq			; ESTIMATE-NEXT: retq
	%call = tail call float @__sqrtf_finite(float %f) #1			%call = tail call float @__sqrtf_finite(float %f) #1
	ret float %call			ret float %call
	}			}

	Show All 22 Lines
	; NORECIP-NEXT: sqrtss %xmm0, %xmm1			; NORECIP-NEXT: sqrtss %xmm0, %xmm1
	; NORECIP-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero			; NORECIP-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
	; NORECIP-NEXT: divss %xmm1, %xmm0			; NORECIP-NEXT: divss %xmm1, %xmm0
	; NORECIP-NEXT: retq			; NORECIP-NEXT: retq
	;			;
	; ESTIMATE-LABEL: reciprocal_square_root:			; ESTIMATE-LABEL: reciprocal_square_root:
	; ESTIMATE: # BB#0:			; ESTIMATE: # BB#0:
	; ESTIMATE-NEXT: vrsqrtss %xmm0, %xmm0, %xmm1			; ESTIMATE-NEXT: vrsqrtss %xmm0, %xmm0, %xmm1
	; ESTIMATE-NEXT: vmulss {{.*}}(%rip), %xmm1, %xmm2			; ESTIMATE-NEXT: vmulss %xmm1, %xmm1, %xmm2
	; ESTIMATE-NEXT: vmulss %xmm0, %xmm1, %xmm0
	; ESTIMATE-NEXT: vmulss %xmm0, %xmm1, %xmm0
	; ESTIMATE-NEXT: vaddss {{.*}}(%rip), %xmm0, %xmm0
	; ESTIMATE-NEXT: vmulss %xmm2, %xmm0, %xmm0			; ESTIMATE-NEXT: vmulss %xmm2, %xmm0, %xmm0
				; ESTIMATE-NEXT: vaddss {{.*}}(%rip), %xmm0, %xmm0
				; ESTIMATE-NEXT: vmulss {{.*}}(%rip), %xmm1, %xmm1
				; ESTIMATE-NEXT: vmulss %xmm0, %xmm1, %xmm0
	; ESTIMATE-NEXT: retq			; ESTIMATE-NEXT: retq
	%sqrt = tail call float @llvm.sqrt.f32(float %x)			%sqrt = tail call float @llvm.sqrt.f32(float %x)
	%div = fdiv fast float 1.0, %sqrt			%div = fdiv fast float 1.0, %sqrt
	ret float %div			ret float %div
	}			}

	define <4 x float> @reciprocal_square_root_v4f32(<4 x float> %x) #0 {			define <4 x float> @reciprocal_square_root_v4f32(<4 x float> %x) #0 {
	; NORECIP-LABEL: reciprocal_square_root_v4f32:			; NORECIP-LABEL: reciprocal_square_root_v4f32:
	; NORECIP: # BB#0:			; NORECIP: # BB#0:
	; NORECIP-NEXT: sqrtps %xmm0, %xmm1			; NORECIP-NEXT: sqrtps %xmm0, %xmm1
	; NORECIP-NEXT: movaps {{.*#+}} xmm0 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]			; NORECIP-NEXT: movaps {{.*#+}} xmm0 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; NORECIP-NEXT: divps %xmm1, %xmm0			; NORECIP-NEXT: divps %xmm1, %xmm0
	; NORECIP-NEXT: retq			; NORECIP-NEXT: retq
	;			;
	; ESTIMATE-LABEL: reciprocal_square_root_v4f32:			; ESTIMATE-LABEL: reciprocal_square_root_v4f32:
	; ESTIMATE: # BB#0:			; ESTIMATE: # BB#0:
	; ESTIMATE-NEXT: vrsqrtps %xmm0, %xmm1			; ESTIMATE-NEXT: vrsqrtps %xmm0, %xmm1
	; ESTIMATE-NEXT: vmulps %xmm0, %xmm1, %xmm0			; ESTIMATE-NEXT: vmulps %xmm1, %xmm1, %xmm2
	; ESTIMATE-NEXT: vmulps %xmm0, %xmm1, %xmm0			; ESTIMATE-NEXT: vmulps %xmm2, %xmm0, %xmm0
	; ESTIMATE-NEXT: vaddps {{.*}}(%rip), %xmm0, %xmm0			; ESTIMATE-NEXT: vaddps {{.*}}(%rip), %xmm0, %xmm0
	; ESTIMATE-NEXT: vmulps {{.*}}(%rip), %xmm1, %xmm1			; ESTIMATE-NEXT: vmulps {{.*}}(%rip), %xmm1, %xmm1
	; ESTIMATE-NEXT: vmulps %xmm1, %xmm0, %xmm0			; ESTIMATE-NEXT: vmulps %xmm0, %xmm1, %xmm0
	; ESTIMATE-NEXT: retq			; ESTIMATE-NEXT: retq
	%sqrt = tail call <4 x float> @llvm.sqrt.v4f32(<4 x float> %x)			%sqrt = tail call <4 x float> @llvm.sqrt.v4f32(<4 x float> %x)
	%div = fdiv fast <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %sqrt			%div = fdiv fast <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %sqrt
	ret <4 x float> %div			ret <4 x float> %div
	}			}

	define <8 x float> @reciprocal_square_root_v8f32(<8 x float> %x) #0 {			define <8 x float> @reciprocal_square_root_v8f32(<8 x float> %x) #0 {
	; NORECIP-LABEL: reciprocal_square_root_v8f32:			; NORECIP-LABEL: reciprocal_square_root_v8f32:
	; NORECIP: # BB#0:			; NORECIP: # BB#0:
	; NORECIP-NEXT: sqrtps %xmm1, %xmm2			; NORECIP-NEXT: sqrtps %xmm1, %xmm2
	; NORECIP-NEXT: sqrtps %xmm0, %xmm3			; NORECIP-NEXT: sqrtps %xmm0, %xmm3
	; NORECIP-NEXT: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]			; NORECIP-NEXT: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; NORECIP-NEXT: movaps %xmm1, %xmm0			; NORECIP-NEXT: movaps %xmm1, %xmm0
	; NORECIP-NEXT: divps %xmm3, %xmm0			; NORECIP-NEXT: divps %xmm3, %xmm0
	; NORECIP-NEXT: divps %xmm2, %xmm1			; NORECIP-NEXT: divps %xmm2, %xmm1
	; NORECIP-NEXT: retq			; NORECIP-NEXT: retq
	;			;
	; ESTIMATE-LABEL: reciprocal_square_root_v8f32:			; ESTIMATE-LABEL: reciprocal_square_root_v8f32:
	; ESTIMATE: # BB#0:			; ESTIMATE: # BB#0:
	; ESTIMATE-NEXT: vrsqrtps %ymm0, %ymm1			; ESTIMATE-NEXT: vrsqrtps %ymm0, %ymm1
	; ESTIMATE-NEXT: vmulps %ymm0, %ymm1, %ymm0			; ESTIMATE-NEXT: vmulps %ymm1, %ymm1, %ymm2
	; ESTIMATE-NEXT: vmulps %ymm0, %ymm1, %ymm0			; ESTIMATE-NEXT: vmulps %ymm2, %ymm0, %ymm0
	; ESTIMATE-NEXT: vaddps {{.*}}(%rip), %ymm0, %ymm0			; ESTIMATE-NEXT: vaddps {{.*}}(%rip), %ymm0, %ymm0
	; ESTIMATE-NEXT: vmulps {{.*}}(%rip), %ymm1, %ymm1			; ESTIMATE-NEXT: vmulps {{.*}}(%rip), %ymm1, %ymm1
	; ESTIMATE-NEXT: vmulps %ymm1, %ymm0, %ymm0			; ESTIMATE-NEXT: vmulps %ymm0, %ymm1, %ymm0
	; ESTIMATE-NEXT: retq			; ESTIMATE-NEXT: retq
	%sqrt = tail call <8 x float> @llvm.sqrt.v8f32(<8 x float> %x)			%sqrt = tail call <8 x float> @llvm.sqrt.v8f32(<8 x float> %x)
	%div = fdiv fast <8 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %sqrt			%div = fdiv fast <8 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %sqrt
	ret <8 x float> %div			ret <8 x float> %div
	}			}


	attributes #0 = { "unsafe-fp-math"="true" }			attributes #0 = { "unsafe-fp-math"="true" }
	attributes #1 = { nounwind readnone }			attributes #1 = { nounwind readnone }

This is an archive of the discontinued LLVM Phabricator instance.

Remove redundant FMUL in Newton-Raphson SQRT code
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 60991

llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

llvm/trunk/test/CodeGen/X86/sqrt-fastmath-mir.ll

llvm/trunk/test/CodeGen/X86/sqrt-fastmath.ll

This is an archive of the discontinued LLVM Phabricator instance.

Remove redundant FMUL in Newton-Raphson SQRT codeClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 60991

llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

llvm/trunk/test/CodeGen/X86/sqrt-fastmath-mir.ll

llvm/trunk/test/CodeGen/X86/sqrt-fastmath.ll

Remove redundant FMUL in Newton-Raphson SQRT code
ClosedPublic