This is an archive of the discontinued LLVM Phabricator instance.

Use rsqrt (X86) to speed up reciprocal square root calcs (PR20900)
ClosedPublic

Authored by spatel on Oct 7 2014, 5:52 PM.

Download Raw Diff

Details

Reviewers

nadav
andreadb
hfinkel

Commits

rG957efc23bb87: Use rsqrt (X86) to speed up reciprocal square root calcs
rL220570: Use rsqrt (X86) to speed up reciprocal square root calcs

Summary

This is a first step for generating SSE rsqrt instructions for reciprocal square root calcs when fast-math is allowed.

For now, be conservative and only enable this for AMD btver2 where performance improves significantly - for example, 29% on llvm/projects/test-suite/SingleSource/Benchmarks/BenchmarkGame/n-body.c if we convert the data type to single-precision float.

We will probably never enable this codegen for any Intel Core* chips because the sqrt/divider circuits are just too fast. On SandyBridge, sqrtss + divss can be as fast as 20 cycles which is better than the 23 cycle critical path for the rsqrt + mul + mul + add + mul estimate.

Follow-on patches may allow reciprocal (rcpss) optimizations, add more vector data types, and enable the optimization for more chips.

More background here: http://llvm.org/bugs/show_bug.cgi?id=20900

Diff Detail

Event Timeline

spatel updated this revision to Diff 14534.Oct 7 2014, 5:52 PM

spatel retitled this revision from to Use rsqrt (X86) to speed up reciprocal square root calcs (PR20900).

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: hfinkel, nadav.

spatel added subscribers: Unknown Object (MLST), RKSimon, tycho.

We will probably never enable this codegen for any Intel Core* chips because the sqrt/divider circuits are just too fast. On SandyBridge, sqrtss + divss can be as fast as 20 cycles which is better than the 23 cycle critical path for the rsqrt + mul + mul + add + mul estimate.

Critical path latency is good, but throughput is normally much better. According to Intel's optimization manual, rsqrtss, for example, is fully pipelined on most Intel cores (on Westmere and Nehalem the dispatch delay is 3 cycles, but 1 cycle elsewhere). But the dispatch delay time for sqrtss is 7 cycles on Haswell, 7-14 cycles on Sandy Bridge, something under 16 cycles for Westmere and Nehalem, and 11 cycles for Silvermont. The throughput for divss is a little better than sqrtss, but not by much.

In short, this is likely a big win *if* there is anything else going on (floating-point-wise), even on Intel cores. I could be wrong, but this very-much reminds me of the problem that the MachineCombiner pass tries to solve for FMAs, etc. on some targets, and I wonder if it could somehow be applied to this as well.

lib/Target/X86/X86ISelLowering.cpp
14341	I'd really prefer that you put the 2-constant version of the algorithm into the DAGCombiner along side the 1-constant version, and just let the target pick. The algorithm itself is really a mathematical expression, and not at all really target dependent, and we should try to keep such things available to other targets without copy-and-paste. Ideally, we'd then also have a flag to force one or the other, so that way PPC can default to the 1-constant version, X86 can default to the 2-constant version, but there's a command-line option I can use to force the choice for benchmarking.

In D5658#4, @hfinkel wrote:

In short, this is likely a big win *if* there is anything else going on (floating-point-wise), even on Intel cores. I could be wrong, but this very-much reminds me of the problem that the MachineCombiner pass tries to solve for FMAs, etc. on some targets, and I wonder if it could somehow be applied to this as well.

Yes, I agree with the throughput argument. And if the user can flip just this one bit of codegen with an attribute flag, that would be ideal.

I didn't want to overstep with this patch though, so I thought it'd be best to just start with the core where I know it's always a win to get rid of the sqrtss/divss.

@tycho may be able to better assess the perf differences on Intel. Some informal benchmarking using an n-body program certainly does show big wins using the rsqrt code on Haswell.

lib/Target/X86/X86ISelLowering.cpp
14341	Agree - I'll rework this.

This version of the patch brings the 2-constant NR builder into DAGCombiner and adds a target-specified boolean to select whether we should use the 2-constant version or the 1-constant version.

We can add a command-line override for the NR selector in a subsequent patch.

hfinkel added inline comments.Oct 9 2014, 7:05 PM

lib/Target/X86/X86ISelLowering.cpp
14347	Please write out "significant digits"
14355	Why wouldn't it be?

spatel added inline comments.Oct 9 2014, 8:11 PM

lib/Target/X86/X86ISelLowering.cpp
14355	A double-precision rsqrt estimate with refinement on x86 prior to FMA requires at least 16 instructions: convert to single, rsqrtss, convert back to double, refine (3 steps = at least 13 insts). I don't think Intel/AMD ever intended for that, or they would've added 'rsqrtsd' (similar to PPC's double-precision frsqrte). AFAICT, no x86 compiler tries to generate that sequence. Now that FMA has been introduced, it might be more feasible, but the HW implementations that have FMA also have really fast sqrt/div units, so it's again not worth it. Add this background to the code comment?

Original Message -----

From: "Sanjay Patel" <spatel@rotateright.com>
To: spatel@rotateright.com, nrotem@apple.com, hfinkel@anl.gov
Cc: steven@uplinklabs.net, llvm-dev@redking.me.uk, llvm-commits@cs.uiuc.edu
Sent: Thursday, October 9, 2014 10:11:56 PM
Subject: Re: [PATCH] Use rsqrt (X86) to speed up reciprocal square root calcs (PR20900)

Comment at: lib/Target/X86/X86ISelLowering.cpp:14355
@@ +14354,3 @@
+ TODO: Add support for AVX (v8f32) and AVX512 (v16f32).
+ TODO: Is it ever worthwhile to use an estimate for f64?
+ if (Subtarget->hasSSE1() && (VT == MVT::f32 || VT == MVT::v4f32))

{

hfinkel wrote:

Why wouldn't it be?

A double-precision rsqrt estimate with refinement on x86 prior to FMA
requires at least 16 instructions: convert to single, rsqrtss,
convert back to double, refine (3 steps = at least 13 insts). I
don't think Intel/AMD ever intended for that, or they would've added
'rsqrtsd' (similar to PPC's double-precision frsqrte). AFAICT, no
x86 compiler tries to generate that sequence. Now that FMA has been
introduced, it might be more feasible, but the HW implementations
that have FMA also have really fast sqrt/div units, so it's again
not worth it. Add this background to the code comment?

Yes, please. But in light of that, I'd probably not make it a "TODO", just say, "It is likely not profitable to do this for f64 because...".

-Hal

http://reviews.llvm.org/D5658

Updated comments - thank you for the feedback!

Any other suggestions/improvements?

Using this n-body benchmark program:
https://github.com/tycho/nbody

...on a btver2 system, I see excellent performance improvements.

Before:

Running simulation with 16384 particles, crosscheck enabled, CPU enabled, 1 threads
CPU_SOA:            2.10 GFLOPS
CPU_SOA_tiled:   1.12 GFLOPS
CPU_AOS:            0.64 GFLOPS
CPU_AOS_tiled:   1.04 GFLOPS

After:

Running simulation with 16384 particles, crosscheck enabled, CPU enabled, 1 threads
CPU_SOA:           5.19 GFLOPS
CPU_SOA_tiled:  5.34 GFLOPS
CPU_AOS:           1.27 GFLOPS
CPU_AOS_tiled:  1.59 GFLOPS

Ping.

This one should be less controversial than http://reviews.llvm.org/D5787 / http://llvm.org/bugs/show_bug.cgi?id=21290 .
We're not actually adding any more fast-math complexity with this patch - just relying on the existing logic.

Ping * 2.

Hi Sanjay,

The changes to the dag combiner and all the x86 specific changes look good to me.
FWIW, the rest of the patch looks good to me too. But you might want to wait to see what Hal thinks.

This revision is now accepted and ready to land.Oct 23 2014, 7:59 AM

LGTM too.

Closed by commit rL220570 (authored by @spatel).

Thanks, Andrea and Hal. Checked in with r220570.

Revision Contents

Path

Size

include/

llvm/

Target/

TargetLowering.h

5 lines

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

117 lines

Target/

PowerPC/

PPCISelLowering.h

3 lines

PPCISelLowering.cpp

4 lines

X86/

5 lines

5 lines

26 lines

6 lines

1 line

test/

CodeGen/

X86/

sqrt-fastmath.ll

55 lines

Diff 14694

include/llvm/Target/TargetLowering.h

Show First 20 Lines • Show All 2,621 Lines • ▼ Show 20 Lines	public:

/// Hooks for building estimates in place of slower divisions and square		/// Hooks for building estimates in place of slower divisions and square
/// roots.		/// roots.

/// Return a reciprocal square root estimate value for the input operand.		/// Return a reciprocal square root estimate value for the input operand.
/// The RefinementSteps output is the number of Newton-Raphson refinement		/// The RefinementSteps output is the number of Newton-Raphson refinement
/// iterations required to generate a sufficient (though not necessarily		/// iterations required to generate a sufficient (though not necessarily
/// IEEE-754 compliant) estimate for the value type.		/// IEEE-754 compliant) estimate for the value type.
		/// The boolean UseOneConstNR output is used to select a Newton-Raphson
		/// algorithm implementation that uses one constant or two constants.
/// A target may choose to implement its own refinement within this function.		/// A target may choose to implement its own refinement within this function.
/// If that's true, then return '0' as the number of RefinementSteps to avoid		/// If that's true, then return '0' as the number of RefinementSteps to avoid
/// any further refinement of the estimate.		/// any further refinement of the estimate.
/// An empty SDValue return means no estimate sequence can be created.		/// An empty SDValue return means no estimate sequence can be created.
virtual SDValue getRsqrtEstimate(SDValue Operand,		virtual SDValue getRsqrtEstimate(SDValue Operand,
DAGCombinerInfo &DCI,		DAGCombinerInfo &DCI,
unsigned &RefinementSteps) const {		unsigned &RefinementSteps,
		bool &UseOneConstNR) const {
return SDValue();		return SDValue();
}		}

/// Return a reciprocal estimate value for the input operand.		/// Return a reciprocal estimate value for the input operand.
/// The RefinementSteps output is the number of Newton-Raphson refinement		/// The RefinementSteps output is the number of Newton-Raphson refinement
/// iterations required to generate a sufficient (though not necessarily		/// iterations required to generate a sufficient (though not necessarily
/// IEEE-754 compliant) estimate for the value type.		/// IEEE-754 compliant) estimate for the value type.
/// A target may choose to implement its own refinement within this function.		/// A target may choose to implement its own refinement within this function.
▲ Show 20 Lines • Show All 70 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 323 Lines • ▼ Show 20 Lines	SDValue SimplifyNodeWithTwoResults(SDNode *N, unsigned LoOp,
unsigned HiOp);		unsigned HiOp);
SDValue CombineConsecutiveLoads(SDNode *N, EVT VT);		SDValue CombineConsecutiveLoads(SDNode *N, EVT VT);
SDValue ConstantFoldBITCASTofBUILD_VECTOR(SDNode *, EVT);		SDValue ConstantFoldBITCASTofBUILD_VECTOR(SDNode *, EVT);
SDValue BuildSDIV(SDNode *N);		SDValue BuildSDIV(SDNode *N);
SDValue BuildSDIVPow2(SDNode *N);		SDValue BuildSDIVPow2(SDNode *N);
SDValue BuildUDIV(SDNode *N);		SDValue BuildUDIV(SDNode *N);
SDValue BuildReciprocalEstimate(SDValue Op);		SDValue BuildReciprocalEstimate(SDValue Op);
SDValue BuildRsqrtEstimate(SDValue Op);		SDValue BuildRsqrtEstimate(SDValue Op);
		SDValue BuildRsqrtNROneConst(SDValue Op, SDValue Est, unsigned Iterations);
		SDValue BuildRsqrtNRTwoConst(SDValue Op, SDValue Est, unsigned Iterations);
SDValue MatchBSwapHWordLow(SDNode *N, SDValue N0, SDValue N1,		SDValue MatchBSwapHWordLow(SDNode *N, SDValue N0, SDValue N1,
bool DemandHighBits = true);		bool DemandHighBits = true);
SDValue MatchBSwapHWord(SDNode *N, SDValue N0, SDValue N1);		SDValue MatchBSwapHWord(SDNode *N, SDValue N0, SDValue N1);
SDNode *MatchRotatePosNeg(SDValue Shifted, SDValue Pos, SDValue Neg,		SDNode *MatchRotatePosNeg(SDValue Shifted, SDValue Pos, SDValue Neg,
SDValue InnerPos, SDValue InnerNeg,		SDValue InnerPos, SDValue InnerNeg,
unsigned PosOpcode, unsigned NegOpcode,		unsigned PosOpcode, unsigned NegOpcode,
SDLoc DL);		SDLoc DL);
SDNode *MatchRotate(SDValue LHS, SDValue RHS, SDLoc DL);		SDNode *MatchRotate(SDValue LHS, SDValue RHS, SDLoc DL);
▲ Show 20 Lines • Show All 6,662 Lines • ▼ Show 20 Lines	if (N1CFP) {
return DAG.getNode(ISD::FMUL, SDLoc(N), VT, N0,		return DAG.getNode(ISD::FMUL, SDLoc(N), VT, N0,
DAG.getConstantFP(Recip, VT));		DAG.getConstantFP(Recip, VT));
}		}

// If this FDIV is part of a reciprocal square root, it may be folded		// If this FDIV is part of a reciprocal square root, it may be folded
// into a target-specific square root estimate instruction.		// into a target-specific square root estimate instruction.
if (N1.getOpcode() == ISD::FSQRT) {		if (N1.getOpcode() == ISD::FSQRT) {
if (SDValue RV = BuildRsqrtEstimate(N1.getOperand(0))) {		if (SDValue RV = BuildRsqrtEstimate(N1.getOperand(0))) {
AddToWorklist(RV.getNode());
return DAG.getNode(ISD::FMUL, DL, VT, N0, RV);		return DAG.getNode(ISD::FMUL, DL, VT, N0, RV);
}		}
} else if (N1.getOpcode() == ISD::FP_EXTEND &&		} else if (N1.getOpcode() == ISD::FP_EXTEND &&
N1.getOperand(0).getOpcode() == ISD::FSQRT) {		N1.getOperand(0).getOpcode() == ISD::FSQRT) {
if (SDValue RV = BuildRsqrtEstimate(N1.getOperand(0).getOperand(0))) {		if (SDValue RV = BuildRsqrtEstimate(N1.getOperand(0).getOperand(0))) {
AddToWorklist(RV.getNode());
RV = DAG.getNode(ISD::FP_EXTEND, SDLoc(N1), VT, RV);		RV = DAG.getNode(ISD::FP_EXTEND, SDLoc(N1), VT, RV);
AddToWorklist(RV.getNode());		AddToWorklist(RV.getNode());
return DAG.getNode(ISD::FMUL, DL, VT, N0, RV);		return DAG.getNode(ISD::FMUL, DL, VT, N0, RV);
}		}
} else if (N1.getOpcode() == ISD::FP_ROUND &&		} else if (N1.getOpcode() == ISD::FP_ROUND &&
N1.getOperand(0).getOpcode() == ISD::FSQRT) {		N1.getOperand(0).getOpcode() == ISD::FSQRT) {
if (SDValue RV = BuildRsqrtEstimate(N1.getOperand(0).getOperand(0))) {		if (SDValue RV = BuildRsqrtEstimate(N1.getOperand(0).getOperand(0))) {
AddToWorklist(RV.getNode());
RV = DAG.getNode(ISD::FP_ROUND, SDLoc(N1), VT, RV, N1.getOperand(1));		RV = DAG.getNode(ISD::FP_ROUND, SDLoc(N1), VT, RV, N1.getOperand(1));
AddToWorklist(RV.getNode());		AddToWorklist(RV.getNode());
return DAG.getNode(ISD::FMUL, DL, VT, N0, RV);		return DAG.getNode(ISD::FMUL, DL, VT, N0, RV);
}		}
} else if (N1.getOpcode() == ISD::FMUL) {		} else if (N1.getOpcode() == ISD::FMUL) {
// Look through an FMUL. Even though this won't remove the FDIV directly,		// Look through an FMUL. Even though this won't remove the FDIV directly,
// it's still worthwhile to get rid of the FSQRT if possible.		// it's still worthwhile to get rid of the FSQRT if possible.
SDValue SqrtOp;		SDValue SqrtOp;
SDValue OtherOp;		SDValue OtherOp;
if (N1.getOperand(0).getOpcode() == ISD::FSQRT) {		if (N1.getOperand(0).getOpcode() == ISD::FSQRT) {
SqrtOp = N1.getOperand(0);		SqrtOp = N1.getOperand(0);
OtherOp = N1.getOperand(1);		OtherOp = N1.getOperand(1);
} else if (N1.getOperand(1).getOpcode() == ISD::FSQRT) {		} else if (N1.getOperand(1).getOpcode() == ISD::FSQRT) {
SqrtOp = N1.getOperand(1);		SqrtOp = N1.getOperand(1);
OtherOp = N1.getOperand(0);		OtherOp = N1.getOperand(0);
}		}
if (SqrtOp.getNode()) {		if (SqrtOp.getNode()) {
// We found a FSQRT, so try to make this fold:		// We found a FSQRT, so try to make this fold:
// x / (y * sqrt(z)) -> x * (rsqrt(z) / y)		// x / (y * sqrt(z)) -> x * (rsqrt(z) / y)
if (SDValue RV = BuildRsqrtEstimate(SqrtOp.getOperand(0))) {		if (SDValue RV = BuildRsqrtEstimate(SqrtOp.getOperand(0))) {
AddToWorklist(RV.getNode());
RV = DAG.getNode(ISD::FDIV, SDLoc(N1), VT, RV, OtherOp);		RV = DAG.getNode(ISD::FDIV, SDLoc(N1), VT, RV, OtherOp);
AddToWorklist(RV.getNode());		AddToWorklist(RV.getNode());
return DAG.getNode(ISD::FMUL, DL, VT, N0, RV);		return DAG.getNode(ISD::FMUL, DL, VT, N0, RV);
}		}
}		}
}		}

// Fold into a reciprocal estimate and multiply instead of a real divide.		// Fold into a reciprocal estimate and multiply instead of a real divide.
Show All 31 Lines	SDValue DAGCombiner::visitFREM(SDNode *N) {

return SDValue();		return SDValue();
}		}

SDValue DAGCombiner::visitFSQRT(SDNode *N) {		SDValue DAGCombiner::visitFSQRT(SDNode *N) {
if (DAG.getTarget().Options.UnsafeFPMath) {		if (DAG.getTarget().Options.UnsafeFPMath) {
// Compute this as X * (1/sqrt(X)) = X * (X ** -0.5)		// Compute this as X * (1/sqrt(X)) = X * (X ** -0.5)
if (SDValue RV = BuildRsqrtEstimate(N->getOperand(0))) {		if (SDValue RV = BuildRsqrtEstimate(N->getOperand(0))) {
AddToWorklist(RV.getNode());
EVT VT = RV.getValueType();		EVT VT = RV.getValueType();
RV = DAG.getNode(ISD::FMUL, SDLoc(N), VT, N->getOperand(0), RV);		RV = DAG.getNode(ISD::FMUL, SDLoc(N), VT, N->getOperand(0), RV);
AddToWorklist(RV.getNode());		AddToWorklist(RV.getNode());

// Unfortunately, RV is now NaN if the input was exactly 0.		// Unfortunately, RV is now NaN if the input was exactly 0.
// Select out this case and force the answer to 0.		// Select out this case and force the answer to 0.
SDValue Zero = DAG.getConstantFP(0.0, VT);		SDValue Zero = DAG.getConstantFP(0.0, VT);
SDValue ZeroCmp =		SDValue ZeroCmp =
▲ Show 20 Lines • Show All 4,810 Lines • ▼ Show 20 Lines	if (Iterations) {
}		}
}		}
return Est;		return Est;
}		}

return SDValue();		return SDValue();
}		}

SDValue DAGCombiner::BuildRsqrtEstimate(SDValue Op) {		/// Newton iteration for a function: F(X) is X_{i+1} = X_i - F(X_i)/F'(X_i)
if (Level >= AfterLegalizeDAG)		/// For the reciprocal sqrt, we need to find the zero of the function:
return SDValue();		/// F(X) = 1/X^2 - A [which has a zero at X = 1/sqrt(A)]
		/// =>
// Expose the DAG combiner to the target combiner implementations.		/// X_{i+1} = X_i (1.5 - A X_i^2 / 2)
TargetLowering::DAGCombinerInfo DCI(DAG, Level, false, this);		/// As a result, we precompute A/2 prior to the iteration loop.
unsigned Iterations = 0;		SDValue DAGCombiner::BuildRsqrtNROneConst(SDValue Arg, SDValue Est,
if (SDValue Est = TLI.getRsqrtEstimate(Op, DCI, Iterations)) {		unsigned Iterations) {
if (Iterations) {		EVT VT = Arg.getValueType();
// Newton iteration for a function: F(X) is X_{i+1} = X_i - F(X_i)/F'(X_i)		SDLoc DL(Arg);
// For the reciprocal sqrt, we need to find the zero of the function:		SDValue ThreeHalves = DAG.getConstantFP(1.5, VT);
// F(X) = 1/X^2 - A [which has a zero at X = 1/sqrt(A)]
// =>
// X_{i+1} = X_i (1.5 - A X_i^2 / 2)
// As a result, we precompute A/2 prior to the iteration loop.
EVT VT = Op.getValueType();
SDLoc DL(Op);
SDValue FPThreeHalves = DAG.getConstantFP(1.5, VT);

AddToWorklist(Est.getNode());

// We now need 0.5 * Arg which we can write as (1.5 * Arg - Arg) so that		// We now need 0.5 * Arg which we can write as (1.5 * Arg - Arg) so that
// this entire sequence requires only one FP constant.		// this entire sequence requires only one FP constant.
SDValue HalfArg = DAG.getNode(ISD::FMUL, DL, VT, FPThreeHalves, Op);		SDValue HalfArg = DAG.getNode(ISD::FMUL, DL, VT, ThreeHalves, Arg);
AddToWorklist(HalfArg.getNode());		AddToWorklist(HalfArg.getNode());

HalfArg = DAG.getNode(ISD::FSUB, DL, VT, HalfArg, Op);		HalfArg = DAG.getNode(ISD::FSUB, DL, VT, HalfArg, Arg);
AddToWorklist(HalfArg.getNode());		AddToWorklist(HalfArg.getNode());

// Newton iterations: Est = Est * (1.5 - HalfArg * Est * Est)		// Newton iterations: Est = Est * (1.5 - HalfArg * Est * Est)
for (unsigned i = 0; i < Iterations; ++i) {		for (unsigned i = 0; i < Iterations; ++i) {
SDValue NewEst = DAG.getNode(ISD::FMUL, DL, VT, Est, Est);		SDValue NewEst = DAG.getNode(ISD::FMUL, DL, VT, Est, Est);
AddToWorklist(NewEst.getNode());		AddToWorklist(NewEst.getNode());

NewEst = DAG.getNode(ISD::FMUL, DL, VT, HalfArg, NewEst);		NewEst = DAG.getNode(ISD::FMUL, DL, VT, HalfArg, NewEst);
AddToWorklist(NewEst.getNode());		AddToWorklist(NewEst.getNode());

NewEst = DAG.getNode(ISD::FSUB, DL, VT, FPThreeHalves, NewEst);		NewEst = DAG.getNode(ISD::FSUB, DL, VT, ThreeHalves, NewEst);
AddToWorklist(NewEst.getNode());		AddToWorklist(NewEst.getNode());

Est = DAG.getNode(ISD::FMUL, DL, VT, Est, NewEst);		Est = DAG.getNode(ISD::FMUL, DL, VT, Est, NewEst);
AddToWorklist(Est.getNode());		AddToWorklist(Est.getNode());
}		}
		return Est;
		}

		/// Newton iteration for a function: F(X) is X_{i+1} = X_i - F(X_i)/F'(X_i)
		/// For the reciprocal sqrt, we need to find the zero of the function:
		/// F(X) = 1/X^2 - A [which has a zero at X = 1/sqrt(A)]
		/// =>
		/// X_{i+1} = (-0.5 * X_i) * (A * X_i * X_i + (-3.0))
		SDValue DAGCombiner::BuildRsqrtNRTwoConst(SDValue Arg, SDValue Est,
		unsigned Iterations) {
		EVT VT = Arg.getValueType();
		SDLoc DL(Arg);
		SDValue MinusThree = DAG.getConstantFP(-3.0, VT);
		SDValue MinusHalf = DAG.getConstantFP(-0.5, VT);

		// Newton iterations: Est = -0.5 * Est * (-3.0 + Arg * Est * Est)
		for (unsigned i = 0; i < Iterations; ++i) {
		SDValue HalfEst = DAG.getNode(ISD::FMUL, DL, VT, Est, MinusHalf);
		AddToWorklist(HalfEst.getNode());

		Est = DAG.getNode(ISD::FMUL, DL, VT, Est, Est);
		AddToWorklist(Est.getNode());

		Est = DAG.getNode(ISD::FMUL, DL, VT, Est, Arg);
		AddToWorklist(Est.getNode());

		Est = DAG.getNode(ISD::FADD, DL, VT, Est, MinusThree);
		AddToWorklist(Est.getNode());

		Est = DAG.getNode(ISD::FMUL, DL, VT, Est, HalfEst);
		AddToWorklist(Est.getNode());
		}
		return Est;
		}

		SDValue DAGCombiner::BuildRsqrtEstimate(SDValue Op) {
		if (Level >= AfterLegalizeDAG)
		return SDValue();

		// Expose the DAG combiner to the target combiner implementations.
		TargetLowering::DAGCombinerInfo DCI(DAG, Level, false, this);
		unsigned Iterations = 0;
		bool UseOneConstNR = false;
		if (SDValue Est = TLI.getRsqrtEstimate(Op, DCI, Iterations, UseOneConstNR)) {
		AddToWorklist(Est.getNode());
		if (Iterations) {
		Est = UseOneConstNR ?
		BuildRsqrtNROneConst(Op, Est, Iterations) :
		BuildRsqrtNRTwoConst(Op, Est, Iterations);
}		}
return Est;		return Est;
}		}

return SDValue();		return SDValue();
}		}

/// Return true if base is a frame index, which is known not to alias with		/// Return true if base is a frame index, which is known not to alias with
▲ Show 20 Lines • Show All 293 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCISelLowering.h

Show First 20 Lines • Show All 696 Lines • ▼ Show 20 Lines	private:

SDValue lowerEH_SJLJ_SETJMP(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerEH_SJLJ_SETJMP(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerEH_SJLJ_LONGJMP(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerEH_SJLJ_LONGJMP(SDValue Op, SelectionDAG &DAG) const;

SDValue DAGCombineExtBoolTrunc(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue DAGCombineExtBoolTrunc(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue DAGCombineTruncBoolExt(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue DAGCombineTruncBoolExt(SDNode *N, DAGCombinerInfo &DCI) const;

SDValue getRsqrtEstimate(SDValue Operand, DAGCombinerInfo &DCI,		SDValue getRsqrtEstimate(SDValue Operand, DAGCombinerInfo &DCI,
unsigned &RefinementSteps) const override;		unsigned &RefinementSteps,
		bool &UseOneConstNR) const override;
SDValue getRecipEstimate(SDValue Operand, DAGCombinerInfo &DCI,		SDValue getRecipEstimate(SDValue Operand, DAGCombinerInfo &DCI,
unsigned &RefinementSteps) const override;		unsigned &RefinementSteps) const override;

CCAssignFn *useFastISelCCs(unsigned Flag) const;		CCAssignFn *useFastISelCCs(unsigned Flag) const;
};		};

namespace PPC {		namespace PPC {
FastISel *createFastISel(FunctionLoweringInfo &FuncInfo,		FastISel *createFastISel(FunctionLoweringInfo &FuncInfo,
Show All 22 Lines

lib/Target/PowerPC/PPCISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 7,454 Lines • ▼ Show 20 Lines
	}			}

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Target Optimization Hooks			// Target Optimization Hooks
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	SDValue PPCTargetLowering::getRsqrtEstimate(SDValue Operand,			SDValue PPCTargetLowering::getRsqrtEstimate(SDValue Operand,
	DAGCombinerInfo &DCI,			DAGCombinerInfo &DCI,
	unsigned &RefinementSteps) const {			unsigned &RefinementSteps,
				bool &UseOneConstNR) const {
	EVT VT = Operand.getValueType();			EVT VT = Operand.getValueType();
	if ((VT == MVT::f32 && Subtarget.hasFRSQRTES()) \|\|			if ((VT == MVT::f32 && Subtarget.hasFRSQRTES()) \|\|
	(VT == MVT::f64 && Subtarget.hasFRSQRTE()) \|\|			(VT == MVT::f64 && Subtarget.hasFRSQRTE()) \|\|
	(VT == MVT::v4f32 && Subtarget.hasAltivec()) \|\|			(VT == MVT::v4f32 && Subtarget.hasAltivec()) \|\|
	(VT == MVT::v2f64 && Subtarget.hasVSX())) {			(VT == MVT::v2f64 && Subtarget.hasVSX())) {
	// Convergence is quadratic, so we essentially double the number of digits			// Convergence is quadratic, so we essentially double the number of digits
	// correct after every iteration. For both FRE and FRSQRTE, the minimum			// correct after every iteration. For both FRE and FRSQRTE, the minimum
	// architected relative accuracy is 2^-5. When hasRecipPrec(), this is			// architected relative accuracy is 2^-5. When hasRecipPrec(), this is
	// 2^-14. IEEE float has 23 digits and double has 52 digits.			// 2^-14. IEEE float has 23 digits and double has 52 digits.
	RefinementSteps = Subtarget.hasRecipPrec() ? 1 : 3;			RefinementSteps = Subtarget.hasRecipPrec() ? 1 : 3;
	if (VT.getScalarType() == MVT::f64)			if (VT.getScalarType() == MVT::f64)
	++RefinementSteps;			++RefinementSteps;
				UseOneConstNR = true;
	return DCI.DAG.getNode(PPCISD::FRSQRTE, SDLoc(Operand), VT, Operand);			return DCI.DAG.getNode(PPCISD::FRSQRTE, SDLoc(Operand), VT, Operand);
	}			}
	return SDValue();			return SDValue();
	}			}

	SDValue PPCTargetLowering::getRecipEstimate(SDValue Operand,			SDValue PPCTargetLowering::getRecipEstimate(SDValue Operand,
	DAGCombinerInfo &DCI,			DAGCombinerInfo &DCI,
	unsigned &RefinementSteps) const {			unsigned &RefinementSteps) const {
	▲ Show 20 Lines • Show All 1,778 Lines • Show Last 20 Lines

lib/Target/X86/X86.td

Show First 20 Lines • Show All 176 Lines • ▼ Show 20 Lines	def FeatureCallRegIndirect : SubtargetFeature<"call-reg-indirect",
"CallRegIndirect", "true",		"CallRegIndirect", "true",
"Call register indirect">;		"Call register indirect">;
def FeatureLEAUsesAG : SubtargetFeature<"lea-uses-ag", "LEAUsesAG", "true",		def FeatureLEAUsesAG : SubtargetFeature<"lea-uses-ag", "LEAUsesAG", "true",
"LEA instruction needs inputs at AG stage">;		"LEA instruction needs inputs at AG stage">;
def FeatureSlowLEA : SubtargetFeature<"slow-lea", "SlowLEA", "true",		def FeatureSlowLEA : SubtargetFeature<"slow-lea", "SlowLEA", "true",
"LEA instruction with certain arguments is slow">;		"LEA instruction with certain arguments is slow">;
def FeatureSlowIncDec : SubtargetFeature<"slow-incdec", "SlowIncDec", "true",		def FeatureSlowIncDec : SubtargetFeature<"slow-incdec", "SlowIncDec", "true",
"INC and DEC instructions are slower than ADD and SUB">;		"INC and DEC instructions are slower than ADD and SUB">;
		def FeatureUseSqrtEst : SubtargetFeature<"use-sqrt-est", "UseSqrtEst", "true",
		"Use RSQRT* to optimize square root calculations">;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// X86 processors supported.		// X86 processors supported.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

include "X86Schedule.td"		include "X86Schedule.td"

def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",		def ProcIntelAtom : SubtargetFeature<"atom", "X86ProcFamily", "IntelAtom",
▲ Show 20 Lines • Show All 149 Lines • ▼ Show 20 Lines	def : Proc<"btver1", [FeatureSSSE3, FeatureSSE4A, FeatureCMPXCHG16B,
FeaturePRFCHW, FeatureLZCNT, FeaturePOPCNT,		FeaturePRFCHW, FeatureLZCNT, FeaturePOPCNT,
FeatureSlowSHLD]>;		FeatureSlowSHLD]>;

// Jaguar		// Jaguar
def : ProcessorModel<"btver2", BtVer2Model,		def : ProcessorModel<"btver2", BtVer2Model,
[FeatureAVX, FeatureSSE4A, FeatureCMPXCHG16B,		[FeatureAVX, FeatureSSE4A, FeatureCMPXCHG16B,
FeaturePRFCHW, FeatureAES, FeaturePCLMUL,		FeaturePRFCHW, FeatureAES, FeaturePCLMUL,
FeatureBMI, FeatureF16C, FeatureMOVBE,		FeatureBMI, FeatureF16C, FeatureMOVBE,
FeatureLZCNT, FeaturePOPCNT, FeatureSlowSHLD]>;		FeatureLZCNT, FeaturePOPCNT, FeatureSlowSHLD,
		FeatureUseSqrtEst]>;

// Bulldozer		// Bulldozer
def : Proc<"bdver1", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,		def : Proc<"bdver1", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,
FeatureAES, FeaturePRFCHW, FeaturePCLMUL,		FeatureAES, FeaturePRFCHW, FeaturePCLMUL,
FeatureLZCNT, FeaturePOPCNT, FeatureSlowSHLD]>;		FeatureLZCNT, FeaturePOPCNT, FeatureSlowSHLD]>;
// Piledriver		// Piledriver
def : Proc<"bdver2", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,		def : Proc<"bdver2", [FeatureXOP, FeatureFMA4, FeatureCMPXCHG16B,
FeatureAES, FeaturePRFCHW, FeaturePCLMUL,		FeatureAES, FeaturePRFCHW, FeaturePCLMUL,
▲ Show 20 Lines • Show All 116 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 1,008 Lines • ▼ Show 20 Lines	private:

/// Emit nodes that will be selected as "cmp Op0,Op1", or something		/// Emit nodes that will be selected as "cmp Op0,Op1", or something
/// equivalent, for use with the given x86 condition code.		/// equivalent, for use with the given x86 condition code.
SDValue EmitCmp(SDValue Op0, SDValue Op1, unsigned X86CC, SDLoc dl,		SDValue EmitCmp(SDValue Op0, SDValue Op1, unsigned X86CC, SDLoc dl,
SelectionDAG &DAG) const;		SelectionDAG &DAG) const;

/// Convert a comparison if required by the subtarget.		/// Convert a comparison if required by the subtarget.
SDValue ConvertCmpIfNecessary(SDValue Cmp, SelectionDAG &DAG) const;		SDValue ConvertCmpIfNecessary(SDValue Cmp, SelectionDAG &DAG) const;

		/// Use rsqrt* to speed up sqrt calculations.
		SDValue getRsqrtEstimate(SDValue Operand, DAGCombinerInfo &DCI,
		unsigned &RefinementSteps,
		bool &UseOneConstNR) const override;
};		};

namespace X86 {		namespace X86 {
FastISel *createFastISel(FunctionLoweringInfo &funcInfo,		FastISel *createFastISel(FunctionLoweringInfo &funcInfo,
const TargetLibraryInfo *libInfo);		const TargetLibraryInfo *libInfo);
}		}
}		}

#endif // X86ISELLOWERING_H		#endif // X86ISELLOWERING_H

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 14,329 Lines • ▼ Show 20 Lines	SDValue X86TargetLowering::ConvertCmpIfNecessary(SDValue Cmp,
SDValue TruncFPSW = DAG.getNode(ISD::TRUNCATE, dl, MVT::i16, Cmp);		SDValue TruncFPSW = DAG.getNode(ISD::TRUNCATE, dl, MVT::i16, Cmp);
SDValue FNStSW = DAG.getNode(X86ISD::FNSTSW16r, dl, MVT::i16, TruncFPSW);		SDValue FNStSW = DAG.getNode(X86ISD::FNSTSW16r, dl, MVT::i16, TruncFPSW);
SDValue Srl = DAG.getNode(ISD::SRL, dl, MVT::i16, FNStSW,		SDValue Srl = DAG.getNode(ISD::SRL, dl, MVT::i16, FNStSW,
DAG.getConstant(8, MVT::i8));		DAG.getConstant(8, MVT::i8));
SDValue TruncSrl = DAG.getNode(ISD::TRUNCATE, dl, MVT::i8, Srl);		SDValue TruncSrl = DAG.getNode(ISD::TRUNCATE, dl, MVT::i8, Srl);
return DAG.getNode(X86ISD::SAHF, dl, MVT::i32, TruncSrl);		return DAG.getNode(X86ISD::SAHF, dl, MVT::i32, TruncSrl);
}		}

		/// The minimum architected relative accuracy is 2^-12. We need one
		/// Newton-Raphson step to have a good float result (24 bits of precision).
		SDValue X86TargetLowering::getRsqrtEstimate(SDValue Op,
		DAGCombinerInfo &DCI,
		hfinkelUnsubmitted Not Done Reply Inline Actions I'd really prefer that you put the 2-constant version of the algorithm into the DAGCombiner along side the 1-constant version, and just let the target pick. The algorithm itself is really a mathematical expression, and not at all really target dependent, and we should try to keep such things available to other targets without copy-and-paste. Ideally, we'd then also have a flag to force one or the other, so that way PPC can default to the 1-constant version, X86 can default to the 2-constant version, but there's a command-line option I can use to force the choice for benchmarking. hfinkel: I'd really prefer that you put the 2-constant version of the algorithm into the DAGCombiner…
		spatelAuthorUnsubmitted Not Done Reply Inline Actions Agree - I'll rework this. spatel: Agree - I'll rework this.
		unsigned &RefinementSteps,
		bool &UseOneConstNR) const {
		// FIXME: We should use instruction latency models to calculate the cost of
		// each potential sequence, but this is very hard to do reliably because
		// at least Intel's Core* chips have variable timing based on the number of
		// sig digs in the divisor and/or sqrt operand.
		hfinkelUnsubmitted Not Done Reply Inline Actions Please write out "significant digits" hfinkel: Please write out "significant digits"
		if (!Subtarget->useSqrtEst())
		return SDValue();

		EVT VT = Op.getValueType();

		// SSE1 has rsqrtss and rsqrtps.
		// TODO: Add support for AVX (v8f32) and AVX512 (v16f32).
		// TODO: Is it ever worthwhile to use an estimate for f64?
		hfinkelUnsubmitted Not Done Reply Inline Actions Why wouldn't it be? hfinkel: Why wouldn't it be?
		spatelAuthorUnsubmitted Not Done Reply Inline Actions A double-precision rsqrt estimate with refinement on x86 prior to FMA requires at least 16 instructions: convert to single, rsqrtss, convert back to double, refine (3 steps = at least 13 insts). I don't think Intel/AMD ever intended for that, or they would've added 'rsqrtsd' (similar to PPC's double-precision frsqrte). AFAICT, no x86 compiler tries to generate that sequence. Now that FMA has been introduced, it might be more feasible, but the HW implementations that have FMA also have really fast sqrt/div units, so it's again not worth it. Add this background to the code comment? spatel: A double-precision rsqrt estimate with refinement on x86 prior to FMA requires at least 16…
		if (Subtarget->hasSSE1() && (VT == MVT::f32 \|\| VT == MVT::v4f32)) {
		RefinementSteps = 1;
		UseOneConstNR = false;
		return DCI.DAG.getNode(X86ISD::FRSQRT, SDLoc(Op), VT, Op);
		}
		return SDValue();
		}

static bool isAllOnes(SDValue V) {		static bool isAllOnes(SDValue V) {
ConstantSDNode *C = dyn_cast<ConstantSDNode>(V);		ConstantSDNode *C = dyn_cast<ConstantSDNode>(V);
return C && C->isAllOnesValue();		return C && C->isAllOnesValue();
}		}

/// LowerToBT - Result of 'and' is compared against zero. Turn it into a BT node		/// LowerToBT - Result of 'and' is compared against zero. Turn it into a BT node
/// if it's possible.		/// if it's possible.
SDValue X86TargetLowering::LowerToBT(SDValue And, ISD::CondCode CC,		SDValue X86TargetLowering::LowerToBT(SDValue And, ISD::CondCode CC,
▲ Show 20 Lines • Show All 11,243 Lines • Show Last 20 Lines

lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 186 Lines • ▼ Show 20 Lines	protected:
bool LEAUsesAG;		bool LEAUsesAG;

/// SlowLEA - True if the LEA instruction with certain arguments is slow		/// SlowLEA - True if the LEA instruction with certain arguments is slow
bool SlowLEA;		bool SlowLEA;

/// SlowIncDec - True if INC and DEC instructions are slow when writing to flags		/// SlowIncDec - True if INC and DEC instructions are slow when writing to flags
bool SlowIncDec;		bool SlowIncDec;

		/// Use the RSQRT* instructions to optimize square root calculations.
		/// For this to be profitable, the cost of FSQRT and FDIV must be
		/// substantially higher than normal FP ops like FADD and FMUL.
		bool UseSqrtEst;

/// Processor has AVX-512 PreFetch Instructions		/// Processor has AVX-512 PreFetch Instructions
bool HasPFI;		bool HasPFI;

/// Processor has AVX-512 Exponential and Reciprocal Instructions		/// Processor has AVX-512 Exponential and Reciprocal Instructions
bool HasERI;		bool HasERI;

/// Processor has AVX-512 Conflict Detection Instructions		/// Processor has AVX-512 Conflict Detection Instructions
bool HasCDI;		bool HasCDI;
▲ Show 20 Lines • Show All 161 Lines • ▼ Show 20 Lines	public:
bool hasCmpxchg16b() const { return HasCmpxchg16b; }		bool hasCmpxchg16b() const { return HasCmpxchg16b; }
bool useLeaForSP() const { return UseLeaForSP; }		bool useLeaForSP() const { return UseLeaForSP; }
bool hasSlowDivide() const { return HasSlowDivide; }		bool hasSlowDivide() const { return HasSlowDivide; }
bool padShortFunctions() const { return PadShortFunctions; }		bool padShortFunctions() const { return PadShortFunctions; }
bool callRegIndirect() const { return CallRegIndirect; }		bool callRegIndirect() const { return CallRegIndirect; }
bool LEAusesAG() const { return LEAUsesAG; }		bool LEAusesAG() const { return LEAUsesAG; }
bool slowLEA() const { return SlowLEA; }		bool slowLEA() const { return SlowLEA; }
bool slowIncDec() const { return SlowIncDec; }		bool slowIncDec() const { return SlowIncDec; }
		bool useSqrtEst() const { return UseSqrtEst; }
bool hasCDI() const { return HasCDI; }		bool hasCDI() const { return HasCDI; }
bool hasPFI() const { return HasPFI; }		bool hasPFI() const { return HasPFI; }
bool hasERI() const { return HasERI; }		bool hasERI() const { return HasERI; }
bool hasDQI() const { return HasDQI; }		bool hasDQI() const { return HasDQI; }
bool hasBWI() const { return HasBWI; }		bool hasBWI() const { return HasBWI; }
bool hasVLX() const { return HasVLX; }		bool hasVLX() const { return HasVLX; }

bool isAtom() const { return X86ProcFamily == IntelAtom; }		bool isAtom() const { return X86ProcFamily == IntelAtom; }
▲ Show 20 Lines • Show All 115 Lines • Show Last 20 Lines

lib/Target/X86/X86Subtarget.cpp

Show First 20 Lines • Show All 272 Lines • ▼ Show 20 Lines	void X86Subtarget::initializeEnvironment() {
HasCmpxchg16b = false;		HasCmpxchg16b = false;
UseLeaForSP = false;		UseLeaForSP = false;
HasSlowDivide = false;		HasSlowDivide = false;
PadShortFunctions = false;		PadShortFunctions = false;
CallRegIndirect = false;		CallRegIndirect = false;
LEAUsesAG = false;		LEAUsesAG = false;
SlowLEA = false;		SlowLEA = false;
SlowIncDec = false;		SlowIncDec = false;
		UseSqrtEst = false;
stackAlignment = 4;		stackAlignment = 4;
// FIXME: this is a known good value for Yonah. How about others?		// FIXME: this is a known good value for Yonah. How about others?
MaxInlineSizeThreshold = 128;		MaxInlineSizeThreshold = 128;
}		}

static std::string computeDataLayout(const Triple &TT) {		static std::string computeDataLayout(const Triple &TT) {
// X86 is little endian		// X86 is little endian
std::string Ret = "e";		std::string Ret = "e";
▲ Show 20 Lines • Show All 84 Lines • Show Last 20 Lines

test/CodeGen/X86/sqrt-fastmath.ll

	; RUN: llc < %s -mcpu=core2 \| FileCheck %s			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=core2 \| FileCheck %s
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=btver2 \| FileCheck %s --check-prefix=BTVER2

	; generated using "clang -S -O2 -ffast-math -emit-llvm sqrt.c" from			; generated using "clang -S -O2 -ffast-math -emit-llvm sqrt.c" from
	; #include <math.h>			; #include <math.h>
	;			;
	; double fd(double d){			; double fd(double d){
	; return sqrt(d);			; return sqrt(d);
	; }			; }
	;			;
	Show All 37 Lines
	; Function Attrs: nounwind readnone uwtable			; Function Attrs: nounwind readnone uwtable
	define x86_fp80 @fld(x86_fp80 %ld) #0 {			define x86_fp80 @fld(x86_fp80 %ld) #0 {
	entry:			entry:
	; CHECK: fsqrt			; CHECK: fsqrt
	%call = tail call x86_fp80 @__sqrtl_finite(x86_fp80 %ld) #2			%call = tail call x86_fp80 @__sqrtl_finite(x86_fp80 %ld) #2
	ret x86_fp80 %call			ret x86_fp80 %call
	}			}

	; Function Attrs: nounwind readnone
	declare x86_fp80 @__sqrtl_finite(x86_fp80) #1			declare x86_fp80 @__sqrtl_finite(x86_fp80) #1

				; If the target's sqrtss and divss instructions are substantially
				; slower than rsqrtss with a Newton-Raphson refinement, we should
				; generate the estimate sequence.
				define float @reciprocal_square_root(float %x) #0 {
				%sqrt = tail call float @llvm.sqrt.f32(float %x)
				%div = fdiv fast float 1.0, %sqrt
				ret float %div

				; CHECK-LABEL: reciprocal_square_root:
				; CHECK: sqrtss
				; CHECK-NEXT: movss
				; CHECK-NEXT: divss
				; CHECK-NEXT: retq
				; BTVER2-LABEL: reciprocal_square_root:
				; BTVER2: vrsqrtss
				; BTVER2-NEXT: vmulss
				; BTVER2-NEXT: vmulss
				; BTVER2-NEXT: vmulss
				; BTVER2-NEXT: vaddss
				; BTVER2-NEXT: vmulss
				; BTVER2-NEXT: retq
				}

				declare float @llvm.sqrt.f32(float) #1

				; If the target's sqrtps and divps instructions are substantially
				; slower than rsqrtps with a Newton-Raphson refinement, we should
				; generate the estimate sequence.
				define <4 x float> @reciprocal_square_root_v4f32(<4 x float> %x) #0 {
				%sqrt = tail call <4 x float> @llvm.sqrt.v4f32(<4 x float> %x)
				%div = fdiv fast <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %sqrt
				ret <4 x float> %div

				; CHECK-LABEL: reciprocal_square_root_v4f32:
				; CHECK: sqrtps
				; CHECK-NEXT: movaps
				; CHECK-NEXT: divps
				; CHECK-NEXT: retq
				; BTVER2-LABEL: reciprocal_square_root_v4f32:
				; BTVER2: vrsqrtps
				; BTVER2-NEXT: vmulps
				; BTVER2-NEXT: vmulps
				; BTVER2-NEXT: vmulps
				; BTVER2-NEXT: vaddps
				; BTVER2-NEXT: vmulps
				; BTVER2-NEXT: retq
				}

				declare <4 x float> @llvm.sqrt.v4f32(<4 x float>) #1


	attributes #0 = { nounwind readnone uwtable "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "unsafe-fp-math"="true" "use-soft-float"="false" }			attributes #0 = { nounwind readnone uwtable "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "unsafe-fp-math"="true" "use-soft-float"="false" }
	attributes #1 = { nounwind readnone "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "unsafe-fp-math"="true" "use-soft-float"="false" }			attributes #1 = { nounwind readnone "less-precise-fpmad"="false" "no-frame-pointer-elim"="false" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "unsafe-fp-math"="true" "use-soft-float"="false" }
	attributes #2 = { nounwind readnone }			attributes #2 = { nounwind readnone }

This is an archive of the discontinued LLVM Phabricator instance.

Use rsqrt (X86) to speed up reciprocal square root calcs (PR20900)ClosedPublic

Details

Diff Detail

Event Timeline

{

Revision Contents

Diff 14694

include/llvm/Target/TargetLowering.h

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

lib/Target/PowerPC/PPCISelLowering.h

lib/Target/PowerPC/PPCISelLowering.cpp

lib/Target/X86/X86.td

lib/Target/X86/X86ISelLowering.h

lib/Target/X86/X86ISelLowering.cpp

lib/Target/X86/X86Subtarget.h

lib/Target/X86/X86Subtarget.cpp

test/CodeGen/X86/sqrt-fastmath.ll

Use rsqrt (X86) to speed up reciprocal square root calcs (PR20900)
ClosedPublic