This patch changes the fast-math implementation for calculating sqrt(x) from:
y = 1 / (1 / sqrt(x))
to:
y = x * (1 / sqrt(x))
This has 2 benefits: less code / faster code and one less estimate instruction that may lose precision.
The only target that will be affected (until http://reviews.llvm.org/D5658 is approved) is PPC. The difference in codegen for PPC is 2 less flops for a single-precision sqrtf or vector sqrtf and 4 less flops for a double-precision sqrt. We also eliminate a constant load and extra register usage.
Here's the existing PPC codegen for a single-precision scalar sqrtf() using a reciprocal square root estimate and a reciprocal estimate:
.L.goo3: # BB#0: addis 3, 2, .LCPI10_2@toc@ha lfs 0, .LCPI10_2@toc@l(3) fcmpu 0, 1, 0 beq 0, .LBB10_2 # BB#1: frsqrtes 0, 1 addis 3, 2, .LCPI10_0@toc@ha lfs 2, .LCPI10_0@toc@l(3) addis 3, 2, .LCPI10_1@toc@ha lfs 13, .LCPI10_1@toc@l(3) fnmsubs 12, 1, 2, 1 fmuls 3, 0, 0 fmadds 1, 12, 3, 2 fmuls 0, 0, 1 fres 1, 0 <--- reciprocal estimate fnmsubs 0, 0, 1, 13 <--- refinement fmadds 0, 1, 0, 1 <--- refinement .LBB10_2: fmr 1, 0 blr
After the patch, we calculate the rsqrt and multiply by the original operand:
.L.goo3: # BB#0: addis 3, 2, .LCPI10_1@toc@ha lfs 0, .LCPI10_1@toc@l(3) fcmpu 0, 1, 0 beq 0, .LBB10_2 # BB#1: frsqrtes 0, 1 addis 3, 2, .LCPI10_0@toc@ha lfs 2, .LCPI10_0@toc@l(3) <--- only need 1 constant for NR sqrt refinement fnmsubs 3, 1, 2, 1 fmuls 4, 0, 0 fmadds 2, 3, 4, 2 fmuls 0, 0, 2 fmuls 0, 1, 0 <--- reciprocal calc replaced by multiply .LBB10_2: fmr 1, 0 blr