This is an archive of the discontinued LLVM Phabricator instance.

Improve sqrt estimate algorithm
ClosedPublic

Authored by spatel on Oct 8 2014, 1:53 PM.

Details

Summary

This patch changes the fast-math implementation for calculating sqrt(x) from:
y = 1 / (1 / sqrt(x))
to:
y = x * (1 / sqrt(x))

This has 2 benefits: less code / faster code and one less estimate instruction that may lose precision.

The only target that will be affected (until http://reviews.llvm.org/D5658 is approved) is PPC. The difference in codegen for PPC is 2 less flops for a single-precision sqrtf or vector sqrtf and 4 less flops for a double-precision sqrt. We also eliminate a constant load and extra register usage.

Here's the existing PPC codegen for a single-precision scalar sqrtf() using a reciprocal square root estimate and a reciprocal estimate:

.L.goo3:
# BB#0:
   addis 3, 2, .LCPI10_2@toc@ha
   lfs 0, .LCPI10_2@toc@l(3)
   fcmpu 0, 1, 0
   beq 0, .LBB10_2
# BB#1:
   frsqrtes 0, 1
   addis 3, 2, .LCPI10_0@toc@ha
   lfs 2, .LCPI10_0@toc@l(3)
   addis 3, 2, .LCPI10_1@toc@ha
   lfs 13, .LCPI10_1@toc@l(3)
   fnmsubs 12, 1, 2, 1
   fmuls 3, 0, 0
   fmadds 1, 12, 3, 2
   fmuls 0, 0, 1
   fres 1, 0                    <--- reciprocal estimate
   fnmsubs 0, 0, 1, 13  <--- refinement
   fmadds 0, 1, 0, 1      <--- refinement
.LBB10_2:
   fmr 1, 0
   blr

After the patch, we calculate the rsqrt and multiply by the original operand:

.L.goo3:
# BB#0:
   addis 3, 2, .LCPI10_1@toc@ha
   lfs 0, .LCPI10_1@toc@l(3)
   fcmpu 0, 1, 0
   beq 0, .LBB10_2
# BB#1:
   frsqrtes 0, 1
   addis 3, 2, .LCPI10_0@toc@ha
   lfs 2, .LCPI10_0@toc@l(3)   <--- only need 1 constant for NR sqrt refinement
   fnmsubs 3, 1, 2, 1
   fmuls 4, 0, 0
   fmadds 2, 3, 4, 2
   fmuls 0, 0, 2
   fmuls 0, 1, 0   <--- reciprocal calc replaced by multiply
.LBB10_2:
   fmr 1, 0
   blr

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 14600.Oct 8 2014, 1:53 PM
spatel retitled this revision from to Improve sqrt estimate algorithm.
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added reviewers: hfinkel, wschmidt, willschm.
spatel added a subscriber: Unknown Object (MLST).
hfinkel accepted this revision.Oct 9 2014, 12:53 PM
hfinkel edited edge metadata.

Awesome, thanks! (LGTM)

-Hal

This revision is now accepted and ready to land.Oct 9 2014, 12:53 PM
spatel added a comment.Oct 9 2014, 1:45 PM

Thanks, Hal!

One other comment I'd like to point out here regarding the PPC codegen: I was expecting an 'fsel' to be generated rather than a compare and branch. I won't be able to look into why that happens anytime soon, but I can file a bug if you think that's appropriate.

X86 codegen (once we enable that) is not going to generate a compare for any of these testcases from what I can tell.

One other comment I'd like to point out here regarding the PPC codegen: I was expecting an 'fsel' to be generated rather than a compare and branch. I won't be able to look into why that happens anytime soon, but I can file a bug if you think that's appropriate.

Yes, please file a PR and we'll look at it; thanks!

spatel closed this revision.Oct 9 2014, 2:36 PM
spatel updated this revision to Diff 14678.

Closed by commit rL219445 (authored by @spatel).