# [libc][math] Implement asinf function correctly rounded for all rounding modes.ClosedPublic

Authored by lntue on Sep 6 2022, 11:24 PM.

# Details

Summary

Implement asinf function correctly rounded for all rounding modes.

For |x| <= 0.5, we approximate asin(x) by

asin(x) = x * P(x^2)

where P(X^2) = Q(X) is a degree-20 minimax even polynomial approximating
asin(x)/x on [0, 0.5] generated by Sollya with:

> Q = fpminimax(asin(x)/x, [|0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20|],
[|1, D...|], [0, 0.5]);

When |x| > 0.5, we perform range reduction as follow:
Assume further that 0.5 < x <= 1, and let:

y = asin(x)

We will use the double angle formula:

cos(2X) = 1 - 2 sin^2(X)

and the complement angle identity:

x = sin(y) = cos(pi/2 - y)
= 1 - 2 sin^2 (pi/4 - y/2)

So:

sin(pi/4 - y/2) = sqrt( (1 - x)/2 )

And hence:

pi/4 - y/2 = asin( sqrt( (1 - x)/2 ) )

Equivalently:

asin(x) = y = pi/2 - 2 * asin( sqrt( (1 - x)/2 ) )

Let u = (1 - x)/2, then

asin(x) = pi/2 - 2 * asin(u)

Moreover, since 0.5 < x <= 1,

0 <= u < 1/4, and 0 <= sqrt(u) < 0.5.

And hence we can reuse the same polynomial approximation of asin(x) when
|x| <= 0.5:

asin(x) = pi/2 - 2 * u * P(u^2).

Performance benchmark using perf tool from the CORE-MATH project on Ryzen 1700:

$CORE_MATH_PERF_MODE="rdtsc" ./perf.sh asinf CORE-MATH reciprocal throughput : 23.418 System LIBC reciprocal throughput : 27.310 LIBC reciprocal throughput : 22.741$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh asinf --latency
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH latency   : 58.884
System LIBC latency : 62.055
LIBC latency        : 62.037

# Diff Detail

### Event Timeline

lntue created this revision.Sep 6 2022, 11:24 PM
Herald added projects: Restricted Project, Restricted Project. Sep 6 2022, 11:24 PM
Herald added subscribers: ecnelises, tschuett, mgorny.
lntue requested review of this revision.Sep 6 2022, 11:24 PM

the patch fails for me on top of main (revision ea953b9). Is there any other patch to apply first?

lntue edited the summary of this revision. (Show Details)Sep 6 2022, 11:41 PM
lntue updated this revision to Diff 458368.Sep 6 2022, 11:42 PM

Rebase.

lntue added a comment.Sep 6 2022, 11:44 PM

the patch fails for me on top of main (revision ea953b9). Is there any other patch to apply first?

I just uploaded the patch after rebasing. Can you try to see if it works now?

Here is the result of git log --oneline on my local repo:

7b6f9731ca7a (HEAD -> main) [libc][math] Implement asinf function correctly rounded for all rounding modes.
c836ddaf721c (origin/main) [X86][NFC] Refine load/store reg to StackSlot for extensibility
4e5a59a3839f [LLD][COFF] Fix writing a map file when range extension thunks are inserted
ea953b9d9a65 [compiler-rt] [test] Handle missing ld.gold gracefully
zimmermann6 accepted this revision.Sep 7 2022, 12:02 AM

it works now, thanks. I confirm it is correctly rounded. I get similar figures on a AMD EPYC 7282 (glibc, core-math, llvm-libc):

# reciprocal throughput
GNU libc version: 2.34
GNU libc release: stable
26.722
31.752
27.841
# latency
GNU libc version: 2.34
GNU libc release: stable
56.310
64.140
61.051
This revision is now accepted and ready to land.Sep 7 2022, 12:02 AM
orex added a comment.Sep 7 2022, 4:03 AM

Good job, Tue!

I have an optional suggestion. You can use formula
asin(x)=1/2*acos(1-2*x^2)
and, of course,
acos(x)=pi/2-asin(x)
for cases where |x|>=0.5 Probably it can be much faster than sqrt.

lntue added a comment.Sep 7 2022, 9:04 AM

Good job, Tue!

I have an optional suggestion. You can use formula
asin(x)=1/2*acos(1-2*x^2)
and, of course,
acos(x)=pi/2-asin(x)
for cases where |x|>=0.5 Probably it can be much faster than sqrt.

Thanks @orex for the suggestion! Unfortunately the formula asin(x) = pi/4 - 1/2 * asin(1 - 2*x^2) only works
well when |1 - 2*x^2| <= 0.5, i.e. 0.5 <= |x| <= sqrt(3)/2.

For sqrt(3)/2 < |x| <= 1, we still need some sort of range reduction involving square root.

Nonetheless, I've tried adding another branch for 0.5 <= |x| <= sqrt(3)/2 using the formula
asin(x) = pi/4 - 1/2 * asin(1 - 2*x^2) and it actually makes both the throughput and the latency
slightly worse than 2 branches.

orex accepted this revision.Sep 7 2022, 12:25 PM

Good job, Tue!

I have an optional suggestion. You can use formula
asin(x)=1/2*acos(1-2*x^2)
and, of course,
acos(x)=pi/2-asin(x)
for cases where |x|>=0.5 Probably it can be much faster than sqrt.

Thanks @orex for the suggestion! Unfortunately the formula asin(x) = pi/4 - 1/2 * asin(1 - 2*x^2) only works
well when |1 - 2*x^2| <= 0.5, i.e. 0.5 <= |x| <= sqrt(3)/2.

For sqrt(3)/2 < |x| <= 1, we still need some sort of range reduction involving square root.

Nonetheless, I've tried adding another branch for 0.5 <= |x| <= sqrt(3)/2 using the formula
asin(x) = pi/4 - 1/2 * asin(1 - 2*x^2) and it actually makes both the throughput and the latency
slightly worse than 2 branches.

No questions! Thanks for trying the approach.

This revision was automatically updated to reflect the committed changes.