This is an archive of the discontinued LLVM Phabricator instance.

[libc] Implement sinf function that is correctly rounded to all rounding modes.
ClosedPublic

Authored by lntue on Apr 5 2022, 1:14 PM.

Details

Summary

Implement sinf function that is correctly rounded to all rounding modes.

  • We use a simple range reduction for pi/16 < |x| : Let k = round(x / pi) and y = (x/pi) - k. So k is an integer and -0.5 <= y <= 0.5.

Then

sin(x) = sin(y*pi + k*pi)
          = (-1)^(k & 1) * sin(y*pi)
          ~ (-1)^(k & 1) * y * P(y^2)
where `y*P(y^2)` is a degree-15 minimax polynomial generated by Sollya with:
> P = fpminimax(sin(x*pi)/x, [|0, 2, 4, 6, 8, 10, 12, 14|], [|D...|], [0, 0.5]);
  • Performance benchmark using perf tool from CORE-MATH project

(https://gitlab.inria.fr/core-math/core-math/-/tree/master) on Ryzen 1700:
Before this patch (not correctly rounded):

$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh sinf
CORE-MATH reciprocal throughput   : 17.892
System LIBC reciprocal throughput : 25.559
LIBC reciprocal throughput        : 29.381

After this patch (correctly rounded):

$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh sinf
CORE-MATH reciprocal throughput   : 17.896
System LIBC reciprocal throughput : 25.740

LIBC reciprocal throughput        : 27.872
LIBC reciprocal throughput        : 20.012     (with `-msse4.2` flag)
LIBC reciprocal throughput        : 14.244     (with `-mfma` flag)

Diff Detail

Event Timeline

lntue created this revision.Apr 5 2022, 1:14 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 5 2022, 1:14 PM
lntue requested review of this revision.Apr 5 2022, 1:14 PM
lntue updated this revision to Diff 420720.Apr 5 2022, 11:49 PM

Simplify the range reduction logic and improve performance for |x| < 2^50.

lntue edited the summary of this revision. (Show Details)Apr 6 2022, 12:04 AM
zimmermann6 requested changes to this revision.Apr 6 2022, 12:36 AM

I get an error for rounding up:

Using llvm-libc
MPFR library: 4.1.0       
MPFR header:  4.1.0 (based on 4.1.0)
Checking function sinf with MPFR_RNDU
libm wrong by up to 3.40e-11 ulp(s) [1] for x=-0x1.47d0fep+34
sin      gives -0x1p+0
mpfr_sin gives -0x1.fffffep-1
Total: errors=1 (0.00%) errors2=0 maxerr=3.40e-11 ulp(s)
This revision now requires changes to proceed.Apr 6 2022, 12:36 AM
lntue updated this revision to Diff 420836.Apr 6 2022, 7:10 AM

Add the exceptional input: x = -0x1.47d0fep+34f.

As a note of why x = 0x1.47d0fep+34f with FE_DOWNWARD passed but
x = -0x1.47d0fep34f with FE_UPWARD failed, the main difference is that
ysq = y*y differs by 1 ULP, breaking the symmetry of sin(-x) = -sin(x)
in the evaluation chain.

zimmermann6 accepted this revision.Apr 6 2022, 7:51 AM

all tests pass now, and I get the following figures (first CORE-MATH, 2nd GNU libc, 2rd LLVM libc):

$ LIBM=/users/zimmerma/svn/core-math/libllvmlibc.a ./perf.sh sinf
38.997
26.503
33.990
This revision is now accepted and ready to land.Apr 6 2022, 7:51 AM
lntue updated this revision to Diff 443322.Jul 8 2022, 12:27 PM

Add support for non-FMA targets and simplify large range branch.

lntue updated this revision to Diff 443326.Jul 8 2022, 12:37 PM

Update math status page.

zimmermann6 accepted this revision.Jul 11 2022, 2:44 AM

I confirm the new version is correctly rounded (on the machine I tried it). However I find different figures for the number of cycles:

zimmerma@biscotte:~/svn/core-math$ LIBM=/localdisk/zimmerma/llvm-project/build/projects/libc/lib/libllvmlibc.a CORE_MATH_PERF_MODE=rdtsc ./perf.sh sinf
GNU libc version: 2.31
GNU libc release: stable
16.781
23.443
32.737

This is on a AMD EPYC 7282 with gcc version 10.2.1 and clang 11.0.1-2 (I guess llvm-libc is compiled with clang). This gives 17 cycles for the core-math routine, and 33 cycles for the llvm-libc one.

lntue updated this revision to Diff 444660.Jul 14 2022, 7:49 AM

Improve the performance with round to nearest integer instructions. The parts
adding nearest_integer.h are refactored to https://reviews.llvm.org/D129776.

lntue edited the summary of this revision. (Show Details)Jul 14 2022, 8:07 AM
lntue updated this revision to Diff 444675.Jul 14 2022, 8:14 AM

Update bazel overlay.

lntue updated this revision to Diff 444713.Jul 14 2022, 10:21 AM

Sync to HEAD.

lntue updated this revision to Diff 445107.Jul 15 2022, 12:46 PM

Add extra exceptional values from AARCH64.

zimmermann6 accepted this revision.Jul 18 2022, 4:14 AM

I confirm the latest version is correctly rounded for all rounding modes on the machine I tried (AMD EPYC 7282 with gcc 10.2.1 and clang 11.0.1-2).
For the reciprocal throughput I get:

zimmerma@biscotte:/tmp/core-math$ LIBM=/localdisk/zimmerma/llvm-project/build/projects/libc/lib/libllvmlibc.a CORE_MATH_LAUNCHER="/localdisk/zimmerma/glibc-2.35/install/lib/ld-linux-x86-64.so.2 --library-path /localdisk/zimmerma/glibc-2.35/install/lib" CORE_MATH_PERF_MODE=rdtsc ./perf.sh sinf
GNU libc version: 2.35
GNU libc release: stable
16.705
23.636
13.989

i.e., 16.7 cycles for core-math, 23.6 cycles for glibc 2.35, and 14.0 cycles for llvm-libc. Good work!
For the latency the figures are worse than glibc:

zimmerma@biscotte:/tmp/core-math$ LIBM=/localdisk/zimmerma/llvm-project/build/projects/libc/lib/libllvmlibc.a CORE_MATH_LAUNCHER="/localdisk/zimmerma/glibc-2.35/install/lib/ld-linux-x86-64.so.2 --library-path /localdisk/zimmerma/glibc-2.35/install/lib" CORE_MATH_PERF_MODE=rdtsc PERF_ARGS=--latency ./perf.sh sinf
GNU libc version: 2.35
GNU libc release: stable
47.926
57.338
62.912
santoshn added inline comments.
libc/src/math/generic/range_reduction.h
33–40 ↗(On Diff #445107)

I was reviewing this code. I am not able to understand how ONE_OVER_PI_28_LSB_EXP[2] == -86. The third entry in ONE_OVER_PI_28 (i.e., 0x1.3f84ebp-62) has an exponent of -62 and there are 27 fraction bits. So the exponent of the LSB should be -89.

Similarly ONE_OVER_PI_28_LSB_EXP[5] is -167?

Similarly ONE_OVER_PI_28_LSB_EXP[6] is -206?

Similarly ONE_OVER_PI_28_LSB_EXP[7] is -236?

lntue added inline comments.Jul 21 2022, 6:31 AM
libc/src/math/generic/range_reduction.h
33–40 ↗(On Diff #445107)

I think for ONE_OVER_PI_28_LSB_EXP[2] I over-counted a bit. The exact value 0x1.3f84eb p-62 actually has the last 1 bit at 24th bit after the decimal point (bits 25-27 are all 0's), so the correct value of ONE_VER_PI_28_LSB_EXP[2] should be -62 - 24 = -66. Similarly for other values.

santoshn added inline comments.Jul 21 2022, 7:06 AM
libc/src/math/generic/range_reduction.h
33–40 ↗(On Diff #445107)

Tue,

Thanks for the clarification.

The last hex digit in ONE_OVER_PI_28_LSP[2] is "b", which is 1011. We are using 28-bits of precision for each entry of OVER_OVER_PI_28
I get -62-27 = -89.

santoshn added inline comments.Jul 21 2022, 7:39 AM
libc/src/math/generic/range_reduction.h
33–40 ↗(On Diff #445107)

The ONE_OVER_PI_28_LSB_EXP array according to me for the double values with a precision of 28 for ONE_OVER_PI_28 should be.

static const double one_over_pi_28_exp[8] = {

 -29, //  -2 -27 = -29
 -60,  // -33 -27 = -60
 -89,  // -62 -27 = -89, last hex digit is b which is  1011
 -118, // -92 - 26 = -118, last hex digit is 2, which is  0010
 -147, // -121 -26  = -147, last hex digit is a, which is 1010
 -174, // -150-24 = -174, last hex digit is 8, which is 1000
 -204, // -179-25 = -204, last hex digit is c, which is 1100
 -234 // -209 -25 = -234, last hex digit is c, which is 1100
};

In the RLIBM project, we have a similar version range reduction for sinf. We preform two levels of range reduction. The first level of range reduction is exactly similar to the approach here. Second level approximates sinpi(x), with two polynomials sinpi and cospi, similar to the description in Section 2 of our PLDI 2021 paper: https://people.cs.rutgers.edu/~sn349/papers/rlibm32-pldi-2021-preprint.pdf

Hence, we think the performance of this sinf function can be significantly improved (close to 2X improvement).

We are doing final stages of testing for our sinf polynomial, which consist of two polynomials: a 3 term, degree-5 polynomial (sinpi) and a 3-term, degree-4 polynomial (cospi) and a table of 512 double values.

We will share the link to the implementation here once we are done with our testing. Hope it can be incorporated.

lntue added inline comments.Jul 21 2022, 8:02 AM
libc/src/math/generic/range_reduction.h
33–40 ↗(On Diff #445107)

Yes, the last hex digit is b, but it has only 6 hex digits, and hence -24 instead of -27; a full 7 hex digits will make it -28 instead.

It would be great to be able to reduce the polynomial's degree + exceptional cases. Thanks for letting me know the progress! I'm looking forward to it!

Tue,

Thanks for the information. Here is the correctly rounded sinf from our RLIBM project: https://github.com/rutgers-apl/The-RLIBM-Project/tree/main/experimental/sin

We know it works for round-to-nearest and produces correct results for all inputs. From our testing, it is close to 2X faster than the current LLVM version and about 25% faster than the CORE-MATH version with respect to latency.

Let me know if you have any questions.

lntue added a comment.Jul 21 2022, 1:08 PM

Tue,

Thanks for the information. Here is the correctly rounded sinf from our RLIBM project: https://github.com/rutgers-apl/The-RLIBM-Project/tree/main/experimental/sin

We know it works for round-to-nearest and produces correct results for all inputs. From our testing, it is close to 2X faster than the current LLVM version and about 25% faster than the CORE-MATH version with respect to latency.

Let me know if you have any questions.

Thanks for the link! It's nice to see that we can bring down the overall latency and it has no exceptional values!

In the earlier iteration of this patch, I've implemented some thing similar with the same degree using k*pi/256 range reduction instead of k*pi/512, with around 5-6 exceptional values: https://reviews.llvm.org/D123154?vs=on&id=420614#toc.
You might want to try with better polynomial generator to see if it can reduce the number of exceptional values, or maybe even bring down the number of sub-intervals to 128?

In this patch I went with the other extreme for range reduction. When I have more time, I'd definitely playing around with various ranges of second range reduction to find the balance between look-up table size, polynomials' degrees, and number of exceptional values.

Thanks!

lntue updated this revision to Diff 446816.Jul 22 2022, 6:57 AM

Update comments and fix bazel builds.