# [libc][math] Improve exp2f performance.ClosedPublic

Authored by lntue on Sep 14 2022, 8:51 AM.

# Details

Reviewers
 michaelrj sivachandra orex zimmermann6
Commits
rGe6226e6b7234: [libc][math] Improve exp2f performance.
Summary

Reduce the number of subintervals that need lookup table and optimize
the evaluation steps.

Currently, exp2f is computed by reducing to 2^hi * 2^mid * 2^lo where
-16/32 <= mid <= 15/32 and -1/64 <= lo <= 1/64, and 2^lo is then
approximated by a degree 6 polynomial.

Experiment with Sollya showed that by using a degree 6 polynomial, we
can approximate 2^lo for a bigger range with reasonable errors:

> P = fpminimax((2^x - 1)/x, 5, [|D...|], [-1/64, 1/64]);
> dirtyinfnorm(2^x - 1 - x*P, [-1/64, 1/64]);
0x1.e18a1bc09114def49eb851655e2e5c4dd08075ac2p-63

> P = fpminimax((2^x - 1)/x, 5, [|D...|], [-1/32, 1/32]);
> dirtyinfnorm(2^x - 1 - x*P, [-1/32, 1/32]);
0x1.05627b6ed48ca417fe53e3495f7df4baf84a05e2ap-56

So we can optimize the implementation a bit with:

1. Reduce the range to mid = i/16 for i = 0..15 and -1/32 <= lo <= 1/32
2. Store the table 2^mid in bits, and add hi directly to its exponent field to compute 2^hi * 2^mid
3. Rearrange the order of evaluating the polynomial approximating 2^lo.

Performance benchmark using perf tool from the CORE-MATH project on Ryzen 1700:

$CORE_MATH_PERF_MODE="rdtsc" ./perf.sh exp2f GNU libc version: 2.35 GNU libc release: stable CORE-MATH reciprocal throughput : 9.534 System LIBC reciprocal throughput : 6.229 BEFORE: LIBC reciprocal throughput : 21.405 LIBC reciprocal throughput : 15.241 (with -msse4.2 flag) LIBC reciprocal throughput : 11.111 (with -mfma flag) AFTER: LIBC reciprocal throughput : 18.617 LIBC reciprocal throughput : 12.852 (with -msse4.2 flag) LIBC reciprocal throughput : 9.253 (with -mfma flag)$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh exp2f --latency
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH latency   : 40.869
System LIBC latency : 30.580

BEFORE
LIBC latency        : 64.888
LIBC latency        : 61.027    (with -msse4.2 flag)
LIBC latency        : 48.778    (with -mfma flag)

AFTER
LIBC latency        : 48.803
LIBC latency        : 45.047    (with -msse4.2 flag)
LIBC latency        : 37.487    (with -mfma flag)

# Diff Detail

### Event Timeline

lntue created this revision.Sep 14 2022, 8:51 AM
Herald added projects: Restricted Project, Restricted Project. Sep 14 2022, 8:51 AM
Herald added subscribers: ecnelises, tschuett, mgorny.
lntue requested review of this revision.Sep 14 2022, 8:51 AM
lntue edited the summary of this revision. (Show Details)Sep 14 2022, 9:26 AM
orex accepted this revision.Sep 14 2022, 11:22 AM
This revision is now accepted and ready to land.Sep 14 2022, 11:22 AM
sivachandra accepted this revision.Sep 14 2022, 11:25 AM
This revision was automatically updated to reflect the committed changes.

I confirm the improvement, and the function is still correctly rounded.