This is an archive of the discontinued LLVM Phabricator instance.

[libc][math] Improve exp2f performance.
ClosedPublic

Authored by lntue on Sep 14 2022, 8:51 AM.

Details

Summary

Reduce the number of subintervals that need lookup table and optimize
the evaluation steps.

Currently, exp2f is computed by reducing to 2^hi * 2^mid * 2^lo where
-16/32 <= mid <= 15/32 and -1/64 <= lo <= 1/64, and 2^lo is then
approximated by a degree 6 polynomial.

Experiment with Sollya showed that by using a degree 6 polynomial, we
can approximate 2^lo for a bigger range with reasonable errors:

> P = fpminimax((2^x - 1)/x, 5, [|D...|], [-1/64, 1/64]);
> dirtyinfnorm(2^x - 1 - x*P, [-1/64, 1/64]);
0x1.e18a1bc09114def49eb851655e2e5c4dd08075ac2p-63

> P = fpminimax((2^x - 1)/x, 5, [|D...|], [-1/32, 1/32]);
> dirtyinfnorm(2^x - 1 - x*P, [-1/32, 1/32]);
0x1.05627b6ed48ca417fe53e3495f7df4baf84a05e2ap-56

So we can optimize the implementation a bit with:

  1. Reduce the range to mid = i/16 for i = 0..15 and -1/32 <= lo <= 1/32
  2. Store the table 2^mid in bits, and add hi directly to its exponent field to compute 2^hi * 2^mid
  3. Rearrange the order of evaluating the polynomial approximating 2^lo.

Performance benchmark using perf tool from the CORE-MATH project on Ryzen 1700:

$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh exp2f
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH reciprocal throughput   : 9.534
System LIBC reciprocal throughput : 6.229

BEFORE:
LIBC reciprocal throughput        : 21.405
LIBC reciprocal throughput        : 15.241    (with `-msse4.2` flag)
LIBC reciprocal throughput        : 11.111    (with `-mfma` flag)

AFTER:
LIBC reciprocal throughput        : 18.617
LIBC reciprocal throughput        : 12.852    (with `-msse4.2` flag)
LIBC reciprocal throughput        : 9.253     (with `-mfma` flag)

$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh exp2f --latency
GNU libc version: 2.35
GNU libc release: stable
CORE-MATH latency   : 40.869
System LIBC latency : 30.580

BEFORE
LIBC latency        : 64.888
LIBC latency        : 61.027    (with `-msse4.2` flag)
LIBC latency        : 48.778    (with `-mfma` flag)

AFTER
LIBC latency        : 48.803
LIBC latency        : 45.047    (with `-msse4.2` flag)
LIBC latency        : 37.487    (with `-mfma` flag)

Diff Detail

Event Timeline

lntue created this revision.Sep 14 2022, 8:51 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptSep 14 2022, 8:51 AM
lntue requested review of this revision.Sep 14 2022, 8:51 AM
lntue edited the summary of this revision. (Show Details)Sep 14 2022, 9:26 AM
orex accepted this revision.Sep 14 2022, 11:22 AM
This revision is now accepted and ready to land.Sep 14 2022, 11:22 AM
sivachandra accepted this revision.Sep 14 2022, 11:25 AM
This revision was automatically updated to reflect the committed changes.

I confirm the improvement, and the function is still correctly rounded.