Reduce the number of subintervals that need lookup table and optimize
the evaluation steps.
Currently, exp2f is computed by reducing to 2^hi * 2^mid * 2^lo where
-16/32 <= mid <= 15/32 and -1/64 <= lo <= 1/64, and 2^lo is then
approximated by a degree 6 polynomial.
Experiment with Sollya showed that by using a degree 6 polynomial, we
can approximate 2^lo for a bigger range with reasonable errors:
> P = fpminimax((2^x - 1)/x, 5, [|D...|], [-1/64, 1/64]); > dirtyinfnorm(2^x - 1 - x*P, [-1/64, 1/64]); 0x1.e18a1bc09114def49eb851655e2e5c4dd08075ac2p-63 > P = fpminimax((2^x - 1)/x, 5, [|D...|], [-1/32, 1/32]); > dirtyinfnorm(2^x - 1 - x*P, [-1/32, 1/32]); 0x1.05627b6ed48ca417fe53e3495f7df4baf84a05e2ap-56
So we can optimize the implementation a bit with:
- Reduce the range to mid = i/16 for i = 0..15 and -1/32 <= lo <= 1/32
- Store the table 2^mid in bits, and add hi directly to its exponent field to compute 2^hi * 2^mid
- Rearrange the order of evaluating the polynomial approximating 2^lo.
Performance benchmark using perf tool from the CORE-MATH project on Ryzen 1700:
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh exp2f GNU libc version: 2.35 GNU libc release: stable CORE-MATH reciprocal throughput : 9.534 System LIBC reciprocal throughput : 6.229 BEFORE: LIBC reciprocal throughput : 21.405 LIBC reciprocal throughput : 15.241 (with `-msse4.2` flag) LIBC reciprocal throughput : 11.111 (with `-mfma` flag) AFTER: LIBC reciprocal throughput : 18.617 LIBC reciprocal throughput : 12.852 (with `-msse4.2` flag) LIBC reciprocal throughput : 9.253 (with `-mfma` flag) $ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh exp2f --latency GNU libc version: 2.35 GNU libc release: stable CORE-MATH latency : 40.869 System LIBC latency : 30.580 BEFORE LIBC latency : 64.888 LIBC latency : 61.027 (with `-msse4.2` flag) LIBC latency : 48.778 (with `-mfma` flag) AFTER LIBC latency : 48.803 LIBC latency : 45.047 (with `-msse4.2` flag) LIBC latency : 37.487 (with `-mfma` flag)