Reduce the number of subintervals that need lookup table and optimize

the evaluation steps.

Currently, `exp2f` is computed by reducing to `2^hi * 2^mid * 2^lo` where

`-16/32 <= mid <= 15/32` and `-1/64 <= lo <= 1/64`, and `2^lo` is then

approximated by a degree 6 polynomial.

Experiment with Sollya showed that by using a degree 6 polynomial, we

can approximate `2^lo` for a bigger range with reasonable errors:

> P = fpminimax((2^x - 1)/x, 5, [|D...|], [-1/64, 1/64]); > dirtyinfnorm(2^x - 1 - x*P, [-1/64, 1/64]); 0x1.e18a1bc09114def49eb851655e2e5c4dd08075ac2p-63 > P = fpminimax((2^x - 1)/x, 5, [|D...|], [-1/32, 1/32]); > dirtyinfnorm(2^x - 1 - x*P, [-1/32, 1/32]); 0x1.05627b6ed48ca417fe53e3495f7df4baf84a05e2ap-56

So we can optimize the implementation a bit with:

- Reduce the range to
`mid = i/16`for`i = 0..15`and`-1/32 <= lo <= 1/32` - Store the table
`2^mid`in bits, and add`hi`directly to its exponent field to compute`2^hi * 2^mid` - Rearrange the order of evaluating the polynomial approximating
`2^lo`.

Performance benchmark using perf tool from the CORE-MATH project on Ryzen 1700:

$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh exp2f GNU libc version: 2.35 GNU libc release: stable CORE-MATH reciprocal throughput : 9.534 System LIBC reciprocal throughput : 6.229 BEFORE: LIBC reciprocal throughput : 21.405 LIBC reciprocal throughput : 15.241 (with `-msse4.2` flag) LIBC reciprocal throughput : 11.111 (with `-mfma` flag) AFTER: LIBC reciprocal throughput : 18.617 LIBC reciprocal throughput : 12.852 (with `-msse4.2` flag) LIBC reciprocal throughput : 9.253 (with `-mfma` flag) $ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh exp2f --latency GNU libc version: 2.35 GNU libc release: stable CORE-MATH latency : 40.869 System LIBC latency : 30.580 BEFORE LIBC latency : 64.888 LIBC latency : 61.027 (with `-msse4.2` flag) LIBC latency : 48.778 (with `-mfma` flag) AFTER LIBC latency : 48.803 LIBC latency : 45.047 (with `-msse4.2` flag) LIBC latency : 37.487 (with `-mfma` flag)