Optimize sinhf and coshf by computing exp(x) and exp(-x) simultaneously.
Currently sinhf and coshf are implemented using the following formulas:
sinh(x) = 0.5 *(exp(x) - 1) - 0.5*(exp(-x) - 1) cosh(x) = 0.5*exp(x) + 0.5*exp(-x)
where exp(x) and exp(-x) are calculated separately using the formula:
exp(x) ~ 2^hi * 2^mid * exp(dx) ~ 2^hi * 2^mid * P(dx)
By expanding the polynomial P(dx) into even and odd parts
P(dx) = P_even(dx) + dx * P_odd(dx)
we can see that the computations of exp(x) and exp(-x) have many things in common,
namely:
exp(x) ~ 2^(hi + mid) * (P_even(dx) + dx * P_odd(dx)) exp(-x) ~ 2^(-(hi + mid)) * (P_even(dx) - dx * P_odd(dx))
Expanding sinh(x) and cosh(x) with respect to the above formulas, we can compute
these two functions as follow in order to maximize the sharing parts:
sinh(x) = (e^x - e^(-x)) / 2 ~ 0.5 * (P_even * (2^(hi + mid) - 2^(-(hi + mid))) + dx * P_odd * (2^(hi + mid) + 2^(-(hi + mid)))) cosh(x) = (e^x + e^(-x)) / 2 ~ 0.5 * (P_even * (2^(hi + mid) + 2^(-(hi + mid))) + dx * P_odd * (2^(hi + mid) - 2^(-(hi + mid))))
So in this patch, we perform the following optimizations for sinhf and coshf:
- Use the above formulas to maximize sharing intermediate results,
- Apply similar optimizations from https://reviews.llvm.org/D133870
Performance benchmark using perf tool from the CORE-MATH project on Ryzen 1700:
For sinhf:
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh sinhf GNU libc version: 2.35 GNU libc release: stable CORE-MATH reciprocal throughput : 16.718 System LIBC reciprocal throughput : 63.151 BEFORE: LIBC reciprocal throughput : 90.116 LIBC reciprocal throughput : 28.554 (with `-msse4.2` flag) LIBC reciprocal throughput : 22.577 (with `-mfma` flag) AFTER: LIBC reciprocal throughput : 36.482 LIBC reciprocal throughput : 16.955 (with `-msse4.2` flag) LIBC reciprocal throughput : 13.943 (with `-mfma` flag) $ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh sinhf --latency GNU libc version: 2.35 GNU libc release: stable CORE-MATH latency : 48.821 System LIBC latency : 137.019 BEFORE LIBC latency : 97.122 LIBC latency : 84.214 (with `-msse4.2` flag) LIBC latency : 71.611 (with `-mfma` flag) AFTER LIBC latency : 54.555 LIBC latency : 50.865 (with `-msse4.2` flag) LIBC latency : 48.700 (with `-mfma` flag)
For coshf:
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh coshf GNU libc version: 2.35 GNU libc release: stable CORE-MATH reciprocal throughput : 16.939 System LIBC reciprocal throughput : 19.695 BEFORE: LIBC reciprocal throughput : 52.845 LIBC reciprocal throughput : 29.174 (with `-msse4.2` flag) LIBC reciprocal throughput : 22.553 (with `-mfma` flag) AFTER: LIBC reciprocal throughput : 37.169 LIBC reciprocal throughput : 17.805 (with `-msse4.2` flag) LIBC reciprocal throughput : 14.691 (with `-mfma` flag) $ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh coshf --latency GNU libc version: 2.35 GNU libc release: stable CORE-MATH latency : 48.478 System LIBC latency : 48.044 BEFORE LIBC latency : 99.123 LIBC latency : 85.595 (with `-msse4.2` flag) LIBC latency : 72.776 (with `-mfma` flag) AFTER LIBC latency : 57.760 LIBC latency : 53.967 (with `-msse4.2` flag) LIBC latency : 50.987 (with `-mfma` flag)