Optimize the core part of tanhf implementation that is to compute e^x
similar to https://reviews.llvm.org/D133870. Factor the constants and
polynomial approximation out so that it can be used for exp10f
Performance benchmark using perf tool from the CORE-MATH project on Ryzen 1700:
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh tanhf GNU libc version: 2.35 GNU libc release: stable CORE-MATH reciprocal throughput : 13.377 System LIBC reciprocal throughput : 55.046 BEFORE: LIBC reciprocal throughput : 75.674 LIBC reciprocal throughput : 33.242 (with `-msse4.2` flag) LIBC reciprocal throughput : 25.927 (with `-mfma` flag) AFTER: LIBC reciprocal throughput : 26.359 LIBC reciprocal throughput : 18.888 (with `-msse4.2` flag) LIBC reciprocal throughput : 14.243 (with `-mfma` flag) $ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh tanhf --latency GNU libc version: 2.35 GNU libc release: stable CORE-MATH latency : 43.365 System LIBC latency : 123.499 BEFORE LIBC latency : 112.968 LIBC latency : 104.908 (with `-msse4.2` flag) LIBC latency : 92.310 (with `-mfma` flag) AFTER LIBC latency : 69.828 LIBC latency : 63.874 (with `-msse4.2` flag) LIBC latency : 57.427 (with `-mfma` flag)