Correct rounding function. Performance ~2x faster than glibc analog.
Performance (llvm 12 intel):
CORE_MATH_PERF_MODE=rdtsc PERF_ARGS='' ./perf.sh tanhf GNU libc version: 2.31 GNU libc release: stable 13.279 37.492 18.145 CORE_MATH_PERF_MODE=rdtsc PERF_ARGS='--latency' ./perf.sh tanhf GNU libc version: 2.31 GNU libc release: stable 40.658 109.582 66.568