Implement tanf function correctly rounded for all rounding modes.
We use the range reduction that is shared with sinf, cosf, and sincosf:
k = round(x * 32/pi) and y = x * (32/pi) - k.
Then we use the tangent of sum formula:
tan(x) = tan((k + y)* pi/32) = tan((k mod 32) * pi / 32 + y * pi/32) = (tan((k mod 32) * pi/32) + tan(y * pi/32)) / (1 - tan((k mod 32) * pi/32) * tan(y * pi/32))
We need to make a further reduction when k mod 32 >= 16 due to the pole at pi/2 of tan(x) function:
if (k mod 32 >= 16): k = k - 31, y = y - 1.0
And to compute the final result, we store tan(k * pi/32) for k = -15..15 in a table of 32 double values,
and evaluate tan(y * pi/32) with a degree-11 minimax odd polynomial generated by Sollya with:
> P = fpminimax(tan(y * pi/32)/y, [|0, 2, 4, 6, 8, 10|], [|D...|], [0, 1.5]);
Performance benchmark using perf tool from the CORE-MATH project on Ryzen 1700:
$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh tanf CORE-MATH reciprocal throughput : 18.586 System LIBC reciprocal throughput : 50.068 LIBC reciprocal throughput : 33.823 LIBC reciprocal throughput : 25.161 (with `-msse4.2` flag) LIBC reciprocal throughput : 19.157 (with `-mfma` flag) $ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh tanf --latency GNU libc version: 2.31 GNU libc release: stable CORE-MATH latency : 55.630 System LIBC latency : 106.264 LIBC latency : 96.060 LIBC latency : 90.727 (with `-msse4.2` flag) LIBC latency : 82.361 (with `-mfma` flag)