Implement acosf function correctly rounded for all rounding modes.

We perform range reduction as follows:

- When
`|x| < 2^(-10)`, we use cubic Taylor polynomial:

acos(x) = pi/2 - asin(x) ~ pi/2 - x - x^3 / 6.

- When
`2^(-10) <= |x| <= 0.5`, we use the same approximation that is used for`asinf(x)`when`|x| <= 0.5`:

acos(x) = pi/2 - asin(x) ~ pi/2 - x - x^3 * P(x^2).

- When
`0.5 < x <= 1`, we use the double angle formula:`cos(2y) = 1 - 2 * sin^2 (y)`to reduce to:

acos(x) = 2 * asin( sqrt( (1 - x)/2 ) )

- When
`-1 <= x < -0.5`, we reduce to the positive case above using the formula:

acos(x) = pi - acos(-x)

Performance benchmark using perf tool from the CORE-MATH project on Ryzen 1700:

$ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh acosf GNU libc version: 2.35 GNU libc release: stable CORE-MATH reciprocal throughput : 28.613 System LIBC reciprocal throughput : 29.204 LIBC reciprocal throughput : 24.271 $ CORE_MATH_PERF_MODE="rdtsc" ./perf.sh asinf --latency GNU libc version: 2.35 GNU libc release: stable CORE-MATH latency : 55.554 System LIBC latency : 76.879 LIBC latency : 62.118