Change sinf range reduction to mod pi/16 to be shared with cosf.
Previously, sinf used range reduction mod pi, but this cannot be used to implement cosf since the minimax algorithm for cosf does not converge due to critical points at pi/2. In order to be able to share the same range reduction functions for both sinf and cosf, we change the range reduction to mod pi/16 for the following reasons:
- The table size is sufficiently small: 32 entries for sin(k * pi/16) with k = 0..31. It could be reduced to 16 entries if we treat the final sign separately, with an extra multiplication at the end.
- The polynomials' degrees are reduced to 7/8 from 15, with extra computations to combine sin and cos with trig sum equality.
- The number of exceptional cases reduced to 2 (with FMA) and 3 (without FMA).
- The latency is reduced while maintaining similar throughput as before.
From my point of view, this line can be changed to static_cast<int64_t>(k_hi + k_low), because k_hi and k_low are already "integer", so you can do one static cast instead of two. Probably it can increase performance.