The implementations use the x86_64 FPU instructions. These instructions
are extremely slow compared to a polynomial based software
implementation. Also, their accuracy falls drastically once the input
goes beyond 2PI. To improve both the speed and accuracy, we will be
taking the following approach going forward:
- As a follow up to this CL, we will implement a range reduction algorithm
which will expand the accuracy to the entire double precision range.
- After that, we will replace the HW instructions with a polynomial
implementation to improve the run time.
After step 2, the implementations will be accurate, performant and target
architecture independent.