Current implementation of _fast_ division (`A/B`) is to:

- Get an initial estimation of reciprocal of B
- Use Newton's iteration method to improve the reciprocal
- Multiply the estimation with A

Compared with GCC, this loses some precision since multiplication is done after all iterations.

This patch is to do multiplication before the last iteration to make the result more accurate. It won't add/change any existing nodes/instructions except reordering calculation.