Only fold for uniform values on pre-GFX9 chips. GFX9+ allow us
to keep the calculation entirely on the SALU.
We never fold if the mul has multiple uses, since that effectively
duplicates the (expensive) multiply.
Finally, we expand 64x32 and 64x64 multiplies here as well, if they
feed into an addition. This results in better code generation than
the generic expansion for such multiplies because we end up using
the accumulator of the MAD instructions.
Nit: use range-based loop over uses()?