D77152 tried to do this but got it wrong in the shift-by-zero case.
D86430 reverted the wrong code. Reimplement the optimization with
different code depending on whether the shift amount is known to be
non-zero (modulo bitwidth).
This improves code quality for fshl tests on AMDGPU, which only has an
fshr instruction.
I find these comments a bit confusing, as the full transform is
I suggest that we either show the full transform like above, or for example add some primes to X and Y on the right hand side to show that it isn't the same X and Y as on the left hand side: