This previously existed for C == 0, but is mostly beneficial for any
C.
There is a slight codesize cost as we get more imm32 (as opposed to
imm8) constants in some cases. But the benefit is was get less imm16
constants (LCP stalls) and save instructions in some vec -> scalar
codegen.
cast