This relies on previous support for expanding MULH[US] / [US]MUL_LOHI. Instead
of doing division-by-constant only when those instructions are legal, targets
should now use isIntDivCheap to signal that they do not want this expansion.
This change allows 64-bit division-by-constant to use the more efficient
multiply and shift lowering on AMDGPU.
This also affects a lowering on SPARC in a way that may or may not be more
efficient, see the change in the corresponding test case for the effect. I'd
appreciate some feedback on that.
The vector case is not enabled yet even though it should be correct and will
likely allow better overall code generation eventually. Enabling it gives some
regressions in X86 tests, mostly due to what looks like insufficient
peep-holing when vNi64 multiplies are scalarized.
This is generating 8 multiply instructions; something is going wrong in your algorithm. (It should only take four multiply instructions to perform a double-width multiply.)