This is an alternative to D37896. I don't see a way to decompose multiplies generically without a target hook to tell us when it's profitable.
As a first step, I'm just trying to get the vector cases requested in PR34474:
https://bugs.llvm.org/show_bug.cgi?id=34474
The shakiest test diff here may be SSE4.1 code that uses 'pmulld' with a constant pool load. That can become 4 instructions like:
movdqa %xmm0, %xmm1 pslld $4, %xmm1 paddd %xmm0, %xmm1 movdqa %xmm1, %xmm0
...but I think despite the code-size increase, this is still better performing code. A scan of Agner's timing tables says pmulld is always at least 4 cycle latency, but possibly as much as 11 cycles. So replacing that with fast ops (and removing the constant load) should be a win even in the minimal case.