Similar to v4i32 SHL, convert v8i16 shift amounts to scale factors instead to improve performance and reduce instruction count. We were already doing this for constant shifts, this adds variable shift support.
Reduces the serial nature of the codegen, which relies on the chain of plendvb/pand+pandn+por shifts.
This is a step towards adding support for vXi16 vector rotates.
A comment around here or maybe as a function comment to describe the transforms would be good.
?