I believe all the pieces are now in place in the backend to make this correctly. We can truncate the vXi64 type to vXi32, extend it back up to the original width and multiply.
In the backend the truncate+extend will becomes sign_extend_inreg/zero_extend_inreg(really an and). Then those will be combined with the mul to PMULDQ/PMULUDQ. Then SimplifyDemandedBits will strip the sign_extend_inreg/zero_extend_inreg out.
The only question I have is whether its ok to emit the v2i32 intermediate type for the 128-bit version. I wasn't sure of any examples where we use an illegal type in our intrinsic/builtin handling. At least not a narrower type. I know pavg uses a wider type.
I think I could probably do this all in the header file using __builtin_convertvector if that's desired.