If we are splatting pairs of 32-bit elements, we can use a 64-bit broadcast to get the job done.
We could probably could probably do this with other sizes too, for example four 16-bit elements. Or we could broadcast pairs of 16-bit elements using a 32-bit element broadcast. But I've left that as a future improvement.
I've also restricted this to AVX2 only because we can only broadcast loads under AVX.
Looks like we may still need a DAG combine for VBROADCAST + VZEXT_LOAD to fold the loads in insertelement-shuffle.ll and vector-shuffle-combining-xop.ll