If we are splatting pairs of 32-bit elements, we can use a 64-bit broadcast to get the job done.
We could probably could probably do this with other sizes too, for example four 16-bit elements. Or we could broadcast pairs of 16-bit elements using a 32-bit element broadcast. But I've left that as a future improvement.
I've also restricted this to AVX2 only because we can only broadcast loads under AVX.
Looks like we may still need a DAG combine for VBROADCAST + VZEXT_LOAD to fold the loads in insertelement-shuffle.ll and vector-shuffle-combining-xop.ll
Initialize the first 2 elements to simplify the code?
I think this would also be easier to read if it was split off into a helper function / lambda because you could just early return when you detect that it's not a splat pair.