This is very similar to D8486 (vperm2). If we treat insertps intrinsics as shufflevectors, we can optimize them better.
I've left all but the full zero case of the zero mask variants out of this patch. I don't think those can be converted into a single shuffle in all cases, but I'd be happy to be proven wrong as I was for vperm2f128.
Either way, we'd need to support whatever sequence we come up with for those cases in the backend before converting them here.