This is a generalization of the IR fold in D38316 to handle insertion into a non-undef vector. If this looks ok, then we should probably just abandon that one.
I had to add a target hook to avoid AVX512 horror with vXi1 shuffles. I think ARM/AArch64 will want to enable this too based on the earlier discussion, but I'm not sure if that would be limited to certain types or just set it to true for everything.
There may be room for improvement in the shuffle lowering here, but I think that would be follow-up work.