This enhances the folds from part 1 and part 2 to allow insertion into an arbitrary vector. This means we form a select-shuffle (no cross-lane movement is allowed).
Example proofs with endian diffs:
https://alive2.llvm.org/ce/z/Mqfgt8
We can create a select-shuffle for all targets because targets are expected to be able to lower select-shuffles reasonably. This transform could be generalized further if it was implemented in a target-specific pass (with a cost/legality model).
The transform can result in more instructions than we started with (in the case where the vector size is longer/shorter than the scalar), but I think that's a reasonable trade-off to make the canonicalization more consistent.
This allows removing a pair of instructions from the motivating example from issue #17113, but it is still not the ideal IR/codegen.