Today, we can generate a vector_shuffle from an IR shuffle where the size of the result is exactly the sum of the sizes of the input vectors. If the output vector is smaller - e.g. a <12 x i8> being formed by a shuffle with two <8 x i8> inputs, we emit a sequence of extracts and inserts.
Instead, we can form a larger vector_shuffle, and then extract a subvector of the right size - e.g. shuffle the two <8 x i8>-s into a <16 x i8> and extract a <12 x i8>.
This solves PR29025.
This has a dependency on D23893 - forming a vector instead of a series of inserts/extracts pessimizes a test, because of a mess created by legalization, and we need D23893 for clean-up.
Possibly remove this, initialize as MappedOps(PaddedMaskNumElts, -1) and then just insert MappedOps[i] = Idx in the readjustment loop.