A logical extension to D107068 - we don't strictly need to have
a single insertion into an otherwise-identity LHS.
While the general case of iteratively chopping off bits of a mask,
but keeping the shuffle result in obviously-bad codegen regressions,
the case where we reduce the entire shuffle into a seqnence of
subvector insertions, and drop said shuffle, seems somewhat promising.