A shuffle with an insert subvector mask is functionally equivalent to:
(insert_subvector v0, (extract_subvector v1, len), index)
We can emulate by doing a vslideup on v1 into the right index, and carefully selecting VL so that we don't overwrite any more destination elements than what we have to.
This avoids the need for a select with a mask.
white space on the ", tu," bit