This is pretty straight forward in the basic form. I did need to move the slideup matching earlier, but that looks generally profitable on it's own.
As follow ups, I plan to explore the v(f)slide1down variants, and see what I can do to canonicalize the shuffle then insert pattern (see _inverse tests at the end of the vslide1up.ll test).
I do need to figure out why these aren't matching. My guess is that we're canonicalizing to a bitcast somewhere, need to track that down. The delta is an improvement even without the vslide1up match, so I think this can be a separate change.