LGTM. See inline for some very minor possible improvements.
Just curious: why is this v_mov needed? Can't v_perm read this value directly from s0?
This would work out slightly better using a non-AMDGPU-specific lowering to something like x >> 8 | (x & 0xff) << 8.
Could do a single v_perm with mask 03020001 to avoid the shift. (Or mask 0C0C0001 if you really want to guarantee the upper bits get zeroed.)
If you care about v2i16 this whole sequence could be done with a single v_perm with mask 02030001.
This would violate the constant bus restriction. This could be folded on gfx10 where the limit is 2. However, this is only a problem because the constant is an SGPR in the first place. If we materialized the mask in a VGPR, we could fold it. We don't try to optimize this case yet