vXi1 vectors are legalized by promoting, but vXi8/i16/i32 vectors are legalized by widening. This results in these extends becoming truncates+sign/zext_extend_inreg. This is worse than the costs we were getting from the default TTI implementation.
We could probably lower the costs of these by improving the codegen to do the sign/zext_extend_inreg before the truncate. I think that would enable the use of packss/packus operations to do the truncation. Then we wouldn't need to insert an AND to make packus usable.