When eliding argument copies, the memory layout between a plain store of the type and the layout of the argument lowering on the stack must match. For multi-part argument lowerings, this is not necessarily the case.
The code already tried to prevent this optimization for "scalarized and extended" vectors, but the check for "extends" was incomplete. While a scalarized vector of i32s stores i32 values on the stack, these are stored in 8 byte stack slots (on x86_64), so effectively have padding.
Rather than trying to add more special cases to handle this (which is not straightforward), I'm going in the other direction and exclude scalarized vectors from this optimization entirely. This seems like a rare case that is not worth the hassle -- the complete lack of test coverage is not reassuring either.
Is it out of bound?