The motivating case is seen in "splat4_v8f32_load_store" and based on code in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024
(I haven't stepped through the v8i32 sibling test yet to see why that diverged.)
There are other potential improvements visible like allowing scalarization or vector narrowing. My reading of AVX512 is still weak, so please have a close look at those diffs to make sure that's good.