This is more a bug report than earnest patch...
We've found a couple of inefficiently lowered shuffles when targeting neoverse-v1 and VLS. This patch covers:
%x = shufflevector <2 x double> %v, <2 x double> poison, <4 x i32> <i32 1, i32 1, i32 0, i32 0>
It could be lowered in a number of ways, but I chose:
zip1 z0.d, z0.d, z0.d
ext z0.b, z0.b, z0.b, #16
The new lowering shows an 11% performance boost on 538.namd with our out-of-tree compiler.