Continuing from D149749, here is another neoverse-v1 VLS shuffle that could be lowered better...
%x = shufflevector <2 x double> %v, <2 x double> poison, <4 x i32> <i32 1, i32 0, i32 1, i32 0>
It could be lowered in a number of ways, but I chose:
zip1 z0.d, z0.d, z0.d
uzp1 z0.d, z0.d, z0.d
ext z0.b, z0.b, z0.b, #8
The new lowering shows a 9% performance boost on 538.namd with our out-of-tree compiler.
Note that this solution takes 6 cycles, compared to a NEON sequence at 4 cycles. This is unfortunate, but I could not find a faster SVE sequence for this shuffle. [Maybe a better solution is to fall back to NEON for shuffles of this form?]