From what I can tell, we are artificially restricting the pass to bail out if we would vectorize to a non-power-of-2 number of elements. That is, everything below the changed part of this patch is working as intended for calculating costs and tree elements. However, I am proposing to add a debug flag for experimentation in case this reveals regressions.
A similar test to the diff here:
rL369255
...shows that we can already generate a non-standard vector size (<2 x float>) and shuffle.
The motivating case is from PR16739:
https://bugs.llvm.org/show_bug.cgi?id=16739
...and after instcombine, we end up with:
define <4 x float> @PR16739_byref(<4 x float>* nocapture readonly dereferenceable(16) %x) { %1 = bitcast <4 x float>* %x to <3 x float>* %2 = load <3 x float>, <3 x float>* %1, align 4 %i3 = shufflevector <3 x float> %2, <3 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 2> ret <4 x float> %i3 }
And because we know that the pointer is dereferenceable to 16 bytes, the backend generates the optimal code for x86:
movups (%rdi), %xmm0 shufps $164, %xmm0, %xmm0 ## xmm0 = xmm0[0,1,2,2]
This does not appear to interact with proposal D57779, but maybe we are just lacking the regression tests to show it?