The loop for handling illegal types queried for extract_subvector shuffle which I think will always return 0 for an illegal type on X86. But that's not the correct shuffle. We really need 2 two source permutes to extract even and odd elements. For non-pairwise we don't need any shuffle so I've removed it entirely.
We were also counting 2 shuffles on the very last reduction step when we need to add element 1 to element 0. But element 0 is already in place. So we only need to move element 1.