Current PPC backend enables shuffling for very limited vector types, and the default behavior for a vector with more than 2 elements is moving data through memory, it usually triggers store forwarding, it is extremely slow on power.
This simple patch explicitly enables shuffling of vectors if VSX is available. For the testcase in the bug entry, the performance is improved by 6.6x on power8. The number of instructions is reduced from 756 to 406.
I agree that this makes sense. I was going to ask that you combine this with the v2i64 logic above, and add appropriate checks for VSX data types, however, since this shouldn't have any effect on scalarized types, it is not clear that the type checks here are actually useful. Maybe just simplify this to read: