As mentioned in D38318 and D40865, modern Intel processors prefer to combine multiple shuffles to a variable shuffle mask (PSHUFB/VPERMPS etc.) instead of having multiple stage 'fixed' shuffles which put more pressure on Port 5 (at the expense of extra shuffle mask loads).
As discussed, this patch provides a FeatureFastVariableShuffle target flag for Haswell+ CPUs that prefers combining 2 or more fixed shuffles to a single variable shuffle (default is 3 shuffles).
If everybody is happy with this approach I will refactor some of the vector-shuffle-* tests to run with -fast-variable-shuffle enabled to compare shuffles.
The long term aim is to drive more of this from schedule data (probably via the MC) but we're not close to being ready for that yet.