Currently, X86 backend only has a global one-size-fits-all FeatureFastVariableShuffle feature,
which controls profitability of both the cross-lane and per-lane variable shuffles.
I guess, this has been fine so far.
But at least on AMD Zen 3, while per-line variable shuffles (e.g. VPSHUFB)
are as fast as as shuffles with fixed/immediate mask,
while lane-crossing shuffles, e.g. VPERMPS is performing worse.
So to get the benefits of variable-mask shuffles, but not the drawbacks of lane-crossing shuffles,
as suggested by @RKSimon, split the feature flag into two.
Nit, are pre-lane shuffles always fast if the target has fast cross-lane shuffles? In this mean, we can keep FeatureFastVariableShuffle and make it implicate FeatureFastVariablePerLaneShuffle. So that we can reduce the changes?