HomePhabricator

[X86] Split FeatureFastVariableShuffle tuning into Lane-Crossing and Per-Lane…

Authored by lebedev.ri on Tue, Jun 1, 12:39 AM.

Description

[X86] Split FeatureFastVariableShuffle tuning into Lane-Crossing and Per-Lane variants

Currently, X86 backend only has a global one-size-fits-all FeatureFastVariableShuffle feature,
which controls profitability of both the cross-lane and per-lane variable shuffles.
I guess, this has been fine so far.

But at least on AMD Zen 3, while per-line variable shuffles (e.g. VPSHUFB)
are as fast as as shuffles with fixed/immediate mask,
while lane-crossing shuffles, e.g. VPERMPS is performing worse.

So to get the benefits of variable-mask shuffles, but not the drawbacks of lane-crossing shuffles,
as suggested by @RKSimon, split the feature flag into two.

Differential Revision: https://reviews.llvm.org/D103274