This is an archive of the discontinued LLVM Phabricator instance.

[X86] Split FeatureFastVariableShuffle tuning into Lane-Crossing and Per-Lane variants
ClosedPublic

Authored by lebedev.ri on May 27 2021, 11:25 AM.

Details

Summary

Currently, X86 backend only has a global one-size-fits-all FeatureFastVariableShuffle feature,
which controls profitability of both the cross-lane and per-lane variable shuffles.
I guess, this has been fine so far.

But at least on AMD Zen 3, while per-line variable shuffles (e.g. VPSHUFB)
are as fast as as shuffles with fixed/immediate mask,
while lane-crossing shuffles, e.g. VPERMPS is performing worse.

So to get the benefits of variable-mask shuffles, but not the drawbacks of lane-crossing shuffles,
as suggested by @RKSimon, split the feature flag into two.

Diff Detail

Event Timeline

lebedev.ri created this revision.May 27 2021, 11:25 AM
lebedev.ri requested review of this revision.May 27 2021, 11:25 AM
lebedev.ri added a reviewer: craig.topper.
lebedev.ri added inline comments.May 27 2021, 11:31 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
1

*Please* double-check that i correctly deduced where we meant per-lane vs. cross-lane shuffles.

lebedev.ri added inline comments.May 27 2021, 2:30 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
36097

This one is weird. I'm not sure why fast-ness of variable shuffles matters here.

craig.topper added inline comments.May 27 2021, 2:48 PM
llvm/lib/Target/X86/X86ISelLowering.cpp
36097

You mean because it's just AND/FAND not a shuffle? Fastness was added to AllowVariableMask later. Initially it was just the depth check. Probably guarding the constant pool?

lebedev.ri marked an inline comment as not done.May 27 2021, 2:59 PM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
36097

You mean because it's just AND/FAND not a shuffle?

Yes.

Fastness was added to AllowVariableMask later. Initially it was just the depth check. Probably guarding the constant pool?

Yes, i think so.

RKSimon added inline comments.May 28 2021, 8:40 AM
llvm/lib/Target/X86/X86.td
1116

If possible I'd make the znver3 change as a followup so its separate from this refactor.

llvm/lib/Target/X86/X86ISelLowering.cpp
36023

We never decided whether to always allow per/cross lane shuffles for avx512 capable targets as they all have decent variable shuffle units (the only case that is bad seems to be KNL's PSHUFB, which we've never cared strongly about.) - which would allow us to clean up a lot of the tests below.

36097

Yeah its a legacy thing, and we don't have any good way to gauge the impact of vector constant masks in isel, so we're just a bit cautious :(

lebedev.ri marked 3 inline comments as done.May 28 2021, 9:07 AM

@craig.topper @RKSimon thank you for taking a look!

llvm/lib/Target/X86/X86.td
1116

Yep, i'll be committing this separately.
It just made sense to me to submit the review in bulk, to show motivation
(because i'm not adding test coverage for the +fast-variable-crosslane-shuffle standalone,
because nothing currently would make use of that.)

llvm/lib/Target/X86/X86ISelLowering.cpp
36097

Though, if i change the guard to

36097

Yep. I've poked at this, and i'm not sure if/how we could lift it,
all i tried seemed to make things not better. But then maybe
things are already bad in-the-wild, and we just don't know that because of tests..

So i'm personally mostly fine with the AllowVariablePerLaneMask guard here as it is now.

lebedev.ri marked 2 inline comments as done.May 28 2021, 9:08 AM
lebedev.ri added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
36097

Though, if i change the guard to

disregard, forgot to delete

Matt added a subscriber: Matt.May 28 2021, 1:02 PM
lebedev.ri retitled this revision from [X86] AMD Zen 3 has fast per-lane variable shuffles to [X86] Split FeatureFastVariableShuffle tuning into Lane-Crossing and Per-Lane variants.

Drop the Znver3 change itself.
Please let me know if there was some other review feedback
that i was supposed to address, i'm not sure if this is good to go?

I'm happy with this but I do think we should always enable variable shuffles (both types) on all AVX512 targets as well (and maybe land that first to simplify this patch) - @pengfei @craig.topper what do you think?

I'm happy with this but I do think we should always enable variable shuffles (both types) on all AVX512 targets as well (and maybe land that first to simplify this patch) - @pengfei @craig.topper what do you think?

The patch seems NFC for the existing targets to me, though I don't have a deep cognition on the shuffles' cost. No objections from me.

llvm/lib/Target/X86/X86.td
326–334

Nit, are pre-lane shuffles always fast if the target has fast cross-lane shuffles? In this mean, we can keep FeatureFastVariableShuffle and make it implicate FeatureFastVariablePerLaneShuffle. So that we can reduce the changes?

I'm happy with this but I do think we should always enable variable shuffles (both types) on all AVX512 targets as well (and maybe land that first to simplify this patch) - @pengfei @craig.topper what do you think?

I'm not sure how that would simplify this patch?

The patch seems NFC for the existing targets to me, though I don't have a deep cognition on the shuffles' cost. No objections from me.

llvm/lib/Target/X86/X86.td
326–334

Assumptions like this is pretty much why i'm having to do this change in the first place...
Could you please specify, *what* specific changes would be simplified by that?
I'm not really seeing it.

pengfei added inline comments.May 30 2021, 5:11 PM
llvm/lib/Target/X86/X86.td
326–334

I meant if we define it like

def FeatureFastVariableShuffle
    : SubtargetFeature<"fast-variable-shuffle",
                       "HasFastVariableShuffle",
                       "true", "Shuffles with variable masks are fast",
                       [FeatureFastVariablePerLaneShuffle]>;

then we can keep most use of FeatureFastVariableShuffle unchanged.
Anyway, it is just nitpicking.

Thank you for taking a look!

llvm/lib/Target/X86/X86.td
326–334

I think the current status is more consistent and perhaps more future-prof/testable,
but i don't think i care enough to argue. I just want to make forward progress with the least latency :)

@pengfei @craig.topper @RKSimon does anyone feel strongly regarding the avx512 question and the question about not having fast-variable-crosslane-shuffle?

@pengfei @craig.topper @RKSimon does anyone feel strongly regarding the avx512 question and the question about not having fast-variable-crosslane-shuffle?

No questions from me, thanks.

Well, okay then. The changes here are pretty mechanical,
and are NFC for all the existing targets.
Thanks everyone!

This revision was not accepted when it landed; it landed in state Needs Review.Jun 1 2021, 12:52 AM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.