Page MenuHomePhabricator

[X86][SSE] Match PSHUFLW/PSHUFHW + PSHUFD vXi16 shuffle patterns (PR34686)
ClosedPublic

Authored by RKSimon on Sep 27 2017, 8:43 AM.

Details

Summary

As noted in PR34686, we are relying on a PSHUFD+PSHUFLW+PSHUFHW shuffle chain for most general vXi16 unary shuffles.

This patch checks for simpler PSHUFLW+PSHUFD and PSHUFHW+PSHUFD cases beforehand, building on some existing code that just handled splat shuffles.

By doing so we also prevent premature use of PSHUFB shuffles which can be slower and require the creation/loading of constant shuffle masks.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Sep 27 2017, 8:43 AM
zvi added inline comments.Sep 27 2017, 9:47 AM
test/CodeGen/X86/vector-shuffle-128-v8.ll
1976

Looks like AVX2, AVX512 regressed. Any idea what happened?

1997

Here too

RKSimon added inline comments.Sep 27 2017, 9:54 AM
test/CodeGen/X86/vector-shuffle-128-v8.ll
1976

We went under the 3-op threshold for combining unary shuffles to PSHUFB (where before it was the PSHUFD+PSHUFLW+PSHUFHW code from SSE2). Despite being 2 ops, this is much smaller in codesize due to not requiring a constant pool entry. It also makes folding easier.

delena added inline comments.Sep 27 2017, 10:49 AM
test/CodeGen/X86/vector-shuffle-128-v8.ll
1976

Loading a constant form memory may be done outside the loop. And two shuffles instead of one increase shuffle port pressure.
I think that the original "pshufb" is better in this case.

RKSimon added inline comments.Sep 27 2017, 12:58 PM
test/CodeGen/X86/vector-shuffle-128-v8.ll
1976

That's true after Haswell/Zen, but not for any older SSSE3+ capable targets. We have hard coded depth controls in combineX86ShuffleChain, we've been putting off changing this as we'd ideally drive this by the scheduler.

RKSimon added inline comments.Sep 27 2017, 1:44 PM
test/CodeGen/X86/vector-shuffle-128-v8.ll
1976

I've committed rL314337 which could be used to permit earlier combining of shuffles to variable masks such as PSHUFB. Ideally though this would be done at a later stage where we have more scheduler details (MC, register pressure etc.).

@delena @zvi What do you want to do with this. IMO we shouldn't be prematurely combining to variable mask shuffles, and this should be performed later as a scheduler based decision. But that will involve a lot of work that I don't think we're ready for (D26855 tried to move some other code to the MC and hit a lot of issues).

What we could do is add a FeatureFastVariableShuffle feature flag to Haswell and later Intel CPUs and perform the decision in combineX86ShuffleChain off that?

delena edited edge metadata.Oct 11 2017, 2:53 AM

@delena @zvi What do you want to do with this. IMO we shouldn't be prematurely combining to variable mask shuffles, and this should be performed later as a scheduler based decision. But that will involve a lot of work that I don't think we're ready for (D26855 tried to move some other code to the MC and hit a lot of issues).

What we could do is add a FeatureFastVariableShuffle feature flag to Haswell and later Intel CPUs and perform the decision in combineX86ShuffleChain off that?

May be just add something like this:
bool hasVariableShuffle(MVT Ty) {

if ((hasAVX2() && Ty == XXX) || hasAVX512() && Ty == YYY)
  return true;

return false;

@delena @zvi What do you want to do with this. IMO we shouldn't be prematurely combining to variable mask shuffles, and this should be performed later as a scheduler based decision. But that will involve a lot of work that I don't think we're ready for (D26855 tried to move some other code to the MC and hit a lot of issues).

What we could do is add a FeatureFastVariableShuffle feature flag to Haswell and later Intel CPUs and perform the decision in combineX86ShuffleChain off that?

May be just add something like this:
bool hasVariableShuffle(MVT Ty) {

if ((hasAVX2() && Ty == XXX) || hasAVX512() && Ty == YYY)
  return true;

return false;

On bdver4 and znver1 this would be a perf regression.

In this case we need to add "SlowShuffle.." property to these two targets. I don't see any other way to distinguish. Or "Fast" to HSW, but I suggest the fist variant, because less instructions is more obvious.

RKSimon updated this revision to Diff 127504.Dec 19 2017, 5:27 AM

Rebased now that D41323 has landed at rL321074

hopeful ping - even though its the holidays, but Jan 3 is fast approaching........

spatel accepted this revision.Dec 28 2017, 4:13 PM

The questionable transforms are differentiated with 'fast-variable-shuffle' now, so LGTM.

This revision is now accepted and ready to land.Dec 28 2017, 4:13 PM
This revision was automatically updated to reflect the committed changes.