This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Improve i16 splatting shuffles
ClosedPublic

Authored by RKSimon on Nov 21 2015, 6:53 AM.

Details

Summary

Better handling of the annoying pshuflw/pshufhw ops which only shuffle lower/upper halves of a vector.

Added vXi16 unary shuffle support for cases where i16 elements (from the same half of the source) are being splatted to the whole of one of the halves. This avoids the general lowering case which must shuffle the 32-bit elements first - meaning that we used to end up with unnecessary duplicate pshuflw/pshufhw shuffles.

Note this has the side effect of a lot of SSSE3 test cases no longer needing to use PSHUFB, as it falls below the 3 op combine threshold for when PSHUFB is typically worth it.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 40868.Nov 21 2015, 6:53 AM
RKSimon retitled this revision from to [X86][SSE] Improve i16 splatting shuffles.
RKSimon updated this object.
RKSimon added a reviewer: chandlerc.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: llvm-commits.
RKSimon added subscribers: qcolombet, andreadb.

Ping - adding extra subscribers.

qcolombet added inline comments.Dec 16 2015, 10:44 AM
test/CodeGen/X86/avx-splat.ll
18

I am missing something.

You said:

Note this has the side effect of a lot of SSSE3 test cases no longer needing to use PSHUFB, as it falls below the 3 op combine threshold for when PSHUFB is typically worth it.

But isn’t pshufb always better than two other pshufX instructions?

RKSimon added inline comments.Dec 16 2015, 12:06 PM
test/CodeGen/X86/avx-splat.ll
18

On all but the newest machines (Haswell+) PSHUFB has a 3 or more cy latency (which is where the 3 op threshold comes from). The bigger problem is that PSHUFB nearly always has to load the shuffle mask operand from the constant pool, which is a lot more costly than 3 immediate ops.

escha added a subscriber: escha.Dec 16 2015, 12:14 PM

Just a side note, but Agner claims pshufb is 1 cycle latency on Wolfdale, Nehalem, and Ivy Bridge as well.

Just a side note, but Agner claims pshufb is 1 cycle latency on Wolfdale, Nehalem, and Ivy Bridge as well.

But not any recent Atom or AMD targets - it still doesn't account for the cost of loading the shuffle mask either unfortunately. On the whole I think the 3 op threshold is about right.

Tested on Jaguar CPU:

Throughput:
Old 3op shuffle: 4cy
New 2op shuffle: 2cy
pshufb_rr 3cy
pshufb_rm 3cy

Latency:
Old 3op shuffle: 4cy
New 2op shuffle: 3cy
pshufb_rr 3cy
pshufb_rm 4cy

qcolombet edited edge metadata.Dec 17 2015, 9:25 AM

Thanks for the numbers Simon.

Sorry Quentin - I missed your follow up email to the list - copied here:

Tested on Jaguar CPU:

Throughput:
Old 3op shuffle: 4cy
New 2op shuffle: 2cy
pshufb_rr 3cy
pshufb_rm 3cy

I am confused.
When the code sequence is shorter, I was expecting this, but this number is not for the problem we were discussing, i.e., when the shufb is replaced by 2 shuf(w|hd, whatever), right?
If it is, I am missing something because it should be 2 uops in both cases.

I'm confused too - I'm not certain what outstanding problem with my patch you think I should be addressing.

What it does is improve vXi16 shuffles so that more patterns can be performed in 2uops instead of 3uops, a side effect of which is that a later combine stage in PerformShuffleCombine (combineX86ShufflesRecursively) no longer merges these into a single PSHUFB as its threshold for combining is 3uops. The timing tests I did demonstrated that this threshold is probably about right - although I accept that more recent targets can perform PSHUFB faster.

What am I missing?

I've raised a bugzilla: https://llvm.org/bugs/show_bug.cgi?id=26183 to discuss what the threshold for combines to PSHUFB should be and how it should be controlled.

qcolombet accepted this revision.Jan 21 2016, 9:22 AM
qcolombet edited edge metadata.

Hi Simon,

Thanks for filing a PR.

You’re right the threshold problem is separate, LGTM then.

Cheers,
-Quentin

This revision is now accepted and ready to land.Jan 21 2016, 9:22 AM
This revision was automatically updated to reflect the committed changes.