This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Add support for lowering shuffles to PACKSS/PACKUS
ClosedPublic

Authored by RKSimon on Oct 2 2017, 11:02 AM.

Details

Summary

If the upper bits of a truncation shuffle patterns have at least the minimum number of sign/zero bits on their inputs then we can safely use PACKSS/PACKUS as shuffles.

Partial fix for https://bugs.llvm.org/show_bug.cgi?id=34773

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Oct 2 2017, 11:02 AM
pcordes accepted this revision.Oct 2 2017, 5:58 PM
pcordes added a subscriber: pcordes.

Looks like many significant improvements, but a couple possible regressions where we now get a shift+pack instead of a single pshufb. e.g. in trunc16i32_16i8_lshr, trunc8i32_8i16_lshr, and a couple other cases.

test/CodeGen/X86/shuffle-strided-with-offset-256.ll
92 ↗(On Diff #117394)

Possible regression here, if this happens in a loop. Saving a pshufb vector constant may be worth it for a one-off, but vpsrld + vpackusdw is pretty much always worse for throughput than vpshufb.

test/CodeGen/X86/sse2-intrinsics-x86.ll
687 ↗(On Diff #117394)

Apparently constant propagation through packssdw-with-zero wasn't working before, but this fixes it.

test/CodeGen/X86/vector-compare-results.ll
3553 ↗(On Diff #117394)

packing into a single vector is a waste if we're still going to pextrb each element separately, and do a bunch of dead stores to 2(%rdi)... what the heck is going on here? Surely the pextr/and/mov asm is total garbage that we really don't want, right?

BTW, psrlw $15, %xmm6 before packing from words to bytes will avoid the need for and $1, so you could extract directly to memory.

test/CodeGen/X86/vector-trunc.ll
409 ↗(On Diff #117394)

If I'm understanding this function right, there's still a big missed optimization:

psrad       $16, %xmm0                           # get the words we want aligned with the garbage in xmm1
pblendw  $alternating, %xmm1, %xmm0
pshufb     (fix the order),  %xmm0
ret

But this patch isn't trying to fix that. TODO: report this separately.

495 ↗(On Diff #117394)

This is questionable. 2x shift + 2x pack + punpck is probably worse than 2x pshufb / punpck.

Even better (if register pressure allows) would be 2x pshufb / POR, with 2 different shuffle-masks that leave the data high or low and zero the other half.

This revision is now accepted and ready to land.Oct 2 2017, 5:58 PM

Added comments before I commit, the remaining regressions should be handled when we enable shuffle combining to create PACKSS/PACKUS as well as combine from them. But that can only be done once lowering has landed.

test/CodeGen/X86/shuffle-strided-with-offset-256.ll
92 ↗(On Diff #117394)

This is a typical example of the separate shuffle (and shuffle like) instructions now falling below the "3-ops" limit before combining to a shuffle with variable masks. This needs to be driven by the scheduler model but we're no where close to supporting that yet.

test/CodeGen/X86/sse2-intrinsics-x86.ll
687 ↗(On Diff #117394)

I pushed this as a separate commit at rL314776

test/CodeGen/X86/vector-compare-results.ll
3553 ↗(On Diff #117394)

This codegen is still a joke - its doing nothing but demonstrating how bad we handle boolean vectors - if you look below you'll see that every single extracted value is stored to the same byte of memory.....

See https://bugs.llvm.org/show_bug.cgi?id=31265

test/CodeGen/X86/vector-trunc.ll
409 ↗(On Diff #117394)

Even better it should just combine to:

psrad $16, %xmm0
psrad $16, %xmm1
packssdw %xmm1, %xmm0

That should be handled when we enable shuffle combining to create PACKSS/PACKUS nodes (and not just lowering).

495 ↗(On Diff #117394)

I reckon we should be able to combine to

psrld $16, %xmm0
psrld $16, %xmm1
packusdw %xmm1, %xmm0
This revision was automatically updated to reflect the committed changes.

Some CPUs have good pblendw throughput, it's not always a win to do 2 shifts. (But I guess that's the same problem you mentioned in https://reviews.llvm.org/D38472?id=117394#inline-335636, that the scheduler model isn't close to figuring out when to use a variable-shuffle to reduce port pressure?)

I hope clang isn't going to start compiling _mm_shuffle_epi8 into psrlw $8, %xmm0 / packsswb %xmm0,%xmm0 in cases where that's not a win, when the shuffle control constant lets it do that.

I guess it's a tricky tradeoff between aggressive optimization of intrinsics helping novices (or code tuned for a uarch that isn't the target) vs. defeating deliberate tuning choices. I think it's good to have at least one compiler (clang) that does aggressively optimize, since we can always use gcc instead or for comparison.

test/CodeGen/X86/vector-trunc.ll
409 ↗(On Diff #117394)

@RKSimon:

Yeah, that's usually better on Skylake, where immediate vector shifts have 0.5c throughput (running on ports 0 or 1) but pblendw and pshufb compete for port 5. It would only be worse in a loop with lots of p01 pressure and very low p5 pressure.

On Haswell, pblendw and pshufb still compete for port 5, but shifts compete for port 0. So depending on the surrounding code, it's worth considering both options to use the one that uses more of the port with lower demand.

On Ryzen, pblendw has 0.33c throughput (ports p013). pshufb and packss run on p12 (0.5c throughput).
psrad runs on p2 only (1c throughput), so it's a potential throughput bottleneck in a loop that isn't doing other stuff on other ports. My sequence has twice the throughput, both bottlencked on port 2 for shift uops.

On Sandybridge: pblendw and pshufb/pack: p15. psrad: p0. So like Ryzen, we get 2x the throughput from my sequence (if used on its own).

On Nehalem, psrad, packss, pshufb, and pblendw all run on p05.