This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Improve variable scalar shift of vXi8 vectors (PR34694)
ClosedPublic

Authored by RKSimon on Aug 25 2018, 3:03 PM.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Aug 25 2018, 3:03 PM
RKSimon added a reviewer: efriedma.
RKSimon updated this revision to Diff 162582.Aug 26 2018, 7:20 AM
RKSimon edited the summary of this revision. (Show Details)

Updated version that forms the shift masks on the vector unit as well since they all use the same xmm shift value.

Looks like an improvement, but haven't fully looked at it yet.

test/CodeGen/X86/vector-rotate-512.ll
474 ↗(On Diff #162582)

This is a pretty questionable way to isolate the low byte of a vector. A vector AND would be 1 uop but needs a constant.

A vmovdqu8 with zero-masking would also work, costing a mov-immediate + kmov to create a mask register. (But all of that overhead is off the critical path from rotate count to final result).

A left/right vector shift (immediate) would also work, 2 uops / 2c latency on the critical path (instead of 3 uops + 5c latency (3+2) for pextrb + movd). More shifts aren't great when we're already using a lot of shift uops, though.

vpsllq $56, %xmm1, %xmm0
vpsrlq $56, %xmm0, %xmm0

If we are extracting to scalar (or coming from there), it's definitely worth considering BMI2 scalar shifts, which are single uop and mask the shift count, unlike vector shifts which saturate.

So vmovd %xmm2, %ecx / shrx %ecx, %eax, %eax or something is only 2 uops. (And AVX512BW can vpbroadcastb from a GP register, but that probably costs 2 uops. Agner Fog doesn't seem to have tested VPBROADCASTB zmm1 {k1}{z}, reg for his SKX instruction tables.)

RKSimon added inline comments.Aug 27 2018, 2:18 AM
test/CodeGen/X86/vector-rotate-512.ll
474 ↗(On Diff #162582)

Come to think of it, I should be able to use PMOVZXBQ - I'll just need to make getTargetVShiftNode v16i8 aware (it ignores this case at the moment as that type isn't support as a shift).

RKSimon updated this revision to Diff 162662.Aug 27 2018, 5:22 AM

Avoid vector-scalar-vector to zero extend the bottom v16i8 element

Avoid vector-scalar-vector to zero extend the bottom v16i8 element

pmovzxbq is perfect for SSE4.1 and higher, well spotted.

In a loop, we might still consider vmovdqu8 xmm0{k1}{z}, xmm0 to avoid port5 pressure, if we can hoist a mask constant. But only if there *is* shuffle port pressure, and AVX512BW is available. (It should be 1c latency for p015 on SKX).

test/CodeGen/X86/vector-shift-shl-128.ll
673 ↗(On Diff #162662)

movzbl within the same register defeats mov-elimination on Intel. Prefer movzbl %al, %edx or any register other than %eax. (movzwl never benefits from mov-elimination, but this also applies to mov %eax,%eax to zero-extend into RAX.)

Other options:

  • vector qword bit-shift left / right
  • vector byte-shift left (pslldq $7, %xmm0), bit-shift right (psrlq $56, %xmm0). distributes the work to different execution units.

Both of those are 2c latency, vs. a round-trip to integer and back being more on Skylake. (movd r,x and x,r are 2c, up from 1c in Broadwell). It's also a big deal on Bulldozer-family, where a movd round-trip to integer regs is about 14 cycles on Piledriver. (Better on Steamroller).

The uops for SSE2 shuffle+shift need the same ports on Intel as movd to/from integer, so we basically just save the movzbl, and win everywhere.

Of course if we can prove that the byte is already isolated at the bottom of an XMM reg, we don't need that extra work.

This patch is entirely in the DAG - so we have no way to recognise when we're in a loop suitable for hoisting masks.

test/CodeGen/X86/vector-shift-shl-128.ll
673 ↗(On Diff #162662)

I'll look at zero-extending with PSLLQ+PSRLQ - this is only necessary pre-SSE41 targets (X86-SSE is only testing SSE2 cases).

RKSimon updated this revision to Diff 162720.Aug 27 2018, 11:40 AM

On pre-SSE41 targets use PSLLQ+PSRLQ to zero-extend v16i8/v8i16 splat shift amounts - this affects a number of extra tests - if accepted I'll push the PSLLQ+PSRLQ change before the vXi8 support.

pcordes accepted this revision.Aug 27 2018, 11:42 PM

Looks like an improvement everywhere; Further refinements are possible but this is already an improvement.

I still like my pslldq byte-shift + psrlq bit-shift to mix up the port pressure, especially when this is emitted specifically to feed other shifts. I'd recommend that, unless this commonly gets generated as part of shuffle-heavy code that bottleneck on port 5 on modern Intel. (It's not rare for people to compile for baseline and run on Haswell+, unfortunately.)

Before Skylake, vector bit-shifts only ran on a single port on Intel, so moving 1 uop to another port could be a big deal if that's the bottleneck in a loop.

shift+shift is better with tune=core2 or earlier, though. (Slow shuffles on Merom and earlier, but even 2nd-gen Core2 Penryn has fast shuffles.) I don't think we care enough about first-gen Core2 or Pentium M to tip the balance, though. It's not disastrous on Core2 (1 extra uop), so I prob. wouldn't even bother keeping the other code-gen option around unless we can select based on port pressure in a whole loop body.

Further refinements would be deciding when to load a mask outside a loop to do it with PAND inside a loop.

This revision is now accepted and ready to land.Aug 27 2018, 11:42 PM
This revision was automatically updated to reflect the committed changes.