Page MenuHomePhabricator

[x86] Implement a faster vector population count based on the PSHUFB in-register LUT technique.

Authored by chandlerc on May 28 2015, 2:16 AM.



A description of this technique can be found here:

The core of the idea is to use an in-register lookup table and the
PSHUFB instruction to compute the population count for the low and high
nibbles of each byte, and then to use horizontal sums to aggregate these
into vector population counts with wider element types.

On x86 there is an instruction that will directly compute the horizontal
sum for the low 8 and high 8 bytes, giving vNi64 popcount very easily.
Various tricks are used to get vNi32 and vNi16 from the vNi8 that the
LUT computes.

The base implemantion of this, and most of the work, was done by Bruno
in a follow up to D6531. See Bruno's detailed post there for lots of
timing information about these changes.

I have extended Bruno's patch in the following ways:

0) I committed the new tests with baseline sequences so this shows

a diff, and regenerated the tests using the update scripts.
  1. Bruno had noticed and mentioned in IRC a redundant mask that I removed.
  1. I introduced a particular optimization for the i32 vector cases where we use PSHL + PSADBW to compute the the low i32 popcounts, and PSHUFD + PSADBW to compute doubled high i32 popcounts. This takes advantage of the fact that to line up the high i32 popcounts we have to shift them anyways, and we can shift them by one fewer bit to effectively divide the count by two. While the PSHUFD based horizontal add is no faster, it doesn't require registers or load traffic the way a mask would, and provides more ILP as it happens on different ports with high throughput.
  1. I did some code cleanups throughout to simplify the implementation logic.

With #1 and #2 above, I analyzed the result in IACA for sandybridge,
ivybridge, and haswell. In every case I measured, the throughput is the
same or better using the LUT lowering, even v2i64 and v4i64, and even
compared with using the native popcnt instruction! The latency of the
LUT lowering is often higher than the latency of the scalarized popcnt
instruction sequence, but I think those latency measurements are deeply
misleading. Keeping the operation fully in the vector unit and having
many chances for increased throughput seems much more likely to win.

I think with this, we can lower every integer vector popcount
implementation using the LUT strategy if we have SSSE3 or better (and
thus have PSHUFB).

Diff Detail


Event Timeline

chandlerc updated this revision to Diff 26666.May 28 2015, 2:16 AM
chandlerc retitled this revision from to [x86] Implement a faster vector population count based on the PSHUFB in-register LUT technique..
chandlerc updated this object.
chandlerc edited the test plan for this revision. (Show Details)
chandlerc added a reviewer: bruno.
chandlerc added a subscriber: Unknown Object (MLST).
chandlerc updated this revision to Diff 26668.May 28 2015, 2:47 AM

Update this with an even better algorithm that Fiora came up with when we were
discussing this in IRC.

By using PUNPCKLDQ and PUNPCKHDQ to interleave the i32 elements with zeros so
that we can use PSADBW to sum 8 bytes worth of bytes horizontally, we end up
with the results of the PSADBW being laid out perfectly to concatenate and
shrink in a single instruction with PACKUSWB. These all pipeline nicely with
the PSADBW instructions resulting in even lower latency and better throughput
than before.

We're down to an insane 10.45 cycle block throughput for this code sequence
compared to 13 for scalarized popcnt on Ivybridge. (12 vs. 13 on Haswell)

bruno accepted this revision.May 28 2015, 6:00 AM
bruno edited edge metadata.

This is awesome Chandler, thank you! Thanks Fiora! :D

I agree that keeping it in the vector unit is likely better when we already have vector ops around. We should do that!
FTR, some new haswell measurements from your patch for the cases where the implementation changed:

v8i32-avx2 -> sselookup now beats scalar ctpop \o/

scalar ctpop (v8i32): 3.93436
sselookup (v8i32): 3.36358

v4i32-avx -> yay, scalar is only slightly better over runs but improved significantly from my previous patch!

scalar ctpop (v4i32): 0.916582
sselookup (v4i32): from ~1.10 to 0.963113

That said, LGTM. Some minor comments in the patch below.

850 ↗(On Diff #26668)

Need to remove this check since we're not going to fallback anymore for >= SSSE3

1125 ↗(On Diff #26668)

Same here

1160 ↗(On Diff #26668)

With the change in the last comment above this line can go away

17392 ↗(On Diff #26668)

This comment can be removed

This revision is now accepted and ready to land.May 28 2015, 6:00 AM

I believe the same approach would work on ARM64, which also as byte-wise vector popcounts and can do interleave-with-zero. Do you think it would be worthwhile to find a way to share the core of this approach?


bruno added a comment.May 29 2015, 8:20 AM

Since we used very specific x86 idiom and carefully tweaked it to get the best out (we handle vXi16, vXi32 and vXi64 differently), my feeling is that we should measure what's best for ARM64 and custom lower it independently. Right now in ARM64 we do this very poorly for EltTy != i8 because of the current scalar expansion (which is pretty horrible):

// CNT supports only B element sizes.
if (VT != MVT::v8i8 && VT != MVT::v16i8)
  setOperationAction(ISD::CTPOP, VT.getSimpleVT(), Expand);

My patch to improve vector legalization for pop count from is certainly a win here, but won't certainly beat using ARM64's native popcnt on vXi8 and building the results for wider types on top of that!

This revision was automatically updated to reflect the committed changes.