A description of this technique can be found here:

http://wm.ite.pl/articles/sse-popcount.html

The core of the idea is to use an in-register lookup table and the

PSHUFB instruction to compute the population count for the low and high

nibbles of each byte, and then to use horizontal sums to aggregate these

into vector population counts with wider element types.

On x86 there is an instruction that will directly compute the horizontal

sum for the low 8 and high 8 bytes, giving vNi64 popcount very easily.

Various tricks are used to get vNi32 and vNi16 from the vNi8 that the

LUT computes.

The base implemantion of this, and most of the work, was done by Bruno

in a follow up to D6531. See Bruno's detailed post there for lots of

timing information about these changes.

I have extended Bruno's patch in the following ways:

0) I committed the new tests with baseline sequences so this shows

a diff, and regenerated the tests using the update scripts.

- Bruno had noticed and mentioned in IRC a redundant mask that I removed.

- I introduced a particular optimization for the i32 vector cases where we use PSHL + PSADBW to compute the the low i32 popcounts, and PSHUFD + PSADBW to compute doubled high i32 popcounts. This takes advantage of the fact that to line up the high i32 popcounts we have to shift them anyways, and we can shift them by one fewer bit to effectively divide the count by two. While the PSHUFD based horizontal add is no faster, it doesn't require registers or load traffic the way a mask would, and provides more ILP as it happens on different ports with high throughput.

- I did some code cleanups throughout to simplify the implementation logic.

With #1 and #2 above, I analyzed the result in IACA for sandybridge,

ivybridge, and haswell. In every case I measured, the throughput is the

same or better using the LUT lowering, even v2i64 and v4i64, and even

compared with using the native popcnt instruction! The latency of the

LUT lowering is often higher than the latency of the scalarized popcnt

instruction sequence, but I think those latency measurements are deeply

misleading. Keeping the operation fully in the vector unit and having

many chances for increased throughput seems much more likely to win.

I think with this, we can lower every integer vector popcount

implementation using the LUT strategy if we have SSSE3 or better (and

thus have PSHUFB).