This is an archive of the discontinued LLVM Phabricator instance.

[x86] @llvm.ctpop.v8i32 custom lowering
AbandonedPublic

Authored by bruno on Dec 4 2014, 8:22 AM.

Download Raw Diff

Details

Reviewers

chandlerc
nadav
delena
andreadb

Summary

This patch adds x86 custom lowering for the @llvm.ctpop.v8i32 intrinsic.

Currently, the expansion of @llvm.ctpop.v8i32 uses vector element extractions,
insertions and individual calls to @llvm.ctpop.i32. Local haswell measurements
show that @llvm.ctpop.v8i32 gets faster by using vector parallel bit twiddling approaches
than using @llvm.ctpop.i32 for each element, based on:

v = v - ((v >> 1) & 0x55555555);
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
v = ((v + (v >> 4) & 0xF0F0F0F)
v = v + (v >> 8)
v = v + (v >> 16)
v = v & 0x0000003F
(from http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel)

Some toy microbenchmark presented a ~2x speedup, whereas vector types with smaller number of elements
are still better with the old approach (see results below). Hence this
patch only implements it for v8i32 type. The results indicate it might also be profitable
to implement this approach for v32i8 and v16i16, but I haven't measured that yet.

AVX1 ctpop.v8i32 is broken into two ctpop.v4i32, which is only slightly better than old expansion. However,
this patch does not implement custom lowering for the general ctpop.v4i32 type, since it's not profitable.

[core-avx2]

v8i32-new: 10.3506
v8i32-old: 18.3879
v4i32-new: 10.3699
v4i32-old: 8.01387
v4i64-new: 11.7464
v4i64-old: 10.3043
v2i64-new: 11.7922
v2i64-old: 5.20916

[corei7-avx]

v8i32-new: 16.5359
v8i32-old: 18.2479
v4i32-new: 10.2069
v4i32-old: 8.03686
v4i64-new: 17.8085
v4i64-old: 10.2366
v2i64-new: 11.7623
v2i64-old: 5.11533

Diff Detail

Event Timeline

bruno updated this revision to Diff 16929.Dec 4 2014, 8:22 AM

bruno retitled this revision from to [x86] @llvm.ctpop.v8i32 custom lowering.

bruno updated this object.

bruno edited the test plan for this revision. (Show Details)

bruno added reviewers: nadav, chandlerc, andreadb, delena.

bruno set the repository for this revision to rL LLVM.

bruno added a subscriber: Unknown Object (MLST).

LGTM! Thank you for the detailed measurements!

Do you think that other targets may also benefit from this kind of transformation?

Thanks,
Nadav

Chandler,

Thanks for the help, the assembly for v8i32-new/old:
http://pastebin.com/4gnd41Je

About the principled split: I rather go the other way around, i.e., since SelectionDAGLegalize::ExpandBitCount already emits the bit-math for scalarized versions it makes more sense to custom split to other known vector types only when we already know it's profitable.

Nadav and Hal,

There are potential benefits for other targets I believe, but this customisation generates a bunch of vector instructions and I'm afraid that if one or other vector instruction isn't well supported on a target, that could lead to a lot of scalarized instructions which may lead to worse code than before? I might be wrong though. I just rather go into the direction that if other targets implement it and succeed, we than move it to target independent code. Additional thoughts?

Actually, back to x86, if popcnt isn't supported by some x86 target it currently leads to this bitmath scalarized code for each element and it would be always profitable to emit the vectorized code instead - tested it for v4i32, v2i64, v4i64 and v8i32 and it performs better. Gonna update the patch to reflect that. For instance "-arch x86_64" doesn't assume popcnt by default, since it is a separate feature, in cases like this we would always win.

Patch updated!

Ping :-)

I see you've gone ahead and committed this.

Please actually implement the significantly better algorithm I point you at
if you're going to have an x86-speciifc implementation. Also please
implement this for the other vector types. I really don't want this to be
left in a half-done state forever.

Expanding to other vector types doesn't seem unreasonable to do in a
follow-up patch, but I think it would have been better to start off with
the final algorithm. We now have a pretty substantial pile of code in the
x86 backend that will be completely replaced. =/

I see you've gone ahead and committed this.

Please actually implement the significantly better algorithm I point you at
if you're going to have an x86-speciifc implementation. Also please
implement this for the other vector types. I really don't want this to be
left in a half-done state forever.

I understand your concern, I'll get to it, promise :-)

Expanding to other vector types doesn't seem unreasonable to do in a
follow-up patch, but I think it would have been better to start off with
the final algorithm. We now have a pretty substantial pile of code in the
x86 backend that will be completely replaced. =/

Although you're probably right, I rather not do it before giving it appropriate
measurements which I couldn't get yet. Also, since we already know this performs
better than previous expansions, at least we have better generation for ctpop
in the meantime.

Thanks for the feedback and ideas :D

And three months later, you still haven't implemented the requested changes during code review.

Please do so, and quickly. I'm really unhappy about the behavior of promising to make changes requested during code review in order to get past code review, and then failing to follow through on them. I'm very tempted to just revert the patch until you actually have time to address this fully.

chandlerc requested changes to this revision.Mar 29 2015, 2:23 PM

chandlerc edited edge metadata.

This revision now requires changes to proceed.Mar 29 2015, 2:23 PM

I guess this makes it four months later now =T

Really sorry for the late reply. You're totally right, my bad I
haven't tackled this from my priority list despite promises.
However, I intend to resume this work in one week or two, but fell
free to revert it if that's sounds like another vague promise :-)

Cheers,

Hi,

This patch implements a faster vector population count based on the algorithm
described in http://wm.ite.pl/articles/sse-popcount.html

It does so by using an in-register lookup table and the pshufb instruction to
compute the popcnt for each byte. Additional instructions are then used to sum
the bytes and produce the result for wider element types. Numbers:

v4i32-avx:

sselookup (v4i32): 1.10211
scalar + ctpop (v4i32): 0.907016 <-- best == ToT
parallelbitmath (v4i32): 1.14124

v8i32-avx:

sselookup (v8i32): 1.97514 <-- best == patch
scalar + ctpop (v8i32): 2.37118

v8i32-avx2:

sselookup (v8i32): 1.17823
parallelbitmath (v8i32): 1.15288 <-- best == ToT

v2i64-avx:

scalar + ctpop (v2i64): 0.589292 <-- best == ToT
sselookup (v2i64): 0.865797
parallelbitmath (v2i64): 1.31027

v4i64-avx:

scalar + ctpop (v4i64): 0.903523 <-- best == ToT
sselookup (v4i64): 1.11988

v4i64-avx2:

scalar + ctpop (v4i64): 0.895486
sselookup (v4i64): 0.677801 <-- best == patch
parallelbitmath (v4i64): 1.02711

v16i8-avx:

scalar + ctpop (v16i8): 4.1569
sselookup (v16i8): 0.508693 <-- best == patch

v32i8-avx:

scalar + ctpop (v32i8): 8.32336
sselookup (v32i8): 0.961657 <-- best == patch

v32i8-avx2:

scalar + ctpop (v32i8): 8.79509
sselookup (v32i8): 0.487716 <-- best == patch

v8i16-avx:

scalar + ctpop (v8i16): 1.86908
sselookup (v8i16): 0.755885 <-- best == patch

v16i16-avx:

scalar + ctpop (v16i16): 4.08575
sselookup (v16i16): 1.32838 <-- best == patch

v16i16-avx2:

scalar + ctpop (v16i16): 4.19101
sselookup (v16i16): 1.18095 <-- best == patch

More info available at
https://github.com/bcardosolopes/llvm-vpopcount

One unexpected case is v8i32-avx2. Although sselookup and parallelbitmath vary
in which runs faster, I've seen the latter yielding slightly better results in
multiple runs. I would expect sselookup to always be faster because it has
fewer instructions but looks like there's some latency/resource conflict issue
going on.

Given the slightly perf diff between sselookup and parallelbitmath for
v8i32-avx2, I've removed parallelbitmath completely in this patch and left
sselookup as the default for this type too. We can later on change the behavior
for this type back to parallelbitmath (see the next paragraph).

This patch only improves the x86 specific part of vector popcnt. The previous
approach implemented for x86 in Dec 2014, the parallelbitmath, is generally
inferior. Given its target independent nature it will get resubmitted in a next
patch as a target independent expansion for vector popcnt, since (although not
anymore for x86) it's much better than the current scalar expansion we
currently do.

The assembly for v8i32-avx2 cases:

https://github.com/bcardosolopes/llvm-vpopcount/blob/master/v8i32/avx2/v8i32-bitmath.s
https://github.com/bcardosolopes/llvm-vpopcount/blob/master/v8i32/avx2/v8i32-sselookup.s

chandlerc mentioned this in rL238391: [x86] Refactor the tests for popcnt..May 27 2015, 7:44 PM

chandlerc mentioned this in D10084: [x86] Implement a faster vector population count based on the PSHUFB in-register LUT technique..May 28 2015, 2:16 AM

chandlerc mentioned this in rL238636: [x86] Implement a faster vector population count based on the PSHUFB.May 29 2015, 8:25 PM

Abandon this one since a improved version was committed way back in http://reviews.llvm.org/D10084.

Revision Contents

Path

Size

lib/

Target/

X86/

X86ISelLowering.cpp

163 lines

test/

CodeGen/