This patch adds x86 custom lowering for the @llvm.ctpop.v8i32 intrinsic.
Currently, the expansion of @llvm.ctpop.v8i32 uses vector element extractions,
insertions and individual calls to @llvm.ctpop.i32. Local haswell measurements
show that @llvm.ctpop.v8i32 gets faster by using vector parallel bit twiddling approaches
than using @llvm.ctpop.i32 for each element, based on:
v = v - ((v >> 1) & 0x55555555);
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
v = ((v + (v >> 4) & 0xF0F0F0F)
v = v + (v >> 8)
v = v + (v >> 16)
v = v & 0x0000003F
(from http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel)
Some toy microbenchmark presented a ~2x speedup, whereas vector types with smaller number of elements
are still better with the old approach (see results below). Hence this
patch only implements it for v8i32 type. The results indicate it might also be profitable
to implement this approach for v32i8 and v16i16, but I haven't measured that yet.
AVX1 ctpop.v8i32 is broken into two ctpop.v4i32, which is only slightly better than old expansion. However,
this patch does not implement custom lowering for the general ctpop.v4i32 type, since it's not profitable.
[core-avx2]
v8i32-new: 10.3506
v8i32-old: 18.3879
v4i32-new: 10.3699
v4i32-old: 8.01387
v4i64-new: 11.7464
v4i64-old: 10.3043
v2i64-new: 11.7922
v2i64-old: 5.20916
[corei7-avx]
v8i32-new: 16.5359
v8i32-old: 18.2479
v4i32-new: 10.2069
v4i32-old: 8.03686
v4i64-new: 17.8085
v4i64-old: 10.2366
v2i64-new: 11.7623
v2i64-old: 5.11533