This patch adds x86 custom lowering for the @llvm.ctpop.v8i32 intrinsic.
Currently, the expansion of @llvm.ctpop.v8i32 uses vector element extractions,
insertions and individual calls to @llvm.ctpop.i32. Local haswell measurements
show that @llvm.ctpop.v8i32 gets faster by using vector parallel bit twiddling approaches
than using @llvm.ctpop.i32 for each element, based on:
v = v - ((v >> 1) & 0x55555555);
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);
v = ((v + (v >> 4) & 0xF0F0F0F)
v = v + (v >> 8)
v = v + (v >> 16)
v = v & 0x0000003F
Some toy microbenchmark presented a ~2x speedup, whereas vector types with smaller number of elements
are still better with the old approach (see results below). Hence this
patch only implements it for v8i32 type. The results indicate it might also be profitable
to implement this approach for v32i8 and v16i16, but I haven't measured that yet.
AVX1 ctpop.v8i32 is broken into two ctpop.v4i32, which is only slightly better than old expansion. However,
this patch does not implement custom lowering for the general ctpop.v4i32 type, since it's not profitable.