This patch is a follow up from vector CTPOP work started in http://reviews.llvm.org/D6531
It modifies current target independent vector CTPOP expansion to implement a parallel version of the algorithm presented in http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
A new TLI hook is provided to let the target decide for a vector type whether it should use the unrolled CTPOP expansion or the algorithm implemented in this patch. This is specially useful for x86 where unrolling, parallel bitmath and custom lowering dispute the better performance depending on the type. It looks like this can benefit other target as well. PowerPC folks, maybe this could show gains for vector types pre-POWER8?
The patch depends upon http://reviews.llvm.org/D6531 to be applied first so that the tests can run smoothly.
This highlights an unfortunate terminology collision that we really should fix. Essentially everywhere in LLVM, "Unroll" refers to something done to loops (not vectors). The operation referred to here is called 'Scalarize', not only in TTI and its users, but also all over CodeGen too, and I think that's a better name (we're breaking the vector apart into scalar operations, and then rebuilding a vector from the results).
I'd like to not make the problem worse with this patch: Please call the operation Scalarize, not Unroll. Fixing the other few places in CodeGen (just the users of DAG.UnrollVectorOp in practice) would also be a nice follow-up.