This patch is a follow up from vector CTPOP work started in http://reviews.llvm.org/D6531
It modifies current target independent vector CTPOP expansion to implement a parallel version of the algorithm presented in http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
A new TLI hook is provided to let the target decide for a vector type whether it should use the unrolled CTPOP expansion or the algorithm implemented in this patch. This is specially useful for x86 where unrolling, parallel bitmath and custom lowering dispute the better performance depending on the type. It looks like this can benefit other target as well. PowerPC folks, maybe this could show gains for vector types pre-POWER8?
The patch depends upon http://reviews.llvm.org/D6531 to be applied first so that the tests can run smoothly.