This is a partial port of AArch64TargetLowering::LowerCTPOP.
This custom lowering tries to uses NEON instructions to give a more efficient CTPOP lowering when possible.
In the non-NEON/noimplicitfloat case, this should use the generic lowering (see: https://godbolt.org/z/GcaPvWe4x). I think that's worth implementing after implementing the widening code for s16/s8 though.