As I suggested on PR39281, this patch uses PADDL pairwise addition to widen from the vXi8 CTPOP result to the target vector type.
This is a blocker for generic vector CTPOP expansion (P32655) - ARM's vXi64 CTPOP currently expands, which would generate a vXi64 MUL but ARM's lowering expands the general MUL case and vectors aren't well handled in LegalizeDAG - improving the CTPOP lowering was a lot easier than fixing the MUL lowering......
For the 64-bit vector case, couldn't we use vpadd instead? We don't care about signed/unsigned, but we'd have to know that the wide result isn't necessary too - which I expect is fine for most bit counting cases.