As noticed on D129765 and reported on Issue #56531 - aarch64 targets can use the neon ctpop + add-reduce instructions to speed up scalar ctpop instructions, but we fail to do this for parity calculations.
I'm not sure where the cutoff should be, but i64 (+ i128 special case) shows a definite reduction in instruction count. i32 is about the same (not sure if scalar <-> neon transfers are particularly costly?), and sub-i32 promotion looks to be a definite regression compared to parity expansion optimized for those widths.
Formatting.