Now that we have fast vector CTPOP implementations we can use this to speed up vector CTTZ using the pattern (cttz(x) = ctpop((x & -x) - 1))
Additionally, for AVX512CD that provides lzcnt instructions we can use the pattern (cttz_undef(x) = (width - 1) - ctlz(x & -x))
Originally I was intending to implement this generically in the VectorLegalizer but hit the issue that the 2i64 implementations were vectorized and saw a large perf regression. I could still do this and provide a 'empty' custom implementation on X86 to force scalarization - not sure if its good practice though? It would have the benefit that we could remove the very similar implementation in the ARM target as well (Logan any comments?).