Now that we have fast vector CTPOP implementations we can use this to speed up vector CTTZ using the pattern (cttz(x) = ctpop((x & -x) - 1))
Additionally, for AVX512CD that provides lzcnt instructions we can use the pattern (cttz_undef(x) = (width - 1) - ctlz(x & -x))
Originally I was intending to implement this generically in the VectorLegalizer but hit the issue that the 2i64 implementations were vectorized and saw a large perf regression. I could still do this and provide a 'empty' custom implementation on X86 to force scalarization - not sure if its good practice though? It would have the benefit that we could remove the very similar implementation in the ARM target as well (Logan any comments?).
Wouldn’t hurt to write the pattern we build here: x & -x