This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Vectorize CTTZ + CTTZ_ZERO_UNDEF
ClosedPublic

Authored by RKSimon on Sep 5 2015, 7:33 AM.

Details

Summary

Now that we have fast vector CTPOP implementations we can use this to speed up vector CTTZ using the pattern (cttz(x) = ctpop((x & -x) - 1))

Additionally, for AVX512CD that provides lzcnt instructions we can use the pattern (cttz_undef(x) = (width - 1) - ctlz(x & -x))

Originally I was intending to implement this generically in the VectorLegalizer but hit the issue that the 2i64 implementations were vectorized and saw a large perf regression. I could still do this and provide a 'empty' custom implementation on X86 to force scalarization - not sure if its good practice though? It would have the benefit that we could remove the very similar implementation in the ARM target as well (Logan any comments?).

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 34110.Sep 5 2015, 7:33 AM
RKSimon retitled this revision from to [X86][SSE] Vectorize CTTZ + CTTZ_ZERO_UNDEF.
RKSimon updated this object.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added subscribers: llvm-commits, logan.
igorb added a subscriber: igorb.Sep 5 2015, 11:52 PM
qcolombet edited edge metadata.Sep 16 2015, 4:51 PM

Hi Simon,

Originally I was intending to implement this generically in the VectorLegalizer but hit the issue that the 2i64 implementations were vectorized and saw a large perf regression.

Pushing that into generic code may make sense, we would need to be careful with the cost model though. I.e., AND and SUB may not be legal on the target.

Anyhow, LGTM.

Thanks,
-Quentin

lib/Target/X86/X86ISelLowering.cpp
17075

Wouldn’t hurt to write the pattern we build here: x & -x

qcolombet accepted this revision.Sep 16 2015, 4:51 PM
qcolombet edited edge metadata.
This revision is now accepted and ready to land.Sep 16 2015, 4:51 PM
This revision was automatically updated to reflect the committed changes.