When CTPOP is not legal, emulating it can be quite expensive. In D89952, a TLI hook was added to allow targets to tune how expensive CTPOP emulation is. This patch attempts to do basic/easy tuning. More refinement is certainly possible, especially when a legal CTPOP is available but with the "wrong" type/size.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Fixed the malformed diff/patch.
Also, PPC reviewers would be appreciated. Thanks for any recommendations.
Adding more potential reviewers.
This is 2 independent patches in 1 proposal, so you may want to split it to avoid waiting on approval for both targets.
I made some general x86 comments, but someone that knows AVX512 better should have a look at those test diffs in particular.
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
5347 | We should have a code comment here to describe approximately what the difference in x86 asm will be with these settings. As-is { 2, 4, 6, 10, 14 } is a set of magic numbers. Need to explain why a particular data type (i8, etc) is treated differently than other types. | |
llvm/test/CodeGen/X86/vector-popcnt-128-ult-ugt.ll | ||
473–484 | Are we saying that the longer sequence is better because it avoids the constant loads and/or potentially expensive vpshufb's? |
llvm/lib/Target/X86/X86ISelLowering.cpp | ||
---|---|---|
5335 | How are we able to make that assertion? I don't see anything in the caller that filters out scalar types. We probably need at least 1 scalar test as a sanity check if the plan is that this optimization is vector-only for x86. |
Sorry, I got distracted. If it were up to me, I'd prefer tuning numbers that are, without checking, unambiguously better than the current code gen. In particular the code gen for KNL is awful. Furthermore, I think that fine tuning this further has diminishing returns for various reasons. Are people okay with that?
The PPC changes are fine. The code is not only slightly better, but all current PPC CPU's have a fast HW implementation of popcnt. Please note of course that my approval is only for the PPC part and should not be understood to be an approval of the X86 changes.
How are we able to make that assertion? I don't see anything in the caller that filters out scalar types. We probably need at least 1 scalar test as a sanity check if the plan is that this optimization is vector-only for x86.