Page MenuHomePhabricator

[X86 and PPC] Tune SelectionDAG CTPOP emulation via TLI hook
Needs ReviewPublic

Authored by davezarzycki on Nov 9 2020, 10:55 AM.

Details

Summary

When CTPOP is not legal, emulating it can be quite expensive. In D89952, a TLI hook was added to allow targets to tune how expensive CTPOP emulation is. This patch attempts to do basic/easy tuning. More refinement is certainly possible, especially when a legal CTPOP is available but with the "wrong" type/size.

Diff Detail

Event Timeline

davezarzycki created this revision.Nov 9 2020, 10:55 AM
davezarzycki requested review of this revision.Nov 9 2020, 10:55 AM

Fixed the malformed diff/patch.

Also, PPC reviewers would be appreciated. Thanks for any recommendations.

Adding more potential reviewers.
This is 2 independent patches in 1 proposal, so you may want to split it to avoid waiting on approval for both targets.
I made some general x86 comments, but someone that knows AVX512 better should have a look at those test diffs in particular.

llvm/lib/Target/X86/X86ISelLowering.cpp
5347

We should have a code comment here to describe approximately what the difference in x86 asm will be with these settings. As-is { 2, 4, 6, 10, 14 } is a set of magic numbers. Need to explain why a particular data type (i8, etc) is treated differently than other types.

llvm/test/CodeGen/X86/vector-popcnt-128-ult-ugt.ll
473–484

Are we saying that the longer sequence is better because it avoids the constant loads and/or potentially expensive vpshufb's?
Is there experimental perf data or llvm-mca to back that up?
If not, we might opt for a more conservative setting as a first step, and then allow this case in a follow-up patch.

spatel added inline comments.Nov 12 2020, 5:28 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
5335

How are we able to make that assertion? I don't see anything in the caller that filters out scalar types. We probably need at least 1 scalar test as a sanity check if the plan is that this optimization is vector-only for x86.