This is an archive of the discontinued LLVM Phabricator instance.

[X86][PPC] Tune SelectionDAG CTPOP emulation via TLI hook
AcceptedPublic

Authored by davezarzycki on Nov 9 2020, 10:55 AM.

Details

Summary

When CTPOP is not legal, emulating it can be quite expensive. In D89952, a TLI hook was added to allow targets to tune how expensive CTPOP emulation is. This patch attempts to do basic/easy tuning. More refinement is certainly possible, especially when a legal CTPOP is available but with the "wrong" type/size.

Diff Detail

Event Timeline

davezarzycki created this revision.Nov 9 2020, 10:55 AM
davezarzycki requested review of this revision.Nov 9 2020, 10:55 AM

Fixed the malformed diff/patch.

Also, PPC reviewers would be appreciated. Thanks for any recommendations.

Adding more potential reviewers.
This is 2 independent patches in 1 proposal, so you may want to split it to avoid waiting on approval for both targets.
I made some general x86 comments, but someone that knows AVX512 better should have a look at those test diffs in particular.

llvm/lib/Target/X86/X86ISelLowering.cpp
5347

We should have a code comment here to describe approximately what the difference in x86 asm will be with these settings. As-is { 2, 4, 6, 10, 14 } is a set of magic numbers. Need to explain why a particular data type (i8, etc) is treated differently than other types.

llvm/test/CodeGen/X86/vector-popcnt-128-ult-ugt.ll
473–484

Are we saying that the longer sequence is better because it avoids the constant loads and/or potentially expensive vpshufb's?
Is there experimental perf data or llvm-mca to back that up?
If not, we might opt for a more conservative setting as a first step, and then allow this case in a follow-up patch.

spatel added inline comments.Nov 12 2020, 5:28 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
5335

How are we able to make that assertion? I don't see anything in the caller that filters out scalar types. We probably need at least 1 scalar test as a sanity check if the plan is that this optimization is vector-only for x86.

@davezarzycki What is happening with this patch and PR47825 ?

RKSimon retitled this revision from [X86 and PPC] Tune SelectionDAG CTPOP emulation via TLI hook to [X86][PPC] Tune SelectionDAG CTPOP emulation via TLI hook.Apr 11 2021, 1:34 PM

Sorry, I got distracted. If it were up to me, I'd prefer tuning numbers that are, without checking, unambiguously better than the current code gen. In particular the code gen for KNL is awful. Furthermore, I think that fine tuning this further has diminishing returns for various reasons. Are people okay with that?

nemanjai accepted this revision.Apr 19 2021, 7:01 AM

The PPC changes are fine. The code is not only slightly better, but all current PPC CPU's have a fast HW implementation of popcnt. Please note of course that my approval is only for the PPC part and should not be understood to be an approval of the X86 changes.

This revision is now accepted and ready to land.Apr 19 2021, 7:01 AM
llvm/test/CodeGen/X86/vector-popcnt-256-ult-ugt.ll