This patch adds costs for the vectorized implementations of CTPOP, the default values were seriously underestimating the cost of these and was encouraging vectorization on targets where serialized use of POPCNT would be much better.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
970–973 ↗ | (On Diff #64294) | Add a general comment to explain why we have these numbers? Also add a comment in LowerVectorCTPOP() that the TTI cost model should be updated if the algorithm changes. |
test/Transforms/SLPVectorizer/X86/ctpop.ll | ||
3–5 ↗ | (On Diff #64294) | Can use -mattr=sse4.2 / avx / avx2 instead of -mcpu? |
Is the plan to make these costs also dependent on host CPU? For example, IIRC the vector ctpop lowerings have serially dependent pshufb's which are 1 cycle latency on big intel cores but 4 cycle latency on Jaguar according to Agner.
Also, on Jaguar scalar popcnt is "as cheap as an add" but on e.g. Skylake scalar popcnt has 4x less throughput than an add and 3x higher latency.
Someday - I think the priority is to get down to one set of cost tables. Sanjay knows this better than I do but IIRC there are 3 or 4 separate sets of costs in the codebase - some based on approximate latency others (like this one) on throughput of recent big intel cores. I don't think any use the scheduler models or anything overly target specific.
Also, on Jaguar scalar popcnt is "as cheap as an add" but on e.g. Skylake scalar popcnt has 4x less throughput than an add and 3x higher latency.
I haven't added scalar throughput costs here, the costs are for the vector implementations which as you say are dominated by PSHUFB - I was put off dealing with the scalars by TargetTransformInfo::PopcntSupportKind which seems to be trying to do something similar.
A couple of links for reference...
- @mkuper mentioned changing the scalar cost of an op here:
https://llvm.org/bugs/show_bug.cgi?id=28434#c4
- There was some discussion about the various cost models here:
https://llvm.org/bugs/show_bug.cgi?id=26837
@mzolotukhin made a good point there: it's not clear whether we actually should differentiate for CPUs at this level (this is IR after all). A better solution might be to have the most conservative or the most common values as part of the cost model here, and then expect the backend to fix that up for particular CPU models.
The scary part about that currently is how bad the backend is at deconstructing too-wide vector IR code. This shows up even today in the semi-legal case of AVX where we have 256-bit registers but no 256-bit integer ops. This causes trouble in several cases I've seen. Ie, we would have done much better if we just pretended that we only have 128-bit registers for those targets.