This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Add cost model values for CTPOP of vectors
ClosedPublic

Authored by RKSimon on Jul 18 2016, 3:48 AM.

Details

Summary

This patch adds costs for the vectorized implementations of CTPOP, the default values were seriously underestimating the cost of these and was encouraging vectorization on targets where serialized use of POPCNT would be much better.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon updated this revision to Diff 64294.Jul 18 2016, 3:48 AM
RKSimon retitled this revision from to [X86][SSE] Add cost model values for CTPOP of vectors.
RKSimon updated this object.
RKSimon added reviewers: mkuper, ab, silvas, spatel, andreadb.
RKSimon set the repository for this revision to rL LLVM.
RKSimon added a subscriber: llvm-commits.
spatel added inline comments.Jul 18 2016, 11:11 AM
lib/Target/X86/X86TargetTransformInfo.cpp
974–977

Add a general comment to explain why we have these numbers? Also add a comment in LowerVectorCTPOP() that the TTI cost model should be updated if the algorithm changes.

test/Transforms/SLPVectorizer/X86/ctpop.ll
3–5

Can use -mattr=sse4.2 / avx / avx2 instead of -mcpu?

silvas edited edge metadata.Jul 18 2016, 2:46 PM

Is the plan to make these costs also dependent on host CPU? For example, IIRC the vector ctpop lowerings have serially dependent pshufb's which are 1 cycle latency on big intel cores but 4 cycle latency on Jaguar according to Agner.
Also, on Jaguar scalar popcnt is "as cheap as an add" but on e.g. Skylake scalar popcnt has 4x less throughput than an add and 3x higher latency.

Is the plan to make these costs also dependent on host CPU? For example, IIRC the vector ctpop lowerings have serially dependent pshufb's which are 1 cycle latency on big intel cores but 4 cycle latency on Jaguar according to Agner.

Someday - I think the priority is to get down to one set of cost tables. Sanjay knows this better than I do but IIRC there are 3 or 4 separate sets of costs in the codebase - some based on approximate latency others (like this one) on throughput of recent big intel cores. I don't think any use the scheduler models or anything overly target specific.

Also, on Jaguar scalar popcnt is "as cheap as an add" but on e.g. Skylake scalar popcnt has 4x less throughput than an add and 3x higher latency.

I haven't added scalar throughput costs here, the costs are for the vector implementations which as you say are dominated by PSHUFB - I was put off dealing with the scalars by TargetTransformInfo::PopcntSupportKind which seems to be trying to do something similar.

spatel edited edge metadata.Jul 18 2016, 4:48 PM
spatel added a subscriber: mzolotukhin.

A couple of links for reference...

  1. @mkuper mentioned changing the scalar cost of an op here:

https://llvm.org/bugs/show_bug.cgi?id=28434#c4

  1. There was some discussion about the various cost models here:

https://llvm.org/bugs/show_bug.cgi?id=26837

@mzolotukhin made a good point there: it's not clear whether we actually should differentiate for CPUs at this level (this is IR after all). A better solution might be to have the most conservative or the most common values as part of the cost model here, and then expect the backend to fix that up for particular CPU models.

The scary part about that currently is how bad the backend is at deconstructing too-wide vector IR code. This shows up even today in the semi-legal case of AVX where we have 256-bit registers but no 256-bit integer ops. This causes trouble in several cases I've seen. Ie, we would have done much better if we just pretended that we only have 128-bit registers for those targets.

RKSimon updated this revision to Diff 64494.Jul 19 2016, 7:15 AM
RKSimon edited edge metadata.

Updated based on Sanjay's comments

spatel accepted this revision.Jul 19 2016, 2:33 PM
spatel edited edge metadata.

LGTM.

This revision is now accepted and ready to land.Jul 19 2016, 2:33 PM
This revision was automatically updated to reflect the committed changes.