This is an archive of the discontinued LLVM Phabricator instance.

[WIP][RFC][Utils] Helper script to check sanity of cost tables vs scheduler models
Changes PlannedPublic

Authored by RKSimon on Jun 4 2021, 5:27 AM.

Details

Summary

We're seeing more and more perf regressions that turn out to be issues with values reported from the cost tables.

I've written this (admittedly hacky) helper script to compare the estimated (worst case) costs reported by the cost tables against groups of similar CPUs represented by their scheduler models (+ llvm-mca). For each common IR instruction/intrinsic + type (up to a CPU's maximum vector width) it generates the IR/assembly and runs llvm-mca to compare the costs against 'opt --analyze --cost-model' and reports if the cost model doesn't match the worst case value reported by the CPUs in a given 'level' (e.g. avx1 - btver2/bdver2/sandybridge).

If run without any args, the script will exhaustively (slowly) test every cpulevel for every IR/type - you can specify cpulevel and/or op to better focus the test runs.

If you use the --stop-on-diff command line argument it will dump the 'fuzz.ll' temp file of the IR where the first cost diff was found, so you can easily grab these to dump into godbolt.org for triage.

There are still a lot of discrepancies reported, some in the cost tables but others in scheduler models, many are obvious (v2i32->v2f64 sitofp doesn't take 20cycles....) - but this script has to be used with due care and with the initial assumption that none of the cost tables, generated assembly or models/llvm-mca reports are perfectly correct.

This is very much a WIP (just count the TODO comments...) but I wanted to get this out so people can check my reasoning as I continue to develop this. There's plenty still to do before this is ready to be committed.

I've written this primarily for x86 but can't see much that will make this tricky to support other targets.

Please don't judge me on my rubbish python skills :)

Diff Detail

Event Timeline

RKSimon created this revision.Jun 4 2021, 5:27 AM
RKSimon requested review of this revision.Jun 4 2021, 5:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 4 2021, 5:27 AM

High-level comment: i have been thinking about this for a while,
and i'm basically set on at least trying to come up with
an infrastructure to autogenerate cost model for a cpu.
The differences between worst-case and best-case models
are too great to ignore.

I'm happy to resurrect D46276 someday, but until we actually have accurate, well maintained/tested models for a broad range of CPUs (for instance anything that shows up on https://store.steampowered.com/hwsurvey) we can't rely on them.

I'm happy to resurrect D46276 someday, but until we actually have accurate, well maintained/tested models for a broad range of CPUs (for instance anything that shows up on https://store.steampowered.com/hwsurvey) we can't rely on them.

Clarification: i was *NOT* talking about just auto generating the generic cost-model as worst-case over all the models we have,
but about having full custom cost models for specific, hand-picked sched models.

Matt added a subscriber: Matt.Jun 4 2021, 9:02 AM
RKSimon updated this revision to Diff 349898.Jun 4 2021, 9:37 AM

Thanks to @gbedwell for the python cleanup

RKSimon planned changes to this revision.Jul 17 2021, 2:35 AM

I think we can't completely trust reversed throughput reported by llvm-mca since some instructions' Rthroughput is not defined correctly in schedmodel.
e.g.

$./llvm_utils_check_cost_tables.py --cpulevel=avx512  --stop-on-diff
double fdiv double: cost (4.0 - 4.0) vs recipthroughput (3 - 3)
skylake-avx512 : 4.0 vs 3

defines in X86SchedSkylakeServer.td:

def SKXWriteResGroup184 : SchedWriteRes<[SKXPort0,SKXFPDivider]> {
  let Latency = 14;
  let NumMicroOps = 1;
  let ResourceCycles = [1,3];
}
def : SchedAlias<WriteFDiv64,  SKXWriteResGroup184>; // TODO - convert to ZnWriteResFpuPair

However, it's measured tpt is 4 from uops.info. llvm-exegesis tpt result is also 4.
I think uops.info/agner.org should be more accurate.

Have you verified cost diff based on uops.info/anger.org?
We have seen some regression on our internal benchmarks due to TTI cost-model patches based on this tool...

As I said in the summary - when the script reports a diff you then need to start looking into manually to see where the problem is, its most likely the cost tables but the models aren't always great.

At the moment I'm mainly using osaca (inside godbolt) when I need an alternative analysis to llvm-mca

If you have access to particular CPUs that we have models for - PLEASE run llvm-exegesis and report any mismatches on bugzilla - ideally just attach analysis.html report - https://llvm.org/docs/CommandGuide/llvm-exegesis.html

Herald added a project: Restricted Project. · View Herald TranscriptMay 17 2022, 9:19 AM
Herald added a subscriber: StephenFan. · View Herald Transcript
RKSimon updated this revision to Diff 457796.Sep 3 2022, 7:58 AM
RKSimon edited the summary of this revision. (Show Details)

latest version of the helper script

  • added support for all 4 cost kinds
  • pipe stdin/stdout/stderr between stages instead of writing to tmp files
  • llc/opt/llvm-mca calls are now threaded
  • wider range of x86 cpu tested (although many just point back to the SandyBridge model)
RKSimon planned changes to this revision.Sep 3 2022, 7:58 AM