We're seeing more and more vectorizer perf regressions that turn out to be issues with values reported from the cost tables.
I've written this (admittedly hacky) helper script to compare the estimated (worst case) costs reported by the cost tables against groups of similar CPUs represented by their scheduler models (+ llvm-mca). For each common IR instruction/intrinsic + type (up to a CPU's maximum vector width) it generates the IR/assembly and runs llvm-mca to compare the reciprocal-throughput costs against 'opt --analyze --cost-model' and reports if the cost model doesn't match the worst case value reported by the CPUs in a given 'level' (e.g. avx1 - btver2/bdver2/sandybridge). Only reciprocal-throughput costs are handled at this time.
If run without any args, the script will exhaustively (slowly) test every cpulevel for every IR/type - you can specify cpulevel and/or op to better focus the test runs.
The script writes out the same 4 temp files each iteration to the cwd (fuzz.ll, fuzz.s, analyze.txt and mca.txt), if you use the --stop-on-diff command line argument you can easily grab these to dump into godbolt.org for triage.
There are still a lot of discrepancies reported, some in the cost tables but others in scheduler models, many are obvious (v2i32->v2f64 sitofp doesn't take 20cycles....) - but this script has to be used with due care and with the initial assumption that none of the cost tables, generated assembly or models/llvm-mca reports are perfectly correct.
This is very much a WIP (just count the TODO comments...) but I wanted to get this out so people can check my reasoning as I continue to develop this. There's plenty still to do before this is ready to be committed.
I've written this primarily for x86 but can't see much that will make this tricky to support other targets.
Please don't judge me on my rubbish python skills :)