This is an archive of the discontinued LLVM Phabricator instance.

[WIP][RFC][Utils] Helper script to check sanity of cost tables vs scheduler models
Changes PlannedPublic

Authored by RKSimon on Jun 4 2021, 5:27 AM.

Download Raw Diff

Details

Reviewers

craig.topper
lebedev.ri
xbolva00
gbedwell
andreadb
ABataev
greened
anton-afanasyev
courbet
gchatelet

Summary

We're seeing more and more perf regressions that turn out to be issues with values reported from the cost tables.

I've written this (admittedly hacky) helper script to compare the estimated (worst case) costs reported by the cost tables against groups of similar CPUs represented by their scheduler models (+ llvm-mca). For each common IR instruction/intrinsic + type (up to a CPU's maximum vector width) it generates the IR/assembly and runs llvm-mca to compare the costs against 'opt --analyze --cost-model' and reports if the cost model doesn't match the worst case value reported by the CPUs in a given 'level' (e.g. avx1 - btver2/bdver2/sandybridge).

If run without any args, the script will exhaustively (slowly) test every cpulevel for every IR/type - you can specify cpulevel and/or op to better focus the test runs.

If you use the --stop-on-diff command line argument it will dump the 'fuzz.ll' temp file of the IR where the first cost diff was found, so you can easily grab these to dump into godbolt.org for triage.

There are still a lot of discrepancies reported, some in the cost tables but others in scheduler models, many are obvious (v2i32->v2f64 sitofp doesn't take 20cycles....) - but this script has to be used with due care and with the initial assumption that none of the cost tables, generated assembly or models/llvm-mca reports are perfectly correct.

This is very much a WIP (just count the TODO comments...) but I wanted to get this out so people can check my reasoning as I continue to develop this. There's plenty still to do before this is ready to be committed.

I've written this primarily for x86 but can't see much that will make this tricky to support other targets.

Please don't judge me on my rubbish python skills :)

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

RKSimon created this revision.Jun 4 2021, 5:27 AM

Herald added a subscriber: pengfei. · View Herald TranscriptJun 4 2021, 5:27 AM

RKSimon requested review of this revision.Jun 4 2021, 5:27 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 4 2021, 5:27 AM

High-level comment: i have been thinking about this for a while,
and i'm basically set on at least trying to come up with
an infrastructure to autogenerate cost model for a cpu.
The differences between worst-case and best-case models
are too great to ignore.

I'm happy to resurrect D46276 someday, but until we actually have accurate, well maintained/tested models for a broad range of CPUs (for instance anything that shows up on https://store.steampowered.com/hwsurvey) we can't rely on them.

In D103695#2798852, @RKSimon wrote:

I'm happy to resurrect D46276 someday, but until we actually have accurate, well maintained/tested models for a broad range of CPUs (for instance anything that shows up on https://store.steampowered.com/hwsurvey) we can't rely on them.

Clarification: i was *NOT* talking about just auto generating the generic cost-model as worst-case over all the models we have,
but about having full custom cost models for specific, hand-picked sched models.

Harbormaster completed remote builds in B107661: Diff 349830.Jun 4 2021, 6:10 AM

Matt added a subscriber: Matt.Jun 4 2021, 9:02 AM

Thanks to @gbedwell for the python cleanup

Harbormaster completed remote builds in B107697: Diff 349898.Jun 4 2021, 10:34 AM

RKSimon mentioned this in rG49d3a367c037: [CostModel][X86] Improve AVX1/AVX2 truncation costs.Jun 8 2021, 2:42 AM

RKSimon mentioned this in D103925: [X86][SSE] Support 64-bit vectorization (WIP).Jun 8 2021, 1:34 PM

tschuett added a subscriber: tschuett.Jun 8 2021, 1:37 PM

RKSimon mentioned this in rG47941d601deb: [CostModel][X86] Adjust fp<->int vXi32 AVX1+ costs based on llvm-mca reports.Jun 30 2021, 7:34 AM

RKSimon mentioned this in rG5e5ba14b4d83: [CostModel][X86] Adjust fp<->int vXi32 SSE legalized costs based on llvm-mca….Jul 1 2021, 7:34 AM

RKSimon mentioned this in rGcdca1785d35f: [CostModel][X86] Adjust uitofp(vXi64) SSE/AVX legalized costs based on llvm-mca….Jul 2 2021, 5:14 AM

RKSimon mentioned this in rGd181fd918d18: [CostModel][X86] Drop some hard coded fp<->int scalarization costs.Jul 2 2021, 6:52 AM

RKSimon mentioned this in D89697: [X86] Implement smarter instruction lowering for FP_TO_UINT from vXf32/vXf64 to vXi32 for SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction..Jul 4 2021, 12:42 PM

RKSimon mentioned this in rGa7da0296a663: [CostModel][X86] Adjust sitofp/uitofp SSE/AVX legalized costs based on llvm-mca….Jul 7 2021, 4:24 AM

RKSimon mentioned this in rG4c7e9a385293: [CostModel][X86] Adjust sext/zext SSE/AVX legalized costs based on llvm-mca….Jul 7 2021, 5:58 AM

RKSimon mentioned this in rG9dbeac16ba9b: [X86] ReplaceNodeResults - fp_to_sint/uint - manually widen v2i32 results to….Jul 9 2021, 4:08 AM

RKSimon mentioned this in rG96b4117d5155: [CostModel][X86] Adjust truncate SSE/AVX legalized costs based on llvm-mca….Jul 12 2021, 5:50 AM

RKSimon mentioned this in rGae0d73ac3bb8: [CostModel][X86] Adjust fptosi/fptoui SSE/AVX legalized costs based on llvm-mca….Jul 12 2021, 12:42 PM

RKSimon planned changes to this revision.Jul 17 2021, 2:35 AM

RKSimon mentioned this in rGe1bdb5795879: [CostModel][X86] Adjust shift SSE legalized costs based on llvm-mca reports..Jul 22 2021, 10:14 AM

RKSimon mentioned this in rG4185c5502c81: [CostModel][X86] Adjust shift SSE4 legalized costs based on llvm-mca reports..Jul 22 2021, 12:10 PM

I think we can't completely trust reversed throughput reported by llvm-mca since some instructions' Rthroughput is not defined correctly in schedmodel.
e.g.

$./llvm_utils_check_cost_tables.py --cpulevel=avx512  --stop-on-diff
double fdiv double: cost (4.0 - 4.0) vs recipthroughput (3 - 3)
skylake-avx512 : 4.0 vs 3

defines in X86SchedSkylakeServer.td:

def SKXWriteResGroup184 : SchedWriteRes<[SKXPort0,SKXFPDivider]> {
  let Latency = 14;
  let NumMicroOps = 1;
  let ResourceCycles = [1,3];
}
def : SchedAlias<WriteFDiv64,  SKXWriteResGroup184>; // TODO - convert to ZnWriteResFpuPair

However, it's measured tpt is 4 from uops.info. llvm-exegesis tpt result is also 4.
I think uops.info/agner.org should be more accurate.

Have you verified cost diff based on uops.info/anger.org?
We have seen some regression on our internal benchmarks due to TTI cost-model patches based on this tool...

As I said in the summary - when the script reports a diff you then need to start looking into manually to see where the problem is, its most likely the cost tables but the models aren't always great.

At the moment I'm mainly using osaca (inside godbolt) when I need an alternative analysis to llvm-mca

If you have access to particular CPUs that we have models for - PLEASE run llvm-exegesis and report any mismatches on bugzilla - ideally just attach analysis.html report - https://llvm.org/docs/CommandGuide/llvm-exegesis.html

RKSimon mentioned this in rG6ba0b9f68ac9: [X86][SLM] Fix PBLENDVB uops and throughput.Sep 3 2021, 3:32 AM

RKSimon mentioned this in rG7d062d2c478b: [X86][Atom] MUL/DIV instructions require both ports, not either..Sep 4 2021, 3:58 AM

RKSimon mentioned this in rGda965a77d566: [X86][SLM] Fix MUL uops, latency and throughput.Sep 4 2021, 5:22 AM

RKSimon mentioned this in rG994da6570769: [X86][SLM] WriteVecIMul instructions only take 1uop.

RKSimon mentioned this in rG2005ae15a66d: [X86][SLM] WriteVecIMul instructions only take 1uop (REAPPLIED).Sep 4 2021, 7:10 AM

RKSimon mentioned this in rG484944ac3b10: [X86][SLM] Fix HADD/HSUB uops, latency and throughput.Sep 11 2021, 3:44 AM

RKSimon mentioned this in rGdf975e459008: [X86][SLM] Fix PSAD/MPSAD uops, latency and throughput.

RKSimon mentioned this in rG0767e43d8745: [CostModel][X86] Adjust bitreverse/ctpop/ctlz/cttz AVX2+ costs based on llvm….Sep 15 2021, 5:17 AM

RKSimon mentioned this in rG5ebe95e25673: [X86][Atom] Fix integer shuffles uops, latency and throughput.Sep 17 2021, 4:13 AM

RKSimon mentioned this in rGf855ef260148: [X86][Atom] Fix FP uops + port usage.Sep 19 2021, 12:58 PM

RKSimon mentioned this in D111460: [X86][LoopVectorize] "Fix" `X86TTIImpl::getAddressComputationCost()`.Oct 19 2021, 9:39 AM

RKSimon mentioned this in D114314: [X86][TTI] Costmodel for AVX512DQ's VPMOVM2[DQ] / VPMOV[DQ]2M instructions.Nov 20 2021, 8:45 AM

RKSimon mentioned this in rG5eb47961c42c: [CostModel][X86] Update ROTL/ROTR vXi8/vXi16 costs on AVX512BW targets.Jan 10 2022, 5:19 AM

RKSimon mentioned this in rGd663166acbe5: [CostModel][X86] Reduce cost of v2i64 icmp base cost on SSE2 targets.Mar 30 2022, 1:13 AM

RKSimon mentioned this in rGc2964746e339: [CostModel][X86] Reduce cost of vector selects on SSE2/AVX1 targets.May 1 2022, 1:32 AM

RKSimon mentioned this in rGf0e8c1d6d99e: [CostModel][X86] Adjust 256-bit select costs to account for slow BLENDV op.May 6 2022, 3:28 AM

RKSimon mentioned this in rGd21bf514940f: [CostModel][X86] Adjust pre-SSE41 fp scalar select costs to account for vector….May 6 2022, 3:42 AM

RKSimon mentioned this in rGcbfa85734632: [CostModel][X86] Adjust 128-bit select costs to account for slow BLENDV op.May 6 2022, 5:08 AM

tim.schmielau added a subscriber: tim.schmielau.May 17 2022, 9:19 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 17 2022, 9:19 AM

Herald added a subscriber: StephenFan. · View Herald Transcript

RKSimon mentioned this in D79483: [CostModel] Replace getUserCost with getInstructionCost..Aug 17 2022, 7:30 AM

RKSimon mentioned this in D132216: [CostModel][X86] Support cost kind specific look up tables.Aug 19 2022, 3:18 AM

RKSimon mentioned this in rG45846854a2c1: [CostModel][X86] Support cost kind specific look up tables.Aug 25 2022, 4:34 AM

RKSimon mentioned this in rG3edec9ba602c: [CostModel][X86] Support cost kind specific look up tables (REAPPLIED).Aug 25 2022, 8:49 AM

RKSimon mentioned this in rGad16f3e41354: [CostModel][X86] Add CostKinds handling for fadd/fsub/fneg ops.Sep 2 2022, 3:50 AM

RKSimon mentioned this in rG11765b77be84: [CostModel][X86] Add CostKinds handling for fmul ops.Sep 2 2022, 9:05 AM

RKSimon mentioned this in rG0735200e3f50: [CostModel][X86] Add CostKinds handling for fmul ops.Sep 3 2022, 2:50 AM

RKSimon mentioned this in rG5aee2726d8a6: [CostModel][X86] Add CostKinds handling for fdiv ops.Sep 3 2022, 7:51 AM

latest version of the helper script

added support for all 4 cost kinds
pipe stdin/stdout/stderr between stages instead of writing to tmp files
llc/opt/llvm-mca calls are now threaded
wider range of x86 cpu tested (although many just point back to the SandyBridge model)

RKSimon planned changes to this revision.Sep 3 2022, 7:58 AM

Harbormaster completed remote builds in B184959: Diff 457796.Sep 3 2022, 8:44 AM

RKSimon mentioned this in rG114b7762a996: [CostModel][X86] Add CostKinds handling for add/sub ops.Sep 3 2022, 10:45 AM

RKSimon mentioned this in rGc444af1c20b3: [CostModel][X86] Add CostKinds handling for mul ops.Sep 4 2022, 4:07 AM

RKSimon mentioned this in rG8534f514747d: [CostModel][X86] Add CostKinds handling for sqrt intrinsicc.Sep 4 2022, 10:39 AM

RKSimon mentioned this in rGbd0801cddfc8: [X86] Cleanup SLM SSE shift and CMPGTQ scheduler model numbers.Sep 5 2022, 5:45 AM

RKSimon mentioned this in rGc1b5e36d74a3: [CostModel][X86] Add CostKinds handling for fcmp ops.Sep 6 2022, 2:35 AM

RKSimon mentioned this in rG10e0f3e9481d: [CostModel][X86] Add CostKinds handling for ctpop ops.Sep 6 2022, 9:27 AM

RKSimon mentioned this in rG05f56f10ed84: [X86] Fix VPPERM load folding latency.Sep 9 2022, 5:58 AM

RKSimon mentioned this in rG7785bd34e744: [X86] Fix bdver2 128-bit ALU/logic/shift throughputs.Sep 10 2022, 8:24 AM

RKSimon mentioned this in rG4994f87ca1d9: [X86] Fix bdver2 128-bit shuffles throughputs.Sep 10 2022, 9:57 AM

RKSimon mentioned this in rG20ad05f9b462: [CostModel][X86] Add CostKinds handling for abs ops.Sep 12 2022, 8:34 AM

RKSimon mentioned this in rG0ec028fe105b: [CostModel][X86] Add CostKinds handling for vector shift by….Sep 15 2022, 6:05 AM

RKSimon mentioned this in rG94620e4fc340: [CostModel][X86] Add CostKinds handling for vector shift by generic/non-uniform….Sep 15 2022, 9:01 AM

RKSimon mentioned this in rGf8fa04295faa: [CostModel][X86] Add CostKinds handling for vector integer comparisons.Sep 16 2022, 5:21 AM

RKSimon mentioned this in rG89e4cb603d96: [X86] Add missing (unsupported) zmm vector move classes.Sep 16 2022, 7:31 AM

RKSimon mentioned this in rG23cb1c42cd20: [CostModel][X86] Update throughput costs for CTLZ ops.Sep 16 2022, 8:57 AM

RKSimon mentioned this in rG2538adde5c89: [CostModel][X86] Add CostKinds handling for cttz.Sep 19 2022, 7:57 AM

RKSimon mentioned this in rG135c9b2c4b47: [CostModel][X86] Add CostKinds handling for vector ctlz instructions.Sep 19 2022, 8:44 AM

RKSimon mentioned this in rG6b4d409f6948: [CostModel][X86] Add CostKinds handling for CTLZ_ZERO_UNDEF/CTTZ_ZERO_UNDEF….Sep 19 2022, 9:38 AM

RKSimon mentioned this in rG839ba13c3e8c: [CostModel][X86] Add vbmi2 costs for funnelshift/rotate intrinsics.Sep 21 2022, 5:48 AM

RKSimon mentioned this in rGb2cd8118d007: [CostModel][X86] Add CostKinds handling for smax/smin/umax/umin instructions.Sep 22 2022, 2:19 AM

RKSimon mentioned this in rGe030be64d8c4: [CostModel][X86] Add partial CostKinds handling for funnelshifts/rotates.Sep 22 2022, 3:27 AM

RKSimon mentioned this in rG0a0d2f540076: [X86] Ensure 256-bit inlane shuffles are set to 2 uops + half rate.Oct 29 2022, 4:03 AM

RKSimon mentioned this in rGeea6a2782e85: [X86] WriteFShuffle256 shuffles aren't microcoded in the llvm sense.

RKSimon mentioned this in rG54aeaa2a8bae: [X86] Ensure 256-bit sqrt + crosslane shuffles are set to 2 uops + half rate.Oct 30 2022, 7:26 AM

RKSimon mentioned this in rG6bb1626e5a5c: [X86] Fix simd integer ALU and shuffle port allocations.

RKSimon mentioned this in D138832: [llvm-exegesis][x86] Add option to prevent use of xmm8-xmm15 upper SSE registers.Nov 28 2022, 9:58 AM

RKSimon mentioned this in rGb0468e3e2288: [X86] Add missing PFM port mappings for Core2/Nehalem.Nov 30 2022, 4:31 AM

RKSimon mentioned this in rGb723d5a625c0: [llvm-exegesis][x86] Add option to prevent use of xmm8-xmm15 upper SSE registers.Dec 7 2022, 9:54 AM

RKSimon mentioned this in rG48fca4b6f30f: [CostModel][X86] Add latency/code-size/size-latency target costs for….Apr 13 2023, 10:12 AM

RKSimon mentioned this in rGc1af46cc20f8: [CostModel][X86] Add BITREVERSE cost model estimations.Apr 18 2023, 3:25 AM

RKSimon mentioned this in rG16808117c351: [CostModel][X86] Add BSWAP cost model estimations.Apr 18 2023, 8:13 AM

RKSimon mentioned this in rG406004238420: [CostModel][X86] Improve i8 and vXi8 MUL costs.Apr 20 2023, 11:39 AM

RKSimon mentioned this in rG3e9d046bfcfa: [CostModel][X86] Improve i16 and vXi16 MUL costs.Apr 21 2023, 8:09 AM

Revision Contents

Path

Size

llvm/

utils/

check_cost_tables.py

565 lines

Diff 457796

llvm/utils/check_cost_tables.py

This file was added.

				#!/usr/bin/env python3

				# Helper script to compare the TTI cost table values for various IR ops and
				# intrinsics against the llvm-mca costs reported from the generated assembly.
				#
				# As cost tables typically use worst case values, the script runs against a set
				# of cpus in a similar level and checks the cost reported by opt --analyze vs
				# the highest cost across all those cpus.
				#
				# By default, the script will exhaustively check all cpulevels and all
				# scalar/vector ops up to the max legal vector width (pow2 numelts only), but
				# more specific checks can be made with the --cpulevel and --op command args.

				import argparse
				import math
				import re
				import os
				import subprocess
				import concurrent.futures

				from collections import defaultdict

				class Error(Exception):
				"""Simple exception type for erroring without a traceback."""


				def _run_command(cmd, *, input, op):
				try:
				return subprocess.run(cmd, input=input, text=True, capture_output=True)
				except subprocess.CalledProcessError as exc:
				raise Error(f"Error running {cmd} : {op}") from exc


				def _run_analysis(op, opname, ir, cpu, costkind):
				# Run opt to get cost-model report
				analysis = _run_command(
				[
				args.opt_binary,
				"-passes=print<cost-model>",
				"-disable-output",
				f"-cost-kind={costkind}",
				f"-mcpu={cpu}",
				f"-mtriple={args.triple}"
				],
				input=ir,
				op=op,
				)

				# Extract analyze costs
				for line in analysis.stderr.splitlines():
				if opname in line:
				matches = re.search(
				r"Cost Model: Found an estimated cost of (\d+)", line
				)
				return float(matches.group(1))

				return None


				def _run_codegen(op, ir, cpu):
				# Run llc to generate asm
				llc = _run_command(
				[
				args.llc_binary,
				f"-mcpu={cpu}",
				f"-mtriple={args.triple}"
				],
				input=ir,
				op=op,
				)

				# TODO - strip out assembly to pass to llvm-mca to avoid need for asm barriers in IR

				# Run llvm-mca to determine asm statistics
				mca = _run_command(
				[
				args.llvm_mca_binary,
				f"-mcpu={cpu}",
				f"-mtriple={args.triple}"
				],
				input=llc.stdout,
				op=op,
				)

				# Extract mca statistics (worst case cost to use math.ceil() to round up)
				costs = {}

				for line in mca.stdout.splitlines():
				if "Instructions:" in line:
				matches = re.search(r"Instructions: ([0-9]+)", line)
				costs["code-size"] = round(math.ceil(max(float(1), float(matches.group(1)))) / float(100))
				continue
				if "Total Cycles:" in line:
				matches = re.search(r"Total Cycles: ([0-9]+)", line)
				costs["latency"] = round(math.ceil(max(float(1), float(matches.group(1)))) / float(100))
				continue
				if "Total uOps:" in line:
				matches = re.search(r"Total uOps: ([0-9]+)", line)
				costs["size-latency"] = round(math.ceil(max(float(1), float(matches.group(1)))) / float(100))
				continue
				if "Block RThroughput:" in line:
				matches = re.search(r"Block RThroughput: ([0-9\.]+)", line)
				costs["throughput"] = math.ceil(max(float(1), float(matches.group(1))))
				break # Assumes other lines are above rthroughput

				return costs


				def run_analysis(srctype, dsttype, op, opname, cpus, declarations=""):
				costkinds = [ "throughput", "latency", "code-size", "size-latency"];

				analysis_costs = defaultdict(dict)
				mca_costs = defaultdict(dict)

				# Write out candidate IR
				ir = "\n".join(
				[
				f"define {dsttype} @costfuzz({srctype} %a0, {srctype} %a1, {srctype} %a2) {{",
				'tail call void asm sideeffect "# LLVM-MCA-BEGIN foo", "~{dirflag},~{fpsr},~{flags},~{rsp}"()',
				op,
				'tail call void asm sideeffect "# LLVM-MCA-END foo", "~{dirflag},~{fpsr},~{flags},~{rsp}"()',
				f"ret {dsttype} %result",
				"}",
				declarations,
				]
				)

				with concurrent.futures.ThreadPoolExecutor(max_workers=args.num_threads) as e:
				analysis_results = defaultdict(dict)
				mca_results = {}

				for cpu in cpus:
				mca_results[cpu] = e.submit(_run_codegen, op, ir, cpu)
				for costkind in costkinds:
				analysis_results[costkind][cpu] = e.submit(_run_analysis, op, opname, ir, cpu, costkind)

				for cpu in cpus:
				for costkind in costkinds:
				analysis_costs[costkind][cpu] = analysis_results[costkind][cpu].result()

				for cpu in cpus:
				costs = mca_results[cpu].result()
				for costkind in costkinds:
				mca_costs[costkind][cpu] = costs[costkind]

				for costkind in costkinds:
				minanalysis = min(analysis_costs[costkind].values())
				maxanalysis = max(analysis_costs[costkind].values())
				minmca = min(mca_costs[costkind].values())
				maxmca = max(mca_costs[costkind].values())

				if maxmca != maxanalysis:
				#if abs(maxmca - maxanalysis) > 1:
				print(
				f"{dsttype} {opname} {srctype}: cost ({minanalysis} - {maxanalysis}) vs {costkind} ({minmca} - {maxmca})"
				)
				for cpu in cpus:
				print(f" {cpu} : {analysis_costs[costkind][cpu]} vs {mca_costs[costkind][cpu]}")
				if args.stop_on_diff:
				with open("fuzz.ll", "w") as f:
				f.write(ir)
				raise SystemExit(-1)


				def get_float_string(width):
				if width == 16:
				return "half"
				if width == 32:
				return "float"
				if width == 64:
				return "double"
				return None


				def get_type(elementcount, base):
				if elementcount == 0:
				return base
				return f"<{elementcount} x {base}>"


				def get_typestub(elttype, elementcount, base):
				if elementcount == 0:
				return f"{elttype}{base}"
				return f"v{elementcount}{elttype}{base}"


				def get_typeistub(elementcount, base):
				return get_typestub("i", elementcount, base)


				def get_typefstub(elementcount, base):
				return get_typestub("f", elementcount, base)


				# TODO - add half conversion
				def fp_cast(maxwidth, ops, cpus):
				for op in ops:
				for srcbasewidth in [32, 64]:
				for dstbasewidth in [32, 64]:
				for elementcount in [0, 2, 4, 8, 16, 32, 64]:
				srctype = get_type(elementcount, get_float_string(srcbasewidth))
				dsttype = get_type(elementcount, get_float_string(dstbasewidth))
				cmd = f"%result = {op} {srctype} %a0 to {dsttype}"

				if srcbasewidth < dstbasewidth and op == "fpext":
				if dstbasewidth * elementcount <= maxwidth:
				run_analysis(srctype, dsttype, cmd, op, cpus)

				if srcbasewidth > dstbasewidth and op == "fptrunc":
				if srcbasewidth * elementcount <= maxwidth:
				run_analysis(srctype, dsttype, cmd, op, cpus)


				def fp_unaryops(maxwidth, ops, cpus):
				for op in ops:
				for basewidth in [32, 64]:
				for elementcount in [0, 2, 4, 8, 16]:
				if (basewidth * elementcount) <= maxwidth:
				type = get_type(elementcount, get_float_string(basewidth))
				cmd = f"%result = {op} {type} %a0"
				run_analysis(type, type, cmd, op, cpus)


				def fp_binops(maxwidth, ops, cpus):
				for op in ops:
				for basewidth in [32, 64]:
				for elementcount in [0, 2, 4, 8, 16]:
				if (basewidth * elementcount) <= maxwidth:
				type = get_type(elementcount, get_float_string(basewidth))
				cmd = f"%result = {op} {type} %a0, %a1"
				run_analysis(type, type, cmd, op, cpus)


				# TODO - support bool predicate results for some targets
				def fp_cmp(maxwidth, ops, cpus):
				for op in ops:
				for basewidth in [32, 64]:
				for elementcount in [2, 4, 8, 16]:
				if (basewidth * elementcount) <= maxwidth:
				for cc in [ "oeq", "ogt", "oge", "olt", "ole", "one", "ord", "ueq", "ugt", "uge", "ult", "ule", "une", "uno" ]:
				cctype = get_type(elementcount, f"i{1}")
				srctype = get_type(elementcount, get_float_string(basewidth))
				dsttype = get_type(elementcount, f"i{basewidth}")
				cmd = "\n".join(
				[
				f"%cmp = {op} {cc} {srctype} %a0, %a1",
				f"%result = sext {cctype} %cmp to {dsttype}",
				]
				)
				run_analysis(srctype, dsttype, cmd, f"{op} {cc}", cpus)


				def int_cast(maxwidth, ops, cpus):
				for op in ops:
				for srcbasewidth in [8, 16, 32, 64]:
				for dstbasewidth in [8, 16, 32, 64]:
				for elementcount in [0, 2, 4, 8, 16, 32, 64]:
				srctype = get_type(elementcount, f"i{srcbasewidth}")
				dsttype = get_type(elementcount, f"i{dstbasewidth}")
				cmd = f"%result = {op} {srctype} %a0 to {dsttype}"

				if srcbasewidth < dstbasewidth and op != "trunc":
				if dstbasewidth * elementcount <= maxwidth:
				run_analysis(srctype, dsttype, cmd, op, cpus)

				if srcbasewidth > dstbasewidth and op == "trunc":
				if srcbasewidth * elementcount <= maxwidth:
				if elementcount != 0:
				run_analysis(srctype, dsttype, cmd, op, cpus)


				def int_binops(maxwidth, ops, cpus):
				for op in ops:
				for basewidth in [8, 16, 32, 64]:
				for elementcount in [0, 2, 4, 8, 16, 32, 64]:
				if (basewidth * elementcount) <= maxwidth:
				type = get_type(elementcount, f"i{basewidth}")
				cmd = f"%result = {op} {type} %a0, %a1"
				run_analysis(type, type, cmd, f" {op} ", cpus)


				def int_shifts(maxwidth, ops, cpus):
				for op in ops:
				for basewidth in [8, 16, 32, 64]:
				for elementcount in [0, 2, 4, 8, 16, 32, 64]:
				if (basewidth * elementcount) <= maxwidth:
				type = get_type(elementcount, f"i{basewidth}")
				cmd = f"%result = {op} {type} %a0, %a1"
				run_analysis(type, type, cmd, op, cpus)


				# TODO - support bool predicate results for some targets
				def int_cmp(maxwidth, ops, cpus):
				for op in ops:
				for basewidth in [8, 16, 32, 64]:
				for elementcount in [2, 4, 8, 16, 32, 64]:
				if (basewidth * elementcount) <= maxwidth:
				for cc in [ "eq", "ne", "ugt", "uge", "ult", "ule", "sgt", "sge", "slt", "sle" ]:
				cctype = get_type(elementcount, f"i{1}")
				type = get_type(elementcount, f"i{basewidth}")
				cmd = "\n".join(
				[
				f"%cmp = {op} {cc} {type} %a0, %a1",
				f"%result = sext {cctype} %cmp to {type}",
				]
				)
				run_analysis(type, type, cmd, f"{op} {cc}", cpus)


				def int_to_fp(maxwidth, ops, cpus):
				for op in ops:
				for srcbasewidth in [8, 16, 32, 64]:
				for dstbasewidth in [32, 64]:
				for elementcount in [0, 2, 4, 8, 16, 32, 64]:
				if (min(srcbasewidth, dstbasewidth) * elementcount) <= maxwidth:
				srctype = get_type(elementcount, f"i{srcbasewidth}")
				dsttype = get_type(elementcount, get_float_string(dstbasewidth))
				cmd = f"%result = {op} {srctype} %a0 to {dsttype}"
				run_analysis(srctype, dsttype, cmd, op, cpus)


				def fp_to_int(maxwidth, ops, cpus):
				for op in ops:
				for srcbasewidth in [32, 64]:
				for dstbasewidth in [8, 16, 32, 64]:
				for elementcount in [0, 2, 4, 8, 16, 32, 64]:
				if (min(srcbasewidth, dstbasewidth) * elementcount) <= maxwidth:
				srctype = get_type(elementcount, get_float_string(srcbasewidth))
				dsttype = get_type(elementcount, f"i{dstbasewidth}")
				cmd = f"%result = {op} {srctype} %a0 to {dsttype}"
				run_analysis(srctype, dsttype, cmd, op, cpus)


				def int_unaryintrinsics(maxwidth, ops, cpus, boolarg = None):
				for op in ops:
				for basewidth in [8, 16, 32, 64]:
				if op == "bswap" and basewidth == 8:
				continue
				for elementcount in [0, 2, 4, 8, 16, 32, 64]:
				if (basewidth * elementcount) <= maxwidth:
				type = get_type(elementcount, f"i{basewidth}")
				stub = get_typeistub(elementcount, basewidth)
				if boolarg is not None:
				boolval = -1 if boolarg else 0
				cmd = f"%result = call {type} @llvm.{op}.{stub}({type} %a0, i1 {boolval})"
				declaration = f"declare {type} @llvm.{op}.{stub}({type}, i1)"
				else:
				cmd = f"%result = call {type} @llvm.{op}.{stub}({type} %a0)"
				declaration = f"declare {type} @llvm.{op}.{stub}({type})"
				run_analysis(type, type, cmd, op, cpus, declaration)


				def int_binaryintrinsics(maxwidth, ops, cpus):
				for op in ops:
				for basewidth in [8, 16, 32, 64]:
				for elementcount in [0, 2, 4, 8, 16, 32, 64]:
				if (basewidth * elementcount) <= maxwidth:
				type = get_type(elementcount, f"i{basewidth}")
				stub = get_typeistub(elementcount, basewidth)
				cmd = f"%result = call {type} @llvm.{op}.{stub}({type} %a0, {type} %a1)"
				declaration = f"declare {type} @llvm.{op}.{stub}({type}, {type})"
				run_analysis(type, type, cmd, op, cpus, declaration)


				def int_ternaryintrinsics(maxwidth, ops, cpus):
				for op in ops:
				for basewidth in [8, 16, 32, 64]:
				for elementcount in [0, 2, 4, 8, 16, 32, 64]:
				if (basewidth * elementcount) <= maxwidth:
				type = get_type(elementcount, f"i{basewidth}")
				stub = get_typeistub(elementcount, basewidth)
				cmd = f"%result = call {type} @llvm.{op}.{stub}({type} %a0, {type} %a1, {type} %a2)"
				declaration = f"declare {type} @llvm.{op}.{stub}({type}, {type}, {type})"
				run_analysis(type, type, cmd, op, cpus, declaration)


				def int_reductions(maxwidth, ops, cpus):
				for op in ops:
				for basewidth in [8, 16, 32, 64]:
				for elementcount in [2, 4, 8, 16, 32, 64]:
				if (basewidth * elementcount) <= maxwidth:
				vectype = get_type(elementcount, f"i{basewidth}")
				scltype = get_type(0, f"i{basewidth}")
				stub = get_typeistub(elementcount, basewidth)
				cmd = f"%result = call {scltype} @llvm.vector.reduce.{op}.{stub}({vectype} %a0)"
				declaration = f"declare {scltype} @llvm.vector.reduce.{op}.{stub}({vectype})"
				run_analysis(vectype, scltype, cmd, f"vector.reduce.{op}", cpus, declaration)


				def filter_ops(targetops, ops):
				if len(targetops) == 0:
				return ops

				selectops = list()
				for targetop in targetops:
				if ops.count(targetop):
				selectops.append(targetop)
				return selectops


				def test_cpus(targetops, maxwidth, cpus):
				ops = filter_ops(targetops, ["fpext", "fptrunc"])
				fp_cast(maxwidth, ops, cpus)

				ops = filter_ops(targetops, ["fneg"])
				fp_unaryops(maxwidth, ops, cpus)

				ops = filter_ops(targetops, ["fadd", "fsub", "fmul", "fdiv"])
				fp_binops(maxwidth, ops, cpus)

				ops = filter_ops(targetops, ["fcmp"])
				fp_cmp(maxwidth, ops, cpus)

				ops = filter_ops(targetops, ["select"])
				# TODO - select with fcmp

				# TODO - fabs, fsqrt, ceil, floor, trunc, rint, nearbyint
				# fp_unaryintrinsics()

				# TODO - copysign, maxnum, maxinum, minnum, mininum
				# fp_binaryintrinsics()

				# TODO - reduction op filtering
				# if len(targetops) == 0 or "reduce" in targetops:
				# fp_reductions(maxwidth, [ "fadd", "fmul", "fmax", "fmin" ], cpus)

				ops = filter_ops(targetops, ["sext", "zext", "trunc"])
				int_cast(maxwidth, ops, cpus)

				# TODO - sdiv/udiv/srem/urem (+ by constant/pow2 cases)
				ops = filter_ops(targetops, ["and", "or", "xor", "add", "sub", "mul"])
				int_binops(maxwidth, ops, cpus)

				# TODO - uniform / constant shift amount costs
				ops = filter_ops(targetops, ["shl", "lshr", "ashr"])
				int_shifts(maxwidth, ops, cpus)

				ops = filter_ops(targetops, ["icmp"])
				int_cmp(maxwidth, ops, cpus)

				ops = filter_ops(targetops, ["select"])
				# TODO - select with icmp

				# TODO - bitcasts i1/i32/i64/float/double

				# TODO - vector ops (extract/insert/shuffle)

				# TODO - better reduction op filtering
				if len(targetops) == 0 or "reduce" in targetops:
				int_reductions(
				maxwidth,
				["and", "or", "xor", "add", "mul", "smax", "smin", "umax", "umin"],
				cpus,
				)

				ops = filter_ops(targetops, ["sitofp", "uitofp"])
				int_to_fp(maxwidth, ops, cpus)

				ops = filter_ops(targetops, ["fptosi", "fptoui"])
				fp_to_int(maxwidth, ops, cpus)

				ops = filter_ops(targetops, ["bitreverse", "bswap", "ctpop"])
				int_unaryintrinsics(maxwidth, ops, cpus)

				ops = filter_ops(targetops, ["ctlz", "cttz"])
				int_unaryintrinsics(maxwidth, ops, cpus, False)
				int_unaryintrinsics(maxwidth, ops, cpus, True)

				ops = filter_ops(targetops, ["smax", "smin", "umax", "umin"])
				int_binaryintrinsics(maxwidth, ops, cpus)

				# TODO - uniform / constant shift amount costs
				ops = filter_ops(targetops, ["fshl", "fshr"])
				int_ternaryintrinsics(maxwidth, ops, cpus)


				def main():
				default_num_threads = os.cpu_count()

				# TODO - 2 modes - (a) create generic codegen for sse level and compare cpu analysis
				# (b) create generic codegen for each cpu of a similar level and compare cpu analysis
				cpulevels = {
				"avx512" : (512, ["x86-64-v4", "skylake-avx512", "icelake-server"]),
				"avx512f" : (512, ["knl"]),
				"avx2" : (256, ["x86-64-v3", "broadwell", "haswell", "skylake", "alderlake", "znver1", "znver2", "znver3"]),
				"avx1" : (256, ["bdver2", "btver2", "sandybridge"]),
				"sse4.2" : (128, ["x86-64-v2", "silvermont", "goldmont", "nehalem"]),
				"sse4.1" : (128, ["penryn", "core2"]),
				"ssse3" : (128, ["atom"]),
				"sse3" : (128, ["atom"]),
				"sse2" : (128, ["x86-64"]),
				}

				parser = argparse.ArgumentParser(description=__doc__)
				parser.add_argument(
				"--triple",
				metavar="<triple>",
				default="x86_64--",
				help="Specify the target triple (default: x86_64--)",
				)
				parser.add_argument(
				"--cpulevel",
				choices=cpulevels.keys(),
				default=None,
				help="Only test cpus specific to a cpulevel",
				)
				# TODO - --op(s) command line handling to select multiple ops for testing
				parser.add_argument(
				"--op", metavar="<op>", default=None, help="Only test requested op"
				)
				parser.add_argument(
				"--stop-on-diff",
				action="store_true",
				help="Stop on first analysis/mca discrepancy, leaves fuzz.ll temp file",
				)
				parser.add_argument(
				"--opt-binary",
				metavar="<path>",
				default="opt",
				help='The "opt" binary to use to analyze the test case IR (default: opt)',
				)
				parser.add_argument(
				"--llc-binary",
				metavar="<path>",
				default="llc",
				help='The "llc" binary to use to generate the test case assembly (default: llc)',
				)
				parser.add_argument(
				"--llvm-mca-binary",
				metavar="<path>",
				default="llvm-mca",
				help='The "llvm-mca "binary to use to analyze the test case assembly (default: llvm-mca)',
				)
				parser.add_argument(
				"-j",
				"--num-threads",
				type=int,
				default=default_num_threads,
				help=f"default:{default_num_threads}",
				)

				global args
				args = parser.parse_args()

				targetops = list()
				if args.op is not None:
				targetops = args.op.split(",")

				targetcpus = ["avx512", "avx2", "avx1", "sse4.2", "ssse3", "sse2"]
				if args.cpulevel is not None:
				targetcpus = [args.cpulevel]

				for targetcpu in targetcpus:
				(maxwidth, cpus) = cpulevels[targetcpu]
				test_cpus(targetops, maxwidth, cpus)

				return 0


				if __name__ == "__main__":
				try:
				raise SystemExit(main())
				except Error as error:
				print(f"error: {error}")
				raise SystemExit(1) from error