(i'd be fine just committing this, but just in case, for greater visibility if nothing else, decided to post)
We *might* not want to perform codegen for all Configurations X Repetitions
beforehand, since the produced Runnable Configurations may have significant
file sizes (up to 1MB?), and there are many Runnable Configurations (30k?).
But doing batches is fine memory-wise, and is a win, as expected.
We really don't want to smudge the measurements, so we do those
standalone, without running *anything* else in parallel,
but when not measuring, the codegen can be done in parallel.
Special care is taken to not produce snippets in non-deterministic order,
although snippets themselves are randomized, it's not as useful.
And so it becomes almost real-time:
time ./bin/llvm-exegesis --opcode-index=-1 --mode=latency --repetition-mode=duplicate --dump-object-to-disk=0 --benchmarks-file=/tmp/res-new.yaml --measurements-print-progress --max-configs-per-opcode=8192
old:
real 1m33.500s user 1m29.644s sys 0m1.762s
new:
real 0m18.191s user 3m8.253s sys 0m3.999s
(5.1x)
time ./bin/llvm-exegesis --opcode-index=-1 --mode=uops --repetition-mode=duplicate --dump-object-to-disk=0 --benchmarks-file=/tmp/res-new.yaml --measurements-print-progress
old:
real 1m52.256s user 1m48.518s sys 0m1.479s
new:
real 0m13.273s user 4m14.228s sys 0m4.903s
(8.5x)
time ./bin/llvm-exegesis --opcode-index=-1 --mode=inverse_throughput --repetition-mode=duplicate --dump-object-to-disk=0 --benchmarks-file=/tmp/res-new.yaml --measurements-print-progress
old:
real 1m58.765s user 1m53.259s sys 0m2.937s
new:
real 0m19.586s user 4m19.133s sys 0m6.314s
(6x)
See also: https://discourse.llvm.org/t/does-anyone-use-llvm-exegesis-feedback-wanted/67729/
Does this bring anything to the user?
Maybe "The number of threads to use for parallel operations (default = 0 (autodetect))"