I'm observing a rather big runtime performance regression
as a result of NewPM switch on one of RawSpeed's benchmarks:
raw.pixls.us-unique/Panasonic/DC-GH5S$ /repositories/googlebenchmark/tools/compare.py -a benchmarks ~/rawspeed/build-{old,new}/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true P1022085.RW2 --benchmark_repetitions=9 --benchmark_min_time=1 RUNNING: /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true P1022085.RW2 --benchmark_repetitions=9 --benchmark_min_time=1 --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmpk9pkqe2s 2021-04-24T14:00:17+03:00 Running /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench Run on (32 X 3599.99 MHz CPU s) CPU Caches: L1 Data 32 KiB (x16) L1 Instruction 32 KiB (x16) L2 Unified 512 KiB (x16) L3 Unified 32768 KiB (x2) Load Average: 0.65, 0.51, 1.27 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations CPUTime,s CPUTime/WallTime Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ P1022085.RW2/threads:32/process_time/real_time_mean 0.748 ms 23.9 ms 9 0.0239231 31.9721 10.3933M 434.452M 13.8904G 41.801 1.33647k 748.25u P1022085.RW2/threads:32/process_time/real_time_median 0.748 ms 23.9 ms 9 0.0239156 31.9716 10.3933M 434.585M 13.8934G 41.8138 1.33676k 748.079u P1022085.RW2/threads:32/process_time/real_time_stddev 0.003 ms 0.080 ms 9 80.0846u 6.00073m 0 1.45335M 48.9162M 0.139834 4.7065 2.63684u RUNNING: /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true P1022085.RW2 --benchmark_repetitions=9 --benchmark_min_time=1 --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmpt6ijfryg 2021-04-24T14:00:31+03:00 Running /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench Run on (32 X 3600.05 MHz CPU s) CPU Caches: L1 Data 32 KiB (x16) L1 Instruction 32 KiB (x16) L2 Unified 512 KiB (x16) L3 Unified 32768 KiB (x2) Load Average: 5.54, 1.56, 1.61 ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ Benchmark Time CPU Iterations CPUTime,s CPUTime/WallTime Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ P1022085.RW2/threads:32/process_time/real_time_mean 0.851 ms 27.2 ms 9 0.0272077 31.9615 10.3933M 382.027M 12.2102G 36.7569 1.17481k 851.271u P1022085.RW2/threads:32/process_time/real_time_median 0.848 ms 27.1 ms 9 0.0271017 31.9699 10.3933M 383.494M 12.2598G 36.8981 1.17959k 847.755u P1022085.RW2/threads:32/process_time/real_time_stddev 0.008 ms 0.243 ms 9 243.106u 0.0215795 0 3.38806M 116.08M 0.325984 11.1687 8.16022u Comparing /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench to /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------- P1022085.RW2/threads:32/process_time/real_time_pvalue 0.0004 0.0004 U Test, Repetitions: 9 vs 9 P1022085.RW2/threads:32/process_time/real_time_mean +0.1377 +0.1373 1 1 24 27 P1022085.RW2/threads:32/process_time/real_time_median +0.1332 +0.1333 1 1 24 27 P1022085.RW2/threads:32/process_time/real_time_stddev +2.0947 +2.0384 0 0 0 0
I've posted repro IR at https://bugs.llvm.org/show_bug.cgi?id=50099.
It happens because certain destructor, that runs at the end of certain function,
is no longer inlined, thus preventing SROA of the class.
I guess this bisects to D28331, which added the lower threshold for cold calls,
with comment:
Since the values haven't been changed since then, i guess that didn't happen as of yet.
Analyzing the problem, on top of D101228, it would cost 50 to inline it, while the threshold is 45.
Bumping it to 50 isn't sufficient, because the threshold is non-inclusive (is that intentional?),
so i rounded it up to 55.
This addresses the issue, inlining happens, SROA happens, and the perf is happy.
While this seems like the least problematic approach,
i think we may want to somehow boost inlining of the callees
that have arguments that are marked as likely to be SROA'ble.
Perhaps we don't want to apply this budget-lowering logic in such cases at least?