This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG] Bump phi-node-folding-threshold from 2 to 3
AbandonedPublic

Authored by lebedev.ri on Jul 23 2019, 7:23 AM.

Details

Summary

The main motivation is the signbit-like-value-extension.ll test.
That pattern comes up in JPEG decoding, see e.g.
Figure F.12 – Extending the sign bit of a decoded value in V
of ITU T.81 (JPEG specification).
That branch is not predictable, and it is within the innermost loop,
so the fact that that pattern ends up being stuck with a branch
instead of select (i.e. CMOV for x86) is unlikely to be beneficial.

Performance/codesize -wise this appears to be mostly neutral-positive.
I'm seeing 4 major improvements on RawSpeed benchmark:

Benchmark                                                                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 27 vs 27
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_mean                                  -0.3052         -0.3052           225           156           225           156
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_median                                -0.3065         -0.3066           225           156           225           156
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_stddev                                -0.7143         -0.7198             1             0             1             0
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 27 vs 27
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_mean                                   -0.1468         -0.1466            79            67            79            67
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_median                                 -0.1513         -0.1513            79            67            79            67
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_stddev                                 +3.1372         +3.7836             0             1             0             1
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 27 vs 27
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_mean                                  -0.1331         -0.1331           170           147           170           147
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_median                                -0.1329         -0.1327           170           147           170           147
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_stddev                                +1.4339         +1.9116             0             0             0             0
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 27 vs 27
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_mean                                   -0.0532         -0.0532           279           265           279           264
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_median                                 -0.0528         -0.0529           279           265           279           265
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_stddev                                 -0.2031         -0.2007             0             0             0             0

Diff Detail

Event Timeline

lebedev.ri created this revision.Jul 23 2019, 7:23 AM

To be noted, i'm changint this pass in particular because it is the pass that folds the case witout not, in -O3
(llvm/test/Transforms/PhaseOrdering/unsigned-multiply-overflow-check.ll, will_not_overflow()).
While i suppose that could be addressed via alternative means (teach some other pass that
t is okay to hoist this @llvm.umul.with.overflow here?),
the second motivational case is separate and remains..

The obvious question here - is this just a part of the patchset, and does this make sense as a fix?
To answer that, a performance benchmark is needed. I'm not really good with llvm's test-suite,
just too noisy, but i have my own benchmark that i trush-ish.
Results:

$ /usr/src/googlebenchmark/tools/compare.py -a benchmarks ~/rawspeed/build-{old,new}/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_repetitions=27 -r ~/raw-camera-samples/raw.pixls.us-unique
RUNNING: /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_repetitions=27 -r /home/lebedevri/raw-camera-samples/raw.pixls.us-unique --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmpSrwQVA
2019-07-26 21:18:05
Running /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 0.80, 1.07, 1.18
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                    Time             CPU   Iterations  CPUTime,s CPUTime/WallTime     Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/process_time/real_time_mean              221 ms          221 ms           27   0.220681         0.999753   9.96653M       45.1633M        45.1521M       4.5315       4.53038   0.220735
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/process_time/real_time_median            221 ms          221 ms           27   0.220519         0.999728   9.96653M       45.1958M        45.1878M      4.53476       4.53395   0.220558
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/process_time/real_time_stddev          0.852 ms        0.834 ms           27   834.338u         120.095u          0       168.652k        172.096k    0.0169219     0.0172674   851.991u
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/process_time/real_time_mean              129 ms          129 ms           27   0.128517         0.999792   5.25658M        40.902M        40.8935M      7.78112       7.77949   0.128544
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/process_time/real_time_median            128 ms          128 ms           27   0.128476         0.999748   5.25658M       40.9147M        40.9102M      7.78353       7.78268   0.128491
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/process_time/real_time_stddev          0.359 ms        0.360 ms           27   360.228u         106.636u          0         114.3k         113.86k    0.0217442     0.0216606   358.904u
Canon/EOS 5DS/2K4A9927.CR2/threads:8/process_time/real_time_mean                           439 ms          438 ms           27   0.438438         0.999792   52.6643M       120.119M        120.094M      2.28083       2.28036   0.438529
Canon/EOS 5DS/2K4A9927.CR2/threads:8/process_time/real_time_median                         438 ms          438 ms           27   0.438095         0.999792   52.6643M       120.212M        120.181M      2.28261       2.28203   0.438207
Canon/EOS 5DS/2K4A9927.CR2/threads:8/process_time/real_time_stddev                       0.829 ms        0.813 ms           27    813.43u           79.67u          0       222.503k         226.76k     4.22493m      4.30577m   829.357u
Canon/EOS 5DS/2K4A9928.CR2/threads:8/process_time/real_time_mean                           561 ms          561 ms           27   0.560808         0.999775   27.9936M       49.9169M        49.9056M      1.78315       1.78275   0.560934
Canon/EOS 5DS/2K4A9928.CR2/threads:8/process_time/real_time_median                         561 ms          561 ms           27   0.561278         0.999728   27.9936M       49.8748M        49.8698M      1.78165       1.78147   0.561334
Canon/EOS 5DS/2K4A9928.CR2/threads:8/process_time/real_time_stddev                        1.33 ms         1.31 ms           27   1.30781m         139.598u          0       116.574k        118.141k     4.16433m      4.22028m    1.3262m
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_mean                           313 ms          313 ms           27    0.31303         0.999766   12.4416M       39.7458M        39.7365M      3.19459       3.19384   0.313103
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_median                         313 ms          313 ms           27    0.31304         0.999709   12.4416M       39.7445M        39.7307M      3.19449       3.19337   0.313149
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_stddev                       0.274 ms        0.253 ms           27   253.188u         153.601u          0       32.1613k        34.7505k     2.58498m      2.79309m   273.671u
Canon/EOS 40D/_MG_0154.CR2/threads:8/process_time/real_time_mean                          56.5 ms         56.5 ms           27  0.0565142         0.999779   2.51942M       44.5809M         44.571M      17.6949        17.691  0.0565267
Canon/EOS 40D/_MG_0154.CR2/threads:8/process_time/real_time_median                        56.5 ms         56.5 ms           27  0.0564672         0.999777   2.51942M       44.6174M        44.6052M      17.7094       17.7045  0.0564827
Canon/EOS 40D/_MG_0154.CR2/threads:8/process_time/real_time_stddev                       0.198 ms        0.195 ms           27   194.747u         137.404u          0        153.04k        155.732k     0.060744     0.0618124   198.255u
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_mean                           279 ms          279 ms           27   0.279347         0.999783   25.5041M       91.2993M        91.2795M      3.57979       3.57901   0.279407
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_median                         279 ms          279 ms           27   0.279287         0.999783   25.5041M       91.3186M        91.3083M      3.58054       3.58014   0.279319
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_stddev                       0.320 ms        0.310 ms           27   310.345u         91.5369u          0       101.368k        104.508k     3.97457m      4.09769m   320.082u
Canon/EOS D30/CRW_2444.CRW/threads:8/process_time/real_time_mean                          63.1 ms         63.1 ms           27  0.0630571         0.999801   3.23814M       51.3527M        51.3425M      15.8587       15.8555  0.0630696
Canon/EOS D30/CRW_2444.CRW/threads:8/process_time/real_time_median                        63.1 ms         63.0 ms           27  0.0630448          0.99976   3.23814M       51.3626M         51.356M      15.8617       15.8597  0.0630529
Canon/EOS D30/CRW_2444.CRW/threads:8/process_time/real_time_stddev                       0.084 ms        0.082 ms           27   82.4575u         105.986u          0       66.9193k        68.3693k    0.0206659     0.0211137   84.2571u
Canon/PowerShot G1/crw_1693.crw/threads:8/process_time/real_time_mean                     63.9 ms         63.8 ms           27  0.0638391          0.99981   3.34464M       52.3918M        52.3818M      15.6644       15.6614  0.0638512
Canon/PowerShot G1/crw_1693.crw/threads:8/process_time/real_time_median                   63.9 ms         63.8 ms           27  0.0638421         0.999776   3.34464M       52.3893M        52.3788M      15.6636       15.6605  0.0638548
Canon/PowerShot G1/crw_1693.crw/threads:8/process_time/real_time_stddev                  0.039 ms        0.035 ms           27   34.9145u         117.826u          0       28.6446k        31.8393k     8.56433m       9.5195m   38.8146u
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/process_time/real_time_mean               483 ms         3589 ms           27    3.58856          7.42225   57.2314M       15.9483M        118.373M     0.278664       2.06832   0.483488
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/process_time/real_time_median             483 ms         3588 ms           27    3.58781          7.42495   57.2314M       15.9516M        118.427M     0.278721       2.06928   0.483261
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/process_time/real_time_stddev            1.10 ms         6.15 ms           27   6.15693m        0.0128675          0       27.3476k        266.874k     477.844u      4.66307m   1097.81u
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/process_time/real_time_mean                         185 ms         1480 ms           27    1.48001           7.9795   24.4218M       16.5011M        131.671M     0.675672       5.39152   0.185477
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/process_time/real_time_median                       185 ms         1480 ms           27    1.47957          7.98001   24.4218M        16.506M        131.709M      0.67587        5.3931   0.185422
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/process_time/real_time_stddev                     0.284 ms         1.48 ms           27    1.4769m         8.72791m          0       16.4542k        200.959k      673.75u      8.22867m   284.073u
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/process_time/real_time_mean                      39.7 ms          315 ms           27   0.314677          7.92492        12M       38.1591M        302.487M      3.17993       25.2072  0.0397233
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/process_time/real_time_median                    39.1 ms          311 ms           27   0.311034          7.96563        12M       38.5811M        307.257M      3.21509       25.6047  0.0390553
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/process_time/real_time_stddev                    1.51 ms         8.35 ms           27   8.34959m        0.0893997          0       971.883k        10.8299M    0.0809903      0.902496   1.51306m
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_mean                          79.0 ms         79.0 ms           27  0.0789524          0.99963   6.12864M       77.6249M        77.5962M      12.6659       12.6612  0.0789817
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_median                        78.9 ms         78.9 ms           27  0.0788841         0.999725   6.12864M       77.6917M        77.6706M      12.6768       12.6734  0.0789055
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_stddev                       0.213 ms        0.184 ms           27   184.079u         403.452u          0       180.141k        208.061k    0.0293933      0.033949   212.935u
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:8/process_time/real_time_mean                375 ms          375 ms           27   0.375122         0.999788   24.1603M       64.4077M         64.394M      2.66585       2.66529   0.375201
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:8/process_time/real_time_median              376 ms          376 ms           27   0.376142          0.99978   24.1603M       64.2318M        64.2182M      2.65857       2.65801   0.376221
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:8/process_time/real_time_stddev             1.65 ms         1.65 ms           27   1.65138m         94.7327u          0       284.133k        284.329k    0.0117604     0.0117685   1.65327m
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:8/process_time/real_time_mean                430 ms          430 ms           27   0.429825         0.999783   24.1603M       56.2096M        56.1974M      2.32653       2.32603   0.429918
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:8/process_time/real_time_median              430 ms          430 ms           27   0.429744         0.999786   24.1603M       56.2201M        56.2084M      2.32697       2.32648   0.429833
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:8/process_time/real_time_stddev            0.318 ms        0.311 ms           27   310.493u         75.3429u          0       40.4997k        41.4639k      1.6763m       1.7162m   317.962u
Nikon/D7200/DSC_0977.NEF/threads:8/process_time/real_time_mean                             255 ms          255 ms           27   0.254739         0.999781   24.1603M       94.8434M        94.8227M       3.9256       3.92474   0.254795
Nikon/D7200/DSC_0977.NEF/threads:8/process_time/real_time_median                           255 ms          255 ms           27   0.254732         0.999771   24.1603M       94.8458M        94.8133M      3.92569       3.92435   0.254819
Nikon/D7200/DSC_0977.NEF/threads:8/process_time/real_time_stddev                         0.385 ms        0.389 ms           27    389.04u         97.2797u          0       144.655k        143.144k     5.98729m      5.92477m   385.183u
Nikon/D7200/DSC_0978.NEF/threads:8/process_time/real_time_mean                             248 ms          248 ms           27   0.248078         0.999799   24.1603M       97.3915M        97.3718M      4.03106       4.03025   0.248128
Nikon/D7200/DSC_0978.NEF/threads:8/process_time/real_time_median                           248 ms          248 ms           27   0.247921         0.999781   24.1603M       97.4516M          97.43M      4.03355       4.03265   0.247976
Nikon/D7200/DSC_0978.NEF/threads:8/process_time/real_time_stddev                          1.04 ms         1.04 ms           27    1036.1u         91.4037u          0       405.888k        406.443k    0.0167998     0.0168228   1037.87u
Nikon/D7200/DSC_0979.NEF/threads:8/process_time/real_time_mean                             234 ms          234 ms           27   0.233758         0.999798   24.1603M       103.356M        103.335M      4.27793       4.27707   0.233805
Nikon/D7200/DSC_0979.NEF/threads:8/process_time/real_time_median                           234 ms          234 ms           27   0.233726         0.999765   24.1603M        103.37M        103.368M      4.27851       4.27841   0.233732
Nikon/D7200/DSC_0979.NEF/threads:8/process_time/real_time_stddev                         0.262 ms        0.257 ms           27   256.535u          111.88u          0       113.353k        115.675k     4.69171m      4.78781m   261.898u
Nikon/D7200/DSC_0982.NEF/threads:8/process_time/real_time_mean                             235 ms          235 ms           27   0.235305         0.999806   24.1603M       102.676M        102.657M      4.24981       4.24899   0.235351
Nikon/D7200/DSC_0982.NEF/threads:8/process_time/real_time_median                           235 ms          235 ms           27   0.235239         0.999769   24.1603M       102.705M        102.695M        4.251       4.25057   0.235263
Nikon/D7200/DSC_0982.NEF/threads:8/process_time/real_time_stddev                         0.351 ms        0.342 ms           27   342.079u          111.19u          0       148.937k        152.731k     6.16456m      6.32158m    350.96u
Olympus/XZ-1/p1319978.orf/threads:8/process_time/real_time_mean                            222 ms          222 ms           27   0.222169         0.999802   10.1568M       45.7364M        45.7273M      4.50303       4.50214   0.222213
Olympus/XZ-1/p1319978.orf/threads:8/process_time/real_time_median                          219 ms          219 ms           27   0.219446         0.999771   10.1568M       46.2838M         46.274M      4.55693       4.55596   0.219493
Olympus/XZ-1/p1319978.orf/threads:8/process_time/real_time_stddev                         4.78 ms         4.78 ms           27   4.77589m         73.8364u          0        957.82k        957.907k    0.0943033     0.0943119   4.77829m
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_mean                        13.7 ms          109 ms           27   0.108907          7.95884   20.6131M       189.308M        1.50669G      9.18387       73.0935  0.0136839
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_median                      13.7 ms          109 ms           27   0.109386          7.95703   20.6131M       188.444M        1.50301G      9.14194       72.9152  0.0137146
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_stddev                     0.199 ms         1.50 ms           27   1.50293m        0.0163877          0       2.62034M        22.0222M      0.12712       1.06836   199.345u
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_mean                       27.0 ms          215 ms           27   0.215353          7.97312   20.5507M       95.4337M        760.914M      4.64383       37.0263  0.0270104
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_median                     27.0 ms          215 ms           27   0.215438          7.97485   20.5507M       95.3903M        760.427M      4.64171       37.0026  0.0270252
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_stddev                    0.269 ms         1.75 ms           27   1.74721m         0.014832          0       757.767k        7.38375M    0.0368731      0.359295   268.885u
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_mean                      5.31 ms         42.3 ms           27  0.0423478          7.96818   10.3933M       245.433M        1.95566G      23.6145       188.165   5.31463m
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_median                    5.32 ms         42.4 ms           27  0.0423718          7.96733   10.3933M       245.289M        1.95449G      23.6006       188.052   5.31768m
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_stddev                   0.029 ms        0.197 ms           27    197.07u         7.18863m          0       1.14322M        10.6535M     0.109996       1.02503   28.9079u
Pentax/K10D/_IGP7284.PEF/threads:8/process_time/real_time_mean                            73.8 ms         77.3 ms           27  0.0772965          1.04637   10.3281M       137.232M        140.022M      13.2873       13.5574  0.0737896
Pentax/K10D/_IGP7284.PEF/threads:8/process_time/real_time_median                          74.6 ms         74.6 ms           27   0.074631         0.999745   10.3281M       138.388M         138.37M      13.3993       13.3975  0.0746409
Pentax/K10D/_IGP7284.PEF/threads:8/process_time/real_time_stddev                          1.48 ms         18.7 ms           27  0.0187208         0.242008          0        15.583M        2.85786M       1.5088      0.276709   1.48122m
Phase One/P65/CF027310.IIQ/threads:8/process_time/real_time_mean                          79.6 ms          634 ms           27   0.634238          7.96905   61.3002M       96.6584M        770.286M       1.5768       12.5658   0.079589
Phase One/P65/CF027310.IIQ/threads:8/process_time/real_time_median                        79.6 ms          635 ms           27   0.634777          7.97073   61.3002M       96.5698M        770.199M      1.57536       12.5644  0.0795902
Phase One/P65/CF027310.IIQ/threads:8/process_time/real_time_stddev                       0.818 ms         5.39 ms           27   5.38533m        0.0151495          0       802.899k        7.71121M    0.0130978      0.125794   817.766u
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/process_time/real_time_mean           329 ms          329 ms           27   0.329368         0.999804   28.1667M       85.5174M        85.5006M      3.03612       3.03553   0.329432
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/process_time/real_time_median         329 ms          329 ms           27   0.329297          0.99977   28.1667M       85.5357M        85.5111M      3.03677        3.0359   0.329392
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/process_time/real_time_stddev       0.257 ms        0.247 ms           27   246.421u         141.203u          0       63.9024k        66.7468k     2.26872m      2.36971m   257.415u
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/process_time/real_time_mean          132 ms          132 ms           27   0.132312         0.999673   20.5978M       155.678M        155.627M      7.55796       7.55549   0.132355
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/process_time/real_time_median        132 ms          132 ms           27   0.132227         0.999734   20.5978M       155.776M        155.734M      7.56275       7.56069   0.132263
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/process_time/real_time_stddev      0.328 ms        0.285 ms           27   285.018u         348.593u          0       333.444k        382.935k    0.0161883      0.018591    327.84u
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_mean                          225 ms          225 ms           27   0.224833         0.999791   20.7984M       92.5094M        92.4901M      4.44791       4.44698    0.22488
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_median                        225 ms          225 ms           27   0.225098         0.999768   20.7984M        92.397M         92.384M       4.4425       4.44188    0.22513
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_stddev                       1.40 ms         1.40 ms           27   1.40071m         123.436u          0       579.564k        578.366k    0.0278658     0.0278082   1.39857m
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_mean                          170 ms          170 ms           27   0.169941         0.999796    10.119M       59.5444M        59.5322M      5.88439       5.88319   0.169976
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_median                        170 ms          170 ms           27   0.169951         0.999753    10.119M        59.541M        59.5262M      5.88406        5.8826   0.169993
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_stddev                      0.083 ms        0.072 ms           27   71.5906u         122.803u          0       25.0929k        29.0196k     2.47978m      2.86782m   82.8219u
Sony/ILCE-7S/DSC04126.ARW/threads:8/process_time/real_time_mean                           8.12 ms         64.8 ms           27  0.0648045          7.97979   5.27155M       81.3511M         649.17M      15.4321       123.146   8.12116m
Sony/ILCE-7S/DSC04126.ARW/threads:8/process_time/real_time_median                         8.14 ms         64.9 ms           27  0.0649027          7.97858   5.27155M       81.2224M        647.988M      15.4077       122.922   8.13526m
Sony/ILCE-7S/DSC04126.ARW/threads:8/process_time/real_time_stddev                        0.078 ms        0.548 ms           27   547.916u         9.96365m          0       689.333k        6.22422M     0.130765       1.18072   77.7306u
RUNNING: /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_repetitions=27 -r /home/lebedevri/raw-camera-samples/raw.pixls.us-unique --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmp2fL6O_
2019-07-26 21:27:50
Running /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 3.56, 2.97, 2.06
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                    Time             CPU   Iterations  CPUTime,s CPUTime/WallTime     Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/process_time/real_time_mean              217 ms          217 ms           27   0.217357         0.999785   9.96653M       45.8534M        45.8435M      4.60074       4.59975   0.217404
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/process_time/real_time_median            217 ms          217 ms           27   0.217318         0.999726   9.96653M       45.8616M        45.8498M      4.60156       4.60038   0.217373
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/process_time/real_time_stddev          0.241 ms        0.239 ms           27   238.508u         110.512u          0        50.277k         50.773k     5.04459m      5.09435m   240.967u
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/process_time/real_time_mean              128 ms          128 ms           27   0.127577         0.999796   5.25658M       41.2033M        41.1949M      7.83843       7.83683   0.127603
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/process_time/real_time_median            128 ms          128 ms           27   0.127544         0.999737   5.25658M       41.2139M        41.2032M      7.84044       7.83842   0.127577
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/process_time/real_time_stddev          0.217 ms        0.214 ms           27   213.818u         124.342u          0       68.6124k        69.7472k    0.0130527     0.0132686   217.427u
Canon/EOS 5DS/2K4A9927.CR2/threads:8/process_time/real_time_mean                           438 ms          437 ms           27   0.437433         0.999775   52.6643M       120.394M        120.367M      2.28607       2.28556   0.437531
Canon/EOS 5DS/2K4A9927.CR2/threads:8/process_time/real_time_median                         437 ms          437 ms           27   0.437309         0.999782   52.6643M       120.428M          120.4M      2.28671       2.28618    0.43741
Canon/EOS 5DS/2K4A9927.CR2/threads:8/process_time/real_time_stddev                       0.678 ms        0.681 ms           27   681.529u         69.4031u          0       187.551k        186.555k     3.56125m      3.54233m   678.218u
Canon/EOS 5DS/2K4A9928.CR2/threads:8/process_time/real_time_mean                           554 ms          554 ms           27   0.553907         0.999781   27.9936M       50.5385M        50.5275M      1.80536       1.80496   0.554029
Canon/EOS 5DS/2K4A9928.CR2/threads:8/process_time/real_time_median                         554 ms          554 ms           27   0.553696         0.999687   27.9936M       50.5577M        50.5514M      1.80605       1.80582   0.553765
Canon/EOS 5DS/2K4A9928.CR2/threads:8/process_time/real_time_stddev                       0.861 ms        0.827 ms           27   827.326u          138.07u          0        75.409k        78.3935k      2.6938m      2.80041m   860.416u
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_mean                           305 ms          305 ms           27   0.304983          0.99978   12.4416M       40.7945M        40.7855M      3.27888       3.27816    0.30505
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_median                         305 ms          305 ms           27    0.30499         0.999722   12.4416M       40.7935M        40.7818M       3.2788       3.27786   0.305077
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_stddev                       0.451 ms        0.427 ms           27   426.649u         145.855u          0       56.9944k        60.2374k     4.58095m      4.84161m   451.077u
Canon/EOS 40D/_MG_0154.CR2/threads:8/process_time/real_time_mean                          56.1 ms         56.1 ms           27  0.0561335         0.999803   2.51942M       44.8828M        44.8739M      17.8147       17.8112  0.0561446
Canon/EOS 40D/_MG_0154.CR2/threads:8/process_time/real_time_median                        56.1 ms         56.1 ms           27  0.0561033         0.999766   2.51942M       44.9069M        44.8995M      17.8243       17.8213  0.0561126
Canon/EOS 40D/_MG_0154.CR2/threads:8/process_time/real_time_stddev                       0.072 ms        0.071 ms           27   70.9614u         82.6905u          0       56.6786k        57.1153k    0.0224966       0.02267   71.5311u
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_mean                           265 ms          264 ms           27   0.264493          0.99979   25.5041M       96.4266M        96.4063M      3.78082       3.78003   0.264549
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_median                         265 ms          265 ms           27   0.264515         0.999787   25.5041M       96.4186M        96.3984M      3.78051       3.77972    0.26457
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_stddev                       0.255 ms        0.248 ms           27    248.06u         80.6647u          0       90.4431k        92.8471k     3.54622m      3.64047m   254.789u
Canon/EOS D30/CRW_2444.CRW/threads:8/process_time/real_time_mean                          62.9 ms         62.9 ms           27  0.0628574         0.999803   3.23814M       51.5158M        51.5056M       15.909       15.9059  0.0628697
Canon/EOS D30/CRW_2444.CRW/threads:8/process_time/real_time_median                        62.9 ms         62.8 ms           27  0.0628458         0.999776   3.23814M       51.5252M        51.5177M       15.912       15.9096   0.062855
Canon/EOS D30/CRW_2444.CRW/threads:8/process_time/real_time_stddev                       0.034 ms        0.030 ms           27   29.5792u         116.782u          0       24.2363k        27.5752k     7.48464m      8.51574m   33.6708u
Canon/PowerShot G1/crw_1693.crw/threads:8/process_time/real_time_mean                     63.7 ms         63.7 ms           27  0.0637022         0.999804   3.34464M       52.5043M        52.4941M      15.6981        15.695  0.0637147
Canon/PowerShot G1/crw_1693.crw/threads:8/process_time/real_time_median                   63.7 ms         63.7 ms           27  0.0636987         0.999772   3.34464M       52.5072M        52.4926M      15.6989       15.6945  0.0637164
Canon/PowerShot G1/crw_1693.crw/threads:8/process_time/real_time_stddev                  0.056 ms        0.053 ms           27    52.531u         114.611u          0       43.2285k        45.7375k    0.0129247     0.0136749   55.5892u
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/process_time/real_time_mean               482 ms         3553 ms           27    3.55313          7.37673   57.2314M       16.1074M        118.819M     0.281443       2.07613   0.481667
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/process_time/real_time_median             482 ms         3552 ms           27    3.55238          7.37846   57.2314M       16.1107M        118.853M     0.281501       2.07671   0.481532
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/process_time/real_time_stddev           0.600 ms         5.90 ms           27   5.90148m         6.72044m          0        26.761k        147.912k     467.593u      2.58446m   600.167u
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/process_time/real_time_mean                         186 ms         1486 ms           27     1.4864          7.97935   24.4218M       16.4303M        131.103M     0.672772       5.36829   0.186281
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/process_time/real_time_median                       186 ms         1487 ms           27    1.48704          7.98027   24.4218M       16.4231M        131.018M     0.672475       5.36481     0.1864
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/process_time/real_time_stddev                     0.652 ms         4.11 ms           27   4.10582m         7.23506m          0       45.4141k        458.782k     1.85957m     0.0187857    651.74u
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/process_time/real_time_mean                      39.1 ms          312 ms           27   0.311914          7.97037        12M         38.48M        306.707M      3.20667       25.5589  0.0391353
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/process_time/real_time_median                    38.9 ms          310 ms           27   0.309873          7.97084        12M       38.7255M        308.506M      3.22713       25.7088  0.0388971
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/process_time/real_time_stddev                   0.641 ms         4.58 ms           27   4.57635m        0.0195726          0       559.017k        4.95242M    0.0465848      0.412702   640.718u
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_mean                          67.4 ms         67.4 ms           27  0.0673743         0.999756   6.12864M       90.9788M        90.9566M      14.8449       14.8412  0.0673908
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_median                        67.0 ms         66.9 ms           27  0.0669451         0.999716   6.12864M       91.5473M        91.5124M      14.9376       14.9319  0.0669706
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_stddev                       0.881 ms        0.881 ms           27   880.633u         170.283u          0       1.16951M        1.16948M     0.190827      0.190822   881.007u
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:8/process_time/real_time_mean                374 ms          374 ms           27   0.374078          0.99976   24.1603M       64.5874M        64.5719M      2.67329       2.67265   0.374167
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:8/process_time/real_time_median              375 ms          375 ms           27   0.375076         0.999743   24.1603M       64.4142M        64.3981M      2.66612       2.66546    0.37517
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:8/process_time/real_time_stddev             1.68 ms         1.67 ms           27   1.67117m         118.413u          0       289.158k        290.051k    0.0119683     0.0120053   1.67717m
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:8/process_time/real_time_mean                429 ms          429 ms           27   0.429337         0.999778   24.1603M       56.2735M         56.261M      2.32918       2.32866   0.429432
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:8/process_time/real_time_median              429 ms          429 ms           27   0.429282         0.999783   24.1603M       56.2806M        56.2624M      2.32947       2.32872   0.429421
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:8/process_time/real_time_stddev            0.318 ms        0.303 ms           27   302.811u         86.0253u          0       39.6843k        41.6671k     1.64254m      1.72461m   318.077u
Nikon/D7200/DSC_0977.NEF/threads:8/process_time/real_time_mean                             254 ms          254 ms           27   0.253873         0.999757   24.1603M       95.1669M        95.1437M      3.93898       3.93803   0.253935
Nikon/D7200/DSC_0977.NEF/threads:8/process_time/real_time_median                           254 ms          254 ms           27   0.253924         0.999755   24.1603M       95.1475M        95.1285M      3.93818        3.9374   0.253975
Nikon/D7200/DSC_0977.NEF/threads:8/process_time/real_time_stddev                         0.431 ms        0.419 ms           27   419.159u         116.554u          0       157.202k        161.585k     6.50665m      6.68804m   431.034u
Nikon/D7200/DSC_0978.NEF/threads:8/process_time/real_time_mean                             248 ms          248 ms           27   0.248208         0.999866   24.1603M       97.3392M        97.3262M       4.0289       4.02836   0.248242
Nikon/D7200/DSC_0978.NEF/threads:8/process_time/real_time_median                           248 ms          248 ms           27    0.24813         0.999897   24.1603M       97.3695M         97.363M      4.03015       4.02988   0.248146
Nikon/D7200/DSC_0978.NEF/threads:8/process_time/real_time_stddev                         0.644 ms        0.636 ms           27   635.579u         92.0355u          0       248.613k        251.922k    0.0102901     0.0104271   644.247u
Nikon/D7200/DSC_0979.NEF/threads:8/process_time/real_time_mean                             233 ms          233 ms           27   0.233221         0.999894   24.1603M       103.594M        103.583M      4.28779       4.28733   0.233246
Nikon/D7200/DSC_0979.NEF/threads:8/process_time/real_time_median                           233 ms          233 ms           27   0.233194         0.999892   24.1603M       103.606M          103.6M      4.28828       4.28805   0.233206
Nikon/D7200/DSC_0979.NEF/threads:8/process_time/real_time_stddev                         0.290 ms        0.291 ms           27   290.906u          29.465u          0       129.165k        128.928k     5.34617m      5.33638m    290.43u
Nikon/D7200/DSC_0982.NEF/threads:8/process_time/real_time_mean                             235 ms          235 ms           27   0.235143         0.999895   24.1603M       102.747M        102.737M      4.25274        4.2523   0.235168
Nikon/D7200/DSC_0982.NEF/threads:8/process_time/real_time_median                           235 ms          235 ms           27   0.234936         0.999892   24.1603M       102.837M        102.825M      4.25647       4.25594   0.234966
Nikon/D7200/DSC_0982.NEF/threads:8/process_time/real_time_stddev                         0.432 ms        0.432 ms           27   431.828u         22.4686u          0       188.369k        188.513k     7.79666m      7.80261m   432.247u
Olympus/XZ-1/p1319978.orf/threads:8/process_time/real_time_mean                            220 ms          220 ms           27   0.219724         0.999912   10.1568M       46.2256M        46.2216M       4.5512        4.5508   0.219744
Olympus/XZ-1/p1319978.orf/threads:8/process_time/real_time_median                          219 ms          219 ms           27    0.21935         0.999915   10.1568M       46.3041M        46.2985M      4.55892       4.55838   0.219376
Olympus/XZ-1/p1319978.orf/threads:8/process_time/real_time_stddev                        0.701 ms        0.701 ms           27   700.728u         22.4112u          0       146.914k        146.886k    0.0144646     0.0144618   700.711u
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_mean                        18.4 ms          134 ms           27   0.134217          7.34991   20.6131M       156.589M        1.15903G      7.59658       56.2279  0.0184245
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_median                      20.0 ms          143 ms           27   0.142954          7.10788   20.6131M       144.194M        1031.26M      6.99527       50.0294  0.0199883
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_stddev                      3.36 ms         18.4 ms           27  0.0184067         0.379332          0       22.8621M        230.634M      1.10911       11.1887   3.35739m
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_mean                       48.4 ms          343 ms           27   0.342767          7.08746   20.5507M       60.0773M        425.781M      2.92338       20.7186  0.0483624
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_median                     48.2 ms          342 ms           27   0.341565          7.08036   20.5507M       60.1662M        426.271M       2.9277       20.7424  0.0482104
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_stddev                     2.18 ms         15.7 ms           27  0.0156749        0.0457947          0       2.77981M        19.6659M     0.135266      0.956949   2.18003m
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_mean                      9.23 ms         65.0 ms           27  0.0650142          7.04263   10.3933M       159.993M        1.12671G      15.3938       108.407   9.23115m
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_median                    9.19 ms         65.0 ms           27  0.0649701          7.04153   10.3933M       159.971M        1.13046G      15.3917       108.768    9.1939m
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_stddev                   0.254 ms         1.90 ms           27   1.90112m        0.0282113          0       4.62052M        30.5074M     0.444565       2.93528   253.691u
Pentax/K10D/_IGP7284.PEF/threads:8/process_time/real_time_mean                            72.0 ms         74.6 ms           27  0.0746144          1.03562   10.3281M       140.833M         143.39M       13.636       13.8835  0.0720285
Pentax/K10D/_IGP7284.PEF/threads:8/process_time/real_time_median                          72.0 ms         72.0 ms           27  0.0719971         0.999721   10.3281M       143.451M        143.402M      13.8894       13.8847  0.0720219
Pentax/K10D/_IGP7284.PEF/threads:8/process_time/real_time_stddev                         0.225 ms         13.6 ms           27   0.013648         0.186462          0       13.7065M        447.989k      1.32711     0.0433759   225.388u
Phase One/P65/CF027310.IIQ/threads:8/process_time/real_time_mean                          80.2 ms          637 ms           27   0.636574          7.93586   61.3002M       96.3609M        764.862M      1.57195       12.4773  0.0802402
Phase One/P65/CF027310.IIQ/threads:8/process_time/real_time_median                        79.2 ms          631 ms           27   0.631014          7.96603   61.3002M       97.1456M        774.035M      1.58475        12.627  0.0791957
Phase One/P65/CF027310.IIQ/threads:8/process_time/real_time_stddev                        2.92 ms         17.3 ms           27  0.0172532         0.078915          0       2.44406M        25.7735M    0.0398703      0.420447   2.92362m
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/process_time/real_time_mean           342 ms          342 ms           27   0.341819         0.999796   28.1667M       82.4024M        82.3856M      2.92553       2.92493   0.341889
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/process_time/real_time_median         342 ms          342 ms           27    0.34171         0.999758   28.1667M       82.4284M        82.4094M      2.92645       2.92578   0.341789
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/process_time/real_time_stddev       0.448 ms        0.435 ms           27   434.881u         111.992u          0       104.391k        107.627k     3.70618m      3.82109m    448.49u
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/process_time/real_time_mean          131 ms          131 ms           27   0.131045         0.999772   20.5978M       157.182M        157.146M      7.63097       7.62924   0.131075
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/process_time/real_time_median        131 ms          131 ms           27   0.131052         0.999737   20.5978M       157.173M        157.132M      7.63055       7.62858   0.131086
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/process_time/real_time_stddev      0.075 ms        0.071 ms           27    71.526u         90.3416u          0       85.7916k        90.1664k     4.16508m      4.37747m   75.2075u
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_mean                          156 ms          156 ms           27   0.156209         0.999775   20.7984M       133.145M        133.115M       6.4017       6.40026   0.156245
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_median                        156 ms          156 ms           27   0.156084         0.999731   20.7984M       133.252M        133.221M      6.40682       6.40534    0.15612
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_stddev                      0.400 ms        0.392 ms           27   392.445u         128.334u          0       334.308k        340.175k    0.0160737     0.0163558    399.53u
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_mean                          147 ms          147 ms           27   0.147323          0.99979    10.119M       68.6861M        68.6716M       6.7878       6.78638   0.147354
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_median                        147 ms          147 ms           27   0.147404          0.99976    10.119M       68.6483M        68.6493M      6.78408       6.78417   0.147402
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_stddev                      0.202 ms        0.209 ms           27   208.528u         112.988u          0       97.3715k        94.1225k     9.62261m      9.30152m    201.66u
Sony/ILCE-7S/DSC04126.ARW/threads:8/process_time/real_time_mean                           8.14 ms         64.9 ms           27  0.0649296          7.97933   5.27155M       81.1938M        647.876M      15.4023         122.9   8.13727m
Sony/ILCE-7S/DSC04126.ARW/threads:8/process_time/real_time_median                         8.16 ms         65.1 ms           27  0.0650779          7.97825   5.27155M       81.0038M        646.172M      15.3662       122.577   8.15813m
Sony/ILCE-7S/DSC04126.ARW/threads:8/process_time/real_time_stddev                        0.072 ms        0.521 ms           27   520.762u         7.26169m          0       650.404k        5.69695M      0.12338        1.0807   71.7411u
Comparing /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench to /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench
Benchmark                                                                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/process_time/real_time_pvalue                     0.0000          0.0000      U Test, Repetitions: 27 vs 27
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/process_time/real_time_mean                      -0.0151         -0.0151           221           217           221           217
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/process_time/real_time_median                    -0.0144         -0.0145           221           217           221           217
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/process_time/real_time_stddev                    -0.7171         -0.7141             1             0             1             0
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/process_time/real_time_pvalue                     0.0000          0.0000      U Test, Repetitions: 27 vs 27
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/process_time/real_time_mean                      -0.0073         -0.0073           129           128           129           128
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/process_time/real_time_median                    -0.0071         -0.0073           128           128           128           128
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/process_time/real_time_stddev                    -0.3942         -0.4063             0             0             0             0
Canon/EOS 5DS/2K4A9927.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 27 vs 27
Canon/EOS 5DS/2K4A9927.CR2/threads:8/process_time/real_time_mean                                   -0.0023         -0.0023           439           438           438           437
Canon/EOS 5DS/2K4A9927.CR2/threads:8/process_time/real_time_median                                 -0.0018         -0.0018           438           437           438           437
Canon/EOS 5DS/2K4A9927.CR2/threads:8/process_time/real_time_stddev                                 -0.1822         -0.1622             1             1             1             1
Canon/EOS 5DS/2K4A9928.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 27 vs 27
Canon/EOS 5DS/2K4A9928.CR2/threads:8/process_time/real_time_mean                                   -0.0123         -0.0123           561           554           561           554
Canon/EOS 5DS/2K4A9928.CR2/threads:8/process_time/real_time_median                                 -0.0135         -0.0135           561           554           561           554
Canon/EOS 5DS/2K4A9928.CR2/threads:8/process_time/real_time_stddev                                 -0.3511         -0.3673             1             1             1             1
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 27 vs 27
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_mean                                   -0.0257         -0.0257           313           305           313           305
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_median                                 -0.0258         -0.0257           313           305           313           305
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_stddev                                 +0.6483         +0.6853             0             0             0             0
Canon/EOS 40D/_MG_0154.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 27 vs 27
Canon/EOS 40D/_MG_0154.CR2/threads:8/process_time/real_time_mean                                   -0.0068         -0.0067            57            56            57            56
Canon/EOS 40D/_MG_0154.CR2/threads:8/process_time/real_time_median                                 -0.0065         -0.0064            56            56            56            56
Canon/EOS 40D/_MG_0154.CR2/threads:8/process_time/real_time_stddev                                 -0.6393         -0.6358             0             0             0             0
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 27 vs 27
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_mean                                   -0.0532         -0.0532           279           265           279           264
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_median                                 -0.0528         -0.0529           279           265           279           265
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_stddev                                 -0.2031         -0.2007             0             0             0             0
Canon/EOS D30/CRW_2444.CRW/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 27 vs 27
Canon/EOS D30/CRW_2444.CRW/threads:8/process_time/real_time_mean                                   -0.0032         -0.0032            63            63            63            63
Canon/EOS D30/CRW_2444.CRW/threads:8/process_time/real_time_median                                 -0.0031         -0.0032            63            63            63            63
Canon/EOS D30/CRW_2444.CRW/threads:8/process_time/real_time_stddev                                 -0.6003         -0.6412             0             0             0             0
Canon/PowerShot G1/crw_1693.crw/threads:8/process_time/real_time_pvalue                             0.0000          0.0000      U Test, Repetitions: 27 vs 27
Canon/PowerShot G1/crw_1693.crw/threads:8/process_time/real_time_mean                              -0.0021         -0.0021            64            64            64            64
Canon/PowerShot G1/crw_1693.crw/threads:8/process_time/real_time_median                            -0.0022         -0.0022            64            64            64            64
Canon/PowerShot G1/crw_1693.crw/threads:8/process_time/real_time_stddev                            +0.4319         +0.5043             0             0             0             0
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/process_time/real_time_pvalue                      0.0000          0.0000      U Test, Repetitions: 27 vs 27
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/process_time/real_time_mean                       -0.0038         -0.0099           483           482          3589          3553
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/process_time/real_time_median                     -0.0036         -0.0099           483           482          3588          3552
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/process_time/real_time_stddev                     -0.4532         -0.0408             1             1             6             6
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/process_time/real_time_pvalue                                0.0001          0.0000      U Test, Repetitions: 27 vs 27
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/process_time/real_time_mean                                 +0.0043         +0.0043           185           186          1480          1486
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/process_time/real_time_median                               +0.0053         +0.0050           185           186          1480          1487
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/process_time/real_time_stddev                               +1.2944         +1.7814             0             1             1             4
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/process_time/real_time_pvalue                              0.1717          0.2326      U Test, Repetitions: 27 vs 27
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/process_time/real_time_mean                               -0.0148         -0.0088            40            39           315           312
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/process_time/real_time_median                             -0.0040         -0.0037            39            39           311           310
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/process_time/real_time_stddev                             -0.5766         -0.4519             2             1             8             5
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 27 vs 27
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_mean                                   -0.1468         -0.1466            79            67            79            67
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_median                                 -0.1513         -0.1513            79            67            79            67
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_stddev                                 +3.1372         +3.7836             0             1             0             1
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:8/process_time/real_time_pvalue                       0.0009          0.0008      U Test, Repetitions: 27 vs 27
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:8/process_time/real_time_mean                        -0.0028         -0.0028           375           374           375           374
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:8/process_time/real_time_median                      -0.0028         -0.0028           376           375           376           375
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:8/process_time/real_time_stddev                      +0.0145         +0.0120             2             2             2             2
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:8/process_time/real_time_pvalue                       0.0000          0.0000      U Test, Repetitions: 27 vs 27
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:8/process_time/real_time_mean                        -0.0011         -0.0011           430           429           430           429
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:8/process_time/real_time_median                      -0.0010         -0.0011           430           429           430           429
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:8/process_time/real_time_stddev                      +0.0006         -0.0246             0             0             0             0
Nikon/D7200/DSC_0977.NEF/threads:8/process_time/real_time_pvalue                                    0.0000          0.0000      U Test, Repetitions: 27 vs 27
Nikon/D7200/DSC_0977.NEF/threads:8/process_time/real_time_mean                                     -0.0034         -0.0034           255           254           255           254
Nikon/D7200/DSC_0977.NEF/threads:8/process_time/real_time_median                                   -0.0033         -0.0032           255           254           255           254
Nikon/D7200/DSC_0977.NEF/threads:8/process_time/real_time_stddev                                   +0.1188         +0.0771             0             0             0             0
Nikon/D7200/DSC_0978.NEF/threads:8/process_time/real_time_pvalue                                    0.3157          0.2758      U Test, Repetitions: 27 vs 27
Nikon/D7200/DSC_0978.NEF/threads:8/process_time/real_time_mean                                     +0.0005         +0.0005           248           248           248           248
Nikon/D7200/DSC_0978.NEF/threads:8/process_time/real_time_median                                   +0.0007         +0.0008           248           248           248           248
Nikon/D7200/DSC_0978.NEF/threads:8/process_time/real_time_stddev                                   -0.3793         -0.3866             1             1             1             1
Nikon/D7200/DSC_0979.NEF/threads:8/process_time/real_time_pvalue                                    0.0000          0.0000      U Test, Repetitions: 27 vs 27
Nikon/D7200/DSC_0979.NEF/threads:8/process_time/real_time_mean                                     -0.0024         -0.0023           234           233           234           233
Nikon/D7200/DSC_0979.NEF/threads:8/process_time/real_time_median                                   -0.0022         -0.0023           234           233           234           233
Nikon/D7200/DSC_0979.NEF/threads:8/process_time/real_time_stddev                                   +0.1087         +0.1338             0             0             0             0
Nikon/D7200/DSC_0982.NEF/threads:8/process_time/real_time_pvalue                                    0.0430          0.0593      U Test, Repetitions: 27 vs 27
Nikon/D7200/DSC_0982.NEF/threads:8/process_time/real_time_mean                                     -0.0008         -0.0007           235           235           235           235
Nikon/D7200/DSC_0982.NEF/threads:8/process_time/real_time_median                                   -0.0013         -0.0013           235           235           235           235
Nikon/D7200/DSC_0982.NEF/threads:8/process_time/real_time_stddev                                   +0.2314         +0.2618             0             0             0             0
Olympus/XZ-1/p1319978.orf/threads:8/process_time/real_time_pvalue                                   0.0348          0.0806      U Test, Repetitions: 27 vs 27
Olympus/XZ-1/p1319978.orf/threads:8/process_time/real_time_mean                                    -0.0111         -0.0110           222           220           222           220
Olympus/XZ-1/p1319978.orf/threads:8/process_time/real_time_median                                  -0.0005         -0.0004           219           219           219           219
Olympus/XZ-1/p1319978.orf/threads:8/process_time/real_time_stddev                                  -0.8534         -0.8533             5             1             5             1
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_pvalue                                0.0000          0.0000      U Test, Repetitions: 27 vs 27
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_mean                                 +0.3464         +0.2324            14            18           109           134
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_median                               +0.4574         +0.3069            14            20           109           143
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_stddev                              +15.8419        +11.2460             0             3             2            18
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_pvalue                               0.0000          0.0000      U Test, Repetitions: 27 vs 27
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_mean                                +0.7905         +0.5916            27            48           215           343
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_median                              +0.7839         +0.5855            27            48           215           342
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_stddev                              +7.1066         +7.9693             0             2             2            16
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_pvalue                              0.0000          0.0000      U Test, Repetitions: 27 vs 27
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_mean                               +0.7369         +0.5353             5             9            42            65
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_median                             +0.7289         +0.5333             5             9            42            65
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_stddev                             +7.7760         +8.6463             0             0             0             2
Pentax/K10D/_IGP7284.PEF/threads:8/process_time/real_time_pvalue                                    0.0090          0.0214      U Test, Repetitions: 27 vs 27
Pentax/K10D/_IGP7284.PEF/threads:8/process_time/real_time_mean                                     -0.0239         -0.0347            74            72            77            75
Pentax/K10D/_IGP7284.PEF/threads:8/process_time/real_time_median                                   -0.0351         -0.0353            75            72            75            72
Pentax/K10D/_IGP7284.PEF/threads:8/process_time/real_time_stddev                                   -0.8478         -0.2710             1             0            19            14
Phase One/P65/CF027310.IIQ/threads:8/process_time/real_time_pvalue                                  0.3870          0.1414      U Test, Repetitions: 27 vs 27
Phase One/P65/CF027310.IIQ/threads:8/process_time/real_time_mean                                   +0.0082         +0.0037            80            80           634           637
Phase One/P65/CF027310.IIQ/threads:8/process_time/real_time_median                                 -0.0050         -0.0059            80            79           635           631
Phase One/P65/CF027310.IIQ/threads:8/process_time/real_time_stddev                                 +2.5751         +2.2037             1             3             5            17
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/process_time/real_time_pvalue                  0.0000          0.0000      U Test, Repetitions: 27 vs 27
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/process_time/real_time_mean                   +0.0378         +0.0378           329           342           329           342
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/process_time/real_time_median                 +0.0376         +0.0377           329           342           329           342
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/process_time/real_time_stddev                 +0.7421         +0.7638             0             0             0             0
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/process_time/real_time_pvalue                 0.0000          0.0000      U Test, Repetitions: 27 vs 27
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/process_time/real_time_mean                  -0.0097         -0.0096           132           131           132           131
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/process_time/real_time_median                -0.0089         -0.0089           132           131           132           131
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/process_time/real_time_stddev                -0.7709         -0.7493             0             0             0             0
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 27 vs 27
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_mean                                  -0.3052         -0.3052           225           156           225           156
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_median                                -0.3065         -0.3066           225           156           225           156
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_stddev                                -0.7143         -0.7198             1             0             1             0
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 27 vs 27
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_mean                                  -0.1331         -0.1331           170           147           170           147
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_median                                -0.1329         -0.1327           170           147           170           147
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_stddev                                +1.4339         +1.9116             0             0             0             0
Sony/ILCE-7S/DSC04126.ARW/threads:8/process_time/real_time_pvalue                                   0.4999          0.4781      U Test, Repetitions: 27 vs 27
Sony/ILCE-7S/DSC04126.ARW/threads:8/process_time/real_time_mean                                    +0.0020         +0.0019             8             8            65            65
Sony/ILCE-7S/DSC04126.ARW/threads:8/process_time/real_time_median                                  +0.0028         +0.0027             8             8            65            65
Sony/ILCE-7S/DSC04126.ARW/threads:8/process_time/real_time_stddev                                  -0.0771         -0.0497             0             0             1             1

Overview: there are three major regressions:

Benchmark                                                                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_pvalue                               0.0000          0.0000      U Test, Repetitions: 27 vs 27
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_mean                                +0.7905         +0.5916            27            48           215           343
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_median                              +0.7839         +0.5855            27            48           215           342
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_stddev                              +7.1066         +7.9693             0             2             2            16
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_pvalue                              0.0000          0.0000      U Test, Repetitions: 27 vs 27
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_mean                               +0.7369         +0.5353             5             9            42            65
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_median                             +0.7289         +0.5333             5             9            42            65
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_stddev                             +7.7760         +8.6463             0             0             0             2
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_pvalue                                0.0000          0.0000      U Test, Repetitions: 27 vs 27
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_mean                                 +0.3464         +0.2324            14            18           109           134
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_median                               +0.4574         +0.3069            14            20           109           143
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_stddev                              +15.8419        +11.2460             0             3             2            18

I have manually re-run them, and that result is noise, they don't actually regress.

There are 4 major improvements:

Benchmark                                                                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 27 vs 27
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_mean                                  -0.3052         -0.3052           225           156           225           156
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_median                                -0.3065         -0.3066           225           156           225           156
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_stddev                                -0.7143         -0.7198             1             0             1             0
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 27 vs 27
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_mean                                   -0.1468         -0.1466            79            67            79            67
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_median                                 -0.1513         -0.1513            79            67            79            67
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_stddev                                 +3.1372         +3.7836             0             1             0             1
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 27 vs 27
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_mean                                  -0.1331         -0.1331           170           147           170           147
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_median                                -0.1329         -0.1327           170           147           170           147
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_stddev                                +1.4339         +1.9116             0             0             0             0
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 27 vs 27
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_mean                                   -0.0532         -0.0532           279           265           279           264
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_median                                 -0.0528         -0.0529           279           265           279           265
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_stddev                                 -0.2031         -0.2007             0             0             0             0

I have manually re-run then, and they are all real.

So ignoring those false regressions, this appears to be performance-positive overall.
In other words, yes, i believe this change makes sense in itself, not just as a chosen implementation path in the patchset.

Perhaps @dmgreen and other interested parties could also run benchmark on this change itself, may be valuable.
Though the previous time -phi-node-folding-threshold was bumped (from 1 to 2, rL229099), i don't observe any perf considerations in that commit.

Hello

Most of the results I ran came out roughly equal. AArch64 codesize was a little better, Thumb2 a little worse, but mostly flat. Performance was all flat-ish.

All except Thumb1 codegen on cmsisdsp, which really didn't like this. Using this code shows the problem:

static int __SSAT(int val, unsigned sat)
{
  if ((sat >= 1U) && (sat <= 32U))
  {
    const int max = (int)((1U << (sat - 1U)) - 1U);
    const int min = -1 - max ;
    if (val > max)
    {
      return max;
    }
    else if (val < min)
    {
      return min;
    }
  }
  return val;
}

void arm_add_q15(const short * pSrcA, const short * pSrcB, short * pDst, unsigned  blockSize)
{
  unsigned  blkCnt = blockSize;
  while (blkCnt > 0U)
  {
    *pDst++ = (short) __SSAT(((int) *pSrcA++ + *pSrcB++), 16);
    blkCnt--;
  }
}

Usually, on a cpu that supports it __SSAT would use an intrinsic, which expands to just a single instruction. It's not available on a Thumb1 core though, and the new change of using selects over blocks really bloats the code out. From a really quick look it seemed that the way its folded before inlining is causing a bit of an odd ordering to the selects. It ends up selecting on the xor of two conditions, as opposed to the simpler pair of higher/lower selects.

I was compiling with something like "bin/clang -target arm-none-eabi -mcpu=cortex-m0plus -O3 arm_add_q15.c", but it seems to show the same issues on other architectures with this code. If that problem was cleared up, I think the rest of the results I ran would either be flat or an improvement.

For the mul overflow case, I wonder if we should be treating extractvalue as free? It won’t generate any code.

@dmgreen could you please specify, did you test this point of this patch queue, or this patch itself?
I'm interested in latter, only in perf impact of changing -phi-node-folding-threshold=2 to -phi-node-folding-threshold=3
Otherwise you'd need to test the entire patch queue.

For the mul overflow case, I wonder if we should be treating extractvalue as free? It won’t generate any code.

extractvalue from @llvm.with.overflow specifically?
Only for X86, or in general?
That could work.

@jmolloy (via mail)

extractvalue from @llvm.with.overflow specifically?

Is there a situation in which *any* extractvalue generates code? structs have to be SROA'd into registers anyway, right?

I don't know an answer to that question, thus i'm asking in the first place.

...

Hmm, i never got this mail, sorry for not noticing this reply. Thank you for looking into it!
I believe this shows the change: https://godbolt.org/z/Mb6T3R
That indeed looks not great. But i think i'm missing something.
That replacement does not look correct to me: https://rise4fun.com/Alive/CNns
Did you run the tests, too, or only benchmarks?

...

Hmm, i never got this mail, sorry for not noticing this reply. Thank you for looking into it!
I believe this shows the change: https://godbolt.org/z/Mb6T3R
That indeed looks not great. But i think i'm missing something.
That replacement does not look correct to me: https://rise4fun.com/Alive/CNns
Did you run the tests, too, or only benchmarks?

Ok so roughly, the generalization seems to be: https://rise4fun.com/Alive/6Ey

I *think* that will handle the case you hit - https://godbolt.org/z/n35hPH
The weird constant is because it get's shrunk since those bits are not being used.
I think that happens late[r] in pipeline..

We indeed currently don't catch it: https://godbolt.org/z/14si0q
We could: https://rise4fun.com/Alive/ZeC

lebedev.ri edited the summary of this revision. (Show Details)Jul 30 2019, 1:31 PM

...

Hmm, i never got this mail, sorry for not noticing this reply. Thank you for looking into it!
I believe this shows the change: https://godbolt.org/z/Mb6T3R
That indeed looks not great. But i think i'm missing something.
That replacement does not look correct to me: https://rise4fun.com/Alive/CNns
Did you run the tests, too, or only benchmarks?

Ok so roughly, the generalization seems to be: https://rise4fun.com/Alive/6Ey

I *think* that will handle the case you hit - https://godbolt.org/z/n35hPH
The weird constant is because it get's shrunk since those bits are not being used.
I think that happens late[r] in pipeline..

We indeed currently don't catch it: https://godbolt.org/z/14si0q
We could: https://rise4fun.com/Alive/ZeC

And some procastination later, D65765, PTAL.

Did anything happen with the extractvalue cost suggestion?

Did anything happen with the extractvalue cost suggestion?

I was planning to look into that, and i just did. Some observations:

  1. I'm not sure we can literally treat any extractvalue as free, it clearly isn't: https://godbolt.org/z/6wIxAa
  2. Even if we do treat it as free in TargetTransformInfo::getInstructionThroughput(), it is not sufficient yet to get the fold
  3. We also need to do the same in TargetTransformInfoImplCRTPBase::getUserCost() because again, PHINodeFoldingThreshold is at 2.
  4. While that addresses the unsigned-multiply-overflow-check.ll, it still leaves signbit-like-value-extension.ll on the table.

Did anything happen with the extractvalue cost suggestion?

I was planning to look into that, and i just did. Some observations:

  1. I'm not sure we can literally treat any extractvalue as free, it clearly isn't: https://godbolt.org/z/6wIxAa

For the short cases, the mov belongs to the return not the extractvalue. For the large cases all of that code is the result of passing an array by value.

Did anything happen with the extractvalue cost suggestion?

I was planning to look into that, and i just did. Some observations:

  1. I'm not sure we can literally treat any extractvalue as free, it clearly isn't: https://godbolt.org/z/6wIxAa

For the short cases, the mov belongs to the return not the extractvalue. For the large cases all of that code is the result of passing an array by value.

Okay, D66098.
If accepted, i will rebase this patch ontop of that committed patch,
and split this particular patch out of patchset - i'm still rather interested
in this change, and i'm not seeing any alternative solutions for it.

@dmgreen hi

Hello
...
If that problem was cleared up, I think the rest of the results I ran would either be flat or an improvement.

This issue pattern has now been resolved as of rL368685 + rL368687.
Were there any other issues you observed? If not, care to stamp, please? :)

Does anyone else have any negative reviews as of this point?

For that test case above, it was to producing:

%conv2 = sext i16 %1 to i32
%add = add nsw i32 %conv2, %conv
%2 = icmp sgt i32 %add, -32768
%spec.select.i = select i1 %2, i32 %add, i32 -32768
%3 = icmp slt i32 %spec.select.i, 32767
%call7 = select i1 %3, i32 %spec.select.i, i32 32767
%conv3 = trunc i32 %call7 to i16

And Now:

%conv2 = sext i16 %1 to i32
%add = add nsw i32 %conv2, %conv
%2 = icmp slt i32 %add, 32768
%3 = icmp sgt i32 %add, -32768
%4 = select i1 %3, i32 %add, i32 -32768
%spec.select.i = select i1 %2, i32 %4, i32 32767
%conv3 = trunc i32 %spec.select.i to i16

Any idea why the icmp slt i32 %spec.select.i, 32767 is now icmp slt i32 %add, 32768? Is that OK?

It looks a lot better than before, but means we need an extra "add 1" to materialise the constants.

Thank you for taking a look!

For that test case above, it was to producing:

%conv2 = sext i16 %1 to i32
%add = add nsw i32 %conv2, %conv
%2 = icmp sgt i32 %add, -32768
%spec.select.i = select i1 %2, i32 %add, i32 -32768
%3 = icmp slt i32 %spec.select.i, 32767
%call7 = select i1 %3, i32 %spec.select.i, i32 32767
%conv3 = trunc i32 %call7 to i16

And Now:

%conv2 = sext i16 %1 to i32
%add = add nsw i32 %conv2, %conv
%2 = icmp slt i32 %add, 32768
%3 = icmp sgt i32 %add, -32768
%4 = select i1 %3, i32 %add, i32 -32768
%spec.select.i = select i1 %2, i32 %4, i32 32767
%conv3 = trunc i32 %spec.select.i to i16

Any idea why the icmp slt i32 %spec.select.i, 32767 is now icmp slt i32 %add, 32768? Is that OK?

That is a valid transformation: https://rise4fun.com/Alive/bxq

It looks a lot better than before, but means we need an extra "add 1" to materialise the constants.

Yeah, i spoke a *bit* too soon, realized that almost immediately after posting (https://reviews.llvm.org/D65765#1627080) :)

I need to follow-up with one more patch:

%2 = icmp slt i32 %add, 32768
%3 = icmp sgt i32 %add, -32768
%4 = select i1 %3, i32 %add, i32 -32768
%spec.select.i = select i1 %2, i32 %4, i32 32767

should be

%n3 = icmp sgt i32 %add, -32768
%n4 = select i1 %n3, i32 %add, i32 -32768
%n5 = icmp slt i32 %n4, 32768
%spec.select.i = select i1 %n5, i32 %n4, i32 32767

https://rise4fun.com/Alive/zRx0
I.e. in the last select, the icmp should be comparing the result of the first select, not the original input.
That will magically fix remaining issues (will allow the constant to get fixed to avoid that add 1)

So i suppose my question here is - other than this, did you observe any other regressions?
If not, care to tentatively stamp? I will *not* land this patch until after that last needed patch.

That is a valid transformation: https://rise4fun.com/Alive/bxq

I see, because of the truncs back into an i16. That makes sense.

So i suppose my question here is - other than this, did you observe any other regressions?
If not, care to tentatively stamp? I will *not* land this patch until after that last needed patch.

It's hard to tell for sure if they are all the same thing. Here's one example of some code if its useful. The code is now smaller, but I think it doesn't like being selects over branches as the chance of executing the branched code is so low:

void arm_abs_q31(
    const int * pSrc,
    int * pDst,
    unsigned blockSize)
{
  unsigned blkCnt = blockSize >> 2U;
  while (blkCnt > 0U)
  {
    int in = *pSrc++;
    *pDst++ = (in > 0) ? in : ((in == (~0x7fffffff)) ? 2147483647 : -in);
    in = *pSrc++;
    *pDst++ = (in > 0) ? in : ((in == (~0x7fffffff)) ? 2147483647 : -in);
    in = *pSrc++;
    *pDst++ = (in > 0) ? in : ((in == (~0x7fffffff)) ? 2147483647 : -in);
    in = *pSrc++;
    *pDst++ = (in > 0) ? in : ((in == (~0x7fffffff)) ? 2147483647 : -in);
    blkCnt--;
  }
  blkCnt = blockSize % 0x4U;
  while (blkCnt > 0U)
  {
    int in = *pSrc++;
    *pDst++ = (in > 0) ? in : ((in == (~0x7fffffff)) ? 2147483647 : -in);
    blkCnt--;
  }
}

That's from the same CMSIS DSP suite again, and only when compiling for v6m. Which might not be the most interesting of suites. I've not seen any significant changes anywhere else.

That is a valid transformation: https://rise4fun.com/Alive/bxq

I see, because of the truncs back into an i16. That makes sense.

So i suppose my question here is - other than this, did you observe any other regressions?
If not, care to tentatively stamp? I will *not* land this patch until after that last needed patch.

It's hard to tell for sure if they are all the same thing. Here's one example of some code if its useful. The code is now smaller, but I think it doesn't like being selects over branches as the chance of executing the branched code is so low:

void arm_abs_q31(
    const int * pSrc,
    int * pDst,
    unsigned blockSize)
{
  unsigned blkCnt = blockSize >> 2U;
  while (blkCnt > 0U)
  {
    int in = *pSrc++;
    *pDst++ = (in > 0) ? in : ((in == (~0x7fffffff)) ? 2147483647 : -in);
    in = *pSrc++;
    *pDst++ = (in > 0) ? in : ((in == (~0x7fffffff)) ? 2147483647 : -in);
    in = *pSrc++;
    *pDst++ = (in > 0) ? in : ((in == (~0x7fffffff)) ? 2147483647 : -in);
    in = *pSrc++;
    *pDst++ = (in > 0) ? in : ((in == (~0x7fffffff)) ? 2147483647 : -in);
    blkCnt--;
  }
  blkCnt = blockSize % 0x4U;
  while (blkCnt > 0U)
  {
    int in = *pSrc++;
    *pDst++ = (in > 0) ? in : ((in == (~0x7fffffff)) ? 2147483647 : -in);
    blkCnt--;
  }
}

That's from the same CMSIS DSP suite again, and only when compiling for v6m. Which might not be the most interesting of suites. I've not seen any significant changes anywhere else.

Oh, nice one. Thank you for taking a look!

This one is different; as per me, it's a true-positive improvement (will add a test).
If this is one-time scalar abs, then i suspect either variant is ok, but if it't done in a loop,
i suspect branch-less abs will be better overall, unless of course the the negativity check was predictable..

But then if that knowledge was present (PGO, __builtin_expect()),
then we will get the branch back in the end: https://godbolt.org/z/3Evpf6
Although that does not happen for ARM, so "you've got a bug": https://godbolt.org/z/91UQRK

TLDR: arm_abs_q31() change is not an issue i'm going to look into.

That is a valid transformation: https://rise4fun.com/Alive/bxq

I see, because of the truncs back into an i16. That makes sense.

So i suppose my question here is - other than this, did you observe any other regressions?
If not, care to tentatively stamp? I will *not* land this patch until after that last needed patch.

It's hard to tell for sure if they are all the same thing.

Posted D66232, it finishes cleaning CLAMP pattern.
As far i'm currently aware that is the last missing bit from my side..

Fair enough.

One last issue, which might come up from this change. This time with a multiply, that was previously able to simplify one of the compares to true, I think in CVP. Same __SSAT as before.

void arm_mult_q7(const signed char * pSrcA, const signed char * pSrcB, signed char * pDst, unsigned  blockSize)
{
  unsigned  blkCnt = blockSize;
  while (blkCnt > 0U)
  {
    *pDst++ = (signed char) __SSAT((((short) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
    blkCnt--;
  }
}

i.e this: https://rise4fun.com/Alive/fplbG

Thank you for taking a look!

Fair enough.

One last issue, which might come up from this change. This time with a multiply, that was previously able to simplify one of the compares to true, I think in CVP. Same __SSAT as before.

void arm_mult_q7(const signed char * pSrcA, const signed char * pSrcB, signed char * pDst, unsigned  blockSize)
{
  unsigned  blkCnt = blockSize;
  while (blkCnt > 0U)
  {
    *pDst++ = (signed char) __SSAT((((short) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
    blkCnt--;
  }
}

Interesting, so https://godbolt.org/z/No06Qq
I'm not sure what fold exactly is missing there yet, i think it could be constant range?
@nikic, is that something you might be interested looking into? :)

i.e this: https://rise4fun.com/Alive/fplbG

@lebedev.ri CVP can already determine that the condition is always true:

define i1 @test(i8 %a, i8 %b) {
  %conv1 = sext i8 %a to i32
  %conv3 = sext i8 %b to i32
  %mul = mul nsw i32 %conv3, %conv1
  %shr = ashr i32 %mul, 7
  br label %split

split:
  %icmp = icmp sgt i32 %shr, -128
  ret i1 %icmp
}

This becomes ret i1 true under -correlated-propagation. Unfortunately CVP has some limitations on icmp simplification that makes it only work in cross-BB scenarios. Relaxing those is not entirely straightforward (iirc it had some negative effects due to LVI query order changes).

Thank you for taking a look!

@lebedev.ri CVP can already determine that the condition is always true:

define i1 @test(i8 %a, i8 %b) {
  %conv1 = sext i8 %a to i32
  %conv3 = sext i8 %b to i32
  %mul = mul nsw i32 %conv3, %conv1
  %shr = ashr i32 %mul, 7
  br label %split

split:
  %icmp = icmp sgt i32 %shr, -128
  ret i1 %icmp
}

This becomes ret i1 true under -correlated-propagation.

Hmm, and unlike computeKnownBits(), the computeConstantRange() is not recursive, i thought it was.
So using computeConstantRangeIncludingKnownBits() instead of computeConstantRange() in simplifyICmpWithConstant() does not help, we only get:

LHS_CR
[-16777216,16777216)

I suppose making it recursive (with depth limit of course) would solve it, but i i'm not sure if it would be too costly?
This *sounds* like a more general fix rather than changing CVP.
Is there third alternative i'm not seeing?

Unfortunately CVP has some limitations on icmp simplification that makes it only work in cross-BB scenarios.
Relaxing those is not entirely straightforward (iirc it had some negative effects due to LVI query order changes).

is there a bug# ?

lebedev.ri added a comment.EditedAug 24 2019, 12:22 AM

That is a valid transformation: https://rise4fun.com/Alive/bxq

I see, because of the truncs back into an i16. That makes sense.

Ok, and that one is done too:

$ ./bin/clang -target arm-none-eabi -mcpu=cortex-m0plus -O3 -mllvm -phi-node-folding-threshold=2 -S -o old.s /tmp/test.c 
$ ./bin/clang -target arm-none-eabi -mcpu=cortex-m0plus -O3 -mllvm -phi-node-folding-threshold=3 -S -o new.s /tmp/test.c 
$ diff old.s new.s 
$ <no diff>
$ sha512sum *.s
9332ab2169151fa21ab03e1fb218178aa16bed2c0dfdd0a40b178a75dbd4a865ba7a7b1b4b37f252928ea794e38776260f84f6c1ae3684d5a445118f518351e2  new.s
9332ab2169151fa21ab03e1fb218178aa16bed2c0dfdd0a40b178a75dbd4a865ba7a7b1b4b37f252928ea794e38776260f84f6c1ae3684d5a445118f518351e2  old.s

Thank you for taking a look!

Fair enough.

One last issue, which might come up from this change. This time with a multiply, that was previously able to simplify one of the compares to true, I think in CVP. Same __SSAT as before.

void arm_mult_q7(const signed char * pSrcA, const signed char * pSrcB, signed char * pDst, unsigned  blockSize)
{
  unsigned  blkCnt = blockSize;
  while (blkCnt > 0U)
  {
    *pDst++ = (signed char) __SSAT((((short) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
    blkCnt--;
  }
}

@dmgreen & others: So now the question is, do we believe this general missing fold is a blocker here?

@lebedev.ri CVP can already determine that the condition is always true:
...
This becomes ret i1 true under -correlated-propagation.
...
Unfortunately CVP has some limitations on icmp simplification that makes it only work in cross-BB scenarios.
Relaxing those is not entirely straightforward (iirc it had some negative effects due to LVI query order changes).

is there a bug# ?

@nikic could you please either CC me to existing bugreport, or file one?

this general missing fold is a blocker here?

I dont think so.

bump

That is a valid transformation: https://rise4fun.com/Alive/bxq

I see, because of the truncs back into an i16. That makes sense.

Ok, and that one is done too:

$ ./bin/clang -target arm-none-eabi -mcpu=cortex-m0plus -O3 -mllvm -phi-node-folding-threshold=2 -S -o old.s /tmp/test.c 
$ ./bin/clang -target arm-none-eabi -mcpu=cortex-m0plus -O3 -mllvm -phi-node-folding-threshold=3 -S -o new.s /tmp/test.c 
$ diff old.s new.s 
$ <no diff>
$ sha512sum *.s
9332ab2169151fa21ab03e1fb218178aa16bed2c0dfdd0a40b178a75dbd4a865ba7a7b1b4b37f252928ea794e38776260f84f6c1ae3684d5a445118f518351e2  new.s
9332ab2169151fa21ab03e1fb218178aa16bed2c0dfdd0a40b178a75dbd4a865ba7a7b1b4b37f252928ea794e38776260f84f6c1ae3684d5a445118f518351e2  old.s

Thank you for taking a look!

Fair enough.

One last issue, which might come up from this change. This time with a multiply, that was previously able to simplify one of the compares to true, I think in CVP. Same __SSAT as before.

void arm_mult_q7(const signed char * pSrcA, const signed char * pSrcB, signed char * pDst, unsigned  blockSize)
{
  unsigned  blkCnt = blockSize;
  while (blkCnt > 0U)
  {
    *pDst++ = (signed char) __SSAT((((short) (*pSrcA++) * (*pSrcB++)) >> 7), 8);
    blkCnt--;
  }
}

@dmgreen & others: So now the question is, do we believe this general missing fold is a blocker here?

@lebedev.ri CVP can already determine that the condition is always true:
...
This becomes ret i1 true under -correlated-propagation.
...
Unfortunately CVP has some limitations on icmp simplification that makes it only work in cross-BB scenarios.
Relaxing those is not entirely straightforward (iirc it had some negative effects due to LVI query order changes).

is there a bug# ?

@nikic could you please either CC me to existing bugreport, or file one?

Since D66098 got reviewed, i'll split this patch out of this patch series and land the series.
Nothing has changed here, i'm still interested in this change.
The remaining question here is https://reviews.llvm.org/D65148#1648801

lebedev.ri edited the summary of this revision. (Show Details)

Rebased.

And found one more pattern that'd eventually benefit from this - safe naive implementation of X86 BZHI pattern, like

// a more likely use-case to avoid shift C UB for c from 1 to 32
unsigned bzhi(unsigned x, unsigned c) {
    if (c < 32) {
        x &= ((1U << c) - 1);
    }
    return x;
}

https://gcc.godbolt.org/z/0UzVJN

^ everything except bzhil itself there isn't needed.

Hello. Sorry for the delay I missed your earlier email and wanted to grab some more results. This is what it looks like on an Cortex-M0+:

cmsisdsp/BasicMath/arm_abs_q15                         61280.3209   51891.8918   -15.32
cmsisdsp/BasicMath/arm_abs_q31                         96690.6474   61414.7322   -36.48
cmsisdsp/BasicMath/arm_abs_q7                          65869.4373   56112.2244   -14.81
cmsisdsp/BasicMath/arm_mult_q15                        49288.5433   42873.5485   -13.01
cmsisdsp/BasicMath/arm_mult_q7                         51843.8512   41171.4250   -20.58
cmsisdsp/ComplexMath/arm_cmplx_mult_real_q15           33015.6234   27529.7009   -16.61
cmsisdsp/Filtering/arm_fir_lattice_q15                 9564.20255   9130.68289   -4.53
cmsisdsp/Transform/arm_cfft_q15                        3854.80244   3656.58566   -5.14
cmsisdsp/Transform/arm_rfft_q15                        5410.24904   5219.46206   -3.52

cmsisdsp/BasicMath                                     40822.2401   39464.0989   -3.32
cmsisdsp/ComplexMath                                   12243.6404   12120.6590   -1.00
cmsisdsp/Filtering                                     1094.25278   1092.98454   -0.11
cmsisdsp/Transform                                     1805.07193   1778.58598   -1.46
total                                                  2450.53110   2433.22327   -0.70

The first two columns mean very little, but are some measure of items/sec/Mhz (higher is better). The third column is percentage change. The bottom results are geomeans. Any results that didn't change are omitted.

Those results are a load better than they were originally. They are still not great, but they are on a CPU/test suite combo that isn't the most interesting combo. This doesn't come up on other, more dsp like cpus, and there are some AArch64 codesize improvements for this patch too.

But equally I don't think that a lot of these regressions were necessarily about the cpu/architecture used, but about the code they are running. You may find that similar problems come up elsewhere from other peoples code, but we have no problem with this going ahead.

Hello. Sorry for the delay I missed your earlier email and wanted to grab some more results.

No problem and thank you for looking into this!

This is what it looks like on an Cortex-M0+:

cmsisdsp/BasicMath/arm_abs_q15                         61280.3209   51891.8918   -15.32
cmsisdsp/BasicMath/arm_abs_q31                         96690.6474   61414.7322   -36.48
cmsisdsp/BasicMath/arm_abs_q7                          65869.4373   56112.2244   -14.81
cmsisdsp/BasicMath/arm_mult_q15                        49288.5433   42873.5485   -13.01
cmsisdsp/BasicMath/arm_mult_q7                         51843.8512   41171.4250   -20.58
cmsisdsp/ComplexMath/arm_cmplx_mult_real_q15           33015.6234   27529.7009   -16.61
cmsisdsp/Filtering/arm_fir_lattice_q15                 9564.20255   9130.68289   -4.53
cmsisdsp/Transform/arm_cfft_q15                        3854.80244   3656.58566   -5.14
cmsisdsp/Transform/arm_rfft_q15                        5410.24904   5219.46206   -3.52

cmsisdsp/BasicMath                                     40822.2401   39464.0989   -3.32
cmsisdsp/ComplexMath                                   12243.6404   12120.6590   -1.00
cmsisdsp/Filtering                                     1094.25278   1092.98454   -0.11
cmsisdsp/Transform                                     1805.07193   1778.58598   -1.46
total                                                  2450.53110   2433.22327   -0.70

The first two columns mean very little, but are some measure of items/sec/Mhz (higher is better). The third column is percentage change. The bottom results are geomeans. Any results that didn't change are omitted.

So correct me please if i'm misunderstanding this - on your benchmarks this does not appear to have any measurable regressions?

Those results are a load better than they were originally. They are still not great, but they are on a CPU/test suite combo that isn't the most interesting combo. This doesn't come up on other, more dsp like cpus, and there are some AArch64 codesize improvements for this patch too.

But equally I don't think that a lot of these regressions were necessarily about the cpu/architecture used, but about the code they are running.

You may find that similar problems come up elsewhere from other peoples code,

Yeah, i suspect this will shake loose (temporarily regress) a few more patterns elsewhere.

but we have no problem with this going ahead.

Great!

By this point, does anyone else have any concerns here?
Anyone feel like reviewing/stamping, or indicating some further steps that must be taken first?

Also, bump @jmolloy who last bumped this threshold in rL229099.

There were/are(?) regressions even with 2...

https://bugs.llvm.org/show_bug.cgi?id=22616

@jmolloy / @RKSimon / @efriedma - thoughts?

There were/are(?) regressions even with 2...

https://bugs.llvm.org/show_bug.cgi?id=22616

Sure it did. Thanks for digging that up.
Honestly all graphics computations are always horrible to do on scalars,
those should (and almost always can be) vectorized, minimizing branching
as much as possible, replacing it with blending, which essentially is
what that patch did, except that the code remained scalar...
So the patch was doing the right thing there, "in general".

Aggressively flattening the CFG has tradeoffs. If the branch is very unpredictable, or it unblocks some important optimization, it can have a huge benefit. If you don't fall into one of those cases, you're mildly degrading the performance of a bunch of code, by forcing the execution of instructions where the result isn't used.

I'd like some idea of how often we actually end up with a "select" in the generated code, vs. getting transformed back to a branch by some later pass. We do some select->branch conversion before isel, and on x86, we also do cmov->branch conversion after isel.

Aggressively flattening the CFG has tradeoffs. If the branch is very unpredictable,
or it unblocks some important optimization, it can have a huge benefit.
If you don't fall into one of those cases, you're mildly degrading the performance
of a bunch of code, by forcing the execution of instructions where the result isn't used.

Yep, and that matches my benchmarks in D59035.

I have added tests for several cases that will be affected by this, and stated
some perf numbers i'm seeing on x86 on some my code (results are mildly great);
separately, @dmgreen has performed some ARM benchmarks
(https://reviews.llvm.org/D65148#1650984, also seemingly-ok)

I'd like some idea of how often we actually end up with a "select" in the generated code,
vs. getting transformed back to a branch by some later pass.
We do some select->branch conversion before isel, and on x86,
we also do cmov->branch conversion after isel.

Also note that if it was specified that the branch is predictable (via PGO/__builtin_assume()),
the select will be converted back to branch.

So, what metric specifically do you want to see, a count of CMOV instructions at the end of codegen, how it is changed by this patch?

So, what metric specifically do you want to see, a count of CMOV instructions at the end of codegen, how it is changed by this patch?

I guess more the number of branches at the end of codegen... because there are really three possibilities here: the select is lowered to cmov, the select is lowered to branch, or the select gets optimized to some non-select computation. But yes, something like that. I would guess most of the selects generated this way can't be significantly optimized, so if everything is working correctly, most of them should be getting converted back to branches.

So, what metric specifically do you want to see, a count of CMOV instructions at the end of codegen, how it is changed by this patch?

I guess more the number of branches at the end of codegen... because there are really three possibilities here: the select is lowered to cmov, the select is lowered to branch, or the select gets optimized to some non-select computation. But yes, something like that. I would guess most of the selects generated this way can't be significantly optimized, so if everything is working correctly, most of them should be getting converted back to branches.

Okay, wrote a pass to count the interesting metrics at the final pass in backend (D67240).
Numbers for RawSpeed:

metricoldnewcnt change% change
x86-mi-counting.NumMachineFunctions105131051300.00%
x86-mi-counting.NumMachineBasicBlocks200350200163-187-0.09%
x86-mi-counting.NumMachineInstructions33078663305504-2362-0.07%
x86-mi-counting.NumUncondBR3347933465-14-0.04%
x86-mi-counting.NumCondBR9106290908-154-0.17%
x86-mi-counting.NumCMOV41954284892.12%
x86-mi-counting.NumVecBlend171700.00%

As it is evident, while there is an increase in cmov count,
the decrease of branch count is almost twice that,
and there is notable decrease of total instruction count, basic block count.
I believe that supports the benchmark numbers.

Does that line up with what you wanted to see?

Will post test-suite numbers a bit later, not sure how to aggregate them..

Yes, we need test suite numbers.

Also we need to know perf numbers with this patch for other architectures, @dmgreen ?

Maybe @evandro is interested in this patch too and could run it on SPECs?

So, what metric specifically do you want to see, a count of CMOV instructions at the end of codegen, how it is changed by this patch?

I guess more the number of branches at the end of codegen... because there are really three possibilities here: the select is lowered to cmov, the select is lowered to branch, or the select gets optimized to some non-select computation. But yes, something like that. I would guess most of the selects generated this way can't be significantly optimized, so if everything is working correctly, most of them should be getting converted back to branches.

Okay, wrote a pass to count the interesting metrics at the final pass in backend (D67240).
Numbers for RawSpeed:

metricoldnewcnt change% change
x86-mi-counting.NumMachineFunctions105131051300.00%
x86-mi-counting.NumMachineBasicBlocks200350200163-187-0.09%
x86-mi-counting.NumMachineInstructions33078663305504-2362-0.07%
x86-mi-counting.NumUncondBR3347933465-14-0.04%
x86-mi-counting.NumCondBR9106290908-154-0.17%
x86-mi-counting.NumCMOV41954284892.12%
x86-mi-counting.NumVecBlend171700.00%

As it is evident, while there is an increase in cmov count,
the decrease of branch count is almost twice that,
and there is notable decrease of total instruction count, basic block count.
I believe that supports the benchmark numbers.

Does that line up with what you wanted to see?

Will post test-suite numbers a bit later, not sure how to aggregate them..

@efriedma and done (these are vanilla llvm test-suite no externals, no rawspeed).
Did not find any nice way to auto-aggregate, so just applied some bash:

$ touch x86-mi-counting.NumMachineFunctions x86-mi-counting.NumMachineBasicBlocks x86-mi-counting.NumMachineInstructions x86-mi-counting.NumUncondBR x86-mi-counting.NumCondBR x86-mi-counting.NumCMOV x86-mi-counting.NumVecBlend
$ for i in x86-mi-counting.*; do echo -n "$i "; grep $i results-testsuite-old.json | awk '{print $2}' | sed "s/.0,//" | awk '{s+=$1}END{print s}'; done;
metricoldnewcnt change% change
x86-mi-counting.NumMachineFunctions271892718900.00%
x86-mi-counting.NumMachineBasicBlocks573079571509-1570-0.27%
x86-mi-counting.NumMachineInstructions418024441807505060.01%
x86-mi-counting.NumUncondBR102271102115-156-0.15%
x86-mi-counting.NumCondBR332645331669-976-0.29%
x86-mi-counting.NumCMOV20620211345142.49%
x86-mi-counting.NumVecBlend0000

Here the results aren't as glaring.
We again notably decreased BB count, increased CMOV count but decreased branch count by more than twice the CMOV increase,
increased instruction count by a bit (the absolute number appears to match the increase in CMOV's);

That looks mostly fine, then. Maybe still a few more cmovs than I'd like, still... but close enough.

In terms of the actual code, I have a couple questions I'm trying to understand:

  1. It looks like PHINodeFoldingThreshold actually controls multiple SimplifyCFG transforms? Which transforms are actually providing the improvements you see? Should we be tying all of them together?
  2. For if-converting a triangle, like the testcases, it looks like we're using separate thresholds for the number of speculated instructions, and the number of generated select instructions? Should we be using some sort of combined cost, instead?

Thank you for taking a look!

That looks mostly fine, then. Maybe still a few more cmovs than I'd like, still... but close enough.

Aha! :)

In terms of the actual code, I have a couple questions I'm trying to understand:

  1. It looks like PHINodeFoldingThreshold actually controls multiple SimplifyCFG transforms?

Yes, it appears to control 3 separate folds:

  • SpeculativelyExecuteBB()
  • FoldTwoEntryPHINode((
  • mergeConditionalStoreToAddress()

Which transforms are actually providing the improvements you see?

FoldTwoEntryPHINode().

Should we be tying all of them together?

Uh, good question.
I'm not all that familiar with SimplifyCFG, but i believe this
is okay. That being said, i see at least 4 other bugs there.

Both FoldTwoEntryPHINode(( and mergeConditionalStoreToAddress() use
PHINodeFoldingThreshold separately for each BB, both of them use it to
count the accumulative cost of all per-BB instructions, so this is consistent.

(1) minor:
mergeConditionalStoreToAddress() uses PHINodeFoldingThreshold
as instruction count while others actually use it as cost.
It should likely actually check the cost, not instruction count.

(2,3) critical:
Both FoldTwoEntryPHINode(( and mergeConditionalStoreToAddress() use it separately
for each BB, so even though PHINodeFoldingThreshold is at 2,
we actually allow to speculate 4 instructions (at most 2 per BB),
and patch this will bump it to 6 - https://godbolt.org/z/WLfpEP

This not my intention, actually. Moreover, why would we be ok
flattening PHI with 2 instructions in both of BB's,
but not a PHI with 3 in one BB and 0 in another?
That looks just broken to me.

Do you agree that the threshold should specify the *total* cost
of instructions to speculate, not per-BB maximal speculation cost?

After fixing this, PHINodeFoldingThreshold will then need to be adjusted x2 (to 4)
to keep the existing behavior, but then the interesting cases
will be handled so there will be no immediate need to bump it to 3 (i.e. 6).

(4) insignificant:
In SpeculativelyExecuteBB(), we seem to at most speculate a single instruction,
so PHINodeFoldingThreshold is used as a cutoff threshold for that single instruction.
So this could be a separate cl::opt, but i'm not sure it's worth changing this.

  1. For if-converting a triangle, like the testcases, it looks like we're using

separate thresholds for the number of speculated instructions,
and the number of generated select instructions?
Should we be using some sort of combined cost, instead?

I'm not sure what you are talking about here?
We don't count the cost of the select we will need to produce
just like we don't consider the cost of the PHINode itself,
so i suppose we consider that we may for select by replacing PHINode.
Was that the question?

I compared your jsons using

python3 compare.py results-testsuite-old.json results-testsuite-new.json --filter-short
Tests: 1318
Short Running: 736 (filtered out)
Remaining: 582
Metric: exec_time

Program                                        results-testsuite-old results-testsuite-new diff  
 test-suite...C++/Shootout-C++-heapsort.test     3.49                 10.84                210.7%
 test-suite...e/Benchmarks/Misc/flops-8.test     1.25                  3.70                195.3%
 test-suite...out-C++/Shootout-C++-hash.test     0.77                  1.97                155.1%
 test-suite...C/Packing-dbl/Packing-dbl.test     3.56                  8.12                128.1%
 test-suite...est:BM_GEN_LIN_RECUR_RAW/44217   339.78                737.62                117.1%
 test-suite...test:BM_PLANCKIAN_LAMBDA/44217   984.23                2096.79               113.0%
 test-suite...s/Rodinia/hotspot/hotspot.test     0.62                  1.25                102.1%
 test-suite...st:BM_MemCmp<64, EqZero, Last>   10153.35              584.76                -94.2%
 test-suite...est:BM_MemCmp<64, EqZero, Mid>   9845.54               581.01                -94.1%
 test-suite...mCmp<7, GreaterThanZero, None>   2168.74               4178.58               92.7% 
 test-suite...d-warshall/floyd-warshall.test     8.15                 15.46                89.6% 
 test-suite...t:BM_MemCmp<64, EqZero, First>   2859.27               401.51                -86.0%
 test-suite...ests/Vector/Vector-build2.test     9.43                  1.91                -79.7%
 test-suite...algebra/kernels/syrk/syrk.test    24.44                  5.37                -78.0%
 test-suite...mCmp<15, GreaterThanZero, Mid>   1615.43               2841.94               75.9%

Any my mistake? Because these perf results do not look so great.

...

(1) minor:
mergeConditionalStoreToAddress() uses PHINodeFoldingThreshold
as instruction count while others actually use it as cost.
It should likely actually check the cost, not instruction count.

D67315

I compared your jsons using

python3 compare.py results-testsuite-old.json results-testsuite-new.json --filter-short

Don't.
I was only acquiring correctness/compiler stats there,
no benchmarking were performing, so those perf numbers are meaningless garbage.

...

Should we be tying all of them together?

But now i have changed my opinion.

(2) critical:
FoldTwoEntryPHINode((

D67318, if makes sense, it effectively replaces this patch.

(2) critical:
mergeConditionalStoreToAddress((

I'm not fully sure about this one, not going to touch.

lebedev.ri abandoned this revision.Sep 8 2019, 1:45 AM

Let's move this to D67318, which looks like more general fix,
and produces much nicer final assembly metrics while having same perf characteristics on my benchmark.

Initial results from your benchmark does not indicate perf issues, but some time ago, this was posted:
https://lists.llvm.org/pipermail/llvm-dev/2018-August/125313.html

Since with this patch there would be much more selects, we should consider its impact.

@efriedma @john.brawn