This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG] FoldTwoEntryPHINode(): consider *total* speculation cost, not per-BB cost
ClosedPublic

Authored by lebedev.ri on Sep 7 2019, 7:01 AM.

Details

Summary

Previously, if the threshold was 2, we were willing to speculatively
execute 2 cheap instructions in both basic blocks (thus we were willing
to speculatively execute cost = 4), but weren't willing to speculate
when one BB had 3 instructions and other one had no instructions,
even thought that would have total cost of 3.

This looks inconsistent to me.
I don't think cmov-like instructions will start executing
until both of it's inputs are available: https://godbolt.org/z/zgHePf
So i don't see why the existing behavior is the correct one.

Also, let's add it's own cl::opt for this threshold.

This is an alternative solution to D65148:
This fix is mainly motivated by signbit-like-value-extension.ll test.
That pattern comes up in JPEG decoding, see e.g.
Figure F.12 – Extending the sign bit of a decoded value in V
of ITU T.81 (JPEG specification).
That branch is not predictable, and it is within the innermost loop,
so the fact that that pattern ends up being stuck with a branch
instead of select (i.e. CMOV for x86) is unlikely to be beneficial.

This has great results on the final assembly (vanilla test-suite + RawSpeed): (metric pass - D67240)

metricoldnewdelta%
x86-mi-counting.NumMachineFunctions377203772110.00%
x86-mi-counting.NumMachineBasicBlocks773545771181-2364-0.31%
x86-mi-counting.NumMachineInstructions74888437486442-2401-0.03%
x86-mi-counting.NumUncondBR135770135543-227-0.17%
x86-mi-counting.NumCondBR423753422187-1566-0.37%
x86-mi-counting.NumCMOV24815257319163.69%
x86-mi-counting.NumVecBlend171700.00%

We significantly decrease basic block count, notably decrease instruction count,
significantly decrease branch count and very significantly increase cmov count.

Performance-wise, unsurprisingly, this has great effect on
target RawSpeed benchmark. I'm seeing 5 major improvements:

Benchmark                                                                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 49 vs 49
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_mean                                  -0.3064         -0.3064      226.9913      157.4452      226.9800      157.4384
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_median                                -0.3057         -0.3057      226.8407      157.4926      226.8282      157.4828
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_stddev                                -0.4985         -0.4954        0.3051        0.1530        0.3040        0.1534
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_mean                                   -0.1747         -0.1747       80.4787       66.4227       80.4771       66.4146
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_median                                 -0.1742         -0.1743       80.4686       66.4542       80.4690       66.4436
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_stddev                                 +0.6089         +0.5797        0.0670        0.1078        0.0673        0.1062
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 49 vs 49
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_mean                                  -0.1598         -0.1598      171.6996      144.2575      171.6915      144.2538
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_median                                -0.1598         -0.1597      171.7109      144.2755      171.7018      144.2766
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_stddev                                +0.4024         +0.3850        0.0847        0.1187        0.0848        0.1175
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_mean                                   -0.0550         -0.0551      280.3046      264.8800      280.3017      264.8559
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_median                                 -0.0554         -0.0554      280.2628      264.7360      280.2574      264.7297
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_stddev                                 +0.7005         +0.7041        0.2779        0.4725        0.2775        0.4729
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_mean                                   -0.0354         -0.0355      316.7396      305.5208      316.7342      305.4890
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_median                                 -0.0354         -0.0356      316.6969      305.4798      316.6917      305.4324
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_stddev                                 +0.0493         +0.0330        0.3562        0.3737        0.3563        0.3681

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.Sep 7 2019, 7:01 AM

4 major improvements

and no regressions? Please also provide test suite numbers.

4 major improvements

Note that those numbers are for D65148;

and no regressions?

See https://reviews.llvm.org/D65148#1602979 for that disscussion.

I have not rerun any of the measurements that were done
in that patch here as of yet, but the results may or may not
be better since this actually speculates less instructions.

Please also provide test suite numbers.

As i have stated previously, that is not feasible for me,
i have tried and the results are always too noisy,
the tooling there is just too clumsy for me.
If you have working setup please feel free to post noise-less actionable the numbers.

I dont have such setup, but some folks could run SPEC for you. Maybe send a RFC about this patch to llvm-dev? :)

I think we really need proper benchmark data before any “LGTM”.

D67240 numbers for RawSpeed: (rL371316 vs. these two patches)

metricoldnewdelta% change
x86-mi-counting.NumMachineFunctions105131051300.00%
x86-mi-counting.NumMachineBasicBlocks200350200178-172-0.09%
x86-mi-counting.NumMachineInstructions33078663305737-2129-0.06%
x86-mi-counting.NumUncondBR3347933463-16-0.05%
x86-mi-counting.NumCondBR9106290925-137-0.15%
x86-mi-counting.NumCMOV41954276811.93%
x86-mi-counting.NumVecBlend171700.00%

This is pretty in-line with what we had in

Numbers for RawSpeed:

metricoldnewdelta% change
x86-mi-counting.NumMachineFunctions105131051300.00%
x86-mi-counting.NumMachineBasicBlocks200350200163-187-0.09%
x86-mi-counting.NumMachineInstructions33078663305504-2362-0.07%
x86-mi-counting.NumUncondBR3347933465-14-0.04%
x86-mi-counting.NumCondBR9106290908-154-0.17%
x86-mi-counting.NumCMOV41954284892.12%
x86-mi-counting.NumVecBlend171700.00%

@efriedma

D67240 numbers for vanilla (no externals, no rawspeed) test-suite: (rL371316 vs. these two patches)

metricoldnewdelta% change
x86-mi-counting.NumMachineFunctions272072720810.00%
x86-mi-counting.NumMachineBasicBlocks573195571003-2192-0.38%
x86-mi-counting.NumMachineInstructions41809774180705-272-0.01%
x86-mi-counting.NumUncondBR102291102080-211-0.21%
x86-mi-counting.NumCondBR332691331262-1429-0.43%
x86-mi-counting.NumCMOV20620214558354.05%
x86-mi-counting.NumVecBlend0000.00%

Which is actually extremely better than what we had in D65148,
we now produce more cmov's (good), and even further decrease everything else including instruction count (good).

metricoldnewdelta% change
x86-mi-counting.NumMachineFunctions271892718900.00%
x86-mi-counting.NumMachineBasicBlocks573079571509-1570-0.27%
x86-mi-counting.NumMachineInstructions418024441807505060.01%
x86-mi-counting.NumUncondBR102271102115-156-0.15%
x86-mi-counting.NumCondBR332645331669-976-0.29%
x86-mi-counting.NumCMOV20620211345142.49%
x86-mi-counting.NumVecBlend0000.00%

(the NumMachineFunctions changed because i did D65148#1660558 without xray tests...)

lebedev.ri edited the summary of this revision. (Show Details)Sep 7 2019, 1:03 PM

Nice, these new results look very good!

lebedev.ri edited the summary of this revision. (Show Details)Sep 7 2019, 2:55 PM

And got perf numbers. No particular differences from what i have seen in D65148, ~5 major improvements, no notable regressions.
TLDR: On average on used benchmarks this results in -~3% speedup.

$ /usr/src/googlebenchmark/tools/compare.py -a benchmarks rawspeed-benchmark-0-old.json rawspeed-benchmark-1-new.json 
Comparing rawspeed-benchmark-0-old.json to rawspeed-benchmark-1-new.json
Benchmark                                                                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/process_time/real_time_pvalue                     0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/process_time/real_time_mean                      -0.0236         -0.0236      222.5219      217.2595      222.4980      217.2369
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/process_time/real_time_median                    -0.0236         -0.0236      222.4776      217.2206      222.4307      217.1793
Canon/EOS 5D Mark II/09.canon.sraw1.cr2/threads:8/process_time/real_time_stddev                    -0.1381         -0.1389        0.5084        0.4382        0.5057        0.4354
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/process_time/real_time_pvalue                     0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/process_time/real_time_mean                      -0.0153         -0.0153      129.8367      127.8563      129.8264      127.8442
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/process_time/real_time_median                    -0.0153         -0.0152      129.7890      127.8042      129.7693      127.7967
Canon/EOS 5D Mark II/10.canon.sraw2.cr2/threads:8/process_time/real_time_stddev                    -0.6448         -0.6457        0.6080        0.2159        0.6068        0.2150
Canon/EOS 5DS/2K4A9927.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS 5DS/2K4A9927.CR2/threads:8/process_time/real_time_mean                                   -0.0181         -0.0181      444.5168      436.4534      444.4694      436.4027
Canon/EOS 5DS/2K4A9927.CR2/threads:8/process_time/real_time_median                                 -0.0181         -0.0181      444.5686      436.5023      444.5210      436.4561
Canon/EOS 5DS/2K4A9927.CR2/threads:8/process_time/real_time_stddev                                 +0.1821         +0.1897        0.5460        0.6454        0.5439        0.6471
Canon/EOS 5DS/2K4A9928.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS 5DS/2K4A9928.CR2/threads:8/process_time/real_time_mean                                   -0.0229         -0.0229      567.6662      554.6384      567.6017      554.5808
Canon/EOS 5DS/2K4A9928.CR2/threads:8/process_time/real_time_median                                 -0.0232         -0.0232      567.7452      554.5751      567.6885      554.5071
Canon/EOS 5DS/2K4A9928.CR2/threads:8/process_time/real_time_stddev                                 +0.4212         +0.4445        0.7544        1.0721        0.7371        1.0648
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_mean                                   -0.0354         -0.0355      316.7396      305.5208      316.7342      305.4890
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_median                                 -0.0354         -0.0356      316.6969      305.4798      316.6917      305.4324
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_stddev                                 +0.0493         +0.0330        0.3562        0.3737        0.3563        0.3681
Canon/EOS 40D/_MG_0154.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS 40D/_MG_0154.CR2/threads:8/process_time/real_time_mean                                   -0.0076         -0.0077       56.9675       56.5326       56.9671       56.5273
Canon/EOS 40D/_MG_0154.CR2/threads:8/process_time/real_time_median                                 -0.0079         -0.0080       56.9662       56.5175       56.9664       56.5097
Canon/EOS 40D/_MG_0154.CR2/threads:8/process_time/real_time_stddev                                 -0.4629         -0.4702        0.1688        0.0907        0.1684        0.0892
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_mean                                   -0.0550         -0.0551      280.3046      264.8800      280.3017      264.8559
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_median                                 -0.0554         -0.0554      280.2628      264.7360      280.2574      264.7297
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_stddev                                 +0.7005         +0.7041        0.2779        0.4725        0.2775        0.4729
Canon/EOS D30/CRW_2444.CRW/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS D30/CRW_2444.CRW/threads:8/process_time/real_time_mean                                   +0.0042         +0.0041       62.7856       63.0503       62.7856       63.0448
Canon/EOS D30/CRW_2444.CRW/threads:8/process_time/real_time_median                                 +0.0041         +0.0040       62.7851       63.0413       62.7854       63.0344
Canon/EOS D30/CRW_2444.CRW/threads:8/process_time/real_time_stddev                                 -0.1853         -0.2144        0.0556        0.0453        0.0551        0.0433
Canon/PowerShot G1/crw_1693.crw/threads:8/process_time/real_time_pvalue                             0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/PowerShot G1/crw_1693.crw/threads:8/process_time/real_time_mean                              +0.0046         +0.0045       63.7451       64.0406       63.7450       64.0343
Canon/PowerShot G1/crw_1693.crw/threads:8/process_time/real_time_median                            +0.0045         +0.0045       63.7339       64.0213       63.7321       64.0207
Canon/PowerShot G1/crw_1693.crw/threads:8/process_time/real_time_stddev                            +0.4642         +0.4343        0.0484        0.0709        0.0479        0.0687
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/process_time/real_time_pvalue                      0.0000          0.0000      U Test, Repetitions: 49 vs 49
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/process_time/real_time_mean                       -0.0086         -0.0038      486.1760      481.9715     3598.8707     3585.2633
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/process_time/real_time_median                     -0.0112         -0.0057      487.3562      481.8991     3604.8764     3584.3669
Fujifilm/GFX 50S/20170525_0037TEST.RAF/threads:8/process_time/real_time_stddev                     -0.8226         -0.5578        2.1500        0.3815       10.4607        4.6256
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/process_time/real_time_pvalue                                0.2217          0.0239      U Test, Repetitions: 49 vs 49
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/process_time/real_time_mean                                 -0.0031         -0.0029      185.9799      185.3941     1483.9142     1479.5656
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/process_time/real_time_median                               -0.0000         -0.0005      185.7836      185.7747     1482.7799     1482.0870
Fujifilm/X-Pro2/_DSF3051.RAF/threads:8/process_time/real_time_stddev                               +0.0014         -0.0133        1.0597        1.0611        7.0797        6.9857
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/process_time/real_time_pvalue                              0.0647          0.0271      U Test, Repetitions: 49 vs 49
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/process_time/real_time_mean                               +0.0016         +0.0017       38.8542       38.9182      309.6872      310.2127
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/process_time/real_time_median                             +0.0021         +0.0028       38.7812       38.8636      308.9483      309.8222
GoPro/HERO6 Black/GOPR9172.GPR/threads:8/process_time/real_time_stddev                             -0.3860         -0.3828        0.5640        0.3463        3.6906        2.2780
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_mean                                   -0.1747         -0.1747       80.4787       66.4227       80.4771       66.4146
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_median                                 -0.1742         -0.1743       80.4686       66.4542       80.4690       66.4436
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_stddev                                 +0.6089         +0.5797        0.0670        0.1078        0.0673        0.1062
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:8/process_time/real_time_pvalue                       0.0048          0.0024      U Test, Repetitions: 49 vs 49
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:8/process_time/real_time_mean                        -0.0006         -0.0007      375.4644      375.2458      375.4502      375.1952
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:8/process_time/real_time_median                      -0.0013         -0.0014      376.6812      376.2053      376.6632      376.1458
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:8/process_time/real_time_stddev                      -0.1310         -0.1324        1.8502        1.6079        1.8492        1.6045
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:8/process_time/real_time_pvalue                       0.7546          0.5649      U Test, Repetitions: 49 vs 49
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:8/process_time/real_time_mean                        -0.0025         -0.0026      430.4334      429.3462      430.4185      429.2971
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:8/process_time/real_time_median                      -0.0000         -0.0001      429.2898      429.2758      429.2598      429.2311
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:8/process_time/real_time_stddev                      -0.7837         -0.7873        2.3148        0.5008        2.3167        0.4928
Nikon/D7200/DSC_0977.NEF/threads:8/process_time/real_time_pvalue                                    0.0001          0.0001      U Test, Repetitions: 49 vs 49
Nikon/D7200/DSC_0977.NEF/threads:8/process_time/real_time_mean                                     -0.0017         -0.0017      254.5381      254.1162      254.5281      254.0925
Nikon/D7200/DSC_0977.NEF/threads:8/process_time/real_time_median                                   -0.0011         -0.0012      254.4295      254.1423      254.4156      254.1112
Nikon/D7200/DSC_0977.NEF/threads:8/process_time/real_time_stddev                                   +0.2642         +0.2611        0.4071        0.5146        0.4063        0.5123
Nikon/D7200/DSC_0978.NEF/threads:8/process_time/real_time_pvalue                                    0.0000          0.0000      U Test, Repetitions: 49 vs 49
Nikon/D7200/DSC_0978.NEF/threads:8/process_time/real_time_mean                                     +0.0086         +0.0085      247.1756      249.3010      247.1651      249.2767
Nikon/D7200/DSC_0978.NEF/threads:8/process_time/real_time_median                                   +0.0081         +0.0081      247.1551      249.1528      247.1444      249.1472
Nikon/D7200/DSC_0978.NEF/threads:8/process_time/real_time_stddev                                   +1.1420         +1.1485        0.2862        0.6131        0.2854        0.6132
Nikon/D7200/DSC_0979.NEF/threads:8/process_time/real_time_pvalue                                    0.1231          0.1574      U Test, Repetitions: 49 vs 49
Nikon/D7200/DSC_0979.NEF/threads:8/process_time/real_time_mean                                     +0.0011         +0.0009      234.2886      234.5454      234.2788      234.4844
Nikon/D7200/DSC_0979.NEF/threads:8/process_time/real_time_median                                   +0.0008         +0.0007      234.2195      234.4022      234.2126      234.3658
Nikon/D7200/DSC_0979.NEF/threads:8/process_time/real_time_stddev                                   +1.7501         +1.3553        0.2661        0.7319        0.2653        0.6249
Nikon/D7200/DSC_0982.NEF/threads:8/process_time/real_time_pvalue                                    0.0000          0.0000      U Test, Repetitions: 49 vs 49
Nikon/D7200/DSC_0982.NEF/threads:8/process_time/real_time_mean                                     +0.0054         +0.0052      235.1888      236.4484      235.1830      236.4139
Nikon/D7200/DSC_0982.NEF/threads:8/process_time/real_time_median                                   +0.0050         +0.0049      235.1839      236.3490      235.1782      236.3416
Nikon/D7200/DSC_0982.NEF/threads:8/process_time/real_time_stddev                                   +1.5848         +1.5295        0.2262        0.5846        0.2256        0.5706
Olympus/XZ-1/p1319978.orf/threads:8/process_time/real_time_pvalue                                   0.0021          0.0010      U Test, Repetitions: 49 vs 49
Olympus/XZ-1/p1319978.orf/threads:8/process_time/real_time_mean                                    +0.0004         +0.0004      219.0685      219.1491      219.0624      219.1499
Olympus/XZ-1/p1319978.orf/threads:8/process_time/real_time_median                                  +0.0004         +0.0005      219.0479      219.1360      219.0389      219.1375
Olympus/XZ-1/p1319978.orf/threads:8/process_time/real_time_stddev                                  -0.2462         -0.2431        0.1743        0.1314        0.1730        0.1309
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_pvalue                                0.0000          0.0000      U Test, Repetitions: 49 vs 49
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_mean                                 -0.0300         -0.0322       14.1339       13.7092      112.5427      108.9215
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_median                               -0.0289         -0.0335       14.1253       13.7178      112.7482      108.9693
Panasonic/DC-G9/P1000476.RW2/threads:8/process_time/real_time_stddev                               -0.2187         -0.3065        0.2117        0.1654        1.5642        1.0847
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_pvalue                               0.0000          0.0000      U Test, Repetitions: 49 vs 49
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_mean                                +0.0103         +0.0104       27.0501       27.3287      215.7217      217.9707
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_median                              +0.0103         +0.0104       27.0693       27.3475      215.8385      218.0814
Panasonic/DC-GH5/_T012014.RW2/threads:8/process_time/real_time_stddev                              +0.0838         +0.1371        0.2002        0.2170        1.2856        1.4619
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_pvalue                              0.0000          0.0000      U Test, Repetitions: 49 vs 49
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_mean                               +0.0533         +0.0531        5.0627        5.3324       40.3521       42.4956
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_median                             +0.0560         +0.0551        5.0707        5.3547       40.4043       42.6317
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_stddev                             +0.0148         +0.0363        0.0547        0.0555        0.3523        0.3651
Pentax/K10D/_IGP7284.PEF/threads:8/process_time/real_time_pvalue                                    0.0000          0.0000      U Test, Repetitions: 49 vs 49
Pentax/K10D/_IGP7284.PEF/threads:8/process_time/real_time_mean                                     -0.0167         -0.0158       72.0108       70.8091       73.8282       72.6596
Pentax/K10D/_IGP7284.PEF/threads:8/process_time/real_time_median                                   -0.0165         -0.0165       71.9825       70.7949       71.9808       70.7963
Pentax/K10D/_IGP7284.PEF/threads:8/process_time/real_time_stddev                                   -0.0501         +0.0156        0.1501        0.1426       12.8755       13.0763
Phase One/P65/CF027310.IIQ/threads:8/process_time/real_time_pvalue                                  0.2646          0.2468      U Test, Repetitions: 49 vs 49
Phase One/P65/CF027310.IIQ/threads:8/process_time/real_time_mean                                   +0.0025         +0.0024       79.2134       79.4154      631.2580      632.7601
Phase One/P65/CF027310.IIQ/threads:8/process_time/real_time_median                                 +0.0003         +0.0000       79.3581       79.3834      632.1152      632.1352
Phase One/P65/CF027310.IIQ/threads:8/process_time/real_time_stddev                                 +0.2383         +0.3005        0.8326        1.0310        5.2849        6.8731
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/process_time/real_time_pvalue                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/process_time/real_time_mean                   +0.0179         +0.0179      328.3829      334.2514      328.3690      334.2385
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/process_time/real_time_median                 +0.0182         +0.0183      328.3175      334.3054      328.2931      334.2921
Samsung/NX1/2016-07-23-142101_sam_9364.srw/threads:8/process_time/real_time_stddev                 +0.8651         +0.8392        0.3049        0.5687        0.3085        0.5674
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/process_time/real_time_pvalue                 0.0000          0.0000      U Test, Repetitions: 49 vs 49
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/process_time/real_time_mean                  -0.0092         -0.0092      129.0806      127.8978      129.0735      127.8921
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/process_time/real_time_median                -0.0085         -0.0086      129.0611      127.9581      129.0593      127.9545
Samsung/NX30/2015-03-07-163604_sam_7204.srw/threads:8/process_time/real_time_stddev                +0.2651         +0.2716        0.1436        0.1816        0.1439        0.1830
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 49 vs 49
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_mean                                  -0.3064         -0.3064      226.9913      157.4452      226.9800      157.4384
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_median                                -0.3057         -0.3057      226.8407      157.4926      226.8282      157.4828
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_stddev                                -0.4985         -0.4954        0.3051        0.1530        0.3040        0.1534
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 49 vs 49
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_mean                                  -0.1598         -0.1598      171.6996      144.2575      171.6915      144.2538
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_median                                -0.1598         -0.1597      171.7109      144.2755      171.7018      144.2766
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_stddev                                +0.4024         +0.3850        0.0847        0.1187        0.0848        0.1175
Sony/ILCE-7S/DSC04126.ARW/threads:8/process_time/real_time_pvalue                                   0.0023          0.0014      U Test, Repetitions: 49 vs 49
Sony/ILCE-7S/DSC04126.ARW/threads:8/process_time/real_time_mean                                    +0.0052         +0.0055        8.1308        8.1728       64.8663       65.2246
Sony/ILCE-7S/DSC04126.ARW/threads:8/process_time/real_time_median                                  +0.0154         +0.0131        8.0814        8.2058       64.5718       65.4151
Sony/ILCE-7S/DSC04126.ARW/threads:8/process_time/real_time_stddev                                  -0.3637         -0.3469        0.1846        0.1175        1.2931        0.8445

5 biggest improvements:

Benchmark                                                                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 49 vs 49
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_mean                                  -0.3064         -0.3064      226.9913      157.4452      226.9800      157.4384
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_median                                -0.3057         -0.3057      226.8407      157.4926      226.8282      157.4828
Samsung/NX3000/_3184416.SRW/threads:8/process_time/real_time_stddev                                -0.4985         -0.4954        0.3051        0.1530        0.3040        0.1534
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_mean                                   -0.1747         -0.1747       80.4787       66.4227       80.4771       66.4146
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_median                                 -0.1742         -0.1743       80.4686       66.4542       80.4690       66.4436
Kodak/DCS760C/86L57188.DCR/threads:8/process_time/real_time_stddev                                 +0.6089         +0.5797        0.0670        0.1078        0.0673        0.1062
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_pvalue                                 0.0000          0.0000      U Test, Repetitions: 49 vs 49
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_mean                                  -0.1598         -0.1598      171.6996      144.2575      171.6915      144.2538
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_median                                -0.1598         -0.1597      171.7109      144.2755      171.7018      144.2766
Sony/DSLR-A230/DSC08026.ARW/threads:8/process_time/real_time_stddev                                +0.4024         +0.3850        0.0847        0.1187        0.0848        0.1175
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_mean                                   -0.0550         -0.0551      280.3046      264.8800      280.3017      264.8559
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_median                                 -0.0554         -0.0554      280.2628      264.7360      280.2574      264.7297
Canon/EOS 77D/IMG_4049.CR2/threads:8/process_time/real_time_stddev                                 +0.7005         +0.7041        0.2779        0.4725        0.2775        0.4729
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_pvalue                                  0.0000          0.0000      U Test, Repetitions: 49 vs 49
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_mean                                   -0.0354         -0.0355      316.7396      305.5208      316.7342      305.4890
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_median                                 -0.0354         -0.0356      316.6969      305.4798      316.6917      305.4324
Canon/EOS 5DS/2K4A9929.CR2/threads:8/process_time/real_time_stddev                                 +0.0493         +0.0330        0.3562        0.3737        0.3563        0.3681

1 biggest regression:

Benchmark                                                                                             Time             CPU      Time Old      Time New       CPU Old       CPU New
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_pvalue                              0.0000          0.0000      U Test, Repetitions: 49 vs 49
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_mean                               +0.0533         +0.0531        5.0627        5.3324       40.3521       42.4956
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_median                             +0.0560         +0.0551        5.0707        5.3547       40.4043       42.6317
Panasonic/DC-GH5S/P1022085.RW2/threads:8/process_time/real_time_stddev                             +0.0148         +0.0363        0.0547        0.0555        0.3523        0.3651

... and that regression is noise, i re-benchmarked it.

jmolloy accepted this revision.Sep 16 2019, 1:57 AM

I can think of a rationale for the current behaviour. A lopsided if-triangle could indicate it might be better to emit a branch than speculate. Especially as this transform is context-free, input code that contains the same pattern many times can cause an explosion in code size and the transform was likely designed to be conservative.

Also note that there exist targets that don't use conditional moves to perform predication (most VLIW targets have a dedicated predicate register file) so that changes the behaviour somewhat.

I don't think there's a right answer for SimplifyCFG. Whatever we do is wrong, but sometimes it is useful. I think your approach here has been great; you have a bunch of performance data showing improvements and no regressions. I also like your renaming from "Cost" to "Budget". That's annoyed me for a while.

LGTM.

This revision is now accepted and ready to land.Sep 16 2019, 1:57 AM

I can think of a rationale for the current behaviour. A lopsided if-triangle could indicate it might be better to emit a branch than speculate. Especially as this transform is context-free, input code that contains the same pattern many times can cause an explosion in code size and the transform was likely designed to be conservative.

I don't think there's a right answer for SimplifyCFG. Whatever we do is wrong, but sometimes it is useful. I think your approach here has been great; you have a bunch of performance data showing improvements

Right.

and no regressions.

To be noted, on the benchmarks run.
I will be very surprised if this does not cause any regressions at all.

I also like your renaming from "Cost" to "Budget". That's annoyed me for a while.

Yeah, that one looked really weird to me too.

Also note that there exist targets that don't use conditional moves to perform predication (most VLIW targets have a dedicated predicate register file) so that changes the behaviour somewhat.

LGTM.

@jmolloy thank you for the review.
@efriedma / others - any comments? Not sure if i should wait a bit more or proceed to commit here.

I will be very surprised if this does not cause any regressions at all.

There is no avoiding that, the best we can do is best-effort here.

Just commit it, there is LNT so we should just watch perf bots after your commit if new changes are acceptable.

This revision was automatically updated to reflect the committed changes.

Seems pretty ok: http://lnt.llvm.org/db_default/v4/nts/129108

except "MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow" really big perf/code size regression. Roman, can you look at it?

lebedev.ri added a comment.EditedSep 16 2019, 2:29 PM

Seems pretty ok: http://lnt.llvm.org/db_default/v4/nts/129108

except "MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow" really big perf/code size regression. Roman, can you look at it?

Hmm yeah, not surprised at all that it regressed.


The patch did what it was supposed to do, and got rid of all the branching in @value.
And in that case it's bad because we end up with a wall (~1500 lines) of basically

%tmp519 = and i64 %tmp62, %tmp233
%tmp520 = icmp eq i64 %tmp519, %tmp62
%tmp521 = sitofp i32 %tmp518 to float
%tmp522 = fadd float %tmp236, %tmp521
%tmp523 = fptosi float %tmp522 to i32
%tmp524 = select i1 %tmp520, i32 %tmp523, i32 %tmp518

That test is really perfect storm for this. And i really don't like those int-float-int translations;
if i get rid of them (obviously not the fix), the regression goes away - even though in IR the flattening
still happens, in the asm we again get branches. So i suppose this may be a bug in undo transform in backend.

Just curious, did you try threshold 5 or 6 for your rawspeed benchmark?

Just curious, did you try threshold 5 or 6 for your rawspeed benchmark?

No, i have not tried any other thresholds here.
I'm personally not aware of any patterns that would benefit from those thresholds, although i did not look.

spatel added a subscriber: spatel.Sep 17 2019, 6:58 AM