This is an archive of the discontinued LLVM Phabricator instance.

[DSE] Add support for not aligned begin/end
ClosedPublic

Authored by ebrevnov on Dec 18 2020, 2:43 AM.

Details

Summary

This is an attempt to improve handling of partial overlaps in case of unaligned begin\end.

Existing implementation just bails out if it encounters such cases. Even when it doesn't I believe existing code checking alignment constraints is not quite correct. It tries to ensure alignment of the "later" start/end offset while should be preserving relative alignment between earlier and later start/end.

The idea behind the change is simple. When start/end is not aligned as we wish instead of bailing out let's adjust it as necessary to get desired alignment.

I'll update with performance results as measured by the test-suite...it's still running...

Diff Detail

Event Timeline

ebrevnov created this revision.Dec 18 2020, 2:43 AM
ebrevnov requested review of this revision.Dec 18 2020, 2:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 18 2020, 2:43 AM
ebrevnov edited the summary of this revision. (Show Details)Dec 18 2020, 2:59 AM
ebrevnov added a reviewer: fhahn.
ebrevnov updated this revision to Diff 312748.Dec 18 2020, 3:44 AM

Initial update

ebrevnov added a comment.EditedDec 20 2020, 11:55 PM

I had already tried to measure performance with the test-suite previously with out success. This time again I observe big variation. I'm using dedicated performance machine which runs only my process. I've build test-suite as follows:

cmake -DTEST_SUITE_BENCHMARKING_ONLY=ON -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -C../cmake/caches/O3.cmake ..; make -j16

And ran the tests 3 times. Here is the results of comparison first vs second vs third runs.

> utils/compare.py --merge-average  --filter-short ~/results.coffeelake.orig1.json  vs ~/results.coffeelake.orig2.json 
Tests: 736
Short Running: 197 (filtered out)
Remaining: 539
Metric: exec_time

Program                                        lhs     rhs     diff  
                                                                     
 test-suite....test:BM_MULADDSUB_LAMBDA/5001    10.18   12.39  21.8% 
 test-suite...da.test:BM_PIC_1D_LAMBDA/44217   514.08  436.69  -15.1%
 test-suite...sCRaw.test:BM_HYDRO_2D_RAW/171     9.97   11.12  11.5% 
 test-suite...bda.test:BM_PIC_1D_LAMBDA/5001    53.98   48.92  -9.4% 
 test-suite...lcalsCRaw.test:BM_ADI_RAW/5001    90.87   84.16  -7.4% 
 test-suite...XRayFDRMultiThreaded/threads:2   142.53  151.64   6.4% 
 test-suite...sARaw.test:BM_VOL3D_CALC_RAW/2     2.57    2.43  -5.6% 
 test-suite....test:BENCHMARK_HARRIS/256/256   341.95  359.82   5.2% 
 test-suite...a/kernels/doitgen/doitgen.test     0.67    0.70   3.8% 
 test-suite.../Applications/spiff/spiff.test     1.12    1.08  -3.5% 
 test-suite...g/correlation/correlation.test     1.27    1.23  -3.2% 
 test-suite....test:BENCHMARK_HARRIS/512/512   1854.99 1913.18  3.1% 
 test-suite...test:BM_MULADDSUB_LAMBDA/44217    95.43   98.36   3.1% 
 test-suite...aw.test:BM_MULADDSUB_RAW/44217    95.66   92.73  -3.1% 
 test-suite...CRaw.test:BM_MAT_X_MAT_RAW/171   105.88  102.78  -2.9% 
 Geomean difference                                             nan% 
                 lhs            rhs        diff
count  538.000000     539.000000     538.000000
mean   1554.916420    1548.375947   -0.000238  
std    13204.345562   13143.680859   0.015499  
min    0.610700       0.609000      -0.150543  
25%    2.729454       2.729260      -0.000955  
50%    92.554771      90.861409      0.000000  
75%    555.904231     555.793696     0.000634  
max    208982.569333  207906.284333  0.217786

>utils/compare.py --merge-average  --filter-short ~/results.coffeelake.orig2.json  vs ~/results.coffeelake.orig3.json 
Tests: 736
Short Running: 197 (filtered out)
Remaining: 539
Metric: exec_time

Program                                        lhs      rhs      diff  
                                                                       
 test-suite...sCRaw.test:BM_PIC_1D_RAW/44217   433.62   519.59   19.8% 
 test-suite....test:BM_MULADDSUB_LAMBDA/5001    12.39    10.36   -16.4%
 test-suite...CHMARK_ANISTROPIC_DIFFUSION/64   2056.09  2369.92  15.3% 
 test-suite...HMARK_ANISTROPIC_DIFFUSION/128   8850.38  10172.76 14.9% 
 test-suite...HMARK_ANISTROPIC_DIFFUSION/256   36797.14 42257.00 14.8% 
 test-suite...CHMARK_ANISTROPIC_DIFFUSION/32   457.93   523.86   14.4% 
 test-suite....test:BENCHMARK_HARRIS/256/256   359.82   310.62   -13.7%
 test-suite...lsCRaw.test:BM_PIC_1D_RAW/5001    48.22    54.78   13.6% 
 test-suite...sCRaw.test:BM_HYDRO_2D_RAW/171    11.12    11.98    7.7% 
 test-suite...flt/LoopRestructuring-flt.test     2.64     2.79    5.5% 
 test-suite...XRayFDRMultiThreaded/threads:2   151.64   143.43   -5.4% 
 test-suite...CRaw.test:BM_HYDRO_2D_RAW/5001   304.73   320.81    5.3% 
 test-suite...mbda.test:BM_INIT3_LAMBDA/5001     9.20     9.68    5.2% 
 test-suite.../Applications/spiff/spiff.test     1.08     1.11    2.8% 
 test-suite...++/Shootout-C++-ackermann.test     0.63     0.65    2.8% 
 Geomean difference                                               nan% 
                 lhs            rhs        diff
count  539.000000     538.000000     538.000000
mean   1548.375947    1552.758096    0.001124  
std    13143.680859   12998.591022   0.020444  
min    0.609000       0.609400      -0.164172  
25%    2.729260       2.770800      -0.000865  
50%    90.861409      91.507016     -0.000025  
75%    555.793696     555.931031     0.000464  
max    207906.284333  204187.808667  0.198265

Is this expected? What I'm doing wrong?

Thanks
Evgeniy

fhahn added a comment.Feb 17 2021, 9:10 AM

Just realized that I missed this one.

With respect to the performance numbers you posted earlier, I am not sure if that is caused by the patch. For SPEC2000/SPEC2006 & MultiSource with -O3 -flto on X86, only 12 benchmarks have binary changes and they are all very minor. The total number of stores is as below (that's with --filter-hash --all). What's a bit curious is that in 2 cases we are left with a tiny number of more stores, but the number of eliminated stores stays the same.

It seems like we do not have a stats counter to measure the number of times we successfully shortened operations. The patch basically looks good to me, but it would be good to understand where the additional stores are coming from.

Same hash: 225 (filtered out)
Remaining: 12
Metric: dse.NumRemainingStores

Program                                        base     patch    diff
 test-suite...006/450.soplex/450.soplex.test   8700.00  8701.00   0.0%
 test-suite...006/447.dealII/447.dealII.test   90533.00 90535.00  0.0%
 test-suite.../CINT2000/176.gcc/176.gcc.test   18291.00 18291.00  0.0%
 test-suite.../CINT2006/403.gcc/403.gcc.test   36093.00 36093.00  0.0%
 test-suite...nal/skidmarks10/skidmarks.test   1161.00  1161.00   0.0%
 test-suite...lications/ClamAV/clamscan.test   10078.00 10078.00  0.0%
 test-suite...lications/viterbi/viterbi.test   102.00   102.00    0.0%
 test-suite...marks/7zip/7zip-benchmark.test   35503.00 35503.00  0.0%
 test-suite...oxyApps-C/RSBench/rsbench.test   145.00   145.00    0.0%
 test-suite...oxyApps-C/XSBench/XSBench.test   120.00   120.00    0.0%
 test-suite...nsumer-jpeg/consumer-jpeg.test   5972.00  5972.00   0.0%
 test-suite...nsumer-lame/consumer-lame.test   4501.00  4501.00   0.0%
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1096

IIUC this is the adjustment to keep ToRemoveStart aligned to PrefAlign, right? Might be worth a comment be worth a comment.

1132

perhaps it would be worth adjusting the message to make it clear that the second number is the size we shorten by?

Florian,

I appreciate your time to review and experimenting with the patch. I agree it's kind of unexpected to see less stores get removed. Would it be possible for you to share an IR where that happens (not reduced one is fine)? I'm asking for this because I don't have access to those SPECs.

Thanks
Evgeniy

fhahn accepted this revision.Mar 2 2021, 3:40 AM

Florian,

I appreciate your time to review and experimenting with the patch. I agree it's kind of unexpected to see less stores get removed. Would it be possible for you to share an IR where that happens (not reduced one is fine)? I'm asking for this because I don't have access to those SPECs.

Unfortunately I could not really reproduce this without LTO and I don't want to hold this up until I have more time to extract a reproducer from an LTO build. Give that it is a tiny number of changes I don't think that's necessary and the improvements in the patch in terms of readability are more than enough to offset them. It could also be the case where we now choose a more profitable offset.

LGTM, with my remaining minor suggestions (optional) with respect to the debug message.

This revision is now accepted and ready to land.Mar 2 2021, 3:40 AM
ebrevnov added inline comments.Mar 3 2021, 1:03 AM
llvm/lib/Transforms/Scalar/DeadStoreElimination.cpp
1132

I agree. This message confused me when I met it in the dump. Other places uses (start,end] notation to represent removed interval. I think it makes sense to unify this one with other places.