This is an archive of the discontinued LLVM Phabricator instance.

[JumpThreading] sync DT for LVI analysis (PR 36133)
ClosedPublic

Authored by brzycki on Jan 30 2018, 3:59 PM.

Details

Summary

LVI will use DT if it is available and in the case of the JumpThreading pass it is. When using LVI for analysis it is necessary to flush all pending DominatorTree updates inside JumpThreading to be certain the analysis is correct.

See https://bugs.llvm.org/show_bug.cgi?id=36133 for detailed debug information.

Diff Detail

Event Timeline

brzycki created this revision.Jan 30 2018, 3:59 PM
kuhar added a comment.Feb 2 2018, 9:36 AM

Looks OK. Out of curiosity, what is the compilation time impact?

kuhar accepted this revision.Feb 2 2018, 9:37 AM
This revision is now accepted and ready to land.Feb 2 2018, 9:37 AM

Thank you @kuhar for the review. I ran CTmark (x86_64) on tip today and with this patch applied. The results are:

"compile_time": 1264.1519999999994, # tip
"compile_time": 1267.0480000000002, # patched

Which is around 0.3% degredation.

kuhar added a comment.Feb 2 2018, 12:37 PM

That seems to be within noise and not an issue

That seems to be within noise and not an issue

I tend to agree. Just for my own curiosity I also ran a check to see if my proposed patch for D34135 changed compile time:

"compile_time": 1268.6039999999998, # tip patched with proposed fix for both this issue and D34135

It's also still within noise levels I think.

@kuhar I had a bug in my test-suite runner script and just fixed it. The new results are require more analysis, particularly sqlite.

## COMPARING TIP OF LLVM-PROJECT VS THE SAME SHA WITH THE PROPOSED PATCH FOR D42717
brzycki@cc01 /work/brzycki/test-suite $ ~/git/me/llvm/litdiff build.upstream/results.json build.D42717/results.json  compile_time
#  left column: build.upstream/results.json
# right column: build.D42717/results.json
#  metric name: compile_time
    336.3600 <- 377.3240     [12.18%]    CTMark/sqlite3/sqlite3.test
    493.8600 -> 486.7000     [1.47%]    CTMark/tramp3d-v4/tramp3d-v4.test
    497.1560 <- 500.4880     [0.67%]    CTMark/ClamAV/clamscan.test
   1265.6800 <- 1272.0480    [0.50%]    CTMark/7zip/7zip-benchmark.test
    589.8000 <- 591.0520     [0.21%]    CTMark/lencod/lencod.test
    387.2960 -> 386.5120     [0.20%]    CTMark/kimwitu++/kc.test
    459.5960 <- 460.3640     [0.17%]    CTMark/SPASS/SPASS.test
    256.1360 <- 256.4000     [0.10%]    CTMark/mafft/pairlocalalign.test
    373.0960 -> 372.7520     [0.09%]    CTMark/consumer-typeset/consumer-typeset.test
    924.1160 -> 923.4120     [0.08%]    CTMark/Bullet/bullet.test

## COMPARING TIP OF LLVM-PROJECT VS THE SAME SHA WITH THE PROPOSED PATCH FOR D34135
brzycki@cc01 /work/brzycki/test-suite $ ~/git/me/llvm/litdiff build.upstream/results.json build.D34135/results.json  compile_time
#  left column: build.upstream/results.json
# right column: build.D34135/results.json
#  metric name: compile_time
    336.3600 <- 387.9720     [15.34%]    CTMark/sqlite3/sqlite3.test
    493.8600 -> 485.5160     [1.72%]    CTMark/tramp3d-v4/tramp3d-v4.test
    256.1360 -> 252.7320     [1.35%]    CTMark/mafft/pairlocalalign.test
    387.2960 <- 392.0080     [1.22%]    CTMark/kimwitu++/kc.test
    497.1560 <- 499.8040     [0.53%]    CTMark/ClamAV/clamscan.test
    373.0960 -> 371.5320     [0.42%]    CTMark/consumer-typeset/consumer-typeset.test
    459.5960 <- 461.2160     [0.35%]    CTMark/SPASS/SPASS.test
    589.8000 -> 588.1520     [0.28%]    CTMark/lencod/lencod.test
   1265.6800 <- 1268.2080    [0.20%]    CTMark/7zip/7zip-benchmark.test
    924.1160 -> 923.6120     [0.05%]    CTMark/Bullet/bullet.test

## COMPARING THE PROPOSED PATCH FOR D34135 VS THE SAME SHA WITH THE PROPOSED PATCH FOR D42717
brzycki@cc01 /work/brzycki/test-suite $ ~/git/me/llvm/litdiff build.D34135/results.json build.D42717/results.json  compile_time
#  left column: build.D34135/results.json
# right column: build.D42717/results.json
#  metric name: compile_time
    387.9720 -> 377.3240     [2.82%]    CTMark/sqlite3/sqlite3.test
    252.7320 <- 256.4000     [1.45%]    CTMark/mafft/pairlocalalign.test
    392.0080 -> 386.5120     [1.42%]    CTMark/kimwitu++/kc.test
    588.1520 <- 591.0520     [0.49%]    CTMark/lencod/lencod.test
    371.5320 <- 372.7520     [0.33%]    CTMark/consumer-typeset/consumer-typeset.test
   1268.2080 <- 1272.0480    [0.30%]    CTMark/7zip/7zip-benchmark.test
    485.5160 <- 486.7000     [0.24%]    CTMark/tramp3d-v4/tramp3d-v4.test
    461.2160 -> 460.3640     [0.19%]    CTMark/SPASS/SPASS.test
    499.8040 <- 500.4880     [0.14%]    CTMark/ClamAV/clamscan.test
    923.6120 -> 923.4120     [0.02%]    CTMark/Bullet/bullet.test

All the other benchmarks are ~1% deltas and I'm not very concerned with them. The sqlite benchmark loses 12% with this proposed patch and an additional 3% with the patch proposed for D34135. I'm going to do another run of the upstream compiler to "check the cable" but I'm fairly certain these results are accurate.

My conjecture is the DT edge calls on a large tree are very expensive and I'm not sure how to mitigate this...

I'm seeing about 3% run variation on sqlite with the upstream compiler:

brzycki@cc01 /work/brzycki/test-suite $ ~/git/me/llvm/litdiff build.upstream/results.json build.upstream2/results.json compile_time
#  left column: build.upstream/results.json
# right column: build.upstream2/results.json
#  metric name: compile_time
    336.3600 <- 347.4120     [  3.29%]  CTMark/sqlite3/sqlite3.test
    493.8600 -> 487.0560     [  1.40%]  CTMark/tramp3d-v4/tramp3d-v4.test
    373.0960 -> 368.5280     [  1.24%]  CTMark/consumer-typeset/consumer-typeset.test
    387.2960 <- 389.8320     [  0.65%]  CTMark/kimwitu++/kc.test
    459.5960 <- 462.5440     [  0.64%]  CTMark/SPASS/SPASS.test
    924.1160 -> 921.2680     [  0.31%]  CTMark/Bullet/bullet.test
    497.1560 <- 498.5240     [  0.28%]  CTMark/ClamAV/clamscan.test
    589.8000 -> 588.3840     [  0.24%]  CTMark/lencod/lencod.test
   1265.6800 <- 1267.2280    [  0.12%]  CTMark/7zip/7zip-benchmark.test
    256.1360 -> 256.1000     [  0.01%]  CTMark/mafft/pairlocalalign.test

That still puts the patch around 10% slower.

Here is a comparison of a run of LLVM before the DDT patch was applied to JT vs upstream.

brzycki@cc01 /work/brzycki/test-suite $ ~/git/me/llvm/litdiff build.pre_dt/results.json build.upstream/results.json compile_time
#  left column: build.pre_dt/results.json
# right column: build.upstream/results.json
#  metric name: compile_time
    527.9440 -> 493.8600     [  6.90%]  CTMark/tramp3d-v4/tramp3d-v4.test
   1203.8440 <- 1265.6800    [  5.14%]  CTMark/7zip/7zip-benchmark.test
    881.3400 <- 924.1160     [  4.85%]  CTMark/Bullet/bullet.test
    576.0160 <- 589.8000     [  2.39%]  CTMark/lencod/lencod.test
    365.8560 <- 373.0960     [  1.98%]  CTMark/consumer-typeset/consumer-typeset.test
    452.0200 <- 459.5960     [  1.68%]  CTMark/SPASS/SPASS.test
    252.3400 <- 256.1360     [  1.50%]  CTMark/mafft/pairlocalalign.test
    491.1680 <- 497.1560     [  1.22%]  CTMark/ClamAV/clamscan.test
    382.8120 <- 387.2960     [  1.17%]  CTMark/kimwitu++/kc.test
    336.8720 -> 336.3600     [  0.15%]  CTMark/sqlite3/sqlite3.test


brzycki@cc01 /work/brzycki/test-suite $ ~/git/me/llvm/litdiff build.pre_dt/results.json build.upstream2/results.json compile_time
#  left column: build.pre_dt/results.json
# right column: build.upstream2/results.json
#  metric name: compile_time
    527.9440 -> 487.0560     [  8.39%]  CTMark/tramp3d-v4/tramp3d-v4.test
   1203.8440 <- 1267.2280    [  5.27%]  CTMark/7zip/7zip-benchmark.test
    881.3400 <- 921.2680     [  4.53%]  CTMark/Bullet/bullet.test
    336.8720 <- 347.4120     [  3.13%]  CTMark/sqlite3/sqlite3.test
    452.0200 <- 462.5440     [  2.33%]  CTMark/SPASS/SPASS.test
    576.0160 <- 588.3840     [  2.15%]  CTMark/lencod/lencod.test
    382.8120 <- 389.8320     [  1.83%]  CTMark/kimwitu++/kc.test
    491.1680 <- 498.5240     [  1.50%]  CTMark/ClamAV/clamscan.test
    252.3400 <- 256.1000     [  1.49%]  CTMark/mafft/pairlocalalign.test
    365.8560 <- 368.5280     [  0.73%]  CTMark/consumer-typeset/consumer-typeset.test

@kuhar and @dberlin I would appreciate any guidance you two may have. I spoke with @sebpop about this problem and we don't see a way around the performance impact to preserve DT for LVI analysis. As it stands now there are cases where the deferred updates change the LVI analysis sufficiently to cause a miscompile. We do think there may be ways to mitigate this loss in the future: one way is to prevent pointless threads where a BB is threaded only to be discarded in a subsequent loop iteration within JumpThreading.

LVI functions fine without DT, and it's going to be too expensive to ever use a lazy analysis that is constantly recomputing info like LVI because it may require hundreds or thousands of Dominator tree rebuilds. I would hand LVI a fake dt pointer class where when the Dominator tree is up to date, it works and when it's not up to date, it is null. That would work with how LVI currently checking I think. Regardless I think the right answer here is to make it not use it when it's not up to date. That will give better answers then it used to get and you get gradually better answers as you can preserve more or become less lazy

brzycki updated this revision to Diff 133630.Feb 9 2018, 9:05 AM

Added a feature to LVI to enable or disable the use of the DominatorTree for analysis. Updated JumpThreading to disable the use of DT for LVI analysis when there are pending updates in the DDT.

@dberlin thank you for the suggestion and I have implemented the change. The good news is sqlite is back to our original performance. The not-so-great news is this causes a regression for tramp3d-v4:

brzycki@cc01 /work/brzycki/test-suite $ ~/git/me/llvm/litdiff build.upstream/results.json build.D42717_newpatch/results.json compile_time
#  left column: build.upstream/results.json
# right column: build.D42717_newpatch/results.json
#  metric name: compile_time
    493.8600 <- 536.3320     [  8.60%]  CTMark/tramp3d-v4/tramp3d-v4.test
    336.3600 <- 343.2440     [  2.05%]  CTMark/sqlite3/sqlite3.test
    373.0960 -> 369.3400     [  1.02%]  CTMark/consumer-typeset/consumer-typeset.test
    256.1360 -> 253.9560     [  0.86%]  CTMark/mafft/pairlocalalign.test
    459.5960 <- 462.5360     [  0.64%]  CTMark/SPASS/SPASS.test
   1265.6800 <- 1270.5560    [  0.39%]  CTMark/7zip/7zip-benchmark.test
    589.8000 <- 591.7280     [  0.33%]  CTMark/lencod/lencod.test
    387.2960 <- 388.3840     [  0.28%]  CTMark/kimwitu++/kc.test
    924.1160 -> 921.7360     [  0.26%]  CTMark/Bullet/bullet.test
    497.1560 -> 496.2280     [  0.19%]  CTMark/ClamAV/clamscan.test
brzycki@cc01 /work/brzycki/test-suite $ ~/git/me/llvm/litdiff build.upstream2/results.json build.D42717_newpatch/results.json compile_time
#  left column: build.upstream2/results.json
# right column: build.D42717_newpatch/results.json
#  metric name: compile_time
    487.0560 <- 536.3320     [ 10.12%]  CTMark/tramp3d-v4/tramp3d-v4.test
    347.4120 -> 343.2440     [  1.21%]  CTMark/sqlite3/sqlite3.test
    256.1000 -> 253.9560     [  0.84%]  CTMark/mafft/pairlocalalign.test
    588.3840 <- 591.7280     [  0.57%]  CTMark/lencod/lencod.test
    498.5240 -> 496.2280     [  0.46%]  CTMark/ClamAV/clamscan.test
    389.8320 -> 388.3840     [  0.37%]  CTMark/kimwitu++/kc.test
   1267.2280 <- 1270.5560    [  0.26%]  CTMark/7zip/7zip-benchmark.test
    368.5280 <- 369.3400     [  0.22%]  CTMark/consumer-typeset/consumer-typeset.test
    921.2680 <- 921.7360     [  0.05%]  CTMark/Bullet/bullet.test
    462.5440 -> 462.5360     [  0.00%]  CTMark/SPASS/SPASS.test

The upstream and upstream2 builds both use DT for LVI analysis. In the case of tramp3d-v4 we lucked out using DT analysis without causing a miscompile. I don't see a way to get that performance back.

This patch also invalidates my proposed fix for D34135 since we no longer have a valid DT before calling LVI in all instances.

I'm seeing considerable variance on the tramp3d-v4 compile time test before and after the preservation of DT on JumpThreading patch:

brzycki@cc01 /work/brzycki/test-suite $ ~/git/me/llvm/litdiff build.pre_dt_1/results.json build.pre_dt_2/results.json compile_time
#  left column: build.pre_dt_1/results.json
# right column: build.pre_dt_2/results.json
#  metric name: compile_time
    524.8640 -> 474.9400     [ 10.51%]  CTMark/tramp3d-v4/tramp3d-v4.test
    343.1280 <- 369.4960     [  7.68%]  CTMark/sqlite3/sqlite3.test
    379.4560 <- 386.0880     [  1.75%]  CTMark/kimwitu++/kc.test
    250.3400 -> 248.0680     [  0.92%]  CTMark/mafft/pairlocalalign.test
    583.2080 <- 587.5400     [  0.74%]  CTMark/lencod/lencod.test
    918.8360 -> 912.4920     [  0.70%]  CTMark/Bullet/bullet.test
    456.7720 -> 454.7480     [  0.45%]  CTMark/SPASS/SPASS.test
    367.2440 -> 366.3920     [  0.23%]  CTMark/consumer-typeset/consumer-typeset.test
   1254.5240 -> 1252.3720    [  0.17%]  CTMark/7zip/7zip-benchmark.test
    490.1600 -> 489.4360     [  0.15%]  CTMark/ClamAV/clamscan.test
brzycki@cc01 /work/brzycki/test-suite $ ~/git/me/llvm/litdiff build.pre_dt_1/results.json build.pre_dt_2/results.json compile_time
#  left column: build.pre_dt_1/results.json
# right column: build.pre_dt_2/results.json
#  metric name: compile_time
    524.8640 -> 474.9400     [ 10.51%]  CTMark/tramp3d-v4/tramp3d-v4.test
    343.1280 <- 369.4960     [  7.68%]  CTMark/sqlite3/sqlite3.test
    379.4560 <- 386.0880     [  1.75%]  CTMark/kimwitu++/kc.test
    250.3400 -> 248.0680     [  0.92%]  CTMark/mafft/pairlocalalign.test
    583.2080 <- 587.5400     [  0.74%]  CTMark/lencod/lencod.test
    918.8360 -> 912.4920     [  0.70%]  CTMark/Bullet/bullet.test
    456.7720 -> 454.7480     [  0.45%]  CTMark/SPASS/SPASS.test
    367.2440 -> 366.3920     [  0.23%]  CTMark/consumer-typeset/consumer-typeset.test
   1254.5240 -> 1252.3720    [  0.17%]  CTMark/7zip/7zip-benchmark.test
    490.1600 -> 489.4360     [  0.15%]  CTMark/ClamAV/clamscan.test
brzycki@cc01 /work/brzycki/test-suite $ ~/git/me/llvm/litdiff build.0cb78d41fca7bac_1/results.json build.0cb78d41fca7bac_2/results.json compile_time
#  left column: build.0cb78d41fca7bac_1/results.json
# right column: build.0cb78d41fca7bac_2/results.json
#  metric name: compile_time
    535.8520 -> 483.3720     [ 10.86%]  CTMark/tramp3d-v4/tramp3d-v4.test
    372.9000 <- 378.4200     [  1.48%]  CTMark/sqlite3/sqlite3.test
    256.3040 -> 253.7600     [  1.00%]  CTMark/mafft/pairlocalalign.test
    387.8480 -> 385.0480     [  0.73%]  CTMark/kimwitu++/kc.test
    369.8760 <- 372.3440     [  0.67%]  CTMark/consumer-typeset/consumer-typeset.test
    458.8480 <- 461.5720     [  0.59%]  CTMark/SPASS/SPASS.test
    590.1440 -> 588.8520     [  0.22%]  CTMark/lencod/lencod.test
    499.0920 <- 499.9840     [  0.18%]  CTMark/ClamAV/clamscan.test
   1270.1200 -> 1267.9480    [  0.17%]  CTMark/7zip/7zip-benchmark.test
    928.7720 <- 928.8000     [  0.00%]  CTMark/Bullet/bullet.test
brzycki@cc01 /work/brzycki/test-suite $ ~/git/me/llvm/litdiff build.D42717_1/results.json build.D42717_2/results.json compile_time
#  left column: build.D42717_1/results.json
# right column: build.D42717_2/results.json
#  metric name: compile_time
    494.1120 -> 487.4120     [  1.37%]  CTMark/tramp3d-v4/tramp3d-v4.test
    369.8320 <- 373.5960     [  1.02%]  CTMark/consumer-typeset/consumer-typeset.test
    386.5280 <- 390.0800     [  0.92%]  CTMark/kimwitu++/kc.test
    501.9440 -> 497.7200     [  0.85%]  CTMark/ClamAV/clamscan.test
    255.9640 -> 254.5120     [  0.57%]  CTMark/mafft/pairlocalalign.test
    347.9000 -> 346.5880     [  0.38%]  CTMark/sqlite3/sqlite3.test
   1267.7040 <- 1269.8640    [  0.17%]  CTMark/7zip/7zip-benchmark.test
    462.5080 <- 462.8000     [  0.06%]  CTMark/SPASS/SPASS.test
    923.5880 -> 923.0320     [  0.06%]  CTMark/Bullet/bullet.test
    589.6800 <- 589.8840     [  0.03%]  CTMark/lencod/lencod.test

All of these are the same compiler compared against itself on as close to a quiescent x86_64 machine as I could find. The D42717 version has been rebased on top of llvm-project Git sha 0cb78d41fca7bac. I'm going to run more iterations of the pre_dt compiler (llvm-project sha 788d5233ad1f2 from January 12) to see if the variance persists.

If it does then this current patch does not meaningfully alter the compile time of tramp3d-v4.

The variance of tramp3d-v4 is considerable, as much as 18.5%. I ran CTMark on the pre-DDT compiler and here are the results:

$ cat bar.sh
#!/bin/bash

for i in 1 2 3 4 5 6; do
    for j in 1 2 3 4 5 6; do
        ~/git/me/llvm/litdiff \
            build.pre_dt_$i/results.json \
            build.pre_dt_$j/results.json \
            compile_time  | grep tramp3d
    done
done
brzycki@cc01 /work/brzycki/test-suite $ ./bar.sh
    524.8640 -> 474.9400     [ 10.51%]  CTMark/tramp3d-v4/tramp3d-v4.test
    524.8640 -> 486.9560     [  7.78%]  CTMark/tramp3d-v4/tramp3d-v4.test
    524.8640 -> 482.9680     [  8.67%]  CTMark/tramp3d-v4/tramp3d-v4.test
    524.8640 -> 484.6760     [  8.29%]  CTMark/tramp3d-v4/tramp3d-v4.test
    524.8640 <- 562.8000     [  7.23%]  CTMark/tramp3d-v4/tramp3d-v4.test
    474.9400 <- 524.8640     [ 10.51%]  CTMark/tramp3d-v4/tramp3d-v4.test
    474.9400 <- 486.9560     [  2.53%]  CTMark/tramp3d-v4/tramp3d-v4.test
    474.9400 <- 482.9680     [  1.69%]  CTMark/tramp3d-v4/tramp3d-v4.test
    474.9400 <- 484.6760     [  2.05%]  CTMark/tramp3d-v4/tramp3d-v4.test
    474.9400 <- 562.8000     [ 18.50%]  CTMark/tramp3d-v4/tramp3d-v4.test
    486.9560 <- 524.8640     [  7.78%]  CTMark/tramp3d-v4/tramp3d-v4.test
    486.9560 -> 474.9400     [  2.53%]  CTMark/tramp3d-v4/tramp3d-v4.test
    486.9560 -> 482.9680     [  0.83%]  CTMark/tramp3d-v4/tramp3d-v4.test
    486.9560 -> 484.6760     [  0.47%]  CTMark/tramp3d-v4/tramp3d-v4.test
    486.9560 <- 562.8000     [ 15.58%]  CTMark/tramp3d-v4/tramp3d-v4.test
    482.9680 <- 524.8640     [  8.67%]  CTMark/tramp3d-v4/tramp3d-v4.test
    482.9680 -> 474.9400     [  1.69%]  CTMark/tramp3d-v4/tramp3d-v4.test
    482.9680 <- 486.9560     [  0.83%]  CTMark/tramp3d-v4/tramp3d-v4.test
    482.9680 <- 484.6760     [  0.35%]  CTMark/tramp3d-v4/tramp3d-v4.test
    482.9680 <- 562.8000     [ 16.53%]  CTMark/tramp3d-v4/tramp3d-v4.test
    484.6760 <- 524.8640     [  8.29%]  CTMark/tramp3d-v4/tramp3d-v4.test
    484.6760 -> 474.9400     [  2.05%]  CTMark/tramp3d-v4/tramp3d-v4.test
    484.6760 <- 486.9560     [  0.47%]  CTMark/tramp3d-v4/tramp3d-v4.test
    484.6760 -> 482.9680     [  0.35%]  CTMark/tramp3d-v4/tramp3d-v4.test
    484.6760 <- 562.8000     [ 16.12%]  CTMark/tramp3d-v4/tramp3d-v4.test
    562.8000 -> 524.8640     [  7.23%]  CTMark/tramp3d-v4/tramp3d-v4.test
    562.8000 -> 474.9400     [ 18.50%]  CTMark/tramp3d-v4/tramp3d-v4.test
    562.8000 -> 486.9560     [ 15.58%]  CTMark/tramp3d-v4/tramp3d-v4.test
    562.8000 -> 482.9680     [ 16.53%]  CTMark/tramp3d-v4/tramp3d-v4.test
    562.8000 -> 484.6760     [ 16.12%]  CTMark/tramp3d-v4/tramp3d-v4.test

Best time was 474.94, worst was 562.80. I'll perform the same experiment with proposed patch in this differential.

Six runs of tramp3d-v4 with this patch applied show comperable variance and similar low/high bounds:

$ cat baz.sh
#!/bin/bash

for i in 1 2 3 4 5 6; do
    for j in 1 2 3 4 5 6; do
        ~/git/me/llvm/litdiff \
            build.D42717_$i/results.json \
            build.D42717_$j/results.json \
            compile_time  | grep tramp3d
    done
done
brzycki@cc01 /work/brzycki/test-suite $ ./baz.sh
    494.1120 -> 487.4120     [  1.37%]  CTMark/tramp3d-v4/tramp3d-v4.test
    494.1120 -> 485.5440     [  1.76%]  CTMark/tramp3d-v4/tramp3d-v4.test
    494.1120 <- 494.9720     [  0.17%]  CTMark/tramp3d-v4/tramp3d-v4.test
    494.1120 <- 502.4760     [  1.69%]  CTMark/tramp3d-v4/tramp3d-v4.test
    494.1120 <- 536.5600     [  8.59%]  CTMark/tramp3d-v4/tramp3d-v4.test
    487.4120 <- 494.1120     [  1.37%]  CTMark/tramp3d-v4/tramp3d-v4.test
    487.4120 -> 485.5440     [  0.38%]  CTMark/tramp3d-v4/tramp3d-v4.test
    487.4120 <- 494.9720     [  1.55%]  CTMark/tramp3d-v4/tramp3d-v4.test
    487.4120 <- 502.4760     [  3.09%]  CTMark/tramp3d-v4/tramp3d-v4.test
    487.4120 <- 536.5600     [ 10.08%]  CTMark/tramp3d-v4/tramp3d-v4.test
    485.5440 <- 494.1120     [  1.76%]  CTMark/tramp3d-v4/tramp3d-v4.test
    485.5440 <- 487.4120     [  0.38%]  CTMark/tramp3d-v4/tramp3d-v4.test
    485.5440 <- 494.9720     [  1.94%]  CTMark/tramp3d-v4/tramp3d-v4.test
    485.5440 <- 502.4760     [  3.49%]  CTMark/tramp3d-v4/tramp3d-v4.test
    485.5440 <- 536.5600     [ 10.51%]  CTMark/tramp3d-v4/tramp3d-v4.test
    494.9720 -> 494.1120     [  0.17%]  CTMark/tramp3d-v4/tramp3d-v4.test
    494.9720 -> 487.4120     [  1.55%]  CTMark/tramp3d-v4/tramp3d-v4.test
    494.9720 -> 485.5440     [  1.94%]  CTMark/tramp3d-v4/tramp3d-v4.test
    494.9720 <- 502.4760     [  1.52%]  CTMark/tramp3d-v4/tramp3d-v4.test
    494.9720 <- 536.5600     [  8.40%]  CTMark/tramp3d-v4/tramp3d-v4.test
    502.4760 -> 494.1120     [  1.69%]  CTMark/tramp3d-v4/tramp3d-v4.test
    502.4760 -> 487.4120     [  3.09%]  CTMark/tramp3d-v4/tramp3d-v4.test
    502.4760 -> 485.5440     [  3.49%]  CTMark/tramp3d-v4/tramp3d-v4.test
    502.4760 -> 494.9720     [  1.52%]  CTMark/tramp3d-v4/tramp3d-v4.test
    502.4760 <- 536.5600     [  6.78%]  CTMark/tramp3d-v4/tramp3d-v4.test
    536.5600 -> 494.1120     [  8.59%]  CTMark/tramp3d-v4/tramp3d-v4.test
    536.5600 -> 487.4120     [ 10.08%]  CTMark/tramp3d-v4/tramp3d-v4.test
    536.5600 -> 485.5440     [ 10.51%]  CTMark/tramp3d-v4/tramp3d-v4.test
    536.5600 -> 494.9720     [  8.40%]  CTMark/tramp3d-v4/tramp3d-v4.test
    536.5600 -> 502.4760     [  6.78%]  CTMark/tramp3d-v4/tramp3d-v4.test

The best time was 485.544 and the worst time was 536.56.

@dberlin I think this patch can be applied as-is to fix pr36133.

I used clang -O3 -fno-exceptions -ftime-report -c tramp3d-v4.cpp -o /tmp/foo.o with an upstream (non-patched) compiler. Here are the hottest areas of pass execution according to the reports. These lines are displayed in the same order:

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
37.7480 (  9.8%)   0.1080 (  3.9%)  37.8560 (  9.7%)  37.8099 (  9.7%)  Global Value Numbering
32.5400 (  8.4%)   0.3320 ( 11.9%)  32.8720 (  8.4%)  32.8960 (  8.5%)  Function Integration/Inlining
26.7520 (  6.9%)   0.2280 (  8.2%)  26.9800 (  6.9%)  26.9368 (  6.9%)  X86 DAG->DAG Instruction Selection
25.8080 (  6.7%)   0.0960 (  3.4%)  25.9040 (  6.7%)  25.8395 (  6.6%)  Combine redundant instructions
16.7080 (  4.3%)   0.1160 (  4.2%)  16.8240 (  4.3%)  17.0581 (  4.4%)  Combine redundant instructions
14.7120 (  3.8%)   0.1000 (  3.6%)  14.8120 (  3.8%)  14.8012 (  3.8%)  Combine redundant instructions
13.6320 (  3.5%)   0.1120 (  4.0%)  13.7440 (  3.5%)  13.6522 (  3.5%)  Combine redundant instructions
11.3840 (  2.9%)   0.0120 (  0.4%)  11.3960 (  2.9%)  11.2862 (  2.9%)  Loop Strength Reduction
10.6120 (  2.7%)   0.0080 (  0.3%)  10.6200 (  2.7%)  10.6267 (  2.7%)  Combine redundant instructions
10.0360 (  2.6%)   0.0320 (  1.1%)  10.0680 (  2.6%)  10.1269 (  2.6%)  Combine redundant instructions
 9.7280 (  2.5%)   0.0160 (  0.6%)   9.7440 (  2.5%)   9.7972 (  2.5%)  Combine redundant instructions
 8.1040 (  2.1%)   0.0240 (  0.9%)   8.1280 (  2.1%)   8.0646 (  2.1%)  Dead Store Elimination
 7.6040 (  2.0%)   0.0480 (  1.7%)   7.6520 (  2.0%)   7.6756 (  2.0%)  Value Propagation
 7.5520 (  2.0%)   0.0200 (  0.7%)   7.5720 (  1.9%)   7.5224 (  1.9%)  SLP Vectorizer
 6.6240 (  1.7%)   0.0800 (  2.9%)   6.7040 (  1.7%)   6.7257 (  1.7%)  Induction Variable Simplification
 6.6320 (  1.7%)   0.0320 (  1.1%)   6.6640 (  1.7%)   6.7028 (  1.7%)  Loop Invariant Code Motion
 6.5040 (  1.7%)   0.1000 (  3.6%)   6.6040 (  1.7%)   6.5648 (  1.7%)  Combine redundant instructions
 6.0480 (  1.6%)   0.0440 (  1.6%)   6.0920 (  1.6%)   6.2509 (  1.6%)  Memory SSA
 5.9080 (  1.5%)   0.0560 (  2.0%)   5.9640 (  1.5%)   6.0655 (  1.6%)  Value Propagation
 5.8120 (  1.5%)   0.0160 (  0.6%)   5.8280 (  1.5%)   5.9670 (  1.5%)  Early CSE w/ MemorySSA
 5.0480 (  1.3%)   0.0040 (  0.1%)   5.0520 (  1.3%)   5.0647 (  1.3%)  Loop Invariant Code Motion
 4.3280 (  1.1%)   0.0120 (  0.4%)   4.3400 (  1.1%)   4.3600 (  1.1%)  Loop Invariant Code Motion
 4.0080 (  1.0%)   0.0280 (  1.0%)   4.0360 (  1.0%)   4.0010 (  1.0%)  Greedy Register Allocator
 3.7480 (  1.0%)   0.0400 (  1.4%)   3.7880 (  1.0%)   3.7836 (  1.0%)  CodeGen Prepare
 3.2240 (  0.8%)   0.0080 (  0.3%)   3.2320 (  0.8%)   3.3606 (  0.9%)  Machine Instruction Scheduler
 3.2640 (  0.8%)   0.0280 (  1.0%)   3.2920 (  0.8%)   3.3383 (  0.9%)  MemCpy Optimization
 3.1040 (  0.8%)   0.0160 (  0.6%)   3.1200 (  0.8%)   3.1925 (  0.8%)  Induction Variable Users
 2.5760 (  0.7%)   0.0160 (  0.6%)   2.5920 (  0.7%)   2.6184 (  0.7%)  Unroll loops
 2.5720 (  0.7%)   0.0320 (  1.1%)   2.6040 (  0.7%)   2.5598 (  0.7%)  SROA
 2.5320 (  0.7%)   0.0200 (  0.7%)   2.5520 (  0.7%)   2.3682 (  0.6%)  Jump Threading
 2.3400 (  0.6%)   0.0040 (  0.1%)   2.3440 (  0.6%)   2.3639 (  0.6%)  Jump Threading
 2.2880 (  0.6%)   0.0280 (  1.0%)   2.3160 (  0.6%)   2.2819 (  0.6%)  Reassociate expressions
 2.1440 (  0.6%)   0.0120 (  0.4%)   2.1560 (  0.6%)   2.1243 (  0.5%)  Loop Load Elimination
 1.9400 (  0.5%)   0.0080 (  0.3%)   1.9480 (  0.5%)   2.0576 (  0.5%)  SROA
 1.9640 (  0.5%)   0.0160 (  0.6%)   1.9800 (  0.5%)   1.9998 (  0.5%)  Loop Vectorization
 1.5960 (  0.4%)   0.0280 (  1.0%)   1.6240 (  0.4%)   1.6763 (  0.4%)  Simplify the CFG
 1.5000 (  0.4%)   0.0160 (  0.6%)   1.5160 (  0.4%)   1.6653 (  0.4%)  Simplify the CFG
 1.6280 (  0.4%)   0.0160 (  0.6%)   1.6440 (  0.4%)   1.5298 (  0.4%)  Remove redundant instructions
 1.5160 (  0.4%)   0.0160 (  0.6%)   1.5320 (  0.4%)   1.4801 (  0.4%)  Module Verifier
 1.4280 (  0.4%)   0.0240 (  0.9%)   1.4520 (  0.4%)   1.4439 (  0.4%)  Module Verifier
 1.3960 (  0.4%)   0.0000 (  0.0%)   1.3960 (  0.4%)   1.4080 (  0.4%)  Module Verifier
 1.4440 (  0.4%)   0.0080 (  0.3%)   1.4520 (  0.4%)   1.3567 (  0.3%)  Simplify the CFG
 1.2880 (  0.3%)   0.0160 (  0.6%)   1.3040 (  0.3%)   1.3025 (  0.3%)  Simplify the CFG
 1.2040 (  0.3%)   0.0200 (  0.7%)   1.2240 (  0.3%)   1.2963 (  0.3%)  Early CSE
 1.0320 (  0.3%)   0.0120 (  0.4%)   1.0440 (  0.3%)   1.1418 (  0.3%)  Sparse Conditional Constant Propagation

and

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
38.5520 (  9.9%)   0.1040 (  3.9%)  38.6560 (  9.9%)  38.7477 (  9.9%)  Global Value Numbering
32.6440 (  8.4%)   0.2720 ( 10.2%)  32.9160 (  8.4%)  32.8803 (  8.4%)  Function Integration/Inlining
26.3040 (  6.8%)   0.2440 (  9.1%)  26.5480 (  6.8%)  26.6897 (  6.8%)  X86 DAG->DAG Instruction Selection
26.3120 (  6.8%)   0.0760 (  2.8%)  26.3880 (  6.7%)  26.2355 (  6.7%)  Combine redundant instructions
17.1440 (  4.4%)   0.0800 (  3.0%)  17.2240 (  4.4%)  17.3240 (  4.4%)  Combine redundant instructions
14.9080 (  3.8%)   0.0480 (  1.8%)  14.9560 (  3.8%)  15.0418 (  3.8%)  Combine redundant instructions
13.6120 (  3.5%)   0.0560 (  2.1%)  13.6680 (  3.5%)  13.8665 (  3.5%)  Combine redundant instructions
11.0240 (  2.8%)   0.0400 (  1.5%)  11.0640 (  2.8%)  11.0079 (  2.8%)  Loop Strength Reduction
10.7280 (  2.8%)   0.0320 (  1.2%)  10.7600 (  2.7%)  10.7927 (  2.7%)  Combine redundant instructions
10.2160 (  2.6%)   0.0400 (  1.5%)  10.2560 (  2.6%)  10.2907 (  2.6%)  Combine redundant instructions
 9.9440 (  2.6%)   0.0200 (  0.7%)   9.9640 (  2.5%)   9.9617 (  2.5%)  Combine redundant instructions
 8.3960 (  2.2%)   0.0320 (  1.2%)   8.4280 (  2.1%)   8.2761 (  2.1%)  Dead Store Elimination
 7.7520 (  2.0%)   0.0440 (  1.6%)   7.7960 (  2.0%)   7.8317 (  2.0%)  Value Propagation
 7.5040 (  1.9%)   0.0360 (  1.3%)   7.5400 (  1.9%)   7.4545 (  1.9%)  SLP Vectorizer
 6.8080 (  1.7%)   0.0240 (  0.9%)   6.8320 (  1.7%)   6.8237 (  1.7%)  Loop Invariant Code Motion
 6.7880 (  1.7%)   0.1160 (  4.3%)   6.9040 (  1.8%)   6.7762 (  1.7%)  Combine redundant instructions
 6.5240 (  1.7%)   0.0880 (  3.3%)   6.6120 (  1.7%)   6.6196 (  1.7%)  Induction Variable Simplification
 6.3320 (  1.6%)   0.0360 (  1.3%)   6.3680 (  1.6%)   6.4497 (  1.6%)  Memory SSA
 6.2720 (  1.6%)   0.0080 (  0.3%)   6.2800 (  1.6%)   6.2596 (  1.6%)  Value Propagation
 5.9040 (  1.5%)   0.0480 (  1.8%)   5.9520 (  1.5%)   6.0199 (  1.5%)  Early CSE w/ MemorySSA
 5.1880 (  1.3%)   0.0080 (  0.3%)   5.1960 (  1.3%)   5.1699 (  1.3%)  Loop Invariant Code Motion
 4.4320 (  1.1%)   0.0040 (  0.1%)   4.4360 (  1.1%)   4.4453 (  1.1%)  Loop Invariant Code Motion
 4.0040 (  1.0%)   0.0200 (  0.7%)   4.0240 (  1.0%)   4.0277 (  1.0%)  Greedy Register Allocator
 3.7680 (  1.0%)   0.0040 (  0.1%)   3.7720 (  1.0%)   3.7186 (  0.9%)  CodeGen Prepare
 3.3680 (  0.9%)   0.0200 (  0.7%)   3.3880 (  0.9%)   3.4240 (  0.9%)  MemCpy Optimization
 3.3920 (  0.9%)   0.0120 (  0.4%)   3.4040 (  0.9%)   3.3959 (  0.9%)  Machine Instruction Scheduler
 3.1280 (  0.8%)   0.0120 (  0.4%)   3.1400 (  0.8%)   3.1442 (  0.8%)  Induction Variable Users
 2.5080 (  0.6%)   0.0360 (  1.3%)   2.5440 (  0.6%)   2.5870 (  0.7%)  SROA
 2.4840 (  0.6%)   0.0320 (  1.2%)   2.5160 (  0.6%)   2.5860 (  0.7%)  Unroll loops
 2.3200 (  0.6%)   0.0160 (  0.6%)   2.3360 (  0.6%)   2.3900 (  0.6%)  Jump Threading
 2.3520 (  0.6%)   0.0200 (  0.7%)   2.3720 (  0.6%)   2.3781 (  0.6%)  Jump Threading
 2.1680 (  0.6%)   0.0240 (  0.9%)   2.1920 (  0.6%)   2.2602 (  0.6%)  Reassociate expressions
 2.1120 (  0.5%)   0.0160 (  0.6%)   2.1280 (  0.5%)   2.0903 (  0.5%)  Loop Load Elimination
 2.0600 (  0.5%)   0.0240 (  0.9%)   2.0840 (  0.5%)   2.0447 (  0.5%)  SROA
 2.0080 (  0.5%)   0.0000 (  0.0%)   2.0080 (  0.5%)   1.9710 (  0.5%)  Loop Vectorization
 1.5600 (  0.4%)   0.0160 (  0.6%)   1.5760 (  0.4%)   1.7095 (  0.4%)  Simplify the CFG
 1.6880 (  0.4%)   0.0200 (  0.7%)   1.7080 (  0.4%)   1.6968 (  0.4%)  Simplify the CFG
 1.4960 (  0.4%)   0.0080 (  0.3%)   1.5040 (  0.4%)   1.5448 (  0.4%)  Remove redundant instructions
 1.4880 (  0.4%)   0.0200 (  0.7%)   1.5080 (  0.4%)   1.4811 (  0.4%)  Module Verifier
 1.4680 (  0.4%)   0.0160 (  0.6%)   1.4840 (  0.4%)   1.4391 (  0.4%)  Module Verifier
 1.4120 (  0.4%)   0.0120 (  0.4%)   1.4240 (  0.4%)   1.4021 (  0.4%)  Module Verifier
 1.3120 (  0.3%)   0.0040 (  0.1%)   1.3160 (  0.3%)   1.3868 (  0.4%)  Simplify the CFG
 1.3240 (  0.3%)   0.0160 (  0.6%)   1.3400 (  0.3%)   1.3271 (  0.3%)  Simplify the CFG
 1.2680 (  0.3%)   0.0200 (  0.7%)   1.2880 (  0.3%)   1.3067 (  0.3%)  Early CSE
 1.1160 (  0.3%)   0.0160 (  0.6%)   1.1320 (  0.3%)   1.1449 (  0.3%)  Sparse Conditional Constant Propagation

There are small differences between their execution times, but none of them to me stand out. The total difference in wall-clock time of these runs is 3.16 seconds.

a.elovikov added inline comments.Feb 13 2018, 1:43 AM
llvm/lib/Transforms/Scalar/JumpThreading.cpp
654

Isn't it strange that we enable LVI without flushing DDT first? Even if it's ok now I'm afraid it might become a source of nasty bugs in the future.

And related question - would it be better to integrate the LVI->disableDT/enableDT calls into the DDT itself (or a wrapper around it)? So that any deferred update to the DT would automatically make DT disabled in LVI and every flush would make it enabled?

brzycki added inline comments.Feb 13 2018, 8:21 AM
llvm/lib/Transforms/Scalar/JumpThreading.cpp
654

Hello @a.elovikov,

I chose to enable/disable around each LVI analysis call in the case that future patches call DDT->flush() within JumpThreading. I could also make this call and feel comfortable with its implications:

if (DDT->pending())
  LVI->disableDT();
else
  LVI->enableDT();
// LVI analysis begins...

As for adding LVI into the DDT class itself, that approach ended up being full of problems. The LazyValue class doesn't actually have the code for enableDT()/disableDT() in the header like traditional LLVM classes. Because the class is broken into two for caching on-demand I get compile-time errors with LLVM unable to find the actual implementation of the disableDT()/enableDT() functions at link time:

lib/IR/CMakeFiles/LLVMCore.dir/Dominators.cpp.o: In function `llvm::DeferredDominance::applyUpdate(llvm::DomTreeBuilder::UpdateKind, llvm::BasicBlock*, llvm::BasicBlock*)':
/work/brzycki/D42717/llvm-project/llvm/lib/IR/Dominators.cpp:569: undefined reference to `llvm::LazyValueInfo::disableDT()'
lib/IR/CMakeFiles/LLVMCore.dir/Dominators.cpp.o: In function `llvm::DeferredDominance::flush()':
/work/brzycki/D42717/llvm-project/llvm/lib/IR/Dominators.cpp:464: undefined reference to `llvm::LazyValueInfo::enableDT()'
lib/IR/CMakeFiles/LLVMCore.dir/Dominators.cpp.o: In function `llvm::DeferredDominance::recalculate(llvm::Function&)':
/work/brzycki/D42717/llvm-project/llvm/lib/IR/Dominators.cpp:481: undefined reference to `llvm::LazyValueInfo::enableDT()'
clang: error: linker command failed with exit code 1 (use -v to see invocation)

If you know of a way to fix these errors I'll give it another try.

dberlin added inline comments.Feb 13 2018, 8:51 AM
llvm/lib/Transforms/Scalar/JumpThreading.cpp
654

Hey @a.elovikov ,

You really don't want DDT to know things about passes. IMHO, if you were going to go that route, LVI checks whether DT is null, and so you could wrap this all in a "UpdatedDTOrNull" object that, when cast to a DominatorTree *, returned the tree if it was up to date, or a null pointer if it was not.

That *looks* nicer code wise.
But one advantage to the way Brian is doing it is that it is easily verifiable and doesn't hide where and when LVI has a disabled DT.

When we hide stuff like that, it tends to regress and nobody notices for a while.

brzycki updated this revision to Diff 134111.Feb 13 2018, 2:11 PM

Rebased to tip, switched to always enabling or disabling before calling LVI analysis. Prevents LVI from being accidentally enabled.

brzycki updated this revision to Diff 134132.Feb 13 2018, 3:17 PM

Use std::swap() in enableDT/DisableDT methods.

IMHO, if you were going to go that route, LVI checks whether DT is null, and so you could wrap this all in a "UpdatedDTOrNull" object that, when cast to a DominatorTree *, returned the tree if it was up to date, or a null pointer if it was not.

It's a bit more subtle than that @dberlin due to the way LazyValueInfo is defined. Because the LazyValueInfoImpl object is cached and re-used it would store a pointer to DT (or nullptr) at the instance of creating the first cachced LVI instance. I ran into this problem with my first attempt to disable DT within LVI and is another reason I decided on the new external methods.

I think this patch is ready to commit, barring any objections.

I might be missing something, but why don't we enable DT back in LVI after the JT pass finishes and flushes the DDT? Is it possible that we end up with disabled DT *after* the JT?

llvm/lib/Analysis/LazyValueInfo.cpp
469

I'd add

assert((!DT || !DisabledDT) && "Both DT and DisabledDT are not nullptr!")
476

Same assert for DT and DisabledDT not being null simultaneously.

I might be missing something, but why don't we enable DT back in LVI after the JT pass finishes and flushes the DDT? Is it possible that we end up with disabled DT *after* the JT?

I missed that, thank you @a.elovikov . I'm adding it to the next patch revision.

llvm/lib/Analysis/LazyValueInfo.cpp
469

Good point, will add asserts.

brzycki updated this revision to Diff 134239.Feb 14 2018, 8:33 AM

Added asserts to enableDT / disableDT to verify both DT and DisabledDT are never both valid pointers. Added a final LVI->enabledDT() at the very end of runImpl to allow subsequent passes to use DT with the cached LVI object.

@a.elovikov and @dberlin does this look good to commit?

@a.elovikov and @dberlin does this look good to commit?

It's looks ok for me but I have not contributed enough to give lgtms.

It's looks ok for me but I have not contributed enough to give lgtms.

Thanks @a.elovikov .
@kuhar , @sebpop , or @dberlin does the patch look ok to commit?

sebpop accepted this revision.Feb 15 2018, 12:51 PM

LGTM.

This revision was automatically updated to reflect the committed changes.