This is an archive of the discontinued LLVM Phabricator instance.

[CaptureTracking] Increase limit but use it for all visited uses.
ClosedPublic

Authored by fhahn on May 23 2022, 1:04 PM.

Details

Summary

Currently the MaxUsesToExplore limit only applies to the number of users
per value, not the total number of users to explore.

The current limit of 20 pessimizes IR with opaque pointers in some
cases. Without opaque pointers, we have deeper pointer def-use chains in
general due to extra bitcasts and geps for structs with index 0.

With opaque pointers the def-use chain is not as deep but wider, due to
bitcasts & 0-geps missing.

To improve the situation for opaque pointers, this patch does 2 things:

  1. Apply the limit to the total number of uses visited. From the wording in the description of the option it seems like this may be the original intention. With the current implementation we could still end up walking a lot of uses.
  2. Increase the limit to 100. This is quite arbitrary, but enables a good number of additional optimizations.

Those adjustments have a noticeable compile-time impact though. In part
that is likely due to additional transformations (and conversely
the current baseline misses optimizations after switching to opaque
pointers).

Limit=100:

  • NewPM-O3: +0.15%
  • NewPM-ReleaseThinLTO: +0.86%
  • NewPM-ReleaseLTO-g: +0.44%

https://llvm-compile-time-tracker.com/compare.php?from=8bfccb963b3519393c0266b452a115a4bb46d207&to=818719fad01d472412c963629671a81a8703b25b&stat=instructions

Limit=60:

  • NewPM-O3: +0.14%
  • NewPM-ReleaseThinLTO: +0.41%
  • NewPM-ReleaseLTO-g: +0.21%

https://llvm-compile-time-tracker.com/compare.php?from=aeb19817d66f1a15754163c7f48e01e9ebdd6d45&to=520563fdc146319aae90d06f88d87f2e9e1247b7&stat=instructions

Limit=40:

  • NewPM-O3: +0.11%
  • NewPM-ReleaseThinLTO: +0.12%
  • NewPM-ReleaseLTO-g: +0.09%

https://llvm-compile-time-tracker.com/compare.php?from=aeb19817d66f1a15754163c7f48e01e9ebdd6d45&to=c9182576e9fe3f1c84a71479665aef91a416318c&stat=instructions

I'll add a test if/once we converge on agreement. I'd be more than happy to
discuss alternatives as well

Diff Detail

Event Timeline

fhahn created this revision.May 23 2022, 1:04 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 23 2022, 1:04 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
fhahn requested review of this revision.May 23 2022, 1:04 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 23 2022, 1:04 PM
fhahn updated this revision to Diff 431461.May 23 2022, 1:04 PM

Update patch to actually use 100 as limit

this makes sense

is there a number you're seeing which roughly matches performance you were getting with typed pointers?

llvm/lib/Analysis/CaptureTracking.cpp
452

does moving this after Visited.insert() below help at all? I'd think that going through the worklist below is the expensive part, not constructing the worklist

Basic idea makes a lot of sense. I'd be tempted to take some cost here, just for the robustness.

llvm/lib/Analysis/CaptureTracking.cpp
452

It also feels odd to count uses we chose not to explore. Maybe we should simply count the number of items we pull off the worklist, not the addition at all?

fhahn added a comment.May 24 2022, 4:26 AM

is there a number you're seeing which roughly matches performance you were getting with typed pointers?

I'm not sure if that data point is too interesting in isolation. It might be more interesting to look at the impact on the capture-tracking.NumNotCapturedBefore statistic. Below are numbers for MultiSource/SPEC2006/SPEC2017 on X86 with -O3. As there are loads of changes, I removed programs where the change is small (+-1-2%) and where the absolute value is low (< ~40) to keep things a bit more compact.

The first table shows current main with opaque pointers disabled vs enabled. Note that there are quite a few notable regressions. With a limit of 60, we still see regressions, going to 100 reduces that further. I also tried a limit of 200 and that only slightly reduces the number of regressions further.

ProgramNo Opaque PointersOpaque Pointersdiff
MultiSourc...arks/DOE-ProxyApps-C/CoMD/CoMD34365.9%
MultiSourc...plications/lambda-0.1.3/lambda19205.3%
MultiSourc...e/Applications/obsequi/Obsequi30313.3%
MultiSourc...e/Applications/sqlite3/sqlite3829813-1.9%
External/S...C/CINT2006/458.sjeng/458.sjeng4746-2.1%
External/S...te/520.omnetpp_r/520.omnetpp_r33203241-2.4%
External/S...06/483.xalancbmk/483.xalancbmk36923600-2.5%
External/S...te/526.blender_r/526.blender_r85258309-2.5%
External/S...NT2006/464.h264ref/464.h264ref121117-3.3%
External/S...NT2017rate/502.gcc_r/502.gcc_r52405051-3.6%
MultiSourc.../Applications/JM/ldecod/ldecod5048-4.0%
External/S...C/CINT2006/445.gobmk/445.gobmk376360-4.3%
MultiSourc...Benchmarks/7zip/7zip-benchmark11321083-4.3%
External/S...06/400.perlbench/400.perlbench279266-4.7%
MultiSourc...e/Applications/SIBsim4/SIBsim410297-4.9%
MultiSourc.../Applications/JM/lencod/lencod111105-5.4%
External/S...CINT2017rate/557.xz_r/557.xz_r238222-6.7%
External/S...rate/511.povray_r/511.povray_r547494-9.7%
External/S.../CFP2006/453.povray/453.povray549495-9.8%
External/S...17rate/541.leela_r/541.leela_r449385-14.3%
External/S...rate/510.parest_r/510.parest_r4556137990-16.6%
External/S...NT2006/471.omnetpp/471.omnetpp1218710133-16.9%
MultiSourc.../DOE-ProxyApps-C++/CLAMR/CLAMR1004833-17.0%
External/S.../CFP2006/447.dealII/447.dealII112539313-17.2%
MultiSourc...-ProxyApps-C++/PENNANT/PENNANT40033171-20.8%
MultiSourc...ALAC/encode/alacconvert-encode10679-25.5%
MultiSourc...ALAC/decode/alacconvert-decode10679-25.5%
MultiSourc...OE-ProxyApps-C++/miniFE/miniFE949659-30.6%
MultiSourc...e/Applications/ClamAV/clamscan398212-46.7%
MultiSourc.../mediabench/jpeg/jpeg-6a/cjpeg16465-60.4%
MultiSourc...ch/consumer-jpeg/consumer-jpeg11921-82.4%
MultiSourc...ity-rijndael/security-rijndael480100.0%
ProgramBase (No Opaque Pointers)Patch 60 (Opaque Pointers)diff
MultiSourc...ity-blowfish/security-blowfish136136000.0%
MultiSourc...sumer-typeset/consumer-typeset12216535.2%
MultiSourc...arks/DOE-ProxyApps-C/CoMD/CoMD344429.4%
MultiSourc.../Applications/JM/lencod/lencod11113622.5%
MultiSource/Applications/SPASS/SPASS809822.5%
External/S...00.perlbench_r/500.perlbench_r56068221.8%
MultiSource/Applications/oggenc/oggenc12614817.5%
External/S...NT2006/464.h264ref/464.h264ref12114217.4%
External/S...06/400.perlbench/400.perlbench27932616.8%
External/SPEC/CINT2006/403.gcc/403.gcc968111415.1%
MultiSourc...hmarks/MallocBench/cfrac/cfrac30326.7%
MultiSourc...enchmarks/mafft/pairlocalalign1011075.9%
External/S.../CFP2006/450.soplex/450.soplex4314555.6%
MultiSourc...nchmarks/FreeBench/mason/mason42444.8%
MultiSource/Benchmarks/Bullet/bullet6777074.4%
External/S...2017rate/525.x264_r/525.x264_r1591664.4%
MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4 497551904.3%
MultiSourc.../Applications/JM/ldecod/ldecod50524.0%
External/S...rate/511.povray_r/511.povray_r5475622.7%
External/S...NT2017rate/502.gcc_r/502.gcc_r 524053772.6%
External/S.../CFP2006/453.povray/453.povray5495632.6%
MultiSourc...e/Applications/sqlite3/sqlite38298492.4%
External/S...C/CINT2006/445.gobmk/445.gobmk3763852.4%
MultiSourc...e/Applications/ClamAV/clamscan3984062.0%
External/S...C/CINT2006/458.sjeng/458.sjeng4746-2.1%
MultiSource/Benchmarks/Ptrdist/bc/bc6055-8.3%
External/S.../CFP2006/447.dealII/447.dealII11253244-9.0%
External/S...17rate/541.leela_r/541.leela_r449385-14.3%
External/S...rate/510.parest_r/510.parest_r455618815-14.8%
MultiSourc.../DOE-ProxyApps-C++/CLAMR/CLAMR1004853-15.0%
MultiSourc...-ProxyApps-C++/PENNANT/PENNANT40033171-20.8%
MultiSourc...OE-ProxyApps-C++/miniFE/miniFE949665-29.9%
External/S...te/538.imagick_r/538.imagick_r837511-38.9%
MultiSourc.../mediabench/jpeg/jpeg-6a/cjpeg16479-51.8%
MultiSourc...ch/consumer-jpeg/consumer-jpeg11919-84.0%
MultiSourc...ity-rijndael/security-rijndael480-100.0%
ProgramBase (No Opaque Pointers)Patch 100 (Opaque Pointers)diff
MultiSourc...ity-blowfish/security-blowfish136136000.0%
MultiSourc...ch/office-ispell/office-ispell439875.0%
External/S.../462.libquantum/462.libquantum1361369.2%
External/S...C/CINT2006/456.hmmer/456.hmmer19131564.9%
External/SPEC/CINT2006/403.gcc/403.gcc968142547.2%
MultiSourc...arks/DOE-ProxyApps-C/CoMD/CoMD345047.1%
MultiSourc...ch/consumer-lame/consumer-lame6910146.4%
MultiSourc...sumer-typeset/consumer-typeset12217341.8%
MultiSourc.../mediabench/jpeg/jpeg-6a/cjpeg16421631.7%
MultiSource/Applications/lemon/lemon324231.2%
External/S...rate/511.povray_r/511.povray_r54769927.8%
External/S.../CFP2006/453.povray/453.povray54970127.7%
MultiSourc.../Applications/JM/lencod/lencod11114026.1%
External/S...00.perlbench_r/500.perlbench_r56070025.0%
MultiSource/Applications/oggenc/oggenc12615724.6%
External/S...C/CINT2006/445.gobmk/445.gobmk37646623.9%
MultiSource/Applications/SPASS/SPASS809923.8%
External/S...NT2006/464.h264ref/464.h264ref12114620.7%
External/S...06/400.perlbench/400.perlbench27932616.8%
MultiSourc...OE-ProxyApps-C/miniGMG/miniGMG232613.0%
MultiSource/Benchmarks/Bullet/bullet67774810.5%
MultiSourc.../Applications/JM/ldecod/ldecod505510.0%
External/S.../CFP2006/450.soplex/450.soplex4314709.0%
External/S...2017rate/525.x264_r/525.x264_r1591728.2%
External/S...NT2017rate/502.gcc_r/502.gcc_r 524056097.0%
External/S...te/526.blender_r/526.blender_r 852590756.5%
External/S...C/CINT2006/458.sjeng/458.sjeng47506.4%
MultiSourc...enchmarks/mafft/pairlocalalign1011075.9%
MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4 497552355.2%
MultiSourc...nchmarks/FreeBench/mason/mason42444.8%
External/S.../CFP2006/447.dealII/447.dealII1125310466-7.0%
External/S...rate/510.parest_r/510.parest_r4556139810-12.6%
External/S...17rate/541.leela_r/541.leela_r449385-14.3%
MultiSourc.../DOE-ProxyApps-C++/CLAMR/CLAMR1004853-15.0%
MultiSourc...-ProxyApps-C++/PENNANT/PENNANT40033171-20.8%
MultiSourc...OE-ProxyApps-C++/miniFE/miniFE949665-29.9%
External/S...te/538.imagick_r/538.imagick_r837537-35.8%
nikic added a comment.May 24 2022, 6:13 AM

is there a number you're seeing which roughly matches performance you were getting with typed pointers?

I'm not sure if that data point is too interesting in isolation. It might be more interesting to look at the impact on the capture-tracking.NumNotCapturedBefore statistic. Below are numbers for MultiSource/SPEC2006/SPEC2017 on X86 with -O3. As there are loads of changes, I removed programs where the change is small (+-1-2%) and where the absolute value is low (< ~40) to keep things a bit more compact.

The first table shows current main with opaque pointers disabled vs enabled. Note that there are quite a few notable regressions. With a limit of 60, we still see regressions, going to 100 reduces that further. I also tried a limit of 200 and that only slightly reduces the number of regressions further.

Looking at just the NumNotCapturedBefore statistic is only meaningful if NumCapturedBefore+NumNotCapturedBefore is approximately the same in both configurations. Is that the case? It would probably be more meaningful to look for changes in the ratio NumNotCapturedBefore/(NumCapturedBefore+NumNotCapturedBefore), i.e. the percentage of queries that succeed.

I do agree with the general change here -- in fact, some time ago I was looking into the same change for the reverse reason: The current implementation of the limit sometimes results in much more uses being visited than is reasonable. I ultimately ended up not pursing it because is has significant impact on optimization behavior and I didn't have time to analyze it in detail.

The actual value of the limit matters though -- the current capture tracking implementation is intended to be cheap enough that uncached usage is viable, and we can clearly see that compile-time is quite sensitive to changes to this limit. We shouldn't overshoot in the other direction here. If we want to raise this too high, we'll have to start thinking about using different limits for different callers. E.g. visiting many users should be fine in DSE because results are cached for the whole duration of the pass, while visiting many uses in (non-batch) AA is problematic.

PS: I think some of the data in your table is incorrect, e.g. it has the line MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4 4 975 5190 4.3%, where the percentage is much smaller than the change.

fhahn added a comment.May 30 2022, 9:23 AM

Looking at just the NumNotCapturedBefore statistic is only meaningful if NumCapturedBefore+NumNotCapturedBefore is approximately the same in both configurations. Is that the case? It would probably be more meaningful to look for changes in the ratio NumNotCapturedBefore/(NumCapturedBefore+NumNotCapturedBefore), i.e. the percentage of queries that succeed.

Good point. Below I added tables with NumCaptured, NumNotcaptured, the sum and NotCaptured/Sum. I've only included SPEC2006/SPEC2017, as those have the largest number of capture queries.

For both limit=80 and limit=100, the total number of queries is quite similar to the baseline without opaque pointers, with most differing < 10%. The largest percentage increase is 531.deepsjeng_r with +30%. With limit=80, there's still a notable regression in the success percentage for 433.milc.

The actual value of the limit matters though -- the current capture tracking implementation is intended to be cheap enough that uncached usage is viable, and we can clearly see that compile-time is quite sensitive to changes to this limit. We shouldn't overshoot in the other direction here. If we want to raise this too high, we'll have to start thinking about using different limits for different callers. E.g. visiting many users should be fine in DSE because results are cached for the whole duration of the pass, while visiting many uses in (non-batch) AA is problematic.

It might be worth exploring different limits for different clients, but I think another major client is GVN (via MemDepAnalysis) and I think exploring a sufficient number of uses there can be quite important (the regression I was investigating was due to missed load elimination by GVN)

Base with opaque pointers disabled

nameNumCapturedNotCapturedSumNotCaptured/Sum
CFP2006/433.milc/433.milc1602118621346488.10%
CFP2006/444.namd/444.namd1866051187110.27%
CFP2006/447.dealII/447.dealII1808902157820246810.66%
CFP2006/450.soplex/450.soplex993425751250920.59%
CFP2006/453.povray/453.povray2691996423656126.37%
CFP2006/470.lbm/470.lbm5411116567.27%
CFP2006/482.sphinx3/482.sphinx32525650317520.47%
CFP2017rate/508.namd_r/508.namd_r28714233289470.80%
CFP2017rate/510.parest_r/510.parest_r639689663937060829.40%
CFP2017rate/511.povray_r/511.povray_r2641896593607726.77%
CFP2017rate/519.lbm_r/519.lbm_r5411116567.27%
CFP2017rate/526.blender_r/526.blender_r56668013518470186419.26%
CFP2017rate/538.imagick_r/538.imagick_r9665470011036556.75%
CFP2017rate/544.nab_r/544.nab_r678552973147.23%
CINT2006/400.perlbench/400.perlbench1169319641365714.38%
CINT2006/401.bzip2/401.bzip2434664109860.47%
CINT2006/403.gcc/403.gcc21205161493735443.23%
CINT2006/429.mcf/429.mcf720720.00%
CINT2006/445.gobmk/445.gobmk1028743221460929.58%
CINT2006/456.hmmer/456.hmmer6940819775910.56%
CINT2006/458.sjeng/458.sjeng10252270329568.89%
CINT2006/462.libquantum/462.libquantum42919362231.03%
CINT2006/464.h264ref/464.h264ref72848154958834317.54%
CINT2006/471.omnetpp/471.omnetpp386717140384.23%
CINT2006/473.astar/473.astar1279622190132.72%
CINT2006/483.xalancbmk/483.xalancbmk367384003407419.83%
CINT2017rate/500.perlbench_r/500.perlbench_r3033184053873621.70%
CINT2017rate/502.gcc_r/502.gcc_r948395933415417338.49%
CINT2017rate/505.mcf_r/505.mcf_r160889104984.75%
CINT2017rate/520.omnetpp_r/520.omnetpp_r3510877624287018.11%
CINT2017rate/523.xalancbmk_r/523.xalancbmk_r749534255792085.37%
CINT2017rate/525.x264_r/525.x264_r857418631043717.85%
CINT2017rate/531.deepsjeng_r/531.deepsjeng_r10342410582.27%
CINT2017rate/541.leela_r/541.leela_r370340041039.75%
CINT2017rate/557.xz_r/557.xz_r1282792207438.19%

Limit 100 (this patch)

nameNumCapturedNotCapturedSumNotCaptured/Sum
CFP2006/433.milc/433.milc1586119201350688.26%
CFP2006/444.namd/444.namd2074851207990.25%
CFP2006/447.dealII/447.dealII194571212802158519.86%
CFP2006/450.soplex/450.soplex1060237601436226.18%
CFP2006/453.povray/453.povray25427124773790432.92%
CFP2006/470.lbm/470.lbm5411116567.27%
CFP2006/482.sphinx3/482.sphinx32472743321523.11%
CFP2017rate/508.namd_r/508.namd_r28466236287020.82%
CFP2017rate/510.parest_r/510.parest_r689825691847590099.12%
CFP2017rate/511.povray_r/511.povray_r24913124933740633.40%
CFP2017rate/519.lbm_r/519.lbm_r5411116567.27%
CFP2017rate/526.blender_r/526.blender_r51661616545468207024.26%
CFP2017rate/538.imagick_r/538.imagick_r913701034910171910.17%
CFP2017rate/544.nab_r/544.nab_r75721072864412.40%
CINT2006/400.perlbench/400.perlbench1133824951383318.04%
CINT2006/401.bzip2/401.bzip2421705112662.61%
CINT2006/403.gcc/403.gcc19972237104368254.28%
CINT2006/429.mcf/429.mcf720720.00%
CINT2006/445.gobmk/445.gobmk981473041711842.67%
CINT2006/456.hmmer/456.hmmer66932352904526.00%
CINT2006/458.sjeng/458.sjeng9782323330170.37%
CINT2006/462.libquantum/462.libquantum37626964541.71%
CINT2006/464.h264ref/464.h264ref61720335179523735.19%
CINT2006/471.omnetpp/471.omnetpp391224841605.96%
CINT2006/473.astar/473.astar1214668188235.49%
CINT2006/483.xalancbmk/483.xalancbmk3857159594453013.38%
CINT2017rate/500.perlbench_r/500.perlbench_r28819169374575637.02%
CINT2017rate/502.gcc_r/502.gcc_r967037652317322644.18%
CINT2017rate/505.mcf_r/505.mcf_r159892105184.87%
CINT2017rate/520.omnetpp_r/520.omnetpp_r3516795904475721.43%
CINT2017rate/523.xalancbmk_r/523.xalancbmk_r761695805819747.08%
CINT2017rate/525.x264_r/525.x264_r847321661063920.36%
CINT2017rate/531.deepsjeng_r/531.deepsjeng_r978397137528.87%
CINT2017rate/541.leela_r/541.leela_r400541844239.45%
CINT2017rate/557.xz_r/557.xz_r1357780213736.50%

Limit 80

nameNumCapturedNotCapturedSumNotCaptured/Sum
CFP2006/433.milc/433.milc18086641844978.60%
CFP2006/444.namd/444.namd2074851207990.25%
CFP2006/447.dealII/447.dealII194239212822155219.87%
CFP2006/450.soplex/450.soplex1063936971433625.79%
CFP2006/453.povray/453.povray25994118563785031.32%
CFP2006/470.lbm/470.lbm5411116567.27%
CFP2006/482.sphinx3/482.sphinx32471743321423.12%
CFP2017rate/508.namd_r/508.namd_r2861875286930.26%
CFP2017rate/510.parest_r/510.parest_r690282690147592969.09%
CFP2017rate/511.povray_r/511.povray_r25405118723727731.85%
CFP2017rate/519.lbm_r/519.lbm_r5411116567.27%
CFP2017rate/526.blender_r/526.blender_r51712018639770351726.50%
CFP2017rate/538.imagick_r/538.imagick_r913258548998738.56%
CFP2017rate/544.nab_r/544.nab_r75811063864412.30%
CINT2006/400.perlbench/400.perlbench1129524931378818.08%
CINT2006/401.bzip2/401.bzip2424702112662.34%
CINT2006/403.gcc/403.gcc20794219394273351.34%
CINT2006/429.mcf/429.mcf720720.00%
CINT2006/445.gobmk/445.gobmk989770061690341.45%
CINT2006/456.hmmer/456.hmmer66722318899025.78%
CINT2006/458.sjeng/458.sjeng10332234326768.38%
CINT2006/462.libquantum/462.libquantum40024364337.79%
CINT2006/464.h264ref/464.h264ref62153326419479434.43%
CINT2006/471.omnetpp/471.omnetpp397717341504.17%
CINT2006/473.astar/473.astar1210668187835.57%
CINT2006/483.xalancbmk/483.xalancbmk3869159384462913.31%
CINT2017rate/500.perlbench_r/500.perlbench_r29096130754217131.00%
CINT2017rate/502.gcc_r/502.gcc_r982447455117279543.14%
CINT2017rate/505.mcf_r/505.mcf_r159892105184.87%
CINT2017rate/520.omnetpp_r/520.omnetpp_r3517495614473521.37%
CINT2017rate/523.xalancbmk_r/523.xalancbmk_r762465768820147.03%
CINT2017rate/525.x264_r/525.x264_r844121091055019.99%
CINT2017rate/531.deepsjeng_r/531.deepsjeng_r978397137528.87%
CINT2017rate/541.leela_r/541.leela_r402441744419.39%
CINT2017rate/557.xz_r/557.xz_r1357777213436.41%

Limit 60

nameNumCapturedNotCapturedSumNotCaptured/Sum
CFP2006/433.milc/433.milc1981721270226.68%
CFP2006/444.namd/444.namd2074851207990.25%
CFP2006/447.dealII/447.dealII194613211042157179.78%
CFP2006/450.soplex/450.soplex1068935761426525.07%
CFP2006/453.povray/453.povray26712105903730228.39%
CFP2006/470.lbm/470.lbm5411116567.27%
CFP2006/482.sphinx3/482.sphinx32472743321523.11%
CFP2017rate/508.namd_r/508.namd_r2863567287020.23%
CFP2017rate/510.parest_r/510.parest_r690663683907590539.01%
CFP2017rate/511.povray_r/511.povray_r26198106063680428.82%
CFP2017rate/519.lbm_r/519.lbm_r5411116567.27%
CFP2017rate/526.blender_r/526.blender_r52022815590267613023.06%
CFP2017rate/538.imagick_r/538.imagick_r9154584621000078.46%
CFP2017rate/544.nab_r/544.nab_r75721072864412.40%
CINT2006/400.perlbench/400.perlbench1134424911383518.01%
CINT2006/401.bzip2/401.bzip2428698112661.99%
CINT2006/403.gcc/403.gcc20978209144189249.92%
CINT2006/429.mcf/429.mcf720720.00%
CINT2006/445.gobmk/445.gobmk1036762751664237.71%
CINT2006/456.hmmer/456.hmmer6881784766510.23%
CINT2006/458.sjeng/458.sjeng10332322335569.21%
CINT2006/462.libquantum/462.libquantum44819564330.33%
CINT2006/464.h264ref/464.h264ref67363212298859223.96%
CINT2006/471.omnetpp/471.omnetpp398917141604.11%
CINT2006/473.astar/473.astar1280602188231.99%
CINT2006/483.xalancbmk/483.xalancbmk3858959414453013.34%
CINT2017rate/500.perlbench_r/500.perlbench_r29667106054027226.33%
CINT2017rate/502.gcc_r/502.gcc_r995487077417032241.55%
CINT2017rate/505.mcf_r/505.mcf_r159892105184.87%
CINT2017rate/520.omnetpp_r/520.omnetpp_r3520795204472721.28%
CINT2017rate/523.xalancbmk_r/523.xalancbmk_r761875787819747.06%
CINT2017rate/525.x264_r/525.x264_r863620071064318.86%
CINT2017rate/531.deepsjeng_r/531.deepsjeng_r978397137528.87%
CINT2017rate/541.leela_r/541.leela_r400541844239.45%
CINT2017rate/557.xz_r/557.xz_r1357780213736.50%
nikic added a comment.May 30 2022, 2:11 PM

Good point. Below I added tables with NumCaptured, NumNotcaptured, the sum and NotCaptured/Sum. I've only included SPEC2006/SPEC2017, as those have the largest number of capture queries.

Thanks! I've put the data in a spreadsheet for easy comparison: https://docs.google.com/spreadsheets/d/13cwt-aWjAb_a1Ri9vqS4RgCslfDkEwSqYoYpEtuoeeI/edit?usp=sharing For Limit=60 we're doing slightly better than baseline on average (1.1x) with two outlier regressions. For Limit=80 we avoid one outlier and slightly improve the average (1.2x). With Limit=100 we avoid both outliers and again slightly improve the average (1.3x).

It might be worth exploring different limits for different clients, but I think another major client is GVN (via MemDepAnalysis) and I think exploring a sufficient number of uses there can be quite important (the regression I was investigating was due to missed load elimination by GVN)

Unfortunately MemDepAnalysis is exactly the area where cost increases are most problematic :( In part because cross-BB MDA uses completely ridiculous cutoffs -- we should probably go ahead and reduce those, and not wait for the magic bullet of NewGVN or MSSA. But that's neither here nor there.

fhahn added a comment.Jun 2 2022, 5:45 AM

Good point. Below I added tables with NumCaptured, NumNotcaptured, the sum and NotCaptured/Sum. I've only included SPEC2006/SPEC2017, as those have the largest number of capture queries.

Thanks! I've put the data in a spreadsheet for easy comparison: https://docs.google.com/spreadsheets/d/13cwt-aWjAb_a1Ri9vqS4RgCslfDkEwSqYoYpEtuoeeI/edit?usp=sharing For Limit=60 we're doing slightly better than baseline on average (1.1x) with two outlier regressions. For Limit=80 we avoid one outlier and slightly improve the average (1.2x). With Limit=100 we avoid both outliers and again slightly improve the average (1.3x).

I suppose the question is how far we want to go with respect to avoiding any outlier regressions :) I think either 60, 80 or 100 are clear improvements for opaque pointers.

When comparing the numbers on the compile-time-tracker for 60 and 100, the majority of the difference is down to tramp3d-v4: https://llvm-compile-time-tracker.com/compare.php?from=520563fdc146319aae90d06f88d87f2e9e1247b7&to=818719fad01d472412c963629671a81a8703b25b&stat=instructions

But I don't think this is mostly caused by time spent in capture-tracking but additional optimizations (binary size difference is about 1.2% for LTO). Looking at the stats, there's an increase in # of dead stores removed and also # of instructions removed by GVN.

Given that, I'd be slightly inclined to go for the larger limit out of the options.

It might be worth exploring different limits for different clients, but I think another major client is GVN (via MemDepAnalysis) and I think exploring a sufficient number of uses there can be quite important (the regression I was investigating was due to missed load elimination by GVN)

Unfortunately MemDepAnalysis is exactly the area where cost increases are most problematic :( In part because cross-BB MDA uses completely ridiculous cutoffs -- we should probably go ahead and reduce those, and not wait for the magic bullet of NewGVN or MSSA. But that's neither here nor there.

Yeah that's unfortunate. But I just noticed that MemDepAnalysis appears to use BatchAA for most queries already, so it should also cache some of the capture results, right?

nikic accepted this revision.Jun 2 2022, 6:09 AM

Good point. Below I added tables with NumCaptured, NumNotcaptured, the sum and NotCaptured/Sum. I've only included SPEC2006/SPEC2017, as those have the largest number of capture queries.

Thanks! I've put the data in a spreadsheet for easy comparison: https://docs.google.com/spreadsheets/d/13cwt-aWjAb_a1Ri9vqS4RgCslfDkEwSqYoYpEtuoeeI/edit?usp=sharing For Limit=60 we're doing slightly better than baseline on average (1.1x) with two outlier regressions. For Limit=80 we avoid one outlier and slightly improve the average (1.2x). With Limit=100 we avoid both outliers and again slightly improve the average (1.3x).

I suppose the question is how far we want to go with respect to avoiding any outlier regressions :) I think either 60, 80 or 100 are clear improvements for opaque pointers.

When comparing the numbers on the compile-time-tracker for 60 and 100, the majority of the difference is down to tramp3d-v4: https://llvm-compile-time-tracker.com/compare.php?from=520563fdc146319aae90d06f88d87f2e9e1247b7&to=818719fad01d472412c963629671a81a8703b25b&stat=instructions

But I don't think this is mostly caused by time spent in capture-tracking but additional optimizations (binary size difference is about 1.2% for LTO). Looking at the stats, there's an increase in # of dead stores removed and also # of instructions removed by GVN.

Given that, I'd be slightly inclined to go for the larger limit out of the options.

Yeah, I think I agree.

It might be worth exploring different limits for different clients, but I think another major client is GVN (via MemDepAnalysis) and I think exploring a sufficient number of uses there can be quite important (the regression I was investigating was due to missed load elimination by GVN)

Unfortunately MemDepAnalysis is exactly the area where cost increases are most problematic :( In part because cross-BB MDA uses completely ridiculous cutoffs -- we should probably go ahead and reduce those, and not wait for the magic bullet of NewGVN or MSSA. But that's neither here nor there.

Yeah that's unfortunate. But I just noticed that MemDepAnalysis appears to use BatchAA for most queries already, so it should also cache some of the capture results, right?

Yes, it does use BatchAA, but only within one query. Each query uses a separate BatchAA instance. So the impact of caching is rather limited.

This revision is now accepted and ready to land.Jun 2 2022, 6:09 AM
fhahn marked an inline comment as done.Jun 2 2022, 1:21 PM
fhahn added inline comments.
llvm/lib/Analysis/CaptureTracking.cpp
452

Good point, I'll adjust that in the commit, thanks!

This revision was landed with ongoing or failed builds.Jun 2 2022, 1:44 PM
This revision was automatically updated to reflect the committed changes.
fhahn marked an inline comment as done.