This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Analysis/
-
Analysis/
2/3
CaptureTracking.cpp
-
test/Transforms/GVN/
-
Transforms/
-
GVN/
-
capture-tracking-limit.ll

Differential D126236

[CaptureTracking] Increase limit but use it for all visited uses.
ClosedPublic

Authored by fhahn on May 23 2022, 1:04 PM.

Download Raw Diff

Details

Reviewers

nikic
aeubanks
reames
jdoerfert

Commits

rG78c6b1488f30: [CaptureTracking] Increase limit and use it for all visited uses.

Summary

Currently the MaxUsesToExplore limit only applies to the number of users
per value, not the total number of users to explore.

The current limit of 20 pessimizes IR with opaque pointers in some
cases. Without opaque pointers, we have deeper pointer def-use chains in
general due to extra bitcasts and geps for structs with index 0.

With opaque pointers the def-use chain is not as deep but wider, due to
bitcasts & 0-geps missing.

To improve the situation for opaque pointers, this patch does 2 things:

Apply the limit to the total number of uses visited. From the wording in the description of the option it seems like this may be the original intention. With the current implementation we could still end up walking a lot of uses.
Increase the limit to 100. This is quite arbitrary, but enables a good number of additional optimizations.

Those adjustments have a noticeable compile-time impact though. In part
that is likely due to additional transformations (and conversely
the current baseline misses optimizations after switching to opaque
pointers).

Limit=100:

NewPM-O3: +0.15%
NewPM-ReleaseThinLTO: +0.86%
NewPM-ReleaseLTO-g: +0.44%

https://llvm-compile-time-tracker.com/compare.php?from=8bfccb963b3519393c0266b452a115a4bb46d207&to=818719fad01d472412c963629671a81a8703b25b&stat=instructions

Limit=60:

NewPM-O3: +0.14%
NewPM-ReleaseThinLTO: +0.41%
NewPM-ReleaseLTO-g: +0.21%

https://llvm-compile-time-tracker.com/compare.php?from=aeb19817d66f1a15754163c7f48e01e9ebdd6d45&to=520563fdc146319aae90d06f88d87f2e9e1247b7&stat=instructions

Limit=40:

NewPM-O3: +0.11%
NewPM-ReleaseThinLTO: +0.12%
NewPM-ReleaseLTO-g: +0.09%

https://llvm-compile-time-tracker.com/compare.php?from=aeb19817d66f1a15754163c7f48e01e9ebdd6d45&to=c9182576e9fe3f1c84a71479665aef91a416318c&stat=instructions

I'll add a test if/once we converge on agreement. I'd be more than happy to
discuss alternatives as well

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fhahn created this revision.May 23 2022, 1:04 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 23 2022, 1:04 PM

Herald added a subscriber: hiraditya. · View Herald Transcript

fhahn requested review of this revision.May 23 2022, 1:04 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 23 2022, 1:04 PM

Update patch to actually use 100 as limit

this makes sense

is there a number you're seeing which roughly matches performance you were getting with typed pointers?

llvm/lib/Analysis/CaptureTracking.cpp
451	does moving this after `Visited.insert()` below help at all? I'd think that going through the worklist below is the expensive part, not constructing the worklist

Basic idea makes a lot of sense. I'd be tempted to take some cost here, just for the robustness.

llvm/lib/Analysis/CaptureTracking.cpp
451	It also feels odd to count uses we chose not to explore. Maybe we should simply count the number of items we pull off the worklist, not the addition at all?

Harbormaster completed remote builds in B165901: Diff 431461.May 23 2022, 1:51 PM

is there a number you're seeing which roughly matches performance you were getting with typed pointers?

I'm not sure if that data point is too interesting in isolation. It might be more interesting to look at the impact on the capture-tracking.NumNotCapturedBefore statistic. Below are numbers for MultiSource/SPEC2006/SPEC2017 on X86 with -O3. As there are loads of changes, I removed programs where the change is small (+-1-2%) and where the absolute value is low (< ~40) to keep things a bit more compact.

The first table shows current main with opaque pointers disabled vs enabled. Note that there are quite a few notable regressions. With a limit of 60, we still see regressions, going to 100 reduces that further. I also tried a limit of 200 and that only slightly reduces the number of regressions further.

Program	No Opaque Pointers	Opaque Pointers	diff
MultiSourc...arks/DOE-ProxyApps-C/CoMD/CoMD	34	36	5.9%
MultiSourc...plications/lambda-0.1.3/lambda	19	20	5.3%
MultiSourc...e/Applications/obsequi/Obsequi	30	31	3.3%
MultiSourc...e/Applications/sqlite3/sqlite3	829	813	-1.9%
External/S...C/CINT2006/458.sjeng/458.sjeng	47	46	-2.1%
External/S...te/520.omnetpp_r/520.omnetpp_r	3320	3241	-2.4%
External/S...06/483.xalancbmk/483.xalancbmk	3692	3600	-2.5%
External/S...te/526.blender_r/526.blender_r	8525	8309	-2.5%
External/S...NT2006/464.h264ref/464.h264ref	121	117	-3.3%
External/S...NT2017rate/502.gcc_r/502.gcc_r	5240	5051	-3.6%
MultiSourc.../Applications/JM/ldecod/ldecod	50	48	-4.0%
External/S...C/CINT2006/445.gobmk/445.gobmk	376	360	-4.3%
MultiSourc...Benchmarks/7zip/7zip-benchmark	1132	1083	-4.3%
External/S...06/400.perlbench/400.perlbench	279	266	-4.7%
MultiSourc...e/Applications/SIBsim4/SIBsim4	102	97	-4.9%
MultiSourc.../Applications/JM/lencod/lencod	111	105	-5.4%
External/S...CINT2017rate/557.xz_r/557.xz_r	238	222	-6.7%
External/S...rate/511.povray_r/511.povray_r	547	494	-9.7%
External/S.../CFP2006/453.povray/453.povray	549	495	-9.8%
External/S...17rate/541.leela_r/541.leela_r	449	385	-14.3%
External/S...rate/510.parest_r/510.parest_r	45561	37990	-16.6%
External/S...NT2006/471.omnetpp/471.omnetpp	12187	10133	-16.9%
MultiSourc.../DOE-ProxyApps-C++/CLAMR/CLAMR	1004	833	-17.0%
External/S.../CFP2006/447.dealII/447.dealII	11253	9313	-17.2%
MultiSourc...-ProxyApps-C++/PENNANT/PENNANT	4003	3171	-20.8%
MultiSourc...ALAC/encode/alacconvert-encode	106	79	-25.5%
MultiSourc...ALAC/decode/alacconvert-decode	106	79	-25.5%
MultiSourc...OE-ProxyApps-C++/miniFE/miniFE	949	659	-30.6%
MultiSourc...e/Applications/ClamAV/clamscan	398	212	-46.7%
MultiSourc.../mediabench/jpeg/jpeg-6a/cjpeg	164	65	-60.4%
MultiSourc...ch/consumer-jpeg/consumer-jpeg	119	21	-82.4%
MultiSourc...ity-rijndael/security-rijndael	48	0	100.0%

Program	Base (No Opaque Pointers)	Patch 60 (Opaque Pointers)	diff
MultiSourc...ity-blowfish/security-blowfish	1	361	36000.0%
MultiSourc...sumer-typeset/consumer-typeset	122	165	35.2%
MultiSourc...arks/DOE-ProxyApps-C/CoMD/CoMD	34	44	29.4%
MultiSourc.../Applications/JM/lencod/lencod	111	136	22.5%
MultiSource/Applications/SPASS/SPASS	80	98	22.5%
External/S...00.perlbench_r/500.perlbench_r	560	682	21.8%
MultiSource/Applications/oggenc/oggenc	126	148	17.5%
External/S...NT2006/464.h264ref/464.h264ref	121	142	17.4%
External/S...06/400.perlbench/400.perlbench	279	326	16.8%
External/SPEC/CINT2006/403.gcc/403.gcc	968	1114	15.1%
MultiSourc...hmarks/MallocBench/cfrac/cfrac	30	32	6.7%
MultiSourc...enchmarks/mafft/pairlocalalign	101	107	5.9%
External/S.../CFP2006/450.soplex/450.soplex	431	455	5.6%
MultiSourc...nchmarks/FreeBench/mason/mason	42	44	4.8%
MultiSource/Benchmarks/Bullet/bullet	677	707	4.4%
External/S...2017rate/525.x264_r/525.x264_r	159	166	4.4%
MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4 4	975	5190	4.3%
MultiSourc.../Applications/JM/ldecod/ldecod	50	52	4.0%
External/S...rate/511.povray_r/511.povray_r	547	562	2.7%
External/S...NT2017rate/502.gcc_r/502.gcc_r 5	240	5377	2.6%
External/S.../CFP2006/453.povray/453.povray	549	563	2.6%
MultiSourc...e/Applications/sqlite3/sqlite3	829	849	2.4%
External/S...C/CINT2006/445.gobmk/445.gobmk	376	385	2.4%
MultiSourc...e/Applications/ClamAV/clamscan	398	406	2.0%
External/S...C/CINT2006/458.sjeng/458.sjeng	47	46	-2.1%
MultiSource/Benchmarks/Ptrdist/bc/bc	60	55	-8.3%
External/S.../CFP2006/447.dealII/447.dealII	11253	244	-9.0%
External/S...17rate/541.leela_r/541.leela_r	449	385	-14.3%
External/S...rate/510.parest_r/510.parest_r	45561	8815	-14.8%
MultiSourc.../DOE-ProxyApps-C++/CLAMR/CLAMR	1004	853	-15.0%
MultiSourc...-ProxyApps-C++/PENNANT/PENNANT	4003	3171	-20.8%
MultiSourc...OE-ProxyApps-C++/miniFE/miniFE	949	665	-29.9%
External/S...te/538.imagick_r/538.imagick_r	837	511	-38.9%
MultiSourc.../mediabench/jpeg/jpeg-6a/cjpeg	164	79	-51.8%
MultiSourc...ch/consumer-jpeg/consumer-jpeg	119	19	-84.0%
MultiSourc...ity-rijndael/security-rijndael	48	0	-100.0%

Program	Base (No Opaque Pointers)	Patch 100 (Opaque Pointers)	diff
MultiSourc...ity-blowfish/security-blowfish	1	361	36000.0%
MultiSourc...ch/office-ispell/office-ispell	4	39	875.0%
External/S.../462.libquantum/462.libquantum	13	61	369.2%
External/S...C/CINT2006/456.hmmer/456.hmmer	191	315	64.9%
External/SPEC/CINT2006/403.gcc/403.gcc	968	1425	47.2%
MultiSourc...arks/DOE-ProxyApps-C/CoMD/CoMD	34	50	47.1%
MultiSourc...ch/consumer-lame/consumer-lame	69	101	46.4%
MultiSourc...sumer-typeset/consumer-typeset	122	173	41.8%
MultiSourc.../mediabench/jpeg/jpeg-6a/cjpeg	164	216	31.7%
MultiSource/Applications/lemon/lemon	32	42	31.2%
External/S...rate/511.povray_r/511.povray_r	547	699	27.8%
External/S.../CFP2006/453.povray/453.povray	549	701	27.7%
MultiSourc.../Applications/JM/lencod/lencod	111	140	26.1%
External/S...00.perlbench_r/500.perlbench_r	560	700	25.0%
MultiSource/Applications/oggenc/oggenc	126	157	24.6%
External/S...C/CINT2006/445.gobmk/445.gobmk	376	466	23.9%
MultiSource/Applications/SPASS/SPASS	80	99	23.8%
External/S...NT2006/464.h264ref/464.h264ref	121	146	20.7%
External/S...06/400.perlbench/400.perlbench	279	326	16.8%
MultiSourc...OE-ProxyApps-C/miniGMG/miniGMG	23	26	13.0%
MultiSource/Benchmarks/Bullet/bullet	677	748	10.5%
MultiSourc.../Applications/JM/ldecod/ldecod	50	55	10.0%
External/S.../CFP2006/450.soplex/450.soplex	431	470	9.0%
External/S...2017rate/525.x264_r/525.x264_r	159	172	8.2%
External/S...NT2017rate/502.gcc_r/502.gcc_r 5	240	5609	7.0%
External/S...te/526.blender_r/526.blender_r 8	525	9075	6.5%
External/S...C/CINT2006/458.sjeng/458.sjeng	47	50	6.4%
MultiSourc...enchmarks/mafft/pairlocalalign	101	107	5.9%
MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4 4	975	5235	5.2%
MultiSourc...nchmarks/FreeBench/mason/mason	42	44	4.8%
External/S.../CFP2006/447.dealII/447.dealII	11253	10466	-7.0%
External/S...rate/510.parest_r/510.parest_r	45561	39810	-12.6%
External/S...17rate/541.leela_r/541.leela_r	449	385	-14.3%
MultiSourc.../DOE-ProxyApps-C++/CLAMR/CLAMR	1004	853	-15.0%
MultiSourc...-ProxyApps-C++/PENNANT/PENNANT	4003	3171	-20.8%
MultiSourc...OE-ProxyApps-C++/miniFE/miniFE	949	665	-29.9%
External/S...te/538.imagick_r/538.imagick_r	837	537	-35.8%

In D126236#3533958, @fhahn wrote:

is there a number you're seeing which roughly matches performance you were getting with typed pointers?

I'm not sure if that data point is too interesting in isolation. It might be more interesting to look at the impact on the capture-tracking.NumNotCapturedBefore statistic. Below are numbers for MultiSource/SPEC2006/SPEC2017 on X86 with -O3. As there are loads of changes, I removed programs where the change is small (+-1-2%) and where the absolute value is low (< ~40) to keep things a bit more compact.

The first table shows current main with opaque pointers disabled vs enabled. Note that there are quite a few notable regressions. With a limit of 60, we still see regressions, going to 100 reduces that further. I also tried a limit of 200 and that only slightly reduces the number of regressions further.

Looking at just the NumNotCapturedBefore statistic is only meaningful if NumCapturedBefore+NumNotCapturedBefore is approximately the same in both configurations. Is that the case? It would probably be more meaningful to look for changes in the ratio NumNotCapturedBefore/(NumCapturedBefore+NumNotCapturedBefore), i.e. the percentage of queries that succeed.

I do agree with the general change here -- in fact, some time ago I was looking into the same change for the reverse reason: The current implementation of the limit sometimes results in much more uses being visited than is reasonable. I ultimately ended up not pursing it because is has significant impact on optimization behavior and I didn't have time to analyze it in detail.

The actual value of the limit matters though -- the current capture tracking implementation is intended to be cheap enough that uncached usage is viable, and we can clearly see that compile-time is quite sensitive to changes to this limit. We shouldn't overshoot in the other direction here. If we want to raise this too high, we'll have to start thinking about using different limits for different callers. E.g. visiting many users should be fine in DSE because results are cached for the whole duration of the pass, while visiting many uses in (non-batch) AA is problematic.

PS: I think some of the data in your table is incorrect, e.g. it has the line MultiSourc...nchmarks/tramp3d-v4/tramp3d-v4 4 975 5190 4.3%, where the percentage is much smaller than the change.

Looking at just the NumNotCapturedBefore statistic is only meaningful if NumCapturedBefore+NumNotCapturedBefore is approximately the same in both configurations. Is that the case? It would probably be more meaningful to look for changes in the ratio NumNotCapturedBefore/(NumCapturedBefore+NumNotCapturedBefore), i.e. the percentage of queries that succeed.

Good point. Below I added tables with NumCaptured, NumNotcaptured, the sum and NotCaptured/Sum. I've only included SPEC2006/SPEC2017, as those have the largest number of capture queries.

For both limit=80 and limit=100, the total number of queries is quite similar to the baseline without opaque pointers, with most differing < 10%. The largest percentage increase is 531.deepsjeng_r with +30%. With limit=80, there's still a notable regression in the success percentage for 433.milc.

The actual value of the limit matters though -- the current capture tracking implementation is intended to be cheap enough that uncached usage is viable, and we can clearly see that compile-time is quite sensitive to changes to this limit. We shouldn't overshoot in the other direction here. If we want to raise this too high, we'll have to start thinking about using different limits for different callers. E.g. visiting many users should be fine in DSE because results are cached for the whole duration of the pass, while visiting many uses in (non-batch) AA is problematic.

It might be worth exploring different limits for different clients, but I think another major client is GVN (via MemDepAnalysis) and I think exploring a sufficient number of uses there can be quite important (the regression I was investigating was due to missed load elimination by GVN)

Base with opaque pointers disabled

name	NumCaptured	NotCaptured	Sum	NotCaptured/Sum
CFP2006/433.milc/433.milc	1602	11862	13464	88.10%
CFP2006/444.namd/444.namd	18660	51	18711	0.27%
CFP2006/447.dealII/447.dealII	180890	21578	202468	10.66%
CFP2006/450.soplex/450.soplex	9934	2575	12509	20.59%
CFP2006/453.povray/453.povray	26919	9642	36561	26.37%
CFP2006/470.lbm/470.lbm	54	111	165	67.27%
CFP2006/482.sphinx3/482.sphinx3	2525	650	3175	20.47%
CFP2017rate/508.namd_r/508.namd_r	28714	233	28947	0.80%
CFP2017rate/510.parest_r/510.parest_r	639689	66393	706082	9.40%
CFP2017rate/511.povray_r/511.povray_r	26418	9659	36077	26.77%
CFP2017rate/519.lbm_r/519.lbm_r	54	111	165	67.27%
CFP2017rate/526.blender_r/526.blender_r	566680	135184	701864	19.26%
CFP2017rate/538.imagick_r/538.imagick_r	96654	7001	103655	6.75%
CFP2017rate/544.nab_r/544.nab_r	6785	529	7314	7.23%
CINT2006/400.perlbench/400.perlbench	11693	1964	13657	14.38%
CINT2006/401.bzip2/401.bzip2	434	664	1098	60.47%
CINT2006/403.gcc/403.gcc	21205	16149	37354	43.23%
CINT2006/429.mcf/429.mcf	72	0	72	0.00%
CINT2006/445.gobmk/445.gobmk	10287	4322	14609	29.58%
CINT2006/456.hmmer/456.hmmer	6940	819	7759	10.56%
CINT2006/458.sjeng/458.sjeng	1025	2270	3295	68.89%
CINT2006/462.libquantum/462.libquantum	429	193	622	31.03%
CINT2006/464.h264ref/464.h264ref	72848	15495	88343	17.54%
CINT2006/471.omnetpp/471.omnetpp	3867	171	4038	4.23%
CINT2006/473.astar/473.astar	1279	622	1901	32.72%
CINT2006/483.xalancbmk/483.xalancbmk	36738	4003	40741	9.83%
CINT2017rate/500.perlbench_r/500.perlbench_r	30331	8405	38736	21.70%
CINT2017rate/502.gcc_r/502.gcc_r	94839	59334	154173	38.49%
CINT2017rate/505.mcf_r/505.mcf_r	160	889	1049	84.75%
CINT2017rate/520.omnetpp_r/520.omnetpp_r	35108	7762	42870	18.11%
CINT2017rate/523.xalancbmk_r/523.xalancbmk_r	74953	4255	79208	5.37%
CINT2017rate/525.x264_r/525.x264_r	8574	1863	10437	17.85%
CINT2017rate/531.deepsjeng_r/531.deepsjeng_r	1034	24	1058	2.27%
CINT2017rate/541.leela_r/541.leela_r	3703	400	4103	9.75%
CINT2017rate/557.xz_r/557.xz_r	1282	792	2074	38.19%

Limit 100 (this patch)

name	NumCaptured	NotCaptured	Sum	NotCaptured/Sum
CFP2006/433.milc/433.milc	1586	11920	13506	88.26%
CFP2006/444.namd/444.namd	20748	51	20799	0.25%
CFP2006/447.dealII/447.dealII	194571	21280	215851	9.86%
CFP2006/450.soplex/450.soplex	10602	3760	14362	26.18%
CFP2006/453.povray/453.povray	25427	12477	37904	32.92%
CFP2006/470.lbm/470.lbm	54	111	165	67.27%
CFP2006/482.sphinx3/482.sphinx3	2472	743	3215	23.11%
CFP2017rate/508.namd_r/508.namd_r	28466	236	28702	0.82%
CFP2017rate/510.parest_r/510.parest_r	689825	69184	759009	9.12%
CFP2017rate/511.povray_r/511.povray_r	24913	12493	37406	33.40%
CFP2017rate/519.lbm_r/519.lbm_r	54	111	165	67.27%
CFP2017rate/526.blender_r/526.blender_r	516616	165454	682070	24.26%
CFP2017rate/538.imagick_r/538.imagick_r	91370	10349	101719	10.17%
CFP2017rate/544.nab_r/544.nab_r	7572	1072	8644	12.40%
CINT2006/400.perlbench/400.perlbench	11338	2495	13833	18.04%
CINT2006/401.bzip2/401.bzip2	421	705	1126	62.61%
CINT2006/403.gcc/403.gcc	19972	23710	43682	54.28%
CINT2006/429.mcf/429.mcf	72	0	72	0.00%
CINT2006/445.gobmk/445.gobmk	9814	7304	17118	42.67%
CINT2006/456.hmmer/456.hmmer	6693	2352	9045	26.00%
CINT2006/458.sjeng/458.sjeng	978	2323	3301	70.37%
CINT2006/462.libquantum/462.libquantum	376	269	645	41.71%
CINT2006/464.h264ref/464.h264ref	61720	33517	95237	35.19%
CINT2006/471.omnetpp/471.omnetpp	3912	248	4160	5.96%
CINT2006/473.astar/473.astar	1214	668	1882	35.49%
CINT2006/483.xalancbmk/483.xalancbmk	38571	5959	44530	13.38%
CINT2017rate/500.perlbench_r/500.perlbench_r	28819	16937	45756	37.02%
CINT2017rate/502.gcc_r/502.gcc_r	96703	76523	173226	44.18%
CINT2017rate/505.mcf_r/505.mcf_r	159	892	1051	84.87%
CINT2017rate/520.omnetpp_r/520.omnetpp_r	35167	9590	44757	21.43%
CINT2017rate/523.xalancbmk_r/523.xalancbmk_r	76169	5805	81974	7.08%
CINT2017rate/525.x264_r/525.x264_r	8473	2166	10639	20.36%
CINT2017rate/531.deepsjeng_r/531.deepsjeng_r	978	397	1375	28.87%
CINT2017rate/541.leela_r/541.leela_r	4005	418	4423	9.45%
CINT2017rate/557.xz_r/557.xz_r	1357	780	2137	36.50%

Limit 80

name	NumCaptured	NotCaptured	Sum	NotCaptured/Sum
CFP2006/433.milc/433.milc	1808	6641	8449	78.60%
CFP2006/444.namd/444.namd	20748	51	20799	0.25%
CFP2006/447.dealII/447.dealII	194239	21282	215521	9.87%
CFP2006/450.soplex/450.soplex	10639	3697	14336	25.79%
CFP2006/453.povray/453.povray	25994	11856	37850	31.32%
CFP2006/470.lbm/470.lbm	54	111	165	67.27%
CFP2006/482.sphinx3/482.sphinx3	2471	743	3214	23.12%
CFP2017rate/508.namd_r/508.namd_r	28618	75	28693	0.26%
CFP2017rate/510.parest_r/510.parest_r	690282	69014	759296	9.09%
CFP2017rate/511.povray_r/511.povray_r	25405	11872	37277	31.85%
CFP2017rate/519.lbm_r/519.lbm_r	54	111	165	67.27%
CFP2017rate/526.blender_r/526.blender_r	517120	186397	703517	26.50%
CFP2017rate/538.imagick_r/538.imagick_r	91325	8548	99873	8.56%
CFP2017rate/544.nab_r/544.nab_r	7581	1063	8644	12.30%
CINT2006/400.perlbench/400.perlbench	11295	2493	13788	18.08%
CINT2006/401.bzip2/401.bzip2	424	702	1126	62.34%
CINT2006/403.gcc/403.gcc	20794	21939	42733	51.34%
CINT2006/429.mcf/429.mcf	72	0	72	0.00%
CINT2006/445.gobmk/445.gobmk	9897	7006	16903	41.45%
CINT2006/456.hmmer/456.hmmer	6672	2318	8990	25.78%
CINT2006/458.sjeng/458.sjeng	1033	2234	3267	68.38%
CINT2006/462.libquantum/462.libquantum	400	243	643	37.79%
CINT2006/464.h264ref/464.h264ref	62153	32641	94794	34.43%
CINT2006/471.omnetpp/471.omnetpp	3977	173	4150	4.17%
CINT2006/473.astar/473.astar	1210	668	1878	35.57%
CINT2006/483.xalancbmk/483.xalancbmk	38691	5938	44629	13.31%
CINT2017rate/500.perlbench_r/500.perlbench_r	29096	13075	42171	31.00%
CINT2017rate/502.gcc_r/502.gcc_r	98244	74551	172795	43.14%
CINT2017rate/505.mcf_r/505.mcf_r	159	892	1051	84.87%
CINT2017rate/520.omnetpp_r/520.omnetpp_r	35174	9561	44735	21.37%
CINT2017rate/523.xalancbmk_r/523.xalancbmk_r	76246	5768	82014	7.03%
CINT2017rate/525.x264_r/525.x264_r	8441	2109	10550	19.99%
CINT2017rate/531.deepsjeng_r/531.deepsjeng_r	978	397	1375	28.87%
CINT2017rate/541.leela_r/541.leela_r	4024	417	4441	9.39%
CINT2017rate/557.xz_r/557.xz_r	1357	777	2134	36.41%

Limit 60

name	NumCaptured	NotCaptured	Sum	NotCaptured/Sum
CFP2006/433.milc/433.milc	1981	721	2702	26.68%
CFP2006/444.namd/444.namd	20748	51	20799	0.25%
CFP2006/447.dealII/447.dealII	194613	21104	215717	9.78%
CFP2006/450.soplex/450.soplex	10689	3576	14265	25.07%
CFP2006/453.povray/453.povray	26712	10590	37302	28.39%
CFP2006/470.lbm/470.lbm	54	111	165	67.27%
CFP2006/482.sphinx3/482.sphinx3	2472	743	3215	23.11%
CFP2017rate/508.namd_r/508.namd_r	28635	67	28702	0.23%
CFP2017rate/510.parest_r/510.parest_r	690663	68390	759053	9.01%
CFP2017rate/511.povray_r/511.povray_r	26198	10606	36804	28.82%
CFP2017rate/519.lbm_r/519.lbm_r	54	111	165	67.27%
CFP2017rate/526.blender_r/526.blender_r	520228	155902	676130	23.06%
CFP2017rate/538.imagick_r/538.imagick_r	91545	8462	100007	8.46%
CFP2017rate/544.nab_r/544.nab_r	7572	1072	8644	12.40%
CINT2006/400.perlbench/400.perlbench	11344	2491	13835	18.01%
CINT2006/401.bzip2/401.bzip2	428	698	1126	61.99%
CINT2006/403.gcc/403.gcc	20978	20914	41892	49.92%
CINT2006/429.mcf/429.mcf	72	0	72	0.00%
CINT2006/445.gobmk/445.gobmk	10367	6275	16642	37.71%
CINT2006/456.hmmer/456.hmmer	6881	784	7665	10.23%
CINT2006/458.sjeng/458.sjeng	1033	2322	3355	69.21%
CINT2006/462.libquantum/462.libquantum	448	195	643	30.33%
CINT2006/464.h264ref/464.h264ref	67363	21229	88592	23.96%
CINT2006/471.omnetpp/471.omnetpp	3989	171	4160	4.11%
CINT2006/473.astar/473.astar	1280	602	1882	31.99%
CINT2006/483.xalancbmk/483.xalancbmk	38589	5941	44530	13.34%
CINT2017rate/500.perlbench_r/500.perlbench_r	29667	10605	40272	26.33%
CINT2017rate/502.gcc_r/502.gcc_r	99548	70774	170322	41.55%
CINT2017rate/505.mcf_r/505.mcf_r	159	892	1051	84.87%
CINT2017rate/520.omnetpp_r/520.omnetpp_r	35207	9520	44727	21.28%
CINT2017rate/523.xalancbmk_r/523.xalancbmk_r	76187	5787	81974	7.06%
CINT2017rate/525.x264_r/525.x264_r	8636	2007	10643	18.86%
CINT2017rate/531.deepsjeng_r/531.deepsjeng_r	978	397	1375	28.87%
CINT2017rate/541.leela_r/541.leela_r	4005	418	4423	9.45%
CINT2017rate/557.xz_r/557.xz_r	1357	780	2137	36.50%

Good point. Below I added tables with NumCaptured, NumNotcaptured, the sum and NotCaptured/Sum. I've only included SPEC2006/SPEC2017, as those have the largest number of capture queries.

Thanks! I've put the data in a spreadsheet for easy comparison: https://docs.google.com/spreadsheets/d/13cwt-aWjAb_a1Ri9vqS4RgCslfDkEwSqYoYpEtuoeeI/edit?usp=sharing For Limit=60 we're doing slightly better than baseline on average (1.1x) with two outlier regressions. For Limit=80 we avoid one outlier and slightly improve the average (1.2x). With Limit=100 we avoid both outliers and again slightly improve the average (1.3x).

It might be worth exploring different limits for different clients, but I think another major client is GVN (via MemDepAnalysis) and I think exploring a sufficient number of uses there can be quite important (the regression I was investigating was due to missed load elimination by GVN)

Unfortunately MemDepAnalysis is exactly the area where cost increases are most problematic :( In part because cross-BB MDA uses completely ridiculous cutoffs -- we should probably go ahead and reduce those, and not wait for the magic bullet of NewGVN or MSSA. But that's neither here nor there.

In D126236#3546453, @nikic wrote:

Good point. Below I added tables with NumCaptured, NumNotcaptured, the sum and NotCaptured/Sum. I've only included SPEC2006/SPEC2017, as those have the largest number of capture queries.

Thanks! I've put the data in a spreadsheet for easy comparison: https://docs.google.com/spreadsheets/d/13cwt-aWjAb_a1Ri9vqS4RgCslfDkEwSqYoYpEtuoeeI/edit?usp=sharing For Limit=60 we're doing slightly better than baseline on average (1.1x) with two outlier regressions. For Limit=80 we avoid one outlier and slightly improve the average (1.2x). With Limit=100 we avoid both outliers and again slightly improve the average (1.3x).

I suppose the question is how far we want to go with respect to avoiding any outlier regressions :) I think either 60, 80 or 100 are clear improvements for opaque pointers.

When comparing the numbers on the compile-time-tracker for 60 and 100, the majority of the difference is down to tramp3d-v4: https://llvm-compile-time-tracker.com/compare.php?from=520563fdc146319aae90d06f88d87f2e9e1247b7&to=818719fad01d472412c963629671a81a8703b25b&stat=instructions

But I don't think this is mostly caused by time spent in capture-tracking but additional optimizations (binary size difference is about 1.2% for LTO). Looking at the stats, there's an increase in # of dead stores removed and also # of instructions removed by GVN.

Given that, I'd be slightly inclined to go for the larger limit out of the options.

It might be worth exploring different limits for different clients, but I think another major client is GVN (via MemDepAnalysis) and I think exploring a sufficient number of uses there can be quite important (the regression I was investigating was due to missed load elimination by GVN)

Unfortunately MemDepAnalysis is exactly the area where cost increases are most problematic :( In part because cross-BB MDA uses completely ridiculous cutoffs -- we should probably go ahead and reduce those, and not wait for the magic bullet of NewGVN or MSSA. But that's neither here nor there.

Yeah that's unfortunate. But I just noticed that MemDepAnalysis appears to use BatchAA for most queries already, so it should also cache some of the capture results, right?

In D126236#3553122, @fhahn wrote:

In D126236#3546453, @nikic wrote:

Good point. Below I added tables with NumCaptured, NumNotcaptured, the sum and NotCaptured/Sum. I've only included SPEC2006/SPEC2017, as those have the largest number of capture queries.

Thanks! I've put the data in a spreadsheet for easy comparison: https://docs.google.com/spreadsheets/d/13cwt-aWjAb_a1Ri9vqS4RgCslfDkEwSqYoYpEtuoeeI/edit?usp=sharing For Limit=60 we're doing slightly better than baseline on average (1.1x) with two outlier regressions. For Limit=80 we avoid one outlier and slightly improve the average (1.2x). With Limit=100 we avoid both outliers and again slightly improve the average (1.3x).

I suppose the question is how far we want to go with respect to avoiding any outlier regressions :) I think either 60, 80 or 100 are clear improvements for opaque pointers.

When comparing the numbers on the compile-time-tracker for 60 and 100, the majority of the difference is down to tramp3d-v4: https://llvm-compile-time-tracker.com/compare.php?from=520563fdc146319aae90d06f88d87f2e9e1247b7&to=818719fad01d472412c963629671a81a8703b25b&stat=instructions

But I don't think this is mostly caused by time spent in capture-tracking but additional optimizations (binary size difference is about 1.2% for LTO). Looking at the stats, there's an increase in # of dead stores removed and also # of instructions removed by GVN.

Given that, I'd be slightly inclined to go for the larger limit out of the options.

Yeah, I think I agree.

It might be worth exploring different limits for different clients, but I think another major client is GVN (via MemDepAnalysis) and I think exploring a sufficient number of uses there can be quite important (the regression I was investigating was due to missed load elimination by GVN)

Unfortunately MemDepAnalysis is exactly the area where cost increases are most problematic :( In part because cross-BB MDA uses completely ridiculous cutoffs -- we should probably go ahead and reduce those, and not wait for the magic bullet of NewGVN or MSSA. But that's neither here nor there.

Yeah that's unfortunate. But I just noticed that MemDepAnalysis appears to use BatchAA for most queries already, so it should also cache some of the capture results, right?

Yes, it does use BatchAA, but only within one query. Each query uses a separate BatchAA instance. So the impact of caching is rather limited.

This revision is now accepted and ready to land.Jun 2 2022, 6:09 AM

fhahn mentioned this in rG44c86e5cdc62: [GVN] Add test for capture tracking use limit..Jun 2 2022, 12:16 PM

fhahn marked an inline comment as done.Jun 2 2022, 1:21 PM

fhahn added inline comments.

llvm/lib/Analysis/CaptureTracking.cpp
451	Good point, I'll adjust that in the commit, thanks!

This revision was landed with ongoing or failed builds.Jun 2 2022, 1:44 PM

Closed by commit rG78c6b1488f30: [CaptureTracking] Increase limit and use it for all visited uses. (authored by fhahn). · Explain Why

This revision was automatically updated to reflect the committed changes.

fhahn marked an inline comment as done.

fhahn added a commit: rG78c6b1488f30: [CaptureTracking] Increase limit and use it for all visited uses..

Revision Contents

Path

Size

llvm/

lib/

Analysis/

CaptureTracking.cpp

9 lines

test/

Transforms/

GVN/

capture-tracking-limit.ll

2 lines

Diff 433871

llvm/lib/Analysis/CaptureTracking.cpp

Show All 39 Lines

/// The default value for MaxUsesToExplore argument. It's relatively small to		/// The default value for MaxUsesToExplore argument. It's relatively small to
/// keep the cost of analysis reasonable for clients like BasicAliasAnalysis,		/// keep the cost of analysis reasonable for clients like BasicAliasAnalysis,
/// where the results can't be cached.		/// where the results can't be cached.
/// TODO: we should probably introduce a caching CaptureTracking analysis and		/// TODO: we should probably introduce a caching CaptureTracking analysis and
/// use it where possible. The caching version can use much higher limit or		/// use it where possible. The caching version can use much higher limit or
/// don't have this cap at all.		/// don't have this cap at all.
static cl::opt<unsigned>		static cl::opt<unsigned>
DefaultMaxUsesToExplore("capture-tracking-max-uses-to-explore", cl::Hidden,		DefaultMaxUsesToExplore("capture-tracking-max-uses-to-explore", cl::Hidden,
cl::desc("Maximal number of uses to explore."),		cl::desc("Maximal number of uses to explore."),
cl::init(20));		cl::init(100));

unsigned llvm::getDefaultMaxUsesToExploreForCaptureTracking() {		unsigned llvm::getDefaultMaxUsesToExploreForCaptureTracking() {
return DefaultMaxUsesToExplore;		return DefaultMaxUsesToExplore;
}		}

CaptureTracker::~CaptureTracker() = default;		CaptureTracker::~CaptureTracker() = default;

bool CaptureTracker::shouldExplore(const Use *U) { return true; }		bool CaptureTracker::shouldExplore(const Use *U) { return true; }
▲ Show 20 Lines • Show All 381 Lines • ▼ Show 20 Lines	void llvm::PointerMayBeCaptured(const Value V, CaptureTracker Tracker,
if (MaxUsesToExplore == 0)		if (MaxUsesToExplore == 0)
MaxUsesToExplore = DefaultMaxUsesToExplore;		MaxUsesToExplore = DefaultMaxUsesToExplore;

SmallVector<const Use *, 20> Worklist;		SmallVector<const Use *, 20> Worklist;
Worklist.reserve(getDefaultMaxUsesToExploreForCaptureTracking());		Worklist.reserve(getDefaultMaxUsesToExploreForCaptureTracking());
SmallSet<const Use *, 20> Visited;		SmallSet<const Use *, 20> Visited;

auto AddUses = [&](const Value *V) {		auto AddUses = [&](const Value *V) {
unsigned Count = 0;
for (const Use &U : V->uses()) {		for (const Use &U : V->uses()) {
// If there are lots of uses, conservatively say that the value		// If there are lots of uses, conservatively say that the value
// is captured to avoid taking too much compile time.		// is captured to avoid taking too much compile time.
if (Count++ >= MaxUsesToExplore) {		if (Visited.size() >= MaxUsesToExplore) {
		aeubanksUnsubmitted Not Done Reply Inline Actions does moving this after `Visited.insert()` below help at all? I'd think that going through the worklist below is the expensive part, not constructing the worklist aeubanks: does moving this after `Visited.insert()` below help at all? I'd think that going through the…
		reamesUnsubmitted Done Reply Inline Actions It also feels odd to count uses we chose not to explore. Maybe we should simply count the number of items we pull off the worklist, not the addition at all? reames: It also feels odd to count uses we chose not to explore. Maybe we should simply count the…
		fhahnAuthorUnsubmitted Done Reply Inline Actions Good point, I'll adjust that in the commit, thanks! fhahn: Good point, I'll adjust that in the commit, thanks!
Tracker->tooManyUses();		Tracker->tooManyUses();
return false;		return false;
}		}
if (!Visited.insert(&U).second)		if (!Visited.insert(&U).second)
continue;		continue;
if (!Tracker->shouldExplore(&U))		if (!Tracker->shouldExplore(&U))
continue;		continue;
Worklist.push_back(&U);		Worklist.push_back(&U);
▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines

llvm/test/Transforms/GVN/capture-tracking-limit.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -aa-pipeline=basic-aa -passes="gvn" -S %s \| FileCheck --check-prefixes=CHECK,LIMIT-TOO-SMALL %s			; RUN: opt -aa-pipeline=basic-aa -passes="gvn" -S %s \| FileCheck --check-prefixes=CHECK,LIMIT %s
	; RUN: opt -aa-pipeline=basic-aa -passes="gvn" -S -capture-tracking-max-uses-to-explore=20 %s \| FileCheck --check-prefixes=CHECK,LIMIT-TOO-SMALL %s			; RUN: opt -aa-pipeline=basic-aa -passes="gvn" -S -capture-tracking-max-uses-to-explore=20 %s \| FileCheck --check-prefixes=CHECK,LIMIT-TOO-SMALL %s
	; RUN: opt -aa-pipeline=basic-aa -passes="gvn" -S -capture-tracking-max-uses-to-explore=21 %s \| FileCheck --check-prefixes=CHECK,LIMIT %s			; RUN: opt -aa-pipeline=basic-aa -passes="gvn" -S -capture-tracking-max-uses-to-explore=21 %s \| FileCheck --check-prefixes=CHECK,LIMIT %s

	define i32 @test1(i32* %p, i1 %C) {			define i32 @test1(i32* %p, i1 %C) {
	; CHECK-LABEL: @test1(			; CHECK-LABEL: @test1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[A:%.*]] = alloca i32, align 4			; CHECK-NEXT: [[A:%.*]] = alloca i32, align 4
	; CHECK-NEXT: call void @dont_capture(i32* [[A]])			; CHECK-NEXT: call void @dont_capture(i32* [[A]])
	▲ Show 20 Lines • Show All 54 Lines • Show Last 20 Lines