CapturesBefore tracker has an overly restrictive dominates check when
the BeforeHere and the capture point are in different basic blocks.
All we need to check is that there is no path from the capture point
to BeforeHere (which is less stricter than the dominates check).
See added testcase in one of the users of CapturesBefore.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
LGTM. Attributor does the same with an "optimistic" version of reachability via AAReachability. I guess that is why no test there changed.
Thanks Johannes. We see these limitations in clients that use this analysis downstream and I was surprised we didn't have a TODO testcase upstream either (i.e. no test cases failing).
This change causes a large compile-time regression on some benchmarks: https://llvm-compile-time-tracker.com/compare.php?from=79d16764dd29aeddb7e6400e6b2d89d31653886c&to=15694fd6ad955c6a16b446a6324364111a49ae8b&stat=instructions
tramp3d-v4 is particularly badly affected, with a 25% instruction count regression in the ThinLTO configuration.
Thanks for pointing that out. I was under the impression the path check is more expensive than the dominates check (and the dominates was done to prevent compile time increase) and our bailouts are also higher (32 basic blocks for the path check). 
It looks like this limitation of checking for dominates exists to avoid compile time regressions of this sort.
Right. It may be possible to salvage this though. One very simple thing to try would be to cache the reachability query. This should make it much cheaper if there are many uses in the same block.
One thing I thought about: Do we have a test where the "nocapture use" has a second operand that captures?
I've applied a couple of fixes to CaptureTracking, the main ones are: https://github.com/llvm/llvm-project/commit/d35366bccae0016418660337ce94e3d7d0ff391e is a potential correctness fix, https://github.com/llvm/llvm-project/commit/f63ab188c63be12871da75bfc5801a7fc752769b is a compile-time improvement.
I also gave caching the reachability query a try: https://llvm-compile-time-tracker.com/compare.php?from=5225c102649b5469d61385e598b744ac8f3dd1da&to=6d1d73dec948a0cbbf0dd6261d5664c4bfce92b1&stat=instructions The result is better, but we still have a 6% regression on tramp3d-v4 with ThinLTO (down from 25%).
One remaining problem in the current CaptureTracking implementation is that the use limit is not fully enforced: It's currently a limit on the the maximum number of direct uses, not transitive uses. If the value has 20 GEP uses, each of which has 20 uses itself, we'll happily explore all 400 of them.
Hi @nikic, some comments below.
Ah, I think your correctness fix is what Johannes was implying by the "multiple args captured" in a single call.
I also gave caching the reachability query a try: https://llvm-compile-time-tracker.com/compare.php?from=5225c102649b5469d61385e598b744ac8f3dd1da&to=6d1d73dec948a0cbbf0dd6261d5664c4bfce92b1&stat=instructions The result is better, but we still have a 6% regression on tramp3d-v4 with ThinLTO (down from 25%).
Wow. this is great. Thank you for working on this! I'm curious if you have already landed the patch or have it for review? Perhaps further improvement will help reduce the compile time regression here.
One remaining problem in the current CaptureTracking implementation is that the use limit is not fully enforced: It's currently a limit on the the maximum number of direct uses, not transitive uses. If the value has 20 GEP uses, each of which has 20 uses itself, we'll happily explore all 400 of them.
Yeah, you're completely right. We should add a TODO/FIXME at the point we're capturing the direct uses for the worklist (there's a chance this triggers runtime degradations by fixing the "transitive uses" bug). 
We also have a comment stating that this max uses restriction can be completely lifted if we cache the capture tracking analysis. The potential downside is that keeping track/adding  incremental updates can be a source of bugs.