This is an archive of the discontinued LLVM Phabricator instance.

[CaptureTracking] Avoid long compilation time on large basic blocks
ClosedPublic

Authored by bruno on Jan 15 2015, 3:06 PM.

Details

Summary

Problem

For large basic blocks CaptureTracking may become very expensive: for each memory related instruction considered, walks top down the BB to check ordering and dominance among two instructions. In my testcase there are 81782 instructions in a BB, which leads to compilation time of ~12min w/ 'opt -O1'. This triggers in the flow: DeadStoreElimination -> MepDepAnalysis -> CaptureTracking.

Solution

To fix the compile time bloat, the patch changes 'shouldExplore' to do two things:

  1. Add a special case to compute the ordering between instructions when both are in the same basic block. It avoids the use of two expensive functions in this scenario: 'dominates' and 'isPotentiallyReachable'.
  2. Limit the search by using a threshold=1000 on the number of instructions to search. This limit presented no measurable regressions on the test-suite.

This lead to a compile time reduction of 'opt -O1' from ~12min to 1s.

Diff Detail

Event Timeline

bruno updated this revision to Diff 18265.Jan 15 2015, 3:06 PM
bruno retitled this revision from to [CaptureTracking] Avoid long compilation time on large basic blocks.
bruno updated this object.
bruno edited the test plan for this revision. (Show Details)
bruno added reviewers: nicholas, hfinkel.
bruno set the repository for this revision to rL LLVM.
bruno added a subscriber: Unknown Object (MLST).
reames added a subscriber: reames.Jan 15 2015, 5:39 PM

Drive by comment - this might be completely wrong.

I wouldn't expect that dominates or isPotentiallyReachable would be expensive when querying two instructions in the same BB. At worst, it's a linear scan of the BB. Even at 81782 instructions, that's pretty fast. How many time is this getting called?

It would seem reasonable to push the search limit down into isPotentiallyReachable. Not dominates unfortunately, that has to be an exact answer.

However, do you actually need to an instruction level dominance answer when the instructions are in the same BB? Wouldn't isPotentiallyReachable take care of that?

You might be able to push your limit down into isPotentiallyReachable and just not call dominates when the def and use are in the same BB.

You might also consider looking at why those loops are slow. Are there micro-optimizations (i.e. loop structure, iterator use, etc..) which would help a lot? For example, you could consider advancing both iterators on every iteration to help with cases where one is near the end of the BB. (Not sure this is actually a good idea!)

Taking a step back, you could consider grouping uses by basic block, and then trying to share work across uses in the same block.

Having said all of that, I don't actually have any *objection* to the current approach. :)

bruno added a comment.Jan 16 2015, 9:42 AM

Hi Philip,

Drive by comment - this might be completely wrong.

Thanks for the comments :-)

I wouldn't expect that dominates or isPotentiallyReachable would be expensive when querying two instructions in the same BB. At worst, it's a linear scan of the BB. Even at 81782 instructions, that's pretty fast. How many time is this getting called?

I don't have exact numbers, but many calls pile up and if you tweak the 'Limit' value in the patch you see how fast it grows. I may clarify that in the comments, but the problem is not that dominates or isPotentiallyReachable are expensive per-se, but the piled up number of calls to those greatly increase the compile time if we have to walk a large number of instructions.

It would seem reasonable to push the search limit down into isPotentiallyReachable. Not dominates unfortunately, that has to be an exact answer.

Because 'dominates' needs an exact answer, it returns TooExpensive instead of 'true' or 'false'.

However, do you actually need to an instruction level dominance answer when the instructions are in the same BB? Wouldn't isPotentiallyReachable take care of that?

You might be able to push your limit down into isPotentiallyReachable and just not call dominates when the def and use are in the same BB.

Note that what I'm actually trying to achieve here is a fast path for:

if (BeforeHere != I && DT->dominates(BeforeHere, I) &&
    !isPotentiallyReachable(I, BeforeHere, DT))

Hence, if 'BeforeHere' dominates 'I', we don't actually need to do any BB walk like what it's done in 'isPotentiallyReachable', and early prune the search in case we're on a entry block or don't have successors. Otherwise we must continue the search, because the BB may be in a Loop, etc.

You might also consider looking at why those loops are slow. Are there micro-optimizations (i.e. loop structure, iterator use, etc..) which would help a lot? For example, you could consider advancing both iterators on every iteration to help with cases where one is near the end of the BB. (Not sure this is actually a good idea!)

Taking a step back, you could consider grouping uses by basic block, and then trying to share work across uses in the same block.

This looks like a really good idea but that would also require more intrusive changes to the current implementation without the actual guarantee that it yield benefits.

Having said all of that, I don't actually have any *objection* to the current approach. :)

Thanks again! :D

We hit a similar issue internally.

I reported it in pr22348, and your patch does speed it up.

BTW, have you looked at any options for speeding up dominance check for two
instructions in the same BB? CCing Daniel in case he knows a nice algorithm
:-)

hfinkel edited edge metadata.Jan 27 2015, 1:47 PM

I wonder if there might be a better way to solve this problem. For example, what if, for large blocks, we scan them only once, number the instructions, and then use that ordering to answer the intra-block dominance queries instead of scanning over and over again?

lib/Analysis/CaptureTracking.cpp
74

I'd much rather this be a command-line opt (1000 as a default is likely fine), than a static constant.

85

"Early compute" sounds odd. Just say "Compute"

bruno updated this revision to Diff 20394.Feb 20 2015, 6:39 AM
bruno updated this object.
bruno edited edge metadata.
bruno removed rL LLVM as the repository for this revision.
bruno added a subscriber: bob.wilson.

The bottleneck here is 'shouldExplore', which is called by llvm::PointerMayBeCaptured. This function
is used by FunctionAttrs, DeadStoreElimination, BasicAliasAnalysis and llvm::PointerMayBeCapturedBefore;
this last one is also called by AliasAnalysis and InlineFunction.

The simple and direct solution where we number the instructions ins't easy to cope with the current
way capture tracking is used in the codebase (and please correct me if I'm wrong):

(1) we call functions to use capture tracking and don't have a way to easily store/update a cache containing
the numbered insns unless we do that in each CaptureTracker user mentioned above.
(2) an alternative to (1) is to transform it in a pass which internally contains a cache, where we would need
an interface for invalidation whenever a BB changes.
(3) it isn't clear to me whether numbering a BB each time it gets invalidated by one of CaptureTracker users
will totally remove the compile time issue for large basic blocks. Although I'm inclined to believe it would,
I would like to get numbers first :-)

I would be happy to try out this approach to see where it gets, but given the steps it takes I unfortunately
don't have time to try this out right now. The best I can do now is to use this patch as a short term solution.
I believe we can still keep track for an enhancement under PR22348.

Updated a new version of the patch after Hal's comments.

The bottleneck here is 'shouldExplore', which is called by llvm::PointerMayBeCaptured. This function is used by FunctionAttrs, DeadStoreElimination, BasicAliasAnalysis and llvm::PointerMayBeCapturedBefore; this last one is also called by AliasAnalysis and InlineFunction.

I don't disagree that it might be difficult to re-use the cache inbetween calls to PointerMayBeCaptured (especially as the IR is being mutated -- we'd need to think about how this was laid out and what needed to actively invalidate it), but I thought the expensive part was really the repeated BB scans *within* PointerMayBeCaptured. That could be solved by having a local cache, right?

bruno added a comment.May 5 2015, 11:57 AM

Hi,

Sorry for the delay folks. Resuming this thread with more information
and patches...

I don't disagree that it might be difficult to re-use the cache inbetween calls to PointerMayBeCaptured (especially as the IR is being mutated -- we'd need to think about how this was laid out and what needed to actively invalidate it), but I thought the expensive part was really the repeated BB scans *within* PointerMayBeCaptured. That could be solved by having a local cache, right?

Currently, PointerMayBeCaptured scans the BB the number of times equal
to the number of uses of 'BeforeHere', which is currently capped at 20
and bails out with Tracker->tooManyUses(). The main issue here is the
number of calls *to* PointerMayBeCaptured times the basic block scan.
In the testcase with 82k instructions, PointerMayBeCaptured is called
130k times, leading to 'shouldExplore' taking 527k runs, this
currently takes ~12min.

I tried the approach where I locally (within PointerMayBeCaptured )
number the instructions in the basic block using a DenseMap to cache
instruction positions/numbers. I experimented with two approaches:

(1) Build the local cache once in the beginning and consult the cache
to gather position of instructions => Takes ~4min.
(2) Build the cache incrementally every time we need to scan an
unexplored part of the BB => Takes ~2min.

I've attached the implementation for (2) in case there's any interest.

Considering the reduction from 12min to ~2min, this is a good gain,
but still looks to me like a long time to have a user waiting for the
compiler to finish, specially under -O1/-O2. I still believe the
limited scan approach is a better fix.

Let me know what you think.

Cheers,

bruno updated this revision to Diff 27593.Jun 12 2015, 1:46 PM
bruno set the repository for this revision to rL LLVM.

Hi Hal,

Updated the patch with your suggestions + the local densemap cache. It reduces the compile time from 12min to 2min. FTR, this approach will still check the entire BB and won't change capture tracker precision. Although I still believe we should be able to reduce the compile further under O1/O2 while paying the cost of losing precision, this discussion can be postponed to a forthcoming patch.

Thanks,

hfinkel accepted this revision.Jun 22 2015, 3:01 PM
hfinkel edited edge metadata.

LGTM, thanks for all the work you've put into this!

This revision is now accepted and ready to land.Jun 22 2015, 3:01 PM
This revision was automatically updated to reflect the committed changes.