This is an archive of the discontinued LLVM Phabricator instance.

[MemoryDependenceAnalysis] Fix compile time slowdown
ClosedPublic

Authored by bruno on Sep 29 2014, 3:15 PM.

Details

Summary
  • Problem

One program takes ~3min to compile under -O2. This happens after a certain
function A is inlined ~700 times in a function B, inserting thousands of new
BBs. This leads to 80% of the compilation time spent in
GVN::processNonLocalLoad and
MemoryDependenceAnalysis::getNonLocalPointerDependency, while searching for
nonlocal information for basic blocks.

Usually, to avoid spending a long time to process nonlocal loads, GVN bails out
if it gets more than 100 deps as a result from
MD->getNonLocalPointerDependency. However this only happens *after* all
nonlocal information for BBs have been computed, which is the bottleneck in
this scenario. For instance, there are 8280 times where
getNonLocalPointerDependency returns deps with more than 100 bbs and from
those, 600 times it returns more than 1000 blocks.

  • Solution

Bail out early during the nonlocal info computation whenever we reach a
specified threshold. This patch proposes a 100 BBs threshold, it also
reduces the compile time from 3min to 23s.

  • Testing

The test-suite presented no compile nor execution time regressions.

Some numbers from my machine (x86_64 darwin):

  • 17s under -Oz (which avoids inlining).
  • 1.3s under -O1.
  • 2m51s under -O2 ToT
    • 23s under -O2 w/ Result.size() > 100
  • 1m54s under -O2 w/ Result.size() > 500

As expected, w/ Result.size() > 100, GVN still yields the same outcome as in the 3min
version:
2898 gvn - Number of loads PRE'd
1710 gvn - Number of loads deleted

Diff Detail

Event Timeline

bruno updated this revision to Diff 14190.Sep 29 2014, 3:15 PM
bruno retitled this revision from to [MemoryDependenceAnalysis] Fix compile time slowdown.
bruno updated this object.
bruno edited the test plan for this revision. (Show Details)
bruno added reviewers: morisset, hfinkel, void.
bruno set the repository for this revision to rL LLVM.
bruno added subscribers: Unknown Object (MLST), bob.wilson.
morisset edited edge metadata.Sep 29 2014, 5:00 PM

I don't think I know this code well enough to give the LGTM, but the general approach looks good.
Also, thank you for the detailed testing/explanation.

I just have a few questions/comments:

  • You say that there were no regressions in the test-suite. Was there any compile-time improvement ?
  • From my understanding of your patch, it makes the limit in GVN redundant. If so, could the limit in GVN be removed ? It looks especially brittle to have a magic number in GVN, that just happens to be the same as a named limit in MemoryDependencyAnalysis and does exactly the same job.
  • Have you tried different values for the limit (not super important, it was maybe already tuned before adding to GVN, but I could not find quickly its history in the log) ?

nitpick: it is easier to review changes in Phabricator if all the context is included. This is normally done automatically when using arcanist, but can also be had with git/svn:
git diff -U999999 other-branch
svn diff --diff-cmd=diff -x -U999999
(from http://llvm.org/docs/Phabricator.html)

FWIW: THis looks reasonable to me.
If you can give me the program, i'd love it.

I have been working on some patches to dramatically speed up memdep for
gvn's use case:
Two possible approaches

  1. by using quadtrees to store and lookup the nearest affecting

loads/stores for a given load/store

  1. by using quadtrees to store and lookup the nearest loads/stores to a

given bb. A lot of memdep time is spent doing the walks, not the dependence
checks.

Quadtrees are used because it can do "nearest" lookups *very* quickly, and
what we want is the nearest loads/stores using the DFS entry/exit numbers
of each statement as ordered in the dominator tree as coordinates.

bruno updated this revision to Diff 14243.Sep 30 2014, 12:45 PM
bruno edited edge metadata.

Hi Robin and Daniel,

Thanks for the review.

Regarding Robin's questions:

  • ~1% compile time speed up on sqlite3
  • It looks like it's redundant indeed; However, getNonLocalPointerDepFromBB can bail out inside recursive calls and the final result maybe be false (which is success in this case) instead of true; i.e. we currently have no way, besides the limit, to tell GVN that we're aborting because the search is expensive. I could change the way it works to be more clear about this, but I'm not sure it's worth the effort and I tried to be less intrusive as possible.
  • Because of GVN, I assumed 100 as the low threshold and only tried from 100-1000, I don't have all the numbers anymore, but Limit=500 yields 1m54s and Limit=1000 goes near ~2m50.

Also, regenerated the patch with more context.

Daniel, that's awesome. Thanks for sharing your plans, looking forward for this
speedup. Let me know how much you can get for this testcase, I'm curious :-)

The source in question comes from http://mednafen.sourceforge.net, to
reproduce:

$ cd mednafen
$ clang -flto -O2 -c src/hw_cpu/huc6280/huc6280.cpp -I./include -DHAVE_MKDIR -DLSB_FIRST -DSIZEOF_DOUBLE=8 -o huc.bc

hfinkel accepted this revision.Sep 30 2014, 3:43 PM
hfinkel edited edge metadata.

LGTM too.

lib/Analysis/MemoryDependenceAnalysis.cpp
1139

I'd replace "one hundred" in the comment, and just say "If we do process a large number of blocks..." (the comment could become confusing if the limit is ever adjusted).

This revision is now accepted and ready to land.Sep 30 2014, 3:43 PM
bruno closed this revision.Oct 1 2014, 1:18 PM

Thanks, committed in r218792