This is an archive of the discontinued LLVM Phabricator instance.

[MemoryDependenceAnalysis] Fix compile time slowdown
ClosedPublic

Authored by bruno on Sep 29 2014, 3:15 PM.

Download Raw Diff

Details

Reviewers

hfinkel
morisset
void

Summary

Problem

One program takes ~3min to compile under -O2. This happens after a certain
function A is inlined ~700 times in a function B, inserting thousands of new
BBs. This leads to 80% of the compilation time spent in
GVN::processNonLocalLoad and
MemoryDependenceAnalysis::getNonLocalPointerDependency, while searching for
nonlocal information for basic blocks.

Usually, to avoid spending a long time to process nonlocal loads, GVN bails out
if it gets more than 100 deps as a result from
MD->getNonLocalPointerDependency. However this only happens *after* all
nonlocal information for BBs have been computed, which is the bottleneck in
this scenario. For instance, there are 8280 times where
getNonLocalPointerDependency returns deps with more than 100 bbs and from
those, 600 times it returns more than 1000 blocks.

Solution

Bail out early during the nonlocal info computation whenever we reach a
specified threshold. This patch proposes a 100 BBs threshold, it also
reduces the compile time from 3min to 23s.

Testing

The test-suite presented no compile nor execution time regressions.

Some numbers from my machine (x86_64 darwin):

17s under -Oz (which avoids inlining).
1.3s under -O1.
2m51s under -O2 ToT
- 23s under -O2 w/ Result.size() > 100
1m54s under -O2 w/ Result.size() > 500

As expected, w/ Result.size() > 100, GVN still yields the same outcome as in the 3min
version:
2898 gvn - Number of loads PRE'd
1710 gvn - Number of loads deleted

Diff Detail

Event Timeline

bruno updated this revision to Diff 14190.Sep 29 2014, 3:15 PM

bruno retitled this revision from to [MemoryDependenceAnalysis] Fix compile time slowdown.

bruno updated this object.

bruno edited the test plan for this revision. (Show Details)

bruno added reviewers: morisset, hfinkel, void.

bruno set the repository for this revision to rL LLVM.

bruno added subscribers: Unknown Object (MLST), bob.wilson.

I don't think I know this code well enough to give the LGTM, but the general approach looks good.
Also, thank you for the detailed testing/explanation.

I just have a few questions/comments:

You say that there were no regressions in the test-suite. Was there any compile-time improvement ?
From my understanding of your patch, it makes the limit in GVN redundant. If so, could the limit in GVN be removed ? It looks especially brittle to have a magic number in GVN, that just happens to be the same as a named limit in MemoryDependencyAnalysis and does exactly the same job.
Have you tried different values for the limit (not super important, it was maybe already tuned before adding to GVN, but I could not find quickly its history in the log) ?

nitpick: it is easier to review changes in Phabricator if all the context is included. This is normally done automatically when using arcanist, but can also be had with git/svn:
git diff -U999999 other-branch
svn diff --diff-cmd=diff -x -U999999
(from http://llvm.org/docs/Phabricator.html)

FWIW: THis looks reasonable to me.
If you can give me the program, i'd love it.

I have been working on some patches to dramatically speed up memdep for
gvn's use case:
Two possible approaches

by using quadtrees to store and lookup the nearest affecting

loads/stores for a given load/store

by using quadtrees to store and lookup the nearest loads/stores to a

given bb. A lot of memdep time is spent doing the walks, not the dependence
checks.

Quadtrees are used because it can do "nearest" lookups *very* quickly, and
what we want is the nearest loads/stores using the DFS entry/exit numbers
of each statement as ordered in the dominator tree as coordinates.

Hi Robin and Daniel,

Thanks for the review.

Regarding Robin's questions:

~1% compile time speed up on sqlite3
It looks like it's redundant indeed; However, getNonLocalPointerDepFromBB can bail out inside recursive calls and the final result maybe be false (which is success in this case) instead of true; i.e. we currently have no way, besides the limit, to tell GVN that we're aborting because the search is expensive. I could change the way it works to be more clear about this, but I'm not sure it's worth the effort and I tried to be less intrusive as possible.
Because of GVN, I assumed 100 as the low threshold and only tried from 100-1000, I don't have all the numbers anymore, but Limit=500 yields 1m54s and Limit=1000 goes near ~2m50.

Also, regenerated the patch with more context.

Daniel, that's awesome. Thanks for sharing your plans, looking forward for this
speedup. Let me know how much you can get for this testcase, I'm curious :-)

The source in question comes from http://mednafen.sourceforge.net, to
reproduce:

$ cd mednafen
$ clang -flto -O2 -c src/hw_cpu/huc6280/huc6280.cpp -I./include -DHAVE_MKDIR -DLSB_FIRST -DSIZEOF_DOUBLE=8 -o huc.bc

LGTM too.

lib/Analysis/MemoryDependenceAnalysis.cpp
1139	I'd replace "one hundred" in the comment, and just say "If we do process a large number of blocks..." (the comment could become confusing if the limit is ever adjusted).

This revision is now accepted and ready to land.Sep 30 2014, 3:43 PM

Thanks, committed in r218792

Revision Contents

Path

Size

lib/

Analysis/

MemoryDependenceAnalysis.cpp

22 lines

Diff 14190

lib/Analysis/MemoryDependenceAnalysis.cpp

Context not available.
	// Limit for the number of instructions to scan in a block.	// Limit for the number of instructions to scan in a block.
	static const int BlockScanLimit = 100;	static const int BlockScanLimit = 100;

		// Limit on the number of memdep results to process.
		static const int NumResultsLimit = 100;

	char MemoryDependenceAnalysis::ID = 0;	char MemoryDependenceAnalysis::ID = 0;

	// Register this pass...	// Register this pass...
Context not available.
	while (!Worklist.empty()) {	while (!Worklist.empty()) {
	BasicBlock *BB = Worklist.pop_back_val();	BasicBlock *BB = Worklist.pop_back_val();

		// If we do process more than one hundred blocks it becomes very expensive
		hfinkelUnsubmitted Not Done Reply Inline Actions I'd replace "one hundred" in the comment, and just say "If we do process a large number of blocks..." (the comment could become confusing if the limit is ever adjusted). hfinkel: I'd replace "one hundred" in the comment, and just say "If we do process a large number of…
		// and likely it isn't worth worrying about
		if (Result.size() > NumResultsLimit) {
		Worklist.clear();
		// Sort it now (if needed) so that recursive invocations of
		// getNonLocalPointerDepFromBB and other routines that could reuse the
		// cache value will only see properly sorted cache arrays.
		if (Cache && NumSortedEntries != Cache->size()) {
		SortNonLocalDepInfoCache(*Cache, NumSortedEntries);
		NumSortedEntries = Cache->size();
		}
		// Since we bail out, the "Cache" set won't contain all of the
		// results for the query. This is ok (we can still use it to accelerate
		// specific block queries) but we can't do the fastpath "return all
		// results from the set". Clear out the indicator for this.
		CacheInfo->Pair = BBSkipFirstBlockPair();
		return true;
		}

	// Skip the first block if we have it.	// Skip the first block if we have it.
	if (!SkipFirstBlock) {	if (!SkipFirstBlock) {
	// Analyze the dependency of *Pointer in FromBB. See if we already have	// Analyze the dependency of *Pointer in FromBB. See if we already have
Context not available.