This is an archive of the discontinued LLVM Phabricator instance.

Preallocate ExplodedNode hash table
ClosedPublic

Authored by bcraig on Jun 2 2016, 2:43 PM.

Details

Summary

This depends on http://reviews.llvm.org/D20930

Rehashing the ExplodedNode table is very expensive. The hashing itself is expensive, and the general activity of iterating over the hash table is highly cache unfriendly. Instead, we guess at the eventual size by using the maximum number of steps allowed. This generally avoids a rehash. It is possible that we still need to rehash if the backlog of work that is added to the worklist significantly exceeds the number of work items that we process. Even if we do need to rehash in that scenario, this change is still a win, as we still have fewer rehashes that we would have prior to this change.

For small work loads, this will increase the memory used. For large work loads, it will somewhat reduce the memory used. Speed is significantly increased. A large .C file took 3m53.812s to analyze prior to this change. Now it takes 3m38.976s, for a ~6% improvement.

Diff Detail

Event Timeline

bcraig updated this revision to Diff 59459.Jun 2 2016, 2:43 PM
bcraig retitled this revision from to Preallocate ExplodedNode hash table.
bcraig updated this object.
bcraig added reviewers: zaks.anna, krememek, jordan_rose.
bcraig added a subscriber: cfe-commits.
zaks.anna edited edge metadata.Jun 2 2016, 3:59 PM

Does FoldingSet already have reserve? (I do not see it.)

My understanding is that you propose to allocate all the nodes that would be required to store the maximum number of nodes we allow, lowering the malloc traffic. The current implementation just doubles the size. Is this correct?

Maybe we should just set a higher initial size, instead of going all the way to the max?

bcraig added a comment.Jun 3 2016, 6:23 AM

Does FoldingSet already have reserve? (I do not see it.)

The reserve call would be new. I'm attempting to add it in this llvm review: http://reviews.llvm.org/D20930

My understanding is that you propose to allocate all the nodes that would be required to store the maximum number of nodes we allow, lowering the malloc traffic. The current implementation just doubles the size. Is this correct?

Maybe we should just set a higher initial size, instead of going all the way to the max?

The implementation of FoldingSet has a vector of bucket pointers. Reserve preallocates that vector of bucket pointers to the correct size to avoid rehashing. The allocation of the nodes themselves hasn't changed. With a value of 150,000 steps (the default?), reserve will preallocate 131072 bucket pointers (this gives a node capacity of double that). On a 64-bit system, that's a 1 MB allocation.

Note that the biggest savings are in avoiding the last rehashing. If we shrink the initial size, but still rehash frequently, we'll lose a lot of the benefits.

Ben,

By the way, thanks for the patch! It's a clever idea.

The implementation of FoldingSet has a vector of bucket pointers. Reserve preallocates that vector of
bucket pointers to the correct size to avoid rehashing. The allocation of the nodes themselves hasn't
changed.

My point was that we are now preallocating more.

With a value of 150,000 steps (the default?),

This is the maximum number of steps after which we just give up analysis. It's very unlikely that the analyzer will execute this many steps on average. Moreover, this number can be reset from command line.

I would prefer not to use the same constant here. We should use another constant. Maybe, we could determine the default setting from the average number of steps that is being executed?

You can gather statistics these and see what we get (http://llvm.org/docs/ProgrammersManual.html#the-statistic-class-stats-option):
STATISTIC(NumSteps,

"The # of steps executed.");

STATISTIC(NumReachedMaxSteps,

"The # of times we reached the max number of steps.");

STATISTIC(NumPathsExplored,

"The # of paths explored by the analyzer.");

reserve will preallocate 131072 bucket pointers
(this gives a node capacity of double that). On a 64-bit system, that's a 1 MB allocation.
Note that the biggest savings are in avoiding the last rehashing. If we shrink the initial size, but still
rehash frequently, we'll lose a lot of the benefits.

Please, update patches with full context: http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface

I'll try to get those stats. My suspicion is that it is highly language / projects specific. Specifically, I suspect that C (and maybe ObjC?) won't hit the analysis limit often, but that C++ will hit the limit frequently because of the large number of inline functions.

I will note that one of the reasons I like using the number of steps (or a number derived from it) is that it scales to the expected worst-case workload. Shallow analysis only get 75000 work items per top level function, so the exploded graph reservation scales down as appropriate.

I'll make sure the next diff (and future diffs) include more context.

Specifically, I suspect that C (and maybe ObjC?) won't hit the analysis limit often, but that C++ will hit the
limit frequently because of the large number of inline functions.

It might be valuable to know this. Maybe we should changer the default for C++?

(or a number derived from it) is that it scales to the expected worst-case workload.

Sounds good.

bcraig added a comment.Jun 7 2016, 2:02 PM

tldr; I'm going to recommend we stick with the current algorithm, but I only have partial data to back that up.

For the pile of LLVM projects that I am currently building (llvm, clang, libcxx, libcxxabi), 18.9% of all analyzed functions hit the maximum step count. For the previously discussed large .C file, 37% of the analyzed functions hit the maximum step count. These are using the stats "AnalysisConsumer - The # of functions and blocks analyzed (as top level with inlining turned on)." and "CoreEngine - The # of times we reached the max number of steps.".

The average number of steps is 34,447. My best guess at a median given the translation unit level statistics is around 8,050 steps.

If we were to use the average number of steps to size the exploded graph, we would get a 256k buffer. We would also cause the (already slowest) ~20% of our functions to rehash (twice!).

There is some data that I wish I had, but don't. I wish I had per-function step counts, so that I could present a histogram of the data. Then we would know what % of functions would be rehashed for a given value.

bcraig updated this revision to Diff 59950.Jun 7 2016, 2:22 PM
bcraig edited edge metadata.

Uploading more context.

For the pile of LLVM projects that I am currently building (llvm, clang, libcxx, libcxxabi), 18.9% of all analyzed
functions hit the maximum step count. For the previously discussed large .C file, 37% of the analyzed functions hit the maximum step count.

These look high to me. I've looked up some statistics that I've collected years ago and I see a similar number for sqlite, but much lower numbers for other test cases. I tested with the Preview app (~150 files) and a bunch of preprocessed files we had on hand at that moment in both cases the number of reaching timeout was about 1%. sqlite is special because all source code is in that one file, so we can do much more inlining. Which C test case have you used?

(I do not have any numbers for C++.)

Another wrinkle here is that we support "Analyzer shallow mode" (-analyzer-config mode=shallow) , which dials down inlining quite a bit. We have users who analyze during build (believe it or not) and they use the shallow mode by default.

For the pile of LLVM projects that I am currently building (llvm, clang, libcxx, libcxxabi), 18.9% of all analyzed
functions hit the maximum step count. For the previously discussed large .C file, 37% of the analyzed functions hit the maximum step count.

These look high to me. I've looked up some statistics that I've collected years ago and I see a similar number for sqlite, but much lower numbers for other test cases. I tested with the Preview app (~150 files) and a bunch of preprocessed files we had on hand at that moment in both cases the number of reaching timeout was about 1%. sqlite is special because all source code is in that one file, so we can do much more inlining. Which C test case have you used?

(I do not have any numbers for C++.)

Another wrinkle here is that we support "Analyzer shallow mode" (-analyzer-config mode=shallow) , which dials down inlining quite a bit. We have users who analyze during build (believe it or not) and they use the shallow mode by default.

I've run another analysis, this time on a code base that is mostly C (maybe 5% C++). 6% of the functions there hit the max. The average number of steps was 13439, and the median is 2867.

Analyzer shallow mode does adjust inlining. It also lowers the maximum number of steps from 150000 to 75000. That reduction in number of steps drops the exploded graph reservation down to half a meg instead of the full megabyte.

A 6% speed improvement could be a big win! Do we have a sense of what the expected increased memory cost (as a % of average use over the lifetime of the process) is? My guess is it would be relatively low. I suspect most analyzer users run relatively few concurrent 'clang' processes -- so this might be well worth it.

I do think we should make sure the user can't shoot themselves in the foot by pre-reserving space for an absurdly high maximum step count. We might want to to clamp this reservation to something that is not outrageous even when the maximum step count is huge.

bcraig added a comment.Jun 9 2016, 7:37 AM

A 6% speed improvement could be a big win! Do we have a sense of what the expected increased memory cost (as a % of average use over the lifetime of the process) is? My guess is it would be relatively low. I suspect most analyzer users run relatively few concurrent 'clang' processes -- so this might be well worth it.

If the underlying allocator that does a poor job at reusing freed memory, then trivial functions will use about 1 MB more than before, then free the memory immediately. On the other hand, functions that hit the max step count will use about 1 MB less memory than before. The thought experiments get difficult when the underlying allocator is good at reusing freed memory.

I got some memory numbers for the .C file that saw the 6% speedup and has 37% of its functions hitting the step count.
From /usr/bin/time -v,
Before: Maximum resident set size (kbytes): 498688
After: Maximum resident set size (kbytes): 497872

Valgrind's massif tool reported the peak usage as 14696448 (0xE04000) bytes for both the before and after.

I do think we should make sure the user can't shoot themselves in the foot by pre-reserving space for an absurdly high maximum step count. We might want to to clamp this reservation to something that is not outrageous even when the maximum step count is huge.

Sure. I'll switch to reserving the smaller of Steps and 4 million. That means the most memory we will pre-reserve will be 32MB.

If the underlying allocator that does a poor job at reusing freed memory, then trivial
functions will use about 1 MB more than before, then free the memory immediately.

You could probably flag some of those functions, especially the ones that do not contain inlinable calls. (Ex: Has low complexity (CFG::getNumBlockIDs()) and does not call anything that can be inlined.) Said that, it's "a nice to have".

bcraig updated this revision to Diff 60177.Jun 9 2016, 1:12 PM

Capping the pre-reserve space

bcraig added a comment.Jun 9 2016, 1:18 PM

I got better valgrind numbers (symbols are important).
Before: 106,131,538
After: 106,657,666
Diff: 526,128 larger.

Note that this is sampled peaks for heap usage. They may not be accurate. Regardless, the peak usage increased by less than .5%.

zaks.anna accepted this revision.Jun 9 2016, 2:34 PM
zaks.anna edited edge metadata.

LGTM. Thanks!

This revision is now accepted and ready to land.Jun 9 2016, 2:34 PM

Ah, right, please, add a comment explaining what we are doing and why.

bcraig closed this revision.Jun 10 2016, 6:30 AM

Completed: At revision: 272394