This is an archive of the discontinued LLVM Phabricator instance.

[ASTMatchers] Performance optimization for memoization
Needs ReviewPublic

Authored by njames93 on May 19 2020, 4:03 AM.

Details

Reviewers
klimek
sammccall
Summary

A simple optimization to avoid unnecessary copies of BoundNodesTreeBuilders when querying the cache for memoized matchers.
This hasn't been benchmarked as I'm unsure how the memoized matchers were first benchmarked so it shouldn't be merged unless there is a performance improvement.

Diff Detail

Event Timeline

njames93 created this revision.May 19 2020, 4:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 19 2020, 4:03 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
grandinj added inline comments.
clang/lib/ASTMatchers/ASTMatchFinder.cpp
84

note that making fields const tends to disable various move optimisations, so better to mark the whole struct const in code that uses it

Given the extra complexity I'd like to see that it matters - bound nodes tend to be small.

Given the extra complexity I'd like to see that it matters - bound nodes tend to be small.

I put that in the description, but this is where i need help. whats the best way to benchmark the matchers?
Also, do you know how it was benchmarked when MaxMemoizationEntries was decided upon?
There was also comments about some making some micro benchmarks but I don't think that was acted upon.

Given the extra complexity I'd like to see that it matters - bound nodes tend to be small.

I put that in the description, but this is where i need help. whats the best way to benchmark the matchers?
Also, do you know how it was benchmarked when MaxMemoizationEntries was decided upon?
There was also comments about some making some micro benchmarks but I don't think that was acted upon.

I do remember benchmarking it when I wrote it, but that was 10 years ago :)
I mainly did benchmarks on large ASTs, trying various matchers.
Today we can benchmark for example whether running clang-tidy on all of LLVM makes a difference.
Micro-BMs would be nice to add, but are somewhat hard to get right.

Running most of the clang tidy checks on the clang-tidy folder yields these results

=================================BeforePatch===================================

RUN1:
4045.17user 83.93system 11:28.80elapsed 599%CPU (0avgtext+0avgdata 534024maxresident)k
0inputs+0outputs (0major+27584683minor)pagefaults 0swaps

RUN2:
4078.06user 84.99system 11:35.99elapsed 598%CPU (0avgtext+0avgdata 506912maxresident)k
55312inputs+0outputs (663major+27661947minor)pagefaults 0swaps

RUN3:
4040.77user 86.02system 11:28.85elapsed 599%CPU (0avgtext+0avgdata 547096maxresident)k
0inputs+0outputs (0major+27698937minor)pagefaults 0swaps
==================================AfterPatch===================================

RUN1: 
4025.33user 83.32system 11:27.00elapsed 598%CPU (0avgtext+0avgdata 530568maxresident)k
0inputs+0outputs (0major+27689512minor)pagefaults 0swaps

RUN2:
4056.93user 83.36system 11:32.31elapsed 598%CPU (0avgtext+0avgdata 529120maxresident)k
3752inputs+0outputs (19major+27794845minor)pagefaults 0swaps

RUN3:
4029.05user 85.45system 11:26.31elapsed 599%CPU (0avgtext+0avgdata 533508maxresident)k
0inputs+0outputs (0major+27730918minor)pagefaults 0swaps

Shows a consistent improvement but there tests are very noisy and dont focus on just the matching, they also include all the other boilderplate when running clang-tidy over a database. not to mention a small sample size

I decided to do some more stringent benchmarking, just focusing on the matching, not the boilerplate of running clang-tidy.

=================================BeforePatch===================================
Matcher Timings:   116.0756 user  29.1397 system  145.2154 process  145.2168 wall
Matcher Timings:   117.7205 user  29.2475 system  146.9681 process  146.9158 wall
Matcher Timings:   116.8313 user  29.5170 system  146.3483 process  146.2655 wall
Matcher Timings:   117.9491 user  29.0969 system  147.0459 process  146.9678 wall
Matcher Timings:   117.6309 user  29.1864 system  146.8173 process  146.7687 wall

user:    117.2+-0.753
system:  29.24+-0.166
process: 146.5+-0.760
==================================AfterPatch===================================
Matcher Timings:   110.5497 user  28.3316 system  138.8813 process  138.7960 wall
Matcher Timings:   112.5151 user  28.8616 system  141.3767 process  141.3003 wall
Matcher Timings:   116.1578 user  28.9472 system  145.1049 process  145.0785 wall
Matcher Timings:   107.1089 user  27.2752 system  134.3841 process  134.3459 wall
Matcher Timings:   105.9242 user  27.0338 system  132.9580 process  132.9010 wall

user:    110.4+-4.159
system:  28.09+-0.890
process: 138.6+-4.979

This is showing ~5% improvement when running every clang-tidy check on a translation unit (Specifically clang-tools-extra/clang-tidy/modernize/LoopConvertCheck.cpp)

+Sam for additional sanity checking.

clang/lib/ASTMatchers/ASTMatchFinder.cpp
902

Ok, if this actually matters, we should also not use a std::map here, but a llvm::DenseMap (or do we rely on iteration invalidation semantics here?).

njames93 updated this revision to Diff 272470.Jun 22 2020, 9:30 AM

Fix checks failing

sammccall added inline comments.Jul 7 2020, 2:30 AM
clang/lib/ASTMatchers/ASTMatchFinder.cpp
902

Belated +1. 5% seems like this performance *does* matter, so DenseMap::find_as should be yet faster.
It looks like BoundNodesTreeBuilder would need a DenseMapInfo, but all the other members are hashable already.