This is an archive of the discontinued LLVM Phabricator instance.

[clang] Optimized ASTNodeKind::isBaseOf
AbandonedPublic

Authored by alexfh on Oct 2 2014, 4:39 AM.

Details

Reviewers
sbenza
Summary

ASTNodeKind::isBaseOf shows in one of the top lines of the profile when running a large clang-tidy benchmark. This rewrite reduces its cumulative time from 6.14% to 3.7%:

<     11.06s  6.14%: clang::ast_type_traits::ASTNodeKind::isBaseOf
>     5.75s  3.70%: clang::ast_type_traits::ASTNodeKind::isBaseOf

Diff Detail

Event Timeline

alexfh updated this revision to Diff 14317.Oct 2 2014, 4:39 AM
alexfh retitled this revision from to [clang] Optimized ASTNodeKind::isBaseOf.
alexfh updated this object.
alexfh edited the test plan for this revision. (Show Details)
alexfh added a reviewer: klimek.
alexfh added subscribers: sbenza, Unknown Object (MLST).
alexfh updated this revision to Diff 14318.Oct 2 2014, 4:40 AM

Remove .gitignore from the patch.

alexfh updated this object.Oct 2 2014, 4:59 AM
klimek added inline comments.
lib/AST/ASTTypeTraits.cpp
51

For my sanity: please one side-effect per line.

64

For my sanity: please one side-effect per line.

alexfh updated this revision to Diff 14319.Oct 2 2014, 5:20 AM

One side-effect per line.

klimek added inline comments.Oct 2 2014, 5:27 AM
lib/AST/ASTTypeTraits.cpp
47

Also, perhaps declare a constant to std::numeric_limits<unsigned>::max()?

61

Not sure whether we care about thread safety here (and VS <= 2013 still not providing it).

sbenza edited edge metadata.Oct 2 2014, 7:55 AM

I have been looking up different ways to do this more efficiently.
There are two reasons the current one is slow: while() loop and loading cold memory for AllKindInfo array.

This CL is one of the solutions I looked into. It removes the while() loop, but makes the caching behavior way worse.
At the very least, ParentDistances should be an array of uint8 to reduce the cache pressure here.

Other solutions I was thinking of require no external memory. That is, you can tell whether they are related (and the distance) from the node ids themselves.
I haven't benchmarked them to determine which one is better. I was waiting on it until isBaseOf() is the largest improvement I can make to get a better comparison.

A list of solutions:
1 Caching the solution (something like this CL).

It requires a half matrix of 1-bit per pair cell. ~290 types so ~5k of memory.
Distance can be calculated as the difference of their depths. Can be stored in nodes or external. External is around ~1k.
(The current CL creates a matrix of ~330k!)

2 Something similar to dyn_cast<> where each node knows the range of their descendants.

The 'last' id can be saved in a separate array or inside the node.
The calculation becomes two comparisons.
Distance can be calculated as the difference of their depths. Can also be stored in the nodes.
If all stored in nodes, requires no external memory and no loops.

3 Each node gets a unique sequential ID.

Then the real node id is: NI(X) = NI(Parent(X)) << 8 + UID(X)
These Ids can be easily generated at compile time.
It requires a loop to find the answer, but no external memory.
Distance is calculated in the loop.

4 Each node gets a unique prime.

Then the real node id is: NI(X) = NI(Parent(X)) * Prime(X)
The operation becomes a simple modulo operation. No external memory.
Distance comes from somewhere else.

I do not like (1) because we can solve this without the external memory.
(2) is fine, but was not easy to implement right now. Type class enums do not have firstXXX/lastXXX definitions.
I implemented (3) and it is not complicated. It still has the loop, though.
(4) is cute, but it might require some work to fit all the ids in a uint64. There are some deeply nested hierarchies.

bkramer added a subscriber: bkramer.Oct 2 2014, 8:56 AM

I wonder if we care about the performance of getting the distance at all, it's only used to create dynamic matchers afaik. Making the non-distance measurement case fast and leaving the case with distances as-is would help clang-tidy and allows a simpler implementation.

sbenza added a comment.Oct 2 2014, 9:19 AM

bkramer is right.
However, the distance is not even used to *create* the dynamic matchers. It is only used for the completion used by clang-query.
So a cheap and simple isBaseOf() should do it. The distance could be separate.

One important thing this patch does, it splits isBaseOf with and without distance to avoid runtime selection.

As for the caching, I believe it doesn't matter much, as on my test (lib/Sema/SemaOverload.cpp) most of the calls were asking one of a small number of table cells. Here's some data:

  • total calls: ~39M
  • total number of pairs asked: 1646
  • 103 mostly asked cells cover 80% of all calls
  • 219 mostly asked cells cover 90% of all calls
  • 50 mostly asked rows (< 6kB) cover 97% of all calls

The other solutions you've proposed are also interesting (especially #4 which is the most elegant, but requires some care in arranging the primes), but this one is the most straightforward, IMO.

It is still bad that it allocates more than 300k of heap, and then it doesn't even use most of it.

I have a prototype of (4) at http://reviews.llvm.org/differential/diff/14336/
Haven't benchmarked it yet.

In D5577#18, @sbenza wrote:

It is still bad that it allocates more than 300k of heap, and then it doesn't even use most of it.

It's ~90k of heap (after changing unsigned to unsigned char). But if you'll manage to get a decent implementation of (4), it would be nice. I'll leave this patch here in case there is no better solution.

I have a prototype of (4) at http://reviews.llvm.org/differential/diff/14336/
Haven't benchmarked it yet.

alexfh updated this revision to Diff 14337.Oct 2 2014, 11:01 AM
alexfh edited edge metadata.

Use unsigned char for ParentDistances.

klimek resigned from this revision.Jul 3 2015, 6:43 AM
klimek removed a reviewer: klimek.
alexfh abandoned this revision.Dec 3 2022, 2:44 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 3 2022, 2:44 PM