This is an archive of the discontinued LLVM Phabricator instance.

[clang][Syntax] Optimize expandedTokens for token ranges.
ClosedPublic

Authored by usaxena95 on Mar 22 2021, 9:21 AM.

Details

Summary

expandedTokens(SourceRange) used to do a binary search to get the
expanded tokens belonging to a source range. Each binary search uses
isBeforeInTranslationUnit to order two source locations. This is
inherently very slow.
By profiling clangd we found out that users like clangd::SelectionTree
spend 95% of time in isBeforeInTranslationUnit. Also it is worth
noting that users of expandedTokens(SourceRange) majorly use ranges
provided by AST to query this funciton. The ranges provided by AST are
token ranges (starting at the beginning of a token and ending at the
beginning of another token).

Therefore we can avoid the binary search in majority of the cases by
maintaining an index of ExpandedToken by their SourceLocations. We still
do binary search for ranges which are not token ranges but such
instances are quite low.

For a cachedAST case, we have wins even for normal clangd usage. Invoking location based features has same performance both in batch mode and normal server mode.


Performance:
~/build/bin/clangd --check=clang/lib/Serialization/ASTReader.cpp
Before: Took 2:10s to complete.
Now: Took 1:13s to complete.
+45% faster.


For medium size buffers ( like XRefs.cpp) the iteration is negligible as compared to ParsedAST. The wins are lower 4-8% for smaller buffers.
~/build/bin/clangd -- clang-tools-extra/clangd/XRefs.cpp
Preamble: 5s
ParsedAST: 0.54s
Location Features: 0.51s (5% speedup)
ExpandedToken iteration: 0.00018s (0.03% of ParsedAST)


Huge win for Macro heavy files (like tests)
~/build/bin/clangd --check=clang-tools-extra/clangd/unittests/CodeCompleteTests.cpp
Preamble: 2.3s
ParsedAST: 0.23s
ExpandedToken iteration: 0.002s (0.85% of ParsedAST)
Location feature: 1702s to 363s (+80% faster)

Diff Detail

Event Timeline

usaxena95 requested review of this revision.Mar 22 2021, 9:21 AM
usaxena95 created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptMar 22 2021, 9:21 AM
kadircet added inline comments.Mar 22 2021, 11:08 AM
clang/include/clang/Tooling/Syntax/Tokens.h
377

this definitely makes sense, but most of the users of TokenBuffers are currently invoking expandedTokens(Range) only once, so i am not sure if this is a huge benefit on their-side, whereas they'll likely end up seeing some short-lived memory consumption (and an extra traversal of all the expanded tokens).

Any reasons for not moving this logic to the application? expandedTokens() endpoint will return all the tokens, no questions asked. Then the application can build whatever acceleration structure it wants. If you think there are in-tree use cases that can benefit from this caching, maybe even provide that as a separate helper?

usaxena95 added inline comments.Mar 22 2021, 12:07 PM
clang/include/clang/Tooling/Syntax/Tokens.h
377

this definitely makes sense, but most of the users of TokenBuffers are currently invoking expandedTokens(Range) only once

I didn't get that users are invoking it only once. This is not caching the results of expandedTokens(Range). We are precomputing these to completely avoid a binary search which uses isBeforeInTranslationUnit (considerably slower). Doing an extra traversal of expandedTokens is not noticeable as compared to isBeforeInTranslationUnit latency. Users like SelectionTree (GoToDefinition, Hover,..) can directly benefit from this even though they query this only once per some range.

Umm. If you meant use cases like: build token buffer -> use expandedTokens(Range) -> destroy TokenBuffer.
I don't immediately see such use cases. But I think it is still better than using isBeforeInTranslationUnit.
My mental model is that TokenBuffer is a long lived object (example in ParsedAST in clangd). And using it only once and then discarding it is expensive in itself.

Any reasons for not moving this logic to the application?
If you think there are in-tree use cases that can benefit from this caching, maybe even provide that as a separate helper?

This can basically benefit all users of expandedToken(Range). Since expandedToken(Range) is provided by TokenBuffer, the optimization must also remain in the TokenBuffer as compared to the application.
Although users which only use this for non-token ranges will pay additional cost. We can potentially make this precomputation optional for such cases.

kadircet added inline comments.Mar 22 2021, 1:13 PM
clang/include/clang/Tooling/Syntax/Tokens.h
377

I didn't get that users are invoking it only once.

Sorry that was confusing, I was trying to say that on use cases like Hover/Go-To etc. clangd builds selection tree only once and you are right, that might actually involve multiple calls to expandedTokens(Range) but that shouldn't be the bottleneck of these features.

This is not caching the results of expandedTokens(Range). We are precomputing these to completely avoid a binary search

right, i was saying that all that precomputation is also probably unneeded, as normally features are only interested in a single token, which should nest properly in AST under normal circumstances and doesn't require calling expandedTokens for all possible ranges.

Doing an extra traversal of expandedTokens is not noticeable as compared to isBeforeInTranslationUnit latency. Users like SelectionTree (GoToDefinition, Hover,..) can directly benefit from this even though they query this only once per some range.

I was suggesting to move this into application because that felt counter-intuitive, as the benchmark you provided is actually running Hover(and other feautres) on every (spelled) token of the file. If you say that it is beneficial even for the normal hover case, sure let's go (but i'd be happier if you proved me wrong first 😅 )

Umm. If you meant use cases like: build token buffer -> use expandedTokens(Range) -> destroy TokenBuffer.
I don't immediately see such use cases. But I think it is still better than using isBeforeInTranslationUnit.
My mental model is that TokenBuffer is a long lived object (example in ParsedAST in clangd). And using it only once and then discarding it is expensive in itself.

That's what clangd does though (not sure if Tokenbuffers have any other users ATM). It literally builds a new ParsedAST after every edit(It also has a cache to skip some work if contents didn't change between requests). But usually all of those requests are user-initiated, so an extra latency of a couple ms on every request isn't really noticeable.
Getting rid of that would still be great, but if it means slowing down the initial request for quite some time (due to extra traversal of all the expanded tokens, which might be a huge deal especially on files including some tablegen'd stuff, as they are usually part of the main file and not preamble).

This can basically benefit all users of expandedToken(Range).

As stated above, i am just worried that this might hurt the use cases in which application just builds a tokenbuffer and calls this on a few ranges. (maybe clangd is the odd one out, but i don't see any other users. there's clang::syntax::computeReplacements but it is only used in a test)

Since expandedToken(Range) is provided by TokenBuffer, the optimization must also remain in the TokenBuffer as compared to the application.

I don't follow the reasoning here exactly. The application that needs to operate on every token of the expanded file can call the wrapper rather than using TokenBuffers directly.

But... The most dominant application in clangd's case is SelectionTree and if it is going to perform this computation all the time (because it wants to support both batch and single invocation use-case :/ ). As you suggested this can be optional (or triggered with an explicit endpoint) but that's gonna be ugly, as it needs to be supplied when building a ParsedAST, ugh..

(Sorry for drawing a little circle here) I suppose it is cleanest the way you propose it then (unless your application is actually building ParsedAST's itself without any clangd abstraction like TUScheduler/ClangdServer).
It would still be great if you could check for possible regressions on the common path (the easiest would be to just look at ParsedAST::build latencies on a couple files, i suppose clang/lib/Serialization/ASTReader.cpp is a good example for the pathological case, winning there would probably mean winning in all the other cases too, but it would be good to see a more "sane" file too. As anything after building the ParsedAST is guaranteed to be as fast as before.

Although users which only use this for non-token ranges will pay additional cost. We can potentially make this precomputation optional for such cases.

I suppose to hit the slow case users should provide spelled tokens that did not expand to anything, which should be rare, so I don't think that'll be a huge problem.

clang/lib/Tooling/Syntax/Tokens.cpp
667

nit: might look better as a for(size_t I=0,E=Tokens.size();I!=E;++I) .... kind of loop.

Thanks for finding and working on this hotspot. (I'm curious *why* isBeforeInTranslationUnit is slow if you have any insight - it has sad worst-case but I thought we'd hit the one-element cache often enough in practice).

I don't see a better way to optimize this, main question is where to do it.
In favor of TokenBuffer initializer (this patch):

  • it produces the simplest API
  • no code duplicated between callers
  • when a TokenBuffer gets reused and it's awkward for its users to share other state, the same cache can be reused

In favor of in the caller (e.g. SelectionTree)

  • users that never call expandedTokens(SourceRange) pay nothing
  • users that call expandedTokens(SourceRange) once pay log(N) * isBeforeInTranslationUnit instead of N, which may come out ahead for large files.

It's a bit unfortunate that TokenBuffer is enough of a grab-bag of features that we can't really reason about which structures to build :-(

Can we get the best of both worlds?
Ideally we'd build this cache only when it's profitable. But building on-demand messes with const (and it's unclear whether to do it on the first call).
What about making this optional via a non-const TokenBuffer::indexExpandedTokens(); // Optimizes future calls to expandedTokens(SourceRange). Idempotent.?
It's a slightly weird API, but not actually hard to use (I imagine clangd would always call it when building ParsedAST).

clang/include/clang/Tooling/Syntax/Tokens.h
377

Long thread here :-)

Kadir: one wrinkle that's maybe missed here is that clangd's SelectionTree calls this method *many* times to establish bounds of many AST nodes. I fully believe this is dominating SelectionTree build time (though it would be nice to see benchmarks).

clang/lib/Tooling/Syntax/Tokens.cpp
665

I think it's worth reserve()ing the map size here

sammccall added inline comments.Mar 23 2021, 5:24 AM
clang/include/clang/Tooling/Syntax/Tokens.h
377

the benchmark you provided is actually running Hover(and other feautres) on every (spelled) token of the file

I finally understood this. You're right, but we want to reuse some of the clangd code in batch mode (for code review workflows), which means exactly this!

If you say that it is beneficial even for the normal hover case

I'd love to have numbers, but my understanding is: parse >> selectiontree > hover. So this should be a >50% speedup for hover in the cached AST case. Though that case is fast anyway, so it's not critically important.

usaxena95 edited the summary of this revision. (Show Details)Mar 24 2021, 10:48 AM
usaxena95 marked 6 inline comments as done.
usaxena95 edited the summary of this revision. (Show Details)

Made the optimization an 'opt-in'.
Created TokenBuffer::indexExpandedToken() for creating the index.

Made some more measurements on other files. Added to description.

Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2021, 10:48 AM
Herald added a subscriber: arphaman. · View Herald Transcript
usaxena95 edited the summary of this revision. (Show Details)Mar 24 2021, 10:50 AM

Remove unintended changes.

sammccall accepted this revision.Mar 25 2021, 9:47 AM

Just doc nits - I think maybe there's some confusion on what a token range is.
Code looks good though!

clang-tools-extra/clangd/ParsedAST.cpp
426

nit: the first sentence is just repeating the function's doc - I'd just say "Makes
SelectionTree build much faster"

clang/include/clang/Tooling/Syntax/Tokens.h
196

Hmm, the primary doc (first sentence) should probably be about the effect rather than implementation details.

Maybe just "Builds a cache to make expandedToken(SourceRange) faster."?

201–202

token range was correct and critical to mention here, please revert

It means that R.endLocation() is the last token included in the range. Sometimes endLocation() means other things, in closed character range, or an half-open token or character range.

203

can you drop "for 'token ranges'" here? I'm not sure what it means, and all ranges passed in here must be token ranges per contract.

376

again this comment is confusing. I'd suggest just dropping it - it's fairly covered by indexExpandedTokens()'s comment.

This revision is now accepted and ready to land.Mar 25 2021, 9:47 AM
usaxena95 marked 4 inline comments as done.

Addressed comments.

This revision was landed with ongoing or failed builds.Mar 25 2021, 10:54 AM
This revision was automatically updated to reflect the committed changes.
usaxena95 marked an inline comment as done.