Add index-while-building support to Clang
AcceptedPublic

Authored by nathawes on Oct 18 2017, 6:09 AM.

Details

Summary

Adds a new -index-store-path option that causes Clang to additionally collect and output source code indexing information to the supplied path. This is done by wrapping the FrontendAction otherwise setup by the invocation with a WrappingIndexRecordAction. This action simply delegates to the wrapped action, but additionally multiplexes in its own IndexASTConsumer to collect symbol information from the AST and tracks the source file and module dependencies of the translation unit (via the IndexDependencyProvider class).

When the action completes, it then writes this information out to the supplied index store path in the form of a unit file, which stores the dependency information of the translation unit, and record files, that store the symbol and symbol occurrences seen in each source file. These are written out in the LLVM Bitstream format.

For a better (and more detailed) description of these changes, see the design document at: https://docs.google.com/document/d/1cH2sTpgSnJZCkZtJl1aY-rzy4uGPcrI-6RrUpdATO2Q/edit?usp=sharing

and the mailing list discussion 'RFC: Adding index-while-building support to Clang'

Diff Detail

There are a very large number of changes, so older changes are hidden. Show Older Changes
nathawes updated this revision to Diff 126065.Dec 7 2017, 4:02 PM

Worked through the comments from @ioeric and split the code for writing out the collected indexing data into a separate patch.

@malaperle, to clarify we are not suggesting that you write your own parser, the suggestion is to use clang in 'fast-scan' mode to get the structure of the declarations of a single file, see CXTranslationUnit_SingleFileParse (along with enabling skipping of bodies). We have found clang is super fast when you only try to get the structure of a file like this.

Thank you, that sounds very useful. I will try that and get some measurements.

We can make convenient APIs to provide the syntactic structure of declarations based on their location.

Perhaps just for the end-loc since it's pretty much guaranteed to be needed by everyone. But if it's very straightforward, perhaps that's not needed. I'll try and see.

But let's say we added the end-loc, is it enough ? If you want to implement the 'peek the definition' like Eclipse, then it is not enough, you also need to figure out if there are documentation comments associated with the declaration and also show those. Also what if you want to highlight the type signature of a function, then just storing the location of the closing brace of its body is not enough. There can be any arbitrary things you may want to get from the structure of the declaration (e.g. the parameter ranges), but we could provide an API to gather any syntactic structure info you may want.

That's a very good point. I guess in the back of my mind, I have the worry that one cannot extend what is stored, either for a different performance trade-off or for additional things. The fact that both clang and clangd have to agree on the format so that index-while-building can be used seems to make it inherently not possible to extend. But perhaps it's better to not overthink this for now.

Thanks a lot for the changes! Some more comments inlined.

Please mark addressed comments as done so that reviewers could know what to look :) Thanks!

include/clang/Frontend/CompilerInstance.h
187

nit: LLVM variable names start with upper-case letters.

include/clang/Index/IndexingAction.h
31

This should be removed?

Some forward declarations above are not used as well.

lib/Driver/Job.cpp
301

Could you share this code with line 278 above, which already has a nice comment?

lib/Index/FileIndexRecord.cpp
39 ↗(On Diff #121833)

I would leave the sorting to the point where records are hashed to avoid making the record stateful. Consider changing getDeclOccurrences to getOccurrencesSortedByOffset; this should make the behavior more explicit.

lib/Index/FileIndexRecord.h
51 ↗(On Diff #126065)

s/isSystem/IsSystem/

Also, I wonder if we can filter out system decls proactively and avoid creating file index record for them. We could also avoid propogating IsSystem here.

lib/Index/IndexingAction.cpp
397

IsSystemFileCache &SysrootPath? What is this parameter?

486

Please document this class. This can be easily confused with IndexActionBase which has a similar name. Same for IndexAction/IndexRecordAction and WrappingIndexRecordAction/WrappingIndexRecordAction. I think these pairs share (especially the wrapping actions) some common logics and could probably be merged.

512

This does a lot of stuff... please document the behavior!

536

I thought this could be a state in the WrapperFrontendAction since both derived classes maintain this state, but after a closer look, this seems to depend on both base classes. I'm not a big fun of maintaining states in multi-stage classes (e.g. FrontendAction), which could be confusing and hard to follow; I think IndexRecordActionBase::finish(...) should be able to handle the case where no index consumer is created (i.e. no record/dependency/... is collected).

Also, IndexRecordActionBase (and the existing IndexActionBase ) should really be a component instead of a base class since none of its methods is virtual.

604

nit: no need for braces. Same below.

616

In the previous patch, writeUnitData does several things including handling modules, dependencies, includes and index records, as well as writing data. It might make sense to add an abstract class (UnitDataCollector?) that defines interfaces which make these behavior more explicit. We can then have users pass in an implementation via createIndexDataRecordingAction which would also decouple the data collection from data storage in the library.

651

I'm a bit nervous about propagating the entire FrontendOptions into the index library. I would simply expose getIndexOptionsFromFrontendOptions and have callers parse FrontendOptions and pass in only index-related options.

lib/Index/IndexingContext.h
41

This name is really confusing... Is* is usually used for booleans. Simply call this SystemFileCache.

53

How does this affect the existing cached results? Do you need to invalidate them?

64

I think it would be more straightforward to have context own the cache. If setSysrootPath is the problem, it might make sense to propagate it via the context or, if necessary, create a new cache when a new SysrootPath is set.

nathawes planned changes to this revision.Dec 12 2017, 11:51 AM

Thanks for taking another look @ioeric – I'll work through your comments and update.

nathawes marked 45 inline comments as done.Dec 18 2017, 2:05 PM
nathawes added inline comments.
lib/Index/FileIndexRecord.h
51 ↗(On Diff #126065)

If the -index-ignore-system-symbols flag is set system decls are filtered out in IndexingContext::handleDeclOccurrence and aren't reported to the IndexDataConsumer, so FileIndexRecords won't be created. The IsSystem here is for clients that want index data for system files, but want to be able to distinguish them from regular files.

nathawes updated this revision to Diff 127412.Dec 18 2017, 2:58 PM
nathawes marked an inline comment as done.

I've refactored the indexing/dependency data collection out from the writing with the new IndexUnitDataConsumer class, and made other smaller changes to address the feedback from @ioeric.

nathawes updated this revision to Diff 127568.EditedDec 19 2017, 10:50 AM

Fix out of date header comment in FileIndexData.h

Thanks a lot for further cleaning up the patch! It is now much easier to review. I really appreciate it!

Some more comments on the public APIs and the layering of classes. There are a lot of helper classes in the implementation, so I think it's important to get a clear layering so that they could be easily understood by future contributors :)

Also, with the IndexUnitDataConsumer abstraction, it seems to be feasible now to add some unit tests for createUnitIndexingAction. With all the recent major changes, I think it's important that we have some degree of testing to make sure components actually work together.

include/clang/Frontend/CompilerInstance.h
187

opts and action are still lower-case.

include/clang/Index/DeclOccurrence.h
39

Nit: indentation.

Tip: git-clang-format against the diff base can format all changed lines in your patch.

include/clang/Index/IndexUnitDataConsumer.h
1 ↗(On Diff #127568)

IIUC, this is the index data for a translation unit, as opposed to an AST. If so, consider calling this UnitIndexDataConsumer to match (AST)IndexDataConsumer which is parallel to this. We might want to rename them to be either index::UnitDataConsumer vs index::ASTDataConsumer or index::UnitIndexDataConsumer vs index::ASTIndexDataConsumer . I am inclined to the first pair as index is already implied in the namespace.

67 ↗(On Diff #127568)

Comment? Why do we actually need this?

include/clang/Index/IndexingAction.h
44

We are now mixing functionalities for Unit indexing and AST indexing actions in the same file. We might want to separate these into two headers e..g UnitIndexingAction.h and ASTIndexingAction.h. This would make it easier for users to find the right functions :)

61

Please add documentation for each field. It's not trivial what each field is for, especially some fields seem to be optional and some seem to be mutually exclusive.

62

These pointers suggest the life time of this struct is tied to some other struct, which makes the struct look a bit dangerous to use. Should we also carry a reference or a smart pointer to the underlying object that keeps these pointers valid? Would it be a CompilerInstance (guessing from IndexUnitDataConsumerFactory )?

81

What is the intended user of this function? It's unclear how users could obtain a ConsumerFactory (i.e. UnitDetails) without the functionalities in UnitDataConsumerActionImpl .

(Also see comment in the implementation of createIndexDataRecordingAction.)

95

This is likely only useful for compiler invocation. I would put it in the compiler invocation code.

lib/Driver/Job.cpp
211

nit: Comment should start with an overview of what the function does.

Returns a directory path that is ...

Also, consider calling this getDirAdjacentToModCache. buildDir can be ambiguous.

220

Please clang-format the code. Without indentation, this looks like an no-op statement.

lib/Index/IndexingAction.cpp
93

Use class for interfaces.

93

Does CI here have to be the same instance as the one in createIndexASTConsumer ? Might worth documenting.

148

nit: Move this after Impl->createIndexASTConsumer(CI).

Do we need to reset this flag? Calling CreateASTConsumer multiple times on the same instance seems to be allowed?

256

This seems to be related to files. Maybe FileIndexDataCollector?

263

override

267

Simply begin, if the class is called FileIndexDataCollector .

Similar below to match iterator naming convention.

278

I think this should be public as this is still implementing IndexDataConsumer.

336

I'd simply do:

if FileIncludeFilter  == UnitIndexingOptions::FileIncludeFilterKind::UserOnly)
  if (isSystem...)
    return;
350

Same here. This should be public

367

The naming convention for the callback interfaces is forEach* e.g. forEachFileDependency.

s/visitor/Callback/
(same below).

371

forEachInclude

374

forEachModuleImport

382

This is two classes in one, which is difficult to understand. Could you split it into FileIndexDependencyCollector and FileIndexDependencyProvider and have FileIndexDependencyCollector returns a provider on finish (e.g. Provider consume();; you might want to copy/move the collected data into the provider). It would be easier to justify the behavior (e.g. what happens when you access the provider while collector is still working?)

386

What does Entries contain? What files are added?

514

Instead of passing ParentUnitConsumer, consider checking the Mod before calling the function.

517

Non-factory static method is often a code smell. Any reason not to make these static methods private members? With that, you wouldn't need to pass along so many parameters. You could make them const if you don't want members to be modified.

524

Why is this overload public while others are private? Aren't they all used only in this class?

544

Any reason to close the anonymous namespace here? Shouldn't outlined definitions of UnitDataConsumerActionImpl's methods also in the anonymous namespace?

771

I think the inheritance of IndexUnitDataConsumer and the creation of factory should be in user code (e.g. implementation for on-disk persist-index-data should come from the compiler invocation code ExecuteCompilerInvocation.cpp or at least a separate file in the library that compiler invocation can use), and the user should only use createUnitIndexingAction by providing a factory. Currently, createUnitIndexingAction and createIndexDataRecordingAction are mostly identical except for the code that implements IndexUnitDataConsumer and creates the factory.

The current createIndexDataRecordingAction would probably only used by the compiler invocation, and we can keep the generalized createUnitIndexingAction in the public APIs.

778

The UnitInfo is ignored? What do we actually need it for?

782

Base doesn't seem to be a very meaningful name here.

ioeric added inline comments.Dec 19 2017, 3:08 PM
include/clang/Index/IndexUnitDataConsumer.h
1 ↗(On Diff #127568)

Sorry, asking you to also rename IndexDataConsumer is probably too much and out of the scope of this patch. I'm fine with UnitIndexDataConsumer or UnitDataConsumer or something similar for now without touching IndexDataConsumer :)

ioeric requested changes to this revision.Jan 3 2018, 5:54 AM

(I think I forgot to update the patch status :)

This revision now requires changes to proceed.Jan 3 2018, 5:54 AM
nathawes planned changes to this revision.Jan 18 2018, 2:36 PM
nathawes marked 32 inline comments as done.

@ioeric I should have an updated patch up shortly with your inline comments addressed + new tests. Thanks again for reviewing!

include/clang/Index/IndexUnitDataConsumer.h
67 ↗(On Diff #127568)

From here, my understanding is that it's an optimization to avoid the vtable being included in multiple translation units. I'm not sure if that's actually a problem, I was just following IndexDataConsumer's lead. Added a comment.

include/clang/Index/IndexingAction.h
81

Sorry, I'm not sure what you mean here. Users shouldn't need to know anything about UnitDataConsumerActionImpl, they just need to provide a lambda/function reference that takes a CompilerInstance& and a UnitDetails and returns an IndexUnitDataConsumer (or UnitIndexDataConsumer once I rename it). This gets called once per translation unit to get a distinct data consumer for each unit, i.e. for the main translation unit as well as for each of its dependent modules that the main unit's data consumer says should be indexed via shouldIndexModuleDependency(...).

95

There's another public index:: API for writing out index data for individual clang module files in the follow up patch that takes a RecordingOptions and is used externally, from Swift. This function's useful on the Swift side to get the RecordingOptions from FrontendOptions it has already set up.

lib/Index/IndexingAction.cpp
148

Oops. Yes, we do :-)

517

Sorry, there's missing context – they're used from another public API that's in the follow-up patch. I'll bring that over and make these top-level static functions, since they don't belong exclusively to IndexDataConsumerActionImpl.

524

Same as above – this is called from a public index:: API in the follow-up patch.

771

IndexUnitDataRecorder here is just a stub I added when I split the patch up – the follow-up revision has it in a separate file. I'll move the separate files to this patch and stub out the method bodies with TODOs instead.

I've made createIndexDataRecordingAction call createUnitIndexingAction to remove the duplication, and pulled it, RecordingOptions and getRecordingOptionsFromFrontendOptions to a new header (RecordingAction.h) that ExecuteComilerInvocation.cpp uses. Does that sound ok?

778

It should be passed to IndexUnitDataRecorder to write out info about the unit itself. This was just me splitting the patch badly.

nathawes updated this revision to Diff 130496.Jan 18 2018, 3:01 PM
nathawes marked 6 inline comments as done.
  • Applied the various refactorings suggested by @ioeric
  • Extended c-index-test with a new option to print out the collected unit indexing data, and
  • Added tests for the unit indexing functionality using the new option
  • Fixed formatting
ioeric accepted this revision.Jan 19 2018, 4:10 AM

Nice! Thanks for making the refactoring and adding tests! I think this is good to go now.

I'm not very familiar with code outside of the index library (Driver, Basic etc), but the changes seem reasonable to me. Feel free to get another pair of eyes for them ;)

include/clang/Index/RecordingAction.h
42

Add a FIXME that this is not implemented yet.

lib/Index/IndexingAction.cpp
771

Sounds good. Thanks for the explanation!

This revision is now accepted and ready to land.Jan 19 2018, 4:10 AM

I'm wondering if there is any further plan for this? ;)

I'm wondering if there is any further plan for this? ;)

I'd like to comment on the amount of data that will be stored but that can be done outside this review. I still have a few things to figure out before reaching a conclusion.

@ioeric I'm working on a few other priorities over the next few weeks, sorry, but should get back to this relatively soon after that.
I would just land it, but I expect some downstream breakage I want to make sure I have time to fix.

@malaperle Sounds good – I'll keep an eye out for it!

@malaperle, to clarify we are not suggesting that you write your own parser, the suggestion is to use clang in 'fast-scan' mode to get the structure of the declarations of a single file, see CXTranslationUnit_SingleFileParse (along with enabling skipping of bodies). We have found clang is super fast when you only try to get the structure of a file like this.

Thank you, that sounds very useful. I will try that and get some measurements.

We can make convenient APIs to provide the syntactic structure of declarations based on their location.

Perhaps just for the end-loc since it's pretty much guaranteed to be needed by everyone. But if it's very straightforward, perhaps that's not needed. I'll try and see.

But let's say we added the end-loc, is it enough ? If you want to implement the 'peek the definition' like Eclipse, then it is not enough, you also need to figure out if there are documentation comments associated with the declaration and also show those. Also what if you want to highlight the type signature of a function, then just storing the location of the closing brace of its body is not enough. There can be any arbitrary things you may want to get from the structure of the declaration (e.g. the parameter ranges), but we could provide an API to gather any syntactic structure info you may want.

That's a very good point. I guess in the back of my mind, I have the worry that one cannot extend what is stored, either for a different performance trade-off or for additional things. The fact that both clang and clangd have to agree on the format so that index-while-building can be used seems to make it inherently not possible to extend. But perhaps it's better to not overthink this for now.

I did a bit more of experimenting. For the end-loc, I changed my prototype so that the end-loc is not stored in the index but rather computed "on the fly" using SourceManager and Lexer only. For my little benchmark, I used the LLVM/Clang/Clangd code base which I queried for all references of "std" (the namespace) which is around 46K references in the index.

With end-loc in index: 3.45s on average (20 samples)
With end-loc computed on the fly: 11.33s on average (20 samples)
I also tried with Xcode but without too much success: it took about 30 secs to reach 45K results and then carried on for a long time and hung (although I didn't try to leave it for hours to see if it finished).

From my perspective, it seems that the extra time is quite substantial and it doesn't seem worth to save an integer per occurrence in this case.

For computing the start/end-loc of function bodies, I tried the SingleFileParseMode and SkipFunctionBodies separately ( as a start). The source I use this on looks like this:

#include "MyClass.h"

MyClass::MyClass() {
}

void MyClass::doOperation() {
}

With SingleFileParseMode, I get several errors:

MyClass.cpp:5:1: error: use of undeclared identifier 'MyClass'
MyClass.cpp:8:6: error: use of undeclared identifier 'MyClass'

Then I cannot obtain any Decl* at the position of doOperation. With SingleFileParseMode, I'm also a bit weary that not processing headers will result in many inaccuracies. From our perspective, we are more wiling to sacrifice disk space in order to have more accuracy and speed. For comparison, the index I worked with containing all end-loc for occurrences and also function start/end is 201M for LLVM/Clang/Clangd which is small to us.

With SkipFunctionBodies alone, I can get the Decl* but FunctionDecl::getSourceRange() doesn't include the body, rather, it stops after the arguments.
It would be very nice if we could do this cheaply but it doesn't seem possible with those two flags alone. What did you have in mind for implementing an "API to gather any syntactic structure info" ?

For computing the start/end-loc of function bodies, I tried the SingleFileParseMode and SkipFunctionBodies separately ( as a start). The source I use this on looks like this:

Given the discussion in https://reviews.llvm.org/D44247, I think we can do without the start/end-loc of function bodies and try some heuristics client-side. We can always revisit this later if necessary.

However, for the end-loc of occurrences, would you be OK with this being added? I think it would be a good compromise in terms of performance, simplicity and index size.

For computing the start/end-loc of function bodies, I tried the SingleFileParseMode and SkipFunctionBodies separately ( as a start). The source I use this on looks like this:

Given the discussion in https://reviews.llvm.org/D44247, I think we can do without the start/end-loc of function bodies and try some heuristics client-side. We can always revisit this later if necessary.

However, for the end-loc of occurrences, would you be OK with this being added? I think it would be a good compromise in terms of performance, simplicity and index size.

@malaperle Just to clarify, what's the particular end-loc we're talking about here? e.g. for a function call, would this be the end of the function's name, or the closing paren?
For the end of the name, couldn't this be derived from the start loc + symbol name length (barring token pastes and escaped new lines in the middle of identifiers, which hopefully aren't too common)?
I can see the value for the closing paren though.

@akyrtzi Are the numbers from Marc-Andre's experiment what you'd expect to see and is there anything else to try? I'm not familiar with those modes at all to comment, sorry. I assume any API to gather syntactic structure info would be based on those modes, right?

@malaperle Just to clarify, what's the particular end-loc we're talking about here? e.g. for a function call, would this be the end of the function's name, or the closing paren?
For the end of the name, couldn't this be derived from the start loc + symbol name length (barring token pastes and escaped new lines in the middle of identifiers, which hopefully aren't too common)?
I can see the value for the closing paren though.

I mean the end of the name referencing the symbol, so that it can be highlighted properly when using the "find references in workspace" feature. There are cases where the name of the symbol itself is not present, for example "MyClass o1, o2;" (o1 and o2 reference the constructor), references to overloaded operators, etc.

@malaperle Just to clarify, what's the particular end-loc we're talking about here? e.g. for a function call, would this be the end of the function's name, or the closing paren?
For the end of the name, couldn't this be derived from the start loc + symbol name length (barring token pastes and escaped new lines in the middle of identifiers, which hopefully aren't too common)?
I can see the value for the closing paren though.

I mean the end of the name referencing the symbol, so that it can be highlighted properly when using the "find references in workspace" feature. There are cases where the name of the symbol itself is not present, for example "MyClass o1, o2;" (o1 and o2 reference the constructor), references to overloaded operators, etc.

Ah, I see – thanks! I was thinking all occurrences whose symbol name didn't actually appear at their location were marked with SymbolRole::Implicit, but that only seems to be true for the ObjC index data.

Hey Marc,

The fact that both clang and clangd have to agree on the format so that index-while-building can be used seems to make it inherently not possible to extend

I don't think "not possible to extend" is quite correct, we can make it so that the format allows optional data to be recorded.

On the topic of recording the end-loc, I agree it's not much data overall, but it will be useful to examine the uses closely and to figure out whether it's really required and whether it is at the same time inadequate for other uses.

I changed my prototype so that the end-loc is not stored in the index but rather computed "on the fly" using SourceManager and Lexer only.

I assume you used SingleFileParseMode+SkipFunctionBodies for this, right ?

For my little benchmark, I used the LLVM/Clang/Clangd code base which I queried for all references of "std" (the namespace) which is around 46K references in the index.

This is an interesting use case, and I can say we have some experience because Xcode has this functionality without requiring the end-loc for every reference.
So what it does is that it has a 'name' to look for (say 'foo' for the variable foo) and if it finds the name in the location then it highlights, otherwise if it doesn't find it (e.g. because it is an implicit reference) then it points to the location but doesn't highlight something. The same thing happens for operator overloads (the operators get highlighted at the reference location).
For implicit references it's most likely there's nothing to highlight so the end-loc will most likely be empty anyway (or same as start-loc ?) to indicate an empty range.

With SingleFileParseMode, I get several errors:

Good point, the parser definitely needs recovery improvements in C++.

With SkipFunctionBodies alone, I can get the Decl* but FunctionDecl::getSourceRange() doesn't include the body

This seems strange, there's an EndRangeLoc field that should have been filled in, not exactly sure if it is a bug or omission.

Going back to the topic of what use cases end-loc covers, note that it still seems inadequate for peek-definition functionality. You can't set it to body-end loc (otherwise occurrences highlighting would highlight the whole body which I think is undesirable) and you still need to include doc-comments if they exist.

Hey Marc,

The fact that both clang and clangd have to agree on the format so that index-while-building can be used seems to make it inherently not possible to extend

I don't think "not possible to extend" is quite correct, we can make it so that the format allows optional data to be recorded.

That would be good. How would one go about asking Clang to generate this extra information? Would a Clang Plugin be suitable for this? I don't know much about those but perhaps that could be one way to extent the basic behavior of "-index_store_path" in this way?

I changed my prototype so that the end-loc is not stored in the index but rather computed "on the fly" using SourceManager and Lexer only.

I assume you used SingleFileParseMode+SkipFunctionBodies for this, right ?

No, sorry the end-locs I meant there is for occurrences. Only the lexer was needed to get the end of the token. So for "MyClass o1, o2;" o1 and o2 get highlighted as references to the MyClass constructor.

For my little benchmark, I used the LLVM/Clang/Clangd code base which I queried for all references of "std" (the namespace) which is around 46K references in the index.

This is an interesting use case, and I can say we have some experience because Xcode has this functionality without requiring the end-loc for every reference.
So what it does is that it has a 'name' to look for (say 'foo' for the variable foo) and if it finds the name in the location then it highlights, otherwise if it doesn't find it (e.g. because it is an implicit reference) then it points to the location but doesn't highlight something.

I think it's useful to highlight something even when the name is not there. For example in "MyClass o1, o2;" it feels natural that o1 and o2 would get highlighted.

The same thing happens for operator overloads (the operators get highlighted at the reference location).

It does? I can only seem to do a textual search. For example, if I look at "FileId::operator<", if I right-click in the middle of "operator<" and do "Find selected symbol in workspace", it seems to start a text based search because there are many results that are semantically unrelated.

For implicit references it's most likely there's nothing to highlight so the end-loc will most likely be empty anyway (or same as start-loc ?) to indicate an empty range.

I think for those cases the end of the token is probably suitable. Can you give examples which implicit references you have in mind? Maybe another one (other than the constructor mentioned above) could be a function call like "passMeAStdString(MyStringRef)", here the "operator std::string" would be called and MyStringRef could be highlighted, I think it would make sense to the user that is gets called by passing this parameter by seeing the highlight.

Going back to the topic of what use cases end-loc covers, note that it still seems inadequate for peek-definition functionality. You can't set it to body-end loc (otherwise occurrences highlighting would highlight the whole body which I think is undesirable) and you still need to include doc-comments if they exist.

I think maybe I wasn't clear, I was thinking about two end-locs: end-locs of occurrences and end-locs of bodies. The end-loc of occurrences would be used for highlight when searching for all occurrences and the end-loc for bodies would be used for the peek definition. I think we can disregard end-locs of bodies for now.

That would be good. How would one go about asking Clang to generate this extra information? Would a Clang Plugin be suitable for this?

That's an interesting idea that we could explore, but I don't have much experience with that mechanism to comment on.

Only the lexer was needed to get the end of the token

Ok, that's interesting, not sure why Xcode is so fast to highlight, did you reuse same SourceManager/Lexer/buffers for occurrences from same file ? We'd definitely add the end-loc if we cannot come up with a mechanism to highlight fast enough without it.

I think it's useful to highlight something even when the name is not there. For example in "MyClass o1, o2;" it feels natural that o1 and o2 would get highlighted.

To clarify, even with implicit references the start loc points to something. In this case the implicit references can have start locs for the o1 and o2 identifiers and the end result for the UI will be the same (o1 and o2 get highlighted) even without having end-locs for all references.

It does? I can only seem to do a textual search.

The example I tried is the following. If you could file a bug report for the test case that did not work as you expected it would be much appreciated!

class Something1 {
public:
    Something1() {}
    ~Something1() {}
    operator int() {
        return 0;
    }

    friend int operator <<(Something1 &p, Something1 &p2) {
        return 0;
    }
};

void foo1(Something1 p1, Something1 p2) {
    p1 << p2;
    p1 << p2;
}

here the "operator std::string" would be called and MyStringRef could be highlighted

Even without end-loc, the start loc could point to MyStringRef and you could highlight it.

That would be good. How would one go about asking Clang to generate this extra information? Would a Clang Plugin be suitable for this?

That's an interesting idea that we could explore, but I don't have much experience with that mechanism to comment on.

Only the lexer was needed to get the end of the token

Ok, that's interesting, not sure why Xcode is so fast to highlight, did you reuse same SourceManager/Lexer/buffers for occurrences from same file ? We'd definitely add the end-loc if we cannot come up with a mechanism to highlight fast enough without it.

I don't think Xcode is quite fast, it's about 10 times slower (although I'm not sure it really finished) than when I use my branch that has the end-loc. I would try end-locs in Xcode if I could, to compare :) So I don't really know where the bottleneck is in Xcode. Comparing oranges to oranges, it's 4 times slower without end-locs compared to with end-locs on my branch. I does use the same SourceManager for the 46K references and I verified that it uses the same buffers, etc.
I'll put the numbers here again for readability.

For my little benchmark, I used the LLVM/Clang/Clangd code base which I queried for all references of "std" (the namespace) which is around 46K references in the index.

With end-loc in index: 3.45s on average (20 samples)
With end-loc computed on the fly: 11.33s on average (20 samples)
I also tried with Xcode but without too much success: it took about 30 secs to reach 45K results and then carried on for a long time and hung (although I didn't try to leave it for hours to see if it finished).

I think it's useful to highlight something even when the name is not there. For example in "MyClass o1, o2;" it feels natural that o1 and o2 would get highlighted.

To clarify, even with implicit references the start loc points to something. In this case the implicit references can have start locs for the o1 and o2 identifiers and the end result for the UI will be the same (o1 and o2 get highlighted) even without having end-locs for all references.

It's the same but slower. IMO, the trade off is not great. It's entirely subjective but I think 4-10 times slower in order to save an integer per occurrence is not worth it from my point of view.

Even without end-loc, the start loc could point to MyStringRef and you could highlight it.

(Same here, it's doable but faster if already in the index.)

It does? I can only seem to do a textual search.

The example I tried is the following. If you could file a bug report for the test case that did not work as you expected it would be much appreciated!

Sure, I'll give that a try and isolate it as much as I can. BTW, does it work for you on the LLVM code base?