This is an archive of the discontinued LLVM Phabricator instance.

Add index-while-building support to Clang
AbandonedPublic

Authored by jkorous on Oct 18 2017, 6:09 AM.

Details

Summary

Adds a new -index-store-path option that causes Clang to additionally collect and output source code indexing information to the supplied path. This is done by wrapping the FrontendAction otherwise setup by the invocation with a WrappingIndexRecordAction. This action simply delegates to the wrapped action, but additionally multiplexes in its own IndexASTConsumer to collect symbol information from the AST and tracks the source file and module dependencies of the translation unit (via the IndexDependencyProvider class).

When the action completes, it then writes this information out to the supplied index store path in the form of a unit file, which stores the dependency information of the translation unit, and record files, that store the symbol and symbol occurrences seen in each source file. These are written out in the LLVM Bitstream format.

For a better (and more detailed) description of these changes, see the design document at: https://docs.google.com/document/d/1cH2sTpgSnJZCkZtJl1aY-rzy4uGPcrI-6RrUpdATO2Q/edit?usp=sharing

and the mailing list discussion 'RFC: Adding index-while-building support to Clang'

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
arphaman added inline comments.Oct 31 2017, 4:07 PM
lib/Index/IndexRecordHasher.cpp
104

Should you hash the return type as well?

205

You can use Qualifiers::Const here or make your own enum instead of raw constants.

nathawes planned changes to this revision.Oct 31 2017, 4:45 PM

Thanks @arphaman! I'll work through your comments and update.

include/clang/Index/IndexDataStoreSymbolUtils.h
14

They're used by IndexRecordWriter below to convert from libIndex's representation of things to the index store's.

nathawes added inline comments.Nov 6 2017, 6:49 PM
lib/Index/IndexRecordHasher.cpp
104

The return type doesn't affect the function's USR, so there's no need to consider it when hashing the function decl. The hashing here is happening per decl occurrence (source offset + role set + Decl) collected by the index AST walker, so changing the return type will still change the record hash when any decl occurrences it contains are hashed in.

nathawes updated this revision to Diff 121833.Nov 6 2017, 7:24 PM

Based on @arphaman's feedback:

  • Pulled the index store related diagnostics out into their own category/diagnostic group
  • Removed the CLANG_PROJECT_INDEX_PATH env var check.
  • Swapped "/" used in a few places as a separator/root with the equivalent llvm::sys::path call.
  • Fixed the typo/convention/documentation issues and simplifications pointed out so far
malaperle added inline comments.
lib/Index/IndexUnitWriter.cpp
213

extra semi-colon (noticed this warning while compiling)

malaperle added inline comments.Nov 7 2017, 8:58 PM
lib/Index/IndexingAction.cpp
559

As a first attempt, I tried to use index::createIndexDataRecordingAction in combination with ASTUnit::LoadFromCompilerInvocationAction but one problem is that right before it calls EndSourceFileAction in LoadFromCompilerInvocationAction, it calls transferASTDataFromCompilerInstance which means that the SourceManager in CompilerInstance is nulled out as it gets "transfered" to the AST. So this line crashes in this case. To be fair, at this point I don't need the ASTUnit so I can look at executing the action differently, but I thought I'd point it out!

hokein added a subscriber: hokein.Nov 8 2017, 4:41 AM
malaperle added inline comments.Nov 8 2017, 8:19 AM
lib/Index/IndexRecordWriter.cpp
156

I'm getting quite a bit of those while indexing Clangd, it looks like it comes from some LLVM/Support headers:
Index: Duplicate USR! c:@N@std@ST>2#NI#Nb@__try_lock_impl
Index: Duplicate USR! c:@N@llvm@ST>1#T@DenseMapInfo
Index: Duplicate USR! c:@N@llvm@ST>1#T@isPodLike
Index: Duplicate USR! c:@N@llvm@N@detail@ST>1#T@unit
Index: Duplicate USR! c:@N@llvm@ST>2#T#T@format_provider
Index: Duplicate USR! c:@N@llvm@ST>2#T#T@format_provider
Index: Duplicate USR! c:@N@llvm@ST>1#T@PointerLikeTypeTraits
Index: Duplicate USR! c:@N@llvm@ST>1#T@simplify_type
Index: Duplicate USR! c:@N@std@ST>1#T@atomic
Index: Duplicate USR! c:@N@llvm@ST>1#T@isPodLike
Index: Duplicate USR! c:@N@llvm@ST>1#T@DenseMapInfo

I think it would be good to have the file name at least in the log. I also assume those duplication are issues that would have to be fixed in USRGenerator (i.e. in separate patches) ?

malaperle added inline comments.Nov 8 2017, 1:40 PM
include/indexstore/IndexStoreCXX.h
75

I know this is an old revision but I thought I should ask for the future patch... Would it be possible to not use "blocks"? This will affect portability of the code. I'm not familiar with blocks but I would think it would be possible to replace with C++11 lambdas, or something else that's standard. I was just playing around with this code and since I used GCC, it did not work.

ioeric added a subscriber: ioeric.Nov 9 2017, 12:58 AM
ioeric added a comment.Nov 9 2017, 6:54 AM

I think this patch should be split into a number of smaller patches to help the review process.

Things like tools/IndexStore, DirectoryWatcher and other components that are not directly needed right now should definitely be in their own patches.
It would be nice to find some way to split the implementation into multiple patches as well.

+1.

This is a lot of work (but great work!) for one patch. Smaller/incremental patches help reviewers understand and (hopefully) capture potential improvement of the design. I would really appreciate it if you could further split the patch.

Some comments/ideas:

  • The lack of tests is a bit concerning.
  • I think the implementation of the index output logic (e.g. IndexUnitWriter and bit format file) can be abstracted away (and split into separate patches) so that you can unit-test the action with a custom/mock unit writer; this would also make the action reusable with (potentially) other storage formats.
  • I would suggest that you start with a patch that implement the index action and just enough components so that you could test the action.

Thanks!

nathawes planned changes to this revision.Nov 9 2017, 2:46 PM

I think this patch should be split into a number of smaller patches to help the review process.

Things like tools/IndexStore, DirectoryWatcher and other components that are not directly needed right now should definitely be in their own patches.
It would be nice to find some way to split the implementation into multiple patches as well.

+1.

This is a lot of work (but great work!) for one patch. Smaller/incremental patches help reviewers understand and (hopefully) capture potential improvement of the design. I would really appreciate it if you could further split the patch.

Thanks for taking a look @ioeric! I'll have a go at splitting it further.

Some comments/ideas:

  • The lack of tests is a bit concerning.

I moved all the code for reading the index store data into a separate patch (to come after this one) in order to slim this one down for review, and most of the tests went with it because they're based around reading and dumping the stored data for FileCheck. The original version of this patch has them all (https://reviews.llvm.org/D39050?id=118854). The ones that remain here are just those checking that the unit/record files are written out and that the hashing mechanism is producing distinct record files when the symbolic content of the source file changes.

  • I think the implementation of the index output logic (e.g. IndexUnitWriter and bit format file) can be abstracted away (and split into separate patches) so that you can unit-test the action with a custom/mock unit writer; this would also make the action reusable with (potentially) other storage formats.

The added IndexRecordAction and existing IndexAction use the same functionality from libIndex to collect the indexing data, so I'm not sure mocking the unit writer to unit test IndexRecordAction would add very much value – writing the index data out is the new behavior. The existing tests for IndexAction (under test/Index/Core) are already covering the correctness of the majority of the collected indexing info and the tests coming in the follow-up patch (seen in the original version of this patch) test it's still correct after the write/read round trip.

  • I would suggest that you start with a patch that implement the index action and just enough components so that you could test the action.

Thanks!

phosek added a subscriber: phosek.Nov 9 2017, 10:15 PM
  • I think the implementation of the index output logic (e.g. IndexUnitWriter and bit format file) can be abstracted away (and split into separate patches) so that you can unit-test the action with a custom/mock unit writer; this would also make the action reusable with (potentially) other storage formats.

The added IndexRecordAction and existing IndexAction use the same functionality from libIndex to collect the indexing data, so I'm not sure mocking the unit writer to unit test IndexRecordAction would add very much value – writing the index data out is the new behavior. The existing tests for IndexAction (under test/Index/Core) are already covering the correctness of the majority of the collected indexing info and the tests coming in the follow-up patch (seen in the original version of this patch) test it's still correct after the write/read round trip.

Thanks for the clarification! I still think it's useful to decouple the IndexAction from the bit format file so that it could be reusable elsewhere. For example, I can see the index action be useful to clangd for building in-memory index.

I also tried applying your original patch locally but couldn't get it to work mostly due to portability issues (e.g. blocks and if (APPLE) in make files). AFAIK, many folks compile clang with GCC and/or without APPLE, so it's important that you get the portability right from the very beginning. Thanks!

Index-while-build is awesome! I'm looking forward to your patches!

akyrtzi edited edge metadata.Nov 10 2017, 5:45 PM

Hey Eric,

  • I think the implementation of the index output logic (e.g. IndexUnitWriter and bit format file) can be abstracted away (and split into separate patches) so that you can unit-test the action with a custom/mock unit writer; this would also make the action reusable with (potentially) other storage formats.

The added IndexRecordAction and existing IndexAction use the same functionality from libIndex to collect the indexing data, so I'm not sure mocking the unit writer to unit test IndexRecordAction would add very much value – writing the index data out is the new behavior. The existing tests for IndexAction (under test/Index/Core) are already covering the correctness of the majority of the collected indexing info and the tests coming in the follow-up patch (seen in the original version of this patch) test it's still correct after the write/read round trip.

Thanks for the clarification! I still think it's useful to decouple the IndexAction from the bit format file so that it could be reusable elsewhere. For example, I can see the index action be useful to clangd for building in-memory index.

As Nathan mentioned, we believe the indexing action, as it exists in the trunk, is decoupled enough to be useful, for example Marc was already able to use it and write out the indexing data in a completely different format for his fork of clangd. Of course, we are definitely interested in any additional refactorings that would structure things better and we are eager to see and discuss follow-up patches from anyone that is interested in improving the code, but could we treat this as potential follow-up improvements ?

We are eager to provide the functionality so others can start experimenting with it; I'd propose that we discuss ideas for refactoring of the code as follow-up, what do you think ? Getting the initial functionality in and iterating on it, while getting more experience with applying it on various use-cases, is a common operating mindset of the llvm/clang projects.

I also tried applying your original patch locally but couldn't get it to work mostly due to portability issues (e.g. blocks and if (APPLE) in make files). AFAIK, many folks compile clang with GCC and/or without APPLE, so it's important that you get the portability right from the very beginning. Thanks!

Nathan will look into making using blocks optional, providing additional function pointer+context APIs where appropriate and having the common implementation using lambdas.
For the APPLE specific parts it, the only specific darwin-specific part is the part using FSEvents, the other 'if (APPLE)' checks can likely be removed. We would generally need help from people with linux expertise to provide the 'FSEvents' equivalent functionality but this is a small part of the overall feature, it's not important for getting the index-while-building data.

But these things are not part of the current patch, we can discuss again with the follow-up patches that will contain those things.

Index-while-build is awesome! I'm looking forward to your patches!

Hey Eric,

  • I think the implementation of the index output logic (e.g. IndexUnitWriter and bit format file) can be abstracted away (and split into separate patches) so that you can unit-test the action with a custom/mock unit writer; this would also make the action reusable with (potentially) other storage formats.

The added IndexRecordAction and existing IndexAction use the same functionality from libIndex to collect the indexing data, so I'm not sure mocking the unit writer to unit test IndexRecordAction would add very much value – writing the index data out is the new behavior. The existing tests for IndexAction (under test/Index/Core) are already covering the correctness of the majority of the collected indexing info and the tests coming in the follow-up patch (seen in the original version of this patch) test it's still correct after the write/read round trip.

Thanks for the clarification! I still think it's useful to decouple the IndexAction from the bit format file so that it could be reusable elsewhere. For example, I can see the index action be useful to clangd for building in-memory index.

As Nathan mentioned, we believe the indexing action, as it exists in the trunk, is decoupled enough to be useful, for example Marc was already able to use it and write out the indexing data in a completely different format for his fork of clangd. Of course, we are definitely interested in any additional refactorings that would structure things better and we are eager to see and discuss follow-up patches from anyone that is interested in improving the code, but could we treat this as potential follow-up improvements ?

We are eager to provide the functionality so others can start experimenting with it; I'd propose that we discuss ideas for refactoring of the code as follow-up, what do you think ? Getting the initial functionality in and iterating on it, while getting more experience with applying it on various use-cases, is a common operating mindset of the llvm/clang projects.

To be honest, I want this functionality to get in as much as you do, and I'm more than happy to prioritize the code review for it :) But the current patch size makes the reviewing really hard (e.g. I would never have caught the BLOCK issues hadn't I tried running the original patch myself). I'm not sure if it's really a common practice to check in a big chunk of code without careful code review and leave potential improvements as followups. I'm sure @klimek would have thoughts about this.

If the index action is already flexible enough, would you mind splitting the code for the index action out so that we can start reviewing it? Given that the current patch has very few tests, I guess it wouldn't be too much worse to split out the action without proper test.

To be honest, I want this functionality to get in as much as you do, and I'm more than happy to prioritize the code review for it :) But the current patch size makes the reviewing really hard (e.g. I would never have caught the BLOCK issues hadn't I tried running the original patch myself). I'm not sure if it's really a common practice to check in a big chunk of code without careful code review and leave potential improvements as followups. I'm sure @klimek would have thoughts about this.

To be clear, I didn't mean to imply we don't want careful code review, we are really happy for people to point out issues. For example the building problems on linux are serious issues that we will fix and we are grateful for your feedback!

If the index action is already flexible enough, would you mind splitting the code for the index action out so that we can start reviewing it? Given that the current patch has very few tests, I guess it wouldn't be too much worse to split out the action without proper test.

To clarify, the index action Nathan and I are referring to, is the indexing action that exists currently in trunk and is the source of the index symbols, feeding index symbols to an abstract IndexDataConsumer. See here: https://llvm.org/svn/llvm-project/cfe/trunk/include/clang/Index/IndexingAction.h
This is what Marc used to get the index symbols and store them in his own format. Tests for this functionality are in: https://llvm.org/svn/llvm-project/cfe/trunk/test/Index/Core/

If the index action is already flexible enough, would you mind splitting the code for the index action out so that we can start reviewing it? Given that the current patch has very few tests, I guess it wouldn't be too much worse to split out the action without proper test.

To clarify, the index action Nathan and I are referring to, is the indexing action that exists currently in trunk and is the source of the index symbols, feeding index symbols to an abstract IndexDataConsumer. See here: https://llvm.org/svn/llvm-project/cfe/trunk/include/clang/Index/IndexingAction.h
This is what Marc used to get the index symbols and store them in his own format. Tests for this functionality are in: https://llvm.org/svn/llvm-project/cfe/trunk/test/Index/Core/

Ah, sorry, I was referring to IndexRecordAction and its friends (record readers/writers). I didn't notice the newly added index action added and really didn't mean to ask you to refactor the existing code. Apologies for the miscommunication!

What I wanted to proposed is that we could decouple reading/writing of record/unit from the bit file format, so that the record output is not tied to a single output format (e.g. bit format, directory-based) and thus make the compiler more flexible. This might already be the case, but it's not really easy to tell from the current patch...

Hi! I got a bit further in my experiment in integrating this in Clangd. I put some comments (in the first more complete revision). But since the scope of this patch changed, if you feel like we should take the discussions elsewhere, please let me know! Thanks!

include/indexstore/IndexStoreCXX.h
84

From what I understand, this returns the beginning of the occurrence. It would be useful to also have the end of the occurrence. From what I tested in Xcode, when you do "Find Selected Symbol in Workspace", it highlights the symbol name in yellow in the list of matches, so it mush use that LineCol then highlight the matching name. This is works in many situations but others occurrences won't have the name of the symbol. For example:
"MyClass o1, o2;"
If I use "Find Selected Symbol in Workspace" on MyClass constructor, if won't be able to highlight o1 and o2
Do you think it would be possible to add that (EndLineCol)? If not, how would one go about extending libindexstore in order to add additional information per occurrence? It is not obvious to me so far.
We also need other things, for example in the case of definitions, we need the full range of the definition so that users can "peek" at definitions in a pop-up. Without storing the end of the definition in the index, we would need to reparse the file.

374

As part of this dependency tracking mechanism, I haven't found that it could provide information about about the files including a specific header. So given a header (or source file in odd cases), I do not see a way to get all the files that would need to be reindexed if that header changed. Is that something you achieve outside the index? Or perhaps this is something I missed in the code.

377

Could there be a bit of explanation about what's a File dependency versus record and unit? All units and records are file dependencies, right? Are there any files that are neither records or units?

Thanks for the feedback @malaperle!

include/indexstore/IndexStoreCXX.h
84

Our approach to related locations (e.g. name, signature, body, and doc comment start/end locs) has been to not include them in the index and derive them from the start location later. There's less data to collect and write out during the build that way, and deriving the other locations isn't that costly usually, as in most cases you 1) don't need to type check or even preprocess to get the related locations, as with finding the end of o1 and o2 in your example, and 2) usually only need to derive locations for a single or small number of occurrences, like when 'peeking' at a definition.

Are there cases where you think this approach won't work/perform well enough for the indexer-backed queries clangd needs to support?

374

The unit files store the path of the header/source files they depend on as 'File' dependencies. So any unit file with 'File' dependency on header/source file that was modified may need to be re-indexed.

To support finding which specific files include or are included by a given header (rather than which units somehow transitively include it) we also store the file-to-file inclusions in the unit file (retrieved via IndexUnitReader's foreachInclude method below).

377

I'll rename this to SourceFile and add some comments to explain. Unit file dependencies separate the source dependencies into 'File' dependencies and 'Record' dependencies. The 'File' dependencies track the paths of the header/source files seen in the translation unit, while the 'Record' dependencies track which record files have the symbolic content seen in those source files – the header/source file path doesn't appear anywhere in the record file. This separation lets us depend on a single record file corresponding to multiple source files (e.g. when two source files have the same symbolic content), and on a single source file corresponding multiple record files (e.g. when a single header is included multiple times with different preprocessor contexts changing its symbolic content).

ioeric edited edge metadata.Nov 28 2017, 3:16 PM

First round of comments. Mostly around indexing actions and file records; I haven't started reviewing the data writing and storing code. I think it might make sense to split the index writing and storing logics into a separate patch, which should be possible if writeUnitData is abstracted into an interface.

include/clang/Frontend/CompilerInstance.h
188

It might make sense to define an alias for std::function<std::unique_ptr<FrontendAction>(const FrontendOptions &opts, std::unique_ptr<FrontendAction> action)>, which is used multiple times.

include/clang/Frontend/FrontendOptions.h
262

It might make sense to also have documentations for these options here.

include/indexstore/IndexStoreCXX.h
84

I agree that we should try to keep the serialized symbol size minimal, but I think it's worthwhile to also store end locations for occurrences because 1) it's cheap, 2) it's not necessary easy to compute without parsing for occurrences like a::b::c or a::b<X>, 3) it would be useful for many LSP use cases.

lib/Frontend/CompilerInstance.cpp
1175

nit: no braces around one liners.

lib/FrontendTool/ExecuteCompilerInvocation.cpp
170

Could you comment on what this does? The Act above is already wrapped. Why do we need setGenModuleActionWrapper to createIndexDataRecordingAction again? Also, createIndexDataRecordingAction doesn't seem related to GenModule.

lib/Index/FileIndexRecord.cpp
24

Why?

38

Please comment when this would happen.

40

Why do we need Decls to be sorted by offset? If we want this for printing, it might make sense to just do a sort there.

lib/Index/FileIndexRecord.h
25

Please add documentation.

42

Is this clang-formatted? You might want to run git-clang-format on the whole patch.

lib/Index/IndexingAction.cpp
281

Again, you don't need the full IndexingContext and RecordOptions here.

291

Note that getDecomposedExpansionLoc can also return invalid decomposed loc.

305

Do we want better error handling here?

324

Please provide documentation.

501

Can we get this state from the base class instead of maintaining a another state, which seems to be identical?

521

Just StringRef BuildNumber = RepositoryPath;

698

Please provide a brief documentation for this class.

700

Again, it doesn't seem necessary for this class to have information about all record options. It seems that you only need RecordSystemDependencies here.

710

readability nit: avoid using auto if the return type is short to spell but hard to infer from the value expression. Same else where.

759

Could you add a comment explaining why we are not allowing searching.

766

It's a bit worrying that IndexDataRecorder and IndexContext reference each other. If you only need some information from the IndexingContext, simply pass it into Recorder. In this case, I think you only need the SourceManager from the ASTContext in the recorder to calculate whether a file is a system header. I see you also cache result of IndexingContext::isSystemFile in the indexing context, but I think it would be more sensible for the callers to handle caching for this call.

768

nit: no braces around one liners.

776

nit: redundant empty line

834

Just auto pair = getIndexOptionsFromFrontendOptions(FEOpts); and then use pair.first and pair.second?

Same below.

lib/Index/IndexingContext.h
39

Please define the scope of this class to avoid throwing random states into it, which usually happens to a "context" class.

malaperle added inline comments.Dec 7 2017, 9:53 AM
include/indexstore/IndexStoreCXX.h
84

There's a few reason I think it's better to store the end loc.

  • When doing "find references", computing the end loc of each occurrence will be costly. Imagine having thousands of occurrences and for each of them having to run logic to find the end of the occurrence.
  • The AST and preprocessor are the best tools I know to figure out the proper end loc. Not using them means having to write a mini-preprocessor with some knowledge about the language semantics to cover some cases.
MyClass |o1, o2;

Here, I have to stop at the comma. So it's basically take any alpha-numerical character, right?

bool operator<(const Foo&, const Foo&)
Ret operator()(Params ...params)

No, in those cases, we have to take < and the first ().

  • In the case of body start/end locations, similarly, it can be non-trivial.
void foo() {
  if (0) {
  }
}

We have to count the balanced { } until we finish the body.

#define FUNC_BODY {\
\
}
void foo() FUNC_BODY

Oops, where's the body? We need another special logic for this, etc.

I think overall, it puts a lot of burden on the client of libIndexStore, burden that would be much more work and more inaccurate than using the AST/Preprocessor while indexing.

374

Thanks! I'll play around with this a bit more with this new information.

377

It's more clear now, thanks!

@malaperle, to clarify we are not suggesting that you write your own parser, the suggestion is to use clang in 'fast-scan' mode to get the structure of the declarations of a single file, see CXTranslationUnit_SingleFileParse (along with enabling skipping of bodies). We have found clang is super fast when you only try to get the structure of a file like this. We can make convenient APIs to provide the syntactic structure of declarations based on their location.

But let's say we added the end-loc, is it enough ? If you want to implement the 'peek the definition' like Eclipse, then it is not enough, you also need to figure out if there are documentation comments associated with the declaration and also show those. Also what if you want to highlight the type signature of a function, then just storing the location of the closing brace of its body is not enough. There can be any arbitrary things you may want to get from the structure of the declaration (e.g. the parameter ranges), but we could provide an API to gather any syntactic structure info you may want.

I would encourage you to try CXTranslationUnit_SingleFileParse|CXTranslationUnit_SkipFunctionBodies, you will be pleasantly surprised with how fast this mode is. The c-index-test option is -single-file-parse.

nathawes added inline comments.Dec 7 2017, 3:42 PM
lib/FrontendTool/ExecuteCompilerInvocation.cpp
170

It's to wrap any GenerateModuleActions that get created as needed when/if Act ends up loading any modules, so that we output index data for them too. I'll add a comment.

lib/Index/FileIndexRecord.cpp
40

It's mostly for when we hash them, so that ordering doesn't change the hash, but it's also for printing. The IndexASTConsumer doesn't always report symbol occurrences in source order, due to the preprocessor and a few other cases.
We can sort them when the IndexRecordDataConsumer's finish() is called rather than as they're added to avoid the copying from repeated insert calls if that's the concern.

lib/Index/IndexingAction.cpp
501

I don't see this state in either base class (WrapperFrontendAction and IndexRecordActionBase). WrappingIndexAction and WrappingIndexRecordAction both have this, though. Were you thinking a new intermediate common base class between them and WrapperFrontendAction?

766

Good point. The IndexingContext was actually already calling IsSystemFile before it calls IndexDataRecorder's handleDeclOccurrence and handleModuleOccurrence anyway, so I'll change it to pass that through as an extra param and remove IndexDataRecorder's dependency on the IndexingContext.

nathawes updated this revision to Diff 126065.Dec 7 2017, 4:02 PM

Worked through the comments from @ioeric and split the code for writing out the collected indexing data into a separate patch.

@malaperle, to clarify we are not suggesting that you write your own parser, the suggestion is to use clang in 'fast-scan' mode to get the structure of the declarations of a single file, see CXTranslationUnit_SingleFileParse (along with enabling skipping of bodies). We have found clang is super fast when you only try to get the structure of a file like this.

Thank you, that sounds very useful. I will try that and get some measurements.

We can make convenient APIs to provide the syntactic structure of declarations based on their location.

Perhaps just for the end-loc since it's pretty much guaranteed to be needed by everyone. But if it's very straightforward, perhaps that's not needed. I'll try and see.

But let's say we added the end-loc, is it enough ? If you want to implement the 'peek the definition' like Eclipse, then it is not enough, you also need to figure out if there are documentation comments associated with the declaration and also show those. Also what if you want to highlight the type signature of a function, then just storing the location of the closing brace of its body is not enough. There can be any arbitrary things you may want to get from the structure of the declaration (e.g. the parameter ranges), but we could provide an API to gather any syntactic structure info you may want.

That's a very good point. I guess in the back of my mind, I have the worry that one cannot extend what is stored, either for a different performance trade-off or for additional things. The fact that both clang and clangd have to agree on the format so that index-while-building can be used seems to make it inherently not possible to extend. But perhaps it's better to not overthink this for now.

Thanks a lot for the changes! Some more comments inlined.

Please mark addressed comments as done so that reviewers could know what to look :) Thanks!

include/clang/Frontend/CompilerInstance.h
187

nit: LLVM variable names start with upper-case letters.

include/clang/Index/IndexingAction.h
34

This should be removed?

Some forward declarations above are not used as well.

lib/Driver/Job.cpp
293

Could you share this code with line 278 above, which already has a nice comment?

lib/Index/FileIndexRecord.cpp
40

I would leave the sorting to the point where records are hashed to avoid making the record stateful. Consider changing getDeclOccurrences to getOccurrencesSortedByOffset; this should make the behavior more explicit.

lib/Index/FileIndexRecord.h
52

s/isSystem/IsSystem/

Also, I wonder if we can filter out system decls proactively and avoid creating file index record for them. We could also avoid propogating IsSystem here.

lib/Index/IndexingAction.cpp
362

IsSystemFileCache &SysrootPath? What is this parameter?

451

Please document this class. This can be easily confused with IndexActionBase which has a similar name. Same for IndexAction/IndexRecordAction and WrappingIndexRecordAction/WrappingIndexRecordAction. I think these pairs share (especially the wrapping actions) some common logics and could probably be merged.

477

This does a lot of stuff... please document the behavior!

501

I thought this could be a state in the WrapperFrontendAction since both derived classes maintain this state, but after a closer look, this seems to depend on both base classes. I'm not a big fun of maintaining states in multi-stage classes (e.g. FrontendAction), which could be confusing and hard to follow; I think IndexRecordActionBase::finish(...) should be able to handle the case where no index consumer is created (i.e. no record/dependency/... is collected).

Also, IndexRecordActionBase (and the existing IndexActionBase ) should really be a component instead of a base class since none of its methods is virtual.

569

nit: no need for braces. Same below.

581

In the previous patch, writeUnitData does several things including handling modules, dependencies, includes and index records, as well as writing data. It might make sense to add an abstract class (UnitDataCollector?) that defines interfaces which make these behavior more explicit. We can then have users pass in an implementation via createIndexDataRecordingAction which would also decouple the data collection from data storage in the library.

616

I'm a bit nervous about propagating the entire FrontendOptions into the index library. I would simply expose getIndexOptionsFromFrontendOptions and have callers parse FrontendOptions and pass in only index-related options.

lib/Index/IndexingContext.h
41

This name is really confusing... Is* is usually used for booleans. Simply call this SystemFileCache.

42

I think it would be more straightforward to have context own the cache. If setSysrootPath is the problem, it might make sense to propagate it via the context or, if necessary, create a new cache when a new SysrootPath is set.

53

How does this affect the existing cached results? Do you need to invalidate them?

nathawes planned changes to this revision.Dec 12 2017, 11:51 AM

Thanks for taking another look @ioeric – I'll work through your comments and update.

nathawes marked 45 inline comments as done.Dec 18 2017, 2:05 PM
nathawes added inline comments.
lib/Index/FileIndexRecord.h
52

If the -index-ignore-system-symbols flag is set system decls are filtered out in IndexingContext::handleDeclOccurrence and aren't reported to the IndexDataConsumer, so FileIndexRecords won't be created. The IsSystem here is for clients that want index data for system files, but want to be able to distinguish them from regular files.

nathawes updated this revision to Diff 127412.Dec 18 2017, 2:58 PM
nathawes marked an inline comment as done.

I've refactored the indexing/dependency data collection out from the writing with the new IndexUnitDataConsumer class, and made other smaller changes to address the feedback from @ioeric.

nathawes updated this revision to Diff 127568.EditedDec 19 2017, 10:50 AM

Fix out of date header comment in FileIndexData.h

Thanks a lot for further cleaning up the patch! It is now much easier to review. I really appreciate it!

Some more comments on the public APIs and the layering of classes. There are a lot of helper classes in the implementation, so I think it's important to get a clear layering so that they could be easily understood by future contributors :)

Also, with the IndexUnitDataConsumer abstraction, it seems to be feasible now to add some unit tests for createUnitIndexingAction. With all the recent major changes, I think it's important that we have some degree of testing to make sure components actually work together.

include/clang/Frontend/CompilerInstance.h
187

opts and action are still lower-case.

include/clang/Index/DeclOccurrence.h
38 ↗(On Diff #127412)

Nit: indentation.

Tip: git-clang-format against the diff base can format all changed lines in your patch.

include/clang/Index/IndexUnitDataConsumer.h
1 ↗(On Diff #127568)

IIUC, this is the index data for a translation unit, as opposed to an AST. If so, consider calling this UnitIndexDataConsumer to match (AST)IndexDataConsumer which is parallel to this. We might want to rename them to be either index::UnitDataConsumer vs index::ASTDataConsumer or index::UnitIndexDataConsumer vs index::ASTIndexDataConsumer . I am inclined to the first pair as index is already implied in the namespace.

67 ↗(On Diff #127568)

Comment? Why do we actually need this?

include/clang/Index/IndexingAction.h
48

We are now mixing functionalities for Unit indexing and AST indexing actions in the same file. We might want to separate these into two headers e..g UnitIndexingAction.h and ASTIndexingAction.h. This would make it easier for users to find the right functions :)

65

Please add documentation for each field. It's not trivial what each field is for, especially some fields seem to be optional and some seem to be mutually exclusive.

66

These pointers suggest the life time of this struct is tied to some other struct, which makes the struct look a bit dangerous to use. Should we also carry a reference or a smart pointer to the underlying object that keeps these pointers valid? Would it be a CompilerInstance (guessing from IndexUnitDataConsumerFactory )?

95

What is the intended user of this function? It's unclear how users could obtain a ConsumerFactory (i.e. UnitDetails) without the functionalities in UnitDataConsumerActionImpl .

(Also see comment in the implementation of createIndexDataRecordingAction.)

109

This is likely only useful for compiler invocation. I would put it in the compiler invocation code.

lib/Driver/Job.cpp
211

nit: Comment should start with an overview of what the function does.

Returns a directory path that is ...

Also, consider calling this getDirAdjacentToModCache. buildDir can be ambiguous.

220

Please clang-format the code. Without indentation, this looks like an no-op statement.

lib/Index/IndexingAction.cpp
90

Use class for interfaces.

98

Does CI here have to be the same instance as the one in createIndexASTConsumer ? Might worth documenting.

136

nit: Move this after Impl->createIndexASTConsumer(CI).

Do we need to reset this flag? Calling CreateASTConsumer multiple times on the same instance seems to be allowed?

221

This seems to be related to files. Maybe FileIndexDataCollector?

228

override

232

Simply begin, if the class is called FileIndexDataCollector .

Similar below to match iterator naming convention.

243

I think this should be public as this is still implementing IndexDataConsumer.

301

I'd simply do:

if FileIncludeFilter  == UnitIndexingOptions::FileIncludeFilterKind::UserOnly)
  if (isSystem...)
    return;
315

Same here. This should be public

332

The naming convention for the callback interfaces is forEach* e.g. forEachFileDependency.

s/visitor/Callback/
(same below).

336

forEachInclude

339

forEachModuleImport

347

This is two classes in one, which is difficult to understand. Could you split it into FileIndexDependencyCollector and FileIndexDependencyProvider and have FileIndexDependencyCollector returns a provider on finish (e.g. Provider consume();; you might want to copy/move the collected data into the provider). It would be easier to justify the behavior (e.g. what happens when you access the provider while collector is still working?)

351

What does Entries contain? What files are added?

479

Instead of passing ParentUnitConsumer, consider checking the Mod before calling the function.

482

Non-factory static method is often a code smell. Any reason not to make these static methods private members? With that, you wouldn't need to pass along so many parameters. You could make them const if you don't want members to be modified.

489

Why is this overload public while others are private? Aren't they all used only in this class?

509

Any reason to close the anonymous namespace here? Shouldn't outlined definitions of UnitDataConsumerActionImpl's methods also in the anonymous namespace?

736

I think the inheritance of IndexUnitDataConsumer and the creation of factory should be in user code (e.g. implementation for on-disk persist-index-data should come from the compiler invocation code ExecuteCompilerInvocation.cpp or at least a separate file in the library that compiler invocation can use), and the user should only use createUnitIndexingAction by providing a factory. Currently, createUnitIndexingAction and createIndexDataRecordingAction are mostly identical except for the code that implements IndexUnitDataConsumer and creates the factory.

The current createIndexDataRecordingAction would probably only used by the compiler invocation, and we can keep the generalized createUnitIndexingAction in the public APIs.

743

The UnitInfo is ignored? What do we actually need it for?

747

Base doesn't seem to be a very meaningful name here.

ioeric added inline comments.Dec 19 2017, 3:08 PM
include/clang/Index/IndexUnitDataConsumer.h
1 ↗(On Diff #127568)

Sorry, asking you to also rename IndexDataConsumer is probably too much and out of the scope of this patch. I'm fine with UnitIndexDataConsumer or UnitDataConsumer or something similar for now without touching IndexDataConsumer :)

ioeric requested changes to this revision.Jan 3 2018, 5:54 AM

(I think I forgot to update the patch status :)

This revision now requires changes to proceed.Jan 3 2018, 5:54 AM
nathawes planned changes to this revision.Jan 18 2018, 2:36 PM
nathawes marked 32 inline comments as done.

@ioeric I should have an updated patch up shortly with your inline comments addressed + new tests. Thanks again for reviewing!

include/clang/Index/IndexUnitDataConsumer.h
67 ↗(On Diff #127568)

From here, my understanding is that it's an optimization to avoid the vtable being included in multiple translation units. I'm not sure if that's actually a problem, I was just following IndexDataConsumer's lead. Added a comment.

include/clang/Index/IndexingAction.h
95

Sorry, I'm not sure what you mean here. Users shouldn't need to know anything about UnitDataConsumerActionImpl, they just need to provide a lambda/function reference that takes a CompilerInstance& and a UnitDetails and returns an IndexUnitDataConsumer (or UnitIndexDataConsumer once I rename it). This gets called once per translation unit to get a distinct data consumer for each unit, i.e. for the main translation unit as well as for each of its dependent modules that the main unit's data consumer says should be indexed via shouldIndexModuleDependency(...).

109

There's another public index:: API for writing out index data for individual clang module files in the follow up patch that takes a RecordingOptions and is used externally, from Swift. This function's useful on the Swift side to get the RecordingOptions from FrontendOptions it has already set up.

lib/Index/IndexingAction.cpp
136

Oops. Yes, we do :-)

482

Sorry, there's missing context – they're used from another public API that's in the follow-up patch. I'll bring that over and make these top-level static functions, since they don't belong exclusively to IndexDataConsumerActionImpl.

489

Same as above – this is called from a public index:: API in the follow-up patch.

736

IndexUnitDataRecorder here is just a stub I added when I split the patch up – the follow-up revision has it in a separate file. I'll move the separate files to this patch and stub out the method bodies with TODOs instead.

I've made createIndexDataRecordingAction call createUnitIndexingAction to remove the duplication, and pulled it, RecordingOptions and getRecordingOptionsFromFrontendOptions to a new header (RecordingAction.h) that ExecuteComilerInvocation.cpp uses. Does that sound ok?

743

It should be passed to IndexUnitDataRecorder to write out info about the unit itself. This was just me splitting the patch badly.

nathawes updated this revision to Diff 130496.Jan 18 2018, 3:01 PM
nathawes marked 6 inline comments as done.
  • Applied the various refactorings suggested by @ioeric
  • Extended c-index-test with a new option to print out the collected unit indexing data, and
  • Added tests for the unit indexing functionality using the new option
  • Fixed formatting
ioeric accepted this revision.Jan 19 2018, 4:10 AM

Nice! Thanks for making the refactoring and adding tests! I think this is good to go now.

I'm not very familiar with code outside of the index library (Driver, Basic etc), but the changes seem reasonable to me. Feel free to get another pair of eyes for them ;)

include/clang/Index/RecordingAction.h
42 ↗(On Diff #130496)

Add a FIXME that this is not implemented yet.

lib/Index/IndexingAction.cpp
736

Sounds good. Thanks for the explanation!

This revision is now accepted and ready to land.Jan 19 2018, 4:10 AM

I'm wondering if there is any further plan for this? ;)

I'm wondering if there is any further plan for this? ;)

I'd like to comment on the amount of data that will be stored but that can be done outside this review. I still have a few things to figure out before reaching a conclusion.

@ioeric I'm working on a few other priorities over the next few weeks, sorry, but should get back to this relatively soon after that.
I would just land it, but I expect some downstream breakage I want to make sure I have time to fix.

@malaperle Sounds good – I'll keep an eye out for it!

@malaperle, to clarify we are not suggesting that you write your own parser, the suggestion is to use clang in 'fast-scan' mode to get the structure of the declarations of a single file, see CXTranslationUnit_SingleFileParse (along with enabling skipping of bodies). We have found clang is super fast when you only try to get the structure of a file like this.

Thank you, that sounds very useful. I will try that and get some measurements.

We can make convenient APIs to provide the syntactic structure of declarations based on their location.

Perhaps just for the end-loc since it's pretty much guaranteed to be needed by everyone. But if it's very straightforward, perhaps that's not needed. I'll try and see.

But let's say we added the end-loc, is it enough ? If you want to implement the 'peek the definition' like Eclipse, then it is not enough, you also need to figure out if there are documentation comments associated with the declaration and also show those. Also what if you want to highlight the type signature of a function, then just storing the location of the closing brace of its body is not enough. There can be any arbitrary things you may want to get from the structure of the declaration (e.g. the parameter ranges), but we could provide an API to gather any syntactic structure info you may want.

That's a very good point. I guess in the back of my mind, I have the worry that one cannot extend what is stored, either for a different performance trade-off or for additional things. The fact that both clang and clangd have to agree on the format so that index-while-building can be used seems to make it inherently not possible to extend. But perhaps it's better to not overthink this for now.

I did a bit more of experimenting. For the end-loc, I changed my prototype so that the end-loc is not stored in the index but rather computed "on the fly" using SourceManager and Lexer only. For my little benchmark, I used the LLVM/Clang/Clangd code base which I queried for all references of "std" (the namespace) which is around 46K references in the index.

With end-loc in index: 3.45s on average (20 samples)
With end-loc computed on the fly: 11.33s on average (20 samples)
I also tried with Xcode but without too much success: it took about 30 secs to reach 45K results and then carried on for a long time and hung (although I didn't try to leave it for hours to see if it finished).

From my perspective, it seems that the extra time is quite substantial and it doesn't seem worth to save an integer per occurrence in this case.

For computing the start/end-loc of function bodies, I tried the SingleFileParseMode and SkipFunctionBodies separately ( as a start). The source I use this on looks like this:

#include "MyClass.h"

MyClass::MyClass() {
}

void MyClass::doOperation() {
}

With SingleFileParseMode, I get several errors:

MyClass.cpp:5:1: error: use of undeclared identifier 'MyClass'
MyClass.cpp:8:6: error: use of undeclared identifier 'MyClass'

Then I cannot obtain any Decl* at the position of doOperation. With SingleFileParseMode, I'm also a bit weary that not processing headers will result in many inaccuracies. From our perspective, we are more wiling to sacrifice disk space in order to have more accuracy and speed. For comparison, the index I worked with containing all end-loc for occurrences and also function start/end is 201M for LLVM/Clang/Clangd which is small to us.

With SkipFunctionBodies alone, I can get the Decl* but FunctionDecl::getSourceRange() doesn't include the body, rather, it stops after the arguments.
It would be very nice if we could do this cheaply but it doesn't seem possible with those two flags alone. What did you have in mind for implementing an "API to gather any syntactic structure info" ?

For computing the start/end-loc of function bodies, I tried the SingleFileParseMode and SkipFunctionBodies separately ( as a start). The source I use this on looks like this:

Given the discussion in https://reviews.llvm.org/D44247, I think we can do without the start/end-loc of function bodies and try some heuristics client-side. We can always revisit this later if necessary.

However, for the end-loc of occurrences, would you be OK with this being added? I think it would be a good compromise in terms of performance, simplicity and index size.

For computing the start/end-loc of function bodies, I tried the SingleFileParseMode and SkipFunctionBodies separately ( as a start). The source I use this on looks like this:

Given the discussion in https://reviews.llvm.org/D44247, I think we can do without the start/end-loc of function bodies and try some heuristics client-side. We can always revisit this later if necessary.

However, for the end-loc of occurrences, would you be OK with this being added? I think it would be a good compromise in terms of performance, simplicity and index size.

@malaperle Just to clarify, what's the particular end-loc we're talking about here? e.g. for a function call, would this be the end of the function's name, or the closing paren?
For the end of the name, couldn't this be derived from the start loc + symbol name length (barring token pastes and escaped new lines in the middle of identifiers, which hopefully aren't too common)?
I can see the value for the closing paren though.

@akyrtzi Are the numbers from Marc-Andre's experiment what you'd expect to see and is there anything else to try? I'm not familiar with those modes at all to comment, sorry. I assume any API to gather syntactic structure info would be based on those modes, right?

@malaperle Just to clarify, what's the particular end-loc we're talking about here? e.g. for a function call, would this be the end of the function's name, or the closing paren?
For the end of the name, couldn't this be derived from the start loc + symbol name length (barring token pastes and escaped new lines in the middle of identifiers, which hopefully aren't too common)?
I can see the value for the closing paren though.

I mean the end of the name referencing the symbol, so that it can be highlighted properly when using the "find references in workspace" feature. There are cases where the name of the symbol itself is not present, for example "MyClass o1, o2;" (o1 and o2 reference the constructor), references to overloaded operators, etc.

@malaperle Just to clarify, what's the particular end-loc we're talking about here? e.g. for a function call, would this be the end of the function's name, or the closing paren?
For the end of the name, couldn't this be derived from the start loc + symbol name length (barring token pastes and escaped new lines in the middle of identifiers, which hopefully aren't too common)?
I can see the value for the closing paren though.

I mean the end of the name referencing the symbol, so that it can be highlighted properly when using the "find references in workspace" feature. There are cases where the name of the symbol itself is not present, for example "MyClass o1, o2;" (o1 and o2 reference the constructor), references to overloaded operators, etc.

Ah, I see – thanks! I was thinking all occurrences whose symbol name didn't actually appear at their location were marked with SymbolRole::Implicit, but that only seems to be true for the ObjC index data.

Hey Marc,

The fact that both clang and clangd have to agree on the format so that index-while-building can be used seems to make it inherently not possible to extend

I don't think "not possible to extend" is quite correct, we can make it so that the format allows optional data to be recorded.

On the topic of recording the end-loc, I agree it's not much data overall, but it will be useful to examine the uses closely and to figure out whether it's really required and whether it is at the same time inadequate for other uses.

I changed my prototype so that the end-loc is not stored in the index but rather computed "on the fly" using SourceManager and Lexer only.

I assume you used SingleFileParseMode+SkipFunctionBodies for this, right ?

For my little benchmark, I used the LLVM/Clang/Clangd code base which I queried for all references of "std" (the namespace) which is around 46K references in the index.

This is an interesting use case, and I can say we have some experience because Xcode has this functionality without requiring the end-loc for every reference.
So what it does is that it has a 'name' to look for (say 'foo' for the variable foo) and if it finds the name in the location then it highlights, otherwise if it doesn't find it (e.g. because it is an implicit reference) then it points to the location but doesn't highlight something. The same thing happens for operator overloads (the operators get highlighted at the reference location).
For implicit references it's most likely there's nothing to highlight so the end-loc will most likely be empty anyway (or same as start-loc ?) to indicate an empty range.

With SingleFileParseMode, I get several errors:

Good point, the parser definitely needs recovery improvements in C++.

With SkipFunctionBodies alone, I can get the Decl* but FunctionDecl::getSourceRange() doesn't include the body

This seems strange, there's an EndRangeLoc field that should have been filled in, not exactly sure if it is a bug or omission.

Going back to the topic of what use cases end-loc covers, note that it still seems inadequate for peek-definition functionality. You can't set it to body-end loc (otherwise occurrences highlighting would highlight the whole body which I think is undesirable) and you still need to include doc-comments if they exist.

Hey Marc,

The fact that both clang and clangd have to agree on the format so that index-while-building can be used seems to make it inherently not possible to extend

I don't think "not possible to extend" is quite correct, we can make it so that the format allows optional data to be recorded.

That would be good. How would one go about asking Clang to generate this extra information? Would a Clang Plugin be suitable for this? I don't know much about those but perhaps that could be one way to extent the basic behavior of "-index_store_path" in this way?

I changed my prototype so that the end-loc is not stored in the index but rather computed "on the fly" using SourceManager and Lexer only.

I assume you used SingleFileParseMode+SkipFunctionBodies for this, right ?

No, sorry the end-locs I meant there is for occurrences. Only the lexer was needed to get the end of the token. So for "MyClass o1, o2;" o1 and o2 get highlighted as references to the MyClass constructor.

For my little benchmark, I used the LLVM/Clang/Clangd code base which I queried for all references of "std" (the namespace) which is around 46K references in the index.

This is an interesting use case, and I can say we have some experience because Xcode has this functionality without requiring the end-loc for every reference.
So what it does is that it has a 'name' to look for (say 'foo' for the variable foo) and if it finds the name in the location then it highlights, otherwise if it doesn't find it (e.g. because it is an implicit reference) then it points to the location but doesn't highlight something.

I think it's useful to highlight something even when the name is not there. For example in "MyClass o1, o2;" it feels natural that o1 and o2 would get highlighted.

The same thing happens for operator overloads (the operators get highlighted at the reference location).

It does? I can only seem to do a textual search. For example, if I look at "FileId::operator<", if I right-click in the middle of "operator<" and do "Find selected symbol in workspace", it seems to start a text based search because there are many results that are semantically unrelated.

For implicit references it's most likely there's nothing to highlight so the end-loc will most likely be empty anyway (or same as start-loc ?) to indicate an empty range.

I think for those cases the end of the token is probably suitable. Can you give examples which implicit references you have in mind? Maybe another one (other than the constructor mentioned above) could be a function call like "passMeAStdString(MyStringRef)", here the "operator std::string" would be called and MyStringRef could be highlighted, I think it would make sense to the user that is gets called by passing this parameter by seeing the highlight.

Going back to the topic of what use cases end-loc covers, note that it still seems inadequate for peek-definition functionality. You can't set it to body-end loc (otherwise occurrences highlighting would highlight the whole body which I think is undesirable) and you still need to include doc-comments if they exist.

I think maybe I wasn't clear, I was thinking about two end-locs: end-locs of occurrences and end-locs of bodies. The end-loc of occurrences would be used for highlight when searching for all occurrences and the end-loc for bodies would be used for the peek definition. I think we can disregard end-locs of bodies for now.

That would be good. How would one go about asking Clang to generate this extra information? Would a Clang Plugin be suitable for this?

That's an interesting idea that we could explore, but I don't have much experience with that mechanism to comment on.

Only the lexer was needed to get the end of the token

Ok, that's interesting, not sure why Xcode is so fast to highlight, did you reuse same SourceManager/Lexer/buffers for occurrences from same file ? We'd definitely add the end-loc if we cannot come up with a mechanism to highlight fast enough without it.

I think it's useful to highlight something even when the name is not there. For example in "MyClass o1, o2;" it feels natural that o1 and o2 would get highlighted.

To clarify, even with implicit references the start loc points to something. In this case the implicit references can have start locs for the o1 and o2 identifiers and the end result for the UI will be the same (o1 and o2 get highlighted) even without having end-locs for all references.

It does? I can only seem to do a textual search.

The example I tried is the following. If you could file a bug report for the test case that did not work as you expected it would be much appreciated!

class Something1 {
public:
    Something1() {}
    ~Something1() {}
    operator int() {
        return 0;
    }

    friend int operator <<(Something1 &p, Something1 &p2) {
        return 0;
    }
};

void foo1(Something1 p1, Something1 p2) {
    p1 << p2;
    p1 << p2;
}

here the "operator std::string" would be called and MyStringRef could be highlighted

Even without end-loc, the start loc could point to MyStringRef and you could highlight it.

That would be good. How would one go about asking Clang to generate this extra information? Would a Clang Plugin be suitable for this?

That's an interesting idea that we could explore, but I don't have much experience with that mechanism to comment on.

Only the lexer was needed to get the end of the token

Ok, that's interesting, not sure why Xcode is so fast to highlight, did you reuse same SourceManager/Lexer/buffers for occurrences from same file ? We'd definitely add the end-loc if we cannot come up with a mechanism to highlight fast enough without it.

I don't think Xcode is quite fast, it's about 10 times slower (although I'm not sure it really finished) than when I use my branch that has the end-loc. I would try end-locs in Xcode if I could, to compare :) So I don't really know where the bottleneck is in Xcode. Comparing oranges to oranges, it's 4 times slower without end-locs compared to with end-locs on my branch. I does use the same SourceManager for the 46K references and I verified that it uses the same buffers, etc.
I'll put the numbers here again for readability.

For my little benchmark, I used the LLVM/Clang/Clangd code base which I queried for all references of "std" (the namespace) which is around 46K references in the index.

With end-loc in index: 3.45s on average (20 samples)
With end-loc computed on the fly: 11.33s on average (20 samples)
I also tried with Xcode but without too much success: it took about 30 secs to reach 45K results and then carried on for a long time and hung (although I didn't try to leave it for hours to see if it finished).

I think it's useful to highlight something even when the name is not there. For example in "MyClass o1, o2;" it feels natural that o1 and o2 would get highlighted.

To clarify, even with implicit references the start loc points to something. In this case the implicit references can have start locs for the o1 and o2 identifiers and the end result for the UI will be the same (o1 and o2 get highlighted) even without having end-locs for all references.

It's the same but slower. IMO, the trade off is not great. It's entirely subjective but I think 4-10 times slower in order to save an integer per occurrence is not worth it from my point of view.

Even without end-loc, the start loc could point to MyStringRef and you could highlight it.

(Same here, it's doable but faster if already in the index.)

It does? I can only seem to do a textual search.

The example I tried is the following. If you could file a bug report for the test case that did not work as you expected it would be much appreciated!

Sure, I'll give that a try and isolate it as much as I can. BTW, does it work for you on the LLVM code base?

nathawes updated this revision to Diff 153190.Jun 27 2018, 3:18 PM

Updated to apply on top-of-tree.

gribozavr added inline comments.
include/clang/Frontend/FrontendOptions.h
262

Please end comments with a period.

265

Would it make more sense to flip this boolean to positive? "IndexIncludeSystemSymbols"?

lib/Index/IndexingAction.cpp
107

Please don't duplicate the information from the signature in comments. No need to say that this function returns an IndexASTConsumer (twice, in the first sentence and in the \returns clause), the code already says that.

Also, "The compiler instance used to process the input" does not mean much to me either.

142

No semicolon.

151

No semicolon.

155

No semicolon.

242

Please don't duplicate type information from the signature in the comment.

250

I don't understand... this is not really the user-specified output file.

270

Please don't duplicate type information from the signature in the comment.

lib/Index/IndexingContext.h
40

Please add a period at the end of the comment.

44

DirEntries => IsSystemDirEntry?

46

Triple slashes for doc comments.

46

Unclear how a boolean can keep track of the last check.

Did you mean "Whether the file is a system file or not. This value is a cache." If so, please rename the variable to something like IsSystemFileCache.

test/Index/Core/index-source.mm
2 ↗(On Diff #153190)

No need to specify check-prefixes=CHECK.

test/Index/Core/index-unit.mm
1 ↗(On Diff #153190)

This test is very difficult to read... it is just a dump of random internal data structures... what do you think about converting it to a unit test?

mgrang added inline comments.Mar 6 2019, 10:17 AM
lib/Index/FileIndexData.cpp
31 ↗(On Diff #153190)

Please use range-based llvm::sort instead of std::sort:

llvm::sort(Sorted);

See https://llvm.org/docs/CodingStandards.html#beware-of-non-deterministic-sorting-order-of-equal-elements

jkorous commandeered this revision.Mar 6 2019, 10:44 AM
jkorous edited reviewers, added: nathawes; removed: jkorous.
jkorous abandoned this revision.Mar 6 2019, 10:45 AM

It's time to officially abandon these patches in favor of new push for upstreaming index-while-building.

Current reviews in progress
https://reviews.llvm.org/D58749
https://reviews.llvm.org/D58418

RFC
http://lists.llvm.org/pipermail/cfe-dev/2019-February/061432.html

I'll address comments for this patch in the new set of patches.

@gribozavr I haven't put up this part of code for the new round of review yet. I will keep this on mind.

@mgrang This already landed in edbbe470f66 as clang/lib/Index/FileIndexRecord.cpp but luckily the implementation isn't using sort() at all. Thanks for pointing this out anyway!

akyrtzi added inline comments.Mar 6 2019, 11:36 AM
include/clang/Frontend/FrontendOptions.h
265

@jkorous I noticed this name can be misleading, it may seem as if what this does is "avoid indexing system symbol occurrences" but what it actually does is "avoid indexing symbol occurrences from system files". We should rename it to "IndexIgnoreSystemHeaders" or "IndexIncludeSystemHeaders" per Dmitri's suggestion.