This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Collect symbol occurrences in SymbolCollector
ClosedPublic

Authored by hokein on Aug 7 2018, 6:14 AM.

Details

Summary

SymbolCollector will be used for two cases:

  • collect Symbol type only, used for indexing preamble AST.
  • collect Symbol and SymbolOccurrences, used for indexing main AST.

For finding local references from the AST, we will implement it in other ways.

Event Timeline

hokein created this revision.Aug 7 2018, 6:14 AM
ioeric added a comment.Aug 8 2018, 2:11 AM

2 high-level questions:

  1. What's the reason for having a separate SymbolOccurrenceSlab? Could store occurrences as extra payload of Symbol?
  1. Could we merge SymbolOccurrenceCollector into the existing SymbolCollector? They look a lot alike. Having another index data consumer seems like more overhead on the user side.
hokein added a comment.Aug 9 2018, 2:28 AM

2 high-level questions:

  1. What's the reason for having a separate SymbolOccurrenceSlab? Could store occurrences as extra payload of Symbol?

Storing occurrences in Symbol structure is easy to misuse by users IMO -- if we go through this way, we will end up having a getOccurrences-like method in Symbol structure. Once users get the Symbol instance, it is natural for them to call getOccurrences to get all occurrences of the symbol. However this getOccurrences method doesn't do what users expected (just returning an incomplete set of results or empty). To query the symbol occurrences, we should always use index interface.

Therefore, I think we should try to avoid these confusions in the design.

  1. Could we merge SymbolOccurrenceCollector into the existing SymbolCollector? They look a lot alike. Having another index data consumer seems like more overhead on the user side.

The SymbolOccurrenceCollector has many responsibilities (collecting declaration, definition, code completion information etc), and the code is growing complex now. Merging the SymbolOccurrenceCollector to it will make it more complicated -- we will introduce more option flags like collect-symbol-only, collect-occurrence-only to configure it for our different use cases (we need to the implementation detail clearly in order to make a correct option for SymbolCollector). And I can foresee these two collectors might be run at different point (runWithPreamble vs runWithAST) in dynamic index.

They might use same facilities, but we could always share them.

ioeric added a comment.Aug 9 2018, 3:36 AM

2 high-level questions:

  1. What's the reason for having a separate SymbolOccurrenceSlab? Could store occurrences as extra payload of Symbol?

Storing occurrences in Symbol structure is easy to misuse by users IMO -- if we go through this way, we will end up having a getOccurrences-like method in Symbol structure. Once users get the Symbol instance, it is natural for them to call getOccurrences to get all occurrences of the symbol. However this getOccurrences method doesn't do what users expected (just returning an incomplete set of results or empty). To query the symbol occurrences, we should always use index interface.

Therefore, I think we should try to avoid these confusions in the design.

Hmm, I think this is the same for other symbol payload e.g. definition can be missing for a symbol. And it seems to me that the concern is on the SymbolSlab level: if a slab is for a single TU, users should expect missing information; if a slab is merged from all TUs, then users can expect "complete" information. I think it's reasonable to assume that users of SymbolSlab are aware of this. I think it's probably not worth the overhead of maintaining and using two separate slabs.

  1. Could we merge SymbolOccurrenceCollector into the existing SymbolCollector? They look a lot alike. Having another index data consumer seems like more overhead on the user side.

The SymbolOccurrenceCollector has many responsibilities (collecting declaration, definition, code completion information etc), and the code is growing complex now. Merging the SymbolOccurrenceCollector to it will make it more

Although the existing SymbolCollector supports different options, I think it still has a pretty well-defined responsibility: gather information about symbols. IMO, cross-reference is one of the property of symbol, and I don't see strong reasons to keep them separated.

complicated -- we will introduce more option flags like collect-symbol-only, collect-occurrence-only to configure it for our different use cases (we need to the implementation detail clearly in order to make a correct option for SymbolCollector).

I think these options are reasonable if they turn out to be necessary. And making the SymbolCollector more complicated doesn't seem to be a problem if we are indeed doing more complicated work, but I don't think this would turn into a big problem as logic of xrefs seems pretty isolated. Conversely, I think implementing xrefs in a separate class would likely to cause more duplicate and maintenance, e.g. two sets of options, two sets of initializations or life-time tracking of collectors (they look a lot alike), the same boilerplate factory code in tests, passing around two collectors in user code.

And I can foresee these two collectors might be run at different point (runWithPreamble vs runWithAST) in dynamic index.

With some options, this should be a problem I think?

2 high-level questions:

  1. What's the reason for having a separate SymbolOccurrenceSlab? Could store occurrences as extra payload of Symbol?

Storing occurrences in Symbol structure is easy to misuse by users IMO -- if we go through this way, we will end up having a getOccurrences-like method in Symbol structure. Once users get the Symbol instance, it is natural for them to call getOccurrences to get all occurrences of the symbol. However this getOccurrences method doesn't do what users expected (just returning an incomplete set of results or empty). To query the symbol occurrences, we should always use index interface.

Therefore, I think we should try to avoid these confusions in the design.

Hmm, I think this is the same for other symbol payload e.g. definition can be missing for a symbol. And it seems to me that the concern is on the SymbolSlab level: if a slab is for a single TU, users should expect missing information; if a slab is merged from all TUs, then users can expect "complete" information. I think it's reasonable to assume that users of SymbolSlab are aware of this. I think it's probably not worth the overhead of maintaining and using two separate slabs.

I think it's reasonable to keep occurrences away from Symbol's Detail field. Stashing them together is only fine for the collector API, having any way to directly access occurrences through Symbol will be totally confusing for all the other users.
E.g., the Index::lookup() will not provide occurrences in the Symbol instances it returns, and if the accessors for those will be there it will only add confusion. So +1 to keeping them out of the Symbol class.

On the other hand, SymbolSlab feels like a perfectly reasonable place to store the occurrences in addition to the symbols themselves and it feels we should reuse its memory arena for storing any strings we need to allocate, etc.

  1. Could we merge SymbolOccurrenceCollector into the existing SymbolCollector? They look a lot alike. Having another index data consumer seems like more overhead on the user side.

The SymbolOccurrenceCollector has many responsibilities (collecting declaration, definition, code completion information etc), and the code is growing complex now. Merging the SymbolOccurrenceCollector to it will make it more

Although the existing SymbolCollector supports different options, I think it still has a pretty well-defined responsibility: gather information about symbols. IMO, cross-reference is one of the property of symbol, and I don't see strong reasons to keep them separated.

complicated -- we will introduce more option flags like collect-symbol-only, collect-occurrence-only to configure it for our different use cases (we need to the implementation detail clearly in order to make a correct option for SymbolCollector).

I think these options are reasonable if they turn out to be necessary. And making the SymbolCollector more complicated doesn't seem to be a problem if we are indeed doing more complicated work, but I don't think this would turn into a big problem as logic of xrefs seems pretty isolated. Conversely, I think implementing xrefs in a separate class would likely to cause more duplicate and maintenance, e.g. two sets of options, two sets of initializations or life-time tracking of collectors (they look a lot alike), the same boilerplate factory code in tests, passing around two collectors in user code.

And I can foresee these two collectors might be run at different point (runWithPreamble vs runWithAST) in dynamic index.

With some options, this should be a problem I think?

+1 to merging into the SymbolCollector. Keeping the responsibilities separate inside a single class should be easy, e.g. something like that should be simple enough:

SymbolCollector::handleDeclOccurence(args) {
  this->processForSymbol(args); // handles keeping the Symbol structure up-to-date, i.e. adds definition locations, etc.
  this->processForOccurrences(args); // appends occurrences to a list of xrefs.
};

The main advantage that we get is less clang-specific boilerplate. The less IndexDataConsumers, FrontendActionFactorys, FrontendActions we create, the more focused and concise our code is.
And in that case, SymbolCollector is already handling those responsibilities for us and reusing looks like a good idea.

Hmm, I think this is the same for other symbol payload e.g. definition can be missing for a symbol. And it seems to me that the concern is on the SymbolSlab level: if a slab is for a single TU, users should expect missing information; if a slab is merged from all TUs, then users can expect "complete" information. I think it's reasonable to assume that users of SymbolSlab are aware of this. I think it's probably not worth the overhead of maintaining and using two separate slabs.

My concerns of storing occurrences as an extra payload of Symbol are:

  • SymbolSlab is more like an implementation detail. Users of SymbolIndex are not aware of it, they only get Symbol objects, so it easily confuses users if they see any occurrence-related interface/member in Symbol. And we will write a looong comment explaining its correct behavior. It'd be better if we avoid this confusion in the API level.
  • The fields in Symbol structure are symbol properties, and could be stored in memory. However, occurrences are not, we can't guarantee that.
  • It seems that we are coupling ID, Symbol, SymbolOccurrence together: in the index implementation, we will go through ID=>Symbol=>Occurrences rather than ID=>Occurrences.

I think these options are reasonable if they turn out to be necessary.

I think they are necessary. For collecting all occurrences for local symbols from the AST, we only need symbol occurrence information, other information (e.g. declaration&definition location, #include) should be discarded; Index for code completion should not collect symbol occurrences.

And making the SymbolCollector more complicated doesn't seem to be a problem if we are indeed doing more complicated work, but I don't think this would turn into a big problem as logic of xrefs seems pretty isolated.

If xrefs is quite isolated, I think it is a good signal to have a dedicated class handling it.

I think implementing xrefs in a separate class would likely to cause more duplicate and maintenance, e.g. two sets of options, two sets of initializations or life-time tracking of collectors (they look a lot alike), the same boilerplate factory code in tests, passing around two collectors in user code.

Merging xrefs to SymbolCollector couldn't avoid these problems, I think it is a matter of where we put these code:

  • different initialization of SymbolCollector for different use cases (e.g. setting different flags in SymbolCollectorOptions).
  • for dynamic index, index for xrefs and code completion would be triggered at different point: index for xrefs should happen when AST is ready; index for code completion happens when Preamble is ready; we might end up with two slabs instances in the dynamic index (1 symbol slab + 1 occurrence slab vs. 2 symbol slabs).

The duplication is mainly about AST frontend action boilerplate code. To eliminate it, we could do some refactorings:

  • get rid of the clang ast action code in SymbolCollector, and SymbolOccurrenceCollector
  • introduce an IndexSymbol which is a subclass index::IndexDataConsumer
  • the IndexSymbol has two mode (indexing symbol or indexing occurrence), and dispatch ast information to SymbolCollector/SymbolOccurrenceCollector.
hokein updated this revision to Diff 161927.Aug 22 2018, 5:29 AM

Update the patch based on our offline discussion

  • only one single clang intefaces implementation, and move finding references to current symbol collector;
  • store references in SymbolSlab;
ilya-biryukov added inline comments.Aug 22 2018, 6:34 AM
clangd/index/Index.cpp
139

NIT: remove the lambda? using < is the default.

145

NIT: remove the lambda? Using == is the default.

158

Is this used for debugging?
In that case maybe consider having a user-readable representation instead of the number?

clangd/index/Index.h
46

NIT: having friend decls inside the classes themselves might prove to be more readable.
Not opposed to the current one too, feel free to ignore.

347

Maybe add a comment or remove the empty line?

348

Any store occurences in a file-centric manner?
E.g.

/// Occurences inside a single file.
class FileOccurences {
  StringRef File;
  vector<pair<Point, OccurenceKind>> Locations;
};

// ....
DenseMap<SymbolID, vector<FileOccurences>>  SymbolOccurences;

As discussed previously, this representation is better suited for both merging and serialization.

clangd/index/SymbolCollector.cpp
272

NIT: maybe use early exits and inverted conditions to keep the nesting down?

321

If we any Options here, why have an extra CollectorSymbolOptions?

clangd/index/SymbolCollector.h
59

Could you elaborate on what this option will be used for? How do we know in advance which symbols we're interested in?

hokein updated this revision to Diff 161962.Aug 22 2018, 8:07 AM
hokein marked 6 inline comments as done.

Address review comments.

hokein updated this revision to Diff 161972.Aug 22 2018, 8:44 AM
hokein marked an inline comment as done.

Add one more comment.

hokein added inline comments.Aug 22 2018, 9:02 AM
clangd/index/Index.h
46

These operator implementations seem not as much interesting as members in the structure, putting them to the structure probably adds some noise to readers.

348

The file-centric manner doesn't seem to suite our current model: whenever we update the index for the main AST, we just replace the symbol slab with the new one; and for index merging, we only use the index findOccurrences interfaces.

It would save some memory usage of StringRef File, but AFAIK, the memory usage of current model is relatively small (comparing with the SymbolSlab for code completion) since we only store occurrences in main file (~50KB for CodeComplete.cpp).

I'd leave it as it is now, and we could revisit it later.

clangd/index/SymbolCollector.h
59

This is used for finding references in the AST as a part of the xref implementation, basically the workflow would be:

  1. find SymbolIDs of the symbol under the cursor, using DeclarationAndMacrosFinder
  2. run symbol collector to find all occurrences in the main AST with all SymbolIDs in #1
  3. query the index, to get more occurrences
  4. merge them
ilya-biryukov added inline comments.Aug 23 2018, 5:58 AM
clangd/index/Index.h
46

Ok, LG outside too

348

Isn't the merging model different for the occurrences?
We would actually have to drop all references from the older index when merging if the new one contains locations in the same file.
If the merge if file-centric, the file-based representation makes more sense in the first place.

Apart from simpler merging the code, the file-based representation also buys us more efficient serialization for the static index, arguably efficient enough to stash all the occurrences even into our YAML index.

Postponing till later is also fine, but I'm not sure it buys us much now. These arguments only apply if we think the file-centric approach is a the right final design, though.

clangd/index/SymbolCollector.h
59

Can we instead find all the occurences in DeclarationAndMacrosFinder directly?
Extra run of SymbolCollector means another AST traversal, which is slow by itself, and SymbolCollector s designed for a much more hairy problem, its interface is just not nicely suited for things like only occurrences. The latter seems to be a simpler problem, and we can have a simpler interface to solve it (possibly shared between SymbolCollector and DeclarationAndMacrosFinder). WDYT?

ioeric added inline comments.Aug 24 2018, 1:42 AM
clangd/index/SymbolCollector.h
74

Use llvm::Optional?

ioeric added inline comments.Aug 24 2018, 2:52 AM
clangd/index/SymbolCollector.cpp
241

I don't see a strong reason for the separation of CollectOccurrence and CollectSymbol. There are some pieceis that are only used by one of them, but they seem cheap enough to ignore? Intuitively, it seems to me reference collection could just be a member function of SymbolCollector.

sammccall added inline comments.Aug 24 2018, 8:10 AM
clangd/index/Index.h
310

As discussed offline: the merge of occurrences into SymbolSlab seems problematic to me.
On the consumer side, we have a separation between Symbol APIs and SymbolOccurrence APIs - they don't really interact.
The Symbol type can often only be used with SymbolSlab, and so including occurrences drags them into the mess for consumers that don't care about them.

For producers (index implementations), they will usually have both and they may want to share arena storage. But this probably doesn't matter much, and if it does we can use another mechanism (like allowing SymbolSlabBuilder and SymbolOccurrenceSlab to share UniqueStringSaver)

clangd/index/SymbolCollector.cpp
439

note that here we've done basically all the work needed to record the occurrence.

If you add a DenseMap<Decl*, {SourceLocation, SymbolRole}> then you'll have enough info at the end to fill in the occurrences, like we do with referenceddecls -> references.

clangd/index/SymbolCollector.h
40

Not sure this split is justified.
if IDs goes away (see below), all that's left can be represented in a SymbolOccurenceKind filter (which is 0 to collect no occurrences)

59

Yeah, I don't think we need this.
For "find references in the AST" we have an implementation in XRefs for highlights which we don't need to share.

76

collecting symbols doesn't actually need to be optional I think - it's the core responsibility of this class, and "find occurrences of a decl in an ast" can be implemented more easily in other ways

hokein updated this revision to Diff 162611.Aug 26 2018, 10:24 PM
hokein marked 7 inline comments as done.

Update the patch based on our new discussion

  • SymbolOccurrenceSlab for storing underlying occurrence data
  • reuse SymbolCollector to collect symbol occurrences
hokein retitled this revision from [clangd] Collect symbol occurrences from AST. to [clangd] Collect symbol occurrences in SymbolCollector.Aug 26 2018, 10:42 PM
hokein edited the summary of this revision. (Show Details)

This looks pretty good!

clangd/index/Index.h
400

assert frozen?
looking up in a non-frozen array is probably a mistake.
if we choose to optimize this, it probably won't be possible.

401

return Occurrences.lookup(ID)?

clangd/index/SymbolCollector.cpp
228

nit: toOccurrenceKind

230

If you want to filter out the unsupported bits, maybe just add an explicit AllOccurrenceKinds constant to the header file, and return AllOccurrenceKinds & Roles here? (plus casts)

442

just compute the spelling loc once and reuse?

443

you get the spelling loc on the previous line to check for mainfile - so surely we should be using spelling loc here?

564

nit: const auto& for clarity since we're not mutating

569

so this seems maybe gratuitously inefficient, we're copying the filename then going through the URI conversion dance for each reference - even though the filename is the same for each.

consider splitting out part of getTokenLocation into getTokenRange(SymbolLocation&) and only calling that here.

clangd/index/SymbolCollector.h
71–72

this should be next to OccurrenceFilter, they're very closely related (the name mismatch is a little unfortunate)

112

please move next to ReferencedDecls/ReferencedMacros so the comment applies to this too

unittests/clangd/SymbolCollectorTests.cpp
491

this is cute - if possible, consider adding a matcher factory function for readability here, so you can write EXPECT_THAT(..., HaveRanges(Main.ranges("foo"))

hokein updated this revision to Diff 162854.Aug 28 2018, 7:23 AM
hokein marked 10 inline comments as done.

Address review comments and fix code style.

hokein updated this revision to Diff 162856.Aug 28 2018, 7:24 AM

Minor cleanup.

Harbormaster completed remote builds in B22001: Diff 162856.
hokein added inline comments.Aug 28 2018, 7:25 AM
clangd/index/Index.h
401

The DenseMap::lookup returns a copy of Value (vector) which doesn't suit our use case :( -- we will return an ArrayRef which stores an reference of a local vector object.

unittests/clangd/SymbolCollectorTests.cpp
491

Wrapped this into HaveRanges.

hokein updated this revision to Diff 163512.Aug 31 2018, 5:25 AM

Address review comments in D51279.

sammccall accepted this revision.Aug 31 2018, 5:35 AM
This revision is now accepted and ready to land.Aug 31 2018, 5:35 AM
hokein closed this revision.Aug 31 2018, 5:56 AM

Committed in rL341208.