[clangd] Introduce a "Symbol" class.
ClosedPublic

Authored by hokein on Wed, Dec 6, 7:46 AM.

Details

Summary
  • The "Symbol" class represents a C++ symbol in the codebase, containing all the information of a C++ symbol needed by clangd. clangd will use it in clangd's AST/dynamic index and global/static index (code completion and code navigation).
  • The SymbolCollector (another IndexAction) will be used to recollect the symbols when the source file is changed (for ASTIndex), or to generate all C++ symbols for the whole project.

In the long term (when index-while-building is ready), clangd should share a
same "Symbol" structure and IndexAction with index-while-building, but
for now we want to have some stuff working in clangd.

Diff Detail

There are a very large number of changes, so older changes are hidden. Show Older Changes
ioeric added inline comments.Thu, Dec 7, 1:27 AM
clangd/Symbol.h
26 ↗(On Diff #125729)

0-based or 1-based?

39 ↗(On Diff #125729)

It might make sense to just call this USR to be more explicit.

51 ↗(On Diff #125729)

For functions and classes, should we store both declaration and definition locations?

Thanks for putting this together! Have a bit of a braindump here, happy to discuss further either here or offline.

clangd/Symbol.h
1 ↗(On Diff #125729)

I think that:

  • there's other places in clangd that deal with symbols too, this is specifically for indexing
  • the index interface belongs alongside Symbol

I'd suggest calling this Index.h

1 ↗(On Diff #125729)

I don't think having Symbols be completely self-contained objects and passing them around in standard containers like sets will prove to be ideal.
It means they can't share storage for e.g. location filename, that it's hard for us to arena-allocate them, etc.

I think we could use the concept of a set of symbols which share a lifetime. An initial version might just be

class SymbolSlab {
public:
  using iterator = DenseSet<Symbol>::iterator;
  iterator begin();
  iterator end();
private:
  DenseSet<Symbol> Symbols;
}

But it's easy to add StringPool etc there.
Then this is the natural unit of granularity of large sets of symbols: a dynamic index that deals with one file at a time would operate on (Filename, SymbolSlab) pairs. SymbolCollector would return a SymbolSlab, etc.

Then indexes can be built on top of this using non-owning pointers.

23 ↗(On Diff #125729)

Having every symbol own a copy of the filepath seems wasteful.
It seems likely that the index will have per-file information too, so this representation should likely be a key to that. Hash of the filepath might work?

32 ↗(On Diff #125729)

Let's not add this until we know what's in it.
There's gong to be an overlap between information needed for CC and other use cases, so this structure might not help the user navigate.

37 ↗(On Diff #125729)

We can't load all the symbol occurrences into the memory since they are too large

I've heard this often, but never backed up by data :-)

Naively an array of references for a symbol could be doc ID + offset + length, let's say 16 bytes.

If a source file consisted entirely of references to 1-character symbols separated by punctuation (1 reference per 2 bytes) then the total size of these references would be 8x the size of the source file - in practice much less. That's not very big.

(Maybe there are edge cases with macros/templates, but we can keep them under control)

39 ↗(On Diff #125729)

USRs are large. Can we use a fixed-size hash?

55 ↗(On Diff #125729)

I'd suggest == and a hash function instead, unless we think this ordering is particularly meaningful?

68 ↗(On Diff #125729)

Please pull this into a separate file. Someone providing e.g. symbols from a YAML file shouldn't need to pull in AST stuff.
Mabye IndexFromAST, which would sort nicely next to Index?

72 ↗(On Diff #125729)

do you really mean to copy here?

74 ↗(On Diff #125729)

what is the boolean for? you always return true

Our rough plan would be

  1. Define the Symbol structure.
  2. Design the interfaces of SymbolIndex, ASTIndex.
  3. Combine 1) and 2) together to make global code completion work (we'd use YAML dataset for LLVM project, note that this is not a final solution, it would be hidden in an --experimental flag).
  4. Switch to use the dataset from index-while-building when it is ready.

Thanks for the explanation. On my end, the plan is not quite sequential. The branch I am developing has interfaces for querying, building and a dataset format, let's call it ClangdIndexDataStorage, which is different from index-while-building (libIndexStore). I also have a version that uses libIndexStore through the same interfaces. So with that in mind, there are too main activities:

  1. Work towards the interfaces for using the index (this patch, and Eric's). From my perspective, I will make sure that it can be as compatible as possible with reading the index from disk and the features we want to develop. One important aspect is to have a good balance between memory and performance. In Eclipse CDT and also the branch I work on using ClangdIndexDataStorage, the emphasis was to minimize memory consumption and have a configurable cache size. But different choices could be made here, perhaps I can start a discussion about that separately.
  2. Work on index-while-building or the other format getting adopted in Clangd. The index-while-building (libIndexStore) is promising but also has a few missing pieces. We need a mapping solution (LMDB equivalent). We also need to make sure it's fast enough and contain enough information for the features we will need, etc.
clangd/Symbol.h
23 ↗(On Diff #125729)

How we model it is that a symbol doesn't have a "location", but its occurrence do. One could consider the location of a symbol to be either its declaration occurrence (SymbolRole::Declaration) or its definition (SymbolRole::Definition).
What we do to get the location path is each occurrence has a pointer (a "database" pointer, but it doesn't matter) to a file entry and then we get the path from the entry.

So conceptually, it works a bit like this (although it fetches information on disk).

class IndexOccurrence {
IndexOccurrence *FilePtr;

std::string Occurrence::getPath() {
  return FilePtr->getPath();
}
};
malaperle added inline comments.Thu, Dec 7, 9:44 AM
clangd/Symbol.h
37 ↗(On Diff #125729)

I'd have to break down how much memory it used by what, I'll come back to you on that. Indexing llvm with ClangdIndexDataStorage, which is pretty packed is about 200MB. That's already a lot considering we want to index code bases many times bigger. But I'll try to come up with more precise numbers. I'm open to different strategies.

malaperle added inline comments.Thu, Dec 7, 8:52 PM
clangd/Symbol.h
23 ↗(On Diff #125729)

Oops, wrong type for the field, it should have been:

class IndexOccurrence {
IndexFile *FilePtr;

std::string Occurrence::getPath() {
  return FilePtr->getPath();
}
};
hokein updated this revision to Diff 126115.Fri, Dec 8, 3:29 AM
hokein marked 4 inline comments as done.

Address review comments.

  • Use SymbolSlab, for allowing future space optimization.
  • Fix a Cast issue when building debug unittest.
hokein added a comment.Fri, Dec 8, 3:32 AM

Thanks for the useful comments!

clangd/Symbol.h
1 ↗(On Diff #125729)

+1 It makes sense.

For initial version, the Symbol structure is still owning its fields naively, we could improve it (change to pointer or references) in the future.

23 ↗(On Diff #125729)

Is this relative or absolute?

Whether the file path is relative or absolute depends on the build system, the file path could be relative (for header files) or absolute (for .cc files).

How we model it is that a symbol doesn't have a "location", but its occurrence do.

We will also have a SymbolOccurence structure alongside with Symbol (but it is not in this patch). The "Location" will be a part of SymbolOccurrence.

26 ↗(On Diff #125729)

The offset is equivalent to FileOffset in SourceLocation. I can't find any document about the FileOffset in LLVM, but it is 0-based.

32 ↗(On Diff #125729)

Removed it.

37 ↗(On Diff #125729)

I can see two points here:

  1. For all symbol occurrences of a TU, it is not quite large, and we can keep them in memory.
  2. For all symbol occurrences of a whole project, it's not a good idea to load all of them into memory (For LLVM project, the size of YAML dataset is ~1.2G).
39 ↗(On Diff #125729)

USR is one of the implementations of the Identifier, I would keep the name here -- we might change the implementation (like hash(USR)) afterwards.

We can use hash would make the query symbol from USR a bit harder -- we have to maintain a hash-to-USR lookup table. I think we can do it later when it shows a large memory consumption?

51 ↗(On Diff #125729)

I think the symbol occurrences would include both declaration and definition locations.

The CanonicalLocation is providing a fast/convenient way to find the most interested location of the symbol (e.g. for code navigation, or include the missing path for a symbol), without symbol occurrences.

55 ↗(On Diff #125729)

It was required when using Symbol structure in a set standard library. Removed it since we are using the DenseSet now.

68 ↗(On Diff #125729)

I can see various meaning of "Index" here:

  1. Index in index::IndexDataConsumer, which collects and contructs all symbols by traversing the AST.
  2. Index term using in clangd, specially for build index on top of these collected symbols.

I think we should be consistent the index for 2), and SymbolCollector is more descriptive here.

74 ↗(On Diff #125729)

return true means continue indexing, while false means abort the indexing. Have added the comment.

More comments, but only two major things really:

  • I'd like to clearly separate USR from SymbolID (even if you want to keep using USRs as their basis for now)
  • the file organization (code within files, and names of files) needs some work I think

Everything else is details, this looks good

clangd/Symbol.h
70 ↗(On Diff #126115)

nit: lowercase iterator is the STL convention, following it tends to make template code work better

72 ↗(On Diff #126115)

This is dangerous if called after reads, as it invalidates iterators and pointers.
I don't think we actually indend to support such mutation, so I'd suggest adding an explicit freeze() function. addSymbol() is only valid before freeze(), and reading functions are only valid after. An assert can enforce this.
(This is a cheap version of a builder, which are more painful to write but may also be worth it).

If we choose not to enforce this at all, the requirement shold be heavily documented!

73 ↗(On Diff #126115)

Awkward as it is, don't you want iterator find() here?

23 ↗(On Diff #125729)

Whether the file path is relative or absolute depends on the build system, the file path could be relative (for header files) or absolute (for .cc files).

I'm not convinced this actually works. There's multiple codepaths to the index, how can we ensure we don't end up using inconsistent paths?
e.g. we open up a project that includes a system header using a relative path, and then open up that system header from file->open (editor knows only the absolute path) and do "find references".

I think we need to canonicalize the paths. Absolute is probably easiest.

37 ↗(On Diff #125729)

(This is still a sidebar - not asking for any changes)

The YAML dataset is not a good proxy for how big the data is (at least without an effort to estimate constant factor).
And "it's not a good idea" isn't an assertion that can hold without reasons, assumptions, and data.
If the size turns out to be, say, 120MB for LLVM, and we want to scale to 10x that, and we're spending 500MB/file for ASTs, then it might well be a good trade.

39 ↗(On Diff #125729)

Personally I'm less worried about the name, and more about the type.

I think it's important:

  1. that we have a distinct SymbolID type (for conceptual clarity)
  2. that we convert to/from this type in as few places as possible (for flexibility)
  3. that this type be fixed-size, preferably <=64 bits (for performance)
  4. that long-lived references to symbols be expressed as SymbolIDs not pointers (to avoid lifetime confusion)

I'm happy to defer some performance considerations to later (though I'm almost certain this will matter). But i'm not sure it changes the conclusion:

We can use hash would make the query symbol from USR a bit harder -- we have to maintain a hash-to-USR lookup table.

You only need that if you need a USR. But the SymbolID -> USR conversion should be in one place anyway (point 2) otherwise we'll end up coupled to using USRs forever.
And the SymbolID -> Symbol lookup is something the index is going to provide anyway, right?

51 ↗(On Diff #125729)

I'd be +1 on including both a definition location (if known) and a preferred declaration location, because there's enough use cases that might want definition even if it's not the preferred declaration.

But i'm fine if we want to omit the separate definition for now. In that case, call this CanonicalDeclaration?

68 ↗(On Diff #125729)

SymbolCollector is a fine name for the type, but the file should have Index in the name, or we should create an Index subdirectory. It should be possible to understand which files are part of the index subsystem by scanning the clangd directory.

(Did you see the comment was about separating into a different file, not about renaming the class?)

74 ↗(On Diff #125729)

Oops, sorry, I missed that this was implementing an interface. Please remove the comment or move it to the implementation - it doesn't make sense in the interface. Sorry for the noise!

unittests/clangd/SymbolCollectorTests.cpp
105

If using the same repeatedly, please pull out a MATCHER_P for readability:

UnorderedElementsAre(QName("Foo"), ...)

(From D40548) Here's the interface for querying the index that I am using right now. It's meant to be able to retrieve from any kind of "backend", i.e. in-memory, ClangdIndexDataStore, libIndexStore, etc. I was able to implement "Open Workspace Symbol" (which is close to code completion in concept), Find References and Find Definitions.

using USR = llvm::SmallString<256>;

class ClangdIndexDataOccurrence;

class ClangdIndexDataSymbol {
public:
  virtual index::SymbolKind getKind() = 0;
  /// For example, for mynamespace::myclass::mymethod, this will be
  /// mymethod.
  virtual std::string getName() = 0;
  /// For example, for mynamespace::myclass::mymethod, this will be
  /// mynamespace::myclass::
  virtual std::string getQualifier() = 0;
  virtual std::string getUsr() = 0;

  virtual void foreachOccurrence(index::SymbolRoleSet Roles, llvm::function_ref<bool(ClangdIndexDataOccurrence&)> Receiver) = 0;

  virtual ~ClangdIndexDataSymbol() = default;
};

class ClangdIndexDataOccurrence {
public:
  enum class OccurrenceType : uint16_t {
     OCCURRENCE,
     DEFINITION_OCCURRENCE
   };

  virtual OccurrenceType getKind() const = 0;
  virtual std::string getPath() = 0;
  /// Get the start offset of the symbol occurrence. The SourceManager can be
  /// used for implementations that need to convert from a line/column
  /// representation to an offset.
  virtual uint32_t getStartOffset(SourceManager &SM) = 0;
  /// Get the end offset of the symbol occurrence. The SourceManager can be
  /// used for implementations that need to convert from a line/column
  /// representation to an offset.
  virtual uint32_t getEndOffset(SourceManager &SM) = 0;
  virtual ~ClangdIndexDataOccurrence() = default;
  //TODO: Add relations

  static bool classof(const ClangdIndexDataOccurrence *O) { return O->getKind() == OccurrenceType::OCCURRENCE; }
};

/// An occurrence that also has definition with a body that requires additional
/// locations to keep track of the beginning and end of the body.
class ClangdIndexDataDefinitionOccurrence : public ClangdIndexDataOccurrence {
public:
  virtual uint32_t getDefStartOffset(SourceManager &SM) = 0;
  virtual uint32_t getDefEndOffset(SourceManager &SM) = 0;

  static bool classof(const ClangdIndexDataOccurrence *O) { return O->getKind() == OccurrenceType::DEFINITION_OCCURRENCE; }
};

class ClangdIndexDataProvider {
public:

  virtual void foreachSymbols(StringRef Pattern, llvm::function_ref<bool(ClangdIndexDataSymbol&)> Receiver) = 0;
  virtual void foreachSymbols(const USR &Usr, llvm::function_ref<bool(ClangdIndexDataSymbol&)> Receiver) = 0;

  virtual ~ClangdIndexDataProvider() = default;
};

The "Clangd" prefix adds a bit much of clutter so maybe it should be removed. I think the main points are that having generic foreachSymbols/foreachOccurrence with callbacks is well suited to implement multiple features with minimal copying.

clangd/Symbol.h
37 ↗(On Diff #125729)

The YAML dataset is not a good proxy for how big the data is (at least without an effort to estimate constant factor).

Indeed. I'll try to come up with more realistic numbers. There are other things not accounted for in the 16 bytes mentioned above, like storing roles and relations.

500MB/file for ASTs

What do you mean? 500MB worth of occurrences per file? Or Preambles perhaps?

hokein updated this revision to Diff 126156.Fri, Dec 8, 8:14 AM
hokein marked 6 inline comments as done.
  • reorganize files, move to index subdirectory.
  • change symbol ID to a hash value, instead of couple with USR.
hokein added a comment.Fri, Dec 8, 8:19 AM

Reorganizing the source files made all the comments invalid in the latest version :(. Feel free to comment on the old version or the new version.

clangd/Symbol.h
72 ↗(On Diff #126115)

I think the only user to mutate this object is SymbolCollector.

Instead of adding freeze function, I made it as a private method and declare SymbolCollector as a friend class. Does it look better?

23 ↗(On Diff #125729)

Absolute path for .cc file is fine, I was a bit concerned about the path for .h file, especially we might use it in #include, but we can figure out later. Changed to absolute file path.

51 ↗(On Diff #125729)

OK, changed it to CanonicalDeclarationLoc, and added a FIXME for the definition.

68 ↗(On Diff #125729)

Oh, sorry for that. I thought that you were meaning to rename the SymbolCollector class.

I prefer to create an subdirectory index, and put everything related to the index to that directory, instead of having Index word in the top-level clangd filenames.

malaperle added inline comments.Fri, Dec 8, 3:28 PM
clangd/Symbol.h
37 ↗(On Diff #125729)

What do you mean? 500MB worth of occurrences per file? Or Preambles perhaps?

Oh I see, the AST must be in memory for fast reparse. I just tried opening 3 files at the same time I it was already around 500MB. Hmm, that's a bit alarming.

Thanks for the restructuring? I want to take another pass today, but wanted to mention some SymbolID things.

clangd/Symbol.h
72 ↗(On Diff #126115)

Hmm, I don't think so.

SymbolCollector being a the only writer is temporary, this is "arena for symbols" and we want to create symbols in other ways (e.g. remote index).

And "friend" to expose one function is a bit iffy.

37 ↗(On Diff #125729)

Right, just that we have to consider RAM usage for the index in the context of clangd's overall requirements - if other parts of clangd use 1G of ram for typical work on a large project, then we shouldn't rule out spending a couple of 100MB on the index if it adds a lot of value.

clangd/index/Index.h
39

This comment doesn't really say anything, and there is a lot to say!
What does it distinguish vs not? How do we guarantee uniqueness?

40

please make symbol ID a real type, constructible from USR

unsigned might be 32 bits, which really isn't OK. Even 64 is pushing it for huge external indexes.

We need these IDs to be stable across executions and versions, which llvm::HashString is not.

I'd suggest using the SHA1 hash (160 bits)

52

nit: remove Loc? It's ambiguous and not really needed

88

You might just want to specialize for SymbolID instead - we should probably mostly use explicit DenseMap<SymbolID, Symbol> as we can avoid constructing the Symbol in a bunch of cases.

clangd/index/SymbolCollector.cpp
31

maybe resolvePath? since it includes symlink resolution

I'm not sure we actually want to resolve symlinks. Let's chat offline.

hokein updated this revision to Diff 126378.Mon, Dec 11, 8:46 AM
hokein marked 5 inline comments as done.

Address comments on SymbolID.

sammccall accepted this revision.Mon, Dec 11, 9:38 AM

\o/

clangd/index/Index.cpp
35

assert frozen? (and in begin())

clangd/index/Index.h
46

nit: make HashValue private?

provide operator== (and use it from DenseMapInfo).

109

nit: you may want to memoize this in a local static variable, rather than compute it each time: DenseMap calls it a lot.

clangd/index/SymbolCollector.cpp
38

Can you spell out here which symbolic link cases we're handling, and what problems we're trying to avoid?

Offline, we talked about the CWD being a symlink. But this is a different case...

clangd/index/SymbolCollector.h
36

What's this for? Seems like we should be able to handle multiple TUs with one collector?

This revision is now accepted and ready to land.Mon, Dec 11, 9:38 AM
hokein updated this revision to Diff 126519.Tue, Dec 12, 3:25 AM
hokein marked 4 inline comments as done.

Address remaining comments.

hokein added inline comments.Tue, Dec 12, 3:27 AM
clangd/index/Index.cpp
35

We may not do the assert "frozen" in these getter methods -- as writers may also have needs to access these APIs (e.g. checking whether SymbolID is already in the slab before constructing and inserting a new Symbol).

clangd/index/SymbolCollector.h
36

We don't have particular usage for this method except for testing. Removed it.

hokein updated this revision to Diff 126538.Tue, Dec 12, 6:10 AM
  • Get rid of clangdIndex library, using the existing clangDaemon library.
  • Remove the getID() method.
sammccall added inline comments.Tue, Dec 12, 6:10 AM
clangd/index/CMakeLists.txt
5 ↗(On Diff #126519)

hmm, I'm not sure whether we actually want this to be a separate library. This means we can't depend on anything from elsewhere in clangd, and may force us to create more components.

e.g. if we want to pass contexts into the index, or if we want to reuse LSP data models from protocol.h.

Maybe we should make this part of the main clangd lib, what do you think?

clangd/index/Index.cpp
35

Right, I'd specifically like not to allow that if possible - it makes it very difficult to understand what the invariants are and what operations are allowed.

I think there's no such need now, right? If we want to add one in future, we can relax this check, or add a builder, or similar - at least we should explicitly consider it.

hokein marked an inline comment as done.Tue, Dec 12, 6:20 AM
hokein added inline comments.
clangd/index/CMakeLists.txt
5 ↗(On Diff #126519)

As discussed offline, made it to the main clangd library.

clangd/index/Index.cpp
35

hmm, SymbolCollector has such need at the moment -- if (Symbols.find(ID) != Symbols.end()) return true; to avoid creating a duplicated Symbol.

hokein marked an inline comment as done.Tue, Dec 12, 7:30 AM

I'm going to submit this patch to unblock the stuff in https://reviews.llvm.org/D40548. Would be happy to address any further comments afterwards.

This revision was automatically updated to reflect the committed changes.
ioeric added inline comments.Tue, Dec 12, 8:29 AM
clangd/index/Index.h
52

warning: class 'DenseMapInfo' was previously declared as a struct [-Wmismatched-tags]

malaperle added inline comments.Tue, Dec 12, 8:58 AM
clangd/Symbol.h
37 ↗(On Diff #125729)

Agreed we have to consider the overall requirements. I think over 1GB of RAM is not good for our use case, whether is comes from the AST or index. I think it's perfectly normal if we have different requirements but we can see discuss how to design things so there are options to use either more RAM or disk space. It seems the AST would be the most important factor for now so perhaps it's something we should start investigating/discussing.