This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clangd/
-
CMakeLists.txt
-
index/
6/9
Index.h
4
Index.cpp
1/2
SymbolCollector.h
1/2
SymbolCollector.cpp
-
unittests/clangd/
-
clangd/
-
CMakeLists.txt
1/1
SymbolCollectorTests.cpp

Differential D40897

[clangd] Introduce a "Symbol" class.
ClosedPublic

Authored by hokein on Dec 6 2017, 7:46 AM.

Download Raw Diff

Details

Reviewers

ioeric
sammccall
ilya-biryukov
malaperle

Commits

rG4c1394d67d77: [clangd] Introduce a "Symbol" class.
rL320486: [clangd] Introduce a "Symbol" class.
rCTE320486: [clangd] Introduce a "Symbol" class.

Summary

The "Symbol" class represents a C++ symbol in the codebase, containing all the information of a C++ symbol needed by clangd. clangd will use it in clangd's AST/dynamic index and global/static index (code completion and code navigation).
The SymbolCollector (another IndexAction) will be used to recollect the symbols when the source file is changed (for ASTIndex), or to generate all C++ symbols for the whole project.

In the long term (when index-while-building is ready), clangd should share a
same "Symbol" structure and IndexAction with index-while-building, but
for now we want to have some stuff working in clangd.

Diff Detail

Repository

rCTE Clang Tools Extra

Build Status

Buildable 13016
Build 13016: arc lint + arc unit

Event Timeline

hokein created this revision.Dec 6 2017, 7:46 AM

Herald added subscribers: mgorny, klimek. · View Herald TranscriptDec 6 2017, 7:46 AM

Harbormaster completed remote builds in B12811: Diff 125729.Dec 6 2017, 7:46 AM

Hi! Have you looked into D40548 ? Maybe we need to coordinate the two a bit.

clangd/Symbol.h
37 ↗	(On Diff #125729)	I think it would be nice to have methods as an interface to get this data instead of storing them directly. So that an index-on-disk could go fetch the data. Especially the occurrences which can take a lot of memory (I'm working on a branch that does that). But perhaps defining that interface is not within the scope of this patch and could be better discussed in D40548 ?

In D40897#946708, @malaperle wrote:

Hi! Have you looked into D40548 ? Maybe we need to coordinate the two a bit.

Hi Marc! Thanks for the input!

Yeah, Eric and I are working closely on a prototype of global code completion. We have implemented the initial version (see github), and the prototype works well for LLVM project (even with a simple implementation), so we plan to split the patch, improve the code, and contribute it back to clangd repo incrementally.

For the prototype, we will load all symbols (without occurrences) into the memory, and build an in-memory index. From our experiment, the dataset of LLVM project in YAML format is ~120MB (~38,000 symbols), which is acceptable in clangd.

Our rough plan would be

Define the Symbol structure.
Design the interfaces of SymbolIndex, ASTIndex.
Combine 1) and 2) together to make global code completion work (we'd use YAML dataset for LLVM project, note that this is not a final solution, it would be hidden in an --experimental flag).
Switch to use the dataset from index-while-building when it is ready.

clangd/Symbol.h
37 ↗	(On Diff #125729)	I agree. We can't load all the symbol occurrences into the memory since they are too large. We need to design interface for the symbol occurrences. We could discuss the interface here, but CodeCompletion is the main thing which this patch focuses on.

malaperle added a reviewer: malaperle.Dec 6 2017, 12:49 PM

ioeric added inline comments.Dec 7 2017, 1:27 AM

clangd/Symbol.h
23 ↗	(On Diff #125729)	Is this relative or absolute?
26 ↗	(On Diff #125729)	0-based or 1-based?
39 ↗	(On Diff #125729)	It might make sense to just call this `USR` to be more explicit.
51 ↗	(On Diff #125729)	For functions and classes, should we store both declaration and definition locations?

sammccall edited edge metadata.Dec 7 2017, 2:28 AM

Thanks for putting this together! Have a bit of a braindump here, happy to discuss further either here or offline.

clangd/Symbol.h
1 ↗	(On Diff #125729)	I think that: there's other places in clangd that deal with symbols too, this is specifically for indexing the index interface belongs alongside Symbol I'd suggest calling this Index.h
1 ↗	(On Diff #125729)	I don't think having `Symbol`s be completely self-contained objects and passing them around in standard containers like `set`s will prove to be ideal. It means they can't share storage for e.g. location filename, that it's hard for us to arena-allocate them, etc. I think we could use the concept of a set of symbols which share a lifetime. An initial version might just be class SymbolSlab { public: using iterator = DenseSet<Symbol>::iterator; iterator begin(); iterator end(); private: DenseSet<Symbol> Symbols; } But it's easy to add `StringPool` etc there. Then this is the natural unit of granularity of large sets of symbols: a dynamic index that deals with one file at a time would operate on (Filename, SymbolSlab) pairs. SymbolCollector would return a SymbolSlab, etc. Then indexes can be built on top of this using non-owning pointers.
23 ↗	(On Diff #125729)	Having every symbol own a copy of the filepath seems wasteful. It seems likely that the index will have per-file information too, so this representation should likely be a key to that. Hash of the filepath might work?
32 ↗	(On Diff #125729)	Let's not add this until we know what's in it. There's gong to be an overlap between information needed for CC and other use cases, so this structure might not help the user navigate.
37 ↗	(On Diff #125729)	We can't load all the symbol occurrences into the memory since they are too large I've heard this often, but never backed up by data :-) Naively an array of references for a symbol could be doc ID + offset + length, let's say 16 bytes. If a source file consisted entirely of references to 1-character symbols separated by punctuation (1 reference per 2 bytes) then the total size of these references would be 8x the size of the source file - in practice much less. That's not very big. (Maybe there are edge cases with macros/templates, but we can keep them under control)
39 ↗	(On Diff #125729)	USRs are large. Can we use a fixed-size hash?
55 ↗	(On Diff #125729)	I'd suggest == and a hash function instead, unless we think this ordering is particularly meaningful?
68 ↗	(On Diff #125729)	Please pull this into a separate file. Someone providing e.g. symbols from a YAML file shouldn't need to pull in AST stuff. Mabye `IndexFromAST`, which would sort nicely next to `Index`?
72 ↗	(On Diff #125729)	do you really mean to copy here?
74 ↗	(On Diff #125729)	what is the boolean for? you always return true

ioeric mentioned this in D40548: [clangd] Symbol index interfaces and an in-memory index implementation..Dec 7 2017, 3:45 AM

ioeric added a child revision: D40548: [clangd] Symbol index interfaces and an in-memory index implementation..Dec 7 2017, 3:46 AM

In D40897#946911, @hokein wrote:

Our rough plan would be

Define the Symbol structure.

Design the interfaces of SymbolIndex, ASTIndex.

Combine 1) and 2) together to make global code completion work (we'd use YAML dataset for LLVM project, note that this is not a final solution, it would be hidden in an --experimental flag).

Switch to use the dataset from index-while-building when it is ready.

Thanks for the explanation. On my end, the plan is not quite sequential. The branch I am developing has interfaces for querying, building and a dataset format, let's call it ClangdIndexDataStorage, which is different from index-while-building (libIndexStore). I also have a version that uses libIndexStore through the same interfaces. So with that in mind, there are too main activities:

Work towards the interfaces for using the index (this patch, and Eric's). From my perspective, I will make sure that it can be as compatible as possible with reading the index from disk and the features we want to develop. One important aspect is to have a good balance between memory and performance. In Eclipse CDT and also the branch I work on using ClangdIndexDataStorage, the emphasis was to minimize memory consumption and have a configurable cache size. But different choices could be made here, perhaps I can start a discussion about that separately.
Work on index-while-building or the other format getting adopted in Clangd. The index-while-building (libIndexStore) is promising but also has a few missing pieces. We need a mapping solution (LMDB equivalent). We also need to make sure it's fast enough and contain enough information for the features we will need, etc.

clangd/Symbol.h
23 ↗	(On Diff #125729)	How we model it is that a symbol doesn't have a "location", but its occurrence do. One could consider the location of a symbol to be either its declaration occurrence (SymbolRole::Declaration) or its definition (SymbolRole::Definition). What we do to get the location path is each occurrence has a pointer (a "database" pointer, but it doesn't matter) to a file entry and then we get the path from the entry. So conceptually, it works a bit like this (although it fetches information on disk). class IndexOccurrence { IndexOccurrence *FilePtr; std::string Occurrence::getPath() { return FilePtr->getPath(); } };

malaperle added inline comments.Dec 7 2017, 9:44 AM

clangd/Symbol.h
37 ↗	(On Diff #125729)	I'd have to break down how much memory it used by what, I'll come back to you on that. Indexing llvm with ClangdIndexDataStorage, which is pretty packed is about 200MB. That's already a lot considering we want to index code bases many times bigger. But I'll try to come up with more precise numbers. I'm open to different strategies.

malaperle added inline comments.Dec 7 2017, 8:52 PM

clangd/Symbol.h
23 ↗	(On Diff #125729)	Oops, wrong type for the field, it should have been: class IndexOccurrence { IndexFile *FilePtr; std::string Occurrence::getPath() { return FilePtr->getPath(); } };

Address review comments.

Use SymbolSlab, for allowing future space optimization.
Fix a Cast issue when building debug unittest.

Harbormaster completed remote builds in B12903: Diff 126115.Dec 8 2017, 3:29 AM

Thanks for the useful comments!

clangd/Symbol.h
1 ↗	(On Diff #125729)	+1 It makes sense. For initial version, the `Symbol` structure is still owning its fields naively, we could improve it (change to pointer or references) in the future.
23 ↗	(On Diff #125729)	Is this relative or absolute? Whether the file path is relative or absolute depends on the build system, the file path could be relative (for header files) or absolute (for .cc files). How we model it is that a symbol doesn't have a "location", but its occurrence do. We will also have a SymbolOccurence structure alongside with Symbol (but it is not in this patch). The "Location" will be a part of SymbolOccurrence.
26 ↗	(On Diff #125729)	The offset is equivalent to FileOffset in `SourceLocation`. I can't find any document about the FileOffset in LLVM, but it is 0-based.
32 ↗	(On Diff #125729)	Removed it.
37 ↗	(On Diff #125729)	I can see two points here: For all symbol occurrences of a TU, it is not quite large, and we can keep them in memory. For all symbol occurrences of a whole project, it's not a good idea to load all of them into memory (For LLVM project, the size of YAML dataset is ~1.2G).
39 ↗	(On Diff #125729)	USR is one of the implementations of the Identifier, I would keep the name here -- we might change the implementation (like hash(USR)) afterwards. We can use `hash` would make the query symbol from USR a bit harder -- we have to maintain a hash-to-USR lookup table. I think we can do it later when it shows a large memory consumption?
51 ↗	(On Diff #125729)	I think the symbol occurrences would include both declaration and definition locations. The `CanonicalLocation` is providing a fast/convenient way to find the most interested location of the symbol (e.g. for code navigation, or include the missing path for a symbol), without symbol occurrences.
55 ↗	(On Diff #125729)	It was required when using `Symbol` structure in a set standard library. Removed it since we are using the `DenseSet` now.
68 ↗	(On Diff #125729)	I can see various meaning of "Index" here: `Index` in `index::IndexDataConsumer`, which collects and contructs all symbols by traversing the AST. `Index` term using in clangd, specially for build index on top of these collected symbols. I think we should be consistent the index for 2), and `SymbolCollector` is more descriptive here.
74 ↗	(On Diff #125729)	return `true` means continue indexing, while false means abort the indexing. Have added the comment.

More comments, but only two major things really:

I'd like to clearly separate USR from SymbolID (even if you want to keep using USRs as their basis for now)
the file organization (code within files, and names of files) needs some work I think

Everything else is details, this looks good

clangd/Symbol.h
23 ↗	(On Diff #125729)	Whether the file path is relative or absolute depends on the build system, the file path could be relative (for header files) or absolute (for .cc files). I'm not convinced this actually works. There's multiple codepaths to the index, how can we ensure we don't end up using inconsistent paths? e.g. we open up a project that includes a system header using a relative path, and then open up that system header from file->open (editor knows only the absolute path) and do "find references". I think we need to canonicalize the paths. Absolute is probably easiest.
37 ↗	(On Diff #125729)	(This is still a sidebar - not asking for any changes) The YAML dataset is not a good proxy for how big the data is (at least without an effort to estimate constant factor). And "it's not a good idea" isn't an assertion that can hold without reasons, assumptions, and data. If the size turns out to be, say, 120MB for LLVM, and we want to scale to 10x that, and we're spending 500MB/file for ASTs, then it might well be a good trade.
39 ↗	(On Diff #125729)	Personally I'm less worried about the name, and more about the type. I think it's important: that we have a distinct SymbolID type (for conceptual clarity) that we convert to/from this type in as few places as possible (for flexibility) that this type be fixed-size, preferably <=64 bits (for performance) that long-lived references to symbols be expressed as SymbolIDs not pointers (to avoid lifetime confusion) I'm happy to defer some performance considerations to later (though I'm almost certain this will matter). But i'm not sure it changes the conclusion: We can use hash would make the query symbol from USR a bit harder -- we have to maintain a hash-to-USR lookup table. You only need that if you need a USR. But the SymbolID -> USR conversion should be in one place anyway (point 2) otherwise we'll end up coupled to using USRs forever. And the SymbolID -> Symbol lookup is something the index is going to provide anyway, right?
51 ↗	(On Diff #125729)	I'd be +1 on including both a definition location (if known) and a preferred declaration location, because there's enough use cases that might want definition even if it's not the preferred declaration. But i'm fine if we want to omit the separate definition for now. In that case, call this CanonicalDeclaration?
68 ↗	(On Diff #125729)	`SymbolCollector` is a fine name for the type, but the file should have `Index` in the name, or we should create an `Index` subdirectory. It should be possible to understand which files are part of the index subsystem by scanning the clangd directory. (Did you see the comment was about separating into a different file, not about renaming the class?)
74 ↗	(On Diff #125729)	Oops, sorry, I missed that this was implementing an interface. Please remove the comment or move it to the implementation - it doesn't make sense in the interface. Sorry for the noise!
70 ↗	(On Diff #126115)	nit: lowercase `iterator` is the STL convention, following it tends to make template code work better
72 ↗	(On Diff #126115)	This is dangerous if called after reads, as it invalidates iterators and pointers. I don't think we actually indend to support such mutation, so I'd suggest adding an explicit freeze() function. addSymbol() is only valid before freeze(), and reading functions are only valid after. An assert can enforce this. (This is a cheap version of a builder, which are more painful to write but may also be worth it). If we choose not to enforce this at all, the requirement shold be heavily documented!
73 ↗	(On Diff #126115)	Awkward as it is, don't you want iterator find() here?
unittests/clangd/SymbolCollectorTests.cpp
105	If using the same repeatedly, please pull out a MATCHER_P for readability: UnorderedElementsAre(QName("Foo"), ...)

(From D40548) Here's the interface for querying the index that I am using right now. It's meant to be able to retrieve from any kind of "backend", i.e. in-memory, ClangdIndexDataStore, libIndexStore, etc. I was able to implement "Open Workspace Symbol" (which is close to code completion in concept), Find References and Find Definitions.

using USR = llvm::SmallString<256>;

class ClangdIndexDataOccurrence;

class ClangdIndexDataSymbol {
public:
  virtual index::SymbolKind getKind() = 0;
  /// For example, for mynamespace::myclass::mymethod, this will be
  /// mymethod.
  virtual std::string getName() = 0;
  /// For example, for mynamespace::myclass::mymethod, this will be
  /// mynamespace::myclass::
  virtual std::string getQualifier() = 0;
  virtual std::string getUsr() = 0;

  virtual void foreachOccurrence(index::SymbolRoleSet Roles, llvm::function_ref<bool(ClangdIndexDataOccurrence&)> Receiver) = 0;

  virtual ~ClangdIndexDataSymbol() = default;
};

class ClangdIndexDataOccurrence {
public:
  enum class OccurrenceType : uint16_t {
     OCCURRENCE,
     DEFINITION_OCCURRENCE
   };

  virtual OccurrenceType getKind() const = 0;
  virtual std::string getPath() = 0;
  /// Get the start offset of the symbol occurrence. The SourceManager can be
  /// used for implementations that need to convert from a line/column
  /// representation to an offset.
  virtual uint32_t getStartOffset(SourceManager &SM) = 0;
  /// Get the end offset of the symbol occurrence. The SourceManager can be
  /// used for implementations that need to convert from a line/column
  /// representation to an offset.
  virtual uint32_t getEndOffset(SourceManager &SM) = 0;
  virtual ~ClangdIndexDataOccurrence() = default;
  //TODO: Add relations

  static bool classof(const ClangdIndexDataOccurrence *O) { return O->getKind() == OccurrenceType::OCCURRENCE; }
};

/// An occurrence that also has definition with a body that requires additional
/// locations to keep track of the beginning and end of the body.
class ClangdIndexDataDefinitionOccurrence : public ClangdIndexDataOccurrence {
public:
  virtual uint32_t getDefStartOffset(SourceManager &SM) = 0;
  virtual uint32_t getDefEndOffset(SourceManager &SM) = 0;

  static bool classof(const ClangdIndexDataOccurrence *O) { return O->getKind() == OccurrenceType::DEFINITION_OCCURRENCE; }
};

class ClangdIndexDataProvider {
public:

  virtual void foreachSymbols(StringRef Pattern, llvm::function_ref<bool(ClangdIndexDataSymbol&)> Receiver) = 0;
  virtual void foreachSymbols(const USR &Usr, llvm::function_ref<bool(ClangdIndexDataSymbol&)> Receiver) = 0;

  virtual ~ClangdIndexDataProvider() = default;
};

The "Clangd" prefix adds a bit much of clutter so maybe it should be removed. I think the main points are that having generic foreachSymbols/foreachOccurrence with callbacks is well suited to implement multiple features with minimal copying.

clangd/Symbol.h
37 ↗	(On Diff #125729)	The YAML dataset is not a good proxy for how big the data is (at least without an effort to estimate constant factor). Indeed. I'll try to come up with more realistic numbers. There are other things not accounted for in the 16 bytes mentioned above, like storing roles and relations. 500MB/file for ASTs What do you mean? 500MB worth of occurrences per file? Or Preambles perhaps?

reorganize files, move to index subdirectory.
change symbol ID to a hash value, instead of couple with USR.

Harbormaster completed remote builds in B12912: Diff 126156.Dec 8 2017, 8:14 AM

Reorganizing the source files made all the comments invalid in the latest version :(. Feel free to comment on the old version or the new version.

clangd/Symbol.h
23 ↗	(On Diff #125729)	Absolute path for .cc file is fine, I was a bit concerned about the path for .h file, especially we might use it in `#include`, but we can figure out later. Changed to absolute file path.
51 ↗	(On Diff #125729)	OK, changed it to `CanonicalDeclarationLoc`, and added a FIXME for the definition.
68 ↗	(On Diff #125729)	Oh, sorry for that. I thought that you were meaning to rename the SymbolCollector class. I prefer to create an subdirectory `index`, and put everything related to the index to that directory, instead of having `Index` word in the top-level clangd filenames.
72 ↗	(On Diff #126115)	I think the only user to mutate this object is SymbolCollector. Instead of adding `freeze` function, I made it as a private method and declare SymbolCollector as a friend class. Does it look better?

malaperle added inline comments.Dec 8 2017, 3:28 PM

clangd/Symbol.h
37 ↗	(On Diff #125729)	What do you mean? 500MB worth of occurrences per file? Or Preambles perhaps? Oh I see, the AST must be in memory for fast reparse. I just tried opening 3 files at the same time I it was already around 500MB. Hmm, that's a bit alarming.

Thanks for the restructuring? I want to take another pass today, but wanted to mention some SymbolID things.

clangd/Symbol.h
37 ↗	(On Diff #125729)	Right, just that we have to consider RAM usage for the index in the context of clangd's overall requirements - if other parts of clangd use 1G of ram for typical work on a large project, then we shouldn't rule out spending a couple of 100MB on the index if it adds a lot of value.
72 ↗	(On Diff #126115)	Hmm, I don't think so. SymbolCollector being a the only writer is temporary, this is "arena for symbols" and we want to create symbols in other ways (e.g. remote index). And "friend" to expose one function is a bit iffy.
clangd/index/Index.h
39	This comment doesn't really say anything, and there is a lot to say! What does it distinguish vs not? How do we guarantee uniqueness?
40	please make symbol ID a real type, constructible from USR unsigned might be 32 bits, which really isn't OK. Even 64 is pushing it for huge external indexes. We need these IDs to be stable across executions and versions, which llvm::HashString is not. I'd suggest using the SHA1 hash (160 bits)
52	nit: remove Loc? It's ambiguous and not really needed
88	You might just want to specialize for SymbolID instead - we should probably mostly use explicit DenseMap<SymbolID, Symbol> as we can avoid constructing the Symbol in a bunch of cases.
clangd/index/SymbolCollector.cpp
31	maybe resolvePath? since it includes symlink resolution I'm not sure we actually want to resolve symlinks. Let's chat offline.

Address comments on SymbolID.

Harbormaster completed remote builds in B12977: Diff 126378.Dec 11 2017, 8:47 AM

\o/

clangd/index/Index.cpp
35	assert frozen? (and in begin())
clangd/index/Index.h
46	nit: make HashValue private? provide operator== (and use it from DenseMapInfo).
109	nit: you may want to memoize this in a local static variable, rather than compute it each time: DenseMap calls it a lot.
clangd/index/SymbolCollector.cpp
38	Can you spell out here which symbolic link cases we're handling, and what problems we're trying to avoid? Offline, we talked about the CWD being a symlink. But this is a different case...
clangd/index/SymbolCollector.h
36	What's this for? Seems like we should be able to handle multiple TUs with one collector?

This revision is now accepted and ready to land.Dec 11 2017, 9:38 AM

Address remaining comments.

Harbormaster completed remote builds in B13008: Diff 126519.Dec 12 2017, 3:25 AM

hokein added inline comments.Dec 12 2017, 3:27 AM

clangd/index/Index.cpp
35	We may not do the assert "frozen" in these getter methods -- as writers may also have needs to access these APIs (e.g. checking whether SymbolID is already in the slab before constructing and inserting a new Symbol).
clangd/index/SymbolCollector.h
36	We don't have particular usage for this method except for testing. Removed it.

Get rid of clangdIndex library, using the existing clangDaemon library.
Remove the getID() method.

Harbormaster completed remote builds in B13016: Diff 126538.Dec 12 2017, 6:10 AM

sammccall added inline comments.Dec 12 2017, 6:10 AM

clangd/index/CMakeLists.txt
5 ↗	(On Diff #126519)	hmm, I'm not sure whether we actually want this to be a separate library. This means we can't depend on anything from elsewhere in clangd, and may force us to create more components. e.g. if we want to pass contexts into the index, or if we want to reuse LSP data models from protocol.h. Maybe we should make this part of the main clangd lib, what do you think?
clangd/index/Index.cpp
35	Right, I'd specifically like not to allow that if possible - it makes it very difficult to understand what the invariants are and what operations are allowed. I think there's no such need now, right? If we want to add one in future, we can relax this check, or add a builder, or similar - at least we should explicitly consider it.

hokein marked an inline comment as done.Dec 12 2017, 6:20 AM

hokein added inline comments.

clangd/index/CMakeLists.txt
5 ↗	(On Diff #126519)	As discussed offline, made it to the main clangd library.
clangd/index/Index.cpp
35	hmm, SymbolCollector has such need at the moment -- `if (Symbols.find(ID) != Symbols.end()) return true;` to avoid creating a duplicated Symbol.

I'm going to submit this patch to unblock the stuff in https://reviews.llvm.org/D40548. Would be happy to address any further comments afterwards.

Closed by commit rCTE320486: [clangd] Introduce a "Symbol" class. (authored by hokein). · Explain WhyDec 12 2017, 7:42 AM

This revision was automatically updated to reflect the committed changes.

ioeric added inline comments.Dec 12 2017, 8:29 AM

clangd/index/Index.h
53	warning: class 'DenseMapInfo' was previously declared as a struct [-Wmismatched-tags]

malaperle added inline comments.Dec 12 2017, 8:58 AM

clangd/Symbol.h
37 ↗	(On Diff #125729)	Agreed we have to consider the overall requirements. I think over 1GB of RAM is not good for our use case, whether is comes from the AST or index. I think it's perfectly normal if we have different requirements but we can see discuss how to design things so there are options to use either more RAM or disk space. It seems the AST would be the most important factor for now so perhaps it's something we should start investigating/discussing.

sammccall added inline comments.Dec 13 2017, 12:51 AM

clangd/Symbol.h
37 ↗	(On Diff #125729)	Agree, this is another thing we can discuss tonight.
clangd/index/Index.h
53	Fixed in rCTE320554

hokein added inline comments.Dec 13 2017, 1:05 AM

clangd/index/Index.h
53	Thanks for the fix!

Revision Contents

Path

Size

clangd/

CMakeLists.txt

2 lines

index/

136 lines

49 lines

43 lines

102 lines

unittests/

clangd/

CMakeLists.txt

1 line

SymbolCollectorTests.cpp

110 lines

Diff 126538

clangd/CMakeLists.txt

Show All 12 Lines	add_clang_library(clangDaemon
FuzzyMatch.cpp		FuzzyMatch.cpp
GlobalCompilationDatabase.cpp		GlobalCompilationDatabase.cpp
JSONExpr.cpp		JSONExpr.cpp
JSONRPCDispatcher.cpp		JSONRPCDispatcher.cpp
Logger.cpp		Logger.cpp
Protocol.cpp		Protocol.cpp
ProtocolHandlers.cpp		ProtocolHandlers.cpp
Trace.cpp		Trace.cpp
		index/Index.cpp
		index/SymbolCollector.cpp

LINK_LIBS		LINK_LIBS
clangAST		clangAST
clangBasic		clangBasic
clangFormat		clangFormat
clangFrontend		clangFrontend
clangIndex		clangIndex
clangLex		clangLex
Show All 12 Lines

clangd/index/Index.h

This file was added.

				//===--- Symbol.h ------------------------------------------------ C++--===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===---------------------------------------------------------------------===//

				#ifndef LLVM_CLANG_TOOLS_EXTRA_CLANGD_INDEX_INDEX_H
				#define LLVM_CLANG_TOOLS_EXTRA_CLANGD_INDEX_INDEX_H

				#include "clang/Index/IndexSymbol.h"
				#include "llvm/ADT/DenseMap.h"
				#include "llvm/ADT/StringExtras.h"

				#include <array>
				#include <string>

				namespace clang {
				namespace clangd {

				struct SymbolLocation {
				// The absolute path of the source file where a symbol occurs.
				std::string FilePath;
				// The 0-based offset to the first character of the symbol from the beginning
				// of the source file.
				unsigned StartOffset;
				// The 0-based offset to the last character of the symbol from the beginning
				// of the source file.
				unsigned EndOffset;
				};

				// The class identifies a particular C++ symbol (class, function, method, etc).
				//
				// As USRs (Unified Symbol Resolution) could be large, especially for functions
				// with long type arguments, SymbolID is using 160-bits SHA1(USR) values to
				// guarantee the uniqueness of symbols while using a relatively small amount of
				// memory (vs storing USRs directly).
				sammccallUnsubmitted Done Reply Inline Actions This comment doesn't really say anything, and there is a lot to say! What does it distinguish vs not? How do we guarantee uniqueness? sammccall: This comment doesn't really say anything, and there is a lot to say! What does it distinguish…
				//
				sammccallUnsubmitted Done Reply Inline Actions please make symbol ID a real type, constructible from USR unsigned might be 32 bits, which really isn't OK. Even 64 is pushing it for huge external indexes. We need these IDs to be stable across executions and versions, which llvm::HashString is not. I'd suggest using the SHA1 hash (160 bits) sammccall: please make symbol ID a real type, constructible from USR unsigned might be 32 bits, which…
				// SymbolID can be used as key in the symbol indexes to lookup the symbol.
				class SymbolID {
				public:
				SymbolID() = default;
				SymbolID(llvm::StringRef USR);

				sammccallUnsubmitted Done Reply Inline Actions nit: make HashValue private? provide operator== (and use it from DenseMapInfo). sammccall: nit: make HashValue private? provide operator== (and use it from DenseMapInfo).
				bool operator==(const SymbolID& Sym) const {
				return HashValue == Sym.HashValue;
				}

				private:
				friend class llvm::DenseMapInfo<clang::clangd::SymbolID>;
				sammccallUnsubmitted Done Reply Inline Actions nit: remove Loc? It's ambiguous and not really needed sammccall: nit: remove Loc? It's ambiguous and not really needed

				ioericUnsubmitted Not Done Reply Inline Actions warning: class 'DenseMapInfo' was previously declared as a struct [-Wmismatched-tags] ioeric: warning: class 'DenseMapInfo' was previously declared as a struct [-Wmismatched-tags]
				sammccallUnsubmitted Not Done Reply Inline Actions Fixed in rCTE320554 sammccall: Fixed in rCTE320554
				hokeinAuthorUnsubmitted Not Done Reply Inline Actions Thanks for the fix! hokein: Thanks for the fix!
				std::array<uint8_t, 20> HashValue;
				};

				// The class presents a C++ symbol, e.g. class, function.
				//
				// FIXME: instead of having own copy fields for each symbol, we can share
				// storage from SymbolSlab.
				struct Symbol {
				// The ID of the symbol.
				SymbolID ID;
				// The qualified name of the symbol, e.g. Foo::bar.
				std::string QualifiedName;
				// The symbol information, like symbol kind.
				index::SymbolInfo SymInfo;
				// The location of the canonical declaration of the symbol.
				//
				// A C++ symbol could have multiple declarations and one definition (e.g.
				// a function is declared in ".h" file, and is defined in ".cc" file).
				// * For classes, the canonical declaration is usually definition.
				// * For non-inline functions, the canonical declaration is a declaration
				// (not a definition), which is usually declared in ".h" file.
				SymbolLocation CanonicalDeclaration;

				// FIXME: add definition location of the symbol.
				// FIXME: add all occurrences support.
				// FIXME: add extra fields for index scoring signals.
				// FIXME: add code completion information.
				};

				// A symbol container that stores a set of symbols. The container will maintain
				// the lifetime of the symbols.
				//
				// FIXME: Use a space-efficient implementation, a lot of Symbol fields could
				// share the same storage.
				class SymbolSlab {
				sammccallUnsubmitted Done Reply Inline Actions You might just want to specialize for SymbolID instead - we should probably mostly use explicit DenseMap<SymbolID, Symbol> as we can avoid constructing the Symbol in a bunch of cases. sammccall: You might just want to specialize for SymbolID instead - we should probably mostly use explicit…
				public:
				using const_iterator = llvm::DenseMap<SymbolID, Symbol>::const_iterator;

				SymbolSlab() = default;

				const_iterator begin() const;
				const_iterator end() const;
				const_iterator find(const SymbolID& SymID) const;

				// Once called, no more symbols would be added to the SymbolSlab. This
				// operation is irreversible.
				void freeze();

				void insert(Symbol S);

				private:
				bool Frozen = false;

				llvm::DenseMap<SymbolID, Symbol> Symbols;
				};

				sammccallUnsubmitted Done Reply Inline Actions nit: you may want to memoize this in a local static variable, rather than compute it each time: DenseMap calls it a lot. sammccall: nit: you may want to memoize this in a local static variable, rather than compute it each time…
				} // namespace clangd
				} // namespace clang

				namespace llvm {

				template <> struct DenseMapInfo<clang::clangd::SymbolID> {
				static inline clang::clangd::SymbolID getEmptyKey() {
				static clang::clangd::SymbolID EmptyKey("EMPTYKEY");
				return EmptyKey;
				}
				static inline clang::clangd::SymbolID getTombstoneKey() {
				static clang::clangd::SymbolID TombstoneKey("TOMBSTONEKEY");
				return TombstoneKey;
				}
				static unsigned getHashValue(const clang::clangd::SymbolID &Sym) {
				return hash_value(
				ArrayRef<uint8_t>(Sym.HashValue.data(), Sym.HashValue.size()));
				}
				static bool isEqual(const clang::clangd::SymbolID &LHS,
				const clang::clangd::SymbolID &RHS) {
				return LHS == RHS;
				}
				};

				} // namespace llvm

				#endif // LLVM_CLANG_TOOLS_EXTRA_CLANGD_INDEX_INDEX_H

clangd/index/Index.cpp

This file was added.

				//===--- Index.cpp ------------------------------------------------ C++--===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#include "Index.h"

				#include "llvm/Support/SHA1.h"

				namespace clang {
				namespace clangd {

				namespace {
				ArrayRef<uint8_t> toArrayRef(StringRef S) {
				return {reinterpret_cast<const uint8_t *>(S.data()), S.size()};
				}
				} // namespace

				SymbolID::SymbolID(llvm::StringRef USR)
				: HashValue(llvm::SHA1::hash(toArrayRef(USR))) {}

				SymbolSlab::const_iterator SymbolSlab::begin() const {
				return Symbols.begin();
				}

				SymbolSlab::const_iterator SymbolSlab::end() const {
				return Symbols.end();
				}

				SymbolSlab::const_iterator SymbolSlab::find(const SymbolID& SymID) const {
				return Symbols.find(SymID);
				sammccallUnsubmitted Not Done Reply Inline Actions assert frozen? (and in begin()) sammccall: assert frozen? (and in begin())
				hokeinAuthorUnsubmitted Not Done Reply Inline Actions We may not do the assert "frozen" in these getter methods -- as writers may also have needs to access these APIs (e.g. checking whether SymbolID is already in the slab before constructing and inserting a new Symbol). hokein: We may not do the assert "frozen" in these getter methods -- as writers may also have needs to…
				sammccallUnsubmitted Not Done Reply Inline Actions Right, I'd specifically like not to allow that if possible - it makes it very difficult to understand what the invariants are and what operations are allowed. I think there's no such need now, right? If we want to add one in future, we can relax this check, or add a builder, or similar - at least we should explicitly consider it. sammccall: Right, I'd specifically like not to allow that if possible - it makes it very difficult to…
				hokeinAuthorUnsubmitted Not Done Reply Inline Actions hmm, SymbolCollector has such need at the moment -- `if (Symbols.find(ID) != Symbols.end()) return true;` to avoid creating a duplicated Symbol. hokein: hmm, SymbolCollector has such need at the moment -- ` if (Symbols.find(ID) != Symbols.end())…
				}

				void SymbolSlab::freeze() {
				Frozen = true;
				}

				void SymbolSlab::insert(Symbol S) {
				assert(!Frozen &&
				"Can't insert a symbol after the slab has been frozen!");
				Symbols[S.ID] = std::move(S);
				}

				} // namespace clangd
				} // namespace clang

clangd/index/SymbolCollector.h

This file was added.

				//===--- SymbolCollector.h ---------------------------------------- C++--===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#include "Index.h"

				#include "clang/Index/IndexDataConsumer.h"
				#include "clang/Index/IndexSymbol.h"

				namespace clang {
				namespace clangd {

				// Collect all symbols from an AST.
				//
				// Clients (e.g. clangd) can use SymbolCollector together with
				// index::indexTopLevelDecls to retrieve all symbols when the source file is
				// changed.
				class SymbolCollector : public index::IndexDataConsumer {
				public:
				SymbolCollector() = default;

				bool
				handleDeclOccurence(const Decl *D, index::SymbolRoleSet Roles,
				ArrayRef<index::SymbolRelation> Relations, FileID FID,
				unsigned Offset,
				index::IndexDataConsumer::ASTNodeInfo ASTNode) override;

				void finish() override;

				SymbolSlab takeSymbols() const { return std::move(Symbols); }

				sammccallUnsubmitted Done Reply Inline Actions What's this for? Seems like we should be able to handle multiple TUs with one collector? sammccall: What's this for? Seems like we should be able to handle multiple TUs with one collector?
				hokeinAuthorUnsubmitted Not Done Reply Inline Actions We don't have particular usage for this method except for testing. Removed it. hokein: We don't have particular usage for this method except for testing. Removed it.
				private:
				// All Symbols collected from the AST.
				SymbolSlab Symbols;
				};

				} // namespace clangd
				} // namespace clang

clangd/index/SymbolCollector.cpp

This file was added.

				//===--- SymbolCollector.cpp -------------------------------------- C++--===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#include "SymbolCollector.h"

				#include "clang/AST/ASTContext.h"
				#include "clang/AST/Decl.h"
				#include "clang/AST/DeclCXX.h"
				#include "clang/Basic/SourceManager.h"
				#include "clang/Index/IndexSymbol.h"
				#include "clang/Index/USRGeneration.h"
				#include "llvm/Support/MemoryBuffer.h"
				#include "llvm/Support/Path.h"

				namespace clang {
				namespace clangd {

				namespace {
				// Make the Path absolute using the current working directory of the given
				// SourceManager if the Path is not an absolute path.
				//
				// The Path can be a path relative to the build directory, or retrieved from
				// the SourceManager.
				std::string makeAbsolutePath(const SourceManager &SM, StringRef Path) {
				llvm::SmallString<128> AbsolutePath(Path);
				sammccallUnsubmitted Not Done Reply Inline Actions maybe resolvePath? since it includes symlink resolution I'm not sure we actually want to resolve symlinks. Let's chat offline. sammccall: maybe resolvePath? since it includes symlink resolution I'm not sure we actually want to…
				if (std::error_code EC =
				SM.getFileManager().getVirtualFileSystem()->makeAbsolute(
				AbsolutePath))
				llvm::errs() << "Warning: could not make absolute file: '" << EC.message()
				<< '\n';
				// Handle the symbolic link path case where the current working directory
				// (getCurrentWorkingDirectory) is a symlink./ We always want to the real
				sammccallUnsubmitted Done Reply Inline Actions Can you spell out here which symbolic link cases we're handling, and what problems we're trying to avoid? Offline, we talked about the CWD being a symlink. But this is a different case... sammccall: Can you spell out here which symbolic link cases we're handling, and what problems we're trying…
				// file path (instead of the symlink path) for the C++ symbols.
				//
				// Consider the following example:
				//
				// src dir: /project/src/foo.h
				// current working directory (symlink): /tmp/build -> /project/src/
				//
				// The file path of Symbol is "/project/src/foo.h" instead of
				// "/tmp/build/foo.h"
				const DirectoryEntry *Dir = SM.getFileManager().getDirectory(
				llvm::sys::path::parent_path(AbsolutePath.str()));
				if (Dir) {
				StringRef DirName = SM.getFileManager().getCanonicalName(Dir);
				SmallVector<char, 128> AbsoluteFilename;
				llvm::sys::path::append(AbsoluteFilename, DirName,
				llvm::sys::path::filename(AbsolutePath.str()));
				return llvm::StringRef(AbsoluteFilename.data(), AbsoluteFilename.size())
				.str();
				}
				return AbsolutePath.str();
				}
				} // namespace

				// Always return true to continue indexing.
				bool SymbolCollector::handleDeclOccurence(
				const Decl *D, index::SymbolRoleSet Roles,
				ArrayRef<index::SymbolRelation> Relations, FileID FID, unsigned Offset,
				index::IndexDataConsumer::ASTNodeInfo ASTNode) {
				// FIXME: collect all symbol references.
				if (!(Roles & static_cast<unsigned>(index::SymbolRole::Declaration) \|\|
				Roles & static_cast<unsigned>(index::SymbolRole::Definition)))
				return true;

				if (const NamedDecl *ND = llvm::dyn_cast<NamedDecl>(D)) {
				// FIXME: Should we include the internal linkage symbols?
				if (!ND->hasExternalFormalLinkage() \|\| ND->isInAnonymousNamespace())
				return true;

				llvm::SmallVector<char, 128> Buff;
				if (index::generateUSRForDecl(ND, Buff))
				return true;

				std::string USR(Buff.data(), Buff.size());
				auto ID = SymbolID(USR);
				if (Symbols.find(ID) != Symbols.end())
				return true;

				auto &SM = ND->getASTContext().getSourceManager();
				SymbolLocation Location = {
				makeAbsolutePath(SM, SM.getFilename(D->getLocation())),
				SM.getFileOffset(D->getLocStart()), SM.getFileOffset(D->getLocEnd())};
				Symbols.insert({std::move(ID), ND->getQualifiedNameAsString(),
				index::getSymbolInfo(D), std::move(Location)});
				}

				return true;
				}

				void SymbolCollector::finish() {
				Symbols.freeze();
				}

				} // namespace clangd
				} // namespace clang

unittests/clangd/CMakeLists.txt

	Show All 9 Lines

	add_extra_unittest(ClangdTests			add_extra_unittest(ClangdTests
	ClangdTests.cpp			ClangdTests.cpp
	CodeCompleteTests.cpp			CodeCompleteTests.cpp
	FuzzyMatchTests.cpp			FuzzyMatchTests.cpp
	JSONExprTests.cpp			JSONExprTests.cpp
	TestFS.cpp			TestFS.cpp
	TraceTests.cpp			TraceTests.cpp
				SymbolCollectorTests.cpp
	)			)

	target_link_libraries(ClangdTests			target_link_libraries(ClangdTests
	PRIVATE			PRIVATE
	clangBasic			clangBasic
	clangDaemon			clangDaemon
	clangFormat			clangFormat
	clangFrontend			clangFrontend
	clangSema			clangSema
	clangTooling			clangTooling
	clangToolingCore			clangToolingCore
	LLVMSupport			LLVMSupport
	)			)

unittests/clangd/SymbolCollectorTests.cpp

This file was added.

				//===-- SymbolCollectorTests.cpp -------------------------------- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#include "index/SymbolCollector.h"
				#include "clang/Index/IndexingAction.h"
				#include "clang/Basic/FileManager.h"
				#include "clang/Basic/FileSystemOptions.h"
				#include "clang/Basic/VirtualFileSystem.h"
				#include "clang/Frontend/CompilerInstance.h"
				#include "clang/Tooling/Tooling.h"
				#include "llvm/ADT/IntrusiveRefCntPtr.h"
				#include "llvm/ADT/STLExtras.h"
				#include "llvm/ADT/StringRef.h"
				#include "llvm/Support/MemoryBuffer.h"
				#include "gtest/gtest.h"
				#include "gmock/gmock.h"

				#include <memory>
				#include <string>

				using testing::UnorderedElementsAre;
				using testing::Eq;
				using testing::Field;

				// GMock helpers for matching Symbol.
				MATCHER_P(QName, Name, "") { return arg.second.QualifiedName == Name; }

				namespace clang {
				namespace clangd {

				namespace {
				class SymbolIndexActionFactory : public tooling::FrontendActionFactory {
				public:
				SymbolIndexActionFactory() = default;

				clang::FrontendAction *create() override {
				index::IndexingOptions IndexOpts;
				IndexOpts.SystemSymbolFilter =
				index::IndexingOptions::SystemSymbolFilterKind::All;
				IndexOpts.IndexFunctionLocals = false;
				Collector = std::make_shared<SymbolCollector>();
				FrontendAction *Action =
				index::createIndexingAction(Collector, IndexOpts, nullptr).release();
				return Action;
				}

				std::shared_ptr<SymbolCollector> Collector;
				};

				class SymbolCollectorTest : public ::testing::Test {
				public:
				bool runSymbolCollector(StringRef HeaderCode, StringRef MainCode) {
				llvm::IntrusiveRefCntPtr<vfs::InMemoryFileSystem> InMemoryFileSystem(
				new vfs::InMemoryFileSystem);
				llvm::IntrusiveRefCntPtr<FileManager> Files(
				new FileManager(FileSystemOptions(), InMemoryFileSystem));

				const std::string FileName = "symbol.cc";
				const std::string HeaderName = "symbols.h";
				auto Factory = llvm::make_unique<SymbolIndexActionFactory>();

				tooling::ToolInvocation Invocation(
				{"symbol_collector", "-fsyntax-only", "-std=c++11", FileName},
				Factory->create(), Files.get(),
				std::make_shared<PCHContainerOperations>());

				InMemoryFileSystem->addFile(HeaderName, 0,
				llvm::MemoryBuffer::getMemBuffer(HeaderCode));

				std::string Content = "#include\"" + std::string(HeaderName) + "\"";
				Content += "\n" + MainCode.str();
				InMemoryFileSystem->addFile(FileName, 0,
				llvm::MemoryBuffer::getMemBuffer(Content));
				Invocation.run();
				Symbols = Factory->Collector->takeSymbols();
				return true;
				}

				protected:
				SymbolSlab Symbols;
				};

				TEST_F(SymbolCollectorTest, CollectSymbol) {
				const std::string Header = R"(
				class Foo {
				void f();
				};
				void f1();
				inline void f2() {}
				)";
				const std::string Main = R"(
				namespace {
				void ff() {} // ignore
				}
				void f1() {}
				)";
				runSymbolCollector(Header, Main);
				EXPECT_THAT(Symbols, UnorderedElementsAre(QName("Foo"), QName("Foo::f"),
				QName("f1"), QName("f2")));
				sammccallUnsubmitted Done Reply Inline Actions If using the same repeatedly, please pull out a MATCHER_P for readability: UnorderedElementsAre(QName("Foo"), ...) sammccall: If using the same repeatedly, please pull out a MATCHER_P for readability…
				}

				} // namespace
				} // namespace clangd
				} // namespace clang