[clangd] Symbol index interfaces and index-based code completion.
Needs ReviewPublic

Authored by ioeric on Tue, Nov 28, 4:29 AM.

Details

Summary

o Index interfaces to support using different index sources (e.g. AST index, global index) for code completion, cross-reference finding etc. This patch focuses on code completion.
o Implement an AST-based index.
o Add an option to replace sema code completion for qualified-id with index-based completion.
o Implement an initial naive code completion index which matches symbols that have the query string as substring.

FIXME:

  • There is no test yet. Will add them when interfaces are stabilized.
  • This depends on D40897 for the symbol structure and symbol sourcing. Logic around symbol storage in this patch might change depending on how D40897 goes.
ioeric created this revision.Tue, Nov 28, 4:29 AM

Hi Eric! As you might know I'm working on persisted indexing. I was wondering which cases needed the index for code completion? Could you give a small example? I thought the AST alone would be sufficient for that. I'll look at this patch more closely a bit later but I can look at what needs to be added in the persisted index interface/data.

If you'd like to see the work in progress, it's here:
https://github.com/MarkZ3/clang-tools-extra/blob/IndexFunctionsPrototype
https://github.com/MarkZ3/clang-tools-extra/blob/IndexFunctionsPrototype/clangd/index/README

Hi Eric! As you might know I'm working on persisted indexing. I was wondering which cases needed the index for code completion? Could you give a small example? I thought the AST alone would be sufficient for that. I'll look at this patch more closely a bit later but I can look at what needs to be added in the persisted index interface/data.

Just wanted to chime in to give my perspective on some of the work that's being done here.

The index would allow to implement a project-wide completion (e.g., even if you don't #include "llvm/ADT/DenseMap.h", you'll get completion for DenseMap and the #include will be added automatically upon completing an item).
This is not aiming trying to have persistent indexes, instead trying to figure out how the right interfaces and plug things into ClangdServer properly, so that we can later play around with different implementations of the indexes. (I.e., we'll probably have our internal implementation at Google at some point).

I guess the main thing right now would be to align on how the SymbolIndex interface should look like.

clangd/ClangdLSPServer.h
41

Maybe pass a parameter of type SymbolIndex& instead of a vector, which is used to create CombinedSymbolIndex later?
It seems that ClangdServer does not really care about the specific index implementation that's used (nor should it!)

We could have helper methods to conveniently create CombinedSymbolIndex from a list of indexes, or even create the default index for clangd.

Hi Eric! As you might know I'm working on persisted indexing. I was wondering which cases needed the index for code completion? Could you give a small example? I thought the AST alone would be sufficient for that. I'll look at this patch more closely a bit later but I can look at what needs to be added in the persisted index interface/data.

If you'd like to see the work in progress, it's here:
https://github.com/MarkZ3/clang-tools-extra/blob/IndexFunctionsPrototype
https://github.com/MarkZ3/clang-tools-extra/blob/IndexFunctionsPrototype/clangd/index/README

Hi Marc! Yes, I was aware of your work. I wanted to upload the prototype early on to get discussions started, especially to get alignment with your work on work space index.

As Ilya said, we would like clangd to support global code completion and other LSP features based on global index (i.e. all symbols in the code base). For example, for code completion on scope specifiers "nx::ny::", we want to be able to complete "nx::ny::X" even if the symbol is not in the AST yet. However, it might not always be feasible to rely on clangd to generate index for a huge code base, so we want to be able to support plugging in additional index sources. The index can be from opened files in clangd, all files in the project (e.g. clangd/ directory), or all files in a huge code base. IIUC, the persistent index you are working on would fall into the project index bucket?

This patch tries to define interfaces for all indexes and how they would be used/merged in clangd, and we would really like your feedback on it. The code completion and AST index changes in the patch are mostly for experimenting with the interfaces.

ioeric updated this revision to Diff 124745.Wed, Nov 29, 7:05 AM

Merged with origin/master

hokein added a subscriber: hokein.Thu, Nov 30, 1:01 AM

Sorry for the delay, I'm looking into this now. I actually don't query (or even store!) any symbol names in the index interface now, only Occurrences queried by USR which is enough for findReferences and findDefinitions. It looks like storing names is the main part in this patch. So our two interfaces have almost no overlap :) But as I want to implement "Open Workspace Symbol" soon, I'll try to add symbol names too and see how it everything fits and make proposals.

Here's how the index "provider" interface looks like now:

class ClangdIndexDataOccurrence {
public:
  enum class OccurrenceType {
     OCCURRENCE,
     DEFINITION_OCCURRENCE
   };

  virtual OccurrenceType getKind() const = 0;
  virtual std::string getPath() = 0;
  virtual uint32_t getStartOffset(SourceManager &SM) = 0;
  virtual uint32_t getEndOffset(SourceManager &SM) = 0;
};

///An occurrence that also has definition with a body that requires additional
///locations to keep track of the beginning and end of the body.
class ClangdIndexDataDefinitionOccurrence : public ClangdIndexDataOccurrence {
  virtual uint32_t getDefStartOffset(SourceManager &SM) = 0;
  virtual uint32_t getDefEndOffset(SourceManager &SM) = 0;
};

class ClangdIndexDataProvider {
  virtual void foreachOccurrence(const USR& Usr, index::SymbolRoleSet Roles, llvm::function_ref<bool(ClangdIndexDataOccurrence&)> Receiver) = 0;

  // Those are mainly for debug for now. They dump information about file
  //dependencies.
  virtual void dumpIncludedBy(StringRef File) {}
  virtual void dumpInclusions(StringRef File) {}
};

https://github.com/MarkZ3/clang-tools-extra/blob/IndexFunctionsPrototype/clangd/index/ClangdIndexDataProvider.h

I've successfully implemented this interface with both ClangdIndexDataStorage (the storage solution I worked on) and libIndexStore (patch D39050, see also https://github.com/MarkZ3/clang-tools-extra/tree/IndexWithLibIndexStore, WARNING: proof of concept code only)

clangd/ASTIndex.h
51 ↗(On Diff #124745)

UID == usr?

Hi Marc, the patch is not ready for review yet. I am still cleaning up the prototype and will let you know when it's ready for review.

ioeric updated this revision to Diff 125928.Thu, Dec 7, 3:45 AM
  • More cleanups and merged with D40897
ioeric retitled this revision from [clangd] Prototyping index support and naive index-based global code completion. WIP to [clangd] Symbol index interfaces and index-based code completion..Thu, Dec 7, 3:56 AM
ioeric edited the summary of this revision. (Show Details)
ioeric added a reviewer: sammccall.
ioeric edited the summary of this revision. (Show Details)Thu, Dec 7, 4:37 AM
ioeric updated this revision to Diff 125960.Thu, Dec 7, 7:49 AM
  • Use IncludeNamespaceLevelDecls option; fix some broken tests.
ioeric updated this revision to Diff 125962.Thu, Dec 7, 7:51 AM

Diff on D40897 instead origin/master!

sammccall added inline comments.Thu, Dec 7, 9:33 AM
clangd/ClangdIndex.h
2

nit: Clangd prefix doesn't do much.
I'd suggest Index.h for the interface (including Symbol), and MemIndex.h for the in-memory implementations?

25

This seems a little premature to me - it's hard to know if this idea makes sense without actually having multiple implementations to mix together. My gut feeling is that sorting by (index weight, symbol score) puts too much emphasis on the weight.

Can we just have a single optional index for now?

37

If you do keep this class, move the label inside the struct? Avoids a lot of std::pair

51

As discussed offline - the lifetime of the contained symbols and the operations on the index derived from it need to be carefully considered.

Involving inheritance here seems likely to make these tricky interactions even trickier.

Would suggest:

  • a concrete class (SymbolCache? FileSymbols?) to hold a collection of symbols from multiple files, with the ability to iterate over them and replace all the symbols from one file at once
  • a concrete class MemIndex that can be constructed from a sequence of symbols and implements Index

You probably want to make them both immutable with snapshot semantics, or have a reader-writer lock that spans both.

73

The AST -> symbol conversion doesn't seem to have much to do with the rest of the class - we can move this to a free function I think, which would give the class a narrower responsibility.

clangd/ClangdLSPServer.h
41

I think the issue is that ClangdServer is going to create and maintain the dynamic index, which should then be merged as a peer with a set of other indexes.

Can we just pass in a single SymbolIndex* StaticIdx, and hard-code the assumption about static index + dynamic index being what we're using? That way we can also hard-code the heuristics for merging them together, instead of data-driving everything.

42

this is kind of two options in one - "should we build the index" and "should we query the index".

That seems OK, but I'd consider renaming this to BuildDynamicSymbolIndex or something to make it clear that building is optional, querying it isn't provided as an option

clangd/CodeComplete.cpp
395

I don't like the idea of doing index lookups from the middle of a sema callback!

Can we just record the CCContext in the Collector instead, and do this work afterwards?

clangd/CodeComplete.h
71

I'm not sure you want to spell out this behavior - merging vs replacing is more of an implementation detail

clangd/SymbolIndex.h
46

Interfaces need doc :-)

50

As discussed, we may want a callback-based API that makes it easier for an implementation to manage the lifetime of symbols without copying them.

Also is there anything to do with errors other than log and treat as no results? If not, just do that instead of returning Expected.

51

This is finding symbols that fuzzy-match a string, right?
We shouldn't conflate this with code completion - code completion is context-sensitive, and this query operation will be used for other operations like navigate-to-symbol.

Suggest fuzzyFind or similar.
(Similarly with CompletionRequest, CompletionSymbol)

As a follow-up, here's the interface for querying the index that I am using right now. It's meant to be able to retrieve from any kind of "backend", i.e. in-memory, ClangdIndexDataStore, libIndexStore, etc. I was able to implement "Open Workspace Symbol" (which is close to code completion in concept), Find References and Find Definitions.

using USR = llvm::SmallString<256>;

class ClangdIndexDataOccurrence;

class ClangdIndexDataSymbol {
public:
  virtual index::SymbolKind getKind() = 0;
  /// For example, for mynamespace::myclass::mymethod, this will be
  /// mymethod.
  virtual std::string getName() = 0;
  /// For example, for mynamespace::myclass::mymethod, this will be
  /// mynamespace::myclass::
  virtual std::string getQualifier() = 0;
  virtual std::string getUsr() = 0;

  virtual void foreachOccurrence(index::SymbolRoleSet Roles, llvm::function_ref<bool(ClangdIndexDataOccurrence&)> Receiver) = 0;

  virtual ~ClangdIndexDataSymbol() = default;
};

class ClangdIndexDataOccurrence {
public:
  enum class OccurrenceType : uint16_t {
     OCCURRENCE,
     DEFINITION_OCCURRENCE
   };

  virtual OccurrenceType getKind() const = 0;
  virtual std::string getPath() = 0;
  /// Get the start offset of the symbol occurrence. The SourceManager can be
  /// used for implementations that need to convert from a line/column
  /// representation to an offset.
  virtual uint32_t getStartOffset(SourceManager &SM) = 0;
  /// Get the end offset of the symbol occurrence. The SourceManager can be
  /// used for implementations that need to convert from a line/column
  /// representation to an offset.
  virtual uint32_t getEndOffset(SourceManager &SM) = 0;
  virtual ~ClangdIndexDataOccurrence() = default;
  //TODO: Add relations

  static bool classof(const ClangdIndexDataOccurrence *O) { return O->getKind() == OccurrenceType::OCCURRENCE; }
};

/// An occurrence that also has definition with a body that requires additional
/// locations to keep track of the beginning and end of the body.
class ClangdIndexDataDefinitionOccurrence : public ClangdIndexDataOccurrence {
public:
  virtual uint32_t getDefStartOffset(SourceManager &SM) = 0;
  virtual uint32_t getDefEndOffset(SourceManager &SM) = 0;

  static bool classof(const ClangdIndexDataOccurrence *O) { return O->getKind() == OccurrenceType::DEFINITION_OCCURRENCE; }
};

class ClangdIndexDataProvider {
public:

  virtual void foreachSymbols(StringRef Pattern, llvm::function_ref<bool(ClangdIndexDataSymbol&)> Receiver) = 0;
  virtual void foreachSymbols(const USR &Usr, llvm::function_ref<bool(ClangdIndexDataSymbol&)> Receiver) = 0;

  virtual ~ClangdIndexDataProvider() = default;
};

The "Clangd" prefix adds a bit much of clutter so maybe it should be removed. I think the main points are that having generic foreachSymbols/foreachOccurrence with callbacks is well suited to implement multiple features with minimal copying.

clangd/SymbolIndex.h
51

Similar to completion is "Open Workspace Symbol". So a more generic query could be useful.

ioeric added a comment.Fri, Dec 8, 2:16 AM

As a follow-up, here's the interface for querying the index that I am using right now. It's meant to be able to retrieve from any kind of "backend", i.e. in-memory, ClangdIndexDataStore, libIndexStore, etc. I was able to implement "Open Workspace Symbol" (which is close to code completion in concept), Find References and Find Definitions.

using USR = llvm::SmallString<256>;

class ClangdIndexDataOccurrence;

class ClangdIndexDataSymbol {
public:
  virtual index::SymbolKind getKind() = 0;
  /// For example, for mynamespace::myclass::mymethod, this will be
  /// mymethod.
  virtual std::string getName() = 0;
  /// For example, for mynamespace::myclass::mymethod, this will be
  /// mynamespace::myclass::
  virtual std::string getQualifier() = 0;
  virtual std::string getUsr() = 0;

  virtual void foreachOccurrence(index::SymbolRoleSet Roles, llvm::function_ref<bool(ClangdIndexDataOccurrence&)> Receiver) = 0;

  virtual ~ClangdIndexDataSymbol() = default;
};

class ClangdIndexDataOccurrence {
public:
  enum class OccurrenceType : uint16_t {
     OCCURRENCE,
     DEFINITION_OCCURRENCE
   };

  virtual OccurrenceType getKind() const = 0;
  virtual std::string getPath() = 0;
  /// Get the start offset of the symbol occurrence. The SourceManager can be
  /// used for implementations that need to convert from a line/column
  /// representation to an offset.
  virtual uint32_t getStartOffset(SourceManager &SM) = 0;
  /// Get the end offset of the symbol occurrence. The SourceManager can be
  /// used for implementations that need to convert from a line/column
  /// representation to an offset.
  virtual uint32_t getEndOffset(SourceManager &SM) = 0;
  virtual ~ClangdIndexDataOccurrence() = default;
  //TODO: Add relations

  static bool classof(const ClangdIndexDataOccurrence *O) { return O->getKind() == OccurrenceType::OCCURRENCE; }
};

/// An occurrence that also has definition with a body that requires additional
/// locations to keep track of the beginning and end of the body.
class ClangdIndexDataDefinitionOccurrence : public ClangdIndexDataOccurrence {
public:
  virtual uint32_t getDefStartOffset(SourceManager &SM) = 0;
  virtual uint32_t getDefEndOffset(SourceManager &SM) = 0;

  static bool classof(const ClangdIndexDataOccurrence *O) { return O->getKind() == OccurrenceType::DEFINITION_OCCURRENCE; }
};

class ClangdIndexDataProvider {
public:

  virtual void foreachSymbols(StringRef Pattern, llvm::function_ref<bool(ClangdIndexDataSymbol&)> Receiver) = 0;
  virtual void foreachSymbols(const USR &Usr, llvm::function_ref<bool(ClangdIndexDataSymbol&)> Receiver) = 0;

  virtual ~ClangdIndexDataProvider() = default;
};

I think some of the ideas here could be useful. This patch focuses mostly on index interfaces and D40897 emphasizes on the design of symbol structure. The way symbols are stored and used in this patch is likely to change depending on how D40897 goes.

The "Clangd" prefix adds a bit much of clutter so maybe it should be removed. I think the main points are that having generic foreachSymbols/foreachOccurrence with callbacks is well suited to implement multiple features with minimal copying.

Although I'm not sure if foreachSymbols/... would be feasible for all indexes yet, we do plan to switch to callback-based index interfaces, which Sam also proposed in the review comments.

There have been some offline discussions happening around clangd's indexing, and sorry that we have not been able to keep you up to date. I think it might be more efficient if we could meet via VC/Hangouts and sync on our designs. If you don't mind a meeting, I am happy to arrange it via emails.

hokein added a comment.Fri, Dec 8, 4:14 AM

Thanks for the feedback, Marc!

Yeah, I think the ClangdIndexDataSymbol and ClangdIndexDataOccurrence are something similar to what https://reviews.llvm.org/D40897 trying to address, maybe put the discussion there? Before that, I think having a sym meeting is a good idea to keep us in the same page.

I think some of the ideas here could be useful. This patch focuses mostly on index interfaces and D40897 emphasizes on the design of symbol structure. The way symbols are stored and used in this patch is likely to change depending on how D40897 goes.

The "Clangd" prefix adds a bit much of clutter so maybe it should be removed. I think the main points are that having generic foreachSymbols/foreachOccurrence with callbacks is well suited to implement multiple features with minimal copying.

Although I'm not sure if foreachSymbols/... would be feasible for all indexes yet, we do plan to switch to callback-based index interfaces, which Sam also proposed in the review comments.

There have been some offline discussions happening around clangd's indexing, and sorry that we have not been able to keep you up to date. I think it might be more efficient if we could meet via VC/Hangouts and sync on our designs. If you don't mind a meeting, I am happy to arrange it via emails.

ioeric updated this revision to Diff 126379.Mon, Dec 11, 8:55 AM
ioeric marked 8 inline comments as done.
  • Address review comments.
ioeric added inline comments.Mon, Dec 11, 8:55 AM
clangd/ClangdIndex.h
2

I'll do the renaming when D40897 is landed since it defines Symbol and would probably create Index.h as well.

51

The current revision implements a FileSymbols with the snapshot semantics. Not entirely sure if this is what you are looking for. Let me know...

73

Moved the conversion code to ClangdUnit.cpp

clangd/CodeComplete.cpp
395

Good point!