This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Symbol index interfaces and an in-memory index implementation.
ClosedPublic

Authored by ioeric on Nov 28 2017, 4:29 AM.

Details

Summary

o Index interfaces to support using different index sources (e.g. AST index, global index) for code completion, cross-reference finding etc. This patch focuses on code completion.

The following changes in the original patch has been split out.
o Implement an AST-based index.
o Add an option to replace sema code completion for qualified-id with index-based completion.
o Implement an initial naive code completion index which matches symbols that have the query string as substring.

Diff Detail

Event Timeline

ioeric created this revision.Nov 28 2017, 4:29 AM

Hi Eric! As you might know I'm working on persisted indexing. I was wondering which cases needed the index for code completion? Could you give a small example? I thought the AST alone would be sufficient for that. I'll look at this patch more closely a bit later but I can look at what needs to be added in the persisted index interface/data.

If you'd like to see the work in progress, it's here:
https://github.com/MarkZ3/clang-tools-extra/blob/IndexFunctionsPrototype
https://github.com/MarkZ3/clang-tools-extra/blob/IndexFunctionsPrototype/clangd/index/README

Hi Eric! As you might know I'm working on persisted indexing. I was wondering which cases needed the index for code completion? Could you give a small example? I thought the AST alone would be sufficient for that. I'll look at this patch more closely a bit later but I can look at what needs to be added in the persisted index interface/data.

Just wanted to chime in to give my perspective on some of the work that's being done here.

The index would allow to implement a project-wide completion (e.g., even if you don't #include "llvm/ADT/DenseMap.h", you'll get completion for DenseMap and the #include will be added automatically upon completing an item).
This is not aiming trying to have persistent indexes, instead trying to figure out how the right interfaces and plug things into ClangdServer properly, so that we can later play around with different implementations of the indexes. (I.e., we'll probably have our internal implementation at Google at some point).

I guess the main thing right now would be to align on how the SymbolIndex interface should look like.

clangd/ClangdLSPServer.h
37 ↗(On Diff #124544)

Maybe pass a parameter of type SymbolIndex& instead of a vector, which is used to create CombinedSymbolIndex later?
It seems that ClangdServer does not really care about the specific index implementation that's used (nor should it!)

We could have helper methods to conveniently create CombinedSymbolIndex from a list of indexes, or even create the default index for clangd.

Hi Eric! As you might know I'm working on persisted indexing. I was wondering which cases needed the index for code completion? Could you give a small example? I thought the AST alone would be sufficient for that. I'll look at this patch more closely a bit later but I can look at what needs to be added in the persisted index interface/data.

If you'd like to see the work in progress, it's here:
https://github.com/MarkZ3/clang-tools-extra/blob/IndexFunctionsPrototype
https://github.com/MarkZ3/clang-tools-extra/blob/IndexFunctionsPrototype/clangd/index/README

Hi Marc! Yes, I was aware of your work. I wanted to upload the prototype early on to get discussions started, especially to get alignment with your work on work space index.

As Ilya said, we would like clangd to support global code completion and other LSP features based on global index (i.e. all symbols in the code base). For example, for code completion on scope specifiers "nx::ny::", we want to be able to complete "nx::ny::X" even if the symbol is not in the AST yet. However, it might not always be feasible to rely on clangd to generate index for a huge code base, so we want to be able to support plugging in additional index sources. The index can be from opened files in clangd, all files in the project (e.g. clangd/ directory), or all files in a huge code base. IIUC, the persistent index you are working on would fall into the project index bucket?

This patch tries to define interfaces for all indexes and how they would be used/merged in clangd, and we would really like your feedback on it. The code completion and AST index changes in the patch are mostly for experimenting with the interfaces.

ioeric updated this revision to Diff 124745.Nov 29 2017, 7:05 AM

Merged with origin/master

hokein added a subscriber: hokein.Nov 30 2017, 1:01 AM

Sorry for the delay, I'm looking into this now. I actually don't query (or even store!) any symbol names in the index interface now, only Occurrences queried by USR which is enough for findReferences and findDefinitions. It looks like storing names is the main part in this patch. So our two interfaces have almost no overlap :) But as I want to implement "Open Workspace Symbol" soon, I'll try to add symbol names too and see how it everything fits and make proposals.

Here's how the index "provider" interface looks like now:

class ClangdIndexDataOccurrence {
public:
  enum class OccurrenceType {
     OCCURRENCE,
     DEFINITION_OCCURRENCE
   };

  virtual OccurrenceType getKind() const = 0;
  virtual std::string getPath() = 0;
  virtual uint32_t getStartOffset(SourceManager &SM) = 0;
  virtual uint32_t getEndOffset(SourceManager &SM) = 0;
};

///An occurrence that also has definition with a body that requires additional
///locations to keep track of the beginning and end of the body.
class ClangdIndexDataDefinitionOccurrence : public ClangdIndexDataOccurrence {
  virtual uint32_t getDefStartOffset(SourceManager &SM) = 0;
  virtual uint32_t getDefEndOffset(SourceManager &SM) = 0;
};

class ClangdIndexDataProvider {
  virtual void foreachOccurrence(const USR& Usr, index::SymbolRoleSet Roles, llvm::function_ref<bool(ClangdIndexDataOccurrence&)> Receiver) = 0;

  // Those are mainly for debug for now. They dump information about file
  //dependencies.
  virtual void dumpIncludedBy(StringRef File) {}
  virtual void dumpInclusions(StringRef File) {}
};

https://github.com/MarkZ3/clang-tools-extra/blob/IndexFunctionsPrototype/clangd/index/ClangdIndexDataProvider.h

I've successfully implemented this interface with both ClangdIndexDataStorage (the storage solution I worked on) and libIndexStore (patch D39050, see also https://github.com/MarkZ3/clang-tools-extra/tree/IndexWithLibIndexStore, WARNING: proof of concept code only)

clangd/ASTIndex.h
51 ↗(On Diff #124745)

UID == usr?

Hi Marc, the patch is not ready for review yet. I am still cleaning up the prototype and will let you know when it's ready for review.

ioeric updated this revision to Diff 125928.Dec 7 2017, 3:45 AM
  • More cleanups and merged with D40897
ioeric retitled this revision from [clangd] Prototyping index support and naive index-based global code completion. WIP to [clangd] Symbol index interfaces and index-based code completion..Dec 7 2017, 3:56 AM
ioeric edited the summary of this revision. (Show Details)
ioeric added a reviewer: sammccall.
ioeric edited the summary of this revision. (Show Details)Dec 7 2017, 4:37 AM
ioeric updated this revision to Diff 125960.Dec 7 2017, 7:49 AM
  • Use IncludeNamespaceLevelDecls option; fix some broken tests.
ioeric updated this revision to Diff 125962.Dec 7 2017, 7:51 AM

Diff on D40897 instead origin/master!

sammccall added inline comments.Dec 7 2017, 9:33 AM
clangd/ClangdIndex.h
1 ↗(On Diff #125928)

nit: Clangd prefix doesn't do much.
I'd suggest Index.h for the interface (including Symbol), and MemIndex.h for the in-memory implementations?

24 ↗(On Diff #125962)

This seems a little premature to me - it's hard to know if this idea makes sense without actually having multiple implementations to mix together. My gut feeling is that sorting by (index weight, symbol score) puts too much emphasis on the weight.

Can we just have a single optional index for now?

36 ↗(On Diff #125962)

If you do keep this class, move the label inside the struct? Avoids a lot of std::pair

50 ↗(On Diff #125962)

As discussed offline - the lifetime of the contained symbols and the operations on the index derived from it need to be carefully considered.

Involving inheritance here seems likely to make these tricky interactions even trickier.

Would suggest:

  • a concrete class (SymbolCache? FileSymbols?) to hold a collection of symbols from multiple files, with the ability to iterate over them and replace all the symbols from one file at once
  • a concrete class MemIndex that can be constructed from a sequence of symbols and implements Index

You probably want to make them both immutable with snapshot semantics, or have a reader-writer lock that spans both.

72 ↗(On Diff #125962)

The AST -> symbol conversion doesn't seem to have much to do with the rest of the class - we can move this to a free function I think, which would give the class a narrower responsibility.

clangd/ClangdLSPServer.h
38 ↗(On Diff #125962)

this is kind of two options in one - "should we build the index" and "should we query the index".

That seems OK, but I'd consider renaming this to BuildDynamicSymbolIndex or something to make it clear that building is optional, querying it isn't provided as an option

37 ↗(On Diff #124544)

I think the issue is that ClangdServer is going to create and maintain the dynamic index, which should then be merged as a peer with a set of other indexes.

Can we just pass in a single SymbolIndex* StaticIdx, and hard-code the assumption about static index + dynamic index being what we're using? That way we can also hard-code the heuristics for merging them together, instead of data-driving everything.

clangd/CodeComplete.cpp
378 ↗(On Diff #125962)

I don't like the idea of doing index lookups from the middle of a sema callback!

Can we just record the CCContext in the Collector instead, and do this work afterwards?

clangd/CodeComplete.h
71 ↗(On Diff #125962)

I'm not sure you want to spell out this behavior - merging vs replacing is more of an implementation detail

clangd/SymbolIndex.h
45 ↗(On Diff #125962)

Interfaces need doc :-)

49 ↗(On Diff #125962)

As discussed, we may want a callback-based API that makes it easier for an implementation to manage the lifetime of symbols without copying them.

Also is there anything to do with errors other than log and treat as no results? If not, just do that instead of returning Expected.

50 ↗(On Diff #125962)

This is finding symbols that fuzzy-match a string, right?
We shouldn't conflate this with code completion - code completion is context-sensitive, and this query operation will be used for other operations like navigate-to-symbol.

Suggest fuzzyFind or similar.
(Similarly with CompletionRequest, CompletionSymbol)

malaperle edited edge metadata.Dec 7 2017, 9:14 PM

As a follow-up, here's the interface for querying the index that I am using right now. It's meant to be able to retrieve from any kind of "backend", i.e. in-memory, ClangdIndexDataStore, libIndexStore, etc. I was able to implement "Open Workspace Symbol" (which is close to code completion in concept), Find References and Find Definitions.

using USR = llvm::SmallString<256>;

class ClangdIndexDataOccurrence;

class ClangdIndexDataSymbol {
public:
  virtual index::SymbolKind getKind() = 0;
  /// For example, for mynamespace::myclass::mymethod, this will be
  /// mymethod.
  virtual std::string getName() = 0;
  /// For example, for mynamespace::myclass::mymethod, this will be
  /// mynamespace::myclass::
  virtual std::string getQualifier() = 0;
  virtual std::string getUsr() = 0;

  virtual void foreachOccurrence(index::SymbolRoleSet Roles, llvm::function_ref<bool(ClangdIndexDataOccurrence&)> Receiver) = 0;

  virtual ~ClangdIndexDataSymbol() = default;
};

class ClangdIndexDataOccurrence {
public:
  enum class OccurrenceType : uint16_t {
     OCCURRENCE,
     DEFINITION_OCCURRENCE
   };

  virtual OccurrenceType getKind() const = 0;
  virtual std::string getPath() = 0;
  /// Get the start offset of the symbol occurrence. The SourceManager can be
  /// used for implementations that need to convert from a line/column
  /// representation to an offset.
  virtual uint32_t getStartOffset(SourceManager &SM) = 0;
  /// Get the end offset of the symbol occurrence. The SourceManager can be
  /// used for implementations that need to convert from a line/column
  /// representation to an offset.
  virtual uint32_t getEndOffset(SourceManager &SM) = 0;
  virtual ~ClangdIndexDataOccurrence() = default;
  //TODO: Add relations

  static bool classof(const ClangdIndexDataOccurrence *O) { return O->getKind() == OccurrenceType::OCCURRENCE; }
};

/// An occurrence that also has definition with a body that requires additional
/// locations to keep track of the beginning and end of the body.
class ClangdIndexDataDefinitionOccurrence : public ClangdIndexDataOccurrence {
public:
  virtual uint32_t getDefStartOffset(SourceManager &SM) = 0;
  virtual uint32_t getDefEndOffset(SourceManager &SM) = 0;

  static bool classof(const ClangdIndexDataOccurrence *O) { return O->getKind() == OccurrenceType::DEFINITION_OCCURRENCE; }
};

class ClangdIndexDataProvider {
public:

  virtual void foreachSymbols(StringRef Pattern, llvm::function_ref<bool(ClangdIndexDataSymbol&)> Receiver) = 0;
  virtual void foreachSymbols(const USR &Usr, llvm::function_ref<bool(ClangdIndexDataSymbol&)> Receiver) = 0;

  virtual ~ClangdIndexDataProvider() = default;
};

The "Clangd" prefix adds a bit much of clutter so maybe it should be removed. I think the main points are that having generic foreachSymbols/foreachOccurrence with callbacks is well suited to implement multiple features with minimal copying.

clangd/SymbolIndex.h
50 ↗(On Diff #125962)

Similar to completion is "Open Workspace Symbol". So a more generic query could be useful.

ioeric added a comment.Dec 8 2017, 2:16 AM

As a follow-up, here's the interface for querying the index that I am using right now. It's meant to be able to retrieve from any kind of "backend", i.e. in-memory, ClangdIndexDataStore, libIndexStore, etc. I was able to implement "Open Workspace Symbol" (which is close to code completion in concept), Find References and Find Definitions.

using USR = llvm::SmallString<256>;

class ClangdIndexDataOccurrence;

class ClangdIndexDataSymbol {
public:
  virtual index::SymbolKind getKind() = 0;
  /// For example, for mynamespace::myclass::mymethod, this will be
  /// mymethod.
  virtual std::string getName() = 0;
  /// For example, for mynamespace::myclass::mymethod, this will be
  /// mynamespace::myclass::
  virtual std::string getQualifier() = 0;
  virtual std::string getUsr() = 0;

  virtual void foreachOccurrence(index::SymbolRoleSet Roles, llvm::function_ref<bool(ClangdIndexDataOccurrence&)> Receiver) = 0;

  virtual ~ClangdIndexDataSymbol() = default;
};

class ClangdIndexDataOccurrence {
public:
  enum class OccurrenceType : uint16_t {
     OCCURRENCE,
     DEFINITION_OCCURRENCE
   };

  virtual OccurrenceType getKind() const = 0;
  virtual std::string getPath() = 0;
  /// Get the start offset of the symbol occurrence. The SourceManager can be
  /// used for implementations that need to convert from a line/column
  /// representation to an offset.
  virtual uint32_t getStartOffset(SourceManager &SM) = 0;
  /// Get the end offset of the symbol occurrence. The SourceManager can be
  /// used for implementations that need to convert from a line/column
  /// representation to an offset.
  virtual uint32_t getEndOffset(SourceManager &SM) = 0;
  virtual ~ClangdIndexDataOccurrence() = default;
  //TODO: Add relations

  static bool classof(const ClangdIndexDataOccurrence *O) { return O->getKind() == OccurrenceType::OCCURRENCE; }
};

/// An occurrence that also has definition with a body that requires additional
/// locations to keep track of the beginning and end of the body.
class ClangdIndexDataDefinitionOccurrence : public ClangdIndexDataOccurrence {
public:
  virtual uint32_t getDefStartOffset(SourceManager &SM) = 0;
  virtual uint32_t getDefEndOffset(SourceManager &SM) = 0;

  static bool classof(const ClangdIndexDataOccurrence *O) { return O->getKind() == OccurrenceType::DEFINITION_OCCURRENCE; }
};

class ClangdIndexDataProvider {
public:

  virtual void foreachSymbols(StringRef Pattern, llvm::function_ref<bool(ClangdIndexDataSymbol&)> Receiver) = 0;
  virtual void foreachSymbols(const USR &Usr, llvm::function_ref<bool(ClangdIndexDataSymbol&)> Receiver) = 0;

  virtual ~ClangdIndexDataProvider() = default;
};

I think some of the ideas here could be useful. This patch focuses mostly on index interfaces and D40897 emphasizes on the design of symbol structure. The way symbols are stored and used in this patch is likely to change depending on how D40897 goes.

The "Clangd" prefix adds a bit much of clutter so maybe it should be removed. I think the main points are that having generic foreachSymbols/foreachOccurrence with callbacks is well suited to implement multiple features with minimal copying.

Although I'm not sure if foreachSymbols/... would be feasible for all indexes yet, we do plan to switch to callback-based index interfaces, which Sam also proposed in the review comments.

There have been some offline discussions happening around clangd's indexing, and sorry that we have not been able to keep you up to date. I think it might be more efficient if we could meet via VC/Hangouts and sync on our designs. If you don't mind a meeting, I am happy to arrange it via emails.

hokein added a comment.Dec 8 2017, 4:14 AM

Thanks for the feedback, Marc!

Yeah, I think the ClangdIndexDataSymbol and ClangdIndexDataOccurrence are something similar to what https://reviews.llvm.org/D40897 trying to address, maybe put the discussion there? Before that, I think having a sym meeting is a good idea to keep us in the same page.

I think some of the ideas here could be useful. This patch focuses mostly on index interfaces and D40897 emphasizes on the design of symbol structure. The way symbols are stored and used in this patch is likely to change depending on how D40897 goes.

The "Clangd" prefix adds a bit much of clutter so maybe it should be removed. I think the main points are that having generic foreachSymbols/foreachOccurrence with callbacks is well suited to implement multiple features with minimal copying.

Although I'm not sure if foreachSymbols/... would be feasible for all indexes yet, we do plan to switch to callback-based index interfaces, which Sam also proposed in the review comments.

There have been some offline discussions happening around clangd's indexing, and sorry that we have not been able to keep you up to date. I think it might be more efficient if we could meet via VC/Hangouts and sync on our designs. If you don't mind a meeting, I am happy to arrange it via emails.

ioeric updated this revision to Diff 126379.Dec 11 2017, 8:55 AM
ioeric marked 8 inline comments as done.
  • Address review comments.
ioeric added inline comments.Dec 11 2017, 8:55 AM
clangd/ClangdIndex.h
1 ↗(On Diff #125928)

I'll do the renaming when D40897 is landed since it defines Symbol and would probably create Index.h as well.

50 ↗(On Diff #125962)

The current revision implements a FileSymbols with the snapshot semantics. Not entirely sure if this is what you are looking for. Let me know...

72 ↗(On Diff #125962)

Moved the conversion code to ClangdUnit.cpp

clangd/CodeComplete.cpp
378 ↗(On Diff #125962)

Good point!

ioeric updated this revision to Diff 126686.Dec 13 2017, 1:34 AM

Merged with origin/master and moved index-related files to index/ subdirectory.

I think this patch is too big, and there are several separate pieces here:

  • the index interface, and the implementation of the dynamic symbol index (these probably belong together). We're missing unit tests for the implementation.
  • building and updating the dynamic symbol index in clangd
  • code completion based on a symbol index

As a consequence I'm finding it pretty intimidating to review, I've tried to leave the main high-level comments.

clangd/ClangdIndex.h
24 ↗(On Diff #126379)

nit: maybe call this out as a container? that tells a lot about what it does/doesn't do.

e.g. \brief A container of Symbols from several source files. It can be updated at source-file granularity, replacing all symbols from one file with a new set. ...

clangd/ClangdUnit.cpp
854 ↗(On Diff #126686)

Having clangdunit know how the index is maintained feels like a layering violation.

Can we instead have clangdunit take a callback (UniqueFunction<void(ParsedAST&)> or so) that gets invoked after each successful build? That can be the "hook" where we do indexing.

The common higher layer (ClangdServer) then just has to have the glue code tying the ClangdUnit API to the Index functionality.

clangd/ClangdUnitStore.h
29 ↗(On Diff #126686)

Similarly here, this can take a post-build callback of void(Filename, ParsedAST&) that is translated into per-cppfile callbacks. Then it doesn't need to know about FileSymbols.

(Or ParsedAST*, with nullptr meaning this file is gone)

clangd/CodeComplete.h
50 ↗(On Diff #126686)

I find the config changes pretty confusing.
CodeComplete options is for options we want to offer users. This change doesn't seem motivated by a user requirement, but rather by implementation details.
If we're going to do that, we should be clear about it, rather than just expose every option of clang::CodeCompletionOptions we might want to set ourselves

If we do want to ensure that we can do clangd::CodeCompleteOptions -> clang::CodeCompleteOptions without extra args:

CodeCompleteOptions {
  ...
  const Index* Index = nullptr; // Populated internally by clangd, do not set
}

then we can use the presence of index to determine whether to restrict what we ask Sema for.
(We could also let users override the index used for code completion this way if we want, but I don't see a use case)

clangd/index/FileIndex.h
62 ↗(On Diff #126686)

I think it'd be possible and nice to decouple MemIndex from FileSymbols, such that we can use it for the on-disk index (which is one big slab).

At the moment it seems like an unhappy mix - we don't actually get any incrementality, but we're coupled to the incremental storage.

I think it's worth addressing now as MemIndex should be named differently and move to another file if this is the case.

One way this could work:

class MemIndex {
  shared_ptr<vector<const Symbol*>> Symbols;

  // Symbols must remain accessible as long as S is kept alive.
  build(shared_ptr<vector<const Symbol*>> S) {
    // TODO: build smarter index structures later
    lock_guard
    Symbols = move(S); // releases old symbols
    // TODO: move smart index structures
  }
}

class FileSymbols {
  // The shared_ptr keeps the symbols alive
  shared_ptr<vector<const Symbol *>> allSymbols() {
    struct Snapshot {
      vector<const Symbol*> Pointers;
      vector<shared_ptr<SymbolSlab>> KeepAlive;
    };
    auto Snap = make_shared<Snapshot>();
    for (Slab in FileToSymbols) {
      Snap->KeepAlive.push_back(Slab);
      for (Sym in *Slab)
        Snap.Pointers.push_back(&Sym);
    }
    return shared_ptr<vector<const Symbol *>>(move(Snap), &Snap->Pointers);
  }
}

// In ClangdServer, when getting new symbols for a file
FileSymbols.update(Path, Collector.takeSymbols());
DynamicIndex.build(FileSymbols.allSymbols());

This seems pretty nice, but vector<shared_ptr<SymbolSlab>> would work in practice for the interface too, if you don't like aliasing shared_ptr tricks.

ioeric updated this revision to Diff 126775.Dec 13 2017, 9:57 AM
  • Remove everything except for SymbolIndex interface and MemIndex
  • Added unit tests for MemIndex
sammccall accepted this revision.Dec 14 2017, 12:43 AM

Thanks for the split & refactoring! This looks nice. Just nits, plus some thoughts on the test.

clangd/index/Index.cpp
11 ↗(On Diff #126775)

changes in this file can now be reverted I think

clangd/index/Index.h
129

nit: "... before returning".

Just to be explicit that this is a synchronous API.

131

This is a little vague - Returns true if the result list is complete, false if it was truncated due to MaxCandidateCount?

Might be clearer to invert it - Returns true if the result list was truncated - up to you.

133

context patch has landed, so this should now take const Context& as first parameter

137

USRs will rather be SymbolIDs

clangd/index/MemIndex.cpp
39

you can easily do this now - it shouldn't even be more lines of code.
Up to you of course.

unittests/clangd/IndexTests.cpp
30

nit: given the name, I actually think inheriting from Slab might be clearer.

Though I'm not sure we actually need this, see weak_ptr part of next comment.

42

This is a nice test, the one weakness I see is it's hard to read the cases in isolation as there's quite a lot of action-at-a-distance.
I think we can eliminate the base class without adding much boilerplate, which would make the tests more self-contained:

  • Index can be owned by the test
  • match() is good, but moving Index.build() call into the test, and accepting Index as a param means this interaction is more obvious and match can be a free function.
  • CountedSymbolSlab can more directly be replaced by a weak_ptr, which can tell you whether a shared_ptr is alive
  • generateNumSymbols (already) doesn't need to be a member. Having it optionally emit a weak_ptr to its return value would simplify the tests a little.

So MemIndexSymbolsRecycled would look something like

Index I;
std::weak_ptr<vector<const Symbol*>> Symbols;
I.build(generateNumSymbols(0, 10), &Symbols);
EXPECT_THAT(match(I, Req), ElementsAre("7"));
EXPECT_FALSE(Symbols.expired());
I.build(generateNumSymbols(0,0));
EXPECT_TRUE(Symbols.expired());

WDYT?

63

This is a leak - the returned shared_ptr doesn't own this vector.

(I'm actually fine with this for simplicity if it's documented - but we might be running leak checkers that complain about it?)

In principle you need to make_shared a struct containing both the vector and the slab.

This revision is now accepted and ready to land.Dec 14 2017, 12:43 AM
ioeric updated this revision to Diff 126916.Dec 14 2017, 2:27 AM
ioeric marked 7 inline comments as done.
  • Address review comments
ioeric updated this revision to Diff 126917.Dec 14 2017, 2:36 AM
ioeric marked an inline comment as done.
  • Fix comment for fuzzyFind
sammccall accepted this revision.Dec 14 2017, 2:43 AM
sammccall added inline comments.
unittests/clangd/IndexTests.cpp
30

this could be local to generateNumSymbols, up to you

52

nit: just return? (you can skip spelling out the shared_ptr type with {}, I think)

74

assert symbols is not expired here?

ioeric updated this revision to Diff 126919.Dec 14 2017, 2:48 AM
ioeric marked 3 inline comments as done.
  • Address more comments. # I am going to land it now.
  • Merge remote-tracking branch 'origin/master' into index
ioeric retitled this revision from [clangd] Symbol index interfaces and index-based code completion. to [clangd] Symbol index interfaces and an in-memory index implementation..Dec 14 2017, 2:50 AM
ioeric edited the summary of this revision. (Show Details)
This revision was automatically updated to reflect the committed changes.

Hi Marc, the patch is not ready for review yet. I am still cleaning up the prototype and will let you know when it's ready for review.

I guess it was ready to review since it was submitted? ;)

clangd/index/Index.h
134

I think a more generic std::function would be useful, similar to the indexstore's filter

bool searchSymbols(llvm::function_ref<bool(IndexRecordSymbol, bool &stop)> filter,
                     llvm::function_ref<void(IndexRecordSymbol)> receiver)

Hi Marc, the patch is not ready for review yet. I am still cleaning up the prototype and will let you know when it's ready for review.

I guess it was ready to review since it was submitted? ;)

Sorry! I should've pinged you before landing this. My apologies! I am happy to address any comment you have.

clangd/index/Index.h
134

Do you have an use case in mind which requires different filtering? This could probably be a separate interface.

I think the behavior of fuzzy finding is well-defined, and how filtering is done is an implementation detail. User-specified filter might make implementation less flexible, especially for large indexes that are not managed in memory.

malaperle added inline comments.Dec 14 2017, 11:28 AM
clangd/index/Index.h
134

For example

  • Searching by USR
  • By non-qualified name (useful also for Open Workspace Symbol)
  • Most likely other filters, hmm, by Symbol kind, language?
  • No filter, i.e. retrieve *all* symbols. Useful mainly for development and to get index statistics

There could also be a boolean in the callback return value (or the filter callback) to limit the number of returned values.
I haven't tried locally but it looks like it would be pretty doable to change the FuzzyFindRequest into a std::function.
A big disadvantage of having just one method though is that is would be difficult for an implementation to optimize for a specific type of query. For example, if you search by non-qualified name, an implementation could use a mapping of non-qualified-name-to-USR but it would be difficult to know what to do given only the std::function/filter passed by argument.
In that sense, perhaps it is better to have multiple methods for each use cases. Or perhaps some enum for each kind of query. What do you think?

ioeric added inline comments.Dec 15 2017, 1:47 AM
clangd/index/Index.h
134

Adding new interfaces is an option. It should also be easy to extend the existing interface to cover more use cases.

Searching by USR

We definitely want an interface for this. I can see this being useful for Goto-definition and other features.

By non-qualified name (useful also for Open Workspace Symbol)

I think fuzzyFind can support this, with fuzzy matching optionally turned off if preferred.

Most likely other filters, hmm, by Symbol kind, language?

Similarly, this could be achieved by extending the find request to include more filters e.g. symbol kind.

No filter, i.e. retrieve *all* symbols. Useful mainly for development and to get index statistics

This can be done today by setting an empty fuzzy filter.

sammccall added inline comments.Dec 15 2017, 2:32 AM
clangd/index/Index.h
134

As Eric said, we should be able to add more options here.
But why are we reluctant to add an arbitrary std::function, which is more elegant and flexible?

  • it essentially forbids remote indexes. Because functions aren't serializable, the only possible implementation is to send all the symbols to clangd. For our main codebase, this is would be ~1G for just qnames, and a lot more for full symbols.
  • A slightly more subtle disadvantage is because functions aren't serializable, it's harder to debug them - you can't just log the query.
  • filtering functions are easy to implement (so can be distributed over callsites), but scoring functions are hard (so should be more centralized). They tend to operate on the same data. It's not obvious how to achieve this with a filter API.
malaperle added inline comments.Dec 15 2017, 6:12 AM
clangd/index/Index.h
134

Thanks a lot, that makes sense.

I do think it would be good, perhaps for other kinds of queries, that the query could be stopped halfway by having the Callback return a bool.

std::function<void(const Symbol &)> Callback) const

to

std::function<bool(const Symbol &)> Callback) const

But this can be revisited once we have other options.