This is an archive of the discontinued LLVM Phabricator instance.

Introduce shard storage to auto-index.
ClosedPublic

Authored by kadircet on Nov 8 2018, 9:47 AM.

Event Timeline

kadircet created this revision.Nov 8 2018, 9:47 AM
Eugene.Zelenko added inline comments.
clangd/index/Background.cpp
32

Please run Clang-format. Headers in sections should be in alphabetical order.

kadircet updated this revision to Diff 173281.Nov 9 2018, 1:22 AM
kadircet marked an inline comment as done.
  • clang-format
sammccall added inline comments.Nov 12 2018, 1:21 AM
clangd/index/Background.h
31

nit: prefer the other way around: doc on the interface, the disk implementation can defer to the interface when there's nothing special to say.
That way callers don't have to be confused about what the semantics are in the general case.

39

Why not use the constructor? what does "directory" mean in the general case?

sammccall added inline comments.Nov 12 2018, 1:21 AM
clangd/index/Background.cpp
335

nit: i'd suggest doing the writes *after* updating the index, as the latter is user-facing

407

I don't think this is the right place to be reading shards, vs on startup. It means we're doing extra IO whenever a file is reindexed, and more importantly we don't make the index contents available on startup (if any file needs reindexing, reads may get stuck behind it).

(as discussed, we can also leave reading out of this patch entirely)

clangd/index/Background.h
35

nit: we probably don't need "shard" both in the interface name and the method names.

Consider either BackgroundIndexStorage/storeShard or ShardStorage/store

38

this signature looks surprising to me, expected something like `Expected<pair<IndexFileIn, FileDigest>> retrieveShard(ShardID).

e.g. if we have a stale shard for a file when initializing, maybe we want to use it until we can reindex the file? Current interface doesn't allow this.

As discussed offline, it'd also be fine to leave retrieve out of the initial patch.

65

So we have several possible designs on a collision course:

  1. One instance of BackgroundIndex for clangd, which has a ShardStorage, which is used for every CDB
  2. One instance of BackgroundIndex, which has many ShardStorages, one-per-CDB
  3. Many instances of BackgroundIndex, one per CDB, each with a ShardStorage
  4. One BackgroundIndex for everything, and one ShardStorage for everything (storage in $HOME/.clangd-index)

What are we aiming for here?

The design of ShardStorage in this patch precludes 1, taking unique_ptr here precludes 2. 3 requires us to revisit some aspects of BackgroundIndex (which is possible). With 4 it isn't obvious how we do multi-config, or how users can manage the storage.

(I'd been assuming 1 or 2, but I don't think we ever discussed it)

126

This class can be hidden in the cpp file.

kadircet added inline comments.Nov 12 2018, 2:32 AM
clangd/index/Background.cpp
335

but updating index consumes slabs. Is it worth making a copy?

407

What exactly you mean by startup exactly from BackgroundIndex's perspective. Is it enqueueAll ?

clangd/index/Background.h
38

I thought it would be better to get rid of disk io if the file was out-of-date, but using the stale file and deferring indexing also seems like a good idea. Moving with it.

39

Directory refers to the one specified in CompilationDatabase(which is usually the build directory?), sorry for the inconvenience.
I wasn't sure about where we plan to instantiate BackgroundIndex. If you plan to do that initialization at a point in which we already know the build directory we can move that to constructor, especially only to the constructor of DiskBackedIndexStorage.

65

The aim was more like 2. But I forgot the fact that we could be working with multiple CDBs :D

So it might be better to just pass a factory to BackgroundIndex and let it create one ShardStorage for each CDB(which is distinguished by the build directory?) WDYT?

126

But it needs to be passed into the constructor of BackgroundIndex, possibly by the ClangdServer, do you suggest hiding inside ClangdServer?

sammccall added inline comments.Nov 12 2018, 4:34 AM
clangd/index/Background.cpp
335

Oh, true - probably not. Add a comment?

407

Yes. (Or rather, a task scheduled there)

clangd/index/Background.h
39

tooling::CompileCommand::WorkingDirectory? That doesn't seem especially relevant here.
Or the directory that the CDB was discovered in?

Yes, this seems to be only relevant to DiskBackedIndexStorage

65

Yep, that works for me.

126

We can expose a factory function (e.g. static on ShardStorage)

kadircet added inline comments.Nov 12 2018, 6:25 AM
clangd/index/Background.h
39

I suppose you meant tooling::CompileCommand::Directory rather than WorkingDirectory ? That is the one I was talking about, why do you think it is irrelevant ?

kadircet updated this revision to Diff 173824.Nov 13 2018, 2:29 AM
kadircet marked 18 inline comments as done.
  • Address comments.
  • Move loading from cache out of indexing.
  • Use a factory for creation of index storage.
clangd/index/Background.h
39

Changed logic to use Directory passed to enqueueAll or enqueue, which I assume is the directory that the compilation database lives. And I assumed we will have at most one compilation database per directory.

sammccall added inline comments.Nov 13 2018, 6:32 AM
clangd/index/Background.cpp
53

nit: these micro-optimizations with SmallString seem unlikely to be worthwhile - maybe just take StringRef and return string?

110

absoluteFilePath? (otherwise it's not clear what the path is to)

227

this absolutely can't be done synchronously (it's much slower than calling getAllCompileCommands - this could block clangd for tens of seconds when opening the first file).

Can we drop shard-loading in BackgroundIndex from this patch?

241

you can capture indexstorage by value

clangd/index/Background.h
34

I don't think we should be using a static mutable factory here.

I think I worded the previous comment about exposing DiskBackedIndexStorage badly, what I meant was something like: static unique_ptr<BackgroundIndexStorage> BackgroundIndexStorage::createDiskStorage(StringRef dir)

This would be separate from how the storage strategy is configured on the background index.

34

do we need shared_ptr here?
ISTM either the BackgroundIndex is going to be responsible for caching (in which case it should receive unique_ptr) or the factory is responsible for caching (in which case it can own the functions, and the BackgroundIndex can get a non-owning raw pointer)

38

unused?

42

Prefer llvm::Error over bool, or no return value if you don't think we ever want to handle it.

45

doc stale

48

nit: I think "loadShard" would better emphasize the relationship with storeShard, as load/store is such a common pair. (read/write would similarly work).
Nothing wrong with "retrieve" per se, I just think the pairing is better.

kadircet updated this revision to Diff 173859.Nov 13 2018, 8:17 AM
kadircet marked 13 inline comments as done.
  • Address comments.
  • Deleted shard loading logic from backgroundindex.
kadircet added inline comments.Nov 13 2018, 8:17 AM
clangd/index/Background.cpp
53

It wasn't for optimization purposes actually, sys::path::append only accepts a SmallString and there is no direct conversion between SmallString and std::string(or stringref).

Thanks, mostly interface tweaks now.

clangd/index/Background.cpp
212

I think it would be simpler to just disallow the storage factory from returning null, and instead have it return a no-op implementation that logs.

Again, this moves the complexity behind an abstraction, where the rest of the code is simpler.

clangd/index/Background.h
36

I like that the implementations treat the shard scheme as opaque, but it'd be good for humans to know that this is going to be a filename in practice.

I'd suggest either changing the param name or adding to the interface comment: "Shards of the index are stored and retrieved independently, keyed by shard identifier - in practice this is a source file name"

39

docs

39

Hmm, we're going to attempt to load the shard corresponding to every file that we have in the CDB, even if the index isn't built yet. So "file doesn't exist" is an expected case (where we don't want to log etc), vs e.g. "file was corrupted" is unexpected and should definitely be logged. How do we distinguish these with this API?

One defensible option would be to have implementations handle errors: return Optional<IndexFileIn>, and have the disk version log errors appropriately and return None.

Another would be Expected<Optional<IndexFileIn>> or so, which is a little messy.

43

docs

53

nit: I'd move the description of IndexStorageFactory semantics to above the Using declaration - generally splitting docs to the most specific decl should make them easier to match up to the code.

54

by the directory they are found in?

109

I do think having the factory own and cache the storages would be simpler:

  • it moves more work out of this complicated class with a lot of responsibilities, and puts it somewhere simpler
  • it would mean we can just use a single storage object in tests that don't use multiple CDBs.
109

don't accesses to this map need to be locked?

kadircet updated this revision to Diff 174017.Nov 14 2018, 5:00 AM
kadircet marked 10 inline comments as done.
  • Address comments.
clangd/index/Background.h
39

used the former proposition, but with std::unique_ptr instead of llvm::Optional since IndexFileIn is move-only.

kadircet updated this revision to Diff 174019.Nov 14 2018, 5:03 AM
  • Get rid off getIndexStorage and use IndexStorageCreator directly.

just some nits. Main thing is the LoggingStorage - again I think this may have been a misunderstanding.

clangd/index/Background.cpp
76

ADD_FAILURE() I think

94

nit: move constructor to the top?

98

create_directory returns success if the directory exists, so no need for this check

108

Hmm, I'm not really sure what this is for - does someone constructing BackgroundIndex ever not want to define storage *and* not want to control logging?

344

this doesn't give enough context - the logger is global to clangd.

maybe "Failed to write background-index shard for file {0}: {1}"

clangd/index/Background.h
39

This is fine, though Optional works fine with move-only types.

(your comment still says None)

48

we may also want to provide the standard factory - can go in a later patch though.

unittests/clangd/BackgroundIndexTests.cpp
127

nit: seems a little less unusual to avoid the static:

MemoryShardStorage Storage;
auto MemoryStorageFactory = [&](llvm::StringRef){ return &Storage; }
...
EXPECT_THAT(Storage.contains("root/A.h"));
kadircet marked 9 inline comments as done.Nov 14 2018, 9:00 AM

I fear we still need to expose createDiskStorage, because someone still needs to tell factory to which creator function to use.

kadircet updated this revision to Diff 174052.Nov 14 2018, 9:00 AM
  • Address comments.
  • Move storage related things to new files.
This revision was not accepted when it landed; it landed in state Needs Review.Nov 15 2018, 2:33 AM
This revision was automatically updated to reflect the committed changes.
kadircet reopened this revision.Nov 15 2018, 4:19 AM
kadircet updated this revision to Diff 174187.Nov 15 2018, 4:19 AM
  • Change factory design to use llvm::unique_function.
sammccall accepted this revision.Nov 15 2018, 4:50 AM

Thanks! Looks good, rest of the nits are obvious or up to you.(

clangd/index/Background.cpp
146

nit: I think the moves of the helper functions (digest/digestFile/getAbsoluteFilePath) can be reverted now

clangd/index/Background.h
45

Some docs? Maybe add a comment like

// The factory provides storage for each CDB. 
// It keeps ownership of the storage instances, and should manage caching itself.
// Factory must be threadsafe and never returns nullptr.
50

nit: create rather than get would be clearer that this returns an independent object each time.

106

(nit: this comment seems redundant with the type/name)

clangd/index/BackgroundIndexStorage.cpp
30 ↗(On Diff #174187)

nit: can we put a separator like between the path and digest?
I'd suggest . so the file extension is most obviously preserved.

e.g. Foo.cpp.12345.idx vs Foo.cpp_123456.idx vs `Foo.cpp123456.idx'

51 ↗(On Diff #174187)

nit: just "for index storage"?
it's clearly disk backed if it's going into a directory!
(and the extra qualification may confuse users)

79 ↗(On Diff #174187)

You're ignoring write errors here (and in fact errors may not have occurred yet due to buffering).

OS.close();
return OS.error();
unittests/clangd/BackgroundIndexTests.cpp
80

Consider pulling this class to the top level, and use it any time you don't care about the storage (to avoid the need for DummyStorage)

91

I think this can just be
Storage[ShardIdentifier] = llvm::to_string(Shard)

(need to include ScopedPrinters.h)

101

this one shouldn't be a failure I think - just return nullptr? (If the test cares about it, the test should depend on it more explicitly).

The corrupt index file is a more obvious "should never happen" case.

110

You never actually assert on this value.

(I'm not sure you need to, but if not just remove it)

126

seems clearer just to inline this in the constructor(, but up to you

142

(even if the StringMap was encapsulated in the MemoryStorage, you can still write this assertion through the public interface: ASSERT_NE(Storage.loadShard("root/A.h"), None)

(you could even assert the actual stored symbols, but may not be worth it)

This revision is now accepted and ready to land.Nov 15 2018, 4:50 AM
kadircet updated this revision to Diff 174215.Nov 15 2018, 7:38 AM
kadircet marked 13 inline comments as done.
  • Address comments.
This revision was automatically updated to reflect the committed changes.