This is an archive of the discontinued LLVM Phabricator instance.

[clangd] auto-index stores symbols per-file instead of per-TU.
ClosedPublic

Authored by ioeric on Oct 19 2018, 5:21 AM.

Details

Summary

This allows us to deduplicate header symbols across TUs. File digests
are collects when collecting symbols/refs. And the index store deduplicates
file symbols based on the file digest.

Diff Detail

Repository
rL LLVM

Event Timeline

ioeric created this revision.Oct 19 2018, 5:21 AM
sammccall added inline comments.Oct 19 2018, 8:20 AM
clangd/index/FileIndex.cpp
167 ↗(On Diff #170188)

This is a good FIXME to carry over to FileSymbols::buildIndex(). It may (or may not) make sense to control this behavior with the same param that controls merge behavior, the logic being that it's also a quality/accuracy tradeoff.

clangd/index/FileIndex.h
49 ↗(On Diff #170188)

FileSymbols isn't actually that opinionated about the data it manages:

  • the keys ("filenames") just identify shards so we can tell whether a new shard is an addition or replaces an existing one.
  • the values (tu symbols/refs) are merged using pretty generic logic, we don't look at the details.

I think we should be able to use FileSymbols mostly unmodified, and store digests in parallel, probably in BackgroundIndexer. Is there a strong reason to store "main file" digests separately to header digests?

There are a couple of weak points:

  • The merging makes subtle assumptions: for symbols it picks one copy arbitrarily, assuming there are many copies (merging would be expensive) and they're mostly interchangeable duplicates. We could add a constructor or buildIndex() param to FileSymbols to control this, similar to the IndexType param. (For refs it assumes no duplicates and simply concatenates, which is also fine for our purposes).
  • FileSymbols locks internally to be threadsafe while keeping critical sections small. To keep digests in sync with FileSymbols contents we probably have to lock externally with a second mutex. This is a little ugly but not a real problem I think.
clangd/index/IndexAction.h
31 ↗(On Diff #170188)

these parallel callbacks that always get called together are a bit ridiculous (my fault).
Can you add a FIXME to replace them with a single callback that passes a struct?

ioeric updated this revision to Diff 170440.Oct 22 2018, 9:04 AM
ioeric marked 2 inline comments as done.
  • address review comments
  • minor cleanup
ioeric retitled this revision from [clangd] *Prototype* auto-index stores symbols per-file instead of per-TU. to [clangd] auto-index stores symbols per-file instead of per-TU..Oct 22 2018, 9:04 AM
ioeric edited the summary of this revision. (Show Details)
ioeric added inline comments.
clangd/index/FileIndex.h
49 ↗(On Diff #170188)

Good idea! Thanks!

FileSymbols locks internally to be threadsafe while keeping critical sections small. To keep digests in sync with FileSymbols contents we probably have to lock externally with a second mutex. This is a little ugly but not a real problem I think.

This doesn't seem to be a problem as Background::index() currently assumes single-thread? I can either make it thread-safe now or add a FIXME. WDYT?

sammccall added inline comments.Oct 22 2018, 9:58 AM
clangd/index/Background.h
68 ↗(On Diff #170440)

Keying this with file URIs seems suspicious to me - why this change?

It seems to be motivated by the current implementation where the partitioning is done post-hoc by looking at strings that contain file URIs.
But this seems likely to change in the long term, and producing data that's sharded in the first place. At that point it'll be just as natural and semantically cleaner just to use file paths, I think.

If performance of converting URI -> path is a concern in the meantime, can we just add a cache to that loop?

clangd/index/FileIndex.cpp
121 ↗(On Diff #170440)

I don't think you want to build a new slab and copy all the strings here.

This seems like it's saving data (you're not keeping everything alive!) but in practice that only allows GC of data that changes, and most data doesn't change. So mostly this just introduces another copy of the data.

So I think you just want a vector<Symbol> MergedSymbols, and either push all the pointers in there or use concat(make_pointee_range(SymbolPointers), MergedSymbols)

clangd/index/FileIndex.h
41 ↗(On Diff #170440)

as noted elsewhere I think this is the wrong place for this decl.
The values are somewhat self-explanatory, but the keys need to be documented.

68 ↗(On Diff #170440)

can we pull out something like enum class DuplicateHandling { PickOne, Merge }?

I think that's much clearer at the callsite than /*MergeSymbols=*/true, and not really longer.

49 ↗(On Diff #170188)

Right, currently we're only reading/writing with a single thread. The mutex thing is more for the future. Don't bother with a FIXME, when we add multiple threads we'll need to work out what to lock.

clangd/index/IndexAction.h
12 ↗(On Diff #170440)

This looks like an inverted dependency. The FileDigests struct should be part of symbolcollector, no?

33 ↗(On Diff #170440)

thinking about what we eventually want here:

  • the index action needs to tell the auto-indexer what the file content or digests are
  • the auto-indexer needs to tell the index action which files to index symbols from
  • this decision is in turn based on the digests

What about this design:

  • createStaticIndexingAction accepts a filter-function: "should we index symbols from file X"?
  • SymbolCollector invokes this once per file and caches the result.
  • the signature is something like bool(SourceManager&, FileEntry*)
  • auto-index passes a function that reads the contents from the FileEntry, checks whether the hash is up-to-date to return true/false, and stores the resulting hashes for later

That way the caching/hash stuff never actually leaks outside auto-index that needs it, and we get the interface we need to precisely avoid generating Symbols we don't want.

(That function could just always return true in this patch, if the filtering isn't trivial).

It would mean that auto-index would need to redundantly compute the URI once per file, I don't think that matters.

ioeric updated this revision to Diff 170664.Oct 23 2018, 8:54 AM
ioeric marked 5 inline comments as done.
  • address review comments
  • merged with origin/master
  • Cleanup
clangd/index/IndexAction.h
33 ↗(On Diff #170440)

This indeed sounds like a better design for the long term use. Thanks!

ioeric updated this revision to Diff 171467.Oct 29 2018, 2:53 AM
  • minor cleanup and a friendly ping.
sammccall accepted this revision.Oct 30 2018, 3:15 AM

Great stuff! Just nits (though a few of them; this is a tricky patch!).

clangd/index/Background.cpp
122 ↗(On Diff #171467)

static rather than inline?

135 ↗(On Diff #171467)

Might be cleanest to wrap the cache, the URISchemes&, the hint path, and this function into a tiny class

160 ↗(On Diff #171467)

Maybe add a comment "eventually SymbolCollector may do this for us to avoid all this copying"?

311 ↗(On Diff #171467)

It looks like you never write the FilesToUpdate hashes back to the IndexedFileDigests, so we'll always update headers?

clangd/index/Background.h
55 ↗(On Diff #171467)

private.

75 ↗(On Diff #171467)

There's no need for this to be global to the class, it's going to cause some hassle once there are multiple worker threads.

I think can a cache can be scoped to a single call to update() (and both it and uriToPath() can be hidden in the cpp file.)

clangd/index/FileIndex.cpp
123 ↗(On Diff #171467)

comment is stale

140 ↗(On Diff #171467)

since we're deduplicating above, we could deduplicate here too and remove the deduplicating logic from Dex (MemIndex gets it by mistake). That would avoid duplicated deduplication (!) in the Merge + Dex case, and seems a little cleaner.

Not something for this patch, but maybe add a FIXME.

clangd/index/SymbolCollector.cpp
212 ↗(On Diff #171467)

nit: simplify logic by inverting the if here and sharing the return?

441 ↗(On Diff #171467)

why is this hoisted above shouldIndexFile? or did you mean to use DefLoc below?

442 ↗(On Diff #171467)

just to check - this is basically trivial to do now but would need tests etc?
(fine to defer from this patch)

clangd/index/SymbolCollector.h
78 ↗(On Diff #171467)

I think we should explicitly test this in SymbolCollectorTests - I think it should be straightforward?

unittests/clangd/BackgroundIndexTests.cpp
28 ↗(On Diff #171467)

stale comment?

29 ↗(On Diff #171467)

nit: R"cpp(

33 ↗(On Diff #171467)

naming here seems confusing: A_H seems to refer to the header, but then what does B_H refer to?

40 ↗(On Diff #171467)

these should be raw strings now I think (sorry, the old test was already marginal)

This revision is now accepted and ready to land.Oct 30 2018, 3:15 AM
ioeric updated this revision to Diff 171684.Oct 30 2018, 5:50 AM
ioeric marked 11 inline comments as done.
  • address review comments.
clangd/index/Background.cpp
311 ↗(On Diff #171467)

It's updated in update when slabs are updated.

clangd/index/Background.h
55 ↗(On Diff #171467)

These are also used in some helpers in the cpp file. I can make these private and make those helper members if you think it would be better?

clangd/index/FileIndex.cpp
140 ↗(On Diff #171467)

Added the duduplication for the PickOne case.

clangd/index/SymbolCollector.cpp
442 ↗(On Diff #171467)

Do you mean just do this for macros for now? Yes, changes for macros should be trivial.

clangd/index/SymbolCollector.h
78 ↗(On Diff #171467)

This should be trivial for macros, but we still need some refactoring to make it work for symbols (e.g. addDeclaration doesn't support dropping symbols).

unittests/clangd/BackgroundIndexTests.cpp
33 ↗(On Diff #171467)

Changed to A_CC and B_CC

sammccall accepted this revision.Oct 30 2018, 6:00 AM
sammccall added inline comments.
clangd/index/Background.cpp
311 ↗(On Diff #171467)

In that case, do we need to update IndexedFileDigests here too?

(I just realized there's an edge case: what if there are no symbols or references in a file? but I'm not sure if this is different between mainfile and headers, either)

clangd/index/Background.h
55 ↗(On Diff #171467)

oops, true. I'd suggest splitting the difference and keeping FileDigest public and dropping the FileDigests alias altogether, but up to you.

clangd/index/FileIndex.cpp
140 ↗(On Diff #171467)

Thanks! The FIXME is still warranted, to remove deduping from Dex.

kadircet added inline comments.Oct 31 2018, 2:25 AM
clangd/index/SymbolCollector.cpp
619 ↗(On Diff #171684)

ND.getASTContext().getSourceManager() -> SM ?

ioeric updated this revision to Diff 171901.Oct 31 2018, 6:21 AM
ioeric marked 3 inline comments as done.
  • Merged with multi-threading changes. Added mutex for file digests.

Made some more changes to make the code work in multiple threads (add mutex for digests, take snapshot of digests for each run etc). PTAL

clangd/index/Background.cpp
311 ↗(On Diff #171467)

The edge case seems to justify the update here. We want to make sure that hash for main files are always updated, even when there's no index data from it.

clangd/index/FileIndex.cpp
140 ↗(On Diff #171467)

Done. I removed it in this patch as it seems trivial enough.

sammccall accepted this revision.Oct 31 2018, 4:53 PM
sammccall added inline comments.
clangd/index/Background.cpp
194 ↗(On Diff #171901)

um, this (before your patch) doesn't look even slightly threadsafe. Sorry I didn't catch it in code review.

Now I'm not sure whether it makes sense for you to fix it or not. It seems fine to use two independent locks here, to me - we don't need actual consistency.

unittests/clangd/DexTests.cpp
494 ↗(On Diff #171901)

at this point we should remove the corresponding test for MemIndex (in IndexTests.cpp I think) too

kadircet added inline comments.Nov 1 2018, 4:04 AM
clangd/index/Background.cpp
194 ↗(On Diff #171901)

Ah, sorry for missing on my side as well. But IIUC, the only unsafe part is access to FileHash since FileSymbols::update is already thread-safe ?

If need be I can send out a patch to fix those issues.

sammccall added inline comments.Nov 1 2018, 6:18 AM
clangd/index/Background.cpp
194 ↗(On Diff #171901)

Oops, you're right - I forgot FileSymbols was threadsafe.

Then I think FileHash itself is indeed the only issue and this patch (DigestsMu) fixes it already. Carry on!

kadircet added inline comments.Nov 1 2018, 9:09 AM
clangd/index/Background.cpp
235 ↗(On Diff #171901)

This call is already thread-safe, no need for it to be under lock.

ioeric added inline comments.Nov 5 2018, 8:42 AM
clangd/index/Background.cpp
235 ↗(On Diff #171901)

This is for making sure that the update of digest and index data is atomic. If FileSymbols update is not in the lock, we might get the wrong digest for the indexed symbols in corner cases (e.g. T1 updates digest -> T2 updates digest -> T2 update symbols -> T1 updates symbols ).

ioeric updated this revision to Diff 172732.Nov 6 2018, 2:54 AM
  • Rebase
This revision was automatically updated to reflect the committed changes.