This allows us to deduplicate header symbols across TUs. File digests
are collects when collecting symbols/refs. And the index store deduplicates
file symbols based on the file digest.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
clangd/index/FileIndex.cpp | ||
---|---|---|
167 ↗ | (On Diff #170188) | This is a good FIXME to carry over to FileSymbols::buildIndex(). It may (or may not) make sense to control this behavior with the same param that controls merge behavior, the logic being that it's also a quality/accuracy tradeoff. |
clangd/index/FileIndex.h | ||
49 ↗ | (On Diff #170188) | FileSymbols isn't actually that opinionated about the data it manages:
I think we should be able to use FileSymbols mostly unmodified, and store digests in parallel, probably in BackgroundIndexer. Is there a strong reason to store "main file" digests separately to header digests? There are a couple of weak points:
|
clangd/index/IndexAction.h | ||
31 ↗ | (On Diff #170188) | these parallel callbacks that always get called together are a bit ridiculous (my fault). |
clangd/index/FileIndex.h | ||
---|---|---|
49 ↗ | (On Diff #170188) | Good idea! Thanks!
This doesn't seem to be a problem as Background::index() currently assumes single-thread? I can either make it thread-safe now or add a FIXME. WDYT? |
clangd/index/Background.h | ||
---|---|---|
68 ↗ | (On Diff #170440) | Keying this with file URIs seems suspicious to me - why this change? It seems to be motivated by the current implementation where the partitioning is done post-hoc by looking at strings that contain file URIs. If performance of converting URI -> path is a concern in the meantime, can we just add a cache to that loop? |
clangd/index/FileIndex.cpp | ||
121 ↗ | (On Diff #170440) | I don't think you want to build a new slab and copy all the strings here. This seems like it's saving data (you're not keeping everything alive!) but in practice that only allows GC of data that changes, and most data doesn't change. So mostly this just introduces another copy of the data. So I think you just want a vector<Symbol> MergedSymbols, and either push all the pointers in there or use concat(make_pointee_range(SymbolPointers), MergedSymbols) |
clangd/index/FileIndex.h | ||
41 ↗ | (On Diff #170440) | as noted elsewhere I think this is the wrong place for this decl. |
68 ↗ | (On Diff #170440) | can we pull out something like enum class DuplicateHandling { PickOne, Merge }? I think that's much clearer at the callsite than /*MergeSymbols=*/true, and not really longer. |
49 ↗ | (On Diff #170188) | Right, currently we're only reading/writing with a single thread. The mutex thing is more for the future. Don't bother with a FIXME, when we add multiple threads we'll need to work out what to lock. |
clangd/index/IndexAction.h | ||
12 ↗ | (On Diff #170440) | This looks like an inverted dependency. The FileDigests struct should be part of symbolcollector, no? |
33 ↗ | (On Diff #170440) | thinking about what we eventually want here:
What about this design:
That way the caching/hash stuff never actually leaks outside auto-index that needs it, and we get the interface we need to precisely avoid generating Symbols we don't want. (That function could just always return true in this patch, if the filtering isn't trivial). It would mean that auto-index would need to redundantly compute the URI once per file, I don't think that matters. |
- address review comments
- merged with origin/master
- Cleanup
clangd/index/IndexAction.h | ||
---|---|---|
33 ↗ | (On Diff #170440) | This indeed sounds like a better design for the long term use. Thanks! |
Great stuff! Just nits (though a few of them; this is a tricky patch!).
clangd/index/Background.cpp | ||
---|---|---|
122 ↗ | (On Diff #171467) | static rather than inline? |
135 ↗ | (On Diff #171467) | Might be cleanest to wrap the cache, the URISchemes&, the hint path, and this function into a tiny class |
160 ↗ | (On Diff #171467) | Maybe add a comment "eventually SymbolCollector may do this for us to avoid all this copying"? |
311 ↗ | (On Diff #171467) | It looks like you never write the FilesToUpdate hashes back to the IndexedFileDigests, so we'll always update headers? |
clangd/index/Background.h | ||
55 ↗ | (On Diff #171467) | private. |
75 ↗ | (On Diff #171467) | There's no need for this to be global to the class, it's going to cause some hassle once there are multiple worker threads. I think can a cache can be scoped to a single call to update() (and both it and uriToPath() can be hidden in the cpp file.) |
clangd/index/FileIndex.cpp | ||
123 ↗ | (On Diff #171467) | comment is stale |
140 ↗ | (On Diff #171467) | since we're deduplicating above, we could deduplicate here too and remove the deduplicating logic from Dex (MemIndex gets it by mistake). That would avoid duplicated deduplication (!) in the Merge + Dex case, and seems a little cleaner. Not something for this patch, but maybe add a FIXME. |
clangd/index/SymbolCollector.cpp | ||
212 ↗ | (On Diff #171467) | nit: simplify logic by inverting the if here and sharing the return? |
441 ↗ | (On Diff #171467) | why is this hoisted above shouldIndexFile? or did you mean to use DefLoc below? |
442 ↗ | (On Diff #171467) | just to check - this is basically trivial to do now but would need tests etc? |
clangd/index/SymbolCollector.h | ||
78 ↗ | (On Diff #171467) | I think we should explicitly test this in SymbolCollectorTests - I think it should be straightforward? |
unittests/clangd/BackgroundIndexTests.cpp | ||
28 ↗ | (On Diff #171467) | stale comment? |
29 ↗ | (On Diff #171467) | nit: R"cpp( |
33 ↗ | (On Diff #171467) | naming here seems confusing: A_H seems to refer to the header, but then what does B_H refer to? |
40 ↗ | (On Diff #171467) | these should be raw strings now I think (sorry, the old test was already marginal) |
- address review comments.
clangd/index/Background.cpp | ||
---|---|---|
311 ↗ | (On Diff #171467) | It's updated in update when slabs are updated. |
clangd/index/Background.h | ||
55 ↗ | (On Diff #171467) | These are also used in some helpers in the cpp file. I can make these private and make those helper members if you think it would be better? |
clangd/index/FileIndex.cpp | ||
140 ↗ | (On Diff #171467) | Added the duduplication for the PickOne case. |
clangd/index/SymbolCollector.cpp | ||
442 ↗ | (On Diff #171467) | Do you mean just do this for macros for now? Yes, changes for macros should be trivial. |
clangd/index/SymbolCollector.h | ||
78 ↗ | (On Diff #171467) | This should be trivial for macros, but we still need some refactoring to make it work for symbols (e.g. addDeclaration doesn't support dropping symbols). |
unittests/clangd/BackgroundIndexTests.cpp | ||
33 ↗ | (On Diff #171467) | Changed to A_CC and B_CC |
clangd/index/Background.cpp | ||
---|---|---|
311 ↗ | (On Diff #171467) | In that case, do we need to update IndexedFileDigests here too? (I just realized there's an edge case: what if there are no symbols or references in a file? but I'm not sure if this is different between mainfile and headers, either) |
clangd/index/Background.h | ||
55 ↗ | (On Diff #171467) | oops, true. I'd suggest splitting the difference and keeping FileDigest public and dropping the FileDigests alias altogether, but up to you. |
clangd/index/FileIndex.cpp | ||
140 ↗ | (On Diff #171467) | Thanks! The FIXME is still warranted, to remove deduping from Dex. |
clangd/index/SymbolCollector.cpp | ||
---|---|---|
619 ↗ | (On Diff #171684) | ND.getASTContext().getSourceManager() -> SM ? |
Made some more changes to make the code work in multiple threads (add mutex for digests, take snapshot of digests for each run etc). PTAL
clangd/index/Background.cpp | ||
---|---|---|
311 ↗ | (On Diff #171467) | The edge case seems to justify the update here. We want to make sure that hash for main files are always updated, even when there's no index data from it. |
clangd/index/FileIndex.cpp | ||
140 ↗ | (On Diff #171467) | Done. I removed it in this patch as it seems trivial enough. |
clangd/index/Background.cpp | ||
---|---|---|
194 ↗ | (On Diff #171901) | um, this (before your patch) doesn't look even slightly threadsafe. Sorry I didn't catch it in code review. Now I'm not sure whether it makes sense for you to fix it or not. It seems fine to use two independent locks here, to me - we don't need actual consistency. |
unittests/clangd/DexTests.cpp | ||
494 ↗ | (On Diff #171901) | at this point we should remove the corresponding test for MemIndex (in IndexTests.cpp I think) too |
clangd/index/Background.cpp | ||
---|---|---|
194 ↗ | (On Diff #171901) | Ah, sorry for missing on my side as well. But IIUC, the only unsafe part is access to FileHash since FileSymbols::update is already thread-safe ? If need be I can send out a patch to fix those issues. |
clangd/index/Background.cpp | ||
---|---|---|
194 ↗ | (On Diff #171901) | Oops, you're right - I forgot FileSymbols was threadsafe. Then I think FileHash itself is indeed the only issue and this patch (DigestsMu) fixes it already. Carry on! |
clangd/index/Background.cpp | ||
---|---|---|
235 ↗ | (On Diff #171901) | This call is already thread-safe, no need for it to be under lock. |
clangd/index/Background.cpp | ||
---|---|---|
235 ↗ | (On Diff #171901) | This is for making sure that the update of digest and index data is atomic. If FileSymbols update is not in the lock, we might get the wrong digest for the indexed symbols in corner cases (e.g. T1 updates digest -> T2 updates digest -> T2 update symbols -> T1 updates symbols ). |