Whenever a change happens on a CDB, load shards associated with that
CDB before issuing re-index actions.
Details
Diff Detail
- Repository
- rCTE Clang Tools Extra
- Build Status
Buildable 26560 Build 26559: arc lint + arc unit
Event Timeline
Probably the most important suggestion from my side is trying to clearly separate the enqueue calls (which actually schedule rebuilds of TUs) using clang and the loadShards function.
The index loading should be fast, so I assume we won't win much latency by scheduling the indexing actions in the middle of the shard-loading.
OTOH, separating those two concerns should make the code way more readable (I think, at least).
The concrete proposal is to make loadShards actually only the shards and return a list of TUs that still need to be rebuilt. As the second step, we can enqueue the rebuild of those TUs.
Another important thing that's worth doing is documenting the correctness trade-offs we have in the current model, i.e. when following the include edges of a file (say, foo.h) that had its digest changed, we can potentially:
- load stale symbols for some other file (say, bar.h) previously included by it,
- never update those symbols to fresh ones if the include edge was dropped, i.e. foo.h does not include bar.h anymore.
The described situation can lead to stale symbols for bar.h being kept indefinitely in some cases, but has the advantage of reparsing less TUs when rebuilding the index. Overall it's definitely fine to have this trade-off for the time-being, but would be nice if we could document it.
The rest is mostly NITs and code style comments.
And thanks for the change, it's an impressive piece of work!
clangd/SourceCode.h | ||
---|---|---|
96 ↗ | (On Diff #177005) | We might want to have a guidance on when this function should be used, similar to getRealPath. Also mention why it's different from the above. My understanding is that it's only used for canonicalizing the file path for storing them in the index and nowhere else. |
107 ↗ | (On Diff #177005) | NIT: maybe name it canonicalizePath? A bit shorter. But up to you |
107 ↗ | (On Diff #177005) | Maybe put this function right after getRealPath? They both "canonicalize" paths, so it makes sense for them to live together. |
clangd/index/Background.cpp | ||
62 | Why not vfs->makeAbsolute instead of this function? Maybe worth a comment? | |
207 | Ah, it feels we should move this into a helper function and reuse everywhere. Just complaining, no need to do anything about it... | |
290 |
And yet, can we avoid this altogether? Also, I believe it won't be rare. When processing multiple different TUs, we can easily run into the same header multiple times, possibly with the different contents. | |
324–325 | Does it make sense for it to be part of makeCanonicalPath? | |
386 | This looks like an important optimization. Did we move it to some other place? Why should we remove it? | |
389 | The comment does not mention why should we update it. Is it because of the dependencies? | |
409 | Was this error possible before this change too? | |
421 | This log is not getting called more often after this change, right? | |
449 | NIT: s/io/IO | |
451 | We used to shuffle the ChangedFiles before processing them. I believe the reason for doing that are still relevant. Should we restore this? (Or are we doing it somewhere else now and I missed it?) | |
clangd/index/Background.h | ||
121 | Could you elaborate what are the "sources and headers of the Cmd"? The compile command is typically created for a single source file, so this comment looks a bit confusing. | |
124 | Consider creating a simple struct for this pair. The named field access is much more readable than .first and .second | |
127 | Should this be named VisitedShards? |
clangd/index/Background.cpp | ||
---|---|---|
449 | NIT: maybe rename to AddShardToIndex or AddShard? LoadShard is too similar to IndexStorage::loadShard, which confused me a little when going through the code for the first time. | |
457 | Maybe return a vector<string> with dependencies that need reindexing? We ignore the dependencies that don't need to be reindexed anyway. | |
460 | NIT: s/more smartly/smarter | |
463 | What is Dependency.second and Dependency.first? Maybe use a named struct instead of a pair here or deconstruct into named variables? | |
468 | Maybe use range-based-for instead? Looks simpler: for (auto Dep : Dependencies) BeingReindexed.insert(Dep.first); | |
518 | maybe log the error? would be nice to somehow recover too, but not sure what we can do here. | |
clangd/index/BackgroundIndexStorage.cpp | ||
80 ↗ | (On Diff #177005) | NIT: maybe send in as a separate patch to keep this more focused? (No need to send for review, just commit it) |
clangd/index/Background.cpp | ||
---|---|---|
290 | Well I am open to ideas, but currently there is no way of knowing whether this is the newer version for the file. Because only information we have is the digest is different than what we currently have and this doesn't yield any chronological information. | |
386 | We kind of moved it into loadShards logic by not running indexing for same file multiple times. I needed to delete this check since we might wanna run indexing on a TU, even if it is up-to-date, to get new symbols coming from one of it's includes. | |
389 | Nice catch, that was the case in one of my experiments but currently during the update we store shards for every source file that exists in include graph even if it has no symbols&refs. So it is no longer necessary. | |
409 | Yes and it got fixed by D55650, just deleting the check it will arrive once I rebase. | |
457 | Yes we ignore them for now, but they might change in one of the consecutive checks. And if that happens we don't need to schedule another TU for re-indexing that dependency, since we mark it as already indexed. | |
518 | Well, if this happens we don't have access to file's contents for some reason. I don't think we can do anything. We might actually want to skip loading these files into index, since they are most likely gone and symbols coming from them won't be accessible. WDYT? |
Most comments are NITs, the major one that I'd suggest paying most attention to is about rewriting newer results with an older ones.
clangd/index/Background.cpp | ||
---|---|---|
281 | This means lock/unlock on every iteration of the loop. | |
290 | Do we know which request is "newer" when scheduling it? This would give us a simple invariant of the final state we want to bring the index to: IndexedFileDigests should correspond to the latest hashes seen so far. If not, we have to rebuild the index for the corresponding files. That, in turn, gives us a way to resolve conflicts: we should never replace with symbols built for the latest version (hash) of the file we've seen. Would that work? | |
386 | Thanks for the explanation, removing the optimization makes sense, we should check for dependency hashes in addition to the original file now. | |
443 | NIT: s/enqueues an indexing operation/add to a list of files requiring reindexing | |
446 | NIT: s/VisitedShards/LoadedShards? | |
448 | NIT: Instead of writing this comment, we could accept IndexFileIn && as a parameter, which clearly states we're consuming it (this would require handling the nullptr case in the call sites, but that should be ok). | |
450 | Now that we don't actually schedule the indexing, maybe rename this to something like FilesToIndex? The current name suggests there's some processing on the files going somewhere in the background, which is not the case. | |
457 | NIT: maybe init at the declaration site? i.e. unique_ptr<Slab> SS = Shared->Symbols ? llvm::make_unique<>() : nullptr; | |
465 | NIT: use early exits to reduce nesting. if (!Dependency.NeedsReIndexing || BeingReindexed.count(...)) continue; ... | |
466 | Would it make sense to collect all symbols into a local variable and only update under a lock once the whole loading process is finished? | |
467 | Maybe be a bit more verbose in the message to clearly state which file is a dependency? Something like Enqueuing TU {0} because its dependency {1} needs re-indexing. | |
500 | Should we mark the file as requiring re-indexing at this point? | |
505 | NIT: use early exits to reduce nesting. See LLVM Style Guide. auto U = URI::parse(...); if (!U) continue; auto AbsolutePath = URI::resolve(...); if (!AbsolutePath) continue; ... | |
507 | Why do we need an extra check for self-edges here? Shouldn't InQueue.try_emplace handle this too? | |
518 | Logging the error is enough, thanks! I think the ideal recovery is tracking whenever the files we were missing were added to the filesystem and rebuilding the index when that happens. | |
clangd/index/Background.h | ||
123 | NIT: Provide the default value to avoid UB on uninitialized values. | |
125 | NIT: all the sources that was touched by this Cmd is a bit confusing (e.g. what does "touched by compile command" mean?) Maybe rephrase to something like Loads the shards for a single TU and all of its dependencies | |
134 | Maybe return Expected<vector<..>> instead of using an out parameter? |
Address comments
clangd/index/Background.cpp | ||
---|---|---|
290 | I am not sure if it would work for non-main files. We update the Hash for the main files at each indexing action, so we can safely keep track of the latest versions. But we collect hashes for dependencies while performing the indexing which happens in parallel. Therefore an indexing action triggered earlier might get an up-to-date version of a dependency than an action triggered later-on. | |
500 | All files are marked as requiring re-indexing by default Dependencies.push_back({AbsolutePath, true}). The second element is always true, and we only mark it as false if we are sure file is up-to-date. | |
507 | It is not looking for a self-edge. It is trying to find source information related to current file(which contains symbols, refs and hash). The nesting seems to be confusing it was just to prevent one redundant try_emplace seems like making the code less readable taking it out. | |
clangd/index/Background.h | ||
121 | It seems like comment become out-dated marking as done. | |
134 | Actually this function can no longer fail, therefore directly returning the vector |
clangd/index/Background.cpp | ||
---|---|---|
290 | If updates for each TU are atomic (i.e. all files included in the TU are updated in a single go) we would get the up-to-date index eventually, assuming we rebuild the index when TU dependencies change (we don't schedule rebuilds on changes to included header now, but we're planning to do so at some point). | |
500 | My confusion is coming from the fact that I'm constantly forgetting that we return the queue we're processing afterwards. |
clangd/SourceCode.h | ||
---|---|---|
107 ↗ | (On Diff #180706) | This changes should go away after the rebase, right? |
clangd/index/Background.cpp | ||
---|---|---|
208 | use llvm::StringRef | |
284 | I now see that my previous comment about lock/unlock for each file was wrong in the first place. Why don't we update the digest in the next loop that actually updates the symbols? | |
clangd/index/Background.h | ||
124 | NIT: maybe remove the constructor? plain structs can easily be initialized with init lists and adding a constructor forbids per-field assignment, which is a bit nicer in some cases. |
There seems to be no unexpected changes after rebase
clangd/index/Background.cpp | ||
---|---|---|
208 | If the command fails we want to show the filename but we are moving the Cmd, so we need to keep a copy of the filename. | |
290 | Sure, that assumption seems more relaxed than the previous one, just wanna make sure I got it right: | |
clangd/index/Background.h | ||
124 | Once I assign a default value for NeedsReIndexing, I cannot construct Source objects with init lists, therefore I've added a constructor instead. |
clangd/SourceCode.h | ||
---|---|---|
21 ↗ | (On Diff #180823) | Looks redundant. Remove? |
clangd/index/Background.cpp | ||
208 | Ah, we're moving out of Cmd here... I'd say index should accept a const CompileCommand&? If the goal was to optimize for less copies, then we failed - now we need to copy the filename string. index() is not part of this patch, though, so let's not change it. Could you add a comment that we can't use StringRef because we move from Cmd? It's easy to miss. | |
283 | We still update digests and symbols non-atomically in that case. Could we do both under the same lock, similar to how it's done in loadShards? | |
290 | Exactly. The idea is that eventually we'll start tracking changes and will be able to detect even those cases and rebuild the index accordingly. Just trying to keep us from drifting away from that model too much.
This assumption worked fine for indexing actions running right away, but I think the situation with loading shards is different: the shards we are loading might be old and if we don't want a stale shard (which might be arbitrarily old) to overwrite results of the fresh indexing action. Let me know if I'm missing something here. | |
clangd/index/Background.h | ||
124 | Right, makes sense. You could initialize explicitly, but I can see how constructor might be simpler to use in that particular use-case. |
Address comments
clangd/index/Background.cpp | ||
---|---|---|
283 | Moved update to the end of the loop, but now we might perform unnecessary work for building {symbol,ref}slabs for files that are already up-to-date. It shouldn't be too much of an issue, as a solution we can lock at the entrance and only check if the file was up-to-date, than we update at the exit. Or we can keep a snapshot. | |
290 | Nope, your point seems to be correct |
clangd/URI.h | ||
---|---|---|
131 | Maybe move it into the Backgroud.cpp, e.g. as a private member of BackoundIndex? | |
clangd/index/Background.cpp | ||
283 | LG, thanks. I don't think this extra work should pop up in the profiler, so we're good. | |
311 | NIT: s/we this/this/ | |
312 | An alternative strategy is to also update if the IndexedFileDigests does not contain a value. | |
319 | This might be too verbose even for vlog(), i.e. this would produce thousands of messages for each TU. Maybe omit it? | |
435 | I believe we want to avoid updating the digests for the dependencies if the digest for the main file has changed in the meantime. Note that this might invalidate the assertion in update() that checks all files have digests. | |
586 | Didn't we land the periodic rebuilds? Should this be obsolete now? | |
clangd/index/Background.h | ||
117 | NIT: maybe add a comment that we use this to decide which of the concurrently running indexing operations should take precedence? | |
clangd/index/SymbolCollector.cpp | ||
213 | NIT: look unrelated, maybe land as a separate NFC commit right away? |
As discussed offline, let's remove the parts of the change that were aiming to fix the consistency issues, the issues were there before this patch, the fix is getting complicated and doesn't really solve all of the problems. All of that suggests it's out of scope for this change.
Maybe move it into the Backgroud.cpp, e.g. as a private member of BackoundIndex?
We don't have other use-case for it to the date and it doesn't really good like a good public API at this point, i.e. lacking docs on what it does and why it's useful, which are useful if we move it into a public header.