This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Introduce loading of shards within auto-index
ClosedPublic

Authored by kadircet on Dec 3 2018, 8:04 AM.

Diff Detail

Event Timeline

kadircet created this revision.Dec 3 2018, 8:04 AM
kadircet updated this revision to Diff 177005.Dec 6 2018, 10:30 AM
  • Fix a few problems that come up in the field test.

Probably the most important suggestion from my side is trying to clearly separate the enqueue calls (which actually schedule rebuilds of TUs) using clang and the loadShards function.
The index loading should be fast, so I assume we won't win much latency by scheduling the indexing actions in the middle of the shard-loading.
OTOH, separating those two concerns should make the code way more readable (I think, at least).

The concrete proposal is to make loadShards actually only the shards and return a list of TUs that still need to be rebuilt. As the second step, we can enqueue the rebuild of those TUs.

Another important thing that's worth doing is documenting the correctness trade-offs we have in the current model, i.e. when following the include edges of a file (say, foo.h) that had its digest changed, we can potentially:

  1. load stale symbols for some other file (say, bar.h) previously included by it,
  2. never update those symbols to fresh ones if the include edge was dropped, i.e. foo.h does not include bar.h anymore.

The described situation can lead to stale symbols for bar.h being kept indefinitely in some cases, but has the advantage of reparsing less TUs when rebuilding the index. Overall it's definitely fine to have this trade-off for the time-being, but would be nice if we could document it.

The rest is mostly NITs and code style comments.

And thanks for the change, it's an impressive piece of work!

clangd/SourceCode.h
96 ↗(On Diff #177005)

We might want to have a guidance on when this function should be used, similar to getRealPath. Also mention why it's different from the above.

My understanding is that it's only used for canonicalizing the file path for storing them in the index and nowhere else.
Should we discourage usages outside the index? Maybe we can give a more obscure name for this purpose, e.g. canonicalizeForIndex or similar?

107 ↗(On Diff #177005)

NIT: maybe name it canonicalizePath? A bit shorter. But up to you

107 ↗(On Diff #177005)

Maybe put this function right after getRealPath? They both "canonicalize" paths, so it makes sense for them to live together.

clangd/index/Background.cpp
92

Why not vfs->makeAbsolute instead of this function? Maybe worth a comment?

237

Ah, it feels we should move this into a helper function and reuse everywhere.
It's a small detail, but it's so easy to forget about it...

Just complaining, no need to do anything about it...

320

"should be rare in practice"

And yet, can we avoid this altogether?

Also, I believe it won't be rare. When processing multiple different TUs, we can easily run into the same header multiple times, possibly with the different contents.
E.g. imagine the index for the whole of clang is being built and the user is editing Sema.h at the same time. We'll definitely be running into this case over and over again.

351–352

Does it make sense for it to be part of makeCanonicalPath?
Both call sites call vfs->getAbsolutePath before calling the function?

386

This looks like an important optimization. Did we move it to some other place? Why should we remove it?

414

The comment does not mention why should we update it. Is it because of the dependencies?

434

Was this error possible before this change too?

446

This log is not getting called more often after this change, right?
Could you send the log -> vlog replacement as a separate patch to keep this change more focused?

463

NIT: s/io/IO
NIT: s/AbsolutePaths/absolute paths (otherwise looks like we're referring to a local variable or a field with this name)

465

We used to shuffle the ChangedFiles before processing them. I believe the reason for doing that are still relevant. Should we restore this? (Or are we doing it somewhere else now and I missed it?)

clangd/index/Background.h
120

Could you elaborate what are the "sources and headers of the Cmd"? The compile command is typically created for a single source file, so this comment looks a bit confusing.

123

Consider creating a simple struct for this pair. The named field access is much more readable than .first and .second

126

Should this be named VisitedShards?

ilya-biryukov added inline comments.Dec 17 2018, 6:26 AM
clangd/index/Background.cpp
466

NIT: maybe rename to AddShardToIndex or AddShard? LoadShard is too similar to IndexStorage::loadShard, which confused me a little when going through the code for the first time.

471

Maybe return a vector<string> with dependencies that need reindexing? We ignore the dependencies that don't need to be reindexed anyway.

474

NIT: s/more smartly/smarter

477

What is Dependency.second and Dependency.first? Maybe use a named struct instead of a pair here or deconstruct into named variables?

482

Maybe use range-based-for instead? Looks simpler:

for (auto Dep : Dependencies)
  BeingReindexed.insert(Dep.first);
535

maybe log the error? would be nice to somehow recover too, but not sure what we can do here.

clangd/index/BackgroundIndexStorage.cpp
80 ↗(On Diff #177005)

NIT: maybe send in as a separate patch to keep this more focused? (No need to send for review, just commit it)

kadircet marked 4 inline comments as done.Dec 18 2018, 3:38 AM

Sending out everything related to canonical path at: D55818

kadircet updated this revision to Diff 178675.Dec 18 2018, 7:35 AM
kadircet marked 20 inline comments as done.
  • Address comments
kadircet added inline comments.Dec 18 2018, 8:15 AM
clangd/index/Background.cpp
320

Well I am open to ideas, but currently there is no way of knowing whether this is the newer version for the file. Because only information we have is the digest is different than what we currently have and this doesn't yield any chronological information.

386

We kind of moved it into loadShards logic by not running indexing for same file multiple times. I needed to delete this check since we might wanna run indexing on a TU, even if it is up-to-date, to get new symbols coming from one of it's includes.

414

Nice catch, that was the case in one of my experiments but currently during the update we store shards for every source file that exists in include graph even if it has no symbols&refs. So it is no longer necessary.

434

Yes and it got fixed by D55650, just deleting the check it will arrive once I rebase.

471

Yes we ignore them for now, but they might change in one of the consecutive checks. And if that happens we don't need to schedule another TU for re-indexing that dependency, since we mark it as already indexed.

535

Well, if this happens we don't have access to file's contents for some reason. I don't think we can do anything. We might actually want to skip loading these files into index, since they are most likely gone and symbols coming from them won't be accessible. WDYT?

Most comments are NITs, the major one that I'd suggest paying most attention to is about rewriting newer results with an older ones.

clangd/index/Background.cpp
311

This means lock/unlock on every iteration of the loop.
Could we move collecting the files we need to process into a separate loop that would only require a single lock over the whole loop?

320

Do we know which request is "newer" when scheduling it?
If so, we could keep the map of the latest hashes of the files we've seen (in addition to the IndexedFileDigests, which correspond to the digest for which we built the index IIUC).

This would give us a simple invariant of the final state we want to bring the index to: IndexedFileDigests should correspond to the latest hashes seen so far. If not, we have to rebuild the index for the corresponding files. That, in turn, gives us a way to resolve conflicts: we should never replace with symbols built for the latest version (hash) of the file we've seen.

Would that work?

386

Thanks for the explanation, removing the optimization makes sense, we should check for dependency hashes in addition to the original file now.

457

NIT: s/enqueues an indexing operation/add to a list of files requiring reindexing
(or any other wording with the same meaning, the important part is that it does not enqueue any indexing operations anymore)

463

NIT: s/VisitedShards/LoadedShards?
To better align with the name of the local var used in loadShards()

464

Now that we don't actually schedule the indexing, maybe rename this to something like FilesToIndex? The current name suggests there's some processing on the files going somewhere in the background, which is not the case.

465

NIT: Instead of writing this comment, we could accept IndexFileIn && as a parameter, which clearly states we're consuming it (this would require handling the nullptr case in the call sites, but that should be ok).

474

NIT: maybe init at the declaration site? i.e. unique_ptr<Slab> SS = Shared->Symbols ? llvm::make_unique<>() : nullptr;
Same for RS.

479

NIT: use early exits to reduce nesting.

if (!Dependency.NeedsReIndexing || BeingReindexed.count(...))
  continue;
...
481

Maybe be a bit more verbose in the message to clearly state which file is a dependency? Something like Enqueuing TU {0} because its dependency {1} needs re-indexing.

483

Would it make sense to collect all symbols into a local variable and only update under a lock once the whole loading process is finished?
I guess that means ~the same memory requirements, but much less locks and unlocks.

517

Should we mark the file as requiring re-indexing at this point?
Not sure how that might happen in practice, but still...

522

NIT: use early exits to reduce nesting. See LLVM Style Guide.

auto U = URI::parse(...);
if (!U) 
  continue;
auto AbsolutePath = URI::resolve(...);
if (!AbsolutePath)
  continue;
...
524

Why do we need an extra check for self-edges here? Shouldn't InQueue.try_emplace handle this too?

535

Logging the error is enough, thanks!

I think the ideal recovery is tracking whenever the files we were missing were added to the filesystem and rebuilding the index when that happens.
That probably requires more work than we'd like to put into it, though, so I don't think investing in this now makes any sense.

clangd/index/Background.h
122

NIT: Provide the default value to avoid UB on uninitialized values.

124

NIT: all the sources that was touched by this Cmd is a bit confusing (e.g. what does "touched by compile command" mean?) Maybe rephrase to something like Loads the shards for a single TU and all of its dependencies

133

Maybe return Expected<vector<..>> instead of using an out parameter?

kadircet updated this revision to Diff 180706.Jan 8 2019, 11:37 AM
kadircet marked 23 inline comments as done.

Address comments

clangd/index/Background.cpp
320

I am not sure if it would work for non-main files. We update the Hash for the main files at each indexing action, so we can safely keep track of the latest versions. But we collect hashes for dependencies while performing the indexing which happens in parallel. Therefore an indexing action triggered earlier might get an up-to-date version of a dependency than an action triggered later-on.

517

All files are marked as requiring re-indexing by default Dependencies.push_back({AbsolutePath, true}). The second element is always true, and we only mark it as false if we are sure file is up-to-date.

524

It is not looking for a self-edge. It is trying to find source information related to current file(which contains symbols, refs and hash). The nesting seems to be confusing it was just to prevent one redundant try_emplace seems like making the code less readable taking it out.

clangd/index/Background.h
120

It seems like comment become out-dated marking as done.

133

Actually this function can no longer fail, therefore directly returning the vector

ilya-biryukov added inline comments.Jan 9 2019, 2:38 AM
clangd/index/Background.cpp
320

If updates for each TU are atomic (i.e. all files included in the TU are updated in a single go) we would get the up-to-date index eventually, assuming we rebuild the index when TU dependencies change (we don't schedule rebuilds on changes to included header now, but we're planning to do so at some point).

517

My confusion is coming from the fact that I'm constantly forgetting that we return the queue we're processing afterwards.
Could you put a small comment that file will be returned to the caller as requiring a reindex here?

ilya-biryukov added inline comments.Jan 9 2019, 2:40 AM
clangd/SourceCode.h
107 ↗(On Diff #180706)

This changes should go away after the rebase, right?
Could you please run the rebase to make sure the patch is in its final state.

ilya-biryukov added inline comments.Jan 9 2019, 3:03 AM
clangd/index/Background.cpp
238

use llvm::StringRef

314

I now see that my previous comment about lock/unlock for each file was wrong in the first place.
It feels that's we'd better keep the IndexedFileDigests and IndexedSymbols in sync.

Why don't we update the digest in the next loop that actually updates the symbols?

clangd/index/Background.h
123

NIT: maybe remove the constructor? plain structs can easily be initialized with init lists and adding a constructor forbids per-field assignment, which is a bit nicer in some cases.
Not very important, since it's a private API, but still.

kadircet updated this revision to Diff 180823.Jan 9 2019, 5:17 AM
kadircet marked 8 inline comments as done.

Address comments && rebase

There seems to be no unexpected changes after rebase

clangd/index/Background.cpp
238

If the command fails we want to show the filename but we are moving the Cmd, so we need to keep a copy of the filename.

320

Sure, that assumption seems more relaxed than the previous one, just wanna make sure I got it right:
Your suggested solution assumes: Dependencies of a TU won't change during indexing of that TU
Whereas the previous assumption was: Files won't change between close indexing actions

clangd/index/Background.h
123

Once I assign a default value for NeedsReIndexing, I cannot construct Source objects with init lists, therefore I've added a constructor instead.

ilya-biryukov added inline comments.Jan 9 2019, 6:44 AM
clangd/SourceCode.h
21 ↗(On Diff #180823)

Looks redundant. Remove?

clangd/index/Background.cpp
238

Ah, we're moving out of Cmd here... I'd say index should accept a const CompileCommand&? If the goal was to optimize for less copies, then we failed - now we need to copy the filename string. index() is not part of this patch, though, so let's not change it.

Could you add a comment that we can't use StringRef because we move from Cmd? It's easy to miss.

313

We still update digests and symbols non-atomically in that case. Could we do both under the same lock, similar to how it's done in loadShards?

320

Exactly. The idea is that eventually we'll start tracking changes and will be able to detect even those cases and rebuild the index accordingly. Just trying to keep us from drifting away from that model too much.

files won't change between close indexing actions

This assumption worked fine for indexing actions running right away, but I think the situation with loading shards is different: the shards we are loading might be old and if we don't want a stale shard (which might be arbitrarily old) to overwrite results of the fresh indexing action. Let me know if I'm missing something here.

clangd/index/Background.h
123

Right, makes sense. You could initialize explicitly, but I can see how constructor might be simpler to use in that particular use-case.
Since we now have a constructor, could you remove the default arg? It's not possible to get an undefined value for NeedsReindexing with a constructor.

kadircet updated this revision to Diff 180859.Jan 9 2019, 9:05 AM
kadircet marked 12 inline comments as done.

Address comments

clangd/index/Background.cpp
313

Moved update to the end of the loop, but now we might perform unnecessary work for building {symbol,ref}slabs for files that are already up-to-date. It shouldn't be too much of an issue, as a solution we can lock at the entrance and only check if the file was up-to-date, than we update at the exit. Or we can keep a snapshot.

320

Nope, your point seems to be correct

ilya-biryukov added inline comments.Jan 10 2019, 5:43 AM
clangd/URI.h
131 ↗(On Diff #180859)

Maybe move it into the Backgroud.cpp, e.g. as a private member of BackoundIndex?
We don't have other use-case for it to the date and it doesn't really good like a good public API at this point, i.e. lacking docs on what it does and why it's useful, which are useful if we move it into a public header.

clangd/index/Background.cpp
313

LG, thanks. I don't think this extra work should pop up in the profiler, so we're good.

341

NIT: s/we this/this/

342

An alternative strategy is to also update if the IndexedFileDigests does not contain a value.
That would ensure we populate some version of all files as fast as possible, but only update for the latest version. To me personally this looks like a reasonable right trade-off (and would require updating just a few lines of code)

349

This might be too verbose even for vlog(), i.e. this would produce thousands of messages for each TU. Maybe omit it?

452

I believe we want to avoid updating the digests for the dependencies if the digest for the main file has changed in the meantime.
Otherwise, we might record the stale versions of the dependency digests for the rest of the index lifetime.

Note that this might invalidate the assertion in update() that checks all files have digests.

600

Didn't we land the periodic rebuilds? Should this be obsolete now?

clangd/index/Background.h
117

NIT: maybe add a comment that we use this to decide which of the concurrently running indexing operations should take precedence?

clangd/index/SymbolCollector.cpp
213 ↗(On Diff #180859)

NIT: look unrelated, maybe land as a separate NFC commit right away?
Not a big deal, though, feel free to ignore

As discussed offline, let's remove the parts of the change that were aiming to fix the consistency issues, the issues were there before this patch, the fix is getting complicated and doesn't really solve all of the problems. All of that suggests it's out of scope for this change.

kadircet updated this revision to Diff 181044.Jan 10 2019, 6:18 AM
kadircet marked 10 inline comments as done.

Address comments & revert last changes

Only one important comment about bringing back the comment.

clangd/index/Background.cpp
224

a typo: s/that needs/that need/

344

Let's bring back the comment about rewriting the newer version.

kadircet updated this revision to Diff 181072.Jan 10 2019, 8:35 AM
kadircet marked 2 inline comments as done.

Address comments

ilya-biryukov accepted this revision.Jan 10 2019, 9:00 AM

LGTM. Ship it!

This revision is now accepted and ready to land.Jan 10 2019, 9:00 AM
This revision was automatically updated to reflect the committed changes.