This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Count number of references while merging RefSlabs inside FileIndex
ClosedPublic

Authored by kadircet on Mar 18 2019, 2:29 AM.

Details

Summary

For counting number of references clangd was relying on merging every
duplication of a symbol. Unfortunately this does not apply to FileIndex(and one
of its users' BackgroundIndex), since we get rid of duplication by simply
dropping symbols coming from non-canonical locations. So only one or two(coming
from canonical declaration header and defined source file, if exists)
replications of the same symbol reaches merging step.

This patch changes reference counting logic to rather count number of different
RefSlabs a given SymbolID exists.

Diff Detail

Event Timeline

kadircet created this revision.Mar 18 2019, 2:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 18 2019, 2:29 AM

I'm not sure if FileIndex is the correct layer to make decision about how References is calculated. Currently, we have two use cases in clangd 1) one slab per TU and 2) one slab per file. It seems to me that we would want different strategies for the two use cases, so it might make sense to make this configurable in FileIndex. And in one case, we have References ~= # of TUs, and in the other case, we would have References ~= # of files. This can over-count References in the second case. It might be fine as an approximation (if there is no better solution), but we should definitely document it (e.g. in BackgroundIndex).

clang-tools-extra/clangd/index/FileIndex.cpp
50 ↗(On Diff #191061)

this is a bit of a code smell. FileIndex is not supposed to know anything about clangd-indexer etc.

170 ↗(On Diff #191061)

note that references might not always exist. SymbolCollector can be set up to not collect references.

kadircet marked 2 inline comments as done.Mar 19 2019, 2:10 AM

I'm not sure if FileIndex is the correct layer to make decision about how References is calculated. Currently, we have two use cases in clangd 1) one slab per TU and 2) one slab per file. It seems to me that we would want different strategies for the two use cases, so it might make sense to make this configurable in FileIndex. And in one case, we have References ~= # of TUs, and in the other case, we would have References ~= # of files.

References ~= # of TUs case has nothing to do with FileIndex actually, it only happens if references are counted beforehand(which is currently only happening in clangd-indexer).
FileIndex has two ways to handle duplicates during an index merge:

  • pickone which doesn't count references at all (this is the oneslab per TU case)
  • merge which counts references as number of files this symbol has occured (this is the oneslab per File case)

This can over-count References in the second case. It might be fine as an approximation (if there is no better solution), but we should definitely document it (e.g. in BackgroundIndex).

I don't follow why this can over-count, FileIndex keeps only one RefSlab per each file, so we can't over-count? Did you mean it will be more than #TUs ?

clang-tools-extra/clangd/index/FileIndex.cpp
50 ↗(On Diff #191061)

I've actually just left the comment for review so that we can decide what to do about discrepancy(with my reasoning). It is not like FileIndex is somehow tied to clangd-indexer now?

170 ↗(On Diff #191061)

I thought we wouldn't make any promises about References in such a case, do we?

I don't follow why this can over-count, FileIndex keeps only one RefSlab per each file, so we can't over-count? Did you mean it will be more than #TUs ?

Yes. This is how Symbol::References is described in its documentation. If we want to change/expand the semantic, we need to make it clear in the documentation as well.

clang-tools-extra/clangd/index/FileIndex.cpp
170 ↗(On Diff #191061)

Is this documented somewhere?

References and RefSlab have been two independent things. References existed before RefSlab. Now might be a good time to unify them, but I think we should think about how this is reflected in their definitions (documentation etc) and other producers and consumers of the structures.

kadircet marked an inline comment as done.Mar 19 2019, 3:23 AM
kadircet added inline comments.
clang-tools-extra/clangd/index/FileIndex.cpp
170 ↗(On Diff #191061)

Both of them are only produced by SymbolCollector, which has an option CountReferences with a comment saying // Populate the Symbol.References field..
Unfortunately that's not what it does, instead it always sets Symol.References field to 1 for every symbol collected during indexing of a single TU.
Then it is up to the consumers to merge those 1s. So even though we say:

/// The number of translation units that reference this symbol from their main
/// file. This number is only meaningful if aggregated in an index.

in the comments of Symbol::References it is totally up-to-the consumer who performs the merging.

The option controls whether Symbol Occurences should be collected or not.

I see two possible options:

  • Un-tie SymbolCollector from Symbol::References and rather define it as:
/// This field is for the convenience of an aggregated symbol index
  • Rather change Index structure's(consumers of this information) to store(and build) this information internally if they plan to make use of it ?

WDYT?

ioeric added inline comments.Mar 20 2019, 6:21 AM
clang-tools-extra/clangd/index/FileIndex.cpp
33 ↗(On Diff #191061)

nit: s/every file/every slab/?

There is no "file" in this context.

46 ↗(On Diff #191061)

I think we can trigger this assertion with an incomplete background index. For example, we've seen references of a symbol in some files but we've not parsed the file that contains the decls.

50 ↗(On Diff #191061)

the comment is useful. I just think it's in the wrong place. If we define the reference counting behavior for FileSymbols, FileSymbolswouldn't need to know about clangd-indexer or background-index. This comment can then live in background index instead.

170 ↗(On Diff #191061)

Unfortunately that's not what it does, instead it always sets Symol.References field to 1 for every symbol collected during indexing of a single TU.

Looking at the code, I think Symol.References is indeed only populated (i.e. set to 1) when CountReferences is enabled. Honestly, we should probably get rid of CountReferences; I don't see a reason not to count references.

But the statement count one per TU still stands in SymbolCollector, as it always only handles one TU at a time.

Actually, all I'm asking in the initial comment is to define the References counting behavior when there is no reference ;) I think a reasonable solution is to keep the references untouched when there is no reference, so that References aggregation via mergeSymbol can still work (for one-TU-per-Slab use cases).

And we should probably also document the new behavior e.g. DuplicateHandling in FileIndex.h seems to be a good place.

180 ↗(On Diff #191061)

this FIXME can be removed?

189 ↗(On Diff #191061)

any reason not to count references for PickOne?

ilya-biryukov added inline comments.Mar 20 2019, 7:44 AM
clang-tools-extra/clangd/index/FileIndex.cpp
170 ↗(On Diff #191061)

Honestly, we should probably get rid of CountReferences; I don't see a reason not to count references

We introduced it to avoid getting partial counts in dynamic index. In the absence of the static index, they would skew ranking towards symbols used inside the files you have open, which is not really a good signal in the first place.

I'm not sure if that actually hurts much in practice, though. It could very possibly be the case that the added complexity isn't worth it.

kadircet updated this revision to Diff 191844.Mar 22 2019, 3:33 AM
kadircet marked 11 inline comments as done.
  • Address comments
clang-tools-extra/clangd/index/FileIndex.cpp
46 ↗(On Diff #191061)

You are right, thanks for pointing this out!

50 ↗(On Diff #191061)

Moved into index/background.h

189 ↗(On Diff #191061)

Because PickOne is not populating SymsStorage and is merely passing along pointers to SymbolSlabs. Counting references in that case would imply also making a new copy per each unique symbol.

Do you think it is worth it?

ilya-biryukov added inline comments.Apr 25 2019, 5:43 AM
clang-tools-extra/clangd/index/Background.h
113 ↗(On Diff #191844)

We should agree on and stick to a single behavior in both cases.
I suggest keeping the current behavior for now (translation units). Why can't we implement that?

clang-tools-extra/clangd/index/FileIndex.cpp
52 ↗(On Diff #191844)

The comment suggests there's something cheesy going on.
Why would FileSymbols recompute the number of references? Can we avoid this complicated logic?

kadircet marked 2 inline comments as done.Apr 25 2019, 9:25 AM

It has been a long time since I've proposed that change and even I forgot some of the high level details. Therefore, I wanted to sum up the state again so that we can decide on how to move forward.

Currently we have two different on-disk formats for index in clangd:

  • Static index:
    • Merged single file, since merging is done before persistence, the data on-disk contains correct reference counts.
    • Produced by running SymbolCollector on each TU. Collector doesn't keep state between different TUs, merging is done on the caller-site.
    • No need to count references when loading, doesn't interact with FileSymbols.
  • Background index:
    • Sharded multiple files.
    • Produced by running SymbolCollector on each TU, but instead of merging we separate those symbols into distinct files. Merging is done later on within FileSymbols which has no information about TUs.

FileSymbols is currently being used by BackgroundIndex and FileIndex(Dynamic Idx):

  • In the case of BackgroundIndex, a symbol occurs at most twice(once in the header declaring it and once in the implementation file, if they are distinct), therefore when merging is done symbol references doesn't go above that.
  • FileIndex doesn't perform de-duplication before-hand therefore it can perform reference counting while performing the merge. But currently it doesn't perform any merging.

As a result, changing FileSymbols would only effect BackgroundIndex and nothing else. Since merging occurs using the information in RefSlabs this can also be extended to benefit FileIndex and will most definitely be used
by any sort of indexing that performs merging over multiple files.

For unification purposes I would rather delete the References counting logic in SymbolCollector and perform it only at mergers(clangd-indexer and FileSymbols)

clang-tools-extra/clangd/index/Background.h
113 ↗(On Diff #191844)

Because FileSymbols only knows about file names and has no idea about whether a given file is TU or not. We might add some heuristics on the file extensions, or change SymbolCollector to count number of files(this would require maintaining a cache to de-duplicate references coming from the same files but originating from a different TU).

clang-tools-extra/clangd/index/FileIndex.cpp
52 ↗(On Diff #191844)

This was the idea behind https://github.com/clangd/clangd/issues/23. Please see the main comment for a higher level discussion.

ilya-biryukov added inline comments.Apr 29 2019, 3:30 AM
clang-tools-extra/clangd/index/Background.h
113 ↗(On Diff #191844)

Update from the offline conversation:
It should be possible to produce the same counts as static-indexer, will wait for an updated revision.

kadircet updated this revision to Diff 197288.Apr 30 2019, 3:44 AM
  • Change logic to count references from only main files
ilya-biryukov added inline comments.May 8 2019, 6:01 AM
clang-tools-extra/clangd/index/Background.cpp
482 ↗(On Diff #197288)

NIT: maybe initialize with =false to avoid potential UB.
Digest seems uninitialized too, could also ={} for it.

Sorry if this was discussed before and I'm repeating myself, I vaguely remember something similar but not sure what the conclusion was.

clang-tools-extra/clangd/index/FileIndex.cpp
116 ↗(On Diff #197288)

Let's use a named struct here? .second.second is super-hard to read

125 ↗(On Diff #197288)

Could we avoid changing the merge logic?
I.e. the proposal is to keep the merging code the same and add a loop that counts references at the end.

clang-tools-extra/clangd/index/FileIndex.h
60 ↗(On Diff #197288)

NIT: this comment seems more appropriate at buildIndex, maybe move it there?
(It has the DuplicateHandle parameter, so it's closer to the context.

66 ↗(On Diff #197288)

Maybe call the parameter CountReferences?
It would narrow down what the code can do with it and make the code easier to understand

clang-tools-extra/clangd/unittests/FileIndexTests.cpp
49 ↗(On Diff #197288)

NIT: to avoid warnings, we could use the corresponding suffix at the callsite (e.g. 1u) instead of casts
It's probably ok either way here, but having casts can lead to surprises if one passes negative numbers there

kadircet updated this revision to Diff 198770.May 9 2019, 12:18 AM
kadircet marked 7 inline comments as done.
  • Address comments
ilya-biryukov accepted this revision.May 9 2019, 2:13 AM

LGTM

clang-tools-extra/clangd/index/FileIndex.cpp
101 ↗(On Diff #198770)

NIT: maybe simplify to the following?

if (!Refs) {
  FileToRefs.erase(Path);
  return;
}
...

up to you.

clang-tools-extra/clangd/index/FileIndex.h
63 ↗(On Diff #198770)

NIT: use \p Refs

This revision is now accepted and ready to land.May 9 2019, 2:13 AM
kadircet updated this revision to Diff 198816.May 9 2019, 7:18 AM
kadircet marked 2 inline comments as done.
  • Address comments
This revision was automatically updated to reflect the committed changes.