RelationSlab is a new index data structure, similar to SymbolSlab and
RefSlab. RelationSlab stores relations between Symbols.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
- Build Status
Buildable 29273 Build 29272: arc lint + arc unit
Event Timeline
clang-tools-extra/clangd/index/Index.h | ||
---|---|---|
26 | One thing I'm wondering about is: would it be better to just use clang::index::SymbolRole here (which has Relation___Of entries that cover what we're interested in) instead of having our own enum? |
This mostly looks good, one high level comment:
I believe it makes sense to deduplicate SymbolIDs for RelationSlab.
Up until now, we mostly had only one occurence of a SymbolID in a Slab, but RelationSlab does not follow that assumption.
Also can you add a few tests after the above mentioned check to make sure interning SymbolIDs works as expected?
clang-tools-extra/clangd/index/Index.h | ||
---|---|---|
26 | I totally agree, but that struct is 4 bytes. I am not sure it is worth the trade off when storing the relationslab, but since this patch is not related to serialization let's get rid of RelationKind and make use of clang::index::SymbolRole. When landing serialization part we can either split clang::index::SymbolRole into two parts or add some custom mapping to serialize only relevant bits etc. | |
60 | use capacity instead of size | |
68 | I am not sure comment currently applies to RelationSlab internals, since everything is of value type anyways, there are no references. |
clang-tools-extra/clangd/index/Index.cpp | ||
---|---|---|
35 ↗ | (On Diff #190785) | Use ArrayRef::copy(), for example: https://reviews.llvm.org/D58782 |
clang-tools-extra/clangd/index/Index.h | ||
44 | struct Relation? And in the comments for it, please explain which way the relationship is directed (is the SymbolID in the key the subtype? or is the SymbolID in the ArrayRef the subtype?). | |
89 | Please move all new declarations into Relation.h. |
clang-tools-extra/clangd/index/Index.h | ||
---|---|---|
44 | Ah exactly my thoughts, forget to mention this. I believe current usage is the counter-intuitive one. For example, we will most likely query with something like: So that this valueType can be read like, `SymbolID` is `RelationKind` every `SymbolID inside array` WDYT? |
Just to make sure I understand, do you mean:
(A) When adding a SymbolID to an entry's value, check that it's not already there; or
(B) Try to conserve space by not storing SymbolIDs directly in the entries, but storing an index into a separate list of unique SymbolIDs.
If it's (B), how many bytes should the index be? Are the space gains worth the complexity, given that SymbolID is only 8 bytes to begin with? (As compared to say, the filenames in Ref, which can be much longer, making this sort of optimization more clearly worth it.)
clang-tools-extra/clangd/index/Index.h | ||
---|---|---|
44 | The way I was thinking of it is that getRelations(SymbolID, baseOf) would return all the bases of SymbolID. However, the opposite interpretation is also reasonable, we just need to pick one and document it. I'm happy to go with your suggested one. | |
60 | Note, RefSlab::bytes() (which I where I copied this from) uses size() as well. Should I change that too? |
Address review comments, except for the deduplication which is still under discussion
clang-tools-extra/clangd/index/Index.h | ||
---|---|---|
60 | I'd say yes -- in a separate patch though. Thanks for catching it! | |
clang-tools-extra/clangd/index/Relation.h | ||
1 | "--- Relation.h --------------------" | |
41 | Three slashes for doc comments, please. | |
45 | Lift it up into the clang::clangd namespace? (like Symbol and Ref) | |
68 | No need to repeat the type name being documented. "A mutable container that can ..." |
That was the case, I agree with you we should rather do that while serializing where we have variable length integers and duplication is more severe(all three slabs make use of same SymbolID pool).
clang-tools-extra/clangd/index/Index.h | ||
---|---|---|
44 | It looks like IndexingAPI is also using the interpretation I suggested, so let's move with that one if you don't have any other concerns. |
clang-tools-extra/clangd/index/Index.h | ||
---|---|---|
44 | Already updated to this interpretation :) | |
clang-tools-extra/clangd/index/Relation.h | ||
45 | This comment made me realize that I haven't addressed your previous comment properly: I haven't changed RelationSlab::value_type from std::pair<RelationKey, llvm::ArrayRef<SymbolID>> to Relation. I tried to make that change this time, and ran into a problem: In the rest of the subtypes patch (D58880), one of the things I do is extend the MemIndex constructor so that, in addition to taking a symbol range and a ref range, it takes a relation range. That constructor assumes that the elements of that range have members of some name - either first and second (as currently in D58880), or Key and Value. However, that constructor has two call sites. In MemIndex::build(), we pass in the slabs themselves as the ranges. So, if we make this change, the field names for that call site will be Key and Value. However, for the second call site in FileSymbols::buildIndex(), the ranges that are passed in are DenseMaps, and therefore their elements' field names are necessarily first and second. The same constructor cannot therefore accept both ranges. How do you propose we address this?
|
clang-tools-extra/clangd/index/Relation.h | ||
---|---|---|
1 | (Sorry, I have these comments addressed locally, was just waiting for the resolution of the remaining issue before uploading.) |
As this is the first of a series of patches adding support for relations, and then building type hierarchy subtypes on top (as discussed here), how should we go about landing this -- should we land each patch in the series as soon as it's ready, or should we wait to land them all together?
Ok, cool. In that case, I think this patch is ready to be committed, and would appreciate it if someone could commit it. Thanks!
(Sorry to arrive late at this discussion, I'm just back from leave.
I have some suggestions and would appreciate your thoughts, but if simply this feels too much like going around in circles I'm happy to work out how we can address this after this patch lands instead.
Have discussed offline with @gribozavr and I think we're roughly on the same page. @kadircet is now out on leave)
I think the data model might be overly complicated here.
I can see how the discussion got here: it mirrors the other *Slab types quite closely. But:
- those types were designed to solve specific problems that we don't have here (string deduplication and symbol merging)
- I think the class-parent relation is pretty sparse (maybe 1 edge per 10 symbols?) so lots of options will work fine
- I don't think we yet know what the more resource-critical (denser) relations and queries are, so it's unclear what to optimize for
I think the simplest model that can work here is something like:
- Relation is struct { SymbolID Subject; SymbolRole Relation; SymbolID Object }
- this is a value type, so could be passed around simply as std::vector<Relation>. If RelationSlab is desired for symmetry, it could be an alias or simple wrapper around std::vector<Relation>.
This has a few advantages:
- the Relation value_type is self-contained, has clearer semantics, and works as the result of any type of query: based on subject and/or predicate and/or object. (Similar to how it's convenient that Symbol contains SymbolID).
- if lookup is desired, lookup by subject, subject+predicate, or full-triple is possible by sorting in SPO order with binary search. Return type is simple ArrayRef<Relation>. (Though fancy lookup maybe better belongs in MemIndex/Dex rather than in the slab). For more query types, storing a copies in OSP and/or POS order is pretty cheap too.
- memory efficiency: it costs 20*distinct(subject, predicate, object) bytes, vs 28*distinct(subject, predicate) + 8 * distinct(subject, predicate, object). Unless the average number of objects for a (subject, predicate) pair (that has at least one object) is >2.3, the former is smaller. For the current use case, this average is certainly <1.1.
- simplicity: no arenas, easy mental model.
I see discussion of (something like) this option stalled around the index constructors, but I don't see a fundamental block there. The concept that the index constructor should be templated over would be Iterable<Relation>. LMK if I'm missing something.
@sammccall, thank you for having a look at this.
I have no objection to revising the data model if there's agreement on a better one.
Based on a brief mental survey of C++ IDE features I'm familiar with, I can think of the following additional uses of the relations capability:
- A call hierarchy feature (which is also proposed for LSP, with client and server implementation efforts) would need every caller-callee relationship to be recorded in the index (RelationCalledBy).
- Given a virtual method declaration, a user may want to see a list of implementations (overriders) and navigate to one or more of them. This would need every overrider relationship to be recorded in the index (RelationOverrideOf).
Intuitively, it seems like RelationOverrideOf would be slightly denser than RelationChildOf (though not by much), while RelationCalledBy would be significantly denser. In terms of queries, I believe the key for lookups for both of the above would be a (subject, predicate) pair, just like for subtypes.
Does that change your analysis at all?
I guess I should clear the "Accepted" status until we settle the question of the data model.
Sorry for the slow response here. Override and callgraph are great examples!
As you say, override is probably pretty sparse and it's probably not worth worrying about the storage too much.
If we stored a callgraph we'd definitely need to worry about the representation though. The space-saving hierarchy in this case would be map<relationtype, map<callee, caller>>I guess. Maybe storing one vector<pair<Subject, Predicate>> for each relationship type would work here - querying for a bunch of relationship types is rare.
One thing that strikes me here is that this case is very similar to our existing Ref data - it's basically a subset, but with a symbolid payload instead of location. We could consider just adding the SymbolID to Refs - it'd blow up the size of that by 50%, but we may not do much better with some other representation, and it would avoid adding any new complexity.
Ultimately it may also be that supporting both find references and callgraph (which provide closely overlapping functionality) isn't a good use of index memory.
Note that this was considered and rejected earlier (see here and here), though that discussion did not consider denser relations like callgraph.
Given the additional information and perspectives from the conversation here, I have two suggestions for potential ways forward:
Approach 1: Add SymbolID to Refs
- This would be initially wasteful when only subtypes use it, but if we then build callgraph on top of it as well it will become significantly less wasteful.
- This has the benefit that we don't have duplicate information for find-references and callgraph (they both use Refs).
- This approach probably adds the least amount of complexity overall.
Approach 2: Add a RelationSlab storing (subject, predicate, object) triples, intended for sparse relations
- This would allow us to implement features that require sparse relations, such as subtypes and overrides, without any significant increase to index size.
- If we later want to add dense relations like callgraph, we'd use a different mechanism for them (possibly by adding SymbolID to Refs as in approach 1).
- If we do end up adding SymbolID to Refs for callgraph, and want to start using that for sparse relations too, we can rip out RelationSlab.
- This also adds relatively little complexity, though slightly more than approach 1, and we are throwing away some code if we end up adding SymbolID to Refs eventually.
Any thoughts on which of the approaches to take, or something else altogether? It would be good to have some confidence that the chosen approach will pass code review before I implement it :)
Hi, any update here? I would appreciate some guidance so I can move forward with this.
Hi Nathan, sorry for the stall here, and for repeatedly going over the same issues.
The design space here is pretty complicated.
I think the conclusion of recent offline discussions is:
- refs and relations can be the same thing-ish when seen through a certain lens.
- Unifying them probably isn't space-prohibitive *if* we implement callgraph: it's something like +15% to overall memory usage, and storing callgraph in another way is likely to be similar.
- however unifying the concepts seems likely to be awkward in practice. LSP features seem to want one or the other but not both - symbolID is generally better than location if it's precise enough, and useless if not. We're storing redundant information (the location for the method override ref is the same as the declaration of the related symbol). We risk tying ourselves in knots maintaining a model that doesn't map well onto our problems.
- In terms of storage size, relation-major (map<SymbolRole, vector<pair<SymbolID, SymbolID>>> or so) seems like a quick win. But it's a small enough one that we should try to live without it first.
So if you can stomach it, I think
Approach 2: Add a RelationSlab storing (subject, predicate, object) triples, intended for sparse relations*
is certainly fine with us (having spoken with @kadircet @ilya-biryukov @sammccall @gribozavr - @hokein is on vacation but not nearly as stubborn as I am!)
Implemented discussed design approach ("Add a RelationSlab storing (subject, predicate, object) triples, intended for sparse relations")
Mostly LG from my side, thanks!
clang-tools-extra/clangd/index/Relation.cpp | ||
---|---|---|
20 | I would suggest adding another version which returns all the relations a given Subject participates, but I suppose currently we don't have any use cases. Feel free to add or skip, that can be added when necessary. | |
25 | std::tie | |
clang-tools-extra/clangd/index/Relation.h | ||
32 | nit: could you use std::tie | |
36 | again std::tie | |
53 | I believe RelationSlab should still be immutable, even after the build operation below RelationSlab is mutable. Also are there any use-cases requiring RelationSlab to be mutable? | |
61 | there seems to be some typos in the comment | |
73 | maybe rename to insert to keep symmetry with other slabs? | |
80 | maybe even get rid of sort by inlining it into build? | |
83 | again, we could get rid of this comment if slab was immutable | |
90 | Do we really need this? |
Address review comments
clang-tools-extra/clangd/index/Relation.cpp | ||
---|---|---|
20 | I'd rather add this when a use case arises. | |
clang-tools-extra/clangd/index/Relation.h | ||
53 | I'm not aware of any use-cases requiring it to be mutable. I'm making it immutable as suggested. | |
90 | I think so. The builder needs a way to construct the slab. Aggregate initialization doesn't work because the class already has declared default and copy constructors. |
LGTM, except the duplication issue. Thanks!
clang-tools-extra/clangd/index/Relation.h | ||
---|---|---|
76 | maybe use a set so that we can be sure that there won't be any duplicates. sorry for missing that in the previous iteration. | |
clang-tools-extra/clangd/unittests/IndexTests.cpp | ||
79 ↗ | (On Diff #201427) | could you also add a test case that's inserting duplicate relations? |
Awesome, thank you!
clang-tools-extra/clangd/index/Relation.h | ||
---|---|---|
76 | (you're sorting the array at the end anyway, so sort + unique + erase seems neater) |
btw, I believe you have enough good quality patches to apply for commit access.
Are there any obstacles that is keeping you from applying?
One thing I'm wondering about is: would it be better to just use clang::index::SymbolRole here (which has Relation___Of entries that cover what we're interested in) instead of having our own enum?