This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Add RelationSlab
ClosedPublic

Authored by nridge on Mar 14 2019, 10:28 PM.

Details

Summary

RelationSlab is a new index data structure, similar to SymbolSlab and
RefSlab. RelationSlab stores relations between Symbols.

Event Timeline

nridge created this revision.Mar 14 2019, 10:28 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 14 2019, 10:28 PM
nridge marked an inline comment as done.Mar 14 2019, 10:30 PM
nridge added inline comments.
clang-tools-extra/clangd/index/Index.h
26

One thing I'm wondering about is: would it be better to just use clang::index::SymbolRole here (which has Relation___Of entries that cover what we're interested in) instead of having our own enum?

This mostly looks good, one high level comment:
I believe it makes sense to deduplicate SymbolIDs for RelationSlab.
Up until now, we mostly had only one occurence of a SymbolID in a Slab, but RelationSlab does not follow that assumption.

Also can you add a few tests after the above mentioned check to make sure interning SymbolIDs works as expected?

clang-tools-extra/clangd/index/Index.h
26

I totally agree, but that struct is 4 bytes. I am not sure it is worth the trade off when storing the relationslab, but since this patch is not related to serialization let's get rid of RelationKind and make use of clang::index::SymbolRole.

When landing serialization part we can either split clang::index::SymbolRole into two parts or add some custom mapping to serialize only relevant bits etc.

60

use capacity instead of size

68

I am not sure comment currently applies to RelationSlab internals, since everything is of value type anyways, there are no references.

gribozavr added inline comments.Mar 15 2019, 6:35 AM
clang-tools-extra/clangd/index/Index.cpp
35 ↗(On Diff #190785)

Use ArrayRef::copy(), for example: https://reviews.llvm.org/D58782

clang-tools-extra/clangd/index/Index.h
44

struct Relation? And in the comments for it, please explain which way the relationship is directed (is the SymbolID in the key the subtype? or is the SymbolID in the ArrayRef the subtype?).

89

Please move all new declarations into Relation.h.

kadircet added inline comments.Mar 15 2019, 6:55 AM
clang-tools-extra/clangd/index/Index.h
44

Ah exactly my thoughts, forget to mention this.

I believe current usage is the counter-intuitive one. For example, we will most likely query with something like:
getRelations(SymbolID, baseOf) to get all relations where SymbolID is baseOf something else(which says get children of SymbolID)

So that this valueType can be read like,

`SymbolID` is `RelationKind` every `SymbolID inside array`

WDYT?

I believe it makes sense to deduplicate SymbolIDs for RelationSlab.
Up until now, we mostly had only one occurence of a SymbolID in a Slab, but RelationSlab does not follow that assumption.

Just to make sure I understand, do you mean:

(A) When adding a SymbolID to an entry's value, check that it's not already there; or
(B) Try to conserve space by not storing SymbolIDs directly in the entries, but storing an index into a separate list of unique SymbolIDs.

If it's (B), how many bytes should the index be? Are the space gains worth the complexity, given that SymbolID is only 8 bytes to begin with? (As compared to say, the filenames in Ref, which can be much longer, making this sort of optimization more clearly worth it.)

nridge marked 7 inline comments as done.Mar 17 2019, 12:20 PM
nridge added inline comments.Mar 17 2019, 12:20 PM
clang-tools-extra/clangd/index/Index.h
44

The way I was thinking of it is that getRelations(SymbolID, baseOf) would return all the bases of SymbolID. However, the opposite interpretation is also reasonable, we just need to pick one and document it. I'm happy to go with your suggested one.

60

Note, RefSlab::bytes() (which I where I copied this from) uses size() as well. Should I change that too?

nridge updated this revision to Diff 191036.Mar 17 2019, 12:21 PM
nridge marked an inline comment as done.

Address review comments, except for the deduplication which is still under discussion

gribozavr accepted this revision.Mar 18 2019, 1:53 AM
gribozavr added inline comments.
clang-tools-extra/clangd/index/Index.h
60

I'd say yes -- in a separate patch though. Thanks for catching it!

clang-tools-extra/clangd/index/Relation.h
2

"--- Relation.h --------------------"

42

Three slashes for doc comments, please.

46

Lift it up into the clang::clangd namespace? (like Symbol and Ref)

69

No need to repeat the type name being documented. "A mutable container that can ..."

This revision is now accepted and ready to land.Mar 18 2019, 1:53 AM

If it's (B), how many bytes should the index be? Are the space gains worth the complexity, given that SymbolID is only 8 bytes to begin with? (As compared to say, the filenames in Ref, which can be much longer, making this sort of optimization more clearly worth it.)

That was the case, I agree with you we should rather do that while serializing where we have variable length integers and duplication is more severe(all three slabs make use of same SymbolID pool).

clang-tools-extra/clangd/index/Index.h
44

It looks like IndexingAPI is also using the interpretation I suggested, so let's move with that one if you don't have any other concerns.

nridge marked 8 inline comments as done.Mar 21 2019, 7:01 PM
nridge added inline comments.
clang-tools-extra/clangd/index/Index.h
44

Already updated to this interpretation :)

clang-tools-extra/clangd/index/Relation.h
46

This comment made me realize that I haven't addressed your previous comment properly: I haven't changed RelationSlab::value_type from std::pair<RelationKey, llvm::ArrayRef<SymbolID>> to Relation.

I tried to make that change this time, and ran into a problem:

In the rest of the subtypes patch (D58880), one of the things I do is extend the MemIndex constructor so that, in addition to taking a symbol range and a ref range, it takes a relation range.

That constructor assumes that the elements of that range have members of some name - either first and second (as currently in D58880), or Key and Value.

However, that constructor has two call sites. In MemIndex::build(), we pass in the slabs themselves as the ranges. So, if we make this change, the field names for that call site will be Key and Value. However, for the second call site in FileSymbols::buildIndex(), the ranges that are passed in are DenseMaps, and therefore their elements' field names are necessarily first and second. The same constructor cannot therefore accept both ranges.

How do you propose we address this?

  • Scrap struct Relation, and keep value_type as std::pair<RelationKey, llvm::ArrayRef<SymbolID>>?
  • Keep struct Relation, but make its fields named first and second?
  • Split the constructor of MemIndex into two constructors, to accomodate both sets of field names?
  • Something else?
gribozavr added inline comments.Mar 22 2019, 6:04 AM
clang-tools-extra/clangd/index/Relation.h
2

Not done?

42

Not done?

46

I guess we should scrap it then. Kadir, WDYT?

69

Not done?

nridge marked 2 inline comments as done.Mar 22 2019, 8:04 AM
nridge added inline comments.
clang-tools-extra/clangd/index/Relation.h
2

(Sorry, I have these comments addressed locally, was just waiting for the resolution of the remaining issue before uploading.)

nridge updated this revision to Diff 191886.Mar 22 2019, 8:28 AM

Scrapped 'struct Relation' and addressed other comments

As this is the first of a series of patches adding support for relations, and then building type hierarchy subtypes on top (as discussed here), how should we go about landing this -- should we land each patch in the series as soon as it's ready, or should we wait to land them all together?

Submitting code as it becomes ready is the usual practice here.

Ok, cool. In that case, I think this patch is ready to be committed, and would appreciate it if someone could commit it. Thanks!

(Sorry to arrive late at this discussion, I'm just back from leave.
I have some suggestions and would appreciate your thoughts, but if simply this feels too much like going around in circles I'm happy to work out how we can address this after this patch lands instead.
Have discussed offline with @gribozavr and I think we're roughly on the same page. @kadircet is now out on leave)

I think the data model might be overly complicated here.
I can see how the discussion got here: it mirrors the other *Slab types quite closely. But:

  • those types were designed to solve specific problems that we don't have here (string deduplication and symbol merging)
  • I think the class-parent relation is pretty sparse (maybe 1 edge per 10 symbols?) so lots of options will work fine
  • I don't think we yet know what the more resource-critical (denser) relations and queries are, so it's unclear what to optimize for

I think the simplest model that can work here is something like:

  • Relation is struct { SymbolID Subject; SymbolRole Relation; SymbolID Object }
  • this is a value type, so could be passed around simply as std::vector<Relation>. If RelationSlab is desired for symmetry, it could be an alias or simple wrapper around std::vector<Relation>.

This has a few advantages:

  • the Relation value_type is self-contained, has clearer semantics, and works as the result of any type of query: based on subject and/or predicate and/or object. (Similar to how it's convenient that Symbol contains SymbolID).
  • if lookup is desired, lookup by subject, subject+predicate, or full-triple is possible by sorting in SPO order with binary search. Return type is simple ArrayRef<Relation>. (Though fancy lookup maybe better belongs in MemIndex/Dex rather than in the slab). For more query types, storing a copies in OSP and/or POS order is pretty cheap too.
  • memory efficiency: it costs 20*distinct(subject, predicate, object) bytes, vs 28*distinct(subject, predicate) + 8 * distinct(subject, predicate, object). Unless the average number of objects for a (subject, predicate) pair (that has at least one object) is >2.3, the former is smaller. For the current use case, this average is certainly <1.1.
  • simplicity: no arenas, easy mental model.

I see discussion of (something like) this option stalled around the index constructors, but I don't see a fundamental block there. The concept that the index constructor should be templated over would be Iterable<Relation>. LMK if I'm missing something.

nridge added a comment.EditedMar 28 2019, 6:11 PM

@sammccall, thank you for having a look at this.

I have no objection to revising the data model if there's agreement on a better one.

  • I don't think we yet know what the more resource-critical (denser) relations and queries are, so it's unclear what to optimize for

Based on a brief mental survey of C++ IDE features I'm familiar with, I can think of the following additional uses of the relations capability:

  • A call hierarchy feature (which is also proposed for LSP, with client and server implementation efforts) would need every caller-callee relationship to be recorded in the index (RelationCalledBy).
  • Given a virtual method declaration, a user may want to see a list of implementations (overriders) and navigate to one or more of them. This would need every overrider relationship to be recorded in the index (RelationOverrideOf).

Intuitively, it seems like RelationOverrideOf would be slightly denser than RelationChildOf (though not by much), while RelationCalledBy would be significantly denser. In terms of queries, I believe the key for lookups for both of the above would be a (subject, predicate) pair, just like for subtypes.

Does that change your analysis at all?

nridge requested review of this revision.Apr 1 2019, 9:43 PM
nridge added a reviewer: sammccall.

I guess I should clear the "Accepted" status until we settle the question of the data model.

@sammccall, thank you for having a look at this.

I have no objection to revising the data model if there's agreement on a better one.

  • I don't think we yet know what the more resource-critical (denser) relations and queries are, so it's unclear what to optimize for

Based on a brief mental survey of C++ IDE features I'm familiar with, I can think of the following additional uses of the relations capability:

  • A call hierarchy feature (which is also proposed for LSP, with client and server implementation efforts) would need every caller-callee relationship to be recorded in the index (RelationCalledBy).
  • Given a virtual method declaration, a user may want to see a list of implementations (overriders) and navigate to one or more of them. This would need every overrider relationship to be recorded in the index (RelationOverrideOf).

Intuitively, it seems like RelationOverrideOf would be slightly denser than RelationChildOf (though not by much), while RelationCalledBy would be significantly denser. In terms of queries, I believe the key for lookups for both of the above would be a (subject, predicate) pair, just like for subtypes.

Does that change your analysis at all?

Sorry for the slow response here. Override and callgraph are great examples!
As you say, override is probably pretty sparse and it's probably not worth worrying about the storage too much.

If we stored a callgraph we'd definitely need to worry about the representation though. The space-saving hierarchy in this case would be map<relationtype, map<callee, caller>>I guess. Maybe storing one vector<pair<Subject, Predicate>> for each relationship type would work here - querying for a bunch of relationship types is rare.
One thing that strikes me here is that this case is very similar to our existing Ref data - it's basically a subset, but with a symbolid payload instead of location. We could consider just adding the SymbolID to Refs - it'd blow up the size of that by 50%, but we may not do much better with some other representation, and it would avoid adding any new complexity.
Ultimately it may also be that supporting both find references and callgraph (which provide closely overlapping functionality) isn't a good use of index memory.

nridge added a comment.EditedApr 5 2019, 8:25 PM

One thing that strikes me here is that this case is very similar to our existing Ref data - it's basically a subset, but with a symbolid payload instead of location. We could consider just adding the SymbolID to Refs - it'd blow up the size of that by 50%, but we may not do much better with some other representation, and it would avoid adding any new complexity.

Note that this was considered and rejected earlier (see here and here), though that discussion did not consider denser relations like callgraph.

Given the additional information and perspectives from the conversation here, I have two suggestions for potential ways forward:

Approach 1: Add SymbolID to Refs

  • This would be initially wasteful when only subtypes use it, but if we then build callgraph on top of it as well it will become significantly less wasteful.
  • This has the benefit that we don't have duplicate information for find-references and callgraph (they both use Refs).
  • This approach probably adds the least amount of complexity overall.

Approach 2: Add a RelationSlab storing (subject, predicate, object) triples, intended for sparse relations

  • This would allow us to implement features that require sparse relations, such as subtypes and overrides, without any significant increase to index size.
  • If we later want to add dense relations like callgraph, we'd use a different mechanism for them (possibly by adding SymbolID to Refs as in approach 1).
  • If we do end up adding SymbolID to Refs for callgraph, and want to start using that for sparse relations too, we can rip out RelationSlab.
  • This also adds relatively little complexity, though slightly more than approach 1, and we are throwing away some code if we end up adding SymbolID to Refs eventually.

Any thoughts on which of the approaches to take, or something else altogether? It would be good to have some confidence that the chosen approach will pass code review before I implement it :)

Hi, any update here? I would appreciate some guidance so I can move forward with this.

Hi Nathan, sorry for the stall here, and for repeatedly going over the same issues.
The design space here is pretty complicated.

I think the conclusion of recent offline discussions is:

  • refs and relations can be the same thing-ish when seen through a certain lens.
  • Unifying them probably isn't space-prohibitive *if* we implement callgraph: it's something like +15% to overall memory usage, and storing callgraph in another way is likely to be similar.
  • however unifying the concepts seems likely to be awkward in practice. LSP features seem to want one or the other but not both - symbolID is generally better than location if it's precise enough, and useless if not. We're storing redundant information (the location for the method override ref is the same as the declaration of the related symbol). We risk tying ourselves in knots maintaining a model that doesn't map well onto our problems.
  • In terms of storage size, relation-major (map<SymbolRole, vector<pair<SymbolID, SymbolID>>> or so) seems like a quick win. But it's a small enough one that we should try to live without it first.

So if you can stomach it, I think

Approach 2: Add a RelationSlab storing (subject, predicate, object) triples, intended for sparse relations*

is certainly fine with us (having spoken with @kadircet @ilya-biryukov @sammccall @gribozavr - @hokein is on vacation but not nearly as stubborn as I am!)

nridge planned changes to this revision.Apr 25 2019, 9:57 PM

So if you can stomach it, I think

Approach 2: Add a RelationSlab storing (subject, predicate, object) triples, intended for sparse relations*

is certainly fine with us (having spoken with @kadircet @ilya-biryukov @sammccall @gribozavr - @hokein is on vacation but not nearly as stubborn as I am!)

Yep, I'm happy to move forward with that approach. Thanks for the guidance!

nridge updated this revision to Diff 200735.May 22 2019, 7:27 AM

Implemented discussed design approach ("Add a RelationSlab storing (subject, predicate, object) triples, intended for sparse relations")

Mostly LG from my side, thanks!

clang-tools-extra/clangd/index/Relation.cpp
19

I would suggest adding another version which returns all the relations a given Subject participates, but I suppose currently we don't have any use cases.

Feel free to add or skip, that can be added when necessary.

24

std::tie

clang-tools-extra/clangd/index/Relation.h
31

nit: could you use std::tie

35

again std::tie

52

I believe RelationSlab should still be immutable, even after the build operation below RelationSlab is mutable.
Let's introduce the builder class and move mutating operations and build into it. So that it would be more symmetric to other slabs we have.

Also are there any use-cases requiring RelationSlab to be mutable?

60

there seems to be some typos in the comment

72

maybe rename to insert to keep symmetry with other slabs?
also it suggests a semantics on the operation that we don't care about.

79

maybe even get rid of sort by inlining it into build?

82

again, we could get rid of this comment if slab was immutable

89

Do we really need this?

nridge updated this revision to Diff 201427.May 25 2019, 4:19 PM
nridge marked 19 inline comments as done.

Address review comments

clang-tools-extra/clangd/index/Relation.cpp
19

I'd rather add this when a use case arises.

clang-tools-extra/clangd/index/Relation.h
52

I'm not aware of any use-cases requiring it to be mutable. I'm making it immutable as suggested.

89

I think so. The builder needs a way to construct the slab. Aggregate initialization doesn't work because the class already has declared default and copy constructors.

kadircet accepted this revision.May 27 2019, 12:40 AM

LGTM, except the duplication issue. Thanks!

clang-tools-extra/clangd/index/Relation.h
76

maybe use a set so that we can be sure that there won't be any duplicates. sorry for missing that in the previous iteration.

clang-tools-extra/clangd/unittests/IndexTests.cpp
79

could you also add a test case that's inserting duplicate relations?

This revision is now accepted and ready to land.May 27 2019, 12:40 AM
sammccall accepted this revision.May 27 2019, 12:55 AM

Awesome, thank you!

clang-tools-extra/clangd/index/Relation.h
76

(you're sorting the array at the end anyway, so sort + unique + erase seems neater)

btw, I believe you have enough good quality patches to apply for commit access.

Are there any obstacles that is keeping you from applying?

nridge updated this revision to Diff 202369.May 30 2019, 10:31 PM

Address latest review comments

nridge marked 3 inline comments as done.May 30 2019, 10:33 PM
kadircet accepted this revision.May 31 2019, 12:56 AM

Still LG, thanks for the patch!

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptJun 2 2019, 9:54 PM