This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Proof-of-concept query iterators for Dex symbol index
ClosedPublic

Authored by kbobyrev on Jul 19 2018, 7:36 AM.

Details

Summary

This patch introduces three essential types of query iterators: DocumentIterator, AndIterator, OrIterator. It provides a convenient API for query tree generation and serves as a building block for the next generation symbol index - Dex. Currently, many optimizations are missed to improve code readability and to serve as the reference implementation. Potential improvements are briefly mentioned in FIXMEs and will be addressed in the following patches.

Dex RFC in the mailing list: http://lists.llvm.org/pipermail/clangd-dev/2018-July/000022.html

Iterators, their applications and potential extensions are explained in detail in the design proposal:
https://docs.google.com/document/d/1C-A6PGT6TynyaX4PXyExNMiGmJ2jL1UwV91Kyx11gOI/edit#heading=h.903u1zon9nkj

Diff Detail

Event Timeline

kbobyrev created this revision.Jul 19 2018, 7:36 AM

Thanks for sending this early!
Rough interface comments - mostly looks good though!

clang-tools-extra/clangd/index/dex/QueryIterator.h
26 ↗(On Diff #156270)

we should at least use a type alias for a DocID (maybe an opaque one - not sure for now)

39 ↗(On Diff #156270)

or have peek return an InvalidID == size_t{-1}?

43 ↗(On Diff #156270)

name Rank is confusing. Type should be DocID (param doesn't need a name in the interface)

44 ↗(On Diff #156270)

could reasonably call this operator*

53 ↗(On Diff #156270)

is shared_ptr really needed here? Seems like a raw pointer would be fine. I can see that when rebuilding the index we'll have some fun lifetime issues that are sometimes solved with shared_ptr, but I think that would be at the full-index level, not the individual-posting-list level.

(if you want one less indirection, you could have using PostingListRef = ArrayRef<size_t>, but it shouldn't matter)

59 ↗(On Diff #156270)

what's this for?
I don't think it's worth exposing for testing, should be able to get at the needed cases through the public interface

63 ↗(On Diff #156270)

how will you use this? it's not in the interface, and if you know the implementation then creating a new iterator is cheap

76 ↗(On Diff #156270)

again, unless you have a strong reason, don't use shared_ptr, as it makes it hard to reason about ownership and lifetimes.

These are a tree, unique_ptr should be fine. That way a top-level owned iterator is self-contained.

83 ↗(On Diff #156270)

what's this for?

kbobyrev updated this revision to Diff 156483.Jul 20 2018, 7:18 AM
kbobyrev marked 9 inline comments as done.
  • Switched from std::shared_ptr to std::unique_ptr for iterator's children: iterators own their subtrees, the lifetime should depend on the root
  • Store PostingListRefin DocumentIterator: the lifetime of underlying PostingList should be longer than the DocumentIterator anyway, DocumentIterator can not retrieve results from an incomplete inverted index
  • Refined API of different iterators: wipe getIndex(), reset() from DocumentIterator, getChildren from AndIterator and OrIterator
  • Not exposing DocumentIterator, AndIterator, OrIterator to the User API anymore: helper functions like std::unique_ptr<QueryIterator> constructDocumentIterator which return a unique pointer and perform type erasure are provided instead
  • Improved variable naming, applied refactorings where necessary
kbobyrev planned changes to this revision.Jul 20 2018, 7:26 AM

Upcoming changes:

  • Improve debugging experience by providing llvm::raw_ostream &operator<<(llvm::raw_ostream &OS, std::unique_ptr<QueryIterator> Iterator to recursively pretty print queries in human-readable format: e.g. (& [0, 1, 2, 3] ([3, 4, 6, 8] || [0, 1])). This can be later used for debugging and making sure optimizations were correct.
  • Properly document the interface and implementation. The current documentation is just a draft, it should be refined.
  • Think about introducing iterator cost and applying cheap optimizations like pre-sorting children of AndIterator before performing advance(): this way the iterator doesn't spend too much time iterating through long posting lists when it has are short/empty children
clang-tools-extra/clangd/index/dex/QueryIterator.h
39 ↗(On Diff #156270)

Tried to do that, but that creates additional complexity for the AndIterator which stores ReachedEnd anyway. Otherwise I would either have to spend O(Children.size()) to understand that AndIterator reached the end or have inconsistent APIs across the classes, I think it's better to leave it as it is.

44 ↗(On Diff #156270)

As discussed offline, *(*Child) would be probably not very clean.

kbobyrev marked 2 inline comments as done.Jul 20 2018, 7:33 AM
kbobyrev updated this revision to Diff 156725.Jul 23 2018, 1:33 AM
  • Implemented convenient dumping via llvm::raw_ostream &operator<<(llvm::raw_ostream &OS, QueryIterator &Iterator), which dumps iterator tree in human-readable format, e.g. (&& [1, 2, 3] (|| [3, 4, 5] []))
  • Implemented rather straightforward iterator cost for more efficient AndIterator iterations

The functional part is probably done at this point, the only thing to do at this point is to properly document written code leaving only few FIXMEs for the future.

A few more comments about the bits I understand, but waiting mostly on the documentation.

clang-tools-extra/clangd/index/dex/QueryIterator.cpp
257 ↗(On Diff #156725)

you can just inline this.

If you don't, you have to provide a declaration in the header, outside the class definition. (In addition to the friend declaration). GCC errors on this, clang ignores it (GCC is right).

clang-tools-extra/clangd/index/dex/QueryIterator.h
28 ↗(On Diff #156725)

uint32_t should be plenty. These will be a large fraction of the index's memory usage. (Especially if we push actual payloads onto disk later)

41 ↗(On Diff #156725)

this is done I think.

43 ↗(On Diff #156725)

I'm not sure Query actually adds much to this class name. Iterator, AndIterator, etc would read more smoothly.

52 ↗(On Diff #156725)

As discussed offline, please leave this out of the initial patch.
It requires quite some explanation and understanding, and seems likely to delay the review.

68 ↗(On Diff #156725)

nit: newAndIterator or just andIterator? brevity is the soul of wit and all that

Or even better, what about:

Iterator::createAnd(...)
Iterator::createOr(...)
Iterator::create(PostingList*)
68 ↗(On Diff #156725)

nit: pass by value rather than by rvalue-reference unless there's a strong reason.

Here, vector<u_p> isn't copyable, and there are no overloads, and this isn't performance-critical, so it should be fine.

clang-tools-extra/unittests/clangd/DexIndexTests.cpp
19 ↗(On Diff #156725)

please spell these out (local style)

27 ↗(On Diff #156725)

using namespace llvm

67 ↗(On Diff #156725)

nit: just auto And = constructAndIterator({constructDocumentIterator(EmptyList)})

(and below in various places).
This actually seems nicer - the expression hierarchy reflects the iterator hierarchy.

75 ↗(On Diff #156725)

just EXPECT_EQ(to_string(*And), "(&& [])")

You may need to #include "llvm/Support/ScopedPrinters.h"

98 ↗(On Diff #156725)

add a helper function vector<DocID> consume(Iterator&)
(Maybe even to the main file rather than the tests).

Then this section (and some others) can just be EXPECT_THAT(consume(*And), ElementsAre(0u,7u,10u,...))

140 ↗(On Diff #156725)

Hmm, I'm not sure this approach scales - there shouldn't be anything special about composing and/and, compared to and/or, or/not, boost/and, etc.

I'd suggest testing each iterator directly, and then having one test that uses a more complex tree, and dropping this case.

324 ↗(On Diff #156725)

This is too complicated. Try to use each iterator once or twice.
(We'll need to extend this when we add new iterator types)

sammccall added inline comments.Jul 23 2018, 3:32 AM
clang-tools-extra/clangd/index/dex/QueryIterator.cpp
155 ↗(On Diff #156725)

nit: & should be enough?

175 ↗(On Diff #156725)

can this just be a static function?

196 ↗(On Diff #156725)

can we start with the simple/obvious implementation, and add heap later if it actually improves performance?

clang-tools-extra/clangd/index/dex/QueryIterator.h
47 ↗(On Diff #156725)

nit: this seems non-orthogonal, could we either:

  • have advance() and advanceTo() return void
  • or remove reachedEnd() and have owners that need to keep the state (and/or iterators) maintain it separately?
54 ↗(On Diff #156725)

private?

kbobyrev removed a subscriber: ilya-biryukov.
kbobyrev updated this revision to Diff 156997.Jul 24 2018, 3:59 AM
kbobyrev marked 19 inline comments as done.

As discussed offline: I simplified the implementation and cleaned up unit tests. Should look better now.

The next step is to document the implementation.

kbobyrev added inline comments.Jul 24 2018, 3:59 AM
clang-tools-extra/clangd/index/dex/QueryIterator.cpp
175 ↗(On Diff #156725)

Wiped this one. Also, in the future I guess it's better to implement operator<() for iterators, because that's what this essentially is.

196 ↗(On Diff #156725)

Done. It doesn't look simpler to me, but I tried to keep it as simple as possible. I also put a FIXME on top of OrIterator with the trade-offs description.

clang-tools-extra/unittests/clangd/DexIndexTests.cpp
324 ↗(On Diff #156725)

Done. Dropped this one and two other complicated cases.

kbobyrev marked an inline comment as done.Jul 24 2018, 3:59 AM
kbobyrev updated this revision to Diff 157004.Jul 24 2018, 4:33 AM

Refactored unit tests in few places.

This looks really nice.
Iterator implementations can be simplified a bit I think.

clang-tools-extra/clangd/index/dex/Iterator.cpp
50 ↗(On Diff #156997)

(uber-nit: slightly clearer to add the separator first, then you only have to compare against zero.
Or this idiom:

const char* Sep = "";
for (DocID D : Documents) {
  OS << Sep << D;
  Sep = ", ";
}
59 ↗(On Diff #156997)

nit: the code might be slightly more concise (particularly advanceTo) if the current element is represented by a PostingListRef::Iterator rather than an index, up to you.

59 ↗(On Diff #156997)

(DocID is the wrong type for an index)

65 ↗(On Diff #156997)

nit: just take children by value without &&

69 ↗(On Diff #156997)

nit: const auto &

74 ↗(On Diff #156997)

just return so you don't need the if at the end?

88 ↗(On Diff #156997)

It would be nice if the functions in this class could share some logic.

what about:

for (auto& C : children)
  C->advance();
sync();

where sync is

// restores class invariants: ReachedEnd set, or all children in sync
void sync() {
DocID Max = 0;
for (auto &C : Children) {
   if (C->reachedEnd()) {
    ReachedEnd = true;
    return;
  }
  Max = std::max(C->peek(), Max);
}
for (auto &C : Children) {
  C->advanceTo(Max);
   if (C->reachedEnd()) {
    ReachedEnd = true;
    return;
  }
}
}

the trick is then the constructor is just sync();, advanceTo is just "advance all children and sync", etc.

155 ↗(On Diff #156997)

I think this is FIXME: would a min-heap be faster? :-)

181 ↗(On Diff #156997)

I can't follow this one - why is the outer loop needed?

(It looks like you're using DocID as integer again, but I can't tell why the variable is needed at all)

184 ↗(On Diff #156997)

nit: you can drop the extra braces here and in advanceTo (at least, you use that style above)

204 ↗(On Diff #156997)

why not just initialize Result to std::numeric_limits<DocID>::max(), and skip the valid checking?

You know by assertion that you'll find something better.

235 ↗(On Diff #156997)

for (; !It.reachedEnd(); It.advance())

clang-tools-extra/clangd/index/dex/Iterator.h
50 ↗(On Diff #156997)

this doesn't really make sense as static as it operates on an Iterator.
Either a free function or non-static member seems fine

65 ↗(On Diff #156997)

nit nit: move above the template overload so you can document the readable one.

kbobyrev updated this revision to Diff 157203.Jul 25 2018, 2:02 AM
kbobyrev marked 14 inline comments as done.

Address the last round of comments.

Incoming: documentation overhaul.

clang-tools-extra/clangd/index/dex/Iterator.cpp
181 ↗(On Diff #156997)

An artifact from min-heap implementation.

kbobyrev marked an inline comment as done.Jul 25 2018, 2:02 AM
kbobyrev updated this revision to Diff 157206.Jul 25 2018, 2:44 AM

Slightly refactored the code, improved documentation. This patch is ready for review.

kbobyrev updated this revision to Diff 157208.Jul 25 2018, 2:56 AM
kbobyrev retitled this revision from [clangd] Implement query iterators for Dex symbol index to [clangd] Proof-of-concept query iterators for Dex symbol index.
kbobyrev edited the summary of this revision. (Show Details)

Choose better wording in a couple of places.

Implementation is in a good shape. I only have nits ;)

clang-tools-extra/clangd/index/dex/Iterator.cpp
18 ↗(On Diff #157206)

Please put all local classes and helpers in an anonymous namespace.

20 ↗(On Diff #157206)

nit: We could elaborate a bit. This is the most basic or "leaf" iterator.

30 ↗(On Diff #157206)

nit: The complexity is obvious, so I'd probably drop it as this is not public.

39 ↗(On Diff #157206)

Again, this should also be obvious given the short implementation.

68 ↗(On Diff #157206)

I'm not exactly sure what this means. Do you mean "AndIterator iterates through common items among all children"?

75 ↗(On Diff #157206)

nit: ReachedEnd is reset in sync(). Maybe just make it default value for the field?

112 ↗(On Diff #157206)

nit: documentation for ReachedEnd is already covered below.

114 ↗(On Diff #157206)

nit: ReachedEnd &= ...; would keep me from worrying about true ReachedEnd getting overridden.

117 ↗(On Diff #157206)

nit: SyncID or Sync might be clearer.

118 ↗(On Diff #157206)

Maybe add a comment NeedsAdvance indicates whether advanceTo(ID) should be rerun from the beginning.

145 ↗(On Diff #157206)

nit: why "local state"? (All instance fields are local states)

166 ↗(On Diff #157206)

use std::any_of(...)?

176 ↗(On Diff #157206)

The comment says smallest DocID, but here we call it highest :(

178 ↗(On Diff #157206)

nit: no braces for one-liners.

218 ↗(On Diff #157206)

I think this FIXME should live in peek() as it's an optimization for peek().

clang-tools-extra/clangd/index/dex/Iterator.h
10 ↗(On Diff #157206)

I think we should give more context here. For example, it's unclear what Query Tree Nodes are. Posting list is not a widely known concept. What is the idea behind iterators? What problem do they solve? Roughly what iterators are provided? An example like ascii graph you have in the test can be useful here as well.

21 ↗(On Diff #157206)

nit: remove spaces between #includes so that clang-format can sort all #includes properly. same for other files.

34 ↗(On Diff #157206)

nit:

PostingLists are keys ...

Do you mean values? Isn't inverted index a mapping from search tokens to positing lists?

44 ↗(On Diff #157206)

Could you also add documentation for Iterator class?

47 ↗(On Diff #157206)

nit: calls.

52 ↗(On Diff #157206)

What does invalid mean? Would there be any side effect? Would it trigger a crash?

59 ↗(On Diff #157206)

micro nit: under cursor might be ambiguous.

Would peek() be valid when reachedEnd() is true?

85 ↗(On Diff #157206)

nit: a document iterator?

88 ↗(On Diff #157206)

nit: Returns an AND iterator which ... (same below)

AndIterator is the name for internal implementation and not exposed.

98 ↗(On Diff #157206)

Documentation for the templated interfaces? e.g. why are they needed?

clang-tools-extra/unittests/clangd/DexIndexTests.cpp
47 ↗(On Diff #157206)

This tests multiple things in one test case, and it's unclear what they are testing. Also, the long function body makes the list definitions hard to read from where they are used.

I think this can be split into:

  • a test case for iterator dumping.
  • a test case for AND of empty list and non-empty list.
  • a test case for AND of two lists.
  • a test case for AND of three lists.
48 ↗(On Diff #157206)

nit: these can be L0, L1, L2, L3. Save you some typing ;)

66 ↗(On Diff #157206)

Could you split checks for iterator dumping into a separate test? It's relatively independent. And you wouldn't need to update multiple places in case you update elements in posting lists.

165 ↗(On Diff #157206)

s/0/1/ for AND lists?

186 ↗(On Diff #157206)

These two operations are not meaningful if we don't check anything after them.

kbobyrev updated this revision to Diff 157438.Jul 26 2018, 2:24 AM
kbobyrev marked 30 inline comments as done.

Addressed a round of comments: cleaned up the code, improved documentation and properly introduced such terms like Posting List and Query Tree. Tests are now more modular and each specific piece tests a certain part of implementation.

$ ninja check-clang-tools is green.

kbobyrev updated this revision to Diff 157439.Jul 26 2018, 2:26 AM

Typo: "Returns false if ..., false otherwise" ->"Returns false if ..., true otherwise".

ioeric accepted this revision.Jul 26 2018, 3:23 AM

lg! just a few nits.

clang-tools-extra/clangd/index/dex/Iterator.cpp
109 ↗(On Diff #157439)

This is the post-condition, not a precondition right?

To be clearer, ... each child will point to the same element after sync

162 ↗(On Diff #157439)

Or Returns true if all children are exhausted. No need for otherwise if it's trivial.

clang-tools-extra/clangd/index/dex/Iterator.h
12 ↗(On Diff #157439)

nit: Just to match the actual implementation, use "clang::clangd::" to avoid confusion.

15 ↗(On Diff #157439)

"number of references" is too specific. Maybe "by symbol quality"?

25 ↗(On Diff #157439)

Again, replace number of references with quality for generalization.

26 ↗(On Diff #157439)

with the requested *properties*?

61 ↗(On Diff #157439)

nit: drop "some of which are not yet introduced" as it's not relevant to the interface itself.

72 ↗(On Diff #157439)

Just "return true if ...".

There is no need to mention advance() and advanceTo() here as they are already covered in their own documentation.

76 ↗(On Diff #157439)

Just reachedEnd() must be false. which usually implies assertion. Same elsewhere.

127 ↗(On Diff #157439)

An example of usage would be simpler and easier to understand here. Same below.

Example:
This allows createAnd({create(...), create(...)})

This revision is now accepted and ready to land.Jul 26 2018, 3:23 AM
kbobyrev updated this revision to Diff 157450.Jul 26 2018, 3:35 AM
kbobyrev marked 10 inline comments as done.

Address post-lg round of comments.

This revision was automatically updated to reflect the committed changes.