Page MenuHomePhabricator

[clangd] Introduce PostingList interface

Authored by kbobyrev on Sep 12 2018, 6:56 AM.



This patch abstracts PostingList interface and reuses existing implementation. It will be used later to test different PostingList representations.

No functionality change is introduced, this patch is mostly refactoring so that the following patches could focus on functionality while not being too hard to review.

Diff Detail

Event Timeline

kbobyrev created this revision.Sep 12 2018, 6:56 AM
kbobyrev planned changes to this revision.Sep 12 2018, 6:57 AM

There is not enough documentation right now, I should fix that before patch could be reviewed. This is a preview mode.

Also, Index size estimation is incorrect now and will be fixed in the next diff.

kbobyrev updated this revision to Diff 165224.Sep 13 2018, 2:18 AM

Add documentation, don't expose PostingList interface implementations in the public API (similarly to Iterator).

ioeric added inline comments.Sep 13 2018, 4:39 AM
125 ↗(On Diff #165224)

nit: InvertedIndex[TokenToPostingList.first] = buildSparsePostingList(...) ?

23 ↗(On Diff #165224)

I'd probably call this something other than SparsePostingList. It doesn't really do anything fancier than just storing the vector. What if sparse posting lists have their own optimizations one day? So how about something more straightforward? E.g. VectorPostingList or PlainPostingList?

40 ↗(On Diff #165224)

As this is just a helper class that is not exposed, I'd suggest making this a standalone class to make it easier to read.

kbobyrev updated this revision to Diff 165256.Sep 13 2018, 5:15 AM
kbobyrev marked 3 inline comments as done.

At a high level: making posting lists an abstract type adds another layer of indirection, which we can use to solve problems. It also has costs, mostly conceptual complexity.
What problems are we solving?

  • if we want to dynamically use a different representation for different data/scenarios: this would be a good solution, do we have any evidence that this is the case? I thought the tentative conclusion was vbyte was good enough for all cases. What's the deal with dense/sparse?
  • if it's too hard to experiment with different posting list types: how hard is it really? I'm not sure this justifies checking this change in.

I'd suggest as a first step making PostingList a concrete class - this would improve the API and enable experimentation. (Even experimentation with dynamically switching the implementation - it's easier behind a class boundary)

34 ↗(On Diff #165256)

this talks a lot about how a posting list can be implemented, and what it interacts with - I'd suggest removing/reducing that and talking about what it is, instead :-)

54 ↗(On Diff #165256)

why is the caller responsible for choosing the implementation?
It seems like buildPostingList could do this based on inspecting the docs

61 ↗(On Diff #165256)


kbobyrev updated this revision to Diff 165268.Sep 13 2018, 6:11 AM
kbobyrev marked 2 inline comments as done.
kbobyrev updated this revision to Diff 165297.Sep 13 2018, 7:57 AM

Don't create abstractions for now.

kbobyrev updated this revision to Diff 165298.Sep 13 2018, 7:59 AM

Remove artifacts from assertion messages.

kbobyrev marked an inline comment as done.Sep 13 2018, 8:04 AM
sammccall accepted this revision.Sep 13 2018, 9:12 AM


11 ↗(On Diff #165298)

And here

39 ↗(On Diff #165298)

Why this dep? Seems circular

12 ↗(On Diff #165298)

And can be traversed in order using an iterator

This revision is now accepted and ready to land.Sep 13 2018, 9:12 AM
kbobyrev updated this revision to Diff 165318.Sep 13 2018, 9:45 AM
kbobyrev updated this revision to Diff 165320.
kbobyrev marked an inline comment as done.

Wording: traversed *in order*.

kbobyrev marked an inline comment as done.Sep 13 2018, 9:46 AM
kbobyrev added inline comments.
39 ↗(On Diff #165298)

Iterator interface uses DocID, so I guess it should depend on PostingList.h, shouldn't it?

kbobyrev added inline comments.Sep 13 2018, 9:53 AM
39 ↗(On Diff #165298)

FWIW I tried to get rid of this dependency in the first place, but it seemed pretty hard, because the options I saw were either forward declaring (which isn't possible since it's practically typedef) it or introducing another using DocID = uint32_t in Iterator.h, which I decided not to do because of the code duplication. Another option would be to make both PostingList.h and Iterator.h depend on some common file, but semantically DocID seems to belong to the Iterator.h, so I didn't think that would be a good solution, too.

Is there something I could do to resolve that?

kbobyrev updated this revision to Diff 165324.Sep 13 2018, 10:09 AM
kbobyrev edited the summary of this revision. (Show Details)
kbobyrev marked 3 inline comments as done.

Pull DocID to Iterator.h.

This revision was automatically updated to reflect the committed changes.