This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Improve performance of dex by 45-60%
ClosedPublic

Authored by kbobyrev on Jul 22 2021, 1:24 AM.

Details

Summary

Take full advantage of AND's iterator children size estimation: use early reset
in sync() and prevent large overhead. The idea is that the children at the
beginning of the list are smaller and cheaper to advance. Very large children
negate the effect of this performance optimisation and hence should be
advanced only when absolutely necessary. By reducing the number of large
iterators' updates, we increase the performance by a large margin.

This change was tested on a comprehensive query dataset. The performance
boost increases with the average length of the query, on small queries it is
close to 45% but the longer they go the closer it gets to 60% and beyond.

Diff Detail

Event Timeline

kbobyrev created this revision.Jul 22 2021, 1:24 AM
kbobyrev requested review of this revision.Jul 22 2021, 1:24 AM
kbobyrev edited the summary of this revision. (Show Details)Jul 22 2021, 3:22 AM
sammccall accepted this revision.Jul 22 2021, 7:11 AM

Nice find!

clang-tools-extra/clangd/index/dex/Iterator.cpp
91

I stared at this for a while and this change makes me think about the whole algo differently.

Effectively, we're just iterating through the items C of front trying to find a valid sync point.
With the slight optimization that when some other child skips over C to X, we catch up the front to X.

This leads to two possible simplifications (strictly, unrelated to this patch and totally up to you):

  1. The initial setup and outer loop has unclear bounds - it's basically a loop over the items of Children.front() but that's not clear.

Consider rather:

auto &Front = *Children.front(); // smallest
while ((ReachedEnd |= Front->reachedEnd())) {
  auto SyncID = Front.peek();
  bool ValidSync = true;
  for (auto &Child : Children {
    ...
    if (Child->peek() > SyncID) {
      Front->advanceTo(Child->peek()); // XXX
      ValidSync = false;
      break;
    }
    ...
  }
  if (ValidSync)
    return;
}

This seems basically equivalent and I find the flow control easier to reason about.

  1. Is the "slight optimization" important? i.e. is Front->advanceTo(Child->peek()) actually a significant win over Front->advance()? I think the flow control is even clearer in this version:
for (auto &Front = *Children.front() ;(ReachedEnd |= Front->reachedEnd()); Front.advance()) {
  auto SyncID = Front.peek();
  bool ValidSync = true;
  for (auto &Child : Children {
    ...
    if (Child->peek() > SyncID) {
      ValidSync = false;
      break;
    }
    ...
  }
  if (ValidSync)
    return;
}

Because now the outer loop is *entirely* just a loop over the items of Front.

Intuitively the average number that advanceTo() advances by here is between 1 and 2 (we know it's at least 1, but the remaining items are mostly from the longer iterator so we probably don't skip over any from the shorter iterator).
And advanceTo is more complicated, if advanceTo turns out to be 2x as expensive as advance in practice, this isn't a win. So this is maybe worth checking.

115

First, say why:

Now we try again with a new sync point.
We always syncing with the first child as this is the cheapest (smallest). We only call advanceTo on the large posting lists once we have a fairly likely sync point.

This revision is now accepted and ready to land.Jul 22 2021, 7:11 AM
kbobyrev added inline comments.Jul 22 2021, 2:06 PM
clang-tools-extra/clangd/index/dex/Iterator.cpp
91

Re 1: this is really interesting. Even

/// Restores class invariants: each child will point to the same element after
/// sync.
void sync() {
  auto SyncID = Children.front()->peek();
  while (!(ReachedEnd |= Children.front()->reachedEnd())) {
    bool ValidSync = true;
    for (auto &Child : Children) {
      Child->advanceTo(SyncID);
      ReachedEnd |= Child->reachedEnd();
      // If any child reaches end And iterator can not match any other items.
      // In this case, just terminate the process.
      if (ReachedEnd)
        return;
      // If any child goes beyond given ID (i.e. ID is not the common item),
      // all children should be advanced to the next common item.
      if (Child->peek() > SyncID) {
        SyncID = Child->peek();
        ValidSync = false;
      }
    }
    if (ValidSync)
      break;
  }
}

Is 5% slower than what we have right now, even though it's almost equivalent to what is written.

And what you are proposing

void sync() {
  auto &Front = *Children.front(); // smallest
  while (!(ReachedEnd |= Children.front()->reachedEnd())) {
    bool ValidSync = true;
    auto SyncID = Front.peek();
    for (auto &Child : Children) {
      Child->advanceTo(SyncID);
      ReachedEnd |= Child->reachedEnd();
      // If any child reaches end And iterator can not match any other items.
      // In this case, just terminate the process.
      if (ReachedEnd)
        return;
      if (Child->peek() > SyncID) {
        Front.advanceTo(Child->peek()); // XXX
        ValidSync = false;
        break;
      }
    }
    if (ValidSync)
      break;
  }
}

Has exactly the same performance as we have right now (1/100th percent slower).

I gave it some thought and realized that the problem is that the code in suggestion does not cache values as the existing one does and incurs overhead. That made me realize I don't really cache Child->peek() on line 109 now even though I should. Doing that gives another 9% performance compared to the proposed version!

Using this knowledge, I tried to incorporate caching into your proposal and came up with something like this:

void sync() {
  if ((ReachedEnd |= Children.front()->reachedEnd()))
    return;
  auto SyncID = Children.front()->peek();
  while (!ReachedEnd) {
    bool ValidSync = true;
    for (auto &Child : Children) {
      Child->advanceTo(SyncID);
      if ((ReachedEnd |= Child->reachedEnd()))
        return;
      auto Candidate = Child->peek();
      if (Candidate > SyncID) {
        SyncID = Candidate;
        ValidSync = false;
        break;
      }
    }
    if (ValidSync)
      break;
  }
}

Weirdly enough, it still performs 7% slower compared to the current patch with ->peek() caching inside the loop. I'm not sure why this is the case since the code looks almost identical :shrug:

kbobyrev updated this revision to Diff 360968.Jul 22 2021, 2:12 PM

Add another 7% compared to the patched version: cache Child->peek() in the
loop.

kbobyrev updated this revision to Diff 360970.Jul 22 2021, 2:17 PM

Add more comments.

kbobyrev marked an inline comment as done.Jul 22 2021, 2:17 PM
kbobyrev marked an inline comment as not done.
sammccall accepted this revision.Jul 23 2021, 5:27 AM
This revision was automatically updated to reflect the committed changes.