This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang-tools-extra/clangd/index/dex/
-
clangd/
-
index/
-
dex/
1/3
Iterator.cpp

Differential D106528

[clangd] Improve performance of dex by 45-60%
ClosedPublic

Authored by kbobyrev on Jul 22 2021, 1:24 AM.

Download Raw Diff

Details

Reviewers

sammccall

Commits

rGa0987e350ccc: [clangd] Improve performance of dex by 45-60%

Summary

Take full advantage of AND's iterator children size estimation: use early reset
in sync() and prevent large overhead. The idea is that the children at the
beginning of the list are smaller and cheaper to advance. Very large children
negate the effect of this performance optimisation and hence should be
advanced only when absolutely necessary. By reducing the number of large
iterators' updates, we increase the performance by a large margin.

This change was tested on a comprehensive query dataset. The performance
boost increases with the average length of the query, on small queries it is
close to 45% but the longer they go the closer it gets to 60% and beyond.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

kbobyrev created this revision.Jul 22 2021, 1:24 AM

Herald added subscribers: usaxena95, arphaman. · View Herald TranscriptJul 22 2021, 1:24 AM

kbobyrev requested review of this revision.Jul 22 2021, 1:24 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 22 2021, 1:24 AM

Herald added subscribers: cfe-commits, MaskRay, ilya-biryukov. · View Herald Transcript

Harbormaster completed remote builds in B115508: Diff 360726.Jul 22 2021, 2:01 AM

kbobyrev edited the summary of this revision. (Show Details)Jul 22 2021, 3:22 AM

Nice find!

clang-tools-extra/clangd/index/dex/Iterator.cpp
91	I stared at this for a while and this change makes me think about the whole algo differently. Effectively, we're just iterating through the items C of front trying to find a valid sync point. With the slight optimization that when some other child skips over C to X, we catch up the front to X. This leads to two possible simplifications (strictly, unrelated to this patch and totally up to you): The initial setup and outer loop has unclear bounds - it's basically a loop over the items of Children.front() but that's not clear. Consider rather: auto &Front = Children.front(); // smallest while ((ReachedEnd \|= Front->reachedEnd())) { auto SyncID = Front.peek(); bool ValidSync = true; for (auto &Child : Children { ... if (Child->peek() > SyncID) { Front->advanceTo(Child->peek()); // XXX ValidSync = false; break; } ... } if (ValidSync) return; } This seems basically equivalent and I find the flow control easier to reason about. Is the "slight optimization" important? i.e. is `Front->advanceTo(Child->peek())` actually a significant win over `Front->advance()`? I think the flow control is even clearer in this version: for (auto &Front = Children.front() ;(ReachedEnd \|= Front->reachedEnd()); Front.advance()) { auto SyncID = Front.peek(); bool ValidSync = true; for (auto &Child : Children { ... if (Child->peek() > SyncID) { ValidSync = false; break; } ... } if (ValidSync) return; } Because now the outer loop is entirely just a loop over the items of Front. Intuitively the average number that advanceTo() advances by here is between 1 and 2 (we know it's at least 1, but the remaining items are mostly from the longer iterator so we probably don't skip over any from the shorter iterator). And advanceTo is more complicated, if advanceTo turns out to be 2x as expensive as advance in practice, this isn't a win. So this is maybe worth checking.
113	First, say why: Now we try again with a new sync point. We always syncing with the first child as this is the cheapest (smallest). We only call advanceTo on the large posting lists once we have a fairly likely sync point.

This revision is now accepted and ready to land.Jul 22 2021, 7:11 AM

kbobyrev added inline comments.Jul 22 2021, 2:06 PM

clang-tools-extra/clangd/index/dex/Iterator.cpp

Re 1: this is really interesting. Even

/// Restores class invariants: each child will point to the same element after
/// sync.
void sync() {
  auto SyncID = Children.front()->peek();
  while (!(ReachedEnd |= Children.front()->reachedEnd())) {
    bool ValidSync = true;
    for (auto &Child : Children) {
      Child->advanceTo(SyncID);
      ReachedEnd |= Child->reachedEnd();
      // If any child reaches end And iterator can not match any other items.
      // In this case, just terminate the process.
      if (ReachedEnd)
        return;
      // If any child goes beyond given ID (i.e. ID is not the common item),
      // all children should be advanced to the next common item.
      if (Child->peek() > SyncID) {
        SyncID = Child->peek();
        ValidSync = false;
      }
    }
    if (ValidSync)
      break;
  }
}

Is 5% slower than what we have right now, even though it's almost equivalent to what is written.

And what you are proposing

void sync() {
  auto &Front = *Children.front(); // smallest
  while (!(ReachedEnd |= Children.front()->reachedEnd())) {
    bool ValidSync = true;
    auto SyncID = Front.peek();
    for (auto &Child : Children) {
      Child->advanceTo(SyncID);
      ReachedEnd |= Child->reachedEnd();
      // If any child reaches end And iterator can not match any other items.
      // In this case, just terminate the process.
      if (ReachedEnd)
        return;
      if (Child->peek() > SyncID) {
        Front.advanceTo(Child->peek()); // XXX
        ValidSync = false;
        break;
      }
    }
    if (ValidSync)
      break;
  }
}

Has exactly the same performance as we have right now (1/100th percent slower).

I gave it some thought and realized that the problem is that the code in suggestion does not cache values as the existing one does and incurs overhead. That made me realize I don't really cache Child->peek() on line 109 now even though I should. Doing that gives another 9% performance compared to the proposed version!

Using this knowledge, I tried to incorporate caching into your proposal and came up with something like this:

void sync() {
  if ((ReachedEnd |= Children.front()->reachedEnd()))
    return;
  auto SyncID = Children.front()->peek();
  while (!ReachedEnd) {
    bool ValidSync = true;
    for (auto &Child : Children) {
      Child->advanceTo(SyncID);
      if ((ReachedEnd |= Child->reachedEnd()))
        return;
      auto Candidate = Child->peek();
      if (Candidate > SyncID) {
        SyncID = Candidate;
        ValidSync = false;
        break;
      }
    }
    if (ValidSync)
      break;
  }
}

Weirdly enough, it still performs 7% slower compared to the current patch with ->peek() caching inside the loop. I'm not sure why this is the case since the code looks almost identical :shrug:

Add another 7% compared to the patched version: cache Child->peek() in the
loop.

Add more comments.

kbobyrev marked an inline comment as done.Jul 22 2021, 2:17 PM

kbobyrev marked an inline comment as not done.

Harbormaster completed remote builds in B115690: Diff 360970.Jul 22 2021, 4:28 PM

sammccall accepted this revision.Jul 23 2021, 5:27 AM

Closed by commit rGa0987e350ccc: [clangd] Improve performance of dex by 45-60% (authored by kbobyrev). · Explain WhyJul 23 2021, 6:29 AM

This revision was automatically updated to reflect the committed changes.

kbobyrev added a commit: rGa0987e350ccc: [clangd] Improve performance of dex by 45-60%.

kbobyrev mentioned this in D106645: [clangd] Tweak constructor of dex AndIterator for performance..Jul 23 2021, 6:55 AM

Revision Contents

Path

Size

clang-tools-extra/

clangd/

index/

dex/

Iterator.cpp

8 lines

Diff 360968

clang-tools-extra/clangd/index/dex/Iterator.cpp

Show First 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	for (const auto &Child : Children) {
Separator = " ";		Separator = " ";
}		}
OS << ')';		OS << ')';
return OS;		return OS;
}		}

/// Restores class invariants: each child will point to the same element after		/// Restores class invariants: each child will point to the same element after
/// sync.		/// sync.
void sync() {		void sync() {
		sammccallUnsubmitted Not Done Reply Inline Actions I stared at this for a while and this change makes me think about the whole algo differently. Effectively, we're just iterating through the items C of front trying to find a valid sync point. With the slight optimization that when some other child skips over C to X, we catch up the front to X. This leads to two possible simplifications (strictly, unrelated to this patch and totally up to you): The initial setup and outer loop has unclear bounds - it's basically a loop over the items of Children.front() but that's not clear. Consider rather: auto &Front = Children.front(); // smallest while ((ReachedEnd \|= Front->reachedEnd())) { auto SyncID = Front.peek(); bool ValidSync = true; for (auto &Child : Children { ... if (Child->peek() > SyncID) { Front->advanceTo(Child->peek()); // XXX ValidSync = false; break; } ... } if (ValidSync) return; } This seems basically equivalent and I find the flow control easier to reason about. Is the "slight optimization" important? i.e. is `Front->advanceTo(Child->peek())` actually a significant win over `Front->advance()`? I think the flow control is even clearer in this version: for (auto &Front = Children.front() ;(ReachedEnd \|= Front->reachedEnd()); Front.advance()) { auto SyncID = Front.peek(); bool ValidSync = true; for (auto &Child : Children { ... if (Child->peek() > SyncID) { ValidSync = false; break; } ... } if (ValidSync) return; } Because now the outer loop is entirely just a loop over the items of Front. Intuitively the average number that advanceTo() advances by here is between 1 and 2 (we know it's at least 1, but the remaining items are mostly from the longer iterator so we probably don't skip over any from the shorter iterator). And advanceTo is more complicated, if advanceTo turns out to be 2x as expensive as advance in practice, this isn't a win. So this is maybe worth checking. sammccall: I stared at this for a while and this change makes me think about the whole algo differently.
		kbobyrevAuthorUnsubmitted Not Done Reply Inline Actions Re 1: this is really interesting. Even /// Restores class invariants: each child will point to the same element after /// sync. void sync() { auto SyncID = Children.front()->peek(); while (!(ReachedEnd \|= Children.front()->reachedEnd())) { bool ValidSync = true; for (auto &Child : Children) { Child->advanceTo(SyncID); ReachedEnd \|= Child->reachedEnd(); // If any child reaches end And iterator can not match any other items. // In this case, just terminate the process. if (ReachedEnd) return; // If any child goes beyond given ID (i.e. ID is not the common item), // all children should be advanced to the next common item. if (Child->peek() > SyncID) { SyncID = Child->peek(); ValidSync = false; } } if (ValidSync) break; } } Is 5% slower than what we have right now, even though it's almost equivalent to what is written. And what you are proposing void sync() { auto &Front = Children.front(); // smallest while (!(ReachedEnd \|= Children.front()->reachedEnd())) { bool ValidSync = true; auto SyncID = Front.peek(); for (auto &Child : Children) { Child->advanceTo(SyncID); ReachedEnd \|= Child->reachedEnd(); // If any child reaches end And iterator can not match any other items. // In this case, just terminate the process. if (ReachedEnd) return; if (Child->peek() > SyncID) { Front.advanceTo(Child->peek()); // XXX ValidSync = false; break; } } if (ValidSync) break; } } Has exactly the same performance as we have right now (1/100th percent slower). I gave it some thought and realized that the problem is that the code in suggestion does not cache values as the existing one does and incurs overhead. That made me realize I don't really cache `Child->peek()` on line 109 now even though I should. Doing that gives another 9% performance compared to the proposed version! Using this knowledge, I tried to incorporate caching into your proposal and came up with something like this: void sync() { if ((ReachedEnd \|= Children.front()->reachedEnd())) return; auto SyncID = Children.front()->peek(); while (!ReachedEnd) { bool ValidSync = true; for (auto &Child : Children) { Child->advanceTo(SyncID); if ((ReachedEnd \|= Child->reachedEnd())) return; auto Candidate = Child->peek(); if (Candidate > SyncID) { SyncID = Candidate; ValidSync = false; break; } } if (ValidSync) break; } } Weirdly enough, it still performs 7% slower compared to the current patch with `->peek()` caching inside the loop. I'm not sure why this is the case since the code looks almost identical :shrug: kbobyrev:* Re 1: this is really interesting. Even ``` /// Restores class invariants: each child will…
ReachedEnd \|= Children.front()->reachedEnd();		ReachedEnd \|= Children.front()->reachedEnd();
if (ReachedEnd)		if (ReachedEnd)
return;		return;
auto SyncID = Children.front()->peek();		auto SyncID = Children.front()->peek();
// Indicates whether any child needs to be advanced to new SyncID.		// Indicates whether any child needs to be advanced to new SyncID.
bool NeedsAdvance = false;		bool NeedsAdvance = false;
do {		do {
NeedsAdvance = false;		NeedsAdvance = false;
for (auto &Child : Children) {		for (auto &Child : Children) {
Child->advanceTo(SyncID);		Child->advanceTo(SyncID);
ReachedEnd \|= Child->reachedEnd();		ReachedEnd \|= Child->reachedEnd();
// If any child reaches end And iterator can not match any other items.		// If any child reaches end And iterator can not match any other items.
// In this case, just terminate the process.		// In this case, just terminate the process.
if (ReachedEnd)		if (ReachedEnd)
return;		return;
// If any child goes beyond given ID (i.e. ID is not the common item),		// If any child goes beyond given ID (i.e. ID is not the common item),
// all children should be advanced to the next common item.		// all children should be advanced to the next common item.
if (Child->peek() > SyncID) {		auto Candidate = Child->peek();
SyncID = Child->peek();		if (Candidate > SyncID) {
		SyncID = Candidate;
NeedsAdvance = true;		NeedsAdvance = true;
		// Reset and make sure advanceTo happens much less frequently on
		sammccallUnsubmitted Done Reply Inline Actions First, say why: Now we try again with a new sync point. We always syncing with the first child as this is the cheapest (smallest). We only call advanceTo on the large posting lists once we have a fairly likely sync point. sammccall: First, say why: Now we try again with a new sync point. We always syncing with the first…
		// large posting lists. This accounts for 45-60% performance boost.
		break;
}		}
}		}
} while (NeedsAdvance);		} while (NeedsAdvance);
}		}

/// AndIterator owns its children and ensures that all of them point to the		/// AndIterator owns its children and ensures that all of them point to the
/// same element. As soon as one child gets exhausted, AndIterator can no		/// same element. As soon as one child gets exhausted, AndIterator can no
/// longer advance and has reached its end.		/// longer advance and has reached its end.
▲ Show 20 Lines • Show All 323 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Improve performance of dex by 45-60%ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 360968

clang-tools-extra/clangd/index/dex/Iterator.cpp

[clangd] Improve performance of dex by 45-60%
ClosedPublic