This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Implement VByte PostingList compression
ClosedPublic

Authored by kbobyrev on Sep 20 2018, 6:15 AM.

Details

Summary

This patch implements Variable-length Byte compression of PostingLists to sacrifice some performance for lower memory consumption.

PostingList compression and decompression was extensively tested using fuzzer for multiple hours and runnning significant number of realistic FuzzyFindRequests. AddressSanitizer and UndefinedBehaviorSanitizer were used to ensure the correct behaviour.

Performance evaluation was conducted with recent LLVM symbol index (292k symbols) and the collection of user-recorded queries (7751 FuzzyFindRequest JSON dumps):

MetricsBeforeAfterChange (%)
Memory consumption (posting lists only), MB54.423.5-60%
Time to process queries, sec7.709.4+25%

Diff Detail

Event Timeline

kbobyrev created this revision.Sep 20 2018, 6:15 AM
kbobyrev edited the summary of this revision. (Show Details)Sep 20 2018, 6:15 AM
kbobyrev updated this revision to Diff 166278.Sep 20 2018, 6:28 AM
  • Update unit tests with iterator tree string representation to comply with the new format
  • Don't mark constructor explicit (previously it only had one parameter)
  • Fix Limits explanation comment (ID > Limits[I] -> ID >= Limits[I])

Very nice!

I think the data structures can be slightly tighter, and some of the implementation could be easier to follow. But this seems like a nice win.

Right-sizing the vectors seems like an important optimization.

clang-tools-extra/clangd/index/dex/PostingList.cpp
29

nit: we generally use members (DecompressedChunk.begin()) unless actually dealing with arrays or templates, since lookup rules are simpler

39–46

nit: I think this might be clearer with the special/unlikely cases (hit end) inside the if:

if (++InnerIndex == DecompressedChunks.begin()) { // end of chunk
  if (++ChunkIndex == Chunks.end()) // end of iterator
    return;
  DecompressedChunk = ChunkIndex->decompress();
  InnerIndex = DecompressedChunk.begin();
}

also I think the indirection via reachedEnd() mostly serves to confuse here, as the other lines deal with the data structures directly. It's not clear (without reading the implementation) what the behavior is when class invariants are violated.

58

this again puts the "normal case" (need to choose a chunk) inside the if(), instead of the exceptional case.

In order to write this more naturally, I think pulling out a private helper advanceToChunk(DocID) might be best here, you can early return from there.

61

ChunkIndex + 1? You've already eliminated the current chunk.

62

This seems unneccesarily two-step (found the chunk... or it could be the first element of the next).
Understandably, because std::*_bound has such a silly API.

You want to find the *last* chunk such that Head <= ID.
So find the first one with Head > ID, and subtract one.

std::lower_bound returns the first element for which its predicate is false.

Therefore:

ChunkIndex = std::lower_bound(ChunkIndex, Chunks.end(), ID,
     [](const Chunk &C, const DocID D) { return C.Head <= ID; }) - 1;
63

(again I'd avoid reachedEnd() here as you haven't reestablished invariants, so it's easier to just deal with the data structures)

76

(this can become an assert)

115

nit: please don't call these indexes if they're actually iterators: CurrentChunk seems fine

116

(again, SmallVector)

121

move to the function where they're used

127

What's the purpose of this? Why can't the caller just construct the Chunk themselves - what does the std::queue buy us?

140

I don't understand this comment. Aren't these bit offsets of payload bytes within a DocID?

162

This appears to be more complicated than necessary.
I'd suggest pulling out the following function, and seeing where it takes you:

// Write a variable length into the buffer, and updates the buffer size.
// If it doesn't fit, returns false and doesn't write to the buffer.
bool encodeVByte(uint32 V, MutableArrayRef<uint8_t>& Buf);

Personally I find the no-loop implementation much easier to read, just:

if (V < (1<<7)) {
  if (Buf.size() < 1)
    return false;
  Buf[0] = V;
  Buf  = Buf.drop_front(1);
  return true;
}
// and 4 more cases

but up to you. Please do try to find a way to reduce the number of constants (masks, limits, offsets, *BytesMask...) if you keep the loop.

248

the logical structure seems like a nested loop, I think this would be easier to follow:

for (Current = Head; have more bytes and not enough numbers; Current += delta) {
 delta = 0;
 continuation = true;
 while (continuation) {
  ...
 }
 Result.push_back(Current + delta;)
}
262

here I think you're missing a memory optimization probably equal in size to the whole gains achieved by compression :-)

libstdc++ uses a 2x growth factor for std::vector, so we're probably wasting an extra 30% or so of ram (depending on size distribution, I forget the theory here).
We should shrink to fit. If we were truly desperate we'd iterate over all the numbers and presize the array, but we're probably not.

I think return std::vector<DocID>(Result); // no move, shrink-to-fit will shrink it as you want (note that shrink_to_fit() is usually a no-op :-\)

clang-tools-extra/clangd/index/dex/PostingList.h
41

With the current implementation, this doesn't need to be in the header.
(the layout of vector<chunk> doesn't depend on chunk, you should just need to out-line the default destructor)

(using SmallVector<Chunk, 1> or maybe 2 *might* be a win. I'd expect not though. I'd either stick with std::vector, or measure)

42

make this a static_assert below the class?

45

return SmallVector<PayloadSize+1> to avoid allocations?

52

This seems like a waste of a byte - ensure padding bytes are zeros, then you're done decoding once you hit a zero byte or the end of the chunk.
(Note that a zero byte encodes the integer zero, which is not a legal posting list delta)

59

this isn't a good justification - the performance of MemIndex isn't really relevant.
"Compression saves memory at a small cost in access time, which is still fast enough in practice."

70

this should be Chunks.capacity() (see comment in other file)

74

this may seem picky, but this seems like a waste of 8 bytes (particularly for small posting lists).
I'd suggest just defining a constant (in Chunk) for the estimated entries per chunk (maybe 15 or so?) and just using Chunks.size() * Chunks::ApproxEntriesPerChunk as a "good enough" estimate.

clang-tools-extra/clangd/index/dex/fuzzer/VByteFuzzer.cpp
2

For better or worse, adding a fuzzer in the open-source project is pretty high ceremony (CMake stuff, subdirectory, oss-fuzz configuration, following up on bugs).

I'm not sure the maintenance cost is justified here. Can we just run the fuzzer but not check it in?

@sammccall thank you for the comments, I'll improve it. @ilya-biryukov also provided valuable feedback and suggested that the code looks complicated now. We also discussed different compression approach which would require storing Heads and Payloads separately so that binary search over Heads could have better cache locality. That can dramatically improve performance.

For now, I think it's important to simplify the code and I'll start with that. Also, your suggestions will help to save even more memory! Zeroed bytes are an example of that: I started the patch with the general encoding API (so that zeros could be encoded in the list), but there's no good reason to keep that assumption now.

Also, I think I got my measurements wrong yesterday. I measured posting lists only: without compression size is 55 MB and with compression (current version, not optimized yet) it's 22 MB. This seems like a huge win. I try to keep myself from being overenthusiastic and double-check the numbers, but it looks more like something you estimated when we used posting list size distribution.

clang-tools-extra/clangd/index/dex/PostingList.cpp
262

Great catch! I have to be careful with std::vectors which are not allocated with their final size in advance.

clang-tools-extra/clangd/index/dex/fuzzer/VByteFuzzer.cpp
2

OK, I'll leave this here until the patch is accepted for continuous testing, but I won't push it in the final version.

kbobyrev updated this revision to Diff 166453.Sep 21 2018, 5:09 AM
kbobyrev marked 9 inline comments as done.

I addressed the easiest issues. I'll try to implement separate storage structure for Heads and Payloads which would potentially make the implementation cleaner and easier to understand (and also more maintainable since that would be way easier to go for SIMD instructions speedups and other encoding schemes if we do that).

Also, I'll refine D52047 a little bit and I believe that is should be way easier to understand performance + memory consumption once we have these benchmarks in. Both @ioeric and @ilya-biryukov expressed their concern with regard to the memory consumption "benchmark" and suggested a separate binary. While this seems fine to me, I think it's important to keep performance + memory tracking infrastructure easy to use (in this sense scattering different metrics across multiple binaries makes it less accessible and probably introduce some code duplication) and therefore using this "trick" is OK to me, but I don't have a strong opinion about this. What do you think, @sammccall?

Also, I'll refine D52047 a little bit and I believe that is should be way easier to understand performance + memory consumption once we have these benchmarks in. Both @ioeric and @ilya-biryukov expressed their concern with regard to the memory consumption "benchmark" and suggested a separate binary. While this seems fine to me, I think it's important to keep performance + memory tracking infrastructure easy to use (in this sense scattering different metrics across multiple binaries makes it less accessible and probably introduce some code duplication) and therefore using this "trick" is OK to me, but I don't have a strong opinion about this. What do you think, @sammccall?

FWIW, I think the "trick" for memory benchmark is fine. I just think we should add proper output to make the trick clear to users, as suggested in the patch comment.

Also, I'll refine D52047 a little bit and I believe that is should be way easier to understand performance + memory consumption once we have these benchmarks in. Both @ioeric and @ilya-biryukov expressed their concern with regard to the memory consumption "benchmark" and suggested a separate binary. While this seems fine to me, I think it's important to keep performance + memory tracking infrastructure easy to use (in this sense scattering different metrics across multiple binaries makes it less accessible and probably introduce some code duplication) and therefore using this "trick" is OK to me, but I don't have a strong opinion about this. What do you think, @sammccall?

FWIW, I think the "trick" for memory benchmark is fine. I just think we should add proper output to make the trick clear to users, as suggested in the patch comment.

It seems to be omitted in README.md, but you are probably after [[ https://github.com/google/benchmark/blob/1b44120cd16712f3b5decd95dc8ff2813574b273/include/benchmark/benchmark.h#L596-L612 | benchmark::State::SetLabel() ]]

kbobyrev added inline comments.Sep 21 2018, 7:39 AM
clang-tools-extra/clangd/index/dex/PostingList.cpp
29

I thought using std::begin(Container), std::end(Container) is way more robust because the API is essentially the same if the code changes, so I used it everywhere in Dex. Do you think I should change this patch or keep it to keep the codebase more consistent?

121

But they're used both in encodeStream() and decompress(). I tried to move as much static constants to functions where they're used, but these masks are useful for both encoding and decoding. Is there something I should do instead (e.g. make them members of PostingList)?

I addressed the easiest issues. I'll try to implement separate storage structure for Heads and Payloads which would potentially make the implementation cleaner and easier to understand (and also more maintainable since that would be way easier to go for SIMD instructions speedups and other encoding schemes if we do that).

That doesn't sound more maintainable, that sounds like a performance hack that will hurt the layering.
Which is ok :-) but please don't do that until you measure a nontrivial performance improvement from it.

Also, I'll refine D52047 a little bit and I believe that is should be way easier to understand performance + memory consumption once we have these benchmarks in. Both @ioeric and @ilya-biryukov expressed their concern with regard to the memory consumption "benchmark" and suggested a separate binary. While this seems fine to me, I think it's important to keep performance + memory tracking infrastructure easy to use (in this sense scattering different metrics across multiple binaries makes it less accessible and probably introduce some code duplication) and therefore using this "trick" is OK to me, but I don't have a strong opinion about this. What do you think, @sammccall?

I may be missing some context, but you're talking about index idle memory usage right?
Can't this just be a dexp command?
No objection to annotating benchmark runs with memory usage too, but I wouldn't jump through hoops unless there's a strong reason.

clang-tools-extra/clangd/index/dex/PostingList.cpp
121

You're right.
My real objection here is that these decls are hard to understand here, and the code that uses them is also hard to understand.

I think this is because they aren't very powerful abstractions (the detail abstracted is limited, and it's not strongly abstracted) and the names aren't sufficiently good. (I don't have great ones, though not confusing bits and bytes would be a start :-)

My top suggestion is to inline these everywhere they're used. This is bit twiddling code, not knowing what the bits are can obscure understanding hard, and encourage you to write the code in a falsely general way.

Failing that, SevenBytesMask -> LowBits and ContinuationBit -> More or HighBit?
FourBytesMask isn't needed, you won't be decoding any invalid data.

kbobyrev updated this revision to Diff 166831.Sep 25 2018, 1:52 AM
kbobyrev marked 17 inline comments as done.
  • Simplify code
  • Disallow empty PostingLists and update tests
kbobyrev edited the summary of this revision. (Show Details)Sep 25 2018, 2:39 AM

Mostly looks good, a few further simplifications...

clang-tools-extra/clangd/index/dex/PostingList.cpp
59

I find "if the position was found" somewhat misleading - even if the ID is not in the chunk we can his this case.

Maybe extract normalizeCursor() here, with

// If the cursor is at the end of a chunk, place it at the start of the next chunk.
void normalizeCursor() { ... }

This can be shared with advance().

116

Comment the invariants here, e.g.

// If CurrentChunk is valid, then DecompressedChunk is CurrentChunk->decompress()
// and CurrentID is a valid (non-end) iterator into it.
128

This gives the wrong answer for zero. So assert Delta != 0?

128

nit: int or unsigned, not size_t

128

Width = 1 + findLastSet(Delta) / 7

138

The mask doesn't need to vary, just apply it after shifting.
But really this loop is much clearer if you modify delta in place.

do {
  Encoding = Delta & 0x7f;
  Delta >>= 7;
  Payload.front() = Delta ? Encoding : 0x80 | Encoding;
  Payload = Payload.drop_front();
} while (Delta != 0);
171

Why not just declare a chunk here (or use Result.back() and work in place)?

Result.emplace_back();
DocID Last = Result.back().Head = Documents.front();
MutableArrayRef<uint8_t> RemainingPayload = Result.back().Payload;
for (DocID Doc : Documents.drop_front()) {
  // no need to handle I == 0 special case.
  if (!encodeVByte(Doc - Last, RemainingPayload)) { // didn't fit, flush chunk
    Result.emplace_back();
    Result.back().Head = Doc;
    RemainingPayload = Result.back().Payload;
  }
  Last = Doc;
}

more values, fewer indices

173

PayloadRef doesn't really describe the function of this variable. Suggest RemainingPayload or EmptyPayload or so

192

nit: if the stream is terminated, consumes all bytes and returns None.

clang-tools-extra/clangd/index/dex/PostingList.h
39

mark as an implementation detail so readers aren't confused.

41

(meta-nit: please don't mark comments as done if they're not done - rather explain why you didn't do them!)

41

(using SmallVector<Chunk, 1> or maybe 2 *might* be a win. I'd expect not though. I'd either stick with std::vector, or measure)

You changed this to SmallVector - what were the measurements?
(SmallVector is going to be bigger than Vector whenever you overrun, so it's worth checking)

46

this is an implementation detail, move to cpp file (or just inline)

50

??

kbobyrev updated this revision to Diff 166843.Sep 25 2018, 4:03 AM
kbobyrev marked 13 inline comments as done and an inline comment as not done.
kbobyrev edited the summary of this revision. (Show Details)

Address a round of comments, fallback to std::vector.

sammccall accepted this revision.Sep 25 2018, 4:35 AM

LG, don't forget about the fuzzer!

clang-tools-extra/clangd/index/dex/PostingList.cpp
39–46

just ++CurrentID; normalizeCursor();

121

this is used only in one place now, inline or use elsewhere

128

meaningful *bits*

no need to say "dividing..." as it just echoes the code. "examining the meaningful bits"?

164

unused

This revision is now accepted and ready to land.Sep 25 2018, 4:35 AM
kbobyrev updated this revision to Diff 166849.Sep 25 2018, 4:43 AM
kbobyrev marked 4 inline comments as done.

Address post-LG comments, remove fuzzer.

clang-tools-extra/clangd/index/dex/PostingList.cpp
192

As discussed offline, when the stream is terminated (i.e. 0 byte indicates the end of the stream) it just returns llvm::None.

clang-tools-extra/clangd/index/dex/PostingList.h
41
Storage typeMemory consumption, MB
llvm::SmallVector<Chunk, 1>23.7
llvm::SmallVector<Chunk, 2>25.2
std::vector<Chunk>23.5

It seems like std::vector<Chunk> would be the best option here, falling back to that.

As discussed offline, moving Chunk to header seems tedious because of the default constructor/destructors failures due to incomplete Chunk type.

This revision was automatically updated to reflect the committed changes.