This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang-tools-extra/clangd/
-
clangd/
-
CMakeLists.txt
-
index/dex/
-
dex/
-
Dex.cpp
11/13
PostingList.h
31/33
PostingList.cpp
-
fuzzer/
-
CMakeLists.txt
2
VByteFuzzer.cpp

Differential D52300

[clangd] Implement VByte PostingList compression
ClosedPublic

Authored by kbobyrev on Sep 20 2018, 6:15 AM.

Download Raw Diff

Details

Reviewers

ioeric
sammccall
ilya-biryukov

Commits

rG6c2f5bd0f1b1: [clangd] Implement VByte PostingList compression
rCTE342965: [clangd] Implement VByte PostingList compression
rL342965: [clangd] Implement VByte PostingList compression

Summary

This patch implements Variable-length Byte compression of PostingLists to sacrifice some performance for lower memory consumption.

PostingList compression and decompression was extensively tested using fuzzer for multiple hours and runnning significant number of realistic FuzzyFindRequests. AddressSanitizer and UndefinedBehaviorSanitizer were used to ensure the correct behaviour.

Performance evaluation was conducted with recent LLVM symbol index (292k symbols) and the collection of user-recorded queries (7751 FuzzyFindRequest JSON dumps):

Metrics	Before	After	Change (%)
Memory consumption (posting lists only), MB	54.4	23.5	-60%
Time to process queries, sec	7.70	9.4	+25%

Diff Detail

Event Timeline

kbobyrev created this revision.Sep 20 2018, 6:15 AM

Herald added subscribers: kadircet, arphaman, jkorous and 2 others. · View Herald TranscriptSep 20 2018, 6:15 AM

kbobyrev edited the summary of this revision. (Show Details)Sep 20 2018, 6:15 AM

Update unit tests with iterator tree string representation to comply with the new format
Don't mark constructor explicit (previously it only had one parameter)
Fix Limits explanation comment (ID > Limits[I] -> ID >= Limits[I])

Very nice!

I think the data structures can be slightly tighter, and some of the implementation could be easier to follow. But this seems like a nice win.

Right-sizing the vectors seems like an important optimization.

clang-tools-extra/clangd/index/dex/PostingList.cpp
29	nit: we generally use members (DecompressedChunk.begin()) unless actually dealing with arrays or templates, since lookup rules are simpler
39–46	nit: I think this might be clearer with the special/unlikely cases (hit end) inside the if: if (++InnerIndex == DecompressedChunks.begin()) { // end of chunk if (++ChunkIndex == Chunks.end()) // end of iterator return; DecompressedChunk = ChunkIndex->decompress(); InnerIndex = DecompressedChunk.begin(); } also I think the indirection via `reachedEnd()` mostly serves to confuse here, as the other lines deal with the data structures directly. It's not clear (without reading the implementation) what the behavior is when class invariants are violated.
58	this again puts the "normal case" (need to choose a chunk) inside the if(), instead of the exceptional case. In order to write this more naturally, I think pulling out a private helper `advanceToChunk(DocID)` might be best here, you can early return from there.
61	ChunkIndex + 1? You've already eliminated the current chunk.
62	This seems unneccesarily two-step (found the chunk... or it could be the first element of the next). Understandably, because std::_bound has such a silly API. You want to find the last* chunk such that Head <= ID. So find the first one with Head > ID, and subtract one. std::lower_bound returns the first element for which its predicate is false. Therefore: ChunkIndex = std::lower_bound(ChunkIndex, Chunks.end(), ID, [](const Chunk &C, const DocID D) { return C.Head <= ID; }) - 1;
63	(again I'd avoid reachedEnd() here as you haven't reestablished invariants, so it's easier to just deal with the data structures)
76	(this can become an assert)
115	nit: please don't call these indexes if they're actually iterators: CurrentChunk seems fine
116	(again, SmallVector)
121	move to the function where they're used
127	What's the purpose of this? Why can't the caller just construct the Chunk themselves - what does the std::queue buy us?
140	I don't understand this comment. Aren't these bit offsets of payload bytes within a DocID?
162	This appears to be more complicated than necessary. I'd suggest pulling out the following function, and seeing where it takes you: // Write a variable length into the buffer, and updates the buffer size. // If it doesn't fit, returns false and doesn't write to the buffer. bool encodeVByte(uint32 V, MutableArrayRef<uint8_t>& Buf); Personally I find the no-loop implementation much easier to read, just: if (V < (1<<7)) { if (Buf.size() < 1) return false; Buf[0] = V; Buf = Buf.drop_front(1); return true; } // and 4 more cases but up to you. Please do try to find a way to reduce the number of constants (masks, limits, offsets, *BytesMask...) if you keep the loop.
248	the logical structure seems like a nested loop, I think this would be easier to follow: for (Current = Head; have more bytes and not enough numbers; Current += delta) { delta = 0; continuation = true; while (continuation) { ... } Result.push_back(Current + delta;) }
262	here I think you're missing a memory optimization probably equal in size to the whole gains achieved by compression :-) libstdc++ uses a 2x growth factor for std::vector, so we're probably wasting an extra 30% or so of ram (depending on size distribution, I forget the theory here). We should shrink to fit. If we were truly desperate we'd iterate over all the numbers and presize the array, but we're probably not. I think `return std::vector<DocID>(Result); // no move, shrink-to-fit` will shrink it as you want (note that `shrink_to_fit()` is usually a no-op :-\)
clang-tools-extra/clangd/index/dex/PostingList.h
41	With the current implementation, this doesn't need to be in the header. (the layout of vector<chunk> doesn't depend on chunk, you should just need to out-line the default destructor) (using SmallVector<Chunk, 1> or maybe 2 might be a win. I'd expect not though. I'd either stick with std::vector, or measure)
42	make this a static_assert below the class?
45	return SmallVector<PayloadSize+1> to avoid allocations?
52	This seems like a waste of a byte - ensure padding bytes are zeros, then you're done decoding once you hit a zero byte or the end of the chunk. (Note that a zero byte encodes the integer zero, which is not a legal posting list delta)
59	this isn't a good justification - the performance of MemIndex isn't really relevant. "Compression saves memory at a small cost in access time, which is still fast enough in practice."
70	this should be Chunks.capacity() (see comment in other file)
74	this may seem picky, but this seems like a waste of 8 bytes (particularly for small posting lists). I'd suggest just defining a constant (in Chunk) for the estimated entries per chunk (maybe 15 or so?) and just using `Chunks.size() * Chunks::ApproxEntriesPerChunk` as a "good enough" estimate.
clang-tools-extra/clangd/index/dex/fuzzer/VByteFuzzer.cpp
2	For better or worse, adding a fuzzer in the open-source project is pretty high ceremony (CMake stuff, subdirectory, oss-fuzz configuration, following up on bugs). I'm not sure the maintenance cost is justified here. Can we just run the fuzzer but not check it in?

@sammccall thank you for the comments, I'll improve it. @ilya-biryukov also provided valuable feedback and suggested that the code looks complicated now. We also discussed different compression approach which would require storing Heads and Payloads separately so that binary search over Heads could have better cache locality. That can dramatically improve performance.

For now, I think it's important to simplify the code and I'll start with that. Also, your suggestions will help to save even more memory! Zeroed bytes are an example of that: I started the patch with the general encoding API (so that zeros could be encoded in the list), but there's no good reason to keep that assumption now.

Also, I think I got my measurements wrong yesterday. I measured posting lists only: without compression size is 55 MB and with compression (current version, not optimized yet) it's 22 MB. This seems like a huge win. I try to keep myself from being overenthusiastic and double-check the numbers, but it looks more like something you estimated when we used posting list size distribution.

clang-tools-extra/clangd/index/dex/PostingList.cpp
262	Great catch! I have to be careful with `std::vector`s which are not allocated with their final size in advance.
clang-tools-extra/clangd/index/dex/fuzzer/VByteFuzzer.cpp
2	OK, I'll leave this here until the patch is accepted for continuous testing, but I won't push it in the final version.

I addressed the easiest issues. I'll try to implement separate storage structure for Heads and Payloads which would potentially make the implementation cleaner and easier to understand (and also more maintainable since that would be way easier to go for SIMD instructions speedups and other encoding schemes if we do that).

Also, I'll refine D52047 a little bit and I believe that is should be way easier to understand performance + memory consumption once we have these benchmarks in. Both @ioeric and @ilya-biryukov expressed their concern with regard to the memory consumption "benchmark" and suggested a separate binary. While this seems fine to me, I think it's important to keep performance + memory tracking infrastructure easy to use (in this sense scattering different metrics across multiple binaries makes it less accessible and probably introduce some code duplication) and therefore using this "trick" is OK to me, but I don't have a strong opinion about this. What do you think, @sammccall?

In D52300#1241754, @kbobyrev wrote:

Also, I'll refine D52047 a little bit and I believe that is should be way easier to understand performance + memory consumption once we have these benchmarks in. Both @ioeric and @ilya-biryukov expressed their concern with regard to the memory consumption "benchmark" and suggested a separate binary. While this seems fine to me, I think it's important to keep performance + memory tracking infrastructure easy to use (in this sense scattering different metrics across multiple binaries makes it less accessible and probably introduce some code duplication) and therefore using this "trick" is OK to me, but I don't have a strong opinion about this. What do you think, @sammccall?

FWIW, I think the "trick" for memory benchmark is fine. I just think we should add proper output to make the trick clear to users, as suggested in the patch comment.

In D52300#1241776, @ioeric wrote:

In D52300#1241754, @kbobyrev wrote:

Also, I'll refine D52047 a little bit and I believe that is should be way easier to understand performance + memory consumption once we have these benchmarks in. Both @ioeric and @ilya-biryukov expressed their concern with regard to the memory consumption "benchmark" and suggested a separate binary. While this seems fine to me, I think it's important to keep performance + memory tracking infrastructure easy to use (in this sense scattering different metrics across multiple binaries makes it less accessible and probably introduce some code duplication) and therefore using this "trick" is OK to me, but I don't have a strong opinion about this. What do you think, @sammccall?

FWIW, I think the "trick" for memory benchmark is fine. I just think we should add proper output to make the trick clear to users, as suggested in the patch comment.

It seems to be omitted in README.md, but you are probably after [[ https://github.com/google/benchmark/blob/1b44120cd16712f3b5decd95dc8ff2813574b273/include/benchmark/benchmark.h#L596-L612 | benchmark::State::SetLabel() ]]

kbobyrev added inline comments.Sep 21 2018, 7:39 AM

clang-tools-extra/clangd/index/dex/PostingList.cpp
29	I thought using `std::begin(Container)`, `std::end(Container)` is way more robust because the API is essentially the same if the code changes, so I used it everywhere in Dex. Do you think I should change this patch or keep it to keep the codebase more consistent?
121	But they're used both in `encodeStream()` and `decompress()`. I tried to move as much static constants to functions where they're used, but these masks are useful for both encoding and decoding. Is there something I should do instead (e.g. make them members of `PostingList`)?

In D52300#1241754, @kbobyrev wrote:

I addressed the easiest issues. I'll try to implement separate storage structure for Heads and Payloads which would potentially make the implementation cleaner and easier to understand (and also more maintainable since that would be way easier to go for SIMD instructions speedups and other encoding schemes if we do that).

That doesn't sound more maintainable, that sounds like a performance hack that will hurt the layering.
Which is ok :-) but please don't do that until you measure a nontrivial performance improvement from it.

Also, I'll refine D52047 a little bit and I believe that is should be way easier to understand performance + memory consumption once we have these benchmarks in. Both @ioeric and @ilya-biryukov expressed their concern with regard to the memory consumption "benchmark" and suggested a separate binary. While this seems fine to me, I think it's important to keep performance + memory tracking infrastructure easy to use (in this sense scattering different metrics across multiple binaries makes it less accessible and probably introduce some code duplication) and therefore using this "trick" is OK to me, but I don't have a strong opinion about this. What do you think, @sammccall?

I may be missing some context, but you're talking about index idle memory usage right?
Can't this just be a dexp command?
No objection to annotating benchmark runs with memory usage too, but I wouldn't jump through hoops unless there's a strong reason.

clang-tools-extra/clangd/index/dex/PostingList.cpp
121	You're right. My real objection here is that these decls are hard to understand here, and the code that uses them is also hard to understand. I think this is because they aren't very powerful abstractions (the detail abstracted is limited, and it's not strongly abstracted) and the names aren't sufficiently good. (I don't have great ones, though not confusing bits and bytes would be a start :-) My top suggestion is to inline these everywhere they're used. This is bit twiddling code, not knowing what the bits are can obscure understanding hard, and encourage you to write the code in a falsely general way. Failing that, SevenBytesMask -> LowBits and ContinuationBit -> More or HighBit? FourBytesMask isn't needed, you won't be decoding any invalid data.

Simplify code
Disallow empty PostingLists and update tests

kbobyrev edited the summary of this revision. (Show Details)Sep 25 2018, 2:39 AM

Mostly looks good, a few further simplifications...

clang-tools-extra/clangd/index/dex/PostingList.cpp
59	I find "if the position was found" somewhat misleading - even if the ID is not in the chunk we can his this case. Maybe extract `normalizeCursor()` here, with // If the cursor is at the end of a chunk, place it at the start of the next chunk. void normalizeCursor() { ... } This can be shared with advance().
116	Comment the invariants here, e.g. // If CurrentChunk is valid, then DecompressedChunk is CurrentChunk->decompress() // and CurrentID is a valid (non-end) iterator into it.
128	This gives the wrong answer for zero. So assert Delta != 0?
128	nit: `int` or `unsigned`, not `size_t`
128	`Width = 1 + findLastSet(Delta) / 7`
138	The mask doesn't need to vary, just apply it after shifting. But really this loop is much clearer if you modify delta in place. do { Encoding = Delta & 0x7f; Delta >>= 7; Payload.front() = Delta ? Encoding : 0x80 \| Encoding; Payload = Payload.drop_front(); } while (Delta != 0);
171	Why not just declare a chunk here (or use Result.back() and work in place)? Result.emplace_back(); DocID Last = Result.back().Head = Documents.front(); MutableArrayRef<uint8_t> RemainingPayload = Result.back().Payload; for (DocID Doc : Documents.drop_front()) { // no need to handle I == 0 special case. if (!encodeVByte(Doc - Last, RemainingPayload)) { // didn't fit, flush chunk Result.emplace_back(); Result.back().Head = Doc; RemainingPayload = Result.back().Payload; } Last = Doc; } more values, fewer indices
173	`PayloadRef` doesn't really describe the function of this variable. Suggest `RemainingPayload` or `EmptyPayload` or so
192	nit: if the stream is terminated, consumes all bytes and returns None.
clang-tools-extra/clangd/index/dex/PostingList.h
39	mark as an implementation detail so readers aren't confused.
41	(meta-nit: please don't mark comments as done if they're not done - rather explain why you didn't do them!)
41	(using SmallVector<Chunk, 1> or maybe 2 might be a win. I'd expect not though. I'd either stick with std::vector, or measure) You changed this to SmallVector - what were the measurements? (SmallVector is going to be bigger than Vector whenever you overrun, so it's worth checking)
46	this is an implementation detail, move to cpp file (or just inline)
50	??

Address a round of comments, fallback to std::vector.

LG, don't forget about the fuzzer!

clang-tools-extra/clangd/index/dex/PostingList.cpp
39–46	just `++CurrentID; normalizeCursor();`
121	this is used only in one place now, inline or use elsewhere
128	meaningful bits no need to say "dividing..." as it just echoes the code. "examining the meaningful bits"?
164	unused

This revision is now accepted and ready to land.Sep 25 2018, 4:35 AM

Address post-LG comments, remove fuzzer.

clang-tools-extra/clangd/index/dex/PostingList.cpp

192

As discussed offline, when the stream is terminated (i.e. 0 byte indicates the end of the stream) it just returns llvm::None.

clang-tools-extra/clangd/index/dex/PostingList.h

Storage type	Memory consumption, MB
`llvm::SmallVector<Chunk, 1>`	23.7
`llvm::SmallVector<Chunk, 2>`	25.2
`std::vector<Chunk>`	23.5

It seems like std::vector<Chunk> would be the best option here, falling back to that.

As discussed offline, moving Chunk to header seems tedious because of the default constructor/destructors failures due to incomplete Chunk type.

Closed by commit rL342965: [clangd] Implement VByte PostingList compression (authored by omtcyfz). · Explain WhySep 25 2018, 4:56 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: llvm-commits. · View Herald TranscriptSep 25 2018, 4:56 AM

kbobyrev mentioned this in D51689: [clangd] Dense posting lists proof-of-concept.Sep 26 2018, 2:43 AM

Revision Contents

Path

Size

clang-tools-extra/

clangd/

CMakeLists.txt

1 line

index/

dex/

Dex.cpp

4 lines

PostingList.h

55 lines

PostingList.cpp

236 lines

fuzzer/

CMakeLists.txt

19 lines

VByteFuzzer.cpp

64 lines

Diff 166275

clang-tools-extra/clangd/CMakeLists.txt

Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	add_clang_library(clangDaemon
)		)

if( LLVM_LIB_FUZZING_ENGINE OR LLVM_USE_SANITIZE_COVERAGE )		if( LLVM_LIB_FUZZING_ENGINE OR LLVM_USE_SANITIZE_COVERAGE )
add_subdirectory(fuzzer)		add_subdirectory(fuzzer)
endif()		endif()
add_subdirectory(tool)		add_subdirectory(tool)
add_subdirectory(indexer)		add_subdirectory(indexer)
add_subdirectory(index/dex/dexp)		add_subdirectory(index/dex/dexp)
		add_subdirectory(index/dex/fuzzer)

if (LLVM_INCLUDE_BENCHMARKS)		if (LLVM_INCLUDE_BENCHMARKS)
add_subdirectory(benchmarks)		add_subdirectory(benchmarks)
endif()		endif()

clang-tools-extra/clangd/index/dex/Dex.cpp

Show First 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	void Dex::buildIndex() {
for (DocID SymbolRank = 0; SymbolRank < Symbols.size(); ++SymbolRank) {		for (DocID SymbolRank = 0; SymbolRank < Symbols.size(); ++SymbolRank) {
const auto *Sym = Symbols[SymbolRank];		const auto *Sym = Symbols[SymbolRank];
for (const auto &Token : generateSearchTokens(*Sym))		for (const auto &Token : generateSearchTokens(*Sym))
TempInvertedIndex[Token].push_back(SymbolRank);		TempInvertedIndex[Token].push_back(SymbolRank);
}		}

// Convert lists of items to posting lists.		// Convert lists of items to posting lists.
for (const auto &TokenToPostingList : TempInvertedIndex)		for (const auto &TokenToPostingList : TempInvertedIndex)
InvertedIndex.insert({TokenToPostingList.first,		InvertedIndex.insert(
PostingList(move(TokenToPostingList.second))});		{TokenToPostingList.first, PostingList(TokenToPostingList.second)});

vlog("Built Dex with estimated memory usage {0} bytes.",		vlog("Built Dex with estimated memory usage {0} bytes.",
estimateMemoryUsage());		estimateMemoryUsage());
}		}

/// Constructs iterators over tokens extracted from the query and exhausts it		/// Constructs iterators over tokens extracted from the query and exhausts it
/// while applying Callback to each symbol in the order of decreasing quality		/// while applying Callback to each symbol in the order of decreasing quality
/// of the matched symbols.		/// of the matched symbols.
▲ Show 20 Lines • Show All 141 Lines • Show Last 20 Lines

clang-tools-extra/clangd/index/dex/PostingList.h

//===--- PostingList.h - Symbol identifiers storage interface --*- C++ -*-===// //===--- PostingList.h - Symbol identifiers storage interface --*- C++ -*-===//

// //

// The LLVM Compiler Infrastructure // The LLVM Compiler Infrastructure

// //

// This file is distributed under the University of Illinois Open Source // This file is distributed under the University of Illinois Open Source

// License. See LICENSE.TXT for details. // License. See LICENSE.TXT for details.

// //

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

// ///

// This defines posting list interface: a storage for identifiers of symbols /// \file

// which can be characterized by a specific feature (such as fuzzy-find trigram, /// This defines posting list interface: a storage for identifiers of symbols

// scope, type or any other Search Token). Posting lists can be traversed in /// which can be characterized by a specific feature (such as fuzzy-find

// order using an iterator and are values for inverted index, which maps search /// trigram, scope, type or any other Search Token). Posting lists can be

// tokens to corresponding posting lists. /// traversed in order using an iterator and are values for inverted index,

// /// which maps search tokens to corresponding posting lists.

///

/// In order to decrease size of Index in-memory representation, Variable Byte

/// Encoding (VByte) is used for PostingLists compression. An overview of VByte

/// algorithm can be found in "Introduction to Information Retrieval" book:

/// https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html

///

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

#ifndef LLVM_CLANG_TOOLS_EXTRA_CLANGD_INDEX_DEX_POSTINGLIST_H #ifndef LLVM_CLANG_TOOLS_EXTRA_CLANGD_INDEX_DEX_POSTINGLIST_H

#define LLVM_CLANG_TOOLS_EXTRA_CLANGD_INDEX_DEX_POSTINGLIST_H #define LLVM_CLANG_TOOLS_EXTRA_CLANGD_INDEX_DEX_POSTINGLIST_H

#include "Iterator.h" #include "Iterator.h"

#include "llvm/ADT/ArrayRef.h" #include "llvm/ADT/ArrayRef.h"

#include <cstdint> #include <cstdint>

#include <vector> #include <vector>

namespace clang { namespace clang {

namespace clangd { namespace clangd {

namespace dex { namespace dex {

class Iterator; class Iterator;

/// Chunk is a fixed-width piece of PostingList which contains the first DocID

/// in uncompressed format (Head) and delta-encoded Payload. It can be

sammccallUnsubmitted

Done

mark as an implementation detail so readers aren't confused.

sammccall: mark as an implementation detail so readers aren't confused.

/// decompressed upon request.

struct Chunk {

sammccallUnsubmitted

Not Done

With the current implementation, this doesn't need to be in the header.
(the layout of vector<chunk> doesn't depend on chunk, you should just need to out-line the default destructor)

(using SmallVector<Chunk, 1> or maybe 2 *might* be a win. I'd expect not though. I'd either stick with std::vector, or measure)

sammccall: With the current implementation, this doesn't need to be in the header. (the layout of…

sammccallUnsubmitted

Done

(meta-nit: please don't mark comments as done if they're not done - rather explain why you didn't do them!)

sammccall: (meta-nit: please don't mark comments as done if they're not done - rather explain why you…

sammccallUnsubmitted

Done

(using SmallVector<Chunk, 1> or maybe 2 *might* be a win. I'd expect not though. I'd either stick with std::vector, or measure)

You changed this to SmallVector - what were the measurements?
(SmallVector is going to be bigger than Vector whenever you overrun, so it's worth checking)

sammccall: > (using SmallVector<Chunk, 1> or maybe 2 *might* be a win. I'd expect not though. I'd either…

kbobyrevAuthorUnsubmitted

Not Done

Storage type	Memory consumption, MB
`llvm::SmallVector<Chunk, 1>`	23.7
`llvm::SmallVector<Chunk, 2>`	25.2
`std::vector<Chunk>`	23.5

It seems like std::vector<Chunk> would be the best option here, falling back to that.

As discussed offline, moving Chunk to header seems tedious because of the default constructor/destructors failures due to incomplete Chunk type.

kbobyrev: | Storage type | Memory consumption, MB | | ----- | ----- | | `llvm::SmallVector<Chunk, 1>` |…

// Keep sizeof(Chunk) == 32.

sammccallUnsubmitted

Done

make this a static_assert below the class?

sammccall: make this a static_assert below the class?

static constexpr size_t PayloadSize = 32 - sizeof(DocID) - sizeof(uint8_t);

std::vector<DocID> decompress() const;

sammccallUnsubmitted

Done

return SmallVector<PayloadSize+1> to avoid allocations?

sammccall: return SmallVector<PayloadSize+1> to avoid allocations?

sammccallUnsubmitted

Done

this is an implementation detail, move to cpp file (or just inline)

sammccall: this is an implementation detail, move to cpp file (or just inline)

/// The first element of

DocID Head;

/// VByte-encoded deltas.

std::array<uint8_t, PayloadSize> Payload = std::array<uint8_t, PayloadSize>();

sammccallUnsubmitted

Done

sammccall: ??

/// Number of elements encoded into Payload + 1.

uint8_t Size;

sammccallUnsubmitted

Done

This seems like a waste of a byte - ensure padding bytes are zeros, then you're done decoding once you hit a zero byte or the end of the chunk.
(Note that a zero byte encodes the integer zero, which is not a legal posting list delta)

sammccall: This seems like a waste of a byte - ensure padding bytes are zeros, then you're done decoding…

};

/// PostingList is the storage of DocIDs which can be inserted to the Query /// PostingList is the storage of DocIDs which can be inserted to the Query

/// Tree as a leaf by constructing Iterator over the PostingList object. /// Tree as a leaf by constructing Iterator over the PostingList object. DocIDs

// FIXME(kbobyrev): Use VByte algorithm to compress underlying data. /// are stored in underlying chunks. While avoiding compression would reflect

/// positively on the Index performance, current Dex implementation has a large

/// performance gap compared to MemIndex which allows memory consumption

sammccallUnsubmitted

Done

this isn't a good justification - the performance of MemIndex isn't really relevant.
"Compression saves memory at a small cost in access time, which is still fast enough in practice."

sammccall: this isn't a good justification - the performance of MemIndex isn't really relevant.

/// reduction at the cost of some performance.

class PostingList { class PostingList {

public: public:

explicit PostingList(const std::vector<DocID> &&Documents) explicit PostingList(const std::vector<DocID> &Documents);

: Documents(std::move(Documents)) {}

/// Constructs DocumentIterator over given posting list. DocumentIterator will

/// go through the chunks and decompress them on-the-fly when necessary.

std::unique_ptr<Iterator> iterator() const; std::unique_ptr<Iterator> iterator() const;

size_t bytes() const { return Documents.size() * sizeof(DocID); } /// Returns in-memory size.

size_t bytes() const { return Chunks.size() * sizeof(Chunk); }

sammccallUnsubmitted

Done

this should be Chunks.capacity() (see comment in other file)

sammccall: this should be Chunks.capacity() (see comment in other file)

private: private:

const std::vector<DocID> Documents; const std::vector<Chunk> Chunks;

size_t Size;

sammccallUnsubmitted

Done

this may seem picky, but this seems like a waste of 8 bytes (particularly for small posting lists).
I'd suggest just defining a constant (in Chunk) for the estimated entries per chunk (maybe 15 or so?) and just using Chunks.size() * Chunks::ApproxEntriesPerChunk as a "good enough" estimate.

sammccall: this may seem picky, but this seems like a waste of 8 bytes (particularly for small posting…

}; };

} // namespace dex } // namespace dex

} // namespace clangd } // namespace clangd

} // namespace clang } // namespace clang

#endif // LLVM_CLANG_TOOLS_EXTRA_CLANGD_INDEX_DEX_POSTINGLIST_H #endif // LLVM_CLANG_TOOLS_EXTRA_CLANGD_INDEX_DEX_POSTINGLIST_H

clang-tools-extra/clangd/index/dex/PostingList.cpp

	//===--- PostingList.cpp - Symbol identifiers storage interface -----------===//			//===--- PostingList.cpp - Symbol identifiers storage interface -----------===//
	//			//
	// The LLVM Compiler Infrastructure			// The LLVM Compiler Infrastructure
	//			//
	// This file is distributed under the University of Illinois Open Source			// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.			// License. See LICENSE.TXT for details.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "PostingList.h"			#include "PostingList.h"
	#include "Iterator.h"			#include "Iterator.h"
				#include <queue>

	namespace clang {			namespace clang {
	namespace clangd {			namespace clangd {
	namespace dex {			namespace dex {

	namespace {			namespace {

	/// Implements Iterator over std::vector<DocID>. This is the most basic			/// Implements iterator of PostingList chunks. This requires iterating over two
	/// iterator and is simply a wrapper around			/// levels: the first level iterator iterates over the chunks and decompresses
	/// std::vector<DocID>::const_iterator.			/// them on-the-fly when the contents of chunk are to be seen.
	class PlainIterator : public Iterator {			class ChunkIterator : public Iterator {
	public:			public:
	explicit PlainIterator(llvm::ArrayRef<DocID> Documents)			explicit ChunkIterator(const std::vector<Chunk> &Chunks, size_t Size)
	: Documents(Documents), Index(std::begin(Documents)) {}			: Chunks(Chunks), Size(Size), ChunkIndex(begin(Chunks)) {
				if (!Chunks.empty()) {
				DecompressedChunk = ChunkIndex->decompress();
				InnerIndex = begin(DecompressedChunk);
				sammccallUnsubmitted Done Reply Inline Actions nit: we generally use members (DecompressedChunk.begin()) unless actually dealing with arrays or templates, since lookup rules are simpler sammccall: nit: we generally use members (DecompressedChunk.begin()) unless actually dealing with arrays…
				kbobyrevAuthorUnsubmitted Done Reply Inline Actions I thought using `std::begin(Container)`, `std::end(Container)` is way more robust because the API is essentially the same if the code changes, so I used it everywhere in Dex. Do you think I should change this patch or keep it to keep the codebase more consistent? kbobyrev: I thought using `std::begin(Container)`, `std::end(Container)` is way more robust because the…
				}
				}

	bool reachedEnd() const override { return Index == std::end(Documents); }			bool reachedEnd() const override { return ChunkIndex == end(Chunks); }

	/// Advances cursor to the next item.			/// Advances cursor to the next item.
	void advance() override {			void advance() override {
	assert(!reachedEnd() &&			assert(!reachedEnd() &&
	"Posting List iterator can't advance() at the end.");			"Posting List iterator can't advance() at the end.");
	++Index;			if (++InnerIndex != end(DecompressedChunk))
				return; // Return if current chunk is not exhausted.
				++ChunkIndex;
				if (reachedEnd())
				return; // Can't advance to the next chunk at the end.
				// Decompress next chunk and reset inner iterator.
				DecompressedChunk = ChunkIndex->decompress();
				InnerIndex = begin(DecompressedChunk);
				sammccallUnsubmitted Done Reply Inline Actions nit: I think this might be clearer with the special/unlikely cases (hit end) inside the if: if (++InnerIndex == DecompressedChunks.begin()) { // end of chunk if (++ChunkIndex == Chunks.end()) // end of iterator return; DecompressedChunk = ChunkIndex->decompress(); InnerIndex = DecompressedChunk.begin(); } also I think the indirection via `reachedEnd()` mostly serves to confuse here, as the other lines deal with the data structures directly. It's not clear (without reading the implementation) what the behavior is when class invariants are violated. sammccall: nit: I think this might be clearer with the special/unlikely cases (hit end) inside the if: ```…
				sammccallUnsubmitted Done Reply Inline Actions just `++CurrentID; normalizeCursor();` sammccall: just `++CurrentID; normalizeCursor();`
	}			}

	/// Applies binary search to advance cursor to the next item with DocID			/// Applies binary search to advance cursor to the next item with DocID
	/// equal or higher than the given one.			/// equal or higher than the given one.
	void advanceTo(DocID ID) override {			void advanceTo(DocID ID) override {
	assert(!reachedEnd() &&			assert(!reachedEnd() &&
	"Posting List iterator can't advance() at the end.");			"Posting List iterator can't advance() at the end.");
	// If current ID is beyond requested one, iterator is already in the right			if (ID <= peek())
	// state.			return;
	if (peek() < ID)			// If current chunk doesn't contain needed element, find the chunk which
	Index = std::lower_bound(Index, std::end(Documents), ID);			// does.
				if ((ChunkIndex != end(Chunks) - 1) && ((ChunkIndex + 1)->Head <= ID)) {
				sammccallUnsubmitted Done Reply Inline Actions this again puts the "normal case" (need to choose a chunk) inside the if(), instead of the exceptional case. In order to write this more naturally, I think pulling out a private helper `advanceToChunk(DocID)` might be best here, you can early return from there. sammccall: this again puts the "normal case" (need to choose a chunk) inside the if(), instead of the…
				// Find the next chunk with Head >= ID.
				sammccallUnsubmitted Done Reply Inline Actions I find "if the position was found" somewhat misleading - even if the ID is not in the chunk we can his this case. Maybe extract `normalizeCursor()` here, with // If the cursor is at the end of a chunk, place it at the start of the next chunk. void normalizeCursor() { ... } This can be shared with advance(). sammccall: I find "if the position was found" somewhat misleading - even if the ID is not in the chunk we…
				ChunkIndex = std::lower_bound(
				ChunkIndex, end(Chunks), ID,
				sammccallUnsubmitted Done Reply Inline Actions ChunkIndex + 1? You've already eliminated the current chunk. sammccall: ChunkIndex + 1? You've already eliminated the current chunk.
				[](const Chunk &C, const DocID ID) { return C.Head < ID; });
				sammccallUnsubmitted Done Reply Inline Actions This seems unneccesarily two-step (found the chunk... or it could be the first element of the next). Understandably, because std::_bound has such a silly API. You want to find the last* chunk such that Head <= ID. So find the first one with Head > ID, and subtract one. std::lower_bound returns the first element for which its predicate is false. Therefore: ChunkIndex = std::lower_bound(ChunkIndex, Chunks.end(), ID, [](const Chunk &C, const DocID D) { return C.Head <= ID; }) - 1; sammccall: This seems unneccesarily two-step (found the chunk... or it could be the first element of the…
				if (reachedEnd())
				sammccallUnsubmitted Done Reply Inline Actions (again I'd avoid reachedEnd() here as you haven't reestablished invariants, so it's easier to just deal with the data structures) sammccall: (again I'd avoid reachedEnd() here as you haven't reestablished invariants, so it's easier to…
				return;
				// Look for ID in the previous chunk if the current Head > ID and
				// therefore needed position is either in previous Chunk or in the
				// beginning of the current chunk.
				if (ChunkIndex != begin(Chunks) && ID < ChunkIndex->Head)
				--ChunkIndex;
				DecompressedChunk = ChunkIndex->decompress();
				InnerIndex = begin(DecompressedChunk);
				}
				// Try to find ID within current chunk.
				InnerIndex = std::lower_bound(InnerIndex, std::end(DecompressedChunk), ID);
				// Return if the position was found in current chunk.
				if (InnerIndex != std::end(DecompressedChunk))
				sammccallUnsubmitted Done Reply Inline Actions (this can become an assert) sammccall: (this can become an assert)
				return;
				// Otherwise, the iterator should point to the first element of the next
				// chunk (if there is any).
				++ChunkIndex;
				if (!reachedEnd())
				DecompressedChunk = ChunkIndex->decompress();
				InnerIndex = begin(DecompressedChunk);
	}			}

	DocID peek() const override {			DocID peek() const override {
	assert(!reachedEnd() &&			assert(!reachedEnd() && "Posting List iterator can't peek() at the end.");
	"Posting List iterator can't peek() at the end.");			return *InnerIndex;
	return *Index;
	}			}

	float consume() override {			float consume() override {
	assert(!reachedEnd() &&			assert(!reachedEnd() &&
	"Posting List iterator can't consume() at the end.");			"Posting List iterator can't consume() at the end.");
	return DEFAULT_BOOST_SCORE;			return DEFAULT_BOOST_SCORE;
	}			}

	size_t estimateSize() const override { return Documents.size(); }			size_t estimateSize() const override { return Size; }

	private:			private:
	llvm::raw_ostream &dump(llvm::raw_ostream &OS) const override {			llvm::raw_ostream &dump(llvm::raw_ostream &OS) const override {
	OS << '[';			OS << '[';
	if (Index != std::end(Documents))			if (ChunkIndex != begin(Chunks) \|\| InnerIndex != begin(DecompressedChunk))
	OS << *Index;			OS << "... ";
	else			OS << (reachedEnd() ? "END" : std::to_string(*InnerIndex));
	OS << "END";			if (!reachedEnd() && InnerIndex < std::end(DecompressedChunk) - 1)
				OS << " ...";
	OS << ']';			OS << ']';
	return OS;			return OS;
	}			}

	llvm::ArrayRef<DocID> Documents;			const std::vector<Chunk> &Chunks;
	llvm::ArrayRef<DocID>::const_iterator Index;			// Cache information about PostingList size.
				size_t Size;
				// Iterator over chunks.
				std::vector<Chunk>::const_iterator ChunkIndex;
				sammccallUnsubmitted Done Reply Inline Actions nit: please don't call these indexes if they're actually iterators: CurrentChunk seems fine sammccall: nit: please don't call these indexes if they're actually iterators: CurrentChunk seems fine
				std::vector<DocID> DecompressedChunk;
				sammccallUnsubmitted Done Reply Inline Actions (again, SmallVector) sammccall: (again, SmallVector)
				sammccallUnsubmitted Done Reply Inline Actions Comment the invariants here, e.g. // If CurrentChunk is valid, then DecompressedChunk is CurrentChunk->decompress() // and CurrentID is a valid (non-end) iterator into it. sammccall: Comment the invariants here, e.g. ``` // If CurrentChunk is valid, then DecompressedChunk is…
				// Iterator over DecompressedChunk.
				std::vector<DocID>::iterator InnerIndex;
	};			};

				/// Single-byte masks used for VByte compression bit manipulations.
				sammccallUnsubmitted Done Reply Inline Actions move to the function where they're used sammccall: move to the function where they're used
				kbobyrevAuthorUnsubmitted Done Reply Inline Actions But they're used both in `encodeStream()` and `decompress()`. I tried to move as much static constants to functions where they're used, but these masks are useful for both encoding and decoding. Is there something I should do instead (e.g. make them members of `PostingList`)? kbobyrev: But they're used both in `encodeStream()` and `decompress()`. I tried to move as much static…
				sammccallUnsubmitted Done Reply Inline Actions You're right. My real objection here is that these decls are hard to understand here, and the code that uses them is also hard to understand. I think this is because they aren't very powerful abstractions (the detail abstracted is limited, and it's not strongly abstracted) and the names aren't sufficiently good. (I don't have great ones, though not confusing bits and bytes would be a start :-) My top suggestion is to inline these everywhere they're used. This is bit twiddling code, not knowing what the bits are can obscure understanding hard, and encourage you to write the code in a falsely general way. Failing that, SevenBytesMask -> LowBits and ContinuationBit -> More or HighBit? FourBytesMask isn't needed, you won't be decoding any invalid data. sammccall: You're right. My real objection here is that these decls are hard to understand here, and the…
				sammccallUnsubmitted Done Reply Inline Actions this is used only in one place now, inline or use elsewhere sammccall: this is used only in one place now, inline or use elsewhere
				constexpr uint8_t SevenBytesMask = 0x7f; // 0b01111111
				constexpr uint8_t FourBytesMask = 0xf; // 0b00001111
				constexpr uint8_t ContinuationBit = 0x80; // 0b10000000

				/// Fills chunk with the maximum number of bits available.
				Chunk createChunk(DocID Head, std::queue<uint8_t> &Payload,
				sammccallUnsubmitted Done Reply Inline Actions What's the purpose of this? Why can't the caller just construct the Chunk themselves - what does the std::queue buy us? sammccall: What's the purpose of this? Why can't the caller just construct the Chunk themselves - what…
				size_t DocumentsCount, size_t MeaningfulBytes) {
				sammccallUnsubmitted Done Reply Inline Actions This gives the wrong answer for zero. So assert Delta != 0? sammccall: This gives the wrong answer for zero. So assert Delta != 0?
				sammccallUnsubmitted Done Reply Inline Actions nit: `int` or `unsigned`, not `size_t` sammccall: nit: `int` or `unsigned`, not `size_t`
				sammccallUnsubmitted Done Reply Inline Actions `Width = 1 + findLastSet(Delta) / 7` sammccall: `Width = 1 + findLastSet(Delta) / 7`
				sammccallUnsubmitted Done Reply Inline Actions meaningful bits no need to say "dividing..." as it just echoes the code. "examining the meaningful bits"? sammccall: meaningful bits no need to say "dividing..." as it just echoes the code. "examining the…
				assert(DocumentsCount > 0 && "Can't create chunk without Head.");
				Chunk Result;
				Result.Head = Head;
				Result.Size = DocumentsCount;
				for (size_t I = 0; I < MeaningfulBytes; ++I) {
				Result.Payload[I] = Payload.front();
				Payload.pop();
				}
				return Result;
				}
				sammccallUnsubmitted Done Reply Inline Actions The mask doesn't need to vary, just apply it after shifting. But really this loop is much clearer if you modify delta in place. do { Encoding = Delta & 0x7f; Delta >>= 7; Payload.front() = Delta ? Encoding : 0x80 \| Encoding; Payload = Payload.drop_front(); } while (Delta != 0); sammccall: The mask doesn't need to vary, just apply it after shifting. But really this loop is much…

				/// Byte offsets of Payload contents within DocID.
				sammccallUnsubmitted Done Reply Inline Actions I don't understand this comment. Aren't these bit offsets of payload bytes within a DocID? sammccall: I don't understand this comment. Aren't these bit offsets of payload bytes within a DocID?
				const size_t Offsets[] = {0, 7, 7 * 2, 7 * 3, 7 * 4};

				/// Use Variable-length Byte (VByte) delta encoding to compress sorted list of
				/// DocIDs. The compression stores deltas (differences) between subsequent
				/// DocIDs and encodes these deltas utilizing the least possible number of
				/// bytes.
				///
				/// Each encoding byte consists of two parts: the first bit (continuation bit)
				/// indicates whether this is the last byte of current encoding and seven bytes
				/// a piece of DocID (payload). DocID contains 32 bits and therefore it takes
				/// up to 5 bytes to encode it (4 full 7-bit payloads and one 4-bit payload),
				/// but in practice it is expected that gaps (deltas) between subsequent DocIDs
				/// are not large enough to require 5 bytes. In very dense posting lists (with
				/// average gaps less than 128) this representation would be 4 times more
				/// efficient than raw DocID array.
				///
				/// PostingList encoding example:
				///
				/// DocIDs 42 47 7000
				/// gaps 5 6958
				/// Encoding (raw number) 10000101 10110110 00101110
				std::vector<Chunk> encodeStream(llvm::ArrayRef<DocID> Documents) {
				sammccallUnsubmitted Done Reply Inline Actions This appears to be more complicated than necessary. I'd suggest pulling out the following function, and seeing where it takes you: // Write a variable length into the buffer, and updates the buffer size. // If it doesn't fit, returns false and doesn't write to the buffer. bool encodeVByte(uint32 V, MutableArrayRef<uint8_t>& Buf); Personally I find the no-loop implementation much easier to read, just: if (V < (1<<7)) { if (Buf.size() < 1) return false; Buf[0] = V; Buf = Buf.drop_front(1); return true; } // and 4 more cases but up to you. Please do try to find a way to reduce the number of constants (masks, limits, offsets, BytesMask...) if you keep the loop. sammccall:* This appears to be more complicated than necessary. I'd suggest pulling out the following…
				// Masks are used to perform bit manipulations over DocID. Each mask
				// represents
				sammccallUnsubmitted Done Reply Inline Actions unused sammccall: unused
				static const std::vector<DocID> Masks = {
				SevenBytesMask, // First 7 bytes: 0b1111111
				SevenBytesMask << 7U, // Next 7 bytes
				SevenBytesMask << 7U * 2, // ...
				SevenBytesMask << 7U * 3, // ...
				static_cast<DocID>(FourBytesMask << 7U * 4), // 4 last bytes
				};
				sammccallUnsubmitted Done Reply Inline Actions Why not just declare a chunk here (or use Result.back() and work in place)? Result.emplace_back(); DocID Last = Result.back().Head = Documents.front(); MutableArrayRef<uint8_t> RemainingPayload = Result.back().Payload; for (DocID Doc : Documents.drop_front()) { // no need to handle I == 0 special case. if (!encodeVByte(Doc - Last, RemainingPayload)) { // didn't fit, flush chunk Result.emplace_back(); Result.back().Head = Doc; RemainingPayload = Result.back().Payload; } Last = Doc; } more values, fewer indices sammccall: Why not just declare a chunk here (or use Result.back() and work in place)? ``` Result.

				// These limits are used to calculate the width of DocID encoding: when
				sammccallUnsubmitted Done Reply Inline Actions `PayloadRef` doesn't really describe the function of this variable. Suggest `RemainingPayload` or `EmptyPayload` or so sammccall: `PayloadRef` doesn't really describe the function of this variable. Suggest `RemainingPayload`…
				// ID > Limits[I], it takes at least I + 1 bytes.
				static const DocID Limits[] = {
				1U << 7U,
				1U << 7U * 2,
				1U << 7U * 3,
				1U << 7U * 4,
				};

				std::vector<Chunk> Result;
				std::queue<uint8_t> Payload;
				size_t HeadIndex = 0;
				// Keep track of the last Payload size which doesn't exceed the limit.
				size_t LastEncodingEnd = 0;
				for (size_t I = 0; I < Documents.size(); ++I) {
				// Don't encode Heads.
				if (HeadIndex == I)
				continue;
				const DocID Delta = Documents[I] - Documents[I - 1];
				// Encode Delta.
				sammccallUnsubmitted Not Done Reply Inline Actions nit: if the stream is terminated, consumes all bytes and returns None. sammccall: nit: if the stream is terminated, consumes all bytes and returns None.
				kbobyrevAuthorUnsubmitted Not Done Reply Inline Actions As discussed offline, when the stream is terminated (i.e. `0` byte indicates the end of the stream) it just returns `llvm::None`. kbobyrev: As discussed offline, when the stream is terminated (i.e. `0` byte indicates the end of the…
				for (size_t I = 0; I < Masks.size(); ++I) {
				uint8_t Encoding = (Delta & Masks[I]) >> Offsets[I];
				bool HasNextByte = I < Masks.size() - 1 ? Delta >= Limits[I] : false;
				// If !HasNextByte, mark the end of encoding stream.
				Payload.push(!HasNextByte ? Encoding \| ContinuationBit : Encoding);
				if (!HasNextByte)
				break;
				}
				if (Payload.size() <= Chunk::PayloadSize)
				LastEncodingEnd = Payload.size();
				// Read stream until Payload overflows.
				if (Payload.size() < Chunk::PayloadSize)
				continue;
				// If Payload contains exactly Chunk::PayloadSize bytes, use all of them to
				// fill the next Chunk. Otherwise, use the last valid size.
				Result.push_back(createChunk(Documents[HeadIndex], Payload,
				Payload.size() == Chunk::PayloadSize
				? I - HeadIndex + 1
				: I - HeadIndex,
				LastEncodingEnd));
				// Next head is the next item if Payload contains exactly Chunk::PayloadSize
				// bytes, otherwise it is the current item.
				HeadIndex = Payload.empty() ? I + 1 : I;
				// If overflow happened, Payload contains encoding of the next Head: discard
				// it.
				while (!Payload.empty())
				Payload.pop();
				LastEncodingEnd = 0;
				}
				// Add another chunk if there are some bytes left in Payload or if there's a
				// trailing Head.
				if (!Payload.empty() \|\| HeadIndex + 1 == Documents.size())
				Result.push_back(createChunk(Documents[HeadIndex], Payload,
				Documents.size() - HeadIndex,
				LastEncodingEnd));
				return Result;
				}

	} // namespace			} // namespace

				std::vector<DocID> Chunk::decompress() const {
				std::vector<DocID> Result{Head};
				if (Size == 1)
				return Result;
				// Store sum of Head and all deltas to keep track of the last ID.
				DocID Current = Head;
				DocID Delta = 0;
				uint8_t Length = 0;
				// Decode bytes from Payload into Delta.
				for (const auto &Byte : Payload) {
				assert(Length <= 5 && "Can't decode sequences longer than 5 bytes");
				// Write meaningful bits to the correct place in the document decoding.
				Delta \|= (Byte & (Length < 4 ? SevenBytesMask : FourBytesMask))
				<< Offsets[Length];
				++Length;
				// Add document decoding to the result as soon as END marker is seen.
				sammccallUnsubmitted Done Reply Inline Actions the logical structure seems like a nested loop, I think this would be easier to follow: for (Current = Head; have more bytes and not enough numbers; Current += delta) { delta = 0; continuation = true; while (continuation) { ... } Result.push_back(Current + delta;) } sammccall: the logical structure seems like a nested loop, I think this would be easier to follow: ```…
				if ((Byte & ContinuationBit) != 0) {
				Current += Delta;
				Result.push_back(Current);
				Length = 0;
				Delta = 0;
				}
				// Stop when all meaningful bytes are decoded.
				if (Result.size() == Size)
				break;
				}
				assert(Result.size() == 1 \|\|
				Length == 0 &&
				"Unterminated byte sequence at the end of input stream.");
				return Result;
				sammccallUnsubmitted Done Reply Inline Actions here I think you're missing a memory optimization probably equal in size to the whole gains achieved by compression :-) libstdc++ uses a 2x growth factor for std::vector, so we're probably wasting an extra 30% or so of ram (depending on size distribution, I forget the theory here). We should shrink to fit. If we were truly desperate we'd iterate over all the numbers and presize the array, but we're probably not. I think `return std::vector<DocID>(Result); // no move, shrink-to-fit` will shrink it as you want (note that `shrink_to_fit()` is usually a no-op :-\) sammccall: here I think you're missing a memory optimization probably equal in size to the whole gains…
				kbobyrevAuthorUnsubmitted Done Reply Inline Actions Great catch! I have to be careful with `std::vector`s which are not allocated with their final size in advance. kbobyrev: Great catch! I have to be careful with `std::vector`s which are not allocated with their final…
				}

				PostingList::PostingList(const std::vector<DocID> &Documents)
				: Chunks(encodeStream(Documents)), Size(Documents.size()) {}

	std::unique_ptr<Iterator> PostingList::iterator() const {			std::unique_ptr<Iterator> PostingList::iterator() const {
	return llvm::make_unique<PlainIterator>(Documents);			return llvm::make_unique<ChunkIterator>(Chunks, Size);
	}			}

	} // namespace dex			} // namespace dex
	} // namespace clangd			} // namespace clangd
	} // namespace clang			} // namespace clang

clang-tools-extra/clangd/index/dex/fuzzer/CMakeLists.txt

This file was added.

				include_directories(${CMAKE_CURRENT_SOURCE_DIR}/..)

				set(LLVM_LINK_COMPONENTS Support)

				if(LLVM_USE_SANITIZE_COVERAGE)
				set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fsanitize=fuzzer")
				endif()

				add_clang_executable(clangd-vbyte-fuzzer
				EXCLUDE_FROM_ALL
				VByteFuzzer.cpp
				)

				target_link_libraries(clangd-vbyte-fuzzer
				PRIVATE
				clangBasic
				clangDaemon
				${LLVM_LIB_FUZZING_ENGINE}
				)

clang-tools-extra/clangd/index/dex/fuzzer/VByteFuzzer.cpp

This file was added.

				//===-- VByteFuzzer.cpp - Fuzz VByte Posting List encoding ----------------===//
				//
				sammccallUnsubmitted Not Done Reply Inline Actions For better or worse, adding a fuzzer in the open-source project is pretty high ceremony (CMake stuff, subdirectory, oss-fuzz configuration, following up on bugs). I'm not sure the maintenance cost is justified here. Can we just run the fuzzer but not check it in? sammccall: For better or worse, adding a fuzzer in the open-source project is pretty high ceremony (CMake…
				kbobyrevAuthorUnsubmitted Not Done Reply Inline Actions OK, I'll leave this here until the patch is accepted for continuous testing, but I won't push it in the final version. kbobyrev: OK, I'll leave this here until the patch is accepted for continuous testing, but I won't push…
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				///
				/// \file
				/// \brief This file implements a function that runs clangd on a single input.
				/// This function is then linked into the Fuzzer library.
				///
				//===----------------------------------------------------------------------===//

				#include "../Iterator.h"
				#include "../PostingList.h"
				#include "llvm/Support/Compiler.h"
				#include "llvm/Support/FileSystem.h"
				#include "llvm/Support/raw_ostream.h"
				#include <cstdint>
				#include <vector>

				using DocID = clang::clangd::dex::DocID;

				/// Transform raw byte sequence into list of DocIDs.
				std::vector<DocID> generateDocuments(uint8_t *Data, size_t Size) {
				std::vector<DocID> Result;
				DocID ID = 0;
				for (size_t I = 0; I < Size; ++I) {
				size_t Offset = I % 4;
				if (Offset == 0 && I != 0) {
				ID = 0;
				Result.push_back(ID);
				}
				ID \|= (Data[I] << Offset);
				}
				if (Size > 4 && Size % 4 != 0)
				Result.push_back(ID);
				return Result;
				}

				/// This fuzzer checks that compressed PostingList contains can be successfully
				/// decoded into the original sequence.
				extern "C" int LLVMFuzzerTestOneInput(uint8_t *Data, size_t Size) {
				if (Size == 0)
				return 0;
				const auto OriginalDocuments = generateDocuments(Data, Size);
				if (OriginalDocuments.empty())
				return 0;
				// Ensure that given sequence of DocIDs is sorted.
				for (size_t I = 1; I < OriginalDocuments.size(); ++I)
				if (OriginalDocuments[I] <= OriginalDocuments[I - 1])
				return 0;
				const clang::clangd::dex::PostingList List(OriginalDocuments);
				const auto DecodedDocuments = clang::clangd::dex::consume(*List.iterator());
				// Compare decoded sequence against the original PostingList contents.
				if (DecodedDocuments.size() != OriginalDocuments.size())
				LLVM_BUILTIN_TRAP;
				for (size_t I = 0; I < DecodedDocuments.size(); ++I)
				if (DecodedDocuments[I].first != OriginalDocuments[I])
				LLVM_BUILTIN_TRAP;
				return 0;
				}

This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Implement VByte PostingList compressionClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 166275

clang-tools-extra/clangd/CMakeLists.txt

clang-tools-extra/clangd/index/dex/Dex.cpp

clang-tools-extra/clangd/index/dex/PostingList.h

clang-tools-extra/clangd/index/dex/PostingList.cpp

clang-tools-extra/clangd/index/dex/fuzzer/CMakeLists.txt

clang-tools-extra/clangd/index/dex/fuzzer/VByteFuzzer.cpp

[clangd] Implement VByte PostingList compression
ClosedPublic