This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang-tools-extra/trunk/
-
trunk/
-
clangd/index/dex/
-
index/
-
dex/
-
Dex.cpp
-
PostingList.h
-
PostingList.cpp
-
unittests/clangd/
-
clangd/
-
DexTests.cpp

Differential D52300

[clangd] Implement VByte PostingList compression
ClosedPublic

Authored by kbobyrev on Sep 20 2018, 6:15 AM.

Download Raw Diff

Details

Reviewers

ioeric
sammccall
ilya-biryukov

Commits

rG6c2f5bd0f1b1: [clangd] Implement VByte PostingList compression
rCTE342965: [clangd] Implement VByte PostingList compression
rL342965: [clangd] Implement VByte PostingList compression

Summary

This patch implements Variable-length Byte compression of PostingLists to sacrifice some performance for lower memory consumption.

PostingList compression and decompression was extensively tested using fuzzer for multiple hours and runnning significant number of realistic FuzzyFindRequests. AddressSanitizer and UndefinedBehaviorSanitizer were used to ensure the correct behaviour.

Performance evaluation was conducted with recent LLVM symbol index (292k symbols) and the collection of user-recorded queries (7751 FuzzyFindRequest JSON dumps):

Metrics	Before	After	Change (%)
Memory consumption (posting lists only), MB	54.4	23.5	-60%
Time to process queries, sec	7.70	9.4	+25%

Diff Detail

Repository: rL LLVM

Event Timeline

kbobyrev created this revision.Sep 20 2018, 6:15 AM

Herald added subscribers: kadircet, arphaman, jkorous and 2 others. · View Herald TranscriptSep 20 2018, 6:15 AM

kbobyrev edited the summary of this revision. (Show Details)Sep 20 2018, 6:15 AM

Update unit tests with iterator tree string representation to comply with the new format
Don't mark constructor explicit (previously it only had one parameter)
Fix Limits explanation comment (ID > Limits[I] -> ID >= Limits[I])

Very nice!

I think the data structures can be slightly tighter, and some of the implementation could be easier to follow. But this seems like a nice win.

Right-sizing the vectors seems like an important optimization.

clang-tools-extra/clangd/index/dex/PostingList.cpp
29 ↗	(On Diff #166278)	nit: we generally use members (DecompressedChunk.begin()) unless actually dealing with arrays or templates, since lookup rules are simpler
39 ↗	(On Diff #166278)	nit: I think this might be clearer with the special/unlikely cases (hit end) inside the if: if (++InnerIndex == DecompressedChunks.begin()) { // end of chunk if (++ChunkIndex == Chunks.end()) // end of iterator return; DecompressedChunk = ChunkIndex->decompress(); InnerIndex = DecompressedChunk.begin(); } also I think the indirection via `reachedEnd()` mostly serves to confuse here, as the other lines deal with the data structures directly. It's not clear (without reading the implementation) what the behavior is when class invariants are violated.
58 ↗	(On Diff #166278)	this again puts the "normal case" (need to choose a chunk) inside the if(), instead of the exceptional case. In order to write this more naturally, I think pulling out a private helper `advanceToChunk(DocID)` might be best here, you can early return from there.
61 ↗	(On Diff #166278)	ChunkIndex + 1? You've already eliminated the current chunk.
62 ↗	(On Diff #166278)	This seems unneccesarily two-step (found the chunk... or it could be the first element of the next). Understandably, because std::_bound has such a silly API. You want to find the last* chunk such that Head <= ID. So find the first one with Head > ID, and subtract one. std::lower_bound returns the first element for which its predicate is false. Therefore: ChunkIndex = std::lower_bound(ChunkIndex, Chunks.end(), ID, [](const Chunk &C, const DocID D) { return C.Head <= ID; }) - 1;
63 ↗	(On Diff #166278)	(again I'd avoid reachedEnd() here as you haven't reestablished invariants, so it's easier to just deal with the data structures)
76 ↗	(On Diff #166278)	(this can become an assert)
115 ↗	(On Diff #166278)	nit: please don't call these indexes if they're actually iterators: CurrentChunk seems fine
116 ↗	(On Diff #166278)	(again, SmallVector)
121 ↗	(On Diff #166278)	move to the function where they're used
127 ↗	(On Diff #166278)	What's the purpose of this? Why can't the caller just construct the Chunk themselves - what does the std::queue buy us?
140 ↗	(On Diff #166278)	I don't understand this comment. Aren't these bit offsets of payload bytes within a DocID?
162 ↗	(On Diff #166278)	This appears to be more complicated than necessary. I'd suggest pulling out the following function, and seeing where it takes you: // Write a variable length into the buffer, and updates the buffer size. // If it doesn't fit, returns false and doesn't write to the buffer. bool encodeVByte(uint32 V, MutableArrayRef<uint8_t>& Buf); Personally I find the no-loop implementation much easier to read, just: if (V < (1<<7)) { if (Buf.size() < 1) return false; Buf[0] = V; Buf = Buf.drop_front(1); return true; } // and 4 more cases but up to you. Please do try to find a way to reduce the number of constants (masks, limits, offsets, *BytesMask...) if you keep the loop.
248 ↗	(On Diff #166278)	the logical structure seems like a nested loop, I think this would be easier to follow: for (Current = Head; have more bytes and not enough numbers; Current += delta) { delta = 0; continuation = true; while (continuation) { ... } Result.push_back(Current + delta;) }
262 ↗	(On Diff #166278)	here I think you're missing a memory optimization probably equal in size to the whole gains achieved by compression :-) libstdc++ uses a 2x growth factor for std::vector, so we're probably wasting an extra 30% or so of ram (depending on size distribution, I forget the theory here). We should shrink to fit. If we were truly desperate we'd iterate over all the numbers and presize the array, but we're probably not. I think `return std::vector<DocID>(Result); // no move, shrink-to-fit` will shrink it as you want (note that `shrink_to_fit()` is usually a no-op :-\)
clang-tools-extra/clangd/index/dex/PostingList.h
41 ↗	(On Diff #166278)	With the current implementation, this doesn't need to be in the header. (the layout of vector<chunk> doesn't depend on chunk, you should just need to out-line the default destructor) (using SmallVector<Chunk, 1> or maybe 2 might be a win. I'd expect not though. I'd either stick with std::vector, or measure)
42 ↗	(On Diff #166278)	make this a static_assert below the class?
45 ↗	(On Diff #166278)	return SmallVector<PayloadSize+1> to avoid allocations?
52 ↗	(On Diff #166278)	This seems like a waste of a byte - ensure padding bytes are zeros, then you're done decoding once you hit a zero byte or the end of the chunk. (Note that a zero byte encodes the integer zero, which is not a legal posting list delta)
59 ↗	(On Diff #166278)	this isn't a good justification - the performance of MemIndex isn't really relevant. "Compression saves memory at a small cost in access time, which is still fast enough in practice."
70 ↗	(On Diff #166278)	this should be Chunks.capacity() (see comment in other file)
74 ↗	(On Diff #166278)	this may seem picky, but this seems like a waste of 8 bytes (particularly for small posting lists). I'd suggest just defining a constant (in Chunk) for the estimated entries per chunk (maybe 15 or so?) and just using `Chunks.size() * Chunks::ApproxEntriesPerChunk` as a "good enough" estimate.
clang-tools-extra/clangd/index/dex/fuzzer/VByteFuzzer.cpp
1 ↗	(On Diff #166278)	For better or worse, adding a fuzzer in the open-source project is pretty high ceremony (CMake stuff, subdirectory, oss-fuzz configuration, following up on bugs). I'm not sure the maintenance cost is justified here. Can we just run the fuzzer but not check it in?

@sammccall thank you for the comments, I'll improve it. @ilya-biryukov also provided valuable feedback and suggested that the code looks complicated now. We also discussed different compression approach which would require storing Heads and Payloads separately so that binary search over Heads could have better cache locality. That can dramatically improve performance.

For now, I think it's important to simplify the code and I'll start with that. Also, your suggestions will help to save even more memory! Zeroed bytes are an example of that: I started the patch with the general encoding API (so that zeros could be encoded in the list), but there's no good reason to keep that assumption now.

Also, I think I got my measurements wrong yesterday. I measured posting lists only: without compression size is 55 MB and with compression (current version, not optimized yet) it's 22 MB. This seems like a huge win. I try to keep myself from being overenthusiastic and double-check the numbers, but it looks more like something you estimated when we used posting list size distribution.

clang-tools-extra/clangd/index/dex/PostingList.cpp
262 ↗	(On Diff #166278)	Great catch! I have to be careful with `std::vector`s which are not allocated with their final size in advance.
clang-tools-extra/clangd/index/dex/fuzzer/VByteFuzzer.cpp
1 ↗	(On Diff #166278)	OK, I'll leave this here until the patch is accepted for continuous testing, but I won't push it in the final version.

I addressed the easiest issues. I'll try to implement separate storage structure for Heads and Payloads which would potentially make the implementation cleaner and easier to understand (and also more maintainable since that would be way easier to go for SIMD instructions speedups and other encoding schemes if we do that).

Also, I'll refine D52047 a little bit and I believe that is should be way easier to understand performance + memory consumption once we have these benchmarks in. Both @ioeric and @ilya-biryukov expressed their concern with regard to the memory consumption "benchmark" and suggested a separate binary. While this seems fine to me, I think it's important to keep performance + memory tracking infrastructure easy to use (in this sense scattering different metrics across multiple binaries makes it less accessible and probably introduce some code duplication) and therefore using this "trick" is OK to me, but I don't have a strong opinion about this. What do you think, @sammccall?

In D52300#1241754, @kbobyrev wrote:

Also, I'll refine D52047 a little bit and I believe that is should be way easier to understand performance + memory consumption once we have these benchmarks in. Both @ioeric and @ilya-biryukov expressed their concern with regard to the memory consumption "benchmark" and suggested a separate binary. While this seems fine to me, I think it's important to keep performance + memory tracking infrastructure easy to use (in this sense scattering different metrics across multiple binaries makes it less accessible and probably introduce some code duplication) and therefore using this "trick" is OK to me, but I don't have a strong opinion about this. What do you think, @sammccall?

FWIW, I think the "trick" for memory benchmark is fine. I just think we should add proper output to make the trick clear to users, as suggested in the patch comment.

In D52300#1241776, @ioeric wrote:

In D52300#1241754, @kbobyrev wrote:

Also, I'll refine D52047 a little bit and I believe that is should be way easier to understand performance + memory consumption once we have these benchmarks in. Both @ioeric and @ilya-biryukov expressed their concern with regard to the memory consumption "benchmark" and suggested a separate binary. While this seems fine to me, I think it's important to keep performance + memory tracking infrastructure easy to use (in this sense scattering different metrics across multiple binaries makes it less accessible and probably introduce some code duplication) and therefore using this "trick" is OK to me, but I don't have a strong opinion about this. What do you think, @sammccall?

FWIW, I think the "trick" for memory benchmark is fine. I just think we should add proper output to make the trick clear to users, as suggested in the patch comment.

It seems to be omitted in README.md, but you are probably after [[ https://github.com/google/benchmark/blob/1b44120cd16712f3b5decd95dc8ff2813574b273/include/benchmark/benchmark.h#L596-L612 | benchmark::State::SetLabel() ]]

kbobyrev added inline comments.Sep 21 2018, 7:39 AM

clang-tools-extra/clangd/index/dex/PostingList.cpp
29 ↗	(On Diff #166278)	I thought using `std::begin(Container)`, `std::end(Container)` is way more robust because the API is essentially the same if the code changes, so I used it everywhere in Dex. Do you think I should change this patch or keep it to keep the codebase more consistent?
121 ↗	(On Diff #166278)	But they're used both in `encodeStream()` and `decompress()`. I tried to move as much static constants to functions where they're used, but these masks are useful for both encoding and decoding. Is there something I should do instead (e.g. make them members of `PostingList`)?

In D52300#1241754, @kbobyrev wrote:

I addressed the easiest issues. I'll try to implement separate storage structure for Heads and Payloads which would potentially make the implementation cleaner and easier to understand (and also more maintainable since that would be way easier to go for SIMD instructions speedups and other encoding schemes if we do that).

That doesn't sound more maintainable, that sounds like a performance hack that will hurt the layering.
Which is ok :-) but please don't do that until you measure a nontrivial performance improvement from it.

Also, I'll refine D52047 a little bit and I believe that is should be way easier to understand performance + memory consumption once we have these benchmarks in. Both @ioeric and @ilya-biryukov expressed their concern with regard to the memory consumption "benchmark" and suggested a separate binary. While this seems fine to me, I think it's important to keep performance + memory tracking infrastructure easy to use (in this sense scattering different metrics across multiple binaries makes it less accessible and probably introduce some code duplication) and therefore using this "trick" is OK to me, but I don't have a strong opinion about this. What do you think, @sammccall?

I may be missing some context, but you're talking about index idle memory usage right?
Can't this just be a dexp command?
No objection to annotating benchmark runs with memory usage too, but I wouldn't jump through hoops unless there's a strong reason.

clang-tools-extra/clangd/index/dex/PostingList.cpp
121 ↗	(On Diff #166278)	You're right. My real objection here is that these decls are hard to understand here, and the code that uses them is also hard to understand. I think this is because they aren't very powerful abstractions (the detail abstracted is limited, and it's not strongly abstracted) and the names aren't sufficiently good. (I don't have great ones, though not confusing bits and bytes would be a start :-) My top suggestion is to inline these everywhere they're used. This is bit twiddling code, not knowing what the bits are can obscure understanding hard, and encourage you to write the code in a falsely general way. Failing that, SevenBytesMask -> LowBits and ContinuationBit -> More or HighBit? FourBytesMask isn't needed, you won't be decoding any invalid data.

Simplify code
Disallow empty PostingLists and update tests

kbobyrev edited the summary of this revision. (Show Details)Sep 25 2018, 2:39 AM

Mostly looks good, a few further simplifications...

clang-tools-extra/clangd/index/dex/PostingList.cpp
59 ↗	(On Diff #166831)	I find "if the position was found" somewhat misleading - even if the ID is not in the chunk we can his this case. Maybe extract `normalizeCursor()` here, with // If the cursor is at the end of a chunk, place it at the start of the next chunk. void normalizeCursor() { ... } This can be shared with advance().
116 ↗	(On Diff #166831)	Comment the invariants here, e.g. // If CurrentChunk is valid, then DecompressedChunk is CurrentChunk->decompress() // and CurrentID is a valid (non-end) iterator into it.
128 ↗	(On Diff #166831)	This gives the wrong answer for zero. So assert Delta != 0?
128 ↗	(On Diff #166831)	nit: `int` or `unsigned`, not `size_t`
128 ↗	(On Diff #166831)	`Width = 1 + findLastSet(Delta) / 7`
138 ↗	(On Diff #166831)	The mask doesn't need to vary, just apply it after shifting. But really this loop is much clearer if you modify delta in place. do { Encoding = Delta & 0x7f; Delta >>= 7; Payload.front() = Delta ? Encoding : 0x80 \| Encoding; Payload = Payload.drop_front(); } while (Delta != 0);
171 ↗	(On Diff #166831)	Why not just declare a chunk here (or use Result.back() and work in place)? Result.emplace_back(); DocID Last = Result.back().Head = Documents.front(); MutableArrayRef<uint8_t> RemainingPayload = Result.back().Payload; for (DocID Doc : Documents.drop_front()) { // no need to handle I == 0 special case. if (!encodeVByte(Doc - Last, RemainingPayload)) { // didn't fit, flush chunk Result.emplace_back(); Result.back().Head = Doc; RemainingPayload = Result.back().Payload; } Last = Doc; } more values, fewer indices
173 ↗	(On Diff #166831)	`PayloadRef` doesn't really describe the function of this variable. Suggest `RemainingPayload` or `EmptyPayload` or so
192 ↗	(On Diff #166831)	nit: if the stream is terminated, consumes all bytes and returns None.
clang-tools-extra/clangd/index/dex/PostingList.h
39 ↗	(On Diff #166831)	mark as an implementation detail so readers aren't confused.
46 ↗	(On Diff #166831)	this is an implementation detail, move to cpp file (or just inline)
50 ↗	(On Diff #166831)	??
41 ↗	(On Diff #166278)	(meta-nit: please don't mark comments as done if they're not done - rather explain why you didn't do them!)
41 ↗	(On Diff #166278)	(using SmallVector<Chunk, 1> or maybe 2 might be a win. I'd expect not though. I'd either stick with std::vector, or measure) You changed this to SmallVector - what were the measurements? (SmallVector is going to be bigger than Vector whenever you overrun, so it's worth checking)

Address a round of comments, fallback to std::vector.

LG, don't forget about the fuzzer!

clang-tools-extra/clangd/index/dex/PostingList.cpp
41 ↗	(On Diff #166843)	just `++CurrentID; normalizeCursor();`
131 ↗	(On Diff #166843)	this is used only in one place now, inline or use elsewhere
138 ↗	(On Diff #166843)	meaningful bits no need to say "dividing..." as it just echoes the code. "examining the meaningful bits"?
174 ↗	(On Diff #166843)	unused

This revision is now accepted and ready to land.Sep 25 2018, 4:35 AM

Address post-LG comments, remove fuzzer.

clang-tools-extra/clangd/index/dex/PostingList.cpp

192 ↗

(On Diff #166831)

As discussed offline, when the stream is terminated (i.e. 0 byte indicates the end of the stream) it just returns llvm::None.

clang-tools-extra/clangd/index/dex/PostingList.h

41 ↗

(On Diff #166278)

Storage type	Memory consumption, MB
`llvm::SmallVector<Chunk, 1>`	23.7
`llvm::SmallVector<Chunk, 2>`	25.2
`std::vector<Chunk>`	23.5

It seems like std::vector<Chunk> would be the best option here, falling back to that.

As discussed offline, moving Chunk to header seems tedious because of the default constructor/destructors failures due to incomplete Chunk type.

Closed by commit rL342965: [clangd] Implement VByte PostingList compression (authored by omtcyfz). · Explain WhySep 25 2018, 4:56 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: llvm-commits. · View Herald TranscriptSep 25 2018, 4:56 AM

kbobyrev mentioned this in D51689: [clangd] Dense posting lists proof-of-concept.Sep 26 2018, 2:43 AM

Revision Contents

Path

Size

clang-tools-extra/

trunk/

clangd/

index/

dex/

Dex.cpp

4 lines

PostingList.h

56 lines

PostingList.cpp

188 lines

unittests/

clangd/

DexTests.cpp

67 lines

Diff 166853

clang-tools-extra/trunk/clangd/index/dex/Dex.cpp

Show First 20 Lines • Show All 122 Lines • ▼ Show 20 Lines	void Dex::buildIndex() {
for (DocID SymbolRank = 0; SymbolRank < Symbols.size(); ++SymbolRank) {		for (DocID SymbolRank = 0; SymbolRank < Symbols.size(); ++SymbolRank) {
const auto *Sym = Symbols[SymbolRank];		const auto *Sym = Symbols[SymbolRank];
for (const auto &Token : generateSearchTokens(*Sym))		for (const auto &Token : generateSearchTokens(*Sym))
TempInvertedIndex[Token].push_back(SymbolRank);		TempInvertedIndex[Token].push_back(SymbolRank);
}		}

// Convert lists of items to posting lists.		// Convert lists of items to posting lists.
for (const auto &TokenToPostingList : TempInvertedIndex)		for (const auto &TokenToPostingList : TempInvertedIndex)
InvertedIndex.insert({TokenToPostingList.first,		InvertedIndex.insert(
PostingList(move(TokenToPostingList.second))});		{TokenToPostingList.first, PostingList(TokenToPostingList.second)});

vlog("Built Dex with estimated memory usage {0} bytes.",		vlog("Built Dex with estimated memory usage {0} bytes.",
estimateMemoryUsage());		estimateMemoryUsage());
}		}

/// Constructs iterators over tokens extracted from the query and exhausts it		/// Constructs iterators over tokens extracted from the query and exhausts it
/// while applying Callback to each symbol in the order of decreasing quality		/// while applying Callback to each symbol in the order of decreasing quality
/// of the matched symbols.		/// of the matched symbols.
▲ Show 20 Lines • Show All 145 Lines • Show Last 20 Lines

clang-tools-extra/trunk/clangd/index/dex/PostingList.h

	//===--- PostingList.h - Symbol identifiers storage interface --- C++ --===//			//===--- PostingList.h - Symbol identifiers storage interface --- C++ --===//
	//			//
	// The LLVM Compiler Infrastructure			// The LLVM Compiler Infrastructure
	//			//
	// This file is distributed under the University of Illinois Open Source			// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.			// License. See LICENSE.TXT for details.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			///
	// This defines posting list interface: a storage for identifiers of symbols			/// \file
	// which can be characterized by a specific feature (such as fuzzy-find trigram,			/// This defines posting list interface: a storage for identifiers of symbols
	// scope, type or any other Search Token). Posting lists can be traversed in			/// which can be characterized by a specific feature (such as fuzzy-find
	// order using an iterator and are values for inverted index, which maps search			/// trigram, scope, type or any other Search Token). Posting lists can be
	// tokens to corresponding posting lists.			/// traversed in order using an iterator and are values for inverted index,
	//			/// which maps search tokens to corresponding posting lists.
				///
				/// In order to decrease size of Index in-memory representation, Variable Byte
				/// Encoding (VByte) is used for PostingLists compression. An overview of VByte
				/// algorithm can be found in "Introduction to Information Retrieval" book:
				/// https://nlp.stanford.edu/IR-book/html/htmledition/variable-byte-codes-1.html
				///
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_CLANG_TOOLS_EXTRA_CLANGD_INDEX_DEX_POSTINGLIST_H			#ifndef LLVM_CLANG_TOOLS_EXTRA_CLANGD_INDEX_DEX_POSTINGLIST_H
	#define LLVM_CLANG_TOOLS_EXTRA_CLANGD_INDEX_DEX_POSTINGLIST_H			#define LLVM_CLANG_TOOLS_EXTRA_CLANGD_INDEX_DEX_POSTINGLIST_H

	#include "Iterator.h"			#include "Iterator.h"
	#include "llvm/ADT/ArrayRef.h"			#include "llvm/ADT/ArrayRef.h"
				#include "llvm/ADT/SmallVector.h"
	#include <cstdint>			#include <cstdint>
	#include <vector>			#include <vector>

	namespace clang {			namespace clang {
	namespace clangd {			namespace clangd {
	namespace dex {			namespace dex {

	class Iterator;			class Iterator;

				/// NOTE: This is an implementation detail.
				///
				/// Chunk is a fixed-width piece of PostingList which contains the first DocID
				/// in uncompressed format (Head) and delta-encoded Payload. It can be
				/// decompressed upon request.
				struct Chunk {
				/// Keep sizeof(Chunk) == 32.
				static constexpr size_t PayloadSize = 32 - sizeof(DocID);

				llvm::SmallVector<DocID, PayloadSize + 1> decompress() const;

				/// The first element of decompressed Chunk.
				DocID Head;
				/// VByte-encoded deltas.
				std::array<uint8_t, PayloadSize> Payload = std::array<uint8_t, PayloadSize>();
				};
				static_assert(sizeof(Chunk) == 32, "Chunk should take 32 bytes of memory.");

	/// PostingList is the storage of DocIDs which can be inserted to the Query			/// PostingList is the storage of DocIDs which can be inserted to the Query
	/// Tree as a leaf by constructing Iterator over the PostingList object.			/// Tree as a leaf by constructing Iterator over the PostingList object. DocIDs
	// FIXME(kbobyrev): Use VByte algorithm to compress underlying data.			/// are stored in underlying chunks. Compression saves memory at a small cost
				/// in access time, which is still fast enough in practice.
	class PostingList {			class PostingList {
	public:			public:
	explicit PostingList(const std::vector<DocID> &&Documents)			explicit PostingList(llvm::ArrayRef<DocID> Documents);
	: Documents(std::move(Documents)) {}

				/// Constructs DocumentIterator over given posting list. DocumentIterator will
				/// go through the chunks and decompress them on-the-fly when necessary.
	std::unique_ptr<Iterator> iterator() const;			std::unique_ptr<Iterator> iterator() const;

	size_t bytes() const { return Documents.size() * sizeof(DocID); }			/// Returns in-memory size.
				size_t bytes() const {
				return sizeof(Chunk) + Chunks.capacity() * sizeof(Chunk);
				}

	private:			private:
	const std::vector<DocID> Documents;			const std::vector<Chunk> Chunks;
	};			};

	} // namespace dex			} // namespace dex
	} // namespace clangd			} // namespace clangd
	} // namespace clang			} // namespace clang

	#endif // LLVM_CLANG_TOOLS_EXTRA_CLANGD_INDEX_DEX_POSTINGLIST_H			#endif // LLVM_CLANG_TOOLS_EXTRA_CLANGD_INDEX_DEX_POSTINGLIST_H

clang-tools-extra/trunk/clangd/index/dex/PostingList.cpp

	//===--- PostingList.cpp - Symbol identifiers storage interface -----------===//			//===--- PostingList.cpp - Symbol identifiers storage interface -----------===//
	//			//
	// The LLVM Compiler Infrastructure			// The LLVM Compiler Infrastructure
	//			//
	// This file is distributed under the University of Illinois Open Source			// This file is distributed under the University of Illinois Open Source
	// License. See LICENSE.TXT for details.			// License. See LICENSE.TXT for details.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "PostingList.h"			#include "PostingList.h"
	#include "Iterator.h"			#include "Iterator.h"
				#include "llvm/Support/Error.h"
				#include "llvm/Support/MathExtras.h"

	namespace clang {			namespace clang {
	namespace clangd {			namespace clangd {
	namespace dex {			namespace dex {

	namespace {			namespace {

	/// Implements Iterator over std::vector<DocID>. This is the most basic			/// Implements iterator of PostingList chunks. This requires iterating over two
	/// iterator and is simply a wrapper around			/// levels: the first level iterator iterates over the chunks and decompresses
	/// std::vector<DocID>::const_iterator.			/// them on-the-fly when the contents of chunk are to be seen.
	class PlainIterator : public Iterator {			class ChunkIterator : public Iterator {
	public:			public:
	explicit PlainIterator(llvm::ArrayRef<DocID> Documents)			explicit ChunkIterator(llvm::ArrayRef<Chunk> Chunks)
	: Documents(Documents), Index(std::begin(Documents)) {}			: Chunks(Chunks), CurrentChunk(Chunks.begin()) {
				if (!Chunks.empty()) {
				DecompressedChunk = CurrentChunk->decompress();
				CurrentID = DecompressedChunk.begin();
				}
				}

	bool reachedEnd() const override { return Index == std::end(Documents); }			bool reachedEnd() const override { return CurrentChunk == Chunks.end(); }

	/// Advances cursor to the next item.			/// Advances cursor to the next item.
	void advance() override {			void advance() override {
	assert(!reachedEnd() &&			assert(!reachedEnd() &&
	"Posting List iterator can't advance() at the end.");			"Posting List iterator can't advance() at the end.");
	++Index;			++CurrentID;
				normalizeCursor();
	}			}

	/// Applies binary search to advance cursor to the next item with DocID			/// Applies binary search to advance cursor to the next item with DocID
	/// equal or higher than the given one.			/// equal or higher than the given one.
	void advanceTo(DocID ID) override {			void advanceTo(DocID ID) override {
	assert(!reachedEnd() &&			assert(!reachedEnd() &&
	"Posting List iterator can't advance() at the end.");			"Posting List iterator can't advance() at the end.");
	// If current ID is beyond requested one, iterator is already in the right			if (ID <= peek())
	// state.			return;
	if (peek() < ID)			advanceToChunk(ID);
	Index = std::lower_bound(Index, std::end(Documents), ID);			// Try to find ID within current chunk.
				CurrentID = std::lower_bound(CurrentID, std::end(DecompressedChunk), ID);
				normalizeCursor();
	}			}

	DocID peek() const override {			DocID peek() const override {
	assert(!reachedEnd() &&			assert(!reachedEnd() && "Posting List iterator can't peek() at the end.");
	"Posting List iterator can't peek() at the end.");			return *CurrentID;
	return *Index;
	}			}

	float consume() override {			float consume() override {
	assert(!reachedEnd() &&			assert(!reachedEnd() &&
	"Posting List iterator can't consume() at the end.");			"Posting List iterator can't consume() at the end.");
	return DEFAULT_BOOST_SCORE;			return DEFAULT_BOOST_SCORE;
	}			}

	size_t estimateSize() const override { return Documents.size(); }			size_t estimateSize() const override {
				return Chunks.size() * ApproxEntriesPerChunk;
				}

	private:			private:
	llvm::raw_ostream &dump(llvm::raw_ostream &OS) const override {			llvm::raw_ostream &dump(llvm::raw_ostream &OS) const override {
	OS << '[';			OS << '[';
	if (Index != std::end(Documents))			if (CurrentChunk != Chunks.begin() \|\|
	OS << *Index;			(CurrentID != DecompressedChunk.begin() && !DecompressedChunk.empty()))
	else			OS << "... ";
	OS << "END";			OS << (reachedEnd() ? "END" : std::to_string(*CurrentID));
				if (!reachedEnd() && CurrentID < DecompressedChunk.end() - 1)
				OS << " ...";
	OS << ']';			OS << ']';
	return OS;			return OS;
	}			}

	llvm::ArrayRef<DocID> Documents;			/// If the cursor is at the end of a chunk, place it at the start of the next
	llvm::ArrayRef<DocID>::const_iterator Index;			/// chunk.
				void normalizeCursor() {
				// Invariant is already established if examined chunk is not exhausted.
				if (CurrentID != std::end(DecompressedChunk))
				return;
				// Advance to next chunk if current one is exhausted.
				++CurrentChunk;
				if (CurrentChunk == Chunks.end()) // Reached the end of PostingList.
				return;
				DecompressedChunk = CurrentChunk->decompress();
				CurrentID = DecompressedChunk.begin();
				}

				/// Advances CurrentChunk to the chunk which might contain ID.
				void advanceToChunk(DocID ID) {
				if ((CurrentChunk != Chunks.end() - 1) &&
				((CurrentChunk + 1)->Head <= ID)) {
				// Find the next chunk with Head >= ID.
				CurrentChunk = std::lower_bound(
				CurrentChunk + 1, Chunks.end(), ID,
				[](const Chunk &C, const DocID ID) { return C.Head <= ID; });
				--CurrentChunk;
				DecompressedChunk = CurrentChunk->decompress();
				CurrentID = DecompressedChunk.begin();
				}
				}

				llvm::ArrayRef<Chunk> Chunks;
				/// Iterator over chunks.
				/// If CurrentChunk is valid, then DecompressedChunk is
				/// CurrentChunk->decompress() and CurrentID is a valid (non-end) iterator
				/// into it.
				decltype(Chunks)::const_iterator CurrentChunk;
				llvm::SmallVector<DocID, Chunk::PayloadSize + 1> DecompressedChunk;
				/// Iterator over DecompressedChunk.
				decltype(DecompressedChunk)::iterator CurrentID;

				static constexpr size_t ApproxEntriesPerChunk = 15;
	};			};

				static constexpr size_t BitsPerEncodingByte = 7;

				/// Writes a variable length DocID into the buffer and updates the buffer size.
				/// If it doesn't fit, returns false and doesn't write to the buffer.
				bool encodeVByte(DocID Delta, llvm::MutableArrayRef<uint8_t> &Payload) {
				assert(Delta != 0 && "0 is not a valid PostingList delta.");
				// Calculate number of bytes Delta encoding would take by examining the
				// meaningful bits.
				unsigned Width = 1 + llvm::findLastSet(Delta) / BitsPerEncodingByte;
				if (Width > Payload.size())
				return false;

				do {
				uint8_t Encoding = Delta & 0x7f;
				Delta >>= 7;
				Payload.front() = Delta ? Encoding \| 0x80 : Encoding;
				Payload = Payload.drop_front();
				} while (Delta != 0);
				return true;
				}

				/// Use Variable-length Byte (VByte) delta encoding to compress sorted list of
				/// DocIDs. The compression stores deltas (differences) between subsequent
				/// DocIDs and encodes these deltas utilizing the least possible number of
				/// bytes.
				///
				/// Each encoding byte consists of two parts: the first bit (continuation bit)
				/// indicates whether this is the last byte (0 if this byte is the last) of
				/// current encoding and seven bytes a piece of DocID (payload). DocID contains
				/// 32 bits and therefore it takes up to 5 bytes to encode it (4 full 7-bit
				/// payloads and one 4-bit payload), but in practice it is expected that gaps
				/// (deltas) between subsequent DocIDs are not large enough to require 5 bytes.
				/// In very dense posting lists (with average gaps less than 128) this
				/// representation would be 4 times more efficient than raw DocID array.
				///
				/// PostingList encoding example:
				///
				/// DocIDs 42 47 7000
				/// gaps 5 6958
				/// Encoding (raw number) 00000101 10110110 00101110
				std::vector<Chunk> encodeStream(llvm::ArrayRef<DocID> Documents) {
				assert(!Documents.empty() && "Can't encode empty sequence.");
				std::vector<Chunk> Result;
				Result.emplace_back();
				DocID Last = Result.back().Head = Documents.front();
				llvm::MutableArrayRef<uint8_t> RemainingPayload = Result.back().Payload;
				for (DocID Doc : Documents.drop_front()) {
				if (!encodeVByte(Doc - Last, RemainingPayload)) { // didn't fit, flush chunk
				Result.emplace_back();
				Result.back().Head = Doc;
				RemainingPayload = Result.back().Payload;
				}
				Last = Doc;
				}
				return std::vector<Chunk>(Result); // no move, shrink-to-fit
				}

				/// Reads variable length DocID from the buffer and updates the buffer size. If
				/// the stream is terminated, return None.
				llvm::Optional<DocID> readVByte(llvm::ArrayRef<uint8_t> &Bytes) {
				if (Bytes.front() == 0 \|\| Bytes.empty())
				return llvm::None;
				DocID Result = 0;
				bool HasNextByte = true;
				for (size_t Length = 0; HasNextByte && !Bytes.empty(); ++Length) {
				assert(Length <= 5 && "Malformed VByte encoding sequence.");
				// Write meaningful bits to the correct place in the document decoding.
				Result \|= (Bytes.front() & 0x7f) << (BitsPerEncodingByte * Length);
				if ((Bytes.front() & 0x80) == 0)
				HasNextByte = false;
				Bytes = Bytes.drop_front();
				}
				return Result;
				}

	} // namespace			} // namespace

				llvm::SmallVector<DocID, Chunk::PayloadSize + 1> Chunk::decompress() const {
				llvm::SmallVector<DocID, Chunk::PayloadSize + 1> Result{Head};
				llvm::ArrayRef<uint8_t> Bytes(Payload);
				DocID Delta;
				for (DocID Current = Head; !Bytes.empty(); Current += Delta) {
				auto MaybeDelta = readVByte(Bytes);
				if (!MaybeDelta)
				break;
				Delta = *MaybeDelta;
				Result.push_back(Current + Delta);
				}
				return llvm::SmallVector<DocID, Chunk::PayloadSize + 1>{Result};
				}

				PostingList::PostingList(llvm::ArrayRef<DocID> Documents)
				: Chunks(encodeStream(Documents)) {}

	std::unique_ptr<Iterator> PostingList::iterator() const {			std::unique_ptr<Iterator> PostingList::iterator() const {
	return llvm::make_unique<PlainIterator>(Documents);			return llvm::make_unique<ChunkIterator>(Chunks);
	}			}

	} // namespace dex			} // namespace dex
	} // namespace clangd			} // namespace clangd
	} // namespace clang			} // namespace clang

clang-tools-extra/trunk/unittests/clangd/DexTests.cpp

Show First 20 Lines • Show All 63 Lines • ▼ Show 20 Lines	TEST(DexIterators, DocumentIterator) {
DocIterator->advanceTo(65);		DocIterator->advanceTo(65);
EXPECT_EQ(DocIterator->peek(), 100U);		EXPECT_EQ(DocIterator->peek(), 100U);
EXPECT_FALSE(DocIterator->reachedEnd());		EXPECT_FALSE(DocIterator->reachedEnd());

DocIterator->advanceTo(420);		DocIterator->advanceTo(420);
EXPECT_TRUE(DocIterator->reachedEnd());		EXPECT_TRUE(DocIterator->reachedEnd());
}		}

TEST(DexIterators, AndWithEmpty) {
const PostingList L0({});
const PostingList L1({0, 5, 7, 10, 42, 320, 9000});

auto AndEmpty = createAnd(L0.iterator());
EXPECT_TRUE(AndEmpty->reachedEnd());

auto AndWithEmpty = createAnd(L0.iterator(), L1.iterator());
EXPECT_TRUE(AndWithEmpty->reachedEnd());

EXPECT_THAT(consumeIDs(*AndWithEmpty), ElementsAre());
}

TEST(DexIterators, AndTwoLists) {		TEST(DexIterators, AndTwoLists) {
const PostingList L0({0, 5, 7, 10, 42, 320, 9000});		const PostingList L0({0, 5, 7, 10, 42, 320, 9000});
const PostingList L1({0, 4, 7, 10, 30, 60, 320, 9000});		const PostingList L1({0, 4, 7, 10, 30, 60, 320, 9000});

auto And = createAnd(L1.iterator(), L0.iterator());		auto And = createAnd(L1.iterator(), L0.iterator());

EXPECT_FALSE(And->reachedEnd());		EXPECT_FALSE(And->reachedEnd());
EXPECT_THAT(consumeIDs(*And), ElementsAre(0U, 7U, 10U, 320U, 9000U));		EXPECT_THAT(consumeIDs(*And), ElementsAre(0U, 7U, 10U, 320U, 9000U));
Show All 22 Lines	TEST(DexIterators, AndThreeLists) {
EXPECT_EQ(And->peek(), 7U);		EXPECT_EQ(And->peek(), 7U);
And->advanceTo(300);		And->advanceTo(300);
EXPECT_EQ(And->peek(), 320U);		EXPECT_EQ(And->peek(), 320U);
And->advanceTo(100000);		And->advanceTo(100000);

EXPECT_TRUE(And->reachedEnd());		EXPECT_TRUE(And->reachedEnd());
}		}

TEST(DexIterators, OrWithEmpty) {
const PostingList L0({});
const PostingList L1({0, 5, 7, 10, 42, 320, 9000});

auto OrEmpty = createOr(L0.iterator());
EXPECT_TRUE(OrEmpty->reachedEnd());

auto OrWithEmpty = createOr(L0.iterator(), L1.iterator());
EXPECT_FALSE(OrWithEmpty->reachedEnd());

EXPECT_THAT(consumeIDs(*OrWithEmpty),
ElementsAre(0U, 5U, 7U, 10U, 42U, 320U, 9000U));
}

TEST(DexIterators, OrTwoLists) {		TEST(DexIterators, OrTwoLists) {
const PostingList L0({0, 5, 7, 10, 42, 320, 9000});		const PostingList L0({0, 5, 7, 10, 42, 320, 9000});
const PostingList L1({0, 4, 7, 10, 30, 60, 320, 9000});		const PostingList L1({0, 4, 7, 10, 30, 60, 320, 9000});

auto Or = createOr(L0.iterator(), L1.iterator());		auto Or = createOr(L0.iterator(), L1.iterator());

EXPECT_FALSE(Or->reachedEnd());		EXPECT_FALSE(Or->reachedEnd());
EXPECT_EQ(Or->peek(), 0U);		EXPECT_EQ(Or->peek(), 0U);
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	TEST(DexIterators, QueryTree) {
// \|		// \|
// +-------------+----------------------+		// +-------------+----------------------+
// \| \|		// \| \|
// \| \|		// \| \|
// +----------v----------+ +----------v------------+		// +----------v----------+ +----------v------------+
// \|And Iterator: 1, 5, 9\| \|Or Iterator: 0, 1, 3, 5\|		// \|And Iterator: 1, 5, 9\| \|Or Iterator: 0, 1, 3, 5\|
// +----------+----------+ +----------+------------+		// +----------+----------+ +----------+------------+
// \| \|		// \| \|
// +------+-----+ +---------------------+		// +------+-----+ ------------+
// \| \| \| \| \|		// \| \| \| \|
// +-------v-----+ +----+---+ +--v--+ +---v----+ +----v---+		// +-------v-----+ +----+---+ +---v----+ +----v---+
// \|1, 3, 5, 8, 9\| \|Boost: 2\| \|Empty\| \|Boost: 3\| \|Boost: 4\|		// \|1, 3, 5, 8, 9\| \|Boost: 2\| \|Boost: 3\| \|Boost: 4\|
// +-------------+ +----+---+ +-----+ +---+----+ +----+---+		// +-------------+ +----+---+ +---+----+ +----+---+
// \| \| \|		// \| \| \|
// +----v-----+ +-v--+ +---v---+		// +----v-----+ +-v--+ +---v---+
// \|1, 5, 7, 9\| \|1, 5\| \|0, 3, 5\|		// \|1, 5, 7, 9\| \|1, 5\| \|0, 3, 5\|
// +----------+ +----+ +-------+		// +----------+ +----+ +-------+
//		//
const PostingList L0({1, 3, 5, 8, 9});		const PostingList L0({1, 3, 5, 8, 9});
const PostingList L1({1, 5, 7, 9});		const PostingList L1({1, 5, 7, 9});
const PostingList L3({});		const PostingList L2({1, 5});
const PostingList L4({1, 5});		const PostingList L3({0, 3, 5});
const PostingList L5({0, 3, 5});

// Root of the query tree: [1, 5]		// Root of the query tree: [1, 5]
auto Root = createAnd(		auto Root = createAnd(
// Lower And Iterator: [1, 5, 9]		// Lower And Iterator: [1, 5, 9]
createAnd(L0.iterator(), createBoost(L1.iterator(), 2U)),		createAnd(L0.iterator(), createBoost(L1.iterator(), 2U)),
// Lower Or Iterator: [0, 1, 5]		// Lower Or Iterator: [0, 1, 5]
createOr(L3.iterator(), createBoost(L4.iterator(), 3U),		createOr(createBoost(L2.iterator(), 3U), createBoost(L3.iterator(), 4U)));
createBoost(L5.iterator(), 4U)));

EXPECT_FALSE(Root->reachedEnd());		EXPECT_FALSE(Root->reachedEnd());
EXPECT_EQ(Root->peek(), 1U);		EXPECT_EQ(Root->peek(), 1U);
Root->advanceTo(0);		Root->advanceTo(0);
// Advance multiple times. Shouldn't do anything.		// Advance multiple times. Shouldn't do anything.
Root->advanceTo(1);		Root->advanceTo(1);
Root->advanceTo(0);		Root->advanceTo(0);
EXPECT_EQ(Root->peek(), 1U);		EXPECT_EQ(Root->peek(), 1U);
Show All 10 Lines
}		}

TEST(DexIterators, StringRepresentation) {		TEST(DexIterators, StringRepresentation) {
const PostingList L0({4, 7, 8, 20, 42, 100});		const PostingList L0({4, 7, 8, 20, 42, 100});
const PostingList L1({1, 3, 5, 8, 9});		const PostingList L1({1, 3, 5, 8, 9});
const PostingList L2({1, 5, 7, 9});		const PostingList L2({1, 5, 7, 9});
const PostingList L3({0, 5});		const PostingList L3({0, 5});
const PostingList L4({0, 1, 5});		const PostingList L4({0, 1, 5});
const PostingList L5({});

EXPECT_EQ(llvm::to_string(*(L0.iterator())), "[4]");

auto Nested =
createAnd(createAnd(L1.iterator(), L2.iterator()),
createOr(L3.iterator(), L4.iterator(), L5.iterator()));

EXPECT_EQ(llvm::to_string(*Nested), "(& (\| [5] [1] [END]) (& [1] [1]))");		EXPECT_EQ(llvm::to_string(*(L0.iterator())), "[4 ...]");
		auto It = L0.iterator();
		It->advanceTo(19);
		EXPECT_EQ(llvm::to_string(*It), "[... 20 ...]");
		It->advanceTo(9000);
		EXPECT_EQ(llvm::to_string(*It), "[... END]");
}		}

TEST(DexIterators, Limit) {		TEST(DexIterators, Limit) {
const PostingList L0({3, 6, 7, 20, 42, 100});		const PostingList L0({3, 6, 7, 20, 42, 100});
const PostingList L1({1, 3, 5, 6, 7, 30, 100});		const PostingList L1({1, 3, 5, 6, 7, 30, 100});
const PostingList L2({0, 3, 5, 7, 8, 100});		const PostingList L2({0, 3, 5, 7, 8, 100});

auto DocIterator = createLimit(L0.iterator(), 42);		auto DocIterator = createLimit(L0.iterator(), 42);
▲ Show 20 Lines • Show All 338 Lines • Show Last 20 Lines