- With the current implementation, sizeof(std::vector<Chunk>) is added twice to the Dex memory estimate which is incorrect
- Dex logs memory usage estimation before BackingDataSize is set and hence the log report excludes size of the external SymbolSlab which is coupled with Dex instance
Details
Diff Detail
Event Timeline
clang-tools-extra/clangd/index/dex/Dex.cpp | ||
---|---|---|
248–250 | Would InvertedIndex.getMemorySize() be a better estimate? |
clang-tools-extra/clangd/index/dex/Dex.cpp | ||
---|---|---|
248–250 | IIUC this is only precise for the objects which do not make any allocations (e.g. std::vector and std::string memory estimate would not be "correct"). From the documentation
I also have Bytes += InvertedIndex.getMemorySize(); above, so the purpose of this code would be to take into account the size of these "reference objects". However, since PostingList::bytes() also returns underlying std::vector size, this code will probably add these std::vector objects size twice (the first one was in InvertedIndex.getMemorySize()). I should fix that. |
clang-tools-extra/clangd/index/dex/Dex.cpp | ||
---|---|---|
248–250 |
This applies to *pointers* to objects, but the PostingLists in InvertedList are not pointerd? So it seems to me that InvertedIndex.getMemorySize() covers everything in the InvertedIndex. One way to verify is probably check the size calculated with loop against InvertedIndex.getMemorySize(). |
- Prevent sizeof(std::vector<Chunk>) to be counted twice in the memory estimates
- Pull memory usage logging to the caller side: it is currently incorrect because Dex::BackingDataSize is not set by the time vlog(..., estimateMemoryUsage()) is called and hence the log only includes size of the index overhead
I will remove these changes from D52047 and hopefully make it easier to review.
clang-tools-extra/clangd/index/dex/Dex.cpp | ||
---|---|---|
248–250 | As discussed offline, pointers and references are an example of objects which would be incorrectly measured by DesneMap::getMemorySize(), but std::vector and std::string are pointers in a way because they also just store a pointer to the underlying storage which is hidden from DenseMap::getMemorySize(); thus, it would be more precise to use the proposed scheme for memory estimation. |
As discussed offline, using std::string::size() would not be a precise estimate of std::string size, therefore I'm excluding Token::Data size estimate.
clang-tools-extra/clangd/index/dex/Dex.cpp | ||
---|---|---|
248–250 | I see. I was expecting DenseMap to maintain an arena which could provide memory usage of all owned data, but it seems to only consider sizeof(KV-Pair)... If the posting lists (vectors) are not covered, shouldn't we also also add the size of tokens (string) here? |
clang-tools-extra/clangd/index/dex/Dex.cpp | ||
---|---|---|
248–250 | Yes, I was also adding TokenToPostingList.first.Data.size() to the size estimate in the previous revision snapshot, but as @sammccall pointed out, there are multiple string implementations and one of them (IIUC this one is currently pushed in the C++ Standard) utilizes Small Object Optimization (i.e. it behaves like llvm::SmallVector rather than std::vector). Therefore, it does not perform external allocations before some limit is reached. I don't know whether it is possible to know whether the external allocation happened or not and the summary of our discussion with Sam was that we should just omit this. |
Would InvertedIndex.getMemorySize() be a better estimate?