- User Since
- Jul 19 2018, 7:14 AM (35 w, 8 h)
Feb 17 2019
Jan 28 2019
Jan 26 2019
Remove empty Clang SA section for now.
Jan 23 2019
@aprantl if there is still some work left for GSoC/potential contributors, feel free to mention it and then it should be just updated for the future. I couldn't infer whether there is still a chunk of work to be done from the project report, but you might have more information.
Nov 2 2018
The patch is good to go!
Will probably come back to it at some point, the experiments didn't show any significant improvement.
Nov 1 2018
Oct 10 2018
Oct 7 2018
Rebase on top of HEAD.
Sep 28 2018
Also, I'm not sure whether I should update Clang-Tidy and other tools in the scope of this patch. It makes sense to me, but I don't know whether the maintainers of these projects are happy with the change. WDYT?
Sep 27 2018
Also, regarding error handling and llvm::Option vs llvm::Expected: I think the case where the check most likely wouldn't be able to provide useful diagnostics and perform enough analysis is when there are macro expansions within inspected statement SourceRange. It might make sense to completely disable it in this case both because it's hard to do anything about the range transformation and because it seems to be hard to analyze this case. E.g. if the whole statement is expanded from a single macro and then the check would report on every macro usage (while the actual problem is in the macro itself). I don't know whether the check should support macros at all, it might make sense to mention this in the documentation and add few tests if we decide to go this way.
A lot of good improvements and many test cases, thank you!
Pass data from I/O to readIndexFile(StringRef).
Sep 26 2018
Simplify the documentation format.
For anyone interested in the direction of posting list compression, an implementation of Variable length Byte compression (VByte) has landed: D52300.
This is no longer needed then.
As discussed offline, using std::string::size() would not be a precise estimate of std::string size, therefore I'm excluding Token::Data size estimate.
Don't include misc changes elsewhere: focus on adding more benchmarks in this revision.
- Prevent sizeof(std::vector<Chunk>) to be counted twice in the memory estimates
- Pull memory usage logging to the caller side: it is currently incorrect because Dex::BackingDataSize is not set by the time vlog(..., estimateMemoryUsage()) is called and hence the log only includes size of the index overhead
Sep 25 2018
Ah, I see. Thanks for clarifying! I was wondering if we should also rename -yaml-symbol-file flag to something like -static-symbol-file (but, as you said, it's something to do in another patch).
Address post-LG comments, remove fuzzer.
Address a round of comments, fallback to std::vector.
- Simplify code
- Disallow empty PostingLists and update tests
Sep 21 2018
I think this one is addressed in the VByte patch, so I'm closing this revision in order to prevent conflicts between these two.
- Be more explicit about the nature of "benchmarks" with memory tracking logic via State::SetLabel(...).
- Force single iteration for "benchmarks" with memory usage tracking
- Add another "benchmark" with Dex memory overhead over plain SymbolSlab
RestirctForCodeCompletion and !RestrictForCodeCompletion are not mutually exclusive.
Use llvm::Expected<SymbolSlab>, cleanup the patch.
I addressed the easiest issues. I'll try to implement separate storage structure for Heads and Payloads which would potentially make the implementation cleaner and easier to understand (and also more maintainable since that would be way easier to go for SIMD instructions speedups and other encoding schemes if we do that).
@sammccall thank you for the comments, I'll improve it. @ilya-biryukov also provided valuable feedback and suggested that the code looks complicated now. We also discussed different compression approach which would require storing Heads and Payloads separately so that binary search over Heads could have better cache locality. That can dramatically improve performance.
Sep 20 2018
Sorry, I didn't get time to review the patch properly, these are few stylistic comments. Hopefully, I'll be able to give more feedback when I get more time.
- Update unit tests with iterator tree string representation to comply with the new format
- Don't mark constructor explicit (previously it only had one parameter)
- Fix Limits explanation comment (ID > Limits[I] -> ID >= Limits[I])
Remove BuildMem benchmark, which collects data about MemIndex building time (which is essentially nothing and therefore not really interesting).
Add benchmark for SymbolSlab loading from file. This might be useful for RIFF/YAML symbol loader optimizations.
Add symbol index building benchmarks, split loadIndex() into symbolsFromFile() and actual index build.
I thought that we might not want to produce too much information, but I don't have anything against printing the JSON representation of every user request.
Sep 18 2018
Sorry, I missed the patch when it was uploaded. It might make sense to mention the upstream revision in README.LLVM (within benchmark folder in the LLVM source tree), so that it's easy to understand that this change was also applied upstream. I can do that, though, no worries.
I agree; There is another issue that I was looking into: I think that it might be useful to understand the structure of fuzzy find query when using dexp tool and I thought that it would be great if we could dump the iterator tree structure along with the results (which is an extension of D52233). For dexp usecase, it would be great to dump the size and the origin of each piece of iterator tree (e.g. labels), but I also think that it might be useful to have "debugging" format so I couldn't figure out what's the best approach here.
I thought we might be replacing this soon. Of course there's no fundamental reason we can't keep both but I'm curious why this is an area of focus :-)
Yes, we are. Initially, this wasn't an area of focus: I just forgot to update the comment in Iterator.h when moving PostingList to a separate file and slightly changing the format, but then Eric had a good proposal and I thought that it's a good improvement. Also, even though the implementation will be different, the dumping format could be the same, so it's not really implementation-specific.
@ioeric does this format look better?
Rebase on top of master, move logging to symbol index build() caller side.
Sep 17 2018
Sep 14 2018
Fixed the bug with incorrect assumption of Children being sorted (instead of being a heap). This caused a huge overhead (~30%). When I iterate through all children in consume() (like before) it's also worse (~18%).
Move vlog message to the outer build(...) function: otherwise BackingMemorySize is not set to the correct value and log reports index overhead (instead of the complete index + slab) size.
- Start measuring time in ms
- Add Tokens' Data size for more precise memory usage estimation (accounts for ~1MB of Static Index in LLVM)
Sep 13 2018
Pull DocID to Iterator.h.
Wording: traversed *in order*.