Instead of encoding as (start, end), encode as (start, end-start).
This is smaller in the common case when:
- we store the values as VBR
- start and end are in the same file ID and nearby
This reduces clangd/AST.cpp PCH size by ~0.75%.
Differential D124422
[Serialization] Improve encoding of small SourceRanges. sammccall on Apr 25 2022, 4:17 PM. Authored by
Details
Instead of encoding as (start, end), encode as (start, end-start).
This reduces clangd/AST.cpp PCH size by ~0.75%.
Diff Detail
Event TimelineComment Actions Bad news: this isn't ready, there are places where ASTWriter calls AddSourceLocation() twice and ASTReader calls ReadSourceRange(). Good news, there's a bunch of potential places where we can replace AddSourceLocation() pairs with AddSourceRange for further savings. Comment Actions Switch reader/writer to use sourcerange encoding for all "obvious" ranges. Gain is now 1.75% in my test PCH. Comment Actions Overall LGTM, but I didn't look closely if there are cases that break now. Comment Actions Serialization tests do exist, though I think coverage isn't amazing.
There is a limited & self-defeating nature of this patch. Before landing this, I'd like to try an approach of compressing all SourceLocations within a record as a sequence.
This would guarantee efficient representations of nearby and null sourcelocations, while regressing random locations by 1 bit only. This can be done explicitly with an API like: SrcLocSeq Locs(Stream); Locs.Add(Loc1); Locs.Add(Loc2);, or implicitly by changing the behavior of AddSourceLocation. The latter seems easier to try, and would require more machinery but fewer local changes than this patch. Let me try to put something together to see what the gains are... Comment Actions Yeah, that sounds promising. Various source locations inside the same nodes are probably very close. Probably even across nodes.
I suspect there are requirements that will force us to store the array in chunks and allow searching for a particular chunk. However, that's really adding quite a bit of complexity and I'm not sure preamble sizes are such a big deal outside our environment. E.g. even 10% savings are borderline for that kind of complexity. Comment Actions Yeah. The reason to avoid compressing across nodes is it breaks lazy-loading. (Also I'm not sure whether related decls end up proximate in the output). Maybe there's a way around it, but this is much lower-hanging I think.
This is an interesting idea, I wonder whether it will be a win. The delta-encodings are efficient as you note, but we have to represent SourceLocations twice. (value + index). Rough example with made-up numbers:
So we come out ahead (20->17.5) if deduplication halves #locations. If it reduces by a third, it's 20->19 instead.
Yeah, I don't plan to pursue this for now because the complexity/reward scares me, but if we get really squeezed for size we could revisit. |