This is an archive of the discontinued LLVM Phabricator instance.

[Serialization] Improve encoding of small SourceRanges.
AcceptedPublic

Authored by sammccall on Apr 25 2022, 4:17 PM.

Details

Reviewers
ilya-biryukov
Summary

Instead of encoding as (start, end), encode as (start, end-start).
This is smaller in the common case when:

  • we store the values as VBR
  • start and end are in the same file ID and nearby

This reduces clangd/AST.cpp PCH size by ~0.75%.

Diff Detail

Event Timeline

sammccall created this revision.Apr 25 2022, 4:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 25 2022, 4:17 PM
sammccall requested review of this revision.Apr 25 2022, 4:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 25 2022, 4:17 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript

Bad news: this isn't ready, there are places where ASTWriter calls AddSourceLocation() twice and ASTReader calls ReadSourceRange().
After this change, those are no longer compatible so we have to make them match, and have to audit them all.

Good news, there's a bunch of potential places where we can replace AddSourceLocation() pairs with AddSourceRange for further savings.
Maybe even replacing other "chains" of source locations with delta-encoding, though I might be getting carried away...

sammccall updated this revision to Diff 425089.Apr 25 2022, 6:18 PM

Switch reader/writer to use sourcerange encoding for all "obvious" ranges.

Gain is now 1.75% in my test PCH.

ilya-biryukov accepted this revision.Apr 26 2022, 1:33 AM

Overall LGTM, but I didn't look closely if there are cases that break now.
Do we have any tests to check this?

This revision is now accepted and ready to land.Apr 26 2022, 1:33 AM

Overall LGTM, but I didn't look closely if there are cases that break now.
Do we have any tests to check this?

Serialization tests do exist, though I think coverage isn't amazing.
I'd have a reasonable level of confidence through:

  • the fact that I made the changes systematically rather than finding them through test failures
  • I'll go through and audit all the changes myself again (the failures I did have after finishing this were mostly typos)
  • using it locally for a while (though this will only cover usual C++ nodes)

There is a limited & self-defeating nature of this patch.
To have a clear idiom, we only encode SourceRange when begin/end pair are directly stored.
Not all SourceLocations are paired, e.g. ForStmt has [KW, LParen, RParen] and we benefit from delta-encoding only once.
In the most common types of nodes, there tends to be more complexity around exactly which SourceLocations need to be stored for size reasons. (Probably size of memory representation, but this is 1:1 with serialized). So I suspect the opportunities missed are disproportionately valuable as well.

Before landing this, I'd like to try an approach of compressing all SourceLocations within a record as a sequence.
Something with a state machine like:

  • at the start of each record, Current = 0
  • for each Loc:
    • if Loc is far from current, encode as 33-bit integer Loc<<1 | 1
    • if Loc is close to current, encode as abs(Loc) << 2 | sign(Loc) << 1 | 0
    • if Loc is nonzero, Current = Loc

This would guarantee efficient representations of nearby and null sourcelocations, while regressing random locations by 1 bit only.

This can be done explicitly with an API like: SrcLocSeq Locs(Stream); Locs.Add(Loc1); Locs.Add(Loc2);, or implicitly by changing the behavior of AddSourceLocation. The latter seems easier to try, and would require more machinery but fewer local changes than this patch.

Let me try to put something together to see what the gains are...

Yeah, that sounds promising. Various source locations inside the same nodes are probably very close. Probably even across nodes.
If we are willing to take on a bit more complexity, we could probably get even bigger wins by storing all source locations in a single array having indices into that array in the actual positions where we need the source locations. That way we can have:

  • get the benefits of delta-encoding in the source locations array, probably even across nodes,
  • probably still get quite good encoding for indices inside this array inside the actual AST nodes. Will definitely get those if we delta-encode the indices inside the same node.

I suspect there are requirements that will force us to store the array in chunks and allow searching for a particular chunk.

However, that's really adding quite a bit of complexity and I'm not sure preamble sizes are such a big deal outside our environment. E.g. even 10% savings are borderline for that kind of complexity.

Yeah, that sounds promising. Various source locations inside the same nodes are probably very close. Probably even across nodes.

Yeah. The reason to avoid compressing across nodes is it breaks lazy-loading. (Also I'm not sure whether related decls end up proximate in the output). Maybe there's a way around it, but this is much lower-hanging I think.

If we are willing to take on a bit more complexity, we could probably get even bigger wins by storing all source locations in a single array having indices into that array in the actual positions where we need the source locations.

This is an interesting idea, I wonder whether it will be a win.

The delta-encodings are efficient as you note, but we have to represent SourceLocations twice. (value + index).
I think the #bits we save storing indices vs SourceLocations is similar to the #bits we need to encode the deltas.
If these cancel out then the savings we get are from deduplicating SourceLocations in the array - I suspect there aren't actually that many duplicates, we'd be lucky to halve the array size.

Rough example with made-up numbers:

  • today we store raw SourceLocations, N * ~28 bits
  • maybe simple delta-encoding maybe reduces this by 8 to N * 20 bits average
  • sorted SourceLocation array (no delta-encoding, no deduplication) is N * 28, indices maybe N * 23 (assuming distance of 2^5 between sourcelocations)
  • delta-encoding the array means it takes conservatively N * 7 bits (5 bits average delta + VBR overhead + variance)
  • delta-encoding the indices yields the same 8 bits as before => N * 15
  • deduplcating the halves the entries and reduces the size of indices by 1 bit
  • this yields N/2*7 for the array + N * 14 for indices = N * 17.5

So we come out ahead (20->17.5) if deduplication halves #locations. If it reduces by a third, it's 20->19 instead.
This seems likely to be an improvement, over simple delta, but much smaller than delta vs status quo.

I suspect there are requirements that will force us to store the array in chunks and allow searching for a particular chunk.

However, that's really adding quite a bit of complexity and I'm not sure preamble sizes are such a big deal outside our environment. E.g. even 10% savings are borderline for that kind of complexity.

Yeah, I don't plan to pursue this for now because the complexity/reward scares me, but if we get really squeezed for size we could revisit.