This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Encode Line/Column as a 32-bits integer.
ClosedPublic

Authored by hokein on Oct 17 2018, 3:37 AM.

Details

Summary

This would buy us more memory. Using a 32-bits integer is enough for
most human-readable source code (up to 1M lines and 4K columns).

Previsouly, we used 8 bytes for a position, now 4 bytes, it would save
us 8 bytes for each Ref and each Symbol instance.

For LLVM-project binary index file, we save ~13% memory.

BeforeAfter
412MB355MB

Diff Detail

Event Timeline

hokein created this revision.Oct 17 2018, 3:37 AM

(I think your math is off in the description: 20 bits should be 1M lines, not 4M)
I think this is a win, as I think truncation will be rare and not terrible. We should document the intentions around truncation though.

Incidentally, this means replacing just the StringRef in SymbolLocation::Position with a char* would save another 13%, because that's also 8 bytes.
(Obviously we'd probably replace the other StringRefs too, but it gives a lower bound).

clangd/index/Index.h
37

I think it's enough to say "position is packed into 32 bits to save space".

It'd be useful to spell out the consequences in a second line.

40–50

(not sure these constants are needed as it stands - YAML shouldn't use them, see below)

52

If we overflow the columns, it would be much easier to recognize if the result is always e.g. 255, rather than an essentially random number from 0-255 (from modulo 256 arithmetic).

This would mean replacing the raw fields with setters/getters, which is obviously a more invasive change. What about a compromise: add the setters, and call them from the most important codepaths: SymbolCollector and Serialization.

clangd/index/YAMLSerialization.cpp
108

I don't think there's any reason to pack the YAML encoding.

hokein edited the summary of this revision. (Show Details)Oct 17 2018, 6:24 AM
hokein updated this revision to Diff 170001.Oct 17 2018, 6:30 AM
hokein marked 3 inline comments as done.

Address review comments:

  • handle overflowed cases, and added tests
  • add getter/setters for line/column and clear all call sides

(I think your math is off in the description: 20 bits should be 1M lines, not 4M)

Oops...Update the desccription.

I think this is a win, as I think truncation will be rare and not terrible. We should document the intentions around truncation though.

Incidentally, this means replacing just the StringRef in SymbolLocation::Position with a char* would save another 13%, because that's also 8 bytes.

Yeah, I have a rough patch for it, using char* will save us ~50MB memory, which will lead to ~300 MB memory usage in total.

(Obviously we'd probably replace the other StringRefs too, but it gives a lower bound).

clangd/index/Index.h
40–50

I think we need, for testing, for setters, rather than using magic number.

52

It seems reasonable to use maximum value if we encounter overflows.

This would mean replacing the raw fields with setters/getters, which is obviously a more invasive change. What about a compromise: add the setters, and call them from the most important codepaths: SymbolCollector and Serialization.

Actually, there is not too much usage of the raw fields in the source code, I'd prefer to use all setters/getters in all places (and it would make the code consistent and more safe). How about this plan?

  1. I update the patch to change all raw fields usage to getters/setters, and keep these raw fields public
  2. do a cleanup internally
  3. make these raw fields private
clangd/index/YAMLSerialization.cpp
108

The YAML here is a bit tricky, the previous code could not compile because we can not bind bit-field to Non-const references, I added another traits to keep the original YAML encoding (which is more readable).

sammccall accepted this revision.Oct 17 2018, 11:46 PM

Nice, just nits!

Yeah, I have a rough patch for it, using char* will save us ~50MB memory, which will lead to ~300 MB memory usage in total.

For just the StringRef in SymbolLocation::Position, or for all our strings?
If the latter, that seems too low, as we should save strictly more than this patch (unless I'm missing something about alignment/padding)

clangd/index/Index.h
38

nit: no "is", overflow is a verb

40–50

In that case I'd suggest static constexpr MaxLine = 1 << 20 - 1, MaxColumn = 1 << 12 - 1, as that's what the tests (and potentially some warning checks in the code later?) need, so we encapsulate a little more detail. But up to you.

(Writing 20 twice a couple of lines apart seems fine to me)

41

nit: define the trivial getters here so the compiler can always inline them.
(Setters seem fine out-of-line)

52

SGTM!

66–67

Why only using the getters on one side?

clangd/index/YAMLSerialization.cpp
108

I don't think specializing the traits for pair<uint32,uint32> is OK, specializing for common types we don't own is asking for ODR violations.

prefer to define an custom struct for this (in fact you do, so just map that struct instead of struct->P?)

111–129

please document why this normalization (YamlIO can't directly map bitfields)

116

no need for this assert

This revision is now accepted and ready to land.Oct 17 2018, 11:46 PM
hokein updated this revision to Diff 170053.Oct 18 2018, 2:10 AM
hokein marked 6 inline comments as done.

Address review comments.

Yeah, I have a rough patch for it, using char* will save us ~50MB memory, which will lead to ~300 MB memory usage in total.

For just the StringRef in SymbolLocation::Position, or for all our strings?
If the latter, that seems too low, as we should save strictly more than this patch (unless I'm missing something about alignment/padding)

This is just for the StringRef in Position, I think it matches our estimation.

More details coming:

  • 6.4 million refs, 6.4m * 40 bytes = ~250MB
  • 0.3 million symbols, 0.3m * 248 bytes = ~70MB

The original size of Ref (before any space optimization) is 40 bytes, we save 8 bytes per refs, 8*6.5 = ~50 MB; If we change all StringRef in Symbol to char*, which will save 64 bytes (8*8) per sym, 64*0.3v = ~19 MB, which doesn't gain as much as refs, as refs contributes more than 70% memory.

clangd/index/Index.h
66–67

Oops, I mis-thought the type of the right side LSP position, fixed.

This revision was automatically updated to reflect the committed changes.