This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Add line and column number to the index symbol.
ClosedPublic

Authored by hokein on Apr 11 2018, 2:19 AM.

Details

Summary

LSP is using Line & column as symbol position, clangd needs to transfer file
offset to Line & column when sending results back to LSP client, which is a high
cost, especially for finding workspace symbol -- we have to read the file
content from disk (if it isn't loaded in memory).

Saving these information in the index will make the clangd life eaiser.

Diff Detail

Event Timeline

hokein created this revision.Apr 11 2018, 2:19 AM
sammccall added inline comments.Apr 11 2018, 2:55 AM
clangd/index/Index.h
27

There are 4 of these per symbol, if we can keep line + character to 32 bits we save 16 bytes per symbol.

That looks like it might still be ~10% of non-string size. WDYT?
(12 bits for column and 20 bits for line seems like a reasonable guess)

28

These comments seem pretty obvious, I think
int Line=0, Character=0 // zero-based is enough, but up to you.

I would like a comment explaining why we're storing these redundant representations though.
e.g. // Storing Line/Character allows us to build LSP responses without reading the file content.

32

Column?

LSP calls this "character" but this is nonstandard and I find it very confusing with offset.

39

I don't think we should remove them, we'll just have the same problem in reverse.
Could position have have line/col/offset, and have Position Start, End?

clangd/index/SymbolCollector.cpp
205

I don't think this works, tokens can be split across lines.

I believe you want to compute NameLoc.locWithOffset(TokenLength) and then decompose that into line/col.
(getLocForEndOfToken, confusingly, is different)

hokein updated this revision to Diff 142170.Apr 12 2018, 7:20 AM
hokein marked 5 inline comments as done.

Address review comments.

hokein added inline comments.Apr 12 2018, 7:21 AM
clangd/index/Index.h
39

As discussed offline, we decide to remove them as we don't have real use case of them (we could add them back when needed).
I removed all the references of them in clangd. And remove these fields in a separate patch.

clangd/index/SymbolCollector.cpp
205

Good catch. Done, added a test.

sammccall added inline comments.Apr 12 2018, 7:27 AM
clangd/index/Index.h
27

after some offline discussion, I no longer think this is a good idea.
We should strive to keep memory usage reasonable with the current naive index implementation, but "giant vector of Symbol structs in memory" isn't a case to micro-optimize for in the long run.

32

We should document what this is an offset into: bytes, utf-16 code units, or unicode codepoints. (Or even grid offsets - glyphs and doublewidth are a thing)

Given that we intend to send it over LSP without reading the source, only utf-16 code units is really correct. Unicode codepoints is "nicer" and will give correct results in the BMP, while bytes will be correct for ASCII only.

I'd vote for making this utf-16 code units.

It's OK if the code populating it doesn't get this right (confuses bytes and code units) but add a fixme.

39

After offline discussion: we don't actually plan to do math on these ever, we just send them to LSP clients.
So removing sounds fine. We can add later if there are clear use cases.

45

Not sure about the abbreviation - maybe just start/end? Since offsets are going away.

sammccall accepted this revision.Apr 12 2018, 8:26 AM

LG, with

  • consider reverting the bitpacking stuff
  • comment about utf-16
  • clang-format :)
clangd/index/SymbolCollector.cpp
202

nit: SourceManager is 1-based (or returns 1-based data here).
SourceLocation uses 0-based offsets, not 1-based line/column.

This revision is now accepted and ready to land.Apr 12 2018, 8:26 AM
hokein updated this revision to Diff 142195.Apr 12 2018, 8:48 AM
hokein marked 3 inline comments as done.

Update the patch, address remaining issues.

clangd/index/Index.h
32

Done. Added FIXME.

MaskRay added inline comments.Apr 12 2018, 2:51 PM
clangd/index/Index.h
32

I'd vote for Unicode code points.

I haven't looked into this closely. But UTF-8 vs UTF-16 vs Unicode code points should not be a big issue here. Unless you use emojis or some uncommon characters, the usage of UTF-16 code units in LSP does not have a lot of harm.

//  ๐Ÿ˜น๐Ÿ˜น๐Ÿ‘ป anything weird hidden in line comments can be ignored because they don't affect offset calculation

And Microsoft might change the spec one day https://github.com/Microsoft/language-server-protocol/issues/376

Just my 2 cents. Calculation of line/character for each occurrence may not take a lot of computation. cquery/ccls computes the

clangd/index/Index.h
39

Yes. StartOffset and EndOffset should be removed some day. a line->offset mapping should be sufficient for documents that have stale index.

hokein added inline comments.Apr 13 2018, 1:10 AM
clangd/index/Index.h
32

Yeah, It'd be nicer if LSP spec is changed to use UTF-8. The Column is intended to align with Character of LSP's Position, let's keep it as it is at the moment.

This revision was automatically updated to reflect the committed changes.