This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Fix unicode handling, using UTF-16 where LSP requires it.
ClosedPublic

Authored by sammccall on Apr 24 2018, 4:26 PM.

Details

Summary

The Language Server Protocol unfortunately mandates that locations in files
be represented by line/column pairs, where the "column" is actually an index
into the UTF-16-encoded text of the line.
(This is because VSCode is written in JavaScript, which is UTF-16-native).

Internally clangd treats source files at UTF-8, the One True Encoding, and
generally deals with byte offsets (though there are exceptions).

Before this patch, conversions between offsets and LSP Position pretended
that Position.character was UTF-8 bytes, which is only true for ASCII lines.
Now we examine the text to convert correctly (but don't actually need to
transcode it, due to some nice details of the encodings).

The updated functions in SourceCode are the blessed way to interact with
the Position.character field, and anything else is likely to be wrong.
So I also updated the other accesses:

  • CodeComplete needs a "clang-style" line/column, with column in utf-8 bytes. This is now converted via Position -> offset -> clang line/column (a new function is added to SourceCode.h for the second conversion).
  • getBeginningOfIdentifier skipped backwards in UTF-16 space, which is will behave badly when it splits a surrogate pair. Skipping backwards in UTF-8 coordinates gives the lexer a fighting chance of getting this right. While here, I clarified(?) the logic comments, fixed a bug with identifiers containing digits, simplified the signature slightly and added a test.

This seems likely to cause problems with editors that have the same bug, and
treat the protocol as if columns are UTF-8 bytes. But we can find and fix those.

Diff Detail

Repository
rL LLVM

Event Timeline

sammccall created this revision.Apr 24 2018, 4:26 PM
sammccall updated this revision to Diff 143836.Apr 24 2018, 4:27 PM

clang-format

sammccall updated this revision to Diff 143838.Apr 24 2018, 4:34 PM

Remove some debugging junk, tweak a comment.

hokein accepted this revision.Apr 25 2018, 2:29 AM

Cool, the code looks good to me (just a few nits), thanks for the descriptive comments!

This seems likely to cause problems with editors that have the same bug, and
treat the protocol as if columns are UTF-8 bytes. But we can find and fix those.

VSCode is fine I think, but we need to fix our internal ycm vim integration.

clangd/SourceCode.cpp
25 ↗(On Diff #143838)

Can we make it static?

The callback type is function<int, int>, the reason why using template here is mainly to save some keystroke?

53 ↗(On Diff #143838)

nit: consider naming the parameter U16Units?

72 ↗(On Diff #143838)

Maybe add an assume to ensure iterateCodepoints always return false (reach the end of the U8)?

137 ↗(On Diff #143838)

nit: it took me a while to understand what the sub-expression Code.substr(LocInfo.second - ColumnInBytes, ColumnInBytes) does, maybe abstract it out with a descriptive name? Also it is not straight-forward to understand what LocInfo.second is without navigating to getDecomposedSpellingLoc.

This revision is now accepted and ready to land.Apr 25 2018, 2:29 AM
sammccall marked 3 inline comments as done.Apr 27 2018, 4:33 AM

Thanks!

clangd/SourceCode.cpp
25 ↗(On Diff #143838)

Added static.

The difference between using a template vs std::function for a lambda is compile-time vs run-time polymorphism: invoking std::function is a virtual call and (AFAIK) compilers don't manage to inline it well.
With the template, we get one copy of the function for each callsite, with the lambda inlined.

Not sure the performance is a big deal here, but this code is at least plausibly hot I guess? And I think there's very little readability cost to using the template in this case.

72 ↗(On Diff #143838)

I'm not sure there's enough value to this one, it's clear from the local code that this isn't possible, and it doesn't seem likely a bug would manifest this way (abort early even though our function returns false, *and return true* from iterateCodepoints).

The small cost is added noise - I think this needs a new variable, and assert, and a suppression of "unused variable" warning.

137 ↗(On Diff #143838)

Called this LineSoFar and decomposed LocInfo int named variables.

This revision was automatically updated to reflect the committed changes.
sammccall marked 2 inline comments as done.
benhamilton added inline comments.
clang-tools-extra/trunk/clangd/SourceCode.cpp
38

This is user input, right? Have we actually checked for valid UTF-8, or do we just assume it's valid?

If not, it seems like an assertion is not the right check, but we should reject it when we're reading the input.

sammccall added inline comments.Apr 30 2018, 1:42 AM
clang-tools-extra/trunk/clangd/SourceCode.cpp
38

Yeah, I wasn't sure about this, offline discussion tentatively concluded we wanted an assert, but I'm happy to switch to something else.

We don't validate the code on the way in, so strings are "bytes of presumed-UTF8". This is usually not a big pain actually. But we could/should certainly make the JSON parser validate the UTF-8. (If we want to go this route, D45753 should be resolved first).

There's two ways the assertion could fire: the code is invalid UTF-8, or there's a bug in the unicode logic here. I thought the latter was more likely at least in the short-term :) and this is the least invasive way to catch it. And if a developer build (assert-enabled) crashes because an editor feeds it invalid bytes, then that's probably better than doing nothing (though not as good as catching the error earlier).

benhamilton added inline comments.Apr 30 2018, 8:57 AM
clang-tools-extra/trunk/clangd/SourceCode.cpp
38

The problem with not validating is it's easy to cause OOB memory access (and thus security issues) if someone crafts malicious UTF-8 and makes us read off the end of a string.

We should be clear about the status of all strings in the documentation to APIs.

sammccall added inline comments.Apr 30 2018, 9:16 AM
clang-tools-extra/trunk/clangd/SourceCode.cpp
38

You still have to find/write a UTF-8 decoder that doesn't check bounds, which is (hopefully!) the harder part of writing that bug :-)
But I agree in principle, there's more subtle attacks too, like C08A which is invalid but non-validating decoders will treat as a newline.

I have a nearly-finished patch to add real validation to the JSON library, I'll copy you on it.

benhamilton added inline comments.Apr 30 2018, 9:21 AM
clang-tools-extra/trunk/clangd/SourceCode.cpp
38

Seems like this particular decoder isn't checking bounds, eh? ;)

If NDEBUG is set, it will happily set UTF8Length to however many leading 1s there are (valid or not) and pass that to CB(UTF8Length). It's true that the current callbacks passed in won't directly turn that into an OOB memory access, but they will end up returning an invalid UTF-16 code unit length from positionToOffset(), so who knows what that will end up doing.

Thanks, I'm always happy to review Unicode stuff.

sammccall added inline comments.Apr 30 2018, 9:25 AM
clang-tools-extra/trunk/clangd/SourceCode.cpp
38

Ah, yeah - the implicit context here is that we don't do anything with UTF-16 other than send it back to the client. So this is garbage in, garbage out. Definitely not ideal.