This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Escape only necessary characters in JSON output
Needs ReviewPublic

Authored by malaperle-ericsson on Apr 12 2017, 3:11 PM.

Details

Summary

Currently, the strings being output by Clangd are escaled with llvm::yaml::escape. But this is not entirely correct because it escapes a lot of additional characters things. For example, a 2 byte character "é" will be encoded with a \x which is not valid and this error is thrown by the VSCode client:

SyntaxError: Unexpected token x in JSON at position 291
    at StreamMessageReader.onData
    at Socket.<anonymous>
    at emitOne (events.js:96:13)
    at Socket.emit (events.js:188:7)
    at readableAddChunk (_stream_readable.js:176:18)
    at Socket.Readable.push (_stream_readable.js:134:10)
    at Pipe.onread (net.js:543:20)

This change introduces an escaping function that does follow the JSON format.

Event Timeline

malaperle-ericsson set the repository for this revision to rL LLVM.
malaperle-ericsson added a project: Restricted Project.
malaperle-ericsson added a subscriber: cfe-commits.
krasimir edited edge metadata.

Seems that we're starting to hit some YAML/JSON mismatches, or is it that your YAML string support is lacking?

Seems that we're starting to hit some YAML/JSON mismatches, or is it that your YAML string support is lacking?

I don't think so. It seems like JSON and YAML are not completely aligned on escaped characters. See http://yaml.org/spec/1.2/spec.html#id2776092 which specifies unicode 8, 16, and 32 bits are escaped with \x, \u, and \U whereas http://www.json.org/string.gif specifies that unicode 16 bit characters can be encoded with \u but \x and \U are not supported. This leads me to believe that a YAML parser can read JSON but a JSON parser will not necessarily be able to read YAML. I thought about using json cpp but that's a much bigger change

joerg added a subscriber: joerg.Apr 13 2017, 6:28 AM

I'm strongly against this patch. Can you give an actual test case for the problematic behavior?

I'm strongly against this patch. Can you give an actual test case for the problematic behavior?

Sure I can add a test. If you meant more real work scenario, you can juste type "é" in VS Code and it will throw an exception trying to parse \x

klimek added inline comments.Apr 13 2017, 7:18 AM
clangd/Protocol.cpp
27–51

For json we only need the first 2 though, right?

klimek edited edge metadata.Apr 13 2017, 7:21 AM

Seems that we're starting to hit some YAML/JSON mismatches, or is it that your YAML string support is lacking?

I don't think so. It seems like JSON and YAML are not completely aligned on escaped characters. See http://yaml.org/spec/1.2/spec.html#id2776092 which specifies unicode 8, 16, and 32 bits are escaped with \x, \u, and \U whereas http://www.json.org/string.gif specifies that unicode 16 bit characters can be encoded with \u but \x and \U are not supported. This leads me to believe that a YAML parser can read JSON but a JSON parser will not necessarily be able to read YAML. I thought about using json cpp but that's a much bigger change

Here we only talk about what we escape, which should be the minimum required.
If I understand correctly, that's " and \ for JSON and ", \ and all non-printable characters (which unfortunately requires understanding of unicode to solve this fully correctly) in YAML.
Am I missing something?

If I understand correctly, that's " and \ for JSON and ", \ and all non-printable characters (which unfortunately requires understanding of unicode to solve this fully correctly) in YAML.

I'd modify that slightly: let's escape those two and all the known ASCII control characters, but just pass the rest through. I.e. reduce the current \x etc handling to known problematic input and not second guess the rest.

let's escape ... all the known ASCII control characters,

Do you mean encode all of them with \u or keep the two characters representation for those that exist? I think \n is nicer than \u000A and is probably common enough to keep the short version. Same for \r, \t.

Handle other control characters and add test

The short version is perfectly fine as long as works for both JSON and YAML. Less output is always good :)

Once the use of "two characters representation" is clarified, I will update the patch again.

joerg added a comment.Apr 16 2017, 1:01 PM

Just to avoid any confusion: this should be the generic YAML escape routine in llvm/lib/Support, i.e. IMO we don't want to have separate YAML and JSON escape routines.
Effective, change YAMLParser.cpp line 697 to use \u and drop the whole UTF-8 handling part.

Just to avoid any confusion: this should be the generic YAML escape routine in llvm/lib/Support, i.e. IMO we don't want to have separate YAML and JSON escape routines.
Effective, change YAMLParser.cpp line 697 to use \u and drop the whole UTF-8 handling part.

I'm not sure it's possible or desirable to reuse yaml::escape as they are two formats and it's expected that a yaml output might not be able to be read by a json parser. If I look at the yaml::escape routine, there are several escape that are incompatible with JSON like \0, \v, \e, and 32-bit handling via \U.
I think ideally this patch would be a stop-gap solution until jsoncpp is introduced so that one can build json output in a structured way (as opposed to manually appending strings now) and the output will also be escape via the library.

joerg added a comment.Apr 16 2017, 1:24 PM

We already have a couple of case that expect the encoding to be compatible. I'm not very attached to the additional special cases from YAML, but having either a common escape function OR a JSON escape in LLVM/Support is quite important.