Parsing invalid UTF-8 input is now a parse error.
Creating JSON values from invalid UTF-8 now triggers an assertion, and
(in no-assert builds) substitutes the unicode replacement character.
Strings retrieved from json::Value are always valid UTF-8.
Details
- Reviewers
benhamilton
Diff Detail
- Repository
- rL LLVM
- Build Status
Buildable 20197 Build 20197: arc lint + arc unit
Event Timeline
I wonder if you also want to replace surrogate codepoint with U+FFFD, as UTF-8 string should not contain any codepoint between U+D800 and U+DFFF.
Looks good, just missed a few edge cases for 4-byte sequences.
Also need to reject the CESU-8 encoding where UTF-16 surrogate pairs are represented as two 3-byte UTF-8 sequences.
lib/Support/JSON.cpp | ||
---|---|---|
535 | Can we clarify the comment that this logic results in checking for the shortest possible encoding? | |
539 | Also need to check for and reject so-called CESU-8 encoding (where UTF-16 surrogate pairs are "encoded" as separate 3-byte UTF-8 sequences): | |
550 | Also need to handle two more cases which would encode a code point > U+10FFFF, which is not allowed:
| |
570 | if (!LLVM_LIKELY(measureChar(S, I)) { |
Also, IMHO, this logic should be in a separate UTF-8 library so we can use it outside of JSON contexts.
Thanks for the comments!
Not validating codepoints was deliberate but in hindsight arbitrary, I'll
fix that. (Both surrogates and high codepoints).
Regarding a separate library: I agree in principle :-) but am concerned
about the potential scope/generality and don't have cycles to design it
right now, even extracting JSON is a bit of a yak-shave...
Here it's in support and exposed as free functions, so if it really is
useful, it *is* available and we can move it later.
That's the best argument I can come up with, open to better ideas :)
Is it super tricky to at least extract the UTF-8 validating and sanitizing functions into their own header and source files within the Support library, if not a separate library?
Reject invalid codepoints (surrogates, high codepoints).
Add more comments around UTF8 validation.
It turns out ConvertUTF already provides a nice/efficient isLegalUTF8String that I somehow missed before.
Given that, rewriting fixUTF8 to use existing facilities is a little clunky but doable, and that shouldn't be hot.
So this is now a lot simpler, sorry for any wasted time :-/
Now I think the "unicode library" point is moot?
Oops, sorry for losing track of this.
The JSON-in-Support patch has finally landed in r336534 so this can actually go in sometime soon :-)
lib/Support/JSON.cpp | ||
---|---|---|
517 | Hmm.. maybe. I'm slightly leery about this, as these are common Unicode reference functions that are largely unmodified, which people may expect. |
lib/Support/JSON.cpp | ||
---|---|---|
517 | I'm supportive of an inline-able ASCII check if we don't already have one. |
lib/Support/JSON.cpp | ||
---|---|---|
517 | Added one to StringExtras.h (Unicode.h and ConvertUTF.h have weird style and no dependencies, it's hard to work out how to make it fit). |
include/llvm/Support/JSON.h | ||
---|---|---|
302 | Moreover, __builtin_expect doesn't work if it's not the top-level condition, and LLVM_LIKELY and LLVM_UNLIKELY are both using it. |
include/llvm/Support/JSON.h | ||
---|---|---|
302 | Interesting! Naive test suggests this isn't always true for clang, though probably is for GCC: Done in any case. |
Seems like flipping the check is clearer: