This is an archive of the discontinued LLVM Phabricator instance.

[Support] Harden JSON against invalid UTF-8.
ClosedPublic

Authored by sammccall on Apr 30 2018, 10:26 AM.

Details

Reviewers
benhamilton
Summary

Parsing invalid UTF-8 input is now a parse error.
Creating JSON values from invalid UTF-8 now triggers an assertion, and
(in no-assert builds) substitutes the unicode replacement character.
Strings retrieved from json::Value are always valid UTF-8.

Diff Detail

Event Timeline

sammccall created this revision.Apr 30 2018, 10:26 AM
ruiu added a subscriber: ruiu.Apr 30 2018, 10:36 AM

I wonder if you also want to replace surrogate codepoint with U+FFFD, as UTF-8 string should not contain any codepoint between U+D800 and U+DFFF.

benhamilton requested changes to this revision.Apr 30 2018, 10:50 AM

Looks good, just missed a few edge cases for 4-byte sequences.

Also need to reject the CESU-8 encoding where UTF-16 surrogate pairs are represented as two 3-byte UTF-8 sequences.

lib/Support/JSON.cpp
537

Can we clarify the comment that this logic results in checking for the shortest possible encoding?

541

Also need to check for and reject so-called CESU-8 encoding (where UTF-16 surrogate pairs are "encoded" as separate 3-byte UTF-8 sequences):

https://www.unicode.org/reports/tr26/#definitions

552

Also need to handle two more cases which would encode a code point > U+10FFFF, which is not allowed:

  1. First byte == 0xF4 and second byte > 0x8F
  2. First byte > 0xF4
572
if (!LLVM_LIKELY(measureChar(S, I)) {
This revision now requires changes to proceed.Apr 30 2018, 10:50 AM

Also, IMHO, this logic should be in a separate UTF-8 library so we can use it outside of JSON contexts.

Thanks for the comments!
Not validating codepoints was deliberate but in hindsight arbitrary, I'll
fix that. (Both surrogates and high codepoints).

Regarding a separate library: I agree in principle :-) but am concerned
about the potential scope/generality and don't have cycles to design it
right now, even extracting JSON is a bit of a yak-shave...
Here it's in support and exposed as free functions, so if it really is
useful, it *is* available and we can move it later.
That's the best argument I can come up with, open to better ideas :)

Is it super tricky to at least extract the UTF-8 validating and sanitizing functions into their own header and source files within the Support library, if not a separate library?

sammccall updated this revision to Diff 144871.May 2 2018, 6:18 AM
sammccall marked 4 inline comments as done.

Reject invalid codepoints (surrogates, high codepoints).
Add more comments around UTF8 validation.

sammccall updated this revision to Diff 144879.May 2 2018, 7:02 AM

Use existing library code from ConvertUTF for UTF-8 validation.

Is it super tricky to at least extract the UTF-8 validating and sanitizing functions into their own header and source files within the Support library, if not a separate library?

It turns out ConvertUTF already provides a nice/efficient isLegalUTF8String that I somehow missed before.
Given that, rewriting fixUTF8 to use existing facilities is a little clunky but doable, and that shouldn't be hot.

So this is now a lot simpler, sorry for any wasted time :-/
Now I think the "unicode library" point is moot?

Ah, glad we did have a solution for this. Wish I'd known, would have saved everyone time. :)

lib/Support/JSON.cpp
519

Wouldn't it make sense to move this to isLegalUTF8String()?

526

Style: Space between UTF8 and *?

536

Style: Space between UTF8 and *?

sammccall marked 2 inline comments as done.Jul 9 2018, 5:59 AM

Oops, sorry for losing track of this.
The JSON-in-Support patch has finally landed in r336534 so this can actually go in sometime soon :-)

lib/Support/JSON.cpp
519

Hmm.. maybe. I'm slightly leery about this, as these are common Unicode reference functions that are largely unmodified, which people may expect.
Also I think we'd still want to split it into two functions so the ASCII check can be inlined and the utf-8 wrangling outlined. It seems harmless enough here, but maybe I'm just lazy. WDYT?

sammccall updated this revision to Diff 154584.Jul 9 2018, 6:04 AM

Rebase and format

benhamilton added inline comments.Jul 9 2018, 7:56 AM
lib/Support/JSON.cpp
519

I'm supportive of an inline-able ASCII check if we don't already have one.

sammccall updated this revision to Diff 154617.Jul 9 2018, 8:53 AM

Properly rebase this time, and add/use isASCII in StringExtras.

sammccall added inline comments.Jul 9 2018, 8:53 AM
lib/Support/JSON.cpp
519

Added one to StringExtras.h (Unicode.h and ConvertUTF.h have weird style and no dependencies, it's hard to work out how to make it fit).

benhamilton accepted this revision.Jul 9 2018, 9:05 AM

Thanks!

include/llvm/Support/JSON.h
231

Seems like flipping the check is clearer:

if (LLVM_UNLIKELY(!isUTF8(V)) {
243

Ditto on LLVM_UNLIKELY.

404

Ditto ditto.

411

Ditto ditto ditto.

This revision is now accepted and ready to land.Jul 9 2018, 9:05 AM
dexonsmith added inline comments.Jul 9 2018, 11:51 AM
include/llvm/Support/JSON.h
231

Moreover, __builtin_expect doesn't work if it's not the top-level condition, and LLVM_LIKELY and LLVM_UNLIKELY are both using it.

sammccall marked 5 inline comments as done.Jul 10 2018, 4:53 AM
sammccall added inline comments.
include/llvm/Support/JSON.h
231

Interesting! Naive test suggests this isn't always true for clang, though probably is for GCC:
https://godbolt.org/g/vek4jo

Done in any case.

sammccall updated this revision to Diff 154775.Jul 10 2018, 4:54 AM
sammccall marked an inline comment as done.

Make LLVM_[UN]LIKELY top-level

sammccall closed this revision.Jul 11 2018, 12:22 AM

Landed as r336657, not sure why it didn't get linked to this review.