This is an archive of the discontinued LLVM Phabricator instance.

[Clangd] Fixed assertion when processing extended ASCII characters.
Needs ReviewPublic

Authored by AnakinZheng on Feb 17 2020, 10:26 AM.
This revision needs review, but all reviewers have resigned.

Details

Summary

Previously this piece of code asserts when processing extended ASCII. Proposing a fix to extend the ascii handling code to take extended ascii as well.

Diff Detail

Event Timeline

AnakinZheng created this revision.Feb 17 2020, 10:26 AM
kadircet added inline comments.Feb 17 2020, 11:00 AM
clang-tools-extra/clangd/SourceCode.cpp
62

C is one byte long, it is not possible for it to ever satisfy C&0x100.

Yeah I think there must be some confusion about what this code is doing. It's specifically iterating over the unicode codepoints of what are supposed to be UTF-8-encoded input bytes.

The input turns out sometimes not to be UTF-8 (e.g. the file on disk is ISO-8859-1 and clang thinks it's UTF-8 and just loads the bytes). We can't give any sort of right answer in these cases - we don't know the actual encoding and we can't even always detect these cases!

What we can do is strengthen the contract: instead of UB, assert in practice, we can say returns some garbage value but doesn't crash.

Fix the previous stupid mistake.

AnakinZheng marked an inline comment as done.Feb 17 2020, 11:55 AM

Update patch.

clang-tools-extra/clangd/SourceCode.cpp
62

Updated the patch, now it should be correct.

Specifically, we can't change the return value for any valid utf-8 string (see tests).

Instead, where we currently assert length is between 2 and 4, you could change that to an if and treat it (arbitrarily) as a one-utf-8-byte, one-utf-16-code-unit character.

This new behavior should have a test.

kadircet added inline comments.Feb 17 2020, 12:01 PM
clang-tools-extra/clangd/SourceCode.cpp
62

sorry but this is still the same since C is an unsigned char it is always guaranteed to be less than or equal to 255.

As @sammccall explained above, we should rather re-arrange the logic to get rid of the assertion below, e.g. when UTF8Length is less than 2 or more than 4 we can choose to interpret it as ascii, instead of asserting, while commenting the reason for this choice.

AnakinZheng marked an inline comment as done.Feb 17 2020, 12:06 PM
AnakinZheng added inline comments.
clang-tools-extra/clangd/SourceCode.cpp
62

It's unsigned short in the new patch. Ok, I'll try @sammccall 's suggestion.

kadircet added inline comments.Feb 17 2020, 12:17 PM
clang-tools-extra/clangd/SourceCode.cpp
62

It's unsigned short in the new patch.

Sorry I missed that change, but still U8[I] is a one byte element(it is of type char), you can't get away by just casting it to unsigned short.

AnakinZheng marked an inline comment as done.Feb 17 2020, 12:25 PM
AnakinZheng added inline comments.
clang-tools-extra/clangd/SourceCode.cpp
62

Yup, just aware of that, @sammccall 's suggestion looks reasonable, implementing now.

Implement @sammccall 's suggestion and add test.

kadircet added inline comments.Feb 21 2020, 1:36 AM
clang-tools-extra/clangd/SourceCode.cpp
74–77

can we also move this into valid case, and only increment by one on the invalid(extended ascii) case ?

as we are trying to recover from extended ascii, and they are 1 byte in length, not multiple.

78

could you re-arrange this part into

if (LLVM_LIKELY(UTF8Length >= 2 && UTF8Length <=4)) {
      // A codepoint takes two UTF-16 code unit if it's astral (outside BMP).
      // Astral codepoints are encoded as 4 bytes in UTF-8 (11110xxx ...)
      if (CB(UTF8Length, UTF8Length >> 1))
        return true;
      continue;
}

vlog("File contains invalid UTF-8, or transcoding bug? UTF8Length = {0}, string = {1}", UTF8Length, U8);
// If UTF8 length is 1, treat as one UTF8, if above 4, treat as one UTF16
if (CB(UTF8Length, UTF8Length == 1 ? 1 : 2))
   return true;
81

instead of printing the string, it would be better to print hex representation of current character, you can check llvm::utohexstr

82

we are already in a bad shape here, instead of making it more fluctuating can we always CB(1, 1)?

As long as you don't have any special reasoning for it.

clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
64 ↗(On Diff #245036)

I believe in addition to being invalidutf16, this is also extended ascii (representing multiple ÿs).

if we want clangd to work for extended ascii. I suppose the length should rather be 5 instead, as in utf-8 case. Since all of these bytes are representing a different character.

also could you clang-format your changes

Proposing a fix to extend the ascii handling code to take extended ascii as well.

Please reduce the scope of this patch to not crashing on invalid utf-8.

Handling other encodings is in principle possible, but requires more work and a more precise understanding of what we're trying to do.
(In particular, "extended ascii" doesn't have a precise meaning)

clang-tools-extra/clangd/SourceCode.cpp
81

Printing UTF8length isn't useful here., and I don't think current character is enough as it may not be where the error is (e.g. in the 10xxx case).

I'd suggest llvm::toHex(U8) and printing the offset I. Printing the bytes is better than the text as we know we're misinterpreting the bytes!

(In practice, these strings are parts of a single line, so it shouldn't be too long)

81

also I think you want to set a flag and vlog only the first time around the loop.

82

Yeah, this isn't a principled recovery strategy, so CB(1, 1) is the way to go I think.
With a comment indicating that we're not returning anything really useful, like "// Treat this byte as a character, for lack of better options".
(And CB(2, 1) seems silly if you think this text is actually ISO-8859-x - that's a single-byte encoding!)

If we really wanted to do something principled, we could validate the UTF-8 first (llvm::isUTF8 in JSON.h, though it's missing some cases), and if it fails, invoke CB(1, 1) for each character under the assumption that it's ISO-8859-x so one-byte per char and all chars are in BMP. But I'd urge against it, I don't think this will actually end up working reliably end-to-end.

clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
60 ↗(On Diff #245036)

"Extended ASCII" is vague and misleading. Please call these something like "Invalid UTF-8" or "ISO-8859-x misinterpreted as UTF-8" or so

62 ↗(On Diff #245036)

I don't think we should be asserting the exact values here, it's garbage and depends on implementation details.

The most important thing is to call it and verify it doesn't crash.
We could assert that 0 <= lspLength <= strlen, which lspLength generally obeys (for all encodings)

64 ↗(On Diff #245036)

I'm just really confused by this test case - we never interpret or convert it to UTF-16 so why is it called invalidUTF16? what is it supposed to be testing?

Sorry this got stalled, it's important (I hadn't realized we were crashing on *indexing* when boost was included).

I've created D81530 derived from this which addresses the previous round of comments and adds a testcase for indexing.

kadircet resigned from this revision.Jun 11 2020, 2:39 AM
sammccall resigned from this revision.Sep 18 2020, 2:04 AM