Previously this piece of code asserts when processing extended ASCII. Proposing a fix to extend the ascii handling code to take extended ascii as well.
Diff Detail
Event Timeline
clang-tools-extra/clangd/SourceCode.cpp | ||
---|---|---|
61–62 | C is one byte long, it is not possible for it to ever satisfy C&0x100. |
Yeah I think there must be some confusion about what this code is doing. It's specifically iterating over the unicode codepoints of what are supposed to be UTF-8-encoded input bytes.
The input turns out sometimes not to be UTF-8 (e.g. the file on disk is ISO-8859-1 and clang thinks it's UTF-8 and just loads the bytes). We can't give any sort of right answer in these cases - we don't know the actual encoding and we can't even always detect these cases!
What we can do is strengthen the contract: instead of UB, assert in practice, we can say returns some garbage value but doesn't crash.
Update patch.
clang-tools-extra/clangd/SourceCode.cpp | ||
---|---|---|
61–62 | Updated the patch, now it should be correct. |
Specifically, we can't change the return value for any valid utf-8 string (see tests).
Instead, where we currently assert length is between 2 and 4, you could change that to an if and treat it (arbitrarily) as a one-utf-8-byte, one-utf-16-code-unit character.
This new behavior should have a test.
clang-tools-extra/clangd/SourceCode.cpp | ||
---|---|---|
61–62 | sorry but this is still the same since C is an unsigned char it is always guaranteed to be less than or equal to 255. As @sammccall explained above, we should rather re-arrange the logic to get rid of the assertion below, e.g. when UTF8Length is less than 2 or more than 4 we can choose to interpret it as ascii, instead of asserting, while commenting the reason for this choice. |
clang-tools-extra/clangd/SourceCode.cpp | ||
---|---|---|
61–62 | It's unsigned short in the new patch. Ok, I'll try @sammccall 's suggestion. |
clang-tools-extra/clangd/SourceCode.cpp | ||
---|---|---|
61–62 |
Sorry I missed that change, but still U8[I] is a one byte element(it is of type char), you can't get away by just casting it to unsigned short. |
clang-tools-extra/clangd/SourceCode.cpp | ||
---|---|---|
61–62 | Yup, just aware of that, @sammccall 's suggestion looks reasonable, implementing now. |
clang-tools-extra/clangd/SourceCode.cpp | ||
---|---|---|
74–77 | can we also move this into valid case, and only increment by one on the invalid(extended ascii) case ? as we are trying to recover from extended ascii, and they are 1 byte in length, not multiple. | |
78 | could you re-arrange this part into if (LLVM_LIKELY(UTF8Length >= 2 && UTF8Length <=4)) { // A codepoint takes two UTF-16 code unit if it's astral (outside BMP). // Astral codepoints are encoded as 4 bytes in UTF-8 (11110xxx ...) if (CB(UTF8Length, UTF8Length >> 1)) return true; continue; } vlog("File contains invalid UTF-8, or transcoding bug? UTF8Length = {0}, string = {1}", UTF8Length, U8); // If UTF8 length is 1, treat as one UTF8, if above 4, treat as one UTF16 if (CB(UTF8Length, UTF8Length == 1 ? 1 : 2)) return true; | |
81 | instead of printing the string, it would be better to print hex representation of current character, you can check llvm::utohexstr | |
82 | we are already in a bad shape here, instead of making it more fluctuating can we always CB(1, 1)? As long as you don't have any special reasoning for it. | |
clang-tools-extra/clangd/unittests/SourceCodeTests.cpp | ||
64 ↗ | (On Diff #245036) | I believe in addition to being invalidutf16, this is also extended ascii (representing multiple ÿs). if we want clangd to work for extended ascii. I suppose the length should rather be 5 instead, as in utf-8 case. Since all of these bytes are representing a different character. |
Proposing a fix to extend the ascii handling code to take extended ascii as well.
Please reduce the scope of this patch to not crashing on invalid utf-8.
Handling other encodings is in principle possible, but requires more work and a more precise understanding of what we're trying to do.
(In particular, "extended ascii" doesn't have a precise meaning)
clang-tools-extra/clangd/SourceCode.cpp | ||
---|---|---|
81 | Printing UTF8length isn't useful here., and I don't think current character is enough as it may not be where the error is (e.g. in the 10xxx case). I'd suggest llvm::toHex(U8) and printing the offset I. Printing the bytes is better than the text as we know we're misinterpreting the bytes! (In practice, these strings are parts of a single line, so it shouldn't be too long) | |
81 | also I think you want to set a flag and vlog only the first time around the loop. | |
82 | Yeah, this isn't a principled recovery strategy, so CB(1, 1) is the way to go I think. If we really wanted to do something principled, we could validate the UTF-8 first (llvm::isUTF8 in JSON.h, though it's missing some cases), and if it fails, invoke CB(1, 1) for each character under the assumption that it's ISO-8859-x so one-byte per char and all chars are in BMP. But I'd urge against it, I don't think this will actually end up working reliably end-to-end. | |
clang-tools-extra/clangd/unittests/SourceCodeTests.cpp | ||
60 ↗ | (On Diff #245036) | "Extended ASCII" is vague and misleading. Please call these something like "Invalid UTF-8" or "ISO-8859-x misinterpreted as UTF-8" or so |
62 ↗ | (On Diff #245036) | I don't think we should be asserting the exact values here, it's garbage and depends on implementation details. The most important thing is to call it and verify it doesn't crash. |
64 ↗ | (On Diff #245036) | I'm just really confused by this test case - we never interpret or convert it to UTF-16 so why is it called invalidUTF16? what is it supposed to be testing? |
Sorry this got stalled, it's important (I hadn't realized we were crashing on *indexing* when boost was included).
I've created D81530 derived from this which addresses the previous round of comments and adds a testcase for indexing.
C is one byte long, it is not possible for it to ever satisfy C&0x100.