This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang-tools-extra/clangd/
-
clangd/
3/13
SourceCode.cpp

Differential D74731

[Clangd] Fixed assertion when processing extended ASCII characters.
Needs ReviewPublic

Authored by AnakinZheng on Feb 17 2020, 10:26 AM.

Download Raw Diff

This revision needs review, but all reviewers have resigned.

Details

Reviewers

kadircet
sammccall

Summary

Previously this piece of code asserts when processing extended ASCII. Proposing a fix to extend the ascii handling code to take extended ascii as well.

Diff Detail

Event Timeline

AnakinZheng created this revision.Feb 17 2020, 10:26 AM

Herald added subscribers: cfe-commits, usaxena95, arphaman and 3 others. · View Herald TranscriptFeb 17 2020, 10:26 AM

Harbormaster failed remote builds in B46653: Diff 245002!Feb 17 2020, 10:58 AM

kadircet added inline comments.Feb 17 2020, 11:00 AM

clang-tools-extra/clangd/SourceCode.cpp
61–62	`C` is one byte long, it is not possible for it to ever satisfy `C&0x100`.

Yeah I think there must be some confusion about what this code is doing. It's specifically iterating over the unicode codepoints of what are supposed to be UTF-8-encoded input bytes.

The input turns out sometimes not to be UTF-8 (e.g. the file on disk is ISO-8859-1 and clang thinks it's UTF-8 and just loads the bytes). We can't give any sort of right answer in these cases - we don't know the actual encoding and we can't even always detect these cases!

What we can do is strengthen the contract: instead of UB, assert in practice, we can say returns some garbage value but doesn't crash.

Fix the previous stupid mistake.

Update patch.

clang-tools-extra/clangd/SourceCode.cpp
61–62	Updated the patch, now it should be correct.

Specifically, we can't change the return value for any valid utf-8 string (see tests).

Instead, where we currently assert length is between 2 and 4, you could change that to an if and treat it (arbitrarily) as a one-utf-8-byte, one-utf-16-code-unit character.

This new behavior should have a test.

kadircet added inline comments.Feb 17 2020, 12:01 PM

clang-tools-extra/clangd/SourceCode.cpp
61–62	sorry but this is still the same since `C` is an `unsigned char` it is always guaranteed to be less than or equal to 255. As @sammccall explained above, we should rather re-arrange the logic to get rid of the assertion below, e.g. when `UTF8Length` is less than 2 or more than 4 we can choose to interpret it as ascii, instead of asserting, while commenting the reason for this choice.

AnakinZheng marked an inline comment as done.Feb 17 2020, 12:06 PM

AnakinZheng added inline comments.

clang-tools-extra/clangd/SourceCode.cpp
61–62	It's unsigned short in the new patch. Ok, I'll try @sammccall 's suggestion.

kadircet added inline comments.Feb 17 2020, 12:17 PM

clang-tools-extra/clangd/SourceCode.cpp
61–62	It's unsigned short in the new patch. Sorry I missed that change, but still `U8[I]` is a one byte element(it is of type `char`), you can't get away by just casting it to `unsigned short`.

AnakinZheng marked an inline comment as done.Feb 17 2020, 12:25 PM

AnakinZheng added inline comments.

clang-tools-extra/clangd/SourceCode.cpp
61–62	Yup, just aware of that, @sammccall 's suggestion looks reasonable, implementing now.

Implement @sammccall 's suggestion and add test.

kadircet added inline comments.Feb 21 2020, 1:36 AM

clang-tools-extra/clangd/SourceCode.cpp
74–77	can we also move this into valid case, and only increment by one on the invalid(extended ascii) case ? as we are trying to recover from extended ascii, and they are 1 byte in length, not multiple.
78	could you re-arrange this part into if (LLVM_LIKELY(UTF8Length >= 2 && UTF8Length <=4)) { // A codepoint takes two UTF-16 code unit if it's astral (outside BMP). // Astral codepoints are encoded as 4 bytes in UTF-8 (11110xxx ...) if (CB(UTF8Length, UTF8Length >> 1)) return true; continue; } vlog("File contains invalid UTF-8, or transcoding bug? UTF8Length = {0}, string = {1}", UTF8Length, U8); // If UTF8 length is 1, treat as one UTF8, if above 4, treat as one UTF16 if (CB(UTF8Length, UTF8Length == 1 ? 1 : 2)) return true;
81	instead of printing the string, it would be better to print hex representation of current character, you can check `llvm::utohexstr`
82	we are already in a bad shape here, instead of making it more fluctuating can we always `CB(1, 1)`? As long as you don't have any special reasoning for it.
clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
64 ↗	(On Diff #245036)	I believe in addition to being invalidutf16, this is also extended ascii (representing multiple `ÿ`s). if we want clangd to work for extended ascii. I suppose the length should rather be 5 instead, as in utf-8 case. Since all of these bytes are representing a different character.

also could you clang-format your changes

Proposing a fix to extend the ascii handling code to take extended ascii as well.

Please reduce the scope of this patch to not crashing on invalid utf-8.

Handling other encodings is in principle possible, but requires more work and a more precise understanding of what we're trying to do.
(In particular, "extended ascii" doesn't have a precise meaning)

clang-tools-extra/clangd/SourceCode.cpp
81	Printing UTF8length isn't useful here., and I don't think current character is enough as it may not be where the error is (e.g. in the 10xxx case). I'd suggest llvm::toHex(U8) and printing the offset I. Printing the bytes is better than the text as we know we're misinterpreting the bytes! (In practice, these strings are parts of a single line, so it shouldn't be too long)
81	also I think you want to set a flag and vlog only the first time around the loop.
82	Yeah, this isn't a principled recovery strategy, so CB(1, 1) is the way to go I think. With a comment indicating that we're not returning anything really useful, like "// Treat this byte as a character, for lack of better options". (And CB(2, 1) seems silly if you think this text is actually ISO-8859-x - that's a single-byte encoding!) If we really wanted to do something principled, we could validate the UTF-8 first (llvm::isUTF8 in JSON.h, though it's missing some cases), and if it fails, invoke CB(1, 1) for each character under the assumption that it's ISO-8859-x so one-byte per char and all chars are in BMP. But I'd urge against it, I don't think this will actually end up working reliably end-to-end.
clang-tools-extra/clangd/unittests/SourceCodeTests.cpp
60 ↗	(On Diff #245036)	"Extended ASCII" is vague and misleading. Please call these something like "Invalid UTF-8" or "ISO-8859-x misinterpreted as UTF-8" or so
62 ↗	(On Diff #245036)	I don't think we should be asserting the exact values here, it's garbage and depends on implementation details. The most important thing is to call it and verify it doesn't crash. We could assert that 0 <= lspLength <= strlen, which lspLength generally obeys (for all encodings)
64 ↗	(On Diff #245036)	I'm just really confused by this test case - we never interpret or convert it to UTF-16 so why is it called `invalidUTF16`? what is it supposed to be testing?

sammccall mentioned this in D74025: [clangd] Add the missing elaborated types in FindTarget..Jun 9 2020, 8:32 PM

Sorry this got stalled, it's important (I hadn't realized we were crashing on *indexing* when boost was included).

I've created D81530 derived from this which addresses the previous round of comments and adds a testcase for indexing.

sammccall mentioned this in D81530: [clangd] Log rather than assert on bad UTF-8..Jun 9 2020, 9:48 PM

kadircet resigned from this revision.Jun 11 2020, 2:39 AM

Herald added a subscriber: kadircet. · View Herald TranscriptJun 11 2020, 2:39 AM

sammccall resigned from this revision.Sep 18 2020, 2:04 AM

Revision Contents

Path

Size

clang-tools-extra/

clangd/

SourceCode.cpp

4 lines

Diff 245021

clang-tools-extra/clangd/SourceCode.cpp

	Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines
	// Iterates over unicode codepoints in the (UTF-8) string. For each,			// Iterates over unicode codepoints in the (UTF-8) string. For each,
	// invokes CB(UTF-8 length, UTF-16 length), and breaks if it returns true.			// invokes CB(UTF-8 length, UTF-16 length), and breaks if it returns true.
	// Returns true if CB returned true, false if we hit the end of string.			// Returns true if CB returned true, false if we hit the end of string.
	template <typename Callback>			template <typename Callback>
	static bool iterateCodepoints(llvm::StringRef U8, const Callback &CB) {			static bool iterateCodepoints(llvm::StringRef U8, const Callback &CB) {
	// A codepoint takes two UTF-16 code unit if it's astral (outside BMP).			// A codepoint takes two UTF-16 code unit if it's astral (outside BMP).
	// Astral codepoints are encoded as 4 bytes in UTF-8, starting with 11110xxx.			// Astral codepoints are encoded as 4 bytes in UTF-8, starting with 11110xxx.
	for (size_t I = 0; I < U8.size();) {			for (size_t I = 0; I < U8.size();) {
	unsigned char C = static_cast<unsigned char>(U8[I]);			unsigned short C = static_cast<unsigned short>(U8[I]);
	if (LLVM_LIKELY(!(C & 0x100))) { // ASCII or extended ASCII character.			if (LLVM_LIKELY(C <= 255)) { // ASCII or extended ASCII character.
				kadircetUnsubmitted Not Done Reply Inline Actions `C` is one byte long, it is not possible for it to ever satisfy `C&0x100`. kadircet: `C` is one byte long, it is not possible for it to ever satisfy `C&0x100`.
				AnakinZhengAuthorUnsubmitted Done Reply Inline Actions Updated the patch, now it should be correct. AnakinZheng: Updated the patch, now it should be correct.
				kadircetUnsubmitted Not Done Reply Inline Actions sorry but this is still the same since `C` is an `unsigned char` it is always guaranteed to be less than or equal to 255. As @sammccall explained above, we should rather re-arrange the logic to get rid of the assertion below, e.g. when `UTF8Length` is less than 2 or more than 4 we can choose to interpret it as ascii, instead of asserting, while commenting the reason for this choice. kadircet: sorry but this is still the same since `C` is an `unsigned char` it is always guaranteed to be…
				AnakinZhengAuthorUnsubmitted Done Reply Inline Actions It's unsigned short in the new patch. Ok, I'll try @sammccall 's suggestion. AnakinZheng: It's unsigned short in the new patch. Ok, I'll try @sammccall 's suggestion.
				kadircetUnsubmitted Not Done Reply Inline Actions It's unsigned short in the new patch. Sorry I missed that change, but still `U8[I]` is a one byte element(it is of type `char`), you can't get away by just casting it to `unsigned short`. kadircet: > It's unsigned short in the new patch. Sorry I missed that change, but still `U8[I]` is a one…
				AnakinZhengAuthorUnsubmitted Done Reply Inline Actions Yup, just aware of that, @sammccall 's suggestion looks reasonable, implementing now. AnakinZheng: Yup, just aware of that, @sammccall 's suggestion looks reasonable, implementing now.
	if (CB(1, 1))			if (CB(1, 1))
	return true;			return true;
	++I;			++I;
	continue;			continue;
	}			}
	// This convenient property of UTF-8 holds for all non-ASCII characters.			// This convenient property of UTF-8 holds for all non-ASCII characters.
	size_t UTF8Length = llvm::countLeadingOnes(C);			size_t UTF8Length = llvm::countLeadingOnes(C);
	// 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.			// 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.
	// 11111xxx is not valid UTF-8 at all. Assert because it's probably our bug.			// 11111xxx is not valid UTF-8 at all. Assert because it's probably our bug.
	assert((UTF8Length >= 2 && UTF8Length <= 4) &&			assert((UTF8Length >= 2 && UTF8Length <= 4) &&
	"Invalid UTF-8, or transcoding bug?");			"Invalid UTF-8, or transcoding bug?");
	I += UTF8Length; // Skip over all trailing bytes.			I += UTF8Length; // Skip over all trailing bytes.
	// A codepoint takes two UTF-16 code unit if it's astral (outside BMP).			// A codepoint takes two UTF-16 code unit if it's astral (outside BMP).
	// Astral codepoints are encoded as 4 bytes in UTF-8 (11110xxx ...)			// Astral codepoints are encoded as 4 bytes in UTF-8 (11110xxx ...)
	if (CB(UTF8Length, UTF8Length == 4 ? 2 : 1))			if (CB(UTF8Length, UTF8Length == 4 ? 2 : 1))
				kadircetUnsubmitted Not Done Reply Inline Actions can we also move this into valid case, and only increment by one on the invalid(extended ascii) case ? as we are trying to recover from extended ascii, and they are 1 byte in length, not multiple. kadircet: can we also move this into valid case, and only increment by one on the invalid(extended ascii)…
	return true;			return true;
				kadircetUnsubmitted Not Done Reply Inline Actions could you re-arrange this part into if (LLVM_LIKELY(UTF8Length >= 2 && UTF8Length <=4)) { // A codepoint takes two UTF-16 code unit if it's astral (outside BMP). // Astral codepoints are encoded as 4 bytes in UTF-8 (11110xxx ...) if (CB(UTF8Length, UTF8Length >> 1)) return true; continue; } vlog("File contains invalid UTF-8, or transcoding bug? UTF8Length = {0}, string = {1}", UTF8Length, U8); // If UTF8 length is 1, treat as one UTF8, if above 4, treat as one UTF16 if (CB(UTF8Length, UTF8Length == 1 ? 1 : 2)) return true; kadircet: could you re-arrange this part into ``` if (LLVM_LIKELY(UTF8Length >= 2 && UTF8Length <=4)) {…
	}			}
	return false;			return false;
	}			}
				kadircetUnsubmitted Not Done Reply Inline Actions instead of printing the string, it would be better to print hex representation of current character, you can check `llvm::utohexstr` kadircet: instead of printing the string, it would be better to print hex representation of current…
				sammccallUnsubmitted Not Done Reply Inline Actions Printing UTF8length isn't useful here., and I don't think current character is enough as it may not be where the error is (e.g. in the 10xxx case). I'd suggest llvm::toHex(U8) and printing the offset I. Printing the bytes is better than the text as we know we're misinterpreting the bytes! (In practice, these strings are parts of a single line, so it shouldn't be too long) sammccall: Printing UTF8length isn't useful here., and I don't think current character is enough as it may…
				sammccallUnsubmitted Not Done Reply Inline Actions also I think you want to set a flag and vlog only the first time around the loop. sammccall: also I think you want to set a flag and vlog only the first time around the loop.

				kadircetUnsubmitted Not Done Reply Inline Actions we are already in a bad shape here, instead of making it more fluctuating can we always `CB(1, 1)`? As long as you don't have any special reasoning for it. kadircet: we are already in a bad shape here, instead of making it more fluctuating can we always `CB(1…
				sammccallUnsubmitted Not Done Reply Inline Actions Yeah, this isn't a principled recovery strategy, so CB(1, 1) is the way to go I think. With a comment indicating that we're not returning anything really useful, like "// Treat this byte as a character, for lack of better options". (And CB(2, 1) seems silly if you think this text is actually ISO-8859-x - that's a single-byte encoding!) If we really wanted to do something principled, we could validate the UTF-8 first (llvm::isUTF8 in JSON.h, though it's missing some cases), and if it fails, invoke CB(1, 1) for each character under the assumption that it's ISO-8859-x so one-byte per char and all chars are in BMP. But I'd urge against it, I don't think this will actually end up working reliably end-to-end. sammccall: Yeah, this isn't a principled recovery strategy, so CB(1, 1) is the way to go I think. With a…
	// Returns the byte offset into the string that is an offset of \p Units in			// Returns the byte offset into the string that is an offset of \p Units in
	// the specified encoding.			// the specified encoding.
	// Conceptually, this converts to the encoding, truncates to CodeUnits,			// Conceptually, this converts to the encoding, truncates to CodeUnits,
	// converts back to UTF-8, and returns the length in bytes.			// converts back to UTF-8, and returns the length in bytes.
	static size_t measureUnits(llvm::StringRef U8, int Units, OffsetEncoding Enc,			static size_t measureUnits(llvm::StringRef U8, int Units, OffsetEncoding Enc,
	bool &Valid) {			bool &Valid) {
	Valid = Units >= 0;			Valid = Units >= 0;
	if (Units <= 0)			if (Units <= 0)
	▲ Show 20 Lines • Show All 1,053 Lines • Show Last 20 Lines