This is an archive of the discontinued LLVM Phabricator instance.

@jaafar would you be able to verify this is equivalent to D74731 for your purposes?
If so I'd like to cherrypick this into tho 10.x branch after landing.

Harbormaster completed remote builds in B59736: Diff 269732.Jun 9 2020, 11:06 PM

LGTM, thanks! (i bet you rushed it already for the cherry-pick, but just wanted to remind again that we should :D)

This revision is now accepted and ready to land.Jun 10 2020, 1:51 AM

Closed by commit rGf2c8f6e16d25: [clangd] Log rather than assert on bad UTF-8. (authored by sammccall). · Explain WhyJun 10 2020, 2:42 AM

This revision was automatically updated to reflect the committed changes.

I can confirm that this commit works equally well for the UTF-8 assertion failure. Thank you!

Is there some way to easily modify the source to produce a diagnostic pointing to the issue's source location? I would like to file an appropriate bug in Boost.

In D81530#2085725, @jaafar wrote:

I can confirm that this commit works equally well for the UTF-8 assertion failure. Thank you!

Is there some way to easily modify the source to produce a diagnostic pointing to the issue's source location? I would like to file an appropriate bug in Boost.

In https://www.boost.org/doc/libs/1_73_0/boost/spirit/home/support/char_encoding/iso8859_1.hpp
The line labeled 161 a1, whose bytes are "/* \xa1 161 a1 */ BOOST_CC_PUNCT," triggers it because of the "\xa1".

It's not a cut and dried bug, but I do think removing the high bytes from these comments is a good idea, so it's probably worth filing the bug.

The C++ standard doesn't say anything about the encoding of characters on disk (the "input encoding" of bytes -> source character set) - it starts with the source character set, I think. So what do implementations do?

clang: from reading the code, clang only supports UTF-8. It supports gcc's -finput-charset flag, but setting it to anything other than UTF-8 is an error!
GCC: supports a variety of input encodings, configured with -finput-charset or locale. As such it's not really reasonable for a header designed to be included to require any particular value, as you can't pass a flag for just that header.

In practice this doesn't actually affect compilation because the bad UTF-8 sequence in the comment is never parsed: clang just skips over it looking for */. It probably mostly affects tools that use line/column coordinates (like clangd by virtue of LSP, and clang diagnostics), and tools that extract comment contents (doxygen et al).
But it still seems that it would be clearer to remove the literal characters from the source file, or write them in UTF-8.

Thanks for the detailed analysis! I have filed https://github.com/boostorg/spirit/issues/612

Revision Contents

Path

Size

clang-tools-extra/

clangd/

SourceCode.cpp

23 lines

unittests/

SourceCodeTests.cpp

15 lines

SymbolCollectorTests.cpp

14 lines

Diff 269773

clang-tools-extra/clangd/SourceCode.cpp

	Show First 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
	namespace clangd {			namespace clangd {

	// Here be dragons. LSP positions use columns measured in UTF-16 code units!			// Here be dragons. LSP positions use columns measured in UTF-16 code units!
	// Clangd uses UTF-8 and byte-offsets internally, so conversion is nontrivial.			// Clangd uses UTF-8 and byte-offsets internally, so conversion is nontrivial.

	// Iterates over unicode codepoints in the (UTF-8) string. For each,			// Iterates over unicode codepoints in the (UTF-8) string. For each,
	// invokes CB(UTF-8 length, UTF-16 length), and breaks if it returns true.			// invokes CB(UTF-8 length, UTF-16 length), and breaks if it returns true.
	// Returns true if CB returned true, false if we hit the end of string.			// Returns true if CB returned true, false if we hit the end of string.
				//
				// If the string is not valid UTF-8, we log this error and "decode" the
				// text in some arbitrary way. This is pretty sad, but this tends to happen deep
				// within indexing of headers where clang misdetected the encoding, and
				// propagating the error all the way back up is (probably?) not be worth it.
	template <typename Callback>			template <typename Callback>
	static bool iterateCodepoints(llvm::StringRef U8, const Callback &CB) {			static bool iterateCodepoints(llvm::StringRef U8, const Callback &CB) {
				bool LoggedInvalid = false;
	// A codepoint takes two UTF-16 code unit if it's astral (outside BMP).			// A codepoint takes two UTF-16 code unit if it's astral (outside BMP).
	// Astral codepoints are encoded as 4 bytes in UTF-8, starting with 11110xxx.			// Astral codepoints are encoded as 4 bytes in UTF-8, starting with 11110xxx.
	for (size_t I = 0; I < U8.size();) {			for (size_t I = 0; I < U8.size();) {
	unsigned char C = static_cast<unsigned char>(U8[I]);			unsigned char C = static_cast<unsigned char>(U8[I]);
	if (LLVM_LIKELY(!(C & 0x80))) { // ASCII character.			if (LLVM_LIKELY(!(C & 0x80))) { // ASCII character.
	if (CB(1, 1))			if (CB(1, 1))
	return true;			return true;
	++I;			++I;
	continue;			continue;
	}			}
	// This convenient property of UTF-8 holds for all non-ASCII characters.			// This convenient property of UTF-8 holds for all non-ASCII characters.
	size_t UTF8Length = llvm::countLeadingOnes(C);			size_t UTF8Length = llvm::countLeadingOnes(C);
	// 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.			// 0xxx is ASCII, handled above. 10xxx is a trailing byte, invalid here.
	// 11111xxx is not valid UTF-8 at all. Assert because it's probably our bug.			// 11111xxx is not valid UTF-8 at all, maybe some ISO-8859-*.
	assert((UTF8Length >= 2 && UTF8Length <= 4) &&			if (LLVM_UNLIKELY(UTF8Length < 2 \|\| UTF8Length > 4)) {
	"Invalid UTF-8, or transcoding bug?");			if (!LoggedInvalid) {
				elog("File has invalid UTF-8 near offset {0}: {1}", I, llvm::toHex(U8));
				LoggedInvalid = true;
				}
				// We can't give a correct result, but avoid returning something wild.
				// Pretend this is a valid ASCII byte, for lack of better options.
				// (Too late to get ISO-8859-* right, we've skipped some bytes already).
				if (CB(1, 1))
				return true;
				++I;
				continue;
				}
	I += UTF8Length; // Skip over all trailing bytes.			I += UTF8Length; // Skip over all trailing bytes.
	// A codepoint takes two UTF-16 code unit if it's astral (outside BMP).			// A codepoint takes two UTF-16 code unit if it's astral (outside BMP).
	// Astral codepoints are encoded as 4 bytes in UTF-8 (11110xxx ...)			// Astral codepoints are encoded as 4 bytes in UTF-8 (11110xxx ...)
	if (CB(UTF8Length, UTF8Length == 4 ? 2 : 1))			if (CB(UTF8Length, UTF8Length == 4 ? 2 : 1))
	return true;			return true;
	}			}
	return false;			return false;
	}			}
	▲ Show 20 Lines • Show All 1,018 Lines • Show Last 20 Lines

clang-tools-extra/clangd/unittests/SourceCodeTests.cpp

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	TEST(SourceCodeTests, lspLength) {
EXPECT_EQ(lspLength("ascii"), 5UL);		EXPECT_EQ(lspLength("ascii"), 5UL);
// BMP		// BMP
EXPECT_EQ(lspLength("↓"), 1UL);		EXPECT_EQ(lspLength("↓"), 1UL);
EXPECT_EQ(lspLength("¥"), 1UL);		EXPECT_EQ(lspLength("¥"), 1UL);
// astral		// astral
EXPECT_EQ(lspLength("😂"), 1UL);		EXPECT_EQ(lspLength("😂"), 1UL);
}		}

		TEST(SourceCodeTests, lspLengthBadUTF8) {
		// Results are not well-defined if source file isn't valid UTF-8.
		// However we shouldn't crash or return something totally wild.
		const char *BadUTF8[] = {"\xa0", "\xff\xff\xff\xff\xff"};

		for (OffsetEncoding Encoding :
		{OffsetEncoding::UTF8, OffsetEncoding::UTF16, OffsetEncoding::UTF32}) {
		WithContextValue UTF32(kCurrentOffsetEncoding, Encoding);
		for (const char *Bad : BadUTF8) {
		EXPECT_GE(lspLength(Bad), 0u);
		EXPECT_LE(lspLength(Bad), strlen(Bad));
		}
		}
		}

// The = → 🡆 below are ASCII (1 byte), BMP (3 bytes), and astral (4 bytes).		// The = → 🡆 below are ASCII (1 byte), BMP (3 bytes), and astral (4 bytes).
const char File[] = R"(0:0 = 0		const char File[] = R"(0:0 = 0
1:0 → 8		1:0 → 8
2:0 🡆 18)";		2:0 🡆 18)";
struct Line {		struct Line {
unsigned Number;		unsigned Number;
unsigned Offset;		unsigned Offset;
unsigned Length;		unsigned Length;
▲ Show 20 Lines • Show All 698 Lines • Show Last 20 Lines

clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp

	Show First 20 Lines • Show All 1,484 Lines • ▼ Show 20 Lines
	TEST_F(SymbolCollectorTest, InvalidSourceLoc) {			TEST_F(SymbolCollectorTest, InvalidSourceLoc) {
	const char *Header = R"(			const char *Header = R"(
	void operator delete(void*)			void operator delete(void*)
	__attribute__((__externally_visible__));)";			__attribute__((__externally_visible__));)";
	runSymbolCollector(Header, /**/ "");			runSymbolCollector(Header, /**/ "");
	EXPECT_THAT(Symbols, Contains(QName("operator delete")));			EXPECT_THAT(Symbols, Contains(QName("operator delete")));
	}			}

				TEST_F(SymbolCollectorTest, BadUTF8) {
				// Extracted from boost/spirit/home/support/char_encoding/iso8859_1.hpp
				// This looks like UTF-8 and fools clang, but has high-ISO-8859-1 comments.
				const char *Header = "int PUNCT = 0;\n"
				"int types[] = { /* \xa1 */PUNCT };";
				CollectorOpts.RefFilter = RefKind::All;
				CollectorOpts.RefsInHeaders = true;
				runSymbolCollector(Header, "");
				EXPECT_THAT(Symbols, Contains(QName("types")));
				EXPECT_THAT(Symbols, Contains(QName("PUNCT")));
				// Reference is stored, although offset within line is not reliable.
				EXPECT_THAT(Refs, Contains(Pair(findSymbol(Symbols, "PUNCT").ID, _)));
				}

	} // namespace			} // namespace
	} // namespace clangd			} // namespace clangd
	} // namespace clang			} // namespace clang

This is an archive of the discontinued LLVM Phabricator instance.

[clangd] Log rather than assert on bad UTF-8.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 269773

clang-tools-extra/clangd/SourceCode.cpp

clang-tools-extra/clangd/unittests/SourceCodeTests.cpp

clang-tools-extra/clangd/unittests/SymbolCollectorTests.cpp

[clangd] Log rather than assert on bad UTF-8.
ClosedPublic