llvm/lib/Support/ConvertUTFWrapper.cpp
148	Wrong constant.
162	Wrong constant. Is this really the function you want to be using from clang? I don't really understand why you'd want to handle byte order marks.
197	Wrong function?

MarcusJohnson91 marked 3 inline comments as done.Jul 25 2021, 1:31 PM

MarcusJohnson91 added inline comments.

llvm/lib/Support/ConvertUTFWrapper.cpp
162	I don't really care about the BOM tbh, I just figured if I was in here, I should flesh out the UTF-32 interface.

Implemented the fixes mentioned and reformatted the patch

Harbormaster completed remote builds in B116096: Diff 361535.Jul 25 2021, 2:53 PM

Anyone got any ideas what happened this time?

The buildbot seems to think your new unittests are broken. Not sure why. (You can run just the unittests with "ninja check-llvm-unit".)

llvm/include/llvm/Support/ConvertUTF.h
127	UNI_UTF32_BYTE_ORDER_MARK_SWAPPED doesn't have the correct bit pattern.
llvm/lib/Support/ConvertUTFWrapper.cpp
162	The BOM handling is actually actively a problem if you're planning to use the interface to interpret wprintf format strings. We don't want to byteswap `L"\uFFFE%s"` or something like that.

Dropped the UTF32 BOM stuff

efriedma added inline comments.Jul 27 2021, 4:12 PM

llvm/lib/Support/ConvertUTFWrapper.cpp
176	`SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1` seems like way too much memory.

Harbormaster completed remote builds in B116554: Diff 362198.Jul 27 2021, 4:48 PM

MarcusJohnson91 added inline comments.Jul 27 2021, 7:46 PM

llvm/lib/Support/ConvertUTFWrapper.cpp
176	I copied that from the UTF16 code

Updated the tests

Harbormaster completed remote builds in B116726: Diff 362431.Jul 28 2021, 10:41 AM

MarcusJohnson91 updated this revision to Diff 362511.Jul 28 2021, 1:12 PM

Harbormaster completed remote builds in B116782: Diff 362511.Jul 28 2021, 3:33 PM

The problem seems to be in the conversion function expecting strings to be a multiple of 4 bytes, which doesn't hold up with the way ArrayRef stores things as char that is casted to char32_t, when using ASCII values like in the look of disapproval emoji, having an underscore in the middle.

But removing the assert and early return result in even more errors.

changing the input string to remove the underscore also fails, i'm out of ideas.

The tests work on my machine now, turns out the Big endian one needs a BOM, pretty obvious in hindsight.

Harbormaster completed remote builds in B117046: Diff 362882.Jul 29 2021, 3:26 PM

Formatted the diff

Harbormaster completed remote builds in B117063: Diff 362907.Jul 29 2021, 4:07 PM

MarcusJohnson91 updated this revision to Diff 362923.Jul 29 2021, 4:40 PM

Harbormaster completed remote builds in B117075: Diff 362923.Jul 29 2021, 5:17 PM

efriedma added inline comments.Jul 30 2021, 2:20 PM

llvm/lib/Support/ConvertUTFWrapper.cpp
162	Any thoughts on this?
176	I'm not sure the math is right even for UTF-16, but anyway, UTF-32 is a little different from UTF-16. A 2-byte character in UTF-16 can translate to 3 bytes in UTF-8. That sort of thing is impossible in UTF-32: a UTF-32 string is never shorter than its translation to UTF-8. A codepoint in UTF-8 is at most 4 bytes.

It seems like this diff keeps getting reverted?

I've fixed all the issues mentioned, and the tests work now, everything is formatted correctly too.

I've set git up to do full context diffs, but it's not working?

@efriedma it seems like you are commenting on old revisions?

the first comment about the UTF16 BOM in the UTF32 converter was fixed a long time ago and the second comment, line 168 I don't even see what you're talking about there

Harbormaster completed remote builds in B117266: Diff 363216.Jul 30 2021, 3:30 PM

As far as I can tell, the lastest version of the diff you uploaded still has the following issues that haven't been addressed:

The BOM handling is actually actively a problem if you're planning to use the interface to interpret wprintf format strings. We don't want to byteswap L"\uFFFE%s" or something like that.
SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1 seems like way too much memory.

There is only one function in ConvertUTFWrapper.cpp: convertUTF32ToUTF8String

idk wtf is going on, maybe the ammending the commit is breaking something?

the diff I see here is correct...

Maybe I should just make a new diff here entirely?

MarcusJohnson91 abandoned this revision.Jul 30 2021, 5:56 PM

If you're having trouble making Arcanist work correctly, you can always just upload "git diff" or "git show" output at https://reviews.llvm.org/differential/diff/create/ .

efriedma mentioned this in D107202: ConvertUTF: convertUTF32ToUTF8String.Aug 2 2021, 2:28 PM

Diff 363216

llvm/include/llvm/Support/ConvertUTF.h

	Show First 20 Lines • Show All 117 Lines • ▼ Show 20 Lines
	#define UNI_MAX_UTF32 (UTF32)0x7FFFFFFF			#define UNI_MAX_UTF32 (UTF32)0x7FFFFFFF
	#define UNI_MAX_LEGAL_UTF32 (UTF32)0x0010FFFF			#define UNI_MAX_LEGAL_UTF32 (UTF32)0x0010FFFF

	#define UNI_MAX_UTF8_BYTES_PER_CODE_POINT 4			#define UNI_MAX_UTF8_BYTES_PER_CODE_POINT 4

	#define UNI_UTF16_BYTE_ORDER_MARK_NATIVE 0xFEFF			#define UNI_UTF16_BYTE_ORDER_MARK_NATIVE 0xFEFF
	#define UNI_UTF16_BYTE_ORDER_MARK_SWAPPED 0xFFFE			#define UNI_UTF16_BYTE_ORDER_MARK_SWAPPED 0xFFFE

				#define UNI_UTF32_BYTE_ORDER_MARK_NATIVE 0x0000FEFF
				#define UNI_UTF32_BYTE_ORDER_MARK_SWAPPED 0xFFFE0000
				efriedmaUnsubmitted Not Done Reply Inline Actions UNI_UTF32_BYTE_ORDER_MARK_SWAPPED doesn't have the correct bit pattern. efriedma: UNI_UTF32_BYTE_ORDER_MARK_SWAPPED doesn't have the correct bit pattern.

	typedef enum {			typedef enum {
	conversionOK, /* conversion successful */			conversionOK, /* conversion successful */
	sourceExhausted, /* partial character in source, but hit end */			sourceExhausted, /* partial character in source, but hit end */
	targetExhausted, /* insuff. room in target for conversion */			targetExhausted, /* insuff. room in target for conversion */
	sourceIllegal /* source sequence is illegal/malformed */			sourceIllegal /* source sequence is illegal/malformed */
	} ConversionResult;			} ConversionResult;

	typedef enum {			typedef enum {
	▲ Show 20 Lines • Show All 140 Lines • ▼ Show 20 Lines
	*			*
	* \param [in] Src A buffer of UTF-16 encoded text.			* \param [in] Src A buffer of UTF-16 encoded text.
	* \param [out] Out Converted UTF-8 is stored here on success.			* \param [out] Out Converted UTF-8 is stored here on success.
	* \returns true on success			* \returns true on success
	*/			*/
	bool convertUTF16ToUTF8String(ArrayRef<UTF16> Src, std::string &Out);			bool convertUTF16ToUTF8String(ArrayRef<UTF16> Src, std::string &Out);

	/**			/**
				* Converts a stream of raw bytes assumed to be UTF32 into a UTF8 std::string.
				*
				* \param [in] SrcBytes A buffer of what is assumed to be UTF-32 encoded text.
				* \param [out] Out Converted UTF-8 is stored here on success.
				* \returns true on success
				*/
				bool convertUTF32ToUTF8String(ArrayRef<char> SrcBytes, std::string &Out);

				/**
				* Converts a UTF32 string into a UTF8 std::string.
				*
				* \param [in] Src A buffer of UTF-32 encoded text.
				* \param [out] Out Converted UTF-8 is stored here on success.
				* \returns true on success
				*/
				bool convertUTF32ToUTF8String(ArrayRef<UTF32> Src, std::string &Out);

				/**
	* Converts a UTF-8 string into a UTF-16 string with native endianness.			* Converts a UTF-8 string into a UTF-16 string with native endianness.
	*			*
	* \returns true on success			* \returns true on success
	*/			*/
	bool convertUTF8ToUTF16String(StringRef SrcUTF8,			bool convertUTF8ToUTF16String(StringRef SrcUTF8,
	SmallVectorImpl<UTF16> &DstUTF16);			SmallVectorImpl<UTF16> &DstUTF16);

	#if defined(_WIN32)			#if defined(_WIN32)
	Show All 17 Lines

llvm/lib/Support/ConvertUTFWrapper.cpp

	Show First 20 Lines • Show All 135 Lines • ▼ Show 20 Lines

	bool convertUTF16ToUTF8String(ArrayRef<UTF16> Src, std::string &Out)			bool convertUTF16ToUTF8String(ArrayRef<UTF16> Src, std::string &Out)
	{			{
	return convertUTF16ToUTF8String(			return convertUTF16ToUTF8String(
	llvm::ArrayRef<char>(reinterpret_cast<const char *>(Src.data()),			llvm::ArrayRef<char>(reinterpret_cast<const char *>(Src.data()),
	Src.size() * sizeof(UTF16)), Out);			Src.size() * sizeof(UTF16)), Out);
	}			}

				bool convertUTF32ToUTF8String(ArrayRef<char> SrcBytes, std::string &Out) {
				assert(Out.empty());

				// Avoid OOB by returning early on empty input.
				if (SrcBytes.empty())
				efriedmaUnsubmitted Done Reply Inline Actions Wrong constant. efriedma: Wrong constant.
				return true;

				const UTF32 Src = reinterpret_cast<const UTF32 >(SrcBytes.begin());
				const UTF32 SrcEnd = reinterpret_cast<const UTF32 >(SrcBytes.end());

				assert((uintptr_t)Src % sizeof(UTF32) == 0);

				// Byteswap if necessary.
				std::vector<UTF32> ByteSwapped;
				if (Src[0] == UNI_UTF32_BYTE_ORDER_MARK_SWAPPED) {
				ByteSwapped.insert(ByteSwapped.end(), Src, SrcEnd);
				for (unsigned I = 0, E = ByteSwapped.size(); I != E; ++I)
				ByteSwapped[I] = llvm::ByteSwap_32(ByteSwapped[I]);
				Src = &ByteSwapped[0];
				efriedmaUnsubmitted Done Reply Inline Actions Wrong constant. Is this really the function you want to be using from clang? I don't really understand why you'd want to handle byte order marks. efriedma: Wrong constant. Is this really the function you want to be using from clang? I don't really…
				MarcusJohnson91AuthorUnsubmitted Done Reply Inline Actions I don't really care about the BOM tbh, I just figured if I was in here, I should flesh out the UTF-32 interface. MarcusJohnson91: I don't really care about the BOM tbh, I just figured if I was in here, I should flesh out the…
				efriedmaUnsubmitted Not Done Reply Inline Actions The BOM handling is actually actively a problem if you're planning to use the interface to interpret wprintf format strings. We don't want to byteswap `L"\uFFFE%s"` or something like that. efriedma: The BOM handling is actually actively a problem if you're planning to use the interface to…
				efriedmaUnsubmitted Not Done Reply Inline Actions Any thoughts on this? efriedma: Any thoughts on this?
				SrcEnd = &ByteSwapped[ByteSwapped.size() - 1] + 1;
				}

				// Skip the BOM for conversion.
				if (Src[0] == UNI_UTF32_BYTE_ORDER_MARK_NATIVE)
				Src++;

				// Just allocate enough space up front. We'll shrink it later. Allocate
				// enough that we can fit a null terminator without reallocating.
				Out.resize(SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1);
				UTF8 Dst = reinterpret_cast<UTF8 >(&Out[0]);
				UTF8 *DstEnd = Dst + Out.size();

				ConversionResult CR =
				efriedmaUnsubmitted Not Done Reply Inline Actions `SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1` seems like way too much memory. efriedma: `SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1` seems like way too much memory.
				MarcusJohnson91AuthorUnsubmitted Done Reply Inline Actions I copied that from the UTF16 code MarcusJohnson91: I copied that from the UTF16 code
				efriedmaUnsubmitted Not Done Reply Inline Actions I'm not sure the math is right even for UTF-16, but anyway, UTF-32 is a little different from UTF-16. A 2-byte character in UTF-16 can translate to 3 bytes in UTF-8. That sort of thing is impossible in UTF-32: a UTF-32 string is never shorter than its translation to UTF-8. A codepoint in UTF-8 is at most 4 bytes. efriedma: I'm not sure the math is right even for UTF-16, but anyway, UTF-32 is a little different from…
				ConvertUTF32toUTF8(&Src, SrcEnd, &Dst, DstEnd, strictConversion);
				assert(CR != targetExhausted);

				if (CR != conversionOK) {
				Out.clear();
				return false;
				}

				Out.resize(reinterpret_cast<char *>(Dst) - &Out[0]);
				Out.push_back(0);
				Out.pop_back();
				return true;
				}

				bool convertUTF32ToUTF8String(ArrayRef<UTF32> Src, std::string &Out) {
				return convertUTF32ToUTF8String(
				llvm::ArrayRef<char>(reinterpret_cast<const char *>(Src.data()),
				Src.size() * sizeof(UTF32)),
				Out);
				}

				efriedmaUnsubmitted Done Reply Inline Actions Wrong function? efriedma: Wrong function?
	bool convertUTF8ToUTF16String(StringRef SrcUTF8,			bool convertUTF8ToUTF16String(StringRef SrcUTF8,
	SmallVectorImpl<UTF16> &DstUTF16) {			SmallVectorImpl<UTF16> &DstUTF16) {
	assert(DstUTF16.empty());			assert(DstUTF16.empty());

	// Avoid OOB by returning early on empty input.			// Avoid OOB by returning early on empty input.
	if (SrcUTF8.empty()) {			if (SrcUTF8.empty()) {
	DstUTF16.push_back(0);			DstUTF16.push_back(0);
	DstUTF16.pop_back();			DstUTF16.pop_back();
	▲ Show 20 Lines • Show All 101 Lines • Show Last 20 Lines

llvm/unittests/Support/ConvertUTFTest.cpp

Show All 19 Lines	TEST(ConvertUTFTest, ConvertUTF16LittleEndianToUTF8String) {
ArrayRef<char> Ref(Src, sizeof(Src) - 1);		ArrayRef<char> Ref(Src, sizeof(Src) - 1);
std::string Result;		std::string Result;
bool Success = convertUTF16ToUTF8String(Ref, Result);		bool Success = convertUTF16ToUTF8String(Ref, Result);
EXPECT_TRUE(Success);		EXPECT_TRUE(Success);
std::string Expected("\xe0\xb2\xa0_\xe0\xb2\xa0");		std::string Expected("\xe0\xb2\xa0_\xe0\xb2\xa0");
EXPECT_EQ(Expected, Result);		EXPECT_EQ(Expected, Result);
}		}

		TEST(ConvertUTFTest, ConvertUTF32LittleEndianToUTF8String) {
		// Src is a crystal ball.
		alignas(UTF32) static const char Src[] = "\x2E\xF5\x01\x00";
		ArrayRef<char> Ref(Src, sizeof(Src) - 1);
		std::string Result;
		bool Success = convertUTF32ToUTF8String(Ref, Result);
		EXPECT_TRUE(Success);
		std::string Expected("\xF0\x9F\x94\xAE");
		EXPECT_EQ(Expected, Result);
		}

TEST(ConvertUTFTest, ConvertUTF16BigEndianToUTF8String) {		TEST(ConvertUTFTest, ConvertUTF16BigEndianToUTF8String) {
// Src is the look of disapproval.		// Src is the look of disapproval.
alignas(UTF16) static const char Src[] = "\xfe\xff\x0c\xa0\x00_\x0c\xa0";		alignas(UTF16) static const char Src[] = "\xfe\xff\x0c\xa0\x00_\x0c\xa0";
ArrayRef<char> Ref(Src, sizeof(Src) - 1);		ArrayRef<char> Ref(Src, sizeof(Src) - 1);
std::string Result;		std::string Result;
bool Success = convertUTF16ToUTF8String(Ref, Result);		bool Success = convertUTF16ToUTF8String(Ref, Result);
EXPECT_TRUE(Success);		EXPECT_TRUE(Success);
std::string Expected("\xe0\xb2\xa0_\xe0\xb2\xa0");		std::string Expected("\xe0\xb2\xa0_\xe0\xb2\xa0");
EXPECT_EQ(Expected, Result);		EXPECT_EQ(Expected, Result);
}		}

		TEST(ConvertUTFTest, ConvertUTF32BigEndianToUTF8String) {
		// Src is a crystal ball.
		alignas(UTF32) static const char Src[] = "\x00\x00\xfe\xff\x00\x01\xF5\x2E";
		ArrayRef<char> Ref(Src, sizeof(Src) - 1);
		std::string Result;
		bool Success = convertUTF32ToUTF8String(Ref, Result);
		EXPECT_TRUE(Success);
		std::string Expected("\xF0\x9F\x94\xAE");
		EXPECT_EQ(Expected, Result);
		}

TEST(ConvertUTFTest, ConvertUTF8ToUTF16String) {		TEST(ConvertUTFTest, ConvertUTF8ToUTF16String) {
// Src is the look of disapproval.		// Src is the look of disapproval.
static const char Src[] = "\xe0\xb2\xa0_\xe0\xb2\xa0";		static const char Src[] = "\xe0\xb2\xa0_\xe0\xb2\xa0";
StringRef Ref(Src, sizeof(Src) - 1);		StringRef Ref(Src, sizeof(Src) - 1);
SmallVector<UTF16, 5> Result;		SmallVector<UTF16, 5> Result;
bool Success = convertUTF8ToUTF16String(Ref, Result);		bool Success = convertUTF8ToUTF16String(Ref, Result);
EXPECT_TRUE(Success);		EXPECT_TRUE(Success);
static const UTF16 Expected[] = {0x0CA0, 0x005f, 0x0CA0, 0};		static const UTF16 Expected[] = {0x0CA0, 0x005f, 0x0CA0, 0};
▲ Show 20 Lines • Show All 1,665 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

ConvertUTF: Created wrapper convertUTF32ToUTF8String
AbandonedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 363216

llvm/include/llvm/Support/ConvertUTF.h

llvm/lib/Support/ConvertUTFWrapper.cpp

llvm/unittests/Support/ConvertUTFTest.cpp

This is an archive of the discontinued LLVM Phabricator instance.

ConvertUTF: Created wrapper convertUTF32ToUTF8StringAbandonedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 363216

llvm/include/llvm/Support/ConvertUTF.h

llvm/lib/Support/ConvertUTFWrapper.cpp

llvm/unittests/Support/ConvertUTFTest.cpp

ConvertUTF: Created wrapper convertUTF32ToUTF8String
AbandonedPublic