This is an archive of the discontinued LLVM Phabricator instance.

In D106753#2917762, @efriedma wrote:

As far as I can tell, the lastest version of the diff you uploaded still has the following issues that haven't been addressed:

The BOM handling is actually actively a problem if you're planning to use the interface to interpret wprintf format strings. We don't want to byteswap L"\uFFFE%s" or something like that.

SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1 seems like way too much memory.

In D107202#2920915, @efriedma wrote:

In D106753#2917762, @efriedma wrote:

As far as I can tell, the lastest version of the diff you uploaded still has the following issues that haven't been addressed:

The BOM handling is actually actively a problem if you're planning to use the interface to interpret wprintf format strings. We don't want to byteswap L"\uFFFE%s" or something like that.

SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1 seems like way too much memory.

What BOM handling? there is no BOM function, bytes are swapped in the converter if the byte order isn't correct, is that what you mean?

I copied SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1 from the UTF-16 version.

Are you asking me to change the UTF-16 version too?

In D107202#2921107, @MarcusJohnson91 wrote:

What BOM handling? there is no BOM function, bytes are swapped in the converter if the byte order isn't correct, is that what you mean?

I mean the behavior handling strings that contain UNI_UTF32_BYTE_ORDER_MARK_SWAPPED.

I suspect a lot of places don't want the BOM handling to trigger. This includes trying to print diagnostics for wprintf, since the underlying function doesn't have any BOM handling. But I guess it's unlikely to matter in practice.

In D107202#2921107, @MarcusJohnson91 wrote:

I copied SrcBytes.size() * UNI_MAX_UTF8_BYTES_PER_CODE_POINT + 1 from the UTF-16 version.

Are you asking me to change the UTF-16 version too?

In D106753#inline-1020607, @efriedma wrote:

I'm not sure the math is right even for UTF-16, but anyway, UTF-32 is a little different from UTF-16. A 2-byte character in UTF-16 can translate to 3 bytes in UTF-8. That sort of thing is impossible in UTF-32: a UTF-32 string is never shorter than its translation to UTF-8. A codepoint in UTF-8 is at most 4 bytes.

I've written my own Unicode encoder/decoder before, I'm familiar with how it works.

You can store regular ASCII in a UTF-32 string, like "Example" as UTF-32 would be 7 * 4 = 28 bytes (not counting the null terminator), where as it would just be 7 bytes in UTF-8.

and it looks like the std::string is being compacted afterwards with Out.resize(reinterpret_cast<char *>(Dst) - &Out[0]);

but maybe a call to Out.shrink_to_fit() at the end is warranted?

The the way the math is written now, for "Example", we allocate UNI_MAX_UTF8_BYTES_PER_CODE_POINT * sizeof(UTF32) * 7 = 112 bytes.

In D107202#2924561, @efriedma wrote:

The the way the math is written now, for "Example", we allocate UNI_MAX_UTF8_BYTES_PER_CODE_POINT * sizeof(UTF32) * 7 = 112 bytes.

Alright, I'm gonna give it a try and re-run the tests

Seems like the tests still work

Harbormaster completed remote builds in B117862: Diff 364035.Aug 4 2021, 4:23 AM

clang-format diff

Harbormaster completed remote builds in B134628: Diff 387786.Nov 16 2021, 5:14 PM

Ok, the tests passed, can this be merged now?

@efriedma

@MyDeveloperDay

Reposted as @D114342

Revision Contents

Path

Size

llvm/

include/

llvm/

Support/

ConvertUTF.h

21 lines

lib/

Support/

ConvertUTFWrapper.cpp

54 lines

unittests/

Support/

ConvertUTFTest.cpp

22 lines

Diff 387786

llvm/include/llvm/Support/ConvertUTF.h

	Show First 20 Lines • Show All 117 Lines • ▼ Show 20 Lines
	#define UNI_MAX_UTF32 (UTF32)0x7FFFFFFF			#define UNI_MAX_UTF32 (UTF32)0x7FFFFFFF
	#define UNI_MAX_LEGAL_UTF32 (UTF32)0x0010FFFF			#define UNI_MAX_LEGAL_UTF32 (UTF32)0x0010FFFF

	#define UNI_MAX_UTF8_BYTES_PER_CODE_POINT 4			#define UNI_MAX_UTF8_BYTES_PER_CODE_POINT 4

	#define UNI_UTF16_BYTE_ORDER_MARK_NATIVE 0xFEFF			#define UNI_UTF16_BYTE_ORDER_MARK_NATIVE 0xFEFF
	#define UNI_UTF16_BYTE_ORDER_MARK_SWAPPED 0xFFFE			#define UNI_UTF16_BYTE_ORDER_MARK_SWAPPED 0xFFFE

				#define UNI_UTF32_BYTE_ORDER_MARK_NATIVE 0x0000FEFF
				#define UNI_UTF32_BYTE_ORDER_MARK_SWAPPED 0xFFFE0000

	typedef enum {			typedef enum {
	conversionOK, /* conversion successful */			conversionOK, /* conversion successful */
	sourceExhausted, /* partial character in source, but hit end */			sourceExhausted, /* partial character in source, but hit end */
	targetExhausted, /* insuff. room in target for conversion */			targetExhausted, /* insuff. room in target for conversion */
	sourceIllegal /* source sequence is illegal/malformed */			sourceIllegal /* source sequence is illegal/malformed */
	} ConversionResult;			} ConversionResult;

	typedef enum {			typedef enum {
	▲ Show 20 Lines • Show All 140 Lines • ▼ Show 20 Lines
	*			*
	* \param [in] Src A buffer of UTF-16 encoded text.			* \param [in] Src A buffer of UTF-16 encoded text.
	* \param [out] Out Converted UTF-8 is stored here on success.			* \param [out] Out Converted UTF-8 is stored here on success.
	* \returns true on success			* \returns true on success
	*/			*/
	bool convertUTF16ToUTF8String(ArrayRef<UTF16> Src, std::string &Out);			bool convertUTF16ToUTF8String(ArrayRef<UTF16> Src, std::string &Out);

	/**			/**
				* Converts a stream of raw bytes assumed to be UTF32 into a UTF8 std::string.
				*
				* \param [in] SrcBytes A buffer of what is assumed to be UTF-32 encoded text.
				* \param [out] Out Converted UTF-8 is stored here on success.
				* \returns true on success
				*/
				bool convertUTF32ToUTF8String(ArrayRef<char> SrcBytes, std::string &Out);

				/**
				* Converts a UTF32 string into a UTF8 std::string.
				*
				* \param [in] Src A buffer of UTF-32 encoded text.
				* \param [out] Out Converted UTF-8 is stored here on success.
				* \returns true on success
				*/
				bool convertUTF32ToUTF8String(ArrayRef<UTF32> Src, std::string &Out);

				/**
	* Converts a UTF-8 string into a UTF-16 string with native endianness.			* Converts a UTF-8 string into a UTF-16 string with native endianness.
	*			*
	* \returns true on success			* \returns true on success
	*/			*/
	bool convertUTF8ToUTF16String(StringRef SrcUTF8,			bool convertUTF8ToUTF16String(StringRef SrcUTF8,
	SmallVectorImpl<UTF16> &DstUTF16);			SmallVectorImpl<UTF16> &DstUTF16);

	#if defined(_WIN32)			#if defined(_WIN32)
	Show All 17 Lines

llvm/lib/Support/ConvertUTFWrapper.cpp

	Show First 20 Lines • Show All 135 Lines • ▼ Show 20 Lines

	bool convertUTF16ToUTF8String(ArrayRef<UTF16> Src, std::string &Out)			bool convertUTF16ToUTF8String(ArrayRef<UTF16> Src, std::string &Out)
	{			{
	return convertUTF16ToUTF8String(			return convertUTF16ToUTF8String(
	llvm::ArrayRef<char>(reinterpret_cast<const char *>(Src.data()),			llvm::ArrayRef<char>(reinterpret_cast<const char *>(Src.data()),
	Src.size() * sizeof(UTF16)), Out);			Src.size() * sizeof(UTF16)), Out);
	}			}

				bool convertUTF32ToUTF8String(ArrayRef<char> SrcBytes, std::string &Out) {
				assert(Out.empty());

				// Avoid OOB by returning early on empty input.
				if (SrcBytes.empty())
				return true;

				const UTF32 Src = reinterpret_cast<const UTF32 >(SrcBytes.begin());
				const UTF32 SrcEnd = reinterpret_cast<const UTF32 >(SrcBytes.end());

				assert((uintptr_t)Src % sizeof(UTF32) == 0);

				// Byteswap if necessary.
				std::vector<UTF32> ByteSwapped;
				if (Src[0] == UNI_UTF32_BYTE_ORDER_MARK_SWAPPED) {
				ByteSwapped.insert(ByteSwapped.end(), Src, SrcEnd);
				for (unsigned I = 0, E = ByteSwapped.size(); I != E; ++I)
				ByteSwapped[I] = llvm::ByteSwap_32(ByteSwapped[I]);
				Src = &ByteSwapped[0];
				SrcEnd = &ByteSwapped[ByteSwapped.size() - 1] + 1;
				}

				// Skip the BOM for conversion.
				if (Src[0] == UNI_UTF32_BYTE_ORDER_MARK_NATIVE)
				Src++;

				// Just allocate enough space up front. We'll shrink it later. Allocate
				// enough that we can fit a null terminator without reallocating.
				Out.resize(SrcBytes.size() + 1);
				UTF8 Dst = reinterpret_cast<UTF8 >(&Out[0]);
				UTF8 *DstEnd = Dst + Out.size();

				ConversionResult CR =
				ConvertUTF32toUTF8(&Src, SrcEnd, &Dst, DstEnd, strictConversion);
				assert(CR != targetExhausted);

				if (CR != conversionOK) {
				Out.clear();
				return false;
				}

				Out.resize(reinterpret_cast<char *>(Dst) - &Out[0]);
				Out.push_back(0);
				Out.pop_back();
				return true;
				}

				bool convertUTF32ToUTF8String(ArrayRef<UTF32> Src, std::string &Out) {
				return convertUTF32ToUTF8String(
				llvm::ArrayRef<char>(reinterpret_cast<const char *>(Src.data()),
				Src.size() * sizeof(UTF32)),
				Out);
				}

	bool convertUTF8ToUTF16String(StringRef SrcUTF8,			bool convertUTF8ToUTF16String(StringRef SrcUTF8,
	SmallVectorImpl<UTF16> &DstUTF16) {			SmallVectorImpl<UTF16> &DstUTF16) {
	assert(DstUTF16.empty());			assert(DstUTF16.empty());

	// Avoid OOB by returning early on empty input.			// Avoid OOB by returning early on empty input.
	if (SrcUTF8.empty()) {			if (SrcUTF8.empty()) {
	DstUTF16.push_back(0);			DstUTF16.push_back(0);
	DstUTF16.pop_back();			DstUTF16.pop_back();
	▲ Show 20 Lines • Show All 101 Lines • Show Last 20 Lines

llvm/unittests/Support/ConvertUTFTest.cpp

Show All 19 Lines	TEST(ConvertUTFTest, ConvertUTF16LittleEndianToUTF8String) {
ArrayRef<char> Ref(Src, sizeof(Src) - 1);		ArrayRef<char> Ref(Src, sizeof(Src) - 1);
std::string Result;		std::string Result;
bool Success = convertUTF16ToUTF8String(Ref, Result);		bool Success = convertUTF16ToUTF8String(Ref, Result);
EXPECT_TRUE(Success);		EXPECT_TRUE(Success);
std::string Expected("\xe0\xb2\xa0_\xe0\xb2\xa0");		std::string Expected("\xe0\xb2\xa0_\xe0\xb2\xa0");
EXPECT_EQ(Expected, Result);		EXPECT_EQ(Expected, Result);
}		}

		TEST(ConvertUTFTest, ConvertUTF32LittleEndianToUTF8String) {
		// Src is a crystal ball.
		alignas(UTF32) static const char Src[] = "\x2E\xF5\x01\x00";
		ArrayRef<char> Ref(Src, sizeof(Src) - 1);
		std::string Result;
		bool Success = convertUTF32ToUTF8String(Ref, Result);
		EXPECT_TRUE(Success);
		std::string Expected("\xF0\x9F\x94\xAE");
		EXPECT_EQ(Expected, Result);
		}

TEST(ConvertUTFTest, ConvertUTF16BigEndianToUTF8String) {		TEST(ConvertUTFTest, ConvertUTF16BigEndianToUTF8String) {
// Src is the look of disapproval.		// Src is the look of disapproval.
alignas(UTF16) static const char Src[] = "\xfe\xff\x0c\xa0\x00_\x0c\xa0";		alignas(UTF16) static const char Src[] = "\xfe\xff\x0c\xa0\x00_\x0c\xa0";
ArrayRef<char> Ref(Src, sizeof(Src) - 1);		ArrayRef<char> Ref(Src, sizeof(Src) - 1);
std::string Result;		std::string Result;
bool Success = convertUTF16ToUTF8String(Ref, Result);		bool Success = convertUTF16ToUTF8String(Ref, Result);
EXPECT_TRUE(Success);		EXPECT_TRUE(Success);
std::string Expected("\xe0\xb2\xa0_\xe0\xb2\xa0");		std::string Expected("\xe0\xb2\xa0_\xe0\xb2\xa0");
EXPECT_EQ(Expected, Result);		EXPECT_EQ(Expected, Result);
}		}

		TEST(ConvertUTFTest, ConvertUTF32BigEndianToUTF8String) {
		// Src is a crystal ball.
		alignas(UTF32) static const char Src[] = "\x00\x00\xfe\xff\x00\x01\xF5\x2E";
		ArrayRef<char> Ref(Src, sizeof(Src) - 1);
		std::string Result;
		bool Success = convertUTF32ToUTF8String(Ref, Result);
		EXPECT_TRUE(Success);
		std::string Expected("\xF0\x9F\x94\xAE");
		EXPECT_EQ(Expected, Result);
		}

TEST(ConvertUTFTest, ConvertUTF8ToUTF16String) {		TEST(ConvertUTFTest, ConvertUTF8ToUTF16String) {
// Src is the look of disapproval.		// Src is the look of disapproval.
static const char Src[] = "\xe0\xb2\xa0_\xe0\xb2\xa0";		static const char Src[] = "\xe0\xb2\xa0_\xe0\xb2\xa0";
StringRef Ref(Src, sizeof(Src) - 1);		StringRef Ref(Src, sizeof(Src) - 1);
SmallVector<UTF16, 5> Result;		SmallVector<UTF16, 5> Result;
bool Success = convertUTF8ToUTF16String(Ref, Result);		bool Success = convertUTF8ToUTF16String(Ref, Result);
EXPECT_TRUE(Success);		EXPECT_TRUE(Success);
static const UTF16 Expected[] = {0x0CA0, 0x005f, 0x0CA0, 0};		static const UTF16 Expected[] = {0x0CA0, 0x005f, 0x0CA0, 0};
▲ Show 20 Lines • Show All 1,665 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

ConvertUTF: convertUTF32ToUTF8StringAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 387786

llvm/include/llvm/Support/ConvertUTF.h

llvm/lib/Support/ConvertUTFWrapper.cpp

llvm/unittests/Support/ConvertUTFTest.cpp

ConvertUTF: convertUTF32ToUTF8String
AbandonedPublic