This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Support/
-
Support/
-
Unicode.cpp
-
unittests/Support/
-
Support/
-
UnicodeTest.cpp

Differential D138518

Update the list of double width codepoints
ClosedPublic

Authored by cor3ntin on Nov 22 2022, 11:32 AM.

Download Raw Diff

Details

Reviewers

aaron.ballman
tahonermann

Commits

rG2903769bf524: Update the list of double width codepoints

Summary

All east asian width wide and full-width codepoints
are considered double width, as well as emojis and
symbols commonely rendered as emoji.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

cor3ntin created this revision.Nov 22 2022, 11:32 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 22 2022, 11:32 AM

Herald added a subscriber: hiraditya. · View Herald Transcript

cor3ntin requested review of this revision.Nov 22 2022, 11:32 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 22 2022, 11:32 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

I did that as a drive-by while investingating https://github.com/llvm/llvm-project/issues/54732#issuecomment-1324107610 (which turned out to work correctly after all)

Any ideas on how to test this?

There are some preexisting tests in llvm/unittests/Support/UnicodeTest.cpp which I might be able to extend with a sampling of unicode 15 codepoints, I don't know how meaningful that would be but as we talked before the only way to do exhaustive checking here would be to cross check to independent implementation.

Harbormaster completed remote builds in B199031: Diff 477263.Nov 22 2022, 1:22 PM

cor3ntin retitled this revision from Update the list of double with codepoints to Update the list of double width codepoints.Nov 28 2022, 5:10 AM

LGTM! Agreed that testing this would be somewhat meaningless without some sort of oracle we can reference. Probably should add a release note for the fix when landing?

This revision is now accepted and ready to land.Nov 28 2022, 5:27 AM

The release notes have

Unicode support has been updated to support Unicode 15.0.

New unicode codepoints are supported as appropriate in diagnostics,
C and C++ identifiers, and escape sequences.

That seems sufficient

This revision was landed with ongoing or failed builds.Nov 28 2022, 6:13 AM

Closed by commit rG2903769bf524: Update the list of double width codepoints (authored by cor3ntin). · Explain Why

This revision was automatically updated to reflect the committed changes.

cor3ntin added a commit: rG2903769bf524: Update the list of double width codepoints.

Thanks Aaron.
As discussed offline I added a few tests anyway, it can't hurt

The tests broke the build on some windows platforms, I pushed a fix here https://reviews.llvm.org/rG9fec67483d4c
Sorry for anyone who was impacted by that

Revision Contents

Path

Size

llvm/

lib/

Support/

Unicode.cpp

61 lines

unittests/

Support/

UnicodeTest.cpp

5 lines

Diff 478212

llvm/lib/Support/Unicode.cpp

Show First 20 Lines • Show All 294 Lines • ▼ Show 20 Lines
/// The implementation defines it in a way that is expected to be compatible		/// The implementation defines it in a way that is expected to be compatible
/// with a generic Unicode-capable terminal.		/// with a generic Unicode-capable terminal.
/// \return Character width:		/// \return Character width:
/// * ErrorNonPrintableCharacter (-1) for non-printable characters (as		/// * ErrorNonPrintableCharacter (-1) for non-printable characters (as
/// identified by isPrintable);		/// identified by isPrintable);
/// * 0 for non-spacing and enclosing combining marks;		/// * 0 for non-spacing and enclosing combining marks;
/// * 2 for CJK characters excluding halfwidth forms;		/// * 2 for CJK characters excluding halfwidth forms;
/// * 1 for all remaining characters.		/// * 1 for all remaining characters.
static inline int charWidth(int UCS)		static inline int charWidth(int UCS) {
{
if (!isPrintable(UCS))		if (!isPrintable(UCS))
return ErrorNonPrintableCharacter;		return ErrorNonPrintableCharacter;

// Sorted list of non-spacing and enclosing combining mark intervals as		// Sorted list of non-spacing and enclosing combining mark intervals as
// defined in "3.6 Combination" of		// defined in "3.6 Combination" of
// https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf		// https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf
static const UnicodeCharRange CombiningCharacterRanges[] = {		static const UnicodeCharRange CombiningCharacterRanges[] = {
{0x0300, 0x036F}, {0x0483, 0x0489}, {0x0591, 0x05BD},		{0x0300, 0x036F}, {0x0483, 0x0489}, {0x0591, 0x05BD},
▲ Show 20 Lines • Show All 112 Lines • ▼ Show 20 Lines	static const UnicodeCharRange CombiningCharacterRanges[] = {
{0x1E4EC, 0x1E4EF}, {0x1E8D0, 0x1E8D6}, {0x1E944, 0x1E94A},		{0x1E4EC, 0x1E4EF}, {0x1E8D0, 0x1E8D6}, {0x1E944, 0x1E94A},
{0xE0100, 0xE01EF},		{0xE0100, 0xE01EF},
};		};
static const UnicodeCharSet CombiningCharacters(CombiningCharacterRanges);		static const UnicodeCharSet CombiningCharacters(CombiningCharacterRanges);

if (CombiningCharacters.contains(UCS))		if (CombiningCharacters.contains(UCS))
return 0;		return 0;

		// We consider double width codepoints any codepoint with
		// the property East_Asian_Width=F\|W
		// + Misc Symbols and Pictographs (U+1F300...U+1F5FF)
		// + Supplemental Symbols and Pictographs (U+1F900...U+1F9FF)
static const UnicodeCharRange DoubleWidthCharacterRanges[] = {		static const UnicodeCharRange DoubleWidthCharacterRanges[] = {
// Hangul Jamo		{0x1100, 0x115F}, {0x231A, 0x231B}, {0x2329, 0x232A},
{ 0x1100, 0x11FF },		{0x23E9, 0x23EC}, {0x23F0, 0x23F0}, {0x23F3, 0x23F3},
// Deprecated fullwidth angle brackets		{0x25FD, 0x25FE}, {0x2614, 0x2615}, {0x2648, 0x2653},
{ 0x2329, 0x232A },		{0x267F, 0x267F}, {0x2693, 0x2693}, {0x26A1, 0x26A1},
// CJK Misc, CJK Unified Ideographs, Yijing Hexagrams, Yi		{0x26AA, 0x26AB}, {0x26BD, 0x26BE}, {0x26C4, 0x26C5},
// excluding U+303F (IDEOGRAPHIC HALF FILL SPACE)		{0x26CE, 0x26CE}, {0x26D4, 0x26D4}, {0x26EA, 0x26EA},
{ 0x2E80, 0x303E }, { 0x3040, 0xA4CF },		{0x26F2, 0x26F3}, {0x26F5, 0x26F5}, {0x26FA, 0x26FA},
// Hangul		{0x26FD, 0x26FD}, {0x2705, 0x2705}, {0x270A, 0x270B},
{ 0xAC00, 0xD7A3 }, { 0xD7B0, 0xD7C6 }, { 0xD7CB, 0xD7FB },		{0x2728, 0x2728}, {0x274C, 0x274C}, {0x274E, 0x274E},
// CJK Unified Ideographs		{0x2753, 0x2755}, {0x2757, 0x2757}, {0x2795, 0x2797},
{ 0xF900, 0xFAFF },		{0x27B0, 0x27B0}, {0x27BF, 0x27BF}, {0x2B1B, 0x2B1C},
// Vertical forms		{0x2B50, 0x2B50}, {0x2B55, 0x2B55}, {0x2E80, 0x2E99},
{ 0xFE10, 0xFE19 },		{0x2E9B, 0x2EF3}, {0x2F00, 0x2FD5}, {0x2FF0, 0x2FFB},
// CJK Compatibility Forms + Small Form Variants		{0x3000, 0x303E}, {0x3041, 0x3096}, {0x3099, 0x30FF},
{ 0xFE30, 0xFE6F },		{0x3105, 0x312F}, {0x3131, 0x318E}, {0x3190, 0x31E3},
// Fullwidth forms		{0x31F0, 0x321E}, {0x3220, 0x3247}, {0x3250, 0xA48C},
{ 0xFF01, 0xFF60 }, { 0xFFE0, 0xFFE6 },		{0xA490, 0xA4C6}, {0xA960, 0xA97C}, {0xAC00, 0xD7A3},
// CJK Unified Ideographs		{0xF900, 0xFAFF}, {0xFE10, 0xFE19}, {0xFE30, 0xFE52},
{ 0x20000, 0x2A6DF }, { 0x2A700, 0x2B81F }, { 0x2F800, 0x2FA1F }		{0xFE54, 0xFE66}, {0xFE68, 0xFE6B}, {0xFF01, 0xFF60},
		{0xFFE0, 0xFFE6}, {0x16FE0, 0x16FE4}, {0x16FF0, 0x16FF1},
		{0x17000, 0x187F7}, {0x18800, 0x18CD5}, {0x18D00, 0x18D08},
		{0x1AFF0, 0x1AFF3}, {0x1AFF5, 0x1AFFB}, {0x1AFFD, 0x1AFFE},
		{0x1B000, 0x1B122}, {0x1B132, 0x1B132}, {0x1B150, 0x1B152},
		{0x1B155, 0x1B155}, {0x1B164, 0x1B167}, {0x1B170, 0x1B2FB},
		{0x1F004, 0x1F004}, {0x1F0CF, 0x1F0CF}, {0x1F18E, 0x1F18E},
		{0x1F191, 0x1F19A}, {0x1F200, 0x1F202}, {0x1F210, 0x1F23B},
		{0x1F240, 0x1F248}, {0x1F250, 0x1F251}, {0x1F260, 0x1F265},
		{0x1F300, 0x1F64F}, {0x1F680, 0x1F6C5}, {0x1F6CC, 0x1F6CC},
		{0x1F6D0, 0x1F6D2}, {0x1F6D5, 0x1F6D7}, {0x1F6DC, 0x1F6DF},
		{0x1F6EB, 0x1F6EC}, {0x1F6F4, 0x1F6FC}, {0x1F7E0, 0x1F7EB},
		{0x1F7F0, 0x1F7F0}, {0x1F900, 0x1F9FF}, {0x1FA70, 0x1FA7C},
		{0x1FA80, 0x1FA88}, {0x1FA90, 0x1FABD}, {0x1FABF, 0x1FAC5},
		{0x1FACE, 0x1FADB}, {0x1FAE0, 0x1FAE8}, {0x1FAF0, 0x1FAF8},
		{0x20000, 0x2FFFD}, {0x30000, 0x3FFFD}
};		};
static const UnicodeCharSet DoubleWidthCharacters(DoubleWidthCharacterRanges);		static const UnicodeCharSet DoubleWidthCharacters(DoubleWidthCharacterRanges);

if (DoubleWidthCharacters.contains(UCS))		if (DoubleWidthCharacters.contains(UCS))
return 2;		return 2;
return 1;		return 1;
}		}

Show All 27 Lines	for (size_t i = 0, e = Text.size(); i < e; i += Length) {
ColumnWidth += Width;		ColumnWidth += Width;
}		}
return ColumnWidth;		return ColumnWidth;
}		}

} // namespace unicode		} // namespace unicode
} // namespace sys		} // namespace sys
} // namespace llvm		} // namespace llvm

llvm/unittests/Support/UnicodeTest.cpp

Show All 39 Lines	TEST(Unicode, columnWidthUTF8) {
EXPECT_EQ(0, columnWidthUTF8("\314\200")); // 0300 COMBINING GRAVE ACCENT		EXPECT_EQ(0, columnWidthUTF8("\314\200")); // 0300 COMBINING GRAVE ACCENT
EXPECT_EQ(1, columnWidthUTF8("\340\270\201")); // 0E01 THAI CHARACTER KO KAI		EXPECT_EQ(1, columnWidthUTF8("\340\270\201")); // 0E01 THAI CHARACTER KO KAI
EXPECT_EQ(2, columnWidthUTF8("\344\270\200")); // CJK UNIFIED IDEOGRAPH-4E00		EXPECT_EQ(2, columnWidthUTF8("\344\270\200")); // CJK UNIFIED IDEOGRAPH-4E00

EXPECT_EQ(4, columnWidthUTF8("\344\270\200\344\270\200"));		EXPECT_EQ(4, columnWidthUTF8("\344\270\200\344\270\200"));
EXPECT_EQ(3, columnWidthUTF8("q\344\270\200"));		EXPECT_EQ(3, columnWidthUTF8("q\344\270\200"));
EXPECT_EQ(3, columnWidthUTF8("\314\200\340\270\201\344\270\200"));		EXPECT_EQ(3, columnWidthUTF8("\314\200\340\270\201\344\270\200"));

		EXPECT_EQ(2, columnWidthUTF8("\u231A")); // WATCH (emoji)
		EXPECT_EQ(2, columnWidthUTF8("\U0001FADB")); // PEA POD (Unicode 15 emoji)
		EXPECT_EQ(2, columnWidthUTF8("\U0001B132")); // HIRAGANA LETTER SMALL KO
		EXPECT_EQ(2, columnWidthUTF8("\U00017042")); // TANGUT IDEOGRAPH

// Invalid UTF-8 strings, columnWidthUTF8 should error out.		// Invalid UTF-8 strings, columnWidthUTF8 should error out.
EXPECT_EQ(-2, columnWidthUTF8("\344"));		EXPECT_EQ(-2, columnWidthUTF8("\344"));
EXPECT_EQ(-2, columnWidthUTF8("\344\270"));		EXPECT_EQ(-2, columnWidthUTF8("\344\270"));
EXPECT_EQ(-2, columnWidthUTF8("\344\270\033"));		EXPECT_EQ(-2, columnWidthUTF8("\344\270\033"));
EXPECT_EQ(-2, columnWidthUTF8("\344\270\300"));		EXPECT_EQ(-2, columnWidthUTF8("\344\270\300"));
EXPECT_EQ(-2, columnWidthUTF8("\377\366\355"));		EXPECT_EQ(-2, columnWidthUTF8("\377\366\355"));

EXPECT_EQ(-2, columnWidthUTF8("qwer\344"));		EXPECT_EQ(-2, columnWidthUTF8("qwer\344"));
▲ Show 20 Lines • Show All 371 Lines • Show Last 20 Lines