This is an archive of the discontinued LLVM Phabricator instance.

Update Unicode to 15.0
ClosedPublic

Authored by cor3ntin on Sep 13 2022, 1:21 PM.

Details

Summary

Unicode 15.0 adds 4,489 characters, for a total of 149,186 characters.
These additions include 2 new scripts along with 20 new emoji characters,
and 4,193 CJK ideographs.

This changes modify most existing tables including

  • XID_Start/XID_Continue in Clang
  • The character name database (used by \N{} in Clang)
  • The list of formattable/printable codepoints
  • The case folding algorithm (which we had not updated since Unicode 9)
  • The list of nonspacing/enclosing marks used by the column width computation algorithm. The rest of the column width algorithm is not updated.

Diff Detail

Event Timeline

cor3ntin created this revision.Sep 13 2022, 1:21 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 13 2022, 1:21 PM
cor3ntin requested review of this revision.Sep 13 2022, 1:21 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptSep 13 2022, 1:21 PM
cor3ntin updated this revision to Diff 459877.Sep 13 2022, 2:38 PM

Changelog

shafik added a subscriber: shafik.Sep 13 2022, 4:59 PM

Thank you for doing this work.

llvm/lib/Support/UnicodeCaseFold.cpp
713

Maybe I am misunderstanding the comments but should this be 0xa7be?

llvm/lib/Support/UnicodeNameToCodepoint.cpp
254

You use 15.0 here and 15. above, any reason for the difference?

cor3ntin updated this revision to Diff 459941.Sep 13 2022, 6:01 PM

Use the name "Unicode 15.0" consistently

Thanks for the review

llvm/lib/Support/UnicodeCaseFold.cpp
713

Quirk of the script, the comment for C|1 never make sense, the values seems correct though (this script is the only think i have not written myself)
https://github.com/llvm/llvm-project/blob/main/llvm/utils/unicode-case-fold.py#L89
So you have 8 even codepoints mapping to C|1 + 7 odd codepoint mapping to C|1 which is C. If my math is correct.
I'm a bit reluctant to modify that script

The changes LGTM, but I'd like to wait for @tahonermann to weigh in with the final acceptance.

llvm/lib/Support/UnicodeCaseFold.cpp
713

Heh, I thought it should have been 0xa7bc based on the changed comment above, but after talking to Corentin off-list, it sounds like any time we see return C | 1;, the comment above it is specifying the wrong number of characters in the range. So the issue is that the comment says 8 characters when it should say 14 characters.

We could correct the comment manually, but the next time we run the script we'll get the incorrect comment again. So for right now, I think this code is actually correct. At some point, we should fix that script to output the correct comment though, as it's hard to review the generated changes when the comments are misleading.

Structurally, these changes look like what I would expect. I didn't try to validate any of the code point ranges.

Are there useful tests that could be modified or added in order to validate (probably on a spot check basis) Unicode 15 support for regression purposes? For example, adding a test for \N{} for one of the newly added names or some case folding tests for new characters.

Structurally, these changes look like what I would expect. I didn't try to validate any of the code point ranges.

Are there useful tests that could be modified or added in order to validate (probably on a spot check basis) Unicode 15 support for regression purposes? For example, adding a test for \N{} for one of the newly added names or some case folding tests for new characters.

I considered it, and I can add a few if you insist but... I'm not sure adding random tests tell us much except that the specific tested characters are supported.

I considered it, and I can add a few if you insist but... I'm not sure adding random tests tell us much except that the specific tested characters are supported.

I wouldn't expect to learn anything from doing so; it would just provide regression protection (for example if the code generation scripts are somehow broken in the future).

cor3ntin updated this revision to Diff 460679.Sep 16 2022, 2:20 AM

Add tests

tahonermann accepted this revision.Sep 21 2022, 2:26 PM

Thank you, Corentin, and apologies for the delayed review. This looks good to me.

This revision is now accepted and ready to land.Sep 21 2022, 2:26 PM
This revision was landed with ongoing or failed builds.Sep 21 2022, 8:03 PM
This revision was automatically updated to reflect the committed changes.