This is an archive of the discontinued LLVM Phabricator instance.

[libc++][format] Improves width estimate.
ClosedPublic

Authored by Mordante on Feb 21 2023, 8:37 AM.

Details

Reviewers
ldionne
vitaut
tahonermann
Group Reviewers
Restricted Project
Commits
rG68c3d66a97a0: [libc++][format] Improves width estimate.
Summary

As obvious from the paper's title this is an LWG issue and thus retroactively
applied to C++20. This change may the output for certain code points:
1 Considers 8477 extra codepoints as having a width 2 (as of Unicode 15)

(mostly Tangut Ideographs)

2 Change the width of 85 unassigned code points from 2 to 1
3 Change the width of 8 codepoints (in the range U+3248 CIRCLED NUMBER

TEN ON BLACK SQUARE ... U+324F CIRCLED NUMBER EIGHTY ON BLACK
SQUARE) from 2 to 1, because it seems questionable to make an exception
for those without input from Unicode

Note that libc++ already uses Unicode 15, while the Standard requires Unicode 12.
(The last time I checked MSVC STL used Unicode 14.)

So in practice the only notable change is item 3.

Implements

P2675 LWG3780: The Paper
format's width estimation is too approximate and not forward compatible

Benchmark before these changes

Benchmark Time CPU Iterations

BM_ascii_text<char> 3928 ns 3928 ns 178131
BM_unicode_text<char> 75231 ns 75230 ns 9158
BM_cyrillic_text<char> 59837 ns 59834 ns 11529
BM_japanese_text<char> 39842 ns 39832 ns 17501
BM_emoji_text<char> 3931 ns 3930 ns 177750
BM_ascii_text<wchar_t> 4024 ns 4024 ns 174190
BM_unicode_text<wchar_t> 63756 ns 63751 ns 11136
BM_cyrillic_text<wchar_t> 44639 ns 44638 ns 15597
BM_japanese_text<wchar_t> 34425 ns 34424 ns 20283
BM_emoji_text<wchar_t> 3937 ns 3937 ns 177684

Benchmark after these changes

Benchmark Time CPU Iterations

BM_ascii_text<char> 3914 ns 3913 ns 178814
BM_unicode_text<char> 70380 ns 70378 ns 9694
BM_cyrillic_text<char> 51889 ns 51877 ns 13488
BM_japanese_text<char> 41707 ns 41705 ns 16723
BM_emoji_text<char> 3908 ns 3907 ns 177912
BM_ascii_text<wchar_t> 3949 ns 3948 ns 177525
BM_unicode_text<wchar_t> 64591 ns 64587 ns 10649
BM_cyrillic_text<wchar_t> 44089 ns 44078 ns 15721
BM_japanese_text<wchar_t> 39369 ns 39367 ns 17779
BM_emoji_text<wchar_t> 3936 ns 3934 ns 177821

Benchmarks without "if(code_point < (entries[0] >> 14))"

Benchmark Time CPU Iterations

BM_ascii_text<char> 3922 ns 3922 ns 178587
BM_unicode_text<char> 94474 ns 94474 ns 7351
BM_cyrillic_text<char> 69202 ns 69200 ns 10157
BM_japanese_text<char> 42735 ns 42692 ns 16382
BM_emoji_text<char> 3920 ns 3919 ns 178704
BM_ascii_text<wchar_t> 3951 ns 3950 ns 177224
BM_unicode_text<wchar_t> 81003 ns 80988 ns 8668
BM_cyrillic_text<wchar_t> 57020 ns 57018 ns 12048
BM_japanese_text<wchar_t> 39695 ns 39687 ns 17582
BM_emoji_text<wchar_t> 3977 ns 3976 ns 176479

This optimization does carry its weight for the Unicode and Cyrillic
test. For the Japanese tests the gains are minor and for emoji it seems
to have no effect.

Diff Detail

Event Timeline

Mordante created this revision.Feb 21 2023, 8:37 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 21 2023, 8:37 AM
Herald added a subscriber: arichardson. · View Herald Transcript
Mordante updated this revision to Diff 499195.Feb 21 2023, 8:39 AM

Retrigger CI.

Mordante updated this revision to Diff 499521.Feb 22 2023, 8:21 AM

CI fixes.

Mordante updated this revision to Diff 499573.Feb 22 2023, 10:09 AM

Fixes copy-paste error.

Mordante updated this revision to Diff 500421.Feb 25 2023, 5:18 AM

Rebased to trigger CI.

Mordante published this revision for review.Mar 9 2023, 8:50 AM
Mordante added reviewers: ldionne, vitaut, tahonermann.
Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2023, 8:50 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript

I'd like the other folks on the review take a look first since they are more knowledgeable about Unicode.

libcxx/test/libcxx/utilities/format/format.string/format.string.std/code_point_width_estimation.pass.cpp
56–58

The values don't make sense here; the initial value is larger than the condition value so the loop body is never entered.

63–64
libcxx/test/std/utilities/format/format.functions/unicode.pass.cpp
85

The comment seems off here; U+A4D0 had a width of 1 both before and after P2675.

97

The comment seems off here; U+FE70 had a width of 1 both before and after P2675.

147

The comment seems off here; U+A4D0 had a width of 1 both before and after P2675.

160

The comment seems off here; U+FE70 had a width of 1 both before and after P2675.

libcxx/utils/generate_width_estimation_table.py
137–141
152–154
198–243

Is there something we can do to ensure this gets updated with newer Unicode releases? Perhaps pull the copyright notice from somewhere to run a comparison when the table is regenerated?

tahonermann requested changes to this revision.Apr 12 2023, 2:37 PM
This revision now requires changes to proceed.Apr 12 2023, 2:37 PM

Adding Corentin for awareness.

Mordante marked 8 inline comments as done.Apr 14 2023, 9:54 AM

Thanks for the review!

libcxx/test/libcxx/utilities/format/format.string/format.string.std/code_point_width_estimation.pass.cpp
56–58

The second should have started with 0x11 instead of 0x10. Nice catch. Interesting the compiler didn't issue a diagnostic.

libcxx/test/std/utilities/format/format.functions/unicode.pass.cpp
85

Nice catch! This seems to be a bug in the existing implementation. (The same for the other similar remarks.)

libcxx/utils/generate_width_estimation_table.py
198–243

Not at the moment. I have a reminder in my calendar.

But maybe I can make a periodic GitHub action that downloads the EastAsianWidth.txt file and compares whether it's different.
This file start with

# EastAsianWidth-15.0.0.txt
# Date: 2022-05-24, 17:40:20 GMT [KW, LI]

which I assume will change with every Unicode release.

I don't want this in the CI and possibly fail the CI since updating the Unicode files might be non-trivial. I don't expect that for this case, but the grapheme clustering might change the rules and thus need changes in the code. (And I'm not sure how Zach's papers will affect what other parts of the Unicode database we need and how stable these rules are.)

@ldionne do you have an opinion? According to the Standard we should be using the latest Unicode Standard.
(This is part of P2736R2 and was voted in during the last plenary.)

Mordante updated this revision to Diff 513659.Apr 14 2023, 10:06 AM
Mordante marked 2 inline comments as done.

Rebased and addresses review comments.

cor3ntin added inline comments.Apr 14 2023, 10:10 AM
libcxx/utils/generate_width_estimation_table.py
198–243

Given the rate of change (on average once a year) I'm not sure it's worth trying to automate that license file change).
Maybe a "Here is everything that needs to be done to update unicode" document somewhere would be as good / better.

tahonermann accepted this revision.Apr 14 2023, 1:58 PM

The changes all look good to me. If there is a relatively simple way to track making sure that copyright statement gets updated, then great; I don't think that is worth holding up these changes though.

libcxx/test/libcxx/utilities/format/format.string/format.string.std/code_point_width_estimation.pass.cpp
56–58

That is interesting. I checked and none of gcc, clang, MSVC, or icc do. Coverity reports a DEADCODE issue for it though.

libcxx/utils/generate_width_estimation_table.py
198–243

I think such a doc would suffice as well. I agree it wouldn't be worth spending a lot of time to automate.

Mordante added inline comments.Apr 15 2023, 4:18 AM
libcxx/utils/generate_width_estimation_table.py
198–243

The advantage would be that update notifications don't depend on my agenda. If there is a CI job that mails the libcxx maintainers everybody would be aware. I'll discuss it with Louis and see what he thinks.

ldionne accepted this revision.Apr 20 2023, 9:35 AM
ldionne added inline comments.
libcxx/utils/generate_width_estimation_table.py
198–243

One suggestion I'd have here is to make updating this part of our release process. In Contributing.rst, we have:

Post-release check list
=======================

After branching for an LLVM release:

1. Update ``_LIBCPP_VERSION`` in ``libcxx/include/__config``
2. Update the version number in ``libcxx/docs/conf.py``
3. Update ``_LIBCPPABI_VERSION`` in ``libcxxabi/include/cxxabi.h``
4. Update ``_LIBUNWIND_VERSION`` in ``libunwind/include/__libunwind_config.h``
5. Update the list of supported clang versions in ``libcxx/docs/index.rst``
6. Remove the in-progress warning from ``libcxx/docs/ReleaseNotes.rst``

We could potentially also have a pre-release checklist there that would include keeping unicode stuff up-to-date. And I'm sure there are other things we could document there as well. I guess the question is who would be tasked with doing the pre-release checklist -- it should probably not be the LLVM release manager since that's too much work. I guess I could do that and potentially distribute some of the pre-release tasks (e.g. I'd probably ask @Mordante to handle the unicode update).

345
This revision is now accepted and ready to land.Apr 20 2023, 9:35 AM
This revision was landed with ongoing or failed builds.Apr 20 2023, 12:19 PM
This revision was automatically updated to reflect the committed changes.
Mordante marked 4 inline comments as done.

Thanks for the reviews!

libcxx/utils/generate_width_estimation_table.py
198–243

As discussed I will do this is a separate review so other libc++ developers can add their opinion.

libcxx/utils/CMakeLists.txt