This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Implements Unicode grapheme clustering
ClosedPublic

Authored by Mordante on Jun 3 2022, 9:10 AM.

Details

Reviewers
ldionne
vitaut
Group Reviewers
Restricted Project
Commits
rG857a78c04dee: [libc++] Implements Unicode grapheme clustering
Summary

This implements the Grapheme clustering as required by
P1868R2 width: clarifying units of width and precision in std::format

This was omitted in the initial patch, but the paper was marked as completed. This really completes the paper.

Diff Detail

Event Timeline

Mordante created this revision.Jun 3 2022, 9:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 3 2022, 9:10 AM
Mordante updated this revision to Diff 434071.Jun 3 2022, 11:02 AM

Fixes some CI issue.
I also noticed the there is room to compact the data even more. In come cases
Unicode splits one larger range is several subranges to add additional
comments. By merging these ranges the number of entries in the tables is
reduced.

Mordante updated this revision to Diff 434254.Jun 4 2022, 2:23 AM

Try to fix the CI.

Mordante updated this revision to Diff 434255.Jun 4 2022, 2:58 AM

Attempts to fix the CI.

Mordante updated this revision to Diff 434256.Jun 4 2022, 3:36 AM

Fixes a shadow warning.

Mordante updated this revision to Diff 434273.Jun 4 2022, 9:36 AM

Fixes ASAN errors.

Mordante updated this revision to Diff 434327.Jun 5 2022, 7:04 AM

Improves end of input handling. That should fix the CI errors.

Mordante added inline comments.Jun 5 2022, 10:52 AM
libcxx/include/__format/parser_std_format_spec.h
95–96

D125606 "Not specific to this diff but I think it would be cleaner to fold Unicode/non-Unicode handling into __estimate_column_width and have a single check and copy here."

Mordante updated this revision to Diff 434427.Jun 6 2022, 4:22 AM

Use a better way to store the data.
This improves the speed by 7% from the last version, and 4% from the original 6 byte algorithm.

Note the code and comment needs more polish.

Mordante updated this revision to Diff 441926.Jul 3 2022, 1:27 AM

Improve and polish the code. Replace libc++ specific tests with generic tests.

Mordante updated this revision to Diff 441945.Jul 3 2022, 5:45 AM

Fixing GCC build failures.

Mordante updated this revision to Diff 441947.Jul 3 2022, 6:25 AM

Disable a test in GCC since it times out.

Mordante updated this revision to Diff 441966.Jul 3 2022, 10:38 AM
Mordante added a subscriber: STL_MSFT.

Minor cleanups and addresses @STL_MSFT's review comments while upstreaming some of these changes.

Mordante published this revision for review.Jul 3 2022, 10:39 AM
Mordante added reviewers: ldionne, vitaut.
Herald added a project: Restricted Project. · View Herald TranscriptJul 3 2022, 10:39 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
Mordante edited the summary of this revision. (Show Details)Jul 3 2022, 10:42 AM
vitaut requested changes to this revision.Jul 9 2022, 8:30 AM

Mostly looks good but at least one output doesn't look right which suggests an error in width estimation. Also a few nits inline.

libcxx/benchmarks/std_format_spec_string_unicode.bench.cpp
9–10

Looks like this include is no longer needed since you removed the only std::array usage.

244

enigine -> engine

libcxx/include/__format/formatter_output.h
277

Why is __truncate in the __detail namespace while other functions which are also implementation details aren't?

libcxx/include/__format/parser_std_format_spec.h
696

Is it the last or one past the last? Seems like the latter. Also "parsed" seems a bit unclear in this context. Are you referring to grapheme clusterization as parsing? I would also recommend using a more descriptive name like __end_ or __last_.

702

nit: I wouldn't call it an issue. It's how width and precision work and very much expected behavior =).

780

with -> width

808

It is well-defined in the same way the Unicode standard we are referring to is well-defined. I suggest saying that it can change instead, particularly with the change to the Unicode standard.

libcxx/include/__format/unicode.h
74

As a potential future optimization (not in this diff) you might want to consider branchless code point decoding, e.g. https://github.com/fmtlib/fmt/blob/b761f1279e2dfb950a7bf9ff7128d6b03b77b013/include/fmt/format.h#L571-L588.

libcxx/test/std/utilities/format/format.functions/ascii.pass.cpp
32

I think the output should be "*\u00a1**" because U+00A1 has an estimated width of 1: http://eel.is/c++draft/format#string.std-11.

This revision now requires changes to proceed.Jul 9 2022, 8:30 AM
Mordante marked 9 inline comments as done.Jul 9 2022, 12:23 PM

Thanks a lot for your review!

libcxx/include/__format/formatter_output.h
277

Good point there used to be more in __detail, I've removed that inner namespace.

libcxx/include/__format/parser_std_format_spec.h
696

I agree, __last_ would be a better name and consistent with other parts of the code.
I'm removed the word "parsed".

702

I agree it's as specified :-) I removed this entire sentence, it didn't add much information not captures in the other comment.

808

I removed most of this sentence and just mention the specifications in [format.string.std]/11.

libcxx/include/__format/unicode.h
74

This looks very interesting, thanks for the hint!

libcxx/test/std/utilities/format/format.functions/ascii.pass.cpp
32

This test tests the output when using ASCII. When using ASCII the codepoint uses two bytes and thus *\u00a1* is correct. In the Unicode test the column width is estimated at one. I left a comment in that file.

96

@vitaut This is the unicode test where the column width is 1.

Mordante updated this revision to Diff 443456.Jul 9 2022, 12:46 PM
Mordante marked 6 inline comments as done.

Addresses review comments.

vitaut added inline comments.Jul 10 2022, 9:00 AM
libcxx/test/std/utilities/format/format.functions/ascii.pass.cpp
32

Missed that, sorry. It makes sense now.

vitaut added inline comments.Jul 10 2022, 9:51 AM
libcxx/test/libcxx/utilities/format/format.string/format.string.std/extended_grapheme_cluster.h
74

"of" is missing after point

libcxx/test/libcxx/utilities/format/format.string/format.string.std/extended_grapheme_cluster.pass.cpp
30

assumption

80–82

Maybe generate only one of data_utf16 and data_utf32 if the other is not used? Or are they both used?

libcxx/test/std/utilities/format/format.functions/unicode.pass.cpp
56

The template operator()< part make the tests a bit difficult to parse. I would recommend passing the format string as a normal argument and not as a template parameter.

I have some comments. Being a total unicode newbie, this is mostly out of my depth, but the general approach looks sensible to me. I'll have a closer look at https://www.unicode.org/reports/tr29/tr29-35.html#Grapheme_Cluster_Boundaries to provide a more meaningful review, but this basically LGTM via the assumption that the test coverage would catch issues in the implementation.

libcxx/.clang-format
68 ↗(On Diff #441966)

This should be part of a separate patch.

libcxx/include/__format/parser_std_format_spec.h
885

// _LIBCPP_HAS_NO_UNICODE? This is quite a long #if block.

libcxx/utils/generate_extended_grapheme_cluster_table.py
2

According to what you just said during live review, https://www.unicode.org/Public/12.0.0/ucd/auxiliary/GraphemeBreakTest.txt changes from time to time (roughly once a year). Consequently, I think we'll want to make it as easy as possible to update. As discussed, we also don't want to get automatic updates under our feet without manual intervention, because the rules might need to change, etc. Hence, I would suggest:

  1. We download the tables that you use to generate headers and tests and check them into libc++ in a reasonable location. Preferably in a subdirectory with a README.md that explains what those are and where to download them from.
  2. We add the running of the generation scripts to libcxx-generate-files.

That way, the workflow for updating those files is simply to re-download the latest tables online, re-generate the headers+tests using libcxx-generate-files, and check that in. Furthermore, if we ever forget to re-generate the headers and tests, the CI will tell us through our existing mechanisms.

98–99

Those should be in sync!

Also, I would recommend using the latest version of these files, even though the standard uses 12.0 in its bibliography.

Mordante marked 7 inline comments as done.Jul 13 2022, 9:28 AM

Thanks for the reviews!

libcxx/test/libcxx/utilities/format/format.string/format.string.std/extended_grapheme_cluster.pass.cpp
80–82

They are both used. Some of our supported platforms have sizeof(wchar_t) == 16, for example Windows and AIX.

libcxx/test/std/utilities/format/format.functions/unicode.pass.cpp
56

The issue is the string is used as both as char and wchar_t. In the past there was no other way to do this, at that time an approach like STATICALLY-WIDEN failed. However STATICALLY-WIDEN now seems to work with all supported compilers. However I used this approach in a lot of tests so I rather "fix" them in one go. So I'll add a todo to my list.

libcxx/utils/generate_extended_grapheme_cluster_table.py
2

I've given this some more thought. Instead of making these changes in this commit I will work on a follow-up commit to make these changes. I don't see the easy way to update as a must-have for LLVM-15 since I don't intend to update these tables in the release branch. That way there's less risk for this patch to miss the branching point.

98–99

I actually used 12 for both but forgot to update this link from the Microsoft code.

But since we want to use the latest version I use the Unicode links to the latest version. That way we don't need to update the links when we update the tables.

Mordante updated this revision to Diff 444319.Jul 13 2022, 9:56 AM
Mordante marked 2 inline comments as done.

Addresses review comments and updates the Unicode tables to version 14.

ldionne accepted this revision.Jul 20 2022, 8:25 AM
ldionne added inline comments.
libcxx/utils/generate_extended_grapheme_cluster_table.py
2

That sounds reasonable, but let's not forget to do this. It's really important to have an easy way to keep those up-to-date, or they will become stale in no time.

Mordante marked 2 inline comments as done.Jul 20 2022, 8:53 AM

Thanks for the review!

libcxx/utils/generate_extended_grapheme_cluster_table.py
2

I've already created D129668, which most likely needs more work.

This revision was not accepted when it landed; it landed in state Needs Review.Jul 20 2022, 9:38 AM
This revision was automatically updated to reflect the committed changes.
Mordante marked an inline comment as done.

It seems one of the clang-tidy changes flags the current code. I should be able to fix this rather fast.

@abrachet I've committed a fix.

phosek added a subscriber: phosek.Aug 15 2022, 2:39 PM
phosek added inline comments.
libcxx/utils/generate_extended_grapheme_cluster_table.py
160

I noticed a typo in generated files, this can be addresses in some future change.

Mordante marked an inline comment as done.Aug 18 2022, 8:30 AM
Mordante added inline comments.
libcxx/utils/generate_extended_grapheme_cluster_table.py
160

Thanks!
I fixed it in https://reviews.llvm.org/rG750ee8d56d0a9f8e93c032ad55841c46b1fbbbd7 waiting for a future change will probably mean I forget it ;-)