This is an archive of the discontinued LLVM Phabricator instance.

[libc++][format] Adds a UTF transcoder.
ClosedPublic

Authored by Mordante on May 6 2023, 2:52 AM.

Details

Reviewers
ldionne
tahonermann
Group Reviewers
Restricted Project
Commits
rG20341c3ad6f6: [libc++][format] Adds a UTF transcoder.
Summary

This is a preparation for

P2093R14 Formatted output

When the output of print is to the terminal it needs to use the native
API. This means transcoding UTF-8 to UTF-16 on Windows. The encoder's
interface is modeled after

P2728 Unicode in the Library, Part 1: UTF Transcoding

But only the required part for P2093R14 is implemented.

On Windows wchar_t is 16 bits, in order to test on platforms where
wchar_t is 32 bits the transcoder has support for char16_t. It also adds
and UTF-8 to UTF-32 encoder which is useful for other tests.

Note it is possible to use <codecvt> for transcoding, but that header is
deprecated. So rather write new code that is not deprecated; the hard
part, decoding, has already been done. The <codecvt> header also
requires locale support while the new code works without including
<locale>.

Note the current transcoder implementation can be optimized since it
basically does UTF-8 -> UTF-32 -> UTF-16. The first goal is to have a
working implementation. Since it's not part of the ABI it's possible to
do the optimization later.

Depends on D149672

Diff Detail

Event Timeline

Mordante created this revision.May 6 2023, 2:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 6 2023, 2:52 AM
Mordante updated this revision to Diff 520068.May 6 2023, 3:55 AM

CI fixes.

Mordante updated this revision to Diff 520079.May 6 2023, 5:51 AM

CI fixes.

Mordante updated this revision to Diff 520084.May 6 2023, 6:37 AM

CI fixes.

Mordante published this revision for review.May 6 2023, 9:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 6 2023, 9:59 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
Mordante updated this revision to Diff 528113.Jun 3 2023, 8:44 AM

Rebased and adjusted to upstream changes.

tahonermann requested changes to this revision.Jun 5 2023, 2:25 PM

The approach looks pretty good though I don't love the encoding inference based on sizeof(). I see the motivation though. I'll think about this some more, but I think we should just infer the encoding based on the character type; char16_t => UTF-16, char32_t => UTF-32, wchar_t => platform dependent.

libcxx/include/print
45–49

The underlying type of char16_t is uint_least16_t, so the sizeof constraint is technically incorrect here. Similarly, a size of 2 for wchar_t doesn't imply UTF-16 if CHAR_BIT is 16 (but that is only the case for a few embedded platforms).

We'll need to see how P2728R0 eventually shakes out, but I think any inference based on character type size should be avoided in favor of an approach that states the desired encoding; e.g., __encode_utf16(...). Constraints could then be added to ensure that the element size is at least 16-bit (sizeof(T) * CHAR_BIT).

52
72
79

basic_string_view or span? I would expect more like the latter.

82–84
This revision now requires changes to proceed.Jun 5 2023, 2:25 PM
Mordante marked 3 inline comments as done.Jun 8 2023, 9:09 AM

Thanks for the review!

libcxx/include/print
45–49

Ah I totally forgot about char16_t not being required to be 16 bits. I see P2728 uses the same assumption, I'll drop Zach a note.

For now I just hard-code it for char, charX_t and use the _LIBCPP_SHORT_WCHAR define in libc++ for wchar_t. This macro is based on the platform, for Windows and AIX 32-bit and it's defined and for others not. Once P2728 lands I will switch to its definitions.

79

The __code_point_view was written for grapheme clustering in <format>. For string formatters I use a basic_string_view as format argument, so using a basic_string_view sounds natural. Since P2728R0 aims at non-character types as code units it would indeed make sense to use a span. Especially since char_traits<int> will be removed in LLVM 18. For now I keep it this way, once P2728R0 accepted I can change it. Then I probably want to model things closer to that paper.

Mordante updated this revision to Diff 529626.Jun 8 2023, 9:09 AM

Addresses review comments.

Mordante updated this revision to Diff 532209.Jun 16 2023, 9:59 AM

Rebased so D150044 can be rebased.
Fixed merge conflicts.
Added header to the modules.

Mordante updated this revision to Diff 532250.Jun 16 2023, 11:35 AM

Rebased to fix CI.

Mordante updated this revision to Diff 533341.Jun 21 2023, 11:08 AM

Rebased to test D150044

@tahonermann, it would be great if you can have another look.

tahonermann requested changes to this revision.Jun 23 2023, 1:44 PM

Sorry for the delay, again. I think this is looking good. I noticed one issue that I think needs to be addressed and otherwise just want to confirm that lack of support for encoding UTF-8 is intentional for now.

libcxx/include/print
40–44

As far as I can tell, the __utf8_code_unit concept is not used, nor is there a UTF-8 overload of __encode(). That is ok if intentional, I'm just mentioning it in case it isn't.

The use of char8_t should be predicated on _LIBCPP_HAS_NO_CHAR8_T not being defined since char8_t support can be disabled with -fno-char8_t.

This revision now requires changes to proceed.Jun 23 2023, 1:44 PM
Mordante marked 3 inline comments as done.Jun 24 2023, 4:26 AM

Sorry for the delay, again. I think this is looking good. I noticed one issue that I think needs to be addressed and otherwise just want to confirm that lack of support for encoding UTF-8 is intentional for now.

No problem, thanks for the review!

libcxx/include/print
40–44

Nice catch. It's indeed not used at the moment. I'll remove it for now. When Zach's paper is approved I will add the concepts based on the wording of the paper.

Mordante updated this revision to Diff 534194.Jun 24 2023, 4:30 AM
Mordante marked an inline comment as done.

Addresses review comment.

tahonermann accepted this revision.Jul 1 2023, 6:34 PM

Looks good to me.

ldionne added inline comments.Jul 4 2023, 1:04 PM
libcxx/include/print
2

Are we going to have a lot of stuff inside <print>? If so, it would probably make sense to introduce this utility under __print/encode.h instead.

45

Can we use #ifndef _LIBCPP_HAS_NO_WIDE_CHARACTERS instead? I'd like to reduce usage of _LIBCPP_IF_WIDE_CHARACTERS and eventually remove it since it's a really unusual macro.

libcxx/test/libcxx/input.output/iostream.format/print.fun/transcoding.pass.cpp
10–11

Does this really time out on GCC 12? What's special about this test that makes it timeout?

19–23

Those don't seem sorted :)

Mordante marked 4 inline comments as done.Jul 5 2023, 9:33 AM

Thanks for the reviews!

libcxx/include/print
2

No print is small, 2 more overloads and it's complete. I want to move the encoding part to a new header once P2728 is voted in. Then the code becomes part of that header.

45

To be honest I feel the macro is easier to read. I don't feel too strongly about it so I change it.

libcxx/test/libcxx/input.output/iostream.format/print.fun/transcoding.pass.cpp
10–11

Actually it's the same GCC12 issue as the other format parts. This patch was written before we started to use // UNSUPPORTED: GCC-ALWAYS_INLINE-FIXME so I copied it from some format header. I've changed it to the GCC-ALWAYS_INLINE-FIXME

19–23

I trust on clang-format ;-) Fixed it.

Mordante updated this revision to Diff 537398.Jul 5 2023, 9:38 AM
Mordante marked 4 inline comments as done.

Rebased and addresses review comments.

ldionne accepted this revision.Jul 11 2023, 8:36 AM

LGTM, thanks @tahonermann for reviewing the hard parts :)

This revision is now accepted and ready to land.Jul 11 2023, 8:36 AM
This revision was automatically updated to reflect the committed changes.