Page MenuHomePhabricator

Please use GitHub pull requests for new patches. Phabricator shutdown timeline

[libc++][format] Implement Unicode support.

Authored by Mordante on May 31 2021, 9:05 AM.



This adds the width estimation functions to the std-format-spec.

Implements parts of:

  • P0645 Text Formatting
  • P1868 width: clarifying units of width and precision in std::format

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Mordante marked 2 inline comments as done.Jul 3 2021, 5:35 AM
Mordante added inline comments.

I tried some different approaches, but they lead to performance degradation. (I also tested the version used in libfmt.)
I think the easiest way to solve this is by improving the formatting,so I went for that route.


Setting it to maximum is less efficient. For example, when a width of 10 is given it's not needed to parse a 1024 code unit input. However it's possible to increase the wanted width + 1 to get the same effect as the current __truncate argument. This still may scan one code unit more than required, but I consider that not an issue.

Mordante updated this revision to Diff 356339.Jul 3 2021, 5:56 AM

Improve the readability of a long return statement, this found some issues in the return statement.
Simplify __estimate_column_width by removing a template argument.
Malformed Unicode now uses 1 column per code unit.

Mordante updated this revision to Diff 356343.Jul 3 2021, 7:27 AM

Fixes benchmark.

Mordante updated this revision to Diff 357282.Jul 8 2021, 10:54 AM

Adds a missing include due to the usage of the new monostate header.

vitaut added inline comments.Jul 11 2021, 9:17 AM

I misinterpreted how the two scanners interact. Makes sense now.

vitaut added inline comments.Jul 11 2021, 9:44 AM

I am not sure I understand what "the formatted output has only a minimum width" means. What is the minimum width?


Why would the estimation algorithm stop prematurely?


nit: I'd add a colon after "formats".

vitaut added inline comments.Jul 11 2021, 9:48 AM

I wonder if it is possible to pull in a subset of <algorithm> because the whole thing is enormous, especially in C++20?

Mordante marked an inline comment as done.Jul 12 2021, 12:44 PM

Thanks for the review!


I think that's now possible since the header has been split recently. I'll have to see whether this project has been completed or still work in progress.


The minimum width is the width specified in the format-spec. For example std::format("{:10}", MyString); As soon as the code has determined MyString results in at least 10 output columns it knows no padding is required. At that point it's a waste of CPU cycles to determine the exact number of output columns MyString will result to. So at that point the algorithm stops prematurely.

I think I'll add some more comment regarding this behavior.


As described in the comment above.

Mordante marked 5 inline comments as done.Jul 15 2021, 1:07 PM
Mordante updated this revision to Diff 359103.Jul 15 2021, 1:18 PM

Addresses review comments
Marked a unit tests as libc++ specific

vitaut added inline comments.Jul 17 2021, 7:17 AM

nit: I would suggest changing the names of both __format_string_traits and __get_format_string_traits because the current naming gives an impressing that they have something to do with format strings while in fact this is an estimation of display width for a string argument. Maybe something along the lines of __[format_]estimate_display_width for the function name?

vitaut added inline comments.Jul 17 2021, 7:20 AM

Or __[format_]get_string_alignment since it decides whether alignment should be applied and returns information about how to apply it.

Mordante updated this revision to Diff 359622.Jul 18 2021, 6:49 AM
Mordante marked 2 inline comments as done.

Based on review comments s/format_string_traits/string_alignment/

Thanks for the review!

vitaut accepted this revision.Jul 18 2021, 7:21 AM


Mordante added inline comments.Jul 20 2021, 10:14 AM

s/_LIBCPP_INLINE_VISIBILITY/_LIBCPP_HIDE_FROM_ABI/ here and all other occurrences in this patch.

Mordante updated this revision to Diff 363833.Aug 3 2021, 12:11 PM

Rebased and minor cleanups.

Mordante updated this revision to Diff 364352.Aug 4 2021, 10:42 PM

Adds UNSUPPORTED: libcpp-has-no-incomplete-format for the test.

Mordante updated this revision to Diff 370726.Sep 4 2021, 3:52 AM

Rebased to trigger CI.

Mordante updated this revision to Diff 372753.Sep 15 2021, 11:09 AM

Rebased to update newer patches in the series.

Mordante updated this revision to Diff 372870.Sep 15 2021, 11:03 PM

Guard against min/max macros.

ldionne requested changes to this revision.Sep 22 2021, 12:55 PM
ldionne added inline comments.
148–149 ↗(On Diff #372870)

Can we make those changes as part of another review? Also, let's set CXX_STANDARD_REQUIRED YES instead -- all supported compilers should support C++20 now.


I don't understand this TODO. We do require C++20 support.


I'm not sure this is useful -- if the library doesn't support unicode, something in the benchmark is going to fail anyway. It would be nice to remove as many libc++ specific things from the benchmark as possible.


The benchmarks should be compiled without assertions anyway -- it seems pretty hacky to undefine this here, no?


If we're going to do something like this, can we instead do namespace format_spec = std::__format_spec or something similar?


This line break is really odd.


Just a general comment -- we generally try to use C++ style comments (//) elsewhere in the library. I don't have the heart to ask you to change them cause it's such a nitpick, but please consider using that style going forward for consistency.


Would it make sense to name those __utf8_char or __utf8_character instead? Just __utf8 seems kind of obtuse to me. (applies below too)


This needs to be inline since it's not a template and it's in a header. This certainly applies elsewhere too, please take a look. Otherwise, those are ODR violations waiting to happen (arguably those ODR violations are benign, but still).


I would suggest splitting this into two files and using the libcpp-has-no-unicode Lit feature to enable/disable them. That sounds a bit cleaner since essentially the whole test is different depending on whether unicode is supported. Your call, this is not a blocking comment.


It just occurred to me that I am confused by the existence of both _LIBCPP_HAS_NO_UNICODE and _LIBCPP_HAS_NO_UNICODE_CHARS -- can you shed some light on this if it's fresh in your mind?

This revision now requires changes to proceed.Sep 22 2021, 12:55 PM
Mordante marked 8 inline comments as done.Sep 23 2021, 10:24 AM
Mordante added subscribers: DanielMcIntosh-IBM, daltenty.

Thanks for the review. I'll update the patch after you land D110338.

148–149 ↗(On Diff #372870)

See D110338.


This TODO was written at a time we didn't require it yet. Hence the TODO to remind me. I remove the test and the comment.


Good catch this is already tested at the start of the file. This benchmark is really intended to test some libcxx implementation details. For example the __column_width_3 function has several different ways to implement it. This had a performance impact. If the benchmarks should be usable for other compilers/standard libraries I would rather add some guards in this test.


I agree I think I used the benchmarks for testing originally. So I removed this hack and the validation.


My clang-format settings are still at 80 columns to avoid merge conflicts in this series. I'll fix this manually.


I'll try to keep it in mind. After we switched to clang-fomrat-13 I still want to find some time to reformat the code using our new settings. Then I probably will look at fixing these comments.


I omitted the inline since constexpr functions are implicitly inline, guarding against ODR violations.

A function or static data member declared with the constexpr or consteval specifier is implicitly an inline function or variable ([dcl.inline]).

Did I misinterpret this rule?
Do you still prefer the inline?


I like your suggestion it indeed sounds like a cleaner solution.


_LIBCPP_HAS_NO_UNICODE is the switch I added so the user can disable Unicode output. It's introduced in libc++13 using a CMake option. (If needed we can look whether this define can/should be renamed.)

_LIBCPP_HAS_NO_UNICODE_CHARS is not fresh in my mind, but grep does miracles ;-)
_LIBCPP_HAS_NO_UNICODE_CHARS is enabled in __config when the IBM compiler is used.
It was added by Howard in 2013, I've no idea whether that's still needed.
@DanielMcIntosh-IBM, @daltenty do you know whether the IBM compiler has proper support for char16_t and char32_t?

Mordante updated this revision to Diff 374899.Sep 24 2021, 10:23 AM
Mordante marked 6 inline comments as done.

Addresses review comments.

Mordante updated this revision to Diff 376298.Sep 30 2021, 11:23 AM

Mark some functions inline. Not strictly required but @ldionne prefers it, as stated in D110515

Mordante marked an inline comment as done.Sep 30 2021, 11:24 AM
ldionne accepted this revision.Oct 1 2021, 1:50 PM
This revision is now accepted and ready to land.Oct 1 2021, 1:50 PM
This revision was automatically updated to reflect the committed changes.