This is an archive of the discontinued LLVM Phabricator instance.

[libc++][format] Implement Unicode support.
ClosedPublic

Authored by Mordante on May 31 2021, 9:05 AM.

Details

Summary

This adds the width estimation functions to the std-format-spec.

Implements parts of:

  • P0645 Text Formatting
  • P1868 width: clarifying units of width and precision in std::format

Event Timeline

Mordante created this revision.May 31 2021, 9:05 AM
Mordante requested review of this revision.May 31 2021, 9:05 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 31 2021, 9:05 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
Mordante updated this revision to Diff 348827.May 31 2021, 9:40 AM

Remove an accidental edit, which caused the code to become invalid.

Mordante updated this revision to Diff 349346.Jun 2 2021, 12:17 PM

Rebase.
Improve comments.

Mordante updated this revision to Diff 349456.Jun 2 2021, 10:47 PM

Temporary disable a failing test on Windows.

Mordante edited the summary of this revision. (Show Details)Jun 5 2021, 4:12 AM
Mordante added reviewers: curdeius, ldionne, miscco, vitaut.
vitaut requested changes to this revision.Jun 18 2021, 7:56 AM
vitaut added inline comments.
libcxx/include/__format/parser_std_format_spec.h
754

I think you mean code units and not characters. Same throughout the diff.

764

16 ->32

772–773

I think a more robust error handling would be to count each invalid code unit as contributing 1 to the width. Terminals normally use replacement characters so it will give the correct result.

829–852

A sequence of ifs would likely be more readable.

913–915

Not sure what you mean by "a full blown Unicode parser" but I don't think it's ever a good idea to throw exception on width estimation.

917

Why not set __threshold to the maximum value to disable it instead of introducing another flag?

936

I recommend returning a struct with meaningful field names.

1079

What does the check __precision != __format::__number_max do here?

Note that __number_max is a valid precision.

This revision now requires changes to proceed.Jun 18 2021, 7:56 AM
Mordante marked 6 inline comments as done.Jun 27 2021, 8:32 AM
Mordante added inline comments.
libcxx/include/__format/parser_std_format_spec.h
754

Correct.

772–773

Not entirely sure what you mean. This algorithm is fast, but simple. The engine first uses this algorithm, when it "fails" it uses a more advanced algorithm. The more advanced algorithm implements the estimation as defined in the Standard.
Basically this algorithm's main purpose is to handle ASCII, when it's not ASCII the Standard's estimation is used.

829–852

Agreed. I originally had a sequence of ifs, but that was a bit slow. I used this to speed things up. However I'll look whether I can use ifs in a smarter fashion to improve readability.

913–915

When I think about a "a full blown Unicode parser" I think of something like the ICU library.
I think here it makes sense to use your previous suggestion:
"I think a more robust error handling would be to count each invalid code unit as contributing 1 to the width. "

917

That can give the issue, where the result can be off by one. I'll see whether I can find a way to improve this code.

1079

This was based on my initial assumption 999.999.999 would be more than anybody would use for a width or precision. Since that assumption's wrong the test here needs to be adjusted.

1082

As pointed out in D103425 the width can be larger than the precision. So this test needs to be removed.

Mordante updated this revision to Diff 355269.Jun 29 2021, 9:25 AM
Mordante marked 4 inline comments as done.
Mordante edited the summary of this revision. (Show Details)

Addresses part of the review comments:

  • Use code units instead of characters in the comments
  • Remove the usage of std::pair
  • Precision == -1 is no precision instead of __number_max
  • Allow a precision less than the width

The other comments will be addressed later, I want to update the rest of the ser
ies before focussing on these issues:

  • Improve the threshold algorithm
  • See whether the long returns can be improved without performance degradations.
Mordante planned changes to this revision.Jun 29 2021, 9:25 AM

Need to address the last review comments.

Mordante updated this revision to Diff 355297.Jun 29 2021, 10:57 AM

Disable a test on Windows, that change was accidentally reverted in the last version.

Mordante planned changes to this revision.Jun 29 2021, 11:06 AM
Mordante marked 2 inline comments as done.Jul 3 2021, 5:35 AM
Mordante added inline comments.
libcxx/include/__format/parser_std_format_spec.h
829–852

I tried some different approaches, but they lead to performance degradation. (I also tested the version used in libfmt.)
I think the easiest way to solve this is by improving the formatting,so I went for that route.

917

Setting it to maximum is less efficient. For example, when a width of 10 is given it's not needed to parse a 1024 code unit input. However it's possible to increase the wanted width + 1 to get the same effect as the current __truncate argument. This still may scan one code unit more than required, but I consider that not an issue.

Mordante updated this revision to Diff 356339.Jul 3 2021, 5:56 AM

Improve the readability of a long return statement, this found some issues in the return statement.
Simplify __estimate_column_width by removing a template argument.
Malformed Unicode now uses 1 column per code unit.

Mordante updated this revision to Diff 356343.Jul 3 2021, 7:27 AM

Fixes benchmark.

Mordante updated this revision to Diff 357282.Jul 8 2021, 10:54 AM

Rebase.
Adds a missing include due to the usage of the new monostate header.

vitaut added inline comments.Jul 11 2021, 9:17 AM
libcxx/include/__format/parser_std_format_spec.h
772–773

I misinterpreted how the two scanners interact. Makes sense now.

vitaut added inline comments.Jul 11 2021, 9:44 AM
libcxx/include/__format/parser_std_format_spec.h
726–728

I am not sure I understand what "the formatted output has only a minimum width" means. What is the minimum width?

734

Why would the estimation algorithm stop prematurely?

753

nit: I'd add a colon after "formats".

vitaut added inline comments.Jul 11 2021, 9:48 AM
libcxx/include/__format/parser_std_format_spec.h
21

I wonder if it is possible to pull in a subset of <algorithm> because the whole thing is enormous, especially in C++20?

Mordante marked an inline comment as done.Jul 12 2021, 12:44 PM

Thanks for the review!

libcxx/include/__format/parser_std_format_spec.h
21

I think that's now possible since the header has been split recently. I'll have to see whether this project has been completed or still work in progress.

726–728

The minimum width is the width specified in the format-spec. For example std::format("{:10}", MyString); As soon as the code has determined MyString results in at least 10 output columns it knows no padding is required. At that point it's a waste of CPU cycles to determine the exact number of output columns MyString will result to. So at that point the algorithm stops prematurely.

I think I'll add some more comment regarding this behavior.

734

As described in the comment above.

Mordante marked 5 inline comments as done.Jul 15 2021, 1:07 PM
Mordante updated this revision to Diff 359103.Jul 15 2021, 1:18 PM

Rebased
Addresses review comments
Marked a unit tests as libc++ specific

vitaut added inline comments.Jul 17 2021, 7:17 AM
libcxx/include/__format/parser_std_format_spec.h
717–719

nit: I would suggest changing the names of both __format_string_traits and __get_format_string_traits because the current naming gives an impressing that they have something to do with format strings while in fact this is an estimation of display width for a string argument. Maybe something along the lines of __[format_]estimate_display_width for the function name?

vitaut added inline comments.Jul 17 2021, 7:20 AM
libcxx/include/__format/parser_std_format_spec.h
717–719

Or __[format_]get_string_alignment since it decides whether alignment should be applied and returns information about how to apply it.

Mordante updated this revision to Diff 359622.Jul 18 2021, 6:49 AM
Mordante marked 2 inline comments as done.

Based on review comments s/format_string_traits/string_alignment/

Thanks for the review!

vitaut accepted this revision.Jul 18 2021, 7:21 AM

LGTM

Mordante added inline comments.Jul 20 2021, 10:14 AM
libcxx/include/__format/parser_std_format_spec.h
830

s/_LIBCPP_INLINE_VISIBILITY/_LIBCPP_HIDE_FROM_ABI/ here and all other occurrences in this patch.

Mordante updated this revision to Diff 363833.Aug 3 2021, 12:11 PM

Rebased and minor cleanups.

Mordante updated this revision to Diff 364352.Aug 4 2021, 10:42 PM

Adds UNSUPPORTED: libcpp-has-no-incomplete-format for the test.

Mordante updated this revision to Diff 370726.Sep 4 2021, 3:52 AM

Rebased to trigger CI.

Mordante updated this revision to Diff 372753.Sep 15 2021, 11:09 AM

Rebased to update newer patches in the series.

Mordante updated this revision to Diff 372870.Sep 15 2021, 11:03 PM

Guard against min/max macros.

ldionne requested changes to this revision.Sep 22 2021, 12:55 PM
ldionne added inline comments.
libcxx/benchmarks/CMakeLists.txt
148–149 ↗(On Diff #372870)

Can we make those changes as part of another review? Also, let's set CXX_STANDARD_REQUIRED YES instead -- all supported compilers should support C++20 now.

libcxx/benchmarks/std_format_spec_string_unicode.bench.cpp
8

I don't understand this TODO. We do require C++20 support.

18–20

I'm not sure this is useful -- if the library doesn't support unicode, something in the benchmark is going to fail anyway. It would be nice to remove as many libc++ specific things from the benchmark as possible.

22

The benchmarks should be compiled without assertions anyway -- it seems pretty hacky to undefine this here, no?

28

If we're going to do something like this, can we instead do namespace format_spec = std::__format_spec or something similar?

85

This line break is really odd.

libcxx/include/__format/parser_std_format_spec.h
758

Just a general comment -- we generally try to use C++ style comments (//) elsewhere in the library. I don't have the heart to ask you to change them cause it's such a nitpick, but please consider using that style going forward for consistency.

809

Would it make sense to name those __utf8_char or __utf8_character instead? Just __utf8 seems kind of obtuse to me. (applies below too)

876

This needs to be inline since it's not a template and it's in a header. This certainly applies elsewhere too, please take a look. Otherwise, those are ODR violations waiting to happen (arguably those ODR violations are benign, but still).

libcxx/test/libcxx/utilities/format/format.string/format.string.std/std_format_spec_string_unicode.pass.cpp
2

I would suggest splitting this into two files and using the libcpp-has-no-unicode Lit feature to enable/disable them. That sounds a bit cleaner since essentially the whole test is different depending on whether unicode is supported. Your call, this is not a blocking comment.

51

It just occurred to me that I am confused by the existence of both _LIBCPP_HAS_NO_UNICODE and _LIBCPP_HAS_NO_UNICODE_CHARS -- can you shed some light on this if it's fresh in your mind?

This revision now requires changes to proceed.Sep 22 2021, 12:55 PM
Mordante marked 8 inline comments as done.Sep 23 2021, 10:24 AM
Mordante added subscribers: DanielMcIntosh-IBM, daltenty.

Thanks for the review. I'll update the patch after you land D110338.

libcxx/benchmarks/CMakeLists.txt
148–149 ↗(On Diff #372870)

See D110338.

libcxx/benchmarks/std_format_spec_string_unicode.bench.cpp
8

This TODO was written at a time we didn't require it yet. Hence the TODO to remind me. I remove the test and the comment.

18–20

Good catch this is already tested at the start of the file. This benchmark is really intended to test some libcxx implementation details. For example the __column_width_3 function has several different ways to implement it. This had a performance impact. If the benchmarks should be usable for other compilers/standard libraries I would rather add some guards in this test.

22

I agree I think I used the benchmarks for testing originally. So I removed this hack and the validation.

85

My clang-format settings are still at 80 columns to avoid merge conflicts in this series. I'll fix this manually.

libcxx/include/__format/parser_std_format_spec.h
758

I'll try to keep it in mind. After we switched to clang-fomrat-13 I still want to find some time to reformat the code using our new settings. Then I probably will look at fixing these comments.

876

I omitted the inline since constexpr functions are implicitly inline, guarding against ODR violations.
http://eel.is/c++draft/dcl.constexpr#1

A function or static data member declared with the constexpr or consteval specifier is implicitly an inline function or variable ([dcl.inline]).

Did I misinterpret this rule?
Do you still prefer the inline?

libcxx/test/libcxx/utilities/format/format.string/format.string.std/std_format_spec_string_unicode.pass.cpp
2

I like your suggestion it indeed sounds like a cleaner solution.

51

_LIBCPP_HAS_NO_UNICODE is the switch I added so the user can disable Unicode output. It's introduced in libc++13 using a CMake option. (If needed we can look whether this define can/should be renamed.)

_LIBCPP_HAS_NO_UNICODE_CHARS is not fresh in my mind, but grep does miracles ;-)
_LIBCPP_HAS_NO_UNICODE_CHARS is enabled in __config when the IBM compiler is used.
It was added by Howard in 2013, I've no idea whether that's still needed.
@DanielMcIntosh-IBM, @daltenty do you know whether the IBM compiler has proper support for char16_t and char32_t?

Mordante updated this revision to Diff 374899.Sep 24 2021, 10:23 AM
Mordante marked 6 inline comments as done.

Addresses review comments.

Mordante updated this revision to Diff 376298.Sep 30 2021, 11:23 AM

Mark some functions inline. Not strictly required but @ldionne prefers it, as stated in D110515

Mordante marked an inline comment as done.Sep 30 2021, 11:24 AM
ldionne accepted this revision.Oct 1 2021, 1:50 PM
This revision is now accepted and ready to land.Oct 1 2021, 1:50 PM
This revision was automatically updated to reflect the committed changes.