This is an archive of the discontinued LLVM Phabricator instance.

[libc++][format] Improves fill character.
ClosedPublic

Authored by Mordante on Feb 24 2023, 9:03 AM.

Details

Reviewers
ldionne
tahonermann
vitaut
Group Reviewers
Restricted Project
Commits
rG5db033e204b2: [libc++][format] Improves fill character.
Summary

The main change is to allow a UCS scalar value as fill character.
Especially for char based formatting this increase the number of valid
characters. Originally this was to be expected ABI breaking, however the
current change does not seem to break the ABI.

Implements

  • P2572 std::format() fill character allowances

Depends on D144499

Diff Detail

Event Timeline

Mordante created this revision.Feb 24 2023, 9:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 24 2023, 9:03 AM
Mordante updated this revision to Diff 500544.Feb 26 2023, 4:00 AM

CI fixes.

Mordante updated this revision to Diff 516052.Apr 22 2023, 3:20 AM

CI fixes.

Mordante updated this revision to Diff 516062.Apr 22 2023, 6:01 AM

CI fixes.

Mordante updated this revision to Diff 516094.Apr 22 2023, 12:07 PM

Try to determine error.

Mordante updated this revision to Diff 516098.Apr 22 2023, 12:45 PM

Fixes UTF-16 bug.

Mordante updated this revision to Diff 517617.Apr 27 2023, 9:53 AM

Tests CI.

Mordante updated this revision to Diff 517630.Apr 27 2023, 10:21 AM

Trigger CI.

Mordante updated this revision to Diff 517641.Apr 27 2023, 10:45 AM

Trigger CI.

Mordante updated this revision to Diff 517657.Apr 27 2023, 11:15 AM

Adds missing paren.

Mordante updated this revision to Diff 517966.Apr 28 2023, 9:51 AM

Rebased and code polishing.

Mordante published this revision for review.Apr 30 2023, 4:32 AM
Mordante added reviewers: ldionne, tahonermann, vitaut.
Mordante added inline comments.
libcxx/include/__format/parser_std_format_spec.h
432–433
Herald added a project: Restricted Project. · View Herald TranscriptApr 30 2023, 4:32 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript

This looks great. I added one comment seeking clarification when an encoding error is present. If the code is as intended, it might help to add a comment to explain what is happening.

libcxx/include/__format/parser_std_format_spec.h
452–457

This seems a little odd to me. When consumption of the fill character fails (due to an encoding issue), an attempt is still made to parse the alignment at the new position before checking for the consumption error and then reporting a parse issue. I'm not sure why that attempt is made since success is going to lead to reporting the fill character failure and failure is going to result in falling through to retry parsing the alignment at the beginning anyway. If the intent is to get to different error messages, that seems reasonable, but it seems this can fallback to trying to parse the alignment at the beginning anyway. Am I missing something?

JMazurkiewicz added inline comments.
libcxx/test/std/utilities/format/format.functions/fill.unicode.pass.cpp
65
Mordante marked 2 inline comments as done.May 13 2023, 3:04 AM

Thanks for the reviews!

libcxx/include/__format/parser_std_format_spec.h
452–457

I wrote this in this way due to historic reasons. A character like 0 is a fill-character when followed by an alignment. Else it's a zero-padding. Since valid elements of the format-spec are not invalid Unicode this is not needed. I adjusted it, but reverted it again. For UTF-32 it makes sense to only test when it's a fill character, and I like to keep the same diagnostic regardless of the encoding used. So I added comment to explain the design.

Mordante updated this revision to Diff 521889.May 13 2023, 3:05 AM
Mordante marked an inline comment as done.

Rebased and addresses review comments.

ldionne accepted this revision.May 16 2023, 9:23 AM
This revision is now accepted and ready to land.May 16 2023, 9:23 AM
ldionne added inline comments.May 16 2023, 9:24 AM
libcxx/include/__format/parser_std_format_spec.h
231

Maybe this could be renamed to __codepoint? That way you'd have __codepoint<_CharT> __fill_;.

Mordante added inline comments.May 17 2023, 11:20 AM
libcxx/include/__format/parser_std_format_spec.h
452–457

@tahonermann are you happy with this comment?

tahonermann added inline comments.May 17 2023, 12:48 PM
libcxx/include/__format/parser_std_format_spec.h
452–457

I agree with the goal of keeping the diagnostics aligned, but I don't think this change does that.

The concern I have is that, when a code point isn't decoded by the call to __view.__consume(), the current location in the view will have been bumped by one code unit (based on my reading of the __consume() implementation for the __code_point_view<char> explicit specialization). There isn't much reason to expect a code point to be successfully decoded at that location. The likely result is that this block isn't entered (because the call to __parse_alignment(*__view.__position() above fails to match an alignment character; note that it doesn't even attempt proper decoding) and execution falls through to the __parse_alignment(*__begin) below which will likely return false thus leading to __parse_fill_align() returning with an indication that the fill-and-align option is not present. I haven't looked into what might happen next.

The UTF-8 and UTF-16 cases are fundamentally different compared to UTF-32 since they are variable length encodings and the latter is a trivial fixed length encoding. I think it is reasonable to throw a distinct error in this case since this problem cannot occur in UTF-32 (a failure to decode a code point at all is subtly different than decoding a code point that is not a UCS scalar value). It is trivially easy to step to the next code unit sequence in UTF-32, but not in UTF-8 or UTF-16.

libcxx/test/std/utilities/format/format.functions/fill.unicode.pass.cpp
85

Suggested tests for ill-formed UTF-8 code unit sequences:

check_exception("???", SV("{:\x80^}"), 42);         // Trailing code unit with no leading one.
check_exception("???", SV("{:\xc0^}"), 42);         // Missing trailing code unit.
check_exception("???", SV("{:\xe0\x80^}"), 42);     // Missing trailing code unit.
check_exception("???", SV("{:\xf0\x80^}"), 42);     // Missing two trailing code units.
check_exception("???", SV("{:\xf0\x80\x80^}"), 42); // Missing trailing code unit.
88–91

These all test lone surrogates. Here is a suggested test for reversed surrogates:

check_exception("???", std::wstring_view{L"{:\xdc00\xd800^}"}, 42);
Mordante updated this revision to Diff 523427.May 18 2023, 10:06 AM
Mordante marked 4 inline comments as done.

Addresses review comments.

Thanks fro the reviews!

libcxx/include/__format/parser_std_format_spec.h
452–457

Good point thanks! I tested with your additional suggested tests and the fill-and-align is indeed not properly detected for some of the new test cases. This results in a replacement-field that doesn't end with a '}'. There the error doesn't mention fill either. So having a different error for UTF-32 and UTF-8/16 seems the way forward. (I don't feel like doing more effort to possibly detect an alignment for an error message to be worth the code size.)

libcxx/test/std/utilities/format/format.functions/fill.unicode.pass.cpp
85

I tested this before changing the code in the header. The first to result in The fill character contains an invalid value, the third "The fill character contains an invalid value. (I didn't test the fourth and fifth.)

88–91

This also results in "The format-spec should consume the input or end with a '}'

tahonermann accepted this revision.May 18 2023, 2:48 PM

This looks good to me now. I added one comment with suggested alternatives for the new error message, but your call on whether you like them better then what you have.

libcxx/include/__format/parser_std_format_spec.h
451

"... contains an ill-formed code unit sequence" seems more accurate to me, but that is probably too technical for the intended audience.

Mordante marked an inline comment as done.May 19 2023, 8:07 AM

Thanks for the review!

libcxx/include/__format/parser_std_format_spec.h
451

"... contains an ill-formed code unit sequence" seems more accurate to me, but that is probably too technical for the intended audience.

Yes, I tried to keep the target audience in mind when I wrote the message. If it was purely internally I would have picked a more accurate message.

This revision was automatically updated to reflect the committed changes.
Mordante marked an inline comment as done.