This is an archive of the discontinued LLVM Phabricator instance.

[libc++][format] Improves Unicode decoders.
ClosedPublic

Authored by Mordante on Feb 19 2023, 5:01 AM.

Details

Summary

During the implementation of P2286 a second Unicode decoder was added.
The original decoder was only used for the width estimation. Changing
an ill-formed Unicode sequence to the replacement character, works
properly for this use case. For P2286 an ill-formed Unicode sequence
needs to be formatted as a sequence of code units. The exact wording in
the Standard as a bit unclear and there was odd example in the WP. This
made it hard to use the same decoder. SG16 determined the odd example in
the WP was a bug and this has been fixed in the WP.

This made it possible to combine the two decoders. The P2286 decoder
kept track of the size of the ill-formed sequence. However this was not
needed since the output algorithm needs to keep track of size of a
well-formed and an ill-formed sequence. So this feature has been
removed.

The error status remains since it's needed for P2286, the grapheme
clustering can ignore this unneeded value. (In general, grapheme
clustering is only has specified behaviour for Unicode. When the string
is in a non-Unicode encoding there are no requirements. Ill-formed
Unicode is a non-Unicode encoding. Still libc++ does a best effort
estimation.)

There UTF-8 decoder accepted several ill-formed sequences:

  • Values in the surrogate range U+D800..U+DFFF.
  • Values encoded in more code units than required, for example 0+0020 in theory can be encoded using 1, 2, 3, or 4 were accepted. This is not allowed by the Unicode Standard.
  • Values larger than U+10FFFF were not always rejected.

Diff Detail

Event Timeline

Mordante created this revision.Feb 19 2023, 5:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 19 2023, 5:01 AM
Mordante updated this revision to Diff 498678.Feb 19 2023, 7:48 AM

GCC 11 fixes.

Mordante updated this revision to Diff 498868.Feb 20 2023, 8:34 AM

Polishing before up for review.

Mordante published this revision for review.Feb 21 2023, 8:24 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 21 2023, 8:24 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
tahonermann added inline comments.Feb 21 2023, 3:28 PM
libcxx/include/__format/unicode.h
139–147

I corrected several of the U+XXXX identifiers in the suggested edit. I also aligned the remarks with the rows that I think they better correspond to.

150

I don't understand this footnote. The full range of code points that are encodeable in a single code unit is U+0000..U+007F.

Thanks for the review!

libcxx/include/__format/unicode.h
150

It seems the * is placed on the wrong line, it should have been at *C2*..DF 80..BF.
Based on the encoding scheme that requires the first code unit to start with 110xxxxx, this allows the values starting from 0xC0. This value is not marked in the Unicode Standard, but I think it's good to point out. Especially since this decoder doesn't use a nested if statement. Instead it decodes the value and tests whether it's in the valid range. This reduces the number of comparisons. IMO this makes the code easier to read.

Mordante updated this revision to Diff 499577.Feb 22 2023, 10:25 AM

Addresses review comments.

tahonermann added inline comments.Feb 24 2023, 9:29 AM
libcxx/include/__format/unicode.h
139–147

Here is another presentation option that avoids the need for those footnotes. If you like this better, great. If not, no problem. The current presentation has the benefit of matching the bold highlighting in the table from the Unicode Standard, but I think the suggested presentation better explains the reason those invalid ranges exist.

Mordante marked 3 inline comments as done.Feb 24 2023, 12:56 PM
Mordante added inline comments.
libcxx/include/__format/unicode.h
139–147

Actually I like this a lot, thanks! This matches the code closer; it does
not validate all ranges, but it rejects the "invalid overlong encoding"

I made a few more changes in the surrounding comments, since they looked odd with the new table.

Mordante updated this revision to Diff 500278.Feb 24 2023, 12:56 PM
Mordante marked an inline comment as done.

Rebased and addresses review comments.

ldionne accepted this revision as: ldionne.Mar 7 2023, 8:41 AM
ldionne added a subscriber: ldionne.

This LGTM at a high level, leaving approval to @tahonermann since he's got much more domain knowledge about unicode.

tahonermann accepted this revision.Mar 7 2023, 6:11 PM

I reviewed more thoroughly and this looks great to me! I added a comment about a possible preexisting issue with wchar_t that it might make sense to address with this change.

libcxx/include/__format/unicode.h
264

It is implementation-defined whether wchar_t is a signed type so I think a cast to an unsigned type is needed here and in other cases below where first is dereferenced.

This would be a good use case for a std::as_unsigned() or std::to_unsigned() function.

Mordante accepted this revision.Mar 8 2023, 8:55 AM
Mordante marked an inline comment as done.

Thanks for the reviews. I'll apply the fixes and land it when the CI is green.

libcxx/include/__format/unicode.h
264

I think it's not needed here, but on line 277 it indeed might be a good idea. For consistency I did it at all places. I think a static_cast is fine, I did the same for char.

This revision is now accepted and ready to land.Mar 8 2023, 8:55 AM
Mordante updated this revision to Diff 503400.Mar 8 2023, 9:00 AM
Mordante marked an inline comment as done.

Rebased and addresses review comments.

This revision was automatically updated to reflect the committed changes.