This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Fix UTF-8 decoding in codecvts. Fix #60177.
AbandonedPublic

Authored by dimztimz on Feb 5 2023, 2:09 PM.

Details

Reviewers
ldionne
Mordante
Group Reviewers
Restricted Project
Summary

This patch fixes one case where the decoding member function in() was returning partial instead of error. Additionally, it adds large testsuite that tests conversions between UTF-8 and other encodings. The testsuite covers this bug.

Diff Detail

Event Timeline

dimztimz created this revision.Feb 5 2023, 2:09 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 5 2023, 2:09 PM
dimztimz requested review of this revision.Feb 5 2023, 2:09 PM
Herald added 1 blocking reviewer(s): Restricted Project. · View Herald TranscriptFeb 5 2023, 2:09 PM
dimztimz updated this revision to Diff 494951.Feb 5 2023, 2:41 PM

Apply clang-format.

dimztimz updated this revision to Diff 494953.Feb 5 2023, 3:06 PM

Apply clang-format second time.

dimztimz updated this revision to Diff 494960.Feb 5 2023, 3:41 PM

Replace non-ASCII characters in strings with escape codes.

Thanks for working on this! I just did a quick scan over the code. I really want to review it after the formatting changes are undone and the CI passes.

libcxx/src/locale.cpp
2024–2052

Can you undo the formatting changes in this hunk? It makes finding the real changes quite hard.
(I know the format CI will probably complain about it, but you can ignore that. We are tuning the CI to not give these unwanted messages in the future.)

libcxx/test/std/localization/codecvt_unicode.h
9

Please add include guards.

30

There's no real reason to use trailing return types here.

36

We normally don't do this, it doesn't improve the readability of the code.

37

Just to improve the readability.

55

Please don't use auto here, this does not match the LLVM coding style.

libcxx/test/std/localization/locale.categories/category.ctype/locale.codecvt/locale.codecvt.members/char16_t_out.pass.cpp
32

I'm not fond of this include path, it feels quite fragile. I think it would be better to move the code to the test/support directory then the suggestion above works. (This is the same location as the test_macros.h reside.

libcxx/test/std/localization/locale.stdcvt/codecvt_utf8_in.pass.cpp
275

This is the preferred style. For the compilers we support this works in C++03 mode.
You could even consider to remove the entire typedef since it's only used once.

dimztimz updated this revision to Diff 495109.Feb 6 2023, 6:51 AM

Fix tests for C++03.

dimztimz marked 4 inline comments as done.Feb 6 2023, 6:56 AM
This comment was removed by dimztimz.
libcxx/test/std/localization/codecvt_unicode.h
37

Just to improve the readability.

dimztimz marked an inline comment as done.Feb 6 2023, 7:00 AM
dimztimz updated this revision to Diff 495210.Feb 6 2023, 10:33 AM

Resolve cosmetic issues.

dimztimz marked 3 inline comments as done.Feb 6 2023, 10:34 AM

@Mordante I think now it is ready to be reviewed. I've undone the clang-format. As for the CI, everything passes except for "Apple back deployment" which I don't know what it is.

dimztimz updated this revision to Diff 495262.EditedFeb 6 2023, 1:22 PM

Patch with full context. I forgot the CLI parametar -U999999 when I was generating my previous patch.

dimztimz marked an inline comment as done.Feb 6 2023, 1:28 PM
dimztimz updated this revision to Diff 496223.Feb 9 2023, 1:19 PM

Test codecvts with char8_t, too. Deal with apple back-deployment and properly mark test with XFAIL.

dimztimz updated this revision to Diff 496225.Feb 9 2023, 1:30 PM

Non-ASCII chars.

dimztimz updated this revision to Diff 496235.Feb 9 2023, 2:14 PM

fix for windows

Sorry for the late review, but I was quite busy last week.
Thanks a lot for fixes. I really like the additional unit tests!

Several minor issues with the patch.

libcxx/test/std/localization/codecvt_unicode.pass.cpp
21 ↗(On Diff #496235)

Please have one declaration per line.

34 ↗(On Diff #496235)

Can you use std::array instead? This is available on all platforms where we support C++03.

70 ↗(On Diff #496235)

What is the difference between this part of the test and the one on line 52? Please add some comments.

111–115 ↗(On Diff #496235)

I think this order of the test improves readability, same for the other tests.

Now the "too small bufffer", gradually grows to the proper size and then we make the "input too small".

Maybe even more readable would be to have a test case there the 3th CP exactly fits in the output.

148 ↗(On Diff #496235)

I think it would be good to test a few more corner cases in this test.

  • input values in the surrogate range (U+D800 to U+DBFF and U+DC00 to U+DFFF)
  • outside the valid range > U+10FFFF
175 ↗(On Diff #496235)

What's the difference between an ASCII byte and an invalid byte?
Both are just invalid due not having the bit pattern 10xxxxxx, right?

706 ↗(On Diff #496235)

Can you make sure all these blocks have comments.
The tests are not to easy to read, without comment I really have hard time to validate the test.

Especially since you use the surrogate values here, are you testing the surrogate values fail, or that the input is malformed in other ways.

dimztimz added inline comments.Feb 13 2023, 4:16 AM
libcxx/test/std/localization/codecvt_unicode.pass.cpp
34 ↗(On Diff #496235)

std::array is not a good fit in this case for three reasons:

  1. There is no inference of size.
  2. Does not play as well with string literals.
  3. Most importantly, in C++03 the member function size() is not constexpr.
70 ↗(On Diff #496235)

This one calls with the full out-buffer, see bellow out, std::end(out).

111–115 ↗(On Diff #496235)

I think this is subjective, you can give arguments for few different orderings.

175 ↗(On Diff #496235)

Well in this test-case there is no difference. But in general, in UTF-8 string if your aim is to fully decode a string then all valid sequences must be treated as valid, and any erroneous bytes between them should be either skipped, replaced with a replacement char, or reported upwards in the call chain (or some combination of these). the ASCII byte breaks the original sequence but creates a new smaller valid sequence. To reach it, once you receive error, you can push your input pointer by one and do another call to in() to check if there is another valid sequence further in the string.

dimztimz updated this revision to Diff 498456.Feb 17 2023, 11:17 AM

Add more tests and comments

dimztimz marked 6 inline comments as done.Feb 17 2023, 11:19 AM

The changes look good to me. It took me a while to convince myself that the intended behavioral change is correct, but I eventually concluded that the changes match the intent in [locale.codecvt.virtuals]p5 (http://eel.is/c++draft/locale.codecvt#virtuals-5).

The tests and test methodology likewise look good to me. One suggestion to consider: In my own testing, I like to test the boundaries of each valid encoding range (see https://github.com/tahonermann/text_view/blob/master/test/test-encodings.cpp#L1333-L1357) to ensure coverage for all well-formed code unit sequences. Likewise, it can be useful to exercise that an error is produced for ill-formed code unit sequences just outside each of those boundaries.

I like @tahonermann's suggestion to test the edge cases.

libcxx/test/std/localization/codecvt_unicode.pass.cpp
111–115 ↗(On Diff #496235)

I agree it's subjective, it's just what feels easier for me.

I'm concerned that the test is not easy to understand, even with the comments.

I'm aware of the problem domain and what you are testing. Even with that knowledge I had issues understanding the test. So I fear it will be worse for people not too familiar with UTF-8 encoding.

175 ↗(On Diff #496235)

Fair point. I think it would be good to mention the ASCII byte is a valid one code point code unit, since that is what actually matters. The test would give the same result when the code unit was the start of a multibyte code unit, right? (Except then the next code unit might be invalid again.)

I like @tahonermann's suggestion to test the edge cases.

That can be done as a separate patch after this one gets accepted. One has to think how to incorporate that testing framework into this one, there is no straightforward way. That takes time. This testsuite is pretty comprehensive on its own. We should massage this one until its ready to be merged, and after than larger changes can be done.

libcxx/test/std/localization/codecvt_unicode.pass.cpp
111–115 ↗(On Diff #496235)

The problem lies in the specification for std::codecvt it is underspecified and hard to understand. Everyone will have the same hard time and there is no way around it. One has to reread the specs multiple times and after that the tests should be easier to read.

Maybe I can add here more comments, what do you think? Or you want me to change the order of the test cases?

175 ↗(On Diff #496235)

I did not understand you here.

That can be done as a separate patch after this one gets accepted. One has to think how to incorporate that testing framework into this one, there is no straightforward way. That takes time. This testsuite is pretty comprehensive on its own. We should massage this one until its ready to be merged, and after than larger changes can be done.

That rationale produces a different response for me. Changing testing frameworks is tricky as it is easy to inadvertently lose coverage in the process. I see that as reason to design the testing framework to suite the eventual needs (when known) up front.

That can be done as a separate patch after this one gets accepted. One has to think how to incorporate that testing framework into this one, there is no straightforward way. That takes time. This testsuite is pretty comprehensive on its own. We should massage this one until its ready to be merged, and after than larger changes can be done.

That rationale produces a different response for me. Changing testing frameworks is tricky as it is easy to inadvertently lose coverage in the process. I see that as reason to design the testing framework to suite the eventual needs (when known) up front.

I find your concerns completely unjustified. You can always send a patch with tests in a completely separate file. I encourage you to do it. I can't do your work.

This patch is supposed to be a bugfix first, and a testsuite second, and a pretty good one too.

I find your concerns completely unjustified.

You are under no obligation to agree with them.

You can always send a patch with tests in a completely separate file. I encourage you to do it. I can't do your work.

I'm not sure what you are attributing as being "my work", nor why you would consider it my obligation. Code review is motivated by a desire to maximize quality. If you think a suggestion is a bad idea, out of scope, something you don't have time for or just don't want to do, that is certainly ok.

This patch is supposed to be a bugfix first, and a testsuite second, and a pretty good one too.

Indeed, and thank you for it. The bug that you have proposed a fix for might have have been avoided had more extensive test coverage been in place. I presume you are a user of these interfaces (I am not) and therefore have a desire for them to work reliably. My suggestion was motivated to help fill in additional testing gaps using the infrastructure you are now offering in the hopes that doing so would identify additional defects or prevent regressions in the future (which I presume you would benefit from). If you don't find that motivating, that is ok. The libc++ maintainers can (and will) determine if they are sufficiently motivated to accept any future burden of maintaining and/or improving what you have offered or whether they would like additional changes first before accepting (I am not a libc++ maintainer).

dimztimz updated this revision to Diff 501599.Mar 1 2023, 11:21 AM

More comments. Test for surrogates in UTF-32.

dimztimz marked 3 inline comments as done.Mar 1 2023, 11:23 AM
dimztimz added inline comments.
libcxx/test/std/localization/codecvt_unicode.pass.cpp
175 ↗(On Diff #496235)

I added additional comments here with my latest patch and I think it explains the situation much better.

Someone should process the issue on Github, its still sitting there tagged as new issue https://github.com/llvm/llvm-project/issues/60177 .

dimztimz updated this revision to Diff 501686.Mar 1 2023, 3:42 PM

Improve surrogate test for UTF-32

I added some minor suggested edits, but otherwise, I think this is fine to accept.

libcxx/test/std/localization/codecvt_unicode.pass.cpp
46 ↗(On Diff #501686)

This depends on the ordinary literal encoding being UTF-8 and that is not guaranteed (note that people are working on Clang's support for non-ASCII based operating systems). The suggested edit avoids that dependency.

100 ↗(On Diff #501686)
158 ↗(On Diff #501686)
174 ↗(On Diff #501686)
301 ↗(On Diff #501686)
336 ↗(On Diff #501686)
387 ↗(On Diff #501686)
454 ↗(On Diff #501686)
507 ↗(On Diff #501686)
570 ↗(On Diff #501686)
713 ↗(On Diff #501686)
748 ↗(On Diff #501686)
806 ↗(On Diff #501686)
889 ↗(On Diff #501686)
942 ↗(On Diff #501686)
991 ↗(On Diff #501686)
1145 ↗(On Diff #501686)
1180 ↗(On Diff #501686)
1225 ↗(On Diff #501686)

@dimztimz do you need anything in order to make progress on this? It looks like there's a few comments to address, then this can be rebased and seems like folks were happy with the patch.

dimztimz abandoned this revision.Oct 1 2023, 4:34 AM