This patch fixes one case where the decoding member function in() was returning partial instead of error. Additionally, it adds large testsuite that tests conversions between UTF-8 and other encodings. The testsuite covers this bug.
Diff Detail
Event Timeline
Thanks for working on this! I just did a quick scan over the code. I really want to review it after the formatting changes are undone and the CI passes.
libcxx/src/locale.cpp | ||
---|---|---|
2024–2052 | Can you undo the formatting changes in this hunk? It makes finding the real changes quite hard. | |
libcxx/test/std/localization/codecvt_unicode.h | ||
9 | Please add include guards. | |
30 | There's no real reason to use trailing return types here. | |
36 | We normally don't do this, it doesn't improve the readability of the code. | |
37 | Just to improve the readability. | |
55 | Please don't use auto here, this does not match the LLVM coding style. | |
libcxx/test/std/localization/locale.categories/category.ctype/locale.codecvt/locale.codecvt.members/char16_t_out.pass.cpp | ||
32 | I'm not fond of this include path, it feels quite fragile. I think it would be better to move the code to the test/support directory then the suggestion above works. (This is the same location as the test_macros.h reside. | |
libcxx/test/std/localization/locale.stdcvt/codecvt_utf8_in.pass.cpp | ||
275 | This is the preferred style. For the compilers we support this works in C++03 mode. |
@Mordante I think now it is ready to be reviewed. I've undone the clang-format. As for the CI, everything passes except for "Apple back deployment" which I don't know what it is.
Patch with full context. I forgot the CLI parametar -U999999 when I was generating my previous patch.
Test codecvts with char8_t, too. Deal with apple back-deployment and properly mark test with XFAIL.
Sorry for the late review, but I was quite busy last week.
Thanks a lot for fixes. I really like the additional unit tests!
Several minor issues with the patch.
libcxx/test/std/localization/codecvt_unicode.pass.cpp | ||
---|---|---|
21 ↗ | (On Diff #496235) | Please have one declaration per line. |
34 ↗ | (On Diff #496235) | Can you use std::array instead? This is available on all platforms where we support C++03. |
70 ↗ | (On Diff #496235) | What is the difference between this part of the test and the one on line 52? Please add some comments. |
111–115 ↗ | (On Diff #496235) | I think this order of the test improves readability, same for the other tests. Now the "too small bufffer", gradually grows to the proper size and then we make the "input too small". Maybe even more readable would be to have a test case there the 3th CP exactly fits in the output. |
148 ↗ | (On Diff #496235) | I think it would be good to test a few more corner cases in this test.
|
175 ↗ | (On Diff #496235) | What's the difference between an ASCII byte and an invalid byte? |
706 ↗ | (On Diff #496235) | Can you make sure all these blocks have comments. Especially since you use the surrogate values here, are you testing the surrogate values fail, or that the input is malformed in other ways. |
libcxx/test/std/localization/codecvt_unicode.pass.cpp | ||
---|---|---|
34 ↗ | (On Diff #496235) | std::array is not a good fit in this case for three reasons:
|
70 ↗ | (On Diff #496235) | This one calls with the full out-buffer, see bellow out, std::end(out). |
111–115 ↗ | (On Diff #496235) | I think this is subjective, you can give arguments for few different orderings. |
175 ↗ | (On Diff #496235) | Well in this test-case there is no difference. But in general, in UTF-8 string if your aim is to fully decode a string then all valid sequences must be treated as valid, and any erroneous bytes between them should be either skipped, replaced with a replacement char, or reported upwards in the call chain (or some combination of these). the ASCII byte breaks the original sequence but creates a new smaller valid sequence. To reach it, once you receive error, you can push your input pointer by one and do another call to in() to check if there is another valid sequence further in the string. |
The changes look good to me. It took me a while to convince myself that the intended behavioral change is correct, but I eventually concluded that the changes match the intent in [locale.codecvt.virtuals]p5 (http://eel.is/c++draft/locale.codecvt#virtuals-5).
The tests and test methodology likewise look good to me. One suggestion to consider: In my own testing, I like to test the boundaries of each valid encoding range (see https://github.com/tahonermann/text_view/blob/master/test/test-encodings.cpp#L1333-L1357) to ensure coverage for all well-formed code unit sequences. Likewise, it can be useful to exercise that an error is produced for ill-formed code unit sequences just outside each of those boundaries.
I like @tahonermann's suggestion to test the edge cases.
libcxx/test/std/localization/codecvt_unicode.pass.cpp | ||
---|---|---|
111–115 ↗ | (On Diff #496235) | I agree it's subjective, it's just what feels easier for me. I'm concerned that the test is not easy to understand, even with the comments. I'm aware of the problem domain and what you are testing. Even with that knowledge I had issues understanding the test. So I fear it will be worse for people not too familiar with UTF-8 encoding. |
175 ↗ | (On Diff #496235) | Fair point. I think it would be good to mention the ASCII byte is a valid one code point code unit, since that is what actually matters. The test would give the same result when the code unit was the start of a multibyte code unit, right? (Except then the next code unit might be invalid again.) |
That can be done as a separate patch after this one gets accepted. One has to think how to incorporate that testing framework into this one, there is no straightforward way. That takes time. This testsuite is pretty comprehensive on its own. We should massage this one until its ready to be merged, and after than larger changes can be done.
libcxx/test/std/localization/codecvt_unicode.pass.cpp | ||
---|---|---|
111–115 ↗ | (On Diff #496235) | The problem lies in the specification for std::codecvt it is underspecified and hard to understand. Everyone will have the same hard time and there is no way around it. One has to reread the specs multiple times and after that the tests should be easier to read. Maybe I can add here more comments, what do you think? Or you want me to change the order of the test cases? |
175 ↗ | (On Diff #496235) | I did not understand you here. |
That can be done as a separate patch after this one gets accepted. One has to think how to incorporate that testing framework into this one, there is no straightforward way. That takes time. This testsuite is pretty comprehensive on its own. We should massage this one until its ready to be merged, and after than larger changes can be done.
That rationale produces a different response for me. Changing testing frameworks is tricky as it is easy to inadvertently lose coverage in the process. I see that as reason to design the testing framework to suite the eventual needs (when known) up front.
I find your concerns completely unjustified. You can always send a patch with tests in a completely separate file. I encourage you to do it. I can't do your work.
This patch is supposed to be a bugfix first, and a testsuite second, and a pretty good one too.
You are under no obligation to agree with them.
You can always send a patch with tests in a completely separate file. I encourage you to do it. I can't do your work.
I'm not sure what you are attributing as being "my work", nor why you would consider it my obligation. Code review is motivated by a desire to maximize quality. If you think a suggestion is a bad idea, out of scope, something you don't have time for or just don't want to do, that is certainly ok.
This patch is supposed to be a bugfix first, and a testsuite second, and a pretty good one too.
Indeed, and thank you for it. The bug that you have proposed a fix for might have have been avoided had more extensive test coverage been in place. I presume you are a user of these interfaces (I am not) and therefore have a desire for them to work reliably. My suggestion was motivated to help fill in additional testing gaps using the infrastructure you are now offering in the hopes that doing so would identify additional defects or prevent regressions in the future (which I presume you would benefit from). If you don't find that motivating, that is ok. The libc++ maintainers can (and will) determine if they are sufficiently motivated to accept any future burden of maintaining and/or improving what you have offered or whether they would like additional changes first before accepting (I am not a libc++ maintainer).
libcxx/test/std/localization/codecvt_unicode.pass.cpp | ||
---|---|---|
175 ↗ | (On Diff #496235) | I added additional comments here with my latest patch and I think it explains the situation much better. |
Someone should process the issue on Github, its still sitting there tagged as new issue https://github.com/llvm/llvm-project/issues/60177 .
I added some minor suggested edits, but otherwise, I think this is fine to accept.
libcxx/test/std/localization/codecvt_unicode.pass.cpp | ||
---|---|---|
46 ↗ | (On Diff #501686) | This depends on the ordinary literal encoding being UTF-8 and that is not guaranteed (note that people are working on Clang's support for non-ASCII based operating systems). The suggested edit avoids that dependency. |
100 ↗ | (On Diff #501686) | |
158 ↗ | (On Diff #501686) | |
174 ↗ | (On Diff #501686) | |
301 ↗ | (On Diff #501686) | |
336 ↗ | (On Diff #501686) | |
387 ↗ | (On Diff #501686) | |
454 ↗ | (On Diff #501686) | |
507 ↗ | (On Diff #501686) | |
570 ↗ | (On Diff #501686) | |
713 ↗ | (On Diff #501686) | |
748 ↗ | (On Diff #501686) | |
806 ↗ | (On Diff #501686) | |
889 ↗ | (On Diff #501686) | |
942 ↗ | (On Diff #501686) | |
991 ↗ | (On Diff #501686) | |
1145 ↗ | (On Diff #501686) | |
1180 ↗ | (On Diff #501686) | |
1225 ↗ | (On Diff #501686) |
@dimztimz do you need anything in order to make progress on this? It looks like there's a few comments to address, then this can be rebased and seems like folks were happy with the patch.
Can you undo the formatting changes in this hunk? It makes finding the real changes quite hard.
(I know the format CI will probably complain about it, but you can ignore that. We are tuning the CI to not give these unwanted messages in the future.)