This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libcxx/
-
src/
1/1
locale.cpp
-
test/std/localization/
-
std/
-
localization/
16/35
codecvt_unicode.pass.cpp

Differential D143349

[libc++] Fix UTF-8 decoding in codecvts. Fix #60177.
AbandonedPublic

Authored by dimztimz on Feb 5 2023, 2:09 PM.

Download Raw Diff

Details

Reviewers

ldionne
Mordante

Group Reviewers

Restricted Project

Summary

This patch fixes one case where the decoding member function in() was returning partial instead of error. Additionally, it adds large testsuite that tests conversions between UTF-8 and other encodings. The testsuite covers this bug.

Diff Detail

Event Timeline

dimztimz created this revision.Feb 5 2023, 2:09 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 5 2023, 2:09 PM

dimztimz requested review of this revision.Feb 5 2023, 2:09 PM

Herald added 1 blocking reviewer(s): Restricted Project. · View Herald TranscriptFeb 5 2023, 2:09 PM

Herald added a subscriber: libcxx-commits. · View Herald Transcript

Harbormaster completed remote builds in B211956: Diff 494946.Feb 5 2023, 2:17 PM

Apply clang-format.

Harbormaster completed remote builds in B211959: Diff 494951.Feb 5 2023, 2:46 PM

Apply clang-format second time.

Harbormaster completed remote builds in B211962: Diff 494953.Feb 5 2023, 3:11 PM

Replace non-ASCII characters in strings with escape codes.

Harbormaster completed remote builds in B211969: Diff 494960.Feb 5 2023, 4:08 PM

Thanks for working on this! I just did a quick scan over the code. I really want to review it after the formatting changes are undone and the CI passes.

libcxx/src/locale.cpp
2024–2053	Can you undo the formatting changes in this hunk? It makes finding the real changes quite hard. (I know the format CI will probably complain about it, but you can ignore that. We are tuning the CI to not give these unwanted messages in the future.)
libcxx/test/std/localization/codecvt_unicode.h
8 ↗	(On Diff #494960)	Please add include guards.
29 ↗	(On Diff #494960)	There's no real reason to use trailing return types here.
35 ↗	(On Diff #494960)	We normally don't do this, it doesn't improve the readability of the code.
36 ↗	(On Diff #494960)	Just to improve the readability.
54 ↗	(On Diff #494960)	Please don't use `auto` here, this does not match the LLVM coding style.
libcxx/test/std/localization/locale.categories/category.ctype/locale.codecvt/locale.codecvt.members/char16_t_out.pass.cpp
32 ↗	(On Diff #494960)	I'm not fond of this include path, it feels quite fragile. I think it would be better to move the code to the `test/support` directory then the suggestion above works. (This is the same location as the `test_macros.h` reside.
libcxx/test/std/localization/locale.stdcvt/codecvt_utf8_in.pass.cpp
275 ↗	(On Diff #494960)	This is the preferred style. For the compilers we support this works in C++03 mode. You could even consider to remove the entire `typedef` since it's only used once.

tahonermann added a subscriber: tahonermann.Feb 6 2023, 6:17 AM

Fix tests for C++03.

dimztimz marked 4 inline comments as done.Feb 6 2023, 6:56 AM

This comment was removed by dimztimz.

libcxx/test/std/localization/codecvt_unicode.h
36 ↗	(On Diff #494960)	Just to improve the readability.

dimztimz marked an inline comment as done.Feb 6 2023, 7:00 AM

Harbormaster completed remote builds in B212085: Diff 495109.Feb 6 2023, 9:38 AM

Resolve cosmetic issues.

dimztimz marked 3 inline comments as done.Feb 6 2023, 10:34 AM

@Mordante I think now it is ready to be reviewed. I've undone the clang-format. As for the CI, everything passes except for "Apple back deployment" which I don't know what it is.

Patch with full context. I forgot the CLI parametar -U999999 when I was generating my previous patch.

dimztimz marked an inline comment as done.Feb 6 2023, 1:28 PM

Harbormaster completed remote builds in B212199: Diff 495262.Feb 6 2023, 3:22 PM

Test codecvts with char8_t, too. Deal with apple back-deployment and properly mark test with XFAIL.

Harbormaster completed remote builds in B212884: Diff 496223.Feb 9 2023, 1:26 PM

Non-ASCII chars.

fix for windows

Harbormaster completed remote builds in B212890: Diff 496235.Feb 9 2023, 3:39 PM

Sorry for the late review, but I was quite busy last week.
Thanks a lot for fixes. I really like the additional unit tests!

Several minor issues with the patch.

libcxx/test/std/localization/codecvt_unicode.pass.cpp
22	Please have one declaration per line.
35	Can you use `std::array` instead? This is available on all platforms where we support C++03.
71	What is the difference between this part of the test and the one on line 52? Please add some comments.
112–116	I think this order of the test improves readability, same for the other tests. Now the "too small bufffer", gradually grows to the proper size and then we make the "input too small". Maybe even more readable would be to have a test case there the 3th CP exactly fits in the output.
149	I think it would be good to test a few more corner cases in this test. input values in the surrogate range (U+D800 to U+DBFF and U+DC00 to U+DFFF) outside the valid range > U+10FFFF
176	What's the difference between an ASCII byte and an invalid byte? Both are just invalid due not having the bit pattern `10xxxxxx`, right?
707	Can you make sure all these blocks have comments. The tests are not to easy to read, without comment I really have hard time to validate the test. Especially since you use the surrogate values here, are you testing the surrogate values fail, or that the input is malformed in other ways.

dimztimz added inline comments.Feb 13 2023, 4:16 AM

libcxx/test/std/localization/codecvt_unicode.pass.cpp
35	`std::array` is not a good fit in this case for three reasons: There is no inference of size. Does not play as well with string literals. Most importantly, in C++03 the member function `size()` is not constexpr.
71	This one calls with the full out-buffer, see bellow `out, std::end(out)`.
112–116	I think this is subjective, you can give arguments for few different orderings.
176	Well in this test-case there is no difference. But in general, in UTF-8 string if your aim is to fully decode a string then all valid sequences must be treated as valid, and any erroneous bytes between them should be either skipped, replaced with a replacement char, or reported upwards in the call chain (or some combination of these). the ASCII byte breaks the original sequence but creates a new smaller valid sequence. To reach it, once you receive error, you can push your input pointer by one and do another call to `in()` to check if there is another valid sequence further in the string.

Add more tests and comments

dimztimz marked 6 inline comments as done.Feb 17 2023, 11:19 AM

The changes look good to me. It took me a while to convince myself that the intended behavioral change is correct, but I eventually concluded that the changes match the intent in [locale.codecvt.virtuals]p5 (http://eel.is/c++draft/locale.codecvt#virtuals-5).

The tests and test methodology likewise look good to me. One suggestion to consider: In my own testing, I like to test the boundaries of each valid encoding range (see https://github.com/tahonermann/text_view/blob/master/test/test-encodings.cpp#L1333-L1357) to ensure coverage for all well-formed code unit sequences. Likewise, it can be useful to exercise that an error is produced for ill-formed code unit sequences just outside each of those boundaries.

Harbormaster completed remote builds in B214470: Diff 498456.Feb 17 2023, 4:08 PM

I like @tahonermann's suggestion to test the edge cases.

libcxx/test/std/localization/codecvt_unicode.pass.cpp
112–116	I agree it's subjective, it's just what feels easier for me. I'm concerned that the test is not easy to understand, even with the comments. I'm aware of the problem domain and what you are testing. Even with that knowledge I had issues understanding the test. So I fear it will be worse for people not too familiar with UTF-8 encoding.
176	Fair point. I think it would be good to mention the ASCII byte is a valid one code point code unit, since that is what actually matters. The test would give the same result when the code unit was the start of a multibyte code unit, right? (Except then the next code unit might be invalid again.)

In D143349#4137640, @Mordante wrote:

I like @tahonermann's suggestion to test the edge cases.

That can be done as a separate patch after this one gets accepted. One has to think how to incorporate that testing framework into this one, there is no straightforward way. That takes time. This testsuite is pretty comprehensive on its own. We should massage this one until its ready to be merged, and after than larger changes can be done.

libcxx/test/std/localization/codecvt_unicode.pass.cpp
112–116	The problem lies in the specification for `std::codecvt` it is underspecified and hard to understand. Everyone will have the same hard time and there is no way around it. One has to reread the specs multiple times and after that the tests should be easier to read. Maybe I can add here more comments, what do you think? Or you want me to change the order of the test cases?
176	I did not understand you here.

That can be done as a separate patch after this one gets accepted. One has to think how to incorporate that testing framework into this one, there is no straightforward way. That takes time. This testsuite is pretty comprehensive on its own. We should massage this one until its ready to be merged, and after than larger changes can be done.

That rationale produces a different response for me. Changing testing frameworks is tricky as it is easy to inadvertently lose coverage in the process. I see that as reason to design the testing framework to suite the eventual needs (when known) up front.

In D143349#4142982, @tahonermann wrote:

That can be done as a separate patch after this one gets accepted. One has to think how to incorporate that testing framework into this one, there is no straightforward way. That takes time. This testsuite is pretty comprehensive on its own. We should massage this one until its ready to be merged, and after than larger changes can be done.

That rationale produces a different response for me. Changing testing frameworks is tricky as it is easy to inadvertently lose coverage in the process. I see that as reason to design the testing framework to suite the eventual needs (when known) up front.

I find your concerns completely unjustified. You can always send a patch with tests in a completely separate file. I encourage you to do it. I can't do your work.

This patch is supposed to be a bugfix first, and a testsuite second, and a pretty good one too.

In D143349#4143309, @dimztimz wrote:

I find your concerns completely unjustified.

You are under no obligation to agree with them.

You can always send a patch with tests in a completely separate file. I encourage you to do it. I can't do your work.

I'm not sure what you are attributing as being "my work", nor why you would consider it my obligation. Code review is motivated by a desire to maximize quality. If you think a suggestion is a bad idea, out of scope, something you don't have time for or just don't want to do, that is certainly ok.

This patch is supposed to be a bugfix first, and a testsuite second, and a pretty good one too.

Indeed, and thank you for it. The bug that you have proposed a fix for might have have been avoided had more extensive test coverage been in place. I presume you are a user of these interfaces (I am not) and therefore have a desire for them to work reliably. My suggestion was motivated to help fill in additional testing gaps using the infrastructure you are now offering in the hopes that doing so would identify additional defects or prevent regressions in the future (which I presume you would benefit from). If you don't find that motivating, that is ok. The libc++ maintainers can (and will) determine if they are sufficiently motivated to accept any future burden of maintaining and/or improving what you have offered or whether they would like additional changes first before accepting (I am not a libc++ maintainer).

More comments. Test for surrogates in UTF-32.

dimztimz marked 3 inline comments as done.Mar 1 2023, 11:23 AM

dimztimz added inline comments.

libcxx/test/std/localization/codecvt_unicode.pass.cpp
176	I added additional comments here with my latest patch and I think it explains the situation much better.

Harbormaster completed remote builds in B216784: Diff 501599.Mar 1 2023, 11:33 AM

Someone should process the issue on Github, its still sitting there tagged as new issue https://github.com/llvm/llvm-project/issues/60177 .

Improve surrogate test for UTF-32

Harbormaster completed remote builds in B216850: Diff 501686.Mar 1 2023, 5:18 PM

I added some minor suggested edits, but otherwise, I think this is fine to accept.

libcxx/test/std/localization/codecvt_unicode.pass.cpp
46	This depends on the ordinary literal encoding being UTF-8 and that is not guaranteed (note that people are working on Clang's support for non-ASCII based operating systems). The suggested edit avoids that dependency.
100
158
174
301
336
387
454
507
570
713
748
806
889
942
991
1145
1180
1225

@dimztimz do you need anything in order to make progress on this? It looks like there's a few comments to address, then this can be rebased and seems like folks were happy with the patch.

dimztimz abandoned this revision.Oct 1 2023, 4:34 AM

Note this was moved to https://github.com/llvm/llvm-project/pull/68442

Revision Contents

Path

Size

libcxx/

src/

locale.cpp

72 lines

test/

std/

localization/

codecvt_unicode.pass.cpp

1381 lines

Diff 501686

libcxx/src/locale.cpp

Show First 20 Lines • Show All 2,015 Lines • ▼ Show 20 Lines	for (; frm_nxt < frm_end && to_nxt < to_end; ++to_nxt)
return codecvt_base::error;		return codecvt_base::error;
uint16_t t = static_cast<uint16_t>(((c1 & 0x1F) << 6) \| (c2 & 0x3F));		uint16_t t = static_cast<uint16_t>(((c1 & 0x1F) << 6) \| (c2 & 0x3F));
if (t > Maxcode)		if (t > Maxcode)
return codecvt_base::error;		return codecvt_base::error;
*to_nxt = t;		*to_nxt = t;
frm_nxt += 2;		frm_nxt += 2;
}		}
else if (c1 < 0xF0)		else if (c1 < 0xF0)
{		{
if (frm_end-frm_nxt < 3)		if (frm_end-frm_nxt < 2)
return codecvt_base::partial;		return codecvt_base::partial;
uint8_t c2 = frm_nxt[1];		uint8_t c2 = frm_nxt[1];
uint8_t c3 = frm_nxt[2];
switch (c1)		switch (c1)
{		{
case 0xE0:		case 0xE0:
if ((c2 & 0xE0) != 0xA0)		if ((c2 & 0xE0) != 0xA0)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
case 0xED:		case 0xED:
if ((c2 & 0xE0) != 0x80)		if ((c2 & 0xE0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
default:		default:
if ((c2 & 0xC0) != 0x80)		if ((c2 & 0xC0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
}		}
		if (frm_end-frm_nxt < 3)
		return codecvt_base::partial;
		uint8_t c3 = frm_nxt[2];
if ((c3 & 0xC0) != 0x80)		if ((c3 & 0xC0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
uint16_t t = static_cast<uint16_t>(((c1 & 0x0F) << 12)		uint16_t t = static_cast<uint16_t>(((c1 & 0x0F) << 12)
\| ((c2 & 0x3F) << 6)		\| ((c2 & 0x3F) << 6)
\| (c3 & 0x3F));		\| (c3 & 0x3F));
if (t > Maxcode)		if (t > Maxcode)
return codecvt_base::error;		return codecvt_base::error;
*to_nxt = t;		*to_nxt = t;
		MordanteUnsubmitted Done Reply Inline Actions Can you undo the formatting changes in this hunk? It makes finding the real changes quite hard. (I know the format CI will probably complain about it, but you can ignore that. We are tuning the CI to not give these unwanted messages in the future.) Mordante: Can you undo the formatting changes in this hunk? It makes finding the real changes quite hard.
frm_nxt += 3;		frm_nxt += 3;
}		}
else if (c1 < 0xF5)		else if (c1 < 0xF5)
{		{
if (frm_end-frm_nxt < 4)		if (frm_end-frm_nxt < 2)
return codecvt_base::partial;		return codecvt_base::partial;
uint8_t c2 = frm_nxt[1];		uint8_t c2 = frm_nxt[1];
uint8_t c3 = frm_nxt[2];
uint8_t c4 = frm_nxt[3];
switch (c1)		switch (c1)
{		{
case 0xF0:		case 0xF0:
if (!(0x90 <= c2 && c2 <= 0xBF))		if (!(0x90 <= c2 && c2 <= 0xBF))
return codecvt_base::error;		return codecvt_base::error;
break;		break;
case 0xF4:		case 0xF4:
if ((c2 & 0xF0) != 0x80)		if ((c2 & 0xF0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
default:		default:
if ((c2 & 0xC0) != 0x80)		if ((c2 & 0xC0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
}		}
if ((c3 & 0xC0) != 0x80 \|\| (c4 & 0xC0) != 0x80)		if (frm_end-frm_nxt < 3)
		return codecvt_base::partial;
		uint8_t c3 = frm_nxt[2];
		if ((c3 & 0xC0) != 0x80)
		return codecvt_base::error;
		if (frm_end-frm_nxt < 4)
		return codecvt_base::partial;
		uint8_t c4 = frm_nxt[3];
		if ((c4 & 0xC0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
if (to_end-to_nxt < 2)		if (to_end-to_nxt < 2)
return codecvt_base::partial;		return codecvt_base::partial;
if ((((c1 & 7UL) << 18) +		if ((((c1 & 7UL) << 18) +
((c2 & 0x3FUL) << 12) +		((c2 & 0x3FUL) << 12) +
((c3 & 0x3FUL) << 6) + (c4 & 0x3F)) > Maxcode)		((c3 & 0x3FUL) << 6) + (c4 & 0x3F)) > Maxcode)
return codecvt_base::error;		return codecvt_base::error;
*to_nxt = static_cast<uint16_t>(		*to_nxt = static_cast<uint16_t>(
0xD800		0xD800
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	for (; frm_nxt < frm_end && to_nxt < to_end; ++to_nxt)
uint16_t t = static_cast<uint16_t>(((c1 & 0x1F) << 6) \| (c2 & 0x3F));		uint16_t t = static_cast<uint16_t>(((c1 & 0x1F) << 6) \| (c2 & 0x3F));
if (t > Maxcode)		if (t > Maxcode)
return codecvt_base::error;		return codecvt_base::error;
*to_nxt = static_cast<uint32_t>(t);		*to_nxt = static_cast<uint32_t>(t);
frm_nxt += 2;		frm_nxt += 2;
}		}
else if (c1 < 0xF0)		else if (c1 < 0xF0)
{		{
if (frm_end-frm_nxt < 3)		if (frm_end-frm_nxt < 2)
return codecvt_base::partial;		return codecvt_base::partial;
uint8_t c2 = frm_nxt[1];		uint8_t c2 = frm_nxt[1];
uint8_t c3 = frm_nxt[2];
switch (c1)		switch (c1)
{		{
case 0xE0:		case 0xE0:
if ((c2 & 0xE0) != 0xA0)		if ((c2 & 0xE0) != 0xA0)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
case 0xED:		case 0xED:
if ((c2 & 0xE0) != 0x80)		if ((c2 & 0xE0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
default:		default:
if ((c2 & 0xC0) != 0x80)		if ((c2 & 0xC0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
}		}
		if (frm_end-frm_nxt < 3)
		return codecvt_base::partial;
		uint8_t c3 = frm_nxt[2];
if ((c3 & 0xC0) != 0x80)		if ((c3 & 0xC0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
uint16_t t = static_cast<uint16_t>(((c1 & 0x0F) << 12)		uint16_t t = static_cast<uint16_t>(((c1 & 0x0F) << 12)
\| ((c2 & 0x3F) << 6)		\| ((c2 & 0x3F) << 6)
\| (c3 & 0x3F));		\| (c3 & 0x3F));
if (t > Maxcode)		if (t > Maxcode)
return codecvt_base::error;		return codecvt_base::error;
*to_nxt = static_cast<uint32_t>(t);		*to_nxt = static_cast<uint32_t>(t);
frm_nxt += 3;		frm_nxt += 3;
}		}
else if (c1 < 0xF5)		else if (c1 < 0xF5)
{		{
if (frm_end-frm_nxt < 4)		if (frm_end-frm_nxt < 2)
return codecvt_base::partial;		return codecvt_base::partial;
uint8_t c2 = frm_nxt[1];		uint8_t c2 = frm_nxt[1];
uint8_t c3 = frm_nxt[2];
uint8_t c4 = frm_nxt[3];
switch (c1)		switch (c1)
{		{
case 0xF0:		case 0xF0:
if (!(0x90 <= c2 && c2 <= 0xBF))		if (!(0x90 <= c2 && c2 <= 0xBF))
return codecvt_base::error;		return codecvt_base::error;
break;		break;
case 0xF4:		case 0xF4:
if ((c2 & 0xF0) != 0x80)		if ((c2 & 0xF0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
default:		default:
if ((c2 & 0xC0) != 0x80)		if ((c2 & 0xC0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
}		}
if ((c3 & 0xC0) != 0x80 \|\| (c4 & 0xC0) != 0x80)		if (frm_end-frm_nxt < 3)
		return codecvt_base::partial;
		uint8_t c3 = frm_nxt[2];
		if ((c3 & 0xC0) != 0x80)
		return codecvt_base::error;
		if (frm_end-frm_nxt < 4)
		return codecvt_base::partial;
		uint8_t c4 = frm_nxt[3];
		if ((c4 & 0xC0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
if (to_end-to_nxt < 2)		if (to_end-to_nxt < 2)
return codecvt_base::partial;		return codecvt_base::partial;
if ((((c1 & 7UL) << 18) +		if ((((c1 & 7UL) << 18) +
((c2 & 0x3FUL) << 12) +		((c2 & 0x3FUL) << 12) +
((c3 & 0x3FUL) << 6) + (c4 & 0x3F)) > Maxcode)		((c3 & 0x3FUL) << 6) + (c4 & 0x3F)) > Maxcode)
return codecvt_base::error;		return codecvt_base::error;
*to_nxt = static_cast<uint32_t>(		*to_nxt = static_cast<uint32_t>(
0xD800		0xD800
▲ Show 20 Lines • Show All 209 Lines • ▼ Show 20 Lines	for (; frm_nxt < frm_end && to_nxt < to_end; ++to_nxt)
\| (c2 & 0x3F));		\| (c2 & 0x3F));
if (t > Maxcode)		if (t > Maxcode)
return codecvt_base::error;		return codecvt_base::error;
*to_nxt = t;		*to_nxt = t;
frm_nxt += 2;		frm_nxt += 2;
}		}
else if (c1 < 0xF0)		else if (c1 < 0xF0)
{		{
if (frm_end-frm_nxt < 3)		if (frm_end-frm_nxt < 2)
return codecvt_base::partial;		return codecvt_base::partial;
uint8_t c2 = frm_nxt[1];		uint8_t c2 = frm_nxt[1];
uint8_t c3 = frm_nxt[2];
switch (c1)		switch (c1)
{		{
case 0xE0:		case 0xE0:
if ((c2 & 0xE0) != 0xA0)		if ((c2 & 0xE0) != 0xA0)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
case 0xED:		case 0xED:
if ((c2 & 0xE0) != 0x80)		if ((c2 & 0xE0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
default:		default:
if ((c2 & 0xC0) != 0x80)		if ((c2 & 0xC0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
}		}
		if (frm_end-frm_nxt < 3)
		return codecvt_base::partial;
		uint8_t c3 = frm_nxt[2];
if ((c3 & 0xC0) != 0x80)		if ((c3 & 0xC0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
uint32_t t = static_cast<uint32_t>(((c1 & 0x0F) << 12)		uint32_t t = static_cast<uint32_t>(((c1 & 0x0F) << 12)
\| ((c2 & 0x3F) << 6)		\| ((c2 & 0x3F) << 6)
\| (c3 & 0x3F));		\| (c3 & 0x3F));
if (t > Maxcode)		if (t > Maxcode)
return codecvt_base::error;		return codecvt_base::error;
*to_nxt = t;		*to_nxt = t;
frm_nxt += 3;		frm_nxt += 3;
}		}
else if (c1 < 0xF5)		else if (c1 < 0xF5)
{		{
if (frm_end-frm_nxt < 4)		if (frm_end-frm_nxt < 2)
return codecvt_base::partial;		return codecvt_base::partial;
uint8_t c2 = frm_nxt[1];		uint8_t c2 = frm_nxt[1];
uint8_t c3 = frm_nxt[2];
uint8_t c4 = frm_nxt[3];
switch (c1)		switch (c1)
{		{
case 0xF0:		case 0xF0:
if (!(0x90 <= c2 && c2 <= 0xBF))		if (!(0x90 <= c2 && c2 <= 0xBF))
return codecvt_base::error;		return codecvt_base::error;
break;		break;
case 0xF4:		case 0xF4:
if ((c2 & 0xF0) != 0x80)		if ((c2 & 0xF0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
default:		default:
if ((c2 & 0xC0) != 0x80)		if ((c2 & 0xC0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
}		}
if ((c3 & 0xC0) != 0x80 \|\| (c4 & 0xC0) != 0x80)		if (frm_end-frm_nxt < 3)
		return codecvt_base::partial;
		uint8_t c3 = frm_nxt[2];
		if ((c3 & 0xC0) != 0x80)
		return codecvt_base::error;
		if (frm_end-frm_nxt < 4)
		return codecvt_base::partial;
		uint8_t c4 = frm_nxt[3];
		if ((c4 & 0xC0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
uint32_t t = static_cast<uint32_t>(((c1 & 0x07) << 18)		uint32_t t = static_cast<uint32_t>(((c1 & 0x07) << 18)
\| ((c2 & 0x3F) << 12)		\| ((c2 & 0x3F) << 12)
\| ((c3 & 0x3F) << 6)		\| ((c3 & 0x3F) << 6)
\| (c4 & 0x3F));		\| (c4 & 0x3F));
if (t > Maxcode)		if (t > Maxcode)
return codecvt_base::error;		return codecvt_base::error;
*to_nxt = t;		*to_nxt = t;
frm_nxt += 4;		frm_nxt += 4;
▲ Show 20 Lines • Show All 189 Lines • ▼ Show 20 Lines	for (; frm_nxt < frm_end && to_nxt < to_end; ++to_nxt)
\| (c2 & 0x3F));		\| (c2 & 0x3F));
if (t > Maxcode)		if (t > Maxcode)
return codecvt_base::error;		return codecvt_base::error;
*to_nxt = t;		*to_nxt = t;
frm_nxt += 2;		frm_nxt += 2;
}		}
else if (c1 < 0xF0)		else if (c1 < 0xF0)
{		{
if (frm_end-frm_nxt < 3)		if (frm_end-frm_nxt < 2)
return codecvt_base::partial;		return codecvt_base::partial;
uint8_t c2 = frm_nxt[1];		uint8_t c2 = frm_nxt[1];
uint8_t c3 = frm_nxt[2];
switch (c1)		switch (c1)
{		{
case 0xE0:		case 0xE0:
if ((c2 & 0xE0) != 0xA0)		if ((c2 & 0xE0) != 0xA0)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
case 0xED:		case 0xED:
if ((c2 & 0xE0) != 0x80)		if ((c2 & 0xE0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
default:		default:
if ((c2 & 0xC0) != 0x80)		if ((c2 & 0xC0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
break;		break;
}		}
		if (frm_end-frm_nxt < 3)
		return codecvt_base::partial;
		uint8_t c3 = frm_nxt[2];
if ((c3 & 0xC0) != 0x80)		if ((c3 & 0xC0) != 0x80)
return codecvt_base::error;		return codecvt_base::error;
uint16_t t = static_cast<uint16_t>(((c1 & 0x0F) << 12)		uint16_t t = static_cast<uint16_t>(((c1 & 0x0F) << 12)
\| ((c2 & 0x3F) << 6)		\| ((c2 & 0x3F) << 6)
\| (c3 & 0x3F));		\| (c3 & 0x3F));
if (t > Maxcode)		if (t > Maxcode)
return codecvt_base::error;		return codecvt_base::error;
*to_nxt = t;		*to_nxt = t;
▲ Show 20 Lines • Show All 3,898 Lines • Show Last 20 Lines

libcxx/test/std/localization/codecvt_unicode.pass.cpp

This file was added.

//===----------------------------------------------------------------------===//

// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.

// See https://llvm.org/LICENSE.txt for license information.

// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception

//===----------------------------------------------------------------------===//

// ADDITIONAL_COMPILE_FLAGS: -D_LIBCPP_DISABLE_DEPRECATION_WARNINGS

// XFAIL: use_system_cxx_lib && target={{.+}}-apple-macosx{{10.9|10.10|10.11|10.12|10.13|10.14|10.15|11.0|12.0|13.0}}

#include <algorithm>

#include <locale>

#include <codecvt>

#include <string>

#include <cassert>

#include "test_macros.h"

struct test_offsets_ok {

size_t in_size;

size_t out_size;

MordanteUnsubmitted

Done

Please have one declaration per line.

Mordante: Please have one declaration per line.

};

struct test_offsets_partial {

size_t in_size;

size_t out_size;

size_t expected_in_next;

size_t expected_out_next;

};

template <class CharT>

struct test_offsets_error {

size_t in_size;

size_t out_size;

size_t expected_in_next;

MordanteUnsubmitted

Done

Can you use std::array instead? This is available on all platforms where we support C++03.

Mordante: Can you use `std::array` instead? This is available on all platforms where we support C++03.

dimztimzAuthorUnsubmitted

Done

std::array is not a good fit in this case for three reasons:

There is no inference of size.
Does not play as well with string literals.
Most importantly, in C++03 the member function size() is not constexpr.

dimztimz: `std::array` is not a good fit in this case for three reasons: # There is no inference of…

size_t expected_out_next;

CharT replace_char;

size_t replace_pos;

};

#define array_size(x) (sizeof(x) / sizeof(x)[0])

template <class InternT, class ExternT>

void utf8_to_utf32_in_ok(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

const unsigned char input[] = "b\u0448\uAAAA\U0010AAAA";

tahonermannUnsubmitted

Not Done

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

- const unsigned char input[] = "b\u0448\uAAAA\U0010AAAA";

+ const unsigned char input[] = "b" "\xD1\x88" "\xEA\xAA\xAA" "\xF4\x8A\xAA\xAA";

const char32_t expected[] = {'b', 0x0448, 0xAAAA, 0x10AAAA, 0};

This depends on the ordinary literal encoding being UTF-8 and that is not guaranteed (note that people are working on Clang's support for non-ASCII based operating systems). The suggested edit avoids that dependency.

tahonermann: This depends on the ordinary literal encoding being UTF-8 and that is not guaranteed (note that…

const char32_t expected[] = {'b', 0x0448, 0xAAAA, 0x10AAAA, 0};

static_assert(array_size(input) == 11, "");

static_assert(array_size(expected) == 5, "");

ExternT in[array_size(input)];

InternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<ExternT>::length(in) == 10);

assert(std::char_traits<InternT>::length(exp) == 4);

test_offsets_ok offsets[] = {{0, 0}, {1, 1}, {3, 2}, {6, 3}, {10, 4}};

for (auto t : offsets) {

InternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

mbstate_t state = {};

const ExternT* in_next = nullptr;

InternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.in(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.ok);

assert(in_next == in + t.in_size);

assert(out_next == out + t.out_size);

MordanteUnsubmitted

Done

What is the difference between this part of the test and the one on line 52? Please add some comments.

Mordante: What is the difference between this part of the test and the one on line 52? Please add some…

dimztimzAuthorUnsubmitted

Done

This one calls with the full out-buffer, see bellow out, std::end(out).

dimztimz: This one calls with the full out-buffer, see bellow `out, std::end(out)`.

assert(std::char_traits<InternT>::compare(out, exp, t.out_size) == 0);

if (t.out_size < array_size(out))

assert(out[t.out_size] == 0);

}

// Similar tests to above, but we always pass the full output buffer

for (auto t : offsets) {

InternT out[array_size(exp)] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

mbstate_t state = {};

const ExternT* in_next = nullptr;

InternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.in(state, in, in + t.in_size, in_next, out, std::end(out), out_next);

assert(res == cvt.ok);

assert(in_next == in + t.in_size);

assert(out_next == out + t.out_size);

assert(std::char_traits<InternT>::compare(out, exp, t.out_size) == 0);

if (t.out_size < array_size(out))

assert(out[t.out_size] == 0);

}

template <class InternT, class ExternT>

void utf8_to_utf32_in_partial(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

const unsigned char input[] = "b\u0448\uAAAA\U0010AAAA";

tahonermannUnsubmitted

Not Done

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

- const unsigned char input[] = "b\u0448\uAAAA\U0010AAAA";

+ const unsigned char input[] = "b" "\xD1\x88" "\xEA\xAA\xAA" "\xF4\x8A\xAA\xAA";

const char32_t expected[] = {'b', 0x0448, 0xAAAA, 0x10AAAA, 0};

tahonermann:

const char32_t expected[] = {'b', 0x0448, 0xAAAA, 0x10AAAA, 0};

static_assert(array_size(input) == 11, "");

static_assert(array_size(expected) == 5, "");

ExternT in[array_size(input)];

InternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<ExternT>::length(in) == 10);

assert(std::char_traits<InternT>::length(exp) == 4);

test_offsets_partial offsets[] = {

{1, 0, 0, 0}, // no space for first CP

{3, 1, 1, 1}, // no space for second CP

{2, 2, 1, 1}, // incomplete second CP

MordanteUnsubmitted

Done

{2, 1, 1, 1}, // incomplete second CP, and no space for it

- {6, 2, 3, 2}, // no space for third CP

- {4, 3, 3, 2}, // incomplete third CP

- {5, 3, 3, 2}, // incomplete third CP

{4, 2, 3, 2}, // incomplete third CP, and no space for it

{5, 2, 3, 2}, // incomplete third CP, and no space for it

+ {6, 2, 3, 2}, // no space for third CP

+ {5, 3, 3, 2}, // incomplete third CP

+ {4, 3, 3, 2}, // incomplete third CP

{10, 3, 6, 3}, // no space for fourth CP

I think this order of the test improves readability, same for the other tests.

Now the "too small bufffer", gradually grows to the proper size and then we make the "input too small".

Maybe even more readable would be to have a test case there the 3th CP exactly fits in the output.

Mordante: I think this order of the test improves readability, same for the other tests. Now the "too…

dimztimzAuthorUnsubmitted

Done

I think this is subjective, you can give arguments for few different orderings.

dimztimz: I think this is subjective, you can give arguments for few different orderings.

MordanteUnsubmitted

Done

I agree it's subjective, it's just what feels easier for me.

I'm concerned that the test is not easy to understand, even with the comments.

I'm aware of the problem domain and what you are testing. Even with that knowledge I had issues understanding the test. So I fear it will be worse for people not too familiar with UTF-8 encoding.

Mordante: I agree it's subjective, it's just what feels easier for me. I'm concerned that the test is…

dimztimzAuthorUnsubmitted

Done

The problem lies in the specification for std::codecvt it is underspecified and hard to understand. Everyone will have the same hard time and there is no way around it. One has to reread the specs multiple times and after that the tests should be easier to read.

Maybe I can add here more comments, what do you think? Or you want me to change the order of the test cases?

dimztimz: The problem lies in the specification for `std::codecvt` it is underspecified and hard to…

{2, 1, 1, 1}, // incomplete second CP, and no space for it

{6, 2, 3, 2}, // no space for third CP

{4, 3, 3, 2}, // incomplete third CP

{5, 3, 3, 2}, // incomplete third CP

{4, 2, 3, 2}, // incomplete third CP, and no space for it

{5, 2, 3, 2}, // incomplete third CP, and no space for it

{10, 3, 6, 3}, // no space for fourth CP

{7, 4, 6, 3}, // incomplete fourth CP

{8, 4, 6, 3}, // incomplete fourth CP

{9, 4, 6, 3}, // incomplete fourth CP

{7, 3, 6, 3}, // incomplete fourth CP, and no space for it

{8, 3, 6, 3}, // incomplete fourth CP, and no space for it

{9, 3, 6, 3}, // incomplete fourth CP, and no space for it

};

for (auto t : offsets) {

InternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

assert(t.expected_in_next <= t.in_size);

assert(t.expected_out_next <= t.out_size);

mbstate_t state = {};

const ExternT* in_next = nullptr;

InternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.in(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.partial);

assert(in_next == in + t.expected_in_next);

assert(out_next == out + t.expected_out_next);

assert(std::char_traits<InternT>::compare(out, exp, t.expected_out_next) == 0);

MordanteUnsubmitted

Done

I think it would be good to test a few more corner cases in this test.

input values in the surrogate range (U+D800 to U+DBFF and U+DC00 to U+DFFF)
outside the valid range > U+10FFFF

Mordante: I think it would be good to test a few more corner cases in this test. - input values in the…

if (t.expected_out_next < array_size(out))

assert(out[t.expected_out_next] == 0);

}

template <class InternT, class ExternT>

void utf8_to_utf32_in_error(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

const unsigned char input[] = "b\u0448\uD700\U0010AAAA";

tahonermannUnsubmitted

Not Done

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

- const unsigned char input[] = "b\u0448\uD700\U0010AAAA";

+ const unsigned char input[] = "b" "\xD1\x88" "\xEA\xAA\xAA" "\xF4\x8A\xAA\xAA";

const char32_t expected[] = {'b', 0x0448, 0xD700, 0x10AAAA, 0};

tahonermann:

const char32_t expected[] = {'b', 0x0448, 0xD700, 0x10AAAA, 0};

static_assert(array_size(input) == 11, "");

static_assert(array_size(expected) == 5, "");

ExternT in[array_size(input)];

InternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<ExternT>::length(in) == 10);

assert(std::char_traits<InternT>::length(exp) == 4);

// There are 5 classes of errors in UTF-8 decoding

// 1. Missing leading byte

// 2. Missing trailing byte

// 3. Surrogate CP

// 4. Ovelong sequence

tahonermannUnsubmitted

Not Done

// 3. Surrogate CP

- // 4. Ovelong sequence

+ // 4. Overlong sequence

// 5. CP out of Unicode range

tahonermann:

// 5. CP out of Unicode range

test_offsets_error<unsigned char> offsets[] = {

MordanteUnsubmitted

Done

What's the difference between an ASCII byte and an invalid byte?
Both are just invalid due not having the bit pattern 10xxxxxx, right?

Mordante: What's the difference between an ASCII byte and an invalid byte? Both are just invalid due not…

dimztimzAuthorUnsubmitted

Done

Well in this test-case there is no difference. But in general, in UTF-8 string if your aim is to fully decode a string then all valid sequences must be treated as valid, and any erroneous bytes between them should be either skipped, replaced with a replacement char, or reported upwards in the call chain (or some combination of these). the ASCII byte breaks the original sequence but creates a new smaller valid sequence. To reach it, once you receive error, you can push your input pointer by one and do another call to in() to check if there is another valid sequence further in the string.

dimztimz: Well in this test-case there is no difference. But in general, in UTF-8 string if your aim is…

MordanteUnsubmitted

Done

Fair point. I think it would be good to mention the ASCII byte is a valid one code point code unit, since that is what actually matters. The test would give the same result when the code unit was the start of a multibyte code unit, right? (Except then the next code unit might be invalid again.)

Mordante: Fair point. I think it would be good to mention the ASCII byte is a valid one code point code…

dimztimzAuthorUnsubmitted

Done

I did not understand you here.

dimztimz: I did not understand you here.

dimztimzAuthorUnsubmitted

Done

I added additional comments here with my latest patch and I think it explains the situation much better.

dimztimz: I added additional comments here with my latest patch and I think it explains the situation…

// 1. Missing leading byte. We will replace the leading byte with

// non-leading byte, such as a byte that is always invalid or a trailing

// byte.

// replace leading byte with invalid byte

{1, 4, 0, 0, 0xFF, 0},

{3, 4, 1, 1, 0xFF, 1},

{6, 4, 3, 2, 0xFF, 3},

{10, 4, 6, 3, 0xFF, 6},

// replace leading byte with trailing byte

{1, 4, 0, 0, 0b10101010, 0},

{3, 4, 1, 1, 0b10101010, 1},

{6, 4, 3, 2, 0b10101010, 3},

{10, 4, 6, 3, 0b10101010, 6},

// 2. Missing trailing byte. We will replace the trailing byte with

// non-trailing byte, such as a byte that is always invalid or a leading

// byte (simple ASCII byte in our case).

// replace first trailing byte with ASCII byte

{3, 4, 1, 1, 'z', 2},

{6, 4, 3, 2, 'z', 4},

{10, 4, 6, 3, 'z', 7},

// replace first trailing byte with invalid byte

{3, 4, 1, 1, 0xFF, 2},

{6, 4, 3, 2, 0xFF, 4},

{10, 4, 6, 3, 0xFF, 7},

// replace second trailing byte with ASCII byte

{6, 4, 3, 2, 'z', 5},

{10, 4, 6, 3, 'z', 8},

// replace second trailing byte with invalid byte

{6, 4, 3, 2, 0xFF, 5},

{10, 4, 6, 3, 0xFF, 8},

// replace third trailing byte

{10, 4, 6, 3, 'z', 9},

{10, 4, 6, 3, 0xFF, 9},

// 2.1 The following test-cases raise doubt whether error or partial should

// be returned. For example, we have 4-byte sequence with valid leading

// byte. If we hide the last byte we need to return partial. But, if the

// second or third byte, which are visible to the call to codecvt, are

// malformed then error should be returned.

// replace first trailing byte with ASCII byte, also incomplete at end

{5, 4, 3, 2, 'z', 4},

{8, 4, 6, 3, 'z', 7},

{9, 4, 6, 3, 'z', 7},

// replace first trailing byte with invalid byte, also incomplete at end

{5, 4, 3, 2, 0xFF, 4},

{8, 4, 6, 3, 0xFF, 7},

{9, 4, 6, 3, 0xFF, 7},

// replace second trailing byte with ASCII byte, also incomplete at end

{9, 4, 6, 3, 'z', 8},

// replace second trailing byte with invalid byte, also incomplete at end

{9, 4, 6, 3, 0xFF, 8},

// 3. Surrogate CP. We modify the second byte (first trailing) of the 3-byte

// CP U+D700

{6, 4, 3, 2, 0b10100000, 4}, // turn U+D700 into U+D800

{6, 4, 3, 2, 0b10101100, 4}, // turn U+D700 into U+DB00

{6, 4, 3, 2, 0b10110000, 4}, // turn U+D700 into U+DC00

{6, 4, 3, 2, 0b10111100, 4}, // turn U+D700 into U+DF00

// 4. Overlong sequence. The CPs in the input are chosen such as modifying

// just the leading byte is enough to make them overlong, i.e. for the

// 3-byte and 4-byte CP the second byte (first trailing) has enough leading

// zeroes.

{3, 4, 1, 1, 0b11000000, 1}, // make the 2-byte CP overlong

{3, 4, 1, 1, 0b11000001, 1}, // make the 2-byte CP overlong

{6, 4, 3, 2, 0b11100000, 3}, // make the 3-byte CP overlong

{10, 4, 6, 3, 0b11110000, 6}, // make the 4-byte CP overlong

// 5. CP above range

// turn U+10AAAA into U+14AAAA by changing its leading byte

{10, 4, 6, 3, 0b11110101, 6},

// turn U+10AAAA into U+11AAAA by changing its 2nd byte

{10, 4, 6, 3, 0b10011010, 7},

};

for (auto t : offsets) {

InternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

assert(t.expected_in_next <= t.in_size);

assert(t.expected_out_next <= t.out_size);

auto old_char = in[t.replace_pos];

in[t.replace_pos] = t.replace_char;

mbstate_t state = {};

const ExternT* in_next = nullptr;

InternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.in(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.error);

assert(in_next == in + t.expected_in_next);

assert(out_next == out + t.expected_out_next);

assert(std::char_traits<InternT>::compare(out, exp, t.expected_out_next) == 0);

if (t.expected_out_next < array_size(out))

assert(out[t.expected_out_next] == 0);

in[t.replace_pos] = old_char;

}

template <class InternT, class ExternT>

void utf8_to_utf32_in(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

utf8_to_utf32_in_ok(cvt);

utf8_to_utf32_in_partial(cvt);

utf8_to_utf32_in_error(cvt);

}

template <class InternT, class ExternT>

void utf32_to_utf8_out_ok(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

const char32_t input[] = {'b', 0x0448, 0xAAAA, 0x10AAAA, 0};

const unsigned char expected[] = "b\u0448\uAAAA\U0010AAAA";

tahonermannUnsubmitted

Not Done

const char32_t input[] = {'b', 0x0448, 0xAAAA, 0x10AAAA, 0};

- const unsigned char expected[] = "b\u0448\uAAAA\U0010AAAA";

+ const unsigned char expected[] = "b" "\xD1\x88" "\xEA\xAA\xAA" "\xF4\x8A\xAA\xAA";

static_assert(array_size(input) == 5, "");

tahonermann:

static_assert(array_size(input) == 5, "");

static_assert(array_size(expected) == 11, "");

InternT in[array_size(input)];

ExternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<InternT>::length(in) == 4);

assert(std::char_traits<ExternT>::length(exp) == 10);

const test_offsets_ok offsets[] = {{0, 0}, {1, 1}, {2, 3}, {3, 6}, {4, 10}};

for (auto t : offsets) {

ExternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

mbstate_t state = {};

const InternT* in_next = nullptr;

ExternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.out(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.ok);

assert(in_next == in + t.in_size);

assert(out_next == out + t.out_size);

assert(std::char_traits<ExternT>::compare(out, exp, t.out_size) == 0);

if (t.out_size < array_size(out))

assert(out[t.out_size] == 0);

}

template <class InternT, class ExternT>

void utf32_to_utf8_out_partial(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

const char32_t input[] = {'b', 0x0448, 0xAAAA, 0x10AAAA, 0};

const unsigned char expected[] = "b\u0448\uAAAA\U0010AAAA";

tahonermannUnsubmitted

Not Done

const char32_t input[] = {'b', 0x0448, 0xAAAA, 0x10AAAA, 0};

- const unsigned char expected[] = "b\u0448\uAAAA\U0010AAAA";

+ const unsigned char expected[] = "b" "\xD1\x88" "\xEA\xAA\xAA" "\xF4\x8A\xAA\xAA";

static_assert(array_size(input) == 5, "");

tahonermann:

static_assert(array_size(input) == 5, "");

static_assert(array_size(expected) == 11, "");

InternT in[array_size(input)];

ExternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<InternT>::length(in) == 4);

assert(std::char_traits<ExternT>::length(exp) == 10);

const test_offsets_partial offsets[] = {

{1, 0, 0, 0}, // no space for first CP

{2, 1, 1, 1}, // no space for second CP

{2, 2, 1, 1}, // no space for second CP

{3, 3, 2, 3}, // no space for third CP

{3, 4, 2, 3}, // no space for third CP

{3, 5, 2, 3}, // no space for third CP

{4, 6, 3, 6}, // no space for fourth CP

{4, 7, 3, 6}, // no space for fourth CP

{4, 8, 3, 6}, // no space for fourth CP

{4, 9, 3, 6}, // no space for fourth CP

};

for (auto t : offsets) {

ExternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

assert(t.expected_in_next <= t.in_size);

assert(t.expected_out_next <= t.out_size);

mbstate_t state = {};

const InternT* in_next = nullptr;

ExternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.out(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.partial);

assert(in_next == in + t.expected_in_next);

assert(out_next == out + t.expected_out_next);

assert(std::char_traits<ExternT>::compare(out, exp, t.expected_out_next) == 0);

if (t.expected_out_next < array_size(out))

assert(out[t.expected_out_next] == 0);

}

template <class InternT, class ExternT>

void utf32_to_utf8_out_error(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

const char32_t input[] = {'b', 0x0448, 0xAAAA, 0x10AAAA, 0};

const unsigned char expected[] = "b\u0448\uAAAA\U0010AAAA";

tahonermannUnsubmitted

Not Done

const char32_t input[] = {'b', 0x0448, 0xAAAA, 0x10AAAA, 0};

- const unsigned char expected[] = "b\u0448\uAAAA\U0010AAAA";

+ const unsigned char expected[] = "b" "\xD1\x88" "\xEA\xAA\xAA" "\xF4\x8A\xAA\xAA";

static_assert(array_size(input) == 5, "");

tahonermann:

static_assert(array_size(input) == 5, "");

static_assert(array_size(expected) == 11, "");

InternT in[array_size(input)];

ExternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<InternT>::length(in) == 4);

assert(std::char_traits<ExternT>::length(exp) == 10);

test_offsets_error<InternT> offsets[] = {

// Surrogate CP

{4, 10, 0, 0, 0xD800, 0},

{4, 10, 1, 1, 0xDBFF, 1},

{4, 10, 2, 3, 0xDC00, 2},

{4, 10, 3, 6, 0xDFFF, 3},

// CP out of range

{4, 10, 0, 0, 0x00110000, 0},

{4, 10, 1, 1, 0x00110000, 1},

{4, 10, 2, 3, 0x00110000, 2},

{4, 10, 3, 6, 0x00110000, 3}};

for (auto t : offsets) {

ExternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

assert(t.expected_in_next <= t.in_size);

assert(t.expected_out_next <= t.out_size);

auto old_char = in[t.replace_pos];

in[t.replace_pos] = t.replace_char;

mbstate_t state = {};

const InternT* in_next = nullptr;

ExternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.out(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.error);

assert(in_next == in + t.expected_in_next);

assert(out_next == out + t.expected_out_next);

assert(std::char_traits<ExternT>::compare(out, exp, t.expected_out_next) == 0);

if (t.expected_out_next < array_size(out))

assert(out[t.expected_out_next] == 0);

in[t.replace_pos] = old_char;

}

template <class InternT, class ExternT>

void utf32_to_utf8_out(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

utf32_to_utf8_out_ok(cvt);

utf32_to_utf8_out_partial(cvt);

utf32_to_utf8_out_error(cvt);

}

template <class InternT, class ExternT>

void test_utf8_utf32_cvt(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

utf8_to_utf32_in(cvt);

utf32_to_utf8_out(cvt);

}

template <class InternT, class ExternT>

void utf8_to_utf16_in_ok(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

const unsigned char input[] = "b\u0448\uAAAA\U0010AAAA";

tahonermannUnsubmitted

Not Done

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

- const unsigned char input[] = "b\u0448\uAAAA\U0010AAAA";

+ const unsigned char input[] = "b" "\xD1\x88" "\xEA\xAA\xAA" "\xF4\x8A\xAA\xAA";

const char16_t expected[] = {'b', 0x0448, 0xAAAA, 0xDBEA, 0xDEAA, 0};

tahonermann:

const char16_t expected[] = {'b', 0x0448, 0xAAAA, 0xDBEA, 0xDEAA, 0};

static_assert(array_size(input) == 11, "");

static_assert(array_size(expected) == 6, "");

ExternT in[array_size(input)];

InternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<ExternT>::length(in) == 10);

assert(std::char_traits<InternT>::length(exp) == 5);

test_offsets_ok offsets[] = {{0, 0}, {1, 1}, {3, 2}, {6, 3}, {10, 5}};

for (auto t : offsets) {

InternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

mbstate_t state = {};

const ExternT* in_next = nullptr;

InternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.in(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.ok);

assert(in_next == in + t.in_size);

assert(out_next == out + t.out_size);

assert(std::char_traits<InternT>::compare(out, exp, t.out_size) == 0);

if (t.out_size < array_size(out))

assert(out[t.out_size] == 0);

}

for (auto t : offsets) {

InternT out[array_size(exp)] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

mbstate_t state = {};

const ExternT* in_next = nullptr;

InternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.in(state, in, in + t.in_size, in_next, out, std::end(out), out_next);

assert(res == cvt.ok);

assert(in_next == in + t.in_size);

assert(out_next == out + t.out_size);

assert(std::char_traits<InternT>::compare(out, exp, t.out_size) == 0);

if (t.out_size < array_size(out))

assert(out[t.out_size] == 0);

}

template <class InternT, class ExternT>

void utf8_to_utf16_in_partial(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

const unsigned char input[] = "b\u0448\uAAAA\U0010AAAA";

tahonermannUnsubmitted

Not Done

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

- const unsigned char input[] = "b\u0448\uAAAA\U0010AAAA";

+ const unsigned char input[] = "b" "\xD1\x88" "\xEA\xAA\xAA" "\xF4\x8A\xAA\xAA";

const char16_t expected[] = {'b', 0x0448, 0xAAAA, 0xDBEA, 0xDEAA, 0};

tahonermann:

const char16_t expected[] = {'b', 0x0448, 0xAAAA, 0xDBEA, 0xDEAA, 0};

static_assert(array_size(input) == 11, "");

static_assert(array_size(expected) == 6, "");

ExternT in[array_size(input)];

InternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<ExternT>::length(in) == 10);

assert(std::char_traits<InternT>::length(exp) == 5);

test_offsets_partial offsets[] = {

{1, 0, 0, 0}, // no space for first CP

{3, 1, 1, 1}, // no space for second CP

{2, 2, 1, 1}, // incomplete second CP

{2, 1, 1, 1}, // incomplete second CP, and no space for it

{6, 2, 3, 2}, // no space for third CP

{4, 3, 3, 2}, // incomplete third CP

{5, 3, 3, 2}, // incomplete third CP

{4, 2, 3, 2}, // incomplete third CP, and no space for it

{5, 2, 3, 2}, // incomplete third CP, and no space for it

{10, 3, 6, 3}, // no space for fourth CP

{10, 4, 6, 3}, // no space for fourth CP

{7, 5, 6, 3}, // incomplete fourth CP

{8, 5, 6, 3}, // incomplete fourth CP

{9, 5, 6, 3}, // incomplete fourth CP

{7, 3, 6, 3}, // incomplete fourth CP, and no space for it

{8, 3, 6, 3}, // incomplete fourth CP, and no space for it

{9, 3, 6, 3}, // incomplete fourth CP, and no space for it

{7, 4, 6, 3}, // incomplete fourth CP, and no space for it

{8, 4, 6, 3}, // incomplete fourth CP, and no space for it

{9, 4, 6, 3}, // incomplete fourth CP, and no space for it

};

for (auto t : offsets) {

InternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

assert(t.expected_in_next <= t.in_size);

assert(t.expected_out_next <= t.out_size);

mbstate_t state = {};

const ExternT* in_next = nullptr;

InternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.in(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.partial);

assert(in_next == in + t.expected_in_next);

assert(out_next == out + t.expected_out_next);

assert(std::char_traits<InternT>::compare(out, exp, t.expected_out_next) == 0);

if (t.expected_out_next < array_size(out))

assert(out[t.expected_out_next] == 0);

}

template <class InternT, class ExternT>

void utf8_to_utf16_in_error(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

const unsigned char input[] = "b\u0448\uD700\U0010AAAA";

tahonermannUnsubmitted

Not Done

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

- const unsigned char input[] = "b\u0448\uD700\U0010AAAA";

+ const unsigned char input[] = "b" "\xD1\x88" "\xED\x9C\x80" "\xF4\x8A\xAA\xAA";

const char16_t expected[] = {'b', 0x0448, 0xD700, 0xDBEA, 0xDEAA, 0};

tahonermann:

const char16_t expected[] = {'b', 0x0448, 0xD700, 0xDBEA, 0xDEAA, 0};

static_assert(array_size(input) == 11, "");

static_assert(array_size(expected) == 6, "");

ExternT in[array_size(input)];

InternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<ExternT>::length(in) == 10);

assert(std::char_traits<InternT>::length(exp) == 5);

// There are 5 classes of errors in UTF-8 decoding

// 1. Missing leading byte

// 2. Missing trailing byte

// 3. Surrogate CP

// 4. Ovelong sequence

// 5. CP out of Unicode range

test_offsets_error<unsigned char> offsets[] = {

// 1. Missing leading byte. We will replace the leading byte with

// non-leading byte, such as a byte that is always invalid or a trailing

// byte.

// replace leading byte with invalid byte

{1, 5, 0, 0, 0xFF, 0},

{3, 5, 1, 1, 0xFF, 1},

{6, 5, 3, 2, 0xFF, 3},

{10, 5, 6, 3, 0xFF, 6},

// replace leading byte with trailing byte

{1, 5, 0, 0, 0b10101010, 0},

{3, 5, 1, 1, 0b10101010, 1},

{6, 5, 3, 2, 0b10101010, 3},

{10, 5, 6, 3, 0b10101010, 6},

// 2. Missing trailing byte. We will replace the trailing byte with

// non-trailing byte, such as a byte that is always invalid or a leading

// byte (simple ASCII byte in our case).

// replace first trailing byte with ASCII byte

{3, 5, 1, 1, 'z', 2},

{6, 5, 3, 2, 'z', 4},

{10, 5, 6, 3, 'z', 7},

// replace first trailing byte with invalid byte

{3, 5, 1, 1, 0xFF, 2},

{6, 5, 3, 2, 0xFF, 4},

{10, 5, 6, 3, 0xFF, 7},

// replace second trailing byte with ASCII byte

{6, 5, 3, 2, 'z', 5},

{10, 5, 6, 3, 'z', 8},

// replace second trailing byte with invalid byte

{6, 5, 3, 2, 0xFF, 5},

{10, 5, 6, 3, 0xFF, 8},

// replace third trailing byte

{10, 5, 6, 3, 'z', 9},

{10, 5, 6, 3, 0xFF, 9},

// 2.1 The following test-cases raise doubt whether error or partial should

// be returned. For example, we have 4-byte sequence with valid leading

// byte. If we hide the last byte we need to return partial. But, if the

// second or third byte, which are visible to the call to codecvt, are

// malformed then error should be returned.

// replace first trailing byte with ASCII byte, also incomplete at end

{5, 5, 3, 2, 'z', 4},

{8, 5, 6, 3, 'z', 7},

{9, 5, 6, 3, 'z', 7},

// replace first trailing byte with invalid byte, also incomplete at end

{5, 5, 3, 2, 0xFF, 4},

{8, 5, 6, 3, 0xFF, 7},

{9, 5, 6, 3, 0xFF, 7},

// replace second trailing byte with ASCII byte, also incomplete at end

{9, 5, 6, 3, 'z', 8},

// replace second trailing byte with invalid byte, also incomplete at end

{9, 5, 6, 3, 0xFF, 8},

// 3. Surrogate CP. We modify the second byte (first trailing) of the 3-byte

// CP U+D700

{6, 5, 3, 2, 0b10100000, 4}, // turn U+D700 into U+D800

{6, 5, 3, 2, 0b10101100, 4}, // turn U+D700 into U+DB00

{6, 5, 3, 2, 0b10110000, 4}, // turn U+D700 into U+DC00

{6, 5, 3, 2, 0b10111100, 4}, // turn U+D700 into U+DF00

// 4. Overlong sequence. The CPs in the input are chosen such as modifying

// just the leading byte is enough to make them overlong, i.e. for the

// 3-byte and 4-byte CP the second byte (first trailing) has enough leading

// zeroes.

{3, 5, 1, 1, 0b11000000, 1}, // make the 2-byte CP overlong

{3, 5, 1, 1, 0b11000001, 1}, // make the 2-byte CP overlong

{6, 5, 3, 2, 0b11100000, 3}, // make the 3-byte CP overlong

{10, 5, 6, 3, 0b11110000, 6}, // make the 4-byte CP overlong

// 5. CP above range

// turn U+10AAAA into U+14AAAA by changing its leading byte

{10, 5, 6, 3, 0b11110101, 6},

// turn U+10AAAA into U+11AAAA by changing its 2nd byte

{10, 5, 6, 3, 0b10011010, 7},

};

for (auto t : offsets) {

InternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

assert(t.expected_in_next <= t.in_size);

assert(t.expected_out_next <= t.out_size);

auto old_char = in[t.replace_pos];

in[t.replace_pos] = t.replace_char;

mbstate_t state = {};

const ExternT* in_next = nullptr;

InternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.in(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.error);

assert(in_next == in + t.expected_in_next);

assert(out_next == out + t.expected_out_next);

assert(std::char_traits<InternT>::compare(out, exp, t.expected_out_next) == 0);

if (t.expected_out_next < array_size(out))

assert(out[t.expected_out_next] == 0);

in[t.replace_pos] = old_char;

}

template <class InternT, class ExternT>

void utf8_to_utf16_in(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

utf8_to_utf16_in_ok(cvt);

utf8_to_utf16_in_partial(cvt);

utf8_to_utf16_in_error(cvt);

}

MordanteUnsubmitted

Done

Can you make sure all these blocks have comments.
The tests are not to easy to read, without comment I really have hard time to validate the test.

Especially since you use the surrogate values here, are you testing the surrogate values fail, or that the input is malformed in other ways.

Mordante: Can you make sure all these blocks have comments. The tests are not to easy to read, without…

template <class InternT, class ExternT>

void utf16_to_utf8_out_ok(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

const char16_t input[] = {'b', 0x0448, 0xAAAA, 0xDBEA, 0xDEAA, 0};

const unsigned char expected[] = "b\u0448\uAAAA\U0010AAAA";

tahonermannUnsubmitted

Not Done

const char16_t input[] = {'b', 0x0448, 0xAAAA, 0xDBEA, 0xDEAA, 0};

- const unsigned char expected[] = "b\u0448\uAAAA\U0010AAAA";

+ const unsigned char expected[] = "b" "\xD1\x88" "\xEA\xAA\xAA" "\xF4\x8A\xAA\xAA";

static_assert(array_size(input) == 6, "");

tahonermann:

static_assert(array_size(input) == 6, "");

static_assert(array_size(expected) == 11, "");

InternT in[array_size(input)];

ExternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<InternT>::length(in) == 5);

assert(std::char_traits<ExternT>::length(exp) == 10);

const test_offsets_ok offsets[] = {{0, 0}, {1, 1}, {2, 3}, {3, 6}, {5, 10}};

for (auto t : offsets) {

ExternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

mbstate_t state = {};

const InternT* in_next = nullptr;

ExternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.out(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.ok);

assert(in_next == in + t.in_size);

assert(out_next == out + t.out_size);

assert(std::char_traits<ExternT>::compare(out, exp, t.out_size) == 0);

if (t.out_size < array_size(out))

assert(out[t.out_size] == 0);

}

template <class InternT, class ExternT>

void utf16_to_utf8_out_partial(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

const char16_t input[] = {'b', 0x0448, 0xAAAA, 0xDBEA, 0xDEAA, 0};

const unsigned char expected[] = "b\u0448\uAAAA\U0010AAAA";

tahonermannUnsubmitted

Not Done

const char16_t input[] = {'b', 0x0448, 0xAAAA, 0xDBEA, 0xDEAA, 0};

- const unsigned char expected[] = "b\u0448\uAAAA\U0010AAAA";

+ const unsigned char expected[] = "b" "\xD1\x88" "\xEA\xAA\xAA" "\xF4\x8A\xAA\xAA";

static_assert(array_size(input) == 6, "");

tahonermann:

static_assert(array_size(input) == 6, "");

static_assert(array_size(expected) == 11, "");

InternT in[array_size(input)];

ExternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<InternT>::length(in) == 5);

assert(std::char_traits<ExternT>::length(exp) == 10);

const test_offsets_partial offsets[] = {

{1, 0, 0, 0}, // no space for first CP

{2, 1, 1, 1}, // no space for second CP

{2, 2, 1, 1}, // no space for second CP

{3, 3, 2, 3}, // no space for third CP

{3, 4, 2, 3}, // no space for third CP

{3, 5, 2, 3}, // no space for third CP

{5, 6, 3, 6}, // no space for fourth CP

{5, 7, 3, 6}, // no space for fourth CP

{5, 8, 3, 6}, // no space for fourth CP

{5, 9, 3, 6}, // no space for fourth CP

{4, 10, 3, 6}, // incomplete fourth CP

{4, 6, 3, 6}, // incomplete fourth CP, and no space for it

{4, 7, 3, 6}, // incomplete fourth CP, and no space for it

{4, 8, 3, 6}, // incomplete fourth CP, and no space for it

{4, 9, 3, 6}, // incomplete fourth CP, and no space for it

};

for (auto t : offsets) {

ExternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

assert(t.expected_in_next <= t.in_size);

assert(t.expected_out_next <= t.out_size);

mbstate_t state = {};

const InternT* in_next = nullptr;

ExternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.out(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.partial);

assert(in_next == in + t.expected_in_next);

assert(out_next == out + t.expected_out_next);

assert(std::char_traits<ExternT>::compare(out, exp, t.expected_out_next) == 0);

if (t.expected_out_next < array_size(out))

assert(out[t.expected_out_next] == 0);

}

template <class InternT, class ExternT>

void utf16_to_utf8_out_error(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

// UTF-8 string of 1-byte code point (CP), 2-byte CP, 3-byte CP and 4-byte CP

const char16_t input[] = {'b', 0x0448, 0xAAAA, 0xDBEA, 0xDEAA, 0};

const unsigned char expected[] = "b\u0448\uAAAA\U0010AAAA";

tahonermannUnsubmitted

Not Done

const char16_t input[] = {'b', 0x0448, 0xAAAA, 0xDBEA, 0xDEAA, 0};

- const unsigned char expected[] = "b\u0448\uAAAA\U0010AAAA";

+ const unsigned char expected[] = "b" "\xD1\x88" "\xEA\xAA\xAA" "\xF4\x8A\xAA\xAA";

static_assert(array_size(input) == 6, "");

tahonermann:

static_assert(array_size(input) == 6, "");

static_assert(array_size(expected) == 11, "");

InternT in[array_size(input)];

ExternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<InternT>::length(in) == 5);

assert(std::char_traits<ExternT>::length(exp) == 10);

// The only possible error in UTF-16 is unpaired surrogate code units.

// So we replace valid code points (scalar values) with lone surrogate CU.

test_offsets_error<InternT> offsets[] = {

{5, 10, 0, 0, 0xD800, 0},

{5, 10, 0, 0, 0xDBFF, 0},

{5, 10, 0, 0, 0xDC00, 0},

{5, 10, 0, 0, 0xDFFF, 0},

{5, 10, 1, 1, 0xD800, 1},

{5, 10, 1, 1, 0xDBFF, 1},

{5, 10, 1, 1, 0xDC00, 1},

{5, 10, 1, 1, 0xDFFF, 1},

{5, 10, 2, 3, 0xD800, 2},

{5, 10, 2, 3, 0xDBFF, 2},

{5, 10, 2, 3, 0xDC00, 2},

{5, 10, 2, 3, 0xDFFF, 2},

// make the leading surrogate a trailing one

{5, 10, 3, 6, 0xDC00, 3},

{5, 10, 3, 6, 0xDFFF, 3},

// make the trailing surrogate a leading one

{5, 10, 3, 6, 0xD800, 4},

{5, 10, 3, 6, 0xDBFF, 4},

// make the trailing surrogate a BMP char

{5, 10, 3, 6, 'z', 4},

};

for (auto t : offsets) {

ExternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

assert(t.expected_in_next <= t.in_size);

assert(t.expected_out_next <= t.out_size);

auto old_char = in[t.replace_pos];

in[t.replace_pos] = t.replace_char;

mbstate_t state = {};

const InternT* in_next = nullptr;

ExternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.out(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.error);

assert(in_next == in + t.expected_in_next);

assert(out_next == out + t.expected_out_next);

assert(std::char_traits<ExternT>::compare(out, exp, t.expected_out_next) == 0);

if (t.expected_out_next < array_size(out))

assert(out[t.expected_out_next] == 0);

in[t.replace_pos] = old_char;

}

template <class InternT, class ExternT>

void utf16_to_utf8_out(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

utf16_to_utf8_out_ok(cvt);

utf16_to_utf8_out_partial(cvt);

utf16_to_utf8_out_error(cvt);

}

template <class InternT, class ExternT>

void test_utf8_utf16_cvt(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

utf8_to_utf16_in(cvt);

utf16_to_utf8_out(cvt);

}

template <class InternT, class ExternT>

void utf8_to_ucs2_in_ok(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

// UTF-8 string of 1-byte code point (CP), 2-byte CP and 3-byte CP

const unsigned char input[] = "b\u0448\uAAAA";

tahonermannUnsubmitted

Not Done

// UTF-8 string of 1-byte code point (CP), 2-byte CP and 3-byte CP

- const unsigned char input[] = "b\u0448\uAAAA";

+ const unsigned char input[] = "b" "\xD1\x88" "\xEA\xAA\xAA";

const char16_t expected[] = {'b', 0x0448, 0xAAAA, 0};

tahonermann:

const char16_t expected[] = {'b', 0x0448, 0xAAAA, 0};

static_assert(array_size(input) == 7, "");

static_assert(array_size(expected) == 4, "");

ExternT in[array_size(input)];

InternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<ExternT>::length(in) == 6);

assert(std::char_traits<InternT>::length(exp) == 3);

test_offsets_ok offsets[] = {{0, 0}, {1, 1}, {3, 2}, {6, 3}};

for (auto t : offsets) {

InternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

mbstate_t state = {};

const ExternT* in_next = nullptr;

InternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.in(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.ok);

assert(in_next == in + t.in_size);

assert(out_next == out + t.out_size);

assert(std::char_traits<InternT>::compare(out, exp, t.out_size) == 0);

if (t.out_size < array_size(out))

assert(out[t.out_size] == 0);

}

for (auto t : offsets) {

InternT out[array_size(exp)] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

mbstate_t state = {};

const ExternT* in_next = nullptr;

InternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.in(state, in, in + t.in_size, in_next, out, std::end(out), out_next);

assert(res == cvt.ok);

assert(in_next == in + t.in_size);

assert(out_next == out + t.out_size);

assert(std::char_traits<InternT>::compare(out, exp, t.out_size) == 0);

if (t.out_size < array_size(out))

assert(out[t.out_size] == 0);

}

template <class InternT, class ExternT>

void utf8_to_ucs2_in_partial(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

// UTF-8 string of 1-byte code point (CP), 2-byte CP and 3-byte CP

const unsigned char input[] = "b\u0448\uAAAA";

tahonermannUnsubmitted

Not Done

// UTF-8 string of 1-byte code point (CP), 2-byte CP and 3-byte CP

- const unsigned char input[] = "b\u0448\uAAAA";

+ const unsigned char input[] = "b" "\xD1\x88" "\xEA\xAA\xAA";

const char16_t expected[] = {'b', 0x0448, 0xAAAA, 0};

tahonermann:

const char16_t expected[] = {'b', 0x0448, 0xAAAA, 0};

static_assert(array_size(input) == 7, "");

static_assert(array_size(expected) == 4, "");

ExternT in[array_size(input)];

InternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<ExternT>::length(in) == 6);

assert(std::char_traits<InternT>::length(exp) == 3);

test_offsets_partial offsets[] = {

{1, 0, 0, 0}, // no space for first CP

{3, 1, 1, 1}, // no space for second CP

{2, 2, 1, 1}, // incomplete second CP

{2, 1, 1, 1}, // incomplete second CP, and no space for it

{6, 2, 3, 2}, // no space for third CP

{4, 3, 3, 2}, // incomplete third CP

{5, 3, 3, 2}, // incomplete third CP

{4, 2, 3, 2}, // incomplete third CP, and no space for it

{5, 2, 3, 2}, // incomplete third CP, and no space for it

};

for (auto t : offsets) {

InternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

assert(t.expected_in_next <= t.in_size);

assert(t.expected_out_next <= t.out_size);

mbstate_t state = {};

const ExternT* in_next = nullptr;

InternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.in(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.partial);

assert(in_next == in + t.expected_in_next);

assert(out_next == out + t.expected_out_next);

assert(std::char_traits<InternT>::compare(out, exp, t.expected_out_next) == 0);

if (t.expected_out_next < array_size(out))

assert(out[t.expected_out_next] == 0);

}

template <class InternT, class ExternT>

void utf8_to_ucs2_in_error(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

const unsigned char input[] = "b\u0448\uD700\U0010AAAA";

tahonermannUnsubmitted

Not Done

void utf8_to_ucs2_in_error(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

- const unsigned char input[] = "b\u0448\uD700\U0010AAAA";

+ const unsigned char input[] = "b" "\xD1\x88" "\xED\x9C\x80" "\xF4\x8A\xAA\xAA";

const char16_t expected[] = {'b', 0x0448, 0xD700, 0xDBEA, 0xDEAA, 0};

tahonermann:

const char16_t expected[] = {'b', 0x0448, 0xD700, 0xDBEA, 0xDEAA, 0};

static_assert(array_size(input) == 11, "");

static_assert(array_size(expected) == 6, "");

ExternT in[array_size(input)];

InternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<ExternT>::length(in) == 10);

assert(std::char_traits<InternT>::length(exp) == 5);

// There are 5 classes of errors in UTF-8 decoding

// 1. Missing leading byte

// 2. Missing trailing byte

// 3. Surrogate CP

// 4. Ovelong sequence

// 5. CP out of Unicode range

test_offsets_error<unsigned char> offsets[] = {

// 1. Missing leading byte. We will replace the leading byte with

// non-leading byte, such as a byte that is always invalid or a trailing

// byte.

// replace leading byte with invalid byte

{1, 5, 0, 0, 0xFF, 0},

{3, 5, 1, 1, 0xFF, 1},

{6, 5, 3, 2, 0xFF, 3},

{10, 5, 6, 3, 0xFF, 6},

// replace leading byte with trailing byte

{1, 5, 0, 0, 0b10101010, 0},

{3, 5, 1, 1, 0b10101010, 1},

{6, 5, 3, 2, 0b10101010, 3},

{10, 5, 6, 3, 0b10101010, 6},

// 2. Missing trailing byte. We will replace the trailing byte with

// non-trailing byte, such as a byte that is always invalid or a leading

// byte (simple ASCII byte in our case).

// replace first trailing byte with ASCII byte

{3, 5, 1, 1, 'z', 2},

{6, 5, 3, 2, 'z', 4},

{10, 5, 6, 3, 'z', 7},

// replace first trailing byte with invalid byte

{3, 5, 1, 1, 0xFF, 2},

{6, 5, 3, 2, 0xFF, 4},

{10, 5, 6, 3, 0xFF, 7},

// replace second trailing byte with ASCII byte

{6, 5, 3, 2, 'z', 5},

{10, 5, 6, 3, 'z', 8},

// replace second trailing byte with invalid byte

{6, 5, 3, 2, 0xFF, 5},

{10, 5, 6, 3, 0xFF, 8},

// replace third trailing byte

{10, 5, 6, 3, 'z', 9},

{10, 5, 6, 3, 0xFF, 9},

// 2.1 The following test-cases raise doubt whether error or partial should

// be returned. For example, we have 4-byte sequence with valid leading

// byte. If we hide the last byte we need to return partial. But, if the

// second or third byte, which are visible to the call to codecvt, are

// malformed then error should be returned.

// replace first trailing byte with ASCII byte, also incomplete at end

{5, 5, 3, 2, 'z', 4},

{8, 5, 6, 3, 'z', 7},

{9, 5, 6, 3, 'z', 7},

// replace first trailing byte with invalid byte, also incomplete at end

{5, 5, 3, 2, 0xFF, 4},

{8, 5, 6, 3, 0xFF, 7},

{9, 5, 6, 3, 0xFF, 7},

// replace second trailing byte with ASCII byte, also incomplete at end

{9, 5, 6, 3, 'z', 8},

// replace second trailing byte with invalid byte, also incomplete at end

{9, 5, 6, 3, 0xFF, 8},

// 3. Surrogate CP. We modify the second byte (first trailing) of the 3-byte

// CP U+D700

{6, 5, 3, 2, 0b10100000, 4}, // turn U+D700 into U+D800

{6, 5, 3, 2, 0b10101100, 4}, // turn U+D700 into U+DB00

{6, 5, 3, 2, 0b10110000, 4}, // turn U+D700 into U+DC00

{6, 5, 3, 2, 0b10111100, 4}, // turn U+D700 into U+DF00

// 4. Overlong sequence. The CPs in the input are chosen such as modifying

// just the leading byte is enough to make them overlong, i.e. for the

// 3-byte and 4-byte CP the second byte (first trailing) has enough leading

// zeroes.

{3, 5, 1, 1, 0b11000000, 1}, // make the 2-byte CP overlong

{3, 5, 1, 1, 0b11000001, 1}, // make the 2-byte CP overlong

{6, 5, 3, 2, 0b11100000, 3}, // make the 3-byte CP overlong

{10, 5, 6, 3, 0b11110000, 6}, // make the 4-byte CP overlong

// 5. CP above range

// turn U+10AAAA into U+14AAAA by changing its leading byte

{10, 5, 6, 3, 0b11110101, 6},

// turn U+10AAAA into U+11AAAA by changing its 2nd byte

{10, 5, 6, 3, 0b10011010, 7},

// Don't replace anything, show full 4-byte CP U+10AAAA

{10, 4, 6, 3, 'b', 0},

{10, 5, 6, 3, 'b', 0},

// Don't replace anything, show incomplete 4-byte CP at the end. It's still

// out of UCS2 range just by seeing the first byte.

{7, 4, 6, 3, 'b', 0}, // incomplete fourth CP

{8, 4, 6, 3, 'b', 0}, // incomplete fourth CP

{9, 4, 6, 3, 'b', 0}, // incomplete fourth CP

{7, 5, 6, 3, 'b', 0}, // incomplete fourth CP

{8, 5, 6, 3, 'b', 0}, // incomplete fourth CP

{9, 5, 6, 3, 'b', 0}, // incomplete fourth CP

};

for (auto t : offsets) {

InternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

assert(t.expected_in_next <= t.in_size);

assert(t.expected_out_next <= t.out_size);

auto old_char = in[t.replace_pos];

in[t.replace_pos] = t.replace_char;

mbstate_t state = {};

const ExternT* in_next = nullptr;

InternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.in(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.error);

assert(in_next == in + t.expected_in_next);

assert(out_next == out + t.expected_out_next);

assert(std::char_traits<InternT>::compare(out, exp, t.expected_out_next) == 0);

if (t.expected_out_next < array_size(out))

assert(out[t.expected_out_next] == 0);

in[t.replace_pos] = old_char;

}

template <class InternT, class ExternT>

void utf8_to_ucs2_in(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

utf8_to_ucs2_in_ok(cvt);

utf8_to_ucs2_in_partial(cvt);

utf8_to_ucs2_in_error(cvt);

}

template <class InternT, class ExternT>

void ucs2_to_utf8_out_ok(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

// UTF-8 string of 1-byte code point (CP), 2-byte CP and 3-byte CP

const char16_t input[] = {'b', 0x0448, 0xAAAA, 0};

const unsigned char expected[] = "b\u0448\uAAAA";

tahonermannUnsubmitted

Not Done

const char16_t input[] = {'b', 0x0448, 0xAAAA, 0};

- const unsigned char expected[] = "b\u0448\uAAAA";

+ const unsigned char expected[] = "b" "\xD1\x88" "\xEA\xAA\xAA";

static_assert(array_size(input) == 4, "");

tahonermann:

static_assert(array_size(input) == 4, "");

static_assert(array_size(expected) == 7, "");

InternT in[array_size(input)];

ExternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<InternT>::length(in) == 3);

assert(std::char_traits<ExternT>::length(exp) == 6);

const test_offsets_ok offsets[] = {{0, 0}, {1, 1}, {2, 3}, {3, 6}};

for (auto t : offsets) {

ExternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

mbstate_t state = {};

const InternT* in_next = nullptr;

ExternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.out(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.ok);

assert(in_next == in + t.in_size);

assert(out_next == out + t.out_size);

assert(std::char_traits<ExternT>::compare(out, exp, t.out_size) == 0);

if (t.out_size < array_size(out))

assert(out[t.out_size] == 0);

}

template <class InternT, class ExternT>

void ucs2_to_utf8_out_partial(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

// UTF-8 string of 1-byte code point (CP), 2-byte CP and 3-byte CP

const char16_t input[] = {'b', 0x0448, 0xAAAA, 0};

const unsigned char expected[] = "b\u0448\uAAAA";

tahonermannUnsubmitted

Not Done

const char16_t input[] = {'b', 0x0448, 0xAAAA, 0};

- const unsigned char expected[] = "b\u0448\uAAAA";

+ const unsigned char expected[] = "b" "\xD1\x88" "\xEA\xAA\xAA";

static_assert(array_size(input) == 4, "");

tahonermann:

static_assert(array_size(input) == 4, "");

static_assert(array_size(expected) == 7, "");

InternT in[array_size(input)];

ExternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<InternT>::length(in) == 3);

assert(std::char_traits<ExternT>::length(exp) == 6);

const test_offsets_partial offsets[] = {

{1, 0, 0, 0}, // no space for first CP

{2, 1, 1, 1}, // no space for second CP

{2, 2, 1, 1}, // no space for second CP

{3, 3, 2, 3}, // no space for third CP

{3, 4, 2, 3}, // no space for third CP

{3, 5, 2, 3}, // no space for third CP

};

for (auto t : offsets) {

ExternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

assert(t.expected_in_next <= t.in_size);

assert(t.expected_out_next <= t.out_size);

mbstate_t state = {};

const InternT* in_next = nullptr;

ExternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.out(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.partial);

assert(in_next == in + t.expected_in_next);

assert(out_next == out + t.expected_out_next);

assert(std::char_traits<ExternT>::compare(out, exp, t.expected_out_next) == 0);

if (t.expected_out_next < array_size(out))

assert(out[t.expected_out_next] == 0);

}

template <class InternT, class ExternT>

void ucs2_to_utf8_out_error(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

const char16_t input[] = {'b', 0x0448, 0xAAAA, 0xDBEA, 0xDEAA, 0};

const unsigned char expected[] = "b\u0448\uAAAA\U0010AAAA";

tahonermannUnsubmitted

Not Done

const char16_t input[] = {'b', 0x0448, 0xAAAA, 0xDBEA, 0xDEAA, 0};

- const unsigned char expected[] = "b\u0448\uAAAA\U0010AAAA";

+ const unsigned char expected[] = "b" "\xD1\x88" "\xEA\xAA\xAA" "\xF4\x8A\xAA\xAA";

static_assert(array_size(input) == 6, "");

tahonermann:

static_assert(array_size(input) == 6, "");

static_assert(array_size(expected) == 11, "");

InternT in[array_size(input)];

ExternT exp[array_size(expected)];

std::copy(std::begin(input), std::end(input), std::begin(in));

std::copy(std::begin(expected), std::end(expected), std::begin(exp));

assert(std::char_traits<InternT>::length(in) == 5);

assert(std::char_traits<ExternT>::length(exp) == 10);

test_offsets_error<InternT> offsets[] = {

{5, 10, 0, 0, 0xD800, 0},

{5, 10, 0, 0, 0xDBFF, 0},

{5, 10, 0, 0, 0xDC00, 0},

{5, 10, 0, 0, 0xDFFF, 0},

{5, 10, 1, 1, 0xD800, 1},

{5, 10, 1, 1, 0xDBFF, 1},

{5, 10, 1, 1, 0xDC00, 1},

{5, 10, 1, 1, 0xDFFF, 1},

{5, 10, 2, 3, 0xD800, 2},

{5, 10, 2, 3, 0xDBFF, 2},

{5, 10, 2, 3, 0xDC00, 2},

{5, 10, 2, 3, 0xDFFF, 2},

// dont replace anything, just show the surrogate pair

{5, 10, 3, 6, 'b', 0},

// make the leading surrogate a trailing one

{5, 10, 3, 6, 0xDC00, 3},

{5, 10, 3, 6, 0xDFFF, 3},

// make the trailing surrogate a leading one

{5, 10, 3, 6, 0xD800, 4},

{5, 10, 3, 6, 0xDBFF, 4},

// make the trailing surrogate a BMP char

{5, 10, 3, 6, 'z', 4},

{5, 7, 3, 6, 'b', 0}, // no space for fourth CP

{5, 8, 3, 6, 'b', 0}, // no space for fourth CP

{5, 9, 3, 6, 'b', 0}, // no space for fourth CP

{4, 10, 3, 6, 'b', 0}, // incomplete fourth CP

{4, 7, 3, 6, 'b', 0}, // incomplete fourth CP, and no space for it

{4, 8, 3, 6, 'b', 0}, // incomplete fourth CP, and no space for it

{4, 9, 3, 6, 'b', 0}, // incomplete fourth CP, and no space for it

};

for (auto t : offsets) {

ExternT out[array_size(exp) - 1] = {};

assert(t.in_size <= array_size(in));

assert(t.out_size <= array_size(out));

assert(t.expected_in_next <= t.in_size);

assert(t.expected_out_next <= t.out_size);

auto old_char = in[t.replace_pos];

in[t.replace_pos] = t.replace_char;

mbstate_t state = {};

const InternT* in_next = nullptr;

ExternT* out_next = nullptr;

std::codecvt_base::result res = std::codecvt_base::ok;

res = cvt.out(state, in, in + t.in_size, in_next, out, out + t.out_size, out_next);

assert(res == cvt.error);

assert(in_next == in + t.expected_in_next);

assert(out_next == out + t.expected_out_next);

assert(std::char_traits<ExternT>::compare(out, exp, t.expected_out_next) == 0);

if (t.expected_out_next < array_size(out))

assert(out[t.expected_out_next] == 0);

in[t.replace_pos] = old_char;

}

template <class InternT, class ExternT>

void ucs2_to_utf8_out(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

ucs2_to_utf8_out_ok(cvt);

ucs2_to_utf8_out_partial(cvt);

ucs2_to_utf8_out_error(cvt);

}

template <class InternT, class ExternT>

void test_utf8_ucs2_cvt(const std::codecvt<InternT, ExternT, mbstate_t>& cvt) {

utf8_to_ucs2_in(cvt);

ucs2_to_utf8_out(cvt);

}

void test_utf8_utf32_codecvts() {

using codecvt_c32 = std::codecvt<char32_t, char, mbstate_t>;

const std::locale& loc = std::locale::classic();

assert(std::has_facet<codecvt_c32>(loc));

const codecvt_c32& cvt = std::use_facet<codecvt_c32>(loc);

test_utf8_utf32_cvt(cvt);

std::codecvt_utf8<char32_t> cvt2;

test_utf8_utf32_cvt(cvt2);

#if !defined(TEST_HAS_NO_WIDE_CHARACTERS) && !defined(TEST_SHORT_WCHAR)

std::codecvt_utf8<wchar_t> cvt3;

test_utf8_utf32_cvt(cvt3);

#endif

#ifndef TEST_HAS_NO_CHAR8_T

using codecvt_c32_c8 = std::codecvt<char32_t, char8_t, mbstate_t>;

assert(std::has_facet<codecvt_c32_c8>(loc));

const codecvt_c32_c8& cvt4 = std::use_facet<codecvt_c32_c8>(loc);

test_utf8_utf32_cvt(cvt4);

#endif

}

void test_utf8_utf16_codecvts() {

using codecvt_c16 = std::codecvt<char16_t, char, mbstate_t>;

const std::locale& loc = std::locale::classic();

assert(std::has_facet<codecvt_c16>(loc));

const codecvt_c16& cvt = std::use_facet<codecvt_c16>(loc);

test_utf8_utf16_cvt(cvt);

std::codecvt_utf8_utf16<char16_t> cvt2;

test_utf8_utf16_cvt(cvt2);

std::codecvt_utf8_utf16<char32_t> cvt3;

test_utf8_utf16_cvt(cvt3);

#ifndef TEST_HAS_NO_WIDE_CHARACTERS

std::codecvt_utf8_utf16<wchar_t> cvt4;

test_utf8_utf16_cvt(cvt4);

#endif

#ifndef TEST_HAS_NO_CHAR8_T

using codecvt_c16_c8 = std::codecvt<char16_t, char8_t, mbstate_t>;

assert(std::has_facet<codecvt_c16_c8>(loc));

const codecvt_c16_c8& cvt5 = std::use_facet<codecvt_c16_c8>(loc);

test_utf8_utf16_cvt(cvt5);

#endif

}

void test_utf8_ucs2_codecvts() {

std::codecvt_utf8<char16_t> cvt;

test_utf8_ucs2_cvt(cvt);

#if !defined(TEST_HAS_NO_WIDE_CHARACTERS) && defined(TEST_SHORT_WCHAR)

std::codecvt_utf8<wchar_t> cvt2;

test_utf8_ucs2_cvt(cvt2);

#endif

}

int main() {

test_utf8_utf32_codecvts();

test_utf8_utf16_codecvts();

test_utf8_ucs2_codecvts();

}

This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Fix UTF-8 decoding in codecvts. Fix #60177.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 501686

libcxx/src/locale.cpp

libcxx/test/std/localization/codecvt_unicode.pass.cpp

[libc++] Fix UTF-8 decoding in codecvts. Fix #60177.
AbandonedPublic