This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
test/std/re/re.traits/
-
std/
-
re/
-
re.traits/
-
translate_nocase.pass.cpp

Differential D55746

[libcxx] [test] [re.traits] Correct expected values for invalid UTF-8
ClosedPublic

Authored by mgorny on Dec 16 2018, 11:34 AM.

Download Raw Diff

Details

Reviewers

krytarowski
joerg
EricWF
howard.hinnant

Commits

rG916bf993a2ec: [test] [re.traits] Remove asserts failing due to invalid UTF-8
rL349378: [test] [re.traits] Remove asserts failing due to invalid UTF-8
rCXX349378: [test] [re.traits] Remove asserts failing due to invalid UTF-8

Summary

Correct the expected result for \xDA transformation, and add comments
to both \xDA and \xFA cases. They both form the first character
of multibyte sequence in UTF-8, and therefore are invalid without
a continuation character. Furthermore, the second one would indicate
a code point outside valid UTF-8 range.

Diff Detail

Repository: rCXX libc++

Event Timeline

mgorny created this revision.Dec 16 2018, 11:34 AM

Herald added a subscriber: ldionne. · View Herald TranscriptDec 16 2018, 11:34 AM

0xDA must be 2 byte sequence, 0xFA is invalid sequence.

This revision is now accepted and ready to land.Dec 17 2018, 1:39 AM

I'm not really a fan of this change. Removing a test just because it's failing seems wrong to me.

We should figure out what the behavior of that call should be, and test for that.

Just to be clear - I'm quite willing to believe that the tests are wrong, and should be changed. But std::char_traits<char>().translate_nocase('\xFA') has some behavior. Should it return 0? 1? '\xFA'? throw an exception? Cause the computer to catch fire? ;-)

Very well. I originally wanted to avoid relying on any specific behavior with invalid input but I suppose you're right. Unless I'm misunderstanding the spec, the behavior should be equivalent to tolower(), and tolower() specifies that the value should be returned unmodified if there's no lowercase representation. Now, I suppose it depends on how you define that but I think it's reasonable to assume that invalid characters have no lowercase representation.

I've tested this behavior on Linux, FreeBSD and NetBSD. I suppose the original change from behavior equivalent to my patch now to the broken behavior we have right now was accidental.

mgorny added a child revision: D55767: [libcxx] [test] Add target_info for NetBSD, and XFAIL some of locale tests.Dec 17 2018, 8:04 AM

Amusingly enough, I received this bug report this morning, which appears be related.

Also, is there a corresponding change that needs to be made for whcar_t?

This updated test fails on Mac OS X. (the assert on line 48 fires)

On Mac OS, using the locale en_US.UTF-8, the call std::re_traits<char>().translate_nocase('\xDA') returns '\xFA'

In D55746#1333027, @mclow.lists wrote:

Also, is there a corresponding change that needs to be made for whcar_t?

No. wchar_t normally does not use multi-wchar_t encoding, at least within the tested range.

In D55746#1333035, @mclow.lists wrote:

This updated test fails on Mac OS X. (the assert on line 48 fires)

Well, that's what I suspected. I presume that it worked with the previous version just fine. Therefore, we have different implementations making different assumptions as to what to do with \xDA.

I don't think killing the whole test on some implementations like we're doing right now makes sense. Either we should stop testing the 'undefined behavior', or maybe allow for both values.

In my opinion we shall not test UB as each implementation can probably behave in any way.

In my opinion we shall not test UB as each implementation can probably behave in any way.

If you can convince me that this is in fact UB, then I would agree.
But I'm not seeing any indication that this is UB when reading the standard.

[locale.ctype.members]/5 says: that tolower calls do_tolower, which is defined thus:
Returns: The first form returns the corresponding lower-case character if it is known to exist, or its argument if not.

[ It's also possible that different implementations of the locales (part of the C library/OS) are returning different values. If that turns out to be the case, then we should document those differences and move on. ]

In D55746#1333210, @mclow.lists wrote:

[ It's also possible that different implementations of the locales (part of the C library/OS) are returning different values. If that turns out to be the case, then we should document those differences and move on. ]

I think that's the case. I've written a simple test program to check it. Could you try it on Darwin?

#include <ctype.h>
#include <locale.h>
#include <stdio.h>

int main() {
	unsigned char c = 0xDA, o;

	/* verify with C locale */
	setlocale(LC_ALL, "C");
	o = tolower(c);
	printf("C locale: %02x (%c) -> %02x (%c)\n", c, c, o, o);

	/* ISO-8859-1 */
	setlocale(LC_ALL, "en_US.ISO-8859-1");
	o = tolower(c);
	printf("iso-8859-1 locale: %02x (%c) -> %02x (%c)\n", c, c, o, o);

	/* UTF-8 locale */
	setlocale(LC_ALL, "en_US.UTF-8");
	o = tolower(c);
	printf("utf-8 locale: %02x (%c) -> %02x (%c)\n", c, c, o, o);
}

In D55746#1333237, @mgorny wrote:

In D55746#1333210, @mclow.lists wrote:

[ It's also possible that different implementations of the locales (part of the C library/OS) are returning different values. If that turns out to be the case, then we should document those differences and move on. ]

I think that's the case. I've written a simple test program to check it. Could you try it on Darwin?

C locale: da (⁄) -> da (⁄)
iso-8859-1 locale: da (⁄) -> da (⁄)
utf-8 locale: da (⁄) -> fa (˙)

I have written a similar program, but using t.translate_nocase. All the characters from C0 --> DE are translated on Darwin.

I'm surprised it doesn't translate ISO-8859-1 characters but I guess it might not support them at all. In any case, the behavior doesn't look correct to me but it's as it is and I guess I can't do anything about it.

So what do you suggest we do about this test?

So what do you suggest we do about this test?

Let's go back to your original patch (removing assert(t.translate_nocase(L'\xDA') == L'\xFA');, and land that, and I'll put this on my TODO list.

Will do. Thank you.

Closed by commit rCXX349378: [test] [re.traits] Remove asserts failing due to invalid UTF-8 (authored by mgorny). · Explain WhyDec 17 2018, 11:16 AM

This revision was automatically updated to reflect the committed changes.

@mgorny What output do you get on Linux for your test program?

In D55746#1333667, @ldionne wrote:

@mgorny What output do you get on Linux for your test program?

C locale: da (�) -> da (�)
iso-8859-1 locale: da (�) -> fa (�)
utf-8 locale: da (�) -> da (�)

(note the %c forms result in invalid UTF-8 to console)

Curious enough, on FreeBSD and NetBSD even iso-8859-1 isn't transformed:

C locale: da (�) -> da (�)
iso-8859-1 locale: da (�) -> da (�)
utf-8 locale: da (�) -> da (�)

Revision Contents

Path

Size

test/

std/

re/

re.traits/

translate_nocase.pass.cpp

5 lines

Diff 178497

test/std/re/re.traits/translate_nocase.pass.cpp

Show All 13 Lines

// charT translate_nocase(charT c) const;		// charT translate_nocase(charT c) const;

// REQUIRES: locale.en_US.UTF-8		// REQUIRES: locale.en_US.UTF-8

// XFAIL: with_system_cxx_lib=macosx10.7		// XFAIL: with_system_cxx_lib=macosx10.7
// XFAIL: with_system_cxx_lib=macosx10.8		// XFAIL: with_system_cxx_lib=macosx10.8

// TODO: investigation needed
// XFAIL: linux-gnu

#include <regex>		#include <regex>
#include <cassert>		#include <cassert>

#include "test_macros.h"		#include "test_macros.h"
#include "platform_support.h"		#include "platform_support.h"

int main()		int main()
{		{
Show All 9 Lines	int main()
assert(t.translate_nocase('\xFA') == '\xFA');		assert(t.translate_nocase('\xFA') == '\xFA');
t.imbue(std::locale(LOCALE_en_US_UTF_8));		t.imbue(std::locale(LOCALE_en_US_UTF_8));
assert(t.translate_nocase(' ') == ' ');		assert(t.translate_nocase(' ') == ' ');
assert(t.translate_nocase('A') == 'a');		assert(t.translate_nocase('A') == 'a');
assert(t.translate_nocase('\x07') == '\x07');		assert(t.translate_nocase('\x07') == '\x07');
assert(t.translate_nocase('.') == '.');		assert(t.translate_nocase('.') == '.');
assert(t.translate_nocase('a') == 'a');		assert(t.translate_nocase('a') == 'a');
assert(t.translate_nocase('1') == '1');		assert(t.translate_nocase('1') == '1');
assert(t.translate_nocase('\xDA') == '\xFA');
assert(t.translate_nocase('\xFA') == '\xFA');
}		}
{		{
std::regex_traits<wchar_t> t;		std::regex_traits<wchar_t> t;
assert(t.translate_nocase(L' ') == L' ');		assert(t.translate_nocase(L' ') == L' ');
assert(t.translate_nocase(L'A') == L'a');		assert(t.translate_nocase(L'A') == L'a');
assert(t.translate_nocase(L'\x07') == L'\x07');		assert(t.translate_nocase(L'\x07') == L'\x07');
assert(t.translate_nocase(L'.') == L'.');		assert(t.translate_nocase(L'.') == L'.');
assert(t.translate_nocase(L'a') == L'a');		assert(t.translate_nocase(L'a') == L'a');
Show All 14 Lines