This is an archive of the discontinued LLVM Phabricator instance.

[libcxx] [test] [re.traits] Correct expected values for invalid UTF-8
ClosedPublic

Authored by mgorny on Dec 16 2018, 11:34 AM.

Details

Summary

Correct the expected result for \xDA transformation, and add comments
to both \xDA and \xFA cases. They both form the first character
of multibyte sequence in UTF-8, and therefore are invalid without
a continuation character. Furthermore, the second one would indicate
a code point outside valid UTF-8 range.

Diff Detail

Repository
rCXX libc++

Event Timeline

mgorny created this revision.Dec 16 2018, 11:34 AM
krytarowski accepted this revision.Dec 17 2018, 1:39 AM

0xDA must be 2 byte sequence, 0xFA is invalid sequence.

This revision is now accepted and ready to land.Dec 17 2018, 1:39 AM

I'm not really a fan of this change. Removing a test just because it's failing seems wrong to me.

We should figure out what the behavior of that call should be, and test for that.

mclow.lists added a comment.EditedDec 17 2018, 7:05 AM

Just to be clear - I'm quite willing to believe that the tests are wrong, and should be changed. But std::char_traits<char>().translate_nocase('\xFA') has some behavior. Should it return 0? 1? '\xFA'? throw an exception? Cause the computer to catch fire? ;-)

mgorny updated this revision to Diff 178463.Dec 17 2018, 7:33 AM
mgorny retitled this revision from [libcxx] [test] [re.traits] Remove asserts failing due to invalid UTF-8 to [libcxx] [test] [re.traits] Correct expected values for invalid UTF-8.
mgorny edited the summary of this revision. (Show Details)

Very well. I originally wanted to avoid relying on any specific behavior with invalid input but I suppose you're right. Unless I'm misunderstanding the spec, the behavior should be equivalent to tolower(), and tolower() specifies that the value should be returned unmodified if there's no lowercase representation. Now, I suppose it depends on how you define that but I think it's reasonable to assume that invalid characters have no lowercase representation.

I've tested this behavior on Linux, FreeBSD and NetBSD. I suppose the original change from behavior equivalent to my patch now to the broken behavior we have right now was accidental.

Amusingly enough, I received this bug report this morning, which appears be related.

Also, is there a corresponding change that needs to be made for whcar_t?

This updated test fails on Mac OS X. (the assert on line 48 fires)

On Mac OS, using the locale en_US.UTF-8, the call std::re_traits<char>().translate_nocase('\xDA') returns '\xFA'

Also, is there a corresponding change that needs to be made for whcar_t?

No. wchar_t normally does not use multi-wchar_t encoding, at least within the tested range.

This updated test fails on Mac OS X. (the assert on line 48 fires)

Well, that's what I suspected. I presume that it worked with the previous version just fine. Therefore, we have different implementations making different assumptions as to what to do with \xDA.

I don't think killing the whole test on some implementations like we're doing right now makes sense. Either we should stop testing the 'undefined behavior', or maybe allow for both values.

In my opinion we shall not test UB as each implementation can probably behave in any way.

In my opinion we shall not test UB as each implementation can probably behave in any way.

If you can convince me that this is in fact UB, then I would agree.
But I'm not seeing any indication that this is UB when reading the standard.

[locale.ctype.members]/5 says: that tolower calls do_tolower, which is defined thus:
Returns: The first form returns the corresponding lower-case character if it is known to exist, or its argument if not.

[ It's also possible that different implementations of the locales (part of the C library/OS) are returning different values. If that turns out to be the case, then we should document those differences and move on. ]

[ It's also possible that different implementations of the locales (part of the C library/OS) are returning different values. If that turns out to be the case, then we should document those differences and move on. ]

I think that's the case. I've written a simple test program to check it. Could you try it on Darwin?

#include <ctype.h>
#include <locale.h>
#include <stdio.h>

int main() {
	unsigned char c = 0xDA, o;

	/* verify with C locale */
	setlocale(LC_ALL, "C");
	o = tolower(c);
	printf("C locale: %02x (%c) -> %02x (%c)\n", c, c, o, o);

	/* ISO-8859-1 */
	setlocale(LC_ALL, "en_US.ISO-8859-1");
	o = tolower(c);
	printf("iso-8859-1 locale: %02x (%c) -> %02x (%c)\n", c, c, o, o);

	/* UTF-8 locale */
	setlocale(LC_ALL, "en_US.UTF-8");
	o = tolower(c);
	printf("utf-8 locale: %02x (%c) -> %02x (%c)\n", c, c, o, o);
}

[ It's also possible that different implementations of the locales (part of the C library/OS) are returning different values. If that turns out to be the case, then we should document those differences and move on. ]

I think that's the case. I've written a simple test program to check it. Could you try it on Darwin?

C locale: da (⁄) -> da (⁄)
iso-8859-1 locale: da (⁄) -> da (⁄)
utf-8 locale: da (⁄) -> fa (˙)

I have written a similar program, but using t.translate_nocase. All the characters from C0 --> DE are translated on Darwin.

I'm surprised it doesn't translate ISO-8859-1 characters but I guess it might not support them at all. In any case, the behavior doesn't look correct to me but it's as it is and I guess I can't do anything about it.

So what do you suggest we do about this test?

So what do you suggest we do about this test?

Let's go back to your original patch (removing assert(t.translate_nocase(L'\xDA') == L'\xFA');, and land that, and I'll put this on my TODO list.

Will do. Thank you.

This revision was automatically updated to reflect the committed changes.

@mgorny What output do you get on Linux for your test program?

@mgorny What output do you get on Linux for your test program?

C locale: da (�) -> da (�)
iso-8859-1 locale: da (�) -> fa (�)
utf-8 locale: da (�) -> da (�)

(note the %c forms result in invalid UTF-8 to console)

Curious enough, on FreeBSD and NetBSD even iso-8859-1 isn't transformed:

C locale: da (�) -> da (�)
iso-8859-1 locale: da (�) -> da (�)
utf-8 locale: da (�) -> da (�)