Correct the expected result for \xDA transformation, and add comments
to both \xDA and \xFA cases. They both form the first character
of multibyte sequence in UTF-8, and therefore are invalid without
a continuation character. Furthermore, the second one would indicate
a code point outside valid UTF-8 range.
Details
Diff Detail
Event Timeline
I'm not really a fan of this change. Removing a test just because it's failing seems wrong to me.
We should figure out what the behavior of that call should be, and test for that.
Just to be clear - I'm quite willing to believe that the tests are wrong, and should be changed. But std::char_traits<char>().translate_nocase('\xFA') has some behavior. Should it return 0? 1? '\xFA'? throw an exception? Cause the computer to catch fire? ;-)
Very well. I originally wanted to avoid relying on any specific behavior with invalid input but I suppose you're right. Unless I'm misunderstanding the spec, the behavior should be equivalent to tolower(), and tolower() specifies that the value should be returned unmodified if there's no lowercase representation. Now, I suppose it depends on how you define that but I think it's reasonable to assume that invalid characters have no lowercase representation.
I've tested this behavior on Linux, FreeBSD and NetBSD. I suppose the original change from behavior equivalent to my patch now to the broken behavior we have right now was accidental.
Amusingly enough, I received this bug report this morning, which appears be related.
Also, is there a corresponding change that needs to be made for whcar_t?
On Mac OS, using the locale en_US.UTF-8, the call std::re_traits<char>().translate_nocase('\xDA') returns '\xFA'
No. wchar_t normally does not use multi-wchar_t encoding, at least within the tested range.
Well, that's what I suspected. I presume that it worked with the previous version just fine. Therefore, we have different implementations making different assumptions as to what to do with \xDA.
I don't think killing the whole test on some implementations like we're doing right now makes sense. Either we should stop testing the 'undefined behavior', or maybe allow for both values.
In my opinion we shall not test UB as each implementation can probably behave in any way.
In my opinion we shall not test UB as each implementation can probably behave in any way.
If you can convince me that this is in fact UB, then I would agree.
But I'm not seeing any indication that this is UB when reading the standard.
[locale.ctype.members]/5 says: that tolower calls do_tolower, which is defined thus:
Returns: The first form returns the corresponding lower-case character if it is known to exist, or its argument if not.
[ It's also possible that different implementations of the locales (part of the C library/OS) are returning different values. If that turns out to be the case, then we should document those differences and move on. ]
I think that's the case. I've written a simple test program to check it. Could you try it on Darwin?
#include <ctype.h> #include <locale.h> #include <stdio.h> int main() { unsigned char c = 0xDA, o; /* verify with C locale */ setlocale(LC_ALL, "C"); o = tolower(c); printf("C locale: %02x (%c) -> %02x (%c)\n", c, c, o, o); /* ISO-8859-1 */ setlocale(LC_ALL, "en_US.ISO-8859-1"); o = tolower(c); printf("iso-8859-1 locale: %02x (%c) -> %02x (%c)\n", c, c, o, o); /* UTF-8 locale */ setlocale(LC_ALL, "en_US.UTF-8"); o = tolower(c); printf("utf-8 locale: %02x (%c) -> %02x (%c)\n", c, c, o, o); }
C locale: da (⁄) -> da (⁄) iso-8859-1 locale: da (⁄) -> da (⁄) utf-8 locale: da (⁄) -> fa (˙)
I have written a similar program, but using t.translate_nocase. All the characters from C0 --> DE are translated on Darwin.
I'm surprised it doesn't translate ISO-8859-1 characters but I guess it might not support them at all. In any case, the behavior doesn't look correct to me but it's as it is and I guess I can't do anything about it.
So what do you suggest we do about this test?
So what do you suggest we do about this test?
Let's go back to your original patch (removing assert(t.translate_nocase(L'\xDA') == L'\xFA');, and land that, and I'll put this on my TODO list.
C locale: da (�) -> da (�) iso-8859-1 locale: da (�) -> fa (�) utf-8 locale: da (�) -> da (�)
(note the %c forms result in invalid UTF-8 to console)
Curious enough, on FreeBSD and NetBSD even iso-8859-1 isn't transformed:
C locale: da (�) -> da (�) iso-8859-1 locale: da (�) -> da (�) utf-8 locale: da (�) -> da (�)