This is an archive of the discontinued LLVM Phabricator instance.

[libc++][test] Update some wstring_convert tests for MSVC quirks
ClosedPublic

Authored by CaseyCarter on Apr 21 2019, 10:32 AM.

Details

Summary

Due to MSVC's decision to encode wchar_t as UTF-16, it rejects wide character/string literals that expect a character value greater than \xffff. UTF-16 wchar_t is clearly non-conforming, given that the standard requires wchar_t to be capable of representing all characters in the supported wide character execution sets, but rejecting e.g. \x40003 is a reasonably sane compromise given that encoding choice: there's an expectation that \xFOO produces a single character in the resulting literal. Consequently L'\x40003'/L"\x40003" are ill-formed literals on MSVC. L'\U00040003' is a high surrogate (and produces a warning about ignoring the "second character" in a multi-character literal), and L"\U00040003" is a perfectly-valid const wchar_t[3].

This change updates these tests to use universal-character-names instead of raw values for the intended character values, which technically makes them portable even to implementations that don't use a unicode transformation format encoding for their wide character execution character set. The two-character literal L"\u1005e" is awkward - the e looks like part of the UCN's hex encoding - but necessary to compile in '03 mode since '03 didn't allow UCNs to be used for members of the basic execution character set even in character/string literals.

I've also eliminated the extraneous \x00 "bonus null-terminator" in some of the string literals which doesn't affect the tested behavior. I'm sorry about using *L"\U00040003" in conversions.string/to_bytes.pass.cpp, but it's correct for platforms with 32-bit wchar_t, *and* doesn't trigger narrowing warnings as did the prior CharT(0x40003).

Diff Detail

Repository
rL LLVM

Event Timeline

CaseyCarter created this revision.Apr 21 2019, 10:32 AM

I think you failed to paste a bit into the description of this patch (after the first sentence, before the second).

Does the "universal character names" stuff work on old standards? (C++11/03) and old compilers?

CaseyCarter edited the summary of this revision. (Show Details)Apr 22 2019, 6:27 AM
CaseyCarter added a comment.EditedApr 22 2019, 6:38 AM

I think you failed to paste a bit into the description of this patch (after the first sentence, before the second).

Reworded. Is that more clear?

Does the "universal character names" stuff work on old standards? (C++11/03) and old compilers?

UCNs were in '98. '11 relaxed the restriction that a UCN could not name a member of the basic execution character set or a control character to only apply outside of string and character literals. The tests pass with clang 3.6 (the oldest I have readily available) in '03 mode, and gcc-4.9 in '11 mode. GCC won't compile the test in '03 mode due to the use of default template arguments in <memory> and <string>.

mclow.lists accepted this revision.Apr 22 2019, 7:00 AM

I think you failed to paste a bit into the description of this patch (after the first sentence, before the second).

Reworded. Is that more clear?

Yes, thank you.

This revision is now accepted and ready to land.Apr 22 2019, 7:00 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptApr 22 2019, 12:06 PM