This is an archive of the discontinued LLVM Phabricator instance.

[libcxx] Fix the ctype `is` (pointer version) function for Windows
ClosedPublic

Authored by mstorsjo on Mar 2 2022, 1:03 AM.

Details

Summary

Previously, this test snippet would report incorrect information:

F::mask m;
std::wstring in(L"\u00DA"); // LATIN CAPITAL LETTER U WITH ACUTE
f.is(in.data(), in.data() + 1, &m);
// m & F::lower would be set

The single-character version of the is function wasn't
affected by this issue though.

Define _LIBCPP_CTYPE_MASK_IS_COMPOSITE_ALPHA for Windows,
as the alpha / _ALPHA constant is a mask consisting of
multiple bits set, which avoids setting alpha whenver any
of the bits is set, in the do_is implementation.

On Windows, with the "C" locale, characters outside of ASCII are
interpreted according to the current system code page, which
can consider chars like e.g. 0xDA as an uppercase alphabetical
character.

Due to the differing classification of some characters, the
scan_is and scan_not tests are quite annoying to fix, thus just
ifdef out some of the tests for the "C" locale there - the code gets
tested with the more standard en_US.UTF-8 locale anyway.

Diff Detail

Event Timeline

mstorsjo created this revision.Mar 2 2022, 1:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2022, 1:03 AM
mstorsjo requested review of this revision.Mar 2 2022, 1:03 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2022, 1:03 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
Quuxplusone requested changes to this revision.Mar 2 2022, 5:32 AM
Quuxplusone added inline comments.
libcxx/include/__locale
457

This looks like a bugfix, so "Fix the ... tests" seems like the wrong commit summary.
Before this patch, did f.is(F::upper, L'a') return true instead of false? That'd be a good thing to put in the commit message.

libcxx/test/std/localization/locale.categories/category.ctype/locale.ctype.byname/is_1.pass.cpp
110–112

This comment is too vague for my liking. I infer that "can be" means sometimes it is and sometimes it isn't. I'd rather see something very specific, like

// On Windows, this depends on the current codepage.
// If the codepage is CP-437, \x00DA is U+250C BOX DRAWINGS LIGHT DOWN AND RIGHT.
// If the codepage is CP-1252, \x00DA is U+00DA CAPITAL LETTER U WITH ACUTE.

And then I'm not actually sure why the non-Windows behavior claims that U+00DA is not alpha+upper.

This revision now requires changes to proceed.Mar 2 2022, 5:32 AM
mstorsjo updated this revision to Diff 412392.Mar 2 2022, 6:06 AM

Updated the comments as requested (will update subject+description too).

mstorsjo retitled this revision from [libcxx] Fix the locale `is` and `scan_is`/`scan_not` tests for Windows to [libcxx] Fix the ctype `is` (pointer version) function for Windows.Mar 2 2022, 6:07 AM
mstorsjo edited the summary of this revision. (Show Details)
mstorsjo added inline comments.
libcxx/test/std/localization/locale.categories/category.ctype/locale.ctype.byname/is_1.pass.cpp
110–112

I think in the non-Windows case, the "C" locale might correspond to plain ASCII or something like that, where \x00DA isn't considered an alphabetical character.

libcxx/test/std/localization/locale.categories/category.ctype/locale.ctype.byname/is_1.pass.cpp
110–112

Nit: for readability, I'd prefer linebreaks before each "If", and ideally no other linebreaks.

I'm still confused as to why the comment says "it depends" but the code on lines 114-115 is very clearly assuming that the character is U+00DA CAPITAL LETTER U WITH ACUTE. Is this because the comment is wrong? or because the test runner forces a specific codepage? or because we just luck into a specific codepage and this is super fragile? (Ditto throughout.)

libcxx/test/std/localization/locale.categories/category.ctype/locale.ctype.byname/scan_not.pass.cpp
75

Pre-existing, no action required: This test seems silly. All of the expected results are 0.

mstorsjo added inline comments.Mar 2 2022, 1:01 PM
libcxx/test/std/localization/locale.categories/category.ctype/locale.ctype.byname/is_1.pass.cpp
110–112

I think this (and a bunch of other similar tests for the "C" locale) might behave differently, if the systemwide default codepage was a more exotic one, yes. Hence, it strictly depends on what it is set to. I dno't think the tests have much say over what the "C" locale is.

Sure, I can unwrap the comment to make it more readable as you wrote originally.

mstorsjo updated this revision to Diff 412543.Mar 2 2022, 2:21 PM

Clarified the reliance on the system's default codepage in the comments, unwrapped the unicode code point explanations as requested.

mstorsjo added inline comments.Mar 2 2022, 2:23 PM
libcxx/test/std/localization/locale.categories/category.ctype/locale.ctype.byname/is_1.pass.cpp
110–112

As an alternative, we could also just ifdef out all tests for the non-ASCII chars in the "C" locales on Windows, as the interpretation strictly does depend on what the system's default codepage is.

mstorsjo updated this revision to Diff 412970.Mar 4 2022, 3:21 AM

I investigated the behaviour of these tests on odd system codepages, and it actually doesn't affect it. On Windows, wchars are correctly classified according to what unicode the wchar represents even when the "C" locale is set (contrary to other OSes) - regardless of the system codepage.

Thus, I changed the comments to something clearer - now this should be safe and good to go as is.

Quuxplusone accepted this revision.Mar 4 2022, 8:18 AM

LGTM % nit: remove the word "proper", mainly for unnecessary-editorializing but also because I think the editorializing is wrong? Seems to me like the C locale "properly" shouldn't be using Unicode interpretations! :)

libcxx/test/std/localization/locale.categories/category.ctype/locale.ctype.byname/is_1.pass.cpp
111

:)

libcxx/test/std/localization/locale.categories/category.ctype/locale.ctype.byname/is_many.pass.cpp
159
libcxx/test/std/localization/locale.categories/category.ctype/locale.ctype.byname/scan_is.pass.cpp
67
libcxx/test/std/localization/locale.categories/category.ctype/locale.ctype.byname/scan_not.pass.cpp
67
This revision is now accepted and ready to land.Mar 4 2022, 8:18 AM