Page MenuHomePhabricator

[5/N] [libcxx] Convert paths to/from the right narrow code page for narrow strings on windows

Authored by mstorsjo on Nov 10 2020, 1:33 AM.



On windows, the narrow, char based paths normally don't use utf8, but can use many different native code pages, and this is what system functions that operate on files, taking such paths/file names, interpret them as.

Diff Detail

Event Timeline

mstorsjo created this revision.Nov 10 2020, 1:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 10 2020, 1:33 AM
Herald added 1 blocking reviewer(s): Restricted Project. · View Herald Transcript
mstorsjo requested review of this revision.Nov 10 2020, 1:33 AM
mstorsjo retitled this revision from [libcxx] Convert paths to/from the right narrow code page for narrow strings on windows to [5/N] [libcxx] Convert paths to/from the right narrow code page for narrow strings on windows.Nov 10 2020, 1:35 AM
mstorsjo updated this revision to Diff 305076.Nov 13 2020, 3:33 AM

Updated to define WIN32_LEAN_AND_MEAN and NOMINMAX when including windows.h in operations.cpp.

The Windows details look correct, though I have a couple questions on the code page conversions.


I'm not clear on what AreFileApisANSI does, nor have I found it using code search. In what case would any APIs use the OEM code page when the user's code page is something else?

I'm also wondering whether CP_ACP should be CP_THREAD_ACP, which, by default, will be the user's current code page which could be different than the system code page.

mstorsjo added inline comments.Nov 16 2020, 2:52 PM

See D91133 for a testcase (which passes with MSVC STL) for this and some more - a process can switch between whether the narrow file apis take OEM CP or ACP with SetFileApisToANSI() and SetFileApisToOEM().

The MSVC STL sources don't seem to use CP_THREAD_ACP at least... however they also check for ___lc_codepage_func() == CP_UTF8, which I guess we also should...

mstorsjo added inline comments.Nov 17 2020, 12:05 AM

To follow up on the last bit regarding ___lc_codepage_func(), contrary to SetFileApisToOEM() which operates on the kernel32 level, affecting how all narrow file names are interpreted, one can also do setlocale(LC_ALL, ".utf8");, which makes e.g. fopen() work with utf8 file names, but it has no effect on actual kernel32 -A suffixed file APIs.

But in any case, I guess adding the extra check for ___lc_codepage_func(), like ___lc_codepage_func() == CP_UTF8 ? CP_UTF8 : AreFileApisANSI() ? CP_ACP : CP_OEMCP.

After reading up on CP_THREAD_ACP, I don't think that's relevant. The main point is that if I export a narrow form of a filename from a std::filesystem::path object (or create a path object from a narrow string), I expect the string to be in the same codepage that fopen() or CreateFileA() would accept and interpreted in the same way, and as far as I'm reading about CP_THREAD_ACP, it only affects other bits, but not how kernel32 file APIs map narrow chars to wchar.

amccarth accepted this revision.Nov 17 2020, 1:28 PM


Excellent. Somehow I missed that AreFileApisANSI was a Win32 function. I must have mistyped it in my searches the other day. This looks good.

curdeius accepted this revision.Dec 9 2020, 2:08 PM
curdeius added a subscriber: curdeius.

I take your word on what you explained about codepages as I'm not much familiar with it. Otherwise LGTM with some nits.


Or just use it instead of _Str?


Please add comments to longer #endifs.

mstorsjo updated this revision to Diff 310659.Dec 9 2020, 2:19 PM

Applied the suggested changes.

mstorsjo updated this revision to Diff 310661.Dec 9 2020, 2:20 PM
mstorsjo set the repository for this revision to rG LLVM Github Monorepo.

Set the repository, to allow the CI to run.

curdeius added inline comments.Dec 9 2020, 2:44 PM

It's there a patch where you add _w.reserve(...)? If not, this one seems like a good candidate.
If I'm not mistaken, widening can decrease the number of characters, so that might get tricky (like going through the input twice, once to count the output size, then doing the real conversion), but at least a FIXME note would be great.
BTW, you probably know that, MultiByteToWideChar will return the required output buffer size without doing the conversion if you pass 0 as cchWideChar.

mstorsjo added inline comments.Dec 10 2020, 12:17 AM

I guess I could call reserve() here using the length of the utf8 string as size. As converting from utf8 to wchar in most cases will make the input shorter, so the size allocated by reserve() should be enough in most concievable cases. For strings consisting mostly of ascii chars, the actual difference in length shouldn't be much, so it shouldn't hurt much with such a rough estimate.

Yeah I know that MultiByteToWideChar can count the needed output size, this is used in e.g. size_t __size = __char_to_wide(__str, nullptr, 0); further up in the same patch when operating on the native narrow charset. I'd rather not mix MultiByteToWideChar with libc++'s __widen_from_utf8, especially as the exact length isn't needed beforehand here as that conversion allocates more as needed.

curdeius added inline comments.Dec 10 2020, 12:42 AM

I agree that mixing MultiByteToWideChar with __widen_from_utf8 doesn't seem like a good idea.
But I'll be for what you suggested, so just reserve possibly more.

mstorsjo updated this revision to Diff 310795.Dec 10 2020, 1:27 AM

Added calls to reserve() where suitable.

ldionne requested changes to this revision.Dec 14 2020, 3:28 PM
ldionne added a subscriber: ldionne.

LGTM except nits!


I don't think #ifdefing the static_assert message adds anything. It just makes things harder to read IMO. I know this was done the same way elsewhere, but I think we should actually remove the ifdefs in those other places instead.


Please indent includes like:

#if defined(_LIBCPP_WIN32API`
#   define WIN32_LEAN_AND_MEAN
#   include <stuff>
#   include <more stuff>
This revision now requires changes to proceed.Dec 14 2020, 3:28 PM
ldionne set the repository for this revision to rG LLVM Github Monorepo.Dec 14 2020, 3:28 PM
mstorsjo added inline comments.Dec 14 2020, 11:13 PM

Fair enough, I can remove that bit from this patch, and I'll send a separate patch for removing the other ifdefs, and rebase these patches on top of that (where relevant) once that's merged.


Ok, will change.

mstorsjo updated this revision to Diff 311802.Dec 14 2020, 11:25 PM

Indented preprocessor directives within ifdefs (indented with one space, to match other existing similar cases), removed ifdef in assert message.

mstorsjo updated this revision to Diff 311807.Dec 14 2020, 11:42 PM
mstorsjo set the repository for this revision to rG LLVM Github Monorepo.

Reupload to re-set the repository, to trigger CI

mstorsjo updated this revision to Diff 312438.Dec 17 2020, 3:52 AM
mstorsjo removed rG LLVM Github Monorepo as the repository for this revision.
mstorsjo set the repository for this revision to rG LLVM Github Monorepo.


ldionne accepted this revision.Dec 17 2020, 2:04 PM


This revision is now accepted and ready to land.Dec 17 2020, 2:04 PM