On windows, the narrow, char based paths normally don't use utf8, but can use many different native code pages, and this is what system functions that operate on files, taking such paths/file names, interpret them as.
amccarth EricWF curdeius ldionne
- Group Reviewers
- rGde698ae73444: [libcxx] Convert paths to/from the right narrow code page for narrow strings on…
The Windows details look correct, though I have a couple questions on the code page conversions.
I'm not clear on what AreFileApisANSI does, nor have I found it using code search. In what case would any APIs use the OEM code page when the user's code page is something else?
I'm also wondering whether CP_ACP should be CP_THREAD_ACP, which, by default, will be the user's current code page which could be different than the system code page.
See D91133 for a testcase (which passes with MSVC STL) for this and some more - a process can switch between whether the narrow file apis take OEM CP or ACP with SetFileApisToANSI() and SetFileApisToOEM().
The MSVC STL sources don't seem to use CP_THREAD_ACP at least... however they also check for ___lc_codepage_func() == CP_UTF8, which I guess we also should...
To follow up on the last bit regarding ___lc_codepage_func(), contrary to SetFileApisToOEM() which operates on the kernel32 level, affecting how all narrow file names are interpreted, one can also do setlocale(LC_ALL, ".utf8");, which makes e.g. fopen() work with utf8 file names, but it has no effect on actual kernel32 -A suffixed file APIs.
But in any case, I guess adding the extra check for ___lc_codepage_func(), like ___lc_codepage_func() == CP_UTF8 ? CP_UTF8 : AreFileApisANSI() ? CP_ACP : CP_OEMCP.
After reading up on CP_THREAD_ACP, I don't think that's relevant. The main point is that if I export a narrow form of a filename from a std::filesystem::path object (or create a path object from a narrow string), I expect the string to be in the same codepage that fopen() or CreateFileA() would accept and interpreted in the same way, and as far as I'm reading about CP_THREAD_ACP, it only affects other bits, but not how kernel32 file APIs map narrow chars to wchar.
It's there a patch where you add _w.reserve(...)? If not, this one seems like a good candidate.
I guess I could call reserve() here using the length of the utf8 string as size. As converting from utf8 to wchar in most cases will make the input shorter, so the size allocated by reserve() should be enough in most concievable cases. For strings consisting mostly of ascii chars, the actual difference in length shouldn't be much, so it shouldn't hurt much with such a rough estimate.
Yeah I know that MultiByteToWideChar can count the needed output size, this is used in e.g. size_t __size = __char_to_wide(__str, nullptr, 0); further up in the same patch when operating on the native narrow charset. I'd rather not mix MultiByteToWideChar with libc++'s __widen_from_utf8, especially as the exact length isn't needed beforehand here as that conversion allocates more as needed.
LGTM except nits!
I don't think #ifdefing the static_assert message adds anything. It just makes things harder to read IMO. I know this was done the same way elsewhere, but I think we should actually remove the ifdefs in those other places instead.
Please indent includes like:
#if defined(_LIBCPP_WIN32API` # define WIN32_LEAN_AND_MEAN # include <stuff> #else # include <more stuff> #endif