This is an archive of the discontinued LLVM Phabricator instance.

Default to using the Unicode version of Win32 APIs
ClosedPublic

Authored by aaron.ballman on Jun 23 2016, 8:05 AM.

Details

Summary

We currently default to using the multibyte character versions of Win32 APIs, commonly suffixed with an "A". However, this causes problems because many of the Win32 APIs that we use are for things like file paths, where non-ASCII characters are commonly expected. This patch defaults us to using the "W" versions of the Win32 APIs, which helps to remind us to properly handle non-ASCII input. We can still fall back on the multibyte APIs by explicitly specifying the "A" version of the function, but we no longer run into issues where we call a Win32 API without a suffix and accidentally wind up using the A version on UTF-8 data (which will fail) instead of using the W version on UTF-16 data (which will succeed).

I have tested this patch with MSVC 2015 and LLVM, Clang, and clang-tools-extra all compile cleanly (after a few cleanup patches) and all tests pass. I have not tried other projects as I do not have them locally, but hopefully they: work with this patch as-is, require minimal changes to support this patch, or can override the behavior with their own CMake settings. Before committing, I plan to make sure these projects all continue to compile, but I wanted to make sure the community was okay with this direction before doing that work.

Diff Detail

Event Timeline

aaron.ballman retitled this revision from to Default to using the Unicode version of Win32 APIs.
aaron.ballman updated this object.
rnk accepted this revision.Jun 23 2016, 8:20 AM
rnk edited edge metadata.

Sure, this serves as a reminder that you should always convert from UTF-8 to wide in Windows support code. I still think we should always explicitly call the wide variants, and it seems like you agree. Defining these macros is just a way to ensure we don't accidentally regress.

Can you build asan with this change? I suspect it will be sensitive to it because it wraps the win32 API directly rather than going through Support.

This revision is now accepted and ready to land.Jun 23 2016, 8:20 AM

Should we add overloads of the UTF8 conversion functions that accept
wstrings?

In line with what rnk said, I'm curious about the implications of simply
#undef'ing all the non A/W versions of functions so that you have to make
the choice explicit

In D21643#465579, @rnk wrote:

Sure, this serves as a reminder that you should always convert from UTF-8 to wide in Windows support code. I still think we should always explicitly call the wide variants, and it seems like you agree. Defining these macros is just a way to ensure we don't accidentally regress.

Exactly! This patch was motivated by a recent bug I found where we used an unprefixed function that defaulted to ANSI instead of Wide.

Can you build asan with this change? I suspect it will be sensitive to it because it wraps the win32 API directly rather than going through Support.

I have never successfully built compiler-rt with MSVC. Every time I've tried (the last time was 6+ months ago, so I will try again), the MSVC solution generated by CMake refuses to compile. From what I understand, it basically only works with ninja, which I do not have installed. :-(

If you wouldn't mind giving it a shot, I would be curious to know what the state of it is with this change.

rnk added a comment.Jun 23 2016, 8:28 AM

I have never successfully built compiler-rt with MSVC. Every time I've tried (the last time was 6+ months ago, so I will try again), the MSVC solution generated by CMake refuses to compile. From what I understand, it basically only works with ninja, which I do not have installed. :-(

Etienne and Wei started working on ASan, and they got the ASan build working in MSBuild a few months ago, so I'd give it another shot. Let me know if you can't build it, and I can take a moment to patch in this change.

Should we add overloads of the UTF8 conversion functions that accept
wstrings?

I think we should eventually consider cleaning up our Unicode APIs because they're kind of all over the place. For instance, WindowsSupport.h has a different set of APIs for handling Unicode conversions that compete with the ones exposed by ConvertUTF.h.

In line with what rnk said, I'm curious about the implications of simply
#undef'ing all the non A/W versions of functions so that you have to make
the choice explicit

I'm not certain I follow along. Are you suggesting we explicitly #undef something like RegQueryValueEx so that you must explicitly call RegQueryValueExW or RegQueryValueExA, as needed? That seems like something that will quickly get forgotten about and not really used.

Commit in r273599.