By default, MSVC assumes source files are encoded with the user
locale's encoding, which can cause build failures.
GCC always use UTF-8 by default and doesn't need change.
cor3ntin on Aug 15 2023, 7:49 AM.Authored by
Before you land the changes: how would this impact test files that are intentionally not encoded in UTF-8 (assuming we have any to test our failure behavior)? I presume MSVC would load the files up fine and since MSVC isn't compiling them, there's not an issue. But does this option change source editor behavior? e.g., if someone opens the file up, will they get invalid UTF-8 displayed to them? Will saving the file automatically try to encode it as UTF-8?
We have no such test (yet) afaik - but we do have tests that are intentionally not valid utf-8 (which are never opened by MSVC)
I presume MSVC would load the files up fine and since MSVC isn't compiling them, there's not an issue.
Nope, the editor has separate options. I believe files are opened in the current locale encoding and save as such, unless that fails in which case it falls back to utf if the right setting is enable.
This did the trick, however it's a bigger change than what i was hoping to get away with.
I think the path forward word be to make sure the test binaries, which do have a bunch of UTF string literals used at compile times are compiled with /exec-charset:utf-8.
So I need to figure out how to change the flags of the tests specifically.
I looked at a bunch of the test failures. Most appear to have failed due to a failed attempt to match non-ASCII characters like line drawing characters. It seems that there are some mismatched encoding expectations going on and that non-ASCII characters are sometimes expected to match '?' and sometimes expected to match an escaped representation. For example, the output for test "./ClangdTests.exe/26/38" contains the following (\xE2\x86\x92 corresponds to the UTF-8 representation of "→" (U+2192 RIGHTWARDS ARROW) and "?" is presumably a substituted replacement character).
→ ret_type (aka can_ret_type) ... -? ret_type (aka can_ret_type) +\xE2\x86\x92 ret_type (aka can_ret_type)
Since Clang only supports UTF-8 as the execution encoding, perhaps we should do likewise for MSVC and build everything with /utf-8. It looks like you tried that already and the result wasn't good? It might be worth trying again, but with additional options to embed a manifest that sets the active code page to UTF-8 (https://devblogs.microsoft.com/oldnewthing/20220531-00/?p=106697); though that requires at least Windows 10 Version 1903. Do we document a minimum Windows version requirement for building and running LLVM/Clang?
No, we don't (not that I could find anyway). We document a minimum version for Visual Studio, but not for Windows. Functionally, I believe Windows 7 is the floor: https://github.com/llvm/llvm-project/blob/65331da0032ab4253a4bc0ddcb2da67664bd86a9/llvm/include/llvm/Support/Windows/WindowsSupport.h#L28