This is an archive of the discontinued LLVM Phabricator instance.

[Cmake] Make sure MSVC knows LLVM source files are UTF-8 encoded
AcceptedPublic

Authored by cor3ntin on Aug 15 2023, 7:49 AM.

Details

Summary

By default, MSVC assumes source files are encoded with the user
locale's encoding, which can cause build failures.

GCC always use UTF-8 by default and doesn't need change.

Fixes #64668

Diff Detail

Unit TestsFailed

Event Timeline

cor3ntin created this revision.Aug 15 2023, 7:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 15 2023, 7:49 AM
Herald added a subscriber: ekilmer. · View Herald Transcript
cor3ntin requested review of this revision.Aug 15 2023, 7:49 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 15 2023, 7:49 AM
aaron.ballman accepted this revision.Aug 15 2023, 8:22 AM

LGTM! I verified that /source-charset: is supported in our current minimum version of MSVC we support (VS 2019).

This revision is now accepted and ready to land.Aug 15 2023, 8:22 AM

Before you land the changes: how would this impact test files that are intentionally not encoded in UTF-8 (assuming we have any to test our failure behavior)? I presume MSVC would load the files up fine and since MSVC isn't compiling them, there's not an issue. But does this option change source editor behavior? e.g., if someone opens the file up, will they get invalid UTF-8 displayed to them? Will saving the file automatically try to encode it as UTF-8?

Before you land the changes: how would this impact test files that are intentionally not encoded in UTF-8 (assuming we have any to test our failure behavior)?

We have no such test (yet) afaik - but we do have tests that are intentionally not valid utf-8 (which are never opened by MSVC)

I presume MSVC would load the files up fine and since MSVC isn't compiling them, there's not an issue.

But does this option change source editor behavior? e.g., if someone opens the file up, will they get invalid UTF-8 displayed to them? Will saving the file automatically try to encode it as UTF-8?

Nope, the editor has separate options. I believe files are opened in the current locale encoding and save as such, unless that fails in which case it falls back to utf if the right setting is enable.
presence of an editorconfig file or a BOM can also affect that.

Before you land the changes: how would this impact test files that are intentionally not encoded in UTF-8 (assuming we have any to test our failure behavior)?

We have no such test (yet) afaik - but we do have tests that are intentionally not valid utf-8 (which are never opened by MSVC)

I presume MSVC would load the files up fine and since MSVC isn't compiling them, there's not an issue.

But does this option change source editor behavior? e.g., if someone opens the file up, will they get invalid UTF-8 displayed to them? Will saving the file automatically try to encode it as UTF-8?

Nope, the editor has separate options. I believe files are opened in the current locale encoding and save as such, unless that fails in which case it falls back to utf if the right setting is enable.
presence of an editorconfig file or a BOM can also affect that.

Okay, that matches my understanding and reading of MSDN, thank you for verifying!

cor3ntin updated this revision to Diff 550440.Aug 15 2023, 1:01 PM

Trying to set the execution charset too just to see how the bot reacts.

This did the trick, however it's a bigger change than what i was hoping to get away with.

I think the path forward word be to make sure the test binaries, which do have a bunch of UTF string literals used at compile times are compiled with /exec-charset:utf-8.
My guess is that clangd tests currently _happen-to-work_ despite the sources of the test not being utf-encoded because as long as execution and source encoding match some bytes end up in the binaries which are equal to themselves.

So I need to figure out how to change the flags of the tests specifically.

cor3ntin updated this revision to Diff 550787.Aug 16 2023, 9:25 AM

Only set the execution charset of tests

@tahonermann Would you like to take a look at this one? there are weird test failures in clandd that i struggle to explain, and I do not have easy access to a windows machine to investigate

I looked at a bunch of the test failures. Most appear to have failed due to a failed attempt to match non-ASCII characters like line drawing characters. It seems that there are some mismatched encoding expectations going on and that non-ASCII characters are sometimes expected to match '?' and sometimes expected to match an escaped representation. For example, the output for test "./ClangdTests.exe/26/38" contains the following (\xE2\x86\x92 corresponds to the UTF-8 representation of "→" (U+2192 RIGHTWARDS ARROW) and "?" is presumably a substituted replacement character).

→ ret_type (aka can_ret_type)
...
-? ret_type (aka can_ret_type)
+\xE2\x86\x92 ret_type (aka can_ret_type)

Since Clang only supports UTF-8 as the execution encoding, perhaps we should do likewise for MSVC and build everything with /utf-8. It looks like you tried that already and the result wasn't good? It might be worth trying again, but with additional options to embed a manifest that sets the active code page to UTF-8 (https://devblogs.microsoft.com/oldnewthing/20220531-00/?p=106697); though that requires at least Windows 10 Version 1903. Do we document a minimum Windows version requirement for building and running LLVM/Clang?

Do we document a minimum Windows version requirement for building and running LLVM/Clang?

No, we don't (not that I could find anyway). We document a minimum version for Visual Studio, but not for Windows. Functionally, I believe Windows 7 is the floor: https://github.com/llvm/llvm-project/blob/65331da0032ab4253a4bc0ddcb2da67664bd86a9/llvm/include/llvm/Support/Windows/WindowsSupport.h#L28

I looked at a bunch of the test failures. Most appear to have failed due to a failed attempt to match non-ASCII characters like line drawing characters. It seems that there are some mismatched encoding expectations going on and that non-ASCII characters are sometimes expected to match '?' and sometimes expected to match an escaped representation. For example, the output for test "./ClangdTests.exe/26/38" contains the following (\xE2\x86\x92 corresponds to the UTF-8 representation of "→" (U+2192 RIGHTWARDS ARROW) and "?" is presumably a substituted replacement character).

→ ret_type (aka can_ret_type)
...
-? ret_type (aka can_ret_type)
+\xE2\x86\x92 ret_type (aka can_ret_type)

Since Clang only supports UTF-8 as the execution encoding, perhaps we should do likewise for MSVC and build everything with /utf-8. It looks like you tried that already and the result wasn't good? It might be worth trying again, but with additional options to embed a manifest that sets the active code page to UTF-8 (https://devblogs.microsoft.com/oldnewthing/20220531-00/?p=106697); though that requires at least Windows 10 Version 1903. Do we document a minimum Windows version requirement for building and running LLVM/Clang?

Yes, same test failures.
I wonder if there isn't an issue with the Clangd tests being incorrect - but they do pass on linux, which make this hypothesis a bit suspicious.