This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/cmake/modules/
-
cmake/
-
modules/
-
AddLLVM.cmake
-
HandleLLVMOptions.cmake

Differential D157986

[Cmake] Make sure MSVC knows LLVM source files are UTF-8 encoded
AcceptedPublic

Authored by cor3ntin on Aug 15 2023, 7:49 AM.

Download Raw Diff

Details

Reviewers

aaron.ballman
tahonermann

Summary

By default, MSVC assumes source files are encoded with the user
locale's encoding, which can cause build failures.

GCC always use UTF-8 by default and doesn't need change.

Fixes #64668

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,050 ms	x64 debian > MLIR.Examples/standalone::test.toy

Event Timeline

cor3ntin created this revision.Aug 15 2023, 7:49 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 15 2023, 7:49 AM

Herald added a subscriber: ekilmer. · View Herald Transcript

cor3ntin requested review of this revision.Aug 15 2023, 7:49 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 15 2023, 7:49 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

cor3ntin added a reviewer: aaron.ballman.Aug 15 2023, 7:50 AM

LGTM! I verified that /source-charset: is supported in our current minimum version of MSVC we support (VS 2019).

This revision is now accepted and ready to land.Aug 15 2023, 8:22 AM

Before you land the changes: how would this impact test files that are intentionally not encoded in UTF-8 (assuming we have any to test our failure behavior)? I presume MSVC would load the files up fine and since MSVC isn't compiling them, there's not an issue. But does this option change source editor behavior? e.g., if someone opens the file up, will they get invalid UTF-8 displayed to them? Will saving the file automatically try to encode it as UTF-8?

In D157986#4588645, @aaron.ballman wrote:

Before you land the changes: how would this impact test files that are intentionally not encoded in UTF-8 (assuming we have any to test our failure behavior)?

We have no such test (yet) afaik - but we do have tests that are intentionally not valid utf-8 (which are never opened by MSVC)

I presume MSVC would load the files up fine and since MSVC isn't compiling them, there's not an issue.

But does this option change source editor behavior? e.g., if someone opens the file up, will they get invalid UTF-8 displayed to them? Will saving the file automatically try to encode it as UTF-8?

Nope, the editor has separate options. I believe files are opened in the current locale encoding and save as such, unless that fails in which case it falls back to utf if the right setting is enable.
presence of an editorconfig file or a BOM can also affect that.

In D157986#4588665, @cor3ntin wrote:

In D157986#4588645, @aaron.ballman wrote:

Before you land the changes: how would this impact test files that are intentionally not encoded in UTF-8 (assuming we have any to test our failure behavior)?

We have no such test (yet) afaik - but we do have tests that are intentionally not valid utf-8 (which are never opened by MSVC)

I presume MSVC would load the files up fine and since MSVC isn't compiling them, there's not an issue.

But does this option change source editor behavior? e.g., if someone opens the file up, will they get invalid UTF-8 displayed to them? Will saving the file automatically try to encode it as UTF-8?

Nope, the editor has separate options. I believe files are opened in the current locale encoding and save as such, unless that fails in which case it falls back to utf if the right setting is enable.
presence of an editorconfig file or a BOM can also affect that.

Okay, that matches my understanding and reading of MSDN, thank you for verifying!

Harbormaster completed remote builds in B252643: Diff 550332.Aug 15 2023, 10:13 AM

Trying to set the execution charset too just to see how the bot reacts.

Harbormaster completed remote builds in B252720: Diff 550440.Aug 15 2023, 3:48 PM

This did the trick, however it's a bigger change than what i was hoping to get away with.

I think the path forward word be to make sure the test binaries, which do have a bunch of UTF string literals used at compile times are compiled with /exec-charset:utf-8.
My guess is that clangd tests currently _happen-to-work_ despite the sources of the test not being utf-encoded because as long as execution and source encoding match some bytes end up in the binaries which are equal to themselves.

So I need to figure out how to change the flags of the tests specifically.

Only set the execution charset of tests

Harbormaster completed remote builds in B252972: Diff 550787.Aug 16 2023, 12:01 PM

@tahonermann Would you like to take a look at this one? there are weird test failures in clandd that i struggle to explain, and I do not have easy access to a windows machine to investigate

I looked at a bunch of the test failures. Most appear to have failed due to a failed attempt to match non-ASCII characters like line drawing characters. It seems that there are some mismatched encoding expectations going on and that non-ASCII characters are sometimes expected to match '?' and sometimes expected to match an escaped representation. For example, the output for test "./ClangdTests.exe/26/38" contains the following (\xE2\x86\x92 corresponds to the UTF-8 representation of "→" (U+2192 RIGHTWARDS ARROW) and "?" is presumably a substituted replacement character).

→ ret_type (aka can_ret_type)
...
-? ret_type (aka can_ret_type)
+\xE2\x86\x92 ret_type (aka can_ret_type)

Since Clang only supports UTF-8 as the execution encoding, perhaps we should do likewise for MSVC and build everything with /utf-8. It looks like you tried that already and the result wasn't good? It might be worth trying again, but with additional options to embed a manifest that sets the active code page to UTF-8 (https://devblogs.microsoft.com/oldnewthing/20220531-00/?p=106697); though that requires at least Windows 10 Version 1903. Do we document a minimum Windows version requirement for building and running LLVM/Clang?

In D157986#4639985, @tahonermann wrote:

Do we document a minimum Windows version requirement for building and running LLVM/Clang?

No, we don't (not that I could find anyway). We document a minimum version for Visual Studio, but not for Windows. Functionally, I believe Windows 7 is the floor: https://github.com/llvm/llvm-project/blob/65331da0032ab4253a4bc0ddcb2da67664bd86a9/llvm/include/llvm/Support/Windows/WindowsSupport.h#L28

In D157986#4639985, @tahonermann wrote:
I looked at a bunch of the test failures. Most appear to have failed due to a failed attempt to match non-ASCII characters like line drawing characters. It seems that there are some mismatched encoding expectations going on and that non-ASCII characters are sometimes expected to match '?' and sometimes expected to match an escaped representation. For example, the output for test "./ClangdTests.exe/26/38" contains the following (\xE2\x86\x92 corresponds to the UTF-8 representation of "→" (U+2192 RIGHTWARDS ARROW) and "?" is presumably a substituted replacement character).
→ ret_type (aka can_ret_type)
...
-? ret_type (aka can_ret_type)
+\xE2\x86\x92 ret_type (aka can_ret_type)
Since Clang only supports UTF-8 as the execution encoding, perhaps we should do likewise for MSVC and build everything with /utf-8. It looks like you tried that already and the result wasn't good? It might be worth trying again, but with additional options to embed a manifest that sets the active code page to UTF-8 (https://devblogs.microsoft.com/oldnewthing/20220531-00/?p=106697); though that requires at least Windows 10 Version 1903. Do we document a minimum Windows version requirement for building and running LLVM/Clang?

Yes, same test failures.
I wonder if there isn't an issue with the Clangd tests being incorrect - but they do pass on linux, which make this hypothesis a bit suspicious.

Revision Contents

Path

Size

llvm/

cmake/

modules/

AddLLVM.cmake

4 lines

HandleLLVMOptions.cmake

3 lines

Diff 550787

llvm/cmake/modules/AddLLVM.cmake

Show First 20 Lines • Show All 1,583 Lines • ▼ Show 20 Lines	function(add_unittest test_suite test_name)
if (SUPPORTS_VARIADIC_MACROS_FLAG)		if (SUPPORTS_VARIADIC_MACROS_FLAG)
list(APPEND LLVM_COMPILE_FLAGS "-Wno-variadic-macros")		list(APPEND LLVM_COMPILE_FLAGS "-Wno-variadic-macros")
endif ()		endif ()
# Some parts of gtest rely on this GNU extension, don't warn on it.		# Some parts of gtest rely on this GNU extension, don't warn on it.
if(SUPPORTS_GNU_ZERO_VARIADIC_MACRO_ARGUMENTS_FLAG)		if(SUPPORTS_GNU_ZERO_VARIADIC_MACRO_ARGUMENTS_FLAG)
list(APPEND LLVM_COMPILE_FLAGS "-Wno-gnu-zero-variadic-macro-arguments")		list(APPEND LLVM_COMPILE_FLAGS "-Wno-gnu-zero-variadic-macro-arguments")
endif()		endif()

		if(MSVC)
		list(APPEND LLVM_COMPILE_FLAGS "/execution-charset:utf-8")
		endif()

if (NOT DEFINED LLVM_REQUIRES_RTTI)		if (NOT DEFINED LLVM_REQUIRES_RTTI)
set(LLVM_REQUIRES_RTTI OFF)		set(LLVM_REQUIRES_RTTI OFF)
endif()		endif()

list(APPEND LLVM_LINK_COMPONENTS Support) # gtest needs it for raw_ostream		list(APPEND LLVM_LINK_COMPONENTS Support) # gtest needs it for raw_ostream
add_llvm_executable(${test_name} IGNORE_EXTERNALIZE_DEBUGINFO NO_INSTALL_RPATH ${ARGN})		add_llvm_executable(${test_name} IGNORE_EXTERNALIZE_DEBUGINFO NO_INSTALL_RPATH ${ARGN})

# The runtime benefits of LTO don't outweight the compile time costs for tests.		# The runtime benefits of LTO don't outweight the compile time costs for tests.
▲ Show 20 Lines • Show All 889 Lines • Show Last 20 Lines

llvm/cmake/modules/HandleLLVMOptions.cmake

Show First 20 Lines • Show All 483 Lines • ▼ Show 20 Lines	add_compile_definitions(
)		)

# Tell MSVC to use the Unicode version of the Win32 APIs instead of ANSI.		# Tell MSVC to use the Unicode version of the Win32 APIs instead of ANSI.
add_compile_definitions(		add_compile_definitions(
UNICODE		UNICODE
_UNICODE		_UNICODE
)		)

		# Tell MSVC the LLVM sources are UTF-8 encoded.
		append("/source-charset:utf-8" CMAKE_C_FLAGS CMAKE_CXX_FLAGS)

if (LLVM_WINSYSROOT)		if (LLVM_WINSYSROOT)
if (NOT CLANG_CL)		if (NOT CLANG_CL)
message(ERROR "LLVM_WINSYSROOT requires clang-cl")		message(ERROR "LLVM_WINSYSROOT requires clang-cl")
endif()		endif()
append("/winsysroot${LLVM_WINSYSROOT}" CMAKE_C_FLAGS CMAKE_CXX_FLAGS)		append("/winsysroot${LLVM_WINSYSROOT}" CMAKE_C_FLAGS CMAKE_CXX_FLAGS)
if (LINKER_IS_LLD_LINK)		if (LINKER_IS_LLD_LINK)
append("/winsysroot:${LLVM_WINSYSROOT}"		append("/winsysroot:${LLVM_WINSYSROOT}"
CMAKE_EXE_LINKER_FLAGS CMAKE_MODULE_LINKER_FLAGS		CMAKE_EXE_LINKER_FLAGS CMAKE_MODULE_LINKER_FLAGS
▲ Show 20 Lines • Show All 840 Lines • Show Last 20 Lines