This is an archive of the discontinued LLVM Phabricator instance.

[llvm] CMake: Force MSVC to read code as UTF-8
ClosedPublic

Authored by kbobyrev on Oct 25 2020, 3:58 AM.

Diff Detail

Event Timeline

kbobyrev created this revision.Oct 25 2020, 3:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 25 2020, 3:58 AM
kbobyrev requested review of this revision.Oct 25 2020, 3:58 AM
kbobyrev updated this revision to Diff 300535.Oct 25 2020, 6:42 AM

Use u8 string literals.

I have face this problem long time ago.
This happens only:

  • on non-English Windows (and could be fixed via system settings: Control Panel -> Clock,Language,and Region -> Region and Language -> Administrative ->Language for non-Unicode programs -> Change system locale -> English).
  • with Visual Studio Compiler

Seems, there are other places in LLVM with this problem (unsure that this is a complete list):
https://github.com/llvm/llvm-project/blob/master/clang-tools-extra/clangd/unittests/SourceCodeTests.cpp#L53
https://github.com/llvm/llvm-project/blob/master/llvm/unittests/Support/JSONTest.cpp#L240

I think we should update cmake rules to explicitly tell the compilers that the source code language is utf8.

https://docs.microsoft.com/en-us/cpp/build/reference/utf-8-set-source-and-executable-character-sets-to-utf-8?view=vs-2019 seems to be the way for MSVC.
I am not sure if there's a way to change that for clang/gcc. I I believe they both require plain ascii or utf-8 anyways.

Using explicit UTF-8 string literals is a possible solution, but it makes the code a bit less readable. Another possible solution is to save the source file using UTF-8 with BOM, but this is confusing outside the Microsoft world (and it's very easy to remove the BOM by mistake).

I think passing /utf-8 to MSVC is the best solution for good interoperability with Clang.

I am not sure if there's a way to change that for clang/gcc. I I believe they both require plain ascii or utf-8 anyways.

Clang enforces UTF-8 everywhere so there's no need for additional configuration. Clang can also accept source files encoded in UTF-8 with BOM. I'm not sure about GCC, I think you need to enforce the encoding manually like in MSVC (see the -finput-charset and -fexec-charset options).

I recommend reading this article about how MSVC interprets the encoding of source files, the casuistic is a bit complex: https://devblogs.microsoft.com/cppblog/new-options-for-managing-character-sets-in-the-microsoft-cc-compiler/

This isn't the only place where we use such characters. While I'm not sure spelling these literally is a hill we want to die on, I think u8 literals are so poorly understood that trying to do this consistently isn't going to be that simple.

As noted, clang is always reading as utf-8. gcc is more complicated but we have no reports of this problem. So I think we should assume this is msvc-only and if we can find a CMake workaround we should take it.

Hmm, I see. From the looks of it, the solution for several projects would be

add_compile_options("$<$<C_COMPILER_ID:MSVC>:/utf-8>")
add_compile_options("$<$<CXX_COMPILER_ID:MSVC>:/utf-8>")

But I'm not sure if it makes sense in our case and I don't see many add_compile_options in LLVM. Also, I don't have a way to test it out.

kbobyrev planned changes to this revision.Oct 27 2020, 12:44 AM

Hmm, I see. From the looks of it, the solution for several projects would be

add_compile_options("$<$<C_COMPILER_ID:MSVC>:/utf-8>")
add_compile_options("$<$<CXX_COMPILER_ID:MSVC>:/utf-8>")

But I'm not sure if it makes sense in our case and I don't see many add_compile_options in LLVM. Also, I don't have a way to test it out.

  • I brought up the environment to reproduce this problem and can confirm, that this solution with add_compile_options works.
  • Adding -DCMAKE_CXX_FLAGS="/utf-8" -DCMAKE_C_FLAGS="/utf-8" into cmake command also helps (so, maybe we can update build documentation for MSVC (here? https://llvm.org/docs/CMake.html#microsoft-visual-c ) instead of changes inside llvm/CMakeLists.txt)
This comment was removed by ilya-golovenko.
kbobyrev updated this revision to Diff 303753.Nov 8 2020, 11:20 PM

Force MSVC to read code as UTF-8 via CMake options.

Herald added a project: Restricted Project. · View Herald TranscriptNov 8 2020, 11:20 PM
kbobyrev updated this revision to Diff 303754.Nov 8 2020, 11:20 PM

Rebase on top of master.

kbobyrev retitled this revision from [clangd] Escape Unicode characters to fix Windows builds to [llvm] CMake: Force MSVC to read code as UTF-8.Nov 8 2020, 11:21 PM
kbobyrev edited the summary of this revision. (Show Details)
ArcsinX added inline comments.Nov 8 2020, 11:26 PM
llvm/CMakeLists.txt
600

I think we could replace these two lines with the one line

add_compile_options(/utf-8)

Because we could not get here with non-MSVC compiler

kbobyrev updated this revision to Diff 303942.Nov 9 2020, 11:11 AM
kbobyrev marked an inline comment as done.

Omit CXX_COMPILER_ID:MSVC since we're within MSVC section.

kadircet accepted this revision.Nov 9 2020, 1:37 PM

thanks, lgtm.

This revision is now accepted and ready to land.Nov 9 2020, 1:37 PM
This revision was landed with ongoing or failed builds.Nov 9 2020, 1:47 PM
This revision was automatically updated to reflect the committed changes.

I have double checked that I do not have such errors with Visual Studio 2019 (just built compiler-rt with your patch)

I am not sure is it a bug of VS 2017 or not, but according to official documentation execution-character-set is obsolete since VS 2015 update 2 ( https://docs.microsoft.com/en-us/cpp/preprocessor/execution-character-set?view=msvc-160 ), so it's strange to me why it's still in use inside VS 2017 system headers.