This is an archive of the discontinued LLVM Phabricator instance.

[Tooling] Parse compilation database command lines properly on Windows
ClosedPublic

Authored by zturner on Aug 12 2016, 10:46 AM.

Details

Summary

When a compilation database is used on Windows, the command lines cannot be parsed using the standard GNU style syntax. LLVM provides functions for parsing Windows style command lines, so use them.

This is a break-off from D23409. Just uploading this here for now as a placeholder, will try to work on a test for this in the meantime.

Diff Detail

Event Timeline

zturner updated this revision to Diff 67856.Aug 12 2016, 10:46 AM
zturner retitled this revision from to [Tooling] Parse compilation database command lines properly on Windows.
zturner updated this object.
zturner added reviewers: alexfh, djasper.
zturner added a subscriber: cfe-commits.
rnk added a subscriber: rnk.Aug 12 2016, 10:53 AM

We should convince cmake to emit "arguments" instead of "command" so that we don't have this ambiguity.

I mentioned offline that we could detect CRLF or LF and heuristically decide whether it's a windows compilation database. That seems like a horrible hack, so failing that, yes having CMake do it for us would be better.

zturner added a comment.EditedAug 12 2016, 2:37 PM

This is starting to get fairly difficult, and I'm afraid it may require someone more knowledgable about clang-format's internals than me. I wrote a test to have it use a fixed compilation database, and I was able to make that test pass, but it breaks many other tests. I will go ahead and upload that patch anyway just so you can see what I did.

What's weird is that it doesn't seem to break clang-tidy itself, just the tests. For example, here's what I get when I run the following command:

D:\src\llvmbuild\ninja>"bin\clang-tidy" "--checks=-*,google-readability-casting" "-header-filter=.*" "D:\src\llvm\tools\clang\tools\extra\test\clang-tidy\google-readability-casting.c" "--" "-ID:\src\llvm\tools\clang\tools\extra\test\clang-tidy" "-DTEST_INCLUDE" "-x" "c++"
1 warning generated.
D:\src\llvm\tools\clang\tools\extra\test\clang-tidy/google-readability-casting.c:16:22: warning: redundant cast to the same type [google-readability-casting]
  const char *cpc2 = (const char*)cpc;
                     ^

It appears to work fine.

But when I run the test, it complains about trying to copy some file which seems unrelated to my change and more about the test infrastructure.

D:\src\llvmbuild\ninja>python bin\llvm-lit.py -sv d:\src\llvm\tools\clang\tools\extra\test\clang-tidy\google-readability-casting.c
-- Testing: 1 tests, 1 threads --
Testing: 0 .. 10.. 20.. 30.. 40.. 50.. 60.. 70.. 80.. 90..
FAIL: Clang Tools :: clang-tidy/google-readability-casting.c (1 of 1)
******************** TEST 'Clang Tools :: clang-tidy/google-readability-casting.c' FAILED ********************
Script:
--
C:/Python27/python.exe D:/src/llvm/tools/clang/tools/extra/test/../test\clang-tidy\check_clang_tidy.py D:\src\llvm\tools\clang\tools\extra\test\clang-tidy\google-readability-casting.c google-readability-casting D:\src\llvmbuild\ninja\tools\clang\tools\extra\test\clang-tidy\Output\google-readability-casting.c.tmp -- -- -x c
clang-tidy --checks=-*,google-readability-casting D:\src\llvm\tools\clang\tools\extra\test\clang-tidy\google-readability-casting.c -- -x c++ | FileCheck D:\src\llvm\tools\clang\tools\extra\test\clang-tidy\google-readability-casting.c -check-prefix=CHECK-MESSAGES -implicit-check-not='{{warning|error}}:'
cp D:\src\llvm\tools\clang\tools\extra\test\clang-tidy\google-readability-casting.c D:\src\llvmbuild\ninja\tools\clang\tools\extra\test\clang-tidy\Output\google-readability-casting.c.tmp.main_file.cpp
clang-tidy --checks=-*,google-readability-casting -header-filter='.*' D:\src\llvmbuild\ninja\tools\clang\tools\extra\test\clang-tidy\Output\google-readability-casting.c.tmp.main_file.cpp -- -ID:\src\llvm\tools\clang\tools\extra\test\clang-tidy -DTEST_INCLUDE -x c++ | FileCheck D:\src\llvm\tools\clang\tools\extra\test\clang-tidy\google-readability-casting.c -check-prefix=CHECK-MESSAGES -implicit-check-not='{{warning|error}}:'
--
Exit Code: 1

Command Output (stdout):
--
$ "C:/Python27/python.exe" "D:/src/llvm/tools/clang/tools/extra/test/../test\clang-tidy\check_clang_tidy.py" "D:\src\llvm\tools\clang\tools\extra\test\clang-tidy\google-readability-casting.c" "google-readability-casting" "D:\src\llvmbuild\ninja\tools\clang\tools\extra\test\clang-tidy\Output\google-readability-casting.c.tmp" "--" "--" "-x" "c"
# command output:
Running ['clang-tidy', 'D:\\src\\llvmbuild\\ninja\\tools\\clang\\tools\\extra\\test\\clang-tidy\\Output\\google-readability-casting.c.tmp.c', '-fix', '--checks=-*,google-readability-casting', '--', '-x', 'c']...
------------------------ clang-tidy output -----------------------
1 warning generated.
D:\src\llvmbuild\ninja\tools\clang\tools\extra\test\clang-tidy\Output\google-readability-casting.c.tmp.c:16:22: warning: redundant cast to the same type [google-readability-casting]
  const char *cpc2 = (const char*)cpc;
                     ^
D:\src\llvmbuild\ninja\tools\clang\tools\extra\test\clang-tidy\Output\google-readability-casting.c.tmp.c:16:22: note: FIX-IT applied suggested code changes
  const char *cpc2 = (const char*)cpc;
                     ^
clang-tidy applied 1 of 1 suggested fixes.

------------------------------------------------------------------
------------------------------ Fixes -----------------------------
--- D:\src\llvmbuild\ninja\tools\clang\tools\extra\test\clang-tidy\Output\google-readability-casting.c.tmp.c.orig       2016-08-12 14:29:14.035738300 -0700
+++ D:\src\llvmbuild\ninja\tools\clang\tools\extra\test\clang-tidy\Output\google-readability-casting.c.tmp.c    2016-08-12 14:29:14.133810600 -0700
@@ -13,7 +13,7 @@
 #else

 void f(const char *cpc) {
-  const char *cpc2 = (const char*)cpc;
+  const char *cpc2 = cpc;
   //
   //
   char *pc = (char*)cpc;

------------------------------------------------------------------

$ "clang-tidy" "--checks=-*,google-readability-casting" "D:\src\llvm\tools\clang\tools\extra\test\clang-tidy\google-readability-casting.c" "--" "-x" "c++"
# command stderr:
1 warning generated.

$ "FileCheck" "D:\src\llvm\tools\clang\tools\extra\test\clang-tidy\google-readability-casting.c" "-check-prefix=CHECK-MESSAGES" "-implicit-check-not={{warning|error}}:"
$ "cp" "D:\src\llvm\tools\clang\tools\extra\test\clang-tidy\google-readability-casting.c" "D:\src\llvmbuild\ninja\tools\clang\tools\extra\test\clang-tidy\Output\google-readability-casting.c.tmp.main_file.cpp"
$ "clang-tidy" "--checks=-*,google-readability-casting" "-header-filter=.*" "D:\src\llvmbuild\ninja\tools\clang\tools\extra\test\clang-tidy\Output\google-readability-casting.c.tmp.main_file.cpp" "--" "-ID:\src\llvm\tools\clang\tools\extra\test\clang-tidy" "-DTEST_INCLUDE" "-x" "c++"
# command stderr:
1 error generated.
Error while processing D:\src\llvmbuild\ninja\tools\clang\tools\extra\test\clang-tidy\Output\google-readability-casting.c.tmp.main_file.cpp.

$ "FileCheck" "D:\src\llvm\tools\clang\tools\extra\test\clang-tidy\google-readability-casting.c" "-check-prefix=CHECK-MESSAGES" "-implicit-check-not={{warning|error}}:"
# command stderr:
D:\src\llvm\tools\clang\tools\extra\test\clang-tidy\google-readability-casting.c:17:21: error: expected string not found in input
 // CHECK-MESSAGES: :[[@LINE-1]]:22: warning: redundant cast to the same type [google-readability-casting]
                    ^

Error below
vvvvvvvv

<stdin>:1:1: note: scanning from here
D:\src\llvmbuild\ninja\tools\clang\tools\extra\test\clang-tidy\Output\google-readability-casting.c.tmp.main_file.cpp:11:10: error: 'google-readability-casting.c' file not found [clang-diagnostic-error]
^
<stdin>:1:1: note: with expression "@LINE-1" equal to "16"
D:\src\llvmbuild\ninja\tools\clang\tools\extra\test\clang-tidy\Output\google-readability-casting.c.tmp.main_file.cpp:11:10: error: 'google-readability-casting.c' file not found [clang-diagnostic-error]
^
<stdin>:1:84: note: possible intended match here
D:\src\llvmbuild\ninja\tools\clang\tools\extra\test\clang-tidy\Output\google-readability-casting.c.tmp.main_file.cpp:11:10: error: 'google-readability-casting.c' file not found [clang-diagnostic-error]
                                                                                   ^

error: command failed with exit status: 1

--

********************
Testing Time: 0.66s
********************
Failing Tests (1):
    Clang Tools :: clang-tidy/google-readability-casting.c

  Unexpected Failures: 1

I'm not really sure what to make of this.

Any assistance would be appreciated. I will update the diff with what I came up with.

As a temporary solution, since the patch here could not possibly regress anything since it leaves the code unchanged on all the paths where it is already supported, perhaps this can go in without a test until I have time to understand the codebase better and figure out something better.

zturner updated this revision to Diff 67915.Aug 12 2016, 2:38 PM

Update with changes needed to make fixed compilation database work (which also breaks many other tests)

zturner updated this revision to Diff 67947.Aug 12 2016, 5:53 PM

After much head banging I think I finally figured this out. I'm putting this back to my original revision, unchanged. The problem is not the original patch. The original patch is perfectly fine. The problem is the way i was writing my test. This is just updating the revision back to its original state. I can't make a cross-repo diff, so I will make a new diff with the test.

alexfh added inline comments.Aug 13 2016, 10:35 AM
lib/Tooling/JSONCompilationDatabase.cpp
119

As noted on D23409, this will likely break mingw compatibility in clang on windows. We should either add a sort of auto-detection of the command line format, or extend the JSON compilation database with a way to specify the command line format used (or as Reid suggests, specify a list of arguments, but this should be done in a backward-compatible way).

+cmake people for the feasibility of emitting 'arguments' instead of 'command' into the JSON compilation database.

the feasibility of emitting 'arguments' instead of 'command' into the JSON compilation database.

CMake constructs the command lines internally using string replacement on templates. We never actually know the exact arguments. Therefore providing arguments instead of the whole command would require parsing to be done on the CMake side instead. This is theoretically possible because we do know the shell for which we are generating (Windows cmd versus MSYS sh). However, it may also require a bunch of logic we don't have yet but that LLVM does.

Alternatively, the JSON could have an additional command_shell="..." field that indicates the shell for which the command line is encoded.

rnk added a comment.Aug 15 2016, 11:26 AM

the feasibility of emitting 'arguments' instead of 'command' into the JSON compilation database.

CMake constructs the command lines internally using string replacement on templates. We never actually know the exact arguments. Therefore providing arguments instead of the whole command would require parsing to be done on the CMake side instead. This is theoretically possible because we do know the shell for which we are generating (Windows cmd versus MSYS sh). However, it may also require a bunch of logic we don't have yet but that LLVM does.

Alternatively, the JSON could have an additional command_shell="..." field that indicates the shell for which the command line is encoded.

Bummer. Given that this is hard to do in CMake, then I think we should just tokenize in Clang. Let's use llvm::sys::getProcessTriple() instead of LLVM_ON_WIN32 and check if that is an MSVC environment as a proxy for the shell type.

alexfh edited edge metadata.Aug 16 2016, 9:25 AM
In D23455#515527, @rnk wrote:

the feasibility of emitting 'arguments' instead of 'command' into the JSON compilation database.

CMake constructs the command lines internally using string replacement on templates. We never actually know the exact arguments. Therefore providing arguments instead of the whole command would require parsing to be done on the CMake side instead. This is theoretically possible because we do know the shell for which we are generating (Windows cmd versus MSYS sh). However, it may also require a bunch of logic we don't have yet but that LLVM does.

Alternatively, the JSON could have an additional command_shell="..." field that indicates the shell for which the command line is encoded.

Bummer. Given that this is hard to do in CMake, then I think we should just tokenize in Clang. Let's use llvm::sys::getProcessTriple() instead of LLVM_ON_WIN32 and check if that is an MSVC environment as a proxy for the shell type.

Do I understand correctly that llvm::sys::getProcessTriple() returns basically a compile-time constant? If yes, it won't allow the same clang tool binary to be used with both command line formats. Should we instead allow runtime selection of the command line format? For example, by extending JSON compilation database with the aforementioned command_shell="..." option.

alexfh edited edge metadata.Aug 16 2016, 9:25 AM
alexfh added a subscriber: aaron.ballman.

Adding Aaron, since he seems to have interest in Windows support for clang-tidy.

In D23455#515527, @rnk wrote:

the feasibility of emitting 'arguments' instead of 'command' into the JSON compilation database.

CMake constructs the command lines internally using string replacement on templates. We never actually know the exact arguments. Therefore providing arguments instead of the whole command would require parsing to be done on the CMake side instead. This is theoretically possible because we do know the shell for which we are generating (Windows cmd versus MSYS sh). However, it may also require a bunch of logic we don't have yet but that LLVM does.

Alternatively, the JSON could have an additional command_shell="..." field that indicates the shell for which the command line is encoded.

Bummer. Given that this is hard to do in CMake, then I think we should just tokenize in Clang. Let's use llvm::sys::getProcessTriple() instead of LLVM_ON_WIN32 and check if that is an MSVC environment as a proxy for the shell type.

Do I understand correctly that llvm::sys::getProcessTriple() returns basically a compile-time constant? If yes, it won't allow the same clang tool binary to be used with both command line formats. Should we instead allow runtime selection of the command line format? For example, by extending JSON compilation database with the aforementioned command_shell="..." option.

It does return a compile time constant, but it is a compile time constant representing the platform that clang-tidy is running on. I don't see a need to have a single clang-tidy binary be able to parse both command line formats. In fact, it seems actively harmful.

If you are running on a Windows system with windows command line parsing rules, and someone has generated a command line on linux, then this command line was never intended to be run on Windows in the first place. The JSON compilation database spec already says that the command line represents "the exact command to be run in the environment of the build system". By adding the flexibility to change the environment, this is no longer true. If I try to run a command generated for one build system in the environment of another build system, even if I tokenize it correctly whose to say it will work?

I understand the desire to make sure we get the right change so we don't have to revisit this in the future, but to me this sounds like YAGNI and actually increasing the risk of someone using the software and getting unexpected results, not less risk.

+Daniel dilts who wrote the arguments support

One problem is that getting into the game of parsing arbitrary shell modes is rather ugly.

In D23455#515527, @rnk wrote:

the feasibility of emitting 'arguments' instead of 'command' into the JSON compilation database.

CMake constructs the command lines internally using string replacement on templates. We never actually know the exact arguments. Therefore providing arguments instead of the whole command would require parsing to be done on the CMake side instead. This is theoretically possible because we do know the shell for which we are generating (Windows cmd versus MSYS sh). However, it may also require a bunch of logic we don't have yet but that LLVM does.

Alternatively, the JSON could have an additional command_shell="..." field that indicates the shell for which the command line is encoded.

Bummer. Given that this is hard to do in CMake, then I think we should just tokenize in Clang. Let's use llvm::sys::getProcessTriple() instead of LLVM_ON_WIN32 and check if that is an MSVC environment as a proxy for the shell type.

Do I understand correctly that llvm::sys::getProcessTriple() returns basically a compile-time constant? If yes, it won't allow the same clang tool binary to be used with both command line formats. Should we instead allow runtime selection of the command line format? For example, by extending JSON compilation database with the aforementioned command_shell="..." option.

It does return a compile time constant, but it is a compile time constant representing the platform that clang-tidy is running on. I don't see a need to have a single clang-tidy binary be able to parse both command line formats. In fact, it seems actively harmful.

If you are running on a Windows system with windows command line parsing rules, and someone has generated a command line on linux, then this command line was never intended to be run on Windows in the first place. The JSON compilation database spec already says that the command line represents "the exact command to be run in the environment of the build system". By adding the flexibility to change the environment, this is no longer true. If I try to run a command generated for one build system in the environment of another build system, even if I tokenize it correctly whose to say it will work?

I understand the desire to make sure we get the right change so we don't have to revisit this in the future, but to me this sounds like YAGNI and actually increasing the risk of someone using the software and getting unexpected results, not less risk.

Now I'm less sure. And I might be misunderstanding something here. Folks actually using clang tools on Windows (Aaron?) can tell with more confidence whether Linux-style command line tokenization is of any use on Windows. I had an impression that it allowed clang tools to be used with mingw (and msys shell), but I'm not sure whether it's an important use case for anyone.

alexfh removed a subscriber: aaron.ballman.
In D23455#515527, @rnk wrote:

the feasibility of emitting 'arguments' instead of 'command' into the JSON compilation database.

CMake constructs the command lines internally using string replacement on templates. We never actually know the exact arguments. Therefore providing arguments instead of the whole command would require parsing to be done on the CMake side instead. This is theoretically possible because we do know the shell for which we are generating (Windows cmd versus MSYS sh). However, it may also require a bunch of logic we don't have yet but that LLVM does.

Alternatively, the JSON could have an additional command_shell="..." field that indicates the shell for which the command line is encoded.

Bummer. Given that this is hard to do in CMake, then I think we should just tokenize in Clang. Let's use llvm::sys::getProcessTriple() instead of LLVM_ON_WIN32 and check if that is an MSVC environment as a proxy for the shell type.

Do I understand correctly that llvm::sys::getProcessTriple() returns basically a compile-time constant? If yes, it won't allow the same clang tool binary to be used with both command line formats. Should we instead allow runtime selection of the command line format? For example, by extending JSON compilation database with the aforementioned command_shell="..." option.

It does return a compile time constant, but it is a compile time constant representing the platform that clang-tidy is running on. I don't see a need to have a single clang-tidy binary be able to parse both command line formats. In fact, it seems actively harmful.

If you are running on a Windows system with windows command line parsing rules, and someone has generated a command line on linux, then this command line was never intended to be run on Windows in the first place. The JSON compilation database spec already says that the command line represents "the exact command to be run in the environment of the build system". By adding the flexibility to change the environment, this is no longer true. If I try to run a command generated for one build system in the environment of another build system, even if I tokenize it correctly whose to say it will work?

I understand the desire to make sure we get the right change so we don't have to revisit this in the future, but to me this sounds like YAGNI and actually increasing the risk of someone using the software and getting unexpected results, not less risk.

Now I'm less sure. And I might be misunderstanding something here. Folks actually using clang tools on Windows (Aaron?) can tell with more confidence whether Linux-style command line tokenization is of any use on Windows. I had an impression that it allowed clang tools to be used with mingw (and msys shell), but I'm not sure whether it's an important use case for anyone.

I agree with @zturner's perspective -- if the command was generated on Linux, I would not expect it to work on Windows (and vice versa). Not only are path separators an issue, you run into other things like reserved file names, different -D flags, different triples, etc. I can't think of a situation where I would expect that to work.

rnk added a comment.Aug 17 2016, 11:14 AM

So, I actually went ahead and generated some MSYS makefiles and made a compile_commands.json, and it doesn't work with clang-tidy. You get this kind of output:

[
{
  "directory": "C:/src/test_proj",
  "command": "\"/C/Program Files/mingw-w64/x86_64-6.1.0-win32-seh-rt_v5-rev0/mingw64/bin/g++.exe\"    -Dsomething\\evil -o CMakeFiles/foo.dir/foo.obj -c /C/src/test_proj/foo.cpp",
  "file": "C:/src/test_proj/foo.cpp"
}
]
...
$ clang-tidy foo.cpp
1 error generated.
Error while processing C:\src\test_proj\foo.cpp.
error: error reading '/C/src/test_proj/foo.cpp' [clang-diagnostic-error]

Hypothetically we could make this work, but there are bigger problems here. In that light, I think we should go with this patch. Long term, we should solve this problem by emitting "arguments" or "command_shell" in cmake.

We could also provide a command line option to clang-tidy that lets the user specify the command line syntax of the compilation database. It could choose a smart default (i.e. what we do in this patch after using llvm::sys::getProcessTriple()) but if the command line option is specified it could just force that parsing mode. This way we don't have to get in the business of guessing, which could itself end up having unforeseen edge cases.

alexfh accepted this revision.Aug 17 2016, 12:03 PM
alexfh edited edge metadata.
In D23455#518312, @rnk wrote:

So, I actually went ahead and generated some MSYS makefiles and made a compile_commands.json, and it doesn't work with clang-tidy. You get this kind of output:

[
{
  "directory": "C:/src/test_proj",
  "command": "\"/C/Program Files/mingw-w64/x86_64-6.1.0-win32-seh-rt_v5-rev0/mingw64/bin/g++.exe\"    -Dsomething\\evil -o CMakeFiles/foo.dir/foo.obj -c /C/src/test_proj/foo.cpp",
  "file": "C:/src/test_proj/foo.cpp"
}
]
...
$ clang-tidy foo.cpp
1 error generated.
Error while processing C:\src\test_proj\foo.cpp.
error: error reading '/C/src/test_proj/foo.cpp' [clang-diagnostic-error]

Hypothetically we could make this work, but there are bigger problems here. In that light, I think we should go with this patch. Long term, we should solve this problem by emitting "arguments" or "command_shell" in cmake.

Yep, if we're not breaking anything with this change, let's proceed. LG

This revision is now accepted and ready to land.Aug 17 2016, 12:03 PM

We could also provide a command line option to clang-tidy that lets the user specify the command line syntax of the compilation database. It could choose a smart default (i.e. what we do in this patch after using llvm::sys::getProcessTriple()) but if the command line option is specified it could just force that parsing mode. This way we don't have to get in the business of guessing, which could itself end up having unforeseen edge cases.

If someone is actually interested in this, the option can be added to CommonOptionsParser.

This revision was automatically updated to reflect the committed changes.