This is an archive of the discontinued LLVM Phabricator instance.

[Frontend] Use executable path when creating invocation from cmdline
Changes PlannedPublic

Authored by phosek on Jun 10 2019, 1:05 PM.

Details

Summary

Args[0] isn't necessarily a correct path, it may not even be the path
to the compiler, for example when using tools like ccache, distcc or
goma the first argument in command recorded inside the compilation
database would be this tool, not the compiler. This path is then used
as installation dir inside the driver and is then used to derive other
toolchain specific paths, such as path to libc++ headers (which would
be ../include/c++/v1 relative to the compiler binary). When the first
argument is another tool, or even if it's Clang but the compilation
database contains a relative path, the installation dir is going to be
incorrect and the header resolution will fail which breaks tools like
clangd. So rather than relying blindly on the first argument, we'll use
the executable path (when available) instead.

Diff Detail

Repository
rC Clang

Event Timeline

phosek created this revision.Jun 10 2019, 1:05 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 10 2019, 1:05 PM

argv[0] does carry important information though, I think this will break a lot of things. It's... concerning that no tests broke.

For example, if it's clang or g++ or clang-cl then that affects how command lines are parsed (regardless of whether the path actually exists).

Additionally I don't think all tools that call createInvocationFromCommandLine carry around all the files you want.

The wrapper tool case is a problem though, I see some options:

  • change the process that generates the compilation database to emit a path to the underlying compiler instead, as tooling currently expects. This may be painful if e.g. goma pretends to be a regular toolchain. An argument can be made it should pretend harder (e.g. symlinking directories around it so discovery works as usual)
  • change the process that generates the CDB to inject the relevant directory paths explicitly as flags
  • make driver aware of the distinction between "invoked-as" and "current binary" and use the right path in the right places

(Incidentally if Driver is going to use the current binary, we shouldn't need to pass it in as a parameter, right?)

There are some potentially-unsolvable cases here: e.g. if your wrapper is magic-remote-build then maybe there's no standard library locally, or no way to find it.

argv[0] does carry important information though, I think this will break a lot of things. It's... concerning that no tests broke.

For example, if it's clang or g++ or clang-cl then that affects how command lines are parsed (regardless of whether the path actually exists).

You're right, I can see how this change can break the driver mode detection.

Additionally I don't think all tools that call createInvocationFromCommandLine carry around all the files you want.

The wrapper tool case is a problem though, I see some options:

  • change the process that generates the compilation database to emit a path to the underlying compiler instead, as tooling currently expects. This may be painful if e.g. goma pretends to be a regular toolchain. An argument can be made it should pretend harder (e.g. symlinking directories around it so discovery works as usual)

I think this is the right solution for compiler wrappers like goma, the problem is that this requires changing any tool that generates compilation database to have explicit notion of compiler launcher and exclude that from the command (today we just prepend the tool to the command). This includes tool like GN and potentially even Ninja which is a non-trivial task. Until that happens, clangd is unusable for projects like Fuchsia (clangd cannot find standard headers and we're getting too many warnings for those).

  • change the process that generates the CDB to inject the relevant directory paths explicitly as flags

This would mean either duplicating the Clang driver logic inside the build system, which is something I'd like to avoid (why use the driver at all in that case?), or have some mechanism through which the driver would have to export all the necessary information (e.g. where to find standard library headers) for the build system to use.

  • make driver aware of the distinction between "invoked-as" and "current binary" and use the right path in the right places

I'll look into this.

(Incidentally if Driver is going to use the current binary, we shouldn't need to pass it in as a parameter, right?)

There are some potentially-unsolvable cases here: e.g. if your wrapper is magic-remote-build then maybe there's no standard library locally, or no way to find it.

Do you think it's worth updating the compilation database documentation to explicitly say that the command has to start with the compiler invocation? Currently it says that "The compile command executed." which is IMHO not sufficiently precise.

One more thing, do you think it's reasonable to use llvm::sys::findProgramByName(Args[0])instead of Args[0] when creating the driver instance? One of the failure modes I ran into is the case where the generated compilation database would contain just the executable name, e.g. clang++. If you invoke that command, everything works as expected because driver resolves the binary to a full path, but when used with clangd, it'll fail because that resolution will never happen, the Dir/InstalledDir will be an empty path and attempt to resolve C++ library headers will end up with /../include/c++/v1 which is invalid.

Alternative would be to make the compilation database specification even more strict and require that the compiler command is not only the first argument, but it's also a (full or relative) path to the compiler executable.

Thanks for pushing on this. I guess this must affect all clang-tools when used with a wrapper, which seems important. (I've just been using bazel and local cmake, so haven't hit this myself).

The wrapper tool case is a problem though, I see some options:

  • change the process that generates the compilation database to emit a path to the underlying compiler instead, as tooling currently expects. This may be painful if e.g. goma pretends to be a regular toolchain. An argument can be made it should pretend harder (e.g. symlinking directories around it so discovery works as usual)

I think this is the right solution for compiler wrappers like goma, the problem is that this requires changing any tool that generates compilation database to have explicit notion of compiler launcher and exclude that from the command (today we just prepend the tool to the command). This includes tool like GN and potentially even Ninja which is a non-trivial task. Until that happens, clangd is unusable for projects like Fuchsia (clangd cannot find standard headers and we're getting too many warnings for those)

Yeah, I can see how putting this in ninja would be a big layering violation. One of the nice things about compile_commands.json is you can get it as a side-effect.

From first principles, a CDB can't hold an arbitrary command that happens to spawn clang, tools need some way to inspect it. We do that by requiring it to be a command-line the driver can parse, which works for clang, gcc, cl, and tools that aim to be CLI-compatible with these. So the issues here are that:

  • gomacc clang++ $flags isn't CLI-compatible with clangd++ $flags: both argv[0] and argv[1] are wrong
  • argv[0] isn't just a string, it's a location to poke around in for include paths
  • [potentially] the driver itself has hard-coded paths in it (see D62804)

One possible fix for the first issue is to invoke is goma-clang++ rather than gomacc clang. Or alternatively goma-clang --driver-mode=g++ which would avoid requiring lots of executables.

Another would be teaching CDB (or createinvocationfromcommandline) to parse out toolchain wrappers. Obviously this seems like a bit of a can of worms. It's (probably?) fine when it's a single argv0 that matches some regex and argv1 is "clang". But what if the wrapper has a generic name, or takes flags, or argv1 is "some-weird-cc"...

The second issue is harder because I think the wrapper makes it fundamentally unclear where to scan for libraries. One answer is near the wrapper itself, but it sounds like goma's answer is that 1) you're likely to have an appropriately configured compiler locally, and 2) it's probably on your PATH. Do you have a sense of how safe these assumptions are?

I'm leaning toward having some logic in JSONCompilationDatabase that pattern matches argv0, and for known wrappers does some transform on argv and sets a new "wrapper" field on the CompileCommand struct. WDYT?

  • change the process that generates the CDB to inject the relevant directory paths explicitly as flags

This would mean either duplicating the Clang driver logic inside the build system, which is something I'd like to avoid (why use the driver at all in that case?), or have some mechanism through which the driver would have to export all the necessary information (e.g. where to find standard library headers) for the build system to use.

  • make driver aware of the distinction between "invoked-as" and "current binary" and use the right path in the right places

I'll look into this.

Thinking about this more - we'd still have the argv[1] issue, and for headers we probably want to search near findProgramByName(argv[1]) rather than near getMainExecutable() (at least clangd doesn't package a standard library). And I do think maybe this is something for CDB to handle rather than driver.

There are some potentially-unsolvable cases here: e.g. if your wrapper is magic-remote-build then maybe there's no standard library locally, or no way to find it.

Do you think it's worth updating the compilation database documentation to explicitly say that the command has to start with the compiler invocation? Currently it says that "The compile command executed." which is IMHO not sufficiently precise.

Yeah, I think it should say that it needs to be a command clang's driver understands, which includes typical invocations of gcc, clang++, clang-cl, etc. And if we add wrapper support (somewhere) we should mention that too.

One more thing, do you think it's reasonable to use llvm::sys::findProgramByName(Args[0])instead of Args[0] when creating the driver instance? One of the failure modes I ran into is the case where the generated compilation database would contain just the executable name, e.g. clang++. If you invoke that command, everything works as expected because driver resolves the binary to a full path, but when used with clangd, it'll fail because that resolution will never happen, the Dir/InstalledDir will be an empty path and attempt to resolve C++ library headers will end up with /../include/c++/v1 which is invalid.

Yes, I think you're right. We need to make sure we only do this for bare names (clang, not ../clang) and that we keep the current behavior if findProgramByName fails.
There's a chance this will change behavior in some cases where we're using a VFS and don't want PATH or the real FS to have any effect, but I think we should try it and see.

Alternative would be to make the compilation database specification even more strict and require that the compiler command is not only the first argument, but it's also a (full or relative) path to the compiler executable.

Changing documentation is obviously going to have a limited, delayed effect. Let's try other approaches first?

phosek planned changes to this revision.Jun 12 2019, 12:14 AM