This is an archive of the discontinued LLVM Phabricator instance.

Extend Compilation Database by an optional field for compiler arguments
Needs ReviewPublic

Authored by Flow on May 7 2020, 7:02 AM.

Details

Summary

Consumers of the compilation database (compile_commands.json) currently resort to fragile hacks to separate the compiler executable(s) from the compiler arguments in the 'command' entry. Some (most?) consumers only care about the compiler arguments and not about the compiler executable invoked. Separating those from the args turns out to be non-trivial if not impossible done reliable with the current format specification. This is because sometimes there are compiler "wrappers", like ccache, involved.

As result, the consumers try to strip the wrappers to determine the actual compiler arguments. See for example

But this approach is obviously fragile, as one never knows all possible wrappers.

Hence this extends the JSON compliation database format by an optional compiler arguments entry. This change is backwards compatible

{
    "directory": "/home/user/code/myproject/build-release",
    "compiler_arguments: "-Itest/9f86d08@@simple_test@exe -Itest -I../test -I../include -fdiagnostics-color=always -DNDEBUG -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wextra -Wpedantic -Werror -std=gnu11 -O3 -pthread -UNDEBUG -MD -MQ 'test/9f86d08@@simple_test@exe/simple_test.c.o' -MF 'test/9f86d08@@simple_test@exe/simple_test.c.o.d' -o 'test/9f86d08@@simple_test@exe/simple_test.c.o' -c ../test/simple_test.c",
    "command": "ccache cc -Itest/9f86d08@@simple_test@exe -Itest -I../test -I../include -fdiagnostics-color=always -DNDEBUG -pipe -D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -Wextra -Wpedantic -Werror -std=gnu11 -O3 -pthread -UNDEBUG -MD -MQ 'test/9f86d08@@simple_test@exe/simple_test.c.o' -MF 'test/9f86d08@@simple_test@exe/simple_test.c.o.d' -o 'test/9f86d08@@simple_test@exe/simple_test.c.o' -c ../test/simple_test.c",
    "file": "../test/simple_test.c"
},

Compilation Database consuming tools could check if the format already includes the new entry, and fall back to the (legacy) "command" entry if not.

Compilation Database producing tools, usually build systems like meson, should be easily able to create the new entry for the compiler arguments, since they have knowledge about what is a compiler argument and what is a compiler executable. The problem is that this semantic is lost in the current format of the compilation database.

Related bug: https://bugs.llvm.org/show_bug.cgi?id=45829

Diff Detail

Event Timeline

Flow created this revision.May 7 2020, 7:02 AM
Flow edited the summary of this revision. (Show Details)May 7 2020, 7:05 AM
Flow added a reviewer: silvas.May 7 2020, 7:07 AM
silvas resigned from this revision.May 7 2020, 1:25 PM
silvas added a reviewer: klimek.

I have not looked at this stuff for a long time. Adding Manuel who is probably more up to date on this.

Adding Sam, as I remember clangd having that problem.

Hi, thanks for sending this! I'm sympathetic to wanting this fixed but a bit skeptical about changing it :-) It would be good to discuss the tradeoffs, and in any case document whether we expect this problem to be solved by build systems, consuming tools, or in the format.

compile_commands.json is a pretty simple and historically-stable format that is an interface boundary for lots of existing tooling, so this is a significant change.
Do we have evidence that this is a *large* problem? e.g. what tools are reasonably commonly deployed that simple unwrapping heuristics won't catch? Is there a reason IWYU can't use JSONCompilationDatabase, where this problem is (AFAIK) solved? (Not a general solution, but most tools consuming this format are clang-based, including IWYU as I understand...)

Increasing the requirements on producers to make life a bit easier for consumers seems directionally wrong given that having compile_commands.json unavailable *still* seems to be a fairly high bar in practice (and a more common cause for tool failure than tools misunderstanding the file, though maybe not among people filing high-quality bug reports against tools).

I understand the desire to encode more structured information in the compilation database to make tools robust, but this is a *long* path and extracting argv0 is far from the most difficult case. This is not solving the problems of "tools must break abstractions to work with this format" - they still often need to identify the actual compiler in order to correctly infer the location of builtin and standard library headers for example (and on mac, poke at xcrun etc, sigh).

This change is backwards compatible

It's only backwards-compatible if producers emit both sets of data, which is not too different from defining a new format alongside. FWIW the Chrome compilation database is several hundred MB, most of it is args... it's not a trivial thing to double the size of emitting, storing, or parsing that.

I've got two issues with the detail of the proposal (though per above I'm not sure it's a good idea in principle):

  • the proposed format still leaves it to the tool to heuristically determine the compiler command, which is needed in several cases
  • it's based on the less-structured command rather than the more-structured arguments, which seems at odds with what it's trying to do

These (and the duplication) could be addressed by adding an index to split the arguments list at instead, I suppose.

Flow updated this revision to Diff 265155.May 20 2020, 12:41 AM

Instead of repeating the complete string of compiler arguments in the new optional field, simply have them provide an index into the respective 'command' or 'arguments' field values.

Flow added a comment.May 20 2020, 1:05 AM

Hi, thanks for sending this! I'm sympathetic to wanting this fixed but a bit skeptical about changing it :-) It would be good to discuss the tradeoffs, and in any case document whether we expect this problem to be solved by build systems, consuming tools, or in the format.

I think this can only be reliably solved if the format provides this additional information. Build systems are usually aware of the compiler command arguments, but the information is lost within the JSON compilation database. Eventually all three involved parties have to produce, carry and consume that additional metadata for the problem to be solved.

Do we have evidence that this is a *large* problem? e.g. what tools are reasonably commonly deployed that simple unwrapping heuristics won't catch?

I think the fact that hard-coded lists like these will never be complete, is probably enough to justify a sound solution.

Is there a reason IWYU can't use JSONCompilationDatabase, where this problem is (AFAIK) solved? (Not a general solution, but most tools consuming this format are clang-based, including IWYU as I understand...)

If you refer to clang/lib/Tooling/JSONCompilationDatabase.cpp, then I would not describe the problem as solved: JSONCompilationDatabase.cpp contains just a hard-coded list of known wrappers.

Increasing the requirements on producers to make life a bit easier for consumers seems directionally wrong given that having compile_commands.json unavailable *still* seems to be a fairly high bar in practice (and a more common cause for tool failure than tools misunderstanding the file, though maybe not among people filing high-quality bug reports against tools).

It is a little more than making life a bit easier for consumers: If your consumer is unable to determine the correct string of compiler arguments, then you may be unable to run a static code analysis tool. And in many cases should be trivial for the producers to add that information, because build systems often hold the compiler arguments in an extra data structure.

This change is backwards compatible

It's only backwards-compatible if producers emit both sets of data, which is not too different from defining a new format alongside. FWIW the Chrome compilation database is several hundred MB, most of it is args... it's not a trivial thing to double the size of emitting, storing, or parsing that.

Good point. I changed the new optional fields to contain an index instead of a full string.

I've got two issues with the detail of the proposal (though per above I'm not sure it's a good idea in principle):

  • the proposed format still leaves it to the tool to heuristically determine the compiler command, which is needed in several cases

Since the fields now contain the index, it should be fairly trivial to extract the compiler commands/exectuables: they substring before the index.

  • it's based on the less-structured command rather than the more-structured arguments, which seems at odds with what it's trying to do

There are now two fields specified, one for 'command' and the other one for 'arguments'.

These (and the duplication) could be addressed by adding an index to split the arguments list at instead, I suppose.

+1

I think we're talking past each other a bit - I'm not yet convinced the problem is important enough to justify much cost, so it's hard to get excited about the merits of the solution.
Similarly you don't seem to be thinking much about the costs of this change or seriously considering the alternatives.

You could try bringing this up on cfe-dev to see if others in the community see this as an important feature.

I think this can only be reliably solved if the format provides this additional information

Yes. My thesis is that it's more important that compile_commands is simple, than that this particular problem is solved reliably.

(To illustrate, I think a more important problem is that tools have a very hard time knowing the standard header search paths, which are "built in" to the driver named as argv0. But we haven't added complexity to the format to solve this).

I think the fact that hard-coded lists like these will never be complete, is probably enough to justify a sound solution.

No, we need to establish that the remaining problem (the gaps left after hard-coded list heuristics) is important enough to justify adding complexity, and that adding the complexity is the best solution (vs e.g. extending the lists, changing the generators, using other heuristics to spot argv1...)