Page MenuHomePhabricator

[clang][cli] Command line round-trip for HeaderSearch options
ClosedPublic

Authored by jansvoboda11 on Jan 12 2021, 1:19 AM.

Details

Summary

This patch implements generation of remaining header search arguments.
It's done manually in C++ as opposed to TableGen, because we need the flexibility and don't anticipate reuse.

This patch also tests the generation of header search options via a round-trip. This way, the code gets exercised whenever Clang is built and tested in asserts mode. All check-clang tests pass.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thanks for putting your time into the comment, @dexonsmith.

I agree that adding a -round-trip-args flag to enable/disable round-tripping is more sensible than hard-coding it via something like #ifndef NDEBUG.
Providing -round-trip-args-debug-mode to help with debugging would be a great addition too. Thanks for pointing that out.

I would be interested to hear why exactly are you concerned by the current approach being too low level.
From my point of view, having to opt-in into round-tripping a part of CompilerInvocation is a feature and I think it should work relatively well with the current roll-out plan:

  1. Write documentation on porting existing options to the new marshalling system.
  2. Send a message to the mailing list pointing out that we'll be enabling round-tripping in assert builds by default and what that means. Downstream projects can prepare to port their custom options or disable the round-tripping in assert builds.
  3. Some time later, commit this patch. Downstream projects already have enough information how to deal with the change. Because we've enabled round-tripping only for HeaderSearchOptions, adopting it isn't that time-consuming if a downstream project decides to do so.
  4. (Any patches that add a command line option affecting HeaderSearchOptions will from now on be forced to include the generating/marshalling code.)
  5. Slowly commit following patches, each enabling round-tripping for another chunk of CompilerInvocation.
  6. After round-tripping has been enabled for all of CompilerInvocation, simplify the code (remove ParseHS, GenerateHS, ResetHS, etc.) and round-trip everything (when told to do so with -round-trip-args).

I'd be interested in hearing your thoughts on that ^.

To help me understand the reasoning behind your proposal, can you answer these questions?

  1. What is the benefit or "relaxed" round-tripping compared to the selective strict round-tripping in this patch?
  2. What does GeneratedArgs1 != GeneratedArgs2 prove? If we forget to generate some option, it will be in neither GeneratedArgs1 nor GeneratedArgs2. We'll catch that only in "strict" mode anyways. I guess this can help us discover other kinds of lossy generation, but I think the guarantees of "strict" mode are nicer.

Thanks for putting your time into the comment, @dexonsmith.

I agree that adding a -round-trip-args flag to enable/disable round-tripping is more sensible than hard-coding it via something like #ifndef NDEBUG.
Providing -round-trip-args-debug-mode to help with debugging would be a great addition too. Thanks for pointing that out.

I would be interested to hear why exactly are you concerned by the current approach being too low level.
From my point of view, having to opt-in into round-tripping a part of CompilerInvocation is a feature and I think it should work relatively well with the current roll-out plan:

  1. Write documentation on porting existing options to the new marshalling system.
  2. Send a message to the mailing list pointing out that we'll be enabling round-tripping in assert builds by default and what that means. Downstream projects can prepare to port their custom options or disable the round-tripping in assert builds.
  3. Some time later, commit this patch. Downstream projects already have enough information how to deal with the change. Because we've enabled round-tripping only for HeaderSearchOptions, adopting it isn't that time-consuming if a downstream project decides to do so.
  4. (Any patches that add a command line option affecting HeaderSearchOptions will from now on be forced to include the generating/marshalling code.)
  5. Slowly commit following patches, each enabling round-tripping for another chunk of CompilerInvocation.
  6. After round-tripping has been enabled for all of CompilerInvocation, simplify the code (remove ParseHS, GenerateHS, ResetHS, etc.) and round-trip everything (when told to do so with -round-trip-args).

I'd be interested in hearing your thoughts on that ^.

I have three concerns with the approach:

  1. Churn. This adds a lot of code that will eventually be removed / refactored away. One example is that shifting the NDEBUG logic to the driver requires threading a Boolean (or tristate/enum) for round-trip mode through a few layers; once we're finished, we'll end up stripping that argument back out. A second example is that this seems to require splitting up OPTIONS_WITH_MARSHALLING for each option. Once this work is finished, someone will need to clean them up / unify them, or maybe they'll unnecessarily stay split / duplicated.
  2. Boilerplate. Somewhat related to churn; there's a fair bit of additional boilerplate required in this approach. This makes it harder to read / understand / modify the code.
  3. Correctness. I'm not sure ResetHS is quite right. It'll probably work for a normal command-line -cc1 invocation, but perhaps not for all tooling applications, since it's changing the pointer identity of HeaderSearchOptions.

On the other hand, one thing I really like about your approach, which I don't see a way of getting with a top-level check, is the ability to turn on "strict" mode for subsets of the command-line, which helps to hold the line in the face of new options being added (I think this is the feature you're (rightly) calling out). I'm not sure how important it is in practice, as long as we still think getting them all is going to happen in a reasonable time frame. There aren't that many new options being added: filtering out [cli] (your patches), [flang], [[D]driver], reverts, and relandings, there are 39 commits to Options.td in the last three months (3 / week). Some of those are deleting options or changing values, and not important to catch. Even for new options, I imagine most people copy/paste nearby options. I think it's critical to hold the line once we have them all, but until then I'm not sure.

Both approaches seem possible for downstreams to adapt / adopt in their own time, although reverting a one line patch switching from "relaxed" to "strict" would be easier than maintaining all the Parse/Generate/Reset infrastructure after upstream deletes it.

(I'm mostly concerned about #1 and #2; I assume #3 can be fixed somehow, although I haven't thought about how.)

To help me understand the reasoning behind your proposal, can you answer these questions?

  1. What does GeneratedArgs1 != GeneratedArgs2 prove? If we forget to generate some option, it will be in neither GeneratedArgs1 nor GeneratedArgs2. We'll catch that only in "strict" mode anyways. I guess this can help us discover other kinds of lossy generation, but I think the guarantees of "strict" mode are nicer.

Firstly, GeneratedArgs1 == GeneratedArgs2 proves that generate is an exact inversion of parse, for the subset of args that are handled by the generator (eventually, this will be all args). IOW, it's adding coverage (in one direction) that parse+generate "match" for every option that has generation code. Because of the automation provided by the marshalling infrastructure, I'd expect this to mostly catch bugs in hand-written code (custom marshalling code, manual generation code, or end-of-parse cleanup / follow-up code). Secondly, this adds coverage that command-line generation is deterministic.

strict mode additionally uses the GeneratedArgs1 to fill CompilerInvocation, indirectly checking both directions by requiring tests to pass both with and without this round-trip. However, during development people generally only run the tests one way and the failure mode won't be ideal. If a new option has been left out entirely, it'll usually be clear enough what went wrong and how to fix it; most people build with asserts in development, and the option won't work until they add generation code. But failures from a mismatched parse and generate might not get caught until you compare an Asserts build to a NoAsserts build, which may not happen until the bots run. This kind of failure is also difficult to debug / understand, compared to the assertion on once- and twice-generated args. IOW, even though GeneratedArgs1 == GeneratedArgs2 doesn't catch all issues, it's probably an easier failure to respond to when it does fail.

  1. What is the benefit or "relaxed" round-tripping compared to the selective strict round-tripping in this patch?

Mainly, less churn, less boilerplate, and no correctness problem to solve for Reset. I assume you can adapt what you have to also check GeneratedArgs1 == GeneratedArgs2 (it'd be nice to have this at the top level, so you get this check even for options where selective strict mode hasn't yet been enabled), but if that's harder than it sounds then lack of that would miss some coverage and make debugging harder, as described above.

clang/lib/Frontend/CompilerInvocation.cpp
3151–3154

I'm not sure this works correctly. Callers may depend on the identity of the HeaderSearchOptions. It's a shared pointer that could have other references. This will only update / reset one reference.

Create function for argument generation

Move boilerplate from CreateFromArgs to separate function

Fix formatting

Add round-tripping arguments

Preserve pointer identity when round-tripping

Undo changes to string allocation

I have three concerns with the approach:

  1. Churn. This adds a lot of code that will eventually be removed / refactored away. One example is that shifting the NDEBUG logic to the driver requires threading a Boolean (or tristate/enum) for round-trip mode through a few layers; once we're finished, we'll end up stripping that argument back out. A second example is that this seems to require splitting up OPTIONS_WITH_MARSHALLING for each option. Once this work is finished, someone will need to clean them up / unify them, or maybe they'll unnecessarily stay split / duplicated.

I think the driver command line options could be useful even after we're finished, so I'd consider not removing them. For example when working on new Clang feature that requires new command line option, one could implement the parsing first, temporarily disable round-tripping, implement the feature and worry about generation and round-tripping later. I can imagine a situation when downstream projects would prefer to prevent round-tripping from the driver rather than from within CompilerInvocation.

Yes, this requires splitting OPTIONS_WITH_MARSHALLING. I don't think it's a big deal, because we already do that for DiagnosticOpts, LangOpts and CodegenOpts anyway (D84673, D94682). I think it makes more sense to have one parsing function per *Opts, rather than keeping it split up into Parse*Opts and ParseSimpleOpts.

  1. Boilerplate. Somewhat related to churn; there's a fair bit of additional boilerplate required in this approach. This makes it harder to read / understand / modify the code.

Boilerplate was added mainly to ArgList, but we can delete that without even showing up in git blame. Another instance is the lambda setup in ParseHeaderSearchArgs, but I think that's pretty isolated as well. I agree the code may be confusing for people not familiar with this effort, but it will be me deleting the code few weeks from now, so I'm not sure how bad this is. Do you worry this will cause confusion during merge conflicts down stream?

  1. Correctness. I'm not sure ResetHS is quite right. It'll probably work for a normal command-line -cc1 invocation, but perhaps not for all tooling applications, since it's changing the pointer identity of HeaderSearchOptions.

Nice catch! Some clients make changes to AnalyzerOpts before calling CreateFromArgs, and we would discard those changes. I've updated this patch with a solution: we save the original part of CompilerInvocation, run the first parse on a clean instance, generate arguments and use the original part for the second parse.

On the other hand, one thing I really like about your approach, which I don't see a way of getting with a top-level check, is the ability to turn on "strict" mode for subsets of the command-line, which helps to hold the line in the face of new options being added (I think this is the feature you're (rightly) calling out). I'm not sure how important it is in practice, as long as we still think getting them all is going to happen in a reasonable time frame. There aren't that many new options being added: filtering out [cli] (your patches), [flang], [[D]driver], reverts, and relandings, there are 39 commits to Options.td in the last three months (3 / week). Some of those are deleting options or changing values, and not important to catch. Even for new options, I imagine most people copy/paste nearby options. I think it's critical to hold the line once we have them all, but until then I'm not sure.

That's indeed not too many changes in the area we're working in. Though the biggest reason I'm inclined to stick with this solution is that we can verify the manual generation as soon as we finish one Generate*Args. When committing each Generate*Args separately, this ensures the code really works.

If we go only with the "relaxed" checks, we'd be committing code that might not work as expected and we'd find out only when enabling strict mode for the whole -cc1 interface. Finding a bug in that situation might be harder than with the incremental approach.

strict mode additionally uses the GeneratedArgs1 to fill CompilerInvocation, indirectly checking both directions by requiring tests to pass both with and without this round-trip. However, during development people generally only run the tests one way and the failure mode won't be ideal.

So people build without assertions during development? In that case, I agree that erroring out on GeneratedArgs1 != GeneratedArgs2 (in all kinds of builds) would improve the experience. I don't think there's anything preventing us to incorporate this into the current patch.

Thanks for talking through this with me! The direction sounds good.

I think the driver command line options could be useful even after we're finished

You mean the -cc1 options used by the driver right? We shouldn't be adding options to the driver. I agree they should be kept.

Yes, this requires splitting OPTIONS_WITH_MARSHALLING. I don't think it's a big deal, because we already do that for DiagnosticOpts, LangOpts and CodegenOpts anyway (D84673, D94682). I think it makes more sense to have one parsing function per *Opts, rather than keeping it split up into Parse*Opts and ParseSimpleOpts.

Okay, that's a fair point; it probably really is cleaner to have one parse / generate function per option class. In that case, I wonder if we can remove the prefix from the key paths the same way as for DiagnosticOpts...

  1. Boilerplate. Somewhat related to churn; there's a fair bit of additional boilerplate required in this approach. This makes it harder to read / understand / modify the code.

Boilerplate was added mainly to ArgList, but we can delete that without even showing up in git blame. Another instance is the lambda setup in ParseHeaderSearchArgs, but I think that's pretty isolated as well. I agree the code may be confusing for people not familiar with this effort, but it will be me deleting the code few weeks from now, so I'm not sure how bad this is. Do you worry this will cause confusion during merge conflicts down stream?

Yes, it's merge conflicts that I think would be most confusing. I think you've convinced me that splitting up the parsers is independently valuable, so I'm less concerned about the churn now.

So people build without assertions during development? In that case, I agree that erroring out on GeneratedArgs1 != GeneratedArgs2 (in all kinds of builds) would improve the experience. I don't think there's anything preventing us to incorporate this into the current patch.

Sounds great. This also checks that generation is deterministic.

strict mode additionally uses the GeneratedArgs1 to fill CompilerInvocation, indirectly checking both directions by requiring tests to pass both with and without this round-trip. However, during development people generally only run the tests one way and the failure mode won't be ideal.

So people build without assertions during development? In that case, I agree that erroring out on GeneratedArgs1 != GeneratedArgs2 (in all kinds of builds) would improve the experience. I don't think there's anything preventing us to incorporate this into the current patch.

The only issue I have with this if always parsing twice has a noticeable performance impact for any users of Clang. Can we measure the impact? If it's small (< 100ms ?) then that's fine. I'm also concerned about users like cling and clangd.

strict mode additionally uses the GeneratedArgs1 to fill CompilerInvocation, indirectly checking both directions by requiring tests to pass both with and without this round-trip. However, during development people generally only run the tests one way and the failure mode won't be ideal.

So people build without assertions during development? In that case, I agree that erroring out on GeneratedArgs1 != GeneratedArgs2 (in all kinds of builds) would improve the experience. I don't think there's anything preventing us to incorporate this into the current patch.

The only issue I have with this if always parsing twice has a noticeable performance impact for any users of Clang. Can we measure the impact? If it's small (< 100ms ?) then that's fine. I'm also concerned about users like cling and clangd.

I think the idea is only to parse twice when -round-trip-args is on, which the driver will only set by default in asserts mode. Good to measure the impact nevertheless.

Compare once and twice generated arguments, fix re-ordering of UserEntries

jansvoboda11 retitled this revision from [WIP][clang][cli] Command line round-trip for HeaderSearch options to [clang][cli] Command line round-trip for HeaderSearch options.Wed, Jan 27, 4:08 AM
jansvoboda11 edited the summary of this revision. (Show Details)

The only issue I have with this if always parsing twice has a noticeable performance impact for any users of Clang. Can we measure the impact? If it's small (< 100ms ?) then that's fine. I'm also concerned about users like cling and clangd.

I've created a micro-benchmark in D95516. I think we should be fine.

I think the idea is only to parse twice when -round-trip-args is on, which the driver will only set by default in asserts mode. Good to measure the impact nevertheless.

I must've misunderstood you earlier. I've updated the patch so that we round-trip only when cc1 is given -round-trip-args (which happens automatically only in assert builds).
We check that the once- and twice- generated arguments match, and we make sure the CompilerInvocation resulting from the second parse gets used in the rest of compilation.

clang/lib/Frontend/CompilerInvocation.cpp
630

There must be a better way to do this...

dexonsmith added inline comments.Wed, Jan 27, 12:44 PM
clang/include/clang/Driver/Options.td
523–524

Can we make this =true vs. =false, or add a -no-round-trip-args? That will allow:

% clang some args -Xclang -no-round-trip-args

to disable it in asserts builds; much better than:

% clang some args -###
# Search for -round-trip-args
% clang -cc1 copy-paste-args-before copy-past-args-after
clang/lib/Frontend/CompilerInvocation.cpp
557

Hmm, I thought we'd found a way to avoid allocating the prefixed name, by adding it to the options table. This unfortunate, but I guess you can fix it another time.

566

Does this allocated string get used, or does "=some-value" get added on?

Ensure Diags get used only once, add disabling flag, clean up

jansvoboda11 added inline comments.Thu, Jan 28, 7:54 AM
clang/include/clang/Driver/Options.td
523–524

Done.

clang/lib/Frontend/CompilerInvocation.cpp
557

We can avoid the allocation in the generating macro. I'll come back to this later and figure out a way to do the same here.

566

Both. It depends on the kind/class (Joined/Separate/...) of the option.

3151–3154

Resolved by SwapOpts.

dexonsmith requested changes to this revision.Thu, Jan 28, 11:57 AM

This is getting close! A few more comments inline.

clang/lib/Frontend/CompilerInvocation.cpp
570–575

I suggest inlining Equal and defining the lambda directly in the call to std::equal; I think that'll be easier to read than having this up here. (Nice to have it split out fo rcompile-time if RoundTrip is templated, but I have a suggestion about that below.)

577–582
  • I think this would be more clear as a local lambda too.
  • I don't think this is the best destination, since llvm::dbgs() goes nowhere when NDEBUG. Should this go in errs()? Or should -round-trip-args-debug= take a filename for where to write extra stuff?
  • The arguments won't be correctly formatted for copy-paste if there are any characters that should be escaped. Is the logic from -### reusable?
590–594

Instead of templating this, can we take llvm::function_refs to type-erase the functions?

619

I suggest double-checking that the second parse also fails, and reporting that as an error somehow.

640–642

Does this need to be checked for success / error conditions in any way?

649

Might be better to error here if the second parse fails. We'd want to debug the generated args too.

670–671

Another option vs. report_fatal_error would be to create (fatal) clang error diagnostics. That applies here, and to my previous comments. WDYT? (@Bigcheese, any thoughts?)

Also, I think -round-trip-args-debug might not be super discoverable. Maybe we can add a comment in the code where the errors are reported suggesting adding that.

This revision now requires changes to proceed.Thu, Jan 28, 11:57 AM

Improve error handling

jansvoboda11 added inline comments.Fri, Jan 29, 5:30 AM
clang/lib/Frontend/CompilerInvocation.cpp
570–575

That's fair.

577–582

Good point, resolved.

590–594

Yes, that would be better. I had this templated while I was figuring out what their signature should even be. I've extracted typedefs to keep the signature of RoundTrip readable.

619

I think this should never happen, but being defensive here is a good idea.

640–642

Other callers usually check for missing argument values or unknown arguments. If we did that, we'd need to distinguish between bad arguments we've copied from OriginalArgs and errors we created in Generate. We don't have access to MissingArgIndex and MissingArgsCount of OriginalArgs, so it's not that easy. We could serialize and parse OriginalArgs again, but I think that's not a good idea from performance perspective.

I suggest we omit the checks for now, but put them in place as soon as we round-trip all arguments.

649

That's right.

670–671

If we decide to go with custom diagnostics instead of just using llvm::errs and llvm::report_fatal_error, we'd need to instantiate a custom DiagnosticsEngine here in RoundTrip, because we cannot rely on clients passing us reasonable Diags.

One such example is CompilerInvocationTest, where we don't care what diagnostics get emitted (by using TextDiagnosticBuffer consumer). The errors we're emitting here would be still useful to see when testing though.

dexonsmith added inline comments.Fri, Jan 29, 7:50 AM
clang/lib/Frontend/CompilerInvocation.cpp
670–671

I'm not entirely following why this would need a custom Diags. If the client wants to ignore diagnostics that should be up to the client, especially in clang-as-a-library situations. In fact, using errs or dbgs at all is pretty iffy when clang is used as a library.

For CompilerInvocationTest, I imagine we could change the consumer / test fixture / etc. somehow to capture this case.

Create proper diagnostics, add remark tests

clang/lib/Frontend/CompilerInvocation.cpp
670–671

My understanding is that we'd be converting llvm::report_fatal_error into error diagnostics and PrintArgs into notes and remarks.

The hiccup here is that Diags most probably hasn't been configured yet (we're still just parsing the DiagnosticsOptions). And by default, DiagnosticsEngine ignores all remarks.

So my thinking was that to ensure our remarks are not thrown away, we could modify the existing Diags to our liking (too invasive) or set up a custom DiagnosticsEngine instance here in RoundTrip. In the end, the best solution is probably to do this of setup in cc1_main, so that we're emitting diagnostics only when used within compiler.

Fix formatting and comment wording

Don't round-trip when the driver says so. Round-trip all the time when built to do so.

I've changed this patch so that the errors and debugging output goes to DiagnosticsEngine.

Also, I switched back to the original solution, where we always round-trip when Clang is built certain way (this time not when CMake was configured with -DLLVM_ENABLE_ASSERTIONS=ON, but with -DCLANG_ROUND_TRIP_CC1_ARGS=ON). The problem with the previous solution (passing -round-trip-args from the driver to -cc1) is that a lot of tests invoke -cc1 directly, not through the driver. We want to test round-tripping on such tests as well, but shouldn't pollute them with -round-trip-args or force them to go through the driver. I've considered enabling round-tripping through an environment variable, but the problem is that the, the environment variable is not propagated when running lit tests.

dexonsmith accepted this revision.Wed, Feb 3, 1:25 PM

I've changed this patch so that the errors and debugging output goes to DiagnosticsEngine.

Also, I switched back to the original solution, where we always round-trip when Clang is built certain way (this time not when CMake was configured with -DLLVM_ENABLE_ASSERTIONS=ON, but with -DCLANG_ROUND_TRIP_CC1_ARGS=ON). The problem with the previous solution (passing -round-trip-args from the driver to -cc1) is that a lot of tests invoke -cc1 directly, not through the driver. We want to test round-tripping on such tests as well, but shouldn't pollute them with -round-trip-args or force them to go through the driver. I've considered enabling round-tripping through an environment variable, but the problem is that the, the environment variable is not propagated when running lit tests.

LGTM; I think this is fine as an incremental step, as long as the goal is to get this on in asserts builds once it's complete (we don't want another bot configuration with this...). Please add -cc1 flags to override at least by the time you switch back to doing this for asserts builds. I've noted inline how you can do that (even though the clang driver idea doesn't work for this case).

clang/lib/Frontend/CompilerInvocation.cpp
592–594

It'd be good to the flags still available in -cc1 as overrides.

bool RoundTripDefault = false;

// FIXME: Switch to '#ifndef NDEBUG' once it's ready.
#ifdef CLANG_ROUND_TRIP_CC1_ARGS
RoundTripDefault = true;
#endif

bool DoRoundTrip = Args.hasFlag(OPT_round_trip_args, OPT_no_round_trip_args,
                                RoundTripDefault);

But that's most compelling once this is in NDEBUG. If you'd rather leave that until then I'm fine with that.

This revision is now accepted and ready to land.Wed, Feb 3, 1:25 PM
This revision was landed with ongoing or failed builds.Thu, Feb 4, 1:18 AM
This revision was automatically updated to reflect the committed changes.