Page MenuHomePhabricator

[Preprocessor] Implement -fminimize-whitespace.
ClosedPublic

Authored by Meinersbur on Jun 19 2021, 6:32 PM.

Details

Summary

This patch adds the -fminimize-whitespace with the following effects:

  • If combined with -E, remove as much non-line-breaking whitespace as possible
  • If combined with -E -P, removes as much whitespace as possible, including line-breaks.

The motivation is to reduce the amount of insignificant changes in the preprocessed output with source files where only whitespace has been changes (add/remove comments, clang-format, etc.) which is in particular useful with ccache.

A patch for ccache for using this flag has been proposed to ccache as well: https://github.com/ccache/ccache/pull/815, which will use -fnormalize-whitespace when clang-13 has been detected (assuming this will be merged before the release fork), and additionally uses -P in "unify_mode". ccache already had a unify_mode in an older version which was removed because of problems that using the preprocessor itself does not have (such that the custom tokenizer did not recognize C++11 raw strings).

This patch slightly reorganizes which part is responsible for adding newlines that are required for semantics. It is now either startNewLineIfNeeded() or MoveToLine() but never both; this avoids the ShouldUpdateCurrentLine workaround and avoids redundant lines being inserted in some cases. It also fixes a mandatory newline not inserted after a _Pragma("...") that is expanded into a #pragma.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
aaron.ballman added inline comments.Jun 29 2021, 8:36 AM
clang/lib/Frontend/PrintPreprocessedOutput.cpp
135
137
174

Can Tok be const Token & instead?

181
635
Meinersbur marked 6 inline comments as done.
Meinersbur added a comment.EditedJun 29 2021, 1:43 PM

... would add that it's very common for implementers to ask developers to run their code through -E mode to submit preprocessed output in order to reproduce a customer issue with the compiler, and I worry that uses of this flag will have unintended consequences in that scenario.

Why would one add -fnormalize-whitespace for this use case?

The "very long line" example mentioned by @dblaikie is sort of along these lines (sorry for the bad pun).

I can add forced line breaks (after/before 80 cols? configurable?) if requested.

clang/lib/Frontend/PrintPreprocessedOutput.cpp
174

Note that this is a renamed bool HandleFirstTokOnLine(Token &Tok). Renamed because it is now called for every token.

ping

It would be nice if we could get this into clang 13 since the ccache part detects support for -fnormalize-whitespace when detecting clang 13+. Probing the compiler for whether it supports -fnormalize-whitespace on every run would be prohibitively expensive.

... would add that it's very common for implementers to ask developers to run their code through -E mode to submit preprocessed output in order to reproduce a customer issue with the compiler, and I worry that uses of this flag will have unintended consequences in that scenario.

Why would one add -fnormalize-whitespace for this use case?

I don't think they'd *add* it, I worry it's already used by their build system for reasons unknown and the developer replicates it when reporting the bug because they're not looking at what flags can be removed.

The "very long line" example mentioned by @dblaikie is sort of along these lines (sorry for the bad pun).

I can add forced line breaks (after/before 80 cols? configurable?) if requested.

I don't think that's necessary just yet. If there's an issue in practice, it seems like we could address it once we have the real world use in front of us. However, It'd help my confidence if you were able to run this functionality over a large corpus of code to see if any issues do pop out, if that's plausible for you.

That said, I think I've convinced myself that this functionality is reasonable (if a bit novel), but I'd like to see it diagnosed when used outside of a preprocessor invocation because it really serves no purpose there. I also think we should rename it because "normalize whitespace" can be ambiguous depending on what context you're reading the words in.

clang/docs/ClangCommandLineReference.rst
2484

Should this option be ignored when not performing a preprocessing action (and documented as such)?

clang/include/clang/Driver/Options.td
1802

I think I'd feel most comfortable if this didn't use "normalize" in the name. When I hear that, I think it's going to be doing something like converting all (perhaps funky Unicode) notions of whitespace into ASCII space characters. But this option doesn't actually do that -- it's removing as much whitespace from the preprocessed output as possible. Would it be better named as -felide-unnecessary-whitespace or something along those lines?

Meinersbur marked an inline comment as done.Jul 18 2021, 10:05 PM

... would add that it's very common for implementers to ask developers to run their code through -E mode to submit preprocessed output in order to reproduce a customer issue with the compiler, and I worry that uses of this flag will have unintended consequences in that scenario.

Why would one add -fnormalize-whitespace for this use case?

I don't think they'd *add* it, I worry it's already used by their build system for reasons unknown and the developer replicates it when reporting the bug because they're not looking at what flags can be removed.

I made the flag an error if used without -E, so the build system cannot add it.

Flags such as -P, -fuse-line-directives, -frewrite-imports and -frewrite-includes are also silently ignored without -E. This requirement seems exclusive for -felide-unnecessary-whitespace. It also suggests that it is not a problematic scenario.

The "very long line" example mentioned by @dblaikie is sort of along these lines (sorry for the bad pun).

I can add forced line breaks (after/before 80 cols? configurable?) if requested.

I don't think that's necessary just yet. If there's an issue in practice, it seems like we could address it once we have the real world use in front of us. However, It'd help my confidence if you were able to run this functionality over a large corpus of code to see if any issues do pop out, if that's plausible for you.

I ran it (using ccache) over llvm-project and the llvm-test-suite. Is there another large corpus do you have in mind?

I'd like to see it diagnosed when used outside of a preprocessor invocation because it really serves no purpose there.

I made the flag an error if not used without -E.

I also think we should rename it because "normalize whitespace" can be ambiguous depending on what context you're reading the words in.

Changed to your suggested -felide-unnecessary-whitespace, although I think name will not allow use to insert whitespace e.g. for keeping the line length down. If that's not required, I would prefer -fminimize-whitespace over -felide-unnecessary-whitespace`.

clang/docs/ClangCommandLineReference.rst
2484

I changed it to an error if used without -E.

clang/include/clang/Driver/Options.td
1802

I don't think it is ambiguous what whitespace means for the preprocessor. Currently, -E already converts into '\n' and '\n' only.

I don't like the suggested name but still applied it to this patch. I'd prefer -fminimize-whitespace. However, the name would be wrong if we add additional whitespace such as line-breaks to avoid the long line problem, of we unconditionally always emit a space between tokens to not having to call AvoidConcat which would still serve the same goal.

Meinersbur marked an inline comment as done.
  • Rename -fnormalize-whitespace to -felide-unnecessary-whitespace
  • error when used without -E
  • Reabse

I don't think they'd *add* it, I worry it's already used by their build system for reasons unknown and the developer replicates it when reporting the bug because they're not looking at what flags can be removed.

I made the flag an error if used without -E, so the build system cannot add it.

Flags such as -P, -fuse-line-directives, -frewrite-imports and -frewrite-includes are also silently ignored without -E. This requirement seems exclusive for -felide-unnecessary-whitespace. It also suggests that it is not a problematic scenario.

Perhaps it's not a problematic scenario, but I'd consider those cases to also be bugs. Silently ignoring things the user wrote (either in code or on the command line) is typically user-hostile.

The "very long line" example mentioned by @dblaikie is sort of along these lines (sorry for the bad pun).

I can add forced line breaks (after/before 80 cols? configurable?) if requested.

I don't think that's necessary just yet. If there's an issue in practice, it seems like we could address it once we have the real world use in front of us. However, It'd help my confidence if you were able to run this functionality over a large corpus of code to see if any issues do pop out, if that's plausible for you.

I ran it (using ccache) over llvm-project and the llvm-test-suite. Is there another large corpus do you have in mind?

A Linux or BSD distro worth of code would be really nice, but I don't insist. The trouble with llvm-project is that it has a somewhat uniform style for code, and I've found that tools in distro packages tend to exercise more interesting cases sometimes.

I also think we should rename it because "normalize whitespace" can be ambiguous depending on what context you're reading the words in.

Changed to your suggested -felide-unnecessary-whitespace, although I think name will not allow use to insert whitespace e.g. for keeping the line length down. If that's not required, I would prefer -fminimize-whitespace over -felide-unnecessary-whitespace`.

I'd be perfectly happy with -fminimize-whitespace as well if that's your preference.

clang/docs/ClangCommandLineReference.rst
2484

Thanks! Can you add that info to the documentation here?

  • Rename -felide-unnecessary-whitespace to -fminimize-whitespace
  • Add -fminimize-whitespace only vlid to be used to docs
  • Adjust some code comments
  • Rebase

but for my money this seems pretty problematic - will make quoted text in compiler diagnostics weird/difficult to read, etc.

FWIW, clang has a line length limit of 4096 for printing diagnostics.

Meinersbur retitled this revision from [Preprocessor] Implement -fnormalize-whitespace. to [Preprocessor] Implement -fminimize-whitespace..Jul 20 2021, 2:47 PM
Meinersbur edited the summary of this revision. (Show Details)
Meinersbur marked 2 inline comments as done.
  • Error if -fminimize-whitespace is applied to asm files
  • Join lines with escaped newlines
  • Added some FIXMEs for grandfathered bugs

I compiled the Linux kernel (default config, excludes most drivers. Takes ~18mins to compile) and indeed found two problems:

  1. Linux contains assembler-with-cpp files. When compiling with clang, it processes it with -E before passing the result to as. There are different rules on whether whitespace can be completely removed (AvoidConcat) in asm files. I changed ccache to not pass -fminimize-whitespace to assembler files and added an error if one tries.
  1. Line breaks outside of macros are (more) significant in assembler files. I tried to rely more on Tok.getLocation() than Tok.isAtStartOfLine() (which is pretty unreliable) to determine line breaks, which unfortunately caused differences even without -fminimize-whitespace. The patch is using Tok.isAtStartOfLine() again.

The clang-13 release branch will be created on July 27. Is there any hope to get this merged by then?

  • Revert test changes
aaron.ballman accepted this revision.Jul 22 2021, 4:42 AM

I compiled the Linux kernel (default config, excludes most drivers. Takes ~18mins to compile) and indeed found two problems:

Thank you, that's super helpful validation!

The clang-13 release branch will be created on July 27. Is there any hope to get this merged by then?

I think there is, but I'm an eternal optimist. :-) This LGTM, but you should wait a day or two before landing it in case @dblaikie or @rsmith have concerns.

clang/lib/Driver/ToolChains/Clang.cpp
6077–6080

One downside to this approach as opposed to using a switch is that it'll be easy to add new languages and forget to update this. However, we don't do that particularly often and this will loudly tell users about the lack of support, so I think it's fine as-is.

This revision is now accepted and ready to land.Jul 22 2021, 4:42 AM

I think there is, but I'm an eternal optimist. :-) This LGTM, but you should wait a day or two before landing it in case @dblaikie or @rsmith have concerns.

Of course.

clang/lib/Driver/ToolChains/Clang.cpp
6077–6080

I used the types::isABC functions because if a new input type is added, it will either be added to one of the isABC functions. If not, it would be save to not allow whitespace minimization with it until verified.

However, I did not think about the warning for missing enums. I am going to use a switch instead.

  • Introduce types::isDerivedFromC
  • Print input type in error message

Because of how large the switch construct would have been, I created a new function types::isDerivedFromC together with the other functions. These commonly have default-cases so compilers would not warn when Types.def is amended, but isDerivedFromC can be found between other functions that would need to be updated.

aaron.ballman accepted this revision.Jul 23 2021, 4:50 AM

Because of how large the switch construct would have been, I created a new function types::isDerivedFromC together with the other functions. These commonly have default-cases so compilers would not warn when Types.def is amended, but isDerivedFromC can be found between other functions that would need to be updated.

Thank you, I think that turned out cleanly!

clang/lib/Driver/Types.cpp
173

Heh, I had to go look this one up. TIL... :-D

This revision was landed with ongoing or failed builds.Jul 25 2021, 9:39 PM
This revision was automatically updated to reflect the committed changes.

This broke some cases for me, where the output is missing some newlines, breaking the output badly. I’ll try to provide a repro…

This broke some cases for me, where the output is missing some newlines, breaking the output badly. I’ll try to provide a repro…

https://martin.st/temp/repro.tar.xz

clang -target x86_64-windows-gnu -E -P _mkerrcodes.h -Iinclude

The output (at the end) has missing newlines (e.g. around GPG_ERR_ETXTBSY).

Looking into it...

Meinersbur added a comment.EditedJul 27 2021, 5:08 PM

Proposed fix here: D106924

The reproducer had another whitespace difference before and after D104601: typedef __builtin_va_list __gnuc_va_list; instead of typedef __builtin_va_list __gnuc_va_list; (two spaces before "typedef" instead of one). I assume this is not the issue, especially since the source file (include\vadefs.h) actually has two spaces before "typedef".

romanovvlad added a subscriber: romanovvlad.EditedJul 29 2021, 4:30 AM

Hi @Meinersbur,

It seems the patch introduces one more regression. The following test doesn't pass on Windows:

// RUN: %clang -E %s -o %t.ii
// RUN: %clang %t.ii

#include "string.h"

int main() {
  return 0;
}

The following macro from vcruntime.h:

#define _CRT_BEGIN_C_HEADER            \
    __pragma(pack(push, _CRT_PACKING)) \
    extern "C" {

Becomes

#pragma pack(push, 8) extern "C" {

in the preprocessed file.

I'm not an expert in this code, but partially returning old behavior helped with that:

diff --git a/clang/lib/Frontend/PrintPreprocessedOutput.cpp b/clang/lib/Frontend/PrintPreprocessedOutput.cpp
index b725956..b49b247 100644
--- a/clang/lib/Frontend/PrintPreprocessedOutput.cpp
+++ b/clang/lib/Frontend/PrintPreprocessedOutput.cpp
@@ -772,6 +772,7 @@ static void PrintPreprocessedTokens(Preprocessor &PP, Token &Tok,
   bool IsStartOfLine = false;
   char Buffer[256];
   while (1) {
+    Callbacks->MoveToLine(Tok.getLocation(), /*RequireStartOfLine=*/false);
     // Two lines joined with line continuation ('\' as last character on the
     // line) must be emitted as one line even though Tok.getLine() returns two
     // different values. In this situation Tok.isAtStartOfLine() is false even

Could you please take a look?

alexfh added a subscriber: alexfh.Jul 29 2021, 4:07 PM

This commit changes the behavior of clang -E -P even when no -fminimize-whitespace is used. This breaks certain use cases like using clang to preprocess files for flex, which turns out to be sensitive to the presence of line breaks in places where C++ compilers aren't.

An isolated test case:

$ clang-old -E -x c++ -P - -o /tmp/pp.good
#define I(x, ...) \
 x { return X##x; }



#ifndef A
#define A(op, x) I(op, x)
#endif

A(foo, {







})
A(bar, {})
$ cat /tmp/pp.good
foo { return Xfoo; }
bar { return Xbar; }
$ clang-new -E -x c++ -P - -o /tmp/pp.bad
#define I(x, ...) \
 x { return X##x; }



#ifndef A
#define A(op, x) I(op, x)
#endif

A(foo, {







})
A(bar, {})
$ cat /tmp/pp.bad
foo { return Xfoo; }bar { return Xbar; }

Please fix or revert the commit.

Thanks!

This commit changes the behavior of clang -E -P even when no -fminimize-whitespace is used. This breaks certain use cases like using clang to preprocess files for flex, which turns out to be sensitive to the presence of line breaks in places where C++ compilers aren't.

An isolated test case:

$ clang-old -E -x c++ -P - -o /tmp/pp.good
#define I(x, ...) \
 x { return X##x; }



#ifndef A
#define A(op, x) I(op, x)
#endif

A(foo, {







})
A(bar, {})
$ cat /tmp/pp.good
foo { return Xfoo; }
bar { return Xbar; }
$ clang-new -E -x c++ -P - -o /tmp/pp.bad
#define I(x, ...) \
 x { return X##x; }



#ifndef A
#define A(op, x) I(op, x)
#endif

A(foo, {







})
A(bar, {})
$ cat /tmp/pp.bad
foo { return Xfoo; }bar { return Xbar; }

Please fix or revert the commit.

Thanks!

IIUC, c6b0b16c0f55c34f4eaa05184815bbbe97f4b750 was aimed to fix a similar issue, but it doesn't fix this specific case. If providing a proper fix quickly isn't feasible, consider reverting these patches.

@tstellar, FYI there are still issues being reported with this commit (which was made before the branch). One fix is committed and a backport request was made in https://llvm.org/PR51261, but there’s still a couple other outstanding issues here.

Looking at this once again, I'm pretty sure that reverting this is much safer than trying to fix forward. Unless there's a really trivial fix to the outstanding issues, I'll revert this later today.

I am working on a fix, but I wouldn't mind reverting.

I am working on a fix, but I wouldn't mind reverting.

I would be comfortable reverting the feature for Clang 13 and fixing forward on trunk (unless there's a need to revert from trunk temporarily, of course). Giving the feature a full release to bake isn't a bad outcome.

@romanovvlad This is due to -fms-extensions. It Expands __Pragma to #pragma instead of keeping it a __Pragma. This is a one-line fix.

@alexfh This was fixed by D106924

I would suggest to not revert. Will upload a patch for -fms-extensions after some cleanup.

I am working on a fix, but I wouldn't mind reverting.

I would be comfortable reverting the feature for Clang 13 and fixing forward on trunk (unless there's a need to revert from trunk temporarily, of course). Giving the feature a full release to bake isn't a bad outcome.

While this means that ccache can only use this feature when clang-14 is detected, I fully understand.

I am working on a fix, but I wouldn't mind reverting.

I would be comfortable reverting the feature for Clang 13 and fixing forward on trunk (unless there's a need to revert from trunk temporarily, of course). Giving the feature a full release to bake isn't a bad outcome.

While this means that ccache can only use this feature when clang-14 is detected, I fully understand.

Correct -- that would be unfortunate as I believe you were hoping for this to be in Clang 13 for ccache support. My fear of having this in Clang 13 is that very early testing has found several breakages so I worry if we've not had enough time to shake all those out. Having a bit more time might build our confidence. (So the result, to me, should be either that we back this out of the Clang 13 branch if we're still worried or we patch the Clang 13 branch if we think we've got all the cases, but I want to make sure we don't leave it somewhere in the middle for the release branch.)

Patch uploaded here: D107183

Correct -- that would be unfortunate as I believe you were hoping for this to be in Clang 13 for ccache support.

Yes, that would have been the ideal outcome.

@romanovvlad This is due to -fms-extensions. It Expands __Pragma to #pragma instead of keeping it a __Pragma. This is a one-line fix.

@alexfh This was fixed by D106924

You're right. It turns out that my comment https://reviews.llvm.org/D104601#2915122 was a mistake and c6b0b16c0f55c34f4eaa05184815bbbe97f4b750 actually fixes this particular issue.

I would suggest to not revert. Will upload a patch for -fms-extensions after some cleanup.

I'm running more thorough testing of clang with the c6b0b16c0f55c34f4eaa05184815bbbe97f4b750 commit to ensure we're not missing anything else here.

FYI, 13.0.0 rc1 is getting tagged on Monday, so it’d be good to have the branch in a usable state by then, either with all fixes we currently have in main, or reverted.

@mstorsjo I've taken all steps required for the resolution suggested @aaron.ballman. llvm.org/PR51300 suberseedes llvm.org/PR51261 (release-13.x branch is under @tstellard's control). D107183 has been pushed to main to fix the outstanding issue discovered by @romanovvlad (unfortunately without pre-commit review, but given that this might be blocking them, the patch is small and you would like the situation to be resolved by then, I think a post-commit review is acceptable).

@alexfh Happy to fix any additional issues you may find. The -fminimize-whitespace feature unfortunately required some restructuring of the the code since the decision on when to insert newline were done in many different places using different logics.

Patch uploaded here: D107183

The patch resolves the issue. Thanks!

Do we have all the issues fixed in trunk yet or do we need to revert in the release/13.x branch.

Do we have all the issues fixed in trunk yet or do we need to revert in the release/13.x branch.

All known issues have been fixed on trunk. However, @alexfh said they are going to do more thorough testing and @aaron.ballman has not changed their suggestion to revert on release/13.x. Since you just reverted on clang-13.x, I'd say the decision has been made.

RKSimon added inline comments.
clang/lib/Frontend/PrintPreprocessedOutput.cpp
680

@Meinersbur Static analysis is warning that these are both the same - should one be EmittedDirectiveOnThisLine ?

This commit seems to cause some regression for "-save-temps" as there is no new line before the pragma. See the below test case, -E will output
int test();#pragma clang assume_nonnull
It will fail the compilation on the preprocessed output with
error: expected unqualified-id
int test();#pragma clang assume_nonnull end

^

Thanks,

Manman

cat test-pragma.mm
// RUN: %clang -E -o - %s | FileCheck %s

#define CF_ASSUME_NONNULL_BEGIN _Pragma("clang assume_nonnull begin")
#define CF_ASSUME_NONNULL_END _Pragma("clang assume_nonnull end")

CF_ASSUME_NONNULL_BEGIN
int test();

CF_ASSUME_NONNULL_END
// CHECK-NOT: test();#pragma clang assume_nonnull

Meinersbur added a comment.EditedNov 4 2021, 10:58 PM

It will fail the compilation on the preprocessed output with
error: expected unqualified-id
int test();#pragma clang assume_nonnull end

The handler for assume_nonull passes an invalid SourceLocation if embedded in in a _Pragma. Not sure whether this is already a bug, but we should insert semantically relevant newlines even if the line for the #pragma is unknown. Fixed in rG1606022fab2d90ed8ee6d15800ec1c2c293db20e. Thanks for the report.

I looked into why it did work before -fminimize-whitespace introduced since the if (PLoc.isInvalid()) return; bail-out existed as well. There was another function startNewLineIfNeeded(bool ShouldUpdateCurrentLine) that was called regardless of PLoc and I folded the functionality into MoveToLine since having two functions with unclear separation between them (i.e. when should ShouldUpdateCurrentLine be true) made things difficult.

clang/lib/Frontend/PrintPreprocessedOutput.cpp
680

Fixed in rG8f099d17a1bee857ada3c5af032cfcb559252024. Thanks for the report and sorry for not seeing you comment until now.