This is an archive of the discontinued LLVM Phabricator instance.

clang-format: Add preprocessor directive indentation
ClosedPublic

Authored by euhlmann on Jul 27 2017, 12:53 PM.

Details

Summary

This is an implementation for bug 17362 which adds support for indenting preprocessor statements inside if/ifdef/endif. This takes previous work from fmauch (https://github.com/fmauch/clang/tree/preprocessor_indent) and makes it into a full feature.
The context of this patch is that I'm a VMware intern, and I implemented this because VMware needs the feature. As such, some decisions were made based on what VMware wants, and I would appreciate suggestions on expanding this if necessary to use-cases other people may want.

This adds a new enum config option, IndentPPDirectives. Values are:

  • PPDIS_None (in config: None):
#if FOO
#if BAR
#include <foo>
#endif
#endif
  • PPDIS_AfterHash (in config: AfterHash):
#if FOO
#  if BAR
#    include <foo>
#  endif
#endif

This is meant to work whether spaces or tabs are used for indentation. Preprocessor indentation is independent of indentation for non-preprocessor lines.

Preprocessor indentation also attempts to ignore include guards with the checks:

  1. Include guards cover the entire file
  2. Include guards don't have #else
  3. Include guards begin with
#ifndef <var>
#define <var>

This patch allows UnwrappedLineParser::PPBranchLevel to be decremented to -1 (the initial value is -1) so the variable can be used for indent tracking.

Defects:

  • This patch does not handle the case where there's code between the #ifndef and #define but all other conditions hold. This is because when the #define line is parsed, UnwrappedLineParser::Lines doesn't hold the previous code line yet, so we can't detect it. This is out of the scope of this patch.
  • This patch does not handle cases where legitimate lines may be outside an include guard. Examples are #pragma once and #pragma GCC diagnostic, or anything else that does not change the meaning of the file if it's included multiple times.
  • This does not detect when there is a single non-preprocessor line in front of an include-guard-like structure where other conditions hold because ScopedLineState hides the line.
  • Preprocessor indentation throws off TokenAnnotator::setCommentLineLevels so the indentation of comments immediately before indented preprocessor lines is toggled on each run. Fixing this issue appears to be a major change and too much complexity for this patch.

Diff Detail

Repository
rL LLVM

Event Timeline

euhlmann created this revision.Jul 27 2017, 12:53 PM
djasper added inline comments.Jul 31 2017, 10:29 PM
docs/ClangFormatStyleOptions.rst
1182 ↗(On Diff #108484)

I think we can foresee that a bool is not going to be enough here. Make this an enum, which for no can contain the values "no" and "afterhash" and in the long run can get a "beforehash" value.

lib/Format/ContinuationIndenter.cpp
379 ↗(On Diff #108484)

Start comments upper case.

383 ↗(On Diff #108484)

Same here and use full sentences.

Also, this does not seem to be what the example in the style option suggests and I'd rather not do it (subtracting 1).

lib/Format/UnwrappedLineFormatter.cpp
1021 ↗(On Diff #108484)

... or not at all. :)

lib/Format/UnwrappedLineParser.cpp
667 ↗(On Diff #108484)

LLVM style does not use braces for single statement ifs.

699 ↗(On Diff #108484)

In LLVM's codebase, we mostly just leave out the "!= nullptr" and rely on the implicit conversion to bool.

701 ↗(On Diff #108484)

I think you'll need substantially more tests here:

  • An include guard must not have a #else
  • An include guard must span the entire file
lib/Format/UnwrappedLineParser.h
238 ↗(On Diff #108484)

I think this should almost always be PPBranchLevel. Probably the different case is the #else itself, but I think we can handle that easily.

euhlmann added inline comments.Aug 2 2017, 12:23 PM
lib/Format/ContinuationIndenter.cpp
383 ↗(On Diff #108484)

I apologize, the style option documentation was a typo. The patch summary has the intended behavior, and I mentioned there that I count the hash as a column. Part of the reasoning for this is because it would look the same visually with spaces or tabs.

euhlmann added inline comments.Aug 3 2017, 11:09 AM
lib/Format/UnwrappedLineParser.cpp
701 ↗(On Diff #108484)

It sounds like these would require two passes, since we wouldn't know until the end whether an include guard spans the whole file or has a #else. Do we really want two passes? Or do you have any advice on how these tests can be implemented?

djasper added inline comments.Aug 4 2017, 12:55 AM
lib/Format/ContinuationIndenter.cpp
383 ↗(On Diff #108484)

Do you know of a coding style that writes something about this? I think the similarity of spaces vs. tabs is not a strong reason because a source file will either use one or the other. To me:

#if a == 1
# define X
# if b == 2
#   define Y
...

Would look weird. I'd prefer if we kept this simpler and more consistent.

lib/Format/UnwrappedLineParser.cpp
701 ↗(On Diff #108484)

You could store an additional bool that gets set to true if the possible include guard is matched up with an appropriate #endif in the last line of a file. If that bool is true, you can remove 1 from the level of all preprocessor lines.

euhlmann added inline comments.Aug 7 2017, 2:23 PM
lib/Format/UnwrappedLineParser.h
238 ↗(On Diff #108484)

It doesn't look like PPBranchLevel can be used to differentiate between no indent and one level of indent, since it starts at -1 and then stays >=0. For example:

#define A     // -1
#if B         // -1
#  define C   // 0
#  if D       // 0
#    define E // 1
#  endif      // 0
#endif        // 0
#define F     // 0
djasper edited edge metadata.Aug 8 2017, 1:32 AM

Manuel: Can you take a look at the last comment here? Why does PPBranchLevel start at -1?

klimek edited edge metadata.Aug 8 2017, 9:03 AM

Manuel: Can you take a look at the last comment here? Why does PPBranchLevel start at -1?

It's a perhaps too-clever implementation to make sure that we can have a per-branch data structure (indexed from 0) , and then making sure that we never index out of bounds by never going down to -1 again.
I think if we need this info, we can just make it count down to -1 again (or, but that's isomorphic, let it run from 0 and make sure we never index into the data structures at 0 :)

I think if we need this info, we can just make it count down to -1 again (or, but that's isomorphic, let it run from 0 and make sure we never index into the data structures at 0 :)

Should I do one of these things? Let me know how you'd like me to implement this.

klimek added a comment.Aug 9 2017, 1:47 AM

I think if we need this info, we can just make it count down to -1 again (or, but that's isomorphic, let it run from 0 and make sure we never index into the data structures at 0 :)

Should I do one of these things? Let me know how you'd like me to implement this.

So my suggestion would be to let it count down to -1 again and put a check around indexing into the data structures.

euhlmann updated this revision to Diff 110429.EditedAug 9 2017, 10:22 AM
euhlmann edited the summary of this revision. (Show Details)

This addresses Daniel's previous concerns. The option is now an enum, the heuristic for detecting include guards is expanded and has corresponding unit tests, and style issues are resolved.
This does not yet use PPBranchLevel to track indent; I'm updating this review with my current work before addressing that point.

There are minor defects in the implementation, which I've put in the change description and added EXPECT_NE tests for. I don't know how to fix those, and I'm not sure whether they should be considered important. Let me know if you think they need to be fixed.

euhlmann marked 11 inline comments as done.Aug 9 2017, 10:24 AM
euhlmann edited the summary of this revision. (Show Details)Aug 9 2017, 10:33 AM
euhlmann updated this revision to Diff 110602.Aug 10 2017, 9:59 AM
euhlmann edited the summary of this revision. (Show Details)

The patch now uses PPBranchLevel to track indent level. It allows PPBranchLevel to go back down to -1. The existing PPBranchLevel >= 0 checks appear to prevent indexing into structures when PPBranchLevel is -1.

LG from my side.

Daniel's right. We need need substantially more tests!

docs/ClangFormatStyleOptions.rst
1199 ↗(On Diff #110429)

delete outdated bit of description: ", counting the hash as a column"

include/clang/Format/Format.h
1038 ↗(On Diff #110429)

ditto

lib/Format/UnwrappedLineParser.cpp
707 ↗(On Diff #110602)

... has an #else ...

728 ↗(On Diff #110602)

You can use a range based for loop here. Also locals begin with upper case:

for (auto& Line : Lines)  {
742–744 ↗(On Diff #110602)

Lines.size() == 1 is problematic. We must at least support comments before the include guard. I'll comment on the tests and point out a couple of important cases that currently fail.

unittests/Format/FormatTest.cpp
2281 ↗(On Diff #110602)

Experimenting with the patch locally I found that comment indentation is broken in some cases. Please add tests that cover comments. For example:

comment indented at code level as expected:

void f() {
#if 0
  // Comment
  code();
#  define FOO 0
#endif
}

comment not indented at code level when there's a guard:

#ifndef _SOMEFILE_H
#define _SOMEFILE_H
void f() {
#if 0
// Comment
  code();
#  define FOO 0
#endif
}
#endif

The #define FOO 0 is required to trigger this behavior.

2322–2331 ↗(On Diff #110602)

A DRY-er way to say this is:

auto WithGuard = "#ifndef _SOMEFILE_H\n"
                 "#define _SOMEFILE_H\n"
                 "code();\n"
                 "#endif";
ASSERT_EQ(WithGuards, format(WithGuards, Style));
2334 ↗(On Diff #110602)

It would be nice to have a comment that explains why you cannot use verifyFormat.

2394 ↗(On Diff #110602)

The two "Defect:" cases would be more precise and more self-documenting if they used EXPECT_EQ where the "expected" text showed the incorrect results.

2405–2408 ↗(On Diff #110602)

We need to handle comments (like copyrights) before the include guard. There is an analogous problem with trailing blank lines, or trailing comments.

I think we need a small state machine: Init, Start, Open, Closed, NotGuard (terminal). FoundIncludeGuardStart should change from a bool to an enum to track this state. PPMaybeIncludeGuard can then drop it's "mabye". Whether or not it is null depends directly on the state. It does not determine state itself. While parsing, each line updates the state. If we get to eof in the Closed state then we have detected an include guard. ... Or similar logic....

Note that support for comments before the guard opens the door to support for #pragma once. It is tempting to add that, but that is a slippery slope. I would recommend leaving that as a defect that can be addressed later.

mzeren-vmw added inline comments.Aug 15 2017, 3:45 AM
unittests/Format/FormatTest.cpp
2281 ↗(On Diff #110602)

Erik spent some time investigating issues with comment indentation. A couple of insights so far:

a) We need tests for wrapped macros (with trailing \ continuations). These need to include embedded comments.

b) It would be most natural to align comments with the following non-comment line. So a comment may be indented at "preprocessor level" or "code level". If indented at preprocessor level we have to handle the extra space introduced by the leading hash. This may require an extra bit per line to indicate whether the "aligned-to" line is preprocessor or code.

euhlmann added inline comments.Aug 15 2017, 2:32 PM
unittests/Format/FormatTest.cpp
2405–2408 ↗(On Diff #110602)

@djasper @klimek
Do you have any comments on this? I've begun to implement an enum/state machine but I'd like to know if you'd like me to implement a different way.

djasper added inline comments.Aug 16 2017, 1:18 AM
unittests/Format/FormatTest.cpp
2281 ↗(On Diff #110602)

"#if 0" specifically might not be a good candidate for a test case as that should be treated like a comment. We should not try to format anything inside a "#if 0" and I'd be surprised if clang-format touched that at all. "#if A" or something is more appropriate.

2322–2331 ↗(On Diff #110602)

I think _A... is actually not an identifier that is ok to use. Remove the leading "_".

2334 ↗(On Diff #110602)

Or even have a separate function that messes this up in some other way, not altering line breaks so that we can keep these tests short.

2405–2408 ↗(On Diff #110602)

Most of the work here is done in the UnwrappedLineParser, which should skip over comments anyway. So yes, we should allow for these comments, but I don't think the logic needs to get much more complex because of that. I strongly doubt that a state machine is the right mechanism here.

And generally (without specific action to derive from that), I think we should try to err on the side of incorrectly finding header guards instead of incorrectly not recognizing a header guard. I think a not-indented #define somewhere is better than an incorrectly indented header guard.

euhlmann updated this revision to Diff 111433.Aug 16 2017, 4:36 PM
euhlmann marked 13 inline comments as done.
euhlmann edited the summary of this revision. (Show Details)

Allows comments before the include guard opens. However, if there's a single non-comment line before an include-guard-like structure that meets all the other conditions then it will be treated as an include guard because Lines does not yet contain the line before the #ifndef. This looks difficult to fix and I believe it's a rare enough situation to ignore.

This does not yet address avoiding repetition in the tests or comments before indented preprocessor lines being indented.

euhlmann updated this revision to Diff 111606.Aug 17 2017, 7:38 PM
euhlmann marked an inline comment as done.
euhlmann edited the summary of this revision. (Show Details)

Eliminated redundancy in tests by fixing test::messUp which was incorrectly merging lines.

Fixing the issue Mark brought up regarding comments before indented preprocessor lines appears to be a major change and I believe it's too much complexity for this patch.

krasimir added a subscriber: krasimir.

Krasimir: Can you actually give this a round of review? I will also try to do so tomorrow.

krasimir added inline comments.Aug 22 2017, 11:31 AM
lib/Format/ContinuationIndenter.cpp
383 ↗(On Diff #111606)

You can replace Current.Previous with Previous. Also, I'd swap the checks a bit, like:

if (Previous.is(tok::hash) && State.FirstIndent > 0 &&
    Style.IndentPPDirectives == FormatStyle::PPDIS_AfterHash &&
    (State.Line->Type == LT_PreprocessorDirective ||
     State.Line->Type == LT_ImportStatement)) {

That way, the common case Previous.is(tok::hash) == false is handled quickly.

387 ↗(On Diff #111606)

I don't understand this comment. Could you please give an example?

lib/Format/UnwrappedLineParser.cpp
701 ↗(On Diff #111606)

This can easily lead to a pretty bad runtime: consider a file starting with a few hundred lines of comments and having a few hundred #ifdef-s.

I'd say that after we do this MaybeIncludeGuard thing once, we don't repeat it again.

Also, lines could start with a block comment and continue with code:

/* small */ int ten = 10;
732 ↗(On Diff #111606)

Why do we need to subtract one from every preprocessor indent?

737 ↗(On Diff #111606)

Wouldn't this also indent lines continuing macro definitions, as in:

#define A(x) int f(int x) { \
  return x; \
}
760 ↗(On Diff #111606)

Why do we reset PPMaybeIncludeGuard here?

770 ↗(On Diff #111606)

Why do we ++Line->Level here?

787 ↗(On Diff #111606)

Why do we reset PPMaybeIncludeGuard here?

lib/Format/UnwrappedLineParser.h
249 ↗(On Diff #111606)

Please add a comment for PPMaybeIncludeGuard: is it expected to point to the hash token of the #ifdef, or?

euhlmann added inline comments.Aug 22 2017, 3:03 PM
lib/Format/ContinuationIndenter.cpp
387 ↗(On Diff #111606)

When tabs are enabled, extra spaces would be added because of the column offset caused by the # token.

#if FOO
#<tab>if BAR
#<tab><space>define BAZ
#<tab>endif
#endif

Let me know if there's a better way to fix that.

lib/Format/UnwrappedLineParser.cpp
732 ↗(On Diff #111606)

This only knows for sure when something is an include guard at the end of the file. By that time, everything inside the include guard has already been indented one extra level since we didn't know about the guard yet, so it would look like

#ifndef FILE_H
#  define FILE_H
// ...
#  other directives
// ...
#endif

Subtracting one is the action taken when we decide that there's a valid include guard, so it looks like

#ifndef FILE_H
#define FILE_H
// ...
#other directives
// ...
#endif
760 ↗(On Diff #111606)

The idea is if there's any preprocessor lines between the #ifndef and the #define, it doesn't treat it as an include guard. This is the same reason for the other PPMaybeIncludeGuard reset.

770 ↗(On Diff #111606)

This sets correct indent for multi-line macros, so they are indented one level more than the indent of the #define instead of always at level 1.

djasper added inline comments.Aug 23 2017, 7:09 AM
unittests/Format/FormatTest.cpp
2297 ↗(On Diff #111606)

Please just set Style.ColumnLimit to 60 or so in order not to have to wrap the test cases. Makes then much easier to understand.

2314 ↗(On Diff #111606)

It's hard to tell with the line wrap, but this does look wrong, the escaped newline does not seem to be in column 80.

2427 ↗(On Diff #111606)

Write "FIXME" instead of "Defect"

unittests/Format/FormatTestUtils.h
34 ↗(On Diff #111606)

Could we just do:

} else if (MessedUp[i] == '#' && (JustReplacedNewline || i == 0 || MessedUp[i - 1] == '\n') {

?

euhlmann updated this revision to Diff 112432.Aug 23 2017, 1:49 PM
euhlmann marked 7 inline comments as done.
  • Swapped checks in ContinuationIndenter
  • Renamed PPMaybeIncludeGuard to IfNdefCondition and added comment
  • Added another bool member that tracks if include guard detection was rejected once. If the bool gets set, the loop checking previous lines doesn't run again.
  • Set lower ColumnLimit in tests
  • Fixed incorrect column for \ in multi-line macros
  • Resolved comment in FormatTestUtils
euhlmann added inline comments.Aug 23 2017, 1:51 PM
lib/Format/UnwrappedLineParser.cpp
737 ↗(On Diff #111606)

Multi-line macros look this under this patch

#if A
#  define MACRO                        \
    void a(int x) {                    \
      b();                             \
      c();                             \
      d();                             \
      e();                             \
      f();                             \
    }
#endif

With previous clang-format, no preprocessor indentation:

#if A
#define MACRO                          \
  void a(int x) {                      \
    b();                               \
    c();                               \
    d();                               \
    e();                               \
    f();                               \
  }
#endif
mzeren-vmw added inline comments.Aug 23 2017, 2:27 PM
lib/Format/ContinuationIndenter.cpp
383 ↗(On Diff #108484)

@djasper I just noticed that the Google Style Guide has:

#if DISASTER_PENDING      // Correct -- Starts at beginning of line
    DropEverything();
# if NOTIFY               // OK but not required -- Spaces after #
    NotifyClient();
# endif
#endif

Note the single space in # if NOTIFY. Can we correct the Guide to match what we have here?

euhlmann edited the summary of this revision. (Show Details)Aug 23 2017, 2:27 PM
euhlmann marked an inline comment as not done.Aug 23 2017, 3:11 PM
djasper accepted this revision.Aug 24 2017, 7:55 AM

From my side this looks good for now (we can always improve more later). Krasimir, what do you think?

This revision is now accepted and ready to land.Aug 24 2017, 7:55 AM
krasimir edited edge metadata.Aug 24 2017, 8:01 AM

Thank you! I understand this patch better now. Looks good from my side!

krasimir accepted this revision.Aug 24 2017, 8:01 AM

I'm glad this is finally in a state to land. Thanks for the helpful reviews!

@euhlmann: are you planning to commit this?

I'll land this for you, as discussed offline. The best thing is to apply for commit rights after you have a few patches landed.

This revision was automatically updated to reflect the committed changes.