This is an archive of the discontinued LLVM Phabricator instance.

clang-format: Fix formatting of function body followed by semicolon
Needs ReviewPublic

Authored by Typz on Jan 31 2018, 2:54 AM.

Details

Summary

In some case, the heuristic used to identify the type of blocks (in
UnwrappedLineParser) mistakens a Block for a BracedInit, when the
TokenAnnotator eventually finds (correclty) that this is a function
declaration.

This happens with empty function blocks followed by a semicolon.

This causes for exemple the following code to be mistakenly formatted:

void abort(){};

instead of:

void abort() {};

Diff Detail

Event Timeline

Typz created this revision.Jan 31 2018, 2:54 AM
Typz retitled this revision from clang-format: Fix formatting of function blocks followed by semicolon to clang-format: Fix formatting of function body followed by semicolon.Jan 31 2018, 2:54 AM
Typz updated this revision to Diff 132130.Jan 31 2018, 2:55 AM

update commit message

Typz added inline comments.Jan 31 2018, 4:22 AM
lib/Format/TokenAnnotator.cpp
1911

this may actually not be enough in all cases: to completely match the 'standard' behavior we should be able to "split" the UnwrappedLine at both opening and ending brace, in this case.

I think this case is not important enough to fix. Please tell users to just delete the useless semicolon.

In particular, my concern is that this makes the data-flow significantly more complicated. Early on in the development, we had separate token classes for the different states of the analysis (unwrapped line parsing, annotation, formatting) with const members. We had to move away from that as it didn't carry the costs that it had. However, this change changes the behavior from the BlockKind only ever being set inside the UnwrappedLineParser and being consumed later (e.g. in the TokenAnnotator) to a state where it is set in many places and setting and changing the value are inter-mingled. That makes it much harder to understand the invariants. E.g. how can we ensure that the incorrect BK_BracedInit hasn't already be consumed by something (now and in the future).

Typz added a comment.Jan 31 2018, 5:57 AM

There are actually other cases, e.g. with macros "containing" the semicolon:

void abort() {
  FOO()
  BAR()
};

gets reformatted to this (still wrong with this patch, but the space after the parenthesis is added):

void abort(){ FOO() BAR() };

And also this one (though it may be slightly different, there is no semicolon here):

void abort() {
  FOO()
}
uint32_t bar() {}

gets reformatted to:

void abort(){ FOO() BAR() } uint32_t bar() {}

IMO the "real" fix would be to somehow let TokenAnnotator::calculateFormattingInformation split the current UnwrappedLine/AnnotatedLine, so that the 'end of the line' could be unwrapped properly: this would also allow keeping the invariant, and just implement some kind of backtracking. But it may be tricky to get the state right; maybe adding some child lines instead of just adding the lines may be simpler?

Typz added a comment.Jan 31 2018, 6:04 AM

I think this case is not important enough to fix. Please tell users to just delete the useless semicolon.

I would agree if were simple to spot: but often this may manifest itself only with a missing space between the function parameters and the function body, which can easily be overlooked...
Another option may be to create new pass which "removes" that extra semicolon: this way we would both fix it and get things right on next pass.

However, the issue with a function which contains only a macro and which is followed by another function which returns an custom type cannot so easily be fixed:

void abort() {
  FOO()
}
uint32_t bar() {}

(note that this case works fine if the body of the function contains a semicolon or reserved keyword, or if the next function returns a base type [int, void...])

I don't think cases where there is only a semicolon-less macro are particularly frequent/important either. You probably could even add a semicolon in those cases, right? How frequent are such cases in your codebase? Either way, they aren't fixed by this patch, so they aren't a good reason to move forward with it.

I still believe that this patch adds complexity for very little gain. And I am not even sure it is correct. isFunctionDeclarationName/getFunctionParameterList is just yet another heuristic that might go wrong. And it might go wrong in both ways. You might still miss some cases and you might start incorrectly formatting stuff as functions. Fundamentally clang-format just doesn't have enough info to do this correctly.

In D42729#993188, @Typz wrote:

I think this case is not important enough to fix. Please tell users to just delete the useless semicolon.

I would agree if were simple to spot: but often this may manifest itself only with a missing space between the function parameters and the function body, which can easily be overlooked...

".. which can easily be overlooked". If they are overlooked, nobody cares about the space either, so no harm done ;).

Another option may be to create new pass which "removes" that extra semicolon: this way we would both fix it and get things right on next pass.

I am not sure we can do this reliably enough.

However, the issue with a function which contains only a macro and which is followed by another function which returns an custom type cannot so easily be fixed:

void abort() {
  FOO()
}
uint32_t bar() {}

(note that this case works fine if the body of the function contains a semicolon or reserved keyword, or if the next function returns a base type [int, void...])

Typz added a comment.EditedFeb 1 2018, 6:40 AM

I don't think cases where there is only a semicolon-less macro are particularly frequent/important either. You probably could even add a semicolon in those cases, right? How frequent are such cases in your codebase? Either way, they aren't fixed by this patch, so they aren't a good reason to move forward with it.

Adding a semicolon works for indentation and behavior, but leads to compiler warning.
So a proper fix would require to change the macro, which can easily be tricky with macros containings branches or loops...

It should not happen that often, but in reality it actually happens often enough that I found the bug while simply testing clang-format...

I still believe that this patch adds complexity for very little gain. And I am not even sure it is correct. isFunctionDeclarationName/getFunctionParameterList is just yet another heuristic that might go wrong. And it might go wrong in both ways. You might still miss some cases and you might start incorrectly formatting stuff as functions. Fundamentally clang-format just doesn't have enough info to do this correctly.

As such, it does not add such complexity I think, but indeed it is not completely correct: it works only in the case where no line breaks are expected around the body (since it is still parsed as a single UnwrappedLine). So it is a slight improvement, but not a complete fix.

isFunctionDeclarationName() is an heuristic as well, but provides better result and does not break the existing one (since there are actually quite a few tests, and none get broken by this patch :-) )

I think "overall" clang-format has enough information for this case, it is just not available at the right time: Unwrapping is done first, then token and matching parenthesis are analysed and lines anotated, and these token anotations are needed to improve the unwrapping behavior...
I would really want to call isFunctionDeclarationName() from UnwrappedLineParser::calculateBraceTypes(), but parenthesis matching, nesting level and operators identification are not available yet.

".. which can easily be overlooked". If they are overlooked, nobody cares about the space either, so no harm done ;).

We want to use a formatter to ensure the code is consistent: e.g. to ensure things that may be overlooked by a human are actually formatted according to the rules.
We want people to trust the tool to do the right thing, so it is best to avoid as many errors as possible...

  • Of course you find all sorts of errors while testing clang-format on a large-enough codebase. That doesn't mean that users run into them much.
  • We have had about 10k clang-format users internally for several years. The semicolon issue comes up but really rarely and if it does, people happily fix their code not blaming clang-format.

Unrelated, my point remains that setting BlockKind in TokenAnnotator is bad enough that I wouldn't want to do it for reaping this small benefit. And I can't see how you could easily achieve the same thing without doing that.

Typz added a comment.Feb 28 2018, 6:58 AM
  • Of course you find all sorts of errors while testing clang-format on a large-enough codebase. That doesn't mean that users run into them much.
  • We have had about 10k clang-format users internally for several years. The semicolon issue comes up but really rarely and if it does, people happily fix their code not blaming clang-format.

Unrelated, my point remains that setting BlockKind in TokenAnnotator is bad enough that I wouldn't want to do it for reaping this small benefit. And I can't see how you could easily achieve the same thing without doing that.

Just a question though. I there a reason brace matching (and other parts of TokenAnnotations) are not performed before LineUnwrapping? That would probably allow fixing this issue more cleanly (though I am not sure I would have the time to actually perform this probably significant task)...

  • Of course you find all sorts of errors while testing clang-format on a large-enough codebase. That doesn't mean that users run into them much.
  • We have had about 10k clang-format users internally for several years. The semicolon issue comes up but really rarely and if it does, people happily fix their code not blaming clang-format.

Unrelated, my point remains that setting BlockKind in TokenAnnotator is bad enough that I wouldn't want to do it for reaping this small benefit. And I can't see how you could easily achieve the same thing without doing that.

Just a question though. I there a reason brace matching (and other parts of TokenAnnotations) are not performed before LineUnwrapping? That would probably allow fixing this issue more cleanly (though I am not sure I would have the time to actually perform this probably significant task)...

I think this is just the way it has grown. And brace/paren matching is actually done in both places several times by now, I think :(.

Fundamentally, the UnwrappedLineParser had the task of splitting a source file into lines and only implemented what it needed for that. The TokenAnnotator then did a much more elaborate analysis on each line to determine token types and such and distinguish what a "<" actually is (comparison vs. template opener). Having these two things be separate makes it slightly easier for error recovery and such as the state space they can be in is somewhat more limited.