This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Format/
-
Format/
1
TokenAnnotator.cpp
-
unittests/Format/
-
Format/
-
FormatTest.cpp

Differential D42729

clang-format: Fix formatting of function body followed by semicolon
Needs ReviewPublic

Authored by Typz on Jan 31 2018, 2:54 AM.

Download Raw Diff

Details

Reviewers

krasimir
djasper
klimek

Summary

In some case, the heuristic used to identify the type of blocks (in
UnwrappedLineParser) mistakens a Block for a BracedInit, when the
TokenAnnotator eventually finds (correclty) that this is a function
declaration.

This happens with empty function blocks followed by a semicolon.

This causes for exemple the following code to be mistakenly formatted:

void abort(){};

instead of:

void abort() {};

Diff Detail

Repository

rC Clang

Build Status

Buildable 14455
Build 14455: arc lint + arc unit

Event Timeline

Typz created this revision.Jan 31 2018, 2:54 AM

Harbormaster completed remote builds in B14454: Diff 132129.Jan 31 2018, 2:54 AM

Typz retitled this revision from clang-format: Fix formatting of function blocks followed by semicolon to clang-format: Fix formatting of function body followed by semicolon.Jan 31 2018, 2:54 AM

update commit message

Harbormaster completed remote builds in B14455: Diff 132130.Jan 31 2018, 2:56 AM

Typz added inline comments.Jan 31 2018, 4:22 AM

lib/Format/TokenAnnotator.cpp
1911	this may actually not be enough in all cases: to completely match the 'standard' behavior we should be able to "split" the UnwrappedLine at both opening and ending brace, in this case.

I think this case is not important enough to fix. Please tell users to just delete the useless semicolon.

In particular, my concern is that this makes the data-flow significantly more complicated. Early on in the development, we had separate token classes for the different states of the analysis (unwrapped line parsing, annotation, formatting) with const members. We had to move away from that as it didn't carry the costs that it had. However, this change changes the behavior from the BlockKind only ever being set inside the UnwrappedLineParser and being consumed later (e.g. in the TokenAnnotator) to a state where it is set in many places and setting and changing the value are inter-mingled. That makes it much harder to understand the invariants. E.g. how can we ensure that the incorrect BK_BracedInit hasn't already be consumed by something (now and in the future).

There are actually other cases, e.g. with macros "containing" the semicolon:

void abort() {
  FOO()
  BAR()
};

gets reformatted to this (still wrong with this patch, but the space after the parenthesis is added):

void abort(){ FOO() BAR() };

And also this one (though it may be slightly different, there is no semicolon here):

void abort() {
  FOO()
}
uint32_t bar() {}

gets reformatted to:

void abort(){ FOO() BAR() } uint32_t bar() {}

IMO the "real" fix would be to somehow let TokenAnnotator::calculateFormattingInformation split the current UnwrappedLine/AnnotatedLine, so that the 'end of the line' could be unwrapped properly: this would also allow keeping the invariant, and just implement some kind of backtracking. But it may be tricky to get the state right; maybe adding some child lines instead of just adding the lines may be simpler?

In D42729#993159, @djasper wrote:

I think this case is not important enough to fix. Please tell users to just delete the useless semicolon.

I would agree if were simple to spot: but often this may manifest itself only with a missing space between the function parameters and the function body, which can easily be overlooked...
Another option may be to create new pass which "removes" that extra semicolon: this way we would both fix it and get things right on next pass.

However, the issue with a function which contains only a macro and which is followed by another function which returns an custom type cannot so easily be fixed:

void abort() {
  FOO()
}
uint32_t bar() {}

(note that this case works fine if the body of the function contains a semicolon or reserved keyword, or if the next function returns a base type [int, void...])

I don't think cases where there is only a semicolon-less macro are particularly frequent/important either. You probably could even add a semicolon in those cases, right? How frequent are such cases in your codebase? Either way, they aren't fixed by this patch, so they aren't a good reason to move forward with it.

I still believe that this patch adds complexity for very little gain. And I am not even sure it is correct. isFunctionDeclarationName/getFunctionParameterList is just yet another heuristic that might go wrong. And it might go wrong in both ways. You might still miss some cases and you might start incorrectly formatting stuff as functions. Fundamentally clang-format just doesn't have enough info to do this correctly.

In D42729#993188, @Typz wrote:

In D42729#993159, @djasper wrote:

I think this case is not important enough to fix. Please tell users to just delete the useless semicolon.

I would agree if were simple to spot: but often this may manifest itself only with a missing space between the function parameters and the function body, which can easily be overlooked...

".. which can easily be overlooked". If they are overlooked, nobody cares about the space either, so no harm done ;).

Another option may be to create new pass which "removes" that extra semicolon: this way we would both fix it and get things right on next pass.

I am not sure we can do this reliably enough.

However, the issue with a function which contains only a macro and which is followed by another function which returns an custom type cannot so easily be fixed:
void abort() {
  FOO()
}
uint32_t bar() {}
(note that this case works fine if the body of the function contains a semicolon or reserved keyword, or if the next function returns a base type [int, void...])

I don't think cases where there is only a semicolon-less macro are particularly frequent/important either. You probably could even add a semicolon in those cases, right? How frequent are such cases in your codebase? Either way, they aren't fixed by this patch, so they aren't a good reason to move forward with it.

Adding a semicolon works for indentation and behavior, but leads to compiler warning.
So a proper fix would require to change the macro, which can easily be tricky with macros containings branches or loops...

It should not happen that often, but in reality it actually happens often enough that I found the bug while simply testing clang-format...

I still believe that this patch adds complexity for very little gain. And I am not even sure it is correct. isFunctionDeclarationName/getFunctionParameterList is just yet another heuristic that might go wrong. And it might go wrong in both ways. You might still miss some cases and you might start incorrectly formatting stuff as functions. Fundamentally clang-format just doesn't have enough info to do this correctly.

As such, it does not add such complexity I think, but indeed it is not completely correct: it works only in the case where no line breaks are expected around the body (since it is still parsed as a single UnwrappedLine). So it is a slight improvement, but not a complete fix.

isFunctionDeclarationName() is an heuristic as well, but provides better result and does not break the existing one (since there are actually quite a few tests, and none get broken by this patch :-) )

I think "overall" clang-format has enough information for this case, it is just not available at the right time: Unwrapping is done first, then token and matching parenthesis are analysed and lines anotated, and these token anotations are needed to improve the unwrapping behavior...
I would really want to call isFunctionDeclarationName() from UnwrappedLineParser::calculateBraceTypes(), but parenthesis matching, nesting level and operators identification are not available yet.

".. which can easily be overlooked". If they are overlooked, nobody cares about the space either, so no harm done ;).

We want to use a formatter to ensure the code is consistent: e.g. to ensure things that may be overlooked by a human are actually formatted according to the rules.
We want people to trust the tool to do the right thing, so it is best to avoid as many errors as possible...

Of course you find all sorts of errors while testing clang-format on a large-enough codebase. That doesn't mean that users run into them much.
We have had about 10k clang-format users internally for several years. The semicolon issue comes up but really rarely and if it does, people happily fix their code not blaming clang-format.

Unrelated, my point remains that setting BlockKind in TokenAnnotator is bad enough that I wouldn't want to do it for reaping this small benefit. And I can't see how you could easily achieve the same thing without doing that.

In D42729#994841, @djasper wrote:

Of course you find all sorts of errors while testing clang-format on a large-enough codebase. That doesn't mean that users run into them much.

We have had about 10k clang-format users internally for several years. The semicolon issue comes up but really rarely and if it does, people happily fix their code not blaming clang-format.

Unrelated, my point remains that setting BlockKind in TokenAnnotator is bad enough that I wouldn't want to do it for reaping this small benefit. And I can't see how you could easily achieve the same thing without doing that.

Just a question though. I there a reason brace matching (and other parts of TokenAnnotations) are not performed before LineUnwrapping? That would probably allow fixing this issue more cleanly (though I am not sure I would have the time to actually perform this probably significant task)...

In D42729#1022069, @Typz wrote:

In D42729#994841, @djasper wrote:

Of course you find all sorts of errors while testing clang-format on a large-enough codebase. That doesn't mean that users run into them much.

We have had about 10k clang-format users internally for several years. The semicolon issue comes up but really rarely and if it does, people happily fix their code not blaming clang-format.

Unrelated, my point remains that setting BlockKind in TokenAnnotator is bad enough that I wouldn't want to do it for reaping this small benefit. And I can't see how you could easily achieve the same thing without doing that.

Just a question though. I there a reason brace matching (and other parts of TokenAnnotations) are not performed before LineUnwrapping? That would probably allow fixing this issue more cleanly (though I am not sure I would have the time to actually perform this probably significant task)...

I think this is just the way it has grown. And brace/paren matching is actually done in both places several times by now, I think :(.

Fundamentally, the UnwrappedLineParser had the task of splitting a source file into lines and only implemented what it needed for that. The TokenAnnotator then did a much more elaborate analysis on each line to determine token types and such and distinguish what a "<" actually is (comparison vs. template opener). Having these two things be separate makes it slightly easier for error recovery and such as the state space they can be in is somewhat more limited.

Revision Contents

Path

Size

lib/

Format/

TokenAnnotator.cpp

41 lines

unittests/

Format/

FormatTest.cpp

9 lines

Diff 132130

lib/Format/TokenAnnotator.cpp

Show First 20 Lines • Show All 1,777 Lines • ▼ Show 20 Lines	void TokenAnnotator::annotate(AnnotatedLine &Line) {
else if (Line.startsWith(TT_ObjCProperty))		else if (Line.startsWith(TT_ObjCProperty))
Line.Type = LT_ObjCProperty;		Line.Type = LT_ObjCProperty;

Line.First->SpacesRequiredBefore = 1;		Line.First->SpacesRequiredBefore = 1;
Line.First->CanBreakBefore = Line.First->MustBreakBefore;		Line.First->CanBreakBefore = Line.First->MustBreakBefore;
}		}

// This function heuristically determines whether 'Current' starts the name of a		// This function heuristically determines whether 'Current' starts the name of a
// function declaration.		// function declaration, and returns the first token of parameters list in that
static bool isFunctionDeclarationName(const FormatToken &Current,		// case.
		// @return The token which opens the parameters list (e.g. the opening
		// parenthesis), or nullptr if this is not a function declaration.
		static const FormatToken *getFunctionParameterList(const FormatToken &Current,
const AnnotatedLine &Line) {		const AnnotatedLine &Line) {
auto skipOperatorName = [](const FormatToken Next) -> const FormatToken {		auto skipOperatorName = [](const FormatToken Next) -> const FormatToken {
for (; Next; Next = Next->Next) {		for (; Next; Next = Next->Next) {
if (Next->is(TT_OverloadedOperatorLParen))		if (Next->is(TT_OverloadedOperatorLParen))
return Next;		return Next;
if (Next->is(TT_OverloadedOperator))		if (Next->is(TT_OverloadedOperator))
continue;		continue;
if (Next->isOneOf(tok::kw_new, tok::kw_delete)) {		if (Next->isOneOf(tok::kw_new, tok::kw_delete)) {
// For 'new[]' and 'delete[]'.		// For 'new[]' and 'delete[]'.
if (Next->Next && Next->Next->is(tok::l_square) && Next->Next->Next &&		if (Next->Next && Next->Next->is(tok::l_square) && Next->Next->Next &&
Next->Next->Next->is(tok::r_square))		Next->Next->Next->is(tok::r_square))
Next = Next->Next->Next;		Next = Next->Next->Next;
continue;		continue;
}		}

break;		break;
}		}
return nullptr;		return nullptr;
};		};

// Find parentheses of parameter list.		// Find parentheses of parameter list.
const FormatToken *Next = Current.Next;		const FormatToken *Next = Current.Next;
if (Current.is(tok::kw_operator)) {		if (Current.is(tok::kw_operator)) {
if (Current.Previous && Current.Previous->is(tok::coloncolon))		if (Current.Previous && Current.Previous->is(tok::coloncolon))
return false;		return nullptr;
Next = skipOperatorName(Next);		Next = skipOperatorName(Next);
} else {		} else {
if (!Current.is(TT_StartOfName) \|\| Current.NestingLevel != 0)		if (!Current.is(TT_StartOfName) \|\| Current.NestingLevel != 0)
return false;		return nullptr;
for (; Next; Next = Next->Next) {		for (; Next; Next = Next->Next) {
if (Next->is(TT_TemplateOpener)) {		if (Next->is(TT_TemplateOpener)) {
Next = Next->MatchingParen;		Next = Next->MatchingParen;
} else if (Next->is(tok::coloncolon)) {		} else if (Next->is(tok::coloncolon)) {
Next = Next->Next;		Next = Next->Next;
if (!Next)		if (!Next)
return false;		return nullptr;
if (Next->is(tok::kw_operator)) {		if (Next->is(tok::kw_operator)) {
Next = skipOperatorName(Next->Next);		Next = skipOperatorName(Next->Next);
break;		break;
}		}
if (!Next->is(tok::identifier))		if (!Next->is(tok::identifier))
return false;		return nullptr;
} else if (Next->is(tok::l_paren)) {		} else if (Next->is(tok::l_paren)) {
break;		break;
} else {		} else {
return false;		return nullptr;
}		}
}		}
}		}

// Check whether parameter list can belong to a function declaration.		// Check whether parameter list can belong to a function declaration.
if (!Next \|\| !Next->is(tok::l_paren) \|\| !Next->MatchingParen)		if (!Next \|\| !Next->is(tok::l_paren) \|\| !Next->MatchingParen)
return false;		return nullptr;
// If the lines ends with "{", this is likely an function definition.		// If the lines ends with "{", this is likely an function definition.
if (Line.Last->is(tok::l_brace))		if (Line.Last->is(tok::l_brace))
return true;		return Next;
if (Next->Next == Next->MatchingParen)		if (Next->Next == Next->MatchingParen)
return true; // Empty parentheses.		return Next; // Empty parentheses.
// If there is an &/&& after the r_paren, this is likely a function.		// If there is an &/&& after the r_paren, this is likely a function.
if (Next->MatchingParen->Next &&		if (Next->MatchingParen->Next &&
Next->MatchingParen->Next->is(TT_PointerOrReference))		Next->MatchingParen->Next->is(TT_PointerOrReference))
return true;		return Next;
for (const FormatToken *Tok = Next->Next; Tok && Tok != Next->MatchingParen;		for (const FormatToken *Tok = Next->Next; Tok && Tok != Next->MatchingParen;
Tok = Tok->Next) {		Tok = Tok->Next) {
if (Tok->is(tok::l_paren) && Tok->MatchingParen) {		if (Tok->is(tok::l_paren) && Tok->MatchingParen) {
Tok = Tok->MatchingParen;		Tok = Tok->MatchingParen;
continue;		continue;
}		}
if (Tok->is(tok::kw_const) \|\| Tok->isSimpleTypeSpecifier() \|\|		if (Tok->is(tok::kw_const) \|\| Tok->isSimpleTypeSpecifier() \|\|
Tok->isOneOf(TT_PointerOrReference, TT_StartOfName, tok::ellipsis))		Tok->isOneOf(TT_PointerOrReference, TT_StartOfName, tok::ellipsis))
return true;		return Next;
if (Tok->isOneOf(tok::l_brace, tok::string_literal, TT_ObjCMethodExpr) \|\|		if (Tok->isOneOf(tok::l_brace, tok::string_literal, TT_ObjCMethodExpr) \|\|
Tok->Tok.isLiteral())		Tok->Tok.isLiteral())
return false;		return nullptr;
}		}
return false;		return nullptr;
}		}

bool TokenAnnotator::mustBreakForReturnType(const AnnotatedLine &Line) const {		bool TokenAnnotator::mustBreakForReturnType(const AnnotatedLine &Line) const {
assert(Line.MightBeFunctionDecl);		assert(Line.MightBeFunctionDecl);

if ((Style.AlwaysBreakAfterReturnType == FormatStyle::RTBS_TopLevel \|\|		if ((Style.AlwaysBreakAfterReturnType == FormatStyle::RTBS_TopLevel \|\|
Style.AlwaysBreakAfterReturnType ==		Style.AlwaysBreakAfterReturnType ==
FormatStyle::RTBS_TopLevelDefinitions) &&		FormatStyle::RTBS_TopLevelDefinitions) &&
Show All 22 Lines	void TokenAnnotator::calculateFormattingInformation(AnnotatedLine &Line) {
}		}

Line.First->TotalLength =		Line.First->TotalLength =
Line.First->IsMultiline ? Style.ColumnLimit		Line.First->IsMultiline ? Style.ColumnLimit
: Line.FirstStartColumn + Line.First->ColumnWidth;		: Line.FirstStartColumn + Line.First->ColumnWidth;
FormatToken *Current = Line.First->Next;		FormatToken *Current = Line.First->Next;
bool InFunctionDecl = Line.MightBeFunctionDecl;		bool InFunctionDecl = Line.MightBeFunctionDecl;
while (Current) {		while (Current) {
if (isFunctionDeclarationName(*Current, Line))		if (const FormatToken Tok = getFunctionParameterList(Current, Line)) {
Current->Type = TT_FunctionDeclarationName;		Current->Type = TT_FunctionDeclarationName;

		// Fix block kind, if it was mistakenly identified as a BracedInit
		FormatToken *Next = Tok->MatchingParen->Next;
		if (Next && Next->is(tok::l_brace) && Next->BlockKind == BK_BracedInit)
		Next->BlockKind = BK_Block;
		TypzAuthorUnsubmitted Not Done Reply Inline Actions this may actually not be enough in all cases: to completely match the 'standard' behavior we should be able to "split" the UnwrappedLine at both opening and ending brace, in this case. Typz: this may actually not be enough in all cases: to completely match the 'standard' behavior we…
		}
if (Current->is(TT_LineComment)) {		if (Current->is(TT_LineComment)) {
if (Current->Previous->BlockKind == BK_BracedInit &&		if (Current->Previous->BlockKind == BK_BracedInit &&
Current->Previous->opensScope())		Current->Previous->opensScope())
Current->SpacesRequiredBefore = Style.Cpp11BracedListStyle ? 0 : 1;		Current->SpacesRequiredBefore = Style.Cpp11BracedListStyle ? 0 : 1;
else		else
Current->SpacesRequiredBefore = Style.SpacesBeforeTrailingComments;		Current->SpacesRequiredBefore = Style.SpacesBeforeTrailingComments;

// If we find a trailing comment, iterate backwards to determine whether		// If we find a trailing comment, iterate backwards to determine whether
▲ Show 20 Lines • Show All 1,003 Lines • Show Last 20 Lines

unittests/Format/FormatTest.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 11,380 Lines • ▼ Show 20 Lines

	TEST_F(FormatTest, MunchSemicolonAfterBlocks) {			TEST_F(FormatTest, MunchSemicolonAfterBlocks) {
	verifyFormat("MY_CLASS(C) {\n"			verifyFormat("MY_CLASS(C) {\n"
	" int i;\n"			" int i;\n"
	" int j;\n"			" int j;\n"
	"};");			"};");
	}			}

				TEST_F(FormatTest, MunchSemicolonAfterFunction) {
				verifyFormat("void foo() { int i; };");
				verifyFormat("void foo() {\n"
				" int i;\n"
				" int j;\n"
				"};");
				verifyFormat("void foo() {};");
				}

	TEST_F(FormatTest, ConfigurableContinuationIndentWidth) {			TEST_F(FormatTest, ConfigurableContinuationIndentWidth) {
	FormatStyle TwoIndent = getLLVMStyleWithColumns(15);			FormatStyle TwoIndent = getLLVMStyleWithColumns(15);
	TwoIndent.ContinuationIndentWidth = 2;			TwoIndent.ContinuationIndentWidth = 2;

	EXPECT_EQ("int i =\n"			EXPECT_EQ("int i =\n"
	" longFunction(\n"			" longFunction(\n"
	" arg);",			" arg);",
	format("int i = longFunction(arg);", TwoIndent));			format("int i = longFunction(arg);", TwoIndent));
	▲ Show 20 Lines • Show All 483 Lines • Show Last 20 Lines