This is an archive of the discontinued LLVM Phabricator instance.

[clang-format] Fix misplacement of `*` in declaration of pointer to struct
ClosedPublic

Authored by jackhong12 on Jun 15 2022, 9:28 AM.

Details

Summary

Summary

In this patch, we fix the misplacement of * in a definition of a pointer to a struct. It's caused by wrong annotating about tokens * and &&.

Issue

The result of echo 'struct { int foo; } *foop;' | clang-format should be

struct {
	int foo;
} *foop;

instead of

struct {
	int foo;
} * foop;

Handle star token (*)

After right braces, star tokens are likely pointers to a struct, union, or class instead of binary operators. In this case, function determineStarAmpUsage should return TT_PointerOrReference.

  • example: struct {} *ptr;

Handle ampamp token (&&)

&& tokens, right after right braces, might be references or binary operators. When initializing references to a struct, union, or class, && are references. When utilized in templates, && are likely binary operators. So we check whether there is a TemplateCloser(">") to indicate it's a template or not. If it's a template, && is a binary operator; on the other hand, if not, it is likely a reference operator.

  • reference operator case: struct {} &&ptr={};
  • binary operator cases: enable_if<>{} && ...

Diff Detail

Event Timeline

jackhong12 created this revision.Jun 15 2022, 9:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 15 2022, 9:28 AM
jackhong12 requested review of this revision.Jun 15 2022, 9:28 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 15 2022, 9:28 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
MyDeveloperDay edited reviewers, added: MyDeveloperDay; removed: Restricted Project.Jun 15 2022, 11:24 AM
MyDeveloperDay added a project: Restricted Project.
MyDeveloperDay removed a subscriber: llvm-commits.

what about (may not be useful but compiles)

struct {
    int foo;    
} &&ptr2 = {};

https://godbolt.org/z/rbb8x3hKP

MyDeveloperDay requested changes to this revision.Jun 15 2022, 11:30 AM

Thank you for the patch, just some observations

clang/unittests/Format/TokenAnnotatorTest.cpp
93

Can you add a verifyFormat test that shows what you want? as well

This revision now requires changes to proceed.Jun 15 2022, 11:30 AM

what about (may not be useful but compiles)

struct {
    int foo;    
} &&ptr2 = {};

https://godbolt.org/z/rbb8x3hKP

Let's ask why it was set to a binary operator. What kind of code would want the binary operator after a closing brace? I can make up such code, but we basically have to decide which code is probably more in use, or make the detection more sophisticated.

clang/unittests/Format/TokenAnnotatorTest.cpp
93

I think the annotator test is sufficient. Because it's just about annotating the token, formatting is secondary (and dependent on style - these tests are already in place).

Could you please either directly link to the github issue or better use llvm.org/pr55810 than just #55810 ?

MyDeveloperDay added inline comments.Jun 16 2022, 3:31 AM
clang/unittests/Format/TokenAnnotatorTest.cpp
93

I know where you are coming from, but actually if it wasn't for the enable_if<>{} && ... example in the unit tests then we'd have missed the && case and caused a regression. (that was ONLY covered by the small example code snippet)

Whilst I believe the TokenAnnotator tests are correct to have as well, I think we should be adding formatting examples just to ensure someone doesn't breaking the formatting rule that this depends on. I always follow the Beyoncé rule... "If you like it you should have put a test on it!"

So If you don't want anyone to break your * placement after the } then I see no harm in adding a single verifyFormat(), but while you are at it please also test to ensure the Left/Middle/Right works with your example as you might expect. (just incase there are some rules about that, that could interfere with your setting)

jackhong12 edited the summary of this revision. (Show Details)Jun 16 2022, 3:44 AM
curdeius retitled this revision from [clang-format] Fix misplacemnt of `*` in declartion of pointer to struct to [clang-format] Fix misplacemnt of `*` in declaration of pointer to struct.Jun 16 2022, 7:31 AM
curdeius added inline comments.Jun 16 2022, 7:33 AM
clang/lib/Format/TokenAnnotator.cpp
2348

Instead of checking for r_brace, maybe it would be possible to check PrevToken->MatchingParen for being TT_RecordLBrace?
This way, we'd be precise.
WDYT?

I think it's a good idea. When does MatchingParen bind? The value of PrevToken->MatchingParen is still NULL in determineStarAmpUsage function.

jackhong12 edited the summary of this revision. (Show Details)
  • Add annotator test
  • Add formatting test
  • Handle reference cases by PrevToken->MatchingParen
jackhong12 marked 4 inline comments as done.Jun 17 2022, 6:15 AM

Add Left/Middle/Right alignment test cases.

I think it's a good idea. When does MatchingParen bind? The value of PrevToken->MatchingParen is still NULL in determineStarAmpUsage function.

What has changed, now you are using that.

clang/lib/Format/TokenAnnotator.cpp
2349–2350

Unknown is everything, until some type is assigned. This way it should be clearer.

Also put that check above the other one and add the r_brace back.

jackhong12 added inline comments.Jun 17 2022, 10:12 AM
clang/lib/Format/TokenAnnotator.cpp
2349–2350

There are other problems. Clang-format will split the input into multiple lines first. For instance, struct {\n int n;\n} &&ptr={}; will be separated as struct {, int n; and } &&ptr={};. It only handles the relation in the line. When declaring a struct variable, the value of MatchingParen will always be NULL instead of pointing to the last left brace. So it will not enter that branch in this case.

jackhong12 added inline comments.Jun 17 2022, 10:27 AM
clang/lib/Format/TokenAnnotator.cpp
2349–2350
if (PrevToken->is(tok::r_brace) && Tok.isOneOf(tok::amp, tok::ampamp) &&
    PrevToken->MatchingParen) {
  if (PrevToken->MatchingParen->is(TT_RecordLBrace))
    return TT_PointerOrReference;
  else
    return TT_BinaryOperator;
}

How about this way? Although the branch of TT_PointerOrReference will not be taken, it's clearer for reading.

clang/lib/Format/TokenAnnotator.cpp
2349–2350

Then I think I would prefer a comment which explains the logic instead of adding dead (and thus untested) code.

jackhong12 edited the summary of this revision. (Show Details)

Just like mentioned above, UnwrappedLineParser will split the input into multiple lines. And MatchingParen will be reset before annotating(code). So when defining a struct, union, or class, the MatchingParen of the right brace will be NULL. If we want to access the left matching brace from the right brace, we need to add new member data, which saves the address of the left brace token, in class FormatToken.
In my opinion, it will be too complicated. Instead of MatchingParen, I categorize the tokens by checking whether it is a template because && will be a binary operator only in a template. In the other cases, && is likely a reference operator.

clang/lib/Format/TokenAnnotator.cpp
2338

Thank you I wish more of the clauses were commented like this

HazardyKnusperkeks added inline comments.
clang/lib/Format/TokenAnnotator.cpp
2338

+1

clang/unittests/Format/FormatTest.cpp
10439

Bonus points for the union. It often gets forgotten.

jackhong12 added inline comments.Jun 20 2022, 7:41 AM
clang/lib/Format/TokenAnnotator.cpp
2338

I'm not sure where I also need to add comments.

jackhong12 marked 2 inline comments as done.Jun 21 2022, 4:03 AM

Thanks for your reply! I added comments for each clause.

owenpan accepted this revision.Jun 21 2022, 7:28 PM
owenpan retitled this revision from [clang-format] Fix misplacemnt of `*` in declaration of pointer to struct to [clang-format] Fix misplacement of `*` in declaration of pointer to struct.
owenpan edited the summary of this revision. (Show Details)
owenpan added inline comments.
clang/lib/Format/TokenAnnotator.cpp
2329–2330

Can we have a variable for MatchingLBrace->getPreviousNonComment() so that the function doesn't get called twice?

2334

Nit: Don’t use else after a return.

curdeius edited the summary of this revision. (Show Details)Jun 21 2022, 9:57 PM
jackhong12 marked an inline comment as done.

I have added a variable for MatchingLBrace->getPreviousNonComment().

Thanks. I am not sure which modification will be better, the patch I submitted or the following code.

FormatToken *BeforeLBraceToken = nullptr;
if (MatchingLBrace)
  BeforeLBraceToken = MatchingLBrace->getPreviousNonComment();

if (BeforeLBraceToken && BeforeLBraceToken->is(TT_TemplateCloser))
  return TT_BinaryOperator;

Thanks. I am not sure which modification will be better, the patch I submitted or the following code.

FormatToken *BeforeLBraceToken = nullptr;
if (MatchingLBrace)
  BeforeLBraceToken = MatchingLBrace->getPreviousNonComment();

if (BeforeLBraceToken && BeforeLBraceToken->is(TT_TemplateCloser))
  return TT_BinaryOperator;

How about the following?

if (!MatchingLBrace)
  return TT_PointerOrReference;
const FormatToken *BeforeLBrace = MatchingLBrace->getPreviousNonComment();
if (!BeforeLBrace || BeforeLBrace->isNot(TT_TemplateCloser))
  return TT_PointerOrReference;
return TT_BinaryOperator;

Right. It looks better.

jackhong12 marked 2 inline comments as done.Jun 25 2022, 5:41 AM

Hi. I passed the unit tests on my computer. But the build status here is failed. The build log only shows the issue is related to git reset --hard. So, I don't know where the bug is. Could you give me some hints to fix this issue? Thanks!

Hi. I passed the unit tests on my computer. But the build status here is failed. The build log only shows the issue is related to git reset --hard. So, I don't know where the bug is. Could you give me some hints to fix this issue? Thanks!

If everything is fine on your side, and you can't see a problem related to you on the build bot, just push it. If it really breaks something you will notice. :)

But for now you should wait for @MyDeveloperDay that he at least removes the changes needed, or state what you have to change, or even better accept it.

Got it. Thanks for your reply!

This revision is now accepted and ready to land.Jun 26 2022, 7:52 AM

Sorry, I don't have commit access. @HazardyKnusperkeks, could you help me commit it?

If I want to contribute to LLVM in the future, how do I get the commit permission? Does it depend on the number of patches I submit?

Sorry, I don't have commit access. @HazardyKnusperkeks, could you help me commit it?

If I want to contribute to LLVM in the future, how do I get the commit permission? Does it depend on the number of patches I submit?

https://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access

I can push it for you, but if you want to get the commit access, you can wait and do it yourself. Just ask for it, I didn't have to wait long.

This revision was landed with ongoing or failed builds.Jun 29 2022, 12:21 AM
This revision was automatically updated to reflect the committed changes.

Appears that this regressed examples like this:

// before
int i = int{16} * 1024;
// after
int i = int{16}* 1024;

Naively it looks like in all added examples, the token preceding the matching { of the } is one of struct, class, union, enum, or a >. Could we adapt this heuristic to only apply TT_PointerOrReference in cases like this?

I think we cannot identify struct, union, class or enum by the right bracket. Clang-format will split the input into multiple lines. For instance struct Tmp {} *tmp; will be separated as struct Tmp { and } *tmp;. In annotating, we only handle the relation of tokens in a single line, so we cannot know whether } belongs to struct or not.

I think we cannot identify struct, union, class or enum by the right bracket. Clang-format will split the input into multiple lines. For instance struct Tmp {} *tmp; will be separated as struct Tmp { and } *tmp;. In annotating, we only handle the relation of tokens in a single line, so we cannot know whether } belongs to struct or not.

Interesting. Actually, it seems the splitting into multiple lines happens if we're in the struct, etc. case, but not if this was likely a braced list:

% cat test.cc                                
int i = int{16} * 1024;
% ./clang-format -style=google -debug test.cc
Args: ./clang-format -style=google -debug test.cc 
File encoding: UTF8
Language: C++
----
Next [0] Token: int / int, Macro: 0
Next [1] Token: identifier / i, Macro: 0
Next [2] Token: equal / =, Macro: 0
Next [3] Token: int / int, Macro: 0
Next [4] Token: l_brace / {, Macro: 0
Getting Position: 4
Next [5] Token: numeric_constant / 16, Macro: 0
Next [6] Token: r_brace / }, Macro: 0
Next [7] Token: star / *, Macro: 0
Setting Position: 4
Next [5] Token: numeric_constant / 16, Macro: 0
Next [6] Token: r_brace / }, Macro: 0
Next [7] Token: star / *, Macro: 0
Next [8] Token: numeric_constant / 1024, Macro: 0
Next [9] Token: semi / ;, Macro: 0
Next [10] Token: eof / , Macro: 0
Line(0, FSC=0): int[T=108, OC=0] identifier[T=108, OC=4] equal[T=108, OC=6] int[T=108, OC=8] l_brace[T=108, OC=11] numeric_constant[T=108, OC=12] r_brace[T=108, OC=14] star[T=108, OC=16] numeric_constant[T=108, OC=18] semi[T=108, OC=22] 
Line(0, FSC=0): eof[T=108, OC=0] 
Run 0...
Replacements for run 0:
File encoding: UTF8
Language: C++
----
Next [0] Token: int / int, Macro: 0
Next [1] Token: identifier / i, Macro: 0
Next [2] Token: equal / =, Macro: 0
Next [3] Token: int / int, Macro: 0
Next [4] Token: l_brace / {, Macro: 0
Getting Position: 4
Next [5] Token: numeric_constant / 16, Macro: 0
Next [6] Token: r_brace / }, Macro: 0
Next [7] Token: star / *, Macro: 0
Setting Position: 4
Next [5] Token: numeric_constant / 16, Macro: 0
Next [6] Token: r_brace / }, Macro: 0
Next [7] Token: star / *, Macro: 0
Next [8] Token: numeric_constant / 1024, Macro: 0
Next [9] Token: semi / ;, Macro: 0
Next [10] Token: eof / , Macro: 0
Line(0, FSC=0): int[T=108, OC=0] identifier[T=108, OC=4] equal[T=108, OC=6] int[T=108, OC=8] l_brace[T=108, OC=11] numeric_constant[T=108, OC=12] r_brace[T=108, OC=14] star[T=108, OC=16] numeric_constant[T=108, OC=18] semi[T=108, OC=22] 
Line(0, FSC=0): eof[T=108, OC=0] 
Run 0...
Replacements for run 0:
File encoding: UTF8
Language: C++
----
Next [0] Token: int / int, Macro: 0
Next [1] Token: identifier / i, Macro: 0
Next [2] Token: equal / =, Macro: 0
Next [3] Token: int / int, Macro: 0
Next [4] Token: l_brace / {, Macro: 0
Getting Position: 4
Next [5] Token: numeric_constant / 16, Macro: 0
Next [6] Token: r_brace / }, Macro: 0
Next [7] Token: star / *, Macro: 0
Setting Position: 4
Next [5] Token: numeric_constant / 16, Macro: 0
Next [6] Token: r_brace / }, Macro: 0
Next [7] Token: star / *, Macro: 0
Next [8] Token: numeric_constant / 1024, Macro: 0
Next [9] Token: semi / ;, Macro: 0
Next [10] Token: eof / , Macro: 0
Line(0, FSC=0): int[T=108, OC=0] identifier[T=108, OC=4] equal[T=108, OC=6] int[T=108, OC=8] l_brace[T=108, OC=11] numeric_constant[T=108, OC=12] r_brace[T=108, OC=14] star[T=108, OC=16] numeric_constant[T=108, OC=18] semi[T=108, OC=22] 
Line(0, FSC=0): eof[T=108, OC=0] 
Run 0...
AnnotatedTokens(L=0):
 M=0 C=0 T=Unknown S=1 F=0 B=0 BK=0 P=0 Name=int L=3 PPK=2 FakeLParens=2/ FakeRParens=0 II=0x1a4ea20 Text='int'
 M=0 C=1 T=StartOfName S=1 F=0 B=0 BK=0 P=220 Name=identifier L=5 PPK=2 FakeLParens= FakeRParens=0 II=0x1a8d578 Text='i'
 M=0 C=0 T=BinaryOperator S=1 F=0 B=0 BK=0 P=22 Name=equal L=7 PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text='='
 M=0 C=1 T=Unknown S=1 F=0 B=0 BK=0 P=22 Name=int L=11 PPK=2 FakeLParens= FakeRParens=0 II=0x1a4ea20 Text='int'
 M=0 C=0 T=Unknown S=0 F=0 B=0 BK=2 P=23 Name=l_brace L=12 PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text='{'
 M=0 C=1 T=Unknown S=0 F=0 B=0 BK=0 P=59 Name=numeric_constant L=14 PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text='16'
 M=0 C=0 T=Unknown S=0 F=0 B=0 BK=2 P=41 Name=r_brace L=15 PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text='}'
 M=0 C=0 T=PointerOrReference S=0 F=0 B=0 BK=0 P=210 Name=star L=16 PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text='*'
 M=0 C=0 T=Unknown S=1 F=0 B=0 BK=0 P=34 Name=numeric_constant L=21 PPK=2 FakeLParens= FakeRParens=1 II=0x0 Text='1024'
 M=0 C=0 T=Unknown S=0 F=0 B=0 BK=0 P=23 Name=semi L=22 PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text=';'
----
AnnotatedTokens(L=0):
 M=0 C=0 T=Unknown S=1 F=0 B=0 BK=0 P=0 Name=eof L=0 PPK=2 FakeLParens= FakeRParens=0 II=0x0 Text=''
----
Replacements for run 0:
test.cc: 15:+1:""
int i = int{16}* 1024;

So maybe we can base it off that distinction. I'll try and explore this idea in the next few days.

binji added a subscriber: binji.Aug 11 2022, 5:30 PM

This change seems to have also regressed code like this:

// before
bool b = 3 == int{3} && true;
// after
bool b = 3 == int{3}&& true;

I suppose a similar fix to the one done for multiply should be done for && too?

Right, I think we can fix this in the same way. I'll look into it soon.