This is an archive of the discontinued LLVM Phabricator instance.

[clang-format] Correctly annotate UDLs as OverloadedOperator
ClosedPublic

Authored by rymiel on Sep 29 2022, 12:13 AM.

Details

Summary

While the opening parenthesis of an user-defined literal operator was
correctly annotated as OverloadedOperatorLParen, the "" and its suffix
wasn't annotated as OverloadedOperator.

Fixes https://github.com/llvm/llvm-project/issues/58035

Diff Detail

Event Timeline

rymiel created this revision.Sep 29 2022, 12:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 29 2022, 12:13 AM
rymiel updated this revision to Diff 463768.Sep 29 2022, 12:16 AM
rymiel edited the summary of this revision. (Show Details)

Note issue being fixed

rymiel added a comment.EditedSep 29 2022, 12:19 AM

Notes:

As noted above, the left paren was correctly annotated, but the logic in rParenEndsCast goes based off of what is *before* the l paren. This resulted in a paren pair where the left one was OverloadedOperatorLParen and the right one was CastRParen.

This means this bug could really be fixed in two ways, I chose to annotate the UDL with OverloadedOperator, as I felt like that resulted in "more correct tokens".

Edit: While attempting to add token annotator tests for this fix, i noticed there aren't any at all for overloaded operators, and while writing them, I noticed more operators which aren't marked with OverloadedOperator, such as operator[] and UDLs which don't start with an underscore. (see below)

Unless any reviewers have any other opinions, I would leave fixing those out of this patch and leave the tests "incomplete" for now?

Also, UDLs that don't start with an underscore aren't considered a single "string_literal" token, instead becoming a string literal "" and an identifier following it (where as those with an underscore become one token, such as ""_a). I'm unsure if that's the expected case and if both tokens should just be considered part of the operator

rymiel updated this revision to Diff 464145.Sep 29 2022, 11:13 PM

Also add a few format tests

rymiel published this revision for review.Sep 29 2022, 11:20 PM

While this patch fixes the linked bug, but I encountered more questions that I would like second opinions on (see notes above)

Herald added a project: Restricted Project. · View Herald TranscriptSep 29 2022, 11:20 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript

Looks good. You could add a FIXME before the commented out tests.

Unless any reviewers have any other opinions, I would leave fixing those out of this patch and leave the tests "incomplete" for now?

+1.

Also, UDLs that don't start with an underscore aren't considered a single "string_literal" token, instead becoming a string literal "" and an identifier following it (where as those with an underscore become one token, such as ""_a). I'm unsure if that's the expected case and if both tokens should just be considered part of the operator

The operator "" and the identifier that follows should be two separate tokens regardless if the identifier starts with an underscore.

clang/lib/Format/TokenAnnotator.cpp
1187

We should check for "", not just any string_literal.

rymiel added a comment.Oct 2 2022, 1:58 AM

The operator "" and the identifier that follows should be two separate tokens regardless if the identifier starts with an underscore.

So, after some investigation, it turns out the clang lexer only combines an underscore-less UDL to a single string_literal token if it's a valid standard library UDL (in the given standard library version)

https://github.com/llvm/llvm-project/blob/6f46ff3765dcdc178b9cf52ebd8c03437806798a/clang/lib/Lex/Lexer.cpp#L1960-L1975
https://github.com/llvm/llvm-project/blob/3f18f7c0072b642f5fe88d2fb7bb8ccf69a6c6f5/clang/lib/Lex/LiteralSupport.cpp#L1180-L1185

Which means only single-token string_literals should be considered

So, after some investigation, it turns out the clang lexer only combines an underscore-less UDL to a single string_literal token if it's a valid standard library UDL (in the given standard library version)

And only if there are no spaces between "" and the identifier. Otherwise, it's invalid. See https://en.cppreference.com/w/cpp/language/user_literal.

Which means only single-token string_literals should be considered

We don't want to merge e.g. "" _foo or "" if into a single token ""_foo or ""if. With the former, the user may want clang-format to keep the space. With the latter, I'm not sure if it's ok for clang-format to fix the syntax error.

rymiel updated this revision to Diff 466862.Oct 11 2022, 10:49 AM
rymiel marked an inline comment as done.

Handle UDLs with and without spaces

Thank you @owenpan, I was not actually aware UDLs without spaces were valid.

HazardyKnusperkeks added inline comments.
clang/lib/Format/TokenAnnotator.cpp
1189–1194
This revision is now accepted and ready to land.Oct 12 2022, 12:37 PM
rymiel updated this revision to Diff 467677.Oct 13 2022, 9:08 PM

Full stops in comments

rymiel marked an inline comment as done.Oct 13 2022, 9:08 PM
owenpan added inline comments.Oct 17 2022, 2:59 AM
clang/lib/Format/TokenAnnotator.cpp
1186–1199

We can simply check if the previous token starts with "".

clang/unittests/Format/TokenAnnotatorTest.cpp
413

We need/should not annotate the suffix.

415–423

IMO they are redundant as both _a and _long_literal are identifiers starting with an underscore.

435

Delete.

rymiel added inline comments.Oct 17 2022, 9:50 PM
clang/unittests/Format/TokenAnnotatorTest.cpp
413

Unless I change the logic in rParenEndsCast, the suffix does need to be an OverloadedOperator, since it goes off of the token immediately before left paren (https://reviews.llvm.org/D134853#3822842)

owenpan added inline comments.Oct 18 2022, 2:51 AM
clang/lib/Format/TokenAnnotator.cpp
2126

Perhaps add:

if (Tok.MatchingParen->is(TT_OverloadedOperatorLParen))
  return false;
clang/unittests/Format/TokenAnnotatorTest.cpp
413

After we annotate "" as TT_OverloadedOperator, rParenEndsCast() returns false, at least for your test cases and the example in https://github.com/llvm/llvm-project/issues/58035. If you can come up with test cases that still confuses rParenEndsCast(), we can add a simple check near the beginning of rParenEndsCast(). See line 2126 above.

rymiel updated this revision to Diff 468473.Oct 18 2022, 3:44 AM
rymiel marked 4 inline comments as done.

Do not annotate the trailing identifier of an UDL

rymiel marked 2 inline comments as done.Oct 18 2022, 3:46 AM
rymiel added inline comments.
clang/lib/Format/TokenAnnotator.cpp
2126

Note that in testing, adding isNot(TT_Unknown) did not cause any issues in tests, but that change felt a little too drastic

clang/unittests/Format/TokenAnnotatorTest.cpp
413

Oh, you are right; I did not notice the extra check which disallows identifiers in front of casts

415–423

I wanted to avoid all the tests being single-character names, as "single-character name" was sort of an issue for one format bug; but actually this isn't even single character anyway so you are right

This revision was landed with ongoing or failed builds.Oct 25 2022, 10:12 AM
This revision was automatically updated to reflect the committed changes.
rymiel marked an inline comment as done.