While the opening parenthesis of an user-defined literal operator was
correctly annotated as OverloadedOperatorLParen, the "" and its suffix
wasn't annotated as OverloadedOperator.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Notes:
As noted above, the left paren was correctly annotated, but the logic in rParenEndsCast goes based off of what is *before* the l paren. This resulted in a paren pair where the left one was OverloadedOperatorLParen and the right one was CastRParen.
This means this bug could really be fixed in two ways, I chose to annotate the UDL with OverloadedOperator, as I felt like that resulted in "more correct tokens".
Edit: While attempting to add token annotator tests for this fix, i noticed there aren't any at all for overloaded operators, and while writing them, I noticed more operators which aren't marked with OverloadedOperator, such as operator[] and UDLs which don't start with an underscore. (see below)
Unless any reviewers have any other opinions, I would leave fixing those out of this patch and leave the tests "incomplete" for now?
Also, UDLs that don't start with an underscore aren't considered a single "string_literal" token, instead becoming a string literal "" and an identifier following it (where as those with an underscore become one token, such as ""_a). I'm unsure if that's the expected case and if both tokens should just be considered part of the operator
While this patch fixes the linked bug, but I encountered more questions that I would like second opinions on (see notes above)
+1.
Also, UDLs that don't start with an underscore aren't considered a single "string_literal" token, instead becoming a string literal "" and an identifier following it (where as those with an underscore become one token, such as ""_a). I'm unsure if that's the expected case and if both tokens should just be considered part of the operator
The operator "" and the identifier that follows should be two separate tokens regardless if the identifier starts with an underscore.
clang/lib/Format/TokenAnnotator.cpp | ||
---|---|---|
1189 | We should check for "", not just any string_literal. |
The operator "" and the identifier that follows should be two separate tokens regardless if the identifier starts with an underscore.
So, after some investigation, it turns out the clang lexer only combines an underscore-less UDL to a single string_literal token if it's a valid standard library UDL (in the given standard library version)
https://github.com/llvm/llvm-project/blob/6f46ff3765dcdc178b9cf52ebd8c03437806798a/clang/lib/Lex/Lexer.cpp#L1960-L1975
https://github.com/llvm/llvm-project/blob/3f18f7c0072b642f5fe88d2fb7bb8ccf69a6c6f5/clang/lib/Lex/LiteralSupport.cpp#L1180-L1185
Which means only single-token string_literals should be considered
And only if there are no spaces between "" and the identifier. Otherwise, it's invalid. See https://en.cppreference.com/w/cpp/language/user_literal.
Which means only single-token string_literals should be considered
We don't want to merge e.g. "" _foo or "" if into a single token ""_foo or ""if. With the former, the user may want clang-format to keep the space. With the latter, I'm not sure if it's ok for clang-format to fix the syntax error.
Handle UDLs with and without spaces
Thank you @owenpan, I was not actually aware UDLs without spaces were valid.
clang/lib/Format/TokenAnnotator.cpp | ||
---|---|---|
1193–1198 |
clang/lib/Format/TokenAnnotator.cpp | ||
---|---|---|
1188–1201 | We can simply check if the previous token starts with "". | |
clang/unittests/Format/TokenAnnotatorTest.cpp | ||
427 | We need/should not annotate the suffix. | |
429–437 | IMO they are redundant as both _a and _long_literal are identifiers starting with an underscore. | |
449 | Delete. |
clang/unittests/Format/TokenAnnotatorTest.cpp | ||
---|---|---|
427 | Unless I change the logic in rParenEndsCast, the suffix does need to be an OverloadedOperator, since it goes off of the token immediately before left paren (https://reviews.llvm.org/D134853#3822842) |
clang/lib/Format/TokenAnnotator.cpp | ||
---|---|---|
2120–2123 | Perhaps add: if (Tok.MatchingParen->is(TT_OverloadedOperatorLParen)) return false; | |
clang/unittests/Format/TokenAnnotatorTest.cpp | ||
427 | After we annotate "" as TT_OverloadedOperator, rParenEndsCast() returns false, at least for your test cases and the example in https://github.com/llvm/llvm-project/issues/58035. If you can come up with test cases that still confuses rParenEndsCast(), we can add a simple check near the beginning of rParenEndsCast(). See line 2126 above. |
clang/lib/Format/TokenAnnotator.cpp | ||
---|---|---|
2120–2123 | Note that in testing, adding isNot(TT_Unknown) did not cause any issues in tests, but that change felt a little too drastic | |
clang/unittests/Format/TokenAnnotatorTest.cpp | ||
427 | Oh, you are right; I did not notice the extra check which disallows identifiers in front of casts | |
429–437 | I wanted to avoid all the tests being single-character names, as "single-character name" was sort of an issue for one format bug; but actually this isn't even single character anyway so you are right |
We should check for "", not just any string_literal.