This adds the Unicode 13 data for XID_Start and XID_Continue.
The definition of valid identifier is changed in all C++ modes
as P1949 was accepted by WG21 as a defect report.
Details
- Reviewers
aaron.ballman tahonermann sdowney rsmith
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
clang/lib/Lex/Lexer.cpp | ||
---|---|---|
1463 | This is overly indented (or your editor snuck in some hard tabs). | |
1503–1511 | Why remove this diagnostic? This looks unintentional. But if it is intentional, then besides justifying it in the commit message (and splitting out this removal into a separate PR), you should remove the enumerator from DiagnosticLexKinds.td. | |
clang/lib/Lex/UnicodeCharSets.h | ||
229 | Ah, here we go. 0x005F is underscore. It should be in the XIDStart table instead. | |
clang/test/Lexer/unicode.c | ||
31 | Why 12 errors? Are these { ( etc. characters not the ASCII versions? If so, this test is needlessly confusing, and should be rewritten with one character per test case, as in lines 43-53 below. (Except line 46, which is also bad. :)) I don't understand why you're disabling the existing tests. In particular, line 43 tests the error message for GREEK QUESTION MARK, which is important to preserve. | |
clang/test/Preprocessor/ucn-pp-identifier.c | ||
31 ↗ | (On Diff #354672) | This test shouldn't need tweaking, should it? |
It would be more user friendly to say which character is not allowed in the diagnostic.
Do we need to have a backwards compatible flag to preserve the old behavior, or do we believe that nobody will be affected by the change? We should make this choice explicitly and state it in the patch description.
Agreed
Do we need to have a backwards compatible flag to preserve the old behavior, or do we believe that nobody will be affected by the change? We should make this choice explicitly and state it in the patch description.
I have been wondering about that.
I came to the conclusion it would probably not be worth it. Until fairly recently Unicode identifiers in GCC were not really usable and therefore not used afaik.
I haven't seen people use emojis or other interesting symbol in non toy code.
I'll try to make that clearer in the commit message
clang/lib/Lex/Lexer.cpp | ||
---|---|---|
1463 | This is intentional | |
1503–1511 | The diagnostic should be removed indeed. | |
clang/test/Lexer/unicode.c | ||
31 | Having 2 sets of completely different diagnostics (for C++ and C) proved a lot more complicated than doing two separate tests. |
Agreed, done!
Do we need to have a backward-compatible flag to preserve the old behavior, or do we believe that nobody will be affected by the change? We should make this choice explicitly and state it in the patch description.
I have been wondering about that.
I came to the conclusion it would probably not be worth it. Until fairly recently Unicode identifiers in GCC were not really usable and therefore not used afaik.
I haven't seen people use emojis or other interesting symbols in non-toy code.
I tried to make that clearer in the commit message
clang/lib/Lex/UnicodeCharSets.h | ||
---|---|---|
229 | These tables should match the Unicode spec exactly. |
clang/lib/Lex/Lexer.cpp | ||
---|---|---|
1444 | Is looking at start chars correct? I was thinking this should only look at the continuation characters because isAllowedInitiallyIDChar handles the start of an identifier. | |
clang/test/CXX/drs/dr2xx.cpp | ||
600 | This means we're going to drop our support of this DR on https://clang.llvm.org/cxx_dr_status.html when that page gets regenerated. What should our status of that DR be with these changes? | |
clang/test/CXX/drs/dr6xx.cpp | ||
721 | Same concern here as above. | |
clang/test/Lexer/unicode.c | ||
40–43 | Then we don't have to do line number math to see what line the diagnostics are attached to. :-) |
clang/lib/Lex/Lexer.cpp | ||
---|---|---|
1444 | Yes, But Maybe it needs a comment. | |
clang/test/CXX/drs/dr2xx.cpp | ||
600 | Good question! | |
clang/test/Lexer/unicode.c | ||
40–43 | It makes me feel dirty but I'll change it! |
Adding @rsmith to hopefully answer the question about how to handle the dr status page for these changes.
clang/www/cxx_dr_status.html | ||
---|---|---|
3 ↗ | (On Diff #366315) | You might have missed this at the top of the file -- this one is a generated file, and that's why I was trying to figure out how to best handle the markings in the test files. make_cxx_dr_status basically looks for those dr comments to generate this content. |
@rsmith: I modified the script locally to support dxx: dup PXXXX - let me know if you think that's a good solution
clang/test/CXX/drs/dr2xx.cpp | ||
---|---|---|
600 | I don't think there is nor will ever be a DR number for this; using the P number here seems like the best option available to us. Ultimately the "sup" marking here serves two purposes:
I think it would be fine to write this as dr248: sup P1949R7, and in the make_cxx_status script, generate text that says "Superseded by P1949R7" like in your edits to cxx_dr_status.html and color DRs superseded by papers as class="na" (because the DR no longer applies due to the paper). (Not saying that's the only option, but I think it would be a reasonable choice.) Separately, please don't remove the tests here; instead update them to show what the new behavior is, and add tests if necessary so that it's clear in what way the DR's behavior has been superseded. |
Not sure if that's a typo: did you mean "sup" rather than "dup"? In this case, it seems like "sup" is right if the old DR resolution no longer applies. (If the old resolution does still apply, but was generalized, then I think we should just leave the DR status table alone for that issue, because that issue's resolution is still implemented.)
clang/include/clang/Basic/DiagnosticLexKinds.td | ||
---|---|---|
117 | The old diagnostic text here was bad in the case where the character was supposed to be part of an identifier. The new diagnostic text is bad in the case where the character is not supposed to be part of an identifier (eg, if you copy-paste a smart-quote into the source file). Is there some way we can phrase this diagnostic so it's reasonable regardless of the programmer's intent? Perhaps the best we can do is to say that if an identifier is immediately followed by one or more "bad" Unicode characters that were probably intended to be identifier characters (things we don't recognize as whitespace or homoglyphs or anything else), then produce the nice "not allowed in identifiers" diagnostic (and maybe even treat the characters as part of the identifier for error recovery purposes), and otherwise diagnose as simply "unexpected character <U+%0>"? I also wonder if perhaps we should treat all unexpected characters as identifier characters for error recovery purposes. | |
clang/lib/Lex/Lexer.cpp | ||
3159 | It's a bit strange that we produce this warning only on characters that aren't allowed in identifiers, and don't warn on identifier continuation characters that aren't valid identifier start characters (such as a standalone combining character). Is there a good reason for that (specifically for the !isAllowedIDChar check above)? |
- Support superseeding DRs with paper in make_cxx_dr_status (ie //drXXX: sup PXXXX)
- Better diagnostic message when a codepoint is identifier continue but not identifier start
- Provide a different diagnostic message for Unicode codepoints that are not identifier part and are not part of a pp-identifier.
- When parsing an identifier, for non-ascii, non-whitespace codepoints, emit an error and then pretend the codepoint is valid for better recovery.
- Restore the previously removed DR tests
clang/include/clang/Basic/DiagnosticLexKinds.td | ||
---|---|---|
117 | I changed the PR so that, at the start of an identifier, if the Unicode character is neither a whitespace, identifier start or identifier continue, I also recover nicely for non-leading invalid Unicode codepoints. I am not sure we should do that for the leading case though. |
LGTM aside from a commenting nit that I managed to miss last time (sorry about that). Btw, would you like to apply for commit privileges now that you've gotten a few patches under your belt (https://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access)?
clang/lib/Lex/Lexer.cpp | ||
---|---|---|
1607–1609 | Oops, I missed this one last time! |
Thank you for the patch, I've landed it on your behalf in 4e80636db71a1b6123d15ed1f9eda3979b4292de
Can you roll this back and don't support P1949? For some inexplicable reason ∂, 𝜕 partial derivative symbols are now not supported. Neither as XID_Start nor as XID_Continue. Where is logic in that? It has no sense, it's not some crazy emoji symbols. Moreover, it breaks the old code.
No; P1949 was adopted for C++23, so this is effectively a nonstarter unless WG21 backs the paper out and I'm not aware of any push within the committee to do that.
For some inexplicable reason ∂, 𝜕 partial derivative symbols are now not supported. Neither as XID_Start nor as XID_Continue. Where is logic in that?
According to the Unicode consortium, both of those are mathematical symbols, not identifier characters. WG21 and WG14 now both defer to the Unicode consortium on what constitutes a valid identifier character and only deviate to allow _ for historical reasons.
It's worth noting that Clang has never accepted ∂ as a valid identifier even though we did accept 𝜕: https://godbolt.org/z/bTKoKjjdc
It has no sense, it's not some crazy emoji symbols. Moreover, it breaks the old code.
That's expected and intentional, which is unfortunate for folks in your situation and I'm sorry to hear you're caught out by this.
Clang could support a more relaxed mode via a feature flag to opt into allowing non-conforming identifiers in C++23, but I think we should be cautious about adding feature flags to deviate from the standard here until the standards bodies have had the opportunity to weigh in on the topic, unless there's a significant amount of code breaking in the wild. Thus far, I'm not aware of any system headers or other major third-party libraries that are impacted (do you know of widely deployed libraries that are caught by this?). We could also elect not to implement this paper in older language modes, but as with the feature flag, I think we want to be cautious about that. The issue with either approach is: previous support for Unicode characters lead to mysterious behaviors that can be dangerous (see recent discussions about trojan source as an example).
For reference in reading this, because display is going to be inconsistent for many people:
∂ - U+2202, PARTIAL DIFFERENTIAL
𝜕 - U+1D715, MATHEMATICAL ITALIC PARTIAL DIFFERENTIAL
The first is part of the Mathematical Operators block (U+2200–U+22FF), and has always been excluded from C and C++.
The second is part of the Mathematical Alphanumeric Symbols block (U+1D400–U+1D7FF) and is part one of the blocks that was 'allow listed' in C and C++ because it was unassigned at the time of standardization in the late 90s. It was assigned in Unicode 3.1.0 (March, 2001).
I hope at least GCC will not implement this arbitrary restriction on math symbols. I would argue a partial derivative symbol in C++ has more sense than a skull emoji.
P.S. I would love to see features like std::jthread in Clang instead of removing math symbols from identifiers.
I hope at least GCC will not implement this arbitrary restriction on math symbols. I would argue a partial derivative symbol in C++ has more sense than a skull emoji.
I am quite sure that gcc will also be implementing P1949 as adopted for C++23.
Thank you to Aaron and Steve for their excellent analysis and responses!
@intractabilis, please note that this is not the end of the identifier story. A new Unicode work group recently started meeting as a result of the Trojan Source reporting. Some members of that work group also maintain code bases that were negatively impacted by the changes in P1949. That group is expected to eventually produce guidance to better help language designers determine what should and should not be allowed in identifiers. The concerns are complicated; see Unicode UAX#31, UAX#36, and UAX#39 for detailed analysis of most of the concerns. I think it is unlikely that WG21 or any of the major compiler implementors will deviate from P1949 until new guidance emerges from the Unicode Technical Committee. With luck, that guidance will help to establish conventions that will be followed by most/all programming languages. But don't hold your breath; this will likely take quite a while.
Yes, you are right, it is available in GCC 12 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100977).
Since it is a C++23 feature, can we at least have it only if the user requests C++23 standard compliance? Isn't it strange to force it upon users if they specify -std=c++20? I am going to check how GCC 12 behaves with this regard.
I checked the latest GCC 12. Despite P1949 support, I can use math symbols like 𝜕 and 𝛻 without any problem with any combination of -std=c++xx, xx ≠ 98, xx ≠ 03. Thanks, God, there are reasonable people in the GCC team.
test.cpp:
#include <iostream> int main(int argc, char* argv[]) { auto 𝜕Ω = 4; std::cout << "𝜕Ω = " << 𝜕Ω << std::endl; }
$ g++ --version g++ (GCC) 12.0.1 20220504 (prerelease) Copyright (C) 2022 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. $ g++ -std=c++23 -o test test.cpp $ ./test 𝜕Ω = 4
Since it is a C++23 feature, can we at least have it only if the user requests C++23 standard compliance?
WG21 adopted P1949 as a defect report (DR) against prior standards with the intent that the changes be applied retroactively.
Isn't it strange to force it upon users if they specify -std=c++20?
This is how DRs are intended to be handled.
I checked the latest GCC 12. Despite P1949 support, I can use math symbols like 𝜕 and 𝛻 without any problem with any combination of -std=c++xx, xx ≠ 98, xx ≠ 03.
It looks to me like gcc implements an extension that allows additional characters not permitted by P1949 to still be used in non-pedantic mode. Relevant gcc commits include:
- https://github.com/gcc-mirror/gcc/commit/c4d6dcacfca1b804504515496e6d9de176d7f51e
- https://github.com/gcc-mirror/gcc/commit/7abcc9ca20d4e17deabb308b5f483aaccc3dc02c
- https://github.com/gcc-mirror/gcc/commit/c264208e161830a5642ee3125871c23110508462
Thanks, God, there are reasonable people in the GCC team.
Please refrain from such unhelpful comments.
The old diagnostic text here was bad in the case where the character was supposed to be part of an identifier. The new diagnostic text is bad in the case where the character is not supposed to be part of an identifier (eg, if you copy-paste a smart-quote into the source file). Is there some way we can phrase this diagnostic so it's reasonable regardless of the programmer's intent?
Perhaps the best we can do is to say that if an identifier is immediately followed by one or more "bad" Unicode characters that were probably intended to be identifier characters (things we don't recognize as whitespace or homoglyphs or anything else), then produce the nice "not allowed in identifiers" diagnostic (and maybe even treat the characters as part of the identifier for error recovery purposes), and otherwise diagnose as simply "unexpected character <U+%0>"? I also wonder if perhaps we should treat all unexpected characters as identifier characters for error recovery purposes.