This is an archive of the discontinued LLVM Phabricator instance.

Implement P1949
ClosedPublic

Authored by cor3ntin on Jun 26 2021, 6:09 AM.

Details

Summary

This adds the Unicode 13 data for XID_Start and XID_Continue.
The definition of valid identifier is changed in all C++ modes
as P1949 was accepted by WG21 as a defect report.

Diff Detail

Event Timeline

cor3ntin requested review of this revision.Jun 26 2021, 6:09 AM
cor3ntin created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJun 26 2021, 6:09 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
Quuxplusone added inline comments.
clang/lib/Lex/Lexer.cpp
1485

This is overly indented (or your editor snuck in some hard tabs).
Why is '_' treated specially, here and on line 1444 above? Shouldn't it just be included in the XIDStartChars set?

1516–1524

Why remove this diagnostic? This looks unintentional.

But if it is intentional, then besides justifying it in the commit message (and splitting out this removal into a separate PR), you should remove the enumerator from DiagnosticLexKinds.td.

clang/lib/Lex/UnicodeCharSets.h
229

Ah, here we go. 0x005F is underscore. It should be in the XIDStart table instead.

clang/test/Lexer/unicode.c
31

Why 12 errors? Are these { ( etc. characters not the ASCII versions? If so, this test is needlessly confusing, and should be rewritten with one character per test case, as in lines 43-53 below. (Except line 46, which is also bad. :))

I don't understand why you're disabling the existing tests. In particular, line 43 tests the error message for GREEK QUESTION MARK, which is important to preserve.

clang/test/Preprocessor/ucn-pp-identifier.c
31 ↗(On Diff #354672)

This test shouldn't need tweaking, should it?

jfb added a subscriber: jfb.Jun 26 2021, 6:41 AM
jfb added a comment.Jun 26 2021, 6:41 AM

It would be more user friendly to say which character is not allowed in the diagnostic.

Do we need to have a backwards compatible flag to preserve the old behavior, or do we believe that nobody will be affected by the change? We should make this choice explicitly and state it in the patch description.

It would be more user friendly to say which character is not allowed in the diagnostic.

Agreed

Do we need to have a backwards compatible flag to preserve the old behavior, or do we believe that nobody will be affected by the change? We should make this choice explicitly and state it in the patch description.

I have been wondering about that.
I came to the conclusion it would probably not be worth it. Until fairly recently Unicode identifiers in GCC were not really usable and therefore not used afaik.
I haven't seen people use emojis or other interesting symbol in non toy code.

I'll try to make that clearer in the commit message

clang/lib/Lex/Lexer.cpp
1485

This is intentional
C++ treats _ as valid in identifiers, Unicode doesn't. Putting it in the array with the other characters is a sure-fire way to have someone drop it

1516–1524

The diagnostic should be removed indeed.
Because the PR applies the paper as a DR to all versions of C++, it is no longer useful to diagnostic changes between c++ versions

clang/test/Lexer/unicode.c
31

Having 2 sets of completely different diagnostics (for C++ and C) proved a lot more complicated than doing two separate tests.
The GREEK QUESTION MARK is no longer important to preserve - it tests for confusing characters - which cannot happen in C++.
I agree that the test above is not super meaningful.

cor3ntin updated this revision to Diff 354679.Jun 26 2021, 7:34 AM

Clang format, cleanup and better diagnostics

cor3ntin updated this revision to Diff 354681.Jun 26 2021, 7:35 AM

More code format

cor3ntin updated this revision to Diff 354683.Jun 26 2021, 7:38 AM

Improve commit message

cor3ntin updated this revision to Diff 354684.Jun 26 2021, 7:40 AM

Missing EOF

cor3ntin updated this revision to Diff 354973.Jun 28 2021, 11:54 AM

Fix clang tidy check, update cxx_status

cor3ntin updated this revision to Diff 355036.Jun 28 2021, 2:03 PM

Optimize character lookup for ascii

Herald added a project: Restricted Project. · View Herald TranscriptJun 28 2021, 2:03 PM

It would be more user-friendly to say which character is not allowed in the diagnostic.

Agreed, done!

Do we need to have a backward-compatible flag to preserve the old behavior, or do we believe that nobody will be affected by the change? We should make this choice explicitly and state it in the patch description.

I have been wondering about that.
I came to the conclusion it would probably not be worth it. Until fairly recently Unicode identifiers in GCC were not really usable and therefore not used afaik.
I haven't seen people use emojis or other interesting symbols in non-toy code.

I tried to make that clearer in the commit message

cor3ntin added inline comments.Jul 12 2021, 5:02 AM
clang/lib/Lex/UnicodeCharSets.h
229

These tables should match the Unicode spec exactly.
Unicode does not consider _ to be valid at the start of an identifier.
C++ does, for obvious reasons, but these tables do not represent the C++ grammar, but the Unicode Spec.
(Hence the names, XIDContinue)

cor3ntin updated this revision to Diff 364941.Aug 7 2021, 1:32 AM

Update cxx_status

aaron.ballman added inline comments.Aug 13 2021, 7:00 AM
clang/lib/Lex/Lexer.cpp
1463

Is looking at start chars correct? I was thinking this should only look at the continuation characters because isAllowedInitiallyIDChar handles the start of an identifier.

clang/test/CXX/drs/dr2xx.cpp
600

This means we're going to drop our support of this DR on https://clang.llvm.org/cxx_dr_status.html when that page gets regenerated. What should our status of that DR be with these changes?

clang/test/CXX/drs/dr6xx.cpp
723

Same concern here as above.

clang/test/Lexer/unicode.c
40–43

Then we don't have to do line number math to see what line the diagnostics are attached to. :-)

cor3ntin added inline comments.Aug 13 2021, 7:24 AM
clang/lib/Lex/Lexer.cpp
1463

Yes, But Maybe it needs a comment.
Per Unicode spec, XIDContinueChars is a superset of XIDStartChars, and in the code, we don't deduplicate the table to not make them too big.
So we look at both tables, starting with start as it's the most likely to contain the character

clang/test/CXX/drs/dr2xx.cpp
600

Good question!
I will change it to Superseded by P1949. What do you think?

clang/test/Lexer/unicode.c
40–43

It makes me feel dirty but I'll change it!

aaron.ballman added inline comments.
clang/lib/Lex/Lexer.cpp
1463

Ah, that explanation makes sense, thank you! I agree that a comment would be helpful there.

clang/test/CXX/drs/dr2xx.cpp
600

Because we're treating this as a DR, hopefully there's a DR number we could use instead of the paper number. @rsmith -- how should we handle this?

Adding @rsmith to hopefully answer the question about how to handle the dr status page for these changes.

cor3ntin updated this revision to Diff 366286.Aug 13 2021, 8:05 AM

Update dr_cxx_status and add comments

aaron.ballman added inline comments.Aug 13 2021, 11:16 AM
clang/www/cxx_dr_status.html
3

You might have missed this at the top of the file -- this one is a generated file, and that's why I was trying to figure out how to best handle the markings in the test files. make_cxx_dr_status basically looks for those dr comments to generate this content.

cor3ntin added inline comments.Aug 13 2021, 11:43 AM
clang/www/cxx_dr_status.html
3

I did, completely.
As we talked, I was considering hacking the script but I think the simplest solution might be to mark them na.
I'll wait for @rsmith 's opinion!

cor3ntin added a comment.EditedAug 13 2021, 12:20 PM

@rsmith: I modified the script locally to support dxx: dup PXXXX - let me know if you think that's a good solution

rsmith added inline comments.Aug 13 2021, 4:40 PM
clang/test/CXX/drs/dr2xx.cpp
600

I don't think there is nor will ever be a DR number for this; using the P number here seems like the best option available to us.

Ultimately the "sup" marking here serves two purposes:

  • tracking what is done, when it was done, and what's still left to do
  • providing a way to determine why we're not doing what a DR resolution says to do

I think it would be fine to write this as dr248: sup P1949R7, and in the make_cxx_status script, generate text that says "Superseded by P1949R7" like in your edits to cxx_dr_status.html and color DRs superseded by papers as class="na" (because the DR no longer applies due to the paper). (Not saying that's the only option, but I think it would be a reasonable choice.)

Separately, please don't remove the tests here; instead update them to show what the new behavior is, and add tests if necessary so that it's clear in what way the DR's behavior has been superseded.

@rsmith: I modified the script locally to support dxx: dup PXXXX - let me know if you think that's a good solution

Not sure if that's a typo: did you mean "sup" rather than "dup"? In this case, it seems like "sup" is right if the old DR resolution no longer applies. (If the old resolution does still apply, but was generalized, then I think we should just leave the DR status table alone for that issue, because that issue's resolution is still implemented.)

rsmith added inline comments.Aug 13 2021, 5:00 PM
clang/include/clang/Basic/DiagnosticLexKinds.td
117

The old diagnostic text here was bad in the case where the character was supposed to be part of an identifier. The new diagnostic text is bad in the case where the character is not supposed to be part of an identifier (eg, if you copy-paste a smart-quote into the source file). Is there some way we can phrase this diagnostic so it's reasonable regardless of the programmer's intent?

Perhaps the best we can do is to say that if an identifier is immediately followed by one or more "bad" Unicode characters that were probably intended to be identifier characters (things we don't recognize as whitespace or homoglyphs or anything else), then produce the nice "not allowed in identifiers" diagnostic (and maybe even treat the characters as part of the identifier for error recovery purposes), and otherwise diagnose as simply "unexpected character <U+%0>"? I also wonder if perhaps we should treat all unexpected characters as identifier characters for error recovery purposes.

clang/lib/Lex/Lexer.cpp
3237

It's a bit strange that we produce this warning only on characters that aren't allowed in identifiers, and don't warn on identifier continuation characters that aren't valid identifier start characters (such as a standalone combining character). Is there a good reason for that (specifically for the !isAllowedIDChar check above)?

cor3ntin updated this revision to Diff 366425.Aug 14 2021, 4:13 AM
  • Support superseeding DRs with paper in make_cxx_dr_status (ie //drXXX: sup PXXXX)
  • Better diagnostic message when a codepoint is identifier continue but not identifier start
  • Provide a different diagnostic message for Unicode codepoints that are not identifier part and are not part of a pp-identifier.
  • When parsing an identifier, for non-ascii, non-whitespace codepoints, emit an error and then pretend the codepoint is valid for better recovery.
  • Restore the previously removed DR tests
cor3ntin updated this revision to Diff 366426.Aug 14 2021, 4:16 AM

Formatting

@rsmith: I modified the script locally to support dxx: dup PXXXX - let me know if you think that's a good solution

Not sure if that's a typo: did you mean "sup" rather than "dup"? In this case, it seems like "sup" is right if the old DR resolution no longer applies. (If the old resolution does still apply, but was generalized, then I think we should just leave the DR status table alone for that issue, because that issue's resolution is still implemented.)

yep, I meant sup, sorry.
I made the change

cor3ntin marked 2 inline comments as done.Aug 14 2021, 4:32 AM
cor3ntin added inline comments.
clang/include/clang/Basic/DiagnosticLexKinds.td
117

I changed the PR so that, at the start of an identifier, if the Unicode character is neither a whitespace, identifier start or identifier continue,
we display unexpected character <U+%0>.
It's not perfect because we make no effort to distinguish pp-numbers from identifiers, but it seems good enough! Certainly an improvement :)

I also recover nicely for non-leading invalid Unicode codepoints. I am not sure we should do that for the leading case though.

cor3ntin updated this revision to Diff 366430.Aug 14 2021, 4:41 AM

Fix test formatting

cor3ntin marked an inline comment as done.Aug 14 2021, 4:42 AM
aaron.ballman added inline comments.Aug 16 2021, 6:31 AM
clang/lib/Lex/Lexer.cpp
1622–1623
1628
1687–1689
1694–1696
clang/www/make_cxx_dr_status
139
cor3ntin updated this revision to Diff 366860.Aug 17 2021, 4:28 AM
cor3ntin marked an inline comment as done.

Fix Aaron's comment

cor3ntin marked 4 inline comments as done.Aug 17 2021, 4:47 AM
aaron.ballman accepted this revision.Aug 17 2021, 4:53 AM

LGTM aside from a commenting nit that I managed to miss last time (sorry about that). Btw, would you like to apply for commit privileges now that you've gotten a few patches under your belt (https://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access)?

clang/lib/Lex/Lexer.cpp
1661–1663

Oops, I missed this one last time!

This revision is now accepted and ready to land.Aug 17 2021, 4:53 AM
cor3ntin updated this revision to Diff 366865.Aug 17 2021, 4:58 AM

Fix comment

aaron.ballman closed this revision.Aug 18 2021, 4:36 AM

Thank you for the patch, I've landed it on your behalf in 4e80636db71a1b6123d15ed1f9eda3979b4292de

Can you roll this back and don't support P1949? For some inexplicable reason ∂, 𝜕 partial derivative symbols are now not supported. Neither as XID_Start nor as XID_Continue. Where is logic in that? It has no sense, it's not some crazy emoji symbols. Moreover, it breaks the old code.

Herald added a project: Restricted Project. · View Herald TranscriptMay 2 2022, 11:36 AM

Can you roll this back and don't support P1949?

No; P1949 was adopted for C++23, so this is effectively a nonstarter unless WG21 backs the paper out and I'm not aware of any push within the committee to do that.

For some inexplicable reason ∂, 𝜕 partial derivative symbols are now not supported. Neither as XID_Start nor as XID_Continue. Where is logic in that?

According to the Unicode consortium, both of those are mathematical symbols, not identifier characters. WG21 and WG14 now both defer to the Unicode consortium on what constitutes a valid identifier character and only deviate to allow _ for historical reasons.

It's worth noting that Clang has never accepted as a valid identifier even though we did accept 𝜕: https://godbolt.org/z/bTKoKjjdc

It has no sense, it's not some crazy emoji symbols. Moreover, it breaks the old code.

That's expected and intentional, which is unfortunate for folks in your situation and I'm sorry to hear you're caught out by this.

Clang could support a more relaxed mode via a feature flag to opt into allowing non-conforming identifiers in C++23, but I think we should be cautious about adding feature flags to deviate from the standard here until the standards bodies have had the opportunity to weigh in on the topic, unless there's a significant amount of code breaking in the wild. Thus far, I'm not aware of any system headers or other major third-party libraries that are impacted (do you know of widely deployed libraries that are caught by this?). We could also elect not to implement this paper in older language modes, but as with the feature flag, I think we want to be cautious about that. The issue with either approach is: previous support for Unicode characters lead to mysterious behaviors that can be dangerous (see recent discussions about trojan source as an example).

Can you roll this back and don't support P1949? For some inexplicable reason ∂, 𝜕 partial derivative symbols are now not supported. Neither as XID_Start nor as XID_Continue. Where is logic in that? It has no sense, it's not some crazy emoji symbols. Moreover, it breaks the old code.

For reference in reading this, because display is going to be inconsistent for many people:

∂ - U+2202, PARTIAL DIFFERENTIAL

𝜕 - U+1D715, MATHEMATICAL ITALIC PARTIAL DIFFERENTIAL

The first is part of the Mathematical Operators block (U+2200–U+22FF), and has always been excluded from C and C++.

The second is part of the Mathematical Alphanumeric Symbols block (U+1D400–U+1D7FF) and is part one of the blocks that was 'allow listed' in C and C++ because it was unassigned at the time of standardization in the late 90s. It was assigned in Unicode 3.1.0 (March, 2001).

I hope at least GCC will not implement this arbitrary restriction on math symbols. I would argue a partial derivative symbol in C++ has more sense than a skull emoji.

P.S. I would love to see features like std::jthread in Clang instead of removing math symbols from identifiers.

I hope at least GCC will not implement this arbitrary restriction on math symbols. I would argue a partial derivative symbol in C++ has more sense than a skull emoji.

I am quite sure that gcc will also be implementing P1949 as adopted for C++23.

Thank you to Aaron and Steve for their excellent analysis and responses!

@intractabilis, please note that this is not the end of the identifier story. A new Unicode work group recently started meeting as a result of the Trojan Source reporting. Some members of that work group also maintain code bases that were negatively impacted by the changes in P1949. That group is expected to eventually produce guidance to better help language designers determine what should and should not be allowed in identifiers. The concerns are complicated; see Unicode UAX#31, UAX#36, and UAX#39 for detailed analysis of most of the concerns. I think it is unlikely that WG21 or any of the major compiler implementors will deviate from P1949 until new guidance emerges from the Unicode Technical Committee. With luck, that guidance will help to establish conventions that will be followed by most/all programming languages. But don't hold your breath; this will likely take quite a while.

I am quite sure that gcc will also be implementing P1949 as adopted for C++23.

Yes, you are right, it is available in GCC 12 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100977).

Since it is a C++23 feature, can we at least have it only if the user requests C++23 standard compliance? Isn't it strange to force it upon users if they specify -std=c++20? I am going to check how GCC 12 behaves with this regard.

I checked the latest GCC 12. Despite P1949 support, I can use math symbols like 𝜕 and 𝛻 without any problem with any combination of -std=c++xx, xx ≠ 98, xx ≠ 03. Thanks, God, there are reasonable people in the GCC team.

test.cpp:

#include <iostream>

int main(int argc, char* argv[]) {
    auto 𝜕Ω = 4;
    std::cout << "𝜕Ω = " << 𝜕Ω << std::endl;
}
$ g++ --version
g++ (GCC) 12.0.1 20220504 (prerelease)
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ g++ -std=c++23 -o test test.cpp
$ ./test
𝜕Ω = 4

Since it is a C++23 feature, can we at least have it only if the user requests C++23 standard compliance?

WG21 adopted P1949 as a defect report (DR) against prior standards with the intent that the changes be applied retroactively.

Isn't it strange to force it upon users if they specify -std=c++20?

This is how DRs are intended to be handled.

I checked the latest GCC 12. Despite P1949 support, I can use math symbols like 𝜕 and 𝛻 without any problem with any combination of -std=c++xx, xx ≠ 98, xx ≠ 03.

It looks to me like gcc implements an extension that allows additional characters not permitted by P1949 to still be used in non-pedantic mode. Relevant gcc commits include:

Thanks, God, there are reasonable people in the GCC team.

Please refrain from such unhelpful comments.

Please refrain from such unhelpful comments.

Absolutely, no problem. It's great that my other comments were helpful.