This is an archive of the discontinued LLVM Phabricator instance.

[clang] Add support for dollar sign in ud_suffix of numeric constants
Needs ReviewPublic

Authored by annara on Mar 26 2023, 12:27 PM.

Details

Summary

According to the standard, $ signs can be used as valid identifiers.

Because the dollar sign is handled specially throughout the Lexer and because the lexing of ud-suffixes is done as an afterthought for numerical constants,
it misses the lexing of ud-suffixes containing $ signs.

A bug related to this problem was reported here.

Diff Detail

Event Timeline

annara created this revision.Mar 26 2023, 12:27 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 26 2023, 12:27 PM
annara requested review of this revision.Mar 26 2023, 12:27 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 26 2023, 12:27 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
annara edited the summary of this revision. (Show Details)Mar 26 2023, 12:36 PM

@cor3ntin, do you have any thoughts about this? CWG 1871 (Non-identifier characters in ud-suffix) is somewhat related. I think the changed behavior makes sense when -fdollars-in-identifiers is in effect (which is the default).

clang/test/CXX/lex/lex.literal/lex.ext/p3.cpp
20–21

If we're going to allow $ in user-defined literal (UDL) suffixes, then I think we should do so for all UDL kinds (and have tests for each kind).

We should also test for $ specified as a UCN (\u0024).

The commit message is certainly not correct, $ is not allowed in identifiers by either the C or C++ standard and its support in compiler is hurting future evolution.
that being said, given that we already do do use it in identifiers, why not in UCNs.

I would like to see a test that ensures 42$ without underscore is ill-formed though, as this would have severe implication for evolution.
if this aimes to fix the github issue, i think we need to understand why the difference of behavior with the euro sign - which should not be (and isn't) allowed either.

GCC seems to allow both in non pedantic mode, which may explain the struggles of clang format users

annara added a comment.EditedMar 27 2023, 2:08 AM

After a second look at the standard, you are right.
It's not in the range of any of the universal-character-names.
That being said, it is allowed as an identifier by default in clang (as @tahonermann) mentioned.
It also works in current versions of clang if you use the UCN in the numeric constant (and leave the user-defined literal operator declaration with a $ sign).
As you stated, GCC also accepts this syntax.

I can make sure dollar signs not part of a ud-sufffix will be rejected, if you think the general idea is acceptable.

Regarding the difference between the dollar and the euro sign. It seems the lexer lexes the euro sign as part of the numeric constant (like any other ud-suffix) and probably fails at a later stage (I'm not sure where yet).
I suspect the dollar sign passes that step because of -fdollars-in-identifiers being true by default.
But, I need to look into it further to give a real answer.

Please let me know if you think it's worth pursuing.

I think it's worth pursuing and investigating as long as

  • We requires some prefix before the dollar ( like _), and `42$ is never valid
  • -Wdollar-in-identifier-extension is emitted
  • We have a better understanding of the current behavior for the euro sign - logically the euro sign should be split in its own token too at the moment so *maybe* there is another bug there!

Thanks for working on this btw.

tahonermann added a comment.EditedMar 27 2023, 7:28 AM

It's not in the range of any of the universal-character-names.

I assume you mean that it isn't included in the range of UCNs that are allowed in identifiers. If so, that is correct; it is neither in XID_Start or XID_Continue (note that the current wording for the C and C++ standards was recently changed for C23 and C++23 to defer to the Unicode standard for what characters are allowed in identifiers; see N2836 and P1949).

Interesting, Clang currently crashes when $ is specified using a UCN in an identifier with -fno-dollars-in-identifiers. It would be nice to correct that issue with this patch. See https://godbolt.org/z/9c7Pc8jbT.

There is also a proposal to WG14 to allow $ in identifiers. See N3046. I don't think that paper has been reviewed by WG14 yet. If approved by WG14, I would expect a similar proposal to be submitted to WG21. Nothing to do here now; just something to be aware of.

There is a proposal to WG21 to add $ (and other characters) to the basic character set. See P2558. If accepted as currently proposed, it would make use of a UCN to name $ outside of character and string literals ill-formed. Again, nothing to do here now; just something to be aware of.

Thanks for all the input!

I'll work on @cor3ntin points and take a look at the crash

rymiel added a subscriber: rymiel.Mar 27 2023, 10:58 PM

It's not in the range of any of the universal-character-names.

I assume you mean that it isn't included in the range of UCNs that are allowed in identifiers. If so, that is correct; it is neither in XID_Start or XID_Continue (note that the current wording for the C and C++ standards was recently changed for C23 and C++23 to defer to the Unicode standard for what characters are allowed in identifiers; see N2836 and P1949).

Interesting, Clang currently crashes when $ is specified using a UCN in an identifier with -fno-dollars-in-identifiers. It would be nice to correct that issue with this patch. See https://godbolt.org/z/9c7Pc8jbT.

There is also a proposal to WG14 to allow $ in identifiers. See N3046. I don't think that paper has been reviewed by WG14 yet. If approved by WG14, I would expect a similar proposal to be submitted to WG21. Nothing to do here now; just something to be aware of.

There is a proposal to WG21 to add $ (and other characters) to the basic character set. See P2558. If accepted as currently proposed, it would make use of a UCN to name $ outside of character and string literals ill-formed. Again, nothing to do here now; just something to be aware of.

I missed that comment @tahonermann
The crash is there https://github.com/llvm/llvm-project/issues/62133
But fundamentally C and C++ having different behavior is nasty, so I wrote a paper https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3124.pdf

I wrote a paper https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3124.pdf

It appears you back ported that paper from the future: "Date: 2024-22-04" :)