This is an archive of the discontinued LLVM Phabricator instance.

[Clang] Add support for Unicode identifiers (UAX31) in C2x mode.
ClosedPublic

Authored by cor3ntin on Jul 23 2022, 3:22 AM.

Details

Summary

This implements
N2836 Identifier Syntax using Unicode Standard Annex 31.

The feature was already implemented for C++,
and the semantics are the same.

Unlike C++ there was, afaict, no decision to
backport the feature in older languages mode,
so C17 and earlier are not modified and the
code point tables for these language modes are conserved.

Diff Detail

Event Timeline

cor3ntin created this revision.Jul 23 2022, 3:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 23 2022, 3:22 AM
cor3ntin requested review of this revision.Jul 23 2022, 3:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 23 2022, 3:22 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
cor3ntin updated this revision to Diff 447051.Jul 23 2022, 3:28 AM

Update a comment

aaron.ballman accepted this revision.Jul 23 2022, 4:53 AM

LGTM aside from a typo fix in a comment. One thing worth thinking about is whether it would be worth it to have a diagnostic for valid C code in older modes that will change behavior in C2x due to this feature, but that can be done as follow-up work.

clang/lib/Lex/Lexer.cpp
1469
This revision is now accepted and ready to land.Jul 23 2022, 4:53 AM
This revision was landed with ongoing or failed builds.Jul 23 2022, 5:08 AM
This revision was automatically updated to reflect the committed changes.

Apologies for the much delayed review (I know this already landed).

I added a comment, but I don't know if there is actually an issue to be fixed. This looks good to me otherwise.

clang/lib/Lex/Lexer.cpp
1490–1492

Perhaps this should permit $ when LangOpts.DollarIdents is true. However, my attempts to produce a test case that illustrates such a need failed. I don't know why; I guess isAllowedInitiallyIDChar() isn't called for all identifiers? Or there is special handling of LangOpts.DollarIdents somewhere? I wonder if that might imply an issue elsewhere; such as a missing call to isAllowedInitiallyIDChar() somewhere?

I did verify that '$' (U+003F) is not in XID_Start: https://util.unicode.org/UnicodeJsps/list-unicodeset.jsp?a=[:XID_Start=Yes:]

cor3ntin added inline comments.Jul 28 2022, 12:52 PM
clang/lib/Lex/Lexer.cpp
1490–1492

Thanks for the review. I think checking for underscore here is actually the issue, it's most likely never true.
isAllowedInitiallyIDChar is never called on ascii codepoints as the fast path uses isAsciiIdentifierStart - which is where $ is checked for.

(LexIdentifierContinue handles ascii characters too and correctly deal with $ )