This is an archive of the discontinued LLVM Phabricator instance.

[clang][lexer] Allow u8 character literal prefixes in C2x
ClosedPublic

Authored by tbaeder on Feb 8 2022, 1:19 AM.

Diff Detail

Event Timeline

tbaeder requested review of this revision.Feb 8 2022, 1:19 AM
tbaeder created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptFeb 8 2022, 1:19 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

Can you also add a release note for the new feature, and update the clang/www/c_status.html page as well?

clang/test/Lexer/utf8-char-literal.cpp
23

One more test I'd like to see added, just to make sure we're covering 6.4.4.4p9 properly:

_Static_assert(
  _Generic(u8'a',
           default: 0,
           unsigned char : 1),
  "Surprise!");

We expect the type of a u8 character literal to be unsigned char at the moment, which is different from a u8 string literal, which uses char.

However, WG14 is also going to be considering http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm for C2x at our meeting next week.

tahonermann added inline comments.Feb 11 2022, 7:11 PM
clang/lib/Lex/Lexer.cpp
3462–3465

The comment is slightly misleading both before and after this change. Assuming this level of detail is desired, I suggest:

// Identifer (e.g., uber), or
// UTF-8 (C2x/C++17) or UTF-16 (C11/C++11) character literal, or
// UTF-8 or UTF-16 string literal (C11/C++11).
case 'u':
clang/test/Lexer/utf8-char-literal.cpp
23

Good suggestion. I believe the following update will be needed to`Sema::ActOnCharacterConstant() in clang/lib/Sema/SemaExpr.cpp`:

...
else if (Literal.isUTF8() && getLangOpts().C2x)
  Ty = Context.UnsignedCharTy; // u8'x' -> unsigned char in c2x.
else if Literal.isUTF8() && getLangOpts().Char8)
  Ty = Context.Char8Ty; // u8'x' -> char8_t when it exists.
...
28

We should also exercise the preprocessor with something like this:

#if u8'\xff' != 0xff
#error uh oh
#endif

Hmm, this currently fails for C++20 for both Clang and gcc unless -funsigned-char is passed. That seems wrong. https://godbolt.org/z/Tb7z85ToG. MSVC gets this wrong too, but I think for a different reason; see the implementation impact section of P2029 if curious.

tbaeder updated this revision to Diff 408341.Feb 13 2022, 11:49 PM
tbaeder marked 3 inline comments as done.
tahonermann added inline comments.Feb 18 2022, 8:44 AM
clang/lib/Sema/SemaExpr.cpp
3623–3624

Perhaps worth updating this comment? e.g.,

// 'x' -> char in C++; u8'x' -> char in C11-C17 and in C++ without char8_t.
clang/test/Lexer/utf8-char-literal.cpp
19

This is an interesting difference between C and C++ that I was not aware of. Looks right though!

tbaeder updated this revision to Diff 410236.Feb 21 2022, 12:15 AM
tbaeder marked an inline comment as done.Feb 21 2022, 12:54 AM
tbaeder added inline comments.
clang/test/Lexer/utf8-char-literal.cpp
28

This also fails in C2x.

Is there anything missing here?

Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2022, 11:09 PM
aaron.ballman added inline comments.Apr 5 2022, 2:12 PM
clang/lib/Lex/Lexer.cpp
3462–3465
clang/test/Lexer/utf8-char-literal.cpp
3

no need for the new -D

16

You can test with __STDC_VERSION__ > 202000L.

23

However, WG14 is also going to be considering http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm for C2x at our meeting next week.

I have an update on this. We discussed the paper and took a straw poll:

Does WG14 wish to adopt N2653 in C23? 18/0/2 (consensus)

So we should make sure that we all agree this patch is in line with the changes from that paper. I believe your changes agree, but it'd be nice for @tahonermann to confirm.

28

I don't think we need to fix the preprocessor behavior as part of this patch, but it would be good to file an issue for this so we know to track it down at some point.

tbaeder marked 2 inline comments as done.Apr 12 2022, 11:17 PM
tbaeder added inline comments.
clang/test/Lexer/utf8-char-literal.cpp
28
tbaeder updated this revision to Diff 422402.Apr 12 2022, 11:18 PM
tbaeder marked an inline comment as done.
aaron.ballman accepted this revision.Apr 13 2022, 4:12 AM

The changes LGTM (thanks for filing the issue). Please wait a day or so for Tom to sign off though (I've pinged him off-list).

This revision is now accepted and ready to land.Apr 13 2022, 4:12 AM
tahonermann accepted this revision.Apr 13 2022, 8:54 AM

Looks good to me! Thank you for filing the separate issue.

clang/test/Lexer/utf8-char-literal.cpp
23

Confirmed. N2653 technically changes the type of u8 character literals to char8_t, but since that is just a typedef of unsigned char, these changes still align with the semantic intent. Ideally, we would maybe try to reflect the typedef, but 1) the typedef isn't necessarily available, 2) Clang doesn't do similarly for any of the other character (or string) literals, and 3) no one is likely to care anyway.

This revision was landed with ongoing or failed builds.Apr 19 2022, 1:00 AM
This revision was automatically updated to reflect the committed changes.