This is an archive of the discontinued LLVM Phabricator instance.

[clang][preprocessor] Fix unsigned-ness of utf8 char literals
ClosedPublic

Authored by tbaeder on May 5 2022, 3:13 AM.

Details

Summary

UTF8 char literals are always unsigned.

Fixes https://github.com/llvm/llvm-project/issues/54886

Diff Detail

Event Timeline

tbaeder created this revision.May 5 2022, 3:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 5 2022, 3:13 AM
tbaeder requested review of this revision.May 5 2022, 3:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 5 2022, 3:13 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
aaron.ballman added inline comments.May 5 2022, 4:01 AM
clang/test/Lexer/utf8-char-literal.cpp
21

I missed this one before. :-(

36

Uh oh.

tbaeder updated this revision to Diff 427267.May 5 2022, 4:12 AM
tbaeder marked 2 inline comments as done.
aaron.ballman accepted this revision.May 5 2022, 4:46 AM

Presuming precommit CI comes back green, this LGTM! Can you also add a release note for the bug fix when landing, please?

This revision is now accepted and ready to land.May 5 2022, 4:46 AM
tbaeder updated this revision to Diff 427283.May 5 2022, 5:13 AM
tbaeder updated this revision to Diff 427289.May 5 2022, 5:47 AM
tahonermann requested changes to this revision.May 5 2022, 7:33 AM

I think changes are needed to make this behavior dependent on whether char8_t support is active or not.

clang/lib/Lex/PPExpressions.cpp
414–419

I think the check for UTF-8 should also be conditioned on PP.getLangOpts().Char8. When char8_t support is not enabled (as in C++17 or with -fno-char8_t in C++20), UTF-8 character literals still have type char.

else if (!(Literal.isUTF8() && PP.getLangOpts().Char8) && !Literal.isUTF16() && !Literal.isUTF32())
clang/test/Lexer/utf8-char-literal.cpp
3–9

I think we should drop testing for -std=c++1z and add testing of -std=c++17 and -std=c++20. Ideally, the test would then validate the differences in behavior.

36–39

Prior to C++20 (unless -fchar8_t is passed), u8'\xff' should have the same behavior as '\xff'.

This revision now requires changes to proceed.May 5 2022, 7:33 AM
tahonermann added inline comments.May 5 2022, 8:04 AM
clang/lib/Lex/PPExpressions.cpp
414–419

My C++ bias may be showing here; LangOptions.Char8 may not be relevant for C, so this may require additional qualification.

aaron.ballman added a reviewer: Restricted Project.May 5 2022, 8:35 AM

I think changes are needed to make this behavior dependent on whether char8_t support is active or not.

Good catch Tom, I forgot we had options to control that! I agree.

tbaeder updated this revision to Diff 427614.May 6 2022, 6:10 AM
tbaeder marked 2 inline comments as done.
tbaeder added inline comments.
clang/test/Lexer/utf8-char-literal.cpp
39

I'm a little confused with the amount of combinations at this point, so please tell me if the emitted warning here looks wrong.

70

I know indenting the preprocessor directives here isn't according to coding style, but it helps a lot with readability.

aaron.ballman added inline comments.May 6 2022, 6:47 AM
clang/test/Lexer/utf8-char-literal.cpp
37–47

The equality operators seem backwards to what @tahonermann was saying -- I read his comment as:

C++17/14/11: u8'\xff' == '\xff'
C++17/14/11, -fchar8_t: u8'\xff' != '\xff'
C++20 and up: u8'\xff' != '\xff'
C++20 and up, -fno-char8_t: u8'\xff' == '\xff'

Hopefully Tom can clarify if I misunderstood.

65
70

I'm fine with the formatting -- it helps readability, and we don't require our tests to be correctly formatted anyway.

Thanks for your continued work on this, Tim! I think this is close. I did spot one issue and added a few other comments.

clang/lib/Lex/PPExpressions.cpp
417–418

Thanks for breaking the conditions out; that does make this simpler to understand.

I don't think this is right yet though. In C++, if PP.getLangOpts().Char8 is false, then signedness is determined by PP.getLangOpts().CharIsSigned. Perhaps this:

else if (Literal.isUTF8()) {
  if (PP.getLangOpts().CPlusPlus)
    Val.setIsUnsigned(PP.getLangOpts().Char8 ? true : !PP.getLangOpts().CharIsSigned);
  else
    Val.setIsUnsigned(true);
}

The test case didn't catch this because char is always a signed type for the variations that are exercised. We could add a variant that includes -funsigned-char, and then modify the test based on the presence of __CHAR_UNSIGNED__, but that might get pretty awkward.

clang/test/Lexer/utf8-char-literal.cpp
3–4

Does the -fchar8_t option have any effect in C at present?

Gcc maintainers are currently not planning to acknowledge that option in C modes since WG14 did not want to add language dialect concerns for C. This is why N2653 doesn't have wording that includes a feature test macro. The gcc maintainers pushed back on the _CHAR8_T_SOURCE macro mentioned in the "Implementation Experience" section.

I think Clang should follow suit; attempts to use -fchar8_t or -fno-char8_t in C modes should be diagnosed; which means that we don't have to exercise these options with C2x.

7–9

Rather than adding your own CHAR8_T and NO_CHAR8_T macros, you can use the predefined __cpp_char8_t feature test macro.

37–47

Yes, that looks right (as long as the target has a signed char type).

tbaeder updated this revision to Diff 427989.May 9 2022, 12:44 AM
tbaeder marked an inline comment as done.
tbaeder marked 4 inline comments as done.
tbaeder added inline comments.
clang/test/Lexer/utf8-char-literal.cpp
37–47

Are you listing the positive conditions (that should be true) or negative ones? The conditions in the test case need to be false in order for the test case to succeed.

tahonermann added inline comments.May 9 2022, 9:52 AM
clang/test/Lexer/utf8-char-literal.cpp
5–7

The -DCHAR8_T and -DNO_CHAR8_T options can be removed now since the test is no longer dependent on them.

37–47

Yes, these list the expected behavior (e.g., the assertable behavior).

37–62

Something is still not right here. -std=c++17 should behave the same as -std=c++20 -fno-char8_t. Likewise, -std=c++17 -fchar8_t should behave the same as -std=c++20. Basically, if __cpp_char8_t is defined, then u8'\xff' should be unsigned and if that macro isn't defined, then the result should be signed (for an 8-bit signed char type). But the reversed conditions for the C++17 vs C++20 tests above conflict with that expectation. The code changes look right. Is the test actually passing like this?

Since the tests are now conditioned on __cpp_char8_t, I think they should be merged. I suggest:

// UTF-8 character literals are enabled in C++17 and later. If `-fchar8_t` is not enabled
// (as is the case in C++17), then UTF-8 character literals may produce signed or
// unsigned values depending on whether char is a signed type. If `-fchar8_t` is enabled
// (which is the default behavior for C++20), then UTF-8 character literals always
// produce unsigned values. The tests below depend on the target having a signed
// 8-bit char so that '\xff' produces a negative value.
#if __cplusplus >= 201703L
#  if !defined(__cpp_char8_t)
#    if !(u8'\xff' == '\xff')
#      error UTF-8 character value did not match ordinary character literal; this is unexpected
#    endif
#  else
#    if u8'\xff' == '\xff' // expected-warning {{right side of operator converted from negative value to unsigned}}
#      error UTF-8 character value matched ordinary character literal; this is unexpected
#    endif
#  endif
#endif
tbaeder updated this revision to Diff 428303.May 10 2022, 12:49 AM
tbaeder marked 4 inline comments as done.
aaron.ballman added inline comments.May 10 2022, 5:55 AM
clang/test/Lexer/utf8-char-literal.cpp
3–4

Does the -fchar8_t option have any effect in C at present?

Yes, it does: https://godbolt.org/z/1co3YYYf8 (it sets LangOpts.Char8)

tahonermann added inline comments.May 10 2022, 9:04 AM
clang/test/Lexer/utf8-char-literal.cpp
3–4

Oh, that is very wrong. char8_t should be neither a keyword nor a type specifier in C modes; char8_t will be a typedef in C23.

@tbaeder, if you are willing to take on fixing this as well, that would be most appreciated! This doesn't need to be fixed as part of this review though. I filed https://github.com/llvm/llvm-project/issues/55373 to track this issue.

tahonermann added inline comments.May 10 2022, 9:11 AM
clang/test/Lexer/utf8-char-literal.cpp
56–58

The C++ case looks good now, but the condition doesn't look right for the C case. The expectation is that u8'\xff' should not match '\xff' in C23 mode, but the test treats this as an error. If the test is passing, that indicates something is not being validated correctly. Shouldn't unexpected error diagnostics cause the test to fail?

tbaeder added inline comments.May 10 2022, 11:52 PM
clang/test/Lexer/utf8-char-literal.cpp
56–58

I used u8'\xff' != 0xff here because that's the condition you mentioned in the phab review adding the u8 prefix. Using u8'\xff' != '\xff' indeed fails with:

clang/test/Lexer/utf8-char-literal.cpp Line 55: u8 char literal is not unsigned
clang/test/Lexer/utf8-char-literal.cpp Line 54: right side of operator converted from negative value to unsigned: -1 to 18446744073709551615
tahonermann added inline comments.May 11 2022, 7:25 AM
clang/test/Lexer/utf8-char-literal.cpp
56–58

Oh! I missed the use of 0xff vs '\xff'. Sorry if I mislead you. In order to avoid such subtle differences in the test, can we use the same check as for C++?

// UTF-8 character literals are enabled in C23 and later and are always unsigned.
#if __STDC_VERSION__ >= 202000L
#  if u8'\xff' == '\xff' // expected-warning {{right side of operator converted from negative value to unsigned}}
#    error UTF-8 character value matched ordinary character literal; this is unexpected
#  endif
#endif
tbaeder updated this revision to Diff 428861.May 11 2022, 11:48 PM
tbaeder marked 2 inline comments as done.
tahonermann accepted this revision as: tahonermann.May 12 2022, 7:18 AM

Looks good, @tbaeder! Thank you for sticking with me through all these iterations!

This revision is now accepted and ready to land.May 12 2022, 7:18 AM
aaron.ballman accepted this revision.May 12 2022, 8:16 AM

LGTM as well, thank you for this!

This revision was landed with ongoing or failed builds.May 12 2022, 11:05 PM
This revision was automatically updated to reflect the committed changes.