UTF8 char literals are always unsigned.
Details
- Reviewers
aaron.ballman tahonermann - Group Reviewers
Restricted Project - Commits
- rGb91073db6ac3: [clang][preprocessor] Fix unsigned-ness of utf8 char literals
Diff Detail
Unit Tests
Time | Test | |
---|---|---|
6,160 ms | x64 debian > libFuzzer.libFuzzer::fork-ubsan.test |
Event Timeline
Presuming precommit CI comes back green, this LGTM! Can you also add a release note for the bug fix when landing, please?
I think changes are needed to make this behavior dependent on whether char8_t support is active or not.
clang/lib/Lex/PPExpressions.cpp | ||
---|---|---|
413 | I think the check for UTF-8 should also be conditioned on PP.getLangOpts().Char8. When char8_t support is not enabled (as in C++17 or with -fno-char8_t in C++20), UTF-8 character literals still have type char. else if (!(Literal.isUTF8() && PP.getLangOpts().Char8) && !Literal.isUTF16() && !Literal.isUTF32()) | |
clang/test/Lexer/utf8-char-literal.cpp | ||
4 | I think we should drop testing for -std=c++1z and add testing of -std=c++17 and -std=c++20. Ideally, the test would then validate the differences in behavior. | |
31–34 | Prior to C++20 (unless -fchar8_t is passed), u8'\xff' should have the same behavior as '\xff'. |
clang/lib/Lex/PPExpressions.cpp | ||
---|---|---|
413 | My C++ bias may be showing here; LangOptions.Char8 may not be relevant for C, so this may require additional qualification. |
clang/test/Lexer/utf8-char-literal.cpp | ||
---|---|---|
32–42 | The equality operators seem backwards to what @tahonermann was saying -- I read his comment as: C++17/14/11: u8'\xff' == '\xff' Hopefully Tom can clarify if I misunderstood. | |
60 | ||
65 | I'm fine with the formatting -- it helps readability, and we don't require our tests to be correctly formatted anyway. |
Thanks for your continued work on this, Tim! I think this is close. I did spot one issue and added a few other comments.
clang/lib/Lex/PPExpressions.cpp | ||
---|---|---|
416–417 | Thanks for breaking the conditions out; that does make this simpler to understand. I don't think this is right yet though. In C++, if PP.getLangOpts().Char8 is false, then signedness is determined by PP.getLangOpts().CharIsSigned. Perhaps this: else if (Literal.isUTF8()) { if (PP.getLangOpts().CPlusPlus) Val.setIsUnsigned(PP.getLangOpts().Char8 ? true : !PP.getLangOpts().CharIsSigned); else Val.setIsUnsigned(true); } The test case didn't catch this because char is always a signed type for the variations that are exercised. We could add a variant that includes -funsigned-char, and then modify the test based on the presence of __CHAR_UNSIGNED__, but that might get pretty awkward. | |
clang/test/Lexer/utf8-char-literal.cpp | ||
4–5 | Does the -fchar8_t option have any effect in C at present? Gcc maintainers are currently not planning to acknowledge that option in C modes since WG14 did not want to add language dialect concerns for C. This is why N2653 doesn't have wording that includes a feature test macro. The gcc maintainers pushed back on the _CHAR8_T_SOURCE macro mentioned in the "Implementation Experience" section. I think Clang should follow suit; attempts to use -fchar8_t or -fno-char8_t in C modes should be diagnosed; which means that we don't have to exercise these options with C2x. | |
8–10 | Rather than adding your own CHAR8_T and NO_CHAR8_T macros, you can use the predefined __cpp_char8_t feature test macro. | |
32–42 | Yes, that looks right (as long as the target has a signed char type). |
clang/test/Lexer/utf8-char-literal.cpp | ||
---|---|---|
32–42 | Are you listing the positive conditions (that should be true) or negative ones? The conditions in the test case need to be false in order for the test case to succeed. |
clang/test/Lexer/utf8-char-literal.cpp | ||
---|---|---|
6–8 | The -DCHAR8_T and -DNO_CHAR8_T options can be removed now since the test is no longer dependent on them. | |
32–42 | Yes, these list the expected behavior (e.g., the assertable behavior). | |
32–57 | Something is still not right here. -std=c++17 should behave the same as -std=c++20 -fno-char8_t. Likewise, -std=c++17 -fchar8_t should behave the same as -std=c++20. Basically, if __cpp_char8_t is defined, then u8'\xff' should be unsigned and if that macro isn't defined, then the result should be signed (for an 8-bit signed char type). But the reversed conditions for the C++17 vs C++20 tests above conflict with that expectation. The code changes look right. Is the test actually passing like this? Since the tests are now conditioned on __cpp_char8_t, I think they should be merged. I suggest: // UTF-8 character literals are enabled in C++17 and later. If `-fchar8_t` is not enabled // (as is the case in C++17), then UTF-8 character literals may produce signed or // unsigned values depending on whether char is a signed type. If `-fchar8_t` is enabled // (which is the default behavior for C++20), then UTF-8 character literals always // produce unsigned values. The tests below depend on the target having a signed // 8-bit char so that '\xff' produces a negative value. #if __cplusplus >= 201703L # if !defined(__cpp_char8_t) # if !(u8'\xff' == '\xff') # error UTF-8 character value did not match ordinary character literal; this is unexpected # endif # else # if u8'\xff' == '\xff' // expected-warning {{right side of operator converted from negative value to unsigned}} # error UTF-8 character value matched ordinary character literal; this is unexpected # endif # endif #endif |
clang/test/Lexer/utf8-char-literal.cpp | ||
---|---|---|
4–5 |
Yes, it does: https://godbolt.org/z/1co3YYYf8 (it sets LangOpts.Char8) |
clang/test/Lexer/utf8-char-literal.cpp | ||
---|---|---|
4–5 | Oh, that is very wrong. char8_t should be neither a keyword nor a type specifier in C modes; char8_t will be a typedef in C23. @tbaeder, if you are willing to take on fixing this as well, that would be most appreciated! This doesn't need to be fixed as part of this review though. I filed https://github.com/llvm/llvm-project/issues/55373 to track this issue. |
clang/test/Lexer/utf8-char-literal.cpp | ||
---|---|---|
51–53 | The C++ case looks good now, but the condition doesn't look right for the C case. The expectation is that u8'\xff' should not match '\xff' in C23 mode, but the test treats this as an error. If the test is passing, that indicates something is not being validated correctly. Shouldn't unexpected error diagnostics cause the test to fail? |
clang/test/Lexer/utf8-char-literal.cpp | ||
---|---|---|
51–53 | I used u8'\xff' != 0xff here because that's the condition you mentioned in the phab review adding the u8 prefix. Using u8'\xff' != '\xff' indeed fails with: clang/test/Lexer/utf8-char-literal.cpp Line 55: u8 char literal is not unsigned clang/test/Lexer/utf8-char-literal.cpp Line 54: right side of operator converted from negative value to unsigned: -1 to 18446744073709551615 |
clang/test/Lexer/utf8-char-literal.cpp | ||
---|---|---|
51–53 | Oh! I missed the use of 0xff vs '\xff'. Sorry if I mislead you. In order to avoid such subtle differences in the test, can we use the same check as for C++? // UTF-8 character literals are enabled in C23 and later and are always unsigned. #if __STDC_VERSION__ >= 202000L # if u8'\xff' == '\xff' // expected-warning {{right side of operator converted from negative value to unsigned}} # error UTF-8 character value matched ordinary character literal; this is unexpected # endif #endif |
I think the check for UTF-8 should also be conditioned on PP.getLangOpts().Char8. When char8_t support is not enabled (as in C++17 or with -fno-char8_t in C++20), UTF-8 character literals still have type char.