This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
docs/
-
ReleaseNotes.rst
-
lib/Lex/
-
Lex/
3/3
PPExpressions.cpp
-
test/Lexer/
-
Lexer/
13/16
utf8-char-literal.cpp

Differential D124996

[clang][preprocessor] Fix unsigned-ness of utf8 char literals
ClosedPublic

Authored by tbaeder on May 5 2022, 3:13 AM.

Download Raw Diff

Details

Reviewers

aaron.ballman
tahonermann

Group Reviewers

Restricted Project

Commits

rGb91073db6ac3: [clang][preprocessor] Fix unsigned-ness of utf8 char literals

Summary

UTF8 char literals are always unsigned.

Fixes https://github.com/llvm/llvm-project/issues/54886

Diff Detail

Unit TestsFailed

	Time	Test
	6,160 ms	x64 debian > libFuzzer.libFuzzer::fork-ubsan.test

Event Timeline

tbaeder created this revision.May 5 2022, 3:13 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 5 2022, 3:13 AM

tbaeder requested review of this revision.May 5 2022, 3:13 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 5 2022, 3:13 AM

Herald added a subscriber: cfe-commits. · View Herald Transcript

Harbormaster completed remote builds in B162864: Diff 427251.May 5 2022, 3:44 AM

aaron.ballman added inline comments.May 5 2022, 4:01 AM

clang/test/Lexer/utf8-char-literal.cpp
16	I missed this one before. :-(
31	Uh oh.

tbaeder updated this revision to Diff 427267.May 5 2022, 4:12 AM

tbaeder marked 2 inline comments as done.

Presuming precommit CI comes back green, this LGTM! Can you also add a release note for the bug fix when landing, please?

This revision is now accepted and ready to land.May 5 2022, 4:46 AM

Harbormaster completed remote builds in B162877: Diff 427267.May 5 2022, 5:03 AM

tbaeder updated this revision to Diff 427283.May 5 2022, 5:13 AM

Harbormaster completed remote builds in B162890: Diff 427283.May 5 2022, 5:44 AM

tbaeder updated this revision to Diff 427289.May 5 2022, 5:47 AM

Harbormaster completed remote builds in B162896: Diff 427289.May 5 2022, 6:30 AM

I think changes are needed to make this behavior dependent on whether char8_t support is active or not.

clang/lib/Lex/PPExpressions.cpp
413	I think the check for UTF-8 should also be conditioned on `PP.getLangOpts().Char8`. When `char8_t` support is not enabled (as in C++17 or with `-fno-char8_t` in C++20), UTF-8 character literals still have type `char`. else if (!(Literal.isUTF8() && PP.getLangOpts().Char8) && !Literal.isUTF16() && !Literal.isUTF32())
clang/test/Lexer/utf8-char-literal.cpp
4	I think we should drop testing for `-std=c++1z` and add testing of `-std=c++17` and `-std=c++20`. Ideally, the test would then validate the differences in behavior.
31–34	Prior to C++20 (unless `-fchar8_t` is passed), `u8'\xff'` should have the same behavior as `'\xff'`.

This revision now requires changes to proceed.May 5 2022, 7:33 AM

tahonermann added inline comments.May 5 2022, 8:04 AM

clang/lib/Lex/PPExpressions.cpp
413	My C++ bias may be showing here; `LangOptions.Char8` may not be relevant for C, so this may require additional qualification.

In D124996#3493930, @tahonermann wrote:

I think changes are needed to make this behavior dependent on whether char8_t support is active or not.

Good catch Tom, I forgot we had options to control that! I agree.

tbaeder updated this revision to Diff 427614.May 6 2022, 6:10 AM

tbaeder marked 2 inline comments as done.

tbaeder added inline comments.

clang/test/Lexer/utf8-char-literal.cpp
34	I'm a little confused with the amount of combinations at this point, so please tell me if the emitted warning here looks wrong.
65	I know indenting the preprocessor directives here isn't according to coding style, but it helps a lot with readability.

aaron.ballman added inline comments.May 6 2022, 6:47 AM

clang/test/Lexer/utf8-char-literal.cpp
32–42	The equality operators seem backwards to what @tahonermann was saying -- I read his comment as: C++17/14/11: u8'\xff' == '\xff' C++17/14/11, -fchar8_t: u8'\xff' != '\xff' C++20 and up: u8'\xff' != '\xff' C++20 and up, -fno-char8_t: u8'\xff' == '\xff' Hopefully Tom can clarify if I misunderstood.
60
65	I'm fine with the formatting -- it helps readability, and we don't require our tests to be correctly formatted anyway.

Harbormaster completed remote builds in B163130: Diff 427614.May 6 2022, 6:47 AM

Thanks for your continued work on this, Tim! I think this is close. I did spot one issue and added a few other comments.

clang/lib/Lex/PPExpressions.cpp
416–417	Thanks for breaking the conditions out; that does make this simpler to understand. I don't think this is right yet though. In C++, if `PP.getLangOpts().Char8` is `false`, then signedness is determined by `PP.getLangOpts().CharIsSigned`. Perhaps this: else if (Literal.isUTF8()) { if (PP.getLangOpts().CPlusPlus) Val.setIsUnsigned(PP.getLangOpts().Char8 ? true : !PP.getLangOpts().CharIsSigned); else Val.setIsUnsigned(true); } The test case didn't catch this because `char` is always a signed type for the variations that are exercised. We could add a variant that includes `-funsigned-char`, and then modify the test based on the presence of `__CHAR_UNSIGNED__`, but that might get pretty awkward.
clang/test/Lexer/utf8-char-literal.cpp
4–5	Does the `-fchar8_t` option have any effect in C at present? Gcc maintainers are currently not planning to acknowledge that option in C modes since WG14 did not want to add language dialect concerns for C. This is why N2653 doesn't have wording that includes a feature test macro. The gcc maintainers pushed back on the `_CHAR8_T_SOURCE` macro mentioned in the "Implementation Experience" section. I think Clang should follow suit; attempts to use `-fchar8_t` or `-fno-char8_t` in C modes should be diagnosed; which means that we don't have to exercise these options with C2x.
8–10	Rather than adding your own `CHAR8_T` and `NO_CHAR8_T` macros, you can use the predefined `__cpp_char8_t` feature test macro.
32–42	Yes, that looks right (as long as the target has a signed `char` type).

tbaeder updated this revision to Diff 427989.May 9 2022, 12:44 AM

tbaeder marked an inline comment as done.

tbaeder marked 4 inline comments as done.

tbaeder added inline comments.

clang/test/Lexer/utf8-char-literal.cpp
32–42	Are you listing the positive conditions (that should be true) or negative ones? The conditions in the test case need to be false in order for the test case to succeed.

Harbormaster completed remote builds in B163425: Diff 427989.May 9 2022, 1:08 AM

tahonermann added inline comments.May 9 2022, 9:52 AM

clang/test/Lexer/utf8-char-literal.cpp

6–8

The -DCHAR8_T and -DNO_CHAR8_T options can be removed now since the test is no longer dependent on them.

32–42

Yes, these list the expected behavior (e.g., the assertable behavior).

32–57

Something is still not right here. -std=c++17 should behave the same as -std=c++20 -fno-char8_t. Likewise, -std=c++17 -fchar8_t should behave the same as -std=c++20. Basically, if __cpp_char8_t is defined, then u8'\xff' should be unsigned and if that macro isn't defined, then the result should be signed (for an 8-bit signed char type). But the reversed conditions for the C++17 vs C++20 tests above conflict with that expectation. The code changes look right. Is the test actually passing like this?

Since the tests are now conditioned on __cpp_char8_t, I think they should be merged. I suggest:

// UTF-8 character literals are enabled in C++17 and later. If `-fchar8_t` is not enabled
// (as is the case in C++17), then UTF-8 character literals may produce signed or
// unsigned values depending on whether char is a signed type. If `-fchar8_t` is enabled
// (which is the default behavior for C++20), then UTF-8 character literals always
// produce unsigned values. The tests below depend on the target having a signed
// 8-bit char so that '\xff' produces a negative value.
#if __cplusplus >= 201703L
#  if !defined(__cpp_char8_t)
#    if !(u8'\xff' == '\xff')
#      error UTF-8 character value did not match ordinary character literal; this is unexpected
#    endif
#  else
#    if u8'\xff' == '\xff' // expected-warning {{right side of operator converted from negative value to unsigned}}
#      error UTF-8 character value matched ordinary character literal; this is unexpected
#    endif
#  endif
#endif

tbaeder updated this revision to Diff 428303.May 10 2022, 12:49 AM

tbaeder marked 4 inline comments as done.

Harbormaster completed remote builds in B163644: Diff 428303.May 10 2022, 3:23 AM

aaron.ballman added inline comments.May 10 2022, 5:55 AM

clang/test/Lexer/utf8-char-literal.cpp
4–5	Does the -fchar8_t option have any effect in C at present? Yes, it does: https://godbolt.org/z/1co3YYYf8 (it sets `LangOpts.Char8`)

tahonermann added inline comments.May 10 2022, 9:04 AM

clang/test/Lexer/utf8-char-literal.cpp
4–5	Oh, that is very wrong. `char8_t` should be neither a keyword nor a type specifier in C modes; `char8_t` will be a typedef in C23. @tbaeder, if you are willing to take on fixing this as well, that would be most appreciated! This doesn't need to be fixed as part of this review though. I filed https://github.com/llvm/llvm-project/issues/55373 to track this issue.

tahonermann added inline comments.May 10 2022, 9:11 AM

clang/test/Lexer/utf8-char-literal.cpp
51–53	The C++ case looks good now, but the condition doesn't look right for the C case. The expectation is that `u8'\xff'` should not match `'\xff'` in C23 mode, but the test treats this as an error. If the test is passing, that indicates something is not being validated correctly. Shouldn't unexpected error diagnostics cause the test to fail?

tbaeder added inline comments.May 10 2022, 11:52 PM

clang/test/Lexer/utf8-char-literal.cpp

51–53

I used u8'\xff' != 0xff here because that's the condition you mentioned in the phab review adding the u8 prefix. Using u8'\xff' != '\xff' indeed fails with:

clang/test/Lexer/utf8-char-literal.cpp Line 55: u8 char literal is not unsigned
clang/test/Lexer/utf8-char-literal.cpp Line 54: right side of operator converted from negative value to unsigned: -1 to 18446744073709551615

tahonermann added inline comments.May 11 2022, 7:25 AM

clang/test/Lexer/utf8-char-literal.cpp

51–53

Oh! I missed the use of 0xff vs '\xff'. Sorry if I mislead you. In order to avoid such subtle differences in the test, can we use the same check as for C++?

// UTF-8 character literals are enabled in C23 and later and are always unsigned.
#if __STDC_VERSION__ >= 202000L
#  if u8'\xff' == '\xff' // expected-warning {{right side of operator converted from negative value to unsigned}}
#    error UTF-8 character value matched ordinary character literal; this is unexpected
#  endif
#endif

tbaeder updated this revision to Diff 428861.May 11 2022, 11:48 PM

tbaeder marked 2 inline comments as done.

Harbormaster completed remote builds in B164047: Diff 428861.May 12 2022, 1:49 AM

Looks good, @tbaeder! Thank you for sticking with me through all these iterations!

This revision is now accepted and ready to land.May 12 2022, 7:18 AM

LGTM as well, thank you for this!

This revision was landed with ongoing or failed builds.May 12 2022, 11:05 PM

Closed by commit rGb91073db6ac3: [clang][preprocessor] Fix unsigned-ness of utf8 char literals (authored by tbaeder). · Explain Why

This revision was automatically updated to reflect the committed changes.

tbaeder added a commit: rGb91073db6ac3: [clang][preprocessor] Fix unsigned-ness of utf8 char literals.

Revision Contents

Path

Size

clang/

docs/

ReleaseNotes.rst

2 lines

lib/

Lex/

PPExpressions.cpp

4 lines

test/

Lexer/

utf8-char-literal.cpp

9 lines

Diff 427283

clang/docs/ReleaseNotes.rst

	Show First 20 Lines • Show All 284 Lines • ▼ Show 20 Lines
	- Improved ``-O0`` code generation for calls to ``std::move``, ``std::forward``,			- Improved ``-O0`` code generation for calls to ``std::move``, ``std::forward``,
	``std::move_if_noexcept``, ``std::addressof``, and ``std::as_const``. These			``std::move_if_noexcept``, ``std::addressof``, and ``std::as_const``. These
	are now treated as compiler builtins and implemented directly, rather than			are now treated as compiler builtins and implemented directly, rather than
	instantiating the definition from the standard library.			instantiating the definition from the standard library.
	- Fixed mangling of nested dependent names such as ``T::a::b``, where ``T`` is a			- Fixed mangling of nested dependent names such as ``T::a::b``, where ``T`` is a
	template parameter, to conform to the Itanium C++ ABI and be compatible with			template parameter, to conform to the Itanium C++ ABI and be compatible with
	GCC. This breaks binary compatibility with code compiled with earlier versions			GCC. This breaks binary compatibility with code compiled with earlier versions
	of clang; use the ``-fclang-abi-compat=14`` option to get the old mangling.			of clang; use the ``-fclang-abi-compat=14`` option to get the old mangling.
				- Preprocessor character literals with a ``u8`` are now correctly treated as
				unsigned character literals. This fixes `Issue 54886 <https://github.com/llvm/llvm-project/issues/54886>`_.

	C++20 Feature Support			C++20 Feature Support
	^^^^^^^^^^^^^^^^^^^^^			^^^^^^^^^^^^^^^^^^^^^
	- Diagnose consteval and constexpr issues that happen at namespace scope. This			- Diagnose consteval and constexpr issues that happen at namespace scope. This
	partially addresses `Issue 51593 <https://github.com/llvm/llvm-project/issues/51593>`_.			partially addresses `Issue 51593 <https://github.com/llvm/llvm-project/issues/51593>`_.
	- No longer attempt to evaluate a consteval UDL function call at runtime when			- No longer attempt to evaluate a consteval UDL function call at runtime when
	it is called through a template instantiation. This fixes			it is called through a template instantiation. This fixes
	`Issue 54578 <https://github.com/llvm/llvm-project/issues/54578>`_.			`Issue 54578 <https://github.com/llvm/llvm-project/issues/54578>`_.
	▲ Show 20 Lines • Show All 138 Lines • Show Last 20 Lines

clang/lib/Lex/PPExpressions.cpp

Show First 20 Lines • Show All 401 Lines • ▼ Show 20 Lines	else if (Literal.isUTF32())
NumBits = TI.getChar32Width();		NumBits = TI.getChar32Width();
else // char or char8_t		else // char or char8_t
NumBits = TI.getCharWidth();		NumBits = TI.getCharWidth();

// Set the width.		// Set the width.
llvm::APSInt Val(NumBits);		llvm::APSInt Val(NumBits);
// Set the value.		// Set the value.
Val = Literal.getValue();		Val = Literal.getValue();
// Set the signedness. UTF-16 and UTF-32 are always unsigned		// Set the signedness. UTF-8, UTF-16 and UTF-32 are always unsigned
if (Literal.isWide())		if (Literal.isWide())
Val.setIsUnsigned(!TargetInfo::isTypeSigned(TI.getWCharType()));		Val.setIsUnsigned(!TargetInfo::isTypeSigned(TI.getWCharType()));
else if (!Literal.isUTF16() && !Literal.isUTF32())		else if (!Literal.isUTF8() && !Literal.isUTF16() && !Literal.isUTF32())
		tahonermannUnsubmitted Done Reply Inline Actions I think the check for UTF-8 should also be conditioned on `PP.getLangOpts().Char8`. When `char8_t` support is not enabled (as in C++17 or with `-fno-char8_t` in C++20), UTF-8 character literals still have type `char`. else if (!(Literal.isUTF8() && PP.getLangOpts().Char8) && !Literal.isUTF16() && !Literal.isUTF32()) tahonermann: I think the check for UTF-8 should also be conditioned on `PP.getLangOpts().Char8`. When…
		tahonermannUnsubmitted Done Reply Inline Actions My C++ bias may be showing here; `LangOptions.Char8` may not be relevant for C, so this may require additional qualification. tahonermann: My C++ bias may be showing here; `LangOptions.Char8` may not be relevant for C, so this may…
Val.setIsUnsigned(!PP.getLangOpts().CharIsSigned);		Val.setIsUnsigned(!PP.getLangOpts().CharIsSigned);

if (Result.Val.getBitWidth() > Val.getBitWidth()) {		if (Result.Val.getBitWidth() > Val.getBitWidth()) {
Result.Val = Val.extend(Result.Val.getBitWidth());		Result.Val = Val.extend(Result.Val.getBitWidth());
		tahonermannUnsubmitted Done Reply Inline Actions Thanks for breaking the conditions out; that does make this simpler to understand. I don't think this is right yet though. In C++, if `PP.getLangOpts().Char8` is `false`, then signedness is determined by `PP.getLangOpts().CharIsSigned`. Perhaps this: else if (Literal.isUTF8()) { if (PP.getLangOpts().CPlusPlus) Val.setIsUnsigned(PP.getLangOpts().Char8 ? true : !PP.getLangOpts().CharIsSigned); else Val.setIsUnsigned(true); } The test case didn't catch this because `char` is always a signed type for the variations that are exercised. We could add a variant that includes `-funsigned-char`, and then modify the test based on the presence of `__CHAR_UNSIGNED__`, but that might get pretty awkward. tahonermann: Thanks for breaking the conditions out; that does make this simpler to understand. I don't…
} else {		} else {
assert(Result.Val.getBitWidth() == Val.getBitWidth() &&		assert(Result.Val.getBitWidth() == Val.getBitWidth() &&
"intmax_t smaller than char/wchar_t?");		"intmax_t smaller than char/wchar_t?");
Result.Val = Val;		Result.Val = Val;
}		}

// Consume the token.		// Consume the token.
Result.setRange(PeekTok.getLocation());		Result.setRange(PeekTok.getLocation());
▲ Show 20 Lines • Show All 512 Lines • Show Last 20 Lines

clang/test/Lexer/utf8-char-literal.cpp

// RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c++11 -fsyntax-only -verify %s // RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c++11 -fsyntax-only -verify %s

// RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c11 -x c -fsyntax-only -verify %s // RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c11 -x c -fsyntax-only -verify %s

// RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c2x -x c -fsyntax-only -verify %s // RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c2x -x c -fsyntax-only -verify %s

// RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c++1z -fsyntax-only -verify %s // RUN: %clang_cc1 -triple x86_64-apple-darwin -std=c++1z -fsyntax-only -verify %s

tahonermannUnsubmitted

Done

I think we should drop testing for -std=c++1z and add testing of -std=c++17 and -std=c++20. Ideally, the test would then validate the differences in behavior.

tahonermann: I think we should drop testing for `-std=c++1z` and add testing of `-std=c++17` and `…

tahonermannUnsubmitted

Not Done

Does the -fchar8_t option have any effect in C at present?

Gcc maintainers are currently not planning to acknowledge that option in C modes since WG14 did not want to add language dialect concerns for C. This is why N2653 doesn't have wording that includes a feature test macro. The gcc maintainers pushed back on the _CHAR8_T_SOURCE macro mentioned in the "Implementation Experience" section.

I think Clang should follow suit; attempts to use -fchar8_t or -fno-char8_t in C modes should be diagnosed; which means that we don't have to exercise these options with C2x.

tahonermann: Does the `-fchar8_t` option have any effect in C at present? Gcc maintainers are currently not…

aaron.ballmanUnsubmitted

Not Done

Does the -fchar8_t option have any effect in C at present?

Yes, it does: https://godbolt.org/z/1co3YYYf8 (it sets LangOpts.Char8)

aaron.ballman: > Does the -fchar8_t option have any effect in C at present? Yes, it does: https://godbolt.

tahonermannUnsubmitted

Not Done

Oh, that is very wrong. char8_t should be neither a keyword nor a type specifier in C modes; char8_t will be a typedef in C23.

@tbaeder, if you are willing to take on fixing this as well, that would be most appreciated! This doesn't need to be fixed as part of this review though. I filed https://github.com/llvm/llvm-project/issues/55373 to track this issue.

tahonermann: Oh, that is very wrong. `char8_t` should be neither a keyword nor a type specifier in C modes…

int array0[u'ñ' == u'\xf1'? 1 : -1]; int array0[u'ñ' == u'\xf1'? 1 : -1];

int array1['\xF1' != u'\xf1'? 1 : -1]; int array1['\xF1' != u'\xf1'? 1 : -1];

int array1['ñ' != u'\xf1'? 1 : -1]; // expected-error {{character too large for enclosing character literal type}} int array1['ñ' != u'\xf1'? 1 : -1]; // expected-error {{character too large for enclosing character literal type}}

tahonermannUnsubmitted

Done

The -DCHAR8_T and -DNO_CHAR8_T options can be removed now since the test is no longer dependent on them.

tahonermann: The `-DCHAR8_T` and `-DNO_CHAR8_T` options can be removed now since the test is no longer…

#if __cplusplus > 201402L #if __cplusplus > 201402L

char a = u8'ñ'; // expected-error {{character too large for enclosing character literal type}} char a = u8'ñ'; // expected-error {{character too large for enclosing character literal type}}

tahonermannUnsubmitted

Done

Rather than adding your own CHAR8_T and NO_CHAR8_T macros, you can use the predefined __cpp_char8_t feature test macro.

tahonermann: Rather than adding your own `CHAR8_T` and `NO_CHAR8_T` macros, you can use the predefined…

char b = u8'\x80'; // ok char b = u8'\x80'; // ok

char c = u8'\u0080'; // expected-error {{character too large for enclosing character literal type}} char c = u8'\u0080'; // expected-error {{character too large for enclosing character literal type}}

char d = u8'\u1234'; // expected-error {{character too large for enclosing character literal type}} char d = u8'\u1234'; // expected-error {{character too large for enclosing character literal type}}

char e = u8'ሴ'; // expected-error {{character too large for enclosing character literal type}} char e = u8'ሴ'; // expected-error {{character too large for enclosing character literal type}}

char f = u8'ab'; // expected-error {{Unicode character literals may not contain multiple characters}} char f = u8'ab'; // expected-error {{Unicode character literals may not contain multiple characters}}

#elif __STDC_VERSION__ > 202000L #elif __STDC_VERSION__ >= 202000L

aaron.ballmanUnsubmitted

Done

I missed this one before. :-(

aaron.ballman: I missed this one before. :-(

char a = u8'ñ'; // expected-error {{character too large for enclosing character literal type}} char a = u8'ñ'; // expected-error {{character too large for enclosing character literal type}}

char b = u8'\x80'; // ok char b = u8'\x80'; // ok

char c = u8'\u0080'; // expected-error {{universal character name refers to a control character}} char c = u8'\u0080'; // expected-error {{universal character name refers to a control character}}

char d = u8'\u1234'; // expected-error {{character too large for enclosing character literal type}} char d = u8'\u1234'; // expected-error {{character too large for enclosing character literal type}}

char e = u8'ሴ'; // expected-error {{character too large for enclosing character literal type}} char e = u8'ሴ'; // expected-error {{character too large for enclosing character literal type}}

_Static_assert( _Static_assert(

_Generic(u8'a', _Generic(u8'a',

default : 0, default : 0,

unsigned char : 1), unsigned char : 1),

"Surprise!"); "Surprise!");

#endif #endif

/// Test u8 char literal preprocessor behavior

#if __cplusplus > 201402L || __STDC_VERSION__ >= 202000L

aaron.ballmanUnsubmitted

Done

/// Test u8 char literal preprocessor behavior

- #if __cplusplus > 201402L || __STDC_VERSION__ > 202000L

+ #if __cplusplus > 201402L || __STDC_VERSION__ >= 202000L

#if u8'\xff' != 0xff

Uh oh.

aaron.ballman: Uh oh.

#if u8'\xff' != 0xff

#error u8 char literal is not unsigned

#endif

tahonermannUnsubmitted

Done

Prior to C++20 (unless -fchar8_t is passed), u8'\xff' should have the same behavior as '\xff'.

tahonermann: Prior to C++20 (unless `-fchar8_t` is passed), `u8'\xff'` should have the same behavior as…

tbaederAuthorUnsubmitted

Done

I'm a little confused with the amount of combinations at this point, so please tell me if the emitted warning here looks wrong.

tbaeder: I'm a little confused with the amount of combinations at this point, so please tell me if the…

#endif

tbaederAuthorUnsubmitted

Done

I know indenting the preprocessor directives here isn't according to coding style, but it helps a lot with readability.

tbaeder: I know indenting the preprocessor directives here isn't according to coding style, but it helps…

aaron.ballmanUnsubmitted

Done

I'm fine with the formatting -- it helps readability, and we don't require our tests to be correctly formatted anyway.

aaron.ballman: I'm fine with the formatting -- it helps readability, and we don't require our tests to be…

aaron.ballmanUnsubmitted

Done

#endif

- /// In C2x, 8u char literals are always unsigned

+ /// In C2x, u8 char literals are always unsigned

#if __STDC_VERSION__ >= 202000L

aaron.ballman:

tahonermannUnsubmitted

Done

The C++ case looks good now, but the condition doesn't look right for the C case. The expectation is that u8'\xff' should not match '\xff' in C23 mode, but the test treats this as an error. If the test is passing, that indicates something is not being validated correctly. Shouldn't unexpected error diagnostics cause the test to fail?

tahonermann: The C++ case looks good now, but the condition doesn't look right for the C case. The…

tbaederAuthorUnsubmitted

Done

I used u8'\xff' != 0xff here because that's the condition you mentioned in the phab review adding the u8 prefix. Using u8'\xff' != '\xff' indeed fails with:

clang/test/Lexer/utf8-char-literal.cpp Line 55: u8 char literal is not unsigned
clang/test/Lexer/utf8-char-literal.cpp Line 54: right side of operator converted from negative value to unsigned: -1 to 18446744073709551615

tbaeder: I used `u8'\xff' != 0xff` here because that's the condition you mentioned in the phab review…

tahonermannUnsubmitted

Done

Oh! I missed the use of 0xff vs '\xff'. Sorry if I mislead you. In order to avoid such subtle differences in the test, can we use the same check as for C++?

// UTF-8 character literals are enabled in C23 and later and are always unsigned.
#if __STDC_VERSION__ >= 202000L
#  if u8'\xff' == '\xff' // expected-warning {{right side of operator converted from negative value to unsigned}}
#    error UTF-8 character value matched ordinary character literal; this is unexpected
#  endif
#endif

tahonermann: Oh! I missed the use of `0xff` vs `'\xff'`. Sorry if I mislead you. In order to avoid such…

This is an archive of the discontinued LLVM Phabricator instance.

[clang][preprocessor] Fix unsigned-ness of utf8 char literalsClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 427283

clang/docs/ReleaseNotes.rst

clang/lib/Lex/PPExpressions.cpp

clang/test/Lexer/utf8-char-literal.cpp

[clang][preprocessor] Fix unsigned-ness of utf8 char literals
ClosedPublic