This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
docs/
6/6
ReleaseNotes.rst
-
include/clang/Basic/
-
clang/
-
Basic/
-
Diagnostic.h
-
lib/
-
Basic/
2/4
Diagnostic.cpp
-
Sema/
16/22
SemaDeclCXX.cpp
-
test/
-
Lexer/
4/4
cxx1z-trigraphs.cpp
-
SemaCXX/
6
static-assert-cxx26.cpp
1/12
static-assert.cpp

Differential D155610

[Clang][Sema] Fix display of characters on static assertion failure
ClosedPublic

Authored by hazohelet on Jul 18 2023, 8:35 AM.

Download Raw Diff

Details

Reviewers

aaron.ballman
tbaeder
cjdb
cor3ntin
tahonermann
hubert.reinterpretcast

Commits

rG2176c5e510e3: [Clang][Sema] Fix display of characters on static assertion failure

Summary

This patch fixes the display of characters appearing in LHS or RHS of == expression in notes to static assertion failure.
This applies C-style escape if the printed character is a special character. This also adds a numerical value displayed next to the character representation.
This also tries to print multi-byte characters if the user-provided expression is multi-byte char type.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hazohelet created this revision.Jul 18 2023, 8:35 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 18 2023, 8:35 AM

hazohelet requested review of this revision.Jul 18 2023, 8:35 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 18 2023, 8:35 AM

I wonder if we want to see whether the character is printable before deciding whether to show the numeric value for it or show a character value? For whitespace characters, printing the character leads to odd diagnostics: https://godbolt.org/z/d3TW6Ee8s and for non-character data (e.g., uses through uint8_t, unsigned char, or signed char) we probably never want to print as a character to begin with because there's no reason to assume the data is textual.

Adding in some other reviewers for a wider selection of opinions.

clang/test/Lexer/cxx1z-trigraphs.cpp
24	I think the original diagnostic was actually more understandable as it relates more closely to what's written in the static assertion. I could imagine something like `evaluates to '?' (63) == '#' (35)` would also be reasonable.

What if you switch from IgnoreParenImpCasts at the top of DiagnoseStaticAssertDetails() to just IgnoreParens()? That improve cases where we simply compare a character literal to an integer, since the literal should be implicitly cast to an integer.

Harbormaster completed remote builds in B246246: Diff 541556.Jul 18 2023, 2:44 PM

I agree with Aaron that in the current state, the common case diagnostics are made worse. But there is room for improvement!

I think what we want to do here is modify ConvertAPValueToString so that it applies the same escape logic to char as we do in pushEscapedString (in Diagnostics.cpp)
It would make sense that signed/unsigned are treated as integers (in which case they should not be quoted), but char, wchar_t, charN_t should try really hard to present a printable character to the user, unless it's not possible to do so.

So the best way to progress that might be:

put pushEscapedString somewhere we can reuse
make char/char8_t use this function in ConvertAPValueToString - it might be slightly more tricky for wchar_t, you would have to try to convert to utf-8 first.

I think the discussion is getting derailed a bit. The original reason I was talking to Takuya about this is this: https://godbolt.org/z/GjsYrexT3

static_assert('a' == 100);

For code like this, we print the it as ''a' == 100', but the AST contains a cast from the LHS char to an integer, and I think we shouldn't ignore that cast.

In D155610#4511547, @tbaeder wrote:

What if you switch from IgnoreParenImpCasts at the top of DiagnoseStaticAssertDetails() to just IgnoreParens()? That improve cases where we simply compare a character literal to an integer, since the literal should be implicitly cast to an integer.

It results in printing everything including char and bool as integer.
I wanted to print true and false instead of 1 and 0, so ignored the implicit casts when the user-provided expression is bool type.

I think it would look nicer to print the character representation as well when it is not non-printable

tahonermann added inline comments.Jul 19 2023, 9:27 AM

clang/test/Lexer/cxx1z-trigraphs.cpp
24	I agree. I would also be ok with printing the integer value as primary with the character as secondary: evaluates to 63 ('?') == 35 ('#') There are two kinds of non-printable characters: Control characters (including new-line) character values that don't correspond to a character (e.g., lone trailing characters or invalid code unit values). For the first case, I would support printing them as either C escapes or universal-character-names. e.g., evaluates to 0 ('\0') == 1 (\u0001) For the second case, I would support printing them as C hex escapes. e.g, evaluates to -128 ('\x80') == -123 ('\x85')

I think Tom's suggestion about using escapes and UCNs is great. I have no real opinion on whether the numeric values is the one that's parenthesised.

cor3ntin added inline comments.Jul 19 2023, 11:05 AM

clang/test/Lexer/cxx1z-trigraphs.cpp
24	For the first case, I would support printing them as either C escapes or universal-character-names. e.g., As mentioned before, we should be consistent with what we do for diagnostics messages in general - (`ie` pushEscapedString`). I check and we already do that. https://godbolt.org/z/doah9YGMT Question is, why do we sometimes don't? Note that in general i don't have an opinion about displaying the value of characters literal _in addition_ of the character itself, it seems like a good thing)

Thanks everyone for the comments!

clang/test/Lexer/cxx1z-trigraphs.cpp
24	The character escape in the current error message is handled in `CharacterLiteral::print` (https://github.com/llvm/llvm-project/blob/fcb6a9c07cf7a2bc63d364e3b7f60aaadadd57cc/clang/lib/AST/Expr.cpp#L1064-L1083), and the reason for `'\0'` being escaped to `\x00` and `'\u{9}'` escaped to `\t` is that `escapeCStyle` does not escape null character (https://github.com/llvm/llvm-project/blob/c1c86f9eae73786bcdacddaab248817c4f176935/clang/include/clang/Basic/CharInfo.h#L174-L200). `pushEscapedString` does not escape whitespace characters, so we should use `escapeCStyle` if we are to use c-style escape for them.

Address review comments

Print the character representation only when the type of the expressions is char or char8_t
Use pushEscapedString in the printing so that we can reuse its escaping logic
Use escapeCStyle to escape whitespace characters
wchar_t and charN_t are not handled yet

cor3ntin added inline comments.Jul 31 2023, 4:49 AM

clang/lib/Basic/Diagnostic.cpp
839–844	The use UCN addition is probably not justified. we should consistent in how we print the value of non-printable code points.
clang/lib/Sema/SemaDeclCXX.cpp
17156–17157	A different way to do that would be to have a 'Escape Whitespaces' parameter on pushEscapedString that would also escape \t, \n, etc (I don't think we want to escape SPACE)

Harbormaster completed remote builds in B249159: Diff 545582.Jul 31 2023, 5:59 AM

hazohelet added inline comments.Jul 31 2023, 10:38 AM

clang/lib/Basic/Diagnostic.cpp
839–844	My motivation here is to print a valid character literal. I think it justifies this change somewhat. I'd like to see what others think about this.
clang/lib/Sema/SemaDeclCXX.cpp
17156–17157	If you mean `\u0020` by SPACE, it won't be escaped by this code.

cor3ntin added inline comments.Aug 1 2023, 6:39 AM

clang/lib/Basic/Diagnostic.cpp
839–844	Why? there is no expectation that diagnostics messages reproduce C++. Consistency between string literals and characters literals is a lot more important

I've been thinking about it and I think I have a cleaner design for the printing of characters:

We need a CharToString(unsigned, Qualtype) -> SmallString method that takes a value and the type.
for char and char8_t we can just return the value.
For wchar_t, char32_t and char16_t, we can use something like ConvertCodePointToUTF8 to convert to UTF-8. If that fails we can escape with \x
If we pass the result of that to the diagnostic engine, escaping of non printing character would happen automatically.

That way we have a nice separation between converting an APValue to string and printing it, which should avoid code duplication quite a bit, and generally make the design nicer.
Then maybe we need to consider whether we want to modify CharacterLiteral::print to be aligned with all of that. I don;t know if that's used, for example, for mangling.

Given there are a bunch of different issues here, i would not mind separate PRs - having the numerical value showed in paren seems valuable on its own.

In D155610#4550346, @cor3ntin wrote:

I've been thinking about it and I think I have a cleaner design for the printing of characters:

We need a CharToString(unsigned, Qualtype) -> SmallString method that takes a value and the type.
for char and char8_t we can just return the value.
For wchar_t, char32_t and char16_t, we can use something like ConvertCodePointToUTF8 to convert to UTF-8. If that fails we can escape with \x
If we pass the result of that to the diagnostic engine, escaping of non printing character would happen automatically.

That way we have a nice separation between converting an APValue to string and printing it, which should avoid code duplication quite a bit, and generally make the design nicer.
Then maybe we need to consider whether we want to modify CharacterLiteral::print to be aligned with all of that. I don;t know if that's used, for example, for mangling.

Given there are a bunch of different issues here, i would not mind separate PRs - having the numerical value showed in paren seems valuable on its own.

Thanks for the suggestion. Printing of multibyte character is a bit out of the scope of the original goal of this patch, but I'll give it a try.

clang/lib/Basic/Diagnostic.cpp
839–844	The diagnostic meesage here looks like `expression evaluates to 'VALUE1 == VALUE2'` I tend to expect that `VALUE1 == VALUE2` is a syntactically valid expression because of the syntactical element `==`. But if others do not feel the same way, I am okay with something like `'<U+0001>' == '<U+0002>'`

Address comments from Corentin

Use default pushEscapedString escaping (<U+0001>) instead of UCN representation \u0001
Convert multi-byte characters (wchar_t, char16_t, char32_t) to UTF-8 and prints them.
Added CharToString utility function

This is starting to look pretty good!
I'm happy with the general direction, my only concern is that printing a prefix does not seem useful - we are trying to display the value, not how it was produced.

clang/lib/Sema/SemaDeclCXX.cpp
17066	We have similar switches in `StringLiteral::outputString` `TryPrintAsStringLiteral` (APValue.cpp) `CharacterLiteral::print` Sadly they all look at different things so I don't know if we could refactor all of that. But looking further done, I don;t think we should print a prefix here, so we could remove that bit entirely.
17129	Looking at the diagnostics, I don't think it makes sense to print a prefix here. You could just leave that part out.

Harbormaster completed remote builds in B250010: Diff 546782.Aug 3 2023, 6:33 AM

Address comments from Corentin

Remove printing of character type prefix
Added code example in release note
Removed unnecessary static_cast

One concern from my side is that some unicode characters like U+FEFF (I added in test) are invisible, but it may not be a big concern because we also display integer representation in parens.

Harbormaster completed remote builds in B250048: Diff 546832.Aug 3 2023, 8:57 AM

In D155610#4557930, @hazohelet wrote:

One concern from my side is that some unicode characters like U+FEFF (I added in test) are invisible, but it may not be a big concern because we also display integer representation in parens.

This is not an easy problem to solve. pushEscapedString does not escape formatting code points, such as U+FEFF because doing so break a bunch of scripts/emojis.
There are cases where it should probably be escaped but that fully depend on context, and it would require grapheme clusterization, which is a lot of work for limited value.
I'd rather we don't change that for now.

cor3ntin added inline comments.Aug 3 2023, 9:01 AM

clang/docs/ReleaseNotes.rst
184–218	@aaron.ballman One one hand this is nice, on the other hand maybe too detailed. What do you think?

aaron.ballman added inline comments.Aug 3 2023, 9:08 AM

clang/docs/ReleaseNotes.rst
184–218	I'm happy with it -- better too much detail than too little, but this really helps users see what's been improved and why it matters. That said, I think `0x0A` and `0x1F30D` would arguably be better than printing the values in decimal. For `\n`, perhaps folks remember that it's decimal value 10, but nobody is going to know what `127757` means compared to the hex representation (esp because the value is specified in hex with the prefix printed in the error message). WDYT?

cor3ntin added inline comments.Aug 3 2023, 9:12 AM

clang/docs/ReleaseNotes.rst
184–218	For `wchar_t`, `charN_t` I think that makes sense. for `char`... hard to know, I think this is mostly useful for people who treat char as some kind of integer. I could go either way. using hex consistently seems reasonable

aaron.ballman added inline comments.Aug 3 2023, 9:40 AM

clang/docs/ReleaseNotes.rst
184–218	I don't insist on using hex, but I have a slight preference for using it consistently everywhere. CC @cjdb for more opinions since this relates to user experience of diagnostics.
clang/test/SemaCXX/static-assert.cpp
286–293

hazohelet added inline comments.Aug 4 2023, 4:55 AM

clang/docs/ReleaseNotes.rst
184–218	I generally agree that hex code would be better for characters. I think we still have some arguable points. Should we print the unsigned code point or the (possibly signed) integer? (e.g. `0xFF` vs `-0x01` for `(char)-1`, on targets where `char` is signed) Should we print the hex code when the other subexpression of the `==` expression is not a textual type? (e.g. `0x11` vs `17` for LHS of `(char)17 == 11`) For 1, I think we should always print unsigned code point for all textual types for consistency. Also we don't want to print `-0x3` for `L'\xFFFD'` on targets where `wchar_t` is signed and 16-bit width (I haven't checked whether that target exists, though). For 2, I want to see decimal (possibly signed) integer if the other side of the expression is not textual type. Displaying `expression evaluates to ''<FF>' (0xFF) == 255'` for the following code would be highly confusing. static_assert((char)-1 == (unsigned char)-1); WDYT?

aaron.ballman added inline comments.Aug 4 2023, 6:04 AM

clang/docs/ReleaseNotes.rst
184–218	Should we print the unsigned code point or the (possibly signed) integer? (e.g. 0xFF vs -0x01 for (char)-1, on targets where char is signed) Personally, I find -0x01 to be kind of weird and I slightly prefer 0xFF. Should we print the hex code when the other subexpression of the == expression is not a textual type? (e.g. 0x11 vs 17 for LHS of (char)17 == 11) I don't have a strong opinion on this because I think we can come up with arguments for either approach. My intuition is that we should just use hex values everywhere, but others may have a different opinion.

@abhina.sreeskantharajan, does this patch assume too much about the characters displayable for diagnostic output?

@hubert.reinterpretcast It does not, Unicode characters are only escaped in Diagnostics.cpp, and I think this is what we want.
Currently, llvm assume UTF-8 terminal, except on Windows where we convert to UTF-16 and use the wide windows APIs (raw_fd_ostream::write_impl).

If we want to extend that - IE support EBCDIC, I assume this is your question - we probably would want to modify pushEscapedString (Diagnostics.cpp), to consider a restricted set of characters as printable.
There are some questions in how we should do that, it could be a compile time configuration, or we need a way to 1/ detect the encoding of the environment, in a way similar to P1885 2/ construct the set of printable characters on that platforms.
Trying to encode all characters < U+00FF might be a reasonable way to build such table.

Address comments from Aaron

Use hex code for integer representation of textual types
NFC stylistic changes

Harbormaster completed remote builds in B251631: Diff 548948.Aug 10 2023, 10:08 AM

In D155610#4575579, @cor3ntin wrote:

@hubert.reinterpretcast It does not, Unicode characters are only escaped in Diagnostics.cpp, and I think this is what we want.
Currently, llvm assume UTF-8 terminal, except on Windows where we convert to UTF-16 and use the wide windows APIs (raw_fd_ostream::write_impl).

I am skeptical of the extent to which that assumption is exercised in a problematic manner today. The characters being emitted (aside from the [U+0020, U+007E] fixed message text itself) generally come from the text of the source file, which is generally written using characters that the user can display (even if they are not "basic Latin" characters).

hubert.reinterpretcast added inline comments.Aug 11 2023, 4:28 PM

clang/lib/Sema/SemaDeclCXX.cpp
17129	Why is removing the prefix better? The types can matter (characters outside the basic character set are allowed to have negative `char` values). Also, moving forward, the value of a character need not be the same in the various encodings.

hubert.reinterpretcast added inline comments.Aug 11 2023, 4:49 PM

clang/lib/Sema/SemaDeclCXX.cpp
17119–17125	Add FIXME for `char` and `wchar_t` cases that this assumes Unicode literal encodings.
17131–17134	@aaron.ballman, hex output hides signedness. I think we want hex and decimal.
clang/test/SemaCXX/static-assert.cpp
287	The C++23 escaped string formatting facility would not generate a trailing combining character like this. I recommend following suit. Info on U+0335: https://util.unicode.org/UnicodeJsps/character.jsp?a=0335

hubert.reinterpretcast added inline comments.Aug 11 2023, 8:52 PM

clang/lib/Sema/SemaDeclCXX.cpp

17129

Some fun with signedness (imagine a more realistic example with ISO-8859-1 ordinary character encoding with a signed char type):

$ clang -Xclang -fwchar-type=short -xc++ -<<<$'static_assert(L"\\uFF10"[0] == U\'\\uFF10\');'
<stdin>:1:15: error: static assertion failed due to requirement 'L"\xFF10"[0] == U'\uff10''
    1 | static_assert(L"\uFF10"[0] == U'\uFF10');
      |               ^~~~~~~~~~~~~~~~~~~~~~~~~
<stdin>:1:28: note: expression evaluates to ''０' (0xFF10) == '０' (0xFF10)'
    1 | static_assert(L"\uFF10"[0] == U'\uFF10');
      |               ~~~~~~~~~~~~~^~~~~~~~~~~~
1 error generated.
Return:  0x01:1   Fri Aug 11 23:49:02 2023 EDT

cor3ntin added inline comments.Aug 12 2023, 1:27 AM

clang/lib/Sema/SemaDeclCXX.cpp
17119–17125	If we wanted such fixme, it should be L1689.
17129	Either we care about the actual character - ie `'a'`, or it's value (ie `42`). The motivation for the current patch is to add the value to the diagnostic message. I'm also concerned about mixing things that are are are not lexical elements in the diagnostics
clang/test/SemaCXX/static-assert.cpp
287	This is way outside the scope of the patch. The diagnostic output facility has no understanding of combining characters or graphemes and do not attempt to match std::print. It probably would be an improvement but this patch is not trying to modify how all diagnostics are printed. (all of that logic is in Diagnostic.cpp)

hubert.reinterpretcast added inline comments.Aug 12 2023, 4:16 PM

clang/lib/Sema/SemaDeclCXX.cpp
17032	It does not seem that the first parameter expects a `CodePoint` argument in all cases. For `Char_S`, `Char_U`, and `Char8`, it seems the function wants to treat the input as a UTF-8 code unit. I suggest changing the argument to be clearly a code unit (and potentially treat it as a code point value as appropriate later in the function). Also: The function should probably be declared as having static linkage. Additionally: The function does not "convert" in the language semantic sense. `WriteCharacterValueDescriptionForDisplay` might be a better name.
17039	For types other than `Char_S`, `Char_U`, and `Char8`, this fails to treat the C1 Controls and Latin-1 Supplement characters as Unicode code points. It looks like test coverage for these cases are missing.
17119–17125	The `ConvertCharToString` has a first parameter called `CodePoint`. With that interface[^1], it is sensible to insert conversion from the applicable literal encoding to a Unicode code point value here (thus my request for a FIXME here). You are probably right that the FIXME belongs elsewhere. If you were thinking what I am thinking, then I am guessing you meant L16894? That is where the `ConvertCharToString` function seems to assume that a `wchar_t` value is directly a "code point value". To generate hex escapes, the function needs to be passed the original value (including for `char`s, e.g., to handle stray code units). Once the interface is updated (i.e., the parameter is renamed), the `ConvertCharToString` function would more clearly be the place to put one or more FIXMEs about encoding assumptions. [^1]: It turns out that the parameter is already not treated consistently as a code point value within the function (and by the caller) and the parameter is just badly named.
17129	Maybe the motivation for the current patch is to add the value, but what it does (for wide characters as defined in C) is to add the character (and obfuscate the value). Observe the status quo (https://godbolt.org/z/Wc6nKvTMn): note: expression evaluates to '-240 == 65296' From the output higher up (with this patch), we see two "identical" characters and values (due to lack of decimal value output). With decimal value output added, it will still be potentially confusing why the two identical characters have different values (without some sort of type annotation). I admit that the confusion arises in the status quo treatment of `signed char` and `unsigned char`. I hope I am using the word correctly when I say that it is ironic that the patch breaks in one context what it seeks to fix in another.
clang/test/SemaCXX/static-assert.cpp
287	This patch is pushing the envelope of what appears in diagnostics. One can also argue that someone writing static_assert(false, "\u0301"); gets what they deserve, but that case does not have a big problem anyway (because the provided message text appears after `:` ). This patch increases the exposure of the diagnostic output facility to input that it does not handle well. I disagree that it is outside the scope of this patch to insist that it does not generate such inputs to the diagnostic output facility (even if a possible solution is to modify the diagnostic output facility first).

hubert.reinterpretcast added a reviewer: hubert.reinterpretcast.Aug 12 2023, 4:16 PM

In D155610#4575579, @cor3ntin wrote:

@hubert.reinterpretcast It does not, Unicode characters are only escaped in Diagnostics.cpp, and I think this is what we want.

Thanks @cor3ntin for the insight. I agree that this is a separate concern that also applies to static assert messages.

cor3ntin added inline comments.Aug 14 2023, 5:56 AM

clang/lib/Sema/SemaDeclCXX.cpp
17032	Agreed, `CodeUnit` or `Value` would be more correct (mostly because of numeric escape sequences). But if we are going to change that then `WriteCharValueForDiagnostic` would be better, `Character` implies too much
17039	`escapeCStyle` is one of the things that assume ASCII / UTF, but yes, we might as well reduce to 0x7F just to avoid unnecessary work

I've discussed offline with @hubert.reinterpretcast and agree with him that with the addition of fexec-charset support, the set of characters deemed printable will not be accurate when other encodings are used. This will be similar to the printf/scanf format string validation issue I mentioned in my RFC and would require us to reverse the conversion or keep the original string around to check if the character is printable. I don't think we have finalized a solution on how to handle these issues yet.

In D155610#4586213, @abhina.sreeskantharajan wrote:

I've discussed offline with @hubert.reinterpretcast and agree with him that with the addition of fexec-charset support, the set of characters deemed printable will not be accurate when other encodings are used. This will be similar to the printf/scanf format string validation issue I mentioned in my RFC and would require us to reverse the conversion or keep the original string around to check if the character is printable. I don't think we have finalized a solution on how to handle these issues yet.

Furthermore, it is reasonable for the scope of the current patch to focus on producing UTF-8 (or UTF-16 for Windows) output to the "terminal".

hubert.reinterpretcast added inline comments.Aug 14 2023, 1:17 PM

clang/lib/Sema/SemaDeclCXX.cpp

17039

escapeCStyle is one of the things that assume ASCII / UTF, but yes, we might as well reduce to 0x7F just to avoid unnecessary work

I meant (with a signed char type to trigger the assertion):

<stdin>:1:28: note: expression evaluates to ''<A2>' (0xA2) == '<A2>' (0xA2)'
    1 | static_assert(u"\u00a2"[0] == '<A2>');
      |               ~~~~~~~~~~~~~^~~~~~~~~

should be:

<stdin>:1:28: note: expression evaluates to ''¢' (0xA2) == '<A2>' (0xA2)'
    1 | static_assert(u"\u00a2"[0] == '<A2>');
      |               ~~~~~~~~~~~~~^~~~~~~~~

Address some review comments

Renamed ConvertCharToString to WriteCharValueForDiagnostic
Made the function static
Fixed the printing for unicode 0x80 ~ 0xFF
Added decimal value next to the hex code

Harbormaster completed remote builds in B252696: Diff 550403.Aug 15 2023, 12:37 PM

hubert.reinterpretcast added inline comments.Aug 15 2023, 9:24 PM

clang/lib/Sema/SemaDeclCXX.cpp
17030–17031	Suggest wording tweaks.
17057–17058	Try using `StringRef`.
17060	Since the function interface has been clarified, this part actually doesn't need a FIXME. The FIXME should instead be added to the comment above the function declaration.

Address comments from Hubert

Bring back type prefix
NFC stylistic changes

Harbormaster completed remote builds in B252915: Diff 550707.Aug 16 2023, 7:32 AM

hubert.reinterpretcast added inline comments.Aug 16 2023, 12:56 PM

clang/lib/Sema/SemaDeclCXX.cpp
17080–17086	Minor nit: Braces no longer needed.

hazohelet updated this revision to Diff 552167.Aug 21 2023, 4:56 PM

hazohelet marked an inline comment as done.

Harbormaster completed remote builds in B253951: Diff 552167.Aug 21 2023, 5:23 PM

hubert.reinterpretcast added inline comments.Aug 28 2023, 12:46 PM

clang/test/SemaCXX/static-assert.cpp
287	@cor3ntin, do you have status quo examples for how grapheme-extending characters that are not already "problematic" in their original context are emitted in diagnostics in contexts where they are?

@cor3ntin Gentle ping

tahonermann added inline comments.Sep 6 2023, 1:43 PM

clang/test/SemaCXX/static-assert-cxx26.cpp
304	Is the expected note up to date? I don't see code that would generate the `<U+0001>` output. Am I just missing it? Since U+0001 is a valid, though non-printable, character, I would expect more `'\u0001'`.
clang/test/SemaCXX/static-assert.cpp
274–277	Here too, I find the `'<U+0000>'` presentation surprising; either of `'\0'` or `'\u0000'` would be preferred.

cor3ntin added inline comments.Sep 6 2023, 1:52 PM

clang/test/SemaCXX/static-assert-cxx26.cpp
304	See elsewhere in the discussion. this formating is pre existing and managed at the DiagnosticEngine level (pushEscapedString). the reason it's not `\u0001` is 1/ to avoid reusing c++ syntactic elements for something that comes from diagnostics and is not represented as an escaped sequence in source 2/ `\u00011` is unreadable, and `\U000000001` is also not helpful :)

cor3ntin added inline comments.Sep 6 2023, 1:55 PM

clang/test/SemaCXX/static-assert.cpp
287	Are you looking for that sort of examples? https://godbolt.org/z/c79xWr7Me That shows that clang has no understanding of graphemes

@hazohelet I'm happy with the patch, I just need to make sure Hubert and I agree!

tahonermann added inline comments.Sep 6 2023, 2:05 PM

clang/test/SemaCXX/static-assert-cxx26.cpp
304	Thanks for the explanation. I'm not sure that I agree with the rationale for (1) though. We're already putting the value in single quotes and representing some values with escapes in many of these cases when the value isn't produced by an escape sequence (or even a character/string literal); why exclude `\uXXXX`? I agree with the rationale for (2); we could use `'\u{1}'` in that case.

cor3ntin added inline comments.Sep 6 2023, 3:00 PM

clang/test/SemaCXX/static-assert-cxx26.cpp
304	FYI afaik the notation in clang predates the existence of \u{} by a few years, and follow Unicode notation (https://unicode.org/mail-arch/unicode-ml/y2005-m11/0060.html). Oldest instance seems to be https://github.com/llvm/llvm-project/commit/77091b167fd959e1ee0c4dad4ec44de43b6c95db - i followed suite when reworking the generic escaping mechanism all string fed to diagnostics go through. I don't care about changing the syntax, but i do hope we are consistent. Ultimately what we are trying to do is to designate a unicode codepoint and whether we do it through C++ syntax or not probably does not matter much as long as it's clear, delimited and consistent!

tahonermann added inline comments.Sep 12 2023, 12:55 PM

clang/test/SemaCXX/static-assert-cxx26.cpp
304	I think the substitution of `<U+XXXX>` by the diagnostic engine itself is perfectly fine and good; particularly when it has no context to suggest a different presentation. In this particular case, where the character is being presented using C++ syntax as a character literal, I would prefer that C++ syntax be used consistently. From an implementation standpoint, I'm suggesting that `WriteCharValueForDiagnostic()` be modified such that, if `escapeCStyle<EscapeChar::Single>()` returns an empty string, that the character be presented in `'\u{XXXX}'` form if the character is one that would otherwise be substituted by the diagnostic engine (e.g., if `isPrintable()` is false). Note that this would be restricted to `char` values <= 0x7F; larger values could still be passed through as invalid code units that the diagnostic engine would then render as, e.g., `'<FC>'`.
clang/test/SemaCXX/static-assert.cpp
287	gcc and MSVC get that case "right" (probably by accident). https://godbolt.org/z/Tjd6xnEon

hubert.reinterpretcast added inline comments.Sep 13 2023, 7:08 PM

clang/lib/Sema/SemaDeclCXX.cpp
17131	To have the diagnostic printer handle separating any potential grapheme (if it is capable of doing so--potentially in the future), we need to isolate the result of `WriteCharValueForDiagnostic` in a separate message substitution.
clang/test/SemaCXX/static-assert.cpp
287	I was more looking for cases where the output grapheme includes elements that were part of the fixed message text (gobbling quotes, etc.). Also, this patch is in the wrong shape for handling this concern in the diagnostic printer because the delimiting of the replacement text happens in this patch.

cor3ntin added inline comments.Sep 16 2023, 4:08 AM

clang/test/SemaCXX/static-assert-cxx26.cpp
304	We should take a decision before forcing the author to do further change as to avoid going in circle. As a user `\u{XXXX}` vs `<U+XXXX>` makes no difference in terms of the amount of information i receive. I'm really not fan of duplicating code, spreading the logic in multiple places and having multiple ways to render a an invalid `char`. But I'm also concerned about spending too much time on `char` literals in `static_assert`. It might not be a common enough use case to warrant that much scrutiny :) So I would be happy to go with _any_ direction
clang/test/SemaCXX/static-assert.cpp
287	Could you craft a message that becomes a graphene after substitution by the engine? Maybe? You would have to try very hard and `static_assert` diagnostics are not of the right shape. This patch increases the exposure of the diagnostic output facility to input that it does not handle well. I disagree that it is outside the scope of this patch to insist that it does not generate such inputs to the diagnostic output facility (even if a possible solution is to modify the diagnostic output facility first). I still don't see it. None of the output produce in this patch are even close to what i could be problematic. ie this patch is only ever producing ASCII or single codepoints that gets escaped when they are not printable

hubert.reinterpretcast added inline comments.Sep 18 2023, 10:17 AM

clang/test/SemaCXX/static-assert.cpp
287	Could you craft a message that becomes a graphene after substitution by the engine? Maybe? You would have to try very hard and `static_assert` diagnostics are not of the right shape. That is what I meant by this patch introducing new situations. This patch increases the exposure of the diagnostic output facility to input that it does not handle well. I disagree that it is outside the scope of this patch to insist that it does not generate such inputs to the diagnostic output facility (even if a possible solution is to modify the diagnostic output facility first). I still don't see it. None of the output produce in this patch are even close to what i could be problematic. ie this patch is only ever producing ASCII or single codepoints that gets escaped when they are not printable This patch produces single codepoints that are not escaped even when they may combine with a `'` delimiter. This patch also (currently) forms the string with the `'` and the potentially-combining character directly adjacent to each other. The least this patch should do is to emit the potentially-combining character to the diagnostic facility as a separate substitution (that is, if we can agree that the diagnostic facility should consider substitution boundaries as separate text elements; i.e., no graphemes should be formed partially by a substitution). Whether the diagnostic facility should use an escape sequence or a ZWNJ at the boundary can be a different discussion.

cor3ntin added inline comments.Sep 18 2023, 4:06 PM

clang/test/SemaCXX/static-assert.cpp
287	Sure we could make it a separate message, although we added so much redundant information in the crafterdmessage it might be tricky. But now that I understand your concern it's the fact that if the codepoints is a grapheme extend ( so we are printing a char16_t or something with at least as many bytes), whether or not it would be rendered as a glyph or be escaped, if clang behavior were to escape a printable lone combining character (which it is currently not) would depend on whether it is passed directly or not to the diag engine. Sure. At least now I get what you mean. I still don't see that has a reason to rework this patch yet again, there are too many ifs for it to be something users are likely to encounter, and it requires clang features that are just not there and that we do not have plans to implementation beyond "wouldn't it be nice if" Would you be happy with a fixme in the code?

I had a chat with @hubert.reinterpretcast and @tahonermann.
We reached consensus on wanting to make sure the codepoint value is formatted in a future-proof way so that if we ever implement escaping of lone combining codepoint, this case would be handled as well.

To do that, we can:

* Expose the escaping mechanism) (done by pushEscapedString in Diagnostic.cpp) as a new EscapeStringForDiagnostic function. Use that to escape the code point (line 16921)

I think that should be the last change we need here. Thanks for being patient with us!

Thank you so much to everyone for guiding this patch onto the right track! I'll submit GitHub PR after making suggested changes.
Since Phabricator is going to be shutdown, I mark this differential as abandoned.

@hazohelet Please keep this patch on Phab. It's not going to be shutdown. The current consensus is that we will reconsider shutting down phab on November 15
https://discourse.llvm.org/t/update-on-github-pull-requests/71540/125

In D155610#4652198, @cor3ntin wrote:

@hazohelet Please keep this patch on Phab. It's not going to be shutdown. The current consensus is that we will reconsider shutting down phab on November 15
https://discourse.llvm.org/t/update-on-github-pull-requests/71540/125

Oh I failed to notice that post, thanks for letting me know.

Address comments from Corentin

Harbormaster completed remote builds in B257715: Diff 557524.Oct 1 2023, 11:18 PM

LGTM, thanks!

This revision is now accepted and ready to land.Oct 2 2023, 2:49 AM

Closed by commit rG2176c5e510e3: [Clang][Sema] Fix display of characters on static assertion failure (authored by hazohelet). · Explain WhyOct 3 2023, 10:11 PM

This revision was automatically updated to reflect the committed changes.

hazohelet added a commit: rG2176c5e510e3: [Clang][Sema] Fix display of characters on static assertion failure.

Revision Contents

Path

Size

clang/

docs/

ReleaseNotes.rst

35 lines

include/

clang/

Basic/

Diagnostic.h

2 lines

lib/

Basic/

Diagnostic.cpp

11 lines

Sema/

SemaDeclCXX.cpp

107 lines

test/

Lexer/

cxx1z-trigraphs.cpp

2 lines

SemaCXX/

static-assert-cxx26.cpp

9 lines

static-assert.cpp

26 lines

Diff 557581

clang/docs/ReleaseNotes.rst

Show First 20 Lines • Show All 175 Lines • ▼ Show 20 Lines	- When a non-variadic function is decorated with the ``format`` attribute,
automatic diagnostic to use parameters of types that the format style		automatic diagnostic to use parameters of types that the format style
supports but that are never the result of default argument promotion, such as		supports but that are never the result of default argument promotion, such as
``float``. (`#59824: <https://github.com/llvm/llvm-project/issues/59824>`_)		``float``. (`#59824: <https://github.com/llvm/llvm-project/issues/59824>`_)

Improvements to Clang's diagnostics		Improvements to Clang's diagnostics
-----------------------------------		-----------------------------------
- Clang constexpr evaluator now prints template arguments when displaying		- Clang constexpr evaluator now prints template arguments when displaying
template-specialization function calls.		template-specialization function calls.
- Clang contexpr evaluator now displays notes as well as an error when a constructor		- Clang contexpr evaluator now displays notes as well as an error when a constructor
of a base class is not called in the constructor of its derived class.		of a base class is not called in the constructor of its derived class.
- Clang no longer emits ``-Wmissing-variable-declarations`` for variables declared		- Clang no longer emits ``-Wmissing-variable-declarations`` for variables declared
with the ``register`` storage class.		with the ``register`` storage class.
- Clang's ``-Wtautological-negation-compare`` flag now diagnoses logical		- Clang's ``-Wtautological-negation-compare`` flag now diagnoses logical
tautologies like ``x && !x`` and ``!x \|\| x`` in expressions. This also		tautologies like ``x && !x`` and ``!x \|\| x`` in expressions. This also
makes ``-Winfinite-recursion`` diagnose more cases.		makes ``-Winfinite-recursion`` diagnose more cases.
(`#56035: <https://github.com/llvm/llvm-project/issues/56035>`_).		(`#56035: <https://github.com/llvm/llvm-project/issues/56035>`_).
- Clang constexpr evaluator now diagnoses compound assignment operators against		- Clang constexpr evaluator now diagnoses compound assignment operators against
uninitialized variables as a read of uninitialized object.		uninitialized variables as a read of uninitialized object.
(`#51536 <https://github.com/llvm/llvm-project/issues/51536>`_)		(`#51536 <https://github.com/llvm/llvm-project/issues/51536>`_)
- Clang's ``-Wformat-truncation`` now diagnoses ``snprintf`` call that is known to		- Clang's ``-Wformat-truncation`` now diagnoses ``snprintf`` call that is known to
result in string truncation.		result in string truncation.
(`#64871: <https://github.com/llvm/llvm-project/issues/64871>`_).		(`#64871: <https://github.com/llvm/llvm-project/issues/64871>`_).
Existing warnings that similarly warn about the overflow in ``sprintf``		Existing warnings that similarly warn about the overflow in ``sprintf``
now falls under its own warning group ```-Wformat-overflow`` so that it can		now falls under its own warning group ```-Wformat-overflow`` so that it can
be disabled separately from ``Wfortify-source``.		be disabled separately from ``Wfortify-source``.
These two new warning groups have subgroups ``-Wformat-truncation-non-kprintf``		These two new warning groups have subgroups ``-Wformat-truncation-non-kprintf``
and ``-Wformat-overflow-non-kprintf``, respectively. These subgroups are used when		and ``-Wformat-overflow-non-kprintf``, respectively. These subgroups are used when
the format string contains ``%p`` format specifier.		the format string contains ``%p`` format specifier.
Because Linux kernel's codebase has format extensions for ``%p``, kernel developers		Because Linux kernel's codebase has format extensions for ``%p``, kernel developers
are encouraged to disable these two subgroups by setting ``-Wno-format-truncation-non-kprintf``		are encouraged to disable these two subgroups by setting ``-Wno-format-truncation-non-kprintf``
and ``-Wno-format-overflow-non-kprintf`` in order to avoid false positives on		and ``-Wno-format-overflow-non-kprintf`` in order to avoid false positives on
the kernel codebase.		the kernel codebase.
Also clang no longer emits false positive warnings about the output length of		Also clang no longer emits false positive warnings about the output length of
``%g`` format specifier and about ``%o, %x, %X`` with ``#`` flag.		``%g`` format specifier and about ``%o, %x, %X`` with ``#`` flag.
- Clang now emits ``-Wcast-qual`` for functional-style cast expressions.		- Clang now emits ``-Wcast-qual`` for functional-style cast expressions.
- Clang no longer emits irrelevant notes about unsatisfied constraint expressions		- Clang no longer emits irrelevant notes about unsatisfied constraint expressions
on the left-hand side of ``\|\|`` when the right-hand side constraint is satisfied.		on the left-hand side of ``\|\|`` when the right-hand side constraint is satisfied.
(`#54678: <https://github.com/llvm/llvm-project/issues/54678>`_).		(`#54678: <https://github.com/llvm/llvm-project/issues/54678>`_).
- Clang now prints its 'note' diagnostic in cyan instead of black, to be more compatible		- Clang now prints its 'note' diagnostic in cyan instead of black, to be more compatible
with terminals with dark background colors. This is also more consistent with GCC.		with terminals with dark background colors. This is also more consistent with GCC.
- The fix-it emitted by ``-Wformat`` for scoped enumerations now take the		- The fix-it emitted by ``-Wformat`` for scoped enumerations now take the
enumeration's underlying type into account instead of suggesting a type just		enumeration's underlying type into account instead of suggesting a type just
based on the format string specifier being used.		based on the format string specifier being used.
		cor3ntinUnsubmitted Done Reply Inline Actions @aaron.ballman One one hand this is nice, on the other hand maybe too detailed. What do you think? cor3ntin: @aaron.ballman One one hand this is nice, on the other hand maybe too detailed. What do you…
		aaron.ballmanUnsubmitted Done Reply Inline Actions I'm happy with it -- better too much detail than too little, but this really helps users see what's been improved and why it matters. That said, I think `0x0A` and `0x1F30D` would arguably be better than printing the values in decimal. For `\n`, perhaps folks remember that it's decimal value 10, but nobody is going to know what `127757` means compared to the hex representation (esp because the value is specified in hex with the prefix printed in the error message). WDYT? aaron.ballman: I'm happy with it -- better too much detail than too little, but this really helps users see…
		cor3ntinUnsubmitted Done Reply Inline Actions For `wchar_t`, `charN_t` I think that makes sense. for `char`... hard to know, I think this is mostly useful for people who treat char as some kind of integer. I could go either way. using hex consistently seems reasonable cor3ntin: For `wchar_t`, `charN_t` I think that makes sense. for `char`... hard to know, I think this is…
		aaron.ballmanUnsubmitted Done Reply Inline Actions I don't insist on using hex, but I have a slight preference for using it consistently everywhere. CC @cjdb for more opinions since this relates to user experience of diagnostics. aaron.ballman: I don't insist on using hex, but I have a slight preference for using it consistently…
		hazoheletAuthorUnsubmitted Done Reply Inline Actions I generally agree that hex code would be better for characters. I think we still have some arguable points. Should we print the unsigned code point or the (possibly signed) integer? (e.g. `0xFF` vs `-0x01` for `(char)-1`, on targets where `char` is signed) Should we print the hex code when the other subexpression of the `==` expression is not a textual type? (e.g. `0x11` vs `17` for LHS of `(char)17 == 11`) For 1, I think we should always print unsigned code point for all textual types for consistency. Also we don't want to print `-0x3` for `L'\xFFFD'` on targets where `wchar_t` is signed and 16-bit width (I haven't checked whether that target exists, though). For 2, I want to see decimal (possibly signed) integer if the other side of the expression is not textual type. Displaying `expression evaluates to ''<FF>' (0xFF) == 255'` for the following code would be highly confusing. static_assert((char)-1 == (unsigned char)-1); WDYT? hazohelet: I generally agree that hex code would be better for characters. I think we still have some…
		aaron.ballmanUnsubmitted Done Reply Inline Actions Should we print the unsigned code point or the (possibly signed) integer? (e.g. 0xFF vs -0x01 for (char)-1, on targets where char is signed) Personally, I find -0x01 to be kind of weird and I slightly prefer 0xFF. Should we print the hex code when the other subexpression of the == expression is not a textual type? (e.g. 0x11 vs 17 for LHS of (char)17 == 11) I don't have a strong opinion on this because I think we can come up with arguments for either approach. My intuition is that we should just use hex values everywhere, but others may have a different opinion. aaron.ballman: > Should we print the unsigned code point or the (possibly signed) integer? (e.g. 0xFF vs -0x01…
- Clang now displays an improved diagnostic and a note when a defaulted special		- Clang now displays an improved diagnostic and a note when a defaulted special
member is marked ``constexpr`` in a class with a virtual base class		member is marked ``constexpr`` in a class with a virtual base class
(`#64843: <https://github.com/llvm/llvm-project/issues/64843>`_).		(`#64843: <https://github.com/llvm/llvm-project/issues/64843>`_).
- ``-Wfixed-enum-extension`` and ``-Wmicrosoft-fixed-enum`` diagnostics are no longer		- ``-Wfixed-enum-extension`` and ``-Wmicrosoft-fixed-enum`` diagnostics are no longer
emitted when building as C23, since C23 standardizes support for enums with a		emitted when building as C23, since C23 standardizes support for enums with a
fixed underlying type.		fixed underlying type.
		- When describing the failure of static assertion of `==` expression, clang prints the integer
		representation of the value as well as its character representation when
		the user-provided expression is of character type. If the character is
		non-printable, clang now shows the escpaed character.
		Clang also prints multi-byte characters if the user-provided expression
		is of multi-byte character type.

		Example Code:

		.. code-block:: c++

		static_assert("A\n"[1] == U'🌍');

		BEFORE:

		.. code-block:: text

		source:1:15: error: static assertion failed due to requirement '"A\n"[1] == U'\U0001f30d''
		1 \| static_assert("A\n"[1] == U'🌍');
		\| ^~~~~~~~~~~~~~~~~
		source:1:24: note: expression evaluates to ''
		' == 127757'
		1 \| static_assert("A\n"[1] == U'🌍');
		\| ~~~~~~~~~^~~~~~~~

		AFTER:

		.. code-block:: text

		source:1:15: error: static assertion failed due to requirement '"A\n"[1] == U'\U0001f30d''
		1 \| static_assert("A\n"[1] == U'🌍');
		\| ^~~~~~~~~~~~~~~~~
		source:1:24: note: expression evaluates to ''\n' (0x0A, 10) == U'🌍' (0x1F30D, 127757)'
		1 \| static_assert("A\n"[1] == U'🌍');
		\| ~~~~~~~~~^~~~~~~~

Bug Fixes in This Version		Bug Fixes in This Version
-------------------------		-------------------------
- Fixed an issue where a class template specialization whose declaration is		- Fixed an issue where a class template specialization whose declaration is
instantiated in one module and whose definition is instantiated in another		instantiated in one module and whose definition is instantiated in another
module may end up with members associated with the wrong declaration of the		module may end up with members associated with the wrong declaration of the
class, which can result in miscompiles in some cases.		class, which can result in miscompiles in some cases.
- Fix crash on use of a variadic overloaded operator.		- Fix crash on use of a variadic overloaded operator.
▲ Show 20 Lines • Show All 357 Lines • Show Last 20 Lines

clang/include/clang/Basic/Diagnostic.h

	Show First 20 Lines • Show All 1,834 Lines • ▼ Show 20 Lines
	/// attribute. The character itself will be not be printed.			/// attribute. The character itself will be not be printed.
	const char ToggleHighlight = 127;			const char ToggleHighlight = 127;

	/// ProcessWarningOptions - Initialize the diagnostic client and process the			/// ProcessWarningOptions - Initialize the diagnostic client and process the
	/// warning options specified on the command line.			/// warning options specified on the command line.
	void ProcessWarningOptions(DiagnosticsEngine &Diags,			void ProcessWarningOptions(DiagnosticsEngine &Diags,
	const DiagnosticOptions &Opts,			const DiagnosticOptions &Opts,
	bool ReportDiags = true);			bool ReportDiags = true);
				void EscapeStringForDiagnostic(StringRef Str, SmallVectorImpl<char> &OutStr);
	} // namespace clang			} // namespace clang

	#endif // LLVM_CLANG_BASIC_DIAGNOSTIC_H			#endif // LLVM_CLANG_BASIC_DIAGNOSTIC_H

clang/lib/Basic/Diagnostic.cpp

Show First 20 Lines • Show All 794 Lines • ▼ Show 20 Lines	FormatDiagnostic(SmallVectorImpl<char> &OutStr) const {
}		}

StringRef Diag =		StringRef Diag =
getDiags()->getDiagnosticIDs()->getDescription(getID());		getDiags()->getDiagnosticIDs()->getDescription(getID());

FormatDiagnostic(Diag.begin(), Diag.end(), OutStr);		FormatDiagnostic(Diag.begin(), Diag.end(), OutStr);
}		}

/// pushEscapedString - Append Str to the diagnostic buffer,		/// EscapeStringForDiagnostic - Append Str to the diagnostic buffer,
/// escaping non-printable characters and ill-formed code unit sequences.		/// escaping non-printable characters and ill-formed code unit sequences.
static void pushEscapedString(StringRef Str, SmallVectorImpl<char> &OutStr) {		void clang::EscapeStringForDiagnostic(StringRef Str,
		SmallVectorImpl<char> &OutStr) {
OutStr.reserve(OutStr.size() + Str.size());		OutStr.reserve(OutStr.size() + Str.size());
auto Begin = reinterpret_cast<const unsigned char >(Str.data());		auto Begin = reinterpret_cast<const unsigned char >(Str.data());
llvm::raw_svector_ostream OutStream(OutStr);		llvm::raw_svector_ostream OutStream(OutStr);
const unsigned char *End = Begin + Str.size();		const unsigned char *End = Begin + Str.size();
while (Begin != End) {		while (Begin != End) {
// ASCII case		// ASCII case
if (isPrintable(Begin) \|\| isWhitespace(Begin)) {		if (isPrintable(Begin) \|\| isWhitespace(Begin)) {
OutStream << *Begin;		OutStream << *Begin;
Show All 16 Lines	if (llvm::isLegalUTF8Sequence(Begin, End)) {
"we must be further along in the string now");		"we must be further along in the string now");
if (llvm::sys::unicode::isPrintable(CodepointValue) \|\|		if (llvm::sys::unicode::isPrintable(CodepointValue) \|\|
llvm::sys::unicode::isFormatting(CodepointValue)) {		llvm::sys::unicode::isFormatting(CodepointValue)) {
OutStr.append(CodepointBegin, CodepointEnd);		OutStr.append(CodepointBegin, CodepointEnd);
continue;		continue;
}		}
// Unprintable code point.		// Unprintable code point.
OutStream << "<U+" << llvm::format_hex_no_prefix(CodepointValue, 4, true)		OutStream << "<U+" << llvm::format_hex_no_prefix(CodepointValue, 4, true)
<< ">";		<< ">";
continue;		continue;
}		}
// Invalid code unit.		// Invalid code unit.
OutStream << "<" << llvm::format_hex_no_prefix(*Begin, 2, true) << ">";		OutStream << "<" << llvm::format_hex_no_prefix(*Begin, 2, true) << ">";
++Begin;		++Begin;
		cor3ntinUnsubmitted Not Done Reply Inline Actions The use UCN addition is probably not justified. we should consistent in how we print the value of non-printable code points. cor3ntin: The use UCN addition is probably not justified. we should consistent in how we print the value…
		hazoheletAuthorUnsubmitted Done Reply Inline Actions My motivation here is to print a valid character literal. I think it justifies this change somewhat. I'd like to see what others think about this. hazohelet: My motivation here is to print a valid character literal. I think it justifies this change…
		cor3ntinUnsubmitted Not Done Reply Inline Actions Why? there is no expectation that diagnostics messages reproduce C++. Consistency between string literals and characters literals is a lot more important cor3ntin: Why? there is no expectation that diagnostics messages reproduce C++. Consistency between…
		hazoheletAuthorUnsubmitted Done Reply Inline Actions The diagnostic meesage here looks like `expression evaluates to 'VALUE1 == VALUE2'` I tend to expect that `VALUE1 == VALUE2` is a syntactically valid expression because of the syntactical element `==`. But if others do not feel the same way, I am okay with something like `'<U+0001>' == '<U+0002>'` hazohelet: The diagnostic meesage here looks like `expression evaluates to 'VALUE1 == VALUE2'` I tend to…
}		}
}		}

void Diagnostic::		void Diagnostic::
FormatDiagnostic(const char DiagStr, const char DiagEnd,		FormatDiagnostic(const char DiagStr, const char DiagEnd,
SmallVectorImpl<char> &OutStr) const {		SmallVectorImpl<char> &OutStr) const {
// When the diagnostic string is only "%0", the entire string is being given		// When the diagnostic string is only "%0", the entire string is being given
// by an outside source. Remove unprintable characters from this string		// by an outside source. Remove unprintable characters from this string
// and skip all the other string processing.		// and skip all the other string processing.
if (DiagEnd - DiagStr == 2 &&		if (DiagEnd - DiagStr == 2 &&
StringRef(DiagStr, DiagEnd - DiagStr).equals("%0") &&		StringRef(DiagStr, DiagEnd - DiagStr).equals("%0") &&
getArgKind(0) == DiagnosticsEngine::ak_std_string) {		getArgKind(0) == DiagnosticsEngine::ak_std_string) {
const std::string &S = getArgStdStr(0);		const std::string &S = getArgStdStr(0);
pushEscapedString(S, OutStr);		EscapeStringForDiagnostic(S, OutStr);
return;		return;
}		}

/// FormattedArgs - Keep track of all of the arguments formatted by		/// FormattedArgs - Keep track of all of the arguments formatted by
/// ConvertArgToString and pass them into subsequent calls to		/// ConvertArgToString and pass them into subsequent calls to
/// ConvertArgToString, allowing the implementation to avoid redundancies in		/// ConvertArgToString, allowing the implementation to avoid redundancies in
/// obvious cases.		/// obvious cases.
SmallVector<DiagnosticsEngine::ArgumentValue, 8> FormattedArgs;		SmallVector<DiagnosticsEngine::ArgumentValue, 8> FormattedArgs;
▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines	if (ModifierIs(Modifier, ModifierLen, "diff")) {
}		}
}		}

switch (Kind) {		switch (Kind) {
// ---- STRINGS ----		// ---- STRINGS ----
case DiagnosticsEngine::ak_std_string: {		case DiagnosticsEngine::ak_std_string: {
const std::string &S = getArgStdStr(ArgNo);		const std::string &S = getArgStdStr(ArgNo);
assert(ModifierLen == 0 && "No modifiers for strings yet");		assert(ModifierLen == 0 && "No modifiers for strings yet");
pushEscapedString(S, OutStr);		EscapeStringForDiagnostic(S, OutStr);
break;		break;
}		}
case DiagnosticsEngine::ak_c_string: {		case DiagnosticsEngine::ak_c_string: {
const char *S = getArgCStr(ArgNo);		const char *S = getArgCStr(ArgNo);
assert(ModifierLen == 0 && "No modifiers for strings yet");		assert(ModifierLen == 0 && "No modifiers for strings yet");

// Don't crash if get passed a null pointer by accident.		// Don't crash if get passed a null pointer by accident.
if (!S)		if (!S)
S = "(null)";		S = "(null)";
pushEscapedString(S, OutStr);		EscapeStringForDiagnostic(S, OutStr);
break;		break;
}		}
// ---- INTEGERS ----		// ---- INTEGERS ----
case DiagnosticsEngine::ak_sint: {		case DiagnosticsEngine::ak_sint: {
int64_t Val = getArgSInt(ArgNo);		int64_t Val = getArgSInt(ArgNo);

if (ModifierIs(Modifier, ModifierLen, "select")) {		if (ModifierIs(Modifier, ModifierLen, "select")) {
HandleSelectModifier(*this, (unsigned)Val, Argument, ArgumentLen,		HandleSelectModifier(*this, (unsigned)Val, Argument, ArgumentLen,
▲ Show 20 Lines • Show All 244 Lines • Show Last 20 Lines

clang/lib/Sema/SemaDeclCXX.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 43 Lines • ▼ Show 20 Lines

#include "clang/Sema/ScopeInfo.h" #include "clang/Sema/ScopeInfo.h"

#include "clang/Sema/SemaInternal.h" #include "clang/Sema/SemaInternal.h"

#include "clang/Sema/Template.h" #include "clang/Sema/Template.h"

#include "llvm/ADT/ArrayRef.h" #include "llvm/ADT/ArrayRef.h"

#include "llvm/ADT/STLExtras.h" #include "llvm/ADT/STLExtras.h"

#include "llvm/ADT/ScopeExit.h" #include "llvm/ADT/ScopeExit.h"

#include "llvm/ADT/SmallString.h" #include "llvm/ADT/SmallString.h"

#include "llvm/ADT/StringExtras.h" #include "llvm/ADT/StringExtras.h"

#include "llvm/Support/ConvertUTF.h"

#include "llvm/Support/SaveAndRestore.h" #include "llvm/Support/SaveAndRestore.h"

#include <map> #include <map>

#include <optional> #include <optional>

#include <set> #include <set>

using namespace clang; using namespace clang;

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

▲ Show 20 Lines • Show All 16,961 Lines • ▼ Show 20 Lines Decl *Sema::ActOnStaticAssertDeclaration(SourceLocation StaticAssertLoc,

SourceLocation RParenLoc) { SourceLocation RParenLoc) {

if (DiagnoseUnexpandedParameterPack(AssertExpr, UPPC_StaticAssertExpression)) if (DiagnoseUnexpandedParameterPack(AssertExpr, UPPC_StaticAssertExpression))

return nullptr; return nullptr;

return BuildStaticAssertDeclaration(StaticAssertLoc, AssertExpr, return BuildStaticAssertDeclaration(StaticAssertLoc, AssertExpr,

AssertMessageExpr, RParenLoc, false); AssertMessageExpr, RParenLoc, false);

} }

static void WriteCharTypePrefix(BuiltinType::Kind BTK, llvm::raw_ostream &OS) {

switch (BTK) {

hubert.reinterpretcastUnsubmitted

Done

AssertMessageExpr, RParenLoc, false);

}

- /// Convert character's code unit value to a string.

- /// The code point needs to be zero-extended to 32-bits.

+ /// Convert character's value, interpreted as a code unit, to a string.

+ /// The value needs to be zero-extended to 32-bits.

static void WriteCharValueForDiagnostic(uint32_t Value, const BuiltinType *BTy,

Suggest wording tweaks.

hubert.reinterpretcast: Suggest wording tweaks.

case BuiltinType::Char_S:

hubert.reinterpretcastUnsubmitted

Done

It does not seem that the first parameter expects a CodePoint argument in all cases. For Char_S, Char_U, and Char8, it seems the function wants to treat the input as a UTF-8 code unit.

I suggest changing the argument to be clearly a code unit (and potentially treat it as a code point value as appropriate later in the function).

Also: The function should probably be declared as having static linkage.
Additionally: The function does not "convert" in the language semantic sense. WriteCharacterValueDescriptionForDisplay might be a better name.

hubert.reinterpretcast: It does not seem that the first parameter expects a `CodePoint` argument in all cases. For…

cor3ntinUnsubmitted

Done

Agreed, CodeUnit or Value would be more correct (mostly because of numeric escape sequences).
But if we are going to change that then WriteCharValueForDiagnostic would be better, Character implies too much

cor3ntin: Agreed, `CodeUnit` or `Value` would be more correct (mostly because of numeric escape…

case BuiltinType::Char_U:

break;

case BuiltinType::Char8:

OS << "u8";

break;

case BuiltinType::Char16:

OS << 'u';

hubert.reinterpretcastUnsubmitted

Done

For types other than Char_S, Char_U, and Char8, this fails to treat the C1 Controls and Latin-1 Supplement characters as Unicode code points. It looks like test coverage for these cases are missing.

hubert.reinterpretcast: For types other than `Char_S`, `Char_U`, and `Char8`, this fails to treat the C1 Controls and…

cor3ntinUnsubmitted

Done

escapeCStyle is one of the things that assume ASCII / UTF, but yes, we might as well reduce to 0x7F just to avoid unnecessary work

cor3ntin: `escapeCStyle` is one of the things that assume ASCII / UTF, but yes, we might as well reduce…

hubert.reinterpretcastUnsubmitted

Done

escapeCStyle is one of the things that assume ASCII / UTF, but yes, we might as well reduce to 0x7F just to avoid unnecessary work

I meant (with a signed char type to trigger the assertion):

<stdin>:1:28: note: expression evaluates to ''<A2>' (0xA2) == '<A2>' (0xA2)'
    1 | static_assert(u"\u00a2"[0] == '<A2>');
      |               ~~~~~~~~~~~~~^~~~~~~~~

should be:

<stdin>:1:28: note: expression evaluates to ''¢' (0xA2) == '<A2>' (0xA2)'
    1 | static_assert(u"\u00a2"[0] == '<A2>');
      |               ~~~~~~~~~~~~~^~~~~~~~~

hubert.reinterpretcast: > `escapeCStyle` is one of the things that assume ASCII / UTF, but yes, we might as well reduce…

break;

case BuiltinType::Char32:

OS << 'U';

break;

case BuiltinType::WChar_S:

case BuiltinType::WChar_U:

OS << 'L';

break;

default:

llvm_unreachable("Non-character type");

}

/// Convert character's value, interpreted as a code unit, to a string.

/// The value needs to be zero-extended to 32-bits.

/// FIXME: This assumes Unicode literal encodings

static void WriteCharValueForDiagnostic(uint32_t Value, const BuiltinType *BTy,

unsigned TyWidth,

SmallVectorImpl<char> &Str) {

hubert.reinterpretcastUnsubmitted

Done

if (llvm::ConvertCodePointToUTF8(Value, Ptr)) {

- for (char *I = Arr; I != Ptr; ++I)

- OS << *I;

+ OS << StringRef(Arr, Ptr - Arr);

} else {

Try using StringRef.

hubert.reinterpretcast: Try using `StringRef`.

char Arr[UNI_MAX_UTF8_BYTES_PER_CODE_POINT];

char *Ptr = Arr;

hubert.reinterpretcastUnsubmitted

Done

Since the function interface has been clarified, this part actually doesn't need a FIXME. The FIXME should instead be added to the comment above the function declaration.

hubert.reinterpretcast: Since the function interface has been clarified, this part actually doesn't need a FIXME. The…

BuiltinType::Kind K = BTy->getKind();

llvm::raw_svector_ostream OS(Str);

// This should catch Char_S, Char_U, Char8, and use of escaped characters in

// other types.

if (K == BuiltinType::Char_S || K == BuiltinType::Char_U ||

cor3ntinUnsubmitted

Done

We have similar switches in

StringLiteral::outputString
TryPrintAsStringLiteral (APValue.cpp)
CharacterLiteral::print

Sadly they all look at different things so I don't know if we could refactor all of that.
But looking further done, I don;t think we should print a prefix here, so we could remove that bit entirely.

cor3ntin: We have similar switches in * `StringLiteral::outputString` * `TryPrintAsStringLiteral`…

K == BuiltinType::Char8 || Value <= 0x7F) {

StringRef Escaped = escapeCStyle<EscapeChar::Single>(Value);

if (!Escaped.empty())

EscapeStringForDiagnostic(Escaped, Str);

else

OS << static_cast<char>(Value);

return;

}

switch (K) {

case BuiltinType::Char16:

case BuiltinType::Char32:

case BuiltinType::WChar_S:

case BuiltinType::WChar_U: {

if (llvm::ConvertCodePointToUTF8(Value, Ptr))

EscapeStringForDiagnostic(StringRef(Arr, Ptr - Arr), Str);

else

OS << "\\x"

<< llvm::format_hex_no_prefix(Value, TyWidth / 4, /*Upper=*/true);

break;

hubert.reinterpretcastUnsubmitted

Done

case BuiltinType::WChar_U: {

- if (llvm::ConvertCodePointToUTF8(Value, Ptr)) {

+ if (llvm::ConvertCodePointToUTF8(Value, Ptr))

OS << StringRef(Arr, Ptr - Arr);

- } else {

+ else

OS << "\\x"

- << llvm::format_hex_no_prefix(Value, TyWidth / 4,

- /*Upper=*/true);

- }

+ << llvm::format_hex_no_prefix(Value, TyWidth / 4, /*Upper=*/true);

break;

Minor nit: Braces no longer needed.

hubert.reinterpretcast: Minor nit: Braces no longer needed.

}

default:

llvm_unreachable("Non-character type is passed");

}

/// Convert \V to a string we can present to the user in a diagnostic /// Convert \V to a string we can present to the user in a diagnostic

/// \T is the type of the expression that has been evaluated into \V /// \T is the type of the expression that has been evaluated into \V

static bool ConvertAPValueToString(const APValue &V, QualType T, static bool ConvertAPValueToString(const APValue &V, QualType T,

SmallVectorImpl<char> &Str) { SmallVectorImpl<char> &Str,

ASTContext &Context) {

if (!V.hasValue()) if (!V.hasValue())

return false; return false;

switch (V.getKind()) { switch (V.getKind()) {

case APValue::ValueKind::Int: case APValue::ValueKind::Int:

if (T->isBooleanType()) { if (T->isBooleanType()) {

// Bools are reduced to ints during evaluation, but for // Bools are reduced to ints during evaluation, but for

// diagnostic purposes we want to print them as // diagnostic purposes we want to print them as

// true or false. // true or false.

int64_t BoolValue = V.getInt().getExtValue(); int64_t BoolValue = V.getInt().getExtValue();

assert((BoolValue == 0 || BoolValue == 1) && assert((BoolValue == 0 || BoolValue == 1) &&

"Bool type, but value is not 0 or 1"); "Bool type, but value is not 0 or 1");

llvm::raw_svector_ostream OS(Str); llvm::raw_svector_ostream OS(Str);

OS << (BoolValue ? "true" : "false"); OS << (BoolValue ? "true" : "false");

} else if (T->isCharType()) { } else {

llvm::raw_svector_ostream OS(Str);

// Same is true for chars. // Same is true for chars.

Str.push_back('\''); // We want to print the character representation for textual types

Str.push_back(V.getInt().getExtValue()); const auto *BTy = T->getAs<BuiltinType>();

Str.push_back('\''); if (BTy) {

} else switch (BTy->getKind()) {

case BuiltinType::Char_S:

case BuiltinType::Char_U:

case BuiltinType::Char8:

case BuiltinType::Char16:

case BuiltinType::Char32:

case BuiltinType::WChar_S:

case BuiltinType::WChar_U: {

hubert.reinterpretcastUnsubmitted

Done

Add FIXME for char and wchar_t cases that this assumes Unicode literal encodings.

hubert.reinterpretcast: Add FIXME for `char` and `wchar_t` cases that this assumes Unicode literal encodings.

cor3ntinUnsubmitted

Done

If we wanted such fixme, it should be L1689.

cor3ntin: If we wanted such fixme, it should be L1689.

hubert.reinterpretcastUnsubmitted

Done

The ConvertCharToString has a first parameter called CodePoint. With that interface[^1], it is sensible to insert conversion from the applicable literal encoding to a Unicode code point value here (thus my request for a FIXME here).

You are probably right that the FIXME belongs elsewhere. If you were thinking what I am thinking, then I am guessing you meant L16894? That is where the ConvertCharToString function seems to assume that a wchar_t value is directly a "code point value". To generate hex escapes, the function needs to be passed the original value (including for chars, e.g., to handle stray code units). Once the interface is updated (i.e., the parameter is renamed), the ConvertCharToString function would more clearly be the place to put one or more FIXMEs about encoding assumptions.

[^1]: It turns out that the parameter is already not treated consistently as a code point value within the function (and by the caller) and the parameter is just badly named.

hubert.reinterpretcast: The `ConvertCharToString` has a first parameter called `CodePoint`. With that interface[^1], it…

unsigned TyWidth = Context.getIntWidth(T);

assert(8 <= TyWidth && TyWidth <= 32 && "Unexpected integer width");

uint32_t CodeUnit = static_cast<uint32_t>(V.getInt().getZExtValue());

WriteCharTypePrefix(BTy->getKind(), OS);

cor3ntinUnsubmitted

Done

Looking at the diagnostics, I don't think it makes sense to print a prefix here. You could just leave that part out.

cor3ntin: Looking at the diagnostics, I don't think it makes sense to print a prefix here. You could just…

hubert.reinterpretcastUnsubmitted

Not Done

Why is removing the prefix better? The types can matter (characters outside the basic character set are allowed to have negative char values). Also, moving forward, the value of a character need not be the same in the various encodings.

hubert.reinterpretcast: Why is removing the prefix better? The types can matter (characters outside the basic character…

hubert.reinterpretcastUnsubmitted

Not Done

Some fun with signedness (imagine a more realistic example with ISO-8859-1 ordinary character encoding with a signed char type):

$ clang -Xclang -fwchar-type=short -xc++ -<<<$'static_assert(L"\\uFF10"[0] == U\'\\uFF10\');'
<stdin>:1:15: error: static assertion failed due to requirement 'L"\xFF10"[0] == U'\uff10''
    1 | static_assert(L"\uFF10"[0] == U'\uFF10');
      |               ^~~~~~~~~~~~~~~~~~~~~~~~~
<stdin>:1:28: note: expression evaluates to ''０' (0xFF10) == '０' (0xFF10)'
    1 | static_assert(L"\uFF10"[0] == U'\uFF10');
      |               ~~~~~~~~~~~~~^~~~~~~~~~~~
1 error generated.
Return:  0x01:1   Fri Aug 11 23:49:02 2023 EDT

hubert.reinterpretcast: Some fun with signedness (imagine a more realistic example with `ISO-8859-1` ordinary character…

cor3ntinUnsubmitted

Not Done

Either we care about the actual character - ie 'a', or it's value (ie 42). The motivation for the current patch is to add the value to the diagnostic message.
I'm also concerned about mixing things that are are are not lexical elements in the diagnostics

cor3ntin: Either we care about the actual character - ie `'a'`, or it's value (ie `42`). The motivation…

hubert.reinterpretcastUnsubmitted

Not Done

Maybe the motivation for the current patch is to add the value, but what it does (for wide characters as defined in C) is to add the character (and obfuscate the value).

Observe the status quo (https://godbolt.org/z/Wc6nKvTMn):

note: expression evaluates to '-240 == 65296'

From the output higher up (with this patch), we see two "identical" characters and values (due to lack of decimal value output). With decimal value output added, it will still be potentially confusing why the two identical characters have different values (without some sort of type annotation).

I admit that the confusion arises in the status quo treatment of signed char and unsigned char. I hope I am using the word correctly when I say that it is ironic that the patch breaks in one context what it seeks to fix in another.

hubert.reinterpretcast: Maybe the //motivation// for the current patch is to add the value, but what it does (for wide…

OS << '\'';

WriteCharValueForDiagnostic(CodeUnit, BTy, TyWidth, Str);

hubert.reinterpretcastUnsubmitted

Not Done

To have the diagnostic printer handle separating any potential grapheme (if it is capable of doing so--potentially in the future), we need to isolate the result of WriteCharValueForDiagnostic in a separate message substitution.

hubert.reinterpretcast: To have the diagnostic printer handle separating any potential grapheme (if it is capable of…

OS << "' (0x"

<< llvm::format_hex_no_prefix(CodeUnit, /*Width=*/2,

/*Upper=*/true)

hubert.reinterpretcastUnsubmitted

Not Done

@aaron.ballman, hex output hides signedness. I think we want hex and decimal.

hubert.reinterpretcast: @aaron.ballman, hex output hides signedness. I think we want hex //and// decimal.

<< ", " << V.getInt() << ')';

return true;

}

default:

break;

}

V.getInt().toString(Str); V.getInt().toString(Str);

}

break; break;

case APValue::ValueKind::Float: case APValue::ValueKind::Float:

V.getFloat().toString(Str); V.getFloat().toString(Str);

break; break;

case APValue::ValueKind::LValue: case APValue::ValueKind::LValue:

if (V.isNullPointer()) { if (V.isNullPointer()) {

llvm::raw_svector_ostream OS(Str); llvm::raw_svector_ostream OS(Str);

OS << "nullptr"; OS << "nullptr";

} else } else

return false; return false;

break; break;

cor3ntinUnsubmitted

Done

A different way to do that would be to have a 'Escape Whitespaces' parameter on pushEscapedString that would also escape \t, \n, etc (I don't think we want to escape SPACE)

cor3ntin: A different way to do that would be to have a 'Escape Whitespaces' parameter on…

hazoheletAuthorUnsubmitted

Done

If you mean \u0020 by SPACE, it won't be escaped by this code.

hazohelet: If you mean `\u0020` by SPACE, it won't be escaped by this code.

case APValue::ValueKind::ComplexFloat: { case APValue::ValueKind::ComplexFloat: {

llvm::raw_svector_ostream OS(Str); llvm::raw_svector_ostream OS(Str);

OS << '('; OS << '(';

V.getComplexFloatReal().toString(Str); V.getComplexFloatReal().toString(Str);

OS << " + "; OS << " + ";

V.getComplexFloatImag().toString(Str); V.getComplexFloatImag().toString(Str);

OS << "i)"; OS << "i)";

▲ Show 20 Lines • Show All 66 Lines • ▼ Show 20 Lines struct {

bool Print; bool Print;

} DiagSide[2] = {{LHS, Expr::EvalResult(), {}, false}, } DiagSide[2] = {{LHS, Expr::EvalResult(), {}, false},

{RHS, Expr::EvalResult(), {}, false}}; {RHS, Expr::EvalResult(), {}, false}};

for (unsigned I = 0; I < 2; I++) { for (unsigned I = 0; I < 2; I++) {

const Expr *Side = DiagSide[I].Cond; const Expr *Side = DiagSide[I].Cond;

Side->EvaluateAsRValue(DiagSide[I].Result, Context, true); Side->EvaluateAsRValue(DiagSide[I].Result, Context, true);

DiagSide[I].Print = ConvertAPValueToString( DiagSide[I].Print =

DiagSide[I].Result.Val, Side->getType(), DiagSide[I].ValueString); ConvertAPValueToString(DiagSide[I].Result.Val, Side->getType(),

DiagSide[I].ValueString, Context);

} }

if (DiagSide[0].Print && DiagSide[1].Print) { if (DiagSide[0].Print && DiagSide[1].Print) {

Diag(Op->getExprLoc(), diag::note_expr_evaluates_to) Diag(Op->getExprLoc(), diag::note_expr_evaluates_to)

<< DiagSide[0].ValueString << Op->getOpcodeStr() << DiagSide[0].ValueString << Op->getOpcodeStr()

<< DiagSide[1].ValueString << Op->getSourceRange(); << DiagSide[1].ValueString << Op->getSourceRange();

} }

▲ Show 20 Lines • Show All 2,036 Lines • Show Last 20 Lines

clang/test/Lexer/cxx1z-trigraphs.cpp

	Show All 15 Lines
	error here;			error here;

	// Note, there is intentionally trailing whitespace one line below.			// Note, there is intentionally trailing whitespace one line below.
	// ??/			// ??/
	error here;			error here;

	#if !ENABLED_TRIGRAPHS			#if !ENABLED_TRIGRAPHS
	// expected-error@11 {{}} expected-warning@11 {{trigraph ignored}}			// expected-error@11 {{}} expected-warning@11 {{trigraph ignored}}
	// expected-error@13 {{failed}} expected-warning@13 {{trigraph ignored}} expected-note@13 {{evaluates to ''?' == '#''}}			// expected-error@13 {{failed}} expected-warning@13 {{trigraph ignored}} expected-note@13 {{evaluates to ''?' (0x3F, 63) == '#' (0x23, 35)'}}
				aaron.ballmanUnsubmitted Done Reply Inline Actions I think the original diagnostic was actually more understandable as it relates more closely to what's written in the static assertion. I could imagine something like `evaluates to '?' (63) == '#' (35)` would also be reasonable. aaron.ballman: I think the original diagnostic was actually more understandable as it relates more closely to…
				tahonermannUnsubmitted Done Reply Inline Actions I agree. I would also be ok with printing the integer value as primary with the character as secondary: evaluates to 63 ('?') == 35 ('#') There are two kinds of non-printable characters: Control characters (including new-line) character values that don't correspond to a character (e.g., lone trailing characters or invalid code unit values). For the first case, I would support printing them as either C escapes or universal-character-names. e.g., evaluates to 0 ('\0') == 1 (\u0001) For the second case, I would support printing them as C hex escapes. e.g, evaluates to -128 ('\x80') == -123 ('\x85') tahonermann: I agree. I would also be ok with printing the integer value as primary with the character as…
				cor3ntinUnsubmitted Done Reply Inline Actions For the first case, I would support printing them as either C escapes or universal-character-names. e.g., As mentioned before, we should be consistent with what we do for diagnostics messages in general - (`ie` pushEscapedString`). I check and we already do that. https://godbolt.org/z/doah9YGMT Question is, why do we sometimes don't? Note that in general i don't have an opinion about displaying the value of characters literal _in addition_ of the character itself, it seems like a good thing) cor3ntin: > For the first case, I would support printing them as either C escapes or universal-character…
				hazoheletAuthorUnsubmitted Done Reply Inline Actions The character escape in the current error message is handled in `CharacterLiteral::print` (https://github.com/llvm/llvm-project/blob/fcb6a9c07cf7a2bc63d364e3b7f60aaadadd57cc/clang/lib/AST/Expr.cpp#L1064-L1083), and the reason for `'\0'` being escaped to `\x00` and `'\u{9}'` escaped to `\t` is that `escapeCStyle` does not escape null character (https://github.com/llvm/llvm-project/blob/c1c86f9eae73786bcdacddaab248817c4f176935/clang/include/clang/Basic/CharInfo.h#L174-L200). `pushEscapedString` does not escape whitespace characters, so we should use `escapeCStyle` if we are to use c-style escape for them. hazohelet: The character escape in the current error message is handled in `CharacterLiteral::print`…
	// expected-error@16 {{}}			// expected-error@16 {{}}
	// expected-error@20 {{}}			// expected-error@20 {{}}
	#else			#else
	// expected-warning@11 {{trigraph converted}}			// expected-warning@11 {{trigraph converted}}
	// expected-warning@13 {{trigraph converted}}			// expected-warning@13 {{trigraph converted}}
	// expected-warning@19 {{backslash and newline separated by space}}			// expected-warning@19 {{backslash and newline separated by space}}
	#endif			#endif

clang/test/SemaCXX/static-assert-cxx26.cpp

Show First 20 Lines • Show All 292 Lines • ▼ Show 20 Lines	struct Frobble {
constexpr int size() const { return 5; }		constexpr int size() const { return 5; }
constexpr const char *data() const { return "hello"; }		constexpr const char *data() const { return "hello"; }
};		};

Good<Frobble> a; // expected-note {{in instantiation}}		Good<Frobble> a; // expected-note {{in instantiation}}
Bad<int> b; // expected-note {{in instantiation}}		Bad<int> b; // expected-note {{in instantiation}}

}		}

		namespace EscapeInDiagnostic {
		static_assert('\u{9}' == (char)1, ""); // expected-error {{failed}} \
		// expected-note {{evaluates to ''\t' (0x09, 9) == '<U+0001>' (0x01, 1)'}}
		tahonermannUnsubmitted Not Done Reply Inline Actions Is the expected note up to date? I don't see code that would generate the `<U+0001>` output. Am I just missing it? Since U+0001 is a valid, though non-printable, character, I would expect more `'\u0001'`. tahonermann: Is the expected note up to date? I don't see code that would generate the `<U+0001>` output. Am…
		cor3ntinUnsubmitted Not Done Reply Inline Actions See elsewhere in the discussion. this formating is pre existing and managed at the DiagnosticEngine level (pushEscapedString). the reason it's not `\u0001` is 1/ to avoid reusing c++ syntactic elements for something that comes from diagnostics and is not represented as an escaped sequence in source 2/ `\u00011` is unreadable, and `\U000000001` is also not helpful :) cor3ntin: See elsewhere in the discussion. this formating is pre existing and managed at the…
		tahonermannUnsubmitted Not Done Reply Inline Actions Thanks for the explanation. I'm not sure that I agree with the rationale for (1) though. We're already putting the value in single quotes and representing some values with escapes in many of these cases when the value isn't produced by an escape sequence (or even a character/string literal); why exclude `\uXXXX`? I agree with the rationale for (2); we could use `'\u{1}'` in that case. tahonermann: Thanks for the explanation. I'm not sure that I agree with the rationale for (1) though. We're…
		cor3ntinUnsubmitted Not Done Reply Inline Actions FYI afaik the notation in clang predates the existence of \u{} by a few years, and follow Unicode notation (https://unicode.org/mail-arch/unicode-ml/y2005-m11/0060.html). Oldest instance seems to be https://github.com/llvm/llvm-project/commit/77091b167fd959e1ee0c4dad4ec44de43b6c95db - i followed suite when reworking the generic escaping mechanism all string fed to diagnostics go through. I don't care about changing the syntax, but i do hope we are consistent. Ultimately what we are trying to do is to designate a unicode codepoint and whether we do it through C++ syntax or not probably does not matter much as long as it's clear, delimited and consistent! cor3ntin: FYI afaik the notation in clang predates the existence of \u{} by a few years, and follow…
		tahonermannUnsubmitted Not Done Reply Inline Actions I think the substitution of `<U+XXXX>` by the diagnostic engine itself is perfectly fine and good; particularly when it has no context to suggest a different presentation. In this particular case, where the character is being presented using C++ syntax as a character literal, I would prefer that C++ syntax be used consistently. From an implementation standpoint, I'm suggesting that `WriteCharValueForDiagnostic()` be modified such that, if `escapeCStyle<EscapeChar::Single>()` returns an empty string, that the character be presented in `'\u{XXXX}'` form if the character is one that would otherwise be substituted by the diagnostic engine (e.g., if `isPrintable()` is false). Note that this would be restricted to `char` values <= 0x7F; larger values could still be passed through as invalid code units that the diagnostic engine would then render as, e.g., `'<FC>'`. tahonermann: I think the substitution of `<U+XXXX>` by the diagnostic engine itself is perfectly fine and…
		cor3ntinUnsubmitted Not Done Reply Inline Actions We should take a decision before forcing the author to do further change as to avoid going in circle. As a user `\u{XXXX}` vs `<U+XXXX>` makes no difference in terms of the amount of information i receive. I'm really not fan of duplicating code, spreading the logic in multiple places and having multiple ways to render a an invalid `char`. But I'm also concerned about spending too much time on `char` literals in `static_assert`. It might not be a common enough use case to warrant that much scrutiny :) So I would be happy to go with _any_ direction cor3ntin: We should take a decision before forcing the author to do further change as to avoid going in…
		static_assert((char8_t)-128 == (char8_t)-123, ""); // expected-error {{failed}} \
		// expected-note {{evaluates to 'u8'<80>' (0x80, 128) == u8'<85>' (0x85, 133)'}}
		static_assert((char16_t)0xFEFF == (char16_t)0xDB93, ""); // expected-error {{failed}} \
		// expected-note {{evaluates to 'u'' (0xFEFF, 65279) == u'\xDB93' (0xDB93, 56211)'}}
		}

clang/test/SemaCXX/static-assert.cpp

Show First 20 Lines • Show All 262 Lines • ▼ Show 20 Lines namespace Diagnostics {

/// Simple things are ignored. /// Simple things are ignored.

static_assert(1 == (-(1)), ""); //expected-error {{failed}} static_assert(1 == (-(1)), ""); //expected-error {{failed}}

/// Chars are printed as chars. /// Chars are printed as chars.

constexpr char getChar() { constexpr char getChar() {

return 'c'; return 'c';

} }

static_assert(getChar() == 'a', ""); // expected-error {{failed}} \ static_assert(getChar() == 'a', ""); // expected-error {{failed}} \

// expected-note {{evaluates to ''c' == 'a''}} // expected-note {{evaluates to ''c' (0x63, 99) == 'a' (0x61, 97)'}}

static_assert((char)9 == '\x61', ""); // expected-error {{failed}} \

// expected-note {{evaluates to ''\t' (0x09, 9) == 'a' (0x61, 97)'}}

static_assert((char)10 == '\0', ""); // expected-error {{failed}} \

// expected-note {{n' (0x0A, 10) == '<U+0000>' (0x00, 0)'}}

// The note above is intended to match "evaluates to '\n' (0x0A, 10) == '<U+0000>' (0x00, 0)'", but if we write it as it is,

// the "\n" cannot be consumed by the diagnostic consumer.

tahonermannUnsubmitted

Not Done

Here too, I find the '<U+0000>' presentation surprising; either of '\0' or '\u0000' would be preferred.

tahonermann: Here too, I find the `'<U+0000>'` presentation surprising; either of `'\0'` or `'\u0000'` would…

static_assert((signed char)10 == (char)-123, ""); // expected-error {{failed}} \

// expected-note {{evaluates to '10 == '<85>' (0x85, -123)'}}

static_assert((char)-4 == (unsigned char)-8, ""); // expected-error {{failed}} \

// expected-note {{evaluates to ''<FC>' (0xFC, -4) == 248'}}

static_assert((char)-128 == (char)-123, ""); // expected-error {{failed}} \

// expected-note {{evaluates to ''<80>' (0x80, -128) == '<85>' (0x85, -123)'}}

static_assert('\xA0' == (char)'\x20', ""); // expected-error {{failed}} \

// expected-note {{evaluates to ''<A0>' (0xA0, -96) == ' ' (0x20, 32)'}}

static_assert((char16_t)L'ゆ' == L"C̵̭̯̠̎͌ͅť̺"[1], ""); // expected-error {{failed}} \

// expected-note {{evaluates to 'u'ゆ' (0x3086, 12422) == L'̵' (0x335, 821)'}}

hubert.reinterpretcastUnsubmitted

Not Done

The C++23 escaped string formatting facility would not generate a trailing combining character like this. I recommend following suit.

Info on U+0335: https://util.unicode.org/UnicodeJsps/character.jsp?a=0335

hubert.reinterpretcast: The C++23 escaped string formatting facility would not generate a trailing combining character…

cor3ntinUnsubmitted

Not Done

This is way outside the scope of the patch. The diagnostic output facility has no understanding of combining characters or graphemes and do not attempt to match std::print. It probably would be an improvement but this patch is not trying to modify how all diagnostics are printed. (all of that logic is in Diagnostic.cpp)

cor3ntin: This is way outside the scope of the patch. The diagnostic output facility has no understanding…

hubert.reinterpretcastUnsubmitted

Not Done

This patch is pushing the envelope of what appears in diagnostics. One can also argue that someone writing

static_assert(false, "\u0301");

gets what they deserve, but that case does not have a big problem anyway (because the provided message text appears after : ).

This patch increases the exposure of the diagnostic output facility to input that it does not handle well. I disagree that it is outside the scope of this patch to insist that it does not generate such inputs to the diagnostic output facility (even if a possible solution is to modify the diagnostic output facility first).

hubert.reinterpretcast: This patch is pushing the envelope of what appears in diagnostics. One can also argue that…

hubert.reinterpretcastUnsubmitted

Not Done

@cor3ntin, do you have status quo examples for how grapheme-extending characters that are not already "problematic" in their original context are emitted in diagnostics in contexts where they are?

hubert.reinterpretcast: @cor3ntin, do you have status quo examples for how grapheme-extending characters that are not…

cor3ntinUnsubmitted

Not Done

Are you looking for that sort of examples? https://godbolt.org/z/c79xWr7Me
That shows that clang has no understanding of graphemes

cor3ntin: Are you looking for that sort of examples? https://godbolt.org/z/c79xWr7Me That shows that…

tahonermannUnsubmitted

Not Done

gcc and MSVC get that case "right" (probably by accident). https://godbolt.org/z/Tjd6xnEon

tahonermann: gcc and MSVC get that case "right" (probably by accident). https://godbolt.org/z/Tjd6xnEon

hubert.reinterpretcastUnsubmitted

Not Done

I was more looking for cases where the output grapheme includes elements that were part of the fixed message text (gobbling quotes, etc.). Also, this patch is in the wrong shape for handling this concern in the diagnostic printer because the delimiting of the replacement text happens in this patch.

hubert.reinterpretcast: I was more looking for cases where the output grapheme includes elements that were part of the…

cor3ntinUnsubmitted

Not Done

Could you craft a message that becomes a graphene after substitution by the engine? Maybe? You would have to try very hard and static_assert diagnostics are not of the right shape.

This patch increases the exposure of the diagnostic output facility to input that it does not handle well. I disagree that it is outside the scope of this patch to insist that it does not generate such inputs to the diagnostic output facility (even if a possible solution is to modify the diagnostic output facility first).

I still don't see it. None of the output produce in this patch are even close to what i could be problematic. ie this patch is only ever producing ASCII or single codepoints that gets escaped when they are not printable

cor3ntin: Could you craft a message that becomes a graphene after substitution by the engine? Maybe? You…

hubert.reinterpretcastUnsubmitted

Not Done

Could you craft a message that becomes a graphene after substitution by the engine? Maybe? You would have to try very hard and static_assert diagnostics are not of the right shape.

That is what I meant by this patch introducing new situations.

This patch increases the exposure of the diagnostic output facility to input that it does not handle well. I disagree that it is outside the scope of this patch to insist that it does not generate such inputs to the diagnostic output facility (even if a possible solution is to modify the diagnostic output facility first).

I still don't see it. None of the output produce in this patch are even close to what i could be problematic. ie this patch is only ever producing ASCII or single codepoints that gets escaped when they are not printable

This patch produces single codepoints that are not escaped even when they may combine with a ' delimiter. This patch also (currently) forms the string with the ' and the potentially-combining character directly adjacent to each other. The least this patch should do is to emit the potentially-combining character to the diagnostic facility as a separate substitution (that is, if we can agree that the diagnostic facility should consider substitution boundaries as separate text elements; i.e., no graphemes should be formed partially by a substitution).

Whether the diagnostic facility should use an escape sequence or a ZWNJ at the boundary can be a different discussion.

hubert.reinterpretcast: > Could you craft a message that becomes a graphene after substitution by the engine? Maybe?

cor3ntinUnsubmitted

Not Done

Sure we could make it a separate message, although we added so much redundant information in the crafterdmessage it might be tricky.

But now that I understand your concern it's the fact that if the codepoints is a grapheme extend ( so we are printing a char16_t or something with at least as many bytes), whether or not it would be rendered as a glyph or be escaped, if clang behavior were to escape a printable lone combining character (which it is currently not) would depend on whether it is passed directly or not to the diag engine.

Sure. At least now I get what you mean.
I still don't see that has a reason to rework this patch yet again, there are too many ifs for it to be something users are likely to encounter, and it requires clang features that are just not there and that we do not have plans to implementation beyond "wouldn't it be nice if"

Would you be happy with a fixme in the code?

cor3ntin: Sure we could make it a separate message, although we added so much redundant information in…

static_assert(L"＼／"[1] == u'\xFFFD', ""); // expected-error {{failed}} \

// expected-note {{evaluates to 'L'／' (0xFF0F, 65295) == u'�' (0xFFFD, 65533)'}}

static_assert(L"⚾"[0] == U'🌍', ""); // expected-error {{failed}} \

// expected-note {{evaluates to 'L'⚾' (0x26BE, 9918) == U'🌍' (0x1F30D, 127757)'}}

static_assert(U"\a"[0] == (wchar_t)9, ""); // expected-error {{failed}} \

// expected-note {{evaluates to 'U'\a' (0x07, 7) == L'\t' (0x09, 9)'}}

aaron.ballmanUnsubmitted

Done

// expected-note {{evaluates to ''<A0>' (-96) == ' ' (32)'}}

- static_assert((char16_t)L'ゆ' == L"C̵̭̯̠̎͌ͅť̺"[1], ""); // expected-error {{failed}} \

- // expected-note {{evaluates to ''ゆ' (12422) == '̵' (821)'}}

- static_assert(L"＼／"[1] == u'\xFFFD', ""); // expected-error {{failed}} \

- // expected-note {{evaluates to ''／' (65295) == '�' (65533)'}}

- static_assert(L"⚾"[0] == U'🌍', ""); // expected-error {{failed}} \

- // expected-note {{evaluates to ''⚾' (9918) == '🌍' (127757)'}}

- static_assert(U"\a"[0] == (wchar_t)9, ""); // expected-error {{failed}} \

- // expected-note {{evaluates to ''\a' (7) == '\t' (9)'}}

+ static_assert((char16_t)L'ゆ' == L"C̵̭̯̠̎͌ͅť̺"[1], ""); // expected-error {{failed}} \

+ // expected-note {{evaluates to ''ゆ' (12422) == '̵' (821)'}}

+ static_assert(L"＼／"[1] == u'\xFFFD', ""); // expected-error {{failed}} \

+ // expected-note {{evaluates to ''／' (65295) == '�' (65533)'}}

+ static_assert(L"⚾"[0] == U'🌍', ""); // expected-error {{failed}} \

+ // expected-note {{evaluates to ''⚾' (9918) == '🌍' (127757)'}}

+ static_assert(U"\a"[0] == (wchar_t)9, ""); // expected-error {{failed}} \

+ // expected-note {{evaluates to ''\a' (7) == '\t' (9)'}}

/// Bools are printed as bools.

aaron.ballman:

static_assert(L"§"[0] == U'Ö', ""); // expected-error {{failed}} \

// expected-note {{evaluates to 'L'§' (0xA7, 167) == U'Ö' (0xD6, 214)'}}

/// Bools are printed as bools. /// Bools are printed as bools.

constexpr bool invert(bool b) { constexpr bool invert(bool b) {

return !b; return !b;

} }

static_assert(invert(true) || invert(true), ""); // expected-error {{static assertion failed due to requirement 'invert(true) || invert(true)'}} static_assert(invert(true) || invert(true), ""); // expected-error {{static assertion failed due to requirement 'invert(true) || invert(true)'}}

static_assert(invert(true) == invert(false), ""); // expected-error {{static assertion failed due to requirement 'invert(true) == invert(false)'}} \ static_assert(invert(true) == invert(false), ""); // expected-error {{static assertion failed due to requirement 'invert(true) == invert(false)'}} \

▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Clang][Sema] Fix display of characters on static assertion failureClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 557581

clang/docs/ReleaseNotes.rst

clang/include/clang/Basic/Diagnostic.h

clang/lib/Basic/Diagnostic.cpp

clang/lib/Sema/SemaDeclCXX.cpp

clang/test/Lexer/cxx1z-trigraphs.cpp

clang/test/SemaCXX/static-assert-cxx26.cpp

clang/test/SemaCXX/static-assert.cpp

[Clang][Sema] Fix display of characters on static assertion failure
ClosedPublic