This is an archive of the discontinued LLVM Phabricator instance.

[Clang][Sema] Fix display of characters on static assertion failure
ClosedPublic

Authored by hazohelet on Jul 18 2023, 8:35 AM.

Details

Summary

This patch fixes the display of characters appearing in LHS or RHS of == expression in notes to static assertion failure.
This applies C-style escape if the printed character is a special character. This also adds a numerical value displayed next to the character representation.
This also tries to print multi-byte characters if the user-provided expression is multi-byte char type.

Diff Detail

Event Timeline

hazohelet created this revision.Jul 18 2023, 8:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 18 2023, 8:35 AM
hazohelet requested review of this revision.Jul 18 2023, 8:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 18 2023, 8:35 AM

I wonder if we want to see whether the character is printable before deciding whether to show the numeric value for it or show a character value? For whitespace characters, printing the character leads to odd diagnostics: https://godbolt.org/z/d3TW6Ee8s and for non-character data (e.g., uses through uint8_t, unsigned char, or signed char) we probably never want to print as a character to begin with because there's no reason to assume the data is textual.

Adding in some other reviewers for a wider selection of opinions.

clang/test/Lexer/cxx1z-trigraphs.cpp
24

I think the original diagnostic was actually more understandable as it relates more closely to what's written in the static assertion. I could imagine something like evaluates to '?' (63) == '#' (35) would also be reasonable.

What if you switch from IgnoreParenImpCasts at the top of DiagnoseStaticAssertDetails() to just IgnoreParens()? That improve cases where we simply compare a character literal to an integer, since the literal should be implicitly cast to an integer.

I agree with Aaron that in the current state, the common case diagnostics are made worse. But there is room for improvement!

I think what we want to do here is modify ConvertAPValueToString so that it applies the same escape logic to char as we do in pushEscapedString (in Diagnostics.cpp)
It would make sense that signed/unsigned are treated as integers (in which case they should not be quoted), but char, wchar_t, charN_t should try really hard to present a printable character to the user, unless it's not possible to do so.

So the best way to progress that might be:

  • put pushEscapedString somewhere we can reuse
  • make char/char8_t use this function in ConvertAPValueToString - it might be slightly more tricky for wchar_t, you would have to try to convert to utf-8 first.

I think the discussion is getting derailed a bit. The original reason I was talking to Takuya about this is this: https://godbolt.org/z/GjsYrexT3

static_assert('a' == 100);

For code like this, we print the it as ''a' == 100', but the AST contains a cast from the LHS char to an integer, and I think we shouldn't ignore that cast.

What if you switch from IgnoreParenImpCasts at the top of DiagnoseStaticAssertDetails() to just IgnoreParens()? That improve cases where we simply compare a character literal to an integer, since the literal should be implicitly cast to an integer.

It results in printing everything including char and bool as integer.
I wanted to print true and false instead of 1 and 0, so ignored the implicit casts when the user-provided expression is bool type.

I think it would look nicer to print the character representation as well when it is not non-printable

tahonermann added inline comments.Jul 19 2023, 9:27 AM
clang/test/Lexer/cxx1z-trigraphs.cpp
24

I agree. I would also be ok with printing the integer value as primary with the character as secondary:

evaluates to 63 ('?') == 35 ('#')

There are two kinds of non-printable characters:

  1. Control characters (including new-line)
  2. character values that don't correspond to a character (e.g., lone trailing characters or invalid code unit values).

For the first case, I would support printing them as either C escapes or universal-character-names. e.g.,

evaluates to 0 ('\0') == 1 (\u0001)

For the second case, I would support printing them as C hex escapes. e.g,

evaluates to -128 ('\x80') == -123 ('\x85')
cjdb added a comment.Jul 19 2023, 10:55 AM

I think Tom's suggestion about using escapes and UCNs is great. I have no real opinion on whether the numeric values is the one that's parenthesised.

cor3ntin added inline comments.Jul 19 2023, 11:05 AM
clang/test/Lexer/cxx1z-trigraphs.cpp
24

For the first case, I would support printing them as either C escapes or universal-character-names. e.g.,

As mentioned before, we should be consistent with what we do for diagnostics messages in general - (ie pushEscapedString`).
I check and we already do that. https://godbolt.org/z/doah9YGMT

Question is, why do we sometimes don't?
Note that in general i don't have an opinion about displaying the value of characters literal _in addition_ of the character itself, it seems like a good thing)

Thanks everyone for the comments!

clang/test/Lexer/cxx1z-trigraphs.cpp
24

The character escape in the current error message is handled in CharacterLiteral::print (https://github.com/llvm/llvm-project/blob/fcb6a9c07cf7a2bc63d364e3b7f60aaadadd57cc/clang/lib/AST/Expr.cpp#L1064-L1083), and the reason for '\0' being escaped to \x00 and '\u{9}' escaped to \t is that escapeCStyle does not escape null character (https://github.com/llvm/llvm-project/blob/c1c86f9eae73786bcdacddaab248817c4f176935/clang/include/clang/Basic/CharInfo.h#L174-L200).
pushEscapedString does not escape whitespace characters, so we should use escapeCStyle if we are to use c-style escape for them.

hazohelet updated this revision to Diff 545582.Jul 31 2023, 4:02 AM

Address review comments

  • Print the character representation only when the type of the expressions is char or char8_t
  • Use pushEscapedString in the printing so that we can reuse its escaping logic
  • Use escapeCStyle to escape whitespace characters
  • wchar_t and charN_t are not handled yet
cor3ntin added inline comments.Jul 31 2023, 4:49 AM
clang/lib/Basic/Diagnostic.cpp
838–858 ↗(On Diff #545582)

The use UCN addition is probably not justified. we should consistent in how we print the value of non-printable code points.

clang/lib/Sema/SemaDeclCXX.cpp
16793–16794

A different way to do that would be to have a 'Escape Whitespaces' parameter on pushEscapedString that would also escape \t, \n, etc (I don't think we want to escape SPACE)

hazohelet added inline comments.Jul 31 2023, 10:38 AM
clang/lib/Basic/Diagnostic.cpp
838–858 ↗(On Diff #545582)

My motivation here is to print a valid character literal. I think it justifies this change somewhat. I'd like to see what others think about this.

clang/lib/Sema/SemaDeclCXX.cpp
16793–16794

If you mean \u0020 by SPACE, it won't be escaped by this code.

cor3ntin added inline comments.Aug 1 2023, 6:39 AM
clang/lib/Basic/Diagnostic.cpp
838–858 ↗(On Diff #545582)

Why? there is no expectation that diagnostics messages reproduce C++. Consistency between string literals and characters literals is a lot more important

I've been thinking about it and I think I have a cleaner design for the printing of characters:

We need a CharToString(unsigned, Qualtype) -> SmallString method that takes a value and the type.
for char and char8_t we can just return the value.
For wchar_t, char32_t and char16_t, we can use something like ConvertCodePointToUTF8 to convert to UTF-8. If that fails we can escape with \x
If we pass the result of that to the diagnostic engine, escaping of non printing character would happen automatically.

That way we have a nice separation between converting an APValue to string and printing it, which should avoid code duplication quite a bit, and generally make the design nicer.
Then maybe we need to consider whether we want to modify CharacterLiteral::print to be aligned with all of that. I don;t know if that's used, for example, for mangling.

Given there are a bunch of different issues here, i would not mind separate PRs - having the numerical value showed in paren seems valuable on its own.

I've been thinking about it and I think I have a cleaner design for the printing of characters:

We need a CharToString(unsigned, Qualtype) -> SmallString method that takes a value and the type.
for char and char8_t we can just return the value.
For wchar_t, char32_t and char16_t, we can use something like ConvertCodePointToUTF8 to convert to UTF-8. If that fails we can escape with \x
If we pass the result of that to the diagnostic engine, escaping of non printing character would happen automatically.

That way we have a nice separation between converting an APValue to string and printing it, which should avoid code duplication quite a bit, and generally make the design nicer.
Then maybe we need to consider whether we want to modify CharacterLiteral::print to be aligned with all of that. I don;t know if that's used, for example, for mangling.

Given there are a bunch of different issues here, i would not mind separate PRs - having the numerical value showed in paren seems valuable on its own.

Thanks for the suggestion. Printing of multibyte character is a bit out of the scope of the original goal of this patch, but I'll give it a try.

clang/lib/Basic/Diagnostic.cpp
838–858 ↗(On Diff #545582)

The diagnostic meesage here looks like expression evaluates to 'VALUE1 == VALUE2'
I tend to expect that VALUE1 == VALUE2 is a syntactically valid expression because of the syntactical element ==. But if others do not feel the same way, I am okay with something like '<U+0001>' == '<U+0002>'

hazohelet updated this revision to Diff 546782.Aug 3 2023, 3:06 AM

Address comments from Corentin

  • Use default pushEscapedString escaping (<U+0001>) instead of UCN representation \u0001
  • Convert multi-byte characters (wchar_t, char16_t, char32_t) to UTF-8 and prints them.
  • Added CharToString utility function

This is starting to look pretty good!
I'm happy with the general direction, my only concern is that printing a prefix does not seem useful - we are trying to display the value, not how it was produced.

clang/lib/Sema/SemaDeclCXX.cpp
16791

We have similar switches in

  • StringLiteral::outputString
  • TryPrintAsStringLiteral (APValue.cpp)
  • CharacterLiteral::print

Sadly they all look at different things so I don't know if we could refactor all of that.
But looking further done, I don;t think we should print a prefix here, so we could remove that bit entirely.

16792

Looking at the diagnostics, I don't think it makes sense to print a prefix here. You could just leave that part out.

hazohelet updated this revision to Diff 546832.Aug 3 2023, 6:44 AM
hazohelet marked 2 inline comments as done.
hazohelet retitled this revision from [Clang][ExprConstant] Print integer instead of character on static assertion failure to [Clang][Sema] Fix display of characters on static assertion failure.
hazohelet edited the summary of this revision. (Show Details)

Address comments from Corentin

  • Remove printing of character type prefix
  • Added code example in release note
  • Removed unnecessary static_cast

One concern from my side is that some unicode characters like U+FEFF (I added in test) are invisible, but it may not be a big concern because we also display integer representation in parens.

cor3ntin added a comment.EditedAug 3 2023, 9:00 AM

One concern from my side is that some unicode characters like U+FEFF (I added in test) are invisible, but it may not be a big concern because we also display integer representation in parens.

This is not an easy problem to solve. pushEscapedString does not escape formatting code points, such as U+FEFF because doing so break a bunch of scripts/emojis.
There are cases where it should probably be escaped but that fully depend on context, and it would require grapheme clusterization, which is a lot of work for limited value.
I'd rather we don't change that for now.

cor3ntin added inline comments.Aug 3 2023, 9:01 AM
clang/docs/ReleaseNotes.rst
308–342

@aaron.ballman One one hand this is nice, on the other hand maybe too detailed. What do you think?

aaron.ballman added inline comments.Aug 3 2023, 9:08 AM
clang/docs/ReleaseNotes.rst
308–342

I'm happy with it -- better too much detail than too little, but this really helps users see what's been improved and why it matters.

That said, I think 0x0A and 0x1F30D would arguably be better than printing the values in decimal. For \n, perhaps folks remember that it's decimal value 10, but nobody is going to know what 127757 means compared to the hex representation (esp because the value is specified in hex with the prefix printed in the error message). WDYT?

cor3ntin added inline comments.Aug 3 2023, 9:12 AM
clang/docs/ReleaseNotes.rst
308–342

For wchar_t, charN_t I think that makes sense.
for char... hard to know, I think this is mostly useful for people who treat char as some kind of integer. I could go either way. using hex consistently seems reasonable

aaron.ballman added inline comments.Aug 3 2023, 9:40 AM
clang/docs/ReleaseNotes.rst
308–342

I don't insist on using hex, but I have a slight preference for using it consistently everywhere. CC @cjdb for more opinions since this relates to user experience of diagnostics.

clang/test/SemaCXX/static-assert.cpp
280–287
hazohelet added inline comments.Aug 4 2023, 4:55 AM
clang/docs/ReleaseNotes.rst
308–342

I generally agree that hex code would be better for characters.
I think we still have some arguable points.

  1. Should we print the unsigned code point or the (possibly signed) integer? (e.g. 0xFF vs -0x01 for (char)-1, on targets where char is signed)
  2. Should we print the hex code when the other subexpression of the == expression is not a textual type? (e.g. 0x11 vs 17 for LHS of (char)17 == 11)

For 1, I think we should always print unsigned code point for all textual types for consistency. Also we don't want to print -0x3 for L'\xFFFD' on targets where wchar_t is signed and 16-bit width (I haven't checked whether that target exists, though).
For 2, I want to see decimal (possibly signed) integer if the other side of the expression is not textual type.
Displaying expression evaluates to ''<FF>' (0xFF) == 255' for the following code would be highly confusing.

static_assert((char)-1 == (unsigned char)-1);

WDYT?

aaron.ballman added inline comments.Aug 4 2023, 6:04 AM
clang/docs/ReleaseNotes.rst
308–342

Should we print the unsigned code point or the (possibly signed) integer? (e.g. 0xFF vs -0x01 for (char)-1, on targets where char is signed)

Personally, I find -0x01 to be kind of weird and I slightly prefer 0xFF.

Should we print the hex code when the other subexpression of the == expression is not a textual type? (e.g. 0x11 vs 17 for LHS of (char)17 == 11)

I don't have a strong opinion on this because I think we can come up with arguments for either approach. My intuition is that we should just use hex values everywhere, but others may have a different opinion.

@abhina.sreeskantharajan, does this patch assume too much about the characters displayable for diagnostic output?

@hubert.reinterpretcast It does not, Unicode characters are only escaped in Diagnostics.cpp, and I think this is what we want.
Currently, llvm assume UTF-8 terminal, except on Windows where we convert to UTF-16 and use the wide windows APIs (raw_fd_ostream::write_impl).

If we want to extend that - IE support EBCDIC, I assume this is your question - we probably would want to modify pushEscapedString (Diagnostics.cpp), to consider a restricted set of characters as printable.
There are some questions in how we should do that, it could be a compile time configuration, or we need a way to 1/ detect the encoding of the environment, in a way similar to P1885 2/ construct the set of printable characters on that platforms.
Trying to encode all characters < U+00FF might be a reasonable way to build such table.

hazohelet updated this revision to Diff 548948.Aug 10 2023, 3:15 AM
hazohelet marked 2 inline comments as done.

Address comments from Aaron

  • Use hex code for integer representation of textual types
  • NFC stylistic changes

@hubert.reinterpretcast It does not, Unicode characters are only escaped in Diagnostics.cpp, and I think this is what we want.
Currently, llvm assume UTF-8 terminal, except on Windows where we convert to UTF-16 and use the wide windows APIs (raw_fd_ostream::write_impl).

I am skeptical of the extent to which that assumption is exercised in a problematic manner today. The characters being emitted (aside from the [U+0020, U+007E] fixed message text itself) generally come from the text of the source file, which is generally written using characters that the user can display (even if they are not "basic Latin" characters).

clang/lib/Sema/SemaDeclCXX.cpp
16792

Why is removing the prefix better? The types can matter (characters outside the basic character set are allowed to have negative char values). Also, moving forward, the value of a character need not be the same in the various encodings.

clang/lib/Sema/SemaDeclCXX.cpp
16782–16788

Add FIXME for char and wchar_t cases that this assumes Unicode literal encodings.

16794–16797

@aaron.ballman, hex output hides signedness. I think we want hex and decimal.

clang/test/SemaCXX/static-assert.cpp
281

The C++23 escaped string formatting facility would not generate a trailing combining character like this. I recommend following suit.

Info on U+0335: https://util.unicode.org/UnicodeJsps/character.jsp?a=0335

clang/lib/Sema/SemaDeclCXX.cpp
16792

Some fun with signedness (imagine a more realistic example with ISO-8859-1 ordinary character encoding with a signed char type):

$ clang -Xclang -fwchar-type=short -xc++ -<<<$'static_assert(L"\\uFF10"[0] == U\'\\uFF10\');'
<stdin>:1:15: error: static assertion failed due to requirement 'L"\xFF10"[0] == U'\uff10''
    1 | static_assert(L"\uFF10"[0] == U'\uFF10');
      |               ^~~~~~~~~~~~~~~~~~~~~~~~~
<stdin>:1:28: note: expression evaluates to ''0' (0xFF10) == '0' (0xFF10)'
    1 | static_assert(L"\uFF10"[0] == U'\uFF10');
      |               ~~~~~~~~~~~~~^~~~~~~~~~~~
1 error generated.
Return:  0x01:1   Fri Aug 11 23:49:02 2023 EDT
cor3ntin added inline comments.Aug 12 2023, 1:27 AM
clang/lib/Sema/SemaDeclCXX.cpp
16782–16788

If we wanted such fixme, it should be L1689.

16792

Either we care about the actual character - ie 'a', or it's value (ie 42). The motivation for the current patch is to add the value to the diagnostic message.
I'm also concerned about mixing things that are are are not lexical elements in the diagnostics

clang/test/SemaCXX/static-assert.cpp
281

This is way outside the scope of the patch. The diagnostic output facility has no understanding of combining characters or graphemes and do not attempt to match std::print. It probably would be an improvement but this patch is not trying to modify how all diagnostics are printed. (all of that logic is in Diagnostic.cpp)

clang/lib/Sema/SemaDeclCXX.cpp
16757

It does not seem that the first parameter expects a CodePoint argument in all cases. For Char_S, Char_U, and Char8, it seems the function wants to treat the input as a UTF-8 code unit.

I suggest changing the argument to be clearly a code unit (and potentially treat it as a code point value as appropriate later in the function).

Also: The function should probably be declared as having static linkage.
Additionally: The function does not "convert" in the language semantic sense. WriteCharacterValueDescriptionForDisplay might be a better name.

16764

For types other than Char_S, Char_U, and Char8, this fails to treat the C1 Controls and Latin-1 Supplement characters as Unicode code points. It looks like test coverage for these cases are missing.

16782–16788

The ConvertCharToString has a first parameter called CodePoint. With that interface[^1], it is sensible to insert conversion from the applicable literal encoding to a Unicode code point value here (thus my request for a FIXME here).

You are probably right that the FIXME belongs elsewhere. If you were thinking what I am thinking, then I am guessing you meant L16894? That is where the ConvertCharToString function seems to assume that a wchar_t value is directly a "code point value". To generate hex escapes, the function needs to be passed the original value (including for chars, e.g., to handle stray code units). Once the interface is updated (i.e., the parameter is renamed), the ConvertCharToString function would more clearly be the place to put one or more FIXMEs about encoding assumptions.

[^1]: It turns out that the parameter is already not treated consistently as a code point value within the function (and by the caller) and the parameter is just badly named.

16792

Maybe the motivation for the current patch is to add the value, but what it does (for wide characters as defined in C) is to add the character (and obfuscate the value).

Observe the status quo (https://godbolt.org/z/Wc6nKvTMn):

note: expression evaluates to '-240 == 65296'

From the output higher up (with this patch), we see two "identical" characters and values (due to lack of decimal value output). With decimal value output added, it will still be potentially confusing why the two identical characters have different values (without some sort of type annotation).

I admit that the confusion arises in the status quo treatment of signed char and unsigned char. I hope I am using the word correctly when I say that it is ironic that the patch breaks in one context what it seeks to fix in another.

clang/test/SemaCXX/static-assert.cpp
281

This patch is pushing the envelope of what appears in diagnostics. One can also argue that someone writing

static_assert(false, "\u0301");

gets what they deserve, but that case does not have a big problem anyway (because the provided message text appears after : ).

This patch increases the exposure of the diagnostic output facility to input that it does not handle well. I disagree that it is outside the scope of this patch to insist that it does not generate such inputs to the diagnostic output facility (even if a possible solution is to modify the diagnostic output facility first).

@hubert.reinterpretcast It does not, Unicode characters are only escaped in Diagnostics.cpp, and I think this is what we want.

Thanks @cor3ntin for the insight. I agree that this is a separate concern that also applies to static assert messages.

cor3ntin added inline comments.Aug 14 2023, 5:56 AM
clang/lib/Sema/SemaDeclCXX.cpp
16757

Agreed, CodeUnit or Value would be more correct (mostly because of numeric escape sequences).
But if we are going to change that then WriteCharValueForDiagnostic would be better, Character implies too much

16764

escapeCStyle is one of the things that assume ASCII / UTF, but yes, we might as well reduce to 0x7F just to avoid unnecessary work

I've discussed offline with @hubert.reinterpretcast and agree with him that with the addition of fexec-charset support, the set of characters deemed printable will not be accurate when other encodings are used. This will be similar to the printf/scanf format string validation issue I mentioned in my RFC and would require us to reverse the conversion or keep the original string around to check if the character is printable. I don't think we have finalized a solution on how to handle these issues yet.

I've discussed offline with @hubert.reinterpretcast and agree with him that with the addition of fexec-charset support, the set of characters deemed printable will not be accurate when other encodings are used. This will be similar to the printf/scanf format string validation issue I mentioned in my RFC and would require us to reverse the conversion or keep the original string around to check if the character is printable. I don't think we have finalized a solution on how to handle these issues yet.

Furthermore, it is reasonable for the scope of the current patch to focus on producing UTF-8 (or UTF-16 for Windows) output to the "terminal".

clang/lib/Sema/SemaDeclCXX.cpp
16764

escapeCStyle is one of the things that assume ASCII / UTF, but yes, we might as well reduce to 0x7F just to avoid unnecessary work

I meant (with a signed char type to trigger the assertion):

<stdin>:1:28: note: expression evaluates to ''<A2>' (0xA2) == '<A2>' (0xA2)'
    1 | static_assert(u"\u00a2"[0] == '<A2>');
      |               ~~~~~~~~~~~~~^~~~~~~~~

should be:

<stdin>:1:28: note: expression evaluates to ''¢' (0xA2) == '<A2>' (0xA2)'
    1 | static_assert(u"\u00a2"[0] == '<A2>');
      |               ~~~~~~~~~~~~~^~~~~~~~~
hazohelet marked 16 inline comments as done.

Address some review comments

  • Renamed ConvertCharToString to WriteCharValueForDiagnostic
  • Made the function static
  • Fixed the printing for unicode 0x80 ~ 0xFF
  • Added decimal value next to the hex code
clang/lib/Sema/SemaDeclCXX.cpp
16755–16756

Suggest wording tweaks.

16782–16783

Try using StringRef.

16785

Since the function interface has been clarified, this part actually doesn't need a FIXME. The FIXME should instead be added to the comment above the function declaration.

hazohelet updated this revision to Diff 550707.Aug 16 2023, 5:40 AM
hazohelet marked 3 inline comments as done.

Address comments from Hubert

  • Bring back type prefix
  • NFC stylistic changes
clang/lib/Sema/SemaDeclCXX.cpp
16805–16811

Minor nit: Braces no longer needed.

hazohelet updated this revision to Diff 552167.Aug 21 2023, 4:56 PM
hazohelet marked an inline comment as done.
clang/test/SemaCXX/static-assert.cpp
281

@cor3ntin, do you have status quo examples for how grapheme-extending characters that are not already "problematic" in their original context are emitted in diagnostics in contexts where they are?

tahonermann added inline comments.Sep 6 2023, 1:43 PM
clang/test/SemaCXX/static-assert-cxx26.cpp
304 ↗(On Diff #552167)

Is the expected note up to date? I don't see code that would generate the <U+0001> output. Am I just missing it? Since U+0001 is a valid, though non-printable, character, I would expect more '\u0001'.

clang/test/SemaCXX/static-assert.cpp
268–271

Here too, I find the '<U+0000>' presentation surprising; either of '\0' or '\u0000' would be preferred.

cor3ntin added inline comments.Sep 6 2023, 1:52 PM
clang/test/SemaCXX/static-assert-cxx26.cpp
304 ↗(On Diff #552167)

See elsewhere in the discussion. this formating is pre existing and managed at the DiagnosticEngine level (pushEscapedString). the reason it's not \u0001 is 1/ to avoid reusing c++ syntactic elements for something that comes from diagnostics and is not represented as an escaped sequence in source 2/ \u00011 is unreadable, and \U000000001 is also not helpful :)

cor3ntin added inline comments.Sep 6 2023, 1:55 PM
clang/test/SemaCXX/static-assert.cpp
281

Are you looking for that sort of examples? https://godbolt.org/z/c79xWr7Me
That shows that clang has no understanding of graphemes

@hazohelet I'm happy with the patch, I just need to make sure Hubert and I agree!

tahonermann added inline comments.Sep 6 2023, 2:05 PM
clang/test/SemaCXX/static-assert-cxx26.cpp
304 ↗(On Diff #552167)

Thanks for the explanation. I'm not sure that I agree with the rationale for (1) though. We're already putting the value in single quotes and representing some values with escapes in many of these cases when the value isn't produced by an escape sequence (or even a character/string literal); why exclude \uXXXX? I agree with the rationale for (2); we could use '\u{1}' in that case.

cor3ntin added inline comments.Sep 6 2023, 3:00 PM
clang/test/SemaCXX/static-assert-cxx26.cpp
304 ↗(On Diff #552167)

FYI afaik the notation in clang predates the existence of \u{} by a few years, and follow Unicode notation (https://unicode.org/mail-arch/unicode-ml/y2005-m11/0060.html).
Oldest instance seems to be https://github.com/llvm/llvm-project/commit/77091b167fd959e1ee0c4dad4ec44de43b6c95db - i followed suite when reworking the generic escaping mechanism all string fed to diagnostics go through.

I don't care about changing the syntax, but i do hope we are consistent. Ultimately what we are trying to do is to designate a unicode codepoint and whether we do it through C++ syntax or not probably does not matter much as long as it's clear, delimited and consistent!

tahonermann added inline comments.Sep 12 2023, 12:55 PM
clang/test/SemaCXX/static-assert-cxx26.cpp
304 ↗(On Diff #552167)

I think the substitution of <U+XXXX> by the diagnostic engine itself is perfectly fine and good; particularly when it has no context to suggest a different presentation. In this particular case, where the character is being presented using C++ syntax as a character literal, I would prefer that C++ syntax be used consistently.

From an implementation standpoint, I'm suggesting that WriteCharValueForDiagnostic() be modified such that, if escapeCStyle<EscapeChar::Single>() returns an empty string, that the character be presented in '\u{XXXX}' form if the character is one that would otherwise be substituted by the diagnostic engine (e.g., if isPrintable() is false). Note that this would be restricted to char values <= 0x7F; larger values could still be passed through as invalid code units that the diagnostic engine would then render as, e.g., '<FC>'.

clang/test/SemaCXX/static-assert.cpp
281

gcc and MSVC get that case "right" (probably by accident). https://godbolt.org/z/Tjd6xnEon

clang/lib/Sema/SemaDeclCXX.cpp
16794

To have the diagnostic printer handle separating any potential grapheme (if it is capable of doing so--potentially in the future), we need to isolate the result of WriteCharValueForDiagnostic in a separate message substitution.

clang/test/SemaCXX/static-assert.cpp
281

I was more looking for cases where the output grapheme includes elements that were part of the fixed message text (gobbling quotes, etc.). Also, this patch is in the wrong shape for handling this concern in the diagnostic printer because the delimiting of the replacement text happens in this patch.

cor3ntin added inline comments.Sep 16 2023, 4:08 AM
clang/test/SemaCXX/static-assert-cxx26.cpp
304 ↗(On Diff #552167)

We should take a decision before forcing the author to do further change as to avoid going in circle.
As a user \u{XXXX} vs <U+XXXX> makes no difference in terms of the amount of information i receive.

I'm really not fan of duplicating code, spreading the logic in multiple places and having multiple ways to render a an invalid char. But I'm also concerned about spending too much time on char literals in static_assert. It might not be a common enough use case to warrant that much scrutiny :)

So I would be happy to go with _any_ direction

clang/test/SemaCXX/static-assert.cpp
281

Could you craft a message that becomes a graphene after substitution by the engine? Maybe? You would have to try very hard and static_assert diagnostics are not of the right shape.

This patch increases the exposure of the diagnostic output facility to input that it does not handle well. I disagree that it is outside the scope of this patch to insist that it does not generate such inputs to the diagnostic output facility (even if a possible solution is to modify the diagnostic output facility first).

I still don't see it. None of the output produce in this patch are even close to what i could be problematic. ie this patch is only ever producing ASCII or single codepoints that gets escaped when they are not printable

clang/test/SemaCXX/static-assert.cpp
281

Could you craft a message that becomes a graphene after substitution by the engine? Maybe? You would have to try very hard and static_assert diagnostics are not of the right shape.

That is what I meant by this patch introducing new situations.

This patch increases the exposure of the diagnostic output facility to input that it does not handle well. I disagree that it is outside the scope of this patch to insist that it does not generate such inputs to the diagnostic output facility (even if a possible solution is to modify the diagnostic output facility first).

I still don't see it. None of the output produce in this patch are even close to what i could be problematic. ie this patch is only ever producing ASCII or single codepoints that gets escaped when they are not printable

This patch produces single codepoints that are not escaped even when they may combine with a ' delimiter. This patch also (currently) forms the string with the ' and the potentially-combining character directly adjacent to each other. The least this patch should do is to emit the potentially-combining character to the diagnostic facility as a separate substitution (that is, if we can agree that the diagnostic facility should consider substitution boundaries as separate text elements; i.e., no graphemes should be formed partially by a substitution).

Whether the diagnostic facility should use an escape sequence or a ZWNJ at the boundary can be a different discussion.

cor3ntin added inline comments.Sep 18 2023, 4:06 PM
clang/test/SemaCXX/static-assert.cpp
281

Sure we could make it a separate message, although we added so much redundant information in the crafterdmessage it might be tricky.

But now that I understand your concern it's the fact that if the codepoints is a grapheme extend ( so we are printing a char16_t or something with at least as many bytes), whether or not it would be rendered as a glyph or be escaped, if clang behavior were to escape a printable lone combining character (which it is currently not) would depend on whether it is passed directly or not to the diag engine.

Sure. At least now I get what you mean.
I still don't see that has a reason to rework this patch yet again, there are too many ifs for it to be something users are likely to encounter, and it requires clang features that are just not there and that we do not have plans to implementation beyond "wouldn't it be nice if"

Would you be happy with a fixme in the code?

I had a chat with @hubert.reinterpretcast and @tahonermann.
We reached consensus on wanting to make sure the codepoint value is formatted in a future-proof way so that if we ever implement escaping of lone combining codepoint, this case would be handled as well.

To do that, we can:

  • * Expose the escaping mechanism) (done by pushEscapedString in Diagnostic.cpp) as a new EscapeStringForDiagnostic function. Use that to escape the code point (line 16921)

I think that should be the last change we need here. Thanks for being patient with us!

hazohelet abandoned this revision.Sep 30 2023, 6:41 PM

Thank you so much to everyone for guiding this patch onto the right track! I'll submit GitHub PR after making suggested changes.
Since Phabricator is going to be shutdown, I mark this differential as abandoned.

@hazohelet Please keep this patch on Phab. It's not going to be shutdown. The current consensus is that we will reconsider shutting down phab on November 15
https://discourse.llvm.org/t/update-on-github-pull-requests/71540/125

@hazohelet Please keep this patch on Phab. It's not going to be shutdown. The current consensus is that we will reconsider shutting down phab on November 15
https://discourse.llvm.org/t/update-on-github-pull-requests/71540/125

Oh I failed to notice that post, thanks for letting me know.

hazohelet updated this revision to Diff 557524.Oct 1 2023, 11:14 PM

Address comments from Corentin

cor3ntin accepted this revision.Oct 2 2023, 2:49 AM

LGTM, thanks!

This revision is now accepted and ready to land.Oct 2 2023, 2:49 AM