This patch fixes the display of characters appearing in LHS or RHS of == expression in notes to static assertion failure.
This applies C-style escape if the printed character is a special character. This also adds a numerical value displayed next to the character representation.
This also tries to print multi-byte characters if the user-provided expression is multi-byte char type.
Details
Diff Detail
Event Timeline
I wonder if we want to see whether the character is printable before deciding whether to show the numeric value for it or show a character value? For whitespace characters, printing the character leads to odd diagnostics: https://godbolt.org/z/d3TW6Ee8s and for non-character data (e.g., uses through uint8_t, unsigned char, or signed char) we probably never want to print as a character to begin with because there's no reason to assume the data is textual.
Adding in some other reviewers for a wider selection of opinions.
clang/test/Lexer/cxx1z-trigraphs.cpp | ||
---|---|---|
24 | I think the original diagnostic was actually more understandable as it relates more closely to what's written in the static assertion. I could imagine something like evaluates to '?' (63) == '#' (35) would also be reasonable. |
What if you switch from IgnoreParenImpCasts at the top of DiagnoseStaticAssertDetails() to just IgnoreParens()? That improve cases where we simply compare a character literal to an integer, since the literal should be implicitly cast to an integer.
I agree with Aaron that in the current state, the common case diagnostics are made worse. But there is room for improvement!
I think what we want to do here is modify ConvertAPValueToString so that it applies the same escape logic to char as we do in pushEscapedString (in Diagnostics.cpp)
It would make sense that signed/unsigned are treated as integers (in which case they should not be quoted), but char, wchar_t, charN_t should try really hard to present a printable character to the user, unless it's not possible to do so.
So the best way to progress that might be:
- put pushEscapedString somewhere we can reuse
- make char/char8_t use this function in ConvertAPValueToString - it might be slightly more tricky for wchar_t, you would have to try to convert to utf-8 first.
I think the discussion is getting derailed a bit. The original reason I was talking to Takuya about this is this: https://godbolt.org/z/GjsYrexT3
static_assert('a' == 100);
For code like this, we print the it as ''a' == 100', but the AST contains a cast from the LHS char to an integer, and I think we shouldn't ignore that cast.
It results in printing everything including char and bool as integer.
I wanted to print true and false instead of 1 and 0, so ignored the implicit casts when the user-provided expression is bool type.
I think it would look nicer to print the character representation as well when it is not non-printable
clang/test/Lexer/cxx1z-trigraphs.cpp | ||
---|---|---|
24 | I agree. I would also be ok with printing the integer value as primary with the character as secondary: evaluates to 63 ('?') == 35 ('#') There are two kinds of non-printable characters:
For the first case, I would support printing them as either C escapes or universal-character-names. e.g., evaluates to 0 ('\0') == 1 (\u0001) For the second case, I would support printing them as C hex escapes. e.g, evaluates to -128 ('\x80') == -123 ('\x85') |
I think Tom's suggestion about using escapes and UCNs is great. I have no real opinion on whether the numeric values is the one that's parenthesised.
clang/test/Lexer/cxx1z-trigraphs.cpp | ||
---|---|---|
24 |
As mentioned before, we should be consistent with what we do for diagnostics messages in general - (ie pushEscapedString`). Question is, why do we sometimes don't? |
Thanks everyone for the comments!
clang/test/Lexer/cxx1z-trigraphs.cpp | ||
---|---|---|
24 | The character escape in the current error message is handled in CharacterLiteral::print (https://github.com/llvm/llvm-project/blob/fcb6a9c07cf7a2bc63d364e3b7f60aaadadd57cc/clang/lib/AST/Expr.cpp#L1064-L1083), and the reason for '\0' being escaped to \x00 and '\u{9}' escaped to \t is that escapeCStyle does not escape null character (https://github.com/llvm/llvm-project/blob/c1c86f9eae73786bcdacddaab248817c4f176935/clang/include/clang/Basic/CharInfo.h#L174-L200). |
Address review comments
- Print the character representation only when the type of the expressions is char or char8_t
- Use pushEscapedString in the printing so that we can reuse its escaping logic
- Use escapeCStyle to escape whitespace characters
- wchar_t and charN_t are not handled yet
clang/lib/Basic/Diagnostic.cpp | ||
---|---|---|
838–858 ↗ | (On Diff #545582) | The use UCN addition is probably not justified. we should consistent in how we print the value of non-printable code points. |
clang/lib/Sema/SemaDeclCXX.cpp | ||
16898–16899 | A different way to do that would be to have a 'Escape Whitespaces' parameter on pushEscapedString that would also escape \t, \n, etc (I don't think we want to escape SPACE) |
clang/lib/Basic/Diagnostic.cpp | ||
---|---|---|
838–858 ↗ | (On Diff #545582) | My motivation here is to print a valid character literal. I think it justifies this change somewhat. I'd like to see what others think about this. |
clang/lib/Sema/SemaDeclCXX.cpp | ||
16898–16899 | If you mean \u0020 by SPACE, it won't be escaped by this code. |
clang/lib/Basic/Diagnostic.cpp | ||
---|---|---|
838–858 ↗ | (On Diff #545582) | Why? there is no expectation that diagnostics messages reproduce C++. Consistency between string literals and characters literals is a lot more important |
I've been thinking about it and I think I have a cleaner design for the printing of characters:
We need a CharToString(unsigned, Qualtype) -> SmallString method that takes a value and the type.
for char and char8_t we can just return the value.
For wchar_t, char32_t and char16_t, we can use something like ConvertCodePointToUTF8 to convert to UTF-8. If that fails we can escape with \x
If we pass the result of that to the diagnostic engine, escaping of non printing character would happen automatically.
That way we have a nice separation between converting an APValue to string and printing it, which should avoid code duplication quite a bit, and generally make the design nicer.
Then maybe we need to consider whether we want to modify CharacterLiteral::print to be aligned with all of that. I don;t know if that's used, for example, for mangling.
Given there are a bunch of different issues here, i would not mind separate PRs - having the numerical value showed in paren seems valuable on its own.
Thanks for the suggestion. Printing of multibyte character is a bit out of the scope of the original goal of this patch, but I'll give it a try.
clang/lib/Basic/Diagnostic.cpp | ||
---|---|---|
838–858 ↗ | (On Diff #545582) | The diagnostic meesage here looks like expression evaluates to 'VALUE1 == VALUE2' |
Address comments from Corentin
- Use default pushEscapedString escaping (<U+0001>) instead of UCN representation \u0001
- Convert multi-byte characters (wchar_t, char16_t, char32_t) to UTF-8 and prints them.
- Added CharToString utility function
This is starting to look pretty good!
I'm happy with the general direction, my only concern is that printing a prefix does not seem useful - we are trying to display the value, not how it was produced.
clang/lib/Sema/SemaDeclCXX.cpp | ||
---|---|---|
16839 | We have similar switches in
Sadly they all look at different things so I don't know if we could refactor all of that. | |
16875 | Looking at the diagnostics, I don't think it makes sense to print a prefix here. You could just leave that part out. |
Address comments from Corentin
- Remove printing of character type prefix
- Added code example in release note
- Removed unnecessary static_cast
One concern from my side is that some unicode characters like U+FEFF (I added in test) are invisible, but it may not be a big concern because we also display integer representation in parens.
This is not an easy problem to solve. pushEscapedString does not escape formatting code points, such as U+FEFF because doing so break a bunch of scripts/emojis.
There are cases where it should probably be escaped but that fully depend on context, and it would require grapheme clusterization, which is a lot of work for limited value.
I'd rather we don't change that for now.
clang/docs/ReleaseNotes.rst | ||
---|---|---|
103–137 | @aaron.ballman One one hand this is nice, on the other hand maybe too detailed. What do you think? |
clang/docs/ReleaseNotes.rst | ||
---|---|---|
103–137 | I'm happy with it -- better too much detail than too little, but this really helps users see what's been improved and why it matters. That said, I think 0x0A and 0x1F30D would arguably be better than printing the values in decimal. For \n, perhaps folks remember that it's decimal value 10, but nobody is going to know what 127757 means compared to the hex representation (esp because the value is specified in hex with the prefix printed in the error message). WDYT? |
clang/docs/ReleaseNotes.rst | ||
---|---|---|
103–137 | For wchar_t, charN_t I think that makes sense. |
clang/docs/ReleaseNotes.rst | ||
---|---|---|
103–137 | I generally agree that hex code would be better for characters.
For 1, I think we should always print unsigned code point for all textual types for consistency. Also we don't want to print -0x3 for L'\xFFFD' on targets where wchar_t is signed and 16-bit width (I haven't checked whether that target exists, though). static_assert((char)-1 == (unsigned char)-1); WDYT? |
clang/docs/ReleaseNotes.rst | ||
---|---|---|
103–137 |
Personally, I find -0x01 to be kind of weird and I slightly prefer 0xFF.
I don't have a strong opinion on this because I think we can come up with arguments for either approach. My intuition is that we should just use hex values everywhere, but others may have a different opinion. |
@abhina.sreeskantharajan, does this patch assume too much about the characters displayable for diagnostic output?
@hubert.reinterpretcast It does not, Unicode characters are only escaped in Diagnostics.cpp, and I think this is what we want.
Currently, llvm assume UTF-8 terminal, except on Windows where we convert to UTF-16 and use the wide windows APIs (raw_fd_ostream::write_impl).
If we want to extend that - IE support EBCDIC, I assume this is your question - we probably would want to modify pushEscapedString (Diagnostics.cpp), to consider a restricted set of characters as printable.
There are some questions in how we should do that, it could be a compile time configuration, or we need a way to 1/ detect the encoding of the environment, in a way similar to P1885 2/ construct the set of printable characters on that platforms.
Trying to encode all characters < U+00FF might be a reasonable way to build such table.
Address comments from Aaron
- Use hex code for integer representation of textual types
- NFC stylistic changes
I am skeptical of the extent to which that assumption is exercised in a problematic manner today. The characters being emitted (aside from the [U+0020, U+007E] fixed message text itself) generally come from the text of the source file, which is generally written using characters that the user can display (even if they are not "basic Latin" characters).
clang/lib/Sema/SemaDeclCXX.cpp | ||
---|---|---|
16875 | Why is removing the prefix better? The types can matter (characters outside the basic character set are allowed to have negative char values). Also, moving forward, the value of a character need not be the same in the various encodings. |
clang/lib/Sema/SemaDeclCXX.cpp | ||
---|---|---|
16865–16871 | Add FIXME for char and wchar_t cases that this assumes Unicode literal encodings. | |
16877–16880 | @aaron.ballman, hex output hides signedness. I think we want hex and decimal. | |
clang/test/SemaCXX/static-assert.cpp | ||
281 | The C++23 escaped string formatting facility would not generate a trailing combining character like this. I recommend following suit. Info on U+0335: https://util.unicode.org/UnicodeJsps/character.jsp?a=0335 |
clang/lib/Sema/SemaDeclCXX.cpp | ||
---|---|---|
16875 | Some fun with signedness (imagine a more realistic example with ISO-8859-1 ordinary character encoding with a signed char type): $ clang -Xclang -fwchar-type=short -xc++ -<<<$'static_assert(L"\\uFF10"[0] == U\'\\uFF10\');' <stdin>:1:15: error: static assertion failed due to requirement 'L"\xFF10"[0] == U'\uff10'' 1 | static_assert(L"\uFF10"[0] == U'\uFF10'); | ^~~~~~~~~~~~~~~~~~~~~~~~~ <stdin>:1:28: note: expression evaluates to ''0' (0xFF10) == '0' (0xFF10)' 1 | static_assert(L"\uFF10"[0] == U'\uFF10'); | ~~~~~~~~~~~~~^~~~~~~~~~~~ 1 error generated. Return: 0x01:1 Fri Aug 11 23:49:02 2023 EDT |
clang/lib/Sema/SemaDeclCXX.cpp | ||
---|---|---|
16865–16871 | If we wanted such fixme, it should be L1689. | |
16875 | Either we care about the actual character - ie 'a', or it's value (ie 42). The motivation for the current patch is to add the value to the diagnostic message. | |
clang/test/SemaCXX/static-assert.cpp | ||
281 | This is way outside the scope of the patch. The diagnostic output facility has no understanding of combining characters or graphemes and do not attempt to match std::print. It probably would be an improvement but this patch is not trying to modify how all diagnostics are printed. (all of that logic is in Diagnostic.cpp) |
clang/lib/Sema/SemaDeclCXX.cpp | ||
---|---|---|
16805 | It does not seem that the first parameter expects a CodePoint argument in all cases. For Char_S, Char_U, and Char8, it seems the function wants to treat the input as a UTF-8 code unit. I suggest changing the argument to be clearly a code unit (and potentially treat it as a code point value as appropriate later in the function). Also: The function should probably be declared as having static linkage. | |
16812 | For types other than Char_S, Char_U, and Char8, this fails to treat the C1 Controls and Latin-1 Supplement characters as Unicode code points. It looks like test coverage for these cases are missing. | |
16865–16871 | The ConvertCharToString has a first parameter called CodePoint. With that interface[^1], it is sensible to insert conversion from the applicable literal encoding to a Unicode code point value here (thus my request for a FIXME here). You are probably right that the FIXME belongs elsewhere. If you were thinking what I am thinking, then I am guessing you meant L16894? That is where the ConvertCharToString function seems to assume that a wchar_t value is directly a "code point value". To generate hex escapes, the function needs to be passed the original value (including for chars, e.g., to handle stray code units). Once the interface is updated (i.e., the parameter is renamed), the ConvertCharToString function would more clearly be the place to put one or more FIXMEs about encoding assumptions. [^1]: It turns out that the parameter is already not treated consistently as a code point value within the function (and by the caller) and the parameter is just badly named. | |
16875 | Maybe the motivation for the current patch is to add the value, but what it does (for wide characters as defined in C) is to add the character (and obfuscate the value). Observe the status quo (https://godbolt.org/z/Wc6nKvTMn): note: expression evaluates to '-240 == 65296' From the output higher up (with this patch), we see two "identical" characters and values (due to lack of decimal value output). With decimal value output added, it will still be potentially confusing why the two identical characters have different values (without some sort of type annotation). I admit that the confusion arises in the status quo treatment of signed char and unsigned char. I hope I am using the word correctly when I say that it is ironic that the patch breaks in one context what it seeks to fix in another. | |
clang/test/SemaCXX/static-assert.cpp | ||
281 | This patch is pushing the envelope of what appears in diagnostics. One can also argue that someone writing static_assert(false, "\u0301"); gets what they deserve, but that case does not have a big problem anyway (because the provided message text appears after : ). This patch increases the exposure of the diagnostic output facility to input that it does not handle well. I disagree that it is outside the scope of this patch to insist that it does not generate such inputs to the diagnostic output facility (even if a possible solution is to modify the diagnostic output facility first). |
Thanks @cor3ntin for the insight. I agree that this is a separate concern that also applies to static assert messages.
clang/lib/Sema/SemaDeclCXX.cpp | ||
---|---|---|
16805 | Agreed, CodeUnit or Value would be more correct (mostly because of numeric escape sequences). | |
16812 | escapeCStyle is one of the things that assume ASCII / UTF, but yes, we might as well reduce to 0x7F just to avoid unnecessary work |
I've discussed offline with @hubert.reinterpretcast and agree with him that with the addition of fexec-charset support, the set of characters deemed printable will not be accurate when other encodings are used. This will be similar to the printf/scanf format string validation issue I mentioned in my RFC and would require us to reverse the conversion or keep the original string around to check if the character is printable. I don't think we have finalized a solution on how to handle these issues yet.
Furthermore, it is reasonable for the scope of the current patch to focus on producing UTF-8 (or UTF-16 for Windows) output to the "terminal".
clang/lib/Sema/SemaDeclCXX.cpp | ||
---|---|---|
16812 |
I meant (with a signed char type to trigger the assertion): <stdin>:1:28: note: expression evaluates to ''<A2>' (0xA2) == '<A2>' (0xA2)' 1 | static_assert(u"\u00a2"[0] == '<A2>'); | ~~~~~~~~~~~~~^~~~~~~~~ should be: <stdin>:1:28: note: expression evaluates to ''¢' (0xA2) == '<A2>' (0xA2)' 1 | static_assert(u"\u00a2"[0] == '<A2>'); | ~~~~~~~~~~~~~^~~~~~~~~ |
Address some review comments
- Renamed ConvertCharToString to WriteCharValueForDiagnostic
- Made the function static
- Fixed the printing for unicode 0x80 ~ 0xFF
- Added decimal value next to the hex code
clang/lib/Sema/SemaDeclCXX.cpp | ||
---|---|---|
16803–16804 | Suggest wording tweaks. | |
16830–16831 | Try using StringRef. | |
16833 | Since the function interface has been clarified, this part actually doesn't need a FIXME. The FIXME should instead be added to the comment above the function declaration. |
clang/lib/Sema/SemaDeclCXX.cpp | ||
---|---|---|
16853–16859 | Minor nit: Braces no longer needed. |
clang/test/SemaCXX/static-assert-cxx26.cpp | ||
---|---|---|
304 | Is the expected note up to date? I don't see code that would generate the <U+0001> output. Am I just missing it? Since U+0001 is a valid, though non-printable, character, I would expect more '\u0001'. | |
clang/test/SemaCXX/static-assert.cpp | ||
268–271 | Here too, I find the '<U+0000>' presentation surprising; either of '\0' or '\u0000' would be preferred. |
clang/test/SemaCXX/static-assert-cxx26.cpp | ||
---|---|---|
304 | See elsewhere in the discussion. this formating is pre existing and managed at the DiagnosticEngine level (pushEscapedString). the reason it's not \u0001 is 1/ to avoid reusing c++ syntactic elements for something that comes from diagnostics and is not represented as an escaped sequence in source 2/ \u00011 is unreadable, and \U000000001 is also not helpful :) |
clang/test/SemaCXX/static-assert.cpp | ||
---|---|---|
281 | Are you looking for that sort of examples? https://godbolt.org/z/c79xWr7Me |
clang/test/SemaCXX/static-assert-cxx26.cpp | ||
---|---|---|
304 | Thanks for the explanation. I'm not sure that I agree with the rationale for (1) though. We're already putting the value in single quotes and representing some values with escapes in many of these cases when the value isn't produced by an escape sequence (or even a character/string literal); why exclude \uXXXX? I agree with the rationale for (2); we could use '\u{1}' in that case. |
clang/test/SemaCXX/static-assert-cxx26.cpp | ||
---|---|---|
304 | FYI afaik the notation in clang predates the existence of \u{} by a few years, and follow Unicode notation (https://unicode.org/mail-arch/unicode-ml/y2005-m11/0060.html). I don't care about changing the syntax, but i do hope we are consistent. Ultimately what we are trying to do is to designate a unicode codepoint and whether we do it through C++ syntax or not probably does not matter much as long as it's clear, delimited and consistent! |
clang/test/SemaCXX/static-assert-cxx26.cpp | ||
---|---|---|
304 | I think the substitution of <U+XXXX> by the diagnostic engine itself is perfectly fine and good; particularly when it has no context to suggest a different presentation. In this particular case, where the character is being presented using C++ syntax as a character literal, I would prefer that C++ syntax be used consistently. From an implementation standpoint, I'm suggesting that WriteCharValueForDiagnostic() be modified such that, if escapeCStyle<EscapeChar::Single>() returns an empty string, that the character be presented in '\u{XXXX}' form if the character is one that would otherwise be substituted by the diagnostic engine (e.g., if isPrintable() is false). Note that this would be restricted to char values <= 0x7F; larger values could still be passed through as invalid code units that the diagnostic engine would then render as, e.g., '<FC>'. | |
clang/test/SemaCXX/static-assert.cpp | ||
281 | gcc and MSVC get that case "right" (probably by accident). https://godbolt.org/z/Tjd6xnEon |
clang/lib/Sema/SemaDeclCXX.cpp | ||
---|---|---|
16877 | To have the diagnostic printer handle separating any potential grapheme (if it is capable of doing so--potentially in the future), we need to isolate the result of WriteCharValueForDiagnostic in a separate message substitution. | |
clang/test/SemaCXX/static-assert.cpp | ||
281 | I was more looking for cases where the output grapheme includes elements that were part of the fixed message text (gobbling quotes, etc.). Also, this patch is in the wrong shape for handling this concern in the diagnostic printer because the delimiting of the replacement text happens in this patch. |
clang/test/SemaCXX/static-assert-cxx26.cpp | ||
---|---|---|
304 | We should take a decision before forcing the author to do further change as to avoid going in circle. I'm really not fan of duplicating code, spreading the logic in multiple places and having multiple ways to render a an invalid char. But I'm also concerned about spending too much time on char literals in static_assert. It might not be a common enough use case to warrant that much scrutiny :) So I would be happy to go with _any_ direction | |
clang/test/SemaCXX/static-assert.cpp | ||
281 | Could you craft a message that becomes a graphene after substitution by the engine? Maybe? You would have to try very hard and static_assert diagnostics are not of the right shape.
I still don't see it. None of the output produce in this patch are even close to what i could be problematic. ie this patch is only ever producing ASCII or single codepoints that gets escaped when they are not printable |
clang/test/SemaCXX/static-assert.cpp | ||
---|---|---|
281 |
That is what I meant by this patch introducing new situations.
This patch produces single codepoints that are not escaped even when they may combine with a ' delimiter. This patch also (currently) forms the string with the ' and the potentially-combining character directly adjacent to each other. The least this patch should do is to emit the potentially-combining character to the diagnostic facility as a separate substitution (that is, if we can agree that the diagnostic facility should consider substitution boundaries as separate text elements; i.e., no graphemes should be formed partially by a substitution). Whether the diagnostic facility should use an escape sequence or a ZWNJ at the boundary can be a different discussion. |
clang/test/SemaCXX/static-assert.cpp | ||
---|---|---|
281 | Sure we could make it a separate message, although we added so much redundant information in the crafterdmessage it might be tricky. But now that I understand your concern it's the fact that if the codepoints is a grapheme extend ( so we are printing a char16_t or something with at least as many bytes), whether or not it would be rendered as a glyph or be escaped, if clang behavior were to escape a printable lone combining character (which it is currently not) would depend on whether it is passed directly or not to the diag engine. Sure. At least now I get what you mean. Would you be happy with a fixme in the code? |
I had a chat with @hubert.reinterpretcast and @tahonermann.
We reached consensus on wanting to make sure the codepoint value is formatted in a future-proof way so that if we ever implement escaping of lone combining codepoint, this case would be handled as well.
To do that, we can:
- * Expose the escaping mechanism) (done by pushEscapedString in Diagnostic.cpp) as a new EscapeStringForDiagnostic function. Use that to escape the code point (line 16921)
I think that should be the last change we need here. Thanks for being patient with us!
Thank you so much to everyone for guiding this patch onto the right track! I'll submit GitHub PR after making suggested changes.
Since Phabricator is going to be shutdown, I mark this differential as abandoned.
@hazohelet Please keep this patch on Phab. It's not going to be shutdown. The current consensus is that we will reconsider shutting down phab on November 15
https://discourse.llvm.org/t/update-on-github-pull-requests/71540/125
@aaron.ballman One one hand this is nice, on the other hand maybe too detailed. What do you think?