This is an archive of the discontinued LLVM Phabricator instance.

[clang][Interp] Implement String- and CharacterLiterals
ClosedPublic

Authored by tbaeder on Oct 6 2022, 8:01 AM.

Diff Detail

Event Timeline

tbaeder created this revision.Oct 6 2022, 8:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 6 2022, 8:01 AM
tbaeder requested review of this revision.Oct 6 2022, 8:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 6 2022, 8:01 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript
tbaeder added inline comments.Oct 6 2022, 11:13 PM
clang/lib/AST/Interp/ByteCodeExprGen.cpp
448

The 0 here is unused in emitConst anyway. I have a follow-up NFC patch locally to remove that parameter from emitConst altogether.

aaron.ballman added inline comments.Oct 7 2022, 10:14 AM
clang/lib/AST/Interp/ByteCodeExprGen.cpp
447

Should this be concerned about the sign of the promoted literal?

clang/test/AST/Interp/arrays.cpp
135

I think you need a more coverage for character literals. Some test cases that are interesting: multichar literals ('abcd'), character literals with UCNs ('\uFFFF'), character literals with numeric escapes ('\xFF'). I'm especially interested in seeing whether we handle integer promotions properly, especially when promoting to unsigned int.

tbaeder updated this revision to Diff 466262.Oct 8 2022, 12:20 AM
tbaeder added inline comments.Oct 8 2022, 12:22 AM
clang/lib/AST/Interp/ByteCodeExprGen.cpp
447

Switched to emitConst() that takes into account signedness.

clang/test/AST/Interp/arrays.cpp
135

I added two more test cases but I'm generally not that familiar with character literal edge cases and integer promotion, so if you have concrete test cases in mind, that would be great :)

shafik added inline comments.Oct 9 2022, 4:06 PM
clang/test/AST/Interp/arrays.cpp
128

Also 2["foobar"]

135

We can find GNU documentation of multi-char literals here: https://gcc.gnu.org/onlinedocs/cpp/Implementation-defined-behavior.html#Implementation-defined-behavior and I believe we follow the same scheme.

There are some weird cases like '\8' which compilers seem to treat consistently but generate a diagnostic for.

tbaeder updated this revision to Diff 466461.Oct 10 2022, 4:01 AM
tbaeder marked an inline comment as done.
aaron.ballman added inline comments.
clang/test/AST/Interp/arrays.cpp
135

CC @tahonermann and @cor3ntin as text encoding code owners -- they likely know all the worst test cases to be thinking about in this space.

cor3ntin added inline comments.Oct 10 2022, 7:50 AM
clang/test/AST/Interp/arrays.cpp
135

Most weirdness are taken care of during lexing.
What is interesting to test during evaluation

  • multi character literals ie 'abcd'
  • u8 (utf8) and u literals (utf16)

Note that wide characters literals were removed from C++ so it's no longer a concern

tahonermann added inline comments.Oct 10 2022, 2:06 PM
clang/lib/AST/Interp/ByteCodeExprGen.cpp
445–446

T isn't used and classify() already returns an Optional<PrimType>, so no conversion coercion appear to be intended here. I think this can be simplified to:

if (classify(E))
  this->emitConst(E, E->getValue());
clang/test/AST/Interp/arrays.cpp
135

I believe Corentin meant that wide multicharacter literals were removed from C++. That is true for C++23 (they were removed by the adoption of P2362R3 during the July 2022 WG21 plenary) but that change was not adopted as a DR, so they remain part of the language for prior language modes. However, they were a conditionally supported feature that, as of Clang 14, are diagnosed as an error in all language modes. All that to say that I agree; no test for them needed.

138

This check depends on the source file being UTF-8 encoded. Perhaps compare against 0x0001F60EL instead.

143

As others already noted, additional testing of multicharacter literals and UCNs (including named universal characters like \N{LATIN_CAPITAL_LETTER_E} would be beneficial. Some tests of character escapes like \t wouldn't hurt either.

Clang does not yet support use of -fexec-charset to set the literal encoding (execution character set) to anything other than UTF-8 though work on that has been done (see D93031). If such work was completed, it would be useful to run some tests against a non-UTF-8 encoding. Maybe next year.

cor3ntin added inline comments.Oct 10 2022, 2:29 PM
clang/test/AST/Interp/arrays.cpp
143

Yes, wide multicharacter literals, that's was important information missing, thanks for spotting that.

I have mixed feeling about adding tests for escape sequences. Their replacement doesn't happen during constant evaluation.
We shouldn't replicate the lexing tests here.

but we should compare string literal with byte values. Testing a string literal against another one doesn't ensure the code units are correct if both are equally miss evaluated.

Also we could add explicit tests for null termination here as they are added as part of evaluation in theory - but then again that's also something clang does earlier.

If we want we could consider enabling the byte code interpreter on the existing lexing test files, i actually think that's the better way to deal with the escape sequences tests.

tbaeder marked 3 inline comments as done.Oct 10 2022, 10:29 PM
tbaeder added inline comments.
clang/lib/AST/Interp/ByteCodeExprGen.cpp
445–446

Good point. Since the value of the expression is always an APInt, might was well skip the classify() altogether. emitConst() does the classify anyway.

clang/test/AST/Interp/arrays.cpp
143

I changed the first test that inspects all characters of a string to comparing with integers instead. Do you have a suggestion for what lexing tests to enable the constant interpreter in?

tbaeder updated this revision to Diff 466700.Oct 10 2022, 10:29 PM
tbaeder marked an inline comment as done.
cor3ntin added inline comments.Oct 11 2022, 8:16 AM
clang/test/AST/Interp/arrays.cpp
143

I think good candidates are

Lexer/char-escapes.c
Lexer/char-escapes-delimited.c
Lexer/char-literal.cpp

tahonermann added inline comments.Oct 11 2022, 8:49 AM
clang/test/AST/Interp/arrays.cpp
143

Of those, only Lexer/char-escapes.c does much validation of literal values. I prefer the approach Timm has already taken relative to those tests.

It looks like we don't have an existing set of Sema tests for character and string literals. How about we move this test under clang/test/Sema? That would be the appropriate place to exercise values relative to -fexec-charset support for non-UTF-8 encodings in the future. If that sounds amenable, then I'd like the test split to exercise character and string literals separately.

The character literal tests don't really belong in a test named arrays.cpp as is.

cor3ntin added inline comments.Oct 11 2022, 9:00 AM
clang/test/AST/Interp/arrays.cpp
143

Of those, only Lexer/char-escapes.c does much validation of literal values. I prefer the approach Timm has already taken relative to those tests.

We can do both, I was not arguing against the test we have here, I'm happy with those :)
I'm opposed to duplicate tests for escape sequences here. using the new interpreter on tests that already exist (in addition of the existing run commands) is pretty easy and cheap to do.

I don't have opinions how the new interpreter tests are organized.
Ideally we would have a complete set of test that is equally suitable for both constexpr engines, but maybe that's something that does not need to be done as part of this PR

tbaeder added inline comments.Oct 12 2022, 12:10 AM
clang/test/AST/Interp/arrays.cpp
143

I've added a new run line to test/Lexer/char-escapes.c, which works fine. To summarize, the plan is to add a new test to clang/test/Sema? Or split the char tests out from arrays.cpp and add some for multicharacter char sequences?

tahonermann added inline comments.Oct 12 2022, 11:26 AM
clang/test/AST/Interp/arrays.cpp
143

I'm a bit uncertain regarding the purpose of tests in the Interp directory as opposed to tests under the Sema* directories. Basically, I'm unsure why we wouldn't want just one set of tests that exercise the semantics of the language, at least with regard to constant evaluation. If the interpreter is also intended to (eventually) interpret full programs (e.g., evaluating main()), I think such tests would be appropriate for an Interp directory. But I say that having not closely followed the development and previous reviews of the new interpreter.

At a minimum, I would like to see the character and string literal testing pulled out of arrays.cpp. For the rest, I'll defer to others that have been more involved in these reviews.

aaron.ballman added inline comments.Oct 12 2022, 11:48 AM
clang/test/AST/Interp/arrays.cpp
143

I'm a bit uncertain regarding the purpose of tests in the Interp directory as opposed to tests under the Sema* directories. Basically, I'm unsure why we wouldn't want just one set of tests that exercise the semantics of the language, at least with regard to constant evaluation.

The line is a bit blurry; there's certainly semantic tests happening for the constant expression interpreter. I think where I draw the line is on what the goal of the test is. If the goal is to test new interpreter behavior, it goes in Interp. If the goal is to test semantic behavior (and happens to test the new interpreter as well), then Sema seems more reasonable.

Ultimately, the really important bit is getting the tests *at all*, so thank you both for your advice on this!

At a minimum, I would like to see the character and string literal testing pulled out of arrays.cpp.

I agree, it'd be nice to put them into a file more focused on literals. It's fine to have some string literal and character literal testing in arrays (a string is an array of characters, after all), but it shouldn't be the primary place where that coverage lives.

tbaeder updated this revision to Diff 467399.Oct 13 2022, 1:43 AM

I've moved the tests to literals.cpp, is there anything else left?

cor3ntin accepted this revision.Oct 27 2022, 1:53 AM

Give some time for Aaron/Tom to have the chance to raise further comment but I'm happy with it otherwise.
Thanks for doing that, and sorry I missed you earlier ping.

This revision is now accepted and ready to land.Oct 27 2022, 1:53 AM
aaron.ballman accepted this revision.Oct 27 2022, 5:29 AM

LGTM aside from a testing request.

clang/test/AST/Interp/literals.cpp
342 ↗(On Diff #467399)

I'd like to see the same test for u, U, and u8 prefixes (which should all fail the same was as L).

tbaeder updated this revision to Diff 471144.Oct 27 2022, 5:54 AM
This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Oct 28 2022, 4:13 AM

One of your commits doesn't build on windows with clang-cl as host compiler: http://45.33.8.238/win/68965/step_4.txt

Please take a look and revert for now if it takes a while to fix.

Give some time for Aaron/Tom to have the chance to raise further comment but I'm happy with it otherwise.

I'm sorry for the delay getting back to this and I see this has already been pushed (and a fix then pushed). The changes look good to me too.