Details
- Reviewers
aaron.ballman tahonermann - Group Reviewers
Restricted Project - Commits
- rGdbfe446ef3b2: [Clang] Implement CWG2640 Allow more characters in an n-char sequence
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
I dont know the lexer well enough to do better than this review, but the code itself seems fine.
clang/www/cxx_dr_status.html | ||
---|---|---|
15650 | Shouldn't this be 'unreleased' and 'Clang 16'? |
clang/test/CXX/drs/dr26xx.cpp | ||
---|---|---|
41 | Why do we get two errors here? Also, can you add this case: int y = a\N{LOTUS}); to show that we now form an identifier for a\N{LOTUS} instead of trying to expand a as a macro; we should get a new diagnostic about the stray ) instead. |
Avoid duplicating errors in macros.
Because of that we cannot alays recover nicely with a loose matching.
clang/test/CXX/drs/dr26xx.cpp | ||
---|---|---|
41 | I did manage to improve that a bit, took a while. |
Restore a macro test
clang/lib/Lex/Lexer.cpp | ||
---|---|---|
3382–3383 | The scenario we want to avoid: There is a tentative parse of a token, no diagnostic is emitted. | |
clang/test/Preprocessor/ucn-pp-identifier.c | ||
132–134 | It doesn't test anything anymore, as the comma is part of the escape sequence now |
LGTM with a request for a comment, but please give @tahonermann a chance to look at the review before landing.
clang/lib/Lex/Lexer.cpp | ||
---|---|---|
3382–3383 | Ahhh okay, that makes sense. I think it'd help to add a comment here to explain that because it's a bit subtle. |
This looks really good. I added a suggested edit for a comment that I had a hard time understanding and noted an area of code that I'm not sure is working as expected.
clang/lib/Lex/Lexer.cpp | ||
---|---|---|
3382–3388 | I'm having a hard time understanding what this comment is trying to convey. The comment response to Aaron was much more helpful to me. How about this suggested edit? | |
3401 | This is a bit of a tangent, but under what circumstances would CurPtr - StartPtr not be equal to Buffer.size() + 4? Actually, I'm not sure that +4 is correct. It looks like StartPtr is expected to point to N at the beginning of the function, so the initial \ is not included in the range. If N isn't seen, the function returns early. Likewise, if either of the { or } delimiters is not found, the function returns early. I think the goal of this code is to skip the getAndAdvanceChar() machinery when the buffer is being claimed (no need to parse UCNs or trigraphs in the character name), but it looks like, particularly with this DR, that we might always be able to do that. |
Thanks Tom, I replied to your comments
clang/lib/Lex/Lexer.cpp | ||
---|---|---|
3382–3388 | It's not that a diagnostic would be emitted twice, is that it would not be emitted at all. | |
3401 | afaict, 4 is correct here because we are one past-the-end. |
Thanks Tom, I replied to your comments
clang/lib/Lex/Lexer.cpp | ||
---|---|---|
3401 | I looked into it, I'm sure we could improve but not easily, getAndAdvanceChar does set some flags on the token in the presence of trigraphs/line splicing and we need those flags to be set, this is the reason we do need to call that method. |
clang/lib/Lex/Lexer.cpp | ||
---|---|---|
3382–3388 | More evidence of the difficulty I'm having appreciating this comment :) Does this sound right? // Only map a loose match to a code point when in a diagnosing state. // If a mapped code point were to be returned in a non-diagnosing state, // token caching would prevent the opportunity to issue a diagnostic when // the token is later used. Tangent: It might be worth renaming one of Res and Result. I keep getting confused due to the similar names. Perhaps rename Result to Token or ResultToken? | |
3401 | My concern is that, as is, the code always takes the else branch (except when Result is non-null). Assuming a buffer containing "X", the pointers are arranged as follows (where $ is one past the end). \ N { X } $ | | `- CurPtr | `- Buffer `- StartPtr CurPtr - StartPtr is 4, but Buffer.size() + 4 is 5 (Buffer.size() is 1 in this case). I think there might be an easy test to see if this is working as intended. If it isn't, I would expect a diagnostic to be issued if trigraphs are enabled and the buffer contains a trigraph sequence. Something like: \N{LOTUS??>} | |
clang/test/Preprocessor/ucn-pp-identifier.c | ||
144 | How about adding a test for an escaped new line? I think this is currently UB according to the standard ((lex.phases)p2), but I believe you are trying to change that with P2621. int \N{\ LATIN CAPITAL LETTER A WITH GRAVE}; |
clang/lib/Lex/Lexer.cpp | ||
---|---|---|
3382–3388 | I'm not sure caching is exactly right. | |
3401 | I can try to add tests
Yes, the if branch sole purpose is to set some state in Result. At the start of the function, StartPtr points to \ There may be a potential refactor here, which is to have getAndAdvanceChar take a bool & ShouldCleanupToken parameter instead of a token so that we don't have to do that dance, but it's a bit outside of the scope of this patch... | |
clang/test/Preprocessor/ucn-pp-identifier.c | ||
144 | I can add a test, sure |
clang/lib/Lex/Lexer.cpp | ||
---|---|---|
3401 |
It doesn't look like it does. The first use of StartPtr is at line 3314 and it expects to read N: 3314: char C = getCharAndSize(StartPtr, CharSize); 3315: assert(C == 'N' && "expected \\N{...}"); |
clang/lib/Lex/Lexer.cpp | ||
---|---|---|
3327 | I was able to confirm that StartPtr does point to N. See the diagnostic generated at https://godbolt.org/z/qnajcMeso; the diagnostic caret points to N instead of \. <source>:1:2: warning: incomplete delimited universal character name; treating as '\' 'N' '{' identifier [-Wunicode] \N{abc ^ |
- Rebase
- Add/Improve comments
- Add more trigrahs tests
- Fix crash when a trigraph appears at the end of a named escape sequence and trigraphs are disabled
- Fix when getAndAdvanceChar is called - alas there is no way to write tests for that but I did check in a debugger.
- Rename s/Res/Match
There are still some inefficiencies with getAndAdvanceChar, but it calls for a
much bigger refactor than what should be in scope for this patch.
clang/lib/Lex/Lexer.cpp | ||
---|---|---|
3401 | Yes, you are right. There was a bug in \u too, probably has been there for ages. |
This looks good. I'm accepting the change despite one remaining issue that was neither introduced nor addressed by this patch; the diagnostic caret will still be in the wrong location for many of the diagnostics issued by tryReadNamedUCN(). Most of them use StartPtr which we've established points to N instead of \. If you want to make an additional update to address that, I'll be happy to review again. I understand if you would rather that be fixed via a separate change.
clang/lib/Lex/Lexer.cpp | ||
---|---|---|
3327 | I think this still needs to be addressed. tryReadNumericUCN() handles this case by requiring that callers pass the location of the \ as SlashLoc. I guess this is technically required since the \ character could be written as either \ or the ??/ trigraph. | |
3401 | Thank you; the added comment makes this more clear. Basically, we only want to skip the call to getAndAdvanceChar() when we're sure the token wasn't spelled with trigraphs or line spaces. |
@cor3ntin, I suspect the answer is no, but do your changes perhaps address this assertion failure? https://godbolt.org/z/1bPcPcWzv
int \u1{234};
Stack dump:
clang++: /root/llvm-project/clang/lib/Lex/LiteralSupport.cpp:382: void clang::expandUCNs(llvm::SmallVectorImpl<char>&, llvm::StringRef): Assertion `Value != -1U' failed. PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace, preprocessed source, and associated run script. Stack dump: 0. Program arguments: /opt/compiler-explorer/clang-assertions-trunk/bin/clang++ -gdwarf-4 -g -o /app/output.s -mllvm --x86-asm-syntax=intel -S --gcc-toolchain=/opt/compiler-explorer/gcc-snapshot -fcolor-diagnostics -fno-crash-diagnostics -std=c++20 <source> 1. <source>:1:5: current parser token '\u1{234}' #0 0x0000564e3c70a2bf llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x3fe12bf) #1 0x0000564e3c70826c llvm::sys::CleanupOnSignal(unsigned long) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x3fdf26c) #2 0x0000564e3c645468 CrashRecoverySignalHandler(int) CrashRecoveryContext.cpp:0:0 #3 0x00007f19aebf0420 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x14420) #4 0x00007f19ae6bd00b raise (/lib/x86_64-linux-gnu/libc.so.6+0x4300b) #5 0x00007f19ae69c859 abort (/lib/x86_64-linux-gnu/libc.so.6+0x22859) #6 0x00007f19ae69c729 (/lib/x86_64-linux-gnu/libc.so.6+0x22729) #7 0x00007f19ae6adfd6 (/lib/x86_64-linux-gnu/libc.so.6+0x33fd6) #8 0x0000564e4055e8b3 clang::expandUCNs(llvm::SmallVectorImpl<char>&, llvm::StringRef) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x7e358b3) #9 0x0000564e405dd531 clang::Preprocessor::LookUpIdentifierInfo(clang::Token&) const (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x7eb4531) #10 0x0000564e4055646c clang::Lexer::LexIdentifierContinue(clang::Token&, char const*) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x7e2d46c) #11 0x0000564e4055993d clang::Lexer::LexTokenInternal(clang::Token&, bool) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x7e3093d) #12 0x0000564e4055b1c3 clang::Lexer::Lex(clang::Token&) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x7e321c3) #13 0x0000564e405e0297 clang::Preprocessor::Lex(clang::Token&) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x7eb7297) #14 0x0000564e3ed354d2 clang::Parser::ConsumeToken() (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x660c4d2) #15 0x0000564e3ed673a3 clang::Parser::ParseDeclarationSpecifiers(clang::DeclSpec&, clang::Parser::ParsedTemplateInfo const&, clang::AccessSpecifier, clang::Parser::DeclSpecContext, clang::Parser::LateParsedAttrList*, clang::ImplicitTypenameContext) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x663e3a3) #16 0x0000564e3ed3814b clang::Parser::ParseDeclOrFunctionDefInternal(clang::ParsedAttributes&, clang::ParsedAttributes&, clang::ParsingDeclSpec&, clang::AccessSpecifier) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x660f14b) #17 0x0000564e3ed38a6f clang::Parser::ParseDeclarationOrFunctionDefinition(clang::ParsedAttributes&, clang::ParsedAttributes&, clang::ParsingDeclSpec*, clang::AccessSpecifier) (.part.0) Parser.cpp:0:0 #18 0x0000564e3ed3ed59 clang::Parser::ParseExternalDeclaration(clang::ParsedAttributes&, clang::ParsedAttributes&, clang::ParsingDeclSpec*) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x6615d59) #19 0x0000564e3ed3f6ad clang::Parser::ParseTopLevelDecl(clang::OpaquePtr<clang::DeclGroupRef>&, clang::Sema::ModuleImportState&) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x66166ad) #20 0x0000564e3ed3fb54 clang::Parser::ParseFirstTopLevelDecl(clang::OpaquePtr<clang::DeclGroupRef>&, clang::Sema::ModuleImportState&) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x6616b54) #21 0x0000564e3ed335ba clang::ParseAST(clang::Sema&, bool, bool) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x660a5ba) #22 0x0000564e3da51088 clang::CodeGenAction::ExecuteAction() (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x5328088) #23 0x0000564e3d2d6599 clang::FrontendAction::Execute() (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x4bad599) #24 0x0000564e3d25ca9e clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x4b33a9e) #25 0x0000564e3d3bbf43 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x4c92f43) #26 0x0000564e39bddf44 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x14b4f44) #27 0x0000564e39bda027 ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&) driver.cpp:0:0 #28 0x0000564e3d0c6829 void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const::'lambda'()>(long) Job.cpp:0:0 #29 0x0000564e3c645c0a llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x3f1cc0a) #30 0x0000564e3d0c70df clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const (.part.0) Job.cpp:0:0 #31 0x0000564e3d09039c clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&, bool) const (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x496739c) #32 0x0000564e3d090e1d clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&, bool) const (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x4967e1d) #33 0x0000564e3d09a24c clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x497124c) #34 0x0000564e39bdc6d2 clang_main(int, char**) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x14b36d2) #35 0x00007f19ae69e083 __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24083) #36 0x0000564e39bd52ce _start (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x14ac2ce) clang-16: error: clang frontend command failed with exit code 134 (use -v to see invocation) Compiler returned: 134
Interesting bug.
No, it doesn't. But i see what's going on here, it's pretty easy to fix!
clang/lib/Lex/Lexer.cpp | ||
---|---|---|
3327 | Done (by adding SlashLoc)! |
This has already been merged, but I'm commenting just to note (continued) approval of the most recent changes. Looks great, Corentin!
Spurious whitespace