This is an archive of the discontinued LLVM Phabricator instance.

[Clang] Implement CWG2640 Allow more characters in an n-char sequence

Authored by cor3ntin on Nov 28 2022, 1:46 PM.

Diff Detail

Event Timeline

cor3ntin created this revision.Nov 28 2022, 1:46 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 28 2022, 1:46 PM
cor3ntin requested review of this revision.Nov 28 2022, 1:46 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 28 2022, 1:46 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
cor3ntin added a reviewer: Restricted Project.Nov 28 2022, 1:46 PM
cor3ntin updated this revision to Diff 478362.Nov 28 2022, 1:50 PM

Update DR web page

I dont know the lexer well enough to do better than this review, but the code itself seems fine.


Shouldn't this be 'unreleased' and 'Clang 16'?

cor3ntin updated this revision to Diff 478373.EditedNov 28 2022, 2:18 PM

Fix DR status

(I didn't know the script supported versions, neat)

Don't forget the release note! :-)






Should you add:

#define z(x) 0
#define a z(
int x = a\N{abc});

from the issue as a test case as well?

cor3ntin updated this revision to Diff 478631.Nov 29 2022, 10:16 AM
  • Add test
  • Add release notes entry
  • use isVerticalWhitespace
aaron.ballman added inline comments.Nov 29 2022, 10:41 AM

Why do we get two errors here?

Also, can you add this case:

int y = a\N{LOTUS});

to show that we now form an identifier for a\N{LOTUS} instead of trying to expand a as a macro; we should get a new diagnostic about the stray ) instead.

cor3ntin updated this revision to Diff 478679.Nov 29 2022, 12:27 PM

Slightly improve error recovery

cor3ntin updated this revision to Diff 478706.Nov 29 2022, 1:18 PM

Avoid duplicating errors in macros.
Because of that we cannot alays recover nicely with a loose matching.

cor3ntin added inline comments.Nov 29 2022, 3:38 PM

I did manage to improve that a bit, took a while.
The tradeoff is slightly subpar recovery in some cases

aaron.ballman added inline comments.Nov 30 2022, 5:05 AM

Spurious whitespace


Why do we only want to do this if we're diagnosing?


Why did this test get removed?

cor3ntin updated this revision to Diff 478914.Nov 30 2022, 5:32 AM

Remove extraneous space and braces

cor3ntin updated this revision to Diff 478919.Nov 30 2022, 5:41 AM

Restore a macro test


The scenario we want to avoid:

There is a tentative parse of a token, no diagnostic is emitted.
The token gets cached and contains the loose match.
Then that cached token gets reused, and because it is well formed, we never get to emit a diagnostic for a loose match.
Only returning a Loose Match when a diagnostic ensure we always emit the diag once
Note that previously we would always emit a diag, but doing so caused the diag to be emitted twice for macros (once when parsing the directive and once more when substituting.


It doesn't test anything anymore, as the comma is part of the escape sequence now
Thinking about it more, we can do a similar test over multiple lines

LGTM with a request for a comment, but please give @tahonermann a chance to look at the review before landing.


Ahhh okay, that makes sense. I think it'd help to add a comment here to explain that because it's a bit subtle.

This revision is now accepted and ready to land.Nov 30 2022, 6:41 AM
cor3ntin updated this revision to Diff 480062.Dec 5 2022, 5:15 AM

Add comment

@tahonermann Do you want to take a look at this?

tahonermann requested changes to this revision.Dec 8 2022, 3:10 PM

This looks really good. I added a suggested edit for a comment that I had a hard time understanding and noted an area of code that I'm not sure is working as expected.


I'm having a hard time understanding what this comment is trying to convey. The comment response to Aaron was much more helpful to me. How about this suggested edit?


This is a bit of a tangent, but under what circumstances would CurPtr - StartPtr not be equal to Buffer.size() + 4? Actually, I'm not sure that +4 is correct. It looks like StartPtr is expected to point to N at the beginning of the function, so the initial \ is not included in the range. If N isn't seen, the function returns early. Likewise, if either of the { or } delimiters is not found, the function returns early.

I think the goal of this code is to skip the getAndAdvanceChar() machinery when the buffer is being claimed (no need to parse UCNs or trigraphs in the character name), but it looks like, particularly with this DR, that we might always be able to do that.

This revision now requires changes to proceed.Dec 8 2022, 3:10 PM

Thanks Tom, I replied to your comments


It's not that a diagnostic would be emitted twice, is that it would not be emitted at all.
By recovering - ie returning a loose match - we prevent the function to be called again on the same buffer.
So if during the first call Diagnose is false (because we were called from a tryXXXX function), then the compilation will continue assuming we were in fact able to parse an identifier and never informs us of an invalid name.


afaict, 4 is correct here because we are one past-the-end.
I do however agree that this whole pattern which is used a few times is probably unnecessary, i do think it would be good to investigate. Not in this patch though, imo

Thanks Tom, I replied to your comments


I looked into it, I'm sure we could improve but not easily, getAndAdvanceChar does set some flags on the token in the presence of trigraphs/line splicing and we need those flags to be set, this is the reason we do need to call that method.
It's not super efficient but it's such an edge case... I'd rather not touch that now

tahonermann added inline comments.Dec 9 2022, 11:41 AM

More evidence of the difficulty I'm having appreciating this comment :)

Does this sound right?

// Only map a loose match to a code point when in a diagnosing state.
// If a mapped code point were to be returned in a non-diagnosing state,
// token caching would prevent the opportunity to issue a diagnostic when
// the token is later used.

Tangent: It might be worth renaming one of Res and Result. I keep getting confused due to the similar names. Perhaps rename Result to Token or ResultToken?


My concern is that, as is, the code always takes the else branch (except when Result is non-null). Assuming a buffer containing "X", the pointers are arranged as follows (where $ is one past the end).

\ N { X } $
  |   |   `- CurPtr
  |   `- Buffer
  `- StartPtr

CurPtr - StartPtr is 4, but Buffer.size() + 4 is 5 (Buffer.size() is 1 in this case).

I think there might be an easy test to see if this is working as intended. If it isn't, I would expect a diagnostic to be issued if trigraphs are enabled and the buffer contains a trigraph sequence. Something like:


How about adding a test for an escaped new line? I think this is currently UB according to the standard ((lex.phases)p2), but I believe you are trying to change that with P2621.

int \N{\
cor3ntin added inline comments.Dec 9 2022, 11:57 AM

I'm not sure caching is exactly right.
There are tentative parses, that should not give false positive answers. I'll try to message a better comment


I can try to add tests

My concern is that, as is, the code always takes the else branch (except when Result is non-null).

Yes, the if branch sole purpose is to set some state in Result.

At the start of the function, StartPtr points to \
And I'll leave a comment, maybe that will clear up future confusions

There may be a potential refactor here, which is to have getAndAdvanceChar take a bool & ShouldCleanupToken parameter instead of a token so that we don't have to do that dance, but it's a bit outside of the scope of this patch...


I can add a test, sure
P2621 is not super relevant (or rather, it standardize behaviors that clang already has)

tahonermann added inline comments.Dec 9 2022, 12:21 PM

At the start of the function, StartPtr points to \

It doesn't look like it does. The first use of StartPtr is at line 3314 and it expects to read N:

3314:   char C = getCharAndSize(StartPtr, CharSize);
3315:   assert(C == 'N' && "expected \\N{...}");
tahonermann added inline comments.Dec 9 2022, 1:56 PM

I was able to confirm that StartPtr does point to N. See the diagnostic generated at; the diagnostic caret points to N instead of \.

<source>:1:2: warning: incomplete delimited universal character name; treating as '\' 'N' '{' identifier [-Wunicode]
cor3ntin updated this revision to Diff 481766.Dec 9 2022, 3:31 PM
  • Rebase
  • Add/Improve comments
  • Add more trigrahs tests
  • Fix crash when a trigraph appears at the end of a named escape sequence and trigraphs are disabled
  • Fix when getAndAdvanceChar is called - alas there is no way to write tests for that but I did check in a debugger.
  • Rename s/Res/Match

There are still some inefficiencies with getAndAdvanceChar, but it calls for a
much bigger refactor than what should be in scope for this patch.

cor3ntin added inline comments.Dec 9 2022, 3:34 PM

Yes, you are right. There was a bug in \u too, probably has been there for ages.
It's unfortunately not testable, any incorrect value would call getCharAndSize which is going to do the right thing.

tahonermann accepted this revision.Dec 12 2022, 9:02 AM

This looks good. I'm accepting the change despite one remaining issue that was neither introduced nor addressed by this patch; the diagnostic caret will still be in the wrong location for many of the diagnostics issued by tryReadNamedUCN(). Most of them use StartPtr which we've established points to N instead of \. If you want to make an additional update to address that, I'll be happy to review again. I understand if you would rather that be fixed via a separate change.


I think this still needs to be addressed. tryReadNumericUCN() handles this case by requiring that callers pass the location of the \ as SlashLoc. I guess this is technically required since the \ character could be written as either \ or the ??/ trigraph.


Thank you; the added comment makes this more clear. Basically, we only want to skip the call to getAndAdvanceChar() when we're sure the token wasn't spelled with trigraphs or line spaces.

This revision is now accepted and ready to land.Dec 12 2022, 9:02 AM

@cor3ntin, I suspect the answer is no, but do your changes perhaps address this assertion failure?

int \u1{234};

Stack dump:

clang++: /root/llvm-project/clang/lib/Lex/LiteralSupport.cpp:382: void clang::expandUCNs(llvm::SmallVectorImpl<char>&, llvm::StringRef): Assertion `Value != -1U' failed.
PLEASE submit a bug report to and include the crash backtrace, preprocessed source, and associated run script.
Stack dump:
0.	Program arguments: /opt/compiler-explorer/clang-assertions-trunk/bin/clang++ -gdwarf-4 -g -o /app/output.s -mllvm --x86-asm-syntax=intel -S --gcc-toolchain=/opt/compiler-explorer/gcc-snapshot -fcolor-diagnostics -fno-crash-diagnostics -std=c++20 <source>
1.	<source>:1:5: current parser token '\u1{234}'
 #0 0x0000564e3c70a2bf llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x3fe12bf)
 #1 0x0000564e3c70826c llvm::sys::CleanupOnSignal(unsigned long) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x3fdf26c)
 #2 0x0000564e3c645468 CrashRecoverySignalHandler(int) CrashRecoveryContext.cpp:0:0
 #3 0x00007f19aebf0420 __restore_rt (/lib/x86_64-linux-gnu/
 #4 0x00007f19ae6bd00b raise (/lib/x86_64-linux-gnu/
 #5 0x00007f19ae69c859 abort (/lib/x86_64-linux-gnu/
 #6 0x00007f19ae69c729 (/lib/x86_64-linux-gnu/
 #7 0x00007f19ae6adfd6 (/lib/x86_64-linux-gnu/
 #8 0x0000564e4055e8b3 clang::expandUCNs(llvm::SmallVectorImpl<char>&, llvm::StringRef) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x7e358b3)
 #9 0x0000564e405dd531 clang::Preprocessor::LookUpIdentifierInfo(clang::Token&) const (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x7eb4531)
#10 0x0000564e4055646c clang::Lexer::LexIdentifierContinue(clang::Token&, char const*) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x7e2d46c)
#11 0x0000564e4055993d clang::Lexer::LexTokenInternal(clang::Token&, bool) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x7e3093d)
#12 0x0000564e4055b1c3 clang::Lexer::Lex(clang::Token&) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x7e321c3)
#13 0x0000564e405e0297 clang::Preprocessor::Lex(clang::Token&) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x7eb7297)
#14 0x0000564e3ed354d2 clang::Parser::ConsumeToken() (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x660c4d2)
#15 0x0000564e3ed673a3 clang::Parser::ParseDeclarationSpecifiers(clang::DeclSpec&, clang::Parser::ParsedTemplateInfo const&, clang::AccessSpecifier, clang::Parser::DeclSpecContext, clang::Parser::LateParsedAttrList*, clang::ImplicitTypenameContext) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x663e3a3)
#16 0x0000564e3ed3814b clang::Parser::ParseDeclOrFunctionDefInternal(clang::ParsedAttributes&, clang::ParsedAttributes&, clang::ParsingDeclSpec&, clang::AccessSpecifier) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x660f14b)
#17 0x0000564e3ed38a6f clang::Parser::ParseDeclarationOrFunctionDefinition(clang::ParsedAttributes&, clang::ParsedAttributes&, clang::ParsingDeclSpec*, clang::AccessSpecifier) (.part.0) Parser.cpp:0:0
#18 0x0000564e3ed3ed59 clang::Parser::ParseExternalDeclaration(clang::ParsedAttributes&, clang::ParsedAttributes&, clang::ParsingDeclSpec*) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x6615d59)
#19 0x0000564e3ed3f6ad clang::Parser::ParseTopLevelDecl(clang::OpaquePtr<clang::DeclGroupRef>&, clang::Sema::ModuleImportState&) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x66166ad)
#20 0x0000564e3ed3fb54 clang::Parser::ParseFirstTopLevelDecl(clang::OpaquePtr<clang::DeclGroupRef>&, clang::Sema::ModuleImportState&) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x6616b54)
#21 0x0000564e3ed335ba clang::ParseAST(clang::Sema&, bool, bool) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x660a5ba)
#22 0x0000564e3da51088 clang::CodeGenAction::ExecuteAction() (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x5328088)
#23 0x0000564e3d2d6599 clang::FrontendAction::Execute() (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x4bad599)
#24 0x0000564e3d25ca9e clang::CompilerInstance::ExecuteAction(clang::FrontendAction&) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x4b33a9e)
#25 0x0000564e3d3bbf43 clang::ExecuteCompilerInvocation(clang::CompilerInstance*) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x4c92f43)
#26 0x0000564e39bddf44 cc1_main(llvm::ArrayRef<char const*>, char const*, void*) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x14b4f44)
#27 0x0000564e39bda027 ExecuteCC1Tool(llvm::SmallVectorImpl<char const*>&) driver.cpp:0:0
#28 0x0000564e3d0c6829 void llvm::function_ref<void ()>::callback_fn<clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const::'lambda'()>(long) Job.cpp:0:0
#29 0x0000564e3c645c0a llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x3f1cc0a)
#30 0x0000564e3d0c70df clang::driver::CC1Command::Execute(llvm::ArrayRef<std::optional<llvm::StringRef>>, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char>>*, bool*) const (.part.0) Job.cpp:0:0
#31 0x0000564e3d09039c clang::driver::Compilation::ExecuteCommand(clang::driver::Command const&, clang::driver::Command const*&, bool) const (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x496739c)
#32 0x0000564e3d090e1d clang::driver::Compilation::ExecuteJobs(clang::driver::JobList const&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&, bool) const (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x4967e1d)
#33 0x0000564e3d09a24c clang::driver::Driver::ExecuteCompilation(clang::driver::Compilation&, llvm::SmallVectorImpl<std::pair<int, clang::driver::Command const*>>&) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x497124c)
#34 0x0000564e39bdc6d2 clang_main(int, char**) (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x14b36d2)
#35 0x00007f19ae69e083 __libc_start_main (/lib/x86_64-linux-gnu/
#36 0x0000564e39bd52ce _start (/opt/compiler-explorer/clang-assertions-trunk/bin/clang+++0x14ac2ce)
clang-16: error: clang frontend command failed with exit code 134 (use -v to see invocation)
Compiler returned: 134
cor3ntin updated this revision to Diff 482279.Dec 12 2022, 3:06 PM

Fix caret position
Fix tests (defining a macro named 'a' caused issues)

@cor3ntin, I suspect the answer is no, but do your changes perhaps address this assertion failure?

int \u1{234};

Interesting bug.
No, it doesn't. But i see what's going on here, it's pretty easy to fix!


Done (by adding SlashLoc)!
There is still room for improvement here but if we wanted to clean that up further we'd need a different patch.

This has already been merged, but I'm commenting just to note (continued) approval of the most recent changes. Looks great, Corentin!