This is an archive of the discontinued LLVM Phabricator instance.

Implement delimited escape sequences.
ClosedPublic

Authored by cor3ntin on Jul 9 2021, 2:09 PM.

Details

Summary

This is a proposed C++ paper (P2290R1) that has not been accepted yet

\x{XXXX} \u{XXXX} and \o{OOOO} are accepted in all languages mode
in characters and string literals.

Diff Detail

Event Timeline

cor3ntin requested review of this revision.Jul 9 2021, 2:09 PM
cor3ntin created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJul 9 2021, 2:09 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
cor3ntin removed a subscriber: cfe-commits.
cor3ntin updated this revision to Diff 359544.Jul 17 2021, 1:33 AM

Fix tests & build

cor3ntin retitled this revision from Implement delimited escape sequences. to [WIP] Implement delimited escape sequences..Jul 18 2021, 11:45 PM

Because this proposal hasn't been accepted by WG21 or WG14, but the syntax is not defined by either standard, I think it's reasonable to support this but with appropriate warnings about it being a Clang extension. Can you add those diagnostics? Also, this is lacking test coverage for C.

clang/include/clang/Basic/DiagnosticLexKinds.td
129

Wouldn't this always be covered by expected '}'?

131

Empty isn't the only problem though -- what happens if the user specifies too many digits? e.g. \xFFFFFFFFFFFFFFFFF (that's 0xFFFF'FFFF'FFFF'FFFF'F)

133
clang/lib/Lex/LiteralSupport.cpp
150–155

Can't this be determined once we've read the } by looking at the string length?

153

This argument doesn't do anything, btw. Same happens in a few other places.

clang/test/Lexer/char-escapes-delimited.cpp
7

This diagnostic highlights the need to tell the user *which* character in the escape sequence is wrong. Consider: \u{0000O111I} as an example.

15

Can you add test cases that show whether digit separators are allowed or prohibited? (FWIW, I sort of think they should be allowed -- at least, I'm not certain I see a reason to disallow them and they do help with readability, especially for longer hex escapes.)

cor3ntin added inline comments.Jul 20 2021, 12:43 PM
clang/include/clang/Basic/DiagnosticLexKinds.td
129

I guess, I didn't think of that

131

That's covered by existing errors for overflowing the char type. There is no limit in the number of digits in the escape per say.

clang/lib/Lex/LiteralSupport.cpp
153

Indeed!

clang/test/Lexer/char-escapes-delimited.cpp
7

Make sense!

cor3ntin updated this revision to Diff 360391.Jul 21 2021, 2:05 AM
  • Format
  • Add warning for extension
  • Add negative tests for delimited sequences split

between 2 concatenated strings, and sequences containing digit
separators

  • Improve diagnostic messages for invalid characters
  • Stop passing unused parameters to diagnostics
clang/lib/Lex/LiteralSupport.cpp
150–155

I suppose we could. Do you think it would be clearer?

aaron.ballman added inline comments.Jul 21 2021, 5:32 AM
clang/include/clang/Basic/DiagnosticLexKinds.td
128–129
131

Ah, good to know, thanks!

clang/lib/Lex/LiteralSupport.cpp
150–155

I thought it could be cleaner if we also had to check lengths, but because we don't, I think this is fine as-is.

180

Comment here explaining why we're breaking when not delimited would be useful for other readers.

186

Can we pass the char directly, or get away with using a StringRef? It'd be nice to avoid the allocation from std::string.

254

Same here as above.

404

Same here as above.

463

This return looks incorrect because we'll miss diagnosing using the extension in this case even though the user used it.

clang/test/Lexer/char-escapes-delimited.c
43 ↗(On Diff #360391)

Can you test that it's not out of range for L'\o{777}'?

59 ↗(On Diff #360391)

I think this should be accepted rather than rejected. Long hex values are somewhat anticipated for this feature (given that you often need 5 or more characters to spell the interesting Unicode hex values), and digit separators help with readability in those cases. ... ... However, I suppose those *can't* work because of character literals. '\x{1234'5}' would be ambiguous to lex. Ouch, okay, so I guess I don't expect that to work, but having a test case with a character literal that tries to use a digit separator would be good.

cor3ntin added inline comments.Jul 21 2021, 5:45 AM
clang/lib/Lex/LiteralSupport.cpp
186

I can use StringRef yep - but I'm not concerned about allocation because of sbo - I used a pattern used elsewhere in the file

463

Indeed!

cor3ntin updated this revision to Diff 360423.Jul 21 2021, 6:32 AM
  • Add tests for short wchar overflow
  • use string ref instead of string
  • fix diagnostic
cor3ntin updated this revision to Diff 360428.Jul 21 2021, 6:44 AM
  • Fix tests
cor3ntin retitled this revision from [WIP] Implement delimited escape sequences. to Implement delimited escape sequences..Jul 21 2021, 6:46 AM
cor3ntin updated this revision to Diff 360431.Jul 21 2021, 6:49 AM
cor3ntin marked 2 inline comments as done.
  • Add comment to explain why we break out during hex escape parsing
cor3ntin updated this revision to Diff 360434.Jul 21 2021, 6:54 AM
cor3ntin marked 4 inline comments as done.
  • Fix diagnostic message
cor3ntin marked an inline comment as done.Jul 21 2021, 6:55 AM
cor3ntin added inline comments.
clang/test/Lexer/char-escapes-delimited.c
59 ↗(On Diff #360391)

I mean, that just explodes

aaron.ballman added inline comments.Jul 21 2021, 7:20 AM
clang/test/Lexer/char-escapes-delimited.c
78–80 ↗(On Diff #360434)

I think the test is a bit off because you couldn't have a trailing digit separator so 12'3' wouldn't be valid anyway. What's the behavior if you use 12'3 instead?

cor3ntin updated this revision to Diff 362034.Jul 27 2021, 8:12 AM
cor3ntin marked an inline comment as not done.
  • Add delimited escape sequences in identifiers
cor3ntin updated this revision to Diff 362079.Jul 27 2021, 9:52 AM

When preprocessing, treat incomplete or empty
delimited UCNs as warning, in the spirit of
[lex.pptoken]/3.1

Overflow is still treated as an error.

Mostly some small nits.

clang/include/clang/Basic/DiagnosticLexKinds.td
132

This describes what the code is, but not why it's an issue. How about: delimited escape sequence cannot be empty or something along those lines?

143–148

I think we can get away with only one diagnostic here.

clang/lib/Lex/Lexer.cpp
3015 ↗(On Diff #362079)
3054 ↗(On Diff #362079)

Should fix this formatting issue. You can also elide the braces per our usual coding conventions.

3060 ↗(On Diff #362079)

Should this be using Diagnose instead?

3077 ↗(On Diff #362079)
3096 ↗(On Diff #362079)
clang/lib/Lex/LiteralSupport.cpp
460–463
465–468
cor3ntin updated this revision to Diff 368754.Aug 25 2021, 3:23 PM
cor3ntin marked 8 inline comments as done.

Address most of the comments

I left warn_delimited_ucn and err_ucn_escape_incomplete
as separate warnings as I think it's easier to understand,
and could not find a way to escape a closing brace inside
a %select{}.

cor3ntin updated this revision to Diff 368756.Aug 25 2021, 3:29 PM

Fix tests

aaron.ballman accepted this revision.Sep 8 2021, 8:57 AM

Because this proposal hasn't been accepted by WG21 or WG14, but the syntax is not defined by either standard, I think it's reasonable to support this but with appropriate warnings about it being a Clang extension. Can you add those diagnostics? Also, this is lacking test coverage for C.

This proposal has now been seen by both WG14 and WG21 and has sentiment for support in both committees. I think the extension diagnostic is still reasonable (we'll want to support this syntax in older language modes when both committees adopt it for a newer language mode).

LGTM, I'm down to only some minor nits with the patch. @rsmith, do you have any concerns you'd like to see addressed before landing this?

clang/include/clang/Basic/DiagnosticLexKinds.td
128

We usually capitalize Clang for this kind of diagnostic.

clang/lib/Lex/Lexer.cpp
3098 ↗(On Diff #368756)
clang/lib/Lex/LiteralSupport.cpp
167
324

Better to use something that can't be confused with a type name (clang::Type), also drops the top-level const.

This revision is now accepted and ready to land.Sep 8 2021, 8:57 AM

Because this proposal hasn't been accepted by WG21 or WG14, but the syntax is not defined by either standard, I think it's reasonable to support this but with appropriate warnings about it being a Clang extension. Can you add those diagnostics? Also, this is lacking test coverage for C.

This proposal has now been seen by both WG14 and WG21 and has sentiment for support in both committees. I think the extension diagnostic is still reasonable (we'll want to support this syntax in older language modes when both committees adopt it for a newer language mode).

LGTM, I'm down to only some minor nits with the patch. @rsmith, do you have any concerns you'd like to see addressed before landing this?

Just to clarify, I don't think this should be allowed in C++20 or C17.
Only C23 and C++23 modes. It make sense to add an extension warning there until the proposal are approved imo

Do you agree?

Because this proposal hasn't been accepted by WG21 or WG14, but the syntax is not defined by either standard, I think it's reasonable to support this but with appropriate warnings about it being a Clang extension. Can you add those diagnostics? Also, this is lacking test coverage for C.

This proposal has now been seen by both WG14 and WG21 and has sentiment for support in both committees. I think the extension diagnostic is still reasonable (we'll want to support this syntax in older language modes when both committees adopt it for a newer language mode).

LGTM, I'm down to only some minor nits with the patch. @rsmith, do you have any concerns you'd like to see addressed before landing this?

Just to clarify, I don't think this should be allowed in C++20 or C17.
Only C23 and C++23 modes. It make sense to add an extension warning there until the proposal are approved imo

Do you agree?

I could agree, but I'm wondering why you don't want to allow in older language modes? Clang typically allows conforming extensions in older language modes (with appropriate diagnostics), and this would qualify. e.g., https://godbolt.org/z/x5Ws6YdGc

cor3ntin updated this revision to Diff 371566.Sep 9 2021, 6:00 AM
cor3ntin marked 4 inline comments as done.

Address Aaron's comments

cor3ntin updated this revision to Diff 371682.Sep 9 2021, 12:05 PM

Fix "clang" case in tests

aaron.ballman closed this revision.Sep 15 2021, 6:55 AM

Thank you for the new feature! I've landed on your behalf in 274adcb866343cb505408d8f401840ea814522c8.