This is an archive of the discontinued LLVM Phabricator instance.

Implement delimited escape sequences.
ClosedPublic

Authored by cor3ntin on Jul 9 2021, 2:09 PM.

Details

Summary

This is a proposed C++ paper (P2290R1) that has not been accepted yet

\x{XXXX} \u{XXXX} and \o{OOOO} are accepted in all languages mode
in characters and string literals.

Diff Detail

Event Timeline

cor3ntin requested review of this revision.Jul 9 2021, 2:09 PM
cor3ntin created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptJul 9 2021, 2:09 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
cor3ntin removed a subscriber: cfe-commits.
cor3ntin updated this revision to Diff 359544.Jul 17 2021, 1:33 AM

Fix tests & build

cor3ntin retitled this revision from Implement delimited escape sequences. to [WIP] Implement delimited escape sequences..Jul 18 2021, 11:45 PM

Because this proposal hasn't been accepted by WG21 or WG14, but the syntax is not defined by either standard, I think it's reasonable to support this but with appropriate warnings about it being a Clang extension. Can you add those diagnostics? Also, this is lacking test coverage for C.

clang/include/clang/Basic/DiagnosticLexKinds.td
131

Wouldn't this always be covered by expected '}'?

133

Empty isn't the only problem though -- what happens if the user specifies too many digits? e.g. \xFFFFFFFFFFFFFFFFF (that's 0xFFFF'FFFF'FFFF'FFFF'F)

135
clang/lib/Lex/LiteralSupport.cpp
151–156

Can't this be determined once we've read the } by looking at the string length?

154

This argument doesn't do anything, btw. Same happens in a few other places.

clang/test/Lexer/char-escapes-delimited.cpp
6 ↗(On Diff #359544)

This diagnostic highlights the need to tell the user *which* character in the escape sequence is wrong. Consider: \u{0000O111I} as an example.

14 ↗(On Diff #359544)

Can you add test cases that show whether digit separators are allowed or prohibited? (FWIW, I sort of think they should be allowed -- at least, I'm not certain I see a reason to disallow them and they do help with readability, especially for longer hex escapes.)

cor3ntin added inline comments.Jul 20 2021, 12:43 PM
clang/include/clang/Basic/DiagnosticLexKinds.td
131

I guess, I didn't think of that

133

That's covered by existing errors for overflowing the char type. There is no limit in the number of digits in the escape per say.

clang/lib/Lex/LiteralSupport.cpp
154

Indeed!

clang/test/Lexer/char-escapes-delimited.cpp
6 ↗(On Diff #359544)

Make sense!

cor3ntin updated this revision to Diff 360391.Jul 21 2021, 2:05 AM
  • Format
  • Add warning for extension
  • Add negative tests for delimited sequences split

between 2 concatenated strings, and sequences containing digit
separators

  • Improve diagnostic messages for invalid characters
  • Stop passing unused parameters to diagnostics
clang/lib/Lex/LiteralSupport.cpp
151–156

I suppose we could. Do you think it would be clearer?

aaron.ballman added inline comments.Jul 21 2021, 5:32 AM
clang/include/clang/Basic/DiagnosticLexKinds.td
130–131
133

Ah, good to know, thanks!

clang/lib/Lex/LiteralSupport.cpp
151–156

I thought it could be cleaner if we also had to check lengths, but because we don't, I think this is fine as-is.

173

Comment here explaining why we're breaking when not delimited would be useful for other readers.

179

Can we pass the char directly, or get away with using a StringRef? It'd be nice to avoid the allocation from std::string.

289

Same here as above.

421

Same here as above.

497

This return looks incorrect because we'll miss diagnosing using the extension in this case even though the user used it.

clang/test/Lexer/char-escapes-delimited.c
44

Can you test that it's not out of range for L'\o{777}'?

60

I think this should be accepted rather than rejected. Long hex values are somewhat anticipated for this feature (given that you often need 5 or more characters to spell the interesting Unicode hex values), and digit separators help with readability in those cases. ... ... However, I suppose those *can't* work because of character literals. '\x{1234'5}' would be ambiguous to lex. Ouch, okay, so I guess I don't expect that to work, but having a test case with a character literal that tries to use a digit separator would be good.

cor3ntin added inline comments.Jul 21 2021, 5:45 AM
clang/lib/Lex/LiteralSupport.cpp
179

I can use StringRef yep - but I'm not concerned about allocation because of sbo - I used a pattern used elsewhere in the file

497

Indeed!

cor3ntin updated this revision to Diff 360423.Jul 21 2021, 6:32 AM
  • Add tests for short wchar overflow
  • use string ref instead of string
  • fix diagnostic
cor3ntin updated this revision to Diff 360428.Jul 21 2021, 6:44 AM
  • Fix tests
cor3ntin retitled this revision from [WIP] Implement delimited escape sequences. to Implement delimited escape sequences..Jul 21 2021, 6:46 AM
cor3ntin updated this revision to Diff 360431.Jul 21 2021, 6:49 AM
cor3ntin marked 2 inline comments as done.
  • Add comment to explain why we break out during hex escape parsing
cor3ntin updated this revision to Diff 360434.Jul 21 2021, 6:54 AM
cor3ntin marked 4 inline comments as done.
  • Fix diagnostic message
cor3ntin marked an inline comment as done.Jul 21 2021, 6:55 AM
cor3ntin added inline comments.
clang/test/Lexer/char-escapes-delimited.c
60

I mean, that just explodes

aaron.ballman added inline comments.Jul 21 2021, 7:20 AM
clang/test/Lexer/char-escapes-delimited.c
79–81

I think the test is a bit off because you couldn't have a trailing digit separator so 12'3' wouldn't be valid anyway. What's the behavior if you use 12'3 instead?

cor3ntin updated this revision to Diff 362034.Jul 27 2021, 8:12 AM
cor3ntin marked an inline comment as not done.
  • Add delimited escape sequences in identifiers
cor3ntin updated this revision to Diff 362079.Jul 27 2021, 9:52 AM

When preprocessing, treat incomplete or empty
delimited UCNs as warning, in the spirit of
[lex.pptoken]/3.1

Overflow is still treated as an error.

Mostly some small nits.

clang/include/clang/Basic/DiagnosticLexKinds.td
134

This describes what the code is, but not why it's an issue. How about: delimited escape sequence cannot be empty or something along those lines?

146–151

I think we can get away with only one diagnostic here.

clang/lib/Lex/Lexer.cpp
3091
3129

Should fix this formatting issue. You can also elide the braces per our usual coding conventions.

3136

Should this be using Diagnose instead?

3153
3174
clang/lib/Lex/LiteralSupport.cpp
494–501
499–502
cor3ntin updated this revision to Diff 368754.Aug 25 2021, 3:23 PM
cor3ntin marked 8 inline comments as done.

Address most of the comments

I left warn_delimited_ucn and err_ucn_escape_incomplete
as separate warnings as I think it's easier to understand,
and could not find a way to escape a closing brace inside
a %select{}.

cor3ntin updated this revision to Diff 368756.Aug 25 2021, 3:29 PM

Fix tests

aaron.ballman accepted this revision.Sep 8 2021, 8:57 AM

Because this proposal hasn't been accepted by WG21 or WG14, but the syntax is not defined by either standard, I think it's reasonable to support this but with appropriate warnings about it being a Clang extension. Can you add those diagnostics? Also, this is lacking test coverage for C.

This proposal has now been seen by both WG14 and WG21 and has sentiment for support in both committees. I think the extension diagnostic is still reasonable (we'll want to support this syntax in older language modes when both committees adopt it for a newer language mode).

LGTM, I'm down to only some minor nits with the patch. @rsmith, do you have any concerns you'd like to see addressed before landing this?

clang/include/clang/Basic/DiagnosticLexKinds.td
130

We usually capitalize Clang for this kind of diagnostic.

clang/lib/Lex/Lexer.cpp
3174
clang/lib/Lex/LiteralSupport.cpp
166
337–352

Better to use something that can't be confused with a type name (clang::Type), also drops the top-level const.

This revision is now accepted and ready to land.Sep 8 2021, 8:57 AM

Because this proposal hasn't been accepted by WG21 or WG14, but the syntax is not defined by either standard, I think it's reasonable to support this but with appropriate warnings about it being a Clang extension. Can you add those diagnostics? Also, this is lacking test coverage for C.

This proposal has now been seen by both WG14 and WG21 and has sentiment for support in both committees. I think the extension diagnostic is still reasonable (we'll want to support this syntax in older language modes when both committees adopt it for a newer language mode).

LGTM, I'm down to only some minor nits with the patch. @rsmith, do you have any concerns you'd like to see addressed before landing this?

Just to clarify, I don't think this should be allowed in C++20 or C17.
Only C23 and C++23 modes. It make sense to add an extension warning there until the proposal are approved imo

Do you agree?

Because this proposal hasn't been accepted by WG21 or WG14, but the syntax is not defined by either standard, I think it's reasonable to support this but with appropriate warnings about it being a Clang extension. Can you add those diagnostics? Also, this is lacking test coverage for C.

This proposal has now been seen by both WG14 and WG21 and has sentiment for support in both committees. I think the extension diagnostic is still reasonable (we'll want to support this syntax in older language modes when both committees adopt it for a newer language mode).

LGTM, I'm down to only some minor nits with the patch. @rsmith, do you have any concerns you'd like to see addressed before landing this?

Just to clarify, I don't think this should be allowed in C++20 or C17.
Only C23 and C++23 modes. It make sense to add an extension warning there until the proposal are approved imo

Do you agree?

I could agree, but I'm wondering why you don't want to allow in older language modes? Clang typically allows conforming extensions in older language modes (with appropriate diagnostics), and this would qualify. e.g., https://godbolt.org/z/x5Ws6YdGc

cor3ntin updated this revision to Diff 371566.Sep 9 2021, 6:00 AM
cor3ntin marked 4 inline comments as done.

Address Aaron's comments

cor3ntin updated this revision to Diff 371682.Sep 9 2021, 12:05 PM

Fix "clang" case in tests

aaron.ballman closed this revision.Sep 15 2021, 6:55 AM

Thank you for the new feature! I've landed on your behalf in 274adcb866343cb505408d8f401840ea814522c8.