This is an archive of the discontinued LLVM Phabricator instance.

[Clang][C++23] P2071 Named universal character escapes
ClosedPublic

Authored by cor3ntin on Apr 4 2022, 12:43 PM.

Details

Summary

Implements P2071 Named Universal Character Escapes - as an extension in all language mode, the patch not warn in c++23 mode will be done later once this paper is plenary approved (in July).

We add

  • A code generator that transforms UnicodeData.txt and NameAliases.txt to a space efficient data structure that can be queried in O(NameLength)
  • A set of functions in Unicode.h to query that data, including
    • A function to find an exact match of a given Unicode character name
    • A function to perform a loose (ignoring case, space, underscore, medial hyphen) matching
    • A function returning the best matching codepoint for a given string per edit distance
  • Support of \N{} escape sequences in String and character Literals, with loose and typos diagnostics/fixits
  • Support of \N{} as UCN with loose matching diagnostics/fixits.

Loose matching is considered an error to match closely the semantics of P2071.

The generated data contributes to 280kB of data to the binaries.

UnicodeData.txt and NameAliases.txt are not committed to the repository in this patch, and regenerating the data is a manual process.

Diff Detail

Event Timeline

cor3ntin created this revision.Apr 4 2022, 12:43 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 4 2022, 12:43 PM
cor3ntin requested review of this revision.Apr 4 2022, 12:43 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 4 2022, 12:43 PM
cor3ntin updated this revision to Diff 420278.Apr 4 2022, 12:45 PM

Committed the wrong file!

cor3ntin updated this revision to Diff 420286.Apr 4 2022, 1:06 PM

Fix generator

cor3ntin updated this revision to Diff 420319.Apr 4 2022, 3:06 PM

Add Unit tests for nameToCodepoint

cor3ntin updated this revision to Diff 420422.Apr 5 2022, 2:36 AM

Add lexing tests, improve diagnostics.
As a drive by, this also fix the handling of \U{123} which was a non
diagnosed invalid escape sequence, leading to some assertion failure.

cor3ntin updated this revision to Diff 420521.Apr 5 2022, 8:32 AM
  • Support loose matching
  • Modifiy the binary format to make it easier to have a root node

This looks pretty good to me. I added a few comments. I mostly just reviewed the lexer related code; I didn't dive into the name matching code.

clang/include/clang/Basic/DiagnosticLexKinds.td
130–131

I don't see much value in combining these diagnostics since these are distinct features. The ext_delimited_escape_sequence name seems odd for named escape sequences too (even if both features use { and } as delimiters).

clang/lib/Lex/Lexer.cpp
3261–3266

It isn't clear to me why there are two separate if statements here. I would expect the following to suffice. If I'm misunderstanding something, then comments might be helpful.

if (!isUppercase(C) && !isDigit(C) && C != '-' && C != ' ') {
  Invalid = true;
  break;
}

I'm not sure why there is a test for '_' since that character is not present in the grammar for n-char in P2071.

Is it intentional that there is no break statement in the second if statement?

clang/lib/Lex/LiteralSupport.cpp
486–494

I like this use of loose matching to generate a fixit!

clang/test/Lexer/char-escapes-delimited.c
83

Does this test imply that the warn_delimited_ucn_incomplete diagnostic cannot be issued for named escape sequences? I don't see a test that exercises a missing '}' otherwise. Perhaps add a test like:

char x = '\N{LOTUS';
cor3ntin updated this revision to Diff 420946.Apr 6 2022, 10:58 AM
  • Typo fixes for invalid code points

Add a method NearestMatchesForCodepointName that will return the N closest
valid codepoints, by edit distance.
This is used to detect typos in named escape sequences in literals.

  • Address some of Tom's comments:
    • Add test for '\N{ABC'.
    • Handle loose matching of named ucns in identifiers with appropriate diagnostics.
    • Fix detection of the difference betwen empty and incomplete named escape sequences.
cor3ntin updated this revision to Diff 420971.Apr 6 2022, 12:31 PM

Recover from valid but loosly matched character names in identifiers
more gracefully

cor3ntin added inline comments.
clang/include/clang/Basic/DiagnosticLexKinds.td
130–131

I'm used to @aaron.ballman encouraging me to keep the number of diagnostics under control, but I'm fine keeping them separate

clang/lib/Lex/Lexer.cpp
3261–3266

I improved that, what it does should be more clear now. More importantly, I added a diagnostic note when we detect a loose match.

We allow _ because Unicode does.
We first perform a strict match - which fails as no Unicode name contains an underscore, we emit a diagnostic, and then we try a loose matching which does allow _.
This enable us to produces better diagnostics

<source>:2:18: error: 'GREEK_SMALL_LETTER-OMICRON' is not a valid Unicode character name
    const char* \N{GREEK_SMALL_LETTER-OMICRON} = "\N{zero width no break space}"; 
                 ^
<source>:2:20: note: characters names in Unicode escape sequences are sensitive to case and whitespaces
    const char* \N{GREEK_SMALL_LETTER-OMICRON} = "\N{zero width no break space}"; 
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~
                   GREEK SMALL LETTER OMICRON
<source>:2:54: error: 'zero width no break space' is not a valid Unicode character name
    const char* \N{GREEK_SMALL_LETTER-OMICRON} = "\N{zero width no break space}"; 
                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~
<source>:2:54: note: characters names in Unicode escape sequences are sensitive to case and whitespaces
    const char* \N{GREEK_SMALL_LETTER-OMICRON} = "\N{zero width no break space}"; 
                                                     ^~~~~~~~~~~~~~~~~~~~~~~~~
                                                     ZERO WIDTH NO-BREAK SPACE
cor3ntin updated this revision to Diff 421562.Apr 8 2022, 9:55 AM

Add fix hint tests

cor3ntin updated this revision to Diff 421721.Apr 9 2022, 5:18 AM
  • Add tests
  • Cleanups
  • Optimization of nearestMatchesForCodepointName to only allocate

when necessary.

cor3ntin edited the summary of this revision. (Show Details)Apr 9 2022, 5:51 AM
cor3ntin updated this revision to Diff 421723.Apr 9 2022, 6:00 AM

Fix in nameToCodepointLoose

cor3ntin retitled this revision from [Clang][C++23][WIP] P2071 Named universal character escapes to [Clang][C++23] P2071 Named universal character escapes.Apr 9 2022, 6:08 AM
cor3ntin updated this revision to Diff 421768.Apr 9 2022, 11:21 PM

Add missing header in generated code to fix windows CI.

cor3ntin updated this revision to Diff 421769.Apr 10 2022, 12:17 AM

Fix linking on windows.

cor3ntin updated this revision to Diff 421771.Apr 10 2022, 12:41 AM

Formatting

cor3ntin updated this revision to Diff 421787.Apr 10 2022, 6:16 AM

A non-existing name could return an engaged value if the
whole string matched the node's name, even if that node had no
attached value.

cor3ntin updated this revision to Diff 421939.Apr 11 2022, 8:45 AM

Add a test for `\o'

aaron.ballman added reviewers: aaron.ballman, tahonermann, Restricted Project.Apr 11 2022, 9:08 AM
aaron.ballman added subscribers: tstellar, lattner.

Thank you for this functionality! I did a pretty quick pass over it and have some comments.

clang/include/clang/Basic/DiagnosticLexKinds.td
130–131

While Tom's right that they're distinct features, diagnostics aren't tied directly to a feature in Clang; we try to reuse existing diagnostic text as much as possible. My preference is to keep them combined so we reduce the number of strings we build into the executable (and we make life easier for eventual localization of the diagnostics).

clang/lib/Lex/Lexer.cpp
3144

FWIW, I think using llvm::None instead of {} is more clear to readers (I doubt we're consistent with this in the code base though).

(Same comment applies other places that we're making an empty Optional.)

3232
3243–3245

Is this a case where we might want a fixit because the user did \N when they meant to do \n?

(Should we look for \N( and fixit that to \N{?)

3296

I think it'd be more clear to set Res = LooseMatch->CodePoint; before exiting here so that after this block, you can always assume there's a valid code point (this helps if we later want to add more logic after this point, too).

3298–3300
3311

Then this can become return *Res;

3321–3323
3331–3333

I think this should live inside the other tryRead* functions once we know the token is definitely a UCN, in case someone finds a reason to want to call one of those concretely but not call tryReadUCN() to do so.

clang/lib/Lex/LiteralSupport.cpp
238

Can you use "o" instead of constructing a std::string for it?

360

Please spell out the type for Res as it's not obvious from context what it is.

399–406

Might as well clean this up a bit more

487

Please spell the type out.

499

Same here.

500

Just double checking that the assertion here is valid and the function can never return an empty set.

509
563

Spell out the type.

592–596

May as well be consistent with the first branch, esp given that this is a multiline substatement.

clang/test/Lexer/char-escapes-delimited.c
87

Here's another fun test:

char trigraphs = '\N??<DOLLAR SIGN??>`;

and we should test that this magic works in the preprocessor:

if `\N{LATIN SMALL LETTER A}` != 'a'
#error "oh no!"
#endif

Might be worthwhile to have a preprocessor test to show that you can't form a named UCN through token concatenation.

96

This has the potential to break user code that was doing \N: https://godbolt.org/z/rahbsssPo However, it's consistent with how we handle \U, \x, and \u, so it's defensible.

llvm/include/llvm/ADT/StringExtras.h
329 ↗(On Diff #421787)

utohexstr() already exists on line 152 -- any reason we can't reuse that?

llvm/include/llvm/Support/Unicode.h
19

No need to include vector?

77–94

Already in the llvm namespace so no need for fully qualified names.

llvm/lib/Support/UnicodeNameToCodepoint.cpp
32–45

You should rename things to match the usual coding conventions.

38
50

Any particular reason for 64?

51

(not like auto vs node saves you much typing.)

73–81

(We don't typically use top-level const qualification.)

188–190
192–194
196

Please spell out the type (I'll stop commenting on the style nits -- you should do a pass for all that stuff at some point).

258

Plan to remove the commented out code?

365
aaron.ballman added inline comments.Apr 11 2022, 9:08 AM
llvm/lib/Support/UnicodeNameToCodepoint.cpp
380–385
410–412
llvm/lib/Support/UnicodeNameToCodepointGenerated.cpp
5

Looks like the generator needs to add some whitespace or drop the trailing //.

llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp
18–19

Do we have something more direct to point users towards?

340–343

This raises a bit of a question about licensing -- we're making a derivative work from Unicode, so I think https://www.unicode.org/license.txt kicks in and we need to update our own license.txt files accordingly.

@tstellar @lattner -- can one of you weigh in on this? (tl;dr: we use this new tool to download data from Unicode.org and use it to generate llvm/lib/Support/UnicodeNameToCodepointGenerated.cpp which is committed into the source tree. The feature being implemented here is a C++23 feature and we cannot implement it without this data.)

Yeah, we should discuss this, thanks for raising this Aaron. I'm not sure exactly what is being pulled in: @cor3ntin can you please email a summary of the situation to board@llvm.org and we'll discuss it and run it by Heather as needed? Thanks

-Chris

cor3ntin updated this revision to Diff 421956.Apr 11 2022, 10:24 AM
cor3ntin marked 17 inline comments as done.

Address Aaron's comments

cor3ntin updated this revision to Diff 421958.Apr 11 2022, 10:28 AM

Cleanup generated code header comment

cor3ntin added inline comments.Apr 11 2022, 10:51 AM
clang/lib/Lex/LiteralSupport.cpp
238

Why do simple when you can do complicated?....

500

you should always get some result yes, at it find all the characters with the smallest edit distance

llvm/include/llvm/ADT/StringExtras.h
329 ↗(On Diff #421787)

I guess I missed that. Why do we have 2 functions doing the same thing?
How do you want me to clean that up?
I think we should have a separate nfc patch to clean that afterwards

aaron.ballman added inline comments.Apr 11 2022, 10:56 AM
llvm/include/llvm/ADT/StringExtras.h
329 ↗(On Diff #421787)

In this patch, I'd replace your new calls to to_hexString() with calls to utohexstr() and then in an NFC change, I'd get rid of to_hexString() entirely.

lattner removed a subscriber: lattner.Apr 11 2022, 11:10 AM
cor3ntin updated this revision to Diff 436133.Jun 11 2022, 6:20 AM
  • Rebase
  • The generator code is more consistent with LLVM style guides.

@tstellar I saw you say in another path that you got confirmation from lawyers that it's okay to include UnicodeData.txt (which this patch doesn't do) and derived data (which this patch does). Can you confirm? Thanks

cor3ntin updated this revision to Diff 436137.Jun 11 2022, 6:46 AM

Use utohexstr and revert the changes that put to_hexString in StringExtras.
I'll remove to_hexString entierly in a separate NFC change to avoid
duplication and further confusion.

cor3ntin updated this revision to Diff 436142.Jun 11 2022, 7:35 AM

Do not hardcode the list of generated code points ranges in the generator
code to ease maintainance burden.

@tstellar I saw you say in another path that you got confirmation from lawyers that it's okay to include UnicodeData.txt (which this patch doesn't do) and derived data (which this patch does). Can you confirm? Thanks

This is fine, please provide a link to the original data being used.

aaron.ballman added inline comments.Jun 13 2022, 11:42 AM
clang/lib/Lex/Lexer.cpp
3144

I think this comment was missed.

clang/lib/Lex/LiteralSupport.cpp
509

Can the subtraction of two unsigned values cause a wrapping issue here? (If not, can we add an assertion?)

537

Should we assert UcnBegin[0] == '\\' as well?

llvm/lib/Support/UnicodeNameToCodepoint.cpp
32–45

It looks like this comment was missed (it applies to this whole file).

50

Still wondering why 64 bytes specifically.

258

Unanswered question here.

llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp
18–19

Unanswered question here. May be a good place for a link like Tom had mentioned.

cor3ntin added inline comments.Jun 13 2022, 12:32 PM
llvm/lib/Support/UnicodeNameToCodepoint.cpp
50

It's semi arbitrary (aka a nice power of two that fits in a cacheline) - but it's large enough that it fits the 99% of names (the 99th percentile is actually around 46 byte)

llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp
18–19

This comment should no longer apply as I got rid of the generated method - instead only relying on info we find when parsing the file (see line 44).
It doesn't mean that we don't need more reference to the UnicodeData.txt url though

aaron.ballman added inline comments.Jun 13 2022, 12:42 PM
llvm/lib/Support/UnicodeNameToCodepoint.cpp
50

Okay, 64 is fine by me then, but if you wanted to bump down to 46 I also wouldn't be sad, it's your call.

llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp
18–19

Ah, okay -- basically, I'm hoping to have a comment somewhere near the top of this file with a link to the Unicode data.

Btw, I noticed this file is missing its license header, you should add that.

cor3ntin updated this revision to Diff 436661.Jun 13 2022, 11:21 PM
cor3ntin marked an inline comment as done.

Address Aaron's comments

  • {} => llvm::None in Lexer.cpp
  • Fix casing in UnicodeNameToCodepoint.cpp to match the style, and a couple I missed in UnicodeNameMappingGenerator.cpp
  • Fix an underflow in LiteralSupport.cpp
  • Add an assert in ProcessNamedUCNEscape
  • Reserve less space in fullName() + add comment in UnicodeNameToCodepoint.cpp
  • remove unused jamo constants in UnicodeNameToCodepoint.cpp
  • Add license + unicode url to UnicodeNameMappingGenerator.cpp
cor3ntin marked 17 inline comments as done.Jun 13 2022, 11:27 PM
cor3ntin updated this revision to Diff 436682.Jun 14 2022, 12:49 AM
cor3ntin marked 12 inline comments as done.

Regenerate UnicodeNameToCodepointGenerated.cpp to fix the formatting of its
header.

cor3ntin marked an inline comment as done.Jun 14 2022, 12:59 AM
cor3ntin added inline comments.
clang/lib/Lex/Lexer.cpp
3243–3245

I think if we wanted to diagnose \n we should also diagnose \U, which we don't do, Maybe a follow up patch, what do you think?
I can't imagine trying to be smart about \N( would be exercised by many users

cor3ntin updated this revision to Diff 436703.Jun 14 2022, 1:56 AM

Fix more style issue (variable casing)

Some likely final coding style nits. The only thing of substance I'm waiting on is an answer to whether we need to update a license file in order to comply with the Unicode license requirements. @tstellar, any chance you can help there?

clang/lib/Lex/Lexer.cpp
3243–3245

Follow-up (if at all) is fine by me!

llvm/lib/Support/UnicodeNameToCodepoint.cpp
42
46
121
132
184
223
304
llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp
2–4

Something looks amiss here -- no idea how we usually handle > 80 col file names, but maybe removing the utils/UnicodeData/ prefix will suffice?

32
57
60
68
207
350
361–362
380–381
aaron.ballman added inline comments.Jun 15 2022, 5:54 AM
clang/lib/Lex/LiteralSupport.cpp
509–511

I had a discussion about the license file on IRC (but not with @tstellar) and the thinking there was: add the license file. It seems to either be required or would be harmless to add. So if we don't hear something different from @tstellar on this, I think you should add the contents of https://www.unicode.org/license.txt to the top of the generated content for UnicodeNameToCodepointGenerated.cpp so that the license information stays with the derived data to which it applies.

cor3ntin updated this revision to Diff 437160.Jun 15 2022, 7:27 AM
cor3ntin marked 17 inline comments as done.

Address more style issues found by Aaron

cor3ntin updated this revision to Diff 437184.Jun 15 2022, 8:10 AM

Add unicode license notice to generated file derived fromn the unicode data

cor3ntin updated this revision to Diff 437186.Jun 15 2022, 8:15 AM

Fix wrapping in file header

I had a discussion about the license file on IRC (but not with @tstellar) and the thinking there was: add the license file. It seems to either be required or would be harmless to add. So if we don't hear something different from @tstellar on this, I think you should add the contents of https://www.unicode.org/license.txt to the top of the generated content for UnicodeNameToCodepointGenerated.cpp so that the license information stays with the derived data to which it applies.

After more chat with Aaron, I added the license in the generated file, in addition of the llvm license header so there is no confusion, This seems to be consistent with other third party files like the regex support files.
In the long run, i think we should consolidate all the tools handling unicode data in a consistent place, at which point we could commit a license file there, along with the data files themselves.

clang/lib/Lex/LiteralSupport.cpp
509–511

They are unsigned so that wouldn't work!

@tahonermann gentle ping (Aaron told me you might have further comments)

tahonermann accepted this revision as: tahonermann.Jun 24 2022, 3:55 PM

@tahonermann gentle ping (Aaron told me you might have further comments)

I'm sorry for the delay. I ran out of time to do the thorough review I would have liked to do, but I did scroll through everything now and did not find anything concerning; Aaron clearly conducted a thorough review already. It looks great to me, really nice work!

clang/lib/Lex/Lexer.cpp
3261–3266

That's great! Very nice!

This revision is now accepted and ready to land.Jun 24 2022, 3:55 PM
This revision was landed with ongoing or failed builds.Jun 25 2022, 10:03 AM
This revision was automatically updated to reflect the committed changes.

@aaron.ballman @tahonermann Thanks for the review. I landed the change after confirming with Aaron he was happy with it

RKSimon added inline comments.
llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp
46

@cor3ntin Should this be SecondSemiPos ?

Report here: https://pvs-studio.com/en/blog/posts/cpp/1003/ (N39)

cor3ntin added inline comments.Oct 25 2022, 9:39 AM
llvm/utils/UnicodeData/UnicodeNameMappingGenerator.cpp
46

Yes. Nice catch.
There is no case of that happening in the file so it never manifested.
I'll push a commit fixing it.