This is an archive of the discontinued LLVM Phabricator instance.

Support for universal character names in identifiers
ClosedPublic

Authored by jordan_rose on Jan 18 2013, 2:56 PM.

Details

Reviewers
rsmith
Summary

This is a missing piece for C99 conformance.

This patch handles UCNs by adding a '\\' case to LexTokenInternal and LexIdentifier -- if we see a backslash, we tentatively try to read in a UCN. If the UCN is not syntactically well-formed, we fall back to the old treatment: a backslash followed by an identifier beginning with 'u' (or 'U').

Because the spelling of an identifier with UCNs still has the UCN in it, we need to convert that to UTF-8 in Preprocessor::LookUpIdentifierInfo.

Of course, valid code that does not use UCNs will see only a very minimal performance hit (checks after each identifier for non-ASCII characters, checks when converting raw_identifiers to identifiers that they do not contain UCNs, and checks when getting the spelling of an identifier that it does not contain a UCN).

This patch also adds basic support for actual UTF-8 in the source, including treating Unicode whitespace as whitespace.

Diff Detail

Event Timeline

rsmith added inline comments.Jan 18 2013, 5:55 PM
lib/Lex/Lexer.cpp
1598

This FIXME still needs to be addressed, right?

2744–2745

You can skip this in the common case that CurPtr - StartPtr == NumHexDigits + 2

2752

This ' should be a `, right?

It'd be nice to also reference C++'s equivalent "Additionally, if the hexadecimal value for a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character (in either of the ranges 0x00–0x1F or 0x7F–0x9F, both inclusive) or to a character in the basic source character set, the program is ill-formed."

2818–2824

Do we diagnose such characters within #if 0 blocks?

jordan_rose added inline comments.Jan 21 2013, 10:46 AM
lib/Lex/Lexer.cpp
1598

I'm not sure. Eli had this taken out in his initial patch, and certainly we now give the proper warning for using a UCN sans underscore in a ud-suffix. But I don't know if it actually works end-to-end yet. I'll double-check.

2752

Grr...darn OS X being "helpful". Good catch.

I'll add the C++ comment.

2818–2824

Hm, we should but this code does not. But I was hitting a reentrancy problem before where emitting the diagnostic required re-lexing. Is there a better way to distinguish "actual parsing" from "lexing for diagnostics" that doesn't include "skipping over #if 0 blocks"?

jordan_rose added inline comments.Jan 21 2013, 12:44 PM
lib/Lex/Lexer.cpp
1598

Looks like no. I started a bit of local work to merge all the identifier reading in LexNumericLiteral, LexUDSuffix, and LexIdentifier, but I guess it can just wait for a separate patch.

jordan_rose updated this revision to Unknown Object (????).Jan 22 2013, 1:41 PM

Addresses most comments from before, and now diagnoses illegal UCNs in #if 0 blocks. This currently uses the presence of a preprocessor as a heuristic to warn even in raw mode.

jordan_rose updated this revision to Unknown Object (????).Jan 23 2013, 11:58 AM

Many more tests.

This is actually now four patches in my git repo, which is how I'm planning to commit it:

  • Unify diagnostics for \x, \u, and \U without any following hex digits.
  • Handle universal character names and Unicode characters outside of literals.
  • As an extension, treat Unicode whitespace characters as whitespace.
  • Add a fixit for \U1234 -> \u1234.
rsmith accepted this revision.Jan 23 2013, 7:23 PM

This looks great, thanks!

lib/Lex/Lexer.cpp
2770

Please replace these UTF-8 dashes with ASCII :)

jordan_rose closed this revision.Jan 24 2013, 12:54 PM

Committed in r173368-71. Thanks, Richard!