This is an archive of the discontinued LLVM Phabricator instance.

Differential D312

Support for universal character names in identifiers
ClosedPublic

Authored by jordan_rose on Jan 18 2013, 2:56 PM.

Download Raw Diff

Details

Reviewers

rsmith

Summary

This is a missing piece for C99 conformance.

This patch handles UCNs by adding a '\\' case to LexTokenInternal and LexIdentifier -- if we see a backslash, we tentatively try to read in a UCN. If the UCN is not syntactically well-formed, we fall back to the old treatment: a backslash followed by an identifier beginning with 'u' (or 'U').

Because the spelling of an identifier with UCNs still has the UCN in it, we need to convert that to UTF-8 in Preprocessor::LookUpIdentifierInfo.

Of course, valid code that does not use UCNs will see only a very minimal performance hit (checks after each identifier for non-ASCII characters, checks when converting raw_identifiers to identifiers that they do not contain UCNs, and checks when getting the spelling of an identifier that it does not contain a UCN).

This patch also adds basic support for actual UTF-8 in the source, including treating Unicode whitespace as whitespace.

Diff Detail

Event Timeline

rsmith added inline comments.Jan 18 2013, 5:55 PM

lib/Lex/Lexer.cpp
1598	This FIXME still needs to be addressed, right?
2744–2745	You can skip this in the common case that CurPtr - StartPtr == NumHexDigits + 2
2752	This ' should be a `, right? It'd be nice to also reference C++'s equivalent "Additionally, if the hexadecimal value for a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character (in either of the ranges 0x00–0x1F or 0x7F–0x9F, both inclusive) or to a character in the basic source character set, the program is ill-formed."
2818–2824	Do we diagnose such characters within #if 0 blocks?

jordan_rose added inline comments.Jan 21 2013, 10:46 AM

lib/Lex/Lexer.cpp
1598	I'm not sure. Eli had this taken out in his initial patch, and certainly we now give the proper warning for using a UCN sans underscore in a ud-suffix. But I don't know if it actually works end-to-end yet. I'll double-check.
2752	Grr...darn OS X being "helpful". Good catch. I'll add the C++ comment.
2818–2824	Hm, we should but this code does not. But I was hitting a reentrancy problem before where emitting the diagnostic required re-lexing. Is there a better way to distinguish "actual parsing" from "lexing for diagnostics" that doesn't include "skipping over #if 0 blocks"?

jordan_rose added inline comments.Jan 21 2013, 12:44 PM

lib/Lex/Lexer.cpp
1598	Looks like no. I started a bit of local work to merge all the identifier reading in LexNumericLiteral, LexUDSuffix, and LexIdentifier, but I guess it can just wait for a separate patch.

Addresses most comments from before, and now diagnoses illegal UCNs in #if 0 blocks. This currently uses the presence of a preprocessor as a heuristic to warn even in raw mode.

Many more tests.

This is actually now four patches in my git repo, which is how I'm planning to commit it:

Unify diagnostics for \x, \u, and \U without any following hex digits.
Handle universal character names and Unicode characters outside of literals.
As an extension, treat Unicode whitespace characters as whitespace.
Add a fixit for \U1234 -> \u1234.

This looks great, thanks!

lib/Lex/Lexer.cpp
2770	Please replace these UTF-8 dashes with ASCII :)

Committed in r173368-71. Thanks, Richard!

Revision Contents

Path

Size

include/

clang/

Basic/

ConvertUTF.h

10 lines

DiagnosticLexKinds.td

38 lines

Lex/

Lexer.h

20 lines

Token.h

8 lines

lib/

Lex/

Lexer.cpp

319 lines

LiteralSupport.cpp

4 lines

Preprocessor.cpp

60 lines

test/

CXX/

over/

over.oper/

over.literal/

p8.cpp

3 lines

CodeGen/

ucn-identifiers.c

14 lines

FixIt/

fixit-unicode.c

12 lines

Lexer/

unicode.c

6 lines

utf8-invalid.c

6 lines

Preprocessor/

ucn-pp-identifier.c

106 lines

Sema/

ucn-cstring.c

2 lines

ucn-identifiers.c

35 lines

Diff 754

include/clang/Basic/ConvertUTF.h

Context not available.
	ConversionResult ConvertUTF32toUTF16 (	ConversionResult ConvertUTF32toUTF16 (
	const UTF32** sourceStart, const UTF32* sourceEnd,	const UTF32** sourceStart, const UTF32* sourceEnd,
	UTF16** targetStart, UTF16* targetEnd, ConversionFlags flags);	UTF16** targetStart, UTF16* targetEnd, ConversionFlags flags);

	Boolean isLegalUTF8Sequence(const UTF8 source, const UTF8 sourceEnd);	Boolean isLegalUTF8Sequence(const UTF8 source, const UTF8 sourceEnd);

	Boolean isLegalUTF8String(const UTF8 *source, const UTF8 sourceEnd);	Boolean isLegalUTF8String(const UTF8 *source, const UTF8 sourceEnd);

	unsigned getNumBytesForUTF8(UTF8 firstByte);	unsigned getNumBytesForUTF8(UTF8 firstByte);

		static inline ConversionResult convertUTF8Sequence(const UTF8 **source,
		const UTF8 *sourceEnd,
		UTF32 *target,
		ConversionFlags flags) {
		unsigned size = getNumBytesForUTF8(**source);
		if (size > sourceEnd - *source)
		return sourceExhausted;
		return ConvertUTF8toUTF32(source, *source + size, &target, target + 1, flags);
		}

	#ifdef __cplusplus	#ifdef __cplusplus
	}	}

	/*************************************************************************/	/*************************************************************************/
	/* Below are LLVM-specific wrappers of the functions above. */	/* Below are LLVM-specific wrappers of the functions above. */

	#include "llvm/ADT/StringRef.h"	#include "llvm/ADT/StringRef.h"

	namespace clang {	namespace clang {

Context not available.

include/clang/Basic/DiagnosticLexKinds.td

Context not available.
	def err_unterminated_raw_string : Error<	def err_unterminated_raw_string : Error<
	"raw string missing terminating delimiter )%0\"">;	"raw string missing terminating delimiter )%0\"">;
	def warn_cxx98_compat_raw_string_literal : Warning<	def warn_cxx98_compat_raw_string_literal : Warning<
	"raw string literals are incompatible with C++98">,	"raw string literals are incompatible with C++98">,
	InGroup<CXX98Compat>, DefaultIgnore;	InGroup<CXX98Compat>, DefaultIgnore;

	def ext_multichar_character_literal : ExtWarn<	def ext_multichar_character_literal : ExtWarn<
	"multi-character character constant">, InGroup<MultiChar>;	"multi-character character constant">, InGroup<MultiChar>;
	def ext_four_char_character_literal : Extension<	def ext_four_char_character_literal : Extension<
	"multi-character character constant">, InGroup<FourByteMultiChar>;	"multi-character character constant">, InGroup<FourByteMultiChar>;


	// Literal
	def ext_nonstandard_escape : Extension<	// Unicode and UCNs
	"use of non-standard escape character '\\%0'">;	def err_invalid_utf8 : Error<
	def ext_unknown_escape : ExtWarn<"unknown escape sequence '\\%0'">;	"source file is not valid UTF-8">;
	def err_hex_escape_no_digits : Error<"\\x used with no following hex digits">;	def err_non_ascii : Error<
	def err_ucn_escape_no_digits : Error<"\\u used with no following hex digits">;	"non-ASCII characters are not allowed outside of literals and identifiers">;
	def err_ucn_escape_invalid : Error<"invalid universal character">;	def ext_unicode_whitespace : ExtWarn<
	def err_ucn_escape_incomplete : Error<"incomplete universal character name">;	"treating Unicode character as whitespace">,
		InGroup<DiagGroup<"unicode-whitespace">>;

		def err_hex_escape_no_digits : Error<
		"\\%0 used with no following hex digits">;
		def warn_ucn_escape_no_digits : Warning<
		"\\%0 used with no following hex digits; "
		"treating as '\\' followed by identifier">, InGroup<Unicode>;
		def err_ucn_escape_incomplete : Error<
		"incomplete universal character name">;
		def warn_ucn_escape_incomplete : Warning<
		"incomplete universal character name; "
		"treating as '\\' followed by identifier">, InGroup<Unicode>;
		def note_ucn_four_not_eight : Note<"did you mean to use '\\u'?">;

	def err_ucn_escape_basic_scs : Error<	def err_ucn_escape_basic_scs : Error<
	"character '%0' cannot be specified by a universal character name">;	"character '%0' cannot be specified by a universal character name">;
	def err_ucn_control_character : Error<	def err_ucn_control_character : Error<
	"universal character name refers to a control character">;	"universal character name refers to a control character">;
		def err_ucn_escape_invalid : Error<"invalid universal character">;
	def warn_cxx98_compat_literal_ucn_escape_basic_scs : Warning<	def warn_cxx98_compat_literal_ucn_escape_basic_scs : Warning<
	"specifying character '%0' with a universal character name "	"specifying character '%0' with a universal character name "
	"is incompatible with C++98">, InGroup<CXX98Compat>, DefaultIgnore;	"is incompatible with C++98">, InGroup<CXX98Compat>, DefaultIgnore;
	def warn_cxx98_compat_literal_ucn_control_character : Warning<	def warn_cxx98_compat_literal_ucn_control_character : Warning<
	"universal character name referring to a control character "	"universal character name referring to a control character "
	"is incompatible with C++98">, InGroup<CXX98Compat>, DefaultIgnore;	"is incompatible with C++98">, InGroup<CXX98Compat>, DefaultIgnore;


		// Literal
		def ext_nonstandard_escape : Extension<
		"use of non-standard escape character '\\%0'">;
		def ext_unknown_escape : ExtWarn<"unknown escape sequence '\\%0'">;
	def err_invalid_decimal_digit : Error<"invalid digit '%0' in decimal constant">;	def err_invalid_decimal_digit : Error<"invalid digit '%0' in decimal constant">;
	def err_invalid_binary_digit : Error<"invalid digit '%0' in binary constant">;	def err_invalid_binary_digit : Error<"invalid digit '%0' in binary constant">;
	def err_invalid_octal_digit : Error<"invalid digit '%0' in octal constant">;	def err_invalid_octal_digit : Error<"invalid digit '%0' in octal constant">;
	def err_invalid_suffix_integer_constant : Error<	def err_invalid_suffix_integer_constant : Error<
	"invalid suffix '%0' on integer constant">;	"invalid suffix '%0' on integer constant">;
	def err_invalid_suffix_float_constant : Error<	def err_invalid_suffix_float_constant : Error<
	"invalid suffix '%0' on floating constant">;	"invalid suffix '%0' on floating constant">;
	def warn_extraneous_char_constant : Warning<	def warn_extraneous_char_constant : Warning<
	"extraneous characters in character constant ignored">;	"extraneous characters in character constant ignored">;
	def warn_char_constant_too_large : Warning<	def warn_char_constant_too_large : Warning<
Context not available.

include/clang/Lex/Lexer.h

Context not available.

	//===--------------------------------------------------------------------===//	//===--------------------------------------------------------------------===//
	// Internal implementation interfaces.	// Internal implementation interfaces.
	private:	private:

	/// LexTokenInternal - Internal interface to lex a preprocessing token. Called	/// LexTokenInternal - Internal interface to lex a preprocessing token. Called
	/// by Lex.	/// by Lex.
	///	///
	void LexTokenInternal(Token &Result);	void LexTokenInternal(Token &Result);

		/// Given that a token begins with the Unicode character \p C, figure out
		/// what kind of token it is and dispatch to the appropriate lexing helper
		/// function.
		void LexUnicode(Token &Result, uint32_t C, const char *CurPtr);

	/// FormTokenWithChars - When we lex a token, we have identified a span	/// FormTokenWithChars - When we lex a token, we have identified a span
	/// starting at BufferPtr, going to TokEnd that forms the token. This method	/// starting at BufferPtr, going to TokEnd that forms the token. This method
	/// takes that range and assigns it to the token as its location and size. In	/// takes that range and assigns it to the token as its location and size. In
	/// addition, since tokens cannot overlap, this also updates BufferPtr to be	/// addition, since tokens cannot overlap, this also updates BufferPtr to be
	/// TokEnd.	/// TokEnd.
	void FormTokenWithChars(Token &Result, const char *TokEnd,	void FormTokenWithChars(Token &Result, const char *TokEnd,
	tok::TokenKind Kind) {	tok::TokenKind Kind) {
	unsigned TokLen = TokEnd-BufferPtr;	unsigned TokLen = TokEnd-BufferPtr;
	Result.setLength(TokLen);	Result.setLength(TokLen);
	Result.setLocation(getSourceLocation(BufferPtr, TokLen));	Result.setLocation(getSourceLocation(BufferPtr, TokLen));
Context not available.
	bool SkipBlockComment (Token &Result, const char *CurPtr);	bool SkipBlockComment (Token &Result, const char *CurPtr);
	bool SaveLineComment (Token &Result, const char *CurPtr);	bool SaveLineComment (Token &Result, const char *CurPtr);

	bool IsStartOfConflictMarker(const char *CurPtr);	bool IsStartOfConflictMarker(const char *CurPtr);
	bool HandleEndOfConflictMarker(const char *CurPtr);	bool HandleEndOfConflictMarker(const char *CurPtr);

	bool isCodeCompletionPoint(const char *CurPtr) const;	bool isCodeCompletionPoint(const char *CurPtr) const;
	void cutOffLexing() { BufferPtr = BufferEnd; }	void cutOffLexing() { BufferPtr = BufferEnd; }

	bool isHexaLiteral(const char *Start, const LangOptions &LangOpts);	bool isHexaLiteral(const char *Start, const LangOptions &LangOpts);


		/// Read a universal character name.
		///
		/// \param CurPtr The position in the source buffer after the initial '\'.
		/// If the UCN is syntactically well-formed (but not necessarily
		/// valid), this parameter will be updated to point to the
		/// character after the UCN.
		/// \param SlashLoc The position in the source buffer of the '\'.
		/// \param Tok The token being formed. Pass \c NULL to suppress diagnostics
		/// and handle token formation in the caller.
		///
		/// \return The Unicode codepoint specified by the UCN, or 0 if the UCN is
		/// invalid.
		uint32_t tryReadUCN(const char &CurPtr, const char SlashLoc, Token *Tok);
	};	};


	} // end namespace clang	} // end namespace clang

	#endif	#endif
Context not available.

include/clang/Lex/Token.h

Context not available.

	/// Flags - Bits we track about this token, members of the TokenFlags enum.	/// Flags - Bits we track about this token, members of the TokenFlags enum.
	unsigned char Flags;	unsigned char Flags;
	public:	public:

	// Various flags set per token:	// Various flags set per token:
	enum TokenFlags {	enum TokenFlags {
	StartOfLine = 0x01, // At start of line or only after whitespace.	StartOfLine = 0x01, // At start of line or only after whitespace.
	LeadingSpace = 0x02, // Whitespace exists before this token.	LeadingSpace = 0x02, // Whitespace exists before this token.
	DisableExpand = 0x04, // This identifier may never be macro expanded.	DisableExpand = 0x04, // This identifier may never be macro expanded.
	NeedsCleaning = 0x08, // Contained an escaped newline or trigraph.	NeedsCleaning = 0x08, // Contained an escaped newline or trigraph.
	LeadingEmptyMacro = 0x10, // Empty macro exists before this token.	LeadingEmptyMacro = 0x10, // Empty macro exists before this token.
	HasUDSuffix = 0x20 // This string or character literal has a ud-suffix.	HasUDSuffix = 0x20, // This string or character literal has a ud-suffix.
		HasUCN = 0x40 // This identifier contains a UCN.
	};	};

	tok::TokenKind getKind() const { return (tok::TokenKind)Kind; }	tok::TokenKind getKind() const { return (tok::TokenKind)Kind; }
	void setKind(tok::TokenKind K) { Kind = K; }	void setKind(tok::TokenKind K) { Kind = K; }

	/// is/isNot - Predicates to check if this token is a specific kind, as in	/// is/isNot - Predicates to check if this token is a specific kind, as in
	/// "if (Tok.is(tok::l_brace)) {...}".	/// "if (Tok.is(tok::l_brace)) {...}".
	bool is(tok::TokenKind K) const { return Kind == (unsigned) K; }	bool is(tok::TokenKind K) const { return Kind == (unsigned) K; }
	bool isNot(tok::TokenKind K) const { return Kind != (unsigned) K; }	bool isNot(tok::TokenKind K) const { return Kind != (unsigned) K; }

Context not available.

	/// \brief Return true if this token has an empty macro before it.	/// \brief Return true if this token has an empty macro before it.
	///	///
	bool hasLeadingEmptyMacro() const {	bool hasLeadingEmptyMacro() const {
	return (Flags & LeadingEmptyMacro) ? true : false;	return (Flags & LeadingEmptyMacro) ? true : false;
	}	}

	/// \brief Return true if this token is a string or character literal which	/// \brief Return true if this token is a string or character literal which
	/// has a ud-suffix.	/// has a ud-suffix.
	bool hasUDSuffix() const { return (Flags & HasUDSuffix) ? true : false; }	bool hasUDSuffix() const { return (Flags & HasUDSuffix) ? true : false; }

		/// Returns true if this token contains a universal character name.
		bool hasUCN() const { return (Flags & HasUCN) ? true : false; }
	};	};

	/// \brief Information about the conditional stack (\#if directives)	/// \brief Information about the conditional stack (\#if directives)
	/// currently active.	/// currently active.
	struct PPConditionalInfo {	struct PPConditionalInfo {
	/// \brief Location where the conditional started.	/// \brief Location where the conditional started.
	SourceLocation IfLoc;	SourceLocation IfLoc;

	/// \brief True if this was contained in a skipping directive, e.g.,	/// \brief True if this was contained in a skipping directive, e.g.,
	/// in a "\#if 0" block.	/// in a "\#if 0" block.
Context not available.

lib/Lex/Lexer.cpp

Context not available.
	// WARNING: `%.*s' is not in NFKC	// WARNING: `%.*s' is not in NFKC
	// WARNING: `%.*s' is not in NFC	// WARNING: `%.*s' is not in NFC
	//	//
	// Other:	// Other:
	// TODO: Options to support:	// TODO: Options to support:
	// -fexec-charset,-fwide-exec-charset	// -fexec-charset,-fwide-exec-charset
	//	//
	//===----------------------------------------------------------------------===//	//===----------------------------------------------------------------------===//

	#include "clang/Lex/Lexer.h"	#include "clang/Lex/Lexer.h"
		#include "clang/Basic/ConvertUTF.h"
	#include "clang/Basic/SourceManager.h"	#include "clang/Basic/SourceManager.h"
	#include "clang/Lex/CodeCompletionHandler.h"	#include "clang/Lex/CodeCompletionHandler.h"
	#include "clang/Lex/LexDiagnostic.h"	#include "clang/Lex/LexDiagnostic.h"
	#include "clang/Lex/Preprocessor.h"	#include "clang/Lex/Preprocessor.h"
	#include "llvm/ADT/STLExtras.h"	#include "llvm/ADT/STLExtras.h"
		#include "llvm/ADT/StringExtras.h"
	#include "llvm/ADT/StringSwitch.h"	#include "llvm/ADT/StringSwitch.h"
	#include "llvm/Support/Compiler.h"	#include "llvm/Support/Compiler.h"
	#include "llvm/Support/MemoryBuffer.h"	#include "llvm/Support/MemoryBuffer.h"
	#include <cstring>	#include <cstring>
	using namespace clang;	using namespace clang;

	static void InitCharacterInfo();	static void InitCharacterInfo();

	//===----------------------------------------------------------------------===//	//===----------------------------------------------------------------------===//
	// Token Class Implementation	// Token Class Implementation
Context not available.
	/// if an internal buffer is returned.	/// if an internal buffer is returned.
	unsigned Lexer::getSpelling(const Token &Tok, const char *&Buffer,	unsigned Lexer::getSpelling(const Token &Tok, const char *&Buffer,
	const SourceManager &SourceMgr,	const SourceManager &SourceMgr,
	const LangOptions &LangOpts, bool *Invalid) {	const LangOptions &LangOpts, bool *Invalid) {
	assert((int)Tok.getLength() >= 0 && "Token character range is bogus!");	assert((int)Tok.getLength() >= 0 && "Token character range is bogus!");

	const char *TokStart = 0;	const char *TokStart = 0;
	// NOTE: this has to be checked before testing for an IdentifierInfo.	// NOTE: this has to be checked before testing for an IdentifierInfo.
	if (Tok.is(tok::raw_identifier))	if (Tok.is(tok::raw_identifier))
	TokStart = Tok.getRawIdentifierData();	TokStart = Tok.getRawIdentifierData();
	else if (const IdentifierInfo *II = Tok.getIdentifierInfo()) {	else if (!Tok.hasUCN()) {
	// Just return the string from the identifier table, which is very quick.	if (const IdentifierInfo *II = Tok.getIdentifierInfo()) {
	Buffer = II->getNameStart();	// Just return the string from the identifier table, which is very quick.
	return II->getLength();	Buffer = II->getNameStart();
		return II->getLength();
		}
	}	}

	// NOTE: this can be checked even after testing for an IdentifierInfo.	// NOTE: this can be checked even after testing for an IdentifierInfo.
	if (Tok.isLiteral())	if (Tok.isLiteral())
	TokStart = Tok.getLiteralData();	TokStart = Tok.getLiteralData();

	if (TokStart == 0) {	if (TokStart == 0) {
	// Compute the start of the token in the input lexer buffer.	// Compute the start of the token in the input lexer buffer.
	bool CharDataInvalid = false;	bool CharDataInvalid = false;
	TokStart = SourceMgr.getCharacterData(Tok.getLocation(), &CharDataInvalid);	TokStart = SourceMgr.getCharacterData(Tok.getLocation(), &CharDataInvalid);
Context not available.
	}	}

	/// getCharAndSizeSlow - Peek a single 'character' from the specified buffer,	/// getCharAndSizeSlow - Peek a single 'character' from the specified buffer,
	/// get its size, and return it. This is tricky in several cases:	/// get its size, and return it. This is tricky in several cases:
	/// 1. If currently at the start of a trigraph, we warn about the trigraph,	/// 1. If currently at the start of a trigraph, we warn about the trigraph,
	/// then either return the trigraph (skipping 3 chars) or the '?',	/// then either return the trigraph (skipping 3 chars) or the '?',
	/// depending on whether trigraphs are enabled or not.	/// depending on whether trigraphs are enabled or not.
	/// 2. If this is an escaped newline (potentially with whitespace between	/// 2. If this is an escaped newline (potentially with whitespace between
	/// the backslash and newline), implicitly skip the newline and return	/// the backslash and newline), implicitly skip the newline and return
	/// the char after it.	/// the char after it.
	/// 3. If this is a UCN, return it. FIXME: C++ UCN's?
	///	///
	/// This handles the slow/uncommon case of the getCharAndSize method. Here we	/// This handles the slow/uncommon case of the getCharAndSize method. Here we
	/// know that we can accumulate into Size, and that we have already incremented	/// know that we can accumulate into Size, and that we have already incremented
	/// Ptr by Size bytes.	/// Ptr by Size bytes.
	///	///
	/// NOTE: When this method is updated, getCharAndSizeSlowNoWarn (below) should	/// NOTE: When this method is updated, getCharAndSizeSlowNoWarn (below) should
	/// be updated to match.	/// be updated to match.
	///	///
	char Lexer::getCharAndSizeSlow(const char *Ptr, unsigned &Size,	char Lexer::getCharAndSizeSlow(const char *Ptr, unsigned &Size,
	Token *Tok) {	Token *Tok) {
Context not available.
	//===----------------------------------------------------------------------===//	//===----------------------------------------------------------------------===//

	/// \brief Routine that indiscriminately skips bytes in the source file.	/// \brief Routine that indiscriminately skips bytes in the source file.
	void Lexer::SkipBytes(unsigned Bytes, bool StartOfLine) {	void Lexer::SkipBytes(unsigned Bytes, bool StartOfLine) {
	BufferPtr += Bytes;	BufferPtr += Bytes;
	if (BufferPtr > BufferEnd)	if (BufferPtr > BufferEnd)
	BufferPtr = BufferEnd;	BufferPtr = BufferEnd;
	IsAtStartOfLine = StartOfLine;	IsAtStartOfLine = StartOfLine;
	}	}

		namespace {
		struct UCNCharRange {
		uint32_t Lower;
		uint32_t Upper;
		};

		// C11 D.1, C++11 [charname.allowed]
		// FIXME: C99 and C++03 each have a different set of allowed UCNs.
		const UCNCharRange UCNAllowedCharRanges[] = {
		// 1
		{ 0x00A8, 0x00A8 }, { 0x00AA, 0x00AA }, { 0x00AD, 0x00AD },
		{ 0x00AF, 0x00AF }, { 0x00B2, 0x00B5 }, { 0x00B7, 0x00BA },
		{ 0x00BC, 0x00BE }, { 0x00C0, 0x00D6 }, { 0x00D8, 0x00F6 },
		{ 0x00F8, 0x00FF },
		// 2
		{ 0x0100, 0x167F }, { 0x1681, 0x180D }, { 0x180F, 0x1FFF },
		// 3
		{ 0x200B, 0x200D }, { 0x202A, 0x202E }, { 0x203F, 0x2040 },
		{ 0x2054, 0x2054 }, { 0x2060, 0x206F },
		// 4
		{ 0x2070, 0x218F }, { 0x2460, 0x24FF }, { 0x2776, 0x2793 },
		{ 0x2C00, 0x2DFF }, { 0x2E80, 0x2FFF },
		// 5
		{ 0x3004, 0x3007 }, { 0x3021, 0x302F }, { 0x3031, 0x303F },
		// 6
		{ 0x3040, 0xD7FF },
		// 7
		{ 0xF900, 0xFD3D }, { 0xFD40, 0xFDCF }, { 0xFDF0, 0xFE44 },
		{ 0xFE47, 0xFFFD },
		// 8
		{ 0x10000, 0x1FFFD }, { 0x20000, 0x2FFFD }, { 0x30000, 0x3FFFD },
		{ 0x40000, 0x4FFFD }, { 0x50000, 0x5FFFD }, { 0x60000, 0x6FFFD },
		{ 0x70000, 0x7FFFD }, { 0x80000, 0x8FFFD }, { 0x90000, 0x9FFFD },
		{ 0xA0000, 0xAFFFD }, { 0xB0000, 0xBFFFD }, { 0xC0000, 0xCFFFD },
		{ 0xD0000, 0xDFFFD }, { 0xE0000, 0xEFFFD }
		};
		}

		static bool isAllowedIDChar(uint32_t c) {
		unsigned LowPoint = 0;
		unsigned HighPoint = llvm::array_lengthof(UCNAllowedCharRanges);

		// Binary search the UCNAllowedCharRanges set.
		while (HighPoint != LowPoint) {
		unsigned MidPoint = (HighPoint + LowPoint) / 2;
		if (c < UCNAllowedCharRanges[MidPoint].Lower)
		HighPoint = MidPoint;
		else if (c > UCNAllowedCharRanges[MidPoint].Upper)
		LowPoint = MidPoint + 1;
		else
		return true;
		}

		return false;
		}

		static bool isAllowedInitiallyIDChar(uint32_t c) {
		// C11 D.2, C++11 [charname.disallowed]
		// FIXME: C99 only forbids "digits", presumably as described in C99 Annex D.
		// FIXME: C++03 does not forbid any initial characters.
		return !(0x0300 <= c && c <= 0x036F) &&
		!(0x1DC0 <= c && c <= 0x1DFF) &&
		!(0x20D0 <= c && c <= 0x20FF) &&
		!(0xFE20 <= c && c <= 0xFE2F);
		}

		static inline bool isASCII(char C) {
		return static_cast<signed char>(C) >= 0;
		}


	void Lexer::LexIdentifier(Token &Result, const char *CurPtr) {	void Lexer::LexIdentifier(Token &Result, const char *CurPtr) {
	// Match [_A-Za-z0-9]*, we have already matched [_A-Za-z$]	// Match [_A-Za-z0-9]*, we have already matched [_A-Za-z$]
	unsigned Size;	unsigned Size;
	unsigned char C = *CurPtr++;	unsigned char C = *CurPtr++;
	while (isIdentifierBody(C))	while (isIdentifierBody(C))
	C = *CurPtr++;	C = *CurPtr++;

	--CurPtr; // Back up over the skipped character.	--CurPtr; // Back up over the skipped character.

	// Fast path, no $,\,? in identifier found. '\' might be an escaped newline	// Fast path, no $,\,? in identifier found. '\' might be an escaped newline
	// or UCN, and ? might be a trigraph for '\', an escaped newline or UCN.	// or UCN, and ? might be a trigraph for '\', an escaped newline or UCN.
	// FIXME: UCNs.
	//	//
	// TODO: Could merge these checks into a CharInfo flag to make the comparison	// TODO: Could merge these checks into a CharInfo flag to make the comparison
	// cheaper	// cheaper
	if (C != '\\' && C != '?' && (C != '$' \|\| !LangOpts.DollarIdents)) {	if (isASCII(C) && C != '\\' && C != '?' &&
		(C != '$' \|\| !LangOpts.DollarIdents)) {
	FinishIdentifier:	FinishIdentifier:
	const char *IdStart = BufferPtr;	const char *IdStart = BufferPtr;
	FormTokenWithChars(Result, CurPtr, tok::raw_identifier);	FormTokenWithChars(Result, CurPtr, tok::raw_identifier);
	Result.setRawIdentifierData(IdStart);	Result.setRawIdentifierData(IdStart);

	// If we are in raw mode, return this identifier raw. There is no need to	// If we are in raw mode, return this identifier raw. There is no need to
	// look up identifier information or attempt to macro expand it.	// look up identifier information or attempt to macro expand it.
	if (LexingRawMode)	if (LexingRawMode)
	return;	return;

Context not available.
	if (C == '$') {	if (C == '$') {
	// If we hit a $ and they are not supported in identifiers, we are done.	// If we hit a $ and they are not supported in identifiers, we are done.
	if (!LangOpts.DollarIdents) goto FinishIdentifier;	if (!LangOpts.DollarIdents) goto FinishIdentifier;

	// Otherwise, emit a diagnostic and continue.	// Otherwise, emit a diagnostic and continue.
	if (!isLexingRawMode())	if (!isLexingRawMode())
	Diag(CurPtr, diag::ext_dollar_in_identifier);	Diag(CurPtr, diag::ext_dollar_in_identifier);
	CurPtr = ConsumeChar(CurPtr, Size, Result);	CurPtr = ConsumeChar(CurPtr, Size, Result);
	C = getCharAndSize(CurPtr, Size);	C = getCharAndSize(CurPtr, Size);
	continue;	continue;
	} else if (!isIdentifierBody(C)) { // FIXME: UCNs.
	// Found end of identifier.	} else if (C == '\\') {
		const char *UCNPtr = CurPtr + Size;
		uint32_t CodePoint = tryReadUCN(UCNPtr, CurPtr, /Token=/0);
		if (CodePoint == 0 \|\| !isAllowedIDChar(CodePoint))
		goto FinishIdentifier;

		Result.setFlag(Token::HasUCN);
		if ((UCNPtr - CurPtr == 6 && CurPtr[1] == 'u') \|\|
		(UCNPtr - CurPtr == 10 && CurPtr[1] == 'U'))
		CurPtr = UCNPtr;
		else
		while (CurPtr != UCNPtr)
		(void)getAndAdvanceChar(CurPtr, Result);

		C = getCharAndSize(CurPtr, Size);
		continue;
		} else if (!isASCII(C)) {
		const char *UnicodePtr = CurPtr;
		UTF32 CodePoint;
		ConversionResult Result = convertUTF8Sequence((const UTF8 **)&UnicodePtr,
		(const UTF8 *)BufferEnd,
		&CodePoint,
		strictConversion);
		if (Result != conversionOK \|\|
		!isAllowedIDChar(static_cast<uint32_t>(CodePoint)))
		goto FinishIdentifier;

		CurPtr = UnicodePtr;
		C = getCharAndSize(CurPtr, Size);
		continue;
		} else if (!isIdentifierBody(C)) {
	goto FinishIdentifier;	goto FinishIdentifier;
	}	}

	// Otherwise, this character is good, consume it.	// Otherwise, this character is good, consume it.
	CurPtr = ConsumeChar(CurPtr, Size, Result);	CurPtr = ConsumeChar(CurPtr, Size, Result);

	C = getCharAndSize(CurPtr, Size);	C = getCharAndSize(CurPtr, Size);
	while (isIdentifierBody(C)) { // FIXME: UCNs.	while (isIdentifierBody(C)) {
	CurPtr = ConsumeChar(CurPtr, Size, Result);	CurPtr = ConsumeChar(CurPtr, Size, Result);
	C = getCharAndSize(CurPtr, Size);	C = getCharAndSize(CurPtr, Size);
	}	}
	}	}
	}	}

	/// isHexaLiteral - Return true if Start points to a hex constant.	/// isHexaLiteral - Return true if Start points to a hex constant.
	/// in microsoft mode (where this is supposed to be several different tokens).	/// in microsoft mode (where this is supposed to be several different tokens).
	bool Lexer::isHexaLiteral(const char *Start, const LangOptions &LangOpts) {	bool Lexer::isHexaLiteral(const char *Start, const LangOptions &LangOpts) {
	unsigned Size;	unsigned Size;
	rsmithUnsubmitted Not Done Reply Inline Actions This FIXME still needs to be addressed, right? rsmith: This FIXME still needs to be addressed, right?
	jordan_roseAuthorUnsubmitted Not Done Reply Inline Actions I'm not sure. Eli had this taken out in his initial patch, and certainly we now give the proper warning for using a UCN sans underscore in a ud-suffix. But I don't know if it actually works end-to-end yet. I'll double-check. jordan_rose: I'm not sure. Eli had this taken out in his initial patch, and certainly we now give the proper…
	jordan_roseAuthorUnsubmitted Not Done Reply Inline Actions Looks like no. I started a bit of local work to merge all the identifier reading in LexNumericLiteral, LexUDSuffix, and LexIdentifier, but I guess it can just wait for a separate patch. jordan_rose: Looks like no. I started a bit of local work to merge all the identifier reading in…
Context not available.

	bool Lexer::isCodeCompletionPoint(const char *CurPtr) const {	bool Lexer::isCodeCompletionPoint(const char *CurPtr) const {
	if (PP && PP->isCodeCompletionEnabled()) {	if (PP && PP->isCodeCompletionEnabled()) {
	SourceLocation Loc = FileLoc.getLocWithOffset(CurPtr-BufferStart);	SourceLocation Loc = FileLoc.getLocWithOffset(CurPtr-BufferStart);
	return Loc == PP->getCodeCompletionLoc();	return Loc == PP->getCodeCompletionLoc();
	}	}

	return false;	return false;
	}	}

		uint32_t Lexer::tryReadUCN(const char &StartPtr, const char SlashLoc,
		Token *Result) {
		assert(LangOpts.CPlusPlus \|\| LangOpts.C99);

		unsigned CharSize;
		char Kind = getCharAndSize(StartPtr, CharSize);

		unsigned NumHexDigits;
		if (Kind == 'u')
		NumHexDigits = 4;
		else if (Kind == 'U')
		NumHexDigits = 8;
		else
		return 0;

		const char *CurPtr = StartPtr + CharSize;
		const char *KindLoc = &CurPtr[-1];

		uint32_t CodePoint = 0;
		for (unsigned i = 0; i < NumHexDigits; ++i) {
		char C = getCharAndSize(CurPtr, CharSize);

		unsigned Value = llvm::hexDigitValue(C);
		if (Value == -1U) {
		if (Result && !isLexingRawMode()) {
		if (i == 0) {
		Diag(BufferPtr, diag::warn_ucn_escape_no_digits)
		<< StringRef(KindLoc, 1);
		} else {
		Diag(BufferPtr, diag::warn_ucn_escape_incomplete);

		// If the user wrote \U1234, suggest a fixit to \u.
		if (i == 4 && NumHexDigits == 8) {
		CharSourceRange URange =
		CharSourceRange::getCharRange(getSourceLocation(KindLoc),
		getSourceLocation(KindLoc + 1));
		Diag(KindLoc, diag::note_ucn_four_not_eight)
		<< FixItHint::CreateReplacement(URange, "u");
		}
		}
		}

		return 0;
		}

		CodePoint <<= 4;
		CodePoint += Value;
		rsmithUnsubmitted Not Done Reply Inline Actions You can skip this in the common case that CurPtr - StartPtr == NumHexDigits + 2 rsmith: You can skip this in the common case that CurPtr - StartPtr == NumHexDigits + 2

		CurPtr += CharSize;
		}

		if (Result) {
		Result->setFlag(Token::HasUCN);
		if (CurPtr - StartPtr == NumHexDigits + 2)
		rsmithUnsubmitted Not Done Reply Inline Actions This ' should be a `, right? It'd be nice to also reference C++'s equivalent "Additionally, if the hexadecimal value for a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character or string literal corresponds to a control character (in either of the ranges 0x00–0x1F or 0x7F–0x9F, both inclusive) or to a character in the basic source character set, the program is ill-formed." rsmith: This ' should be a `, right? It'd be nice to also reference C++'s equivalent "Additionally, if…
		jordan_roseAuthorUnsubmitted Not Done Reply Inline Actions Grr...darn OS X being "helpful". Good catch. I'll add the C++ comment. jordan_rose: Grr...darn OS X being "helpful". Good catch. I'll add the C++ comment.
		StartPtr = CurPtr;
		else
		while (StartPtr != CurPtr)
		(void)getAndAdvanceChar(StartPtr, *Result);
		} else {
		StartPtr = CurPtr;
		}

		// C99 6.4.3p2: A universal character name shall not specify a character whose
		// short identifier is less than 00A0 other than 0024 ($), 0040 (@), or
		// 0060 (`), nor one in the range D800 through DFFF inclusive.)
		// C++11 [lex.charset]p2: If the hexadecimal value for a
		// universal-character-name corresponds to a surrogate code point (in the
		// range 0xD800-0xDFFF, inclusive), the program is ill-formed. Additionally,
		// if the hexadecimal value for a universal-character-name outside the
		// c-char-sequence, s-char-sequence, or r-char-sequence of a character or
		// string literal corresponds to a control character (in either of the
		// ranges 0x00‚Äì0x1F or 0x7F‚Äì0x9F, both inclusive) or to a character in the
		rsmithUnsubmitted Not Done Reply Inline Actions Please replace these UTF-8 dashes with ASCII :) rsmith: Please replace these UTF-8 dashes with ASCII :)
		// basic source character set, the program is ill-formed.
		if (CodePoint < 0xA0) {
		if (CodePoint == 0x24 \|\| CodePoint == 0x40 \|\| CodePoint == 0x60)
		return CodePoint;

		// We don't use isLexingRawMode() here because we need to warn about bad
		// UCNs even when skipping preprocessing tokens in a #if block.
		if (Result && PP) {
		if (CodePoint < 0x20 \|\| CodePoint >= 0x7F)
		Diag(BufferPtr, diag::err_ucn_control_character);
		else {
		char C = static_cast<char>(CodePoint);
		Diag(BufferPtr, diag::err_ucn_escape_basic_scs) << StringRef(&C, 1);
		}
		}

		return 0;

		} else if ((!LangOpts.CPlusPlus \|\| LangOpts.CPlusPlus11) &&
		(CodePoint >= 0xD800 && CodePoint <= 0xDFFF)) {
		// C++03 allows UCNs representing surrogate characters. C99 and C++11 don't.
		// We don't use isLexingRawMode() here because we need to warn about bad
		// UCNs even when skipping preprocessing tokens in a #if block.
		if (Result && PP)
		Diag(BufferPtr, diag::err_ucn_escape_invalid);
		return 0;
		}

		return CodePoint;
		}

		static bool isUnicodeWhitespace(uint32_t C) {
		return (C == 0x0085 \|\| C == 0x00A0 \|\| C == 0x1680 \|\|
		C == 0x180E \|\| (C >= 0x2000 && C <= 0x200A) \|\|
		C == 0x2028 \|\| C == 0x2029 \|\| C == 0x202F \|\|
		C == 0x205F \|\| C == 0x3000);
		}

		void Lexer::LexUnicode(Token &Result, uint32_t C, const char *CurPtr) {
		if (isUnicodeWhitespace(C)) {
		if (!isLexingRawMode()) {
		CharSourceRange CharRange =
		CharSourceRange::getCharRange(getSourceLocation(),
		getSourceLocation(CurPtr));
		Diag(BufferPtr, diag::ext_unicode_whitespace)
		<< CharRange;
		}

		Result.setFlag(Token::LeadingSpace);
		if (SkipWhitespace(Result, CurPtr))
		return; // KeepWhitespaceMode

		return LexTokenInternal(Result);
		}
		rsmithUnsubmitted Not Done Reply Inline Actions Do we diagnose such characters within #if 0 blocks? rsmith: Do we diagnose such characters within #if 0 blocks?
		jordan_roseAuthorUnsubmitted Not Done Reply Inline Actions Hm, we should but this code does not. But I was hitting a reentrancy problem before where emitting the diagnostic required re-lexing. Is there a better way to distinguish "actual parsing" from "lexing for diagnostics" that doesn't include "skipping over #if 0 blocks"? jordan_rose: Hm, we should but this code does not. But I was hitting a reentrancy problem before where…

		if (isAllowedIDChar(C) && isAllowedInitiallyIDChar(C)) {
		MIOpt.ReadToken();
		return LexIdentifier(Result, CurPtr);
		}

		if (!isASCII(*BufferPtr) && !isAllowedIDChar(C)) {
		// Non-ASCII characters tend to creep into source code unintentionally.
		// Instead of letting the parser complain about the unknown token,
		// just drop the character.
		// Note that we can /only/ do this when the non-ASCII character is actually
		// spelled as Unicode, not written as a UCN. The standard requires that
		// we not throw away any possible preprocessor tokens, but there's a
		// loophole in the mapping of Unicode characters to basic character set
		// characters that allows us to map these particular characters to, say,
		// whitespace.
		if (!isLexingRawMode()) {
		CharSourceRange CharRange =
		CharSourceRange::getCharRange(getSourceLocation(),
		getSourceLocation(CurPtr));
		Diag(BufferPtr, diag::err_non_ascii)
		<< FixItHint::CreateRemoval(CharRange);
		}

		BufferPtr = CurPtr;
		return LexTokenInternal(Result);
		}

		// Otherwise, we have an explicit UCN or a character that's unlikely to show
		// up by accident.
		MIOpt.ReadToken();
		FormTokenWithChars(Result, CurPtr, tok::unknown);
		}


	/// LexTokenInternal - This implements a simple C family lexer. It is an	/// LexTokenInternal - This implements a simple C family lexer. It is an
	/// extremely performance critical piece of code. This assumes that the buffer	/// extremely performance critical piece of code. This assumes that the buffer
	/// has a null character at the end of the file. This returns a preprocessing	/// has a null character at the end of the file. This returns a preprocessing
	/// token, not a normal token, as such, it is an internal interface. It assumes	/// token, not a normal token, as such, it is an internal interface. It assumes
	/// that the Flags of result have been cleared before calling this.	/// that the Flags of result have been cleared before calling this.
	void Lexer::LexTokenInternal(Token &Result) {	void Lexer::LexTokenInternal(Token &Result) {
	LexNextToken:	LexNextToken:
	// New token, can't need cleaning yet.	// New token, can't need cleaning yet.
	Result.clearFlag(Token::NeedsCleaning);	Result.clearFlag(Token::NeedsCleaning);
Context not available.
	break;	break;

	case '@':	case '@':
	// Objective C support.	// Objective C support.
	if (CurPtr[-1] == '@' && LangOpts.ObjC1)	if (CurPtr[-1] == '@' && LangOpts.ObjC1)
	Kind = tok::at;	Kind = tok::at;
	else	else
	Kind = tok::unknown;	Kind = tok::unknown;
	break;	break;

		// UCNs (C99 6.4.3, C++11 [lex.charset]p2)
	case '\\':	case '\\':
	// FIXME: UCN's.	if (uint32_t CodePoint = tryReadUCN(CurPtr, BufferPtr, &Result))
	// FALL THROUGH.	return LexUnicode(Result, CodePoint, CurPtr);
	default:
	Kind = tok::unknown;	Kind = tok::unknown;
	break;	break;

		default: {
		if (isASCII(Char)) {
		Kind = tok::unknown;
		break;
		}

		UTF32 CodePoint;

		// We can't just reset CurPtr to BufferPtr because BufferPtr may point to
		// an escaped newline.
		--CurPtr;
		ConversionResult Status = convertUTF8Sequence((const UTF8 **)&CurPtr,
		(const UTF8 *)BufferEnd,
		&CodePoint,
		strictConversion);
		if (Status == conversionOK)
		return LexUnicode(Result, CodePoint, CurPtr);

		// Non-ASCII characters tend to creep into source code unintentionally.
		// Instead of letting the parser complain about the unknown token,
		// just warn that we don't have valid UTF-8, then drop the character.
		if (!isLexingRawMode())
		Diag(CurPtr, diag::err_invalid_utf8);

		BufferPtr = CurPtr+1;
		goto LexNextToken;
		}
	}	}

	// Notify MIOpt that we read a non-whitespace/non-comment token.	// Notify MIOpt that we read a non-whitespace/non-comment token.
	MIOpt.ReadToken();	MIOpt.ReadToken();

	// Update the location of token as well as BufferPtr.	// Update the location of token as well as BufferPtr.
	FormTokenWithChars(Result, CurPtr, Kind);	FormTokenWithChars(Result, CurPtr, Kind);
	return;	return;

	HandleDirective:	HandleDirective:
Context not available.

lib/Lex/LiteralSupport.cpp

Context not available.
	ResultChar = 9;	ResultChar = 9;
	break;	break;
	case 'v':	case 'v':
	ResultChar = 11;	ResultChar = 11;
	break;	break;
	case 'x': { // Hex escape.	case 'x': { // Hex escape.
	ResultChar = 0;	ResultChar = 0;
	if (ThisTokBuf == ThisTokEnd \|\| !isxdigit(*ThisTokBuf)) {	if (ThisTokBuf == ThisTokEnd \|\| !isxdigit(*ThisTokBuf)) {
	if (Diags)	if (Diags)
	Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,	Diag(Diags, Features, Loc, ThisTokBegin, EscapeBegin, ThisTokBuf,
	diag::err_hex_escape_no_digits);	diag::err_hex_escape_no_digits) << "x";
	HadError = 1;	HadError = 1;
	break;	break;
	}	}

	// Hex escapes are a maximal series of hex digits.	// Hex escapes are a maximal series of hex digits.
	bool Overflow = false;	bool Overflow = false;
	for (; ThisTokBuf != ThisTokEnd; ++ThisTokBuf) {	for (; ThisTokBuf != ThisTokEnd; ++ThisTokBuf) {
	int CharVal = llvm::hexDigitValue(ThisTokBuf[0]);	int CharVal = llvm::hexDigitValue(ThisTokBuf[0]);
	if (CharVal == -1) break;	if (CharVal == -1) break;
	// About to shift out a digit?	// About to shift out a digit?
Context not available.
	const LangOptions &Features,	const LangOptions &Features,
	bool in_char_string_literal = false) {	bool in_char_string_literal = false) {
	const char *UcnBegin = ThisTokBuf;	const char *UcnBegin = ThisTokBuf;

	// Skip the '\u' char's.	// Skip the '\u' char's.
	ThisTokBuf += 2;	ThisTokBuf += 2;

	if (ThisTokBuf == ThisTokEnd \|\| !isxdigit(*ThisTokBuf)) {	if (ThisTokBuf == ThisTokEnd \|\| !isxdigit(*ThisTokBuf)) {
	if (Diags)	if (Diags)
	Diag(Diags, Features, Loc, ThisTokBegin, UcnBegin, ThisTokBuf,	Diag(Diags, Features, Loc, ThisTokBegin, UcnBegin, ThisTokBuf,
	diag::err_ucn_escape_no_digits);	diag::err_hex_escape_no_digits) << StringRef(&ThisTokBuf[-1], 1);
	return false;	return false;
	}	}
	UcnLen = (ThisTokBuf[-1] == 'u' ? 4 : 8);	UcnLen = (ThisTokBuf[-1] == 'u' ? 4 : 8);
	unsigned short UcnLenSave = UcnLen;	unsigned short UcnLenSave = UcnLen;
	for (; ThisTokBuf != ThisTokEnd && UcnLenSave; ++ThisTokBuf, UcnLenSave--) {	for (; ThisTokBuf != ThisTokEnd && UcnLenSave; ++ThisTokBuf, UcnLenSave--) {
	int CharVal = llvm::hexDigitValue(ThisTokBuf[0]);	int CharVal = llvm::hexDigitValue(ThisTokBuf[0]);
	if (CharVal == -1) break;	if (CharVal == -1) break;
	UcnVal <<= 4;	UcnVal <<= 4;
	UcnVal \|= CharVal;	UcnVal \|= CharVal;
	}	}
Context not available.

lib/Lex/Preprocessor.cpp

Context not available.
	// -W*	// -W*
	// -w	// -w
	//	//
	// Messages to emit:	// Messages to emit:
	// "Multiple include guards may be useful for:\n"	// "Multiple include guards may be useful for:\n"
	//	//
	//===----------------------------------------------------------------------===//	//===----------------------------------------------------------------------===//

	#include "clang/Lex/Preprocessor.h"	#include "clang/Lex/Preprocessor.h"
	#include "MacroArgs.h"	#include "MacroArgs.h"
		#include "clang/Basic/ConvertUTF.h"
	#include "clang/Basic/FileManager.h"	#include "clang/Basic/FileManager.h"
	#include "clang/Basic/SourceManager.h"	#include "clang/Basic/SourceManager.h"
	#include "clang/Basic/TargetInfo.h"	#include "clang/Basic/TargetInfo.h"
	#include "clang/Lex/CodeCompletionHandler.h"	#include "clang/Lex/CodeCompletionHandler.h"
	#include "clang/Lex/ExternalPreprocessorSource.h"	#include "clang/Lex/ExternalPreprocessorSource.h"
	#include "clang/Lex/HeaderSearch.h"	#include "clang/Lex/HeaderSearch.h"
	#include "clang/Lex/LexDiagnostic.h"	#include "clang/Lex/LexDiagnostic.h"
	#include "clang/Lex/LiteralSupport.h"	#include "clang/Lex/LiteralSupport.h"
	#include "clang/Lex/MacroInfo.h"	#include "clang/Lex/MacroInfo.h"
	#include "clang/Lex/ModuleLoader.h"	#include "clang/Lex/ModuleLoader.h"
	#include "clang/Lex/Pragma.h"	#include "clang/Lex/Pragma.h"
	#include "clang/Lex/PreprocessingRecord.h"	#include "clang/Lex/PreprocessingRecord.h"
	#include "clang/Lex/PreprocessorOptions.h"	#include "clang/Lex/PreprocessorOptions.h"
	#include "clang/Lex/ScratchBuffer.h"	#include "clang/Lex/ScratchBuffer.h"
	#include "llvm/ADT/APFloat.h"	#include "llvm/ADT/APFloat.h"
	#include "llvm/ADT/SmallString.h"	#include "llvm/ADT/SmallString.h"
		#include "llvm/ADT/STLExtras.h"
		#include "llvm/ADT/StringExtras.h"
	#include "llvm/Support/Capacity.h"	#include "llvm/Support/Capacity.h"
	#include "llvm/Support/MemoryBuffer.h"	#include "llvm/Support/MemoryBuffer.h"
	#include "llvm/Support/raw_ostream.h"	#include "llvm/Support/raw_ostream.h"
	using namespace clang;	using namespace clang;

	//===----------------------------------------------------------------------===//	//===----------------------------------------------------------------------===//
	ExternalPreprocessorSource::~ExternalPreprocessorSource() { }	ExternalPreprocessorSource::~ExternalPreprocessorSource() { }

	PPMutationListener::~PPMutationListener() { }	PPMutationListener::~PPMutationListener() { }

Context not available.
	setCodeCompletionReached();	setCodeCompletionReached();
	}	}

	/// getSpelling - This method is used to get the spelling of a token into a	/// getSpelling - This method is used to get the spelling of a token into a
	/// SmallVector. Note that the returned StringRef may not point to the	/// SmallVector. Note that the returned StringRef may not point to the
	/// supplied buffer if a copy can be avoided.	/// supplied buffer if a copy can be avoided.
	StringRef Preprocessor::getSpelling(const Token &Tok,	StringRef Preprocessor::getSpelling(const Token &Tok,
	SmallVectorImpl<char> &Buffer,	SmallVectorImpl<char> &Buffer,
	bool *Invalid) const {	bool *Invalid) const {
	// NOTE: this has to be checked before testing for an IdentifierInfo.	// NOTE: this has to be checked before testing for an IdentifierInfo.
	if (Tok.isNot(tok::raw_identifier)) {	if (Tok.isNot(tok::raw_identifier) && !Tok.hasUCN()) {
	// Try the fast path.	// Try the fast path.
	if (const IdentifierInfo *II = Tok.getIdentifierInfo())	if (const IdentifierInfo *II = Tok.getIdentifierInfo())
	return II->getName();	return II->getName();
	}	}

	// Resize the buffer if we need to copy into it.	// Resize the buffer if we need to copy into it.
	if (Tok.needsCleaning())	if (Tok.needsCleaning())
	Buffer.resize(Tok.getLength());	Buffer.resize(Tok.getLength());

	const char *Ptr = Buffer.data();	const char *Ptr = Buffer.data();
Context not available.
	void Preprocessor::EndSourceFile() {	void Preprocessor::EndSourceFile() {
	// Notify the client that we reached the end of the source file.	// Notify the client that we reached the end of the source file.
	if (Callbacks)	if (Callbacks)
	Callbacks->EndOfMainFile();	Callbacks->EndOfMainFile();
	}	}

	//===----------------------------------------------------------------------===//	//===----------------------------------------------------------------------===//
	// Lexer Event Handling.	// Lexer Event Handling.
	//===----------------------------------------------------------------------===//	//===----------------------------------------------------------------------===//

		static void appendCodePoint(unsigned Codepoint,
		llvm::SmallVectorImpl<char> &Str) {
		char ResultBuf[4];
		char *ResultPtr = ResultBuf;
		bool Res = ConvertCodePointToUTF8(Codepoint, ResultPtr);
		(void)Res;
		assert(Res && "Unexpected conversion failure");
		Str.append(ResultBuf, ResultPtr);
		}

		static void expandUCNs(SmallVectorImpl<char> &Buf, StringRef Input) {
		for (StringRef::iterator I = Input.begin(), E = Input.end(); I != E; ++I) {
		if (*I != '\\') {
		Buf.push_back(*I);
		continue;
		}

		++I;
		assert(I == 'u' \|\| I == 'U');

		unsigned NumHexDigits;
		if (*I == 'u')
		NumHexDigits = 4;
		else
		NumHexDigits = 8;

		assert(I + NumHexDigits <= E);

		uint32_t CodePoint = 0;
		for (++I; NumHexDigits != 0; ++I, --NumHexDigits) {
		unsigned Value = llvm::hexDigitValue(*I);
		assert(Value != -1U);

		CodePoint <<= 4;
		CodePoint += Value;
		}

		appendCodePoint(CodePoint, Buf);
		--I;
		}
		}

	/// LookUpIdentifierInfo - Given a tok::raw_identifier token, look up the	/// LookUpIdentifierInfo - Given a tok::raw_identifier token, look up the
	/// identifier information for the token and install it into the token,	/// identifier information for the token and install it into the token,
	/// updating the token kind accordingly.	/// updating the token kind accordingly.
	IdentifierInfo *Preprocessor::LookUpIdentifierInfo(Token &Identifier) const {	IdentifierInfo *Preprocessor::LookUpIdentifierInfo(Token &Identifier) const {
	assert(Identifier.getRawIdentifierData() != 0 && "No raw identifier data!");	assert(Identifier.getRawIdentifierData() != 0 && "No raw identifier data!");

	// Look up this token, see if it is a macro, or if it is a language keyword.	// Look up this token, see if it is a macro, or if it is a language keyword.
	IdentifierInfo *II;	IdentifierInfo *II;
	if (!Identifier.needsCleaning()) {	if (!Identifier.needsCleaning() && !Identifier.hasUCN()) {
	// No cleaning needed, just use the characters from the lexed buffer.	// No cleaning needed, just use the characters from the lexed buffer.
	II = getIdentifierInfo(StringRef(Identifier.getRawIdentifierData(),	II = getIdentifierInfo(StringRef(Identifier.getRawIdentifierData(),
	Identifier.getLength()));	Identifier.getLength()));
	} else {	} else {
	// Cleaning needed, alloca a buffer, clean into it, then use the buffer.	// Cleaning needed, alloca a buffer, clean into it, then use the buffer.
	SmallString<64> IdentifierBuffer;	SmallString<64> IdentifierBuffer;
	StringRef CleanedStr = getSpelling(Identifier, IdentifierBuffer);	StringRef CleanedStr = getSpelling(Identifier, IdentifierBuffer);
	II = getIdentifierInfo(CleanedStr);
		if (Identifier.hasUCN()) {
		SmallString<64> UCNIdentifierBuffer;
		expandUCNs(UCNIdentifierBuffer, CleanedStr);
		II = getIdentifierInfo(UCNIdentifierBuffer);
		} else {
		II = getIdentifierInfo(CleanedStr);
		}
	}	}

	// Update the token info (identifier info and appropriate token kind).	// Update the token info (identifier info and appropriate token kind).
	Identifier.setIdentifierInfo(II);	Identifier.setIdentifierInfo(II);
	Identifier.setKind(II->getTokenID());	Identifier.setKind(II->getTokenID());

	return II;	return II;
	}	}

	void Preprocessor::SetPoisonReason(IdentifierInfo *II, unsigned DiagID) {	void Preprocessor::SetPoisonReason(IdentifierInfo *II, unsigned DiagID) {
Context not available.

test/CXX/over/over.oper/over.literal/p8.cpp

	// RUN: %clang_cc1 -std=c++11 %s -verify			// RUN: %clang_cc1 -std=c++11 %s -verify

	struct string;			struct string;
	namespace std {			namespace std {
	using size_t = decltype(sizeof(int));			using size_t = decltype(sizeof(int));
	}			}

	void operator "" _km(long double); // ok			void operator "" _km(long double); // ok
	string operator "" _i18n(const char*, std::size_t); // ok			string operator "" _i18n(const char*, std::size_t); // ok
	// FIXME: This should be accepted once we support UCNs			template<char...> int operator "" \u03C0(); // ok, UCN for lowercase pi // expected-warning {{reserved}}
	template<char...> int operator "" \u03C0(); // ok, UCN for lowercase pi // expected-error {{expected identifier}}
	float operator ""E(const char *); // expected-error {{invalid suffix on literal}} expected-warning {{reserved}}			float operator ""E(const char *); // expected-error {{invalid suffix on literal}} expected-warning {{reserved}}
	float operator " " B(const char *); // expected-error {{must be '""'}} expected-warning {{reserved}}			float operator " " B(const char *); // expected-error {{must be '""'}} expected-warning {{reserved}}
	string operator "" 5X(const char *, std::size_t); // expected-error {{expected identifier}}			string operator "" 5X(const char *, std::size_t); // expected-error {{expected identifier}}
	double operator "" _miles(double); // expected-error {{parameter}}			double operator "" _miles(double); // expected-error {{parameter}}
	template<char...> int operator "" j(const char*); // expected-error {{parameter}}			template<char...> int operator "" j(const char*); // expected-error {{parameter}}

	float operator ""_E(const char *);			float operator ""_E(const char *);

test/CodeGen/ucn-identifiers.c

This file was added.

				// RUN: %clang_cc1 %s -emit-llvm -o /dev/null
				// RUN: %clang_cc1 %s -emit-llvm -o /dev/null -x c++
				// This file contains UTF-8; please do not fix!


				extern void \u00FCber(int);
				extern void \U000000FCber(int); // redeclaration, no warning

				void goodCalls() {
				\u00FCber(0);
				\u00fcber(1);
				√ºber(2);
				\U000000FCber(3);
				}

test/FixIt/fixit-unicode.c

	// RUN: %clang_cc1 -fsyntax-only %s 2>&1 \| FileCheck -strict-whitespace %s			// RUN: %clang_cc1 -fsyntax-only %s 2>&1 \| FileCheck -strict-whitespace %s
	// RUN: %clang_cc1 -fsyntax-only -fdiagnostics-parseable-fixits %s 2>&1 \| FileCheck -check-prefix=CHECK-MACHINE %s			// RUN: %clang_cc1 -fsyntax-only -fdiagnostics-parseable-fixits %s 2>&1 \| FileCheck -check-prefix=CHECK-MACHINE %s

	struct Foo {			struct Foo {
	int bar;			int bar;
	};			};

	// PR13312			// PR13312
	void test1() {			void test1() {
	struct Foo foo;			struct Foo foo;
	(&foo)‚òÉ>bar = 42;			foo.bar = 42‚òÉ
				// CHECK: error: non-ASCII characters are not allowed outside of literals and identifiers
				// CHECK: {{^ \^}}
	// CHECK: error: expected ';' after expression			// CHECK: error: expected ';' after expression
	// Make sure we emit the fixit right in front of the snowman.			// Make sure we emit the fixit right in front of the snowman.
	// CHECK: {{^ \^}}			// CHECK: {{^ \^}}
	// CHECK: {{^ ;}}			// CHECK: {{^ ;}}

	// CHECK-MACHINE: fix-it:"{{.*}}fixit-unicode.c":{11:9-11:9}:";"			// CHECK-MACHINE: fix-it:"{{.*}}fixit-unicode.c":{[[@LINE-8]]:15-[[@LINE-8]]:15}:";"
	}			}


	int printf(const char *, ...);			int printf(const char *, ...);
	void test2() {			void test2() {
	printf("‚àÜ: %d", 1L);			printf("‚àÜ: %d", 1L);
	// CHECK: warning: format specifies type 'int' but the argument has type 'long'			// CHECK: warning: format specifies type 'int' but the argument has type 'long'
	// Don't crash emitting a fixit after the delta.			// Don't crash emitting a fixit after the delta.
	// CHECK: printf("			// CHECK: printf("
	// CHECK: : %d", 1L);			// CHECK: : %d", 1L);
	// Unfortunately, we can't actually check the location of the printed fixit,			// Unfortunately, we can't actually check the location of the printed fixit,
	// because different systems will render the delta differently (either as a			// because different systems will render the delta differently (either as a
	// character, or as <U+2206>.) The fixit should line up with the %d regardless.			// character, or as <U+2206>.) The fixit should line up with the %d regardless.

	// CHECK-MACHINE: fix-it:"{{.*}}fixit-unicode.c":{23:16-23:18}:"%ld"			// CHECK-MACHINE: fix-it:"{{.*}}fixit-unicode.c":{[[@LINE-9]]:16-[[@LINE-9]]:18}:"%ld"
	}			}

test/Lexer/unicode.c

This file was added.

				// RUN: %clang_cc1 -fsyntax-only -verify %s

				// This file contains Unicode characters; please do not "fix" them!

				extern int¬†x; // expected-warning {{treating Unicode character as whitespace}}
				extern int„ÄÄx; // expected-warning {{treating Unicode character as whitespace}}

test/Lexer/utf8-invalid.c

This file was added.

				// RUN: %clang_cc1 -fsyntax-only -verify %s

				// Note: this file contains invalid UTF-8 before the variable name in the
				// next line. Please do not fix!

				extern int Çx; // expected-error{{source file is not valid UTF-8}}

test/Preprocessor/ucn-pp-identifier.c

This file was added.

				// RUN: %clang_cc1 %s -fsyntax-only -std=c99 -pedantic -verify -Wundef
				// RUN: %clang_cc1 %s -fsyntax-only -x c++ -pedantic -verify -Wundef
				// RUN: %clang_cc1 %s -fsyntax-only -std=c99 -pedantic -fdiagnostics-parseable-fixits -Wundef 2>&1 \| FileCheck -strict-whitespace %s

				#define \u00FC
				#define a\u00FD() 0
				#ifndef \u00FC
				#error "This should never happen"
				#endif

				#if a\u00FD()
				#error "This should never happen"
				#endif

				#if a\U000000FD()
				#error "This should never happen"
				#endif

				#if \uarecool // expected-warning{{incomplete universal character name; treating as '\' followed by identifier}} expected-error {{invalid token at start of a preprocessor expression}}
				#endif
				#if \uwerecool // expected-warning{{\u used with no following hex digits; treating as '\' followed by identifier}} expected-error {{invalid token at start of a preprocessor expression}}
				#endif
				#if \U0001000 // expected-warning{{incomplete universal character name; treating as '\' followed by identifier}} expected-error {{invalid token at start of a preprocessor expression}}
				#endif

				// Make sure we reject disallowed UCNs
				#define \ufffe // expected-error {{macro names must be identifiers}}
				#define \U10000000 // expected-error {{macro names must be identifiers}}
				#define \u0061 // expected-error {{character 'a' cannot be specified by a universal character name}} expected-error {{macro names must be identifiers}}

				// FIXME: Not clear what our behavior should be here; \u0024 is "$".
				#define a\u0024 // expected-warning {{whitespace}}

				#if \u0110 // expected-warning {{is not defined, evaluates to 0}}
				#endif


				#define \u0110 1 / 0
				#if \u0110 // expected-error {{division by zero in preprocessor expression}}
				#endif

				#define STRINGIZE(X) # X

				extern int check_size[sizeof(STRINGIZE(\u0112)) == 3 ? 1 : -1];

				// Check that we still diagnose disallowed UCNs in #if 0 blocks.
				// C99 5.1.1.2p1 and C++11 [lex.phases]p1 dictate that preprocessor tokens are
				// formed before directives are parsed.
				// expected-error@+4 {{character 'a' cannot be specified by a universal character name}}
				#if 0
				#define \ufffe // okay
				#define \U10000000 // okay
				#define \u0061 // error, but -verify only looks at comments outside #if 0
				#endif


				// A UCN formed by token pasting is undefined in both C99 and C++.
				// Right now we don't do anything special, which causes us to coincidentally
				// accept the first case below but reject the second two.
				#define PASTE(A, B) A ## B
				extern int PASTE(\, u00FD);
				extern int PASTE(\u, 00FD); // expected-warning{{\u used with no following hex digits}}
				extern int PASTE(\u0, 0FD); // expected-warning{{incomplete universal character name}}
				#ifdef __cplusplus
				// expected-error@-3 {{expected unqualified-id}}
				// expected-error@-3 {{expected unqualified-id}}
				#else
				// expected-error@-6 {{expected identifier}}
				// expected-error@-6 {{expected identifier}}
				#endif


				// A UCN produced by line splicing is valid in C99 but undefined in C++.
				// Since undefined behavior can do anything including working as intended,
				// we just accept it in C++ as well.;
				#define newline_1_\u00F\
				C 1
				#define newline_2_\u00\
				F\
				C 1
				#define newline_3_\u\
				00\
				FC 1
				#define newline_4_\\
				u00FC 1
				#define newline_5_\\
				u\
				\
				0\
				0\
				F\
				C 1

				#if (newline_1_\u00FC && newline_2_\u00FC && newline_3_\u00FC && \
				newline_4_\u00FC && newline_5_\u00FC)
				#else
				#error "Line splicing failed to produce UCNs"
				#endif


				#define capital_u_\U00FC
				// expected-warning@-1 {{incomplete universal character name}} expected-note@-1 {{did you mean to use '\u'?}} expected-warning@-1 {{whitespace}}
				// CHECK: note: did you mean to use '\u'?
				// CHECK-NEXT: #define capital_u_\U00FC
				// CHECK-NEXT: {{^ \^}}
				// CHECK-NEXT: {{^ u}}

test/Sema/ucn-cstring.c

	// RUN: %clang_cc1 %s -verify -fsyntax-only -pedantic			// RUN: %clang_cc1 %s -verify -fsyntax-only -pedantic

	int printf(const char *, ...);			int printf(const char *, ...);

	int main(void) {			int main(void) {
	int a[sizeof("hello \u2192 \u2603 \u2190 world") == 24 ? 1 : -1];			int a[sizeof("hello \u2192 \u2603 \u2190 world") == 24 ? 1 : -1];

	printf("%s (%zd)\n", "hello \u2192 \u2603 \u2190 world", sizeof("hello \u2192 \u2603 \u2190 world"));			printf("%s (%zd)\n", "hello \u2192 \u2603 \u2190 world", sizeof("hello \u2192 \u2603 \u2190 world"));
	printf("%s (%zd)\n", "\U00010400\U0001D12B", sizeof("\U00010400\U0001D12B"));			printf("%s (%zd)\n", "\U00010400\U0001D12B", sizeof("\U00010400\U0001D12B"));
	// Some error conditions...			// Some error conditions...
	printf("%s\n", "\U"); // expected-error{{\u used with no following hex digits}}			printf("%s\n", "\U"); // expected-error{{\U used with no following hex digits}}
	printf("%s\n", "\U00"); // expected-error{{incomplete universal character name}}			printf("%s\n", "\U00"); // expected-error{{incomplete universal character name}}
	printf("%s\n", "\U0001"); // expected-error{{incomplete universal character name}}			printf("%s\n", "\U0001"); // expected-error{{incomplete universal character name}}
	printf("%s\n", "\u0001"); // expected-error{{universal character name refers to a control character}}			printf("%s\n", "\u0001"); // expected-error{{universal character name refers to a control character}}
	return 0;			return 0;
	}			}

test/Sema/ucn-identifiers.c

This file was added.

				// RUN: %clang_cc1 %s -verify -fsyntax-only -pedantic
				// RUN: %clang_cc1 %s -verify -fsyntax-only -x c++ -pedantic

				// This file contains UTF-8; please do not fix!


				extern void \u00FCber(int);
				extern void \U000000FCber(int); // redeclaration, no warning
				#ifdef __cplusplus
				// expected-note@-2 + {{candidate function not viable}}
				#else
				// expected-note@-4 + {{declared here}}
				#endif

				void goodCalls() {
				\u00FCber(0);
				\u00fcber(1);
				√ºber(2);
				\U000000FCber(3);
				}

				void badCalls() {
				\u00FCber(0.5); // expected-warning{{implicit conversion from 'double' to 'int'}}
				\u00fcber = 0; // expected-error{{non-object type 'void (int)' is not assignable}}

				√ºber(1, 2);
				\U000000FCber();
				#ifdef __cplusplus
				// expected-error@-3 {{no matching function}}
				// expected-error@-3 {{no matching function}}
				#else
				// expected-error@-6 {{too many arguments to function call, expected 1, have 2}}
				// expected-error@-6 {{too few arguments to function call, expected 1, have 0}}
				#endif
				}