This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/Lex/
-
clang/
-
Lex/
1/7
Lexer.h
-
Preprocessor.h
-
lib/
-
Format/
-
FormatTokenLexer.cpp
-
Lex/
2/16
Lexer.cpp
-
PPDirectives.cpp
-
PPLexerChange.cpp
-
Pragma.cpp

Differential D143142

[clang][lex] Enable Lexer to grow its buffer
Needs ReviewPublic

Authored by sunho on Feb 2 2023, 12:28 AM.

Download Raw Diff

Details

Reviewers

v.g.vassilev
aaron.ballman
cor3ntin
tahonermann
davrec

Summary

Change Lexer to use offsets instead of direct pointers to buffer so that even if we swap the buffer address in the middle, Lexer will be still functional.

Incremental input (via clang-repl, cling, etc) adds code line by line growing the TU. One of the last elements which needs to support growing is the source code buffer. One of the challenges is that when we grow the buffer, practically the buffer address can change. Since Lexer is using direct pointer to some point in buffer, once buffer is swapped every pointer needs to be updated including all trivial local variables -- which is very challenging to do without sacrificing readability of code.

This change solves this issue nicely. Since we will be only adding code at the back of the buffer, the offsets are always constant even if we grow the buffer many times and all the access to new buffer will be valid. We do add a number of indirections to BufferStart, but performance impact on actual compile time turned out to be negligible. The only visible performance trend seems to be 0.5%~0.7% increase in instruction count.

NOTE: This is part 1 of https://discourse.llvm.org/t/rfc-flexible-lexer-buffering-for-handling-incomplete-input-in-interactive-c-c/64180

The debian failure is due to some clang-format issue which is unrelated to this change.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

sunho created this revision.Feb 2 2023, 12:28 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 2 2023, 12:28 AM

Harbormaster completed remote builds in B211400: Diff 494186.Feb 2 2023, 12:28 AM

Update

sunho retitled this revision from Format to [clang] Change Lexer to use offsets instead of direct pointer.Feb 2 2023, 12:34 AM

sunho edited the summary of this revision. (Show Details)

sunho retitled this revision from [clang] Change Lexer to use offsets instead of direct pointer to [clang][lex] Change Lexer to use offsets instead of direct pointer.

sunho added a child revision: D143144: [clang][lex] Add TryGrowLexerBuffer/SourceFileGrower.Feb 2 2023, 12:45 AM

Harbormaster completed remote builds in B211401: Diff 494187.Feb 2 2023, 1:52 AM

sunho updated this revision to Diff 494968.Feb 5 2023, 6:03 PM

This comment was removed by sunho.

Harbormaster completed remote builds in B211976: Diff 494968.Feb 5 2023, 6:50 PM

sunho updated this revision to Diff 494976.Feb 5 2023, 6:51 PM

This comment was removed by sunho.

Harbormaster completed remote builds in B211982: Diff 494976.Feb 5 2023, 7:48 PM

Update

Harbormaster completed remote builds in B212029: Diff 495038.Feb 6 2023, 2:02 AM

Update

sunho edited the summary of this revision. (Show Details)Feb 6 2023, 2:42 AM

sunho edited the summary of this revision. (Show Details)

sunho edited the summary of this revision. (Show Details)Feb 6 2023, 2:47 AM

sunho edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B212031: Diff 495040.Feb 6 2023, 2:52 AM

sunho published this revision for review.Feb 6 2023, 3:05 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 6 2023, 3:05 AM

Herald added a subscriber: cfe-commits. · View Herald Transcript

sunho added a reviewer: v.g.vassilev.Feb 6 2023, 3:06 AM

sunho edited the summary of this revision. (Show Details)Feb 6 2023, 3:26 AM

sunho edited the summary of this revision. (Show Details)Feb 6 2023, 3:28 AM

sunho edited the summary of this revision. (Show Details)

sunho retitled this revision from [clang][lex] Change Lexer to use offsets instead of direct pointer to [clang][lex] Enable Lexer to grow its buffer.Feb 6 2023, 3:38 AM

shafik added a subscriber: shafik.Feb 6 2023, 6:40 PM

shafik added inline comments.

clang/lib/Lex/Lexer.cpp
1203	I wonder do we really need to do these pointer gymnastics, maybe making this a member function would eliminate the need for it.
1815	nit

WG21 is meeting all this week, so a bunch of folks who should take a look at this may not get around to it right away.

junaire added a subscriber: junaire.Feb 6 2023, 8:49 PM

sunho edited the summary of this revision. (Show Details)Feb 8 2023, 5:08 PM

sunho added a reviewer: davrec.Feb 8 2023, 5:10 PM

v.g.vassilev added inline comments.Feb 9 2023, 2:45 AM

clang/lib/Lex/Lexer.cpp
211	Is that an outdated comment? If not maybe elaborate why this is wrong.

sunho added inline comments.Feb 9 2023, 9:07 AM

clang/lib/Lex/Lexer.cpp
211	It's indeed outdated comment. I'll remove it.
1203	Yes, we can change it to offset here.

We should probably add some tests here. Alternatively we can add the tests from https://reviews.llvm.org/D143148 but that'd make this patch bulkier and probably harder to review.

Only had a chance to give it a once over, I will look through more closely later, def by this weekend. Main thing is I think we shouldn't be exposing the buffer pointers after this change, i.e. no public function should return const char *, unless I'm missing something. If that box is checked and performance cost is negligible I'd give this the thumbs up.

clang/include/clang/Lex/Lexer.h
307	I think I'd like this to return `unsigned`; i.e. I think after this patch we should not even be publicly exposing buffer locations as pointers, IIUC. A brief search for uses of `getBufferLocation()` (there aren't many) suggests this would be probably be fine and indeed would get rid of some unnecessary pointer arithmetic. And indeed if anything really needs that `const char *` that might be a red flag to investigate further.
609	FWIW it sucks that `uint32_t` is already sprinkled throughout the interface alongside `unsigned`, wish just one was used consistently, but that does not need to be addressed in this patch.

This is an impressive amount of work. I think it makes sense!
Thanks a lot for doing that work.
I only have a few nits after a first review of this.

clang/include/clang/Lex/Lexer.h
89	Should that use `SourceLocation::UIntTy`? Looking at comments in SourceManager, I think there was an attempt at supporting > 2GB file but I don't think it got anywhere. Nevertheless, using `SourceLocation::UIntTy` would arguably be more consistent It does seem to be a huge undertaking to change it though, I'm not sure it would be worth it at all. There would be far bigger issues with ridiculously large source files anyway.
609	here, `uint32_t` is a codepoint, should arguably be `char32_t` instead. but agreed, not in this patch.
clang/lib/Lex/Lexer.cpp
213
1203	Agreed, that would be nice! (In all places `getBuffer().data()` is used)
1353
1378
1740
1744	Ditto in all similar places, I think it reads easier

tschuett added a subscriber: tschuett.Feb 20 2023, 2:55 AM

tschuett added inline comments.

clang/include/clang/Lex/Lexer.h
89	I am a bit afraid that unsigned has different sizes on different platforms. At least a `using BufferOffsetType = uint64_t;` would be nice.

Suggested a few adjustments in LexTokenInternal you might want to test for perf improvements (orthogonal to this patch, but could boost its numbers :).
And also noted that Lexer::getBuffer() has same issue as getBufferLocation() re potential for pointer invalidation IIUC. At a minimum we need comments on these functions explaining any risks; better still to remove them from the public interface. If downstream users use these functions and complain, good - they need to be aware of this change.

clang/include/clang/Lex/Lexer.h
285	Same issue as with `getBufferLocation()`, publicly returning it permits possible pointer invalidation. Fortunately I only see it used in a single spot (prior to this patch anyway) which can be easily eliminated IIUC. Yank this function? Or make private/append "Unsafe" to name (and explain in comments)?
307	Looking more closely I see that `getCurrentBufferOffset` returns the unsigned, and this patch already changes some `getBufferLocation` usages to `getCurrentBufferOffset`. So, I say either yank it or make it private or append "Unsafe" and explain in comments.
clang/lib/Lex/Lexer.cpp
2948–2949	indent
2973–2975	indent
3630–3632	for (isHorizontalWhitespace(BufferStart[++CurOffset]);;) ; might save a few instructions? Worth trying since this function is the main perf-critical one.
3739–3752	Spitballing again for possible minor perf improvements: if (char Char0 = BufferStart[CurOffset] == '/' && !inKeepCommentMode()) { if (char Char1 = BufferStart[CurOffset + 1] == '/' && LineComment && (LangOpts.CPlusPlus \|\| !LangOpts.TraditionalCPP)) { if (SkipLineComment(Result, CurOffset + 2, TokAtPhysicalStartOfLine)) return true; // There is a token to return. goto SkipIgnoredUnits; } else if (Char1 == '*') { if (SkipBlockComment(Result, CurOffset + 2, TokAtPhysicalStartOfLine)) return true; // There is a token to return. goto SkipIgnoredUnits; } } else if (isHorizontalWhitespace(Char0)) { goto SkipHorizontalWhitespace; }

davrec added inline comments.Feb 20 2023, 10:28 AM

clang/lib/Lex/Lexer.cpp
3630–3632	^ Ignore, erroneous :)

In general, I think this makes sense. However:

This change solves this issue nicely. Since we will be only adding code at the back of the buffer, the offsets are always constant even if we grow the buffer many times and all the access to new buffer will be valid. We do add a number of indirections to BufferStart, but performance impact on actual compile time turned out to be negligible. The only visible performance trend seems to be 0.5%~0.7% increase in instruction count.

Can you give some performance numbers for how this impacts compile time performance for some large C and C++ projects? https://llvm-compile-time-tracker.com/ might be of help in gathering that data, FWIW.

In terms of whether to use unsigned vs a specific type; I don't have a strong opinion, but it'd be good to at least static_assert properties we care about (like the type being at least 32 bits). We've had efforts in the past to allow for > 2GB source files, so I weakly think it would make sense to use a 64-bit value explicitly up front, but I have no idea how that changes the performance characteristics for the changes either.

In terms of whether to expose access to pointers to the underlying buffer through lexer APIs... I sort of agree with @davrec that it would be good to avoid exposing those interfaces given that the pointer values are likely to be invalidated when the buffer grows. My instinct is that Clang developers are unlikely to consider that buffer to be something that can be invalidated, so if we retain an interface to get a pointer to the buffer when you get to the point of actually allowing it to grow, it'd be nice if we can find some way to forcefully grow the buffer to help catch misuses in Clang when running the llvm-lit tests.

The changes made (from what I've seen, I haven't reviewed every line) make sense to me. The amount of change does make me a bit nervous though.

In D143142#4142212, @aaron.ballman wrote:

In terms of whether to use unsigned vs a specific type; I don't have a strong opinion, but it'd be good to at least static_assert properties we care about (like the type being at least 32 bits). We've had efforts in the past to allow for > 2GB source files, so I weakly think it would make sense to use a 64-bit value explicitly up front, but I have no idea how that changes the performance characteristics for the changes either.

I like the idea of using a strong type for two reasons: 1) normal type safety stuff, and 2) it would enable a platform dependent and/or configurable type; I don't see any reason for use of a 64-bit offset on an ILP-32 platform, so no point in adding the overhead there.

In terms of whether to expose access to pointers to the underlying buffer through lexer APIs... I sort of agree with @davrec that it would be good to avoid exposing those interfaces given that the pointer values are likely to be invalidated when the buffer grows. My instinct is that Clang developers are unlikely to consider that buffer to be something that can be invalidated, so if we retain an interface to get a pointer to the buffer when you get to the point of actually allowing it to grow, it'd be nice if we can find some way to forcefully grow the buffer to help catch misuses in Clang when running the llvm-lit tests.

Perhaps such access can be facilitated via a shared_ptr-like handle that dynamically records whether direct access is in use; something like a read-lock. Attempts to grow the buffer could then assert that no such use is outstanding.

Revision Contents

Path

Size

clang/

include/

clang/

Lex/

Lexer.h

166 lines

Preprocessor.h

4 lines

lib/

Format/

FormatTokenLexer.cpp

12 lines

Lex/

1359 lines

33 lines

28 lines

2 lines

Diff 495040

clang/include/clang/Lex/Lexer.h

Show First 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	class Lexer : public PreprocessorLexer {
void anchor() override;		void anchor() override;

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
// Constant configuration values for this lexer.		// Constant configuration values for this lexer.

// Start of the buffer.		// Start of the buffer.
const char *BufferStart;		const char *BufferStart;

// End of the buffer.		// Size of the buffer.
		cor3ntinUnsubmitted Not Done Reply Inline Actions Should that use `SourceLocation::UIntTy`? Looking at comments in SourceManager, I think there was an attempt at supporting > 2GB file but I don't think it got anywhere. Nevertheless, using `SourceLocation::UIntTy` would arguably be more consistent It does seem to be a huge undertaking to change it though, I'm not sure it would be worth it at all. There would be far bigger issues with ridiculously large source files anyway. cor3ntin: Should that use `SourceLocation::UIntTy`? Looking at comments in SourceManager, I think there…
		tschuettUnsubmitted Not Done Reply Inline Actions I am a bit afraid that unsigned has different sizes on different platforms. At least a `using BufferOffsetType = uint64_t;` would be nice. tschuett: I am a bit afraid that unsigned has different sizes on different platforms. At least a `using…
const char *BufferEnd;		unsigned BufferSize;

// Location for start of file.		// Location for start of file.
SourceLocation FileLoc;		SourceLocation FileLoc;

// LangOpts enabled by this language.		// LangOpts enabled by this language.
// Storing LangOptions as reference here is important from performance point		// Storing LangOptions as reference here is important from performance point
// of view. Lack of reference means that LangOptions copy constructor would be		// of view. Lack of reference means that LangOptions copy constructor would be
// called by Lexer(..., const LangOptions &LangOpts,...). Given that local		// called by Lexer(..., const LangOptions &LangOpts,...). Given that local
Show All 22 Lines	class Lexer : public PreprocessorLexer {
/// it returns comments, when it is set to 0 it returns normal tokens only.		/// it returns comments, when it is set to 0 it returns normal tokens only.
unsigned char ExtendedTokenMode;		unsigned char ExtendedTokenMode;

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
// Context that changes as the file is lexed.		// Context that changes as the file is lexed.
// NOTE: any state that mutates when in raw mode must have save/restore code		// NOTE: any state that mutates when in raw mode must have save/restore code
// in Lexer::isNextPPTokenLParen.		// in Lexer::isNextPPTokenLParen.

// BufferPtr - Current pointer into the buffer. This is the next character		// BufferOffset - Current offset into the buffer. This is the next character
// to be lexed.		// to be lexed.
const char *BufferPtr;		unsigned BufferOffset;

// IsAtStartOfLine - True if the next lexed token should get the "start of		// IsAtStartOfLine - True if the next lexed token should get the "start of
// line" flag set on it.		// line" flag set on it.
bool IsAtStartOfLine;		bool IsAtStartOfLine;

bool IsAtPhysicalStartOfLine;		bool IsAtPhysicalStartOfLine;

bool HasLeadingSpace;		bool HasLeadingSpace;

bool HasLeadingEmptyMacro;		bool HasLeadingEmptyMacro;

/// True if this is the first time we're lexing the input file.		/// True if this is the first time we're lexing the input file.
bool IsFirstTimeLexingFile;		bool IsFirstTimeLexingFile;

// NewLinePtr - A pointer to new line character '\n' being lexed. For '\r\n',		// NewLineOffset - A offset to new line character '\n' being lexed. For
// it also points to '\n.'		// '\r\n', it also points to '\n.'
const char *NewLinePtr;		std::optional<unsigned> NewLineOffset;

// CurrentConflictMarkerState - The kind of conflict marker we are handling.		// CurrentConflictMarkerState - The kind of conflict marker we are handling.
ConflictMarkerKind CurrentConflictMarkerState;		ConflictMarkerKind CurrentConflictMarkerState;

/// Non-empty if this \p Lexer is \p isDependencyDirectivesLexer().		/// Non-empty if this \p Lexer is \p isDependencyDirectivesLexer().
ArrayRef<dependency_directives_scan::Directive> DepDirectives;		ArrayRef<dependency_directives_scan::Directive> DepDirectives;

/// If this \p Lexer is \p isDependencyDirectivesLexer(), it represents the		/// If this \p Lexer is \p isDependencyDirectivesLexer(), it represents the
/// next token to use from the current dependency directive.		/// next token to use from the current dependency directive.
unsigned NextDepDirectiveTokenIndex = 0;		unsigned NextDepDirectiveTokenIndex = 0;

void InitLexer(const char BufStart, const char BufPtr, const char *BufEnd);		void InitLexer(const char *BufStart, unsigned BufferOffset,
		unsigned BufferSize);

public:		public:
/// Lexer constructor - Create a new lexer object for the specified buffer		/// Lexer constructor - Create a new lexer object for the specified buffer
/// with the specified preprocessor managing the lexing process. This lexer		/// with the specified preprocessor managing the lexing process. This lexer
/// assumes that the associated file buffer and Preprocessor objects will		/// assumes that the associated file buffer and Preprocessor objects will
/// outlive it, so it doesn't take ownership of either of them.		/// outlive it, so it doesn't take ownership of either of them.
Lexer(FileID FID, const llvm::MemoryBufferRef &InputFile, Preprocessor &PP,		Lexer(FileID FID, const llvm::MemoryBufferRef &InputFile, Preprocessor &PP,
bool IsFirstIncludeOfFile = true);		bool IsFirstIncludeOfFile = true);
Show All 40 Lines	private:
/// Called when the preprocessor is in 'dependency scanning lexing mode' and		/// Called when the preprocessor is in 'dependency scanning lexing mode' and
/// is skipping a conditional block.		/// is skipping a conditional block.
bool LexDependencyDirectiveTokenWhileSkipping(Token &Result);		bool LexDependencyDirectiveTokenWhileSkipping(Token &Result);

/// True when the preprocessor is in 'dependency scanning lexing mode' and		/// True when the preprocessor is in 'dependency scanning lexing mode' and
/// created this \p Lexer for lexing a set of dependency directive tokens.		/// created this \p Lexer for lexing a set of dependency directive tokens.
bool isDependencyDirectivesLexer() const { return !DepDirectives.empty(); }		bool isDependencyDirectivesLexer() const { return !DepDirectives.empty(); }

/// Initializes \p Result with data from \p DDTok and advances \p BufferPtr to		/// Initializes \p Result with data from \p DDTok and advances \p BufferOffset to
/// the position just after the token.		/// the position just after the token.
/// \returns the buffer pointer at the beginning of the token.		/// \returns the buffer pointer at the beginning of the token.
const char *convertDependencyDirectiveToken(		const char *convertDependencyDirectiveToken(
const dependency_directives_scan::Token &DDTok, Token &Result);		const dependency_directives_scan::Token &DDTok, Token &Result);

public:		public:
/// isPragmaLexer - Returns true if this Lexer is being used to lex a pragma.		/// isPragmaLexer - Returns true if this Lexer is being used to lex a pragma.
bool isPragmaLexer() const { return Is_PragmaLexer; }		bool isPragmaLexer() const { return Is_PragmaLexer; }

private:		private:
/// IndirectLex - An indirect call to 'Lex' that can be invoked via		/// IndirectLex - An indirect call to 'Lex' that can be invoked via
/// the PreprocessorLexer interface.		/// the PreprocessorLexer interface.
void IndirectLex(Token &Result) override { Lex(Result); }		void IndirectLex(Token &Result) override { Lex(Result); }

public:		public:
/// LexFromRawLexer - Lex a token from a designated raw lexer (one with no		/// LexFromRawLexer - Lex a token from a designated raw lexer (one with no
/// associated preprocessor object. Return true if the 'next character to		/// associated preprocessor object. Return true if the 'next character to
/// read' pointer points at the end of the lexer buffer, false otherwise.		/// read' pointer points at the end of the lexer buffer, false otherwise.
bool LexFromRawLexer(Token &Result) {		bool LexFromRawLexer(Token &Result) {
assert(LexingRawMode && "Not already in raw mode!");		assert(LexingRawMode && "Not already in raw mode!");
Lex(Result);		Lex(Result);
// Note that lexing to the end of the buffer doesn't implicitly delete the		// Note that lexing to the end of the buffer doesn't implicitly delete the
// lexer when in raw mode.		// lexer when in raw mode.
return BufferPtr == BufferEnd;		return BufferOffset == BufferSize;
}		}

/// isKeepWhitespaceMode - Return true if the lexer should return tokens for		/// isKeepWhitespaceMode - Return true if the lexer should return tokens for
/// every character in the file, including whitespace and comments. This		/// every character in the file, including whitespace and comments. This
/// should only be used in raw mode, as the preprocessor is not prepared to		/// should only be used in raw mode, as the preprocessor is not prepared to
/// deal with the excess tokens.		/// deal with the excess tokens.
bool isKeepWhitespaceMode() const {		bool isKeepWhitespaceMode() const {
return ExtendedTokenMode > 1;		return ExtendedTokenMode > 1;
Show All 26 Lines	public:
/// language options and preprocessor. This controls whether the lexer		/// language options and preprocessor. This controls whether the lexer
/// produces comment and whitespace tokens.		/// produces comment and whitespace tokens.
///		///
/// This requires the lexer to have an associated preprocessor. A standalone		/// This requires the lexer to have an associated preprocessor. A standalone
/// lexer has nothing to reset to.		/// lexer has nothing to reset to.
void resetExtendedTokenMode();		void resetExtendedTokenMode();

/// Gets source code buffer.		/// Gets source code buffer.
StringRef getBuffer() const {		StringRef getBuffer() const { return StringRef(BufferStart, BufferSize); }
		davrecUnsubmitted Not Done Reply Inline Actions Same issue as with `getBufferLocation()`, publicly returning it permits possible pointer invalidation. Fortunately I only see it used in a single spot (prior to this patch anyway) which can be easily eliminated IIUC. Yank this function? Or make private/append "Unsafe" to name (and explain in comments)? davrec: Same issue as with `getBufferLocation()`, publicly returning it permits possible pointer…
return StringRef(BufferStart, BufferEnd - BufferStart);
}

/// ReadToEndOfLine - Read the rest of the current preprocessor line as an		/// ReadToEndOfLine - Read the rest of the current preprocessor line as an
/// uninterpreted string. This switches the lexer out of directive mode.		/// uninterpreted string. This switches the lexer out of directive mode.
void ReadToEndOfLine(SmallVectorImpl<char> *Result = nullptr);		void ReadToEndOfLine(SmallVectorImpl<char> *Result = nullptr);


/// Diag - Forwarding function for diagnostics. This translate a source		/// Diag - Forwarding function for diagnostics. This translate a source
/// position in the current buffer into a SourceLocation object for rendering.		/// position in the current buffer into a SourceLocation object for rendering.
DiagnosticBuilder Diag(const char *Loc, unsigned DiagID) const;		DiagnosticBuilder Diag(unsigned Loc, unsigned DiagID) const;

/// getSourceLocation - Return a source location identifier for the specified		/// getSourceLocation - Return a source location identifier for the specified
/// offset in the current file.		/// offset in the current file.
SourceLocation getSourceLocation(const char *Loc, unsigned TokLen = 1) const;		SourceLocation getSourceLocation(unsigned Loc, unsigned TokLen = 1) const;

/// getSourceLocation - Return a source location for the next character in		/// getSourceLocation - Return a source location for the next character in
/// the current file.		/// the current file.
SourceLocation getSourceLocation() override {		SourceLocation getSourceLocation() override {
return getSourceLocation(BufferPtr);		return getSourceLocation(BufferOffset);
}		}

/// Return the current location in the buffer.		/// Return the current location in the buffer.
const char *getBufferLocation() const { return BufferPtr; }		const char *getBufferLocation() const {
		davrecUnsubmitted Not Done Reply Inline Actions I think I'd like this to return `unsigned`; i.e. I think after this patch we should not even be publicly exposing buffer locations as pointers, IIUC. A brief search for uses of `getBufferLocation()` (there aren't many) suggests this would be probably be fine and indeed would get rid of some unnecessary pointer arithmetic. And indeed if anything really needs that `const char ` that might be a red flag to investigate further. davrec:* I think I'd like this to return `unsigned`; i.e. I think after this patch we should not even be…
		davrecUnsubmitted Not Done Reply Inline Actions Looking more closely I see that `getCurrentBufferOffset` returns the unsigned, and this patch already changes some `getBufferLocation` usages to `getCurrentBufferOffset`. So, I say either yank it or make it private or append "Unsafe" and explain in comments. davrec: Looking more closely I see that `getCurrentBufferOffset` returns the unsigned, and this patch…
		assert(BufferOffset <= BufferSize && "Invalid buffer state");
		return BufferStart + BufferOffset;
		}

/// Returns the current lexing offset.		/// Returns the current lexing offset.
unsigned getCurrentBufferOffset() {		unsigned getCurrentBufferOffset() { return BufferOffset; }
assert(BufferPtr >= BufferStart && "Invalid buffer state");
return BufferPtr - BufferStart;
}

/// Set the lexer's buffer pointer to \p Offset.		/// Set the lexer's buffer pointer to \p Offset.
void seek(unsigned Offset, bool IsAtStartOfLine);		void seek(unsigned Offset, bool IsAtStartOfLine);

/// Stringify - Convert the specified string into a C string by i) escaping		/// Stringify - Convert the specified string into a C string by i) escaping
/// '\\' and " characters and ii) replacing newline character(s) with "\\n".		/// '\\' and " characters and ii) replacing newline character(s) with "\\n".
/// If Charify is true, this escapes the ' character instead of ".		/// If Charify is true, this escapes the ' character instead of ".
static std::string Stringify(StringRef Str, bool Charify = false);		static std::string Stringify(StringRef Str, bool Charify = false);
▲ Show 20 Lines • Show All 279 Lines • ▼ Show 20 Lines	private:
//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
// Internal implementation interfaces.		// Internal implementation interfaces.

/// LexTokenInternal - Internal interface to lex a preprocessing token. Called		/// LexTokenInternal - Internal interface to lex a preprocessing token. Called
/// by Lex.		/// by Lex.
///		///
bool LexTokenInternal(Token &Result, bool TokAtPhysicalStartOfLine);		bool LexTokenInternal(Token &Result, bool TokAtPhysicalStartOfLine);

bool CheckUnicodeWhitespace(Token &Result, uint32_t C, const char *CurPtr);		bool CheckUnicodeWhitespace(Token &Result, uint32_t C, unsigned CurOffset);
		davrecUnsubmitted Done Reply Inline Actions FWIW it sucks that `uint32_t` is already sprinkled throughout the interface alongside `unsigned`, wish just one was used consistently, but that does not need to be addressed in this patch. davrec: FWIW it sucks that `uint32_t` is already sprinkled throughout the interface alongside…
		cor3ntinUnsubmitted Not Done Reply Inline Actions here, `uint32_t` is a codepoint, should arguably be `char32_t` instead. but agreed, not in this patch. cor3ntin: here, `uint32_t` is a codepoint, should arguably be `char32_t` instead. but agreed, not in this…

bool LexUnicodeIdentifierStart(Token &Result, uint32_t C, const char *CurPtr);		bool LexUnicodeIdentifierStart(Token &Result, uint32_t C, unsigned CurOffset);

/// FormTokenWithChars - When we lex a token, we have identified a span		/// FormTokenWithChars - When we lex a token, we have identified a span
/// starting at BufferPtr, going to TokEnd that forms the token. This method		/// starting at BufferPtr, going to TokEnd that forms the token. This method
/// takes that range and assigns it to the token as its location and size. In		/// takes that range and assigns it to the token as its location and size. In
/// addition, since tokens cannot overlap, this also updates BufferPtr to be		/// addition, since tokens cannot overlap, this also updates BufferPtr to be
/// TokEnd.		/// TokEnd.
void FormTokenWithChars(Token &Result, const char *TokEnd,		void FormTokenWithChars(Token &Result, unsigned TokEnd, tok::TokenKind Kind) {
tok::TokenKind Kind) {		unsigned TokLen = TokEnd - BufferOffset;
unsigned TokLen = TokEnd-BufferPtr;
Result.setLength(TokLen);		Result.setLength(TokLen);
Result.setLocation(getSourceLocation(BufferPtr, TokLen));		Result.setLocation(getSourceLocation(BufferOffset, TokLen));
Result.setKind(Kind);		Result.setKind(Kind);
BufferPtr = TokEnd;		BufferOffset = TokEnd;
}		}

/// isNextPPTokenLParen - Return 1 if the next unexpanded token will return a		/// isNextPPTokenLParen - Return 1 if the next unexpanded token will return a
/// tok::l_paren token, 0 if it is something else and 2 if there are no more		/// tok::l_paren token, 0 if it is something else and 2 if there are no more
/// tokens in the buffer controlled by this lexer.		/// tokens in the buffer controlled by this lexer.
unsigned isNextPPTokenLParen();		unsigned isNextPPTokenLParen();

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
Show All 21 Lines	private:
static bool isObviouslySimpleCharacter(char C) {		static bool isObviouslySimpleCharacter(char C) {
return C != '?' && C != '\\';		return C != '?' && C != '\\';
}		}

/// getAndAdvanceChar - Read a single 'character' from the specified buffer,		/// getAndAdvanceChar - Read a single 'character' from the specified buffer,
/// advance over it, and return it. This is tricky in several cases. Here we		/// advance over it, and return it. This is tricky in several cases. Here we
/// just handle the trivial case and fall-back to the non-inlined		/// just handle the trivial case and fall-back to the non-inlined
/// getCharAndSizeSlow method to handle the hard case.		/// getCharAndSizeSlow method to handle the hard case.
inline char getAndAdvanceChar(const char *&Ptr, Token &Tok) {		inline char getAndAdvanceChar(unsigned &Offset, Token &Tok) {
// If this is not a trigraph and not a UCN or escaped newline, return		// If this is not a trigraph and not a UCN or escaped newline, return
// quickly.		// quickly.
if (isObviouslySimpleCharacter(Ptr[0])) return *Ptr++;		if (isObviouslySimpleCharacter(BufferStart[Offset]))
		return BufferStart[Offset++];

unsigned Size = 0;		unsigned Size = 0;
char C = getCharAndSizeSlow(Ptr, Size, &Tok);		char C = getCharAndSizeSlow(Offset, Size, &Tok);
Ptr += Size;		Offset += Size;
return C;		return C;
}		}

/// ConsumeChar - When a character (identified by getCharAndSize) is consumed		/// ConsumeChar - When a character (identified by getCharAndSize) is consumed
/// and added to a given token, check to see if there are diagnostics that		/// and added to a given token, check to see if there are diagnostics that
/// need to be emitted or flags that need to be set on the token. If so, do		/// need to be emitted or flags that need to be set on the token. If so, do
/// it.		/// it.
const char ConsumeChar(const char Ptr, unsigned Size, Token &Tok) {		unsigned ConsumeChar(unsigned Offset, unsigned Size, Token &Tok) {
// Normal case, we consumed exactly one token. Just return it.		// Normal case, we consumed exactly one token. Just return it.
if (Size == 1)		if (Size == 1)
return Ptr+Size;		return Offset + Size;

// Otherwise, re-lex the character with a current token, allowing		// Otherwise, re-lex the character with a current token, allowing
// diagnostics to be emitted and flags to be set.		// diagnostics to be emitted and flags to be set.
Size = 0;		Size = 0;
getCharAndSizeSlow(Ptr, Size, &Tok);		getCharAndSizeSlow(Offset, Size, &Tok);
return Ptr+Size;		return Offset + Size;
}		}

/// getCharAndSize - Peek a single 'character' from the specified buffer,		/// getCharAndSize - Peek a single 'character' from the specified buffer,
/// get its size, and return it. This is tricky in several cases. Here we		/// get its size, and return it. This is tricky in several cases. Here we
/// just handle the trivial case and fall-back to the non-inlined		/// just handle the trivial case and fall-back to the non-inlined
/// getCharAndSizeSlow method to handle the hard case.		/// getCharAndSizeSlow method to handle the hard case.
inline char getCharAndSize(const char *Ptr, unsigned &Size) {		inline char getCharAndSize(unsigned Offset, unsigned &Size) {
// If this is not a trigraph and not a UCN or escaped newline, return		// If this is not a trigraph and not a UCN or escaped newline, return
// quickly.		// quickly.
if (isObviouslySimpleCharacter(Ptr[0])) {		if (isObviouslySimpleCharacter(BufferStart[Offset])) {
Size = 1;		Size = 1;
return *Ptr;		return BufferStart[Offset];
}		}

Size = 0;		Size = 0;
return getCharAndSizeSlow(Ptr, Size);		return getCharAndSizeSlow(Offset, Size);
}		}

/// getCharAndSizeSlow - Handle the slow/uncommon case of the getCharAndSize		/// getCharAndSizeSlow - Handle the slow/uncommon case of the getCharAndSize
/// method.		/// method.
char getCharAndSizeSlow(const char *Ptr, unsigned &Size,		char getCharAndSizeSlow(unsigned Offset, unsigned &Size,
Token *Tok = nullptr);		Token *Tok = nullptr);

/// getEscapedNewLineSize - Return the size of the specified escaped newline,		/// getEscapedNewLineSize - Return the size of the specified escaped newline,
/// or 0 if it is not an escaped newline. P[-1] is known to be a "\" on entry		/// or 0 if it is not an escaped newline. P[-1] is known to be a "\" on entry
/// to this function.		/// to this function.
static unsigned getEscapedNewLineSize(const char *P);		static unsigned getEscapedNewLineSize(const char *P);

/// SkipEscapedNewLines - If P points to an escaped newline (or a series of		/// SkipEscapedNewLines - If P points to an escaped newline (or a series of
/// them), skip over them and return the first non-escaped-newline found,		/// them), skip over them and return the first non-escaped-newline found,
/// otherwise return P.		/// otherwise return P.
static const char SkipEscapedNewLines(const char P);		static const char SkipEscapedNewLines(const char P);

/// getCharAndSizeSlowNoWarn - Same as getCharAndSizeSlow, but never emits a		/// getCharAndSizeSlowNoWarn - Same as getCharAndSizeSlow, but never emits a
/// diagnostic.		/// diagnostic.
static char getCharAndSizeSlowNoWarn(const char *Ptr, unsigned &Size,		static char getCharAndSizeSlowNoWarn(const char *Ptr, unsigned &Size,
const LangOptions &LangOpts);		const LangOptions &LangOpts);

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
// Other lexer functions.		// Other lexer functions.

void SetByteOffset(unsigned Offset, bool StartOfLine);		void SetByteOffset(unsigned Offset, bool StartOfLine);

void PropagateLineStartLeadingSpaceInfo(Token &Result);		void PropagateLineStartLeadingSpaceInfo(Token &Result);

const char LexUDSuffix(Token &Result, const char CurPtr,		unsigned LexUDSuffix(Token &Result, unsigned CurOffset, bool IsStringLiteral);
bool IsStringLiteral);

// Helper functions to lex the remainder of a token of the specific type.		// Helper functions to lex the remainder of a token of the specific type.

// This function handles both ASCII and Unicode identifiers after		// This function handles both ASCII and Unicode identifiers after
// the first codepoint of the identifyier has been parsed.		// the first codepoint of the identifyier has been parsed.
bool LexIdentifierContinue(Token &Result, const char *CurPtr);		bool LexIdentifierContinue(Token &Result, unsigned CurOffset);

bool LexNumericConstant (Token &Result, const char *CurPtr);		bool LexNumericConstant(Token &Result, unsigned CurOffset);
bool LexStringLiteral (Token &Result, const char *CurPtr,		bool LexStringLiteral(Token &Result, unsigned CurOffset, tok::TokenKind Kind);
tok::TokenKind Kind);		bool LexRawStringLiteral(Token &Result, unsigned CurOffset,
bool LexRawStringLiteral (Token &Result, const char *CurPtr,
tok::TokenKind Kind);
bool LexAngledStringLiteral(Token &Result, const char *CurPtr);
bool LexCharConstant (Token &Result, const char *CurPtr,
tok::TokenKind Kind);		tok::TokenKind Kind);
bool LexEndOfFile (Token &Result, const char *CurPtr);		bool LexAngledStringLiteral(Token &Result, unsigned CurOffset);
bool SkipWhitespace (Token &Result, const char *CurPtr,		bool LexCharConstant(Token &Result, unsigned CurOffset, tok::TokenKind Kind);
		bool LexEndOfFile(Token &Result, unsigned CurOffset);
		bool SkipWhitespace(Token &Result, unsigned CurOffset,
bool &TokAtPhysicalStartOfLine);		bool &TokAtPhysicalStartOfLine);
bool SkipLineComment (Token &Result, const char *CurPtr,		bool SkipLineComment(Token &Result, unsigned CurOffset,
bool &TokAtPhysicalStartOfLine);		bool &TokAtPhysicalStartOfLine);
bool SkipBlockComment (Token &Result, const char *CurPtr,		bool SkipBlockComment(Token &Result, unsigned CurOffset,
bool &TokAtPhysicalStartOfLine);		bool &TokAtPhysicalStartOfLine);
bool SaveLineComment (Token &Result, const char *CurPtr);		bool SaveLineComment(Token &Result, unsigned CurOffset);

bool IsStartOfConflictMarker(const char *CurPtr);		bool IsStartOfConflictMarker(unsigned CurOffset);
bool HandleEndOfConflictMarker(const char *CurPtr);		bool HandleEndOfConflictMarker(unsigned CurOffset);

bool lexEditorPlaceholder(Token &Result, const char *CurPtr);		bool lexEditorPlaceholder(Token &Result, unsigned CurOffset);

bool isCodeCompletionPoint(const char *CurPtr) const;		bool isCodeCompletionPoint(unsigned CurOffset) const;
void cutOffLexing() { BufferPtr = BufferEnd; }		void cutOffLexing() { BufferOffset = BufferSize; }

bool isHexaLiteral(const char *Start, const LangOptions &LangOpts);		bool isHexaLiteral(unsigned Start, const LangOptions &LangOpts);

void codeCompleteIncludedFile(const char *PathStart,		void codeCompleteIncludedFile(unsigned PathStart, unsigned CompletionPoint,
const char *CompletionPoint, bool IsAngled);		bool IsAngled);

std::optional<uint32_t>		std::optional<uint32_t> tryReadNumericUCN(unsigned &StartOffset,
tryReadNumericUCN(const char &StartPtr, const char SlashLoc, Token *Result);		unsigned SlashLoc, Token *Result);
std::optional<uint32_t> tryReadNamedUCN(const char *&StartPtr,		std::optional<uint32_t> tryReadNamedUCN(unsigned &StartOffset,
const char SlashLoc, Token Result);		unsigned SlashLoc, Token *Result);

/// Read a universal character name.		/// Read a universal character name.
///		///
/// \param StartPtr The position in the source buffer after the initial '\'.		/// \param StartOffset The position in the source buffer after the initial '\'.
/// If the UCN is syntactically well-formed (but not		/// If the UCN is syntactically well-formed (but not
/// necessarily valid), this parameter will be updated to		/// necessarily valid), this parameter will be updated to
/// point to the character after the UCN.		/// point to the character after the UCN.
/// \param SlashLoc The position in the source buffer of the '\'.		/// \param SlashLoc The position in the source buffer of the '\'.
/// \param Result The token being formed. Pass \c nullptr to suppress		/// \param Result The token being formed. Pass \c nullptr to suppress
/// diagnostics and handle token formation in the caller.		/// diagnostics and handle token formation in the caller.
///		///
/// \return The Unicode codepoint specified by the UCN, or 0 if the UCN is		/// \return The Unicode codepoint specified by the UCN, or 0 if the UCN is
/// invalid.		/// invalid.
uint32_t tryReadUCN(const char &StartPtr, const char SlashLoc, Token *Result);		uint32_t tryReadUCN(unsigned &StartOffset, unsigned SlashLoc, Token *Result);

/// Try to consume a UCN as part of an identifier at the current		/// Try to consume a UCN as part of an identifier at the current
/// location.		/// location.
/// \param CurPtr Initially points to the range of characters in the source		/// \param CurOffset Initially points to the range of characters in the source
/// buffer containing the '\'. Updated to point past the end of		/// buffer containing the '\'. Updated to point past the end of
/// the UCN on success.		/// the UCN on success.
/// \param Size The number of characters occupied by the '\' (including		/// \param Size The number of characters occupied by the '\' (including
/// trigraphs and escaped newlines).		/// trigraphs and escaped newlines).
/// \param Result The token being produced. Marked as containing a UCN on		/// \param Result The token being produced. Marked as containing a UCN on
/// success.		/// success.
/// \return \c true if a UCN was lexed and it produced an acceptable		/// \return \c true if a UCN was lexed and it produced an acceptable
/// identifier character, \c false otherwise.		/// identifier character, \c false otherwise.
bool tryConsumeIdentifierUCN(const char *&CurPtr, unsigned Size,		bool tryConsumeIdentifierUCN(unsigned &CurOffset, unsigned Size,
Token &Result);		Token &Result);

/// Try to consume an identifier character encoded in UTF-8.		/// Try to consume an identifier character encoded in UTF-8.
/// \param CurPtr Points to the start of the (potential) UTF-8 code unit		/// \param CurOffset Points to the start of the (potential) UTF-8 code unit
/// sequence. On success, updated to point past the end of it.		/// sequence. On success, updated to point past the end of it.
/// \return \c true if a UTF-8 sequence mapping to an acceptable identifier		/// \return \c true if a UTF-8 sequence mapping to an acceptable identifier
/// character was lexed, \c false otherwise.		/// character was lexed, \c false otherwise.
bool tryConsumeIdentifierUTF8Char(const char *&CurPtr);		bool tryConsumeIdentifierUTF8Char(unsigned &CurOffset);
};		};

} // namespace clang		} // namespace clang

#endif // LLVM_CLANG_LEX_LEXER_H		#endif // LLVM_CLANG_LEX_LEXER_H

clang/include/clang/Lex/Preprocessor.h

Show First 20 Lines • Show All 1,010 Lines • ▼ Show 20 Lines	private:
///		///
/// See comments at the use-site for more context about why it is needed.		/// See comments at the use-site for more context about why it is needed.
bool SkippingExcludedConditionalBlock = false;		bool SkippingExcludedConditionalBlock = false;

/// Keeps track of skipped range mappings that were recorded while skipping		/// Keeps track of skipped range mappings that were recorded while skipping
/// excluded conditional directives. It maps the source buffer pointer at		/// excluded conditional directives. It maps the source buffer pointer at
/// the beginning of a skipped block, to the number of bytes that should be		/// the beginning of a skipped block, to the number of bytes that should be
/// skipped.		/// skipped.
llvm::DenseMap<const char *, unsigned> RecordedSkippedRanges;		llvm::DenseMap<FileID, llvm::DenseMap<unsigned , unsigned>> RecordedSkippedRanges;

void updateOutOfDateIdentifier(IdentifierInfo &II) const;		void updateOutOfDateIdentifier(IdentifierInfo &II) const;

public:		public:
Preprocessor(std::shared_ptr<PreprocessorOptions> PPOpts,		Preprocessor(std::shared_ptr<PreprocessorOptions> PPOpts,
DiagnosticsEngine &diags, LangOptions &opts, SourceManager &SM,		DiagnosticsEngine &diags, LangOptions &opts, SourceManager &SM,
HeaderSearch &Headers, ModuleLoader &TheModuleLoader,		HeaderSearch &Headers, ModuleLoader &TheModuleLoader,
IdentifierInfoLookup *IILookup = nullptr,		IdentifierInfoLookup *IILookup = nullptr,
▲ Show 20 Lines • Show All 1,089 Lines • ▼ Show 20 Lines	private:
IdentifierInfo *Ident__exception_info,		IdentifierInfo *Ident__exception_info,
*Ident___exception_info,		*Ident___exception_info,
*Ident_GetExceptionInfo;		*Ident_GetExceptionInfo;
// __finally		// __finally
IdentifierInfo *Ident__abnormal_termination,		IdentifierInfo *Ident__abnormal_termination,
*Ident___abnormal_termination,		*Ident___abnormal_termination,
*Ident_AbnormalTermination;		*Ident_AbnormalTermination;

const char *getCurLexerEndPos();		unsigned getCurLexerEndPos();
void diagnoseMissingHeaderInUmbrellaDir(const Module &Mod);		void diagnoseMissingHeaderInUmbrellaDir(const Module &Mod);

public:		public:
void PoisonSEHIdentifiers(bool Poison = true); // Borland		void PoisonSEHIdentifiers(bool Poison = true); // Borland

/// Callback invoked when the lexer reads an identifier and has		/// Callback invoked when the lexer reads an identifier and has
/// filled in the tokens IdentifierInfo member.		/// filled in the tokens IdentifierInfo member.
///		///
▲ Show 20 Lines • Show All 586 Lines • Show Last 20 Lines

clang/lib/Format/FormatTokenLexer.cpp

Show First 20 Lines • Show All 603 Lines • ▼ Show 20 Lines	void FormatTokenLexer::tryParseJSRegexLiteral() {
}		}

RegexToken->setType(TT_RegexLiteral);		RegexToken->setType(TT_RegexLiteral);
// Treat regex literals like other string_literals.		// Treat regex literals like other string_literals.
RegexToken->Tok.setKind(tok::string_literal);		RegexToken->Tok.setKind(tok::string_literal);
RegexToken->TokenText = StringRef(RegexBegin, Offset - RegexBegin);		RegexToken->TokenText = StringRef(RegexBegin, Offset - RegexBegin);
RegexToken->ColumnWidth = RegexToken->TokenText.size();		RegexToken->ColumnWidth = RegexToken->TokenText.size();

resetLexer(SourceMgr.getFileOffset(Lex->getSourceLocation(Offset)));		resetLexer(SourceMgr.getFileOffset(Lex->getSourceLocation(Offset-Lex->getBuffer().data())));
}		}

static auto lexCSharpString(const char Begin, const char End, bool Verbatim,		static auto lexCSharpString(const char Begin, const char End, bool Verbatim,
bool Interpolated) {		bool Interpolated) {
auto Repeated = [&Begin, End]() {		auto Repeated = [&Begin, End]() {
return Begin + 1 < End && Begin[1] == Begin[0];		return Begin + 1 < End && Begin[1] == Begin[0];
};		};

▲ Show 20 Lines • Show All 104 Lines • ▼ Show 20 Lines	if (LastBreak != StringRef::npos) {
CSharpStringLiteral->IsMultiline = true;		CSharpStringLiteral->IsMultiline = true;
unsigned StartColumn = 0;		unsigned StartColumn = 0;
CSharpStringLiteral->LastLineColumnWidth =		CSharpStringLiteral->LastLineColumnWidth =
encoding::columnWidthWithTabs(LiteralText.substr(LastBreak + 1),		encoding::columnWidthWithTabs(LiteralText.substr(LastBreak + 1),
StartColumn, Style.TabWidth, Encoding);		StartColumn, Style.TabWidth, Encoding);
}		}

assert(Offset < End);		assert(Offset < End);
resetLexer(SourceMgr.getFileOffset(Lex->getSourceLocation(Offset + 1)));		resetLexer(SourceMgr.getFileOffset(Lex->getSourceLocation(Offset + 1 - Lex->getBuffer().data())));
}		}

void FormatTokenLexer::handleTemplateStrings() {		void FormatTokenLexer::handleTemplateStrings() {
FormatToken *BacktickToken = Tokens.back();		FormatToken *BacktickToken = Tokens.back();

if (BacktickToken->is(tok::l_brace)) {		if (BacktickToken->is(tok::l_brace)) {
StateStack.push(LexerState::NORMAL);		StateStack.push(LexerState::NORMAL);
return;		return;
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	void FormatTokenLexer::handleTemplateStrings() {
if (LastBreak != StringRef::npos) {		if (LastBreak != StringRef::npos) {
BacktickToken->IsMultiline = true;		BacktickToken->IsMultiline = true;
unsigned StartColumn = 0; // The template tail spans the entire line.		unsigned StartColumn = 0; // The template tail spans the entire line.
BacktickToken->LastLineColumnWidth =		BacktickToken->LastLineColumnWidth =
encoding::columnWidthWithTabs(LiteralText.substr(LastBreak + 1),		encoding::columnWidthWithTabs(LiteralText.substr(LastBreak + 1),
StartColumn, Style.TabWidth, Encoding);		StartColumn, Style.TabWidth, Encoding);
}		}

SourceLocation loc = Lex->getSourceLocation(Offset);		SourceLocation loc = Lex->getSourceLocation(Offset - Lex->getBuffer().data());
resetLexer(SourceMgr.getFileOffset(loc));		resetLexer(SourceMgr.getFileOffset(loc));
}		}

void FormatTokenLexer::tryParsePythonComment() {		void FormatTokenLexer::tryParsePythonComment() {
FormatToken *HashToken = Tokens.back();		FormatToken *HashToken = Tokens.back();
if (!HashToken->isOneOf(tok::hash, tok::hashhash))		if (!HashToken->isOneOf(tok::hash, tok::hashhash))
return;		return;
// Turn the remainder of this line into a comment.		// Turn the remainder of this line into a comment.
const char *CommentBegin =		const char *CommentBegin =
Lex->getBufferLocation() - HashToken->TokenText.size(); // at "#"		Lex->getBufferLocation() - HashToken->TokenText.size(); // at "#"
size_t From = CommentBegin - Lex->getBuffer().begin();		size_t From = CommentBegin - Lex->getBuffer().begin();
size_t To = Lex->getBuffer().find_first_of('\n', From);		size_t To = Lex->getBuffer().find_first_of('\n', From);
if (To == StringRef::npos)		if (To == StringRef::npos)
To = Lex->getBuffer().size();		To = Lex->getBuffer().size();
size_t Len = To - From;		size_t Len = To - From;
HashToken->setType(TT_LineComment);		HashToken->setType(TT_LineComment);
HashToken->Tok.setKind(tok::comment);		HashToken->Tok.setKind(tok::comment);
HashToken->TokenText = Lex->getBuffer().substr(From, Len);		HashToken->TokenText = Lex->getBuffer().substr(From, Len);
SourceLocation Loc = To < Lex->getBuffer().size()		SourceLocation Loc = To < Lex->getBuffer().size()
? Lex->getSourceLocation(CommentBegin + Len)		? Lex->getSourceLocation(CommentBegin - Lex->getBuffer().data() + Len)
: SourceMgr.getLocForEndOfFile(ID);		: SourceMgr.getLocForEndOfFile(ID);
resetLexer(SourceMgr.getFileOffset(Loc));		resetLexer(SourceMgr.getFileOffset(Loc));
}		}

bool FormatTokenLexer::tryMerge_TMacro() {		bool FormatTokenLexer::tryMerge_TMacro() {
if (Tokens.size() < 4)		if (Tokens.size() < 4)
return false;		return false;
FormatToken *Last = Tokens.back();		FormatToken *Last = Tokens.back();
▲ Show 20 Lines • Show All 115 Lines • ▼ Show 20 Lines
/// from the end of the truncated token. Used for other languages that have		/// from the end of the truncated token. Used for other languages that have
/// different token boundaries, like JavaScript in which a comment ends at a		/// different token boundaries, like JavaScript in which a comment ends at a
/// line break regardless of whether the line break follows a backslash. Also		/// line break regardless of whether the line break follows a backslash. Also
/// used to set the lexer to the end of whitespace if the lexer regards		/// used to set the lexer to the end of whitespace if the lexer regards
/// whitespace and an unrecognized symbol as one token.		/// whitespace and an unrecognized symbol as one token.
void FormatTokenLexer::truncateToken(size_t NewLen) {		void FormatTokenLexer::truncateToken(size_t NewLen) {
assert(NewLen <= FormatTok->TokenText.size());		assert(NewLen <= FormatTok->TokenText.size());
resetLexer(SourceMgr.getFileOffset(Lex->getSourceLocation(		resetLexer(SourceMgr.getFileOffset(Lex->getSourceLocation(
Lex->getBufferLocation() - FormatTok->TokenText.size() + NewLen)));		Lex->getCurrentBufferOffset() - FormatTok->TokenText.size() + NewLen)));
FormatTok->TokenText = FormatTok->TokenText.substr(0, NewLen);		FormatTok->TokenText = FormatTok->TokenText.substr(0, NewLen);
FormatTok->ColumnWidth = encoding::columnWidthWithTabs(		FormatTok->ColumnWidth = encoding::columnWidthWithTabs(
FormatTok->TokenText, FormatTok->OriginalColumn, Style.TabWidth,		FormatTok->TokenText, FormatTok->OriginalColumn, Style.TabWidth,
Encoding);		Encoding);
FormatTok->Tok.setLength(NewLen);		FormatTok->Tok.setLength(NewLen);
}		}

/// Count the length of leading whitespace in a token.		/// Count the length of leading whitespace in a token.
▲ Show 20 Lines • Show All 292 Lines • ▼ Show 20 Lines	if (Start[0] == '\\' && (Start[1] == '\r' \|\| Start[1] == '\n'))
return false;		return false;
size_t Len = Matches[0].size();		size_t Len = Matches[0].size();

// The kind has to be an identifier so we can match it against those defined		// The kind has to be an identifier so we can match it against those defined
// in Keywords. The kind has to be set before the length because the setLength		// in Keywords. The kind has to be set before the length because the setLength
// function checks that the kind is not an annotation.		// function checks that the kind is not an annotation.
Tok.setKind(tok::raw_identifier);		Tok.setKind(tok::raw_identifier);
Tok.setLength(Len);		Tok.setLength(Len);
Tok.setLocation(Lex->getSourceLocation(Start, Len));		Tok.setLocation(Lex->getSourceLocation(Lex->getCurrentBufferOffset(), Len));
Tok.setRawIdentifierData(Start);		Tok.setRawIdentifierData(Start);
Lex->seek(Lex->getCurrentBufferOffset() + Len, /IsAtStartofline=/false);		Lex->seek(Lex->getCurrentBufferOffset() + Len, /IsAtStartofline=/false);
return true;		return true;
}		}

void FormatTokenLexer::readRawToken(FormatToken &Tok) {		void FormatTokenLexer::readRawToken(FormatToken &Tok) {
// For Verilog, first see if there is a special token, and fall back to the		// For Verilog, first see if there is a special token, and fall back to the
// normal lexer if there isn't one.		// normal lexer if there isn't one.
▲ Show 20 Lines • Show All 41 Lines • Show Last 20 Lines

clang/lib/Lex/Lexer.cpp

Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines

} }

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

// Lexer Class Implementation // Lexer Class Implementation

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

void Lexer::anchor() {} void Lexer::anchor() {}

void Lexer::InitLexer(const char *BufStart, const char *BufPtr, void Lexer::InitLexer(const char *BufStart, unsigned BufOffset,

const char *BufEnd) { unsigned BufSize) {

BufferStart = BufStart; BufferStart = BufStart;

BufferPtr = BufPtr; BufferOffset = BufOffset;

BufferEnd = BufEnd; BufferSize = BufSize;

assert(BufEnd[0] == 0 && assert(BufStart[BufSize] == 0 &&

"We assume that the input buffer has a null character at the end" "We assume that the input buffer has a null character at the end"

" to simplify lexing!"); " to simplify lexing!");

// Check whether we have a BOM in the beginning of the buffer. If yes - act // Check whether we have a BOM in the beginning of the buffer. If yes - act

// accordingly. Right now we support only UTF-8 with and without BOM, so, just // accordingly. Right now we support only UTF-8 with and without BOM, so, just

// skip the UTF-8 BOM if it's present. // skip the UTF-8 BOM if it's present.

if (BufferStart == BufferPtr) { if (BufferOffset == 0) {

// Determine the size of the BOM. // Determine the size of the BOM.

StringRef Buf(BufferStart, BufferEnd - BufferStart); StringRef Buf(BufferStart, BufferSize);

size_t BOMLength = llvm::StringSwitch<size_t>(Buf) size_t BOMLength = llvm::StringSwitch<size_t>(Buf)

.StartsWith("\xEF\xBB\xBF", 3) // UTF-8 BOM .StartsWith("\xEF\xBB\xBF", 3) // UTF-8 BOM

.Default(0); .Default(0);

// Skip the BOM. // Skip the BOM.

BufferPtr += BOMLength; BufferOffset += BOMLength;

} }

Is_PragmaLexer = false; Is_PragmaLexer = false;

CurrentConflictMarkerState = CMK_None; CurrentConflictMarkerState = CMK_None;

// Start of the file is a start of line. // Start of the file is a start of line.

IsAtStartOfLine = true; IsAtStartOfLine = true;

IsAtPhysicalStartOfLine = true; IsAtPhysicalStartOfLine = true;

Show All 11 Lines void Lexer::InitLexer(const char *BufStart, unsigned BufOffset,

// of tokens (e.g. identifiers, thus disabling macro expansion). It is used // of tokens (e.g. identifiers, thus disabling macro expansion). It is used

// to quickly lex the tokens of the buffer, e.g. when handling a "#if 0" block // to quickly lex the tokens of the buffer, e.g. when handling a "#if 0" block

// or otherwise skipping over tokens. // or otherwise skipping over tokens.

LexingRawMode = false; LexingRawMode = false;

// Default to not keeping comments. // Default to not keeping comments.

ExtendedTokenMode = 0; ExtendedTokenMode = 0;

NewLinePtr = nullptr; NewLineOffset = std::nullopt;

} }

/// Lexer constructor - Create a new lexer object for the specified buffer /// Lexer constructor - Create a new lexer object for the specified buffer

/// with the specified preprocessor managing the lexing process. This lexer /// with the specified preprocessor managing the lexing process. This lexer

/// assumes that the associated file buffer and Preprocessor objects will /// assumes that the associated file buffer and Preprocessor objects will

/// outlive it, so it doesn't take ownership of either of them. /// outlive it, so it doesn't take ownership of either of them.

Lexer::Lexer(FileID FID, const llvm::MemoryBufferRef &InputFile, Lexer::Lexer(FileID FID, const llvm::MemoryBufferRef &InputFile,

Preprocessor &PP, bool IsFirstIncludeOfFile) Preprocessor &PP, bool IsFirstIncludeOfFile)

: PreprocessorLexer(&PP, FID), : PreprocessorLexer(&PP, FID),

FileLoc(PP.getSourceManager().getLocForStartOfFile(FID)), FileLoc(PP.getSourceManager().getLocForStartOfFile(FID)),

LangOpts(PP.getLangOpts()), LineComment(LangOpts.LineComment), LangOpts(PP.getLangOpts()), LineComment(LangOpts.LineComment),

IsFirstTimeLexingFile(IsFirstIncludeOfFile) { IsFirstTimeLexingFile(IsFirstIncludeOfFile) {

InitLexer(InputFile.getBufferStart(), InputFile.getBufferStart(), InitLexer(InputFile.getBufferStart(), 0, InputFile.getBufferSize());

InputFile.getBufferEnd());

resetExtendedTokenMode(); resetExtendedTokenMode();

} }

/// Lexer constructor - Create a new raw lexer object. This object is only /// Lexer constructor - Create a new raw lexer object. This object is only

/// suitable for calls to 'LexFromRawLexer'. This lexer assumes that the text /// suitable for calls to 'LexFromRawLexer'. This lexer assumes that the text

/// range will outlive it, so it doesn't take ownership of it. /// range will outlive it, so it doesn't take ownership of it.

Lexer::Lexer(SourceLocation fileloc, const LangOptions &langOpts, Lexer::Lexer(SourceLocation fileloc, const LangOptions &langOpts,

const char *BufStart, const char *BufPtr, const char *BufEnd, const char *BufStart, const char *BufPtr, const char *BufEnd,

bool IsFirstIncludeOfFile) bool IsFirstIncludeOfFile)

: FileLoc(fileloc), LangOpts(langOpts), LineComment(LangOpts.LineComment), : FileLoc(fileloc), LangOpts(langOpts), LineComment(LangOpts.LineComment),

IsFirstTimeLexingFile(IsFirstIncludeOfFile) { IsFirstTimeLexingFile(IsFirstIncludeOfFile) {

InitLexer(BufStart, BufPtr, BufEnd); InitLexer(BufStart, BufPtr - BufStart, BufEnd - BufStart);

// We *are* in raw mode. // We *are* in raw mode.

LexingRawMode = true; LexingRawMode = true;

} }

/// Lexer constructor - Create a new raw lexer object. This object is only /// Lexer constructor - Create a new raw lexer object. This object is only

/// suitable for calls to 'LexFromRawLexer'. This lexer assumes that the text /// suitable for calls to 'LexFromRawLexer'. This lexer assumes that the text

/// range will outlive it, so it doesn't take ownership of it. /// range will outlive it, so it doesn't take ownership of it.

Show All 38 Lines Lexer *Lexer::Create_PragmaLexer(SourceLocation SpellingLoc,

llvm::MemoryBufferRef InputFile = SM.getBufferOrFake(SpellingFID); llvm::MemoryBufferRef InputFile = SM.getBufferOrFake(SpellingFID);

Lexer *L = new Lexer(SpellingFID, InputFile, PP); Lexer *L = new Lexer(SpellingFID, InputFile, PP);

// Now that the lexer is created, change the start/end locations so that we // Now that the lexer is created, change the start/end locations so that we

// just lex the subsection of the file that we want. This is lexing from a // just lex the subsection of the file that we want. This is lexing from a

// scratch buffer. // scratch buffer.

const char *StrData = SM.getCharacterData(SpellingLoc); const char *StrData = SM.getCharacterData(SpellingLoc);

L->BufferPtr = StrData; L->BufferStart = InputFile.getBufferStart();

L->BufferEnd = StrData+TokLen; L->BufferOffset =

assert(L->BufferEnd[0] == 0 && "Buffer is not nul terminated!"); StrData - InputFile.getBufferStart(); // FIXME: this is wrong

v.g.vassilevUnsubmitted

Not Done

Is that an outdated comment? If not maybe elaborate why this is wrong.

v.g.vassilev: Is that an outdated comment? If not maybe elaborate why this is wrong.

sunhoAuthorUnsubmitted

Done

It's indeed outdated comment. I'll remove it.

sunho: It's indeed outdated comment. I'll remove it.

L->BufferSize = L->BufferOffset + TokLen;

assert(L->BufferStart[L->BufferSize] == 0 && "Buffer is not nul terminated!");

cor3ntinUnsubmitted

Not Done

L->BufferSize = L->BufferOffset + TokLen;

- assert(L->BufferStart[L->BufferSize] == 0 && "Buffer is not nul terminated!");

+ assert(L->BufferStart[L->BufferSize] == 0 && "Buffer is not null-terminated!");

// Set the SourceLocation with the remapping information. This ensures that

cor3ntin:

// Set the SourceLocation with the remapping information. This ensures that // Set the SourceLocation with the remapping information. This ensures that

// GetMappedTokenLoc will remap the tokens as they are lexed. // GetMappedTokenLoc will remap the tokens as they are lexed.

L->FileLoc = SM.createExpansionLoc(SM.getLocForStartOfFile(SpellingFID), L->FileLoc = SM.createExpansionLoc(SM.getLocForStartOfFile(SpellingFID),

ExpansionLocStart, ExpansionLocStart,

ExpansionLocEnd, TokLen); ExpansionLocEnd, TokLen);

// Ensure that the lexer thinks it is inside a directive, so that end \n will // Ensure that the lexer thinks it is inside a directive, so that end \n will

// return an EOD token. // return an EOD token.

L->ParsingPreprocessorDirective = true; L->ParsingPreprocessorDirective = true;

// This lexer really is for _Pragma. // This lexer really is for _Pragma.

L->Is_PragmaLexer = true; L->Is_PragmaLexer = true;

return L; return L;

} }

void Lexer::seek(unsigned Offset, bool IsAtStartOfLine) { void Lexer::seek(unsigned Offset, bool IsAtStartOfLine) {

this->IsAtPhysicalStartOfLine = IsAtStartOfLine; this->IsAtPhysicalStartOfLine = IsAtStartOfLine;

this->IsAtStartOfLine = IsAtStartOfLine; this->IsAtStartOfLine = IsAtStartOfLine;

assert((BufferStart + Offset) <= BufferEnd); assert(Offset <= BufferSize);

BufferPtr = BufferStart + Offset; BufferOffset = Offset;

} }

template <typename T> static void StringifyImpl(T &Str, char Quote) { template <typename T> static void StringifyImpl(T &Str, char Quote) {

typename T::size_type i = 0, e = Str.size(); typename T::size_type i = 0, e = Str.size();

while (i < e) { while (i < e) {

if (Str[i] == '\\' || Str[i] == Quote) { if (Str[i] == '\\' || Str[i] == Quote) {

Str.insert(Str.begin() + i, '\\'); Str.insert(Str.begin() + i, '\\');

i += 2; i += 2;

▲ Show 20 Lines • Show All 899 Lines • ▼ Show 20 Lines static SourceLocation GetMappedTokenLoc(Preprocessor &PP,

// original _Pragma(...) sequence. // original _Pragma(...) sequence.

CharSourceRange II = SM.getImmediateExpansionRange(FileLoc); CharSourceRange II = SM.getImmediateExpansionRange(FileLoc);

return SM.createExpansionLoc(SpellingLoc, II.getBegin(), II.getEnd(), TokLen); return SM.createExpansionLoc(SpellingLoc, II.getBegin(), II.getEnd(), TokLen);

} }

/// getSourceLocation - Return a source location identifier for the specified /// getSourceLocation - Return a source location identifier for the specified

/// offset in the current file. /// offset in the current file.

SourceLocation Lexer::getSourceLocation(const char *Loc, SourceLocation Lexer::getSourceLocation(unsigned Loc, unsigned TokLen) const {

unsigned TokLen) const { assert(Loc <= BufferSize && "Location out of range for this buffer!");

assert(Loc >= BufferStart && Loc <= BufferEnd &&

"Location out of range for this buffer!");

// In the normal case, we're just lexing from a simple file buffer, return // In the normal case, we're just lexing from a simple file buffer, return

// the file id from FileLoc with the offset specified. // the file id from FileLoc with the offset specified.

unsigned CharNo = Loc-BufferStart; unsigned CharNo = Loc;

if (FileLoc.isFileID()) if (FileLoc.isFileID())

return FileLoc.getLocWithOffset(CharNo); return FileLoc.getLocWithOffset(CharNo);

// Otherwise, this is the _Pragma lexer case, which pretends that all of the // Otherwise, this is the _Pragma lexer case, which pretends that all of the

// tokens are lexed from where the _Pragma was defined. // tokens are lexed from where the _Pragma was defined.

assert(PP && "This doesn't work on raw lexers"); assert(PP && "This doesn't work on raw lexers");

return GetMappedTokenLoc(*PP, FileLoc, CharNo, TokLen); return GetMappedTokenLoc(*PP, FileLoc, CharNo, TokLen);

} }

/// Diag - Forwarding function for diagnostics. This translate a source /// Diag - Forwarding function for diagnostics. This translate a source

/// position in the current buffer into a SourceLocation object for rendering. /// position in the current buffer into a SourceLocation object for rendering.

DiagnosticBuilder Lexer::Diag(const char *Loc, unsigned DiagID) const { DiagnosticBuilder Lexer::Diag(unsigned Loc, unsigned DiagID) const {

return PP->Diag(getSourceLocation(Loc), DiagID); return PP->Diag(getSourceLocation(Loc), DiagID);

} }

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

// Trigraph and Escaped Newline Handling Code. // Trigraph and Escaped Newline Handling Code.

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

/// GetTrigraphCharForLetter - Given a character that occurs after a ?? pair, /// GetTrigraphCharForLetter - Given a character that occurs after a ?? pair,

Show All 19 Lines

/// whether trigraphs are enabled or not. /// whether trigraphs are enabled or not.

static char DecodeTrigraphChar(const char *CP, Lexer *L, bool Trigraphs) { static char DecodeTrigraphChar(const char *CP, Lexer *L, bool Trigraphs) {

char Res = GetTrigraphCharForLetter(*CP); char Res = GetTrigraphCharForLetter(*CP);

if (!Res) if (!Res)

return Res; return Res;

if (!Trigraphs) { if (!Trigraphs) {

if (L && !L->isLexingRawMode()) if (L && !L->isLexingRawMode())

L->Diag(CP-2, diag::trigraph_ignored); L->Diag(CP - 2 - L->getBuffer().data(), diag::trigraph_ignored);

shafikUnsubmitted

Not Done

I wonder do we really need to do these pointer gymnastics, maybe making this a member function would eliminate the need for it.

shafik: I wonder do we really need to do these pointer gymnastics, maybe making this a member function…

sunhoAuthorUnsubmitted

Done

Yes, we can change it to offset here.

sunho: Yes, we can change it to offset here.

cor3ntinUnsubmitted

Not Done

Agreed, that would be nice! (In all places getBuffer().data() is used)

cor3ntin: Agreed, that would be nice! (In all places `getBuffer().data()` is used)

return 0; return 0;

} }

if (L && !L->isLexingRawMode()) if (L && !L->isLexingRawMode())

L->Diag(CP-2, diag::trigraph_converted) << StringRef(&Res, 1); L->Diag(CP - 2 - L->getBuffer().data(), diag::trigraph_converted)

<< StringRef(&Res, 1);

return Res; return Res;

} }

/// getEscapedNewLineSize - Return the size of the specified escaped newline, /// getEscapedNewLineSize - Return the size of the specified escaped newline,

/// or 0 if it is not an escaped newline. P[-1] is known to be a "\" or a /// or 0 if it is not an escaped newline. P[-1] is known to be a "\" or a

/// trigraph equivalent on entry to this function. /// trigraph equivalent on entry to this function.

unsigned Lexer::getEscapedNewLineSize(const char *Ptr) { unsigned Lexer::getEscapedNewLineSize(const char *Ptr) {

unsigned Size = 0; unsigned Size = 0;

▲ Show 20 Lines • Show All 114 Lines • ▼ Show 20 Lines

/// the char after it. /// the char after it.

/// ///

/// This handles the slow/uncommon case of the getCharAndSize method. Here we /// This handles the slow/uncommon case of the getCharAndSize method. Here we

/// know that we can accumulate into Size, and that we have already incremented /// know that we can accumulate into Size, and that we have already incremented

/// Ptr by Size bytes. /// Ptr by Size bytes.

/// ///

/// NOTE: When this method is updated, getCharAndSizeSlowNoWarn (below) should /// NOTE: When this method is updated, getCharAndSizeSlowNoWarn (below) should

/// be updated to match. /// be updated to match.

char Lexer::getCharAndSizeSlow(const char *Ptr, unsigned &Size, char Lexer::getCharAndSizeSlow(unsigned Offset, unsigned &Size, Token *Tok) {

Token *Tok) {

// If we have a slash, look for an escaped newline. // If we have a slash, look for an escaped newline.

if (Ptr[0] == '\\') { if (BufferStart[Offset] == '\\') {

++Size; ++Size;

++Ptr; ++Offset;

Slash: Slash:

// Common case, backslash-char where the char is not whitespace. // Common case, backslash-char where the char is not whitespace.

if (!isWhitespace(Ptr[0])) return '\\'; if (!isWhitespace(BufferStart[Offset]))

return '\\';

// See if we have optional whitespace characters between the slash and // See if we have optional whitespace characters between the slash and

// newline. // newline.

if (unsigned EscapedNewLineSize = getEscapedNewLineSize(Ptr)) { if (unsigned EscapedNewLineSize =

getEscapedNewLineSize(&BufferStart[Offset])) {

cor3ntinUnsubmitted

Not Done

if (unsigned EscapedNewLineSize =

- getEscapedNewLineSize(&BufferStart[Offset])) {

+ getEscapedNewLineSize(BufferStart + Offset)) {

// Remember that this token needs to be cleaned.

cor3ntin:

// Remember that this token needs to be cleaned. // Remember that this token needs to be cleaned.

if (Tok) Tok->setFlag(Token::NeedsCleaning); if (Tok) Tok->setFlag(Token::NeedsCleaning);

// Warn if there was whitespace between the backslash and newline. // Warn if there was whitespace between the backslash and newline.

if (Ptr[0] != '\n' && Ptr[0] != '\r' && Tok && !isLexingRawMode()) if (BufferStart[Offset] != '\n' && BufferStart[Offset] != '\r' && Tok &&

Diag(Ptr, diag::backslash_newline_space); !isLexingRawMode())

Diag(Offset, diag::backslash_newline_space);

// Found backslash<whitespace><newline>. Parse the char after it. // Found backslash<whitespace><newline>. Parse the char after it.

Size += EscapedNewLineSize; Size += EscapedNewLineSize;

Ptr += EscapedNewLineSize; Offset += EscapedNewLineSize;

// Use slow version to accumulate a correct size field. // Use slow version to accumulate a correct size field.

return getCharAndSizeSlow(Ptr, Size, Tok); return getCharAndSizeSlow(Offset, Size, Tok);

} }

// Otherwise, this is not an escaped newline, just return the slash. // Otherwise, this is not an escaped newline, just return the slash.

return '\\'; return '\\';

} }

// If this is a trigraph, process it. // If this is a trigraph, process it.

if (Ptr[0] == '?' && Ptr[1] == '?') { if (BufferStart[Offset] == '?' && BufferStart[Offset + 1] == '?') {

// If this is actually a legal trigraph (not something like "??x"), emit // If this is actually a legal trigraph (not something like "??x"), emit

// a trigraph warning. If so, and if trigraphs are enabled, return it. // a trigraph warning. If so, and if trigraphs are enabled, return it.

if (char C = DecodeTrigraphChar(Ptr + 2, Tok ? this : nullptr, if (char C = DecodeTrigraphChar(&BufferStart[Offset + 2],

cor3ntinUnsubmitted

Not Done

// a trigraph warning. If so, and if trigraphs are enabled, return it.

- if (char C = DecodeTrigraphChar(&BufferStart[Offset + 2],

+ if (char C = DecodeTrigraphChar(BufferStart + Offset + 2,

Tok ? this : nullptr, LangOpts.Trigraphs)) {

cor3ntin:

LangOpts.Trigraphs)) { Tok ? this : nullptr, LangOpts.Trigraphs)) {

// Remember that this token needs to be cleaned. // Remember that this token needs to be cleaned.

if (Tok) Tok->setFlag(Token::NeedsCleaning); if (Tok) Tok->setFlag(Token::NeedsCleaning);

Ptr += 3; Offset += 3;

Size += 3; Size += 3;

if (C == '\\') goto Slash; if (C == '\\') goto Slash;

return C; return C;

} }

// If this is neither, return a single character. // If this is neither, return a single character.

++Size; ++Size;

return *Ptr; return BufferStart[Offset];

} }

/// getCharAndSizeSlowNoWarn - Handle the slow/uncommon case of the /// getCharAndSizeSlowNoWarn - Handle the slow/uncommon case of the

/// getCharAndSizeNoWarn method. Here we know that we can accumulate into Size, /// getCharAndSizeNoWarn method. Here we know that we can accumulate into Size,

/// and that we have already incremented Ptr by Size bytes. /// and that we have already incremented Ptr by Size bytes.

/// ///

/// NOTE: When this method is updated, getCharAndSizeSlow (above) should /// NOTE: When this method is updated, getCharAndSizeSlow (above) should

/// be updated to match. /// be updated to match.

Show All 39 Lines

} }

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

// Helper methods for lexing. // Helper methods for lexing.

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

/// Routine that indiscriminately sets the offset into the source file. /// Routine that indiscriminately sets the offset into the source file.

void Lexer::SetByteOffset(unsigned Offset, bool StartOfLine) { void Lexer::SetByteOffset(unsigned Offset, bool StartOfLine) {

BufferPtr = BufferStart + Offset; BufferOffset = Offset;

if (BufferPtr > BufferEnd) if (Offset > BufferSize)

BufferPtr = BufferEnd; Offset = BufferSize;

// FIXME: What exactly does the StartOfLine bit mean? There are two // FIXME: What exactly does the StartOfLine bit mean? There are two

// possible meanings for the "start" of the line: the first token on the // possible meanings for the "start" of the line: the first token on the

// unexpanded line, or the first token on the expanded line. // unexpanded line, or the first token on the expanded line.

IsAtStartOfLine = StartOfLine; IsAtStartOfLine = StartOfLine;

IsAtPhysicalStartOfLine = StartOfLine; IsAtPhysicalStartOfLine = StartOfLine;

} }

static bool isUnicodeWhitespace(uint32_t Codepoint) { static bool isUnicodeWhitespace(uint32_t Codepoint) {

▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines static void diagnoseExtensionInIdentifier(DiagnosticsEngine &Diags, uint32_t C,

(void)MathStartChars; (void)MathStartChars;

(void)MathContinueChars; (void)MathContinueChars;

assert((MathStartChars.contains(C) || MathContinueChars.contains(C)) && assert((MathStartChars.contains(C) || MathContinueChars.contains(C)) &&

"Unexpected mathematical notation codepoint"); "Unexpected mathematical notation codepoint");

Diags.Report(Range.getBegin(), diag::ext_mathematical_notation) Diags.Report(Range.getBegin(), diag::ext_mathematical_notation)

<< codepointAsHexString(C) << Range; << codepointAsHexString(C) << Range;

} }

static inline CharSourceRange makeCharRange(Lexer &L, const char *Begin, static inline CharSourceRange makeCharRange(Lexer &L, unsigned Begin,

const char *End) { unsigned End) {

return CharSourceRange::getCharRange(L.getSourceLocation(Begin), return CharSourceRange::getCharRange(L.getSourceLocation(Begin),

L.getSourceLocation(End)); L.getSourceLocation(End));

} }

static void maybeDiagnoseIDCharCompat(DiagnosticsEngine &Diags, uint32_t C, static void maybeDiagnoseIDCharCompat(DiagnosticsEngine &Diags, uint32_t C,

CharSourceRange Range, bool IsFirst) { CharSourceRange Range, bool IsFirst) {

// Check C99 compatibility. // Check C99 compatibility.

if (!Diags.isIgnored(diag::warn_c99_compat_unicode_id, Range.getBegin())) { if (!Diags.isIgnored(diag::warn_c99_compat_unicode_id, Range.getBegin())) {

▲ Show 20 Lines • Show All 119 Lines • ▼ Show 20 Lines Diags.Report(Range.getBegin(), diag::err_character_not_allowed_identifier)

<< FixItHint::CreateRemoval(Range); << FixItHint::CreateRemoval(Range);

} else { } else {

Diags.Report(Range.getBegin(), diag::err_character_not_allowed) Diags.Report(Range.getBegin(), diag::err_character_not_allowed)

<< Range << codepointAsHexString(CodePoint) << Range << codepointAsHexString(CodePoint)

<< FixItHint::CreateRemoval(Range); << FixItHint::CreateRemoval(Range);

} }

bool Lexer::tryConsumeIdentifierUCN(const char *&CurPtr, unsigned Size, bool Lexer::tryConsumeIdentifierUCN(unsigned &CurOffset, unsigned Size,

Token &Result) { Token &Result) {

const char *UCNPtr = CurPtr + Size; unsigned UCNOffset = CurOffset + Size;

uint32_t CodePoint = tryReadUCN(UCNPtr, CurPtr, /*Token=*/nullptr); uint32_t CodePoint = tryReadUCN(UCNOffset, CurOffset, /*Token=*/nullptr);

if (CodePoint == 0) { if (CodePoint == 0) {

return false; return false;

} }

bool IsExtension = false; bool IsExtension = false;

if (!isAllowedIDChar(CodePoint, LangOpts, IsExtension)) { if (!isAllowedIDChar(CodePoint, LangOpts, IsExtension)) {

if (isASCII(CodePoint) || isUnicodeWhitespace(CodePoint)) if (isASCII(CodePoint) || isUnicodeWhitespace(CodePoint))

return false; return false;

if (!isLexingRawMode() && !ParsingPreprocessorDirective && if (!isLexingRawMode() && !ParsingPreprocessorDirective &&

!PP->isPreprocessedOutput()) !PP->isPreprocessedOutput())

diagnoseInvalidUnicodeCodepointInIdentifier( diagnoseInvalidUnicodeCodepointInIdentifier(

PP->getDiagnostics(), LangOpts, CodePoint, PP->getDiagnostics(), LangOpts, CodePoint,

makeCharRange(*this, CurPtr, UCNPtr), makeCharRange(*this, CurOffset, UCNOffset),

/*IsFirst=*/false); /*IsFirst=*/false);

// We got a unicode codepoint that is neither a space nor a // We got a unicode codepoint that is neither a space nor a

// a valid identifier part. // a valid identifier part.

// Carry on as if the codepoint was valid for recovery purposes. // Carry on as if the codepoint was valid for recovery purposes.

} else if (!isLexingRawMode()) { } else if (!isLexingRawMode()) {

if (IsExtension) if (IsExtension)

diagnoseExtensionInIdentifier(PP->getDiagnostics(), CodePoint, diagnoseExtensionInIdentifier(PP->getDiagnostics(), CodePoint,

makeCharRange(*this, CurPtr, UCNPtr)); makeCharRange(*this, CurOffset, UCNOffset));

maybeDiagnoseIDCharCompat(PP->getDiagnostics(), CodePoint, maybeDiagnoseIDCharCompat(PP->getDiagnostics(), CodePoint,

makeCharRange(*this, CurPtr, UCNPtr), makeCharRange(*this, CurOffset, UCNOffset),

/*IsFirst=*/false); /*IsFirst=*/false);

} }

Result.setFlag(Token::HasUCN); Result.setFlag(Token::HasUCN);

if ((UCNPtr - CurPtr == 6 && CurPtr[1] == 'u') || if ((UCNOffset - CurOffset == 6 && BufferStart[CurOffset + 1] == 'u') ||

(UCNPtr - CurPtr == 10 && CurPtr[1] == 'U')) (UCNOffset - CurOffset == 10 && BufferStart[CurOffset + 1] == 'U'))

CurPtr = UCNPtr; CurOffset = UCNOffset;

else else

while (CurPtr != UCNPtr) while (CurOffset != UCNOffset)

(void)getAndAdvanceChar(CurPtr, Result); (void)getAndAdvanceChar(CurOffset, Result);

return true; return true;

} }

bool Lexer::tryConsumeIdentifierUTF8Char(const char *&CurPtr) { bool Lexer::tryConsumeIdentifierUTF8Char(unsigned &CurOffset) {

const char *UnicodePtr = CurPtr; const char *UnicodePtr = &BufferStart[CurOffset];

cor3ntinUnsubmitted

Not Done

bool Lexer::tryConsumeIdentifierUTF8Char(unsigned &CurOffset) {

- const char *UnicodePtr = &BufferStart[CurOffset];

+ const char *UnicodePtr = BufferStart + CurOffset;

llvm::UTF32 CodePoint;

cor3ntin:

llvm::UTF32 CodePoint; llvm::UTF32 CodePoint;

llvm::ConversionResult Result = llvm::ConversionResult Result =

llvm::convertUTF8Sequence((const llvm::UTF8 **)&UnicodePtr, llvm::convertUTF8Sequence((const llvm::UTF8 **)&UnicodePtr,

(const llvm::UTF8 *)BufferEnd, (const llvm::UTF8 *)&BufferStart[BufferSize],

cor3ntinUnsubmitted

Not Done

Ditto in all similar places, I think it reads easier

cor3ntin: Ditto in all similar places, I think it reads easier

&CodePoint, &CodePoint, llvm::strictConversion);

llvm::strictConversion);

if (Result != llvm::conversionOK) if (Result != llvm::conversionOK)

return false; return false;

bool IsExtension = false; bool IsExtension = false;

if (!isAllowedIDChar(static_cast<uint32_t>(CodePoint), LangOpts, if (!isAllowedIDChar(static_cast<uint32_t>(CodePoint), LangOpts,

IsExtension)) { IsExtension)) {

if (isASCII(CodePoint) || isUnicodeWhitespace(CodePoint)) if (isASCII(CodePoint) || isUnicodeWhitespace(CodePoint))

return false; return false;

if (!isLexingRawMode() && !ParsingPreprocessorDirective && if (!isLexingRawMode() && !ParsingPreprocessorDirective &&

!PP->isPreprocessedOutput()) !PP->isPreprocessedOutput())

diagnoseInvalidUnicodeCodepointInIdentifier( diagnoseInvalidUnicodeCodepointInIdentifier(

PP->getDiagnostics(), LangOpts, CodePoint, PP->getDiagnostics(), LangOpts, CodePoint,

makeCharRange(*this, CurPtr, UnicodePtr), /*IsFirst=*/false); makeCharRange(*this, CurOffset, UnicodePtr - BufferStart),

/*IsFirst=*/false);

// We got a unicode codepoint that is neither a space nor a // We got a unicode codepoint that is neither a space nor a

// a valid identifier part. Carry on as if the codepoint was // a valid identifier part. Carry on as if the codepoint was

// valid for recovery purposes. // valid for recovery purposes.

} else if (!isLexingRawMode()) { } else if (!isLexingRawMode()) {

if (IsExtension) if (IsExtension)

diagnoseExtensionInIdentifier(PP->getDiagnostics(), CodePoint, diagnoseExtensionInIdentifier(PP->getDiagnostics(), CodePoint,

makeCharRange(*this, CurPtr, UnicodePtr)); makeCharRange(*this, CurOffset, UnicodePtr - BufferStart));

maybeDiagnoseIDCharCompat(PP->getDiagnostics(), CodePoint, maybeDiagnoseIDCharCompat(

makeCharRange(*this, CurPtr, UnicodePtr), PP->getDiagnostics(), CodePoint,

makeCharRange(*this, CurOffset, UnicodePtr - BufferStart),

/*IsFirst=*/false); /*IsFirst=*/false);

maybeDiagnoseUTF8Homoglyph(PP->getDiagnostics(), CodePoint, maybeDiagnoseUTF8Homoglyph(

makeCharRange(*this, CurPtr, UnicodePtr)); PP->getDiagnostics(), CodePoint,

makeCharRange(*this, CurOffset, UnicodePtr - BufferStart));

} }

CurPtr = UnicodePtr; CurOffset = UnicodePtr - BufferStart;

return true; return true;

} }

bool Lexer::LexUnicodeIdentifierStart(Token &Result, uint32_t C, bool Lexer::LexUnicodeIdentifierStart(Token &Result, uint32_t C,

const char *CurPtr) { unsigned CurOffset) {

bool IsExtension = false; bool IsExtension = false;

if (isAllowedInitiallyIDChar(C, LangOpts, IsExtension)) { if (isAllowedInitiallyIDChar(C, LangOpts, IsExtension)) {

if (!isLexingRawMode() && !ParsingPreprocessorDirective && if (!isLexingRawMode() && !ParsingPreprocessorDirective &&

!PP->isPreprocessedOutput()) { !PP->isPreprocessedOutput()) {

if (IsExtension) if (IsExtension)

diagnoseExtensionInIdentifier(PP->getDiagnostics(), C, diagnoseExtensionInIdentifier(PP->getDiagnostics(), C,

makeCharRange(*this, BufferPtr, CurPtr)); makeCharRange(*this, BufferOffset, CurOffset));

maybeDiagnoseIDCharCompat(PP->getDiagnostics(), C, maybeDiagnoseIDCharCompat(PP->getDiagnostics(), C,

makeCharRange(*this, BufferPtr, CurPtr), makeCharRange(*this, BufferOffset, CurOffset),

/*IsFirst=*/true); /*IsFirst=*/true);

maybeDiagnoseUTF8Homoglyph(PP->getDiagnostics(), C, maybeDiagnoseUTF8Homoglyph(PP->getDiagnostics(), C,

makeCharRange(*this, BufferPtr, CurPtr)); makeCharRange(*this, BufferOffset, CurOffset));

} }

MIOpt.ReadToken(); MIOpt.ReadToken();

return LexIdentifierContinue(Result, CurPtr); return LexIdentifierContinue(Result, CurOffset);

} }

if (!isLexingRawMode() && !ParsingPreprocessorDirective && if (!isLexingRawMode() && !ParsingPreprocessorDirective &&

!PP->isPreprocessedOutput() && !isASCII(*BufferPtr) && !PP->isPreprocessedOutput() && !isASCII(BufferStart[BufferOffset]) &&

!isUnicodeWhitespace(C)) { !isUnicodeWhitespace(C)) {

// Non-ASCII characters tend to creep into source code unintentionally. // Non-ASCII characters tend to creep into source code unintentionally.

// Instead of letting the parser complain about the unknown token, // Instead of letting the parser complain about the unknown token,

// just drop the character. // just drop the character.

// Note that we can /only/ do this when the non-ASCII character is actually // Note that we can /only/ do this when the non-ASCII character is actually

// spelled as Unicode, not written as a UCN. The standard requires that // spelled as Unicode, not written as a UCN. The standard requires that

// we not throw away any possible preprocessor tokens, but there's a // we not throw away any possible preprocessor tokens, but there's a

// loophole in the mapping of Unicode characters to basic character set // loophole in the mapping of Unicode characters to basic character set

// characters that allows us to map these particular characters to, say, // characters that allows us to map these particular characters to, say,

// whitespace. // whitespace.

diagnoseInvalidUnicodeCodepointInIdentifier( diagnoseInvalidUnicodeCodepointInIdentifier(

PP->getDiagnostics(), LangOpts, C, PP->getDiagnostics(), LangOpts, C,

makeCharRange(*this, BufferPtr, CurPtr), /*IsStart*/ true); makeCharRange(*this, BufferOffset, CurOffset), /*IsStart*/ true);

shafikUnsubmitted

Not Done

PP->getDiagnostics(), LangOpts, C,

- makeCharRange(*this, BufferOffset, CurOffset), /*IsStart*/ true);

+ makeCharRange(*this, BufferOffset, CurOffset), /*IsStart=*/true);

BufferOffset = CurOffset;

nit

shafik: nit

BufferPtr = CurPtr; BufferOffset = CurOffset;

return false; return false;

} }

// Otherwise, we have an explicit UCN or a character that's unlikely to show // Otherwise, we have an explicit UCN or a character that's unlikely to show

// up by accident. // up by accident.

MIOpt.ReadToken(); MIOpt.ReadToken();

FormTokenWithChars(Result, CurPtr, tok::unknown); FormTokenWithChars(Result, CurOffset, tok::unknown);

return true; return true;

} }

bool Lexer::LexIdentifierContinue(Token &Result, const char *CurPtr) { bool Lexer::LexIdentifierContinue(Token &Result, unsigned CurOffset) {

// Match [_A-Za-z0-9]*, we have already matched an identifier start. // Match [_A-Za-z0-9]*, we have already matched an identifier start.

while (true) { while (true) {

unsigned char C = *CurPtr; unsigned char C = BufferStart[CurOffset];

// Fast path. // Fast path.

if (isAsciiIdentifierContinue(C)) { if (isAsciiIdentifierContinue(C)) {

++CurPtr; ++CurOffset;

continue; continue;

} }

unsigned Size; unsigned Size;

// Slow path: handle trigraph, unicode codepoints, UCNs. // Slow path: handle trigraph, unicode codepoints, UCNs.

C = getCharAndSize(CurPtr, Size); C = getCharAndSize(CurOffset, Size);

if (isAsciiIdentifierContinue(C)) { if (isAsciiIdentifierContinue(C)) {

CurPtr = ConsumeChar(CurPtr, Size, Result); CurOffset = ConsumeChar(CurOffset, Size, Result);

continue; continue;

} }

if (C == '$') { if (C == '$') {

// If we hit a $ and they are not supported in identifiers, we are done. // If we hit a $ and they are not supported in identifiers, we are done.

if (!LangOpts.DollarIdents) if (!LangOpts.DollarIdents)

break; break;

// Otherwise, emit a diagnostic and continue. // Otherwise, emit a diagnostic and continue.

if (!isLexingRawMode()) if (!isLexingRawMode())

Diag(CurPtr, diag::ext_dollar_in_identifier); Diag(CurOffset, diag::ext_dollar_in_identifier);

CurPtr = ConsumeChar(CurPtr, Size, Result); CurOffset = ConsumeChar(CurOffset, Size, Result);

continue; continue;

} }

if (C == '\\' && tryConsumeIdentifierUCN(CurPtr, Size, Result)) if (C == '\\' && tryConsumeIdentifierUCN(CurOffset, Size, Result))

continue; continue;

if (!isASCII(C) && tryConsumeIdentifierUTF8Char(CurPtr)) if (!isASCII(C) && tryConsumeIdentifierUTF8Char(CurOffset))

continue; continue;

// Neither an expected Unicode codepoint nor a UCN. // Neither an expected Unicode codepoint nor a UCN.

break; break;

} }

const char *IdStart = BufferPtr; const char *IdStart = BufferStart + BufferOffset;

FormTokenWithChars(Result, CurPtr, tok::raw_identifier); FormTokenWithChars(Result, CurOffset, tok::raw_identifier);

Result.setRawIdentifierData(IdStart); Result.setRawIdentifierData(IdStart);

// If we are in raw mode, return this identifier raw. There is no need to // If we are in raw mode, return this identifier raw. There is no need to

// look up identifier information or attempt to macro expand it. // look up identifier information or attempt to macro expand it.

if (LexingRawMode) if (LexingRawMode)

return true; return true;

// Fill in Result.IdentifierInfo and update the token kind, // Fill in Result.IdentifierInfo and update the token kind,

// looking up the identifier in the identifier table. // looking up the identifier in the identifier table.

IdentifierInfo *II = PP->LookUpIdentifierInfo(Result); IdentifierInfo *II = PP->LookUpIdentifierInfo(Result);

// Note that we have to call PP->LookUpIdentifierInfo() even for code // Note that we have to call PP->LookUpIdentifierInfo() even for code

// completion, it writes IdentifierInfo into Result, and callers rely on it. // completion, it writes IdentifierInfo into Result, and callers rely on it.

// If the completion point is at the end of an identifier, we want to treat // If the completion point is at the end of an identifier, we want to treat

// the identifier as incomplete even if it resolves to a macro or a keyword. // the identifier as incomplete even if it resolves to a macro or a keyword.

// This allows e.g. 'class^' to complete to 'classifier'. // This allows e.g. 'class^' to complete to 'classifier'.

if (isCodeCompletionPoint(CurPtr)) { if (isCodeCompletionPoint(CurOffset)) {

// Return the code-completion token. // Return the code-completion token.

Result.setKind(tok::code_completion); Result.setKind(tok::code_completion);

// Skip the code-completion char and all immediate identifier characters. // Skip the code-completion char and all immediate identifier characters.

// This ensures we get consistent behavior when completing at any point in // This ensures we get consistent behavior when completing at any point in

// an identifier (i.e. at the start, in the middle, at the end). Note that // an identifier (i.e. at the start, in the middle, at the end). Note that

// only simple cases (i.e. [a-zA-Z0-9_]) are supported to keep the code // only simple cases (i.e. [a-zA-Z0-9_]) are supported to keep the code

// simpler. // simpler.

assert(*CurPtr == 0 && "Completion character must be 0"); assert(BufferStart[CurOffset] == 0 && "Completion character must be 0");

++CurPtr; ++CurOffset;

// Note that code completion token is not added as a separate character // Note that code completion token is not added as a separate character

// when the completion point is at the end of the buffer. Therefore, we need // when the completion point is at the end of the buffer. Therefore, we need

// to check if the buffer has ended. // to check if the buffer has ended.

if (CurPtr < BufferEnd) { if (CurOffset < BufferSize) {

while (isAsciiIdentifierContinue(*CurPtr)) while (isAsciiIdentifierContinue(BufferStart[CurOffset]))

++CurPtr; ++CurOffset;

} }

BufferPtr = CurPtr; BufferOffset = CurOffset;

return true; return true;

} }

// Finally, now that we know we have an identifier, pass this off to the // Finally, now that we know we have an identifier, pass this off to the

// preprocessor, which may macro expand it or something. // preprocessor, which may macro expand it or something.

if (II->isHandleIdentifierCase()) if (II->isHandleIdentifierCase())

return PP->HandleIdentifier(Result); return PP->HandleIdentifier(Result);

return true; return true;

} }

/// isHexaLiteral - Return true if Start points to a hex constant. /// isHexaLiteral - Return true if Start points to a hex constant.

/// in microsoft mode (where this is supposed to be several different tokens). /// in microsoft mode (where this is supposed to be several different tokens).

bool Lexer::isHexaLiteral(const char *Start, const LangOptions &LangOpts) { bool Lexer::isHexaLiteral(unsigned Start, const LangOptions &LangOpts) {

unsigned Size; unsigned Size;

char C1 = Lexer::getCharAndSizeNoWarn(Start, Size, LangOpts); char C1 = Lexer::getCharAndSizeNoWarn(&BufferStart[Start], Size, LangOpts);

if (C1 != '0') if (C1 != '0')

return false; return false;

char C2 = Lexer::getCharAndSizeNoWarn(Start + Size, Size, LangOpts); char C2 =

Lexer::getCharAndSizeNoWarn(&BufferStart[Start + Size], Size, LangOpts);

return (C2 == 'x' || C2 == 'X'); return (C2 == 'x' || C2 == 'X');

} }

/// LexNumericConstant - Lex the remainder of a integer or floating point /// LexNumericConstant - Lex the remainder of a integer or floating point

/// constant. From[-1] is the first character lexed. Return the end of the /// constant. From[-1] is the first character lexed. Return the end of the

/// constant. /// constant.

bool Lexer::LexNumericConstant(Token &Result, const char *CurPtr) { bool Lexer::LexNumericConstant(Token &Result, unsigned CurOffset) {

unsigned Size; unsigned Size;

char C = getCharAndSize(CurPtr, Size); char C = getCharAndSize(CurOffset, Size);

char PrevCh = 0; char PrevCh = 0;

while (isPreprocessingNumberBody(C)) { while (isPreprocessingNumberBody(C)) {

CurPtr = ConsumeChar(CurPtr, Size, Result); CurOffset = ConsumeChar(CurOffset, Size, Result);

PrevCh = C; PrevCh = C;

C = getCharAndSize(CurPtr, Size); C = getCharAndSize(CurOffset, Size);

} }

// If we fell out, check for a sign, due to 1e+12. If we have one, continue. // If we fell out, check for a sign, due to 1e+12. If we have one, continue.

if ((C == '-' || C == '+') && (PrevCh == 'E' || PrevCh == 'e')) { if ((C == '-' || C == '+') && (PrevCh == 'E' || PrevCh == 'e')) {

// If we are in Microsoft mode, don't continue if the constant is hex. // If we are in Microsoft mode, don't continue if the constant is hex.

// For example, MSVC will accept the following as 3 tokens: 0x1234567e+1 // For example, MSVC will accept the following as 3 tokens: 0x1234567e+1

if (!LangOpts.MicrosoftExt || !isHexaLiteral(BufferPtr, LangOpts)) if (!LangOpts.MicrosoftExt || !isHexaLiteral(BufferOffset, LangOpts))

return LexNumericConstant(Result, ConsumeChar(CurPtr, Size, Result)); return LexNumericConstant(Result, ConsumeChar(CurOffset, Size, Result));

} }

// If we have a hex FP constant, continue. // If we have a hex FP constant, continue.

if ((C == '-' || C == '+') && (PrevCh == 'P' || PrevCh == 'p')) { if ((C == '-' || C == '+') && (PrevCh == 'P' || PrevCh == 'p')) {

// Outside C99 and C++17, we accept hexadecimal floating point numbers as a // Outside C99 and C++17, we accept hexadecimal floating point numbers as a

// not-quite-conforming extension. Only do so if this looks like it's // not-quite-conforming extension. Only do so if this looks like it's

// actually meant to be a hexfloat, and not if it has a ud-suffix. // actually meant to be a hexfloat, and not if it has a ud-suffix.

bool IsHexFloat = true; bool IsHexFloat = true;

if (!LangOpts.C99) { if (!LangOpts.C99) {

if (!isHexaLiteral(BufferPtr, LangOpts)) if (!isHexaLiteral(BufferOffset, LangOpts))

IsHexFloat = false; IsHexFloat = false;

else if (!LangOpts.CPlusPlus17 && else if (!LangOpts.CPlusPlus17 &&

std::find(BufferPtr, CurPtr, '_') != CurPtr) std::find(BufferStart + BufferOffset, BufferStart + CurOffset,

'_') != BufferStart + CurOffset)

IsHexFloat = false; IsHexFloat = false;

} }

if (IsHexFloat) if (IsHexFloat)

return LexNumericConstant(Result, ConsumeChar(CurPtr, Size, Result)); return LexNumericConstant(Result, ConsumeChar(CurOffset, Size, Result));

} }

// If we have a digit separator, continue. // If we have a digit separator, continue.

if (C == '\'' && (LangOpts.CPlusPlus14 || LangOpts.C2x)) { if (C == '\'' && (LangOpts.CPlusPlus14 || LangOpts.C2x)) {

unsigned NextSize; unsigned NextSize;

char Next = getCharAndSizeNoWarn(CurPtr + Size, NextSize, LangOpts); char Next = getCharAndSizeNoWarn(&BufferStart[CurOffset + Size], NextSize,

LangOpts);

if (isAsciiIdentifierContinue(Next)) { if (isAsciiIdentifierContinue(Next)) {

if (!isLexingRawMode()) if (!isLexingRawMode())

Diag(CurPtr, LangOpts.CPlusPlus Diag(CurOffset, LangOpts.CPlusPlus

? diag::warn_cxx11_compat_digit_separator ? diag::warn_cxx11_compat_digit_separator

: diag::warn_c2x_compat_digit_separator); : diag::warn_c2x_compat_digit_separator);

CurPtr = ConsumeChar(CurPtr, Size, Result); CurOffset = ConsumeChar(CurOffset, Size, Result);

CurPtr = ConsumeChar(CurPtr, NextSize, Result); CurOffset = ConsumeChar(CurOffset, NextSize, Result);

return LexNumericConstant(Result, CurPtr); return LexNumericConstant(Result, CurOffset);

} }

// If we have a UCN or UTF-8 character (perhaps in a ud-suffix), continue. // If we have a UCN or UTF-8 character (perhaps in a ud-suffix), continue.

if (C == '\\' && tryConsumeIdentifierUCN(CurPtr, Size, Result)) if (C == '\\' && tryConsumeIdentifierUCN(CurOffset, Size, Result))

return LexNumericConstant(Result, CurPtr); return LexNumericConstant(Result, CurOffset);

if (!isASCII(C) && tryConsumeIdentifierUTF8Char(CurPtr)) if (!isASCII(C) && tryConsumeIdentifierUTF8Char(CurOffset))

return LexNumericConstant(Result, CurPtr); return LexNumericConstant(Result, CurOffset);

// Update the location of token as well as BufferPtr. // Update the location of token as well as BufferPtr.

const char *TokStart = BufferPtr; const char *TokStart = BufferStart + BufferOffset;

FormTokenWithChars(Result, CurPtr, tok::numeric_constant); FormTokenWithChars(Result, CurOffset, tok::numeric_constant);

Result.setLiteralData(TokStart); Result.setLiteralData(TokStart);

return true; return true;

} }

/// LexUDSuffix - Lex the ud-suffix production for user-defined literal suffixes /// LexUDSuffix - Lex the ud-suffix production for user-defined literal suffixes

/// in C++11, or warn on a ud-suffix in C++98. /// in C++11, or warn on a ud-suffix in C++98.

const char *Lexer::LexUDSuffix(Token &Result, const char *CurPtr, unsigned Lexer::LexUDSuffix(Token &Result, unsigned CurOffset,

bool IsStringLiteral) { bool IsStringLiteral) {

assert(LangOpts.CPlusPlus); assert(LangOpts.CPlusPlus);

// Maximally munch an identifier. // Maximally munch an identifier.

unsigned Size; unsigned Size;

char C = getCharAndSize(CurPtr, Size); char C = getCharAndSize(CurOffset, Size);

bool Consumed = false; bool Consumed = false;

if (!isAsciiIdentifierStart(C)) { if (!isAsciiIdentifierStart(C)) {

if (C == '\\' && tryConsumeIdentifierUCN(CurPtr, Size, Result)) if (C == '\\' && tryConsumeIdentifierUCN(CurOffset, Size, Result))

Consumed = true; Consumed = true;

else if (!isASCII(C) && tryConsumeIdentifierUTF8Char(CurPtr)) else if (!isASCII(C) && tryConsumeIdentifierUTF8Char(CurOffset))

Consumed = true; Consumed = true;

else else

return CurPtr; return CurOffset;

} }

if (!LangOpts.CPlusPlus11) { if (!LangOpts.CPlusPlus11) {

if (!isLexingRawMode()) if (!isLexingRawMode())

Diag(CurPtr, Diag(CurOffset,

C == '_' ? diag::warn_cxx11_compat_user_defined_literal C == '_' ? diag::warn_cxx11_compat_user_defined_literal

: diag::warn_cxx11_compat_reserved_user_defined_literal) : diag::warn_cxx11_compat_reserved_user_defined_literal)

<< FixItHint::CreateInsertion(getSourceLocation(CurPtr), " "); << FixItHint::CreateInsertion(getSourceLocation(CurOffset), " ");

return CurPtr; return CurOffset;

} }

// C++11 [lex.ext]p10, [usrlit.suffix]p1: A program containing a ud-suffix // C++11 [lex.ext]p10, [usrlit.suffix]p1: A program containing a ud-suffix

// that does not start with an underscore is ill-formed. As a conforming // that does not start with an underscore is ill-formed. As a conforming

// extension, we treat all such suffixes as if they had whitespace before // extension, we treat all such suffixes as if they had whitespace before

// them. We assume a suffix beginning with a UCN or UTF-8 character is more // them. We assume a suffix beginning with a UCN or UTF-8 character is more

// likely to be a ud-suffix than a macro, however, and accept that. // likely to be a ud-suffix than a macro, however, and accept that.

if (!Consumed) { if (!Consumed) {

bool IsUDSuffix = false; bool IsUDSuffix = false;

if (C == '_') if (C == '_')

IsUDSuffix = true; IsUDSuffix = true;

else if (IsStringLiteral && LangOpts.CPlusPlus14) { else if (IsStringLiteral && LangOpts.CPlusPlus14) {

// In C++1y, we need to look ahead a few characters to see if this is a // In C++1y, we need to look ahead a few characters to see if this is a

// valid suffix for a string literal or a numeric literal (this could be // valid suffix for a string literal or a numeric literal (this could be

// the 'operator""if' defining a numeric literal operator). // the 'operator""if' defining a numeric literal operator).

const unsigned MaxStandardSuffixLength = 3; const unsigned MaxStandardSuffixLength = 3;

char Buffer[MaxStandardSuffixLength] = { C }; char Buffer[MaxStandardSuffixLength] = { C };

unsigned Consumed = Size; unsigned Consumed = Size;

unsigned Chars = 1; unsigned Chars = 1;

while (true) { while (true) {

unsigned NextSize; unsigned NextSize;

char Next = getCharAndSizeNoWarn(CurPtr + Consumed, NextSize, LangOpts); char Next = getCharAndSizeNoWarn(&BufferStart[CurOffset] + Consumed,

NextSize, LangOpts);

if (!isAsciiIdentifierContinue(Next)) { if (!isAsciiIdentifierContinue(Next)) {

// End of suffix. Check whether this is on the allowed list. // End of suffix. Check whether this is on the allowed list.

const StringRef CompleteSuffix(Buffer, Chars); const StringRef CompleteSuffix(Buffer, Chars);

IsUDSuffix = IsUDSuffix =

StringLiteralParser::isValidUDSuffix(LangOpts, CompleteSuffix); StringLiteralParser::isValidUDSuffix(LangOpts, CompleteSuffix);

break; break;

} }

if (Chars == MaxStandardSuffixLength) if (Chars == MaxStandardSuffixLength)

// Too long: can't be a standard suffix. // Too long: can't be a standard suffix.

break; break;

Buffer[Chars++] = Next; Buffer[Chars++] = Next;

Consumed += NextSize; Consumed += NextSize;

} }

if (!IsUDSuffix) { if (!IsUDSuffix) {

if (!isLexingRawMode()) if (!isLexingRawMode())

Diag(CurPtr, LangOpts.MSVCCompat Diag(CurOffset, LangOpts.MSVCCompat

? diag::ext_ms_reserved_user_defined_literal ? diag::ext_ms_reserved_user_defined_literal

: diag::ext_reserved_user_defined_literal) : diag::ext_reserved_user_defined_literal)

<< FixItHint::CreateInsertion(getSourceLocation(CurPtr), " "); << FixItHint::CreateInsertion(getSourceLocation(CurOffset), " ");

return CurPtr; return CurOffset;

} }

CurPtr = ConsumeChar(CurPtr, Size, Result); CurOffset = ConsumeChar(CurOffset, Size, Result);

} }

Result.setFlag(Token::HasUDSuffix); Result.setFlag(Token::HasUDSuffix);

while (true) { while (true) {

C = getCharAndSize(CurPtr, Size); C = getCharAndSize(CurOffset, Size);

if (isAsciiIdentifierContinue(C)) { if (isAsciiIdentifierContinue(C)) {

CurPtr = ConsumeChar(CurPtr, Size, Result); CurOffset = ConsumeChar(CurOffset, Size, Result);

} else if (C == '\\' && tryConsumeIdentifierUCN(CurPtr, Size, Result)) { } else if (C == '\\' && tryConsumeIdentifierUCN(CurOffset, Size, Result)) {

} else if (!isASCII(C) && tryConsumeIdentifierUTF8Char(CurPtr)) { } else if (!isASCII(C) && tryConsumeIdentifierUTF8Char(CurOffset)) {

} else } else

break; break;

} }

return CurPtr; return CurOffset;

} }

/// LexStringLiteral - Lex the remainder of a string literal, after having lexed /// LexStringLiteral - Lex the remainder of a string literal, after having lexed

/// either " or L" or u8" or u" or U". /// either " or L" or u8" or u" or U".

bool Lexer::LexStringLiteral(Token &Result, const char *CurPtr, bool Lexer::LexStringLiteral(Token &Result, unsigned CurOffset,

tok::TokenKind Kind) { tok::TokenKind Kind) {

const char *AfterQuote = CurPtr; unsigned AfterQuote = CurOffset;

// Does this string contain the \0 character? // Does this string contain the \0 character?

const char *NulCharacter = nullptr; std::optional<unsigned> NulCharacter = std::nullopt;

if (!isLexingRawMode() && if (!isLexingRawMode() &&

(Kind == tok::utf8_string_literal || (Kind == tok::utf8_string_literal ||

Kind == tok::utf16_string_literal || Kind == tok::utf16_string_literal ||

Kind == tok::utf32_string_literal)) Kind == tok::utf32_string_literal))

Diag(BufferPtr, LangOpts.CPlusPlus ? diag::warn_cxx98_compat_unicode_literal Diag(BufferOffset, LangOpts.CPlusPlus

? diag::warn_cxx98_compat_unicode_literal

: diag::warn_c99_compat_unicode_literal); : diag::warn_c99_compat_unicode_literal);

char C = getAndAdvanceChar(CurPtr, Result); char C = getAndAdvanceChar(CurOffset, Result);

while (C != '"') { while (C != '"') {

// Skip escaped characters. Escaped newlines will already be processed by // Skip escaped characters. Escaped newlines will already be processed by

// getAndAdvanceChar. // getAndAdvanceChar.

if (C == '\\') if (C == '\\')

C = getAndAdvanceChar(CurPtr, Result); C = getAndAdvanceChar(CurOffset, Result);

if (C == '\n' || C == '\r' || // Newline. if (C == '\n' || C == '\r' || // Newline.

(C == 0 && CurPtr-1 == BufferEnd)) { // End of file. (C == 0 && CurOffset - 1 == BufferSize)) { // End of file.

if (!isLexingRawMode() && !LangOpts.AsmPreprocessor) if (!isLexingRawMode() && !LangOpts.AsmPreprocessor)

Diag(BufferPtr, diag::ext_unterminated_char_or_string) << 1; Diag(BufferOffset, diag::ext_unterminated_char_or_string) << 1;

FormTokenWithChars(Result, CurPtr-1, tok::unknown); FormTokenWithChars(Result, CurOffset - 1, tok::unknown);

return true; return true;

} }

if (C == 0) { if (C == 0) {

if (isCodeCompletionPoint(CurPtr-1)) { if (isCodeCompletionPoint(CurOffset - 1)) {

if (ParsingFilename) if (ParsingFilename)

codeCompleteIncludedFile(AfterQuote, CurPtr - 1, /*IsAngled=*/false); codeCompleteIncludedFile(AfterQuote, CurOffset - 1,

/*IsAngled=*/false);

else else

PP->CodeCompleteNaturalLanguage(); PP->CodeCompleteNaturalLanguage();

FormTokenWithChars(Result, CurPtr - 1, tok::unknown); FormTokenWithChars(Result, CurOffset - 1, tok::unknown);

cutOffLexing(); cutOffLexing();

return true; return true;

} }

NulCharacter = CurPtr-1; NulCharacter = CurOffset - 1;

} }

C = getAndAdvanceChar(CurPtr, Result); C = getAndAdvanceChar(CurOffset, Result);

} }

// If we are in C++11, lex the optional ud-suffix. // If we are in C++11, lex the optional ud-suffix.

if (LangOpts.CPlusPlus) if (LangOpts.CPlusPlus)

CurPtr = LexUDSuffix(Result, CurPtr, true); CurOffset = LexUDSuffix(Result, CurOffset, true);

// If a nul character existed in the string, warn about it. // If a nul character existed in the string, warn about it.

if (NulCharacter && !isLexingRawMode()) if (NulCharacter && !isLexingRawMode())

Diag(NulCharacter, diag::null_in_char_or_string) << 1; Diag(*NulCharacter, diag::null_in_char_or_string) << 1;

// Update the location of the token as well as the BufferPtr instance var. // Update the location of the token as well as the BufferPtr instance var.

const char *TokStart = BufferPtr; const char *TokStart = BufferStart + BufferOffset;

FormTokenWithChars(Result, CurPtr, Kind); FormTokenWithChars(Result, CurOffset, Kind);

Result.setLiteralData(TokStart); Result.setLiteralData(TokStart);

return true; return true;

} }

/// LexRawStringLiteral - Lex the remainder of a raw string literal, after /// LexRawStringLiteral - Lex the remainder of a raw string literal, after

/// having lexed R", LR", u8R", uR", or UR". /// having lexed R", LR", u8R", uR", or UR".

bool Lexer::LexRawStringLiteral(Token &Result, const char *CurPtr, bool Lexer::LexRawStringLiteral(Token &Result, unsigned CurOffset,

tok::TokenKind Kind) { tok::TokenKind Kind) {

// This function doesn't use getAndAdvanceChar because C++0x [lex.pptoken]p3: // This function doesn't use getAndAdvanceChar because C++0x [lex.pptoken]p3:

// Between the initial and final double quote characters of the raw string, // Between the initial and final double quote characters of the raw string,

// any transformations performed in phases 1 and 2 (trigraphs, // any transformations performed in phases 1 and 2 (trigraphs,

// universal-character-names, and line splicing) are reverted. // universal-character-names, and line splicing) are reverted.

if (!isLexingRawMode()) if (!isLexingRawMode())

Diag(BufferPtr, diag::warn_cxx98_compat_raw_string_literal); Diag(BufferOffset, diag::warn_cxx98_compat_raw_string_literal);

unsigned PrefixLen = 0; unsigned PrefixLen = 0;

while (PrefixLen != 16 && isRawStringDelimBody(CurPtr[PrefixLen])) while (PrefixLen != 16 &&

isRawStringDelimBody(BufferStart[CurOffset + PrefixLen]))

++PrefixLen; ++PrefixLen;

// If the last character was not a '(', then we didn't lex a valid delimiter. // If the last character was not a '(', then we didn't lex a valid delimiter.

if (CurPtr[PrefixLen] != '(') { if (BufferStart[CurOffset + PrefixLen] != '(') {

if (!isLexingRawMode()) { if (!isLexingRawMode()) {

const char *PrefixEnd = &CurPtr[PrefixLen]; unsigned PrefixEnd = CurOffset + PrefixLen;

if (PrefixLen == 16) { if (PrefixLen == 16) {

Diag(PrefixEnd, diag::err_raw_delim_too_long); Diag(PrefixEnd, diag::err_raw_delim_too_long);

} else { } else {

Diag(PrefixEnd, diag::err_invalid_char_raw_delim) Diag(PrefixEnd, diag::err_invalid_char_raw_delim)

<< StringRef(PrefixEnd, 1); << StringRef(BufferStart + PrefixEnd, 1);

} }

// Search for the next '"' in hopes of salvaging the lexer. Unfortunately, // Search for the next '"' in hopes of salvaging the lexer. Unfortunately,

// it's possible the '"' was intended to be part of the raw string, but // it's possible the '"' was intended to be part of the raw string, but

// there's not much we can do about that. // there's not much we can do about that.

while (true) { while (true) {

char C = *CurPtr++; char C = BufferStart[CurOffset++];

if (C == '"') if (C == '"')

break; break;

if (C == 0 && CurPtr-1 == BufferEnd) { if (C == 0 && CurOffset - 1 == BufferSize) {

--CurPtr; --CurOffset;

break; break;

} }

FormTokenWithChars(Result, CurPtr, tok::unknown); FormTokenWithChars(Result, CurOffset, tok::unknown);

return true; return true;

} }

// Save prefix and move CurPtr past it // Save prefix and move CurPtr past it

const char *Prefix = CurPtr; unsigned Prefix = CurOffset;

CurPtr += PrefixLen + 1; // skip over prefix and '(' CurOffset += PrefixLen + 1; // skip over prefix and '('

while (true) { while (true) {

char C = *CurPtr++; char C = BufferStart[CurOffset++];

if (C == ')') { if (C == ')') {

// Check for prefix match and closing quote. // Check for prefix match and closing quote.

if (strncmp(CurPtr, Prefix, PrefixLen) == 0 && CurPtr[PrefixLen] == '"') { if (strncmp(&BufferStart[CurOffset], &BufferStart[Prefix], PrefixLen) ==

CurPtr += PrefixLen + 1; // skip over prefix and '"' 0 &&

BufferStart[CurOffset + PrefixLen] == '"') {

CurOffset += PrefixLen + 1; // skip over prefix and '"'

break; break;

} }

} else if (C == 0 && CurPtr-1 == BufferEnd) { // End of file. } else if (C == 0 && CurOffset - 1 == BufferSize) { // End of file.

if (!isLexingRawMode()) if (!isLexingRawMode())

Diag(BufferPtr, diag::err_unterminated_raw_string) Diag(BufferOffset, diag::err_unterminated_raw_string)

<< StringRef(Prefix, PrefixLen); << StringRef(BufferStart + Prefix, PrefixLen);

FormTokenWithChars(Result, CurPtr-1, tok::unknown); FormTokenWithChars(Result, CurOffset - 1, tok::unknown);

return true; return true;

} }

// If we are in C++11, lex the optional ud-suffix. // If we are in C++11, lex the optional ud-suffix.

if (LangOpts.CPlusPlus) if (LangOpts.CPlusPlus)

CurPtr = LexUDSuffix(Result, CurPtr, true); CurOffset = LexUDSuffix(Result, CurOffset, true);

// Update the location of token as well as BufferPtr. // Update the location of token as well as BufferPtr.

const char *TokStart = BufferPtr; const char *TokStart = &BufferStart[BufferOffset];

FormTokenWithChars(Result, CurPtr, Kind); FormTokenWithChars(Result, CurOffset, Kind);

Result.setLiteralData(TokStart); Result.setLiteralData(TokStart);

return true; return true;

} }

/// LexAngledStringLiteral - Lex the remainder of an angled string literal, /// LexAngledStringLiteral - Lex the remainder of an angled string literal,

/// after having lexed the '<' character. This is used for #include filenames. /// after having lexed the '<' character. This is used for #include filenames.

bool Lexer::LexAngledStringLiteral(Token &Result, const char *CurPtr) { bool Lexer::LexAngledStringLiteral(Token &Result, unsigned CurOffset) {

// Does this string contain the \0 character? // Does this string contain the \0 character?

const char *NulCharacter = nullptr; std::optional<unsigned> NulCharacter = std::nullopt;

const char *AfterLessPos = CurPtr; unsigned AfterLessPos = CurOffset;

char C = getAndAdvanceChar(CurPtr, Result); char C = getAndAdvanceChar(CurOffset, Result);

while (C != '>') { while (C != '>') {

// Skip escaped characters. Escaped newlines will already be processed by // Skip escaped characters. Escaped newlines will already be processed by

// getAndAdvanceChar. // getAndAdvanceChar.

if (C == '\\') if (C == '\\')

C = getAndAdvanceChar(CurPtr, Result); C = getAndAdvanceChar(CurOffset, Result);

if (isVerticalWhitespace(C) || // Newline. if (isVerticalWhitespace(C) || // Newline.

(C == 0 && (CurPtr - 1 == BufferEnd))) { // End of file. (C == 0 && (CurOffset - 1 == BufferSize))) { // End of file.

// If the filename is unterminated, then it must just be a lone < // If the filename is unterminated, then it must just be a lone <

// character. Return this as such. // character. Return this as such.

FormTokenWithChars(Result, AfterLessPos, tok::less); FormTokenWithChars(Result, AfterLessPos, tok::less);

return true; return true;

} }

if (C == 0) { if (C == 0) {

if (isCodeCompletionPoint(CurPtr - 1)) { if (isCodeCompletionPoint(CurOffset - 1)) {

codeCompleteIncludedFile(AfterLessPos, CurPtr - 1, /*IsAngled=*/true); codeCompleteIncludedFile(AfterLessPos, CurOffset - 1,

/*IsAngled=*/true);

cutOffLexing(); cutOffLexing();

FormTokenWithChars(Result, CurPtr - 1, tok::unknown); FormTokenWithChars(Result, CurOffset - 1, tok::unknown);

return true; return true;

} }

NulCharacter = CurPtr-1; NulCharacter = CurOffset - 1;

} }

C = getAndAdvanceChar(CurPtr, Result); C = getAndAdvanceChar(CurOffset, Result);

} }

// If a nul character existed in the string, warn about it. // If a nul character existed in the string, warn about it.

if (NulCharacter && !isLexingRawMode()) if (NulCharacter && !isLexingRawMode())

Diag(NulCharacter, diag::null_in_char_or_string) << 1; Diag(*NulCharacter, diag::null_in_char_or_string) << 1;

// Update the location of token as well as BufferPtr. // Update the location of token as well as BufferPtr.

const char *TokStart = BufferPtr; const char *TokStart = &BufferStart[BufferOffset];

FormTokenWithChars(Result, CurPtr, tok::header_name); FormTokenWithChars(Result, CurOffset, tok::header_name);

Result.setLiteralData(TokStart); Result.setLiteralData(TokStart);

return true; return true;

} }

void Lexer::codeCompleteIncludedFile(const char *PathStart, void Lexer::codeCompleteIncludedFile(unsigned PathStart,

const char *CompletionPoint, unsigned CompletionPoint, bool IsAngled) {

bool IsAngled) {

// Completion only applies to the filename, after the last slash. // Completion only applies to the filename, after the last slash.

StringRef PartialPath(PathStart, CompletionPoint - PathStart); StringRef PartialPath(BufferStart + PathStart, CompletionPoint - PathStart);

llvm::StringRef SlashChars = LangOpts.MSVCCompat ? "/\\" : "/"; llvm::StringRef SlashChars = LangOpts.MSVCCompat ? "/\\" : "/";

auto Slash = PartialPath.find_last_of(SlashChars); auto Slash = PartialPath.find_last_of(SlashChars);

StringRef Dir = StringRef Dir =

(Slash == StringRef::npos) ? "" : PartialPath.take_front(Slash); (Slash == StringRef::npos) ? "" : PartialPath.take_front(Slash);

const char *StartOfFilename = unsigned StartOfFilename =

(Slash == StringRef::npos) ? PathStart : PathStart + Slash + 1; (Slash == StringRef::npos) ? PathStart : PathStart + Slash + 1;

// Code completion filter range is the filename only, up to completion point. // Code completion filter range is the filename only, up to completion point.

PP->setCodeCompletionIdentifierInfo(&PP->getIdentifierTable().get( PP->setCodeCompletionIdentifierInfo(&PP->getIdentifierTable().get(StringRef(

StringRef(StartOfFilename, CompletionPoint - StartOfFilename))); BufferStart + StartOfFilename, CompletionPoint - StartOfFilename)));

// We should replace the characters up to the closing quote or closest slash, // We should replace the characters up to the closing quote or closest slash,

// if any. // if any.

while (CompletionPoint < BufferEnd) { while (CompletionPoint < BufferSize) {

char Next = *(CompletionPoint + 1); char Next = BufferStart[CompletionPoint + 1];

if (Next == 0 || Next == '\r' || Next == '\n') if (Next == 0 || Next == '\r' || Next == '\n')

break; break;

++CompletionPoint; ++CompletionPoint;

if (Next == (IsAngled ? '>' : '"')) if (Next == (IsAngled ? '>' : '"'))

break; break;

if (llvm::is_contained(SlashChars, Next)) if (llvm::is_contained(SlashChars, Next))

break; break;

} }

PP->setCodeCompletionTokenRange( PP->setCodeCompletionTokenRange(FileLoc.getLocWithOffset(StartOfFilename),

FileLoc.getLocWithOffset(StartOfFilename - BufferStart), FileLoc.getLocWithOffset(CompletionPoint));

FileLoc.getLocWithOffset(CompletionPoint - BufferStart));

PP->CodeCompleteIncludedFile(Dir, IsAngled); PP->CodeCompleteIncludedFile(Dir, IsAngled);

} }

/// LexCharConstant - Lex the remainder of a character constant, after having /// LexCharConstant - Lex the remainder of a character constant, after having

/// lexed either ' or L' or u8' or u' or U'. /// lexed either ' or L' or u8' or u' or U'.

bool Lexer::LexCharConstant(Token &Result, const char *CurPtr, bool Lexer::LexCharConstant(Token &Result, unsigned CurOffset,

tok::TokenKind Kind) { tok::TokenKind Kind) {

// Does this character contain the \0 character? // Does this character contain the \0 character?

const char *NulCharacter = nullptr; std::optional<unsigned> NulCharacter = std::nullopt;

if (!isLexingRawMode()) { if (!isLexingRawMode()) {

if (Kind == tok::utf16_char_constant || Kind == tok::utf32_char_constant) if (Kind == tok::utf16_char_constant || Kind == tok::utf32_char_constant)

Diag(BufferPtr, LangOpts.CPlusPlus Diag(BufferOffset, LangOpts.CPlusPlus

? diag::warn_cxx98_compat_unicode_literal ? diag::warn_cxx98_compat_unicode_literal

: diag::warn_c99_compat_unicode_literal); : diag::warn_c99_compat_unicode_literal);

else if (Kind == tok::utf8_char_constant) else if (Kind == tok::utf8_char_constant)

Diag(BufferPtr, diag::warn_cxx14_compat_u8_character_literal); Diag(BufferOffset, diag::warn_cxx14_compat_u8_character_literal);

} }

char C = getAndAdvanceChar(CurPtr, Result); char C = getAndAdvanceChar(CurOffset, Result);

if (C == '\'') { if (C == '\'') {

if (!isLexingRawMode() && !LangOpts.AsmPreprocessor) if (!isLexingRawMode() && !LangOpts.AsmPreprocessor)

Diag(BufferPtr, diag::ext_empty_character); Diag(BufferOffset, diag::ext_empty_character);

FormTokenWithChars(Result, CurPtr, tok::unknown); FormTokenWithChars(Result, CurOffset, tok::unknown);

return true; return true;

} }

while (C != '\'') { while (C != '\'') {

// Skip escaped characters. // Skip escaped characters.

if (C == '\\') if (C == '\\')

C = getAndAdvanceChar(CurPtr, Result); C = getAndAdvanceChar(CurOffset, Result);

if (C == '\n' || C == '\r' || // Newline. if (C == '\n' || C == '\r' || // Newline.

(C == 0 && CurPtr-1 == BufferEnd)) { // End of file. (C == 0 && CurOffset - 1 == BufferSize)) { // End of file.

if (!isLexingRawMode() && !LangOpts.AsmPreprocessor) if (!isLexingRawMode() && !LangOpts.AsmPreprocessor)

Diag(BufferPtr, diag::ext_unterminated_char_or_string) << 0; Diag(BufferOffset, diag::ext_unterminated_char_or_string) << 0;

FormTokenWithChars(Result, CurPtr-1, tok::unknown); FormTokenWithChars(Result, CurOffset - 1, tok::unknown);

return true; return true;

} }

if (C == 0) { if (C == 0) {

if (isCodeCompletionPoint(CurPtr-1)) { if (isCodeCompletionPoint(CurOffset - 1)) {

PP->CodeCompleteNaturalLanguage(); PP->CodeCompleteNaturalLanguage();

FormTokenWithChars(Result, CurPtr-1, tok::unknown); FormTokenWithChars(Result, CurOffset - 1, tok::unknown);

cutOffLexing(); cutOffLexing();

return true; return true;

} }

NulCharacter = CurPtr-1; NulCharacter = CurOffset - 1;

} }

C = getAndAdvanceChar(CurPtr, Result); C = getAndAdvanceChar(CurOffset, Result);

} }

// If we are in C++11, lex the optional ud-suffix. // If we are in C++11, lex the optional ud-suffix.

if (LangOpts.CPlusPlus) if (LangOpts.CPlusPlus)

CurPtr = LexUDSuffix(Result, CurPtr, false); CurOffset = LexUDSuffix(Result, CurOffset, false);

// If a nul character existed in the character, warn about it. // If a nul character existed in the character, warn about it.

if (NulCharacter && !isLexingRawMode()) if (NulCharacter && !isLexingRawMode())

Diag(NulCharacter, diag::null_in_char_or_string) << 0; Diag(*NulCharacter, diag::null_in_char_or_string) << 0;

// Update the location of token as well as BufferPtr. // Update the location of token as well as BufferPtr.

const char *TokStart = BufferPtr; const char *TokStart = BufferStart + BufferOffset;

FormTokenWithChars(Result, CurPtr, Kind); FormTokenWithChars(Result, CurOffset, Kind);

Result.setLiteralData(TokStart); Result.setLiteralData(TokStart);

return true; return true;

} }

/// SkipWhitespace - Efficiently skip over a series of whitespace characters. /// SkipWhitespace - Efficiently skip over a series of whitespace characters.

/// Update BufferPtr to point to the next non-whitespace character and return. /// Update BufferPtr to point to the next non-whitespace character and return.

/// ///

/// This method forms a token and returns true if KeepWhitespaceMode is enabled. /// This method forms a token and returns true if KeepWhitespaceMode is enabled.

bool Lexer::SkipWhitespace(Token &Result, const char *CurPtr, bool Lexer::SkipWhitespace(Token &Result, unsigned CurOffset,

bool &TokAtPhysicalStartOfLine) { bool &TokAtPhysicalStartOfLine) {

// Whitespace - Skip it, then return the token after the whitespace. // Whitespace - Skip it, then return the token after the whitespace.

bool SawNewline = isVerticalWhitespace(CurPtr[-1]); bool SawNewline = isVerticalWhitespace(BufferStart[CurOffset - 1]);

unsigned char Char = *CurPtr; unsigned char Char = BufferStart[CurOffset];

const char *lastNewLine = nullptr; std::optional<unsigned> lastNewLine = std::nullopt;

auto setLastNewLine = [&](const char *Ptr) { auto setLastNewLine = [&](unsigned Offset) {

lastNewLine = Ptr; lastNewLine = Offset;

if (!NewLinePtr) if (!NewLineOffset)

NewLinePtr = Ptr; NewLineOffset = Offset;

}; };

if (SawNewline) if (SawNewline)

setLastNewLine(CurPtr - 1); setLastNewLine(CurOffset - 1);

// Skip consecutive spaces efficiently. // Skip consecutive spaces efficiently.

while (true) { while (true) {

// Skip horizontal whitespace very aggressively. // Skip horizontal whitespace very aggressively.

while (isHorizontalWhitespace(Char)) while (isHorizontalWhitespace(Char))

Char = *++CurPtr; Char = BufferStart[++CurOffset];

// Otherwise if we have something other than whitespace, we're done. // Otherwise if we have something other than whitespace, we're done.

if (!isVerticalWhitespace(Char)) if (!isVerticalWhitespace(Char))

break; break;

if (ParsingPreprocessorDirective) { if (ParsingPreprocessorDirective) {

// End of preprocessor directive line, let LexTokenInternal handle this. // End of preprocessor directive line, let LexTokenInternal handle this.

BufferPtr = CurPtr; BufferOffset = CurOffset;

return false; return false;

} }

// OK, but handle newline. // OK, but handle newline.

if (*CurPtr == '\n') if (BufferStart[CurOffset] == '\n')

setLastNewLine(CurPtr); setLastNewLine(CurOffset);

SawNewline = true; SawNewline = true;

Char = *++CurPtr; Char = BufferStart[++CurOffset];

} }

// If the client wants us to return whitespace, return it now. // If the client wants us to return whitespace, return it now.

if (isKeepWhitespaceMode()) { if (isKeepWhitespaceMode()) {

FormTokenWithChars(Result, CurPtr, tok::unknown); FormTokenWithChars(Result, CurOffset, tok::unknown);

if (SawNewline) { if (SawNewline) {

IsAtStartOfLine = true; IsAtStartOfLine = true;

IsAtPhysicalStartOfLine = true; IsAtPhysicalStartOfLine = true;

} }

// FIXME: The next token will not have LeadingSpace set. // FIXME: The next token will not have LeadingSpace set.

return true; return true;

} }

// If this isn't immediately after a newline, there is leading space. // If this isn't immediately after a newline, there is leading space.

char PrevChar = CurPtr[-1]; char PrevChar = BufferStart[CurOffset - 1];

bool HasLeadingSpace = !isVerticalWhitespace(PrevChar); bool HasLeadingSpace = !isVerticalWhitespace(PrevChar);

Result.setFlagValue(Token::LeadingSpace, HasLeadingSpace); Result.setFlagValue(Token::LeadingSpace, HasLeadingSpace);

if (SawNewline) { if (SawNewline) {

Result.setFlag(Token::StartOfLine); Result.setFlag(Token::StartOfLine);

TokAtPhysicalStartOfLine = true; TokAtPhysicalStartOfLine = true;

if (NewLinePtr && lastNewLine && NewLinePtr != lastNewLine && PP) { if (NewLineOffset && lastNewLine &&

*NewLineOffset != *lastNewLine && PP) {

if (auto *Handler = PP->getEmptylineHandler()) if (auto *Handler = PP->getEmptylineHandler())

Handler->HandleEmptyline(SourceRange(getSourceLocation(NewLinePtr + 1), Handler->HandleEmptyline(

getSourceLocation(lastNewLine))); SourceRange(getSourceLocation(*NewLineOffset + 1),

getSourceLocation(*lastNewLine)));

} }

BufferPtr = CurPtr; BufferOffset = CurOffset;

return false; return false;

} }

/// We have just read the // characters from input. Skip until we find the /// We have just read the // characters from input. Skip until we find the

/// newline character that terminates the comment. Then update BufferPtr and /// newline character that terminates the comment. Then update BufferPtr and

/// return. /// return.

/// ///

/// If we're in KeepCommentMode or any CommentHandler has inserted /// If we're in KeepCommentMode or any CommentHandler has inserted

/// some tokens, this will store the first token and return true. /// some tokens, this will store the first token and return true.

bool Lexer::SkipLineComment(Token &Result, const char *CurPtr, bool Lexer::SkipLineComment(Token &Result, unsigned CurOffset,

bool &TokAtPhysicalStartOfLine) { bool &TokAtPhysicalStartOfLine) {

// If Line comments aren't explicitly enabled for this language, emit an // If Line comments aren't explicitly enabled for this language, emit an

// extension warning. // extension warning.

if (!LineComment) { if (!LineComment) {

if (!isLexingRawMode()) // There's no PP in raw mode, so can't emit diags. if (!isLexingRawMode()) // There's no PP in raw mode, so can't emit diags.

Diag(BufferPtr, diag::ext_line_comment); Diag(BufferOffset, diag::ext_line_comment);

// Mark them enabled so we only emit one warning for this translation // Mark them enabled so we only emit one warning for this translation

// unit. // unit.

LineComment = true; LineComment = true;

} }

// Scan over the body of the comment. The common case, when scanning, is that // Scan over the body of the comment. The common case, when scanning, is that

// the comment contains normal ascii characters with nothing interesting in // the comment contains normal ascii characters with nothing interesting in

// them. As such, optimize for this case with the inner loop. // them. As such, optimize for this case with the inner loop.

// //

// This loop terminates with CurPtr pointing at the newline (or end of buffer) // This loop terminates with CurPtr pointing at the newline (or end of buffer)

// character that ends the line comment. // character that ends the line comment.

// C++23 [lex.phases] p1 // C++23 [lex.phases] p1

// Diagnose invalid UTF-8 if the corresponding warning is enabled, emitting a // Diagnose invalid UTF-8 if the corresponding warning is enabled, emitting a

// diagnostic only once per entire ill-formed subsequence to avoid // diagnostic only once per entire ill-formed subsequence to avoid

// emiting to many diagnostics (see http://unicode.org/review/pr-121.html). // emiting to many diagnostics (see http://unicode.org/review/pr-121.html).

bool UnicodeDecodingAlreadyDiagnosed = false; bool UnicodeDecodingAlreadyDiagnosed = false;

char C; char C;

while (true) { while (true) {

C = *CurPtr; C = BufferStart[CurOffset];

// Skip over characters in the fast loop. // Skip over characters in the fast loop.

while (isASCII(C) && C != 0 && // Potentially EOF. while (isASCII(C) && C != 0 && // Potentially EOF.

C != '\n' && C != '\r') { // Newline or DOS-style newline. C != '\n' && C != '\r') { // Newline or DOS-style newline.

C = *++CurPtr; C = BufferStart[++CurOffset];

UnicodeDecodingAlreadyDiagnosed = false; UnicodeDecodingAlreadyDiagnosed = false;

} }

if (!isASCII(C)) { if (!isASCII(C)) {

unsigned Length = llvm::getUTF8SequenceSize( unsigned Length = llvm::getUTF8SequenceSize(

(const llvm::UTF8 *)CurPtr, (const llvm::UTF8 *)BufferEnd); (const llvm::UTF8 *)&BufferStart[CurOffset],

(const llvm::UTF8 *)&BufferStart[BufferSize]);

if (Length == 0) { if (Length == 0) {

if (!UnicodeDecodingAlreadyDiagnosed && !isLexingRawMode()) if (!UnicodeDecodingAlreadyDiagnosed && !isLexingRawMode())

Diag(CurPtr, diag::warn_invalid_utf8_in_comment); Diag(CurOffset, diag::warn_invalid_utf8_in_comment);

UnicodeDecodingAlreadyDiagnosed = true; UnicodeDecodingAlreadyDiagnosed = true;

++CurPtr; ++CurOffset;

} else { } else {

UnicodeDecodingAlreadyDiagnosed = false; UnicodeDecodingAlreadyDiagnosed = false;

CurPtr += Length; CurOffset += Length;

} }

continue; continue;

} }

const char *NextLine = CurPtr; unsigned NextLine = CurOffset;

if (C != 0) { if (C != 0) {

// We found a newline, see if it's escaped. // We found a newline, see if it's escaped.

const char *EscapePtr = CurPtr-1; unsigned EscapeOffset = CurOffset - 1;

bool HasSpace = false; bool HasSpace = false;

while (isHorizontalWhitespace(*EscapePtr)) { // Skip whitespace. while (isHorizontalWhitespace(

--EscapePtr; BufferStart[EscapeOffset])) { // Skip whitespace.

--EscapeOffset;

HasSpace = true; HasSpace = true;

} }

if (*EscapePtr == '\\') if (BufferStart[EscapeOffset] == '\\')

// Escaped newline. // Escaped newline.

CurPtr = EscapePtr; CurOffset = EscapeOffset;

else if (EscapePtr[0] == '/' && EscapePtr[-1] == '?' && else if (BufferStart[EscapeOffset] == '/' &&

EscapePtr[-2] == '?' && LangOpts.Trigraphs) BufferStart[EscapeOffset - 1] == '?' &&

BufferStart[EscapeOffset - 2] == '?' && LangOpts.Trigraphs)

// Trigraph-escaped newline. // Trigraph-escaped newline.

CurPtr = EscapePtr-2; CurOffset = EscapeOffset - 2;

else else

break; // This is a newline, we're done. break; // This is a newline, we're done.

// If there was space between the backslash and newline, warn about it. // If there was space between the backslash and newline, warn about it.

if (HasSpace && !isLexingRawMode()) if (HasSpace && !isLexingRawMode())

Diag(EscapePtr, diag::backslash_newline_space); Diag(EscapeOffset, diag::backslash_newline_space);

} }

// Otherwise, this is a hard case. Fall back on getAndAdvanceChar to // Otherwise, this is a hard case. Fall back on getAndAdvanceChar to

// properly decode the character. Read it in raw mode to avoid emitting // properly decode the character. Read it in raw mode to avoid emitting

// diagnostics about things like trigraphs. If we see an escaped newline, // diagnostics about things like trigraphs. If we see an escaped newline,

// we'll handle it below. // we'll handle it below.

const char *OldPtr = CurPtr; unsigned OldOffset = CurOffset;

bool OldRawMode = isLexingRawMode(); bool OldRawMode = isLexingRawMode();

LexingRawMode = true; LexingRawMode = true;

C = getAndAdvanceChar(CurPtr, Result); C = getAndAdvanceChar(CurOffset, Result);

LexingRawMode = OldRawMode; LexingRawMode = OldRawMode;

// If we only read only one character, then no special handling is needed. // If we only read only one character, then no special handling is needed.

// We're done and can skip forward to the newline. // We're done and can skip forward to the newline.

if (C != 0 && CurPtr == OldPtr+1) { if (C != 0 && CurOffset == OldOffset + 1) {

CurPtr = NextLine; CurOffset = NextLine;

break; break;

} }

// If we read multiple characters, and one of those characters was a \r or // If we read multiple characters, and one of those characters was a \r or

// \n, then we had an escaped newline within the comment. Emit diagnostic // \n, then we had an escaped newline within the comment. Emit diagnostic

// unless the next line is also a // comment. // unless the next line is also a // comment.

if (CurPtr != OldPtr + 1 && C != '/' && if (CurOffset != OldOffset + 1 && C != '/' &&

(CurPtr == BufferEnd + 1 || CurPtr[0] != '/')) { (CurOffset == BufferSize + 1 || BufferStart[CurOffset] != '/')) {

for (; OldPtr != CurPtr; ++OldPtr) for (; OldOffset != CurOffset; ++OldOffset)

if (OldPtr[0] == '\n' || OldPtr[0] == '\r') { if (BufferStart[OldOffset] == '\n' || BufferStart[OldOffset] == '\r') {

// Okay, we found a // comment that ends in a newline, if the next // Okay, we found a // comment that ends in a newline, if the next

// line is also a // comment, but has spaces, don't emit a diagnostic. // line is also a // comment, but has spaces, don't emit a diagnostic.

if (isWhitespace(C)) { if (isWhitespace(C)) {

const char *ForwardPtr = CurPtr; unsigned ForwardOffset = CurOffset;

while (isWhitespace(*ForwardPtr)) // Skip whitespace. while (isWhitespace(BufferStart[ForwardOffset])) // Skip whitespace.

++ForwardPtr; ++ForwardOffset;

if (ForwardPtr[0] == '/' && ForwardPtr[1] == '/') if (BufferStart[ForwardOffset] == '/' &&

BufferStart[ForwardOffset + 1] == '/')

break; break;

} }

if (!isLexingRawMode()) if (!isLexingRawMode())

Diag(OldPtr-1, diag::ext_multi_line_line_comment); Diag(OldOffset - 1, diag::ext_multi_line_line_comment);

break; break;

} }

if (C == '\r' || C == '\n' || CurPtr == BufferEnd + 1) { if (C == '\r' || C == '\n' || CurOffset == BufferSize + 1) {

--CurPtr; --CurOffset;

break; break;

} }

if (C == '\0' && isCodeCompletionPoint(CurPtr-1)) { if (C == '\0' && isCodeCompletionPoint(CurOffset - 1)) {

PP->CodeCompleteNaturalLanguage(); PP->CodeCompleteNaturalLanguage();

cutOffLexing(); cutOffLexing();

return false; return false;

} }

// Found but did not consume the newline. Notify comment handlers about the // Found but did not consume the newline. Notify comment handlers about the

// comment unless we're in a #if 0 block. // comment unless we're in a #if 0 block.

if (PP && !isLexingRawMode() && if (PP && !isLexingRawMode() &&

PP->HandleComment(Result, SourceRange(getSourceLocation(BufferPtr), PP->HandleComment(Result, SourceRange(getSourceLocation(BufferOffset),

getSourceLocation(CurPtr)))) { getSourceLocation(CurOffset)))) {

BufferPtr = CurPtr; BufferOffset = CurOffset;

return true; // A token has to be returned. return true; // A token has to be returned.

} }

// If we are returning comments as tokens, return this comment as a token. // If we are returning comments as tokens, return this comment as a token.

if (inKeepCommentMode()) if (inKeepCommentMode())

return SaveLineComment(Result, CurPtr); return SaveLineComment(Result, CurOffset);

// If we are inside a preprocessor directive and we see the end of line, // If we are inside a preprocessor directive and we see the end of line,

// return immediately, so that the lexer can return this as an EOD token. // return immediately, so that the lexer can return this as an EOD token.

if (ParsingPreprocessorDirective || CurPtr == BufferEnd) { if (ParsingPreprocessorDirective || CurOffset == BufferSize) {

BufferPtr = CurPtr; BufferOffset = CurOffset;

return false; return false;

} }

// Otherwise, eat the \n character. We don't care if this is a \n\r or // Otherwise, eat the \n character. We don't care if this is a \n\r or

// \r\n sequence. This is an efficiency hack (because we know the \n can't // \r\n sequence. This is an efficiency hack (because we know the \n can't

// contribute to another token), it isn't needed for correctness. Note that // contribute to another token), it isn't needed for correctness. Note that

// this is ok even in KeepWhitespaceMode, because we would have returned the // this is ok even in KeepWhitespaceMode, because we would have returned the

/// comment above in that mode. /// comment above in that mode.

NewLinePtr = CurPtr++; NewLineOffset = CurOffset++;

// The next returned token is at the start of the line. // The next returned token is at the start of the line.

Result.setFlag(Token::StartOfLine); Result.setFlag(Token::StartOfLine);

TokAtPhysicalStartOfLine = true; TokAtPhysicalStartOfLine = true;

// No leading whitespace seen so far. // No leading whitespace seen so far.

Result.clearFlag(Token::LeadingSpace); Result.clearFlag(Token::LeadingSpace);

BufferPtr = CurPtr; BufferOffset = CurOffset;

return false; return false;

} }

/// If in save-comment mode, package up this Line comment in an appropriate /// If in save-comment mode, package up this Line comment in an appropriate

/// way and return it. /// way and return it.

bool Lexer::SaveLineComment(Token &Result, const char *CurPtr) { bool Lexer::SaveLineComment(Token &Result, unsigned CurOffset) {

// If we're not in a preprocessor directive, just return the // comment // If we're not in a preprocessor directive, just return the // comment

// directly. // directly.

FormTokenWithChars(Result, CurPtr, tok::comment); FormTokenWithChars(Result, CurOffset, tok::comment);

if (!ParsingPreprocessorDirective || LexingRawMode) if (!ParsingPreprocessorDirective || LexingRawMode)

return true; return true;

// If this Line-style comment is in a macro definition, transmogrify it into // If this Line-style comment is in a macro definition, transmogrify it into

// a C-style block comment. // a C-style block comment.

bool Invalid = false; bool Invalid = false;

std::string Spelling = PP->getSpelling(Result, &Invalid); std::string Spelling = PP->getSpelling(Result, &Invalid);

▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines if (*CurPtr != '\n' && *CurPtr != '\r')

return false; return false;

} }

if (TrigraphPos) { if (TrigraphPos) {

// If no trigraphs are enabled, warn that we ignored this trigraph and // If no trigraphs are enabled, warn that we ignored this trigraph and

// ignore this * character. // ignore this * character.

if (!Trigraphs) { if (!Trigraphs) {

if (!L->isLexingRawMode()) if (!L->isLexingRawMode())

L->Diag(TrigraphPos, diag::trigraph_ignored_block_comment); L->Diag(TrigraphPos - L->getBuffer().data(),

diag::trigraph_ignored_block_comment);

return false; return false;

} }

if (!L->isLexingRawMode()) if (!L->isLexingRawMode())

L->Diag(TrigraphPos, diag::trigraph_ends_block_comment); L->Diag(TrigraphPos - L->getBuffer().data(),

diag::trigraph_ends_block_comment);

} }

// Warn about having an escaped newline between the */ characters. // Warn about having an escaped newline between the */ characters.

if (!L->isLexingRawMode()) if (!L->isLexingRawMode())

L->Diag(CurPtr + 1, diag::escaped_newline_block_comment_end); L->Diag(CurPtr + 1 - L->getBuffer().data(),

diag::escaped_newline_block_comment_end);

// If there was space between the backslash and newline, warn about it. // If there was space between the backslash and newline, warn about it.

if (SpacePos && !L->isLexingRawMode()) if (SpacePos && !L->isLexingRawMode())

L->Diag(SpacePos, diag::backslash_newline_space); L->Diag(SpacePos - L->getBuffer().data(), diag::backslash_newline_space);

return true; return true;

} }

#ifdef __SSE2__ #ifdef __SSE2__

#include <emmintrin.h> #include <emmintrin.h>

#elif __ALTIVEC__ #elif __ALTIVEC__

#include <altivec.h> #include <altivec.h>

#undef bool #undef bool

#endif #endif

/// We have just read from input the / and * characters that started a comment. /// We have just read from input the / and * characters that started a comment.

/// Read until we find the * and / characters that terminate the comment. /// Read until we find the * and / characters that terminate the comment.

/// Note that we don't bother decoding trigraphs or escaped newlines in block /// Note that we don't bother decoding trigraphs or escaped newlines in block

/// comments, because they cannot cause the comment to end. The only thing /// comments, because they cannot cause the comment to end. The only thing

/// that can happen is the comment could end with an escaped newline between /// that can happen is the comment could end with an escaped newline between

/// the terminating * and /. /// the terminating * and /.

/// ///

/// If we're in KeepCommentMode or any CommentHandler has inserted /// If we're in KeepCommentMode or any CommentHandler has inserted

/// some tokens, this will store the first token and return true. /// some tokens, this will store the first token and return true.

bool Lexer::SkipBlockComment(Token &Result, const char *CurPtr, bool Lexer::SkipBlockComment(Token &Result, unsigned CurOffset,

bool &TokAtPhysicalStartOfLine) { bool &TokAtPhysicalStartOfLine) {

// Scan one character past where we should, looking for a '/' character. Once // Scan one character past where we should, looking for a '/' character. Once

// we find it, check to see if it was preceded by a *. This common // we find it, check to see if it was preceded by a *. This common

// optimization helps people who like to put a lot of * characters in their // optimization helps people who like to put a lot of * characters in their

// comments. // comments.

// The first character we get with newlines and trigraphs skipped to handle // The first character we get with newlines and trigraphs skipped to handle

// the degenerate /*/ case below correctly if the * has an escaped newline // the degenerate /*/ case below correctly if the * has an escaped newline

// after it. // after it.

unsigned CharSize; unsigned CharSize;

unsigned char C = getCharAndSize(CurPtr, CharSize); unsigned char C = getCharAndSize(CurOffset, CharSize);

CurPtr += CharSize; CurOffset += CharSize;

if (C == 0 && CurPtr == BufferEnd+1) { if (C == 0 && CurOffset == BufferSize + 1) {

if (!isLexingRawMode()) if (!isLexingRawMode())

Diag(BufferPtr, diag::err_unterminated_block_comment); Diag(BufferOffset, diag::err_unterminated_block_comment);

--CurPtr; --CurOffset;

// KeepWhitespaceMode should return this broken comment as a token. Since // KeepWhitespaceMode should return this broken comment as a token. Since

// it isn't a well formed comment, just return it as an 'unknown' token. // it isn't a well formed comment, just return it as an 'unknown' token.

if (isKeepWhitespaceMode()) { if (isKeepWhitespaceMode()) {

FormTokenWithChars(Result, CurPtr, tok::unknown); FormTokenWithChars(Result, CurOffset, tok::unknown);

return true; return true;

} }

BufferPtr = CurPtr; BufferOffset = CurOffset;

return false; return false;

} }

// Check to see if the first character after the '/*' is another /. If so, // Check to see if the first character after the '/*' is another /. If so,

// then this slash does not end the block comment, it is part of it. // then this slash does not end the block comment, it is part of it.

if (C == '/') if (C == '/')

C = *CurPtr++; C = BufferStart[CurOffset++];

// C++23 [lex.phases] p1 // C++23 [lex.phases] p1

// Diagnose invalid UTF-8 if the corresponding warning is enabled, emitting a // Diagnose invalid UTF-8 if the corresponding warning is enabled, emitting a

// diagnostic only once per entire ill-formed subsequence to avoid // diagnostic only once per entire ill-formed subsequence to avoid

// emiting to many diagnostics (see http://unicode.org/review/pr-121.html). // emiting to many diagnostics (see http://unicode.org/review/pr-121.html).

bool UnicodeDecodingAlreadyDiagnosed = false; bool UnicodeDecodingAlreadyDiagnosed = false;

while (true) { while (true) {

// Skip over all non-interesting characters until we find end of buffer or a // Skip over all non-interesting characters until we find end of buffer or a

// (probably ending) '/' character. // (probably ending) '/' character.

if (CurPtr + 24 < BufferEnd && if (CurOffset + 24 < BufferSize &&

// If there is a code-completion point avoid the fast scan because it // If there is a code-completion point avoid the fast scan because it

// doesn't check for '\0'. // doesn't check for '\0'.

!(PP && PP->getCodeCompletionFileLoc() == FileLoc)) { !(PP && PP->getCodeCompletionFileLoc() == FileLoc)) {

// While not aligned to a 16-byte boundary. // While not aligned to a 16-byte boundary.

while (C != '/' && (intptr_t)CurPtr % 16 != 0) { while (C != '/' && (intptr_t)(BufferStart + CurOffset) % 16 != 0) {

if (!isASCII(C)) if (!isASCII(C))

goto MultiByteUTF8; goto MultiByteUTF8;

C = *CurPtr++; C = BufferStart[CurOffset++];

} }

if (C == '/') goto FoundSlash; if (C == '/') goto FoundSlash;

#ifdef __SSE2__ #ifdef __SSE2__

__m128i Slashes = _mm_set1_epi8('/'); __m128i Slashes = _mm_set1_epi8('/');

while (CurPtr + 16 < BufferEnd) { while (CurOffset + 16 < BufferSize) {

int Mask = _mm_movemask_epi8(*(const __m128i *)CurPtr); int Mask =

_mm_movemask_epi8(*(const __m128i *)(BufferStart + CurOffset));

if (LLVM_UNLIKELY(Mask != 0)) { if (LLVM_UNLIKELY(Mask != 0)) {

goto MultiByteUTF8; goto MultiByteUTF8;

} }

// look for slashes // look for slashes

int cmp = _mm_movemask_epi8(_mm_cmpeq_epi8(*(const __m128i*)CurPtr, int cmp = _mm_movemask_epi8(_mm_cmpeq_epi8(

Slashes)); *(const __m128i *)(BufferStart + CurOffset), Slashes));

if (cmp != 0) { if (cmp != 0) {

// Adjust the pointer to point directly after the first slash. It's // Adjust the pointer to point directly after the first slash. It's

// not necessary to set C here, it will be overwritten at the end of // not necessary to set C here, it will be overwritten at the end of

// the outer loop. // the outer loop.

CurPtr += llvm::countr_zero<unsigned>(cmp) + 1; CurOffset += llvm::countr_zero<unsigned>(cmp) + 1;

goto FoundSlash; goto FoundSlash;

} }

CurPtr += 16; CurOffset += 16;

} }

#elif __ALTIVEC__ #elif __ALTIVEC__

__vector unsigned char LongUTF = {0x80, 0x80, 0x80, 0x80, 0x80, 0x80, __vector unsigned char LongUTF = {0x80, 0x80, 0x80, 0x80, 0x80, 0x80,

0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80,

0x80, 0x80, 0x80, 0x80}; 0x80, 0x80, 0x80, 0x80};

__vector unsigned char Slashes = { __vector unsigned char Slashes = {

'/', '/', '/', '/', '/', '/', '/', '/', '/', '/', '/', '/', '/', '/', '/', '/',

'/', '/', '/', '/', '/', '/', '/', '/' '/', '/', '/', '/', '/', '/', '/', '/'

}; };

while (CurPtr + 16 < BufferEnd) { while (CurPtr + 16 < BufferEnd) {

if (LLVM_UNLIKELY( if (LLVM_UNLIKELY(

vec_any_ge(*(const __vector unsigned char *)CurPtr, LongUTF))) vec_any_ge(*(const __vector unsigned char *)CurPtr, LongUTF)))

goto MultiByteUTF8; goto MultiByteUTF8;

if (vec_any_eq(*(const __vector unsigned char *)CurPtr, Slashes)) { if (vec_any_eq(*(const __vector unsigned char *)CurPtr, Slashes)) {

break; break;

} }

CurPtr += 16; CurPtr += 16;

} }

#else #else

while (CurPtr + 16 < BufferEnd) { while (CurOffset + 16 < BufferSize) {

bool HasNonASCII = false; bool HasNonASCII = false;

for (unsigned I = 0; I < 16; ++I) for (unsigned I = 0; I < 16; ++I)

HasNonASCII |= !isASCII(CurPtr[I]); HasNonASCII |= !isASCII(BufferStart[CurOffset + I]);

if (LLVM_UNLIKELY(HasNonASCII)) if (LLVM_UNLIKELY(HasNonASCII))

goto MultiByteUTF8; goto MultiByteUTF8;

bool HasSlash = false; bool HasSlash = false;

for (unsigned I = 0; I < 16; ++I) for (unsigned I = 0; I < 16; ++I)

HasSlash |= CurPtr[I] == '/'; HasSlash |= BufferStart[CurOffset + I] == '/';

if (HasSlash) if (HasSlash)

break; break;

CurPtr += 16; CurOffset += 16;

} }

#endif #endif

// It has to be one of the bytes scanned, increment to it and read one. // It has to be one of the bytes scanned, increment to it and read one.

C = *CurPtr++; C = BufferStart[CurOffset++];

} }

// Loop to scan the remainder, warning on invalid UTF-8 // Loop to scan the remainder, warning on invalid UTF-8

// if the corresponding warning is enabled, emitting a diagnostic only once // if the corresponding warning is enabled, emitting a diagnostic only once

// per sequence that cannot be decoded. // per sequence that cannot be decoded.

while (C != '/' && C != '\0') { while (C != '/' && C != '\0') {

if (isASCII(C)) { if (isASCII(C)) {

UnicodeDecodingAlreadyDiagnosed = false; UnicodeDecodingAlreadyDiagnosed = false;

C = *CurPtr++; C = BufferStart[CurOffset++];

continue; continue;

} }

MultiByteUTF8: MultiByteUTF8:

// CurPtr is 1 code unit past C, so to decode // CurPtr is 1 code unit past C, so to decode

// the codepoint, we need to read from the previous position. // the codepoint, we need to read from the previous position.

unsigned Length = llvm::getUTF8SequenceSize( unsigned Length = llvm::getUTF8SequenceSize(

(const llvm::UTF8 *)CurPtr - 1, (const llvm::UTF8 *)BufferEnd); (const llvm::UTF8 *)(BufferStart + CurOffset) - 1,

(const llvm::UTF8 *)(BufferStart + BufferSize));

if (Length == 0) { if (Length == 0) {

if (!UnicodeDecodingAlreadyDiagnosed && !isLexingRawMode()) if (!UnicodeDecodingAlreadyDiagnosed && !isLexingRawMode())

Diag(CurPtr - 1, diag::warn_invalid_utf8_in_comment); Diag(CurOffset - 1, diag::warn_invalid_utf8_in_comment);

UnicodeDecodingAlreadyDiagnosed = true; UnicodeDecodingAlreadyDiagnosed = true;

} else { } else {

UnicodeDecodingAlreadyDiagnosed = false; UnicodeDecodingAlreadyDiagnosed = false;

CurPtr += Length - 1; CurOffset += Length - 1;

} }

C = *CurPtr++; C = BufferStart[CurOffset++];

} }

if (C == '/') { if (C == '/') {

FoundSlash: FoundSlash:

if (CurPtr[-2] == '*') // We found the final */. We're done! if (BufferStart[CurOffset - 2] ==

'*') // We found the final */. We're done!

break; break;

if ((CurPtr[-2] == '\n' || CurPtr[-2] == '\r')) { if ((BufferStart[CurOffset - 2] == '\n' ||

if (isEndOfBlockCommentWithEscapedNewLine(CurPtr - 2, this, BufferStart[CurOffset - 2] == '\r')) {

LangOpts.Trigraphs)) { if (isEndOfBlockCommentWithEscapedNewLine(&BufferStart[CurOffset - 2],

this, LangOpts.Trigraphs)) {

// We found the final */, though it had an escaped newline between the // We found the final */, though it had an escaped newline between the

// * and /. We're done! // * and /. We're done!

break; break;

} }

if (CurPtr[0] == '*' && CurPtr[1] != '/') { if (BufferStart[CurOffset] == '*' && BufferStart[CurOffset + 1] != '/') {

// If this is a /* inside of the comment, emit a warning. Don't do this // If this is a /* inside of the comment, emit a warning. Don't do this

// if this is a /*/, which will end the comment. This misses cases with // if this is a /*/, which will end the comment. This misses cases with

// embedded escaped newlines, but oh well. // embedded escaped newlines, but oh well.

if (!isLexingRawMode()) if (!isLexingRawMode())

Diag(CurPtr-1, diag::warn_nested_block_comment); Diag(CurOffset - 1, diag::warn_nested_block_comment);

} }

} else if (C == 0 && CurPtr == BufferEnd+1) { } else if (C == 0 && CurOffset == BufferSize + 1) {

if (!isLexingRawMode()) if (!isLexingRawMode())

Diag(BufferPtr, diag::err_unterminated_block_comment); Diag(BufferOffset, diag::err_unterminated_block_comment);

// Note: the user probably forgot a */. We could continue immediately // Note: the user probably forgot a */. We could continue immediately

// after the /*, but this would involve lexing a lot of what really is the // after the /*, but this would involve lexing a lot of what really is the

// comment, which surely would confuse the parser. // comment, which surely would confuse the parser.

--CurPtr; --CurOffset;

// KeepWhitespaceMode should return this broken comment as a token. Since // KeepWhitespaceMode should return this broken comment as a token. Since

// it isn't a well formed comment, just return it as an 'unknown' token. // it isn't a well formed comment, just return it as an 'unknown' token.

if (isKeepWhitespaceMode()) { if (isKeepWhitespaceMode()) {

FormTokenWithChars(Result, CurPtr, tok::unknown); FormTokenWithChars(Result, CurOffset, tok::unknown);

return true; return true;

} }

BufferPtr = CurPtr; BufferOffset = CurOffset;

return false; return false;

} else if (C == '\0' && isCodeCompletionPoint(CurPtr-1)) { } else if (C == '\0' && isCodeCompletionPoint(CurOffset - 1)) {

PP->CodeCompleteNaturalLanguage(); PP->CodeCompleteNaturalLanguage();

cutOffLexing(); cutOffLexing();

return false; return false;

} }

C = *CurPtr++; C = BufferStart[CurOffset++];

} }

// Notify comment handlers about the comment unless we're in a #if 0 block. // Notify comment handlers about the comment unless we're in a #if 0 block.

if (PP && !isLexingRawMode() && if (PP && !isLexingRawMode() &&

PP->HandleComment(Result, SourceRange(getSourceLocation(BufferPtr), PP->HandleComment(Result, SourceRange(getSourceLocation(BufferOffset),

getSourceLocation(CurPtr)))) { getSourceLocation(CurOffset)))) {

BufferPtr = CurPtr; BufferOffset = CurOffset;

return true; // A token has to be returned. return true; // A token has to be returned.

} }

// If we are returning comments as tokens, return this comment as a token. // If we are returning comments as tokens, return this comment as a token.

if (inKeepCommentMode()) { if (inKeepCommentMode()) {

FormTokenWithChars(Result, CurPtr, tok::comment); FormTokenWithChars(Result, CurOffset, tok::comment);

return true; return true;

} }

// It is common for the tokens immediately after a /**/ comment to be // It is common for the tokens immediately after a /**/ comment to be

// whitespace. Instead of going through the big switch, handle it // whitespace. Instead of going through the big switch, handle it

// efficiently now. This is safe even in KeepWhitespaceMode because we would // efficiently now. This is safe even in KeepWhitespaceMode because we would

// have already returned above with the comment as a token. // have already returned above with the comment as a token.

if (isHorizontalWhitespace(*CurPtr)) { if (isHorizontalWhitespace(BufferStart[CurOffset])) {

SkipWhitespace(Result, CurPtr+1, TokAtPhysicalStartOfLine); SkipWhitespace(Result, CurOffset + 1, TokAtPhysicalStartOfLine);

return false; return false;

davrecUnsubmitted

Not Done

indent

davrec: indent

} }

// Otherwise, just return so that the next character will be lexed as a token. // Otherwise, just return so that the next character will be lexed as a token.

BufferPtr = CurPtr; BufferOffset = CurOffset;

Result.setFlag(Token::LeadingSpace); Result.setFlag(Token::LeadingSpace);

return false; return false;

} }

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

// Primary Lexing Entry Points // Primary Lexing Entry Points

//===----------------------------------------------------------------------===// //===----------------------------------------------------------------------===//

/// ReadToEndOfLine - Read the rest of the current preprocessor line as an /// ReadToEndOfLine - Read the rest of the current preprocessor line as an

/// uninterpreted string. This switches the lexer out of directive mode. /// uninterpreted string. This switches the lexer out of directive mode.

void Lexer::ReadToEndOfLine(SmallVectorImpl<char> *Result) { void Lexer::ReadToEndOfLine(SmallVectorImpl<char> *Result) {

assert(ParsingPreprocessorDirective && ParsingFilename == false && assert(ParsingPreprocessorDirective && ParsingFilename == false &&

"Must be in a preprocessing directive!"); "Must be in a preprocessing directive!");

Token Tmp; Token Tmp;

Tmp.startToken(); Tmp.startToken();

// CurPtr - Cache BufferPtr in an automatic variable. // CurPtr - Cache BufferPtr in an automatic variable.

const char *CurPtr = BufferPtr; unsigned CurOffset = BufferOffset;

while (true) { while (true) {

char Char = getAndAdvanceChar(CurPtr, Tmp); char Char = getAndAdvanceChar(CurOffset, Tmp);

switch (Char) { switch (Char) {

default: default:

davrecUnsubmitted

Not Done

indent

davrec: indent

if (Result) if (Result)

Result->push_back(Char); Result->push_back(Char);

break; break;

case 0: // Null. case 0: // Null.

// Found end of file? // Found end of file?

if (CurPtr-1 != BufferEnd) { if (CurOffset - 1 != BufferSize) {

if (isCodeCompletionPoint(CurPtr-1)) { if (isCodeCompletionPoint(CurOffset - 1)) {

PP->CodeCompleteNaturalLanguage(); PP->CodeCompleteNaturalLanguage();

cutOffLexing(); cutOffLexing();

return; return;

} }

// Nope, normal character, continue. // Nope, normal character, continue.

if (Result) if (Result)

Result->push_back(Char); Result->push_back(Char);

break; break;

} }

// FALL THROUGH. // FALL THROUGH.

[[fallthrough]]; [[fallthrough]];

case '\r': case '\r':

case '\n': case '\n':

// Okay, we found the end of the line. First, back up past the \0, \r, \n. // Okay, we found the end of the line. First, back up past the \0, \r, \n.

assert(CurPtr[-1] == Char && "Trigraphs for newline?"); assert(BufferStart[CurOffset - 1] == Char && "Trigraphs for newline?");

BufferPtr = CurPtr-1; BufferOffset = CurOffset - 1;

// Next, lex the character, which should handle the EOD transition. // Next, lex the character, which should handle the EOD transition.

Lex(Tmp); Lex(Tmp);

if (Tmp.is(tok::code_completion)) { if (Tmp.is(tok::code_completion)) {

if (PP) if (PP)

PP->CodeCompleteNaturalLanguage(); PP->CodeCompleteNaturalLanguage();

Lex(Tmp); Lex(Tmp);

} }

assert(Tmp.is(tok::eod) && "Unexpected token!"); assert(Tmp.is(tok::eod) && "Unexpected token!");

// Finally, we're done; // Finally, we're done;

return; return;

} }

/// LexEndOfFile - CurPtr points to the end of this file. Handle this /// LexEndOfFile - CurPtr points to the end of this file. Handle this

/// condition, reporting diagnostics and handling other edge cases as required. /// condition, reporting diagnostics and handling other edge cases as required.

/// This returns true if Result contains a token, false if PP.Lex should be /// This returns true if Result contains a token, false if PP.Lex should be

/// called again. /// called again.

bool Lexer::LexEndOfFile(Token &Result, const char *CurPtr) { bool Lexer::LexEndOfFile(Token &Result, unsigned CurOffset) {

// If we hit the end of the file while parsing a preprocessor directive, // If we hit the end of the file while parsing a preprocessor directive,

// end the preprocessor directive first. The next token returned will // end the preprocessor directive first. The next token returned will

// then be the end of file. // then be the end of file.

if (ParsingPreprocessorDirective) { if (ParsingPreprocessorDirective) {

// Done parsing the "line". // Done parsing the "line".

ParsingPreprocessorDirective = false; ParsingPreprocessorDirective = false;

// Update the location of token as well as BufferPtr. // Update the location of token as well as BufferPtr.

FormTokenWithChars(Result, CurPtr, tok::eod); FormTokenWithChars(Result, CurOffset, tok::eod);

// Restore comment saving mode, in case it was disabled for directive. // Restore comment saving mode, in case it was disabled for directive.

if (PP) if (PP)

resetExtendedTokenMode(); resetExtendedTokenMode();

return true; // Have a token. return true; // Have a token.

} }

// If we are in raw mode, return this event as an EOF token. Let the caller // If we are in raw mode, return this event as an EOF token. Let the caller

// that put us in raw mode handle the event. // that put us in raw mode handle the event.

if (isLexingRawMode()) { if (isLexingRawMode()) {

Result.startToken(); Result.startToken();

BufferPtr = BufferEnd; BufferOffset = BufferSize;

FormTokenWithChars(Result, BufferEnd, tok::eof); FormTokenWithChars(Result, BufferSize, tok::eof);

return true; return true;

} }

if (PP->isRecordingPreamble() && PP->isInPrimaryFile()) { if (PP->isRecordingPreamble() && PP->isInPrimaryFile()) {

PP->setRecordedPreambleConditionalStack(ConditionalStack); PP->setRecordedPreambleConditionalStack(ConditionalStack);

// If the preamble cuts off the end of a header guard, consider it guarded. // If the preamble cuts off the end of a header guard, consider it guarded.

// The guard is valid for the preamble content itself, and for tools the // The guard is valid for the preamble content itself, and for tools the

// most useful answer is "yes, this file has a header guard". // most useful answer is "yes, this file has a header guard".

Show All 9 Lines while (!ConditionalStack.empty()) {

if (PP->getCodeCompletionFileLoc() != FileLoc) if (PP->getCodeCompletionFileLoc() != FileLoc)

PP->Diag(ConditionalStack.back().IfLoc, PP->Diag(ConditionalStack.back().IfLoc,

diag::err_pp_unterminated_conditional); diag::err_pp_unterminated_conditional);

ConditionalStack.pop_back(); ConditionalStack.pop_back();

} }

// C99 5.1.1.2p2: If the file is non-empty and didn't end in a newline, issue // C99 5.1.1.2p2: If the file is non-empty and didn't end in a newline, issue

// a pedwarn. // a pedwarn.

if (CurPtr != BufferStart && (CurPtr[-1] != '\n' && CurPtr[-1] != '\r')) { if (CurOffset != 0 && (BufferStart[CurOffset - 1] != '\n' &&

BufferStart[CurOffset - 1] != '\r')) {

DiagnosticsEngine &Diags = PP->getDiagnostics(); DiagnosticsEngine &Diags = PP->getDiagnostics();

SourceLocation EndLoc = getSourceLocation(BufferEnd); SourceLocation EndLoc = getSourceLocation(BufferSize);

unsigned DiagID; unsigned DiagID;

if (LangOpts.CPlusPlus11) { if (LangOpts.CPlusPlus11) {

// C++11 [lex.phases] 2.2 p2 // C++11 [lex.phases] 2.2 p2

// Prefer the C++98 pedantic compatibility warning over the generic, // Prefer the C++98 pedantic compatibility warning over the generic,

// non-extension, user-requested "missing newline at EOF" warning. // non-extension, user-requested "missing newline at EOF" warning.

if (!Diags.isIgnored(diag::warn_cxx98_compat_no_newline_eof, EndLoc)) { if (!Diags.isIgnored(diag::warn_cxx98_compat_no_newline_eof, EndLoc)) {

DiagID = diag::warn_cxx98_compat_no_newline_eof; DiagID = diag::warn_cxx98_compat_no_newline_eof;

} else { } else {

DiagID = diag::warn_no_newline_eof; DiagID = diag::warn_no_newline_eof;

} }

} else { } else {

DiagID = diag::ext_no_newline_eof; DiagID = diag::ext_no_newline_eof;

} }

Diag(BufferEnd, DiagID) Diag(BufferSize, DiagID) << FixItHint::CreateInsertion(EndLoc, "\n");

<< FixItHint::CreateInsertion(EndLoc, "\n");

} }

BufferPtr = CurPtr; BufferOffset = CurOffset;

// Finally, let the preprocessor handle this. // Finally, let the preprocessor handle this.

return PP->HandleEndOfFile(Result, isPragmaLexer()); return PP->HandleEndOfFile(Result, isPragmaLexer());

} }

/// isNextPPTokenLParen - Return 1 if the next unexpanded token lexed from /// isNextPPTokenLParen - Return 1 if the next unexpanded token lexed from

/// the specified lexer will return a tok::l_paren token, 0 if it is something /// the specified lexer will return a tok::l_paren token, 0 if it is something

/// else and 2 if there are no more tokens in the buffer controlled by the /// else and 2 if there are no more tokens in the buffer controlled by the

Show All 9 Lines unsigned Lexer::isNextPPTokenLParen() {

} }

// Switch to 'skipping' mode. This will ensure that we can lex a token // Switch to 'skipping' mode. This will ensure that we can lex a token

// without emitting diagnostics, disables macro expansion, and will cause EOF // without emitting diagnostics, disables macro expansion, and will cause EOF

// to return an EOF token instead of popping the include stack. // to return an EOF token instead of popping the include stack.

LexingRawMode = true; LexingRawMode = true;

// Save state that can be changed while lexing so that we can restore it. // Save state that can be changed while lexing so that we can restore it.

const char *TmpBufferPtr = BufferPtr; unsigned TmpBufferOffset = BufferOffset;

bool inPPDirectiveMode = ParsingPreprocessorDirective; bool inPPDirectiveMode = ParsingPreprocessorDirective;

bool atStartOfLine = IsAtStartOfLine; bool atStartOfLine = IsAtStartOfLine;

bool atPhysicalStartOfLine = IsAtPhysicalStartOfLine; bool atPhysicalStartOfLine = IsAtPhysicalStartOfLine;

bool leadingSpace = HasLeadingSpace; bool leadingSpace = HasLeadingSpace;

Token Tok; Token Tok;

Lex(Tok); Lex(Tok);

// Restore state that may have changed. // Restore state that may have changed.

BufferPtr = TmpBufferPtr; BufferOffset = TmpBufferOffset;

ParsingPreprocessorDirective = inPPDirectiveMode; ParsingPreprocessorDirective = inPPDirectiveMode;

HasLeadingSpace = leadingSpace; HasLeadingSpace = leadingSpace;

IsAtStartOfLine = atStartOfLine; IsAtStartOfLine = atStartOfLine;

IsAtPhysicalStartOfLine = atPhysicalStartOfLine; IsAtPhysicalStartOfLine = atPhysicalStartOfLine;

// Restore the lexer back to non-skipping mode. // Restore the lexer back to non-skipping mode.

LexingRawMode = false; LexingRawMode = false;

Show All 21 Lines static const char *FindConflictEnd(const char *CurPtr, const char *BufferEnd,

} }

return nullptr; return nullptr;

} }

/// IsStartOfConflictMarker - If the specified pointer is the start of a version /// IsStartOfConflictMarker - If the specified pointer is the start of a version

/// control conflict marker like '<<<<<<<', recognize it as such, emit an error /// control conflict marker like '<<<<<<<', recognize it as such, emit an error

/// and recover nicely. This returns true if it is a conflict marker and false /// and recover nicely. This returns true if it is a conflict marker and false

/// if not. /// if not.

bool Lexer::IsStartOfConflictMarker(const char *CurPtr) { bool Lexer::IsStartOfConflictMarker(unsigned CurOffset) {

// Only a conflict marker if it starts at the beginning of a line. // Only a conflict marker if it starts at the beginning of a line.

if (CurPtr != BufferStart && if (CurOffset != 0 && BufferStart[CurOffset - 1] != '\n' &&

CurPtr[-1] != '\n' && CurPtr[-1] != '\r') BufferStart[CurOffset - 1] != '\r')

return false; return false;

// Check to see if we have <<<<<<< or >>>>. // Check to see if we have <<<<<<< or >>>>.

if (!StringRef(CurPtr, BufferEnd - CurPtr).startswith("<<<<<<<") && if (!StringRef(BufferStart + CurOffset, BufferSize - CurOffset)

!StringRef(CurPtr, BufferEnd - CurPtr).startswith(">>>> ")) .startswith("<<<<<<<") &&

!StringRef(BufferStart + CurOffset, BufferSize - CurOffset)

.startswith(">>>> "))

return false; return false;

// If we have a situation where we don't care about conflict markers, ignore // If we have a situation where we don't care about conflict markers, ignore

// it. // it.

if (CurrentConflictMarkerState || isLexingRawMode()) if (CurrentConflictMarkerState || isLexingRawMode())

return false; return false;

ConflictMarkerKind Kind = *CurPtr == '<' ? CMK_Normal : CMK_Perforce; ConflictMarkerKind Kind =

BufferStart[CurOffset] == '<' ? CMK_Normal : CMK_Perforce;

// Check to see if there is an ending marker somewhere in the buffer at the // Check to see if there is an ending marker somewhere in the buffer at the

// start of a line to terminate this conflict marker. // start of a line to terminate this conflict marker.

if (FindConflictEnd(CurPtr, BufferEnd, Kind)) { if (FindConflictEnd(&BufferStart[CurOffset], &BufferStart[BufferSize],

Kind)) {

// We found a match. We are really in a conflict marker. // We found a match. We are really in a conflict marker.

// Diagnose this, and ignore to the end of line. // Diagnose this, and ignore to the end of line.

Diag(CurPtr, diag::err_conflict_marker); Diag(CurOffset, diag::err_conflict_marker);

CurrentConflictMarkerState = Kind; CurrentConflictMarkerState = Kind;

// Skip ahead to the end of line. We know this exists because the // Skip ahead to the end of line. We know this exists because the

// end-of-conflict marker starts with \r or \n. // end-of-conflict marker starts with \r or \n.

while (*CurPtr != '\r' && *CurPtr != '\n') { while (BufferStart[CurOffset] != '\r' && BufferStart[CurOffset] != '\n') {

assert(CurPtr != BufferEnd && "Didn't find end of line"); assert(CurOffset != BufferSize && "Didn't find end of line");

++CurPtr; ++CurOffset;

} }

BufferPtr = CurPtr; BufferOffset = CurOffset;

return true; return true;

} }

// No end of conflict marker found. // No end of conflict marker found.

return false; return false;

} }

/// HandleEndOfConflictMarker - If this is a '====' or '||||' or '>>>>', or if /// HandleEndOfConflictMarker - If this is a '====' or '||||' or '>>>>', or if

/// it is '<<<<' and the conflict marker started with a '>>>>' marker, then it /// it is '<<<<' and the conflict marker started with a '>>>>' marker, then it

/// is the end of a conflict marker. Handle it by ignoring up until the end of /// is the end of a conflict marker. Handle it by ignoring up until the end of

/// the line. This returns true if it is a conflict marker and false if not. /// the line. This returns true if it is a conflict marker and false if not.

bool Lexer::HandleEndOfConflictMarker(const char *CurPtr) { bool Lexer::HandleEndOfConflictMarker(unsigned CurOffset) {

// Only a conflict marker if it starts at the beginning of a line. // Only a conflict marker if it starts at the beginning of a line.

if (CurPtr != BufferStart && if (CurOffset != 0 && BufferStart[CurOffset - 1] != '\n' &&

CurPtr[-1] != '\n' && CurPtr[-1] != '\r') BufferStart[CurOffset - 1] != '\r')

return false; return false;

// If we have a situation where we don't care about conflict markers, ignore // If we have a situation where we don't care about conflict markers, ignore

// it. // it.

if (!CurrentConflictMarkerState || isLexingRawMode()) if (!CurrentConflictMarkerState || isLexingRawMode())

return false; return false;

// Check to see if we have the marker (4 characters in a row). // Check to see if we have the marker (4 characters in a row).

for (unsigned i = 1; i != 4; ++i) for (unsigned i = 1; i != 4; ++i)

if (CurPtr[i] != CurPtr[0]) if (BufferStart[CurOffset + i] != BufferStart[CurOffset])

return false; return false;

// If we do have it, search for the end of the conflict marker. This could // If we do have it, search for the end of the conflict marker. This could

// fail if it got skipped with a '#if 0' or something. Note that CurPtr might // fail if it got skipped with a '#if 0' or something. Note that CurPtr might

// be the end of conflict marker. // be the end of conflict marker.

if (const char *End = FindConflictEnd(CurPtr, BufferEnd, if (const char *End =

FindConflictEnd(BufferStart + CurOffset, BufferStart + BufferSize,

CurrentConflictMarkerState)) { CurrentConflictMarkerState)) {

CurPtr = End; CurOffset = End - BufferStart;

// Skip ahead to the end of line. // Skip ahead to the end of line.

while (CurPtr != BufferEnd && *CurPtr != '\r' && *CurPtr != '\n') while (CurOffset != BufferSize && BufferStart[CurOffset] != '\r' &&

++CurPtr; BufferStart[CurOffset] != '\n')

++CurOffset;

BufferPtr = CurPtr; BufferOffset = CurOffset;

// No longer in the conflict marker. // No longer in the conflict marker.

CurrentConflictMarkerState = CMK_None; CurrentConflictMarkerState = CMK_None;

return true; return true;

} }

return false; return false;

} }

static const char *findPlaceholderEnd(const char *CurPtr, static const char *findPlaceholderEnd(const char *CurPtr,

const char *BufferEnd) { const char *BufferEnd) {

if (CurPtr == BufferEnd) if (CurPtr == BufferEnd)

return nullptr; return nullptr;

BufferEnd -= 1; // Scan until the second last character. BufferEnd -= 1; // Scan until the second last character.

for (; CurPtr != BufferEnd; ++CurPtr) { for (; CurPtr != BufferEnd; ++CurPtr) {

if (CurPtr[0] == '#' && CurPtr[1] == '>') if (CurPtr[0] == '#' && CurPtr[1] == '>')

return CurPtr + 2; return CurPtr + 2;

} }

return nullptr; return nullptr;

} }

bool Lexer::lexEditorPlaceholder(Token &Result, const char *CurPtr) { bool Lexer::lexEditorPlaceholder(Token &Result, unsigned CurOffset) {

assert(CurPtr[-1] == '<' && CurPtr[0] == '#' && "Not a placeholder!"); assert(BufferStart[CurOffset - 1] == '<' && BufferStart[CurOffset] == '#' &&

"Not a placeholder!");

if (!PP || !PP->getPreprocessorOpts().LexEditorPlaceholders || LexingRawMode) if (!PP || !PP->getPreprocessorOpts().LexEditorPlaceholders || LexingRawMode)

return false; return false;

const char *End = findPlaceholderEnd(CurPtr + 1, BufferEnd); const char *End =

findPlaceholderEnd(BufferStart + CurOffset + 1, BufferStart + BufferSize);

if (!End) if (!End)

return false; return false;

const char *Start = CurPtr - 1; const char *Start = BufferStart + CurOffset - 1;

if (!LangOpts.AllowEditorPlaceholders) if (!LangOpts.AllowEditorPlaceholders)

Diag(Start, diag::err_placeholder_in_source); Diag(CurOffset - 1, diag::err_placeholder_in_source);

Result.startToken(); Result.startToken();

FormTokenWithChars(Result, End, tok::raw_identifier); FormTokenWithChars(Result, End - BufferStart, tok::raw_identifier);

Result.setRawIdentifierData(Start); Result.setRawIdentifierData(Start);

PP->LookUpIdentifierInfo(Result); PP->LookUpIdentifierInfo(Result);

Result.setFlag(Token::IsEditorPlaceholder); Result.setFlag(Token::IsEditorPlaceholder);

BufferPtr = End; BufferOffset = End - BufferStart;

return true; return true;

} }

bool Lexer::isCodeCompletionPoint(const char *CurPtr) const { bool Lexer::isCodeCompletionPoint(unsigned CurOffset) const {

if (PP && PP->isCodeCompletionEnabled()) { if (PP && PP->isCodeCompletionEnabled()) {

SourceLocation Loc = FileLoc.getLocWithOffset(CurPtr-BufferStart); SourceLocation Loc = FileLoc.getLocWithOffset(CurOffset);

return Loc == PP->getCodeCompletionLoc(); return Loc == PP->getCodeCompletionLoc();

} }

return false; return false;

} }

std::optional<uint32_t> Lexer::tryReadNumericUCN(const char *&StartPtr, std::optional<uint32_t> Lexer::tryReadNumericUCN(unsigned &StartOffset,

const char *SlashLoc, unsigned SlashLoc,

Token *Result) { Token *Result) {

unsigned CharSize; unsigned CharSize;

char Kind = getCharAndSize(StartPtr, CharSize); char Kind = getCharAndSize(StartOffset, CharSize);

assert((Kind == 'u' || Kind == 'U') && "expected a UCN"); assert((Kind == 'u' || Kind == 'U') && "expected a UCN");

unsigned NumHexDigits; unsigned NumHexDigits;

if (Kind == 'u') if (Kind == 'u')

NumHexDigits = 4; NumHexDigits = 4;

else if (Kind == 'U') else if (Kind == 'U')

NumHexDigits = 8; NumHexDigits = 8;

bool Delimited = false; bool Delimited = false;

bool FoundEndDelimiter = false; bool FoundEndDelimiter = false;

unsigned Count = 0; unsigned Count = 0;

bool Diagnose = Result && !isLexingRawMode(); bool Diagnose = Result && !isLexingRawMode();

if (!LangOpts.CPlusPlus && !LangOpts.C99) { if (!LangOpts.CPlusPlus && !LangOpts.C99) {

if (Diagnose) if (Diagnose)

Diag(SlashLoc, diag::warn_ucn_not_valid_in_c89); Diag(SlashLoc, diag::warn_ucn_not_valid_in_c89);

return std::nullopt; return std::nullopt;

} }

const char *CurPtr = StartPtr + CharSize; unsigned CurOffset = StartOffset + CharSize;

const char *KindLoc = &CurPtr[-1]; unsigned KindLoc = CurOffset - 1;

uint32_t CodePoint = 0; uint32_t CodePoint = 0;

while (Count != NumHexDigits || Delimited) { while (Count != NumHexDigits || Delimited) {

char C = getCharAndSize(CurPtr, CharSize); char C = getCharAndSize(CurOffset, CharSize);

if (!Delimited && Count == 0 && C == '{') { if (!Delimited && Count == 0 && C == '{') {

Delimited = true; Delimited = true;

CurPtr += CharSize; CurOffset += CharSize;

continue; continue;

} }

if (Delimited && C == '}') { if (Delimited && C == '}') {

CurPtr += CharSize; CurOffset += CharSize;

FoundEndDelimiter = true; FoundEndDelimiter = true;

break; break;

} }

unsigned Value = llvm::hexDigitValue(C); unsigned Value = llvm::hexDigitValue(C);

if (Value == -1U) { if (Value == -1U) {

if (!Delimited) if (!Delimited)

break; break;

if (Diagnose) if (Diagnose)

Diag(SlashLoc, diag::warn_delimited_ucn_incomplete) Diag(SlashLoc, diag::warn_delimited_ucn_incomplete)

<< StringRef(KindLoc, 1); << StringRef(BufferStart + KindLoc, 1);

return std::nullopt; return std::nullopt;

} }

if (CodePoint & 0xF000'0000) { if (CodePoint & 0xF000'0000) {

if (Diagnose) if (Diagnose)

Diag(KindLoc, diag::err_escape_too_large) << 0; Diag(KindLoc, diag::err_escape_too_large) << 0;

return std::nullopt; return std::nullopt;

} }

CodePoint <<= 4; CodePoint <<= 4;

CodePoint |= Value; CodePoint |= Value;

CurPtr += CharSize; CurOffset += CharSize;

Count++; Count++;

} }

if (Count == 0) { if (Count == 0) {

if (Diagnose) if (Diagnose)

Diag(SlashLoc, FoundEndDelimiter ? diag::warn_delimited_ucn_empty Diag(SlashLoc, FoundEndDelimiter ? diag::warn_delimited_ucn_empty

: diag::warn_ucn_escape_no_digits) : diag::warn_ucn_escape_no_digits)

<< StringRef(KindLoc, 1); << StringRef(BufferStart + KindLoc, 1);

return std::nullopt; return std::nullopt;

} }

if (Delimited && Kind == 'U') { if (Delimited && Kind == 'U') {

if (Diagnose) if (Diagnose)

Diag(SlashLoc, diag::err_hex_escape_no_digits) << StringRef(KindLoc, 1); Diag(SlashLoc, diag::err_hex_escape_no_digits)

<< StringRef(BufferStart + KindLoc, 1);

return std::nullopt; return std::nullopt;

} }

if (!Delimited && Count != NumHexDigits) { if (!Delimited && Count != NumHexDigits) {

if (Diagnose) { if (Diagnose) {

Diag(SlashLoc, diag::warn_ucn_escape_incomplete); Diag(SlashLoc, diag::warn_ucn_escape_incomplete);

// If the user wrote \U1234, suggest a fixit to \u. // If the user wrote \U1234, suggest a fixit to \u.

if (Count == 4 && NumHexDigits == 8) { if (Count == 4 && NumHexDigits == 8) {

CharSourceRange URange = makeCharRange(*this, KindLoc, KindLoc + 1); CharSourceRange URange = makeCharRange(*this, KindLoc, KindLoc + 1);

Diag(KindLoc, diag::note_ucn_four_not_eight) Diag(KindLoc, diag::note_ucn_four_not_eight)

Show All 10 Lines Diag(SlashLoc, PP->getLangOpts().CPlusPlus2b

<< /*delimited*/ 0 << (PP->getLangOpts().CPlusPlus ? 1 : 0); << /*delimited*/ 0 << (PP->getLangOpts().CPlusPlus ? 1 : 0);

} }

if (Result) { if (Result) {

Result->setFlag(Token::HasUCN); Result->setFlag(Token::HasUCN);

// If the UCN contains either a trigraph or a line splicing, // If the UCN contains either a trigraph or a line splicing,

// we need to call getAndAdvanceChar again to set the appropriate flags // we need to call getAndAdvanceChar again to set the appropriate flags

// on Result. // on Result.

if (CurPtr - StartPtr == (ptrdiff_t)(Count + 1 + (Delimited ? 2 : 0))) if (CurOffset - StartOffset == (ptrdiff_t)(Count + 1 + (Delimited ? 2 : 0)))

StartPtr = CurPtr; StartOffset = CurOffset;

else else

while (StartPtr != CurPtr) while (StartOffset != CurOffset)

(void)getAndAdvanceChar(StartPtr, *Result); (void)getAndAdvanceChar(StartOffset, *Result);

} else { } else {

StartPtr = CurPtr; StartOffset = CurOffset;

} }

return CodePoint; return CodePoint;

} }

std::optional<uint32_t> Lexer::tryReadNamedUCN(const char *&StartPtr, std::optional<uint32_t> Lexer::tryReadNamedUCN(

const char *SlashLoc, unsigned &StartOffset, unsigned SlashLoc, Token *Result) {

Token *Result) {

unsigned CharSize; unsigned CharSize;

bool Diagnose = Result && !isLexingRawMode(); bool Diagnose = Result && !isLexingRawMode();

char C = getCharAndSize(StartPtr, CharSize); char C = getCharAndSize(StartOffset, CharSize);

assert(C == 'N' && "expected \\N{...}"); assert(C == 'N' && "expected \\N{...}");

const char *CurPtr = StartPtr + CharSize; unsigned CurOffset = StartOffset + CharSize;

const char *KindLoc = &CurPtr[-1]; unsigned KindLoc = CurOffset - 1;

C = getCharAndSize(CurPtr, CharSize); C = getCharAndSize(CurOffset, CharSize);

if (C != '{') { if (C != '{') {

if (Diagnose) if (Diagnose)

Diag(SlashLoc, diag::warn_ucn_escape_incomplete); Diag(SlashLoc, diag::warn_ucn_escape_incomplete);

return std::nullopt; return std::nullopt;

} }

CurPtr += CharSize; CurOffset += CharSize;

const char *StartName = CurPtr; unsigned StartName = CurOffset;

bool FoundEndDelimiter = false; bool FoundEndDelimiter = false;

llvm::SmallVector<char, 30> Buffer; llvm::SmallVector<char, 30> Buffer;

while (C) { while (C) {

C = getCharAndSize(CurPtr, CharSize); C = getCharAndSize(CurOffset, CharSize);

CurPtr += CharSize; CurOffset += CharSize;

if (C == '}') { if (C == '}') {

FoundEndDelimiter = true; FoundEndDelimiter = true;

break; break;

} }

if (isVerticalWhitespace(C)) if (isVerticalWhitespace(C))

break; break;

Buffer.push_back(C); Buffer.push_back(C);

} }

if (!FoundEndDelimiter || Buffer.empty()) { if (!FoundEndDelimiter || Buffer.empty()) {

if (Diagnose) if (Diagnose)

Diag(SlashLoc, FoundEndDelimiter ? diag::warn_delimited_ucn_empty Diag(SlashLoc, FoundEndDelimiter ? diag::warn_delimited_ucn_empty

: diag::warn_delimited_ucn_incomplete) : diag::warn_delimited_ucn_incomplete)

<< StringRef(KindLoc, 1); << StringRef(BufferStart + KindLoc, 1);

return std::nullopt; return std::nullopt;

} }

StringRef Name(Buffer.data(), Buffer.size()); StringRef Name(Buffer.data(), Buffer.size());

std::optional<char32_t> Match = std::optional<char32_t> Match =

llvm::sys::unicode::nameToCodepointStrict(Name); llvm::sys::unicode::nameToCodepointStrict(Name);

std::optional<llvm::sys::unicode::LooseMatchingResult> LooseMatch; std::optional<llvm::sys::unicode::LooseMatchingResult> LooseMatch;

if (!Match) { if (!Match) {

LooseMatch = llvm::sys::unicode::nameToCodepointLooseMatching(Name); LooseMatch = llvm::sys::unicode::nameToCodepointLooseMatching(Name);

if (Diagnose) { if (Diagnose) {

Diag(StartName, diag::err_invalid_ucn_name) Diag(StartOffset, diag::err_invalid_ucn_name)

<< StringRef(Buffer.data(), Buffer.size()) << StringRef(Buffer.data(), Buffer.size())

<< makeCharRange(*this, StartName, CurPtr - CharSize); << makeCharRange(*this, StartOffset, CurOffset - CharSize);

if (LooseMatch) { if (LooseMatch) {

Diag(StartName, diag::note_invalid_ucn_name_loose_matching) Diag(StartName, diag::note_invalid_ucn_name_loose_matching)

<< FixItHint::CreateReplacement( << FixItHint::CreateReplacement(

makeCharRange(*this, StartName, CurPtr - CharSize), makeCharRange(*this, StartName, CurOffset - CharSize),

LooseMatch->Name); LooseMatch->Name);

} }

// We do not offer misspelled character names suggestions here // We do not offer misspelled character names suggestions here

// as the set of what would be a valid suggestion depends on context, // as the set of what would be a valid suggestion depends on context,

// and we should not make invalid suggestions. // and we should not make invalid suggestions.

} }

Show All 10 Lines std::optional<uint32_t> Lexer::tryReadNamedUCN(

if (LooseMatch && Diagnose) if (LooseMatch && Diagnose)

Match = LooseMatch->CodePoint; Match = LooseMatch->CodePoint;

if (Result) { if (Result) {

Result->setFlag(Token::HasUCN); Result->setFlag(Token::HasUCN);

// If the UCN contains either a trigraph or a line splicing, // If the UCN contains either a trigraph or a line splicing,

// we need to call getAndAdvanceChar again to set the appropriate flags // we need to call getAndAdvanceChar again to set the appropriate flags

// on Result. // on Result.

if (CurPtr - StartPtr == (ptrdiff_t)(Buffer.size() + 3)) if (CurOffset - StartOffset == (ptrdiff_t)(Buffer.size() + 3))

StartPtr = CurPtr; StartOffset = CurOffset;

else else

while (StartPtr != CurPtr) while (StartOffset != CurOffset)

(void)getAndAdvanceChar(StartPtr, *Result); (void)getAndAdvanceChar(StartOffset, *Result);

} else { } else {

StartPtr = CurPtr; StartOffset = CurOffset;

} }

return Match ? std::optional<uint32_t>(*Match) : std::nullopt; return Match ? std::optional<uint32_t>(*Match) : std::nullopt;

} }

uint32_t Lexer::tryReadUCN(const char *&StartPtr, const char *SlashLoc, uint32_t Lexer::tryReadUCN(unsigned &StartOffset, unsigned SlashLoc,

Token *Result) { Token *Result) {

unsigned CharSize; unsigned CharSize;

std::optional<uint32_t> CodePointOpt; std::optional<uint32_t> CodePointOpt;

char Kind = getCharAndSize(StartPtr, CharSize); char Kind = getCharAndSize(StartOffset, CharSize);

if (Kind == 'u' || Kind == 'U') if (Kind == 'u' || Kind == 'U')

CodePointOpt = tryReadNumericUCN(StartPtr, SlashLoc, Result); CodePointOpt = tryReadNumericUCN(StartOffset, SlashLoc, Result);

else if (Kind == 'N') else if (Kind == 'N')

CodePointOpt = tryReadNamedUCN(StartPtr, SlashLoc, Result); CodePointOpt = tryReadNamedUCN(StartOffset, SlashLoc, Result);

if (!CodePointOpt) if (!CodePointOpt)

return 0; return 0;

uint32_t CodePoint = *CodePointOpt; uint32_t CodePoint = *CodePointOpt;

// Don't apply C family restrictions to UCNs in assembly mode // Don't apply C family restrictions to UCNs in assembly mode

if (LangOpts.AsmPreprocessor) if (LangOpts.AsmPreprocessor)

Show All 13 Lines uint32_t Lexer::tryReadUCN(unsigned &StartOffset, unsigned SlashLoc,

if (CodePoint < 0xA0) { if (CodePoint < 0xA0) {

if (CodePoint == 0x24 || CodePoint == 0x40 || CodePoint == 0x60) if (CodePoint == 0x24 || CodePoint == 0x40 || CodePoint == 0x60)

return CodePoint; return CodePoint;

// We don't use isLexingRawMode() here because we need to warn about bad // We don't use isLexingRawMode() here because we need to warn about bad

// UCNs even when skipping preprocessing tokens in a #if block. // UCNs even when skipping preprocessing tokens in a #if block.

if (Result && PP) { if (Result && PP) {

if (CodePoint < 0x20 || CodePoint >= 0x7F) if (CodePoint < 0x20 || CodePoint >= 0x7F)

Diag(BufferPtr, diag::err_ucn_control_character); Diag(BufferOffset, diag::err_ucn_control_character);

else { else {

char C = static_cast<char>(CodePoint); char C = static_cast<char>(CodePoint);

Diag(BufferPtr, diag::err_ucn_escape_basic_scs) << StringRef(&C, 1); Diag(BufferOffset, diag::err_ucn_escape_basic_scs) << StringRef(&C, 1);

} }

return 0; return 0;

} else if (CodePoint >= 0xD800 && CodePoint <= 0xDFFF) { } else if (CodePoint >= 0xD800 && CodePoint <= 0xDFFF) {

// C++03 allows UCNs representing surrogate characters. C99 and C++11 don't. // C++03 allows UCNs representing surrogate characters. C99 and C++11 don't.

// We don't use isLexingRawMode() here because we need to diagnose bad // We don't use isLexingRawMode() here because we need to diagnose bad

// UCNs even when skipping preprocessing tokens in a #if block. // UCNs even when skipping preprocessing tokens in a #if block.

if (Result && PP) { if (Result && PP) {

if (LangOpts.CPlusPlus && !LangOpts.CPlusPlus11) if (LangOpts.CPlusPlus && !LangOpts.CPlusPlus11)

Diag(BufferPtr, diag::warn_ucn_escape_surrogate); Diag(BufferOffset, diag::warn_ucn_escape_surrogate);

else else

Diag(BufferPtr, diag::err_ucn_escape_invalid); Diag(BufferOffset, diag::err_ucn_escape_invalid);

} }

return 0; return 0;

} }

return CodePoint; return CodePoint;

} }

bool Lexer::CheckUnicodeWhitespace(Token &Result, uint32_t C, bool Lexer::CheckUnicodeWhitespace(Token & Result, uint32_t C,

const char *CurPtr) { unsigned CurOffset) {

if (!isLexingRawMode() && !PP->isPreprocessedOutput() && if (!isLexingRawMode() && !PP->isPreprocessedOutput() &&

isUnicodeWhitespace(C)) { isUnicodeWhitespace(C)) {

Diag(BufferPtr, diag::ext_unicode_whitespace) Diag(BufferOffset, diag::ext_unicode_whitespace)

<< makeCharRange(*this, BufferPtr, CurPtr); << makeCharRange(*this, BufferOffset, CurOffset);

Result.setFlag(Token::LeadingSpace); Result.setFlag(Token::LeadingSpace);

return true; return true;

} }

return false; return false;

} }

void Lexer::PropagateLineStartLeadingSpaceInfo(Token &Result) { void Lexer::PropagateLineStartLeadingSpaceInfo(Token &Result) {

▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines

/// token, not a normal token, as such, it is an internal interface. It assumes /// token, not a normal token, as such, it is an internal interface. It assumes

/// that the Flags of result have been cleared before calling this. /// that the Flags of result have been cleared before calling this.

bool Lexer::LexTokenInternal(Token &Result, bool TokAtPhysicalStartOfLine) { bool Lexer::LexTokenInternal(Token &Result, bool TokAtPhysicalStartOfLine) {

LexStart: LexStart:

assert(!Result.needsCleaning() && "Result needs cleaning"); assert(!Result.needsCleaning() && "Result needs cleaning");

assert(!Result.hasPtrData() && "Result has not been reset"); assert(!Result.hasPtrData() && "Result has not been reset");

// CurPtr - Cache BufferPtr in an automatic variable. // CurPtr - Cache BufferPtr in an automatic variable.

const char *CurPtr = BufferPtr; unsigned CurOffset = BufferOffset;

// Small amounts of horizontal whitespace is very common between tokens. // Small amounts of horizontal whitespace is very common between tokens.

if (isHorizontalWhitespace(*CurPtr)) { if (isHorizontalWhitespace(BufferStart[CurOffset])) {

do { do {

++CurPtr; ++CurOffset;

} while (isHorizontalWhitespace(*CurPtr)); } while (isHorizontalWhitespace(BufferStart[CurOffset]));

davrecUnsubmitted

Not Done

for (isHorizontalWhitespace(BufferStart[++CurOffset]);;)
  ;

might save a few instructions? Worth trying since this function is the main perf-critical one.

davrec: ``` for (isHorizontalWhitespace(BufferStart[++CurOffset]);;) ; ``` might save a few…

davrecUnsubmitted

Not Done

^ Ignore, erroneous :)

davrec: ^ Ignore, erroneous :)

// If we are keeping whitespace and other tokens, just return what we just // If we are keeping whitespace and other tokens, just return what we just

// skipped. The next lexer invocation will return the token after the // skipped. The next lexer invocation will return the token after the

// whitespace. // whitespace.

if (isKeepWhitespaceMode()) { if (isKeepWhitespaceMode()) {

FormTokenWithChars(Result, CurPtr, tok::unknown); FormTokenWithChars(Result, CurOffset, tok::unknown);

// FIXME: The next token will not have LeadingSpace set. // FIXME: The next token will not have LeadingSpace set.

return true; return true;

} }

BufferPtr = CurPtr; BufferOffset = CurOffset;

Result.setFlag(Token::LeadingSpace); Result.setFlag(Token::LeadingSpace);

} }

unsigned SizeTmp, SizeTmp2; // Temporaries for use in cases below. unsigned SizeTmp, SizeTmp2; // Temporaries for use in cases below.

// Read a character, advancing over it. // Read a character, advancing over it.

char Char = getAndAdvanceChar(CurPtr, Result); char Char = getAndAdvanceChar(CurOffset, Result);

tok::TokenKind Kind; tok::TokenKind Kind;

if (!isVerticalWhitespace(Char)) if (!isVerticalWhitespace(Char))

NewLinePtr = nullptr; NewLineOffset = std::nullopt;

switch (Char) { switch (Char) {

case 0: // Null. case 0: // Null.

// Found end of file? // Found end of file?

if (CurPtr-1 == BufferEnd) if (CurOffset - 1 == BufferSize)

return LexEndOfFile(Result, CurPtr-1); return LexEndOfFile(Result, CurOffset - 1);

// Check if we are performing code completion. // Check if we are performing code completion.

if (isCodeCompletionPoint(CurPtr-1)) { if (isCodeCompletionPoint(CurOffset - 1)) {

// Return the code-completion token. // Return the code-completion token.

Result.startToken(); Result.startToken();

FormTokenWithChars(Result, CurPtr, tok::code_completion); FormTokenWithChars(Result, CurOffset, tok::code_completion);

return true; return true;

} }

if (!isLexingRawMode()) if (!isLexingRawMode())

Diag(CurPtr-1, diag::null_in_file); Diag(CurOffset - 1, diag::null_in_file);

Result.setFlag(Token::LeadingSpace); Result.setFlag(Token::LeadingSpace);

if (SkipWhitespace(Result, CurPtr, TokAtPhysicalStartOfLine)) if (SkipWhitespace(Result, CurOffset, TokAtPhysicalStartOfLine))

return true; // KeepWhitespaceMode return true; // KeepWhitespaceMode

// We know the lexer hasn't changed, so just try again with this lexer. // We know the lexer hasn't changed, so just try again with this lexer.

// (We manually eliminate the tail call to avoid recursion.) // (We manually eliminate the tail call to avoid recursion.)

goto LexNextToken; goto LexNextToken;

case 26: // DOS & CP/M EOF: "^Z". case 26: // DOS & CP/M EOF: "^Z".

// If we're in Microsoft extensions mode, treat this as end of file. // If we're in Microsoft extensions mode, treat this as end of file.

if (LangOpts.MicrosoftExt) { if (LangOpts.MicrosoftExt) {

if (!isLexingRawMode()) if (!isLexingRawMode())

Diag(CurPtr-1, diag::ext_ctrl_z_eof_microsoft); Diag(CurOffset - 1, diag::ext_ctrl_z_eof_microsoft);

return LexEndOfFile(Result, CurPtr-1); return LexEndOfFile(Result, CurOffset - 1);

} }

// If Microsoft extensions are disabled, this is just random garbage. // If Microsoft extensions are disabled, this is just random garbage.

Kind = tok::unknown; Kind = tok::unknown;

break; break;

case '\r': case '\r':

if (CurPtr[0] == '\n') if (BufferStart[CurOffset] == '\n')

(void)getAndAdvanceChar(CurPtr, Result); (void)getAndAdvanceChar(CurOffset, Result);

[[fallthrough]]; [[fallthrough]];

case '\n': case '\n':

// If we are inside a preprocessor directive and we see the end of line, // If we are inside a preprocessor directive and we see the end of line,

// we know we are done with the directive, so return an EOD token. // we know we are done with the directive, so return an EOD token.

if (ParsingPreprocessorDirective) { if (ParsingPreprocessorDirective) {

// Done parsing the "line". // Done parsing the "line".

ParsingPreprocessorDirective = false; ParsingPreprocessorDirective = false;

// Restore comment saving mode, in case it was disabled for directive. // Restore comment saving mode, in case it was disabled for directive.

if (PP) if (PP)

resetExtendedTokenMode(); resetExtendedTokenMode();

// Since we consumed a newline, we are back at the start of a line. // Since we consumed a newline, we are back at the start of a line.

IsAtStartOfLine = true; IsAtStartOfLine = true;

IsAtPhysicalStartOfLine = true; IsAtPhysicalStartOfLine = true;

NewLinePtr = CurPtr - 1; NewLineOffset = CurOffset - 1;

Kind = tok::eod; Kind = tok::eod;

break; break;

} }

// No leading whitespace seen so far. // No leading whitespace seen so far.

Result.clearFlag(Token::LeadingSpace); Result.clearFlag(Token::LeadingSpace);

if (SkipWhitespace(Result, CurPtr, TokAtPhysicalStartOfLine)) if (SkipWhitespace(Result, CurOffset, TokAtPhysicalStartOfLine))

return true; // KeepWhitespaceMode return true; // KeepWhitespaceMode

// We only saw whitespace, so just try again with this lexer. // We only saw whitespace, so just try again with this lexer.

// (We manually eliminate the tail call to avoid recursion.) // (We manually eliminate the tail call to avoid recursion.)

goto LexNextToken; goto LexNextToken;

case ' ': case ' ':

case '\t': case '\t':

case '\f': case '\f':

case '\v': case '\v':

SkipHorizontalWhitespace: SkipHorizontalWhitespace:

Result.setFlag(Token::LeadingSpace); Result.setFlag(Token::LeadingSpace);

if (SkipWhitespace(Result, CurPtr, TokAtPhysicalStartOfLine)) if (SkipWhitespace(Result, CurOffset, TokAtPhysicalStartOfLine))

return true; // KeepWhitespaceMode return true; // KeepWhitespaceMode

SkipIgnoredUnits: SkipIgnoredUnits:

CurPtr = BufferPtr; CurOffset = BufferOffset;

// If the next token is obviously a // or /* */ comment, skip it efficiently // If the next token is obviously a // or /* */ comment, skip it efficiently

// too (without going through the big switch stmt). // too (without going through the big switch stmt).

if (CurPtr[0] == '/' && CurPtr[1] == '/' && !inKeepCommentMode() && if (BufferStart[CurOffset] == '/' && BufferStart[CurOffset + 1] == '/' &&

LineComment && (LangOpts.CPlusPlus || !LangOpts.TraditionalCPP)) { !inKeepCommentMode() && LineComment &&

if (SkipLineComment(Result, CurPtr+2, TokAtPhysicalStartOfLine)) (LangOpts.CPlusPlus || !LangOpts.TraditionalCPP)) {

if (SkipLineComment(Result, CurOffset + 2, TokAtPhysicalStartOfLine))

return true; // There is a token to return. return true; // There is a token to return.

goto SkipIgnoredUnits; goto SkipIgnoredUnits;

} else if (CurPtr[0] == '/' && CurPtr[1] == '*' && !inKeepCommentMode()) { } else if (BufferStart[CurOffset] == '/' &&

if (SkipBlockComment(Result, CurPtr+2, TokAtPhysicalStartOfLine)) BufferStart[CurOffset + 1] == '*' && !inKeepCommentMode()) {

if (SkipBlockComment(Result, CurOffset + 2, TokAtPhysicalStartOfLine))

return true; // There is a token to return. return true; // There is a token to return.

goto SkipIgnoredUnits; goto SkipIgnoredUnits;

} else if (isHorizontalWhitespace(*CurPtr)) { } else if (isHorizontalWhitespace(BufferStart[CurOffset])) {

goto SkipHorizontalWhitespace; goto SkipHorizontalWhitespace;

} }

davrecUnsubmitted

Not Done

Spitballing again for possible minor perf improvements:

if (char Char0 = BufferStart[CurOffset] == '/' && !inKeepCommentMode()) {
  if (char Char1 = BufferStart[CurOffset + 1] == '/' && LineComment && 
      (LangOpts.CPlusPlus || !LangOpts.TraditionalCPP)) {
    if (SkipLineComment(Result, CurOffset + 2, TokAtPhysicalStartOfLine))
      return true; // There is a token to return.
    goto SkipIgnoredUnits;
  } else if (Char1 == '*') {
    if (SkipBlockComment(Result, CurOffset + 2, TokAtPhysicalStartOfLine))
      return true; // There is a token to return.
    goto SkipIgnoredUnits;
  }
} else if (isHorizontalWhitespace(Char0)) {
  goto SkipHorizontalWhitespace;
}

davrec: Spitballing again for possible minor perf improvements: ``` if (char Char0 = BufferStart…

// We only saw whitespace, so just try again with this lexer. // We only saw whitespace, so just try again with this lexer.

// (We manually eliminate the tail call to avoid recursion.) // (We manually eliminate the tail call to avoid recursion.)

goto LexNextToken; goto LexNextToken;

// C99 6.4.4.1: Integer Constants. // C99 6.4.4.1: Integer Constants.

// C99 6.4.4.2: Floating Constants. // C99 6.4.4.2: Floating Constants.

case '0': case '1': case '2': case '3': case '4': case '0': case '1': case '2': case '3': case '4':

case '5': case '6': case '7': case '8': case '9': case '5': case '6': case '7': case '8': case '9':

// Notify MIOpt that we read a non-whitespace/non-comment token. // Notify MIOpt that we read a non-whitespace/non-comment token.

MIOpt.ReadToken(); MIOpt.ReadToken();

return LexNumericConstant(Result, CurPtr); return LexNumericConstant(Result, CurOffset);

// Identifier (e.g., uber), or // Identifier (e.g., uber), or

// UTF-8 (C2x/C++17) or UTF-16 (C11/C++11) character literal, or // UTF-8 (C2x/C++17) or UTF-16 (C11/C++11) character literal, or

// UTF-8 or UTF-16 string literal (C11/C++11). // UTF-8 or UTF-16 string literal (C11/C++11).

case 'u': case 'u':

// Notify MIOpt that we read a non-whitespace/non-comment token. // Notify MIOpt that we read a non-whitespace/non-comment token.

MIOpt.ReadToken(); MIOpt.ReadToken();

if (LangOpts.CPlusPlus11 || LangOpts.C11) { if (LangOpts.CPlusPlus11 || LangOpts.C11) {

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

// UTF-16 string literal // UTF-16 string literal

if (Char == '"') if (Char == '"')

return LexStringLiteral(Result, ConsumeChar(CurPtr, SizeTmp, Result), return LexStringLiteral(Result, ConsumeChar(CurOffset, SizeTmp, Result),

tok::utf16_string_literal); tok::utf16_string_literal);

// UTF-16 character constant // UTF-16 character constant

if (Char == '\'') if (Char == '\'')

return LexCharConstant(Result, ConsumeChar(CurPtr, SizeTmp, Result), return LexCharConstant(Result, ConsumeChar(CurOffset, SizeTmp, Result),

tok::utf16_char_constant); tok::utf16_char_constant);

// UTF-16 raw string literal // UTF-16 raw string literal

if (Char == 'R' && LangOpts.CPlusPlus11 && if (Char == 'R' && LangOpts.CPlusPlus11 &&

getCharAndSize(CurPtr + SizeTmp, SizeTmp2) == '"') getCharAndSize(CurOffset + SizeTmp, SizeTmp2) == '"')

return LexRawStringLiteral(Result, return LexRawStringLiteral(

ConsumeChar(ConsumeChar(CurPtr, SizeTmp, Result), Result,

SizeTmp2, Result), ConsumeChar(ConsumeChar(CurOffset, SizeTmp, Result), SizeTmp2,

Result),

tok::utf16_string_literal); tok::utf16_string_literal);

if (Char == '8') { if (Char == '8') {

char Char2 = getCharAndSize(CurPtr + SizeTmp, SizeTmp2); char Char2 = getCharAndSize(CurOffset + SizeTmp, SizeTmp2);

// UTF-8 string literal // UTF-8 string literal

if (Char2 == '"') if (Char2 == '"')

return LexStringLiteral(Result, return LexStringLiteral(

ConsumeChar(ConsumeChar(CurPtr, SizeTmp, Result), Result,

SizeTmp2, Result), ConsumeChar(ConsumeChar(CurOffset, SizeTmp, Result), SizeTmp2,

Result),

tok::utf8_string_literal); tok::utf8_string_literal);

if (Char2 == '\'' && (LangOpts.CPlusPlus17 || LangOpts.C2x)) if (Char2 == '\'' && (LangOpts.CPlusPlus17 || LangOpts.C2x))

return LexCharConstant( return LexCharConstant(

Result, ConsumeChar(ConsumeChar(CurPtr, SizeTmp, Result), Result,

SizeTmp2, Result), ConsumeChar(ConsumeChar(CurOffset, SizeTmp, Result), SizeTmp2,

Result),

tok::utf8_char_constant); tok::utf8_char_constant);

if (Char2 == 'R' && LangOpts.CPlusPlus11) { if (Char2 == 'R' && LangOpts.CPlusPlus11) {

unsigned SizeTmp3; unsigned SizeTmp3;

char Char3 = getCharAndSize(CurPtr + SizeTmp + SizeTmp2, SizeTmp3); char Char3 = getCharAndSize(CurOffset + SizeTmp + SizeTmp2, SizeTmp3);

// UTF-8 raw string literal // UTF-8 raw string literal

if (Char3 == '"') { if (Char3 == '"') {

return LexRawStringLiteral(Result, return LexRawStringLiteral(

ConsumeChar(ConsumeChar(ConsumeChar(CurPtr, SizeTmp, Result), Result,

ConsumeChar(

ConsumeChar(ConsumeChar(CurOffset, SizeTmp, Result),

SizeTmp2, Result), SizeTmp2, Result),

SizeTmp3, Result), SizeTmp3, Result),

tok::utf8_string_literal); tok::utf8_string_literal);

} }

// treat u like the start of an identifier. // treat u like the start of an identifier.

return LexIdentifierContinue(Result, CurPtr); return LexIdentifierContinue(Result, CurOffset);

case 'U': // Identifier (e.g. Uber) or C11/C++11 UTF-32 string literal case 'U': // Identifier (e.g. Uber) or C11/C++11 UTF-32 string literal

// Notify MIOpt that we read a non-whitespace/non-comment token. // Notify MIOpt that we read a non-whitespace/non-comment token.

MIOpt.ReadToken(); MIOpt.ReadToken();

if (LangOpts.CPlusPlus11 || LangOpts.C11) { if (LangOpts.CPlusPlus11 || LangOpts.C11) {

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

// UTF-32 string literal // UTF-32 string literal

if (Char == '"') if (Char == '"')

return LexStringLiteral(Result, ConsumeChar(CurPtr, SizeTmp, Result), return LexStringLiteral(Result, ConsumeChar(CurOffset, SizeTmp, Result),

tok::utf32_string_literal); tok::utf32_string_literal);

// UTF-32 character constant // UTF-32 character constant

if (Char == '\'') if (Char == '\'')

return LexCharConstant(Result, ConsumeChar(CurPtr, SizeTmp, Result), return LexCharConstant(Result, ConsumeChar(CurOffset, SizeTmp, Result),

tok::utf32_char_constant); tok::utf32_char_constant);

// UTF-32 raw string literal // UTF-32 raw string literal

if (Char == 'R' && LangOpts.CPlusPlus11 && if (Char == 'R' && LangOpts.CPlusPlus11 &&

getCharAndSize(CurPtr + SizeTmp, SizeTmp2) == '"') getCharAndSize(CurOffset + SizeTmp, SizeTmp2) == '"')

return LexRawStringLiteral(Result, return LexRawStringLiteral(

ConsumeChar(ConsumeChar(CurPtr, SizeTmp, Result), Result,

SizeTmp2, Result), ConsumeChar(ConsumeChar(CurOffset, SizeTmp, Result), SizeTmp2,

Result),

tok::utf32_string_literal); tok::utf32_string_literal);

} }

// treat U like the start of an identifier. // treat U like the start of an identifier.

return LexIdentifierContinue(Result, CurPtr); return LexIdentifierContinue(Result, CurOffset);

case 'R': // Identifier or C++0x raw string literal case 'R': // Identifier or C++0x raw string literal

// Notify MIOpt that we read a non-whitespace/non-comment token. // Notify MIOpt that we read a non-whitespace/non-comment token.

MIOpt.ReadToken(); MIOpt.ReadToken();

if (LangOpts.CPlusPlus11) { if (LangOpts.CPlusPlus11) {

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

if (Char == '"') if (Char == '"')

return LexRawStringLiteral(Result, return LexRawStringLiteral(Result,

ConsumeChar(CurPtr, SizeTmp, Result), ConsumeChar(CurOffset, SizeTmp, Result),

tok::string_literal); tok::string_literal);

} }

// treat R like the start of an identifier. // treat R like the start of an identifier.

return LexIdentifierContinue(Result, CurPtr); return LexIdentifierContinue(Result, CurOffset);

case 'L': // Identifier (Loony) or wide literal (L'x' or L"xyz"). case 'L': // Identifier (Loony) or wide literal (L'x' or L"xyz").

// Notify MIOpt that we read a non-whitespace/non-comment token. // Notify MIOpt that we read a non-whitespace/non-comment token.

MIOpt.ReadToken(); MIOpt.ReadToken();

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

// Wide string literal. // Wide string literal.

if (Char == '"') if (Char == '"')

return LexStringLiteral(Result, ConsumeChar(CurPtr, SizeTmp, Result), return LexStringLiteral(Result, ConsumeChar(CurOffset, SizeTmp, Result),

tok::wide_string_literal); tok::wide_string_literal);

// Wide raw string literal. // Wide raw string literal.

if (LangOpts.CPlusPlus11 && Char == 'R' && if (LangOpts.CPlusPlus11 && Char == 'R' &&

getCharAndSize(CurPtr + SizeTmp, SizeTmp2) == '"') getCharAndSize(CurOffset + SizeTmp, SizeTmp2) == '"')

return LexRawStringLiteral(Result, return LexRawStringLiteral(

ConsumeChar(ConsumeChar(CurPtr, SizeTmp, Result), Result,

SizeTmp2, Result), ConsumeChar(ConsumeChar(CurOffset, SizeTmp, Result), SizeTmp2,

Result),

tok::wide_string_literal); tok::wide_string_literal);

// Wide character constant. // Wide character constant.

if (Char == '\'') if (Char == '\'')

return LexCharConstant(Result, ConsumeChar(CurPtr, SizeTmp, Result), return LexCharConstant(Result, ConsumeChar(CurOffset, SizeTmp, Result),

tok::wide_char_constant); tok::wide_char_constant);

// FALL THROUGH, treating L like the start of an identifier. // FALL THROUGH, treating L like the start of an identifier.

[[fallthrough]]; [[fallthrough]];

// C99 6.4.2: Identifiers. // C99 6.4.2: Identifiers.

case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': case 'G': case 'A': case 'B': case 'C': case 'D': case 'E': case 'F': case 'G':

case 'H': case 'I': case 'J': case 'K': /*'L'*/case 'M': case 'N': case 'H': case 'I': case 'J': case 'K': /*'L'*/case 'M': case 'N':

case 'O': case 'P': case 'Q': /*'R'*/case 'S': case 'T': /*'U'*/ case 'O': case 'P': case 'Q': /*'R'*/case 'S': case 'T': /*'U'*/

case 'V': case 'W': case 'X': case 'Y': case 'Z': case 'V': case 'W': case 'X': case 'Y': case 'Z':

case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': case 'g': case 'a': case 'b': case 'c': case 'd': case 'e': case 'f': case 'g':

case 'h': case 'i': case 'j': case 'k': case 'l': case 'm': case 'n': case 'h': case 'i': case 'j': case 'k': case 'l': case 'm': case 'n':

case 'o': case 'p': case 'q': case 'r': case 's': case 't': /*'u'*/ case 'o': case 'p': case 'q': case 'r': case 's': case 't': /*'u'*/

case 'v': case 'w': case 'x': case 'y': case 'z': case 'v': case 'w': case 'x': case 'y': case 'z':

case '_': case '_':

// Notify MIOpt that we read a non-whitespace/non-comment token. // Notify MIOpt that we read a non-whitespace/non-comment token.

MIOpt.ReadToken(); MIOpt.ReadToken();

return LexIdentifierContinue(Result, CurPtr); return LexIdentifierContinue(Result, CurOffset);

case '$': // $ in identifiers. case '$': // $ in identifiers.

if (LangOpts.DollarIdents) { if (LangOpts.DollarIdents) {

if (!isLexingRawMode()) if (!isLexingRawMode())

Diag(CurPtr-1, diag::ext_dollar_in_identifier); Diag(CurOffset - 1, diag::ext_dollar_in_identifier);

// Notify MIOpt that we read a non-whitespace/non-comment token. // Notify MIOpt that we read a non-whitespace/non-comment token.

MIOpt.ReadToken(); MIOpt.ReadToken();

return LexIdentifierContinue(Result, CurPtr); return LexIdentifierContinue(Result, CurOffset);

} }

Kind = tok::unknown; Kind = tok::unknown;

break; break;

// C99 6.4.4: Character Constants. // C99 6.4.4: Character Constants.

case '\'': case '\'':

// Notify MIOpt that we read a non-whitespace/non-comment token. // Notify MIOpt that we read a non-whitespace/non-comment token.

MIOpt.ReadToken(); MIOpt.ReadToken();

return LexCharConstant(Result, CurPtr, tok::char_constant); return LexCharConstant(Result, CurOffset, tok::char_constant);

// C99 6.4.5: String Literals. // C99 6.4.5: String Literals.

case '"': case '"':

// Notify MIOpt that we read a non-whitespace/non-comment token. // Notify MIOpt that we read a non-whitespace/non-comment token.

MIOpt.ReadToken(); MIOpt.ReadToken();

return LexStringLiteral(Result, CurPtr, return LexStringLiteral(Result, CurOffset,

ParsingFilename ? tok::header_name ParsingFilename ? tok::header_name

: tok::string_literal); : tok::string_literal);

// C99 6.4.6: Punctuators. // C99 6.4.6: Punctuators.

case '?': case '?':

Kind = tok::question; Kind = tok::question;

break; break;

case '[': case '[':

Show All 10 Lines case ')':

break; break;

case '{': case '{':

Kind = tok::l_brace; Kind = tok::l_brace;

break; break;

case '}': case '}':

Kind = tok::r_brace; Kind = tok::r_brace;

break; break;

case '.': case '.':

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

if (Char >= '0' && Char <= '9') { if (Char >= '0' && Char <= '9') {

// Notify MIOpt that we read a non-whitespace/non-comment token. // Notify MIOpt that we read a non-whitespace/non-comment token.

MIOpt.ReadToken(); MIOpt.ReadToken();

return LexNumericConstant(Result, ConsumeChar(CurPtr, SizeTmp, Result)); return LexNumericConstant(Result,

ConsumeChar(CurOffset, SizeTmp, Result));

} else if (LangOpts.CPlusPlus && Char == '*') { } else if (LangOpts.CPlusPlus && Char == '*') {

Kind = tok::periodstar; Kind = tok::periodstar;

CurPtr += SizeTmp; CurOffset += SizeTmp;

} else if (Char == '.' && } else if (Char == '.' &&

getCharAndSize(CurPtr+SizeTmp, SizeTmp2) == '.') { getCharAndSize(CurOffset + SizeTmp, SizeTmp2) == '.') {

Kind = tok::ellipsis; Kind = tok::ellipsis;

CurPtr = ConsumeChar(ConsumeChar(CurPtr, SizeTmp, Result), CurOffset = ConsumeChar(ConsumeChar(CurOffset, SizeTmp, Result), SizeTmp2,

SizeTmp2, Result); Result);

} else { } else {

Kind = tok::period; Kind = tok::period;

} }

break; break;

case '&': case '&':

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

if (Char == '&') { if (Char == '&') {

Kind = tok::ampamp; Kind = tok::ampamp;

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

} else if (Char == '=') { } else if (Char == '=') {

Kind = tok::ampequal; Kind = tok::ampequal;

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

} else { } else {

Kind = tok::amp; Kind = tok::amp;

} }

break; break;

case '*': case '*':

if (getCharAndSize(CurPtr, SizeTmp) == '=') { if (getCharAndSize(CurOffset, SizeTmp) == '=') {

Kind = tok::starequal; Kind = tok::starequal;

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

} else { } else {

Kind = tok::star; Kind = tok::star;

} }

break; break;

case '+': case '+':

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

if (Char == '+') { if (Char == '+') {

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

Kind = tok::plusplus; Kind = tok::plusplus;

} else if (Char == '=') { } else if (Char == '=') {

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

Kind = tok::plusequal; Kind = tok::plusequal;

} else { } else {

Kind = tok::plus; Kind = tok::plus;

} }

break; break;

case '-': case '-':

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

if (Char == '-') { // -- if (Char == '-') { // --

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

Kind = tok::minusminus; Kind = tok::minusminus;

} else if (Char == '>' && LangOpts.CPlusPlus && } else if (Char == '>' && LangOpts.CPlusPlus &&

getCharAndSize(CurPtr+SizeTmp, SizeTmp2) == '*') { // C++ ->* getCharAndSize(CurOffset + SizeTmp, SizeTmp2) ==

CurPtr = ConsumeChar(ConsumeChar(CurPtr, SizeTmp, Result), '*') { // C++ ->*

SizeTmp2, Result); CurOffset = ConsumeChar(ConsumeChar(CurOffset, SizeTmp, Result), SizeTmp2,

Result);

Kind = tok::arrowstar; Kind = tok::arrowstar;

} else if (Char == '>') { // -> } else if (Char == '>') { // ->

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

Kind = tok::arrow; Kind = tok::arrow;

} else if (Char == '=') { // -= } else if (Char == '=') { // -=

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

Kind = tok::minusequal; Kind = tok::minusequal;

} else { } else {

Kind = tok::minus; Kind = tok::minus;

} }

break; break;

case '~': case '~':

Kind = tok::tilde; Kind = tok::tilde;

break; break;

case '!': case '!':

if (getCharAndSize(CurPtr, SizeTmp) == '=') { if (getCharAndSize(CurOffset, SizeTmp) == '=') {

Kind = tok::exclaimequal; Kind = tok::exclaimequal;

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

} else { } else {

Kind = tok::exclaim; Kind = tok::exclaim;

} }

break; break;

case '/': case '/':

// 6.4.9: Comments // 6.4.9: Comments

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

if (Char == '/') { // Line comment. if (Char == '/') { // Line comment.

// Even if Line comments are disabled (e.g. in C89 mode), we generally // Even if Line comments are disabled (e.g. in C89 mode), we generally

// want to lex this as a comment. There is one problem with this though, // want to lex this as a comment. There is one problem with this though,

// that in one particular corner case, this can change the behavior of the // that in one particular corner case, this can change the behavior of the

// resultant program. For example, In "foo //**/ bar", C89 would lex // resultant program. For example, In "foo //**/ bar", C89 would lex

// this as "foo / bar" and languages with Line comments would lex it as // this as "foo / bar" and languages with Line comments would lex it as

// "foo". Check to see if the character after the second slash is a '*'. // "foo". Check to see if the character after the second slash is a '*'.

// If so, we will lex that as a "/" instead of the start of a comment. // If so, we will lex that as a "/" instead of the start of a comment.

// However, we never do this if we are just preprocessing. // However, we never do this if we are just preprocessing.

bool TreatAsComment = bool TreatAsComment =

LineComment && (LangOpts.CPlusPlus || !LangOpts.TraditionalCPP); LineComment && (LangOpts.CPlusPlus || !LangOpts.TraditionalCPP);

if (!TreatAsComment) if (!TreatAsComment)

if (!(PP && PP->isPreprocessedOutput())) if (!(PP && PP->isPreprocessedOutput()))

TreatAsComment = getCharAndSize(CurPtr+SizeTmp, SizeTmp2) != '*'; TreatAsComment = getCharAndSize(CurOffset + SizeTmp, SizeTmp2) != '*';

if (TreatAsComment) { if (TreatAsComment) {

if (SkipLineComment(Result, ConsumeChar(CurPtr, SizeTmp, Result), if (SkipLineComment(Result, ConsumeChar(CurOffset, SizeTmp, Result),

TokAtPhysicalStartOfLine)) TokAtPhysicalStartOfLine))

return true; // There is a token to return. return true; // There is a token to return.

// It is common for the tokens immediately after a // comment to be // It is common for the tokens immediately after a // comment to be

// whitespace (indentation for the next line). Instead of going through // whitespace (indentation for the next line). Instead of going through

// the big switch, handle it efficiently now. // the big switch, handle it efficiently now.

goto SkipIgnoredUnits; goto SkipIgnoredUnits;

} }

if (Char == '*') { // /**/ comment. if (Char == '*') { // /**/ comment.

if (SkipBlockComment(Result, ConsumeChar(CurPtr, SizeTmp, Result), if (SkipBlockComment(Result, ConsumeChar(CurOffset, SizeTmp, Result),

TokAtPhysicalStartOfLine)) TokAtPhysicalStartOfLine))

return true; // There is a token to return. return true; // There is a token to return.

// We only saw whitespace, so just try again with this lexer. // We only saw whitespace, so just try again with this lexer.

// (We manually eliminate the tail call to avoid recursion.) // (We manually eliminate the tail call to avoid recursion.)

goto LexNextToken; goto LexNextToken;

} }

if (Char == '=') { if (Char == '=') {

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

Kind = tok::slashequal; Kind = tok::slashequal;

} else { } else {

Kind = tok::slash; Kind = tok::slash;

} }

break; break;

case '%': case '%':

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

if (Char == '=') { if (Char == '=') {

Kind = tok::percentequal; Kind = tok::percentequal;

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

} else if (LangOpts.Digraphs && Char == '>') { } else if (LangOpts.Digraphs && Char == '>') {

Kind = tok::r_brace; // '%>' -> '}' Kind = tok::r_brace; // '%>' -> '}'

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

} else if (LangOpts.Digraphs && Char == ':') { } else if (LangOpts.Digraphs && Char == ':') {

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

if (Char == '%' && getCharAndSize(CurPtr+SizeTmp, SizeTmp2) == ':') { if (Char == '%' && getCharAndSize(CurOffset + SizeTmp, SizeTmp2) == ':') {

Kind = tok::hashhash; // '%:%:' -> '##' Kind = tok::hashhash; // '%:%:' -> '##'

CurPtr = ConsumeChar(ConsumeChar(CurPtr, SizeTmp, Result), CurOffset = ConsumeChar(ConsumeChar(CurOffset, SizeTmp, Result),

SizeTmp2, Result); SizeTmp2, Result);

} else if (Char == '@' && LangOpts.MicrosoftExt) {// %:@ -> #@ -> Charize } else if (Char == '@' && LangOpts.MicrosoftExt) { // %:@ -> #@ -> Charize

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

if (!isLexingRawMode()) if (!isLexingRawMode())

Diag(BufferPtr, diag::ext_charize_microsoft); Diag(BufferOffset, diag::ext_charize_microsoft);

Kind = tok::hashat; Kind = tok::hashat;

} else { // '%:' -> '#' } else { // '%:' -> '#'

// We parsed a # character. If this occurs at the start of the line, // We parsed a # character. If this occurs at the start of the line,

// it's actually the start of a preprocessing directive. Callback to // it's actually the start of a preprocessing directive. Callback to

// the preprocessor to handle it. // the preprocessor to handle it.

// TODO: -fpreprocessed mode?? // TODO: -fpreprocessed mode??

if (TokAtPhysicalStartOfLine && !LexingRawMode && !Is_PragmaLexer) if (TokAtPhysicalStartOfLine && !LexingRawMode && !Is_PragmaLexer)

goto HandleDirective; goto HandleDirective;

Kind = tok::hash; Kind = tok::hash;

} }

} else { } else {

Kind = tok::percent; Kind = tok::percent;

} }

break; break;

case '<': case '<':

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

if (ParsingFilename) { if (ParsingFilename) {

return LexAngledStringLiteral(Result, CurPtr); return LexAngledStringLiteral(Result, CurOffset);

} else if (Char == '<') { } else if (Char == '<') {

char After = getCharAndSize(CurPtr+SizeTmp, SizeTmp2); char After = getCharAndSize(CurOffset + SizeTmp, SizeTmp2);

if (After == '=') { if (After == '=') {

Kind = tok::lesslessequal; Kind = tok::lesslessequal;

CurPtr = ConsumeChar(ConsumeChar(CurPtr, SizeTmp, Result), CurOffset = ConsumeChar(ConsumeChar(CurOffset, SizeTmp, Result),

SizeTmp2, Result); SizeTmp2, Result);

} else if (After == '<' && IsStartOfConflictMarker(CurPtr-1)) { } else if (After == '<' && IsStartOfConflictMarker(CurOffset - 1)) {

// If this is actually a '<<<<<<<' version control conflict marker, // If this is actually a '<<<<<<<' version control conflict marker,

// recognize it as such and recover nicely. // recognize it as such and recover nicely.

goto LexNextToken; goto LexNextToken;

} else if (After == '<' && HandleEndOfConflictMarker(CurPtr-1)) { } else if (After == '<' && HandleEndOfConflictMarker(CurOffset - 1)) {

// If this is '<<<<' and we're in a Perforce-style conflict marker, // If this is '<<<<' and we're in a Perforce-style conflict marker,

// ignore it. // ignore it.

goto LexNextToken; goto LexNextToken;

} else if (LangOpts.CUDA && After == '<') { } else if (LangOpts.CUDA && After == '<') {

Kind = tok::lesslessless; Kind = tok::lesslessless;

CurPtr = ConsumeChar(ConsumeChar(CurPtr, SizeTmp, Result), CurOffset = ConsumeChar(ConsumeChar(CurOffset, SizeTmp, Result),

SizeTmp2, Result); SizeTmp2, Result);

} else { } else {

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

Kind = tok::lessless; Kind = tok::lessless;

} }

} else if (Char == '=') { } else if (Char == '=') {

char After = getCharAndSize(CurPtr+SizeTmp, SizeTmp2); char After = getCharAndSize(CurOffset + SizeTmp, SizeTmp2);

if (After == '>') { if (After == '>') {

if (LangOpts.CPlusPlus20) { if (LangOpts.CPlusPlus20) {

if (!isLexingRawMode()) if (!isLexingRawMode())

Diag(BufferPtr, diag::warn_cxx17_compat_spaceship); Diag(BufferOffset, diag::warn_cxx17_compat_spaceship);

CurPtr = ConsumeChar(ConsumeChar(CurPtr, SizeTmp, Result), CurOffset = ConsumeChar(ConsumeChar(CurOffset, SizeTmp, Result),

SizeTmp2, Result); SizeTmp2, Result);

Kind = tok::spaceship; Kind = tok::spaceship;

break; break;

} }

// Suggest adding a space between the '<=' and the '>' to avoid a // Suggest adding a space between the '<=' and the '>' to avoid a

// change in semantics if this turns up in C++ <=17 mode. // change in semantics if this turns up in C++ <=17 mode.

if (LangOpts.CPlusPlus && !isLexingRawMode()) { if (LangOpts.CPlusPlus && !isLexingRawMode()) {

Diag(BufferPtr, diag::warn_cxx20_compat_spaceship) Diag(BufferOffset, diag::warn_cxx20_compat_spaceship)

<< FixItHint::CreateInsertion( << FixItHint::CreateInsertion(

getSourceLocation(CurPtr + SizeTmp, SizeTmp2), " "); getSourceLocation(CurOffset + SizeTmp, SizeTmp2), " ");

} }

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

Kind = tok::lessequal; Kind = tok::lessequal;

} else if (LangOpts.Digraphs && Char == ':') { // '<:' -> '[' } else if (LangOpts.Digraphs && Char == ':') { // '<:' -> '['

if (LangOpts.CPlusPlus11 && if (LangOpts.CPlusPlus11 &&

getCharAndSize(CurPtr + SizeTmp, SizeTmp2) == ':') { getCharAndSize(CurOffset + SizeTmp, SizeTmp2) == ':') {

// C++0x [lex.pptoken]p3: // C++0x [lex.pptoken]p3:

// Otherwise, if the next three characters are <:: and the subsequent // Otherwise, if the next three characters are <:: and the subsequent

// character is neither : nor >, the < is treated as a preprocessor // character is neither : nor >, the < is treated as a preprocessor

// token by itself and not as the first character of the alternative // token by itself and not as the first character of the alternative

// token <:. // token <:.

unsigned SizeTmp3; unsigned SizeTmp3;

char After = getCharAndSize(CurPtr + SizeTmp + SizeTmp2, SizeTmp3); char After = getCharAndSize(CurOffset + SizeTmp + SizeTmp2, SizeTmp3);

if (After != ':' && After != '>') { if (After != ':' && After != '>') {

Kind = tok::less; Kind = tok::less;

if (!isLexingRawMode()) if (!isLexingRawMode())

Diag(BufferPtr, diag::warn_cxx98_compat_less_colon_colon); Diag(BufferOffset, diag::warn_cxx98_compat_less_colon_colon);

break; break;

} }

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

Kind = tok::l_square; Kind = tok::l_square;

} else if (LangOpts.Digraphs && Char == '%') { // '<%' -> '{' } else if (LangOpts.Digraphs && Char == '%') { // '<%' -> '{'

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

Kind = tok::l_brace; Kind = tok::l_brace;

} else if (Char == '#' && /*Not a trigraph*/ SizeTmp == 1 && } else if (Char == '#' && /*Not a trigraph*/ SizeTmp == 1 &&

lexEditorPlaceholder(Result, CurPtr)) { lexEditorPlaceholder(Result, CurOffset)) {

return true; return true;

} else { } else {

Kind = tok::less; Kind = tok::less;

} }

break; break;

case '>': case '>':

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

if (Char == '=') { if (Char == '=') {

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

Kind = tok::greaterequal; Kind = tok::greaterequal;

} else if (Char == '>') { } else if (Char == '>') {

char After = getCharAndSize(CurPtr+SizeTmp, SizeTmp2); char After = getCharAndSize(CurOffset + SizeTmp, SizeTmp2);

if (After == '=') { if (After == '=') {

CurPtr = ConsumeChar(ConsumeChar(CurPtr, SizeTmp, Result), CurOffset = ConsumeChar(ConsumeChar(CurOffset, SizeTmp, Result),

SizeTmp2, Result); SizeTmp2, Result);

Kind = tok::greatergreaterequal; Kind = tok::greatergreaterequal;

} else if (After == '>' && IsStartOfConflictMarker(CurPtr-1)) { } else if (After == '>' && IsStartOfConflictMarker(CurOffset - 1)) {

// If this is actually a '>>>>' conflict marker, recognize it as such // If this is actually a '>>>>' conflict marker, recognize it as such

// and recover nicely. // and recover nicely.

goto LexNextToken; goto LexNextToken;

} else if (After == '>' && HandleEndOfConflictMarker(CurPtr-1)) { } else if (After == '>' && HandleEndOfConflictMarker(CurOffset - 1)) {

// If this is '>>>>>>>' and we're in a conflict marker, ignore it. // If this is '>>>>>>>' and we're in a conflict marker, ignore it.

goto LexNextToken; goto LexNextToken;

} else if (LangOpts.CUDA && After == '>') { } else if (LangOpts.CUDA && After == '>') {

Kind = tok::greatergreatergreater; Kind = tok::greatergreatergreater;

CurPtr = ConsumeChar(ConsumeChar(CurPtr, SizeTmp, Result), CurOffset = ConsumeChar(ConsumeChar(CurOffset, SizeTmp, Result),

SizeTmp2, Result); SizeTmp2, Result);

} else { } else {

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

Kind = tok::greatergreater; Kind = tok::greatergreater;

} }

} else { } else {

Kind = tok::greater; Kind = tok::greater;

} }

break; break;

case '^': case '^':

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

if (Char == '=') { if (Char == '=') {

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

Kind = tok::caretequal; Kind = tok::caretequal;

} else if (LangOpts.OpenCL && Char == '^') { } else if (LangOpts.OpenCL && Char == '^') {

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

Kind = tok::caretcaret; Kind = tok::caretcaret;

} else { } else {

Kind = tok::caret; Kind = tok::caret;

} }

break; break;

case '|': case '|':

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

if (Char == '=') { if (Char == '=') {

Kind = tok::pipeequal; Kind = tok::pipeequal;

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

} else if (Char == '|') { } else if (Char == '|') {

// If this is '|||||||' and we're in a conflict marker, ignore it. // If this is '|||||||' and we're in a conflict marker, ignore it.

if (CurPtr[1] == '|' && HandleEndOfConflictMarker(CurPtr-1)) if (BufferStart[CurOffset + 1] == '|' &&

HandleEndOfConflictMarker(CurOffset - 1))

goto LexNextToken; goto LexNextToken;

Kind = tok::pipepipe; Kind = tok::pipepipe;

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

} else { } else {

Kind = tok::pipe; Kind = tok::pipe;

} }

break; break;

case ':': case ':':

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

if (LangOpts.Digraphs && Char == '>') { if (LangOpts.Digraphs && Char == '>') {

Kind = tok::r_square; // ':>' -> ']' Kind = tok::r_square; // ':>' -> ']'

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

} else if ((LangOpts.CPlusPlus || } else if ((LangOpts.CPlusPlus ||

LangOpts.DoubleSquareBracketAttributes) && LangOpts.DoubleSquareBracketAttributes) &&

Char == ':') { Char == ':') {

Kind = tok::coloncolon; Kind = tok::coloncolon;

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

} else { } else {

Kind = tok::colon; Kind = tok::colon;

} }

break; break;

case ';': case ';':

Kind = tok::semi; Kind = tok::semi;

break; break;

case '=': case '=':

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

if (Char == '=') { if (Char == '=') {

// If this is '====' and we're in a conflict marker, ignore it. // If this is '====' and we're in a conflict marker, ignore it.

if (CurPtr[1] == '=' && HandleEndOfConflictMarker(CurPtr-1)) if (BufferStart[CurOffset + 1] == '=' &&

HandleEndOfConflictMarker(CurOffset - 1))

goto LexNextToken; goto LexNextToken;

Kind = tok::equalequal; Kind = tok::equalequal;

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

} else { } else {

Kind = tok::equal; Kind = tok::equal;

} }

break; break;

case ',': case ',':

Kind = tok::comma; Kind = tok::comma;

break; break;

case '#': case '#':

Char = getCharAndSize(CurPtr, SizeTmp); Char = getCharAndSize(CurOffset, SizeTmp);

if (Char == '#') { if (Char == '#') {

Kind = tok::hashhash; Kind = tok::hashhash;

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

} else if (Char == '@' && LangOpts.MicrosoftExt) { // #@ -> Charize } else if (Char == '@' && LangOpts.MicrosoftExt) { // #@ -> Charize

Kind = tok::hashat; Kind = tok::hashat;

if (!isLexingRawMode()) if (!isLexingRawMode())

Diag(BufferPtr, diag::ext_charize_microsoft); Diag(BufferOffset, diag::ext_charize_microsoft);

CurPtr = ConsumeChar(CurPtr, SizeTmp, Result); CurOffset = ConsumeChar(CurOffset, SizeTmp, Result);

} else { } else {

// We parsed a # character. If this occurs at the start of the line, // We parsed a # character. If this occurs at the start of the line,

// it's actually the start of a preprocessing directive. Callback to // it's actually the start of a preprocessing directive. Callback to

// the preprocessor to handle it. // the preprocessor to handle it.

// TODO: -fpreprocessed mode?? // TODO: -fpreprocessed mode??

if (TokAtPhysicalStartOfLine && !LexingRawMode && !Is_PragmaLexer) if (TokAtPhysicalStartOfLine && !LexingRawMode && !Is_PragmaLexer)

goto HandleDirective; goto HandleDirective;

Kind = tok::hash; Kind = tok::hash;

} }

break; break;

case '@': case '@':

// Objective C support. // Objective C support.

if (CurPtr[-1] == '@' && LangOpts.ObjC) if (BufferStart[CurOffset - 1] == '@' && LangOpts.ObjC)

Kind = tok::at; Kind = tok::at;

else else

Kind = tok::unknown; Kind = tok::unknown;

break; break;

// UCNs (C99 6.4.3, C++11 [lex.charset]p2) // UCNs (C99 6.4.3, C++11 [lex.charset]p2)

case '\\': case '\\':

if (!LangOpts.AsmPreprocessor) { if (!LangOpts.AsmPreprocessor) {

if (uint32_t CodePoint = tryReadUCN(CurPtr, BufferPtr, &Result)) { if (uint32_t CodePoint = tryReadUCN(CurOffset, BufferOffset, &Result)) {

if (CheckUnicodeWhitespace(Result, CodePoint, CurPtr)) { if (CheckUnicodeWhitespace(Result, CodePoint, CurOffset)) {

if (SkipWhitespace(Result, CurPtr, TokAtPhysicalStartOfLine)) if (SkipWhitespace(Result, CurOffset, TokAtPhysicalStartOfLine))

return true; // KeepWhitespaceMode return true; // KeepWhitespaceMode

// We only saw whitespace, so just try again with this lexer. // We only saw whitespace, so just try again with this lexer.

// (We manually eliminate the tail call to avoid recursion.) // (We manually eliminate the tail call to avoid recursion.)

goto LexNextToken; goto LexNextToken;

} }

return LexUnicodeIdentifierStart(Result, CodePoint, CurPtr); return LexUnicodeIdentifierStart(Result, CodePoint, CurOffset);

} }

Kind = tok::unknown; Kind = tok::unknown;

break; break;

default: { default: {

if (isASCII(Char)) { if (isASCII(Char)) {

Kind = tok::unknown; Kind = tok::unknown;

break; break;

} }

llvm::UTF32 CodePoint; llvm::UTF32 CodePoint;

// We can't just reset CurPtr to BufferPtr because BufferPtr may point to // We can't just reset CurPtr to BufferPtr because BufferPtr may point to

// an escaped newline. // an escaped newline.

--CurPtr; --CurOffset;

llvm::ConversionResult Status = const char *CurPtr = BufferStart + CurOffset;

llvm::convertUTF8Sequence((const llvm::UTF8 **)&CurPtr, llvm::ConversionResult Status = llvm::convertUTF8Sequence(

(const llvm::UTF8 *)BufferEnd, (const llvm::UTF8 **)&CurPtr,

&CodePoint, (const llvm::UTF8 *)(BufferStart + BufferSize), &CodePoint,

llvm::strictConversion); llvm::strictConversion);

CurOffset = CurPtr - BufferStart;

if (Status == llvm::conversionOK) { if (Status == llvm::conversionOK) {

if (CheckUnicodeWhitespace(Result, CodePoint, CurPtr)) { if (CheckUnicodeWhitespace(Result, CodePoint, CurOffset)) {

if (SkipWhitespace(Result, CurPtr, TokAtPhysicalStartOfLine)) if (SkipWhitespace(Result, CurOffset, TokAtPhysicalStartOfLine))

return true; // KeepWhitespaceMode return true; // KeepWhitespaceMode

// We only saw whitespace, so just try again with this lexer. // We only saw whitespace, so just try again with this lexer.

// (We manually eliminate the tail call to avoid recursion.) // (We manually eliminate the tail call to avoid recursion.)

goto LexNextToken; goto LexNextToken;

} }

return LexUnicodeIdentifierStart(Result, CodePoint, CurPtr); return LexUnicodeIdentifierStart(Result, CodePoint, CurOffset);

} }

if (isLexingRawMode() || ParsingPreprocessorDirective || if (isLexingRawMode() || ParsingPreprocessorDirective ||

PP->isPreprocessedOutput()) { PP->isPreprocessedOutput()) {

++CurPtr; ++CurOffset;

Kind = tok::unknown; Kind = tok::unknown;

break; break;

} }

// Non-ASCII characters tend to creep into source code unintentionally. // Non-ASCII characters tend to creep into source code unintentionally.

// Instead of letting the parser complain about the unknown token, // Instead of letting the parser complain about the unknown token,

// just diagnose the invalid UTF-8, then drop the character. // just diagnose the invalid UTF-8, then drop the character.

Diag(CurPtr, diag::err_invalid_utf8); Diag(CurOffset, diag::err_invalid_utf8);

BufferPtr = CurPtr+1; BufferOffset = CurOffset + 1;

// We're pretending the character didn't exist, so just try again with // We're pretending the character didn't exist, so just try again with

// this lexer. // this lexer.

// (We manually eliminate the tail call to avoid recursion.) // (We manually eliminate the tail call to avoid recursion.)

goto LexNextToken; goto LexNextToken;

} }

// Notify MIOpt that we read a non-whitespace/non-comment token. // Notify MIOpt that we read a non-whitespace/non-comment token.

MIOpt.ReadToken(); MIOpt.ReadToken();

// Update the location of token as well as BufferPtr. // Update the location of token as well as BufferPtr.

FormTokenWithChars(Result, CurPtr, Kind); FormTokenWithChars(Result, CurOffset, Kind);

return true; return true;

HandleDirective: HandleDirective:

// We parsed a # character and it's the start of a preprocessing directive. // We parsed a # character and it's the start of a preprocessing directive.

FormTokenWithChars(Result, CurPtr, tok::hash); FormTokenWithChars(Result, CurOffset, tok::hash);

PP->HandleDirective(Result); PP->HandleDirective(Result);

if (PP->hadModuleLoaderFatalFailure()) { if (PP->hadModuleLoaderFatalFailure()) {

// With a fatal failure in the module loader, we abort parsing. // With a fatal failure in the module loader, we abort parsing.

assert(Result.is(tok::eof) && "Preprocessor did not set tok:eof"); assert(Result.is(tok::eof) && "Preprocessor did not set tok:eof");

return true; return true;

} }

// We parsed the directive; lex a token with the new state. // We parsed the directive; lex a token with the new state.

return false; return false;

LexNextToken: LexNextToken:

Result.clearFlag(Token::NeedsCleaning); Result.clearFlag(Token::NeedsCleaning);

goto LexStart; goto LexStart;

} }

const char *Lexer::convertDependencyDirectiveToken( const char *Lexer::convertDependencyDirectiveToken(

const dependency_directives_scan::Token &DDTok, Token &Result) { const dependency_directives_scan::Token &DDTok, Token &Result) {

const char *TokPtr = BufferStart + DDTok.Offset; const char *TokPtr = BufferStart + DDTok.Offset;

Result.startToken(); Result.startToken();

Result.setLocation(getSourceLocation(TokPtr)); Result.setLocation(getSourceLocation(TokPtr - BufferStart));

Result.setKind(DDTok.Kind); Result.setKind(DDTok.Kind);

Result.setFlag((Token::TokenFlags)DDTok.Flags); Result.setFlag((Token::TokenFlags)DDTok.Flags);

Result.setLength(DDTok.Length); Result.setLength(DDTok.Length);

BufferPtr = TokPtr + DDTok.Length; BufferOffset = TokPtr + DDTok.Length - BufferStart;

return TokPtr; return TokPtr;

} }

bool Lexer::LexDependencyDirectiveToken(Token &Result) { bool Lexer::LexDependencyDirectiveToken(Token &Result) {

assert(isDependencyDirectivesLexer()); assert(isDependencyDirectivesLexer());

using namespace dependency_directives_scan; using namespace dependency_directives_scan;

while (NextDepDirectiveTokenIndex == DepDirectives.front().Tokens.size()) { while (NextDepDirectiveTokenIndex == DepDirectives.front().Tokens.size()) {

if (DepDirectives.front().Kind == pp_eof) if (DepDirectives.front().Kind == pp_eof)

return LexEndOfFile(Result, BufferEnd); return LexEndOfFile(Result, BufferSize);

if (DepDirectives.front().Kind == tokens_present_before_eof) if (DepDirectives.front().Kind == tokens_present_before_eof)

MIOpt.ReadToken(); MIOpt.ReadToken();

NextDepDirectiveTokenIndex = 0; NextDepDirectiveTokenIndex = 0;

DepDirectives = DepDirectives.drop_front(); DepDirectives = DepDirectives.drop_front();

} }

const dependency_directives_scan::Token &DDTok = const dependency_directives_scan::Token &DDTok =

DepDirectives.front().Tokens[NextDepDirectiveTokenIndex++]; DepDirectives.front().Tokens[NextDepDirectiveTokenIndex++];

if (NextDepDirectiveTokenIndex > 1 || DDTok.Kind != tok::hash) { if (NextDepDirectiveTokenIndex > 1 || DDTok.Kind != tok::hash) {

// Read something other than a preprocessor directive hash. // Read something other than a preprocessor directive hash.

MIOpt.ReadToken(); MIOpt.ReadToken();

} }

if (ParsingFilename && DDTok.is(tok::less)) { if (ParsingFilename && DDTok.is(tok::less)) {

BufferPtr = BufferStart + DDTok.Offset; BufferOffset = DDTok.Offset;

LexAngledStringLiteral(Result, BufferPtr + 1); LexAngledStringLiteral(Result, BufferOffset + 1);

if (Result.isNot(tok::header_name)) if (Result.isNot(tok::header_name))

return true; return true;

// Advance the index of lexed tokens. // Advance the index of lexed tokens.

while (true) { while (true) {

const dependency_directives_scan::Token &NextTok = const dependency_directives_scan::Token &NextTok =

DepDirectives.front().Tokens[NextDepDirectiveTokenIndex]; DepDirectives.front().Tokens[NextDepDirectiveTokenIndex];

if (BufferStart + NextTok.Offset >= BufferPtr) if (NextTok.Offset >= BufferOffset)

break; break;

++NextDepDirectiveTokenIndex; ++NextDepDirectiveTokenIndex;

} }

return true; return true;

} }

const char *TokPtr = convertDependencyDirectiveToken(DDTok, Result); const char *TokPtr = convertDependencyDirectiveToken(DDTok, Result);

Show All 12 Lines bool Lexer::LexDependencyDirectiveToken(Token &Result) {

} }

if (Result.isLiteral()) { if (Result.isLiteral()) {

Result.setLiteralData(TokPtr); Result.setLiteralData(TokPtr);

return true; return true;

} }

if (Result.is(tok::colon) && if (Result.is(tok::colon) &&

(LangOpts.CPlusPlus || LangOpts.DoubleSquareBracketAttributes)) { (LangOpts.CPlusPlus || LangOpts.DoubleSquareBracketAttributes)) {

// Convert consecutive colons to 'tok::coloncolon'. // Convert consecutive colons to 'tok::coloncolon'.

if (*BufferPtr == ':') { if (BufferStart[BufferOffset] == ':') {

assert(DepDirectives.front().Tokens[NextDepDirectiveTokenIndex].is( assert(DepDirectives.front().Tokens[NextDepDirectiveTokenIndex].is(

tok::colon)); tok::colon));

++NextDepDirectiveTokenIndex; ++NextDepDirectiveTokenIndex;

Result.setKind(tok::coloncolon); Result.setKind(tok::coloncolon);

} }

return true; return true;

} }

if (Result.is(tok::eod)) if (Result.is(tok::eod))

▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines case pp_endif:

if (!NestedIfs) { if (!NestedIfs) {

Stop = true; Stop = true;

} else { } else {

--NestedIfs; --NestedIfs;

} }

break; break;

case pp_eof: case pp_eof:

NextDepDirectiveTokenIndex = 0; NextDepDirectiveTokenIndex = 0;

return LexEndOfFile(Result, BufferEnd); return LexEndOfFile(Result, BufferSize);

} }

} while (!Stop); } while (!Stop);

const dependency_directives_scan::Token &DDTok = const dependency_directives_scan::Token &DDTok =

DepDirectives.front().Tokens.front(); DepDirectives.front().Tokens.front();

assert(DDTok.is(tok::hash)); assert(DDTok.is(tok::hash));

NextDepDirectiveTokenIndex = 1; NextDepDirectiveTokenIndex = 1;

convertDependencyDirectiveToken(DDTok, Result); convertDependencyDirectiveToken(DDTok, Result);

return false; return false;

} }

clang/lib/Lex/PPDirectives.cpp

Show First 20 Lines • Show All 489 Lines • ▼ Show 20 Lines	void Preprocessor::SkipExcludedConditionalBlock(SourceLocation HashTokenLoc,
Token Tok;		Token Tok;
SourceLocation endLoc;		SourceLocation endLoc;

/// Keeps track and caches skipped ranges and also retrieves a prior skipped		/// Keeps track and caches skipped ranges and also retrieves a prior skipped
/// range if the same block is re-visited.		/// range if the same block is re-visited.
struct SkippingRangeStateTy {		struct SkippingRangeStateTy {
Preprocessor &PP;		Preprocessor &PP;

const char *BeginPtr = nullptr;		std::optional<unsigned> BeginOffset;
unsigned *SkipRangePtr = nullptr;		unsigned *SkipRangePtr = nullptr;

SkippingRangeStateTy(Preprocessor &PP) : PP(PP) {}		SkippingRangeStateTy(Preprocessor &PP) : PP(PP) {}

void beginLexPass() {		void beginLexPass() {
if (BeginPtr)		if (BeginOffset)
return; // continue skipping a block.		return; // continue skipping a block.

// Initiate a skipping block and adjust the lexer if we already skipped it		// Initiate a skipping block and adjust the lexer if we already skipped it
// before.		// before.
BeginPtr = PP.CurLexer->getBufferLocation();		BeginOffset = PP.CurLexer->getCurrentBufferOffset();
SkipRangePtr = &PP.RecordedSkippedRanges[BeginPtr];		SkipRangePtr = &PP.RecordedSkippedRanges[PP.CurLexer->getFileID()][*BeginOffset];
if (*SkipRangePtr) {		if (*SkipRangePtr) {
PP.CurLexer->seek(PP.CurLexer->getCurrentBufferOffset() + *SkipRangePtr,		PP.CurLexer->seek(PP.CurLexer->getCurrentBufferOffset() + *SkipRangePtr,
/IsAtStartOfLine/ true);		/IsAtStartOfLine/ true);
}		}
}		}

void endLexPass(const char *Hashptr) {		void endLexPass(unsigned HashOffset) {
if (!BeginPtr) {		if (!BeginOffset) {
// Not doing normal lexing.		// Not doing normal lexing.
assert(PP.CurLexer->isDependencyDirectivesLexer());		assert(PP.CurLexer->isDependencyDirectivesLexer());
return;		return;
}		}

// Finished skipping a block, record the range if it's first time visited.		// Finished skipping a block, record the range if it's first time visited.
if (!*SkipRangePtr) {		if (!*SkipRangePtr) {
*SkipRangePtr = Hashptr - BeginPtr;		SkipRangePtr = HashOffset - BeginOffset;
}		}
assert(*SkipRangePtr == Hashptr - BeginPtr);		assert(SkipRangePtr == HashOffset - BeginOffset);
BeginPtr = nullptr;		BeginOffset = std::nullopt;
SkipRangePtr = nullptr;		SkipRangePtr = nullptr;
}		}
} SkippingRangeState(*this);		} SkippingRangeState(*this);

while (true) {		while (true) {
if (CurLexer->isDependencyDirectivesLexer()) {		if (CurLexer->isDependencyDirectivesLexer()) {
CurLexer->LexDependencyDirectiveTokenWhileSkipping(Tok);		CurLexer->LexDependencyDirectiveTokenWhileSkipping(Tok);
} else {		} else {
Show All 32 Lines	while (true) {

// We just parsed a # character at the start of a line, so we're in		// We just parsed a # character at the start of a line, so we're in
// directive mode. Tell the lexer this so any newlines we see will be		// directive mode. Tell the lexer this so any newlines we see will be
// converted into an EOD token (this terminates the macro).		// converted into an EOD token (this terminates the macro).
CurPPLexer->ParsingPreprocessorDirective = true;		CurPPLexer->ParsingPreprocessorDirective = true;
if (CurLexer) CurLexer->SetKeepWhitespaceMode(false);		if (CurLexer) CurLexer->SetKeepWhitespaceMode(false);

assert(Tok.is(tok::hash));		assert(Tok.is(tok::hash));
const char *Hashptr = CurLexer->getBufferLocation() - Tok.getLength();		unsigned HashOffset = CurLexer->getCurrentBufferOffset() - Tok.getLength();
assert(CurLexer->getSourceLocation(Hashptr) == Tok.getLocation());		assert(CurLexer->getSourceLocation(HashOffset) == Tok.getLocation());

// Read the next token, the directive flavor.		// Read the next token, the directive flavor.
LexUnexpandedToken(Tok);		LexUnexpandedToken(Tok);

// If this isn't an identifier directive (e.g. is "# 1\n" or "#\n", or		// If this isn't an identifier directive (e.g. is "# 1\n" or "#\n", or
// something bogus), skip it.		// something bogus), skip it.
if (Tok.isNot(tok::raw_identifier)) {		if (Tok.isNot(tok::raw_identifier)) {
CurPPLexer->ParsingPreprocessorDirective = false;		CurPPLexer->ParsingPreprocessorDirective = false;
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	if (Directive.startswith("if")) {
PPConditionalInfo CondInfo;		PPConditionalInfo CondInfo;
CondInfo.WasSkipping = true; // Silence bogus warning.		CondInfo.WasSkipping = true; // Silence bogus warning.
bool InCond = CurPPLexer->popConditionalLevel(CondInfo);		bool InCond = CurPPLexer->popConditionalLevel(CondInfo);
(void)InCond; // Silence warning in no-asserts mode.		(void)InCond; // Silence warning in no-asserts mode.
assert(!InCond && "Can't be skipping if not in a conditional!");		assert(!InCond && "Can't be skipping if not in a conditional!");

// If we popped the outermost skipping block, we're done skipping!		// If we popped the outermost skipping block, we're done skipping!
if (!CondInfo.WasSkipping) {		if (!CondInfo.WasSkipping) {
SkippingRangeState.endLexPass(Hashptr);		SkippingRangeState.endLexPass(HashOffset);
// Restore the value of LexingRawMode so that trailing comments		// Restore the value of LexingRawMode so that trailing comments
// are handled correctly, if we've reached the outermost block.		// are handled correctly, if we've reached the outermost block.
CurPPLexer->LexingRawMode = false;		CurPPLexer->LexingRawMode = false;
endLoc = CheckEndOfDirective("endif");		endLoc = CheckEndOfDirective("endif");
CurPPLexer->LexingRawMode = true;		CurPPLexer->LexingRawMode = true;
if (Callbacks)		if (Callbacks)
Callbacks->Endif(Tok.getLocation(), CondInfo.IfLoc);		Callbacks->Endif(Tok.getLocation(), CondInfo.IfLoc);
break;		break;
} else {		} else {
DiscardUntilEndOfDirective();		DiscardUntilEndOfDirective();
}		}
} else if (Sub == "lse") { // "else".		} else if (Sub == "lse") { // "else".
// #else directive in a skipping conditional. If not in some other		// #else directive in a skipping conditional. If not in some other
// skipping conditional, and if #else hasn't already been seen, enter it		// skipping conditional, and if #else hasn't already been seen, enter it
// as a non-skipping conditional.		// as a non-skipping conditional.
PPConditionalInfo &CondInfo = CurPPLexer->peekConditionalLevel();		PPConditionalInfo &CondInfo = CurPPLexer->peekConditionalLevel();

if (!CondInfo.WasSkipping)		if (!CondInfo.WasSkipping)
SkippingRangeState.endLexPass(Hashptr);		SkippingRangeState.endLexPass(HashOffset);

// If this is a #else with a #else before it, report the error.		// If this is a #else with a #else before it, report the error.
if (CondInfo.FoundElse)		if (CondInfo.FoundElse)
Diag(Tok, diag::pp_err_else_after_else);		Diag(Tok, diag::pp_err_else_after_else);

// Note that we've seen a #else in this conditional.		// Note that we've seen a #else in this conditional.
CondInfo.FoundElse = true;		CondInfo.FoundElse = true;

Show All 11 Lines	if (Directive.startswith("if")) {
break;		break;
} else {		} else {
DiscardUntilEndOfDirective(); // C99 6.10p4.		DiscardUntilEndOfDirective(); // C99 6.10p4.
}		}
} else if (Sub == "lif") { // "elif".		} else if (Sub == "lif") { // "elif".
PPConditionalInfo &CondInfo = CurPPLexer->peekConditionalLevel();		PPConditionalInfo &CondInfo = CurPPLexer->peekConditionalLevel();

if (!CondInfo.WasSkipping)		if (!CondInfo.WasSkipping)
SkippingRangeState.endLexPass(Hashptr);		SkippingRangeState.endLexPass(HashOffset);

// If this is a #elif with a #else before it, report the error.		// If this is a #elif with a #else before it, report the error.
if (CondInfo.FoundElse)		if (CondInfo.FoundElse)
Diag(Tok, diag::pp_err_elif_after_else) << PED_Elif;		Diag(Tok, diag::pp_err_elif_after_else) << PED_Elif;

// If this is in a skipping block or if we're already handled this #if		// If this is in a skipping block or if we're already handled this #if
// block, don't bother parsing the condition.		// block, don't bother parsing the condition.
if (CondInfo.WasSkipping \|\| CondInfo.FoundNonSkip) {		if (CondInfo.WasSkipping \|\| CondInfo.FoundNonSkip) {
Show All 28 Lines	if (Directive.startswith("if")) {
}		}
} else if (Sub == "lifdef" \|\| // "elifdef"		} else if (Sub == "lifdef" \|\| // "elifdef"
Sub == "lifndef") { // "elifndef"		Sub == "lifndef") { // "elifndef"
bool IsElifDef = Sub == "lifdef";		bool IsElifDef = Sub == "lifdef";
PPConditionalInfo &CondInfo = CurPPLexer->peekConditionalLevel();		PPConditionalInfo &CondInfo = CurPPLexer->peekConditionalLevel();
Token DirectiveToken = Tok;		Token DirectiveToken = Tok;

if (!CondInfo.WasSkipping)		if (!CondInfo.WasSkipping)
SkippingRangeState.endLexPass(Hashptr);		SkippingRangeState.endLexPass(HashOffset);

// Warn if using `#elifdef` & `#elifndef` in not C2x & C++2b mode even		// Warn if using `#elifdef` & `#elifndef` in not C2x & C++2b mode even
// if this branch is in a skipping block.		// if this branch is in a skipping block.
unsigned DiagID;		unsigned DiagID;
if (LangOpts.CPlusPlus)		if (LangOpts.CPlusPlus)
DiagID = LangOpts.CPlusPlus2b ? diag::warn_cxx2b_compat_pp_directive		DiagID = LangOpts.CPlusPlus2b ? diag::warn_cxx2b_compat_pp_directive
: diag::ext_cxx2b_pp_directive;		: diag::ext_cxx2b_pp_directive;
else		else
▲ Show 20 Lines • Show All 1,527 Lines • ▼ Show 20 Lines	if (Imported) {
// that it's part of the corresponding module.		// that it's part of the corresponding module.
} else {		} else {
// We hit an error processing the import. Bail out.		// We hit an error processing the import. Bail out.
if (hadModuleLoaderFatalFailure()) {		if (hadModuleLoaderFatalFailure()) {
// With a fatal failure in the module loader, we abort parsing.		// With a fatal failure in the module loader, we abort parsing.
Token &Result = IncludeTok;		Token &Result = IncludeTok;
assert(CurLexer && "#include but no current lexer set!");		assert(CurLexer && "#include but no current lexer set!");
Result.startToken();		Result.startToken();
CurLexer->FormTokenWithChars(Result, CurLexer->BufferEnd, tok::eof);		CurLexer->FormTokenWithChars(Result, CurLexer->BufferSize, tok::eof);
CurLexer->cutOffLexing();		CurLexer->cutOffLexing();
}		}
return {ImportAction::None};		return {ImportAction::None};
}		}
}		}

// The #included file will be considered to be a system header if either it is		// The #included file will be considered to be a system header if either it is
// in a system include directory, or if the #includer is a system include		// in a system include directory, or if the #includer is a system include
▲ Show 20 Lines • Show All 1,190 Lines • Show Last 20 Lines

clang/lib/Lex/PPLexerChange.cpp

Show First 20 Lines • Show All 258 Lines • ▼ Show 20 Lines
}		}

/// Determine the location to use as the end of the buffer for a lexer.		/// Determine the location to use as the end of the buffer for a lexer.
///		///
/// If the file ends with a newline, form the EOF token on the newline itself,		/// If the file ends with a newline, form the EOF token on the newline itself,
/// rather than "on the line following it", which doesn't exist. This makes		/// rather than "on the line following it", which doesn't exist. This makes
/// diagnostics relating to the end of file include the last file that the user		/// diagnostics relating to the end of file include the last file that the user
/// actually typed, which is goodness.		/// actually typed, which is goodness.
const char *Preprocessor::getCurLexerEndPos() {		unsigned Preprocessor::getCurLexerEndPos() {
const char *EndPos = CurLexer->BufferEnd;		unsigned EndPos = CurLexer->BufferSize;
if (EndPos != CurLexer->BufferStart &&		if (EndPos != 0 &&
(EndPos[-1] == '\n' \|\| EndPos[-1] == '\r')) {		(CurLexer->BufferStart[EndPos-1] == '\n' \|\| CurLexer->BufferStart[EndPos-1] == '\r')) {
--EndPos;		--EndPos;

// Handle \n\r and \r\n:		// Handle \n\r and \r\n:
if (EndPos != CurLexer->BufferStart &&		if (EndPos != 0 &&
(EndPos[-1] == '\n' \|\| EndPos[-1] == '\r') &&		(CurLexer->BufferStart[EndPos-1] == '\n' \|\| CurLexer->BufferStart[EndPos-1] == '\r') &&
EndPos[-1] != EndPos[0])		CurLexer->BufferStart[EndPos-1] != CurLexer->BufferStart[EndPos])
--EndPos;		--EndPos;
}		}

return EndPos;		return EndPos;
}		}

static void collectAllSubModulesWithUmbrellaHeader(		static void collectAllSubModulesWithUmbrellaHeader(
const Module &Mod, SmallVectorImpl<const Module *> &SubMods) {		const Module &Mod, SmallVectorImpl<const Module *> &SubMods) {
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	bool Preprocessor::HandleEndOfFile(Token &Result, bool isEndOfMacro) {
if ((LeavingSubmodule \|\| IncludeMacroStack.empty()) &&		if ((LeavingSubmodule \|\| IncludeMacroStack.empty()) &&
!BuildingSubmoduleStack.empty() &&		!BuildingSubmoduleStack.empty() &&
BuildingSubmoduleStack.back().IsPragma) {		BuildingSubmoduleStack.back().IsPragma) {
Diag(BuildingSubmoduleStack.back().ImportLoc,		Diag(BuildingSubmoduleStack.back().ImportLoc,
diag::err_pp_module_begin_without_module_end);		diag::err_pp_module_begin_without_module_end);
Module M = LeaveSubmodule(/ForPragma*/true);		Module M = LeaveSubmodule(/ForPragma*/true);

Result.startToken();		Result.startToken();
const char *EndPos = getCurLexerEndPos();		unsigned EndPos = getCurLexerEndPos();
CurLexer->BufferPtr = EndPos;		CurLexer->BufferOffset = EndPos;
CurLexer->FormTokenWithChars(Result, EndPos, tok::annot_module_end);		CurLexer->FormTokenWithChars(Result, EndPos, tok::annot_module_end);
Result.setAnnotationEndLoc(Result.getLocation());		Result.setAnnotationEndLoc(Result.getLocation());
Result.setAnnotationValue(M);		Result.setAnnotationValue(M);
return true;		return true;
}		}

// See if this file had a controlling macro.		// See if this file had a controlling macro.
if (CurPPLexer) { // Not ending a macro, ignore it.		if (CurPPLexer) { // Not ending a macro, ignore it.
▲ Show 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	bool Preprocessor::HandleEndOfFile(Token &Result, bool isEndOfMacro) {
if (!IncludeMacroStack.empty()) {		if (!IncludeMacroStack.empty()) {

// If we lexed the code-completion file, act as if we reached EOF.		// If we lexed the code-completion file, act as if we reached EOF.
if (isCodeCompletionEnabled() && CurPPLexer &&		if (isCodeCompletionEnabled() && CurPPLexer &&
SourceMgr.getLocForStartOfFile(CurPPLexer->getFileID()) ==		SourceMgr.getLocForStartOfFile(CurPPLexer->getFileID()) ==
CodeCompletionFileLoc) {		CodeCompletionFileLoc) {
assert(CurLexer && "Got EOF but no current lexer set!");		assert(CurLexer && "Got EOF but no current lexer set!");
Result.startToken();		Result.startToken();
CurLexer->FormTokenWithChars(Result, CurLexer->BufferEnd, tok::eof);		CurLexer->FormTokenWithChars(Result, CurLexer->BufferSize, tok::eof);
CurLexer.reset();		CurLexer.reset();

CurPPLexer = nullptr;		CurPPLexer = nullptr;
recomputeCurLexerKind();		recomputeCurLexerKind();
return true;		return true;
}		}

if (!isEndOfMacro && CurPPLexer &&		if (!isEndOfMacro && CurPPLexer &&
Show All 19 Lines	if (!isEndOfMacro && CurPPLexer) {
ExitedFromPredefinesFile = (PredefinesFileID == ExitedFID);		ExitedFromPredefinesFile = (PredefinesFileID == ExitedFID);
}		}

if (LeavingSubmodule) {		if (LeavingSubmodule) {
// We're done with this submodule.		// We're done with this submodule.
Module M = LeaveSubmodule(/ForPragma*/false);		Module M = LeaveSubmodule(/ForPragma*/false);

// Notify the parser that we've left the module.		// Notify the parser that we've left the module.
const char *EndPos = getCurLexerEndPos();		unsigned EndPos = getCurLexerEndPos();
Result.startToken();		Result.startToken();
CurLexer->BufferPtr = EndPos;		CurLexer->BufferOffset = EndPos;
CurLexer->FormTokenWithChars(Result, EndPos, tok::annot_module_end);		CurLexer->FormTokenWithChars(Result, EndPos, tok::annot_module_end);
Result.setAnnotationEndLoc(Result.getLocation());		Result.setAnnotationEndLoc(Result.getLocation());
Result.setAnnotationValue(M);		Result.setAnnotationValue(M);
}		}

bool FoundPCHThroughHeader = false;		bool FoundPCHThroughHeader = false;
if (CurPPLexer && creatingPCHWithThroughHeader() &&		if (CurPPLexer && creatingPCHWithThroughHeader() &&
isPCHThroughHeader(		isPCHThroughHeader(
Show All 35 Lines	if (!IncludeMacroStack.empty()) {
} else {		} else {
// Client should lex another token unless we generated an EOM.		// Client should lex another token unless we generated an EOM.
return LeavingSubmodule;		return LeavingSubmodule;
}		}
}		}

// If this is the end of the main file, form an EOF token.		// If this is the end of the main file, form an EOF token.
assert(CurLexer && "Got EOF but no current lexer set!");		assert(CurLexer && "Got EOF but no current lexer set!");
const char *EndPos = getCurLexerEndPos();		unsigned EndPos = getCurLexerEndPos();
Result.startToken();		Result.startToken();
CurLexer->BufferPtr = EndPos;		CurLexer->BufferOffset = EndPos;
CurLexer->FormTokenWithChars(Result, EndPos, tok::eof);		CurLexer->FormTokenWithChars(Result, EndPos, tok::eof);

if (isCodeCompletionEnabled()) {		if (isCodeCompletionEnabled()) {
// Inserting the code-completion point increases the source buffer by 1,		// Inserting the code-completion point increases the source buffer by 1,
// but the main FileID was created before inserting the point.		// but the main FileID was created before inserting the point.
// Compensate by reducing the EOF location by 1, otherwise the location		// Compensate by reducing the EOF location by 1, otherwise the location
// will point to the next FileID.		// will point to the next FileID.
// FIXME: This is hacky, the code-completion point should probably be		// FIXME: This is hacky, the code-completion point should probably be
▲ Show 20 Lines • Show All 333 Lines • Show Last 20 Lines

clang/lib/Lex/Pragma.cpp

Show First 20 Lines • Show All 885 Lines • ▼ Show 20 Lines	if (Tok.isNot(tok::eod))
Diag(Tok.getLocation(), diag::ext_pp_extra_tokens_at_eol)		Diag(Tok.getLocation(), diag::ext_pp_extra_tokens_at_eol)
<< "pragma hdrstop";		<< "pragma hdrstop";

if (creatingPCHWithPragmaHdrStop() &&		if (creatingPCHWithPragmaHdrStop() &&
SourceMgr.isInMainFile(Tok.getLocation())) {		SourceMgr.isInMainFile(Tok.getLocation())) {
assert(CurLexer && "no lexer for #pragma hdrstop processing");		assert(CurLexer && "no lexer for #pragma hdrstop processing");
Token &Result = Tok;		Token &Result = Tok;
Result.startToken();		Result.startToken();
CurLexer->FormTokenWithChars(Result, CurLexer->BufferEnd, tok::eof);		CurLexer->FormTokenWithChars(Result, CurLexer->BufferSize, tok::eof);
CurLexer->cutOffLexing();		CurLexer->cutOffLexing();
}		}
if (usingPCHWithPragmaHdrStop())		if (usingPCHWithPragmaHdrStop())
SkippingUntilPragmaHdrStop = false;		SkippingUntilPragmaHdrStop = false;
}		}

/// AddPragmaHandler - Add the specified pragma handler to the preprocessor.		/// AddPragmaHandler - Add the specified pragma handler to the preprocessor.
/// If 'Namespace' is non-null, then it is a token required to exist on the		/// If 'Namespace' is non-null, then it is a token required to exist on the
▲ Show 20 Lines • Show All 1,259 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[clang][lex] Enable Lexer to grow its bufferNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 495040

clang/include/clang/Lex/Lexer.h

clang/include/clang/Lex/Preprocessor.h

clang/lib/Format/FormatTokenLexer.cpp

clang/lib/Lex/Lexer.cpp

clang/lib/Lex/PPDirectives.cpp

clang/lib/Lex/PPLexerChange.cpp

clang/lib/Lex/Pragma.cpp

[clang][lex] Enable Lexer to grow its buffer
Needs ReviewPublic