This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/Lex/
-
clang/
-
Lex/
1
Preprocessor.h
-
TokenLexer.h
-
lib/Lex/
-
Lex/
4/16
Preprocessor.cpp

Differential D59885

[Lex] Allow to consume tokens while preprocessing
ClosedPublic

Authored by ilya-biryukov on Mar 27 2019, 9:20 AM.

Download Raw Diff

Details

Reviewers

rsmith

Commits

rZORG96a8fbb12176: [Lex] Allow to consume tokens while preprocessing
rG96a8fbb12176: [Lex] Allow to consume tokens while preprocessing
rCTE361007: [Lex] Allow to consume tokens while preprocessing
rG929af6736127: [Lex] Allow to consume tokens while preprocessing
rL361007: [Lex] Allow to consume tokens while preprocessing
rC361007: [Lex] Allow to consume tokens while preprocessing

Summary

By adding a hook to consume all tokens produced by the preprocessor.
The intention of this change is to make it possible to consume the
expanded tokens without re-runnig the preprocessor with minimal changes
to the preprocessor and minimal performance penalty when preprocessing
without recording the tokens.

The added hook is very low-level and reconstructing the expanded token
stream requires more work in the client code, the actual algorithm to
collect the tokens using this hook can be found in the follow-up change.

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 31286
Build 31285: arc lint + arc unit

Event Timeline

ilya-biryukov created this revision.Mar 27 2019, 9:20 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 27 2019, 9:20 AM

Harbormaster completed remote builds in B29690: Diff 192455.Mar 27 2019, 9:20 AM

ilya-biryukov added a child revision: D59887: [Syntax] Introduce TokenBuffer, start clangToolingSyntax library.Mar 27 2019, 9:25 AM

ilya-biryukov added a parent revision: D59814: [Testing] Move clangd::Annotations to llvm testing support.

ilya-biryukov removed a parent revision: D59814: [Testing] Move clangd::Annotations to llvm testing support.

I'm happy to assess the performance costs of this change if needed (the function is obviously on the hot path), but I hope this won't add any significant performance costs.
Also happy to consider alternative approaches to collect the tokens when preprocessing, the current implementation is complicated (search for 'TokenCollector' in D59887) and my intuition is that moving parts of it to the preprocessor should simplify things considerably.

riccibruno added a subscriber: riccibruno.Mar 27 2019, 10:09 AM

Pass an enum indicating where the token comes from.
The enum is ad-hoc at the moment, will require some thought to turn
it into a reasonable abstraction.

The consumer of the token stream actually needs to be able to distinguish
tokens which are part of the final expanded token stream vs those that
aren't (macro directives, intermediate macro argument expansions, etc)

Harbormaster completed remote builds in B30110: Diff 193901.Apr 5 2019, 9:32 AM

ilya-biryukov added inline comments.Apr 16 2019, 7:01 AM

clang/include/clang/Lex/Preprocessor.h
120	This is supposed to provide enough information to distinguish token from an intermediate macro expansion from a token of a final expanded stream. It's not very well thought-through, though, the final set of enum values will definitely end up being different.

Revamp TokenSource, make it more principled

Harbormaster completed remote builds in B31145: Diff 197278.Apr 30 2019, 2:23 AM

I'd like to understand more about the intended use cases of this functionality. What information do clients want?

clang/lib/Lex/Preprocessor.cpp
870–900	This is a lot of extra stuff to be doing in the main `Lex` loop. Adding one (perfectly-predicted) branch on `OnToken` seems like it should be fine, but this seems like a bit more added complexity than I'd prefer. I'd like some performance measurements of `-cc1 -Eonly` to see if this makes any real difference.
888–890	This isn't a directive; these are normal tokens.
947–948	This seems like it's going to be extremely hard to use. If you just want the expanded token stream, then removing all the other calls to `OnToken` and only calling it here when `LexLevel == 0` will give you that.

In D59885#1485159, @rsmith wrote:

I'd like to understand more about the intended use cases of this functionality. What information do clients want?

We are aiming to collect all expanded tokens, all raw tokens and how they relate to each other (i.e. which raw tokens form macro calls, #include directives, etc that produced particular subrange of expanded tokens).
We go through all this trouble to allow mapping expanded tokens (which is what you "see" in the AST, even though AST does not store them) back to the raw tokens that produced them (which you can finally map to textual offsets, the final output in various tooling-related scenarios).
That information is used specifically to map (an allowed subset of) changes in the expanded token stream to textual changes.

We could actually get most of this information merely by looking at source locations. The only bit that we're missing is how to map from a macro call tokens into all expanded tokens that were produced by it.
I'll play around with reporting only the expanded token stream here and using source locations to produce the rest of the information we need.

clang/lib/Lex/Preprocessor.cpp
870–900	Happy to do the measurements. Do you have a good collection of input files for this in mind?
888–890	Sorry if it's a silly question, just wanted to clarify I'm not missing anything here. These tokens are the name of the module, "import-suffix" and the semicolon that follows it? And they end up being used to build the AST for `ImportDecl`?

ilya-biryukov marked an inline comment as done.May 2 2019, 7:04 AM

ilya-biryukov added inline comments.

clang/lib/Lex/Preprocessor.cpp
947–948	I've tried `LexLevel == 0 && IsNewToken` and it almost works. The only trouble left is dealing with the delayed parsing. E.g. I'm getting the callbacks for the same tokens multiple times in cases like parsing of method bodies. Any ideas on how to best tackle this? The example is: struct X { int method() { return 10; } int a = 10; }; Seeing multiple callbacks for `{ return 10; }` and `= 10;`, want to see only the first one.

Simplify the interface, get only the expanded token stream
An attempt to not report tokens from delayed parsing twice, almost works

Harbormaster completed remote builds in B31286: Diff 197778.May 2 2019, 7:38 AM

ilya-biryukov marked an inline comment as not done.May 2 2019, 8:16 AM

ilya-biryukov added inline comments.

clang/lib/Lex/Preprocessor.cpp
947–948	It almost works now, see the update revision. Almost no tokens get reported by the callback twice now, except the following case. Input: struct X { int method() { return method(); } }; After the body of the class, we get a an extra callback for `)` token from one of the functions under `ParsedLexedMethodDef`. This seems to be happening only for parenthesis.

ilya-biryukov marked an inline comment as not done.May 2 2019, 10:13 AM

ilya-biryukov added inline comments.

clang/lib/Lex/Preprocessor.cpp
947–948	More precisely, this happens from a following sequence of calls: Lex() {LexLevel=0} CachingLex() {LexLevel=1} // which does not have a token in `CachedTokens`, so it recurses into ... Lex() {LexLevel=1} // which results in a call to ... CurTokenLexer->Lex() At this point `CurTokenLexer` returns a token of a pre-lexed method definition, but the current implementation forgets that this was the case by the time the token is processed at the the top of the callstack (`Lex() {LexLevel=0}`). So the token ends up being reported by a callback.

Propagate whether the token came from a token stream through CachingLex.

Herald added subscribers: jsji, kbarton, nemanjai. · View Herald TranscriptMay 8 2019, 8:10 AM

Harbormaster completed remote builds in B31616: Diff 198662.May 8 2019, 8:11 AM

I've managed to get what I needed by returning a flag from Lex.
@rsmith, could you take a look? Is there a better way to do the equivalent?

Get rid of an accidental change in how LexAfterModuleImport is handled

Harbormaster completed remote builds in B31617: Diff 198663.May 8 2019, 8:16 AM

rsmith added inline comments.May 8 2019, 11:41 AM

clang/lib/Lex/Preprocessor.cpp
870–900	I think any large header should be fine for testing; I don't expect this to scale worse in the presence of lots of macro expansion or anything like that.
888–890	Right. LexAfterModuleImport is a pass-through lexing layer that takes special action on the ObjC `@import module.name;` and C++20 `import "header";` declarations, which are just normal phase 4 tokens.
947–948	Yeah, the parser messes with the token stream to support C++'s out-of-order parsing rule (among other things). Ignoring that here seems right. You might also want to make sure that calls to `EnterToken` are handled properly. The extra paren likely comes from such a call.

Here are some numbers from running clang -cc1 -Eonly on SemaExpr.cpp, it includes a whole ton of headers, including a rather large TreeTransform.h. This was run on my machine, so keep in mind it's rather noisy. Minimal times should be somewhat representative, though.

Before:
  Time (mean ± σ):     159.5 ms ±   7.3 ms    [User: 137.3 ms, System: 22.1 ms]
  Range (min … max):   148.9 ms … 199.5 ms    100 runs

After (no callback specified):
  Time (mean ± σ):     165.0 ms ±   7.5 ms    [User: 142.8 ms, System: 22.2 ms]
  Range (min … max):   155.0 ms … 196.7 ms    100 runs

And two more measurements with callbacks non-empty callbacks:

Counting tokens:
  Time (mean ± σ):     169.0 ms ±   9.5 ms    [User: 147.3 ms, System: 21.7 ms]
  Range (min … max):   158.3 ms … 206.0 ms    100 runs

Saving tokens to a vector:
  Time (mean ± σ):     191.0 ms ±   5.5 ms    [User: 154.1 ms, System: 36.8 ms]
  Range (min … max):   180.4 ms … 221.0 ms    100 runs

Overall, the performance cost in empty-callback case seems to be lower than 5%. This is significant, but hopefully acceptable.

In D59885#1496734, @ilya-biryukov wrote:

Overall, the performance cost in empty-callback case seems to be lower than 5%. This is significant, but hopefully acceptable.

5% is a lot to pay for something that we won't be using at all during normal preprocessing / compilation. But I think we can simplify this a bit, and that might help.

clang/lib/Lex/PPCaching.cpp
64 ↗	(On Diff #198663)	This change seems redundant: `LexLevel` is always `1` here, so this token would never be reported anyway. And with that gone I think you can remove the `Report` parameter entirely.
clang/lib/Lex/Preprocessor.cpp
873	Doesn't this always set `Report` to the same value as `IsNewToken`? (The only case we set `Report` to `false` is when we call `CachingLex` and it sets `IsNewToken` to `false`, and `CachingLex` can't be recursively called twice, so its recursive call to `Lex` can't set `Report` to `false`..)

I'll try to explore bringing the overhead down.
The fact that CachingLex is happening at LexLevel==1 might help, thanks for pointing that out!

clang/lib/Lex/PPCaching.cpp
64 ↗	(On Diff #198663)	It is reported by the caller of `CachingLex` (the one that has `LexLevel == 0`) and it needs to know where the reported token comes from.
clang/lib/Lex/Preprocessor.cpp
873	We can have `IsNewToken = true && Report = false` when the recursive call to `Lex()` inside `CachingLex()` sets `Report = CurTokenLexer->isMacroExpansion()`. This happens when we call `CachingLex` and the resulting token comes from a token stream for out-of-order parsing.

Thinking about this some more: distinguishing between "macro expansion" and other cases seems like a proxy for "this token came from inside the preprocessor / lexer" versus "this token was provided by the user", which is also exactly what IsNewToken is supposed to capture. I don't think we need two separate flags here.

I also suspect that a significant amount of the cost is the additional function call on the hot path (the 1-parameter Lex wrapper function). I wonder if there's something that we could do to remove that from the hot path without duplicating the whole Lex function. Maybe we could add a flag to Tokens to say that they're a replay of prior tokens, rather than being newly-created tokens, and make TokenLexer set that flag on tokens it produces if its token sequence came from a "replay" context (that's largely the same as your current "not a macro expansion" check, but there are probably cases where the preprocessor itself generates tokens for which that's wrong; we should have a flag on EnterTokenStream to indicate which mode we're in). We can then also change CachingLex to set the flag, and get rid of the IsNewToken flag too.

The suggested approach looks promising. The difference seems to be within the noise levels on my machine:

Before the change (baseline upstream revision):
  Time (mean ± σ):     159.1 ms ±   8.7 ms    [User: 138.3 ms, System: 20.6 ms]
  Range (min … max):   150.4 ms … 196.2 ms    100 runs

After (no callback specified):
  Time (mean ± σ):     160.4 ms ±   7.6 ms    [User: 138.5 ms, System: 21.7 ms]
  Range (min … max):   151.0 ms … 191.5 ms    100 runs

I'm sending out a prototype I used for measures for review. The flag is currently set by the preprocessor, but I guess it would make more sense to do the following before landing this:

flip the flag, i.e. instead of reporting whether the token is "new" (the common case), report whether it's "cached", i.e. coming from CachedTokens or from a non-macro-expansion token stream.
set this flag only inside CachingLex and TokenLexer.

The revision also removes IsNewToken from CachingLex and sets the flag in the token instead.
Any other suggestions?

Use a flag inside clang::Token instead

Harbormaster completed remote builds in B31892: Diff 199485.May 14 2019, 11:32 AM

Remove the now redundant 'public:' spec.

Harbormaster completed remote builds in B31893: Diff 199486.May 14 2019, 11:34 AM

In D59885#1501774, @ilya-biryukov wrote:

The suggested approach looks promising. The difference seems to be within the noise levels on my machine:

:) Awesome!

I guess it would make more sense to do the following before landing this:

flip the flag, i.e. instead of reporting whether the token is "new" (the common case), report whether it's "cached", i.e. coming from CachedTokens or from a non-macro-expansion token stream.

set this flag only inside CachingLex and TokenLexer.

The revision also removes IsNewToken from CachingLex and sets the flag in the token instead.
Any other suggestions?

Notionally, I think we should separate out "produces new tokens" from "is macro expansion" in TokenLexer. Please take a quick look at the various callers of EnterTokenStream (particularly the ones inside Lex; I'm not too worried about the ones in Parse) and see which (if any) of them actually intend to inject new "real" phase-4-of-translation tokens. If there are any (and maybe even if there aren't?), please add a flag on EnterTokenStream to specify whether the TokenLexer believes it's creating new tokens or replaying tokens that have been seen before.

We'll need to decide what to do about the annotation tokens that we create when we enter / leave / import modules (Preprocessor::EnterAnnotationToken). In some sense, those are "new" tokens -- and that's how the parser will interpret them -- but they didn't come from the original source file. I think either answer could be OK there, but I'm inclined to treat them as "new" because that gives more information to the token consumer.

Will do. Also think reporting annotation tokens is ok, one can easily tell them apart and filter them by looking at the corresponding flags in clang::Token, will put it into the docs, though

I've added a new parameter to EnterToken and EnterTokenStream. There are a bunch of interesting cases.

Preprocessor::AnnotateCachedTokens is one. It consumes some CachedTokens (which are always considered re-injected) and replaces them with a single annotation token (which we normally don't consider re-injected).
So we end up with an annotation token that is re-injected (and therefore never reported). That's ok for our current use-cases, so I won't try to fix it in this change. If this actually turns out important at any moment, can fix in a follow-up.

Another interesting case is parser "tweaking" tokens and re-introducing them, e.g. FixDigraph that turns a digraph token into two or various recoveries turning :: into : or introducing ephemeral ; and } tokens.
I've decided to mark those as re-injected for the time being, that is fine for now and I'm happy to reconsider later if it turns out we need them.

Properly mark tokens as reinjected where necessary.

Herald added a subscriber: eraman. · View Herald TranscriptMay 15 2019, 10:43 AM

Harbormaster completed remote builds in B31938: Diff 199637.May 15 2019, 10:43 AM

Improve comment, remove redundant braces

clang/lib/Lex/Preprocessor.cpp
947	Could probably remove the `IsReinjected` check from here. It's fine to check this on the client side.

Harbormaster completed remote builds in B31939: Diff 199639.May 15 2019, 10:52 AM

rsmith added inline comments.May 15 2019, 11:30 AM

clang/lib/Lex/PPDirectives.cpp
1521 ↗	(On Diff #199639)	I think it'd be more useful to treat this as a new token. But that's not a strong preference.
clang/lib/Lex/Pragma.cpp
370 ↗	(On Diff #199639)	I think this case is a reinjection; we've copied some tokens inside __pragma out into a duplicate position in the token stream. But I guess it doesn't matter because the tokens never escape the outer Lex function anyway.
clang/lib/Lex/Preprocessor.cpp
1128	I think this should always be `false`: the tokens we're producing here have never been emitted at `LexLevel` 0 before.

Address comments

Harbormaster completed remote builds in B31977: Diff 199775.May 16 2019, 2:51 AM

ilya-biryukov added inline comments.May 16 2019, 2:51 AM

clang/lib/Lex/PPDirectives.cpp
1521 ↗	(On Diff #199639)	Ah, there were too many changes and I missed this one. Definitely agree, clients can filter out annotation tokens themselves, if needed.
clang/lib/Lex/Pragma.cpp
370 ↗	(On Diff #199639)	Yeah, my logic is that it's not a re-injection in the sense that they were never the phase 4 tokens before.
clang/lib/Lex/Preprocessor.cpp
1128	Ah, totally, we lexed the original tokens in this function... Fixed, thanks!

Thanks, LGTM!

clang/lib/Lex/Pragma.cpp
370 ↗	(On Diff #199639)	Yeah, I think you're right. As a weird example, if we had a #pragma clang produce_token x pragma that just produced the given token, it'd be reasonable for it to leave the `Reinjected` flag alone, and it should come out as `false` in this case.

This revision is now accepted and ready to land.May 16 2019, 11:41 AM

Fix compilation of a clang-tidy check, add a FIXME to stop reporting those tokens

Harbormaster completed remote builds in B32053: Diff 199991.May 17 2019, 1:46 AM

Closed by commit rC361007: [Lex] Allow to consume tokens while preprocessing (authored by ibiryukov). · Explain WhyMay 17 2019, 2:31 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

clang/

include/

clang/

Lex/

Preprocessor.h

10 lines

TokenLexer.h

4 lines

lib/

Lex/

Preprocessor.cpp

11 lines

Diff 197778

clang/include/clang/Lex/Preprocessor.h

Show All 27 Lines
#include "clang/Lex/ModuleLoader.h"		#include "clang/Lex/ModuleLoader.h"
#include "clang/Lex/ModuleMap.h"		#include "clang/Lex/ModuleMap.h"
#include "clang/Lex/PPCallbacks.h"		#include "clang/Lex/PPCallbacks.h"
#include "clang/Lex/Token.h"		#include "clang/Lex/Token.h"
#include "clang/Lex/TokenLexer.h"		#include "clang/Lex/TokenLexer.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/FoldingSet.h"		#include "llvm/ADT/FoldingSet.h"
		#include "llvm/ADT/FunctionExtras.h"
#include "llvm/ADT/None.h"		#include "llvm/ADT/None.h"
#include "llvm/ADT/Optional.h"		#include "llvm/ADT/Optional.h"
#include "llvm/ADT/PointerUnion.h"		#include "llvm/ADT/PointerUnion.h"
#include "llvm/ADT/STLExtras.h"		#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SmallPtrSet.h"		#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include "llvm/ADT/TinyPtrVector.h"		#include "llvm/ADT/TinyPtrVector.h"
#include "llvm/ADT/iterator_range.h"		#include "llvm/ADT/iterator_range.h"
#include "llvm/Support/Allocator.h"		#include "llvm/Support/Allocator.h"
#include "llvm/Support/Casting.h"		#include "llvm/Support/Casting.h"
#include "llvm/Support/Registry.h"		#include "llvm/Support/Registry.h"
#include <cassert>		#include <cassert>
#include <cstddef>		#include <cstddef>
#include <cstdint>		#include <cstdint>
#include <memory>
#include <map>		#include <map>
		#include <memory>
#include <string>		#include <string>
#include <utility>		#include <utility>
#include <vector>		#include <vector>

namespace llvm {		namespace llvm {

template<unsigned InternalLen> class SmallString;		template<unsigned InternalLen> class SmallString;

▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	enum MacroUse {
MU_Define = 1,		MU_Define = 1,

// macro name specified in #undef		// macro name specified in #undef
MU_Undef = 2		MU_Undef = 2
};		};

/// Engages in a tight little dance with the lexer to efficiently		/// Engages in a tight little dance with the lexer to efficiently
/// preprocess tokens.		/// preprocess tokens.
///		///
		ilya-biryukovAuthorUnsubmitted Not Done Reply Inline Actions This is supposed to provide enough information to distinguish token from an intermediate macro expansion from a token of a final expanded stream. It's not very well thought-through, though, the final set of enum values will definitely end up being different. ilya-biryukov: This is supposed to provide enough information to distinguish token from an intermediate macro…
/// Lexers know only about tokens within a single source file, and don't		/// Lexers know only about tokens within a single source file, and don't
/// know anything about preprocessor-level issues like the \#include stack,		/// know anything about preprocessor-level issues like the \#include stack,
/// token expansion, etc.		/// token expansion, etc.
class Preprocessor {		class Preprocessor {
friend class VAOptDefinitionContext;		friend class VAOptDefinitionContext;
friend class VariadicMacroScopeGuard;		friend class VariadicMacroScopeGuard;

		llvm::unique_function<void(const clang::Token &)> OnToken;
std::shared_ptr<PreprocessorOptions> PPOpts;		std::shared_ptr<PreprocessorOptions> PPOpts;
DiagnosticsEngine *Diags;		DiagnosticsEngine *Diags;
LangOptions &LangOpts;		LangOptions &LangOpts;
const TargetInfo *Target = nullptr;		const TargetInfo *Target = nullptr;
const TargetInfo *AuxTarget = nullptr;		const TargetInfo *AuxTarget = nullptr;
FileManager &FileMgr;		FileManager &FileMgr;
SourceManager &SourceMgr;		SourceManager &SourceMgr;
std::unique_ptr<ScratchBuffer> ScratchBuf;		std::unique_ptr<ScratchBuffer> ScratchBuf;
▲ Show 20 Lines • Show All 857 Lines • ▼ Show 20 Lines	public:
void addPPCallbacks(std::unique_ptr<PPCallbacks> C) {		void addPPCallbacks(std::unique_ptr<PPCallbacks> C) {
if (Callbacks)		if (Callbacks)
C = llvm::make_unique<PPChainedCallbacks>(std::move(C),		C = llvm::make_unique<PPChainedCallbacks>(std::move(C),
std::move(Callbacks));		std::move(Callbacks));
Callbacks = std::move(C);		Callbacks = std::move(C);
}		}
/// \}		/// \}

		/// Register a function that would be called on each token in the final
		/// expanded token stream.
		void setTokenWatcher(llvm::unique_function<void(const clang::Token &)> F) {
		OnToken = std::move(F);
		}

bool isMacroDefined(StringRef Id) {		bool isMacroDefined(StringRef Id) {
return isMacroDefined(&Identifiers.get(Id));		return isMacroDefined(&Identifiers.get(Id));
}		}
bool isMacroDefined(const IdentifierInfo *II) {		bool isMacroDefined(const IdentifierInfo *II) {
return II->hasMacroDefinition() &&		return II->hasMacroDefinition() &&
(!getLangOpts().Modules \|\| (bool)getMacroDefinition(II));		(!getLangOpts().Modules \|\| (bool)getMacroDefinition(II));
}		}

▲ Show 20 Lines • Show All 1,308 Lines • Show Last 20 Lines

clang/include/clang/Lex/TokenLexer.h

Show First 20 Lines • Show All 141 Lines • ▼ Show 20 Lines	public:

/// Lex and return a token from this macro stream.		/// Lex and return a token from this macro stream.
bool Lex(Token &Tok);		bool Lex(Token &Tok);

/// isParsingPreprocessorDirective - Return true if we are in the middle of a		/// isParsingPreprocessorDirective - Return true if we are in the middle of a
/// preprocessor directive.		/// preprocessor directive.
bool isParsingPreprocessorDirective() const;		bool isParsingPreprocessorDirective() const;

		/// Returns true iff the TokenLexer is expanding a macro and not replaying a
		/// stream of tokens.
		bool isMacroExpansion() const { return Macro != nullptr; }

private:		private:
void destroy();		void destroy();

/// Return true if the next lex call will pop this macro off the include		/// Return true if the next lex call will pop this macro off the include
/// stack.		/// stack.
bool isAtEnd() const {		bool isAtEnd() const {
return CurTokenIdx == NumTokens;		return CurTokenIdx == NumTokens;
}		}
▲ Show 20 Lines • Show All 84 Lines • Show Last 20 Lines

clang/lib/Lex/Preprocessor.cpp

Show First 20 Lines • Show All 861 Lines • ▼ Show 20 Lines
}		}

void Preprocessor::Lex(Token &Result) {		void Preprocessor::Lex(Token &Result) {
++LexLevel;		++LexLevel;

// We loop here until a lex function returns a token; this avoids recursion.		// We loop here until a lex function returns a token; this avoids recursion.
bool ReturnedToken;		bool ReturnedToken;
bool IsNewToken = true;		bool IsNewToken = true;
		// Whether to call the OnToken callback on this token.
		bool Report;
do {		do {
switch (CurLexerKind) {		switch (CurLexerKind) {
		rsmithUnsubmitted Not Done Reply Inline Actions Doesn't this always set `Report` to the same value as `IsNewToken`? (The only case we set `Report` to `false` is when we call `CachingLex` and it sets `IsNewToken` to `false`, and `CachingLex` can't be recursively called twice, so its recursive call to `Lex` can't set `Report` to `false`..) rsmith: Doesn't this always set `Report` to the same value as `IsNewToken`? (The only case we set…
		ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions We can have `IsNewToken = true && Report = false` when the recursive call to `Lex()` inside `CachingLex()` sets `Report = CurTokenLexer->isMacroExpansion()`. This happens when we call `CachingLex` and the resulting token comes from a token stream for out-of-order parsing. ilya-biryukov: We can have `IsNewToken = true && Report = false` when the recursive call to `Lex()` inside…
case CLK_Lexer:		case CLK_Lexer:
		Report = true;
ReturnedToken = CurLexer->Lex(Result);		ReturnedToken = CurLexer->Lex(Result);
break;		break;
case CLK_TokenLexer:		case CLK_TokenLexer:
		Report = CurTokenLexer->isMacroExpansion();
ReturnedToken = CurTokenLexer->Lex(Result);		ReturnedToken = CurTokenLexer->Lex(Result);
break;		break;
case CLK_CachingLexer:		case CLK_CachingLexer:
CachingLex(Result, IsNewToken);		CachingLex(Result, IsNewToken);
		Report = IsNewToken;
ReturnedToken = true;		ReturnedToken = true;
break;		break;
case CLK_LexAfterModuleImport:		case CLK_LexAfterModuleImport:
ReturnedToken = LexAfterModuleImport(Result);		LexAfterModuleImport(Result);
		Report = true;
		ReturnedToken = true;
		rsmithUnsubmitted Not Done Reply Inline Actions This isn't a directive; these are normal tokens. rsmith: This isn't a directive; these are normal tokens.
		ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Sorry if it's a silly question, just wanted to clarify I'm not missing anything here. These tokens are the name of the module, "import-suffix" and the semicolon that follows it? And they end up being used to build the AST for `ImportDecl`? ilya-biryukov: Sorry if it's a silly question, just wanted to clarify I'm not missing anything here. These…
		rsmithUnsubmitted Not Done Reply Inline Actions Right. LexAfterModuleImport is a pass-through lexing layer that takes special action on the ObjC `@import module.name;` and C++20 `import "header";` declarations, which are just normal phase 4 tokens. rsmith: Right. LexAfterModuleImport is a pass-through lexing layer that takes special action on the…
break;		break;
}		}
} while (!ReturnedToken);		} while (!ReturnedToken);

if (Result.is(tok::code_completion) && Result.getIdentifierInfo()) {		if (Result.is(tok::code_completion) && Result.getIdentifierInfo()) {
// Remember the identifier before code completion token.		// Remember the identifier before code completion token.
setCodeCompletionIdentifierInfo(Result.getIdentifierInfo());		setCodeCompletionIdentifierInfo(Result.getIdentifierInfo());
setCodeCompletionTokenRange(Result.getLocation(), Result.getEndLoc());		setCodeCompletionTokenRange(Result.getLocation(), Result.getEndLoc());
// Set IdenfitierInfo to null to avoid confusing code that handles both		// Set IdenfitierInfo to null to avoid confusing code that handles both
// identifiers and completion tokens.		// identifiers and completion tokens.
		rsmithUnsubmitted Not Done Reply Inline Actions This is a lot of extra stuff to be doing in the main `Lex` loop. Adding one (perfectly-predicted) branch on `OnToken` seems like it should be fine, but this seems like a bit more added complexity than I'd prefer. I'd like some performance measurements of `-cc1 -Eonly` to see if this makes any real difference. rsmith: This is a lot of extra stuff to be doing in the main `Lex` loop. Adding one (perfectly…
		ilya-biryukovAuthorUnsubmitted Not Done Reply Inline Actions Happy to do the measurements. Do you have a good collection of input files for this in mind? ilya-biryukov: Happy to do the measurements. Do you have a good collection of input files for this in mind?
		rsmithUnsubmitted Not Done Reply Inline Actions I think any large header should be fine for testing; I don't expect this to scale worse in the presence of lots of macro expansion or anything like that. rsmith: I think any large header should be fine for testing; I don't expect this to scale worse in the…
Result.setIdentifierInfo(nullptr);		Result.setIdentifierInfo(nullptr);
}		}

// Update ImportSeqState to track our position within a C++20 import-seq		// Update ImportSeqState to track our position within a C++20 import-seq
// if this token is being produced as a result of phase 4 of translation.		// if this token is being produced as a result of phase 4 of translation.
if (getLangOpts().CPlusPlusModules && LexLevel == 1 && IsNewToken) {		if (getLangOpts().CPlusPlusModules && LexLevel == 1 && IsNewToken) {
switch (Result.getKind()) {		switch (Result.getKind()) {
case tok::l_paren: case tok::l_square: case tok::l_brace:		case tok::l_paren: case tok::l_square: case tok::l_brace:
Show All 30 Lines	if (getLangOpts().CPlusPlusModules && LexLevel == 1 && IsNewToken) {
default:		default:
ImportSeqState.handleMisc();		ImportSeqState.handleMisc();
break;		break;
}		}
}		}

LastTokenWasAt = Result.is(tok::at);		LastTokenWasAt = Result.is(tok::at);
--LexLevel;		--LexLevel;
		if (OnToken && LexLevel == 0 && Report)
		ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Could probably remove the `IsReinjected` check from here. It's fine to check this on the client side. ilya-biryukov: Could probably remove the `IsReinjected` check from here. It's fine to check this on the client…
		OnToken(Result);
		rsmithUnsubmitted Not Done Reply Inline Actions This seems like it's going to be extremely hard to use. If you just want the expanded token stream, then removing all the other calls to `OnToken` and only calling it here when `LexLevel == 0` will give you that. rsmith: This seems like it's going to be extremely hard to use. If you just want the expanded token…
		ilya-biryukovAuthorUnsubmitted Not Done Reply Inline Actions I've tried `LexLevel == 0 && IsNewToken` and it almost works. The only trouble left is dealing with the delayed parsing. E.g. I'm getting the callbacks for the same tokens multiple times in cases like parsing of method bodies. Any ideas on how to best tackle this? The example is: struct X { int method() { return 10; } int a = 10; }; Seeing multiple callbacks for `{ return 10; }` and `= 10;`, want to see only the first one. ilya-biryukov: I've tried `LexLevel == 0 && IsNewToken` and it almost works. The only trouble left is…
		ilya-biryukovAuthorUnsubmitted Not Done Reply Inline Actions It almost works now, see the update revision. Almost no tokens get reported by the callback twice now, except the following case. Input: struct X { int method() { return method(); } }; After the body of the class, we get a an extra callback for `)` token from one of the functions under `ParsedLexedMethodDef`. This seems to be happening only for parenthesis. ilya-biryukov: It almost works now, see the update revision. Almost no tokens get reported by the callback…
		ilya-biryukovAuthorUnsubmitted Not Done Reply Inline Actions More precisely, this happens from a following sequence of calls: Lex() {LexLevel=0} CachingLex() {LexLevel=1} // which does not have a token in `CachedTokens`, so it recurses into ... Lex() {LexLevel=1} // which results in a call to ... CurTokenLexer->Lex() At this point `CurTokenLexer` returns a token of a pre-lexed method definition, but the current implementation forgets that this was the case by the time the token is processed at the the top of the callstack (`Lex() {LexLevel=0}`). So the token ends up being reported by a callback. ilya-biryukov: More precisely, this happens from a following sequence of calls: ``` Lex() {LexLevel=0}…
		rsmithUnsubmitted Not Done Reply Inline Actions Yeah, the parser messes with the token stream to support C++'s out-of-order parsing rule (among other things). Ignoring that here seems right. You might also want to make sure that calls to `EnterToken` are handled properly. The extra paren likely comes from such a call. rsmith: Yeah, the parser messes with the token stream to support C++'s out-of-order parsing rule (among…
}		}

/// Lex a header-name token (including one formed from header-name-tokens if		/// Lex a header-name token (including one formed from header-name-tokens if
/// \p AllowConcatenation is \c true).		/// \p AllowConcatenation is \c true).
///		///
/// \param FilenameTok Filled in with the next token. On success, this will		/// \param FilenameTok Filled in with the next token. On success, this will
/// be either a header_name token. On failure, it will be whatever other		/// be either a header_name token. On failure, it will be whatever other
/// token was found instead.		/// token was found instead.
▲ Show 20 Lines • Show All 163 Lines • ▼ Show 20 Lines	bool Preprocessor::LexAfterModuleImport(Token &Result) {
}		}

// Allocate a holding buffer for a sequence of tokens and introduce it into		// Allocate a holding buffer for a sequence of tokens and introduce it into
// the token stream.		// the token stream.
auto EnterTokens = [this](ArrayRef<Token> Toks) {		auto EnterTokens = [this](ArrayRef<Token> Toks) {
auto ToksCopy = llvm::make_unique<Token[]>(Toks.size());		auto ToksCopy = llvm::make_unique<Token[]>(Toks.size());
std::copy(Toks.begin(), Toks.end(), ToksCopy.get());		std::copy(Toks.begin(), Toks.end(), ToksCopy.get());
EnterTokenStream(std::move(ToksCopy), Toks.size(),		EnterTokenStream(std::move(ToksCopy), Toks.size(),
/DisableMacroExpansion/ true);		/DisableMacroExpansion/ true);
		rsmithUnsubmitted Not Done Reply Inline Actions I think this should always be `false`: the tokens we're producing here have never been emitted at `LexLevel` 0 before. rsmith: I think this should always be `false`: the tokens we're producing here have never been emitted…
		ilya-biryukovAuthorUnsubmitted Done Reply Inline Actions Ah, totally, we lexed the original tokens in this function... Fixed, thanks! ilya-biryukov: Ah, totally, we lexed the original tokens in this function... Fixed, thanks!
};		};

// Check for a header-name.		// Check for a header-name.
SmallVector<Token, 32> Suffix;		SmallVector<Token, 32> Suffix;
if (Result.is(tok::header_name)) {		if (Result.is(tok::header_name)) {
// Enter the header-name token into the token stream; a Lex action cannot		// Enter the header-name token into the token stream; a Lex action cannot
// both return a token and cache tokens (doing so would corrupt the token		// both return a token and cache tokens (doing so would corrupt the token
// cache if the call to Lex comes from CachingLex / PeekAhead).		// cache if the call to Lex comes from CachingLex / PeekAhead).
▲ Show 20 Lines • Show All 253 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Lex] Allow to consume tokens while preprocessingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 197778

clang/include/clang/Lex/Preprocessor.h

clang/include/clang/Lex/TokenLexer.h

clang/lib/Lex/Preprocessor.cpp

[Lex] Allow to consume tokens while preprocessing
ClosedPublic